/ ai
Stop renting tokens.
Own the model.
AI built so the bill stops growing with usage. Open-weight models on hardware you control: sized to your workload, benchmarked against your actual work, deployed inside your perimeter. Same outcomes as the frontier hosted APIs on the work that matters to you. None of the per-token tax.
01 economics
The token economy
is breaking.
For two years, hosted AI felt like a flat-rate utility. That window is closing. Per-token billing is replacing open-ended subscriptions, agentic workloads are burning tokens faster than anyone priced for, and the margin math behind the data-centre buildout assumes consumption growth nobody is on track to deliver. The cost of "let's just call the API" is going up, and it's going to keep going up.
Pricing shift
GitHub Copilot moves to per-token billing on June 1. Anthropic ended open-ended subscription limits and pushed heavy users to pay-as-you-go through the API. The pattern is consistent across vendors: flat-rate gives way to metered consumption as soon as agentic usage proves unsustainable at the subscription tier.
The agentic tax
Agents don't just use tokens. They burn them. Every backtrack reloads the full conversation. Every retry replays the whole context. A single multi-hour autonomous task can consume what a year of chat conversations would. The chat-era pricing intuition does not survive contact with production agents.
The margin math
Gartner's working number, via The Verge: AI providers need consumption to grow fifty- to a hundred-thousand fold by 2030 to sustain a ten-percent margin on the trillions being poured into data centres. Some of that growth will materialise. Some of it will come out of your operating budget.
A quick chat question and a multi-hour autonomous coding session can cost the user the same amount.
mario rodriguez · cpo, github
02 the flip
Same equation.
Different answer.
num: 02
/ self-hosted
Open weights caught up.
Hardware caught down.
Eighteen months ago, self-hosting an LLM that could compete with frontier APIs meant a server rack and a research team. Today it means commodity hardware running open-weight models (Llama, Mistral, Qwen, DeepSeek) that match or beat hosted frontier models on narrow, well-defined work. The marginal cost of a token drops to zero. The bill becomes the hardware, and the hardware is finite, predictable, and yours.
Laptop-class
Mac mini or MacBook Pro with unified memory. Runs 7B–32B parameter models locally. Right for internal tooling, dev workflows, content authoring agents, low-volume retrieval. Hardware you already have or budget in single-thousands.
Workstation-class
Single high-VRAM GPU or Apple Mac Studio. Runs 30B–70B models with production headroom. Right for site-wide retrieval, mid-traffic on-page assistants, scheduled content workflows. Capital cost in low-tens-of-thousands, paid off against a single quarter of hosted spend.
Server-class
Multi-GPU server, on-prem rack or colo. Runs frontier-tier open-weight models with multi-tenant throughput. Right for high-volume agentic workloads, regulated industries, or any workload where data cannot leave the perimeter. Sized against your traffic, not against a vendor's pricing tier.
/ proof
Predictable bill.
Private by default.
03 method
Four-week benchmark.
Then production.
The right model for your workload is the one you have actually measured on your data. We run a fixed four-week engagement that puts frontier hosted models head-to-head against two or three open-weight candidates on the work you actually do, then produces a decision document with the numbers behind it. If self-hosted loses, you've validated the hosted spend. If it wins, we ship the production path.
Pick the workflow
Narrow, measurable, business-critical. One workflow per benchmark, not "AI strategy" in the abstract. The workflow becomes the rubric: success looks like this, failure looks like that, and everyone agrees before we start.
Define the bar
What today's frontier hosted model does on this workflow becomes the baseline. We measure speed, quality, refusal behaviour, and cost on your data, not on a public eval. The bar is what you would ship to production with confidence.
Run side-by-side
Two or three open-weight candidates against the frontier baseline. Same prompts, same data, same evaluation harness. Quantised, fine-tuned, or retrieval-augmented variants where the workload calls for them. End of week three: a decision document with the numbers.
Ship the winner
If self-hosted clears the bar, we build the production path on the winning candidate: inference hardware sized to traffic, evals running continuously, guardrails calibrated to your risk profile, observability wired into the tools your team already watches. Operate it ourselves for the first quarter; hand it over to your team after.
/ next opening · june
Run the
benchmark.
Four weeks. Fixed scope. A decision document with the numbers, signed by everyone who has to live with the answer. Tell us the workflow. We'll tell you within 48 hours whether it's a good benchmark candidate and what the right comparison set looks like.