The Token Economy Is Breaking
by Josh Oransky
For two years, hosted AI felt like a flat-rate utility. You bought a Copilot seat, an Anthropic Pro subscription, an OpenAI Team plan, and you stopped thinking about what your developers were actually consuming. That window is closing. The pricing model that got the industry to a billion users is not the pricing model that will get it to economic sustainability, and the gap is starting to show up in your bill.
What changed in spring
GitHub announced Copilot moves to per-token billing on June 1. Anthropic ended open-ended subscription limits and pushed heavy users to pay-as-you-go through the API. The pattern is consistent across vendors. Flat-rate gives way to metered consumption as soon as agentic usage proves unsustainable at the subscription tier.
The interesting number sits underneath. Gartner's working estimate, via The Verge: AI providers need consumption to grow fifty- to a hundred-thousand fold by 2030 to sustain a ten-percent margin on the trillions being poured into data centres. Some of that growth will materialise. Some of it will come out of your operating budget.
The agentic tax
Agents do not just use tokens. They burn them. Every backtrack reloads the full conversation. Every retry replays the whole context. A single multi-hour autonomous task can consume what a year of chat-style usage would. The cost intuition formed during the chat era does not survive contact with production agents.
Mario Rodriguez, CPO of GitHub, put it plainly: "A quick chat question and a multi-hour autonomous coding session can cost the user the same amount." The flat-rate accounting that hid this is going away. The cost is about to become legible to your CFO.
What "ownership" buys you
Self-hosted open-weight inference is no longer a fringe option. Llama, Mistral, Qwen, and DeepSeek models match or beat frontier hosted models on narrow, well-defined work. The hardware to run them is commodity. A workstation-class GPU pays for itself against a single quarter of hosted spend at scale.
The thing you actually buy when you own your inference is not "free tokens." It is the right to model your bill before you commit to a workflow. Hosted token billing gives you a metered cost that scales with usage in ways nobody priced for in 2024. Owned compute gives you a fixed cost that scales with how many machines you bought. Those are very different financial postures.
The right way to decide
Do not migrate everything off hosted models because the bill scares you. Run a four-week benchmark on one workflow. Measure speed, quality, refusal behaviour, and cost on YOUR data, not a public eval. The winner tells you which side of the line that workflow belongs on.
If self-hosted clears the bar, ship it. If it doesn't, you've validated the hosted spend with numbers, not hope. Either way, you stop guessing.