Stop Paying for Compute You Don't Use

by Josh Oransky

Every enterprise AI bill I've seen in the past year has three lines on it that should not exist. They're not malicious. They're the boring kind of waste that accumulates when a team adopts hosted inference faster than they audit it. They're also worth real money. Stop paying for them.

Three line items to delete

One: idle subscription seats. A Copilot seat costs around twenty dollars a month. A team of three hundred where forty percent log in less than once a week is wasting twenty-four hundred dollars a month on seats that aren't generating output. Audit usage. Reclaim the unused seats. Make people request them back if they want them. The seats that get reclaimed quietly are the ones that were never needed.

Two: prompt-tax on low-value tasks. Calling Claude to fix a typo in a YAML file is a real engineer's habit. It costs nothing per call. It costs real money at scale. Build a small wrapper that routes obvious local tasks to a local model or a regex, and only escalates to the hosted API when the wrapper can't handle it. The first time a team plots their token spend by task type, the result is usually shocking.

Three: re-summarising the same context. Agents loop. They backtrack. Every backtrack re-sends the full conversation, including the parts that already worked. If your tooling doesn't cache, your bill scales linearly with how many wrong turns the agent took. Use prompt caching aggressively. Anthropic and OpenAI both support it. Measure your cache hit rate and treat it as a first-class metric.

The wider point

None of this requires you to migrate off hosted models. None of it requires self-hosted inference. It requires the same thing any other cloud bill requires: somebody on the team who actually looks at the numbers and asks "what is this paying for?"

The first time you do that audit, the savings will be embarrassing. The second time will be smaller. By the third quarter you'll have a healthy baseline and the AI spend will become a normal line item instead of a Q4 surprise.

When self-hosted becomes the answer

If your audit shows that the workflows driving cost are repetitive, high-volume, and narrow (content tagging, alt-text generation, internal search, code review of small diffs), then self-hosted is the right next step. Those are exactly the workflows where an open-weight model on your own hardware matches frontier quality at zero marginal cost per call.

If your audit shows the cost is concentrated in low-volume, open-ended, hard problems (strategy, codebase-spanning refactors, document generation), keep paying. That's where frontier hosted models earn their margin.

The mistake is assuming one model fits all workloads. It doesn't. Your bill knows it.