YOLO pricing

Chat GPT Plus started in 2023 with 20 USD subscription that had no grounding on actual costs for inference. Like a lot of things in this AI boom the pricing was mostly based on vibes. That initial pricing sounded like what most users would be willing to pay for this new product, but a 20 USD flat rate makes no sense considering the costs to train and run these models.

As trend setters, everyone followed suit. Microsoft, Google and Anthropic. Everyone created their own Pro plan version costing the same price. And when agentic coding became a thing, it was clear they were not sustainable at all. They ramped up the prices. The new plans now cost triple digits, but that was still not enough. Someone is paying that bill and it's not us, the users.

The end of cheap inference

Out of the most popular labs, Anthropic looks to be the one least eager to eat those losses. And they started to signal that.

In early 2026 they captured a huge amount of mindshare among developers. Claude Opus 4.5 was already strong, then 4.6 Opus made half the internet started talking like AGI was around the corner. An amazing model running on a solid harness showed everyone what was possible and the AI twitter bubble was raving about it.

A month later the tone changed.

Peak-hour limits are the first visible sign that demand is colliding with serving costs.

Anthropic started to suffer from success, and unlike OpenAI or Google, they do not have those deep pockets. Even if the viral claim that heavy users cost them $5,000 on a $200 Max plan is false (read more here), the economics can still be ugly enough to hurt. That helps explain why they started sending this email to some users:

Starting April 4 at 12pm PT / 8pm BST, you'll no longer be able to use your Claude subscription limits for third-party harnesses including OpenClaw. You can still use them with your Claude account, but they will require extra usage, a pay-as-you-go option billed separately from your subscription. source

Running inference at frontier quality is brutally expensive. Scaling the hardware behind it was always going to be hard. It gets even harder when the physical buildout itself starts slowing down because of Force Majeure causes.

Between a third and a half of all US data centers planned for 2026 are likely to be delayed or totally cancelled, Bloomberg reports, amid ongoing supply chain challenges and campus location concerns.

Frontier demand still runs into a very physical bottleneck: data centers do not appear by magic.

If the supply side is constrained and the best models are the most expensive to serve, the market eventually sorts itself.

Cheap frontier models. A Myth.

This is also why I am skeptical whenever a lab says its next frontier model is too dangerous to release.

Anthropic's Mythos is supposedly too capable, too risky, maybe too good at finding zero-days. Maybe. I buy that risk matters. I also think there is a simpler explanation worth keeping on the table: frontier labs can afford spectacular training runs more easily than mass deployment.

👀

Benchmarks are one thing. Serving the best model to millions of users at sane prices is another.

So yes, safety may be part of the story. Economics may be too. The two explanations can coexist. But if you are building products or workflows around permanent access to models that keep getting better, you are betting on a future where frontier inference stays cheap enough and abundant enough. I would not make that bet now.

From AI-rich to AI-poor

In 2025, a lot of companies went full tokenmaxxing. Token usage leaderboards became a thing and maximizing model usage became a weird badge of honor, sometimes even part of developer KPIs.

It worked, at least for adoption. People got hooked on cheap AI, but now the bills are showing up.

Everyone is looking for ways to keep the productivity gains while cutting the inference bill. That is where this starts to matter for the rest of us, because the patterns that emerge inside big orgs usually trickle down into product packaging.

The likely outcome is simple: more routing, tighter limits, and a clearer separation between everyday models and premium ones.

The advisor pattern

Anthropic more or less said this part out loud.

Anthropic's own advisor pattern says the quiet part out loud: use Opus sparingly, keep a cheaper executor in the loop.

Pair Opus as an advisor with Sonnet or Haiku as an executor, and get near Opus-level intelligence in your agents at a fraction of the cost.

That is a pretty clear signal. You are not supposed to run Opus for everything.

Use the expensive model when the thinking really matters. Use the cheaper model for execution, iteration, and volume. This is the same logic companies will use in their pricing, and the same logic developers should use in their own stack.

You probably don't need SOTA to get work done

For most workflows, good mid-tier models are already good enough.

Kimi K2.5, GLM, and a bunch of others have closed a lot of the gap, especially on the work that dominates real usage: writing code, search, summarization, rewriting, extraction, and agent loops that need decent judgment more than absolute peak intelligence.

They deliver most of the value for a fraction of the price.

An Open Secret

Silicon Valley is already running on Chinese open source AI models. Here are some of the receipts:

Cursor confirmed last month that Composer 2 is built on Moonshot's Kimi K2.5
Cognition's SWE-1.6 model is likely post-trained on Zhipu's GLM
Shopify saved $5M a year by switching to Alibaba's Qwen model.
Airbnb CEO Brian Chesky has also said: "We rely a lot on Qwen. It's very good, fast, and cheap."

Kimi 2.5 vs others

Kimi K2.5 sits close enough on intelligence to be interesting, while staying much cheaper than the frontier tier.

Intelligence vs price

Once you plot intelligence against price, the good-enough zone looks a lot more attractive than benchmark m

If your entire workflow depends on permanent, cheap access to frontier APIs, you are tying your lifeline to someone else's subsidy schedule. That is shaky.

Design around quality per dollar, not benchmark prestige. A more durable setup could look like this:

Now:
- use frontier models sparingly
- route the bulk of work to strong mid-tier models
- experiment with local models
Soon:
- add local models to the mix where they are good enough

This does not mean smaller models are always enough. Sometimes they still miss obvious things, fall apart on longer reasoning chains, or just feel noticeably worse.

Frontier models still win by far on harder tasks

Conclusion

The point is that SOTA is becoming a luxury tier. That changes how we should think about tooling, product design, and personal workflows.

Andrej Karpathy said it well: "when state-of-the-art LLMs go down, it's an intelligence brownout. The more the world relies on these models, the dumber the outage makes us."

This is especially relevant if you are wary of having your entire organization cut off from access to a provider. (Like the big labs have been doing lately). Build for optionality. Mix providers, keep local models in the loop, and design systems that degrade gracefully.

Last minute Edit

As I pressed the publish button, Moonshot released Kimi k2.6!
And the vibes are very positive, as it seems the gap is closing even further.
I'll report back once I get more time with it.

Elsewhere in the Latent Space 🌌

🎙️Our talk with legendary programmer, Mitchell Hashimoto went live on the pod last week!
Interesting interview with Simon Willison on the state of AI.
The Hermes agent is an incredible contender on the personal agent story. Massive surge in popularity recently.
Anthropic is teasing Mythos, a frontier model they are not releasing publicly. The safety issues aside, this fits the broader point of this piece: training a monster and serving a monster are two very different economic problems.
Andrej Karpathy's LLM Knowledge Bases post keeps pushing in the same direction we covered in Fixing Agentic Memory. Durable notes, retrieval, and file-based memory keep looking a lot more practical than convoluted memory architectures.
TurboQuant is one of the more important recent efficiency stories. Better KV-cache compression means longer practical context windows and cheaper local inference.
Gemma 4 is part of the same trend. Smaller open models keep getting more capable, more multimodal, and more realistic to run on consumer hardware. That is exactly the kind of progress that makes the luxury-vs-good-enough split more obvious.
AWS just announced Amazon S3 Files. Filesystem-native cloud storage is an underrated agent infrastructure story, because agents are weirdly good when the world looks like files, folders, and paths they can traverse. More agent infrastructure.
Mario Zechner's Thoughts on slowing the fuck down should be mandatory reading for anyone going full agent-maxxing. More speed and more automation just means faster mistakes if the supervision layer is sloppy.

Share the newsletter

How SOTA models became a luxury product and what to do about it