Two headlines hit the engineering community in May 2026 that confirmed what many of us already suspected:
Microsoft cancelled thousands of internal Claude Code licenses across its Experiences & Devices division — the team behind Windows, Office, and Teams — effective June 30, 2026. The experiment had run for barely six months. Engineers loved it. It was too popular. Token-based costs spiralled beyond budget, and now developers are being migrated to GitHub Copilot CLI instead.
Uber’s CTO admitted the company burned its entire 2026 AI budget in just four months. After deploying Claude Code to 5,000 engineers, monthly per-engineer API costs hit USD 500-2,000. Uber CTO Praveen Neppalli Naga said he personally spent USD 1,200 in a single two-hour demo session. Usage exploded — 95% of Uber engineers were using AI tools by April — but the economics buckled.
Meanwhile, GitHub announced that Copilot is moving to usage-based billing starting June 1, 2026, shifting from flat-rate plans to metered AI Credits because “a quick chat question and a multi-hour autonomous coding session can cost the same — and that’s no longer sustainable.”
I’ve been thinking about this problem for a while. My thesis: LLMs are expensive not because they’re stupid, but because we’re forcing them to do tasks that a CPU with a deterministic algorithm handles a million times better. This post is about understanding why — technically — and what we should do about it.
The Strawberry Problem
Let me start with a famous example that illustrates the absurdity of using LLMs for everything.
Task: Count the number of ‘r’ characters in the word “strawberry.”
The CPU approach
char word[] = "strawberry";
int count = 0;
for (int i = 0; word[i] != '\0'; i++) {
if (word[i] == 'r') count++;
}
// count = 3
- Steps: ~10 iterations
- Time: ~20 nanoseconds on a modern CPU
- Memory: 11 bytes for the string + 2 bytes for variables = 13 bytes total
- Accuracy: 100%, deterministic, always correct
The LLM approach
To ask an LLM to do this, you first construct an input token sequence:
System prompt: "You are an expert language processing assistant..."
Task description: "You will receive a string X and a character c. Count how many times c appears in X."
Input: "X = strawberry, c = r"
Output format: "Respond in JSON: {"count": n}"
This prompt gets sent to an inference server. Here’s what happens next:
-
Tokenization: “strawberry” becomes approximately 5 tokens:
str,aw,b,er,ry— notice the individual ‘r’ characters are hidden inside multi-character tokens. This is why early LLMs got this wrong! -
Embedding: Each token becomes a vector of ~10,000 floating-point numbers. In FP8 format (1 byte per number), that’s 10,000 bytes per token x 5 tokens = 50,000 bytes just to represent the word “strawberry” — versus 11 bytes in C.
-
But the prompt is bigger: Claude Code’s actual input includes at minimum 6,000 tokens for the system prompt and 8,000 tokens for tool descriptions. That’s 14,000 tokens of fixed overhead before your actual task even starts.
-
KV Cache memory: For each token in the input, the transformer needs to store Key (K) and Value (V) vectors for every layer. A large model with 100 layers, processing 100,000 tokens (a typical Claude Code session), needs:
2 x 100 layers x 100,000 tokens x 10,000 dimensions = ~200 GBof KV cache memory. A single H100 GPU has 80 GB of VRAM. This is why LLM inference is memory-bound, not compute-bound. -
Attention is quadratic: Every new token the model generates must “attend to” all previous tokens. Attention complexity is O(n^2) — double the context, quadruple the compute. This is why inference gets progressively slower as the conversation grows.
-
Sequential decoding: Tokens are generated one at a time, each requiring a full forward pass through the neural network. A model like Claude Opus with potentially trillions of parameters needs hundreds of billions of floating-point operations just to generate a single token.
The result? The LLM might get the answer wrong (many models historically failed this test), it’ll use gigabytes of VRAM on a cluster of H100s, and the whole inference call will take seconds. All to count three letters.
Why Agentic Coding Compounds the Problem
When you use Claude Code or GitHub Copilot as an agentic coding assistant, the cost structure is far worse than a single LLM call.
Context Window Inflation
Here’s what a typical Claude Code session context actually looks like:
System prompt: ~6,000 tokens (fixed)
Tool definitions: ~8,000 tokens (fixed)
CLAUDE.md / project: ~500 tokens (fixed)
----------------------------------------------
Fixed overhead: ~14,500 tokens
Conversation history: grows with every turn
Tool results (bash, read, grep, find): grows fast
Reasoning chains: very long, especially "thinking" mode
----------------------------------------------
Typical session: 50,000–200,000 tokens
The longer the conversation, the slower and more expensive each response becomes — both because attention is quadratic and because the model struggles to maintain coherence over very long contexts, which is why Claude Code “forgets” things and needs to summarize history.
Tool Result Inflation
When an agent explores a codebase looking for where function X is called, it might:
find . -name "*.java"— returns a list of 200 filesgrep -r "functionX" --include="*.java"— finds 15 matchesreadeach match with 100–200 lines of surrounding context — adds ~3,000 tokens of code per file- LLM reasons about each file to determine if it’s a real call
Result: thousands of tokens consumed to answer a question that a call graph computed once in a static analysis tool answers in microseconds.
Reasoning Chains
Modern LLMs use chain-of-thought, tree-of-thought, and reflexion techniques to improve accuracy. These generate verbose “thinking” output before the actual answer — often 10x longer than the final response. All of that thinking is billed as output tokens, typically at 3–5x the price of input tokens.
Multi-Agent Loops
In multi-agent architectures, agents can spawn sub-agents, each with their own context windows. A single high-level task might trigger dozens of LLM calls across the chain, with each call inheriting the full context of everything that came before.
The Unit Economics Are Brutal
Let’s ground this in real numbers:
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| Claude Opus 4 | ~USD 15 | ~USD 75 |
| Claude Sonnet 4 | ~USD 3 | ~USD 15 |
| GPT-4o | ~USD 2.50 | ~USD 10 |
| Claude Haiku | ~USD 0.25 | ~USD 1.25 |
A single agentic coding session with 100,000 input tokens + 10,000 output tokens using Claude Opus costs:
Input: 100,000 x ($15/1,000,000) = $1.50
Output: 10,000 x ($75/1,000,000) = $0.75
Total per session: ~$2.25
At 8 sessions per day, that’s USD 18/day, USD 540/month per engineer. Scale to 5,000 engineers and you’re looking at USD 2.7 million per month.
That’s the math that burned Uber’s budget.
The Right Mental Model: LLMs Are Not CPUs
Here’s the core insight every engineering leader needs to internalize:
An LLM is a reasoning engine that generates probabilistic text. A CPU is a deterministic computation engine. Use each for what it’s good at.
| Task Type | Best Tool | Why |
|---|---|---|
| Count characters in a string | CPU + algorithm | Exact, deterministic, O(n), nanoseconds |
| Find all callers of function X | Static analysis / call graph | Exact, fast, computed once |
| Parse JSON/XML | Standard parser | Zero tokens, microseconds |
| Validate email format | Regex | 100% accurate, zero cost |
| Sort/filter a list | Algorithm | Deterministic, scalable |
| Understand ambiguous requirements | LLM | Probabilistic reasoning is the right tool |
| Generate code from intent | LLM | Natural language → structured output |
| Semantic code search | LLM + embeddings | Meaning-based, not keyword-based |
| Summarize a long document | LLM | Compression and synthesis |
| Detect subtle bugs in context | LLM | Pattern recognition across large context |
The problem is that many agentic coding tools treat everything as an LLM task — including navigating file systems, parsing syntax trees, resolving imports, and finding function references — when these are exactly the problems compilers and static analysis tools have solved perfectly for decades.
The Metis Philosophy: Hybrid is the Future
The right architecture for AI-assisted development isn’t “LLM for everything.” It’s a hybrid stack that uses:
-
Compiler and program analysis tools for deterministic, structural questions:
- “Where is function X called?” → Call graph (microseconds)
- “What type does this variable have?” → Type inference
- “What are all the dependencies of module Y?” → Dependency graph
- “Does this code compile?” → Compiler
-
LLMs for semantic and generative tasks where probabilistic reasoning adds real value:
- “What does this function probably do?” → Semantic understanding
- “Write a unit test for this behaviour” → Code generation
- “Find code that has similar intent to this query” → Semantic search over embeddings
- “Why might this be causing a bug?” → Reasoning under uncertainty
This is the philosophy behind tools like Metis: use static analysis for the structural layer (AST parsing, call graphs, type inference, binding resolution), and reserve LLMs for the semantic layer where exact algorithms don’t exist.
The result: fewer tokens consumed, faster results, lower cost, higher accuracy for structural queries.
Practical Cost Optimization for Engineering Teams
If you can’t switch architectures immediately, here are four levers you can pull today:
1. Model Routing
Use the smallest model capable of the task. Haiku for autocomplete and simple questions. Sonnet for most coding tasks. Opus only for the hardest reasoning. This alone can cut costs 40–60%.
2. Prompt Caching
Repeated system prompts and context (tools, project files, guidelines) can be cached. AWS Bedrock and Anthropic’s API both support this. Cost reduction: up to 90% for cached tokens.
3. Semantic Caching
Cache responses to semantically similar queries. A question about “how to handle null checks” asked 50 different ways can return a cached response without hitting the LLM.
4. Context Window Hygiene
Periodically summarise and compress conversation history. Don’t pass full file contents when only a snippet is needed. Keep tool results minimal.
What This Means for the Industry
The Microsoft and Uber situations aren’t anomalies — they’re early signals of a structural problem that every enterprise will face as AI tool adoption matures:
- Token-based pricing is metered infrastructure — not software licensing. Per-seat pricing hid this reality, but GitHub’s shift to AI Credits (June 1, 2026) makes it explicit.
- LLM inference costs have dropped 10x in three years, but token consumption has grown 100x with agentic workloads. Net cost: higher.
- 85% of companies miss AI cost forecasts by more than 10% (Mavvrik, 2025). The budgeting models don’t exist yet.
The teams that thrive will be the ones who treat AI cost as infrastructure cost — not headcount — and build the measurement, governance, and architectural discipline to manage it properly.
The strawberry problem isn’t just a party trick. It’s a parable about using the right tool for the right job. LLMs are powerful precisely because they can handle ambiguity and language that deterministic algorithms can’t. Use them for that. Let the CPU count the letters.
References
- Microsoft cancels Claude Code licenses, shifting developers to GitHub Copilot CLI — Windows Central: Full reporting on Microsoft’s decision and financial motives
- Microsoft Drops Claude Code After Budget Overrun | AI Weekly — AI Weekly summary of the Microsoft budget situation
- Microsoft pulls Claude Code licenses and pushes developers back toward its own AI tool — The Decoder’s analysis of Microsoft’s strategic shift
- Uber Spends Full 2026 AI Budget in 4 Months — Briefs.co report on Uber’s AI cost overrun
- Uber’s Anthropic AI Push Hits A Wall — CTO Says Budget Struggles — Yahoo Finance report with CTO quotes and cost figures
- AI Cost Crisis Emerges as Claude Usage and Agentic Coding Bills Spiral — Comprehensive analysis of the broader enterprise AI cost crisis
- GitHub Copilot is moving to usage-based billing — Official GitHub blog announcement on AI Credits billing model
- GitHub Copilot AI Credits: Usage-Based Billing Starts June 1, 2026 — Detailed breakdown of the new GitHub billing model
- Devs Sound Off on Usage-Based Copilot Pricing Change — Developer community reactions to Copilot pricing changes
- LLM Inference Cost 2026: Complete Pricing Guide — Comprehensive 2026 pricing comparison across all major LLM providers
- AI Inference Cost Crisis 2026: Why Your AI Bill Is Exploding — Analysis of why inference costs are rising despite per-token price drops
- KV Cache Explained: Efficient Attention for LLM Generation — Hugging Face technical explanation of KV cache and memory requirements
- Understanding KV Cache and Paged Attention in LLMs — Deep dive into KV cache mechanics and memory scaling
- LLM Inference Series: KV Caching Explained — Technical walkthrough of transformer attention and caching strategy
- Inference Unit Economics: The True Cost Per Million Tokens — GPU hardware economics and per-token cost breakdown