Why AI Costs So Much: The LLM Tax on Things CPUs Do for Free

Two headlines hit the engineering community in May 2026 that confirmed what many of us already suspected:

Microsoft cancelled thousands of internal Claude Code licenses across its Experiences & Devices division — the team behind Windows, Office, and Teams — effective June 30, 2026. The experiment had run for barely six months. Engineers loved it. It was too popular. Token-based costs spiralled beyond budget, and now developers are being migrated to GitHub Copilot CLI instead.

Uber’s CTO admitted the company burned its entire 2026 AI budget in just four months. After deploying Claude Code to 5,000 engineers, monthly per-engineer API costs hit USD 500-2,000. Uber CTO Praveen Neppalli Naga said he personally spent USD 1,200 in a single two-hour demo session. Usage exploded — 95% of Uber engineers were using AI tools by April — but the economics buckled.

Meanwhile, GitHub announced that Copilot is moving to usage-based billing starting June 1, 2026, shifting from flat-rate plans to metered AI Credits because “a quick chat question and a multi-hour autonomous coding session can cost the same — and that’s no longer sustainable.”

I’ve been thinking about this problem for a while. My thesis: LLMs are expensive not because they’re stupid, but because we’re forcing them to do tasks that a CPU with a deterministic algorithm handles a million times better. This post is about understanding why — technically — and what we should do about it.

The Strawberry Problem

Let me start with a famous example that illustrates the absurdity of using LLMs for everything.

Task: Count the number of ‘r’ characters in the word “strawberry.”

The CPU approach

char word[] = "strawberry";
int count = 0;
for (int i = 0; word[i] != '\0'; i++) {
    if (word[i] == 'r') count++;
}
// count = 3

Steps: ~10 iterations
Time: ~20 nanoseconds on a modern CPU
Memory: 11 bytes for the string + 2 bytes for variables = 13 bytes total
Accuracy: 100%, deterministic, always correct

The LLM approach

To ask an LLM to do this, you first construct an input token sequence:

System prompt: "You are an expert language processing assistant..."
Task description: "You will receive a string X and a character c. Count how many times c appears in X."
Input: "X = strawberry, c = r"
Output format: "Respond in JSON: {"count": n}"

This prompt gets sent to an inference server. Here’s what happens next:

Tokenization: “strawberry” becomes approximately 5 tokens: str, aw, b, er, ry — notice the individual ‘r’ characters are hidden inside multi-character tokens. This is why early LLMs got this wrong!
Embedding: Each token becomes a vector of ~10,000 floating-point numbers. In FP8 format (1 byte per number), that’s 10,000 bytes per token x 5 tokens = 50,000 bytes just to represent the word “strawberry” — versus 11 bytes in C.
But the prompt is bigger: Claude Code’s actual input includes at minimum 6,000 tokens for the system prompt and 8,000 tokens for tool descriptions. That’s 14,000 tokens of fixed overhead before your actual task even starts.
KV Cache memory: For each token in the input, the transformer needs to store Key (K) and Value (V) vectors for every layer. A large model with 100 layers, processing 100,000 tokens (a typical Claude Code session), needs: 2 x 100 layers x 100,000 tokens x 10,000 dimensions = ~200 GB of KV cache memory. A single H100 GPU has 80 GB of VRAM. This is why LLM inference is memory-bound, not compute-bound.
Attention is quadratic: Every new token the model generates must “attend to” all previous tokens. Attention complexity is O(n^2) — double the context, quadruple the compute. This is why inference gets progressively slower as the conversation grows.
Sequential decoding: Tokens are generated one at a time, each requiring a full forward pass through the neural network. A model like Claude Opus with potentially trillions of parameters needs hundreds of billions of floating-point operations just to generate a single token.

The result? The LLM might get the answer wrong (many models historically failed this test), it’ll use gigabytes of VRAM on a cluster of H100s, and the whole inference call will take seconds. All to count three letters.

Why Agentic Coding Compounds the Problem

When you use Claude Code or GitHub Copilot as an agentic coding assistant, the cost structure is far worse than a single LLM call.

Context Window Inflation

Here’s what a typical Claude Code session context actually looks like:

System prompt:          ~6,000 tokens  (fixed)
Tool definitions:       ~8,000 tokens  (fixed)
CLAUDE.md / project:    ~500 tokens    (fixed)
----------------------------------------------
Fixed overhead:         ~14,500 tokens

Conversation history:   grows with every turn
Tool results (bash, read, grep, find): grows fast
Reasoning chains:       very long, especially "thinking" mode
----------------------------------------------
Typical session:        50,000–200,000 tokens

The longer the conversation, the slower and more expensive each response becomes — both because attention is quadratic and because the model struggles to maintain coherence over very long contexts, which is why Claude Code “forgets” things and needs to summarize history.

Tool Result Inflation

When an agent explores a codebase looking for where function X is called, it might:

find . -name "*.java" — returns a list of 200 files
grep -r "functionX" --include="*.java" — finds 15 matches
read each match with 100–200 lines of surrounding context — adds ~3,000 tokens of code per file
LLM reasons about each file to determine if it’s a real call

Result: thousands of tokens consumed to answer a question that a call graph computed once in a static analysis tool answers in microseconds.

Reasoning Chains

Modern LLMs use chain-of-thought, tree-of-thought, and reflexion techniques to improve accuracy. These generate verbose “thinking” output before the actual answer — often 10x longer than the final response. All of that thinking is billed as output tokens, typically at 3–5x the price of input tokens.

Multi-Agent Loops

In multi-agent architectures, agents can spawn sub-agents, each with their own context windows. A single high-level task might trigger dozens of LLM calls across the chain, with each call inheriting the full context of everything that came before.

The Unit Economics Are Brutal

Let’s ground this in real numbers:

Model	Input (per 1M tokens)	Output (per 1M tokens)
Claude Opus 4	~USD 15	~USD 75
Claude Sonnet 4	~USD 3	~USD 15
GPT-4o	~USD 2.50	~USD 10
Claude Haiku	~USD 0.25	~USD 1.25

A single agentic coding session with 100,000 input tokens + 10,000 output tokens using Claude Opus costs:

Input:  100,000 x ($15/1,000,000)  = $1.50
Output:  10,000 x ($75/1,000,000)  = $0.75
Total per session: ~$2.25

At 8 sessions per day, that’s USD 18/day, USD 540/month per engineer. Scale to 5,000 engineers and you’re looking at USD 2.7 million per month.

That’s the math that burned Uber’s budget.

The Right Mental Model: LLMs Are Not CPUs

Here’s the core insight every engineering leader needs to internalize:

An LLM is a reasoning engine that generates probabilistic text. A CPU is a deterministic computation engine. Use each for what it’s good at.

Task Type	Best Tool	Why
Count characters in a string	CPU + algorithm	Exact, deterministic, O(n), nanoseconds
Find all callers of function X	Static analysis / call graph	Exact, fast, computed once
Parse JSON/XML	Standard parser	Zero tokens, microseconds
Validate email format	Regex	100% accurate, zero cost
Sort/filter a list	Algorithm	Deterministic, scalable
Understand ambiguous requirements	LLM	Probabilistic reasoning is the right tool
Generate code from intent	LLM	Natural language → structured output
Semantic code search	LLM + embeddings	Meaning-based, not keyword-based
Summarize a long document	LLM	Compression and synthesis
Detect subtle bugs in context	LLM	Pattern recognition across large context

The problem is that many agentic coding tools treat everything as an LLM task — including navigating file systems, parsing syntax trees, resolving imports, and finding function references — when these are exactly the problems compilers and static analysis tools have solved perfectly for decades.

The Metis Philosophy: Hybrid is the Future

The right architecture for AI-assisted development isn’t “LLM for everything.” It’s a hybrid stack that uses:

Compiler and program analysis tools for deterministic, structural questions:
- “Where is function X called?” → Call graph (microseconds)
- “What type does this variable have?” → Type inference
- “What are all the dependencies of module Y?” → Dependency graph
- “Does this code compile?” → Compiler
LLMs for semantic and generative tasks where probabilistic reasoning adds real value:
- “What does this function probably do?” → Semantic understanding
- “Write a unit test for this behaviour” → Code generation
- “Find code that has similar intent to this query” → Semantic search over embeddings
- “Why might this be causing a bug?” → Reasoning under uncertainty

This is the philosophy behind tools like Metis: use static analysis for the structural layer (AST parsing, call graphs, type inference, binding resolution), and reserve LLMs for the semantic layer where exact algorithms don’t exist.

The result: fewer tokens consumed, faster results, lower cost, higher accuracy for structural queries.

Practical Cost Optimization for Engineering Teams

If you can’t switch architectures immediately, here are four levers you can pull today:

1. Model Routing

Use the smallest model capable of the task. Haiku for autocomplete and simple questions. Sonnet for most coding tasks. Opus only for the hardest reasoning. This alone can cut costs 40–60%.

2. Prompt Caching

Repeated system prompts and context (tools, project files, guidelines) can be cached. AWS Bedrock and Anthropic’s API both support this. Cost reduction: up to 90% for cached tokens.

3. Semantic Caching

Cache responses to semantically similar queries. A question about “how to handle null checks” asked 50 different ways can return a cached response without hitting the LLM.

4. Context Window Hygiene

Periodically summarise and compress conversation history. Don’t pass full file contents when only a snippet is needed. Keep tool results minimal.

What This Means for the Industry

The Microsoft and Uber situations aren’t anomalies — they’re early signals of a structural problem that every enterprise will face as AI tool adoption matures:

Token-based pricing is metered infrastructure — not software licensing. Per-seat pricing hid this reality, but GitHub’s shift to AI Credits (June 1, 2026) makes it explicit.
LLM inference costs have dropped 10x in three years, but token consumption has grown 100x with agentic workloads. Net cost: higher.
85% of companies miss AI cost forecasts by more than 10% (Mavvrik, 2025). The budgeting models don’t exist yet.

The teams that thrive will be the ones who treat AI cost as infrastructure cost — not headcount — and build the measurement, governance, and architectural discipline to manage it properly.

The strawberry problem isn’t just a party trick. It’s a parable about using the right tool for the right job. LLMs are powerful precisely because they can handle ambiguity and language that deterministic algorithms can’t. Use them for that. Let the CPU count the letters.

References

Microsoft cancels Claude Code licenses, shifting developers to GitHub Copilot CLI — Windows Central: Full reporting on Microsoft’s decision and financial motives
Microsoft Drops Claude Code After Budget Overrun | AI Weekly — AI Weekly summary of the Microsoft budget situation
Microsoft pulls Claude Code licenses and pushes developers back toward its own AI tool — The Decoder’s analysis of Microsoft’s strategic shift
Uber Spends Full 2026 AI Budget in 4 Months — Briefs.co report on Uber’s AI cost overrun
Uber’s Anthropic AI Push Hits A Wall — CTO Says Budget Struggles — Yahoo Finance report with CTO quotes and cost figures
AI Cost Crisis Emerges as Claude Usage and Agentic Coding Bills Spiral — Comprehensive analysis of the broader enterprise AI cost crisis
GitHub Copilot is moving to usage-based billing — Official GitHub blog announcement on AI Credits billing model
GitHub Copilot AI Credits: Usage-Based Billing Starts June 1, 2026 — Detailed breakdown of the new GitHub billing model
Devs Sound Off on Usage-Based Copilot Pricing Change — Developer community reactions to Copilot pricing changes
LLM Inference Cost 2026: Complete Pricing Guide — Comprehensive 2026 pricing comparison across all major LLM providers
AI Inference Cost Crisis 2026: Why Your AI Bill Is Exploding — Analysis of why inference costs are rising despite per-token price drops
KV Cache Explained: Efficient Attention for LLM Generation — Hugging Face technical explanation of KV cache and memory requirements
Understanding KV Cache and Paged Attention in LLMs — Deep dive into KV cache mechanics and memory scaling
LLM Inference Series: KV Caching Explained — Technical walkthrough of transformer attention and caching strategy
Inference Unit Economics: The True Cost Per Million Tokens — GPU hardware economics and per-token cost breakdown

Export for reading

Why AI Costs So Much: The LLM Tax on Things CPUs Do for Free

The Strawberry Problem

The CPU approach

The LLM approach

Why Agentic Coding Compounds the Problem

Context Window Inflation

Tool Result Inflation

Reasoning Chains

Multi-Agent Loops

The Unit Economics Are Brutal

The Right Mental Model: LLMs Are Not CPUs

The Metis Philosophy: Hybrid is the Future

Practical Cost Optimization for Engineering Teams

1. Model Routing

2. Prompt Caching

3. Semantic Caching

4. Context Window Hygiene

What This Means for the Industry

References

Comments

On this page

Why AI Costs So Much: The LLM Tax on Things CPUs Do for Free