The 5 Levels of Agentic Coding -- A Tech Lead's Field Guide to AI Engineering Maturity

I have watched hundreds of developers adopt AI coding tools over the past year. Some ship production-grade software in hours. Others burn through $200 in API credits debugging a button color.

The difference is not talent. It is not the model they use. It is where they sit on the agentic coding maturity curve — and whether they know it.

Andrej Karpathy coined “vibe coding” in February 2025. One year later, he declared it “passe” and rebranded the practice as agentic engineering — “because the new default is that you are not writing the code directly 99% of the time, you are orchestrating agents who do and acting as oversight.”

That single sentence captures a tectonic shift. But it skips the messy middle. The part where you mass-produce tech trash before learning how to build something real.

Here is the maturity model I use to evaluate where an engineer (or founder) actually stands. Five levels. No gatekeeping — just an honest map of the terrain.

Level 1: Vibe Coding — The Sugar Rush

“There’s a new kind of coding where you fully give in to the vibes, embrace exponentials, and forget that the code even exists.” — Andrej Karpathy

This is where everyone starts. You type a prompt. AI generates a website. You see pixels on screen. It feels like magic.

The tools: Cursor IDE, Lovable, Replit Agent, Bolt.new. Point-and-click interfaces that turn natural language into running applications.

The loop: Write prompt -> wait for code -> see UI -> repeat N times.

The numbers tell the story:

Lovable hit $100M ARR in 8 months — potentially the fastest-growing startup in history
Replit’s revenue jumped from $10M to $100M in 9 months after launching Replit Agent
84% of developers now use or plan to use AI tools (Stack Overflow 2025)

And yet:

77% of developers say vibe coding is not part of their professional work (Stack Overflow 2025)
45% say debugging AI-generated code takes longer than writing it themselves
Only 29% trust AI output to be accurate — down from 40% in 2024

The pattern is always the same. Build something small: amazing. Build something real: it works for two days, then collapses into a spaghetti codebase that nobody — including the AI — can untangle.

You are not building software. You are generating demos.

💡 Level 1 signal: Your main activity is testing whether the AI got it right, not designing what “right” means.

Level 2: Vibe Engineering — The CLI Upgrade

Level 2 is where technically literate users graduate from IDE copilots to terminal-based agents: Claude Code, OpenAI Codex CLI, OpenCode.

The difference is significant. Instead of auto-completing one file, these tools operate across your entire codebase. They grep, read, plan, edit multiple files, run tests, and iterate.

The toolkit:

Tool	What it does	Stars/Users
Claude Code	Anthropic’s agentic CLI, sub-agents, skills, plan mode, memory	10% dev adoption
Codex CLI	OpenAI’s terminal agent, cloud sandboxes, multimodal input	New entrant
OpenCode	Open-source, 75+ model providers, no vendor lock-in	120K+ GitHub stars

New techniques unlocked:

Plan mode — the agent reads your codebase before touching it. Research first, code second.
Sub-agents — spawn specialized workers for parallel tasks, each in isolated context.
MCP (Model Context Protocol) — connect agents to browsers, databases, Docker, and 5,800+ external tool servers. 97M monthly SDK downloads.
Skills and frameworks — reusable instruction sets (like Super Claude Framework) that enrich prompts with coding conventions, project rules, and workflow habits.

The reality check:

You can handle larger codebases now. But when bugs hit, the debugging loop can stretch for hours. The agent tries fix after fix, each one making the codebase slightly worse. You are burning Opus credits like jet fuel.

If you are a developer, you eventually open VS Code, set a breakpoint, find the bug in 3 minutes, and curse the AI for wasting your afternoon.

If you are not a developer, you are stuck. You wait for the next frontier model release hoping it will magically fix everything.

💡 Level 2 signal: You know what plan mode is. You have strong opinions about Claude Code vs Cursor. You have rage-quit at least one session after hitting rate limits mid-debug.

Level 3: Fullstack Builder — One Person, Infinite Leverage

At Level 3, the agent is not just writing code. It is your designer, copywriter, DevOps engineer, QA tester, and database admin — all at once.

The unlock: Instead of asking the agent to “fix this bug,” you are feeding it entire workflows:

UI/UX design systems
Copywriting guidelines
CI/CD pipeline configs
Database migration scripts
Deployment automation

The key skill is finding the right MCP server, skill, or plugin for each domain. Want the agent to see your browser? Chrome DevTools MCP. Want it to understand your Docker logs? Docker MCP. Want it to manage your database? Supabase MCP.

What changed in 2025-2026:

MCP server downloads grew from 100,000 in November 2024 to 8 million by April 2025. By Q1 2025, 28% of Fortune 500 companies had implemented MCP in their AI stacks. In December 2025, Anthropic donated MCP to the Linux Foundation’s Agentic AI Foundation — signaling it is now infrastructure, not a product feature.

The painful truth:

You are powerful but fragile. When the agent hits a wall, you escalate to the most expensive model available — Opus 4.7, GPT-5.5 — praying it can brute-force through the problem. If you hit rate limits before the bug is fixed… well, there is no polite way to describe that frustration.

Your actual job at Level 3 is testing. You are a full-time QA engineer for an AI developer that writes code at the speed of light and debugs at the speed of a drunk snail.

💡 Level 3 signal: You can build almost anything. You just cannot guarantee it works. Your credit card statement looks like a small SaaS company’s cloud bill.

Level 4: Agentic Engineering — Building the Machine That Builds

This is where the game changes fundamentally.

You stop fighting the agent’s limitations and start engineering around them. You build frameworks, harness layers, hooks, triggers, and feedback loops that make agents reliable by design.

The core insight:

“The agent isn’t the hard part — the harness is.” — Anthropic, 2026 Agentic Coding Trends Report

What is a harness?

Martin Fowler defines it as: Agent = Model + Harness. The harness is everything that wraps the core reasoning loop:

Tool execution dispatch
Context management and compaction
Safety enforcement and permission gates
Session persistence across context windows
Structured handoff artifacts between phases

The agent writes the code. The harness ensures the code is correct, consistent, and production-ready.

Level 4 engineers build:

Custom evaluation pipelines — Every agent output passes through automated quality gates before it touches production code.
Modular architecture — Clean separation of concerns so the agent can work on one module without breaking others. This is not optional — agents cannot reason about 50,000-line monoliths.
Lifecycle hooks — Pre-commit checks, post-generation tests, smoke tests on deploy. The agent’s work is verified at every stage.
Custom tools — MCP servers, CLI plugins, or scripts that give the agent exactly the visibility and capability it needs for your specific project.

The telltale sign of a Level 4 engineer: They build their own tools. They publish open-source frameworks. They collect GitHub stars like trading cards.

💡 Level 4 signal: You have a CLAUDE.md file longer than most people’s READMEs. You have opinions about harness architecture. You have built at least one custom MCP server.

Level 5: Software Craftsmanship — The Grandmaster

Level 5 is mastery. The “ten thousand swords return to one” moment. Everything clicks.

At this level, you are not using agents. You are designing agent systems.

What Level 5 looks like:

Going deep: You build frameworks that compete on the global stage. You evaluate Palantir’s Ontology and think “I can architect something better for my domain.” You fork open-source agents and modify their reasoning loops to match your engineering standards.

Going wide: You run coding farms. Multiple Mac Minis, dozens of tmux sessions, agents working 24/7 on parallel tasks. Each agent has:

A well-defined pipeline with clear input/output contracts
Evaluation checkpoints (“evals”) at every stage
Automated rollback on quality regression

The QA revolution:

Here is where it gets interesting. I invested in building a dedicated QA agent with a rigorous methodology — borrowing from multiple established testing frameworks and adapting them for AI-generated code.

The agent takes nearly 10x longer to complete a task compared to raw code generation. But the output is battle-tested.

Here is what the QA agent actually does:

Plan review — Before the coding agent writes a single line, the QA agent tears the plan apart. It checks for gaps, stubs, unhandled edge cases, and architectural blind spots. Plans that look “good enough” to a human get rejected 4-5 times before passing.
Post-code audit — After generation, the QA agent runs:
- Test coverage analysis
- Wiring verification — does every function actually get called in the application flow? Or did the agent write beautiful code that nothing ever invokes?
- Convention compliance checks
- A structured checklist against quality criteria
Code review layer — Tools like CodeRabbit (2M+ repos, 13M+ PRs reviewed, highest F1 score at 60.1%) catch bugs that slip through. Combined with strict coding conventions, this creates defense in depth.
CI/CD integration — GitHub CI tests, smoke tests on deploy, Docker log verification. Then — and only then — production.

The result: Agents can push code to production autonomously. For real. Not “vibe deploy and pray” — actually verified, tested, production-grade code.

💡 Level 5 signal: Your agents have a pipeline. You sleep while they ship. The code that reaches production has been reviewed by more automated checks than most human-written code gets in a typical startup.

Understanding the Beast: Why Agents Struggle

To level up in agentic engineering, you need to understand why agents fail. The core limitations have not changed — but the workarounds have gotten dramatically better.

A coding agent exploring your codebase is literally a blind man touching an elephant. It cannot see the whole picture. It uses grep and read to touch one small piece at a time, then infers (guesses) the rest.

The numbers:

Effective context at 200K tokens? The agent can “feel” the elephant’s knee. Maybe part of a leg.
At 1M-2M tokens? It can feel more — but research shows accuracy drops 30%+ when relevant information sits in middle positions (the “lost in the middle” problem).
60-80% of an agent’s token budget goes to orientation — figuring out where things are — not solving the actual problem.

The Goldfish Brain

Close a session. Open a new one. The agent greets you like a stranger: “Hello! How can I help you today?”

All context from the previous session? Gone. Every debugging insight, every architectural decision, every “do not touch this file” warning — erased.

This is session amnesia, and it was the most complained-about limitation in 2024-2025.

The Hungry Mind

LLMs “think” within a fixed token budget. When the budget runs low mid-reasoning, the model does not gracefully stop and say “I need more resources.” It does one of three things:

Stubs — writes empty functions with # TODO comments
Hallucinations — generates plausible-looking but incorrect code
Hardcodes — replaces dynamic logic with static values to reduce complexity

This is not laziness. It is a fundamental constraint: the model literally does not have enough compute tokens to think through the problem properly.

The Antidotes: What Actually Works

Understanding the beast reveals the cures. Here is the toolkit that transforms agentic coding from gambling into engineering:

For Amnesia: Memory Systems

Solution	How it works
`CLAUDE.md`	Manually maintained project instructions loaded every session
Auto Memory	Claude Code watches conversations, extracts insights, saves structured summaries to disk automatically
claude-mem	Cross-session memory plugin that compresses session history and injects relevant context into future sessions

For Blindness: Visibility Tools

Solution	What it gives the agent
GitNexus	Knowledge graph of your codebase — symbols, relationships, call chains, blast radius analysis. 28K+ GitHub stars
Serena MCP	IDE-quality semantic understanding across 30+ languages. Symbol-level navigation, not token-based grep
Chrome DevTools MCP	Full browser visibility — DOM, computed styles, performance metrics, ARIA roles. Evidence-based debugging
Docker MCP	Container logs, runtime behavior observation

For Hunger: Context Efficiency

Plan mode — research and map the codebase before consuming tokens on code generation
Sub-agents — isolate tasks into separate context windows so no single agent runs out of space
Strict conventions — well-structured, consistently formatted code is easier for agents to parse and less likely to cause hallucinations
Modular architecture — small, focused modules that fit within a single context window

For Quality: Automated Verification

CodeRabbit — AI code review with 40+ static analysis tools integrated
QA agents — custom evaluation pipelines that verify plans before coding and code before deploying
Test-driven spec development — define the specification first, let agents implement against it, verify with automated test suites
Smoke tests on deploy — verify the deployed application actually works before routing traffic

Where Are You?

Be honest.

Level	Title	You are here if…
1	Vibe Coding	You prompt and pray. AI is magic until it is not.
2	Vibe Engineering	You use CLI agents and plan mode. You burn credits on hard bugs.
3	Fullstack Builder	You build complete products solo. Testing is your main job.
4	Agentic Engineering	You build the harness. Agents work within your system.
5	Software Craftsmanship	Agents have pipelines and evals. You sleep while they ship.

Most developers I work with are somewhere between Level 2 and 3. The gap between Level 3 and Level 4 is the hardest jump — it requires a fundamental mindset shift from “using AI to code” to “engineering systems that make AI code reliably.”

The industry is moving fast. 84% of developers use AI tools. The market is $7.37 billion and growing 35-40% annually. MCP has 5,800+ servers and 97M monthly SDK downloads.

But the trust problem remains. Only 29% trust the output. 45% say debugging AI code is harder than writing it themselves.

The engineers who will thrive are not the ones who prompt the hardest. They are the ones who build the harness that makes prompting unnecessary.

“2025 was the year AI agents proved they could write code. 2026 is the year we learned that the agent isn’t the hard part — the harness is.”

The harness is where the craft lives.

Built with agentic engineering. Verified by a QA agent. Reviewed by a human. Shipped to production.

Export for reading

The 5 Levels of Agentic Coding -- A Tech Lead's Field Guide to AI Engineering Maturity

Level 1: Vibe Coding — The Sugar Rush

Level 2: Vibe Engineering — The CLI Upgrade

Level 3: Fullstack Builder — One Person, Infinite Leverage

Level 4: Agentic Engineering — Building the Machine That Builds

Level 5: Software Craftsmanship — The Grandmaster

Understanding the Beast: Why Agents Struggle

The Blind Man and the Elephant

The Goldfish Brain

The Hungry Mind

The Antidotes: What Actually Works

For Amnesia: Memory Systems

For Blindness: Visibility Tools

For Hunger: Context Efficiency

For Quality: Automated Verification

Where Are You?

Comments

On this page

The 5 Levels of Agentic Coding -- A Tech Lead's Field Guide to AI Engineering Maturity