I have watched hundreds of developers adopt AI coding tools over the past year. Some ship production-grade software in hours. Others burn through $200 in API credits debugging a button color.

The difference is not talent. It is not the model they use. It is where they sit on the agentic coding maturity curve — and whether they know it.

Andrej Karpathy coined “vibe coding” in February 2025. One year later, he declared it “passe” and rebranded the practice as agentic engineering — “because the new default is that you are not writing the code directly 99% of the time, you are orchestrating agents who do and acting as oversight.”

That single sentence captures a tectonic shift. But it skips the messy middle. The part where you mass-produce tech trash before learning how to build something real.

Here is the maturity model I use to evaluate where an engineer (or founder) actually stands. Five levels. No gatekeeping — just an honest map of the terrain.


Level 1: Vibe Coding — The Sugar Rush

“There’s a new kind of coding where you fully give in to the vibes, embrace exponentials, and forget that the code even exists.” — Andrej Karpathy

This is where everyone starts. You type a prompt. AI generates a website. You see pixels on screen. It feels like magic.

The tools: Cursor IDE, Lovable, Replit Agent, Bolt.new. Point-and-click interfaces that turn natural language into running applications.

The loop: Write prompt -> wait for code -> see UI -> repeat N times.

The numbers tell the story:

  • Lovable hit $100M ARR in 8 months — potentially the fastest-growing startup in history
  • Replit’s revenue jumped from $10M to $100M in 9 months after launching Replit Agent
  • 84% of developers now use or plan to use AI tools (Stack Overflow 2025)

And yet:

  • 77% of developers say vibe coding is not part of their professional work (Stack Overflow 2025)
  • 45% say debugging AI-generated code takes longer than writing it themselves
  • Only 29% trust AI output to be accurate — down from 40% in 2024

The pattern is always the same. Build something small: amazing. Build something real: it works for two days, then collapses into a spaghetti codebase that nobody — including the AI — can untangle.

You are not building software. You are generating demos.

💡 Level 1 signal: Your main activity is testing whether the AI got it right, not designing what “right” means.


Level 2: Vibe Engineering — The CLI Upgrade

Level 2 is where technically literate users graduate from IDE copilots to terminal-based agents: Claude Code, OpenAI Codex CLI, OpenCode.

The difference is significant. Instead of auto-completing one file, these tools operate across your entire codebase. They grep, read, plan, edit multiple files, run tests, and iterate.

The toolkit:

ToolWhat it doesStars/Users
Claude CodeAnthropic’s agentic CLI, sub-agents, skills, plan mode, memory10% dev adoption
Codex CLIOpenAI’s terminal agent, cloud sandboxes, multimodal inputNew entrant
OpenCodeOpen-source, 75+ model providers, no vendor lock-in120K+ GitHub stars

New techniques unlocked:

  • Plan mode — the agent reads your codebase before touching it. Research first, code second.
  • Sub-agents — spawn specialized workers for parallel tasks, each in isolated context.
  • MCP (Model Context Protocol) — connect agents to browsers, databases, Docker, and 5,800+ external tool servers. 97M monthly SDK downloads.
  • Skills and frameworks — reusable instruction sets (like Super Claude Framework) that enrich prompts with coding conventions, project rules, and workflow habits.

The reality check:

You can handle larger codebases now. But when bugs hit, the debugging loop can stretch for hours. The agent tries fix after fix, each one making the codebase slightly worse. You are burning Opus credits like jet fuel.

If you are a developer, you eventually open VS Code, set a breakpoint, find the bug in 3 minutes, and curse the AI for wasting your afternoon.

If you are not a developer, you are stuck. You wait for the next frontier model release hoping it will magically fix everything.

💡 Level 2 signal: You know what plan mode is. You have strong opinions about Claude Code vs Cursor. You have rage-quit at least one session after hitting rate limits mid-debug.


Level 3: Fullstack Builder — One Person, Infinite Leverage

At Level 3, the agent is not just writing code. It is your designer, copywriter, DevOps engineer, QA tester, and database admin — all at once.

The unlock: Instead of asking the agent to “fix this bug,” you are feeding it entire workflows:

  • UI/UX design systems
  • Copywriting guidelines
  • CI/CD pipeline configs
  • Database migration scripts
  • Deployment automation

The key skill is finding the right MCP server, skill, or plugin for each domain. Want the agent to see your browser? Chrome DevTools MCP. Want it to understand your Docker logs? Docker MCP. Want it to manage your database? Supabase MCP.

What changed in 2025-2026:

MCP server downloads grew from 100,000 in November 2024 to 8 million by April 2025. By Q1 2025, 28% of Fortune 500 companies had implemented MCP in their AI stacks. In December 2025, Anthropic donated MCP to the Linux Foundation’s Agentic AI Foundation — signaling it is now infrastructure, not a product feature.

The painful truth:

You are powerful but fragile. When the agent hits a wall, you escalate to the most expensive model available — Opus 4.7, GPT-5.5 — praying it can brute-force through the problem. If you hit rate limits before the bug is fixed… well, there is no polite way to describe that frustration.

Your actual job at Level 3 is testing. You are a full-time QA engineer for an AI developer that writes code at the speed of light and debugs at the speed of a drunk snail.

💡 Level 3 signal: You can build almost anything. You just cannot guarantee it works. Your credit card statement looks like a small SaaS company’s cloud bill.


Level 4: Agentic Engineering — Building the Machine That Builds

This is where the game changes fundamentally.

You stop fighting the agent’s limitations and start engineering around them. You build frameworks, harness layers, hooks, triggers, and feedback loops that make agents reliable by design.

The core insight:

“The agent isn’t the hard part — the harness is.” — Anthropic, 2026 Agentic Coding Trends Report

What is a harness?

Martin Fowler defines it as: Agent = Model + Harness. The harness is everything that wraps the core reasoning loop:

  • Tool execution dispatch
  • Context management and compaction
  • Safety enforcement and permission gates
  • Session persistence across context windows
  • Structured handoff artifacts between phases

The agent writes the code. The harness ensures the code is correct, consistent, and production-ready.

Level 4 engineers build:

  1. Custom evaluation pipelines — Every agent output passes through automated quality gates before it touches production code.
  2. Modular architecture — Clean separation of concerns so the agent can work on one module without breaking others. This is not optional — agents cannot reason about 50,000-line monoliths.
  3. Lifecycle hooks — Pre-commit checks, post-generation tests, smoke tests on deploy. The agent’s work is verified at every stage.
  4. Custom tools — MCP servers, CLI plugins, or scripts that give the agent exactly the visibility and capability it needs for your specific project.

The telltale sign of a Level 4 engineer: They build their own tools. They publish open-source frameworks. They collect GitHub stars like trading cards.

💡 Level 4 signal: You have a CLAUDE.md file longer than most people’s READMEs. You have opinions about harness architecture. You have built at least one custom MCP server.


Level 5: Software Craftsmanship — The Grandmaster

Level 5 is mastery. The “ten thousand swords return to one” moment. Everything clicks.

At this level, you are not using agents. You are designing agent systems.

What Level 5 looks like:

Going deep: You build frameworks that compete on the global stage. You evaluate Palantir’s Ontology and think “I can architect something better for my domain.” You fork open-source agents and modify their reasoning loops to match your engineering standards.

Going wide: You run coding farms. Multiple Mac Minis, dozens of tmux sessions, agents working 24/7 on parallel tasks. Each agent has:

  • A well-defined pipeline with clear input/output contracts
  • Evaluation checkpoints (“evals”) at every stage
  • Automated rollback on quality regression

The QA revolution:

Here is where it gets interesting. I invested in building a dedicated QA agent with a rigorous methodology — borrowing from multiple established testing frameworks and adapting them for AI-generated code.

The agent takes nearly 10x longer to complete a task compared to raw code generation. But the output is battle-tested.

Here is what the QA agent actually does:

  1. Plan review — Before the coding agent writes a single line, the QA agent tears the plan apart. It checks for gaps, stubs, unhandled edge cases, and architectural blind spots. Plans that look “good enough” to a human get rejected 4-5 times before passing.

  2. Post-code audit — After generation, the QA agent runs:

    • Test coverage analysis
    • Wiring verification — does every function actually get called in the application flow? Or did the agent write beautiful code that nothing ever invokes?
    • Convention compliance checks
    • A structured checklist against quality criteria
  3. Code review layer — Tools like CodeRabbit (2M+ repos, 13M+ PRs reviewed, highest F1 score at 60.1%) catch bugs that slip through. Combined with strict coding conventions, this creates defense in depth.

  4. CI/CD integration — GitHub CI tests, smoke tests on deploy, Docker log verification. Then — and only then — production.

The result: Agents can push code to production autonomously. For real. Not “vibe deploy and pray” — actually verified, tested, production-grade code.

💡 Level 5 signal: Your agents have a pipeline. You sleep while they ship. The code that reaches production has been reviewed by more automated checks than most human-written code gets in a typical startup.


Understanding the Beast: Why Agents Struggle

To level up in agentic engineering, you need to understand why agents fail. The core limitations have not changed — but the workarounds have gotten dramatically better.

The Blind Man and the Elephant

A coding agent exploring your codebase is literally a blind man touching an elephant. It cannot see the whole picture. It uses grep and read to touch one small piece at a time, then infers (guesses) the rest.

The numbers:

  • Effective context at 200K tokens? The agent can “feel” the elephant’s knee. Maybe part of a leg.
  • At 1M-2M tokens? It can feel more — but research shows accuracy drops 30%+ when relevant information sits in middle positions (the “lost in the middle” problem).
  • 60-80% of an agent’s token budget goes to orientation — figuring out where things are — not solving the actual problem.

The Goldfish Brain

Close a session. Open a new one. The agent greets you like a stranger: “Hello! How can I help you today?”

All context from the previous session? Gone. Every debugging insight, every architectural decision, every “do not touch this file” warning — erased.

This is session amnesia, and it was the most complained-about limitation in 2024-2025.

The Hungry Mind

LLMs “think” within a fixed token budget. When the budget runs low mid-reasoning, the model does not gracefully stop and say “I need more resources.” It does one of three things:

  1. Stubs — writes empty functions with # TODO comments
  2. Hallucinations — generates plausible-looking but incorrect code
  3. Hardcodes — replaces dynamic logic with static values to reduce complexity

This is not laziness. It is a fundamental constraint: the model literally does not have enough compute tokens to think through the problem properly.


The Antidotes: What Actually Works

Understanding the beast reveals the cures. Here is the toolkit that transforms agentic coding from gambling into engineering:

For Amnesia: Memory Systems

SolutionHow it works
CLAUDE.mdManually maintained project instructions loaded every session
Auto MemoryClaude Code watches conversations, extracts insights, saves structured summaries to disk automatically
claude-memCross-session memory plugin that compresses session history and injects relevant context into future sessions

For Blindness: Visibility Tools

SolutionWhat it gives the agent
GitNexusKnowledge graph of your codebase — symbols, relationships, call chains, blast radius analysis. 28K+ GitHub stars
Serena MCPIDE-quality semantic understanding across 30+ languages. Symbol-level navigation, not token-based grep
Chrome DevTools MCPFull browser visibility — DOM, computed styles, performance metrics, ARIA roles. Evidence-based debugging
Docker MCPContainer logs, runtime behavior observation

For Hunger: Context Efficiency

  • Plan mode — research and map the codebase before consuming tokens on code generation
  • Sub-agents — isolate tasks into separate context windows so no single agent runs out of space
  • Strict conventions — well-structured, consistently formatted code is easier for agents to parse and less likely to cause hallucinations
  • Modular architecture — small, focused modules that fit within a single context window

For Quality: Automated Verification

  • CodeRabbit — AI code review with 40+ static analysis tools integrated
  • QA agents — custom evaluation pipelines that verify plans before coding and code before deploying
  • Test-driven spec development — define the specification first, let agents implement against it, verify with automated test suites
  • Smoke tests on deploy — verify the deployed application actually works before routing traffic

Where Are You?

Be honest.

LevelTitleYou are here if…
1Vibe CodingYou prompt and pray. AI is magic until it is not.
2Vibe EngineeringYou use CLI agents and plan mode. You burn credits on hard bugs.
3Fullstack BuilderYou build complete products solo. Testing is your main job.
4Agentic EngineeringYou build the harness. Agents work within your system.
5Software CraftsmanshipAgents have pipelines and evals. You sleep while they ship.

Most developers I work with are somewhere between Level 2 and 3. The gap between Level 3 and Level 4 is the hardest jump — it requires a fundamental mindset shift from “using AI to code” to “engineering systems that make AI code reliably.”

The industry is moving fast. 84% of developers use AI tools. The market is $7.37 billion and growing 35-40% annually. MCP has 5,800+ servers and 97M monthly SDK downloads.

But the trust problem remains. Only 29% trust the output. 45% say debugging AI code is harder than writing it themselves.

The engineers who will thrive are not the ones who prompt the hardest. They are the ones who build the harness that makes prompting unnecessary.

“2025 was the year AI agents proved they could write code. 2026 is the year we learned that the agent isn’t the hard part — the harness is.”

The harness is where the craft lives.


Built with agentic engineering. Verified by a QA agent. Reviewed by a human. Shipped to production.

Export for reading

Comments