The biggest complaint about AI-generated code is that it looks right but does not work right. Functions get written that no code path ever calls. Tests get written that test the wrong thing. Edge cases get handled in comments but not in code.

I built a QA agent specifically to catch these problems. It runs 10x slower than raw code generation. It rejects more plans than it approves on the first try. And it has eliminated production bugs from my agentic workflow.

Here is how it works.


The Problem: Why AI Code Looks Right But Is Not

When a coding agent produces output, you are seeing the result of probabilistic pattern completion. The model has seen millions of code files. It knows what correct code looks like. It can reproduce the patterns convincingly.

But pattern recognition is not understanding. An agent does not know that userService.fetchById() is called from 3 places and all 3 callers expect a User | null return type. It sees the function signature and writes what looks right.

The three failure modes I see most often:

1. The Orphan Function

The agent writes a complete, well-documented function. It compiles. It has proper types. Nobody calls it.

This happens when the agent implements a feature “bottom up” — writes the implementation first, then forgets to wire it into the application flow. The code exists, tests pass on the function itself, and the feature simply does not appear in the UI.

Real example from a CubLearn session:

// Agent wrote this complete function
export function getYLEExercisesByLevel(level: 'starters' | 'movers' | 'flyers') {
  return PRONUNCIATION_EXERCISES.filter(ex => ex.id.startsWith(`en-yle-${level}`));
}

// But nobody called it -- the game component still used a hardcoded array
// The new exercises were invisible to users

2. The Happy-Path Test

The agent writes tests. The tests pass. The tests only cover the case where everything goes right.

// Agent's test
it('should fetch lesson by id', async () => {
  const lesson = await getLessonById('yle-starters-numbers');
  expect(lesson.title).toBe('Numbers and Counting');
});

// Missing tests:
// - What if id does not exist?
// - What if id is empty string?
// - What if database is unreachable?
// - What if lesson has no vocabulary?

The happy path test gives 100% line coverage on the success branch. Zero coverage on failure branches that will definitely be hit in production.

3. The Stub That Ships

Mid-task token exhaustion. The agent writes:

async function syncToCloudflare(posts: BlogPost[]): Promise<void> {
  // TODO: implement chunked upload for large post arrays
  // For now, handle single posts only
  if (posts.length === 1) {
    await uploadSingle(posts[0]);
  }
}

The comment says “for now.” It ships. Six months later, someone tries to sync 50 posts and nothing happens.


The QA Agent Methodology

My QA agent runs four verification passes in sequence. Each pass can reject the work and send it back for revision.

Pass 1: Plan Audit

Before any code is written, the plan gets reviewed.

Checklist:

  • Every changed file is listed with specific line ranges
  • Every new function is named explicitly (no “add helper functions as needed”)
  • Type interfaces for new data structures are specified
  • The “do not touch” list is explicit
  • Risk assessment includes what could break
  • Every external dependency is identified (APIs, env vars, third-party libs)

Rejection trigger: Any item is vague, missing, or uses language like “appropriate handling” or “necessary changes.”

Real rejection from a recent session:

PLAN REJECTED -- Pass 1

Item 3: "Update the blog rendering pipeline as needed"

This is not a plan. "As needed" is not a file path.
List every file in the rendering pipeline and specify
exactly which lines will change and why.

Item 7: Missing -- what happens when the Cloudflare R2
upload fails? The plan has no error handling specification.
Is this a hard fail or a silent skip?

Return with specific file paths and error handling spec.

Pass 2: Wiring Verification

After code is generated, the QA agent traces every new function from definition to call site.

The question is simple: Is this code reachable?

Wiring check: getYLEExercisesByLevel()

Definition: packages/domain/src/rules/pronunciation.ts:147
Expected call sites: game components that render pronunciation exercises

Search results:
- apps/web/src/app/[locale]/games/pronunciation/page.tsx -- NOT FOUND
- apps/web/src/components/PronunciationGame.tsx -- NOT FOUND
- apps/web/src/data/english-lessons.ts -- NOT FOUND

Result: ORPHAN FUNCTION -- not callable from any code path
Status: FAIL -- add call site or remove function

This check alone has caught 6 orphan functions across my projects in the last month.

Pass 3: Test Coverage Audit

For each new function or changed behavior, the QA agent verifies:

  1. Success path test exists — the happy path is tested
  2. Failure path tests exist — at least: empty input, null/undefined, network failure, unexpected type
  3. Integration test exists — the function works when called from its actual call site (not just in isolation)
  4. No dead tests — tests reference functions that actually exist

Coverage threshold: 80% line coverage minimum, 100% branch coverage on critical paths (auth, data mutation, external API calls).

Test audit: tts.ts changes

Functions added/changed: 4
- sha256Hex() -- 0 tests -- FAIL
- langCodeFromVoice() -- 0 tests -- FAIL
- isChirp3HD() -- 0 tests -- WARN (pure function, low risk)
- synthesizeChunk() -- 1 test (happy path only) -- WARN

Missing test cases:
- sha256Hex('') -- empty string behavior
- langCodeFromVoice('invalid') -- malformed voice name
- synthesizeChunk() when Google API returns 429 (rate limit)
- synthesizeChunk() when audioContent is malformed base64

Add tests or document why these cases are acceptable to skip.

Pass 4: Convention Compliance

The QA agent runs our coding conventions checklist:

For TypeScript:

  • No any types (use unknown with type guards)
  • No non-null assertions (!) on external data
  • All async functions have try/catch or explicit error propagation
  • No console.log in production code (use structured logging)
  • All exported interfaces documented with JSDoc

For Markdown (blog posts):

  • No em-dashes (U+2014) — use --
  • No smart quotes — use "straight quotes"
  • Vietnamese content has full diacritics (checked by counting non-ASCII characters)
  • YAML frontmatter all required fields present

For Cloudflare Workers:

  • No Date.now() (use performance.now() or pass timestamp)
  • All Response objects include CORS headers
  • ctx.waitUntil() used for non-blocking background tasks

This last check is what caught the Cloudflare Workers 1101 error pattern early in my workflow. Em-dashes and smart quotes cause silent failures in CF Workers. The convention check runs on every markdown file now.


The 10x Time Investment: Is It Worth It?

The QA agent runs 10x slower than raw code generation. A task that takes the coding agent 5 minutes takes 50 minutes with QA review.

Is it worth it?

Let me compare two months of data from my own projects:

Before QA agent (June-July 2025):

  • 23 deployments
  • 8 production bugs found after deploy
  • 3 rollbacks required
  • Average time to fix post-deploy bug: 2.5 hours

After QA agent (August-September 2025):

  • 31 deployments
  • 1 production bug found after deploy (CORS header on new endpoint — caught by smoke test, not QA)
  • 0 rollbacks
  • Time to fix: 15 minutes (smoke test caught it immediately)

The QA agent costs 10x per task. It saved approximately 20 hours of post-deploy debugging in two months.

More importantly: it changed the nature of my work. I stopped being a debugger. I became a specification writer and reviewer.


Real Code: The QA Agent’s Core Loop

Here is a simplified version of the verification loop I run:

def run_qa_review(plan: str, code_diff: str, context: dict) -> QAResult:
    results = []

    # Pass 1: Plan audit
    plan_issues = audit_plan(plan, context['spec'])
    if plan_issues:
        return QAResult(status='REJECT', pass_num=1, issues=plan_issues)

    # Pass 2: Wiring check
    wiring_issues = check_wiring(code_diff, context['codebase_graph'])
    if wiring_issues:
        return QAResult(status='REJECT', pass_num=2, issues=wiring_issues)

    # Pass 3: Test coverage
    coverage_issues = audit_tests(code_diff, threshold=0.8)
    if coverage_issues:
        return QAResult(status='WARN', pass_num=3, issues=coverage_issues)

    # Pass 4: Convention compliance
    convention_issues = check_conventions(code_diff, context['conventions'])
    if convention_issues:
        return QAResult(status='REJECT', pass_num=4, issues=convention_issues)

    return QAResult(status='PASS', summary=generate_summary(code_diff))

The check_wiring() function is the heart of it. It builds a call graph from the diff, then checks each new function against the graph. Any function with no incoming edges is flagged as a potential orphan.


What the QA Agent Cannot Catch

Being honest about limitations:

  1. Business logic errors — if the spec is wrong, the QA agent cannot know. It verifies “does the code do what was planned” not “was the plan right.”
  2. Performance regressions — the QA agent does not run load tests.
  3. Security vulnerabilities — convention checks catch common patterns (SQL injection, unvalidated input) but are not a substitute for a security audit.
  4. UI/UX problems — the agent cannot see the rendered interface.
  5. Data migration correctness — it can check that a migration runs, not that it migrates the right data.

For these, I use additional tools: CodeRabbit for security/quality checks, Chrome DevTools MCP for UI visibility, manual testing for UX flows.

The QA agent is one layer of a defense-in-depth stack, not the only layer.


Getting Started: Build Your Own QA Agent

If you want to implement this pattern, start simple:

Minimum viable QA agent:

  1. After each code generation, prompt the agent: “List every new function in this diff. For each function, find all places in the codebase that call it. Return any functions with zero call sites.”
  2. Review the list. Delete or wire in every orphan.

That single check — wiring verification — will catch more bugs than any other single step.

Add test auditing next. Then convention checks. Build up incrementally.

The methodology matters more than the implementation. Start verifying, iterate on what you find.


This is Part 3 of the “Built with Agentic Engineering” series. Previous: Plan Mode or Bust Next: Evidence-Based Debugging — Giving Your Agent Eyes

Export for reading

Comments