Every blog post on this site was built by an agent. The TTS system that reads this to you was upgraded by an agent. The Cambridge YLE curriculum for CubLearn — 22 files, 8 game lessons, 14 pronunciation exercises — shipped in a single afternoon, driven by agents.

The footer of each post says:

Built with agentic engineering. Verified by a QA agent. Reviewed by a human. Shipped to production.

That is not marketing copy. That is the actual workflow. This series documents it from the ground up — what it means, how it works, and what it takes to make agents ship code you can trust.


Why “Vibe Coding” Is Not the Destination

Vibe coding works. For demos. For MVPs nobody will maintain. For the first three days of a side project before you realize the codebase is held together with hallucinated APIs and hardcoded values.

The Stack Overflow 2025 survey asked 49,000 developers about AI coding tools. The numbers tell the story:

  • 84% use or plan to use AI coding tools
  • 45% say debugging AI code takes longer than writing it themselves
  • 29% actually trust the output

That gap — 84% adoption, 29% trust — is the problem vibe coding creates. Developers are using AI, but spending more time cleaning up after it than it saves them.

The answer is not to use AI less. It is to use it differently.


What Agentic Engineering Actually Is

Agentic engineering means engineering the system that the agent operates within — not just prompting the agent and hoping.

The shift is from:

  • “AI, write me code” (vibe coding)

To:

  • “AI, here is the specification. Here are the constraints. Here is the verification suite. Now build it — and prove it works.”

The difference is not the prompt. It is everything around the prompt: the harness, the plan mode, the QA loop, the automated tests, the deployment gates.

The agent writes the code. You design the system that makes the code worth writing.


The Workflow in Practice

Here is exactly how this portfolio site gets built and updated:

1. Specification First

Before any agent touches code, the task is defined:

  • What feature are we adding?
  • What files will be modified?
  • What does success look like (tests, behavior, output)?
  • What must NOT break?

This lives in CLAUDE.md — the persistent project memory that loads into every session. Key decisions, architecture choices, naming conventions, “do not touch” warnings. 400 lines of context that keep the agent from re-learning the codebase every session.

2. Plan Mode — Research Before Code

Claude Code’s plan mode is non-negotiable. Before writing a single line, the agent:

  1. Reads all relevant files (grep, read, knowledge graph traversal)
  2. Maps the call chain (what calls what, what breaks if X changes)
  3. Produces a written plan with specific file paths and function names
  4. Lists everything that could go wrong

The plan gets reviewed. Not always by me — more on that in Part 3.

3. Code Generation — Constrained and Monitored

With an approved plan, the coding agent executes. Key constraints in our workflow:

  • Modular architecture — each module is small enough to fit in one context window. The agent edits one module at a time.
  • Coding conventions — enforced via CLAUDE.md and custom linting hooks. The agent cannot ship code that violates our style (snake_case vs camelCase, async patterns, error handling conventions).
  • Hook system — pre-commit hooks run automatically. The agent cannot bypass them.

4. QA Agent Review — The Brutal Filter

After code generation, a dedicated QA agent reviews the output against a structured checklist:

  • Test coverage — does the code have tests? Do they actually test the behavior?
  • Wiring verification — is every new function actually called somewhere in the application flow?
  • Convention compliance — does the code follow our established patterns?
  • Edge cases — what happens with empty input? Null values? Network failure?
  • Plan fidelity — does the implementation match what was planned?

The QA agent has rejected plans 4-5 times before approving them. It runs 10x slower than raw code generation. The quality difference is dramatic.

5. Human Review — The 5-Minute Scan

After QA approval, I review the diff. Not line by line — I trust the QA agent for that. I am looking for:

  • Does this make architectural sense?
  • Is anything surprising in the diff?
  • Are there any security or data concerns the QA agent might miss?

This takes 5 minutes for most changes. Sometimes 30 seconds.

6. CI/CD — The Automated Gate

GitHub CI runs on every push:

  • TypeScript compilation
  • Unit tests
  • Integration tests (where applicable)
  • Build validation

If any step fails, nothing deploys. The agent’s code must pass the same gates as human code.

7. Smoke Tests on Deploy — The Final Check

After Cloudflare Pages deploys:

  1. Automated smoke tests hit key endpoints
  2. Docker logs are checked for runtime errors
  3. Core user journeys are verified (blog post renders, TTS works, newsletter subscribe works)

If smoke tests fail: automatic rollback.

If they pass: traffic routes to the new deployment.


A Real Example: Upgrading TTS in 30 Minutes

Here is what this workflow looks like in practice. Last night, I asked an agent to upgrade the site’s Text-to-Speech from Journey voices to Chirp3-HD voices with 30-day caching.

The spec I provided:

  • Upgrade to Google Chirp3-HD voices (en-US-Chirp3-HD-Charon, vi-VN-Chirp3-HD-Charon)
  • Add Cloudflare Cache API with SHA-256 hash keys and 30-day TTL
  • Auto-detect page language (document.documentElement.lang) for voice selection
  • Update TTSPlayer.astro with 4 targeted patches

What the agent did:

  1. Read the existing functions/api/tts.ts (108 lines)
  2. Read functions/api/tts-config.ts (37 lines)
  3. Read the relevant sections of TTSPlayer.astro (2328 lines, searched for Journey, voice =, /api/tts)
  4. Generated 3 new files and applied 4 surgical patches
  5. Verified: no em-dashes, no smart quotes, no breaking changes to the audio player interface

Total wall-clock time: 28 minutes from spec to pushed commit.

The old approach (pre-agentic workflow): Read docs, write code, test locally, debug CORS headers, fix the base64 decode, fix the language code extraction, finally push. Probably 3-4 hours.


What This Series Covers

This is Part 1 of a 5-part series. Here is what is coming:

Part 2 — “Plan Mode or Bust” Why every coding session must start with a research phase. How plan mode works in Claude Code. Real examples of plans that got rejected and why.

Part 3 — “The QA Agent” The methodology behind automated code verification. What the QA agent checks that humans miss. The wiring verification problem (code that gets written but never called). Real rejection logs.

Part 4 — “Evidence-Based Debugging” How to give your agent eyes. Chrome DevTools MCP, Docker log streaming, Serena semantic code understanding. The difference between an agent that guesses and an agent that measures.

Part 5 — “From Agent to Production” The full CI/CD pipeline for agentic code. Smoke tests, rollback strategy, deployment gates. How to sleep while your agents ship.


The Real Metric

I used to spend 60-70% of my AI coding time on testing and debugging.

Now I spend it on specification and review.

The agents do more. I direct more. The software is better.

That is the promise of agentic engineering. Not magic — engineering.


This is Part 1 of the “Built with Agentic Engineering” series. Next: Plan Mode or Bust — Why Agents Must Think Before They Code

Export for reading

Comments