Evidence-Based Debugging -- Giving Your Agent Eyes

Here is a debugging session I watched play out last year.

The developer noticed the blog’s dark mode was flickering on page load. They asked Claude to fix it. Claude read the CSS. It found a likely cause. It made a change. The developer refreshed. Still flickering. Claude made another guess. Refresh. Still flickering. Third guess. Refresh. The flickering was gone — but now there was a white flash instead.

After 45 minutes and 6 attempts, the developer found the root cause themselves: a localStorage.getItem() call was happening after the initial paint, causing a double-render. The fix was 3 lines.

Why did the agent fail? It was not the model. Claude Opus can reason about race conditions. The problem was that the agent was guessing without evidence. It could not see the rendered page. It could not see the browser console. It could not observe the actual painting behavior.

It was debugging in the dark.

What “Debugging in the Dark” Costs

When an agent cannot observe the actual system behavior, it falls back on pattern matching. It finds code that looks like the problem and makes changes that look like solutions.

This produces a specific failure pattern:

Agent makes a plausible guess
Change is applied
Issue persists (or transforms into a different issue)
Agent makes another plausible guess
Repeat until budget exhausted or developer gives up

Every loop costs:

API tokens (money)
Agent context (the context window fills with failed attempts)
Developer attention (watching, reviewing, approving each attempt)

The pattern-guessing loop is the most expensive thing in agentic development.

Coding agents operating in the terminal have three fundamental visibility gaps:

The agent can read HTML files. It cannot see what a browser renders from that HTML.

The gap is larger than it seems. A browser does not just parse HTML — it:

Computes CSS specificity across all loaded stylesheets
Applies animations and transitions
Evaluates JavaScript that modifies the DOM
Handles z-index stacking across layers
Renders fonts with subpixel anti-aliasing

An element that looks positioned in the source HTML might be invisible because another element with z-index: 9999 sits on top of it. The agent cannot know without seeing the render.

Taking screenshots helps but does not fully solve this. Screenshots give the agent a visual but no element IDs, no computed styles, no ARIA roles. “The button looks off” is a different kind of problem to debug than “the button has pointer-events: none because an ancestor has overflow: hidden with incorrect z-index.”

When code runs in production, the agent cannot see what is happening unless you explicitly pipe logs to it.

A Cloudflare Worker that silently fails after 10ms because of a character encoding issue — the agent cannot know. It can read the source code. It cannot observe the execution.

This is why the Cloudflare Workers 1101 error was so hard to debug without evidence. The worker was receiving the markdown, processing it, and then — at some point — failing. The error showed in Cloudflare’s dashboard but not in the source code.

Even within the code it can read, the agent has limited semantic understanding. It reads text. It does not understand the meaning.

Ask an agent to add a feature to a 50,000-line codebase, and it will search for relevant files, find some of them, miss others, and make changes that are locally correct but globally inconsistent.

The Evidence-Based Stack

The solution is to give the agent evidence — actual observations of the system — not just source code.

These are the tools I use:

Chrome DevTools MCP

Google released Chrome DevTools MCP in public preview on September 22, 2025. It gives AI coding agents direct access to the Chrome DevTools Protocol.

What this means in practice: the agent can now observe the browser instead of guessing at it.

# Start Chrome DevTools MCP server
npx @google/chrome-devtools-mcp

What the agent can do with DevTools MCP:

DOM inspection:

agent: "What is the computed z-index of element #modal-overlay?"
devtools: "Element #modal-overlay: z-index computed value is 'auto' (not 9999 as specified in source).
           Parent .modal-wrapper has overflow: hidden which resets stacking context.
           Fix: move modal-overlay outside .modal-wrapper in the DOM."

Style debugging:

agent: "Why is the dark mode toggle not applying the dark class?"
devtools: "Event listener on #dark-toggle is firing correctly.
           document.documentElement.classList.add('dark') IS being called.
           But a MutationObserver on .theme-wrapper is removing 'dark' class 12ms later.
           Stack trace: theme.js:47 -> initThemeObserver -> removeAttribute"

Performance profiling:

agent: "Why is the page loading slowly on mobile?"
devtools: "Largest Contentful Paint: 4.2s.
           Blocking resource: /fonts/Inter-var.woff2 (1.2MB, no preload hint).
           Recommendation: add <link rel='preload'> for font, serve subset variant."

The agent moves from “I think the problem might be…” to “The problem is X, at line Y, caused by Z.”

I used Chrome DevTools MCP to debug the blog’s reading mode. The highlight sync was off by one paragraph. With DevTools, the agent could observe the scrollY value, the paragraph bounding boxes, and the highlight CSS state simultaneously. Fix in 8 minutes. Without DevTools: I estimate 2+ hours of guessing.

Docker Logs MCP

For server-side debugging, Docker Logs MCP gives the agent access to container stdout/stderr in real time.

Setup:

// .claude/settings.json
{
  "mcpServers": {
    "docker": {
      "command": "npx",
      "args": ["@anthropic/mcp-docker"],
      "type": "stdio"
    }
  }
}

Real example: The TTS caching upgrade last night. After deploying, I wanted to verify the Cloudflare Cache API was actually caching (not just returning headers claiming it was). The agent:

Made a TTS request for a test phrase
Checked the X-TTS-Cache response header (MISS on first request)
Made the same request again
Checked the header (HIT — proving edge cache worked)
Checked Wrangler logs to confirm the function was not executing on the second request

Without log access, we would have to trust the headers. With logs, we confirmed the cache was working at the infrastructure level.

Serena MCP — Semantic Code Understanding

Serena is an open-source MCP server that provides IDE-quality semantic understanding of code. It is not text search — it is actual symbol resolution.

pip install serena-mcp
serena-mcp init

What this changes:

Without Serena:

agent: grep for "getLessonById"
result: 23 matches across 12 files
agent: reads all 12 files
agent: runs out of context
agent: makes educated guess about which match is relevant

With Serena:

agent: "Find all callers of getLessonById and their expected return type"
serena: "getLessonById called from 3 locations:
         1. apps/web/src/app/[locale]/lessons/[id]/page.tsx:47
            -- expects: Lesson | null (handles null case with redirect)
         2. apps/web/src/components/LessonCard.tsx:23
            -- expects: Lesson (no null check -- potential crash if lesson missing)
         3. apps/web/src/api/lesson-export.ts:89
            -- expects: Lesson[] (WRONG TYPE -- will fail silently)"

Serena found the type mismatch in lesson-export.ts that grep never would have — because it understands types, not just text.

I use Serena for all cross-module changes. Before touching any shared utility, I check the call graph. Every caller, every expected type. The QA agent’s wiring verification runs on top of Serena’s output.

GitNexus — Codebase Knowledge Graph

GitNexus indexes the repository into a knowledge graph: nodes are code entities (functions, classes, interfaces), edges are relationships (calls, imports, inheritance).

Key use case: blast radius analysis. Before changing a shared function, the agent asks GitNexus: “If I change the return type of synthesizeChunk(), what else breaks?”

GitNexus blast radius: synthesizeChunk()

Direct callers: onRequestPost() in tts.ts
Indirect callers: none (edge function, no imports)

Impact: LOW -- change is isolated to this file.
Safe to modify return type.

Compare to changing a shared utility:

GitNexus blast radius: formatDate()

Direct callers:
- BlogPostLayout.astro (3 uses)
- sitemap.ts (1 use)
- rss.xml.ts (1 use)

Indirect callers via re-export:
- BaseLayout.astro
- OpenGraph.astro

Impact: HIGH -- changing return format will break 6 render paths.
Consider: new function or optional parameter.

This single piece of information — the blast radius — prevents the most common breaking change pattern in agentic coding.

A Full Evidence-Based Debug Session

Here is how a debugging session looks with full evidence access.

Problem: The Vietnamese blog post TTS was playing in English despite being on the /vi/ URL path.

Without evidence:

The agent would search for language detection code
Find multiple candidate implementations
Guess which one is wrong
Make a change
Hope it works

With evidence:

DevTools MCP: Agent checks document.documentElement.lang value on the VI page
- Result: "en" — the lang attribute is English despite VI URL
Source investigation: Agent reads BaseLayout.astro
- Finds: lang={locale} is set correctly — the HTML element should have lang="vi"
DevTools again: Agent checks if the page has finished hydrating
- Finds: lang is initially "en" then changes to "vi" after hydration — a race condition
Root cause: TTSPlayer.astro initializes before the Astro View Transitions framework updates the lang attribute
Fix: Read document.documentElement.lang at the moment TTS is triggered (user clicks play) rather than at initialization
Verify: DevTools confirms lang is "vi" at play-click time — fix is correct

Total time: 11 minutes. No guessing. Every step grounded in observed evidence.

The Principle: Measure, Then Fix

The rule for evidence-based debugging is simple:

Before proposing a fix, observe the problem.

What is the actual value of the state that seems wrong?
Where is that value being set?
What should it be set to?
When does the wrong value appear?

Four questions. Four observations. One fix.

The agent that answers these four questions with actual measurements will fix the bug correctly in one try. The agent that guesses will need six.

Give your agent eyes. The tools exist. Use them.

This is Part 4 of the “Built with Agentic Engineering” series. Previous: The QA Agent Next: From Agent to Production — The Full Deployment Pipeline

Export for reading

Evidence-Based Debugging -- Giving Your Agent Eyes

What “Debugging in the Dark” Costs

The Three Blind Spots

Blind Spot 1: The Rendered UI

Blind Spot 2: Runtime Logs

Blind Spot 3: The Semantic Structure

The Evidence-Based Stack

Chrome DevTools MCP

Docker Logs MCP

Serena MCP — Semantic Code Understanding

GitNexus — Codebase Knowledge Graph

A Full Evidence-Based Debug Session

The Principle: Measure, Then Fix

Comments

On this page

Evidence-Based Debugging -- Giving Your Agent Eyes