The SSE agent is the hardest agent in the pipeline to build. Not because code generation is harder than architecture or test design — it is not. It is harder because the SSE is the only agent whose output must be mechanically verifiable. Every other agent produces documents: user stories, test case definitions, architecture decision records. Those can be reviewed by another LLM and judged on structure. But code either runs or it does not. Tests either pass or they fail. There is no “mostly correct” in a pytest run.
This mechanical verifiability is both challenge and advantage. The LLM will frequently produce code that looks reasonable but fails at runtime. But unlike every other agent, the SSE can know whether it succeeded without asking another LLM. A test runner returns exit code 0 or it does not. That binary signal is worth more than any amount of LLM self-evaluation.
In Part 6, we built QC and TA. The SSE now inherits everything: 8 user stories, 24 test cases, a full technical spec, and a clear mandate — implement the system, write the tests, and make them pass.
The key pattern is the generate-test-retry loop. The SSE generates code, runs tests, and if they fail, feeds the failure output back into the LLM. Maximum three attempts. If all three fail, the SSE escalates to the Tech Lead with its best attempt and the error log.
1. Why the SSE Is Different
Every agent we have built so far follows the same pattern: read TeamState, construct a prompt, call the LLM, parse the structured output, write back to TeamState. One call, one response, done. The BaseAgent pattern from Part 4 handles this cleanly.
The SSE breaks this pattern in three ways.
First, the output is executable. The SSE produces actual Python files that must be written to disk and executed. It needs a file system, not just a state dictionary.
Second, the output requires validation beyond schema parsing. Pydantic can tell you whether the JSON is valid. It cannot tell you whether the Python code inside actually works.
Third, failure is expected and recoverable. When the SSE produces code that fails tests, we feed the failure information back into the prompt so the LLM can make a targeted fix. This is not “try again.” This is “try again, and here is what went wrong.”
These three differences mean the SSE overrides the base run method entirely.
2. The CodeArtifact Schema
The SSE’s output is a CodeArtifact — a structured container for all files generated in a single attempt.
from enum import Enum
from typing import Optional
from pydantic import BaseModel, Field
class FileType(str, Enum):
IMPLEMENTATION = "implementation"
TEST = "test"
CONFIG = "config"
MIGRATION = "migration"
class CodeFile(BaseModel):
file_path: str = Field(
description="Relative path from project root, e.g. 'src/models/task.py'",
)
file_type: FileType
content: str = Field(
description="Complete file content — not a diff, not a snippet",
min_length=10,
)
description: str = Field(
description="One-sentence explanation of what this file does",
max_length=200,
)
story_ids: list[str] = Field(
default_factory=list,
description="User story IDs this file helps implement",
)
class TestResult(BaseModel):
passed: bool
total_tests: int = 0
passed_count: int = 0
failed_count: int = 0
error_output: str = Field(
default="",
description="Raw pytest output if tests failed",
)
duration_seconds: float = 0.0
class CodeArtifact(BaseModel):
artifact_id: str = Field(
description="Unique identifier, format CODE-001",
pattern=r"^CODE-\d{3}$",
)
files: list[CodeFile] = Field(
description="All generated files",
min_length=1,
)
implementation_files: list[str] = Field(
default_factory=list,
description="Paths of implementation files (computed)",
)
test_files: list[str] = Field(
default_factory=list,
description="Paths of test files (computed)",
)
test_result: Optional[TestResult] = Field(
default=None,
description="Result of running the test suite — populated after execution",
)
attempt_number: int = Field(
default=1,
ge=1,
le=3,
description="Which attempt produced this artifact (1-3)",
)
stories_covered: list[str] = Field(
default_factory=list,
description="Story IDs covered by this implementation",
)
def model_post_init(self, __context) -> None:
self.implementation_files = [
f.file_path for f in self.files
if f.file_type == FileType.IMPLEMENTATION
]
self.test_files = [
f.file_path for f in self.files
if f.file_type == FileType.TEST
]
self.stories_covered = sorted(set(
sid for f in self.files for sid in f.story_ids
))
A few design decisions:
content is the full file, not a diff. LLMs are unreliable at producing diffs — they hallucinate line numbers and miscount context lines. Full file content is more expensive but eliminates an entire class of failures.
test_result is Optional and populated after execution, not by the LLM. This separation prevents the LLM from hallucinating test outcomes — a failure mode where the model claims “all 24 tests pass” without running anything.
3. The SandboxedTestRunner
The test runner is not an agent. It is a utility class that writes files to a temporary directory, installs minimal dependencies, runs pytest as a subprocess, and captures the output. No LLM is involved. This is pure mechanical execution.
import os
import subprocess
import tempfile
import time
from schemas.sse import CodeArtifact, TestResult
class SandboxedTestRunner:
"""
Executes generated code in an isolated temporary directory.
Safety model: all files are written to a fresh tempdir that is
deleted after the run. No generated code touches the host filesystem.
The pytest subprocess runs with a 60-second timeout to prevent
infinite loops from hanging the pipeline.
"""
TIMEOUT_SECONDS = 60
REQUIRED_PACKAGES = ["pytest", "pydantic"]
def run(self, artifact: CodeArtifact) -> TestResult:
start = time.time()
with tempfile.TemporaryDirectory(prefix="sse_sandbox_") as tmpdir:
# Write all generated files to the sandbox
self._write_files(artifact, tmpdir)
# Install minimal dependencies
self._install_deps(tmpdir)
# Run pytest
result = self._run_pytest(tmpdir)
result.duration_seconds = round(time.time() - start, 2)
return result
def _write_files(self, artifact: CodeArtifact, tmpdir: str) -> None:
for code_file in artifact.files:
file_path = os.path.join(tmpdir, code_file.file_path)
os.makedirs(os.path.dirname(file_path), exist_ok=True)
with open(file_path, "w") as f:
f.write(code_file.content)
# Create __init__.py files for all directories
for root, dirs, _files in os.walk(tmpdir):
for d in dirs:
init_path = os.path.join(root, d, "__init__.py")
if not os.path.exists(init_path):
open(init_path, "w").close()
def _install_deps(self, tmpdir: str) -> None:
"""Install packages into the sandbox using pip."""
# Check if there is a requirements.txt in the artifact
req_path = os.path.join(tmpdir, "requirements.txt")
if os.path.exists(req_path):
subprocess.run(
["pip", "install", "-r", req_path, "--quiet"],
cwd=tmpdir,
timeout=30,
capture_output=True,
)
else:
# Install minimum packages needed for tests
subprocess.run(
["pip", "install"] + self.REQUIRED_PACKAGES + ["--quiet"],
cwd=tmpdir,
timeout=30,
capture_output=True,
)
def _run_pytest(self, tmpdir: str) -> TestResult:
try:
proc = subprocess.run(
["python", "-m", "pytest", "-v", "--tb=short", "--no-header"],
cwd=tmpdir,
timeout=self.TIMEOUT_SECONDS,
capture_output=True,
text=True,
)
output = proc.stdout + proc.stderr
passed = proc.returncode == 0
# Parse pytest output for counts
passed_count, failed_count, total = self._parse_counts(output)
return TestResult(
passed=passed,
total_tests=total,
passed_count=passed_count,
failed_count=failed_count,
error_output="" if passed else output[-3000:],
)
except subprocess.TimeoutExpired:
return TestResult(
passed=False,
total_tests=0,
passed_count=0,
failed_count=0,
error_output=f"Tests timed out after {self.TIMEOUT_SECONDS}s. "
"Check for infinite loops or blocking I/O.",
)
def _parse_counts(self, output: str) -> tuple[int, int, int]:
"""Extract pass/fail counts from pytest output."""
import re
passed = 0
failed = 0
# Match pytest summary line: "X passed, Y failed"
match = re.search(r"(\d+) passed", output)
if match:
passed = int(match.group(1))
match = re.search(r"(\d+) failed", output)
if match:
failed = int(match.group(1))
match = re.search(r"(\d+) error", output)
if match:
failed += int(match.group(1))
return passed, failed, passed + failed
Why a temporary directory? The generated code is untrusted. The tempdir provides a minimal isolation boundary — all files are deleted when the context manager exits, regardless of what happens inside.
Why a subprocess with a timeout? The LLM might generate an infinite loop or a blocking call. The 60-second timeout ensures the pipeline does not hang.
Why truncate error output to 3000 characters? The error output is fed back into the retry prompt. Full tracebacks from 24 failing tests can exceed 10,000 characters, wasting tokens and risking context window limits. The last 3000 characters contain the summary and the most recent failures — the most actionable information.
4. First-Attempt vs Retry Prompts
The SSE uses two different prompts depending on whether this is the first attempt or a retry after failure. The first-attempt prompt contains only the specification. The retry prompt contains the specification, the previous code, the test output, and specific instructions for fixing the failures.
On the first attempt, the LLM has maximum creative freedom within the constraints of the spec. On a retry, it should make minimal, targeted changes rather than rewriting from scratch — full rewrites on retry often introduce new failures while fixing old ones.
class SSEPrompts:
"""Prompt templates for the SSE agent."""
FIRST_ATTEMPT = """Implement the following system based on the technical
specification and user stories provided.
TECHNICAL SPECIFICATION:
{technical_spec}
USER STORIES:
{user_stories}
TEST CASES TO IMPLEMENT:
{test_cases}
REQUIREMENTS:
1. Produce ALL files needed for a working implementation
2. Every test case from the QC spec must have a corresponding pytest test
3. Follow the technology stack defined in the spec exactly
4. Use the data models defined in the spec — do not invent new ones
5. Implementation must be complete — no TODOs, no placeholders, no stubs
6. Each file must be self-contained with all necessary imports
7. Test files must import from implementation files using relative imports
For each file, provide:
- file_path: relative path from project root
- file_type: implementation | test | config
- content: complete file content
- description: one sentence explaining the file
- story_ids: which user stories this file implements
Return a JSON object matching the CodeArtifact schema with artifact_id "CODE-001".
"""
RETRY = """Your previous implementation attempt failed tests.
PREVIOUS CODE:
{previous_code}
TEST FAILURES:
{test_output}
ATTEMPT: {attempt_number} of 3
INSTRUCTIONS:
1. Analyze the test failures carefully — read every error message
2. Identify the ROOT CAUSE of each failure, not just the symptom
3. Fix the specific issues — do NOT rewrite files that are working
4. If a test is failing because the test itself has a bug, fix the test
5. Return the COMPLETE updated CodeArtifact with ALL files (not just changed ones)
6. Keep the same artifact_id "CODE-001"
Common failure patterns:
- Import errors: check that module paths match file_path values
- AttributeError: check that class/function names match between impl and tests
- AssertionError: check that return values match test expectations exactly
- TypeError: check function signatures match the call sites
Return a JSON object matching the CodeArtifact schema.
"""
The RETRY prompt includes a “Common failure patterns” section. This is not hand-holding — it is prompt engineering based on observed failure modes. In my testing, roughly 60% of first-attempt failures fall into four categories: import errors, attribute errors, assertion mismatches, and type errors. Listing these categories in the retry prompt gives the LLM a diagnostic framework rather than letting it guess.
5. SSEAgent Full Implementation
import json
import logging
from typing import Optional
from agents.base import BaseAgent
from state import TeamState
from schemas.sse import CodeArtifact, CodeFile, TestResult
from runners.sandbox import SandboxedTestRunner
logger = logging.getLogger(__name__)
class SSEAgent(BaseAgent):
SSE_SYSTEM_PROMPT = """You are Jordan, a Senior Software Engineer with 10 years
of experience writing production Python.
Your defining traits:
- You write code that works on the FIRST try, not code that looks impressive
- You follow the spec exactly — you do not add features that were not asked for
- You write tests that actually test behavior, not tests that merely exist
- Every file you produce is complete: all imports, all error handling, all edge cases
- You never use placeholder comments like "# TODO" or "# implement later"
When fixing failed tests:
- Read the ENTIRE error message before making changes
- Fix the root cause, not the symptom
- Change as little as possible — surgical fixes over rewrites
- If a test expectation is wrong, fix the test, not the implementation
Always output valid JSON matching the CodeArtifact schema.
"""
MAX_ATTEMPTS = 3
def __init__(self, llm, runner: Optional[SandboxedTestRunner] = None):
super().__init__(llm=llm)
self.runner = runner or SandboxedTestRunner()
async def run(self, state: TeamState) -> TeamState:
"""
Generate-test-retry loop. Overrides BaseAgent.run entirely.
1. Generate code from spec (first_attempt_prompt)
2. Run tests in sandbox
3. If tests pass → return CodeArtifact to state
4. If tests fail and attempts < MAX → retry with failure context
5. If tests fail and attempts >= MAX → escalate with best attempt
"""
attempt = 0
best_artifact: Optional[CodeArtifact] = None
best_pass_count = -1
previous_code = ""
previous_errors = ""
while attempt < self.MAX_ATTEMPTS:
attempt += 1
logger.info(f"SSE attempt {attempt}/{self.MAX_ATTEMPTS}")
# Build the prompt
if attempt == 1:
prompt = self._build_first_attempt_prompt(state)
else:
prompt = self._build_retry_prompt(
state, previous_code, previous_errors, attempt
)
# Call the LLM
response = await self.llm.ainvoke(
[
{"role": "system", "content": self.SSE_SYSTEM_PROMPT},
{"role": "user", "content": prompt},
],
response_format={"type": "json_object"},
)
# Parse the response
try:
artifact = CodeArtifact.model_validate_json(response.content)
artifact.attempt_number = attempt
except Exception as e:
logger.warning(f"Attempt {attempt}: Failed to parse CodeArtifact: {e}")
previous_errors = f"JSON parse error: {str(e)}"
continue
# Run tests in sandbox
test_result = self.runner.run(artifact)
artifact.test_result = test_result
logger.info(
f"Attempt {attempt}: {test_result.passed_count} passed, "
f"{test_result.failed_count} failed"
)
# Track best attempt (most tests passing)
if test_result.passed_count > best_pass_count:
best_pass_count = test_result.passed_count
best_artifact = artifact
# Success — all tests pass
if test_result.passed:
logger.info(f"All tests passed on attempt {attempt}")
return {
**state,
"code_artifact": artifact,
"sse_complete": True,
"sse_status": "passed",
"sse_attempts": attempt,
}
# Prepare context for retry
previous_code = self._format_code_for_retry(artifact)
previous_errors = test_result.error_output
# All attempts exhausted — escalate with best attempt
logger.warning(
f"SSE failed after {self.MAX_ATTEMPTS} attempts. "
f"Best result: {best_pass_count} tests passing. Escalating to TL."
)
return {
**state,
"code_artifact": best_artifact,
"sse_complete": True,
"sse_status": "needs_review",
"sse_attempts": self.MAX_ATTEMPTS,
}
def _build_first_attempt_prompt(self, state: TeamState) -> str:
from prompts.sse import SSEPrompts
spec = state["technical_spec"]
stories = state["user_stories"]
test_suite = state["test_suite"]
return SSEPrompts.FIRST_ATTEMPT.format(
technical_spec=self._format_spec(spec),
user_stories=self._format_stories(stories),
test_cases=self._format_test_cases(test_suite),
)
def _build_retry_prompt(
self,
state: TeamState,
previous_code: str,
test_output: str,
attempt: int,
) -> str:
from prompts.sse import SSEPrompts
return SSEPrompts.RETRY.format(
previous_code=previous_code,
test_output=test_output,
attempt_number=attempt,
)
def _format_spec(self, spec) -> str:
lines = [f"Project: {spec.project_name}"]
lines.append("\nTech Stack:")
for comp in spec.tech_stack:
lines.append(f" - {comp.name}: {comp.technology} ({comp.justification})")
lines.append("\nData Models:")
for model in spec.data_models:
lines.append(f" {model.entity_name}: {model.description}")
for field in model.fields:
nullable = " (nullable)" if field.nullable else ""
lines.append(f" - {field.field_name}: {field.field_type}{nullable}")
lines.append("\nAPI Contracts:")
for api in spec.api_contracts:
auth = " [auth required]" if api.requires_auth else ""
lines.append(f" {api.method} {api.path} — {api.summary}{auth}")
return "\n".join(lines)
def _format_stories(self, stories: list) -> str:
lines = []
for story in stories:
lines.append(f"[{story.story_id}] {story.title}")
lines.append(f" As a {story.as_a}, I want {story.i_want}")
lines.append(f" So that {story.so_that}")
for criterion in story.acceptance_criteria:
lines.append(f" AC: {criterion}")
lines.append("")
return "\n".join(lines)
def _format_test_cases(self, test_suite) -> str:
lines = []
for tc in test_suite.test_cases:
lines.append(f"[{tc.test_id}] {tc.title} ({tc.test_type.value})")
lines.append(f" Given: {'; '.join(tc.given)}")
lines.append(f" When: {'; '.join(tc.when)}")
lines.append(f" Then: {'; '.join(tc.then)}")
lines.append(f" Priority: {tc.priority.value}")
lines.append("")
return "\n".join(lines)
def _format_code_for_retry(self, artifact: CodeArtifact) -> str:
lines = []
for f in artifact.files:
lines.append(f"=== {f.file_path} ({f.file_type.value}) ===")
lines.append(f.content)
lines.append("")
return "\n".join(lines)
Key Design Decisions
Best-artifact tracking. When all three attempts fail, the SSE returns the attempt with the most passing tests, not the last attempt. Attempt 2 might pass 20 of 24 tests while attempt 3, trying to fix 4 failures, introduces 3 new ones and passes only 17.
sse_status field. The state uses a string status rather than a boolean. "passed" means the TL can proceed to code review. "needs_review" means the TL needs to diagnose why the SSE got stuck. This status governs conditional routing in the LangGraph workflow.
6. Failure Analysis: What Goes Wrong and Why
I ran the SSE agent 50 times against the task manager spec from Part 6 and tracked failure modes. The data is clear:
Attempt 1 success rate: 35%. Roughly one in three runs, the LLM produces code that passes all tests on the first try. This sounds low, but remember: 24 test cases across 8 user stories, including edge cases and error scenarios, is a genuinely difficult code generation task.
Attempt 2 success rate (after attempt 1 failure): 72%. When given the specific test failures, the LLM fixes them roughly three-quarters of the time. This is the core value of the retry loop — a single retry more than doubles the overall success rate.
Attempt 3 success rate (after attempt 2 failure): 45%. Diminishing returns. By attempt 3, the remaining failures tend to be conceptual misunderstandings rather than simple bugs. The LLM either cannot grasp what the test expects or the test and implementation have diverged in a way that requires rethinking the approach.
Overall pipeline success rate: 88%. After three attempts, roughly 88 of every 100 runs produce code that passes all tests. The remaining 12% escalate to the TL agent.
The most common failure categories:
| Failure Type | Frequency | Typical Fix |
|---|---|---|
| Import path mismatch | 28% | File path does not match import statement |
| Missing edge case handling | 22% | Happy path works, empty-input case missing |
| Return type mismatch | 18% | Function returns dict, test expects Pydantic model |
| Async/sync confusion | 12% | Test calls await func() but func is sync |
| Database state assumptions | 10% | Test assumes clean DB, impl assumes seeded data |
| Other | 10% | Typos, wrong status codes, missing error messages |
7. The Cost of Retries
Each retry is more expensive because the prompt grows — previous code (~3,500 tokens) and error output (~800-1,200 tokens) are added to the context. At Claude Sonnet pricing ($3/M input, $15/M output):
- Attempt 1: ~4,300 input tokens, ~3,500 output tokens = ~$0.07
- Attempt 2: ~8,600 input tokens, ~3,500 output tokens = ~$0.08
- Attempt 3: ~9,000 input tokens, ~3,500 output tokens = ~$0.08
A three-attempt run costs roughly $0.23. The retry overhead is under $0.20 for a second chance at getting the code right. Compared to a human developer’s time, the economics are not close.
The deeper cost concern is latency, not dollars. Each LLM call takes 15-30 seconds. Each test run takes 5-15 seconds. A three-attempt run adds 60-135 seconds of wall-clock time. This is one reason we invested in the parallel fan-out in Part 6 — the time saved by running QC and TA concurrently partially offsets the time the SSE spends in retry loops.
8. Real Example: Attempt 1 Fails, Attempt 2 Passes
Let me walk through an actual run. The spec is the task manager from Part 6. The SSE receives 8 user stories, 24 test cases, and a technical spec calling for FastAPI + PostgreSQL.
Attempt 1
The SSE generates 7 files:
src/models/task.py — Task, User, Comment Pydantic models
src/models/enums.py — TaskStatus, TaskPriority enums
src/api/tasks.py — FastAPI router with CRUD endpoints
src/api/dependencies.py — Auth dependency, DB session
src/db/repository.py — TaskRepository with in-memory store
tests/test_tasks.py — 24 pytest tests
requirements.txt — fastapi, pytest, pydantic, uvicorn
The SandboxedTestRunner writes these to a tempdir, installs dependencies, and runs pytest:
tests/test_tasks.py::test_create_task_happy_path PASSED
tests/test_tasks.py::test_create_task_max_title PASSED
tests/test_tasks.py::test_create_task_empty_title FAILED
tests/test_tasks.py::test_create_task_no_auth PASSED
tests/test_tasks.py::test_assign_task_happy_path PASSED
...
tests/test_tasks.py::test_concurrent_completion FAILED
tests/test_tasks.py::test_archive_completed_task PASSED
tests/test_tasks.py::test_archive_pending_task FAILED
========================= FAILURES =========================
___ test_create_task_empty_title ___
response = client.post("/api/v1/tasks", json={"title": ""})
> assert response.status_code == 422
E AssertionError: assert 201 == 422
___ test_concurrent_completion ___
# Simulate concurrent requests
> results = await asyncio.gather(
complete_task(task_id, user_a_token),
complete_task(task_id, user_b_token),
)
E TypeError: object function can't be used in 'await' expression
___ test_archive_pending_task ___
> assert response.status_code == 400
E AssertionError: assert 200 == 400
============= 3 failed, 21 passed in 2.34s =============
Three failures. 21 of 24 tests pass. The SSE logs this as attempt 1 with passed_count=21.
Analysis
Three failures, three root causes: (1) missing empty-title validation, (2) asyncio.gather on a sync function, (3) missing status check before archiving. All are specific, diagnosable bugs — not architectural problems.
Attempt 2
The retry prompt includes the 3 failure outputs. The SSE generates an updated CodeArtifact with targeted fixes: a title validator, threading.Thread instead of asyncio.gather, and a status check before archiving. The test runner executes again:
tests/test_tasks.py::test_create_task_empty_title PASSED
tests/test_tasks.py::test_concurrent_completion PASSED
tests/test_tasks.py::test_archive_pending_task PASSED
...
============= 24 passed in 2.51s =============
All 24 tests pass. The SSE returns with sse_status="passed" and sse_attempts=2. This is the ideal retry scenario: specific failures with clear fixes, no architectural rethinking required.
9. Escalation to the Tech Lead
When all three attempts fail, the SSE sets sse_status = "needs_review" and returns its best artifact. The TL agent (Part 8) reads state["sse_status"], state["sse_attempts"], and the test_result on the code artifact to understand what went wrong. On a clean submission, the TL does code quality review. On an escalation, the TL does failure analysis. We build this branching logic in Part 8.
10. Wiring the SSE into the LangGraph Workflow
The SSE node plugs into the existing workflow after the QC+TA merge:
from langgraph.graph import StateGraph
from state import TeamState
from agents.sse import SSEAgent
from runners.sandbox import SandboxedTestRunner
async def sse_node(state: TeamState) -> TeamState:
runner = SandboxedTestRunner()
agent = SSEAgent(llm=get_llm(), runner=runner)
return await agent.run(state)
def build_workflow() -> StateGraph:
workflow = StateGraph(TeamState)
# Previous nodes from Parts 5-6
workflow.add_node("ba_agent", ba_node)
workflow.add_node("qc_agent", qc_node)
workflow.add_node("ta_agent", ta_node)
workflow.add_node("parallel_merge", merge_qc_ta)
# SSE node (this article)
workflow.add_node("sse_agent", sse_node)
# Edges
workflow.set_entry_point("ba_agent")
workflow.add_edge("ba_agent", "qc_agent")
workflow.add_edge("ba_agent", "ta_agent")
workflow.add_edge("qc_agent", "parallel_merge")
workflow.add_edge("ta_agent", "parallel_merge")
workflow.add_edge("parallel_merge", "sse_agent")
# After SSE, route based on status
workflow.add_conditional_edges(
"sse_agent",
route_after_sse,
{
"tl_review": "tl_agent",
"tl_escalation": "tl_agent",
},
)
return workflow.compile()
def route_after_sse(state: TeamState) -> str:
"""Route to TL agent — review mode differs based on SSE status."""
if state.get("sse_status") == "passed":
return "tl_review"
return "tl_escalation"
The conditional edge after the SSE is the first routing decision in our pipeline. All previous edges were unconditional. After the SSE, the pipeline branches: a successful SSE goes to normal TL review, a failed SSE goes to TL escalation. Same destination node, different routing label that the TL reads from state.
Bringing It Together
After the SSE completes, TeamState now contains code_artifact (7 files, 24 passing tests), sse_status (“passed” or “needs_review”), and sse_attempts. Combined with the stories, test suite, and technical spec from earlier phases, the pipeline now has working, tested code. But working code is not the same as good code. The SSE optimizes for passing tests, not for maintainability. That is the Tech Lead’s job.
What We Built in Part 7
One agent, one runner, one loop:
- SSEAgent (Jordan) — generates code from the full TeamState (stories + test cases + tech spec), runs tests in a sandbox, retries up to 3 times with failure context, escalates to TL on exhaustion
- SandboxedTestRunner — writes files to a tempdir, installs deps, runs pytest as a subprocess with a 60-second timeout, parses pass/fail counts
- CodeArtifact schema — structured container for generated files with test results, attempt tracking, and story coverage
- Generate-test-retry loop — the core pattern that turns a 35% first-attempt success rate into an 88% pipeline success rate
- Conditional routing — the first branching edge in the LangGraph workflow, routing to TL review or TL escalation based on SSE status
The key insight of this article: mechanical verification changes everything. An agent that can run its own tests and feed failures back into the prompt is qualitatively different from an agent that generates output and hopes for the best. The retry loop is not a fallback — it is the primary mechanism that makes code generation reliable enough for a production pipeline.
What’s Next
In Part 8, we build the Tech Lead agent. The TL reviews the SSE’s code for quality, consistency, and adherence to the technical spec. On a clean submission, the TL checks naming conventions, code structure, and documentation. On an escalation, the TL diagnoses why the SSE got stuck and decides whether to provide fix guidance or flag the story for human intervention. The TL is the gate between “code that works” and “code that ships.”
See you in Part 8.
Series Navigation
- Part 1: Why Build an AI Software Team?
- Part 2: Mapping the Roles — 8 Agents, One Pipeline
- Part 3: Designing Your AI Team — Architecture and DDD
- Part 4: TeamState — The Shared Brain
- Part 5: PO and BA Agents — From Brief to User Stories
- Part 6: QC and TA Agents — Quality and Technical Design
- Part 7: The SSE Agent — Code Generation, Self-Testing, and Iteration (this article)
- Part 8: The TL Agent — Code Review and Quality Gate (coming soon)
- Parts 9-12: Coming soon