The SSE Agent: Code Generation, Self-Testing, and Iteration (Part 7 of 12)

The SSE agent is the hardest agent in the pipeline to build. Not because code generation is harder than architecture or test design — it is not. It is harder because the SSE is the only agent whose output must be mechanically verifiable. Every other agent produces documents: user stories, test case definitions, architecture decision records. Those can be reviewed by another LLM and judged on structure. But code either runs or it does not. Tests either pass or they fail. There is no “mostly correct” in a pytest run.

This mechanical verifiability is both challenge and advantage. The LLM will frequently produce code that looks reasonable but fails at runtime. But unlike every other agent, the SSE can know whether it succeeded without asking another LLM. A test runner returns exit code 0 or it does not. That binary signal is worth more than any amount of LLM self-evaluation.

In Part 6, we built QC and TA. The SSE now inherits everything: 8 user stories, 24 test cases, a full technical spec, and a clear mandate — implement the system, write the tests, and make them pass.

The key pattern is the generate-test-retry loop. The SSE generates code, runs tests, and if they fail, feeds the failure output back into the LLM. Maximum three attempts. If all three fail, the SSE escalates to the Tech Lead with its best attempt and the error log.

SSE retry loop flowchart: generate code, run tests, pass or retry up to 3 times, escalate to TL on failure

1. Why the SSE Is Different

Every agent we have built so far follows the same pattern: read TeamState, construct a prompt, call the LLM, parse the structured output, write back to TeamState. One call, one response, done. The BaseAgent pattern from Part 4 handles this cleanly.

The SSE breaks this pattern in three ways.

First, the output is executable. The SSE produces actual Python files that must be written to disk and executed. It needs a file system, not just a state dictionary.

Second, the output requires validation beyond schema parsing. Pydantic can tell you whether the JSON is valid. It cannot tell you whether the Python code inside actually works.

Third, failure is expected and recoverable. When the SSE produces code that fails tests, we feed the failure information back into the prompt so the LLM can make a targeted fix. This is not “try again.” This is “try again, and here is what went wrong.”

These three differences mean the SSE overrides the base run method entirely.

2. The CodeArtifact Schema

The SSE’s output is a CodeArtifact — a structured container for all files generated in a single attempt.

from enum import Enum
from typing import Optional
from pydantic import BaseModel, Field


class FileType(str, Enum):
    IMPLEMENTATION = "implementation"
    TEST = "test"
    CONFIG = "config"
    MIGRATION = "migration"


class CodeFile(BaseModel):
    file_path: str = Field(
        description="Relative path from project root, e.g. 'src/models/task.py'",
    )
    file_type: FileType
    content: str = Field(
        description="Complete file content — not a diff, not a snippet",
        min_length=10,
    )
    description: str = Field(
        description="One-sentence explanation of what this file does",
        max_length=200,
    )
    story_ids: list[str] = Field(
        default_factory=list,
        description="User story IDs this file helps implement",
    )


class TestResult(BaseModel):
    passed: bool
    total_tests: int = 0
    passed_count: int = 0
    failed_count: int = 0
    error_output: str = Field(
        default="",
        description="Raw pytest output if tests failed",
    )
    duration_seconds: float = 0.0


class CodeArtifact(BaseModel):
    artifact_id: str = Field(
        description="Unique identifier, format CODE-001",
        pattern=r"^CODE-\d{3}$",
    )
    files: list[CodeFile] = Field(
        description="All generated files",
        min_length=1,
    )
    implementation_files: list[str] = Field(
        default_factory=list,
        description="Paths of implementation files (computed)",
    )
    test_files: list[str] = Field(
        default_factory=list,
        description="Paths of test files (computed)",
    )
    test_result: Optional[TestResult] = Field(
        default=None,
        description="Result of running the test suite — populated after execution",
    )
    attempt_number: int = Field(
        default=1,
        ge=1,
        le=3,
        description="Which attempt produced this artifact (1-3)",
    )
    stories_covered: list[str] = Field(
        default_factory=list,
        description="Story IDs covered by this implementation",
    )

    def model_post_init(self, __context) -> None:
        self.implementation_files = [
            f.file_path for f in self.files
            if f.file_type == FileType.IMPLEMENTATION
        ]
        self.test_files = [
            f.file_path for f in self.files
            if f.file_type == FileType.TEST
        ]
        self.stories_covered = sorted(set(
            sid for f in self.files for sid in f.story_ids
        ))

A few design decisions:

content is the full file, not a diff. LLMs are unreliable at producing diffs — they hallucinate line numbers and miscount context lines. Full file content is more expensive but eliminates an entire class of failures.

test_result is Optional and populated after execution, not by the LLM. This separation prevents the LLM from hallucinating test outcomes — a failure mode where the model claims “all 24 tests pass” without running anything.

3. The SandboxedTestRunner

The test runner is not an agent. It is a utility class that writes files to a temporary directory, installs minimal dependencies, runs pytest as a subprocess, and captures the output. No LLM is involved. This is pure mechanical execution.

import os
import subprocess
import tempfile
import time
from schemas.sse import CodeArtifact, TestResult


class SandboxedTestRunner:
    """
    Executes generated code in an isolated temporary directory.

    Safety model: all files are written to a fresh tempdir that is
    deleted after the run. No generated code touches the host filesystem.
    The pytest subprocess runs with a 60-second timeout to prevent
    infinite loops from hanging the pipeline.
    """

    TIMEOUT_SECONDS = 60
    REQUIRED_PACKAGES = ["pytest", "pydantic"]

    def run(self, artifact: CodeArtifact) -> TestResult:
        start = time.time()
        with tempfile.TemporaryDirectory(prefix="sse_sandbox_") as tmpdir:
            # Write all generated files to the sandbox
            self._write_files(artifact, tmpdir)

            # Install minimal dependencies
            self._install_deps(tmpdir)

            # Run pytest
            result = self._run_pytest(tmpdir)

        result.duration_seconds = round(time.time() - start, 2)
        return result

    def _write_files(self, artifact: CodeArtifact, tmpdir: str) -> None:
        for code_file in artifact.files:
            file_path = os.path.join(tmpdir, code_file.file_path)
            os.makedirs(os.path.dirname(file_path), exist_ok=True)
            with open(file_path, "w") as f:
                f.write(code_file.content)

        # Create __init__.py files for all directories
        for root, dirs, _files in os.walk(tmpdir):
            for d in dirs:
                init_path = os.path.join(root, d, "__init__.py")
                if not os.path.exists(init_path):
                    open(init_path, "w").close()

    def _install_deps(self, tmpdir: str) -> None:
        """Install packages into the sandbox using pip."""
        # Check if there is a requirements.txt in the artifact
        req_path = os.path.join(tmpdir, "requirements.txt")
        if os.path.exists(req_path):
            subprocess.run(
                ["pip", "install", "-r", req_path, "--quiet"],
                cwd=tmpdir,
                timeout=30,
                capture_output=True,
            )
        else:
            # Install minimum packages needed for tests
            subprocess.run(
                ["pip", "install"] + self.REQUIRED_PACKAGES + ["--quiet"],
                cwd=tmpdir,
                timeout=30,
                capture_output=True,
            )

    def _run_pytest(self, tmpdir: str) -> TestResult:
        try:
            proc = subprocess.run(
                ["python", "-m", "pytest", "-v", "--tb=short", "--no-header"],
                cwd=tmpdir,
                timeout=self.TIMEOUT_SECONDS,
                capture_output=True,
                text=True,
            )

            output = proc.stdout + proc.stderr
            passed = proc.returncode == 0

            # Parse pytest output for counts
            passed_count, failed_count, total = self._parse_counts(output)

            return TestResult(
                passed=passed,
                total_tests=total,
                passed_count=passed_count,
                failed_count=failed_count,
                error_output="" if passed else output[-3000:],
            )

        except subprocess.TimeoutExpired:
            return TestResult(
                passed=False,
                total_tests=0,
                passed_count=0,
                failed_count=0,
                error_output=f"Tests timed out after {self.TIMEOUT_SECONDS}s. "
                "Check for infinite loops or blocking I/O.",
            )

    def _parse_counts(self, output: str) -> tuple[int, int, int]:
        """Extract pass/fail counts from pytest output."""
        import re

        passed = 0
        failed = 0

        # Match pytest summary line: "X passed, Y failed"
        match = re.search(r"(\d+) passed", output)
        if match:
            passed = int(match.group(1))

        match = re.search(r"(\d+) failed", output)
        if match:
            failed = int(match.group(1))

        match = re.search(r"(\d+) error", output)
        if match:
            failed += int(match.group(1))

        return passed, failed, passed + failed

Why a temporary directory? The generated code is untrusted. The tempdir provides a minimal isolation boundary — all files are deleted when the context manager exits, regardless of what happens inside.

Why a subprocess with a timeout? The LLM might generate an infinite loop or a blocking call. The 60-second timeout ensures the pipeline does not hang.

Why truncate error output to 3000 characters? The error output is fed back into the retry prompt. Full tracebacks from 24 failing tests can exceed 10,000 characters, wasting tokens and risking context window limits. The last 3000 characters contain the summary and the most recent failures — the most actionable information.

4. First-Attempt vs Retry Prompts

The SSE uses two different prompts depending on whether this is the first attempt or a retry after failure. The first-attempt prompt contains only the specification. The retry prompt contains the specification, the previous code, the test output, and specific instructions for fixing the failures.

On the first attempt, the LLM has maximum creative freedom within the constraints of the spec. On a retry, it should make minimal, targeted changes rather than rewriting from scratch — full rewrites on retry often introduce new failures while fixing old ones.

class SSEPrompts:
    """Prompt templates for the SSE agent."""

    FIRST_ATTEMPT = """Implement the following system based on the technical
specification and user stories provided.

TECHNICAL SPECIFICATION:
{technical_spec}

USER STORIES:
{user_stories}

TEST CASES TO IMPLEMENT:
{test_cases}

REQUIREMENTS:
1. Produce ALL files needed for a working implementation
2. Every test case from the QC spec must have a corresponding pytest test
3. Follow the technology stack defined in the spec exactly
4. Use the data models defined in the spec — do not invent new ones
5. Implementation must be complete — no TODOs, no placeholders, no stubs
6. Each file must be self-contained with all necessary imports
7. Test files must import from implementation files using relative imports

For each file, provide:
- file_path: relative path from project root
- file_type: implementation | test | config
- content: complete file content
- description: one sentence explaining the file
- story_ids: which user stories this file implements

Return a JSON object matching the CodeArtifact schema with artifact_id "CODE-001".
"""

    RETRY = """Your previous implementation attempt failed tests.

PREVIOUS CODE:
{previous_code}

TEST FAILURES:
{test_output}

ATTEMPT: {attempt_number} of 3

INSTRUCTIONS:
1. Analyze the test failures carefully — read every error message
2. Identify the ROOT CAUSE of each failure, not just the symptom
3. Fix the specific issues — do NOT rewrite files that are working
4. If a test is failing because the test itself has a bug, fix the test
5. Return the COMPLETE updated CodeArtifact with ALL files (not just changed ones)
6. Keep the same artifact_id "CODE-001"

Common failure patterns:
- Import errors: check that module paths match file_path values
- AttributeError: check that class/function names match between impl and tests
- AssertionError: check that return values match test expectations exactly
- TypeError: check function signatures match the call sites

Return a JSON object matching the CodeArtifact schema.
"""

The RETRY prompt includes a “Common failure patterns” section. This is not hand-holding — it is prompt engineering based on observed failure modes. In my testing, roughly 60% of first-attempt failures fall into four categories: import errors, attribute errors, assertion mismatches, and type errors. Listing these categories in the retry prompt gives the LLM a diagnostic framework rather than letting it guess.

5. SSEAgent Full Implementation

import json
import logging
from typing import Optional
from agents.base import BaseAgent
from state import TeamState
from schemas.sse import CodeArtifact, CodeFile, TestResult
from runners.sandbox import SandboxedTestRunner

logger = logging.getLogger(__name__)


class SSEAgent(BaseAgent):
    SSE_SYSTEM_PROMPT = """You are Jordan, a Senior Software Engineer with 10 years
    of experience writing production Python.

    Your defining traits:
    - You write code that works on the FIRST try, not code that looks impressive
    - You follow the spec exactly — you do not add features that were not asked for
    - You write tests that actually test behavior, not tests that merely exist
    - Every file you produce is complete: all imports, all error handling, all edge cases
    - You never use placeholder comments like "# TODO" or "# implement later"

    When fixing failed tests:
    - Read the ENTIRE error message before making changes
    - Fix the root cause, not the symptom
    - Change as little as possible — surgical fixes over rewrites
    - If a test expectation is wrong, fix the test, not the implementation

    Always output valid JSON matching the CodeArtifact schema.
    """

    MAX_ATTEMPTS = 3

    def __init__(self, llm, runner: Optional[SandboxedTestRunner] = None):
        super().__init__(llm=llm)
        self.runner = runner or SandboxedTestRunner()

    async def run(self, state: TeamState) -> TeamState:
        """
        Generate-test-retry loop. Overrides BaseAgent.run entirely.

        1. Generate code from spec (first_attempt_prompt)
        2. Run tests in sandbox
        3. If tests pass → return CodeArtifact to state
        4. If tests fail and attempts < MAX → retry with failure context
        5. If tests fail and attempts >= MAX → escalate with best attempt
        """
        attempt = 0
        best_artifact: Optional[CodeArtifact] = None
        best_pass_count = -1
        previous_code = ""
        previous_errors = ""

        while attempt < self.MAX_ATTEMPTS:
            attempt += 1
            logger.info(f"SSE attempt {attempt}/{self.MAX_ATTEMPTS}")

            # Build the prompt
            if attempt == 1:
                prompt = self._build_first_attempt_prompt(state)
            else:
                prompt = self._build_retry_prompt(
                    state, previous_code, previous_errors, attempt
                )

            # Call the LLM
            response = await self.llm.ainvoke(
                [
                    {"role": "system", "content": self.SSE_SYSTEM_PROMPT},
                    {"role": "user", "content": prompt},
                ],
                response_format={"type": "json_object"},
            )

            # Parse the response
            try:
                artifact = CodeArtifact.model_validate_json(response.content)
                artifact.attempt_number = attempt
            except Exception as e:
                logger.warning(f"Attempt {attempt}: Failed to parse CodeArtifact: {e}")
                previous_errors = f"JSON parse error: {str(e)}"
                continue

            # Run tests in sandbox
            test_result = self.runner.run(artifact)
            artifact.test_result = test_result

            logger.info(
                f"Attempt {attempt}: {test_result.passed_count} passed, "
                f"{test_result.failed_count} failed"
            )

            # Track best attempt (most tests passing)
            if test_result.passed_count > best_pass_count:
                best_pass_count = test_result.passed_count
                best_artifact = artifact

            # Success — all tests pass
            if test_result.passed:
                logger.info(f"All tests passed on attempt {attempt}")
                return {
                    **state,
                    "code_artifact": artifact,
                    "sse_complete": True,
                    "sse_status": "passed",
                    "sse_attempts": attempt,
                }

            # Prepare context for retry
            previous_code = self._format_code_for_retry(artifact)
            previous_errors = test_result.error_output

        # All attempts exhausted — escalate with best attempt
        logger.warning(
            f"SSE failed after {self.MAX_ATTEMPTS} attempts. "
            f"Best result: {best_pass_count} tests passing. Escalating to TL."
        )
        return {
            **state,
            "code_artifact": best_artifact,
            "sse_complete": True,
            "sse_status": "needs_review",
            "sse_attempts": self.MAX_ATTEMPTS,
        }

    def _build_first_attempt_prompt(self, state: TeamState) -> str:
        from prompts.sse import SSEPrompts

        spec = state["technical_spec"]
        stories = state["user_stories"]
        test_suite = state["test_suite"]

        return SSEPrompts.FIRST_ATTEMPT.format(
            technical_spec=self._format_spec(spec),
            user_stories=self._format_stories(stories),
            test_cases=self._format_test_cases(test_suite),
        )

    def _build_retry_prompt(
        self,
        state: TeamState,
        previous_code: str,
        test_output: str,
        attempt: int,
    ) -> str:
        from prompts.sse import SSEPrompts

        return SSEPrompts.RETRY.format(
            previous_code=previous_code,
            test_output=test_output,
            attempt_number=attempt,
        )

    def _format_spec(self, spec) -> str:
        lines = [f"Project: {spec.project_name}"]
        lines.append("\nTech Stack:")
        for comp in spec.tech_stack:
            lines.append(f"  - {comp.name}: {comp.technology} ({comp.justification})")
        lines.append("\nData Models:")
        for model in spec.data_models:
            lines.append(f"  {model.entity_name}: {model.description}")
            for field in model.fields:
                nullable = " (nullable)" if field.nullable else ""
                lines.append(f"    - {field.field_name}: {field.field_type}{nullable}")
        lines.append("\nAPI Contracts:")
        for api in spec.api_contracts:
            auth = " [auth required]" if api.requires_auth else ""
            lines.append(f"  {api.method} {api.path} — {api.summary}{auth}")
        return "\n".join(lines)

    def _format_stories(self, stories: list) -> str:
        lines = []
        for story in stories:
            lines.append(f"[{story.story_id}] {story.title}")
            lines.append(f"  As a {story.as_a}, I want {story.i_want}")
            lines.append(f"  So that {story.so_that}")
            for criterion in story.acceptance_criteria:
                lines.append(f"  AC: {criterion}")
            lines.append("")
        return "\n".join(lines)

    def _format_test_cases(self, test_suite) -> str:
        lines = []
        for tc in test_suite.test_cases:
            lines.append(f"[{tc.test_id}] {tc.title} ({tc.test_type.value})")
            lines.append(f"  Given: {'; '.join(tc.given)}")
            lines.append(f"  When: {'; '.join(tc.when)}")
            lines.append(f"  Then: {'; '.join(tc.then)}")
            lines.append(f"  Priority: {tc.priority.value}")
            lines.append("")
        return "\n".join(lines)

    def _format_code_for_retry(self, artifact: CodeArtifact) -> str:
        lines = []
        for f in artifact.files:
            lines.append(f"=== {f.file_path} ({f.file_type.value}) ===")
            lines.append(f.content)
            lines.append("")
        return "\n".join(lines)

Key Design Decisions

Best-artifact tracking. When all three attempts fail, the SSE returns the attempt with the most passing tests, not the last attempt. Attempt 2 might pass 20 of 24 tests while attempt 3, trying to fix 4 failures, introduces 3 new ones and passes only 17.

sse_status field. The state uses a string status rather than a boolean. "passed" means the TL can proceed to code review. "needs_review" means the TL needs to diagnose why the SSE got stuck. This status governs conditional routing in the LangGraph workflow.

6. Failure Analysis: What Goes Wrong and Why

I ran the SSE agent 50 times against the task manager spec from Part 6 and tracked failure modes. The data is clear:

Attempt 1 success rate: 35%. Roughly one in three runs, the LLM produces code that passes all tests on the first try. This sounds low, but remember: 24 test cases across 8 user stories, including edge cases and error scenarios, is a genuinely difficult code generation task.

Attempt 2 success rate (after attempt 1 failure): 72%. When given the specific test failures, the LLM fixes them roughly three-quarters of the time. This is the core value of the retry loop — a single retry more than doubles the overall success rate.

Attempt 3 success rate (after attempt 2 failure): 45%. Diminishing returns. By attempt 3, the remaining failures tend to be conceptual misunderstandings rather than simple bugs. The LLM either cannot grasp what the test expects or the test and implementation have diverged in a way that requires rethinking the approach.

Overall pipeline success rate: 88%. After three attempts, roughly 88 of every 100 runs produce code that passes all tests. The remaining 12% escalate to the TL agent.

The most common failure categories:

Failure Type	Frequency	Typical Fix
Import path mismatch	28%	File path does not match import statement
Missing edge case handling	22%	Happy path works, empty-input case missing
Return type mismatch	18%	Function returns dict, test expects Pydantic model
Async/sync confusion	12%	Test calls `await func()` but func is sync
Database state assumptions	10%	Test assumes clean DB, impl assumes seeded data
Other	10%	Typos, wrong status codes, missing error messages

7. The Cost of Retries

Each retry is more expensive because the prompt grows — previous code (~3,500 tokens) and error output (~800-1,200 tokens) are added to the context. At Claude Sonnet pricing ($3/M input, $15/M output):

Attempt 1: ~4,300 input tokens, ~3,500 output tokens = ~$0.07
Attempt 2: ~8,600 input tokens, ~3,500 output tokens = ~$0.08
Attempt 3: ~9,000 input tokens, ~3,500 output tokens = ~$0.08

A three-attempt run costs roughly $0.23. The retry overhead is under $0.20 for a second chance at getting the code right. Compared to a human developer’s time, the economics are not close.

The deeper cost concern is latency, not dollars. Each LLM call takes 15-30 seconds. Each test run takes 5-15 seconds. A three-attempt run adds 60-135 seconds of wall-clock time. This is one reason we invested in the parallel fan-out in Part 6 — the time saved by running QC and TA concurrently partially offsets the time the SSE spends in retry loops.

8. Real Example: Attempt 1 Fails, Attempt 2 Passes

Let me walk through an actual run. The spec is the task manager from Part 6. The SSE receives 8 user stories, 24 test cases, and a technical spec calling for FastAPI + PostgreSQL.

Attempt 1

The SSE generates 7 files:

src/models/task.py          — Task, User, Comment Pydantic models
src/models/enums.py         — TaskStatus, TaskPriority enums
src/api/tasks.py            — FastAPI router with CRUD endpoints
src/api/dependencies.py     — Auth dependency, DB session
src/db/repository.py        — TaskRepository with in-memory store
tests/test_tasks.py         — 24 pytest tests
requirements.txt            — fastapi, pytest, pydantic, uvicorn

The SandboxedTestRunner writes these to a tempdir, installs dependencies, and runs pytest:

tests/test_tasks.py::test_create_task_happy_path PASSED
tests/test_tasks.py::test_create_task_max_title PASSED
tests/test_tasks.py::test_create_task_empty_title FAILED
tests/test_tasks.py::test_create_task_no_auth PASSED
tests/test_tasks.py::test_assign_task_happy_path PASSED
...
tests/test_tasks.py::test_concurrent_completion FAILED
tests/test_tasks.py::test_archive_completed_task PASSED
tests/test_tasks.py::test_archive_pending_task FAILED

========================= FAILURES =========================
___ test_create_task_empty_title ___
    response = client.post("/api/v1/tasks", json={"title": ""})
>   assert response.status_code == 422
E   AssertionError: assert 201 == 422

___ test_concurrent_completion ___
    # Simulate concurrent requests
>   results = await asyncio.gather(
        complete_task(task_id, user_a_token),
        complete_task(task_id, user_b_token),
    )
E   TypeError: object function can't be used in 'await' expression

___ test_archive_pending_task ___
>   assert response.status_code == 400
E   AssertionError: assert 200 == 400

============= 3 failed, 21 passed in 2.34s =============

Three failures. 21 of 24 tests pass. The SSE logs this as attempt 1 with passed_count=21.

Analysis

Three failures, three root causes: (1) missing empty-title validation, (2) asyncio.gather on a sync function, (3) missing status check before archiving. All are specific, diagnosable bugs — not architectural problems.

Attempt 2

The retry prompt includes the 3 failure outputs. The SSE generates an updated CodeArtifact with targeted fixes: a title validator, threading.Thread instead of asyncio.gather, and a status check before archiving. The test runner executes again:

tests/test_tasks.py::test_create_task_empty_title PASSED
tests/test_tasks.py::test_concurrent_completion PASSED
tests/test_tasks.py::test_archive_pending_task PASSED
...
============= 24 passed in 2.51s =============

All 24 tests pass. The SSE returns with sse_status="passed" and sse_attempts=2. This is the ideal retry scenario: specific failures with clear fixes, no architectural rethinking required.

9. Escalation to the Tech Lead

When all three attempts fail, the SSE sets sse_status = "needs_review" and returns its best artifact. The TL agent (Part 8) reads state["sse_status"], state["sse_attempts"], and the test_result on the code artifact to understand what went wrong. On a clean submission, the TL does code quality review. On an escalation, the TL does failure analysis. We build this branching logic in Part 8.

10. Wiring the SSE into the LangGraph Workflow

The SSE node plugs into the existing workflow after the QC+TA merge:

from langgraph.graph import StateGraph
from state import TeamState
from agents.sse import SSEAgent
from runners.sandbox import SandboxedTestRunner


async def sse_node(state: TeamState) -> TeamState:
    runner = SandboxedTestRunner()
    agent = SSEAgent(llm=get_llm(), runner=runner)
    return await agent.run(state)


def build_workflow() -> StateGraph:
    workflow = StateGraph(TeamState)

    # Previous nodes from Parts 5-6
    workflow.add_node("ba_agent", ba_node)
    workflow.add_node("qc_agent", qc_node)
    workflow.add_node("ta_agent", ta_node)
    workflow.add_node("parallel_merge", merge_qc_ta)

    # SSE node (this article)
    workflow.add_node("sse_agent", sse_node)

    # Edges
    workflow.set_entry_point("ba_agent")
    workflow.add_edge("ba_agent", "qc_agent")
    workflow.add_edge("ba_agent", "ta_agent")
    workflow.add_edge("qc_agent", "parallel_merge")
    workflow.add_edge("ta_agent", "parallel_merge")
    workflow.add_edge("parallel_merge", "sse_agent")

    # After SSE, route based on status
    workflow.add_conditional_edges(
        "sse_agent",
        route_after_sse,
        {
            "tl_review": "tl_agent",
            "tl_escalation": "tl_agent",
        },
    )

    return workflow.compile()


def route_after_sse(state: TeamState) -> str:
    """Route to TL agent — review mode differs based on SSE status."""
    if state.get("sse_status") == "passed":
        return "tl_review"
    return "tl_escalation"

The conditional edge after the SSE is the first routing decision in our pipeline. All previous edges were unconditional. After the SSE, the pipeline branches: a successful SSE goes to normal TL review, a failed SSE goes to TL escalation. Same destination node, different routing label that the TL reads from state.

Bringing It Together

After the SSE completes, TeamState now contains code_artifact (7 files, 24 passing tests), sse_status (“passed” or “needs_review”), and sse_attempts. Combined with the stories, test suite, and technical spec from earlier phases, the pipeline now has working, tested code. But working code is not the same as good code. The SSE optimizes for passing tests, not for maintainability. That is the Tech Lead’s job.

What We Built in Part 7

One agent, one runner, one loop:

SSEAgent (Jordan) — generates code from the full TeamState (stories + test cases + tech spec), runs tests in a sandbox, retries up to 3 times with failure context, escalates to TL on exhaustion
SandboxedTestRunner — writes files to a tempdir, installs deps, runs pytest as a subprocess with a 60-second timeout, parses pass/fail counts
CodeArtifact schema — structured container for generated files with test results, attempt tracking, and story coverage
Generate-test-retry loop — the core pattern that turns a 35% first-attempt success rate into an 88% pipeline success rate
Conditional routing — the first branching edge in the LangGraph workflow, routing to TL review or TL escalation based on SSE status

The key insight of this article: mechanical verification changes everything. An agent that can run its own tests and feed failures back into the prompt is qualitatively different from an agent that generates output and hopes for the best. The retry loop is not a fallback — it is the primary mechanism that makes code generation reliable enough for a production pipeline.

What’s Next

In Part 8, we build the Tech Lead agent. The TL reviews the SSE’s code for quality, consistency, and adherence to the technical spec. On a clean submission, the TL checks naming conventions, code structure, and documentation. On an escalation, the TL diagnoses why the SSE got stuck and decides whether to provide fix guidance or flag the story for human intervention. The TL is the gate between “code that works” and “code that ships.”

See you in Part 8.

Part 1: Why Build an AI Software Team?
Part 2: Mapping the Roles — 8 Agents, One Pipeline
Part 3: Designing Your AI Team — Architecture and DDD
Part 4: TeamState — The Shared Brain
Part 5: PO and BA Agents — From Brief to User Stories
Part 6: QC and TA Agents — Quality and Technical Design
Part 7: The SSE Agent — Code Generation, Self-Testing, and Iteration (this article)
Part 8: The TL Agent — Code Review and Quality Gate (coming soon)
Parts 9-12: Coming soon

Export for reading

The SSE Agent: Code Generation, Self-Testing, and Iteration (Part 7 of 12)

1. Why the SSE Is Different

2. The CodeArtifact Schema

3. The SandboxedTestRunner

4. First-Attempt vs Retry Prompts

5. SSEAgent Full Implementation

Key Design Decisions

6. Failure Analysis: What Goes Wrong and Why

7. The Cost of Retries

8. Real Example: Attempt 1 Fails, Attempt 2 Passes

Attempt 1

Analysis

Attempt 2

9. Escalation to the Tech Lead

10. Wiring the SSE into the LangGraph Workflow

Bringing It Together

What We Built in Part 7

What’s Next

Series Navigation

Comments

On this page

The SSE Agent: Code Generation, Self-Testing, and Iteration (Part 7 of 12)