SSE Agent: Code Generation, Self-Testing và Iteration (Phần 7/12)

Agent SSE là agent khó xây dựng nhất trong đường ống của chúng tôi. Không phải vì code generation khó hơn architecture hay test design — nó không khó hơn. Nó khó hơn vì SSE là agent duy nhất mà output của nó phải có thể xác minh cơ học. Mọi agent khác tạo ra tài liệu: user stories, định nghĩa test case, architectural decision records, implementation plans. Những artifacts đó có thể được review bởi một LLM khác và đánh giá dựa trên structure và coherence. Nhưng code thì chạy được hoặc chạy không. Tests thì pass hoặc fail. Không có khái niệm “mostly correct” trong một pytest run.

Khả năng xác minh cơ học này là cả thách thức lớn nhất và lợi thế lớn nhất của SSE. Thách thức, vì LLM sẽ thường xuyên tạo ra code trông hợp lý nhưng fail lúc runtime — một import bị thiếu, một lỗi off-by-one, một function signature không match với test expectation. Lợi thế, vì không giống những agent khác, SSE có thể biết liệu nó có thành công không mà không cần hỏi LLM khác. Một test runner trả về exit code 0 hoặc không. Tín hiệu nhị phân đó có giá trị hơn bất kỳ lượng tự đánh giá nào từ LLM.

Trong Phần 6, tôi xây dựng các agent QC và TA. Agent QC tạo ra 24 test case với structure Given/When/Then. Agent TA tạo ra một TechnicalSpec với technology choices, data models, API contracts, và ADRs. Cả hai đều ghi vào TeamState. SSE bây giờ kế thừa mọi thứ: 8 user stories, 24 test cases, một full technical specification, và một mandate rõ ràng — implement the system, write the tests, và make them pass.

Mẫu thiết kế chính của bài viết này là generate-test-retry loop. SSE tạo ra code, chạy tests, và nếu chúng fail, feed the failure output trở lại vào LLM để attempt thứ hai. Tối đa ba lần attempt. Nếu cả ba đều fail, SSE escalates tới agent Tech Lead với attempt tốt nhất của nó và the error log.

SSE retry loop flowchart: generate code, run tests, pass or retry up to 3 times, escalate to TL on failure

1. Tại sao SSE khác biệt

Mọi agent mà chúng tôi xây dựng cho đến nay đều tuân theo cùng một mẫu: đọc TeamState, construct a prompt, call the LLM, parse the structured output, write back to TeamState. Một call, một response, xong. BaseAgent pattern từ Phần 4 xử lý cái này một cách sạch sẽ.

SSE phá vỡ mẫu này theo ba cách.

Thứ nhất, output là executable. SSE không tạo ra JSON mô tả code. Nó tạo ra các file Python thực tế — implementation files và test files — phải được ghi vào disk và executed. Điều này có nghĩa SSE cần một file system, không chỉ một state dictionary.

Thứ hai, output yêu cầu validation vượt quá schema parsing. Một Pydantic model_validate_json call có thể cho tôi biết liệu LLM có trả về valid JSON matching schema không. Nhưng nó không thể cho tôi biết liệu Python code bên trong JSON đó có thực sự hoạt động không. Để làm điều đó, tôi cần một test runner.

Thứ ba, failure được kỳ vọng và có thể recover. Khi QC agent trả về invalid JSON, tôi retry the LLM call — đó là một simple schema retry mà base agent có thể xử lý. Khi SSE tạo ra code mà fail tests, retry khác nhau về chất lượng: tôi cần feed the failure information (tests nào failed, error messages là gì, stack traces nói gì) trở lại vào prompt để LLM có thể make a targeted fix. Đây không phải “try again.” Đây là “try again, và đây là what went wrong.”

Ba điểm khác này có nghĩa là SSE overrides the base run method entirely thay vì chỉ implementing _prepare_prompt.

2. Schema CodeArtifact

Trước khi tôi xây dựng agent, tôi cần định nghĩa cái mà nó tạo ra. Output của SSE là một CodeArtifact — một structured container cho tất cả các files mà SSE tạo ra trong một single attempt.

from enum import Enum
from typing import Optional
from pydantic import BaseModel, Field


class FileType(str, Enum):
    IMPLEMENTATION = "implementation"
    TEST = "test"
    CONFIG = "config"
    MIGRATION = "migration"


class CodeFile(BaseModel):
    file_path: str = Field(
        description="Relative path from project root, e.g. 'src/models/task.py'",
    )
    file_type: FileType
    content: str = Field(
        description="Complete file content — not a diff, not a snippet",
        min_length=10,
    )
    description: str = Field(
        description="One-sentence explanation of what this file does",
        max_length=200,
    )
    story_ids: list[str] = Field(
        default_factory=list,
        description="User story IDs this file helps implement",
    )


class TestResult(BaseModel):
    passed: bool
    total_tests: int = 0
    passed_count: int = 0
    failed_count: int = 0
    error_output: str = Field(
        default="",
        description="Raw pytest output if tests failed",
    )
    duration_seconds: float = 0.0


class CodeArtifact(BaseModel):
    artifact_id: str = Field(
        description="Unique identifier, format CODE-001",
        pattern=r"^CODE-\d{3}$",
    )
    files: list[CodeFile] = Field(
        description="All generated files",
        min_length=1,
    )
    implementation_files: list[str] = Field(
        default_factory=list,
        description="Paths of implementation files (computed)",
    )
    test_files: list[str] = Field(
        default_factory=list,
        description="Paths of test files (computed)",
    )
    test_result: Optional[TestResult] = Field(
        default=None,
        description="Result of running the test suite — populated after execution",
    )
    attempt_number: int = Field(
        default=1,
        ge=1,
        le=3,
        description="Which attempt produced this artifact (1-3)",
    )
    stories_covered: list[str] = Field(
        default_factory=list,
        description="Story IDs covered by this implementation",
    )

    def model_post_init(self, __context) -> None:
        self.implementation_files = [
            f.file_path for f in self.files
            if f.file_type == FileType.IMPLEMENTATION
        ]
        self.test_files = [
            f.file_path for f in self.files
            if f.file_type == FileType.TEST
        ]
        self.stories_covered = sorted(set(
            sid for f in self.files for sid in f.story_ids
        ))

Một vài design decisions:

content là full file, không phải diff. LLMs không đáng tin cậy trong việc tạo diffs. Chúng hallucinate line numbers, miscount context lines, và tạo patches không apply cleanly. Yêu cầu full file content đắt hơn về tokens nhưng loại bỏ toàn bộ một class of failures.

min_length=10 trên content prevents the LLM from producing empty hoặc stub files. Một file với pass là nội dung duy nhất vẫn sẽ là 4 characters — quá ngắn để có ý nghĩa.

test_result là Optional và populated after execution, không phải bởi LLM. LLM tạo ra code; test runner fills in the results. Sự tách biệt này prevents the LLM from hallucinating test outcomes — một failure mode tôi đã thấy nơi model nói “all 24 tests pass” khi nó chưa chạy bất cứ thứ gì.

3. SandboxedTestRunner

Test runner không phải là agent. Nó là một utility class ghi files vào một temporary directory, cài đặt minimal dependencies, chạy pytest như một subprocess, và captures the output. Không có LLM involved. Đây là pure mechanical execution.

import os
import subprocess
import tempfile
import time
from schemas.sse import CodeArtifact, TestResult


class SandboxedTestRunner:
    """
    Executes generated code in an isolated temporary directory.

    Safety model: all files are written to a fresh tempdir that is
    deleted after the run. No generated code touches the host filesystem.
    The pytest subprocess runs with a 60-second timeout to prevent
    infinite loops from hanging the pipeline.
    """

    TIMEOUT_SECONDS = 60
    REQUIRED_PACKAGES = ["pytest", "pydantic"]

    def run(self, artifact: CodeArtifact) -> TestResult:
        start = time.time()
        with tempfile.TemporaryDirectory(prefix="sse_sandbox_") as tmpdir:
            # Write all generated files to the sandbox
            self._write_files(artifact, tmpdir)

            # Install minimal dependencies
            self._install_deps(tmpdir)

            # Run pytest
            result = self._run_pytest(tmpdir)

        result.duration_seconds = round(time.time() - start, 2)
        return result

    def _write_files(self, artifact: CodeArtifact, tmpdir: str) -> None:
        for code_file in artifact.files:
            file_path = os.path.join(tmpdir, code_file.file_path)
            os.makedirs(os.path.dirname(file_path), exist_ok=True)
            with open(file_path, "w") as f:
                f.write(code_file.content)

        # Create __init__.py files for all directories
        for root, dirs, _files in os.walk(tmpdir):
            for d in dirs:
                init_path = os.path.join(root, d, "__init__.py")
                if not os.path.exists(init_path):
                    open(init_path, "w").close()

    def _install_deps(self, tmpdir: str) -> None:
        """Install packages into the sandbox using pip."""
        # Check if there is a requirements.txt in the artifact
        req_path = os.path.join(tmpdir, "requirements.txt")
        if os.path.exists(req_path):
            subprocess.run(
                ["pip", "install", "-r", req_path, "--quiet"],
                cwd=tmpdir,
                timeout=30,
                capture_output=True,
            )
        else:
            # Install minimum packages needed for tests
            subprocess.run(
                ["pip", "install"] + self.REQUIRED_PACKAGES + ["--quiet"],
                cwd=tmpdir,
                timeout=30,
                capture_output=True,
            )

    def _run_pytest(self, tmpdir: str) -> TestResult:
        try:
            proc = subprocess.run(
                ["python", "-m", "pytest", "-v", "--tb=short", "--no-header"],
                cwd=tmpdir,
                timeout=self.TIMEOUT_SECONDS,
                capture_output=True,
                text=True,
            )

            output = proc.stdout + proc.stderr
            passed = proc.returncode == 0

            # Parse pytest output for counts
            passed_count, failed_count, total = self._parse_counts(output)

            return TestResult(
                passed=passed,
                total_tests=total,
                passed_count=passed_count,
                failed_count=failed_count,
                error_output="" if passed else output[-3000:],
            )

        except subprocess.TimeoutExpired:
            return TestResult(
                passed=False,
                total_tests=0,
                passed_count=0,
                failed_count=0,
                error_output=f"Tests timed out after {self.TIMEOUT_SECONDS}s. "
                "Check for infinite loops or blocking I/O.",
            )

    def _parse_counts(self, output: str) -> tuple[int, int, int]:
        """Extract pass/fail counts from pytest output."""
        import re

        passed = 0
        failed = 0

        # Match pytest summary line: "X passed, Y failed"
        match = re.search(r"(\d+) passed", output)
        if match:
            passed = int(match.group(1))

        match = re.search(r"(\d+) failed", output)
        if match:
            failed = int(match.group(1))

        match = re.search(r"(\d+) error", output)
        if match:
            failed += int(match.group(1))

        return passed, failed, passed + failed

Tại sao một temporary directory? Code được tạo ra là untrusted. Tempdir cung cấp một minimal isolation boundary — tất cả files bị xóa khi context manager thoát, bất kể điều gì xảy ra bên trong.

Tại sao một subprocess với timeout? LLM có thể tạo ra infinite loop hoặc blocking call. 60-second timeout đảm bảo pipeline không hang.

Tại sao truncate error output thành 3000 characters? Error output được feed lại vào retry prompt. Full tracebacks từ 24 failing tests có thể vượt quá 10,000 characters, wasting tokens và risking context window limits. 3000 characters cuối cùng chứa summary và failures gần đây nhất — the most actionable information.

4. First-Attempt vs Retry Prompts

SSE sử dụng hai prompts khác nhau tùy thuộc vào việc đây là first attempt hay retry sau failure. First-attempt prompt chỉ chứa specification. Retry prompt chứa specification, previous code, test output, và specific instructions để fix the failures.

Sự phân biệt này rất quan trọng. Khi first attempt, LLM có maximum creative freedom — nó có thể structure code như nó muốn trong constraints của technical spec. Khi retry, LLM nên make minimal, targeted changes để fix specific failures thay vì rewriting everything from scratch. Full rewrite trên retry thường introduce new failures trong khi fixing old ones.

class SSEPrompts:
    """Prompt templates for the SSE agent."""

    FIRST_ATTEMPT = """Implement the following system based on the technical
specification and user stories provided.

TECHNICAL SPECIFICATION:
{technical_spec}

USER STORIES:
{user_stories}

TEST CASES TO IMPLEMENT:
{test_cases}

REQUIREMENTS:
1. Produce ALL files needed for a working implementation
2. Every test case from the QC spec must have a corresponding pytest test
3. Follow the technology stack defined in the spec exactly
4. Use the data models defined in the spec — do not invent new ones
5. Implementation must be complete — no TODOs, no placeholders, no stubs
6. Each file must be self-contained with all necessary imports
7. Test files must import from implementation files using relative imports

For each file, provide:
- file_path: relative path from project root
- file_type: implementation | test | config
- content: complete file content
- description: one sentence explaining the file
- story_ids: which user stories this file implements

Return a JSON object matching the CodeArtifact schema with artifact_id "CODE-001".
"""

    RETRY = """Your previous implementation attempt failed tests.

PREVIOUS CODE:
{previous_code}

TEST FAILURES:
{test_output}

ATTEMPT: {attempt_number} of 3

INSTRUCTIONS:
1. Analyze the test failures carefully — read every error message
2. Identify the ROOT CAUSE of each failure, not just the symptom
3. Fix the specific issues — do NOT rewrite files that are working
4. If a test is failing because the test itself has a bug, fix the test
5. Return the COMPLETE updated CodeArtifact with ALL files (not just changed ones)
6. Keep the same artifact_id "CODE-001"

Common failure patterns:
- Import errors: check that module paths match file_path values
- AttributeError: check that class/function names match between impl and tests
- AssertionError: check that return values match test expectations exactly
- TypeError: check function signatures match the call sites

Return a JSON object matching the CodeArtifact schema.
"""

Prompt RETRY bao gồm một section “Common failure patterns”. Đây không phải hand-holding — đó là prompt engineering dựa trên observed failure modes. Trong testing của tôi, khoảng 60% of first-attempt failures rơi vào four categories: import errors, attribute errors, assertion mismatches, và type errors. Listing những categories này trong retry prompt cung cấp cho LLM một diagnostic framework thay vì để nó guess.

5. SSEAgent Full Implementation

import json
import logging
from typing import Optional
from agents.base import BaseAgent
from state import TeamState
from schemas.sse import CodeArtifact, CodeFile, TestResult
from runners.sandbox import SandboxedTestRunner

logger = logging.getLogger(__name__)


class SSEAgent(BaseAgent):
    SSE_SYSTEM_PROMPT = """You are Jordan, a Senior Software Engineer with 10 years
    of experience writing production Python.

    Your defining traits:
    - You write code that works on the FIRST try, not code that looks impressive
    - You follow the spec exactly — you do not add features that were not asked for
    - You write tests that actually test behavior, not tests that merely exist
    - Every file you produce is complete: all imports, all error handling, all edge cases
    - You never use placeholder comments like "# TODO" or "# implement later"

    When fixing failed tests:
    - Read the ENTIRE error message before making changes
    - Fix the root cause, not the symptom
    - Change as little as possible — surgical fixes over rewrites
    - If a test expectation is wrong, fix the test, not the implementation

    Always output valid JSON matching the CodeArtifact schema.
    """

    MAX_ATTEMPTS = 3

    def __init__(self, llm, runner: Optional[SandboxedTestRunner] = None):
        super().__init__(llm=llm)
        self.runner = runner or SandboxedTestRunner()

    async def run(self, state: TeamState) -> TeamState:
        """
        Generate-test-retry loop. Overrides BaseAgent.run entirely.

        1. Generate code from spec (first_attempt_prompt)
        2. Run tests in sandbox
        3. If tests pass → return CodeArtifact to state
        4. If tests fail and attempts < MAX → retry with failure context
        5. If tests fail and attempts >= MAX → escalate with best attempt
        """
        attempt = 0
        best_artifact: Optional[CodeArtifact] = None
        best_pass_count = -1
        previous_code = ""
        previous_errors = ""

        while attempt < self.MAX_ATTEMPTS:
            attempt += 1
            logger.info(f"SSE attempt {attempt}/{self.MAX_ATTEMPTS}")

            # Build the prompt
            if attempt == 1:
                prompt = self._build_first_attempt_prompt(state)
            else:
                prompt = self._build_retry_prompt(
                    state, previous_code, previous_errors, attempt
                )

            # Call the LLM
            response = await self.llm.ainvoke(
                [
                    {"role": "system", "content": self.SSE_SYSTEM_PROMPT},
                    {"role": "user", "content": prompt},
                ],
                response_format={"type": "json_object"},
            )

            # Parse the response
            try:
                artifact = CodeArtifact.model_validate_json(response.content)
                artifact.attempt_number = attempt
            except Exception as e:
                logger.warning(f"Attempt {attempt}: Failed to parse CodeArtifact: {e}")
                previous_errors = f"JSON parse error: {str(e)}"
                continue

            # Run tests in sandbox
            test_result = self.runner.run(artifact)
            artifact.test_result = test_result

            logger.info(
                f"Attempt {attempt}: {test_result.passed_count} passed, "
                f"{test_result.failed_count} failed"
            )

            # Track best attempt (most tests passing)
            if test_result.passed_count > best_pass_count:
                best_pass_count = test_result.passed_count
                best_artifact = artifact

            # Success — all tests pass
            if test_result.passed:
                logger.info(f"All tests passed on attempt {attempt}")
                return {
                    **state,
                    "code_artifact": artifact,
                    "sse_complete": True,
                    "sse_status": "passed",
                    "sse_attempts": attempt,
                }

            # Prepare context for retry
            previous_code = self._format_code_for_retry(artifact)
            previous_errors = test_result.error_output

        # All attempts exhausted — escalate with best attempt
        logger.warning(
            f"SSE failed after {self.MAX_ATTEMPTS} attempts. "
            f"Best result: {best_pass_count} tests passing. Escalating to TL."
        )
        return {
            **state,
            "code_artifact": best_artifact,
            "sse_complete": True,
            "sse_status": "needs_review",
            "sse_attempts": self.MAX_ATTEMPTS,
        }

    def _build_first_attempt_prompt(self, state: TeamState) -> str:
        from prompts.sse import SSEPrompts

        spec = state["technical_spec"]
        stories = state["user_stories"]
        test_suite = state["test_suite"]

        return SSEPrompts.FIRST_ATTEMPT.format(
            technical_spec=self._format_spec(spec),
            user_stories=self._format_stories(stories),
            test_cases=self._format_test_cases(test_suite),
        )

    def _build_retry_prompt(
        self,
        state: TeamState,
        previous_code: str,
        test_output: str,
        attempt: int,
    ) -> str:
        from prompts.sse import SSEPrompts

        return SSEPrompts.RETRY.format(
            previous_code=previous_code,
            test_output=test_output,
            attempt_number=attempt,
        )

    def _format_spec(self, spec) -> str:
        lines = [f"Project: {spec.project_name}"]
        lines.append("\nTech Stack:")
        for comp in spec.tech_stack:
            lines.append(f"  - {comp.name}: {comp.technology} ({comp.justification})")
        lines.append("\nData Models:")
        for model in spec.data_models:
            lines.append(f"  {model.entity_name}: {model.description}")
            for field in model.fields:
                nullable = " (nullable)" if field.nullable else ""
                lines.append(f"    - {field.field_name}: {field.field_type}{nullable}")
        lines.append("\nAPI Contracts:")
        for api in spec.api_contracts:
            auth = " [auth required]" if api.requires_auth else ""
            lines.append(f"  {api.method} {api.path} — {api.summary}{auth}")
        return "\n".join(lines)

    def _format_stories(self, stories: list) -> str:
        lines = []
        for story in stories:
            lines.append(f"[{story.story_id}] {story.title}")
            lines.append(f"  As a {story.as_a}, I want {story.i_want}")
            lines.append(f"  So that {story.so_that}")
            for criterion in story.acceptance_criteria:
                lines.append(f"  AC: {criterion}")
            lines.append("")
        return "\n".join(lines)

    def _format_test_cases(self, test_suite) -> str:
        lines = []
        for tc in test_suite.test_cases:
            lines.append(f"[{tc.test_id}] {tc.title} ({tc.test_type.value})")
            lines.append(f"  Given: {'; '.join(tc.given)}")
            lines.append(f"  When: {'; '.join(tc.when)}")
            lines.append(f"  Then: {'; '.join(tc.then)}")
            lines.append(f"  Priority: {tc.priority.value}")
            lines.append("")
        return "\n".join(lines)

    def _format_code_for_retry(self, artifact: CodeArtifact) -> str:
        lines = []
        for f in artifact.files:
            lines.append(f"=== {f.file_path} ({f.file_type.value}) ===")
            lines.append(f.content)
            lines.append("")
        return "\n".join(lines)

Key Design Decisions

Best-artifact tracking. Khi cả ba attempts fail, SSE trả về attempt với most passing tests, không phải last attempt. Attempt 2 có thể pass 20 of 24 tests trong khi attempt 3, cố gắng fix 4 failures, introduces 3 new ones và passes chỉ 17.

sse_status field. State sử dụng một string status thay vì boolean. "passed" có nghĩa là TL có thể proceed tới code review. "needs_review" có nghĩa TL cần diagnose tại sao SSE bị stuck. Status này governs conditional routing trong LangGraph workflow.

6. Failure Analysis: Điều gì sai và tại sao

Tôi chạy SSE agent 50 lần against the task manager spec từ Phần 6 và tracked failure modes. Data rõ ràng:

Attempt 1 success rate: 35%. Khoảng một trong ba runs, LLM tạo ra code pass all tests khi first try. Điều này nghe thấp, nhưng nhớ: 24 test cases across 8 user stories, bao gồm edge cases và error scenarios, là một genuinely difficult code generation task.

Attempt 2 success rate (after attempt 1 failure): 72%. Khi given the specific test failures, LLM fixes them khoảng ba phần tư thời gian. Đây là core value của retry loop — một single retry doubles the overall success rate.

Attempt 3 success rate (after attempt 2 failure): 45%. Diminishing returns. By attempt 3, remaining failures tend to be conceptual misunderstandings thay vì simple bugs. LLM either cannot grasp what the test expects hoặc test và implementation đã diverged theo cách mà requires rethinking the approach.

Overall pipeline success rate: 88%. Sau ba attempts, khoảng 88 of every 100 runs produce code pass all tests. 12% remaining escalate tới TL agent.

Most common failure categories:

Failure Type	Frequency	Typical Fix
Import path mismatch	28%	File path không match import statement
Missing edge case handling	22%	Happy path works, empty-input case missing
Return type mismatch	18%	Function returns dict, test expects Pydantic model
Async/sync confusion	12%	Test calls `await func()` nhưng func is sync
Database state assumptions	10%	Test assumes clean DB, impl assumes seeded data
Other	10%	Typos, wrong status codes, missing error messages

7. Chi phí của Retries

Mỗi retry đắt hơn vì prompt grows — previous code (~3,500 tokens) và error output (~800-1,200 tokens) được thêm vào context. At Claude Sonnet pricing ($3/M input, $15/M output):

Attempt 1: ~4,300 input tokens, ~3,500 output tokens = ~$0.07
Attempt 2: ~8,600 input tokens, ~3,500 output tokens = ~$0.08
Attempt 3: ~9,000 input tokens, ~3,500 output tokens = ~$0.08

Một three-attempt run costs khoảng $0.23. Retry overhead là under $0.20 for một second chance ở getting the code right. Compared tới human developer’s time, economics không gần.

Deeper cost concern là latency, không phải dollars. Mỗi LLM call takes 15-30 seconds. Mỗi test run takes 5-15 seconds. Một three-attempt run adds 60-135 seconds of wall-clock time. Đây là một reason tôi invested trong parallel fan-out ở Phần 6 — time saved by running QC và TA concurrently partially offsets the time SSE spends ở retry loops.

8. Real Example: Attempt 1 Fails, Attempt 2 Passes

Hãy walk through một actual run. Spec là task manager từ Phần 6. SSE receives 8 user stories, 24 test cases, và technical spec calling for FastAPI + PostgreSQL.

Attempt 1

SSE tạo ra 7 files:

src/models/task.py          — Task, User, Comment Pydantic models
src/models/enums.py         — TaskStatus, TaskPriority enums
src/api/tasks.py            — FastAPI router with CRUD endpoints
src/api/dependencies.py     — Auth dependency, DB session
src/db/repository.py        — TaskRepository with in-memory store
tests/test_tasks.py         — 24 pytest tests
requirements.txt            — fastapi, pytest, pydantic, uvicorn

SandboxedTestRunner ghi những files này vào tempdir, installs dependencies, và chạy pytest:

tests/test_tasks.py::test_create_task_happy_path PASSED
tests/test_tasks.py::test_create_task_max_title PASSED
tests/test_tasks.py::test_create_task_empty_title FAILED
tests/test_tasks.py::test_create_task_no_auth PASSED
tests/test_tasks.py::test_assign_task_happy_path PASSED
...
tests/test_tasks.py::test_concurrent_completion FAILED
tests/test_tasks.py::test_archive_completed_task PASSED
tests/test_tasks.py::test_archive_pending_task FAILED

========================= FAILURES =========================
___ test_create_task_empty_title ___
    response = client.post("/api/v1/tasks", json={"title": ""})
>   assert response.status_code == 422
E   AssertionError: assert 201 == 422

___ test_concurrent_completion ___
    # Simulate concurrent requests
>   results = await asyncio.gather(
        complete_task(task_id, user_a_token),
        complete_task(task_id, user_b_token),
    )
E   TypeError: object function can't be used in 'await' expression

___ test_archive_pending_task ___
>   assert response.status_code == 400
E   AssertionError: assert 200 == 400

============= 3 failed, 21 passed in 2.34s =============

Ba failures. 21 of 24 tests pass. SSE logs đây là attempt 1 với passed_count=21.

Analysis

Ba failures có ba root causes khác nhau:

test_create_task_empty_title: Implementation không validate rằng title là non-empty. Endpoint accepts {"title": ""} và creates a task với empty title. Fix: add a Pydantic validator hoặc manual check ở endpoint.
test_concurrent_completion: Test uses asyncio.gather với await, nhưng complete_task helper function là synchronous. Fix: either make the helper async hoặc use threading instead of asyncio cho concurrency test.
test_archive_pending_task: Implementation allows archiving a task regardless of status. Test expects chỉ completed tasks có thể được archived. Fix: add a status check ở archive endpoint.

Attempt 2

Retry prompt bao gồm 3 failure outputs. SSE tạo ra một updated CodeArtifact với fixes:

Added @field_validator("title") tới TaskCreate model mà rejects empty strings
Changed concurrent test để use threading.Thread instead of asyncio.gather
Added if task.status != TaskStatus.COMPLETE: raise HTTPException(400) tới archive endpoint

Test runner executes lại:

tests/test_tasks.py::test_create_task_empty_title PASSED
tests/test_tasks.py::test_concurrent_completion PASSED
tests/test_tasks.py::test_archive_pending_task PASSED
...
============= 24 passed in 2.51s =============

Tất cả 24 tests pass. SSE returns với sse_status="passed" và sse_attempts=2.

Đây là ideal retry scenario: specific, diagnosable failures với clear fixes. LLM không cần rethink approach của nó — nó cần handle ba cases nó miss khi first pass.

9. Escalation tới Tech Lead

Khi cả ba attempts fail, SSE sets sse_status = "needs_review" và returns its best artifact. Tech Lead agent (Phần 8) receives state này và knows nó cần làm more than một standard code review — nó needs diagnose tại sao SSE không thể fix the failures và either provide specific guidance hoặc flag the story để human intervention.

Escalation carries forward:

# What the TL sees in state after SSE escalation
state["sse_status"]       # "needs_review"
state["sse_attempts"]     # 3
state["code_artifact"]    # Best attempt (most tests passing)

# The TL can inspect the test result
artifact = state["code_artifact"]
result = artifact.test_result
print(f"Best result: {result.passed_count}/{result.total_tests} passing")
print(f"Failures:\n{result.error_output}")

TL agent’s response tới escalation qualitatively differs từ response tới clean submission. Trên clean submission, TL làm code quality review: naming, structure, patterns, documentation. Trên escalation, TL làm failure analysis: tại sao SSE bị stuck, test có sai hoặc implementation có sai, điều này có require một different algorithmic approach không? Tôi sẽ xây dựng branching logic này ở Phần 8.

10. Wiring SSE vào LangGraph Workflow

SSE node plugs vào existing workflow sau QC+TA merge:

from langgraph.graph import StateGraph
from state import TeamState
from agents.sse import SSEAgent
from runners.sandbox import SandboxedTestRunner


async def sse_node(state: TeamState) -> TeamState:
    runner = SandboxedTestRunner()
    agent = SSEAgent(llm=get_llm(), runner=runner)
    return await agent.run(state)


def build_workflow() -> StateGraph:
    workflow = StateGraph(TeamState)

    # Previous nodes from Parts 5-6
    workflow.add_node("ba_agent", ba_node)
    workflow.add_node("qc_agent", qc_node)
    workflow.add_node("ta_agent", ta_node)
    workflow.add_node("parallel_merge", merge_qc_ta)

    # SSE node (this article)
    workflow.add_node("sse_agent", sse_node)

    # Edges
    workflow.set_entry_point("ba_agent")
    workflow.add_edge("ba_agent", "qc_agent")
    workflow.add_edge("ba_agent", "ta_agent")
    workflow.add_edge("qc_agent", "parallel_merge")
    workflow.add_edge("ta_agent", "parallel_merge")
    workflow.add_edge("parallel_merge", "sse_agent")

    # After SSE, route based on status
    workflow.add_conditional_edges(
        "sse_agent",
        route_after_sse,
        {
            "tl_review": "tl_agent",
            "tl_escalation": "tl_agent",
        },
    )

    return workflow.compile()


def route_after_sse(state: TeamState) -> str:
    """Route to TL agent — review mode differs based on SSE status."""
    if state.get("sse_status") == "passed":
        return "tl_review"
    return "tl_escalation"

Conditional edge sau SSE là routing decision đầu tiên ở pipeline. Tất cả previous edges là unconditional — BA luôn goes tới QC và TA, QC và TA luôn merge, merge luôn goes tới SSE. Sau SSE, pipeline branches: một successful SSE goes tới normal TL review, một failed SSE goes tới TL escalation. Cùng destination node, different routing label mà TL agent có thể read từ state để adjust behavior của nó.

Bringing It Together: The State Sau Phase Này

Sau SSE completes (whether by success hoặc escalation), TeamState chứa:

state = TeamState(
    # From Parts 4-6
    user_brief=UserBrief(...),
    clarified_requirements="...",
    user_stories=[UserStory(...) * 8],
    test_suite=TestSuite(test_cases=[TestCase(...) * 24], ...),
    technical_spec=TechnicalSpec(...),

    # From SSE (this article)
    code_artifact=CodeArtifact(
        artifact_id="CODE-001",
        files=[CodeFile(...) * 7],
        test_result=TestResult(
            passed=True,
            total_tests=24,
            passed_count=24,
            failed_count=0,
        ),
        attempt_number=2,
    ),
    sse_complete=True,
    sse_status="passed",  # or "needs_review"
    sse_attempts=2,
)

Pipeline bây giờ có working, tested code. Abstract đã become concrete. User stories đã được translated thành endpoints. Test case definitions đã được translated thành executable pytest tests. Architecture decisions đã được translated thành code structure.

Nhưng working code không giống good code. SSE optimizes cho passing tests, không phải cho maintainability, readability, hoặc adherence tới team conventions. Đó là job của Tech Lead.

Điều gì mà Tôi Built ở Phần 7

Một agent, một runner, một loop:

SSEAgent (Jordan) — generates code từ full TeamState (stories + test cases + tech spec), chạy tests ở sandbox, retries tới 3 lần với failure context, escalates tới TL on exhaustion
SandboxedTestRunner — ghi files vào tempdir, installs deps, chạy pytest như một subprocess với 60-second timeout, parses pass/fail counts
CodeArtifact schema — structured container cho generated files với test results, attempt tracking, và story coverage
Generate-test-retry loop — core pattern mà turns một 35% first-attempt success rate thành 88% pipeline success rate
Conditional routing — first branching edge ở LangGraph workflow, routing tới TL review hoặc TL escalation dựa trên SSE status

Key insight của article này: mechanical verification changes everything. Một agent mà có thể run its own tests và feed failures trở lại vào prompt qualitatively differs từ một agent mà generates output và hopes for the best. Retry loop không phải fallback — nó là primary mechanism mà makes code generation reliable enough cho một production pipeline.

What’s Next

Ở Phần 8, tôi xây dựng Tech Lead agent. TL reviews SSE’s code cho quality, consistency, và adherence tới technical spec. Trên clean submission, TL checks naming conventions, code structure, và documentation. Trên escalation, TL diagnoses tại sao SSE bị stuck và decides whether provide fix guidance hoặc flag the story để human intervention. TL là gate giữa “code that works” và “code that ships.”

Hẹn gặp tôi ở Phần 8.

Phần 1: Why Build an AI Software Team?
Phần 2: Mapping the Roles — 8 Agents, One Pipeline
Phần 3: Designing Your AI Team — Architecture and DDD
Phần 4: TeamState — The Shared Brain
Phần 5: PO and BA Agents — From Brief to User Stories
Phần 6: QC and TA Agents — Quality and Technical Design
Phần 7: The SSE Agent — Code Generation, Self-Testing, and Iteration (this article)
Phần 8: The TL Agent — Code Review and Quality Gate (coming soon)
Phần 9-12: Coming soon

Xuất nội dung

SSE Agent: Code Generation, Self-Testing và Iteration (Phần 7/12)

1. Tại sao SSE khác biệt

2. Schema CodeArtifact

3. SandboxedTestRunner

4. First-Attempt vs Retry Prompts

5. SSEAgent Full Implementation

Key Design Decisions

6. Failure Analysis: Điều gì sai và tại sao

7. Chi phí của Retries

8. Real Example: Attempt 1 Fails, Attempt 2 Passes

Attempt 1

Analysis

Attempt 2

9. Escalation tới Tech Lead

10. Wiring SSE vào LangGraph Workflow

Bringing It Together: The State Sau Phase Này

Điều gì mà Tôi Built ở Phần 7

What’s Next

Series Navigation

Bình luận

Nội dung chính

SSE Agent: Code Generation, Self-Testing và Iteration (Phần 7/12)