There is a mistake almost every engineering team makes at least once. You write the code first, ship it to QA, and then find out the tester has a completely different mental model of what “done” means. The developer thought “user can log in” meant a working form. The tester thought it meant SSO, two-factor authentication, session expiry, and graceful degradation when the identity provider goes down. Two weeks of rework follow, usually during the worst possible time.

The fix is not a better QA process. The fix is a better sequence. Quality must be defined before code exists, not after.

And while QC is drawing those quality boundaries, your Technical Architect should be making the hard technology decisions — not as an afterthought, not inside the pull request, but upfront, with explicit reasoning, before a single line of implementation is written.

In Part 5, we built the Business Analyst agent. The BA produced user stories: structured, prioritized, and rich with acceptance criteria. Now we have two questions to answer simultaneously. The QC agent asks: “What does success look like, and what does failure look like?” The TA agent asks: “What technology choices and design constraints govern the implementation?” Both questions are independent of each other. Both need to be answered before the SSE agent touches a keyboard.

That independence is the key insight of this article. QC and TA run in parallel.


1. Test-First, Design-First Thinking

Let me say something that sounds obvious but is widely ignored in practice: the best time to write a test case is when you have zero lines of code to look at.

When code exists, test authors unconsciously test what the code does, not what the requirements say it should do. You find the happy path and write a test for it. You see the error handling the developer chose and write a test for that. You miss the error handling the developer did not choose — the case they forgot to imagine. This is called confirmation bias in test writing, and it is endemic.

The QC agent avoids this entirely because it has no code to look at. It reads user stories and asks: “What would a user experience if this goes perfectly? What would they experience if it goes badly? What are the boundary conditions? What happens under load?” The answers become test cases — formal, numbered, structured — before implementation has started.

This is the spirit of Behavior-Driven Development and test-driven development, but applied at the team architecture level, not just the developer level.

Why Architecture Decisions Made Upfront Save 10x Refactor Time

The TA agent’s job is equally front-loaded for the same reason: decisions made late are expensive.

The canonical example is database choice. If your Technical Architect decides at week three that PostgreSQL is the right call — after the SSE spent two weeks designing around MongoDB’s document model — you have a full data layer rewrite ahead of you. More commonly, decisions that arrive late create a hidden tax: the developer makes a local decision (I’ll use this library, I’ll structure this API this way) that is inconsistent with what the architect would have chosen. These inconsistencies accumulate. They become the “technical debt” that makes every subsequent feature twice as hard to add.

Architecture Decision Records (ADRs) are the artifact the TA agent produces. An ADR is a short document that records a decision, the context that motivated it, the alternatives considered, and the consequences of the choice. It is not a design document that explains everything. It is a decision record — a snapshot of reasoning at a moment in time. ADRs are valuable because they answer the question every new team member eventually asks: “Why did we build it this way?”

The Parallel Fan-Out Pattern

Here is the key structural insight. Look at the two questions:

  • QC: “What does success look like for each user story?”
  • TA: “What is the technology stack, data model, and architecture?”

These questions do not depend on each other. The QC agent does not need to know the technology stack to write test cases. The TA agent does not need to know what edge cases QC will test to make architecture decisions. They both need the same input — the user stories from the BA — and they produce different outputs that will both be needed by the agents that come later.

This is a parallel fan-out pattern. A single upstream output fans out to multiple downstream processes that run concurrently, then their results merge before the next stage begins.

In LangGraph, this is a first-class concept. The graph explicitly models which nodes can run in parallel and which must wait for specific predecessors. The sequential alternative — run QC, wait, then run TA — would take twice as long for no benefit. This is the first significant time optimization in our pipeline, and it will not be the last.

Parallel fan-out pattern: BA output feeds QC and TA simultaneously, both merge into combined state

2. QC Agent Design

Sam is our QC Engineer. Eight years of experience means Sam has seen every failure mode that ships-to-production introduces. Sam knows that “it works on my machine” is not a test result. Sam knows that “we’ll add error handling later” is a lie.

Sam’s job in our pipeline is specific and bounded: transform user stories into test cases and quality gates. Sam does not run tests. Sam does not write test code. Sam does not configure CI pipelines. Sam defines what must be tested and what the pass/fail criteria are. The SSE agent will write the actual test code later. DevOps will configure the pipeline. Sam defines the contract.

Role Definition

The QC agent’s formal role:

  • Input: UserStory objects from TeamState
  • Output: TestSuite containing TestCase objects
  • Tools: read_document, create_test_file
  • Does NOT: run tests, write implementation code, configure CI

The distinction between “defining tests” and “running tests” is critical for agent design. If Sam tried to run tests, Sam would need access to a running environment, a build system, and live code — none of which exist at this stage of the pipeline. By restricting Sam to test definition only, we keep the agent’s scope clean and its output purely logical rather than environmental.

What Sam Produces for Each Story

For every user story, Sam generates:

  1. Happy path test cases — the expected successful flow, end to end
  2. Edge cases — boundary conditions: empty input, maximum length, zero values, single-item lists, concurrent users
  3. Error scenarios — what happens when external dependencies fail: network timeout, database unavailable, invalid authentication token, malformed request body
  4. Performance acceptance criteria — concrete, measurable: response time under load, throughput requirements, memory limits

The output is not prose. It is structured data — TestCase objects with formal Given/When/Then steps, expected results, and metadata linking each test case back to the story it validates.


3. QCAgent Full Implementation

from typing import Optional
from agents.base import BaseAgent
from state import TeamState
from schemas.qc import TestSuite, TestCase, TestType


class QCAgent(BaseAgent):
    QC_SYSTEM_PROMPT = """You are Sam, a QC Engineer with 8 years of experience.
    Your job is to think about what can go wrong BEFORE the code is written.

    For every user story, you create:
    1. Happy path test cases — the successful flow a real user would follow
    2. Edge cases — empty input, max length strings, zero values,
       concurrent users, single-item collections, unicode characters
    3. Error scenarios — network failure, invalid data, auth failures,
       timeout conditions, third-party service outages
    4. Performance acceptance criteria — response time < 200ms at p95,
       throughput targets, memory ceiling

    You are NOT writing test code. You are defining test contracts.
    Every test case must be actionable by a developer who has not read
    the requirements — it must stand alone as a specification.

    Your test IDs follow the format TC-XXX where XXX is a zero-padded
    three-digit integer. Start from TC-001 for each new suite.

    Always output valid JSON matching the TestSuite schema.
    """

    def _prepare_prompt(self, state: TeamState) -> str:
        stories = state["user_stories"]
        formatted = self._format_stories(stories)

        return f"""Create comprehensive test cases for these user stories:

{formatted}

For each story produce test cases covering:

HAPPY PATH:
- Normal user flow from start to finish
- All required fields present and valid
- Expected system response and state change

EDGE CASES (at minimum):
- Empty / null / whitespace-only inputs
- Maximum length inputs (assume 255 chars for strings)
- Numeric boundary values (0, 1, max int, negative)
- Concurrent requests (at least 2 simultaneous users)
- Unicode and special character inputs

ERROR SCENARIOS:
- Missing required fields
- Invalid data types
- Authentication / authorization failure
- Downstream service unavailable (500 / timeout)
- Rate limit exceeded

PERFORMANCE:
- Define p95 response time ceiling
- Define acceptable throughput (requests/second)
- Define memory growth ceiling if applicable

For each test case provide:
- test_id: TC-001, TC-002, ...
- story_id: matches the user story ID
- title: short descriptive name
- test_type: unit | integration | e2e | performance
- given: list of preconditions
- when: list of actions / inputs
- then: list of expected outcomes
- expected_result: single-sentence summary
- priority: critical | high | medium | low

Return a JSON object matching this structure:
{{
  "suite_id": "TS-001",
  "story_coverage": ["US-001", "US-002", ...],
  "test_cases": [ ... ],
  "quality_gates": {{
    "min_unit_coverage": 80,
    "max_p95_response_ms": 200,
    "max_error_rate_percent": 0.1
  }}
}}
"""

    def _format_stories(self, stories: list) -> str:
        lines = []
        for story in stories:
            lines.append(f"Story {story.story_id}: {story.title}")
            lines.append(f"  As a {story.as_a}")
            lines.append(f"  I want to {story.i_want}")
            lines.append(f"  So that {story.so_that}")
            lines.append(f"  Acceptance criteria:")
            for criterion in story.acceptance_criteria:
                lines.append(f"    - {criterion}")
            lines.append(f"  Priority: {story.priority}")
            lines.append("")
        return "\n".join(lines)

    async def run(self, state: TeamState) -> TeamState:
        prompt = self._prepare_prompt(state)
        response = await self.llm.ainvoke(
            [
                {"role": "system", "content": self.QC_SYSTEM_PROMPT},
                {"role": "user", "content": prompt},
            ],
            response_format={"type": "json_object"},
        )
        raw = response.content
        suite = TestSuite.model_validate_json(raw)

        return {
            **state,
            "test_suite": suite,
            "qc_complete": True,
        }

The _format_stories helper is the same pattern used in the BA agent from Part 5: convert structured Pydantic objects into readable prose that gives the LLM sufficient context without overwhelming it with JSON syntax. The LLM reads English; give it English.

The response_format={"type": "json_object"} parameter is OpenAI’s structured output mode. It instructs the model to return only valid JSON. Combined with TestSuite.model_validate_json(raw), this gives us a hard guarantee: if the model output cannot be parsed into a TestSuite, the agent raises a validation error that the orchestrator can retry or escalate.


4. TestCase and TestSuite Schemas

from enum import Enum
from typing import Optional
from pydantic import BaseModel, Field


class TestType(str, Enum):
    UNIT = "unit"
    INTEGRATION = "integration"
    E2E = "e2e"
    PERFORMANCE = "performance"


class TestPriority(str, Enum):
    CRITICAL = "critical"
    HIGH = "high"
    MEDIUM = "medium"
    LOW = "low"


class TestCase(BaseModel):
    test_id: str = Field(
        description="Unique identifier, format TC-001",
        pattern=r"^TC-\d{3}$",
    )
    story_id: str = Field(
        description="References the user story this test validates",
    )
    title: str = Field(
        description="Short descriptive name for the test case",
        max_length=120,
    )
    test_type: TestType
    given: list[str] = Field(
        description="Preconditions that must be true before the test runs",
        min_length=1,
    )
    when: list[str] = Field(
        description="Actions or inputs that trigger the behavior under test",
        min_length=1,
    )
    then: list[str] = Field(
        description="Observable outcomes that must be true after the action",
        min_length=1,
    )
    expected_result: str = Field(
        description="Single-sentence summary of the expected outcome",
        max_length=200,
    )
    priority: TestPriority
    tags: list[str] = Field(
        default_factory=list,
        description="Optional labels: happy-path, edge-case, error-scenario, performance",
    )


class QualityGates(BaseModel):
    min_unit_coverage: int = Field(
        default=80,
        ge=0,
        le=100,
        description="Minimum percentage of lines/branches covered by unit tests",
    )
    max_p95_response_ms: int = Field(
        default=200,
        gt=0,
        description="Maximum acceptable p95 response time in milliseconds",
    )
    max_error_rate_percent: float = Field(
        default=0.1,
        ge=0.0,
        le=100.0,
        description="Maximum acceptable error rate percentage under normal load",
    )
    min_test_pass_rate: float = Field(
        default=100.0,
        ge=0.0,
        le=100.0,
        description="Minimum percentage of tests that must pass to gate deployment",
    )


class TestSuite(BaseModel):
    suite_id: str = Field(
        description="Unique identifier for this test suite, format TS-001",
    )
    story_coverage: list[str] = Field(
        description="List of story IDs covered by this test suite",
    )
    test_cases: list[TestCase] = Field(
        description="All test cases in this suite",
        min_length=1,
    )
    quality_gates: QualityGates = Field(
        default_factory=QualityGates,
        description="Measurable thresholds that gate deployment",
    )
    total_cases: int = Field(default=0)
    critical_cases: int = Field(default=0)

    def model_post_init(self, __context) -> None:
        self.total_cases = len(self.test_cases)
        self.critical_cases = sum(
            1 for tc in self.test_cases if tc.priority == TestPriority.CRITICAL
        )

A few design decisions worth noting:

The pattern=r"^TC-\d{3}$" validator on test_id enforces the naming convention at the data layer. If the LLM hallucinates an ID like “TEST_1” or “tc001”, Pydantic rejects it immediately. The agent framework catches the validation error and can retry with a more specific error message in the prompt.

QualityGates is a first-class model, not a loose dictionary. These numbers will be referenced by the DevOps agent when configuring CI/CD pipeline gates in Part 10. Having them as a typed model means the DevOps agent can read suite.quality_gates.min_unit_coverage with full type safety rather than parsing strings.

model_post_init computes derived fields after validation. This keeps the LLM from having to count its own output (which it does unreliably) while ensuring the TeamState always has accurate counts.


5. TA Agent Design

Morgan is our Technical Architect. Morgan’s defining characteristic — and the defining characteristic we encode in the system prompt — is a preference for simplicity. “Boring technology over clever technology” is not a joke or a dismissal of innovation. It is a hard-won engineering philosophy: the system you ship is the system you maintain. Systems built on exotic choices require experts in exotic choices. Systems built on PostgreSQL, HTTP, and boring file I/O require senior developers — a much larger talent pool.

Morgan’s formal role:

  • Input: clarified_requirements and user_stories from TeamState
  • Output: TechnicalSpec containing ArchitectureDecision objects, data models, API contracts, and implementation phases
  • Does NOT: write implementation code, define test cases, configure CI/CD

The TA agent’s output is the design envelope within which all subsequent agents operate. The SSE will implement against the tech stack Morgan chose. The DevOps agent will configure infrastructure for the architecture Morgan specified. The TL agent will enforce the coding standards Morgan defined in the ADRs. Getting this right upfront is not optional.

What Morgan Produces

  1. Technology stack with justification — not just “we’ll use FastAPI” but “we use FastAPI because the team has Python expertise, the async model fits our I/O pattern, and the automatic OpenAPI generation reduces documentation overhead compared to Flask”
  2. Data models — key entities with attributes and relationships, documented as entity definitions rather than SQL DDL (the SSE will write the actual DDL)
  3. API contracts — key endpoints with HTTP methods, request/response shapes, and status codes
  4. Architecture Decision Records — one ADR per significant choice, with context, decision, alternatives considered, and consequences
  5. Implementation phases — what gets built in what order, with dependency reasoning
  6. Technical risks and mitigations — what could go wrong architecturally, and what the plan is if it does

6. TAAgent Full Implementation

from agents.base import BaseAgent
from state import TeamState
from schemas.ta import TechnicalSpec, ArchitectureDecision, DataModel, ApiContract


class TAAgent(BaseAgent):
    TA_SYSTEM_PROMPT = """You are Morgan, a Technical Architect with 12 years of
    experience building production systems at scale.

    Your core philosophy:
    - Prefer boring technology over clever technology
    - Simple systems that work beat complex systems that might work
    - Document decisions with explicit reasoning — future Morgan will thank you
    - "It depends" is a valid answer, but you must always say what it depends ON

    You design systems that are:
    - Simple enough for a mid-level developer to understand
    - Maintainable by a team that does not include you
    - Appropriately scalable — not over-engineered for traffic that doesn't exist
    - Explicit about trade-offs, never pretending trade-offs don't exist

    You document every significant decision as an ADR (Architecture Decision Record).
    An ADR is short: title, status, context (why this decision was needed),
    decision (what you chose), alternatives considered, consequences (good and bad).

    Always output valid JSON matching the TechnicalSpec schema.
    """

    def _prepare_prompt(self, state: TeamState) -> str:
        requirements = state.get("clarified_requirements", "")
        stories = state.get("user_stories", [])
        formatted_stories = self._format_stories(stories)

        return f"""Based on these requirements, produce a complete technical specification.

CLARIFIED REQUIREMENTS:
{requirements}

USER STORIES:
{formatted_stories}

Produce the following sections:

1. TECHNOLOGY STACK
   For each component (backend framework, database, cache, message queue if needed,
   frontend if applicable), specify:
   - The chosen technology
   - The version or version range
   - The justification (2-3 sentences minimum)
   - Alternatives you considered and why you rejected them

2. DATA MODELS
   For each key entity:
   - Entity name and description
   - Fields with types and constraints
   - Relationships to other entities
   - Any important indexes or unique constraints

3. API CONTRACTS
   For each significant endpoint:
   - HTTP method and path
   - Request body schema (if applicable)
   - Response schema
   - Key HTTP status codes and their meaning
   - Authentication requirement

4. ARCHITECTURE DECISION RECORDS
   Create one ADR for each significant technical decision. At minimum, address:
   - Primary database choice
   - Caching strategy (if applicable)
   - Authentication approach
   - API style (REST vs GraphQL vs other)
   Each ADR must include: title, status (Proposed/Accepted/Deprecated),
   context, decision, alternatives_considered (list), consequences (list)

5. IMPLEMENTATION PHASES
   Define 3-4 phases of implementation in dependency order:
   - Phase name
   - What gets built
   - What it depends on from previous phases
   - Estimated relative complexity (low/medium/high)

6. TECHNICAL RISKS
   For each identified risk:
   - Risk description
   - Likelihood (low/medium/high)
   - Impact (low/medium/high)
   - Mitigation strategy

Return a JSON object matching the TechnicalSpec schema.
"""

    def _format_stories(self, stories: list) -> str:
        if not stories:
            return "No user stories provided."
        lines = []
        for story in stories:
            lines.append(f"[{story.story_id}] {story.title}")
            lines.append(f"  As a {story.as_a}, I want {story.i_want}")
        return "\n".join(lines)

    async def run(self, state: TeamState) -> TeamState:
        prompt = self._prepare_prompt(state)
        response = await self.llm.ainvoke(
            [
                {"role": "system", "content": self.TA_SYSTEM_PROMPT},
                {"role": "user", "content": prompt},
            ],
            response_format={"type": "json_object"},
        )
        raw = response.content
        spec = TechnicalSpec.model_validate_json(raw)

        return {
            **state,
            "technical_spec": spec,
            "ta_complete": True,
        }

The prompt structure is explicit about the minimum requirements for each section. “2-3 sentences minimum” for justification is not politeness — it prevents the LLM from producing one-word answers like “FastAPI: fast.” “Alternatives you considered” forces the ADR to be a genuine decision record, not a post-hoc rationalization of the first option that came to mind.


7. TechnicalSpec and ArchitectureDecision Schemas

from enum import Enum
from typing import Optional
from pydantic import BaseModel, Field


class AdrStatus(str, Enum):
    PROPOSED = "Proposed"
    ACCEPTED = "Accepted"
    DEPRECATED = "Deprecated"
    SUPERSEDED = "Superseded"


class Complexity(str, Enum):
    LOW = "low"
    MEDIUM = "medium"
    HIGH = "high"


class RiskLevel(str, Enum):
    LOW = "low"
    MEDIUM = "medium"
    HIGH = "high"


class ArchitectureDecision(BaseModel):
    adr_id: str = Field(
        description="Unique identifier, format ADR-001",
        pattern=r"^ADR-\d{3}$",
    )
    title: str = Field(
        description="Short noun phrase describing what was decided",
        max_length=100,
    )
    status: AdrStatus
    context: str = Field(
        description="Why this decision was needed — what problem it addresses",
        min_length=50,
    )
    decision: str = Field(
        description="What was chosen and the primary reasoning",
        min_length=50,
    )
    alternatives_considered: list[str] = Field(
        description="Technologies or approaches that were evaluated and not chosen",
        min_length=1,
    )
    consequences: list[str] = Field(
        description="Trade-offs, both positive and negative, of this decision",
        min_length=1,
    )


class TechComponent(BaseModel):
    name: str
    technology: str
    version: Optional[str] = None
    justification: str = Field(min_length=30)
    alternatives_rejected: list[str] = Field(default_factory=list)


class EntityField(BaseModel):
    field_name: str
    field_type: str
    nullable: bool = False
    description: Optional[str] = None
    constraints: list[str] = Field(default_factory=list)


class DataModel(BaseModel):
    entity_name: str
    description: str
    fields: list[EntityField]
    relationships: list[str] = Field(
        default_factory=list,
        description="Plaintext descriptions: 'User has many Tasks', etc.",
    )
    indexes: list[str] = Field(
        default_factory=list,
        description="Key indexes and unique constraints",
    )


class ApiContract(BaseModel):
    method: str = Field(pattern=r"^(GET|POST|PUT|PATCH|DELETE|OPTIONS|HEAD)$")
    path: str = Field(description="URL path, e.g. /api/v1/tasks/{task_id}")
    summary: str
    request_body: Optional[dict] = None
    response_schema: dict = Field(default_factory=dict)
    status_codes: dict[str, str] = Field(
        description="Map of status code string to meaning, e.g. {'200': 'Success', '404': 'Task not found'}",
    )
    requires_auth: bool = True


class ImplementationPhase(BaseModel):
    phase_number: int
    name: str
    deliverables: list[str]
    depends_on: list[str] = Field(
        default_factory=list,
        description="Phase names this phase depends on",
    )
    complexity: Complexity


class TechnicalRisk(BaseModel):
    description: str
    likelihood: RiskLevel
    impact: RiskLevel
    mitigation: str


class TechnicalSpec(BaseModel):
    spec_id: str = Field(description="Format SPEC-001")
    project_name: str
    tech_stack: list[TechComponent]
    data_models: list[DataModel]
    api_contracts: list[ApiContract]
    adrs: list[ArchitectureDecision]
    implementation_phases: list[ImplementationPhase]
    technical_risks: list[TechnicalRisk] = Field(default_factory=list)
    total_adrs: int = Field(default=0)

    def model_post_init(self, __context) -> None:
        self.total_adrs = len(self.adrs)

The schema enforces the minimum quality bar. min_length=50 on context and decision fields in ArchitectureDecision prevents the LLM from producing ADRs that say “We chose Postgres. It is good.” The regex on method ensures no typos in API contracts. The model_post_init count again removes a task the LLM handles unreliably.


8. The Parallel Fan-Out Pattern in LangGraph

Here is how we wire QC and TA to run simultaneously in the LangGraph workflow:

from langgraph.graph import StateGraph
from state import TeamState
from agents.ba import BAAgent
from agents.qc import QCAgent
from agents.ta import TAAgent
from nodes.merge import merge_qc_ta


def build_workflow() -> StateGraph:
    workflow = StateGraph(TeamState)

    # --- Node registration ---
    workflow.add_node("ba_agent", ba_node)
    workflow.add_node("qc_agent", qc_node)
    workflow.add_node("ta_agent", ta_node)
    workflow.add_node("parallel_merge", merge_qc_ta)

    # --- Entry point ---
    workflow.set_entry_point("ba_agent")

    # --- BA completes, fans out to both QC and TA ---
    workflow.add_edge("ba_agent", "qc_agent")
    workflow.add_edge("ba_agent", "ta_agent")

    # --- Both feed into the merge node ---
    workflow.add_edge("qc_agent", "parallel_merge")
    workflow.add_edge("ta_agent", "parallel_merge")

    # --- After merge, the SSE agent has everything it needs ---
    workflow.add_edge("parallel_merge", "sse_agent")

    return workflow.compile()

LangGraph recognizes that when a node has multiple outgoing edges, those target nodes can be dispatched concurrently. The runtime will not start parallel_merge until both qc_agent and ta_agent have completed and written their outputs to TeamState.

The node wrappers are thin:

async def ba_node(state: TeamState) -> TeamState:
    agent = BAAgent(llm=get_llm())
    return await agent.run(state)


async def qc_node(state: TeamState) -> TeamState:
    agent = QCAgent(llm=get_llm())
    return await agent.run(state)


async def ta_node(state: TeamState) -> TeamState:
    agent = TAAgent(llm=get_llm())
    return await agent.run(state)

The get_llm() factory creates a new LLM client per invocation. This is intentional: in a concurrent execution, sharing a single LLM client across agents can cause state leakage (some client libraries maintain internal conversation history). Isolation is worth the small overhead.


9. The Merge Node

The merge node has one job: combine the outputs of QC and TA into a single coherent TeamState without conflicts.

The potential conflict is simple. Both QC and TA start from the same TeamState snapshot. Both return an updated state. If we naively spread both updates ({**state, **qc_update, **ta_update}), the second spread overwrites fields from the first. In practice, QC and TA write to different keys (test_suite vs technical_spec), so this is not dangerous — but it is worth making explicit.

from state import TeamState


def merge_qc_ta(states: list[TeamState]) -> TeamState:
    """
    Merge the outputs of QC and TA agents into a single TeamState.

    LangGraph passes all completed branch states as a list when multiple
    branches converge at the same node. We extract the non-None values
    from each and combine them into the final state.

    Both agents write to disjoint keys:
    - QC writes: test_suite, qc_complete
    - TA writes: technical_spec, ta_complete

    We assert this invariant explicitly rather than hoping it stays true.
    """
    if len(states) != 2:
        raise ValueError(
            f"merge_qc_ta expected exactly 2 states, got {len(states)}"
        )

    # Identify which state came from which agent
    qc_state = next(s for s in states if s.get("qc_complete"))
    ta_state = next(s for s in states if s.get("ta_complete"))

    # Sanity check: neither agent should have written the other's key
    if "technical_spec" in qc_state and qc_state["technical_spec"] is not None:
        raise ValueError("QC agent wrote to technical_spec — unexpected state pollution")
    if "test_suite" in ta_state and ta_state["test_suite"] is not None:
        raise ValueError("TA agent wrote to test_suite — unexpected state pollution")

    # Merge: start from QC state (arbitrary), overlay TA additions
    merged: TeamState = {
        **qc_state,
        "technical_spec": ta_state["technical_spec"],
        "ta_complete": True,
        "qc_complete": True,
    }

    return merged

The explicit invariant checks are not paranoia. In a system where LLMs produce outputs and agents write to shared state, “this should not happen” scenarios happen regularly during development. Making the invariant explicit and loud means you find violations immediately rather than hours later when a downstream agent produces confusing output.


10. Real Example: Task Manager Application

Let us trace through a concrete run using the task manager stories introduced in Part 5. The BA produced 8 user stories for a task management application: create task, assign task, set due date, mark complete, filter by status, add comment, set priority, and archive task.

QC Output: 24 Test Cases

Sam reviewed all 8 stories and produced 24 test cases distributed as follows:

Create Task (US-001) — 4 test cases:

TC-001 | Happy Path | Create task with all required fields
  Given: authenticated user, valid session token
  When:  POST /api/v1/tasks with {"title": "Deploy to prod", "priority": "high"}
  Then:  Response 201 Created
         Response body contains task_id (UUID format)
         task.status equals "pending"
         task.created_at within last 5 seconds
  Priority: critical

TC-002 | Edge Case | Create task with maximum-length title (255 chars)
  Given: authenticated user
  When:  POST /api/v1/tasks with title of exactly 255 characters
  Then:  Response 201 Created
         task.title stored without truncation
  Priority: medium

TC-003 | Error Scenario | Create task with empty title
  Given: authenticated user
  When:  POST /api/v1/tasks with {"title": ""}
  Then:  Response 422 Unprocessable Entity
         Error body contains field "title" with message about minimum length
  Priority: high

TC-004 | Error Scenario | Create task without authentication
  Given: no session token in request headers
  When:  POST /api/v1/tasks with valid body
  Then:  Response 401 Unauthorized
         Response body does not leak system information
  Priority: critical

Mark Complete (US-005) — 4 test cases including a concurrency edge case:

TC-017 | Edge Case | Concurrent completion — two users mark same task simultaneously
  Given: task in "pending" status, two authenticated users (user_a, user_b)
  When:  user_a and user_b both send PATCH /api/v1/tasks/{id}
         with {"status": "complete"} within 50ms of each other
  Then:  Exactly one request receives 200 OK
         Exactly one request receives 409 Conflict or 200 OK (idempotent)
         Task appears in completed state exactly once
         No duplicate completion events emitted
  Priority: high
  Tags: concurrency, edge-case

The quality_gates block Sam produced:

{
  "suite_id": "TS-001",
  "quality_gates": {
    "min_unit_coverage": 85,
    "max_p95_response_ms": 150,
    "max_error_rate_percent": 0.05,
    "min_test_pass_rate": 100.0
  }
}

Note that Sam set the error rate ceiling at 0.05% — tighter than the schema default of 0.1%. This reflects the requirement Sam read in the user stories: the PM explicitly asked for “reliable” task tracking.

TA Output: FastAPI + PostgreSQL + Redis, 4 ADRs, 3 Data Models

Morgan reviewed the same 8 stories and produced a TechnicalSpec with the following key decisions:

Technology Stack:

ComponentChoiceKey Justification
Backend FrameworkFastAPI 0.111Async-native, auto-generated OpenAPI docs, Python type hints throughout
Primary DatabasePostgreSQL 16ACID transactions for task state machines, mature ecosystem, excellent JSON support for metadata
Cache LayerRedis 7Session storage, rate limiting, real-time task status broadcasting via pub/sub
Task QueueNone (Phase 1)Current scale does not justify async worker complexity — revisit at 10k users

The “Task Queue: None” decision is an example of Morgan’s philosophy in practice. The requirement does not call for background processing. Adding Celery or RQ at this stage would be premature complexity. The ADR for this decision reads:

ADR-004: Defer Async Task Queue to Phase 2
Status: Accepted
Context: Several team members suggested adding an async task queue (Celery/RQ)
  upfront for operations like email notifications and status webhooks. The current
  requirements specify no background operations. Adding queue infrastructure adds
  a Redis dependency (partially mitigated by the cache decision), a worker process
  to manage, a dead-letter queue strategy, and monitoring for queue depth.
Decision: No task queue in Phase 1. Use synchronous processing. Add queue in Phase 2
  if response time SLAs cannot be met synchronously.
Alternatives Considered:
  - Celery with Redis broker — well-understood but heavy for current scale
  - RQ (Redis Queue) — lighter than Celery, still adds operational complexity
  - AWS SQS — vendor lock-in not justified for single-region deployment
Consequences:
  + Simpler deployment and monitoring in Phase 1
  + Fewer failure modes (no queue backpressure, no worker health checks)
  - Email notification latency will be synchronous (adds ~200ms to request path)
  - Refactor required if async processing becomes necessary

Three Data Models:

Task:
  - task_id (UUID, primary key)
  - title (varchar 255, not null)
  - description (text, nullable)
  - status (enum: pending | in_progress | complete | archived)
  - priority (enum: low | medium | high | critical)
  - created_by (UUID, foreign key → User.user_id)
  - assigned_to (UUID, nullable, foreign key → User.user_id)
  - due_date (timestamp with timezone, nullable)
  - created_at (timestamp with timezone, default now())
  - updated_at (timestamp with timezone, auto-updated)
  Indexes: (status, created_by), (assigned_to, status), (due_date) where not null

User:
  - user_id (UUID, primary key)
  - email (varchar 320, unique, not null)
  - display_name (varchar 100, not null)
  - created_at (timestamp with timezone)

Comment:
  - comment_id (UUID, primary key)
  - task_id (UUID, foreign key → Task.task_id, cascade delete)
  - author_id (UUID, foreign key → User.user_id)
  - body (text, not null)
  - created_at (timestamp with timezone)
  Indexes: (task_id, created_at)

The three data models and four ADRs give the SSE agent an unambiguous technical envelope. There is no need for the SSE to decide between MySQL and PostgreSQL, no need to choose between sync and async, no need to design the data schema from scratch. These decisions are made, documented, and frozen.

QC TestCase card and TA Architecture Decision Record side by side, showing how they connect

11. Quality Gates Concept

The QualityGates object that Sam produces is not just documentation. It is an operational contract that travels forward through the entire pipeline and eventually governs deployment.

Here is how quality gates work across the lifecycle:

At QC time (now): Sam defines the gates based on the requirements. These are the minimum bars that the shipping system must clear. They are set before any code exists, which means they reflect the requirements, not the implementation. An 85% unit coverage gate set before code is written is a genuine requirement. An 85% coverage gate set after code is written is a threshold that the existing code happens to pass.

At SSE time (Part 7): When the SSE agent writes code, it also writes the test implementations that correspond to Sam’s test case definitions. The SSE reads TestSuite and implements each TestCase as an actual pytest test. The QualityGates thresholds are embedded as comments in the test configuration.

At TL time (Part 8): When the Tech Lead agent reviews the SSE’s output, one of the review criteria is whether the test implementation actually covers the test cases Sam defined. A SSE that implemented 18 of 24 test cases fails TL review regardless of whether the code looks clean.

At CI/CD time (Part 10): The DevOps agent reads the QualityGates object and configures the pipeline accordingly:

# Generated by DevOps agent from QualityGates object
coverage:
  minimum: 85  # From QualityGates.min_unit_coverage
  fail_under: true

performance:
  p95_threshold_ms: 150  # From QualityGates.max_p95_response_ms
  test_environment: staging

reliability:
  max_error_rate: 0.05  # From QualityGates.max_error_rate_percent
  measurement_window: 5m

This chain from Sam’s initial definition to the live CI/CD configuration is what makes quality gates meaningful. They are not suggestions. They are gates — the deployment cannot proceed if they are not met. And critically, they were defined by the QC agent before any code existed, which means they were defined by the requirements, not by what the code happens to do.

What Quality Gates Guard Against

There are three failure modes quality gates specifically prevent:

Coverage Drift: Without a coverage gate, unit test coverage tends to decline over time as features are added faster than tests. The gate makes coverage a hard requirement for every merge, not a metric you look at quarterly.

Performance Regression: Without a performance gate, response times tend to creep upward. An endpoint that took 80ms on day one takes 300ms six months later because six features have each added 30-50ms. The p95 gate catches regressions immediately rather than letting them accumulate.

Silent Error Rate Increase: Error rates in production often rise gradually through small changes that each seem harmless. A 0.05% error rate ceiling means the team is alerted to increases that might not even be visible in a dashboard with default scale settings.

All three failure modes share a root cause: they are invisible unless you are explicitly measuring them against defined thresholds. Quality gates make them visible. Defining them before code exists ensures the thresholds are honest.


Bringing It Together: The State After This Phase

After QC and TA complete and the merge node fires, TeamState contains:

state = TeamState(
    # From Part 4 (PO)
    user_brief=UserBrief(...),

    # From Part 5 (BA)
    clarified_requirements="...",
    user_stories=[UserStory(...) * 8],

    # From QC (this article)
    test_suite=TestSuite(
        suite_id="TS-001",
        test_cases=[TestCase(...) * 24],
        quality_gates=QualityGates(
            min_unit_coverage=85,
            max_p95_response_ms=150,
            max_error_rate_percent=0.05,
        ),
    ),
    qc_complete=True,

    # From TA (this article)
    technical_spec=TechnicalSpec(
        spec_id="SPEC-001",
        tech_stack=[TechComponent(...) * 4],
        data_models=[DataModel(...) * 3],
        api_contracts=[ApiContract(...) * 12],
        adrs=[ArchitectureDecision(...) * 4],
        implementation_phases=[ImplementationPhase(...) * 3],
    ),
    ta_complete=True,
)

The SSE agent, which comes next, inherits this state. It has:

  • 8 user stories telling it what to build
  • 24 test cases telling it what success and failure look like
  • A technology stack telling it what tools to use
  • 3 data models telling it what entities to create
  • 12 API contracts telling it what endpoints to implement
  • 4 ADRs explaining the reasoning behind key decisions

The SSE is not making architecture decisions. The SSE is not defining quality criteria. The SSE is implementing a well-specified system. This is, by design, the narrowest job in the pipeline. The more we specify upfront, the less ambiguity the SSE faces — and the less ambiguity the SSE faces, the better the output quality.


What We Built in Part 6

Two agents, one structural optimization:

  • QCAgent (Sam) — reads user stories, produces 24 test cases with Given/When/Then structure, TestPriority classification, and QualityGates thresholds
  • TAAgent (Morgan) — reads requirements and user stories, produces TechnicalSpec with tech stack choices, 3 data models, 12 API contracts, 4 ADRs with full reasoning, and 3 implementation phases
  • Parallel fan-out — both agents run concurrently after BA completes, halving the wall-clock time for this phase
  • Merge node — safely combines both outputs into a single TeamState with explicit invariant checks
  • Quality gates — formal thresholds defined before code exists that will govern CI/CD pipeline behavior in Part 10

The key principle connecting both agents is that definition must precede implementation. Quality is defined before code is written. Architecture is decided before the SSE picks a library. This sequencing is not bureaucracy — it is the engineering discipline that prevents the most expensive category of rework.


What’s Next

In Part 7, we build the SSE agent — the Senior Software Engineer. The SSE is the implementation engine of the pipeline. It reads the full TeamState (user stories + test cases + technical spec) and produces working code: Python files, test files, a requirements manifest, and a brief implementation log. The SSE is where the abstract becomes concrete.

The SSE faces an interesting challenge: how do you prompt an LLM to produce production-quality code across multiple files without letting it drift off-spec? The answer involves strict output schemas, file-level granularity, and a pattern I call “spec-first prompting” — giving the model its constraints before its freedom.

See you in Part 7.


Series Navigation

Export for reading

Comments