In Part 7, the SSE agent produced code. Files, tests, a working implementation. The temptation is to ship it. The code compiles. The tests pass locally. What could go wrong?
I once watched a developer push a login endpoint that stored passwords in plaintext because “bcrypt was slow and the tests all passed.” The tests checked whether the password was stored, not whether it was hashed. No one reviewed the code. The breach happened eleven days later.
Code review exists because the person who wrote the code is the worst person to evaluate it. This is true for human developers. It is doubly true for LLMs, which produce statistically plausible code, not necessarily correct code.
This article builds the three agents that close the loop: the Tech Lead who reviews code, the DevOps agent who generates deployment infrastructure, and the PM agent who coordinates everything. Together with the five agents from Parts 5-7, these complete the eight-agent roster.
1. The Tech Lead Agent
Jordan is our Tech Lead. Fifteen years of production experience, which mostly means fifteen years of watching things break in ways nobody anticipated. Jordan’s job is not to write code. Jordan’s job is to prevent bad code from reaching production.
What Jordan Reviews
The TL agent performs a structured review across four dimensions:
Security — SQL injection, hardcoded credentials, missing input validation, insecure deserialization, exposed stack traces, missing auth checks, deprecated crypto.
Performance — N+1 queries, unbounded list operations, missing pagination, synchronous calls that should be async, missing indexes, unclosed resources.
Code Quality — naming conventions, function length (over 50 lines gets flagged), dead code, duplication, missing error handling, architecture drift from the TA’s decisions.
Test Coverage — does the SSE’s test code actually cover the test cases Sam defined in Part 6? Implementing 18 of 24 test cases fails review regardless of code quality.
The Decision: APPROVE or REJECT
Jordan produces exactly one of two verdicts. APPROVE means proceed to DevOps. REJECT means return to SSE with specific feedback. There is no “approve with comments.” In an automated pipeline, ambiguity is a bug.
2. TL Review Schema
from enum import Enum
from typing import Optional
from pydantic import BaseModel, Field
class ReviewVerdict(str, Enum):
APPROVE = "APPROVE"
REJECT = "REJECT"
class Severity(str, Enum):
CRITICAL = "critical"
HIGH = "high"
MEDIUM = "medium"
LOW = "low"
class ReviewCategory(str, Enum):
SECURITY = "security"
PERFORMANCE = "performance"
CODE_QUALITY = "code_quality"
TEST_COVERAGE = "test_coverage"
class ReviewFinding(BaseModel):
finding_id: str = Field(
description="Unique identifier, format FND-001",
pattern=r"^FND-\d{3}$",
)
category: ReviewCategory
severity: Severity
file_path: str = Field(
description="File where the issue was found",
)
line_range: Optional[str] = Field(
default=None,
description="Line range, e.g. '42-48'",
)
description: str = Field(
description="What the issue is",
min_length=20,
)
suggestion: str = Field(
description="How to fix the issue — actionable, specific",
min_length=20,
)
class CodeReview(BaseModel):
review_id: str = Field(
description="Unique identifier, format CR-001",
)
verdict: ReviewVerdict
findings: list[ReviewFinding] = Field(default_factory=list)
summary: str = Field(
description="2-3 sentence overall assessment",
min_length=30,
)
test_coverage_assessment: str = Field(
description="Assessment of whether SSE tests cover QC test cases",
)
security_passed: bool = Field(
description="True if no critical/high security findings",
)
performance_passed: bool = Field(
description="True if no critical/high performance findings",
)
approve_reason: Optional[str] = Field(
default=None,
description="If APPROVE: why the code is ready",
)
reject_reason: Optional[str] = Field(
default=None,
description="If REJECT: the primary blocker",
)
Every finding requires a concrete suggestion. “This is bad” is not a review comment. “Replace str concatenation on line 47 with parameterized query” is. The security_passed and performance_passed booleans exist for conditional edge logic — the routing function reads booleans, not prose.
3. TLAgent Implementation
from agents.base import BaseAgent
from state import TeamState
from schemas.tl import CodeReview, ReviewVerdict
class TLAgent(BaseAgent):
TL_SYSTEM_PROMPT = """You are Jordan, a Tech Lead with 15 years of experience
reviewing production code. You have seen every failure mode. You are thorough
but fair — you do not reject code for style preferences, only for real issues.
Your review covers four dimensions:
1. SECURITY — injection, auth bypass, credential exposure, input validation
2. PERFORMANCE — N+1 queries, unbounded operations, missing indexes, memory leaks
3. CODE QUALITY — naming, dead code, duplication, error handling, architecture drift
4. TEST COVERAGE — do the implemented tests cover the QC agent's test cases?
Verdict rules:
- Any CRITICAL security finding → REJECT
- Any CRITICAL performance finding → REJECT
- More than 3 HIGH findings across any category → REJECT
- Test coverage below 90% of QC test cases → REJECT
- Otherwise → APPROVE
When you REJECT, your findings must include specific file paths, line ranges,
and actionable suggestions. The SSE agent will read your findings and must be
able to fix every issue without asking clarifying questions.
When you APPROVE, briefly note what the code does well.
Always output valid JSON matching the CodeReview schema.
"""
def _prepare_prompt(self, state: TeamState) -> str:
code_output = state.get("code_output", {})
test_suite = state.get("test_suite")
tech_spec = state.get("technical_spec")
files_text = self._format_code_files(code_output)
test_cases_text = self._format_test_cases(test_suite)
adrs_text = self._format_adrs(tech_spec)
return f"""Review the following code produced by the SSE agent.
CODE FILES:
{files_text}
QC TEST CASES (defined before code was written):
{test_cases_text}
ARCHITECTURE DECISIONS (from TA):
{adrs_text}
Review each file against the four dimensions: security, performance,
code quality, and test coverage. For test coverage, check whether the
SSE's test files actually implement the test cases defined by QC.
Produce a CodeReview JSON with your verdict and findings.
"""
def _format_code_files(self, code_output: dict) -> str:
lines = []
files = code_output.get("files", [])
for f in files:
lines.append(f"--- {f['path']} ---")
lines.append(f["content"])
lines.append("")
return "\n".join(lines) if lines else "No files provided."
def _format_test_cases(self, test_suite) -> str:
if not test_suite:
return "No test suite provided."
lines = []
for tc in test_suite.test_cases:
lines.append(f"{tc.test_id}: {tc.title} [{tc.priority.value}]")
return "\n".join(lines)
def _format_adrs(self, tech_spec) -> str:
if not tech_spec:
return "No technical spec provided."
lines = []
for adr in tech_spec.adrs:
lines.append(f"{adr.adr_id}: {adr.title} — {adr.decision[:80]}...")
return "\n".join(lines)
async def run(self, state: TeamState) -> TeamState:
prompt = self._prepare_prompt(state)
response = await self.llm.ainvoke(
[
{"role": "system", "content": self.TL_SYSTEM_PROMPT},
{"role": "user", "content": prompt},
],
response_format={"type": "json_object"},
)
review = CodeReview.model_validate_json(response.content)
retry_count = state.get("tl_retry_count", 0)
if review.verdict == ReviewVerdict.REJECT:
retry_count += 1
return {
**state,
"code_review": review,
"tl_verdict": review.verdict.value,
"tl_retry_count": retry_count,
"tl_complete": True,
}
Every rejection increments tl_retry_count. Without this counter, a stubborn bug causes an infinite SSE-TL loop.
4. The Conditional Edge: route_after_tl_review
This is the most important routing function in the pipeline. It determines what happens after the TL renders a verdict.
from state import TeamState
MAX_TL_RETRIES = 2
def route_after_tl_review(state: TeamState) -> str:
"""
Conditional edge after TL review.
Returns the name of the next node:
- "devops_agent" → code approved, generate CI/CD
- "sse_agent" → code rejected, SSE gets another attempt
- "human_escalation" → max retries exceeded, needs human
"""
verdict = state.get("tl_verdict", "REJECT")
retry_count = state.get("tl_retry_count", 0)
if verdict == "APPROVE":
return "devops_agent"
# Rejected — can the SSE try again?
if retry_count <= MAX_TL_RETRIES:
return "sse_agent"
# Out of retries — escalate
return "human_escalation"
Three exits, no ambiguity. The SSE gets two revision attempts. After that, the system stops and asks for help — a system that loops forever is worse than one that admits it is stuck.
Wiring It Into LangGraph
from langgraph.graph import StateGraph
from state import TeamState
workflow = StateGraph(TeamState)
# ... (previous nodes from Parts 5-7) ...
workflow.add_node("tl_agent", tl_node)
workflow.add_node("devops_agent", devops_node)
workflow.add_node("pm_agent", pm_node)
workflow.add_node("human_escalation", escalation_node)
# SSE feeds into TL
workflow.add_edge("sse_agent", "tl_agent")
# TL has a conditional edge — three possible destinations
workflow.add_conditional_edges(
"tl_agent",
route_after_tl_review,
{
"devops_agent": "devops_agent",
"sse_agent": "sse_agent",
"human_escalation": "human_escalation",
},
)
# DevOps feeds into PM
workflow.add_edge("devops_agent", "pm_agent")
The routing function examines state, returns a string, and the dictionary maps strings to node names. LangGraph validates at compile time that every return value has a corresponding node.
5. The DevOps Agent
Casey is our DevOps engineer. Casey does not deploy code. Casey generates deployment infrastructure — the files that CI/CD systems consume. The distinction matters: Casey produces artifacts, not side effects.
Given an approved codebase and technical spec, Casey produces three artifacts: a GitHub Actions workflow (.github/workflows/ci.yml) with lint, test, build, and deploy stages; a Dockerfile with multi-stage build; and a docker-compose.yml for local development. Casey does not invent infrastructure decisions. If Morgan chose PostgreSQL 16, Casey uses postgres:16. The TA decides; Casey implements.
6. DevOps Schema
from pydantic import BaseModel, Field
class GeneratedFile(BaseModel):
path: str = Field(
description="File path relative to project root",
)
content: str = Field(
description="Full file content",
)
description: str = Field(
description="What this file does and why",
)
class DeploymentConfig(BaseModel):
config_id: str = Field(
description="Unique identifier, format DEPLOY-001",
)
generated_files: list[GeneratedFile] = Field(
min_length=1,
description="CI/CD, Dockerfile, docker-compose files",
)
environment_variables: list[str] = Field(
default_factory=list,
description="Required env vars (names only, not values)",
)
deployment_notes: str = Field(
description="Instructions for running the generated configuration",
)
quality_gate_integration: str = Field(
description="How QC quality gates are enforced in the pipeline",
)
environment_variables captures names only, never values — the DevOps agent knows the app needs DATABASE_URL but never generates actual credentials.
7. DevOpsAgent Implementation
from agents.base import BaseAgent
from state import TeamState
from schemas.devops import DeploymentConfig
class DevOpsAgent(BaseAgent):
DEVOPS_SYSTEM_PROMPT = """You are Casey, a DevOps Engineer with 10 years of
experience building CI/CD pipelines and containerized deployments.
Your philosophy:
- Infrastructure as code, always
- Build once, deploy anywhere
- Quality gates are non-negotiable — if the QC agent set a threshold, your
pipeline enforces it
- Secrets never appear in generated files — use environment variable references
You generate three files:
1. .github/workflows/ci.yml
- Triggered on push to main and pull requests
- Jobs: lint, test (with coverage), build Docker image, deploy (staging only)
- Coverage threshold from QualityGates.min_unit_coverage
- Fail the pipeline if coverage drops below threshold
2. Dockerfile
- Multi-stage build: builder stage + runtime stage
- Pin base image versions
- Non-root user in runtime stage
- HEALTHCHECK instruction
3. docker-compose.yml
- Application service built from Dockerfile
- Database service matching TA spec (Postgres version, etc.)
- Cache service if TA specified one (Redis version, etc.)
- Named volumes for data persistence
- Health checks on all services
Always output valid JSON matching the DeploymentConfig schema.
"""
def _prepare_prompt(self, state: TeamState) -> str:
tech_spec = state.get("technical_spec")
quality_gates = state.get("test_suite", {})
code_review = state.get("code_review")
stack_text = self._format_stack(tech_spec)
gates_text = self._format_gates(quality_gates)
return f"""Generate deployment configuration for the approved codebase.
TECHNOLOGY STACK (from TA):
{stack_text}
QUALITY GATES (from QC):
{gates_text}
TL REVIEW STATUS: {code_review.verdict.value if code_review else 'N/A'}
Generate:
1. .github/workflows/ci.yml — full GitHub Actions workflow
2. Dockerfile — multi-stage build
3. docker-compose.yml — local development environment
Requirements:
- Pin ALL image versions (no :latest tags)
- Coverage threshold must match QualityGates.min_unit_coverage
- Database and cache versions must match TA specification
- All services must have health checks
- Use environment variables for secrets (DATABASE_URL, etc.)
- Include comments explaining non-obvious configuration choices
Return a DeploymentConfig JSON.
"""
def _format_stack(self, tech_spec) -> str:
if not tech_spec:
return "No tech spec available."
lines = []
for comp in tech_spec.tech_stack:
version = comp.version or "latest"
lines.append(f"- {comp.name}: {comp.technology} {version}")
return "\n".join(lines)
def _format_gates(self, test_suite) -> str:
if not test_suite or not hasattr(test_suite, "quality_gates"):
return "No quality gates defined."
gates = test_suite.quality_gates
return (
f"- Min unit coverage: {gates.min_unit_coverage}%\n"
f"- Max p95 response: {gates.max_p95_response_ms}ms\n"
f"- Max error rate: {gates.max_error_rate_percent}%\n"
f"- Min test pass rate: {gates.min_test_pass_rate}%"
)
async def run(self, state: TeamState) -> TeamState:
prompt = self._prepare_prompt(state)
response = await self.llm.ainvoke(
[
{"role": "system", "content": self.DEVOPS_SYSTEM_PROMPT},
{"role": "user", "content": prompt},
],
response_format={"type": "json_object"},
)
config = DeploymentConfig.model_validate_json(response.content)
return {
**state,
"deployment_config": config,
"devops_complete": True,
}
The key line in any generated workflow is pytest --cov-fail-under=85 — that 85 comes directly from QualityGates.min_unit_coverage. The Postgres version in the CI services block comes from the TA’s tech stack. Casey invents nothing. Casey translates.
8. The PM Agent
Riley is our Project Manager. Riley does not write code, review code, or generate infrastructure. Riley’s job is to know what is happening, what is blocked, and what needs human attention.
The PM agent runs after DevOps completes and performs three functions: status reporting (what each agent produced, anomalies, pipeline health), blocker tracking (rejections, retries, unresolved findings), and escalation decisions (does a human need to intervene?).
9. PM Schema and Implementation
from enum import Enum
from pydantic import BaseModel, Field
class PipelineStatus(str, Enum):
SUCCESS = "success"
PARTIAL = "partial"
BLOCKED = "blocked"
FAILED = "failed"
class AgentStatus(BaseModel):
agent_name: str
completed: bool
output_summary: str = Field(
description="1-2 sentence summary of what this agent produced",
)
issues: list[str] = Field(default_factory=list)
class Blocker(BaseModel):
blocker_id: str
description: str
blocking_agent: str
suggested_resolution: str
requires_human: bool = False
class StatusReport(BaseModel):
report_id: str = Field(description="Format RPT-001")
pipeline_status: PipelineStatus
agent_statuses: list[AgentStatus]
blockers: list[Blocker] = Field(default_factory=list)
summary: str = Field(
description="Executive summary of pipeline execution",
min_length=50,
)
human_action_required: bool
human_action_description: str = Field(
default="",
description="What the human needs to do, if anything",
)
total_retries: int = Field(default=0)
total_findings: int = Field(default=0)
from agents.base import BaseAgent
from state import TeamState
from schemas.pm import StatusReport, PipelineStatus
class PMAgent(BaseAgent):
PM_SYSTEM_PROMPT = """You are Riley, a Project Manager with 10 years of
experience coordinating software delivery teams.
Your job is to observe, summarize, and escalate — never to make technical
decisions. You produce a StatusReport that tells stakeholders:
1. What happened — which agents ran, what they produced
2. What went wrong — any rejections, retries, or missing outputs
3. What needs attention — blockers requiring human intervention
4. Overall status — success, partial, blocked, or failed
Status rules:
- SUCCESS: all agents completed, TL approved, DevOps generated configs
- PARTIAL: pipeline completed but with warnings or non-critical findings
- BLOCKED: pipeline stopped due to unresolvable issue, needs human
- FAILED: critical error — agent crashed, schema validation failed, etc.
Be concise. Stakeholders read your reports at 7am with coffee. Respect
their time.
Always output valid JSON matching the StatusReport schema.
"""
def _prepare_prompt(self, state: TeamState) -> str:
return f"""Produce a status report for the current pipeline execution.
PIPELINE STATE:
- PO complete: {state.get('po_complete', False)}
- BA complete: {state.get('ba_complete', False)}
- QC complete: {state.get('qc_complete', False)}
- TA complete: {state.get('ta_complete', False)}
- SSE complete: {state.get('sse_complete', False)}
- TL verdict: {state.get('tl_verdict', 'pending')}
- TL retries: {state.get('tl_retry_count', 0)}
- DevOps complete: {state.get('devops_complete', False)}
TL REVIEW SUMMARY:
{self._format_review(state)}
USER STORIES COUNT: {len(state.get('user_stories', []))}
TEST CASES COUNT: {len(state.get('test_suite', {}).test_cases) if state.get('test_suite') else 0}
CODE FILES COUNT: {len(state.get('code_output', {}).get('files', []))}
DEPLOYMENT FILES COUNT: {len(state.get('deployment_config', {}).generated_files) if state.get('deployment_config') else 0}
Produce a StatusReport JSON.
"""
def _format_review(self, state: TeamState) -> str:
review = state.get("code_review")
if not review:
return "No code review performed."
lines = [
f"Verdict: {review.verdict.value}",
f"Findings: {len(review.findings)}",
f"Security passed: {review.security_passed}",
f"Performance passed: {review.performance_passed}",
]
for f in review.findings[:5]:
lines.append(
f" {f.finding_id} [{f.severity.value}] {f.category.value}: "
f"{f.description[:60]}..."
)
return "\n".join(lines)
async def run(self, state: TeamState) -> TeamState:
prompt = self._prepare_prompt(state)
response = await self.llm.ainvoke(
[
{"role": "system", "content": self.PM_SYSTEM_PROMPT},
{"role": "user", "content": prompt},
],
response_format={"type": "json_object"},
)
report = StatusReport.model_validate_json(response.content)
return {
**state,
"status_report": report,
"pm_complete": True,
"pipeline_status": report.pipeline_status.value,
}
10. Human Escalation Node
The escalation node is not an agent. It is a simple function that halts the pipeline when the SSE cannot resolve the TL’s findings.
from state import TeamState
def escalation_node(state: TeamState) -> TeamState:
review = state.get("code_review")
retry_count = state.get("tl_retry_count", 0)
findings_summary = []
if review:
for f in review.findings:
findings_summary.append(
f"[{f.severity.value}] {f.category.value} in "
f"{f.file_path}: {f.description}"
)
escalation_message = (
f"PIPELINE HALTED — Human review required.\n\n"
f"SSE failed TL review {retry_count} time(s).\n"
f"Unresolved findings:\n\n"
+ "\n".join(findings_summary)
)
return {
**state,
"pipeline_status": "blocked",
"escalation_message": escalation_message,
"human_intervention_required": True,
}
The escalation node does not make decisions. It stops, reports, and waits. The most dangerous thing an automated system can do is continue operating when it should have stopped.
11. The Complete Agent Roster
With the TL, DevOps, and PM agents built, we now have all eight agents in the pipeline. Here is the complete roster:
| # | Agent | Name | Model | Role | Input | Output |
|---|---|---|---|---|---|---|
| 1 | PO | Alex | GPT-4o | Clarify requirements, ask questions, define scope | Raw client brief | RequirementDoc |
| 2 | BA | Jamie | GPT-4o | Break requirements into user stories | RequirementDoc | UserStory[] |
| 3 | QC | Sam | GPT-4o | Define test cases and quality gates before code | UserStory[] | TestSuite |
| 4 | TA | Morgan | GPT-4o | Architecture decisions, tech stack, data models | RequirementDoc, UserStory[] | TechnicalSpec |
| 5 | SSE | Chris | Claude 3.5 Sonnet | Write implementation code and tests | Full TeamState | CodeOutput |
| 6 | TL | Jordan | GPT-4o | Review code for security, perf, quality | CodeOutput, TestSuite, TechnicalSpec | CodeReview |
| 7 | DevOps | Casey | GPT-4o | Generate CI/CD, Dockerfile, docker-compose | TechnicalSpec, QualityGates | DeploymentConfig |
| 8 | PM | Riley | GPT-4o | Monitor pipeline, track blockers, status report | Full TeamState | StatusReport |
The SSE uses Claude 3.5 Sonnet for code generation; the other agents use GPT-4o for analytical and structured reasoning tasks. Swap models freely based on your own benchmarks.
12. Updated TeamState
The TeamState now carries output from all eight agents. Each agent added its own keys in previous parts; the complete type now includes code_review (TL), deployment_config (DevOps), status_report (PM), plus pipeline-level fields like tl_retry_count, pipeline_status, escalation_message, and human_intervention_required. Every field is Optional or has a default, so the state is valid at every point in the pipeline — from the initial user_brief to the final StatusReport.
13. The Complete Graph
from langgraph.graph import StateGraph, END
from state import TeamState
from nodes.po import po_node
from nodes.ba import ba_node
from nodes.qc import qc_node
from nodes.ta import ta_node
from nodes.merge import merge_qc_ta
from nodes.sse import sse_node
from nodes.tl import tl_node
from nodes.devops import devops_node
from nodes.pm import pm_node
from nodes.escalation import escalation_node
from routing import route_after_tl_review
def build_full_pipeline() -> StateGraph:
workflow = StateGraph(TeamState)
# --- Register all nodes ---
workflow.add_node("po_agent", po_node)
workflow.add_node("ba_agent", ba_node)
workflow.add_node("qc_agent", qc_node)
workflow.add_node("ta_agent", ta_node)
workflow.add_node("parallel_merge", merge_qc_ta)
workflow.add_node("sse_agent", sse_node)
workflow.add_node("tl_agent", tl_node)
workflow.add_node("devops_agent", devops_node)
workflow.add_node("pm_agent", pm_node)
workflow.add_node("human_escalation", escalation_node)
# --- Sequential: PO → BA ---
workflow.set_entry_point("po_agent")
workflow.add_edge("po_agent", "ba_agent")
# --- Parallel fan-out: BA → QC + TA ---
workflow.add_edge("ba_agent", "qc_agent")
workflow.add_edge("ba_agent", "ta_agent")
# --- Merge: QC + TA → merge ---
workflow.add_edge("qc_agent", "parallel_merge")
workflow.add_edge("ta_agent", "parallel_merge")
# --- Sequential: merge → SSE → TL ---
workflow.add_edge("parallel_merge", "sse_agent")
workflow.add_edge("sse_agent", "tl_agent")
# --- Conditional: TL → DevOps | SSE (retry) | Human ---
workflow.add_conditional_edges(
"tl_agent",
route_after_tl_review,
{
"devops_agent": "devops_agent",
"sse_agent": "sse_agent",
"human_escalation": "human_escalation",
},
)
# --- DevOps → PM → END ---
workflow.add_edge("devops_agent", "pm_agent")
workflow.add_edge("pm_agent", END)
# --- Escalation → END ---
workflow.add_edge("human_escalation", END)
return workflow.compile()
Twenty-two lines of graph definition for eight agents with parallel execution, conditional routing, and human escalation.
What We Built in Part 8
- TLAgent (Jordan) — binary APPROVE/REJECT code review across security, performance, quality, and test coverage
- DevOpsAgent (Casey) — generates CI/CD workflow, Dockerfile, and docker-compose from the TA’s spec with QC quality gates baked in
- PMAgent (Riley) — status reporting, blocker tracking, and escalation decisions
- Conditional routing —
route_after_tl_reviewimplements retry-or-escalate with configurable max retries - Human escalation — clean pipeline halt when automated resolution fails
- Complete graph — all eight agents wired with parallel execution, conditional edges, and two terminal states
The eight agents are built. The graph is complete.
What’s Next
In Part 9, we tackle tool integration. Our agents currently operate in a closed loop — reading from and writing to TeamState. Real-world pipelines need to read files, call APIs, and interact with version control. We will build the tool layer that gives agents controlled access to the outside world without compromising determinism.
Series Navigation
- Part 1: Why Build an AI Software Team? | Part 2: Mapping the Roles | Part 3: Architecture and DDD | Part 4: TeamState
- Part 5: PO and BA Agents | Part 6: QC and TA Agents | Part 7: SSE Agent | Part 8 (this article)
- Parts 9-12: Coming soon