Your agents have a problem you might not have noticed yet. They are goldfish.
Every time the pipeline runs, the SSE agent writes code as if it has never written code before. The QC agent invents test strategies from scratch. The TA agent makes architecture decisions without remembering that it made the exact same decision last Tuesday — and that decision turned out to be wrong.
This is not a model limitation. It is architectural. We built a pipeline where state lives in TeamState, flows through the graph, and disappears when the run completes. The agents have working memory but no long-term memory. They have intelligence but no experience.
In human teams, experience separates a junior from a senior. The senior developer has seen a hundred failed deployments, a thousand code review comments. They pattern-match against history and get faster with every project.
This article gives your agents the same capability. We build three types of memory, a reflection system that learns from every completed task, a skills library that accumulates proven patterns, and an inter-agent knowledge sharing protocol. By the end, your agents will get measurably better with every run.
1. The Three Types of Agent Memory
Cognitive science divides human memory into three categories that map remarkably well to agent systems.
Working memory is what you are thinking about right now. Fast, limited, temporary. In our system: TeamState — messages, task context, iteration counts. It exists for one pipeline execution and vanishes.
Episodic memory is your record of past experiences. “Last time I used MongoDB for relational queries, it was a disaster.” In our system: a ChromaDB vector store where agents write reflections after tasks. When a similar task arrives later, the agent retrieves relevant past experiences.
Semantic memory is general knowledge abstracted from specific experiences. “Relational data belongs in PostgreSQL.” In our system: a skills library with confidence scores that increase as evidence accumulates.
The flow is directional. Working memory produces raw material during a run. Reflection distills it into episodic memories. Recurring patterns get promoted to semantic memory. Each layer is slower to write but longer-lasting and more valuable.
2. The Memory Store: ChromaDB for Episodic Memory
ChromaDB is an open-source vector database that stores documents alongside embeddings. You give it text, it vectorizes it, and later you query by similarity. Exactly what episodic memory needs.
The data model:
# memory/models.py
from pydantic import BaseModel, Field
from datetime import datetime
from enum import Enum
class MemoryType(str, Enum):
REFLECTION = "reflection"
ERROR_PATTERN = "error_pattern"
DECISION_OUTCOME = "decision_outcome"
REVIEW_FEEDBACK = "review_feedback"
class AgentMemory(BaseModel):
"""A single episodic memory entry."""
memory_id: str = Field(default_factory=lambda: str(uuid4()))
agent_role: str # "sse", "qc", "ta", etc.
memory_type: MemoryType
task_id: str # which pipeline run produced this
content: str # the actual memory text
context: dict = Field(default_factory=dict) # structured metadata
created_at: datetime = Field(default_factory=datetime.utcnow)
relevance_score: float = 0.0 # updated on retrieval
@property
def search_text(self) -> str:
"""Combined text used for embedding and similarity search."""
return f"[{self.agent_role}] [{self.memory_type.value}] {self.content}"
Now the store itself:
# memory/store.py
import chromadb
from chromadb.config import Settings
class AgentMemoryStore:
"""Persistent episodic memory backed by ChromaDB."""
def __init__(self, persist_dir: str = "./data/memory"):
self.client = chromadb.PersistentClient(
path=persist_dir,
settings=Settings(anonymized_telemetry=False),
)
self.collection = self.client.get_or_create_collection(
name="agent_memories",
metadata={"hnsw:space": "cosine"},
)
def store(self, memory: AgentMemory) -> str:
"""Store a memory. Returns the memory ID."""
self.collection.add(
ids=[memory.memory_id],
documents=[memory.search_text],
metadatas=[{
"agent_role": memory.agent_role,
"memory_type": memory.memory_type.value,
"task_id": memory.task_id,
"created_at": memory.created_at.isoformat(),
}],
)
return memory.memory_id
def recall(
self,
query: str,
agent_role: str | None = None,
memory_type: MemoryType | None = None,
n_results: int = 5,
) -> list[dict]:
"""Retrieve memories similar to the query."""
where_filter = {}
if agent_role:
where_filter["agent_role"] = agent_role
if memory_type:
where_filter["memory_type"] = memory_type.value
results = self.collection.query(
query_texts=[query],
n_results=n_results,
where=where_filter if where_filter else None,
)
return [
{
"content": results["documents"][0][i],
"distance": results["distances"][0][i],
"metadata": results["metadatas"][0][i],
}
for i in range(len(results["documents"][0]))
]
def forget_before(self, cutoff: datetime) -> int:
"""Remove memories older than cutoff. Returns count removed."""
all_data = self.collection.get(
where={"created_at": {"$lt": cutoff.isoformat()}},
)
if all_data["ids"]:
self.collection.delete(ids=all_data["ids"])
return len(all_data["ids"])
The recall method takes a natural-language query and returns the most similar memories. When the SSE agent is about to implement a FastAPI endpoint, it queries “implementing REST API endpoint with authentication” and gets back memories from previous runs where it did something similar — including what went wrong.
3. Task Reflection: Learning After Every Run
Memory without reflection is just logging. A log says “test failed with exit code 1.” A memory says “the test failed because I forgot to mock the database connection, and this is the third time I have made that mistake with integration tests.”
Reflection happens after a task completes. Each agent examines what it produced, what feedback it received, and what the outcome was. Then it writes a structured reflection into episodic memory.
# memory/reflection.py
from memory.models import AgentMemory, MemoryType
from memory.store import AgentMemoryStore
from langchain_anthropic import ChatAnthropic
from pydantic import BaseModel
class TaskReflection(BaseModel):
"""Structured reflection output from an agent."""
what_went_well: str
what_went_wrong: str
key_decision: str
decision_outcome: str # "positive", "negative", "neutral"
pattern_noticed: str | None = None
improvement_for_next_time: str
REFLECTION_PROMPT = """You are {agent_role}, reflecting on a completed task.
## Task Summary
{task_summary}
## Your Output
{agent_output}
## Feedback Received
{feedback}
## Final Outcome
{outcome}
Reflect on this task honestly. What went well? What went wrong?
What was the key decision you made, and how did it turn out?
Do you notice any recurring pattern? What would you do differently next time?
Respond as JSON matching the TaskReflection schema."""
class ReflectionEngine:
"""Generates and stores post-task reflections."""
def __init__(self, memory_store: AgentMemoryStore):
self.memory = memory_store
self.llm = ChatAnthropic(
model="claude-sonnet-4-20250514",
temperature=0.3,
)
async def reflect(
self,
agent_role: str,
task_id: str,
task_summary: str,
agent_output: str,
feedback: str,
outcome: str,
) -> TaskReflection:
"""Generate a reflection and store it as episodic memory."""
prompt = REFLECTION_PROMPT.format(
agent_role=agent_role,
task_summary=task_summary,
agent_output=agent_output[:2000], # truncate to fit context
feedback=feedback,
outcome=outcome,
)
response = await self.llm.ainvoke([{"role": "user", "content": prompt}])
reflection = TaskReflection.model_validate_json(response.content)
# Store the main reflection
self.memory.store(AgentMemory(
agent_role=agent_role,
memory_type=MemoryType.REFLECTION,
task_id=task_id,
content=(
f"Task: {task_summary}\n"
f"Well: {reflection.what_went_well}\n"
f"Wrong: {reflection.what_went_wrong}\n"
f"Improve: {reflection.improvement_for_next_time}"
),
context={
"decision": reflection.key_decision,
"decision_outcome": reflection.decision_outcome,
},
))
# If a pattern was noticed, store it separately
if reflection.pattern_noticed:
self.memory.store(AgentMemory(
agent_role=agent_role,
memory_type=MemoryType.ERROR_PATTERN,
task_id=task_id,
content=reflection.pattern_noticed,
))
return reflection
The reflection step runs after the PM marks the task complete. Every participating agent reflects:
# graph/reflection_node.py
async def reflection_node(state: TeamState) -> dict:
engine = ReflectionEngine(memory_store=get_memory_store())
for role in ["po", "ba", "ta", "qc", "sse", "tl"]:
output_key = f"{role}_output"
if not state.get(output_key):
continue
await engine.reflect(
agent_role=role, task_id=state["task_id"],
task_summary=state.get("clarified_requirement", ""),
agent_output=str(state[output_key])[:2000],
feedback=state.get("qc_feedback", "No feedback"),
outcome=state.get("phase", "unknown"),
)
return {"reflection_complete": True}
4. The Skills Library: Semantic Memory
Episodic memories are raw experiences — useful but noisy. The skills library is the curated layer above them: proven patterns validated across multiple experiences.
# memory/skills.py
from pydantic import BaseModel, Field
from datetime import datetime
from uuid import uuid4
class Skill(BaseModel):
"""A learned skill with confidence tracking."""
skill_id: str = Field(default_factory=lambda: str(uuid4()))
name: str # "fastapi_auth_endpoint"
description: str # what this skill does
agent_role: str # which agent owns this
category: str # "api_design", "testing", etc.
template: str # the actual pattern/template
confidence: float = 0.5 # 0.0 to 1.0
times_used: int = 0
times_succeeded: int = 0
created_at: datetime = Field(default_factory=datetime.utcnow)
updated_at: datetime = Field(default_factory=datetime.utcnow)
source_task_ids: list[str] = Field(default_factory=list)
approved: bool = False # requires human approval
@property
def success_rate(self) -> float:
if self.times_used == 0:
return 0.0
return self.times_succeeded / self.times_used
def record_usage(self, succeeded: bool, task_id: str):
"""Update stats after the skill is used."""
self.times_used += 1
if succeeded:
self.times_succeeded += 1
self.source_task_ids.append(task_id)
# Bayesian-style confidence update
self.confidence = (
(self.confidence * (self.times_used - 1) + (1.0 if succeeded else 0.0))
/ self.times_used
)
self.updated_at = datetime.utcnow()
class SkillsLibrary:
"""Persistent store for learned skills."""
def __init__(self, db_path: str = "./data/skills.json"):
self.db_path = db_path
self.skills: dict[str, Skill] = {}
self._load()
def _load(self):
import json; from pathlib import Path
path = Path(self.db_path)
if path.exists():
data = json.loads(path.read_text())
self.skills = {k: Skill.model_validate(v) for k, v in data.items()}
def _save(self):
import json; from pathlib import Path
data = {k: v.model_dump(mode="json") for k, v in self.skills.items()}
Path(self.db_path).parent.mkdir(parents=True, exist_ok=True)
Path(self.db_path).write_text(json.dumps(data, indent=2))
def add_skill(self, skill: Skill) -> str:
"""Add a new skill. Returns skill ID."""
self.skills[skill.skill_id] = skill
self._save()
return skill.skill_id
def find_skills(self, agent_role: str | None = None,
category: str | None = None,
min_confidence: float = 0.0,
approved_only: bool = True) -> list[Skill]:
"""Find skills matching criteria, sorted by confidence."""
results = [
s for s in self.skills.values()
if (not agent_role or s.agent_role == agent_role)
and (not category or s.category == category)
and s.confidence >= min_confidence
and (not approved_only or s.approved)
]
return sorted(results, key=lambda s: s.confidence, reverse=True)
def record_usage(self, skill_id: str, succeeded: bool, task_id: str):
if skill_id in self.skills:
self.skills[skill_id].record_usage(succeeded, task_id)
self._save()
def promote_from_reflections(
self,
memory_store: AgentMemoryStore,
agent_role: str,
pattern_query: str,
min_occurrences: int = 3,
) -> Skill | None:
"""
Check if a pattern appears frequently enough in episodic memory
to be promoted to a skill.
"""
memories = memory_store.recall(
query=pattern_query,
agent_role=agent_role,
memory_type=MemoryType.ERROR_PATTERN,
n_results=10,
)
# Filter by similarity threshold
relevant = [m for m in memories if m["distance"] < 0.3]
if len(relevant) >= min_occurrences:
skill = Skill(
name=f"auto_{agent_role}_{len(self.skills)}",
description=f"Pattern detected from {len(relevant)} similar experiences",
agent_role=agent_role,
category="auto_detected",
template=pattern_query,
confidence=0.5,
approved=False, # needs human review
source_task_ids=[
m["metadata"]["task_id"] for m in relevant
],
)
self.add_skill(skill)
return skill
return None
Skills start at 0.5 confidence. Every successful use pushes confidence up; every failure pushes it down. Over enough runs, reliable skills float to the top and unreliable ones sink. An agent looking for guidance gets the highest-confidence relevant skill first.
The approved flag prevents the system from learning bad habits. Auto-detected skills start unapproved. A human reviews them before they enter active rotation. A pattern that appeared three times might exist because the agent made the same mistake three times, not because it found a good approach.
5. Injecting Memory into Agent Prompts
Memory that is never consulted is useless. Before each agent runs, we query both episodic memory and the skills library, then inject relevant context into the prompt.
# memory/injection.py
class MemoryInjector:
"""Retrieves and formats memories for agent prompt injection."""
def __init__(
self,
memory_store: AgentMemoryStore,
skills_library: SkillsLibrary,
):
self.memory = memory_store
self.skills = skills_library
def build_memory_context(
self,
agent_role: str,
task_description: str,
max_memories: int = 3,
max_skills: int = 2,
) -> str:
"""Build a memory context block for prompt injection."""
sections = []
# 1. Retrieve relevant episodic memories
memories = self.memory.recall(
query=task_description,
agent_role=agent_role,
n_results=max_memories,
)
if memories:
sections.append("## Relevant Past Experiences")
for i, mem in enumerate(memories, 1):
sections.append(f"{i}. {mem['content']}")
# 2. Retrieve relevant skills
skills = self.skills.find_skills(
agent_role=agent_role,
min_confidence=0.6,
approved_only=True,
)[:max_skills]
if skills:
sections.append("\n## Proven Skills Available")
for skill in skills:
sections.append(
f"- **{skill.name}** (confidence: {skill.confidence:.0%}): "
f"{skill.description}\n"
f" Template: {skill.template[:300]}"
)
if not sections:
return ""
return (
"\n---\n"
"# Memory Context (from past runs)\n"
"Use these past experiences and skills to inform your work. "
"Do not follow them blindly — adapt to the current task.\n\n"
+ "\n".join(sections)
+ "\n---\n"
)
def inject_memory_into_prompt(
base_prompt: str,
agent_role: str,
task_description: str,
injector: MemoryInjector,
) -> str:
"""Append memory context to an agent's base prompt."""
memory_context = injector.build_memory_context(
agent_role=agent_role,
task_description=task_description,
)
if memory_context:
return base_prompt + "\n" + memory_context
return base_prompt
The injection happens in each agent node function, right before the LLM call. The SSE agent’s node, for example, creates a MemoryInjector, calls inject_memory_into_prompt with the base system prompt and current task description, then passes the enhanced prompt to the LLM. The agent now sees not just its role definition and the current task, but also “last time you built a similar API, you forgot input validation on the PATCH endpoint” and “proven skill: always add request body validation middleware (confidence: 87%).“
6. Inter-Agent Knowledge Sharing
Some lessons apply across the team. When the QC agent discovers that “endpoints without rate limiting always fail load testing,” that knowledge should reach the TA and SSE agents. The sharing mechanism uses a proposal-and-approval pattern:
# memory/sharing.py
from pydantic import BaseModel
from enum import Enum
class ShareStatus(str, Enum):
PROPOSED = "proposed"
APPROVED = "approved"
REJECTED = "rejected"
class KnowledgeShare(BaseModel):
"""A knowledge item proposed for cross-agent sharing."""
share_id: str
source_agent: str
target_agents: list[str]
knowledge: str
evidence_task_ids: list[str]
status: ShareStatus = ShareStatus.PROPOSED
class KnowledgeBroker:
"""Manages cross-agent knowledge sharing with approval gates."""
def __init__(self, skills_library: SkillsLibrary):
self.skills = skills_library
self.pending: list[KnowledgeShare] = []
def propose(self, source_agent: str, target_agents: list[str],
knowledge: str, evidence_task_ids: list[str]) -> KnowledgeShare:
share = KnowledgeShare(
share_id=str(uuid4()), source_agent=source_agent,
target_agents=target_agents, knowledge=knowledge,
evidence_task_ids=evidence_task_ids,
)
self.pending.append(share)
return share
def approve(self, share_id: str) -> bool:
"""Approve a share and create skills for target agents."""
share = next((s for s in self.pending if s.share_id == share_id), None)
if not share:
return False
share.status = ShareStatus.APPROVED
for target in share.target_agents:
self.skills.add_skill(Skill(
name=f"shared_from_{share.source_agent}",
description=share.knowledge, agent_role=target,
category="shared_knowledge", template=share.knowledge,
confidence=0.6, approved=True,
source_task_ids=share.evidence_task_ids,
))
return True
After reflection, agents propose shares automatically based on a role-to-beneficiary map:
# Role -> who benefits from their discoveries
SHARE_TARGETS = {
"qc": ["sse", "ta"], # QC findings help SSE and TA
"tl": ["sse", "qc"], # TL review patterns help SSE and QC
"sse": ["ta", "qc"], # SSE impl patterns help TA and QC
"ta": ["sse", "devops"], # TA arch decisions help SSE and DevOps
}
async def propose_cross_agent_shares(
reflection: TaskReflection, agent_role: str,
task_id: str, broker: KnowledgeBroker,
):
if not reflection.pattern_noticed:
return
targets = SHARE_TARGETS.get(agent_role, [])
if targets:
broker.propose(agent_role, targets,
reflection.pattern_noticed, [task_id])
The human reviews pending shares in the dashboard. “QC agent noticed that all APIs without input validation fail test cases. Share with SSE and TA?” Approve, and both agents get a new skill.
7. Learning from External Sources
Agents should not only learn from their own experience. A ResearchTool lets agents query external documentation and best-practice guides during execution.
# tools/research.py
from langchain_core.tools import tool
from langchain_community.utilities import GoogleSearchAPIWrapper
class ResearchTool:
"""Lets agents search for external best practices."""
def __init__(self):
self.search = GoogleSearchAPIWrapper(k=3)
@tool
async def research_best_practice(self, query: str) -> str:
"""Search for best practices on a topic. Use when encountering
unfamiliar problems or verifying your approach."""
results = self.search.results(query, num_results=3)
formatted = [
f"**{r['title']}**\n{r['snippet']}\nSource: {r['link']}"
for r in results
]
return "\n\n---\n\n".join(formatted) if formatted else "No results found."
The TA and SSE agents get this tool. The key constraint: research results are ephemeral. They inform the current run but do not enter the memory store. Only reflections and approved skills persist. This prevents the system from memorizing random Stack Overflow answers. The agent uses research to make a decision, the outcome is reflected on, and only proven patterns graduate to long-term memory.
8. Wiring It All Into the Graph
Here is how memory integrates with the LangGraph pipeline:
# graph/builder.py (updated with memory nodes)
from langgraph.graph import StateGraph, END
def build_team_graph() -> StateGraph:
graph = StateGraph(TeamState)
# ── Existing agent nodes (from Parts 5-9) ──
graph.add_node("po", po_node)
graph.add_node("ba", ba_node)
graph.add_node("ta", ta_node)
graph.add_node("qc", qc_node)
graph.add_node("sse", sse_node) # now uses memory injection
graph.add_node("tl", tl_node)
graph.add_node("devops", devops_node)
graph.add_node("pm", pm_node)
# ── New memory nodes ──
graph.add_node("reflect", reflection_node)
graph.add_node("skill_check", skill_promotion_node)
# ── Existing edges (abbreviated) ──
graph.set_entry_point("po")
graph.add_edge("po", "ba")
graph.add_edge("ba", "ta"); graph.add_edge("ba", "qc") # parallel
graph.add_edge("ta", "sse"); graph.add_edge("qc", "sse")
graph.add_edge("sse", "tl")
graph.add_conditional_edges("tl", tl_router)
graph.add_edge("devops", "pm")
# ── Memory edges (post-pipeline) ──
graph.add_edge("pm", "reflect")
graph.add_edge("reflect", "skill_check")
graph.add_edge("skill_check", END)
return graph.compile()
async def skill_promotion_node(state: TeamState) -> dict:
"""Check if any episodic patterns should be promoted to skills."""
memory_store = get_memory_store()
skills_lib = get_skills_library()
# Check common pattern categories
patterns_to_check = [
("sse", "input validation on API endpoints"),
("sse", "error handling in async functions"),
("qc", "edge cases in authentication flows"),
("ta", "database selection for relational data"),
]
promoted = []
for agent_role, pattern in patterns_to_check:
skill = skills_lib.promote_from_reflections(
memory_store=memory_store,
agent_role=agent_role,
pattern_query=pattern,
)
if skill:
promoted.append(skill.name)
return {"promoted_skills": promoted}
The pipeline now has a tail: after the PM declares the task complete, the reflection node runs, every agent writes its memories, and the skill promotion node checks whether any patterns have occurred often enough to become skills.
9. Measuring Improvement
“The agents are getting better” is not a useful claim without metrics. Here is how we track whether the memory system works.
# metrics/improvement.py
from dataclasses import dataclass
@dataclass
class ImprovementMetrics:
"""Track agent improvement across pipeline runs."""
task_id: str
run_number: int
qc_pass_on_first_try: bool = False
tl_review_pass_on_first: bool = False
iteration_count: int = 0
memories_recalled: int = 0
skills_applied: int = 0
total_tokens_used: int = 0
def compute_improvement_trend(history: list[ImprovementMetrics]) -> dict:
"""Compare early runs vs recent runs."""
if len(history) < 4:
return {"status": "insufficient_data"}
mid = len(history) // 2
early, recent = history[:mid], history[mid:]
def avg(items, attr):
vals = [getattr(i, attr) for i in items]
return sum(vals) / len(vals) if vals else 0
return {
"first_try_pass_rate": {"early": avg(early, "qc_pass_on_first_try"),
"recent": avg(recent, "qc_pass_on_first_try")},
"avg_iterations": {"early": avg(early, "iteration_count"),
"recent": avg(recent, "iteration_count")},
"avg_tokens": {"early": avg(early, "total_tokens_used"),
"recent": avg(recent, "total_tokens_used")},
}
The metrics you want trending in the right direction:
- First-try pass rate should increase. Agents learning from past failures should pass QC and TL review on the first attempt more often.
- Iteration count should decrease. Fewer rejection cycles means higher-quality output from the start.
- Token usage should stabilize or decrease. Better skills mean less fumbling, fewer retry tokens wasted.
Track these across runs. If the lines are flat, your memory system is not working. If first-try pass rate is climbing and iteration count is falling, your agents are genuinely learning.
10. Practical Considerations
Memory Hygiene
Not all memories are worth keeping. Filter before storing — reflections that say “everything went fine” add noise without signal. Reject reflections where what_went_wrong == "Nothing" and improvement_for_next_time is shorter than 20 characters.
Memory Decay
Old memories become less relevant. Run forget_before() monthly with a 90-day cutoff to prevent stale patterns from polluting retrieval.
Cost Control
Six agents reflecting per run means 6 extra LLM calls. Use a cheaper model for reflections (Claude Haiku or GPT-4o-mini) and batch embedding calls where possible.
The Cold Start Problem
On the first run, there are no memories. The pipeline works identically to the non-memory version from previous parts. This is fine — memory is additive, not a dependency. To accelerate cold start, seed the skills library before the first run:
def seed_initial_skills(library: SkillsLibrary):
"""Pre-load common skills to avoid cold start."""
library.add_skill(Skill(
name="api_input_validation",
description="Always validate request bodies with Pydantic models",
agent_role="sse",
category="api_design",
template="Use Pydantic BaseModel for all request/response schemas",
confidence=0.8,
approved=True,
))
library.add_skill(Skill(
name="test_isolation",
description="Each test case must be independent and idempotent",
agent_role="qc",
category="testing",
template="Use fixtures for setup/teardown. Never depend on test order.",
confidence=0.9,
approved=True,
))
What We Built
This article added three layers to the agent system:
- Episodic memory via ChromaDB — agents store reflections after every task and retrieve similar experiences before future tasks.
- Semantic memory via the Skills Library — recurring patterns are promoted to skills with confidence scores and human approval gates.
- Knowledge sharing via the KnowledgeBroker — one agent’s lessons can be shared with the rest of the team through a proposal-and-approval workflow.
The result is a system that genuinely improves over time — not through fine-tuning, but through accumulated experience, the same mechanism that makes human teams better with practice.
In Part 11, we tackle deployment: containerizing the multi-agent system, orchestration infrastructure, and running the pipeline in production with proper observability. The agents are smart. Now we need them to be reliable.