The Discipline That Didn’t Exist Four Years Ago

In late 2022, building with AI was deceptively simple: you typed a question, you marveled at the answer. There was no discipline. No tooling. No infrastructure. Just a text box and a model.

That simplicity was the beginning of something that would grow — over the next four years — into a full engineering practice with its own patterns, its own failure modes, and its own infrastructure layer. What started as the art of clever phrasing has become system architecture. The craft has accumulated depth, and the question at its center has shifted three times.

2022: “What should I say?” — and prompt engineering was the answer.

2023: “What information should I feed the model?” — and context engineering emerged.

2025: “What system do I need to build?” — and the answer is harness engineering.

The three eras of AI engineering: Prompt Engineering (2022), Context Engineering (2023), Harness Engineering (2025)

Each shift happened because the previous generation hit a structural wall it could not climb through better technique alone. The bottleneck kept moving — from expression, to information, to reliability. Understanding that progression is the fastest way to understand where the field is today and why production agentic systems look the way they do.


Part 1 — Prompt Engineering: “What Should I Say?” (2022–2024)

The foundational assumption of prompt engineering was elegant: LLMs are trained on billions of tokens of human knowledge, compressed into billions of parameters. That knowledge is already inside the model. The only variable left is how you phrase the request. Write the right words, in the right structure, and you unlock the right answer.

For two years, this assumption held up remarkably well.

The hottest new programming language is English. — Andrej Karpathy, 2023

A rich vocabulary of techniques emerged:

TechniqueHow It WorksBest For
Zero-shotDirect instruction, no examplesSimple, factual tasks
Few-shot2–5 examples embedded in promptFormat-sensitive outputs (JSON, SQL)
Chain-of-Thought”Think step by step” prefixMulti-step reasoning, math
Role prompting”You are a senior engineer…”Domain-specific behavior
Self-consistencySample N responses, majority voteHigh-stakes decisions
# The complete "stack" in 2023 — everything lived in the prompt
system_prompt = """
You are a senior backend engineer.
Rules:
- Return JSON only
- Handle all edge cases explicitly
- Use snake_case for identifiers
"""
response = openai.chat(
    model="gpt-4",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": query}
    ]
)
# No memory. No tools. Every call started fresh.

The Wall: Knowledge Cutoff

Prompt engineering hit a structural wall it couldn’t reason its way past. LLMs are frozen in time — their knowledge ends at a training cutoff. They don’t know what your team shipped last sprint, what your internal API does, or what error your system threw five minutes ago.

No amount of prompt refinement can fix this. The bottleneck had shifted from expression to information. The question was no longer “what do I say?” — it was “what can the model actually see?”


Part 2 — Context Engineering: “What Info Should I Feed?” (2023–2025)

Context engineering started from a simple realization: the prompt is not the whole story. The full object that determines model behavior is the context window — everything the model reads before generating a response, assembled dynamically on every turn.

Building with language models is becoming less about finding the right words for your prompts, and more about answering the question: “What configuration of context is most likely to produce the desired behavior?” — Anthropic Engineering, 2025

Anatomy of the context window: System Prompt, Conversation History, RAG results, Tool schemas, Memory files

In an agentic loop, that context is never static. At each turn, more information accumulates:

  • Conversation history — previous turns in the dialogue
  • Retrieved knowledge — documents pulled from external sources via RAG
  • Tool schemas — descriptions of available actions and APIs
  • Tool results — outputs from tools the agent has already called
  • Memory files — state persisted from earlier sessions

The prompt is no longer something you write once. It becomes a dynamic assembly that gets rebuilt on every single turn — growing, shifting, and accumulating state. And context windows have a hard limit.

Managing the Finite Window

Three strategies emerged, with fundamentally different trade-offs:

StrategyMechanismTrade-offUse When
Raw contextInclude everything as-isHighest fidelity, most tokensShort sessions, critical detail
Compaction (reversible)Replace content with references (e.g., file path instead of full file)Lossless if re-readableLong sessions with many edits
Summarization (lossy)LLM rewrites history into shorter formHigh compression, permanently lossyLast resort only

The hierarchy is strict: Raw context > Compaction > Summarization. Use lossy compression only as a last resort — a detail summarized away is gone forever.

Context Rot: More Isn’t Always Better

A critical finding from 2024 research: context rot. As the token count in a context window grows, the model’s ability to accurately recall information from that context decreases. Packing the window isn’t always the right move. The goal of good context engineering is finding the smallest possible set of high-signal tokens that maximize the probability of the desired output.

Progressive Disclosure

The most effective pattern for tool-heavy agents: load tool names and descriptions into context first, then hydrate the full schema on demand when a specific tool is needed.

Step 1  →  Agent sees 50 tool names + short descriptions
Step 2  →  Agent identifies it needs "database_query" tool
Step 3  →  Full schema for "database_query" loaded into context
Step 4  →  Tool called, result enters context, schema can be evicted

This pattern — pioneered in agentic frameworks — has since been adopted across OpenAI Agents SDK (deferLoading: true), CrewAI (v1.10.2+), and LangGraph. If you have more than ~20 tools and aren’t doing this, you’re burning tokens on every call.

The Wall: Reliability

Context engineering made agents smarter. It didn’t make them reliable. Knowing more, seeing more, doesn’t solve the fundamental problem of acting correctly, recovering from failures, or maintaining coherent behavior across sessions. For that, you need something beyond a well-crafted window — you need a system built around the model.


Part 3 — Harness Engineering: “What System Should I Build?” (2025–2026)

Harness engineering is the discipline of designing the full infrastructure that surrounds a model and makes its intelligence reliably useful in production.

The key formulation, from LangChain’s March 2026 anatomy post:

Agent = Model + Harness

The model provides stateless token prediction. The harness provides everything that transforms that prediction into something useful: memory that persists, tools with permission gates, orchestration for complex tasks, sandboxes for safe execution, and evaluation loops that catch failures before they compound.

Harness architecture: 6 layers — Serving, Orchestration, Sandbox, Context Engine, Memory, Tools — wrapping the Model

A harness has six distinct layers:

Layer 1: Serving — Channels

The serving layer is how the agent receives input and delivers output. The architectural insight is the Channel abstraction: deploy one agent, connect it to many surfaces through a unified gateway. The same brain, the same memory, accessible from a messaging app, a web UI, or an IDE plugin — because the context is shared behind the gateway.

The richness of context varies significantly by channel:

Channel TypeContext AvailableExamples
IDE / TUIFiles, cursor position, active errors, build outputVS Code plugin, terminal
Web appSession state, user preferencesCustom dashboard
MessagingText onlySlack, Telegram, WhatsApp

The model is identical in all cases. The gateway normalizes inputs before the agent ever sees them.

Layer 2: Orchestration — Routing, Not Entry

Orchestration is the logic that decides, given a task, how to split it and who handles what. It’s a router — not the entry point for requests (that’s the serving layer).

Two orchestration patterns: ReAct Loop (Observe-Reason-Act cycle) vs Orchestrator-Worker (decompose, delegate, verify)

Two patterns dominate production:

Subagent Spawning: The parent agent creates a child agent. It writes a new prompt, assigns a specific tool set, and copies its context window into the child. The child works independently with enough context to complete its task. This pattern is how Claude Code’s Agent Teams feature works internally — one session as team lead, spawning independent sub-agents that work in parallel, coordinated via a shared task list with dependency tracking.

Multi-agent Coordination: The task is decomposed and the context window is divided across specialized agents. Each sees only its relevant slice. The orchestrator collects and aggregates results. Unlike spawning, context is not replicated — it is partitioned.

The key difference: spawning copies context and delegates one task. Coordination partitions context and distributes the problem.

Layer 3: Sandbox — Isolated Execution

Agents run code. That code can delete files, crash systems, or leak credentials. Sandboxes ensure failures stay contained.

Isolation LevelSpeedSafetyUse Case
Local subprocessFastestLowestDevelopment, experimentation
Local Docker containerModerateModerateCI/CD, staging
Remote cloud containerSlowestHighestProduction, untrusted inputs

The credential isolation rule (learned from real production incidents): credentials must never be reachable from the sandbox where agent-generated code runs. One successful prompt injection in a coupled design is all it takes to expose an entire environment. This is a structural fix, not a prompt fix.

Layer 4: Context Engine

At the harness level, context management is not a craft — it’s an engineered component that runs on every invocation, deciding what enters the context window and when. The three techniques (compaction, progressive disclosure, tool offloading) are not manual practices — they are pipeline stages with defined trigger conditions.

Claude Code implements five distinct compaction strategies with different latency/fidelity profiles: “Snip” is fast but lossy; “Microcompact” targets tool outputs specifically. Two additional strategies remain behind feature flags — evidence of how genuinely unsolved this is at scale.

Layer 5: Memory — The Filesystem, Not a Vector Database

Production agent memory operates across three tiers:

TierWhat It IsLifecycle
Context windowWhat the model is literally readingCleared after each request
Session RAMConversation history in-processLost when the process exits
FilesystemMarkdown files written to diskSurvives all sessions

The instinct when hearing “agent memory” is to reach for a vector database and a RAG pipeline. The reality for most production systems is far simpler. Plain Markdown files — human-readable, human-editable, inspectable, and durable — serve as the memory layer. A MEMORY.md index pointing to topic files, a CLAUDE.md or AGENTS.md injected at session start, a progress file updated after each work session.

No embedding pipeline. No retrieval infrastructure. Memory that you can read, edit, and version alongside your code.

The long-running agent pattern: An initializer agent runs on the first session — writes a structured progress file, describes the current state, makes an initial commit. Every subsequent session reads that file first. This is durable knowledge transfer across sessions with zero infrastructure.

Layer 6: Tools — MCP with Permission Gating

The reason for wrapping capabilities as MCP tools rather than exposing a raw shell is permission gating. A bash shell gives the agent everything. An MCP tool gives the agent exactly what it is allowed to have.

Every tool in a well-designed harness declares:

  • isReadOnly (defaults to false — assume it writes)
  • A risk level (low / medium / high)
  • Conditions under which it requires explicit human approval

The result is a permission system built on default-deny: nothing is allowed unless explicitly granted. This is what users see as the approval flow in tools like Claude Code — the “auto-approve” vs. “ask me first” toggle is the if-else inside the MCP tool wrapper.

ToolPurposeRisk
Read / Glob / GrepFile inspection, searchLow — read-only
EditTargeted in-place text replacementMedium
WriteCreate or overwrite filesMedium-high
BashExecute arbitrary scriptsHigh
WebFetchFetch external URLsMedium
Agent / TaskSpawn a sub-agentHigh — recursive

As of mid-2026, MCP has surpassed 97 million monthly SDK downloads and is supported natively by every major AI vendor — a sign that permission-gated tooling has become the infrastructure standard for agentic systems.


The Seven Agentic Design Patterns

Seven agentic design patterns: ReAct, Reflection, Tool Use, Planning, Multi-Agent, Orchestrator-Worker, Evaluator-Optimizer

Building on Andrew Ng’s foundational four patterns (2024), the field has converged on seven composable patterns by 2026:

PatternWhat It DoesWhen to UseReadiness
ReActInterleave reasoning + tool callsSequential, multi-step tasks🟢 Battle-tested
ReflectionSelf-critique and iterate on outputQuality-critical: code review, writing🟢 Mature
Tool UseCall external APIs and servicesAny task needing live data🟢 Universal
PlanningBreak goal into sub-task dependency graphLong-horizon goals, complex workflows🟡 Maturing
Multi-AgentSpecialized agents collaborateLarge parallelizable workloads🟡 Maturing
Orchestrator-WorkerDecompose → delegate → verify independentlyTasks with separable verification🟡 Maturing
Evaluator-OptimizerScore output against criteria, loop until quality metOutput quality is the bottleneck🟡 Emerging

These patterns compose. A production system is rarely just one:

Production Research Agent
  Orchestrator-Worker          ← task decomposition
    Worker 1: ReAct            ← sequential research
      Tool Use                 ← web search, retrieval
      Reflection               ← self-check findings
    Worker 2: ReAct            ← parallel data analysis
      Tool Use                 ← database queries
    Verifier: Evaluator        ← independent quality gate

The Accuracy Cascade

Here is the math that changes how you think about per-step reliability:

If each action succeeds with 85% probability, a 10-action workflow succeeds roughly 20% of the time (0.85¹⁰ ≈ 0.197).

This is why Reflection and Evaluator-Optimizer patterns exist — they push per-step accuracy from ~85% to ~95%, which transforms a 20% end-to-end success rate into 60%+. The investment in quality-checking patterns pays off exponentially in multi-step workflows.


Production Failure Modes

Across real production deployments, four failure patterns recur:

FailureSymptomRoot CauseStructural Fix
Silent failureSystem looks healthy; outputs are wrongNo independent evaluation layerSeparate verifier agent with scoring criteria
Context rotAgent “forgets” important details mid-taskToken count has exceeded useful recall rangeCompaction strategies, progressive disclosure
Permission creepAgent accumulates excessive system accessFlat permission model, no scopingMCP default-deny, per-tool risk levels
Runaway executionUncontrolled tool calls, spiraling costNo iteration caps, no budget trackingHard caps on iterations, token budget per cycle

Real incidents: Slack AI (August 2025) — indirect prompt injection allowing data exfiltration from private channels. Salesforce Agentforce (September 2025) — malicious inputs used to leak CRM data. These are not theoretical risks; they are documented production failures that shaped the security architecture of later harnesses.

Minimum viable observability for any agent in production:

  1. Trace every tool call with full input and output
  2. Log reasoning steps at each iteration
  3. Track token usage per turn and cumulative
  4. Alert when iteration count exceeds defined threshold
  5. Record task completion time and success/failure outcome

The AGENTS.md Discipline

The most practically impactful idea in harness engineering requires no new infrastructure. Mitchell Hashimoto described it in his public notes on agentic development: a file called AGENTS.md (or CLAUDE.md) sitting at the root of a repository.

Every line in that file represents a real failure that was observed and encoded as a permanent constraint:

# AGENTS.md

## Code Standards
- Never use `any` type in TypeScript — define proper interfaces
- All API responses must be wrapped in a Result<T, E> type
- Database queries must use parameterized statements only

## Architecture
- Never modify migration files after initial commit
- All database writes go through the service layer — no direct ORM calls
- New API endpoints require corresponding integration tests before merge

## Agent Behavior
- Run the full test suite before marking any task complete
- Never delete files without explicit user confirmation
- When design intent is unclear, ask — do not guess

## Project Context
- The payments service is in a separate monorepo — do not import directly
- Feature flags are managed through LaunchDarkly, not environment variables
- API v1 is frozen — all new work goes to v2

This file is injected into the agent’s context at session start. It is not a prompt. It is not documentation. It is a system constraint — accumulated institutional knowledge encoded as durable rules the agent cannot bypass.

Start one today. Every time an agent makes a mistake in your codebase, add a rule. Within a month, you’ll have a constraint file that materially improves agent reliability — with zero infrastructure cost.


The Anthropic Three-Agent Harness

Anthropic published their production harness for long-running development tasks in March 2026. The pattern addresses a fundamental problem with single-agent loops over long tasks: models overestimate the quality of their own output, particularly on subjective tasks like UI design.

Their solution: separate planning, execution, and evaluation into three distinct agents.

PLANNER
  Reads requirements
  Writes structured feature list + implementation plan
  Makes initial git commit (sets baseline)
        |
        v
GENERATOR
  Reads plan + progress file
  Implements incrementally
  Updates progress file at each context reset
  (does NOT summarize — resets cleanly with structured state)
        |
        v
EVALUATOR
  Completely independent — never sees generator's reasoning
  Calibrated with few-shot scoring examples
  Grades output against explicit criteria
  Sends pass/fail + specific feedback back to generator

The evaluator’s independence is critical. When the same agent that generates also evaluates, it is biased toward approving what it built. A separate evaluator, calibrated with explicit scoring criteria and examples, catches failures the generator would rationalize away.

The context reset pattern is also notable: rather than compressing history when approaching the token limit, the generator starts a fresh context each session using only the structured progress file. This avoids the “caution near the context limit” behavior that compaction sometimes produces.


Framework Landscape: 2026

FrameworkBest ForLanguageMulti-AgentMCPLearning Curve
LangGraphComplex stateful workflows, productionPythonGraph-basedHigh (1–2 weeks)
CrewAIRapid prototyping, role-based collaborationPythonRole-based DSLLow (20 lines)
OpenAI Agents SDKOpenAI-native production systemsPythonHandoff-basedModerate
Anthropic Agent SDKClaude-native, permission-firstTypeScriptSubagent spawn✓ NativeModerate
Google ADKGoogle Cloud / Gemini ecosystemPythonA2A protocolModerate
MastraTypeScript-first teamsTypeScriptModerate

Teams commonly start with CrewAI for prototyping and migrate to LangGraph when they need production-grade state management and conditional routing. If production is the goal from day one, starting with LangGraph or the OpenAI Agents SDK avoids the migration cost.

The meta-lesson: Mastering a handful of composable design patterns matters far more than mastering any single framework. Frameworks change; patterns persist.


The Stack View

The three engineering disciplines don’t replace each other — they layer:

┌──────────────────────────────────────────────────────┐
│               HARNESS ENGINEERING                    │
│   Tools · Memory · Sandbox · Orchestration · Eval    │
│         "What system should I build?"                │
├──────────────────────────────────────────────────────┤
│               CONTEXT ENGINEERING                    │
│    RAG · Compaction · Progressive Disclosure         │
│           "What info should I feed?"                 │
├──────────────────────────────────────────────────────┤
│               PROMPT ENGINEERING                     │
│          CoT · Few-shot · Role prompting             │
│              "What should I say?"                    │
├──────────────────────────────────────────────────────┤
│                     MODEL                            │
│           GPT-4o · Claude · Gemini · Llama           │
└──────────────────────────────────────────────────────┘

Each layer builds on the one below it. You still need good prompts. You still need smart context management. But leverage has shifted upward — the largest returns in 2026 come from getting the harness right.

The discipline has traveled a long distance from “type a question, marvel at the answer.” What we have now is not a more refined chat interface — it is a class of distributed systems with its own failure modes, its own infrastructure patterns, and its own engineering discipline. The question is no longer “how do I talk to AI?” It is: “what do I build around it?”


References

Export for reading

Comments