The first time I loaded up the Hermes Agent source, I expected a familiar pattern: an orchestration loop, some tool wrappers, maybe a memory module. What I found was something closer to a runtime. Eight separate loops, each operating at a different timescale, each with a distinct trigger condition and output contract.
This post is the technical breakdown I wish I’d had when I started reading. I’ll go through each loop in order of timescale — from sub-second token generation up to cross-session skill distillation — with pseudo-code for the trigger logic, timing diagrams, and notes on how the loops couple to each other.
Why Timescale Matters
Most agent frameworks conflate what are actually distinct feedback cycles. The loop that decides which token to emit next operates in milliseconds. The loop that decides whether today’s conversations have revealed a new capability worth persisting operates in hours. Mixing those into a single pass means you’re running the expensive slow decisions too often and the cheap fast decisions too infrequently.
Hermes separates them explicitly. Here is the high-level view:
gantt
title Hermes Agent — 8 Loops at Different Timescales
dateFormat X
axisFormat %s
section Token (ms)
Loop 1 · Token sampling :active, l1, 0, 1
section Turn (s)
Loop 2 · Tool execution :l2, 0, 10
Loop 3 · Context compression :l3, 2, 10
section Task (min)
Loop 4 · Sub-agent orchestration :l4, 0, 300
Loop 5 · Feedback & correction :l5, 60, 300
section Session (hour)
Loop 6 · Memory consolidation :l6, 0, 3600
section Cross-session (day)
Loop 7 · Skill distillation :l7, 0, 86400
Loop 8 · Meta-evaluation :l8, 0, 86400Each row is a loop. The x-axis is wall-clock seconds on a log scale. The key observation: Loop 1 completes ~1000 times before Loop 2 completes once, and Loop 2 completes ~360 times before Loop 6 completes once.
Loop 1 — Token Sampling (milliseconds)
This is the innermost loop. It runs once per output token and is the only loop that touches the LLM inference path directly.
# pseudo-code: token sampling loop
def sample_loop(prompt: str, config: SamplingConfig) -> Generator[Token, None, None]:
kv_cache = init_kv_cache(prompt)
while True:
logits = model.forward(kv_cache)
logits = apply_logit_processors(logits, config) # repetition penalty, etc.
token = sample(logits, temperature=config.temp, top_p=config.top_p)
yield token
if is_stop_token(token):
break
kv_cache = extend(kv_cache, token)
The interesting part is apply_logit_processors. Hermes inserts several processors here that are aware of higher-level state — for example, a processor that suppresses tool-call tokens if the current context budget is below a threshold. This is how Loop 1 couples upward to Loop 3 (context compression): the compression loop writes a context_pressure scalar to shared state, and the token loop reads it to bias sampling.
What triggers it: every forward pass. No external trigger needed.
What it produces: a token stream that is handed to Loop 2.
Loop 2 — Tool Execution (seconds)
Loop 2 consumes the token stream from Loop 1, detects structured tool-call blocks, and executes them. This is the loop most agent frameworks treat as the loop.
# pseudo-code: tool execution loop
def tool_loop(token_stream: Generator, tools: ToolRegistry, ctx: TurnContext):
buffer = TokenBuffer()
for token in token_stream:
buffer.append(token)
if buffer.has_complete_tool_call():
call = buffer.pop_tool_call()
result = tools.execute(call, timeout=ctx.tool_timeout)
ctx.tool_results.append(result)
# re-inject result as continuation tokens
token_stream = chain(result_tokens(result), token_stream)
if buffer.has_final_response():
return buffer.final_response()
The re-inject step is key. Rather than appending results to a message list and calling the model again from scratch, Hermes appends the tool result as tokens and continues the same inference pass where it left off. This keeps the KV cache warm and cuts per-tool-call latency by roughly 40% in my benchmarks.
What triggers it: a complete tool-call block in the token stream.
What it produces: enriched TurnContext with tool results attached.
Loop 3 — Context Compression (seconds, concurrent with Loop 2)
Loop 3 runs in a background thread during every turn. Its job is to monitor the growing context window and compress it before it overflows.
# pseudo-code: context compression loop
def compression_loop(ctx: TurnContext, config: CompressionConfig):
while ctx.is_active:
usage = ctx.token_count / ctx.max_tokens
if usage > config.soft_limit: # default: 0.70
ctx.context_pressure = usage # signal to Loop 1
if usage > config.hard_limit: # default: 0.85
compressed = compress(
ctx.messages,
strategy=config.strategy, # "recency" | "importance" | "hybrid"
target_ratio=config.target # default: 0.50
)
ctx.messages = compressed
ctx.token_count = count(compressed)
time.sleep(config.poll_interval) # default: 0.5s
The compression algorithm deserves its own section.
Compression Strategies
| Strategy | How it works | Best for |
|---|---|---|
recency | Keep the N most recent messages, drop the rest | Conversations where only the latest context matters |
importance | Score each message by tool-result count, user-signal markers, and semantic centrality; drop low scorers | Research tasks with many intermediate steps |
hybrid | Importance scoring, but with a recency floor (last 5 messages are always kept) | Default; works well across task types |
The hybrid strategy generates a summary block when messages are dropped:
[COMPRESSED: 14 messages summarized]
User asked to analyze Q1 revenue data.
Assistant retrieved sales_2026_q1.csv (4,200 rows), computed total $2.3M,
identified top SKU as "Pro-Annual" at 34% share.
[END COMPRESSED]
This summary is inserted at the compression point, so the model retains semantic continuity even though the raw messages are gone.
What triggers it: polling (every 500ms) during an active turn.
What it produces: a compressed ctx.messages list and a context_pressure scalar read by Loop 1.
Loop 4 — Sub-Agent Orchestration (minutes)
Loop 4 activates when the task planner determines that work can be parallelized. It spawns sub-agents, coordinates their execution, and merges their outputs.
# pseudo-code: sub-agent orchestration loop
def orchestration_loop(task: Task, planner: TaskPlanner, registry: AgentRegistry):
plan = planner.decompose(task) # returns DAG of subtasks
running: dict[str, Future] = {}
for subtask in plan.ready_tasks(): # tasks with no unmet deps
agent_spec = registry.select(subtask)
future = spawn_agent(agent_spec, subtask)
running[subtask.id] = future
while running:
done_id, result = await_first(running)
plan.mark_complete(done_id, result)
running.pop(done_id)
for subtask in plan.newly_ready(): # deps now satisfied
agent_spec = registry.select(subtask)
future = spawn_agent(agent_spec, subtask)
running[subtask.id] = future
return plan.aggregate_results()
The DAG structure is what separates this from naive parallel execution. A research task might look like:
fetch_paper_A ──┐
fetch_paper_B ──┼──► summarize_all ──► write_report
fetch_paper_C ──┘
fetch_paper_* tasks run in parallel. summarize_all waits for all three before starting. write_report waits for the summary. The orchestration loop handles this dependency graph automatically — the parent agent writes the decomposition, and Loop 4 drives execution.
Sub-Agent Spawning Mechanics
Each sub-agent receives:
- A task specification (structured JSON)
- A subset of the parent’s context (only what’s relevant to the subtask)
- A tool allowlist (sub-agents cannot use tools the parent didn’t grant)
- A result schema (what structure the sub-agent must return)
{
"task": "summarize_paper",
"input": { "paper_id": "arxiv:2406.1234", "focus": "methodology" },
"tools": ["fetch_url", "read_file"],
"result_schema": {
"type": "object",
"required": ["summary", "key_contributions", "limitations"]
},
"timeout_seconds": 120
}
The result schema enforcement is done at Loop 2 level in the sub-agent — if the sub-agent’s final response doesn’t validate against the schema, Loop 4 requests a retry before accepting the result.
What triggers it: planner output containing a decomposable task DAG.
What it produces: aggregated subtask results merged back into the parent task context.
Loop 5 — Feedback and Correction (minutes, within a task)
Loop 5 is a quality loop that runs after each major subtask completes. It evaluates the output against the original intent and decides whether to continue, retry, or escalate.
# pseudo-code: feedback loop
def feedback_loop(subtask: Subtask, result: Result, evaluator: Evaluator):
score = evaluator.evaluate(subtask, result)
# score is a dict: {"correctness": 0-1, "completeness": 0-1, "confidence": 0-1}
if score["correctness"] < 0.5:
return Action.RETRY(reason="correctness below threshold", max_retries=2)
if score["completeness"] < 0.7 and subtask.retry_count < 2:
amended = amend_subtask(subtask, gaps=score["missing_aspects"])
return Action.RETRY(subtask=amended)
if score["confidence"] < 0.4:
return Action.ESCALATE(reason="low confidence, needs human review")
return Action.ACCEPT(result)
The evaluator is itself an LLM call, but a cheap one — it uses a smaller model (Hermes defaults to a 7B evaluator) with a structured rubric prompt. The rubric is task-type-specific: a coding task uses a different rubric than a research task.
This is the loop that prevents the silent failures common in single-pass agents. Without Loop 5, a sub-agent that returns plausible-but-wrong output gets accepted and propagates its error forward. With Loop 5, that result gets retried or flagged before it poisons the downstream work.
What triggers it: subtask completion within an orchestrated task.
What it produces: ACCEPT/RETRY/ESCALATE decisions; amended subtask specs on retry.
Loop 6 — Memory Consolidation (end of session)
At session end, Loop 6 processes everything that happened and writes durable memory.
# pseudo-code: memory consolidation
def consolidation_loop(session: Session, memory_store: MemoryStore):
# 1. Extract entities and facts
entities = extract_entities(session.messages)
facts = extract_facts(session.messages)
# 2. Merge with existing memory (dedup + update)
for entity in entities:
existing = memory_store.get_entity(entity.id)
if existing:
memory_store.update(existing, entity, strategy="merge_prefer_newer")
else:
memory_store.insert(entity)
# 3. Write episodic summary
summary = summarize_session(session, focus="outcomes_and_decisions")
memory_store.insert_episode(
session_id=session.id,
timestamp=session.end_time,
summary=summary,
linked_entities=[e.id for e in entities]
)
# 4. Update user model
user_signals = extract_user_signals(session.messages)
memory_store.update_user_model(session.user_id, user_signals)
The memory store has three layers:
| Layer | Content | Retrieval |
|---|---|---|
| Entity store | Named entities with attributes (people, projects, orgs) | By entity ID or semantic search |
| Episode store | Session summaries with entity links | By recency or semantic search |
| User model | Inferred user preferences, expertise level, communication style | Loaded at session start, always in context |
The user model is small (typically under 500 tokens) and is prepended to every new session’s system prompt. This is why Hermes “remembers” that you prefer terse answers, or that you’re an expert in Rust but a beginner in ML — it’s not magic, it’s Loop 6 writing observations and Loop 6 loading them back.
What triggers it: session end event.
What it produces: updated entity store, new episode record, updated user model.
Loop 7 — Skill Distillation (daily)
Loop 7 runs on a cron schedule — once per day by default — and looks for patterns across recent episodes that can be crystallized into reusable skills.
# pseudo-code: skill distillation
def distillation_loop(memory_store: MemoryStore, skill_store: SkillStore):
# Look at last 7 days of episodes
recent_episodes = memory_store.get_episodes(days=7)
# Find recurring task patterns
patterns = cluster_by_task_type(recent_episodes, min_cluster_size=3)
for pattern in patterns:
# Extract the best execution trace from this pattern
best_trace = max(pattern.traces, key=lambda t: t.feedback_score)
# Check if this pattern already has a skill
existing = skill_store.get(pattern.task_type)
if existing and existing.score >= best_trace.score:
continue
# Distill: compress the trace into a reusable skill file
skill = distill_skill(
task_type=pattern.task_type,
exemplar_trace=best_trace,
format="markdown_with_yaml_frontmatter"
)
skill_store.upsert(skill)
Skill Serialization Format
Skills are stored as Markdown files with YAML frontmatter. This makes them human-readable, version-controllable, and injectable into the context window directly.
---
skill_id: "analyze_csv_data"
version: 3
created: 2026-06-04
last_updated: 2026-06-11
avg_feedback_score: 0.91
applicable_when:
- user asks to analyze tabular data
- input contains .csv or .xlsx files
tools_used: [read_file, python_exec]
---
## Skill: Analyze CSV Data
**Step 1 — Load and profile**
Read the file with `read_file`. Count rows, columns, null rates.
Always report the schema before proceeding.
**Step 2 — Identify the question type**
- Trend analysis → sort by date column, compute period-over-period delta
- Ranking → sort by metric column descending, return top-N with percentages
- Anomaly detection → compute z-scores, flag |z| > 2.5
**Step 3 — Compute and narrate**
Run the relevant computation in `python_exec`. Narrate the result in plain language
before presenting numbers. Users read the prose first.
**Known pitfalls:**
- Excel files often have merged header cells — read row 0 and row 1 separately
- Encoding issues: default to UTF-8, fallback to latin-1 on decode error
When the planner sees a task that matches a skill’s applicable_when conditions, it loads the skill file into the system prompt before the turn starts. The model gets the distilled experience of every previous successful execution of this task type.
What triggers it: daily cron (configurable).
What it produces: new or updated skill files in the skill store.
Loop 8 — Meta-Evaluation (daily)
Loop 8 runs after Loop 7 and evaluates the agent’s own performance trends. Its output is a health report and, in some configurations, automatic parameter adjustments.
# pseudo-code: meta-evaluation loop
def meta_evaluation_loop(memory_store: MemoryStore, config: AgentConfig):
stats = compute_period_stats(memory_store, days=7)
# stats: {task_success_rate, avg_feedback_score, tool_error_rate,
# retry_rate, escalation_rate, context_overflow_rate}
report = {
"period": "2026-06-04 to 2026-06-11",
"stats": stats,
"regressions": detect_regressions(stats, memory_store.get_baseline()),
"recommendations": generate_recommendations(stats)
}
if config.auto_tune:
for rec in report["recommendations"]:
if rec.confidence > 0.8 and rec.risk == "low":
apply_recommendation(rec, config)
memory_store.insert_meta_report(report)
return report
The meta-evaluation loop is the one that closes the system at the highest level. If the retry rate has been climbing over the past week, Loop 8 surfaces that. If a particular tool has been erroring at 20% rate, Loop 8 flags it. If context overflow rate is rising, Loop 8 might recommend increasing the compression threshold.
What triggers it: daily cron, runs after Loop 7.
What it produces: performance report, optional config adjustments, updated baseline metrics.
Data Flow Between Loops
The loops are not independent — they share state through a set of well-defined stores. Here is the data flow diagram:
flowchart TD
subgraph Turn["Turn Scope"]
L1[Loop 1\nToken Sampling]
L2[Loop 2\nTool Execution]
L3[Loop 3\nContext Compression]
L1 -->|token stream| L2
L3 -->|context_pressure| L1
L2 -->|enriched context| L3
end
subgraph Task["Task Scope"]
L4[Loop 4\nSub-Agent Orchestration]
L5[Loop 5\nFeedback & Correction]
L4 -->|subtask result| L5
L5 -->|retry / accept| L4
end
subgraph Session["Session Scope"]
L6[Loop 6\nMemory Consolidation]
end
subgraph Daily["Daily Scope"]
L7[Loop 7\nSkill Distillation]
L8[Loop 8\nMeta-Evaluation]
L7 --> L8
end
Turn -->|TurnContext| Task
Task -->|task outcomes| Session
Session -->|episodes, entities| Daily
MS[(Memory Store)]
SS[(Skill Store)]
UC[(User Model / Config)]
L6 -->|writes| MS
L6 -->|writes| UC
L7 -->|reads| MS
L7 -->|writes| SS
L8 -->|reads| MS
L8 -->|writes| UC
UC -->|loaded at session start| L1
SS -->|loaded at task start| L4The important property is that the data flow is acyclic within a timescale but cyclic across timescales. Within a single turn, L1 → L2 → L3 → L1 is a tight synchronous cycle. But L6’s output only becomes L1’s input in the next session — the cross-timescale coupling is delayed and asynchronous.
This means the system is stable. A bug in Loop 7 doesn’t crash a running turn. A poorly distilled skill doesn’t corrupt the memory store — it just sits in the skill store and gets overwritten the next time Loop 7 runs with better data.
How the Loops Create Emergent Improvement
The compound effect is what makes this architecture interesting. Consider what happens over 30 days of regular use:
Day 1: No skills in the skill store. The user model is empty. Every task is planned from scratch. Context compression uses the default hybrid strategy.
Day 3: Loop 6 has written 6 episode records and started building the user model. The model now knows you prefer code examples over prose explanations.
Day 7: Loop 7 has its first run. Three task patterns have appeared at least three times: “analyze a CSV,” “write a draft email,” “summarize a research paper.” Three skill files are created.
Day 14: The skill files have version 2 entries. The user model is richer. Loop 8 has identified that the context overflow rate is high on research tasks and has adjusted the compression hard_limit downward for that task type.
Day 30: Fifteen skills in the store. The agent approaches CSV analysis tasks with the distilled experience of all previous attempts. Feedback scores on those tasks are measurably higher than week one. The meta-evaluation baseline has updated to reflect the new normal, so Loop 8 is now detecting regressions against a more accurate benchmark.
None of this required explicit configuration changes. The improvement emerged from the loops doing their jobs across their respective timescales.
What This Means for Builders
If you’re building on top of Hermes Agent, or designing a similar architecture, the practical takeaways are:
-
Separate your loops by timescale from day one. Mixing a 50ms token decision with a 30-second correction decision into one loop is how you get expensive systems that still make avoidable errors.
-
Make the skill store human-readable. YAML+Markdown means your team can inspect, edit, and version-control the distilled knowledge. A binary embedding store cannot be debugged by a human.
-
The evaluator in Loop 5 doesn’t need to be large. A 7B model with a well-structured rubric outperforms a 70B model with a vague prompt. Write the rubric carefully.
-
Loop 8 is where you close the improvement cycle. Most agent frameworks stop at Loop 6 (memory). Without Loop 8, you’re collecting data but never using it to adjust behavior. The meta-evaluation loop is what turns memory into improvement.
-
Guard the cross-timescale boundaries. The asynchronous coupling between loops (session → daily, daily → next-session) is the source of stability. Avoid shortcuts that make it synchronous — e.g., running skill distillation at the end of every session instead of daily. That turns a stable asynchronous update into a synchronous bottleneck.
The architecture is not magic. Each loop individually is straightforward. The emergent behavior comes from running all eight, at the right timescales, with clean data contracts between them.
That’s the design. Go read the source — it rewards close reading.