Software quality degrades in predictable ways: coverage gaps that no one tracks, security patterns that slip through review, documentation that falls behind the code. Every team knows these failure modes. Most teams handle them the same way — periodic manual audits, ad-hoc checklist reviews, the occasional panic before a release. The problem isn’t awareness. It’s that quality enforcement requires sustained, consistent attention that humans are bad at providing at scale.

A QC agent built on Hermes changes that equation. It runs the same checks every time, remembers what your team cares about, and gets better at catching your specific failure patterns the longer it operates. This post walks through how to build one — architecture, the five core skills you need pre-built, CI/CD integration, and the feedback loop that makes the agent genuinely learn your standards.

What a Hermes-Based QC Agent Looks Like

Hermes agents are built around three primitives: skills (atomic capabilities), loops (execution patterns), and memory (persistent context that accumulates across runs). A QC agent uses all three.

The agent receives a trigger — a pull request, a scheduled run, a CI/CD event — and fans out across five quality dimensions in parallel. It collects results, applies learned thresholds, makes pass/fail decisions, and writes structured findings back to the appropriate surfaces (PR comments, dashboards, Slack).

┌─────────────────────────────────────────────────────────┐
│                     QC Agent Core                        │
│                                                         │
│  trigger → router → [skill dispatch via Loop 6]        │
│                ↓                                        │
│         result aggregator                               │
│                ↓                                        │
│    threshold evaluator ← quality memory                 │
│                ↓                                        │
│         report emitter → [PR / Slack / dashboard]      │
└─────────────────────────────────────────────────────────┘

The agent’s configuration lives in a single YAML file that defines which skills run, what thresholds trigger failures, and where results go:

# hermes-qc.yml
agent:
  name: qc-agent
  version: "1.0"

skills:
  - code-review
  - test-coverage
  - security-scan
  - performance-regression
  - documentation-check

execution:
  loop: kanban-parallel   # Loop 6
  max_concurrency: 5
  timeout_per_skill: 120s

thresholds:
  test_coverage_min: 80
  security_critical_max: 0
  security_high_max: 2
  perf_regression_max_pct: 10
  doc_coverage_min: 70

outputs:
  - type: github-pr-comment
    on: [failure, warning]
  - type: slack
    channel: "#engineering-quality"
    on: [failure]
  - type: dashboard
    endpoint: "${QC_DASHBOARD_URL}"
    on: [always]

The Five Core QC Skills

Each skill is a self-contained module with a defined input schema, execution logic, and structured output. Pre-building these as reusable skills means the QC agent can compose them without duplication across projects.

1. Code Review Skill

The code review skill goes beyond style linting. It analyzes diffs for logic errors, identifies anti-patterns specific to your codebase, and flags deviations from team conventions the agent has learned.

// skills/code-review.ts
export const codeReviewSkill: HermesSkill = {
  name: "code-review",
  input: z.object({
    diff: z.string(),
    context: z.object({
      repo: z.string(),
      base_branch: z.string(),
      pr_number: z.number().optional(),
    }),
  }),
  async execute({ diff, context }, memory) {
    const teamPatterns = await memory.get("team-anti-patterns") ?? [];
    const recentFailures = await memory.get("recent-review-failures") ?? [];

    const findings = await llmReview(diff, {
      focus: ["logic-errors", "security", "performance", "team-patterns"],
      knownPatterns: teamPatterns,
      recentContext: recentFailures.slice(-10),
    });

    // Update memory with new patterns found
    if (findings.newPatterns.length > 0) {
      await memory.append("team-anti-patterns", findings.newPatterns);
    }

    return {
      severity: findings.maxSeverity,
      findings: findings.items,
      suggestions: findings.suggestions,
    };
  },
};

The key detail here is the memory integration. Every review run reads from accumulated team patterns and writes back new ones. The agent learns that your team uses Result<T> types instead of throwing exceptions, or that certain API endpoints require explicit rate-limit handling — and it enforces those conventions automatically in future reviews.

2. Test Coverage Check Skill

Coverage analysis without context is noise. The test coverage skill compares against baseline metrics, identifies which new lines are uncovered, and distinguishes between coverage gaps that matter (core logic, public APIs) and those that don’t (generated code, test utilities).

// skills/test-coverage.ts
export const testCoverageSkill: HermesSkill = {
  name: "test-coverage",
  input: z.object({
    coverage_report: z.string(),  // lcov or JSON format
    changed_files: z.array(z.string()),
    config: z.object({ min_coverage: z.number() }),
  }),
  async execute({ coverage_report, changed_files, config }, memory) {
    const baseline = await memory.get("coverage-baseline") ?? {};
    const parsed = parseCoverage(coverage_report);

    const regressions = changed_files
      .filter(f => parsed[f]?.lines < (baseline[f]?.lines ?? config.min_coverage))
      .map(f => ({
        file: f,
        current: parsed[f]?.lines ?? 0,
        baseline: baseline[f]?.lines ?? config.min_coverage,
        uncovered: parsed[f]?.uncoveredLines ?? [],
      }));

    // Update baseline for files that improved
    const improvements = changed_files.filter(
      f => parsed[f]?.lines > (baseline[f]?.lines ?? 0)
    );
    for (const f of improvements) {
      await memory.set(`coverage-baseline.${f}`, parsed[f]);
    }

    return {
      overall: parsed.__total,
      regressions,
      passed: regressions.length === 0 && parsed.__total >= config.min_coverage,
    };
  },
};

3. Security Scan Skill

Security scanning integrates with existing tools (Semgrep, Snyk, Trivy) but adds an LLM layer to reduce false positives and provide actionable remediation guidance.

// skills/security-scan.ts
export const securityScanSkill: HermesSkill = {
  name: "security-scan",
  input: z.object({
    scan_results: z.array(ScanFinding),
    diff: z.string(),
    config: z.object({
      critical_max: z.number(),
      high_max: z.number(),
    }),
  }),
  async execute({ scan_results, diff, config }, memory) {
    const suppressions = await memory.get("security-suppressions") ?? [];
    const knownFPs = await memory.get("known-false-positives") ?? [];

    // Filter known false positives
    const filtered = scan_results.filter(
      r => !knownFPs.some(fp => fp.ruleId === r.ruleId && fp.pattern === r.snippet)
    );

    // Apply active suppressions
    const active = filtered.filter(
      r => !suppressions.some(s => s.ruleId === r.ruleId && !isExpired(s))
    );

    const bySeverity = groupBy(active, "severity");

    const passed =
      (bySeverity.critical?.length ?? 0) <= config.critical_max &&
      (bySeverity.high?.length ?? 0) <= config.high_max;

    return {
      findings: active,
      summary: bySeverity,
      passed,
      suppressed: filtered.length - active.length,
    };
  },
};

The suppression system is important for adoption. Teams abandon automated security scanning when it generates too many irrelevant findings. By recording confirmed false positives in agent memory, the QC agent learns your codebase’s specific patterns and stops crying wolf.

4. Performance Regression Skill

Detecting performance regressions requires a baseline to compare against. This skill tracks benchmark results over time and flags statistically significant regressions.

// skills/performance-regression.ts
export const perfRegressionSkill: HermesSkill = {
  name: "performance-regression",
  input: z.object({
    benchmarks: z.array(BenchmarkResult),
    config: z.object({ max_regression_pct: z.number() }),
  }),
  async execute({ benchmarks, config }, memory) {
    const history = await memory.get("perf-history") ?? {};

    const regressions = benchmarks
      .filter(b => {
        const baseline = history[b.name];
        if (!baseline) return false;
        const changePct = ((b.p95 - baseline.p95) / baseline.p95) * 100;
        return changePct > config.max_regression_pct;
      })
      .map(b => ({
        benchmark: b.name,
        baseline_p95: history[b.name].p95,
        current_p95: b.p95,
        change_pct: ((b.p95 - history[b.name].p95) / history[b.name].p95) * 100,
      }));

    // Update baselines for non-regression runs
    for (const b of benchmarks) {
      if (!regressions.find(r => r.benchmark === b.name)) {
        history[b.name] = { p95: b.p95, p50: b.p50, timestamp: Date.now() };
      }
    }
    await memory.set("perf-history", history);

    return { regressions, passed: regressions.length === 0 };
  },
};

5. Documentation Check Skill

Documentation drift is the quietest quality killer. This skill checks that public APIs, exported functions, and changed interfaces have corresponding documentation, and that existing docs still match the code.

// skills/documentation-check.ts
export const docCheckSkill: HermesSkill = {
  name: "documentation-check",
  input: z.object({
    changed_files: z.array(z.string()),
    diff: z.string(),
    config: z.object({ min_doc_coverage: z.number() }),
  }),
  async execute({ changed_files, diff, config }, memory) {
    const docStandards = await memory.get("doc-standards") ?? defaultDocStandards;

    const analysis = await analyzeDocCoverage(changed_files, diff, docStandards);

    return {
      missing_docs: analysis.undocumented,
      stale_docs: analysis.stale,
      coverage_pct: analysis.coveragePct,
      passed: analysis.coveragePct >= config.min_doc_coverage,
    };
  },
};

Integrating the QC Agent into CI/CD Pipelines

The QC agent plugs into existing pipelines as a standard step. Here’s the full integration pattern for a GitHub Actions workflow:

# .github/workflows/qc.yml
name: QC Agent

on:
  pull_request:
    types: [opened, synchronize]
  push:
    branches: [main, release/*]

jobs:
  qc-agent:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Run tests with coverage
        run: |
          npm test -- --coverage --coverageReporters=lcov
          npm run benchmark -- --output=json > benchmark-results.json

      - name: Run security scan
        run: |
          npx semgrep --config=auto --json > semgrep-results.json || true

      - name: Dispatch QC Agent
        uses: hermes-agent/dispatch@v2
        with:
          config: hermes-qc.yml
          inputs: |
            diff: ${{ steps.diff.outputs.content }}
            coverage_report: coverage/lcov.info
            scan_results: semgrep-results.json
            benchmarks: benchmark-results.json
            changed_files: ${{ steps.changed.outputs.files }}
          outputs:
            pr_comment: true
            status_check: true
        env:
          HERMES_API_KEY: ${{ secrets.HERMES_API_KEY }}
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

The agent returns a structured result that the CI runner uses to set the commit status:

{
  "overall": "failure",
  "skills": {
    "code-review": { "status": "warning", "findings": 3 },
    "test-coverage": { "status": "failure", "coverage": 74.2, "required": 80 },
    "security-scan": { "status": "pass", "findings": 0 },
    "performance-regression": { "status": "pass" },
    "documentation-check": { "status": "warning", "missing": 2 }
  },
  "blocking": ["test-coverage"],
  "pr_comment_id": 1847392
}

The blocking field is separate from overall — the agent can report warnings from non-blocking checks while only failing the pipeline on hard requirements. Your team configures which skills are blocking in hermes-qc.yml.

Running QC Checks in Parallel with Loop 6 (Kanban)

The five skills run in parallel using Hermes’s Loop 6 (Kanban) pattern. This keeps total QC time proportional to the slowest skill rather than the sum of all skills.

sequenceDiagram
    participant CI as CI/CD Pipeline
    participant Router as QC Agent Router
    participant L6 as Loop 6 (Kanban)
    participant CR as Code Review
    participant TC as Test Coverage
    participant SS as Security Scan
    participant PR as Perf Regression
    participant DC as Doc Check
    participant Agg as Aggregator

    CI->>Router: trigger(diff, artifacts)
    Router->>L6: dispatch(all skills)
    
    par Parallel execution
        L6->>CR: execute(diff)
        L6->>TC: execute(coverage)
        L6->>SS: execute(scan_results)
        L6->>PR: execute(benchmarks)
        L6->>DC: execute(diff, files)
    end

    CR-->>Agg: findings
    TC-->>Agg: metrics
    SS-->>Agg: vulnerabilities
    PR-->>Agg: regressions
    DC-->>Agg: gaps

    Agg->>CI: structured_result
    Agg->>CI: pr_comment
    Agg->>CI: status_check

The Kanban loop manages the work queue — skills that finish early can start processing the next available work item if you’ve configured batch analysis. For single PR checks, the five skills run concurrently and the aggregator waits for all to complete.

// agent/loop.ts
const qcKanban = new HermesLoop({
  type: "kanban",
  workers: 5,
  skills: [
    codeReviewSkill,
    testCoverageSkill,
    securityScanSkill,
    perfRegressionSkill,
    docCheckSkill,
  ],
  onComplete: aggregateResults,
  onSkillComplete: (skill, result) => {
    // Stream intermediate results — fast skills post findings before slow ones finish
    emitPartialResult(skill.name, result);
  },
});

One underrated benefit of parallel execution: the QC agent can post partial results to the PR as skills complete. A developer doesn’t have to wait for the performance benchmarks to finish reading the code review findings.

How the QC Agent Learns Your Quality Standards

Static thresholds and rule sets are a starting point, not a destination. The QC agent builds a quality profile for your codebase through three feedback mechanisms.

Explicit feedback — when a developer marks a finding as a false positive, that signal goes into agent memory. The code review skill reads known false positives before classifying findings. After 20 feedback cycles, the agent has a meaningful model of what your team considers a real issue versus noise.

Outcome tracking — the agent records which PRs passed QC and then introduced bugs or regressions post-merge. This creates a training signal for tuning classifier sensitivity. If PRs that passed code review with 0 findings later caused incidents, the agent can retrospectively flag the pattern it missed and update its review heuristics.

Threshold evolution — rather than static min/max values, the agent tracks rolling percentiles. If your team’s test coverage has been consistently above 85% for six months, the agent can suggest raising the minimum threshold to match the de facto standard.

// The quality profile that accumulates over time
interface QualityProfile {
  // Per-rule FP rates, updated from developer feedback
  falsePositiveRates: Record<string, number>;

  // Patterns that historically preceded bugs
  riskPatterns: Array<{ pattern: string; bugRate: number; lastSeen: Date }>;

  // Observed thresholds (rolling 90th percentile of actual values)
  observedThresholds: {
    coverage: RollingStats;
    docCoverage: RollingStats;
    reviewFindings: RollingStats;
  };

  // Suppression rules with expiry
  suppressions: Array<{ ruleId: string; reason: string; expiresAt: Date }>;
}

Real Example QC Workflows

To make this concrete, here are two workflows the QC agent handles end-to-end.

Workflow 1: Pre-merge gate on a feature branch

A developer opens a PR with 12 changed files. The QC agent triggers automatically, fans out the five skills in parallel, and posts a structured comment within 90 seconds:

QC Agent Results — 3 issues found

✗ Test Coverage: 74.2% (required: 80%)
  └─ src/payments/refund.ts — 0% coverage (new file)
  └─ src/payments/validation.ts — 61% → 58% (regression)

⚠ Code Review: 2 warnings
  └─ src/payments/refund.ts:47 — Missing error handling for network timeout
  └─ src/api/routes.ts:92 — Direct DB query in route handler (use service layer)

⚠ Documentation: 1 warning
  └─ src/payments/refund.ts — 3 exported functions missing JSDoc

✓ Security: Pass
✓ Performance: Pass

Blocking: test-coverage — pipeline will not pass until coverage meets threshold.

Workflow 2: Post-merge quality trend report

Every Monday at 9am, the QC agent generates a quality trend report for the past week:

Weekly Quality Report — Week 23, 2026

Coverage: 83.4% (↑ 1.2% from last week)
  Top regressions introduced: payments module (3 PRs)
  Top improvements: auth module (2 PRs)

Security: 14 findings resolved, 3 new (all low severity)
  Notable: SQL injection pattern suppression expired — 2 previously-suppressed
  findings are now active in billing/queries.ts

Code Review patterns this week:
  Most common finding: missing null checks (8 occurrences)
  Suggested rule addition: "explicit-null-handling" — 6/8 were confirmed bugs

Performance: All benchmarks within baseline. payment-process p95 trending up
  (+6% over 4 weeks) — not yet at threshold but worth monitoring.

Putting It Together

The QC agent is most valuable when it stops feeling like a gate and starts feeling like a collaborator — something the team actually consults because its findings are accurate and its memory of what matters is reliable. That transition happens through the feedback loop: every false positive dismissed, every suppression added, every threshold adjusted makes the next run more precise.

The architecture described here is intentionally modular. You don’t need all five skills on day one. Start with code review and test coverage, establish the feedback loop, let the agent build its quality profile for a few weeks, then add security scanning and performance regression as the team builds trust in the system.

The Hermes framework’s Loop 6 parallel execution means adding skills has no meaningful cost on QC time — a sixth skill runs concurrently with the other five. The agent grows with your quality standards rather than requiring a fixed upfront investment in what “quality” means for your codebase.

Quality control at this level of automation isn’t about replacing human judgment — it’s about making human judgment available where it’s actually needed, rather than spending it on checks a machine can run in 90 seconds every time.

Export for reading

Comments