How I Built an AI Code Review Bot That My Team Actually Uses

Code reviews were our biggest bottleneck. PRs sat for hours waiting for reviewers. When reviews did happen, quality varied wildly — sometimes thorough feedback, sometimes just “LGTM.” I figured I’d build something to at least do the first pass automatically.

The first version was terrible. It flagged everything. “Consider adding a comment here.” “This function could be renamed for clarity.” “You might want to add error handling.” Developers hated it. It felt like being nitpicked by a robot. Within a week, people were ignoring the bot entirely.

What I Changed

The breakthrough was realizing the bot shouldn’t try to be a thorough reviewer. It should catch the stuff humans miss because they’re skimming — null reference risks, missing error handling, potential security issues, test coverage gaps. The stuff where a machine’s thoroughness actually adds value.

I rewrote the system prompt to be extremely focused:

Only flag issues you’re confident about (> 80% certainty)
Categorize as: bug, security, performance, or suggestion
Never comment on style or naming (that’s what linters are for)
If you’re not sure, don’t flag it — false positives destroy trust

How It Works

A GitHub Action triggers on every PR:

async function reviewFile(diff: string, fileContent: string, filePath: string) {
  const response = await client.messages.create({
    model: "claude-sonnet-4-5-20250514",
    max_tokens: 2048,
    messages: [{
      role: "user",
      content: `Review this code change in ${filePath}.
Full file:\n\`\`\`\n${fileContent}\n\`\`\`
Diff:\n\`\`\`diff\n${diff}\n\`\`\`
Focus on: bugs, security issues, performance problems.
Only flag issues you're confident about. Skip style and naming.`
    }],
  });
  return response;
}

The key technical decisions:

Structured outputs. The bot returns JSON with line numbers, severity, and message. This maps directly to GitHub’s review comment API — no parsing free-form text.

Full file context, not just diff. Sending just the diff produces mediocre reviews. The model needs to see the surrounding code to understand what the change is actually doing.

Prompt caching. The system prompt and review guidelines are the same for every review. Caching these reduces API costs by about 60%. Without caching, we’d be paying $0.15 per review. With caching, it’s $0.06.

Sonnet, not Opus. I tested both. Opus produces marginally better reviews at 5x the cost. For a bot that runs on every PR, Sonnet is the right trade-off.

Results After 6 Months

Catches about 30% of issues before human review
False positive rate is around 15% — low enough that developers read the comments
Review turnaround for the first pass went from hours to minutes
Costs less than $50/month for a team of 10 developers
My team’s production bug rate actually went down

The last point surprised me. I expected the bot to catch things in review but didn’t expect it to change developer behavior. But it did — developers started writing more careful code because they knew the bot would catch common mistakes. It raised the baseline.

The Unexpected Lessons

Developers warm up to AI reviews slowly. First week: “this is annoying.” First month: “okay, it caught something useful.” After three months: “wait, the bot didn’t flag anything on my PR? Let me double-check my code.”

Business context is the limit. The bot is great at catching technical issues but terrible at understanding business logic. “This function returns early when the user has no orders” — is that a bug or intended behavior? The bot can’t know without business context I haven’t given it.

Prompt engineering is real engineering. I spent more time tuning the system prompt than writing the actual integration code. Small changes to the instructions have outsized effects on review quality. Version-controlling your prompts is essential.

The bot isn’t replacing human reviewers. It’s making human reviewers faster by handling the mechanical stuff so they can focus on architecture, design, and business logic — the things that actually require human judgment.

Export for reading

How I Built an AI Code Review Bot That My Team Actually Uses

What I Changed

How It Works

Results After 6 Months

The Unexpected Lessons

Comments

On this page

How I Built an AI Code Review Bot That My Team Actually Uses