I want to talk about a number I keep thinking about: 1,300 pull requests per week. That’s what Stripe’s internal engineering team is generating with their autonomous coding agents, called “Minions.” Not drafts. Not suggestions. Production-ready pull requests originating from Slack messages, bug reports, and feature requests — processed by LLMs using blueprints and CI/CD pipelines, with human review as the final gate.

This is the shift I’ve been watching for three years. It’s no longer theoretical. The agentic AI era is happening inside production engineering organizations right now.

What “Agentic” Actually Means in Production

The term gets thrown around loosely. Here’s the definition I use: an agent is an AI system that can take sequences of actions, use tools, and adapt its behavior based on intermediate results — without a human in the loop at each step.

The progression looks like this:

  1. Chatbot: You ask a question, it answers once
  2. Copilot: It suggests completions or short snippets inline
  3. Assistant with tools: It can call APIs, search, write files — but one task at a time
  4. Agent: It decomposes goals into subtasks, executes them across multiple systems, handles failures, and surfaces the final result

Most engineering teams are somewhere between steps 2 and 3 right now. Stripe is at step 4, at scale.

Stripe’s Minions: What 1,300 PRs/Week Actually Requires

The engineering challenge isn’t the LLM — it’s everything around it. From what’s been described publicly, Stripe’s Minion system works like this:

Task ingestion: A Slack message or bug report arrives. The system classifies it, determines if it’s automatable, and routes it to an agent.

Blueprint execution: Agents follow “blueprints” — structured playbooks for common task types (fix this type of bug, add this type of test, implement this type of endpoint). These are essentially system prompts with specialized context about Stripe’s codebase conventions.

Tool use: Agents read code, write code, run tests, check CI results. They iterate — if tests fail, they fix and retry.

Human review: The final PR goes through normal code review. Humans don’t direct the agent during execution; they validate the output.

# Conceptual structure of an agentic coding loop
async def minion_agent(task_description: str, codebase_context: dict):
    plan = await llm.plan_task(task_description, codebase_context)

    for step in plan.steps:
        result = await execute_tool(step.tool, step.params)

        if step.requires_verification:
            test_result = await run_tests(result.changed_files)
            if not test_result.passed:
                # Agent retries with error context
                result = await llm.fix_error(test_result.error, result)

    pr = await create_pull_request(result.changes, plan.description)
    return pr  # Human reviews from here

The key insight: agents are most valuable when the task space is well-defined but the execution is tedious. Fixing a known category of bug, adding tests to an existing pattern, updating API calls when a dependency changes — these are exactly the kinds of work that drain senior engineers.

OpenAI’s Responses API: Infrastructure for Agents

While Stripe built their own, OpenAI is trying to give everyone the infrastructure to build similar systems. The recent Responses API expansion is significant for developers:

Shell tool: Agents can now execute shell commands in a sandboxed environment. This means a coding agent can not just write code but actually run it, see the output, and iterate.

Built-in agent execution loop: Instead of you managing the tool-call/response/tool-call cycle, the API handles it. You define available tools and the model runs until it has a final answer — or hits a configured limit.

Hosted container workspace: Persistent file system for agent runs. The agent can write files, read them back, install packages, compile code — all in a managed sandbox.

Context compaction: Long-running agents generate long context. The API now handles compaction automatically, keeping costs manageable without losing important history.

Reusable agent skills: Pre-built capabilities (web search, code execution, file I/O) that you compose rather than implement.

from openai import OpenAI

client = OpenAI()

# New: The API handles the tool loop for you
response = client.responses.create(
    model="gpt-5.4",
    tools=[
        {"type": "shell"},           # Can execute commands
        {"type": "file_system"},     # Persistent workspace
        {"type": "web_search"}       # Live information
    ],
    instructions="You are a senior code reviewer. Analyze the provided PR diff and generate a review.",
    input="<PR_DIFF_CONTENT>",
    max_turns=10  # Maximum iterations
)

print(response.output)  # Final answer after tool use

This matters because the biggest barrier to building agents isn’t the model — it’s the plumbing. Managing tool call loops, handling retries, dealing with context limits — these are genuinely hard engineering problems. The Responses API abstracts most of them.

AWS Strands Labs: The Open-Source Play

Amazon Web Services took a different approach with Strands Labs — a new GitHub organization for experimental agent-related projects. This positions AWS as a contributor to the open-source agent ecosystem rather than just a cloud host.

The strategic angle: if you’re experimenting with Strands Labs projects and they work well, you’re likely deploying on AWS infrastructure. It’s the cloud-provider version of developer relations through tooling.

For teams already using AWS Bedrock and Lambda, this creates an interesting path:

# Conceptual Strands Labs-style agent pattern
import boto3
from strands import Agent, Tool

bedrock = boto3.client('bedrock-runtime', region_name='us-east-1')

@Tool
def query_database(sql: str) -> str:
    """Execute a read-only database query"""
    # Your implementation
    pass

@Tool
def analyze_logs(service_name: str, hours: int) -> str:
    """Pull and analyze CloudWatch logs"""
    # Your implementation
    pass

agent = Agent(
    model="anthropic.claude-sonnet-4-6-v1",
    tools=[query_database, analyze_logs],
    system_prompt="You are an infrastructure analyst. Diagnose the reported issue."
)

result = agent.run("Production latency spiked 3 hours ago. Find the root cause.")

The appeal: you define tools as Python functions, the framework handles the orchestration, and you run it on AWS infrastructure you already operate.

What Engineering Leaders Should Take From This

Three concrete observations from watching this space closely:

1. The evaluation problem is now critical. When agents generate 1,300 PRs per week, how do you know they’re good? Human review helps but doesn’t scale infinitely. Teams need automated evaluation pipelines — not just test suites but semantic checks, security scans, and style enforcement that can catch problems before human review.

2. Blueprint quality determines agent quality. The LLM doesn’t know your codebase conventions unless you tell it. The teams seeing the best results from coding agents invest heavily in writing detailed, structured playbooks — essentially documentation that humans also benefit from. This is leverage: one well-written blueprint enables hundreds of automated implementations.

3. The human review gate cannot be a rubber stamp. As agents become more capable and generate more output, the temptation is to review PRs more quickly. This is the failure mode. Maintain or raise review rigor; the volume should be handled by better tooling (AI-assisted review, automated checks) not by cutting corners.

The Practical Path Forward

If you want to introduce agentic capabilities to your engineering team in 2026, here’s how I’d sequence it:

Month 1-2: Automate a single, high-frequency, well-defined task. Test generation is ideal — you have clear correctness criteria (tests pass), clear value (coverage goes up), and bounded risk (tests don’t deploy to production themselves).

Month 3-4: Extend to code review assistance. Not replacement, but augmentation — an agent that surfaces potential issues before human review. This builds trust in agent output with low stakes.

Month 5-6: Introduce bounded PR generation for well-understood task categories. Start with dependency updates, typo fixes, or documentation changes. Measure the revision rate (what percentage of agent PRs need significant changes before merging).

Month 7+: Expand to more complex tasks based on data from previous phases. By this point you understand your agent’s error modes and have evaluation tooling to catch them.

The Stripe result — 1,300 PRs per week — didn’t happen overnight. It’s the output of years of iteration on task definition, blueprint quality, tool integration, and evaluation. The organizations that start building these capabilities now will have a significant head start when the tooling matures further.

Start with one task. Make it work well. Then expand.

Export for reading

Comments