On March 5, 2026, OpenAI released GPT-5.4 — and buried in the benchmark results was a number that caught my attention: 75.0% on OSWorld-Verified, beating the human baseline of 72.4%.

For context: OSWorld-Verified measures whether an AI can navigate a real desktop environment — clicking buttons, reading screens, filling forms, using actual software — through screenshots and keyboard/mouse actions. Not an abstracted API. Not a sandboxed browser. Real computer use, the way a human does it.

This is the first time any AI model has cleared that bar. And if you’re building agentic systems, it changes the architecture calculus significantly.

What “Computer Use” Actually Means

Before GPT-5.4, computer use in AI agents came in two frustrating flavors:

Flavor 1: Brittle browser automation. Playwright scripts that break the moment the UI changes. Hardcoded selectors that stop working after a SaaS product updates its CSS. You’ve written these. I’ve written these. They’re technically computer use, but they’re fragile by design.

Flavor 2: Expensive specialized models. Models specifically fine-tuned for UI interaction that lived outside your main reasoning stack. You’d call your reasoning model, it would decide to take an action, it would hand off to the specialized computer-use model, you’d get a result back, and you’d feed it to the reasoning model again. Double the latency, double the cost, double the failure surface.

GPT-5.4 collapses this gap. Computer use is native — the same model that reasons about your problem also operates the computer. No handoffs.

Under the hood, it works by issuing Playwright-compatible commands in response to screenshots, combined with a screenshot-observation loop:

from openai import OpenAI

client = OpenAI()

# Computer use requires the Responses API (not Chat Completions)
response = client.responses.create(
    model="gpt-5.4",
    tools=[{"type": "computer_use"}],
    messages=[
        {
            "role": "user",
            "content": "Open GitHub, create a new repository called 'demo-project', set it to private, and add a README"
        }
    ]
)

Note the important detail: computer use requires the Responses API. You can’t use chat.completions.create for this. If you’re migrating from GPT-5.2, this is the main friction point — you need to update your client code, not just the model name.

The Benchmark Numbers Worth Understanding

The OSWorld number (75.0%) gets the headlines, but there are two other benchmarks that matter more for specific use cases:

WebArena-Verified (67.3%): Browser-based navigation using both DOM and screenshot-driven interaction. This is your typical web automation scenario — filling forms, extracting data from sites, navigating multi-step workflows in web apps. GPT-5.4 improved over GPT-5.2’s 65.4%, but this is the benchmark where it’s weakest relative to human performance.

Online-Mind2Web (92.8%): Browser use with screenshot observations only — no DOM access. This is closer to how a human actually uses a browser. 92.8% is remarkably high and suggests strong visual understanding of web interfaces. For comparison, the previous generation Atlas Agent Mode scored 70.9%.

Toolathlon (54.6%): Real-world tool and API use across multi-step tasks like reading emails, extracting attachments, uploading files, and recording results in spreadsheets. The improvement over GPT-5.2 (46.3%) is meaningful — that’s 8 percentage points on tasks that are genuinely hard.

Tool Search: The Quiet Feature That Matters for Scale

Alongside computer use, GPT-5.4 shipped a feature called Tool Search that most coverage glossed over, but it’s architecturally important.

The traditional approach: you define all your tools upfront, and the full tool definition (with schema, description, examples) gets included in every prompt. At 10 tools, this is fine. At 50+ tools — which is where real enterprise agent systems live — you’re burning tokens on tool descriptions for tools the model will never call in that specific request.

Tool Search flips this. You pass a lightweight manifest (just names and brief descriptions), and GPT-5.4 looks up the full definition only when it decides to use a specific tool. On Scale’s MCP Atlas benchmark with 36 MCP servers enabled, OpenAI measured a 47% token reduction with identical accuracy.

For high-volume agent workflows, this isn’t a minor optimization — it’s the difference between $0.50/session and $0.90/session at scale.

# Instead of passing full tool schemas upfront:
tools = [
    {
        "type": "tool_search",
        "tool_catalog": "your-mcp-server-url",  # lazy-loaded
        "available_tools": ["send_email", "create_issue", "query_db", ...]
    }
]

Practical Architecture for Computer Use Agents

After experimenting with GPT-5.4’s computer use over the past few weeks, here’s the pattern that works reliably in production:

┌─────────────────────────────────────────────────┐
│                   Task Input                    │
│         (natural language instruction)          │
└─────────────────────┬───────────────────────────┘


┌─────────────────────────────────────────────────┐
│              GPT-5.4 Orchestrator               │
│   Breaks task into verifiable sub-steps         │
│   Maintains state across screenshot cycles      │
└─────────────────────┬───────────────────────────┘

          ┌───────────┼───────────┐
          ▼           ▼           ▼
    Screenshot    Keyboard     Mouse
     Capture       Input      Actions
          │           │           │
          └───────────┴───────────┘


┌─────────────────────────────────────────────────┐
│              Verification Step                  │
│   Did the action produce expected state?        │
└─────────────────────┬───────────────────────────┘

              ┌───────┴───────┐
              ▼               ▼
          Continue          Retry /
          to next step      Escalate

The key design decision is the verification step. Don’t assume an action succeeded because the model executed it. Have the model take a screenshot after each significant action and confirm the expected state before proceeding. This catches UI race conditions, loading states, and the occasional model error.

Risk controls are essential. GPT-5.4 allows configuring custom confirmation policies — different risk tolerance for different action categories. I use three tiers:

confirmation_policy = {
    "read_only": "auto",       # Screenshots, page reads — no confirmation
    "low_risk_write": "auto",  # Form fills, text input — no confirmation
    "high_risk_write": "confirm",  # File deletion, emails, purchases — confirm
    "destructive": "confirm"   # Anything irreversible — always confirm
}

The Context Window Question

GPT-5.4 defaults to 272K context but supports up to 1M tokens experimentally. For computer use agents, this matters because screenshot sequences accumulate quickly.

A realistic agent session doing a 20-step workflow will generate 15-25 screenshots. At medium screenshot quality, you’re looking at 5-15K tokens per screenshot. A 20-step workflow can easily consume 200-300K tokens on screenshots alone.

Practical recommendation: use the 272K default unless your workflow is genuinely long-running. The 1M context window carries a 2× input token cost at the billing boundary, which adds up fast if you’re not careful.

Pricing Reality Check

GPT-5.4 is priced at $2.50/1M input tokens and $20.00/1M output tokens. For computer use sessions specifically:

  • Simple 5-step task (fill a form, submit): ~30-50K tokens → $0.07-0.12
  • Medium task (15-step with verification): ~150-200K tokens → $0.38-0.50
  • Complex workflow (50+ steps, multiple apps): 500K+ tokens → $1.25+

For comparison, doing this with GPT-5.2 + specialized computer-use model was easily 2-3× more expensive per task due to the dual-model architecture.

My Take: What Actually Changed

The OSWorld benchmark is a useful headline, but what I find more practically significant is the consolidation. Before GPT-5.4, building a production-grade agent that could “use a computer” meant managing a reasoning model, a computer-use model, and the plumbing between them.

Now it’s one model. One API endpoint. One context window that holds the entire task history. One billing meter. One point of failure to monitor.

For teams that have been waiting to invest in computer use automation because the multi-model complexity wasn’t worth the payoff — that calculus has changed. The infrastructure got simpler exactly as the capability got stronger.

The real test isn’t the benchmark. It’s whether your actual workflows transfer. Start with a task your team does manually 10+ times a day, instrument it well, and measure against the benchmark. In my experience, real-world performance tends to land 10-15 points below benchmark numbers — which still puts GPT-5.4 comfortably above its predecessors.


GPT-5.4 is available now via the OpenAI API. Computer use requires the Responses API endpoint. For migration from Chat Completions, see OpenAI’s migration guide. GPT-5.2 Thinking retires June 5, 2026.

Export for reading

Comments