Something quietly significant happened in late March 2026: OpenAI’s GPT-5.4 scored 75% on OSWorld-V — a benchmark that simulates real desktop productivity tasks — slightly above the human baseline of 72.4%. On paper, that’s just a number. In practice, it’s a milestone that changes how I think about what AI agents can actually do in production.
I’ve been building AI-assisted systems since GPT-3. I’ve seen the chat era, the RAG era, the tool-calling era. This feels different. Let me break down why GPT-5.4 and the extended Responses API represent a genuine architectural shift — and what it means for engineers building real systems.
The Shell Tool Changes Everything
Previous models had a “code interpreter” — essentially a sandboxed Python REPL. Useful, but limited. GPT-5.4’s shell tool is something else entirely.
With the Responses API’s new shell support, an agent can now:
- Run Go or Java programs
- Start a Node.js server
- Execute bash scripts
- Query databases directly
- Install packages and manage dependencies
This is not just a wider code execution surface. It’s a fundamentally different execution model. Here’s what a basic agentic workflow looks like now:
from openai import OpenAI
client = OpenAI()
response = client.responses.create(
model="gpt-5.4",
input="Analyze the performance bottleneck in this Node.js app, fix it, and run the benchmarks",
tools=[{"type": "shell"}],
# Files live in the hosted container
container={"type": "hosted", "files": ["app.js", "package.json"]}
)
The model doesn’t just suggest a fix — it runs it, verifies it works, and iterates. That’s the build-run-verify-fix loop that used to require a human in the middle.
Context Compaction: The Unsung Hero
One of my biggest frustrations with long-running agents has always been context overflow. The agent starts a complex task, burns through 100k tokens in intermediate steps, then hits the limit and fails or loses important context.
GPT-5.4 introduces native compaction — the first mainline model trained specifically for this. During a long agent trajectory, the model compresses previous steps into a shorter representation while preserving critical context. Think of it like a skilled project manager who can summarize a two-hour meeting into the three decisions that actually matter.
This is a massive win for production deployments. I’ve seen agents on GPT-4 fail on tasks that are essentially trivial for a human but require many iterative steps. Compaction makes these tractable.
Tool Search: Solving the Discovery Problem
Here’s a real problem I face at work: we have 200+ internal tools registered in our system. When an agent needs to select the right one, feeding all 200 definitions into context is wasteful and hurts accuracy.
GPT-5.4 handles this with deferred tool loading — tools become searchable rather than pre-loaded. The model queries for relevant tools based on the current task, then loads only those definitions. In practice:
- Reduced token usage by ~40% for large tool registries
- Improved tool selection accuracy
- Faster inference for agentic workloads
For .NET developers building enterprise tooling, this is the API design pattern to adopt. Register your tools semantically, let the model discover them.
What 1M Context + Autonomous Execution Actually Means
Let me be concrete. With GPT-5.4’s 1M token context window and built-in computer use:
Before: “Analyze this codebase and suggest improvements” → Agent reads files one by one, misses cross-file dependencies, suggests generic improvements
Now: Entire codebase in context, agent can run the tests, identify failing ones, trace the bug across files, apply the fix, verify it passes — in a single request.
I tested this internally with a 150,000-token .NET solution (about 80 files). GPT-5.4 caught a subtle async deadlock that our team had missed for two sprints. It traced the call chain through 6 layers of abstraction, identified the root cause in SemaphoreSlim misuse, and generated a working fix with unit tests.
The Hard Parts Nobody Is Talking About
This power comes with real engineering challenges:
1. Cost at Scale A single complex agentic task can easily consume 50k-200k tokens with multiple iterations. At production scale, you need serious cost modeling. Implement caching aggressively — use GPT-5.4-nano for routing/classification, reserve 5.4 for the actual execution.
2. Safety in the Shell The hosted container has network policy controls and allow-lists, but if you’re running agents against production systems, you need additional safeguards. Treat your shell-enabled agents like a junior engineer with sudo: capable but requiring guardrails.
// .NET example: Scoped tool permissions
var agentConfig = new AgentConfiguration
{
Model = "gpt-5.4",
ShellPolicy = new ShellPolicy
{
AllowedCommands = ["dotnet", "git", "curl"],
NetworkAllowList = ["api.internal.company.com"],
MaxExecutionTime = TimeSpan.FromMinutes(5)
}
};
3. Observability Long-running agents are notoriously hard to debug. Instrument every tool call, log the compaction events, and set up alerts for agents that are spinning in loops. The Responses API provides event streaming — use it.
4. Idempotency Agents can and will retry operations. Design your tools to be idempotent. This seems obvious but I’ve seen production incidents from agents that retried a “create record” call 3 times because a timeout made it look like it failed.
My Take: What to Build Right Now
After testing GPT-5.4 for two weeks, here’s where I see the highest-value opportunities:
- Automated code review pipelines — Not just style checks, but actual reasoning about correctness and performance across large codebases
- Self-healing infrastructure agents — Detect anomalies, diagnose root cause, apply known fixes, verify resolution
- Document intelligence workflows — Process hundreds of documents with full cross-reference reasoning, not just individual extractions
- Developer tooling assistants — Context-aware agents that understand your entire codebase and CI/CD pipeline
The OSWorld-V benchmark crossing human baseline isn’t just a PR moment for OpenAI. It’s a signal that the planning-execution-verification loop is now something we can reliably delegate to AI in constrained domains. The constraint is still there — you need to define the domain carefully. But within well-defined domains, GPT-5.4 is genuinely capable.
Practical Next Steps
If you want to start experimenting today:
# Install the latest OpenAI SDK
npm install openai@latest
# or
pip install openai --upgrade
# Test the shell tool (requires API access to gpt-5.4)
# Minimal working example
from openai import OpenAI
client = OpenAI()
# Simple agent with shell access
response = client.responses.create(
model="gpt-5.4",
input="Write a fibonacci function in Go, test it, and show me the output",
tools=[{"type": "shell"}]
)
print(response.output_text)
The shift from “AI that advises” to “AI that executes” is happening now. The engineers who understand how to architect systems around this capability — with appropriate controls, cost management, and observability — will have a meaningful advantage. Start small, build trust incrementally, and measure everything.
GPT-5.4 is currently rolling out across ChatGPT and the OpenAI API. The shell tool requires the Responses API (not the Chat Completions API). Check the OpenAI changelog for the latest availability updates.