GPT-5.4: AI That Builds Your PPT, Excel Models, and Financial Reports

2026-03-06 · # AI News

On March 5, 2026, OpenAI released GPT-5.4.¹ This was no ordinary version bump. The model became the first general-purpose AI to ship with native Computer Use built in, and it introduced OpenAI’s largest-ever 1M-token context window. GPT-5.4 launched in three variants: standard, Thinking (with visible reasoning traces), and the top-tier Pro version — all available simultaneously via ChatGPT, the API, and Codex.

“GPT-5.4 is our most capable and efficient frontier model for professional work.” — OpenAI, March 5, 2026

OpenAI called it “the best model ever for professional work.”¹ To see whether that claim held up, let’s walk through the capabilities and the numbers.

Core Feature: Native Computer Use

The biggest change in GPT-5.4 was Computer Use. Anthropic’s Claude had offered computer-use capabilities before, but as a separate, specialized mode. GPT-5.4 took a different approach: Computer Use was natively integrated into the general-purpose model itself.

How It Works

GPT-5.4’s Computer Use activates through the API and Codex. The model accepts screenshots as input, analyzes the current state of the screen, and directly executes mouse clicks, drags, and keyboard inputs. In parallel, it generates Playwright code that scripts the same actions — meaning a single run produces both the completed task and an automation script for future use.²

Cross-app workflows were also supported. The model could read data from an email client, transfer it into a spreadsheet, and send the result via Slack — all without human intervention. Because it checked a screenshot after each step, it could adapt on the fly when a UI rendered differently than expected.

Custom Confirmation Policies for Developers

On the API side, developers could define custom confirmation policies. These rules told the model to automatically pause and request human approval before executing sensitive actions — file deletions, payments, external data transfers, and so on. This gave developers direct control over where to draw the line between fully autonomous and human-in-the-loop execution.³

flowchart TD
    A["User Instruction Input"] --> B["GPT-5.4 Plans the Task"]
    B --> C["Capture and Analyze Screenshot"]
    C --> D{"Check Confirmation Policy"}
    D -->|Sensitive Action| E["Request User Approval"]
    D -->|Standard Action| F["Execute Mouse/Keyboard Commands"]
    E -->|Approved| F
    E -->|Rejected| G["Halt Task or Explore Alternatives"]
    F --> H["Generate Playwright Code in Parallel"]
    F --> I["Verify with Next Screenshot"]
    I --> J{"Task Complete?"}
    J -->|Not Yet| C
    J -->|Done| K["Return Results and Save Script"]

Benchmark Breakdown: What the Numbers Said

OpenAI published benchmark results across several evaluations. Here’s how GPT-5.4 compared to its predecessors, GPT-5.3-Codex and GPT-5.2.¹

Benchmark	GPT-5.4	GPT-5.3-Codex	GPT-5.2
GDPval	83.0%	70.9%	70.9%
SWE-Bench Pro	57.7%	56.8%	55.6%
OSWorld-Verified	75.0%	74.0%	47.3%
BrowseComp	82.7%	77.3%	65.8%
Toolathlon	54.6%	51.9%	46.3%
WebArena-Verified	67.3%	—	65.4%
Online-Mind2Web	92.8%	—	—
MMMU-Pro	81.2%	—	79.5%

GDPval: A New Bar for Professional Competence

GDPval was a benchmark developed by OpenAI that measured how well AI could handle real-world tasks across 44 professions.⁴ GPT-5.4 scored 83.0% — a jump of more than 12 percentage points over GPT-5.2’s 70.9%. The score represented the share of professions in which the model matched or exceeded the performance of human professionals in that field. It was the highest figure ever recorded on this benchmark.¹

OSWorld-Verified: Computer Control Surpassed Human Average

OSWorld-Verified measured a model’s ability to operate a real operating system environment. GPT-5.4’s score of 75.0% cleared the human average of 72.4%.² Given that GPT-5.2 had scored just 47.3%, the scale of improvement in Computer Use between the two generations was hard to miss.

[!KEY] GPT-5.4’s computer control ability already exceeded the human average (72.4%) on OSWorld-Verified. The gap versus its predecessor GPT-5.2 (47.3%) was 27 percentage points.

Real-World Task Performance: Spreadsheets, Presentations, and Documents

Beyond formal benchmarks, OpenAI ran separate evaluations on practical workplace tasks. The results were notable.¹⁵

On spreadsheet analysis, GPT-5.4 performed investment-banking analyst-level tasks with 87.3% accuracy — up sharply from GPT-5.2’s 68.4%. The evaluation included complex financial modeling, pivot table construction, and conditional formatting tasks.

On presentations, 68% of human evaluators preferred slides produced by GPT-5.4. Reviewers noted that outputs went beyond simply listing information as text — the model demonstrated awareness of visual hierarchy and information structure.⁵

Image handling capabilities also improved. The model accepted inputs up to 10.24M pixels and 6,000px resolution, making it viable for high-resolution engineering drawings and medical imaging analysis.¹

Tool Search and 1M Context: What Changed for API Developers

Tool Search: Rethinking How Models Find Their Tools

One persistent bottleneck in AI API development had been system prompt length. Every tool available to the model had to be defined upfront in the system prompt — the more tools, the longer the prompt, and the higher the token cost and latency.

Tool Search solved this. Instead of preloading all tool definitions into the system prompt, the model dynamically searched and retrieved the tools it needed at the moment it needed them.² VentureBeat reported that this feature reduced token usage by up to 47% on certain tasks compared to the previous generation.³

1M-Token Context Window

The 1M-token context window was the largest in OpenAI’s history. It was enough to hold multiple novels, hundreds of code files, or extensive meeting transcripts in a single session. Refactoring an entire large codebase in one pass, or reviewing dozens of contracts simultaneously, became realistic options.¹

Reduced Hallucinations and Safety: A Question of Trust

Hallucination Reduction

GPT-5.4 showed a 33% reduction in individual claim error rate and an 18% reduction in overall response error rate compared to GPT-5.2.¹ In practical terms, the model was less likely to get factual claims wrong — numbers, dates, names, sources. As AI-generated output increasingly flows into professional workflows without being manually reviewed, improvements in this area carry weight beyond simple benchmark gains.

CoT Transparency: Making Reasoning Visible

The Thinking variant showed the model’s Chain-of-Thought reasoning in real time. Users could see what plan the model was forming mid-task and redirect it if the approach was heading in the wrong direction. This created an early-warning mechanism for catching errors during long autonomous runs.⁶

According to OpenAI’s internal safety evaluations, the reasoning traces in the Thinking variant were difficult to suppress. This aligned with earlier research from Anthropic warning about the potential for reasoning models to conceal their thinking.⁷ By making the reasoning process transparent, the design reduced the risk of a model hiding its intent or misleading users.

“The thinking trace is hard to suppress.” — OpenAI Internal Safety Evaluation, March 2026

Conclusion: The Door to the Agent Era Has Opened

GPT-5.4 represented a qualitative departure from its predecessors on several fronts. Computer control exceeded the human average. GDPval hit 83% on a measure of professional-level task performance. The 1M-token context window and Tool Search lowered the practical barriers to building large-scale agent systems.

Until now, AI agents had mostly been implemented in narrowly specialized forms: a coding agent, a search agent, a document summarization agent. GPT-5.4’s native Computer Use blurred those lines. A single model that could autonomously execute complex, multi-app workflows signaled that AI was moving from a simple assistance tool to a genuine task executor.

Caveats remained. Computer Use was available only through the API and Codex — it wasn’t yet fully accessible to general ChatGPT users. Practical questions around security policy, error recovery, and accountability still needed answers. But the benchmark GPT-5.4 set was clear. The age of AI agents had already begun.⁵⁶

OpenAI. (2026-03-05). “Introducing GPT-5.4”. OpenAI. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸
OpenAI. (2026-03-05). “GDPval”. OpenAI. ↩ ↩² ↩³
VentureBeat. (2026-03-05). “OpenAI launches GPT-5.4 with native computer use mode, financial plugins”. VentureBeat. ↩ ↩²
TechCrunch. (2026-03-05). “OpenAI launches GPT-5.4 with Pro and Thinking versions”. TechCrunch. ↩
Ars Technica. (2026-03-05). “OpenAI introduces GPT-5.4 with more knowledge work capability”. Ars Technica. ↩ ↩² ↩³
The Verge. (2026-03-05). “OpenAI GPT-5.4 model release: AI agents”. The Verge. ↩ ↩²
Anthropic. (2026). “Reasoning models don’t always say what they think”. Anthropic Research. ↩