GPT-5.4: OpenAI’s New Model Brings Native Computer Use & 1M Token Context Window

by Chief Editor

OpenAI’s Rapid Fire Releases: A Sign of AI’s New Battleground

The artificial intelligence landscape is moving at breakneck speed. Just days after releasing GPT-5.3 Instant, OpenAI has launched GPT-5.4, a substantial upgrade arriving amidst a turbulent period for the company. This follows a wave of user cancellations and a public disagreement with Anthropic’s CEO, sparked by OpenAI’s deal with the U.S. Department of Defense.

GPT-5.4: Key Improvements and Capabilities

OpenAI is positioning GPT-5.4 as its most capable and efficient model for professional work, offering three configurations: a standard version, GPT-5.4 Thinking for complex reasoning, and GPT-5.4 Pro for demanding workloads. GPT-5.4 Thinking is now available to ChatGPT Plus, Team, and Pro subscribers, replacing GPT-5.2 Thinking. The Pro version is exclusive to the $200-per-month ChatGPT Pro and Enterprise tiers.

Benchmark results are impressive. On GDPval, OpenAI’s internal metric for knowledge work, GPT-5.4 matched or exceeded industry professionals in 83% of comparisons, an increase from 70.9% for GPT-5.2. On OSWorld-Verified, assessing desktop environment navigation, GPT-5.4 achieved a 75% success rate, surpassing the 72.4% human benchmark and significantly improving upon GPT-5.2’s 47.3%. It also leads on Mercor’s APEX-Agents benchmark, evaluating sustained professional tasks.

Hallucinations, or incorrect factual claims, are reportedly 33% less frequent in GPT-5.4 compared to GPT-5.2, with overall response errors down by 18%.

Native Computer Use and Expanded Context Window

Perhaps the most significant advancement is GPT-5.4’s native computer use capability within Codex and the API. This allows the model to operate software, navigate file systems, and execute multi-step workflows – functionality previously requiring specialized agentic frameworks. This simplifies automation pipeline development by reducing integration complexity.

The API now supports context windows up to 1 million tokens, more than double the 400,000 offered by GPT-5.3, and the largest OpenAI has released. This is a major benefit for organizations processing large datasets, codebases, or financial records, allowing for more complete context without relying on retrieval workarounds. However, OpenAI charges double the standard rate per million tokens after 272,000 tokens. Google’s Gemini 3.1 Pro offers a 2-million-token context at a lower base price.

A new Tool Search system improves API efficiency. Previously, each call included full tool specifications, adding potentially thousands of tokens. Now, the model retrieves tool definitions on demand, reducing token usage by 47% in internal testing. This translates to lower costs and faster responses for developers.

The Evolving Benchmark Landscape

While GPT-5.4 leads on the Mercor APEX-Agents benchmark, it’s important to note that current models are not yet at professional-grade reliability for long-horizon tasks. Mercor’s CEO described current models as being akin to “an intern that gets it right a quarter of the time.”

OpenAI’s internal GDPval benchmark, measuring individual deliverables, and APEX-Agents, testing sustained workflows, assess different aspects of performance. Both are valuable, but neither provides a complete picture.

Safety and Controllability

OpenAI has introduced CoT Controllability, an open-source evaluation to assess whether reasoning models can deliberately obscure their chain-of-thought to evade monitoring. GPT-5.4 Thinking demonstrates a low ability to control its reasoning, which OpenAI views as a positive safety signal. This aligns with research from Anthropic, which has observed similar behavior in its own models.

The Competitive AI Arena

GPT-5.4’s release occurs during a highly competitive period. Anthropic’s Claude Opus 4.6 remains a leader in coding benchmarks, while Google’s Gemini 3.1 Pro excels in abstract reasoning and offers a larger context window at a lower price. GPT-5.4 appears to lead in desktop computer use and professional knowledge work, based on OpenAI’s highlighted benchmarks.

The rapid release cadence – GPT-5.3 Instant on Monday, GPT-5.4 on Thursday – suggests OpenAI is prioritizing visibility and maintaining momentum. Whether this strategy will drive sustained enterprise adoption or simply accelerate benchmark turnover remains to be seen.

FAQ

Q: What is a “token” in the context of AI models?
A: A token is a unit of text that the model processes. It can be a word, part of a word, or a character. The number of tokens affects processing time, and cost.

Q: What is a context window?
A: The context window refers to the amount of text the model can consider at once when generating a response. A larger context window allows the model to understand and respond to more complex prompts.

Q: What is “hallucination” in AI?
A: Hallucination refers to instances where an AI model generates incorrect or nonsensical information that is not based on factual data.

Q: What is the difference between GPT-5.4 Thinking and GPT-5.4 Pro?
A: GPT-5.4 Thinking is optimized for tasks requiring extended reasoning, while GPT-5.4 Pro is designed for the most demanding workloads.

Q: Is OpenAI’s deal with the Department of Defense controversial?
A: Yes, the deal has sparked debate and protests, with concerns raised about the military use of AI and potential ethical implications.

Did you know? Anthropic refused to allow its technology to be used for mass surveillance or fully autonomous weapons, leading to the Department of Defense blacklisting its models.

Pro Tip: When evaluating AI models, consider the specific benchmarks used and understand their limitations. No single benchmark provides a complete picture of a model’s capabilities.

Stay informed about the latest developments in AI. Explore more articles and share your thoughts in the comments below!

You may also like

Leave a Comment