OpenAI Launches Three Realtime Voice Models in API

By Artūras Malašauskas May 08, 2026 3 min read Share:

OpenAI has released GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper via its Realtime API, targeting voice agents that can reason, translate, and transcribe during live conversations.

OpenAI has expanded its voice intelligence capabilities with three new models now available through its Realtime API. The announcement marks a shift from simple call-and-response interactions toward voice agents capable of listening, reasoning, translating, and taking action while conversations unfold.

The three models serve distinct purposes. GPT-Realtime-2 brings GPT-5-class reasoning to voice interactions, handling complex requests and maintaining natural conversation flow. GPT-Realtime-Translate enables live speech translation across 70+ input languages into 13 output languages. GPT-Realtime-Whisper provides streaming speech-to-text transcription as speakers talk.

According to the official OpenAI Community announcement, the larger shift here is that realtime voice is moving beyond simple call-and-response. These models are aimed at voice agents that can listen, reason, translate, transcribe, use tools, and take action while the conversation is still unfolding.

For developers building voice applications, the pricing structure is now clear. GPT-Realtime-2 costs $32 per million audio input tokens ($0.40 for cached input tokens) and $64 per million audio output tokens. GPT-Realtime-Translate runs at $0.034 per minute. GPT-Realtime-Whisper is priced at $0.017 per minute.

Independent reporting from 9to5Mac confirms the model specifications and pricing details. The outlet notes that all three new voice models are included in OpenAI's Realtime API, with developers able to test them in the Playground immediately.

The translation model's language support is particularly notable. Supporting 70 input languages while outputting to 13 languages means developers can build applications that handle global communication without requiring separate translation pipelines. The model keeps pace with the speaker, which matters when you're trying to maintain natural conversation rhythm (nobody wants to wait three seconds for their French colleague's words to appear in English).

From a technical standpoint, the streaming nature of GPT-Realtime-Whisper changes how developers approach transcription. Instead of waiting for a speaker to finish before processing audio, the model transcribes as people speak. This reduces perceived latency in applications like live captions, meeting notes, or real-time transcription services.

The physical experience of using these models differs from previous voice APIs. Users won't experience the awkward pause that typically occurs when a system waits for silence before processing. The models handle interruptions and corrections mid-conversation, which feels more like talking to another person than issuing commands to a machine.

For enterprise applications, the pricing model introduces some complexity. The token-based pricing for GPT-Realtime-2 means costs scale with conversation length and complexity. The per-minute pricing for translation and transcription is more predictable but still requires careful budgeting for high-volume deployments.

OpenAI positions these models as unlocking a new class of voice apps for developers. Use cases include voice-to-action workflows, live spoken guidance from software, and voice-to-voice conversations across languages. The ability to call tools and handle corrections during conversation opens possibilities for customer service bots, technical support agents, and interactive voice assistants.

Developers with Codex installed can add GPT-Realtime-2 to existing apps or create new applications through the Playground. The integration process appears streamlined, though the actual implementation will depend on each developer's infrastructure and use case requirements.

Whether these models actually deliver on their promises in production environments remains to be seen. The difference between a demo and a deployed voice agent handling thousands of concurrent conversations is substantial. Latency, accuracy under varying audio conditions, and cost management will all factor into real-world adoption.

The competitive landscape for voice AI is intensifying. Other companies are developing similar capabilities, and the margin between good and great voice experiences is narrowing. OpenAI's advantage here is the integration with its broader model ecosystem and the established API infrastructure.

For now, developers have access to test these capabilities. The question isn't whether the technology works—it demonstrably does in controlled environments. The question is whether it works reliably enough at scale to justify the investment for businesses building voice-first applications.

Artūras Malašauskas is an AI Systems Integrator with 20+ years of production-grade web engineering experience. He has designed, shipped, and scaled enterprise Python/PHP systems for logistics, SaaS, and public-sector clients. For the past year, he has focused exclusively on AI integrations: deploying open-source LLMs, building generative media pipelines (image, audio, video), and engineering multi-agent workflows for real production environments. His standard: reproducibility, security, cost-efficient inference—no vaporware. He documents and evaluates emerging AI tooling, separating verified capabilities from marketing noise. Technical editor at: muza-ai.eu, ai-verslas.lt, ai-naujinos.lt Connect on LinkedIn

OpenAI Launches Three Realtime Voice Models in API

Comments