OpenAI has officially released its Realtime API out of beta, making it ready for production use in building advanced voice agents. Alongside this launch, they introduced gpt-realtime, their most advanced speech-to-speech (S2S) model to date. This model delivers faster, more natural, and more expressive voice interactions, significantly improving on previous capabilities. The gpt-realtime-2025-08-28
version enhances instruction-following, complex tool calling, and produces speech that sounds highly natural and emotionally expressive. It supports multilingual conversations with seamless mid-sentence language switching and more accurate handling of alphanumeric content.
Financially, this upgrade comes with a 20% price reduction compared to the prior model, costing $32 per million audio input tokens and $64 per million audio output tokens.
Feature-wise, the Realtime API now supports several powerful new capabilities:
– Remote MCP (Model Coordination Protocol) servers, enabling voice agents to access additional tools and richer contextual information.
– Image input, allowing voice agents to process and refer to visual information within conversations.
– SIP (Session Initiation Protocol) phone calling, which empowers agents to make and receive real phone calls, expanding their utility in business, customer support, and education domains.
– Asynchronous function/tool calling and reusable prompts to enable more complex and flexible dialogue flows.
Two new synthetic voices, named Cedar and Marin, have been introduced exclusively for this API, offering developers fresh, high-quality voice options. Additionally, existing voices have been updated to improve quality and expressiveness further.
Compliance and performance improvements accompany the release:
The Realtime API fully supports EU Data Residency requirements, ensuring compliance for applications deployed within the European Union. On benchmark testing, such as the Big Bench Audio evaluation for reasoning tasks, the gpt-realtime
model achieves 82.8% accuracy—substantially higher than the previous generation’s 65.6% from December 2024.
The technology behind the Realtime API exhibits strong real-time interaction attributes:
It implements highly effective semantic Voice Activity Detection (VAD) for precise turn-taking, minimizing interruptions and reducing latency. While it currently lacks multi-speaker differentiation (which would help in multi-user or noisy environments), built-in noise reduction improves robustness against background speech.
Developers and industry watchers recognize this launch as a major step forward for voice AI applications. The API’s design facilitates low-complexity integration without heavy server-side overhead, allowing voice mode to be added to applications quickly. Use cases span customer support, personal assistants, education, real estate, and other domains where natural, context-aware voice interaction is paramount. Integration previews with large enterprises such as T-Mobile have already been shared publicly, demonstrating promising production-ready capabilities.
In addition to the Realtime API and gpt-realtime
speech-to-speech model, OpenAI also released gpt-audio
(version 2025-08-28), their first generally available audio model designed for the Chat Completions REST API. It targets audio understanding and generation with pricing set at $40 per million audio input tokens and $80 per million output tokens.
Overall, OpenAI’s Realtime API and gpt-realtime
represent a leap in creating production-grade voice agents— with expressive, contextually intelligent speech capabilities, multimodal inputs, telephony integration, and improved affordability. The updates highlight the platform’s readiness to power real-world voice applications with enhanced user experience and developer flexibility.