Which AI models power the voice features?

Audio capabilities launch with models from OpenAI and xAI. Real-time voice uses OpenAI's gpt-realtime-2, text-to-speech uses xAI's grok-tts, and transcription uses OpenAI's whisper-1.

What makes real-time voice different from speech-to-text chains?

A single real-time model hears audio and produces audio directly. In contrast, a traditional pipeline runs speech-to-text, then a language model, then text-to-speech as separate steps. Real-time also enables users to interrupt and talk over the model (barge-in) with no client-side silence timers.

Is this available now and what SDK is required?

The audio features are in beta and available in AI SDK 7. Developers can also test audio models in a browser-based playground without writing code.

Back to articles

Vercel AI Gateway adds real-time voice agents, speech-to-text

Vercel AI Blog15h ago4 min read

Key takeaway

Vercel's AI Gateway now supports real-time voice conversations, text-to-speech, and transcription alongside its existing text and image capabilities. Real-time voice agents can hold natural conversations where users interrupt and talk over the model, unlike traditional pipelines that chain separate speech-to-text, language model, and text-to-speech steps. All audio calls route through the same unified API gateway with shared observability and spend controls.

Summaries like this, in your inbox every morning.

3 Key Points

What happened
Vercel's AI Gateway now supports audio and voice capabilities including real-time voice conversations, text-to-speech, and speech-to-text transcription. These features launch with models from OpenAI and xAI, available in beta through AI SDK 7.
Why it matters
Real-time voice agents let applications hold live conversations where users can interrupt and talk over the model as they would with a person, making it practical for voice assistants, customer support, and hands-free tools. Unlike chaining separate speech-to-text, language model, and text-to-speech steps, a single real-time model hears and produces audio directly.
What to watch
Audio calls route through the same AI Gateway infrastructure as text and image requests, meaning developers can manage all modalities—voice, text, video—with one API key, unified observability, and shared spend controls. A browser-based playground lets developers test audio models without writing code.

FAQ

Which AI models power the voice features?: Audio capabilities launch with models from OpenAI and xAI. Real-time voice uses OpenAI's gpt-realtime-2, text-to-speech uses xAI's grok-tts, and transcription uses OpenAI's whisper-1.
What makes real-time voice different from speech-to-text chains?: A single real-time model hears audio and produces audio directly. In contrast, a traditional pipeline runs speech-to-text, then a language model, then text-to-speech as separate steps. Real-time also enables users to interrupt and talk over the model (barge-in) with no client-side silence timers.
Is this available now and what SDK is required?: The audio features are in beta and available in AI SDK 7. Developers can also test audio models in a browser-based playground without writing code.

Discussion

No discussion yet for this article

Stay ahead with AI news

Get curated AI news from 200+ sources delivered daily to your inbox. Free to use.

Get Started Free

Free · takes 30 seconds · unsubscribe anytime

1 minute a day. The AI essentials.

200+ sources · Email / LINE / Slack

Get it free →