
Vercel's AI Gateway now supports real-time voice conversations, text-to-speech, and transcription alongside its existing text and image capabilities. Real-time voice agents can hold natural conversations where users interrupt and talk over the model, unlike traditional pipelines that chain separate speech-to-text, language model, and text-to-speech steps. All audio calls route through the same unified API gateway with shared observability and spend controls.
Summaries like this, in your inbox every morning.
Sign up free →What happened
Vercel's AI Gateway now supports audio and voice capabilities including real-time voice conversations, text-to-speech, and speech-to-text transcription. These features launch with models from OpenAI and xAI, available in beta through AI SDK 7.
Why it matters
Real-time voice agents let applications hold live conversations where users can interrupt and talk over the model as they would with a person, making it practical for voice assistants, customer support, and hands-free tools. Unlike chaining separate speech-to-text, language model, and text-to-speech steps, a single real-time model hears and produces audio directly.
What to watch
Audio calls route through the same AI Gateway infrastructure as text and image requests, meaning developers can manage all modalities—voice, text, video—with one API key, unified observability, and shared spend controls. A browser-based playground lets developers test audio models without writing code.
No discussion yet for this article
Get curated AI news from 200+ sources delivered daily to your inbox. Free to use.
Get Started FreeFree · takes 30 seconds · unsubscribe anytime
1 minute a day. The AI essentials.
200+ sources · Email / LINE / Slack