Microsoft VibeVoice ASR integrated into Hugging Face Transformers; voice AI framework now supports 60-minute speech recognition in single pass

Hacker NewsApr 28, 20262 min read

Summaries like this, in your inbox every morning.

3 Key Points

VibeVoice-ASR, a speech-to-text model, is now part of a Transformers release and available directly through the Hugging Face Transformers library. The model handles 60-minute long-form audio in a single pass and generates structured transcriptions identifying speaker identity, timestamps, and content, with support for customized context.
The model is natively multilingual, supporting over 50 languages. vLLM inference is now supported for faster inference. Finetuning code is available, and a Technique Report has been published.
VibeVoice uses continuous speech tokenizers (Acoustic and Semantic) operating at 7.5 Hz frame rate to preserve audio fidelity while boosting computational efficiency. It employs a next-token diffusion framework with an LLM (an AI model that understands and generates text) to understand textual context and a diffusion head to generate high-fidelity acoustic details.

No comments yet. Be the first to share your thoughts!

Get curated AI news from 200+ sources delivered daily to your inbox. Free to use.

Free · takes 30 seconds · unsubscribe anytime

1 minute a day. The AI essentials.

200+ sources · Email / LINE / Slack