Microsoft VibeVoice ASR integrated into Hugging Face Transformers; voice AI framework now supports 60-minute speech recognition in single pass
Hacker News · April 28, 2026
AI Summary
•VibeVoice-ASR, a speech-to-text model, is now part of a Transformers release and available directly through the Hugging Face Transformers library. The model handles 60-minute long-form audio in a single pass and generates structured transcriptions identifying speaker identity, timestamps, and content, with support for customized context.
•The model is natively multilingual, supporting over 50 languages. vLLM inference is now supported for faster inference. Finetuning code is available, and a Technique Report has been published.
•VibeVoice uses continuous speech tokenizers (Acoustic and Semantic) operating at 7.5 Hz frame rate to preserve audio fidelity while boosting computational efficiency. It employs a next-token diffusion framework with an LLM (an AI model that understands and generates text) to understand textual context and a diffusion head to generate high-fidelity acoustic details.