Back to articles

Microsoft VibeVoice ASR integrated into Hugging Face Transformers; voice AI framework now supports 60-minute speech recognition in single pass

Hacker News · April 28, 2026

Microsoft VibeVoice ASR integrated into Hugging Face Transformers; voice AI framework now supports 60-minute speech recognition in single pass

AI Summary

  • VibeVoice-ASR, a speech-to-text model, is now part of a Transformers release and available directly through the Hugging Face Transformers library. The model handles 60-minute long-form audio in a single pass and generates structured transcriptions identifying speaker identity, timestamps, and content, with support for customized context.
  • The model is natively multilingual, supporting over 50 languages. vLLM inference is now supported for faster inference. Finetuning code is available, and a Technique Report has been published.
  • VibeVoice uses continuous speech tokenizers (Acoustic and Semantic) operating at 7.5 Hz frame rate to preserve audio fidelity while boosting computational efficiency. It employs a next-token diffusion framework with an LLM (an AI model that understands and generates text) to understand textual context and a diffusion head to generate high-fidelity acoustic details.

Related Articles

Stay ahead with AI news

Get curated AI news from 200+ sources delivered daily to your inbox. Free to use.

Get Started Free