
Summaries like this, in your inbox every morning.
Sign up free →Audio-Interaction, created by researchers from China, Hong Kong, and Singapore, processes continuous audio streams and outputs either <silent> or <response> tokens after each 0.4-second chunk, enabling the model to stay quiet or begin speaking based on context. The system handles translation, transcription, dialog, and reactions to everyday noises in a single model.
The team built StreamAudio-2M, a training dataset with 2.6 million units and about 302,000 hours of audio across seven skill areas and 28 subtasks, by having a language model design realistic scenarios, sourcing matching clips from a database or generating missing sounds with audio models like AudioX or ElevenLabs, then smoothing recordings for naturalness.
Audio-Interaction scored 58.15 points on the audio benchmark MMAU, narrowly beating its base model Qwen2.5-Omni-3B, and comes close to much larger 7B models. On the ProactiveSound Bench with 644 human-curated events, the model outperforms Gemini 3 Flash, Kimi-Audio-Instruct, and Step-Audio 2.
Code and weights are available on GitHub under the Apache 2.0 license with no restrictions on commercial use; the full training dataset is set to follow later.
No comments yet. Be the first to share your thoughts!
Log in to join the discussion





Get curated AI news from 200+ sources delivered daily to your inbox. Free to use.
Get Started FreeFree · takes 30 seconds · unsubscribe anytime
5 minutes a day. The AI essentials.
200+ sources · Email / LINE / Slack