Researchers develop three-billion-parameter model that listens continuously to audio and decides every 0.4 seconds whether to speak, handling translation, transcription, and sound recognition simultaneously

THE DECODERJun 6, 2026

Summaries like this, in your inbox every morning.

3 Key Points

Audio-Interaction, created by researchers from China, Hong Kong, and Singapore, processes continuous audio streams and outputs either <silent> or <response> tokens after each 0.4-second chunk, enabling the model to stay quiet or begin speaking based on context. The system handles translation, transcription, dialog, and reactions to everyday noises in a single model.
The team built StreamAudio-2M, a training dataset with 2.6 million units and about 302,000 hours of audio across seven skill areas and 28 subtasks, by having a language model design realistic scenarios, sourcing matching clips from a database or generating missing sounds with audio models like AudioX or ElevenLabs, then smoothing recordings for naturalness.
Audio-Interaction scored 58.15 points on the audio benchmark MMAU, narrowly beating its base model Qwen2.5-Omni-3B, and comes close to much larger 7B models. On the ProactiveSound Bench with 644 human-curated events, the model outperforms Gemini 3 Flash, Kimi-Audio-Instruct, and Step-Audio 2.
Code and weights are available on GitHub under the Apache 2.0 license with no restrictions on commercial use; the full training dataset is set to follow later.

AI-summarized, only the topics you pick — one digest a day via Email, Slack, or Discord.

Free · takes 30 seconds · unsubscribe anytime

No comments yet. Be the first to share your thoughts!

Get curated AI news from 200+ sources delivered daily to your inbox. Free to use.

Free · takes 30 seconds · unsubscribe anytime

1 minute a day. The AI essentials.

200+ sources · Email / LINE / Slack