How do DFlash models differ from standard drafters?

Standard drafters are autoregressive and generate one token at a time, becoming the bottleneck. DFlash replaces that with a block diffusion model that denoises a full block of tokens in one parallel pass, and it pulls hidden state representations from the target model's layers rather than guessing from raw tokens alone.

Why does acceptance length matter more than draft length?

Decode is memory-bound rather than compute-bound, so reading model weights takes the same time whether the pass checks one token or a block. Every token accepted in a pass is nearly free throughput, while every rejected token still costs drafting time. Acceptance length maps almost linearly to speedup.

Where can I access the DFlash models?

The DFlash draft models are available on HuggingFace for Qwen and several other model families, and are already integrated with vLLM, SGLang, and Transformers.

Back to articlesLarge Language Models

Large Language Models

Modal releases faster AI draft models that overcome a long-standing speed bottleneck in AI inference, reaching over 1000 tokens per second on large models.

Daily Dose of Data Science1d ago5 min read

Key takeaway

Modal released new DFlash draft models that accelerate AI inference by denoising entire blocks of tokens in parallel rather than one at a time, and by leveraging the target model's internal context representations. This overcomes a historical 2–3× speedup ceiling in speculative decoding; Qwen 3.5 122B-A10B now reaches over 1000 tokens/sec, up from 250 without speculation.

Summaries like this, in your inbox every morning.

3 Key Points

What happened
Modal released DFlash draft models for Qwen AI models that use block diffusion instead of traditional sequential token generation. Running Qwen 3.5 122B-A10B with these drafters reached over 1000 tokens/sec on a B200, compared with 250 tokens/sec without speculation.
Why it matters
Speculative decoding has been capped around 2–3× speedup because the small draft model that proposes tokens became the bottleneck. DFlash breaks past this by denoising an entire block of tokens in parallel, and by using the target model's own internal representations to guide drafting, which raises acceptance length from a baseline of 3 to over 9.
What to watch
The DFlash draft models are already integrated with vLLM, SGLang, and Transformers, with draft models available on HuggingFace for Qwen and several other model families. Acceptance length maps nearly linearly to speedup—at length 8, Qwen 3.5 27B achieved 5.62× speedup on one B200.

FAQ

How do DFlash models differ from standard drafters?: Standard drafters are autoregressive and generate one token at a time, becoming the bottleneck. DFlash replaces that with a block diffusion model that denoises a full block of tokens in one parallel pass, and it pulls hidden state representations from the target model's layers rather than guessing from raw tokens alone.
Why does acceptance length matter more than draft length?: Decode is memory-bound rather than compute-bound, so reading model weights takes the same time whether the pass checks one token or a block. Every token accepted in a pass is nearly free throughput, while every rejected token still costs drafting time. Acceptance length maps almost linearly to speedup.
Where can I access the DFlash models?: The DFlash draft models are available on HuggingFace for Qwen and several other model families, and are already integrated with vLLM, SGLang, and Transformers.

Discussion

No comments yet. Be the first to share your thoughts!

Unable to generate summary — the article body provided contains no news content, only a newsletter registration prompt.

Nikkei AI Stocks1h ago

GitHub Copilot's shared AI harness matches rival model-vendor tools on task completion while using fewer tokens, letting developers mix-and-match models without sacrificing performance.

GitHub Copilot Blog4h ago

OpenAI will release GPT-5.6 in limited preview with Trump administration case-by-case approval, a less restrictive arrangement than the export controls imposed on rival Anthropic.

The Verge AI4h ago

The Trump administration is pressuring OpenAI to limit early access to its new GPT 5.6 model to select partners before a broader public release, shifting from its previous hands-off AI stance.

TechCrunch AI4h ago

Visa partners with AI and stablecoin firms to build payments infrastructure for autonomous agents and digital assets, diversifying revenue beyond traditional card fees.

Top Companies AI — US (1/2)7h ago

TrueFoundry acquires MLOps pioneer Seldon AI to combine infrastructure for deploying AI agents at scale, addressing a gap where only 14% of enterprises have moved AI pilots into production.

Top Companies AI — US (1/2)7h ago

Stay ahead with AI news

Get curated AI news from 200+ sources delivered daily to your inbox. Free to use.

Get Started Free

Free · takes 30 seconds · unsubscribe anytime

5 minutes a day. The AI essentials.

200+ sources · Email / LINE / Slack

Get it free →