Diffusion LLMs
Generate the whole sentence at once, not one token at a time
The Problem: Your product needs an LLM that streams a 400-token answer in under 200ms for a live voice agent. Your best autoregressive model is accurate but generates one token per forward pass — 400 sequential steps — and you simply cannot hit the latency budget no matter how you batch. The generation is fundamentally serial. Is there a way to generate the whole answer in parallel instead?
The Solution: Diffusion LLMs — Denoising the Whole Sequence in Parallel
Almost every LLM you have used — GPT, Claude, Llama — is autoregressive (AR): it generates one token at a time, left to right, each conditioned on all previous tokens. That is inherently sequential: a 200-token answer needs ~200 forward passes that cannot be parallelized. Diffusion language models take a different route borrowed from image diffusion. They start from a fully masked or noised sequence and run a few denoising steps. At each step the model predicts all positions at once (parallel decoding), keeps the tokens it is confident about, and re-masks the uncertain ones for the next pass. Because every token attends to both sides, diffusion has bidirectional context rather than the causal, left-only context of AR. The number of denoising steps is a dial: more steps means cleaner text but more latency; fewer steps means faster but rougher output. In 2026 this went mainstream with Mercury (Inception Labs) and Gemini Diffusion, advertising dramatically higher throughput than comparable AR models.
Think of it like a sculptor versus a writer. An autoregressive model is a writer producing text word by word, left to right, never going back. A diffusion model is a sculptor: it starts from a rough marble block (a fully masked, noisy sequence) and chisels the whole text into shape over a few passes — revealing the entire statue at once, refining the parts that are still rough:
- 1. Start from a masked / noised sequence: Instead of an empty prompt continuation, the model begins with a target-length sequence of mask tokens (the "noise"). Nothing is decided yet — every position is a blank to be filled
- 2. Predict all tokens simultaneously: In a single forward pass the model predicts a distribution for every masked position at once — parallel decoding. Because attention is bidirectional, each prediction is informed by the whole sequence, not just the tokens to its left
- 3. Keep confident tokens, re-mask the rest: The model commits the highest-confidence predictions and re-masks the positions it is still unsure about. This is the built-in revision: uncertain tokens get another chance with more context now fixed around them
- 4. Repeat for K denoising steps: Steps 2-3 repeat for a small fixed K (often 4-30) until the text stabilizes. K is the speed/quality dial: fewer steps finish faster but leave rougher text; more steps polish it. Total latency tracks K, not the output length
Diffusion LLMs in Practice
- Low-Latency Inference: Real-time agents, autocomplete, and voice assistants need answers in tens of milliseconds. Because a diffusion LLM converges in a small fixed number of denoising steps instead of one step per token, models like Mercury and Gemini Diffusion report several-times-higher tokens-per-second than comparable autoregressive models
- Whole-Function Code Generation: Generating a complete function or block at once fits diffusion well: every token sees both the signature above and the return below, so the model plans the structure holistically instead of committing to early tokens it later regrets. This is a strong fit for code completion and refactoring
- Text Editing and Infilling: Filling a gap in the middle of a document is awkward for a left-to-right model but natural for diffusion: bidirectional context means the inserted span is conditioned on both the text before and the text after, producing edits that fit seamlessly
- Common Pitfall: Do not confuse text diffusion with image diffusion models (DALL-E, Stable Diffusion). The shared idea is iterative denoising, but text diffusion operates over discrete tokens (usually mask-and-unmask) rather than continuous pixels. Also, fewer denoising steps are not free — cutting steps too aggressively yields rougher, less coherent text
Fun Fact: Mercury (Inception Labs, 2026) reported generating over 1000 tokens per second on server-class GPUs (H100-class hardware) — roughly 5-10x the throughput of similarly sized autoregressive models. The speedup comes almost entirely from replacing ~N sequential token steps with a handful of parallel denoising passes.
Try It Yourself!
Explore the interactive visualization below: watch a masked sequence denoise step by step, slide the number of denoising steps to feel the speed/quality tradeoff, and race an autoregressive model against a diffusion model on the same output.
Watch a fully masked sequence resolve into text. Each denoising step predicts all positions at once and keeps the confident ones.
Frequently asked questions
What is a diffusion LLM and how does it differ from an autoregressive model?
A diffusion LLM is a text generation model that produces the whole sequence at once by iterative denoising, rather than predicting tokens strictly left-to-right. An autoregressive (AR) model generates one token at a time, each conditioned on all previous tokens — inherently sequential. A diffusion model starts from a fully masked or noised sequence and, over a few denoising steps, predicts all positions in parallel, keeping confident tokens and re-masking uncertain ones. This makes generation parallelizable and gives every token bidirectional context.
Why can diffusion LLMs be lower latency than autoregressive ones?
An autoregressive model needs one forward pass per output token, so a 200-token answer needs ~200 sequential steps. A diffusion LLM predicts all positions at once each step and converges in a fixed, usually small number of denoising steps (often 4-30) regardless of output length. Because the per-step work is parallel across the whole sequence, total latency depends on the step count rather than the token count — which is why 2026 models like Mercury (Inception Labs) and Gemini Diffusion advertise dramatically higher tokens-per-second.
What is the tradeoff when you reduce the number of denoising steps?
Denoising steps are the main quality/speed dial. More steps let the model revise uncertain tokens more times, producing cleaner, more coherent text but taking longer. Fewer steps are faster but rougher — you may get more grammatical slips or incoherence because tokens were committed before the context fully stabilized. The art is choosing the smallest step count that still meets your quality bar for the task, the same way image diffusion trades sampler steps for fidelity.
Try it yourself
Interactive demo of this technique
Produce a long answer (~400 tokens) for a real-time voice agent with a 200ms latency budget
The AR model runs one pass for each of ~400 tokens — ~400 sequential steps that cannot be parallelized. Even on fast hardware the total latency is ~1900ms, nearly 10x over the 200ms budget. Text quality is high, but for a live voice agent this is unacceptable: the user hears a long pause. Batching does not help — it raises throughput but does not cut the latency of a single long answer.
The diffusion LLM denoises the whole sequence in 8 parallel steps instead of ~400 sequential ones. Because latency tracks the step count rather than output length, the same ~400-token answer is produced in ~170ms — inside the 200ms budget. Bidirectional context and re-masking keep coherence close to AR at this step count. For a real-time voice agent this is decisive: the long answer arrives with no noticeable pause.
By replacing ~N sequential per-token passes with a handful of parallel denoising steps, a diffusion LLM makes latency depend on step count rather than output length — the longer the answer, the bigger the win.
Create a free account to solve challenges
3 AI-verified challenges for this lesson
This lesson is part of a structured LLM course.
My Learning Path