d1: Scaling Reasoning in Diffusion Large Language Models
Learn what's great about Diffusion LLMs and how they are different from Transformers
TL;DR — Why this paper matters
Large language models that reason well are usually trained and fine‑tuned in an autoregressive (left‑to‑right) way. d1 shows, for the first time, that the same reinforcement‑learning tricks that lifted autoregressive models to GPT‑4‑class reasoning can also lift diffusion language models, which generate text in a coarse‑to‑fine, non‑sequential fashion. The authors introduce two key ingredients -masked supervised fine‑tuning (SFT) and a new RL algorithm called diffu‑GRPO -and demonstrate big gains on math and logic benchmarks without changing the base model size.
Reasoning beyond left‑to‑right text
Most large language models we use today - GPT, Gemini, Claude - generate text one token after another in a single left‑to‑right pass. That autoregressive habit is intuitive for us humans, but it is not the only way to produce language. A fast‑growing line of research borrows ideas from image diffusion models: start with a noisy, fully masked sentence and denoise it in several sweeps, filling in blanks until a coherent text emerges. The resulting diffusion LLMs (dLLMs), such as LLaDA and Dream, can look at future context while writing the present token, often need fewer decoding steps, and open the door to parallel generation on hardware.
Yet all the spectacular reasoning progress you might have seen in DeepSeek‑R1 or Kimi K1.5 was achieved with reinforcement learning (RL) algorithms - PPO, GRPO, variants of policy gradients - tailored to left‑to‑right models. Those methods rely on computing token‑by‑token log‑probabilities. In dLLMs, the log‑probability of a sentence is not factorized that way, so you cannot drop PPO into a diffusion model and hope it works.
That gap is exactly what d1 tries to close. The authors show that with just an 8‑billion‑parameter backbone and careful post‑training, a diffusion model can rival or beat comparable autoregressive models on GSM8K math, MATH500 competition problems, mini‑Sudoku, and the classic Countdown numbers game.
State of AI Agents: Google Goes All-In on Agents
Welcome to State of AI Agents! Last week marked a decisive turning point for AI agents, with multiple tech giants unveiling ambitious agent strategies, frameworks, and deployments. This shift signals a fundamental transformation in how AI will be integrated into software and services going forward.
The d1 recipe at a glance
The proposed pipeline has only two stages:
Why not merge the stages? In practice, SFT gives the model a basic “chain‑of‑thought grammar” - self‑checks, backtracking, tidy XML tags - making the subsequent RL much more stable. The recipe is therefore called d1: “diffusion + one‑two punch of SFT then RL.”
Keep reading with a 7-day free trial
Subscribe to LLM Watch to keep reading this post and get 7 days of free access to the full post archives.