Scaling Beyond Masked Diffusion Language Models

¹NVIDIA ²Cornell Tech ³EPFL ⁴Cornell University
^†Joint Second Authors
Pre-print
Model Checkpoints will be released on March 1st, 2026.

arXiv Code

We show that Masked diffusion isn't the future paradigm for diffusion LLMs!

Key Contributions

We present the first systematic IsoFLOP scaling study for a state-of-the-art Uniform-state diffusion model (Duo) and an interpolating diffusion model (Eso-LM).

We improve the scaling laws of MDLM by training with a simple cross-entropy loss, which makes these models 12% more compute efficient.

We show that perplexity can be a misleading metric while comparing diffusion families: models with worse likelihood scaling may still be preferable due to faster sampling, as reflected by the speed-quality Pareto frontier (see Figure 1).

We scale all methods to 1.7B parameters and show that Duo beats MDLM on GSM8K (see Table 1), despite worse validation perplexity.

Experiments

Speed-Quality Pareto Frontier

Figure 1: We report the highest throughput achieved by compute-optimal models across a range of training FLOPs budgets. AR produces the highest-quality samples but is slow. Sample diversity (measured by entropy) remains broadly similar across algorithms, with Duo exhibiting slightly reduced diversity. Duo dominates in the throughput ranges [200, 400] ∪ [600, ∞], while Eso-LM dominates in the range [400, 600].

GSM8K @ 1.7B scale

**Table 1:** SMDM (Nie et al., 2024) is trained on SlimPajama while LLaDa (Nie et al., 2025) on proprietary dataset. Our models are trained on the Nemotron-Pre-Training-Dataset. **Duo beats MDLM on GSM8K @1.7B scale**, despite worse validation perplexity.
Models	SMDM	LLaDa	AR (Ours)	MDLM (Ours)	Eso-LM (Ours)	Duo (Ours)
Params	1B	8B	1.7B	1.7B	1.7B	1.7B
GSM8K (↑)	58.5	70.7	62.9	58.8	33.4	65.8

BibTeX

@misc{sahoo2026scalingmaskeddiffusionlanguage,
      title={Scaling Beyond Masked Diffusion Language Models}, 
      author={Subham Sekhar Sahoo and Jean-Marie Lemercier and Zhihan Yang and Justin Deschenaux and Jingyu Liu and John Thickstun and Ante Jukic},
      year={2026},
      eprint={2602.15014},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2602.15014}, 
}