Scaling Beyond Masked Diffusion Language Models

1NVIDIA     2Cornell Tech     3EPFL     4Cornell University
Joint Second Authors
Pre-print
Model Checkpoints will be released on March 1st, 2026.
MY ALT TEXT

We perform scaling law analysis the state-of-the-art Masked diffusion model (MDLM), Uniform-state discrete diffusion (Duo), AR-Diffusion-Interpolating model (Eso-LM), and AR models.

A descriptive text for the GIF A descriptive text for the GIF A descriptive text for the GIF

Key Contributions

  1. We present the first systematic IsoFLOP scaling study for a state-of-the-art Uniform-state diffusion model (Duo) and an interpolating diffusion model (Eso-LM).
  2. We improve the scaling laws of MDLM by training with a simple cross-entropy loss, which makes these models 12% more compute efficient.
  3. We show that perplexity can be a misleading metric while comparing diffusion families: models with worse likelihood scaling may still be preferable due to faster sampling, as reflected by the speed-quality Pareto frontier (see Figure 1).
  4. We scale all methods to 1.7B parameters and show that Duo beats MDLM on GSM8K (see Table 1), despite worse validation perplexity.

Experiments

Speed-Quality Pareto Frontier

MY ALT TEXT
Figure 1: We report the highest throughput achieved by compute-optimal models across a range of training FLOPs budgets. AR produces the highest-quality samples but is slow. Sample diversity (measured by entropy) remains broadly similar across algorithms, with Duo exhibiting slightly reduced diversity. Duo dominates in the throughput ranges [200, 400] ∪ [600, ∞], while Eso-LM dominates in the range [400, 600].

GSM8K @ 1.7B scale

Table 1: SMDM (Nie et al., 2024) is trained on SlimPajama while LLaDa (Nie et al., 2025) on proprietary dataset. Our models are trained on the Nemotron-Pre-Training-Dataset. Duo beats MDLM on GSM8K @1.7B scale, despite worse validation perplexity.
Models SMDM LLaDa AR (Ours) MDLM (Ours) Eso-LM (Ours) Duo (Ours)
Params 1B 8B 1.7B 1.7B 1.7B 1.7B
GSM8K (↑) 58.5 70.7 62.9 58.8 33.4 65.8

BibTeX

@misc{sahoo2026scalingmaskeddiffusionlanguage,
      title={Scaling Beyond Masked Diffusion Language Models}, 
      author={Subham Sekhar Sahoo and Jean-Marie Lemercier and Zhihan Yang and Justin Deschenaux and Jingyu Liu and John Thickstun and Ante Jukic},
      year={2026},
      eprint={2602.15014},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2602.15014}, 
}