The Diffusion Duality, Chapter II:
\( \Psi \)-Samplers and Efficient Curriculum

1EPFL, Lausanne     2Cornell Tech, NY
ICLR 2026
Psi-Samplers schematic: Predictor step followed by Corrector step for discrete diffusion
Duo++ generation animation showing parallel token refinement
MDLM generation animation showing masked token unmasking
GPT-2 autoregressive generation animation

Key Contributions

  1. \(\Psi\)-Samplers: A family of Predictor-Corrector samplers for discrete diffusion that generalize prior methods and apply to any noise process. With these, we outperform MDLM on both text and image generation.
  2. Inference-time scaling: Unlike ancestral sampling which plateaus, \(\Psi\)-samplers continue to improve with more sampling steps.
  3. Efficient Curriculum: 33% less memory and 25% faster training by exploiting softmax sparsity: only \(k\) embeddings needed (as few as 2).

Concurrently, Sahoo et al. (2026) show that Duo surpasses autoregressive models at the 1.7B scale on maths and reasoning (GSM8K).

\(\Psi\)-Samplers

Standard discrete diffusion uses ancestral sampling: at each step, tokens are updated using the reverse posterior.

  • For Masked diffusion, once a token is unmasked it can never be corrected.
  • For Uniform-state diffusion, quality plateaus in the high NFE regime.

We introduce \(\Psi\)-posteriors, a superposition of a Predictor (reverse posterior) and a Corrector (forward process), that preserve the diffusion marginals while enabling error correction.

As shown in Fig. 1, \(\Psi\)-samplers consistently improve with more sampling steps (unlike ancestral sampling which plateaus), and Duo++ outperforms MDLM on both text (OpenWebText) and image (CIFAR-10) generation.

Efficient Curriculum

Efficient curriculum diagram
(Top) Duo uses linear combinations of all \(K\) embeddings. (Bottom) Duo++ exploits softmax sparsity, simulating only the top-\(k\) entries.

Duo++ exploits the sparsity of the low-temperature softmax used by Duo. By simulating only the top-\(k\) entries using order statistics (\(k\) as small as 2), we reduce peak memory by 33% (94 GiB → 63 GiB) and end-to-end training time by 25%, while matching the perplexity and downstream accuracy of Duo.

BibTeX

@inproceedings{
  deschenaux2026the,
  title={The Diffusion Duality, Chapter {II}: ${\textbackslash}Psi$-Samplers and Efficient Curriculum},
  author={Justin Deschenaux and Caglar Gulcehre and Subham Sekhar Sahoo},
  booktitle={The Fourteenth International Conference on Learning Representations},
  year={2026},
  url={https://openreview.net/forum?id=RSIoYWIzaP}
}