Discrete diffusions models have been demonstrated to be surprisingly strong language models. In this work, we show that discrete diffusion language models can be further improved by adapting methods from continuous-state diffusion models. We establish a core property of uniform state diffusion: it stems from an underlying Gaussian diffusion process. This property allows us to improve both training by utilizing a curriculum learning strategy that reduces training variance and leads to \(\mathbf{2\times}\) faster convergence, as well as sampling by adapting efficient distillation methods from continuous-state diffusion models. As a result, models surpass an autoregressive model's zero-shot perplexity on 3 out of 7 benchmarks and we manage to reduce the sampling steps by \(\textbf{two orders}\) of magnitude while preserving sample quality.
An eternal theme in mathematics is that discreteness emerges from underlying continuity. From quantum mechanics, where the quantized energy states of electrons arise as solutions to continuous wave equations, to the Fourier decomposition of the Heaviside function, which results in a trigonometric series, and to the binary logic of digital circuits, fundamentally driven by smooth analog currents, discreteness has repeatedly and naturally emerged from an underlying continuum. Our work continues this tradition by demonstrating that a discrete diffusion process is, in fact, an emergent phenomenon of an underlying continuous Gaussian diffusion process. This perspective enables the design of faster training and sampling algorithms for discrete diffusion models.
[TODO]
[TODO]
[TODO] Equivalence of Marginals, ELBO relation.
TODO
TODO
TODO
TODO