The Diffusion Duality

Cornell Tech, NY.
[The website will be ready by March 15, 2025.]
MY ALT TEXT

An illustration of uniform state discrete diffusion (top) and the underlying Gaussian diffusion (bottom). While both are separate Markov processes, applying \(\texttt{arg max}\) maps Gaussian latents \(\mathbf{w}_t \in \mathbb{R}^n\) to discrete latents \(\mathbf{z}_t \in \mathcal{V}\), transforming their marginals from \(\tilde{q}_t(.|\mathbf{x}; \tilde{\alpha}_t)\) to \(q_t(.|\mathbf{x}; \mathcal{T}(\tilde{\alpha}_t))\) and adjusting diffusion parameters from \(\tilde{\alpha}_t\) to \(\alpha_t = \mathcal{T}(\tilde{\alpha}_t)\) .

Abstract

Discrete diffusions models have been demonstrated to be surprisingly strong language models. In this work, we show that discrete diffusion language models can be further improved by adapting methods from continuous-state diffusion models. We establish a core property of uniform state diffusion: it stems from an underlying Gaussian diffusion process. This property allows us to improve both training by utilizing a curriculum learning strategy that reduces training variance and leads to \(\mathbf{2\times}\) faster convergence, as well as sampling by adapting efficient distillation methods from continuous-state diffusion models. As a result, models surpass an autoregressive model's zero-shot perplexity on 3 out of 7 benchmarks and we manage to reduce the sampling steps by \(\textbf{two orders}\) of magnitude while preserving sample quality.

Introduction

An eternal theme in mathematics is that discreteness emerges from underlying continuity. From quantum mechanics, where the quantized energy states of electrons arise as solutions to continuous wave equations, to the Fourier decomposition of the Heaviside function, which results in a trigonometric series, and to the binary logic of digital circuits, fundamentally driven by smooth analog currents, discreteness has repeatedly and naturally emerged from an underlying continuum. Our work continues this tradition by demonstrating that a discrete diffusion process is, in fact, an emergent phenomenon of an underlying continuous Gaussian diffusion process. This perspective enables the design of faster training and sampling algorithms for discrete diffusion models.

Gaussiasn Diffusion

[TODO]

Discrete Diffusion

[TODO]

The Diffusion Duality

[TODO] Equivalence of Marginals, ELBO relation.

Marginals

TODO

ELBO

TODO

Experiments

Curriculum Learning

TODO

Distillation

TODO