🤗
HuggingFaceFirst paper to propose KV-caching for diffusion language models while retaining parallel generation.
Masked diffusion models (MDMs) (e.g., MDLM) are a compelling alternative to AR models. However, they suffer from two key limitations:
Recently proposed BD3-LMs address the speed issue by introducing a semi-autoregressive generation strategy. These models perform diffusion over fixed-length blocks of text sequentially. Because previously denoised blocks can be cached, BD3-LMs partially support KV caching and are faster than standard MDMs. However, we identify two key shortcomings in BD3-LMs:
To address these challenges, we propose a new language modeling paradigm that fuses autoregressive and masked diffusion approaches. Our model is trained with a hybrid loss—a combination of AR and MDM objectives—which allows it to interpolate smoothly between the two paradigms in terms of perplexity and sample quality. This requires two key innovations:
In Eso-LMs, some tokens are generated in parallel via MDMs and the rest sequentially in a left-to-right fashion. Here we introduce the variant Eso-LM (B). Refer to our paper for the other variant Eso-LM (A).
Let \( \mathcal{V} \) be the set of one-hot vectors corresponding to the tokens so that \( |\mathcal{V}| \) is the vocabulary size. Let \( L \) denote the sequence length. Let \( \mathbf{x} \sim q_{\text{data}}(\mathbf{x}) \) in \( \mathcal{V}^L \) be a sample from the data distribution, and let \( p_\theta \) be our model distribution parameterized by \( \theta \). Eso-LMs decompose \( p_\theta \) into two components: an AR model \( p_\theta^{\text{AR}} \) and an MDM \( p_\theta^{\text{MDM}} \). The MDM generates a partially masked sequence \( \mathbf{z}_0 \in \mathcal{V}^L \sim p_\theta^{\text{MDM}}(\mathbf{z}_0) \), and the AR model finishes the remaining unmasking steps in an auto-regressive left-to-right manner: \( p_\theta^{\text{AR}}(\mathbf{x} \mid \mathbf{z}_0) \).
The marginal likelihood of such a hybrid generative process is:
\( p_\theta(\mathbf{x}) = \sum_{\mathbf{z}_0 \in \mathcal{V}^L} p_\theta^{\text{AR}}(\mathbf{x} \mid \mathbf{z}_0) \, p_\theta^{\text{MDM}}(\mathbf{z}_0) \).
Although this sum is intractable, we can compute a variational bound on the true likelihood using a posterior \( q(\mathbf{z}_0 \mid \mathbf{x}) \). Since \( p_\theta^{\text{MDM}} \) models masked sequences, we choose \( q \) to be a simple masking distribution. Specifically, our choice \( q \) independently masks each token \( (\mathbf{x}^\ell)_{\ell \in [L]} \) with probability \( 1 - \alpha_0 \), where \( \alpha_0 \in [0, 1] \). This leads to the following variational bound:
\[ -\log p_\theta(\mathbf{x}) \leq -\mathbb{E}_{\mathbf{z}_0 \sim q_0(\cdot \mid \mathbf{x})} \left[ \log p_\theta^{\text{AR}}(\mathbf{x} \mid \mathbf{z}_0) \right] + D_{\text{KL}}\left(q_0(\mathbf{z}_0 \mid \mathbf{x}) \,\|\, p_\theta^{\text{MDM}}(\mathbf{z}_0)\right) \]
\[ = -\mathbb{E}_{\mathbf{z}_0 \sim q_0(\cdot \mid \mathbf{x})} \left[ \sum_{\ell \in \mathcal{M}(\mathbf{z}_0)} \log p_\theta^{\text{AR}}\left(\mathbf{x}^\ell \mid \mathbf{z}_0, \mathbf{x}^{<\ell}\right) \right] + D_{\text{KL}}\left(q_0(\mathbf{z}_0 \mid \mathbf{x}) \,\|\, p_\theta^{\text{MDM}}(\mathbf{z}_0)\right). \]
Given a denoising model \( \mathbf{x}_\theta : \mathcal{V}^L \to (\Delta^K)^L \) that parameterizes \(p_\theta^{\text{AR}}\) and \( p_\theta^{\text{MDM}} \), we show that the Negative Evidence Lower Bound (NELBO) factors into a sum of AR and MDM losses over masked positions:
\[ \mathcal{L}_{\text{NELBO}}(\mathbf{x}) = \mathbb{E}_{\mathbf{z}_0 \sim q_0} \underbrace{\left[ - \sum_{\ell \in \mathcal{M}(\mathbf{z}_0)} \log \left\langle \mathbf{x}_\theta^\ell(\mathbf{z}_0 \odot \mathbf{x}^{<\ell}), \mathbf{x}^\ell \right\rangle \right]}_{\text{AR loss}} + \int_{t=0}^{t=1} \frac{\alpha_t'}{1 - \alpha_t} \underbrace{ \mathbb{E}_{\mathbf{z}_t \sim q_t} \left[ \sum_{\ell \in \mathcal{M}(\mathbf{z}_t)} \log \left\langle \mathbf{x}_\theta^\ell(\mathbf{z}_t), \mathbf{x}^\ell \right\rangle \right] }_{\text{MDM loss}} dt \]
Diffusion Phase. The denoising transformer receives \( \mathbf{z}_t \sim q_t(.|\mathbf{x})\), which contains the mask tokens to denoise, and \( \mathbf{x} \) as target. A random ordering \( \sigma \sim \mathcal{P}_L \) is sampled with the natural constraint that clean tokens in \(\mathbf{z}_t\) precede mask tokens in \(\mathbf{z}_t\) in \(\sigma\). Below is the example attention mask and its sorted version for implementation when \( \mathbf{x} = (A, B, C, D, E, F) \), \( \mathbf{z}_t = (A, M, C, M, M, F) \), and \( \sigma = (3, 1, 6, 4, 5, 2) \):
Sequential Phase. The denoising transformer receives \( \mathbf{z}_0 \oplus \mathbf{x} \in \mathcal{V}^{2L} \), where \( \mathbf{z}_0 \sim q_0(.|\mathbf{x})\) contains the mask tokens to denoise, and computes loss by comparing the transformer output over \( \mathbf{z}_0 \) against target \( \mathbf{x} \). Concatenating \( \mathbf{z}_0 \) and \( \mathbf{x} \) at input is required during training because we do not use shift-by-one at the output like AR models. A random ordering \( \sigma \sim \mathcal{P}_L \) is sampled with the constraints that (i) clean tokens in \(\mathbf{z}_0\) precede mask tokens in \(\mathbf{z}_0\) in \(\sigma\) and (2) mask tokens are in natural order in \( \sigma \). Below is the example attention mask and its sorted version for implementation when \( \mathbf{x} = (A, B, C, D, E, F) \), \( \mathbf{z}_0 = (A, M, C, M, M, F) \), and \( \sigma = (3, 1, 6, 2, 4, 5) \):
Efficient generation of an example sequence:
We train and evaluate on the One Billion Words (LM1B) dataset and OpenWebText (OWT).
When generating a sequence of length 8192 using the maximum possible number of function evaluations (NFEs = 8192), Eso-LMs achieve up to 65× faster inference than MDLM and 3-4× faster inference than BD3-LMs:
We use Genenerative Perplexity (Gen. PPL) to evaluate the quality of samples generated by models trained on OWT. Low Gen. PPL means high quality. Sequence length is 1024.
To compare sampling efficiency, we also record the median sampling duration in seconds (across 5 trials) taken by each method to generate a single sample (i.e., batch size is 1).
Eso-LMs set new SOTA on the sampling speed–quality Pareto frontier, redefining what’s possible:
@misc{sahoo2025esotericlanguagemodels,
title={Esoteric Language Models},
author={Subham Sekhar Sahoo and Zhihan Yang and Yash Akhauri and Johnna Liu and Deepansha Singh and Zhoujun Cheng and Zhengzhong Liu and Eric Xing and John Thickstun and Arash Vahdat},
year={2025},
eprint={2506.01928},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.01928},
}