Enabling Approximate Joint Sampling in Diffusion LMs

Enabling Approximate Joint Sampling in Diffusion LMs
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In autoregressive language models, each token is sampled by conditioning on all the past tokens; the overall string has thus been sampled from the correct underlying joint distribution represented by the model. In contrast, masked diffusion language models generate text by unmasking tokens out of order and potentially in parallel. Generating an overall string sampled from the correct underlying joint distribution would (again) require exactly one token unmasking in every full-model forward pass. The more tokens unmasked in parallel, the further away the string is from the true joint; this can be seen in the resulting drop in accuracy (but, increase in speed). In this paper we devise a way to {\em approximately} sample multiple tokens from the joint distribution in a single full-model forward pass; we do so by developing a new lightweight single-layer ``sampler" on top of an existing large diffusion LM. One forward pass of the full model can now be followed by multiple forward passes of only this sampler layer, to yield multiple unmasked tokens. Our sampler is trained to mimic exact joint sampling from the (frozen) full model. We show the effectiveness of our approximate joint sampling for both pretrained-only (Dream-7B-Base, Llada-7B-Base) and instruction-tuned (Dream-7B-Instruct, Dream-7B-Coder) models on language modeling and math & coding tasks. When four tokens are unmasked for each full-model denoising step, our sampling algorithm achieves a MAUVE score of 0.87 (vs marginal baseline of 0.31) with respect to the true joint distribution.


💡 Research Summary

This paper tackles a fundamental limitation of masked diffusion language models (Diffusion LMs): when multiple tokens are unmasked in parallel during a single denoising step, the model samples from the product of per‑token marginal distributions rather than from the true joint distribution encoded by the model. Because each forward pass of a diffusion model only provides a marginal p_i(·|x) for every position, generating K tokens simultaneously amounts to independent draws, which dramatically degrades quality on tasks that require strong token‑to‑token dependencies (e.g., mathematics, code).

The authors propose a novel “approximate joint sampling” technique called ADJUST. The key idea is to keep the large diffusion model f frozen and stack a lightweight single‑layer transformer g on top of it. Generation proceeds as follows: (1) a single forward pass through f produces the current embeddings h₀ for the fully masked input; (2) the first token to be unmasked is drawn directly from f’s marginal distribution p₁; (3) for each subsequent token, the current embeddings hₖ and the partially filled sequence are fed into g, which updates the embeddings to hₖ₊₁ and yields a conditional distribution qₖ that is conditioned on all previously sampled tokens. By iteratively feeding back each newly sampled token, ADJUST ensures that every token is sampled with awareness of the others, approximating the true chain‑rule factorisation of the joint distribution.

Training ADJUST requires realistic inputs that mimic the state of the diffusion model at various noise levels. The authors generate offline data by running the frozen f on many masked patterns, collecting embeddings and marginal logits. ADJUST is then trained to minimise a KL‑divergence between its conditional distribution qₖ and the true conditional pₖ derived from the frozen model, using a specially designed loss that respects the sequential nature of the sampling process. No external autoregressive verifier is needed; the final output distribution is a new approximation that is closer to the original joint distribution than naïve parallel sampling.

Experiments cover both pretrained‑only models (Dream‑7B‑Base, Llada‑7B‑Base) and instruction‑tuned variants (Dream‑7B‑Instruct, Dream‑7B‑Coder). The authors evaluate unconditional generation (NLL, MAUVE) and downstream benchmarks (GSM8K, MBPP, HEval). When unmasking four tokens per diffusion step, ADJUST improves MAUVE from a marginal baseline of 0.31 to 0.87 and raises GSM8K accuracy by roughly 16 percentage points. Even with eight tokens per step, MAUVE remains at 0.84 versus 0.19 for naïve parallel sampling. Throughput drops only modestly: ADJUST is about 20‑25 % slower than pure parallel decoding while delivering substantially higher quality.

The paper’s contributions are: (1) a clear probabilistic analysis showing why parallel unmasking samples from a product‑of‑marginals distribution; (2) the design of a single‑layer “draft” model that enables sequential, conditioned sampling after a single diffusion forward pass; (3) a training pipeline that supplies realistic noisy inputs and aligns the draft model’s conditionals with the frozen diffusion model; (4) extensive empirical validation demonstrating consistent gains across models and tasks.

Limitations include the simplicity of the draft model—being only one transformer layer, it may not fully capture complex long‑range dependencies, especially when K (tokens per step) becomes large. Future work could explore multi‑layer draft networks, adaptive selection of K, or integration with alternative masking schedules. Nonetheless, ADJUST offers a practical and effective route to bridge the speed‑quality gap in diffusion‑based language generation, bringing parallel decoding closer to the true joint distribution without the overhead of full autoregressive verification.


Comments & Academic Discussion

Loading comments...

Leave a Comment