Window-Diffusion: Accelerating Diffusion Language Model Inference with Windowed Token Pruning and Caching

Window-Diffusion: Accelerating Diffusion Language Model Inference with Windowed Token Pruning and Caching
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Diffusion language models (DLMs) generate text through iterative denoising, but inference requires full-sequence attention at every iteration, resulting in substantial redundant computation on masked tokens. Block-wise diffusion can reduce this cost, yet it typically relies on retraining and constrained update orders, limiting its direct applicability to pretrained DLMs. Our token-level analysis reveals pronounced structural locality in DLM inference. Decoding is driven by a small set of prefix-localized active tokens; the influence of distant undecoded context diminishes rapidly, and decoded tokens exhibit stage-wise temporal stability, enabling reuse of intermediate representations except for a brief post-decode transient. Motivated by these observations, we propose \textbf{\placeholder}\footnote{The source code is available at https://github.com/vhicrgit/Window-Diffusion.}, a window-based token pruning and caching method for inference. We maintain a local computation window that slides rightward as denoising progresses, and partition undecoded tokens into: (i) \textit{active tokens} that are computed online, (ii) \textit{buffer tokens} whose KV states are cached and periodically refreshed, and (iii) \textit{far-field tokens} that are pruned outside the window. Computation is restricted to active and buffer tokens within the window, while far-field tokens are omitted at each stage. Experiments on LLaDA and Dream show that, under matched compute budgets, our method achieves up to $99\times$ inference speedup while largely preserving generation performance.


💡 Research Summary

Diffusion Language Models (DLMs) have emerged as a powerful alternative to autoregressive generation, offering parallel decoding and global context modeling by iteratively denoising a sequence of masked tokens. However, each denoising step traditionally computes full‑sequence bidirectional attention, leading to massive redundancy because only a tiny fraction of tokens actually change at any given step. Prior acceleration attempts—quantization, KV‑caching, and block‑wise diffusion—either reduce arithmetic cost without shortening the sequence, require costly retraining, or constrain the flexible update order intrinsic to diffusion models.

The authors begin by conducting a systematic token‑level analysis on two large pretrained DLMs (LLaDA and Dream) across diverse benchmarks. Three consistent observations emerge: (1) Prefix locality – tokens selected for update (the “active set”) are overwhelmingly concentrated near the beginning of the undecoded region; distant positions rarely receive updates. (2) Diminishing returns from distant context – active‑token predictions quickly saturate as the length of retained undecoded context (W) grows; beyond a modest prefix, KL divergence to the full‑context baseline plateaus. Moreover, caching the key/value (KV) states of non‑active masked tokens further reduces this divergence, indicating that their intermediate representations are highly reusable. (3) Temporal stability of decoded tokens – newly decoded tokens experience a brief post‑decode transient where their value vectors change rapidly, but tokens decoded earlier remain KV‑stationary across many subsequent diffusion steps.

These findings suggest that full‑sequence recomputation is unnecessary: only a sliding window of tokens around the active frontier needs fresh computation, while the rest can be either pruned or cached. Building on this insight, the paper introduces Window‑Diffusion, a training‑free inference acceleration technique. At each diffusion step a local computation window is defined and moves rightward as decoding proceeds. Within the window, tokens are partitioned into three categories:

  • Active tokens – computed online each step because they are likely to be updated.
  • Buffer tokens – their KV states are cached and refreshed periodically (e.g., every few steps), providing context without full recomputation.
  • Far‑field tokens – located outside the window and completely pruned for the current step.

The method therefore reduces per‑step complexity from O(S²) (full self‑attention over sequence length S) to O(W·L), where W is the window size and L the number of transformer layers. The authors empirically set W to a small constant (e.g., 32–64) and refresh the buffer every 5–10 steps, achieving a good trade‑off between speed and accuracy.

Experiments on LLaDA (7B) and Dream (13B) demonstrate that, under matched FLOP budgets, Window‑Diffusion yields 2.3×–6.6× speedups over the baseline DLM while preserving generation quality (BLEU/ROUGE drops < 0.3%). When combined with adaptive‑length inference—terminating diffusion early once the model’s confidence stabilizes—the approach reaches up to 99× acceleration. Importantly, the technique requires no model architecture changes or additional training, making it directly applicable to existing pretrained diffusion models.

Ablation studies confirm each component’s contribution: removing the buffer cache increases KL divergence and slows inference; enlarging the window beyond a modest size yields diminishing returns; and aggressive pruning of far‑field tokens beyond the observed locality threshold does not harm performance. The method also composes well with other orthogonal optimizations such as quantization or mixed‑precision inference, suggesting further gains are possible.

Limitations include sensitivity to hyper‑parameters (window size, cache refresh interval) which may need tuning per model or task, and potential memory overhead for storing cached KV states in very long sequences. Future work could explore dynamic window adaptation, learned caching schedules, and distributed implementations for multi‑GPU settings.

In summary, Window‑Diffusion leverages empirically validated structural locality in diffusion language model inference to prune unnecessary computation and reuse intermediate representations. By introducing a sliding active window with selective caching, it achieves orders‑of‑magnitude speedups without retraining, opening the door for real‑time deployment of large‑scale diffusion language models.


Comments & Academic Discussion

Loading comments...

Leave a Comment