Focus-dLLM: Accelerating Long-Context Diffusion LLM Inference via Confidence-Guided Context Focusing
Diffusion Large Language Models (dLLMs) deliver strong long-context processing capability in a non-autoregressive decoding paradigm. However, the considerable computational cost of bidirectional full attention limits the inference efficiency. Although sparse attention is promising, existing methods remain ineffective. This stems from the need to estimate attention importance for tokens yet to be decoded, while the unmasked token positions are unknown during diffusion. In this paper, we present Focus-dLLM, a novel training-free attention sparsification framework tailored for accurate and efficient long-context dLLM inference. Based on the finding that token confidence strongly correlates across adjacent steps, we first design a past confidence-guided indicator to predict unmasked regions. Built upon this, we propose a sink-aware pruning strategy to accurately estimate and remove redundant attention computation, while preserving highly influential attention sinks. To further reduce overhead, this strategy reuses identified sink locations across layers, leveraging the observed cross-layer consistency. Experimental results show that our method offers more than $29\times$ lossless speedup under $32K$ context length. The code is publicly available at: https://github.com/Longxmas/Focus-dLLM
💡 Research Summary
Focus‑dLLM addresses the computational bottleneck of diffusion large language models (dLLMs), which require bidirectional full‑attention over the entire context at every denoising step. Existing acceleration techniques—approximate KV caching and sparse attention—struggle because dLLMs do not know in advance which tokens will be unmasked in the next step, making it hard to decide where to focus computation. The authors make two key empirical observations that drive their solution. First, token confidence scores (the maximum soft‑max probability for each masked token) are highly correlated across consecutive denoising steps. Moreover, the set of tokens that will be unmasked at step t already exhibits high confidence in step t‑1. By ranking the previous‑step confidences, the top‑k tokens predict the future unmasked positions with a recall of over 96 %. Second, attention “sinks” – tokens that receive disproportionately large attention weights – appear consistently across layers. Visualizing attention maps shows that the same sink indices recur from shallow to deep layers, suggesting that identifying sinks once is sufficient for the whole network. Building on these findings, Focus‑dLLM introduces a training‑free sparsification pipeline that combines past‑confidence‑guided query selection with sink‑aware pruning and an approximate KV cache. The pipeline works as follows. At each denoising step, the algorithm ranks the confidence scores from the previous step for all still‑masked positions. The top‑ρ · n(t) indices (where n(t) is the number of tokens to be unmasked in the current block and ρ is a small expansion factor) form a candidate set I_focus. To respect the locality of language, each candidate is expanded to a small window of size w, yielding an active query set I_active. Only tokens in I_active are used as queries for the attention computation; all other positions are ignored, dramatically reducing the query‑key multiplication cost. For the key/value side, the method first treats the first l layers as dense, computing full attention. The attention distribution at layer l is then used to extract a global set of sink tokens S (e.g., the top‑m keys with highest aggregate attention). In all subsequent layers, the key/value matrix is pruned to retain only the union of S, the active query windows, and the already‑generated response tokens. This “sink‑aware” pruning preserves the most influential historical context while discarding the vast majority of irrelevant keys. The KV cache is refreshed only at block boundaries; within a block, only the selected sparse tokens have their KV states recomputed, while the rest are reused from the previous step. This hybrid of sparse attention and approximate caching yields a dramatic reduction in FLOPs without any additional training or model modification. Experiments on multiple dLLM backbones (LLaDA‑8B‑Instruct, UltraLLaMA‑13B, etc.) and benchmark datasets (GSM‑8K, WikiText‑103, LongBench) demonstrate that Focus‑dLLM achieves over 29× speedup compared with full‑attention inference at a 32 K context length, while maintaining or slightly improving generation quality (BLEU, ROUGE, GPT‑4 evaluations). Compared to the recent Fast‑dLLM baseline, it provides a 2.05× speedup with identical accuracy. Memory consumption is also reduced by roughly 40 % due to the smaller key/value matrices. Ablation studies confirm that both components—past‑confidence‑guided query selection and sink‑aware pruning—are essential; removing either leads to noticeable drops in speed or quality. The paper discusses limitations such as the fixed hyper‑parameters (ρ, w, l) and the dependence on the quality of the early‑layer attention for sink detection, suggesting future work on adaptive parameter tuning and extending the method to other diffusion‑based generative models. In summary, Focus‑dLLM offers a practical, training‑free framework that leverages temporal confidence consistency and cross‑layer sink stability to accelerate long‑context diffusion LLM inference dramatically while preserving generation fidelity.
Comments & Academic Discussion
Loading comments...
Leave a Comment