Hierarchical Adaptive Eviction for KV Cache Management in Multimodal Language Models
The integration of visual information into Large Language Models (LLMs) has enabled Multimodal LLMs (MLLMs), but the quadratic memory and computational costs of Transformer architectures remain a bottleneck. Existing KV cache eviction strategies fail to address the heterogeneous attention distributions between visual and text tokens, leading to suboptimal efficiency or degraded performance. In this paper, we propose Hierarchical Adaptive Eviction (HAE), a KV cache eviction framework that optimizes text-visual token interaction in MLLMs by implementing Dual-Attention Pruning during pre-filling (leveraging visual token sparsity and attention variance) and a Dynamic Decoding Eviction Strategy (inspired by OS Recycle Bins) during decoding. HAE minimizes KV cache usage across layers, reduces computational overhead via index broadcasting, and theoretically ensures superior information integrity and lower error bounds compared to greedy strategies, enhancing efficiency in both comprehension and generation tasks. Empirically, HAE reduces KV-Cache memory by 41% with minimal accuracy loss (0.3% drop) in image understanding tasks and accelerates story generation inference by 1.5x while maintaining output quality on Phi3.5-Vision-Instruct model.
💡 Research Summary
The paper tackles the growing memory and compute burden of key‑value (KV) caches in multimodal large language models (MLLMs) that process both visual and textual tokens. Existing KV‑cache eviction methods either focus on a single modality or apply a uniform greedy policy, which ignores the distinct attention characteristics of visual versus textual tokens and leads to either inefficient memory use or degraded model performance.
Observations
Through extensive analysis of Phi‑3.5‑Vision‑Instruct, the authors find that (1) the variance of cumulative attention scores differs markedly between visual and textual tokens, and (2) visual tokens exhibit higher sparsity—especially in the first two transformer layers—while textual tokens are relatively dense in those early layers. This suggests that visual information is often less critical early on, providing an opportunity for selective pruning.
Hierarchical Adaptive Eviction (HAE)
HAE introduces a two‑stage eviction framework:
-
Dual‑Attention Pruning (DAP) – pre‑filling stage
- For each visual token (V_j), the method aggregates its attention to all text tokens, yielding a global attention score (A_j).
- Tokens whose (A_j) falls below a dynamic threshold (a fraction (r) of the average visual attention) are marked as low‑importance.
- An additional per‑token check ensures that the maximum attention from any single text token to (V_j) is below a second threshold (\alpha). Only tokens satisfying both criteria are finally evicted.
- Eviction is performed only in the first transformer layer; the indices of the removed visual tokens are broadcast to all subsequent layers, eliminating per‑layer decision overhead while uniformly reducing KV storage throughout the network.
-
Dynamic Decoding Eviction Strategy (DDES) – decoding stage
- Inspired by operating‑system recycle bins, DDES maintains a buffer (the “recycling bin”) that temporarily stores KV entries with the lowest cumulative attention scores.
- When the buffer reaches a predefined capacity (D), the stored entries are evicted in bulk rather than one‑by‑one as in greedy approaches.
- The scoring function combines the softmax similarity between the current query and each KV with the cumulative attention accumulated so far, allowing a balanced consideration of both visual and textual importance.
- This dynamic retention reduces the risk of prematurely discarding KV pairs that might become relevant later in a long generation, thereby preserving model accuracy while still cutting memory usage.
Theoretical Guarantees
The authors prove two key results. Theorem 2.1 shows that if the eviction threshold (k) satisfies a logarithmic bound involving the allowable loss (\epsilon), the maximum attention among evicted tokens, and a decay rate (\lambda), then the total information loss stays below (\epsilon). Corollary 2.1 demonstrates that DDES yields a tighter upper bound on cumulative attention loss compared with a greedy eviction policy, confirming that the proposed strategy is theoretically more faithful to the original KV information.
Empirical Evaluation
Experiments are conducted on two open‑source MLLMs—LLaVA‑1.5‑7B and Phi‑3.5‑Vision‑Instruct—across two families of tasks:
Image‑based Question Answering (GQA, ScienceQA, MMMU, etc.)
- HAE reduces KV‑cache memory by roughly 47 % while retaining about 97 % of the baseline accuracy.
- Ablation shows that DAP alone accounts for most of the memory savings, whereas DDES contributes modestly to accuracy preservation.
Long‑form Story Generation (Seed‑Story)
- HAE accelerates inference by 1.5×, with BLEU/ROUGE scores dropping by less than 0.2 points, indicating negligible quality loss.
- Human evaluations confirm that the generated narratives remain fluent and coherent.
Parameter sweeps over the sparsity thresholds (r), (\alpha), and the recycling bin size (D) reveal that HAE is robust: modest variations do not substantially affect the memory‑efficiency‑accuracy trade‑off.
Significance and Limitations
HAE offers a principled, modality‑aware approach to KV‑cache management, achieving substantial memory reduction and speedup without sacrificing performance. Its two‑stage design—static visual pruning followed by dynamic, cross‑modal eviction—addresses the heterogeneity of attention patterns that previous single‑modal or greedy methods overlook. However, the current work focuses on visual‑textual modalities; extending the framework to audio, video, or more complex multimodal pipelines remains an open direction. Additionally, broadcasting eviction indices and maintaining a recycling bin introduce implementation complexity that may affect integration with existing inference engines.
Conclusion
Hierarchical Adaptive Eviction (HAE) demonstrates that careful analysis of cross‑modal attention distributions can guide effective KV‑cache pruning. By combining Dual‑Attention Pruning during pre‑filling with a recycling‑bin‑style Dynamic Decoding Eviction Strategy, HAE cuts KV‑cache memory by over 40 % on average, speeds up generation by 1.5×, and keeps accuracy loss under 0.3 %. Theoretical error bounds and extensive empirical validation substantiate its superiority over prior greedy or single‑modal eviction schemes, marking a notable advancement in efficient multimodal LLM inference.
Comments & Academic Discussion
Loading comments...
Leave a Comment