SemShareKV: Efficient KVCache Sharing for Semantically Similar Prompts via Token-Level LSH Matching

SemShareKV: Efficient KVCache Sharing for Semantically Similar Prompts via Token-Level LSH Matching
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

As large language models (LLMs) continue to scale, the memory footprint of key-value (KV) caches during inference has become a significant bottleneck. Existing approaches primarily focus on compressing KV caches within a single prompt or reusing shared prefixes or frequently ocurred text segments across prompts. However, such strategies are limited in scenarios where prompts are semantically similar but lexically different, which frequently occurs in tasks such as multi-document summarization and conversational agents. We propose \textit{SemShareKV}, a KV cache sharing and compression framework that accelerates LLM inference by reusing KVCache in semantically similar prompts. Instead of relying on exact token matches, SemShareKV applies fuzzy token matching using locality-sensitive hashing (LSH) on token embeddings and incorporates Rotary Position Embedding (RoPE) to better preserve positional information. By selectively reusing relevant key-value pairs from a reference prompt’s cache, SemShareKV reduces redundant computation while maintaining output quality. Experiments on diverse summarization datasets show up to 6.25$\times$ speedup and 42% lower GPU memory usage with 5k tokens input, with negligible quality degradation. These results highlight the potential of semantic-aware cache sharing for efficient LLM inference.


💡 Research Summary

SemShareKV addresses the growing memory and compute bottleneck of key‑value (KV) caches in large language model (LLM) inference, especially when handling long prompts. Existing cache‑optimization techniques focus on compressing or reusing exact token matches within a single prompt or on frequently occurring text chunks. However, many real‑world scenarios—such as multi‑document summarization or conversational agents—feature prompts that are semantically similar but lexically different, limiting the applicability of prior methods.

The authors first conduct three empirical observations across three LLMs (Mistral‑7B, LLaMA‑3.1‑8B, and MPT‑7B). (1) High‑deviation (HD) tokens—those whose KV representations differ most between a target and a reference prompt—show strong consistency across layers, suggesting they can be reliably identified for selective recomputation. (2) Deeper transformer layers attend to progressively fewer tokens, as measured by the Attention Recovery (AR) metric, indicating that full recomputation in deep layers is unnecessary. (3) Redundant information accumulates in deeper layers; an exponential‑decay token‑retention pattern (more tokens kept in shallow layers, fewer in deep layers) yields the lowest perplexity, confirming that aggressive pruning in deep layers does not harm generation quality.

Building on these insights, SemShareKV introduces a two‑stage pipeline. First, each incoming prompt’s contextualized embedding cache (E‑Cache) is stored on CPU. When a new target prompt arrives, the system computes an LSH‑based similarity score between its E‑Cache and all stored caches, selecting the most semantically similar reference prompt. Second, the reference’s KV cache is loaded onto the GPU. To enable accurate token alignment despite positional differences, Rotary Position Embedding (RoPE) is applied to both the target and reference E‑Caches before matching. LSH then performs fuzzy token‑level matching, mapping each target token to its most similar reference token. The reference KV cache is reordered according to this mapping and injected into the transformer layers.

In the first transformer layer, all tokens are fully recomputed; the L2 distance between recomputed outputs and the reordered KV values identifies HD tokens. Subsequent layers only recompute these HD tokens, while tokens with low attention scores are evicted from the cache, achieving dynamic memory reduction. This “recomputation strategy” respects the layer‑wise importance revealed by the observations, while the “retention strategy” preserves more KV pairs in shallow layers and discards them in deep layers.

Experiments on multiple summarization benchmarks (MultiNews, XSum, CNN/DailyMail) demonstrate that, for 5 k‑token inputs, SemShareKV achieves up to 6.25× inference speedup and a 42 % reduction in GPU memory usage, with negligible degradation in ROUGE‑L (≤ 0.2 % drop) and comparable human preference scores. Ablation studies confirm that removing RoPE reduces matching accuracy by ~15 %, and using uniform token retention instead of exponential decay diminishes memory savings by over 30 %. The method also generalizes across the three tested LLM architectures, though models that rely on ALiBi positional encoding (e.g., MPT‑7B) require additional handling because RoPE cannot be simply omitted from the KV cache.

Limitations include the overhead of maintaining a CPU‑resident repository of reference embeddings and the cost of LSH queries for very long sequences, which could become a bottleneck in ultra‑long contexts. Moreover, the approach currently assumes that a sufficiently similar reference prompt exists; in low‑coverage domains the hit rate may drop. Future work is outlined to explore distributed cache management across multiple GPUs, adaptive LSH parameter tuning, unsupervised clustering of prompts for automatic reference selection, and compatibility with alternative positional encodings.

In summary, SemShareKV pioneers “semantic‑aware KV cache sharing” by combining token‑level LSH matching with RoPE‑enhanced positional awareness. It demonstrates that substantial inference acceleration and memory savings are achievable without sacrificing output quality, offering a practical pathway for deploying LLMs in production environments that routinely process long, semantically overlapping inputs.


Comments & Academic Discussion

Loading comments...

Leave a Comment