FreqKV: Key-Value Compression in Frequency Domain for Context Window Extension

FreqKV: Key-Value Compression in Frequency Domain for Context Window Extension
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Existing key-value (KV) cache compression methods for large language models (LLMs) often rely on token eviction, which risks losing critical local information in both long prefilling and decoding scenarios. When extrapolating beyond the pretrained context length, their performance degrades sharply on long-context benchmarks. Motivated by the observation in the frequency domain that the context information is concentrated in the low-frequency components, we propose FreqKV, a parameter-free and architecture-agnostic approach. It iteratively compresses the increasing KV cache in the frequency domain, allowing models to process lengthy contexts efficiently. With minimal training at 8K length, FreqKV extends the context window of LLaMA-2-7B up to 256K tokens while maintaining stable perplexity. Extensive experiments across prefilling and decoding demonstrate that FreqKV enables robust context window extension and consistently outperforms existing KV cache compression methods on LLaMA-2 and LLaMA-3, highlighting its effectiveness for both understanding and generation in long contexts.


💡 Research Summary

FreqKV introduces a novel, parameter‑free method for extending the context window of large language models (LLMs) by compressing the key‑value (KV) cache in the frequency domain. The authors begin with an empirical observation: the energy of KV states in LLaMA‑2‑7B concentrates increasingly in low‑frequency components as depth grows. By applying a discrete cosine transform (DCT) along the sequence dimension, they map the KV cache into the frequency domain, retain only the low‑frequency coefficients (a configurable retention ratio γ, default 0.5), discard the high‑frequency part, and then reconstruct the compressed KV via inverse DCT (IDCT) with appropriate scaling.

The compression is performed iteratively. When the cache reaches its maximum size N (e.g., the original 4 096 token window), the KV matrices are transformed, filtered, and reconstructed, reducing the effective cache size to L = γ · N. The compressed KV is then concatenated with newly arriving tokens; once the cache fills again, the process repeats. Early “sink” tokens (S = 4 in experiments) are never compressed, preserving the most important initial context, a design motivated by the “attention sink” phenomenon reported in recent work. Because the compression operates on the KV matrices before applying RoPE, positional embeddings are re‑applied after reconstruction using the new indices, eliminating the need for position extrapolation.

Training is minimal: a short fine‑tuning phase of 1–2 epochs on 8 K‑length sequences suffices to adapt the model to the compressed representation. Experiments span LLaMA‑2‑7B/13B and LLaMA‑3‑8B, evaluating both prefilling (understanding) and decoding (generation) scenarios. Results show that extending the context window up to 256 K tokens incurs only a tiny perplexity increase (≈ 0.2–0.4), outperforming token‑eviction baselines such as SnapKV, PyramidKV, and FastKV by 10–15 % in perplexity. On long‑context benchmarks (LongBench, RULER, Needle‑in‑a‑Haystack) the low‑frequency‑only version yields substantially higher ROUGE and accuracy scores, confirming that low‑frequency components encode global semantics while high‑frequency components capture finer, less critical details. In generation tests (LongGenBench) FreqKV maintains token quality while halving memory consumption compared to the uncompressed baseline.

Complexity analysis indicates that each compression step costs O(N log N) due to the DCT/IDCT operations, and it is triggered only once every N − L tokens, making the overhead negligible relative to the overall self‑attention cost. The method requires no additional parameters, no architectural changes, and works with any decoder‑only Transformer that uses KV caching, making it broadly applicable.

The paper also discusses potential extensions: adaptive retention ratios, alternative transforms (FFT, wavelets), and dynamic scheduling of compression based on token importance. By grounding the approach in the Frequency Principle—low frequencies converge faster and dominate learned representations—the authors provide both theoretical justification and practical evidence that frequency‑domain KV compression is a viable path toward scalable, long‑context LLM inference.


Comments & Academic Discussion

Loading comments...

Leave a Comment