RelayCaching: Accelerating LLM Collaboration via Decoding KV Cache Reuse
The increasing complexity of AI tasks has shifted the paradigm from monolithic models toward multi-agent large language model (LLM) systems. However, these collaborative architectures introduce a critical bottleneck: redundant prefill computation for shared content generated by previous agents, which significantly increases KV cache memory usage and time-to-first-token (TTFT). While various KV cache methods have been proposed to mitigate prefill redundancy, they either fail to maintain accuracy on agent-generated outputs or exhibit low reuse rates due to rigid constraints. We present RelayCaching, a training-free inference method that directly reuses decoding phase KV caches from previous agents in subsequent prefill phases. Our key insight is that KV caches for identical content are highly consistent across phases, while prefix-induced deviations are sparse and localized within a limited range of layers and token positions. By selectively recomputing KV caches at these positions, RelayCaching preserves model accuracy with minimal overhead, yielding a superior accuracy-efficiency trade-off over existing methods. Experiments on diverse collaborative LLM tasks spanning mathematical reasoning, general knowledge, and code generation demonstrate that RelayCaching achieves over 80% KV cache reuse, reduces TTFT by up to $4.7\times$ compared to the standard pipeline, all with negligible accuracy degradation.
💡 Research Summary
RelayCaching addresses a critical inefficiency in multi‑agent large language model (LLM) pipelines: the repeated pre‑fill of identical text that appears as input to downstream agents but with different preceding contexts (prefix variation). Traditional KV‑cache optimizations fall into two categories. Prefix caching reuses KV entries only when the exact prefix aligns, which rarely holds in dynamic agent‑generated workflows. Pre‑computed caching assumes static, offline‑encodable content (e.g., documents in retrieval‑augmented generation) and cannot handle on‑the‑fly generated outputs. Consequently, existing systems fall back to full pre‑fill for each agent, leading to quadratic growth of KV‑cache memory and time‑to‑first‑token (TTFT) with the number of interaction turns.
The authors first conduct a systematic empirical study comparing KV caches obtained during decoding (when a token is generated with a shifted prefix) with those obtained by a full pre‑fill of the same token sequence. Three key observations emerge:
-
Macro‑level alignment – Across layers and tokens, decoding KV caches retain high cosine similarity (≈0.9) with full‑pre‑fill KV caches. Keys are almost identical in direction and magnitude; values differ mainly in direction, making value‑cosine similarity the most informative deviation metric.
-
U‑shaped layer‑wise deviation – Middle layers exhibit the lowest value‑cosine similarity, while shallow and deep layers remain relatively stable. An oracle experiment that substitutes decoding KV with full‑pre‑fill KV for different layer ranges shows that correcting the middle‑layer range yields the steepest recovery in downstream similarity, confirming that these layers dominate generation quality.
-
Sparse token‑wise deviation with inter‑layer correlation – Only a small fraction of token positions show large deviations; these high‑deviation tokens tend to persist across adjacent layers, as evidenced by a rapidly rising Spearman rank correlation. This sparsity suggests that selective rectification can be highly effective.
Guided by these findings, RelayCaching is introduced as a training‑free inference technique that reuses decoding KV caches in subsequent pre‑fill phases, while selectively recomputing only the portions that cause significant deviation. The method consists of two components:
- Layer‑range profiler – Analyzes the U‑shaped similarity profile to automatically select a critical layer interval (
Comments & Academic Discussion
Loading comments...
Leave a Comment