LRAgent: Efficient KV Cache Sharing for Multi-LoRA LLM Agents
Role specialization in multi-LLM agent systems is often realized via multi-LoRA, where agents share a pretrained backbone and differ only through lightweight adapters. Despite sharing base model weights, each agent independently builds and stores its own KV cache for the same long, tool-augmented trajectories, incurring substantial memory and compute overhead. Existing KV cache sharing methods largely overlook this multi-LoRA setting. We observe that, across agents, cache differences are dominated by adapter outputs, while activations from the shared pretrained backbone remain highly similar. Based on this observation, we propose LRAgent, a KV cache sharing framework for multi-LoRA agents that decomposes the cache into a shared base component from the pretrained weights and an adapter-dependent component from LoRA weights. LRAgent reduces memory overhead by sharing the base component and storing the adapter component in its inherent low-rank form, and further reduces compute overhead, enabled by shared-$A$ multi-LoRA architectures, by also sharing the low-rank cache and avoiding redundant computations for contexts already processed by other agents. To efficiently reconstruct adapter contributions at runtime, we introduce Flash-LoRA-Attention, a kernel that reorders attention computation to avoid materializing the low-rank cache to full dimension. LRAgent achieves throughput and time-to-first-token latency close to fully shared caching, while preserving accuracy near the non-shared caching baseline across agentic question-answering benchmarks.
💡 Research Summary
The paper tackles a practical bottleneck in multi‑LLM agent systems that employ multi‑LoRA fine‑tuning. While agents share a large pretrained backbone, each role‑specific LoRA adapter forces every agent to maintain its own key‑value (KV) cache during inference. For long, tool‑augmented trajectories this leads to substantial memory consumption and redundant computation, as the same context is processed repeatedly across agents.
The authors first empirically show that, for identical input sequences, the KV values generated by the frozen backbone (the “base cache”) are highly similar across agents (average cosine similarity > 0.95). In contrast, the contributions from the LoRA adapters (the “adapter output”) are low‑magnitude but largely decorrelated, accounting for most of the inter‑agent cache variance. This observation suggests that sharing only the base cache while handling the adapter‑specific part separately could preserve accuracy while cutting memory usage.
Building on this, LRAgent decomposes the value cache into two components:
- Base Cache – computed solely from the pretrained weight matrix (W_0). This component is identical (or nearly so) for all agents and can be stored once and reused.
- LR Cache – the intermediate activation after the LoRA down‑projection, (X A_i), which lives in a low‑rank space (rank (r \ll d)). The full adapter contribution is recovered at runtime by multiplying this low‑rank cache with the up‑projection matrix (B_i).
Two sharing schemes are introduced:
-
BaseShared: All agents share the base cache; each agent keeps its own LR cache. This already reduces total KV memory from ((1+1/N)) times the non‑shared baseline to roughly (1+1/N), where (N) is the number of agents.
-
BaseLRShared: Recent “Shared‑A” multi‑LoRA designs show that the down‑projection matrix (A) can be common across tasks without hurting performance. By enforcing a shared (A), the LR cache itself becomes identical for all agents, allowing a single LR cache to be reused. Consequently, both memory and the amount of recomputed matrix‑vector work are further reduced.
To avoid the naïve cost of materializing the full‑dimensional adapter contribution at each forward pass, the authors design Flash‑LoRA‑Attention, a custom kernel built on top of FlashAttention. The kernel reorders the attention computation so that the low‑rank cache is multiplied by the up‑projection on‑the‑fly, never expanding to full dimension. This eliminates extra memory traffic and leverages the highly optimized block‑sparse kernels of FlashAttention, yielding near‑optimal GPU utilization.
Experiments are conducted on LLaMA‑3.1‑8B‑Instruct and MiniSTR‑8B‑Instruct backbones, with three role‑specific agents (planning, action, reflection) fine‑tuned via LoRA on a publicly released trajectory dataset. Benchmarks include HotpotQA, GSM‑8K, and other long‑context tasks. Results show:
- Memory reduction: BaseShared cuts KV memory by ~38 %; BaseLRShared by ~55 % compared to the naïve per‑agent cache.
- Throughput & latency: Token‑per‑second throughput is within 0.92–0.97× of a fully shared cache baseline, and first‑token latency is virtually unchanged.
- Accuracy: The full‑cache (non‑shared) and the shared schemes differ by less than 0.3 % absolute on most metrics. The Shared‑A variant (BaseLRShared) even matches the non‑shared baseline within 0.1 % due to the reduced variance in adapter outputs.
Additional analyses confirm that key‑cache similarity remains > 0.98 across agents, indicating that focusing on the value cache is sufficient. Ablations on applying LoRA to query/key projections show similar trends, but the paper’s primary implementation applies LoRA only to the value projection for best accuracy.
In summary, LRAgent provides a principled, architecture‑aware solution for KV cache sharing in multi‑LoRA agent systems. By decoupling the cache, exploiting the inherent low‑rank nature of LoRA adapters, and introducing an efficient attention kernel, it achieves substantial memory and compute savings while preserving the performance of individually cached agents. The code and models are released publicly, paving the way for broader adoption in complex, multi‑agent LLM applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment