Efficient LLM Inference with Activation Checkpointing and Hybrid Caching

Efficient LLM Inference with Activation Checkpointing and Hybrid Caching
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recent large language models (LLMs) with enormous model sizes use many GPUs to meet memory capacity requirements incurring substantial costs for token generation. To provide cost-effective LLM inference with relaxed latency constraints, extensive research has focused on expanding GPU memory by leveraging the host memory. However, LLM inference engines that utilize the host memory often face underutilization of GPU compute units, as a considerable portion of inference time is spent in loading the model onto the GPU via host-GPU interconnect. To tackle these challenges of the host memory offloading for LLM, we introduce HybridServe, an LLM inference system with activation checkpointing based on activation caching. The activation cache stores activation checkpoints generated during intermediate inference stages, allowing the fast recomputation of KV cache while model parameters are transferred to GPU from host memory. Unlike conventional methods that recompute the KV cache from scratch using token IDs, the activation cache allows bypassing projection and FFN operations. To balance between the activation recomputation and parameter loading overhead, this study proposes a KV-activation hybrid caching scheme which finds the best ratio of the key-value and activation caches to adjust the recomputation time. Our system achieves 2.19x throughput improvement over the state-of-the-art prior work for offloading both model weights and KV cache.


💡 Research Summary

Large language models (LLMs) have grown to tens of billions of parameters, and their KV (key‑value) cache can easily exceed the memory capacity of a single high‑end GPU. Prior work such as FlexGen mitigates this by offloading both model weights and the KV cache to host memory and streaming the needed data over PCIe. While this enables a single‑GPU deployment, the KV cache transfer volume grows linearly with batch size and sequence length, quickly saturating the PCIe bandwidth. As a result, the GPU spends most of its time idle, with utilization often below 10 % for realistic batch sizes (e.g., 128).

The paper introduces HybridServe, a system that tackles the communication‑compute imbalance through two complementary techniques: (1) activation checkpointing with an activation cache, and (2) a KV‑Activation hybrid caching scheme that dynamically balances the proportion of KV entries and activation checkpoints stored in host memory.

Activation checkpointing stores the intermediate activations A_i of each transformer decoder layer on the host. When a new token is generated, the system recomputes the required Q, K, and V matrices directly from these activations, bypassing the projection and feed‑forward network (FFN) stages that are necessary in a naïve token‑ID‑only recomputation. This reduces the FLOP count for KV reconstruction by more than 50 % and halves the amount of data that must be transferred, while preserving exact numerical results (no approximation).

Because keeping only activations still incurs recomputation cost that grows with batch size and sequence length, HybridServe does not rely exclusively on the activation cache. Instead, it stores a mixture of KV entries (as traditional key‑value tensors) and activation checkpoints in host memory. The GPU memory holds only activation checkpoints, which are smaller than full KV blocks. An algorithm searches for the optimal KV‑to‑activation ratio given the current batch size, prompt length, and GPU memory budget. The scheduler then packs incoming requests into mini‑batches that respect this optimal ratio, ensuring that data transfer time and recomputation time are balanced.

The authors implemented the approach on top of vLLM 0.4.3, extending its host‑offloading capabilities. Experiments were conducted on four variants of the OPT family (30B, 66B, 125B, etc.) under a variety of batch sizes (1–256) and prompt lengths (128–512 tokens). HybridServe achieved a geometric‑mean throughput improvement of 2.19× over FlexGen and 1.35× over a system that uses only the activation cache. GPU utilization rose from roughly 7 % to over 15 % in the high‑batch regime, demonstrating that the PCIe bottleneck was effectively mitigated. Importantly, the method incurs no loss of model accuracy because it never discards or approximates data; it merely reorganizes how the data are stored and recomputed.

In summary, HybridServe provides a practical solution for cost‑effective, high‑throughput LLM inference on a single GPU by (i) reducing the amount of data that must be moved between host and device, (ii) lowering the recomputation workload through activation checkpointing, and (iii) automatically finding the best mix of KV and activation caches for any workload. The work opens avenues for further research on multi‑GPU scaling, integration with faster interconnects (e.g., NVLink), and compression of activation checkpoints to push throughput even higher while keeping latency within acceptable bounds for throughput‑oriented applications.


Comments & Academic Discussion

Loading comments...

Leave a Comment