PackInfer: Compute- and I/O-Efficient Attention for Batched LLM Inference

PackInfer: Compute- and I/O-Efficient Attention for Batched LLM Inference
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Attention efficiency is critical to large language model (LLM) inference. While prior advances optimize attention execution for individual requests (e.g., FlashAttention), production LLM serving relies on batching requests with highly heterogeneous sequence lengths for high serving throughput. This mismatch induces severe computation and I/O imbalance, exacerbates stragglers, and underutilizes GPU resources. We present PackInfer, a kernel-level attention framework that enables compute- and I/O-aware execution for heterogeneous batched inference. PackInfer orchestrates batched requests into load-balanced execution groups, effectively saturating GPU utilization by packing multiple requests into unified kernel launches. By constructing attention kernels directly over packed query-key regions, PackInfer eliminates redundant computation and balances thread-block execution. It then incorporates I/O-aware grouping that co-locates shared-prefix requests and reorganizes KV caches into group-contiguous layouts, reducing memory fragmentation and redundant data movement as generation evolves. Evaluations on real-world workloads show that PackInfer reduces inference latency by 13.0-20.1%, and improves throughput by 20% compared to the state-of-the-art FlashAttention.


💡 Research Summary

**
PackInfer addresses a critical inefficiency in modern large‑language‑model (LLM) serving: the mismatch between heterogeneous request lengths in a batch and the fixed‑size tiling strategies used by state‑of‑the‑art attention kernels such as FlashAttention. In production systems, short prompts (often only a few dozen tokens) are mixed with long‑context requests (hundreds or thousands of tokens). When these are padded to a common block size, most of the compute tiles contain padding or masked entries, leading to severe load imbalance across streaming multiprocessors (SMs). Short requests finish quickly, leaving SMs idle while long requests dominate the critical path, a phenomenon the authors term “straggler effect.”

PackInfer introduces a kernel‑level framework that simultaneously balances computation and I/O. The system first groups incoming requests into length‑balanced execution groups. A lightweight greedy bin‑packing algorithm, guided by a pre‑profiled group capacity C (the maximum total token length a group can handle efficiently), sorts requests by descending length and assigns each to the currently least‑loaded group, creating a new group only when capacity would be exceeded. This yields a set of groups {S₁,…,S_G} where each group’s total length L(S_g) is roughly equal, minimizing the variance that causes SM idle time.

Within each group, PackInfer constructs a packed attention kernel that operates directly on the union of all valid query‑key regions, rather than on per‑request padded tiles. By doing so, the kernel’s tiled capacity (G · T²) is fully utilized, and the effective utilization η = Σ_i L_i² / (G · T²) is dramatically higher than in naïve per‑request execution.

The second pillar is I/O‑aware packing of the key‑value (KV) cache. The authors observe that many requests share a common prefix (e.g., the same system prompt). PackInfer builds a trie over the KV cache of a group, extracts shared prefixes P_k, copies them once into a contiguous buffer B_g, and then appends each request’s unique suffix Q_i,k. Offsets O_g


Comments & Academic Discussion

Loading comments...

Leave a Comment