Parallelization Strategies for Dense LLM Deployment: Navigating Through Application-Specific Tradeoffs and Bottlenecks

Parallelization Strategies for Dense LLM Deployment: Navigating Through Application-Specific Tradeoffs and Bottlenecks
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Breakthroughs in the generative AI domain have fueled an explosion of large language model (LLM)-powered applications, whose workloads fundamentally consist of sequences of inferences through transformer architectures. Within this rapidly expanding ecosystem, dense LLMs–those that activate all model parameters for each token generation–form the foundation for advanced expert-based variants. Dense models continue to dominate because of their strong generalization ability, scalability, ease of fine-tuning, and versatility across diverse tasks. In LLM inference systems, performance is mainly characterized by latency, response time, and throughput (i.e., tokens generated per unit of time). Latency and throughput are inherently coupled: optimizing for one often comes at the expense of the other. Moreover, batching strategies and parallelism configurations, which are essential when dense model parameters exceed device memory capacity, can significantly affect both latency and overall system throughput. This paper (i) investigates the workloads of two representative dense LLMs–Llama-3.1-70B and Llama-3.1-405B, focusing in particular on intra-node parallelization schemes, (ii) analyzes how input characteristics, batching, and parallelism strategies influence latency flexibility and the latency-throughput tradeoff, and (iii) identifies key performance bottlenecks that inform design choices for meeting service-level agreements (SLAs) and sustaining inference quality. Our empirical evaluations reveal that Tensor Parallelism (TP) improves the latency objectives while Pipeline Parallelism (PP) is better-suited for throughput-oriented applications. We highlight that their hybrid usage by controlling the TP and PP degrees provides control over the latency-throughput interplay.


💡 Research Summary

The paper investigates intra‑node parallelization strategies for deploying dense large language models (LLMs) that activate all parameters for each generated token. The authors focus on two representative models, Llama‑3.1‑70B and Llama‑3.1‑405B, and evaluate how Tensor Parallelism (TP), Pipeline Parallelism (PP), and hybrid combinations affect the fundamental trade‑off between latency (response time) and throughput (tokens per second).

First, the authors describe the memory and compute challenges of dense LLM inference. Even with aggressive FP8 quantization, the 405‑billion‑parameter model requires roughly 405 GB of memory for weights alone, far exceeding the capacity of a single GPU (e.g., AMD MI250, NVIDIA H100). Consequently, model parallelism is mandatory to distribute both weights and the growing KV cache across multiple accelerators.

TP shards the internal matrix multiplications of each transformer layer across GPUs, allowing all devices to work on the same layer simultaneously and then aggregating partial results via an All‑Reduce operation. This approach increases per‑token compute parallelism, reduces the critical path of both the pre‑fill and decode phases, and therefore improves latency. The downside is the communication overhead of All‑Reduce and reduced batch‑size flexibility because each token must wait for the reduction to complete.

PP, by contrast, partitions whole transformer blocks along the depth dimension, assigning each GPU a distinct pipeline stage. Multiple batches can be in flight simultaneously, each occupying a different stage, which dramatically raises hardware utilization and overall throughput. However, the pipeline introduces stage‑wise waiting times, so the end‑to‑end latency for a single request typically grows with the number of pipeline stages.

The authors built an in‑house simulator calibrated against silicon measurements on an 8‑GPU AMD MI250 node (NVLink interconnect). The simulator reproduces real‑world latencies within a 3 % error margin, enabling systematic exploration of many parallelism configurations. Experiments cover TP degrees of 2 and 4, PP degrees of 2 and 4, and hybrid settings (e.g., TP‑2 + PP‑2). They also vary batch size (1‑32) and context length (512, 2 K, 8 K, 128 K tokens).

Key findings:

  • For the 70 B model, TP‑4 reduces per‑token latency by 18‑22 % compared with a baseline data‑parallel run, but improves throughput by less than 6 %. This makes TP‑heavy configurations ideal for latency‑sensitive services such as real‑time chat or code completion.
  • The 405 B model cannot fit on a single GPU even after quantization; the smallest viable TP is TP‑2, which yields only modest latency gains. PP‑4, however, boosts throughput by a factor of 2.3 while keeping latency within acceptable bounds for batch‑oriented workloads (e.g., document summarization pipelines).
  • Hybrid configurations (TP‑2 + PP‑2) strike a balance: latency improves by roughly 10‑12 % and throughput rises by 1.8‑2.0× for both models, demonstrating that fine‑grained control over TP and PP degrees can be used to meet specific Service‑Level Agreements (SLAs).
  • Input characteristics matter. Short sequences (≤512 tokens) and small batches (≤4) favor TP because the All‑Reduce cost is amortized over fewer tokens. Long contexts (≥8 K tokens) and large batches (≥16) benefit from PP, as the pipeline can hide the KV‑cache growth and keep GPUs busy.
  • Interconnect bandwidth is decisive. NVLink’s high bandwidth reduces All‑Reduce latency, making TP more attractive on such systems. On PCIe‑based nodes, PP’s stage‑wise communication is less penalized, so PP often outperforms TP.

The paper also situates these findings within the broader ecosystem of LLM serving optimizations, such as dynamic batching, ranking‑based scheduling, and disaggregated pre‑fill/decode (e.g., DistServe). It argues that combining TP with static batching yields predictable latency, whereas PP pairs naturally with dynamic batching to maximize throughput under bursty traffic.

Finally, the authors discuss extensions to multi‑node clusters and to expert‑based (Mixture‑of‑Experts) models. In multi‑node settings, inter‑node network latency becomes the dominant bottleneck, so the TP/PP ratio must be tuned to match available bandwidth. For expert models, expert parallelism can be layered on top of TP/PP, offering additional memory savings while preserving the latency‑throughput control demonstrated for dense models.

In summary, the study provides a comprehensive, empirically validated framework for selecting and tuning parallelization strategies in dense LLM inference. By quantifying how TP, PP, and their hybrids affect latency, throughput, memory usage, and communication overhead, the work offers concrete guidance for system architects aiming to meet diverse SLA requirements without sacrificing model quality.


Comments & Academic Discussion

Loading comments...

Leave a Comment