D-Legion: A Scalable Many-Core Architecture for Accelerating Matrix Multiplication in Quantized LLMs

D-Legion: A Scalable Many-Core Architecture for Accelerating Matrix Multiplication in Quantized LLMs
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The performance gains obtained by large language models (LLMs) are closely linked to their substantial computational and memory requirements. Quantized LLMs offer significant advantages with extremely quantized models, motivating the development of specialized architectures to accelerate their workloads. This paper proposes D-Legion, a novel scalable many-core architecture, designed using many adaptive-precision systolic array cores, to accelerate matrix multiplication in quantized LLMs. The proposed architecture consists of a set of Legions where each Legion has a group of adaptive-precision systolic arrays. D-Legion supports multiple computation modes, including quantized sparse and dense matrix multiplications. The block structured sparsity is exploited within a fully-sparse, or partially-sparse windows. In addition, memory accesses of partial summations (psums) are spatially reduced through parallel accumulators. Furthermore, data reuse is maximized through optimized scheduling techniques by multicasting matrix tiles across the Legions. A comprehensive design space exploration is performed in terms of Legion/core granularity to determine the optimal Legion configuration. Moreover, D-Legion is evaluated on attention workloads from two BitNet models, delivering up to 8.2$\times$ lower latency, up to 3.8$\times$ higher memory savings, and up to 3$\times$ higher psum memory savings compared to state-of-the-art work. D-Legion, with eight Legions and 64 total cores, achieves a peak throughput of 135,68 TOPS at a frequency of 1 GHz. A scaled version of D-Legion, with 32 Legions, is compared to Google TPUv4i, achieving up to 2.5$\times$ lower total latency, up to 2.3$\times$ higher total throughput, and up to 2.7$\times$ higher total memory savings.


💡 Research Summary

The paper introduces D‑Legion, a novel many‑core accelerator architecture specifically designed to speed up matrix multiplication in highly quantized large language models (LLMs), such as BitNet, which use 1‑bit or 2‑bit weights. The authors observe that while GPUs and Google’s TPUs have driven AI progress, they become inefficient when models are aggressively quantized and exhibit block‑structured sparsity. To address this gap, D‑Legion combines three key ideas: (1) adaptive‑precision systolic array (ADiP) cores that can process 8‑bit × 2‑bit operations four times faster than conventional 8‑bit × 8‑bit multiplications; (2) a hierarchical grouping of cores into “Legions,” each equipped with parallel accumulators that spatially reduce partial‑sum (psum) memory traffic; and (3) a flexible network‑on‑chip (NoC) that multicasts matrix tiles across Legions, maximizing data reuse and exploiting block‑structured sparsity (both fully‑sparse and partially‑sparse windows).

The architecture is parametrized by three variables: L (number of Legions), C (cores per Legion), and D (size of each systolic array, D × D PEs). The paper provides analytical models for tile sizing, latency, and “time‑to‑full‑utilization” (TFU). A design‑space exploration compares a single large 64 × 64 systolic array against many smaller 16 × 16 arrays that together contain the same number of PEs. The smaller‑core configuration reduces psum memory bandwidth by a factor of four, cuts TFU by four, and yields lower overall latency for quantized QKV projection workloads, while dense INT8 workloads see comparable latency.

Four Legion granularity configurations are examined (2 × 64 × 64, 4 × 32 × 32, 8 × 16 × 16, 16 × 8 × 8). The 8 × 16 × 16 layout emerges as optimal: it keeps Legion‑level input bandwidth constant, minimizes psum memory bandwidth, and offers the best TFU‑latency trade‑off for the BitNet attention kernels. The authors also discuss how the parallel accumulators within each Legion reduce the number of psum writes, which is especially beneficial for the sparsity‑heavy QKV projection phase.

Evaluation uses two BitNet models with extreme quantization. The baseline D‑Legion configuration (8 Legions, 64 total ADiP cores) runs at 1 GHz and achieves a peak throughput of 135.68 TOPS. Compared with state‑of‑the‑art accelerators, it delivers up to 8.2× lower latency, 3.8× higher memory savings, and 3× better psum memory efficiency. A scaled‑up version with 32 Legions (256 cores) is benchmarked against Google’s TPUv4i. D‑Legion shows up to 2.5× lower total latency, 2.3× higher total throughput, and 2.7× higher total memory savings, while maintaining comparable power‑area efficiency thanks to the adaptive‑precision and sparsity‑aware design.

In summary, D‑Legion’s contributions are: (i) a many‑core, adaptive‑precision systolic array fabric that delivers up to 4× higher computational density for quantized workloads; (ii) spatial reduction of psum traffic via per‑Legion parallel accumulators; (iii) a multicast‑friendly NoC and scheduling scheme that maximizes tile reuse across Legions; (iv) a thorough design‑space analysis that identifies the optimal core‑size‑to‑Legion‑count ratio; and (v) extensive experimental validation showing competitive or superior performance to leading TPU designs. The work demonstrates that a carefully co‑designed hardware‑software stack can fully exploit the computational and memory advantages of ultra‑low‑precision LLMs, paving the way for scalable, energy‑efficient inference at the scale of modern AI deployments.


Comments & Academic Discussion

Loading comments...

Leave a Comment