Communication-Avoiding SpGEMM via Trident Partitioning on Hierarchical GPU Interconnects
The multiplication of two sparse matrices, known as SpGEMM, is a key kernel in scientific computing and large-scale data analytics, underpinning graph algorithms, machine learning, simulations, and computational biology, where sparsity is often highly unstructured. The unstructured sparsity makes achieving high performance challenging because it limits both memory efficiency and scalability. In distributed memory, the cost of exchanging and merging partial products across nodes further constrains performance. These issues are exacerbated on modern heterogeneous supercomputers with deep, hierarchical GPU interconnects. Current SpGEMM implementations overlook the gap between intra-node and inter-node bandwidth, resulting in unnecessary data movement and synchronization not fully exploiting the fast intra-node interconnect. To address these challenges, we introduce Trident, a hierarchy-aware 2D distributed SpGEMM algorithm that uses communication-avoiding techniques and asynchronous communication to exploit the hierarchical and heterogeneous architecture of modern supercomputing interconnect. Central to Trident is the novel trident partitioning scheme, which enables hierarchy-aware decomposition and reduces internode communication by leveraging the higher bandwidth between GPUs within a node compared to across nodes. Here, we evaluate Trident on unstructured matrices, achieving up to $2.38\times$ speedup over a 2D SpGEMM with a corresponding geometric mean speedup of $1.54\times$. Trident reduces internode communication volume by up to $2\times$ on NERSC’s Perlmutter supercomputer. Furthermore, we demonstrate the effectiveness of Trident in speeding up Markov Clustering, achieving up to $2\times$ speedup compared to competing strategies.
💡 Research Summary
**
The paper addresses the performance challenges of sparse matrix‑matrix multiplication (SpGEMM) on modern heterogeneous supercomputers that feature deep hierarchical GPU interconnects. While 2‑D algorithms such as Sparse SUMMA reduce communication volume compared to 1‑D schemes, they still treat intra‑node and inter‑node communication uniformly and incur √P synchronization steps, which become a bottleneck on large systems where intra‑node bandwidth (NVLink/Infinity Fabric) can be an order of magnitude higher than inter‑node bandwidth (Infiniband/Slingshot).
Trident is introduced as a hierarchy‑aware distributed SpGEMM algorithm that explicitly exploits this bandwidth gap. The core idea is a hybrid 2‑D/1‑D partitioning: the global process grid is split into a √P × √P 2‑D layout across nodes, while within each node the GPUs (λ per node) are partitioned in a 1‑D fashion. This “trident” scheme ensures that each node receives only the necessary tiles of the operands once (inter‑node peer‑to‑peer transfers) and then reuses them locally via high‑speed NCCL collectives.
The communication schedule consists of two phases. First, each node asynchronously sends and receives the required A and B tiles using non‑blocking MPI point‑to‑point calls. Second, the received tiles are broadcast, gathered, or reduced across the GPUs inside the node using NCCL AllGather/ReduceScatter, after which each GPU performs its local CSR‑based multiplication with KokkosKernels. By overlapping the inter‑node transfers with intra‑node collectives, Trident hides latency and minimizes the number of inter‑node messages from O(√P) to O(1). The algorithm therefore reduces inter‑node traffic by up to a factor of two and eliminates the √P synchronization barrier that limits traditional 2‑D approaches.
Performance evaluation is carried out on NERSC’s Perlmutter system (four NVIDIA A100 GPUs per node connected by NVLink, nodes linked by Slingshot‑11). A suite of unstructured matrices—protein‑similarity, genome‑assembly, and matrices generated by Markov Clustering (MCL)—are used to benchmark the implementation at scales up to 256 GPUs (64 nodes). Compared with the state‑of‑the‑art 2‑D Sparse SUMMA implementation, Trident achieves an average speed‑up of 1.54× and a peak of 2.38×, while cutting inter‑node communication volume by up to 2×. Against a sparsity‑aware 1‑D SpGEMM in Trilinos, Trident delivers an average 2.96× speed‑up and a maximum of 5.95×. When integrated into an MCL pipeline, the overall clustering runtime is reduced by roughly a factor of two, demonstrating tangible benefits for real scientific workloads.
The contributions of the work are: (1) a novel 2‑D/1‑D hybrid partitioning that aligns computation with the hierarchical network topology; (2) a communication‑avoiding strategy that limits inter‑node data movement to a single exchange per operand; (3) an asynchronous MPI + NCCL pipeline that overlaps communication and computation; and (4) extensive empirical evidence of scalability and performance gains on a leading GPU‑accelerated supercomputer.
The authors discuss future directions, including extending the approach to full 3‑D process grids for further communication reduction, incorporating dynamic load‑balancing to handle highly irregular sparsity patterns, and adapting the method to emerging CPU‑GPU heterogeneous nodes. They also suggest exploring integration with newer GPU communication libraries (e.g., SHARP, UCX) and automated generation of hierarchical collectives. Overall, the paper demonstrates that accounting for the hierarchical nature of modern interconnects can unlock significant performance improvements for SpGEMM, a kernel central to many graph, machine‑learning, and scientific simulations.
Comments & Academic Discussion
Loading comments...
Leave a Comment