Accelerating Sparse Matrix-Matrix Multiplication on GPUs with Processing Near HBMs

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Sparse General Matrix-Matrix Multiplication (SpGEMM) is a fundamental operation in numerous scientific computing and data analytics applications, often bottlenecked by irregular memory access patterns. This paper presents Hash based Multi-phase SpGEMM on GPU and the Acceleration of Indirect Memory Access (AIA) technique, a novel custom near-memory processing approach to optimizing SpGEMM on GPU HBM. Our hardware-software co-designed framework for SpGEMM demonstrates significant performance improvements over state-of-the-art methods, particularly in handling complex, application-specific workloads. We evaluate our approach on various graph workloads, including graph contraction, Markov clustering, and Graph Neural Networks (GNNs), showcasing its practical applicability. For graph analytics applications, AIA demonstrates up to 17.3% time reduction from the software-only implementation, while achieving time reduction of 76.5% for Graph Contraction and 58.4% for Markov Clustering compared to cuSPARSE. For GNN training applications with structured global pruning, our hybrid approach delivers an average of 1.43x speedup over software-only implementation across six benchmark datasets and three architectures (GCN, GIN, GraphSAGE), and shows 1.95x speedup for GNN workloads when compared to cuSPARSE, with up to 4.18x gains on large-scale datasets.

💡 Research Summary

This paper tackles the long‑standing performance bottlenecks of Sparse General Matrix‑Matrix multiplication (SpGEMM) on modern GPUs, namely irregular memory accesses, unknown output sparsity, and severe load imbalance. The authors propose a co‑designed hardware‑software solution that combines a hash‑based multi‑phase SpGEMM algorithm with a novel near‑HBM processing unit called Acceleration of Indirect Memory Access (AIA).

The algorithm follows Gustavson’s row‑wise product method but augments it with three distinct phases: (1) Row‑grouping – rows of the left matrix A are classified into four groups according to the number of intermediate products (IP) they generate, using logarithmic binning. A mapping between original and grouped row IDs is kept, allowing the kernel to process rows in order of increasing workload without physically permuting the matrix. (2) Allocation – without computing actual numerical values, the kernel discovers the unique column indices that will appear in each output row. Two thread‑assignment strategies are provided: Partial‑Warp‑Per‑Row (PWPR) for moderate workloads and Thread‑Block‑Per‑Row (TBPR) for heavy rows. Each group receives a dedicated shared‑memory hash table sized proportionally to its workload, enabling fast insertion of intermediate column indices. (3) Accumulation – the previously built hash tables are used to accumulate the partial products into the final output matrix C. Collision resolution and atomic updates are handled entirely within the hash tables, eliminating costly random writes to global memory.

The second pillar, AIA, is a lightweight compute block placed between the GPU streaming multiprocessors and the stacked High‑Bandwidth Memory (HBM). Its purpose is to transform the two‑level indirection pattern typical of SpGEMM (e.g., x

Accelerating Sparse Matrix-Matrix Multiplication on GPUs with Processing Near HBMs

💡 Research Summary

Comments & Academic Discussion

Leave a Comment