LSM-GNN: Large-scale Storage-based Multi-GPU GNN Training by Optimizing Data Transfer Scheme
Graph Neural Networks (GNNs) are widely used today in recommendation systems, fraud detection, and node/link classification tasks. Real world GNNs continue to scale in size and require a large memory footprint for storing graphs and embeddings that often exceed the memory capacities of the target GPUs used for training. To address limited memory capacities, traditional GNN training approaches use graph partitioning and sharding techniques to scale up across multiple GPUs within a node and/or scale out across multiple nodes. However, this approach suffers from the high computational costs of graph partitioning algorithms and inefficient communication across GPUs. To address these overheads, we propose Large-scale Storage-based Multi-GPU GNN framework (LSM-GNN), a storage-based approach to train GNN models that utilizes a novel communication layer enabling GPU software caches to function as a system-wide shared cache with low overheads. LSM-GNN incorporates a hybrid eviction policy that intelligently manages cache space by using both static and dynamic node information to significantly enhance cache performance. Furthermore, we introduce the Preemptive Victim-buffer Prefetcher (PVP), a mechanism for prefetching node feature data from a Victim Buffer located in CPU pinned-memory to further reduce the pressure on the storage devices. Experimental results show that despite the lower compute capabilities and memory capacities, LSM-GNN in a single node with two GPUs offers superior performance over two-node-four-GPU Dist-DGL baseline and provides up to 3.75x speed up on end-to-end epoch time while running large-scale GNN training
💡 Research Summary
The paper introduces LSM‑GNN, a storage‑centric framework for training Graph Neural Networks (GNNs) on multiple GPUs without relying on graph partitioning. Traditional multi‑GPU or distributed GNN training first partitions the graph (often with METIS), which incurs heavy preprocessing time, large memory overhead, and substantial inter‑GPU communication during sampling and feature aggregation. These drawbacks become prohibitive for real‑world graphs that can reach billions of nodes and tens of terabytes of data.
LSM‑GNN tackles the problem by keeping the entire graph and node features on SSDs and fetching data on‑demand directly from the storage devices. To mitigate the limited SSD bandwidth, the system builds a 32‑way set‑associative software cache on each GPU and orchestrates these independent caches into a system‑wide shared cache through a novel communication layer. This layer leverages NVIDIA’s Scoped Memory Consistency model, using lightweight acquire/release primitives to keep cache metadata coherent across GPUs without resorting to costly system‑wide synchronization.
Cache management is enhanced by a hybrid eviction policy that combines static graph information (normalized reverse PageRank, which predicts globally “hot” nodes) with dynamic access patterns (the next reuse iteration estimated from pre‑executed sampling rounds). By weighting both signals, LSM‑GNN decides which cache lines to evict more intelligently than pure LRU or LFU policies, achieving a 15‑20 % increase in hit rate in the authors’ experiments.
When a cache line is evicted, its feature vector is moved to a Victim Buffer (VB) allocated in CPU pinned memory. The Preemptive Victim‑buffer Prefetcher (PVP) monitors the dynamic reuse information attached to each evicted line and asynchronously prefetches likely‑to‑be‑reused data back to the GPU during the training phase, when PCIe ingress bandwidth is under‑utilized. Because PVP runs on the CPU, it does not consume GPU resources and can overlap with computation, further reducing storage contention.
The implementation targets NVIDIA A100 GPUs and NVMe SSDs. Experiments on two large graphs—IGBH‑medium (≈26 M nodes) and a Pinterest‑scale graph (≈2 B nodes)—using a GraphSAGE‑GAT model demonstrate that a single node with two GPUs running LSM‑GNN outperforms a four‑GPU, two‑node Dist‑DGL baseline. End‑to‑end epoch time improves by up to 3.75×, cache hit rates rise from ~68 % to ~84 %, and SSD read bandwidth usage drops to roughly one‑quarter of the baseline. Moreover, the need for graph partitioning disappears, eliminating the multi‑minute preprocessing step required by METIS.
Limitations include the current focus on NVIDIA hardware and PCIe‑NVMe storage; performance on AMD GPUs, CXL‑based high‑bandwidth memory, or alternative interconnects remains untested. The hybrid eviction policy also requires tuning of static versus dynamic weight parameters, which may vary across graph topologies, suggesting a need for automated parameter optimization.
In summary, LSM‑GNN presents a compelling alternative to partition‑based multi‑GPU GNN training by turning the GPU software cache into a shared, intelligently managed resource and by prefetching evicted data from a CPU‑resident victim buffer. This design substantially reduces reliance on storage bandwidth, improves GPU utilization, and shortens overall training time for massive graphs. Future work could extend the framework to heterogeneous hardware stacks and explore learning‑based cache management to further generalize the approach.
Comments & Academic Discussion
Loading comments...
Leave a Comment