CaPGNN: Optimizing Parallel Graph Neural Network Training with Joint Caching and Resource-Aware Graph Partitioning

CaPGNN: Optimizing Parallel Graph Neural Network Training with Joint Caching and Resource-Aware Graph Partitioning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Graph-structured data is ubiquitous in the real world, and Graph Neural Networks (GNNs) have become increasingly popular in various fields due to their ability to process such irregular data directly. However, as data scale, GNNs become inefficient. Although parallel training offers performance improvements, increased communication costs often offset these advantages. To address this, this paper introduces CaPGNN, a novel parallel full-batch GNN training framework on single-server with multi-GPU. Firstly, considering the fact that the number of remote vertices in a partition is often greater than or equal to the number of local vertices and there may exist many duplicate vertices, we propose a joint adaptive caching algorithm that leverages both CPU and GPU memory, integrating lightweight cache update and prefetch techniques to effectively reduce redundant communication costs. Furthermore, taking into account the varying computational and communication capabilities among GPUs, we propose a communication- and computation-aware heuristic graph partitioning algorithm inspired by graph sparsification. Additionally, we implement a pipeline to overlap computation and communication. Extensive experiments show that CaPGNN improves training efficiency by up to 18.98x and reduces communication costs by up to 99%, with minimal accuracy loss or even accuracy improvement in some cases. Finally, we extend CaPGNN to multi-machine multi-GPU environments. The code is available at https://github.com/songxf1024/CaPGNN.


💡 Research Summary

CaPGNN addresses the scalability bottlenecks of full‑batch Graph Neural Network (GNN) training on a single‑server multi‑GPU platform. The authors first identify two major sources of inefficiency: (1) a large number of halo (boundary) vertices that are duplicated across partitions, leading to redundant feature exchanges, and (2) heterogeneous GPU capabilities that cause load imbalance when traditional partitioners assume equal resources. To tackle these issues, the paper introduces two tightly coupled techniques.

The Joint Adaptive Caching Algorithm (JACA) treats CPU memory and GPU memory as a two‑level cache. An analytical model assigns an “importance score” to each vertex based on access frequency, feature dimensionality, and recency. High‑importance vertices are placed in the limited GPU cache, while the rest reside in the larger CPU cache. Cache updates are performed incrementally, using pinned memory and asynchronous CUDA streams to overlap data movement with computation. A lightweight prefetcher and a staleness‑tolerant pipeline allow the system to reuse slightly outdated vertex features, dramatically reducing the number of cross‑GPU communications required for halo vertices.

The Resource‑Aware Partitioning Algorithm (RAPA) extends conventional graph partitioners by incorporating per‑GPU resource profiles (FLOPS, memory capacity, NVLink bandwidth). Inspired by graph sparsification, RAPA reduces the number of halo vertices by selectively pruning low‑degree vertices at partition boundaries and by adjusting the replication factor according to each GPU’s performance. Consequently, powerful GPUs receive larger sub‑graphs, while weaker GPUs are assigned smaller workloads, achieving both communication reduction and load‑balance.

A pipelined execution model further overlaps local aggregation/combine operations with halo feature exchanges. The pipeline tolerates bounded staleness, ensuring that the computation stage never stalls waiting for remote data.

Experiments on seven public benchmarks (e.g., Reddit, ogbn‑products, Amazon) and three GNN architectures (GCN, GraphSAGE, GAT) demonstrate that CaPGNN achieves up to 18.98× speed‑up over state‑of‑the‑art systems such as DGL‑Dist, NeuGraph, and HongTu, while cutting inter‑GPU communication by up to 99 %. Accuracy loss is negligible (≤0.1 %) or even slightly improved, confirming that the staleness‑tolerant cache does not harm convergence. The authors also extend the framework to multi‑machine multi‑GPU settings, showing that the same caching and partitioning principles apply without modification.

Limitations include potential cache‑management overhead for extremely large graphs (hundreds of millions of vertices) and the current focus on CUDA‑based single‑node environments, which may require additional cost modeling for Ethernet‑based clusters. Future work is suggested on hierarchical caching (SSD/NVMe), dynamic repartitioning, and broader network topology awareness. Overall, CaPGNN offers a practical, resource‑aware solution that makes full‑batch GNN training feasible and efficient on heterogeneous multi‑GPU servers.


Comments & Academic Discussion

Loading comments...

Leave a Comment