Accelerating Storage-Based Training for Graph Neural Networks

Reading time: 5 minute
...

📝 Original Info

  • Title: Accelerating Storage-Based Training for Graph Neural Networks
  • ArXiv ID: 2601.01473
  • Date: 2026-01-04
  • Authors: Myung-Hwan Jang, Jeong-Min Park, Yunyong Ko, Sang-Wook Kim

📝 Abstract

Graph neural networks (GNNs) have achieved breakthroughs in various real-world downstream tasks due to their powerful expressiveness. As the scale of real-world graphs has been continuously growing, a storage-based approach to GNN training has been studied, which leverages external storage (e.g., NVMe SSDs) to handle such web-scale graphs on a single machine. Although such storagebased GNN training methods have shown promising potential in large-scale GNN training, we observed that they suffer from a severe bottleneck in data preparation since they overlook a critical challenge: how to handle a large number of small storage I/Os. To address the challenge, in this paper, we propose a novel storage-based GNN training framework, named AGNES, that employs a method of block-wise storage I/O processing to fully utilize the I/O bandwidth of high-performance storage devices. Moreover, to further enhance the efficiency of each storage I/O, AGNES employs a simple yet effective strategy, hyperbatch-based processing based on the characteristics of real-world graphs. Comprehensive experiments on five real-world graphs reveal that AGNES consistently outperforms four state-of-the-art methods, up to 4.1× faster than the best competitor. CCS Concepts • Information systems → Data management systems; • Computing methodologies → Machine learning.

💡 Deep Analysis

Figure 1

📄 Full Content

Graphs are prevalent in many applications [5,7,9,14,25,41] to represent a variety of real-world networks, such as social networks and web, where objects and their relationships are modeled as nodes and edges, respectively. Recently, graph neural networks (GNNs), a class of deep neural networks specially designed to learn such graph-structured data, have achieved breakthroughs in various downstream tasks, including node classification [24,43], link prediction [20,[33][34][35][36], and community detection [3,19,38].

Although existing works have designed model architectures to learn the structural information of graphs by considering not only node features but also graph topology [1,16,18,23,37], they have a simple assumption: the entire input data, including node features and graph topology, reside in the GPU or main memory during GNN training [2,30,31,39]. However, as the scale of real-world graphs has been continuously growing, this assumption is not practical anymore: the size of real-world graphs often exceeds the capacity of GPU memory (e.g., 80GB for an NVIDIA H100) or even that of main memory in a single machine (e.g., 256GB). For instance, training 3-layer GAT [28] on the yahoo-web graph [32], which consists of 1.4B nodes and 6.6B edges, requires about 1.5TB, including node features, graph topology, and intermediate results.

To address this challenge, a storage-based approach to GNN training has been studied [8,22,26,29], which leverages recent highperformance external storage devices (e.g., NVMe SSDs) [8,22,29]. This approach stores the entire graph topology and node features in external storage and loads some parts into the main memory from storage only when required for GNN training. The storage-based GNN training is two-fold as illustrated in Figure 1: (1) Data preparation: it (i) traverses the graph stored in storage (by loading it into main memory) to find the neighboring nodes of target nodes necessary for training, (ii) gathers their associated features stored in storage for training in main memory, and (iii) transfers both into the GPU.

(2) Computation: it performs (iv) forward propagation (i.e., prediction) and (v) backward propagation (i.e., loss and gradient computations) over the transferred data in the GPU.

Although the advanced computational power of modern GPUs has accelerated the computation stage significantly, the data preparation stage could be a significant bottleneck in the entire process of the storage-based GNN training as it can incur a large amount of [22] and GNNDrive [8]).

I/Os between storage and main memory (simply storage I/Os, hereafter). Existing works [8,22,26,29] have focused on improving the data preparation stage, thus showing its promising potential.

Despite their success, we observed that there is still a large room for further improvement in storage-based GNN training. We conducted a preliminary experiment to analyze the ratio of the time for the data preparation stage to the total execution time in Ginex [22] and GNNDrive [8], state-of-the-art methods for storagebased GNN training. Specifically, we trained two GNN models, i.e., GCN [13] and GraphSAGE [4] (SAGE in short), on three realworld graph datasets -twitter-2010 (TW) [11], ogbn-papers100M (PA) [6], and com-friendster (FR) [11]. As shown in Figure 2(a), the data preparation stage dominates the entire training process (i.e., up to 96% of the total execution time). For in-depth analysis, we also measured the size of each individual I/O that occurs during training. Figure 2(b) shows the distribution of storage I/Os’ sizes, where a large number of storage I/Os are small, while only a few I/Os are very large. Such a large number of small I/Os leads to significant degradation of the utilization of computing resources (e.g., GPU utilization) in the GNN training as shown in Figure 2(c).

We posit that this phenomenon arises because real-world graphs tend to have a power-law degree distribution [15], meaning that the majority of nodes have only a few edges (i.e., neighbors) while a small number of nodes have a huge number of edges. That is, the number of neighboring nodes required for GNN training is highly likely to be very small in most cases. Existing storage-based GNN training methods [8,22,26,29], however, have overlooked this important characteristic. They focus only on how to increase the possibility of reusing cached data in main memory (i.e., cache hit ratio) and simply read a few nodes from storage whenever they are required for GNN training, thereby generating a significant number of small storage I/Os. For example, Sheng et al. [26] aims to enhance the locality of sampled nodes for better cache hit ratio by partitioning the entire graph and selecting target nodes within the same partition. These approaches, however, do not address the challenge of handling a large number of small I/Os fundamentally, which still remains under-explored.

We may tackle this challenge by merging small storage I/Os

📸 Image Gallery

acm-jdslogo.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut