Beyond BFS: A Comparative Study of Rooted Spanning Tree Algorithms on GPUs

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Rooted spanning trees (RSTs) are a core primitive in parallel graph analytics, underpinning algorithms such as biconnected components and planarity testing. On GPUs, RST construction has traditionally relied on breadth-first search (BFS) due to its simplicity and work efficiency. However, BFS incurs an O(D) step complexity, which severely limits parallelism on high-diameter and power-law graphs. We present a comparative study of alternative RST construction strategies on modern GPUs. We introduce a GPU adaptation of the Path Reversal RST (PR-RST) algorithm, optimizing its pointer-jumping and broadcast operations for modern GPU architecture. In addition, we evaluate an integrated approach that combines a state-of-the-art connectivity framework (GConn) with Eulerian tour-based rooting. Across more than 10 real-world graphs, our results show that the GConn-based approach achieves up to 300x speedup over optimized BFS on high-diameter graphs. These findings indicate that the O(log n) step complexity of connectivity-based methods can outweigh their structural overhead on modern hardware, motivating a rethinking of RST construction in GPU graph analytics.

💡 Research Summary

The paper addresses the problem of constructing rooted spanning trees (RSTs) on modern GPUs, a primitive that underlies many higher‑level graph analytics such as biconnected components, planarity testing, and ear decomposition. Historically, breadth‑first search (BFS) has been the default method for RST construction because of its simplicity and work‑optimality (Θ(V + E)). However, BFS proceeds level‑by‑level, requiring Θ(D) synchronization steps where D is the graph diameter. On high‑diameter graphs—common in power‑law networks, road networks, and certain biological graphs—this leads to thousands of kernel launches and global synchronizations, severely limiting the exploitation of GPU parallelism.

The authors evaluate two alternative, connectivity‑based approaches that have logarithmic parallel depth: (1) a combination of a state‑of‑the‑art connected‑components framework (GConn) with an Euler‑tour rooting phase, and (2) a GPU adaptation of the Path‑Reversal RST (PR‑RST) algorithm originally proposed for multicore CPUs. Both methods decouple connectivity from rooting, allowing the construction to avoid the strict level synchrony of BFS.

GConn + Euler Tour
The GConn framework implements Shiloach‑Vishkin style hooking (linking) and compression (pointer‑jumping) on the GPU. Hooking is performed with alternating max‑ and min‑hook strategies to improve load balance. After each hooking round, several pointer‑jump steps are executed per thread before a global synchronization, reducing kernel launch overhead. Once a forest of unrooted trees is obtained, the authors generate a directed edge list (2 × |E|), sort it lexicographically using CUB, and implicitly build adjacency structures (first/last indices and next pointers). An Euler tour is then constructed by defining a successor for each directed edge; cycles are broken at each root to obtain linear lists. Parallel list ranking assigns a rank to each edge, and the parent array is derived from the relative ranks of an edge and its reverse. This pipeline eliminates the need for a separate rooting pass and leverages the high memory bandwidth of modern GPUs.

PR‑RST
PR‑RST also follows a hook‑compress loop but integrates rooting by reversing the parent‑child direction along the path between a newly attached vertex and the root of the component it joins. Each vertex stores a “special‑ancestor” array of size O(log n) that contains ancestors at powers‑of‑two distances, built during pointer‑jumping. Using these ancestors, the algorithm can identify all vertices on the path from u to r in O(log n) parallel iterations. The authors introduce an onPath flag array to record which vertices lie on the path during the current iteration; a second kernel then flips parent pointers along the flagged edges. This design avoids serial traversals, maintains coalesced memory accesses, and keeps the entire reversal phase data‑parallel. As with GConn, multiple pointer‑jump steps are batched before synchronization.

Experimental Methodology
All implementations were compiled with NVCC 12.9 (‑O3, ‑‑sm_89) and run on an NVIDIA L40s GPU (Ampere, SM 89). The benchmark suite comprises more than 30 real‑world graphs, ranging from small web graphs (≈0.7 M vertices) to massive social networks (≈18 M vertices, 523 M edges) and synthetic Kronecker graphs with diameters up to 553 k. For each dataset, one warm‑up run followed by five timed runs were performed; the median runtime is reported. The authors measure total execution time, the depth of the resulting spanning tree, and the number of kernel launches.

Results
The key findings are:

Performance – GConn + Euler consistently outperforms both BFS and PR‑RST. On average it is more than ten times faster than BFS, and on the worst‑case high‑diameter road network it achieves a 300× speedup. PR‑RST is competitive on medium‑size graphs but lags behind GConn on the largest instances.
Diameter Sensitivity – BFS runtime grows roughly linearly with the depth of the spanning tree, confirming the theoretical Θ(D) step cost. In contrast, GConn’s runtime remains stable across graphs with diameters ranging from 14 to over 500 k, because its hooking/compression phases expose parallelism independent of graph depth.
Tree Depth Trade‑off – While GConn produces deeper trees (often an order of magnitude deeper than BFS), the depth does not affect its construction time. However, downstream algorithms that rely on shallow trees (e.g., certain parallel DFS‑based procedures) may need additional re‑balancing.
Euler‑Tour Overhead – The authors demonstrate that the perceived overhead of constructing Euler tours—historically considered a bottleneck on PRAM models—is negligible on modern GPUs thanks to high‑throughput sorting (CUB) and efficient list‑ranking kernels.

Discussion
The paper argues that the long‑standing dominance of BFS for RST construction on GPUs is more a matter of historical convenience than of performance optimality. Connectivity‑based methods, despite higher theoretical work complexity (O((V + E) log V)), achieve superior wall‑clock times because their logarithmic depth aligns well with GPU’s massive parallelism and because the extra work can be overlapped with memory‑bound operations. The deeper trees produced by GConn raise interesting questions about the impact on subsequent graph algorithms; the authors suggest that post‑processing steps such as tree compression or re‑rooting could mitigate any adverse effects.

Future Work
The authors outline several directions: (i) hybrid schemes that combine the fast convergence of PR‑RST’s path reversal with GConn’s robust hooking, (ii) GPU‑friendly tree‑depth reduction techniques (e.g., parallel tree contraction), (iii) evaluation on alternative GPU architectures (AMD, Intel) and on emerging heterogeneous platforms, and (iv) integration of RST construction into full graph‑analytics pipelines to assess end‑to‑end benefits.

Conclusion
By systematically implementing and benchmarking both a modern connectivity framework with Euler‑tour rooting and a GPU‑adapted PR‑RST algorithm, the paper provides strong empirical evidence that O(log n) depth algorithms can dramatically outperform traditional BFS on GPUs, especially for high‑diameter graphs. This work encourages the graph‑processing community to reconsider the default reliance on BFS for rooted spanning tree construction and to explore connectivity‑centric designs that better exploit contemporary GPU hardware.

Beyond BFS: A Comparative Study of Rooted Spanning Tree Algorithms on GPUs

💡 Research Summary

Comments & Academic Discussion

Leave a Comment