HopFormer: Sparse Graph Transformers with Explicit Receptive Field Control

HopFormer: Sparse Graph Transformers with Explicit Receptive Field Control
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Graph Transformers typically rely on explicit positional or structural encodings and dense global attention to incorporate graph topology. In this work, we show that neither is essential. We introduce HopFormer, a graph Transformer that injects structure exclusively through head-specific n-hop masked sparse attention, without the use of positional encodings or architectural modifications. This design provides explicit and interpretable control over receptive fields while enabling genuinely sparse attention whose computational cost scales linearly with mask sparsity. Through extensive experiments on both node-level and graph-level benchmarks, we demonstrate that our approach achieves competitive or superior performance across diverse graph structures. Our results further reveal that dense global attention is often unnecessary: on graphs with strong small-world properties, localized attention yields more stable and consistently high performance, while on graphs with weaker small-world effects, global attention offers diminishing returns. Together, these findings challenge prevailing assumptions in graph Transformer design and highlight sparsity-controlled attention as a principled and efficient alternative.


💡 Research Summary

The paper “HopFormer: Sparse Graph Transformers with Explicit Receptive Field Control” challenges two prevailing assumptions in graph‑transformer design: that explicit positional or structural encodings are required to inject graph topology, and that dense global self‑attention is essential for high performance. HopFormer demonstrates that neither is necessary. The authors propose a graph transformer that relies solely on head‑specific n‑hop masked sparse attention, preserving the vanilla transformer architecture without any positional embeddings, structural encodings, or architectural modifications.

Core Design

  1. Node‑Edge Tokenization via Augmented Incidence Graph: The original graph G = (V, E) is transformed into an augmented incidence graph eG = (eV, eE) where each edge becomes an additional node (edge token). The token set eV = V ∪ E thus contains both original node tokens and edge tokens, and eE connects each edge token to its two endpoint nodes. This representation treats nodes and edges uniformly as tokens while keeping the sparsity of the original graph (|eE| = Θ(|E|)).
  2. Lightweight Input Projectors: Separate linear projections W_n and W_e map node features x_v ∈ ℝ^{d_v} and edge features e_e ∈ ℝ^{d_e} into a shared d‑dimensional space, producing token embeddings h_v and h_e. If edge attributes are missing, e_e is set to zero, allowing edge tokens to contribute purely through topology.
  3. Head‑Specific n‑Hop Masks: For each attention head h, a hop budget n_h ∈ ℕ is assigned. Using the adjacency matrix eA of the augmented graph, the n_h‑hop reachability matrix M^{(h)} = I ∨ (eA + eA^2 + … + eA^{n_h}) is computed (the indicator function applied element‑wise). M^{(h)}{ij}=1 iff token j is reachable from token i within at most n_h incidence hops. During attention, the scaled dot‑product scores QK^T are computed only for pairs (i, j) where M^{(h)}{ij}=1, and the softmax normalization is restricted to this support. Consequently, the operation is genuinely sparse: its cost scales with nnz(M^{(h)})·d_h rather than (N+M)^2.
  4. Standard Transformer Encoder: The model stacks L layers of the usual transformer encoder (multi‑head self‑attention, residual connections, layer norm, feed‑forward network). The only deviation is that the multi‑head attention module receives the set of head‑specific masks {M^{(h)}}. All other components remain unchanged, preserving the simplicity and proven training dynamics of vanilla transformers.

Complexity and Interpretability
Because attention is masked before score computation, the computational complexity per head is linear in the number of allowed token pairs, which depends on graph topology and the chosen hop budget. For graphs exhibiting the small‑world property (average shortest‑path length ℓ(G) = O(log |V|) and high clustering coefficient), a small n_h (e.g., 1–3) already covers most informative interactions, yielding a dramatic reduction in FLOPs and memory. Moreover, each head’s receptive field is explicitly defined, making the model’s behavior interpretable: one can directly inspect which hops each head attends to and how information propagates across the graph.

Theoretical Insight
The authors argue that positional encodings in sequential transformers can be replaced by causal masking, which implicitly encodes order. Analogously, in graphs, the n‑hop mask encodes topological distance, allowing the transformer to learn relational patterns without any handcrafted embeddings. They provide a formal analysis showing that the sparse attention operation is equivalent to applying a binary mask to the attention matrix before softmax, thus avoiding the O(K^2) cost of masked‑dense attention (where K = N+M).

Experimental Evaluation
Extensive experiments are conducted on node‑level tasks (Cora, Citeseer, PubMed, ogbn‑arxiv), graph‑level classification (MUTAG, PROTEINS, ogbg‑molpcba), and link prediction. Baselines include Graphormer, GraphGPS, Exphormer, and other recent graph transformers that use structural encodings and/or dense attention. Key findings:

  • Performance: HopFormer matches or exceeds baseline accuracy on the majority of datasets. On strongly small‑world datasets (e.g., citation networks), a 2‑hop mask achieves the best results, outperforming dense attention by up to 2 % absolute accuracy.
  • Efficiency: Memory consumption drops by 30‑70 % and training speed improves by 1.5‑3×, directly attributable to the reduced number of attention score calculations.
  • Stability: Performance variance across random seeds is lower for HopFormer, indicating that the explicit receptive‑field control mitigates sensitivity to hyper‑parameters such as the number of heads or layers.
  • Ablation: Removing edge tokens degrades performance on heterophilic graphs, confirming the benefit of unified node‑edge tokenization. Varying n_h shows diminishing returns beyond 3 hops for small‑world graphs, while for graphs with weaker small‑world characteristics (e.g., random geometric graphs) larger n_h yields modest gains, yet still far below the cost of full global attention.

Limitations and Future Work
The current approach requires pre‑defining hop budgets per head, which may need dataset‑specific tuning. The masks are static; adapting them dynamically during training or learning the hop budgets could further improve flexibility, especially for dynamic or evolving graphs. Additionally, the implementation relies on sparse matrix libraries; optimizing for GPU‑accelerated sparse kernels could yield further speedups.

Conclusion
HopFormer establishes that graph structure can be injected into a transformer solely through topology‑aligned, head‑specific n‑hop masks, eliminating the need for explicit positional/structural encodings and dense global attention. This yields a model that is simpler, more interpretable, and computationally efficient while achieving state‑of‑the‑art results across diverse graph benchmarks. The work suggests a paradigm shift toward sparsity‑controlled attention as a principled alternative to increasingly complex graph‑transformer architectures.


Comments & Academic Discussion

Loading comments...

Leave a Comment