Robust Node Affinities via Jaccard-Biased Random Walks and Rank Aggregation
Estimating node similarity is a fundamental task in network analysis and graph-based machine learning, with applications in clustering, community detection, classification, and recommendation. We propose TopKGraphs, a method based on start-node-anchored random walks that bias transitions toward nodes with structurally similar neighborhoods, measured via Jaccard similarity. Rather than computing stationary distributions, walks are treated as stochastic neighborhood samplers, producing partial node rankings that are aggregated using robust rank aggregation to construct interpretable node-to-node affinity matrices. TopKGraphs provides a non-parametric, interpretable, and general-purpose representation of node similarity that can be applied in both network analysis and machine learning workflows. We evaluate the method on synthetic graphs (stochastic block models, Lancichinetti-Fortunato-Radicchi benchmark graphs), k-nearest-neighbor graphs from tabular datasets, and a curated high-confidence protein-protein interaction network. Across all scenarios, TopKGraphs achieves competitive or superior performance compared to standard similarity measures (Jaccard, Dice), a diffusion-based method (personalized PageRank), and an embedding-based approach (Node2Vec), demonstrating robustness in sparse, noisy, or heterogeneous networks. These results suggest that TopKGraphs is a versatile and interpretable tool for bridging simple local similarity measures with more complex embedding-based approaches, facilitating both data mining and network analysis applications.
💡 Research Summary
The paper introduces TopKGraphs, a novel method for estimating node‑to‑node affinity in undirected, unweighted graphs. The core idea is to perform multiple start‑node‑anchored random walks whose transition probabilities are biased by the Jaccard similarity between each candidate neighbor and the fixed start node. For a given start node s, the Jaccard similarity Jₛ(v)=|N(s)∩N(v)|/|N(s)∪N(v)| is computed once and used to define the transition probability P(u→v)=Jₛ(v)+ε (ε>0) for every step from the current node u to a neighbor v. This “Jaccard‑anchored” walk preferentially visits nodes whose local neighborhoods overlap with the start node, thereby propagating the local similarity information across multiple hops.
Each walk of length T yields a first‑visit time tₖ(v) for every node visited in the k‑th walk. Nodes are ranked by these first‑visit times (earlier visits receive higher rank). Nodes never visited are appended in a random order after all visited nodes, producing a complete ranking ˜τₖ(v) for each walk. After K independent walks from the same start node, the rankings are aggregated using Borda aggregation: the Borda score Bₛ(v)= (1/K)∑ₖ ˜τₖ(v). Smaller scores indicate stronger affinity to the start node. Collecting Bₛ(v) for all start nodes builds an asymmetric affinity matrix A, which can be row‑normalized, symmetrized, and optionally embedded into low‑dimensional Euclidean space via classical multidimensional scaling (MDS) for visualization or downstream tasks.
The authors provide a theoretical motivation: the observed graph G is modeled as a noisy observation of a latent true similarity graph G⋆, where edges are independently deleted with probability 1−p and spurious edges are added with probability q. While a single observed Jaccard coefficient is a biased estimator of the latent similarity, the random‑walk process aggregates many noisy local estimates along short paths, and the first‑visit order acts as a path‑based estimator of the latent Jaccard proximity. Borda aggregation further reduces variance, yielding a robust ordering even under substantial edge perturbations.
Evaluation is extensive. Synthetic benchmarks include stochastic block model (SBM) graphs, where intra‑community edge probability varies, and Lancichinetti‑Fortunato‑Radicchi (LFR) graphs with heterogeneous degree distributions. Real‑world experiments involve (i) k‑nearest‑neighbor (kNN) graphs built from the UCI Breast Cancer Wisconsin dataset, (ii) subgraphs of the Cora citation network, and (iii) a high‑confidence human protein‑protein interaction (PPI) network derived from STRING (edges with combined score ≥ 990). For each graph, affinity matrices from TopKGraphs are compared against six baselines: Jaccard similarity, Dice similarity, Laplacian embedding, Personalized PageRank (PPR), and Node2Vec embeddings. Community detection is performed via Ward’s hierarchical clustering on each affinity matrix, and performance is measured using Adjusted Rand Index (ARI), Normalized Mutual Information (NMI), and Adjusted Mutual Information (AMI). Node classification is evaluated with k‑nearest‑neighbor classifiers using the same affinity matrices.
Results show that TopKGraphs consistently attains the highest or near‑highest ARI, NMI, and AMI across a wide range of SBM intra‑community densities, and performs competitively on LFR graphs despite their degree heterogeneity. On the kNN breast‑cancer graphs, TopKGraphs improves clustering of the underlying class labels and yields higher kNN classification accuracy than the baselines. In the sparse PPI network (119 nodes, 314 edges, average degree ≈ 5, edge density ≈ 0.045), TopKGraphs achieves the best clustering of disease‑associated gene groups, demonstrating robustness to sparsity and noise. Sensitivity analyses varying walk length T and number of walks K reveal that performance is stable; modest values (e.g., T = 5–10, K = 30–50) already produce strong results, underscoring the method’s low‑parameter nature.
The discussion highlights several advantages: (1) only two intuitive hyperparameters (walk length and number of walks) are required, eliminating the extensive tuning needed for methods like Node2Vec (which also needs p, q, window size, embedding dimension, etc.); (2) biasing transitions by Jaccard similarity directly incorporates interpretable local overlap information while still capturing multi‑hop structure; (3) rank‑based aggregation discards raw visitation frequencies, focusing on relative ordering, which is less sensitive to noisy edge perturbations; (4) the resulting affinity matrix is directly interpretable and can be used in any downstream graph‑learning pipeline without additional training. Limitations include the reliance on undirected, unweighted graphs and the exclusive use of Jaccard similarity for biasing, which may be suboptimal for heterogeneous or directed networks. Future work is suggested on (a) mixing multiple local similarity measures (e.g., Adamic‑Adar, resource allocation), (b) extending the framework to heterogeneous graphs with type‑specific biases, and (c) integrating the affinity matrix as a message‑passing prior in graph neural networks to improve heterophilic learning.
In summary, TopKGraphs offers a simple yet powerful, non‑parametric approach that bridges elementary set‑based similarity (Jaccard) and sophisticated embedding techniques (Node2Vec, PPR) by treating random walks as stochastic samplers of ranked neighborhoods and aggregating them via robust Borda rank aggregation. Empirical evidence across synthetic and real datasets demonstrates its competitive performance, interpretability, and resilience to noise, making it a valuable addition to the toolbox of network analysts and graph‑machine‑learning practitioners.
Comments & Academic Discussion
Loading comments...
Leave a Comment