Fast and Scalable Complex Network Descriptor Using PageRank and Persistent Homology
The PageRank of a graph is a scalar function defined on the node set of the graph which encodes nodes centrality information of the graph. In this article, we use the PageRank function along with persistent homology to obtain a scalable graph descrip…
Authors: Mustafa Hajij, Elizabeth Munch, Paul Rosen
F ast and Scalable Comple x Network Descriptor Using P ageRank and Persistent Homology 1 st Mustafa Hajij Department of Mathematics and Computer Science Santa Clara, California mhajij@scu.edu 2 nd Elizabeth Munch Department of Computational Mathematics, Science, and Engineering Michigan State University Lansing, Michigan muncheli@msu.edu 3 rd Paul Rosen Department of Computer Science and Engineering University of South Florida T ampa, Florida prosen@usf.edu Abstract —The PageRank of a graph is a scalar function defined on the node set of the graph which encodes nodes centrality information of the graph. In this article we use the PageRank function along with persistent homology to obtain a scalable graph descriptor and utilize it to compare the similarities between graphs. For a gi ven graph G ( V , E ) , our descriptor can be computed in O ( | E | α ( | V | )) , where α is the in verse Ackermann function which makes it scalable and computable on massive graphs. W e sho w the effectiveness of our method by utilizing it on multiple shape mesh datasets. Index T erms —PageRank, Complex Networks Similarity , T opo- logical Data Analysis, Graph Similarity I . I N T RO D U C T I O N The problem of studying similarity between graphs has attracted much attention recently in the pattern recognition and machine learning communities. One of the main challenges is to construct an effecti v e similarity measure between graphs that takes into account the complexity of the underlying structure while still being computed efficiently . In this work, we utilize the P ageRank vector [2] in conjunc- tion with a tool av ailable in persistent homology [10] to define a graph descriptor . More specifically , we vie w the PageRank as a continuous scalar function [20] defined on the vertices of the graph and utilize this scalar function to induce a filtration as defined traditionally in the conte xt of persistent homology . W e sho w that the per sistence diagr am induced by this filtration can be utilized for graph similarity . Persistent homology provides a robust set of tools for the theoretical and practical capacity to understand the shape of data [4] in any number of dimensions and on multiple scales, placing the concept of shape, as applied to data analysis, on a solid mathematical foundation. On the other hand, the PageRank function of a graph stores information regarding the centrality information of the underlying nodes. The filtration induced by the PageRank provides a method to decode the information encoded in this scalar function and stores it in the persistence diagram. The latter , when combined with bottleneck distance, can then be used for the graph similarity task. Utilizing the PageRank vector has two main advantages. First, PageRank was originally designed to compute efficiently on very large graphs. The efficienc y of the PageRank vector has been studied extensi vely [14]. The PageRank vector has found many applications, including graph partition [1], image search [16], and citation analysis [17], among others. Second, as we will show here, as a function defined on the nodes of the graph the PageRank vector stores rich structural information about the underlying graph that can be utilized to to detect the similarity between different graphs effecti vely . Graph similarity lies within the realm of pattern recognition and machine learning [21]. Persistent homology provides unique information about the graphs, discover uncovering insights, and determines which predictors are more related to the outcome. Persistent Homology-based methods have shown excellent performance in several applications including pattern recognition on graphs [5], [8], [18], [19], [25], time-varying data [9], [13], and images [6], [11], [22], among others. I I . B AC K G R O U N D In this section, we gi ve a brief revie w of persistent ho- mology and the PageRank vector . While the work here is concerned with graphs, we choose here to introduce persistent homology for simplicial complexes since our work can be generalized easily to more general domains. W e assume the reader is familiar with the basics of simplicial homology . A. P ersistent Homology Let K be a simplicial complex. W e will denote the vertices of K by V ( K ) . Let S be an ordered sequence σ 1 , · · · , σ n of all simplices in K , such that for simplex σ ∈ K every face of σ appear before it σ in S . Then S induces a nested sequence of subcomplex es called a filtr ation : φ = K 0 ⊂ K 1 ⊂ ... ⊂ K n = K . A d -homology class α ∈ H d ( K i ) is said to be born at the time i if it appears for the first time as a homology class in H d ( K i ) . A class α dies at time j if it is trivial H d ( K j ) but not tri vial in H d ( K j − 1 ) . The persistence of α is defined to be j − i . Persistent homology captures the birth and death ev ents in a given filtration and summarizes them in a multi-set structure called the persistence diagram P d ( φ ) . Specifically , the persistence diagram of the a filtration φ is a collection of pairs ( i, j ) in the plane where each ( i, j ) indicates a d - homology class that is created at time i in the filtration φ and killed entering time j . Persistent homology can be defined given any filtration. For the purposes of this w ork, the input is a piecewise linear function f : | K | − → R defined on the vertices of complex K . Furthermore, we assume the function f has different v alues on different nodes of K . Any such a function induces the lower-star filtration as follo ws. Let V = { v 1 , · · · , v n } be the set of vertices of K sorted in non-decreasing order of their f -values, and let K i := { σ ∈ K | max v ∈ σ f ( v ) ≤ f ( v i ) } . The lower -star filtration is defined as: F f ( K ) : φ = K 0 ⊂ K 1 ⊂ ... ⊂ K n = K . (1) The lower -star filtration reflects the topology of the function f in the sense that the persistence homology induced by the filtration 1 is identical to the persistent homology of the sublev el sets of the function f . W e denote by P f ( K ) to the persistence diagram induced by the lower-star filtration F f ( K ) . See Figure 1. Furthermore, we will denote by P k f ( K ) to the k th persis- tence diagram induced by the lower-star filtration F f ( K ) . In this work, we will only consider the 0-dimensional persistence diagram. B. Computing the 0 -persistence diagram the of a lower-star filtration For completeness of our treatment we give a brief descrip- tion for computing the 0-persistence diagram the PageRank defined on the nodes of on a graph G . The computation of the zero persistent diagram P 0 f ( G ) can actually be done using union-find data structure. W e give the details next. If e = ( u, v ) is an edge of the graph G then we will extend the PageRank vector to e by defining P R ( e ) := max ( P R ( u ) , P R ( v )) . Let V = { v 1 , · · · , v m } be the node set of G . Let E = { e 1 , · · · , e n } be its edge set ordered with respect to their P R - values. The steps of the the algorithm to compute the zero PD associated with the PageRank is giv en as follows. The first step in the algorithm creates a connected compo- nent C i for each node v i in the graph G . Here we assume that the connected components are created using the disjoint set data structure. The second step of the algorithm looks at the edges of G in the ascending order with respect to their P R -values. For each e = ( u, v ) , we check if the nodes u and v of e belong to two different sets. If this is the case, then we merge the two connected components containing u and Algorithm 1: Computing the Persistence Diagram induced by the PageRank 1 Function computeP ageRankPD( G, P R : G − → R ) 2 bar s = [ ] 3 U = ∅ 4 for each Node i in V ( G ) do 5 U.mak e ( i ) ; 6 Sort the edge of the graph G in ascending order using the their P R -values. 7 for each Edge e = ( u, v ) in E ( G ) do 8 c ← U.g et ( u ) 9 d ← U.g et ( v ) 10 if c 6 = d then 11 U.mer ge ( c, d ) 12 bar s.append (( max ( P R ( c ) , P R ( d )) , P R ( e ))) 13 retur n bars 14 End Function v . Furthermore, we append to the list of bar s the pair ( max ( P R ( c ) , P R ( d )) , P R ( e )) where c and d are the roots of the trees that contain the nodes u and v respectiv ely in the disjoint set data structure. The algorithm return the list bar s representing the birth and death of 0 -features of the graph G with respect to PageRank functional values. The mer g e operation in line 12 in Algorithm 1 assumes the following merge order on the sub-trees in disjoint set data structure. The tree with root c is merged with the tree with root d according to the P R values of c and d . Namely , if P R ( c ) > P R ( d ) then we set d to be the parent of c . Otherwise c to be the parent of d . An illustrati ve e xample of running this algorithm on a 1-d function is giv en in Figure 2. I I I . C O M P U T I N G T H E D I S TA N C E B E T W E E N T H E P E R S I S T E N C E D I AG R A M S Giv en two persistence diagrams, we measure the distance between them using the bottleneck distance. Namely , giv en two persistence diagrams X and Y , let η be a bijection between points in the diagrams. The bottleneck distance is defined as, W ∞ ( X, Y ) = inf η : X → Y sup x ∈ X k x − η ( x ) k ∞ . For technical reasons we usually add to the persistence di- agram infinitely many points on the diagonal and each one of these points with is counted with infinite multiplicity . In our study we utilize the bottleneck distance to quantify the difference between two PR descriptors. Other distances can also be employed such as the W asserstein distance. A. P ageRank This work utilizes the lower -star filtration induced by the PageRank function [2]; more specifically , we consider a Fig. 1: Left : a graph with a scalar function defined on its nodes. Middle the star of the node v . Right: the lower -star of a verte x v . (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 8 v 9 v 10 Fig. 2: An example illustrating the computation of the persistence diagram on a scalar function defined on 1-d simplicial complex K . W e assume that the we have a scalar function f : V ( K ) − → R defined on the v ertex set V ( K ) of K . W e order the nodes and the edges in the function using their f values and process them with respect to this order . The values of the function f hence induces a lower star filtration where at ev ery stage in this filtration we introduce a vertex along with the edges that are connected to it and have lower f -values, if it has any . version applicable to undirected graphs [12]. The PageRank function P R : V → R is defined for every verte x v ∈ V by P R ( v ) = (1 − d ) | V | + d X u ∈ N ( v ) P R ( u ) | N ( u ) | , (2) where N ( v ) is the set of neighbors of v ; 0 < d < 1 is the damping factor , typically set at 0 . 85 . Equation (2) can solved ef ficiently by the power method [15]. See also [24] for a O ( p log( n ) / ) distributed algorithm where n is the number of nodes in the graph and is fixed constant. A high PageRank score at v typically means that v is connected to many nodes, which also hav e high PageRank scores. For our purpose, it is important to notice that the PageRank is a continuous function [20]. For example, Figure 3 illustrates the continuity of the function on the nodes of the graph on a random geometric graph. I V . R U N N I N G T I M E The proposed descriptor can be computed in almost linear time. Once the graph data is loaded, the 0-dimensional persis- tence diagram can be computed using disjoint sets which take O ( | E | α ( | V | )) , where α is the in verse Ackermann function [7], an extremely slow gro wing function. The PageRank can Fig. 3: Example of the PageRank vector computed on a geometric graph. Higher PageRank v alues indicate higher node centrality . In this figure the PageRank values are indicated by the size of the nodes as well as the the color of the nodes color (nodes with higher PR values have darker colors). be computed in sub-linear time. For instance see [24] for a O ( p log( n ) / ) distributed algorithm where n is the number of nodes in the graph and is fixed constant. V . R E S U LT S T o validate the method proposed, we run some experiments on three publicly av ailable datasets. W e use mesh datasets to make a visual comparison between similar graphs easier . In our experiments, we compute the persistence diagram of each mesh obtained from the lo wer-star filtration induced by the PageRank vector defined on that mesh. The pairwise bottleneck distance is then computed between ev ery pair of persistence diagrams. Finally , the resulting discrete metric space is visualized using a 2d t-SNE projection [27]. The first dataset [26] consists of 60 meshes that are divided into 6 categories: cat, elephant, face, head, horse, and lion. Each category contains ten triangulated meshes. The result is reported in Figure 5 left handside. The second dataset [23] consists of 30 meshes that are divided into 2 categories: kid A and kid B. The result is reported in Figure 5 right-handside. The third dataset [3] contains a total of 80 objects, including 11 cats, 9 dogs, 3 wolves, 8 horses, 6 centaurs, 4gorillas, 12 female figures The vertex count for each object in this data is about 50 K . In all of our three example datasets, one can clearly observe the effecti v eness of the proposed descriptor at capture the geometry of the underlying meshes. In particular , one can easily see that the meshes within the same category are clustered together . W e also notice that meshes with similar topology tend to be closer than those with different topology . Observe for instance the clusters of horses and cats in Figure 5. V I . C O N C L U S I O N In this work, we have illustrated ho w the PageRank can be utilized in conjunction with persistent homology to study graph similarity and demonstrated our results on small datasets. In future work, we are planning to conduct a more thorough analysis with lar ger datasets. Moreov er , the P ageR- ank is typically defined on directed graphs. This feature of the PageRank vector can be utilized to induce a filtration that is sensitiv e to the directionality of the edges a directed graph. W e are planning to in vestigate this direction in the future. V I I . A C K N OW L E D G M E N T This work was supported in part by a grant from the National Science Foundation (IIS-1845204). R E F E R E N C E S [1] Reid Andersen, Fan Chung, and Ke vin Lang. Local graph partitioning using pagerank vectors. In 2006 47th Annual IEEE Symposium on F oundations of Computer Science (FOCS’06) , pages 475–486. IEEE, 2006. [2] Serge y Brin and Lawrence Page. The anatomy of a large-scale hy- pertextual web search engine. Computer networks and ISDN systems , 30(1-7):107–117, 1998. [3] Alexander M Bronstein, Michael M Bronstein, and Ron Kimmel. Numerical geometry of non-rigid shapes . Springer Science & Business Media, 2008. [4] Gunnar Carlsson. T opology and data. Bulletin of the American Mathematical Society , 46(2):255–308, 2009. [5] C. J. Carstens and K. J. Horadam. Persistent homology of collaboration networks. Mathematical Pr oblems in Engineering , 2013, 2013. [6] James R Clough, Ilkay Oksuz, Nicholas Byrne, V eronika A Zimmer , Julia A Schnabel, and Andrew P King. A topological loss function for deep-learning based image segmentation using persistent homology . arXiv preprint arXiv:1910.01877 , 2019. [7] Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clif ford Stein. Intr oduction to algorithms . MIT press, 2009. [8] W einan E, Jianfeng Lu, and Y uan Y ao. The landscape of complex networks. CoRR , abs/1204.6376, 2012. [9] Herbert Edelsbrunner , John Harer, Ajith Mascarenhas, and V alerio Pascucci. Time-v arying reeb graphs for continuous space-time data. In Pr oceedings of the twentieth annual symposium on Computational geometry , pages 366–372. A CM, 2004. [10] Herbert Edelsbrunner , David Letscher , and Afra Zomorodian. T opo- logical persistence and simplification. In Proceedings 41st Annual Symposium on F oundations of Computer Science , pages 454–463. IEEE, 2000. [11] Kathryn Garside, Robin Henderson, Irina Makarenko, and Cristina Ma- soller . T opological data analysis of high resolution diabetic retinopathy images. PloS one , 14(5):e0217413, 2019. [12] V ince Grolmusz. A note on the pagerank of undirected graphs. arXiv pr eprint arXiv:1205.1960 , 2012. [13] Mustafa Hajij, Bei W ang, Carlos Scheidegger , and Paul Rosen. V isual detection of structural changes in time-varying graphs using persistent homology . In 2018 IEEE P acific V isualization Symposium (P acificV is) , pages 125–134. IEEE, 2018. [14] T aher Haveliwala. Efficient computation of pagerank. T echnical report, Stanford, 1999. [15] Joe D Hoffman and Steven Frankel. Numerical methods for engineers and scientists . CRC press, 2018. [16] Y ushi Jing and Shumeet Baluja. Pagerank for product image search. In Pr oceedings of the 17th international conference on W orld W ide W eb , pages 307–316, 2008. [17] Nan Ma, Jiancheng Guan, and Y i Zhao. Bringing pagerank to the citation analysis. Information Processing & Management , 44(2):800–810, 2008. [18] Giov anni Petri, Martina Scolamiero, Irene Donato, and Francesco V ac- carino. Networks and cycles: A persistent homology approach to complex networks. Pr oceedings Eur opean Conference on Complex Systems 2012, Springer Pr oceedings in Complexity , pages 93–99, 2013. Fig. 4: In both left and right Figure we compute the PageRank’ s vector for each mesh in a data set is computed. W e then utilize this function to compute 0 -persistence diagram associated with the lower -star filtration of PageRank. Then we compute the pairwise bottleneck distance between e very pair of of that dataset. The final distance matrix is then visualized using a 2d t-SNE projection. In the left figure, we show the application of our method to a data set consists of 60 triangulated meshes divided into 6 categories [26]. On the other hand the right figure shows the application of this method to kids dataset [23] which consists of 30 meshes, 15 meshes of kid A and 15 meshes of kid B . Fig. 5: On the left the dataset [3] which consists a total of 80 objects, including 11 cats, 9 dogs, 3 wolves, 8 horses, 6 centaurs, 4 gorillas, 12 female figures. The vertex count for this dataset is about 50,000. On the right the t-SNE projection obtained from the distance matrix of the pairwise bottleneck distance between the persistence diagrams associated with the lo wer-star filtration of the PageRank v ectors. [19] Giov anni Petri, Martina Scolamiero, Irene Donato, and Francesco V ac- carino. T opological strata of weighted complex networks. PLoS ONE , 8(6):e66506, 2013. [20] Luca Pretto. Analysis of web link analysis algorithms: The mathematics of ranking. In Maristella Agosti, editor , Information Access thr ough Sear ch Engines and Digital Libraries , pages 97–111. Springer, 2008. [21] Saif Ur Rehman, Asmat Ullah Khan, and Simon Fong. Graph mining: A survey of graph mining techniques. In Se venth International Conference on Digital Information Management (ICDIM 2012) , pages 88–92. IEEE, 2012. [22] Alejandro Robles, Mustafa Hajij, and Paul Rosen. The shape of an image: A study of mapper on images. arXiv preprint , 2017. [23] Emanuele Rodol ` a, Samuel Rota Bulo, Thomas W indheuser , Matthias V estner , and Daniel Cremers. Dense non-rigid shape correspondence using random forests. In Pr oceedings of the IEEE Conference on Computer V ision and P attern Recognition , pages 4177–4184, 2014. [24] Atish Das Sarma, Anisur Rahaman Molla, Gopal Pandurangan, and Eli Upfal. Fast distributed pagerank computation. In International Confer ence on Distributed Computing and Networking , pages 11–26. Springer , 2013. [25] Ashley Suh, Mustaf a Hajij, Bei W ang, Carlos Scheidegger , and Paul Rosen. Persistent homology guided force-directed graph layouts. IEEE T ransactions on V isualization and Computer Graphics , 26(1):697–707, 2019. [26] Robert W Sumner and Jov an Popovi ´ c. Deformation transfer for triangle meshes. A CM Tr ansactions on graphics (TOG) , 23(3):399–405, 2004. [27] Laurens van der Maaten and Geoffrey Hinton. V isualizing data using t-SNE. J ournal of Machine Learning Research , 9:2579–2605, 2008.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment