Targeted sampling from massive block model graphs with personalized PageRank

T argeted sampling from massiv e blo c k mo del graphs with p ersonalized P ageRank ∗ F an Chen 1 , Yini Zhang 2 , and Karl Rohe 1 1 Departmen t of Statistics 2 Sc ho ol of Journalism and Mass Comm unication Univ ersity of Wisconsin, Madison, WI 53706, USA Abstract This pap er provides statistical theory and intuition for Personalized PageRank (PPR), a p opular tec hnique that samples a small communit y from a massiv e net work. W e study a setting where the en tire net work is exp ensive to thoroughly obtain or main tain, but we can start from a seed no de of in terest and “cra wl” the netw ork to ﬁnd other no des through their connections. By crawling the graph in a designed w ay , the PPR vector can be approximated without querying the entire massive graph, making it an alternativ e to snowball sampling. Using the degree-corrected sto c hastic block mo del, we study whether the PPR v ector can select nodes that belong to the same blo c k as the seed no de. W e pro vide a simple and in terpretable form for the PPR vector, highligh ting its biases to wards high degree no des outside of the target blo ck. W e examine a simple adjustmen t based on no de degrees and establish consistency results for PPR clustering that allo ws for directed graphs. These results are enabled b y recen t tec hnical adv ances sho wing the element-wise con vergence of eigen vectors. W e illustrate the method with the massiv e Twitter friendship graph, whic h w e crawl using the Twitter API. W e ﬁnd that (i) the adjusted and unadjusted PPR techniques are complemen tary approac hes, where the adjustment mak es the results particularly lo calized around the seed node and (ii) the bias adjustment greatly beneﬁts from degree regularization. Keywor ds Communit y detection; Degree-corrected sto chastic block model; Local clustering; Net work sampling; P ersonalized P ageRank 1 In tro duction Muc h of the literature on graph sampling has treated the en tire graph, or all of the p eople in it, as the target population. Ho wev er, in many settings, the target p opulation is a small comm unity in the massive graph. F or example, a key diﬃculty in studying social media is to gather data that is suﬃcien tly relev an t for the scien tiﬁc ob jectiv e. A motiv ating example for this paper is to sample the Twitter friendship graph for accounts that report and discuss curren t p olitical even ts. 1 This corresp onds to sampling and iden tifying m ultiple diﬀerent comm unities, eac h a p oten tially small part of the massiv e netw ork. In such an application, the graph is useful for t w o primary reasons. First, via link tracing, we can ﬁnd p otential mem bers of the ∗ This researc h is supported in part by NSF Grants DMS-1612456 and DMS-1916378 and AR O Gran t W911NF-15-1-0423. 1 See our w ebsite http://murmuration.wisc.edu which do es this. 1 target p opulation. Second, the graph connections are informativ e for iden tifying communit y membership. Throughout, we presume that the sampling is initiated around a “seed node” that b elongs to the target comm unity of interest. P ersonalized P ageRank (PPR) can b e thought of as an alternativ e to snowball sampling, a p opular tec hnique for gathering individuals close to the seed no de. F or some d ≥ 0, snowball sampling gathers all individuals who are d friends aw a y from the seed. This pro cess has t wo comp eting ﬂaws for our application whic h are addressed b y PPR. First, snowball sampling fails to accoun t for the density of common friendships. F or example, perhaps i and j are both one friend remov ed from the seed, but i has 10 friends in common with the seed, while j only has 1 friend in common. It seems natural to supp ose that i is closer than j to the seed. Hence, the metric for snowball sampling can be misleading. Second, the snowball sample size grows v ery quickly with d . F or example, under the “six degrees of separation” phenomenon [W atts and Strogatz, 1998, Newman et al., 2006], sno wballing gathers the entire graph if d ≥ 6. PPR giv es a sample that is more lo calized around the seed no de. The PPR vector is deﬁned as the stationary distribution of what we call a p ersonalize d r andom walk [Page et al., 1998]. A t each step of the personalized random walk, the random w alker returns to the seed no de with probabilit y α , called the telep ortation constant, and with probabilit y 1 − α , the random walk er goes to an adjacent no de that is c hosen uniformly at random. Consider the stationary distribution of this pro cess as giving the inclusion probability for a sample of size 1. This is the PPR vector. PPR naturally leads to a clustering algorithm, where the cluster is made up of the no des with a large inclusion probabilit y . T o quic kly appro ximate the PPR v ector, Berkhin [2006] prop osed an algorithm that only examines no des with large inclusion probabilities (i.e. no des near the seed). As such, PPR is particularly useful for its computational eﬃciency – the running time and the amoun t of data it requires is nearly linear in the size of the output cluster, whic h is typically m uch smaller than that of the entire graph. Due to the lo cal nature of the algorithm, it can b e used to study large graphs suc h as Twitter where the entire graph is not a v ailable, but where one can query to ﬁnd the connections to any small set of nodes. One w ay to conduct lo cal clustering is by exploring and ranking the nearby no des of a seed no de. [Andersen and Lang, 2006, Andersen and Peres, 2009, Alamgir and v on Luxburg, 2010, Gharan and T revisan, 2012]. Spielman and T eng [2004] pioneered lo cal clustering by deﬁning nearness as the landing probability of a random walk starting from the seed no de. Their algorithm’s guarantee was improv ed in follow-up work by Andersen et al. [2006] whic h prop osed using an appro ximate PPR v ector. Lo cal algorithms can b e applied recursiv ely to solving more complicated problems suc h as graph partitions (k-wa y partitions) [Spielman and T eng, 1996, Karypis and Kumar, 1998], and has many fruitful applications [Jeh and Widom, 2003, Macrop ol et al., 2009, Liao et al., 2009, Gupta et al., 2013, Gleic h, 2015], particularly when it comes to sampling and studying massiv e graphs. Along with the widespread use of PPR, there has b een recent work to study its statistical estimation prop erties under a statistical model with laten t communit y structure. Bey ond the scop e of lo cal clustering, Kloumann et al. [2017] show ed that the PPR v ector is asymptotically equiv alent to optimal linear discrimi- nan t analysis under the sto c hastic blo c k mo del (SBM) [Holland et al., 1983], assuming a symmetry condition on the blo c k structure. W e add to this statistical understanding of PPR by pro viding a simple and more general representation for PPR v ectors that allows for diﬀeren t blo ck sizes, more than tw o blo c ks, degree heterogeneit y , and directed edges. In order to understand the eﬀects of heterogeneous node degrees, this pap er uses the degree-corrected stochastic blo ck mo del (DC-SBM) [Karrer and Newman, 2011] and examines when the PPR clustering recov ers no des within the same blo c k as the seed no de (lo cal cluster). Breaking the 2 T able 1: T op 15 handles b y PPR clustering. Column names represen t seed nodes, and the sampled no des are rank ed b y PPR v alues, with teleportation constant α = 0 . 15 uniformly . @CNN @BreitbartNews @dailyk os 1 CNN Breaking News Alex Marlow Hillary Clinton 2 CNN International AndrewBreitbart Stephen Colb ert 3 W olf Blitzer Big Hollywoo d Rachel Maddow MSNBC 4 Anderson Co op er Big Governmen t Jake T app er 5 Christiane Amanp our James O’Keefe Jo y Reid 6 Pope F rancis Sean Hannity Chris Hay es 7 Dr. Sanja y Gupta Raheem Emma Gonzlez 8 CNNMoney Joel B. Pollak Markos Moulitsas 9 Jake T apper Ann Coulter Maggie Hab erman 10 Brian Stelter Allum Bokhari Sarah Silverman 11 CNN Newsro om Ben Kew Lin-Manuel Miranda 12 Dana Bash Brandon Darby Elizabeth W arren 13 CNN Politics Noah Dulis Jon F avreau 14 BBC Breaking News Mic helle Malkin Michelle Obama 15 Brooke Baldwin Nate Churc h Bill Clinton Through the PPR vector, the top 15 handles returned to each of the three seed nodes ﬁt well with the characteristics of the seed nodes. They are popular/high- status handles either directly related to the seed nodes or align with their political leanings. This shows the eﬀectiv eness of clustering via the PPR v ector. It also shows the PPR vector’s preference for highly connected no des. symmetry that is imposed b y Kloumann et al. [2017] rev eals additional insigh t. In particular, giv en a seed no de in the ﬁrst blo c k, w e show that PPR is likely to contain high degree no des outside of that block. W e study an adjustment that w as previously prop osed in Andersen et al. [2006]. W e show how this adjustmen t can correct for the bias. W e illustrate these ideas with examples from the Twitter friendship graph. 1.1 An illustrative example in so cial media Lo cal clustering using PPR is particularly well suited to studying current p olitical ev ents on Twitter b ecause (i) the accounts that discuss politics or curren t ev ents are a small part of the entire Twitter graph, (ii) it is reasonable to b eliev e that the accounts in our target p opulation are w ell connected to one another in the Twitter friendship graph, and (iii) while the en tire Twitter graph is not publicly av ailable, the w ay that PPR (Algorithm 1 and 3) queries the graph matc hes the Twitter API proto col whic h is the primary mo de of access for researchers. While w e do not suppose that the Twitter friendship graph is sampled from a DC-SBM, Twitter do es ha ve all of the heterogeneities that our results identify as important. The Twitter friendship graph is comp osed of users who can freely follow others but will not necessarily b e follo wed bac k, or friended. Such asymmetry b et ween follo wing and friending forms a directed graph where follo wer coun t indicates status – some p opular/high-status no des command millions of follow ers while the ma jorit y of no des are follow ed by far few er. The theoretical results in this pap er suggest that such degree heterogeneities will mak e the PPR vector biased for detecting blo c k memberships (Theorem 1). W e prop ose a wa y to adjust for this bias (Algorithm 2) and show that it is a consistent estimator (Corollary 1). Not surprisingly , this section demonstrates that PPR with and without the bias adjustment give fundamen tally diﬀeren t results on the Twitter graph. Ho wev er, depending on the application, the biases in the PPR vector might be adv antageous. In this w ay , PPR with and without the bias adjustment are complementary , not comp eting, approac hes. T o illustrate, T able 1 displays the top 15 handles ranked b y the PPR v ector (without adjustment) for 3 T able 2: T op 15 handles by adjusted PPR (with regularization) sampling. Column names represent seed no des, and the sampled no des are rank ed b y adjusted PPR v alues, with teleportation constan t α = 0 . 15 uniformly . @CNN @BreitbartNews @dailyk os 1 Po werZ Robert Two Thanks 2 Elissa W eldon Lee Peace Catherine Daligga 3 T ess Eastment Wynn Marlow exmearden 4 Chris Dawson Logan Churc hw ell F aith Gardner 5 carol kinstle Peter Schw eizer Andrew Thornton 6 erinmclaughlin Breitbart Sp orts UnreasonableF ridays 7 T aylor W ard Jon Fleischman DKos T op Commen ts 8 Jennifer Z. Deaton Nate Churc h 2016 relitigator 9 Pam Benson Daniel Nussbaum Daily Kos 10 amy entelis Noah Dulis W alter Einenkel 11 Grace Bohnhoﬀ Jon David Kahn Candelaria V argas 12 k ate lazarus Breitbart California Mara Sc hech ter 13 Newstron Ken Kluko wski Emi F eldman 14 Becky Brittain pam key The Soulful Negress 15 CNN Ballot Bo wl Auntie Hollywood Kim Soﬀen After adjustment, PPR returns a more lo calized cluster. Instead of the highly visible public faces of the three seed organizations, the individuals in this table serve a central role to the internal organization (e.g. editors and writers). Dep ending on the application, one might prefer the results in T able 1 or T able 2. three diﬀeren t seed no des: @CNN, @BreitbartNews, and @dailyk os, the Twitter accounts of three diﬀerent t yp es of media outlets that exhibit distinct political leanings (legacy broadcast news, online right-wing and online left-wing). F or @CNN, all top 15 handles ranked b y the PPR vector are its subsidiary accounts and its celebrit y rep orters and anc hors (lik e W olf Blitzer and Anderson Co oper), except for one accoun t, Pope F rancis, who enjoys an extremely larger following. The top 15 handles for @BreitbartNews are a mixed bag of inﬂuential conserv atives (lik e Sean Hannit y and Ann Coulter) and Breitbart’s editors/writers. How ever, the top 15 handles returned to @dailykos by the PPR vector are all famous liberal p ersonalities not directly aﬃliated with Daily Kos, but one, its founder Markos Moulitsas. Those p eople range from demo cratic p oliticians to lib eral media p ersonalities and journalists, suc h as Hillary Clin ton, Stephen Colb ert, and Rac hel Maddow. All the handles align with the c haracteristics of their resp ective media outlets, attesting to the clustering eﬀectiv eness. Ho wev er, it is w orth noting that the top handles ranked b y the PPR v ector tend to b e p opular handles with millions of follow ers. This shows that the PPR v ector’s preference for high in-degree nodes. In contrast, for eac h of the three seeds, adjusted PPR ﬁnds accoun ts that are more central to the internal functioning of these organizations. T able 2 lists those accoun ts. The bias adjustmen t also greatly beneﬁts from a degree regularization [Qin and Rohe, 2013]. F or @CNN, those handles include primarily its own staﬀ/pro ducers/journalists (lik e Elissa W eldon, Chris Dawson, and Grace Bohnhoﬀ ), a freelance journalist (T ess Eastment). The pattern is similar for @BreitbartNews and @dailykos, their top 15 handles including their own journalists, editors as w ell as related writers/campaigners/activists. The general pattern is that the adjustmen t returns editors, journalists and staﬀ working within eac h media outlet. As suc h, the adjustmen t is useful for identifying a more localized cluster. 4 1.2 Main contributions The main con tributions of the pap er are (a) a simple and in terpretable form for the PPR vector and (b) a statistical guarantee for clustering with the adjusted PPR v ector. (a) This pap er rev eals a simple t wo-stage form of the PPR vector under the p opulation (exp ectation) DC- SBM. Consider the v -th element of the PPR vector as the probabilit y of sampling no de v in a sample of size 1 from the stationary distribution of the p ersonalized random w alk. This inclusion probability is akin to stratiﬁed sampling: The inclusion pr ob ability for no de v is the pr o duct of two sep ar ate pr ob abilities. First, the pr ob ability that the p ersonalize d r andom walk samples any no de in v ’s blo ck. Se c ond, the pr ob ability that the p ersonalize d r andom walk sele cts no de v , c onditional on sampling that blo ck. Both of these probabilities hav e simple expressions. If there are K blocks in the graph, then the blo c k- wise probability comes from the PPR vector of a graph with K vertices, with edge weigh ts sp eciﬁed by the “blo c k c onnection matrix” in the DC-SBM. The second probability is prop ortional to the degree of no de v . In addition to the population results, Theorem 2 demonstrates that when the graph is random, the PPR vector concentrates around its p opulation (exp ectation) under certain conditions. (b) This pap er iden tiﬁes tw o sources of bias of using a PPR vector for lo cal clustering under the DC-SBM – the ancillary eﬀects of heterogeneous no de degrees and block degrees. With this ﬁnding, the paper examines a simple bias adjustmen t that remedies the t wo biases sim ultaneously and suggests conditions when the adjusted PPR can be used to return the correct local cluster. In other words, PPR clustering with the adjustment achieves the pr e cise identiﬁc ation of the lo c al cluster, pr ovide d the gr aph is suﬃciently dense. These results establish statistical p erformance (consistency) of PPR clustering under the DC-SBM, in the sparse regime where the minim um exp ected degree gro ws logarithmically with the num b er of nodes in the netw ork. Our results provide an element-wise p erturbation b ound for PPR v ectors, that allows the n umber of clusters to gro w with the size of graphs, and generalize to a directed graph setting as P ageRank do es. The rest of the pap er proceeds as follows. Section 2 formally in tro duces the PPR metho d and some of the known results. Section 2 also introduces the degree-corrected sto c hastic blo c k mo del. Section 3 giv es a p opulation analysis of the PPR clustering under directed block mo del graphs. Section 4 pro vides concen tration results for the PPR v ector when the graph is random and provides a statistical guaran tee on the PPR lo cal clustering metho d. Section 5 presen ts several numerical results sho wing the eﬀectiveness of the PPR clustering. Section 6 illustrates the PPR clustering through the massiv e Twitter friendship graph and demonstrates the b eneﬁts of a smo othing step in the PPR adjustment. 2 Preliminaries Throughout this pap er, let G = ( V , E ) denotes an un weigh ted and connected graph, where E is the edge set and V is the set of vertices indexed b y 1 , ..., N . When G is an undirected and un weigh ted graph, encode 5 E into a binary adjac ency matrix A ∈ { 0 , 1 } N × N with A uv = A v u = 1 if and only if edge ( u, v ) app ears in E . Deﬁne a diagonal matrix D = diag ( d 1 , ..., d N ) and the gr aph tr ansition matrix P as follo ws: d u = X v ∈ V A uv and P = D − 1 A. When G is a directed graph, the adjacency matrix A ∈ { 0 , 1 } N × N accordingly b ecomes asymmetric with A uv = 1 if and only if edge ( u, v ) ∈ E , and the graph transition matrix is deﬁned as P = [ D out ] − 1 A, where D out = diag ( d out 1 , ..., d out N ) and d out u = P v ∈ V A uv is the num b er of edges leaving from no de u . In addition, deﬁne D in = diag  d in 1 , ..., d in N  where d in v = P u ∈ V A uv is the num b er of edges p oin ting to no de v . 2.1 P ersonalized P ageRank and the lo cal clustering algorithm The p ersonalized PageRank (PPR) is an extension of Go ogle’s PageRank [Brin and P age, 1998, Hav eli- w ala, 2003]. T o illustrate, consider a p ersonalized random walk (or originally called “surﬁng”) on the graph G = ( V , E ) with a se e d no de v 0 ∈ V . At each step, the random w alker either restarts from the seed no de v 0 with probability α (called the telep ortation c onstant ) or contin ues the random walk from the curren t no de to a neighbor uniformly at random. The p ersonalize d PageR ank ve ctor p ∈ [0 , 1] N is the stationary distribution of this pro cess, th us the solution to the equation p > = απ > + (1 − α ) p > P , (1) where P is the graph transition matrix, and π is the elementary unit vector in the direction of seed no de v 0 . Here p is a column vector normalized by a positive scalar such that its elemen ts sum to 1, and w ithout loss of generalit y , we set v 0 = 1 and thus π = (1 , 0 , ..., 0) > . In general, the pr efer enc e ve ctor π do es not ha ve to b e an elementary unit v ector, but an y probabilit y distribution on V . F or example, when π = (1 / N , ..., 1 / N ) > , PPR is equiv alent to ordinary PageRank. Moreo ver, the PPR v ector is a linear function of the preference v ector. That is, let p ( π 1 ) and p ( π 2 ) b e t wo PPR vectors corresp onding to tw o preference vectors π 1 and π 2 resp ectiv ely . Then, for a new preference v ector that is a conv ex com bination of π i , the resulting PPR v ector is constructiv e of p ( π i ), p ( w 1 π 1 + w 2 π 2 ) = w 1 p ( π 1 ) + w 2 p ( π 2 ) , where w i ≥ 0 and w 1 + w 2 = 1. Deﬁne Π to b e an N × N matrix with rep eating rows π > , and let Q = α Π + (1 − α ) P , then Q is the Mark ov transition matrix for the stochastic process and Equation (1) b ecomes p > = p > Q . Below are some useful prop erties of the PageRank vector (also see Ha veliw ala [2003], Jeh and Widom [2003] and Appendix A). Prop osition 1. F or any ﬁxe d α ∈ (0 , 1] , the PPR ve ctor p is (a) the left le ading eigenve ctor of Q , asso ciate d with the simple eigenvalue 1; and 6 (b) the inﬁnite sum of landing pr ob ability { ( P s ) > π } ∞ s =0 with weights φ = { α (1 − α ) s } ∞ s =0 , p > = α ∞ X s =0 (1 − α ) s π > P s . (2) Berkhin [2006] gives an iterativ e algorithm based on Prop osition 1 to approximate the PPR vector (that scales to large graphs); eac h update requires only neighborho od information of one visited v ertex. A few lines of linear algebra sho w that the PPR vector is equiv alent to the solution to the linear system p > = α 0 π > + (1 − α 0 ) p > W , where W = ( I + P ) / 2 is the lazy graph transition matrix and α 0 = α/ (2 − α ). Using this fact, Algorithm 1 approximates the PPR v ector in running time of order O  1 α  , b y reac hing at most 2  (1 − α ) v ertices. The follo wing proposition giv es a guarantee on the appro ximation error for this algorithm in terms of the toler anc e parameter and the degrees of visited no des. Prop osition 2 (En trywise appro ximation error [Andersen et al., 2006]) . L et p b e a PPR ve ctor, and let p  ∈ [0 , 1] N b e an appr oximate PPR ve ctor c ompute d by A lgorithm 1 with a toler anc e  > 0 . F or any vertex u that is sample d in A lgorithm 1, | p u − p  u | ≤ d u . Prop osition 2 ensures that for an y ﬁxed graph, the approximate PPR vector is arbitrarily close to the exact PPR v ector, as long as the tolerance  > 0 is suﬃciently small. App endix A con tains a pro of of this prop osition for completeness. Given a seed node in the graph, Algorithm 2 uses the appro ximate PPR v ector from Algorithm 1 and returns a set of nodes with the largest corresp onding v alues in the adjuste d p ersonalize d PageR ank (aPPR) vector, which is deﬁned as p ∗ v = p v d v , for v = 1 , 2 , ..., N . The aPPR v ector w as previously prop osed in Andersen et al. [2006]. Algorithm 1 and 2 operate on undirected graphs. W e will generalize them to directed graphs in Section 3 thanks to a simpliﬁed and interpretable form for the PPR v ector. Algorithm 1 Appro ximate PPR V ector (undirected) [Andersen et al., 2006] Require: Undirected graph G , preference vector π , telep ortation constan t α , and tolerance  . Initialize p ← 0, r ← π , α 0 ← α/ (2 − α ). while ∃ u ∈ V suc h that r u ≥ d u do Uniformly sample a vertex u satisfying r u ≥ d u . p u ← p u + α 0 r u . for v : ( u, v ) ∈ E do r v ← r v + (1 − α 0 ) r u / (2 d u ). end for r u ← (1 − α 0 ) r u / 2. end while Return:  -approximate PPR vector p . 7 Algorithm 2 PPR Clustering (undirected) Require: Undirected graph G , seed no de v 0 , and the desired size of lo cal cluster n . 1: Calculate the approximate PPR vector p (Algorithm 1). 2: Adjust the PPR vector p by no de degrees, p ∗ v ← p v /d v . 3: Rank all vertices according to the adjusted PPR vector p ∗ . Return: lo cal cluster – n top-ranking no des. 2.2 Sto c hastic blo ck mo del In the stochastic block model (SBM), each node b elongs to one of K blo c ks. The presence of each edge corresp onds to an indep enden t Bernoulli random v ariable, where the probability of an edge b et w een any t wo no des depends only on the block memberships of tw o no des [Holland et al., 1983]. The formal deﬁnition is as follo ws. Deﬁnition 1. F or a vertex set V = { 1 , 2 , ..., N } , let z : { 1 , 2 , ..., N } → { 1 , 2 , ..., K } p artition the N no des into K blo cks, so z ( v ) is the blo ck memb ership of vertex v . L et B b e a K × K matrix with al l entries r ange in [0 , 1] . Under the SBM, the pr ob ability of an e dge b etwe en u and v is B z ( u ) z ( v ) . That is, A uv | z ( u ) , z ( v ) ind. ∼ Bernoul li  B z ( u ) z ( v )  , for any u, v ∈ { 1 , 2 , ..., N } . Under the ordinary SBM, no des in the same blo c k hav e the same exp ected degree. One extension is the degree-corrected stochastic blo c k mo del (DC-SBM), whic h adds a series of parameters ( θ v > 0 for every v ertex v ) to create more heterogeneous no de degrees [Karrer and Newman, 2011]. Let B be a K × K matrix with B ij > 0 for an y i and j . Then the probabilit y of an edge b et ween u and v is θ u θ v B z ( u ) z ( v ) . That is, A uv | z ( u ) , z ( v ) ind. ∼ Bernoulli  θ u θ v B z ( u ) z ( v )  , for u, v ∈ { 1 , 2 , ..., N } . Since θ v ’s are arbitrary to a m ultiplicative constan t which can be absorbed in to B , Karrer and Newman [2011] suggest imp osing the constraint that the θ v ’s sum to 1 within each block. That is, P v : z ( v )= i θ v = 1 for all i = 1 , 2 , .., K . With this constraint, B ij represen ts the exp ected num b er of edges betw een block i and j if i 6 = j , and twice of that if i = j . Throughout this paper, w e presume B is p ositiv e deﬁnite 2 and all blo c ks are connected (we ignore an y blo c ks that are isolated from the seed). The DC-SBM can b e generalized to directed graphs b y giving each no de t w o parameters, θ in v and θ out v , controlling its in-degree and out-degree resp ectiv ely [Zh u et al., 2013]. Then, the presence of an directed edge from u to v , given the blo c k mem b erships, corresp onds to an indep enden t Bernoulli random v ariable, A uv | z ( u ) , z ( v ) ind. ∼ Bernoulli  θ out u θ in v B z ( u ) z ( v )  . In order to make the model identiﬁable, we need to imp ose a structural constrain t on θ in ’s and θ out ’s, that b oth of them sum up to 1 within each blo ck, X v : z ( v )= i θ in v = X v : z ( v )= i θ out v = 1 , for an y i = 1 , 2 , ..., K . Because the oﬀ-diagonal elemen ts of B can b e interpreted as the exp ected num b er of edges b et ween blo c ks, w e deﬁne the blo c k in-degree and block out-degree to b e the total n um b er of incoming edges and outgoing 2 This preven ts scenarios where edges are unlikely within blocks and more likely betw een blo c ks. In such scenarios, lo cal clustering needs to b e reimagined cautiously . See Supplementary Materials S2 for additional details about generalizations. 8 edges respectively , that is, d in j = P K i =1 B ij , and d out i = P K j =1 B ij . 3 P opulation Analysis of P ageRank In this section, we analyze the PPR v ector of the exp ected adjacency matrix under the DC-SBM. This pro vides a simple represen tation of the PPR vector that motiv ates (1) the bias adjustment and (2) the generalization of Algorithm 1 and 2 to directed graphs. W e use three distinct typefaces to denote three classes of ob jects. Calligraphic typeface is given to the p opulation version of an y observ able quantities in random graphs, such as graph adjacency matrix and no de degrees (e.g. Equation (3)). Normal typeface is given to unobserved model parameters, such as block mem b ership and degree parameters θ i . Bold face is giv en to all block-lev el quantities and parameters like B and d out i . Deﬁne the p opulation graph adjacency matrix, A = E ( A | z (1) , z (2) , ..., z ( N )) , (3) to b e the exp ectation of random adjacency matrix A . Let Z ∈ { 0 , 1 } N × K b e the blo ck membership matrix with Z v i = 1 if and only if vertex v b elongs to blo c k i , and deﬁne diagonal matrices Θ in and Θ out with en tries θ in ’s and θ out ’s resp ectiv ely . Then, under the directed DC-SBM with K blo c ks and parameters { B , Z, Θ in , Θ out } , A ∈ R K × K can be compactly expressed as A = Θ out Z B Z > Θ in . Accordingly , w e deﬁne the p opulation node degrees and the p opulation transition matrix, d in u = P v ∈ V A uv , d out v = P u ∈ V A uv , and P = [ D out ] − 1 A , where D in and D out are the diagonal matrices of the population no de in-degrees d in u ’s and out-degrees d out v ’s resp ectiv ely . Let p b e the p opulation PPR v ector (i.e., the solution to equation p > = απ > + (1 − α ) p > P ) and let p ∗ =  D in  − 1 p be the population aPPR vector. In addition, deﬁne the blo ck tr ansition matrix P ∈ R K × K as P =  D out  − 1 B , (4) where D in ∈ R K × K and D out ∈ R K × K are diagonal matrices of the block in-degrees d in i ’s and out-degrees d out i ’s. 3.1 A representation of PPR v ectors This section pro vides a simple and interpretable form for PPR v ectors under the population DC-SBM. T o this end, we deﬁne the “blo ck-wise” PPR ve ctor p ∈ R K to be the unique solution to linear system p > = α π > + (1 − α ) p > P , (5) where π = Z > π ∈ R K is the blo c k-wise preference vector and P is the blo ck transition matrix in Equation (4). This treats the block connectivit y matrix B as a weigh ted adjacency matrix of blo c ks and the blo c k of seed no de as a seed blo ck. T o build up the relationship b et ween PPR and the block-wise PPR, the next theorem giv es an explicit form for PPR v ectors whic h also rev eals the sources of bias for local clustering. 9 Theorem 1 (Explicit form of PPR vectors) . Under the p opulation dir e cte d DC-SBM with K blo cks and p ar ameters  B , Z, Θ in , Θ out  , (a) the p opulation PPR ve ctor p ∈ R N has elements p u = θ in u p z ( u ) wher e p is the blo ck-wise PPR ve ctor in Equation (5), (b) and the p opulation aPPR ve ctor p ∗ ∈ R N has elements p ∗ u = p ∗ z ( u ) (6) wher e p ∗ =  D in  − 1 p . Theorem 1 demonstrates that the PPR vector p decomposes into block-related information ( p ) and no de sp eciﬁc information (Θ). Within each block, the PPR v alues are prop ortional to the no de degree parameters θ v ’s and sum up to the block-wise PPR v alue of the blo ck. The pro of of Theorem 1 (App endix A) relies on a key observ ation (App endix A.3) that the p o wers of p opulation transition matrix, P s for s = 1 , 2 , . . . , hav e a similarly simple form and the no de speciﬁc information components (i.e., z ( v ) and θ v ) are inv ariant in s . In order to justify the adjustment (Step 2) in Algorithm 2, w e observ e that the seed alw ays has the highest p opulation aPPR score. This turns out to be a k ey feature that facilitates the aPPR v ector to reco ver a lo cal cluster correctly , so w e state it in the following lemma. Lemma 1 (The largest en try of aPPR v ector) . Under the p opulation DC-SBM, assume that the minimum exp e cte d de gr e e is p ositive, that is, min v ∈ V d v > 0 . Then, for any ﬁxe d α > 0 , the p opulation aPPR ve ctor p ∗ has the strictly lar gest entry c orr esp onding to the se e d no de, p ∗ v 0 > p ∗ v , for any v 6 = v 0 . On the other hand, this is not gener al ly true for a PPR ve ctor. When α = 0 (i.e., no teleportation), the PPR vector b ecomes the limiting distribution of a standard random w alk and all en tries of the aPPR v ector are equal (App endix A). Lemma 1 (applied to blo c k-wise PPR v ectors) and Theorem 1 together iden tify t wo sources of bias for PPR vectors and suggest a justiﬁcation for the degree adjustment, which we discuss in order: (i) Both node degree heterogeneity (Θ) and block size imbalance ( D ) confound the identiﬁcation of lo cal cluster by the PPR vector. In particular, supp ose vertex v b elongs to a blo c k z ( v ) = i other than 1. PPR vector assigns it a score θ v p i , where p i is the blo c k-wise PPR of block i , and θ v is the parameter sp eciﬁcally controlling the degree of v . Then, no de v ma y rank at the top, if θ v is large enough. F urthermore, Lemma 1 implies that p 1 is not necessarily the largest due to blo ck degree heterogeneity . Sp eciﬁcally , if block i has an exceedingly high block degree, it is likely that p fails to down-rank no de v vis-a-vis those nodes of blo c k 1. (ii) Adjusted p ersonalized PageRank remov es the no de and the blo ck degree heterogeneity sim ultaneously , and perfectly reco vers the lo cal cluster. T o see this, note that p ∗ is the adjusted v ersion of block-wise PPR vector. F rom Lemma 1, p ∗ 1 is the largest entry of p ∗ . F rom Equation 6, the aPPR vector assigns 10 an y vertex v a score p ∗ z ( v ) . Hence, no des with the highest v alue of p ∗ b elong to blo c k 1, which is precisely the desired lo cal cluster. Note that the PPR vector can still b e biased for lo cal clustering even under the classic SBM. T o see this, set the matrix Θ to the iden tity matrix in Theorem 1. In this case, the heterogeneous blo ck degrees still confound the PPR vector (Section 5.2); there is generally no guaran tee for p 1 to appear on the top (due to Lemma 1), unless there are further symmetry conditions. Kloumann et al. [2017] uses such one scenario. As a byproduct of our analysis, w e extend their results under the DC-SBM with the symmetric conditions (see Supplemen tary Materials S3 to the pap er). 3.2 Lo cal clustering on directed graphs In ligh t of the clean form of PPR vectors under the DC-SBM, one can mo dify Algorithm 1 and 2 to op erate on a directed graph accordingly . T o this end, note that the transition matrix of a directed graph requires no de out-degrees, hence Algorithm 1 examines only the edges lea ving visited nodes. Consequently it suﬃces to replace d u ’s in Algorithm 1 b y d out u ’s (Algorithm 3). Prop osition 2 applies to Algorithm 3 as w ell, and one can approximate the PPR vector pro vided the out-degrees of visited nodes can b e observed and the tolerance parameter  > 0 is suﬃciently small. T o p erform lo cal clustering on a directed graph, Algorithm 4 adjusts the appro ximate PPR v ectors from Algorithm 3 by no de in-degrees, that is, p ∗ v = p v d in v , for v = 1 , 2 , ..., N . Another option is regularized adjustmen t, which pro duces the r e gularize d PPR (rPPR) vector, p τ v = p v d in v + τ , for v = 1 , 2 , ..., N , where τ > 0 is the regularization parameter. The regularized adjustment greatly stabilize the PPR clustering in practice, by remo ving no des with extremely lo w in-degrees (see Section 6 for more details). Adjusted PPR for directed graphs is a lo cal algorithm so long as d in is av ailable with a lo cal query , for example, the Twitter friendship graph. Algorithm 3 Appro ximate PPR V ector (directed) Require: Directed graph G , preference vector π , telep ortation constan t α , and tolerance  . Initialize p ← 0, r ← π , α 0 ← α/ (2 − α ). while ∃ u ∈ V suc h that r u ≥ d out u do Sample a vertex u uniformly at random, satisfying r u ≥ d out u . p u ← p u + α 0 r u . for v : ( u, v ) ∈ E do r v ← r v + (1 − α 0 ) r u / (2 d out u ). end for r u ← (1 − α 0 ) r u / 2. end while Return:  -approximate PPR vector p . 11 Algorithm 4 PPR Clustering (directed) Require: Directed graph G , seed node v 0 , the desired size of lo cal cluster n , and an optional regularization parameter τ . 1: Calculate the approximate PPR vector p (Algorithm 3). 2: Adjust the PPR vector p with: Option (a): no de in-degrees, p ∗ v ← p v /d in v , Option (b): regularized node in-degrees, p τ v ← p v / ( d in v + τ ). 3: Rank all vertices according to the aPPR vector p ∗ or p τ . Return: lo cal cluster – n top-ranking no des. 4 P ersonalized P ageRank in Random Graphs This section establishes sev eral concentration results for the lo cal clustering algorithm using the adjusted PPR v ector (Algorithm 2 and 4) under the DC-SBM. The results show that if the graph is generated from the DC-SBM, then PPR clustering returns the desired lo cal cluster with high probabilit y . Since in Algorithm 4, the calculation for PPR vectors only relies on no de out-degrees and the adjustment step solely utilizes no de in-degrees, it is not diﬃcult to distinguish d in and d out . Th us, we state the results in undirected graphs for simplicity . One can dra w the analogous conclusions for directed graphs by tracing the pro of step by step. W e ﬁrst present a useful to ol that con trols the entrywise errors of a PPR vector in random graphs. Recall that p is the stationary distribution of probabilit y transition matrix Q = α Π + (1 − α ) P . F or an y vector x ∈ R n , deﬁne the vector inﬁnit y norm as k x k ∞ = max i | x i | . The follo wing theorem b ounds the en trywise error of the stationary distribution of Q . Theorem 2 (Concen tration of the PPR v ectors) . L et G = ( V , E ) b e a gr aph of N vertic es gener ate d fr om the DC-SBM with K blo cks and p ar ameters { B , Z , Θ } . L et p and p b e the PPR ve ctor c orr esp onding to r andom tr ansition matrix P and its p opulation version P r esp e ctively, with the same telep ortation c onstant α . L et p ∗ , p ∗ ∈ [0 , 1] N b e the adjuste d PPR ve ctor of p and p . L et δ b e the aver age exp e cte d no de de gr e es, that is, δ = 1 N P v ∈ V d v . Assume that ρ = max v ∈ V d v min v ∈ V d v is b ounde d by some ﬁnite c onstant and that δ > c 0 (1 − α ) 2 log N , (7) for some suﬃciently lar ge c onstant c 0 > 0 . Then, with pr ob ability at le ast 1 − O ( N − 5 ) , k p − p k ∞ k p k ∞ ≤ c 1 (1 − α ) r log N δ , and k p ∗ − p ∗ k ∞ k p ∗ k ∞ ≤ c 2 (1 − α ) r log N δ , for some suﬃciently lar ge c onstant c 1 , c 2 > 0 . The pro of of Theorem 2 in vok es the elemen tary eigenv ector p erturbation bound for asymmetric matrices, an analog to the celebrated Davis-Kahan sin Θ theorem [Da vis and Kahan, 1970], and the nov el lea ve-one-out tec hnique due to Chen et al. [2019]. The detailed proof is given in the Supplemen tary Materials S1 to the pap er. Theorem 2 demonstrates that if the exp ected av erage degree δ exceeds (1 − α ) 2 log N to some suﬃciently large extent, then with high probabilit y , the random aPPR v ector concentrates around the p opulation aPPR v ector in terms of all en tries. In fact, the concen tration statemen t holds for an y v alid preference v ector π . Hence, the classic PageRank v ector and some other v ariants also enjo y the entrywise error b ounds, so long as they can b e written as the solution to the linear system (1). 12 Next, we in tro duce a separation measure of the DC-SBM. Recall that one can conduct a lo cal clustering task b y selecting nodes ranked by the adjusted PPR v ector p ∗ . In the population v ersion, it is equiv alent to distinguishing b et ween p ∗ 1 and p ∗ k , for all k = 2 , 3 , ..., K , whic h also c haracterizes the distance from the desired lo cal cluster (blo c k 1) to its complement set (the other blo cks). Only if they are suﬃcien tly separated, can the lo cal cluster be iden tiﬁable in the sample. Due to Lemma 1, we assume without loss of generality that the second blo ck has the second highest v alue in the “blo c k-wise” aPPR vector, that is, p ∗ 1 > p ∗ 2 ≥ p ∗ k for k = 3 , 4 , ..., K . Then, w e deﬁne the sep ar ation me asur e ∆ α ∈ (0 , 1], ∆ α = p ∗ 1 − p ∗ 2 p ∗ 1 , whic h turns out to b e crucial in determining the sample complexit y required to guarantee the exact reco very . W e remark that ∆ α is an increasing function of the telep ortation constan t, hence the subscript α . With Theorem 2 and the separation measure, w e then give following corollary that b ounds the accuracy of Algorithm 2, in terms of graph edge densit y . Corollary 1 (Exact reco very b y adjusted PPR v ector) . F or any se e d no des, let C ⊂ V b e the lo c al cluster of n no des r eturne d by A lgorithm 2 with telep ortation c onstant α and toler anc e  , and C ⊂ V b e the no des in the se e d no de’s blo ck. Assume that ρ < c 0 ,  ≤ c 1 (1 − α ) p ∗ 1 p log N /δ , and that δ > 16 c 2  1 − α ∆ α  2 log N , (8) for some suﬃciently lar ge c onstants c 0 , c 1 , c 2 > 0 . If the desir e d size of the lo c al cluster n = | C | , then with pr ob ability at le ast 1 − O ( N − 5 ) , we have C = C . The proof of Corollary 1 is presented in Appendix A. W e make a few remarks: (i) Corollary 1 demonstrates that Algorithm 2 works under a sparse scenario, where the num b er of edges is exceedingly small in prop ortion to the n umber of p ossible edges in the netw ork. T o reach the en trywise con trol of the aPPR vector and the suﬃcien t separation of lo cal cluster from others, the theorem calls for the exp ected node degree δ to grow with only a fraction (for any ﬁxed telep ortation constan t α ) of the logarithm of the size of the netw ork, log N . In other words, Algorithm 2 requires a sample complexit y (the n umber of edges) of order  1 − α ∆ α  2 N log N . (ii) The results sho w that α lev erages betw een the sampling complexity and statistical p erformance of PPR clustering. T o see this, rearrange condition (8),  1 − α ∆ α  2 < c 0 δ log N , for some small enough constant c 0 > 0. As α increases, the left hand side is decreasing to zero th us making the condition more lik ely to hold. On the other hand, as α increases, the tolerance  must decrease at rate O (1 − α ) in order to guarantee an en trywise control of p  analogous to the form in Theorem 2 (App endix A). More in tuitively , if  do es not decrease, then as α goes to one, Algorithm 13 1 may terminate early without reaching all vertices in the desired lo cal cluster. In sum, Algorithm 1 and 3 need at least O  1 α (1 − α )  queries (see Supplementary Materials S2 for an example). This implies that one can approac h the conditions in Corollary 1 b y setting the telep ortation constan t suﬃcien tly large, while the computational burden can increase as α → 1. 5 Sim ulation Studies This section compares the PPR vector and the aPPR vector. The results sho w the eﬀectiv eness and robustness of aPPR vector in detecting a lo cal cluster. Exp erimen t 1 utilizes the DC-SBM with a p o wer-la w degree distribution and in vestigates the eﬀects of heterogeneous no de degrees. Exp erimen t 2 uses the SBM with unequal blo c k sizes to study the inﬂuences of heterogeneous block degrees. Exp erimen t 3 generates net works from the SBM with equal blo c k sizes and v arying edge densit y to examine the eﬃcacy of PPR metho ds in sparse graphs. In all sim ulations, we employ ee the blo c k connectivit y matrix B with homogeneous diagonal elements, B ii = b 1 , and homogeneous oﬀ-diagonal elemen ts, B ij = b 2 for an y i 6 = j . Deﬁne the signal-to-noise ratio (SNR) to b e the exp ected num b er of in-blo c k edges divided by the exp ected n umber of out-blo ck edges, that is, b 1 / ( b 2 ( K − 1)), where K is the num b er of blo c ks. In particular, w e set the SNR to 1 . 5 and c ho ose the telep ortation constan t of α = 0 . 15 throughout the section. Additional simulation results (illustrating the Theorem 2) are av ailable from Supplemen tary Materials S2. 5.1 Exp erimen t 1 This exp erimen t illustrates ho w node degree heterogeneit y aﬀects the discriminan t pow er in iden tifying lo cal cluster using a PPR vector or an aPPR vector. The results also illustrate the adv antages of ha ving m ultiple seed no des. The Θ parameters from the DC-SBM are drawn from the pow er law distribution with lo wer b ound x min = 1 and shap e parameter β = 2 . 5. A random netw orks were sampled from the DC-SBM with K = 3, N = 1500 and equal block sampling prop ortions, z ( v ) i.i.d. ∼ Multinomial  1 3 , 1 3 , 1 3  , for vertex v = 1 , 2 , ..., N , whose exp ected av erage degree ( δ ) is set to 105. The PPR vector is calculated with one or ten seeds randomly c hosen from block one. Figure 1 plots PPR v alues (left tw o panels) and aPPR v alues (righ t t wo panels) of a random graph generated from the DC-SBM, excluding seed node(s). The upper tw o panels in Figure 1 contrast PPR and aPPR when there is only one seed no de and the bottom t wo panels compare tw o vectors when ten seed no des are used. The v ertices from the lo cal block in the SBM are colored in blue and the others are in yello w. The nodes are ordered ﬁrst by blo c k, then by no de degree parameters θ (left is larger). A horizontal line is dra wn for eac h blo c k indicating the median of the aPPR v alues within that blo c k. With one seed node (upper tw o panels), the scatter plots has t wo clouds within each blo c k. The upper cloud contains the immediate neigh b ors of the seed no de. This separation disapp ears when multiple seed no des are used (b ottom tw o panels). T o see the eﬀect of no de heterogeneity , the skew ed distribution of PPR v alues in eac h blo c k demonstrates its bias tow ards high degree node inside and outside of the seed nodes blo c k in the SBM. In contrast, aPPR v alues are evenly distributed within blo c ks, verifying that aPPR v ector remo ves the eﬀects of node degree heterogeneit y . 14 0 500 1000 1500 0.000 0.001 0.002 0.003 0.004 0.005 Index PPR 0 500 1000 1500 Index aPPR 2x10 − 5 4x10 − 5 0 500 1000 1500 0.000 0.001 0.002 0.003 0.004 0.005 Index PPR 0 500 1000 1500 Index aPPR 10 − 5 2x10 − 5 3x10 − 5 Figure 1: Comparison of PPR (left t wo panels) and aPPR (right t wo panels) under the DC-SBM with one seed node (upp er t w o panels) and ten seed no des (bottom tw o panels). Lo cal cluster is in blue and other clusters are colored in y ellow. Solid horizontal lines on right panels indicate the median of aPPR v alues within eac h cluster. 5.2 Exp erimen t 2 This exp erimen t compares PPR and aPPR under the SBM with blo c k degree heterogeneity . A n um b er of random netw orks were sampled from the SBM with K = 3, N = 900, and geometric blo ck sampling prop ortions, z ( v ) i.i.d. ∼ Multinomial  1 , b, b 2  , (9) where b ∈ { 1 . 0 , 1 . 2 , 1 . 4 , 1 . 6 , 1 . 8 , 2 . 0 } . When b is larger, the p opulation of no des in each block b ecomes more unbalanced and th us inducing greater block degree heterogeneity . The blo c k connectivity matrix B is conﬁgured as des cribed in the beginning of this section. The expected a verage degree ( δ ) is set to 70. F or each sampled netw ork, the size of the ﬁrst blo c k is assumed kno wn to Algorithm 2. The PPR vector is calculated exactly in place of the approximation PPR v ector (Step 1), with one seed randomly chosen from the ﬁrst blo c k. The top panels of Figure 2 displays the PPR v ector on an example net work with b = 1 . 4, demonstrating 15 0 200 400 600 800 Index PPR 5x10 − 4 10 − 3 2x10 − 3 5x10 − 3 0 200 400 600 800 Index APPR 2x10 − 5 5x10 − 5 10 − 4 ● ● ● ● ● ● 1.0 1.2 1.4 1.6 1.8 2.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Geometric Ratio Accuracy ● PPR aPPR ● ● ● ● ● ● 20 40 60 80 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Expected A verage Degree Accuracy ● PPR aPPR Figure 2: (T op) Sim ulated netw ork generated from the classic SBM of 3 blo c ks with blo c k degree heterogene- it y . Three horizontal lines indicate the median of PPR and aPPR v alues within eac h cluster. (Bottom Left) Comparison of p erformance for PPR (triangles with solid line) and aPPR (circles with dashed line) under the SBM with diﬀerent levels of blo c k degree heterogeneity . (Bottom Right) Comparison of performance for PPR and aPPR under the four-parameters SBM with diﬀerent sparsity . Error bars are drawn using standard deviation. its preference tow ard the high degree blo c k (the third blo ck) ov er local cluster. Giv en the size of the ﬁrst blo c k, w e measure the accuracy by the prop ortion of vertices belonging to the ﬁrst blo c k in the returned cluster. The b ottom left panel of Figure 2 shows the accuracy of PPR and aPPR for six diﬀerent v alues of b (i.e., the geometric ratio in distribution (9)) where each p oint is the a v erage of 100 sampled netw ork. The comparison demonstrates that the adjusted PPR vector corrects the bias of PPR caused by blo c k heterogeneit y . Moreov er, blo c k degree heterogeneity degrades the p erformance of b oth PPR and aPPR. Note that aPPR outp erforms PPR even when b = 1; this is likely due to the fact that even when no des hav e equal expected degrees in the SBM, the actual node degrees will b e heterogeneous due to the randomness in the sampled graph. In a ﬁnite graph, this v ariability is enough to giv e aPPR an adv antage o v er PPR. Asymptotically , this adv antage should fade a wa y [Kloumann et al., 2017]. 16 5.3 Exp erimen t 3 This exp erimen t in vestigates the p erformance of PPR and aPPR under the SBM where there is no heterogeneit y in the exp ected no de degrees or blo c k degrees. A n umber of random netw orks w ere sampled from the four-parameter sto c hastic block mo del, SBM( K = 3 , N = 900 , b 1 = 0 . 6 , b 2 = 0 . 2) [Rohe et al., 2011]. Under the four-parameter SBM, each of K blocks has equal size in exp ectation, N /K , and the probability of a connection b et w een tw o no des is b 2 if they are in t wo separate blo c ks, or b 1 if in the same one. In addition, the exp ected av erage degree v aries, δ ∈ { 15 , 30 , 45 , 60 , 75 , 90 } . F or every setting, the results are a veraged o ver 100 samples of the netw ork. The PPR v ector is calculated with one seeds randomly c hosen from block one. The bottom right panel of Figure 2 contrasts the accuracy of PPR and aPPR against six diﬀerent v alues of exp ected av erage degree, showing that when the sampled graph has minimal degree heterogeneity , the adjusted PPR vector has only sligh tly higher accuracy than the PPR v ector. 6 A Sample of Twitter In this section, w e pro vide a more detailed case study to illustrate the prop erties of diﬀeren t PPR v ectors. W e obtain a lo cal cluster of nodes around the seed no de @NBCPolitics (NBC P olitics) in the Twitter friendship graph. In the Twitter graph, the no des are called handles or accoun ts (e.g. @NBCPolitics) and if Twitter handle i follows Twitter handle j , then w e deﬁne this as a directed edge ( i, j ) p ointing from i to j . Aﬃliated with NBC news, NBC Politics sp ecializes in p olitical news cov erage and has ov er 470k follow ers on Twitter (in-degree) and follows 145 handles (out-degree) as of Decem b er 2018. A brief lo ok through @NBCP olitics’ follo wing list rev eals that it follo ws a wide range of accoun ts, from TV programs, reporters and editors aﬃliated with NBC, to media accoun ts and journalists of other news outlets as w ell as p oliticians. Data on follo wing and handle proﬁle information w ere collected through the Standard Twitter Search API. W e queried the Twitter friendship graph starting from the seed no de @NBCPolitics, using Algorithm 3 with telep ortation constant α = 0 . 15 and termination parameter  = 10 − 7 , ending up with 5840 surrounding handles. Through this exercise, we intend to illustrate the prop erties and applications of local clustering using PPR, aPPR and rPPR v ectors, where w e set the regularization parameter τ to 100. W e ﬁrst presen t the results of PPR. As T able 3 sho ws, the top 30 handles (except @NBCP olitics) with the highest PPR v alues are a combination of (i) NBC’s news related programs such as NBC News, TODA Y and Meet the Press; (ii) NBC’s p olitical rep orters, anchors and editors, from well-kno wn ﬁgures like Ch uck T o dd and Andrea Mitchell to less-kno wn ones like P ete Williams (justice corresp ondent) and Mark Murra y (senior p olitical editor); (iii) other mainstream news outlets such as The W all Street Journal, POLITICO, and TIME; and (iv) prominen t public ﬁgures and p oliticians lik e Melania T rump, Bill Clinton and John McCain. In ligh t of NBC’s status as a mainstream news outlet and the p olitical fo cus of @NBCPolitics, suc h results mak e sound sense. It must also b e noted that all the top 30 handles are direct friends of @NBCP olitics’s and hav e at least tens of thousands of follow ers. The median follo wer count is 1.4 million, suggesting high in-degrees. In fact, the pattern observed in the top 30 extends to the top 200 handles with the highest PPR v alues, whic h include NBC’s own programs, journalists, editors and staﬀ; fello w mainstream media outlets and their staﬀ; and prominent public ﬁgures, p oliticians and go vernmen t institutions (see Supplementary Materials S4). The median in-degree of top 200 handles is around 184k, though there are four handles with less than one thousand follo wers. One imp ortan t thing to notice is that among the top 200 handles, the ﬁrst 139 are all directly follow ed b y @NBCP olitics, with handles having high in-degrees generally rank ed higher than those having low in-degrees (although @NBCPolitics follows 145 handles, 6 of them migh t ha ve 17 T able 3: T op 30 handles of PPR with seed no de @NBCP olitics and the telep ortation constant α = 0 . 15 in Decem b er 2018. Name F ollow ers Description 1 Melania T rump 11242283 This account is run b y the Oﬃce of First Lady Melania T rump... 2 The White House 17625630 W elcome to @WhiteHouse! F ollow for the latest from President... 3 Chuc k T o dd 2032038 Moderator of @meetthepress and @nbcnews p olitical director; ... 4 NBC News 6280551 The leading source of global news and info for more than 75 ... 5 NBC Nightly News 962290 Breaking news, in-depth rep orting, con text on news from ... 6 Andrea Mitchell 1737764 NBC News Chief F oreign Aﬀairs Correspondent/anc hor, Andrea ... 7 Sav annah Guthrie 881669 Mom to V ale & Charley , TOD A Y Co-Anchor, Georgetown La w. ... 8 Joe Scarborough 2521215 With Malice T ow ard None 9 MSNBC 2261911 The place for in-depth analysis, p olitical commentary and ... 10 Rachel Maddow MSNBC 9498076 I see p olitical people... 11 Breaking News 9223158 12 NBC News First Read 53847 The ﬁrst place for news and analysis from the @NBCNews Poli... 13 TODA Y 4276453 America’s fav orite morning sho w | Snap chat: todaysho w 14 Meet the Press 566713 Meet the Press is the longest-running television sho w in history ... 15 The W all Street Journal 16188842 Breaking news and features from the WSJ. 16 Pete Williams 70062 NBC News Justice Corresp onden t. Co vers US Supreme Court, ... 17 Mark Murray 97571 Mark Murray is the senior political editor for NBC News, ... 18 POLITICO 3695835 Nobo dy knows p olitics like POLITICO. Got a news tip for us? ... 19 Katy T ur 587474 MSNBC anchor @2pm, NBC News correspondent, author of NYT ... 20 Bill Clinton 10697521 F ounder, Clinton F oundation and 42nd President of the United... 21 Kasie Hunt 381704 @NBCNews Capitol Hill Corresp onden t. Host, @KasieDC, Sunda ys... 22 TIME 15584815 Breaking news and current ev ents from around the glob e. Host... 23 Kelly O’Donnell 195765 White House Corresp onden t @NBCNews V eteran of Cap Hill & ... 24 John McCain 3181773 Memorial account for U.S. Senator John McCain, 1936-2018. T o... 25 Peter Alexander 283522 @NBCNews White House Corresp onden t / W eekend @TODA Yshow ... 26 Hallie Jackson 359099 Chief White House Corresp onden t / @NBCNews / @MSNBC Anchor ... 27 Kristen W elker 182244 @NBCNews White House Correspondent. Links and retw eets ... 28 Carrie Dann 37119 .@NBCNews / @NBCPolitics. R Ts not endorsements. 29 Willie Geist 807536 Host @NBC #SundayTOD A Y, Co-Host @Morning Jo e, Sunda y ... 30 Morning Jo e 563650 Live tweet during the sho w! Links to m ust-read op-eds ... Through the PPR vector, the top 30 handles returned to @NBCP olitics include NBC’s news related programs and celebrity reporters, comparable mainstream media outlets, as well as prominent p olitical and public ﬁgures and institutions. Such results line up with its status as a mainstream political news source, demonstrating clustering eﬀectiveness. Those Twitter handles tend to hav e millions of follo wers, showing the PPR vector’s bias tow ard high in-degree. priv acy protection that has preven ted us from accessing their information). The remaining handles on the list, although not directly follow ed b y @NBCP olitics, include ﬁv e handles asso ciated with NBC, from its news anc hor Lester Holt to its News In ternational Presiden t. Ho wev er, the ma jority of those indirectly follow ed b y @NB CP olitics are mainly high proﬁle p olitical and public ﬁgures (like President T rump, Vice Presiden t P ence, Hillary Clinton, and Stephen Colb ert), go vernmen t organizations (like WhiteHouse Oﬃce of Cabinet Aﬀairs and National Security Council), and mainstream news outlets (lik e New Y ork Times, CNN and AP) and w ell-known journalists (lik e John Dick erson and Anderson Co oper). W e can thus conclude that the PPR vector is biased tow ard p opular accounts follow ed directly by the seed no de or indirectly by its friends, reﬂecting the p opular Twitter handles follow ed by them. This property of the PPR vector can b e harnessed b y researchers interested in iden tifying the upstream of a handle, i.e., those Twitter elites who are follow ed b y and migh t inﬂuence the seed no de and by extension its follo wers. In con trast, the aPPR vector up-weigh ts handles that are muc h less p opular (i.e., those with low in- degrees). As shown in T able 2, the 30 handles with the highest aPPR v alues include NBC’s rep orters, writers, editors, pro ducers, and programs, all of whom hav e a few hundred to a few thousand follow ers. The 30 handles also include those unaﬃliated with NBC, such as director of a non-proﬁt (Enroll America), director of digital programming at National Geographic, and @CNNP olitics’ editor. All of them are professionally related to the seed no de. This testiﬁes to the applicability of aPPR for lo cating an idiosyncratic lo cal cluster around a seed no de. How ever, more than half (17) of the 30 handles are obscure and not directly follow ed by @NBCP olitics. The reason they appear on the list is probably that they ha ve just one and at most a dozen 18 T able 4: T op 30 handles of aPPR with seed no de @NBCPolitics and the teleportation constant α = 0 . 15 in Decem b er 2018. Name F ollow ers Description 1 Stephanie Palla 198 Enroll America National Regional Director... 2 Jennifer Sizemore 386 3 Alissa Swango 441 Director of Digital Programming at @natgeo. All things ... 4 Making a Diﬀerence 670 @NBCNigh tlyNews’ p opular feature proﬁles ordinary ... 5 Ron Whittemore 1 6 Sv ante Sto ckselius 3 7 Greg Martin 1161 Political Booking Pro ducer at @nbcnews @to da yshow 8 Area Man 1 I am Area Man. I pwn your news feed. 9 CELESTIA ROBINSON 2 10 NBC Field Notes 1390 NBC News corresp onden ts and http://t.co/1eSopOQt8s ... 11 rob adams 2 12 JL 2 13 David Kelsey 1 14 Hank Morris 1 15 Jesse Marks 1 16 Brayden Rainey 1 17 child of the tiger 3 yet another activist twitter, ﬁgh ting all those fun... 18 Julie Swango 4 19 Author Dianne Kub e 7 Dianne Kub e is an Author with a passion, for family ,... 20 Consider the Source 7 21 Adam Edelman 2341 Political rep orter @nbcnews. Wisconsin native, ... 22 Phil McCausland 2519 @NBCNews Digital reporter fo cused on the rural-urban... 23 Corky Siemaszko 2538 Senior W riter at NBC News Digital (former NY Daily ... 24 Sam Petulla 2588 Editor @cnnp olitics Usually looking for datasets. ... 25 Ken Strickland 2693 NBC News W ashington Bureau Chief 26 Mike Mullen 7 27 Elyse PG 2697 White House pro ducer @n b cnews | @USCAnnen b erg alum ... 28 A. Johnson 2 Change your thoughts & you c hange your world. -Normal... 29 Steve F enton 4 30 Dobe Pitt y Mami 13 Through the aPPR vector, the top 30 handles returned to @NBCPolitics include some relev ant handles (NBC’s news team and their counterparts in other mainstream news organizations) and man y obscure ones (handles with few follow ers and no proﬁle descriptions). This results from the aPPR vector’s bias tow ard extreme low degree and introduces noise to the clustering results. follo wers (recall that aPPR divides by in-degree). In fact, 160 of the top 200 handles are not direct friends of @NBCP olitics; the median in-degree of the top 200 handles is merely 8 (Supplemen tary Materials S4). Those handles might hav e ended up on the list due to a combination of luc k and, more imp ortan tly , their extremely lo w in-degrees. In this regard, noise can be introduced by the aPPR v ector b ecause it prioritizes handles with extremely low in-degrees that are p ossibly sev eral degrees separated from the seed node. T o reduce noise, we applied a regularization step to the aPPR v ector to remov e those “distan t” and small no des while preserving the close and relev ant ones. In T able 3, the ma jority of the top 30 handles with the highest regularized aPPR (i.e., rPPR) v alues hav e three- or four-digit n umbers of follow ers. Similar to the aPPR results, they include NBC’s news crew. But the diﬀerence is that the ov erwhelming ma jorit y (18) of the top 30 handles work at NBC. Some handles who work for other news organizations (e.g., Sam Petulla at @cnnp olitics and Emmanuelle Saliba at @Euronews) might hav e previously work ed at NBC or hav e close connection with its news team. Even the four handles that are not directly follow ed b y @NBCP olitics are in teresting – they are non-proﬁt organizations (NYC Clothing Bank and V oices United) and news-related individual or organization (James Miklaszewski and So cial Headlines). This pattern can also b e observ ed in the top 200 handles, 72 of whom are directly follow ed b y @NBCPolitics. The ov erwhelming ma jority of those directly follo wed by it are aﬃliated with NBC, comprising its da y-to-day news team, who enjo y m uch less publicit y than the celebrity rep orters. The remaining 128 of them, who are not directly follow ed b y @NBCPolitics, actually also include 20 NBC’s journalists and staﬀ, suc h as Ray F armer (NBC News photographer) and Jim Miklaszewski (chief Pen tagon correspondent for NBC News). Others are non-proﬁts lik e V ets Helping Hero es and professionals from other news organizations or companies such as WSJ, NFL 19 T able 5: T op 30 handles of rPPR with seed no de @NBCP olitics and the teleportation constant α = 0 . 15 in Decem b er 2018. Name F ollow ers Description 1 Stephanie Palla 198 Enroll America National Regional Director http://t.co/X6jJIE... 2 Jennifer Sizemore 386 3 Alissa Swango 441 Director of Digital Programming at @natgeo. All things foo d.... 4 Making a Diﬀerence 670 @NBCNightlyNews’ p opular feature proﬁles ordinary people do... 5 Greg Martin 1161 Political Booking Pro ducer at @nbcnews @to da yshow 6 NBC Field Notes 1390 NBC News correspondents and http://t.co/1eSopOQt8s rep orters... 7 Adam Edelman 2341 Political reporter @nbcnews. Wisconsin nativ e, Bestchester ... 8 Phil McCausland 2519 @NBCNews Digital rep orter focused on the rural-urban divide.... 9 Corky Siemaszko 2538 Senior W riter at NBC News Digital (former NY Daily News ... 10 Sam Petulla 2588 Editor @cnnpolitics Usually looking for datasets. Y ou can ... 11 Ken Strickland 2693 NBC News W ashington Bureau Chief 12 Elyse PG 2697 White House pro ducer @n b cnews | @USCAnnen b erg alum | LA kid ... 13 Hasani Gittens 3002 Level 29 Mage. Senior News Ed. @NBCNews. Sheriﬀ of Nattahna... 14 Scott F oster 3464 Senior Pro ducer, W ashington @NBCNEWS @TODA Yshow 15 Zach Hab erman 3693 Lead Breaking News Editor, @NBCNews. Previously had other jobs... 16 Emmanuelle Saliba 4004 Head of Social Media Strategy @Euronews | Launc hed #THECUBE ... 17 Alex Johnson 4371 News, data and analysis for @NBCNews; data geek; ... 18 Sav annah Sellers 4637 News junkie. Host of NBC’s ”Sta y T uned” on Snap c hat. Storyte... 19 NYC Clothing Bank 154 W e distribute new, never-worn clothing and merc handise... 20 Shaquille Brewster 5362 @NBCNews Producer/Politics | @How ardU Alum | Journalist | P ol... 21 Joey Scarborough 6277 NBC News So cial Media Editor. New Y ork Daily News Alum. R Ts ... 22 Jane C. Timm 6478 @nbcnews p olitical rep orter and fact chec ker. More fun than ... 23 Anthon y T errell 6827 Emm y Award winning journalist. Political observer. Cov ered ... 24 NBC News Videos 7838 The latest video from http://t.co/xPyvMOTEF6 25 Libby Leist 7946 Executive Pro ducer @to daysho w 26 V oices United 310 V oices United is a non proﬁt educational organization ... 27 Social Headlines 344 Daily roundup of top social media and net working stories. 28 James Miklaszewski 337 W riter, Photographer, Editor, Director, Pro ducer, Newshound ... 29 Courtney Kub e 9494 NBC News National Security & Military Rep orter... 30 Bob Corker 10042 Serving T ennesseans in the U.S. Senate Through the rPPR vector, the top 30 handles returned to @NBCP olitics include much fewer low in-degree and obscure ones and many more moderately connected no des that are relev ant to @NBCP olitics, including its reporters and editors and media professionals from other organizations. Net work, and Microsoft, who migh t hav e w orked for NBC or ha ve close connection with it. Although there still app ear to b e obscure handles with few follow ers, they decrease signiﬁcan tly in n umber – the median in-degree of the top 200 handles is 340 (Supplemen tary Materials S4), a precipitous drop from that of the top PPR handles yet not too small as compared to that of the top aPPR handles. W e th us conclude that the regularized aPPR vector returns a lo cal cluster with little noise, reﬂecting a seed no de’s close circles, either directly or indirectly related. In order to ev aluate the inﬂuence of the desired cluster size n on the results based on diﬀerent PPR v ectors, w e compare the lo cal clusters of PPR, aPPR, and rPPR by v arying sample size. Deﬁne the in- and-out r atio of local cluster C ⊂ V as the proportion of edges inside C among all edges connected to C , 2 × P u,v ∈ C A uv P u ∈ C d in u + d out u . A higher in-and-out ratio indicates a more in ternally connected sample. Figure 3 (Righ t) sho ws the eﬀec- tiv eness of aPPR and rPPR in pro ducing a compact lo cal cluster. When the sample size is bigger than 100, the connectedness of the local cluster pro duced b y rPPR stabilizes; the greater the sample size, the more densely connected a cluster aPPR would produce. How ever, PPR is easily susceptible to the inclusion of p opular no des. In this case, a sharp drop of in-and-out ratio for PPR when the sample size reac hes around 140 is caused by inclusions of highly p opular accounts @POTUS (President T rump) and @realDonaldT rump (Donald J. T rump). The PPR clustering is fairly robust to the choice of teleportation constant, despite the size of lo cal 20 Number of follow ers PPR 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 8 10 −8 10 −7 10 −6 10 −5 10 −4 10 −3 T op 200 PPR boundary T op 200 aPPR boundary T op 200 rPPR boundary ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 100 200 300 400 500 Sample Size In−and−out Ratio 10 −5 10 −4 10 −3 ● PPR aPPR rPPR Figure 3: Left: an illustration of 5840 Twitter handles examined by Algorithm 3 and three samples of size 200 b y PPR, aPPR, and rPPR. Each dot represents a user in Twitter. The blue dashed line delimits the top 200 handles b y PPR vector; vertices ab o ve the line are PPR’s sample. Similarly , the yello w dotdash line determines the sample returned by Algorithm 4 given n = 200; vertices abov e this b oundary corresp ond to aPPR’s sample. In particular, dots in purple stand for the sample of rPPR; the purple solid line shows the b oundary of this sample. Righ t: The in-and-out ratio of local clusters iden tiﬁed by PPR, aPPR, and rPPR, as the sample sizes v ary . A higher in-and-out ratio indicates a more internally connected cluster. cluster. T o illustrate this, we also p erformed the same pip eline of analysis with the seed @NBCPolitics while v arying the v alue of α (e.g., 0 . 05, 0 . 25, and 1 / 3) in parallel. W e observed that those lo cal clusters returned by Algorithm 4 all share a great portion of members in common. F or example, there are 280 (93.3%) ov erlapping mem b ers b et ween tw o targeted samples of size n = 300, using α = 0 . 15 and 0 . 25 resp ectiv ely . These suggest a lo w sensitivit y to the teleportation constant (see Supplemen tary Materials S2). The left panel of Figure 3 depicts the b eha viors of PPR, aPPR and rPPR. Eac h handle queried in this sampling is display ed as a dot, with y-axis represen ting the PPR v alue and x-axis the num b er of follow ers (i.e., in-degree). T op handles with the highest PPR v alues are ab o ve blue dashed line, which tend to concentrate on the right end of the x-axis and thus are biased tow ard high in-degrees. T op handles with highest aPPR v alues are dots to the left of the y ellow dotdash line, which gather on the left end of the x-axis and th us in fa vor of lo w in-degrees. Regularized aPPR, by purple dots, excludes the very lo w degree no des and very high degree no des. As the empirical results sho w, these three vectors can be thought of as lenses through whic h w e view the local structure of a giv en Twitter handle with v arying fo ci, rendering high, moderate, and low in-degree blocks and serving diﬀeren t needs and purposes. 7 Discussion This paper studies the PPR vector under the degree-corrected stochastic blo c k mo del and PPR clustering in massive blo ck mo del graphs. W e establish some consistency results for this metho d, and examine its p erformance through analysis of Twitter friendship graph. As sho wn in the results, the PPR v ectors with 21 and without adjustment ha ve distinct properties and can be used to eﬀ ectiv ely sample a massiv e graph for v arious purp oses. Ho wev er, there are limitations worth y of future inv estigations. In Section 3, w e pro vide a representation of the PPR v ector under the DC-SBM and its extension in to directed graphs. The result do es not impose extra structural restrictions on the mo del parameters, except that B corresp onds to a strongly connected “blo c k-wise” graph. W e consider a p ositiv e deﬁnite connectivity matrix particularly so that it is in tuitive to conceive the notion of lo cal cluster. In practice (and many of our experiments, see Supplemen tary Materials S2), ho w ever, a PPR-t yp e algorithm appears to contin ue w orking for a broader range of B (e.g., singular or indeﬁnite), pro vided that the teleportation constant is suﬃcien tly large (e.g. α > 0 . 1). It is unclear yet what is the minim um constraint needed on B in order for the PPR clustering to function. In addition, DC-SBM do es hav e its limits. F or example, the mo del fails to capture either mixed block membership or p opularity features whic h are p otentially informative in real w orld netw orks. The b eha vior of a PPR v ector under other extensions of stochastic block model, such as mixed mem b ership sto c hastic blo c k mo del and p opularity-adjusted blo c k mo del, remains unkno wn [Airoldi et al., 2008, Sengupta and Chen, 2018]. F uture studies on the PPR v ector under these mo dels could shed further ligh t on the PPR clustering and oﬀer more practical guidelines on their application. In Section 4, we prov ed the consistency of the PPR clustering, requiring the a verage exp ected no de degree to grow in order of log N , which hits the b oundary b et ween the theoretical guarantees and the realistic observ ation. In contrast, scale-free net w orks such as the preferential attac hment model [Barab´ asi and Alb ert, 1999] ha ve ﬁnite expected node degrees. F uture inv estigations into v ariants of PPR that could p ossibly ov ercome this limitation y et ensure a ﬁne lo cal cluster disco very would b e particularly in teresting and useful. In Section 6, we in tro duce the regularized version of adjusted PPR (rPPR) vector, with a series of empirical evidence showing its eﬃcacy in targeted sampling. While the results app ear promising, theoretical guaran tees for this technique remain unexplored. In order for some mathematical analyses, one ma y resort to the techniques used in Le et al. [2016]. It is previously sho wn that the regularized graph Laplacian (or transition matrix) enjo ys “nice” ﬁnite sample prop erties, whic h facilitate the consistency of man y regularized sp ectral metho ds. It th us is reasonable conjecture that rPPR v ectors are also suitable for lo cal clustering. An R implementation of the PPR clustering is av ailable at author’s GitHub ( https://github.com/ RoheLab/aPPR ). Ac kno wledgmen ts This researc h is supp orted by NSF Grants DMS-1612456 and DMS-1916378 and ARO Grant W911NF- 15-1-0423. Thank you to Y uling Y an and E. Auden Krausk a for the helpful comments. Thank you to Alex Ha yes for kindly advising on the softw are developmen t. A T ec hnical Pro ofs A.1 Pro of of Prop osition 1 Pr o of. W e apply P erron-F rob enius theorem for the ﬁrst part [P erron, 1907, F rob enius et al., 1912], and complete the pro of b y construction. (a) First, notice that Q is a Marko v transition matrix by mo difying G = ( V , E ) a little. T o this end, (i) 22 shrink the weigh ts of ev ery existing edge b y factor 1 − α , and (ii) add an edge w eighted α b et ween seed no de v 0 and all no des in the graph. Then Q represen ts the new graph G 0 ( V , E 0 ), which is strongly connected b y construction. Hence Q is irreducible. The PPR vector p is all-positive. T o see this, note that the equation p > = p > Q implies that p is a stationary distribution for the standard random walk on G 0 . Since G 0 is strongly connected, it follows that the stationary distribution m ust b e all-p ositiv e. F rom the P erron-F rob enius theorem, the only all-p ositiv e eigen vector of a non-negativ e irreducible matrix is associated with the leading eigenv alue, whic h is 1 in our case. Since the leading eigenv alue of non-negativ e irreducible matrix is simple, w e conclude that p is unique. (b) W e ﬁnish the pro of by constructing an explicit form of the PPR vector. Let R α = α P ∞ s =0 (1 − α ) s P s . The inﬁnite sum con verges for α ∈ (0 , 1]. Then, p = R > α π satisﬁes the deﬁnition of p ersonalized P ageRank vector, απ > + (1 − α ) π > R α P = απ > + (1 − α ) π > α ∞ X s =0 (1 − α ) s P s ! P = απ > + α ∞ X s =1 (1 − α ) s π > P s = π > R α . Since the solution is unique, w e ha ve p = R > α π . A.2 Pro of of Prop osition 2 Pr o of. Algorithm 1 main tains tw o vectors, p  and r , b y transporting probability mass from r to p  at each up dating step. Note that the termination criterion implies that r u < d u for an y u sampled, th us it suﬃces to pro ve that | p u − p  u | ≤ r u . F or a ﬁxed α , let p ( x ) be the PPR v ector with preference vector x ∈ R N satisfying x i ≥ 0 and k x k 1 ≤ 1. Then p ( π ) is the exact PPR vector as in Equation (2). Since p ( x ) > P = p ( x > P ), we hav e [Jeh and Widom, 2003] p ( x ) = αx + (1 − α ) p ( P > x ) . (10) W e argue that p  + p ( r ) is inv ariant in up dating steps. T o s ee this, supp ose ( p  ) 0 and r 0 are the results of performing one update on p  and r after sampling node u . W e ha ve ( p  ) 0 = p  + αr u e u , r 0 = r − r u e u + (1 − α ) r u P > e u . 23 where e u is the unit vector on the direction of u . Then, p ( r ) = p ( r − r u e u ) + p ( r u e u ) (i) = p ( r − r u e u ) + αr u e u + (1 − α ) p  r u P > e u  (ii) = p  r − r u e u + (1 − α ) r u P > e u  + αr u e u = p ( r 0 ) + ( p  ) 0 − p  , where (i) is applying Equation (10) at x = r u e u and (ii) comes from the linearit y of PPR v ector in the preference v ector. The desired result follo ws from recognizing that p  + p ( r ) is initially ~ 0 + p ( π ) and that when the algorithm terminates, [ p ( r )] u ≤ r u for an y sampled u . Remark. If d 1 > 1, Algorithm 1 terminates after the ﬁrst round and simply output p = ~ 0. Under this circumstance, Proposition 2 still holds, b ecause | p u − p  u | ≤ | p u | + | p  u | ≤ 1. A.3 Lemmas for the DC-SBM Lemma 2 (Properties of the DC-SBM) . Under the p opulation dir e cte d DC-SBM with K blo cks and p ar am- eters  B , Z, Θ in , Θ out  , (a) D in = Z > D in Z , and D out = Z > D out Z , and (b) d in v = θ in v d in z ( v ) , and d out v = θ out v d out z ( v ) . Pr o of. (a) is an alternativ e wa y of writing the deﬁnition. F or (b), w e pro ve the ﬁrst equation. Recall that for an y i , P u : z ( u )= i θ out u = 1, then by deﬁnition, d in v = X u θ out u θ in v B z ( u ) z ( v ) = θ in v K X j =1   B j z ( v ) X u : z ( u )= j θ out u   = θ in v d in z ( v ) . Remark. Since Z > Θ in Z = I K , (a) implies  D in  − 1 Θ in Z = Z  D in  − 1 . Lemma 3 (Explicit form of P and its pow ers) . Under the p opulation dir e cte d DC-SBM with K blo cks and p ar ameters  B , Z, Θ in , Θ out  , the p opulation gr aph tr ansition is the pr o duct P = Z P Z > Θ in . and its matrix p owers ar e P k = Z P k Z > Θ in . Pr o of. By deﬁnition and Lemma 2(b), for an y u, v ∈ V , P uv =  θ out u d out z ( u )  − 1 θ out u θ in v B z ( u ) z ( v ) = θ in v B z ( u ) z ( v ) / d out z ( u ) = θ in v P z ( u ) z ( v ) . 24 F or the pow ers of P , noticing that Z > Θ in Z = I K , P 2 = Z P Z > Θ in Z P Z > Θ in = Z P 2 Z > Θ in . The desired result follows from the principle of induction on k -th p o wer. A.4 Pro of of Theorem 1 Pr o of. By Prop osition 1 and Lemma 3, we hav e p = α ∞ X s =0 (1 − α ) s ( P s ) > π = α ∞ X s =0 (1 − α ) s Θ in Z ( P s ) > Z > π = Θ in Z α ∞ X s =0 (1 − α ) s ( P s ) > π ! = Θ in Z p , In addition, it follows from Lemma 2(a) that p ∗ =  D in  − 1 p =  D in  − 1 Θ in Z p = Z  D in  − 1 p = Z p ∗ . This completes the pro of. A.5 Pro of of Lemma 1 Pr o of. F or an y α > 0, the PPR vector with seed node v 0 = 1 is the solution to the equation p > = p > Q , where Q = α Π + (1 − α ) P . Deﬁne a sequence of probability distribution p s ∈ R N suc h that p s = ( Q s ) > p 0 , where p 0 is an arbitrary initial probabilit y distribution. Then, lim s →∞ p s = p . F or simplicity , we assume p 0 is close to p , that is, for any ε > 0 and s ≥ 0, k p s − p k ∞ < ε/ 2 . (11) This can b e ac hieved by ﬁnding an in teger S ( ε ) large enough and setting p 0 = p S . W e ﬁrst claim that max u 6 =1 p s +1 u d u ≤ (1 − α ) max u ∈ V p s u d u . (12) In fact, for any u 6 = 1, p s +1 u = α 1 { u =1 } + (1 − α ) X v ∈ V A v u d v p s v ≤ (1 − α ) X v ∈ V A v u ! max v ∈ V p s v d v = (1 − α ) d u max v ∈ V p s v d v . 25 W e then sho w p s 1 d 1 > p s v d v for any v 6 = 1 b y con tradiction. Supp ose otherwise that p s 1 d 1 ≤ max u 6 =1 p s u d u , then Equation (11) implies for an y s 0 , p s 0 1 d 1 ≤ p s 1 + ε d 1 ≤ max u 6 =1 p s u d u + ε d 1 ≤ max u 6 =1 p s 0 u + ε d u + ε d 1 ≤ max u 6 =1 p s 0 u d u + 2 ε d min , where d min = min v ∈ V d v . Hence, max u ∈ V p s 0 u d u ≤ max u 6 =1 p s 0 u d u + 2 ε d min . In addition, applying Equation (12) recursiv ely we hav e max u ∈ V p s u d u = max u 6 =1 p s u d u ≤ (1 − α ) max u ∈ V p s − 1 u d u ≤ (1 − α )  max u 6 =1 p s − 1 u d u + 2 ε d min  ≤ (1 − α ) s max u ∈ V p 0 u d u + 2 ε d min s − 1 X t =1 (1 − α ) t . The inequalit y means that if d min > 0 is ﬁxed, p s u can be arbitrarily small when s → ∞ , whic h con tradicts the fact that p is a probability distribution. This completes the pro of. Remark. When the teleportation constant is zero, the PPR vector becomes the stationary probabilit y distribution of a standard random w alk,  d 1 P i d i , d 2 P i d i , ..., d N P i d i  . After adjusting by no de degrees, every entry b ecomes identical (1 / P i d i ). The lemma is in tuitive, recog- nizing that the telep ortation introduces a particular fa vor of the seed no de. Remark. When the edges are weigh ted (non-negativ e), the stationary distribution of a random walk is still proportional to no de degrees, if one deﬁnes the degree as sum of edge w eights incident to the no de [Lo v´ asz, 1996]. Note also that the stationary distribution of a random walk in a directed graph is c haracterized b y the in-degree of nodes [Ghoshal and Barab´ asi, 2011, Lu et al., 2013]. The conclusion and a mo diﬁed pro of apply to directed or w eighted graphs. A.6 Pro of of Corollary 1 Pr o of. The algorithm ranks all vertices according to p  ∗ , and the population local cluster can b e explicitly written as C = { v ∈ V : p ∗ v = p ∗ 1 } . It suﬃces to show that p  v ∗ > p  u ∗ , for ∀ v ∈ C , u ∈ V \ C , 26 where p  ∗ v = p  v /d v . T o this end, we apply triangle inequalit y and get p  v ∗ − p  u ∗ k p ∗ k ∞ ≥ p ∗ v − p ∗ u k p ∗ k ∞ − | p ∗ v − p ∗ v | k p ∗ k ∞ − | p ∗ u − p ∗ u | k p ∗ k ∞ − | p  u ∗ − p ∗ u | k p ∗ k ∞ − | p  v ∗ − p ∗ v | k p ∗ k ∞ ≥ ∆ − 2 k p ∗ − p ∗ k ∞ k p ∗ k ∞ − 2 k p  ∗ − p ∗ k ∞ k p ∗ k ∞ . Since ∆ α ≤ 1, assumption (8) contains condition (7) in Theorem 2, whic h together with Proposition 2 implies that k p ∗ − p ∗ k ∞ k p ∗ k ∞ < 1 4 ∆ , k p  ∗ − p ∗ k ∞ k p ∗ k ∞ < 1 4 ∆ , if ∆ 2 δ / log N is large enough. These collectively imply p ∗ v > p ∗ u as desired. 27 Supplemen tary Materials Abstract This do cumen t provides sev eral supplemen tary materials to “T argeted sampling from massiv e block mo del graphs with p ersonalized PageRank”. Section S1 con tains a pro of for the entrywise error control (Theorem 2). Section S2 giv es additional information about some model parameters, including B , P , α , and N . Section S3 extends the results in Kloumann et al. [2017] to the DC-SBM from a linear discriminan t analysis p erspective. Section S4 supplies three targeted Twitter samples about the seed @NBCP olitics describ ed in the pap er. The PPR clustering is implemented in R and all source co des are a v ailable at author’s GitHub ( https://github.com/RoheLab/aPPR ). S1 A pro of for the en trywise error con trol W e start with a few lemmas to prepare for the pro of of Theorem 2. F or completeness, Section S1.3 lists a few inequalities that are used throughout the proofs. S1.1 Some deﬁnitions and lemmas In this section, w e in tro duce a few notations used in Lemma S3 and list a few prop erties of vector norm and matrix norm [Br´ emaud, 2013]. F or any strictly positive probability distribution vector p ∈ R N , the inner pro duct space indexed b y p is a real vector space R N endo wed with the inner product h x, y i p = N X v =1 p v x v y v . The corresponding vector norm and the induced matrix norm are deﬁned resp ectiv ely as k x k p = q h x, x i p and k A k p = sup k x k p =1 k A > x k p . Lemma S1. If 0 ≤ p min ≤ p v ≤ p max for al l v = 1 , 2 , ..., N , then the fol lowing ine qualities hold √ p min k x k 2 ≤ k x k p ≤ √ p max k x k 2 and r p min p max k A k 2 ≤ k A k p ≤ r p max p min k A k 2 . The following lemma pro vides concentration of the node degrees in a graph generated from the DC-SBM. Lemma S2 (Degree concentration) . L et G = ( V , E ) b e a gr aph of N vertic es gener ate d fr om the DC-SBM with K blo cks and p ar ameters { B , Z, Θ } . L et d min and d max b e the smal lest and the lar gest no de de gr e e observe d. L et δ b e the aver age exp e cte d no de de gr e e, and deﬁne ρ = d max / d min . If δ ≥ c 0 (1 − α ) log N for some suﬃciently lar ge c onstant c 0 > 0 , then with pr ob ability at le ast 1 − O ( N − 10 ) , it holds that δ 2 ρ ≤ d min ≤ d max ≤ 3 ρδ 2 . (13) Pr o of. Note that the deﬁnition of ρ immediately implies that δ ρ ≤ d min ≤ d max ≤ δ ρ. 28 The lemma follows from the standard Chernoﬀ ’s b ound, hence is omitted. The follo wing useful lem ma concerns the eigenv ector p erturbation for probability transition matrices, promoted from the celebrated Da vis-Kahan sin Θ Theorem [Da vis and Kahan, 1970]. Lemma S3 (Eigen vector p erturbation) . Supp ose that Q , ˆ Q , and Q ar e pr ob ability tr ansition matric es with stationary distributions p , ˆ p , and p r esp e ctively. Assume that Q r epr esents a r eversible Markov chain. Then, k p − ˆ p k p ≤ k ( Q − ˆ Q ) > p k p 1 − max { λ 2 ( Q ) , − λ N ( Q ) } − k ˆ Q − Q k p . The proof the Lemma S3 can b e found in Chen et al. [2019] Section 3, thus omitted. S1.2 Pro of of Theorem 2 Pr o of. The pro of pro cesses as follo ws. W e ﬁrst bound the en trywise error rate of p , k p − p k ∞ k p k ∞ ≤ c 0 r log N δ , b y in voking the no vel leav e-one-out techniques [Chen et al., 2019], The entrywise error b ounds of p ∗ follo ws immediately . Recall that b oth p and p are stationary distribution, which means p = Q > p and p = Q > p . Due to this, for an y w = 1 , 2 , ..., N , we can decompose p w − p w = Q > · w p − Q > · w p = ( Q · w − Q · w ) > p | {z } := I w 1 + Q > · w ( p − p ) | {z } := I w 2 , where Q · w denotes the w -th column of Q . (a) W e start with the ﬁrst term I w 1 . Note that I w 1 = (1 − α ) N X v =1  A v w d v − A v w d v  p v = (1 − α ) N X v =1  ( A v w − A v w ) 1 d v  p v | {z } := I w 11 + (1 − α ) N X v =1 A v w  1 d v − 1 d v  p v | {z } := I w 12 . Recall that A v w ’s corresp ond to indep enden t Bernoulli random v ariables, we can easily b ound the ﬁrst 29 term using Bernstein’s inequality (Lemma S6), with probability at least 1 − O ( N − 8 ), | I w 11 | ≤ (1 − α )      N X v =1 ( A v w − A v w )      k p k ∞ δ ≤ (1 − α )   v u u t 16 log N N X v =1 A v w + 16 log N 3   k p k ∞ δ (i) ≤ (1 − α ) 4 r ρ log N δ + 16 log N 3 δ ! k p k ∞ , where (i) follows from the fact that ρδ ≤ d max . Note that the second term is I w 12 = (1 − α ) N X v =1 1 ( v ,w ) ∈ E  1 d v − 1 d v  p v , to whic h w e can apply the Ho eﬀding’s inequalit y (Lemma S4) and obtain P | I w 12 | ≤ ρ (1 − α ) r ρ log N δ k p k ∞ ! ≥ 1 − 2 N − 8 . In sums, we hav e high probabilit y ev ent | I w 1 | ≤ (1 − α ) (4 + ρ ) √ ρ + 3 r log N δ ! r log N δ k p k ∞ . (14) (b) The statistical dependency betw een p and Q in tro duces diﬃculty in sharply bounding I w 2 . Nevertheless, w e can in vok e the leav e-one-out techniques to decouple the dep endency . T o this end, we deﬁne, for eac h w = 1 , 2 , ..., N , a new transition matrix Q ( w ) = α Π + (1 − α ) P ( w ) that bridges betw een Q and Q . P ( w ) has almost the same en tries as P except for replacing those in w -th ro w or column b y their exp ectations; that is, for any u 6 = v , P ( w ) uv = ( P uv , u 6 = w and v 6 = w, P uv , u = w or v = w , and for any u = 1 , 2 , ..., N , P ( w ) uu = 1 − X v : v 6 = u P ( w ) uv , in order to ensure that P ( w ) and Q ( w ) are transition matrices. In addition, deﬁne p ( w ) to be the stationary distribution corresp onding to Q ( w ) . As demonstrated in Chen et al. [2019], p ( w ) helps us w ell approximate p , y et it is statistically indep enden t of Q · w . 30 No w we decomp ose I w 2 as follo ws: I w 2 = N X v =1 Q v w ( p v − p v ) = N X v =1 Q v w  p v − p ( w ) v  | {z } := I w 21 + N X v =1 Q v w  p ( w ) v − p v  | {z } := I w 22 . (c) In this part, w e fo cus on the ﬁrst term I w 21 , where we would need another in termediate quan tity to facilitate our estimation. T o be sp eciﬁc, consider the lea ve-one-out version of Q conditioning on the graph G = ( V , E ), Q ( w,G ) = α Π + (1 − α ) P ( w,G ) , which is almost the same as Q except for replacing the non-zero entries in w -th row or column b y their exp ectations. Concretely , for and u 6 = v , P ( w,G ) uv = ( P uv , u 6 = w and v 6 = w , 1 ( u,v ) ∈ E P uv , u = w or v = w , and for any u = 1 , 2 , ..., N , deﬁne P ( w,G ) uu = 1 − X v : v 6 = u P ( w,G ) uv , so that P ( w,G ) is a probability transition matrix. With Q ( w,G ) in mind, we now apply Cauch y-Sch warz inequality on I w 21 to reac h | I w 21 | =      N X v =1 Q v w  p v − p ( w ) v       ≤ N X v =1 Q 2 v w ! 1 2    p − p ( w )    2 (i) ≤ r α + 1 d min r p max p min    p − p ( w )    p (ii) ≤ r α + 1 d min r p max p min 1 γ      Q − Q ( w )  > p ( w )     2 (iii) w.h.p. ≤ r α + 2 ρ δ √ κ γ        ( Q − Q ( w,G ) ) > p ( w )    2 | {z } := I w 211 +    ( Q ( w,G ) − Q ( w ) ) > p ( w )    2 | {z } := I w 212     . where (i) follo ws from Lemma S1 and the fact that P v w ≤ 1 d min , (ii) comes from Lemma S3, and (iii) results from Lemma S2 and the triangle inequalit y , and recognizing κ = p max / p min (from the proof of Prop osition 1, it is bounded), and “w.h.p.” is short for “with high probability”. Note that Π adds at most 1 to the rank of Q , and b ecause w e presume B is positive deﬁnite P has exactly K positive eigen v alues among other zeros (Section S2.2). Here, γ = 1 − max { λ 2 ( Q ) , − λ N ( Q ) } − k Q ( w,G ) − Q k p is the sp ectral gap and is low er bounded b y some p ositiv e constant (due to Khanna et al. [2017]). Then, it boils down to con trolling I w 211 and I w 212 . 31 F or I w 211 , the w -th en try inside the vector norm is   Q − Q ( w,G )  > p ( w )  w = h ( Q − Q ) > p ( w ) i w = (1 − α ) N X v =1 ( P v w − P v w ) p ( w ) v . Note that p ( w ) v is statistically independent of P · w . Then, b y Ho eﬀding’s inequalit y (Lemma S4) and Lemma S2, we hav e with probabilit y at least 1 − 2 N − 8 ,   Q − Q ( w,G )  > p ( w )  w ≤ 4 ρ (1 − α ) r ρ log N δ    p ( w )    ∞ . (15) As for any u 6 = w , applying Ho eﬀding’s inequality again yields h Q − Q ( w,G )  p ( w ) i u = (1 − α ) N X v =1 ( P v u − P v u ) p ( w ) v = (1 − α )  P uu − P ( w,G ) uu  p ( w ) u +(1 − α )  P uw − P ( w,G ) uw  p ( w ) w = − (1 − α )  P uw − P ( w,G ) uw  p ( w ) u +(1 − α )  P uw − P ( w,G ) uw  p ( w ) w . Recognizing that    P uw − P ( w,G ) uw    = ( A uw d − 1 uu − A uw d − 1 uu , ( u, w ) ∈ E , 0 , ( u, w ) / ∈ E , w e apply again the Ho eﬀding’s inequality (Lemma S4) together with (13), and obtain with probability at least 1 − O  N − 8  ,       Q − Q ( w,G )  > p ( w )  u     ≤ ( 4 ρ (1 − α ) √ log N δ   p ( w )   ∞ , ( u, w ) ∈ E , 0 , ( u, w ) / ∈ E . (16) Com bining (15) and (16) yields I w 211 ≤ 4 ρ (1 − α )   1 + s X u : u 6 = w 1 ( u,w ) ∈ E   r ρ log N δ    p ( w )    ∞ (i) w.h.p. ≤ 8 ρ 2 √ ρ (1 − α ) r log N δ    p ( w )    ∞ , where (i) follows from the high probabilit y ev ent that d max ≤ 3 ρδ / 2. 32 Regarding I w 212 , since ( Q ( w,G ) − Q ( w ) ) p = ~ 0, w e can rewrite this as I w 212 =      Q ( w,G ) − Q ( w )  >  p ( w ) − p      2 . Similarly , note that P ( w ) v w − P ( w,G ) v w = A vw d v 1 ( w,v ) / ∈ E , we apply Bernstein’s inequality on w -th term inside the v ector norm to obtain that with probabilit y at least 1 − 2 N − 8 , h Q ( w,G ) − Q ( w )   p ( w ) − p i w = (1 − α ) N X v =1  P ( w,G ) v w − P ( w ) v w   p ( w ) v − p v  = (1 − α ) N X v =1 1 d v  p ( w ) v − p v  1 ( w,v ) / ∈ E w.h.p. ≤ (1 − α ) 4 ρ r ρ log N δ + 16 3 log N δ !    p ( w ) − p    ∞ . F or any u 6 = w , the u -th term inside v ector norm is h Q ( w,G ) − Q ( w )   p ( w ) − p i u = (1 − α ) N X v =1  P ( w,G ) v u − P ( w ) v u   p ( w ) v − p v  = (1 − α )  P ( w,G ) v v − P ( w ) uu   p ( w ) u − p u  +(1 − α )  P ( w,G ) v w − P ( w ) uw   p ( w ) w − p w  = − (1 − α )  P ( w,G ) v w − P ( w ) uw   p ( w ) u − p u  +(1 − α )  P ( w,G ) v w − P ( w ) uw   p ( w ) w − p w  . Recognizing that P ( w ) uw − P ( w,G ) uw = A uw d − 1 u 1 ( u,w ) / ∈ E , w e hav e from (13) that       Q ( w,G ) − Q ( w )  >  p ( w ) − p   u     ≤ 2 A uw d − 1 u 1 ( u,w ) / ∈ E (1 − α )    p ( w ) − p    ∞ . Hence, w e ha ve with probability at least 1 − O  N − 8  , I w 212 ≤ (1 − α )   4 ρ √ ρ r log N δ + 16 3 log N δ + 2 v u u t X u : u 6 = w 1 ( u,w ) / ∈ E D 2 uu      p ( w ) − p    ∞ (i) ≤ (1 − α ) 4 ρ √ ρ r log N δ + 16 3 log N δ + 2 ρ r ρ δ !    p ( w ) − p    ∞ , where (i) follo ws from the high probability even t that d max ≤ 3 ρδ / 2. Com bining the abov e tw o b ounds, 33 w e hav e with probability at least 1 − O  N − 8  that I w 21 ≤ r α + 2 ρ δ √ κ γ ( I w 211 + I w 212 ) ≤ c (1 − α ) 8 ρ 2 r ρ log N δ k p ( w ) k ∞ + 2 ρ r ρ δ + 4 ρ r ρ log N δ + 16 3 log N δ ! k p ( w ) − p k ∞ ! (i) ≤ 8 cρ 2 (1 − α ) r ρ log N δ k p k ∞ + c (1 − α ) 8 ρ 2 r ρ log N δ + 2 ρ r ρ δ + 4 r ρ log N δ + 16 3 log N δ ! k p ( w ) − p k ∞ (ii) ≤ 8 cρ 2 (1 − α ) r ρ log N δ k p k ∞ + c 2 k p ( w ) − p k ∞ (iii) ≤ 16 cρ 2 (1 − α ) r ρ log N δ k p k ∞ + c k p − p k ∞ . where c = q α + 2 ρ δ √ κ γ , and (i) follows from the triangle inequality   p ( w )   ∞ ≤   p ( w ) − p   ∞ + k p k ∞ , and (ii) holds as long as δ > c 0 (1 − α ) 2 log N for some c 0 > 0 suﬃciently large, and (iii) comes from the triangle inequality   p ( w ) − p   ∞ ≤   p ( w ) − p   2 + k p − p k ∞ . (d) Now it is left to estimate the last item I w 22 . Note that I w 22 = N X v =1 1 ( v ,w ) ∈ E Q v w  p ( w ) v − p v  = N X v =1  α 1 { w =1 } + (1 − α ) 1 d v 1 ( v ,w ) ∈ E   p ( w ) v − p v  = α N X v =1 1 { w =1 }  p ( w ) v − p v  | {z } := I w 221 + (1 − α ) N X v =1 1 ( v ,w ) ∈ E d v  p ( w ) v − p v  | {z } := I w 222 + (1 − α ) N X v =1  1 d v − 1 d v  1 ( v ,w ) ∈ E  p ( w ) v − p v  | {z } := I w 223 . Since b oth p ( w ) and p are distribution vector, I w 221 = 0. Then, due to Ho eﬀding’s inequalit y (Lemma S4), | I w 222 | ≤ 4 ρ (1 − α ) r ρ log N δ    p ( w ) − p    ∞ , | I w 223 | ≤ 2 ρ (1 − α ) r ρ log N δ    p ( w ) − p    ∞ , 34 with probabilit y at least 1 − O  N − 8  . Thus, we reach the high probabilit y even t | I w 22 | ≤ 6 ρ (1 − α ) r ρ log N δ    p ( w ) − p    ∞ . In sums, we reach with probability at least 1 − O  N − 8  , | I w 2 | ≤ 16 ρ 2 (1 − α ) √ κρ γ r α + 2 ρ δ r log N δ k p k ∞ + √ κ γ r α + 2 ρ δ + 6 ρ (1 − α ) r ρ log N δ ! k p − p k ∞ . (17) (e) Collecting the preceding b ounds (14) and (17) together, we conclude that with high probability k p − p k ∞ = max w | p w − p w | ≤ c 2 (1 − α ) r log N δ k p k ∞ + c 3 k p − p k ∞ , as long as δ / [(1 − α ) log N ] is suﬃcien tly large, whic h con trols the en trywise error of p , k p − p k ∞ k p k ∞ ≤ c 1 (1 − α ) r log N δ , (18) for some suﬃciently large constant c 1 , c 2 , c 3 > 0. Remark. c 2 and c 3 are con trolled by constants ρ, κ, γ , which are thereby driven from the model parameters B , Θ, K , and Z . (f ) Finally , we accomplish the proof by observing that k p ∗ − p ∗ k ∞ k p ∗ k ∞ ≤ 2 max  d − 1 min , d − 1 min  d − 1 min k p − p k ∞ k p k ∞ ≤ 4 k p − p k ∞ k p k ∞ . Ab o v e observ ation together with the inequality (18) allow us to control the entrywise error of p ∗ as claimed, with probabilit y at least 1 − O  N − 5  , k p ∗ − p ∗ k ∞ k p ∗ k ∞ ≤ c 2 (1 − α ) r log N δ , for some suﬃciently large constant c 2 > 0. S1.3 Concen tration inequalities The follo wing is a standard concentration inequality used throughout the pap er. Lemma S4 (Ho eﬀding’s inequality) . L et { X i } 1 ≤ i ≤ n b e a se quenc e of indep endent r andom variables wher e 35 X i ∈ [ a i , b i ] for e ach 1 ≤ i ≤ n , and S n = P n i =1 X i . Then, P ( | S n − E S n | ≥ t ) ≤ 2 exp  − t 2 P n i =1 ( b i − a i ) 2  . The next lemma is a special case of Chernoﬀ ’s b ound. Lemma S5 (Chernoﬀ ’s b ounds) . L et { X i } 1 ≤ i ≤ n b e a se quenc e of indep endent r andom variables, whose sum is S n , e ach having pr ob ability p i of b eing e qual to a i , otherwise 0. Deﬁne µ = P i p i a i . Then, for any  > 0 , P ( X i ≥ (1 +  ) µ ) ≤ (1 +  ) − µ , P ( X i ≤ (1 −  ) µ ) ≤ (1 −  ) µ . F or the use of this pap er, we only inv oke a simpler version of Bernstein inequality . Lemma S6 (Bernstein’s inequality) . L et { X i } 1 ≤ i ≤ n b e a se quenc e of indep endent r andom variables with | X i | ≤ B for e ach 1 ≤ i ≤ n , and S n = P n i =1 X i and T n = P n i =1 X 2 i . Then, with pr ob ability at le ast 1 − 2 n − a , | S n − E [ S n ] | ≤ p 2 a log n E [ T n ] + 2 a 3 B log n for any a ≥ 2 . The proofs of Lemma S4, S5, and S6 can b e found in Boucheron et al. [2013], hence are omitted. S2 Additional information on mo del parameters S2.1 Blo c k connectivity matrix B In the pap er, we assume that the blo c k connectivity matrix B corresponds to a strongly connected graph at blo c k level and is p ositive deﬁnite. These assumptions asserts the eﬃcacy of PPR clustering and is primarily a technical assumption suﬃcient for our theoretical results. In fact, we require B to represen t to a strongly connected graph b ecause this enables the blo ck-wise PPR v ector to ha ve the largest v alue corresp onding to the block of seed(s) (Lemma 1 in the paper). On the other hand, we imp ose the p ositive deﬁniteness on B b ecause this allows us to in tuitively deﬁne the notion of lo cal cluster, yet our statistical theory (i.e., the en trywise con trol of sample PPR vector) does not explicitly rely on suc h p ositiv e-deﬁniteness p er se. It is not clear y et whether these constrain ts are ne c essary in order for PPR clustering to function; p ossible generalizations of them are of researc h interest. W e list a few concrete examples sho wing that (i) if we break the strongly-connectivit y assumption, the PPR clustering can fail, despite a reasonable telep ortation constant, α = 0 . 15, but (ii) PPR clustering often w orks as hoped ev en when B is not p ositiv e-deﬁnite. Throughout, we assume that the ﬁrst blo c k is targeted and consider directed graphs with three underlying blo c ks ( K = 3). The ﬁrst t wo instances of B demonstrates the necessity of the strongly-connectivit y constrain t, which ensures the blo c k-wise aPPR v ector to p ossess the largest ﬁrst element. The third and forth instances, on the other hand, indicate that B need not to b e positive deﬁnite. 36 S2.1.1 Violating the strongly-connectiv e assumption Hierarc hy case. Let the blo c k connectivit y matrix B =    p p p 0 p p 0 0 p    for some constant p > 0. B ij is the num b er (or the probability) of edges from the i -th blo c k to the j -th blo c k in p opulation. Then, the directed graph represented b y B is not strongly connected, as block 3 has no path to the ﬁrst blo c k. In fact, this graph (speciﬁed b y upp er triangular B ) has a hierarc hical structure, where the third blo ck is in the center (or the highest hierarch y) of the graph, and the mem b er of ﬁrst blo c k are essen tially satellite from outside. P articularly , edges only come from outsiders to insiders. W e now p erform the PPR clustering on the ﬁrst cluster. The blo c k-wise transition matrix is P =    1 / 3 1 / 3 1 / 3 0 1 / 2 1 / 2 0 0 1    . Then, b oth B and P are positive deﬁnite, with eigen v alues of ( p, p, p ) and (1 , 1 / 2 , 1 / 3) resp ectively . T o ease the calculation, we set p = 3. Then the blo ck-wise PPR v ector is appro ximately p = (0 . 209 , 0 . 103 , 0 . 688) , and the blo c k-wise aPPR v ector is appro ximately (after adjusting b y column sums of B ) p ∗ = (0 . 0698 , 0 . 0172 , 0 . 0764) . As sho wn, neither block-wise PPR vector nor aPPR v ector properly recognize the local cluster 1. Adding a small amount of circulation. If we add a small quantit y to the left b ottom element of ab o ve B matrix, then the blo c k connectivity matrix corresp onds to a connected graph. T o illustrate, w e assign a small v alue to it, B 31 = 0 . 1, then the new blo c k connectivit y matrix becomes B 0 =    p p p 0 p p 0 . 1 0 p    . T o explore the PPR v ector, we set p = 3 once again. In this case, B 0 has one real eigenv alue ( ≈ 4 . 069) and t wo imaginary eigenv alues. The blo c k-wise PPR vector is approximately p = (0 . 235 , 0 . 115 , 0 . 650) , and the blo c k-wise aPPR v ector is appro ximately (after adjusting b y column sums of B 0 ) p ∗ = (0 . 0755 , 0 . 0192 , 0 . 0723) . 37 In this case, the PPR clustering works like a charm. S2.1.2 Violating the p ositiv e-deﬁnite assumption Consider again the K = 3 design with equally distributed block size. W e present tw o examples breaking the positive-deﬁnite assumption on B , where the PPR cluster still op erates properly . Indeﬁnite case. Giv en some constan ts r > p > 0, deﬁne B =    p r r r p r r r p    . In this case, the random graphs generated from such conﬁguration of B hav e a unique characteristic: tw o v ertices with diﬀerent blo c k mem b erships are more lik ely to connect than those pairs b elonging to the same blo c k. Note that the three eigenv alues of B are p + 2 r , p − r , and p − r . Hence, B is an indeﬁnite matrix (so do es the blo c k-wise transition matrix P ). In terestingly , the PPR clustering contin ues w orking under this circumstance. F or simplicit y , setting p = 3 and r = 9, and we articulate the block-wise PPR v ector and aPPR vector. In fact, the blo c k-wise PPR v ector is appro ximately p = (0 . 386 , 0 . 306 , 0 . 306) . Since B has homogeneous column sums, it follows that the ﬁrst elemen t in the blo c k-wise aPPR v ector is also the largest, suggesting the eﬀectiveness of PPR clustering. The same conclusion hold when w e set p = 3 and r = 99 (or 999). Singular case. Suppose p > 0 and let B =    0 p 0 p 0 p 0 p 0    . In this case, no des in blo ck 1 only connect with those nodes in block 2, and the nodes in block 3 only ha ve edges with blo c k 2’s mem b ers. Note that B is singular b ecause three of its eigenv alues are 1, -1, and 0. So do es the blo c k-wise transition matrix. How ever, the PPR clustering remain eﬀectiv e. In fact, the blo c k-wise PPR vector and aPPR v ector are p = (0 . 345 , 0 . 459 , 0 . 195) and p ∗ = 1 p (0 . 345 , 0 . 230 , 0 . 195) . In b oth cases (when B is not p ositiv e-deﬁnite), the blo ck-wise aPPR v ector correctly assigns the largest v alue to the ﬁrst elemen t and thus is still eﬀective for targeted sampling. These examples suggest a potentially greater applicabilit y of the PPR clustering under the blo c k mo del graph. S2.1.3 Comments Putting together ab o ve demonstrations, we brieﬂy comment on B and the PPR clustering. (i) The strongly-connectivit y assumption is essen tial for the PPR clustering to b e consisten t. (ii) The eﬃcacy of PPR clustering is conditioning on the fact that telep ortation constan t is suﬃcien tly large. If we assign an 38 extremely small to it, e.g. α = 0 . 001, the PPR clustering collapses. (iii) Bey ond comm unity-lik e graphs (where B is p ositiv e-deﬁnite), the PPR clustering has p oten tial for working on a more general blo c k mo del graphs. S2.2 Sp ectral analysis on graph transition P In this section, w e present a spectral analysis of graph transition matrix, whic h demonstrates that (1) under the p opulation DC-SBM, a graph transition matrix P has exactly K positive eigenv alues, and N − K zero eigenv alues, and (2) in a random graph generated from the DC-SBM, the graph transition matrix P is close to its p opulation, with resp ect to sp ectral norm. Lemma S7 (Eigen-decomp osition for P and P ) . Under the p opulation DC-SBM with K blo cks and p ar am- eters { B , Z , Θ } , let P ∈ R N × N b e the p opulation gr aph tr ansition matrix and P ∈ R K × K b e the blo ck-wise tr ansition matrix. Then, P and P have the same K p ositive eigenvalues. The r emaining N − K eigenvalues of P ar e al l zer os. Denote the K p ositive eigenvalues of b oth matric es as λ 1 ≥ λ 2 ≥ · · · ...λ K ≥ 0 , and let X ∈ R N × K and Y ∈ R K × K c ontain the left eigenve ctor of P and P r esp e ctively, c orr esp onding to λ i in their i -th c olumn. Then, ther e exists a ortho gonal matrix U ∈ R K × K , such that (a) X > = D − 1 / 2 Θ 1 / 2 Z U ; and (b) Y > = D − 1 / 2 U. Pr o of. W e follow the pro of of Lemma 3.3 in Qin and Rohe [2013]. Deﬁne L = D − 1 / 2 BD − 1 / 2 , then P = D − 1 / 2 LD 1 / 2 . By mo del assumption, P  0. Deﬁne the graph Laplacian L = D − 1 / 2 A D − 1 / 2 , then by Lemma 2(b), L uv = A uv √ d u d v = θ u θ v B z ( u ) z ( v ) √ d u d v = B z ( u ) z ( v ) √ θ u θ v p d z ( u ) d z ( v ) = [ L ] z ( u ) z ( v ) p θ u θ v , or equiv alently , L = Θ 1 / 2 Z L Z > Θ 1 / 2 . Then X > Λ X 0 = D − 1 / 2 Θ 1 / 2 Z U Λ U > Z > Θ 1 / 2 D 1 / 2 = D − 1 / 2 L D 1 / 2 = D − 1 A = P , and Y > Λ Y 0 = D − 1 / 2 U Λ U > D 1 / 2 = D − 1 / 2 LD 1 / 2 = P , where X 0 = U > Z > Θ 1 / 2 D 1 / 2 and Y 0 = U > D 1 / 2 are right eigenv ectors if P and P respectively . Recognizing that X > X 0 = Y > Y 0 = I K completes the pro of. Lemma S8. L et L b e a symmetric matrix, let D b e a diagonal matrix, and let P = D − 1 / 2 LD 1 / 2 . If x is an eigenve ctor of L c orr esp onding to eigenvalue λ , then (a) D − 1 / 2 x is a right eigenve ctor of P with eigenvalue λ , (b) k P > P k = k L k 2 . Pr o of. Let y = D − 1 / 2 x , then P y = D − 1 / 2 LD 1 / 2 y = D − 1 / 2 Lx = λD − 1 / 2 x = λy . 39 Supplemen tary T able S1: Number of no des examined/reached b y Algorithm 3 with seed no de @NBCPolitics and diﬀeren t telep ortation constants, and a ﬁxed tolerance parameter  = 10 − 7 , as in August 2019. α Examined Reac hed 0.1 7,445 342,454 0.15 5,919 272,985 0.25 4,860 228,561 1/3 3,984 193,848 P art (a) of the lemma follows. T o see (b), observe that y is also an eigenv ector of P > P with eigenv alue λ 2 . Lemma S8 implies that P has the same sp ectral norm of graph Laplacian L . Since L concen trates to L (see for example Qin and Rohe [2013] for a proof ), we ha ve under a random graph generated from the DC-SBM, the graph transition matrix P concen trates to its p opulation P with resp ect to sp ectral norm. S2.3 T elep ortation constan t α In the pap er, w e state that a suﬃciently large telep ortation constant α enables the entrywise control of sample PPR vector, thus facilitating the PPR clustering in a random graph. Here, from a practical p erspec- tiv e, we further illustrate the sensitivity of PPR clustering to α , with the Twitter friendship netw ork. T o this end, we in vestigate the targeted sampling returned b y four conﬁgurations of the telep ortation constant, α ∈ { 0 . 1 , 0 . 15 , 0 . 25 , 1 / 3 } , where NBC Politics (@NBCPolitics) is the seed. The tolerance parameter is ﬁxed,  = 10 − 7 , in four targeted sampling. T able S1 lists the n umber of Twitter users we examined and the total n umber of users w e “reached” (as of August 2019) in four attempts. Here, w e examine a user b y retrieving its friend list (after whic h it gets a p ositiv e p u v alue in Algorithm 3), and reac h a user once it app ears in a user’s friend list (at whic h point, it p ossesses a p ositiv e r v v alue in Algorithm 3). Given the same tolerance parameter, v arying the telep ortation constan t largely aﬀects the num b er of no des examined/reached. This demonstrates the role of teleportation constan t in lev eraging b et ween the seed preference and the standard random walk. Despite the fact that diﬀerent α ’s result in substantial diﬀerence in netw ork cov erage, when the algorithm stops, the estimated lo cal clusters app ear to share a v ast ma jority in common. T o demonstrate this imme- diately , we inspect the lo cal clusters of size n = 300 returned b y Algorithm 4 with four α ’s and quantify to what degree do they ov erlap each other. T able S2 sho ws the p ercen tage of common mem b ers betw een each pair of four returned lo cal clusters. As sho wn, most pairs hav e ab out 90% ov erlapping mem b ers, indicating that PPR clustering is fairly robust against the teleportation constant. The stabilit y of PPR clustering contin ues to show when w e v ary the cluster size, n = 100 , 150 , ..., 700. Figure S1 shows the prop ortion of common mem b ers across al l four lo cal clusters, returned b y PPR, aPPR, and rPPR (with the regularizer τ = 10). Overall, the PPR clustering pro duces a fairly consistent lo cal cluster, with around 80% of members ov erlapping across four diﬀerent strengths of telep ortation (see Supplementary Materials). W e conclude that in practice, PPR clustering (i) is mainly inﬂuential to the n umber of no des examined in the targeted sampling and (ii) has fairly robust performance with resp ect to the c hoice of teleportation constan t. 40 Supplemen tary T able S2: P ercentage of pairwise ov erlapping among three lo cal clusters around @NBCPolitics with diﬀeren t telep ortation constants, α ∈ { 0 . 1 , 0 . 15 , 0 . 25 , 1 / 3 } , as in August 2019. α 0.1 0.15 0.25 1/3 0.1 100% 92.7% 89.3% 87.7% 0.15 100% 93.3% 90% 0.25 100% 92% 1/3 100% 70 80 90 100 200 400 600 sample size % of ov erlaps (across all clusters) 200 400 600 200 400 600 sample size # of ov erlaps (across all clusters) method ppr appr rppr Supplemen tary Figure S1: Sensitivity to the telep ortation constan t, α = { 0 . 1 , 0 . 15 , 0 . 25 , 1 / 3 } . Sho wn are the p ercentage (left) and num b er (right) of common members across all four lo cal clusters returned by three PPR clustering metho ds. The targeted sample size increases from 100 to 700 with the increment of 50. 41 0 3 6 9 1096.633 2980.958 8103.084 Size of graphs Relative entrywise error vector PPR aPPR Supplemen tary Figure S2: Entrywise error rate v ersus the graph sizes. Shown are relativ e en trywise error (REE) corresp onding to diﬀerent underlying graph sizes, a veraged o ver 30 replicates. F or each dot, an error bar indicates the standard error. The RER for aPPR v ector is scaled down by a factor of 240 to impro ve visualization. The ticks in x-axis are transformed through logarithm with the natural base. S2.4 The graph size N In the pap er, w e pro vide an entrywise error con trol for the PPR v ector and the aPPR v ector (Theorem 2), assuming the edge densit y is suﬃcien tly large (i.e., inequality (7) in the main pap er). Simulation 3 in Section 5 demonstrates the relationship betw een the exp ected degree ( δ ) and the error rate of PPR clustering, as promised b y the theorem. Here, we pro vide another sim ulation to illustrate Theorem 2. Speciﬁcally , we further in vestigate the aﬀect of graph size ( N ) on the relative entrywise error (REE) of the PPR v ector ( k p − p k ∞ k p k ∞ ), giv en some edge density ( δ ). W e generate 30 replicates of net works of size N = e x , where x ∈ { 6 . 5 , 7 , 7 . 5 , 8 , 8 . 5 , 9 } , from the four- parameter sto c hastic blo c k mo del, SBM( K = 3 , N , b 1 = 9 , b 2 = 3). The av erage exp ect degree is set to δ = 125. Both PPR v ectors and aPPR v ectors are calculated for every netw ork, with telep ortation constant α = 0 . 15 and 10 seeds randomly selected from the ﬁrst blo c k. Figure S2 depict the REE with respect to diﬀeren t graph sizes (scaled b y a logarithm transformation with the natural base). As shown, with δ ﬁxed (not growing at the rate of log N ), the REE increases as the graph expands, so do es the v ariance of REE for b oth PPR and aPPR vectors, matching the results in Theorem 2. S3 Connection to linear discriminan t analysis In this section, we giv e another representation of PPR v ectors in the landing probabilit y space, whic h builds upon Kloumann et al. [2017]. This assorts PPR to a greater functional regime. Then, w e extend the previous result that links the PPR vectors with linear discriminant functions under the DC-SBM. In particular, when ev ery blo c k has the same degree (v olume), where D b ecomes a scalar matrix, the PPR v ector is asymptotically equiv alent to the optimal linear discriminant function. First, w e brieﬂy in tro duce linear discriminant analysis in landing probability space, which the PPR vector also liv es in. Consider a random walk on the graph starting from a seed no de. Deﬁne the landing pr ob ability r v s to b e the probabilit y that the random walk ends up at v ∈ V after exactly s steps. The landing pr ob ability 42 sp ac e is the space of landing probability of any no des. A line ar discriminant (LD) analysis keeps the ﬁrst S landing probability on each no de, r v = ( r v 0 , r v 1 , ..., r v S ) ∈ R S , and divides v ertices in to tw o sets by thresholding on the linear discriminant score vector l ∈ R N , whose v -th entry is deﬁned to be inner pro duct l v = h ω , r v i with some w eights ω ∈ R S . F or example, let ω = r v 1 − r v 2 , where v 1 , v 2 are empirical cen troids of tw o node sets. Then l v increases as v slides from v 1 to v 2 , and thresholding ( k v 1 k 2 − k v 2 k 2 ) / 2 allo cates v ertices to nearest cen troid. Remark. The landing probability of the s -th step, r s =  r 1 s , r 2 s , ..., r N s  ∈ R N , is deﬁned as ( P s ) > π . It follo ws from prop osition 1 that PPR v ector p = P ∞ s =0 φ s r s with φ s = α (1 − α ) s . Keeping the ﬁrst S terms yields an LD score v ector with the w eights ω P P R = ( φ 0 , φ 1 , ..., φ S − 1 ). W e then p erform p opulation (exp ectation) analysis for PPR in the landing probability space. Deﬁne the population blo ck landing pr ob ability W k s to be the probability that a random walk from v 0 ends up in blo c k k after exactly s steps, where k = 1 , 2 , ..., K and s = 0 , 1 , ..., S − 1. Given that v 0 is in block 1, W · 0 = (1 , 0 , ..., 0) > . Using the ﬁrst S steps blo c k landing probabilities, the next lemma gives an explicit form of LD vectors. Lemma S9 (Explicit form of LD v ectors) . Under the p opulation DC-SBM with K blo cks and p ar ameters { B , Z, Θ } , assume al l blo cks have the same de gr e es. L et l ( k ) b e the line ar discriminant sc or e ve ctor b etwe en blo ck 1 and blo ck k . Then, (a) W · s = P > W · s − 1 , s = 1 , 2 , ..., S − 1 ; and (b) l ( k ) = Θ Z l ( k ) , k = 2 , ..., K , wher e l ( k ) = WW > ( e 1 − e k ) . Her e, e k is the elementary unit ve ctor on the dir e ction of k -th blo ck. Pr o of. W e prov e (a) using following quan tities. Let E k s b e the n umber of paths from v 0 to blo c k k with exact length s , and let E k s b e the exp ected num b er of paths from v 0 to blo c k k with exact length s . Recall from 3 that B ij represen ts the exp ected num b er of edges betw een blo c k i and j if i 6 = j , or twice of that if i = j . Then, E k s = K X j =1 B kj E j s − 1 . T o see W · s = P > W · s − 1 , observ e that W k s = E k s P K i =1 E i s = P K j =1 B kj E j s − 1 P K i =1 P K j =1 B ij E j s − 1 = P K j =1 B kj E j s − 1 P K j =1 d j E j s − 1 = K X j =1 P kj W k s − 1 . The last equality comes from the assumption that all blo c ks hav e the same degrees, which means d i is constan t. No w, w e prov e part (b) of the lemma. Let R ∈ R N × S collect all landing probabilities r v s of the ﬁrst S steps, where v = 1 , 2 , ..., N and s = 0 , 1 , ..., S − 1. Without loss of generality , assume the seed no de corresp onds to the ﬁrst row. Deﬁne R = E ( R ) ∈ [0 , 1] N × S to b e the p opulation version of R . Then the p opulation landing probabilit y is explicitly R v s = d v d z ( v ) W z ( v ) s = θ v W z ( v ) s , 43 or compactly , R = Θ Z W . In linear discriminan t, the w eights v ector ω is the geometric diﬀerence b et w een cen troid of block 1 and k , which can b e written as   X v : z ( v )=1 R v 1 − X v : z ( v )= k R v 1 , X v : z ( v )=1 R v 2 − X v : z ( v )= k R v 2 , ..., X v : z ( v )=1 R v S − X v : z ( v )= k R v S   , or compactly ω = R > Z ( e 1 − e k ) . By Lemma 2, the linear discriminan t score v ector reads h R · ω i = RR > Z ( e 1 − e k ) = Θ Z WW > ( e 1 − e k ) , for k = 2 , ..., K . Setting l ( k ) = WW > ( e 1 − e k ) completes the pro of. Recall from Theorem 1 that p = Θ Z p . The LD score v ector l has a similarly simple form that separates the blo c k-related information ( W ) and the no de sp eciﬁc information (Θ and Z ). Lemma S9 provides a p opulation (exp ectation) represen tation of PPR in the landing probability space. T o facilitate its application in random graphs, the next lemma provides a con trol of the landing probabilities on a random block mo del graph. Lemma S10 (Concen tration of landing probabilities) . L et G = ( V , E ) b e a gr aph of N vertic es gener ate d fr om the DC-SBM with K blo cks and p ar ameters { B , Z , Θ } . L et R · s ∈ [0 , 1] N b e the landing pr ob abilities of the k -th step, and R · s = E ( R ) b e its exp e ctation. Then, for any  > 0 and any vertex u = 1 , 2 , ..., N , P ( R u s ≥ (1 +  ) R u s ) ≤ (1 +  ) − N r , P ( R u s ≥ (1 −  ) R u s ) ≤ (1 −  ) N r , wher e r = min v ∈ V θ u θ v P z ( u ) z ( v ) W z ( v ) s − 1 . Pr o of. Note that R u s = P v ∈ V X uv , where X uv = W z ( v ) s − 1 d z ( v ) 1 { A uv =1 } are indep enden t random v ariables having probabilit y θ u θ v B z ( u ) z ( v ) of b eing equal to W z ( v ) s − 1 / D z ( v ) z ( v ) . Then, E [ R u s ] = X v ∈ V W z ( v ) s − 1 d z ( v ) θ u θ v B z ( u ) z ( v ) = θ u K X k =1 P z ( u ) k W k s − 1 = R u s . W e can apply Chernoﬀ ’s b ounds (Lemma S5) on R u s and obtain b ounds for any ﬁxed u , P ( R u s ≥ (1 +  ) R u s ) ≤ (1 +  ) −  R u s , 44 and P ( R u s ≤ (1 −  ) R u s ) ≤ (1 −  )  R u s . Recognizing that R u s ≥ N r completes the pro of. Lemma S10 pro vides an entrywise concentration b ound for landing probabilities. The next theorem equates PPR and LD vectors when blo c ks are equally distributed. T ogether, they asserts the asymptotically equiv alence b et ween PPR and LD vectors, in symmetric blo c k model graphs. Theorem S1 (Equiv alence betw een PPR and LD vectors) . Under the p opulation DC-SBM with K blo cks and p ar ameters { B , Z , Θ } , assume B ii = b 1 for al l i , and B ij = b 2 for i 6 = j ( b 1 > b 2 > 0) . L et λ 2 the se c ond lar gest eigenvalue of P . L et p b e the p ersonalize d PageR ank ve ctor, and let l ( k ) b e the line ar discriminant sc or e ve ctor b etwe en blo ck 1 and blo ck k , k = 2 , ..., K . If the telep ortation c onstant α = 1 − λ 2 , then p ∝ l ( k ) . Pr o of. F rom Section S2.2 and Lemma S9(a), the blo c k landing probabilit y is precisely W k s = K X j =1 λ s j U kj U 1 j , where λ k is the k -th eigen v alues of P and U is the orthogonal matrix used in Lemma S2.2. Note that B has eigenv alues λ 1 = 1 and λ 2 = b 1 − b 2 b 1 + b 2 , with complexity indices 1 and k − 1 respectively . In addition, w e kno w the orthogonal matrix ab o ve precisely as w ell, U =           1 √ N 1 √ 2 1 √ 2 · · · 1 √ 2 1 √ N − 1 √ 2 0 · · · 0 1 √ N 0 − 1 √ 2 · · · 0 . . . . . . . . . . . . 1 √ N 0 0 · · · − 1 √ 2           . Then it follows from Lemma S9(b) that the LD w eight vector is ω LD = R > Z ( e 1 − e k ) = W > ( e 1 − e k ) = W 1 · − W k · = K X j =1 λ s j ( U 1 j − U kj ) U 1 j = K 2          1 λ 2 λ 2 2 . . . λ S − 1 2          . On the other hand, the w eight vector of PPR on landing probability space is ω P P R = ( φ 0 , φ 1 , ... ), where 45 φ s = α (1 − α ) s . Hence, setting the telep ortation constant α = 1 − λ 2 asymptotically equates appro ximate PPR and LD vectors, up to a scalar factor. Remark. First, a positive factor that diﬀeren tiates PPR and LD v ectors does not c hange the relativ e ranking of the nodes, b ecause the ranking via p or c p is equiv alent. Hence, Theorem S1 shows that the PPR v ector is equiv alent to an optimal LD score v ector under describ ed p opulation DC-SBM. Second, Theorem S1 is an extension of Kloumann et al. [2017]. Com bining Theorem S1 and Lemma S10 gives the asymptotic equiv alence b et ween PPR and LD vectors under the particular DC-SBM stated. S4 Lists of top 200 handles In this section, we supply three lists of handles resulting from sampling using PPR, aPPR, and rPPR v ectors with @NBCP olitics as the seed, as of December 2018. W e conceal handles with follow ers count fewer than 200 for priv acy considerations. The biographical descriptions are trimmed for unifying displa ys. In addition, w e annotate handles with whether or not they are follo wed (“F ollow ed”) by the seed node. 46 S4.1 A PPR’s sample of 200 Supplemen tary T able S3: T op (selected) handles returned by PPR. Name F ollowed F ollow ers Description 1 Melania T rump Y es 11242283 This account is run b y the Oﬃce of First Lady Melania T rum... 2 The White House Y es 17625630 W elcome to @WhiteHouse! F ollow for the latest from President... 3 Chuc k T o dd Y es 2032038 Moderator of @meetthepress and @nbcnews political director; ... 4 NBC News Y es 6280551 The leading source of global news and info for more than 75 ... 5 NBC Nightly News Y es 962290 Breaking news, in-depth rep orting, con text on news from arou... 6 Andrea Mitchell Y es 1737764 NBC News Chief F oreign Aﬀairs Correspondent/anc hor, Andrea ... 7 Sav annah Guthrie Y es 881669 Mom to V ale & Charley , TODA Y Co-Anc hor, Georgetown Law... 8 Joe Scarborough Y es 2521215 With Malice T ow ard None 9 MSNBC Y es 2261911 The place for in-depth analysis, political commentary and in... 10 Rachel Maddow MSNBC Y es 9498076 I see p olitical p eople... (Retw eets do not imply endorsemen t... 11 Breaking News Y es 9223158 12 NBC News First Read Y es 53847 The ﬁrst place for news and analysis from the @NBCNews Poli... 13 TODA Y Y es 4276453 America’s fa vorite morning show | Snapchat: to da yshow 14 Meet the Press Y es 566713 Meet the Press is the longest-running television sho w in his... 15 The W all Street Journal Y es 16188842 Breaking news and features from the WSJ. 16 Pete Williams Y es 70062 NBC News Justice Correspondent. Cov ers US Supreme Court, ... 17 Mark Murray Y es 97571 Mark Murray is the senior p olitical editor for NBC News, as ... 18 POLITICO Y es 3695835 Nobo dy knows p olitics like POLITICO. Got a news tip for us? ... 19 Katy T ur Y es 587474 MSNBC anchor @2pm, NBC News corresp onden t, author of NYT ... 20 Bill Clinton Y es 10697521 F ounder, Clinton F oundation and 42nd President of the United ... 21 Kasie Hunt Y es 381704 @NBCNews Capitol Hill Corresp onden t. Host, @KasieDC, Sunda ys... 22 TIME Y es 15584815 Breaking news and curren t even ts from around the globe. Host... 23 Kelly O’Donnell Y es 195765 White House Correspondent @NBCNews V eteran of Cap Hill ... 24 John McCain Y es 3181773 Memorial accoun t for U.S. Senator John McCain, 1936-2018. T o... 25 Peter Alexander Y es 283522 @NBCNews White House Correspondent / W eekend @TODA Yshow ... 26 Hallie Jackson Y es 359099 Chief White House Corresp ondent / @NBCNews / @MSNBC ... 27 Kristen W elker Y es 182244 @NBCNews White House Corresp onden t. Links and ret weets ... 28 Carrie Dann Y es 37119 .@NBCNews / @NBCP olitics. R Ts not endorsements. 29 Willie Geist Y es 807536 Host @NBC #SundayTOD A Y, Co-Host @Morning Jo e, Sunda y ... 30 Morning Jo e Y es 563650 Live tweet during the sho w! Links to m ust-read op-eds and ... 31 F rank Thorp V Y es 58152 Producer & Oﬀ-Air Rep orter co vering Congress at @NBCNews ... 32 Mark Knoller Y es 318923 CBS News White House Correspondent 33 T om Brok aw Y es 308276 Sp ecial correspondent, @NBCNews 34 Mik a Brzezinski Y es 868124 ”Bipartisanship helps to avoid extremes and im balances. It ... 35 Chris Jansing Y es 72375 @msn b c Senior National Correspondent, intrepid traveler and ... 36 John Harwoo d Y es 251246 a Dad who co vers W ashington, the economy and national p oliti... 37 Nicolle W allace Y es 413153 Author of 18 Acres series, mom, dog walk er, wife, gardener. ... 38 NBC News Signal Y es 83715 A new streaming news channel from @NBCNews. Catch us Th ursda... 39 Sam Stein Y es 392003 Daily Beast/MSNBC newsletter: https://t.co/DVURxn tWdL Emai... 40 Chris Matthews Y es 882434 Host of @hardball M-F at 7PM ET on @MSNBC and author of ”Bob... 41 Carol Lee Y es 51240 Reporter for NBC News, former WSJ & POLITICO, Hudson’s mom, ... 42 Ali Vitali Y es 78839 @NBCnews P olitical Rep orter. Cov ered T rump campaign, WH + ... 43 Ken Dilanian Y es 124635 Intelligence and national security rep orter for the NBC News... 44 Jim Miklaszewski Y es 14196 Jim Miklaszewski is Chief Pen tagon Corresp onden t for NBC New... 45 John Heilemann Y es 247616 @SHO TheCircus host/ep; NBCNews/@MSNBC natl aﬀairs analyst;... 46 Stephanie Ruhle Y es 352895 Mom, MSNBC LIVE Anchor 9AM M-F, VELSHI & RUHLE 1 PM ... 47 Nick Confessore Y es 172359 Reporter for @NYTimes, writer-at-large for @NYTmag, MSNBC ... 48 T alking Poin ts Memo Y es 275692 Breaking news and analysis from the TPM team. Ill leav e ... 49 T om Costello Y es 17268 NBC News Correspondent cov ering Aviation, T ransp ortation, Ec... 50 Post Politics Y es 384611 The latest p olitical news and analysis from The W ashington P ... 51 Alex Mo e Y es 28245 @NBCNews Capitol Hill Producer + Oﬀ-Air Reporter; ’12 & ’16... 52 Benjy Sarlin Y es 100896 Political reporter for @NBCNews. I cov er elections and their... 53 Preet Bharara Y es 945030 Patriotic American & proud immigrant. Movie buﬀ. @Springste... 54 Matthew Miller Y es 229867 Partner at Viano vo. MSNBC Justice & Security Analyst. Recov e... 55 Leigh Ann Caldwell Y es 20714 NBC Capitol Hill reporter. F ormerly at CNN and public radio.... 56 Ken Strickland Y es 2693 NBC News W ashington Bureau Chief 57 Ron F ournier Y es 64356 President: T ruscott Rossman. Best-seller h ttps://t.co/09CdTN... 58 Mike Memoli Y es 39693 National Political Rep orter @nbcnews; @latimes alum mike dot... 59 Miguel Almaguer Y es 14082 Proliﬁc coﬀee drink er. Chronic under sleeper. Raging road ... 47 . . . contin ued Name F ollowed F ollow ers Description 60 Courtney Kub e Y es 9494 NBC News National Security & Military Reporter. Links and ... 61 NBC News W orld Y es 279165 A dynamic lo ok at world events from @NBCNews. 62 Jonathan Martin Y es 241690 Nat’l Political Correspondent, NY Times. Husband of the ... 63 Steve Schmidt Y es 498812 ”Patriotism means to stand b y the coun try . It does not mean ... 64 Jenna Bush Hager Y es 207106 Mama to M and P , NBC News corresp onden t, Editor-at-Large ... 65 Sean Spicer Y es 406957 President of RigWil, Sr Advisor @AmericaFirstP AC chec k out ... 66 Roll Call Y es 356374 Breaking news, rep orter tweets and analysis from the Source ... 67 POLITICO 45 Y es 8847 0 A daily diary of the 45th president of the United States. 68 Scott F oster Y es 3464 Senior Pro ducer, W ashington @NBCNEWS @TODA Yshow 69 Domenico Montanaro Y es 83999 ”Congress shall mak e no la w resp ecting an est. of religion, ... 70 T om Winter Y es 40777 NBC News In vestigations reporter based in New Y ork fo cusing ... 71 Kailani Ko enig Y es 11416 Producer with @MSNBC & @NBCNews. T eam @MeetThePress ... 72 Capital Journal Y es 131212 WSJs home for politics, p olicy and national security news. ... 73 NBC News Videos Y es 7838 The latest video from http://t.co/xPyvMOTEF6 74 Diane Sawy er Y es 876906 I like my news 24/7, my foo d spicy , my drinks caﬀeinated, ... 75 Jane C. Timm Y es 6478 @nbcnews p olitical reporter and fact c heck er. More fun than ... 76 Elyse PG Y es 2697 White House pro ducer @nbcnews | @USCAnnenberg alum | LA kid ... 77 Libby Leist Y es 7946 Executive Producer @to daysho w 78 Mike Barnicle Y es 116588 Mike Barnicle is an aw ard-winning print and broadcast journa... 79 Reuters Politics Y es 259106 U.S. political co verage, breaking news and sp ecial in vestiga... 80 Beth F ouhy Y es 13684 Senior editor, politics, NBC News and MSNBC 81 HuﬀPost Y es 11401771 Know what’s real. 82 Joey Scarborough Y es 6277 NBC News Social Media Editor. New Y ork Daily News Alum. R Ts ... 83 Marianna Sotomay or Y es 11965 Running around Capitol Hill for @NBCNews. Cov ers p olitics ... 84 Shaquille Brewster Y es 5362 @NBCNews Pro ducer/P olitics | @HowardU Alum | Journalist ... 85 Joyce Alene Y es 185116 U of Alabama Law Professor | @MSNBC Contributor | Obama US ... 86 Garrett Haake Y es 40714 Correspondent @msn b c T aller than I look on TV Long-suﬀe... 87 Andrew Raﬀerty Y es 16567 Senior p olitical editor for @newsy Before that @NBCNews ... 88 Jacob Sob oroﬀ Y es 144153 @MSNBC corresp ondent. Instagram & Snapchat: jacobsob oroﬀ 89 Perry Bacon Jr. Y es 26853 I write about gov ernment (mostly federal, often state, ... 90 Alex Witt Y es 28126 W eekend host on @MSNBC (9am, no on & 1pm). Tiggers mom ... 91 Mark Halp erin Y es 332564 New Y ork, New Y ork 92 Heidi Przybyla Y es 66489 NBC News, n’tl political rep orter ”Prezbella” Heidi.Przyb... 93 Morgan Radford Y es 20967 @NBCnews Correspondent: @TODA YShow/@NBCNigh tlyNews . 94 Sav annah Sellers Y es 4637 News junkie. Host of NBC’s ”Sta y T uned” on Snap c hat. Storyte... 95 Marist Poll Y es 16030 F ounded in 1978, MIPO is home to the Marist Poll and regular... 96 Jill Wine-Banks Y es 158753 @NBCNews & @MSNBC Con tributor. Sp eaker. W atergate prosecutor... 97 NBC Field Notes Y es 1390 NBC News correspondents and http://t.co/1eSopOQt8s rep orters... 98 Olivia Nuzzi Y es 190919 W ashington Corresp onden t, New Y ork Magazine 99 NBC News THINK Y es 12017 THINK is NBC News’ home for fresh opinion, sharp analysis ... 100 Making a Diﬀerence Y es 670 @NBCNightlyNews’ popular feature proﬁles ordinary people do... 101 adam nagourney Y es 25307 LA Bureau Chief for The New Y ork Times. Story ideas welcome ... 102 Phil McCausland Y es 2519 @NBCNews Digital reporter fo cused on the rural-urban divide.... 103 Katie Couric Y es 1746116 Journalist, p odcaster, @SU2C founder, do c ﬁlmmaker of @F edU... 104 Monica Alba Y es 30034 @NBCNews White House team. Cov ered Hillary Clinton on the ... 105 Vicente F ox Quesada Y es 1244017 Presidente de Mxico de 2000 a 2006 y ahora traba jando p o... 106 Alex Johnson Y es 4371 News, data and analysis for @NBCNews; data geek; non-celebri... 108 Alex Seitz-W ald Y es 50168 P olitical rep orter for @NBCNews cov ering Demo crats | Tips, ... 109 Anthon y T errell Y es 6827 Emmy Award winning journalist. Political observer. Cov ered ... 110 Sam Petulla Y es 2588 Editor @cnnp olitics Usually lo oking for datasets. Y ou can ... 111 Debra Messing Y es 532941 Actor. Mama. Global Ambassador for HIV/AIDS for PSI. Activis... 112 Corky Siemaszko Y es 2538 Senior W riter at NBC News Digital (former NY Daily News rewr... 114 Zach Hab erman Y es 3693 Lead Breaking News Editor, @NBCNews. Previously had other jobs ... 115 NBC Latino Y es 67920 Elevating the con versation around Latino news in the United ... 116 Vivian Salama Y es 16020 White House reporter for @WSJ. F ormerly AP Baghdad bureau ... 117 Zeke Miller Y es 215054 White House Reporter @AP . Email: zekejmiller@gmail.com Links... 118 V aughn Hillyard Y es 31464 On the Road, Meeting Go od F olk | NBC News | Arizonan | IG: @... 119 Jonathan Allen Y es 44477 p olitical reporter, @NBCNews Digital | co-author, NYT bestse... 121 HuﬀPost Politics Y es 1428870 The latest political news from HuﬀP ost’s politics team. 122 Nick Akerman Y es 14949 Partner in the AmLaw 100 la w ﬁrm of Dorsey & Whitney , W ater... 123 CSP AN Y es 1915821 Capitol Hill. The White House. National P olitics. 124 John McCormack Y es 30688 Senior writer at The W eekly Standard. 48 . . . contin ued Name F ollowed F ollow ers Description 125 Jo Ling Kent Y es 32957 NBC News Correspondent @NBCNigh tlyNews, @TODA Yshow ... 126 PolitiF act Y es 628659 Home of the T ruth-O-Meter and independent fact-c hecking ... 127 Bob Corker Y es 10042 Serving T ennesseans in the U.S. Senate 128 Elise Jordan Y es 58884 Co-host of @WMM p odcast p odcast. @MSNBC/@NBCNews political... 129 Greg Martin Y es 1161 P olitical Bo oking Producer at @nbcnews @todaysho w 130 Education Nation Y es 276468 Hosted by @NBCNews. Creator of Parent T oolkit & moderator of... 131 Micah Grimes Y es 25948 Head of Social, @NBCNews & @MSNBC – F oreign and domestic ... 132 Jill Lawrence Y es 17282 Commentary editor and columnist @USA TODA Y. Author of The Art... 133 McKay Coppins Y es 131623 Staﬀ writer at @TheA tlantic. Author of THE WILDERNESS. ’Sor... 134 Emmanuelle Saliba Y es 4004 Head of Social Media Strategy @Euronews | Launc hed #THECUBE ... 135 Hasani Gittens Y es 3002 Level 29 Mage. Senior News Ed. @NBCNews. Sheriﬀ of Nattahna... 136 Rebecca Sinderbrand Y es 18691 Now: @NBCNews Senior W ashington Editor, visiting lecturer @Y... 137 BuzzF eed Politics Y es 121646 News and up dates from the politics team @BuzzF eedNews. 138 Adam Edelman Y es 2341 P olitical rep orter @n b cnews. Wisconsin native, Bestchester r... 139 Ethan Klapp er Y es 18292 Journalist (@Y aho oNews) and #a vgeek. 140 President T rump No 24593638 45th President of the United States of America, @realDonaldT... 141 Vice President Mike ... No 6795022 Vice President Mike Pence. Husband, father, & honored to ... 142 Donald J. T rump No 56050499 45th President of the United States of America 143 Karen Pence No 403315 Educator, mom, wife of @VP Pence. Passionate about art thera... 144 Sarah Sanders No 3522219 @WhiteHouse Press Secretary . Proudly representing @POTUS ... 145 Kellyanne Conwa y No 2506546 Mom. Patriot. Catholic. Counselor. 146 DRUDGE REPOR T No 1408129 The DRUDGE REPOR T is a U.S. based news aggregation w ebsite ... 147 White House History No 104010 The White House Historical Asso ciation is a non-proﬁt organ... 148 The New Y ork Times No 42412491 Where the conv ersation b egins. F ollow for breaking news, ... 149 White House Archiv ed No 13379715 This is an arc hive of an Obama Administration account mainta... 150 Dan Scavino Jr. No 324561 Assistant to President @realDonaldT rump, Director of So cial ... 151 Drudge Buzz No 104111 T racking the buzz made by Americas #1 newsmak er Matt Drudge.... 152 David Gregory No 1749373 CNN, Georgetown U 153 Hillary Clinton No 23643522 2016 Democratic Nominee, SecState, Senator, hair icon. Mom, ... 154 CNN Breaking News No 54476034 Breaking news from CNN Digital. No w 54M strong. Check @cnn ... 155 The Cabinet No 123597 The @WhiteHouse Oﬃce of Cabinet Aﬀairs. Tweets ma y b e arc... 156 Lester Holt No 501427 Anchor @NBCNigh tlyNews and @datelinen b c, reporting on the to... 157 John Dick erson No 48122 Co-host CBS This Morning. This account @johndic kerson is mos... 158 CNN No 40854429 Its our job to #GoThere & tell the most diﬃcult stories. ... 159 J Earnest (Archiv ed) No 1182091 WH Press Secretary . This is an arc hive of an Obama Administr... 160 The W ashington Post No 13117609 Breaking news, analysis, and opinion. F ounded in 1877. Our ... 161 Adam Liptak No 61589 Supreme Court reporter for The New Y ork Times 162 NSC No 35905 National Security Council | Tw eets may be arc hived ... 163 MSNBC video No 40669 F avorite video highligh ts from @msnbc. 164 Gorsuch F acts No 39143 Judge Gorsuch will be fair to all regardless of their backgr... 165 Greg Stohr No 11651 Supreme Court reporter for Blo om b erg News. Baseball dad ... 166 OMB Press No 11182 Oﬃce of Management and Budget | Tweets may be arc hived: ... 167 Richard Engel No 288066 @NBCNews Chief F oreign Correspondent 168 Norah O’Donnell No 195549 Wife, mother of 3, Co-Host @cbsthismorning, #1 fan of @chefg... 169 Robert Barnes No 37361 Robert Barnes co vers the Supreme Court for The W ashington Po... 170 Luke Russert No 253495 Sometimes nothing can b e a real co ol hand. ST A’04/BC’08 171 Stephen Colb ert No 18269222 the guy on CBS 172 Mark Sherman No 6336 173 U.S. Attorney EDV A No 5709 Led by U.S. A ttorney G. Zachary T erwilliger. 130+ attorneys ... 174 The Asso ciated Press No 13051963 News from The Asso ciated Press, and a taste of the great jou... 175 Joe P alazzolo No 10938 WSJ rep orter cov ering legal issues. joe.palazzolo@wsj.com. ... 176 Natalie Morales No 443991 @TODA Yshow Anc hor and @AccessOnline Anc hor, Author, mom ... 177 Brent Kendall No 5451 WSJ legal aﬀairs reporter in W ashington. Nativ e T ar Heel, ... 178 Joan Biskupic No 11021 CNN legal analyst & Supreme Court biographer; Chicago native... 179 Keith Olb ermann No 1097676 Dogs. And sp orts. And whales (T om Jumbo-Grumbo on BoJac k ... 180 Brian Williams No 230947 181 Pope F rancis No 17791867 W elcome to the oﬃcial Twitter page of His Holiness Pope F r... 182 Ezra Klein No 2500383 F ounder and editor-at-large, https:/ /t.co/5gESirESRH. Wh y ... 183 Anderson Co oper No 9967099 tweets by Anderson Co oper. Anchor @AC360 and correspondent... 184 BBC News (W orld) No 24153838 News, features and analysis from the W orld’s newsro om. Break... 185 Reince Priebus No 935431 President @MichaelBestLaw; Exclusive Sp eak er @W ashSp eakers; ... 186 Joe Biden No 3111675 Represented Delaw are in the Senate for 36 y ears, 47th Vice P ... 49 . . . contin ued Name F ollowed F ollow ers Description 187 Department of State No 5149607 W elcome to the oﬃcial U.S. Department of State Twitter acc... 188 Jim Miklaszewski No 1956 Chief Pen tagon Corresp onden t for NBC News 189 T ony Mauro No 20310 Supreme Court corresp ondent, https://t.co/571ZdQnzo2 and The... 190 David Axelro d No 1113850 Director, UChicago Institute of P olitics. Senior P olitical ... 191 Nate Silver No 3176243 Editor-in-Chief, @FiveThirt yEight. Author, The Signal and ... 192 George Bush No 356042 A tribute site to the 41st President of the United States of... 193 CBS News No 6537991 Y our source for original rep orting and trusted news. 194 Jonathan Karl No 206986 ABC News Chief White House Correspondent. insta @jonk arl ... 195 BBC Breaking News No 38539186 Breaking news alerts and up dates from the BBC. F or news, ... 196 Mitt Romney No 1977201 Senator-elect from Utah. 197 ABC News No 13985606 All the news and information you need to see, curated by the... 198 Deborah T urness No 10389 President of NBC News International 199 The Hill No 3162118 The Hill is the premier source for policy and political news... 200 Ann Curry No 1536122 Journalism is an act of faith in the future. 50 S4.2 An aPPR’s sample of 200 Supplemen tary T able S4: T op (selected) handles returned by aPPR. The handles with fewer than 200 follo wers are hidden for priv acy considerations. Name F ollow ed F ollow ers Description 1 Y es 198 Enroll America National Regional Director http://t.co/X6jJIE... 2 Jennifer Sizemore Y es 386 3 Alissa Swango Y es 441 Director of Digital Programming at @natgeo. All things fo od.... 4 Making a Diﬀerence Y es 670 @NBCNightlyNews’ popular feature proﬁles ordinary people do... 5 No 1 6 No 3 7 Greg Martin Y es 1161 Political Booking Pro ducer at @nbcnews @to da yshow 8 No 1 I am Area Man. I pwn your news feed. 9 No 2 10 NBC Field Notes Y es 1390 NBC News correspondents and http://t.co/1eSopOQt8s rep orters... 11 No 2 12 No 2 13 No 1 14 No 1 15 No 1 16 No 1 17 No 3 y et another activist t witter, ﬁghting all those fun -isms ... 18 No 4 19 No 7 Dianne Kube is an Author with a passion, for family , holida y ... 20 No 7 21 Adam Edelman Y es 2341 Political reporter @n b cnews. Wisconsin native, Bestc hester ... 22 Phil McCausland Y es 2519 @NBCNews Digital rep orter focused on the rural-urban divide.... 23 Corky Siemaszko Y es 2538 Senior W riter at NBC News Digital (former NY Daily News rewr... 24 Sam Petulla Y es 2588 Editor @cnnp olitics Usually lo oking for datasets. Y ou can ... 25 Ken Strickland Y es 2693 NBC News W ashington Bureau Chief 26 No 7 27 Elyse PG Y es 2697 White House pro ducer @nbcnews | @USCAnnenberg alum | LA kid ... 28 No 2 Change y our though ts & you change your world. -Norman Vincen... 29 No 4 30 No 13 31 No 6 32 No 154 W e distribute new, never-w orn clothing and merchandise to ... 33 No 10 34 Hasani Gittens Y es 3002 Level 29 Mage. Senior News Ed. @NBCNews. Sheriﬀ of Nattahna... 35 No 1 36 Scott F oster Y es 3464 Senior Pro ducer, W ashington @NBCNEWS @TODA Yshow 37 No 2 38 No 13 39 No 5 40 Zach Hab erman Y es 3693 Lead Breaking News Editor, @NBCNews. Previously had other jobs... 41 No 3 just like to stay in the know :) just like to stay in the ... 42 No 2 43 No 5 44 No 7 45 No 1 46 Emmanuelle Saliba Y es 4004 Head of So cial Media Strategy @Euronews | Launched #THECUBE ... 47 No 2 48 Alex Johnson Y es 4371 News, data and analysis for @NBCNews; data geek; non-celebri... 49 No 8 50 Sav annah Sellers Y es 4637 News junkie. Host of NBC’s ”Stay T uned” on Snap c hat. Storyte... 51 No 21 52 No 6 An ti-money laundering professional with federal law enforcem... 53 No 15 54 Shaquille Brewster Y es 5362 @NBCNews Producer/Politics | @How ardU Alum | Journalist | Pol... 55 No 2 Just another DIY, punk kid from the black land dirt of NEP A’... 56 No 18 Cdr Bob Mehal, Public Aﬀairs Oﬃce, Oﬃce of the Secretar... 57 No 5 51 . . . contin ued Name F ollow ed F ollow ers Description 58 No 4 59 No 8 60 No 10 61 No 2 62 Joey Scarborough Y es 6277 NBC News Social Media Editor. New Y ork Daily News Alum. R Ts ... 63 No 5 64 No 1 65 V oices United No 310 V oices United is a non proﬁt educational organization for ... 66 Jane C. Timm Y es 6478 @nbcnews p olitical reporter and fact c heck er. More fun than ... 67 Social Headlines No 344 Daily roundup of top social media and networking stories. 68 James Miklaszewski No 337 W riter, Photographer, Editor, Director, Producer, Newshound ... 69 No 12 70 Anthon y T errell Y es 6827 Emm y Award winning journalist. Political observer. Co vered ... 71 No 10 72 No 8 73 No 8 I’m the real Charlie Sheen. If y ou are a Winner, stic k aroun... 74 No 9 Quotes from a nice jewish mom who’s just tryna get some nice... 75 No 2 76 No 4 77 No 6 ”Ra wr!” 78 NBC News Videos Y es 7838 The latest video from http://t.co/xPyvMOTEF6 79 No 9 80 No 4 81 Libby Leist Y es 7946 Executiv e Pro ducer @todayshow 82 No 8 83 No 2 I’m running for President of the United States of America. 84 No 35 85 No 8 86 No 2 87 No 2 88 No 16 89 No 4 90 No 5 Happ y princess 91 No 1 92 No 4 93 Courtney Kub e Y es 9494 NBC News National Securit y & Military Reporter. Links and ... 94 No 5 95 No 5 96 No 169 97 No 5 98 No 2 99 V ets Helping Hero es No 449 Raising funds to sponsor the training of assistance dogs for... 100 No 12 101 No 4 102 No 8 103 Bob Corker Y es 10042 Serving T ennesseans in the U.S. Senate 104 No 4 105 No 2 106 No 11 Spcialiste dveloppement pro duit et marketing des produits ... 107 No 4 108 No 8 Not y our a verage Grandma 109 No 29 110 No 2 111 No 6 112 Kailani Ko enig Y es 11416 Pro ducer with @MSNBC & @NBCNews. T eam @MeetThePress alum. 20... 113 No 13 114 No 14 115 Gloria T urkin No 204 I am honest and straight to the point. Retired Civilian F ed... 116 No 7 117 No 28 An unconv entional appreciation accoun t for @DeadlineWH host,... 118 No 6 119 No 10 Live like Bones 52 . . . contin ued Name F ollow ed F ollow ers Description 120 No 2 121 Marianna Sotomay or Y es 11965 Running around Capitol Hill for @NBCNews. Co vers politics ... 122 NBC News THINK Y es 12017 THINK is NBC News’ home for fresh opinion, sharp analysis ... 123 No 1 124 No 15 125 No 2 126 No 3 Photographer, artist, newsletter editor, designer, writer... 127 No 18 128 No 5 129 No 5 The Quest for the Denim Jack et 130 No 9 131 No 15 132 No 16 Author of A T raumatic History: A Unique Look at PTSD and ... 133 No 5 134 No 7 135 No 5 136 No 7 137 No 7 138 Beth F ouhy Y es 13684 Senior editor, p olitics, NBC News and MSNBC 139 Jim Miklaszewski Y es 14196 Jim Miklaszewski is Chief Pen tagon Correspondent for NBC New... 140 Miguel Almaguer Y es 14082 Proliﬁc coﬀee drink er. Chronic under sleeper. Raging road ... 141 No 16 142 No 4 143 No 3 144 No 19 The Northeast T ennessee Victory program will create a grassr... 145 No 17 146 No 14 Just a dude with a crappy job. 147 No 5 148 Nick Akerman Y es 14949 Partner in the AmLaw 100 law ﬁrm of Dorsey & Whitney , W ater... 149 No 5 150 No 59 151 No 8 152 No 8 153 No 4 Grad studen t at JHU 154 No 6 155 Marist Poll Y es 16030 F ounded in 1978, MIPO is home to the Marist Poll and regular... 156 No 10 Sharing the b est news from the e-Discov ery world. Tweets by ... 157 No 7 158 No 4 159 No 11 W orkforce and Economic Development Consultant; Employmen t ... 160 No 7 W e’re the work ers of the @villagev oice, trying to get a fair... 161 Vivian Salama Y es 16020 White House reporter for @WSJ. F ormerly AP Baghdad bureau ... 162 No 8 163 No 24 164 No 19 I should b e the real trix rabbit 165 No 4 166 No 24 Curious fo od lov er alwa ys lo oking for the b est foo d ev erywhe... 167 Andrew Raﬀerty Y es 16567 Senior p olitical editor for @newsy Before that @NBCNews. And... 168 No 5 169 No 36 170 T om Costello Y es 17268 NBC News Corresp onden t covering Aviation, T ransp ortation, ... 171 No 68 W anderlust journalist ... A man is but the product of ... 172 No 6 Bibliophile, Animal lov er, Realtor, V olunteer, 173 No 25 174 No 70 Director or Pro duct Mark eting @ Microsoft. My t weets. My li... 175 No 3 Experienced (and successful) gran twriter, author, wife, moth... 176 No 5 177 Jill Lawrence Y es 17282 Commentary editor and columnist @USA TODA Y. Author of The Art... 178 No 8 Ho ward McKinnon is T own Manager of Hav ana, Florida. 179 No 136 180 No 59 181 No 8 53 . . . contin ued Name F ollow ed F ollow ers Description 182 No 12 183 No 7 184 No 8 185 No 8 186 No 15 Old and getting older. 187 No 15 188 No 15 Married 189 No 4 190 No 2 Director of the Essex, Connecticut Public Library aka ”Y our ... 191 Ethan Klapp er Y es 18292 Journalist (@Y aho oNews) and #avgeek. 192 No 38 193 No 5 194 Rebecca Sinderbrand Y es 18691 Now: @NBCNews Senior W ashington Editor, visiting lecturer ... 195 No 3 196 No 11 Tireless trend researcher. 197 No 5 198 No 5 199 No 11 200 No 3 54 S4.3 An rPPR’s sample of 200 Supplemen tary T able S5: T op (selected) handles returned b y rPPR. The handles with fewer than 200 follo wers are hidden for priv acy considerations. Name F ollowed F ollow ers Description 1 Y es 198 Enroll America National Regional Director http://t.co/X6jJIE... 2 Jennifer Sizemore Y es 386 3 Alissa Swango Y es 441 Director of Digital Programming at @natgeo. All things fo od.... 4 Making a Diﬀerence Y es 670 @NBCNightlyNews’ p opular feature proﬁles ordinary p eople do... 5 Greg Martin Y es 1161 Political Booking Pro ducer at @nbcnews @to da yshow 6 NBC Field Notes Y es 1390 NBC News correspondents and http://t.co/1eSopOQt8s rep orters... 7 Adam Edelman Y es 2341 Political rep orter @nbcnews. Wisconsin nativ e, Bestchester ... 8 Phil McCausland Y es 2519 @NBCNews Digital rep orter focused on the rural-urban divide.... 9 Corky Siemaszko Y es 2538 Senior W riter at NBC News Digital (former NY Daily News ... 10 Sam Petulla Y es 2588 Editor @cnnpolitics Usually looking for datasets. Y ou can ... 11 Ken Strickland Y es 2693 NBC News W ashington Bureau Chief 12 Elyse PG Y es 2697 White House pro ducer @nbcnews | @USCAnnenberg alum | LA kid ... 13 Hasani Gittens Y es 3002 Lev el 29 Mage. Senior News Ed. @NBCNews. Sheriﬀ of Nattahna... 14 Scott F oster Y es 3464 Senior Producer, W ashington @NBCNEWS @TOD A Yshow 15 Zach Hab erman Y es 3693 Lead Breaking News Editor, @NBCNews. Previously had other jobs ... 16 Emmanuelle Saliba Y es 4004 Head of Social Media Strategy @Euronews | Launc hed #THECUBE ... 17 Alex Johnson Y es 4371 News, data and analysis for @NBCNews; data geek; non-celebri... 18 Sav annah Sellers Y es 4637 News junkie. Host of NBC’s ”Sta y T uned” on Snap c hat. Storyte... 19 No 154 W e distribute new, never-w orn clothing and merchandise to ... 20 Shaquille Brewster Y es 5362 @NBCNews Pro ducer/P olitics | @HowardU Alum | Journalist | Pol... 21 Joey Scarborough Y es 6277 NBC News Social Media Editor. New Y ork Daily News Alum. R Ts ... 22 Jane C. Timm Y es 6478 @nbcnews p olitical reporter and fact c hecker. More fun than ... 23 Anthon y T errell Y es 6827 Emmy Award winning journalist. P olitical observer. Cov ered ... 24 NBC News Videos Y es 7838 The latest video from h ttp://t.co/xPyvMOTEF6 25 Libby Leist Y es 7946 Executive Producer @to daysho w 26 V oices United No 310 V oices United is a non proﬁt educational organization for ... 27 Social Headlines No 344 Daily roundup of top social media and net working stories. 28 James Miklaszewski No 337 W riter, Photographer, Editor, Director, Pro ducer, Newshound ... 29 Courtney Kub e Y es 9494 NBC News National Securit y & Military Reporter. Links and ... 30 Bob Corker Y es 10042 Serving T ennesseans in the U.S. Senate 31 Kailani Ko enig Y es 11416 Producer with @MSNBC & @NBCNews. T eam @MeetThePress alum... 32 V ets Helping Hero es No 449 Raising funds to sponsor the training of assistance dogs for... 33 Marianna Sotomay or Y es 11965 Running around Capitol Hill for @NBCNews. Co vers politics ... 34 NBC News THINK Y es 12017 THINK is NBC News’ home for fresh opinion, sharp analysis ... 35 Beth F ouhy Y es 13684 Senior editor, p olitics, NBC News and MSNBC 36 Jim Miklaszewski Y es 14196 Jim Miklaszewski is Chief Pen tagon Corresp onden t for NBC New... 37 Miguel Almaguer Y es 14082 Proliﬁc coﬀee drink er. Chronic under sleeper. Raging road ... 38 No 169 39 Nick Akerman Y es 14949 Partner in the AmLaw 100 law ﬁrm of Dorsey & Whitney , W ater... 40 Marist Poll Y es 16030 F ounded in 1978, MIPO is home to the Marist Poll and regular... 41 Vivian Salama Y es 16020 White House rep orter for @WSJ. F ormerly AP Baghdad bureau ... 42 Andrew Raﬀerty Y es 16567 Senior p olitical editor for @newsy Before that @NBCNews. And... 43 T om Costello Y es 17268 NBC News Correspondent cov ering Aviation, T ransp ortation, ... 44 Gloria T urkin No 204 I am honest and straight to the point. Retired Civilian F ed... 45 Jill Lawrence Y es 17282 Commentary editor and columnist @USA TODA Y. Author of The Art... 46 Ethan Klapp er Y es 18292 Journalist (@Y aho oNews) and #avgeek. 47 Rebecca Sinderbrand Y es 18691 Now: @NBCNews Senior W ashington Editor, visiting lecturer ... 48 Leigh Ann Caldwell Y es 20714 NBC Capitol Hill reporter. F ormerly at CNN and public radio.... 49 Morgan Radford Y es 20967 @NBCnews Correspondent: @TODA YShow/@NBCNigh tlyNews/... 50 GuardAnglSolPet No 927 Supporting the Military , our V eterans and their Beloved Pets... 51 adam nagourney Y es 25307 LA Bureau Chief for The New Y ork Times. Story ideas w elcome ... 52 No 13 53 Micah Grimes Y es 25948 Head of So cial, @NBCNews & @MSNBC – F oreign and domestic ... 54 Perry Bacon Jr. Y es 26853 I write ab out gov ernment (mostly federal, often state, o ccas... 55 No 21 56 Alex Mo e Y es 28245 @NBCNews Capitol Hill Producer + Oﬀ-Air Reporter; ’12 & ’16... 57 Ray F armer No 603 NBC News staﬀ photographer. Colorado based 55 . . . contin ued Name F ollowed F ollow ers Description 58 Alex Witt Y es 28126 W eekend host on @MSNBC (9am, no on & 1pm). Tiggers mom + ... 59 Monica Alba Y es 30034 @NBCNews White House team. Co vered Hillary Clinton on the ... 60 Jim Miklaszewski No 1956 Chief Pen tagon Corresp onden t for NBC News 61 No 13 62 John McCormack Y es 30688 Senior writer at The W eekly Standard. 63 No 136 64 V aughn Hillyard Y es 31464 On the Road, Meeting Goo d F olk | NBC News | Arizonan | IG... 65 No 35 66 Madelyn Monteath No 257 NFL Network, wife, mother.. not necessarily in that order. 67 Thomas DeF rank No 593 V eteran White House correspondent (every prez since LBJ) and... 68 Jo Ling Kent Y es 32957 NBC News Corresp onden t @NBCNightlyNews, @TODA Yshow... 69 No 10 70 Carrie Dann Y es 37119 .@NBCNews / @NBCPolitics. R Ts not endorsemen ts. 71 No 3 72 No 7 Dianne Kube is an Author with a passion, for family , holida y ... 73 No 18 Cdr Bob Mehal, Public Aﬀairs Oﬃce, Oﬃce of the Secretar... 74 No 7 75 Mike Memoli Y es 39693 National Political Reporter @nbcnews; @latimes alum mike dot... 76 John Boxley No 1201 NBC News Producer...Living life one day at a time. 77 No 15 78 T om Winter Y es 40777 NBC News In vestigations reporter based in New Y ork fo cusing ... 79 No 7 80 Garrett Haake Y es 40714 Corresp onden t @msnbc T aller than I lo ok on TV Long-suﬀe... 81 No 59 82 No 70 Director or Pro duct Mark eting @ Microsoft. My t weets. My li... 83 No 68 W anderlust journalist ... A man is but the product of ... 84 No 158 Marketing nerd at Cornerstone OnDemand. 85 Jonathan Allen Y es 44477 political reporter, @NBCNews Digital | co-author, NYT bestse... 86 NBC News First Read Y es 53847 The ﬁrst place for news and analysis from the @NBCNews Poli... 87 No 92 Smokin Meat & Raising Kids That Raise Hell. Live Every Day ... 88 Sam Singal No 1016 Executive Producer, @nbcnightlynews 89 No 29 90 No 59 91 Carol Lee Y es 51240 Rep orter for NBC News, former WSJ & POLITICO, Hudson’s mom, ... 92 Alex Seitz-W ald Y es 50168 Political rep orter for @NBCNews cov ering Demo crats | Tips, ... 93 No 28 An unconv entional appreciation accoun t for @DeadlineWH host,... 94 No 188 I am a Senior Video Producer at NBCNews.com, as well a few ... 95 HailY eah63 No 483 #RedskinsTweetT eam #HTTR 96 Ev a’s Hero es No 2067 T o enrich the lives of individuals with intellectual special... 97 No 6 98 Chi Omega No 278 Chi Omega Chapter at CU Boulder 99 Aarne Heikkila No 1210 Coordinating Pro ducer for @JacobSob oroﬀ @MSNBC & @NBCNews, ... 100 Dani No 447 only here to talk shit & complain 101 F rank Thorp V Y es 58152 Pro ducer & Oﬀ-Air Reporter co vering Congress at @NBCNews. ... 102 Y oucef No 228 103 No 76 Pen tagon corresp onden t http://t.co/Qo0w3AnYOb 104 pro ject c.u.r.e. No 2260 deliv ering donated medical supplies and equipment to dev elop... 105 No 117 106 No 4 107 Elise Jordan Y es 58884 Co-host of @WMM po dcast p odcast. @MSNBC/@NBCNews p olitical ... 108 Patric k Burkey No 2313 Executive Producer, @NBCNews, @MSNBC. F ormer EP , @NBCNightly ... 109 bill hartnett No 2500 Stripmining the internets for remark able ephemera So cial Mus... 110 No 7 111 No 8 112 No 16 113 No 36 114 Ron F ournier Y es 64356 President: T ruscott Rossman. Best-seller https://t.co/09CdTN... 115 No 12 116 Pete Williams Y es 70062 NBC News Justice Corresp onden t. Co vers US Supreme Court, ... 117 No 65 Wife, Mother. Litigation Sp ecialist. Designer. Activist for ... 118 No 10 119 Heidi Przybyla Y es 66489 NBC News, n’tl political rep orter ”Prezbella” Heidi.Przyb... 56 . . . contin ued Name F ollowed F ollow ers Description 120 NBC Latino Y es 67920 Elev ating the con versation around Latino news in the United ... 121 No 189 122 No 38 123 Chris Jansing Y es 72375 @msnbc Senior National Corresp onden t, intrepid trav eler and ... 124 No 1 125 Brent Kendall No 5451 WSJ legal aﬀairs rep orter in W ashington. Native T ar Heel, ... 126 No 2 127 U.S. Attorney EDV A No 5709 Led by U.S. Attorney G. Zachary T erwilliger. 130+ attorneys ... 128 No 74 Life long learner Paralegal Arts & Culture Black Comm unity ... 129 No 2 130 No 2 131 T ammy Fine No 1584 Corporate Communications by day . T een Negotiator by nigh t... 132 Bonnie Optekman No 2242 Digital media strategist. V oice ov er artist. News junkie, ... 133 No 3 y et another activist t witter, ﬁghting all those fun -isms ... 134 No 109 Communicator through an eclectic lens of #healthcare #hospit... 135 No 88 Earth and Physical Science T eacher, Mom of 2, Self-declared ... 136 Amy Lynn-Cramer No 1590 Mommy to 2 amazing kiddos, Wife to @tecramer AND Corporate ... 137 No 5 138 prodjay No 304 NBC News pro ducer 139 No 109 140 No 10 141 Meghann Ludemann No 216 Sta y T uned Asso ciate Producer @NBCNews on @Snapchat 142 No 4 143 Ali Vitali Y es 78839 @NBCnews Political Reporter. Co vered T rump campaign, WH + ... 144 Doug Adams No 1902 NBC Sr. Political desk editor; F ather; Baseball fan; Lov er ... 145 No 99 146 Mark Sherman No 6336 147 Robin Gradison No 272 NBC News DC Deputy Bureau Chief, p olitics junkie, road run... 148 NBC News Signal Y es 83715 A new streaming news channel from @NBCNews. Catc h us Thursda... 149 No 45 Professor at Columbia Journaism Sc ho ol. 150 No 8 151 No 18 152 Stacey Klein No 914 @NBCNews White House Pro ducer, Born and raised in BalDimore ... 153 No 97 154 Rich Latour No 1883 F rom Broadcast News to Digital Storytelling. Dad of 3 Bo ys ... 155 Domenico Montanaro Y es 83999 ”Congress shall mak e no la w resp ecting an est. of religion, ... 156 No 5 157 No 6 An ti-money laundering professional with federal law enforcem... 158 No 24 159 No 24 Curious fo od lov er alwa ys lo oking for the b est foo d ev erywhe... 160 No 25 161 No 161 1 of 12 U.S.-led PR Ts. Improving Panjshir’s stability , incre... 162 Anna Matthews No 230 163 No 46 164 POLITICO 45 Y es 88470 A daily diary of the 45th presiden t of the United States. 165 No 9 Quotes from a nice jewish mom who’s just tryna get some nice... 166 No 19 The Northeast T ennessee Victory program will create a grassr... 167 No 130 @NBCNews Pro ducer in London, Links & retweets aren’t endorse... 168 samgo No 1161 Executive Pro ducer, @MSNBC Digital 169 Megan Stark No 263 ov er served Coloradan 170 No 70 171 Katie Y u No 484 NBC News Senior Pro ducer / formerly @Nightline, @NBCNightlyN... 172 Mark Murray Y es 97571 Mark Murra y is the senior political editor for NBC News, as ... 173 Kevin Thurm No 1946 Chief Executive Oﬃcer @ClintonFdn. Dad, sports fan & trivi... 174 No 122 Mom, wife, grandma, Airedale T errier lov er 175 No 173 Providing conserv atives with breaking news, opinion, blogs ... 176 No 14 177 No 12 178 No 137 Celebrate the simple lo veliness of every day things, scarves... 179 No 15 180 No 8 181 No 16 Author of A T raumatic History: A Unique Look at PTSD and ... 57 . . . contin ued Name F ollowed F ollow ers Description 182 No 9 183 No 8 I’m the real Charlie Sheen. If y ou are a Winner, stic k aroun... 184 David Esp o No 1308 Dad, AP Special Corresp onden t, Dad, Red So x fan, Dad. 185 No 40 186 matt to der No 253 sup ervising producer, documentaries/v erticals at NBC News ... 187 No 13 188 Benjy Sarlin Y es 100896 Political rep orter for @NBCNews. I co ver elections and their... 189 No 15 190 No 29 191 No 17 192 No 28 Director of the Marist Poll, poll obsessed, epistemophilic,... 193 No 16 194 No 144 Vice President, Standards @NBCNews 195 No 108 trey .daly@gmail.com 196 Daniella May er No 314 DON’T forget the A. I think ev erything ab out North Korea is ... 197 Bill Hatﬁeld No 635 W ashington news pro ducer for NBC News TODA Y; p olitics/histor... 198 No 19 I should b e the real trix rabbit 199 No 50 200 Phil Griﬃn No 231 58 References Edoardo M Airoldi, David M Blei, Stephen E Fien b erg, and Eric P Xing. Mixed membership stochastic blo c kmo dels. Journal of Machine L e arning R ese ar ch , 9(Sep):1981–2014, 2008. M. Alamgir and U. v on Luxburg. Multi-agent random w alks for local clustering. pages 18–27, Piscataw ay , NJ, USA, December 2010. Max-Planc k-Gesellschaft, IEEE. Reid Andersen and Kevin J. Lang. Communities from seed sets. In Pr o c e e dings of the 15th International Confer enc e on World Wide Web , WWW ’06, pages 223–232, New Y ork, NY, USA, 2006. A CM. ISBN 1-59593-323-9. doi: 10.1145/1135777.1135814. URL http://doi.acm.org/10.1145/1135777.1135814 . Reid Andersen and Y uv al Peres. Finding sparse cuts lo cally using evolving sets. In Pr o c e e dings of the F orty- ﬁrst Annual ACM Symp osium on The ory of Computing , STOC ’09, pages 235–244, New Y ork, NY, USA, 2009. A CM. ISBN 978-1-60558-506-2. doi: 10.1145/1536414.1536449. URL http://doi.acm.org/10. 1145/1536414.1536449 . Reid Andersen, F an Ch ung, and Kevin Lang. Lo cal graph partitioning using pagerank vectors. In Pr o c e e dings of the 47th Annual IEEE Symp osium on F oundations of Computer Scienc e , F OCS ’06, pages 475–486, W ashington, DC, USA, 2006. IEEE Computer So ciet y . ISBN 0-7695-2720-5. doi: 10.1109/F OCS.2006.44. URL http://dx.doi.org/10.1109/FOCS.2006.44 . Alb ert-L´ aszl´ o Barab´ asi and R´ ek a Alb ert. Emergence of scaling in random net works. scienc e , 286(5439): 509–512, 1999. P av el Berkhin. Bo okmark-coloring algorithm for personalized pagerank computing. Internet Mathematics , 3(1):41–62, 2006. St ´ ephane Boucheron, G´ ab or Lugosi, and P ascal Massart. Conc entr ation ine qualities: A nonasymptotic the ory of indep endenc e . Oxford universit y press, 2013. Pierre Br´ emaud. Markov chains: Gibbs ﬁelds, Monte Carlo simulation, and queues , volume 31. Springer Science & Business Media, 2013. Sergey Brin and Lawrence P age. The anatom y of a large-scale h yp ertextual web searc h engine. Comput. Netw. ISDN Syst. , 30(1-7):107–117, April 1998. ISSN 0169-7552. doi: 10.1016/S0169- 7552(98)00110- X. URL http://dx.doi.org/10.1016/S0169- 7552(98)00110- X . Y uxin Chen, Jianqing F an, Cong Ma, and Kaizheng W ang. Sp ectral metho d and regularized mle are both optimal for top- k ranking. Ann. Statist. , 47(4):2204–2235, 08 2019. doi: 10.1214/18- AOS1745. URL https://doi.org/10.1214/18- AOS1745 . Chandler Da vis and William Morton Kahan. The rotation of eigenv ectors by a p erturbation. iii. SIAM Journal on Numeric al Analysis , 7(1):1–46, 1970. Georg F erdinand F rob enius, F erdinand Georg F rob enius, F erdinand Georg F rob enius, F erdinand Georg F rob enius, and Germany Mathematician. ¨ Ub er Matrizen aus nicht ne gativen Elementen . K¨ onigliche Ak ademie der Wissensc haften, 1912. 59 Sha yan Oveis Gharan and Luca T revisan. Approximating the expansion proﬁle and almost optimal lo cal graph c lustering. In Pr o c e e dings of the 2012 IEEE 53r d A nnual Symp osium on F oundations of Computer Scienc e , FOCS ’12, pages 187–196, W ashington, DC, USA, 2012. IEEE Computer So ciet y . ISBN 978-0- 7695-4874-6. doi: 10.1109/F OCS.2012.85. URL https://doi.org/10.1109/FOCS.2012.85 . Gourab Ghoshal and Albert-L´ aszl´ o Barab´ asi. Ranking stabilit y and sup er-stable nodes in complex netw orks. Natur e c ommunic ations , 2:394, 2011. Da vid F Gleic h. P agerank b ey ond the web. SIAM R eview , 57(3):321–363, 2015. P ank a j Gupta, Ashish Go el, Jimmy Lin, Aneesh Sharma, Dong W ang, and Reza Zadeh. Wtf: The who to follo w service at twitter. In Pr o c e e dings of the 22nd international c onfer enc e on World Wide Web , pages 505–514. A CM, 2013. T aher H Hav eliwala. T opic-sensitive pagerank: A con text-sensitive ranking algorithm for w eb searc h. IEEE tr ansactions on know le dge and data engine ering , 15(4):784–796, 2003. P aul W Holland, Kathryn Blackmond Laskey , and Samuel Leinhardt. Sto c hastic blockmodels: First steps. So cial networks , 5(2):109–137, 1983. Glen Jeh and Jennifer Widom. Scaling personalized web searc h. In Pr o c e e dings of the 12th International Confer enc e on World Wide Web , WWW ’03, pages 271–279, New Y ork, NY, USA, 2003. A CM. ISBN 1-58113-680-3. doi: 10.1145/775152.775191. URL http://doi.acm.org/10.1145/775152.775191 . Brian Karrer and Mark EJ Newman. Sto c hastic blo c kmo dels and communit y structure in net works. Physic al R eview E , 83(1):016107, 2011. George Karypis and Vipin Kumar. Multilev elk-wa y partitioning sc heme for irregular graphs. Journal of Par al lel and Distribute d c omputing , 48(1):96–129, 1998. Ra jiv Khanna, Ethan Elenberg, Alexandros G Dimakis, and Sahand Negah ban. On approximation guaran- tees for greedy low rank optimization. arXiv pr eprint arXiv:1703.02721 , 2017. Isab el M. Kloumann, Johan Ugander, and Jon Kleinberg. Blo c k mo dels and personalized pagerank. Pr o- c e e dings of the National A c ademy of Scienc es , 114(1):33–38, 2017. ISSN 0027-8424. doi: 10.1073/pnas. 1611275114. URL https://www.pnas.org/content/114/1/33 . Can M Le, Eliza v eta Levina, Roman V ershynin, et al. Optimization via lo w-rank appro ximation for com- m unity detection in net works. The Annals of Statistics , 44(1):373–400, 2016. Ch ung-Shou Liao, Kanghao Lu, Mic hael Ba ym, Rohit Singh, and Bonnie Berger. IsoRankN: sp ectral metho ds for global alignment of multiple protein netw orks. Bioinformatics , 25(12):i253–i258, 05 2009. ISSN 1367- 4803. doi: 10.1093/bioinformatics/btp203. URL https://doi.org/10.1093/bioinformatics/btp203 . L´ aszl´ o Lo v´ asz. Random walks on graphs: A survey . In D. Mikl´ os, V. T. S´ os, and T. Sz˝ on yi, editors, Combi- natorics, Paul Er d˝ os is Eighty , volume 2, pages 353–398. J´ anos Bolyai Mathematical So ciety , Budap est, 1996. Xin Lu, Jens Malmros, F redrik Liljeros, T om Britton, et al. Resp onden t-driven sampling on directed net- w orks. Ele ctr onic Journal of Statistics , 7:292–322, 2013. 60 Kath y Macrop ol, T olga Can, and Ambuj K. Singh. Rrw: rep eated random walks on genome-scale protein net works for lo cal cluster discov ery . BMC Bioinformatics , 10(1):283, Sep 2009. ISSN 1471-2105. doi: 10.1186/1471- 2105- 10- 283. URL https://doi.org/10.1186/1471- 2105- 10- 283 . Mark Newman, Alb ert-Laszlo Barabasi, and Duncan J. W atts, editors. The Structur e and Dynamics of Networks . Princeton Univ ersity Press, Princeton, NJ, USA, 2006. L. Page, S. Brin, R. Motw ani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. In Pr o c e e dings of the 7th International World Wide Web Confer enc e , pages 161–172, Brisbane, Australia, 1998. URL citeseer.nj.nec.com/page98pagerank.html . Osk ar Perron. Zur theorie der matrices. Mathematische A nnalen , 64(2):248–263, 1907. T ai Qin and Karl Rohe. Regularized sp ectral clustering under the degree-corrected sto c hastic blockmodel. In Pr o c e e dings of the 26th International Confer enc e on Neur al Information Pr o c essing Systems - V olume 2 , NIPS’13, pages 3120–3128, USA, 2013. Curran Associates Inc. URL http://dl.acm.org/citation. cfm?id=2999792.2999960 . Karl Rohe, Soura v Chatterjee, and Bin Y u. Sp ectral clustering and the high-dimensional sto c hastic block- mo del. The Annals of Statistics , 39(4):1878–1915, 2011. Srijan Sengupta and Y uguo Chen. A block model for node p opularit y in net works with comm unity structure. Journal of the R oyal Statistic al So ciety: Series B (Statistic al Metho dolo gy) , 80(2):365–386, 2018. Daniel A Spielman and Shang-Hua T eng. Spectral partitioning works: Planar graphs and ﬁnite elemen t meshes. In F oundations of Computer Scienc e, 1996. Pr o c e e dings., 37th Annual Symp osium on , pages 96–105. IEEE, 1996. Daniel A Spielman and Shang Hua T eng. Nearly-linear time algorithms for graph partitioning, graph sparsi- ﬁcation, and solving linear systems. In Pr o c e e dings of the thirty-sixth annual ACM symp osium on The ory of c omputing , pages 81–90. ACM, 2004. Duncan J W atts and Steven H Strogatz. Collective dynamics of small-worldnet works. natur e , 393(6684): 440, 1998. Y ao jia Zh u, Xiaoran Y an, and Cristopher Moore. Oriented and degree-generated blo c k mo dels: generating and inferring communities with inhomogeneous degree distributions. Journal of Complex Networks , 2(1): 1–18, 2013. 61

Targeted sampling from massive block model graphs with personalized PageRank

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment