Phase Transitions in Spectral Community Detection of Large Noisy Networks
In this paper, we study the sensitivity of the spectral clustering based community detection algorithm subject to a Erdos-Renyi type random noise model. We prove phase transitions in community detectability as a function of the external edge connecti…
Authors: Pin-Yu Chen, Alfred O. Hero III
Phase T ransitions in Spectr al Community Detect ion of Lar g e Noisy Netw orks Pin-Y u Chen and Alfred O. Hero I II, F ellow , IEEE Department of Electrical Engineering and Computer Sc ience, University of Michigan, An n Arbor , USA Email : { pinyu,he ro } @umich.edu Abstract —In this paper , w e study the sensitivity of the spectral clustering based community detection algorithm subject to a Erdos-Renyi type random noise model. W e pr ov e ph ase transi- tions in community detectability as a function of the exter nal edge connection probability and the noisy edge presence probability under a general n etwork model where two arbitrarily connected communities are interconnected by random external edges. Specifically , the community detection performance transitions from almost p erfect detectabi lity to low detectabili ty as the inter - community edge connection probability exceeds some critical value. W e derive upper and lo wer bounds on the critical va lue and show that th e bounds are iden tical when the two communities hav e the same size. The phase transiti on results are validated using network simulations. Using th e derived expressions for the phase transition threshold we propose a method for estimating this threshold from observ ed d ata. Index T erms —community detectability , noisy graph I . I N T R O D U C T I O N Community detection is a graph signal processing problem [1]– [9] where the goal is to cluster the nodes on a graph into dif ferent communities by inspecting the connectiv ity str ucture of the graph. Consider an undirected regular graph consisting of two node-disjoint communities interconnected by some external edges. Let n denote the total number of nodes in the network . The network t opology can be characterized by its symmetric adjacency matrix A , where A is an n × n matrix, with A ij = 1 if an edge exists between nodes i and j , and A ij = 0 otherwise. Since community detection can be viewe d as a graph partitioning problem that can be solved by identifying the graph cut that correctly separates the commun ities, spectral clustering [10], [11] approach es to community detection are natural [12]–[15]. Spectral clustering specifies a graph cut by inspecting the eigenstructure of the graph. Let 1 n ( 0 n ) be the n -dimensional all-one (all-zero) vector . Define L = D − A as the graph Laplacian matrix of the graph, where D = diag ( A1 n ) is the diagonal degree matrix. Let λ i ( L ) denote t he i -th smallest eigen v a lue of L . It is well-known that λ 1 ( L ) = 0 since L1 n = 0 n and L is a positiv e semidefinite (PSD) matrix [16], [17]. The second smallest eigen v alue, λ 2 ( L ) , is known as the algebraic connecti vity . The eigen v ector associated with λ 2 ( L ) is called the Fiedler vector [18] . A mathematical representation of the algebraic connecti vity i s λ 2 ( L ) = min k x k 2 =1 , 1 T n x =0 x T Lx . (1) The principle underlying spectral clustering for community detec- tion [12]–[15] is summarized as follows: 1) Compute the graph Laplacian matrix L = D − A . 2) Compute the Fiedler vector y . 3) Perform K-means clustering [19] on the entries of y to cluster the nodes into two groups. T o detect more than two This work has been partiall y supported by the Army Research Offic e (AR O), grant number W911NF-12-1-0443. communities, we can use successi ve spectral clustering on the discov ered communities [1], [20]. Most literature on community detectability [21]–[28] focuses on the noiseless setting where the edges are not subject to random insertions or deletions. Howe ve r , in practice t he network data can be corrupted by incorrect measurements or background noises (e.g., bio-informatics data) that can produce such random insertions and deletions. Consequently , analyzing t he sensitivity of community detection algorithms to noise is an important task. In this paper , we prov e the existence of abrupt phase transitions i n community detectability for spectral community detection under a E rdos-Ren yi type random noise model. Our network model includes the widely used stochastic block model [29] as a special case. W e show that at some critical value of random external edge connection probability the community detection performance tr ansitions from almost perfect detectability to low detectability in the large network limit (large n ). W e provide asymptotic upper and lower bounds on this crit ical value. The bounds become equal to each other when these two community sizes are identical. This framework can be generalized to community detection on more than two communities by aggregating multiple communities into two larger communities. W e use simulated networks to validate the asymptotic exp ressions for the phase transitions. Using our theory , we propose an empirical estimator of the critical phase transition threshold that can be applied to d ata. These empirical estimates are used to test whether the detector is operating in a reliable detection regime, i.e. , below the phase transition threshold. I I . N E T W O R K M O D E L A N D R E L A T E D W O R K S Consider two arbitrarily connected communities with internal adjacenc y matrices A S 1 and A S 2 and network sizes n 1 and n 2 , re- specti vely . The external connections between these two communities are characterized by an n 1 × n 2 adjacenc y matri x C S , where each entry in C S is a Bernoulli( p ) random v ariable. L et n = n 1 + n 2 . The ove rall n × n adjacency matrix of the community structure can be represented as A S = A S 1 C S C T S A S 2 . (2) The widely used stochastic block model [29] is a special case of (2) when the two community structures are generated by connected Erdos-Renyi random graphs parameterized by the within-community connection probability p i ( i = 1 , 2 ). Our network model is more general since we only assume random connection probability p on the external edges and we allow the wit hin-commun ity adjacency matrices A S i to be arbitrary . In this paper we consider the noisy setting in which t he adjacency matrix A S is corrupted by a random adjacenc y matrix A N such that the observed adjacency matrix is A = A S + A N . The adjacency matrix A N is generated by a Erdos- Renyi random graph wi th edge connection probability q . Note that this model only allows random insertions and not deletions of edges. Community detectability has been studied under the stochastic block model with restri cted assumptions such as n 1 = n 2 , p 1 = p 2 and fixed av erage degree as the network size n increases [23] –[26], [30]. The planted clique detection problem in [31] is a special case of the stochastic block model when p 1 = 1 and p 2 = p . A less restricted stochastic block model is studied in [28] where a univ ersal phase transition in community detectability is established for which the critical value does not depend on the community sizes. A similar model t o our network model is studied in [32] for i nterconnected networks. Howe ver , in [32] the subnetworks are of equal size and the external edges are kno wn (i .e., non-random). Phase transitions in spectral community detection under noiseless network setting is studied in [ 27]. I I I . P H A S E T R A N S I T I O N A NA L Y S I S Let 1 n i be the n i -dimensional all -one vector and l et D S 1 = diag ( C S 1 n 2 ) and D S 2 = diag C T S 1 n 1 . The graph Laplacian matrix of the noiseless graph can be represented as L S = L S 1 + D S 1 − C S − C T S L S 2 + D S 2 , (3) where L S i is the graph L aplacian matrix of i -th community . S imi- larly , t he graph L aplacian matrix of the noise matrix can be repre- sented as L N = L N 1 + D N 1 − C N − C T N L N 2 + D N 2 , (4) where L N i is the graph Laplacian matrix of the noise matrix in i -th community , C N is the adjacency matrix of noisy edges between two communities, D N 1 = diag ( C N 1 n 2 ) and D N 2 = diag C T N 1 n 1 . Therefore the ov erall graph Laplacian matrix is L = L S + L N . Let x = [ x 1 x 2 ] T , where x 1 ∈ R n 1 and x 2 ∈ R n 2 . By (1) we hav e λ 2 ( L ) = min x x T Lx subject to the constraints x T 1 x 1 + x T 2 x 2 = 1 and x T 1 1 n 1 + x T 2 1 n 2 = 0 . Using Lagrange multipliers µ , ν and (3), the Fiedler vector y = [ y 1 y 2 ] T of L , with y 1 ∈ R n 1 and y 1 ∈ R n 2 , satisfies y = arg min x Γ( x ) , where Γ( x ) = x T 1 ( L S 1 + D S 1 + L N 1 + D N 1 ) x 1 − 2 x T 1 ( C S + C N ) x 2 + x T 2 ( L S 2 + D S 2 + L N 2 + D N 2 ) x 2 − µ ( x T 1 x 1 + x T 2 x 2 − 1) − ν ( x T 1 1 n 1 + x T 2 1 n 2 ) . (5) Differentiating (5) with respect to x 1 and x 2 respecti vely , and substituting y to the equations, we obtain 2( L S 1 + D S 1 + L N 1 + D N 1 ) y 1 − 2( C S + C N ) y 2 − 2 µ y 1 − ν 1 n 1 = 0 n 1 , (6) 2( L S 2 + D S 2 + L N 2 + D N 2 ) y 2 − 2( C S + C N ) T y 1 − 2 µ y 2 − ν 1 n 2 = 0 n 2 . (7) Left multiplying (6) by 1 T n 1 and left multiplying (7 ) by 1 T n 2 , we have 2 1 T n 1 ( D S 1 + D N 1 ) y 1 − 2 1 T n 1 ( C S + C N ) y 2 − 2 µ 1 T n 1 y 1 − ν n 1 = 0 , (8) 2 1 T n 2 ( D S 2 + D N 2 ) y 2 − 2 1 T n 2 ( C S + C N ) T y 1 − 2 µ 1 T n 2 y 2 − ν n 2 = 0 . (9) Since by definition 1 T n 1 D S 1 = 1 T n 2 C T S , 1 T n 1 C S = 1 T n 2 D S 2 , 1 T n 1 D N 1 = 1 T n 2 C T N and 1 T n 1 C N = 1 T n 2 D N 2 , adding (8) and (9) we obtain ν = − 2 µ n ( y T 1 1 n 1 + y T 2 1 n 2 ) = 0 by the fact that the Fiedler vector y has the property y T 1 = 0 . Applying ν = 0 and left multiplying (6) by y T 1 and left multiplying (7) by y T 2 , we ha ve y T 1 ( L S 1 + D S 1 + L N 1 + D N 1 ) y 1 − y T 1 ( C S + C N ) y 2 − µ y T 1 y 1 = 0 , (10) y T 2 ( L S 2 + D S 2 + L N 2 + D N 2 ) y 2 − y T 2 ( C S + C N ) T y 1 − µ y T 2 y 2 = 0 . (11) Adding (10) and (11) and by (1) and (3) we obtain µ = λ 2 ( L ) . Let ¯ C S = p 1 n 1 1 T n 2 , a matrix whose elements are the means of entries i n C S . Let σ i ( M ) denote the i -th largest si ngular v alue of a rectangular matrix M 1 and write C S = ¯ C S + ∆ S , where ∆ S = C S − ¯ C S . By Lat ala’ s theorem [33], E h σ 1 ∆ S √ n 1 n 2 i → 0 . This is prove d in Appendix VII-A of [27]. Furthermore, by T alagrand’ s concentration inequality [34], almost surely , σ 1 C S √ n 1 n 2 → p ; σ i C S √ n 1 n 2 → 0 ∀ i ≥ 2 (12) when n 1 , n 2 → ∞ and n 1 n 2 → c > 0 . This is proved in Appendix VII-B of [27]. Note that the con verg ence rate is maximal when n 1 = n 2 because n 1 + n 2 ≥ 2 √ n 1 n 2 and the equality holds if n 1 = n 2 . Similarly , let ¯ C N = q 1 n 1 1 T n 2 , a matrix whose elements are the means of entries in A N . W e have σ 1 C N √ n 1 n 2 → q and σ i C N √ n 1 n 2 → 0 ∀ i ≥ 2 when n 1 , n 2 → ∞ and n 1 n 2 → c > 0 . As proved in [35], the singular vectors of C S ( C N ) and ¯ C S ( ¯ C N ) are close to each other in the sense that the squared i nner product of their left/right singular vectors con verges to 1 almost surely when √ n 1 n 2 p → ∞ ( √ n 1 n 2 q → ∞ ). Consequently , we hav e, almost surely , ( D S 1 + D N 1 ) 1 n 1 n 2 = ( C S + C N ) 1 n 2 n 2 → ( p + q ) 1 n 1 ; (13) ( D S 2 + D N 2 ) 1 n 2 n 1 = ( C S + C N ) T 1 n 1 n 1 → ( p + q ) 1 n 2 . (14) Applying (12), (13) and (14) to (8) and (9 ) and recalling that ν = 0 and n 1 n 2 = c > 0 , we have, almost surely , 1 √ c ( p + q ) 1 T n 1 y 1 − √ c ( p + q ) 1 T n 2 y 2 − µ 1 T n 1 y 1 √ n 1 n 2 → 0; (15) √ c ( p + q ) 1 T n 2 y 2 − 1 √ c ( p + q ) 1 T n 1 y 1 − µ 1 T n 2 y 2 √ n 1 n 2 → 0 . (16) By the fact that 1 T n 1 y 1 + 1 T n 2 y 2 = 0 , we have, almost surely , √ c + 1 √ c p + q − µ n 1 T n 1 y 1 → 0; (17) √ c + 1 √ c p + q − µ n 1 T n 2 y 2 → 0 . (18) Consequently , as µ = λ 2 ( L ) , at least one of the t wo cases hav e to be satisfied: Case 1: λ 2 ( L ) n a.s. − → p + q =: t , (19) Case 2: 1 T n 1 y 1 → 0 and 1 T n 2 y 2 → 0 almost surely . (20) W e wil l sho w that t he algebraic connectivity λ 2 ( L ) /n and the Fiedler vector y underg o a phase transition between Case 1 and Case 2 as a function of t = p + q . That i s, a t ransition from C ase 1 t o Case 2 occurs when p exceeds a certain threshold p ∗ . In Case 1, observ e that asymptotically λ 2 ( L ) n gro ws linearly with t while the asymptotic Fiedler vector remains the same (unique up to its sign). Furthermore, from (10), (11), (12), (19), µ = λ 2 ( L ) and 1 T n 1 y 1 + 1 T n 2 y 2 = 0 , the Fielder vector y in Case 1 has t he following property . Almost surely , y T 1 ( L S 1 + L N 1 ) y 1 √ n 1 n 2 + p + q √ n 1 n 2 ( 1 T n 1 y 1 ) 2 − √ c ( p + q ) y T 1 y 1 → 0 , (21) y T 2 ( L S 2 + L N 2 ) y 2 √ n 1 n 2 + p + q √ n 1 n 2 ( 1 T n 1 y 1 ) 2 − 1 √ c ( p + q ) y T 2 y 2 → 0 . (22) 1 Note that for con venie nce, we use λ i ( M 1 ) to denote the i -th smallest eigen value of a square matrix M 1 and use σ i ( M 2 ) to denote the i -th largest singular val ue of a rectangular matrix M 2 . Adding (21) and (22), we hav e 1 √ n 1 n 2 h y T 1 ( L S 1 + L N 1 ) y 1 + y T 2 ( L S 2 + L N 2 ) y 2 i + 2( 1 T n 1 y 1 ) 2 √ n 1 n 2 − √ c y T 1 y 1 + 1 √ c y T 2 y 2 ( p + q ) a.s. − → 0 . (23) As the two bracketed terms in (23) conv erge to finite constants for all t = p + q in Case 1; almost surely , 1 √ n 1 n 2 h y T 1 ( L S 1 + L N 1 ) y 1 + y T 2 ( L S 2 + L N 2 ) y 2 i → 0; (24) 2( 1 T n 1 y 1 ) 2 √ n 1 n 2 − √ c y T 1 y 1 + 1 √ c y T 2 y 2 → 0 . (25) By t he PSD property of the graph Laplacian matrix, y T 1 ( L S 1 + L N 1 ) y 1 + y T 2 ( L S 2 + L N 2 ) y 2 > 0 if and only if y 1 and y 2 are not constant vectors. Therefore (24) i mplies y 1 and y 2 con verge to constant vectors. By the constraints y T 1 y 1 + y T 2 y 2 = 1 and 1 T n 1 y 1 + 1 T n 2 y 2 = 0 , we hav e, almost surely , r nn 1 n 2 y 1 → ± 1 n 1 and r nn 2 n 1 y 2 → ∓ 1 n 2 . (26) Consequently , in C ase 1 y 1 and y 2 tend to be constant vectors with opposite signs. More importantly , (26) suggests a phase transition in spectral community detectability . In Case 1, spectral clustering can almost correctly identify t hese t wo communities since y 1 and y 2 are constant vectors with opposite signs. On t he other hand, in Case 2, 1 T n 1 y 1 → 0 and 1 T n 2 y 2 → 0 almost surely . The entries of y 1 and y 2 tend to hav e opposite signs in their entries. Therefore in Case 2 spectral clustering results in very poor community detection. I V . U P P E R A N D L O W E R B O U N D S O N T H E C R I T I C A L V A L U E Next we derive an upper bound on the critical valu e p ∗ of the phase transition. F rom (1) and (3) we kno w that λ 2 ( L ) = y T 1 ( L S 1 + D S 1 + L N 1 + D N 1 ) y 1 − 2 y T 1 ( C S + C N ) y 2 + y T 2 ( L S 2 + D S 2 + L N 2 + D N 2 ) y 2 (27) subject to 1 T n 1 y 1 + 1 T n 2 y 2 = 0 and y T 1 y 1 + y T 2 y 2 = 1 . In C ase 2, since 1 T n 1 y 1 → 0 and 1 T n 2 y 2 → 0 almost surely , recalling the definition ∆ S = C S − ¯ C S and let ∆ N = C N − ¯ C N , y T 1 ( C S + C N ) y 2 √ n 1 n 2 = y T 1 ( ¯ C S + ¯ C N ) y 2 + y T 1 ∆ S y 2 + y T 1 ∆ N y 2 √ n 1 n 2 ≤ y T 1 ( ¯ C S + ¯ C N ) y 2 + k y 1 k 2 k y 2 k 2 · [ σ 1 ( ∆ S ) + σ 1 ( ∆ N )] √ n 1 n 2 a.s. − → 0 (28) by the fact t hat σ 1 ∆ S √ n 1 n 2 a.s. − → 0 and σ 1 ∆ N √ n 1 n 2 a.s. − → 0 i n Appendix VII-B of [27] and ¯ C S = p 1 n 1 1 T n 2 and ¯ C N = q 1 n 1 1 T n 2 . Furthermore, since D S 1 = diag ( C S 1 n 2 ) , D S 2 = diag C T S 1 n 1 , D N 1 = diag ( C N 1 n 2 ) and D N 2 = diag C T N 1 n 1 , (12) giv es, almost surely , 1 n 2 y T 1 ( D S 1 + D N 1 ) y 1 → ( p + q ) y T 1 y 1 ; (29) 1 n 1 y T 1 ( D S 2 + D N 2 ) y 1 → ( p + q ) y T 2 y 2 . (30) Therefore in Case 2 we hav e λ 2 ( L ) n a.s. − → min x ∈S x T 1 L 1 x 1 + x T 2 L 2 x 2 + n 2 t x T 1 x 1 + n 1 t x T 2 x 2 n , (31) where L i = L S i + L N i , t = p + q , and S = n x = [ x 1 x 2 ] T : 1 T n 1 x 1 = 1 T n 2 x 2 = 0 , x T 1 x 1 + x T 2 x 2 = 1 o . (32) Define two sets S 1 = n x : 1 T n 1 x 1 = 1 T n 2 x 2 = 0 , x T 1 x 1 = 1 , x T 2 x 2 = 0 o ; (33) S 2 = n x : 1 T n 1 x 1 = 1 T n 2 x 2 = 0 , x T 1 x 1 = 0 , x T 2 x 2 = 1 o , (34) and define µ i ( L ) = min x ∈S i x T 1 L 1 x 1 + x T 2 L 2 x 2 + n 2 t x T 1 x 1 + n 1 t x T 2 x 2 n . (35) Since S 1 , S 2 ⊆ S , we hav e, almost surely , λ 2 ( L ) n ≤ min { µ 1 ( L ) , µ 2 ( L ) } = min λ 2 ( L 1 ) + n 2 t n , λ 2 ( L 2 ) + n 1 t n = t 2 + λ 2 ( L 1 ) + λ 2 ( L 2 ) − | λ 2 ( L 1 ) − λ 2 ( L 2 ) + ( n 2 − n 1 ) t | 2 n ≤ t 2 + | n 1 − n 2 | t 2 n + λ 2 ( L 1 ) + λ 2 ( L 2 ) − | λ 2 ( L 1 ) − λ 2 ( L 2 ) | 2 n , (36) where we use the facts that min { a, b } = a + b −| a − b | 2 and | a − b | ≥ | a | − | b | . Note that the last equality in (36) holds if n 1 = n 2 . Let t ∗ = p ∗ + q be the critical value for phase transition from Case 1 to Case 2. There is a phase transition on the asymptotic value of λ 2 ( L ) n since the slope of λ 2 ( L ) n con verge s to 1 almost surely when t ≤ t ∗ , whereas from (36) λ 2 ( L ) n − t ≤ ( | n 1 − n 2 |− n ) t 2 n + λ 2 ( L 1 )+ λ 2 ( L 2 ) −| λ 2 ( L 1 ) − λ 2 ( L 2 ) | 2 n when t ≥ t ∗ . From (19), we obtain an asymp totic upper bo und p UB on the critical v alue p ∗ by sub stituting t ∗ = p ∗ + q to (36). p UB = λ 2 ( L 1 ) + λ 2 ( L 2 ) − | λ 2 ( L 1 ) − λ 2 ( L 2 ) | n − | n 1 − n 2 | − q . (37) T o derive a lower bound on p ∗ , we ha ve that in Case 2, λ 2 ( L ) n a.s. − → min x ∈S x T 1 L 1 x 1 + x T 2 L 2 x 2 + n 2 p x T 1 x 1 + n 1 p x T 2 x 2 n ≥ min x ∈ S x T 1 L 1 x 1 + x T 2 L 2 x 2 n + min x ∈ S n 2 t x T 1 x 1 + n 1 t x T 2 x 2 n (38) = min λ 2 ( L 1 ) n , λ 2 ( L 2 ) n + min n 1 t n , n 2 t n . = t 2 − | n 1 − n 2 | t 2 n + λ 2 ( L 1 ) + λ 2 ( L 2 ) − | λ 2 ( L 1 ) − λ 2 ( L 2 ) | 2 n . (39) Substituting t ∗ = p ∗ + q to (39), we obtain an asymptotic lower bound p LB on the critical value p ∗ . p LB = λ 2 ( L 1 ) + λ 2 ( L 2 ) − | λ 2 ( L 1 ) − λ 2 ( L 2 ) | n + | n 1 − n 2 | − q . ( 40) Note that when n 1 = n 2 , the equality in (38) holds. This me ans when n 1 = n 2 , λ 2 ( L ) n a.s. − → t 2 + λ 2 ( L 1 )+ λ 2 ( L 2 ) −| λ 2 ( L 1 ) − λ 2 ( L 2 ) | 2 n =: t 2 + c ∗ in Case 2, and t he critical value p ∗ a.s. − → λ 2 ( L 1 ) + λ 2 ( L 2 ) − | λ 2 ( L 1 ) − λ 2 ( L 2 ) | n − q . (41) Here we deriv e the bounds on t he critical value p ∗ for t he stochas- tic block model, where the internal adjacency matrix A i in (2) is generated by a Erdos-Renyi random g raph with edge co nnection prob- 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0 0.5 p λ 2 ( L ) n (a) si mulat ion λ 2 ( L ) n = p + q λ 2 ( L ) n = p + q 2 + c ∗ 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.6 0.8 1 p detectability (b) spectral clustering baseline (0.5) 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 −20 0 20 p 1 T n i y i (c) 1 n 1 T y 1 1 n 2 T y 2 ± p n 1 n 2 n Fig. 1. T wo communities generated by the s tocha stic block m odel [29]. The results are av eraged ove r 100 trials. n 1 = n 2 = 2000 , p 1 = p 2 = 0 . 25 , and q = 0 . 05 . T he theoreti cal critic al v alue from (41) is p ∗ = 0 . 2229 . noise le vel ( q ) 0 0.002 0.01 0.05 0.1 mean 0.8571 0.8548 0.8004 0.6325 0.5038 detec tabili ty std 0 0.006 0.1227 0.1597 0.0823 mean 0.0127 0.0116 0.0076 0.00016 0 b p LB std 0 0.0021 0.0039 0.001 0 mean 0.0073 0.0095 0.0173 0.0513 0.0835 b p std 0 0.001 0.0025 0.011 0.0209 mean 0.013 0.0124 0.0633 0.1422 0.1494 b p UB std 0 0.0021 0.1493 0.3199 0.3213 fracti on of b p ≤ b p LB 1 0.98 0.01 0 0 fracti on of b p UB < b p < b p UB 0 0.02 0.75 0.2 0.2 fracti on of b p ≥ b p UB 0 0 0.24 0.8 0.8 T ABLE I Sensiti vity of spectral community detection to noisy edge insert ions for Amazon American politica l books co-purchasement data [36]. The network contains 105 nodes and 441 edges. The oracle detectabil ity is 0.8762. The noisy edges are randomly generat ed for 100 trial s. ability p i . It is proved in Appendix VII-C of [27] that λ 2 L i n i a.s. − → p i + q . Therefore p UB = cp 1 + p 2 −| cp 1 − p 2 +( c − 1) q |− | 1 − c | q 1+ c −| 1 − c | and p LB = cp 1 + p 2 −| cp 1 − p 2 +( c − 1) q |− | 1 − c | q 1+ c + | 1 − c | . When n 1 = n 2 (i.e., c = 1 ), the critical value p ∗ a.s. − → p 1 + p 2 −| p 1 − p 2 | 2 . This suggests that in the largest network limit when n → ∞ and c = 1 the performance of spectral community detection is independent of the noise parameter q . V . P E R F O R M A N C E E V A L U A T I O N A. S imulated Networks W e use t he stochastic block model [29] to generate network graphs for community detection. T he detectability is defined as the fraction of nodes t hat are correctly identified and the baseline detectability is 0.5 for random guesses. In Fig. 1, when p 1 = p 2 = 0 . 25 , n 1 = n 2 = 2000 and q = 0 . 05 , the theoretical critical v alue from (41) is p ∗ = 0 . 2229 . Note that p ∗ will con verge to 0 . 25 as we increase n as predicted in Sec. IV. Fig. 1 (a ) v erifi es the phase transition in λ 2 ( L ) n empirically confirm- ing that λ 2 ( L ) n approaches p + q when p ≤ p ∗ and λ 2 ( L ) n approaches p + q 2 + c ∗ when p > p ∗ , where c ∗ = λ 2 ( L 1 )+ λ 2 ( L 2 ) −| λ 2 ( L 1 ) − λ 2 ( L 2 ) | 2 n . Fig. 1 (b) shows that the community detectability transitions from almost perfect detectability when p < p ∗ to low detectability when p > p ∗ . Moreover , as derived in (26), t he Fiedler vector components y 1 and y 2 are constant vectors with opposite signs f or p < p ∗ , and 1 T n 1 y 1 → 0 and 1 T n 2 y 2 → 0 for p > p ∗ , as shown in F ig. 1 (c). B. Emp irical Estimators o f Ph ase T ransition Bou nds o n Rea l- world Data set Here w e sho w that the critical phase transition threshold p ∗ can be empirically estimated to empirically test the reliability of spectral community detection. Let b L i be the graph Laplacian matrix of the estimated community i obtained by applying spectral clustering to the observed adjacenc y matrix A and let b n i denote the estimated network size of community i . Using (37) and (40 ), the empirical estimators of these parameters are defined as b p = number of identified external edges / b n 1 b n 2 , (42) b p LB = λ 2 ( b L 1 ) + λ 2 ( b L 2 ) − λ 2 ( b L 1 ) − λ 2 ( b L 2 ) n + | b n 1 − b n 2 | , (43) b p UB = λ 2 ( b L 1 ) + λ 2 ( b L 2 ) − λ 2 ( b L 1 ) − λ 2 ( b L 2 ) n − | b n 1 − b n 2 | . (44) Based on these empirical estimates, the performance of community detection can be classified into three categories. If b p ≤ b p LB , the network is in the reliable detection region. If b p LB < b p < b p UB , the network is in the intermediate detection region. If b p ≥ b p UB , the network is in the unreliable detection region. The co-purchasement data between 105 American political books sold on Amazon [36] are used to estimate the parameters p LB , p UB and p . For the corresponding network graph nodes represent political books and edges represent co-purchasements. A n edge exists between two books i f t hey are frequently purchased by the same buyer . Three labels, liberal , conservative and neutral , were determined by Ne wman [36]. W e perform community detection by separating the books into two groups since there are only 13 books with neutral labels (i .e., the oracle detectability is 0.8762). T o in vestigate the sensitiv ity of spectral community detection to noisy edge insertions, for each edge not present in the original graph, an edge is added with probability q . The community detection results are summarized in T able I. Observe that for small q ( q =0 or 0.002) the network is mostly in the reliable detection region ( b p < b p LB ), which indicates that spectral community detection achie ves high detectability . When q = 0 . 01 , the network i s mostly i n the intermediate detection region ( b p LB < b p < b p UB ), indicating that the community detectability has large variation. When q is l arge ( q =0.05 or 0.1), the network is mostly in the u nreliable detection reg ion resulting in lo w detectability . The large standard de viation of b p UB for large q is due t o the fact that spectral community detection may mist aken ly detect two communities with extremely imbalanced community sizes such that the denominator of the estimator b p UB is small. V I . C O N C L U S I O N W e establish asymptotic phase transition bounds on the critical v alue p ∗ under a general network setting corrupted by a Erdos-Renyi type noise model. The communities are proven to be almost perfectly detectable belo w the phase transition threshold and t o be undetectable abov e the phase transition threshold. T he phase transition bounds are used to establish empirical estimators to ev aluate the reliability of spectral community detection, where the detector is said to be operating in the reliable, i ntermediate, or unreliable detection regime based on the empirical esti mates. Simulated networks generated by t he stochastic block model validate the phase transition theory for community detectability . An empirical estimator of the phase transition is proposed that can be used to explore sensitivity of the spectral community detection algorithm on real data. R E F E R E N C E S [1] S. Fortun ato, “Community detection in graphs, ” Physics Reports , vol. 486, no. 3-5, pp. 75–174, 2010. [2] B. Mille r , N . Bliss, and P . J. W olfe, “Subgr aph dete ction using eigen vec- tor L1 norms, ” in Advances in Neural Information Pr ocessing Systems (NIPS) , 2010, pp. 1633–1641. [3] A. Sandryhaila and J. Moura, “Discrete signal processing on graphs, ” IEEE T rans. Signal Proc ess. , vol. 61, no. 7, pp. 1644–1656, Apr . 2013. [4] A. Bertran d and M. Moonen, “Seeing the bigger picture: How nodes can learn their place within a complex ad hoc network topology , ” IEEE Signal Proc ess. Mag. , vol. 30, no. 3, pp. 71–82, May 2013. [5] D. Shuman, S. Narang, P . Frossard, A. Orteg a, and P . V anderghe ynst, “The emergi ng field of signal processing on graphs: Extending high- dimensiona l data analysis to networks and other irreg ular domains, ” IEEE Signal P r ocess. Mag. , vol. 30, no. 3, pp. –98, May 2013. [6] B. Mille r , N. Bliss, and P . W olfe, “T ow ard signal processing theory for graphs and non-Euclide an data, ” in IEEE Internat ional Confer ence on Acoustics, Speech and Signal Proc essing (ICASSP) , March 2010, pp. 5414–5417. [7] P .-Y . Chen and A. O. Hero, “Dee p community detect ion, ” arXiv:1407.607 1 , 2014. [8] S. Chen, A. Sandryhail a, G. Lederman, Z. W ang, J. Moura, P . Rizzo, J. Bielak, J. Garrett, and J. Ko vace vic, “Signal inpaintin g on graphs via total varia tion minimization, ” in IEEE International Confer ence on Acoustics, Speec h and Signal Proce s sing (ICASSP) , May 2014, pp. 8267–8271. [9] P .-Y . Chen and A. O. Hero, “Local Fiedler vector centrality for de- tecti on of deep and overla pping communities in networks, ” in IEE E Internati onal Confer ence on Acoustics, Speec h and Signal Processi ng (ICASSP) , 2014, pp. 1120–1124. [10] U. Luxbur g, “ A tutorial on spectral clusteri ng, ” Statisti cs and Computing , vol. 17, no. 4, pp. 395–416, Dec. 2007. [11] J. Shi and J. Malik, “Normalize d cuts and image segment ation, ” IEEE T rans. P attern A nal. Mach . Intel l. , vol. 22, no. 8, pp. 888–905, 2000. [12] S. White and P . Smyth, “ A spect ral clustering approach to finding communitie s in graph, ” in SDM , 2005, pp. 274–285. [13] Y . va n Gennip, H. Hu, B. Hunter , and M. A. Porter , “Geosocial graph- based communi ty detect ion, ” in IEEE Internati onal Confer ence on Data Mining W orkshops , 2012, pp. 754–758. [14] S. Tsironis, M. Sozio, T . Paristec h, and M. V azirgian nis, “ Accurate spectra l cl ustering for community dete ction in MapReduc e, ” in Advances in Neural Information P r ocessing Systems (NIPS) W orkshops , 2013. [15] L. Huang, R. Li, H. Chen , X. Gu, K. W en, and Y . Li, “Detec ting network communitie s using regula rized spectral clustering algorithm, ” Artificial Intell igen ce R evi ew , vol. 41, no. 4, pp. 579–594, 2014. [16] R. Merris, “Lapl acian matrice s of graphs: a survey , ” Linear Algebra and its Applications , vol . 197-198, pp. 143–176, 1994. [17] F . R. K. Chung, Spectr al Graph Theory . American Mathemati cal Society , 1997. [18] M. Fied ler , “ Algebraic conne cti vity of graphs, ” Czec hoslovak Mathemat - ical Journal , vol. 23, no. 98, pp. 298–305, 1973. [19] J. A. Hartigan and M. A. W ong, “ A k-means clustering algorit hm, ” JSTOR: A pplie d Statist ics , vol. 28, no. 1, pp. 100–108, 1979. [20] M. E. J. Newma n, “Finding community s tructu re in networks using the eigen vectors of matrice s, ” Phys. Rev . E , vol. 74, p. 036104, Sep 2006. [21] P . J . Bick el and A. Chen, “ A nonparamet ric vie w of networ k models and ne wmangirv an and othe r modularit ies, ” Procee dings of the National Academy of Scien ces , vol . 106, no. 50, pp. 21 068–21 073, 2009. [22] Y . Z hao, E . L e vina, and J. Z hu, “Consistenc y of community detection in netw orks under degree-co rrecte d stochast ic block models, ” The Annals of Statistics , vol . 40, no. 4, pp. 2266–2292 , 08 2012. [23] R. R. Nadakuditi and M. E. J. Ne wman, “Graph spectra and the detec tabili ty of community s tructu re in networ ks, ” Phys. Rev . Lett. , vol. 108, p. 188701, May 2012. [24] F . Krzakala , C. Moore, E. Mossel, J. Neeman, A. Sly , L . Zdeborov , and P . Zhang, “Spectra l redemption in clustering sparse networks, ” Pr oceedi ngs of the National A cademy of Sciences , vol. 110, no. 52, pp. 20 935–20 940, 2013. [25] F . Radicc hi, “Detect abilit y of communities in heterogene ous netw orks, ” Phys. Rev . E , vol. 88, p. 010801, Jul 2013. [26] ——, “ A paradox in community detecti on, ” EPL (Europhy sics Letters) , vol. 106, no. 3, p. 38001, 2014. [27] P .-Y . Chen and A. O. Hero, “Phase transition s in spectral community detec tion, ” , 2014. [28] ——, “Univ ersal phase transition in community detec tabili ty under a stochasti c block model, ” arXiv:1409.2186 , 2014. [29] P . W . Holland, K. B. Laske y , and S. Leinhardt, “Stoc hastic bloc kmodels: First steps, ” Social Networks , vol. 5, no. 2, pp. 109–137, 1983. [30] A. Decell e, F . Krzakala, C. Moore, and L . Zdeborov ´ a, “Inference and phase transition s in the detecti on of modules in sparse netw orks, ” Phys. Rev . Lett. , vol . 107, p. 065701, Aug 2011. [31] R. R. Nadakuditi , “On hard limits of eigen-ana lysis based plante d clique detec tion, ” in IEEE Statistical Signal Proc essing W orkshop (SSP) , Aug 2012, pp. 129–132. [32] F . Radicchi and A. Arenas, “ Abrupt transi tion in the struct ural formation of inte rconnected netw orks, ” Natur e Physics , vol. 9, n o. 11, pp. 717 –720, Nov . 2013. [33] R. Latala, “Some estimates of norms of random matrices.” Pr oc. Am . Math. Soc. , vol. 133, no. 5, pp. 1273–1282, 2005. [34] M. T alagrand, “Concentra tion of measure and isoperimetric inequal ities in product spaces, ” Publicati ons Mathmatiqu es de l’Institu t des Hautes tudes Scienti fiques , vol. 81, no. 1, pp. 73–205, 1995. [35] F . Benaych-Geor ges and R. R. Nadakudit i, “The singular v alues and vec tors of lo w rank pert urbatio ns of large recta ngular random matric es, ” J ournal of Multi variate Analysis , vol. 111, no. 0, pp. 120–135, 2012. [36] M. E. J. Newman, “Modulari ty and community structu re in netw orks, ” Pr oc. National Academy of Scien ces , vol. 103, no. 23, pp. 8577–8582, 2006.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment