Convexified Modularity Maximization for Degree-corrected Stochastic Block Models

Con v exiﬁed Mo dularit y Maximization for Degree-corrected Sto c hastic Blo c k Mo dels Y udong Chen ˚ Xiao dong Li : Jiaming Xu ; Jan uary 20, 2016 Abstract The sto c hastic blo c k mo del ( SBM ) is a popular framework for studying comm u- nit y detection in netw orks. This model is limited by the assumption that all no des in the same comm unity are statistically equiv alen t and hav e equal exp ected degrees. The degree-corrected sto chastic blo c k mo del ( DCSBM ) is a natural extension of SBM that allo ws for degree heterogeneity within communities. This paper proposes a c onvexi- ﬁe d mo dularity maximization approac h for estimating the hidden communities under DCSBM . Our approach is based on a conv ex programming relaxation of the classi- cal (generalized) mo dularit y maximization formulation, follow ed b y a nov el doubly- w eighted ` 1 -norm k -median pro cedure. W e establish non-asymptotic theoretical guar- an tees for both appro ximate clustering and p erfect clustering. Our appro ximate clus- tering results are insensitive to the minimum degree, and hold ev en in sparse regime with b ounded av erage degrees. In the special case of SBM , these theoretical results matc h the b est-kno wn p erformance guaran tees of computationally feasible algorithms. Numerically , w e pro vide an eﬃcien t implemen tation of our algorithm, which is applied to b oth synthetic and real-world net works. Exp erimen t results show that our metho d enjo ys comp etitiv e p erformance compared to the state of the art in the literature. 1 In tro duction Detecting comm unities/clusters in net works and graphs is an important subroutine in man y applications across computer, so cial and natural sciences and engineering. A standard framew ork for studying communit y detection in a statistical setting is the sto c hastic blo c k ˚ Y. Chen is with the Sc ho ol of Operations Research and Information Engineering at Cornell Universit y , Ithaca, NY. yudong.chen@cornell.edu . : X. Li is with the Statistics Department at the Universit y of California, Davis, CA. xdgli@ucdavis.edu . ; J. Xu is with Departmen t of Statistics, The Wharton Sc ho ol, Univ ersity of P ennsylv ania, Philadelphia, P A. jiamingx@wharton.upenn.edu . 1 mo del ( SBM ) prop osed in Holland et al. [ 1983 ]. Also kno wn as the plan ted partition mo del in the c omputer science literature [ Condon and Karp , 2001 ], SBM is a random graph mo del for generating net works from a set of underlying clusters. The statistical task is to accurately reco ver the underlying true clusters given a single realization of the random graph. The v ersatilit y and analytic tractabilit y of SBM ha ve made it arguably the most p opular mo del for studying comm unity detections. It ho wev er falls short of abstracting a key asp ect of real-w orld net works. In particular, an unrealistic assumption of SBM is that within each comm unity , the degree distributions of each no de are the same. In empirical netw ork data sets, ho wev er, the degree distributions are often highly inhomogeneous across nodes, some- times exhibiting a hea vy tail b eha vior with some nodes having very high degrees (so-called h ubs). At the same time, sparsely connected no des with small degrees are also common in real netw orks. T o o vercome this shortcoming of the SBM , the degree-corrected sto c hastic blo c k model ( DCSBM ) was introduced in the literature to allow for degree heterogeneity within communities, thereb y pro viding a more ﬂexible and accurate mo del of real-w orld net works [ Dasgupta et al. , 2004 ; Karrer and Newman , 2011 ]. A num b er of communit y detection methods ha v e been proposed based on DCSBM , suc h as mo del-based metho ds and spectral metho ds. Model-based metho ds include proﬁle lik e- liho od maximization and mo dularity maximization [ Newman , 2006 ; Karrer and Newman , 2011 ]. Although these metho ds enjo y certain statistical guarantees [ Zhao et al. , 2012 ], they often in volv e optimization ov er all p ossible partitions, which is computationally in tractable. Recen t work in Amini et al. [ 2013 ]; Le et al. [ 2015+ ] discusses eﬃcien t solv ers, but the theoretical guaran tees are only established under restricted settings suc h as those with tw o comm unities. Spectral methods, whic h estimate the communities based on the eigenv ectors of the graph adjacency matrix and its v ariants, are computationally fast. Statistical guar- an tees are derived for sp ectral metho ds under certain settings (see, e.g., Dasgupta et al. [ 2004 ]; Co ja-Oghlan and Lank a [ 2009 ]; Chaudh uri et al. [ 2012 ]; Qin and Rohe [ 2013 ]; Lei and Rinaldo [ 2015 ]; Jin [ 2015 ]; Gulik ers et al. [ 2015 ]), but numerical v alidation on synthetic and real data has not been as thorough. One notable exception is the SCORE metho d prop osed in Jin [ 2015 ], whic h achiev ed one of the b est known clustering p erformance on the p olitical blogs dataset from Adamic and Glance [ 2005 ]. Sp ectral metho ds are also known to suﬀer from inconsistency in sparse graphs [ Krzak ala et al. , 2013 ] as well as sensitivity to outliers [ Cai and Li , 2015 ]. W e discuss other related work in details in Section 5 . In this pap er, w e seek for a clustering algorithm that is computationally feasible, has strong statistical p erformance guarantees under DCSBM , and provides competitive empir- ical p erformance. Our approach mak es use of the robustness and computational p o wer of con vex optimization. Under the standard SBM , con vex optimization has b een pro ven to b e statistically eﬃcient under a broad range of mo del parameters, including the size and 2 n umber of communities as well as the sparsity of the netw ork; see e.g. Chen et al. [ 2012 ]; Chen and Xu [ 2014 ]; Gu´ edon and V ershynin [ 2015 ]; Ames and V a v asis [ 2014 ]; Oymak and Hassibi [ 2011 ]. Moreov er, a signiﬁcant adv antage of conv ex metho ds is their robustness against arbitrary outlier no des, as is established in the theoretical framework in Cai and Li [ 2015 ]. There, it is also observed that their con vex optimization approach leads to state-of- the-art misclassiﬁcation rates in the p olitical blogs dataset, in which the no de degrees are highly heterogeneous. These observ ations motiv ate us to study whether strong theoretical guaran tees under DCSBM can b e established for conv ex optimization-based metho ds. Building on the work of Chen et al. [ 2012 ] and Cai and Li [ 2015 ], we introduce in Section 2 a new communit y detection approac h called Conv exiﬁed Mo dularity Maximiza- tion (CMM). CMM is based on conv exiﬁcation of the elegant modularity maximization form ulation, follo wed b y a nov el and computationally tractable w eighted ` 1 -norm k -median clustering pro cedure. As w e show in Section 3 and Section 4 , our approac h has strong theoretical guarantees, applicable ev en in the sparse graph regime with b ounded av erage degree, as w ell as state-of-the-art empirical p erformance. In b oth asp ects our approach is comparable to or impro ves upon the b est-kno wn results in the literature. 2 Problem setup and algorithms In this section, we set up the communit y detection problem under DCSBM , and describ e our algorithms based on con vexiﬁed mo dularit y maximization and w eighted k -median clus- tering. Throughout this pap er, w e use low er-case and upp er-case b old letters such as u and U to represen t v ectors and matrices, respectively , with u i and U ij denoting their elemen ts. W e use U i ˚ to denote the i -th ro w of the matrix U . If all coordinates of a v ector v are nonnegativ e, we write v ě 0 . The notation v ą 0 , as well as U ě 0 and U ą 0 for matrices, are deﬁned similarly . F or a symmetric matrix U P R n ˆ n , we write U ą 0 if U is p ositiv e deﬁnite, and U ľ 0 if it is p ositiv e semideﬁnite. F or an y sequences t a n u and t b n u , we write a n À b n if there is an absolute constant c ą 0 suc h that a n { b n ď c, @ n , and w e deﬁne a n Á b n similarly . 2.1 The degree-corrected sto c hastic blo c k mo del In DCSBM a graph G is generated randomly as follows. A total of n no des, which w e iden tify with the set r n s : “ t 1 , . . . , n u , are partitioned into r ﬁxed but unknown clusters C ˚ 1 , C ˚ 2 . . . , C ˚ r . Each pair of distinct no des i P C ˚ a and j P C ˚ b are connected by an (undi- rected) edge with probabilit y θ i θ j B ab P r 0 , 1 s , indep enden tly of all others. Here the vector θ “ p θ 1 , . . . , θ n q J P R n ` is referred to as the de gr e e heter o geneity p ar ameters of the no des, 3 and the symmetric matrix B P R r ˆ r ` is called the c onne ctivity matrix of the clusters. With- out loss of generality , we assume that max 1 ď i ď n θ i “ 1 (one can m ultiply θ b y a scalar c and divide B by c 2 without c hanging the distribution of the graph). Note that if θ i “ 1 for all no des i , DCSBM reduces to the classical SBM . Given a single realization of the resulting random graph G “ pr n s , E q , the statistical goal is to estimate the true clusters t C ˚ a u r a “ 1 . Before describing our algorithms, let us ﬁrst introduce some useful notation. Denote by A P t 0 , 1 u n ˆ n the adjac ency matrix asso ciated with the graph G , with A ij “ 1 if and only if no des i and j are connected. F or eac h candidate partition of n no des into r clusters, w e asso ciate it with a p artition matrix Y P t 0 , 1 u n ˆ n , such that Y ij “ 1 if and only if no des i and j are assigned to the same cluster, with the conv en tion that Y ii “ 1 , @ i . Let P n,r b e the set of all suc h partition matrices, and Y ˚ the true partition matrix asso ciated with the ground-truth clusters t C ˚ a u r a “ 1 . The notion of partition matrices pla ys a crucial role in the subsequen t discussion. 2.2 Generalized mo dularit y maximization Our clustering algorithm is based on Newman and Girv an’s classical notion of mo dularity (see, e.g., Newman [ 2006 ]). Given the graph adjacency matrix A of n no des, the modularity of a partition represen ted by the partition matrix Y P Ť r P n,r , is deﬁned as Q p Y q : “ ÿ 1 ď i,j ď n ˆ A ij ´ d i d j 2 L ˙ Y ij , (2.1) where d i : “ ř n j “ 1 A ij is the degree of no de i , and L “ 1 2 ř n i “ 1 d i is the total num b er of edges in the graph. The mo dularity maximization approac h to comm unity detection is based on ﬁnding a partition Y m that optimizes Q p Y q : Y m Ð arg max Y P Ť r P n,r Q p Y q . (2.2) This standard form of mo dularit y maximization is kno wn to suﬀer from a “resolution limit” and cannot detect small clusters [ F ortunato and Barthelem y , 2007 ]. T o address this issue, sev eral authors hav e prop osed to replace the normalization factor 1 2 L b y a tuning param- eter λ [ Reichardt and Bornholdt , 2006 ; Lancic hinetti and F ortunato , 2011 ], giving rise to the following generalized formulation of mo dularit y maximization: Y m Ð arg max Y P Ť r P n,r Q λ p Y q : “ ÿ 1 ď i,j ď n p A ij ´ λd i d j q Y ij . (2.3) While mo dularit y maximization enjo ys several desirable statistical properties under SBM and DCSBM [ Zhao et al. , 2012 ; Amini et al. , 2013 ], the asso ciated optimization prob- lems ( 2.2 ) and ( 2.3 ) are not computationally feasible due to the com binatorial constraint, 4 whic h limits the practical applications of these form ulations. In practice, mo dularit y max- imization is often used as a guidance for designing heuristic algorithms. Here w e take a more principled approach to computational feasibility while maintain- ing prov able statistical guaran tees: we dev elop a tractable con vex surrogate for the ab o ve com binatorial optimization problems, whose solution is then reﬁned by a no vel weigh ted k -median algorithm. 2.3 Con v ex relaxation In tro ducing the degree vector d “ p d 1 , . . . , d n q J , w e can rewrite the generalized modularity maximization problem ( 2.3 ) in matrix form as max Y @ Y , A ´ λ dd J D sub ject to Y P Ť r P n,r , (2.4) where x¨ , ¨y denotes the trace inner pro duct b et w een matrices. The ob jective function is linear in matrix v ariable Y , so it suﬃces to con vexify the combinatorial constraint Y P Ť r P n,r . Recall that each matrix Y in P n,r corresp onds to a unique partition of n no des into r clusters. There is another algebraic represen tation of suc h a partition via a memb ership matrix Ψ P t 0 , 1 u n ˆ r , where Ψ ia “ 1 if and only if no de i b elongs to the cluster a . These t wo represen tations are related by the iden tity Y “ ΨΨ J , (2.5) whic h implies that Y ľ 0 . The membership matrix of a partition is only unique up to p erm utation of the cluster lab els 1 , 2 , . . . , r , so eac h partition matrix Y corresponds to m ultiple membership matrices Ψ . W e use M n,r to denote the set of all p ossible membership matrices of r -partitions. Besides b eing p ositiv e semideﬁnite, a partition matrix Y also satisﬁes the linear con- strain ts 0 ď Y ij ď 1 and Y ii “ 1 for all i, j P r n s . Using these prop erties of partition matrices, w e obtain the following con vexiﬁcation of the mo dularit y optimization problem ( 2.4 ): p Y “ arg max Y @ Y , A ´ λ dd J D sub ject to Y ľ 0 , 0 ď Y ď J , Y ii “ 1 , for each i P r n s . (2.6) Here J is the n ˆ n matrix with all entries equal to 1. Implementation of the formulation ( 2.6 ) requires choosing an appropriate tuning parameter λ . W e will discuss the theoretical range 5 for λ for consisten t clustering in Section 3 , and empirical choice of λ in Section 4 . As our con vexiﬁcation is based on the generalized v ersion ( 2.3 ) of mo dularit y maximization, it is capable of detecting small clusters, even when the num b er of clusters r gro ws with n , as is sho wn later. 2.4 Explicit clustering via weigh ted k -median Ideally , the optimal solution p Y to the con vex relaxation ( 2.6 ) is a v alid partition matrix in P n,r and reco vers the true partition Y ˚ p erfectly — our theoretical results in Section 3 c haracterize when this is the case. In general, the matrix p Y will not lie in P n,r , but we exp ect it to b e close to Y ˚ . T o extract an explicit clustering from p Y , w e introduce a nov el and tractable weighte d k -median algorithm. Recall that b y deﬁnition, the i -th and j -th rows of the true partition matrix Y ˚ are iden tical if the corresp onding no des i and j belong to the same communit y , and otherwise orthogonal to each other. If p Y is close to Y ˚ , in tuitiv ely one can extract a goo d partition b y clustering the ro w v ectors of p Y as points in the Euclidean space R n . While there exist n umerous algorithms (e.g., k -means) for such a task, our analysis identiﬁes a particularly viable choice — a k -median pro cedure appropriately w eighted b y the no de degrees — that is eﬃcient b oth theoretically and empirically . Sp eciﬁcally , our weigh ted k -median procedure consists of tw o steps. First, w e m ultiply the columns of p Y b y the corresponding degrees to obtain the matrix x W : “ p Y D , where D : “ diag p d q “ diag p d 1 , . . . , d n q , which is the diagonal matrix formed by the en tries of d . Clustering is p erformed on the ro w vectors of x W instead of p Y . Note that if w e consider the i -th ro w of p Y as a vector of n features for no de i , then the ro ws of x W can b e thought of as v ectors of weighte d features. In the second step, we implemen t a weigh ted k -median clustering on the row v ectors of x W . Denoting b y p w i the i -th ro w of x W , w e searc h for a partition C 1 , . . . , C r of r n s and r cluster cen ters x 1 , . . . , x r P R n that minimize the sum of the weigh ted distances in ` 1 norm ÿ 1 ď a ď r ÿ i P C a d i } p w i ´ x a } 1 . Additionally , we require that the center v ectors x 1 , . . . , x r are c hosen from the row v ectors of x W (these cen ters are sometimes called me doids ). Represen ting the partition t C a u r a “ 1 b y a membership matrix Ψ P M n,r and the centers t x i u as the ro ws of a matrix X P R r ˆ n , w e ma y write the abov e t w o-step pro cedure 6 compactly as min Ψ , X } D p Ψ X ´ x W q} 1 s.t. Ψ P M n,r , X P R r ˆ n , Ro ws p X q Ď Ro ws p x W q , (2.7) where Rows p Z q denotes the collection of row v ectors of a matrix Z , and } Z } 1 denotes the sum of the absolute v alues of all entries of Z . W e emphasize that the form ulation ( 2.7 ) diﬀers from standard clustering algorithms (suc h as k -means) in a n umber of wa ys. The ob jective function is the sum of distances rather than that of squared distances (hence k -median), and the distances are in ` 1 instead of ` 2 norms. Moreo ver, our form ulation has tw o lev els of w eigh ting: eac h column of p Y is weigh ted to form x W , and the distance of eac h row w i to its cluster cen ter is further w eighted by d i . This doubly-w eighted ` 1 -norm k -median form ulation is crucial in obtaining strong and robust statistical bounds, and is signiﬁcan tly diﬀeren t from previous approaches, suc h as those in Lei and Rinaldo [ 2015 ]; Gulikers et al. [ 2015 ] (which only use the second w eighting, and the w eights are inversely prop ortional to d i ). Our double w eighting scheme is motiv ated by the observ ation that no des with larger degrees tend to b e clustered more accurately — in particular our analysis of the conv ex relaxation ( 2.4 ) naturally leads to a doubly weigh ted error b ound on its solution p Y . On the one hand, for eac h given i , p Y ij is exp ected to b e closer to Y ˚ ij if the degree of no de j is larger, so we m ultiply p Y ij b y d j for ev ery j to get the w eighted feature vector. On the other hand, the i -th ro w of p Y D is closer to the i -the ro w of Y ˚ D if the degree of node i is larger, hence we minimize the distances w eighted b y d i . With the constrain t Rows p X q Ď Rows p x W q , the optimization problem ( 2.7 ) is precisely the w eighted ` 1 -norm k -median (also known as k -medoid) problem considered in Charik ar et al. [ 1999 ]. Computing the exact optimizer to ( 2.7 ), denoted b y p Ψ , X q , is NP-hard. Nev ertheless, Charik ar et al. [ 1999 ] pro vides a p olynomial-time appro ximation algorithm, whic h outputs a solution p q Ψ , | X q P M n,r ˆ R r ˆ n feasible to ( 2.7 ) and prov ably satisfying } D p q Ψ | X ´ x W q} 1 ď 20 3 } D p Ψ X ´ x W q} 1 . (2.8) As the solution p Y to the conv ex relaxation ( 2.6 ) and the approximate solution q Ψ to the k -median problem ( 2.7 ) can b oth b e computed in polynomial-time, our algorithm is computationally tractable. In the next section, w e turn to the statistical aspect and show that the clustering induced by p Y and q Ψ is close to the true underlying clusters, under some mild and in terpretable conditions of DCSBM . 7 3 Theoretical results In this section, we provide theoretical results c haracterizing the statistical prop erties of our algorithm. W e show that under mild conditions of DCSBM , the diﬀerence b et ween the con vex relaxation solution p Y and the true partition matrix Y ˚ , and the diﬀerence betw een the approximate k -median clustering q Ψ and the true clustering Ψ ˚ , are w ell bounded. When additional conditions hold, we further sho w that p Y p erfectly reco vers the true clusters. Our results are non-asymptotic in nature, v alid for an y scaling of n , r , θ and B etc. 3.1 Densit y gap conditions In the literature of communit y detection by conv ex optimization under standard SBM , it is often assumed that the minim um within-cluster edge densit y is greater than the maxim um cross-cluster edge densit y , i.e., max 1 ď a ă b ď r B ab ă min 1 ď a ď r B aa . (3.1) See for example Chen et al. [ 2012 ]; Oymak and Hassibi [ 2011 ]; Ames and V a v asis [ 2014 ]; Cai and Li [ 2015 ]; Gu´ edon and V ersh ynin [ 2015 ]. This requiremen t ( 3.1 ) can b e directly extended to the DCSBM setting, leading to the condition max 1 ď a ă b ď r max i P C ˚ a ,j P C ˚ b B ab θ i θ j ă min 1 ď a ď r min i,j P C ˚ a ,i ‰ j B aa θ i θ j . (3.2) Under DCSBM , ho wev er, this condition w ould often b e ov erly restrictiv e, particularly when the degree parameters t θ i u are imbalanced with some of them b eing very small. In partic- ular, this condition is highly sensitive to the minimum v alue θ min : “ min 1 ď i ď n θ i , whic h is unnecessary since the communit y memberships of no des with larger θ i ma y still b e reco v- erable. Here we instead consider a v ersion of the density gap condition that is m uc h milder and more appropriate for DCSBM . F or eac h cluster index 1 ď a ď r , deﬁne the quan tities G a : “ ÿ i P C ˚ a θ i and H a : “ r ÿ b “ 1 B ab G b . (3.3) Simple calculation giv es E d i “ θ i H a ´ θ 2 i B aa « θ i H a . Therefore, the quan tity H a con trols the aver age de gr e e of the no des in the a -th cluster. With this notation, w e imp ose the condition that max 1 ď a ă b ď r B ab H a H b ă min 1 ď a ď r B aa H 2 a (3.4) 8 W e refer to the condition ( 3.4 ) as the de gr e e-c orr e cte d density gap c ondition . This condition can b e viewed as the “a verage” v ersion of ( 3.2 ), as it depends on the aggregate quan tity H a asso ciated with eac h cluster a rather than the θ i ’s of individual no des in the cluster — in particular, the condition ( 3.4 ) is robust against small θ min . This condition plays a key role throughout our theoretical analysis, for both appro ximate and exact cluster recov ery under DCSBM . T o gain in tuition on the new degree-corrected densit y gap condition ( 3.4 ), consider the follo wing sub-class of DCSBM with symmetric/balanced clusters. Deﬁnition 1. W e say that a DCSBM ob eys a F p n, r, p, q , g q -mo del, if B aa “ p for all a “ 1 , . . . , r , B ab “ q for all 1 ď a ă b ď r , and G 1 “ G 2 “ ¨ ¨ ¨ “ G r “ g . In a F p n, r , p, q , g q -mo del, the true clusters are balanced in terms of the connectivit y matrix B and the sum of the degree heterogeneit y parameters (rather than the cluster size). Under this mo del, straightforw ard calculation gives H a “ pp r ´ 1 q q ` p q g for all a “ 1 . . . r . The degree-corrected densit y gap condition ( 3.4 ) then reduces to p ą q , i.e., the classical density gap condition ( 3.1 ). 3.2 Theory of appro ximate clustering W e no w study when the solutions to our conv ex relaxation ( 2.6 ) and w eighted k -median algorithms ( 2.7 ) appro ximately reco ver the underlying true clusters. Under DCSBM , no des with diﬀerent θ i ’s hav e v arying degrees, and therefore contribute diﬀeren tly to the ov erall graph and in turn to the clustering qualit y . Such heterogeneity needs to b e tak en into accoun t in order to get tigh t b ounds on clustering errors. The following version of ` 1 norm, corrected by the degree heterogeneity parameters, is the natural notion of an error metric: Deﬁnition 2. F or eac h matrix Z P R n ˆ n , its w eigh ted elemen t-wise ` 1 norm is deﬁned as } Z } 1 , θ : “ ÿ 1 ď i,j ď n | θ i Z ij θ j | . Also recall our deﬁnitions of H a and G a in equation ( 3.3 ). F urthermore, for eac h 1 ď a ď r and i P C ˚ a , deﬁne the quantit y f i : “ θ i H a , which corresp onds appro ximately to the exp ected degree of node i and satisﬁes } f } 1 “ ř a,b B ab G a G b . With the notation ab ov e, our ﬁrst theorem shows that the conv ex relaxation solution p Y is close to the true partition matrix Y ˚ in terms of the weigh ted ` 1 norm. Theorem 1. Under DCSBM , assume that the de gr e e-c orr e cte d density gap c ondition ( 3.4 ) holds. Mor e over, supp ose that the tuning p ar ameter λ in the c onvex r elaxation ( 2.6 ) satisﬁes max 1 ď a ă b ď r B ab ` δ H a H b ă λ ă min 1 ď a ď r B aa ´ δ H 2 a (3.5) 9 for some numb er δ ą 0 . Then with pr ob ability at le ast 0 . 99 ´ 2 p e { 2 q ´ 2 n , the solution p Y to ( 2.6 ) satisﬁes the b ound } Y ˚ ´ p Y } 1 , θ ď C 0 δ ˆ 1 ` ˆ min 1 ď a ď r B aa H 2 a ˙ } f } 1 ˙ ´ a n } f } 1 ` n ¯ , (3.6) wher e C 0 ą 0 is an absolute c onstant. W e prov e this claim in Section 7.1 . The b ound ( 3.6 ) holds with probabilit y close to one. Notably , it is insensitiv e to θ min as should b e exp ected, b ecause communit y memberships of no des with relativ ely large θ i are still recov erable. In con trast, the error bounds of sev eral existing metho ds, such as that of SCORE metho d in [ Jin , 2015 , eq. (2.15), (2.16)], dep end on θ min crucially . Under the F p n, r , p, q , g q -mo del, recall that H a ” p p ` p r ´ 1 q q q g and density gap con- dition ( 3.4 ) b ecomes p ą q . Moreo ver, the constrain t ( 3.5 ) for δ and λ b ecomes p ´ q ą 2 δ and q ` δ p p ` p r ´ 1 q q q 2 g 2 ă λ ă p ´ δ p p ` p r ´ 1 q q q 2 g 2 . (3.7) Note that the ﬁrst inequalit y ab o v e is the same as the standard density gap condition im- p osed in, for example, Chen et al. [ 2012 ]; Chen and Xu [ 2014 ]; Cai and Li [ 2015 ]. F urther- more, the vector f satisﬁes } f } 1 “ r p p ` p r ´ 1 q q q g 2 ď r 2 pg 2 . Substituting these expressions in to the b ound ( 3.6 ), we obtain the following corollary for the symmetric DCSBM setting. Corollary 1. Under the F p n, r , p, q , g q -mo del of DCSBM , if the c ondition ( 3.7 ) holds for the density gap and tuning p ar ameter, then with pr ob ability at le ast 0 . 99 ´ 2 p e { 2 q ´ 2 n , the solution p Y to the c onvex r elaxation ( 2.6 ) satisﬁes the b ound } Y ˚ ´ p Y } 1 , θ À 1 δ ˆ 1 ` r p p p ` p r ´ 1 q q q ˙ p n ` rg ? np q À 1 δ r p n ` r g ? np q . (3.8) Note that if p q “ c for an absolute constant c , then the bound ( 3.8 ) takes the simpler form } Y ˚ ´ p Y } 1 , θ À n ` rg ? np δ . If θ i “ 1 for all no des i , the F p n, r , p, q , g q -model reduces to the standard SBM with equal communit y size. If w e assume r “ O p 1 q additionally , and note that g “ n { r and let δ “ p ´ q 4 , then the error b ound ( 3.8 ) b ecomes } Y ˚ ´ p Y } 1 À n p 1 ` ? np q p ´ q . This b ound matc hes the error b ounds prov ed in [ Gu ´ edon and V ershynin , 2015 , Theorem 1.3]. The output p Y of the conv ex relaxation need not b e a partition matrix corresponding to a clustering; a consequence is that the theoretical results in Gu´ edon and V ershynin [ 2015 ] do not pro vide an explicit guarantee on clustering errors (except for the sp ecial case of r “ 2). 10 W e give suc h a b ound below, based on the explicit clustering extracted from p Y using the w eighted ` 1 -norm k -median algorithm ( 2.7 ). Recall that q Ψ is the membership matrix in the approximate k -median solution given in ( 2.8 ), and let Ψ ˚ b e the mem b ership matrix corresp onding to the true clusters. A mem b ership matrix is unique only up to p ermutation of its columns (i.e., relab eling the clusters), so coun ting the misclassiﬁed no des in q Ψ requires an appropriate minimization ov er suc h p erm utations. The follo wing deﬁnition is useful to this end. F or a matrix M , let M i ‚ denote its i -th ro w vector. Deﬁnition 3. Let S r denote the set of all r ˆ r permutation matrices. The set of misclas- siﬁed no des with resp ect to a p erm utation matrix Π P S r is deﬁned as E p Π q : “ i P r n s : ` q ΨΠ ˘ i ‚ ‰ Ψ ˚ i ‚ ( . With this deﬁnition, w e ha ve the following theorem that quantiﬁes the misclassiﬁcation rate of approximate k -median solution q Ψ . Theorem 2. Under the F p n, r , p, q , g q -mo del, assume that the p ar ameters δ and λ satisfy ( 3.7 ). Then with pr ob ability at le ast 0 . 99 ´ 2 p e { 2 q ´ 2 n , the appr oximate k -me dian solution q Ψ satisﬁes the b ound min Π P S r ! ř i P E p Π q θ i ) ď C 0 r δ ˆ n g ` r ? np ˙ (3.9) for some absolute c onstant C 0 . W e prov e this claim in Section 7.2 . Extension to the general DCSBM setting is straigh tfor- w ard. If we let Π θ b e a minimizer of the LHS of ( 3.9 ) and E θ : “ E p Π θ q , then the quantit y ř i P E θ θ i is the num b er of misclassiﬁed no des weighte d by their degree heterogeneit y pa- rameters t θ i u . Theorem 2 controls this w eighted quan tity , which is natural as no des with smaller θ i are harder to cluster and th us less controlled in ( 3.9 ). Notably , the b ound given in ( 3.9 ) is applicable even in the sp arse gr aph r e gime with b ounded av erage degrees, i.e., p, q “ O p 1 { n q . F or example, supp ose that p “ a { n and q “ b { n for t wo ﬁxed constants a ą b , r “ O p 1 q and g — n ; if p a ´ b q{ ? a is suﬃcien tly large, then, with the c hoice δ — p a ´ b q{ n , the righ t hand side of ( 3.9 ) can b e an arbitrarily small constant times n . In comparison, con ven tional sp ectral metho ds are kno wn to b e inconsisten t in this sparse regime [ Krzak ala et al. , 2013 ]. While this diﬃculty is alleviated under SBM b y the use of regularization or non-bac ktracking matrices (e.g., Le and V ershynin [ 2015 ]; Bordena ve et al. [ 2015 ]), rigorous justiﬁcation and numerical v alidation under DCSBM hav e not b een well explored. It is sometimes desirable to ha ve a direct (un weigh ted) bound on the n umber misclassi- ﬁed no des. Suppose that Π 0 P S r is a p erm utation matrix that minimizes | E p Π q| , and let 11 E 0 : “ E p Π 0 q . A b ound on the un weigh ted misclassiﬁcation error | E 0 | can b e easily deriv ed from the general w eighted b ound ( 3.9 ). F or example, com bining ( 3.9 ) with the AM-HM inequalit y ř i P E θ θ i ě | E θ | 2 ř i P E θ 1 { θ i , we obtain that | E 0 | ď | E θ | À g f f e 1 δ g r p n ` r g ? np q n ÿ i “ 1 1 θ i . (3.10) Another bound on | E 0 | , whic h is applicable even when some θ i ’s are zero, can b e derived as follo ws: we pic k any n umber τ ą 0 and use the inequality ( 3.9 ) to get | E 0 | ď | E θ | ď 1 δ g τ r p n ` r g ? np q ` ˇ ˇ t i : θ i ă τ u ˇ ˇ , @ τ ą 0 . (3.11) This simple b ound is already quite useful, for example in standard SBM with θ i ” 1, p ě 1 n and r equal-sized clusters. In this case, setting τ “ 0 . 9 in ( 3.11 ) yields that the num b er of misclassiﬁed no des satisﬁes | E 0 | À r 2 ? np p ´ q . When r “ 2, this b ound is consistent with those in Gu ´ edon and V ersh ynin [ 2015 ], but our result is more general as it applies to more clusters r ě 3. 3.3 Theory of p erfect clustering In this section, w e show that under an additional condition on the minimum degree het- erogeneit y parameter θ min “ min 1 ď i ď n θ j , the solution p Y to the conv ex relaxation p erfectly reco vers the true partition matrix Y ˚ . In this case the true clusters can b e extracted easily from p Y without using the k -median pro cedure. F or the purpose of studying p erfect clustering, w e consider a setting of DCSBM with B aa “ p for all a “ 1 , . . . , r , and B ab “ q for all 1 ď a ă b ď r . Under this setup, the degree-corrected density gap condition ( 3.5 ) becomes max 1 ď a ă b ď r q ` δ H a H b ă λ ă min 1 ď a ď r p ´ δ H 2 a . (3.12) Recalling the deﬁnition of G a in ( 3.3 ), we further deﬁne G min : “ min 1 ď a ď r G a . The follo wing theorem characterizes when p erfect clustering is guaran teed. Theorem 3. Supp ose that the de gr e e-c orr e cte d density gap c ondition ( 3.12 ) is satisﬁe d for some numb er δ ą 0 and tuning p ar ameter λ , and that δ ą C 0 ˜ ? q n G min ` c p log n G min θ min ¸ (3.13) for some suﬃciently lar ge absolute c onstant C 0 . Then with pr ob ability at le ast 1 ´ 10 n ´ 1 , the solution p Y to the c onvex r elaxation ( 2.6 ) is unique and e quals Y ˚ . 12 The condition ( 3.13 ) dep ends on the minimum v alues G min and θ min . Suc h dep endence is necessary for p erfect clustering, as clusters and no des with ov erly small G a and θ i will ha ve to o few edges and are not recov erable. In comparison, the approximate recov ery results in Theorem 1 are not sensitiv e to either θ min or G min , as should b e exp ected. V alid for the more general DCSBM , Theorem 3 signiﬁcan tly generalizes the existing theory for standard SBM on p erfect clustering by SDP in the literature (see, e.g., Chen et al. [ 2012 ]; Chen and Xu [ 2014 ]; Cai and Li [ 2015 ]). T aking n Ñ 8 , Theorem 3 guarantees that the probability of p erfect clustering con v erges to one, thereb y implying the conv ex relaxation approac h is str ongly c onsistent in the sense of Zhao et al. [ 2012 ]. In the special case of standard SBM with θ i “ 1 , @ i P r n s , the density gap low er b ound ( 3.13 ) simpliﬁes to δ Á ? q n ` min ` c p log n ` min , where ` min : “ min 1 ď a ď r ` a is the minim um communit y size and ` a : “ | C ˚ a | is the size of comm unity a . This density gap low er b ound is consisten t with b est existing results giv en in Chen et al. [ 2012 ]; Chen and Xu [ 2014 ]; Cai and Li [ 2015 ] — as w e discussed earlier, our densit y condition in ( 3.7 ) under the F p n, r , p, q , g q mo del (which encompasses SBM with equal-sized clusters) is the same as in these previous pap ers, with the minor diﬀerence that in these pap ers the term λ d i d j in the conv ex relaxation ( 2.6 ) is replaced b y a tuning parameter λ 1 assumed to satisfy the condition q ` δ ă λ 1 ă p ´ δ . 4 Numerical results In this section, we pro vide n umerical results on both syn thetic and real datasets, which corrob orate our theoretical ﬁndings. Our conv exiﬁed modularity maximization approach is found to e mpirically outperform state-of-the-art metho ds in several settings. The con vexiﬁed mo dularit y maximization problem ( 2.6 ) is a semideﬁnite program (SDP), and can b e solved eﬃciently b y a range of general and sp ecialized algorithms. Here w e use the alternating direction method of multipliers (ADMM) suggested in Cai and Li [ 2015 ]. T o sp ecify the ADMM solver, we need some notations as follows. F or an y t w o n ˆ n matrices X and Y , let max t X , Y u b e the matrix whose p i, j q -th en try is giv en b y max t X ij , Y ij u ; the matrix min t X , Y u is similarly deﬁned. F or a symme tric matrix X with an eigenv alue decomp osition X “ U Σ U J , let p X q ` : “ U max t Σ , 0 u U J , and let p X q I b e the matrix obtained b y setting all the diagonal en tries of X to 1. Recall that J denotes the n ˆ n all-one matrix. The ADMM algorithm for solving ( 2.6 ) with the dual up date step size equal to 1, is giv en as Algorithm 1 . Our c hoice of the tuning parameter λ “ x A , J y ´ 1 is motiv ated b y the following simple 13 Algorithm 1 ADMM algorithm for solving the SDP ( 2.6 ) 1: Input: A and λ “ x A , J y ´ 1 . 2: Initialization: Z p 0 q “ Λ p 0 q “ 0, k “ 0 and MaxIter “ 100. 3: while k ă MaxIter 1. Y p k ` 1 q “ ` Z p k q ´ Λ p k q ` A ´ λdd J ˘ ` 2. Z p k ` 1 q “ ` min max Y p k q ` Λ p k q , 0 ( , J (˘ I 3. Λ p k ` 1 q “ Λ p k q ` Y p k ` 1 q ´ Z p k ` 1 q 4. k “ k ` 1 end while 4: Output the ﬁnal solution Y p k q . observ ation. By standard concen tration inequalities, the num b er x A , J y is close to its exp ectation ř i E r d i s « } f } 1 . Under the F p n, r, p, q , g q -mo del, we hav e } f } 1 “ r p p ` p r ´ 1 q q q g 2 and H a “ pp r ´ 1 q q ` p q g for all a P r r s . In this case and with the ab o ve c hoice of λ , the densit y gap assumption ( 3.12 ) simpliﬁes to q ` δ p r ´ 1 q q ` p ă 1 r ă p ´ δ p r ´ 1 q q ` p , whic h holds with δ “ p p ´ q q{ r . After obtaining the solution p Y to the con vex relaxation, w e extract an explicit clustering using the weigh ted k -median pro cedure describ ed in ( 2.7 ) with k “ r , where the num b er of ma jor clusters r is assumed known. Our complete communit y detection algorithm, Con v ex- iﬁed Mo dularit y Maximization (CMM), is summarized in Algorithm 2 . In our exp eriments, the weigh ted k -median problem is solved by an iterativ e greedy pro cedure that optimizes alternativ ely o v er the v ariables Ψ and X in ( 2.7 ), with 100 random initializations. Algorithm 2 Con vexiﬁed Mo dularit y Maximization (CMM) 1: Input: A , λ “ x A , J y ´ 1 , and r ě 2. 2: Solv e the conv ex relaxation ( 2.6 ) for p Y using Algorithm 1 . 3: Solv e the weigh ted k -median problem ( 2.7 ) with x W “ p Y D and k “ r , and output the resulting r -partition of r n s . 4.1 Syn thetic data exp erimen ts W e pro vide exp erimen t results on syn thetic data generated from DCSBM . F or eac h no de i P r n s , the degree heterogeneity parameter θ i is sampled indep enden tly from a Pareto p α, β q distribution with the densit y function f p x | α, β q “ αβ α x α ` 1 1 t x ě β u , where α and β are called 14 the shap e and sc ale parameters, resp ectiv ely . W e consider diﬀerent v alues of the shape parameter, and choose the scale parameter accordingly so that the exp ectation of each θ i is ﬁxed at 1. Note that the v ariability of the θ i ’s decreases with the shape parameter α . Giv en the degree heterogeneity parameters t θ i u and tw o num b ers 0 ď q ă p ď 1, a graph is generated from DCSBM , with the edge probability b et ween no des i P C ˚ a and j P C ˚ b b eing min p 1 , θ i θ j B ab q and B aa “ p, B ab “ q , @ 1 ď a ‰ b ď r . W e applied our CMM approac h in Algorithm 2 to the resulting graph, and recorded the misclassiﬁcation rate | E p Π 0 q|{ n (cf. the discussion after Theorem 2 ). F or comparison, w e also applied the SCORE algorithm in Jin [ 2015 ] and the OCCAM algorithm in Zhang et al. [ 2014 ], which are rep orted to hav e state-of-the-art empirical p erformance on DCSBM in the existing literature. The SCORE algorithm p erforms k -means on the top-2 to top- r eigen vectors of the adjacency matrix normalized element-wise by the top-1 eigenv ector. OC- CAM is a type of regularized sp ectral k -median algorithm; it can pro duce non-o verlapping clusters and its regularization parameter is given explicitly in Zhang et al. [ 2014 ]. F or all k -means/medians pro cedures used in the exp erimen ts, we to ok k “ r and used 100 random initializations. Fig. 1 sho ws the misclassiﬁcation rates of CMM (solid lines), SCORE (dash lines) and OCCAM (individual mark ers) for v arious settings of n , p , q , cluster size and the shap e parameter for θ . W e see that the misclassiﬁcation rate of CMM grows as the degree param- eters t θ i u becomes more heterogeneous (smaller v alues of the shap e parameter), and as the graph b ecomes sparser, which is consisten t with the prediction of Theorem 2 . Moreov er, our approac h has consisten tly lo wer misclassiﬁcation rates than SCORE and OCCAM, with SCORE and OCCAM exhibiting similar p erformance. 4.2 P olitical blog net w ork dataset W e next test the empirical p erformance of CMM (Algorithm 2 ), SCORE and OCCAM on the US p olitical blog netw ork dataset from Adamic and Glance [ 2005 ]. This dataset consists of 19090 hyperlinks (directed edges) b et ween 1490 p olitical blogs collected in the y ear 2005. The p olitical leaning (lib eral versus conserv ative) of each blog has b een lab eled man ually based on blog directories, incoming and outgoing links and posts around the time of the 2004 presidential election. W e treat these lab els as the true memberships of r “ 2 comm unities. W e ignore the edge direction, and fo cus on the largest connected component with n “ 1222 no des and 16 , 714 edges, represented b y the adjacency matrix A . This graph has high degree v ariation: the maximum degree is 351 while the mean degree is around 27. CMM, SCORE and OCCAM misclassify 62, 58 and 65 no des, resp ectiv ely , out of 1222 no des on this dataset. The SCORE metho d has the b est known error rate on the p olitical blogs dataset in the literature [ Jin , 2015 ], and we see that our CMM approach is 15 1 1.2 1.4 1.6 1.8 2 0 0.1 0.2 0.3 0.4 0.5 S h a p e p a r a m e t e r f o r θ M i s c l a s s i ﬁ c a t i o n r a t e p=0.05 p=0.10 p=0.15 p=0.20 1 1.2 1.4 1.6 1.8 2 0 0.1 0.2 0.3 0.4 0.5 S h a p e p a r a m e t e r f o r θ M i s c l a s s i ﬁ c a t i o n r a t e p=0.15 p=0.25 p=0.35 p=0.45 (a) (b) 1 1.2 1.4 1.6 1.8 2 0 0.1 0.2 0.3 0.4 0.5 0.6 S h a p e p a r a m e t e r f o r θ M i s c l a s s i ﬁ c a t i o n r a t e p=0.10 p=0.20 p=0.30 p=0.40 1 1.2 1.4 1.6 1.8 2 0 0.1 0.2 0.3 0.4 0.5 S h a p e p a r a m e t e r f o r θ M i s c l a s s i ﬁ c a t i o n r a t e p=0.05 p=0.10 p=0.15 p=0.20 (c) (d) Figure 1: Clustering performance on synthetic datasets versus the v ariability of θ . Solid lines: our CMM algorithm. Dash lines: the SCORE algorithm. Individual marker: Reg- ularized sp ectral algorithm. Panel (a): 400 no des, 2 clusters of size 200. Panel (b): 600 no des, 3 clusters of size 200. P anel (c): 800 no des, 4 clusters of size 200. P anel (d): 900 no des, 2 clusters of size 450. Eac h point represen ts the a verage of 20 independent runs. In all exp erimen ts w e set q “ 0 . 3 p . 16 comparable to the state of the art. Panel (a) in Fig. 2 shows the adjacency matrix A with ro ws and columns sorted according to the true communit y lab els. The output of ADMM Algorithm 1 for solving the conv ex relaxation ( 2.6 ) is sho wn in Fig. 2 (b). The partition matrix corresp onding to the output of the weigh ted k -median step in Algorithm 2 is sho wn in Fig. 2 (c). 0 200 400 600 800 1000 1200 0 200 400 600 800 1000 1200 (a) (b) (c) Figure 2: Panel (a): The adjacency matrix of the largest connected comp onen t in the p olitical blog net work with 1222 no des. The ro ws and columns are sorted according to the true communit y lab els. The ﬁrst 586 rows/columns corresp ond to the lib eral comm unity , and the next 636 is the conserv ative communit y . P anel (b): The output matrix p Y of the conv ex relaxation ( 2.6 ) solved by ADMM (Algorithm 1 ), with the en tries truncated to the in terv al r 0 , 1 s . Panel (c): The partition matrix corresp onding to the output of CMM (Algorithm 2 ), obtained from the w eigh ted k -median algorithm. Matrix entry v alues are sho wn in gra y scale with black corresponding to 1 and white to 0. 4.3 F acebo ok dataset In this section, we consider the F aceb o ok netw ork dataset from T raud et al. [ 2011 , 2012 ], and compare the empirical performance of our CMM approac h with the SCORE and OC- CAM metho ds. The F aceb ook netw ork dataset consists of 100 US universities and all the “friendship” links b et ween the users within eac h universit y , recorded on one particular day in September 2005. The dataset also contains sev eral no de attributes such as the gender, dorm, graduation year and academic ma jor of eac h user. Here w e rep ort results on the friendship netw orks of tw o univ ersities: Simmons College and Caltec h. 4.3.1 Simmons College netw ork The Simmons College netw ork contains 1518 nodes and 32988 undirected edges. The sub- graph induced by no des with graduation year b et ween 2006 and 2009 has a largest connected 17 comp onen t with 1137 no des and 24257 undirected edges, whic h we shall fo cus on. It is ob- serv ed in T raud et al. [ 2011 , 2012 ] that the communit y structure of the Simmons College net work exhibits a strong correlation with the graduation year — students in the same year are more likely to b e friends. Panel (a) of Fig. 3 shows this largest comp onent with no des colored according to their graduation year. W e applied the CMM (Algorithm 2 ), SCORE and OCCAM metho ds to partition the largest comp onen t in to r “ 4 clusters. In Panels (b)–(d) of Fig. 3 the clustering results of these three methods are shown as the no de colors. In Fig. 4 w e also pro vide the confu- sion matrices of the clustering results against the graduation y ears; the p i, j q -th entry of a confusion matrix represen ts the num b er of no des that are from graduation year i ` 2005 but assigned to cluster j by the algorithm. W e see that our CMM approach pro duced a partition more correlated with the actual graduation y ears. In fact, if w e treat the gradua- tion y ears as the ground truth cluster labels, then CMM misclassiﬁed 12 . 04% of the nodes, whereas SCORE and OCCAM hav e higher misclassiﬁcation rates of 23 . 57% and 22 . 43%, resp ectiv ely . A closer inv estigation of Fig. 3 and Fig. 4 sho ws that CMM was b etter in distinguishing b et ween the no des of year 2006 and 2007. 4.3.2 Caltec h net work In this section, we provide exp erimen t results on the Caltech net work. This net work has 769 no des and 16656 undirected edges. W e consider the subgraph induced by no des with kno wn dorm attributes, and fo cus on its largest connected component, which consists of 590 no des and 12822 edges. The communit y structure is highly correlated with whic h of the 8 dorms a user is from, as observed in T raud et al. [ 2011 , 2012 ]. W e applied CMM, SCORE and OCCAM to partition this largest comp onen t into r “ 8 clusters. With the dorms as the ground truth cluster lab els, CMM misclassiﬁed 21 . 02% of the no des, whereas SCORE and OCCAM had higher misclassiﬁcation rates of 31 . 02% and 32 . 03%, resp ectiv ely . The confusion matrices of these metho ds are sho wn in Fig. 5 . W e see that dorm 3 was diﬃcult to recov er and largely missed b y all three metho ds, but our CMM algorithm b etter identiﬁed the other dorms. 18 (a) (b) (c) (d) Figure 3: The largest comp onent of the Simmons College netw ork. Panel (a): Each node is colored according to its graduation year, with 2006 in green, 2007 in ligh t blue, 2008 in purple and 2009 in red. Panels (b)–(d): Each no de is colored according to the clustering result of (b) CMM, (c) SCORE and (d) OCCAM. (These plots are generated using the Gephi pack age [ Bastian et al. , 2009 ] with the F orceA tlas2 la yout algorithm [ Jacomy et al. , 2014 ].) 19 185 50 9 0 36 220 9 1 4 11 319 0 0 3 14 276 Clusters by CMM Graduation Years 1 2 3 4 2006 2007 2008 2009 157 118 28 0 65 145 10 1 2 13 291 0 1 8 22 276 Clusters by SCORE Graduation Years 1 2 3 4 2006 2007 2008 2009 161 118 34 2 62 148 7 1 2 16 299 0 0 2 11 274 Clusters by OCCAM Graduation Years 1 2 3 4 2006 2007 2008 2009 Figure 4: The confusion matrices of CMM, SCORE and OCCAM applied to the largest comp onen t of the Simmons College netw ork. 26 0 5 3 2 5 0 2 3 68 4 1 0 0 1 1 2 0 11 2 2 9 0 1 1 0 0 60 0 1 1 0 2 0 2 2 91 1 2 2 2 0 1 5 0 69 3 1 1 0 38 0 1 0 60 2 6 1 1 3 0 2 0 81 Clusters by CMM Dorms 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 27 0 6 5 6 13 4 3 3 61 9 1 0 1 1 1 2 8 14 10 12 14 7 20 1 0 2 56 0 1 1 1 0 0 14 1 78 1 2 0 2 0 0 2 0 55 1 0 1 0 17 0 0 0 51 0 7 0 0 1 0 2 0 65 Clusters by SCORE Dorms 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 35 5 30 14 10 22 12 17 3 61 3 0 0 0 0 0 2 3 8 7 10 9 2 9 1 0 0 54 0 1 1 0 0 0 1 0 76 1 0 0 0 0 0 1 0 52 0 0 1 0 20 0 0 0 52 1 1 0 0 0 0 2 0 63 Clusters by OCCAM Dorms 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Figure 5: The confusion matrices of CMM, SCORE and OCCAM applied to the largest comp onen t of the Caltec h netw ork against the dorm assignments of the users . 5 Related w ork In this section, w e discuss prior results that are related to our work. Existing comm u- nit y detection metho ds for DCSBM include mo del-based metho ds and sp ectral metho ds. In model-based metho ds such as proﬁle lik eliho od and mo dularit y maximization [ Newman , 2006 ], one ﬁts the mo del parameters to the observed netw ork based on the lik eliho od func- tions or mo dularit y functions determined b y the statistical structure of DCSBM . In Karrer and Newman [ 2011 ], the maximum likelihoo d estimator is used to infer the unknown mo del parameters θ and B . These estimates are then plugged into the log lik eliho o d function, whic h leads to a quality function for communit y partitions. An estimate of the comm unity structure is obtained by maximizing this qualit y function using a greedy heuristic algorithm. No pro v able theoretical guaran tee is kno wn for this greedy algorithm, and one usually needs to run the algorithm with many random initial p oin ts to ac hieve go o d p erformance. The 20 w ork in Zhao et al. [ 2012 ] discusses proﬁle lik eliho o d metho ds for DCSBM and the closely related mo dularit y maximization approach. Under the assumption that the num b er of clus- ters is ﬁxed, strong consistency is prov ed when the av erage degree is Ω p log n q , and weak consistency when it is Ω p 1 q . How ev er, directly solving the maximization problems is compu- tationally infeasible, as it in volv es searc hing o ver all p ossible partitions. Numerically , these optimization problems are solved heuristically using T abu searc h and sp ectral decomp osi- tion without theoretical guarantees. The algorithm proposed in Amini et al. [ 2013 ] inv olv es ﬁnding an initial clustering using sp ectral metho ds, then iteratively up dating the lab els via maximizing conditional pseudo lik eliho o d, which is done using the EM algorithm in each step of iteration. After simplifying the iterations in to one E-step, they establish guaran teed consistency when there are only tw o clusters. The work in Le et al. [ 2015+ ] proposes to appro ximate the proﬁle likelihoo d functions, mo dularit y functions or other criterions using surrogates deﬁned in a 2-dimensional subspace constructed by sp ectral dimension reduction. Thanks to the conv exit y of the surrogate functions, the searc h complexit y is p olynomial. The metho d and theory are how ever only applicable when there are tw o communities. Sp ectral metho ds for communit y detection hav e attracted interest from diverse com- m unities including computer science, applied math, statistics, and mac hine learning; see e.g. Rohe et al. [ 2011 ] and the references therein for results of sp ectral clustering under SBM . The seminal w ork of Dasgupta et al. [ 2004 ] on DCSBM (prop osed under the name of Extended Planted P artition model) considered a sp ectral metho d similar to that in Mc- Sherry [ 2001 ]. One ma jor drawbac k is that the kno wledge of θ is required in b oth the theory and algorithm. In the algorithm prop osed in Co ja-Oghlan and Lank a [ 2009 ], the adjacency matrix is ﬁrst normalized by the no de degrees and then thresholded entrywise, after whic h spectral clustering is applied. Strong consistency is pro ved for the setting with a ﬁxed n umber of clusters. In Chaudhuri et al. [ 2012 ], a mo diﬁed spectral clustering method w as prop osed using a regularized random-walk graph Laplacian, and strong consistency is established under the assumption that the a verage degree grows at least as Ω p ? n q . A diﬀer- en t sp ectral clustering approac h based on regularized graph Laplacians is considered in Qin and Rohe [ 2013 ]. Their theoretical bound on the misclassiﬁed rates dep ends on the eigen- v ectors of the graph Laplacian, which is a still random ob ject. Sp ectral clustering based on unmodiﬁed adjacency matrices and degree-normalized adjacency matrices are analyzed in Lei and Rinaldo [ 2015 ] and Gulikers et al. [ 2015 ], which pro v e rigorous error rate results but did not pro vide numerical v alidation on either synthetic or real data. It is observed in Jin [ 2015 ] that sp ectral clustering based directly the adjacency matrix (or their normalized version) often result in inconsisten t clustering in real data, such as the p olitical blogs dataset Adamic and Glance [ 2005 ], a p opular b enc hmark for testing comm unity detection approac hes. T o address this issue, a new sp ectral clustering algorithm 21 called SCORE is prop osed in Jin [ 2015 ]. Sp eciﬁcally , the second to r -th leading eigen v ectors are divided b y the ﬁrst leading eigenv ector element wisely , and sp ectral clustering is applied to the resulting ratio matrix. In their theoretical results, an implicit assumption is that the num b er of comm unities r is b ounded by a constant, as implied by the condition (2.14) in Jin [ 2015 ]. In comparison, our con vexiﬁed mo dularit y maximization approach w orks for gro wing r b oth theoretically and empirically . As illustrated in Section 4.1 , our metho d exhibited b etter p erformance on b oth the synthetic and real datasets considered there, esp ecially when r ě 3. 6 Discussion and future w ork In this pap er, w e studied communit y detection in net works with p ossibly highly skew ed degree distributions. W e in tro duced a new computationally eﬃcient metho dology based on con vexiﬁcation of the mo dularit y maximization formulation and a no vel doubly-w eigh ted ` 1 norm k -median clustering pro cedure. Our complete algorithm runs pro v ably in polynomial time and is computationally feasible. Non-asymptotic theoretical p erformance guaran tees w ere established under DCSBM for b oth approximate clustering and p erfect clustering, whic h are consisten t with the b est known rate results in the literature of SBM . The proposed metho d also enjoys go o d empirical p erformance, as was demonstrated on b oth synthetic data and real-w orld netw orks. On these datasets our method w as observed to ha ve performance comparable to, and sometimes b etter than, the state-of-the-art sp ectral clustering metho ds, particularly when there are more than t wo comm unities. Our w ork in volv es several algorithmic and analytical nov elties. W e pro vide a tractable solution to the classical modularity maximization form ulation via con vexiﬁcation, ac hieving sim ultaneously strong theoretical guaran tees and comp etitive empirical p erformance. The theoretical results are based on an aggregate and degree-corrected version of the density gap condition, which is robust to a small num b er of outlier no des and thus is an appro- priate condition for approximate clustering. In our algorithms and error b ounds we made use of techniques from Gu ´ edon and V ershynin [ 2015 ]; Jin [ 2015 ]; Lei and Rinaldo [ 2015 ], but departed from these existing works in several imp ortant aspects. In particular, we prop osed a nov el k -median formulation using doubly-weigh ted ` 1 norms, which allows for a tight analysis that produces strong non-asymptotic guarantees on appro ximate recov ery . F urthermore, we dev elop ed a non-asymptotic theory on p erfect clustering, whic h is based on a divide-and-conquer primal-dual analysis and mak es crucial use of certain w eighted ` 1 metrics that exploit the structures of DCSBM . A future direction imp ortan t in both theory and practice, is to consider net works with o verlapping comm unities, where a node ma y b elong to m ultiple communities sim ultaneously . T o accommo date suc h a setting several extensions of SBM ha ve b een introduced in the 22 literature. F or example, Zhang et al. [ 2014 ] prop osed a sp ectral algorithm based on the Ov erlapping Con tin uous Comm unit y Assignmen t Mo del (OCCAM). As our CMM metho d is shown to b e an attractiv e alternative to sp ectral metho ds for DCSBM , it will b e in teresting to extend CMM to allo w for b oth ov erlapping communities and heterogeneous degrees. Another direction of interest is to develop a general theory of optimal misclassiﬁcation rates for DCSBM along the lines of Gao et al. [ 2015 ]; Zhang and Zhou [ 2015 ]. 7 Pro ofs In this section, we pro ve the theoretical results in Theorem 1 – 3 . In tro ducing the con ve- nien t shorthand Θ : “ θ θ J P R n ˆ n ` , w e can write the weigh ted ` 1 norm of a matrix Z in Deﬁnition 2 as } Z } 1 , θ “ ÿ 1 ď i,j ď n | θ i Z ij θ j | “ } Θ ˝ Z } 1 , where ˝ denotes the Hadamard (element-wise) pro duct. Sev eral standard matrix norms will also b e used: the sp ectral norm } Z } (the largest singular v alue of Z ); the n uclear norm } Z } ˚ (the sum of the singular v alues); the ` 1 norm } Z } 1 “ ř i,j | Z ij | ; the ` 8 norm } Z } 8 “ max i,j | Z ij | ; and the ` 8 Ñ ` 1 op erator norm } Z } 8Ñ 1 “ sup } v } 8 ď 1 } Z v } 1 . F or any vector v P R n , w e denote by diag p v q the n ˆ n diagonal matrix whose diagonal en tries are corresp ondingly the en tries of v . F or an y matrix M P R n , let diag p M q denote the n ˆ n diagonal matrix with diagonal entries giv en by the corresp onding diagonal en tries of M . W e denote absolute constants by C, C 0 , c 1 , etc, whose v alue may change line by line. Recall that d and f are the vectors of no de degrees and their exp ectations, resp ectiv ely , where f i “ θ i H a , for eac h i P C ˚ a , a “ 1 , . . . , r . A k ey step in our pro ofs is to appropriately control the deviation of the degrees from their exp ectation. This is done in the follo wing lemma. Lemma 1. Under DCSBM , with pr ob ability at le ast 0 . 99 , we have max ` } f f J ´ dd J } 8Ñ 1 , } f f J ´ dd J } 1 ˘ ď C ` n ` a n } f } 1 ˘ } f } 1 for some absolute c onstant C ą 0 . Pr o of. Since f f J ´ dd J “ f p f ´ d q J ` p f ´ d q d J , 23 w e ha v e } f f J ´ dd J } 8Ñ 1 ď } f p f ´ d q J } 8Ñ 1 ` }p f ´ d q d J } 8Ñ 1 “ } f } 1 } f ´ d } 1 ` } f ´ d } 1 } d } 1 “ } f ´ d } 1 p} f } 1 ` } d } 1 q , and } f f J ´ dd J } 1 ď } f p f ´ d q J } 1 ` }p f ´ d q d J } 1 “ } f } 1 } f ´ d } 1 ` } f ´ d } 1 } d } 1 “ } f ´ d } 1 p} f } 1 ` } d } 1 q . W e b ound } f ´ d } 1 and } f } 1 separately . F or each i P C ˚ a , there holds E d i “ f i ´ θ 2 i B aa and V ar p d i q “ n ÿ j “ 1 V ar p A ij q ď n ÿ j “ 1 E p A ij q “ E p d i q “ f i ´ θ 2 i B aa ď f i . Therefore, we ha ve E ˇ ˇ f i ´ θ 2 i B aa ´ d i ˇ ˇ ď b E ˇ ˇ f i ´ θ 2 i B aa ´ d i ˇ ˇ 2 “ a V ar p d i q ď a f i , whic h implies that E | f i ´ d i | ď 1 ` ? f i and E n ÿ i “ 1 | f i ´ d i | ď n ` n ÿ i “ 1 a f i ď n ` a n } f } 1 . By Marko v’s inequalit y , with probabilit y 0 . 995, there holds } f ´ d } 1 ď C p n ` a n } f } 1 q for an absolute constant C . T o b ound } d } 1 , w e observ e that since E d i ď f i and d i ě 0, there holds E } d } 1 ď } f } 1 . By Marko v’s inequalit y , with probabilit y at least 0 . 995, there holds } d } 1 ď C } f } 1 for some absolute constant. Combining these bounds on } f ´ d } 1 and } d } 1 pro ves Lemma 1 . 7.1 Pro of of Theorem 1 Recall that the v ector f P R n is deﬁned by letting f i “ θ i H a for i P C ˚ a , where H a is deﬁned in ( 3.3 ). It follows from the optimality of p Y that 0 ď x p Y ´ Y ˚ , A ´ λ dd J y “ x p Y ´ Y ˚ , E A ´ λ f f J y loooooooooooooomoooooooooooooon S 1 ` λ x p Y ´ Y ˚ , f f J ´ dd J y lo oooooooooooo omo oooooooooooo on S 2 ` x p Y ´ Y ˚ , A ´ E A y lo ooooooooo omoooooooooo on S 3 . W e control the terms S 1 , S 2 and S 3 separately b elo w. 24 Upp er b ound for S 1 F or each pair i, j P C ˚ a and i ‰ j , we hav e p Y ij ´ Y ˚ ij ď 0, E p A ij q “ θ i θ j B aa , and f i f j “ θ i θ j H a H b . Hence the condition ( 3.5 ) implies that E p A ij q´ λf i f j ě δ θ i θ j , whence p p Y ij ´ Y ˚ ij qp E p A ij q ´ λf i f j q ď ´ δ θ i θ j | p Y ij ´ Y ˚ ij | . Similarly , for eac h pair i P C ˚ a and j P C ˚ b with 1 ď a ă b ď r , w e ha v e p p Y ij ´ Y ˚ ij qp E p A ij q ´ λf i f j q ď ´ δ θ i θ j | p Y ij ´ Y ˚ ij | . Com bining the last tw o inequalities, we obtain the b ound S 1 : “ x p Y ´ Y ˚ , E A ´ λ f f J y ď ´ δ } Y ˚ ´ p Y } 1 , θ . Upp er b ound for S 2 By Grothendiec k’s inequality [ Grothendieck , 1953 ; Lindenstrauss and Pe lczy ´ nski , 1968 ] we hav e x p Y ´ Y ˚ , f f J ´ dd J y ď 2 sup Y ľ 0 , diag p Y q“ I ˇ ˇ x Y , f f J ´ dd J y ˇ ˇ ď 2 K G } f f J ´ dd J } 8Ñ 1 , where K G is Grothendiec k’s constan t known to satisfy K G ď 1 . 783. Since λ ď min 1 ď a ď r B aa H 2 a , applying Lemma 1 on } f f J ´ dd J } 8Ñ 1 ensures that with probabilit y at least 0 . 99, S 2 ď C ˆ min 1 ď a ď r B aa H 2 a ˙ } f } 1 ´ a n } f } 1 ` n ¯ for some absolute constan t C . Upp er b ound for S 3 Observ e that x p Y ´ Y ˚ , A ´ E A y ď 2 sup Y ľ 0 , diag Y “ I ˇ ˇ x Y , A ´ E A y ˇ ˇ . It follows from Grothendieck’s inequalit y that sup Y ľ 0 , diag p Y q“ I ˇ ˇ x Y , A ´ E A y ˇ ˇ ď K G } A ´ E A } 8Ñ 1 . The norm on the last RHS can b e expressed as } A ´ E A } 8Ñ 1 “ sup x : } x } 8 ď 1 }p A ´ E A q x } 1 “ sup x , y Pt˘ 1 u n | x J p A ´ E A q y | . F or each ﬁxed pair of sign v ectors x , y P t˘ 1 u n , Bernstein’s inequalit y ensures that for eac h t ą 0, with probabilit y at most 2 e ´ t there holds the inequalit y | x J p A ´ E A q y | ě ? 8 tσ 2 ` 4 3 t, 25 where σ 2 : “ ř i ă j va r A ij ď 1 2 ř r a,b “ 1 B ab G a G b “ 1 2 } f } 1 . Setting t “ 2 n and applying the union b ound ov er all sign vectors, w e obtain that with probability at most 2 p e { 2 q ´ 2 n , } A ´ E A } 8Ñ 1 ě a 8 n } f } 1 ` 8 3 n. It follows that with probability at least 1 ´ 2 p e { 2 q ´ 2 n , S 3 ď 2 K G a 8 n } f } 1 ` 16 K G 3 n. Putting together the b ounds for S 1 , S 2 and S 3 , we conclude that with probabilit y at least 0 . 99 ´ 2 p e { 2 q ´ 2 n , the bound ( 3.6 ) holds. 7.2 Pro of of Theorem 2 Recall that p Ψ , X q is the exact optimal solution to the weigh ted k -median problem ( 2.7 ), p q Ψ , | X q is the appro ximate solution, and x W : “ p Y D is the column-w eighted v ersion of the solution p Y to the con vex program ( 2.6 ). The last constraint in ( 2.7 ) ensures that the ro w v ectors of X and | X are subsets of the ro w v ectors of x W . If we deﬁne the matrices W : “ Ψ X and | W : “ q Ψ | X , then the ro w vectors of W and | W are also subsets of the ro w v ectors of x W . F or any matrix M , w e let M i ‚ denote the i -th row vector of M , and M ‚ j the j -th column vector of M . A t a high level, we pro v e Theorem 2 by translating the upp er b ound on the weigh ted error } p Y ´ Y ˚ } 1 , θ , giv en in ( 3.8 ) in Corollary 1 , to an upp er b ound on the w eighted misclassiﬁcation rate deﬁned in Deﬁnition 3 . This is done in three steps. Step 1 As sho wn in Section 2.4 , the true partition matrix admits the decomp osition Y ˚ “ Ψ ˚ p Ψ ˚ q J , where Ψ ˚ P M n,r is the true membership matrix. Letting W ˚ : “ Y ˚ D P R n ˆ n and X ˚ : “ p Ψ ˚ q J D P R r ˆ n , w e hav e the expression W “ Ψ ˚ X ˚ . W e no w deﬁne a matrix Ă X P R r ˆ n b y setting its k -th row to Ă X k ‚ : “ arg min x Pt x W i ‚ : i P C ˚ k u } x ´ X ˚ k ‚ } 1 , for each k “ 1 , . . . , r . Note that Ă X also satisﬁes Ro ws p Ă X q Ď Rows p x W q . Set Ă W : “ Ψ ˚ Ă X P R n ˆ n . By deﬁnition w e ha v e the inequalit y } D p x W ´ W ˚ q} 1 “ r ÿ k “ 1 ÿ i P C ˚ k d i } x W i ‚ ´ X ˚ k ‚ } 1 ě r ÿ k “ 1 ÿ i P C ˚ k d i } Ă X k ‚ ´ X ˚ k ‚ } 1 “ } D p Ă W ´ W ˚ q} 1 , 26 whic h implies that } D p Ă W ´ x W q} 1 ď } D p x W ´ W ˚ q} 1 ` } D p Ă W ´ W ˚ q} 1 ď 2 } D p x W ´ W ˚ q} 1 . Since p Ψ ˚ , Ă X q is feasible to the optimization ( 2.7 ), we ha ve } D p W ´ x W q} 1 ď } D p Ă W ´ x W q} 1 ď 2 } D p x W ´ W ˚ q} 1 , whence } D p | W ´ x W q} 1 ď 20 3 } D p W ´ x W q} 1 ď 40 3 } D p x W ´ W ˚ q} 1 . Putting together, we obtain that } D p | W ´ W ˚ q} 1 ď } D p x W ´ W ˚ q} 1 ` } D p | W ´ x W q} 1 ď 43 3 } D p x W ´ W ˚ q} 1 . (7.1) Deﬁne a matrix q Y P R n ˆ n b y q Y ij “ $ & % | W ij { d j , if d j ą 0 , 0 , if d j “ 0 . (7.2) F or eac h j P r n s , if d j ą 0, then it follo ws from the abov e deﬁnition that | W ij “ q Y ij d j . Supp ose d j “ 0; b ecause Rows p | W q Ď Ro ws p x W q and x W “ p Y D , for each i P r n s , there exists an index i 1 P r n s suc h that | W ij “ x W i 1 j “ p Y i 1 j d j “ 0 “ q Y ij d j . Putting together, w e conclude that | W “ q Y D . In view of the b ound ( 7.1 ) and the deﬁnitions x W : “ p Y D and W ˚ : “ Y ˚ D , w e get that } q Y ´ Y ˚ } 1 , d “ } D p q Y ´ Y ˚ q D } 1 “ } D p | W ´ W ˚ q} 1 ď 43 3 } D p x W ´ W ˚ q} 1 “ 43 3 } D p p Y ´ Y ˚ q D } 1 “ 43 3 } p Y ´ Y ˚ } 1 , d , (7.3) where the w eigh ted ` 1 norm } ¨ } 1 , d is deﬁned analogously to } ¨ } 1 , θ in Deﬁnition 2 . Step 2 The b ound in ( 7.3 ) is weigh ted by the empirical degrees d . Our next step is to con vert this b ound in to one that is weigh ted b y the p opulation quantit y f . Recall that Ro ws p | W q Ď Ro ws p x W q and x W “ p Y D . If d j ą 0, then for any i P r n s , there exists an i 1 suc h that q Y ij “ | W ij { d j “ x W i 1 j { d j “ p Y i 1 j . Since p Y is feasible to the conv ex relaxation ( 2.6 ), w e hav e 0 ď p Y ď J . It follows that 0 ď q Y ď J and hence } q Y ´ Y ˚ } 8 ď 1. Setting M : “ } f f T ´ dd T } 1 , we observe that any matrix Z satisﬁes the b ound |} Z } 1 , f ´ } Z } 1 , d | ď } Z ˝ p f f T ´ dd T q} 1 ď M } Z } 8 , 27 where } Z } 1 , f and } Z } 1 , d are deﬁned in the same fashion as } Z } 1 , θ giv en in Deﬁnition 2 . Therefore, the bound ( 7.3 ) implies that } q Y ´ Y ˚ } 1 , θ “ 1 h 2 } q Y ´ Y ˚ } 1 , f ď 1 h 2 ´ } q Y ´ Y ˚ } 1 , d ` M ¯ À 1 h 2 ´ } p Y ´ Y ˚ } 1 , d ` M ¯ ď 1 h 2 ´ } p Y ´ Y ˚ } 1 , f ` 2 M ¯ À } p Y ´ Y ˚ } 1 , θ ` M h 2 . (7.4) T o b ound the second term abov e, we apply Lemma 1 to get that with probabilit y at least 0.99, M ď C } f } 1 p a n } f } 1 ` n q . Also note that under the F p n, r, p, q , g q -mo del, H 1 “ ¨ ¨ ¨ “ H r “ h : “ p p ` p r ´ 1 q q q g , whic h implies that f i “ θ i h, @ i P r n s and } f } 1 “ r p p ` p r ´ 1 q q q g 2 . Com bining the last three equations giv es the follo wing b ound on the second term of ( 7.4 ): M h 2 À r p a n } f } 1 ` n q p ` p r ´ 1 q q ď r p rg ? np ` n q δ . (7.5) W e can con trol the ﬁrst term in ( 7.4 ) using the b ound ( 3.8 ) in Corollary 1 . Putting together, straigh tforward calculation yields the inequalit y } q Y ´ Y ˚ } 1 , θ À 1 δ r p n ` r g ? np q . (7.6) Step 3 Recall that diag p θ q is the n ˆ n diagonal matrix whose diagonal entries are corre- sp ondingly the entries of θ . F or each a “ 1 , . . . , r , deﬁne the set of node indices S a : “ ! i P C ˚ a : }p q Y i ‚ ´ Y ˚ i ‚ q diag p θ q} 1 ě g ) , and let S : “ Ť r a “ 1 S a . It follo ws from ( 7.6 ) that ÿ i P S θ i ď n ÿ i “ 1 θ i g }p q Y i ‚ ´ Y ˚ i ‚ q diag p θ q} 1 “ 1 g } q Y ´ Y ˚ } 1 , θ À 1 δ g r p n ` r g ? np q . (7.7) Consider the set T a : “ C ˚ a z S a for eac h a “ 1 , . . . , r . There are three cases for each T a . In the ﬁrst case, T a “ H , and w e denote by R 1 the collection of all suc h indices a . In the 28 second case, T a ‰ H and q Ψ i ‚ “ q Ψ j ‚ for all i, j P T a . W e sa y that these T a ’s are pure, and denote by R 2 the collection of all such indices a . Finally , we set R 3 : “ t 1 , . . . , r uzp R 1 Y R 2 q ; for each a P R 3 , we sa y that T a is impure since there exist i, j P T a suc h that q Ψ i ‚ ‰ q Ψ j ‚ . F or each a P R 1 , we ha ve S a “ C ˚ a , which implies that ÿ i P S θ i ě ÿ i P Ť a P R 1 C ˚ a θ i “ | R 1 | g . (7.8) F or each pair a, b P R 2 Y R 3 with a ‰ b , b y deﬁnition we know that T a ‰ H and T b ‰ H . Then for eac h pair i P T a Ď C ˚ a , j P T b Ď C ˚ b , we ha ve } q Y i ‚ diag p θ q ´ q Y j ‚ diag p θ q} 1 ě }p Y ˚ i ‚ ´ Y ˚ j ‚ q diag p θ q} 1 ´ }p Y ˚ i ‚ ´ q Y i ‚ q diag p θ q} 1 ´ }p Y ˚ j ‚ ´ q Y j ‚ q diag p θ q} 1 ą 2 g ´ g ´ g “ 0 , whence q Y i ‚ ‰ q Y j ‚ . W e claim that this implies q Ψ i ‚ ‰ q Ψ j ‚ . Supp ose that this claim is not true. F or each k P r n s , if d k “ 0, then q Y ik “ q Y j k “ 0 in view of the deﬁnition of q Y in ( 7.2 ); if d k ą 0, then the deﬁnition | W “ q Ψ | X implies that q Y ik “ 1 d k | W ik “ 1 d k x q Ψ i ‚ , | X ‚ k y “ 1 d k x q Ψ j ‚ , | X ‚ k y “ 1 d k | W j k “ q Y j k . Therefore, we ha ve q Y i ‚ “ q Y j ‚ , which is a contradiction. In conclusion, we hav e prov ed that for each pair a, b P R 2 Y R 3 with a ‰ b and each pair i P T a , j P T b , w e ha ve q Ψ i ‚ ‰ q Ψ j ‚ . Moreov er, since for eac h a P R 2 , the set T a is pure b y deﬁnition, there exists a p erm utation matrix Π P S r suc h that for all i P Ť a P R 2 T a , there holds p q ΨΠ q i ‚ “ Ψ ˚ i ‚ . Recalling Deﬁnition 3 , w e conclude that the set p Ť a P R 3 T a q Ť S con tains the misclassiﬁed no de set with resp ect to Π . It follows that ř i P E p Π q θ i ď ř i P S θ i ` ř i P Ť a P R 3 T a θ i ď ř i P S θ i ` | R 3 | g . (7.9) The matrix q Ψ consists of at most r distinct ro w v ectors. Because R 2 is pure and R 3 is impure by deﬁnition, we ha v e the inequalit y | R 2 | ` 2 | R 3 | ď r “ | R 1 | ` | R 2 | ` | R 3 | , whic h implies that | R 3 | ď | R 1 | . (7.10) Applying the bounds ( 7.9 ), ( 7.10 ), ( 7.8 ) and ( 7.7 ) in order, w e obtain ř i P E p Π q θ i ď 2 ÿ i P S θ i À 1 δ g r p n ` r g ? np q , thereb y pro ving the inequalit y ( 3.9 ) in the theorem. 29 7.3 Pro of of Theorem 3 Let us ﬁrst recall and in tro duce some key notations and deﬁnitions. F or eac h k P r r s , deﬁne θ p a q P R n ` suc h that θ p a q i “ θ i if i P C ˚ a and θ p a q i “ 0 otherwise. F or an y v ector v P R n and 1 ď a ď r , we let v p a q P R ` a denote the restriction of v to en tries in C ˚ a . F or any matrix M ě 0 , let M 1 2 denote the matrix such that its p i, j q -en try is a M ij . Similarly , for an y matrix M ą 0 , let M ´ 1 2 denote the matrix suc h that its p i, j q -entry is 1 ? M ij . Let G : “ ř n i “ 1 θ i “ } θ } 1 and we ha ve for all 1 ď a ď r , H a “ r ÿ k “ 1 B ak G k “ q G ` p p ´ q q G a . A simple implication is H a ě q G a ` p p ´ q q G a “ pG a ě pG min . By the assumption ( 3.13 ) and the fact that δ ă p, C 2 0 log n G min θ min ď δ 2 p ă δ . (7.11) W e no w turn to the pro of of Theorem 3 . In the proof w e mak e use of sev eral tec hnical lemmas given in the App endix. Pr o of. Recall that f i “ θ i H a , for each i P C ˚ a , a “ 1 , . . . , r . The following lemma, com- plemen ting Lemma 1 , characterizes the relationship b etw een the degrees d 1 , . . . , d n and the p opulation quan tities f 1 , . . . , f n . Lemma 2. With pr ob ability at le ast 1 ´ 2 n 2 , for al l i “ 1 , . . . , n , | d i ´ f i | ď max p a 12 f i log n, 4 log n ` 1 q . (7.12) If we further assume that the c ondition ( 3.13 ) holds with some lar ge enough numeric al c onstant C 0 , ther e holds for al l i “ 1 , . . . , n | d i ´ f i | ď δ 5 p f i . (7.13) W e prov e this lemma in Section 7.3.1 to follow. Bac k to the pro of of Theorem 3 , we assume without loss of generality that the no des in the same cluster hav e adjacent indices. Recall that for 1 ď a ď r , ` a is the size of comm unity a . Then w e ha v e Y ˚ “ » — — – J l 1 . . . J l r ﬁ ﬃ ﬃ ﬂ P P n,r . (7.14) 30 Recall the decomposition Y ˚ “ Ψ ˚ p Ψ ˚ q J , where Ψ ˚ : “ r v 1 , . . . , v r s : “ » — — — — – 1 l 1 0 . . . 0 0 1 l 2 . . . 0 . . . . . . . . . . . . 0 0 . . . 1 l r ﬁ ﬃ ﬃ ﬃ ﬃ ﬂ P M n,r . T o establish the theorem, it suﬃces to show for any feasible solution Y with Y ‰ Y ˚ , ∆ p Y q : “ x Y ˚ ´ Y , A ´ λ dd J y ą 0 . F or a matrix X P R n ˆ n , let X w P R n ˆ n denote the matrix X restricted to en tries in Ť 1 ď k ď r C k ˆ C k , and X b P R n ˆ n denote the matrix X restricted to en tries in Ť 1 ď k ă ` ď r C k ˆ C ` . It yields the decomposition X “ X w ` X b . Moreov er, for each ﬁxed pair 1 ď a, b ď r , the submatrix of X with en tries in C k ˆ C ` is denoted as X p a,b q P R ` a ˆ ` b . Let  “ δ 10 . W e propose to decomp ose ∆ p Y q as ∆ p Y q “x A w ´ λ p dd J q w ´  ` θ θ J ˘ w , Y ˚ ´ Y y ` x  ` θ θ J ˘ w , Y ˚ ´ Y y ` x E A b ´ λ ` dd J ˘ b , Y ˚ ´ Y y ` x A b ´ E A b , Y ˚ ´ Y y . “ : S 1 ` S 2 ` S 3 ` S 4 . (7.15) Belo w w e establish lo w er bounds for the terms S 1 , S 2 , S 3 and S 4 resp ectiv ely . Lo w er b ound of S 1 W e plan to construct an n ˆ n diagonal matrix D , such that with high probabilit y , $ & % Λ : “ D `  ` θ θ J ˘ w ` λ ` dd J ˘ w ´ A w ľ 0 ; ΛΨ ˚ “ 0 . (7.16) Suc h a diagonal matrix D implies that with high probability , S 1 “ x A w ´ λ p dd J q w ´  ` θ θ J ˘ w , Y ˚ ´ Y y p a q “ x A w ´ λ p dd J q w ´  ` θ θ J ˘ w ´ D , Y ˚ ´ Y y p b q “ x´ Λ , Y ˚ ´ Y y p c q ě x´ Λ , Y ˚ y “ 0 , where the step p a q follo ws diag p Y ˚ q “ diag p Y q “ I n , p b q follows from the deﬁnition of Λ , p c q holds due to Y ľ 0 and Λ ľ 0 , and the last equalit y follows b ecause ΛΨ ˚ “ 0 . 31 In what follo ws, we will show ho w to construct explicitly the diagonal matrix D satisfy- ing the condition ( 7.16 ) with high probabilit y . Notice that the condition ( 7.16 ) is equiv alent to that with high probability , for all 1 ď a ď r , $ & % Λ p a,a q : “ D p a,a q `  θ p a q θ J p a q ` λ d p a q d J p a q ´ A p a,a q ľ 0 ; Λ p a,a q 1 ` a “ 0 , (7.17) where 1 ` denote the ` -dimensional v ector whose co ordinates all equal 1. The equality condition gives D p a,a q “ diag ¨ ˝ A p a,a q 1 ` a ´ λ ¨ ˝ ÿ j P C ˚ a d j ˛ ‚ d p a q ´  G a θ p a q ˛ ‚ . (7.18) The equality condition also implies that rank ` Λ p a,a q ˘ ď ` a ´ 1. Therefore, in order to prov e Λ p a,a q ľ 0 , it suﬃces to prov e λ ` a ´ 1 ` Λ p a,a q ˘ ą 0. By W eyl’s inequalit y [ Horn and Johnson , 2013 ], we ha ve λ ` a ´ 1 ` Λ p a,a q ˘ “ λ ` a ´ 1 ´ D p a,a q ´ A p a,a q ` λ d p a q d J p a q `  θ p a q θ J p a q ¯ ě λ ` a ´ D p a,a q ´ A p a,a q ` p θ p a q θ J p a q ¯ ` λ ` a ´ 1 ´ λ d p a q d J p a q `  θ p a q θ J p a q ´ p θ p a q θ J p a q ¯ ě λ ` a ´ D p a,a q ´ A p a,a q ` p θ p a q θ J p a q ¯ . The last inequality is due to the fact that λ d p a q d J p a q `  θ p a q θ J p a q ´ p θ p a q θ J p a q has at most one negativ e eigen v alue. Therefore, to establish ( 7.16 ), we only need to pro ve D p a,a q ´ A p a,a q ` p θ p a q θ J p a q ą 0 , or equiv alently , Λ 1 : “ diag p θ p a q q ´ 1 2 ´ D p a,a q ´ A p a,a q ` p θ p a q θ J p a q ¯ diag p θ p a q q ´ 1 2 ą 0 . Deﬁne the matrices $ & % Λ 11 : “ diag p θ p a q q ´ 1 2 D p a,a q diag p θ p a q q ´ 1 2 “ D p a,a q diag ` θ p a q ˘ ´ 1 , Λ 12 : “ diag p θ p a q q ´ 1 2 ´ A p a,a q ´ p θ p a q θ J p a q q ¯ diag p θ p a q q ´ 1 2 . Then Λ 1 “ Λ 11 ` Λ 12 . By W eyl’s inequality [ Horn and Johnson , 2013 ], to prov e Λ 1 ą 0 , w e only need to show that λ ` a p Λ 11 q ą } Λ 12 } . (7.19) 32 Applying Lemma 5 in the app endix, we can prov e that with probabilit y at least 1 ´ 1 n 2 , } Λ 12 } ď C 1 ˆ a p` a log n ` log n θ min ˙ ` p, (7.20) for some n umerical constant C 1 . Moreo ver, λ ` a p Λ 11 q “ min i P C ˚ a ¨ ˝ 1 θ i ¨ ˝ ÿ j P C ˚ a a ij ˛ ‚ ´ λ ¨ ˝ ÿ j P C ˚ a d j ˛ ‚ d i θ i ´ G a ˛ ‚ , (7.21) where a ij denotes the p i, j q -th entry of A . By Chernoﬀ ’s inequality , for each i P C ˚ a , with probabilit y at least 1 ´ 1 n 3 , ÿ j P C ˚ a a ij ě pθ i ¨ ˝ ÿ j P C ˚ a {t i u θ j ˛ ‚ ´ g f f f e 6 p log n q θ i p ¨ ˝ ÿ j P C ˚ a {t i u θ j ˛ ‚ ě pθ i p G a ´ 1 q ´ a 6 p log n q θ i pG a . By the b ound ( 7.13 ) and the separation condition ( 3.12 ), with probability at least 1 ´ 2 n 2 there holds the b ound λ ¨ ˝ ÿ j P C ˚ a d j ˛ ‚ d i θ i ď p ´ δ H 2 a θ i ˆ 1 ` δ 5 p ˙ 2 ¨ ˝ ÿ j P C ˚ a f j ˛ ‚ f i “ p ´ δ H 2 a θ i ˆ 1 ` δ 5 p ˙ 2 ¨ ˝ ÿ j P C ˚ a θ j H a ˛ ‚ θ i H a “ p p ´ δ q ˆ 1 ` δ 5 p ˙ 2 G a ď ˆ p ´ 3 5 δ ˙ G a . Then by the fact  “ δ 10 , the expression ( 7.21 ) and the union b ound, w e conclude that with probabilit y at least 1 ´ 3 n 2 , λ ` a p Λ 11 q ě 1 2 δ G a ´ p ´ d 6 p log n q pG a θ min . Therefore, to guaran tee ( 7.19 ), it suﬃces to let 1 2 δ G a ´ p ´ d 6 p log n q pG a θ min ą C 1 ˆ a p` i log n ` log n θ min ˙ ` p. Since ` i ď G i { θ min and p ď 1, the ab o ve inequalit y is guaran teed b y the assumption ( 3.13 ) and its implication ( 7.11 ) with suﬃciently large C 0 . Thus for eac h 1 ď a ď r , ( 7.17 ) holds with probability at least 1 ´ 4 n 2 . By the union b ound, ( 7.17 ) holds for all 1 ď a ď r with probabilit y at least 1 ´ 4 n . 33 Lo w er b ound of S 2 F or any p i, j q P C ˚ a ˆ C ˚ a for some a “ 1 , . . . , r , Y ˚ ij “ 1 ě Y ij . This implies that the matrix Y ˚ w ´ Y w is entrywise nonnegativ e, and th us S 2 “ x  ` θ θ J ˘ w , Y ˚ ´ Y y “ δ 10 } Y ˚ w ´ Y w } 1 , θ . Lo w er b ound of S 3 F or an y p i, j q P C ˚ a ˆ C ˚ b with a ‰ b , b y the b ound ( 7.13 ), with probabilit y at least 1 ´ 2 n 2 there holds d i d j ě ˆ 1 ´ δ 5 p ˙ 2 f i f j “ ˆ 1 ´ δ 5 p ˙ 2 θ i θ j H a H b . The separation condition ( 3.12 ) implies that λ ą q ` δ H a H b , so with probabilit y at least 1 ´ 2 n 2 , λd i d j ą p q ` δ q ˆ 1 ´ δ 5 p ˙ 2 θ i θ j . The inequalities p q ` δ q ˆ 1 ´ δ 5 p ˙ 2 ą q ` δ ´ 2 q δ 5 p ´ 2 δ 2 5 p ą q ` δ ´ 4 δ 5 “ q ` 1 5 δ imply that with probabilit y at least 1 ´ 2 n 2 , λd i d j ą ˆ q ` 1 5 δ ˙ θ i θ j . Moreo ver, w e know for an y p i, j q P C ˚ a ˆ C ˚ b with a ‰ b , Y ˚ ij “ 0. Therefore, with probability at least 1 ´ 2 n 2 , S 3 “ x E A b ´ λ ` dd J ˘ b , Y ˚ ´ Y y “ x q p θ θ J q b ´ λ p dd J q b , ´ Y b y ě δ 5 } Y b } 1 , θ . (7.22) Lo w er b ound of S 4 Deﬁne matrix W “ p A b ´ E r A b s q ˝ Θ ´ 1 2 and then x A b ´ E r A b s , Y ˚ ´ Y y “ x W , p Y ˚ ´ Y q ˝ Θ 1 2 y . (7.23) Let U P R n ˆ r b e the weigh ted c haracteristic matrix for the clusters, i.e., U ia “ $ & % ? θ i ? } θ p a q } 1 if no de i P C ˚ a 0 otherwise, Let Σ P R r ˆ r b e the diagonal matrix with Σ aa “ } θ p a q } 1 for a P r r s . The weigh ted true cluster matrix Y ˚ ˝ Θ 1 2 has a rank- r singular v alue decomp osition giv en by Y ˚ ˝ 34 Θ 1 2 “ U Σ U J . Deﬁne the pro jections P T p M q “ U U J M ` M U U J ´ U U J M U U J and P T K p M q “ M ´ P T p M q . Let T r p M q denote the trace of M . It follows from Lemma 3 in the App endix that with probabilit y at least 1 ´ 1 n 2 , } W } ď c 2 ? q n ` c 2 ? log n θ min . Notice that U U J ` P T K ´ W } W } ¯ is a subgradien t of } X } ˚ at X “ Y ˚ ˝ Θ 1 2 . Hence, w e ha ve 0 ě T r ´ Y ˝ Θ 1 2 ¯ ´ T r ´ Y ˚ ˝ Θ 1 2 ¯ ě x U U J ` P T K p W {} W }q , p Y ´ Y ˚ q ˝ Θ 1 2 y , and as a consequence, x P T K p W q , p Y ˚ ´ Y q ˝ Θ 1 2 y ě } W }x U U J , p Y ´ Y ˚ q ˝ Θ 1 2 y Therefore, we get that x W , p Y ˚ ´ Y q ˝ Θ 1 2 y “x P T K p W q , p Y ˚ ´ Y q ˝ Θ 1 2 y ` x P T p W q , p Y ˚ ´ Y q ˝ Θ 1 2 y “x P T K p W q , p Y ˚ ´ Y q ˝ Θ 1 2 y ´ x P T p W q , Y b ˝ Θ 1 2 y ě} W }x U U J , p Y ´ Y ˚ q ˝ Θ 1 2 y ´ x P T p W q , Y b ˝ Θ 1 2 y ě ´ } W }} U U J } 8 , Θ ´ 1 2 } Y ˚ ´ Y } 1 , θ ´ } P T p W q} 8 , Θ ´ 1 2 } Y b } 1 , θ , (7.24) where the second equality holds b ecause P T p W q “ 0 on the diagonal blocks C k ˆ C k for 1 ď k ď r ; the last inequalit y follo ws b ecause for an y matrix M P R n ˆ n , } M } 8 , Θ ´ 1 2 : “ } M ˝ Θ ´ 1 { 2 } 8 . By the deﬁnition of U , } U U J } 8 , Θ ´ 1 2 “ 1 { G min . It follo ws from the theorem assumption ( 3.13 ) that δ ą 10 } W } G min . Therefore, we hav e } W }} U U J } 8 , Θ ´ 1 2 “ } W } G min ă δ 10 . (7.25) Belo w w e b ound the term } P T p W q} 8 , Θ ´ 1 2 . F rom the deﬁnition of P T , } P T p W q} 8 , Θ ´ 1 2 ď} U U J W } 8 , Θ ´ 1 2 ` } W U U J } 8 , Θ ´ 1 2 ` } U U J W U U J } 8 , Θ ´ 1 2 . W e b ound } U U J W } 8 , Θ ´ 1 2 b elo w. Notice that p U U J W q ij “ 0 if i and j are from the same cluster. Th us, to b ound the term p U U J W q ij , it suﬃces to consider the case where i 35 b elongs to cluster k and j b elongs to a diﬀerent cluster k 1 for k 1 ‰ k , Recall C ˚ k is the set of users in cluster k . Then p U U J W q ij “ ? θ i } θ p a q } 1 ÿ i 1 P C ˚ k a θ i 1 W i 1 j , whic h is the weigh ted av erage of indep enden t random v ariables. By Bernstein’s inequalit y , with probability at least 1 ´ n ´ 3 , ˇ ˇ ˇ ˇ ˇ ˇ ÿ i 1 P C ˚ k a θ i 1 W i 1 j ˇ ˇ ˇ ˇ ˇ ˇ ď b 6 q } θ p a q } 1 log n ` 2 log n a θ j . Then with probabilit y at least 1 ´ n ´ 1 , } U U J W } 8 , Θ ´ 1 2 ď c 1 ˜ c q log n G min θ min ` log n G min θ min ¸ . Similarly we b ound } W U U J } 8 , Θ ´ 1 2 and } U U J W U U J } 8 , Θ ´ 1 2 . Therefore, with probabil- it y at least 1 ´ 3 n ´ 1 , } P T p W q} 8 , Θ ´ 1 2 ď 3 c 1 ˜ c q log n G min θ min ` log n G min θ min ¸ ă δ { 10 , (7.26) where the last inequalit y follows from the theorem assumption ( 3.13 ). Substituting the b ounds ( 7.25 ) and ( 7.26 ) into the inequality ( 7.24 ), w e get that with probability at least 1 ´ 4 n ´ 1 , S 4 ą ´ δ 10 p } Y ˚ ´ Y } 1 , θ ` } Y b } 1 , θ q . Putting together Com bining the bounds for S i with i “ 1 , 2 , 3 , 4, we conclude that with probability at least 1 ´ 10 n ´ 1 , S 1 ` S 2 ` S 3 ` S 4 ą δ 10 } Y ˚ w ´ Y w } 1 , θ ` δ 5 } Y b } 1 , θ ´ δ 10 p} Y ˚ ´ Y } 1 , θ ` } Y b } 1 , θ q ě 0 , thereb y pro ving that ∆ p Y q ą 0 for an y feasible Y ‰ Y ˚ . 36 7.3.1 Pro of of Lemma 2 Pr o of. The equation ( 7.12 ) can be obtained straigh tforwardly b y Chernoﬀ ’s inequality . T o pro ve ( 7.13 ), we only need to establish for all i “ 1 , . . . , n , δ ě 5 p max ˜ d 12 log n f i , 4 log n ` 1 f i ¸ . (7.27) F or any i P C ˚ a , since f i “ θ i H a ě pθ i G min ě pθ min G min , Therefore, the assumption ( 3.13 ) implies that δ ą C 0 c p log n G min θ min “ C 0 p d log n pG min θ min ě C 0 p d log n f i . Therefore, as long as C 0 is large enough, w e hav e δ ě 5 p d 12 log n f i . (7.28) Since δ ă p , for suﬃciently large C 0 , we ha ve that b 12 log n f i ď 1 5 , and this implies 4 log n ` 1 f i ď 12 log n f i ď d 12 log n f i . The b ound ( 7.27 ) can then b e deduced from ( 7.28 ). Ac kno wledgmen t Y. Chen w as supported b y the Sc ho ol of Op erations Research and Information Engineering at Cornell Univ ersity . X. Li w as supported b y a startup fund from the Statistics Departmen t at Univ ersity of California, Da vis. J. Xu was supported by the National Science F oundation under Grant CCF 14-09106, I IS-1447879, OIS 13-39388, and CCF 14-23088, and Strategic Researc h Initiativ e on Big-Data Analytics of the College of Engineering at the Universit y of Illinois, DOD ONR Grant N00014-14-1-0823, and Simons F oundation Grant 328025. App endices A Supp orting lemmas In this section we state sev eral additional technical lemmas concerning random matrices. These lemmas are used in the pro of of our main theorems. 37 Recall that ˝ denotes the element-wise pro duct b et w een matrices. Lemma 3. L et W “ p A b ´ E r A b s q ˝ Θ ´ 1 2 . F or any c ą 0 , ther e exists c 1 ą 0 , such that with pr ob ability at le ast 1 ´ n ´ c , } W } ď c 1 ˆ ? nq ` ? log n θ min ˙ . (A.1) Pr o of. Let W 1 denote an indep endent copy of W . Let M “ p M ij q denote an n ˆ n zero- diagonal symmetric matrix whose entries are Rademac her and indep endent from W and W 1 . W e apply the usual symmetrization arguments: E r} W }s “ E r} W ´ E r W 1 s}s p a q ď E r} W ´ W 1 }s p b q “ E r}p W ´ W 1 q ˝ M }s p c q ď 2 E r} W ˝ M }s , where p a q follow from the Jensen’s inequality , p b q follows b ecause W ´ W 1 has the same distribution as p W ´ W 1 q ˝ M , and p c q follo w from the triangle inequalit y . Next w e upp er b ound E r} W ˝ M }s . Notice that W ˝ M is a n ˆ n symmetric and zero-diagonal random matrix, where the entries tp W ˝ M q ij , i ă j u are indep enden t. Let b ij “ a q p 1 ´ q θ i θ j q if i and j in tw o diﬀerent clusters; otherwise b ij “ 0. Let t ξ ij , i ď j u denote indep enden t random v ariables with ξ ij “ $ ’ ’ ’ ’ ’ ’ ’ & ’ ’ ’ ’ ’ ’ ’ % 1 ´ q θ i θ j ? q p 1 ´ qθ i θ j q θ i θ j w.p. 1 2 q θ i θ j ´ 1 ´ q θ i θ j ? q p 1 ´ qθ i θ j q θ i θ j w.p. 1 2 q θ i θ j q ? θ i θ j ? q p 1 ´ qθ i θ j q w.p. 1 2 p 1 ´ qθ i θ j q ´ q ? θ i θ j ? q p 1 ´ qθ i θ j q w.p. 1 2 p 1 ´ qθ i θ j q . Notice that ξ ij has a symmetric distribution with zero mean and unit v ariance. Let X denote the random matrix with X ij “ X ij “ ξ ij b ij . Then one can chec k that W ˝ M has the same distribution as X . Notice that } X } 8 ď 1 { θ min . It follows from [ Bandeira and v an Handel , 2015+ , Corollary 3.6] that there exists some absolute constan t c 1 ą 0 suc h that E r W ˝ M s “ E r X s ď c 1 ˆ ? nq ` ? log n θ min ˙ . Since the en tries of } W } 8 ď 1 { θ min , T alagrand’s concentration inequality for 1-Lipsc hitz con vex functions (see, e.g., [ T ao , 2012 , Theorem 2.1.13]) yields the b ound P t} W s} ě E r} W }s ` t { θ min u ď c 2 exp p´ c 3 t 2 q 38 for some absolute constants c 2 , c 3 , which implies that for an y c ą 0, there exists c 1 ą 0, such that P " } W } ě c 1 ˆ ? nq ` ? log n θ min ˙* ď n ´ c . Lemma 4 (Theorem 6.1 in T ropp [ 2012 ]) . Consider a ﬁnite se quenc e t X k u of indep endent, r andom, self-adjoint matric es with dimension d . Assume that E X k “ 0 and } X k } ď R . If the norm of the total varianc e satisﬁes › › › › › ÿ k E p X 2 k q › › › › › ď M 2 , then, the fol lowing ine quality holds for al l t ě 0 P # › › › › › ÿ k X k › › › › › ě t + ď 2 d exp ˆ ´ t 2 { 2 M 2 ` Rt { 3 ˙ . Lemma 5. L et A “ p A ij q 1 ď i,j ď n b e a symmetric r andom matrix whose diagonal entries ar e al l zer os. Mor e over, supp ose A ij , 1 ď i ă j ď n ar e indep endent zer o-me an r andom variables satisfying | A ij | ď R and V ar p A ij q ď σ 2 . Then, with pr ob ability at le ast 1 ´ 2 n 4 , we have } A } ď C 0 ´ σ a n log n ` R log n ¯ for some numeric al c onstant C 0 . Pr o of. F or eac h pair p i, j q : 1 ď i ă j ď n , let X ij b e the matrix whose p i, j q and p j, i q en tries are b oth A ij , whereas other en tires are zeros. Then w e ha v e A “ ÿ 1 ď i ă j ď n X ij . Moreo ver, w e can easily show that E X ij “ 0 , } X ij } ď R and 0 ĺ ÿ 1 ď i ă j ď n E X 2 ij ĺ p n ´ 1 q σ 2 I n . Applying Lemma 4 completes the pro of. 39 References A. Adamic and N. Glance. The political blogosphere and the 2004 us election: divided they blog. Pr o c e e dings of the 3r d International Workshop on Link Disc overy, A CM, New Y ork , pages 36–43, 2005. B. P . Ames and S. A. V av asis. Conv ex optimization for the plan ted k -disjoint-clique. Math- ematic al Pr o gr amming , 143(1–2):299–337, 2014. A. A. Amini, A. Chen, P . J. Bic kel, and E. Levina. Pseudo-likelihoo d metho ds for commu- nit y detection in large sparse net works. Ann. Statist. , 41(4):2097–2122, 2013. A. S. Bandeira and R. v an Handel. Sharp nonasymptotic b ounds on the norm of random matrices with independent en tries. Annals of Pr ob ability , to app ear, 2015+. M. Bastian, S. Heymann, and M. Jacom y . Gephi: An op en source soft ware for exploring and manipulating netw orks. In International AAAI Confer enc e on Weblo gs and So cial Me dia , 2009. C. Bordena v e, M. Lelarge, and L. Massouli´ e. Non-backtrac king spectrum of random graphs: comm unity detection and non-regular Raman ujan graphs. ArXiv 1501.06087, January 2015. URL h ttp://arxiv.org/abs/1501.06087 . T. Cai and X. Li. Robust and computationally feasible comm unity detection in the presence of arbitrary outlier no des. Ann. Statist. , 43(3):1027–1059, 2015. M. Charik ar, S. Guha, E. T ardos, and D. B. Shmoys. A constan t-factor appro ximation algorithm for the k-median problem (extended abstract). In Pr o c e e dings of the Thirty- ﬁrst Annual A CM Symp osium on The ory of Computing , STOC ’99, pages 1–10, New Y ork, NY, USA, 1999. ACM. K. Chaudh uri, F. Chung, and A. Tsiatas. Spectral clustering of graphs with general degrees in the extended planted partition mo del. In Pr o c e e dings of the 25th Annual Confer enc e on L e arning The ory (COL T) , pages 35.1–35.23, 2012. Y. Chen and J. Xu. Statistical-computational phase transitions in plan ted mo dels: The high-dimensional setting. In Pr o c e e dings of the 31st International Confer enc e on Machine L e arning , pages 244—252, 2014. Y. Chen, S. Sanghavi, and H. Xu. Clustering sparse graphs. A dvanc es in Neur al Information Pr o c essing Systems 25 , pages 2213–2221, 2012. A. Co ja-Oghlan and A. Lank a. Finding planted partitions in random graphs with general degree distributions. SIAM Journal on Discr ete Mathematics , 23(4):1682–1714, 2009. 40 A. Condon and R. M. Karp. Algorithms for graph partitioning on the plan ted partition mo del. R andom Structur es and Algorithms , 18(2):116 – 140, 2001. A. Dasgupta, J. Hop croft, and F. McSherry . Sp ectral analysis of random graphs with sk ewed degree distributions. In the 45th IEEE FOCS , pages 602–610, 2004. S. F ortunato and M. Barthelemy . Resolution limit in communit y detection. Pr o c e e dings of the National A c ademy of Scienc es , 104(1):36–41, 2007. C. Gao, Z. Ma, A. Y. Zhang, and H. H. Zhou. Ac hieving optimal misclassiﬁcation proportion in sto c hastic block mo del. arXiv 1505.03772, Mar. 2015. A. Grothendiec k. R ´ esum ´ e de la th ´ eorie m´ etrique des produits tensoriels topologiques. R e- senhas do Instituto de Matem´ atic a e Estatistic a da Universidade de S˜ ao Paulo , 2(4): 401–481, 1953. O. Gu´ edon and R. V ersh ynin. Comm unity detection in sparse netw orks via grothendiec k’s inequalit y . Pr ob ability The ory and R elate d Fields , pages 1–25, 2015. L. Gulikers, M. Lelarge, and L. Massouli´ e. A sp ectral metho d for communit y detec- tion in moderately-sparse degree-corrected sto c hastic blo c k mo dels. arXiv pr eprint arXiv:1506.08621 , 2015. P . W. Holland, K. B. Laskey , and S. Leinhardt. Sto c hastic blo c kmo dels: First steps. So cial Networks , 5(2):109–137, 1983. R. A. Horn and C. R. Johnson. Matrix A nalysis, se c ond e dition . Cambridge, 2013. M. Jacomy , T. V en turini, S. Heymann, and M. Bastian. F orceatlas2, a contin uous graph la yout algorithm for handy net w ork visualization designed for the gephi soft ware. PL oS ONE , 9(6):e98679, 06 2014. J. Jin. F ast netw ork communit y detection b y score. Ann. Statist. , 43(1):57–89, 2015. B. Karrer and M. Newman. Sto c hastic blo c kmo dels and communit y structure in net works. Phys. R ev. E , 83:016107, 2011. F. Krzak ala, C. Mo ore, E. Mossel, J. Neeman, A. Sly , L. Zdeb oro v´ a, and P . Zhang. Sp ectral redemption in clustering sparse net works. Pr o c. Natl. A c ad. Sci. USA , 110(52):20935– 20940, 2013. A. Lancic hinetti and S. F ortunato. Limits of mo dularit y maximization in communit y de- tection. Phys. R ev. E , 84(066122), 2011. 41 C. M. Le and R. V ersh ynin. Concentration and regularization of random graphs. arXiv:1506.00669, June 2015. C. M. Le, E. Levina, and R. V ersh ynin. Optimization via low-rank appro ximation for comm unity detection in netw orks. A nn. Statist. , to ap ear, 2015+. J. Lei and A. Rinaldo. Consistency of sp ectral clustering in sparse sto c hastic blo c k mo dels. A nn. Statist. , 43(1):215 – 237, 2015. J. Lindenstrauss and A. Pe lczy ´ nski. Absolutely summing op erators in L p -spaces and their applications. Studia Mathematic a , 3(29):275–326, 1968. F. McSherry . Sp ectral partitioning of random graphs. F oundations of Computer Scienc e. Pr o c e e dings. 42nd IEEE Symp osium on , pages 529–537, 2001. M. E. J. Newman. Mo dularit y and communit y structure in netw orks. PNAS , 103(23): 8577–8582, 2006. doi: 10.1073/pnas.0601602103. S. Oymak and B. Hassibi. Finding dense clusters via low rank ` sparse decomp osition. arXiv:1104.5186 , 2011. T. Qin and K. Rohe. Regularized sp ectral clustering under the degree-corrected sto c hastic blo c kmo del. In A dvanc es in Neur al Information Pr o c essing Systems , pages 3120–3128, 2013. J. Reic hardt and S. Bornholdt. Statistical mechanics of comm unity detection. Phys. R ev. E , 74(016110), 2006. K. Rohe, S. Chatterjee, and B. Y u. Sp ectral clustering and the high-dimensional sto chastic blo c kmo del. Ann. Statist. , 39(4):1878–1915, 2011. T. T ao. T opics in r andom matrix the ory . American Mathematical Society , Pro vidence, RI, USA, 2012. A. L. T raud, E. D. Kelsic, P . J. Mucha, and M. A. Porter. Comparing communit y structure to c haracteristics in online collegiate so cial net works. SIAM R eview , 53(3):526–543, 2011. A. L. T raud, P . J. Muc ha, and M. A. Porter. So cial structure of faceb o ok net works. Physic a A: Statistic al Me chanics and its Applic ations , 391(16):4165 – 4180, 2012. J. T ropp. User-friendly tail b ounds for sums of random matrices. F oundations of Compu- tational Mathematics , 12(4):389–434, 2012. A. Y. Zhang and H. H. Zhou. Minimax rates of communit y detection in sto c hastic blo c k mo dels. arXiv:1507.05313, July 2015. 42 Y. Zhang, E. Levina, and J. Zhu. Detecting ov erlapping communities in net works using sp ectral metho ds. arXiv pr eprint arXiv:1412.3432 , 2014. Y. Zhao, E. Levina, and J. Zhu. Consistency of communit y detection in netw orks under degree-corrected sto c hastic block mo dels. Annals of Statistics , 40(4):1935–2357, 2012. 43

Convexified Modularity Maximization for Degree-corrected Stochastic Block Models

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment