A Tensor Approach to Learning Mixed Membership Community Models

A T ensor Sp ectral Approac h to Learning Mixed Mem b ers hip Comm unit y Mo dels Anima Anandkumar 1 , Rong Ge 2 , Daniel Hsu 3 , and Sham M. Kak ade 3 1 a.anandkumar@uci.edu, Univ ersit y of California, Irvine 2 rongge@cs.princeton.edu, Princeton Univ ersit y 3 dahsu/sk ak ade@microsoft.com, Microsoft Researc h, New England Octob er 28, 2013 Abstract Communit y detection is the task o f detecting hidden communities from obser v ed interac- tions. Guaranteed communit y detec tio n ha s so far b een mostly limited to models with non- ov erlapping communities such a s the sto chastic blo ck mo del. In this pap er, we remov e this restriction, and provide guaranteed communit y detection for a family of proba bilistic netw ork mo dels with overlapping comm unities, termed as the mixed membership Dirichlet mo del, ﬁrst int ro duced by Airoldi et al. [200 8]. This mo del a llows for no des to hav e fractional member- ships in multiple communities and assumes that the c o mm unity memberships ar e drawn fro m a Diric hlet distribution. Moreover, it con tains the sto chastic blo ck model as a sp ecial case. W e prop ose a uniﬁed approach to learning these models via a tensor spectra l decompo sition method. Our estimator is based on low-order moment tenso r o f the o bs erved netw ork, consisting of 3- star co un ts. Our lea rning metho d is fa s t and is based on simple linea r algebraic op era tions, e.g. singular v alue deco mpositio n and tenso r p ow er iterations . W e provide g uaranteed recovery of communit y memberships a nd mo del parameters a nd present a careful ﬁnite sample a nalysis of our learning metho d. As an imp ortant sp ecial ca s e, our results match the b est known scaling requirements fo r the (ho mogeneous) stochastic block mo del. Keyw ords: Comm unity detection, sp ectral metho ds, tensor m etho d s, momen t-based estima- tion, m ixed members h ip mo dels. 1 In tro duction Studying comm unities forms an integral part of so cial net wo rk analysis. A comm unit y generally refers to a grou p of individu als with shared interests (e.g. m usic, sp orts), or r elatio nsh ips (e.g. friends, co-w orkers). Comm un it y formation in so cial netw orks has b een stu died by many s o ciol- ogists, e.g. [Moreno, 1934, Lazarsfeld et al., 1954, McPher s on et al., 2001, Curr arini et al., 2009], starting with the seminal work of Moreno [1934]. They p osit v arious f actors suc h as homophily 1 1 The term homophily refers to the tend ency that individuals b elonging to the same communit y tend to connect more than individuals in d iﬀeren t communities. 1 among the individuals to b e resp ons ible f or comm unit y formation. V arious probabilistic and non- probabilistic net wo rk mo d els attempt to explain communit y formati on. In addition, they also attempt to quant ify interac tions and the extent of ov erlap b et we en diﬀerent communities, r elativ e sizes among the comm unities, and v arious other net work prop erties. Studyin g suc h communit y mo dels are also of interest in other domains, e.g. in b iolog ical net w orks. While th er e exists a v ast literature on comm unit y mo dels, learning these mo dels is typicall y c hallenging, and v arious h euristics suc h as Marko v Chain Mont e Carlo (MCMC) or v ariational ex- p ectation maximization (EM) are e mploy ed in practic e. Such heuristics tend to scale p oorly for large net w orks. On the other hand, comm u nit y mo dels with guaran teed learning method s te nd to b e re- strictiv e. A p opular class of probabilistic mo dels, termed as sto chastic blo ckmo dels , ha v e b een widely studied and enjo y strong theoretical learning guarante es, e.g. [White et al., 1976, Holland et al., 1983, Fien b erg et al., 1985, W ang and W ong, 1987, Sn ij d ers and Nowic ki , 1997 , McSherr y, 2001]. On th e other hand, they p osit that an in dividual b elongs to a single comm u nit y , w hic h d o es not hold in most r eal settings [P alla et al. , 2005]. In this pap er, we consider a class of mixed mem b ership comm unity mo dels, originally introd uced b y Airol di et al. [2008], and recen tly emplo y ed b y Xing et al. [2 010] and Gopalan et al. [2012]. The mo del has b een sho wn to b e e ﬀectiv e in man y real-wo rld settings, b ut so far, no learning approac h exists with pro v able guarantees. In this pap er, we pr ovide a n o vel learning app roac h for learnin g these mixed mem b ership mo dels and prov e that these metho ds succeed u nder a set of s u ﬃcien t conditions. The m ixed m em b ership comm u nit y mo del of Air oldi et al. [2008] has a n umber of attractiv e prop erties. It retains man y of the conv enient prop erties of th e stoc hastic bloc k mo del. F or instance, conditional indep en dence of the edges is assumed, giv en the comm unit y mem b ersh ips of the no des in the net wo rk. A t the same time, it allo ws for comm unities to o ve rlap, and for ev ery individu al to b e fractionally inv olved in diﬀerent comm unities. It in cludes the sto c hastic blo c k mo del as a sp ecial case (corresp ondin g to zero o ve rlap among the diﬀerent comm unities). This enables us to compare ou r learning guarantee s w ith existing w orks for sto c hastic blo c k mo dels and also stud y ho w the exten t of o v erlap among d iﬀeren t comm unities aﬀects the learning p erforman ce. 1.1 Summary of Results W e n o w su m marize the main con tributions of this pap er. W e pr op ose a nov el approac h for learn- ing mixed mem b ersh ip comm u nit y mo dels of Airoldi et al. [2008]. Our approac h is a metho d of momen ts estimat or and incorp orates tensor sp ectral decomp osition. W e provide guaran tees for our approac h und er a set of suﬃcient conditions. Finally , w e compare our results to existing ones for the sp ecial case of the sto c hastic b lock mo del, wh ere no des b elong to a single comm unity . Learning Mixed Membership Mo dels: W e p resen t a tensor-based approac h for learnin g th e mixed members hip stochastic blo c k mo del (MMSB) p r op osed by Airoldi et al. [2008]. In the MMSB mo del, the communit y memb ership v ectors are drawn fr om the Diric hlet distribu tion, denoted by Dir( α ), where α is kno wn the Diric hlet concen tration vect or. Emp lo yin g the Diric hlet distribu tion results in sparse communit y mem b ersh ips in certain regimes of α , wh ic h is realistic. The exten t of o v erlap b et wee n diﬀerent comm unities un der the MMSB mo d el is con trolled (roughly) via a single scalar parameter, α 0 := P i α i , where α := [ α i ] is the Diric hlet concen tration v ector. When α 0 → 0, the mixed memb er s hip mod el degenerates to a stoc hastic block mod el an d w e ha ve non-o v erlapping comm unities. 2 W e prop ose a u n iﬁed tensor-based learning metho d for the MMSB model and establish reco v ery guaran tees under a set of suﬃcien t conditions. These conditions are in in terms of the net w ork size n , the n umb er of comm un ities k , exten t of comm un it y ov erlaps (thr ou gh α 0 ), and the a v erage edge connectivit y across v arious comm unities. Belo w, we pr esen t an o v erview of our guaran tees for the sp ecial case of equ al sized comm unities (eac h of size n/k ) and homogeneous comm unit y connectivit y: let p b e the probabilit y for an y in tra-comm unity edge to o ccur, and q b e th e prob ab ility for any in ter-comm unit y edge. Let Π b e the comm unity membersh ip matrix, where Π ( i ) denotes the i th ro w, which is the v ector of membersh ip w eigh ts of the no d es f or the i th comm unit y . Let P b e the comm unit y connectivit y matrix such that P ( i, i ) = p an d P ( i, j ) = q for i 6 = j . Theorem 1.1 (Main Result) . F or an MM SB mo del with network size n , numb er of c ommunities k , c onne ctivi ty p ar ameters p, q and c ommunity overlap p ar ameter α 0 , when 2 n = ˜ Ω( k 2 ( α 0 + 1) 2 ) , p − q √ p = ˜ Ω  ( α 0 + 1) k n 1 / 2  , (1) our estimate d c ommunity memb ership matrix ˆ Π and the e dge c onne ctiv ity matr ix ˆ P sa tisfy with high pr ob ability (w.h.p.) ε π ,ℓ 1 n := 1 n max i ∈ [ n ] k ˆ Π i − Π i k 1 = ˜ O ( α 0 + 1) 3 / 2 √ p ( p − q ) √ n ! (2) ε P := max i,j ∈ [ k ] | ˆ P i,j − P i,j | = ˜ O ( α 0 + 1) 3 / 2 k √ p √ n ! . (3) F urther, our supp ort estimates ˆ S satisfy w.h.p., Π( i, j ) ≥ ξ ⇒ ˆ S ( i, j ) = 1 and Π( i, j ) ≤ ξ 2 ⇒ ˆ S ( i, j ) = 0 , ∀ i ∈ [ k ] , j ∈ [ n ] , (4) wher e Π is the true c ommunity memb e rship matrix and the thr eshold i s chosen as ξ = Ω( ǫ P ) . The complete details are in Section 4 . W e ﬁrs t pro vide some in tuitions b ehind the s u ﬃcien t conditions in (1). W e require the net w ork size n to b e large enough compared to the num b er of comm unities k , and for th e separation p − q to b e large enough, so that th e learning metho d can distinguish the diﬀerent communities. This is natural sin ce a zero separation ( p = q ) imp lies that the comm u nities are ind istinguishable. Moreo ver, we see that th e s caling requ iremen ts b ecome more stringen t as α 0 increases. T his is in tuitiv e since it is harder to learn comm unities w ith more o v erlap, and we quantify this scaling. F or th e Diric hlet distribution, it can b e shown that the n umb er of “signiﬁcan t” en tr ies is rou gh ly O ( α 0 ) with high probability , an d in many s ettings of practical in terest, no d es ma y hav e signiﬁcant members h ips in only a few comm un ities, and th us, α 0 is a constant (or gro wing s lowly) in many instances. In a dd ition, we quantify the error b ounds f or estimating v arious parameters of the mixed mem- b ership mo del in (2) an d (3 ). These err ors d eca y under the suﬃcient conditions in (1). Lastly , we establish zero-error guaran tees for supp ort reco very in (4 ): our learning method correctly iden tiﬁes (w.h.p) all the signiﬁcan t membersh ips of a no de and also id en tiﬁes the set of communities w here a no de do es not h a v e a strong presence, and we quan tify the threshold ξ in Th eorem 1.1. F ur ther, w e present the results for a general (non-homogeneous) MMSB m o d el in Section 4.2. 2 The notation ˜ Ω( · ) , ˜ O ( · ) denotes Ω ( · ) , O ( · ) up to p oly-log factors. 3 Iden tiﬁability Result for the MMSB mo del: A b ypr o d uct of our analysis yields no ve l iden tiﬁabilit y results for the MMSB mo del based on low order graph m omen ts. W e establish that the MMSB mo d el is ident iﬁable, g iv en access to third order m omen ts in the form of coun ts of 3-star subgraphs, i.e. a star subgraph consisting of three lea v es, for eac h triplet of lea v es, when the comm unit y connectivit y matrix P is full r ank. Our learning app roac h inv olv es decomp osition of this thir d ord er tensor. Pr evious id en tiﬁabilit y results r equired access to high order momen ts and w ere limited to the sto c hastic blo ck mo del setting; see Section 1.3 for details. Implications on Learning Stochastic Bloc k Mo dels: Our results ha v e implications for learning sto c hastic blo c k mo dels, wh ich is a sp ecia l case of th e MMSB mo del with α 0 → 0. In this case, the su ﬃcien t cond itions in (1) redu ce to n = ˜ Ω( k 2 ) , p − q √ p = ˜ Ω  k n 1 / 2  , (5) The scaling requirements in (5) matc h with the b est known b ou n ds 3 (up to p oly-log f actors) for learning un iform sto c hastic blo c k m o d els and were pr eviously ac h iev ed b y C hen et al. [201 2 ] via con v ex optimizati on in v olving semi-deﬁnite p rogramming (SDP). In con trast, we prop ose an iter- ativ e non-con v ex appr oac h inv olving tensor p o wer iterations and linear algebraic tec hn iques, and obtain simila r guaran tees. F or a deta iled comparison of learning guaran tees und er v arious metho ds for learnin g (h omogeneo us) sto c hastic blo ck mo dels, see Chen et al. [2012]. Th us, we establish learning guaran tees explicitly in terms of the exten t of ov erlap among the diﬀeren t c ommunities for general MMSB models. Man y real-w orld net w orks inv olv e sparse commu- nit y membersh ips and th e total n umber of comm unities is typically m uc h larger than the exten t of mem b ersh ip of a single individual, e.g. hobbies/in terests o f a p erson, univ ersit y/compan y net works that a p er s on b elongs to, the set of transcrip tion factors regulating a gene, and so on. Thus, w e see that in this regime of p ractical interest, where α 0 = Θ (1), the scaling requirements in (1) m atc h those for the sto c hastic blo c k mo del in (5) (u p to p olylog factors) without any degradation in learning p erformance. Th us, we est ablish that learning communit y mo dels with sp arse comm u nit y mem b ersh ip s is akin to learning sto chasti c blo ck mo dels and we pr esen t a uniﬁed approac h and analysis for learnin g these mo d els. T o the b est of our kn o w ledge, th is work is the ﬁ rst to establish p olynomial time learning guaran tees for probabilistic net w ork mo dels with o v erlapping communities and we pro vide a fast and a n iterativ e learnin g approac h through linear algebraic tec hniqu es an d tensor p o wer iterations. While the results of this pap er are mostly limited to a th eoretica l analysis of the tensor metho d for learning o v erlapping comm unities, w e n ote recent results which sh o w that this metho d (with impro ve ments and mod iﬁcations) is very accurate in practice o n real datasets f rom so cial net works, and is scalable to graphs with millions of no des [Huang et al., 2013]. 1.2 Ov erview of T echn iques W e no w d escrib e the main tec hniques emp lo yed in our learning app roac h and in establishing the reco very guaran tees. 3 There are many meth ods which achieve the b est kn o wn scaling for n in (5), but hav e worse scaling for th e separation p − q . This includes v ariants of the sp ectral clustering metho d, e.g. Chaudhuri et al. [2012]. See Chen et al. [2012] for a detailed comparison. 4 Metho d of momen ts and subgraph coun ts: W e prop ose an eﬃcien t learnin g algorithm based on lo w ord er moments, viz., coun ts of small subgrap h s. Sp eciﬁcally , w e emplo y a third-order tensor wh ic h counts the n umb er of 3-stars in the observ ed netw ork. A 3-star is a star graph with three lea v es (see ﬁgu r e 1 ) and we count th e o ccurr ences of such 3-stars across diﬀerent p artitions. W e establish that (an adjusted) 3-star count tensor has a simple relationship with the mo del parameters, when the netw ork is drawn f rom a mixed memb er s hip mo d el. W e prop ose a multi-linea r transformation using edge-coun t matrices (also termed as the pro cess of whitening), whic h reduces the p roblem of learning mixed mem b ersh ip mo dels to the c anonic al p olyadic (CP) de c omp osition of an orthogonal symm etric tensor, for whic h tractable decomp osition exists, as describ ed b elo w. Note that the decomp osition of a general tensor in to its rank-one comp onents is referr ed to as its CP d ecomp osition [K olda and Bader, 2009] and is in general NP-hard [Hillar and Lim, 2012]. Ho w ev er, the decomp osition is tractable in the sp ecial case of an orthogonal sym metric tensor considered h ere. T ensor sp ectral decomp osition via p o w er iterations: Our tensor d ecomp osition metho d is based on the p opular p o wer iterations (e.g. see Anandkumar et al. [201 2a ]). It is a simple iterativ e metho d to compute the stable eigen-pairs of a tensor. In this pap er, we prop ose v arious mo diﬁ ca- tions to the basic p ow er metho d to strengthen the reco very guarantees un der p ertur bations. F or instance, we introd uce adaptiv e deﬂation tec hn iques (whic h inv olv es subtracting out the e igen-pairs previously estimated). Moreo v er, w e in itializ e the tensor p o we r m ethod w ith (whitened) neighbor- ho o d v ectors from th e obser ved net work, as opp osed to random initializatio n. I n the regime, where the comm unity o v erlaps are small, this leads to an imp ro v ed p erf orm ance. Add itionally , w e incor- p orate thresholding as a p ost-pro cessing op eration, which again, leads to improv ed guaran tees for sparse communit y mem b erships , i.e., when the o v erlap among diﬀerent comm unities is small. W e theoreticall y establish th at all these mo diﬁcations lead to impr o vemen t in p erform ance guaran tees and w e discuss comparisons with the b asic p o wer metho d in Section 4.4. Sample analysis: W e establish that our learnin g approac h correctly reco vers the mo del pa- rameters and th e communit y mem b ersh ips of all no des under exact moments. W e then carry out a careful analysis of the empirical graph moment s, computed u sing the n et work observ ations. W e establish tensor concent ration b ounds and also con trol the p ertu rbation of th e v arious quantitie s used b y our learnin g alg orithm via matrix Bernstein’s inequalit y [T ropp, 2012, thm. 1.4] and other inequalities. W e imp ose the scaling requirement s in (1) for v arious concen tration b ound s to hold. 1.3 Related W ork There is extensiv e w ork on mo d eling comm unities and v arious algorithms and heu r istics for disco v - ering them. W e mostly limit ou r f o cus to works with theoretical guarant ees. Metho d of momen ts: The metho d of momen ts approac h dates bac k to P earson [1894] and has b een applied for learning v arious comm unity models. Here, the moments corresp ond to co unts of v arious subgraph s in the net work. They t ypically consist of aggregate quan tities, e.g., num b er of star su bgraphs, triangles etc. in the netw ork. F or instance, Bic k el et al. [2011] analyze the momen ts of a sto c hastic blo c k mo del and establish that the subgraph counts of certain structures, termed as “wheels” (a family of tr ees), are su ﬃcien t for identiﬁabilit y under some n atur al non- degeneracy conditions. In con trast, w e establish that momen ts up to third order (corresp onding 5 to edge and 3-star coun ts) are suﬃcient for iden tiﬁabilit y of the sto chastic blo c k mo del, and also more generally , f or the mixed member s hip Diric h let mo del. W e emp lo y subgraph count tensors, corresp onding to the num b er of sub graphs (suc h as stars) o ve r a set of lab eled v ertices, while the work of Bic k el et al. [2011] considers only aggregate (i.e. scalar) counts. Considering tensor momen ts allo ws us to use simple subgraphs (edges and 3 stars) corresp onding to low order m oments, rather than more complicated graphs (e.g. wheels considered by Bic kel et al. [2011]) w ith larger n umb er of no des, for learnin g the comm un ity mo del. The metho d of m oments is also relev an t for the family of rand om graph mo dels termed as ex- p onential r andom gr aph mo dels [Holland and Leinh ardt, 1981, F rank and Strauss, 19 86]. Subgraph coun ts of ﬁxed grap h s such as stars and triangles serve as suﬃcien t statistic s for these m o d els. Ho w ev er, parameter estimation giv en the su bgraph coun ts is in general NP-hard, due to the nor- malizatio n constan t in the lik eliho o d (the partition function) and th e mo del suﬀers from d egeneracy issues; see Rinaldo et al. [20 09], Chatterjee and Diaconis [2 011] for detailed discussion. In con trast, w e establish in this pap er that the mixed memb ership mo del is amenable to simp le estimation metho ds throu gh linear algebraic op er ations and tensor p o wer iterations using subgraph counts of 3-stars. Sto c hastic blo ck mo dels: Man y algorithms pro vide learning guarantee s for sto chastic blo c k mo dels. F or a d etaile d comparison of these metho d s , see the r ecen t wo rk by Chen et al. [2012]. A p opular metho d is based on sp ectral clustering [McSh erry, 2001], where communit y mem b ersh ips are in ferred through pro jection on to the sp ectrum of the Laplacian matrix (or its v arian ts). This metho d is fast and easy to implement (via singular v alue decomp osition). Ther e are man y v ariants of this metho d, e.g. the work of Ch audhuri et al. [2012] emplo ys normalized Laplacian matrix to handle degree heterogeneities. In contrast, the w ork of Chen et al. [2012] uses con v ex optimization tec h niques via s emi-deﬁ nite programming learning blo ck m o d els. F or a detailed comparison of learning guaran tees under v arious m etho d s for learning stochasti c blo c k mo dels, see Ch en et al. [2012]. Non-probabilistic approac hes: The classical ap p roac h to comm un ity detection tries to di- rectly exploit the pr op erties of the graph to deﬁne communities, without assu ming a pr ob ab ilistic mo del. Girv an and Newman [2002] use b et weenness to remo v e edges until only comm unities are left. Ho wev er, Bick el and Chen [2009] show that these algorithms are (asymptotically) biased and that using mo dularity scores can lead to the disco v ery of an incorrect comm unity str ucture, ev en for large graph s. J alali et al. [2011] deﬁne comm unit y structure as the structure that satisﬁes the maxim um n umb er of edge constraint s (whether t wo individu als like/ dislike eac h other). Ho wev er, these mo dels assu me that ev ery ind ividual b elongs to a sin gle comm un it y . Recen tly , some non-probabilistic ap p roac h es hav e b een int ro du ced with ov erlapping comm u- nit y mo dels b y Arora et al. [2012] and Balcan et al. [2012]. The analysis of Arora et al. [2012] is mostly limited to dens e graphs (i.e. Θ( n 2 ) edges for a n n o d e grap h ), w hile our analysis pro vides learning guaran tees for muc h sparser graphs (as seen b y the scaling r equiremen ts in (1)). More- o v er, the run ning time of the metho d of Arora et al. [2012] is quasip olynomial time (i.e. O ( n log n )) for the general case, and is based on a com binatorial learning approac h. In con tr ast, our learn- ing appr oac h is based on simple linear algebraic tec hniques and the running time is a lo w-order p olynomial (roughly it is O ( n 2 k ) for a n n o d e net wo rk with k comm unities un der a serial com- putation mo del and O ( n + k 3 ) under a parallel computation mo del). The w ork of Balcan et al. 6 [2012] assumes end ogenously formed comm unities, by constraining the fr action of edges within a comm unit y co mpared to the outside. They provide a p olynomial time algorithm f or ﬁndin g all su c h “self-determined” comm u n ities an d the running time is n O (log 1 /α ) / α , where α is the fr action of edges within a self-determined comm unit y , and this b ound is impro v ed to lin ear time when α > 1 / 2. O n the other hand, the ru nning time of our algorithm is mostly in d ep endent of the parameters of the assumed mo del, (and is r oughly O ( n 2 k )). Moreo ve r, b oth these works are limited to homophilic mo dels, wh ere th ere are more edges within eac h communit y , than b et we en an y t w o diﬀeren t com- m unities. How eve r, our learning ap p roac h is not limited to this setting and also do es not assu me homogeneit y in edge connectivit y across diﬀeren t communities (bu t instead it makes p robabilistic assumptions on comm unit y formation). In addition, w e p ro vide imp ro v ed guaran tees for homophilic mo dels b y considering additional p ost-pro cessing steps in our al gorithm. Recen tly , Abraham et al. [2012] pro vide an algorithm for appro ximating the p arameters of an Eu clidean log-linear mo del in p olynomial time. Ho wev er, there setting is considerably diﬀerent than the one in this p ap er. Inhomogeneous random graphs, graph limits and weak regularit y lemma: Inhomoge- neous random graphs ha v e b een analyzed in a v ariet y of settings (e .g., Bollob´ as et al. [2007], Lo v´ asz [2009]) an d are generalizations of the sto c hastic blo c k mo del. Here, the p robabilit y of an edge b e- t w een an y t wo n o des is c haracterized b y a general function (rather than by a k × k matrix as in the sto c hastic blo ck m o d el w ith k blo c ks). Note that the mixed membersh ip mo d el consider ed in this w ork is a sp ecial instance of this general framew ork. These m o d els a rise as the limits of con v ergen t (dense) graph sequences and for th is reason, the functions are also termed as “graphons” or graph limits [Lo v´ asz, 2009]. A deep r esult in this con text is the regularit y lemma and its v arian ts. The w eak regularit y lemma prop osed b y F rieze and Kannan [1999], show ed that any con v ergen t dense graph can b e approximat ed by a sto c hastic b lock mo del. Moreo v er, th ey prop ose an algorithm to learn such a blo c k mo del based on the so-called d 2 distance. Th e d 2 distance b et ween t w o no des measures s im ilarity with resp ect to their “t wo -hop” neigh b ors and th e b lo c k mo del is obtained b y thresholding the d 2 distances. Ho wev er, the metho d is limited to learning b lo ck m od els and not o v erlapping communities. Learning Laten t V ariable Models (T opic Models) : The communit y mo dels considered in this pap er are closely related to th e probabilistic topic mo dels [Blei , 2012], emplo y ed for text mo deling and do cumen t categ orization. T opic mo dels p osit the o ccurrence of wo rds in a corpus of d o cumen ts, through the p resence of m ultiple laten t topics in eac h d o cumen t. Laten t Dirichlet allocation (LD A) is p erhap s the m ost p opular topic m o d el, wh ere the topic mixtures are assu med to b e dra wn from th e Dirichlet d istribution. In eac h d o cumen t, a topic mixture is drawn from the Diric hlet distrib ution, and the w ords are drawn in a conditional indep en d en t man n er, giv en the topic mixture. The mixed mem b ership comm unit y mo del considered in this pap er can b e interpreted as a generalizati on of the LD A mo d el, w here a n o de in th e communit y mo del can function b oth as a do cument and a wo rd. F or instance, in the directed communit y mod el, when the outgoing links of a no de are consid ered, the no de functions as a do cument , an d its outgoing n eighb ors can b e interpreted as the words o ccurring in that do cument. Similarly , when the incoming links of a no de in the net w ork are considered, the n o d e can b e inte rp r eted as a word, and its incoming links, as do cuments co nta ining that p articular w ord . In particular, we e stablish th at certain graph momen ts un d er th e mixed memb er s hip mod el ha v e similar structure as t he observed w ord momen ts under the LD A m o d el. This allo ws us to lev erage the recen t deve lopments from Anandkumar et. 7 al. [Anan d kumar et al., 2012 c ,a,b] for learning topic mo d els, based on the m etho d of moments. These works establish guaran teed learning usin g second- and third -order observ ed m oments through linear algebraic and tensor-based tec hniques. In p articular, in this p ap er, we exploit the tensor p o we r iteration metho d of An andkumar et al. [2012b], and prop ose additional impro v emen ts to obtain stronger reco v ery guarantee s. Moreov er, the sample analysis is quite diﬀeren t (and m ore c hallenging) in the communit y setting, compared to topic mo dels analyzed in An andkumar et al. [2012c,a,b]. W e clearly sp ell out t he similarities and diﬀerences betw een the comm un it y mo d el and other laten t v ariable mo d els in Section 4.4. Lo w er Bounds: Th e w ork of F eldm an et al. [2012] provides lo w er b ound s on the complexit y of statistica l algorithms, and shows that for cliques of size O ( n 1 / 2 − δ ), for any co nstant δ > 0, at least n Ω(log log n ) queries are n eeded to ﬁnd the cliques. There are works relating the hard ness of ﬁnding hidden cliques and the use of higher order momen t tensors for this purp ose. F rieze and Kannan [2008] rela te the pr oblem of ﬁnding a hidden clique to ﬁ nding the top eigenv ector of the third order tensor, corr esp onding to the maxim um sp ectral norm. Brubak er and V empala [2009] extend the result to arb itrary r th -order tensors and the cliques ha v e to b e size Ω( n 1 /r ) to enable reco v ery from r th -order moment te nsors in a n no de net w ork. Ho w ev er, this p roblem (ﬁnding the top eigen vecto r of a tensor) is known to b e NP-hard in general [Hillar and Lim, 2012]. Th us, tensors are u seful for ﬁnding smaller h id den cliques in net wo rk (alb eit b y solving a computationally hard problem). In con trast, w e consid er tractable tensor decomp osition through reduction to orthogonal tensors (under the sca ling requir emen ts o f (1)), and ou r learning metho d is a fast and an iterativ e approac h based on tensor p ow er iterations and linear alge br aic op erations. Mossel et al. [2012] p ro vide low er b oun ds on the sep aration p − q , the edge connectivit y b et ween in tra-comm unity and int er-comm unity , for iden tiﬁabilit y of comm un ities in sto chasti c blo c k mo dels in the sparse regime (when p, q ∼ n − 1 ), when the num b er of comm unities is a constant k = O (1). Our metho d ac hiev es the lo we r b ounds on separation of edge connectivit y up to p oly-log factors. Lik eliho o d-based Approac hes to Learning MMSB: Another c lass of approac hes for learn- ing MMSB mo dels are based on optimizing th e observed lik eliho o d. T rad itional approac hes suc h as Gibb s sampling or exp ectation maximization (EM) can b e to o exp ensive app ly in practice for MMSB mo dels. V ariational appr oac hes whic h optimize the so-called evidence low er b ou n d [Hoﬀman et al. , 2012, Gopalan et al., 2012], w hic h is a lo wer b ound on th e marginal likelihoo d of the observ ed data (t ypically by applying a mean-ﬁeld appro ximation), are eﬃcien t for practical i mplementat ion. Sto c hastic ve rsions of the v ariational appr oac h pro vide ev en fur ther gains in eﬃciency and are state-of-a rt practical learnin g metho ds for MMSB mo dels [Gopalan et al., 2012]. Ho w ev er, these metho ds lac k theoretical guaran tees; s in ce they optimize a b ound on th e lik eliho o d, they are n ot guaran teed to reco ver the un derlying comm un ities co nsistently . A recen t w ork [Celisse et al., 2012] establishes co nsistency of m axim um likelihoo d and v ariational estimato rs for sto chastic bloc k mod- els, w hic h are s p ecial cases of the MMSB m o d el. How eve r, it is not known if the results extend to general MMSB mo dels. Moreo ve r, the framew ork of Celisse et al. [201 2 ] a ssum es a ﬁxed num b er of comm unities and gro wing net w ork size , and pr o vide only asymp totic consistency guarantees. Thus, they do not al lo w for high-dimensional settings, where the parameters of the learning problem also gro w as the observe d d imensionalit y gro ws . In contrast, in this p ap er, we allo w f or the num b er of comm unities to gro w, and provide precise constrain ts on the scaling b ound s for consisten t estima- tion un d er ﬁn ite samp les. It is an op en pr oblem to obtain such b ounds for maximum lik eliho o d 8 and v ariational estimators. On the practical side, a r ecen t w ork deplo ying the tensor approac h prop osed in th is pap er by Huang et al. [2013] sho ws that the tensor appr oac h is more than an order of magnitude faster in reco v ering the communities than the v ariational approac h, is scalable to n et works w ith millions of n o d es, and also has b etter accuracy in reco v ering th e communities. 2 Comm unit y Mo dels and Graph Momen ts 2.1 Comm unit y Mem b ership Mo dels In this section, we describ e the mixed mem b ersh ip comm unity mo del based on Diric hlet pr iors for the comm unit y dr a w s b y the individuals. W e ﬁ rst in tro duce the sp ecial case of the p opular sto c hastic blo c k mo del, wh ere eac h no d e b elongs to a single comm unit y . Notation: W e consider net works with n n o des and let [ n ] := { 1 , 2 , . . . , n } . Let G b e the { 0 , 1 } adjacency 4 matrix for the rand om netw ork and let G A,B b e the subm atrix of G corresp onding to ro ws A ⊆ [ n ] and column s B ⊆ [ n ]. W e consider mo dels w ith k u nderlying (hidden) communities. F or no d e i , let π i ∈ R k denote its c ommunity memb ership ve ctor , i.e., the v ector is supp orted on the comm unities to wh ic h the no de b elongs. In the sp ecial case of the p opular s to chastic blo c k mo del describ ed b elo w, π i is a basis co ordin ate v ector, while the more general mixed memb ership mo del relaxes this assumption and a no de can b e in m ultiple communities with fr actional member s hips. Deﬁne Π := [ π 1 | π 2 | · · · | π n ] ∈ R k × n . and let Π A := [ π i : i ∈ A ] ∈ R k ×| A | denote the set of column v ectors r estricted to A ⊆ [ n ]. F or a matrix M , let ( M ) i and ( M ) i denote its i th column and ro w resp ectiv ely . F or a matrix M wit h singular v alue decomp osition (SVD) M = U D V ⊤ , let ( M ) k − svd := U ˜ D V ⊤ denote th e k -rank SVD of M , where ˜ D is limited to top- k singular v alues of M . Let M † denote the Mo orePe nrose pseudo-inv erse of M . Let I ( · ) b e t he indicator fun ction. Let Diag( v ) d enote a diagonal matrix w ith diagonal en tries giv en by a v ector v . W e u se the term high probabilit y to mean with probab ility 1 − n − c for an y constan t c > 0. Sto c hastic blo ck mo del (sp ecial case): In this mo del, eac h individual is indep end en tly as- signed to a single comm un it y , chosen at random: eac h no d e i c ho oses comm unit y j indep endently with probabilit y b α j , for i ∈ [ n ] , j ∈ [ k ], and we assign π i = e j in this case, where e j ∈ { 0 , 1 } k is the j th co ordinate b asis vect or. Giv en th e communit y assignmen ts Π, ev ery directed 5 edge in the net w ork is indep endently d ra wn: if no d e u is in communit y i and n o de v is in comm u nit y j (and u 6 = v ), th en the probab ility of ha ving the edge ( u, v ) in th e net w ork is P i,j . Here, P ∈ [0 , 1] k × k and we refer to it as th e c ommunity c onne ctivity matrix . This implies that give n the comm unity mem b ersh ip ve ctors π u and π v , the p robabilit y of an edge fr om u to v is π ⊤ u P π v (since when π u = e i and π v = e j , w e h a ve π ⊤ u P π v = P i,j .). T he sto c hastic mo d el has b een extensive ly studied and can b e learn t eﬃciently through v arious metho ds, e.g. sp ectral cl ustering [McSh erry, 20 01], con v ex op- timizatio n [Chen et al., 2012]. and so on. Man y of these metho ds rely on conditional in dep endence assumptions of the edges in the blo c k mo d el for guarante ed learning. 4 Our analysis can easily be extend ed to wei ghted adjacency matrices with b ound ed entries. 5 W e limit our discussion to d irected n et works in this pap er, but note that the results also hold for u n directed comm unity models, w here P is a symmetric matrix, a nd an edge ( u, v ) is formed with probabili ty π ⊤ u P π v = π ⊤ v P π u . 9 Mixed membership mode l: W e no w consider th e extension of the sto chastic blo c k mo del whic h allo ws for an individual to b elong to multiple comm unities and y et preserv es some of the con v enien t indep en dence assumptions of th e b lo c k mo del. In this mo d el, the comm unity member- ship v ector π u at no d e u is a p robabilit y v ector, i.e., P i ∈ [ k ] π u ( i ) = 1, for all u ∈ [ n ]. Giv en the comm unit y memb er s hip ve ctors, the generation of the edges is identic al to the b lo ck mo del: giv en v ectors π u and π v , the pr obabilit y of an edge f rom u to v is π ⊤ u P π v , and the edges are ind ep en- den tly dra wn. T his formulation allo ws for the no des to b e in m ultiple comm unities, and at th e same time, preserve s the conditional indep endence of t he edges, give n the communit y mem b ersh ips of the n o d es. Diric hlet prior for comm unit y mem b ership: The only asp ect left to b e sp eciﬁed for th e mixed mem b ership model is the d istr ibution from wh ic h the c ommunit y members h ip vect ors Π are dra wn. W e consider the p opu lar setting of Airoldi et al. [2008], where the co mmunit y v ectors { π u } are i. i.d. draws from the Diric hlet distribution, denoted b y Dir( α ), with parameter v ector α ∈ R k > 0 . The p robabilit y density function of th e Dirichlet d istribution is giv en by P [ π ] = Q k i =1 Γ( α i ) Γ( α 0 ) k Y i =1 π α i − 1 i , π ∼ Dir( α ) , α 0 := X i α i , (6) where Γ( · ) is th e Gamma fun ction and the ratio of the Gamma function serv es as the normalizat ion constan t. The Diric hlet distribution is widely emplo ye d for s p ecifying priors in Ba ye sian s tatistics, e.g. laten t Diric hlet allo cation [Blei et al., 2003]. The Diric hlet distr ib ution is the conjugate prior of the m ultinomial distribu tion which mak es it attractiv e for Bay esian inference. Let b α denote the n orm alized p arameter vecto r α/α 0 , wh er e α 0 := P i α i . In particular, note that b α is a probabilit y ve ctor: P i b α i = 1. In tuitiv ely , b α denotes the relativ e exp ected sizes of the comm unities (since E [ n − 1 P u ∈ [ n ] π u [ i ]] = b α i ). Let b α max b e the largest entry in b α , and b α min b e the smallest entry . Our learnin g guarantee s will d ep end on these parameters. The stochastic b lo ck mo del is a limiting case of the mixed m em b ership mo d el wh en the Diric hlet parameter is α = α 0 · b α , wh ere the pr obabilit y v ector b α is held ﬁxed and α 0 → 0. In the other extreme w hen α 0 → ∞ , the Dirichlet distribution b ecomes p eak ed around a s ingle p oin t, for instance, if α i ≡ c and c → ∞ , the Diric hlet distribution is p eak ed at k − 1 · ~ 1, where ~ 1 i s the a ll-ones v ector. Thus, the parameter α 0 serv es as a measure of the av erage sparsity of the Diric hlet draws or equiv alen tly , of ho w concen trated th e Dirichlet measure is along the d iﬀeren t co ordinates. This in eﬀect, controls the exten t of o v erlap among diﬀerent comm unities. Sparse regime of Diric hlet distribution: When the Diric hlet p arameter v ector satisﬁes 6 α i < 1, for all i ∈ [ k ], the Diric hlet distribution Dir( α ) generates “ sparse” vec tors with high probabilit y 7 ; see T elgarsky [2012] (and in the extreme case of th e blo c k mo del where α 0 → 0, it generates 1-spars e v ectors). Man y real-w orld settings inv olv e s p arse comm unity membersh ip and the to tal n umber o f comm unities is typica lly m uc h larger t han the exten t of mem b ers hip of a single 6 The assumption that the Diric hlet distribution b e in the sparse regime is not strictly n eed ed. Our results can be extended to general Diric hlet distributions, b ut with w orse scaling requirements on the netw ork size n for guaranteed learning. 7 Roughly the number of entries in π exceeding a threshold τ is at most O ( α 0 log(1 /τ )) with high probabilit y , when π ∼ Dir( α ). 10 x u v w A B C X Figure 1: Our momen t-based le arning algorithm uses 3-star coun t tensor from set X to sets A, B , C (and the role s of the sets are in terchanged to g et v arious estimates). Sp eciﬁcally , T is a third order tensor, where T( u, v , w ) is the normaliz ed coun t of the 3-sta rs with u, v , w as lea ves o ver all x ∈ X . individual, e.g. h obbies/in terests of a p erson, un iv ersit y/compan y net w orks th at a p erson b elongs to, the se t of transcription facto rs regulating a gene, and s o on. Our learning guarantee s are limited to the s parse regime of the Diric hlet mo d el. 2.2 Graph Momen t s Under Mixed Mem b er ship Mo dels Our approac h for learning a mixed m emb ership comm unity mo d el relies on the f orm of the graph momen ts 8 under the mixed mem b ers h ip mo del. W e no w describ e the sp eciﬁc graph momen ts used b y our learning algorithm (based on 3-star and edge count s) and provide exp licit forms for the momen ts, assuming draws from a mixed m em b ership mo del. Notations Recall that G denotes the adjacency matrix and that G X,A denotes the s ubmatrix corresp ondin g to edges goi ng fr om X to A . Recall th at P ∈ [0 , 1] k × k denotes the comm unit y connectivit y m atrix. Deﬁne F := Π ⊤ P ⊤ = [ π 1 | π 2 | · · · | π n ] ⊤ P ⊤ . (7) F or a subset A ⊆ [ n ] of ind ividuals, let F A ∈ R | A |× k denote the submatrix of F corresp onding to no d es in A , i.e. , F A := Π ⊤ A P ⊤ . W e will subsequently sh o w that F A is linear map whic h tak es an y comm unity v ector π i as input and outputs the corresp ond ing neighborh o o d v ector G ⊤ i,A in exp ectation. Our learnin g algorithm uses moments up to the thir d-order, r epresen ted as a tensor. A thir d- order tensor T is a three-dimensional arra y w h ose ( p, q , r )-th en try denoted by T p,q ,r . Th e symb ol ⊗ d enotes the standard Kr on eck er p r o duct: if u , v , w are three vec tors, then ( u ⊗ v ⊗ w ) p,q ,r := u p · v q · w r . (8) A tensor of th e form u ⊗ v ⊗ w is referr ed to a s a ran k -one tensor. The d ecomp osition of a general tensor into a sum of its rank-one comp onents is referred to as c anonic al p olyadic (CP) de c omp osition Kolda and Bader [2009]. W e will sub sequen tly see that the graph moment s can b e expressed as a tensor a nd that the CP decomp osition of th e graph-momen t tensor yields the mo del parameters and the communit y v ectors un der the m ixed mem b ersh ip comm unity mo del. 8 W e interc hangeably use the term ﬁrst order moments for edge coun ts and third order moments for 3-star counts. 11 2.2.1 Graph momen ts under St o c hastic Blo c k Mode l W e ﬁ rst analyze the graph momen ts in the sp ecial case of a sto c hastic blo c k mo del (i.e., α 0 = P i α i → 0 in the Dirichlet pr ior in (6)) and then extend it to general mixed mem b ersh ip mo del. W e pro vide explicit expressions for the graph momen ts corresp onding to edge coun ts and 3-star coun ts. W e later establish in Section 3 that these moments are s uﬃcien t to learn the communit y mem b ersh ip s of the n o des and the mo del parameters of the blo c k mo del. 3 -star coun ts: The pr im ary quantit y of in terest is a third-ord er tensor whic h counts the num b er of 3-stars. A 3-star is a star graph w ith three lea ves { a, b, c } and we refer to the inte rn al no d e x of the star as its “head”, and denote th e structure b y x → { a, b, c } (see ﬁgur e 1). W e partition the netw ork in to four 9 parts and consider 3-stars suc h that eac h no d e in the 3-star b elongs to a d iﬀeren t partition. This is necessary to obtain a simple form of the momen ts, based on the conditional indep en d ence assumptions of the blo ck mod el, see Pr op osition 2.1. Sp eciﬁcally , consider 10 a partition A, B , C , X of the n etw ork. W e count the num b er of 3-stars fr om X to A, B , C and our quantit y of interest is T X →{ A,B ,C } := 1 | X | X i ∈ X [ G ⊤ i,A ⊗ G ⊤ i,B ⊗ G ⊤ i,C ] , (9) where ⊗ is th e Kronec ke r p r o duct, deﬁned in (8) an d G i,A is the ro w v ector supp orted on the set of neigh b ors of i b elonging t o set A . T ∈ R | A |×| B | × | C | is a third o rder tensor, and an elemen t of the tensor is given by T X →{ A,B ,C } ( a, b, c ) = 1 | X | X x ∈ X G ( x, a ) G ( x, b ) G ( x, c ) , ∀ a ∈ A, b ∈ B , c ∈ C, (10) whic h is the norm alized coun t of the num b er of 3-stars with lea v es a, b, c such that its “head” is in set X . W e no w relate the tensor T X →{ A,B ,C } to the parameters of the sto c hastic blo c k mo del, viz., the comm unit y connectivit y matrix P and the comm unity p robabilit y v ector b α , w h ere b α i is the probabilit y of c ho osing communit y i . Prop osition 2.1 (Momen ts in Sto c hastic Blo c k Mo del) . Given p artitions A, B , C, X , and F := Π ⊤ P ⊤ , wher e P is the c ommunity c onne ctivity matrix and Π is the matrix of c ommunity memb e r- ship ve ctors, we have E [ G ⊤ X,A | Π A , Π X ] = F A Π X , (11) E [T X →{ A,B ,C } | Π A , Π B , Π C ] = X i ∈ [ k ] b α i ( F A ) i ⊗ ( F B ) i ⊗ ( F C ) i , (12) wher e b α i is the pr ob ability for a no de to sele ct c ommunity i . Remark 1 ( Linear mo del): In Equation (11), we see that the edge generation o ccurs under a linear mo d el, and m ore pr ecisely , the matrix F A ∈ R | A |× k is a linear m ap whic h tak es a comm unity v ector π i ∈ R k to a n eigh b orho o d ve ctor G ⊤ i,A ∈ R | A | in exp ecta tion. 9 F or sample complexity analysis, we require dividin g the graph into more than four partitions to deal with statistical d epen dency issues, and w e outline it in Section 3. 10 T o establish our theoretical guarantees, w e assume that the partitions A, B , C , X are randomly chosen a nd are of size Θ( n ). 12 Remark 2 (I den tiﬁabilit y under third order momen ts): Note th e form of the 3-star coun t tensor T in (12). I t provi des a CP decomp osition of T since eac h term in the su m mation, viz., b α i ( F A ) i ⊗ ( F B ) i ⊗ ( F C ) i , is a rank one tensor. Th us, we ca n learn the matrices F A , F B , F C and the v ector b α through CP decomp osition of tensor T. Once these parameters are learn t, learnin g the comm unities is straigh t-forwa rd under exact momen ts: b y exploiting (11 ), we ﬁnd Π X as Π X = F † A · E [ G ⊤ X,A | Π A , Π X ] . Similarly , we can consid er another tensor consisting of 3-stars f rom A to X , B , C , and obtain matrices F X , F B and F C through a CP decomp osition, and so on. Once we o btain matrices F and Π for the en tire set of n o des in this manner, w e can obtain the communit y connectivit y matrix P , since F := Π ⊤ P ⊤ . Thus, in p rinciple, we are able to learn all the mo d el parameters ( b α and P ) and the communit y member s hip m atrix Π un der the sto c hastic blo c k mo del, giv en exact momen ts. This establishes iden tiﬁabilit y of the mo del giv en moments u p to thir d ord er and forms a high- lev el approac h for learning the comm unities. When only samples are a v ailable, we establish that the empirical v ersions are close to the exact moments considered ab ov e, and we mo dif y the basic learning approac h to obtain robust guaran tees. See Section 3 for details. Remark 3 (Signiﬁcance of conditional indep endence relationships): T he main p rop ert y exploited in p ro ving the tensor form in (12) is the conditional-indep endence assu mption und er the sto c hastic block mo del: the realization of the edges in eac h 3-star, sa y in x → { a, b, c } , is condition- ally indep endent give n the comm u nit y mem b ership vec tor π x , when x 6 = a 6 = b 6 = c . This is b ecause the comm u nit y memb ership vect ors Π are assumed to b e dra wn indep enden tly at the diﬀerent no des and the edges are dr a w n indep endently giv en th e comm unit y vecto rs. Considerin g 3-stars from X to A, B , C wh ere X, A, B , C form a p artition ensures that this conditional ind ep endence is satisﬁed for all the 3-stars in tensor T. Pr o of: Recall that the probability of an edge f rom u to v give n π u , π v is E [ G u,v | π u , π v ] = π ⊤ u P π v = π ⊤ v P ⊤ π u = F v π u , and E [ G X,A | Π A , Π X ] = Π ⊤ X P Π A = Π ⊤ X F ⊤ A and thus (11) holds. F or the tensor form , ﬁ rst consider an element of the tensor, with a ∈ A, b ∈ B , c ∈ C , E  T X →{ A,B ,C } ( a, b, c ) | π a , π b , π c , π x  = 1 | X | X x ∈ X F a π x · F b π x · F c π x , The equation follo w s from the conditional-indep end ence assumption of the edges (assuming a 6 = b 6 = c ). No w taking exp ectation ov er the n o d es in X , we hav e E  T X →{ A,B ,C } ( a, b, c ) | π a , π b , π c  = 1 | X | X x ∈ X E [ F a π x · F b π x · F c π x | π a , π b , π c ] = E [ F a π · F b π · F c π | π a , π b , π c ] = X j ∈ [ k ] b α j ( F a ) j · ( F b ) j · ( F c ) j , where the last ste p follo ws fr om th e fact that π = e j with probabilit y b α j and the result holds wh en x 6 = a, b, c . Recall th at ( F a ) j denotes the j th column of F a (since F a e j = ( F a ) j ). C ollecti ng all the elemen ts of the tensor, we obtain the d esired resu lt.  13 2.2.2 Graph Momen ts under Mixed Membership Diric hlet Mo del W e no w analyze the graph momen ts for the general mixed mem b ership Diric h let mo del. Instead of the ra w moments (i.e. edge and 3-star counts), we consider mo diﬁed moments to obtain similar expressions as in the case of the sto c hastic b lo ck mo del. Let µ X → A ∈ R | A | denote a vecto r which giv es th e n ormalized count of edges from X to A : µ X → A := 1 | X | X i ∈ X [ G ⊤ i,A ] . (13) W e no w deﬁne a mo diﬁed adjacency matrix 11 G α 0 X,A as G α 0 X,A :=  √ α 0 + 1 G X,A − ( √ α 0 + 1 − 1) ~ 1 µ ⊤ X → A  . (14) In the sp ecial case of the sto c hastic blo ck mo d el ( α 0 → 0), G α 0 X,A = G X,A is the sub matrix of th e adjacency m atrix G . Similarly , we deﬁne mo diﬁed third-order statistics, T α 0 X →{ A,B ,C } := ( α 0 + 1)( α 0 + 2) T X →{ A,B ,C } +2 α 2 0 µ X → A ⊗ µ X → B ⊗ µ X → C − α 0 ( α 0 + 1) | X | X i ∈ X h G ⊤ i,A ⊗ G ⊤ i,B ⊗ µ X → C + G ⊤ i,A ⊗ µ X → B ⊗ G ⊤ i,C + µ X → A ⊗ G ⊤ i,B ⊗ G ⊤ i,C i , (15) and it reduces to (a scaled v ersion of ) the 3 -star coun t T X →{ A,B ,C } deﬁned in (9) for the sto chastic blo c k mo d el ( α 0 → 0) . The mo diﬁ ed adjacency matrix and the 3-star count tensor can b e viewed as a form of “cen tering” of the raw momen ts which sim p liﬁes the expressions for the momen ts. The follo wing relationships hold b et we en the mo diﬁed graph moments G α 0 X,A , T α 0 and the mo del parameters P and b α of the mixed mem b ersh ip mo del. Prop osition 2.2 (Momen ts in Mixed Mem b ership Mo del) . Given p artitions A, B , C, X and G α 0 X,A and T α 0 , as in (14) and (15) , normalize d Dirichlet c onc entr ation ve ctor b α , and F := Π ⊤ P ⊤ , wher e P i s the c ommunity c onne ctivity matrix and Π is the matrix of c ommunity memb erships, we have E [( G α 0 X,A ) ⊤ | Π A , Π X ] = F A Diag( b α 1 / 2 )Ψ X , (16) E [T α 0 X →{ A,B ,C } | Π A , Π B , Π C ] = k X i =1 b α i ( F A ) i ⊗ ( F B ) i ⊗ ( F C ) i , (17) wher e ( F A ) i c orr esp onds to i th c olumn of F A and Ψ X r elates to the c ommunity memb ership matrix Π X as Ψ X := Diag ( b α − 1 / 2 ) √ α 0 + 1Π X − ( √ α 0 + 1 − 1) 1 | X | X i ∈ X π i ! ~ 1 ⊤ ! . Mor e over, we have that | X | − 1 E Π X [Ψ X Ψ ⊤ X ] = I . (18) 11 T o compu te the modiﬁed moments G α 0 , and T α 0 , w e need to know the va lue of the scala r α 0 := P i α i , whic h is t h e concentration parameter of the Dirichlet distribution and is a measure of the extent of ov erlap b etw een t he comm unities. W e assume its know ledge h ere. 14 Remark 1: The 3-star count tensor T α 0 is carefully c hosen so that the CP decomp osition of the tensor d ir ectly yields the m atrices F A , F B , F C and b α i , as in the case of the sto c hastic blo c k m o d el. Similarly , the mo diﬁed adj acency matrix ( G α 0 X,A ) ⊤ is carefully c hosen to eliminate sec ond-order cor- relation in the Dirichlet distribution and we ha v e that | X | − 1 E Π X [ΨΨ ⊤ ] = I is the identit y matrix. These p rop erties will b e exploited by our learning algorithm in Section 3. Remark 2: Recall th at α 0 quan tiﬁes the extent of o v erlap among the comm unities. The com- putation of the modiﬁed momen t T α 0 requires the kno wledge of α 0 , whic h is assumed to b e kno w n . Since this is a scalar quantit y , in practice, we can easily tune th is parameter via cross v alidation. Pr o of: The p ro of is on lines of Prop osition 2.1 f or sto chasti c blo ck mo dels ( α 0 → 0) bu t more in v olv ed due to the form of Diric hlet momen ts. Recall E [ G ⊤ i,A | π i , Π A ] = F A π i for a mixed mem- b ership mo del, and µ X → A := 1 | X | P i ∈ X G ⊤ i,A , therefore E [ µ X → A | Π A , Π X ] = F A  1 | X | P i ∈ X π i  ~ 1 ⊤ . Equation (16) fol lo ws d irectly . F or Equ ation (18), we note the Diric hlet moment, E [ π π ⊤ ] = 1 α 0 +1 Diag( b α ) + α 0 α 0 +1 b α b α ⊤ , when π ∼ Dir( α ) and | X | − 1 E [Ψ X Ψ ⊤ X ] = Diag ( b α − 1 / 2 ) h ( α 0 + 1) E [ ππ ⊤ ] + ( − 2 √ α 0 + 1( √ α 0 + 1 − 1) +( √ α 0 + 1 − 1) 2 ) E [ π ] E [ π ] ⊤ i Diag( b α − 1 / 2 ) = Diag ( b α − 1 / 2 )  Diag( b α ) + α 0 b α b α ⊤ + ( − α 0 ) b α b α ⊤  Diag( b α − 1 / 2 ) = I . On lines of the pro of of Prop osition 2.1 for the blo c k mo del, the exp ectation in (17) in volv es multi- linear map of the exp ectation o f the tensor pro ducts π ⊗ π ⊗ π among other te rms. Collect ing these terms, w e ha v e th at ( α 0 + 1)( α 0 + 2) E [ π ⊗ π ⊗ π ] − ( α 0 )( α 0 + 1)( E [ π ⊗ π ⊗ E [ π ]] + E [ π ⊗ E [ π ] ⊗ π ] + E [ E [ π ] ⊗ π ⊗ π ]) + 2 α 2 0 E [ π ] ⊗ E [ π ] ⊗ E [ π ] is a diagonal tensor, in th e sense that its ( p, p, p )-th entry is b α p , and its ( p, q , r )-th en try is 0 when p, q , r are not all equal. With this, w e h a ve (17).  Note th e nearly ident ical forms of the graph momen ts for the stochastic blo c k mo d el in (11), (12) and for the general mixed member s hip mo del in (16), (17). In other w ords, the mo diﬁed momen ts G α 0 X,A and T α 0 ha v e similar r elatio nsh ips to u n derlying parameters as the ra w momen ts in the case of the sto chastic blo ck mo del. This enables us to use a un iﬁed learning approac h for the t w o mo dels, outlined in the next section. 3 Algorithm for Learning Mixed Mem b ership Mo dels The simple f orm of the graph momen ts d eriv ed in the previous section is n o w utilize d to reco ver the comm unit y vec tors Π and mo del parameters P , b α of th e mixed m em b ership mo del. Th e metho d is based on the so-call ed tensor p o w er metho d, used to ob tain a tensor decomp osition. W e ﬁrst outline the basic tensor d ecomp osition metho d b elo w and then demonstrate how the metho d can b e 15 adapted to learnin g u sing the graph momen ts at h and. W e ﬁr st analyze the simpler case wh en exact momen ts are a v ailable in Section 3.2 and then extend the metho d to hand le emp irical moments computed f rom the n etw ork ob s erv ations in Section 3.3. 3.1 Ov erview of T ensor Decomp osition Through P o w er It erations In this section, we review the basic metho d for tensor decomp osition b ased on p o w er iterations f or a sp ecial class of tensors, viz., symm etric orthogonal tensors. Su bsequent ly , in Section 3.2 and 3.3, w e mo d if y this metho d to learn the mixed mem b ership mo d el fr om grap h momen ts, describ ed in the previous section. F or details on the tensor p ow er metho d , refer to Anand kumar et al. [2012a], Kolda and Ma yo [2011]. Recall that a thir d-order tens or T is a th r ee-dimensional array and w e u se T p,q ,r to denote the ( p, q , r )-th entry o f the tensor T . The standard sym b ol ⊗ is u sed to den ote th e Kronec ker pro duct, and ( u ⊗ v ⊗ w ) is a rank one tensor. The decomp osition of a tensor in to its r ank one comp onents is called th e CP decomp osition. Multi-linear maps: W e can view a tensor T ∈ R d × d × d as a multilinear map in the follo wing sense: for a set of matrices { V i ∈ R d × m i : i ∈ [3] } , the ( i 1 , i 2 , i 3 )-th en try in the th ree-w a y arra y represent ation of T ( V 1 , V 2 , V 3 ) ∈ R m 1 × m 2 × m 3 is [ T ( V 1 , V 2 , V 3 )] i 1 ,i 2 ,i 3 := X j 1 ,j 2 ,j 3 ∈ [ d ] T j 1 ,j 2 ,j 3 [ V 1 ] j 1 ,i 1 [ V 2 ] j 2 ,i 2 [ V 3 ] j 3 ,i 3 . The term m ultilinear map arises from th e fact that the ab o ve map is linear in eac h of th e co ordinates, e.g. if w e replace V 1 b y aV 1 + bW 1 in the ab ov e equation, wh ere W 1 is a matrix of appr opriate dimensions, and a, b are an y scalars, the outpu t is a linear com bin ation of the outpu ts under V 1 and W 1 resp ectiv ely . W e will use the ab o ve noti on of multi- linear transforms to describ e v arious tensor op erations. F or in stance, T ( I , I , v ) yields a matrix, T ( I , v , v ), a v ector, and T ( v , v , v ), a scalar. Symmetric tensors and ort hogonal decomp osition: A sp ecial class of tensors are the symmetric tensors T ∈ R d × d × d whic h are in v ariant to p erm utation of the arra y indices. S ymmetric tensors h a ve CP decomp osition of the form T = X i ∈ [ r ] λ i v i ⊗ v i ⊗ v i = X i ∈ [ r ] λ i v ⊗ 3 i , (19) where r den otes the tensor CP rank and we use the notation v ⊗ 3 i := v i ⊗ v i ⊗ v i . It is con v enient to ﬁ rst analyze metho ds for decomp osition of symmetric tensors and we then extend them to the general case of asymmetric tensors. F urther, a sub-class of sy m metric tensors are those whic h p ossess a decomp osition in to o rthogo- nal comp onent s, i.e. the v ectors v i ∈ R d are orthogonal to one another in the ab o v e decomp osition in (19) (without loss of generalit y , we assume that vec tors { v i } are orthonormal in this case). An orthogonal decomp osition im p lies that the tensor rank r ≤ d and there are tractable metho ds for reco vering the rank-one comp onen ts in this setting. W e limit ourselve s to this set ting in this pap er. 16 T ensor eigen analysis: F or symmetric tensors T p ossessing an orthogonal decomp osition of the form in (19), eac h pair ( λ i , v i ), for i ∈ [ r ], can b e in terpreted as an eigen-pair for th e tensor T , since T ( I , v i , v i ) = X j ∈ [ r ] λ j h v i , v j i 2 v j = λ i v i , ∀ i ∈ [ r ] , due to the fact that h v i , v j i = δ i,j . Thus, the vect ors { v i } i ∈ [ r ] can b e inte rpr eted as ﬁ xed p oints of the map v 7→ T ( I , v , v ) k T ( I , v , v ) k , (20) where k · k denotes the sp ectral norm (and k T ( I , v , v ) k is a vect or norm ), and is used to normalize the v ector v in (20). Basic te nsor p o wer it eration metho d: A straigh tforw ard approac h to computing the or- thogonal deco mp ositio n of a symmetric tensor is to iterate according to the ﬁxed-p oin t map in (20) with an arbitrary initializa tion vec tor. This is r eferr ed to as the tensor p o wer iteration metho d. Additionally , it is kno wn that the vec tors { v i } i ∈ [ r ] are the only stable ﬁxed p oints of the m ap in (20). I n other w ords, the set of initializati on v ectors wh ic h con v erge to vect ors other th an { v i } i ∈ [ r ] are of m easure zero. This en sures that we obtain th e correct set of v ectors through p o we r itera- tions and that no spurious answ ers are obtained. See [Anandku mar et al., 2012b, Thm. 4.1] for details. Moreo v er, after an approximat ely ﬁxed p oin t is obtained (after man y pow er it erations), the estimated eigen-pair can b e subtracted out (i.e., deﬂate d ) and su b sequen t vecto rs can b e similarly obtained thr ough p o wer iterations. Thus, w e can obtain all the stable eigen-pairs { λ i , v i } i ∈ [ r ] whic h are the comp onen ts of the orthogonal tensor decomp osition. Th e metho d needs to b e suitably mo diﬁed when th e tens or T is p ertu rb ed (e.g. as in the case wh en empirical moment s are used) and w e discuss it in S ection 3.3. 3.2 Learning Mixed Mem b ership Mo dels Under Exact Momen ts W e ﬁrst describ e the learning approac h when exact momen ts are a v ailable. In Section 3.3, we suitably mo dify the app r oac h to handle p erturb ations, wh ic h are introd uced when only empirical momen ts are a v ailable. W e no w emplo y the te nsor p o wer method describ ed ab ov e to obtain a CP d ecomp osition o f the graph moment tensor T α 0 in (15). W e ﬁrst describ e a “symmetrization” pro cedu r e to con v ert the graph momen t tensor T α 0 to a symmetric orthogonal tensor throu gh a multi-linear transformation of T α 0 . W e then emplo y the p o wer metho d to obtain a symmetric orthogonal decomp osition. Fi- nally , t he original CP d ecomp osition is obtained by r ev ersing the multi-l inear transform of the sym- metrization pr o cedu re. Th is yields a guarantee d metho d for obtaining the decomp osition of graph momen t tensor T α 0 under exact moment s. W e n ote that this symmetrization approac h has b een earlier emp lo yed in other conte xts, e.g. for learning h idden Marko v mo dels [Anandkumar et al., 2012b, S ec. 3.3]. Reduction of the graph-momen t tensor to symmetric orthogonal form (Whit ening): Recall f rom Prop osition 2.2 that the mo d iﬁed 3-star count tensor T α 0 has a CP d ecomp osition as E [T α 0 | Π A , Π B , Π C ] = k X i =1 b α i ( F A ) i ⊗ ( F B ) i ⊗ ( F C ) i . 17 W e n o w describ e a symmetrization pro cedure to con vert T α 0 to a symmetric orthogonal tensor through a multi-li near transformation us ing the mo diﬁed adjacency matrix G α 0 , deﬁn ed in (14). Consider the singular v alue decomp osition (SVD) of th e mo diﬁed adjacency matrix G α 0 under exact momen ts: | X | − 1 / 2 E [( G α 0 X,A ) ⊤ | Π] = U A D A V ⊤ A . Deﬁne W A := U A D − 1 A , and similarly deﬁ n e W B and W C using the corresp on d ing m atrices G α 0 X,B and G α 0 X,C resp ectiv ely . No w deﬁne R A,B := 1 | X | W ⊤ B E [( G α 0 X,B ) ⊤ | Π] · E [( G α 0 X,A ) | Π] W A , ˜ W B := W B R A,B , (21) and similarly deﬁne ˜ W C . W e establish that a m ultilinear transf ormation (as deﬁn ed in (3.1)) of the graph-moment tensor T α 0 using matrices W A , ˜ W B , and ˜ W C results in a symmetric orthogonal form. Lemma 3.1 (O r thogonal Symmetric T ensor) . Assume that the matric es F A , F B , F C and Π X have r ank k , wher e k is the nu mb er of c ommunities. We have an ortho g onal symmetric tensor form for the mo diﬁe d 3 -star c ount tensor T α 0 in (15) u nder a multiline ar tr ansformation usi ng matric es W A , ˜ W B , and ˜ W C : E [T α 0 ( W A , ˜ W B , ˜ W C ) | Π A , Π B , Π C ] = X i ∈ [ k ] λ i (Φ) ⊗ 3 i ∈ R k × k × k , ( 22) wher e λ i := b α − 0 . 5 i and Φ ∈ R k × k is an ortho gonal matrix, given by Φ := W ⊤ A F A Diag( b α 0 . 5 ) . (23) Remark 1: Note that th e m atrix W A orthogonalize s F A under exact moments, and is r eferred to as a whitening matrix . S imilarly , the matrices ˜ W B = R A,B W B and ˜ W C = R A,C W C consist of whitening ma trices W B and W C , and in addition, the matrices R A,B and R A,C serv e to symmetrize the tensor. W e can in terpret { λ i , (Φ) i } i ∈ [ k ] as the stable eigen-pairs of th e transformed tensor (henceforth, referred to as the whitene d and symmetrize d tensor ). Remark 2: The fu ll r ank assumption on m atrix F A = Π ⊤ A P ⊤ ∈ R | A |× k implies that | A | ≥ k , and simila rly | B | , | C | , | X | ≥ k . Moreo v er, w e require the comm unity c onnectivit y matrix P ∈ R k × k to b e of f ull rank 12 (whic h is a n atural n on-degeneracy condition). In this case, w e can reduce the graph-momen t tensor T α 0 to a k -ran k orthogonal symmetric tensor, wh ic h has a unique d ecomp o- sition. This implies that th e mixed mem b ership mo del is identiﬁable using 3-star and edge count momen ts, when the n et work size n = | A | + | B | + | C | + | X | ≥ 4 k , m atrix P is f ull rank and the comm unit y mem b ership matrices Π A , Π B , Π C , Π X eac h ha v e rank k . On the o ther hand, when on ly empirical moment s are a v ailable, roughly , w e require the net wo rk size n = Ω( k 2 ( α 0 + 1) 2 ) (where α 0 := P i α i is related to the exten t of o v erlap b et w een the comm unities) to pro vide guaran teed 12 In the w ork of McSherry [2001], where spectral clustering for sto chastic blo c k mo dels is analyzed, rank d eﬁcien t P is allow ed as long as the neigh b orhoo d v ectors generated by any pair of communities are suﬃciently diﬀerent. On the other hand , our metho d requires P to b e full rank . W e argue that this is a mild restriction since we allo w for mixed memberships while McSherry [2001 ] limit to the stochastic b lock mo del. 18 learning of the comm u nit y mem b ership and mo del parameters. See Section 4 for a detaile d s amp le analysis. Pr o of: Recall that the mo diﬁed adjacency matrix G α 0 satisﬁes E [( G α 0 X,A ) ⊤ | Π A , Π X ] = F A Diag( b α 1 / 2 )Ψ X . Ψ X := Diag ( b α − 1 / 2 ) √ α 0 + 1Π X − ( √ α 0 + 1 − 1) 1 | X | X i ∈ X π i ! ~ 1 ⊤ ! . F rom the d eﬁnition of Ψ X ab o v e, w e see that it has rank k wh en Π X has rank k . Usin g the Sylv ester’s r ank inequalit y , w e ha v e th at the r ank of F A Diag( b α 1 / 2 )Ψ X is at least 2 k − k = k . This implies that the w hitening matrix W A also has rank k . Notice that | X | − 1 W ⊤ A E [( G α 0 X,A ) ⊤ | Π] · E [( G α 0 X,A ) | Π] W A = D − 1 A U ⊤ A U A D 2 A U ⊤ A U A D − 1 A = I ∈ R k × k , or in other words, | X | − 1 M M ⊤ = I , where M := W ⊤ A F A Diag( b α 1 / 2 )Ψ X . W e no w ha ve that I = | X | − 1 E Π X h M M ⊤ i = | X | − 1 W ⊤ A F A Diag( b α 1 / 2 ) E [Ψ X Ψ ⊤ X ] Diag ( b α 1 / 2 ) F ⊤ A W A = W ⊤ A F A Diag( b α ) F ⊤ A W A , since | X | − 1 E Π X [Ψ X Ψ ⊤ X ] = I from (18), and we use the fact that the sets A and X do n ot o v erlap. Th us, W A whitens F A Diag( b α 1 / 2 ) u nder exact moments (up on taking exp ectation ov er Π X ) and the columns of W ⊤ A F A Diag( b α 1 / 2 ) are orthonormal. No w n ote from the deﬁnition of ˜ W B that ˜ W ⊤ B E [( G α 0 X,B ) ⊤ | Π] = W ⊤ A E [( G α 0 X,A ) ⊤ | Π] , since W B satisﬁes | X | − 1 W ⊤ B E [( G α 0 X,B ) ⊤ | Π] · E [( G α 0 X,B ) | Π] W B = I , and similar resu lt h olds f or ˜ W C . The ﬁnal r esult in (22) follo ws by taking exp ectation of tensor T α 0 o v er Π X .  Ov erview of the learning approac h under exact momen ts: With the ab o ve result in place, we are no w ready to describ e the h igh-lev el app roac h f or learnin g the mixed m em b ership mo del un der exact moment s. First, symmetrize the graph -momen t tensor T α 0 as describ ed ab o ve and then apply the tensor p o wer metho d describ ed in the previous section. This enables u s to obtain the ve ctor of eigen v alues λ := b α − 1 / 2 and th e matrix of eigen v ectors Φ = W ⊤ A F A Diag( b α 0 . 5 ) using tensor p ow er iterations. W e can then r eco ver the communit y mem b ersh ip v ectors of set A c (i.e., no des not in set A ) un der exact moment s as Π A c ← Diag ( λ ) − 1 Φ ⊤ W ⊤ A E [ G ⊤ A c ,A | Π] , since E [ G ⊤ A c ,A | Π] = F A Π A c (since A an d A c do n ot ov erlap) and Diag ( λ ) − 1 Φ ⊤ W ⊤ A = Diag ( b α ) F ⊤ A W A W ⊤ A under exact momen ts. In ord er to r eco ve r the comm unity members h ip vect ors of set A , viz., Π A , w e can rev erse the direction of th e 3-star counts, i.e., consider the 3-stars from set A to X , B , C 19 and obtain Π A in a similar m anner. Once all the comm unity m em b ership vec tors Π are obtained, w e can ob tain the communit y connectivit y matrix P , using th e relationship: Π ⊤ P Π = E [ G | Π] and noting that we assume Π to b e of rank k . Th us, we are able to learn the communit y members h ip v ectors Π and the mo d el parameters b α and P of th e mixed mem b ership mo del using edge counts and the 3-star count tensor. W e n o w describ e m o d iﬁcations to this appr oac h to handle empirical momen ts. 3.3 Learning Algorithm Under Empirical Momen t s In the p revious section, w e explored a tensor-based app roac h for learning mixed mem b ership mo del under e xact momen ts. How ev er, in practic e, we only ha ve s amples (i.e. the obs erv ed net work), and the metho d needs to b e robus t to p ertur bations when emp irical momen ts are emp lo yed. Algorithm 1 { ˆ Π , ˆ P , b α } ← LearnMixedMem b ership ( G, k , α 0 , N , τ ) Input: Adj acency matrix G ∈ R n × n , k is the num b er of comm unities, α 0 := P i α i , where α is the Dirichlet parameter vec tor, N is the num b er of iterations for the tensor p o wer m ethod , and τ is used for thresholding the est imated comm unit y mem b ers h ip vecto rs, sp eciﬁed in (29) in assumption A5. Let A c := [ n ] \ A denote the set of no d es not in A . Output: Estimates of the comm unit y membersh ip v ectors Π ∈ R n × k , comm unity connectivit y matrix P ∈ [0 , 1] k × k , and the normalized Diric hlet parameter vect or b α . P artition the v ertex set [ n ] into 5 parts X , Y , A , B , C . Compute momen ts G α 0 X,A , G α 0 X,B , G α 0 X,C , T α 0 Y →{ A,B , C } using (14) and (15). { ˆ Π A c , b α } ← LearnPartiti onCommunit y( G α 0 X,A , G α 0 X,B , G α 0 X,C , T α 0 Y →{ A,B ,C } , G, N , τ ). In terc hange r oles 13 of Y and A to obtain ˆ Π Y c . Deﬁne ˆ Q suc h that its i -th ro w is ˆ Q i := ( α 0 + 1) ˆ Π i | ˆ Π i | 1 − α 0 n ~ 1 ⊤ . { W e w ill esta blish that ˆ Q ≈ (Π † ) ⊤ under conditions A1-A5. } Estimate ˆ P ← ˆ QG ˆ Q ⊤ . { Reca ll that E [ G ] = Π ⊤ P Π in our mo d el. } Return ˆ Π , ˆ P , b α 3.3.1 Pre-pro cessing steps P artitioning: I n the previous section, we partitioned the no des into four sets A, B , C, X for learning under exact moment s. Ho w ev er, w e require more partitions u n der empirical momen ts to av oid stat istical d ep endency issu es a nd obtain stronger reconstruction guarantee s. W e no w divide the net work in to ﬁv e non-o v erlapping sets A, B , C, X , Y . Th e set X is emplo yed to compute whitening m atrices ˆ W A , ˆ W B and ˆ W C , described in detail sub sequen tly , the set Y is employ ed to compu te the 3-star coun t tensor T α 0 and s ets A, B , C conta in the lea ves of the 3-stars under consideration. Th e roles of the sets can b e in terc hanged to obtain the comm u n it y memb ership v ectors of all the sets. Whitening: The whitening p r o cedure is along the same lines a s d escrib ed in the previous section, except that no w empirical moments are used . Sp eciﬁcal ly , consider the k -rank sin gular 20 Pro cedure 1 { ˆ Π A c , b α } ← LearnPa rtitionCommunit y( G α 0 X,A , G α 0 X,B , G α 0 X,C , T α 0 Y →{ A,B , C } , G , N , τ ) Input: Requir e mo diﬁed adjacency submatrices G α 0 X,A , G α 0 X,B , G α 0 X,C , 3-star count tensor T α 0 Y →{ A,B ,C } , adjacency matrix G , n umb er of iterations N for the tensor p ow er met ho d and threshold τ for thresholding estimated comm unit y members hip vecto rs. Let Thres( A, τ ) denote the elemen t-wise thresholding op eration using threshold τ , i.e., Thres( A, τ ) i,j = A i,j if A i,j ≥ τ and 0 otherwise. Let e i denote basis vect or along co ordinate i . Output: Estimates of Π A c and b α . Compute rank- k SVD: ( | X | − 1 / 2 G α 0 X,A ) ⊤ k − svd = U A D A V ⊤ A and compute wh itening m atrices ˆ W A := U A D − 1 A . Similarly , compu te ˆ W B , ˆ W C and ˆ R AB , ˆ R AC using (24). Compute whitened and symmetrized tens or T ← T α 0 Y →{ A,B , C } ( ˆ W A , ˆ W B ˆ R AB , ˆ W C ˆ R AC ). { ˆ λ, ˆ Φ } ← T ensorEigen( T , { ˆ W ⊤ A G ⊤ i,A } i / ∈ A , N ). { ˆ Φ is a k × k matrix with eac h columns b eing an estimated eigen v ector and ˆ λ is th e ve ctor of estimated eigen v alues. } ˆ Π A c ← Th res(Dia g ( ˆ λ ) − 1 ˆ Φ ⊤ ˆ W ⊤ A G ⊤ A c ,A , τ ) and ˆ α i ← ˆ λ − 2 i , for i ∈ [ k ]. Return ˆ Π A c and ˆ α . v alue decomp osition (SVD) of the mo diﬁed adj acency matrix G α 0 deﬁned in (14) , ( | X | − 1 / 2 G α 0 X,A ) ⊤ k − svd = U A D A V ⊤ A . Deﬁne ˆ W A := U A D − 1 A , and similarly deﬁ n e ˆ W B and ˆ W C using the corresp on d ing m atrices G α 0 X,B and G α 0 X,C resp ectiv ely . No w deﬁne ˆ R A,B := 1 | X | ˆ W ⊤ B ( G α 0 X,B ) ⊤ k − svd · ( G α 0 X,A ) k − svd ˆ W A , (24) and similarly deﬁne ˆ R AC . The whitened and symmetrized graph-moment tensor is now computed as T α 0 Y →{ A,B ,C } ( ˆ W A , ˆ W B ˆ R AB , ˆ W C ˆ R AC ) , where T α 0 is give n by (15) and the multi-linear transformation of a tensor is d eﬁned in (3.1). 3.3.2 Mo diﬁcations to the t ensor p ow er metho d Recall th at under exact m oments, the stable eigen-pairs of a s y m metric orthogonal tensor can b e computed in a straightforw ard man n er through the basic p o w er iteration metho d in (20), along with the deﬂ ation pro cedure. Ho wev er, this is not suﬃcien t to get go o d reconstruction guarant ees under empirical m omen ts. W e now prop ose a robust tens or metho d, d etaile d in Pro cedu r e 2. The main mo diﬁcations in v olv e: (i) eﬃcient initializa tion and (ii) adaptive deﬂation, whic h are detailed b elo w. Emplo ying these mo diﬁcations allo w s us to tolerate a far greater p erturbation of the thir d order momen t tensor, than the basic tens or p o w er p ro cedure emp loy ed in Anandkumar et al. [2012b]. See remarks follo w ing Theorem A.1 in App end ix A for the p r ecise comparison. 21 Eﬃcien t Initialization: Recall th at the basic tensor p o w er metho d in corp orates generic ini- tializa tion ve ctors and this pro cedure reco ve rs all the stable eigenv ectors correctly (exc ept for initializat ion vec tors ov er a set of measure zero). Ho wev er, under empirical moments, we ha ve a p erturb ed tensor, and here, it is adv an tageous to instead employ sp eciﬁc initialization vect ors. F or in stance, to obtain one of the eigen ve ctors (Φ) i , it is adv antag eous to initialize with a ve ctor in the neigh b orh o o d of (Φ) i . This not only r ed uces th e n um b er of p o wer iteratio ns required to con v erge (appro ximately), but more imp ortan tly , this makes the p o w er metho d more robus t to p erturb ations. See Th eorem A.1 in Ap p endix A.1 for a detailed analysis qu an tifying the relation- ship b et w een in itialization vect ors, tensor p erturbation and the resulting guarant ees for reco ve ry of the tensor eigen vect ors. F or a mixed mem b ership mo del in the s parse r egime, recall that the comm u nit y mem b ers h ip v ectors Π are sparse (with h igh probability) . Under this regime of the mo del, w e note th at the whitened neighborh o o d vecto rs cont ain goo d initialize rs for the p ow er iterations. Sp eciﬁcall y , in Pro cedure 2, we in itializ e with the w h itened neighborh o o d vect ors ˆ W ⊤ A G ⊤ i,A , f or i / ∈ A . The intu- ition b ehind this is as follo ws: for a suitable c hoice of parameters (suc h as the scaling of net w ork size n with resp ect to the n umber of comm unities k ), w e exp ect neigh b orho o d v ectors G ⊤ i,A to concen trate around their mean v alues, v iz., , F A π i . Sin ce π i is sparse (w.h.p) for the mo d el regime under consideration, this im p lies that there exist ve ctors ˆ W ⊤ A F A π i , f or i ∈ A c , w hic h concent rate (w.h.p) o n only along a few eig en-directions of the wh itened te nsor, and hence, serve as an eﬀectiv e initializer. Adaptiv e Deﬂation: R ecall that in the basic p o wer iteration pro cedu re, w e can obtain the eigen-pairs one after another through simple deﬂation: s ubtracting the estimates of the current eigen-pairs and ru nning th e p o wer iterations again to obtain new eigen ve ctors. Ho wev er, it tu r ns out th at we can establish b etter theoretical guaran tees (in terms of greater robustness) when we adaptiv ely d eﬂate the tensor in eac h p o wer iteration. In P r o cedure 2, among the estimated eigen- pairs, we only deﬂate those whic h “comp ete” with the current estimate of the p o wer iteration. In other w ords, if the v ector in the curr ent iteration θ ( τ ) t has a signiﬁcant p ro jectio n along the direction of an estimated eigen-pair φ j , i.e. | λ j D θ ( τ ) t , φ j E | > ξ , for some thresh old ξ , then the eigen-pair is d eﬂated; otherwise the eigen vect or φ j is not deﬂated. This allo ws us to carefully cont rol the error build-up for eac h estimated eigenpair in our analysis. In tuitiv ely , if an eigen v ector do es not hav e a go o d correlation with the curr en t estimate, then it do es not int erfere with th e up d ate of the curren t vect or, while if the eigen vect or has a go od cor- relation, then it is p ertinen t that it b e deﬂated so as to discourage con ve rgence in the direction of the already estimated eigenv ector. See Theorem A.1 in App endix A.1 for details. Finally , w e note that stabilization, as prop osed by Kolda and Ma y o [2011] for general tensor eigen-decomp osition (as opp osed to orthogonal d ecomp osition in this pap er), can b e eﬀectiv e in impro ving conv ergence, esp ecially on real data, and we defer its d etailed analysis to f uture wo rk. 3.3.3 Reconstruction after t ensor p ow er metho d Recall that previously in Section 3 .2 , wh en exac t momen ts are a v ailable, estimating th e comm u nit y mem b ersh ip v ectors Π is straigh tforward, once we reco ver all the stable tensor eigen-pairs. Ho w ev er, 22 Pro cedure 2 { λ, Φ } ← T ensorEigen( T , { v i } i ∈ [ L ] , N ) Input: T ensor T ∈ R k × k × k , L initialization vecto rs { v i } i ∈ L , num b er of iterations N . Output: the estimated eigenv alue/eigen v ector pairs { λ, Φ } , where λ is the vec tor of eigenv alues and Φ is the matrix of eigen v ectors. for i = 1 to k do for τ = 1 to L do θ 0 ← v τ . for t = 1 to N do ˜ T ← T . for j = 1 to i − 1 (wh en i > 1) do if | λ j D θ ( τ ) t , φ j E | > ξ t hen ˜ T ← ˜ T − λ j φ ⊗ 3 j . end if end for Compute p o we r iteration up date θ ( τ ) t := ˜ T ( I ,θ ( τ ) t − 1 ,θ ( τ ) t − 1 ) k ˜ T ( I ,θ ( τ ) t − 1 ,θ ( τ ) t − 1 ) k end for end for Let τ ∗ := arg max τ ∈ L { ˜ T ( θ ( τ ) N , θ ( τ ) N , θ ( τ ) N ) } . Do N p o wer iteration up d ates starting from θ ( τ ∗ ) N to obtain eigen ve ctor estimate φ i , and s et λ i := ˜ T ( φ i , φ i , φ i ). end for return the estimated eigen v alue/eigen v ectors ( λ, Φ). in case of empirical m oments, w e can obtain b etter guaran tees with the follo w ing mod iﬁcation: the estimated comm unit y member s hip ve ctors Π are further sub ject to thr esholding so that th e weak v alues are set to zero. Since we are limiting ourselv es to the regime of the mixed mem b ership mod el, where the comm un ity v ectors Π are sparse (w.h .p), this mod iﬁcation strengthens our reconstruction guaran tees. This thresholding step is incorp orated in Algorithm 1. Moreo ver, r ecall that under exact moments, estimating the comm unit y connectivit y matrix P is straig htforw ard , once w e reco v er the comm unit y mem b ersh ip vecto rs sin ce P ← (Π ⊤ ) † E [ G | Π]Π † . Ho w ev er, when empirical momen ts are a v ailable, w e are able to establish b etter reconstruction guaran tees th rough a d iﬀerent metho d, outlined in Algorithm 1. W e deﬁne ˆ Q such that its i -th ro w is ˆ Q i := ( α 0 + 1) ˆ Π i | ˆ Π i | 1 − α 0 n ~ 1 ⊤ , based on estimates ˆ Π, and the matrix ˆ P is obtained as ˆ P ← ˆ QG ˆ Q ⊤ . W e sub sequen tly establish that ˆ Q ˆ Π ⊤ ≈ I , under a set of su ﬃ cien t conditions outlined in the n ext section. Impro v ed supp ort reco v ery estimates in homophilic mo dels: A sub -class of comm unity mo del are those satisfying homophily . As discussed in S ection 1, homophily or the tend en cy to form edges within the mem b ers of th e same communit y h as b een p osited as an imp ortant factor in communit y form ation, esp ecially in so cial settings. Many of the existing learning algorithms, 23 e.g. Ch en et al. [2012] r equire this assump tion to provide guaran tees in th e sto c hastic blo c k mo del setting. Moreo v er, our pro cedure describ ed b elo w can b e easily mo diﬁed to w ork in s itu ations where the order of in tra-connectivit y an d in ter-connectivit y among comm unities is reversed, i.e. in the comm unity connectivit y matrix P ∈ [0 , 1] k × k , P ( i, i ) ≡ p < P ( i, j ) ≡ q , for all i 6 = j . F or instance, in the k -coloring m o del [McSherry, 2001], p = 0 and q > 0. W e describ e the p ost-processing method in Pro cedure 3 for mo dels with comm unity connectivit y matrix P satisfying P ( i, i ) ≡ p > P ( i, j ) ≡ q for all i 6 = j . F or suc h mod els, we can obtain imp ro v ed estimates by av eraging. Sp eciﬁcally , consider no des in set C and edges going fr om C to no des in B . Firs t, consider the sp ecial case of the sto c hastic blo c k mo del: for eac h n o de c ∈ C , compute the num b er of neigh b ors in B b elonging to eac h comm unity (as given b y the estimate ˆ Π from Algorithm 1), and d eclare the comm unit y with the maxim um n umber of such neigh b ors as th e comm unit y of no de c . Intuitiv ely , this pr o v id es a b etter estimate for Π C since w e av erage o v er the edges in B . Th is metho d has b een used b efore in the conte xt of sp ectral clustering [McSherry, 2001]. The same idea can be extended to the general mixed m em b ership (homo ph ilic) m o d els: declare comm unities to b e signiﬁcant if they exceed a certain threshold, as ev aluated by the a v erage num b er of edges to eac h comm unity . The correctness of the pro cedur e can b e gleaned from th e fact that if the true F matrix is inpu t, it satisﬁes F j,i = q + Π i,j ( p − q ) , ∀ i ∈ [ k ] , j ∈ [ n ] , and if the true P matrix is in put, H = p and L = q . Th us, under a suitable threshold ξ , the entries F j,i pro vide information on whether the corresp ond ing comm unit y weig ht Π i,j is signiﬁcant. In the n ext section, we establish that in certain regime of p arameters, this supp ort r eco very pro cedure can lead to zero-error su pp ort reco v ery of signiﬁcant communit y membersh ips of the no des and also r ule out comm unities w here a no de do es n ot h a ve a strong pr esence. Computational complexit y: W e n ote that the computational co mplexit y of th e met ho d, implemen ted naiv ely , is O ( n 2 k + k 4 . 43 b α − 1 min ) when α 0 > 1 and O ( n 2 k ) when α 0 < 1. T his is b eca use the time for compu ting whitening matrices is dominated by SVD of th e top k singu lar v ectors of n × n matrix, whic h tak es O ( n 2 k ) time. W e then compute the whitened tensor T which requires time O ( n 2 k + k 3 n ) = O ( n 2 k ), since for eac h i ∈ Y , w e m ultiply G i,A , G i,B , G i,C with the corresp ond ing whitening matrices, and this step tak es O ( nk ) time. W e then av erage this k × k × k tensor o v er diﬀeren t no des i ∈ Y to the result, which tak es O ( k 3 ) time in eac h step. F or the tensor p o wer metho d, the time requ ired for a single iteration is O ( k 3 ). W e need at m ost log n iterations p er initial ve ctor, and w e n eed to consider O ( b α − 1 min k 0 . 43 ) initial ve ctors (this could b e smaller when α 0 < 1). Hence the total ru nning time of tens or p o w er metho d is O ( k 4 . 43 b α − 1 min ) (and w hen α 0 is small this can b e impr o ved to O ( k 4 b α − 1 min ) whic h is dominated by O ( n 2 k ). In the pro cess of estimating Π and P , the dominant op eration is multiplying k × n matrix by n × n matrix, w hic h tak es O ( n 2 k ) time. F or supp ort reco very , the dominant op eration is compu tin g the “a verag e d egree”, whic h again tak es O ( n 2 k ) time. T h us, we ha ve that the o verall compu tatio nal time is O ( n 2 k + k 4 . 43 b α − 1 min ) wh en α 0 > 1 and O ( n 2 k ) when α 0 < 1. Note th at the ab o ve b ound on complexit y of our metho d nearly matc hes the b ound for sp ec- tral clustering m ethod [McSherry, 2001], since computin g the k -rank SVD requires O ( n 2 k ) time. Another metho d for learning sto chastic blo c k mo d els is based on con v ex optimizatio n in vo lving semi-deﬁnite programming (SDP) [Ch en et al., 2012], and it p ro vides the b est scaling b oun ds (for 24 Pro cedure 3 { ˆ S } ← Su p p ortReco v eryHomophilicMod els( G, k , α 0 , ξ , ˆ Π) Input: Adj acency matrix G ∈ R n × n , k is the num b er of communities, α 0 := P i α i , wh ere α is the Diric hlet parameter vec tor, ξ is the threshold for supp ort r eco ve ry , corresp onding to signiﬁcant comm unit y memb er s hips of an ind ividual. Get estimate ˆ Π f rom Algorithm 1 . Also asume the mo del is h omophilic: P ( i, i ) ≡ p > P ( i, j ) ≡ q , for all i 6 = j . Output: ˆ S ∈ { 0 , 1 } n × k is the estimated supp ort for signiﬁcant comm u nit y mem b er s h ips (see Theorem 4.2 for guarantees). Consider p artitions A, B , C, X , Y as in Algorithm 1. Deﬁne ˆ Q on lines of deﬁnition in Algorithm 1, u sing estimates ˆ Π. Let the i -th row for set B b e ˆ Q i B := ( α 0 + 1) ˆ Π i B | ˆ Π i B | 1 − α 0 n ~ 1 ⊤ . Similarly d eﬁne ˆ Q i C . Estimate ˆ F C ← G C,B ˆ Q ⊤ B , ˆ P ← ˆ Q C ˆ F C . if α 0 = 0 (sto c hastic blo ck mo del) then for x ∈ C do Let i ∗ ← arg max i ∈ [ k ] ˆ F C ( x, i ) and ˆ S ( i ∗ , x ) ← 1 and 0 o.w. end for else Let H b e the a v erage of d iagonals of ˆ P , L b e the a v erage of oﬀ-diagonals of ˆ P for x ∈ C , i ∈ [ k ] do ˆ S ( i, x ) ← 1 if ˆ F C ( x, i ) ≥ L + ( H − L ) · 3 ξ 4 and zero otherwise. { Identify large entries } end for end if P erm ute the r oles of the sets A, B , C , X , Y to get results for remaining no des. 25 b oth the n etw ork size n and the separation p − q f or edge connectivit y) known so far. T h e sp eciﬁc con v ex p r oblem can b e solv ed via the metho d of augmente d L agr ange multipliers [Lin et al., 2010], where eac h step consists of an SVD op eration and q-linear conv ergence is established b y Lin et al. [2010]. This implies that the metho d h as complexit y O ( n 3 ), since it inv olve s taking SVD of a gen- eral n × n matrix, rather than a k -rank SVD. Thus, our metho d h as signiﬁcan t adv antag e in terms of computational complexit y , when the num b er of comm unities is muc h sm aller than the n et work size ( k ≪ n ). F urther, a sub s equen t w ork pro vides a more sophisticated i mplementat ion of the prop osed tensor metho d th rough p aralleliz ation and the use of sto c hastic gradient d escen t for tensor de- comp osition [Huang et al. , 2013]. Additionally , the k -rank SVD op erations are approxima ted via r andomized metho ds such as th e Nystrom’s metho d le ading to more eﬃcient implementa- tions [Gittens and Mahoney, 201 3 ]. Huang et al. [2 013] deplo y the tensor approac h for comm unity detection and establish that it has a run ning time of O ( n + k 3 ) using nk cores u nder a parallel computation m o d el [J´ aJ´ a, 1992]. 4 Sample Analysis for Prop osed L earning Algorithm 4.1 Homogeneous Mixed Mem b ership Models It is easier to ﬁrst present the results for our prop osed algorithm for the sp ecia l case, wh ere all the comm unities ha ve the same exp ected s ize and the ent ries of the communit y conn ectivit y matrix P are equal on d iagonal and oﬀ-diagonal lo cations: b α i ≡ 1 k , P ( i, j ) = p · I ( i = j ) + q · I ( i 6 = j ) , p ≥ q . (25) In other wo rd s , the pr obabilit y of an edge acco rdin g to P only dep en d s on wh ether it is b etw een t w o individuals of the same comm u nit y or betw een diﬀeren t comm unities. The ab o ve setting is al so w ell studied for sto c hastic blo c k mo d els ( α 0 = 0), allo w ing u s to compare our resu lts with existing ones. The resu lts for general mixed members h ip mo dels are d eferr ed to Section 4.2. [A1] Sparse regime of Diric hlet parameters: The c ommunit y mem b ership v ectors a re dra wn from the Diric hlet distrib u tion, Dir( α ), und er the mixed memb ership mo del. W e assu me that α i < 1 for i ∈ [ k ] (see S ectio n 2.1 for an extended discu ssion on the s p arse regime of the Diric hlet distribu tion) and that α 0 is kn own. [A2] Condition on the netw ork size: Giv en the concentrati on parameter of the Diric hlet distribution, α 0 := P i α i , we require that n = ˜ Ω( k 2 ( α 0 + 1) 2 ) , (26) and that th e disjoint sets A, B , C , X, Y are chose n r andomly and are of s ize Θ( n ). Note that from assumption A1, α i < 1 whic h implies that α 0 < k . Thus, in the wo rst-case, when α 0 = Θ ( k ), w e require 14 n = ˜ Ω( k 4 ), and in the b est case, when α 0 = Θ(1), we require n = ˜ Ω( k 2 ). The latter case includes the sto c hastic blo ck mo del ( α 0 = 0), and thus, our results m atc h th e state-of-art b oun ds for learnin g s to chastic blo ck mo dels. 14 The notation ˜ Ω( · ) , ˜ O ( · ) denotes Ω ( · ) , O ( · ) up to p oly-log factors. 26 [A3] Condition on edge connectivity : Recall that p is the pr ob ab ility of intra-c ommunit y connectivit y and q is the pr obabilit y of inter-co mmunit y connectivit y . W e require that p − q √ p = Ω  ( α 0 + 1) k n 1 / 2  (27) The ab o v e condition is on the standardized separation b et ween intra-c ommunit y and inter-co mmunit y connectivit y (note that √ p is the stand ard d eviatio n of a Bernoulli ran d om v ariable). Th e ab o ve condition is r equired to con trol the p ertu r bation in the wh itened tensor (computed usin g observ ed net w ork samples), thereb y , p ro viding guaran tees on th e estimated eigen-pairs thr ough th e tensor p o we r metho d. [A4] Condition on n um b er of iterations of the p o w er method: W e assume that the n umb er of iterations N of the tensor p o wer metho d in Pro cedure 2 s atisﬁes N ≥ C 2 ·  log( k ) + log log  p − q p  , (28) for some constant C 2 . [A5] Choic e of τ fo r thresholding comm unity v ector estimates: Th e threshold τ for obtaining estimates ˆ Π of communit y mem b ership v ectors in Algorithm 1 is c hosen as τ =    Θ  k √ α 0 √ n · √ p p − q  , α 0 6 = 0, (29) 0 . 5 , α 0 = 0, (30) F or the stoc hastic block mo del ( α 0 = 0), since π i is a b asis v ector, w e can use a large threshold. F or general mo d els ( α 0 6 = 0), τ can b e view ed as a regularization parameter and deca ys as n − 1 / 2 when other parameters are h eld ﬁxed. W e are n o w ready to state the error b ounds on the estimates of comm unit y mem b ership vect ors Π and the blo c k connectivit y matrix P . ˆ Π a nd ˆ P are the estimat es computed in Algorithm 1 . Recall that for a m atrix M , ( M ) i and ( M ) i denote t he i th ro w and column resp ectiv ely . W e sa y that an ev ent holds w ith high probabilit y , if it o ccurs with pr obabilit y 1 − n − c for some constan t c > 0 . Theorem 4.1 (Guaran tees on Estimating P , Π) . Under assumptions A 1- A 5, we have with high pr ob ability ε π ,ℓ 1 := max i ∈ [ n ] k ˆ Π i − Π i k 1 = ˜ O ( α 0 + 1) 3 / 2 √ np ( p − q ) ! (31) ε P := max i,j ∈ [ k ] | ˆ P i,j − P i,j | = ˜ O ( α 0 + 1) 3 / 2 k √ p √ n ! . (32) The p ro ofs are giv en in the App endix and a pr o of outline is pr o vid ed in Section 4.3. The main in gredien t in establishing the ab o v e result is the tensor concen tration b ound and additionally , reco v ery guarantee s under the tensor p o wer metho d in Pro cedure 2. W e now pro vide these results b elo w. 27 Recall that F A := Π ⊤ A P ⊤ and Φ = W ⊤ A F A Diag( b α 1 / 2 ) denotes th e set of tensor eigen v ectors under exact momen ts in (23), and ˆ Φ is the set of estimate d eigenv ectors und er e mpir ical momen ts, obtained using Pro cedure 1 . W e establish the follo wing guaran tees. Lemma 4.1 (Pe rtur b ation b ound for estimated eigen-pairs) . U nder the assumptio ns A1-A4, the r e c over e d eigenv e ctor-eigenvalue p airs ( ˆ Φ i , ˆ λ i ) fr om the tensor p ower metho d in Pr o c e dur e 2 satisﬁes with high pr ob ability, for a p ermutation θ , such that max i ∈ [ k ] k ˆ Φ i − Φ θ ( i ) k ≤ 8 k − 1 / 2 ε T , max i ∈ [ k ] | λ i − b α − 1 / 2 θ ( i ) | ≤ 5 ε T , (33) The tensor p erturb ation b ound ε T is given by ε T :=    T α 0 Y →{ A,B , C } ( ˆ W A , ˆ W B ˆ R AB , ˆ W C ˆ R AC ) − E [T α 0 Y →{ A,B , C } ( W A , W B R AB , W C R AC ) | Π A ∪ B ∪ C ]    = ˜ O ( α 0 + 1) k 3 / 2 √ p ( p − q ) √ n ! , (34) wher e k T k for a tensor T r efers to its sp e ctr al norm. Sto c hastic blo c k mo dels ( α 0 = 0) : F or sto c hastic b lock mo dels, assump tions A2 and A3 reduce to n = ˜ Ω( k 2 ) , ζ = Θ  √ p p − q  = O n 1 / 2 k ! . (35) This matc hes with the b est known scaling (up to p oly-log factors), and was previously ac hiev ed via con v ex optimization b y Chen et al. [2012] for sto c hastic block mo dels. Ho w ev er, our results in Theorem 4.1 d o n ot provide zero error g uarantee s as in Chen et al. [2012]. W e strengthen our results to pr ovide zero-error guaran tees in Section 4.1.1 b elo w and th us, matc h the scaling of Ch en et al. [2012] for sto c hastic b lo c k mo d els. Moreo ver, w e also p ro vide zero-error sup p ort reco very guaran tees for reco v ering s igniﬁ can t memb erships of n o d es in mixed mem b ersh ip mo d els in S ection 4.1.1. Dep endence on α 0 : The guarante es degrade as α 0 increases, whic h is in tuitive since the exten t of communit y ov erlap in cr eases. The requ iremen t for scaling of n also gro ws as α 0 increases. Note that the guaran tees on ε π and ε P can b e impr o ved by assuming a more strin gen t scaling of n w ith resp ect to α 0 , rather th an the one sp eciﬁed by A2. 4.1.1 Zero-error guaran tees for supp ort re cov ery Recall that we prop osed Pr o cedure 3 as a p ost-pro cessing step to pro vide impr ov ed supp ort reco v ery estimates. W e no w pro vide guarantee s for this metho d . W e no w sp ecify the threshold ξ for su pp ort reco v ery in Pro cedure 3 . [A6] Choice of ξ for support reco v ery: W e assume that the threshold ξ in P ro cedure 3 satisﬁes ξ = Ω( ε P ) , where ε P is sp eciﬁed in Th eorem 4.1. W e no w state the guarante es for sup p ort reco v ery . 28 Theorem 4.2 (Su pp ort reco v ery guaran tees) . Assuming A1-A 6 and (25) hold, the supp ort r e c overy metho d in Pr o c e dur e 3 has th e fol lowing guar ante es on the estima te d supp ort set ˆ S : with high pr ob ability, Π( i, j ) ≥ ξ ⇒ ˆ S ( i, j ) = 1 and Π( i, j ) ≤ ξ 2 ⇒ ˆ S ( i, j ) = 0 , ∀ i ∈ [ k ] , j ∈ [ n ] , (36) wher e Π is the true c ommunity memb e rship matrix. Th us, the ab o ve result g uarante es that the Pro cedure 3 co rrectly reco v ers all the “ large” en tr ies of Π and also correctly rules o ut all the “small” entrie s in Π. In other w ord s, w e can c orrectly infer all the sig niﬁcant m emb erships of eac h no d e and also r ule o ut the set o f comm unities where a nod e do es n ot hav e a strong p resence. The o nly shortcoming of the ab ov e result is that there is a ga p b et ween the “large” and “small” v alues, and for an intermediate set of v alues (in [ ξ / 2 , ξ ]), w e cannot guarant ee correct inferences ab out the comm unity mem b ers h ips. Note this gap dep ends on ε P , th e error in estimating the P matrix. Th is is intuitiv e, since as the error ε P decreases, w e can infer th e comm un it y mem b ers h ips o v er a large range of v alues. F or the sp ecial case of sto c hastic blo c k mo d els (i.e. lim α 0 → 0), we can improv e the ab ov e result and giv e a zero error guarante e at all no des (w.h.p). Note that we n o longer require a thr eshold ξ in th is case, and only infer one communit y f or eac h no de. Corollary 4.1 (Zero error guarantee for blo c k mo dels) . A ssuming A 1-A5 and (25) hold, the supp ort r e c overy metho d in Pr o c e dur e 3 c orr e ctly identiﬁes the c ommunity memb erships for al l no des with high pr ob ability in c ase of sto chastic blo ck mo dels ( α 0 → 0) . Th us, with the ab o ve resu lt, w e matc h the state-o f-art results of Chen et al. [2012] for stochastic blo c k mo dels in terms of s caling requirements and reco ve ry guaran tees. 4.2 General (Non-Homogeneous) Mixed Mem b ership Mo dels In the previous sections, w e pro vided learning gu arantees for learning homogeneous mixed mem b er- ship mo d els. Here, w e extend th e results to learnin g general non-homogeneous mixed memb ership mo dels under a suﬃcien t set of co nd itions, inv olving scaling of v arious parameters suc h as net w ork size n , num b er of communities k , concentrat ion parameter α 0 of the Diric hlet d istribution (whic h is a measure of o ve rlap of th e communities) and so on. [B1] Sparse regime of Diric hlet parameters: The comm u nit y membersh ip ve ctors are dra wn from th e Diric hlet d istr ibution, Dir( α ), under the mixed mem b ersh ip mo del. W e assume that 15 α i < 1 for i ∈ [ k ] α i < 1 (see S ection 2.1 for an extended d iscu ssion on the sparse regime of the Diric hlet distribu tion). 15 The assumption B1 that the Diric hlet distribution be in th e sparse regime is not strictly needed. Our results can b e extended to general Diric hlet d istributions, b ut with w orse scaling requirements on n . The dep endence of n is still p olynomial in α 0 , i.e. w e req uire n = ˜ Ω(( α 0 + 1) c b α − 2 min ), where c ≥ 2 is some constan t. 29 [B2] C ondition on the netw ork size: Given the concentrat ion parameter of the Diric hlet distribution, α 0 := P i α i , and b α min := α min /α 0 , the exp ected size of the smallest comm unity , deﬁne ρ := α 0 + 1 b α min . (37) W e require that the n et work s ize scale as n = Ω  ρ 2 log 2 k  , (38) and that the sets A, B , C, X , Y are Θ( n ). Note that from assu mption B1, α i < 1 wh ic h imp lies that α 0 < k . Thus, in the w orst-case, wh en α 0 = Θ( k ), we r equire 16 n = ˜ Ω( k 4 ), assuming equal sizes: b α i = 1 /k , and in the b est case, when α 0 = Θ (1), w e r equire n = ˜ Ω( k 2 ). T he latter case includes the sto c hastic blo ck mo del ( α 0 = 0), and thus, our results m atc h th e state-of-art b oun ds for learnin g s to chastic blo ck mo dels. See S ection 4.1 for an extended discussion. [B3] Condition on relativ e co mm unity sizes and blo c k connectivit y matrix: Recall that P ∈ [0 , 1] k × k denotes the blo c k connectivit y matrix. Deﬁne ζ :=  b α max b α min  1 / 2 p (max i ( P b α ) i ) σ min ( P ) , (39) where σ min ( P ) is the min imum singular v alue of P . W e require that ζ =            O n 1 / 2 ρ ! , α 0 < 1 (40) O n 1 / 2 ρk b α max ! α 0 ≥ 1. (41) In tuitiv ely , the ab ov e condition requires the ratio of m axim u m and minim um exp ected communit y sizes to b e n ot to o large and for the matrix P to b e w ell conditioned. The ab ov e condition is required to control the p ertu r bation in the wh itened tensor (computed using observed netw ork samples), thereb y , pro viding guarantees on the estimated eigen-pairs through the tens or p o wer method . The ab o v e condition can b e int erpr eted as a separation requiremen t b et we en int ra-comm unity and inter- comm unit y connectivit y in the sp ecial case considered in Section 4.1. Sp eciﬁcally , for th e sp ecial case of h omogeneo us mixed m emb ership mo d el, we ha v e σ min ( P ) = Θ( p − q ) , max i ( P b α ) i = p k + ( k − 1) q k ≤ p. Th us, the assumptions A2 and A3 in Section 4.1 giv en b y n = ˜ Ω( k 2 ( α 0 + 1) 2 ) , ζ = Θ  √ p p − q  = O n 1 / 2 ( α 0 + 1) k ! are sp ecial cases of the assump tions B2 and B3 ab o v e. 16 The notation ˜ Ω( · ) , ˜ O ( · ) denotes Ω ( · ) , O ( · ) up to log factors. 30 [B4] Condition on num b er of iterations of the p ow er metho d: W e assu me that the n umb er of iterations N of the tensor p o wer metho d in Pro cedure 2 s atisﬁes N ≥ C 2 ·  log( k ) + log log  σ min ( P ) (max i ( P b α ) i )  , ( 42) for some constant C 2 . [B5] Choice of τ for thresholding comm unity vec tor estimates: The th reshold τ for obtaining estimates ˆ Π of communit y mem b ership v ectors in Algorithm 1 is c hosen as τ =      Θ ρ 1 / 2 · ζ · b α 1 / 2 max n 1 / 2 · b α min ! , α 0 6 = 0, (43) 0 . 5 , α 0 = 0, (44) F or th e sto c hastic blo c k mo del ( α 0 = 0), since π i is a basis v ector, we can use a large thresh old. F or general mo dels ( α 0 6 = 0), τ can b e view ed as a r egularizatio n parameter and deca ys as n − 1 / 2 when other parameters are held ﬁxed. Moreo v er, wh en n = ˜ Θ( ρ 2 ), we hav e that τ ∼ ρ − 1 / 2 when other terms are h eld ﬁxed. Recall th at ρ ∝ ( α 0 + 1) when the exp ected comm unit y sizes b α i are held ﬁxed. In this case, τ ∼ ρ − 1 / 2 allo ws for smaller v alues to b e pick ed up after thr esh olding as α 0 is increased. This is in tuitiv e since as α 0 increases, the comm unit y v ectors π are more “sp r ead out” across d iﬀeren t comm unities and ha ve smaller v alues. W e are no w rea dy to state the error b oun ds on the estimate s of comm un ity member s hip v ectors Π and the b lock connectivit y matrix P . ˆ Π and ˆ P are the estimates compu ted in Algorithm 1. Recall that for a m atrix M , ( M ) i and ( M ) i denote t he i th ro w and column resp ectiv ely . W e sa y that an ev ent holds w ith high probabilit y , if it o ccurs with pr obabilit y 1 − n − c for some constan t c > 0 . Theorem 4.3 (Guarante es on estimating P , Π) . Under assumptions B1-B 5, The estimates ˆ P and ˆ Π obtaine d fr om Algorithm 1 satisfy with high pr ob ability, ε π ,ℓ 1 := max i ∈ [ k ] | ( ˆ Π) i − (Π) i | 1 = ˜ O  n 1 / 2 · ρ 3 / 2 · ζ · b α max  (45) ε P := max i,j ∈ [ n ] | ˆ P i,j − P i,j | = ˜ O  n − 1 / 2 · ρ 5 / 2 · ζ · b α 3 / 2 max · ( P max − P min )  (46) The p ro ofs are in App end ix B and a pr o of outline is provided in Section 4.3. The main in gredien t in establishing the ab o v e result is the tensor concen tration b ound and additionally , reco v ery guarantee s under the tensor p o wer metho d in Pro cedure 2. W e now pro vide these results b elo w. Recall that F A := Π ⊤ A P ⊤ and Φ = W ⊤ A F A Diag( b α 1 / 2 ) denotes th e set of tensor eigen v ectors under exact momen ts in (23), and ˆ Φ is the set of estimate d eigenv ectors und er e mpir ical momen ts, obtained using Pro cedure 1 . W e establish the follo wing guaran tees. Lemma 4.2 (P erturbation b ound for estimated eigen-pairs) . Under the assumptions B1-B4, the r e c over e d eigenv e ctor-eigenvalue p airs ( ˆ Φ i , ˆ λ i ) fr om the tensor p ower metho d in Pr o c e dur e 2 satisﬁes with high pr ob ability, for a p ermutation θ , such that max i ∈ [ k ] k ˆ Φ i − Φ θ ( i ) k ≤ 8 b α 1 / 2 max ε T , max i | λ i − b α − 1 / 2 θ ( i ) | ≤ 5 ε T , (47) 31 The tensor p erturb ation b ound ε T is given by ε T :=    T α 0 Y →{ A,B , C } ( ˆ W A , ˆ W B ˆ R AB , ˆ W C ˆ R AC ) − E [T α 0 Y →{ A,B , C } ( W A , W B R AB , W C R AC ) | Π A ∪ B ∪ C ]    = ˜ O ρ √ n · ζ b α 1 / 2 max ! , (48) wher e k T k for a tensor T r efers to its sp e ctr al norm, ρ is deﬁne d in (37) and ζ in (39) . 4.2.1 Application to Plan ted Clique Problem The plan ted c lique p r oblem is a sp ecial case of th e sto chastic blo c k mo d el C ondon and Karp [1999], and is arguably the simplest setting f or the comm un ity pr oblem. Here, a clique of size s is uniformly plan ted (or placed) in an E r d˝ os-R ´ enyi graph w ith edge pr obabilit y 0 . 5. T his can b e view ed as a sto c hastic blo c k mo del with k = 2 communities, where b α min = s /n is the probabilit y of a no de b eing in a clique and b α max = 1 − s /n . The connectivit y m atrix is P = [1 , q ; q , q ] with q = 0 . 5, sin ce the pr obabilit y of connectivit y within the clique is 1 and the probabilit y of connectivit y f or any other no de pair is 0 . 5. Since the plan ted clique setting has u nequal sized comm un ities, the general resu lt in S ection 4.3 is applicable, and we d emonstrate h ow the assumptions ( B 1)-( B 5) simplify for th e p lan ted clique setting. W e ha ve that α 0 = 0, since the comm unities are n on-o verlapping. F or assu mption B 2, we ha v e that ρ = α 0 + 1 b α min = n s , n = ˜ Ω( ρ 2 ) ⇒ s = ˜ Ω( √ n ) . (49) F or assumption B 3, we ha ve that σ min ( P ) = Θ(1) and that max i ( P b α ) i ≤ s /n + q ≤ 2, and th us the assumption B 3 simpliﬁes as ζ :=  b α max b α min  1 / 2 p (max i ( P b α ) i ) σ min ( P ) = ˜ O  √ n ρ  ⇒ s = ˜ Ω  n 2 / 3  . (50) The condition in (49) that s = ˜ Ω( n 1 / 2 ) matc hes the compu tational low er b ounds for r eco ve ring the clique [F eldman et al., 2012]. Unfortunately , the condition in (50) that s = ˜ Ω  n 2 / 3  is worse. This is required for assu mption ( B 3) to hold, whic h is needed to ensu re the success of the tensor p o we r metho d. The w h itening step is particularly sensitiv e to the condition num b er of the matrix to b e whitened (i.e., matrices F A , F B , F C in our case and the cond ition n umbers for these matrices dep end on the ratio of the communit y sizes), which results in a w eak er guaran tee. Thus, our m etho d do es not p erform v ery well when the comm un ity sizes are dr astically diﬀerent. It r emains an op en question if our metho d can b e impro ve d in this setting. W e conjecture that using “p eeling” ideas similar to Ailon et al. [2013], wh ere the comm u n ities are reco v ered one b y one can impro v e our dep endence on the ratio of comm un ity sizes. 4.3 Pro of Outline W e now sum marize the main tec hniqu es inv olv ed in p r o ving T heorem 4.3. The details are in the App endix. The main ingredient is the concen tration of the adjacency matrix: since the edges are dra wn indep end en tly conditioned on the comm un it y memb erships, w e establish that the adjacency matrix concent rates around its mean und er the stated assum ptions. See App end ix C.4 for details. With this in hand , we can then establish concen tration of v arious quantit ies u sed by our learning algorithm. 32 Step 1: Whitening matrices. W e ﬁ rst establish concen tration b ound s on the whitening ma- trices ˆ W A , ˆ W B , ˆ W C computed using empir ical m omen ts, d escrib ed in Section 3.3.1. With this in hand, w e can appro ximately r eco ver the span of matrix F A since ˆ W ⊤ A F Diag( b α i ) 1 / 2 is a rotation matrix. Th e main tec hniqu e emp lo yed is the Matrix Be rns tein’s inequalit y [T ropp, 2012 , thm. 1.4]. See App end ix C.2 for details. Step 2: T ensor concen tration b ounds Recall that w e use the whitening matrices to obtain a symmetric orthogonal tensor. W e esta blish that the whitened and symm etrized tensor concen trates around its mean. (Note that the empirical third order tensor T X → A,B ,C tends to its exp ecta tion conditioned on Π A , Π B , Π C when | X | → ∞ ) . This is done in sev eral stag es and we carefully co ntrol the tensor p erturbation b oun ds. See App endix C.1 for details. Step 3: T ensor p o wer metho d ana lysis. W e an alyze the p erf ormance of Pr o cedu re 2 un der empirical m omen ts. W e emp lo y the v arious impro ve ments, detailed in Section 3.3.2 to establish guaran tees on the reco vered eigen-pairs. This in cludes coming up w ith a condition on th e tensor p erturb ation b oun d, f or the tensor p o w er metho d to succeed. It also inv olv es establishing that there exist goo d initializers for the p ow er metho d among (whitened) neigh b orh o o d v ectors. Th is allo ws us to obtain s tronger guarantees f or the tensor p o we r metho d , compared to earlier analysis b y Anandkum ar et al. [2012b]. This analysis is crucial for us to obtain state-of-art scaling b oun ds for guaran teed reco v ery (for the s p ecial case o f sto c hastic b lock mod el). See App endix A f or details. Step 4: Thresholding of estimated c omm unity v ectors In S tep 3, w e p ro vide guarant ees for reco very of eac h eigen vec tor in ℓ 2 norm. Direct application o f this result only allo ws us to obtain ℓ 2 norm b ound s for ro w-wise reco v ery of the communit y matrix Π. I n order to strengthen the resu lt to an ℓ 1 norm boun d, we threshold th e estimated Π v ectors. Here, w e exploit th e sparsity in Diric hlet dra ws and carefu lly control the con tribution of w eak entries in the v ector. Finally , w e establish p erturb ation b oun ds on P through r ather straigh tforw ard concen tration b ound argumen ts. See App endix B.2 for details. Step 5: Supp ort recov ery guarantees. T o simplify the argument, consider the sto c hastic blo c k mo del. Recall that Pro cedure 3 readju sts the comm unit y mem b ers h ip estimates based on degree a v eraging. F or eac h v ertex, if we count the a ve rage degree to wards these “appro ximate comm unities”, for the correct comm u nit y the result is concentrate d around v alue p and for th e wrong comm un it y the r esult is around v alue q . Therefore, w e can co rrectly iden tify the co mmunit y mem b ersh ip s of all the n o d es, wh en p − q is suﬃcien tly large, as sp eciﬁed b y A3. The argument can b e easily extended to general mixed member s hip mo dels. See Ap p endix B.4 for details. 4.4 Comparison with Previous Results W e n o w compare the results of this pap er to our previous work [Anandkumar et al., 2012 b ] on the use of tensor-b ased ap p roac h es for learning v arious laten t v ariable mo dels suc h as topic mo d- els, hidden Mark o v mo dels (HMM) and Gaussian mixtures. A t a high leve l, the tensor app roac h is exploited in a similar manner in all these m o d els (including the communit y mo del in this pa- p er), viz., that the conditional-indep end en ce rela tionships of the mod el r esult in a lo w rank tensor, 33 Do cumen ts View 1 View 2 View 3 x X A B C . (a) Communit y mo del as a topic model π x G ⊤ x,A G ⊤ x,B G ⊤ x,C . (b) Graphical model represe ntation Figure 2: Casting the communit y mo del as a topic mo del, w e obtain cond itional in d ep endence of the three views. constructed from low order momen ts under the giv en mo d el. Ho wev er, there are sev eral imp or- tan t diﬀerences b et ween th e communit y mo del and th e other laten t v ariable mo dels consid ered b y Anandkumar et al. [2012b] and w e list them b elo w. W e also p r ecisely list th e v arious algorith- mic impro ve ments prop osed in this pap er with resp ect to the tensor p o wer metho d, and ho w they can b e applicable to other laten t v ariable mo dels. 4.4.1 T opic mo del vs. comm unity mo del Among the latent v ariable mo dels studied by Anandkumar et al. [2012 b ], the topic mod el, viz., laten t Diric hlet allocation (LD A), b ears the closest r esem blance to MMSB. In fact, the MMSB mo del wa s originally inspir ed b y the L D A mo del. The analogy b etw een the MMSB m o d el and the LD A is direct u nder our fr amew ork and we describ e it b elo w. Recall that for learning MMSBs, w e consider a partition of the no des { X, A, B , C } and w e consider the set of 3-stars from set X to A, B , C . W e can construct an equ iv alen t topic mo del as follo w s : the no des in X form th e “do cuments” and f or eac h do cument x ∈ X , the neigh b orho o d v ectors G ⊤ xA , G ⊤ xB , G ⊤ xC form the th ree “wo rd s ” or “views” for that do cumen t. In eac h d o cumen t x ∈ X , the comm un it y v ector π x corresp onds to the “topic v ector” and th e matrice s F A , F B and F C corresp ond to the topic-w ord matrices. Note that the three views G ⊤ xA , G ⊤ xB , G ⊤ xC are conditionally indep end ent giv en the topic v ector π x . Th us, the comm un ity mo del can b e cast as a topic mo d el or a multi-vie w mo del. See Figure 2. Although the communit y mo del can b e v iewed as a topic mo del, it h as some imp ortan t sp ecial prop erties which allo ws us to pr o vide b etter guarante es. Th e topic-wo rd matrices F A , F B , F C are not arb itrary m atrices. Recall that F A := Π ⊤ A P ⊤ and similarly F B , F C are r andom matrices and w e can pr o vid e s tr ong concen tration b oun d s for these matrices b y app ealing to random matrix theory . Moreo v er, eac h of the views in the comm unity mo d el has add itional structure, viz., the v ector G ⊤ x,A has indep endent Bernoulli en tries conditioned on the comm unit y v ector π x , wh ile in a general m u lti-view mo d el, we only s p ecify the conditional distribution of eac h view give n the hidden topic vec tor. This fur ther allo ws us to p ro vide sp ecialized concent ration b ound s for the comm unit y mo del. Imp ortan tly , we can reco v er the communit y mem b ersh ip s (or topic vecto rs) accurately while for a general m ulti-view mo d el this cannot b e guaran teed and we can only hop e to r eco ve r the mo del parameters. 34 4.4.2 Impro v emen ts to tensor reco v ery guaran tees in this pap er In this pap er, we make m o d iﬁcations to the tensor p o w er metho d of Anandkum ar et al. [2012b] and obtain b ette r guaran tees for the comm un it y set ting. Recal l that the t wo mo diﬁcations are adaptiv e deﬂation and in itializ ation using wh itened n eigh b orho o d vecto rs. The adaptiv e deﬂation leads to a weak er gap condition for an in itializ ation v ector to su cceed in estimating a tensor eigen vec tor eﬃcien tly . Initialization using whitened n eigh b orho o d v ectors allo ws u s to tolerate more noise in th e estimated 3-star tensor, thereby impr o vin g our sample complexit y r esu lt. W e mak e this impro ve ment p r ecise b elo w. If w e directly app ly the tensor p o wer method of Anandkumar et al. [2012b], without considering the modiﬁ cations, we require a stronge r condition on th e sample complexit y and e dge connectivit y . F or simplicit y , consider the homogeneous setting o f Section 4.1. The c onditions ( A 2) and ( A 3) n o w need to b e replaced with stronger cond itions: [A2’] Sample complexity : The num b er of samples s atisﬁes n = ˜ Ω( k 4 ( α 0 + 1) 2 ) . [A3’] E dge connectivity: The edge connectivit y parameters p , q s atisfy p − q √ p = Ω  ( α 0 + 1) k 2 √ n  . Th us, w e obtain signiﬁcant imp ro v emen ts in reco ve ry guarante es via algo rithmic mo diﬁcations and careful analysis of concentrat ion b oun ds. The guarantees derive d in this pap er are sp eciﬁc to the comm unity setting, and w e outlined previously the sp ecial pr op erties of the communit y mo del when compared to a general m ulti-view mo del. Ho wev er, wh en the d o cu men ts of th e topic mo del are su ﬃcien tly long, th e word fr equency v ector w ithin a do cument has go o d concen tration, and our mo diﬁed tensor metho d has b etter reco very guarantees in this setting as we ll. Thus, the improv ed tensor reco v ery guarantee s d er ived in this p ap er are applicable in scenarios where w e hav e access to b ette r initialization v ectors rather than sim p le random initialization. 5 Conclusion In this pap er, w e presente d a nov el approac h f or learning o verla pp in g comm un ities based on a tensor decomp osition approac h. W e established that our metho d is guarantee d to reco v er the u nderlying comm unit y mem b erships correctly , wh en th e comm unities are dra wn from a m ixed mem b ership sto c hastic b lock mo del (MMSB). Our metho d is also compu tationally eﬃcien t and requ ires simple linear a lgebraic op erations an d tensor iterations. Moreo ver, our metho d is tigh t for th e sp ecial case of the sto chastic blo c k mo del (up to p oly-log f actors), b oth in terms of sample complexit y and the separation b et ween edge connectivit y within a communit y and across diﬀeren t communities. W e no w note a num b er of int eresting op en problems an d extensions. While w e obtained tigh t guaran tees for MMSB mo dels with un if orm sized communities, our guarant ees are we ak w hen th e comm unit y sizes are drastically diﬀeren t, suc h as in the p lan ted clique setting where we do not matc h the compu tatio nal lo wer b ound [F eldman et al., 2012]. Th e w hitening step in the tensor 35 decomp osition metho d is particularly sensitive to the ratio of comm unity sizes and it is in teresting to see if mo diﬁcations can b e made to our algorithm to provide tigh t guaran tees und er u nequal comm unit y sizes. While this pap er mostly dealt with the theoretical analysis of the tensor metho d for comm unit y detection, we n ote recen t exp eriment al results where the tensor metho d is deplo yed on graphs with millions of no des with very go o d accuracy and runn ing times [Huang et al. , 2013]. In fact, the run ning times are m ore than an order of magnitude b etter than the state-of-art v aria- tional app r oac h for learning MMSB mo d els. The work of [Huang et al., 2013] make s an imp ortant mo diﬁcation to m ak e the m ethod scalable, viz., that the tensor decomp osition is carried out through sto c hastic up dates in parallel unlik e th e serial batc h up d ates considered here. Establishing theo- retical guaran tees for sto c hastic tensor decomp osition is an imp ortan t problem. Moreo ver, w e ha v e limited ourselv es to the MMSB mo d els, whic h assumes a linear mo d el for edge formation, wh ic h is not applicable u niv ersally . F or instance, exclusionary relationships, where t wo no des cannot b e connected b ecause of their membersh ips in certain comm unities cannot b e imp osed in the MMSB mo del. Are th ere other classes of mixed mem b ersh ip mo dels wh ic h d o n ot suﬀer from th is restric- tion, an d y et are identiﬁable and are amenable for learnin g? Moreo ver, the Diric hlet distribution in the MMSB mo del imp oses constraints on the mem b ers h ips across diﬀerent comm unities. Can we incorp orate m ixed memb erships with arbitrary correlations? Th e answe rs to these qu estions will further push the b oundaries of tractable learning of mixed mem b ersh ip comm unities mo dels. Ac kno wledgemen ts W e thank the J MLR Action Editor Nathan S rebro and the anon ymous reviewe rs for commen ts whic h signiﬁcantly improv ed this manuscript. W e thank Jur e Lesko vec for h elpful discussions regarding v arious comm unity mo dels. Pa rt of this wo rk wa s done when AA and R G were v isiting MSR New England. AA is su pp orted in part by the Microsoft facult y fello wship, NSF Career a w ard CCF-12541 06, NSF Aw ard CCF-1219234 and the AR O YIP Award W911NF-1 3-1-0084 . References Ittai Abr aham, Shiri Ch ec h ik, Da vid Kemp e, and Aleksandrs Slivkins. L o w -distortion inf erence of laten t similarities fr om a multiplex so cial net w ork. CoRR , abs/1202.09 22, 2012. Nir Ailon, Y u dong Chen, and Xu Huan. Breaking the small cluster barrier of graph clustering. arXiv pr eprint arXiv:1302.4549 , 2013. Edoardo M. Airoldi, Da vid M. Blei, Steph en E . Fienberg, and Eric P . Xing. Mixed m emb ership sto c hastic blo c kmo dels. Journal of Machine L e arning R ese ar ch , 9:1981– 2014, Jun e 2008. A. Anandkumar, D. P . F oster, D . Hsu, S. M. Kak ade, and Y. Liu. Tw o svds suﬃce: Sp ectral d ecom- p ositions for probabilistic t opic mo deling and laten t dirichlet allo cation, 2012a . arXiv:1204.67 03. A. Anand kumar, R. Ge, D. Hsu, S . M. Kak ad e, and M. T elgarsky . T ensor deco mp ositio ns for laten t v ariable mo dels, 2012b. A. Anandkum ar, D. Hsu , and S.M. Kak ade. A Metho d of Momen ts for Mixtur e Models a nd Hidd en Mark o v Mo dels. In P r o c. of Conf. on L e arning The ory , J une 2012c. 36 Sanjeev Arora, Ron g Ge, Sushant Sac hdev a, and Gran t Sc ho eneb ec k. Finding o ve rlappin g comm u- nities in so cial netw orks: to wa rd a rigorous app r oac h. In Pr o c e e dings of the 13 th A CM Confer enc e on E le ctr onic Commer c e , 2012. Maria-Flo rina Balcan, Christian Borgs, Mark Brav erman, Jennifer T. Chay es, and Shang-Hua T eng. I like h er more than y ou: Self-determined comm unities. CoRR , abs/1201.489 9, 2012. P .J. Bic k el and A. Chen. A nonp arametric view of n et work mo d els and n ewman–girv an an d other mo dularities. Pr o c e e dings of the National A c ademy of Scienc es , 106(50):2 1068–21073, 2009. P .J. Bic kel, A. Chen, and E. Levina. T h e m etho d of momen ts and degree distributions for netw ork mo dels. The Annals of Statistics , 39(5):38– 59, 2011. Da vid M Blei. Probabilistic topic mo dels. Communic ations of the ACM , 55(4):77– 84, 2012. Da vid M. Blei, Andr ew Y. Ng, and Mic hael I . Jord an. Laten t dirichlet allo cation. Journal of Machine L e arning R e se ar ch , 3:993– 1022, Marc h 2003. B. Bollob´ as, S . J anson, and O . Riord an. T he phase transition in inhomogeneous random graphs. R andom Structur es & A lgorithms , 31(1):3–122 , 2007. S. Charles Brubak er and San tosh S. V empala. Random tensors and plan ted cliques. In RANDOM , 2009. Alain Celisse, Jean-Jacques Daudin, and Laurent Pierre. Consistency of maximum-lik eliho o d and v ariational estimators in th e sto chasti c blo c k mo del. Ele ctr onic Journal of Statistics , 6:18 47–1899 , 2012. S. Chatterjee and P . Diaconis. Es timating and unders tand ing exp onential rand om graph mo d els. Arxiv pr e print arxiv:1102.2650 , 2011. Kamalik a Ch audhuri, F an Chung, and Alexander Tsiatas. S p ectral clustering of graphs with general degrees in the extended planted partition mo del. Journal of Machine L e arning R ese ar ch , pages 1–23, 2012. Y udong Chen, S u ja y Sangha vi, and Huan Xu. Clustering spars e graphs. In A dvanc es i n N eur al Information Pr o c essing , 2012. Anne Condon and Ric hard M Karp. Algorithms for graph partitioning on the p lan ted partition mo del. In R andomization, Appr oximation, and Co mbinatorial Optimiza tion. A lgorithms and T e c hniqu e s , p ages 221–232 . Sp ringer, 1999. S. C urrarini, M.O. Jackson, and P . Pin. An economic mo del of friendsh ip: Homophily , m inorities, and segregatio n. E c onometric a , 77(4):1003 –1045, 2009. Vitaly F eldman, Elena Gr igorescu, Lev Reyzin, S an tosh V empala, and Ying Xiao. Statistica l algo- rithms and a lo wer b oun d for plan ted clique. Ele ctr onic Col lo quium on Comp utational Comp lexity (ECCC) , 19:64, 2012. K F erentio s. On tceb ycheﬀ ’s t yp e inequalities. T r ab ajos de estad ´ ıstic a y de investigaci´ on op er ativa , 33(1): 125–132, 1982. 37 S.E. Fienber g, M.M. Mey er, and S.S. W asserman. S tatistic al analysis of m ultiple sociometric relations. Journal of the americ an Statistic al asso ciation , 80(389):51 –67, 1985. O. F rank and D. Strauss. Marko v graphs. Journal of the americ an Statistic al asso ciation , 81(395): 832–8 42, 1986. Alan M. F rieze and Ra vi Kannan. Q uic k appr o ximation to m atrices and applications. Combina- toric a , 19(2):175 –220, 1999. Alan M. F rieze and Ra vi Kannan. A new approac h to the plan ted clique problem. In FSTTCS , 2008. M. Girv an and M.E.J. Newman. Comm unit y structure in so cial and b iologi cal net w orks. Pr o c e e dings of the N ational A c ademy of Scienc es , 99(12):782 1–7826, 2002 . Alex Gittens and Mic hael W Mahoney . Revisiting the nystrom metho d for imp ro v ed large-scale mac hine learning. arXiv pr eprint arXiv:1303.1849 , 2013. P . Gopalan, D. Mimno, S. Gerrish, M. F reedman, and D. Blei. Scalable inference of ov erlappin g comm unities. In A dvanc es in N e ur al Information P r o c essing Systems 25 , pages 225 8–2266, 2012. C. Hillar and L.-H. Lim. Most tensor problems are NP h ard , 2012. Matt Hoﬀman, Da vid M Blei, Chong W an g, and John Paisley . Sto c hastic v ariational in ference. JMLR , 14:1303–13 47, 2012. P .W. Holland and S. Leinhardt. An exp onen tial family of probabilit y distr ibutions for directed graphs. Journal of the americ an Statistic al asso c i ation , 76(373) :33–50, 1981. P .W. Holla nd, K.B. Lask ey , and S. Leinhardt. Sto c hastic blo c kmo dels: ﬁrs t steps. So cial networks , 5(2):1 09–137, 1983. F. Hu ang, U.N. Niranjan, M. Hakee m, and A. Anand k u mar. F ast Detection of O v erlapping Com- m unities via Onlin e T ensor Metho ds. ArXiv 1309.07 87 , Sept. 2013. Joseph J´ aJ´ a. An intr o duction to p ar al lel algorithms . Ad dison W esley Longman Pu blishing Co., Inc., 1992. A. Jalali, Y. Chen, S. Sangha vi, and H. Xu. Clustering partially observed graphs via conv ex optimization. arXiv pr eprint arXiv:1104.480 3 , 2011. Mic h ael J. Kearns and Umesh V. V azirani. An Intr o duction to Computationa l L e arning The ory . MIT Pr ess., Cambridge, MA, 1994. T. G. Kolda and B. W. Bader. T ensor decomp ositions and app licatio ns. SIAM r eview , 51(3):4 55, 2009. T. G. Kolda and J. R. Ma yo. S hifted p o w er metho d for computing tensor eigenpairs. SIAM Journal on M atrix Analysis and Applic ations , 32(4):1095– 1124, Octob er 2011. P .F. Lazarsfeld, R.K. Merton, et al. F riendship as a so cial pro cess: A substant iv e and metho d olog- ical analysis. F r e e dom and c ontr ol in mo dern so ciety , 18(1):18–6 6, 1954 . 38 Zhouc hen Lin , Minming Chen , and Yi Ma. T he augmen ted lagrange m ultiplier metho d f or exact reco very of corrup ted lo w-rank matrices. arXiv pr eprint arXiv:1009.5055 , 2010. L. Lov´ asz. V ery large graphs. Curr ent Developments in Mathematics , 2008:67 –128, 2009. M. McPherson, L. Smith-Lo vin, and J.M. C o ok. Birds of a feather: Homophily in so cial netw orks. Annual r e v iew of so c i olo gy , p ages 415–444, 2001. F. McSh erry . S p ectral p artitioning of random graphs. In FOCS , 2001. J.L. Moreno. Who shal l survive?: A new appr o ach to the pr oblem of human interr elations. Nervous and Men tal Disease Publish ing Co, 1934. Elc hanan Mossel, J o e Neeman, and Allan S ly . Sto c hastic blo c k mo dels and reconstruction. arXiv pr eprint arXiv:1202.149 9 , 2012. G. Pa lla, I. Der´ en yi, I. F ark as, and T. Vicsek. Unco v ering the o ve rlappin g communit y structure of complex n et works in n ature and so ciet y . N atur e , 435(7043):8 14–818, 2005. K. Pearson. Contributions to th e mathematical theory of evolutio n. P hilosophic al T r ansactions of the R oyal So ciety, L ondon, A . , page 71, 1894. A. Rinaldo, S.E. Fien b erg, an d Y. Zh ou. On the geometry of discrete exp onen tial families with application to exp onen tial rand om graph mo d els. E le ctr onic Journal of Statistics , 3:446–48 4, 2009. T.A.B. S nijders and K. Nowic ki. Estimatio n and prediction f or sto c hastic b lo ckmodels for graphs with latent blo ck structur e. Journal of Classiﬁc ation , 14(1): 75–100, 1997. G.W. Stew art and J. Sun . Matrix p erturb ation the ory , v olume 175. Academic press New Y ork, 1990. M. T elgarsky . Diric h let draws are sparse with h igh pr obabilit y . ArXiv:1301.49 17 , 2012. J.A. T ropp. User-friend ly tail b ounds for sums o f random matrices. F oundations of Computational Mathematics , 12(4):3 89–434, 2012. Y.J. W ang and G.Y. W ong. Sto c hastic blo c kmo dels for directed graphs. Journal of the Americ an Statistic al Asso ciation , 82(397) :8–19, 1987. H.C. White, S.A. Bo orman, and R.L. Breiger. So cial structure from m ultiple net wo rks. i. b lock- mo dels of r oles and p ositions. Americ an journal of so ciolo gy , pages 730–78 0, 1976. E.P . Xing, W. F u, and L. Song. A state-space mixed mem b ership blo c k m o d el for d ynamic n et w ork tomograph y . The Annals of Applie d Statistics , 4(2):53 5–566, 2010. 39 A T ensor P o w er Metho d A n alysis In this section, w e lev erage on the p erturbation analysis for tensor p ow er metho d in Anan d kumar et al. [2012b]. As discuss ed in S ection 3.3.2, we prop ose the follo wing mo diﬁcations to the tensor p o w er metho d and obtain guarantees b elo w for the mo d iﬁed method . T h e tw o main mo d iﬁ cations are: (1) w e mo d ify the tensor deﬂ ation pro cess in th e robust p o wer metho d in Pro cedure 2. Rather than a ﬁxed deﬂation step after obtaining an e stimate of the ei gen v alue-eigen v ector pair, in this pap er, w e deﬂate adaptiv ely dep endin g on the current estimate, and (2)rather than selecting random initial- ization v ectors, as in Anandkumar et al. [2012b], w e initialize with vec tors obtai ned from adjacency matrix. Belo w in Section A.1, w e establish success of the mo d iﬁ ed tensor metho d un d er “g o o d” initial- ization vect ors, as deﬁned b elow. This inv olve s impr o v ed error b ound s for the mo diﬁed deﬂation pro cedure pro vided in Section A.2. In Section C.5, we su bsequent ly establish that under the Diric h- let d istribution (for small α 0 ), we obtain “go o d” initialization ve ctors. A.1 Analysis under goo d initialization v ectors W e no w sh o w that wh en “go o d” initialization vect ors are input to tensor p ow er metho d in Pr o ce- dure 2, we obtain go o d estimates of eigen-pairs under appropr iate c hoice of num b er of iterations N and sp ectral norm ǫ of tensor p ertur b ation. Let T = P i ∈ [ k ] λ i v i , where v i are orthonormal vect ors and λ 1 ≥ λ 2 ≥ . . . λ k . Let ˜ T = T + E b e the p erturb ed tensor with k E k ≤ ǫ . Recall that N denotes the n umb er of iterations of the tensor p o we r metho d. W e call a n initialization v ector u to b e ( γ , R 0 )-goo d if there exists v i suc h that h u, v i i > R 0 and | h u, v i i | − max j γ | h u, v i i | . (51) Cho ose γ = 1 / 100. Theorem A.1. Ther e exists universal c onstants C 1 , C 2 > 0 such that the fol lowing holds. ǫ ≤ C 1 · λ min R 2 0 , N ≥ C 2 ·  log( k ) + log log  λ max ǫ  , (52 ) Assume ther e is at le ast one go o d initialization ve ctor c orr esp onding to e ach v i , i ∈ [ k ] . The p ar am- eter ξ for cho osing deﬂation ve ctors i n e ach iter ation of the tensor p ower metho d in Pr o c e dur e 2 is chosen as ξ ≥ 25 ǫ . We obtain eigenvalue-eige nve ctor p airs ( ˆ λ 1 , ˆ v 1 ) , ( ˆ λ 2 , ˆ v 2 ) , . . . , ( ˆ λ k , ˆ v k ) such that ther e exists a p ermutation π on [ k ] with k v π ( j ) − ˆ v j k ≤ 8 ǫ/λ π ( j ) , | λ π ( j ) − ˆ λ j | ≤ 5 ǫ, ∀ j ∈ [ k ] , and       T − k X j =1 ˆ λ j ˆ v ⊗ 3 j       ≤ 55 ǫ. 40 Remark 1 (need for adapt ive deﬂation) : W e now compare the ab o v e resu lt w ith the result in [Anandku mar et al., 2012b, Thm. 5.1], wh ere similar guarantee s are obtained for a simpler v ersion o f the tensor p ow er m ethod without an y adap tive deﬂation and using random initialization. The main diﬀerence is in our requirement of the gap γ in (51) for an in itializa tion v ector is wea ke r than th e gap requir ement in [Anandkumar et al., 2012b, Thm. 5.1]. This is du e to the use of adaptiv e deﬂation in this pap er. Remark 2 (need for non-random initialization): In this pap er, w e emp lo y whitened neigh- b orho o d v ectors generated u n der th e MMSB mo del for initializat ion, wh ile [Anandkumar et al., 2012b, Thm. 5.1] assumes a ran d om initializat ion. Under rand om initializatio n, we obtain R 0 ∼ 1 / √ k (with p oly( k ) trials), wh ile for initializa tion u sing whitened neighborh o o d v ectors, we sub - sequen tly establish that R 0 = Ω(1) is a constan t, when num b er of samples n is large enough. W e also establish that the gap requirement in (51) is satisﬁed f or the c hoice of γ = 1 / 100 ab o v e. S ee Lemma C.9 for details. Thus, w e can tolerate m uch larger p erturbation ǫ of the t hird order moment tensor, when n on-random initializations are employ ed. Pr o of: The pr o of is on lines of th e p r o of of [Anandku mar et al., 2012b, Thm. 5. 1] but here, we consider the mo diﬁed deﬂ ation pro cedure, which impr ov es the condition on ǫ in (52). W e pr ovide the full pro of b elo w for completeness. W e pro v e b y induction on i , the n umber of eige npairs estimated so far b y Pro cedur e 2. Assu me that th ere exists a p ermutation π on [ k ] such that the follo wing assertions h old. 1. F or all j ≤ i , k v π ( j ) − ˆ v j k ≤ 8 ǫ/λ π ( j ) and | λ π ( j ) − ˆ λ j | ≤ 1 2 ǫ . 2. D ( u, i ) is the set of deﬂated v ectors giv en curr en t estimate of the p o wer metho d is u ∈ S k − 1 : D ( u, i ; ξ ) := { j : | ˆ λ i ˆ θ i | ≥ ξ } ∩ [ i ] , where ˆ θ i := h u, ˆ v i i . 3. T h e error tensor ˜ E i +1 ,u :=  ˆ T − X j ∈ D ( u,i ; ξ ) ˆ λ j ˆ v ⊗ 3 j  − X j / ∈ D ( u,i ; ξ ) λ π ( j ) v ⊗ 3 π ( j ) = E + X j ∈ D ( u,i ; ξ )  λ π ( j ) v ⊗ 3 π ( j ) − ˆ λ j ˆ v ⊗ 3 j  satisﬁes k ˜ E i +1 ,u ( I , u, u ) k ≤ 56 ǫ, ∀ u ∈ S k − 1 ; (53) k ˜ E i +1 ,u ( I , u, u ) k ≤ 2 ǫ, ∀ u ∈ S k − 1 s.t. ∃ j ≥ i + 1  ( u ⊤ v π ( j ) ) 2 ≥ 1 − (168 ǫ/λ π ( j ) ) 2 . ( 54) W e tak e i = 0 a s the base case, so w e can ignore t he ﬁ rst assertio n, and just observe that f or i = 0, D ( u, 0; ξ ) = ∅ and thus ˜ E 1 ,u = ˆ T − k X j =1 λ i v ⊗ 3 i = E , ∀ u ∈ S k − 1 . W e ha ve k ˜ E 1 k = k E k = ǫ , and therefore the second assertion holds. 41 No w ﬁx some i ∈ [ k ], and assume as the inductiv e hyp othesis. The p o wer iterations n o w tak e a s ubset of j ∈ [ i ] for deﬂation, dep ending on the current estimate. S et C 1 := min  (56 · 9 · 102) − 1 , (100 · 168) − 1 , ∆ ′ from L emma A.1 with ∆ = 1 / 50  . (55) F or all go o d initializat ion ve ctors which are γ -separated relativ e to π ( j max ), we ha ve (i) | θ ( τ ) j max , 0 | ≥ R 0 , and (ii) that b y [Anandku m ar et al. , 2012b, Lemma B.4] (usin g ˜ ǫ/p := 2 ǫ , κ := 1, and i ∗ := π ( j max ), an d p ro viding C 2 ), | ˜ T i ( θ ( τ ) N , θ ( τ ) N , θ ( τ ) N ) − λ π ( j max ) | ≤ 5 ǫ (notice by d eﬁnition that γ ≥ 1 / 100 implies γ 0 ≥ 1 − 1 / (1 + γ ) ≥ 1 / 101, th us it follo ws f rom the b ounds on th e other qu an tities that ˜ ǫ = 2 pǫ ≤ 56 C 1 · λ min R 2 0 < γ 0 2(1+8 κ ) · ˜ λ min · θ 2 i ∗ , 0 as necessary). Therefore θ N := θ ( τ ∗ ) N m ust satisfy ˜ T i ( θ N , θ N , θ N ) = max τ ∈ [ L ] ˜ T i ( θ ( τ ) N , θ ( τ ) N , θ ( τ ) N ) ≥ max j ≥ i λ π ( j ) − 5 ǫ = λ π ( j max ) − 5 ǫ. On the other h and, by the triangle inequalit y , ˜ T i ( θ N , θ N , θ N ) ≤ X j ≥ i λ π ( j ) θ 3 π ( j ) ,N + | ˜ E i ( θ N , θ N , θ N ) | ≤ X j ≥ i λ π ( j ) | θ π ( j ) ,N | θ 2 π ( j ) ,N + 56 ǫ ≤ λ π ( j ∗ ) | θ π ( j ∗ ) ,N | + 56 ǫ where j ∗ := arg max j ≥ i λ π ( j ) | θ π ( j ) ,N | . Therefore λ π ( j ∗ ) | θ π ( j ∗ ) ,N | ≥ λ π ( j max ) − 5 ǫ − 56 ǫ ≥ 4 5 λ π ( j max ) . Squaring b oth sid es and using the fact that θ 2 π ( j ∗ ) ,N + θ 2 π ( j ) ,N ≤ 1 f or any j 6 = j ∗ ,  λ π ( j ∗ ) θ π ( j ∗ ) ,N  2 ≥ 16 25  λ π ( j max ) θ π ( j ∗ ) ,N  2 + 16 25  λ π ( j max ) θ π ( j ) ,N  2 ≥ 16 25  λ π ( j ∗ ) θ π ( j ∗ ) ,N  2 + 16 25  λ π ( j ) θ π ( j ) ,N  2 whic h in turn implies λ π ( j ) | θ π ( j ) ,N | ≤ 3 4 λ π ( j ∗ ) | θ π ( j ∗ ) ,N | , j 6 = j ∗ . This means th at θ N is (1 / 4)-separated relativ e to π ( j ∗ ). Also, observe that | θ π ( j ∗ ) ,N | ≥ 4 5 · λ π ( j max ) λ π ( j ∗ ) ≥ 4 5 , λ π ( j max ) λ π ( j ∗ ) ≤ 5 4 . Therefore by [Anandkum ar et al., 2012b, L emm a B.4] (using ˜ ǫ/p := 2 ǫ , γ := 1 / 4, and κ := 5 / 4), executing another N p ow er iterations s tarting fr om θ N giv es a vecto r ˆ θ that satisﬁes k ˆ θ − v π ( j ∗ ) k ≤ 8 ǫ λ π ( j ∗ ) , | ˆ λ − λ π ( j ∗ ) | ≤ 5 ǫ. 42 Since ˆ v i = ˆ θ and ˆ λ i = ˆ λ , the ﬁrst assertion of the ind uctiv e hypothesis is satisﬁed, a s w e can mo dify the p erm utation π by swa pp in g π ( i ) and π ( j ∗ ) w ithout aﬀecting the v alues of { π ( j ) : j ≤ i − 1 } (recall j ∗ ≥ i ). W e n o w argue that ˜ E i +1 ,u has the requir ed prop er ties to complete th e indu ctiv e s tep. By Lemma A.1 (u s ing ˜ ǫ := 5 ǫ , ξ = 5˜ ǫ = 25 ǫ and ∆ := 1 / 50 , th e latter pro viding one up p er b ou n d on C 1 as p er (55)), w e ha ve for any unit vect or u ∈ S k − 1 ,       X j ≤ i  λ π ( j ) v ⊗ 3 π ( j ) − ˆ λ j ˆ v ⊗ 3 j   ( I , u, u )      ≤  1 / 50 + 100 i X j =1 ( u ⊤ v π ( j ) ) 2  1 / 2 5 ǫ ≤ 5 5 ǫ. (56) Therefore by the triangle in equalit y , k ˜ E i +1 ( I , u, u ) k ≤ k E ( I , u, u ) k +       X j ≤ i  λ π ( j ) v ⊗ 3 π ( j ) − ˆ λ j ˆ v ⊗ 3 j   ( I , u, u )      ≤ 56 ǫ. Th us the b ound (53) h olds. T o p ro v e that (54) holds, for an y unit vec tor u ∈ S k − 1 suc h that there exists j ′ ≥ i + 1 with ( u ⊤ v π ( j ′ ) ) 2 ≥ 1 − (168 ǫ/λ π ( j ′ ) ) 2 . W e hav e (via the second b ound on C 1 in (55) and th e corresp onding assumed b ound ǫ ≤ C 1 · λ min R 2 0 ) 100 i X j =1 ( u ⊤ v π ( j ) ) 2 ≤ 100  1 − ( u ⊤ v π ( j ′ ) ) 2  ≤ 100  168 ǫ λ π ( j ′ )  2 ≤ 1 50 , and therefore  1 / 50 + 100 i X j =1 ( u ⊤ v π ( j ) ) 2  1 / 2 5 ǫ ≤ (1 / 50 + 1 / 50) 1 / 2 5 ǫ ≤ ǫ. By the triangle inequalit y , w e h a v e k ˜ E i +1 ( I , u, u ) k ≤ 2 ǫ . Th erefore (54) holds, so the second asser- tion of the inductive hyp othesis holds. W e conclude that b y the indu ction pr inciple, there exists a p ermutation π su c h that t wo assertions hold for i = k . F rom the last in duction step ( i = k ), it is also clear from (56) that k T − P k j =1 ˆ λ j ˆ v ⊗ 3 j k ≤ 55 ǫ . This completes the p ro of of the theorem.  A.2 Deﬂation Analysis Lemma A.1 (Deﬂati on analysis) . L et ˜ ǫ > 0 and let { v 1 , . . . , v k } b e an ortho normal b asis for R k and λ i ≥ 0 for i ∈ [ k ] . L et { ˆ v 1 , . . . , ˆ v k } ∈ R k b e a set of u ni t ve ctors and ˆ λ i ≥ 0 . Deﬁne thir d or der tensor E i such that E i := λ i v ⊗ 3 i − ˆ λ i ˆ v ⊗ 3 i , ∀ i ∈ k. F or some t ∈ [ k ] and a unit ve c tor u ∈ S k − 1 such that u = P i ∈ [ k ] θ i v i and ˆ θ i := h u, ˆ v i i , we have for i ∈ [ t ] , | ˆ λ i ˆ θ i | ≥ ξ ≥ 5 ˜ ǫ, | ˆ λ i − λ i | ≤ ˜ ǫ, k ˆ v i − v i k ≤ min { √ 2 , 2˜ ǫ/λ i } , 43 then, the fol lowing holds     t X i =1 E i ( I , u, u )     2 2 ≤  4(5 + 11˜ ǫ /λ min ) 2 + 128 (1 + ˜ ǫ/λ min ) 2 (˜ ǫ /λ min ) 2  ˜ ǫ 2 t X i =1 θ 2 i + 64(1 + ˜ ǫ/λ min ) 2 ˜ ǫ 2 + 204 8(1 + ˜ ǫ/λ min ) 2 ˜ ǫ 2 . In p articular, for any ∆ ∈ (0 , 1) , ther e exists a c onstant ∆ ′ > 0 (dep ending only on ∆ ) such that ˜ ǫ ≤ ∆ ′ λ min implies     t X i =1 E i ( I , u, u )     2 2 ≤  ∆ + 100 t X i =1 θ 2 i  ˜ ǫ 2 . Pr o of: T h e p ro of is on lines of deﬂation analysis in [Anand kumar et al., 2012 b , Lemma B.5], but we impro ve the b ounds based on additional p rop erties of v ector u . F rom Anandku mar et al. [2012b], we hav e that f or all i ∈ [ t ], and an y unit ve ctor u ,     t X i =1 E i ( I , u, u )     2 2 ≤  4(5 + 11˜ ǫ/λ min ) 2 + 128 (1 + ˜ ǫ/λ min ) 2 (˜ ǫ/λ min ) 2  ˜ ǫ 2 t X i =1 θ 2 i + 64(1 + ˜ ǫ/λ min ) 2 ˜ ǫ 2 t X i =1 (˜ ǫ/λ i ) 2 + 204 8(1 + ˜ ǫ/λ min ) 2 ˜ ǫ 2  t X i =1 (˜ ǫ /λ i ) 3  2 . (57) Let ˆ λ i = λ i + δ i and ˆ θ i = θ i + β i . W e h a ve δ i ≤ ˜ ǫ and β i ≤ 2˜ ǫ/λ i , and that | ˆ λ i ˆ θ i | ≥ ξ . || ˆ λ i ˆ θ i | − | λ i θ i || ≤ | ˆ λ i ˆ θ i − λ i θ i | ≤ | ( λ i + δ i )( θ i + β i ) − λ i θ i | ≤ | δ i θ i + λ i β i + δ i β i | ≤ 4˜ ǫ. Th us, we ha v e that | λ i θ i | ≥ 5˜ ǫ − 4˜ ǫ = ˜ ǫ . Thus P t i =1 ˜ ǫ 2 /λ 2 i ≤ P i θ 2 i ≤ 1. Substituting in (57), w e ha v e the result.  B Pro of of T h eorem 4.3 W e no w prov e the main resu lts on error b oun d s claimed in Th eorem 4.3 for the estimated comm unit y v ectors ˆ Π and estimated b lo c k probabilit y matrix ˆ P in Algorithm 1. Belo w, we ﬁrst sho w th at the tensor p erturbation b ounds claimed in Lemma 4.2 holds. Notation: Let k T k denote the sp ectral norm for a tensor T (o r in sp ecial cases a matrix or a v ector). Let k M k F denote the F rob en iu s norm. Let | M 1 | denote the op erator ℓ 1 norm, i.e., the maxim um ℓ 1 norm of its columns and k M k ∞ denote the maximum ℓ 1 norm of its r ows. Let κ ( M ) denote the condition num b er, i.e., k M k σ min ( M ) . 44 B.1 Pro of of Lemma 4.2 F rom Theorem A.1 in App endix A, we see t hat the tensor p o wer met ho d returns eigen v alue-v ector pair ( ˆ λ i , ˆ Φ i ) suc h that there exists a p erm utation θ w ith max i ∈ [ k ] k ˆ Φ i − Φ θ ( i ) k ≤ 8 b α 1 / 2 max ε T , (58) and max i | λ i − b α − 1 / 2 θ ( i ) | ≤ 5 ε T , (59) when the p erturbation of the tensor is small enough, according to ε T ≤ C 1 b α − 1 / 2 max r 2 0 , (60) for some constant C 1 , when in itialized w ith a ( γ , r 0 ) go od vecto r. With the ab o ve resu lt, t w o asp ects need to b e established: (1) the whitened tens or p ertur b a- tion ǫ T is as claimed, (2) the condition in (60) is satisﬁed and (3) there exist go o d in itializ ation v ectors wh en wh itened neighborh oo d vect ors are emplo y ed. Th e tensor p erturbation b ound ǫ T is established in Theorem C.1 in App endix C.1. Lemma C.9 establishes that wh en ζ = O ( √ nr 2 0 /ρ ), w e h av e goo d initializa tion v ectors with Recall r 2 0 = Ω (1 / b α max k ) when α 0 > 1 and r 2 0 = Ω (1) for α 0 ≤ 1, and γ = 1 / 100 with p robabilit y 1 − 9 δ u nder Diric hlet distribution, when n = ˜ Ω  α − 1 min k 0 . 43 log( k /δ )  , (61) whic h is satisﬁed since we assume b α − 2 min < n . W e n o w sh ow that the condition in (60) is satisﬁed under the assumptions B1-B4. S ince ǫ T is giv en b y ε T = ˜ O ρ √ n · ζ b α 1 / 2 max ! , the condition in (60 ) is equiv alen t to ζ = O ( √ nr 2 0 /ρ ). Therefore w hen ζ = O ( √ nr 2 0 /ρ ), the assumptions of T h eorem A.1 are satisﬁed. B.2 Reconstruction of Π after tensor p o w er metho d Let ( M ) i and ( M ) i denote t he i th ro w a nd i th column in matrix M resp ectiv ely . Let Z ⊆ A c denote an y sub s et of n o d es not in A , considered in P r o cedure LearnPartitio n Communit y . Deﬁne ˜ Π Z := Diag ( λ ) − 1 Φ ⊤ ˆ W ⊤ A G ⊤ Z,A . (62) Recall that the ﬁn al estimate ˆ Π Z is obtained by thresh oldin g ˜ Π Z elemen t-wise with thr eshold τ in Pro cedure 1. W e ﬁ r st analyze p erturbation of ˜ Π Z . Lemma B.1 (Reconstruction Guarantee s for ˜ Π Z ) . Assuming L emma 4.2 holds and the tensor p ower metho d r e c overs eigenve ctors and eigenvalues u p to the guar ante e d err ors, we h ave with 45 pr ob ability 1 − 122 δ , ε π := max i ∈ Z k ( ˜ Π Z ) i − (Π Z ) i k = O ε T b α 1 / 2 max  b α max b α min  1 / 2 k Π Z k ! , = O ρ · ζ · b α 1 / 2 max  b α max b α min  1 / 2 ! wher e ε T is give n by (70) . Pr o of: W e h a v e ( ˜ Π Z ) i = λ − 1 i ((Φ) i ) ⊤ ˆ W ⊤ A G ⊤ Z,A . W e w ill no w u se p erturbation b ounds for eac h of the terms to get the resu lt. The ﬁ rst term is k Diag ( λ i ) − 1 − Diag( b α 1 / 2 i ) k · k Diag ( b α 1 / 2 ) ˜ F ⊤ A k · k ˜ F A k · k Π Z k ≤ 5 ε T b α max b α − 1 / 2 min (1 + ε 1 ) 2 k Π Z k from th e fact that k Diag ( b α 1 / 2 ) ˜ F ⊤ A k ≤ 1 + ε 1 , wh ere ε 1 is giv en by (85). The second term is k Diag ( b α 1 / 2 ) k · k (Φ) i − b α 1 / 2 i ( ˜ F A ) i k · k ˜ F A k · k Π Z k ≤ 8 b α max ε T b α − 1 / 2 min (1 + ε 1 ) k Π Z k The th ird term is k b α 1 / 2 i k · k ( ˆ W ⊤ A − W ⊤ A ) F A Π Z k ≤ b α 1 / 2 max b α − 1 / 2 min k Π Z k ǫ W (63) ≤ O  b α max b α min  1 / 2 ε T b α 1 / 2 min k Π Z k ! , (64) from L emma C.1 and ﬁnally , we hav e k b α 1 / 2 i k · k W A k · k G ⊤ Z,A − F A Π Z k ≤ O b α 1 / 2 max √ α 0 + 1 b α min σ min ( P ) r (max i ( P b α ) i )(1 + ε 2 + ε 3 ) log k δ ! (65) ≤ O  b α max b α min  1 / 2 ε T √ α 0 + 1(1 + ε 2 + ε 3 ) r log k δ ! (66) from L emma C.6 and Lemma C.7. The th ird term in (64) dominates the last term in (66) since ( α 0 + 1) log k /δ < n b α min (due to assumption B2 on scaling of n ).  W e no w sho w that if w e threshold the entries of ˜ Π Z , the the resulting matrix ˆ Π Z has ro ws close to those in Π Z in ℓ 1 norm. 46 Lemma B.2 (Guarante es after thresholding) . F or ˆ Π Z := Thres( ˜ Π Z , τ ) , wher e τ is the thr eshold, we have with pr ob ability 1 − 2 δ , that ε π ,ℓ 1 := max i ∈ [ k ] | ( ˆ Π Z ) i − (Π Z ) i | 1 = O √ nη ε π r log 1 2 τ 1 − s 2 log ( k /δ ) nη log(1 / 2 τ ) ! + nη τ + r ( nη + 4 τ 2 ) log k δ + ε 2 π τ ! , wher e η = b α max when α 0 < 1 and η = α max when α 0 ∈ [1 , k ) . Remark 1: The ab o v e guarantee on ˆ Π Z is stronger than for ˜ Π Z in Lemm a B.1 s in ce this is an ℓ 1 guaran tee on the r ows compared to ℓ 2 guaran tee on ro ws for ˜ Π Z . Remark 2: When τ is c hosen as τ = Θ( ε π √ nη ) = Θ ρ 1 / 2 · ζ · b α 1 / 2 max n 1 / 2 · b α min ! , w e ha ve that max i ∈ [ k ] | ( ˆ Π Z ) i − (Π Z ) i | 1 = ˜ O ( √ nη · ε π ) = ˜ O  n 1 / 2 · ρ 3 / 2 · ζ · b α max  Pr o of: Let S i := { j : ˆ Π Z ( i, j ) > 2 τ } . F or a v ector v , let v S denote the su b -v ector by considerin g en tries in set S . W e now hav e | ( ˆ Π Z ) i − (Π Z ) i | 1 ≤ | ( ˆ Π Z ) i S i − (Π Z ) i S i | 1 + | (Π Z ) i S c i | 1 + | ( ˆ Π Z ) i S c i | 1 Case α 0 < 1 : F r om Lemma C .10, we hav e P [Π( i, j ) ≥ 2 τ ] ≤ 8 b α i log(1 / 2 τ ). Since Π( i, j ) are indep end ent for j ∈ Z , we ha v e from multiplic ativ e Chern oﬀ b ound [Kearns and V azirani, 1994, Thm 9.2], that with pr obabilit y 1 − δ , max i ∈ [ k ] | S i | < 8 n b α max log  1 2 τ  1 − s 2 log ( k /δ ) n b α i log(1 / 2 τ ) ! . W e ha ve | ( ˜ Π Z ) i S i − (Π Z ) i S i | 1 ≤ ε π | S i | 1 / 2 , and the i th ro ws of ˜ Π Z and ˆ Π Z can diﬀer on S i , we ha ve | ˜ Π Z ( i, j ) − ˆ Π Z ( i, j ) | ≤ τ , for j ∈ S i , and n umb er of suc h terms is at most ε 2 π /τ 2 . Thus, | ( ˜ Π Z ) i S i − ( ˆ Π Z ) i S i | 1 ≤ ε 2 π τ . F or the other term, from L emm a C.10, we ha ve E [Π Z ( i, j ) · δ (Π Z ( i, j ) ≤ 2 τ )] ≤ b α i (2 τ ) . 47 Applying Bernstein’s b oun d w e ha v e with probability 1 − δ max i ∈ [ k ] X j ∈ Z Π Z ( i, j ) · δ (Π Z ( i, j ) ≤ 2 τ ) ≤ n b α max (2 τ ) + r 2( n b α max + 4 τ 2 ) log k δ . F or ˆ Π i S c i , w e fur th er divide S c i in to T i and U i , where T i := { j : τ / 2 < Π Z ( i, j ) ≤ 2 τ } and U i := { j : Π Z ( i, j ) ≤ τ / 2 } . In the set T i , using similar argument we kn ow | (Π Z ) i T i − ( ˜ Π Z ) i T i | 1 ≤ O ( ε π p n b α max log 1 /τ ), therefore | ˆ Π i T i | 1 ≤ | ˜ Π i T i | 1 ≤ | Π i T i − ˜ Π i T i | 1 + | Π i S c i | 1 ≤ O ( ε π p n b α max log 1 /τ ) . Finally , for index j ∈ U i , in order f or ˆ Π Z ( i, j ) b e p ositiv e, it is r equired that ˜ Π Z ( i, j ) − Π Z ( i, j ) ≥ τ / 2. In this case, w e hav e | ( ˆ Π Z ) i U i | 1 ≤ 4 τ    ( ˜ Π Z ) i U i − Π i U i    2 ≤ 4 ε 2 π τ . Case α 0 ∈ [1 , k ) : F rom Lemma C.10, we see that the r esults hold wh en w e replace b α max with α max .  B.3 Reconstruction of P after tensor p ow er metho d Finally we w ould like to use the comm u nit y v ectors Π an d the adjacency matrix G to estimate the P matrix. Recall that in the generativ e mo del, w e ha v e E [ G ] = Π ⊤ P Π. Thus, a straigh tforwa rd estimate is to use ( ˆ Π † ) ⊤ G ˆ Π † . Ho we ve r, our guaran tees on ˆ Π are not strong enough to control the error on ˆ Π † (since we only h a v e ro w-wise ℓ 1 guaran tees). W e p rop ose an alternativ e estimator ˆ Q for ˆ Π † and use it to ﬁ n d ˆ P in Algorithm 1. Recall th at the i -th row of ˆ Q is given by ˆ Q i := ( α 0 + 1) ˆ Π i | ˆ Π i | 1 − α 0 n ~ 1 ⊤ . Deﬁne Q u sing exact communities, i.e. Q i := ( α 0 + 1) Π i | Π i | 1 − α 0 n ~ 1 ⊤ . W e show b elo w th at ˆ Q is close to Π † , and therefore, ˆ P := ˆ Q ⊤ G ˆ Q is close to P w.h.p. Lemma B.3 (Reconstruction of P ) . With pr ob ability 1 − 5 δ , ε P := max i,j ∈ [ n ] | ˆ P i,j − P i,j | ≤ O ( α 0 + 1) 3 / 2 ε π ( P max − P min ) √ n b α − 1 min b α 1 / 2 max log nk δ ! 48 Remark: If we deﬁne a new matrix Q ′ as ( Q ′ ) i := α 0 +1 n b α i Π i − α 0 n ~ 1 ⊤ , then E Π [ Q ′ Π ⊤ ] = I . Belo w, w e sh o w that Q ′ is close to Q since E [ | Π i | 1 ] = n b α i and thus the ab o ve r esult holds. W e require Q to b e norm alized by | Π i | 1 in order to ensure that the ﬁrst term of Q has equal column n orms, whic h will b e used in our p ro ofs su b sequen tly . Pr o of: The pr o of goes in thr ee steps: P ≈ Q Π ⊤ P Π Q ⊤ ≈ QGQ ⊤ ≈ ˆ QG ˆ Q ⊤ . Note th at E Π [Π Q ⊤ ] = I and by Berns tein’s b ound, we can claim that Π Q ⊤ is close to I and can sho w that the i -th r o w of Q Π ⊤ satisﬁes ∆ i := | ( Q Π ⊤ ) i − e ⊤ i | 1 = O k s log  nk δ  b α max b α min 1 √ n ! with p robabilit y 1 − δ . Moreo v er, | (Π ⊤ P Π Q ⊤ ) i,j − (Π ⊤ P ) i,j | ≤ | (Π ⊤ P ) i (( Q ) j − e j ) | = | (Π ⊤ P ) i ∆ j | ≤ O P max k · p b α max / b α min √ n r log nk δ ! . using the fact th at (Π ⊤ P ) i,j ≤ P max . No w w e claim that ˆ Q is close to Q and it can b e sho wn that | Q i − ˆ Q i | 1 ≤ O  ε P P max − P min  (67) Using (67), we ha ve | (Π ⊤ P Π Q ⊤ ) i,j − (Π ⊤ P Π ˆ Q ⊤ ) i,j | = | (Π ⊤ P Π) i ( Q ⊤ − ˆ Q ⊤ ) j | = ((Π ⊤ P Π) i − P min ~ 1 ⊤ ) | ( Q ⊤ − ˆ Q ⊤ ) j | 1 ≤ O (( P max − P min ) | ( Q ⊤ − ˆ Q ⊤ ) j | 1 ) = O ( ε P ) . using the fact th at ( Q j − ˆ Q j ) ~ 1 = 0, d ue to the normalization. Finally , | ( G ˆ Q ⊤ ) i,j (Π ⊤ P Π ˆ Q ⊤ ) i,j | are small by standard concen tration b oun ds (and the d iﬀer- ences are of lo wer ord er). Combining these | ˆ P i,j − P i,j | ≤ O ( ε P ).  B.4 Zero-error supp ort recov ery guaran tees Recall t hat w e prop osed Pro cedure 3 to pr o v id e imp ro v ed supp ort reco v ery estimates in the sp ecial case of homophilic mo dels (where there are more edges w ith in a communit y than to any comm un it y outside). W e limit our analysis to the sp ecia l case of u niform sized communities ( α i = 1 /k ) and matrix P such that P ( i, j ) = p I ( i = j ) + q I ( i 6 = j ) and p ≥ q . In p r inciple, the analysis can b e extended to homophilic mo dels with more general P matrix (with suitably c hosen thresh olds for supp ort reco very). 49 W e ﬁrst consider analysis for the sto chastic bloc k mo del (i.e. α 0 → 0) and prov e the guaran tees claimed in Corollary 4.1 . Pr o of of Cor ol lary 4.1 : Recall the deﬁnition of ˜ Π in (6 2) and ˆ Π is obtained b y th r esholding ˜ Π with threshold τ . S in ce the threshold τ for sto c hastic blo ck mo dels is 0 . 5 (assumption B5), w e h av e | ( ˆ Π) i − (Π) i | 1 = O ( ε 2 π ) , (68) where ε π is the ro w-wise ℓ 2 error fo r ˜ Π in Lemma B.1. This is b ecause Π( i, j ) ∈ { 0 , 1 } , and in order for our metho d to make a mistak e, it tak es 1 / 4 in the ℓ 2 2 error. In Pro cedure 3 , for th e sto c hastic b lo ck mo del ( α 0 = 0), for a no de x ∈ [ n ], we ha ve ˆ F ( x, i ) = X y ∈ [ n ] G x,y ˆ Π( i, y ) | ˆ Π i | 1 ≈ X y ∈ [ n ] G x,y ˆ Π( i, y ) | Π i | 1 ≈ k n X y ∈ [ n ] G x,y ˆ Π( i, y ) , using (68) and the fact that th e size of eac h comm unit y on a v erage is n/k . In other words, for eac h v ertex x , w e compute the a ve rage n u m b er of ed ges f r om this vertex to all the estimated comm unities according to ˆ Π, and set it to belong to the one with largest a v erage degree. Note that the margin o f error on av erage for eac h node to b e assigned the correct comm unity according to the ab o v e pro cedur e is ( p − q ) n/k , since the size of eac h communit y is n/k and the av erage n umb er of in tra-comm unit y edges at a n o de is pn/k and edges to an y diﬀerent comm u nit y at a no d e is q n /k . F rom (68), we h a v e that the av erage n umb er o f errors made is O (( p − q ) ε 2 π ). Note that the degrees concen trate around their exp ectations according to Bernstein’s b oun d and the fact that the edges used for a v eraging is indep enden t from the edges used for estimating ˆ Π. T h us, for our metho d to succeed in in ferring th e correct communit y at a no d e, we require, O (( p − q ) ε 2 π ) ≤ ( p − q ) n k , whic h implies p − q ≥ ˜ Ω  √ pk √ n  .  W e no w prov e the general result on sup p ort reco very . Pr o of of The or em 4.2: F rom Lemm a B.3, | ˆ P i,j − P i,j | ≤ O ( ε P ) whic h implies b oun ds for th e av erage of d iagonals H and a v erage of oﬀ-diagonals L : | H − p | = O ( ε P ) , | L − q | = O ( ε P ) . On similar lines as the pro of of Lemma B.3 and from ind ep endence of edges used to deﬁn e ˆ F from the edges used to estimate ˆ Π, w e also ha v e | ˆ F ( j, i ) − F ( j, i ) | ≤ O ( ε P ) . Note that F j,i = q + Π i,j ( p − q ). T he threshold ξ satisﬁes ξ = Ω ( ε P ), therefore, all the entries in F that are larger than q + ( p − q ) ξ , the corresp on d ing en tries in S are declared to b e one, w hile none of the entries that are smaller than q + ( p − q ) ξ / 2 are set to one in S .  50 C Concen tration Bounds C.1 Main Result: T ensor P erturbation Bound W e no w provide the main r esult that the thir d -order whitened tensor compu ted from samples concen trates. Recall that T α 0 Y →{ A,B ,C } denotes the third order momen t computed u sing edges from partition Y to p artitions A, B , C in (15 ). ˆ W A , ˆ W B ˆ R AB , ˆ W C ˆ R AC are the whitening matrices deﬁned in (24). The corresp onding whitening matrices W A , W B R AB , W C R AC for exact momen t third order tensor E [T α 0 Y →{ A,B ,C } | Π] will b e d eﬁned later. Recall that ρ is deﬁn ed in (37) as ρ := α 0 +1 b α min . Giv en δ ∈ (0 , 1), thr oughout assume that n = Ω  ρ 2 log 2 k δ  , (69) as in Ass u mption ( B 2). Theorem C.1 (P erturbation of whitened tensor) . W hen the p artitions A, B , C, X , Y satisfy (69) , we have with pr ob ability 1 − 100 δ , ε T :=    T α 0 Y →{ A,B , C } ( ˆ W A , ˆ W B ˆ R AB , ˆ W C ˆ R AC ) − E [T α 0 Y →{ A,B ,C } ( W A , ˜ W B , ˜ W C ) | Π A , Π B , Π C ]    = O ( α 0 + 1) p (max i ( P b α ) i ) n 1 / 2 b α 3 / 2 min σ min ( P ) · 1 +  ρ 2 n log 2 k δ  1 / 4 ! r log k δ ! = ˜ O ρ √ n · ζ b α 1 / 2 max ! . (70) Pro of Overview: The pr o of of the ab o ve result follo ws. It consists mainly of the follo wing steps: (1) Controlling the p erturbations of the wh itening matrices and (2) Establishing concen tration of the third momen t tens or (b efore whitening). Combining the t w o, w e can then obtain p erturb a- tion of the whitened tensor. P erturbations for the wh itenin g step is established in App endix C.2. Auxiliary concent ration b ounds requ ired for the whitening step, and for the claims b elo w are in App endix C.3 and C.4. Pr o of of The or em C.1: In tensor T α 0 in (15), the ﬁ rst term is ( α 0 + 1)( α 0 + 2) X i ∈ Y  G ⊤ i,A ⊗ G ⊤ i,B ⊗ G ⊤ i,C  . W e claim that this term dominates in the p erturbation analysis since the mean v ector p ertur bation is of low er order. W e n o w consider p erturb ation of the whitened tensor Λ 0 = 1 | Y | X i ∈ Y  ( ˆ W ⊤ A G ⊤ i,A ) ⊗ ( ˆ R ⊤ AB ˆ W ⊤ B G ⊤ i,B ) ⊗ ( ˆ R ⊤ AC ˆ W ⊤ C G ⊤ i,C )  . W e show that th is tensor is close to the corresp ondin g term in th e exp ecta tion in thr ee steps. First w e show it is close to Λ 1 = 1 | Y | X i ∈ Y  ( ˆ W ⊤ A F A π i ) ⊗ ( ˆ R ⊤ AB ˆ W ⊤ B F B π i ) ⊗ ( ˆ R ⊤ AC ˆ W ⊤ C F C π i )  . 51 Then this v ector is close to the exp ectation ov er Π Y . Λ 2 = E π ∼ Dir( α )  ( ˆ W ⊤ A F A π ) ⊗ ( ˆ R ⊤ AB ˆ W ⊤ B F B π ) ⊗ ( ˆ R ⊤ AC ˆ W ⊤ C F C π )  . Finally w e replace the estimated wh itening matrix ˆ W A with W A , deﬁned in ( 71), and note that W A whitens the exact moments. Λ 3 = E π ∼ Dir( α )  ( W ⊤ A F A π ) ⊗ ( ˜ W ⊤ B F B π ) ⊗ ( ˜ W ⊤ C F C π )  . F or Λ 0 − Λ 1 , the d omin an t term in the p erturbation b ound (assuming partitions A, B , C , X, Y are of size n ) is (since for an y rank 1 tensor, k u ⊗ v ⊗ w k = k u k · k v k · k w k ), O 1 | Y | k ˜ W ⊤ B F B k 2      X i ∈ Y  ˆ W ⊤ A G ⊤ i,A − ˆ W ⊤ A F A π i       ! O  1 | Y | b α − 1 min · ( α 0 + 1)(max i ( P b α ) i ) b α min σ min ( P ) · (1 + ε 1 + ε 2 + ε 3 ) r log n δ  , with probability 1 − 13 δ (Lemma C.2). Since there are 7 terms in the third order tensor T α 0 , w e ha v e the b oun d with probability 1 − 91 δ . F or Λ 1 − Λ 2 , s in ce ˆ W A F A Diag( b α ) 1 / 2 has sp ectral norm almost 1, b y Lemma C.4 the sp ectral norm of the p erturbation is at most    ˆ W A F A Diag( b α ) 1 / 2    3      1 | Y | X i ∈ Y (Diag( b α ) − 1 / 2 π i ) ⊗ 3 − E π ∼ Dir( α ) (Diag( b α ) − 1 / 2 π i ) ⊗ 3      ≤ O  1 b α min √ n · r log n δ  . F or the ﬁnal term Λ 2 − Λ 3 , the dominating term is ( ˆ W A − W A ) F A Diag( b α ) 1 / 2 k Λ 3 k ≤ ε W A k Λ 3 k ≤ O ( α 0 + 1) p max i ( P b α ) i n 1 / 2 b α 3 / 2 min σ min ( P ) (1 + ε 1 + ε 2 + ε 3 ) r log n δ ! Putting all these toget her, th e thir d term k Λ 2 − Λ 3 k dominate s. W e know with probabilit y at l east 1 − 100 δ , the p erturbation in the tensor is at most O ( α 0 + 1) p max i ( P b α ) i n 1 / 2 b α 3 / 2 min σ min ( P ) (1 + ε 1 + ε 2 + ε 3 ) r log n δ ! .  C.2 Whitening Matrix Perturbations Consider rank - k SVD of | X | − 1 / 2 ( G α 0 X,A ) ⊤ k − svd = ˆ U A ˆ D A ˆ V ⊤ A , and the w hitening matrix is giv en by ˆ W A := ˆ U A ˆ D − 1 A and th us | X | − 1 ˆ W ⊤ A ( G α 0 X,A ) ⊤ k − svd ( G α 0 X,A ) k − svd ˆ W A = I . Now consider the singular v alue decomp osition of | X | − 1 ˆ W ⊤ A E [( G α 0 X,A ) ⊤ | Π] · E [( G α 0 X,A ) | Π] ˆ W A = Φ ˜ D Φ ⊤ . 52 ˆ W A do es n ot wh iten the exact momen ts in general. On the other hand, consider W A := ˆ W A Φ A ˜ D − 1 / 2 A Φ ⊤ A . ( 71) Observe that W A whitens | X | − 1 / 2 E [( G α 0 X,A ) | Π] | X | − 1 W ⊤ A E [( G α 0 X,A ) ⊤ | Π] E [( G α 0 X,A ) | Π] W A = (Φ A ˜ D − 1 / 2 A Φ ⊤ A ) ⊤ Φ A ˜ D A Φ ⊤ A Φ A ˜ D − 1 / 2 A Φ ⊤ A = I No w the ranges of W A and ˆ W A ma y diﬀer and w e con trol the p erturbations b elo w. Also n ote that ˆ R A,B , ˆ R A,C are giv en b y ˆ R AB := | X | − 1 ˆ W ⊤ B ( G α 0 X,B ) ⊤ k − svd ( G α 0 X,A ) k − svd ˆ W A . (72) R AB := | X | − 1 W ⊤ B E [( G α 0 X,B ) ⊤ | Π] · E [ G α 0 X,A | Π] · W A . (73) Recall ǫ G is giv en by (78), and σ min  E [ G α 0 X,A | Π]  is giv en in (C.7) and | A | = | B | = | X | = n . Lemma C.1 (Whitening m atrix p ertur bations) . W ith pr ob ability 1 − δ , ǫ W A := k Diag ( b α ) 1 / 2 F ⊤ A ( ˆ W A − W A ) k = O   (1 − ε 1 ) − 1 / 2 ǫ G σ min  E [ G α 0 X,A | Π]    (74) ǫ ˜ W B := k Diag ( b α ) 1 / 2 F ⊤ B ( ˆ W B ˆ R AB − W B R AB ) k = O   (1 − ε 1 ) − 1 / 2 ǫ G σ min  E [ G α 0 X,B | Π]    (75) Thus, with pr ob ability 1 − 6 δ , ǫ W A = ǫ ˜ W B = O ( α 0 + 1) p max i ( P b α ) i n 1 / 2 b α min σ min ( P ) · (1 + ε 1 + ε 2 + ε 3 ) ! , (76) wher e ε 1 , ε 2 and ε 3 ar e give n by (84) and (85) . Remark: Note that when partitions X , A satisfy (69), ε 1 , ε 2 , ε 3 are small. When P is well conditioned and b α min = b α max = 1 /k , w e hav e ǫ W A , ǫ ˜ W B = O ( k / √ n ). Pr o of: Using the fact that W A = ˆ W A Φ A ˜ D − 1 / 2 A Φ ⊤ A or ˆ W A = W A Φ A ˜ D 1 / 2 A Φ ⊤ A w e ha ve that k Diag ( b α ) 1 / 2 F ⊤ A ( ˆ W A − W A ) k ≤ k Diag ( b α ) 1 / 2 F ⊤ A W A ( I − Φ A ˜ D 1 / 2 A Φ ⊤ A ) k = k Diag ( b α ) 1 / 2 F ⊤ A W A ( I − ˜ D 1 / 2 A ) k ≤ k Diag ( b α ) 1 / 2 F ⊤ A W A ( I − ˜ D 1 / 2 A )( I + ˜ D 1 / 2 A ) k ≤ k Diag ( b α ) 1 / 2 F ⊤ A W A k · k I − ˜ D A k using the fact th at ˜ D A is a diagonal matrix. 53 No w note that W A whitens | X | − 1 / 2 E [ G α 0 X,A | Π] = | X | − 1 / 2 F A Diag( α 1 / 2 )Ψ X , wh ere Ψ X is d eﬁ ned in (83) . F urth er it is sh o w n in Lemma C.7 that Ψ X satisﬁes with probabilit y 1 − δ that ε 1 := k I − | X | − 1 Ψ X Ψ ⊤ X k ≤ O s ( α 0 + 1) b α min | X | · log k δ ! Since ε 1 ≪ 1 when X, A satisfy (69). W e ha v e that | X | − 1 / 2 Ψ X has singular v alues around 1. S ince W A whitens | X | − 1 / 2 E [ G α 0 X,A | Π], we hav e | X | − 1 W ⊤ A F A Diag( α 1 / 2 )Ψ X Ψ ⊤ X Diag( α 1 / 2 ) F ⊤ A W A = I . Th us, with probab ility 1 − δ , k Diag ( b α ) 1 / 2 F ⊤ A W A k = O ((1 − ε 1 ) − 1 / 2 ) . Let E [( G α 0 X,A ) | Π] = ( G α 0 X,A ) k − svd + ∆. W e h a ve k I − ˜ D A k = k I − Φ A ˜ D A Φ ⊤ A k = k I − | X | − 1 ˆ W ⊤ A E [( G α 0 X,A ) ⊤ | Π] · E [( G α 0 X,A ) | Π] ˆ W A k = O  | X | − 1 k ˆ W ⊤ A  ∆ ⊤ ( G α 0 X,A ) k − svd + ∆( G α 0 X,A ) ⊤ k − svd  ˆ W A k  = O  | X | − 1 / 2 k ˆ W ⊤ A ∆ ⊤ ˆ V A + ˆ V ⊤ A ∆ ˆ W A k  , = O  | X | − 1 / 2 k ˆ W A kk ∆ k  = O  | X | − 1 / 2 k W A k ǫ G  , since k ∆ k ≤ ǫ G + σ k +1 ( G α 0 X,A ) ≤ 2 ǫ G , using W eyl’s th eorem f or singular v alue p erturb ation and the fact that ǫ G · k W A k ≪ 1 and k W A k = | X | 1 / 2 /σ min  E [ G α 0 X,A | Π]  . W e no w consider p ertur bation of W B R AB . By d eﬁ nition, we ha v e that E [ G α 0 X,B | Π] · W B R AB = E [ G α 0 X,A | Π] · W A . and k W B R AB k = | X | 1 / 2 σ min ( E [ G α 0 X,B | Π]) − 1 . Along the lin es of pr evious deriv ation for ǫ W A , let | X | − 1 ( ˆ W B ˆ R AB ) ⊤ · E [( G α 0 X,B ) ⊤ | Π] · E [ G α 0 X,B | Π] ˆ W B ˆ R AB = Φ B ˜ D B Φ ⊤ B . Again u sing the fact that | X | − 1 Ψ X Ψ ⊤ X ≈ I , we ha ve k Diag ( b α ) 1 / 2 F ⊤ B W B R AB k ≈ k Diag ( b α ) 1 / 2 F ⊤ A W A k , and the rest of the pr o of follo ws.  54 C.3 Auxiliary Concen t ration B ounds Lemma C .2 (Concen tration of sum of whitened vect ors) . Assuming al l the p artitions satisfy (69) , with pr ob ability 1 − 7 δ ,      X i ∈ Y  ˆ W ⊤ A G ⊤ i,A − ˆ W ⊤ A F A π i       = O ( p | Y | b α max ǫ W A ) = O p ( α 0 + 1)(max i ( P b α ) i ) b α min σ min ( P ) · (1 + ε 2 + ε 3 ) p log n /δ ! ,      X i ∈ Y  ( ˆ W B ˆ R AB ) ⊤ ( G ⊤ i,B − F B π i )       = O p ( α 0 + 1)(max i ( P b α ) i ) b α min σ min ( P ) · (1 + ε 1 + ε 2 + ε 3 ) p log n /δ ! . Remark: Note that when P is well conditioned and b α min = b α max = 1 /k , w e h av e the ab o ve b ounds as O ( k ). Thus, when it is n orm alized with 1 / | Y | = 1 /n , we hav e the b oun d as O ( k /n ). Pr o of: Note that ˆ W A is computed usin g p artition X and G i,A is obtained from i ∈ Y . W e ha v e indep end en ce for edges across diﬀeren t partitions X and Y . Let Ξ i := ˆ W ⊤ A ( G ⊤ i,A − F A π i ).Applying matrix Bernstein’s in equalit y to eac h of the v ariables, we hav e k Ξ i k ≤ k ˆ W A k · k G ⊤ i,A − F A π i k ≤ k ˆ W A k p k F A k 1 , from L emma C.6. The v ariances are giv en by k X i ∈ Y E [Ξ i Ξ ⊤ i | Π] k ≤ X i ∈ Y ˆ W ⊤ A Diag( F A π i ) ˆ W A , ≤ k ˆ W A k 2 k F Y k 1 = O  | Y | | A | · ( α 0 + 1)(max i ( P b α ) i ) b α 2 min σ 2 min ( P ) · (1 + ε 2 + ε 3 )  , with probabilit y 1 − 2 δ f rom (81) and (82), and ε 2 , ε 3 are giv en b y (85). Similarly , k P i ∈ Y E [Ξ ⊤ i Ξ i | Π] k ≤ k ˆ W A k 2 k F Y k 1 . Thus, fr om matrix Bernstein’s inequalit y , we ha v e with probability 1 − 3 δ k X i ∈ Y Ξ i k = O ( k ˆ W A k p max( k F A k 1 , k F X k 1 )) . = O p ( α 0 + 1)(max i ( P b α ) i ) b α min σ min ( P ) · (1 + ε 2 + ε 3 ) p log n/δ ! On similar lines, w e hav e the result for B and C , and also use the indep end ence assumption on edges in v arious p artitions.  W e no w sho w that not only the sum of whitened v ectors concent rates, but that eac h ind ividual whitened v ector ˆ W ⊤ A G ⊤ i,A concen trates, wh en A is large enough. 55 Lemma C .3 (Concentratio n of a random whitened vecto r) . Conditione d on π i , with pr ob ability at le ast 1 / 4 ,    ˆ W ⊤ A G ⊤ i,A − W ⊤ A F A π i    ≤ O ( ε W A b α − 1 / 2 min ) = ˜ O p ( α 0 + 1)(max i ( P b α ) i ) n 1 / 2 b α 3 / 2 min σ min ( P ) ! . Remark: The abov e result is not a high pr obabilit y ev en t since w e employ Ch eb yshev’s inequalit y to establish it. How eve r, this is not an issu e for u s , sin ce we will employ it to sho w that out of Θ( n ) whitened vecto rs, there exists at least one goo d initializat ion v ector corresp on d ing to eac h eigen-direction, as r equired in T heorem A.1 in App end ix A. See Lemma C .9 for details. Pr o of. W e ha v e    ˆ W ⊤ A G ⊤ i,A − W ⊤ A F A π i    ≤    ( ˆ W A − W A ) ⊤ F A π i    +    ˆ W ⊤ A ( G ⊤ i,A − F A π i )    . The ﬁ rst term is satisﬁes satisﬁes with probability 1 − 3 δ k ( ˆ W ⊤ A − W ⊤ A ) F A π i k ≤ ǫ W A b α − 1 / 2 min = O ( α 0 + 1) b α 1 / 2 max p (max i ( P b α ) i ) n 1 / 2 b α 3 / 2 min σ min ( P ) · (1 + ε 1 + ε 2 + ε 3 ) ! No w w e b oun d the second term . Note that G ⊤ i,A is indep endent of ˆ W ⊤ A , s in ce they are related to disjoin t sub set of ed ges. Th e whitened neighborh oo d ve ctor can b e view ed as a sum of v ectors: ˆ W ⊤ A G ⊤ i,A = X j ∈ A G i,j ( ˆ W ⊤ A ) j = X j ∈ A G i,j ( ˆ D A ˆ U ⊤ A ) j = ˆ D A X j ∈ A G i,j ( ˆ U ⊤ A ) j . Conditioned on π i and F A , G i,j are Bernoulli v ariables with probabilit y ( F A π i ) j . T he goal is to compute the v ariance of th e su m, and then use C heb yshev’s inequalit y n oted in Prop osition C.5. Note that the v ariance is given by k E [( G ⊤ i,A − F A π i ) ⊤ ˆ W A ˆ W ⊤ A ( G ⊤ i,A − F A π i )] k ≤ k ˆ W A k 2 X j ∈ A ( F A π i ) j    ( ˆ U ⊤ A ) j    2 . W e no w b ound the v ariance. By W edin’s theorem, we kno w the sp an of columns of ˆ U A is O ( ǫ G /σ min ( G α 0 X , A )) = O ( ǫ W A ) close to the sp an of columns of F A . The span of columns of F A is the same as the sp an of ro ws in Π A . In p articular, let P r oj Π b e th e p ro j ectio n m atrix of the span of ro ws in Π A , w e hav e    ˆ U A ˆ U ⊤ A − P r oj Π    ≤ O ( ǫ W A ) . Using the sp ectral norm b ound , w e ha ve the F rob enius norm    ˆ U A ˆ U ⊤ A − P r oj Π    F ≤ O ( ǫ W A √ k ) since they are r ank k matrices. This implies that X j ∈ A     ( ˆ U ⊤ A ) j    −    P r oj j Π     2 = O ( ǫ 2 W A k ) . 56 No w k P r oj j Π k ≤ k π j k σ min (Π A ) = O   s ( α 0 + 1) n b α min   , from L emma C.7 No w we can b ound th e v ariance of th e vect ors P j ∈ A G i,j ( ˆ U ⊤ A ) j , since the v ariance of G i,j is b ounded by ( F A π i ) j (its probability), and the v ariance of th e vect ors is at most X j ∈ A ( F A π i ) j    ( ˆ U ⊤ A ) j    2 ≤ 2 X j ∈ A ( F A π i ) j    P r oj j Π    2 + 2 X j ∈ A ( F A π i ) j     ( ˆ U ⊤ A ) j    −    P r oj j Π     2 ≤ 2 X j ∈ A ( F A π i ) j max j ∈ A     P r oj j Π    2  + max i,j P i,j X j ∈ A     ( ˆ U ⊤ A ) j    −    P r oj j Π     2 ≤ O  | F A | 1 ( α 0 + 1) n b α min  No w Ch eb yshev’s inequalit y imp lies that w ith pr obabilit y at least 1 / 4 (or any other constant) ,       X j ∈ A ( G i,j − F A π i )( ˆ U ⊤ A ) j       2 ≤ O  | F A | 1 ( α 0 + 1) n b α min  . And thus, we hav e ˆ W ⊤ A ( G i,A − F A π i ) ≤ s | F A | 1 ( α 0 + 1) n b α min ·    ˆ W ⊤ A    ≤ O  ǫ W A b α − 1 / 2 min  . Com bining the t wo terms, we ha v e the resu lt. Finally , we establish the follo wing p erturbation b ou n d b et ween empir ical and exp ected tensor under the Diric hlet distrib u tion, which is u sed in the pro of of T heorem C.1. Lemma C .4 (Concent ration of th ir d moment tensor u nder Diric hlet distribu tion) . With pr ob ability 1 − δ , for π i iid ∼ Dir( α ) ,      1 | Y | X i ∈ Y (Diag( b α ) − 1 / 2 π i ) ⊗ 3 − E π ∼ Dir( α ) (Diag( b α ) − 1 / 2 π ) ⊗ 3      ≤ O  · 1 b α min √ n r log n δ  = ˜ O  1 b α min √ n  Pr o of. The sp ectral n orm of this tensor ca nn ot b e larger than th e sp ectral n orm of a k × k 2 matrix that w e obtain b e “collapsing” the last t wo dimensions (by deﬁn itions of norms). Let φ i := Diag ( ˆ α ) − 1 / 2 π i and the “collapsed” tensor is the m atrix φ i ( φ i ⊗ φ i ) ⊤ (here w e view φ i ⊗ φ i as a vecto r in R k 2 ). W e apply Matrix Bernstein on the matrices Z i = φ i ( φ i ⊗ φ i ) ⊤ . No w      X i ∈ Y E [ Z i Z ⊤ i ]      ≤ | Y | max k φ k 4    E [ φφ ⊤ ]    ≤ | Y | b α − 2 min 57 since   E [ φφ ⊤ ]   ≤ 2. F or the other v ariance term   P i ∈ Y E [ Z ⊤ i Z i ]   , w e hav e      X i ∈ Y E [ Z ⊤ i Z i ]      ≤ | Y | b α min    E [( φ ⊗ φ )( φ ⊗ φ ) ⊤ ]    . It remains to b ound the norm of E [( φ ⊗ φ )( φ ⊗ φ ) ⊤ ]. W e hav e k E [( φ ⊗ φ )( φ ⊗ φ ) ⊤ ] k = su p   k E [ M 2 ] k , s.t. M = X i,j N i,j φ i φ ⊤ j , k N k F = 1   . b y deﬁnition. W e no w group the terms of E [ M 2 ] and b ound them separately . M 2 = X i N 2 i,i φ i φ ⊤ i k φ i k 2 + X i 6 = j N 2 i,j φ i φ ⊤ j h φ i , φ j i + X i 6 = j 6 = a N i,i N j,a φ i φ ⊤ a h φ i , φ j i + X i 6 = j 6 = a 6 = b N i,j N a,b φ i φ ⊤ b h φ j , φ a i (77) W e b oun d the terms ind ividually now. k φ ( i ) k 4 terms: By prop erties of Diric hlet distrib u tion we kno w E [ k φ ( i ) k 4 ] = Θ( b α − 1 i ) ≤ O ( b α − 1 min ) . Th us, for the ﬁrst term in (77 ), we ha v e sup N : k N k F =1 k X i E [ N 2 i,i φ i φ ⊤ i k φ i ] k 2 k = O ( b α − 1 min ) . k φ ( i ) k 3 · k φ ( j ) k terms: W e ha ve k E [ X i,j N i,i N i,j φ ( i ) 3 φ ( j )] k ≤ E [ k φ i k 2 · k φ j k ] ≤ O ( s X i,j ( N 2 i,i ˆ α ( j )) X i,j N 2 i,j ˆ α ( i ) − 1 ) ≤ O ( b α − 1 / 2 min ) . k φ ( i ) k 2 · k φ ( j ) k 2 terms: the total num b er of such terms is O ( k 2 ) and we ha ve E [ k φ ( i ) k 2 · k φ ( j ) k 2 ] = Θ(1) , and th us the F rob enius norm of these set of terms is smaller than O ( k ) k φ ( i ) k 2 · k φ ( j ) k · k φ ( a ) k terms: there are O ( k 3 ) suc h terms, and w e ha ve k E [ φ ( i ) k 2 · k φ ( j ) k · k φ ( a )] k = Θ( ˆ α ( i 2 ) 1 / 2 ˆ α ( i 3 ) 1 / 2 ) . The F r ob enius norm of this part of matrix is b oun ded b y O   s X i,j,a ∈ [ k ] ˆ α ( j ) ˆ α ( a )   ≤ O ( √ k ) s X j X a b α j b α a ≤ O ( √ k ) . 58 the rest: the sum is E [ X i 6 = j 6 = a 6 = b N i,j N a,b ˆ α ( i ) 1 / 2 ˆ α ( j ) 1 / 2 ˆ α ( a ) 1 / 2 ˆ α ( b ) 1 / 2 ] . It is easy to br eak the b oun ds in to the p ro duct of t w o sums ( P i,j and P a,b ) and then b ound eac h one b y Cauc hy-Sc hw artz, the result is 1. Hence the v ariance term in Matrix Bernstein’s inequalit y can b e b oun ded by σ 2 ≤ O ( n b α − 2 min ), eac h term has norm at most b α − 3 / 2 min . When b α − 2 min < n we kn ow the v ariance term dominates and the sp ectral norm of the diﬀerence is at most O ( b α − 1 min n − 1 / 2 p log n /δ ) w ith pr obabilit y 1 − δ . C.4 Basic Results on Sp ectral Concen tration of A djacency Matrix Let n := max( | A | , | X | ). Lemma C.5 (Concent ration of G α 0 X,A ) . When π i ∼ Dir( α ) , for i ∈ V , with pr ob ability 1 − 4 δ , ǫ G := k G α 0 X,A − E [( G α 0 X,A ) ⊤ | Π] k = O  r ( α 0 + 1) n · (max i ( P b α ) i )(1 + ε 2 ) log n δ  (78) Pr o of: F r om deﬁnition of G α 0 X,A , we hav e ǫ G ≤ √ α 0 + 1 k G X,A − E [ G X,A | Π] k + ( √ α 0 + 1 − 1) p | X |k µ X,A − E [ µ X,A | Π] k . W e ha ve concen tration f or µ X,A and adjacency s u bmatrix G X,A from L emma C.6.  W e now pro vide concen tration b ound s for adjacency sub-matrix G X,A from partition X to A and the corresp onding mean vec tor. Recall that E [ µ X → A | F A , π X ] = F A π X and E [ µ X → A | F A ] = F A b α . Lemma C.6 (Concentrat ion of adjacency submatrices) . When π i iid ∼ Dir( α ) for i ∈ V , with pr ob a- bility 1 − 2 δ , k G X,A − E [ G X,A | Π] k = O  r n · (max(max i ( P b α ) i , m ax i ( P ⊤ b α ) i ))(1 + ε 2 ) log n δ  . (79) k µ A − E [ µ A | Π] k = O  1 | X | r n · (max(max i ( P b α ) i , m ax i ( P ⊤ b α ) i ))(1 + ε 2 ) log n δ  , (80 ) wher e ε 2 is given by (85) . Pr o of: Recall E [ G X,A | Π] = F A Π X and G A,X = Ber( F A Π X ) w here Ber( · ) denotes the Bernoulli random matrix with ind ep endent en tries. Let Z i := ( G ⊤ i,A − F A π i ) e ⊤ i . W e ha ve G ⊤ X,A − F A Π X = P i ∈ X Z i . W e apply m atrix Berns tein’s inequalit y . W e compute the v ariances P i E [ Z i Z ⊤ i | Π] and P i E [ Z ⊤ i Z i | Π]. W e ha ve that P i E [ Z i Z ⊤ i | Π] only the diagonal terms are non-zero du e to indep en dence of Bernoulli v ariables, and E [ Z i Z ⊤ i | Π] ≤ Diag ( F A π i ) (81) 59 en try-wise. Th us, k X i ∈ X E [ Z i Z ⊤ i | Π] k ≤ max a ∈ A X i ∈ X,b ∈ [ k ] F A ( a, b ) π i ( b ) = max a ∈ A X i ∈ X,b ∈ [ k ] F A ( a, b )Π X ( b, i ) ≤ max c ∈ [ k ] X i ∈ X,b ∈ [ k ] P ( b, c )Π X ( b, i ) = k P ⊤ Π X k ∞ . (82) Similarly P i ∈ X E [ Z ⊤ i Z i ] = P i ∈ X Diag( E [ k G ⊤ i,A − F A π i k 2 ]) ≤ k P ⊤ Π X k ∞ . On lines of Lemma C.1 1, w e ha ve k P ⊤ Π X k ∞ = O ( | X | · (max i ( P ⊤ b α ) i )) w hen | X | satisﬁes (69 ). W e no w b ound k Z i k . First note th at the en tries in G i,A are in d ep endent and w e can use the v ector Bernstein’s inequalit y to b ound k G i,A − F A π i k . W e hav e max j ∈ A | G i,j − ( F A π i ) j | ≤ 2 and P j E [ G i,j − ( F A π i ) j ] 2 ≤ P j ( F A π i ) j ≤ k F A k 1 . Th us w ith pr obabilit y 1 − δ , we hav e k G i,A − F A π i k ≤ (1 + p 8 log (1 /δ )) p k F A k 1 + 8 / 3 log (1 /δ ) . Th us, we hav e th e b ound th at k P i Z i k = O (max( p k F A k 1 , p k P ⊤ Π X k ∞ )). The concen tration of the mean term follo ws from this r esult.  W e no w provide sp ectral b ound s on E [( G α 0 X,A ) ⊤ | Π]. Deﬁn e ψ i := Diag ( ˆ α ) − 1 / 2 ( √ α 0 + 1 π i − ( √ α 0 + 1 − 1) µ ) . (83) Let Ψ X b e th e matrix with columns ψ i , for i ∈ X . W e ha ve E [( G α 0 X,A ) ⊤ | Π] = F A Diag( ˆ α ) 1 / 2 Ψ X , from d eﬁnition of E [( G α 0 X,A ) ⊤ | Π]. Lemma C.7 (Sp ectral b ound s) . With pr ob ability 1 − δ , ε 1 := k I − | X | − 1 Ψ X Ψ ⊤ X k ≤ O s ( α 0 + 1) b α min | X | · log k δ ! (84) With pr ob ability 1 − 2 δ , k E [( G α 0 X,A ) ⊤ | Π] k = O  k P k b α max p | X || A | (1 + ε 1 + ε 2 )  σ min  E [( G α 0 X,A ) ⊤ | Π]  = Ω   b α min s | A || X | α 0 + 1 (1 − ε 1 − ε 3 ) · σ min ( P ) ·   , wher e ε 2 := O  1 | A | b α 2 max log k δ  1 / 4 ! , ε 3 := O  ( α 0 + 1) 2 | A | b α 2 min log k δ  1 / 4 ! . (85) 60 Remark: When p artitions X , A satisfy (69 ), ε 1 , ε 2 , ε 3 are small. Pr o of: Note that ψ i is a rand om v ector with norm b ounded by O ( p ( α 0 + 1) / b α min ) fr om Lemma C.11 and E [ ψ i ψ ⊤ i ] = I . W e no w prov e (84). using Matrix Bernstein In equalit y . Eac h matrix ψ i ψ ⊤ i / | X | has sp ectral norm at most O (( α 0 + 1) / b α min | X | ). The v ariance σ 2 is b ound ed b y      1 | X | 2 E [ X i ∈ X k ψ i k 2 ψ i ψ ⊤ i ]      ≤      1 | X | 2 max k ψ i k 2 E [ X i ∈ X ψ i ψ ⊤ i ]      ≤ O (( α 0 + 1) / b α min | X | ) . Since O (( α 0 + 1) /α min | X | ) < 1, the v ariance dominates in Matrix Bernstein’s inequ alit y . Let B := | X | − 1 Ψ X Ψ ⊤ X . W e ha v e with prob ab ility 1 − δ , σ min ( E [( G α 0 X,A ) ⊤ | Π]) = q | X | σ min ( F A Diag( ˆ α ) 1 / 2 B Diag ( ˆ α ) 1 / 2 F ⊤ A ) , = Ω( p b α min | X | (1 − ǫ 1 ) · σ min ( F A )) . F rom Lemma C .11, w ith pr obabilit y 1 − δ , σ min ( F A ) ≥   s | A | b α min α 0 + 1 − O (( | A | log k /δ ) 1 / 4 )   · σ min ( P ) . Similarly other r esults follo w.  C.5 Prop erties of Diric hlet Distr ibution In this section, w e list v arious prop erties of Diric hlet distribution. C.5.1 Sparsity Inducing Prop erty W e ﬁrst n ote that the Diric h let distribution Dir( α ) is sparse d ep ending on v alues of α i , whic h is sho wn in T elgarsky [2012]. Lemma C .8. L et r e als τ ∈ (0 , 1] , α i > 0 , α 0 := P i α i and inte gers 1 ≤ s ≤ k b e given. L et ( X i , . . . , X k ) ∼ Dir ( α ) . Then Pr  |{ i : X i ≥ τ }| ≤ s  ≥ 1 − τ − α 0 e − ( s +1) / 3 − e − 4( s +1) / 9 , when s + 1 < 3 k . W e no w show that we obtain go o d initialization vecto rs u nder Diric hlet distribution. Arrange the b α j ’s in ascending ord er, i.e. b α 1 = b α min ≤ b α 2 . . . ≤ b α k = b α max . Recall that columns v ectors ˆ W ⊤ A G ⊤ i,A , for i / ∈ A , are u sed as initialization vec tors to the tensor p o wer metho d. W e sa y that u i := ˆ W ⊤ A G ⊤ i,A k ˆ W ⊤ A G ⊤ i,A k is a ( γ , R 0 )-goo d initializat ion vec tor corresp onding to j ∈ [ k ] if |h u i , Φ j i| ≥ R 0 , |h u i , Φ j i| − max m 1 . (89) When α 0 < 1 , the b ound c an b e impr ove d for r 0 ∈ (0 . 5 , ( α 0 + 1) − 1 ) and 1 − γ ≥ 1 − r 0 r 0 as n > (1 + α 0 )(1 − r 0 b α min ) b α min ( α min + 1 − r 0 ( α 0 + 1)) log( k /δ ) . (90) Remark when α 0 ≥ 1 , α 0 = Θ(1) : When r 0 is c hosen as r 0 = α − 1 / 2 max ( √ α 0 + c 1 √ k ) − 1 , the term e r 0 b α 1 / 2 max ( α 0 + c 1 √ k α 0 ) = e , and we require n = ˜ Ω  α − 1 min k 0 . 43 log( k /δ )  , r 0 = α − 1 / 2 max ( √ α 0 + c 1 √ k ) − 1 , (91) b y substituting c 2 /c 1 = 0 . 43. Moreo ver, (89) is satisﬁed for the ab o ve c hoice of r 0 when γ = Θ(1). In this case we also need ∆ < r 0 / 2, whic h implies ζ = O  √ n ρk b α max  (92) Remark when α 0 < 1 : In this regime, (90 ) im p lies th at w e requir e n = Ω( b α − 1 min ). Also, r 0 is a constan t, we just need ζ = O ( √ n/ρ ). Pr o of: Deﬁne ˜ u i := W ⊤ A F A π i / k W ⊤ A F A π i k , when wh itenin g m atrix W A and F A corresp onding to exact statistics are input. W e ﬁrst observ e that if ˜ u i is ( γ , r 0 ) go od , then u i is ( γ − 2∆ r 0 − ∆ , r 0 − ∆) go o d. When ˜ u i is ( γ , r 0 ) go od , note that W ⊤ A F A π i ≥ b α − 1 / 2 max r 0 since σ min ( W ⊤ A F A ) = b α − 1 / 2 max and k π i k ≥ r 0 . No w with p robabilit y 1 / 4, conditioned on π i , w e hav e the ev ent B ( i ), B ( i ) := { k u i − ˜ u i k ≤ ∆ } , where ∆ is giv en b y ∆ = ˜ O b α 0 . 5 max p ( α 0 + 1)(max i ( P b α ) i ) r 0 n 1 / 2 b α 1 . 5 min σ min ( P ) ! from Lemma C.3. Thus, w e ha ve P [ B ( i ) | π i ] ≥ 1 / 4, i.e. B ( i ) o ccurs with probability 1 / 4 for any realizatio n of π i . 62 If we p erturb a ( γ , r 0 ) go o d v ector by ∆ (while main taining u nit norm ), then it is still ( γ − 2∆ r 0 − ∆ , r 0 − ∆) go o d. W e now sh o w that th e set { ˜ u i } con tains go o d initializ ation v ectors w hen n is large enough. Consider Y i ∼ Γ( α i , 1), wh ere Γ( · , · ) denotes the Gamma d istr ibution and we ha ve Y / P i Y i ∼ Dir( α ). W e ﬁrst compute the prob ab ility that ˜ u i := W ⊤ A F A π i / k W ⊤ A F A π i k is a ( r 0 , γ )-go o d v ector with r esp ect to j = 1 (recall that b α 1 = b α min ). The desired eve nt is A 1 := ( b α − 1 / 2 1 Y 1 ≥ r 0 s X j b α − 1 j Y 2 j ) ∩ ( b α − 1 / 2 1 Y 1 ≥ 1 1 − γ max j > 1 b α − 1 / 2 j Y j ) (93) W e ha ve P [ A 1 ] ≥ P   ( b α − 1 / 2 min Y 1 ≥ r 0 s X j b α − 1 j Y 2 j ) ∩ ( Y 1 ≥ 1 1 − γ max j > 1 Y j )   ≥ P   ( b α − 1 / 2 min Y 1 > r 0 t ) \ ( X j b α − 1 j Y 2 j ≤ t 2 ) \ j > 1 ( Y 1 ≤ (1 − γ ) r 0 t b α 1 / 2 min )   , for some t ≥ P h b α − 1 / 2 min Y 1 > r 0 t i P   X j b α − 1 j Y 2 j ≤ t 2    b α − 1 / 2 j Y j ≤ (1 − γ ) r 0 t b α 1 / 2 min   P  max j > 1 Y j ≤ (1 − γ ) r 0 t b α 1 / 2 min  ≥ P h b α − 1 / 2 min Y 1 > r 0 t i P   X j b α − 1 j Y 2 j ≤ t 2   P  max j > 1 Y j ≤ (1 − γ ) r 0 t b α 1 / 2 min  When α j ≤ 1, we ha v e P [ ∪ j Y j ≥ log 2 k ] ≤ 0 . 5 , since P ( Y j ≥ t ) ≤ t α j − 1 e − t ≤ e − t when t > 1 and α j ≤ 1. App lying v ector Bernstein’s inequalit y , w e ha ve with prob ab ility 0 . 5 − e − m that k Diag ( b α − 1 / 2 j )( Y − E ( Y )) k 2 ≤ (1 + √ 8 m ) p k α 0 + 4 / 3 m b α − 1 / 2 min log 2 k , since E [ P j b α − 1 j V ar( Y j )] = k α 0 since b α j = α j /α 0 and V ar( Y j ) = α j . Th us, w e ha v e k Diag ( b α − 1 / 2 j ) Y k 2 ≤ α 0 + (1 + √ 8 m ) p k α 0 + 4 / 3 m b α − 1 / 2 min log 2 k , since k Diag ( b α − 1 / 2 j ) E ( Y ) k 2 = q P j b α − 1 j α 2 j = α 0 . Cho osing m = log 4, w e h a ve with probabilit y 1 / 4 that k Diag ( b α − 1 / 2 j ) Y k 2 ≤ t := α 0 + (1 + p 8 log 4) p k α 0 + 4 / 3 (log 4) b α − 1 / 2 min log 2 k , (94) = α 0 + c 1 p k α 0 + c 2 b α − 1 / 2 min log 2 k . (95) W e no w ha ve P h b α − 1 / 2 min Y 1 > r 0 t i ≥ α min 4 C  r 0 t b α 1 / 2 min  α min − 1 e − r 0 t b α 1 / 2 min , from L emma C.1. 63 Similarly , P  max j 6 =1 Y j ≤ b α 1 / 2 min (1 − γ ) r 0 t  ≥ 1 − X j  (1 − γ ) r 0 t b α 1 / 2 min  P j α j − 1 e − (1 − γ ) r 0 b α 1 / 2 min t ≥ 1 − k e − (1 − γ ) r 0 b α 1 / 2 min t , assuming that (1 − γ ) r 0 b α 1 / 2 min t > 1. Cho osing t as in (94), we ha ve the probabilit y of the ev ent in (93) is greater than α min 16 C 1 − e − (1 − γ ) r 0 b α 1 / 2 min ( α 0 + c 1 √ k α 0 ) 2(2 k ) (1 − γ ) r 0 c 2 − 1 ! e − r 0 b α 1 / 2 min ( α 0 + c 1 √ k α 0 ) (2 k ) r 0 c 2  r 0 b α 1 / 2 min ( α 0 + c 1 p k α 0 + c 2 b α − 1 / 2 min log 2 k )  α min − 1 Similarly the (marginal) p r obabilit y of ev en ts A 2 can b e b ounded from b elo w by replacing α min with α 2 and so on. Thus, we ha ve P [ A m ] = ˜ Ω α min e − r 0 b α 1 / 2 max ( α 0 + c 1 √ k α 0 ) (2 k ) r 0 c 2 ! , for all m ∈ [ k ]. Th us, we ha ve eac h of the ev ents A 1 ( i ) ∩ B ( i ) , A 2 ( i ) ∩ B ( i ) , . . . , A k ∩ B ( i ) occur at least once in i ∈ [ n ] i.i.d. tries with prob ab ility 1 − P   [ j ∈ [ k ] ( \ i ∈ [ n ] ( A j ( i ) ∩ B ( i )) c )   ≥ 1 − X j ∈ [ k ] P   \ i ∈ [ n ] ( A j ( i ) − B ( i )) c   ≥ 1 − X j ∈ [ k ] exp [ − n P ( A j ∩ B )] , ≥ 1 − k exp " − n ˜ Ω α min e − r 0 b α 1 / 2 max ( α 0 + c 1 √ k α 0 ) (2 k ) r 0 c 2 !# where A j ( i ) denotes the eve nt that A 1 o ccurs for i th trial and we ha ve that P [ B |A j ] ≥ 0 . 25 since B o ccurs in an y trial with probabilit y 0 . 25 for any realiz ation of π i and the ev ents A j dep end only on π i . W e use th at 1 − x ≤ e − x when x ∈ [0 , 1]. Th us, f or the ev en t to o ccur with prob ab ility 1 − δ , w e require n = ˜ Ω  α − 1 min e r 0 b α 1 / 2 max ( α 0 + c 1 √ k α 0 ) (2 k ) r 0 c 2 log(1 /δ )  . Impro v ed Bound when α 0 < 1 : W e can imp ro v e the ab ov e b ound by directly w orking with the Diric hlet distrib u tion. Let π ∼ Dir( α ). Th e desired eve nt corresp onding to j = 1 is giv en by A 1 = b α − 1 / 2 1 π 1 k Diag ( b α − 1 / 2 i ) π k ≥ r 0 ! \ i> 1  π 1 ≥ π i 1 − γ  . 64 Th us, we ha v e P [ A 1 ] ≥ P " ( π 1 ≥ r 0 ) \ i> 1 ( π i ≤ (1 − γ ) r 0 ) # ≥ P [ π 1 ≥ r 0 ] P \ i> 1 π i ≤ (1 − γ ) r 0 | π 1 ≥ r 0 ! , since P  T i> 1 π i ≤ (1 − γ ) r 0 | π 1 ≥ r 0  ≥ P  T i> 1 π i ≤ (1 − γ ) r 0  . By prop er ties of Diric hlet distri- bution, we kn o w E [ π i ] = b α i and E [ π 2 i ] = b α i α i +1 α 0 +1 . Let p := Pr[ π 1 ≥ r 0 ]. W e hav e E [ π 2 i ] = p E [ π 2 i | π i ≥ r 0 ] + (1 − p ) E [ π 2 i | π i < r 0 ] ≤ p + (1 − p ) r 0 E [ π i | π i < r 0 ] ≤ p + (1 − p ) r 0 E [ π i ] Th us, p ≥ b α min ( α min +1 − r 0 ( α 0 +1)) ( α 0 +1)(1 − r 0 b α min ) , wh ich is u seful w h en r 0 ( α 0 + 1) < 1. Also wh en π 1 ≥ r 0 , w e h a ve that π i ≤ 1 − r 0 since π i ≥ 0 and P i π i = 1. Thus, c ho osing 1 − γ = 1 − r 0 r 0 , w e ha ve the other conditions for A 1 are s atisﬁed. Also, verify that we ha v e γ < 1 when r 0 > 0 . 5 and th is is feasible when α 0 < 1.  W e no w prov e a r esult that the entries of π i , wh ic h are marginals of th e Dirichlet distrib ution, are lik ely to b e small in th e spars e regime of the Diric hlet parameters. Recall that the marginal distribution of π i is distr ibuted as B ( α i , α 0 − α i ), w here B ( a, b ) is th e b eta distribu tion and P [ Z = z ] ∝ z a − 1 (1 − z ) b − 1 , Z ∼ B ( a, b ) . Lemma C.10 (Marginal Diric h let distrib ution in spars e regime) . F or Z ∼ B ( a, b ) , the fol lowing r esults hold: Case b ≤ 1 , C ∈ [0 , 1 / 2] : Pr[ Z ≥ C ] ≤ 8 log (1 /C ) · a a + b (96) E [ Z · δ ( Z ≤ C )] ≤ C · E [ Z ] = C · a a + b (97) Case b ≥ 1 , C ≤ ( b + 1) − 1 : we have Pr[ Z ≥ C ] ≤ a log(1 /C ) (98) E [ Z · δ ( Z ≤ C )] ≤ 6 aC (99) Remark: T he guarante e for b ≥ 1 is worse and this agrees w ith the in tuition that the Diric hlet v ectors are more spread out (or less sparse) wh en b = α 0 − α i is large. 65 Pr o of. W e ha v e E [ Z · δ ( Z ≤ C )] = Z C 0 1 B ( a, b ) x a (1 − x ) b − 1 dx ≤ (1 − C ) b − 1 B ( a, b ) Z C 0 x a dx = (1 − C ) b − 1 C a +1 ( a + 1) B ( a, b ) F or E [ Z · δ ( Z ≥ C )], w e ha ve , E [ Z · δ ( Z ≥ C )] = Z 1 C 1 B ( a, b ) x a (1 − x ) b − 1 dx ≥ C a B ( a, b ) Z 1 C (1 − x ) b − 1 dx = (1 − C ) b C a bB ( a, b ) The r atio b etw een these tw o is at least E [ Z · δ ( Z ≥ C )] E [ Z · δ ( Z ≤ C )] ≥ (1 − C )( a + 1) bC ≥ 1 C . The last inequalit y holds wh en a, b < 1 and C < 1 / 2. The sum of the tw o is exactly E [ Z ], so when C < 1 / 2 we kno w E [ Z · δ ( Z ≤ C )] < C · E [ Z ]. Next we b ound the p robabilit y Pr[ Z ≥ C ]. Note that Pr[ Z ≥ 1 / 2] ≤ 2 E [ Z ] = 2 a a + b b y Mark o v’s inequalit y . No w we sho w Pr[ Z ∈ [ C , 1 / 2]] is not muc h larger than Pr[ Z ≥ 1 / 2] by b ounding the in tegrals. A = Z 1 1 / 2 x a − 1 (1 − x ) b − 1 dx ≥ Z 1 1 / 2 (1 − x ) b − 1 dx = (1 / 2) b /b. B = Z 1 / 2 C x a − 1 (1 − x ) b − 1 ≤ (1 / 2) b − 1 Z 1 / 2 C x a − 1 dx ≤ (1 / 2) b − 1 0 . 5 a − C a a ≤ (1 / 2) b − 1 1 − (1 − a log 1 /C ) a = (1 / 2) b − 1 log(1 /C ) . The last inequ alit y uses the f act th at e x ≥ 1 + x for all x . No w Pr[ Z ≥ C ] = (1 + B A ) Pr [ Z ≥ 1 / 2] ≤ (1 + 2 b log (1 /C )) 2 a a + b ≤ 8 log (1 /C ) · a a + b and w e ha ve the result. 66 Case 2: When b ≥ 1, we hav e an alternativ e b ound . W e use the fact that if X ∼ Γ( a, 1) and Y ∼ Γ( b, 1) then Z ∼ X/ ( X + Y ). Since Y is d istributed as Γ( b, 1), its PDF is 1 Γ( b ) x b − 1 e − x . Th is is pr op ortional to the PDF of Γ(1) ( e − x ) multiplied by a increasing function x b − 1 . Therefore we kn o w Pr [ Y ≥ t ] ≥ Pr Y ′ ∼ Γ(1) [ Y ′ ≥ t ] = e − t . No w w e use this b ound to compute th e p robabilit y that Z ≤ 1 /R f or all R ≥ 1. This is equiv alen t to Pr[ X X + Y ≤ 1 R ] = Z ∞ 0 P r [ X = x ] P r [ Y ≥ ( R − 1) X ] dx ≥ Z ∞ 0 1 Γ( a ) x a − 1 e − Rx dx = R − a Z ∞ 0 1 Γ( a ) y a − 1 e − y dy = R − a In particular, Pr[ Z ≤ C ] ≥ C a , whic h means Pr[ Z ≥ C ] ≤ 1 − C a ≤ a log (1 /C ). F or E [ Z δ ( Z < C )], the pro of is similar as b efore: P = E [ Z δ ( Z < C )] = Z C 0 1 B ( a, b ) x a (1 − x ) b dx ≤ C a +1 B ( a, b )( a + 1) Q = E [ Z δ ( Z ≥ C )] = Z 1 C 1 B ( a, b ) x a (1 − x ) b dx ≥ C a (1 − C ) b +1 B ( a, b )( b + 1) No w E [ Z δ ( Z ≤ C )] ≤ P Q E [ Z ] ≤ 6 aC when C < 1 / ( b + 1). C.5.2 Norm Bounds Lemma C.11 (Norm Bounds und er Diric hlet d istribution) . F or π i iid ∼ Dir( α ) for i ∈ A , with pr ob- ability 1 − δ , we have σ min (Π A ) ≥ s | A | b α min α 0 + 1 − O (( | A | log k /δ ) 1 / 4 ) , k Π A k ≤ p | A | b α max + O (( | A | log k /δ ) 1 / 4 ) , κ (Π A ) ≤ s ( α 0 + 1) b α max b α min + O (( | A | log k /δ ) 1 / 4 ) . This implies that k F A k ≤ k P k p | A | b α max , κ ( F A ) ≤ O ( κ ( P ) p ( α 0 + 1) b α max / b α min ) . Mor e over, with pr ob ability 1 − δ k F A k 1 ≤ | A | · m ax i ( P b α ) i + O k P k r | A | log | A | δ ! (100) 67 Remark: When | A | = Ω  log k δ  α 0 +1 b α min  2  , we hav e σ min (Π A ) = Ω ( q | A | b α min α 0 +1 ) w ith p robabilit y 1 − δ for any ﬁxed δ ∈ (0 , 1). Pr o of: Consider Π A Π ⊤ A = P i ∈ A π i π ⊤ i . 1 | A | E [Π A Π ⊤ A ] = E π ∼ D ir ( α ) [ π π ⊤ ] = α 0 α 0 + 1 b α b α ⊤ + 1 α 0 + 1 Diag( b α ) , from Prop osition C.2 . T he ﬁrst term is p ositiv e semi-deﬁnite so the eigen v alues of the sum are at least the eige nv alues of the second comp onent . Smallest eigen v alue of second comp onen t giv es lo w er b ound on σ min ( E [Π A Π ⊤ A ]). The sp ectral norm of the ﬁ rst comp onen t is b ound ed by α 0 α 0 +1 k ˆ α k ≤ α 0 α 0 +1 b α max , the sp ectral norm of second comp onen t is 1 α 0 +1 α max . Thus   E [Π A Π ⊤ A ]   ≤ | A | · b α max . No w applying Matrix Bernstein’s inequalit y to 1 | A | P i  π i π ⊤ i − E [ π π ⊤ ]  . W e hav e that the v ari- ance is O (1 / | A | ). Thus with probability 1 − δ ,     1 | A |  Π A Π ⊤ A − E [Π A Π ⊤ A ]      = O s log( k /δ ) | A | ! . F or the result on F , we use the prop erty that for an y t wo m atrices A, B , k AB k ≤ k A k k B k and κ ( AB ) ≤ κ ( A ) κ ( B ). T o sho w b oun d on k F A k 1 , note that eac h column of F A satisﬁes E [( F A ) i ] = h b α , ( P ) i i 1 ⊤ , and th us k E [ F A ] k 1 ≤ | A | max i ( P b α ) i . Using Bern stein’s inequalit y , for e ac h column of F A , w e ha ve, with probabilit y 1 − δ ,   k ( F A ) i k 1 − | A |  b α, ( P ) i    = O k P k r | A | log | A | δ ! , b y applying Bern stein’s inequalit y , since |  b α, ( P ) i  | ≤ k P k , and th us w e ha v e P i ∈ A k E [( P ) j π i π ⊤ i (( P ) j ) ⊤ ] k , and P i ∈ A k E [ π ⊤ i (( P ) j ) ⊤ ( P ) j π i ] k ≤ | A | · k P k .  C.5.3 Prop erties of Gamma and Diric hlet Distributions Recall Gamma distribu tion Γ ( α, β ) is a distribution on nonnegativ e r eal v alues with d en sit y function β α Γ( α ) x α − 1 e − β x . Prop osition C.1 (Diric hlet and Gamma distr ib utions) . The fol lowing facts ar e known for Dirichlet distribution and Gamma distribution. 1. L e t Y i ∼ Γ( α i , 1) b e indep endent r andom variables, then the ve ctor ( Y 1 , Y 2 , ..., Y k ) / P k i =1 Y k is distribute d as D ir ( α ) . 2. The Γ function satisﬁes Euler’s r eﬂe ction formula: Γ(1 − z )Γ( z ) ≤ π / sin π z . 3. The Γ( z ) ≥ 1 when 0 < z < 1 . 68 4. Ther e exists a u niversal c onstant C such that Γ( z ) ≤ C /z when 0 < z < 1 . 5. F or Y ∼ Γ( α, 1) and t > 0 and α ∈ (0 , 1) , we have α 4 C t α − 1 e − t ≤ Pr[ Y ≥ t ] ≤ t α − 1 e − t , (101) and for any η , c > 1 , we have P [ Y > ηt | Y ≥ t ] ≥ ( cη ) α − 1 e − ( η − 1) t . (102) Pr o of: The b ounds in (101 ) is derived using the fact that 1 ≤ Γ( α ) ≤ C /α wh en α ∈ (0 , 1) and Z ∞ t 1 Γ( α i ) x α i − 1 e − x dx ≤ 1 Γ( α i ) Z ∞ t t α i − 1 e − x dx ≤ t α i − 1 e − t , and Z ∞ t 1 Γ( α i ) x α i − 1 e − x dx ≥ 1 Γ( α i ) Z 2 t t x α i − 1 e − x dx ≥ α i /C Z 2 t t (2 t ) α i − 1 e − x dx ≥ α i 4 C t α i − 1 e − t .  Prop osition C.2 (Momen ts u nder Diric h let distrib ution) . Supp ose v ∼ D ir ( α ) , the moments of v satisﬁes the fol lowing formulas: E [ v i ] = α i α 0 E [ v 2 i ] = α i ( α i + 1) α 0 ( α 0 + 1) E [ v i v j ] = α i α j α 0 ( α 0 + 1) , i 6 = j. Mor e gener al ly, if a ( t ) = Q t − 1 i =0 ( a + i ) , then we have E [ k Y i =1 v ( a i ) i ] = Q k i =1 α ( a i ) i α ( P k i =1 a i ) 0 . C.6 Standard Results Bernstein’s Inequalit ies: One of the key to ols we use is the standard matrix Bernstein in- equalit y [T ropp, 2012, thm. 1.4]. Prop osition C.3 (Matrix Bernstein I n equalit y) . Supp ose Z = P j W j wher e 1. W j ar e indep endent r andom matric es with dimension d 1 × d 2 , 2. E [ W j ] = 0 for al l j , 3. k W j k ≤ R almost sur ely. 69 L et d = d 1 + d 2 , and σ 2 = max n    P j E [ W j W ⊤ j ]    ,    P j E [ W ⊤ j W j ]    o , then we have Pr[ k Z k ≥ t ] ≤ d · exp  − t 2 / 2 σ 2 + Rt/ 3  . Prop osition C.4 (V ector Bernstein Inequalit y) . L et z = ( z 1 , z 2 , ..., z n ) ∈ R n b e a r andom ve ctor with indep endent entries, E [ z i ] = 0 , E [ z 2 i ] = σ 2 i , and Pr[ | z i | ≤ 1] = 1 . L et A = [ a 1 | a 2 | · · · | a n ] ∈ R m × n b e a matrix, then Pr[ k Az k ≤ (1 + √ 8 t ) v u u t n X i =1 k a i k 2 σ 2 i + (4 / 3 ) max i ∈ [ n ] k a i k t ] ≥ 1 − e − t . V ector Chebyshev inequalit y: W e will requir e a vec tor version of the C heb yshev inequ al- it y F eren tios [1982]. Prop osition C.5. L et z = ( z 1 , z 2 , ..., z n ) ∈ R n b e a r andom ve c tor with ind ep endent entries, E [ z i ] = µ , σ := k Diag ( E [( z − µ ) ⊤ ( z − µ )]) k . Then we have that P [ k z − µ k > tσ ] ≤ t − 2 . W edin’s t heorem: W e mak e use of W edin’s theorem to control su bspace p erturbations. Lemma C.12 (W edin’s theorem; T heorem 4.4, p. 262 in Stewa rt and Su n [1990].) . L et A, E ∈ R m × n with m ≥ n b e given. L et A have the singular value de c omp osition   U ⊤ 1 U ⊤ 2 U ⊤ 3   A  V 1 V 2  =   Σ 1 0 0 Σ 2 0 0   . L et ˜ A := A + E , with analo gous singular value de c omp osition ( ˜ U 1 , ˜ U 2 , ˜ U 3 , ˜ Σ 1 , ˜ Σ 2 , ˜ V 1 ˜ V 2 ) . L et Φ b e the matrix of c anonic al angles b e twe en range( U 1 ) and range( ˜ U 1 ) , and Θ b e the matrix of c anonic al angles b etwe en range( V 1 ) and range( ˜ V 1 ) . If ther e exists δ, α > 0 such that m in i σ i ( ˜ Σ 1 ) ≥ α + δ and max i σ i (Σ 2 ) ≤ α , then max {k sin Φ k 2 , k sin Θ k 2 } ≤ k E k 2 δ . 70

A Tensor Approach to Learning Mixed Membership Community Models

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment