A Tensor Approach to Learning Mixed Membership Community Models

Community detection is the task of detecting hidden communities from observed interactions. Guaranteed community detection has so far been mostly limited to models with non-overlapping communities such as the stochastic block model. In this paper, we…

Authors: Anima An, kumar, Rong Ge

A Tensor Approach to Learning Mixed Membership Community Models
A T ensor Sp ectral Approac h to Learning Mixed Mem b ers hip Comm unit y Mo dels Anima Anandkumar 1 , Rong Ge 2 , Daniel Hsu 3 , and Sham M. Kak ade 3 1 a.anandkumar@uci.edu, Univ ersit y of California, Irvine 2 rongge@cs.princeton.edu, Princeton Univ ersit y 3 dahsu/sk ak ade@microsoft.com, Microsoft Researc h, New England Octob er 28, 2013 Abstract Communit y detection is the task o f detecting hidden communities from obser v ed interac- tions. Guaranteed communit y detec tio n ha s so far b een mostly limited to models with non- ov erlapping communities such a s the sto chastic blo ck mo del. In this pap er, we remov e this restriction, and provide guaranteed communit y detection for a family of proba bilistic netw ork mo dels with overlapping comm unities, termed as the mixed membership Dirichlet mo del, first int ro duced by Airoldi et al. [200 8]. This mo del a llows for no des to hav e fractional member- ships in multiple communities and assumes that the c o mm unity memberships ar e drawn fro m a Diric hlet distribution. Moreover, it con tains the sto chastic blo ck model as a sp ecial case. W e prop ose a unified approach to learning these models via a tensor spectra l decompo sition method. Our estimator is based on low-order moment tenso r o f the o bs erved netw ork, consisting of 3- star co un ts. Our lea rning metho d is fa s t and is based on simple linea r algebraic op era tions, e.g. singular v alue deco mpositio n and tenso r p ow er iterations . W e provide g uaranteed recovery of communit y memberships a nd mo del parameters a nd present a careful finite sample a nalysis of our learning metho d. As an imp ortant sp ecial ca s e, our results match the b est known scaling requirements fo r the (ho mogeneous) stochastic block mo del. Keyw ords: Comm unity detection, sp ectral metho ds, tensor m etho d s, momen t-based estima- tion, m ixed members h ip mo dels. 1 In tro duction Studying comm unities forms an integral part of so cial net wo rk analysis. A comm unit y generally refers to a grou p of individu als with shared interests (e.g. m usic, sp orts), or r elatio nsh ips (e.g. friends, co-w orkers). Comm un it y formation in so cial netw orks has b een stu died by many s o ciol- ogists, e.g. [Moreno, 1934, Lazarsfeld et al., 1954, McPher s on et al., 2001, Curr arini et al., 2009], starting with the seminal work of Moreno [1934]. They p osit v arious f actors suc h as homophily 1 1 The term homophily refers to the tend ency that individuals b elonging to the same communit y tend to connect more than individuals in d ifferen t communities. 1 among the individuals to b e resp ons ible f or comm unit y formation. V arious probabilistic and non- probabilistic net wo rk mo d els attempt to explain communit y formati on. In addition, they also attempt to quant ify interac tions and the extent of ov erlap b et we en different communities, r elativ e sizes among the comm unities, and v arious other net work prop erties. Studyin g suc h communit y mo dels are also of interest in other domains, e.g. in b iolog ical net w orks. While th er e exists a v ast literature on comm unit y mo dels, learning these mo dels is typicall y c hallenging, and v arious h euristics suc h as Marko v Chain Mont e Carlo (MCMC) or v ariational ex- p ectation maximization (EM) are e mploy ed in practic e. Such heuristics tend to scale p oorly for large net w orks. On the other hand, comm u nit y mo dels with guaran teed learning method s te nd to b e re- strictiv e. A p opular class of probabilistic mo dels, termed as sto chastic blo ckmo dels , ha v e b een widely studied and enjo y strong theoretical learning guarante es, e.g. [White et al., 1976, Holland et al., 1983, Fien b erg et al., 1985, W ang and W ong, 1987, Sn ij d ers and Nowic ki , 1997 , McSherr y, 2001]. On th e other hand, they p osit that an in dividual b elongs to a single comm u nit y , w hic h d o es not hold in most r eal settings [P alla et al. , 2005]. In this pap er, we consider a class of mixed mem b ership comm unity mo dels, originally introd uced b y Airol di et al. [2008], and recen tly emplo y ed b y Xing et al. [2 010] and Gopalan et al. [2012]. The mo del has b een sho wn to b e e ffectiv e in man y real-wo rld settings, b ut so far, no learning approac h exists with pro v able guarantees. In this pap er, we pr ovide a n o vel learning app roac h for learnin g these mixed mem b ership mo dels and prov e that these metho ds succeed u nder a set of s u fficien t conditions. The m ixed m em b ership comm u nit y mo del of Air oldi et al. [2008] has a n umber of attractiv e prop erties. It retains man y of the conv enient prop erties of th e stoc hastic bloc k mo del. F or instance, conditional indep en dence of the edges is assumed, giv en the comm unit y mem b ersh ips of the no des in the net wo rk. A t the same time, it allo ws for comm unities to o ve rlap, and for ev ery individu al to b e fractionally inv olved in different comm unities. It in cludes the sto c hastic blo c k mo del as a sp ecial case (corresp ondin g to zero o ve rlap among the different comm unities). This enables us to compare ou r learning guarantee s w ith existing w orks for sto c hastic blo c k mo dels and also stud y ho w the exten t of o v erlap among d ifferen t comm unities affects the learning p erforman ce. 1.1 Summary of Results W e n o w su m marize the main con tributions of this pap er. W e pr op ose a nov el approac h for learn- ing mixed mem b ersh ip comm u nit y mo dels of Airoldi et al. [2008]. Our approac h is a metho d of momen ts estimat or and incorp orates tensor sp ectral decomp osition. W e provide guaran tees for our approac h und er a set of sufficient conditions. Finally , w e compare our results to existing ones for the sp ecial case of the sto c hastic b lock mo del, wh ere no des b elong to a single comm unity . Learning Mixed Membership Mo dels: W e p resen t a tensor-based approac h for learnin g th e mixed members hip stochastic blo c k mo del (MMSB) p r op osed by Airoldi et al. [2008]. In the MMSB mo del, the communit y memb ership v ectors are drawn fr om the Diric hlet distribu tion, denoted by Dir( α ), where α is kno wn the Diric hlet concen tration vect or. Emp lo yin g the Diric hlet distribu tion results in sparse communit y mem b ersh ips in certain regimes of α , wh ic h is realistic. The exten t of o v erlap b et wee n different comm unities un der the MMSB mo d el is con trolled (roughly) via a single scalar parameter, α 0 := P i α i , where α := [ α i ] is the Diric hlet concen tration v ector. When α 0 → 0, the mixed memb er s hip mod el degenerates to a stoc hastic block mod el an d w e ha ve non-o v erlapping comm unities. 2 W e prop ose a u n ified tensor-based learning metho d for the MMSB model and establish reco v ery guaran tees under a set of sufficien t conditions. These conditions are in in terms of the net w ork size n , the n umb er of comm un ities k , exten t of comm un it y ov erlaps (thr ou gh α 0 ), and the a v erage edge connectivit y across v arious comm unities. Belo w, we pr esen t an o v erview of our guaran tees for the sp ecial case of equ al sized comm unities (eac h of size n/k ) and homogeneous comm unit y connectivit y: let p b e the probabilit y for an y in tra-comm unity edge to o ccur, and q b e th e prob ab ility for any in ter-comm unit y edge. Let Π b e the comm unity membersh ip matrix, where Π ( i ) denotes the i th ro w, which is the v ector of membersh ip w eigh ts of the no d es f or the i th comm unit y . Let P b e the comm unit y connectivit y matrix such that P ( i, i ) = p an d P ( i, j ) = q for i 6 = j . Theorem 1.1 (Main Result) . F or an MM SB mo del with network size n , numb er of c ommunities k , c onne ctivi ty p ar ameters p, q and c ommunity overlap p ar ameter α 0 , when 2 n = ˜ Ω( k 2 ( α 0 + 1) 2 ) , p − q √ p = ˜ Ω  ( α 0 + 1) k n 1 / 2  , (1) our estimate d c ommunity memb ership matrix ˆ Π and the e dge c onne ctiv ity matr ix ˆ P sa tisfy with high pr ob ability (w.h.p.) ε π ,ℓ 1 n := 1 n max i ∈ [ n ] k ˆ Π i − Π i k 1 = ˜ O ( α 0 + 1) 3 / 2 √ p ( p − q ) √ n ! (2) ε P := max i,j ∈ [ k ] | ˆ P i,j − P i,j | = ˜ O ( α 0 + 1) 3 / 2 k √ p √ n ! . (3) F urther, our supp ort estimates ˆ S satisfy w.h.p., Π( i, j ) ≥ ξ ⇒ ˆ S ( i, j ) = 1 and Π( i, j ) ≤ ξ 2 ⇒ ˆ S ( i, j ) = 0 , ∀ i ∈ [ k ] , j ∈ [ n ] , (4) wher e Π is the true c ommunity memb e rship matrix and the thr eshold i s chosen as ξ = Ω( ǫ P ) . The complete details are in Section 4 . W e firs t pro vide some in tuitions b ehind the s u fficien t conditions in (1). W e require the net w ork size n to b e large enough compared to the num b er of comm unities k , and for th e separation p − q to b e large enough, so that th e learning metho d can distinguish the different communities. This is natural sin ce a zero separation ( p = q ) imp lies that the comm u nities are ind istinguishable. Moreo ver, we see that th e s caling requ iremen ts b ecome more stringen t as α 0 increases. T his is in tuitiv e since it is harder to learn comm unities w ith more o v erlap, and we quantify this scaling. F or th e Diric hlet distribution, it can b e shown that the n umb er of “significan t” en tr ies is rou gh ly O ( α 0 ) with high probability , an d in many s ettings of practical in terest, no d es ma y hav e significant members h ips in only a few comm un ities, and th us, α 0 is a constant (or gro wing s lowly) in many instances. In a dd ition, we quantify the error b ounds f or estimating v arious parameters of the mixed mem- b ership mo del in (2) an d (3 ). These err ors d eca y under the sufficient conditions in (1). Lastly , we establish zero-error guaran tees for supp ort reco very in (4 ): our learning method correctly iden tifies (w.h.p) all the significan t membersh ips of a no de and also id en tifies the set of communities w here a no de do es not h a v e a strong presence, and we quan tify the threshold ξ in Th eorem 1.1. F ur ther, w e present the results for a general (non-homogeneous) MMSB m o d el in Section 4.2. 2 The notation ˜ Ω( · ) , ˜ O ( · ) denotes Ω ( · ) , O ( · ) up to p oly-log factors. 3 Iden tifiability Result for the MMSB mo del: A b ypr o d uct of our analysis yields no ve l iden tifiabilit y results for the MMSB mo del based on low order graph m omen ts. W e establish that the MMSB mo d el is ident ifiable, g iv en access to third order m omen ts in the form of coun ts of 3-star subgraphs, i.e. a star subgraph consisting of three lea v es, for eac h triplet of lea v es, when the comm unit y connectivit y matrix P is full r ank. Our learning app roac h inv olv es decomp osition of this thir d ord er tensor. Pr evious id en tifiabilit y results r equired access to high order momen ts and w ere limited to the sto c hastic blo ck mo del setting; see Section 1.3 for details. Implications on Learning Stochastic Bloc k Mo dels: Our results ha v e implications for learning sto c hastic blo c k mo dels, wh ich is a sp ecia l case of th e MMSB mo del with α 0 → 0. In this case, the su fficien t cond itions in (1) redu ce to n = ˜ Ω( k 2 ) , p − q √ p = ˜ Ω  k n 1 / 2  , (5) The scaling requirements in (5) matc h with the b est known b ou n ds 3 (up to p oly-log f actors) for learning un iform sto c hastic blo c k m o d els and were pr eviously ac h iev ed b y C hen et al. [201 2 ] via con v ex optimizati on in v olving semi-definite p rogramming (SDP). In con trast, we prop ose an iter- ativ e non-con v ex appr oac h inv olving tensor p o wer iterations and linear algebraic tec hn iques, and obtain simila r guaran tees. F or a deta iled comparison of learning guaran tees und er v arious metho ds for learnin g (h omogeneo us) sto c hastic blo ck mo dels, see Chen et al. [2012]. Th us, we establish learning guaran tees explicitly in terms of the exten t of ov erlap among the differen t c ommunities for general MMSB models. Man y real-w orld net w orks inv olv e sparse commu- nit y membersh ips and th e total n umber of comm unities is typically m uc h larger than the exten t of mem b ersh ip of a single individual, e.g. hobbies/in terests o f a p erson, univ ersit y/compan y net works that a p er s on b elongs to, the set of transcrip tion factors regulating a gene, and so on. Thus, w e see that in this regime of p ractical interest, where α 0 = Θ (1), the scaling requirements in (1) m atc h those for the sto c hastic blo c k mo del in (5) (u p to p olylog factors) without any degradation in learning p erformance. Th us, we est ablish that learning communit y mo dels with sp arse comm u nit y mem b ersh ip s is akin to learning sto chasti c blo ck mo dels and we pr esen t a unified approac h and analysis for learnin g these mo d els. T o the b est of our kn o w ledge, th is work is the fi rst to establish p olynomial time learning guaran tees for probabilistic net w ork mo dels with o v erlapping communities and we pro vide a fast and a n iterativ e learnin g approac h through linear algebraic tec hniqu es an d tensor p o wer iterations. While the results of this pap er are mostly limited to a th eoretica l analysis of the tensor metho d for learning o v erlapping comm unities, w e n ote recent results which sh o w that this metho d (with impro ve ments and mod ifications) is very accurate in practice o n real datasets f rom so cial net works, and is scalable to graphs with millions of no des [Huang et al., 2013]. 1.2 Ov erview of T echn iques W e no w d escrib e the main tec hniques emp lo yed in our learning app roac h and in establishing the reco very guaran tees. 3 There are many meth ods which achieve the b est kn o wn scaling for n in (5), but hav e worse scaling for th e separation p − q . This includes v ariants of the sp ectral clustering metho d, e.g. Chaudhuri et al. [2012]. See Chen et al. [2012] for a detailed comparison. 4 Metho d of momen ts and subgraph coun ts: W e prop ose an efficien t learnin g algorithm based on lo w ord er moments, viz., coun ts of small subgrap h s. Sp ecifically , w e emplo y a third-order tensor wh ic h counts the n umb er of 3-stars in the observ ed netw ork. A 3-star is a star graph with three lea v es (see figu r e 1 ) and we count th e o ccurr ences of such 3-stars across different p artitions. W e establish that (an adjusted) 3-star count tensor has a simple relationship with the mo del parameters, when the netw ork is drawn f rom a mixed memb er s hip mo d el. W e prop ose a multi-linea r transformation using edge-coun t matrices (also termed as the pro cess of whitening), whic h reduces the p roblem of learning mixed mem b ersh ip mo dels to the c anonic al p olyadic (CP) de c omp osition of an orthogonal symm etric tensor, for whic h tractable decomp osition exists, as describ ed b elo w. Note that the decomp osition of a general tensor in to its rank-one comp onents is referr ed to as its CP d ecomp osition [K olda and Bader, 2009] and is in general NP-hard [Hillar and Lim, 2012]. Ho w ev er, the decomp osition is tractable in the sp ecial case of an orthogonal sym metric tensor considered h ere. T ensor sp ectral decomp osition via p o w er iterations: Our tensor d ecomp osition metho d is based on the p opular p o wer iterations (e.g. see Anandkumar et al. [201 2a ]). It is a simple iterativ e metho d to compute the stable eigen-pairs of a tensor. In this pap er, we prop ose v arious mo difi ca- tions to the basic p ow er metho d to strengthen the reco very guarantees un der p ertur bations. F or instance, we introd uce adaptiv e deflation tec hn iques (whic h inv olv es subtracting out the e igen-pairs previously estimated). Moreo v er, w e in itializ e the tensor p o we r m ethod w ith (whitened) neighbor- ho o d v ectors from th e obser ved net work, as opp osed to random initializatio n. I n the regime, where the comm unity o v erlaps are small, this leads to an imp ro v ed p erf orm ance. Add itionally , w e incor- p orate thresholding as a p ost-pro cessing op eration, which again, leads to improv ed guaran tees for sparse communit y mem b erships , i.e., when the o v erlap among different comm unities is small. W e theoreticall y establish th at all these mo difications lead to impr o vemen t in p erform ance guaran tees and w e discuss comparisons with the b asic p o wer metho d in Section 4.4. Sample analysis: W e establish that our learnin g approac h correctly reco vers the mo del pa- rameters and th e communit y mem b ersh ips of all no des under exact moments. W e then carry out a careful analysis of the empirical graph moment s, computed u sing the n et work observ ations. W e establish tensor concent ration b ounds and also con trol the p ertu rbation of th e v arious quantitie s used b y our learnin g alg orithm via matrix Bernstein’s inequalit y [T ropp, 2012, thm. 1.4] and other inequalities. W e imp ose the scaling requirement s in (1) for v arious concen tration b ound s to hold. 1.3 Related W ork There is extensiv e w ork on mo d eling comm unities and v arious algorithms and heu r istics for disco v - ering them. W e mostly limit ou r f o cus to works with theoretical guarant ees. Metho d of momen ts: The metho d of momen ts approac h dates bac k to P earson [1894] and has b een applied for learning v arious comm unity models. Here, the moments corresp ond to co unts of v arious subgraph s in the net work. They t ypically consist of aggregate quan tities, e.g., num b er of star su bgraphs, triangles etc. in the netw ork. F or instance, Bic k el et al. [2011] analyze the momen ts of a sto c hastic blo c k mo del and establish that the subgraph counts of certain structures, termed as “wheels” (a family of tr ees), are su fficien t for identifiabilit y under some n atur al non- degeneracy conditions. In con trast, w e establish that momen ts up to third order (corresp onding 5 to edge and 3-star coun ts) are sufficient for iden tifiabilit y of the sto chastic blo c k mo del, and also more generally , f or the mixed member s hip Diric h let mo del. W e emp lo y subgraph count tensors, corresp onding to the num b er of sub graphs (suc h as stars) o ve r a set of lab eled v ertices, while the work of Bic k el et al. [2011] considers only aggregate (i.e. scalar) counts. Considering tensor momen ts allo ws us to use simple subgraphs (edges and 3 stars) corresp onding to low order m oments, rather than more complicated graphs (e.g. wheels considered by Bic kel et al. [2011]) w ith larger n umb er of no des, for learnin g the comm un ity mo del. The metho d of m oments is also relev an t for the family of rand om graph mo dels termed as ex- p onential r andom gr aph mo dels [Holland and Leinh ardt, 1981, F rank and Strauss, 19 86]. Subgraph coun ts of fixed grap h s such as stars and triangles serve as sufficien t statistic s for these m o d els. Ho w ev er, parameter estimation giv en the su bgraph coun ts is in general NP-hard, due to the nor- malizatio n constan t in the lik eliho o d (the partition function) and th e mo del suffers from d egeneracy issues; see Rinaldo et al. [20 09], Chatterjee and Diaconis [2 011] for detailed discussion. In con trast, w e establish in this pap er that the mixed memb ership mo del is amenable to simp le estimation metho ds throu gh linear algebraic op er ations and tensor p o wer iterations using subgraph counts of 3-stars. Sto c hastic blo ck mo dels: Man y algorithms pro vide learning guarantee s for sto chastic blo c k mo dels. F or a d etaile d comparison of these metho d s , see the r ecen t wo rk by Chen et al. [2012]. A p opular metho d is based on sp ectral clustering [McSh erry, 2001], where communit y mem b ersh ips are in ferred through pro jection on to the sp ectrum of the Laplacian matrix (or its v arian ts). This metho d is fast and easy to implement (via singular v alue decomp osition). Ther e are man y v ariants of this metho d, e.g. the work of Ch audhuri et al. [2012] emplo ys normalized Laplacian matrix to handle degree heterogeneities. In contrast, the w ork of Chen et al. [2012] uses con v ex optimization tec h niques via s emi-defi nite programming learning blo ck m o d els. F or a detailed comparison of learning guaran tees under v arious m etho d s for learning stochasti c blo c k mo dels, see Ch en et al. [2012]. Non-probabilistic approac hes: The classical ap p roac h to comm un ity detection tries to di- rectly exploit the pr op erties of the graph to define communities, without assu ming a pr ob ab ilistic mo del. Girv an and Newman [2002] use b et weenness to remo v e edges until only comm unities are left. Ho wev er, Bick el and Chen [2009] show that these algorithms are (asymptotically) biased and that using mo dularity scores can lead to the disco v ery of an incorrect comm unity str ucture, ev en for large graph s. J alali et al. [2011] define comm unit y structure as the structure that satisfies the maxim um n umb er of edge constraint s (whether t wo individu als like/ dislike eac h other). Ho wev er, these mo dels assu me that ev ery ind ividual b elongs to a sin gle comm un it y . Recen tly , some non-probabilistic ap p roac h es hav e b een int ro du ced with ov erlapping comm u- nit y mo dels b y Arora et al. [2012] and Balcan et al. [2012]. The analysis of Arora et al. [2012] is mostly limited to dens e graphs (i.e. Θ( n 2 ) edges for a n n o d e grap h ), w hile our analysis pro vides learning guaran tees for muc h sparser graphs (as seen b y the scaling r equiremen ts in (1)). More- o v er, the run ning time of the metho d of Arora et al. [2012] is quasip olynomial time (i.e. O ( n log n )) for the general case, and is based on a com binatorial learning approac h. In con tr ast, our learn- ing appr oac h is based on simple linear algebraic tec hniques and the running time is a lo w-order p olynomial (roughly it is O ( n 2 k ) for a n n o d e net wo rk with k comm unities un der a serial com- putation mo del and O ( n + k 3 ) under a parallel computation mo del). The w ork of Balcan et al. 6 [2012] assumes end ogenously formed comm unities, by constraining the fr action of edges within a comm unit y co mpared to the outside. They provide a p olynomial time algorithm f or findin g all su c h “self-determined” comm u n ities an d the running time is n O (log 1 /α ) / α , where α is the fr action of edges within a self-determined comm unit y , and this b ound is impro v ed to lin ear time when α > 1 / 2. O n the other hand, the ru nning time of our algorithm is mostly in d ep endent of the parameters of the assumed mo del, (and is r oughly O ( n 2 k )). Moreo ve r, b oth these works are limited to homophilic mo dels, wh ere th ere are more edges within eac h communit y , than b et we en an y t w o differen t com- m unities. How eve r, our learning ap p roac h is not limited to this setting and also do es not assu me homogeneit y in edge connectivit y across differen t communities (bu t instead it makes p robabilistic assumptions on comm unit y formation). In addition, w e p ro vide imp ro v ed guaran tees for homophilic mo dels b y considering additional p ost-pro cessing steps in our al gorithm. Recen tly , Abraham et al. [2012] pro vide an algorithm for appro ximating the p arameters of an Eu clidean log-linear mo del in p olynomial time. Ho wev er, there setting is considerably different than the one in this p ap er. Inhomogeneous random graphs, graph limits and weak regularit y lemma: Inhomoge- neous random graphs ha v e b een analyzed in a v ariet y of settings (e .g., Bollob´ as et al. [2007], Lo v´ asz [2009]) an d are generalizations of the sto c hastic blo c k mo del. Here, the p robabilit y of an edge b e- t w een an y t wo n o des is c haracterized b y a general function (rather than by a k × k matrix as in the sto c hastic blo ck m o d el w ith k blo c ks). Note that the mixed membersh ip mo d el consider ed in this w ork is a sp ecial instance of this general framew ork. These m o d els a rise as the limits of con v ergen t (dense) graph sequences and for th is reason, the functions are also termed as “graphons” or graph limits [Lo v´ asz, 2009]. A deep r esult in this con text is the regularit y lemma and its v arian ts. The w eak regularit y lemma prop osed b y F rieze and Kannan [1999], show ed that any con v ergen t dense graph can b e approximat ed by a sto c hastic b lock mo del. Moreo v er, th ey prop ose an algorithm to learn such a blo c k mo del based on the so-called d 2 distance. Th e d 2 distance b et ween t w o no des measures s im ilarity with resp ect to their “t wo -hop” neigh b ors and th e b lo c k mo del is obtained b y thresholding the d 2 distances. Ho wev er, the metho d is limited to learning b lo ck m od els and not o v erlapping communities. Learning Laten t V ariable Models (T opic Models) : The communit y mo dels considered in this pap er are closely related to th e probabilistic topic mo dels [Blei , 2012], emplo y ed for text mo deling and do cumen t categ orization. T opic mo dels p osit the o ccurrence of wo rds in a corpus of d o cumen ts, through the p resence of m ultiple laten t topics in eac h d o cumen t. Laten t Dirichlet allocation (LD A) is p erhap s the m ost p opular topic m o d el, wh ere the topic mixtures are assu med to b e dra wn from th e Dirichlet d istribution. In eac h d o cumen t, a topic mixture is drawn from the Diric hlet distrib ution, and the w ords are drawn in a conditional indep en d en t man n er, giv en the topic mixture. The mixed mem b ership comm unit y mo del considered in this pap er can b e interpreted as a generalizati on of the LD A mo d el, w here a n o de in th e communit y mo del can function b oth as a do cument and a wo rd. F or instance, in the directed communit y mod el, when the outgoing links of a no de are consid ered, the no de functions as a do cument , an d its outgoing n eighb ors can b e interpreted as the words o ccurring in that do cument. Similarly , when the incoming links of a no de in the net w ork are considered, the n o d e can b e inte rp r eted as a word, and its incoming links, as do cuments co nta ining that p articular w ord . In particular, we e stablish th at certain graph momen ts un d er th e mixed memb er s hip mod el ha v e similar structure as t he observed w ord momen ts under the LD A m o d el. This allo ws us to lev erage the recen t deve lopments from Anandkumar et. 7 al. [Anan d kumar et al., 2012 c ,a,b] for learning topic mo d els, based on the m etho d of moments. These works establish guaran teed learning usin g second- and third -order observ ed m oments through linear algebraic and tensor-based tec hniques. In p articular, in this p ap er, we exploit the tensor p o we r iteration metho d of An andkumar et al. [2012b], and prop ose additional impro v emen ts to obtain stronger reco v ery guarantee s. Moreov er, the sample analysis is quite differen t (and m ore c hallenging) in the communit y setting, compared to topic mo dels analyzed in An andkumar et al. [2012c,a,b]. W e clearly sp ell out t he similarities and differences betw een the comm un it y mo d el and other laten t v ariable mo d els in Section 4.4. Lo w er Bounds: Th e w ork of F eldm an et al. [2012] provides lo w er b ound s on the complexit y of statistica l algorithms, and shows that for cliques of size O ( n 1 / 2 − δ ), for any co nstant δ > 0, at least n Ω(log log n ) queries are n eeded to find the cliques. There are works relating the hard ness of finding hidden cliques and the use of higher order momen t tensors for this purp ose. F rieze and Kannan [2008] rela te the pr oblem of finding a hidden clique to fi nding the top eigenv ector of the third order tensor, corr esp onding to the maxim um sp ectral norm. Brubak er and V empala [2009] extend the result to arb itrary r th -order tensors and the cliques ha v e to b e size Ω( n 1 /r ) to enable reco v ery from r th -order moment te nsors in a n no de net w ork. Ho w ev er, this p roblem (finding the top eigen vecto r of a tensor) is known to b e NP-hard in general [Hillar and Lim, 2012]. Th us, tensors are u seful for finding smaller h id den cliques in net wo rk (alb eit b y solving a computationally hard problem). In con trast, w e consid er tractable tensor decomp osition through reduction to orthogonal tensors (under the sca ling requir emen ts o f (1)), and ou r learning metho d is a fast and an iterativ e approac h based on tensor p ow er iterations and linear alge br aic op erations. Mossel et al. [2012] p ro vide low er b oun ds on the sep aration p − q , the edge connectivit y b et ween in tra-comm unity and int er-comm unity , for iden tifiabilit y of comm un ities in sto chasti c blo c k mo dels in the sparse regime (when p, q ∼ n − 1 ), when the num b er of comm unities is a constant k = O (1). Our metho d ac hiev es the lo we r b ounds on separation of edge connectivit y up to p oly-log factors. Lik eliho o d-based Approac hes to Learning MMSB: Another c lass of approac hes for learn- ing MMSB mo dels are based on optimizing th e observed lik eliho o d. T rad itional approac hes suc h as Gibb s sampling or exp ectation maximization (EM) can b e to o exp ensive app ly in practice for MMSB mo dels. V ariational appr oac hes whic h optimize the so-called evidence low er b ou n d [Hoffman et al. , 2012, Gopalan et al., 2012], w hic h is a lo wer b ound on th e marginal likelihoo d of the observ ed data (t ypically by applying a mean-field appro ximation), are efficien t for practical i mplementat ion. Sto c hastic ve rsions of the v ariational appr oac h pro vide ev en fur ther gains in efficiency and are state-of-a rt practical learnin g metho ds for MMSB mo dels [Gopalan et al., 2012]. Ho w ev er, these metho ds lac k theoretical guaran tees; s in ce they optimize a b ound on th e lik eliho o d, they are n ot guaran teed to reco ver the un derlying comm un ities co nsistently . A recen t w ork [Celisse et al., 2012] establishes co nsistency of m axim um likelihoo d and v ariational estimato rs for sto chastic bloc k mod- els, w hic h are s p ecial cases of the MMSB m o d el. How eve r, it is not known if the results extend to general MMSB mo dels. Moreo ve r, the framew ork of Celisse et al. [201 2 ] a ssum es a fixed num b er of comm unities and gro wing net w ork size , and pr o vide only asymp totic consistency guarantees. Thus, they do not al lo w for high-dimensional settings, where the parameters of the learning problem also gro w as the observe d d imensionalit y gro ws . In contrast, in this p ap er, we allo w f or the num b er of comm unities to gro w, and provide precise constrain ts on the scaling b ound s for consisten t estima- tion un d er fin ite samp les. It is an op en pr oblem to obtain such b ounds for maximum lik eliho o d 8 and v ariational estimators. On the practical side, a r ecen t w ork deplo ying the tensor approac h prop osed in th is pap er by Huang et al. [2013] sho ws that the tensor appr oac h is more than an order of magnitude faster in reco v ering the communities than the v ariational approac h, is scalable to n et works w ith millions of n o d es, and also has b etter accuracy in reco v ering th e communities. 2 Comm unit y Mo dels and Graph Momen ts 2.1 Comm unit y Mem b ership Mo dels In this section, we describ e the mixed mem b ersh ip comm unity mo del based on Diric hlet pr iors for the comm unit y dr a w s b y the individuals. W e fi rst in tro duce the sp ecial case of the p opular sto c hastic blo c k mo del, wh ere eac h no d e b elongs to a single comm unit y . Notation: W e consider net works with n n o des and let [ n ] := { 1 , 2 , . . . , n } . Let G b e the { 0 , 1 } adjacency 4 matrix for the rand om netw ork and let G A,B b e the subm atrix of G corresp onding to ro ws A ⊆ [ n ] and column s B ⊆ [ n ]. W e consider mo dels w ith k u nderlying (hidden) communities. F or no d e i , let π i ∈ R k denote its c ommunity memb ership ve ctor , i.e., the v ector is supp orted on the comm unities to wh ic h the no de b elongs. In the sp ecial case of the p opular s to chastic blo c k mo del describ ed b elo w, π i is a basis co ordin ate v ector, while the more general mixed memb ership mo del relaxes this assumption and a no de can b e in m ultiple communities with fr actional member s hips. Define Π := [ π 1 | π 2 | · · · | π n ] ∈ R k × n . and let Π A := [ π i : i ∈ A ] ∈ R k ×| A | denote the set of column v ectors r estricted to A ⊆ [ n ]. F or a matrix M , let ( M ) i and ( M ) i denote its i th column and ro w resp ectiv ely . F or a matrix M wit h singular v alue decomp osition (SVD) M = U D V ⊤ , let ( M ) k − svd := U ˜ D V ⊤ denote th e k -rank SVD of M , where ˜ D is limited to top- k singular v alues of M . Let M † denote the Mo orePe nrose pseudo-inv erse of M . Let I ( · ) b e t he indicator fun ction. Let Diag( v ) d enote a diagonal matrix w ith diagonal en tries giv en by a v ector v . W e u se the term high probabilit y to mean with probab ility 1 − n − c for an y constan t c > 0. Sto c hastic blo ck mo del (sp ecial case): In this mo del, eac h individual is indep end en tly as- signed to a single comm un it y , chosen at random: eac h no d e i c ho oses comm unit y j indep endently with probabilit y b α j , for i ∈ [ n ] , j ∈ [ k ], and we assign π i = e j in this case, where e j ∈ { 0 , 1 } k is the j th co ordinate b asis vect or. Giv en th e communit y assignmen ts Π, ev ery directed 5 edge in the net w ork is indep endently d ra wn: if no d e u is in communit y i and n o de v is in comm u nit y j (and u 6 = v ), th en the probab ility of ha ving the edge ( u, v ) in th e net w ork is P i,j . Here, P ∈ [0 , 1] k × k and we refer to it as th e c ommunity c onne ctivity matrix . This implies that give n the comm unity mem b ersh ip ve ctors π u and π v , the p robabilit y of an edge fr om u to v is π ⊤ u P π v (since when π u = e i and π v = e j , w e h a ve π ⊤ u P π v = P i,j .). T he sto c hastic mo d el has b een extensive ly studied and can b e learn t efficiently through v arious metho ds, e.g. sp ectral cl ustering [McSh erry, 20 01], con v ex op- timizatio n [Chen et al., 2012]. and so on. Man y of these metho ds rely on conditional in dep endence assumptions of the edges in the blo c k mo d el for guarante ed learning. 4 Our analysis can easily be extend ed to wei ghted adjacency matrices with b ound ed entries. 5 W e limit our discussion to d irected n et works in this pap er, but note that the results also hold for u n directed comm unity models, w here P is a symmetric matrix, a nd an edge ( u, v ) is formed with probabili ty π ⊤ u P π v = π ⊤ v P π u . 9 Mixed membership mode l: W e no w consider th e extension of the sto chastic blo c k mo del whic h allo ws for an individual to b elong to multiple comm unities and y et preserv es some of the con v enien t indep en dence assumptions of th e b lo c k mo del. In this mo d el, the comm unity member- ship v ector π u at no d e u is a p robabilit y v ector, i.e., P i ∈ [ k ] π u ( i ) = 1, for all u ∈ [ n ]. Giv en the comm unit y memb er s hip ve ctors, the generation of the edges is identic al to the b lo ck mo del: giv en v ectors π u and π v , the pr obabilit y of an edge f rom u to v is π ⊤ u P π v , and the edges are ind ep en- den tly dra wn. T his formulation allo ws for the no des to b e in m ultiple comm unities, and at th e same time, preserve s the conditional indep endence of t he edges, give n the communit y mem b ersh ips of the n o d es. Diric hlet prior for comm unit y mem b ership: The only asp ect left to b e sp ecified for th e mixed mem b ership model is the d istr ibution from wh ic h the c ommunit y members h ip vect ors Π are dra wn. W e consider the p opu lar setting of Airoldi et al. [2008], where the co mmunit y v ectors { π u } are i. i.d. draws from the Diric hlet distribution, denoted b y Dir( α ), with parameter v ector α ∈ R k > 0 . The p robabilit y density function of th e Dirichlet d istribution is giv en by P [ π ] = Q k i =1 Γ( α i ) Γ( α 0 ) k Y i =1 π α i − 1 i , π ∼ Dir( α ) , α 0 := X i α i , (6) where Γ( · ) is th e Gamma fun ction and the ratio of the Gamma function serv es as the normalizat ion constan t. The Diric hlet distribution is widely emplo ye d for s p ecifying priors in Ba ye sian s tatistics, e.g. laten t Diric hlet allo cation [Blei et al., 2003]. The Diric hlet distr ib ution is the conjugate prior of the m ultinomial distribu tion which mak es it attractiv e for Bay esian inference. Let b α denote the n orm alized p arameter vecto r α/α 0 , wh er e α 0 := P i α i . In particular, note that b α is a probabilit y ve ctor: P i b α i = 1. In tuitiv ely , b α denotes the relativ e exp ected sizes of the comm unities (since E [ n − 1 P u ∈ [ n ] π u [ i ]] = b α i ). Let b α max b e the largest entry in b α , and b α min b e the smallest entry . Our learnin g guarantee s will d ep end on these parameters. The stochastic b lo ck mo del is a limiting case of the mixed m em b ership mo d el wh en the Diric hlet parameter is α = α 0 · b α , wh ere the pr obabilit y v ector b α is held fixed and α 0 → 0. In the other extreme w hen α 0 → ∞ , the Dirichlet distribution b ecomes p eak ed around a s ingle p oin t, for instance, if α i ≡ c and c → ∞ , the Diric hlet distribution is p eak ed at k − 1 · ~ 1, where ~ 1 i s the a ll-ones v ector. Thus, the parameter α 0 serv es as a measure of the av erage sparsity of the Diric hlet draws or equiv alen tly , of ho w concen trated th e Dirichlet measure is along the d ifferen t co ordinates. This in effect, controls the exten t of o v erlap among different comm unities. Sparse regime of Diric hlet distribution: When the Diric hlet p arameter v ector satisfies 6 α i < 1, for all i ∈ [ k ], the Diric hlet distribution Dir( α ) generates “ sparse” vec tors with high probabilit y 7 ; see T elgarsky [2012] (and in the extreme case of th e blo c k mo del where α 0 → 0, it generates 1-spars e v ectors). Man y real-w orld settings inv olv e s p arse comm unity membersh ip and the to tal n umber o f comm unities is typica lly m uc h larger t han the exten t of mem b ers hip of a single 6 The assumption that the Diric hlet distribution b e in the sparse regime is not strictly n eed ed. Our results can be extended to general Diric hlet distributions, b ut with w orse scaling requirements on the netw ork size n for guaranteed learning. 7 Roughly the number of entries in π exceeding a threshold τ is at most O ( α 0 log(1 /τ )) with high probabilit y , when π ∼ Dir( α ). 10 x u v w A B C X Figure 1: Our momen t-based le arning algorithm uses 3-star coun t tensor from set X to sets A, B , C (and the role s of the sets are in terchanged to g et v arious estimates). Sp ecifically , T is a third order tensor, where T( u, v , w ) is the normaliz ed coun t of the 3-sta rs with u, v , w as lea ves o ver all x ∈ X . individual, e.g. h obbies/in terests of a p erson, un iv ersit y/compan y net w orks th at a p erson b elongs to, the se t of transcription facto rs regulating a gene, and s o on. Our learning guarantee s are limited to the s parse regime of the Diric hlet mo d el. 2.2 Graph Momen t s Under Mixed Mem b er ship Mo dels Our approac h for learning a mixed m emb ership comm unity mo d el relies on the f orm of the graph momen ts 8 under the mixed mem b ers h ip mo del. W e no w describ e the sp ecific graph momen ts used b y our learning algorithm (based on 3-star and edge count s) and provide exp licit forms for the momen ts, assuming draws from a mixed m em b ership mo del. Notations Recall that G denotes the adjacency matrix and that G X,A denotes the s ubmatrix corresp ondin g to edges goi ng fr om X to A . Recall th at P ∈ [0 , 1] k × k denotes the comm unit y connectivit y m atrix. Define F := Π ⊤ P ⊤ = [ π 1 | π 2 | · · · | π n ] ⊤ P ⊤ . (7) F or a subset A ⊆ [ n ] of ind ividuals, let F A ∈ R | A |× k denote the submatrix of F corresp onding to no d es in A , i.e. , F A := Π ⊤ A P ⊤ . W e will subsequently sh o w that F A is linear map whic h tak es an y comm unity v ector π i as input and outputs the corresp ond ing neighborh o o d v ector G ⊤ i,A in exp ectation. Our learnin g algorithm uses moments up to the thir d-order, r epresen ted as a tensor. A thir d- order tensor T is a three-dimensional arra y w h ose ( p, q , r )-th en try denoted by T p,q ,r . Th e symb ol ⊗ d enotes the standard Kr on eck er p r o duct: if u , v , w are three vec tors, then ( u ⊗ v ⊗ w ) p,q ,r := u p · v q · w r . (8) A tensor of th e form u ⊗ v ⊗ w is referr ed to a s a ran k -one tensor. The d ecomp osition of a general tensor into a sum of its rank-one comp onents is referred to as c anonic al p olyadic (CP) de c omp osition Kolda and Bader [2009]. W e will sub sequen tly see that the graph moment s can b e expressed as a tensor a nd that the CP decomp osition of th e graph-momen t tensor yields the mo del parameters and the communit y v ectors un der the m ixed mem b ersh ip comm unity mo del. 8 W e interc hangeably use the term first order moments for edge coun ts and third order moments for 3-star counts. 11 2.2.1 Graph momen ts under St o c hastic Blo c k Mode l W e fi rst analyze the graph momen ts in the sp ecial case of a sto c hastic blo c k mo del (i.e., α 0 = P i α i → 0 in the Dirichlet pr ior in (6)) and then extend it to general mixed mem b ersh ip mo del. W e pro vide explicit expressions for the graph momen ts corresp onding to edge coun ts and 3-star coun ts. W e later establish in Section 3 that these moments are s ufficien t to learn the communit y mem b ersh ip s of the n o des and the mo del parameters of the blo c k mo del. 3 -star coun ts: The pr im ary quantit y of in terest is a third-ord er tensor whic h counts the num b er of 3-stars. A 3-star is a star graph w ith three lea ves { a, b, c } and we refer to the inte rn al no d e x of the star as its “head”, and denote th e structure b y x → { a, b, c } (see figur e 1). W e partition the netw ork in to four 9 parts and consider 3-stars suc h that eac h no d e in the 3-star b elongs to a d ifferen t partition. This is necessary to obtain a simple form of the momen ts, based on the conditional indep en d ence assumptions of the blo ck mod el, see Pr op osition 2.1. Sp ecifically , consider 10 a partition A, B , C , X of the n etw ork. W e count the num b er of 3-stars fr om X to A, B , C and our quantit y of interest is T X →{ A,B ,C } := 1 | X | X i ∈ X [ G ⊤ i,A ⊗ G ⊤ i,B ⊗ G ⊤ i,C ] , (9) where ⊗ is th e Kronec ke r p r o duct, defined in (8) an d G i,A is the ro w v ector supp orted on the set of neigh b ors of i b elonging t o set A . T ∈ R | A |×| B | × | C | is a third o rder tensor, and an elemen t of the tensor is given by T X →{ A,B ,C } ( a, b, c ) = 1 | X | X x ∈ X G ( x, a ) G ( x, b ) G ( x, c ) , ∀ a ∈ A, b ∈ B , c ∈ C, (10) whic h is the norm alized coun t of the num b er of 3-stars with lea v es a, b, c such that its “head” is in set X . W e no w relate the tensor T X →{ A,B ,C } to the parameters of the sto c hastic blo c k mo del, viz., the comm unit y connectivit y matrix P and the comm unity p robabilit y v ector b α , w h ere b α i is the probabilit y of c ho osing communit y i . Prop osition 2.1 (Momen ts in Sto c hastic Blo c k Mo del) . Given p artitions A, B , C, X , and F := Π ⊤ P ⊤ , wher e P is the c ommunity c onne ctivity matrix and Π is the matrix of c ommunity memb e r- ship ve ctors, we have E [ G ⊤ X,A | Π A , Π X ] = F A Π X , (11) E [T X →{ A,B ,C } | Π A , Π B , Π C ] = X i ∈ [ k ] b α i ( F A ) i ⊗ ( F B ) i ⊗ ( F C ) i , (12) wher e b α i is the pr ob ability for a no de to sele ct c ommunity i . Remark 1 ( Linear mo del): In Equation (11), we see that the edge generation o ccurs under a linear mo d el, and m ore pr ecisely , the matrix F A ∈ R | A |× k is a linear m ap whic h tak es a comm unity v ector π i ∈ R k to a n eigh b orho o d ve ctor G ⊤ i,A ∈ R | A | in exp ecta tion. 9 F or sample complexity analysis, we require dividin g the graph into more than four partitions to deal with statistical d epen dency issues, and w e outline it in Section 3. 10 T o establish our theoretical guarantees, w e assume that the partitions A, B , C , X are randomly chosen a nd are of size Θ( n ). 12 Remark 2 (I den tifiabilit y under third order momen ts): Note th e form of the 3-star coun t tensor T in (12). I t provi des a CP decomp osition of T since eac h term in the su m mation, viz., b α i ( F A ) i ⊗ ( F B ) i ⊗ ( F C ) i , is a rank one tensor. Th us, we ca n learn the matrices F A , F B , F C and the v ector b α through CP decomp osition of tensor T. Once these parameters are learn t, learnin g the comm unities is straigh t-forwa rd under exact momen ts: b y exploiting (11 ), we find Π X as Π X = F † A · E [ G ⊤ X,A | Π A , Π X ] . Similarly , we can consid er another tensor consisting of 3-stars f rom A to X , B , C , and obtain matrices F X , F B and F C through a CP decomp osition, and so on. Once we o btain matrices F and Π for the en tire set of n o des in this manner, w e can obtain the communit y connectivit y matrix P , since F := Π ⊤ P ⊤ . Thus, in p rinciple, we are able to learn all the mo d el parameters ( b α and P ) and the communit y member s hip m atrix Π un der the sto c hastic blo c k mo del, giv en exact momen ts. This establishes iden tifiabilit y of the mo del giv en moments u p to thir d ord er and forms a high- lev el approac h for learning the comm unities. When only samples are a v ailable, we establish that the empirical v ersions are close to the exact moments considered ab ov e, and we mo dif y the basic learning approac h to obtain robust guaran tees. See Section 3 for details. Remark 3 (Significance of conditional indep endence relationships): T he main p rop ert y exploited in p ro ving the tensor form in (12) is the conditional-indep endence assu mption und er the sto c hastic block mo del: the realization of the edges in eac h 3-star, sa y in x → { a, b, c } , is condition- ally indep endent give n the comm u nit y mem b ership vec tor π x , when x 6 = a 6 = b 6 = c . This is b ecause the comm u nit y memb ership vect ors Π are assumed to b e dra wn indep enden tly at the different no des and the edges are dr a w n indep endently giv en th e comm unit y vecto rs. Considerin g 3-stars from X to A, B , C wh ere X, A, B , C form a p artition ensures that this conditional ind ep endence is satisfied for all the 3-stars in tensor T. Pr o of: Recall that the probability of an edge f rom u to v give n π u , π v is E [ G u,v | π u , π v ] = π ⊤ u P π v = π ⊤ v P ⊤ π u = F v π u , and E [ G X,A | Π A , Π X ] = Π ⊤ X P Π A = Π ⊤ X F ⊤ A and thus (11) holds. F or the tensor form , fi rst consider an element of the tensor, with a ∈ A, b ∈ B , c ∈ C , E  T X →{ A,B ,C } ( a, b, c ) | π a , π b , π c , π x  = 1 | X | X x ∈ X F a π x · F b π x · F c π x , The equation follo w s from the conditional-indep end ence assumption of the edges (assuming a 6 = b 6 = c ). No w taking exp ectation ov er the n o d es in X , we hav e E  T X →{ A,B ,C } ( a, b, c ) | π a , π b , π c  = 1 | X | X x ∈ X E [ F a π x · F b π x · F c π x | π a , π b , π c ] = E [ F a π · F b π · F c π | π a , π b , π c ] = X j ∈ [ k ] b α j ( F a ) j · ( F b ) j · ( F c ) j , where the last ste p follo ws fr om th e fact that π = e j with probabilit y b α j and the result holds wh en x 6 = a, b, c . Recall th at ( F a ) j denotes the j th column of F a (since F a e j = ( F a ) j ). C ollecti ng all the elemen ts of the tensor, we obtain the d esired resu lt.  13 2.2.2 Graph Momen ts under Mixed Membership Diric hlet Mo del W e no w analyze the graph momen ts for the general mixed mem b ership Diric h let mo del. Instead of the ra w moments (i.e. edge and 3-star counts), we consider mo dified moments to obtain similar expressions as in the case of the sto c hastic b lo ck mo del. Let µ X → A ∈ R | A | denote a vecto r which giv es th e n ormalized count of edges from X to A : µ X → A := 1 | X | X i ∈ X [ G ⊤ i,A ] . (13) W e no w define a mo dified adjacency matrix 11 G α 0 X,A as G α 0 X,A :=  √ α 0 + 1 G X,A − ( √ α 0 + 1 − 1) ~ 1 µ ⊤ X → A  . (14) In the sp ecial case of the sto c hastic blo ck mo d el ( α 0 → 0), G α 0 X,A = G X,A is the sub matrix of th e adjacency m atrix G . Similarly , we define mo dified third-order statistics, T α 0 X →{ A,B ,C } := ( α 0 + 1)( α 0 + 2) T X →{ A,B ,C } +2 α 2 0 µ X → A ⊗ µ X → B ⊗ µ X → C − α 0 ( α 0 + 1) | X | X i ∈ X h G ⊤ i,A ⊗ G ⊤ i,B ⊗ µ X → C + G ⊤ i,A ⊗ µ X → B ⊗ G ⊤ i,C + µ X → A ⊗ G ⊤ i,B ⊗ G ⊤ i,C i , (15) and it reduces to (a scaled v ersion of ) the 3 -star coun t T X →{ A,B ,C } defined in (9) for the sto chastic blo c k mo d el ( α 0 → 0) . The mo difi ed adjacency matrix and the 3-star count tensor can b e viewed as a form of “cen tering” of the raw momen ts which sim p lifies the expressions for the momen ts. The follo wing relationships hold b et we en the mo dified graph moments G α 0 X,A , T α 0 and the mo del parameters P and b α of the mixed mem b ersh ip mo del. Prop osition 2.2 (Momen ts in Mixed Mem b ership Mo del) . Given p artitions A, B , C, X and G α 0 X,A and T α 0 , as in (14) and (15) , normalize d Dirichlet c onc entr ation ve ctor b α , and F := Π ⊤ P ⊤ , wher e P i s the c ommunity c onne ctivity matrix and Π is the matrix of c ommunity memb erships, we have E [( G α 0 X,A ) ⊤ | Π A , Π X ] = F A Diag( b α 1 / 2 )Ψ X , (16) E [T α 0 X →{ A,B ,C } | Π A , Π B , Π C ] = k X i =1 b α i ( F A ) i ⊗ ( F B ) i ⊗ ( F C ) i , (17) wher e ( F A ) i c orr esp onds to i th c olumn of F A and Ψ X r elates to the c ommunity memb ership matrix Π X as Ψ X := Diag ( b α − 1 / 2 ) √ α 0 + 1Π X − ( √ α 0 + 1 − 1) 1 | X | X i ∈ X π i ! ~ 1 ⊤ ! . Mor e over, we have that | X | − 1 E Π X [Ψ X Ψ ⊤ X ] = I . (18) 11 T o compu te the modified moments G α 0 , and T α 0 , w e need to know the va lue of the scala r α 0 := P i α i , whic h is t h e concentration parameter of the Dirichlet distribution and is a measure of the extent of ov erlap b etw een t he comm unities. W e assume its know ledge h ere. 14 Remark 1: The 3-star count tensor T α 0 is carefully c hosen so that the CP decomp osition of the tensor d ir ectly yields the m atrices F A , F B , F C and b α i , as in the case of the sto c hastic blo c k m o d el. Similarly , the mo dified adj acency matrix ( G α 0 X,A ) ⊤ is carefully c hosen to eliminate sec ond-order cor- relation in the Dirichlet distribution and we ha v e that | X | − 1 E Π X [ΨΨ ⊤ ] = I is the identit y matrix. These p rop erties will b e exploited by our learning algorithm in Section 3. Remark 2: Recall th at α 0 quan tifies the extent of o v erlap among the comm unities. The com- putation of the modified momen t T α 0 requires the kno wledge of α 0 , whic h is assumed to b e kno w n . Since this is a scalar quantit y , in practice, we can easily tune th is parameter via cross v alidation. Pr o of: The p ro of is on lines of Prop osition 2.1 f or sto chasti c blo ck mo dels ( α 0 → 0) bu t more in v olv ed due to the form of Diric hlet momen ts. Recall E [ G ⊤ i,A | π i , Π A ] = F A π i for a mixed mem- b ership mo del, and µ X → A := 1 | X | P i ∈ X G ⊤ i,A , therefore E [ µ X → A | Π A , Π X ] = F A  1 | X | P i ∈ X π i  ~ 1 ⊤ . Equation (16) fol lo ws d irectly . F or Equ ation (18), we note the Diric hlet moment, E [ π π ⊤ ] = 1 α 0 +1 Diag( b α ) + α 0 α 0 +1 b α b α ⊤ , when π ∼ Dir( α ) and | X | − 1 E [Ψ X Ψ ⊤ X ] = Diag ( b α − 1 / 2 ) h ( α 0 + 1) E [ ππ ⊤ ] + ( − 2 √ α 0 + 1( √ α 0 + 1 − 1) +( √ α 0 + 1 − 1) 2 ) E [ π ] E [ π ] ⊤ i Diag( b α − 1 / 2 ) = Diag ( b α − 1 / 2 )  Diag( b α ) + α 0 b α b α ⊤ + ( − α 0 ) b α b α ⊤  Diag( b α − 1 / 2 ) = I . On lines of the pro of of Prop osition 2.1 for the blo c k mo del, the exp ectation in (17) in volv es multi- linear map of the exp ectation o f the tensor pro ducts π ⊗ π ⊗ π among other te rms. Collect ing these terms, w e ha v e th at ( α 0 + 1)( α 0 + 2) E [ π ⊗ π ⊗ π ] − ( α 0 )( α 0 + 1)( E [ π ⊗ π ⊗ E [ π ]] + E [ π ⊗ E [ π ] ⊗ π ] + E [ E [ π ] ⊗ π ⊗ π ]) + 2 α 2 0 E [ π ] ⊗ E [ π ] ⊗ E [ π ] is a diagonal tensor, in th e sense that its ( p, p, p )-th entry is b α p , and its ( p, q , r )-th en try is 0 when p, q , r are not all equal. With this, w e h a ve (17).  Note th e nearly ident ical forms of the graph momen ts for the stochastic blo c k mo d el in (11), (12) and for the general mixed member s hip mo del in (16), (17). In other w ords, the mo dified momen ts G α 0 X,A and T α 0 ha v e similar r elatio nsh ips to u n derlying parameters as the ra w momen ts in the case of the sto chastic blo ck mo del. This enables us to use a un ified learning approac h for the t w o mo dels, outlined in the next section. 3 Algorithm for Learning Mixed Mem b ership Mo dels The simple f orm of the graph momen ts d eriv ed in the previous section is n o w utilize d to reco ver the comm unit y vec tors Π and mo del parameters P , b α of th e mixed m em b ership mo del. Th e metho d is based on the so-call ed tensor p o w er metho d, used to ob tain a tensor decomp osition. W e first outline the basic tensor d ecomp osition metho d b elo w and then demonstrate how the metho d can b e 15 adapted to learnin g u sing the graph momen ts at h and. W e fir st analyze the simpler case wh en exact momen ts are a v ailable in Section 3.2 and then extend the metho d to hand le emp irical moments computed f rom the n etw ork ob s erv ations in Section 3.3. 3.1 Ov erview of T ensor Decomp osition Through P o w er It erations In this section, we review the basic metho d for tensor decomp osition b ased on p o w er iterations f or a sp ecial class of tensors, viz., symm etric orthogonal tensors. Su bsequent ly , in Section 3.2 and 3.3, w e mo d if y this metho d to learn the mixed mem b ership mo d el fr om grap h momen ts, describ ed in the previous section. F or details on the tensor p ow er metho d , refer to Anand kumar et al. [2012a], Kolda and Ma yo [2011]. Recall that a thir d-order tens or T is a th r ee-dimensional array and w e u se T p,q ,r to denote the ( p, q , r )-th entry o f the tensor T . The standard sym b ol ⊗ is u sed to den ote th e Kronec ker pro duct, and ( u ⊗ v ⊗ w ) is a rank one tensor. The decomp osition of a tensor in to its r ank one comp onents is called th e CP decomp osition. Multi-linear maps: W e can view a tensor T ∈ R d × d × d as a multilinear map in the follo wing sense: for a set of matrices { V i ∈ R d × m i : i ∈ [3] } , the ( i 1 , i 2 , i 3 )-th en try in the th ree-w a y arra y represent ation of T ( V 1 , V 2 , V 3 ) ∈ R m 1 × m 2 × m 3 is [ T ( V 1 , V 2 , V 3 )] i 1 ,i 2 ,i 3 := X j 1 ,j 2 ,j 3 ∈ [ d ] T j 1 ,j 2 ,j 3 [ V 1 ] j 1 ,i 1 [ V 2 ] j 2 ,i 2 [ V 3 ] j 3 ,i 3 . The term m ultilinear map arises from th e fact that the ab o ve map is linear in eac h of th e co ordinates, e.g. if w e replace V 1 b y aV 1 + bW 1 in the ab ov e equation, wh ere W 1 is a matrix of appr opriate dimensions, and a, b are an y scalars, the outpu t is a linear com bin ation of the outpu ts under V 1 and W 1 resp ectiv ely . W e will use the ab o ve noti on of multi- linear transforms to describ e v arious tensor op erations. F or in stance, T ( I , I , v ) yields a matrix, T ( I , v , v ), a v ector, and T ( v , v , v ), a scalar. Symmetric tensors and ort hogonal decomp osition: A sp ecial class of tensors are the symmetric tensors T ∈ R d × d × d whic h are in v ariant to p erm utation of the arra y indices. S ymmetric tensors h a ve CP decomp osition of the form T = X i ∈ [ r ] λ i v i ⊗ v i ⊗ v i = X i ∈ [ r ] λ i v ⊗ 3 i , (19) where r den otes the tensor CP rank and we use the notation v ⊗ 3 i := v i ⊗ v i ⊗ v i . It is con v enient to fi rst analyze metho ds for decomp osition of symmetric tensors and we then extend them to the general case of asymmetric tensors. F urther, a sub-class of sy m metric tensors are those whic h p ossess a decomp osition in to o rthogo- nal comp onent s, i.e. the v ectors v i ∈ R d are orthogonal to one another in the ab o v e decomp osition in (19) (without loss of generalit y , we assume that vec tors { v i } are orthonormal in this case). An orthogonal decomp osition im p lies that the tensor rank r ≤ d and there are tractable metho ds for reco vering the rank-one comp onen ts in this setting. W e limit ourselve s to this set ting in this pap er. 16 T ensor eigen analysis: F or symmetric tensors T p ossessing an orthogonal decomp osition of the form in (19), eac h pair ( λ i , v i ), for i ∈ [ r ], can b e in terpreted as an eigen-pair for th e tensor T , since T ( I , v i , v i ) = X j ∈ [ r ] λ j h v i , v j i 2 v j = λ i v i , ∀ i ∈ [ r ] , due to the fact that h v i , v j i = δ i,j . Thus, the vect ors { v i } i ∈ [ r ] can b e inte rpr eted as fi xed p oints of the map v 7→ T ( I , v , v ) k T ( I , v , v ) k , (20) where k · k denotes the sp ectral norm (and k T ( I , v , v ) k is a vect or norm ), and is used to normalize the v ector v in (20). Basic te nsor p o wer it eration metho d: A straigh tforw ard approac h to computing the or- thogonal deco mp ositio n of a symmetric tensor is to iterate according to the fixed-p oin t map in (20) with an arbitrary initializa tion vec tor. This is r eferr ed to as the tensor p o wer iteration metho d. Additionally , it is kno wn that the vec tors { v i } i ∈ [ r ] are the only stable fixed p oints of the m ap in (20). I n other w ords, the set of initializati on v ectors wh ic h con v erge to vect ors other th an { v i } i ∈ [ r ] are of m easure zero. This en sures that we obtain th e correct set of v ectors through p o we r itera- tions and that no spurious answ ers are obtained. See [Anandku mar et al., 2012b, Thm. 4.1] for details. Moreo v er, after an approximat ely fixed p oin t is obtained (after man y pow er it erations), the estimated eigen-pair can b e subtracted out (i.e., deflate d ) and su b sequen t vecto rs can b e similarly obtained thr ough p o wer iterations. Thus, w e can obtain all the stable eigen-pairs { λ i , v i } i ∈ [ r ] whic h are the comp onen ts of the orthogonal tensor decomp osition. Th e metho d needs to b e suitably mo dified when th e tens or T is p ertu rb ed (e.g. as in the case wh en empirical moment s are used) and w e discuss it in S ection 3.3. 3.2 Learning Mixed Mem b ership Mo dels Under Exact Momen ts W e first describ e the learning approac h when exact momen ts are a v ailable. In Section 3.3, we suitably mo dify the app r oac h to handle p erturb ations, wh ic h are introd uced when only empirical momen ts are a v ailable. W e no w emplo y the te nsor p o wer method describ ed ab ov e to obtain a CP d ecomp osition o f the graph moment tensor T α 0 in (15). W e first describ e a “symmetrization” pro cedu r e to con v ert the graph momen t tensor T α 0 to a symmetric orthogonal tensor throu gh a multi-linear transformation of T α 0 . W e then emplo y the p o wer metho d to obtain a symmetric orthogonal decomp osition. Fi- nally , t he original CP d ecomp osition is obtained by r ev ersing the multi-l inear transform of the sym- metrization pr o cedu re. Th is yields a guarantee d metho d for obtaining the decomp osition of graph momen t tensor T α 0 under exact moment s. W e n ote that this symmetrization approac h has b een earlier emp lo yed in other conte xts, e.g. for learning h idden Marko v mo dels [Anandkumar et al., 2012b, S ec. 3.3]. Reduction of the graph-momen t tensor to symmetric orthogonal form (Whit ening): Recall f rom Prop osition 2.2 that the mo d ified 3-star count tensor T α 0 has a CP d ecomp osition as E [T α 0 | Π A , Π B , Π C ] = k X i =1 b α i ( F A ) i ⊗ ( F B ) i ⊗ ( F C ) i . 17 W e n o w describ e a symmetrization pro cedure to con vert T α 0 to a symmetric orthogonal tensor through a multi-li near transformation us ing the mo dified adjacency matrix G α 0 , defin ed in (14). Consider the singular v alue decomp osition (SVD) of th e mo dified adjacency matrix G α 0 under exact momen ts: | X | − 1 / 2 E [( G α 0 X,A ) ⊤ | Π] = U A D A V ⊤ A . Define W A := U A D − 1 A , and similarly defi n e W B and W C using the corresp on d ing m atrices G α 0 X,B and G α 0 X,C resp ectiv ely . No w define R A,B := 1 | X | W ⊤ B E [( G α 0 X,B ) ⊤ | Π] · E [( G α 0 X,A ) | Π] W A , ˜ W B := W B R A,B , (21) and similarly define ˜ W C . W e establish that a m ultilinear transf ormation (as defin ed in (3.1)) of the graph-moment tensor T α 0 using matrices W A , ˜ W B , and ˜ W C results in a symmetric orthogonal form. Lemma 3.1 (O r thogonal Symmetric T ensor) . Assume that the matric es F A , F B , F C and Π X have r ank k , wher e k is the nu mb er of c ommunities. We have an ortho g onal symmetric tensor form for the mo difie d 3 -star c ount tensor T α 0 in (15) u nder a multiline ar tr ansformation usi ng matric es W A , ˜ W B , and ˜ W C : E [T α 0 ( W A , ˜ W B , ˜ W C ) | Π A , Π B , Π C ] = X i ∈ [ k ] λ i (Φ) ⊗ 3 i ∈ R k × k × k , ( 22) wher e λ i := b α − 0 . 5 i and Φ ∈ R k × k is an ortho gonal matrix, given by Φ := W ⊤ A F A Diag( b α 0 . 5 ) . (23) Remark 1: Note that th e m atrix W A orthogonalize s F A under exact moments, and is r eferred to as a whitening matrix . S imilarly , the matrices ˜ W B = R A,B W B and ˜ W C = R A,C W C consist of whitening ma trices W B and W C , and in addition, the matrices R A,B and R A,C serv e to symmetrize the tensor. W e can in terpret { λ i , (Φ) i } i ∈ [ k ] as the stable eigen-pairs of th e transformed tensor (henceforth, referred to as the whitene d and symmetrize d tensor ). Remark 2: The fu ll r ank assumption on m atrix F A = Π ⊤ A P ⊤ ∈ R | A |× k implies that | A | ≥ k , and simila rly | B | , | C | , | X | ≥ k . Moreo v er, w e require the comm unity c onnectivit y matrix P ∈ R k × k to b e of f ull rank 12 (whic h is a n atural n on-degeneracy condition). In this case, w e can reduce the graph-momen t tensor T α 0 to a k -ran k orthogonal symmetric tensor, wh ic h has a unique d ecomp o- sition. This implies that th e mixed mem b ership mo del is identifiable using 3-star and edge count momen ts, when the n et work size n = | A | + | B | + | C | + | X | ≥ 4 k , m atrix P is f ull rank and the comm unit y mem b ership matrices Π A , Π B , Π C , Π X eac h ha v e rank k . On the o ther hand, when on ly empirical moment s are a v ailable, roughly , w e require the net wo rk size n = Ω( k 2 ( α 0 + 1) 2 ) (where α 0 := P i α i is related to the exten t of o v erlap b et w een the comm unities) to pro vide guaran teed 12 In the w ork of McSherry [2001], where spectral clustering for sto chastic blo c k mo dels is analyzed, rank d eficien t P is allow ed as long as the neigh b orhoo d v ectors generated by any pair of communities are sufficiently different. On the other hand , our metho d requires P to b e full rank . W e argue that this is a mild restriction since we allo w for mixed memberships while McSherry [2001 ] limit to the stochastic b lock mo del. 18 learning of the comm u nit y mem b ership and mo del parameters. See Section 4 for a detaile d s amp le analysis. Pr o of: Recall that the mo dified adjacency matrix G α 0 satisfies E [( G α 0 X,A ) ⊤ | Π A , Π X ] = F A Diag( b α 1 / 2 )Ψ X . Ψ X := Diag ( b α − 1 / 2 ) √ α 0 + 1Π X − ( √ α 0 + 1 − 1) 1 | X | X i ∈ X π i ! ~ 1 ⊤ ! . F rom the d efinition of Ψ X ab o v e, w e see that it has rank k wh en Π X has rank k . Usin g the Sylv ester’s r ank inequalit y , w e ha v e th at the r ank of F A Diag( b α 1 / 2 )Ψ X is at least 2 k − k = k . This implies that the w hitening matrix W A also has rank k . Notice that | X | − 1 W ⊤ A E [( G α 0 X,A ) ⊤ | Π] · E [( G α 0 X,A ) | Π] W A = D − 1 A U ⊤ A U A D 2 A U ⊤ A U A D − 1 A = I ∈ R k × k , or in other words, | X | − 1 M M ⊤ = I , where M := W ⊤ A F A Diag( b α 1 / 2 )Ψ X . W e no w ha ve that I = | X | − 1 E Π X h M M ⊤ i = | X | − 1 W ⊤ A F A Diag( b α 1 / 2 ) E [Ψ X Ψ ⊤ X ] Diag ( b α 1 / 2 ) F ⊤ A W A = W ⊤ A F A Diag( b α ) F ⊤ A W A , since | X | − 1 E Π X [Ψ X Ψ ⊤ X ] = I from (18), and we use the fact that the sets A and X do n ot o v erlap. Th us, W A whitens F A Diag( b α 1 / 2 ) u nder exact moments (up on taking exp ectation ov er Π X ) and the columns of W ⊤ A F A Diag( b α 1 / 2 ) are orthonormal. No w n ote from the definition of ˜ W B that ˜ W ⊤ B E [( G α 0 X,B ) ⊤ | Π] = W ⊤ A E [( G α 0 X,A ) ⊤ | Π] , since W B satisfies | X | − 1 W ⊤ B E [( G α 0 X,B ) ⊤ | Π] · E [( G α 0 X,B ) | Π] W B = I , and similar resu lt h olds f or ˜ W C . The final r esult in (22) follo ws by taking exp ectation of tensor T α 0 o v er Π X .  Ov erview of the learning approac h under exact momen ts: With the ab o ve result in place, we are no w ready to describ e the h igh-lev el app roac h f or learnin g the mixed m em b ership mo del un der exact moment s. First, symmetrize the graph -momen t tensor T α 0 as describ ed ab o ve and then apply the tensor p o wer metho d describ ed in the previous section. This enables u s to obtain the ve ctor of eigen v alues λ := b α − 1 / 2 and th e matrix of eigen v ectors Φ = W ⊤ A F A Diag( b α 0 . 5 ) using tensor p ow er iterations. W e can then r eco ver the communit y mem b ersh ip v ectors of set A c (i.e., no des not in set A ) un der exact moment s as Π A c ← Diag ( λ ) − 1 Φ ⊤ W ⊤ A E [ G ⊤ A c ,A | Π] , since E [ G ⊤ A c ,A | Π] = F A Π A c (since A an d A c do n ot ov erlap) and Diag ( λ ) − 1 Φ ⊤ W ⊤ A = Diag ( b α ) F ⊤ A W A W ⊤ A under exact momen ts. In ord er to r eco ve r the comm unity members h ip vect ors of set A , viz., Π A , w e can rev erse the direction of th e 3-star counts, i.e., consider the 3-stars from set A to X , B , C 19 and obtain Π A in a similar m anner. Once all the comm unity m em b ership vec tors Π are obtained, w e can ob tain the communit y connectivit y matrix P , using th e relationship: Π ⊤ P Π = E [ G | Π] and noting that we assume Π to b e of rank k . Th us, we are able to learn the communit y members h ip v ectors Π and the mo d el parameters b α and P of th e mixed mem b ership mo del using edge counts and the 3-star count tensor. W e n o w describ e m o d ifications to this appr oac h to handle empirical momen ts. 3.3 Learning Algorithm Under Empirical Momen t s In the p revious section, w e explored a tensor-based app roac h for learning mixed mem b ership mo del under e xact momen ts. How ev er, in practic e, we only ha ve s amples (i.e. the obs erv ed net work), and the metho d needs to b e robus t to p ertur bations when emp irical momen ts are emp lo yed. Algorithm 1 { ˆ Π , ˆ P , b α } ← LearnMixedMem b ership ( G, k , α 0 , N , τ ) Input: Adj acency matrix G ∈ R n × n , k is the num b er of comm unities, α 0 := P i α i , where α is the Dirichlet parameter vec tor, N is the num b er of iterations for the tensor p o wer m ethod , and τ is used for thresholding the est imated comm unit y mem b ers h ip vecto rs, sp ecified in (29) in assumption A5. Let A c := [ n ] \ A denote the set of no d es not in A . Output: Estimates of the comm unit y membersh ip v ectors Π ∈ R n × k , comm unity connectivit y matrix P ∈ [0 , 1] k × k , and the normalized Diric hlet parameter vect or b α . P artition the v ertex set [ n ] into 5 parts X , Y , A , B , C . Compute momen ts G α 0 X,A , G α 0 X,B , G α 0 X,C , T α 0 Y →{ A,B , C } using (14) and (15). { ˆ Π A c , b α } ← LearnPartiti onCommunit y( G α 0 X,A , G α 0 X,B , G α 0 X,C , T α 0 Y →{ A,B ,C } , G, N , τ ). In terc hange r oles 13 of Y and A to obtain ˆ Π Y c . Define ˆ Q suc h that its i -th ro w is ˆ Q i := ( α 0 + 1) ˆ Π i | ˆ Π i | 1 − α 0 n ~ 1 ⊤ . { W e w ill esta blish that ˆ Q ≈ (Π † ) ⊤ under conditions A1-A5. } Estimate ˆ P ← ˆ QG ˆ Q ⊤ . { Reca ll that E [ G ] = Π ⊤ P Π in our mo d el. } Return ˆ Π , ˆ P , b α 3.3.1 Pre-pro cessing steps P artitioning: I n the previous section, we partitioned the no des into four sets A, B , C, X for learning under exact moment s. Ho w ev er, w e require more partitions u n der empirical momen ts to av oid stat istical d ep endency issu es a nd obtain stronger reconstruction guarantee s. W e no w divide the net work in to fiv e non-o v erlapping sets A, B , C, X , Y . Th e set X is emplo yed to compute whitening m atrices ˆ W A , ˆ W B and ˆ W C , described in detail sub sequen tly , the set Y is employ ed to compu te the 3-star coun t tensor T α 0 and s ets A, B , C conta in the lea ves of the 3-stars under consideration. Th e roles of the sets can b e in terc hanged to obtain the comm u n it y memb ership v ectors of all the sets. Whitening: The whitening p r o cedure is along the same lines a s d escrib ed in the previous section, except that no w empirical moments are used . Sp ecifical ly , consider the k -rank sin gular 20 Pro cedure 1 { ˆ Π A c , b α } ← LearnPa rtitionCommunit y( G α 0 X,A , G α 0 X,B , G α 0 X,C , T α 0 Y →{ A,B , C } , G , N , τ ) Input: Requir e mo dified adjacency submatrices G α 0 X,A , G α 0 X,B , G α 0 X,C , 3-star count tensor T α 0 Y →{ A,B ,C } , adjacency matrix G , n umb er of iterations N for the tensor p ow er met ho d and threshold τ for thresholding estimated comm unit y members hip vecto rs. Let Thres( A, τ ) denote the elemen t-wise thresholding op eration using threshold τ , i.e., Thres( A, τ ) i,j = A i,j if A i,j ≥ τ and 0 otherwise. Let e i denote basis vect or along co ordinate i . Output: Estimates of Π A c and b α . Compute rank- k SVD: ( | X | − 1 / 2 G α 0 X,A ) ⊤ k − svd = U A D A V ⊤ A and compute wh itening m atrices ˆ W A := U A D − 1 A . Similarly , compu te ˆ W B , ˆ W C and ˆ R AB , ˆ R AC using (24). Compute whitened and symmetrized tens or T ← T α 0 Y →{ A,B , C } ( ˆ W A , ˆ W B ˆ R AB , ˆ W C ˆ R AC ). { ˆ λ, ˆ Φ } ← T ensorEigen( T , { ˆ W ⊤ A G ⊤ i,A } i / ∈ A , N ). { ˆ Φ is a k × k matrix with eac h columns b eing an estimated eigen v ector and ˆ λ is th e ve ctor of estimated eigen v alues. } ˆ Π A c ← Th res(Dia g ( ˆ λ ) − 1 ˆ Φ ⊤ ˆ W ⊤ A G ⊤ A c ,A , τ ) and ˆ α i ← ˆ λ − 2 i , for i ∈ [ k ]. Return ˆ Π A c and ˆ α . v alue decomp osition (SVD) of the mo dified adj acency matrix G α 0 defined in (14) , ( | X | − 1 / 2 G α 0 X,A ) ⊤ k − svd = U A D A V ⊤ A . Define ˆ W A := U A D − 1 A , and similarly defi n e ˆ W B and ˆ W C using the corresp on d ing m atrices G α 0 X,B and G α 0 X,C resp ectiv ely . No w define ˆ R A,B := 1 | X | ˆ W ⊤ B ( G α 0 X,B ) ⊤ k − svd · ( G α 0 X,A ) k − svd ˆ W A , (24) and similarly define ˆ R AC . The whitened and symmetrized graph-moment tensor is now computed as T α 0 Y →{ A,B ,C } ( ˆ W A , ˆ W B ˆ R AB , ˆ W C ˆ R AC ) , where T α 0 is give n by (15) and the multi-linear transformation of a tensor is d efined in (3.1). 3.3.2 Mo difications to the t ensor p ow er metho d Recall th at under exact m oments, the stable eigen-pairs of a s y m metric orthogonal tensor can b e computed in a straightforw ard man n er through the basic p o w er iteration metho d in (20), along with the defl ation pro cedure. Ho wev er, this is not sufficien t to get go o d reconstruction guarant ees under empirical m omen ts. W e now prop ose a robust tens or metho d, d etaile d in Pro cedu r e 2. The main mo difications in v olv e: (i) efficient initializa tion and (ii) adaptive deflation, whic h are detailed b elo w. Emplo ying these mo difications allo w s us to tolerate a far greater p erturbation of the thir d order momen t tensor, than the basic tens or p o w er p ro cedure emp loy ed in Anandkumar et al. [2012b]. See remarks follo w ing Theorem A.1 in App end ix A for the p r ecise comparison. 21 Efficien t Initialization: Recall th at the basic tensor p o w er metho d in corp orates generic ini- tializa tion ve ctors and this pro cedure reco ve rs all the stable eigenv ectors correctly (exc ept for initializat ion vec tors ov er a set of measure zero). Ho wev er, under empirical moments, we ha ve a p erturb ed tensor, and here, it is adv an tageous to instead employ sp ecific initialization vect ors. F or in stance, to obtain one of the eigen ve ctors (Φ) i , it is adv antag eous to initialize with a ve ctor in the neigh b orh o o d of (Φ) i . This not only r ed uces th e n um b er of p o wer iteratio ns required to con v erge (appro ximately), but more imp ortan tly , this makes the p o w er metho d more robus t to p erturb ations. See Th eorem A.1 in Ap p endix A.1 for a detailed analysis qu an tifying the relation- ship b et w een in itialization vect ors, tensor p erturbation and the resulting guarant ees for reco ve ry of the tensor eigen vect ors. F or a mixed mem b ership mo del in the s parse r egime, recall that the comm u nit y mem b ers h ip v ectors Π are sparse (with h igh probability) . Under this regime of the mo del, w e note th at the whitened neighborh o o d vecto rs cont ain goo d initialize rs for the p ow er iterations. Sp ecificall y , in Pro cedure 2, we in itializ e with the w h itened neighborh o o d vect ors ˆ W ⊤ A G ⊤ i,A , f or i / ∈ A . The intu- ition b ehind this is as follo ws: for a suitable c hoice of parameters (suc h as the scaling of net w ork size n with resp ect to the n umber of comm unities k ), w e exp ect neigh b orho o d v ectors G ⊤ i,A to concen trate around their mean v alues, v iz., , F A π i . Sin ce π i is sparse (w.h.p) for the mo d el regime under consideration, this im p lies that there exist ve ctors ˆ W ⊤ A F A π i , f or i ∈ A c , w hic h concent rate (w.h.p) o n only along a few eig en-directions of the wh itened te nsor, and hence, serve as an effectiv e initializer. Adaptiv e Deflation: R ecall that in the basic p o wer iteration pro cedu re, w e can obtain the eigen-pairs one after another through simple deflation: s ubtracting the estimates of the current eigen-pairs and ru nning th e p o wer iterations again to obtain new eigen ve ctors. Ho wev er, it tu r ns out th at we can establish b etter theoretical guaran tees (in terms of greater robustness) when we adaptiv ely d eflate the tensor in eac h p o wer iteration. In P r o cedure 2, among the estimated eigen- pairs, we only deflate those whic h “comp ete” with the current estimate of the p o wer iteration. In other w ords, if the v ector in the curr ent iteration θ ( τ ) t has a significant p ro jectio n along the direction of an estimated eigen-pair φ j , i.e. | λ j D θ ( τ ) t , φ j E | > ξ , for some thresh old ξ , then the eigen-pair is d eflated; otherwise the eigen vect or φ j is not deflated. This allo ws us to carefully cont rol the error build-up for eac h estimated eigenpair in our analysis. In tuitiv ely , if an eigen v ector do es not hav e a go o d correlation with the curr en t estimate, then it do es not int erfere with th e up d ate of the curren t vect or, while if the eigen vect or has a go od cor- relation, then it is p ertinen t that it b e deflated so as to discourage con ve rgence in the direction of the already estimated eigenv ector. See Theorem A.1 in App endix A.1 for details. Finally , w e note that stabilization, as prop osed by Kolda and Ma y o [2011] for general tensor eigen-decomp osition (as opp osed to orthogonal d ecomp osition in this pap er), can b e effectiv e in impro ving conv ergence, esp ecially on real data, and we defer its d etailed analysis to f uture wo rk. 3.3.3 Reconstruction after t ensor p ow er metho d Recall that previously in Section 3 .2 , wh en exac t momen ts are a v ailable, estimating th e comm u nit y mem b ersh ip v ectors Π is straigh tforward, once we reco ver all the stable tensor eigen-pairs. Ho w ev er, 22 Pro cedure 2 { λ, Φ } ← T ensorEigen( T , { v i } i ∈ [ L ] , N ) Input: T ensor T ∈ R k × k × k , L initialization vecto rs { v i } i ∈ L , num b er of iterations N . Output: the estimated eigenv alue/eigen v ector pairs { λ, Φ } , where λ is the vec tor of eigenv alues and Φ is the matrix of eigen v ectors. for i = 1 to k do for τ = 1 to L do θ 0 ← v τ . for t = 1 to N do ˜ T ← T . for j = 1 to i − 1 (wh en i > 1) do if | λ j D θ ( τ ) t , φ j E | > ξ t hen ˜ T ← ˜ T − λ j φ ⊗ 3 j . end if end for Compute p o we r iteration up date θ ( τ ) t := ˜ T ( I ,θ ( τ ) t − 1 ,θ ( τ ) t − 1 ) k ˜ T ( I ,θ ( τ ) t − 1 ,θ ( τ ) t − 1 ) k end for end for Let τ ∗ := arg max τ ∈ L { ˜ T ( θ ( τ ) N , θ ( τ ) N , θ ( τ ) N ) } . Do N p o wer iteration up d ates starting from θ ( τ ∗ ) N to obtain eigen ve ctor estimate φ i , and s et λ i := ˜ T ( φ i , φ i , φ i ). end for return the estimated eigen v alue/eigen v ectors ( λ, Φ). in case of empirical m oments, w e can obtain b etter guaran tees with the follo w ing mod ification: the estimated comm unit y member s hip ve ctors Π are further sub ject to thr esholding so that th e weak v alues are set to zero. Since we are limiting ourselv es to the regime of the mixed mem b ership mod el, where the comm un ity v ectors Π are sparse (w.h .p), this mod ification strengthens our reconstruction guaran tees. This thresholding step is incorp orated in Algorithm 1. Moreo ver, r ecall that under exact moments, estimating the comm unit y connectivit y matrix P is straig htforw ard , once w e reco v er the comm unit y mem b ersh ip vecto rs sin ce P ← (Π ⊤ ) † E [ G | Π]Π † . Ho w ev er, when empirical momen ts are a v ailable, w e are able to establish b etter reconstruction guaran tees th rough a d ifferent metho d, outlined in Algorithm 1. W e define ˆ Q such that its i -th ro w is ˆ Q i := ( α 0 + 1) ˆ Π i | ˆ Π i | 1 − α 0 n ~ 1 ⊤ , based on estimates ˆ Π, and the matrix ˆ P is obtained as ˆ P ← ˆ QG ˆ Q ⊤ . W e sub sequen tly establish that ˆ Q ˆ Π ⊤ ≈ I , under a set of su ffi cien t conditions outlined in the n ext section. Impro v ed supp ort reco v ery estimates in homophilic mo dels: A sub -class of comm unity mo del are those satisfying homophily . As discussed in S ection 1, homophily or the tend en cy to form edges within the mem b ers of th e same communit y h as b een p osited as an imp ortant factor in communit y form ation, esp ecially in so cial settings. Many of the existing learning algorithms, 23 e.g. Ch en et al. [2012] r equire this assump tion to provide guaran tees in th e sto c hastic blo c k mo del setting. Moreo v er, our pro cedure describ ed b elo w can b e easily mo dified to w ork in s itu ations where the order of in tra-connectivit y an d in ter-connectivit y among comm unities is reversed, i.e. in the comm unity connectivit y matrix P ∈ [0 , 1] k × k , P ( i, i ) ≡ p < P ( i, j ) ≡ q , for all i 6 = j . F or instance, in the k -coloring m o del [McSherry, 2001], p = 0 and q > 0. W e describ e the p ost-processing method in Pro cedure 3 for mo dels with comm unity connectivit y matrix P satisfying P ( i, i ) ≡ p > P ( i, j ) ≡ q for all i 6 = j . F or suc h mod els, we can obtain imp ro v ed estimates by av eraging. Sp ecifically , consider no des in set C and edges going fr om C to no des in B . Firs t, consider the sp ecial case of the sto c hastic blo c k mo del: for eac h n o de c ∈ C , compute the num b er of neigh b ors in B b elonging to eac h comm unity (as given b y the estimate ˆ Π from Algorithm 1), and d eclare the comm unit y with the maxim um n umber of such neigh b ors as th e comm unit y of no de c . Intuitiv ely , this pr o v id es a b etter estimate for Π C since w e av erage o v er the edges in B . Th is metho d has b een used b efore in the conte xt of sp ectral clustering [McSherry, 2001]. The same idea can be extended to the general mixed m em b ership (homo ph ilic) m o d els: declare comm unities to b e significant if they exceed a certain threshold, as ev aluated by the a v erage num b er of edges to eac h comm unity . The correctness of the pro cedur e can b e gleaned from th e fact that if the true F matrix is inpu t, it satisfies F j,i = q + Π i,j ( p − q ) , ∀ i ∈ [ k ] , j ∈ [ n ] , and if the true P matrix is in put, H = p and L = q . Th us, under a suitable threshold ξ , the entries F j,i pro vide information on whether the corresp ond ing comm unit y weig ht Π i,j is significant. In the n ext section, we establish that in certain regime of p arameters, this supp ort r eco very pro cedure can lead to zero-error su pp ort reco v ery of significant communit y membersh ips of the no des and also r ule out comm unities w here a no de do es n ot h a ve a strong pr esence. Computational complexit y: W e n ote that the computational co mplexit y of th e met ho d, implemen ted naiv ely , is O ( n 2 k + k 4 . 43 b α − 1 min ) when α 0 > 1 and O ( n 2 k ) when α 0 < 1. T his is b eca use the time for compu ting whitening matrices is dominated by SVD of th e top k singu lar v ectors of n × n matrix, whic h tak es O ( n 2 k ) time. W e then compute the whitened tensor T which requires time O ( n 2 k + k 3 n ) = O ( n 2 k ), since for eac h i ∈ Y , w e m ultiply G i,A , G i,B , G i,C with the corresp ond ing whitening matrices, and this step tak es O ( nk ) time. W e then av erage this k × k × k tensor o v er differen t no des i ∈ Y to the result, which tak es O ( k 3 ) time in eac h step. F or the tensor p o wer metho d, the time requ ired for a single iteration is O ( k 3 ). W e need at m ost log n iterations p er initial ve ctor, and w e n eed to consider O ( b α − 1 min k 0 . 43 ) initial ve ctors (this could b e smaller when α 0 < 1). Hence the total ru nning time of tens or p o w er metho d is O ( k 4 . 43 b α − 1 min ) (and w hen α 0 is small this can b e impr o ved to O ( k 4 b α − 1 min ) whic h is dominated by O ( n 2 k ). In the pro cess of estimating Π and P , the dominant op eration is multiplying k × n matrix by n × n matrix, w hic h tak es O ( n 2 k ) time. F or supp ort reco very , the dominant op eration is compu tin g the “a verag e d egree”, whic h again tak es O ( n 2 k ) time. T h us, we ha ve that the o verall compu tatio nal time is O ( n 2 k + k 4 . 43 b α − 1 min ) wh en α 0 > 1 and O ( n 2 k ) when α 0 < 1. Note th at the ab o ve b ound on complexit y of our metho d nearly matc hes the b ound for sp ec- tral clustering m ethod [McSherry, 2001], since computin g the k -rank SVD requires O ( n 2 k ) time. Another metho d for learning sto chastic blo c k mo d els is based on con v ex optimizatio n in vo lving semi-definite programming (SDP) [Ch en et al., 2012], and it p ro vides the b est scaling b oun ds (for 24 Pro cedure 3 { ˆ S } ← Su p p ortReco v eryHomophilicMod els( G, k , α 0 , ξ , ˆ Π) Input: Adj acency matrix G ∈ R n × n , k is the num b er of communities, α 0 := P i α i , wh ere α is the Diric hlet parameter vec tor, ξ is the threshold for supp ort r eco ve ry , corresp onding to significant comm unit y memb er s hips of an ind ividual. Get estimate ˆ Π f rom Algorithm 1 . Also asume the mo del is h omophilic: P ( i, i ) ≡ p > P ( i, j ) ≡ q , for all i 6 = j . Output: ˆ S ∈ { 0 , 1 } n × k is the estimated supp ort for significant comm u nit y mem b er s h ips (see Theorem 4.2 for guarantees). Consider p artitions A, B , C, X , Y as in Algorithm 1. Define ˆ Q on lines of definition in Algorithm 1, u sing estimates ˆ Π. Let the i -th row for set B b e ˆ Q i B := ( α 0 + 1) ˆ Π i B | ˆ Π i B | 1 − α 0 n ~ 1 ⊤ . Similarly d efine ˆ Q i C . Estimate ˆ F C ← G C,B ˆ Q ⊤ B , ˆ P ← ˆ Q C ˆ F C . if α 0 = 0 (sto c hastic blo ck mo del) then for x ∈ C do Let i ∗ ← arg max i ∈ [ k ] ˆ F C ( x, i ) and ˆ S ( i ∗ , x ) ← 1 and 0 o.w. end for else Let H b e the a v erage of d iagonals of ˆ P , L b e the a v erage of off-diagonals of ˆ P for x ∈ C , i ∈ [ k ] do ˆ S ( i, x ) ← 1 if ˆ F C ( x, i ) ≥ L + ( H − L ) · 3 ξ 4 and zero otherwise. { Identify large entries } end for end if P erm ute the r oles of the sets A, B , C , X , Y to get results for remaining no des. 25 b oth the n etw ork size n and the separation p − q f or edge connectivit y) known so far. T h e sp ecific con v ex p r oblem can b e solv ed via the metho d of augmente d L agr ange multipliers [Lin et al., 2010], where eac h step consists of an SVD op eration and q-linear conv ergence is established b y Lin et al. [2010]. This implies that the metho d h as complexit y O ( n 3 ), since it inv olve s taking SVD of a gen- eral n × n matrix, rather than a k -rank SVD. Thus, our metho d h as significan t adv antag e in terms of computational complexit y , when the num b er of comm unities is muc h sm aller than the n et work size ( k ≪ n ). F urther, a sub s equen t w ork pro vides a more sophisticated i mplementat ion of the prop osed tensor metho d th rough p aralleliz ation and the use of sto c hastic gradient d escen t for tensor de- comp osition [Huang et al. , 2013]. Additionally , the k -rank SVD op erations are approxima ted via r andomized metho ds such as th e Nystrom’s metho d le ading to more efficient implementa- tions [Gittens and Mahoney, 201 3 ]. Huang et al. [2 013] deplo y the tensor approac h for comm unity detection and establish that it has a run ning time of O ( n + k 3 ) using nk cores u nder a parallel computation m o d el [J´ aJ´ a, 1992]. 4 Sample Analysis for Prop osed L earning Algorithm 4.1 Homogeneous Mixed Mem b ership Models It is easier to first present the results for our prop osed algorithm for the sp ecia l case, wh ere all the comm unities ha ve the same exp ected s ize and the ent ries of the communit y conn ectivit y matrix P are equal on d iagonal and off-diagonal lo cations: b α i ≡ 1 k , P ( i, j ) = p · I ( i = j ) + q · I ( i 6 = j ) , p ≥ q . (25) In other wo rd s , the pr obabilit y of an edge acco rdin g to P only dep en d s on wh ether it is b etw een t w o individuals of the same comm u nit y or betw een differen t comm unities. The ab o ve setting is al so w ell studied for sto c hastic blo c k mo d els ( α 0 = 0), allo w ing u s to compare our resu lts with existing ones. The resu lts for general mixed members h ip mo dels are d eferr ed to Section 4.2. [A1] Sparse regime of Diric hlet parameters: The c ommunit y mem b ership v ectors a re dra wn from the Diric hlet distrib u tion, Dir( α ), und er the mixed memb ership mo del. W e assu me that α i < 1 for i ∈ [ k ] (see S ectio n 2.1 for an extended discu ssion on the s p arse regime of the Diric hlet distribu tion) and that α 0 is kn own. [A2] Condition on the netw ork size: Giv en the concentrati on parameter of the Diric hlet distribution, α 0 := P i α i , we require that n = ˜ Ω( k 2 ( α 0 + 1) 2 ) , (26) and that th e disjoint sets A, B , C , X, Y are chose n r andomly and are of s ize Θ( n ). Note that from assumption A1, α i < 1 whic h implies that α 0 < k . Thus, in the wo rst-case, when α 0 = Θ ( k ), w e require 14 n = ˜ Ω( k 4 ), and in the b est case, when α 0 = Θ(1), we require n = ˜ Ω( k 2 ). The latter case includes the sto c hastic blo ck mo del ( α 0 = 0), and thus, our results m atc h th e state-of-art b oun ds for learnin g s to chastic blo ck mo dels. 14 The notation ˜ Ω( · ) , ˜ O ( · ) denotes Ω ( · ) , O ( · ) up to p oly-log factors. 26 [A3] Condition on edge connectivity : Recall that p is the pr ob ab ility of intra-c ommunit y connectivit y and q is the pr obabilit y of inter-co mmunit y connectivit y . W e require that p − q √ p = Ω  ( α 0 + 1) k n 1 / 2  (27) The ab o v e condition is on the standardized separation b et ween intra-c ommunit y and inter-co mmunit y connectivit y (note that √ p is the stand ard d eviatio n of a Bernoulli ran d om v ariable). Th e ab o ve condition is r equired to con trol the p ertu r bation in the wh itened tensor (computed usin g observ ed net w ork samples), thereb y , p ro viding guaran tees on th e estimated eigen-pairs thr ough th e tensor p o we r metho d. [A4] Condition on n um b er of iterations of the p o w er method: W e assume that the n umb er of iterations N of the tensor p o wer metho d in Pro cedure 2 s atisfies N ≥ C 2 ·  log( k ) + log log  p − q p  , (28) for some constant C 2 . [A5] Choic e of τ fo r thresholding comm unity v ector estimates: Th e threshold τ for obtaining estimates ˆ Π of communit y mem b ership v ectors in Algorithm 1 is c hosen as τ =    Θ  k √ α 0 √ n · √ p p − q  , α 0 6 = 0, (29) 0 . 5 , α 0 = 0, (30) F or the stoc hastic block mo del ( α 0 = 0), since π i is a b asis v ector, w e can use a large threshold. F or general mo d els ( α 0 6 = 0), τ can b e view ed as a regularization parameter and deca ys as n − 1 / 2 when other parameters are h eld fixed. W e are n o w ready to state the error b ounds on the estimates of comm unit y mem b ership vect ors Π and the blo c k connectivit y matrix P . ˆ Π a nd ˆ P are the estimat es computed in Algorithm 1 . Recall that for a m atrix M , ( M ) i and ( M ) i denote t he i th ro w and column resp ectiv ely . W e sa y that an ev ent holds w ith high probabilit y , if it o ccurs with pr obabilit y 1 − n − c for some constan t c > 0 . Theorem 4.1 (Guaran tees on Estimating P , Π) . Under assumptions A 1- A 5, we have with high pr ob ability ε π ,ℓ 1 := max i ∈ [ n ] k ˆ Π i − Π i k 1 = ˜ O ( α 0 + 1) 3 / 2 √ np ( p − q ) ! (31) ε P := max i,j ∈ [ k ] | ˆ P i,j − P i,j | = ˜ O ( α 0 + 1) 3 / 2 k √ p √ n ! . (32) The p ro ofs are giv en in the App endix and a pr o of outline is pr o vid ed in Section 4.3. The main in gredien t in establishing the ab o v e result is the tensor concen tration b ound and additionally , reco v ery guarantee s under the tensor p o wer metho d in Pro cedure 2. W e now pro vide these results b elo w. 27 Recall that F A := Π ⊤ A P ⊤ and Φ = W ⊤ A F A Diag( b α 1 / 2 ) denotes th e set of tensor eigen v ectors under exact momen ts in (23), and ˆ Φ is the set of estimate d eigenv ectors und er e mpir ical momen ts, obtained using Pro cedure 1 . W e establish the follo wing guaran tees. Lemma 4.1 (Pe rtur b ation b ound for estimated eigen-pairs) . U nder the assumptio ns A1-A4, the r e c over e d eigenv e ctor-eigenvalue p airs ( ˆ Φ i , ˆ λ i ) fr om the tensor p ower metho d in Pr o c e dur e 2 satisfies with high pr ob ability, for a p ermutation θ , such that max i ∈ [ k ] k ˆ Φ i − Φ θ ( i ) k ≤ 8 k − 1 / 2 ε T , max i ∈ [ k ] | λ i − b α − 1 / 2 θ ( i ) | ≤ 5 ε T , (33) The tensor p erturb ation b ound ε T is given by ε T :=    T α 0 Y →{ A,B , C } ( ˆ W A , ˆ W B ˆ R AB , ˆ W C ˆ R AC ) − E [T α 0 Y →{ A,B , C } ( W A , W B R AB , W C R AC ) | Π A ∪ B ∪ C ]    = ˜ O ( α 0 + 1) k 3 / 2 √ p ( p − q ) √ n ! , (34) wher e k T k for a tensor T r efers to its sp e ctr al norm. Sto c hastic blo c k mo dels ( α 0 = 0) : F or sto c hastic b lock mo dels, assump tions A2 and A3 reduce to n = ˜ Ω( k 2 ) , ζ = Θ  √ p p − q  = O n 1 / 2 k ! . (35) This matc hes with the b est known scaling (up to p oly-log factors), and was previously ac hiev ed via con v ex optimization b y Chen et al. [2012] for sto c hastic block mo dels. Ho w ev er, our results in Theorem 4.1 d o n ot provide zero error g uarantee s as in Chen et al. [2012]. W e strengthen our results to pr ovide zero-error guaran tees in Section 4.1.1 b elo w and th us, matc h the scaling of Ch en et al. [2012] for sto c hastic b lo c k mo d els. Moreo ver, w e also p ro vide zero-error sup p ort reco very guaran tees for reco v ering s ignifi can t memb erships of n o d es in mixed mem b ersh ip mo d els in S ection 4.1.1. Dep endence on α 0 : The guarante es degrade as α 0 increases, whic h is in tuitive since the exten t of communit y ov erlap in cr eases. The requ iremen t for scaling of n also gro ws as α 0 increases. Note that the guaran tees on ε π and ε P can b e impr o ved by assuming a more strin gen t scaling of n w ith resp ect to α 0 , rather th an the one sp ecified by A2. 4.1.1 Zero-error guaran tees for supp ort re cov ery Recall that we prop osed Pr o cedure 3 as a p ost-pro cessing step to pro vide impr ov ed supp ort reco v ery estimates. W e no w pro vide guarantee s for this metho d . W e no w sp ecify the threshold ξ for su pp ort reco v ery in Pro cedure 3 . [A6] Choice of ξ for support reco v ery: W e assume that the threshold ξ in P ro cedure 3 satisfies ξ = Ω( ε P ) , where ε P is sp ecified in Th eorem 4.1. W e no w state the guarante es for sup p ort reco v ery . 28 Theorem 4.2 (Su pp ort reco v ery guaran tees) . Assuming A1-A 6 and (25) hold, the supp ort r e c overy metho d in Pr o c e dur e 3 has th e fol lowing guar ante es on the estima te d supp ort set ˆ S : with high pr ob ability, Π( i, j ) ≥ ξ ⇒ ˆ S ( i, j ) = 1 and Π( i, j ) ≤ ξ 2 ⇒ ˆ S ( i, j ) = 0 , ∀ i ∈ [ k ] , j ∈ [ n ] , (36) wher e Π is the true c ommunity memb e rship matrix. Th us, the ab o ve result g uarante es that the Pro cedure 3 co rrectly reco v ers all the “ large” en tr ies of Π and also correctly rules o ut all the “small” entrie s in Π. In other w ord s, w e can c orrectly infer all the sig nificant m emb erships of eac h no d e and also r ule o ut the set o f comm unities where a nod e do es n ot hav e a strong p resence. The o nly shortcoming of the ab ov e result is that there is a ga p b et ween the “large” and “small” v alues, and for an intermediate set of v alues (in [ ξ / 2 , ξ ]), w e cannot guarant ee correct inferences ab out the comm unity mem b ers h ips. Note this gap dep ends on ε P , th e error in estimating the P matrix. Th is is intuitiv e, since as the error ε P decreases, w e can infer th e comm un it y mem b ers h ips o v er a large range of v alues. F or the sp ecial case of sto c hastic blo c k mo d els (i.e. lim α 0 → 0), we can improv e the ab ov e result and giv e a zero error guarante e at all no des (w.h.p). Note that we n o longer require a thr eshold ξ in th is case, and only infer one communit y f or eac h no de. Corollary 4.1 (Zero error guarantee for blo c k mo dels) . A ssuming A 1-A5 and (25) hold, the supp ort r e c overy metho d in Pr o c e dur e 3 c orr e ctly identifies the c ommunity memb erships for al l no des with high pr ob ability in c ase of sto chastic blo ck mo dels ( α 0 → 0) . Th us, with the ab o ve resu lt, w e matc h the state-o f-art results of Chen et al. [2012] for stochastic blo c k mo dels in terms of s caling requirements and reco ve ry guaran tees. 4.2 General (Non-Homogeneous) Mixed Mem b ership Mo dels In the previous sections, w e pro vided learning gu arantees for learning homogeneous mixed mem b er- ship mo d els. Here, w e extend th e results to learnin g general non-homogeneous mixed memb ership mo dels under a sufficien t set of co nd itions, inv olving scaling of v arious parameters suc h as net w ork size n , num b er of communities k , concentrat ion parameter α 0 of the Diric hlet d istribution (whic h is a measure of o ve rlap of th e communities) and so on. [B1] Sparse regime of Diric hlet parameters: The comm u nit y membersh ip ve ctors are dra wn from th e Diric hlet d istr ibution, Dir( α ), under the mixed mem b ersh ip mo del. W e assume that 15 α i < 1 for i ∈ [ k ] α i < 1 (see S ection 2.1 for an extended d iscu ssion on the sparse regime of the Diric hlet distribu tion). 15 The assumption B1 that the Diric hlet distribution be in th e sparse regime is not strictly needed. Our results can b e extended to general Diric hlet d istributions, b ut with w orse scaling requirements on n . The dep endence of n is still p olynomial in α 0 , i.e. w e req uire n = ˜ Ω(( α 0 + 1) c b α − 2 min ), where c ≥ 2 is some constan t. 29 [B2] C ondition on the netw ork size: Given the concentrat ion parameter of the Diric hlet distribution, α 0 := P i α i , and b α min := α min /α 0 , the exp ected size of the smallest comm unity , define ρ := α 0 + 1 b α min . (37) W e require that the n et work s ize scale as n = Ω  ρ 2 log 2 k  , (38) and that the sets A, B , C, X , Y are Θ( n ). Note that from assu mption B1, α i < 1 wh ic h imp lies that α 0 < k . Thus, in the w orst-case, wh en α 0 = Θ( k ), we r equire 16 n = ˜ Ω( k 4 ), assuming equal sizes: b α i = 1 /k , and in the b est case, when α 0 = Θ (1), w e r equire n = ˜ Ω( k 2 ). T he latter case includes the sto c hastic blo ck mo del ( α 0 = 0), and thus, our results m atc h th e state-of-art b oun ds for learnin g s to chastic blo ck mo dels. See S ection 4.1 for an extended discussion. [B3] Condition on relativ e co mm unity sizes and blo c k connectivit y matrix: Recall that P ∈ [0 , 1] k × k denotes the blo c k connectivit y matrix. Define ζ :=  b α max b α min  1 / 2 p (max i ( P b α ) i ) σ min ( P ) , (39) where σ min ( P ) is the min imum singular v alue of P . W e require that ζ =            O n 1 / 2 ρ ! , α 0 < 1 (40) O n 1 / 2 ρk b α max ! α 0 ≥ 1. (41) In tuitiv ely , the ab ov e condition requires the ratio of m axim u m and minim um exp ected communit y sizes to b e n ot to o large and for the matrix P to b e w ell conditioned. The ab ov e condition is required to control the p ertu r bation in the wh itened tensor (computed using observed netw ork samples), thereb y , pro viding guarantees on the estimated eigen-pairs through the tens or p o wer method . The ab o v e condition can b e int erpr eted as a separation requiremen t b et we en int ra-comm unity and inter- comm unit y connectivit y in the sp ecial case considered in Section 4.1. Sp ecifically , for th e sp ecial case of h omogeneo us mixed m emb ership mo d el, we ha v e σ min ( P ) = Θ( p − q ) , max i ( P b α ) i = p k + ( k − 1) q k ≤ p. Th us, the assumptions A2 and A3 in Section 4.1 giv en b y n = ˜ Ω( k 2 ( α 0 + 1) 2 ) , ζ = Θ  √ p p − q  = O n 1 / 2 ( α 0 + 1) k ! are sp ecial cases of the assump tions B2 and B3 ab o v e. 16 The notation ˜ Ω( · ) , ˜ O ( · ) denotes Ω ( · ) , O ( · ) up to log factors. 30 [B4] Condition on num b er of iterations of the p ow er metho d: W e assu me that the n umb er of iterations N of the tensor p o wer metho d in Pro cedure 2 s atisfies N ≥ C 2 ·  log( k ) + log log  σ min ( P ) (max i ( P b α ) i )  , ( 42) for some constant C 2 . [B5] Choice of τ for thresholding comm unity vec tor estimates: The th reshold τ for obtaining estimates ˆ Π of communit y mem b ership v ectors in Algorithm 1 is c hosen as τ =      Θ ρ 1 / 2 · ζ · b α 1 / 2 max n 1 / 2 · b α min ! , α 0 6 = 0, (43) 0 . 5 , α 0 = 0, (44) F or th e sto c hastic blo c k mo del ( α 0 = 0), since π i is a basis v ector, we can use a large thresh old. F or general mo dels ( α 0 6 = 0), τ can b e view ed as a r egularizatio n parameter and deca ys as n − 1 / 2 when other parameters are held fixed. Moreo v er, wh en n = ˜ Θ( ρ 2 ), we hav e that τ ∼ ρ − 1 / 2 when other terms are h eld fixed. Recall th at ρ ∝ ( α 0 + 1) when the exp ected comm unit y sizes b α i are held fixed. In this case, τ ∼ ρ − 1 / 2 allo ws for smaller v alues to b e pick ed up after thr esh olding as α 0 is increased. This is in tuitiv e since as α 0 increases, the comm unit y v ectors π are more “sp r ead out” across d ifferen t comm unities and ha ve smaller v alues. W e are no w rea dy to state the error b oun ds on the estimate s of comm un ity member s hip v ectors Π and the b lock connectivit y matrix P . ˆ Π and ˆ P are the estimates compu ted in Algorithm 1. Recall that for a m atrix M , ( M ) i and ( M ) i denote t he i th ro w and column resp ectiv ely . W e sa y that an ev ent holds w ith high probabilit y , if it o ccurs with pr obabilit y 1 − n − c for some constan t c > 0 . Theorem 4.3 (Guarante es on estimating P , Π) . Under assumptions B1-B 5, The estimates ˆ P and ˆ Π obtaine d fr om Algorithm 1 satisfy with high pr ob ability, ε π ,ℓ 1 := max i ∈ [ k ] | ( ˆ Π) i − (Π) i | 1 = ˜ O  n 1 / 2 · ρ 3 / 2 · ζ · b α max  (45) ε P := max i,j ∈ [ n ] | ˆ P i,j − P i,j | = ˜ O  n − 1 / 2 · ρ 5 / 2 · ζ · b α 3 / 2 max · ( P max − P min )  (46) The p ro ofs are in App end ix B and a pr o of outline is provided in Section 4.3. The main in gredien t in establishing the ab o v e result is the tensor concen tration b ound and additionally , reco v ery guarantee s under the tensor p o wer metho d in Pro cedure 2. W e now pro vide these results b elo w. Recall that F A := Π ⊤ A P ⊤ and Φ = W ⊤ A F A Diag( b α 1 / 2 ) denotes th e set of tensor eigen v ectors under exact momen ts in (23), and ˆ Φ is the set of estimate d eigenv ectors und er e mpir ical momen ts, obtained using Pro cedure 1 . W e establish the follo wing guaran tees. Lemma 4.2 (P erturbation b ound for estimated eigen-pairs) . Under the assumptions B1-B4, the r e c over e d eigenv e ctor-eigenvalue p airs ( ˆ Φ i , ˆ λ i ) fr om the tensor p ower metho d in Pr o c e dur e 2 satisfies with high pr ob ability, for a p ermutation θ , such that max i ∈ [ k ] k ˆ Φ i − Φ θ ( i ) k ≤ 8 b α 1 / 2 max ε T , max i | λ i − b α − 1 / 2 θ ( i ) | ≤ 5 ε T , (47) 31 The tensor p erturb ation b ound ε T is given by ε T :=    T α 0 Y →{ A,B , C } ( ˆ W A , ˆ W B ˆ R AB , ˆ W C ˆ R AC ) − E [T α 0 Y →{ A,B , C } ( W A , W B R AB , W C R AC ) | Π A ∪ B ∪ C ]    = ˜ O ρ √ n · ζ b α 1 / 2 max ! , (48) wher e k T k for a tensor T r efers to its sp e ctr al norm, ρ is define d in (37) and ζ in (39) . 4.2.1 Application to Plan ted Clique Problem The plan ted c lique p r oblem is a sp ecial case of th e sto chastic blo c k mo d el C ondon and Karp [1999], and is arguably the simplest setting f or the comm un ity pr oblem. Here, a clique of size s is uniformly plan ted (or placed) in an E r d˝ os-R ´ enyi graph w ith edge pr obabilit y 0 . 5. T his can b e view ed as a sto c hastic blo c k mo del with k = 2 communities, where b α min = s /n is the probabilit y of a no de b eing in a clique and b α max = 1 − s /n . The connectivit y m atrix is P = [1 , q ; q , q ] with q = 0 . 5, sin ce the pr obabilit y of connectivit y within the clique is 1 and the probabilit y of connectivit y f or any other no de pair is 0 . 5. Since the plan ted clique setting has u nequal sized comm un ities, the general resu lt in S ection 4.3 is applicable, and we d emonstrate h ow the assumptions ( B 1)-( B 5) simplify for th e p lan ted clique setting. W e ha ve that α 0 = 0, since the comm unities are n on-o verlapping. F or assu mption B 2, we ha v e that ρ = α 0 + 1 b α min = n s , n = ˜ Ω( ρ 2 ) ⇒ s = ˜ Ω( √ n ) . (49) F or assumption B 3, we ha ve that σ min ( P ) = Θ(1) and that max i ( P b α ) i ≤ s /n + q ≤ 2, and th us the assumption B 3 simplifies as ζ :=  b α max b α min  1 / 2 p (max i ( P b α ) i ) σ min ( P ) = ˜ O  √ n ρ  ⇒ s = ˜ Ω  n 2 / 3  . (50) The condition in (49) that s = ˜ Ω( n 1 / 2 ) matc hes the compu tational low er b ounds for r eco ve ring the clique [F eldman et al., 2012]. Unfortunately , the condition in (50) that s = ˜ Ω  n 2 / 3  is worse. This is required for assu mption ( B 3) to hold, whic h is needed to ensu re the success of the tensor p o we r metho d. The w h itening step is particularly sensitiv e to the condition num b er of the matrix to b e whitened (i.e., matrices F A , F B , F C in our case and the cond ition n umbers for these matrices dep end on the ratio of the communit y sizes), which results in a w eak er guaran tee. Thus, our m etho d do es not p erform v ery well when the comm un ity sizes are dr astically different. It r emains an op en question if our metho d can b e impro ve d in this setting. W e conjecture that using “p eeling” ideas similar to Ailon et al. [2013], wh ere the comm u n ities are reco v ered one b y one can impro v e our dep endence on the ratio of comm un ity sizes. 4.3 Pro of Outline W e now sum marize the main tec hniqu es inv olv ed in p r o ving T heorem 4.3. The details are in the App endix. The main ingredient is the concen tration of the adjacency matrix: since the edges are dra wn indep end en tly conditioned on the comm un it y memb erships, w e establish that the adjacency matrix concent rates around its mean und er the stated assum ptions. See App end ix C.4 for details. With this in hand , we can then establish concen tration of v arious quantit ies u sed by our learning algorithm. 32 Step 1: Whitening matrices. W e fi rst establish concen tration b ound s on the whitening ma- trices ˆ W A , ˆ W B , ˆ W C computed using empir ical m omen ts, d escrib ed in Section 3.3.1. With this in hand, w e can appro ximately r eco ver the span of matrix F A since ˆ W ⊤ A F Diag( b α i ) 1 / 2 is a rotation matrix. Th e main tec hniqu e emp lo yed is the Matrix Be rns tein’s inequalit y [T ropp, 2012 , thm. 1.4]. See App end ix C.2 for details. Step 2: T ensor concen tration b ounds Recall that w e use the whitening matrices to obtain a symmetric orthogonal tensor. W e esta blish that the whitened and symm etrized tensor concen trates around its mean. (Note that the empirical third order tensor T X → A,B ,C tends to its exp ecta tion conditioned on Π A , Π B , Π C when | X | → ∞ ) . This is done in sev eral stag es and we carefully co ntrol the tensor p erturbation b oun ds. See App endix C.1 for details. Step 3: T ensor p o wer metho d ana lysis. W e an alyze the p erf ormance of Pr o cedu re 2 un der empirical m omen ts. W e emp lo y the v arious impro ve ments, detailed in Section 3.3.2 to establish guaran tees on the reco vered eigen-pairs. This in cludes coming up w ith a condition on th e tensor p erturb ation b oun d, f or the tensor p o w er metho d to succeed. It also inv olv es establishing that there exist goo d initializers for the p ow er metho d among (whitened) neigh b orh o o d v ectors. Th is allo ws us to obtain s tronger guarantees f or the tensor p o we r metho d , compared to earlier analysis b y Anandkum ar et al. [2012b]. This analysis is crucial for us to obtain state-of-art scaling b oun ds for guaran teed reco v ery (for the s p ecial case o f sto c hastic b lock mod el). See App endix A f or details. Step 4: Thresholding of estimated c omm unity v ectors In S tep 3, w e p ro vide guarant ees for reco very of eac h eigen vec tor in ℓ 2 norm. Direct application o f this result only allo ws us to obtain ℓ 2 norm b ound s for ro w-wise reco v ery of the communit y matrix Π. I n order to strengthen the resu lt to an ℓ 1 norm boun d, we threshold th e estimated Π v ectors. Here, w e exploit th e sparsity in Diric hlet dra ws and carefu lly control the con tribution of w eak entries in the v ector. Finally , w e establish p erturb ation b oun ds on P through r ather straigh tforw ard concen tration b ound argumen ts. See App endix B.2 for details. Step 5: Supp ort recov ery guarantees. T o simplify the argument, consider the sto c hastic blo c k mo del. Recall that Pro cedure 3 readju sts the comm unit y mem b ers h ip estimates based on degree a v eraging. F or eac h v ertex, if we count the a ve rage degree to wards these “appro ximate comm unities”, for the correct comm u nit y the result is concentrate d around v alue p and for th e wrong comm un it y the r esult is around v alue q . Therefore, w e can co rrectly iden tify the co mmunit y mem b ersh ip s of all the n o d es, wh en p − q is sufficien tly large, as sp ecified b y A3. The argument can b e easily extended to general mixed member s hip mo dels. See Ap p endix B.4 for details. 4.4 Comparison with Previous Results W e n o w compare the results of this pap er to our previous work [Anandkumar et al., 2012 b ] on the use of tensor-b ased ap p roac h es for learning v arious laten t v ariable mo dels suc h as topic mo d- els, hidden Mark o v mo dels (HMM) and Gaussian mixtures. A t a high leve l, the tensor app roac h is exploited in a similar manner in all these m o d els (including the communit y mo del in this pa- p er), viz., that the conditional-indep end en ce rela tionships of the mod el r esult in a lo w rank tensor, 33 Do cumen ts View 1 View 2 View 3 x X A B C . (a) Communit y mo del as a topic model π x G ⊤ x,A G ⊤ x,B G ⊤ x,C . (b) Graphical model represe ntation Figure 2: Casting the communit y mo del as a topic mo del, w e obtain cond itional in d ep endence of the three views. constructed from low order momen ts under the giv en mo d el. Ho wev er, there are sev eral imp or- tan t differences b et ween th e communit y mo del and th e other laten t v ariable mo dels consid ered b y Anandkumar et al. [2012b] and w e list them b elo w. W e also p r ecisely list th e v arious algorith- mic impro ve ments prop osed in this pap er with resp ect to the tensor p o wer metho d, and ho w they can b e applicable to other laten t v ariable mo dels. 4.4.1 T opic mo del vs. comm unity mo del Among the latent v ariable mo dels studied by Anandkumar et al. [2012 b ], the topic mod el, viz., laten t Diric hlet allocation (LD A), b ears the closest r esem blance to MMSB. In fact, the MMSB mo del wa s originally inspir ed b y the L D A mo del. The analogy b etw een the MMSB m o d el and the LD A is direct u nder our fr amew ork and we describ e it b elo w. Recall that for learning MMSBs, w e consider a partition of the no des { X, A, B , C } and w e consider the set of 3-stars from set X to A, B , C . W e can construct an equ iv alen t topic mo del as follo w s : the no des in X form th e “do cuments” and f or eac h do cument x ∈ X , the neigh b orho o d v ectors G ⊤ xA , G ⊤ xB , G ⊤ xC form the th ree “wo rd s ” or “views” for that do cumen t. In eac h d o cumen t x ∈ X , the comm un it y v ector π x corresp onds to the “topic v ector” and th e matrice s F A , F B and F C corresp ond to the topic-w ord matrices. Note that the three views G ⊤ xA , G ⊤ xB , G ⊤ xC are conditionally indep end ent giv en the topic v ector π x . Th us, the comm un ity mo del can b e cast as a topic mo d el or a multi-vie w mo del. See Figure 2. Although the communit y mo del can b e v iewed as a topic mo del, it h as some imp ortan t sp ecial prop erties which allo ws us to pr o vide b etter guarante es. Th e topic-wo rd matrices F A , F B , F C are not arb itrary m atrices. Recall that F A := Π ⊤ A P ⊤ and similarly F B , F C are r andom matrices and w e can pr o vid e s tr ong concen tration b oun d s for these matrices b y app ealing to random matrix theory . Moreo v er, eac h of the views in the comm unity mo d el has add itional structure, viz., the v ector G ⊤ x,A has indep endent Bernoulli en tries conditioned on the comm unit y v ector π x , wh ile in a general m u lti-view mo d el, we only s p ecify the conditional distribution of eac h view give n the hidden topic vec tor. This fur ther allo ws us to p ro vide sp ecialized concent ration b ound s for the comm unit y mo del. Imp ortan tly , we can reco v er the communit y mem b ersh ip s (or topic vecto rs) accurately while for a general m ulti-view mo d el this cannot b e guaran teed and we can only hop e to r eco ve r the mo del parameters. 34 4.4.2 Impro v emen ts to tensor reco v ery guaran tees in this pap er In this pap er, we make m o d ifications to the tensor p o w er metho d of Anandkum ar et al. [2012b] and obtain b ette r guaran tees for the comm un it y set ting. Recal l that the t wo mo difications are adaptiv e deflation and in itializ ation using wh itened n eigh b orho o d vecto rs. The adaptiv e deflation leads to a weak er gap condition for an in itializ ation v ector to su cceed in estimating a tensor eigen vec tor efficien tly . Initialization using whitened n eigh b orho o d v ectors allo ws u s to tolerate more noise in th e estimated 3-star tensor, thereby impr o vin g our sample complexit y r esu lt. W e mak e this impro ve ment p r ecise b elo w. If w e directly app ly the tensor p o wer method of Anandkumar et al. [2012b], without considering the modifi cations, we require a stronge r condition on th e sample complexit y and e dge connectivit y . F or simplicit y , consider the homogeneous setting o f Section 4.1. The c onditions ( A 2) and ( A 3) n o w need to b e replaced with stronger cond itions: [A2’] Sample complexity : The num b er of samples s atisfies n = ˜ Ω( k 4 ( α 0 + 1) 2 ) . [A3’] E dge connectivity: The edge connectivit y parameters p , q s atisfy p − q √ p = Ω  ( α 0 + 1) k 2 √ n  . Th us, w e obtain significant imp ro v emen ts in reco ve ry guarante es via algo rithmic mo difications and careful analysis of concentrat ion b oun ds. The guarantees derive d in this pap er are sp ecific to the comm unity setting, and w e outlined previously the sp ecial pr op erties of the communit y mo del when compared to a general m ulti-view mo del. Ho wev er, wh en the d o cu men ts of th e topic mo del are su fficien tly long, th e word fr equency v ector w ithin a do cument has go o d concen tration, and our mo dified tensor metho d has b etter reco very guarantees in this setting as we ll. Thus, the improv ed tensor reco v ery guarantee s d er ived in this p ap er are applicable in scenarios where w e hav e access to b ette r initialization v ectors rather than sim p le random initialization. 5 Conclusion In this pap er, w e presente d a nov el approac h f or learning o verla pp in g comm un ities based on a tensor decomp osition approac h. W e established that our metho d is guarantee d to reco v er the u nderlying comm unit y mem b erships correctly , wh en th e comm unities are dra wn from a m ixed mem b ership sto c hastic b lock mo del (MMSB). Our metho d is also compu tationally efficien t and requ ires simple linear a lgebraic op erations an d tensor iterations. Moreo ver, our metho d is tigh t for th e sp ecial case of the sto chastic blo c k mo del (up to p oly-log f actors), b oth in terms of sample complexit y and the separation b et ween edge connectivit y within a communit y and across differen t communities. W e no w note a num b er of int eresting op en problems an d extensions. While w e obtained tigh t guaran tees for MMSB mo dels with un if orm sized communities, our guarant ees are we ak w hen th e comm unit y sizes are drastically differen t, suc h as in the p lan ted clique setting where we do not matc h the compu tatio nal lo wer b ound [F eldman et al., 2012]. Th e w hitening step in the tensor 35 decomp osition metho d is particularly sensitive to the ratio of comm unity sizes and it is in teresting to see if mo difications can b e made to our algorithm to provide tigh t guaran tees und er u nequal comm unit y sizes. While this pap er mostly dealt with the theoretical analysis of the tensor metho d for comm unit y detection, we n ote recen t exp eriment al results where the tensor metho d is deplo yed on graphs with millions of no des with very go o d accuracy and runn ing times [Huang et al. , 2013]. In fact, the run ning times are m ore than an order of magnitude b etter than the state-of-art v aria- tional app r oac h for learning MMSB mo d els. The work of [Huang et al., 2013] make s an imp ortant mo dification to m ak e the m ethod scalable, viz., that the tensor decomp osition is carried out through sto c hastic up dates in parallel unlik e th e serial batc h up d ates considered here. Establishing theo- retical guaran tees for sto c hastic tensor decomp osition is an imp ortan t problem. Moreo ver, w e ha v e limited ourselv es to the MMSB mo d els, whic h assumes a linear mo d el for edge formation, wh ic h is not applicable u niv ersally . F or instance, exclusionary relationships, where t wo no des cannot b e connected b ecause of their membersh ips in certain comm unities cannot b e imp osed in the MMSB mo del. Are th ere other classes of mixed mem b ersh ip mo dels wh ic h d o n ot suffer from th is restric- tion, an d y et are identifiable and are amenable for learnin g? Moreo ver, the Diric hlet distribution in the MMSB mo del imp oses constraints on the mem b ers h ips across different comm unities. Can we incorp orate m ixed memb erships with arbitrary correlations? Th e answe rs to these qu estions will further push the b oundaries of tractable learning of mixed mem b ersh ip comm unities mo dels. Ac kno wledgemen ts W e thank the J MLR Action Editor Nathan S rebro and the anon ymous reviewe rs for commen ts whic h significantly improv ed this manuscript. W e thank Jur e Lesko vec for h elpful discussions regarding v arious comm unity mo dels. Pa rt of this wo rk wa s done when AA and R G were v isiting MSR New England. AA is su pp orted in part by the Microsoft facult y fello wship, NSF Career a w ard CCF-12541 06, NSF Aw ard CCF-1219234 and the AR O YIP Award W911NF-1 3-1-0084 . References Ittai Abr aham, Shiri Ch ec h ik, Da vid Kemp e, and Aleksandrs Slivkins. L o w -distortion inf erence of laten t similarities fr om a multiplex so cial net w ork. CoRR , abs/1202.09 22, 2012. Nir Ailon, Y u dong Chen, and Xu Huan. Breaking the small cluster barrier of graph clustering. arXiv pr eprint arXiv:1302.4549 , 2013. Edoardo M. Airoldi, Da vid M. Blei, Steph en E . Fienberg, and Eric P . Xing. Mixed m emb ership sto c hastic blo c kmo dels. Journal of Machine L e arning R ese ar ch , 9:1981– 2014, Jun e 2008. A. Anandkumar, D. P . F oster, D . Hsu, S. M. Kak ade, and Y. Liu. Tw o svds suffice: Sp ectral d ecom- p ositions for probabilistic t opic mo deling and laten t dirichlet allo cation, 2012a . arXiv:1204.67 03. A. Anand kumar, R. Ge, D. Hsu, S . M. Kak ad e, and M. T elgarsky . T ensor deco mp ositio ns for laten t v ariable mo dels, 2012b. A. Anandkum ar, D. Hsu , and S.M. Kak ade. A Metho d of Momen ts for Mixtur e Models a nd Hidd en Mark o v Mo dels. In P r o c. of Conf. on L e arning The ory , J une 2012c. 36 Sanjeev Arora, Ron g Ge, Sushant Sac hdev a, and Gran t Sc ho eneb ec k. Finding o ve rlappin g comm u- nities in so cial netw orks: to wa rd a rigorous app r oac h. In Pr o c e e dings of the 13 th A CM Confer enc e on E le ctr onic Commer c e , 2012. Maria-Flo rina Balcan, Christian Borgs, Mark Brav erman, Jennifer T. Chay es, and Shang-Hua T eng. I like h er more than y ou: Self-determined comm unities. CoRR , abs/1201.489 9, 2012. P .J. Bic k el and A. Chen. A nonp arametric view of n et work mo d els and n ewman–girv an an d other mo dularities. Pr o c e e dings of the National A c ademy of Scienc es , 106(50):2 1068–21073, 2009. P .J. Bic kel, A. Chen, and E. Levina. T h e m etho d of momen ts and degree distributions for netw ork mo dels. The Annals of Statistics , 39(5):38– 59, 2011. Da vid M Blei. Probabilistic topic mo dels. Communic ations of the ACM , 55(4):77– 84, 2012. Da vid M. Blei, Andr ew Y. Ng, and Mic hael I . Jord an. Laten t dirichlet allo cation. Journal of Machine L e arning R e se ar ch , 3:993– 1022, Marc h 2003. B. Bollob´ as, S . J anson, and O . Riord an. T he phase transition in inhomogeneous random graphs. R andom Structur es & A lgorithms , 31(1):3–122 , 2007. S. Charles Brubak er and San tosh S. V empala. Random tensors and plan ted cliques. In RANDOM , 2009. Alain Celisse, Jean-Jacques Daudin, and Laurent Pierre. Consistency of maximum-lik eliho o d and v ariational estimators in th e sto chasti c blo c k mo del. Ele ctr onic Journal of Statistics , 6:18 47–1899 , 2012. S. Chatterjee and P . Diaconis. Es timating and unders tand ing exp onential rand om graph mo d els. Arxiv pr e print arxiv:1102.2650 , 2011. Kamalik a Ch audhuri, F an Chung, and Alexander Tsiatas. S p ectral clustering of graphs with general degrees in the extended planted partition mo del. Journal of Machine L e arning R ese ar ch , pages 1–23, 2012. Y udong Chen, S u ja y Sangha vi, and Huan Xu. Clustering spars e graphs. In A dvanc es i n N eur al Information Pr o c essing , 2012. Anne Condon and Ric hard M Karp. Algorithms for graph partitioning on the p lan ted partition mo del. In R andomization, Appr oximation, and Co mbinatorial Optimiza tion. A lgorithms and T e c hniqu e s , p ages 221–232 . Sp ringer, 1999. S. C urrarini, M.O. Jackson, and P . Pin. An economic mo del of friendsh ip: Homophily , m inorities, and segregatio n. E c onometric a , 77(4):1003 –1045, 2009. Vitaly F eldman, Elena Gr igorescu, Lev Reyzin, S an tosh V empala, and Ying Xiao. Statistica l algo- rithms and a lo wer b oun d for plan ted clique. Ele ctr onic Col lo quium on Comp utational Comp lexity (ECCC) , 19:64, 2012. K F erentio s. On tceb ycheff ’s t yp e inequalities. T r ab ajos de estad ´ ıstic a y de investigaci´ on op er ativa , 33(1): 125–132, 1982. 37 S.E. Fienber g, M.M. Mey er, and S.S. W asserman. S tatistic al analysis of m ultiple sociometric relations. Journal of the americ an Statistic al asso ciation , 80(389):51 –67, 1985. O. F rank and D. Strauss. Marko v graphs. Journal of the americ an Statistic al asso ciation , 81(395): 832–8 42, 1986. Alan M. F rieze and Ra vi Kannan. Q uic k appr o ximation to m atrices and applications. Combina- toric a , 19(2):175 –220, 1999. Alan M. F rieze and Ra vi Kannan. A new approac h to the plan ted clique problem. In FSTTCS , 2008. M. Girv an and M.E.J. Newman. Comm unit y structure in so cial and b iologi cal net w orks. Pr o c e e dings of the N ational A c ademy of Scienc es , 99(12):782 1–7826, 2002 . Alex Gittens and Mic hael W Mahoney . Revisiting the nystrom metho d for imp ro v ed large-scale mac hine learning. arXiv pr eprint arXiv:1303.1849 , 2013. P . Gopalan, D. Mimno, S. Gerrish, M. F reedman, and D. Blei. Scalable inference of ov erlappin g comm unities. In A dvanc es in N e ur al Information P r o c essing Systems 25 , pages 225 8–2266, 2012. C. Hillar and L.-H. Lim. Most tensor problems are NP h ard , 2012. Matt Hoffman, Da vid M Blei, Chong W an g, and John Paisley . Sto c hastic v ariational in ference. JMLR , 14:1303–13 47, 2012. P .W. Holland and S. Leinhardt. An exp onen tial family of probabilit y distr ibutions for directed graphs. Journal of the americ an Statistic al asso c i ation , 76(373) :33–50, 1981. P .W. Holla nd, K.B. Lask ey , and S. Leinhardt. Sto c hastic blo c kmo dels: firs t steps. So cial networks , 5(2):1 09–137, 1983. F. Hu ang, U.N. Niranjan, M. Hakee m, and A. Anand k u mar. F ast Detection of O v erlapping Com- m unities via Onlin e T ensor Metho ds. ArXiv 1309.07 87 , Sept. 2013. Joseph J´ aJ´ a. An intr o duction to p ar al lel algorithms . Ad dison W esley Longman Pu blishing Co., Inc., 1992. A. Jalali, Y. Chen, S. Sangha vi, and H. Xu. Clustering partially observed graphs via conv ex optimization. arXiv pr eprint arXiv:1104.480 3 , 2011. Mic h ael J. Kearns and Umesh V. V azirani. An Intr o duction to Computationa l L e arning The ory . MIT Pr ess., Cambridge, MA, 1994. T. G. Kolda and B. W. Bader. T ensor decomp ositions and app licatio ns. SIAM r eview , 51(3):4 55, 2009. T. G. Kolda and J. R. Ma yo. S hifted p o w er metho d for computing tensor eigenpairs. SIAM Journal on M atrix Analysis and Applic ations , 32(4):1095– 1124, Octob er 2011. P .F. Lazarsfeld, R.K. Merton, et al. F riendship as a so cial pro cess: A substant iv e and metho d olog- ical analysis. F r e e dom and c ontr ol in mo dern so ciety , 18(1):18–6 6, 1954 . 38 Zhouc hen Lin , Minming Chen , and Yi Ma. T he augmen ted lagrange m ultiplier metho d f or exact reco very of corrup ted lo w-rank matrices. arXiv pr eprint arXiv:1009.5055 , 2010. L. Lov´ asz. V ery large graphs. Curr ent Developments in Mathematics , 2008:67 –128, 2009. M. McPherson, L. Smith-Lo vin, and J.M. C o ok. Birds of a feather: Homophily in so cial netw orks. Annual r e v iew of so c i olo gy , p ages 415–444, 2001. F. McSh erry . S p ectral p artitioning of random graphs. In FOCS , 2001. J.L. Moreno. Who shal l survive?: A new appr o ach to the pr oblem of human interr elations. Nervous and Men tal Disease Publish ing Co, 1934. Elc hanan Mossel, J o e Neeman, and Allan S ly . Sto c hastic blo c k mo dels and reconstruction. arXiv pr eprint arXiv:1202.149 9 , 2012. G. Pa lla, I. Der´ en yi, I. F ark as, and T. Vicsek. Unco v ering the o ve rlappin g communit y structure of complex n et works in n ature and so ciet y . N atur e , 435(7043):8 14–818, 2005. K. Pearson. Contributions to th e mathematical theory of evolutio n. P hilosophic al T r ansactions of the R oyal So ciety, L ondon, A . , page 71, 1894. A. Rinaldo, S.E. Fien b erg, an d Y. Zh ou. On the geometry of discrete exp onen tial families with application to exp onen tial rand om graph mo d els. E le ctr onic Journal of Statistics , 3:446–48 4, 2009. T.A.B. S nijders and K. Nowic ki. Estimatio n and prediction f or sto c hastic b lo ckmodels for graphs with latent blo ck structur e. Journal of Classific ation , 14(1): 75–100, 1997. G.W. Stew art and J. Sun . Matrix p erturb ation the ory , v olume 175. Academic press New Y ork, 1990. M. T elgarsky . Diric h let draws are sparse with h igh pr obabilit y . ArXiv:1301.49 17 , 2012. J.A. T ropp. User-friend ly tail b ounds for sums o f random matrices. F oundations of Computational Mathematics , 12(4):3 89–434, 2012. Y.J. W ang and G.Y. W ong. Sto c hastic blo c kmo dels for directed graphs. Journal of the Americ an Statistic al Asso ciation , 82(397) :8–19, 1987. H.C. White, S.A. Bo orman, and R.L. Breiger. So cial structure from m ultiple net wo rks. i. b lock- mo dels of r oles and p ositions. Americ an journal of so ciolo gy , pages 730–78 0, 1976. E.P . Xing, W. F u, and L. Song. A state-space mixed mem b ership blo c k m o d el for d ynamic n et w ork tomograph y . The Annals of Applie d Statistics , 4(2):53 5–566, 2010. 39 A T ensor P o w er Metho d A n alysis In this section, w e lev erage on the p erturbation analysis for tensor p ow er metho d in Anan d kumar et al. [2012b]. As discuss ed in S ection 3.3.2, we prop ose the follo wing mo difications to the tensor p o w er metho d and obtain guarantees b elo w for the mo d ified method . T h e tw o main mo d ifi cations are: (1) w e mo d ify the tensor defl ation pro cess in th e robust p o wer metho d in Pro cedure 2. Rather than a fixed deflation step after obtaining an e stimate of the ei gen v alue-eigen v ector pair, in this pap er, w e deflate adaptiv ely dep endin g on the current estimate, and (2)rather than selecting random initial- ization v ectors, as in Anandkumar et al. [2012b], w e initialize with vec tors obtai ned from adjacency matrix. Belo w in Section A.1, w e establish success of the mo d ifi ed tensor metho d un d er “g o o d” initial- ization vect ors, as defined b elow. This inv olve s impr o v ed error b ound s for the mo dified deflation pro cedure pro vided in Section A.2. In Section C.5, we su bsequent ly establish that under the Diric h- let d istribution (for small α 0 ), we obtain “go o d” initialization ve ctors. A.1 Analysis under goo d initialization v ectors W e no w sh o w that wh en “go o d” initialization vect ors are input to tensor p ow er metho d in Pr o ce- dure 2, we obtain go o d estimates of eigen-pairs under appropr iate c hoice of num b er of iterations N and sp ectral norm ǫ of tensor p ertur b ation. Let T = P i ∈ [ k ] λ i v i , where v i are orthonormal vect ors and λ 1 ≥ λ 2 ≥ . . . λ k . Let ˜ T = T + E b e the p erturb ed tensor with k E k ≤ ǫ . Recall that N denotes the n umb er of iterations of the tensor p o we r metho d. W e call a n initialization v ector u to b e ( γ , R 0 )-goo d if there exists v i suc h that h u, v i i > R 0 and | h u, v i i | − max j γ | h u, v i i | . (51) Cho ose γ = 1 / 100. Theorem A.1. Ther e exists universal c onstants C 1 , C 2 > 0 such that the fol lowing holds. ǫ ≤ C 1 · λ min R 2 0 , N ≥ C 2 ·  log( k ) + log log  λ max ǫ  , (52 ) Assume ther e is at le ast one go o d initialization ve ctor c orr esp onding to e ach v i , i ∈ [ k ] . The p ar am- eter ξ for cho osing deflation ve ctors i n e ach iter ation of the tensor p ower metho d in Pr o c e dur e 2 is chosen as ξ ≥ 25 ǫ . We obtain eigenvalue-eige nve ctor p airs ( ˆ λ 1 , ˆ v 1 ) , ( ˆ λ 2 , ˆ v 2 ) , . . . , ( ˆ λ k , ˆ v k ) such that ther e exists a p ermutation π on [ k ] with k v π ( j ) − ˆ v j k ≤ 8 ǫ/λ π ( j ) , | λ π ( j ) − ˆ λ j | ≤ 5 ǫ, ∀ j ∈ [ k ] , and       T − k X j =1 ˆ λ j ˆ v ⊗ 3 j       ≤ 55 ǫ. 40 Remark 1 (need for adapt ive deflation) : W e now compare the ab o v e resu lt w ith the result in [Anandku mar et al., 2012b, Thm. 5.1], wh ere similar guarantee s are obtained for a simpler v ersion o f the tensor p ow er m ethod without an y adap tive deflation and using random initialization. The main difference is in our requirement of the gap γ in (51) for an in itializa tion v ector is wea ke r than th e gap requir ement in [Anandkumar et al., 2012b, Thm. 5.1]. This is du e to the use of adaptiv e deflation in this pap er. Remark 2 (need for non-random initialization): In this pap er, w e emp lo y whitened neigh- b orho o d v ectors generated u n der th e MMSB mo del for initializat ion, wh ile [Anandkumar et al., 2012b, Thm. 5.1] assumes a ran d om initializat ion. Under rand om initializatio n, we obtain R 0 ∼ 1 / √ k (with p oly( k ) trials), wh ile for initializa tion u sing whitened neighborh o o d v ectors, we sub - sequen tly establish that R 0 = Ω(1) is a constan t, when num b er of samples n is large enough. W e also establish that the gap requirement in (51) is satisfied f or the c hoice of γ = 1 / 100 ab o v e. S ee Lemma C.9 for details. Thus, w e can tolerate m uch larger p erturbation ǫ of the t hird order moment tensor, when n on-random initializations are employ ed. Pr o of: The pr o of is on lines of th e p r o of of [Anandku mar et al., 2012b, Thm. 5. 1] but here, we consider the mo dified defl ation pro cedure, which impr ov es the condition on ǫ in (52). W e pr ovide the full pro of b elo w for completeness. W e pro v e b y induction on i , the n umber of eige npairs estimated so far b y Pro cedur e 2. Assu me that th ere exists a p ermutation π on [ k ] such that the follo wing assertions h old. 1. F or all j ≤ i , k v π ( j ) − ˆ v j k ≤ 8 ǫ/λ π ( j ) and | λ π ( j ) − ˆ λ j | ≤ 1 2 ǫ . 2. D ( u, i ) is the set of deflated v ectors giv en curr en t estimate of the p o wer metho d is u ∈ S k − 1 : D ( u, i ; ξ ) := { j : | ˆ λ i ˆ θ i | ≥ ξ } ∩ [ i ] , where ˆ θ i := h u, ˆ v i i . 3. T h e error tensor ˜ E i +1 ,u :=  ˆ T − X j ∈ D ( u,i ; ξ ) ˆ λ j ˆ v ⊗ 3 j  − X j / ∈ D ( u,i ; ξ ) λ π ( j ) v ⊗ 3 π ( j ) = E + X j ∈ D ( u,i ; ξ )  λ π ( j ) v ⊗ 3 π ( j ) − ˆ λ j ˆ v ⊗ 3 j  satisfies k ˜ E i +1 ,u ( I , u, u ) k ≤ 56 ǫ, ∀ u ∈ S k − 1 ; (53) k ˜ E i +1 ,u ( I , u, u ) k ≤ 2 ǫ, ∀ u ∈ S k − 1 s.t. ∃ j ≥ i + 1  ( u ⊤ v π ( j ) ) 2 ≥ 1 − (168 ǫ/λ π ( j ) ) 2 . ( 54) W e tak e i = 0 a s the base case, so w e can ignore t he fi rst assertio n, and just observe that f or i = 0, D ( u, 0; ξ ) = ∅ and thus ˜ E 1 ,u = ˆ T − k X j =1 λ i v ⊗ 3 i = E , ∀ u ∈ S k − 1 . W e ha ve k ˜ E 1 k = k E k = ǫ , and therefore the second assertion holds. 41 No w fix some i ∈ [ k ], and assume as the inductiv e hyp othesis. The p o wer iterations n o w tak e a s ubset of j ∈ [ i ] for deflation, dep ending on the current estimate. S et C 1 := min  (56 · 9 · 102) − 1 , (100 · 168) − 1 , ∆ ′ from L emma A.1 with ∆ = 1 / 50  . (55) F or all go o d initializat ion ve ctors which are γ -separated relativ e to π ( j max ), we ha ve (i) | θ ( τ ) j max , 0 | ≥ R 0 , and (ii) that b y [Anandku m ar et al. , 2012b, Lemma B.4] (usin g ˜ ǫ/p := 2 ǫ , κ := 1, and i ∗ := π ( j max ), an d p ro viding C 2 ), | ˜ T i ( θ ( τ ) N , θ ( τ ) N , θ ( τ ) N ) − λ π ( j max ) | ≤ 5 ǫ (notice by d efinition that γ ≥ 1 / 100 implies γ 0 ≥ 1 − 1 / (1 + γ ) ≥ 1 / 101, th us it follo ws f rom the b ounds on th e other qu an tities that ˜ ǫ = 2 pǫ ≤ 56 C 1 · λ min R 2 0 < γ 0 2(1+8 κ ) · ˜ λ min · θ 2 i ∗ , 0 as necessary). Therefore θ N := θ ( τ ∗ ) N m ust satisfy ˜ T i ( θ N , θ N , θ N ) = max τ ∈ [ L ] ˜ T i ( θ ( τ ) N , θ ( τ ) N , θ ( τ ) N ) ≥ max j ≥ i λ π ( j ) − 5 ǫ = λ π ( j max ) − 5 ǫ. On the other h and, by the triangle inequalit y , ˜ T i ( θ N , θ N , θ N ) ≤ X j ≥ i λ π ( j ) θ 3 π ( j ) ,N + | ˜ E i ( θ N , θ N , θ N ) | ≤ X j ≥ i λ π ( j ) | θ π ( j ) ,N | θ 2 π ( j ) ,N + 56 ǫ ≤ λ π ( j ∗ ) | θ π ( j ∗ ) ,N | + 56 ǫ where j ∗ := arg max j ≥ i λ π ( j ) | θ π ( j ) ,N | . Therefore λ π ( j ∗ ) | θ π ( j ∗ ) ,N | ≥ λ π ( j max ) − 5 ǫ − 56 ǫ ≥ 4 5 λ π ( j max ) . Squaring b oth sid es and using the fact that θ 2 π ( j ∗ ) ,N + θ 2 π ( j ) ,N ≤ 1 f or any j 6 = j ∗ ,  λ π ( j ∗ ) θ π ( j ∗ ) ,N  2 ≥ 16 25  λ π ( j max ) θ π ( j ∗ ) ,N  2 + 16 25  λ π ( j max ) θ π ( j ) ,N  2 ≥ 16 25  λ π ( j ∗ ) θ π ( j ∗ ) ,N  2 + 16 25  λ π ( j ) θ π ( j ) ,N  2 whic h in turn implies λ π ( j ) | θ π ( j ) ,N | ≤ 3 4 λ π ( j ∗ ) | θ π ( j ∗ ) ,N | , j 6 = j ∗ . This means th at θ N is (1 / 4)-separated relativ e to π ( j ∗ ). Also, observe that | θ π ( j ∗ ) ,N | ≥ 4 5 · λ π ( j max ) λ π ( j ∗ ) ≥ 4 5 , λ π ( j max ) λ π ( j ∗ ) ≤ 5 4 . Therefore by [Anandkum ar et al., 2012b, L emm a B.4] (using ˜ ǫ/p := 2 ǫ , γ := 1 / 4, and κ := 5 / 4), executing another N p ow er iterations s tarting fr om θ N giv es a vecto r ˆ θ that satisfies k ˆ θ − v π ( j ∗ ) k ≤ 8 ǫ λ π ( j ∗ ) , | ˆ λ − λ π ( j ∗ ) | ≤ 5 ǫ. 42 Since ˆ v i = ˆ θ and ˆ λ i = ˆ λ , the first assertion of the ind uctiv e hypothesis is satisfied, a s w e can mo dify the p erm utation π by swa pp in g π ( i ) and π ( j ∗ ) w ithout affecting the v alues of { π ( j ) : j ≤ i − 1 } (recall j ∗ ≥ i ). W e n o w argue that ˜ E i +1 ,u has the requir ed prop er ties to complete th e indu ctiv e s tep. By Lemma A.1 (u s ing ˜ ǫ := 5 ǫ , ξ = 5˜ ǫ = 25 ǫ and ∆ := 1 / 50 , th e latter pro viding one up p er b ou n d on C 1 as p er (55)), w e ha ve for any unit vect or u ∈ S k − 1 ,       X j ≤ i  λ π ( j ) v ⊗ 3 π ( j ) − ˆ λ j ˆ v ⊗ 3 j   ( I , u, u )      ≤  1 / 50 + 100 i X j =1 ( u ⊤ v π ( j ) ) 2  1 / 2 5 ǫ ≤ 5 5 ǫ. (56) Therefore by the triangle in equalit y , k ˜ E i +1 ( I , u, u ) k ≤ k E ( I , u, u ) k +       X j ≤ i  λ π ( j ) v ⊗ 3 π ( j ) − ˆ λ j ˆ v ⊗ 3 j   ( I , u, u )      ≤ 56 ǫ. Th us the b ound (53) h olds. T o p ro v e that (54) holds, for an y unit vec tor u ∈ S k − 1 suc h that there exists j ′ ≥ i + 1 with ( u ⊤ v π ( j ′ ) ) 2 ≥ 1 − (168 ǫ/λ π ( j ′ ) ) 2 . W e hav e (via the second b ound on C 1 in (55) and th e corresp onding assumed b ound ǫ ≤ C 1 · λ min R 2 0 ) 100 i X j =1 ( u ⊤ v π ( j ) ) 2 ≤ 100  1 − ( u ⊤ v π ( j ′ ) ) 2  ≤ 100  168 ǫ λ π ( j ′ )  2 ≤ 1 50 , and therefore  1 / 50 + 100 i X j =1 ( u ⊤ v π ( j ) ) 2  1 / 2 5 ǫ ≤ (1 / 50 + 1 / 50) 1 / 2 5 ǫ ≤ ǫ. By the triangle inequalit y , w e h a v e k ˜ E i +1 ( I , u, u ) k ≤ 2 ǫ . Th erefore (54) holds, so the second asser- tion of the inductive hyp othesis holds. W e conclude that b y the indu ction pr inciple, there exists a p ermutation π su c h that t wo assertions hold for i = k . F rom the last in duction step ( i = k ), it is also clear from (56) that k T − P k j =1 ˆ λ j ˆ v ⊗ 3 j k ≤ 55 ǫ . This completes the p ro of of the theorem.  A.2 Deflation Analysis Lemma A.1 (Deflati on analysis) . L et ˜ ǫ > 0 and let { v 1 , . . . , v k } b e an ortho normal b asis for R k and λ i ≥ 0 for i ∈ [ k ] . L et { ˆ v 1 , . . . , ˆ v k } ∈ R k b e a set of u ni t ve ctors and ˆ λ i ≥ 0 . Define thir d or der tensor E i such that E i := λ i v ⊗ 3 i − ˆ λ i ˆ v ⊗ 3 i , ∀ i ∈ k. F or some t ∈ [ k ] and a unit ve c tor u ∈ S k − 1 such that u = P i ∈ [ k ] θ i v i and ˆ θ i := h u, ˆ v i i , we have for i ∈ [ t ] , | ˆ λ i ˆ θ i | ≥ ξ ≥ 5 ˜ ǫ, | ˆ λ i − λ i | ≤ ˜ ǫ, k ˆ v i − v i k ≤ min { √ 2 , 2˜ ǫ/λ i } , 43 then, the fol lowing holds     t X i =1 E i ( I , u, u )     2 2 ≤  4(5 + 11˜ ǫ /λ min ) 2 + 128 (1 + ˜ ǫ/λ min ) 2 (˜ ǫ /λ min ) 2  ˜ ǫ 2 t X i =1 θ 2 i + 64(1 + ˜ ǫ/λ min ) 2 ˜ ǫ 2 + 204 8(1 + ˜ ǫ/λ min ) 2 ˜ ǫ 2 . In p articular, for any ∆ ∈ (0 , 1) , ther e exists a c onstant ∆ ′ > 0 (dep ending only on ∆ ) such that ˜ ǫ ≤ ∆ ′ λ min implies     t X i =1 E i ( I , u, u )     2 2 ≤  ∆ + 100 t X i =1 θ 2 i  ˜ ǫ 2 . Pr o of: T h e p ro of is on lines of deflation analysis in [Anand kumar et al., 2012 b , Lemma B.5], but we impro ve the b ounds based on additional p rop erties of v ector u . F rom Anandku mar et al. [2012b], we hav e that f or all i ∈ [ t ], and an y unit ve ctor u ,     t X i =1 E i ( I , u, u )     2 2 ≤  4(5 + 11˜ ǫ/λ min ) 2 + 128 (1 + ˜ ǫ/λ min ) 2 (˜ ǫ/λ min ) 2  ˜ ǫ 2 t X i =1 θ 2 i + 64(1 + ˜ ǫ/λ min ) 2 ˜ ǫ 2 t X i =1 (˜ ǫ/λ i ) 2 + 204 8(1 + ˜ ǫ/λ min ) 2 ˜ ǫ 2  t X i =1 (˜ ǫ /λ i ) 3  2 . (57) Let ˆ λ i = λ i + δ i and ˆ θ i = θ i + β i . W e h a ve δ i ≤ ˜ ǫ and β i ≤ 2˜ ǫ/λ i , and that | ˆ λ i ˆ θ i | ≥ ξ . || ˆ λ i ˆ θ i | − | λ i θ i || ≤ | ˆ λ i ˆ θ i − λ i θ i | ≤ | ( λ i + δ i )( θ i + β i ) − λ i θ i | ≤ | δ i θ i + λ i β i + δ i β i | ≤ 4˜ ǫ. Th us, we ha v e that | λ i θ i | ≥ 5˜ ǫ − 4˜ ǫ = ˜ ǫ . Thus P t i =1 ˜ ǫ 2 /λ 2 i ≤ P i θ 2 i ≤ 1. Substituting in (57), w e ha v e the result.  B Pro of of T h eorem 4.3 W e no w prov e the main resu lts on error b oun d s claimed in Th eorem 4.3 for the estimated comm unit y v ectors ˆ Π and estimated b lo c k probabilit y matrix ˆ P in Algorithm 1. Belo w, we first sho w th at the tensor p erturbation b ounds claimed in Lemma 4.2 holds. Notation: Let k T k denote the sp ectral norm for a tensor T (o r in sp ecial cases a matrix or a v ector). Let k M k F denote the F rob en iu s norm. Let | M 1 | denote the op erator ℓ 1 norm, i.e., the maxim um ℓ 1 norm of its columns and k M k ∞ denote the maximum ℓ 1 norm of its r ows. Let κ ( M ) denote the condition num b er, i.e., k M k σ min ( M ) . 44 B.1 Pro of of Lemma 4.2 F rom Theorem A.1 in App endix A, we see t hat the tensor p o wer met ho d returns eigen v alue-v ector pair ( ˆ λ i , ˆ Φ i ) suc h that there exists a p erm utation θ w ith max i ∈ [ k ] k ˆ Φ i − Φ θ ( i ) k ≤ 8 b α 1 / 2 max ε T , (58) and max i | λ i − b α − 1 / 2 θ ( i ) | ≤ 5 ε T , (59) when the p erturbation of the tensor is small enough, according to ε T ≤ C 1 b α − 1 / 2 max r 2 0 , (60) for some constant C 1 , when in itialized w ith a ( γ , r 0 ) go od vecto r. With the ab o ve resu lt, t w o asp ects need to b e established: (1) the whitened tens or p ertur b a- tion ǫ T is as claimed, (2) the condition in (60) is satisfied and (3) there exist go o d in itializ ation v ectors wh en wh itened neighborh oo d vect ors are emplo y ed. Th e tensor p erturbation b ound ǫ T is established in Theorem C.1 in App endix C.1. Lemma C.9 establishes that wh en ζ = O ( √ nr 2 0 /ρ ), w e h av e goo d initializa tion v ectors with Recall r 2 0 = Ω (1 / b α max k ) when α 0 > 1 and r 2 0 = Ω (1) for α 0 ≤ 1, and γ = 1 / 100 with p robabilit y 1 − 9 δ u nder Diric hlet distribution, when n = ˜ Ω  α − 1 min k 0 . 43 log( k /δ )  , (61) whic h is satisfied since we assume b α − 2 min < n . W e n o w sh ow that the condition in (60) is satisfied under the assumptions B1-B4. S ince ǫ T is giv en b y ε T = ˜ O ρ √ n · ζ b α 1 / 2 max ! , the condition in (60 ) is equiv alen t to ζ = O ( √ nr 2 0 /ρ ). Therefore w hen ζ = O ( √ nr 2 0 /ρ ), the assumptions of T h eorem A.1 are satisfied. B.2 Reconstruction of Π after tensor p o w er metho d Let ( M ) i and ( M ) i denote t he i th ro w a nd i th column in matrix M resp ectiv ely . Let Z ⊆ A c denote an y sub s et of n o d es not in A , considered in P r o cedure LearnPartitio n Communit y . Define ˜ Π Z := Diag ( λ ) − 1 Φ ⊤ ˆ W ⊤ A G ⊤ Z,A . (62) Recall that the fin al estimate ˆ Π Z is obtained by thresh oldin g ˜ Π Z elemen t-wise with thr eshold τ in Pro cedure 1. W e fi r st analyze p erturbation of ˜ Π Z . Lemma B.1 (Reconstruction Guarantee s for ˜ Π Z ) . Assuming L emma 4.2 holds and the tensor p ower metho d r e c overs eigenve ctors and eigenvalues u p to the guar ante e d err ors, we h ave with 45 pr ob ability 1 − 122 δ , ε π := max i ∈ Z k ( ˜ Π Z ) i − (Π Z ) i k = O ε T b α 1 / 2 max  b α max b α min  1 / 2 k Π Z k ! , = O ρ · ζ · b α 1 / 2 max  b α max b α min  1 / 2 ! wher e ε T is give n by (70) . Pr o of: W e h a v e ( ˜ Π Z ) i = λ − 1 i ((Φ) i ) ⊤ ˆ W ⊤ A G ⊤ Z,A . W e w ill no w u se p erturbation b ounds for eac h of the terms to get the resu lt. The fi rst term is k Diag ( λ i ) − 1 − Diag( b α 1 / 2 i ) k · k Diag ( b α 1 / 2 ) ˜ F ⊤ A k · k ˜ F A k · k Π Z k ≤ 5 ε T b α max b α − 1 / 2 min (1 + ε 1 ) 2 k Π Z k from th e fact that k Diag ( b α 1 / 2 ) ˜ F ⊤ A k ≤ 1 + ε 1 , wh ere ε 1 is giv en by (85). The second term is k Diag ( b α 1 / 2 ) k · k (Φ) i − b α 1 / 2 i ( ˜ F A ) i k · k ˜ F A k · k Π Z k ≤ 8 b α max ε T b α − 1 / 2 min (1 + ε 1 ) k Π Z k The th ird term is k b α 1 / 2 i k · k ( ˆ W ⊤ A − W ⊤ A ) F A Π Z k ≤ b α 1 / 2 max b α − 1 / 2 min k Π Z k ǫ W (63) ≤ O  b α max b α min  1 / 2 ε T b α 1 / 2 min k Π Z k ! , (64) from L emma C.1 and finally , we hav e k b α 1 / 2 i k · k W A k · k G ⊤ Z,A − F A Π Z k ≤ O b α 1 / 2 max √ α 0 + 1 b α min σ min ( P ) r (max i ( P b α ) i )(1 + ε 2 + ε 3 ) log k δ ! (65) ≤ O  b α max b α min  1 / 2 ε T √ α 0 + 1(1 + ε 2 + ε 3 ) r log k δ ! (66) from L emma C.6 and Lemma C.7. The th ird term in (64) dominates the last term in (66) since ( α 0 + 1) log k /δ < n b α min (due to assumption B2 on scaling of n ).  W e no w sho w that if w e threshold the entries of ˜ Π Z , the the resulting matrix ˆ Π Z has ro ws close to those in Π Z in ℓ 1 norm. 46 Lemma B.2 (Guarante es after thresholding) . F or ˆ Π Z := Thres( ˜ Π Z , τ ) , wher e τ is the thr eshold, we have with pr ob ability 1 − 2 δ , that ε π ,ℓ 1 := max i ∈ [ k ] | ( ˆ Π Z ) i − (Π Z ) i | 1 = O √ nη ε π r log 1 2 τ 1 − s 2 log ( k /δ ) nη log(1 / 2 τ ) ! + nη τ + r ( nη + 4 τ 2 ) log k δ + ε 2 π τ ! , wher e η = b α max when α 0 < 1 and η = α max when α 0 ∈ [1 , k ) . Remark 1: The ab o v e guarantee on ˆ Π Z is stronger than for ˜ Π Z in Lemm a B.1 s in ce this is an ℓ 1 guaran tee on the r ows compared to ℓ 2 guaran tee on ro ws for ˜ Π Z . Remark 2: When τ is c hosen as τ = Θ( ε π √ nη ) = Θ ρ 1 / 2 · ζ · b α 1 / 2 max n 1 / 2 · b α min ! , w e ha ve that max i ∈ [ k ] | ( ˆ Π Z ) i − (Π Z ) i | 1 = ˜ O ( √ nη · ε π ) = ˜ O  n 1 / 2 · ρ 3 / 2 · ζ · b α max  Pr o of: Let S i := { j : ˆ Π Z ( i, j ) > 2 τ } . F or a v ector v , let v S denote the su b -v ector by considerin g en tries in set S . W e now hav e | ( ˆ Π Z ) i − (Π Z ) i | 1 ≤ | ( ˆ Π Z ) i S i − (Π Z ) i S i | 1 + | (Π Z ) i S c i | 1 + | ( ˆ Π Z ) i S c i | 1 Case α 0 < 1 : F r om Lemma C .10, we hav e P [Π( i, j ) ≥ 2 τ ] ≤ 8 b α i log(1 / 2 τ ). Since Π( i, j ) are indep end ent for j ∈ Z , we ha v e from multiplic ativ e Chern off b ound [Kearns and V azirani, 1994, Thm 9.2], that with pr obabilit y 1 − δ , max i ∈ [ k ] | S i | < 8 n b α max log  1 2 τ  1 − s 2 log ( k /δ ) n b α i log(1 / 2 τ ) ! . W e ha ve | ( ˜ Π Z ) i S i − (Π Z ) i S i | 1 ≤ ε π | S i | 1 / 2 , and the i th ro ws of ˜ Π Z and ˆ Π Z can differ on S i , we ha ve | ˜ Π Z ( i, j ) − ˆ Π Z ( i, j ) | ≤ τ , for j ∈ S i , and n umb er of suc h terms is at most ε 2 π /τ 2 . Thus, | ( ˜ Π Z ) i S i − ( ˆ Π Z ) i S i | 1 ≤ ε 2 π τ . F or the other term, from L emm a C.10, we ha ve E [Π Z ( i, j ) · δ (Π Z ( i, j ) ≤ 2 τ )] ≤ b α i (2 τ ) . 47 Applying Bernstein’s b oun d w e ha v e with probability 1 − δ max i ∈ [ k ] X j ∈ Z Π Z ( i, j ) · δ (Π Z ( i, j ) ≤ 2 τ ) ≤ n b α max (2 τ ) + r 2( n b α max + 4 τ 2 ) log k δ . F or ˆ Π i S c i , w e fur th er divide S c i in to T i and U i , where T i := { j : τ / 2 < Π Z ( i, j ) ≤ 2 τ } and U i := { j : Π Z ( i, j ) ≤ τ / 2 } . In the set T i , using similar argument we kn ow | (Π Z ) i T i − ( ˜ Π Z ) i T i | 1 ≤ O ( ε π p n b α max log 1 /τ ), therefore | ˆ Π i T i | 1 ≤ | ˜ Π i T i | 1 ≤ | Π i T i − ˜ Π i T i | 1 + | Π i S c i | 1 ≤ O ( ε π p n b α max log 1 /τ ) . Finally , for index j ∈ U i , in order f or ˆ Π Z ( i, j ) b e p ositiv e, it is r equired that ˜ Π Z ( i, j ) − Π Z ( i, j ) ≥ τ / 2. In this case, w e hav e | ( ˆ Π Z ) i U i | 1 ≤ 4 τ    ( ˜ Π Z ) i U i − Π i U i    2 ≤ 4 ε 2 π τ . Case α 0 ∈ [1 , k ) : F rom Lemma C.10, we see that the r esults hold wh en w e replace b α max with α max .  B.3 Reconstruction of P after tensor p ow er metho d Finally we w ould like to use the comm u nit y v ectors Π an d the adjacency matrix G to estimate the P matrix. Recall that in the generativ e mo del, w e ha v e E [ G ] = Π ⊤ P Π. Thus, a straigh tforwa rd estimate is to use ( ˆ Π † ) ⊤ G ˆ Π † . Ho we ve r, our guaran tees on ˆ Π are not strong enough to control the error on ˆ Π † (since we only h a v e ro w-wise ℓ 1 guaran tees). W e p rop ose an alternativ e estimator ˆ Q for ˆ Π † and use it to fi n d ˆ P in Algorithm 1. Recall th at the i -th row of ˆ Q is given by ˆ Q i := ( α 0 + 1) ˆ Π i | ˆ Π i | 1 − α 0 n ~ 1 ⊤ . Define Q u sing exact communities, i.e. Q i := ( α 0 + 1) Π i | Π i | 1 − α 0 n ~ 1 ⊤ . W e show b elo w th at ˆ Q is close to Π † , and therefore, ˆ P := ˆ Q ⊤ G ˆ Q is close to P w.h.p. Lemma B.3 (Reconstruction of P ) . With pr ob ability 1 − 5 δ , ε P := max i,j ∈ [ n ] | ˆ P i,j − P i,j | ≤ O ( α 0 + 1) 3 / 2 ε π ( P max − P min ) √ n b α − 1 min b α 1 / 2 max log nk δ ! 48 Remark: If we define a new matrix Q ′ as ( Q ′ ) i := α 0 +1 n b α i Π i − α 0 n ~ 1 ⊤ , then E Π [ Q ′ Π ⊤ ] = I . Belo w, w e sh o w that Q ′ is close to Q since E [ | Π i | 1 ] = n b α i and thus the ab o ve r esult holds. W e require Q to b e norm alized by | Π i | 1 in order to ensure that the first term of Q has equal column n orms, whic h will b e used in our p ro ofs su b sequen tly . Pr o of: The pr o of goes in thr ee steps: P ≈ Q Π ⊤ P Π Q ⊤ ≈ QGQ ⊤ ≈ ˆ QG ˆ Q ⊤ . Note th at E Π [Π Q ⊤ ] = I and by Berns tein’s b ound, we can claim that Π Q ⊤ is close to I and can sho w that the i -th r o w of Q Π ⊤ satisfies ∆ i := | ( Q Π ⊤ ) i − e ⊤ i | 1 = O k s log  nk δ  b α max b α min 1 √ n ! with p robabilit y 1 − δ . Moreo v er, | (Π ⊤ P Π Q ⊤ ) i,j − (Π ⊤ P ) i,j | ≤ | (Π ⊤ P ) i (( Q ) j − e j ) | = | (Π ⊤ P ) i ∆ j | ≤ O P max k · p b α max / b α min √ n r log nk δ ! . using the fact th at (Π ⊤ P ) i,j ≤ P max . No w w e claim that ˆ Q is close to Q and it can b e sho wn that | Q i − ˆ Q i | 1 ≤ O  ε P P max − P min  (67) Using (67), we ha ve | (Π ⊤ P Π Q ⊤ ) i,j − (Π ⊤ P Π ˆ Q ⊤ ) i,j | = | (Π ⊤ P Π) i ( Q ⊤ − ˆ Q ⊤ ) j | = ((Π ⊤ P Π) i − P min ~ 1 ⊤ ) | ( Q ⊤ − ˆ Q ⊤ ) j | 1 ≤ O (( P max − P min ) | ( Q ⊤ − ˆ Q ⊤ ) j | 1 ) = O ( ε P ) . using the fact th at ( Q j − ˆ Q j ) ~ 1 = 0, d ue to the normalization. Finally , | ( G ˆ Q ⊤ ) i,j (Π ⊤ P Π ˆ Q ⊤ ) i,j | are small by standard concen tration b oun ds (and the d iffer- ences are of lo wer ord er). Combining these | ˆ P i,j − P i,j | ≤ O ( ε P ).  B.4 Zero-error supp ort recov ery guaran tees Recall t hat w e prop osed Pro cedure 3 to pr o v id e imp ro v ed supp ort reco v ery estimates in the sp ecial case of homophilic mo dels (where there are more edges w ith in a communit y than to any comm un it y outside). W e limit our analysis to the sp ecia l case of u niform sized communities ( α i = 1 /k ) and matrix P such that P ( i, j ) = p I ( i = j ) + q I ( i 6 = j ) and p ≥ q . In p r inciple, the analysis can b e extended to homophilic mo dels with more general P matrix (with suitably c hosen thresh olds for supp ort reco very). 49 W e first consider analysis for the sto chastic bloc k mo del (i.e. α 0 → 0) and prov e the guaran tees claimed in Corollary 4.1 . Pr o of of Cor ol lary 4.1 : Recall the definition of ˜ Π in (6 2) and ˆ Π is obtained b y th r esholding ˜ Π with threshold τ . S in ce the threshold τ for sto c hastic blo ck mo dels is 0 . 5 (assumption B5), w e h av e | ( ˆ Π) i − (Π) i | 1 = O ( ε 2 π ) , (68) where ε π is the ro w-wise ℓ 2 error fo r ˜ Π in Lemma B.1. This is b ecause Π( i, j ) ∈ { 0 , 1 } , and in order for our metho d to make a mistak e, it tak es 1 / 4 in the ℓ 2 2 error. In Pro cedure 3 , for th e sto c hastic b lo ck mo del ( α 0 = 0), for a no de x ∈ [ n ], we ha ve ˆ F ( x, i ) = X y ∈ [ n ] G x,y ˆ Π( i, y ) | ˆ Π i | 1 ≈ X y ∈ [ n ] G x,y ˆ Π( i, y ) | Π i | 1 ≈ k n X y ∈ [ n ] G x,y ˆ Π( i, y ) , using (68) and the fact that th e size of eac h comm unit y on a v erage is n/k . In other words, for eac h v ertex x , w e compute the a ve rage n u m b er of ed ges f r om this vertex to all the estimated comm unities according to ˆ Π, and set it to belong to the one with largest a v erage degree. Note that the margin o f error on av erage for eac h node to b e assigned the correct comm unity according to the ab o v e pro cedur e is ( p − q ) n/k , since the size of eac h communit y is n/k and the av erage n umb er of in tra-comm unit y edges at a n o de is pn/k and edges to an y different comm u nit y at a no d e is q n /k . F rom (68), we h a v e that the av erage n umb er o f errors made is O (( p − q ) ε 2 π ). Note that the degrees concen trate around their exp ectations according to Bernstein’s b oun d and the fact that the edges used for a v eraging is indep enden t from the edges used for estimating ˆ Π. T h us, for our metho d to succeed in in ferring th e correct communit y at a no d e, we require, O (( p − q ) ε 2 π ) ≤ ( p − q ) n k , whic h implies p − q ≥ ˜ Ω  √ pk √ n  .  W e no w prov e the general result on sup p ort reco very . Pr o of of The or em 4.2: F rom Lemm a B.3, | ˆ P i,j − P i,j | ≤ O ( ε P ) whic h implies b oun ds for th e av erage of d iagonals H and a v erage of off-diagonals L : | H − p | = O ( ε P ) , | L − q | = O ( ε P ) . On similar lines as the pro of of Lemma B.3 and from ind ep endence of edges used to defin e ˆ F from the edges used to estimate ˆ Π, w e also ha v e | ˆ F ( j, i ) − F ( j, i ) | ≤ O ( ε P ) . Note that F j,i = q + Π i,j ( p − q ). T he threshold ξ satisfies ξ = Ω ( ε P ), therefore, all the entries in F that are larger than q + ( p − q ) ξ , the corresp on d ing en tries in S are declared to b e one, w hile none of the entries that are smaller than q + ( p − q ) ξ / 2 are set to one in S .  50 C Concen tration Bounds C.1 Main Result: T ensor P erturbation Bound W e no w provide the main r esult that the thir d -order whitened tensor compu ted from samples concen trates. Recall that T α 0 Y →{ A,B ,C } denotes the third order momen t computed u sing edges from partition Y to p artitions A, B , C in (15 ). ˆ W A , ˆ W B ˆ R AB , ˆ W C ˆ R AC are the whitening matrices defined in (24). The corresp onding whitening matrices W A , W B R AB , W C R AC for exact momen t third order tensor E [T α 0 Y →{ A,B ,C } | Π] will b e d efined later. Recall that ρ is defin ed in (37) as ρ := α 0 +1 b α min . Giv en δ ∈ (0 , 1), thr oughout assume that n = Ω  ρ 2 log 2 k δ  , (69) as in Ass u mption ( B 2). Theorem C.1 (P erturbation of whitened tensor) . W hen the p artitions A, B , C, X , Y satisfy (69) , we have with pr ob ability 1 − 100 δ , ε T :=    T α 0 Y →{ A,B , C } ( ˆ W A , ˆ W B ˆ R AB , ˆ W C ˆ R AC ) − E [T α 0 Y →{ A,B ,C } ( W A , ˜ W B , ˜ W C ) | Π A , Π B , Π C ]    = O ( α 0 + 1) p (max i ( P b α ) i ) n 1 / 2 b α 3 / 2 min σ min ( P ) · 1 +  ρ 2 n log 2 k δ  1 / 4 ! r log k δ ! = ˜ O ρ √ n · ζ b α 1 / 2 max ! . (70) Pro of Overview: The pr o of of the ab o ve result follo ws. It consists mainly of the follo wing steps: (1) Controlling the p erturbations of the wh itening matrices and (2) Establishing concen tration of the third momen t tens or (b efore whitening). Combining the t w o, w e can then obtain p erturb a- tion of the whitened tensor. P erturbations for the wh itenin g step is established in App endix C.2. Auxiliary concent ration b ounds requ ired for the whitening step, and for the claims b elo w are in App endix C.3 and C.4. Pr o of of The or em C.1: In tensor T α 0 in (15), the fi rst term is ( α 0 + 1)( α 0 + 2) X i ∈ Y  G ⊤ i,A ⊗ G ⊤ i,B ⊗ G ⊤ i,C  . W e claim that this term dominates in the p erturbation analysis since the mean v ector p ertur bation is of low er order. W e n o w consider p erturb ation of the whitened tensor Λ 0 = 1 | Y | X i ∈ Y  ( ˆ W ⊤ A G ⊤ i,A ) ⊗ ( ˆ R ⊤ AB ˆ W ⊤ B G ⊤ i,B ) ⊗ ( ˆ R ⊤ AC ˆ W ⊤ C G ⊤ i,C )  . W e show that th is tensor is close to the corresp ondin g term in th e exp ecta tion in thr ee steps. First w e show it is close to Λ 1 = 1 | Y | X i ∈ Y  ( ˆ W ⊤ A F A π i ) ⊗ ( ˆ R ⊤ AB ˆ W ⊤ B F B π i ) ⊗ ( ˆ R ⊤ AC ˆ W ⊤ C F C π i )  . 51 Then this v ector is close to the exp ectation ov er Π Y . Λ 2 = E π ∼ Dir( α )  ( ˆ W ⊤ A F A π ) ⊗ ( ˆ R ⊤ AB ˆ W ⊤ B F B π ) ⊗ ( ˆ R ⊤ AC ˆ W ⊤ C F C π )  . Finally w e replace the estimated wh itening matrix ˆ W A with W A , defined in ( 71), and note that W A whitens the exact moments. Λ 3 = E π ∼ Dir( α )  ( W ⊤ A F A π ) ⊗ ( ˜ W ⊤ B F B π ) ⊗ ( ˜ W ⊤ C F C π )  . F or Λ 0 − Λ 1 , the d omin an t term in the p erturbation b ound (assuming partitions A, B , C , X, Y are of size n ) is (since for an y rank 1 tensor, k u ⊗ v ⊗ w k = k u k · k v k · k w k ), O 1 | Y | k ˜ W ⊤ B F B k 2      X i ∈ Y  ˆ W ⊤ A G ⊤ i,A − ˆ W ⊤ A F A π i       ! O  1 | Y | b α − 1 min · ( α 0 + 1)(max i ( P b α ) i ) b α min σ min ( P ) · (1 + ε 1 + ε 2 + ε 3 ) r log n δ  , with probability 1 − 13 δ (Lemma C.2). Since there are 7 terms in the third order tensor T α 0 , w e ha v e the b oun d with probability 1 − 91 δ . F or Λ 1 − Λ 2 , s in ce ˆ W A F A Diag( b α ) 1 / 2 has sp ectral norm almost 1, b y Lemma C.4 the sp ectral norm of the p erturbation is at most    ˆ W A F A Diag( b α ) 1 / 2    3      1 | Y | X i ∈ Y (Diag( b α ) − 1 / 2 π i ) ⊗ 3 − E π ∼ Dir( α ) (Diag( b α ) − 1 / 2 π i ) ⊗ 3      ≤ O  1 b α min √ n · r log n δ  . F or the final term Λ 2 − Λ 3 , the dominating term is ( ˆ W A − W A ) F A Diag( b α ) 1 / 2 k Λ 3 k ≤ ε W A k Λ 3 k ≤ O ( α 0 + 1) p max i ( P b α ) i n 1 / 2 b α 3 / 2 min σ min ( P ) (1 + ε 1 + ε 2 + ε 3 ) r log n δ ! Putting all these toget her, th e thir d term k Λ 2 − Λ 3 k dominate s. W e know with probabilit y at l east 1 − 100 δ , the p erturbation in the tensor is at most O ( α 0 + 1) p max i ( P b α ) i n 1 / 2 b α 3 / 2 min σ min ( P ) (1 + ε 1 + ε 2 + ε 3 ) r log n δ ! .  C.2 Whitening Matrix Perturbations Consider rank - k SVD of | X | − 1 / 2 ( G α 0 X,A ) ⊤ k − svd = ˆ U A ˆ D A ˆ V ⊤ A , and the w hitening matrix is giv en by ˆ W A := ˆ U A ˆ D − 1 A and th us | X | − 1 ˆ W ⊤ A ( G α 0 X,A ) ⊤ k − svd ( G α 0 X,A ) k − svd ˆ W A = I . Now consider the singular v alue decomp osition of | X | − 1 ˆ W ⊤ A E [( G α 0 X,A ) ⊤ | Π] · E [( G α 0 X,A ) | Π] ˆ W A = Φ ˜ D Φ ⊤ . 52 ˆ W A do es n ot wh iten the exact momen ts in general. On the other hand, consider W A := ˆ W A Φ A ˜ D − 1 / 2 A Φ ⊤ A . ( 71) Observe that W A whitens | X | − 1 / 2 E [( G α 0 X,A ) | Π] | X | − 1 W ⊤ A E [( G α 0 X,A ) ⊤ | Π] E [( G α 0 X,A ) | Π] W A = (Φ A ˜ D − 1 / 2 A Φ ⊤ A ) ⊤ Φ A ˜ D A Φ ⊤ A Φ A ˜ D − 1 / 2 A Φ ⊤ A = I No w the ranges of W A and ˆ W A ma y differ and w e con trol the p erturbations b elo w. Also n ote that ˆ R A,B , ˆ R A,C are giv en b y ˆ R AB := | X | − 1 ˆ W ⊤ B ( G α 0 X,B ) ⊤ k − svd ( G α 0 X,A ) k − svd ˆ W A . (72) R AB := | X | − 1 W ⊤ B E [( G α 0 X,B ) ⊤ | Π] · E [ G α 0 X,A | Π] · W A . (73) Recall ǫ G is giv en by (78), and σ min  E [ G α 0 X,A | Π]  is giv en in (C.7) and | A | = | B | = | X | = n . Lemma C.1 (Whitening m atrix p ertur bations) . W ith pr ob ability 1 − δ , ǫ W A := k Diag ( b α ) 1 / 2 F ⊤ A ( ˆ W A − W A ) k = O   (1 − ε 1 ) − 1 / 2 ǫ G σ min  E [ G α 0 X,A | Π]    (74) ǫ ˜ W B := k Diag ( b α ) 1 / 2 F ⊤ B ( ˆ W B ˆ R AB − W B R AB ) k = O   (1 − ε 1 ) − 1 / 2 ǫ G σ min  E [ G α 0 X,B | Π]    (75) Thus, with pr ob ability 1 − 6 δ , ǫ W A = ǫ ˜ W B = O ( α 0 + 1) p max i ( P b α ) i n 1 / 2 b α min σ min ( P ) · (1 + ε 1 + ε 2 + ε 3 ) ! , (76) wher e ε 1 , ε 2 and ε 3 ar e give n by (84) and (85) . Remark: Note that when partitions X , A satisfy (69), ε 1 , ε 2 , ε 3 are small. When P is well conditioned and b α min = b α max = 1 /k , w e hav e ǫ W A , ǫ ˜ W B = O ( k / √ n ). Pr o of: Using the fact that W A = ˆ W A Φ A ˜ D − 1 / 2 A Φ ⊤ A or ˆ W A = W A Φ A ˜ D 1 / 2 A Φ ⊤ A w e ha ve that k Diag ( b α ) 1 / 2 F ⊤ A ( ˆ W A − W A ) k ≤ k Diag ( b α ) 1 / 2 F ⊤ A W A ( I − Φ A ˜ D 1 / 2 A Φ ⊤ A ) k = k Diag ( b α ) 1 / 2 F ⊤ A W A ( I − ˜ D 1 / 2 A ) k ≤ k Diag ( b α ) 1 / 2 F ⊤ A W A ( I − ˜ D 1 / 2 A )( I + ˜ D 1 / 2 A ) k ≤ k Diag ( b α ) 1 / 2 F ⊤ A W A k · k I − ˜ D A k using the fact th at ˜ D A is a diagonal matrix. 53 No w note that W A whitens | X | − 1 / 2 E [ G α 0 X,A | Π] = | X | − 1 / 2 F A Diag( α 1 / 2 )Ψ X , wh ere Ψ X is d efi ned in (83) . F urth er it is sh o w n in Lemma C.7 that Ψ X satisfies with probabilit y 1 − δ that ε 1 := k I − | X | − 1 Ψ X Ψ ⊤ X k ≤ O s ( α 0 + 1) b α min | X | · log k δ ! Since ε 1 ≪ 1 when X, A satisfy (69). W e ha v e that | X | − 1 / 2 Ψ X has singular v alues around 1. S ince W A whitens | X | − 1 / 2 E [ G α 0 X,A | Π], we hav e | X | − 1 W ⊤ A F A Diag( α 1 / 2 )Ψ X Ψ ⊤ X Diag( α 1 / 2 ) F ⊤ A W A = I . Th us, with probab ility 1 − δ , k Diag ( b α ) 1 / 2 F ⊤ A W A k = O ((1 − ε 1 ) − 1 / 2 ) . Let E [( G α 0 X,A ) | Π] = ( G α 0 X,A ) k − svd + ∆. W e h a ve k I − ˜ D A k = k I − Φ A ˜ D A Φ ⊤ A k = k I − | X | − 1 ˆ W ⊤ A E [( G α 0 X,A ) ⊤ | Π] · E [( G α 0 X,A ) | Π] ˆ W A k = O  | X | − 1 k ˆ W ⊤ A  ∆ ⊤ ( G α 0 X,A ) k − svd + ∆( G α 0 X,A ) ⊤ k − svd  ˆ W A k  = O  | X | − 1 / 2 k ˆ W ⊤ A ∆ ⊤ ˆ V A + ˆ V ⊤ A ∆ ˆ W A k  , = O  | X | − 1 / 2 k ˆ W A kk ∆ k  = O  | X | − 1 / 2 k W A k ǫ G  , since k ∆ k ≤ ǫ G + σ k +1 ( G α 0 X,A ) ≤ 2 ǫ G , using W eyl’s th eorem f or singular v alue p erturb ation and the fact that ǫ G · k W A k ≪ 1 and k W A k = | X | 1 / 2 /σ min  E [ G α 0 X,A | Π]  . W e no w consider p ertur bation of W B R AB . By d efi nition, we ha v e that E [ G α 0 X,B | Π] · W B R AB = E [ G α 0 X,A | Π] · W A . and k W B R AB k = | X | 1 / 2 σ min ( E [ G α 0 X,B | Π]) − 1 . Along the lin es of pr evious deriv ation for ǫ W A , let | X | − 1 ( ˆ W B ˆ R AB ) ⊤ · E [( G α 0 X,B ) ⊤ | Π] · E [ G α 0 X,B | Π] ˆ W B ˆ R AB = Φ B ˜ D B Φ ⊤ B . Again u sing the fact that | X | − 1 Ψ X Ψ ⊤ X ≈ I , we ha ve k Diag ( b α ) 1 / 2 F ⊤ B W B R AB k ≈ k Diag ( b α ) 1 / 2 F ⊤ A W A k , and the rest of the pr o of follo ws.  54 C.3 Auxiliary Concen t ration B ounds Lemma C .2 (Concen tration of sum of whitened vect ors) . Assuming al l the p artitions satisfy (69) , with pr ob ability 1 − 7 δ ,      X i ∈ Y  ˆ W ⊤ A G ⊤ i,A − ˆ W ⊤ A F A π i       = O ( p | Y | b α max ǫ W A ) = O p ( α 0 + 1)(max i ( P b α ) i ) b α min σ min ( P ) · (1 + ε 2 + ε 3 ) p log n /δ ! ,      X i ∈ Y  ( ˆ W B ˆ R AB ) ⊤ ( G ⊤ i,B − F B π i )       = O p ( α 0 + 1)(max i ( P b α ) i ) b α min σ min ( P ) · (1 + ε 1 + ε 2 + ε 3 ) p log n /δ ! . Remark: Note that when P is well conditioned and b α min = b α max = 1 /k , w e h av e the ab o ve b ounds as O ( k ). Thus, when it is n orm alized with 1 / | Y | = 1 /n , we hav e the b oun d as O ( k /n ). Pr o of: Note that ˆ W A is computed usin g p artition X and G i,A is obtained from i ∈ Y . W e ha v e indep end en ce for edges across differen t partitions X and Y . Let Ξ i := ˆ W ⊤ A ( G ⊤ i,A − F A π i ).Applying matrix Bernstein’s in equalit y to eac h of the v ariables, we hav e k Ξ i k ≤ k ˆ W A k · k G ⊤ i,A − F A π i k ≤ k ˆ W A k p k F A k 1 , from L emma C.6. The v ariances are giv en by k X i ∈ Y E [Ξ i Ξ ⊤ i | Π] k ≤ X i ∈ Y ˆ W ⊤ A Diag( F A π i ) ˆ W A , ≤ k ˆ W A k 2 k F Y k 1 = O  | Y | | A | · ( α 0 + 1)(max i ( P b α ) i ) b α 2 min σ 2 min ( P ) · (1 + ε 2 + ε 3 )  , with probabilit y 1 − 2 δ f rom (81) and (82), and ε 2 , ε 3 are giv en b y (85). Similarly , k P i ∈ Y E [Ξ ⊤ i Ξ i | Π] k ≤ k ˆ W A k 2 k F Y k 1 . Thus, fr om matrix Bernstein’s inequalit y , we ha v e with probability 1 − 3 δ k X i ∈ Y Ξ i k = O ( k ˆ W A k p max( k F A k 1 , k F X k 1 )) . = O p ( α 0 + 1)(max i ( P b α ) i ) b α min σ min ( P ) · (1 + ε 2 + ε 3 ) p log n/δ ! On similar lines, w e hav e the result for B and C , and also use the indep end ence assumption on edges in v arious p artitions.  W e no w sho w that not only the sum of whitened v ectors concent rates, but that eac h ind ividual whitened v ector ˆ W ⊤ A G ⊤ i,A concen trates, wh en A is large enough. 55 Lemma C .3 (Concentratio n of a random whitened vecto r) . Conditione d on π i , with pr ob ability at le ast 1 / 4 ,    ˆ W ⊤ A G ⊤ i,A − W ⊤ A F A π i    ≤ O ( ε W A b α − 1 / 2 min ) = ˜ O p ( α 0 + 1)(max i ( P b α ) i ) n 1 / 2 b α 3 / 2 min σ min ( P ) ! . Remark: The abov e result is not a high pr obabilit y ev en t since w e employ Ch eb yshev’s inequalit y to establish it. How eve r, this is not an issu e for u s , sin ce we will employ it to sho w that out of Θ( n ) whitened vecto rs, there exists at least one goo d initializat ion v ector corresp on d ing to eac h eigen-direction, as r equired in T heorem A.1 in App end ix A. See Lemma C .9 for details. Pr o of. W e ha v e    ˆ W ⊤ A G ⊤ i,A − W ⊤ A F A π i    ≤    ( ˆ W A − W A ) ⊤ F A π i    +    ˆ W ⊤ A ( G ⊤ i,A − F A π i )    . The fi rst term is satisfies satisfies with probability 1 − 3 δ k ( ˆ W ⊤ A − W ⊤ A ) F A π i k ≤ ǫ W A b α − 1 / 2 min = O ( α 0 + 1) b α 1 / 2 max p (max i ( P b α ) i ) n 1 / 2 b α 3 / 2 min σ min ( P ) · (1 + ε 1 + ε 2 + ε 3 ) ! No w w e b oun d the second term . Note that G ⊤ i,A is indep endent of ˆ W ⊤ A , s in ce they are related to disjoin t sub set of ed ges. Th e whitened neighborh oo d ve ctor can b e view ed as a sum of v ectors: ˆ W ⊤ A G ⊤ i,A = X j ∈ A G i,j ( ˆ W ⊤ A ) j = X j ∈ A G i,j ( ˆ D A ˆ U ⊤ A ) j = ˆ D A X j ∈ A G i,j ( ˆ U ⊤ A ) j . Conditioned on π i and F A , G i,j are Bernoulli v ariables with probabilit y ( F A π i ) j . T he goal is to compute the v ariance of th e su m, and then use C heb yshev’s inequalit y n oted in Prop osition C.5. Note that the v ariance is given by k E [( G ⊤ i,A − F A π i ) ⊤ ˆ W A ˆ W ⊤ A ( G ⊤ i,A − F A π i )] k ≤ k ˆ W A k 2 X j ∈ A ( F A π i ) j    ( ˆ U ⊤ A ) j    2 . W e no w b ound the v ariance. By W edin’s theorem, we kno w the sp an of columns of ˆ U A is O ( ǫ G /σ min ( G α 0 X , A )) = O ( ǫ W A ) close to the sp an of columns of F A . The span of columns of F A is the same as the sp an of ro ws in Π A . In p articular, let P r oj Π b e th e p ro j ectio n m atrix of the span of ro ws in Π A , w e hav e    ˆ U A ˆ U ⊤ A − P r oj Π    ≤ O ( ǫ W A ) . Using the sp ectral norm b ound , w e ha ve the F rob enius norm    ˆ U A ˆ U ⊤ A − P r oj Π    F ≤ O ( ǫ W A √ k ) since they are r ank k matrices. This implies that X j ∈ A     ( ˆ U ⊤ A ) j    −    P r oj j Π     2 = O ( ǫ 2 W A k ) . 56 No w k P r oj j Π k ≤ k π j k σ min (Π A ) = O   s ( α 0 + 1) n b α min   , from L emma C.7 No w we can b ound th e v ariance of th e vect ors P j ∈ A G i,j ( ˆ U ⊤ A ) j , since the v ariance of G i,j is b ounded by ( F A π i ) j (its probability), and the v ariance of th e vect ors is at most X j ∈ A ( F A π i ) j    ( ˆ U ⊤ A ) j    2 ≤ 2 X j ∈ A ( F A π i ) j    P r oj j Π    2 + 2 X j ∈ A ( F A π i ) j     ( ˆ U ⊤ A ) j    −    P r oj j Π     2 ≤ 2 X j ∈ A ( F A π i ) j max j ∈ A     P r oj j Π    2  + max i,j P i,j X j ∈ A     ( ˆ U ⊤ A ) j    −    P r oj j Π     2 ≤ O  | F A | 1 ( α 0 + 1) n b α min  No w Ch eb yshev’s inequalit y imp lies that w ith pr obabilit y at least 1 / 4 (or any other constant) ,       X j ∈ A ( G i,j − F A π i )( ˆ U ⊤ A ) j       2 ≤ O  | F A | 1 ( α 0 + 1) n b α min  . And thus, we hav e ˆ W ⊤ A ( G i,A − F A π i ) ≤ s | F A | 1 ( α 0 + 1) n b α min ·    ˆ W ⊤ A    ≤ O  ǫ W A b α − 1 / 2 min  . Com bining the t wo terms, we ha v e the resu lt. Finally , we establish the follo wing p erturbation b ou n d b et ween empir ical and exp ected tensor under the Diric hlet distrib u tion, which is u sed in the pro of of T heorem C.1. Lemma C .4 (Concent ration of th ir d moment tensor u nder Diric hlet distribu tion) . With pr ob ability 1 − δ , for π i iid ∼ Dir( α ) ,      1 | Y | X i ∈ Y (Diag( b α ) − 1 / 2 π i ) ⊗ 3 − E π ∼ Dir( α ) (Diag( b α ) − 1 / 2 π ) ⊗ 3      ≤ O  · 1 b α min √ n r log n δ  = ˜ O  1 b α min √ n  Pr o of. The sp ectral n orm of this tensor ca nn ot b e larger than th e sp ectral n orm of a k × k 2 matrix that w e obtain b e “collapsing” the last t wo dimensions (by defin itions of norms). Let φ i := Diag ( ˆ α ) − 1 / 2 π i and the “collapsed” tensor is the m atrix φ i ( φ i ⊗ φ i ) ⊤ (here w e view φ i ⊗ φ i as a vecto r in R k 2 ). W e apply Matrix Bernstein on the matrices Z i = φ i ( φ i ⊗ φ i ) ⊤ . No w      X i ∈ Y E [ Z i Z ⊤ i ]      ≤ | Y | max k φ k 4    E [ φφ ⊤ ]    ≤ | Y | b α − 2 min 57 since   E [ φφ ⊤ ]   ≤ 2. F or the other v ariance term   P i ∈ Y E [ Z ⊤ i Z i ]   , w e hav e      X i ∈ Y E [ Z ⊤ i Z i ]      ≤ | Y | b α min    E [( φ ⊗ φ )( φ ⊗ φ ) ⊤ ]    . It remains to b ound the norm of E [( φ ⊗ φ )( φ ⊗ φ ) ⊤ ]. W e hav e k E [( φ ⊗ φ )( φ ⊗ φ ) ⊤ ] k = su p   k E [ M 2 ] k , s.t. M = X i,j N i,j φ i φ ⊤ j , k N k F = 1   . b y definition. W e no w group the terms of E [ M 2 ] and b ound them separately . M 2 = X i N 2 i,i φ i φ ⊤ i k φ i k 2 + X i 6 = j N 2 i,j φ i φ ⊤ j h φ i , φ j i + X i 6 = j 6 = a N i,i N j,a φ i φ ⊤ a h φ i , φ j i + X i 6 = j 6 = a 6 = b N i,j N a,b φ i φ ⊤ b h φ j , φ a i (77) W e b oun d the terms ind ividually now. k φ ( i ) k 4 terms: By prop erties of Diric hlet distrib u tion we kno w E [ k φ ( i ) k 4 ] = Θ( b α − 1 i ) ≤ O ( b α − 1 min ) . Th us, for the first term in (77 ), we ha v e sup N : k N k F =1 k X i E [ N 2 i,i φ i φ ⊤ i k φ i ] k 2 k = O ( b α − 1 min ) . k φ ( i ) k 3 · k φ ( j ) k terms: W e ha ve k E [ X i,j N i,i N i,j φ ( i ) 3 φ ( j )] k ≤ E [ k φ i k 2 · k φ j k ] ≤ O ( s X i,j ( N 2 i,i ˆ α ( j )) X i,j N 2 i,j ˆ α ( i ) − 1 ) ≤ O ( b α − 1 / 2 min ) . k φ ( i ) k 2 · k φ ( j ) k 2 terms: the total num b er of such terms is O ( k 2 ) and we ha ve E [ k φ ( i ) k 2 · k φ ( j ) k 2 ] = Θ(1) , and th us the F rob enius norm of these set of terms is smaller than O ( k ) k φ ( i ) k 2 · k φ ( j ) k · k φ ( a ) k terms: there are O ( k 3 ) suc h terms, and w e ha ve k E [ φ ( i ) k 2 · k φ ( j ) k · k φ ( a )] k = Θ( ˆ α ( i 2 ) 1 / 2 ˆ α ( i 3 ) 1 / 2 ) . The F r ob enius norm of this part of matrix is b oun ded b y O   s X i,j,a ∈ [ k ] ˆ α ( j ) ˆ α ( a )   ≤ O ( √ k ) s X j X a b α j b α a ≤ O ( √ k ) . 58 the rest: the sum is E [ X i 6 = j 6 = a 6 = b N i,j N a,b ˆ α ( i ) 1 / 2 ˆ α ( j ) 1 / 2 ˆ α ( a ) 1 / 2 ˆ α ( b ) 1 / 2 ] . It is easy to br eak the b oun ds in to the p ro duct of t w o sums ( P i,j and P a,b ) and then b ound eac h one b y Cauc hy-Sc hw artz, the result is 1. Hence the v ariance term in Matrix Bernstein’s inequalit y can b e b oun ded by σ 2 ≤ O ( n b α − 2 min ), eac h term has norm at most b α − 3 / 2 min . When b α − 2 min < n we kn ow the v ariance term dominates and the sp ectral norm of the difference is at most O ( b α − 1 min n − 1 / 2 p log n /δ ) w ith pr obabilit y 1 − δ . C.4 Basic Results on Sp ectral Concen tration of A djacency Matrix Let n := max( | A | , | X | ). Lemma C.5 (Concent ration of G α 0 X,A ) . When π i ∼ Dir( α ) , for i ∈ V , with pr ob ability 1 − 4 δ , ǫ G := k G α 0 X,A − E [( G α 0 X,A ) ⊤ | Π] k = O  r ( α 0 + 1) n · (max i ( P b α ) i )(1 + ε 2 ) log n δ  (78) Pr o of: F r om definition of G α 0 X,A , we hav e ǫ G ≤ √ α 0 + 1 k G X,A − E [ G X,A | Π] k + ( √ α 0 + 1 − 1) p | X |k µ X,A − E [ µ X,A | Π] k . W e ha ve concen tration f or µ X,A and adjacency s u bmatrix G X,A from L emma C.6.  W e now pro vide concen tration b ound s for adjacency sub-matrix G X,A from partition X to A and the corresp onding mean vec tor. Recall that E [ µ X → A | F A , π X ] = F A π X and E [ µ X → A | F A ] = F A b α . Lemma C.6 (Concentrat ion of adjacency submatrices) . When π i iid ∼ Dir( α ) for i ∈ V , with pr ob a- bility 1 − 2 δ , k G X,A − E [ G X,A | Π] k = O  r n · (max(max i ( P b α ) i , m ax i ( P ⊤ b α ) i ))(1 + ε 2 ) log n δ  . (79) k µ A − E [ µ A | Π] k = O  1 | X | r n · (max(max i ( P b α ) i , m ax i ( P ⊤ b α ) i ))(1 + ε 2 ) log n δ  , (80 ) wher e ε 2 is given by (85) . Pr o of: Recall E [ G X,A | Π] = F A Π X and G A,X = Ber( F A Π X ) w here Ber( · ) denotes the Bernoulli random matrix with ind ep endent en tries. Let Z i := ( G ⊤ i,A − F A π i ) e ⊤ i . W e ha ve G ⊤ X,A − F A Π X = P i ∈ X Z i . W e apply m atrix Berns tein’s inequalit y . W e compute the v ariances P i E [ Z i Z ⊤ i | Π] and P i E [ Z ⊤ i Z i | Π]. W e ha ve that P i E [ Z i Z ⊤ i | Π] only the diagonal terms are non-zero du e to indep en dence of Bernoulli v ariables, and E [ Z i Z ⊤ i | Π] ≤ Diag ( F A π i ) (81) 59 en try-wise. Th us, k X i ∈ X E [ Z i Z ⊤ i | Π] k ≤ max a ∈ A X i ∈ X,b ∈ [ k ] F A ( a, b ) π i ( b ) = max a ∈ A X i ∈ X,b ∈ [ k ] F A ( a, b )Π X ( b, i ) ≤ max c ∈ [ k ] X i ∈ X,b ∈ [ k ] P ( b, c )Π X ( b, i ) = k P ⊤ Π X k ∞ . (82) Similarly P i ∈ X E [ Z ⊤ i Z i ] = P i ∈ X Diag( E [ k G ⊤ i,A − F A π i k 2 ]) ≤ k P ⊤ Π X k ∞ . On lines of Lemma C.1 1, w e ha ve k P ⊤ Π X k ∞ = O ( | X | · (max i ( P ⊤ b α ) i )) w hen | X | satisfies (69 ). W e no w b ound k Z i k . First note th at the en tries in G i,A are in d ep endent and w e can use the v ector Bernstein’s inequalit y to b ound k G i,A − F A π i k . W e hav e max j ∈ A | G i,j − ( F A π i ) j | ≤ 2 and P j E [ G i,j − ( F A π i ) j ] 2 ≤ P j ( F A π i ) j ≤ k F A k 1 . Th us w ith pr obabilit y 1 − δ , we hav e k G i,A − F A π i k ≤ (1 + p 8 log (1 /δ )) p k F A k 1 + 8 / 3 log (1 /δ ) . Th us, we hav e th e b ound th at k P i Z i k = O (max( p k F A k 1 , p k P ⊤ Π X k ∞ )). The concen tration of the mean term follo ws from this r esult.  W e no w provide sp ectral b ound s on E [( G α 0 X,A ) ⊤ | Π]. Defin e ψ i := Diag ( ˆ α ) − 1 / 2 ( √ α 0 + 1 π i − ( √ α 0 + 1 − 1) µ ) . (83) Let Ψ X b e th e matrix with columns ψ i , for i ∈ X . W e ha ve E [( G α 0 X,A ) ⊤ | Π] = F A Diag( ˆ α ) 1 / 2 Ψ X , from d efinition of E [( G α 0 X,A ) ⊤ | Π]. Lemma C.7 (Sp ectral b ound s) . With pr ob ability 1 − δ , ε 1 := k I − | X | − 1 Ψ X Ψ ⊤ X k ≤ O s ( α 0 + 1) b α min | X | · log k δ ! (84) With pr ob ability 1 − 2 δ , k E [( G α 0 X,A ) ⊤ | Π] k = O  k P k b α max p | X || A | (1 + ε 1 + ε 2 )  σ min  E [( G α 0 X,A ) ⊤ | Π]  = Ω   b α min s | A || X | α 0 + 1 (1 − ε 1 − ε 3 ) · σ min ( P ) ·   , wher e ε 2 := O  1 | A | b α 2 max log k δ  1 / 4 ! , ε 3 := O  ( α 0 + 1) 2 | A | b α 2 min log k δ  1 / 4 ! . (85) 60 Remark: When p artitions X , A satisfy (69 ), ε 1 , ε 2 , ε 3 are small. Pr o of: Note that ψ i is a rand om v ector with norm b ounded by O ( p ( α 0 + 1) / b α min ) fr om Lemma C.11 and E [ ψ i ψ ⊤ i ] = I . W e no w prov e (84). using Matrix Bernstein In equalit y . Eac h matrix ψ i ψ ⊤ i / | X | has sp ectral norm at most O (( α 0 + 1) / b α min | X | ). The v ariance σ 2 is b ound ed b y      1 | X | 2 E [ X i ∈ X k ψ i k 2 ψ i ψ ⊤ i ]      ≤      1 | X | 2 max k ψ i k 2 E [ X i ∈ X ψ i ψ ⊤ i ]      ≤ O (( α 0 + 1) / b α min | X | ) . Since O (( α 0 + 1) /α min | X | ) < 1, the v ariance dominates in Matrix Bernstein’s inequ alit y . Let B := | X | − 1 Ψ X Ψ ⊤ X . W e ha v e with prob ab ility 1 − δ , σ min ( E [( G α 0 X,A ) ⊤ | Π]) = q | X | σ min ( F A Diag( ˆ α ) 1 / 2 B Diag ( ˆ α ) 1 / 2 F ⊤ A ) , = Ω( p b α min | X | (1 − ǫ 1 ) · σ min ( F A )) . F rom Lemma C .11, w ith pr obabilit y 1 − δ , σ min ( F A ) ≥   s | A | b α min α 0 + 1 − O (( | A | log k /δ ) 1 / 4 )   · σ min ( P ) . Similarly other r esults follo w.  C.5 Prop erties of Diric hlet Distr ibution In this section, w e list v arious prop erties of Diric hlet distribution. C.5.1 Sparsity Inducing Prop erty W e first n ote that the Diric h let distribution Dir( α ) is sparse d ep ending on v alues of α i , whic h is sho wn in T elgarsky [2012]. Lemma C .8. L et r e als τ ∈ (0 , 1] , α i > 0 , α 0 := P i α i and inte gers 1 ≤ s ≤ k b e given. L et ( X i , . . . , X k ) ∼ Dir ( α ) . Then Pr  |{ i : X i ≥ τ }| ≤ s  ≥ 1 − τ − α 0 e − ( s +1) / 3 − e − 4( s +1) / 9 , when s + 1 < 3 k . W e no w show that we obtain go o d initialization vecto rs u nder Diric hlet distribution. Arrange the b α j ’s in ascending ord er, i.e. b α 1 = b α min ≤ b α 2 . . . ≤ b α k = b α max . Recall that columns v ectors ˆ W ⊤ A G ⊤ i,A , for i / ∈ A , are u sed as initialization vec tors to the tensor p o wer metho d. W e sa y that u i := ˆ W ⊤ A G ⊤ i,A k ˆ W ⊤ A G ⊤ i,A k is a ( γ , R 0 )-goo d initializat ion vec tor corresp onding to j ∈ [ k ] if |h u i , Φ j i| ≥ R 0 , |h u i , Φ j i| − max m 1 . (89) When α 0 < 1 , the b ound c an b e impr ove d for r 0 ∈ (0 . 5 , ( α 0 + 1) − 1 ) and 1 − γ ≥ 1 − r 0 r 0 as n > (1 + α 0 )(1 − r 0 b α min ) b α min ( α min + 1 − r 0 ( α 0 + 1)) log( k /δ ) . (90) Remark when α 0 ≥ 1 , α 0 = Θ(1) : When r 0 is c hosen as r 0 = α − 1 / 2 max ( √ α 0 + c 1 √ k ) − 1 , the term e r 0 b α 1 / 2 max ( α 0 + c 1 √ k α 0 ) = e , and we require n = ˜ Ω  α − 1 min k 0 . 43 log( k /δ )  , r 0 = α − 1 / 2 max ( √ α 0 + c 1 √ k ) − 1 , (91) b y substituting c 2 /c 1 = 0 . 43. Moreo ver, (89) is satisfied for the ab o ve c hoice of r 0 when γ = Θ(1). In this case we also need ∆ < r 0 / 2, whic h implies ζ = O  √ n ρk b α max  (92) Remark when α 0 < 1 : In this regime, (90 ) im p lies th at w e requir e n = Ω( b α − 1 min ). Also, r 0 is a constan t, we just need ζ = O ( √ n/ρ ). Pr o of: Define ˜ u i := W ⊤ A F A π i / k W ⊤ A F A π i k , when wh itenin g m atrix W A and F A corresp onding to exact statistics are input. W e first observ e that if ˜ u i is ( γ , r 0 ) go od , then u i is ( γ − 2∆ r 0 − ∆ , r 0 − ∆) go o d. When ˜ u i is ( γ , r 0 ) go od , note that W ⊤ A F A π i ≥ b α − 1 / 2 max r 0 since σ min ( W ⊤ A F A ) = b α − 1 / 2 max and k π i k ≥ r 0 . No w with p robabilit y 1 / 4, conditioned on π i , w e hav e the ev ent B ( i ), B ( i ) := { k u i − ˜ u i k ≤ ∆ } , where ∆ is giv en b y ∆ = ˜ O b α 0 . 5 max p ( α 0 + 1)(max i ( P b α ) i ) r 0 n 1 / 2 b α 1 . 5 min σ min ( P ) ! from Lemma C.3. Thus, w e ha ve P [ B ( i ) | π i ] ≥ 1 / 4, i.e. B ( i ) o ccurs with probability 1 / 4 for any realizatio n of π i . 62 If we p erturb a ( γ , r 0 ) go o d v ector by ∆ (while main taining u nit norm ), then it is still ( γ − 2∆ r 0 − ∆ , r 0 − ∆) go o d. W e now sh o w that th e set { ˜ u i } con tains go o d initializ ation v ectors w hen n is large enough. Consider Y i ∼ Γ( α i , 1), wh ere Γ( · , · ) denotes the Gamma d istr ibution and we ha ve Y / P i Y i ∼ Dir( α ). W e first compute the prob ab ility that ˜ u i := W ⊤ A F A π i / k W ⊤ A F A π i k is a ( r 0 , γ )-go o d v ector with r esp ect to j = 1 (recall that b α 1 = b α min ). The desired eve nt is A 1 := ( b α − 1 / 2 1 Y 1 ≥ r 0 s X j b α − 1 j Y 2 j ) ∩ ( b α − 1 / 2 1 Y 1 ≥ 1 1 − γ max j > 1 b α − 1 / 2 j Y j ) (93) W e ha ve P [ A 1 ] ≥ P   ( b α − 1 / 2 min Y 1 ≥ r 0 s X j b α − 1 j Y 2 j ) ∩ ( Y 1 ≥ 1 1 − γ max j > 1 Y j )   ≥ P   ( b α − 1 / 2 min Y 1 > r 0 t ) \ ( X j b α − 1 j Y 2 j ≤ t 2 ) \ j > 1 ( Y 1 ≤ (1 − γ ) r 0 t b α 1 / 2 min )   , for some t ≥ P h b α − 1 / 2 min Y 1 > r 0 t i P   X j b α − 1 j Y 2 j ≤ t 2    b α − 1 / 2 j Y j ≤ (1 − γ ) r 0 t b α 1 / 2 min   P  max j > 1 Y j ≤ (1 − γ ) r 0 t b α 1 / 2 min  ≥ P h b α − 1 / 2 min Y 1 > r 0 t i P   X j b α − 1 j Y 2 j ≤ t 2   P  max j > 1 Y j ≤ (1 − γ ) r 0 t b α 1 / 2 min  When α j ≤ 1, we ha v e P [ ∪ j Y j ≥ log 2 k ] ≤ 0 . 5 , since P ( Y j ≥ t ) ≤ t α j − 1 e − t ≤ e − t when t > 1 and α j ≤ 1. App lying v ector Bernstein’s inequalit y , w e ha ve with prob ab ility 0 . 5 − e − m that k Diag ( b α − 1 / 2 j )( Y − E ( Y )) k 2 ≤ (1 + √ 8 m ) p k α 0 + 4 / 3 m b α − 1 / 2 min log 2 k , since E [ P j b α − 1 j V ar( Y j )] = k α 0 since b α j = α j /α 0 and V ar( Y j ) = α j . Th us, w e ha v e k Diag ( b α − 1 / 2 j ) Y k 2 ≤ α 0 + (1 + √ 8 m ) p k α 0 + 4 / 3 m b α − 1 / 2 min log 2 k , since k Diag ( b α − 1 / 2 j ) E ( Y ) k 2 = q P j b α − 1 j α 2 j = α 0 . Cho osing m = log 4, w e h a ve with probabilit y 1 / 4 that k Diag ( b α − 1 / 2 j ) Y k 2 ≤ t := α 0 + (1 + p 8 log 4) p k α 0 + 4 / 3 (log 4) b α − 1 / 2 min log 2 k , (94) = α 0 + c 1 p k α 0 + c 2 b α − 1 / 2 min log 2 k . (95) W e no w ha ve P h b α − 1 / 2 min Y 1 > r 0 t i ≥ α min 4 C  r 0 t b α 1 / 2 min  α min − 1 e − r 0 t b α 1 / 2 min , from L emma C.1. 63 Similarly , P  max j 6 =1 Y j ≤ b α 1 / 2 min (1 − γ ) r 0 t  ≥ 1 − X j  (1 − γ ) r 0 t b α 1 / 2 min  P j α j − 1 e − (1 − γ ) r 0 b α 1 / 2 min t ≥ 1 − k e − (1 − γ ) r 0 b α 1 / 2 min t , assuming that (1 − γ ) r 0 b α 1 / 2 min t > 1. Cho osing t as in (94), we ha ve the probabilit y of the ev ent in (93) is greater than α min 16 C 1 − e − (1 − γ ) r 0 b α 1 / 2 min ( α 0 + c 1 √ k α 0 ) 2(2 k ) (1 − γ ) r 0 c 2 − 1 ! e − r 0 b α 1 / 2 min ( α 0 + c 1 √ k α 0 ) (2 k ) r 0 c 2  r 0 b α 1 / 2 min ( α 0 + c 1 p k α 0 + c 2 b α − 1 / 2 min log 2 k )  α min − 1 Similarly the (marginal) p r obabilit y of ev en ts A 2 can b e b ounded from b elo w by replacing α min with α 2 and so on. Thus, we ha ve P [ A m ] = ˜ Ω α min e − r 0 b α 1 / 2 max ( α 0 + c 1 √ k α 0 ) (2 k ) r 0 c 2 ! , for all m ∈ [ k ]. Th us, we ha ve eac h of the ev ents A 1 ( i ) ∩ B ( i ) , A 2 ( i ) ∩ B ( i ) , . . . , A k ∩ B ( i ) occur at least once in i ∈ [ n ] i.i.d. tries with prob ab ility 1 − P   [ j ∈ [ k ] ( \ i ∈ [ n ] ( A j ( i ) ∩ B ( i )) c )   ≥ 1 − X j ∈ [ k ] P   \ i ∈ [ n ] ( A j ( i ) − B ( i )) c   ≥ 1 − X j ∈ [ k ] exp [ − n P ( A j ∩ B )] , ≥ 1 − k exp " − n ˜ Ω α min e − r 0 b α 1 / 2 max ( α 0 + c 1 √ k α 0 ) (2 k ) r 0 c 2 !# where A j ( i ) denotes the eve nt that A 1 o ccurs for i th trial and we ha ve that P [ B |A j ] ≥ 0 . 25 since B o ccurs in an y trial with probabilit y 0 . 25 for any realiz ation of π i and the ev ents A j dep end only on π i . W e use th at 1 − x ≤ e − x when x ∈ [0 , 1]. Th us, f or the ev en t to o ccur with prob ab ility 1 − δ , w e require n = ˜ Ω  α − 1 min e r 0 b α 1 / 2 max ( α 0 + c 1 √ k α 0 ) (2 k ) r 0 c 2 log(1 /δ )  . Impro v ed Bound when α 0 < 1 : W e can imp ro v e the ab ov e b ound by directly w orking with the Diric hlet distrib u tion. Let π ∼ Dir( α ). Th e desired eve nt corresp onding to j = 1 is giv en by A 1 = b α − 1 / 2 1 π 1 k Diag ( b α − 1 / 2 i ) π k ≥ r 0 ! \ i> 1  π 1 ≥ π i 1 − γ  . 64 Th us, we ha v e P [ A 1 ] ≥ P " ( π 1 ≥ r 0 ) \ i> 1 ( π i ≤ (1 − γ ) r 0 ) # ≥ P [ π 1 ≥ r 0 ] P \ i> 1 π i ≤ (1 − γ ) r 0 | π 1 ≥ r 0 ! , since P  T i> 1 π i ≤ (1 − γ ) r 0 | π 1 ≥ r 0  ≥ P  T i> 1 π i ≤ (1 − γ ) r 0  . By prop er ties of Diric hlet distri- bution, we kn o w E [ π i ] = b α i and E [ π 2 i ] = b α i α i +1 α 0 +1 . Let p := Pr[ π 1 ≥ r 0 ]. W e hav e E [ π 2 i ] = p E [ π 2 i | π i ≥ r 0 ] + (1 − p ) E [ π 2 i | π i < r 0 ] ≤ p + (1 − p ) r 0 E [ π i | π i < r 0 ] ≤ p + (1 − p ) r 0 E [ π i ] Th us, p ≥ b α min ( α min +1 − r 0 ( α 0 +1)) ( α 0 +1)(1 − r 0 b α min ) , wh ich is u seful w h en r 0 ( α 0 + 1) < 1. Also wh en π 1 ≥ r 0 , w e h a ve that π i ≤ 1 − r 0 since π i ≥ 0 and P i π i = 1. Thus, c ho osing 1 − γ = 1 − r 0 r 0 , w e ha ve the other conditions for A 1 are s atisfied. Also, verify that we ha v e γ < 1 when r 0 > 0 . 5 and th is is feasible when α 0 < 1.  W e no w prov e a r esult that the entries of π i , wh ic h are marginals of th e Dirichlet distrib ution, are lik ely to b e small in th e spars e regime of the Diric hlet parameters. Recall that the marginal distribution of π i is distr ibuted as B ( α i , α 0 − α i ), w here B ( a, b ) is th e b eta distribu tion and P [ Z = z ] ∝ z a − 1 (1 − z ) b − 1 , Z ∼ B ( a, b ) . Lemma C.10 (Marginal Diric h let distrib ution in spars e regime) . F or Z ∼ B ( a, b ) , the fol lowing r esults hold: Case b ≤ 1 , C ∈ [0 , 1 / 2] : Pr[ Z ≥ C ] ≤ 8 log (1 /C ) · a a + b (96) E [ Z · δ ( Z ≤ C )] ≤ C · E [ Z ] = C · a a + b (97) Case b ≥ 1 , C ≤ ( b + 1) − 1 : we have Pr[ Z ≥ C ] ≤ a log(1 /C ) (98) E [ Z · δ ( Z ≤ C )] ≤ 6 aC (99) Remark: T he guarante e for b ≥ 1 is worse and this agrees w ith the in tuition that the Diric hlet v ectors are more spread out (or less sparse) wh en b = α 0 − α i is large. 65 Pr o of. W e ha v e E [ Z · δ ( Z ≤ C )] = Z C 0 1 B ( a, b ) x a (1 − x ) b − 1 dx ≤ (1 − C ) b − 1 B ( a, b ) Z C 0 x a dx = (1 − C ) b − 1 C a +1 ( a + 1) B ( a, b ) F or E [ Z · δ ( Z ≥ C )], w e ha ve , E [ Z · δ ( Z ≥ C )] = Z 1 C 1 B ( a, b ) x a (1 − x ) b − 1 dx ≥ C a B ( a, b ) Z 1 C (1 − x ) b − 1 dx = (1 − C ) b C a bB ( a, b ) The r atio b etw een these tw o is at least E [ Z · δ ( Z ≥ C )] E [ Z · δ ( Z ≤ C )] ≥ (1 − C )( a + 1) bC ≥ 1 C . The last inequalit y holds wh en a, b < 1 and C < 1 / 2. The sum of the tw o is exactly E [ Z ], so when C < 1 / 2 we kno w E [ Z · δ ( Z ≤ C )] < C · E [ Z ]. Next we b ound the p robabilit y Pr[ Z ≥ C ]. Note that Pr[ Z ≥ 1 / 2] ≤ 2 E [ Z ] = 2 a a + b b y Mark o v’s inequalit y . No w we sho w Pr[ Z ∈ [ C , 1 / 2]] is not muc h larger than Pr[ Z ≥ 1 / 2] by b ounding the in tegrals. A = Z 1 1 / 2 x a − 1 (1 − x ) b − 1 dx ≥ Z 1 1 / 2 (1 − x ) b − 1 dx = (1 / 2) b /b. B = Z 1 / 2 C x a − 1 (1 − x ) b − 1 ≤ (1 / 2) b − 1 Z 1 / 2 C x a − 1 dx ≤ (1 / 2) b − 1 0 . 5 a − C a a ≤ (1 / 2) b − 1 1 − (1 − a log 1 /C ) a = (1 / 2) b − 1 log(1 /C ) . The last inequ alit y uses the f act th at e x ≥ 1 + x for all x . No w Pr[ Z ≥ C ] = (1 + B A ) Pr [ Z ≥ 1 / 2] ≤ (1 + 2 b log (1 /C )) 2 a a + b ≤ 8 log (1 /C ) · a a + b and w e ha ve the result. 66 Case 2: When b ≥ 1, we hav e an alternativ e b ound . W e use the fact that if X ∼ Γ( a, 1) and Y ∼ Γ( b, 1) then Z ∼ X/ ( X + Y ). Since Y is d istributed as Γ( b, 1), its PDF is 1 Γ( b ) x b − 1 e − x . Th is is pr op ortional to the PDF of Γ(1) ( e − x ) multiplied by a increasing function x b − 1 . Therefore we kn o w Pr [ Y ≥ t ] ≥ Pr Y ′ ∼ Γ(1) [ Y ′ ≥ t ] = e − t . No w w e use this b ound to compute th e p robabilit y that Z ≤ 1 /R f or all R ≥ 1. This is equiv alen t to Pr[ X X + Y ≤ 1 R ] = Z ∞ 0 P r [ X = x ] P r [ Y ≥ ( R − 1) X ] dx ≥ Z ∞ 0 1 Γ( a ) x a − 1 e − Rx dx = R − a Z ∞ 0 1 Γ( a ) y a − 1 e − y dy = R − a In particular, Pr[ Z ≤ C ] ≥ C a , whic h means Pr[ Z ≥ C ] ≤ 1 − C a ≤ a log (1 /C ). F or E [ Z δ ( Z < C )], the pro of is similar as b efore: P = E [ Z δ ( Z < C )] = Z C 0 1 B ( a, b ) x a (1 − x ) b dx ≤ C a +1 B ( a, b )( a + 1) Q = E [ Z δ ( Z ≥ C )] = Z 1 C 1 B ( a, b ) x a (1 − x ) b dx ≥ C a (1 − C ) b +1 B ( a, b )( b + 1) No w E [ Z δ ( Z ≤ C )] ≤ P Q E [ Z ] ≤ 6 aC when C < 1 / ( b + 1). C.5.2 Norm Bounds Lemma C.11 (Norm Bounds und er Diric hlet d istribution) . F or π i iid ∼ Dir( α ) for i ∈ A , with pr ob- ability 1 − δ , we have σ min (Π A ) ≥ s | A | b α min α 0 + 1 − O (( | A | log k /δ ) 1 / 4 ) , k Π A k ≤ p | A | b α max + O (( | A | log k /δ ) 1 / 4 ) , κ (Π A ) ≤ s ( α 0 + 1) b α max b α min + O (( | A | log k /δ ) 1 / 4 ) . This implies that k F A k ≤ k P k p | A | b α max , κ ( F A ) ≤ O ( κ ( P ) p ( α 0 + 1) b α max / b α min ) . Mor e over, with pr ob ability 1 − δ k F A k 1 ≤ | A | · m ax i ( P b α ) i + O k P k r | A | log | A | δ ! (100) 67 Remark: When | A | = Ω  log k δ  α 0 +1 b α min  2  , we hav e σ min (Π A ) = Ω ( q | A | b α min α 0 +1 ) w ith p robabilit y 1 − δ for any fixed δ ∈ (0 , 1). Pr o of: Consider Π A Π ⊤ A = P i ∈ A π i π ⊤ i . 1 | A | E [Π A Π ⊤ A ] = E π ∼ D ir ( α ) [ π π ⊤ ] = α 0 α 0 + 1 b α b α ⊤ + 1 α 0 + 1 Diag( b α ) , from Prop osition C.2 . T he first term is p ositiv e semi-definite so the eigen v alues of the sum are at least the eige nv alues of the second comp onent . Smallest eigen v alue of second comp onen t giv es lo w er b ound on σ min ( E [Π A Π ⊤ A ]). The sp ectral norm of the fi rst comp onen t is b ound ed by α 0 α 0 +1 k ˆ α k ≤ α 0 α 0 +1 b α max , the sp ectral norm of second comp onen t is 1 α 0 +1 α max . Thus   E [Π A Π ⊤ A ]   ≤ | A | · b α max . No w applying Matrix Bernstein’s inequalit y to 1 | A | P i  π i π ⊤ i − E [ π π ⊤ ]  . W e hav e that the v ari- ance is O (1 / | A | ). Thus with probability 1 − δ ,     1 | A |  Π A Π ⊤ A − E [Π A Π ⊤ A ]      = O s log( k /δ ) | A | ! . F or the result on F , we use the prop erty that for an y t wo m atrices A, B , k AB k ≤ k A k k B k and κ ( AB ) ≤ κ ( A ) κ ( B ). T o sho w b oun d on k F A k 1 , note that eac h column of F A satisfies E [( F A ) i ] = h b α , ( P ) i i 1 ⊤ , and th us k E [ F A ] k 1 ≤ | A | max i ( P b α ) i . Using Bern stein’s inequalit y , for e ac h column of F A , w e ha ve, with probabilit y 1 − δ ,   k ( F A ) i k 1 − | A |  b α, ( P ) i    = O k P k r | A | log | A | δ ! , b y applying Bern stein’s inequalit y , since |  b α, ( P ) i  | ≤ k P k , and th us w e ha v e P i ∈ A k E [( P ) j π i π ⊤ i (( P ) j ) ⊤ ] k , and P i ∈ A k E [ π ⊤ i (( P ) j ) ⊤ ( P ) j π i ] k ≤ | A | · k P k .  C.5.3 Prop erties of Gamma and Diric hlet Distributions Recall Gamma distribu tion Γ ( α, β ) is a distribution on nonnegativ e r eal v alues with d en sit y function β α Γ( α ) x α − 1 e − β x . Prop osition C.1 (Diric hlet and Gamma distr ib utions) . The fol lowing facts ar e known for Dirichlet distribution and Gamma distribution. 1. L e t Y i ∼ Γ( α i , 1) b e indep endent r andom variables, then the ve ctor ( Y 1 , Y 2 , ..., Y k ) / P k i =1 Y k is distribute d as D ir ( α ) . 2. The Γ function satisfies Euler’s r efle ction formula: Γ(1 − z )Γ( z ) ≤ π / sin π z . 3. The Γ( z ) ≥ 1 when 0 < z < 1 . 68 4. Ther e exists a u niversal c onstant C such that Γ( z ) ≤ C /z when 0 < z < 1 . 5. F or Y ∼ Γ( α, 1) and t > 0 and α ∈ (0 , 1) , we have α 4 C t α − 1 e − t ≤ Pr[ Y ≥ t ] ≤ t α − 1 e − t , (101) and for any η , c > 1 , we have P [ Y > ηt | Y ≥ t ] ≥ ( cη ) α − 1 e − ( η − 1) t . (102) Pr o of: The b ounds in (101 ) is derived using the fact that 1 ≤ Γ( α ) ≤ C /α wh en α ∈ (0 , 1) and Z ∞ t 1 Γ( α i ) x α i − 1 e − x dx ≤ 1 Γ( α i ) Z ∞ t t α i − 1 e − x dx ≤ t α i − 1 e − t , and Z ∞ t 1 Γ( α i ) x α i − 1 e − x dx ≥ 1 Γ( α i ) Z 2 t t x α i − 1 e − x dx ≥ α i /C Z 2 t t (2 t ) α i − 1 e − x dx ≥ α i 4 C t α i − 1 e − t .  Prop osition C.2 (Momen ts u nder Diric h let distrib ution) . Supp ose v ∼ D ir ( α ) , the moments of v satisfies the fol lowing formulas: E [ v i ] = α i α 0 E [ v 2 i ] = α i ( α i + 1) α 0 ( α 0 + 1) E [ v i v j ] = α i α j α 0 ( α 0 + 1) , i 6 = j. Mor e gener al ly, if a ( t ) = Q t − 1 i =0 ( a + i ) , then we have E [ k Y i =1 v ( a i ) i ] = Q k i =1 α ( a i ) i α ( P k i =1 a i ) 0 . C.6 Standard Results Bernstein’s Inequalit ies: One of the key to ols we use is the standard matrix Bernstein in- equalit y [T ropp, 2012, thm. 1.4]. Prop osition C.3 (Matrix Bernstein I n equalit y) . Supp ose Z = P j W j wher e 1. W j ar e indep endent r andom matric es with dimension d 1 × d 2 , 2. E [ W j ] = 0 for al l j , 3. k W j k ≤ R almost sur ely. 69 L et d = d 1 + d 2 , and σ 2 = max n    P j E [ W j W ⊤ j ]    ,    P j E [ W ⊤ j W j ]    o , then we have Pr[ k Z k ≥ t ] ≤ d · exp  − t 2 / 2 σ 2 + Rt/ 3  . Prop osition C.4 (V ector Bernstein Inequalit y) . L et z = ( z 1 , z 2 , ..., z n ) ∈ R n b e a r andom ve ctor with indep endent entries, E [ z i ] = 0 , E [ z 2 i ] = σ 2 i , and Pr[ | z i | ≤ 1] = 1 . L et A = [ a 1 | a 2 | · · · | a n ] ∈ R m × n b e a matrix, then Pr[ k Az k ≤ (1 + √ 8 t ) v u u t n X i =1 k a i k 2 σ 2 i + (4 / 3 ) max i ∈ [ n ] k a i k t ] ≥ 1 − e − t . V ector Chebyshev inequalit y: W e will requir e a vec tor version of the C heb yshev inequ al- it y F eren tios [1982]. Prop osition C.5. L et z = ( z 1 , z 2 , ..., z n ) ∈ R n b e a r andom ve c tor with ind ep endent entries, E [ z i ] = µ , σ := k Diag ( E [( z − µ ) ⊤ ( z − µ )]) k . Then we have that P [ k z − µ k > tσ ] ≤ t − 2 . W edin’s t heorem: W e mak e use of W edin’s theorem to control su bspace p erturbations. Lemma C.12 (W edin’s theorem; T heorem 4.4, p. 262 in Stewa rt and Su n [1990].) . L et A, E ∈ R m × n with m ≥ n b e given. L et A have the singular value de c omp osition   U ⊤ 1 U ⊤ 2 U ⊤ 3   A  V 1 V 2  =   Σ 1 0 0 Σ 2 0 0   . L et ˜ A := A + E , with analo gous singular value de c omp osition ( ˜ U 1 , ˜ U 2 , ˜ U 3 , ˜ Σ 1 , ˜ Σ 2 , ˜ V 1 ˜ V 2 ) . L et Φ b e the matrix of c anonic al angles b e twe en range( U 1 ) and range( ˜ U 1 ) , and Θ b e the matrix of c anonic al angles b etwe en range( V 1 ) and range( ˜ V 1 ) . If ther e exists δ, α > 0 such that m in i σ i ( ˜ Σ 1 ) ≥ α + δ and max i σ i (Σ 2 ) ≤ α , then max {k sin Φ k 2 , k sin Θ k 2 } ≤ k E k 2 δ . 70

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment