Online Tensor Methods for Learning Latent Variable Models

Journal of Mac hine Learning Researc h (2014) Submitted ; Published Online T ensor Metho ds for Learning Laten t V ariable Mo dels F urong Huang fur ongh@uci.edu Ele ctric al Engine ering and Computer Scienc e Dept. University of Cali fornia, Irvine Irvine, USA 9269 7, USA U. N. Niranjan un.niranjan@uci.edu Ele ctric al Engine ering and Computer Scienc e Dept. University of Cali fornia, Irvine Irvine, USA 9269 7, USA Mohammad Umar Hak eem mhakeem@uci.edu Ele ctric al Engine ering and Computer Scienc e Dept. University of Cali fornia, Irvine Irvine, USA 9269 7, USA Animashree Anandkumar a.anandkumar@uci. edu Ele ctric al Engine ering and Computer Scienc e Dept. University of Cali fornia, Irvine Irvine, USA 9269 7, USA Editor: Cha rles Sutton Abstract W e int ro duce an online tensor decomp osition base d approach for tw o latent v ariable mo d- eling problems namely , (1 ) communit y detectio n, in which we lea rn the la ten t co mmunities that the s ocial actors in so cial netw orks b elong to, and (2) topic mo deling, in which we infer hidden topics of text articles. W e consider decomp osition o f moment tensor s using sto c hastic gradient descent. W e conduct optimiza tio n of m ultilinear ope rations in SGD and av oid directly for ming the tensors, to sav e computational and sto rage costs. W e pres en t optimized a lgorithm in t wo platforms. Our GPU-ba s ed implementation ex plo its the par- allelism of SIMD a rc hitectures to allow for ma xim um spee d- up b y a careful o ptimiza tion of sto r age and da ta transfer, wher eas our CPU-based implement ation uses eﬃcient sparse matrix co mputatio ns a nd is suitable for large spar se datasets. F or the communit y detec- tion problem, we demonstrate accura cy and computational eﬃciency o n F aceb ook, Y elp and DBLP da tasets, and for the topic mo deling problem, we also demonstra te go o d per- formance on the New Y o rk Times dataset. W e compare our results to the state- of-the-art algorithms suc h as the v ariational metho d, a nd re p ort a ga in of a ccuracy and a gain of several o rders of magnitude in the execution time. Keyw ords: Mixed Me m b ership Stochastic Blo c kmod el, topic mo deling, tensor metho d, sto c hastic gradien t descent, parallel implemen tation, large datasets. 1. In tro duction The sp ectral or momen t-based app roac h inv olv es decomp ositio n of certain empir ic al mo- men t tensors, estimated from ob s erv ed data to obtain the p aramet ers of the prop osed prob- c  2014 Huang et al . . Huang et al. abilistic mo del. Unsup ervised learning f o r a wid e range of laten t v ariable mo dels can b e carried out eﬃcien tly v ia tensor-b ased tec hniqu es with lo w samp le and compu ta tional com- plexities (Anandkumar et al., 2012). In con trast, usual method s emplo y ed in pr actice such as exp ecta tion maximization (EM) and v ariati onal Ba y es d o not ha v e such consistency guaran tees. While the pr evio us wo rks (Anandkum a r et al., 2013b) fo cused on theoretical guaran tees, in this paper , w e focus on the implemen tati on of the tensor method s, study its p erformance on sev eral datasets. 1.1 Summary of Con tributions W e consider t w o problems: (1) comm unit y detection (wherein we compute the decomp o- sition of a tensor whic h relates to the coun t of 3-stars in a graph) and (2) topic mo deling (wherein w e consider the tensor r el ated to co-occurr ence of triplets of words in docu men ts); decomp ositio n of th e these tensors allo ws us to learn the h idden communities and topics from observ ed data. Comm unit y detection: W e reco v er hidden comm unities in sev eral real d at asets with high accuracy . When ground-truth comm unities are a v ailable, w e prop ose a n ew error score based on the hyp othesis testing metho dology inv olving p -v alues and false disco v ery rates (Strimmer, 2008) to v alidate our resu lts. Th e use of p -v alues eliminates the need to carefully tune th e num b er of communities output by our algorithm, and hen ce, w e ob tain a ﬂexible trade-oﬀ b et w een the fraction of comm unities reco v ered and their estimation accuracy . W e ﬁnd that our metho d has v ery goo d accuracy on a range of net w ork dataset s: F aceb ook, Y elp and DBLP . W e summarize the datasets used in this p a p er in T able 6 . T o get an idea of our ru nning times, let us consider the larger DBLP collab orativ e data set for a momen t. I t consists of 16 m illion edges, one million no des and 250 communities. W e obtain an error of 10% and the metho d r uns in ab out tw o min utes, excluding the 80 min utes tak en to read the edge data from ﬁles stored on the hard disk and con v erting it to sp arse matrix format. Compared to the state-of-the-a rt metho d for lea rning MMSB mo dels using the stochas- tic v ariational inference algorithm of (Gopalan et al., 20 12), we obtain seve ral orders of magnitude sp eed-up in the runn ing time on m ultiple real datasets. This is b ecause our metho d consists of eﬃcient matrix op erations wh ic h are emb arr assingly p ar al lel . Matrix op- erations are carried out in the sparse format which is eﬃcien t esp ecially for so cial net w ork settings inv olving large sp a rse graphs. Moreo v er, our co de is ﬂexible to run on a r a nge of graphs suc h as directed, undirected and bip a rtite graphs , wh ile the co de of (Gopalan et al., 2012) is designed for homophilic n et works, and cannot handle bip a rtite graphs in its p resen t format. Note that bipartite netw orks occur in the r ec ommendation setting suc h as the Y elp dataset. Additionally , the v ariatio nal implemen tation in (Gopalan et al., 2012) assumes a homogeneous connectivit y m o del, where any p ai r of communities conn ec t with the same probabilit y and the pr ob ab ility of intra- comm unit y connectivit y is also ﬁxed. Our frame- w ork d o es n ot suﬀer from this restriction. W e also provide argumen ts to sho w that the Normalized Mutual Information (NMI) and other scores, previously used for ev aluating the reco v ery of o v erlapping communit y , can u nderestimate the er r ors. 2 Online Tensor Methods for Learning La tent V ariable Models T opic mo deling: W e also emplo y the tensor m etho d for topic-mo deling, and th e re are man y similarities b et w een the topic and communit y settings. F or instance, eac h do cument has multiple topics, while in the net w ork setting, eac h no de has mem b ership in multiple comm unities. The words in a d ocument are generated based on the laten t topics in the do cumen t, and similarly , edges are generated based on the comm unity mem b erships of the no de pairs. Th e tensor metho d is even faster for topic mod el ing, sin c e the word v o ca bulary size is t ypically m uc h smaller than the size of real-w orld net w orks. W e learn interesting hidden topics in New Y ork T imes corpu s from UCI bag-of-w ord s datase t 1 with around 100 , 000 w ords and 300 , 000 d o cuments in ab out t w o minutes. W e present the imp ortan t w ords for reco v ered topics, as w ell as in terpret “bridging” words, which o ccur in man y topics. Implemen tations: W e p resen t t w o imp le ment ations, viz., a GPU-based implemen tation whic h exploits the p a rallelism of SIMD arc h ite ctures and a CPU-based implemen tati on for larger d at asets, w here th e GPU memory do es not suﬃ ce. W e discuss v arious asp ects in v olv ed suc h as implicit manip ulati on of tensors since explicitly forming tensors w ould b e u n wieldy for large netw orks, optimizing for communicatio n b ottlenec ks in a parallel deplo ymen t, the need for sparse matrix and vect or operations since real w orld net w orks tend to b e sparse, and a careful statisti cal appr oa c h to v alidating the results, when ground truth is a v ailable. 1.2 Related work This pap er builds on the recen t works of Anandkumar et al (Anandkumar et al., 2012, 2013b) wh ic h establishes th e correctness of tensor-based approac hes for learning MMSB (Airoldi et al., 2008) mo dels and other laten t v ariable mo dels. While, the earlier works provided a theo- retical analysis of the metho d, the cur r en t pap er co nsiders a careful implementat ion of the metho d. Moreo v er, ther e are a num b er of algo rithmic impro v emen ts in this pap er. F or in- stance, while (Anandkumar et al., 2012, 2013b) consider tensor p ow er iterations, based on batc h data and d e ﬂations p erform ed serially , here, w e adopt a sto c hastic gradient d esc en t approac h for tensor decomp osit ion, which pr ovides the ﬂ exibilit y to trade-oﬀ sub -sampling with accuracy . Moreo v er, we u s e randomized metho ds for d imensionalit y reduction in the prepro cessing stage of our metho d which enables u s to scale our metho d to graph s with millions of no des. There are other known methods for learning the sto c hastic blo c k mo del based on tec h- niques suc h as sp ectral clustering (McSherr y, 2001) and conv ex optimizatio n (Chen et al., 2012). How eve r, these metho ds are n o t app lic able for learning o v erlapping communities. W e note that learning the mixed mem b ership mo del can b e r e duced to a matrix factor- ization pr o blem (Zh a ng and Y eung, 201 2 ). While collab orativ e ﬁltering tec hn iques suc h as (Mnih and Salakhutdino v, 2007; S alakhutdino v and Mnih, 2008) fo cus on matrix factor- ization and the pr edict ion accuracy of recommendations on an u n seen test s e t, w e reco v er the un derlying laten t comm unities, whic h helps with the interpretabilit y and the statistical mo del ca n b e emplo y ed for other tasks. Although there ha v e b een other fast implement ations for comm unit y detection b e- fore (Soman and Narang, 20 11; Lancic hinetti and F ortu nato, 2009), these method s are not 1. https://archive.i cs.uci.edu/m l/datasets/Bag+of+Words 3 Huang et al. statistica l and do not yield descriptiv e statistics su ch as bridging no des (Nepusz et al., 2008), and cannot p erform p redictiv e tasks such as link classiﬁcation wh ich are the main strengths of the MMSB mod el . With the implemen tation of our tensor-based approac h, w e record h uge sp eed-ups compared to existing approac hes for learning the MMSB mo del. T o the b est of our kno wledge, wh ile stoc hastic metho ds for matrix d ec omp osition ha v e b een consider ed earlier (Oj a and Karhunen, 198 5; Arora et al., 2012), this is the ﬁrst w ork incorp orating stochastic optimizatio n for tensor d e comp osition, and pa v es the wa y f o r fu r- ther inv estig ation on m any theoretical an d practical issues. W e also n ot e that we nev er explicitly form or store the subgraph count tensor, of size O ( n 3 ) wh e re n is th e n um b er of no des, in our implemen tation, b ut d irect ly manipu la te the neighborh o o d v ecto rs to obtain tensor decomp ositions throu gh sto c h asti c up dates. This is a crucial departure from other w orks on tensor decomp ositions on GPUs (Ballard et al., 2011; Schatz et al., 2013), wh ere the tensor needs to b e stored and manipulated directly . 2. T ensor F orms for T opic and Comm unit y Mo dels In this section, w e brieﬂy recap the topic and comm unit y mo dels, as we ll as the tensor forms for their exact momen ts, derived in (Anandkumar et al., 2012, 201 3b). 2.1 T opic Mo deling In topic mo deling, a do cumen t is view ed as a bag of words. Eac h d ocument has a latent set of topics, and h = ( h 1 , h 2 , . . . , h k ) rep resen ts the prop ortions of k topics in a giv en do cumen t. Giv en th e topics h , the w ords are indep enden tly d ra wn and are exc hangeable, and h ence, the term “bag of w ords” mo del. W e r ep resen t the words in the do cumen t by d -dimensional random v ectors x 1 , x 2 , . . . x l ∈ R d , where x i are co ordinate basis v ectors in R d and d is the size of th e w ord vocabulary . Conditioned on h , the w ords in a do cument satisfy E [ x i | h ] = µh , w h ere µ := [ µ 1 , . . . , µ k ] is the topic-w ord matrix. And th us µ j is the topic v ector satisfying µ j = Pr ( x i | h j ), ∀ j ∈ [ k ]. Under the Laten t Diric hlet Allo cat ion (LD A) topic mo del (Blei, 2012), h is drawn from a Diric hlet distribution with concent ration p arameter v ector α = [ α 1 , . . . , α k ]. In other words, for eac h d ocument u , h u iid ∼ Dir( α ) , ∀ u ∈ [ n ] with parameter v ector α ∈ R k + . W e deﬁn e the Dirichlet concen tratio n (mixing) parameter α 0 := X i ∈ [ k ] α i . The Diric hlet distribution allo ws u s to sp ecify th e exten t of ov erlap among the topics by con trolling for sparsity in topic density function. A larger α 0 results in more o v erlapp ed (mixed) topics. A sp ecial case of α 0 = 0 is the single topic mo del. Due to exc hangeabilit y , the order of the words do es not matter, and it suﬃces to consider the frequency ve ctor for eac h do cumen t, wh ic h counts the num b er of o ccurrences of eac h w ord in a do cumen t. Let c t := ( c 1 ,t , c 2 ,t , . . . , c d,t ) ∈ R d denote the frequ en c y ve ctor for t th do cumen t, and let n b e the num b er of do cuments. 4 Online Tensor Methods for Learning La tent V ariable Models W e consider the ﬁ rst thr ee ord er emp irica l moments, gi v en by M T op 1 := 1 n n X t =1 c t (1) M T op 2 := α 0 + 1 n n X t =1 ( c t ⊗ c t − diag ( c t )) − α 0 M T op 1 ⊗ M T op 1 (2) M T op 3 := ( α 0 + 1)( α 0 + 2) 2 n n X t =1   c t ⊗ c t ⊗ c t − d X i =1 d X j =1 c i,t c j,t ( e i ⊗ e i ⊗ e j ) − d X i =1 d X j =1 c i,t c j,t ( e i ⊗ e j ⊗ e i ) − d X i =1 d X j =1 c i,t c j,t ( e i ⊗ e j ⊗ e j ) + 2 d X i =1 c i,t ( e i ⊗ e i ⊗ e i )   − α 0 ( α 0 + 1) 2 n n X t =1 d X i =1 c i,t ( e i ⊗ e i ⊗ M T op 1 ) + d X i =1 c i,t ( e i ⊗ M T op 1 ⊗ e i ) + d X i =1 c i,t ( M T op 1 ⊗ e i ⊗ e i ) ! + α 2 0 M T op 1 ⊗ M T op 1 ⊗ M T op 1 . (3) W e recall Theorem 3.5 of (Anandku mar et al., 2012): Lemma 1 The exact moments c an b e factorize d as E [ M T op 1 ] = k X i =1 α i α 0 µ i (4) E [ M T op 2 ] = k X i =1 α i α 0 µ i ⊗ µ i (5) E [ M T op 3 ] = k X i =1 α i α 0 µ i ⊗ µ i ⊗ µ i . (6) wher e µ = [ µ 1 , . . . , µ k ] and µ i = Pr ( x t | h = i ) , ∀ t ∈ [ l ] . In other wor ds, µ is the topic- wor d matrix. F rom the Lemma 1, w e observe that the ﬁ rst th r ee momen ts of a LDA topic m o del hav e a simple form in v olving the topic-w ord matrix µ and Diric hlet parameters α i . In (Anandku m ar et al., 2012), it is sho wn that th e se p aramet ers can be r ec o v ered under a we ak non-d eg eneracy as- sumption. W e will employ tensor decomp ositio n tec hniques to learn the parameters. 2.2 Mixed Mem b ership Mo del In the mixed memb e rship sto c hastic b lo c k mo del (MMSB), in tro duced by (Airoldi et al., 2008), the edges in a so cial net work are r ela ted to the hidd en comm unities of the no des. A batc h tensor decomp osition tec hnique for learning MMSB w as derived in (An an d kumar et al., 2013b). Let n d enote th e num b er of n odes, k th e num b er of comm unities and G ∈ R n × n the adjacency matrix of the graph. Each no de i ∈ [ n ] has an asso ciated communit y mem b ership 5 Huang et al. v ector π i ∈ R k , whic h is a laten t v ariable, and the v ectors are con tained in a simplex, i.e., X i ∈ [ k ] π u ( i ) = 1 , ∀ u ∈ [ n ] where the n o tation [ n ] d en o tes the set { 1 , . . . , n } . Mem b ership v ectors are sampled from the Diric hlet distribution π u iid ∼ Dir( α ) , ∀ u ∈ [ n ] with p a rameter vect or α ∈ R k + where α 0 := P i ∈ [ k ] α i . As in the topic mo deling setting, th e Dirichlet distrib u tio n allo ws u s to sp ecify the exten t of o v erlap among th e communities b y con trolling for sparsity in communit y mem b ership vec tors. A larger α 0 results in more o v erlapp ed (mixed) membersh ips. A sp ecial case of α 0 = 0 is the sto c hastic blo c k mo del (An an d kumar et al., 2013b ). The c ommunity c onne c tivity matrix is denoted b y P ∈ [0 , 1] k × k where P ( a, b ) measures the co nnectivit y b et ween comm unities a and b , ∀ a, b ∈ [ k ]. W e mo del the adjacency matrix en tries as either of the t w o settings give n b elo w: Bernoulli mo del: This mo dels a net w ork with unw eight ed edges. It is used for F aceb ook and DBLP datasets in Section 6 in our exp erimen ts. G ij iid ∼ Ber( π ⊤ i P π j ) , ∀ i, j ∈ [ n ] . P oisson mo del (Karrer and Newman, 2011): This mo dels a net w ork with w eigh ted edges. It is used for the Y elp dataset in Section 6 to incorp orate the review ratings. G ij iid ∼ Poi( π ⊤ i P π j ) , ∀ i, j ∈ [ n ] . The tensor decomp osition approac h in v olv es up to third order moments, compu te d from the observ ed net work. In ord er to compute the moments, we partitio n the no d es randomly in to sets X , A, B , C . Let F A := Π ⊤ A P ⊤ , F B := Π ⊤ B P ⊤ , F C := Π ⊤ C P ⊤ (where P is the comm unit y connectivit y matrix and Π is the mem b ership matrix) and ˆ α :=  α 1 α 0 , . . . , α k α 0  denote the normalized Diric hlet concentrat ion parameter. W e deﬁne pairs ov er Y 1 and Y 2 as P airs( Y 1 , Y 2 ) := G ⊤ X,Y 1 ⊗ G ⊤ X,Y 2 . Deﬁne the follo wing matrices Z B := P airs ( A, C ) (P airs ( B , C )) † , (7) Z C := P airs ( A, B ) (P airs ( C, B )) † . (8) W e consider the ﬁ rst thr ee emp irica l moment s, giv en by M 1 Com := 1 n X X x ∈ X G ⊤ x,A (9) M 2 Com := α 0 + 1 n X X x ∈ X Z C G ⊤ x,C G x,B Z ⊤ B − α 0  M 1 Com M 1 Com ⊤  (10) M 3 Com := ( α 0 + 1)( α 0 + 2) 2 n X X x ∈ X h G ⊤ x,A ⊗ Z B G ⊤ x,B ⊗ Z C G ⊤ x,C i + α 2 0 M 1 Com ⊗ M 1 Com ⊗ M 1 Com − α 0 ( α 0 + 1) 2 n X X x ∈ X h G ⊤ x,A ⊗ Z B G ⊤ x,B ⊗ M 1 Com + G ⊤ x,A ⊗ M 1 Com ⊗ Z C G ⊤ x,C + M 1 Com ⊗ Z B G ⊤ x,B ⊗ Z C G ⊤ x,C i (11) 6 Online Tensor Methods for Learning La tent V ariable Models W e no w recap Prop osition 2.2 of (Anand kumar et al., 2013a ) which pr o vides the form of these momen ts under exp ectatio n. Lemma 2 The exact moments c an b e factorize d as E [ M 1 Com | Π A , Π B , Π C ] := X i ∈ [ k ] ˆ α i ( F A ) i (12) E [ M 2 Com | Π A , Π B , Π C ] := X i ∈ [ k ] ˆ α i ( F A ) i ⊗ ( F A ) i (13) E [ M 3 Com | Π A , Π B , Π C ] := X i ∈ [ k ] ˆ α i ( F A ) i ⊗ ( F A ) i ⊗ ( F A ) i (14) wher e ⊗ denotes the Kr onec ke r p r odu c t and ( F A ) i c orr esp onds to the i th c olumn of F A . W e observe that the moment forms ab o ve for the MMSB mo del hav e a similar form as the momen ts of th e topic mo del in the pr evio us section. Th us, we can employ a uniﬁed framew ork for b oth topic and communit y mo deling inv olving d ec omp osition of the th ird order momen t tensors M T op 3 and M Com 3 . Second order momen ts M T op 2 and M Com 2 are used for pr epr o c essing of the data (i.e., wh ite ning, which is introdu ce d in detail in Section 3.1). F or the sak e of the simp lic it y of the notation, in th e rest of the pap er, we will u se M 2 to denote empirical second order moment s f or b oth M T op 2 in topic mo deling setting, and M Com 2 in the mixed memb er s hip m odel s etting. Similarly , we will u se M 3 to den o te empirical third order momen ts for b oth M T op 3 and M Com 3 . 3. Learning using Third Order Momen t Our learning alg orithm uses up to the third-order momen t to esti mate the topic w ord matrix µ or the comm unit y membersh ip matrix Π. First, we obtain co-o ccurrence of triplet w ords or su bgraph counts (imp licitly). T hen, we p erform prepro cessing u sing second order momen t M 2 . Then we p erform tensor d ecomp osition eﬃcien tly u sing sto chastic gr adient desc ent (Kushner and Yin, 2003) on M 3 . W e n ote that, in our implementa tion of the algorithm on the Graphics P rocessing Unit (GPU), linear algebraic op erations are extremely fast. W e also imp le ment our algo rithm on the C PU for large d at asets which exceed the memory capacit y of GPU and u se sparse matrix op erations wh ic h results in large gains in terms of b oth the memory an d the runnin g time requiremen ts. The o v erall approac h is summarized in Algorithm 1. 3.1 Dimensionalit y Reduction and Whitening Whitening step utilizes linear algebraic manipu la tions to mak e the tensor sym metric and orthogonal (in exp ectatio n). Moreo v er, it leads to dimensionalit y red u cti on since it (im- plicitly) reduces tensor M 3 of size O ( n 3 ) to a tensor of size k 3 , where k is the n um b er of comm unities. T ypically w e ha v e k ≪ n . The whitening step also con v erts the tensor M 3 to a symmetric orthogonal tensor. Th e wh ite ning matrix W ∈ R n A × k satisﬁes W ⊤ M 2 W = I . The idea is that if the b ili near pro jection of the second ord e r moment on to W resu lt s in the ident it y matrix, then a trilinear pro jection of the thir d order moment on to W wo uld 7 Huang et al. Algorithm 1 Overall appr oa c h for learning laten t v ariable mo dels via a moment-based approac h. Input: O b serv ed data: so cial net w ork graph or do cumen t samples. Output: Learned lat en t v ariable mo del and infer hidden attributes. 1: Estimate the third order moments tensor M 3 (implicitly). T he tens or is not formed explicitly as w e break do wn the tensor op erations in to v ector and matrix op erations. 2: Whiten th e data, via S VD of M 2 , to reduce dimensionalit y via symm et rization and orthogonaliza tion. The third order momen ts M 3 are whitened as T . 3: Use sto c hastic gradien t descen t to estimat e sp ectrum of whitened (i mplicit) tensor T . 4: Apply p ost-pro cessing to obtain the topic-w ord matrix or the communit y membersh ips. 5: If groun d truth is kno wn, v alidate th e results using v arious ev aluati on measures. result in an orthogonal tensor. W e use m ultilinear op erations to get an orthogonal tensor T := M 3 ( W , W, W ). The wh itenin g matrix W is compu te d via truncated k − svd of the second order moments. W = U M 2 Σ − 1 / 2 M 2 , where U M 2 and Σ M 2 = diag ( σ M 2 , 1 , . . . , σ M 2 ,k ) are the top k singular vec tors and singular v alues of M 2 resp ectiv ely . W e then p erform m ultilinear transformations on the triplet data using the whitening matrix. Th e wh it ened data is th us y t A :=  W , c t  , y t B :=  W , c t  , y t C :=  W , c t  , for the topic mo deling, where t denotes the index of the do cu men ts. Not e that y t A , y t B and y t C ∈ R k . Implicitly , the whitened tensor is T = 1 n X P t ∈ X y t A ⊗ y t B ⊗ y t C and is a k × k × k dimension tensor. Since k ≪ n , the dimensionalit y r eduction is crucial for our sp eedup. 3.2 Stochastic T ensor Gradien t Descen t In (Anandku mar et al., 2013b ) and (Anandkumar et al., 2012), the p o wer metho d with de- ﬂation is used for tensor decomp ositio n wh ere the eigen v ectors are reco v ered b y iterating o v er multiple lo ops in a serial manner. F urthermore, batc h data is us ed in their itera- tiv e p o w er metho d wh ic h mak es that algorithm slow er than its sto c hastic coun terpart. In addition to implementing a sto c hastic sp ectral optimization algorithm, w e ac hiev e fu r ther sp eed-up b y eﬃcien tly parallelizing the sto c hastic up dates. Let v = [ v 1 | v 2 | . . . | v k ] b e the tru e eigen v ectors. Denote the cardinality of the sample set as n X , i.e ., n X := | X | . Now that w e ha ve th e whitened tensor, we prop ose the Sto chastic T ensor Gr adient D esc ent (STGD) algorithm for tensor decomp osition. Consider the tensor 8 Online Tensor Methods for Learning La tent V ariable Models T ∈ R k × k × k using whitened samples, i.e., T = X t ∈ X T t = ( α 0 + 1)( α 0 + 2) 2 n X X t ∈ X y t A ⊗ y t B ⊗ y t C − α 0 ( α 0 + 1) 2 n X X t ∈ X  y t A ⊗ y t B ⊗ ¯ y C + y t A ⊗ ¯ y B ⊗ y t C + ¯ y A ⊗ y t B ⊗ y t C  + α 2 0 ¯ y A ⊗ ¯ y B ⊗ ¯ y C , where t ∈ X and d enot es the ind e x o f the online data an d ¯ y A , ¯ y B , and ¯ y C denote th e mean of the wh it ened data. Ou r goal is to ﬁ nd a s ymmetric CP d e comp osition of the wh ite ned tensor. Deﬁnition 3 Our optimization pr oblem is given by arg min v : k v i k 2 F =1 n   X i ∈ [ k ] ⊗ 3 v i − X t ∈ X T t   2 F + θ k X i ∈ [ k ] ⊗ 3 v i k 2 F o , wher e v i ar e the unk nown c omp onents to b e estimate d, and θ > 0 is some ﬁxe d p ar ameter. In ord er to encourage orthogonalit y b et w een eigenv ectors, we ha v e the extra term as θ k P i ∈ [ k ] ⊗ 3 v i k 2 F . Since k P t ∈ X T t k 2 F is a constan t, the ab o v e minimization is the same as minimizing a loss function L ( v ) := 1 n X P t L t ( v ), w h ere L t ( v ) is the loss function ev aluated at no de t ∈ X , and is giv en b y L t ( v ) := 1 + θ 2   X i ∈ [ k ] ⊗ 3 v i   2 F −  X i ∈ [ k ] ⊗ 3 v i , T t  (15) The loss fun ct ion has t w o terms, viz., the term k P i ∈ [ k ] ⊗ 3 v i k 2 F , whic h can b e inte rpreted as the orthogonalit y cost, wh ich w e need to minimize, and th e second term h P i ∈ [ k ] ⊗ 3 v i , T t i , whic h can b e viewe d as the correlat ion rew ard to b e maximized. The parameter θ pr o vides additional ﬂexibilit y for tunin g b et w een the t w o terms. Let Φ t :=  φ t 1 | φ t 2 | . . . | φ t k  denote the estimation of th e eigen v ectors using the whitened data p oin t t , where φ t i ∈ R k , i ∈ [ k ]. T aking th e deriv ativ e of the loss fun cti on leads us to the iterativ e up date equation for the sto c hastic gradien t descen t wh ic h is φ t +1 i ← φ t i − β t ∂ L t ∂ v i     φ t i , ∀ i ∈ [ k ] where β t is the learning rate. Compu ti ng the deriv ativ e of the loss function and su bstituting the result leads to the follo wing lemma. Lemma 4 The sto chastic up dates for the eigenve ctors ar e given by φ t +1 i ← φ t i − 1 + θ 2 β t k X j =1 h  φ t j , φ t i  2 φ t j i + β t ( α 0 + 1)( α 0 + 2) 2  φ t i , y t A   φ t i , y t B  y t C + β t α 2 0  φ t i , ¯ y A   φ t i , ¯ y t B  ¯ y C − β t α 0 ( α 0 + 1) 2  φ t i , y t A   φ t i , y t B  ¯ y C − β t α 0 ( α 0 + 1) 2  φ t i , y t A   φ t i , ¯ y B  y C − β t α 0 ( α 0 + 1) 2  φ t i , ¯ y A   φ t i , y t B  y C , (16) 9 Huang et al. y t A y t C y t B v t i v t i Figure 1: S c hematic represent ation of the s toc hastic u pd at es for the sp ectral estimation. Note the we n ev er form the tensor explicitly , since the gradien t inv olv es v ector pro ducts b y col lapsing t w o mo des, as sho w n in Equation 16. In Equation (16), all our tensor op erations are in terms of eﬃcien t sample v ector inner pro ducts, and no tensor is explicitly formed. The m ultilinear op erations are sho wn in Figure 1. W e c ho ose θ = 1 in our exp eriments to ensure that there is suﬃcient p enalt y for non-orthogonalit y , whic h preven ts us from obtaining degenerate solutions. After learning the decomp osition of the third order momen t, we p erform p ost-pro cessing to estimate b Π. 3.3 P ost-pro cessing Eigen v alues Λ := [ λ 1 , λ 2 , . . . , λ k ] are esti mated as the norm of the eigen v ectors λ i = k φ i k 3 . Lemma 5 After we obtain Λ and Φ , the estimate for the topic-wor d matrix is given by ˆ µ = W ⊤ † Φ , and in the c ommunity setting, the c ommunity memb ership matrix is given by ˆ Π A c = d iag ( γ ) 1 / 3 diag(Λ) − 1 Φ ⊤ ˆ W ⊤ G A,A c . wher e A c := X ∪ B ∪ C . Similarl y, we estimate ˆ Π A by exchanging the r oles of X and A . Next, we obtain the Di ric hlet distribution p ar ameters ˆ α i = γ 2 λ − 2 i , ∀ i ∈ [ k ] . where γ 2 is c hosen suc h that w e ha v e n o rmalization P i ∈ [ k ] ˆ α i := P i ∈ [ k ] α i α 0 = 1 . Th us, w e p erform STGD metho d to estimate the eigen v ectors and eigen v alues of the whitened tensor, and then use these to estimate the topic word matrix µ and comm unit y mem b ership matrix b Π b y thresholding. 4. Implemen tation Details 4.1 Symmetrization Step to Compute M 2 Note that for th e topic mo del, the s econd order momen t M 2 can b e computed easily from the w ord-frequency v ector. On the other hand, for the comm unit y setting, computing M 2 requires additional linear algebraic op erations. It requires computation of m a trices Z B 10 Online Tensor Methods for Learning La tent V ariable Models and Z C in equation (7). Th is r equires compu ta tion of pseu do-i nv erses of “P airs” matrices. No w, note that p seudo-in v er s e of (P airs ( B , C )) in Equation (7) can b e computed using rank k -SVD: k-SVD ( Pairs ( B , C )) = U B (: , 1 : k )Σ B C (1 : k ) V C (: , 1 : k ) ⊤ . W e exploit the low r ank pr op erty to hav e eﬃcient runnin g times and s torage. W e ﬁrst implemen t the k-S VD of P airs, giv en by G ⊤ X,C G X,B . Then the order in wh ic h the matrix pro ducts are carried out plays a signiﬁcan t role in terms of b oth memory and sp eed. Note that Z C in v olv es the m ultiplication of a sequen ce of matrices of s izes R n A × n B , R n B × k , R k × k , R k × n C , G ⊤ x,C G x,B in v olv es pro ducts of sizes R n C × k , R k × k , R k × n B , and Z B in v olving pro ducts of sizes R n A × n C , R n C × k , R k × k , R k × n B . While p erforming th ese pro ducts, we av oid pro ducts of sizes R O ( n ) × O ( n ) and R O ( n ) × O ( n ) . This allo w s us to ha v e eﬃcien t s to rage requirement s. Suc h manipulations are represented in Fig ure 2 . = † ⊤ † ⊤ | A | | A | = ⊤ ⊤ ⊤ = ⊤ ⊤ ⊤ Figure 2: By p erforming the m a trix m ultiplications in an eﬃcien t order (Equation (10)), w e av oid pro ducts inv olving O ( n ) × O ( n ) ob jects. Instead, w e us e ob jects of size O ( n ) × k whic h improv es the sp eed, since k ≪ n . Equation (10) is equiv- alen t to M 2 =  P airs A,B P airs † C,B  P airs C,B  P airs † B ,C  ⊤ P airs ⊤ A,C − shift, where the shift = α 0 α 0 +1  M 1 M 1 ⊤ − diag  M 1 M 1 ⊤  . W e do not explicitly calculate the pseudoinv erse but main tain the lo w rank matrix decomp osition form. W e then orthogonalize the third ord er moment s to reduce the dimension of its mo des to k . W e p erform linear transformations on the data corresp onding to the p a rtitions A , B and C using the whitening matrix. The whitened d at a is thus y t A := D W , G ⊤ t,A E , y t B := D W , Z B G ⊤ t,B E , and y t C := D W , Z C G ⊤ t,C E , where t ∈ X and d en o tes the index of the online data. Since k ≪ n , the dimensionalit y r eductio n is crucial for our sp eedup. 11 Huang et al. 4.2 Eﬃcien t Randomized SVD Computations When we consider v ery large-sca le data, the whitening matrix is a b ottlenec k to hand le when we aim for f a st runnin g times. W e obtain th e lo w ran k app r o ximation of matrices using random p ro jections. In the C PU imp leme nt ation, w e use tal l-thin SVD (on a sparse matrix) via th e Lanczos algorithm after the pro jection and in the GPU implementa tion, w e use tal l-thin QR . W e giv e th e o v erview of these metho ds b elo w. Again, we use graph comm unit y mem b ership mo del without loss of generalit y . Randomized lo w rank appro ximation: F rom (Gittens and Mahoney, 2013), for the k -rank p ositiv e semi-deﬁnite matrix M 2 ∈ R n A × n A with n A ≫ k , we can p erform rand om pro jection to reduce dimensionalit y . More precisely , if w e ha v e a rand o m matrix S ∈ R n A × ˜ k with unit norm (rotatio n matrix), w e p ro ject M 2 on to this random matrix to get R n × ˜ k tall-thin matrix. Not e that we choose ˜ k = 2 k in our implemen tation. W e will obtain lo w er dimension appro ximation of M 2 in R ˜ k × ˜ k . Here we emphasize that S ∈ R n × ˜ k is a random matrix for dense M 2 . Ho wev er for sp arse M 2 , S ∈ { 0 , 1 } n × ˜ k is a column s e lection matrix with random sign for eac h ent ry . After the pro jection, one approac h we use is S VD on this tall-thin ( R n × ˜ k ) matrix. Deﬁne O := M 2 S ∈ R n × ˜ k and Ω := S ⊤ M 2 S ∈ R ˜ k × ˜ k . A lo w r ank ap p ro ximation of M 2 is giv en by O Ω † O ⊤ (Gittens and Mahoney, 2013). Reca ll that the deﬁnition of a whitening m a trix W is that W ⊤ M 2 W = I . W e can obtain the whitening matrix of M 2 without directly doing a SVD on M 2 ∈ R n A × n A . T al l-thin SVD: Th is is used in the CPU implemen tatio n. The w hitening matrix can b e obtained by W ≈ ( O † ) ⊤ (Ω 1 2 ) ⊤ . (17) The pseudo co de for compu ting the whitening matrix W u s ing tall-thin SVD is giv en in Algorithm 2 . Ther efore, we only need to compute SVD of a tall-thin matrix O ∈ R n A × ˜ k . Algorithm 2 Randomized T all-thin S VD Input: S ec ond moment m at rix M 2 . Output: Whitening matrix W . 1: Generate r andom matrix S ∈ R n × ˜ k if M 2 is dense. 2: Generate column selection matrix with random sign S ∈ { 0 , 1 } n × ˜ k if M 2 is sparse. 3: O = M 2 S ∈ R n × ˜ k 4: [ U O , L O , V O ] =SVD( O ) 5: Ω = S ⊤ O ∈ R ˜ k × ˜ k 6: [ U Ω , L Ω , V Ω ] =SVD(Ω) 7: W = U O L − 1 O V ⊤ O V Ω L 1 2 Ω U ⊤ Ω Note that Ω ∈ R ˜ k × ˜ k , its square-ro ot is easy to compute. Similarly , p seudoin v erses ca n also b e obtained without d irec tly doing SVD. F or in sta nce, the p seudoin v erse of the Pairs ( B , C ) matrix is giv en b y (P airs ( B , C )) † = ( J † ) ⊤ Ψ J † , where Ψ = S ⊤ (P airs ( B , C )) S and J = ( P airs ( B , C )) S . Th e pseudo co de for computing pseudoinv erses is given in Algo rithm 3 . 12 Online Tensor Methods for Learning La tent V ariable Models Algorithm 3 Randomized Pseudoin v erse Input: Pairs matrix P airs ( B , C ). Output: Pseudoin v erse of the pairs matrix (P airs ( B , C )) † . 1: Generate r andom matrix S ∈ R n,k if M 2 is dense. 2: Generate column selection matrix with random sign S ∈ { 0 , 1 } n × k if M 2 is sparse. 3: J = (P airs ( B , C )) S 4: Ψ = S ⊤ J 5: [ U J , L J , V J ] = SVD( J ) 6: (P airs ( B , C )) † = U J L − 1 J V ⊤ J Ψ V J L − 1 J U ⊤ J The sparse rep r esen tation of th e data all o ws for scalabilit y on a single mac hine to datasets ha ving millions of nod es. Although the GPU has SIMD arc hitecture w hic h makes parallelizat ion eﬃcien t, it lac ks adv anced libraries with sparse SVD op erations and out- of-GPU-core implementat ions. W e therefore implement the sparse format on C P U for sparse datasets. W e imp lemen t our algorithm usin g random pro jection for eﬃcien t di- mensionalit y redu ction (Clarks on and W o odru ﬀ, 2012) alo ng w ith the sp ars e matrix op- erations av ailable in the Eigen to olkit 2 , and w e use the SVDLIBC (Berry et al., 200 2) library to compute sparse SVD via the Lan czos algorithm. Theoretically , the Lanczos al- gorithm (Golub and V an Loan, 2013) on a n × n matrix tak es around (2 d + 8) n ﬂops for a single step where d is the a v erage num b er of n o n-zero en tries p er ro w. T al l-thin QR: This is used in the GPU im p lemen tation du e to the lac k of library to do sparse tall-thin S VD . The diﬀerence is that w e instead imp lemen t a tall-thin QR on O , therefore the whitening matrix is obtained as W ≈ Q ( R † ) ⊤ (Ω 1 2 ) ⊤ . The main b ottlenec k f or our GPU implemen tation is device storage, since GPU memory is h ig hly limited and not expandable. Random pro jections help in redu ci ng the dimensional- it y fr o m O ( n × n ) to O ( n × k ) an d hence, th is ﬁ ts the data in the GPU memory b etter. Con- sequen tly , after the whitening step, we p ro ject the data int o k -dimensional space. Therefore, the STGD step is dep endent o nly on k , and hence can b e ﬁt in the GPU memory . So, the main b ottle nec k is computation of large SVDs. In order to supp ort larger d atasets such as the DBLP data set wh ic h exceed the GPU memory capacit y , we extend our implementati on with out-of-GPU-co re matrix op erations and the Nystrom m et ho d (Gittens and Mahoney, 2013) for th e w h ite ning matrix computation and th e pseud o inv erse computation in the pre-pro cessing mod ule. 4.3 Stochastic up dates STGD can p oten tially b e the most computationally intensiv e task if carried out naiv ely since the storage and manipu la tion of a O ( n 3 )-sized tensor mak es the metho d not scalable. Ho w ev er w e o v ercome this problem since we nev er form the tensor explicitly; instead, we collapse the tensor m odes imp lic itly as sh o wn in Figure 1. W e ga in large sp eed u p b y optimizing the implemen tation of STGD.T o implemen t the tensor op erations eﬃcien tly w e 2. http://eigen.tu xfamily.org/index.php?title=Main_Page 13 Huang et al. v t i y t A , y t B , y t C CPU GPU Standard In terface v t i y t A , y t B , y t C CPU GPU Device In terface v t i Figure 3: Data transfers in the s ta ndard and device in terfaces of the GPU imp le ment ation. con v ert them in to matrix and v ecto r op erations so that they are imp le ment ed us in g BLAS routines. W e obtain whitened v ectors y A , y B and y C and manipu lat e these v ectors eﬃciently to obtain tensor eigen v ect or u pd at es using the gradien t scaled b y a suitable learning rate. Eﬃcien t STGD via stac k ed vector op erations: W e conv ert the BLAS I I into BLAS I I I op erations b y stac king the v ecto rs to form matrices, leading to more eﬃcien t op erations. Although the u pd at ing equation for the s toc hastic gradien t up date is present ed serially in Equation (1 6), w e ca n up date the k eigen v ectors sim ultaneously in p arallel. The basic idea is to stac k the k eigen v ect ors φ i ∈ R k in to a m atrix Φ , then using the internal parallelism designed for BLAS I I I op erations. Ov erall, the STGD step inv olve s 1 + k + i ( 2 + 3 k ) BLAS I I o v er R k v ectors, 7N BLAS I I I o v er R k × k matrices and 2 Q R op er ations o v er R k × k matrices, wh ere i denotes the num b er of iterations. W e pr o vide a count of BLAS operations f o r v arious steps in T able 1. Mo dule BLAS I BLAS I I BLAS I I I SVD QR Pre 0 8 1 9 3 0 STGD 0 N k 7 N 0 2 P ost 0 0 7 0 0 T able 1: Linear algebraic op eratio n counts: N denotes the num b er of iterations for STGD and k , the n um b er of comm unities. Reducing comm unication in GPU implemen tation: In STGD, note that the stor- age needed f or the iterativ e part do es not dep end on the n umber of n odes in the dataset, rather, it dep ends on the parameter k , i.e., th e num b er of communities to b e estimated, since whitenin g p erform ed b efore STGD leads to dimensionalit y reduction. T his mak es it suitable for storing the required buﬀers in th e GPU memory , and using the CULA device in terface for the BLAS op erations. In Figure 3, w e illustrate the data transfer inv olve d in 14 Online Tensor Methods for Learning La tent V ariable Models 10 2 10 3 10 −1 10 0 10 1 10 2 10 3 10 4 P S f r a g r e p l a c e m e n t s Num b er of comm u nities k Runnin g time(secs) MA TLAB T ensor T o olb o x CULA Standard I n terfa c e CULA Device I nterface Eigen Sparse Figure 4: Comparison of the run ning time for STGD und er diﬀeren t k for 100 iterations. the GPU standard and device interface co des. While the stand a rd interface in v olv es d at a transfer (including wh it ened neigh b orho o d v ectors and the eige nv ectors) at eac h s toc hastic iteration b et w een the CPU memory and the GPU memory , the d evic e in terface inv olve s allocating and retaining the eigen v ectors at eac h sto chastic iteration w hic h in tur n sp eeds up the sp ectral estimatio n. W e co mpare the running time of the CULA device co de w ith the MA TLAB co de (using the tensor to o lb o x (Bader et al., 2012)), CULA standard co de and Eigen s p arse co de in Figure 4. As exp ected, the GPU implemen tations of matrix op erations are m uc h faster and scale m uc h b etter than the CPU implement ations. Am on g th e C PU co des, we notice that sparsit y and optimization oﬀered by the Eige n to olkit giv es us huge ga ins. W e obtain orders of magnitud e of sp eed up f o r the GPU device co de as we p lac e the b uﬀers in the GPU memory and transfer minimal amount of data in v olving the wh it ened vec tors only once at the b eginning of eac h iteration. The runn ing time for the C ULA standard co de is more than the device cod e b ecause of the CP U-GPU data transfer o ve rhead. F or the same reason, the sparse CPU implementat ion, by a v oiding the d a ta transfer ov erhead, p erforms b etter than th e GPU standard co de f o r very small num b er of comm unities. W e note that there is no p erformance degradation du e to the parallelization of the matrix op eratio ns. After whitening, th e S TGD requires the most cod e design and optimization eﬀort, and so w e con v ert that into BLAS-lik e routines. 15 Huang et al. Mo dule Time Space Pre-pro cessing (Ma trix Multiplication) O (max( nsk /c, log s )) O  max( s 2 , sk )  Pre-pro cessing (CPU SVD) O  max( nsk /c, log s ) + max( k 2 /c, k )  O ( sk ) Pre-pro cessing (GPU QR) O  max( sk 2 /c, log s ) + max( sk 2 /c, log k )  O ( sk ) Pre-pro cessing(short-thin SVD) O  max( k 3 /c, log k ) + max( k 2 /c, k )  O ( k 2 ) STGD O  max( k 3 /c, log k )  O ( k 2 ) P ost-pro cessing O (max( nsk /c, log s )) O ( nk ) T able 2: The time and space complexit y (n um b er of compu te cores required) of our algo - rithm. Note that k ≪ n , s is the av erage degree of a no de (or equiv alentl y , the a v erage n umber of non -zeros p er r o w/column in the adj ac ency sub-matrix); note that the STGD time is p er iteration time. W e denote the num b er of cores as c - the time-space trade-oﬀ dep ends on this parameter. 4.4 Computational Complexity W e p a rtition the execution of our algorithm into three m a in mo dules namely , p r e- pro cessing, STGD and p ost-pro cessing, whose v arious matrix op eration counts are listed ab o v e in T a- ble 1. The theoretical asymptotic complexit y of our metho d is s ummarized in T able 2 and is b est ad d ressed by considering the parallel mo del of computation (J´ aJ´ a, 1992), i.e., wherein a n umber of pro cessors or compu te cores are op erating on the data sim ultaneously in parallel. This is justiﬁed considering that w e implemen t our method on GPUs and matrix pro ducts are embarrassingly parallel. Note that this is diﬀerent from serial computational complexit y . W e n ow break d o wn the entrie s in T able 2. First, we recall a b a sic lemma r eg arding the lo w er b ound on the time complexit y for p aral lel addition along with the requir ed num b er of cores to ac hiev e a sp eed-up. Lemma 6 (J´ aJ´ a, 1992) A ddition of s numb ers in serial takes O ( s ) time; with Ω( s / log s ) c or es, this c an b e impr ove d to O (log s ) time in the b e st c ase. Essen tially , this sp eed-up is ac hieve d b y recursive ly adding pairs of num b ers in p arall el. Lemma 7 (J´ aJ´ a, 1992) Consider M ∈ R p × q and N ∈ R q × r with s non-zer os p e r r ow/c olumn. Naive serial matrix multiplic ation r e qu ir es O ( psr ) time; with Ω( ps r / log s ) c or es, this c an b e impr ove d to O (log s ) time in the b est c ase. Lemma 7 follo ws by simply p aral lelizing the sparse inner pr oducts and applying Lemma 6 for the addition in the inner pr oducts. Note that, this can b e generalized to the f a ct that giv en c cores, the multi plication can b e p erformed in O (max( ps r /c, log s )) runn in g time. 4.4.1 Pre-processing Random pro jection: In prepro cessing, giv en c compute co res, w e ﬁrst do ran d om pro jection usin g matrix multiplicati on. W e m ultiply an O ( n ) × O ( n ) m a trix M 2 with an O ( n ) × O ( k ) random matrix S . Th er efore, this r equires O ( nsk ) serial op erations, where 16 Online Tensor Methods for Learning La tent V ariable Models s is the num b er of n on-ze ro elemen ts p er r ow/co lumn of M 2 . Using L e mma 7, giv en c = nsk log s cores, we could achiev e O (log s ) compu ta tional co mplexit y . Ho w ev er, th e parallel computational complexit y is n ot fur th er reduced with more than nsk log s cores. After the multiplic ation, we use tal l-thin SVD for CPU implemen tation, and tal l- th in QR for GPU imp le ment ation. T all-thin SVD: W e p erform Lanczos SVD on the tall-thin s p arse O ( n ) × O ( k ) matrix, whic h inv olv es a tri-diagonalizatio n follo wed with the QR on the tri-diagonal matrix. Giv en c = nsk log s cores, the computational complexity of the tri-diagonalizatio n is O (log s ). W e then do QR on the tridiagonal matrix wh ic h is as c heap as O ( k 2 ) serially . Eac h orthogolizat ion requires O ( k ) inn er pr odu cts of constant entry v ectors, and there are O ( k ) suc h orthogoliza - tions to b e d one. Therefore giv en O ( k ) cores, th e complexit y is O ( k ). More cores d o es not help since the degree of parallelism is k . T all-thin QR: Alternativ ely , we p erform QR in th e GPU implemen tation which tak es O ( sk 2 ). T o arr iv e at the complexit y of obtaining Q , w e analyze the Gram-Sc hmidt or- thonormalization pro cedure und er sparsity and parallelism conditions. Consider a serial Gram-Sc hmidt on k column s (whic h are s -dense) of O ( n ) × O ( k ) m a trix. F or eac h of the columns 2 to k , we p erform pro jection on the previously computed comp onen ts and sub - tract it. Both inner pro duct and sub tr a ction op erations are on the s -d en se columns and there are O ( s ) op erations whic h are done O ( k 2 ) ti mes serially . The last step is the normal- ization of k s -d ense vecto rs with is an O ( sk ) op eration. This leads to a serial complexit y of O ( sk 2 + sk ) = O ( sk 2 ). Using this, we may obtain the parallel complexit y in d iﬀerent regimes of the n umber of cores as follo ws. Par al lelism for inner pr o ducts : F or eac h comp onen t i , w e n ee d i − 1 pro j ec tions on pre- vious comp onen ts wh ic h can b e p arall el. Eac h p ro jection in v olv es scaling and inn er pr oduct op eratio ns on a p air of s -dense ve ctors. Using Lemma 6, p r o jectio n for comp onen t i can b e p erformed in O (max( sk c , log s )) time. O (log s ) complexit y is obtained u sing O ( sk / log s ) cores. Par al lelism for sub tr actions : F or eac h comp onent i , we n ee d i − 1 su b tract ions on a s -dense v ector after the pro jection. Serially the subtraction requires O ( sk ) op erations, and this can b e reduced to O (log k ) with O ( sk / log k ) cores in th e b est case. The co mplexit y is O (max( sk c , log k )). Com bing the inner p rodu ct s and subtractions, th e complexit y is O  max( sk c , log s ) + max( sk c , log k )  for comp onen t i . There are k comp onen ts in total , which can not b e p arall el. In total, the complexit y for the parallel QR is O  max( sk 2 c , log s ) + max( sk 2 c , log k )  . Short-thin SVD: SVD of the smaller O ( R k × k ) matrix time requires O ( k 3 ) computa- tions in serially . W e note that this is the b ottlenec k for th e computational complexit y , but we emp h asiz e that k is suﬃcien tly sm all in many applications. F ur th ermore, this k 3 complexit y can b e reduced b y using distribu ted SVD algorithms e.g. (Kann an et al., 2014; F eldman et al., 2013). An analysis with resp ect to Lanczos parallel SVD is similar with the discussion in the T all-thin SVD p arag raph. Th e complexit y is O (max( k 3 /c, log k ) + max( k 2 /c, k )). In the b est case, the complexit y is r educed to O (lo g k + k ). 17 Huang et al. The serial time complexit y of SVD is O ( n 2 k ) b ut with randomized dimensionalit y re- duction (Gittens and Mahoney, 2013) and parallelization (Constan tine and Gleic h , 2011), this is signiﬁcan tly reduced. 4.4.2 STGD In S T GD , w e p erform implicit sto c hastic u p date s, consisting of a constan t num b er of matrix- matrix and matrix-ve ctor pro ducts, on the set of eigen v ectors and whitened samples wh ic h is of size k × k . When c ∈ [1 , k 3 / log k ], w e obtain a running time of O ( k 3 /c ) for computing inner p rodu ct s in parallel w it h c compu te cores since eac h core can p erform an inner pr odu ct to compute an element in the resulting matrix ind ep endent of other cores in linear time. F or c ∈ ( k 3 / log k , ∞ ], using Lemma 6, we obtain a r unning time of O (log k ). Note that the STGD time complexit y is calculated p er iteratio n. 4.4.3 Post-processing Finally , p ost-pro ce ssing consists of sparse m a trix pro ducts as we ll. Sim ilar to pre-pro cessing, this consists of multi plications inv olving the spars e matrices. Given s n um b er of non-zeros p er column of an O ( n ) × O ( k ) matrix, the eﬀectiv e num b er of elemen ts reduces to O ( sk ). Hence, give n c ∈ [1 , nk s/ log s ] cores, w e need O ( nsk /c ) time to perf o rm the in ner pr oducts for eac h en try of the resultan t matrix. F or c ∈ ( nk s/ log s , ∞ ], using L e mma 6, we obtain a runn in g time of O (lo g s ). Note that nk 2 is the complexit y of computing the exact SVD and w e reduce it to O ( k ) when there are suﬃcient cores a v ailable. This is meant for th e setting where k is small. This k 3 complexit y of SVD on O ( k × k ) matrix can b e reduced to O ( k ) using distribu te d SVD algorithms e.g. (Kann an et al., 2014; F eldman et al., 2013). W e note that the v ariational inference algorithm complexit y , b y Gopalan and Blei (Gopalan an d Blei, 2013), is O ( mk ) for eac h iteration, where m d e notes the n umber of edges in the graph, and n < m < n 2 . In the regime that n ≫ k , our algorithm is m ore eﬃcient. Moreo ver, a big diﬀerence is in the scaling w ith resp ec t to the size of the netw ork and ease of parallelization o f our metho d compared to v ariational one. 5. V alidation metho ds 5.1 p -v alue testing: W e reco v er the estimated comm unit y membersh ip matrix b Π ∈ R b k × n , where b k is the n umber of communities sp eciﬁed to our metho d. Recall th at th e true communit y mem b ership matrix is Π, an d w e consider datasets w h ere ground tru th is av ailable. Let i -th ro w of b Π b e d enote d b y b Π i . Our comm unit y d e tection metho d is unsu perv ised , whic h inevitably resu lts in row p erm utations b et w een Π and b Π and b k ma y not b e the same as k . T o v alidate the results, w e need to ﬁnd a go od matc h b et w een the rows of b Π and Π. W e use the notion of p -v alues to test for statistically signiﬁcant dep endencies among a set of r andom v ariables. T he p -v alue denotes th e probabilit y of n ot rejecting the n ull hyp othesis that the rand o m v ariables u nder 18 Online Tensor Methods for Learning La tent V ariable Models P S f r a g r e p l a c e m e n t s Π 1 Π 2 Π 3 Π 4 b Π 1 b Π 2 b Π 3 b Π 4 b Π 5 b Π 6 Figure 5: Bipartite graph G { P v al } induced by p -v alue testing. Edges represent statistically signiﬁcan t relationships b et w een ground truth and estimated comm unities. consideration are indep endent and we use the Stud ent’s 3 t -test statistic (F adem, 2012) to compute the p -v alue. W e use multiple hyp o thesis testing for diﬀerent pairs of estimated and ground-tru th comm unities b Π i , Π j and adjust th e p -v alues to ensure a sm a ll enough f al se disco v ery rate (FDR) (Strim m er , 2008). The test statistic used for the p -v alue testi ng of the estimated communities is T ij := ρ  b Π i , Π j  √ n − 2 r 1 − ρ  b Π i , Π j  2 . The righ t p -v alue is obtai ned via the probabilit y of obtaining a v alue (sa y t ij ) g reater than the test statistic T ij , and it is deﬁned as P v al (Π i , b Π j ) := 1 − P ( t ij > T ij ) . Note th a t T ij has S tu den t’s t -distribu tio n with degree of freedom n − 2 (i.e. T ij ∼ t n − 2 ). Th us, we obtain the r ig ht p -v alue 4 . In this w a y , we compute the P v al matrix as P v al ( i, j ) := P v al h b Π i , Π j i , ∀ i ∈ [ k ] and j ∈ [ b k ] . 5.2 Ev aluat ion metrics Reco v ery ratio: V alidat ing the results requires a matc hing of the true memb ership Π with estimated mem b ership b Π. Let P v al (Π i , b Π j ) denote the right p -v alue un der the n ull 3. N o te that Student’ s t - test is robust to the presence of unequal v ariances when the sample sizes of the tw o are equal which is true in our setting. 4. The righ t p -v alue accounts for the fact that when tw o comm unities ar e a nti-correlated th ey are not paired up. H ence note that in th e sp ecial case of blo c k model in which the estimated communities are just p erm uted version of the ground truth communities, t he pairing results in a p erf ect matc hing accurately . 19 Huang et al. h yp othesis that Π i and b Π j are statistica lly in d epend en t. W e u se the p -v alue test to ﬁn d out pairs Π i , b Π j whic h pass a sp e ciﬁed p -v alue threshold, and w e denote suc h pairs using a bipartite graph G { P v al } . Thus, G { P v al } is deﬁned as G { P v al } := n V (1) { P v al } , V (2) { P v al } o , E { P v al }  , where the no des in the t w o no d e sets are V (1) { P v al } = { Π 1 , . . . , Π k } , V (2) { P v al } = n b Π 1 , . . . , b Π b k o and the edges of G { P v al } satisfy ( i, j ) ∈ E { P v al } s.t. P v al h b Π i , Π j i ≤ 0 . 01 . A simple example is shown in Figure 5, in which Π 2 has statistically signiﬁcan t dep en- dence with b Π 1 , i.e., the probabilit y of not rejecting the n ull h yp othesis is small (recall that n ull hypothesis is that they are indep endent) . If n o estimated m emb ership vec tor has a signiﬁcan t o v erlap with Π 3 , then Π 3 is not reco v ered. T h ere can also b e m ultiple pairings suc h as for Π 1 and { b Π 2 , b Π 3 , b Π 6 } . Th e p -v alue test b et ween Π 1 and { b Π 2 , b Π 3 , b Π 6 } indicates that probab ility of not r ejec ting the n ull h yp othesis is small, i.e., they are indep endent . W e use 0 . 01 as the th r eshold. The same holds for Π 2 and { b Π 1 } and for Π 4 and { b Π 4 , b Π 5 } . There can b e a p erfect one to one matc hing lik e for Π 2 and b Π 1 as w ell as a multiple matc h ing su ch as for Π 1 and { b Π 2 , b Π 3 , b Π 6 } . Or another m ultiple matc hing suc h as for { Π 1 , Π 2 } and b Π 3 . Let Degree i denote the degree of ground truth comm unity i ∈ [ k ] in G { P v al } , we deﬁne the reco v ery ratio as follo ws. Deﬁnition 8 The reco v ery ratio is deﬁne d as R := 1 k X i I { Degree i > 0 } , i ∈ [ k ] wher e I ( x ) is the indic ator func tion whose value e quals one if x is true. The p erfect case is that all the mem b erships hav e at least one signiﬁcant o v erlapp ing esti- mated mem b ership, giving a reco very ratio of 100%. Error function: F or p erformance analysis of our learning algorithm, w e use an error function giv en as follo ws: Deﬁnition 9 The aver age err or fu nc tion is deﬁne d as E := 1 k X ( i,j ) ∈ E { P val }    1 n X x ∈| X |     b Π i ( x ) − Π j ( x )        , wher e E { P val } denotes the set of e dges b ase d on thr esholding of the p -v alues. 20 Online Tensor Methods for Learning La tent V ariable Models Hardware / soft ware V ersion CPU Dual 8-cor e Xeon @ 2.0GHz Memory 64GB DDR3 GPU Nvidia Quadr o K 5000 CUD A Cor es 1536 Global memory 4GB GDDR5 Cent OS Release 6.4 (Final) GCC 4.4.7 CUD A Release 5.0 CULA-Dense R16a T able 3: System sp eciﬁcations. The err or fun ct ion in c orp orates t w o asp ects, namely the l 1 norm error b et w een eac h estimated comm unit y and the corresp onding paired ground truth communit y , and the error induced b y false pairings b et w een the estimated and ground-truth comm unities thr ough p -v alue testing. F or the former l 1 norm error, we normalize with n whic h is reasonable and results in th e range of the error in [0 , 1]. F or the latter, we deﬁne th e a v erag e err o r function as the summation of all paired mem b erships errors divided b y the tr u e num b er of comm unities k . In this wa y we p enalize falsely disco v ered pairings by su mming them up . Our error fu nctio n can b e greater than 1 if there are to o many falsely disco v ered pairings through p -v alue testing (wh ic h can b e as large as k × b k ). Bridgeness: Bridgeness in o v erlapping comm unities is an in teresting measure to ev alu- ate. A b r idge is deﬁn ed as a v ertex th at crosses structural holes b e t w een discrete groups of p eople and brid ge ness analyzes the exten t to which a giv en v ertex is sh ared among diﬀeren t comm unities (Nepusz et al., 2008). F ormally , the br idge ness of a v ertex i is deﬁn ed as b i := 1 − v u u u t b k b k − 1 b k X j =1  b Π i ( j ) − 1 b k  2 . (18) Note that cen tralit y m ea sures should b e u sed in conjunction with bridge score to distinguish outliers from genuine b ridge n odes (Nepusz et al., 2008 ). The de gr e e- c orr e cte d bridgeness is used to ev aluate our resu lt s and is deﬁned as B i := D i b i , (19) where D i is degree of no de i . 6. Exp erime n tal Results The sp eciﬁcatio ns of the m achine on whic h w e run our co d e are given in T able 3. Results on Synthetic Da t asets: W e p erform exp erimen ts for b oth the sto c hastic blo c k mo del ( α 0 = 0) and the m ixe d memb er s hip mo del. F or the mixed memb e rship mo del, w e set the concen tration parameter α 0 = 1. W e note that the error is around 8% − 14% and the runn ing times are under a min ute, when n ≤ 10000 and n ≫ k 5 . 5. The code is av ailable at https://g ithub.com/Fur ongHuang/Fast- Detection- of- Overlapping- Communities- via- Online- Tens o r - M e t h o d s 21 Huang et al. W e observe that more samples result in a more accurate reco v ery of mem b erships whic h m at c hes in tuition and theory . Ov erall, o ur learning algorithm p erforms b etter in the sto c hastic blo c k mo del case than in th e mixed mem b ership mo del case although we n o te that the accuracy is quite high for practical p urp oses. Theoretically , this is exp ected since smaller concentrati on parameter α 0 is easier for our algorithm to learn (Anandkumar et al., 2013b). Also , our algorithm is scalable to an order of magnitude more in n as illustrated b y exp erimen ts on real-w orld large-scal e datasets. Note that w e threshold th e estimated mem b erships to clean the results. There is a tradeoﬀ b et w een matc h r a tio and a ve rage error via diﬀerent thresholds. In syn thetic exp er- imen ts, the tradeoﬀ is not evident since a p erfect matc hing is alw ays present . Ho w ev er, w e need to carefully handle this in exp erimen ts in v olving real data. Results on T opic Mo deling: W e p erform exp eriments for the bag of words dataset (Bac h e and Lic hman, 2013) for The New Y ork Times. W e set the concentrat ion parameter to b e α 0 = 1 and ob- serv e top reco vered wo rds in n umerous topics. Th e results are in T able 4. Man y of the results are exp ect ed. F or example, the top w ords in topic # 11 are all related to some bad p ersonalit y . W e also present the wo rds with most spread mem b ership, i.e., w ords that b elong to man y topics as in T able 5. As exp ected, w e see minutes, consumer, human, memb e r and so on. These w ords can app ear in a lot of topics, and w e exp ect them to co nnect topics. Results on Real-world Graph Datasets: W e describ e the results on real datasets summarized in T able 6 in d et ail b elo w . Th e simulations are summarized in T able 7. The results are p resen ted in T able 7 . W e note that our metho d, in b oth dense an d sparse implemen tations, p erforms very w ell compared to the state-o f-the-art v ariatio nal metho d. F or the Y elp d ata set, w e hav e a b ip artit e graph where the b usiness no des are on one side and user nod es on the other and use the r eview stars as the edge w eigh ts. In this bipartite setting, the v ariational co de pro vided by Gopalan et al (Gopalan et al., 2012 ) do es not wo rk on since it is not applicable to non-homoph il ic mo dels. Our appr oa c h do es not hav e this restriction. Note that we use our dense im p lemen tation on the GPU to run exp eriments with large num b er of comm unities k as the device imp lemen tation is m uc h faster in terms of run ning time of the STGD step.On the other h and, the sparse implementat ion on CPU is fast and memory eﬃcien t in the case of sparse graphs with a small num b er of co mm unities while the dense implemen tation on GPU is faster for denser graphs such as F aceb o ok. Note that data reading time for DBLP is around 4700 seconds , which is not negligible as compared to other d ata sets (usually within a few seconds). Eﬀectiv ely , our algorithm, excluding the ﬁle I/O time, executes within tw o minutes for k = 10 and within ten minutes for k = 100. In terpretation on Y elp Dataset: T he ground truth on b u siness attributes suc h as lo ca tion and typ e of b usiness are a v ailable (but n ot provided to our algorithm) and we pro vide the distribu tion in Figure 6 on the left side. There is also a natural trade-oﬀ b et w een r ec o v ery ratio and a ve rage error or b etw een attempting to reco ver all the busin ess comm unities and the accuracy of reco v ery . W e can either reco v er top signiﬁcan t communities with high accuracy or reco v er more with lo w er accuracy . W e demonstrate the trade-oﬀ in Figure 6 on the righ t side. W e sele ct the top te n ca tegories reco ve red with the lo west error and rep ort the bu siness with highest weig h ts in b Π. Among the m a tc hed communities, w e ﬁnd the b usiness with 22 Online Tensor Methods for Learning La tent V ariable Models T opic # T op W ords 1 prompting complicated eviscerated predetermined lap renegotiating l oose ent ity legalese justice 2 hamstrung airbrushed quasi outsold fargo ennobled tan talize irrelev ance noncon tro v ersial untalen ted 3 scariest pest kno wingly causing ﬂub mesmerize da wned millennium ecological ecologist 4 reelection quixotic arthroscopic v ersatility commanded h yp e rextended an us precipitating underhand knee 5 beli ev e signing ballcarrier paral lel ano malies mun ching prorated unsettle linebac king bonus 6 gainfully settles narr at or considerable articles narrative rosier deviating protagonist deductible 7 faithful betc ha corrupted inept retrenc h martialed winston do wdy islamic corrupting 8 capable misdeed dash board na vigation opportunistically aerodynamic airbag system braking mph 9 apostles oracles b eliev er d elib erately loafer gospel apt mobbed manipulate dialogue 10 physique jumping visualizing hedgehog zeitgeist belonged lo o mauling postpro duction plunk 11 smir ky silly bad natured frat though tful freaked moron obtuse stink 12 oﬀsetting pr epari ng ac kno wledgmen t agree misstating litigator prev ent ed revok ed pr ese ason entomology 13 undertak en wilsonian idealism brethren writeoﬀ multipolar hegemonist multilate ral enlargement mutating 14 athletically ﬁctitious m y er ma jorleaguebaseball famili arizing resurrect slug backslide supers eding artistically 15 dialog ﬁles diabolical lion to wn passwo rd list swiss coldbloo ded outga ined 16 recessed phased but yl lo wligh t balmy redlining prescri pt ion marched mi sc haract erization tertiary 17 sp onsor televise sponsors hi p festiv al sullied ratiﬁcation insinuating warhead staged reconstruct 18 trespasses buc kle div estmen t sch o olc hild refuel ineﬀectiv eness coexisted repentance divvying o verex p osed T able 4: T op r ec o v ered topic group s fr om the New Y ork Times dataset along with th e wo rds present in th em. Keyw ords min utes, consumer, human, mem b er, f r iend, pr o gram, b oard, cell , in surance, shot T able 5: The top ten w ords whic h o ccur in m ultiple con texts in the New Y ork Times d a taset. 23 Huang et al. Statistics F aceb oo k Y elp DBLP sub DBLP | E | 7 66,800 672,51 5 5 ,066,510 1 6,221,000 | V | 1 8,163 10,010 +28,588 116,31 7 1 ,054,066 GD 0.00 4649 0.000 903 0.0 00749 0.000 029 k 360 159 250 6,003 AB 0.5379 0.42 81 0.3779 0 .2066 ADCB 47.01 30.75 48.41 6.36 T able 6: Summary of real datasets used in our pap er: | V | is th e num b er of no des in the graph, | E | is the n umber of edges, GD is the graph densit y giv en by 2 | E | | V | ( | V |− 1) , k is the num b er of communities, AB is the a v erage bridgeness and ADCB is the a v erage degree-corrected b r idgeness(explai ned in Section 5). Data Method b k Thre E R (%) Time(s) T en(sparse) 10 0 . 10 0 . 063 13 35 T en(sparse) 100 0 . 08 0 . 024 62 309 T en(sparse) 100 0 . 05 0 . 118 95 309 T en(dense) 100 0 . 100 0 . 012 39 190 T en(dense) 100 0 . 070 0 . 019 100 190 FB V ar iat ional 100 – 0 . 070 100 10 , 795 T en(dense) 500 0 . 020 0 . 014 71 468 T en(dense) 500 0 . 015 0 . 018 100 468 V ariational 500 – 0 . 031 100 86 , 808 T en(sparse) 10 0 . 10 0 . 271 43 10 T en(sparse) 100 0 . 08 0 . 046 86 287 T en(dense) 100 0 . 100 0 . 023 43 1 , 127 YP T en(dense) 100 0 . 090 0 . 061 80 1 , 127 T en(dense) 500 0 . 020 0 . 064 72 1 , 706 T en(dense) 500 0 . 015 0 . 336 100 1 , 706 T en(dense) 100 0 . 15 0 . 072 36 7 , 664 T en(dense) 100 0 . 09 0 . 260 80 7 , 664 V ariational 100 – 7 . 453 99 69 , 156 DB sub T en(dense) 500 0 . 10 0 . 010 19 10 , 157 T en(dense) 500 0 . 04 0 . 139 89 10 , 157 V ariational 500 – 16 . 38 99 558 , 723 T en(sparse) 10 0 . 30 0 . 103 73 4716 DB T en(sparse) 100 0 . 08 0 . 003 57 5407 T en(sparse) 100 0 . 05 0 . 105 95 5407 T able 7: Y elp, F aceb o ok an d DBLP main quantitat iv e ev aluation of the tensor metho d ver- sus the v ariational m et ho d: b k is the comm unit y num b er sp eciﬁed to our algorithm, Thre is the threshold for pic king signiﬁcan t estimated membersh ip en tr ies. Refer to T able 6 for statistics of the datasets. the highest memb ership w eigh t (T able 9). W e can see that most of the “top” reco v ered businesses are rated high. Man y of the categories in the top ten list a re resta urants as they ha v e a large num b er of reviewers. O ur metho d can reco v er restauran t category with high accuracy , and the sp eci ﬁc restauran t in th e category is a p o pular result (with h ig h n um b er of sta rs). Also, our method can also reco ver man y of the catego ries w ith lo w review co unts accurately lik e hobby sh ops, yo ga, c h urches, gal leries and r el igious organizations which are 24 Online Tensor Methods for Learning La tent V ariable Models Business RC Categ ories F our P eaks Brewing Co 735 Restauran ts, Bars, American (New), Nightlife, F o od, Pubs, T emp e Pizzeria Bia nco 803 Restauran ts, Pi zz a,Pho e nix FEZ 652 Restauran ts, Bar s, American (New), Nightlife, Mediterranean, Lounges Phoenix Matt’s Big Br eakfast 689 Restauran ts, Pho enix, Breakfast& Brunc h Cornish Past y Company 580 Restauran ts, Bars, Nigh tlife, Pubs, T emp e P ostino Arcadia 575 Restauran ts, Italian, Wine Bars, Bars, Night life, Pho enix Cib o 594 Restauran ts, Italian, Pi zz a, Sandwich es, Pho enix Phoenix Airpo rt 862 Hotels & T rav el, Phoenix Gallo Bla nco Cafe 549 Restauran ts, M exica n, Pho enix The Parlor 489 Restauran ts, Italian, Pi zz a, Pho en ix T able 8: T op 10 brid ging b usinesses in Y elp and categories they b elong to. “R C” denotes review coun ts for that particular bus in ess. 0 50 100 150 200 250 300 0 50 100 150 200 250 300 P S f r a g r e p l a c e m e n t s Business Category ID # b usiness 0 0.2 0.4 0.6 0.8 1 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 P S f r a g r e p l a c e m e n t s B u s i n e s s C a t e g o r y I D # b u s i n e s s Reco ve ry Ratio Avera ge Error Figure 6: Distribu tio n of bu siness categ ories (left) and resu lt tradeoﬀ b et w een reco v ery ratio and error for y elp (righ t). the “nic he” catego ries with a dedicated set of reviewers, who mostly do not review other catego ries. The top bridging no des reco vered b y our metho d for the Y elp dataset are giv en in the T able 8. The br idging no des hav e m ultiple attributes typically , the type of business and its lo catio n. I n addition, th e categories ma y also b e hierarc hical: within restaurants, diﬀeren t cuisin es suc h as Italian, American or Pizza are reco v ered by our metho d. Moreo v er, restauran ts w hic h also function as bars or lounges are also r eco v ered as top br idging no des in our metho d. Th us, our metho d can reco v er multiple attributes for the businesses eﬃcien tly . Among all 11537 bus inesses, there are 89 . 39% of them are still op en. W e only select those b usinesses w hic h are still op en. T here are 285 catego ries in total. After w e remov e all the categories having no more than 20 busin esses within it, there are 134 categories that remain. W e generate communit y mem b ership matrix for b usiness catego ries Π c ∈ R k c × n where k c := 134 is the num b e r of remaining categories and n := 10141 is the n umber of business remaining after removi ng all the neglig ible categories. All the businesses co llected in th e Y elp data are in AZ except 3 of them (one is in CA, one in CO and th e other in 25 Huang et al. Category Business Star(B) Star(C) RC(B) RC(C) Latin American Salv adoreno 4 . 0 3 . 94 36 93 . 8 Gluten F r ee P .F. Chang’s 3 . 5 3 . 72 55 50 . 6 Hobb y Shops Make Meaning 4 . 5 4 . 13 14 7 . 6 Mass Media KJZZ 91 . 5FM 4 . 0 3 . 63 13 5 . 6 Y oga Sutra Mi dt own 4 . 5 4 . 55 31 12 . 6 Ch urches St Andrew Ch urch 4 . 5 4 . 52 3 4 . 2 Art Gall e ries Set te Lisa 4 . 5 4 . 48 4 6 . 6 Libraries Cholla Br anc h 4 . 0 4 . 00 5 11 . 2 Religious St Andrew Churc h 4 . 5 4 . 40 3 4 . 2 Wic ke nburg T aste of Caribb ean 4 . 0 3 . 66 60 6 . 7 T able 9: Most ac curately reco v ered categories and b usinesses with highest mem b ership w eigh ts for the Y elp dataset. “Star(B)” denotes the review stars that the b usiness receiv e and “Star(C)”, the a v erage r evie w stars that bu s inesses in th a t category receiv e. “RC(B) ” denotes the r evie w coun ts for that business and “R C(C)” , th e a v erage review counts in that cate gory . SC). W e remo v e the three businesses outside AZ. W e notice that most of the businesses are spread out in 25 cities. Communit y memb er s hip m atrix for lo cation is deﬁned as Π ∈ R k l × n where k l := 25 is the num b er cities and n := 10010 is n um b er of b usinesses. Distribution of lo ca tions are in T able 11. The stars a b usiness receiv es can v ary f r om 1 (the low est) to 5 (the highest). The higher the score is, the more satisﬁed the cus to mers are. Th e a v erage star score is 3 . 6745. The distrib utio n is giv en in T able 10. There are also r evie w counts for eac h business whic h are the n umber of reviews that busin ess recei v es from all the u sers. The minimum review coun ts is 3 and the maximum is 862. Th e mean of r evie w counts is 20 . 192 9. The prepro cessing h elps us to pic k out top comm unities. There are 5 attributes asso cia ted with all th e 11537 b usinesses, wh ic h are “op en”, “Cat- egories”, “Lo cation”, “Review Counts” and “Stars”. W e mo del ground tru th co mmunities as a co m bination of “Categ ories” and “Location”. W e s e lect business cate gories with more than 20 mem b ers and r emo ve all businesses wh ic h are closed. 10010 b usinesses are re- mained. On ly 28588 users are in v olv ed in r eviews to w ards the 10010 businesses. Th er e are 3 attributes asso ciated w it h all the 28588 users, wh ic h are “F emale”, “Male”, “Review Count s” and “Stars”. Although we do not directly know the gender information fr om the dataset, a name-gender guesser 6 is used to estimate gender information using names. W e pro vide some sample visu al ization r esu lts in Fi gure 7 for b oth the ground tru th and the estimates fr om our al gorithm. W e sub -sample the u sers and bu sinesses, group the u sers in to m a le and female categories, and consider nail salon and tire businesses. Analysis of ground truth rev eals that n ai l s a lon and tire bu sinesses are v ery discrimin a tiv e of the u ser genders, and thus w e emplo y them for visualization. W e note that b oth the nail salon and tire bu sinesses a re catego rized with high accuracy , while u s e rs are categorized with p o orer accuracy . Our algorithm can also reco v er the attributes of users. H o w ev er, the ground truth a v ailable ab out users is far more limited th a n businesses, and w e only ha v e information on gender, a v erage r e view coun ts and a v erage stars (w e inf er the gender of the u sers through their names). Ou r algorithm can reco ve r all these attributes. W e observ e that ge nder 6. https: //github.com /amacinho/Name- Gender- Guesser by Amac Herdagdelen. 26 Online Tensor Methods for Learning La tent V ariable Models Star Score Nu m of bu sinesses Perce nt age 1 . 0 108 0 . 94% 1 . 5 170 1 . 47% 2 . 0 403 3 . 49% 2 . 5 1011 8 , 76% 3 . 0 1511 13 . 10% 3 . 5 2639 22 . 87% 4 . 0 2674 23 . 18% 4 . 5 1748 15 . 15% 5 . 0 1273 11 . 03% T able 10: T able for distribution of busin e ss star scores. City State Num of business Ant hem AZ 34 Apac he Junction AZ 46 Avo ndale AZ 129 Buc ke ye AZ 31 Casa Grande AZ 48 Ca ve Cr ee k AZ 65 Chandler A Z 865 El Mirage AZ 11 F oun tain Hill s AZ 49 Gilb ert AZ 439 Glendale AZ 611 Goo dy ear A Z 126 La vee n AZ 22 Maricopa AZ 31 Mesa AZ 898 Pa radise V alley AZ 57 Pe oria AZ 267 Phoenix AZ 4155 Queen Creek AZ 78 Scottsdale AZ 2026 Sun Cit y AZ 37 Surprise AZ 161 T empe AZ 1153 T olleson AZ 22 Wic ke nburg AZ 28 T able 11: Distribution of bus in ess lo cations. Only top cities with more than 10 bus inesses are presen ted. 27 Huang et al. P S f r a g r e p l a c e m e n t s F e m a l e M a l e N a i l S a l o n T i r e s Tires Male F emale Nail Salon P S f r a g r e p l a c e m e n t s F e m a l e M a l e N a i l S a l o n T i r e s F e m a l e M a l e N a i l S a l o n T i r e s Tires Male F emale Nail Salon Figure 7: Groun d truth (left) vs estimated bus in ess and user catego ries (right). The error in the estimated graph due to misclassiﬁcation is sh o wn by the mixed colours. is the h ardest to reco v er while review coun ts is the easiest. W e see that the other user attributes reco v ered by our algorithm co rresp ond to v aluable user in formati on such as their in terests, lo cation, age, lifest yle, etc. This is u seful, for instance, for bu sinesses studying the c haracte ristics of their users, for deliv ering b ette r p ersonalize d adv ertisemen ts for users, and so on. F aceb ook Dataset: A snapshot of the F aceb ook net w ork of UNC (T raud et al., 2010) is pro vided with user attributes. The ground truth comm unities are based on user at tributes giv en in the dataset whic h are not exp osed to the algorithm. There are 360 top comm unities with suﬃcient (at least 20) u sers. Our algorithm can reco v er these attributes with h ig h accuracy; see main pap er for our metho d’s results compared with v ariational inference result (Gopalan et al., 2012). W e also obtain results for a range of v alues of α 0 (Figure 8). W e observe that the r ec o v ery ratio improv es with larger α 0 since a larger α 0 can reco v er o v erlapping comm unities more eﬃcien tly while the error score remains relativ ely th e same. F or the F aceb o ok dataset, the top ten comm unities reco v ered with lo w est error consist of certain high schools, second ma jors and dorms / houses. W e observ e that high sc ho ol attributes are easiest to reco v er and second ma jor an d dorm/house are r ea sonably easy to reco v er by looking at the friend ship relations in F aceb o ok. This is reasonable: col lege student s from the same high sc ho ol ha ve a high p r obabilit y of b eing friends; so do colleges student s from the same dorm. DBLP Dataset: The DBLP data conta ins bibliographic records 7 with v arious publica- tion v en ues, suc h as journ al s and conferences, whic h we mo del as comm unities. W e then consider authors who ha v e p ublished at least one p aper in a comm unit y (publication ven ue) as a m em b er of it. Co-authorship is th us m odeled as link in the graph in wh ic h auth ors are 7. http://dblp.uni-trier.de/xml/Dblp.xml 28 Online Tensor Methods for Learning La tent V ariable Models 0 0.05 0.1 0.15 0.2 0 0.2 0.4 0.6 0.8 1 Recovery ratio P S f r a g r e p la c e m e n t s Threshold R e c o v e r y r a t io α 0 :0.1 α 0 :0 .2 α 0 :0 .3 α 0 :0 .4 α 0 :0.5 α 0 :0 .6 α 0 :0 .7 α 0 :0 .8 α 0 :0.9 0 0.05 0.1 0.15 0.2 0 0.05 0.1 0.15 0.2 0.25 P S f r a g r e p la c e m e n t s T h r e s h o ld R e c o v e r y r a t io α 0 :0 .1 α 0 :0 .2 α 0 :0 .3 α 0 :0 .4 α 0 :0 .5 α 0 :0 .6 α 0 :0 .7 α 0 :0 .8 α 0 :0 .9 Threshold Erro r α 0 :0.1 α 0 :0 .2 α 0 :0 .3 α 0 :0 .4 α 0 :0.5 α 0 :0 .6 α 0 :0 .7 α 0 :0 .8 α 0 :0.9 Figure 8: Performance analysis of F aceb ook dataset und er diﬀeren t settings of the co ncen- tration parameter ( α 0 ) for ˆ k = 100. represent ed as no des. In this framew ork, we could reco v er the top authors in communities and bridging authors. 7. Conclusion In this p aper, we presente d a fast and uniﬁed momen t-based framework for learning o v er- lapping comm unities as w ell as topics in a corpus. There are sev eral k ey insigh ts inv olv ed. Firstly , our approac h follo ws from a systematic and guaranteed learning p r ocedure in con- trast to sev eral h euristic app roa c hes which ma y not ha v e strong statistical reco v ery guaran- tees. S econdly , though usin g a moment- based formulati on may seem compu ta tionally exp en- siv e at ﬁ rst sigh t, imp lementing implicit “tensor” op erations leads to signiﬁcan t s peed-up s of the algorithm. Thir dly , emplo ying randomized m et ho ds for sp ectral metho ds is p romising in the computational domain, since the runn ing time can then b e signiﬁcan tly reduced. This p a p er pav es the wa y for sev eral in teresting directions for further r esearch. While our current deplo ymen t incorp orates comm unity detection in a single graph, extensions to m ulti- graphs and hyp ergraphs are p ossible in prin ciple. A careful and eﬃcien t implement ation for su c h settings will b e useful in a num b er of applications. It is natural to extend the deplo ymen t to ev en larger datasets by h a ving cloud-based systems. The iss ue of eﬃcient partitioning of data and reducing communicatio n b et w een the mac h ines b ecomes signiﬁcan t there. Com bining our approac h with other simple co mmunit y detection approac h es to gain ev en more sp eedups can b e explored. Ac kno wledgemen t The ﬁr st author is supp orted b y NS F BIGDA T A I IS-1251267 , the second author is s u pp orted in part by UCI graduate fello wship and NSF Aw ard CCF-1219234, and the last author is supp orted in part b y Microsoft F acult y F ello wship, NSF Career a w ard CCF-1254106, NSF Aw ard CCF-1219234 , and AR O YIP Aw ard W911NF-1 3-1-008 4. T he auth o rs ac kno wledge 29 Huang et al. insigh tful d iscussions with Prem Gopalan, Da vid Mimno, Da v id Blei, Qirong Ho, Eric Xing, Carter Butts, Blak e F oster, Rui W ang, Sr idhar Mahadev an, and the CULA team. Sp ecial thanks to Prem Gopalan and Da vid Mimno for pro viding the v ariational cod e and answer- ing all our questions. The authors also thank Daniel Hsu and S ham Kak ade for initial discussions regardin g the implemen tation o f the tensor metho d. W e also thank Dan Melz er for helping us with the system-related issues. References Edoardo M. Airoldi, Da vid M. Blei, Stephen E. Fienberg, and Er ic P . Xing. Mixed mem b er- ship sto c h a stic blo c kmo dels. Journal of Machine L e arning R ese ar ch , 9:198 1–2014 , J u ne 2008. A. Anand kumar, R. Ge, D. Hsu , S. M. Kak ade, and M. T elgarsky . T ensor decomp ositions for laten t v ariable mo dels, 2012. A. Anandkum a r, R. Ge, D. Hsu , an d S. M. Kak ade. A T ens o r Sp ectral App roac h to Learn in g Mixed Mem b ership Comm unit y Mo dels. ArXiv 1302.2684 , F eb. 2013a. A. Anandku mar, R. Ge, D. Hsu, and S. M. Kak ade. A T ensor Sp ectral Approac h to Learning Mixed Mem b ership C omm u nit y Models. In Confer enc e on L e arning The ory (COL T) , June 2013 b. Raman Arora, Andrew Cotter, K a ren Liv escu, and Nathan S rebro. Sto c hastic optimization for p ca and pls. In Communic ation, Contr ol, and Computing (Al lerton), 2012 50th A n- nual Al lerton Confer enc e on , pages 861–8 68, 2012. doi: 10.1109 /Allerton.201 2.6483308. K. Bac he and M. Lic hman. UCI mac hine le arning rep osito ry , 2013. URL http://a rchive.ics.uci. e du/ml . Brett W. Bader, T amara G. Kolda, et al. Matlab tensor to olb o x v ersion 2.5. Av aila ble online, Jan uary 2012. URL http://w ww.sandia.gov/ ~ tgkolda/ TensorToolbox/ . Grey Ballard, T amara Kolda, and T o dd Plan tenga. Eﬃcien tly compu ting tensor eigen v alues on a gpu. In Par al lel and Distribute d Pr o c essing Workshops and Phd F orum (IPDPSW ), 2011 IE E E International Symp osium on , pages 1340–1 348. IEEE, 2011. Arindam Banerjee and John Langford. An ob jectiv e ev aluation criterion for clus terin g . In Pr o c e e dings of the tenth ACM SIGKDD International Confer enc e on Know le dge D isc ov- ery and Data Mining , pages 515 –520. ACM, 200 4. Mic hael Berry , Theresa Do, Ga vin O’Brien, Vija y Krish na, and So wmini V aradhan. Svdlib c v ersion 1.4. Av ailable online, 2002. URL http://tedlab. mit.edu/ ~ dr/SVDLI BC/ . Da vid M Blei. P r obabilisti c topic mo dels. Communic ations of the ACM , 55(4):7 7–84, 2012. Y udong Ch en, Suj a y Sangha vi, and Hu a n Xu. Clustering sparse graph s. arXiv pr eprint arXiv:1210.33 35 , 2012. 30 Online Tensor Methods for Learning La tent V ariable Models Kenneth L. C la rkson and Da vid P . W o od ruﬀ. Lo w rank approxi mation and regression in input sparsit y time. CoRR , abs/1207.63 65, 2012. P aul G Constan tine and Da vid F Gleic h . T all and skinn y qr f actorizations in map r educe arc hitectures. I n Pr o c e e dings of the Se c ond International W o rkshop on M a pR e duc e and its Applic ations , pages 43–50. ACM, 201 1. Barbara F adem. High- yi e ld b ehavior al scienc e . L WW, 2012. Dan F eldman, Mela nie Schmidt, and Christian Sohler. T urning big data in to tiny data: Constan t-size coresets for k-means, p ca and pro j ec tiv e clustering. In Pr o c e e dings of the Twenty-F ourth Annua l ACM-SIAM Symp osium on Discr ete A lgorithms , p ag es 14 34– 1453. SIAM, 2013. Alex Gittens and Michae l W Mahoney . Revisiting the nystrom metho d for impro v ed large- scale mac hine learning. arXiv pr eprint arXiv:1303.18 49 , 2013. Gene H. Golub and Charles F. V an Loan. M a trix c omputatio ns. 4th e d. Baltimore, MD: The J oh n s Hopkins Univ ersit y Press, 4th ed. edition, 2013 . IS BN 978-1-4214 -0794- 4/hbk; 978-1 -4214-0 859-0/eb ook. P . Gopalan, D. Mimno, S. Gerrish, M. F reedman, and D. Blei. Scalable inference of o v er- lapping comm unities. In A dvanc es i n Ne u r al Information Pr o c essing Systems 25 , p ag es 2258– 2266, 2012. Prem K Gopalan and Da vid M Blei. Eﬃcient d iscov ery of o v erlapping communities in massiv e net w orks. Pr o c e e dings of the National A c ademy of Scienc es , 110(36): 14534–1 4539, 2013. Joseph J´ aJ´ a. An intr o duction to p ar al lel algorithms . Addison W esley Longman Pub lishing Co., Inc., 1992 . Ra vindran Kannan, Santosh S V empala, a nd Da vid P W o odru ﬀ. Principal comp onen t anal- ysis and higher correlations for distrib uted data . In Pr o c e e dings of The 27th Confer enc e on L e arning The ory , pages 1040– 1057, 2014. Brian Karr er and Mark EJ Newman. Stochastic blo c kmo dels and communit y stru ct ure in net w orks. Physic al R eview E , 83(1):016 107, 2011 . H.J. Kushn er a nd G. Yin. Sto chastic Appr oximation and R e cursive Algorithms and Appli- c ations . Applications of Mathematics Series. Sp ringer, 2003. ISBN 978 038700 8943. URL http://b ooks.google.com / books?id=_0bIieuUJGkC . Andrea Lancic hinetti and Sant o F ortu nato . Comm unit y detection algorithms: a compara- tiv e analysis. Physic al r eview E , 80(5):056 117, 2009. Andrea Lancic hinetti, Sant o F ortunato, and J´ anos Kert´ esz. Detect ing the o v erlapping and hierarc hical comm unit y structure in complex netw orks . N ew Journal of Physics , 11(3): 03301 5, 2009. 31 Huang et al. M. McPherson, L. Smith-Lo vin, and J.M. Co ok. Birds of a feather: Ho mophily in so cial net w orks. Annual R e v iew of So ciolo g y , pages 415–444, 2001. F. McSherry . Sp ectral partitioning of random graphs. In FOCS , 200 1. Andriy Mnih and Ruslan Salakhutdino v. P r obabilisti c m atrix factorizati on. In A dvanc es in N e ur al Information Pr o c essing Systems , pages 1257–126 4, 2007. T am´ as Nepusz, Andrea P etr´ oczi, L´ aszl´ o N´ egye ssy , and F ¨ u l¨ op Bazs´ o. F u zz y comm unities and the concept of br idgeness in complex net wo rks. Physic al R eview E , 77(1):016 107, 2008. Erkki Oja and Juha Karhunen. On sto c hastic appro ximation of the eigen v ectors and eigen- v alues of the exp ect ation of a random matrix. Journal of Mathematic al A nalysis and Applic ations , 106(1 ):69–84 , 1985. Ruslan Salakh utdino v and Andriy Mn ih. Ba y esian probabilistic matrix factorization us- ing mark o v c hain monte carlo. In Pr o c e e dings of the 25th International Confer enc e on Machine le arning , pages 880–887. ACM, 200 8. Martin D S c hatz, Tze Meng Lo w, Rob ert A v an de Geijn, and T amara G Kolda. Exploiting symmetry in tensors for high p erformance. arXiv pr eprint arXiv:1301.7744 , 2013. Rob ert R Sok al and F James Rohlf. The comparison of dendrograms by ob jectiv e metho ds. T axon , 11(2):33–40 , 1962. Jy othish Soman and Anku r Narang. F ast comm unit y detection algorithm with gpus and m ulticore arc hitectures. In Par al lel & Di stribu te d Pr o c essing Symp osium (IPD PS), 2011 IEEE International , p ag es 568–579 . IEE E , 2011. Korbinian Strimmer. fd rtool: a v ersatile r pac k age for estimating lo cal and ta il area-based false disco v ery rates. Bioinformatics , 24(12):14 61–1462, 2008. Amanda L. T raud, Eric D. Kelsic, P eter J. Mu cha, and Mason A. P orter. Comp aring comm unit y structure to c haracteristic s in online collegiate so cial n et works. SIAM Review, in press (arXiv:0809. 0960), 2010. Jaew on Y ang an d Jur e Lesko vec. Deﬁning and ev aluating net w ork comm unities based on ground-truth. In Pr o c e e dings of the ACM SIGKDD Workshop on M i ning Data Semantics , page 3. A CM, 2012. Y u Zh ang and Dit -Y an Y eung. Ove rlapping communit y detection via b o und ed nonnegativ e matrix tri-factorization. In Pr o c e e dings of the 18th ACM SIGKDD International Con- fer enc e on Know le dge Disc overy and Data Mining , KDD ’12, pages 606–614, New Y ork, NY, USA, 2012. A CM. IS BN 978-1-45 03-1462-6. d oi: 10.1 145/23 39530.2339629. URL http://d oi.acm.org/10.1 1 45/2339530.2339629 . 32 Online Tensor Methods for Learning La tent V ariable Models 8. Appendix App endix A . Stochastic Up dates After obtaining the whitening matrix, we whiten the data G ⊤ x,A , G ⊤ x,B and G ⊤ x,C b y linear op eratio ns to get y t A , y t B and y t C ∈ R k : y t A := D G ⊤ x,A , W E , y t B := D Z B G ⊤ x,B , W E , y t C := D Z C G ⊤ x,C , W E . where x ∈ X and t denotes the index of the online data. The sto c hastic grad ient descen t algorithm is obtained b y taking the d eriv ativ e of the loss function ∂ L t ( v ) ∂ v i : ∂ L t ( v ) ∂ v i = θ k X j =1 h v j , v i i 2 v j − ( α 0 + 1)( α 0 + 2) 2  v i , y t A   v i , y t B  y t C − α 2 0  φ t i , ¯ y A   φ t i , ¯ y t B  ¯ y C + α 0 ( α 0 + 1) 2  φ t i , y t A   φ t i , y t B  ¯ y C + α 0 ( α 0 + 1) 2  φ t i , y t A   φ t i , ¯ y B  y C + α 0 ( α 0 + 1) 2  φ t i , ¯ y A   φ t i , y t B  y C for i ∈ [ k ], where y t A , y t B and y t C are the online whitened data p oin ts as discuss ed in the whitening step and θ is a constan t factor that we can set. The iterativ e up dating equation for the sto c hastic gradien t up date is giv en b y φ t +1 i ← φ t i − β t ∂ L t ∂ v i     φ t i (20) for i ∈ [ k ], wher e β t is the learning rate, φ t i is the last iteration eigen v ector and φ t i is the up dated eig en v ector. W e up date eige nv ectors through φ t +1 i ← φ t i − θ β t k X j =1 h  φ t j , φ t i  2 φ t j i + shift[ β t  φ t i , y t A   φ t i , y t B  y t C ] (21) No w we shift the up dating steps so that they corresp ond to the cen tered Diric hlet momen t forms, i.e., shift[ β t  φ t i , y t A   φ t i , y t B  y t C ] := β t ( α 0 + 1)( α 0 + 2) 2  φ t i , y t A   φ t i , y t B  y t C + β t α 2 0  φ t i , ¯ y A   φ t i , ¯ y B  ¯ y C − β t α 0 ( α 0 + 1) 2  φ t i , y t A   φ t i , y t B  ¯ y C − β t α 0 ( α 0 + 1) 2  φ t i , y t A   φ t i , ¯ y B  y C − β t α 0 ( α 0 + 1) 2  φ t i , ¯ y A   φ t i , y t B  y C , (22) where ¯ y A := E t [ y t A ] and similarly for ¯ y B and ¯ y C . App endix B. Pro of of corr ectness of our algorithm: W e n ow pro v e the correctness of our algo rithm. First, w e compute M 2 as just E x h ˜ G ⊤ x,C ⊗ ˜ G ⊤ x,B | Π A , Π B , Π C i 33 Huang et al. where w e deﬁne ˜ G ⊤ x,B := E x  G ⊤ x,A ⊗ G ⊤ x,C     Π A , Π C   E x  G ⊤ x,B ⊗ G ⊤ x,C     Π B , Π C  † G ⊤ x,B ˜ G ⊤ x,C := E x  G ⊤ x,A ⊗ G ⊤ x,B     Π A , Π B   E x  G ⊤ x,C ⊗ G ⊤ x,B     Π B , Π C  † G ⊤ x,C . Deﬁne F A is deﬁn ed as F A := Π ⊤ A P ⊤ , we obtain M 2 = E h G ⊤ x,A ⊗ G ⊤ x,A i = Π ⊤ A P ⊤  E x [ π x π ⊤ x ]  P Π A = F A  E x [ π x π ⊤ x ]  F ⊤ A . Note that P is the communit y conn ec tivit y matrix deﬁned as P ∈ [0 , 1] k × k . No w that w e know M 2 , E  π 2 i  = α i ( α i +1) α 0 ( α 0 +1) , and E [ π i π j ] = α i α j α 0 ( α 0 +1) ∀ i 6 = j , we can get the cen tered second order moments P airs Com as P airs Com := F A diag  α 1 α 1 + 1 α 0 ( α 0 + 1) , . . . , α k α k + 1 α 0 ( α 0 + 1)  F ⊤ A (23) = M 2 − α 0 α 0 + 1 F A  ˆ α ˆ α ⊤ − diag  ˆ α ˆ α ⊤  F ⊤ A (24) = 1 n X X x ∈ X Z C G ⊤ x,C G x,B Z ⊤ B − α 0 α 0 + 1  µ A µ ⊤ A − diag  µ A µ ⊤ X → A  (25) Th us, our whitening matrix is computed. No w, our whitened tensor is T is giv en b y T = T Com ( W , W, W ) = 1 n X X x h ( W ⊤ F A π α 0 x ) ⊗ ( W ⊤ F A π α 0 x ) ⊗ ( W ⊤ F A π α 0 x ) i , where π α 0 x is the cen tered v ector so that E [ π α 0 x ⊗ π α 0 x ⊗ π α 0 x ] is diagonal. W e then app ly the sto c hastic gradien t descent tec hniqu e to decompose the third ord er moment. App endix C. GPU A rc hitect ure The algorithm w e prop ose is very amenable to parallelization and is scalable whic h makes it suitable to implement on pro cessors with multiple cores in it. Our metho d consists of simple linear algebraic op erations, th us enabling u s to utilize Basic Line ar Algebr a Sub pr o gr ams (BLAS) routines such as BLAS I (v ector op erations), BLAS I I (matrix-v ect or op erations), BLAS I I I (matrix-matrix op erations), Singular V alue Decomp ositio n (SVD), and iterativ e op eratio ns su ch as stochastic gradien t descen t for tensor decomp osit ion that can easily tak e adv an tag e of S ingle Instru c tion Multiple Data (SIMD) hardw are units present in the GPUs. As su c h, our metho d is amenable to parallelizat ion and is ideal for GPU-based implemen tation. Ov erview of code design: F rom a h ig her leve l p oint of view, a t ypical GPU based computation is a thr ee step pro cess in v olving data transfer from CPU memory to GPU global memory , op eratio ns on the d a ta no w presen t in GPU memory and ﬁnally , the result transfer from the GPU memory back to the CP U memory . W e use the CULA lib rary for implemen ting the linear algebraic op erations. 34 Online Tensor Methods for Learning La tent V ariable Models GPU compute arc hitecture: The GPUs ac hiev e massive parallelism by ha ving hun- dreds of homogeneous pr ocessing cores in tegrated on-chip. Massiv e replication of these cores pro vides the parallelism needed by the applications th at r un on the GPUs. These cores, for the Nvidia GPUs, are kno wn as CU D A c or es , wh ere eac h core has fully p ipelined ﬂoating- p oin t and in teger arithmetic logic units. In Nvidia’s Kepler arc hitecture based GPUs, these CUD A cores are bunched together to form a Str e aming M ultipr o c essor (SMX). These SMX units act as the basic bu ilding b loc k for Nvidia Kepler GPUs. Eac h GPU conta ins m ultiple SMX un its wh ere eac h SMX unit has 192 single- precision CUD A cores, 64 double-precision units, 32 sp ecial function units, and 32 load/store units for data mo v emen t b et w een cores and memory . Eac h SMX has L1, shared memory and a read-only data cac he that are common to all the CUD A cores in that S MX un it . Moreo ver, the programmer can c ho ose b et ween diﬀeren t conﬁgurations of the shared memory and L1 cac he. Kepler GPUs also ha v e an L2 cac h e memory of ab out 1 . 5MB that is common to all the on -chip S MXs. Apart f rom the ab o ve m entioned memories, Kepler b ase d GPU cards come with a large DRAM memory , also kno wn as the global memory , whose size is usu a lly in gigab ytes. This global memory is also visible to all the cores. T h e GPU cards usually do n o t exist as standalone devices. Rather they are part of a CPU based system, where the CPU and GPU in teract with ea c h other via PCI (or PCI Express) bus. In ord er to pr o gram these massiv ely parallel GPUs, Nvid ia provides a fr amew ork kno wn as CUDA th a t enables th e develo p ers to write programs in languages lik e C, C++, and F ortran etc. A C UD A program constitutes of functions called CU D A kernels th a t execute across many parallel soft w are threads, w here eac h thread r uns on a CUD A core. Th us the GPU’s p erformance and scalabilit y is exploited by th e simp le partitioning of the algorithm in to ﬁ x ed sized blo c ks of parallel threads that ru n on hundreds of CUD A cores. The threads runn in g on an SMX can synchronize and co op erat e with eac h o ther via the shared m emo ry of that SMX u nit and can access the Global m emory . Note that the C UD A k ernels are launc hed b y the CPU but they get executed on th e GPU. Thus compu te arc hitecture of the GPU requires CPU to initiate the CUD A kernels. CUD A en a bles the programming of Nvid ia GPUs b y exp osing low lev el API. Apart from CUD A framew ork, Nvidia p ro vides a wid e v ariet y of other to ols and also supp orts third part y libraries that can b e used to program Nvid ia GPUs. Sin ce a ma jor c hunk of the scien tiﬁc computing algorithms is linear algebra based, it is not surp rising that the stand ard linear algebraic solv er libraries like BLAS and Line ar Algebr a P ACKage (LAP A CK) also ha v e th e ir equiv alent s for Nvidia GPUs in one form or another. Unlik e CUD A APIs, suc h libraries exp ose APIs at a muc h h igher-level and mask the arc hitectural details of the underlying GPU hard ware to s ome extent th us enabling relativ ely f a ster d ev elopment time. Considering the tradeoﬀs b et ween the algorithm’s compu tational requirements, design ﬂexibilit y , execution sp eed and develo pment time, we c ho ose CULA-Dense as our main im- plemen tation library . CULA-Dense pr o vides GPU based implementat ions of the LAP ACK and BLAS libr arie s for d ense linear algebra and conta ins routin es for systems solv ers, sin- gular v alue decomp ositions, and eigen-problems. Along with the ric h set of fu nctions that it oﬀers, C ULA pr o vides the ﬂ exibilit y needed b y the programmer to rapidly im p lemen t the algorithm w hile m ai nt aining the p erformance. It h ides most of th e GPU architec ture d epen- 35 Huang et al. den t pr og ramming details th us making it p ossible for rapid p roto t yping of GPU intensiv e routines. The d at a transf ers b et w een the CPU m emo ry and the GPU memory are usu al ly exp lic- itly initiated by CPU and are carried out via the PCI (or PCI Exp r ess) bus in terconnecting the CPU and the GPU. The m ov ement o f data buﬀers b et w een CPU and GPU is the most taxing in terms of time. Th e b u ﬀer transaction time is sho wn in the plot in Figure 9 . New er GPUs, like Kepler b a sed GPUs, also supp ort useful features lik e GPU-GPU direct data trans fers without C PU in terv en tion. Our system and soft ware sp ec iﬁcations are giv en in T able 3. 1 2 3 4 5 6 7 8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 CPU−GPU buffer round−trip transaction time P S f r a g r e p l a c e m e n t s log  buﬀer size 8  Time (s) Figure 9: Exp erimentall y measured time tak en for b uﬀer trans fer b et w een the CPU and the GPU memory in our system. CULA exp oses t w o imp ortan t inte rfaces for GPU programming namely , standar d and devic e . Using the standard interface, the dev elop er can program without wo rrying ab out the underlyin g arc hitect ural details of the GPU as th e standard inte rface tak es care of all the data mo vemen ts, memory allocations in the GPU and sync hronization iss ues. This ho w ev er comes at a cost. F or ev ery standard in terface function call the data is mo v ed in and out of the GPU even if the output r esult of one op erati on is dir ec tly requ ir ed b y the subsequent op eratio n. This un nece ssary mo v emen t of intermediate data can dramatically impact the p erformance of the program. In ord er to a void this, CULA p ro vides the d evic e in terface. W e use th e device inte rface for STGD in which th e programmer is r esponsib le for data b uﬀer all o cati ons in the GPU memory , the required data mov ements b et w een the CPU and GPU, and op erates only on the data in the GPU. Thus the su broutines of th e program that are iterativ e in nature are go od candidates for device implemen tation. Pre-pro cess ing and p ost-pro cessing : The pre-pro ce ssing inv olv es matrices whose leading dimension is of the ord er of num b er of no des. These are imp lemented us in g the CULA standard in terface BLAS I I and BLAS II I routines. Pre-pro cessing requires SVD computations for th e Mo o re-P enrose p seudoin v erse calcu- lations. W e u s e C ULA SVD routines since these SVD op eratio ns are carried out on matrices of mo derate size. W e fur ther replaced the CULA SVD routines with more scalable SVD and pseudo inv ers e routin es using rand om pro jections (Gittens and Mahoney, 2013) to h an d le larger datasets such as DBLP d ataset in our exp erimen t. 36 Online Tensor Methods for Learning La tent V ariable Models n k α 0 Error Time (secs) 1e2 10 0 0.1200 0.5 1e3 10 0 0.1010 1.2 1e4 10 0 0.0841 4 3.2 1e2 10 1 0.1455 0.5 1e3 10 1 0.1452 1.2 1e4 10 1 0.1259 4 2.2 T able 12: Syn thetic simulatio n results for diﬀeren t conﬁ g urations. Running time is the time tak en to run to con v ergence. After STGD, th e comm unit y membersh ip m a trix estimates are obtained using BLAS I I I routines provided b y the CULA standard in terface. Th e matrices are th en used for h yp othesis testing to ev aluate the algorithm against the groun d truth . App endix D. Results on Synthetic Datasets Homophily is an imp ortan t factor in so cial in teractio ns (McPherson et al., 2001); the term homop hily refers to the tendency that actors in the same communit y interact more than across diﬀeren t communities. T herefore, we assume diagonal d ominate d comm unit y connec- tivit y matrix P with diagonal elements equal to 0 . 9 and oﬀ-dia gonal elemen ts equal to 0 . 1. Note that P need neither b e sto c h astic nor symmetric. Ou r algorithm allo ws for randomly generated communit y connectivit y matrix P with sup port [0 , 1]. In this w a y , we lo ok at general directed so cial ties among comm unities. W e p erform exp eriments for b oth the sto c hastic blo c k mo del ( α 0 = 0) and the m ixed mem b ership mo del. F or the mixed mem b ership mo del, we set the concentrat ion parameter α 0 = 1. W e note that the err or is around 8% − 14% and the r unning times are u nder a min ute, when n ≤ 10000 and n ≫ k . The results are giv en in T able 12. W e observe that more samples result in a more ac- curate reco v ery of m e mbersh ips whic h matc hes in tuition and theory . Ov erall, our learning algorithm p erforms b etter in the sto c hastic blo c k mo del case than in the mixed mem b er- ship mo del case although we n ot e that the accuracy is qu ite high for practical pu r p oses. Theoretically , this is exp ected since smaller concen tratio n parameter α 0 is easier for our algorithm to learn (Anandku mar et al., 2013b). Also, our alg orithm is scala ble to an order of magnitude more in n as illustrated by exp eriment s on real-w orld large-scale datasets. App endix E . Compa rison of Error Scores Normalized Mutual Information (NMI) s c ore (Lancic hinetti et al. , 2009) is another p opular score which is deﬁn ed diﬀerent ly for o v erlapping and n on -ov erlapp in g communit y mo dels. F or non -ov erlapp in g b loc k mo del, ground truth mem b ership f or node i is a discrete k - state categoric al v ariable Π block ∈ [ k ] and the estimate d mem b ership is a discrete b k -stat e catego rical v ariable b Π block ∈ [ b k ]. The empirical distribu tion of ground truth memb ership 37 Huang et al. catego rical v ariable Π block is easy to obtain. Similarly is the empirical distribution of the estimated mem b ership catego rical v ariable b Π block . NMI for blo c k mo del is d e ﬁned as N block ( b Π block : Π block ) := H (Π block ) + H ( b Π block ) − H (Π block , b Π block )  H (Π block ) + H ( b Π block )  / 2 . The NMI for ov erlappin g communities is a binary v ector ins te ad of a categorical v ari- able (Lancic hinetti et al., 2009). Th e ground truth mem b ership for n o de i is a b inary v ector of length k , Π mix , while the estimated mem b ership for no de i is a bin a ry v ector of length b k , b Π mix . Th is notion coincides with one column of our mem b ership matrices Π ∈ R k × n and b Π ∈ R b k × n except that our memb ership matrices are sto c hastic. In other wo rds, w e consid er all the n onze ro en tries of Π as 1’s, then eac h column of our Π is a sample for Π mix . The m -th ent ry of this binary vecto r is the realization of a r a ndom v ariable Π mix m = ( Π mix ) m , whose probabilit y distribu ti on is P (Π mix m = 1) = n m n , P (Π mix m = 0) = 1 − n m n , where n m is the num b er of no des in comm unit y m . Th e same holds for b Π mix m . The normalized conditional entrop y b et w een Π mix and b Π mix is deﬁned as H ( b Π mix | Π mix ) norm := 1 k X j ∈ [ k ] min i ∈ [ b k ] H  b Π mix i | Π mix j  H (Π mix j ) (26) where Π mix j denotes the j th en try of Π mix and similarly f or b Π mix i . Th e NMI for ov erlappin g comm unit y is N mix ( b Π mix : Π mix ) := 1 − 1 2 h H ( Π mix | b Π mix ) norm + H ( b Π mix | Π mix ) norm i . There are t w o asp ects in ev aluating th e error. The ﬁ rst asp ect is the l 1 norm error. According to Equation (26), the error fun cti on u sed in NMI score is H  b Π mix i | Π mix j  H (Π mix j ) . NMI is not suitable for ev aluating reco v ery of diﬀeren t sized co mm unities. In the sp ecial case of a pair of extremel y spars e and dense mem b ership vect ors, depicted in Fig ure 10, H (Π mix j ) is the same for b oth th e dens e and the sparse v ectors since they are ﬂ ipp ed versions of eac h other (0s ﬂipp ed to 1s and vice v ersa). Ho w ev er, the smaller sized communit y (i.e. the sparser comm unity vec tor), sho wn in red in Figure 10, is signiﬁcant ly more diﬃcult to reco v er than the larger sized communit y sho wn in blue in Fig ure 10. Although this example is an extreme scenario that is n ot s e en in p racti ce, it justiﬁes the dra wbac ks of th e NMI. Th us, NMI is n ot suitable for ev aluating reco v ery of diﬀerent sized comm unities. In con trast, our error fu nctio n employs a normalized l 1 norm error whic h p enalize s more for larger sized comm unities than smaller ones. The second asp ect is the error induced by f alse p ai rings of estimated and ground-truth comm unities. NMI score selec ts only the closest estimated communit y through normal- ized conditional entrop y minimization and it do es n ot acco unt for statistically signiﬁ ca n t dep endence b etw een an estimated comm unit y and m ultiple ground truth comm unities and 38 Online Tensor Methods for Learning La tent V ariable Models P S f r a g r e p l a c e m e n t s dense Π 1 sparse Π 2 length n mem b ership v ector 0 1 large sized comm unit y small sized comm unit y Figure 10: A sp ecial case of a pair of extremely dense and sparse comm unities. Th eo ret- ically , the sparse communit y is more diﬃcult to reco ver than the dens e one. Ho w ev er, th e NMI score p enalizes b oth of them equ a lly . Note that for d ense Π 1 , P (Π mix 1 = 0) = # of 0s in Π 1 n whic h is equal to P (Π mix 2 = 1) = # of 1s in Π 2 n . Simi- larly , P (Π mix 1 = 1) = # of 1s in Π 1 n whic h is equal to P (Π mix 2 = 0) = # of 0s in Π 2 n . Therefore, H (Π mix 1 ) = H (Π mix 2 ). vice-v ersa, and therefore it und erestimat es error. How ever, our er r or score d oes not limit to a matc hing b et w een th e estimated and ground truth comm unities: if an estimated comm unit y is found to ha v e statistically signiﬁcan t correlation with m ultiple groun d truth comm unities (as ev aluated by the p -v alue), we p enalize for the error o v er all suc h ground truth comm u- nities. T h u s, our er r or score is a harsh er measure of ev aluation than NMI. Th is notion of “soft-matc hin g ” b et w een groun d-truth and estimated comm unities also enables v alidation of reco v ery of a com binatorial union of comm unities instead of single ones. A n umber of other scores such as “separabilit y”, “density” , “cohesiv eness” and “clus- tering co eﬃcien t” (Y ang and Lesk o v ec, 2012) are n on-sta tistical m ea sures of faithfu l com- m unit y reco very . T he scores of (Y ang and Lesko v ec , 2012) in trinsically aim to ev aluate the lev el of clustering within a communit y . Ho w ev er our goal is to measure the accuracy of reco v ery of the comm unities and not h ow w ell- clustered the comm unities are. Banerjee and Langford (Banerjee and Langford , 2004) prop osed an ob jectiv e ev aluation criterion for clustering wh ic h use classiﬁcation p erformance as the ev aluation measure. In con trast, w e lo ok at ho w w ell th e metho d p erforms in reco v ering the hidden comm unities, and we are not ev aluating pr ed ic tiv e p erforman ce. Therefore, this measure is not used in our ev aluation. Finally , we note th at cophenetic correlation is an other statistical score us ed for ev al- uating clus te ring metho ds, but note that it is only v alid for hierarc hical clustering and it is a measure of h o w f ai thfully a dendr og ram preserves the pairwise distances b et w een the original unmo deled data p oin ts (Sok al and Rohlf, 1962). Hence, it is not emplo y ed in this pap er. 39

Online Tensor Methods for Learning Latent Variable Models

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment