When are Overcomplete Topic Models Identifiable? Uniqueness of Tensor Tucker Decompositions with Structured Sparsity

When are Ov ercomplete T opic Mo dels Iden tiﬁable? Uniqueness of T ensor T uc k er De comp o sitions with Structured Sparsit y Animashree Anandkumar, Daniel Hsu, Ma j id Janzamin and Sham Kak ade ∗ August 13, 2018 Abstract Overcomplete latent representations hav e b een very p o pular for unsupe r vised feature learning in rece nt years. In this pap er, we specify which overcomplete models can b e identiﬁed given observ able moments of a certain order. W e consider proba bilistic admixture or topic mo dels in the ov ercomplete regime, where the num b er of latent topics ca n greatly exceed the size of the observed word vocabular y . While general overcomplete topic models ar e not identiﬁable, w e establish generic iden tiﬁability under a c o nstrain t, referr e d to as topic p ersistenc e . Our suﬃcient conditions for iden tiﬁability in volv e a no vel set of “ higher order ” expansion co nditions on the topic-wor d matrix or the p opulation st ructur e of the model. This set of higher- o rder ex pansion conditions allo w for ov erco mplete mo dels, and require the exis tence of a p erfect matching from latent topics to hig her order observed words. W e establish that ra ndom structured topic mo dels are identiﬁable w.h.p. in the overcomplete regime. Our identiﬁabilit y results allows for genera l (non-degenera te) distributions for mo deling the topic pr oportions , and thus, we can handle arbitrar ily correlated topics in our framework. Our identiﬁabilit y results imply uniqueness o f a class of tensor decompos itions with structured sparsity which is contained in the class of T ucker decomp ositions, but is more g eneral than the Cande c omp/Par afac (CP) decomp o sition. Keyw ords: Ov ercomplete representati ons, topic mo dels, generic identi ﬁ a b ili ty , tensor d e comp osi- tion. 1 In tro duction The p erformance of many m a chine learning metho ds is hugely dep enden t on the c hoice of data represent ations or features. Ov ercomplete repr ese ntatio ns , where the num b er of features can b e greater than the dimensionalit y of the input data, h a ve b een extensiv ely emplo y ed, and are ar- guably critical in a n u m b er of app lic ations such as sp eec h and compu ter vision [1]. O vercomplete ∗ A. Anandku mar and M. Janzamin are with the Cen ter for Perv asive Comm unications and Computing, Electrical Engineering and Computer Science Dept., Universit y of Calif ornia, Irvine, USA 92697. Email: a.anandkumar@uci.edu,mjanzami@uci.edu. Daniel H su and Sham Kak ade are with Microsoft R esea rch New England, 1 Memorial Drive, Cambridge, MA 02142. Email: dahsu@microsoft.com, sk aka de@m icrosoft.com 1 represent ations are known to b e more robust to noise, and can p ro vid e greater ﬂexibilit y in mo d- eling [2]. Unsup ervised estimation of o vercomplete rep r esen tations has b een hugely p opular due to the a v ailabilit y of large-scale u nlab eled samp les in many app lic ations. A p robabilistic framework for incorp orating f ea tur es p osits laten t or hid den v ariables that can pro- vide a goo d explanation to the observ ed data. Overco mp let e pr obabilisti c mo dels can incorp orate a muc h larger num b er of laten t v ariables compared to th e observed dimensionalit y . In this pap er, w e c haracterize the cond itions und er wh ic h o v ercomplete laten t v ariable mo dels can b e identiﬁed from their observed momen ts. F or an y p aramet ric statistical mo del, iden tiﬁabilit y is a fu n damen tal question of whether the mo del parameters can b e un iquely reco ve red giv en the observed statistics. Iden tiﬁability is crucial in a n umb er of applications where the laten t v ariables are the quan tities of interest, e.g. inferring dis- eases (laten t v ariables) th rough s y m ptoms (observ ations), inf erring comm un iti es (laten t v ariables) via the in teractions among the actors in a so cial net w ork (observ ations), and so on. Moreo v er, iden tiﬁabilit y can b e relev ant ev en in predictiv e s et tings, where feature learning is emplo y ed for some h ig h er lev el task suc h as classiﬁcation. F or instance, non-iden tiﬁability can lead to the pres- ence of non-isolated lo cal optima for optimization-based learning metho ds, and this can aﬀect their con v ergence prop erties, e.g. see [3]. In this pap er, we c haracterize id en tiﬁability for a p opular class of laten t v ariable mo dels, kn o wn as the admixtur e or topic mo dels [4, 5]. These are hierarc h ical mixture mo dels, which incorp o- rate the p resence of m u lt ip le laten t states (i.e. topics) in eac h do cument consisting of a tup le of obser ved v ariables (i.e. words). Previous w orks ha ve established that th e mo del p arame ters can b e estimated eﬃcien tly using lo w order observ ed momen ts (second and third ord er ) under some n on-dege n er acy assumptions, e.g. [6 – 8]. Ho wev er, these n on-dege n e r a cy conditions imply that the mo del is undercomplete, i.e., the laten t dimensionalit y (num b er of topics) cannot exceed the observ ed dimensionalit y (w ord v o cabulary size). In this pap er, w e remo v e this restriction and consider o v ercomplete topic mo dels, where the num b er of topics can far exceed the w ord v o cabulary size. It is p erhaps not surp rising that general topic mo dels are not iden tiﬁable in the o v ercomplete regime. T o this end, w e introduce an additional constraint on the mo del, referred to as topic p ersistenc e . Intuitiv ely , this captures the “lo calit y” eﬀect among th e observ ed w ords, and is not present in the usual “bag-of-w ords” or exchange able topic mo del. Suc h lo cal dep endencies among observ ations ab ound in app lications suc h as text, images and sp eec h , and can lead to a more faithful represent ation. In addition, we establish that the presence of topic p ersistence is central to wa rd s obtaining mo del iden tiﬁabilit y in the o v ercomplete regime, and we pro vide an in -depth analysis of this phenomenon in this pap er. 1.1 Summary of results In this pap er, we pro vide conditions for generic 1 mo del iden tiﬁabilit y of o v ercomplete topic mo dels giv en observ ab le momen ts of a certain order (i.e., having a certain n u m b er of w ord s in eac h do c- 1 A mo del is generically identiﬁable, if all the parameters in t h e parameter space are identiﬁable, almost surely . Refer to D eﬁnition 1 for more discussion. 2 P S f r a g r e p l a c e m e n t s h y 1 y 2 y 2 r x 1 x n x n + 1 x 2 n x ( 2 r − 1 ) n + 1 x 2 rn x 2 A A A A A A Figure 1: Hierarchical structure o f the n -p ersistent topic mo del. 2 rn n umber of words (views ) a re shown for some in tege r r ≥ 1. A single topic y j , j ∈ [2 r ], is chosen for e a c h n succe ssiv e views { x ( j − 1) n +1 , . . . , x ( j − 1) n + n } . Matrix A is the popula tion structure or topic-word matrix. ument ). W e introd uce the notion of topic p ersistenc e , and analyze its eﬀect on id entiﬁabilit y . W e establish iden tiﬁability in th e presence of a no ve l com binatorial ob ject, referr ed to as p erfe ct n - gr am matching , in the bipartite graph f rom topics to words. Finally , w e prov e that r an d om stru ctured topic mo dels satisfy these criteria, and are thus identiﬁable in the ov ercomplete regime. P ersisten t T opic Mo del: W e ﬁrst introdu ce the n -p ersistent topic mo del, where the parameter n determines the p ersistence level of a common topic in a sequence of n successiv e words. F or instance, in Figure 1, the sequence of successiv e words x 1 , . . . , x n share a common topic y 1 , and similarly , th e words x n +1 , . . . , x 2 n share topic y 2 , and so on. The n -p ersisten t mo del reduces to the p opular “bag-of-w ords” m odel, when n = 1, and to the single topic mo del (i.e. only one topic in eac h do cumen t) when n → ∞ . Intuitiv ely , topic p ersistence aids identiﬁabilit y since we ha ve m u ltiple views of the common h idden topic generating a sequence of successiv e w ords . W e establish that the bag-of-w ords mo del (with n = 1) is to o non-informativ e ab out the topics in the o v ercomplete regime, and is therefore, not ident iﬁab le. On the other hand , n -p ersistent o ve rcomplete topic mo dels with n ≥ 2 can b ecome ident iﬁable, and we establish a set of transp a r e nt conditions for iden tiﬁabilit y . Deterministic C onditions for I den t iﬁa bilit y: Our suﬃcient conditions for ident iﬁ ab ility are in the form of exp an s io n conditions from th e latent topic sp ace to the observ ed word space. In the o v ercomplete regime, there are more topics than words in the v o cabulary , and thus it is imp ossible to hav e expansion on th e bipartite graph from topics to words, i.e., th e graph enco ding the sparsit y pattern of th e topic-w ord m a trix. Ins te ad, w e imp ose an expansion constraint f rom topics to “higher order” words, whic h allo ws us to incorp orate o ve rcomplete mo dels. W e establish that this condition translates to the presence of a nov el com bin at orial ob ject, referred to as the p erfe ct n-gr am matching , on the topic-w ord b ipartite graph. Intuitiv ely , the p erfect n -gram matc hing condition imp lie s “div ersity” among the higher-order w ord supp orts for diﬀeren t topics whic h leads to ident iﬁab ility . In add it ion, we present trade-oﬀs among the follo wing quan tities: num b er of topics, size of the word v o cabulary , the topic p ersistence level, the order of the observ ed m o ments at h a n d, the m inim u m and maximum degrees of an y topic in the topic-wo rd bipartite graph, and the K rusk al r ank [9] of the topic-wo r d matrix, und e r whic h id en tiﬁability h olds . T o the b est of our kno wledge, this is th e ﬁrst wo rk to pro vide cond it ions for characte rizing identiﬁabilit y of o v ercomplete topic m odels with s tructured sp ars it y . 3 Iden tiﬁability of Random Structured T opic Mo dels: W e explicitly charac terize the regime of id en tiﬁability for the random setting, wh ere eac h topic i is randomly supp orted on a set of d i w ords, i.e. the bipartite graph is a rand om graph. F or this rand om mo del with q topics, p - dimensional w ord v o cabulary , and topic p ersistence lev el n , when q = O ( p n ) and Θ(log p ) ≤ d i ≤ Θ( p 1 /n ), for all topics i , the topic-w ord m atrix is ident iﬁable fr om 2 n th order observed momen ts with high probabilit y . In tuitive ly , the up p e r b ound on the degrees d i is n ee d ed to limit the o verlap of wo rd supp orts among diﬀerent topics in the ov ercomplete r e gime: as the num b er of topics q increases (i.e., n in cr eases in the ab o ve d eg r ee b ound), the d egree needs to b e corresp ondin gl y smaller to ens u re identiﬁabilit y , and we m a ke this dep endence explicit. Intuitiv ely , as the exten t of o v ercompleteness incr eases, we need sp arser conn ec tions from topics to words to ensure su ﬃcie nt div ersity in the w ord supp orts among d iﬀerent topics. The low er b ound on the degrees is r equired so that there are enough edges in th e topic-w ord bipartite graph so th at v arious topics can b e distinguished from on e another. F urthermore, we establish that the size condition q = O ( p n ) f o r iden tiﬁabilit y is tigh t. Implications on Uniqueness of Ov ercomplete T uck er and CP T ensor Decompositions: W e establish that identiﬁabilit y of an o v ercomplete topic mo del is equiv alen t to un iqueness of decomp ositio n of the observe d m o ment tensor (of a certain order). O ur ident iﬁ ab ility results for p ersisten t topic mo dels imply u niqueness of a stru ct u red class of tensor decomp ositions, wh ic h is con tained in the class of T ucker d ec omp ositions, but is more general than the c ande c omp/p ar afac (CP) decomp osition [10]. Th is sub-class of T uc ker decomp ositions inv olv es stru ctured spars it y and symmetry constraints on the c or e tensor , and sp a r s it y constrain ts on the inverse factors of the T uc ker decomp osit ion. The structural constrain ts on the T uc ker tensor decomp ositio n are r ela ted to the topic mo del as follo ws: the sparsity and s ymmetry constraint s on the core tensor are related to the p ersistence prop ert y of the topic mo del, and the sp a r s it y constrain ts on th e inv erse factors are equiv alent to th e sparsity constraints on the topic-w ord m a trix. F or n -p ersisten t topic mo del with n = 1 (bag-of-w ord s mod el ), the tensor d ecomp osition is a general T u c ker d ec omp osition, wh er e the core tensor is fully dense, while f or n → ∞ (single-topic m odel), the tensor decomp o sition reduces to a CP decomp osition, i.e. the core tensor is a diagonal tensor . F or a ﬁ n ite p ersistence lev el n , in b et wee n these tw o extremes, the core tensor satisﬁes certain sp arsit y and symmetry constrain ts, whic h b ecomes cru cia l to wards establishing iden tiﬁabilit y in the ov ercomplete regime. 1.2 Ov erview of T echniqu es W e no w pro vide a sh ort o verview of th e tec hniqu e s emplo ye d in th is pap er. Recap of Ide ntiﬁabilit y C onditions in Under-complete Sett ing (Expansion Conditions on T opic-W ord Mat rix): Our appr oac h is based on the recen t results of [7 ], where conditions for iden tiﬁabilit y of topic mo dels are derived, giv en pairwise observ ed m o ments (sp eciﬁcally , co- o cc u rrence of w ord-pairs in d ocuments). Consider a topic mo del with q topics and observ ed wo rd v o cabulary of size p . Let A ∈ R p × q denote the topic-wo rd matrix. E xpansion conditions are imp osed in [7] on the topic-w ord bipartite graph whic h imply that (generically) the sparsest v ectors in the column span of A , d enot ed b y C o l( A ), are the columns of A th ems elves. T h us the topic-w ord matrix 4 A is identiﬁable fr o m pairwise momen ts un der expansion constraints. Ho wev er, these expansion conditions constrain the mo del to b e u n der-complete , i.e., the num b er of topics q ≤ p , the size of the wo rd v o cabulary . Th erefore, the tec hn iques d eriv ed in [7] are not directly applicable here since w e consider o v ercomplete mo dels. Iden tiﬁability in Overcomplete Sett ing and Why T opic-P ersistence Helps: P airwise momen ts are th us not suﬃcient f o r identiﬁabilit y of ov ercomplete m o dels, and the question is whether higher order moment s can yield iden tiﬁabilit y . W e can view the higher order momen ts as pairwise momen ts of another equiv alent topic mo del, whic h enables us to apply the tec h niques of [7]. The k ey question is wh ether w e h a ve expansion in the equiv alen t topic mo del, wh ich implies iden tiﬁabilit y . F or a general topic mo del (without any topic p ersistence constraints), it can b e sho wn that for identi ﬁ a b il ity , w e require expansion of the n th -order Kr one cker pr o duct of th e original topic-w ord matrix A , denoted by A ⊗ n ∈ R p n × q n , when given access to (2 n ) th -order moments, for an y intege r n ≥ 1. I n the o v ercomplete regime wh er e q > p , A ⊗ n cannot expand, and therefore, o v ercomplete mo dels are n ot iden tiﬁable in general. On th e other h an d , w e sho w that imp osing the constrain t of topic p ersistence can lead to identiﬁabilit y . F or a n -p ersistent topic mo del, giv en (2 n ) th -order moments, we establish that iden tiﬁabilit y o ccurs wh en th e n th -order Khatri-R ao pr o duct of A , den o ted by A ⊙ n ∈ R p n × q , expands. Note that the Kh at r i- Rao pro duct A ⊙ n is a sub-matrix of the Kroneck er pr odu c t A ⊗ n , and the Kh at r i-Rao pro duct A ⊙ n can expand as long as q ≤ p n . Thus, the p roperty of topic p ersistence is cent r al tow ard s ac h ieving id entiﬁabilit y in the o v ercomplete regime. First-Order Approac h for I de ntiﬁabilit y of Ov ercomplete Mo dels (Expansion of n - gram T opic-W ord Matrix) : W e refer to A ⊙ n ∈ R p n × q as the n -gram topic-w ord matrix, and in tuitive ly , it relates topics to n -t u ple w ord s. Imp osing the expansion conditions d e r iv ed in [7] on A ⊙ n implies that (generically) the sparsest ve ctors in Col( A ⊙ n ), are the columns of A ⊙ n themselv es. Th u s, the topic-w ord matrix A is identiﬁable from (2 n ) th -order momen ts f or a n -persistent topic mo del. W e refer to this as the “ﬁr s t- ord er ” ap p roac h since we directly imp ose the expansion conditions of [7] on A ⊙ n , without exploiting the additional structure p resen t in A ⊙ n . Wh y t he First-Order Approac h is not E noug h: Note that A ⊙ n ∈ R p n × q matrix r e lates topics to n -tuples of wo r d s. T h us, the en tries of A ⊙ n are highly correlated, ev en if the original topic-w ord matrix A is assumed to b e rand omly generated. It is non-trivial to d eriv e conditions on A , so that A ⊙ n expands. Moreo ver, w e establish that A ⊙ n fails to expand on “small” sets, as required in [7], when the degrees are suﬃcien tly d iﬀeren t 2 . Thus, the ﬁr s t- ord er approac h is highly restrictiv e in the o verco mp let e setting. 2 F or A ⊙ n to expand on a set of size s ≥ 2, it is necessary that s ·  d min + n − 1 n  ≥ s +  d max + n − 1 n  , where d min and d max are the minim um and maximum degrees, and n is the ex ten t of ov ercompleteness: q = Θ( p n ). When the model is highly ov ercomplete (large n ) and we require small set ex pansion (small s ), th e degrees need to b e nearly the same. Thus, it is desirable t o imp ose expansion only on large sets, since it allo ws for more degree d iv ersit y . 5 Incorp orating Rank Criterion: Note that A ⊙ n is highly structured: the columns of A ⊙ n matrix p ossess a tensor 3 rank of 1, when n > 1. This can b e incorp orated in our iden tiﬁabilit y criteria as follo ws: w e pro vide cond it ions un d er whic h the sparsest vecto rs in C o l( A ⊙ n ), wh ic h also p ossess a tensor r a n k of 1, are the columns of A ⊙ n themselv es. This implies id en tiﬁability of a n - p ersisten t topic mod el , wh en giv en access to (2 n ) th -order moments. Note that wh en a small num b er of columns of A ⊙ n are com bin ed , th e r esulting ve ctor cannot p ossess a tensor rank of 1, and thus, w e can rule out that suc h sparse com binations of columns using the r a n k criterion. The maxim um suc h n umb er is at least the Kruskal r ank 4 of A . Thus, sparse com b inati ons of columns of A (u p to the Kru s k al rank) can b e ruled out using the rank criterion, and we require expansion on A ⊙ n only on large sets of topics (of size larger than the Kru sk al rank). This agrees with the intuition that when the topic-word matrix A h as a larger Krusk al rank, it s h ould b e easier to iden tify A , since the Kru sk al rank is r ela ted to the mutual inc oher enc e 5 among the column s of A , see [11]. Notion of Perfect n -gram Matc hing and Final I de ntiﬁabilit y Conditions: Th us, we establish iden tiﬁability of ov ercomplete topic mo dels su b ject to expansion conditions A ⊙ n on sets of size larger than the K rusk al rank of the topic-w ord matrix A . Ho we ver, it is desirable to imp ose transparent and int erp reta b le conditions d irect ly on A for iden tiﬁabilit y . W e in tro duce the notion of p erfe c t n -gr am matching on the topic-word bipartite graph , whic h ensures that eac h topic can b e uniquely matc hed to a n -tup le word. Th is combined with a lo we r b oun d on the Krusk al rank pro vides the ﬁn al set of deterministic conditions for identi ﬁ a b ili ty of the o verco mp le te topic mo del. In tuitivel y , w e r equire that the columns of A b e spars e , w hile still main taining a large enough Krusk al rank; in other w ord s, the topics ha ve to b e sparse and h a ve suﬃcien tly div erse w ord supp orts. Thus, we establish identiﬁabilit y und er a set of transparent conditions on th e topic-w ord matrix A , consisting of p erfect n -gram matc hing condition and a low er b ound on the K rusk al rank of A . Analysis under Random-Structured T opic-W ord Matrices: Finall y , w e establish that the deriv ed d ete rm inistic conditions are satisﬁed when the topic-wo rd bipartite graph is r a n domly generated, as long as the d eg rees satisfy certain lo wer and up p er b oun ds. Intuiti vely , a low er b ound on the degrees of the topics is required to ha ve degree concentrat ion on v arious subsets so that expansion can o ccur, while the upp er b ound is required so that the K rusk al rank of the topic-w ord matrix is large enough compared to the sparsit y lev el. Here, the main tec h nica l result is establishin g the presence of a p erfect n -gram matc hing in a random bipartite graph with a wid e range of degrees. W e presen t a greedy and a recursiv e mec hanism for constructing suc h a n -gram matc hing for o vercomplete mo dels, whic h can b e relev an t even in other settings. F or instance, our results imply th e presence of a p erfect matc hing when the ed ges of a b ipartite graph are correlated in a stru ct u r ed manner, as giv en by th e K h at r i- Rao pro duct. 3 When any column of A ⊙ n ∈ R p n × q (of length p n ) is reshaped as a n th -order tensor T ∈ R p × p ×···× p , the tensor T is rank 1. 4 The Krusk al rank is th e maximum n umb er k such that every k -subset of columns of A are linearly ind ependent. Note that th e Krusk al rank is equal to the rank of A , when A has full column rank . But this cannot happ en in th e o vercomplete setting. 5 It is easy to show that krank ≥ (max i 6 = j | a ⊤ i a j | ) − 1 , where a i , a j are an y pair of columns of A . Thus, higher incoherence leads to a larger krusk al rank. 6 1.3 Related w orks W e now summ a r ize some r ec ent related w orks in the area of identiﬁabilit y and learning of laten t v ariable m odels. Iden tiﬁability , learning a nd applications of o verco mplete latent represen tations: Many recen t w orks emplo y un sup ervised estimation of o verco mp let e features for higher lev el tasks such classiﬁcation, e.g. [1, 12 – 14], and record huge gains o ve r other approac hes in a num b er of ap p lica tions suc h as sp eec h recognition and computer vision. Ho w eve r , th eoretical unders ta nd ing regarding learnabilit y or iden tiﬁabilit y of ov ercomplete representat ions is far more limited. Ov ercomplete laten t representati ons h a ve b een analyzed in the con text of th e indep endent com- p onen ts analysis (IC A), where the sour ce s are assumed to b e indep endent, and the mixing matrix is un kno w n . In the o v ercomplete or under-d etermin ed regime of the ICA, there are more s ou r ce s than sensors. Identiﬁabilit y and learning of the o v ercomplete ICA redu c es to the problem of ﬁndin g an o v ercomplete candecomp/parafac (CP ) tensor decomp osition. The classical result by Krusk al pro vides conditions for u niqueness of a C P decomp osition [9, 15], with recen t extensions to the notion of robust id en tiﬁability [16]. T hese results provide conditions for strict identiﬁabilit y of th e mo del, and here, the dimensionalit y of th e laten t space is required to b e of the same ord er as the observ ed space dimens ionalit y . In con trast, a num b er of recent w orks analyze generic id en tiﬁability of o verco mp let e CP d ec omp osition, whic h is weak er than strict id en tiﬁability , e.g. [17 – 23]. Th ese w orks assu m e that the factors (i.e. the comp onen ts) of the CP decomp osition are generically drawn and pro vide conditions for uniqueness. They allo w for the laten t d imensionalit y to b e muc h larger (p olynomially larger) than the observ ed dimensionalit y . These resu lts on the uniqueness of CP decomp ositio n s also lead to iden tiﬁabilit y of other latent v ariable mo dels, such as laten t tr ee mo d- els, e.g. [24, 25], and the sin gle-topic m o del, or more generally latent Diric hlet allocation (LD A). Recen tly Go yal et. al. [26] pr oposed an alternativ e f ramew ork f or o vercomplete ICA mo dels based on the eigen-decomp osit ion of the rew eight ed co v ariance matrix (or higher order momen ts), where the weig hts are the F ourier co eﬃcien ts. How ever, their appr o ac h requires ind epend ence of sou r ce s (i.e. laten t topics in our cont ext), which is n ot imp osed her e. In con trast to the ab o ve w orks dealing with the CP tensor decomp ositi on, w e requ ir e uniqueness for a more general class of tensor decomp ositio n s , in order to establish identiﬁabilit y of topic m odels with arbitrarily correlated topics. W e establish that our class of tensor decomp osition is con tained in the class of T ucker decomp ositions whic h is more general than CP decomp osition. Moreo v er, we explicitly c h arac terize the eﬀect of the sparsity pattern of the factors (i.e., th e topic-wo rd matrix) on mo del identiﬁabilit y , while all the pr evious wo rks based on generic identi ﬁ a b ili ty assu m e fully dense f actors (since sparse factors are not generic). F or a general o ve r v iew of tensor decomp ositions, see [10, 27]. Iden tiﬁability and learning of unde rcomplete/ov er-dete rmined latent represen tations: Muc h of the th e oretical r esults on id en tiﬁability and learning of the laten t v ariable mo dels are limited to non-singular mo dels, which implies that the laten t space dimensionalit y is at most the observ ed d imensionali ty . W e outline some of the recen t w orks b elo w. 7 The w orks of Anandkumar et. al. [6 , 28, 29] pro vide an eﬃcien t momen t-based app r oa ch f or learning topic mo dels, u n der constrain ts on the d istr ibution of the topic prop ortions, e.g. the sin gl e topic mo del, and more generally laten t Dirichlet allo ca tion (LD A). In add itio n, the app roac h can handle a v ariet y of laten t v ariable mo dels su c h as Gaussian mixtures, hidd en Marko v mo dels (HMM) and comm unity mo dels [30]. The high-lev el id ea is to reduce the problem of learning of the laten t v ariable m odel to ﬁnding a CP decomp ositio n of the (suitably adjusted) obser ved momen t tensor. V arious app roac hes can th e n b e employ ed to ﬁnd the CP decomp osition. In [6], a tensor p o w er metho d ap p roac h is analyzed and is s h o wn to b e an eﬃcien t guarant eed reco ve ry metho d in the n o n - degenerate (i.e. un dercomplete ) setting. P r evio us ly , sim ultaneous diagonaliza tion tec hniques ha ve b een emp loy ed for solving the CP decomp osition, e.g. [28, 31, 32]. Ho wev er, these tec hn iques fail when the mo del is o v ercomplete, as considered h ere. W e note that some recen t tec hniques, e.g. [20], can b e emplo y ed instead, alb eit at a cost of higher computational complexit y for o verco mp le te CP tensor decomp ositio n . Ho w ev er, it is not clear ho w the spars ity constraint s aﬀect th e guaran tees of such metho ds. Moreo v er, these appr oac h e s cann o t handle general topic mo dels, w here the distribution of the topic prop ortions is not limited to these classes (i.e. either single topic or Diric hlet distribution), and w e require tensor decomp o sitions whic h are more general than the CP decomp ositio n . There are man y other works whic h consider learning mixtu re mo dels w hen multiple views are a v ailable. See [28] for a detailed description of these w orks. Recen tly , Rabani et. al. [33] consider learning discrete m ixtures give n a large num b er of “views”, and they refer to the num b er of views as the sampling ap ertur e . They establish imp ro ved r ec ov ery resu lt s (in terms of ℓ 1 b ounds) w hen suﬃcien t n umb er of views are av ailable (2 k − 1 views for a k -comp onen t mixtur e ). How ever, their results are limited to discrete m ixtures or single-topic mo dels, while our setting can hand le more general topic mo dels. Moreo v er, our approac h is d iﬀeren t since w e incorp orate sp arsit y constrain ts in the topic-w ord distribution. Another series of recent works b y Arora et. al. [8, 34 ] employ approac hes b ase d on non-negativ e matrix factorization (NMF) to reco ve r the topic-w ord matrix. These w orks allo w mo dels with arbitrarily correlated topics, as considered here. They establish guaran teed learning when ev ery topic has an anchor word, i.e. th e wo r d is uniquely generated from that topic, and do es not o cc u r u nder any other topic. Note that the anc hor-word assumption cannot b e satisﬁed in the o v ercomplete setting. Our w ork is closely related to the work of Anandkumar et. al. [7 ] wh ic h considers id e ntiﬁabilit y and learnin g of topic mo dels under expansion conditions on the topic-wo r d m at rix. The work of Spielman et. al [35] considers the p roblem of dictionary learning, whic h is closely r el ated to the setting of [7 ], but in addition assumes that the coeﬃcient matrix is r andom. Ho we ver, these w orks [7, 35] can h andle only the un der-co mp let e setting, wh ere the n umb er of topics is less than the dimensionalit y of the word v o cabulary (or the num b er of dictionary atoms is less than the n umb er of observ ations in [35]). W e extend these results to th e o vercomplete setting b y p roposin g nov el higher order expans io n conditions on the topic-w ord matrix, and also in co rp orate additional rank constrain ts p resen t in higher order momen ts. Dictionary learning/sparse co ding: Ov ercomplete representa tions hav e b een v ery p opular in the con text of dictionary learning or spars e co ding. Here, the task is to j o intly learn a dictionary as we ll as a sparse selection of the dictionary atoms to ﬁt the observed data. T here hav e b een Ba y esian as w ell as frequentist app roac hes for dictionary learning [2, 36, 37]. Ho we ver, the heuristics 8 emplo y ed in these w orks [2 , 36, 37] h a ve n o p erformance guarantee s. The work of S pielman et. al [35] considers learning (un dercomplete ) d ictionaries and provide guarant eed learning u nder the assumption that the coeﬃcient matrix is random (distributed as Bernoulli-Gaussian v ariables). Recen t works [38, 39] pro vide generalization b ounds for predictiv e sparse co ding, where the goal of the learned rep resen tation is to obtain goo d p erformance on some pr edict ive task. This diﬀers from our fr a mework since w e do not consid er pr edict ive tasks here, but the task of reco v ering th e underlying laten t represent ation. Hillar and Sommer [40 ] consider the problem of iden tiﬁabilit y of sparse cod ing and establish that w hen the dictionary succeeds in reconstru cting a certain set of sparse v ectors, then there exists a unique sparse co ding, up to p erm utation and scaling. Ho wev er, our setting her e is diﬀerent , since we do not assume that a sparse set of topics o ccur in eac h do cumen t. 2 Mo del Notation: The set { 1 , 2 , . . . , n } is denoted b y [ n ] := { 1 , 2 , . . . , n } . Given a set X = { 1 , . . . , p } , set X ( n ) denotes all ordered n -tuples generated from X . The cardinalit y of a set S is d enot ed b y | S | . F or any v ector u (or matrix U ), the su pp ort is denoted b y S u pp ( u ), and the ℓ 0 norm is denoted b y k u k 0 , whic h corresp onds to the num b er of non-zero en tries of u , i.e., k u k 0 := | S upp( u ) | . F or a v ector u ∈ R q , Diag ( u ) ∈ R q × q is the diagonal matrix with v ector u on its diagonal. The column space of a matrix A is d en o ted b y Col( A ). V ector e i ∈ R q is the i -th basis vec tor, w it h the i -th en try equal to 1 and all the others equal to zero. F or A ∈ R p × q and B ∈ R m × n , the Kr one cker pro duct A ⊗ B ∈ R pm × q n is deﬁned as [41] A ⊗ B =      a 11 B a 12 B · · · a 1 q B a 21 B a 22 B · · · a 2 q B . . . . . . . . . . . . a p 1 B a p 2 B · · · a pq B      , and for A = [ a 1 | a 2 | · · · | a r ] ∈ R p × r and B = [ b 1 | b 2 | · · · | b r ] ∈ R m × r , the Khatri-R ao pr oduct A ⊙ B ∈ R pm × r is deﬁned as A ⊙ B = [ a 1 ⊗ b 1 | a 2 ⊗ b 2 | · · · | a r ⊗ b r ] . 2.1 P ersistent topic mo del In this section, the n - p ersistent topic mo del is introduced and this imp oses an additional constrain t, kno wn as topic p ersistence on the p opu lar admixture mo del [4, 5, 42]. The n -p ersistent topic mo del reduces to the bag-of-w ord s admixture mo del when n = 1. An admixture mo del sp eciﬁes a q -dimensional ve ctor of topic prop ortions h ∈ ∆ q − 1 := { u ∈ R q : u i ≥ 0 , P q i =1 u i = 1 } whic h generates th e observ ed v ariables x l ∈ R p through ve ctors a 1 , . . . , a q ∈ R p . This collection of vecto rs a i , i ∈ [ q ], is referred to as the p opulation structur e or the topic-wor d matrix [42]. F or in s ta nce, a i is the conditional distribu tio n of words give n topic i . The laten t v ariable h is a q dimensional rand o m ve ctor h := [ h 1 , . . . , h q ] ⊤ kno wn as prop ortion vec tor. A p r io r 9 distribution P ( h ) ov er the probabilit y simplex ∆ q − 1 c haracterizes the p rior joint distribu ti on o v er the latent v ariables h i , i ∈ [ q ]. In the topic mo deling, this is the prior distribution o v er the q topics. The n -p ersisten t topic mo del h a s a three-lev el m u lt i-view hierarch y in Figure 1. 2 r n n umb er of w ords (views) are sho wn in the mo del for some intege r r ≥ 1. In this mo del, a common hidd e n topic is p ersisten t for a sequence of n wo r d s { x ( j − 1) n +1 , . . . , x ( j − 1) n + n } , j ∈ [2 r ]. Note that th e rand om observ ed v ariables (w ords) are exc hangeable within group s of size n , where n is the p ersistence lev el, b ut are not globally exc hangeable. W e now describ e a linear r epresen tation of the n -persistent topic mo del, on lines of [6 ], bu t with extensions to in corp orate p ersistence. Eac h random v ariable y j , j ∈ [2 r ] , is a discrete v alued r andom v ariable taking one of the q p ossibilities { 1 , . . . , q } , i.e., y j ∈ [ q ] for j ∈ [2 r ]. In the n -p ersistent mo del, a single common topic is chosen for a sequence of n w ords { x ( j − 1) n +1 , . . . , x ( j − 1) n + n } , j ∈ [2 r ], i.e., the topic is p ersisten t for n su cc essive views. F or notational purp oses, we equiv alentl y assume that v ariables y j , j ∈ [2 r ], are encoded b y the b asis v ectors e i , i ∈ [ q ]. Th us, th e v ariable y j , j ∈ [2 r ], is y j = e i ∈ R q ⇐ ⇒ the topic of j -th group of wo r ds is i. Giv en p roportion v ector h , topics y j , j ∈ [2 r ], are indep endently drawn according to the conditional exp ecta tion E  y j | h  = h, j ∈ [2 r ] , or equiv alent ly Pr  y j = e i | h  = h i , j ∈ [2 r ] , i ∈ [ q ]. Finally , at the b ottom la yer, eac h observed v ariable x l for l ∈ [2 r n ], is a discrete-v alued p - dimensional random v ariable, where p is the size of wo rd vocabulary . Again, w e assume that v ariables x l , are enco ded b y the basis v ectors e k , k ∈ [ p ], s uc h as x l = e k ∈ R p ⇐ ⇒ the l -th word in the d ocument is k . Giv en the corresp onding topic y j , j ∈ [2 r ], w ords x l , l ∈ [2 r n ], are indep enden tly d ra w n according to the conditional exp ectation E  x ( j − 1) n + k | y j = e i  = a i , i ∈ [ q ] , j ∈ [2 r ] , k ∈ [ n ] , (1) where v ectors a i ∈ R p , i ∈ [ q ], are the conditional pr obabilit y distr ib ution vecto rs. The matrix A = [ a 1 | a 2 | · · · | a q ] ∈ R p × q collect ing these v ectors is th e p opulation structur e or topic-wor d matrix . The (2 r n )-th order moment of observe d v ariables x l , l ∈ [2 r n ], for some inte ger r ≥ 1, is d eﬁned as (in the matrix form) 6 M 2 rn ( x ) := E h ( x 1 ⊗ x 2 ⊗ · · · ⊗ x r n )( x r n +1 ⊗ x r n +2 ⊗ · · · ⊗ x 2 rn ) ⊤ i ∈ R p r n × p r n . (2) F or the n -persistent topic mo d el with 2 r n n u m b er of observ ations (words) x l , l ∈ [2 rn ], the corre- sp onding momen t is denoted b y M ( n ) 2 rn ( x ). Note that to estimate the (2 r n ) th momen t, we require 6 V ector x is the vector generated by concatenating all vectors x l , l ∈ [2 rn ]. 10 a minim um of 2 r n words in eac h do cumen t. W e can select the ﬁrst 2 r n w ords in eac h do cumen t, and av erage o v er the diﬀerent do cumen ts to obtain a consisten t estimate of the moment. I n this pap er, w e consider the problem of id e ntiﬁabilit y wh en exact momen ts are a v ailable. The momen t characte r ization of the n -p ersisten t topic mo del is pr ovided in Lemma 1 in Section 4.1. Given M ( n ) 2 rn ( x ), what are the suﬃcient conditions u nder w h ic h the p opulation structure A is iden tiﬁable? Th is is answered in Section 3. Remark 1. Note that our r esults ar e valid for the mor e gener al line ar mo del x l = Ay j (mor e pr e ci sely, x ( j − 1) n + k = Ay j , j ∈ [2 r ] , k ∈ [ n ] ), i.e., e ach c olumn of matrix A do es not ne e d to b e a valid pr ob ability distribution. F urthermo r e, the observe d r andom variables x l , c an b e c ontinuous while the hidden ones y j ar e assume d to b e discr ete. 3 Suﬃcien t Conditions for Generic Iden tiﬁabilit y In th is section, the identi ﬁ a b ili ty result for the n -p ersisten t topic mo del with access to (2 n )-th ord er observ ed moment is provi d ed. First, suﬃcient deterministic conditions on the p opulation structur e A are p r o vided for identi ﬁ a b il ity in T heorem 1. Next, th e deterministic analysis is sp ecializ ed to a random structur ed mo del in Theorem 2. W e no w mak e the notion of iden tiﬁabilit y precise. As deﬁ ned in literature, (strict) iden tiﬁabilit y means that the p opu lation structure A can b e uniquely r ec ov ered up to p erm utation and scaling for all A ∈ R p × q . Instead, w e consid er a m o r e relaxed notion of id e ntiﬁabilit y , known as generic iden tiﬁabilit y . Deﬁnition 1 (Generic identiﬁabilit y) . We r efer to a matrix A ∈ R p × q as generic, with a ﬁxe d sp arsity p attern when the nonzer o entries of A ar e dr awn fr om a distribution which is absolutely c ontinuous with r esp e ct to L eb esgu e me asur e 7 . F or a giv en sp arsity p attern, the class of p opulation structur e matric es is said to b e generically identi ﬁ ab le [25] , if al l the non-identiﬁable matric es form a set of L eb esgue me asur e zer o. The (2 r )-th order moment of hidden v ariables h ∈ R q , denoted by M 2 r ( h ) ∈ R q r × q r , is deﬁn ed as M 2 r ( h ) := E   r times z }| { h ⊗ · · · ⊗ h  r times z }| { h ⊗ · · · ⊗ h  ⊤  ∈ R q r × q r . (3) W e no w provi d e a s et of suﬃcient cond it ions f o r generic ident iﬁab ility of stru ct u red topic mo dels giv en (2 r n )-th ord er observed momen t. W e ﬁrst start with a n a tur al assumption on the hidden v ariables. Condition 1 (Non-degeneracy) . The (2 r ) -th or der moment of hidden variables h ∈ R q , deﬁne d in e quation (3) , is ful l r ank (non-de gener acy of hidden no des). Note that there is no h o p e of distinguishing distinct hidden n odes without this non-degeneracy assumption. W e do n ot imp ose an y other assump tio n on hidden v ariables and can incorp orate arbitrarily correlated topics. 7 As an eq u iv alent deﬁnition, if the n on-zero en tries of an arbitrary sparse matrix are ind ependently p erturbed with noise drawn from a contin uous distribution to generate A , then A is called generic. 11 P S f r a g r e p l a c e m e n t s Y X Figure 2: A bipar tite g raph G ( Y , X ; E ) with | X | = 4 and | Y | = 6 wher e the edge set E itself is a p erfect 2-gra m matc hing. F u rthermore, we can only hop e to identify the p opulation structur e A u p to scaling and p erm utation. Therefore, we can iden tify A u p to a canonical form deﬁned as: Deﬁnition 2 (Canonical form) . Population structur e A is said to b e in canonical form if al l of its c olumns have unit norm. 3.1 Deterministic conditions for generic iden t iﬁability In this section, we consider a ﬁxed sparsity p at tern on the p opulation stru ct u re A and establish generic iden tiﬁabilit y when n on-z ero entries of A are drawn from some conti nuous distribution. Before providing the m ain result, a generalize d notion of (p erfect) matc hing for bipartite graphs is deﬁn ed. W e subsequ en tly imp ose these conditions on the bipartite graph from topics to words whic h en c o des the sparsit y p at tern of p opulation structure A . Generalized ma t c hing for bipartit e graphs A bipartite graph with t wo disjoint ve rtex sets Y and X and an edge set E b et ween them is denoted by G ( Y , X ; E ). Giv en the bi-adjacency matrix A , th e notation G ( Y , X ; A ) is also used to denote a bipartite graph . Here, the ro ws and columns of matrix A ∈ R | X |×| Y | are r espectiv ely indexed by X an d Y vertex sets. F or an y sub set S ⊆ Y , the set of neighb ors of v ertices in S with resp ect to A is deﬁned as N A ( S ) := { i ∈ X : A ij 6 = 0 for s o me j ∈ S } , or equiv alen tly , N E ( S ) := { i ∈ X : ( j, i ) ∈ E for some j ∈ S } with resp ect to edge set E . Here, w e d eﬁne a generalized n ot ion of matc hing for a bipartite graph and refer to it as n -gram matc hing. Deﬁnition 3 ((P erfect) n -gram m a tching) . A n -gram matc hing M f o r a b i p artite gr aph G ( Y , X ; E ) is a subset of e dges M ⊆ E which satisﬁes the fol lowing c onditions. First, for any j ∈ Y , we have | N M ( j ) | ≤ n . Se c ond, for any j 1 , j 2 ∈ Y , j 1 6 = j 2 , we have min {| N M ( j 1 ) | , | N M ( j 2 ) |} > | N M ( j 1 ) ∩ N M ( j 2 ) | . A p erfect n -g ram matc h ing or Y -saturating n -gram matc hing for the bip artite gr aph G ( Y , X ; E ) i s a n -gr am matching M in which e ach vertex in Y is the end-p oint of exactly n e dges in M . In wo r d s, in a n -g r am matc hing M , eac h vertex j ∈ Y is at m o st the end -point of n edges in M and for an y p ai r of vertic es in Y ( j 1 , j 2 ∈ Y , j 1 6 = j 2 ), there exists at least one non-common neighbor in set X for eac h of them ( j 1 and j 2 ). As an example, a bipartite graph G ( Y , X ; E ) with | X | = 4 and | Y | = 6 is shown in Figure 2 for whic h th e edge set E itself is a p erfect 2-gram m atching. 12 Remark 2 (Relationship to other m a tchings) . The r elationship of n - gr am matching to other typ es of matchings is discusse d b elow. • R e gular matching: F or sp e ci a l c ase n = 1 , the (p erfe ct) n -gr am match ing r e duc es to the usual (p erfe ct) matching for bip artite gr aphs. • b -matching: A b -matching f or a bip artite gr aph G ( Y , X ; E ) (with e qual vertex sizes | X | = | Y | ) is a subset of e dges M b ⊆ E , wher e e ach v ertex is c onne cte d to b e dges. Comp aring with the pr op ose d p erfe ct n - gr am matching, b -matching do es not enfor c e that the set of neighb ors b e dif- fer ent, and furthermor e, it r e quir es that X = Y , which is not p ossible under the over c omplete setting. Remark 3 (Necessary s ize b ound) . Consider a bip artite gr aph G ( Y , X ; E ) with | Y | = q and | X | = p which has a p erfe ct n - gr am matching. Note that ther e ar e  p n  n -c ombinations on X side and e ach c ombination c an at most have one neighb or (a no de in Y which is c onne cte d to al l no des in the c ombination) thr ough the matching, and ther efor e we ne c essarily have q ≤  p n  . Finally , n ote that the existence of p erfect n -gram m a tching resu lt s the existence of p erfect ( n + 1)- gram matc h ing 8 , but the reverse is n ot true. F or example, the bip a r ti te graph G ( Y , X ; E ) with | X | = 4 and | Y | =  4 2  = 6 in Figure 2, has a p erfect 2-gram matc hing, but not a p erfect (1-gram) matc hing (since 6 > 4). Iden tiﬁability conditions based on existence of p erfect n -gram matc hing in topic-word graph No w, we are ready to p ropose the identiﬁabilit y conditions and result. Condition 2 (P erfect n -gram matc hing on A ) . The bip artite gr aph G ( V h , V o ; A ) b etwe en hidden and observe d variables, has a p erfe ct n -gr am matching. The ab o v e condition implies that the sp arsit y pattern of matrix A is appropriately scattered in the mapping from hidden to observed v ariables to b e identiﬁable. Int u it ively , it means that eve ry hidden no de can b e distinguished from another hid den no de by its u nique set of neigh b ors und er the corresp onding n -gram matc hing. F u rthermore, condition 2 is the key to b e able to p r opose identiﬁabilit y in the o ve rcomplete regime. As stated in the size b oun d in Remark 3, for n ≥ 2, the num b er of h idden v ariables can b e more than the num b er of observ ed v ariables and w e can still ha v e p erfect n -g r am matc hing. Deﬁnition 4 (Kr usk al rank, [15]) . The Kru sk al rank or the krank of matrix A is deﬁne d as the maximum numb er k such that every sub set of k c olumns of A is line arly indep endent. Note that krank is d iﬀerent fr om the general notion of matrix rank and it is a lo wer b oun d f o r the matrix rank, i.e., Rank( A ) ≥ krank( A ). Condition 3 (Kr an k condition on A ) . The Kruskal r ank of matrix A satisﬁes the b ound krank( A ) ≥ d max ( A ) n , wher e d max ( A ) is the maximum no de de gr e e of any c olumn of A . In the o v ercomplete regime, it is not p ossible for A to b e f u ll column rank and krank( A ) < | V h | = q . Ho w eve r , note that a large enough krank ens ures that appropriate sized s u bsets of columns of A are linearly indep endent. F o r instance, when krank( A ) > 1, an y t w o columns cannot b e collinear 8 Note that t h e degree of each node ( o n matching side Y ) in the original bipartite graph should b e at least n + 1. 13 and the ab o ve cond ition rules out the collinea r case for ident iﬁ ab ility . In the ab o ve condition, w e see that a larger krank can incorp orate denser conn ections b et ween topics and words. The main identiﬁabilit y result un der a ﬁ xed graph stru ct ur e is stated in the follo win g theorem for n ≥ 2, wh ere n is th e topic p ersistence lev el. Th e iden tiﬁabilit y result relies on h a ving access to the (2 r n )-th order moment of observ ed v ariables x l , l ∈ [2 r n ], deﬁn ed in equation (2) as M 2 rn ( x ) := E h ( x 1 ⊗ x 2 ⊗ · · · ⊗ x r n )( x r n +1 ⊗ x r n +2 ⊗ · · · ⊗ x 2 rn ) ⊤ i ∈ R p r n × p r n , for some integ er r ≥ 1. Theorem 1 (Generic iden tiﬁability un d er deterministic topic-w ord graph stru ct u re) . L et M ( n ) 2 rn ( x ) in e quation (2) b e the (2 rn ) -th or der observe d moment of the n -p ersistent topic mo del for some inte ger r ≥ 1 . If the mo del satisﬁes c onditions 1, 2 and 3, then, for any n ≥ 2 , al l the c olumns of p opulation structur e A ar e generically identiﬁable fr om M ( n ) 2 rn ( x ) . F urthermor e, the (2 r ) -th or der moment of the hidden v ariables, denote d by M 2 r ( h ) , is also generically identiﬁable . The theorem is p ro ved in App endix A. It is seen that the p opulation structure A is iden tiﬁable, giv en an y observ ed moment of ord er at least 2 n . Increasing the order of observ ed momen t results in ident ifyin g h ig h er order momen ts of the hidd e n v ariables. The ab o ve theorem do es not co ver the case when the p ersistence lev el n = 1. This is the usu a l bag-of-w ords adm ixtu re mo d el. Ident iﬁab ility of this mo del has b een studied earlier [7] and we recall it b elo w. Remark 4 (Bag-of-w ord s admixtur e m odel, [7]) . Given (2 r ) -th or der observe d moments with r ≥ 1 , the structur e of the p opular b ag-of-wor ds admixtur e mo del and the (2 r ) -th or der moment of hidden variables ar e i dentiﬁable, when A is ful l c olumn r ank and the fol lowing exp ansion c ondition holds [7] | N A ( S ) | ≥ | S | + d max ( A ) , ∀ S ⊆ V h , | S | ≥ 2 . (4) Our r esult for n ≥ 2 in The or em 1, pr ovides identiﬁability in the over c omplete r e gi me with we aker matching c ondition 2 and kr ank c ondition 3. The matching c ondition 2 is we aker than the ab ove exp ansion c ondition which is b ase d on the p erfe ct matching and henc e, do es not al low over c omplete mo dels. F urth ermor e, the ab ove r esult for the b ag-of-wor ds admixtur e mo del r e quir es ful l c olumn r ank of A which is mor e stringent than our kr ank c ondition 3. Remark 5 (Kr usk al r a n k and d eg ree diversit y) . Condition 3 r e quir es that the Kruskal r ank of the topic-wor d matrix b e lar ge enough c omp ar e d to the maximum de gr e e of the topics. Intuitively, a lar ger Kruskal r ank ensur e s enough diversity in the wor d supp orts among diﬀer ent topics u nd er a higher level of sp arsity. This Kruskal r ank c ondition also al lows for mor e de gr e e diversity among the topics, when the topic p ersistenc e level n > 1 . On the other hand, for the b ag-of-wor ds mo del ( n = 1 ), u sing (4) implies that 2 d min > d max , wher e d min , d max ar e the minimum and maximum de gr e es of the topics. Thus, we pr ovide identiﬁability r esults with mor e de gr e e diversity when higher or der moments ar e employe d. Remark 6 (Reco v ery usin g ℓ 1 optimization) . It tu rns out that our c onditions for identiﬁability imply that the c olumns of the n -gr am matrix A ⊙ n , deﬁne d in D eﬁnition 6, ar e the sp arsest ve ctors in Col  M ( n ) 2 n ( x )  , having a tensor r ank of one. Se e App endix A. This implies r e c overy of the c olumns of A thr ough exhaustive se ar ch, which is not eﬃcient. Eﬃcient ℓ 1 -b ase d r e c overy algorithms have b e en analyze d in [7, 43 ] for the under c omplete c ase ( n = 1) . They c an b e employe d her e for r e c overy fr om higher or der moments as wel l. Exploiting additional structur e pr esent in A ⊙ n , f o r n > 1 , suc h as r ank-1 test devic es pr op ose d in [20] ar e inter e sting avenues for futu r e investigation. 14 3.2 Analysis under random topic-w ord graph struct ures In this section, we sp ecialize the identiﬁabilit y resu lt to the random case. This result is based on more tr a n sparen t conditions on the size and the degree of the rand om bipartite graph G ( V h , V o ; A ). W e consider the random mo del where in the bipartite graph G ( V h , V o ; A ), eac h no de i ∈ V h is randomly connected to d i diﬀeren t n odes in set V o . Note that th is is a heterogeneous degree mo del. Condition 4 (Size condition) . The r andom bip artite gr aph G ( V h , V o ; A ) with | V h | = q , | V o | = p , and A ∈ R p × q , satisﬁes the size c ondition q ≤  c p n  n for some c onstant 0 < c < 1 . This size condition is r equired to establish th a t the random b ip artit e graph has a p erfect n -gram matc hing (and hence satisﬁes deterministic condition 2). It is sho wn in Section 5.2.1 that the necessary size constraint q = O ( p n ) stated in Remark 3, is ac h ie ved in the rand om case. Thus, the ab o ve constrain t allo ws for th e ov ercomplete regime, wher e q ≫ p for n ≥ 2, and is tight. Condition 5 (Degree condition) . In the r andom bip artite gr aph G ( V h , V o ; A ) with | V h | = q , | V o | = p , and A ∈ R p × q , the de gr e e d i of no des i ∈ V h satisﬁes the fol lowing lower and upp e r b ounds ( d i ∈ [ d min , d max ]) : • Lo wer b ound: d min ≥ max { 1+ β log p, α log p } for some c onstants β > n − 1 log 1 /c , α > max  2 n 2  β log 1 c + 1  , 2 β n  . • Upp er b ound: d max ≤ ( cp ) 1 n . In tuitivel y , the low er b ound on the degree is required to show that the corresp onding bip a r tite graph G ( V h , V o ; A ) h a s suﬃ cient num b er of rand om edges to ensu r e that it has p erfect n -gram matc hing w ith high prob ab ility . The upp er b oun d on the degree is mainly required to satisfy the krank condition 3, where d max ( A ) n ≤ krank( A ). It is imp ortan t to see that, for n ≥ 2, the ab o ve condition on d eg r ee co vers a ran ge of mo dels fr om sparse to in termediate regimes and it is reasonable in a num b er of applications that eac h topic do es not generate a very large num b er of w ords. Deﬁnition 5 ( whp ) . A se quenc e of events E p o c curs with high probability ( whp ) if Pr ( E p ) = 1 − O ( p − ǫ ) for some ǫ > 0 . The main random iden tiﬁabilit y result is stated in the follo wing theorem for n ≥ 2, while n = 1 case is add ressed in Remark 8. The iden tiﬁabilit y result relies on h a ving access to the (2 r n )-th order moment of obs er ved v ariables x l , l ∈ [2 r n ], deﬁn ed in equation (2) as M 2 rn ( x ) := E h ( x 1 ⊗ x 2 ⊗ · · · ⊗ x r n )( x r n +1 ⊗ x r n +2 ⊗ · · · ⊗ x 2 rn ) ⊤ i ∈ R p r n × p r n , for some integ er r ≥ 1. Probabilit y rate constants: The probability rate of su cc ess in the f ollo win g rand o m identiﬁa- bilit y r esult is sp eciﬁed b y constant s β ′ > 0 and γ = γ 1 + γ 2 > 0 as β ′ = − β log c − n + 1 , (5) γ 1 = e n − 1  c n n − 1 + e 2 1 − δ 1 n β ′ +1  , (6) 15 γ 2 = c n − 1 e 2 n n (1 − δ 2 ) , (7) where δ 1 and δ 2 are some constan ts satisfying e 2  p n  − β log 1 /c < δ 1 < 1 and c n − 1 e 2 n n p − β ′ < δ 2 < 1. Theorem 2 (Rand o m iden tiﬁabilit y) . L et M ( n ) 2 rn ( x ) in e quation (2) b e the (2 rn ) -th or der observe d moment of the n - p ersistent topic mo del for some inte ge r r ≥ 1 . If the mo del with r andom p opulation structur e A satisﬁes c onditions 1 , 4 and 5, then whp (with pr ob ability at le ast 1 − γ p − β ′ for c onstants β ′ > 0 and γ > 0 , sp e ciﬁe d in (5) - (7) ), for any n ≥ 2 , al l the c olumns of p opulation structur e A ar e i dentiﬁable fr om M ( n ) 2 rn ( x ) . F urthermor e, the (2 r ) - th or der moment of hidden variables, denote d by M 2 r ( h ) , is also identiﬁable, whp . The theorem is prov ed in Ap pen d ix B. Sim ilar to the d ete rm inistic analysis, it is seen that the p op- ulation structure A is id en tiﬁable giv en an y observed moment with order at least 2 n . Increasing the order of observed moment results in identifying higher order momen ts of the hidden v ariables. Remark 7 (T rade-o ﬀ b et we en topic-w ord size ratio and degree) . When the numb er of hidden variables incr e ases, i.e. c incr e ases, bu t the or der n is kept ﬁxe d, the b ounds on de gr e e in c ondition 5 also ne e ds to gr ow. Intuitively, a lar ger de gr e e is ne e de d to pr ovide mor e ﬂexibility in c h o osing the subsets of neighb ors for hidden no des to e nsur e the existenc e of a p e rfe ct n -gr am matching in the bip artite gr aph, which in turn ensur es identiﬁability. Note that as c gr ows, the p ar ameter β , which is the lower b ound on d also gr ows, and the pr ob ability r ate (i.e. , the term − β log c ) r emains c onstant. Henc e, the pr ob ability r ate do es not change as c incr e ases, sinc e the incr e ase in the de gr e e d c omp ensates the additional “diﬃculty” arising due to a lar ger numb er of hidden variables. The ab o v e identiﬁabilit y theorem only co v ers for n ≥ 2 and the n = 1 case is addr essed in the follo wing remark. Remark 8 (Bag-of-w ords admixtur e mo del) . The identiﬁability r esult for the r andom b ag-of- w or ds admixtur e mo del is c omp ar able to the r esult in [ 4 3 ], which c onsiders exact r e c overy of sp arsely-use d dictionaries. They assume that Y = D X is given for some unknown arbitr ary dictionary D ∈ R q × q and unknown r andom sp arse c o eﬃci ent matrix X ∈ R q × p . They establish that if D ∈ R q × q is ful l r ank and the r andom sp arse c o e ﬃc i ent matrix X ∈ R q × p fol lows the Bernoul li-sub gaussian mo del with size c onstr aint p > C q log q and de gr e e c onstr aint O (lo g q ) < E [ d ] < O ( q log q ) , then the mo del is identiﬁable, whp. Comp aring the size and de gr e e c onstr aints, our identiﬁability r esu lt for n ≥ 2 r e quir es mor e stringent upp er b ound on the de g r e e ( d = O ( p 1 /n ) ), while mor e r elaxe d c ondition on the size ( q = O ( p n ) ) which al lows to identiﬁability in the over c omplete r e gime. Remark 9 (The size condition is tigh t) . The si ze b ound q = O ( p n ) in the ab ove the or em achieves the ne c essary c ondition that q ≤  p n  = O ( p n ) (se e R e ma rk 3), and is ther efor e tight. The suﬃciency is ar gue d in The or em 3, wher e we show that the matching c ondition 2 holds under the ab ove size and de gr e e c onditions 4 and 5. 4 Iden tiﬁabilit y v ia Uniqueness of T ensor Decomp ositions In this section, w e c haracterize the moments of the n -p ersisten t topic mo del in terms of the mo del parameters, i.e. th e topic-w ord m a trix A and the momen t of hidden v ariables. W e relate identiﬁa- 16 bilit y of th e topic mo del to uniqueness of a certain class of tensor decomp ositions, which in tu r n, enables us to pro ve Theorems 1 and 2. W e then discuss the sp ecial cases of the p ersisten t topic mo del, viz., the single topic mo del (inﬁnite-p ersisten t topic mo del) and the bag-of-w ords admixture mo del (1-p ersisten t topic mo del). 4.1 Momen t c haracterization of the p ersisten t topic mo del The moment charact erization requ ires the follo win g deﬁnition of a n -gram matrix. Deﬁnition 6 ( n -gram Matrix) . Given a matrix A ∈ R p × q , its n -gr am matrix A ⊙ n ∈ R p n × q is deﬁne d as the matrix whose ( i , j ) -th entry is given by, for i := ( i 1 , i 2 , . . . , i n ) ∈ [ p ] n and j ∈ [ q ] , A ⊙ n ( i , j ) := A i 1 ,j A i 2 ,j · · · A i n ,j , or A ⊙ n := n times z }| { A ⊙ · · · ⊙ A . That is, A ⊙ n is the column-w ise n th order Kronec ker pro duct of n copies of A , and is kno wn as the Khatri-Rao pro duct [41]. In the follo wing lemma, wh ic h is pr o ve d in App end ix A.2, we c haracterize the observed m o ments of a p ersisten t topic mo del. Through ou t this section, the ord er of the observ ed moment is ﬁx ed to 2 m . Lemma 1 ( n -p ersisten t topic m odel momen t c haracterization) . The (2 m ) -th or der moment of observe d variables, deﬁne d in e qu ation (2) , for the n -p ersistent topic mo del is char acterize d as 9 : • if m = r n , for some inte ger r ≥ 1 , then M ( n ) 2 m ( x ) =  r times z }| { A ⊙ n ⊗ · · · ⊗ A ⊙ n  M 2 r ( h )  r times z }| { A ⊙ n ⊗ · · · ⊗ A ⊙ n  ⊤ , (8) wher e M 2 r ( h ) ∈ R q r × q r is the (2 r ) -th or der moment of hidden variables h ∈ R q , deﬁne d in e quation (3) . • If n ≥ 2 m , then M ( n ) 2 m ( x ) =  A ⊙ m  M 1 ( h )  A ⊙ m  ⊤ , (9) wher e M 1 ( h ) := Diag( E [ h ]) ∈ R q × q is the ﬁrst or der moment of hidden v ariables h ∈ R q , stacke d in a diagonal matrix. Th u s, w e see that the observed moments can b e expressed in terms of the hidden momen ts M ( h ) and the Kr on eck er pro ducts of the n -gram matrices. In the sp ecial case, when the p ersistence lev el is large enough compared to th e order of the moment ( n ≥ 2 m ), the momen t form reduces to a Khatri-Rao pro duct form in (9). Moreo v er, in (9), we h a ve a diagonal matrix M 1 ( h ) instead of a general (dense) matrix M 2 r ( h ) in (8), w h en n < 2 m = 2 r n . Thus, w e ha ve a more succinct represent ation of the moments in (9) when the p ersistence leve l of the topics is large enough. In the follo wing, we con trast the sp ecial cases when the p ersistence lev el n is n → ∞ (single topic mo del) and n = 1 (bag of w ords admixture mo del), as shown in Fig.3a and Fig.3b. In order to 9 The other cases not cov ered in Lemma 1 are deferred to App endix A.2. See Remark 12. 17 P S f r a g r e p l a c e m e n t s h y x 1 x m x m + 1 x 2 m (a) Single topic mo del (inﬁnite-per sisten t topic mo del) P S f r a g r e p l a c e m e n t s h y 1 y m y m + 1 y 2 m x 1 x m x m + 1 x 2 m (b) Bag-o f-w ords a dmixture mo del (1-p ersisten t topic mo del) Figure 3: Hierarchical str ucture of the single topic mo del and bag -of-w or ds admixture mo del shown for 2 m nu mber o f words (view s ). ha ve a fair comparison, the num b er of observ ed v ariables is ﬁxed to 2 m and the p ersistence leve l is v aried. Single topic model ( n → ∞ ): Th e condition in (9) ( n ≥ 2 m ) is alw ays satisﬁed for the single- topic mo del, since n → ∞ in this case, and we ha ve M ( ∞ ) 2 m ( x ) =  A ⊙ m  M 1 ( h )  A ⊙ m  ⊤ . (10) Note that M 1 ( h ) is a diagonal matrix. Bag-of-w ords admixture mo del ( n = 1 ): F rom Lemma 1, the (2 m )-th order momen t of observ ed v ariables x l , l ∈ [2 m ], for the bag-of-w ords admixture mo del (1-p ersisten t topic mo del), sho wn in Figure 3b, is giv en b y M (1) 2 m ( x ) =  m times z }| { A ⊗ · · · ⊗ A  M 2 m ( h )  m times z }| { A ⊗ · · · ⊗ A  ⊤ , (11) where M 2 m ( h ) ∈ R q m × q m is the (2 m )-th order momen t of hidden v ariables h ∈ R q , d eﬁned in (3). Note that M 2 m ( h ) is a fu ll matrix in general. Con trasting single topic ( n → ∞ ) and bag of words mo dels ( n = 1) : Comparing equations (10) and (11), it is seen that the m oments u nder the single topic mo del in (10) are more “stru ct u red” compared to the bag of words mo del in (11). I n (11), w e ha ve K ronec ker pro ducts of the topic- w ord matrix A , while (10) inv olv es Khatri-Rao pro ducts of A . T his forms a crucial criterion in determining of w hether ov ercomplete mo dels are identiﬁable, as discussed b elo w . Wh y p ersistence helps in identiﬁabilit y of o v ercomplete mo dels? F or simplicit y , let the order of the momen t 2 m = 4. The equ ations (10) and (11) r educe to M ( ∞ ) 4 ( x ) = ( A ⊙ A ) Diag  E  h ]  ( A ⊙ A ) ⊤ , (12) M (1) 4 ( x ) = ( A ⊗ A ) E  ( h ⊗ h )( h ⊗ h ) ⊤  ( A ⊗ A ) ⊤ . (13) Note th a t for th e single topic m odel in (12), th e Khatri-Rao pro duct matrix A ⊙ A ∈ R p 2 × q has the same as the num b er of columns (i.e. th e laten t dimensionalit y) of the original matrix A , while the num b er of rows (i.e. the observ ed d imensionalit y) is increased. Th us , th e K hatri-Ra o pro duct 18 P S f r a g r e p l a c e m e n t s X Y 1 1 2 2 3 3 4 4 5 (a) Structure of an ov erco mplete matrix A ∈ R 4 × 5 having a p erfect 2-g ram matching. P S f r a g r e p l a c e m e n t s X (2) Y 1 2 3 4 5 (1 , 1) (1 , 2) (1 , 3) (1 , 4) (2 , 1) (2 , 2) (2 , 3) (2 , 4) (3 , 1) (3 , 2) (3 , 3) (3 , 4) (4 , 1) (4 , 2) (4 , 3) (4 , 4) (b) Structure of A ⊙ A ∈ R 16 × 5 having a pe r fect ( Y -satur ating) matching, hig hligh ted b y dashed red edges. P S f r a g r e p l a c e m e n t s X (2) Y (2) (1 , 1) (1 , 1) (1 , 2) (1 , 2) (1 , 3) (1 , 3) (1 , 4) (1 , 4) (1 , 5) (2 , 1) (2 , 1) (2 , 2) (2 , 2) (2 , 3) (2 , 3) ( 2 , 4 ) ( 3 , 1 ) ( 3 , 2 ) ( 3 , 3 ) ( 3 , 4 ) ( 4 , 1 ) (4 , 2) (4 , 3) (4 , 4) (4 , 5) (5 , 1) (5 , 2) (5 , 3) (5 , 4) (5 , 5) (c) Str ucture of A ⊗ A ∈ R 16 × 25 . F or simplicity , only a few edges and no des a re shown and the dashed edges denote the bunch of edg es connected to each no de, not sp eciﬁcally shown. Figure 4: An example o f an ov ercomplete matrix A and the ma trices A ⊙ A and A ⊗ A . The corr esponding bipartite gra phs enco de the sparsity patter n of ea c h of the matrice s . A ⊙ A expands the eﬀect of hidden v ar iables to second or der observed v ariables which is crucial for ov erco mplete iden tiﬁability , while in the A ⊗ A , the order of b oth the hidden a nd observed v ariables ar e inc r eased. “expands” the eﬀect of hid den v ariables to higher order obser ved v ariables, which is the k ey to wa rd s iden tifying o verco mp let e mo dels. In other words, the original o vercomplete r e p resen tation b ecomes determined d ue to the ‘expansion eﬀect’ of the K hatri-Ra o p rodu ct stru c tur e of the h igher order observ ed moments. On the other hand, in the b a g-of-w ord s admixtur e mo del in (13), this in teresting ‘expansion prop- ert y’ do es not o cc u r, and we ha ve the Kronec k er pr oduct A ⊗ A ∈ R p 2 × q 2 , in p la ce of th e Khatri-Rao pro ducts. T he Kr o n ec ke r pro duct op eration incr eases b oth the n umb er of the columns (i.e. laten t dimensionalit y) and the num b er of r o ws (i.e. observe d dimensionalit y), whic h implies that higher order moment s do not h elp in iden tifying o vercomplete mo dels. An example is p ro vided in Figure 4 whic h helps to see ho w th e matrices A ⊙ A and A ⊗ A b eha v e diﬀeren tly in terms of mapp ing topics to w ord tup le s. 19 Note that for the n -pers istent mo del, for n = 2, the 4 th order moment r e d uces to M (2) 4 ( x ) = ( A ⊙ A ) E  hh ⊤ ]( A ⊙ A ) ⊤ . (14) Con trasting the ab o v e equation with (12) and (13), w e ﬁ nd th at the 2-p ersisten t mo del retains the desirable pr operty of p ossessing Kh at ri-Rao pro ducts, wh ile b eing more general than the form f o r single topic m odel in (12). This k ey pr operty enables us to establish id e ntiﬁabilit y of topic mo dels with ﬁnite p ersistence leve ls. 4.2 T ensor algebra of the mo del In Section 4.1, we provided a representa tion of the moment forms in the matrix form. W e n o w pro vide the equiv alent tensor represen tation of the m ome nts. Th e tensor representa tion is more compact and transparent, and allo ws us to compare the topic m odels un der diﬀeren t lev els of p er- sistence. W e compare the deriv ed tensor form with the well-kno wn T u c ker and CP decomp ositions. W e ﬁrst int r oduce some tensor notations and d eﬁnitions. 4.2.1 T ensor nota t ions and deﬁnitions A r ea l-v alued order- n tensor A ∈ N n i =1 R p i := R p 1 ×···× p n is a n dimensional arra y A (1 : p 1 , . . . , 1 : p n ), w h ere the i -th mo de is in d exed fr om 1 to p i . In this pap er, we restrict ourselv es to th e case that p 1 = · · · = p n = p , and simply write A ∈ N n R p . A ﬁb er of a tensor A is a v ector obtained b y ﬁxing all ind ices of A except one, e.g., for A ∈ N 4 R 3 , the vec tor f = A (2 , 1 : 3 , 3 , 1) is a ﬁ ber. F or a vect or u ∈ R p , Diag n ( u ) ∈ N n R p is the n -t h ord er diagonal tensor with vecto r u on its diagonal. The tensor A ∈ N n R p , is stac ked as a ve ctor a ∈ R p n b y the vec( · ) op erator, deﬁned as a = v ec( A ) ⇔ a  ( i 1 − 1) p n − 1 + ( i 2 − 1) p n − 2 + · · · + ( i n − 1 − 1) p + i n )  = A ( i 1 , i 2 , . . . , i n ) . The inv erse of a = vec( A ) op eration is denoted b y A = ten( a ). F or v ectors a i ∈ R p i , i ∈ [ n ], the tensor outer pro duct op erator “ ◦ ” is deﬁned as [41] A = a 1 ◦ a 2 ◦ · · · ◦ a n ∈ n O i =1 R p i ⇔ A ( i 1 , i 2 , . . . , i n ) := a 1 ( i 1 ) a 2 ( i 2 ) · · · a n ( i n ) . (15) The ab o ve generated tensor is a r a n k-1 tensor. The tensor r ank is the minimal num b er of rank-1 ten- sors into which a tensor can b e decomp osed. Th is type of rank is called CP (Candecomp/Pa r afac) tensor rank in th e literature [41]. According to ab o ve deﬁnitions, for an y set of vec tors a i ∈ R p i , i ∈ [ n ], we ha v e the follo wing pair of equalities: v ec( a 1 ◦ a 2 ◦ · · · ◦ a n ) = a 1 ⊗ a 2 ⊗ · · · ⊗ a n , ten( a 1 ⊗ a 2 ⊗ · · · ⊗ a n ) = a 1 ◦ a 2 ◦ · · · ◦ a n . F or an y v ector a ∈ R p , the p o wer notations are also d eﬁ ned as a ⊗ n := n times z }| { a ⊗ a ⊗ · · · ⊗ a ∈ R p n , 20 a ◦ n := n times z }| { a ◦ a ◦ · · · ◦ a ∈ n O R p . The second p o wer is usually called the n -t h order tensor p ower of vecto r a . Finally , the T uc k er and C P (Candecomp/P arafac) representa tions are deﬁned as follo ws [10, 41]. Deﬁnition 7 (T uck er repr esentati on) . Given a c or e tensor S ∈ N n i =1 R r i and inverse factors U i ∈ R p i × r i , i ∈ [ n ] , the T ucker r epr esentation of the n -th or der tensor A ∈ N n i =1 R p i is A = r 1 X i 1 =1 r 2 X i 2 =1 · · · r n X i n =1 S ( i 1 , i 2 , . . . , i n ) U 1 (: , i 1 ) ◦ U 2 (: , i 2 ) ◦ · · · ◦ U n (: , i n ) =: [[ S ; U 1 , U 2 , . . . , U n ]] , (16) wher e U j (: , i j ) denotes the i j -th c olumn of matrix U j . The tensor S is r eferr e d to as the c or e tensor. Deﬁnition 8 (CP r epresen tation) . Given λ ∈ R r , U i ∈ R p i × r , i ∈ [ n ] , the CP r epr esentation of the n -th or der tensor A ∈ N n i =1 R p i is A = r X i =1 λ i U 1 (: , i ) ◦ U 2 (: , i ) ◦ · · · ◦ U n (: , i ) = : [[Diag n ( λ ); U 1 , U 2 , . . . , U n ]] , (17) wher e U j (: , i ) denotes the i -th c olumn of matrix U j . Note th at the CP r ep resen tation is a sp ecial case of the T uc k er r ep resen tation w h en the core tensor S is squ are and diagonal. 4.2.2 T ensor representation of moments under topic mo del W e no w pro vide a tensor represent ation of the momen ts. F or the n -p ersisten t topic mo del, the 2 m -th observe d momen t is denoted by T ( n ) 2 m ( x ), whic h is th e tensor form of the m o ment matrix M ( n ) 2 m ( x ), c haracterized in Lemma 1. I t is giv en by T 2 m ( x ) ( i 1 ,i 2 ,...,i 2 m ) := E [ x 1 ( i 1 ) x 2 ( i 2 ) · · · x 2 m ( i 2 m )] , i 1 , i 2 , . . . , i 2 m ∈ [ p ] , (18) where T 2 m ( x ) ∈ N 2 m R p . This tensor is charact erized in the follo wing lemma, and is pr o ved in App endix A.2. Lemma 2 ( n -p ersisten t topic mo del momen t charac terization in tensor form) . The (2 m ) -th or der moment of wor ds, deﬁne d in e quation (18) , for the n -p e rsistent topic mo del is char acterize d as 10 : • if m = r n for some inte ger r ≥ 1 , then T ( n ) 2 m ( x ) = q X i 1 =1 q X i 2 =1 · · · q X i 2 r =1 E [ h i 1 h i 2 · · · h i 2 r ] a ◦ n i 1 ◦ a ◦ n i 2 ◦ · · · ◦ a ◦ n i 2 r (19) = hh S r ; 2 m times z }| { A, A, . . . , A ii , 10 The other cases not cov ered in Lemma 2 are deferred to App endix A.2. See Remark 12. 21 wher e S r ∈ N 2 rn R q is the c or e tensor in the ab ove T ucker r epr esentation with the sp arsity p attern as S r  i  = ( M 2 r ( h )  ( i n ,i 2 n ,...,i r n ) , ( i ( r +1) n ,i ( r +2) n ,...,i 2 rn )  , i 1 = i 2 = · · · = i n , i n +1 = i n +2 = · · · = i 2 n , . . . 0 , o . w ., wher e i := ( i 1 , i 2 , . . . , i 2 rn ) . • If n ≥ 2 m , then T ( n ) 2 m ( x ) = X i ∈ [ q ] E [ h i ] a ◦ 2 m i =  Diag 2 m ( E [ h ]); 2 m times z }| { A, A, . . . , A  . (20) The tensor represen tation in (19) is a sp eciﬁc type of tensor decomp ositio n whic h is a sp ecial case of the T uc k er represen tation (since S r is not fully dens e ), but more general than th e CP rep resen tation. The tensor repr e sentatio n in (20) has a CP form. Comparison with single topic mo del and bag-of-w ords admixture mo del W e n ow pro vid e the tens or form for the sp ecial cases single topic mo del and b ag -of-wo rd s admixture mo del. In order to h a ve a fair comparison, th e num b er of obs e r v ed v ariables is ﬁxed to 2 m and th e p ersistence level is v aried. CP repre sentation of the single topic mo de l: The (2 m )-th order moment of the words for the single topic mo del (inﬁn ite -p ersistent topic mod el) is p ro vided in equation (20) as T ( ∞ ) 2 m ( x ) = X i ∈ [ q ] E [ h i ] a ◦ 2 m i =  Diag 2 m ( E [ h ]); 2 m times z }| { A, A, . . . , A  . (21) This repr ese ntatio n is the sym m et ric CP representa tion 11 of T ( ∞ ) 2 m ( x ). T uck er represen tation of the bag-of-words admixture mo del: F rom L emm a 2 , the tensor form of the (2 m )-th order momen t of observed v ariables x l , l ∈ [2 m ], for th e bag-of-w ords admixture mo del (1-p ersisten t topic mo del) is giv en b y T (1) 2 m ( x ) = q X i 1 =1 q X i 2 =1 · · · q X i 2 m =1 E [ h i 1 h i 2 · · · h i 2 m ] a i 1 ◦ a i 2 ◦ · · · ◦ a i 2 m = hh E  h ◦ (2 m )  ; 2 m times z }| { A, A, . . . , A ii . (22) This represen tation is the T uc k er rep r esen tation (decomp osition) of T (1) 2 m ( x ) w h ere the core tensor S = E  h ◦ (2 m )  is the tensor f o r m of the (2 m )-th order hid d en momen t M 2 m ( h ), deﬁned in equation (3), and the in v erse factors corresp ond to the p opulation structure A . 11 In App endix C, we pro vide a more detailed comparison b et ween our approac h and some of the previous identiﬁ- abilit y results for the (ov ercomplete) CP decomp osi tion. 22 P S f r a g r e p l a c e m e n t s Size & degree conditions 4,5 for r ando m case Theorems 3,4 Matc hing & krank conditions 2,3 on A Lemma 5 Rank & expansion conditions 6,7 on A ⊙ n Theorem 5 Non-degeneracy condition 1 on h Iden tiﬁability Figure 5: Hierarchy a mo ng the pr oposed conditions and results. Comparing the tensor forms for the n -p ersisten t topic mo del (19), single topic m odel (21 ), and bag of words admixtu re mo del (22), we ﬁnd that all of them inv olv e T uc k er decomp ositions, where the in verse factors corresp ond to the topic-w ord matrix A , and th e only diﬀerence is in the sparsit y lev el of the core tensor S . F or th e bag of wo rd s m o del, with n = 1, the core tensor is fully d ense in general, while for the single topic mo del, w it h n → ∞ , the core tens or is d iag onal which reduces to the CP decomp osition. F or a general topic mo del with p ersistence level n , the core tensor is in b et w een these t wo extremes and h as s tructured sp ars it y . This sp arsit y prop ert y of the core tensor is crucial tow ard s establishing identiﬁabilit y in the o v ercomplete regime. T he bag-of-w ords mo del is n ot id en tiﬁable in the o v ercomplete regime since the core tensor is fully dense in th is case, while an o verco mp le te n -pers istent topic m odel can b e id entiﬁed under certain constrain ts p ro vided in Section 3, since the core tensor has str uctured sparsit y and symmetry . 5 Pro of T ec hniques and Auxiliary Results The m ain identiﬁabilit y results are given in Th e orems 1 and 2 for deterministic and random cases of topic-w ord graph str uctures. In this sectio n, we provide a p roof sk etc h of these r esults, and then, w e prop ose auxiliary results on the existence of p erfect n -gram matc hing f o r random bipartite graphs and a low er b ound on the Krusk al rank of rand o m matrices. 5.1 Pro of sketc h Summary of relat ions hips among diﬀeren t conditions: T o sum marize , there exists a h ier- arc hy among the pr o p osed conditions as follo ws . See Figure 5. First, in the rand o m analysis, th e size and the degree conditions 4 an d 5 are suﬃcien t for satisfying the p erfect n -gram matc hing and the krank conditions 2 and 3, sh o wn by Theorems 3 and 4 . Then, these conditions 2 and 3 ensu re that the rank and the expansion cond it ions 6 and 7 h old, sho wn by Lemma 5. And ﬁn a lly , these conditions 6 and 7 together with non-degeneracy condition 1 conclude the primary identiﬁabilit y result in Th eorem 5 . Note that th e genericit y of A is also required for these r esults to hold. Primary deterministic ana lysis in Theorem 5: The deterministic analysis is pr imarily b a sed on conditions on the n -g ram matrix A ⊙ n ; b ut sin ce these conditions are op aqu e (mainly expansion condition on A ⊙ n , provided in condition 7), this analysis is r ela ted to conditions on matrix A itself. See T h eo r em 5 in App endix A.1 for the identiﬁabilit y r esult based on A ⊙ n . W e brieﬂy d iscuss it b elo w for the case when 2 n num b er of w ords are a v ailable under the n -p ersistent topic mo del. F r om equation (8), th e (2 n )-th order momen t of the observed v ariables u nder the n -p ersisten t topic 23 mo del can b e w r itt en as M ( n ) 2 n ( x ) =  A ⊙ n  E  hh ⊤   A ⊙ n  ⊤ . (23) The qu estion is whether w e can r ec ov er A , giv en th e M ( n ) 2 n ( x ). Obviously , the m a trix A is not iden tiﬁable without any fu rther conditions. First, non-degeneracy and r a n k conditions (conditions 1 and 6) are r equ ired. Assuming these t w o conditions, we hav e from (23) th a t Col  M ( n ) 2 n ( x )  = Col  A ⊙ n  . Therefore, the p roblem of reco vering A from M ( n ) 2 n ( x ) reduces to ﬁnding A ⊙ n in Col  A ⊙ n  . Then, w e show that und er the follo wing expansion condition on A ⊙ n and the genericit y prop ert y , matrix A is identiﬁable from Col  A ⊙ n  . The expansion condition (refer to condition 7 for a more de- tailed statemen t), imp oses the follo wing prop ert y on the bipartite graph G  V h , V ( n ) o ; A ⊙ n  12 ,    N A ⊙ n Rest . ( S )    ≥ | S | + d max  A ⊙ n  , ∀ S ⊆ V h , | S | > krank( A ) , (24) where d max  A ⊙ n  is the maxim um no de degree in set V h , and the restricted v ersion of n -g r am matrix, denoted by A ⊙ n Rest . , is obtained by remo ving its redu ndan t (iden tical) rows (see Deﬁnition 9). Th e iden tiﬁability claim is p ro ved by sho win g that the columns of A ⊙ n are the spars est and rank-1 ve ctors (in th e tensor form) in Col  A ⊙ n  under the expansion condition in (24) and gener- icit y conditions. Note that since we only require exp a n sio n on sets larger than Kr u sk al rank, the expansion condition (24) is a more relaxed cond it ion compared to expansion cond iti on prop osed in [7, 43] for identiﬁabilit y in the undercomplete regime. F or a more d e tailed comparison , refer to Remark 11 in App endix A.1. Deterministic analysis in Theorem 1: Expansion and rank conditions in Theorem 5 are imp osed on the n -gram matrix A ⊙ n . Acco rd ing to the generalized matc h ing notions, deﬁn ed in Section 3.1, suﬃcien t com bin at orial conditions on matrix A (conditions 2 and 3 ) are in tro duced whic h ensure that the expansion and r ank cond it ions on A ⊙ n are satisﬁed. Th e follo win g lemma is emp lo ye d to establish these results, w h ere w e s ta te an in teresting pr o p erty whic h relates the existence of a p erfect matc hin g in A ⊙ n to the existence of a p erfect n -gram matc hing in A . Lemma 3. If G ( Y , X ; A ) has a p erfe ct n -g r am matching, then G ( Y , X ( n ) ; A ⊙ n ) has a p erfe ct match- ing. In the other dir e ction, if G ( Y , X ( n ) ; A ⊙ n ) has a p erfe ct matching M ⊙ n , then G ( Y , X ; A ) has a p erfe ct n -gr am matching under the fol lowing c ondition on M ⊙ n . Al l the matching e dges ( j, ( i 1 , . . . , i n )) ∈ M ⊙ n should satisfy i 1 6 = i 2 6 = · · · 6 = i n for al l j ∈ Y . In wor ds, the matching e dges should b e c onne cte d to no des in X ( n ) , which ar e indexe d by tuples of distinct indic es. See App endix A.4 for a p roof. Using th is lemma, condition 2 implies that G ( Y , X ( n ) ; A ⊙ n ) has a p erfect matc h in g. Th en, it is str a ightforw ard to argue that the expansion and r an k conditions on A ⊙ n are satisﬁed, whic h is sho wn in Lemma 5 in App endix A.3 . Th is leads to th e generic iden tiﬁabilit y result stated in Theorem 1. 12 V ( n ) o denotes all ordered n -tuples generated from set V o := { 1 , . . . , p } whic h indexes the row s of A ⊙ n . 24 5.2 Analysis of Random St r uctures The iden tiﬁabilit y result for a random structured matrix A is pro vided in Theorem 2. S u ﬃcien t size and degree conditions 4 an d 5 on the random matrix A are p rop osed suc h that the deterministic com binatorial conditions 2 and 3 on A are satisﬁed. The details of these auxiliary results are pro vided in the follo wing tw o su bsequen t s e ctions. In Section 5.2.1, it is p ro ved in Th eo rem 3 that a r a n dom bipartite graph satisfying reasonable size and d eg ree constraints, has a p erfect n -gram matc hing (condition 2), w hp . Then, a lo wer b ound on the Kru sk al rank of a random matrix A under size and degree constraints is p ro vid ed in Theorem 4 in Section 5.2.2, whic h implies the krank condition 3. Intuitions on w h y su ch size and degree conditions are required, are mentioned in Section 3.2 where these conditions are prop osed. 5.2.1 Existence of p erfect n -gram matching for random bipartite graphs W e show in the follo w ing theorem that a r a n dom bipartite graph s a tisfyin g r ea sonable size and de- gree constraints, prop osed earlier in conditions 4 an d 5, has a p erfect n -g ram matc hing whp . Theorem 3 (Existence of p erfect n -gram matc h ing for random bipartite graphs) . Consider a r andom bip artite gr aph G ( Y , X ; E ) with | Y | = q no des on the left side and | X | = p no des on the right side, and e ach no de i ∈ Y is r andomly c onne c te d to d i diﬀer ent no des in X . L et d min := min i ∈ Y d i . Assume that it satisﬁes the size c ondition q ≤  c p n  n (c ondition 4) for some c onstant 0 < c < 1 and the de gr e e c ondition d min ≥ max { 1 + β log p , α log p } for some c onstants β > n − 1 log 1 /c , α > max  2 n 2  β log 1 c +1  , 2 β n  (lower b ound i n c ondition 5). Then, ther e exists a p e rfe ct ( Y -satur ating) n -gr am matching in the r andom bip artite gr aph G ( Y , X ; E ) , with pr ob ability at le ast 1 − γ 1 p − β ′ for c onstants β ′ > 0 and γ 1 > 0 , sp e ciﬁe d in (5) and (6) . Note th a t the su ﬃcien t size b ound q = O ( p n ) in the ab o ve theorem is also necessary (see Remark 3), and is therefore tigh t. Remark 10 (Insuﬃciency of the union b ound argumen t) . It is e asier to exploit the union b ound ar guments to pr op ose r andom bip artite gr aphs which have a p erfe ct n -gr am matching whp . It is pr ove d in App endix B.1 that if d ≥ n and the size c onstr aint | Y | = O ( | X | n 2 − δ ) for some δ > 0 is satisﬁe d, then whp , the r andom bip artite gr aph has a p erfe ct n -gr am matching. Comp aring this r esult with ours i n The or em 3 , our appr o ach has a b etter size sc aling while the uni on b ound appr o ach has a b etter de gr e e sc aling. The size sc aling limitation in the union b ound ar gument makes it unattr active. In or der to identify the p opulation structur e A in the over c omplete r e gime wher e | Y | = O ( | X | n ) , we ne e d ac c ess to at le ast (4 n ) -th or der moment under the uni on b ound ar gument, while only the (2 n ) -th or der moment is r e quir e d under our ar gument. 5.2.2 Lo wer b ound on the Krusk al ra nk of random ma t rices In the follo wing theorem, a lo wer b oun d on th e Krusk al rank of a random matrix A und er dimension and degree constraints is provided, w hic h is pro ve d in App endix B.1. Theorem 4 (Lo we r b ound on the Krusk al rank of ran d om matrices) . Consider a r andom matrix A ∈ R p × q , wher e for any i ∈ [ q ] , ther e ar e d i numb er of r andom non-zer o entries in c olumn i . L et 25 d min := min i ∈ [ q ] d i . Assume that it satisﬁes the size c ondition q ≤  c p n  n (c ondition 4) for some c onstant 0 < c < 1 and the de gr e e c ondition d min ≥ 1 + β log p for some c onstant β > n − 1 log 1 /c (lower b ound in c ondition 5) and in additio n A is generic. Then, kr a n k( A ) ≥ cp , with pr ob ability at le ast 1 − γ 2 p − β ′ for c onstants β ′ > 0 and γ 2 > 0 , sp e ciﬁe d in (5) and (7) . Ac knowled gemen t s The authors ac kn o wledge useful discussions with Sina Jafarp our, Adel Ja v anm a r d, Alex Dimakis, Moses Charik ar, Sanjeev Ar ora, Ankur Moitra and Kamalik a Chaud h uri. A. An andkumar is sup- p orted in part b y Microsoft F acult y F ello wship , NSF C a r ee r a wa rd C CF-1 254106, NSF Aw ard CCF-12192 34, ARO Award W911NF-12- 1-0404, and AR O YIP Award W911NF-13- 1-0084. M. Janzamin is sup p o r ted by NSF Aw ard CCF-121923 4, ARO Aw ard W911NF-12 -1-0404 and AR O YIP Aw ard W911 NF-13-1-00 84. App endi x A Pro of of Deterministic Id en tiﬁabilit y Result (Theorem 1) First, we sho w the iden tiﬁabilit y result und er an alternativ e set of conditions on the n -g ram matrix, A ⊙ n , and then, w e s ho w th at the conditions of Theorem 1 are suﬃcien t f o r th ese conditions to hold. A.1 Deterministic analysis based on A ⊙ n In th is section, the determin istic id en tiﬁability resu lt based on conditions on the n -gram m a trix, A ⊙ n , is pr o vided. In the n -gram matrix, A ⊙ n ∈ R p n × q , redundant rows exist. If some ro w of A ⊙ n is in dexed b y n -tuple ( i 1 , . . . , i n ) ∈ [ p ] n , then another ro w indexed by an y p ermutat ion of the tuple ( i 1 , . . . , i n ) has the same en tries. Therefore, the num b er of distinct ro ws of A ⊙ n is at most  p + n − 1 n  . In the follo wing deﬁnition, we deﬁne a n o n -redundan t ve rsion of n -gram matrix w hic h is restricted to the (p oten tially) distinct r o ws. Deﬁnition 9 (Restricted n -gram matrix) . F or any matrix A ∈ R p × q , r estricte d n -gr am matrix A ⊙ n Rest . ∈ R s × q , s =  p + n − 1 n  , is deﬁne d as the r estricte d version of n - gr am matrix A ⊙ n ∈ R p n × q , wher e the r e dundant r ows of A ⊙ n ar e r emove d, as explaine d ab ove. Condition 6 (Rank condition) . The n - gr am matrix A ⊙ n is ful l c olumn r ank. Condition 7 (Graph expansion) . L et G ( V h , V ( n ) o ; A ⊙ n ) denote the bip artite gr aph with vertex se ts V h c orr esp onding to the hidden variables (indexing the c olumns of A ⊙ n ) and V ( n ) o c orr esp onding to the n -th or der observe d variables (indexing the r ows of A ⊙ n ) and e dge matrix A ⊙ n ∈ R | V ( n ) o |×| V h | . The bip artite gr aph G ( V h , V ( n ) o ; A ⊙ n ) satisﬁes the fol lowing expansion prop ert y on the r estricte d 26 version sp e c i ﬁe d by A ⊙ n Rest . ,    N A ⊙ n Rest . ( S )    ≥ | S | + d max  A ⊙ n  , ∀ S ⊆ V h , | S | > krank( A ) , (25) wher e d max  A ⊙ n  is the maximum no de de g r e e in set V h . Remark 11. The exp ansion c ondition for the b ag-of-wor ds admixtur e mo del is pr ovide d in (4) , intr o duc e d in [7]. The pr op ose d exp ansion c ondition i n (25) is inherite d fr om (4) , with two major mo diﬁc ations. First, the c ondition is appr opriately gener alize d for our mo del which involves a gr aph with e dges sp e ciﬁe d by the n - gr am matrix, A ⊙ n , as state d in (23 ) . Se c ond, the exp ansion pr op erty (4) , pr op ose d in [7], ne e ds to b e satisﬁe d for al l subse ts S with size | S | ≥ 2 , which is a stricter c ondition than the one pr op ose d her e in (25) , sinc e we c an have kran k( A ) ≫ 2 . The d et ermin istic id entiﬁabilit y result based on th e conditions on A ⊙ n , is stated in the follo wing theorem for n ≥ 2, while n = 1 case is add r essed in Remarks 4 and 11. The identiﬁabilit y result relies on access to the (2 n )-th order momen t of observ ed v ariables x l , l ∈ [2 n ], deﬁned in equation (2) as M 2 n ( x ) := E h ( x 1 ⊗ x 2 ⊗ · · · ⊗ x n )( x n +1 ⊗ x n +2 ⊗ · · · ⊗ x 2 n ) ⊤ i ∈ R p n × p n . Theorem 5 (Generic identiﬁabilit y u nder deterministic conditions on A ⊙ n ) . L et M ( n ) 2 n ( x ) (deﬁne d in e quation (2) ) b e the (2 n ) -th or der moment of the n -p ersistent topic mo del describ e d i n Se ction 2. If the mo del satisﬁes c onditions 1, 6 and 7, then, for any n ≥ 2 , al l the c olumns of p opulation structur e A ar e generically iden tiﬁable fr om M ( n ) 2 n ( x ) . Pr o of: Deﬁne B := A ⊙ n ∈ R p n × q . Then, the momen t c haracterized in equ a tion (23) can b e written as M ( n ) 2 n ( x ) = B E  hh ⊤  B ⊤ . Since b oth matrices E  hh ⊤  and B ha ve full column rank (from conditions 1 and 6), the rank of B E  hh ⊤  B ⊤ is q where q = O ( p n ), and further m o r e Col( B E  hh ⊤  B ⊤ ) = Col( B ). Let U := { u 1 , . . . , u q } ∈ R p n b e any basis of Col ( B E  hh ⊤  B ⊤ ) satisfying the follo wing t w o prop erties: 1) u i ’s hav e the smallest ℓ 0 norms. 2) u i ’s ha v e q smallest (tensor) ranks in the n -th ord er tensor form, i.e., U i := ten( u i ) , i ∈ [ q ], ha ve q smallest r anks. Let th e columns of matrix B b e b i for i ∈ [ q ]. S ince all the b i ’s (whic h b elong to Col( B E  hh ⊤  B ⊤ )) are rank-1 in the n -t h order tensor form (since ten( b i ) = a ◦ n i ) and th e num b er of non-zero ent r ie s in eac h of b i ’s is at m o st d max ( B ) = d max ( A ) n , w e conclude that max i Rank(ten( u i )) = 1 and max i k u i k 0 ≤ d max ( B ) . (26) The ab o v e b ounds are concluded from the fact that b i ∈ Col( B E  hh ⊤  B ⊤ ) , i ∈ [ q ] , and ther efore the ℓ 0 norm and the r an k prop erties of b i ’s are upp er b ounds for the corresp onding prop erties of basis v ectors u i ’s (according to the prop osed cond iti ons for u i ’s). No w, exploiting these obs e r v ations and also the genericit y of A and the exp a n sio n condition 7, we sho w that the basis vect ors u i ’s are scaled columns of B . Since u i for i ∈ [ q ] , is a vecto r in the column space of B , it can b e represented as u i = B v i for some ve ctor v i ∈ R q . Equiv alen tly , for 27 an y i ∈ [ q ], u i = P q j =1 v i ( j ) b j where b j = a ⊗ n j is the j -th column of m a trix B and v i ( j ) is a scalar whic h is the j -th en try of vecto r v i . T h en, the tensor form of u i can b e w ritte n as ten( u i ) = q X j =1 v i ( j ) ten( b j ) = q X j =1 v i ( j ) ten ( a ⊗ n j ) = q X j =1 v i ( j ) a ◦ n j = [[Diag n ( v i ); n times z }| { A, . . . , A ]] , (27) where the last equalit y is based on the notation deﬁned in Deﬁnition 8. W e deﬁn e e v i := [ v i ( j )] j : v i ( j ) 6 =0 as th e ve ctor w hic h contai n s only the non-zero entries of v i , i.e., e v i is the restriction of v ector v i to its supp ort. Therefore, e v i ∈ R r , w here r := k v i k 0 . F u rthermore, th e matrix e A i := { a j : v i ( j ) 6 = 0 } ∈ R p × r is deﬁn ed as the restriction of A to its columns corresp onding to the supp ort of v i . Let ( e a i ) j denote the j -th column of e A i . According to these deﬁnitions, equation (27) r educes to ten( u i ) = [[Diag n ( e v i ); n times z }| { e A i , . . . , e A i ]] = r X j =1 e v i ( j )[( e a i ) j ] ◦ n , (2 8) whic h is deriv ed b y remo ving columns of A corresp ondin g to the zero en tries in v i . Next, we rule out th a t k v i k 0 ≥ 2 under t wo cases (2 ≤ k v i k 0 ≤ krank( A ) and krank( A ) < k v i k 0 ≤ q ), to conclude that u i ’s v ectors are scaled columns of B . Case 1: 2 ≤ k v i k 0 ≤ krank( A ) . Here, the num b er of columns of e A i ∈ R p ×k v i k 0 is less than or equal to krank( A ) and therefore it is f ull column rank. Since, all the comp onen ts of CP representati on in equation (28) are full column rank 13 , for an y 14 n ≥ 2, w e h av e Rank(ten( u i )) = r = k v i k 0 > 1, whic h contradicts the fact that max i Rank(ten( u i )) = 1 in (26). Case 2: krank( A ) < k v i k 0 ≤ q . Here, w e ﬁrst restrict th e n -gram matrix B to distinct rows, denoted by B Rest . , as deﬁn ed in Deﬁnition 9. Let u ′ i = B Rest . v i . Since u ′ i is the restricted version of u i , w e ha v e k u i k 0 ≥ k u ′ i k 0 = k B Rest . v i k 0 >   N B Rest . (Supp( v i ))   − | S upp( v i ) | ≥ d max ( B ) , where the second inequalit y is from Lemma 4, and th e third inequalit y f o llows from the graph expansion pr o p erty (condition 7). This r esult contradict s th e fact that max i k u i k 0 ≤ d max ( B ) in (26). F r om ab o ve con tradictions, k v i k 0 = 1 and h e n ce , columns of B := A ⊙ n are the scaled ve r s io ns of u i ’s. ✷ 13 Note th at for n ≥ 3, this full rank condition can b e relaxed by Kru sk al’s condition for uniq ueness of CP decomp o- sition [15] and its generaliza tion to higher order tensors [44]. Precisely , instead of sa yin g Rank  e A i  = k rank  e A i  = r , it is only requ ired to hav e k rank  e A i  ≥ ( 2 r + n − 1) /n t o argue the result of case 1. This only impro ves the constan ts inv olved in the ﬁn al result. 14 Note that for n = 1, since the (tensor) rank of any vector is 1, th is analysis do es not work. 28 The follo wing lemma is u seful in the pro of of T heorem 5 . The resu lt p rop o sed in this lemma is similar to the parameter genericit y condition in [7], bu t generalize d for the n -gram matrix, A ⊙ n . The lemma is p ro ved on lines of the pro of of Remark 2.2 in [7]. Lemma 4. If A ∈ R p × q is g e neric, then the n - gr am matrix A ⊙ n ∈ R p n × q satisﬁes the fol lowing pr op erty with L eb esgue me asur e one. F or any ve ctor v ∈ R q with k v k 0 ≥ 2 , we have   A ⊙ n Rest . v   0 >    N A ⊙ n Rest . (Supp( v ))    − | S upp( v ) | , wher e for a set S ⊆ [ q ] , N A ⊙ n ( S ) := { i ∈ [ p ] n : A ⊙ n ( i, j ) 6 = 0 f or some j ∈ S } . Here, we p r o ve the result for th e case of n = 2. The pro of can b e easily generalized to larger n . Let A := M + Z b e generic, wh er e M is an arbitrary matrix, p erturb ed b y random con tin u ou s p erturbations Z . C o n sider the 2-gram matrix B := A ⊙ A ∈ R p 2 × q . It is shown that the restricted v ersion of B , denoted b y e B := B Rest . ∈ R p ( p +1) 2 × q , satisﬁes the ab o v e genericit y condition. W e ﬁ rst establish some deﬁnitions. Deﬁnition 10. We c al l a ve ctor fu lly dense if al l of its entries ar e non-zer o. Deﬁnition 11. W e say a matrix has the Null Space Prop ert y (NSP) if its nul l sp ac e do e s not c ontain any ful ly dense ve ctor. Claim 1. Fix any S ⊆ [ q ] with | S | ≥ 2 , and se t R := N M (2 -gram) Rest . ( S ) . L et e C b e a | S | × | S | submatrix of e B R,S . Then Pr( e C has the NSP ) = 1 . Pr o of of Claim 1: First, note that e B can b e exp a n ded as e B := ( A ⊙ A ) Rest . = ( M ⊙ M ) Rest . + ( M ⊙ Z + Z ⊙ M ) Rest . + ( Z ⊙ Z ) Rest . | {z } := U . Let s = | S | and let e C = [˜ c 1 | ˜ c 2 | · · · | ˜ c s ] ⊤ , where ˜ c ⊤ i is the i -th row of e C . Also, let C := [ c 1 | c 2 | · · · | c s ] ⊤ and W := [ w 1 | w 2 | · · · | w s ] ⊤ b e the corresp onding | S | × | S | submatrices of M (2 -gram) Rest . and U , resp ec- tiv ely . F or eac h i ∈ [ s ], denote b y N i the n ull space of the matrix e C i = [˜ c 1 | ˜ c 2 | · · · | ˜ c i ] ⊤ . Finally let N 0 = R s . Then, N 0 ⊇ N 1 ⊇ · · · ⊇ N s . W e need to s h o w th a t, with p robabilit y one, N s do es not con tain an y fully d en se ve ctor. If one of N i , i ∈ [ s ] , d oes n ot conta in any full dense v ector, the result is prov ed. Supp ose that N i con tains some fully dense vecto r v . Since C is a su bmatrix of M (2 -gram) R,S , every row c ⊤ i +1 of C con tains at least one non-zero en try . Th erefore, v ⊤ ˜ c i +1 = X j ∈ [ s ] v ( j )˜ c i +1 ( j ) = X j ∈ [ s ]: c i +1 ( j ) 6 =0 v ( j )( c i +1 ( j ) + w i +1 ( j )) , where { w i +1 ( j ) : j ∈ [ s ] s.t. c i +1 ( j ) 6 = 0 } are ind epend en t r a n dom v ariables, and moreo ve r , they are indep endent of ˜ c 1 , . . . , ˜ c i and thus of v . By assum ption on the distribution of the w i +1 ( j ), Pr " v ∈ N i +1     ˜ c 1 , ˜ c 2 , . . . , ˜ c i # = Pr " X j ∈ [ s ]: c i +1 ( j ) 6 =0 v ( j )( c i +1 ( j ) + w i +1 ( j )) = 0     ˜ c 1 , ˜ c 2 , . . . , ˜ c i # = 0 . (29) 29 Consequent ly , Pr " dim( N i +1 ) < dim( N i )     ˜ c 1 , ˜ c 2 , . . . , ˜ c i  = 1 (30) for all i = 0 , . . . , s − 1. As a result, w it h probabilit y one, dim( N s ) = 0. ✷ No w, we are ready to p ro ve Lemma 4. Pr o of of L emma 4 : It follo ws from Claim 1 that, with p r obabilit y one, the follo wing ev en t holds: for ev ery S ⊆ [ q ] , | S | ≥ 2, and ev ery | S | × | S | su b matrix e C of e B R,S where R := N M (2 -gram) Rest . ( S ), then ˜ C h as the NSP . No w ﬁx v ∈ R q with k v k 0 ≥ 2. Let S := Supp( v ) and H := e B R,S . F urthermore, let u ∈ ( R \ { 0 } ) | S | b e the restriction of vec tor v to S ; obs er ve th a t u is fu lly d e n se. It is clear that k e B v k 0 = k H u k 0 , so we need to sho w that k H u k 0 > | R | − | S | . (31) F or the s a ke of contradictio n, supp ose that H u has at most | R | − | S | non-zero en tries. Since H u ∈ R | R | , there is a subset of | S | entries on which H u is zero. Th is corresp onds to a | S | × | S | submatrix of H := e B R,S whic h conta ins u in its n ull space. It means that this submatrix do es not hav e the NSP , whic h is a con tradiction. Therefore w e conclude that H u must ha v e more than | R | − | S | non-zero en tries, w hic h ﬁnishes the pr oof. ✷ A.2 Pro of of momen t characterization lemmata Remark 12. In L emmata 1 and 2, a sp e c iﬁc c ase of or der and p ersistenc e ( m = r n ) was c onsider e d. Her e, we pr ovide the moment form for a mor e gener al c ase. Assume that m = r n + s for some inte gers r ≥ 1 , 1 ≤ s ≤ n 2 , then M ( n ) 2 m ( x ) =  r times z }| { A ⊙ n ⊗ · · · ⊗ A ⊙ n ⊗ A ( s -gram)  f M 2 r ( h )  A (( n − s ) -gram) ⊗ r − 1 times z }| { A ⊙ n ⊗ · · · ⊗ A ⊙ n ⊗ A (2 s -gram)  ⊤ , wher e f M 2 r ( h ) ∈ R q r +1 × q r +1 is the hidd en moment as f M 2 r ( h )  ( i 1 ,...,i r +1 ) , ( j 1 ,...,j r +1 )  :=  E [ h i 1 · · · h i r h 2 i r +1 h j 2 · · · h j r +1 ] if i r +1 = j 1 , 0 o . w . The tensor form is also c h ar acterize d as T ( n ) 2 m ( x ) = hh e S r ; 2 m times z }| { A, A, . . . , A ii , 30 wher e e S r ∈ N 2 m R q is the c or e tensor in the ab ove T ucker r epr esentation with the sp arsity p attern as fol lows. L et i := ( i 1 , i 2 , . . . , i 2 m ) . If i 1 = i 2 = · · · = i n , i n +1 = i n +2 = · · · = i 2 n , · · · , i (2 r − 1) n +1 = i (2 r − 1) n +2 = · · · = i 2 rn , i 2( m − s )+1 = i 2( m − s )+2 = · · · = i 2 m , we have e S r  i  = f M 2 r ( h )  ( i n ,i 2 n ,...,i r n ,i m ) , ( i ( r +1) n ,i ( r +2) n ,...,i 2 rn ,i 2 m )  . Otherwise, e S r  i  = 0 . Pr o of of L emma 1: In order to simp lify the notation, similar to tensor p o wers for vect ors, th e tensor p o w er for a matrix U ∈ R p × q is deﬁned as U ⊗ r := r times z }| { U ⊗ U ⊗ · · · ⊗ U ∈ R p r × q r . (32) First, consider the case m = r n for some in teger r ≥ 1. One adv antag e of enco ding y j , j ∈ [2 r ], b y basis v ectors ap p ears in charact erizing the conditional momen ts. The ﬁrst order conditional momen t of w ords x l , l ∈ [2 m ], in the n -p ersisten t topic mo del can b e written as E  x ( j − 1) n + k | y j  = Ay j , j ∈ [2 r ] , k ∈ [ n ] , where A = [ a 1 | a 2 | · · · | a q ] ∈ R p × q . Next, the m -th order cond it ional momen t of diﬀeren t views x l , l ∈ [ m ], in the n -p ersisten t topic mod el can b e wr itten as E [ x 1 ⊗ x 2 ⊗ · · · ⊗ x m | y 1 = e i 1 , y 2 = e i 2 , . . . , y r = e i r ] = a ⊗ n i 1 ⊗ a ⊗ n i 2 ⊗ · · · ⊗ a ⊗ n i r , whic h is derive d from th e conditional indep endence r el ationships among the observ ations x l , l ∈ [ m ], giv en topics y j , j ∈ [ r ]. Similar to the ﬁ rst order momen ts, since vec tors y j , j ∈ [ r ], are enco ded by the basis vect ors e i ∈ R q , the ab o ve momen t can b e w r itt en as the follo wing matrix m ultiplication E [ x 1 ⊗ x 2 ⊗ · · · ⊗ x m | y 1 , y 2 , . . . , y r ] =  A ⊙ n  ⊗ r ( y 1 ⊗ y 2 ⊗ · · · ⊗ y r ) , (33) where the ( · ) ⊗ r notation is deﬁned in equation (32). No w for the (2 m )-th order momen t, w e ha ve M ( n ) 2 m ( x ) := E h ( x 1 ⊗ x 2 ⊗ · · · ⊗ x m )( x m +1 ⊗ x m +2 ⊗ · · · ⊗ x 2 m ) ⊤ i = E ( y 1 ,y 2 ,...,y 2 r ) h E h ( x 1 ⊗ · · · ⊗ x m )( x m +1 ⊗ · · · ⊗ x 2 m ) ⊤ | y 1 , y 2 , . . . , y 2 r ii ( a ) = E ( y 1 ,y 2 ,...,y 2 r ) h E  ( x 1 ⊗ · · · ⊗ x m ) | y 1 , . . . , y 2 r  E  ( x m +1 ⊗ · · · ⊗ x 2 m ) ⊤ | y 1 , . . . , y 2 r  i ( b ) = E ( y 1 ,y 2 ,...,y 2 r ) h E  ( x 1 ⊗ · · · ⊗ x m ) | y 1 , . . . , y r  E  ( x m +1 ⊗ · · · ⊗ x 2 m ) ⊤ | y r +1 , . . . , y 2 r  i ( c ) = E ( y 1 ,y 2 ,...,y 2 r ) "  h A ⊙ n i ⊗ r  ( y 1 ⊗ · · · ⊗ y r ) ( y r +1 ⊗ · · · ⊗ y 2 r ) ⊤  h A ⊙ n i ⊗ r  ⊤ # 31 =  h A ⊙ n i ⊗ r  E h ( y 1 ⊗ · · · ⊗ y r ) ( y r +1 ⊗ · · · ⊗ y 2 r ) ⊤ i  h A ⊙ n i ⊗ r  ⊤ ( d ) =  h A ⊙ n i ⊗ r  M 2 r ( y )  h A ⊙ n i ⊗ r  ⊤ , (34) where ( a ) results from the indep endence of ( x 1 , . . . , x m ) and ( x m +1 , . . . , x 2 m ) given ( y 1 , y 2 , . . . , y 2 r ) and ( b ) is concluded f rom the indep endence of ( x 1 , . . . , x m ) and ( y r +1 , . . . , y 2 r ) give n ( y 1 , . . . , y r ) and the ind epend ence of ( x m +1 , . . . , x 2 m ) and ( y 1 , . . . , y r ) giv en ( y r +1 , . . . , y 2 r ). Equ at ion (33 ) is used in ( c ) and ﬁnally , the (2 r )-th ord er moment of ( y 1 , . . . , y 2 r ) is d eﬁ ned as M 2 r ( y ) := E h ( y 1 ⊗ · · · ⊗ y r ) ( y r +1 ⊗ · · · ⊗ y 2 r ) ⊤ i in ( d ). F or M 2 r ( y ), w e ha v e by the la w of total exp ectation M 2 r ( y ) := E  ( y 1 ⊗ · · · ⊗ y r ) ( y r +1 ⊗ · · · ⊗ y 2 r ) ⊤  = E h h E  ( y 1 ⊗ · · · ⊗ y r ) ( y r +1 ⊗ · · · ⊗ y 2 r ) ⊤ | h  i = E h   r times z }| { h ⊗ · · · ⊗ h  r times z }| { h ⊗ · · · ⊗ h  ⊤  = M 2 r ( h ) , where th e third equalit y is concluded f rom the cond it ional in d epend ence of v ariables y j , j ∈ [2 r ], giv en h and th e mod el assump tio n that E  y j | h  = h, j ∈ [2 r ]. Substituting this in equation (34) , ﬁnishes the pro of for the n -p ersistent topic mo del. Similarly , the moment of s in gl e topic mo del (inﬁnite p ersistence) can b e also derived. ✷ Pr o of of L emma 2: Deﬁning Λ := M 2 r ( h ) ∈ R q r × q r and B :=  A ⊙ n  ⊗ r ∈ R p r n × q r , the (2 r n )-th order moment M ( n ) 2 rn ( x ) ∈ R p r n × p r n of the n -p ersistent topic mo del prop osed in equation (8) can b e written as M ( n ) 2 rn ( x ) = B Λ B ⊤ . Let b ( i 1 ,...,i r ) ∈ R p r n denote the corresp onding column of B indexed by r -tuple ( i 1 , . . . , i r ) , i k ∈ [ q ] , k ∈ [ r ]. Th en, the ab o ve matrix equation can b e expanded as M ( n ) 2 rn ( x ) = X i 1 ,...,i r ∈ [ q ] j 1 ,...,j r ∈ [ q ] Λ  ( i 1 , . . . , i r ) , ( j 1 , . . . , j r )  b ( i 1 ,...,i r ) b ⊤ ( j 1 ,...,j r ) = X i 1 ,...,i r ∈ [ q ] j 1 ,...,j r ∈ [ q ] Λ  ( i 1 , . . . , i r ) , ( j 1 , . . . , j r )  [ a ⊗ n i 1 ⊗ · · · ⊗ a ⊗ n i r ][ a ⊗ n j 1 ⊗ · · · ⊗ a ⊗ n j r ] ⊤ , where relation b ( i 1 ,...,i r ) = a ⊗ n i 1 ⊗ · · · ⊗ a ⊗ n i r , i 1 , . . . , i r ∈ [ q ] , is u sed in the last equalit y . Let m ( n ) 2 rn ( x ) ∈ R p 2 rn denote the v ectorized f o r m of (2 r n )-th order moment M ( n ) 2 rn ( x ) ∈ R p r n × p r n . Therefore, w e ha ve m ( n ) 2 rn ( x ) := vec  M ( n ) 2 rn ( x )  32 = X i 1 ,...,i r ∈ [ q ] j 1 ,...,j r ∈ [ q ] Λ  ( i 1 , . . . , i r ) , ( j 1 , . . . , j r )  a ⊗ n i 1 ⊗ · · · ⊗ a ⊗ n i r ⊗ a ⊗ n j 1 ⊗ · · · ⊗ a ⊗ n j r . Then, we h a ve the follo wing equiv alen t tensor form f or the original mo del p roposed in equation (8) T ( n ) 2 rn ( x ) := ten  m ( n ) 2 rn ( x )  = X i 1 ,...,i r ∈ [ q ] j 1 ,...,j r ∈ [ q ] Λ  ( i 1 , . . . , i r ) , ( j 1 , . . . , j r )  a ◦ n i 1 ◦ · · · ◦ a ◦ n i r ◦ a ◦ n j 1 ◦ · · · ◦ a ◦ n j r . ✷ A.3 Suﬃcien t matchin g prop erties for satisfying rank and graph expansion con- ditions In the f o llo w in g lemma, it is sho wn that und e r a p erfect n -gram matc hing and additional genericit y and krank conditions, the rank and graph expansion conditions 6 and 7 on A ⊙ n , are satisﬁed. Lemma 5. Assume that the bip artite gr aph G ( V h , V o ; A ) has a p erfe ct n -gr am matching (c ondition 2 is satisﬁe d). Then, the fol lowing r esults hold for the n -gr am matrix A ⊙ n : 1) If A is generic, A ⊙ n is f ul l c olumn r ank (c ondition 6) with L eb esgue me asur e one (almost sur ely). 2) If kr ank c ondition 3 holds, A ⊙ n satisﬁes the pr op ose d exp ansion pr op erty in c ondition 7. Pr o of: Let M denote the p erfect n -g ram matc hing of the b ipartite graph G ( V h , V o ; A ). F r o m Lemma 3, there exists a p erfect matc hing M ⊙ n for the bipartite graph G ( V h , V ( n ) o ; A ⊙ n ). Denote the corresp onding bi-adjacency matrix to the edge set M as A M . S imila r ly , B M denotes the corresp onding bi-adjacency matrix to the edge set M ⊙ n . Note that S upp( A M ) ⊆ Supp( A ) and Supp( B M ) ⊆ Su pp ( A ⊙ n ). Since B M is a p erfect matc hin g , it consists of q := | V h | rows, eac h of whic h has only one n on-ze ro en try , and fur thermore, the non-zero en tries are in q diﬀeren t columns. Therefore, th ese r o ws form q linearly indep endent v ectors. Since th e row ran k and column rank of a matrix are equal, and th e n umb er of columns of B M is q , the column rank of B M is q or in other w ord s, B M is full column rank. Since A is generic, from Lemma 6 (with a sligh t mo diﬁcation in the analysis 15 ), A ⊙ n is also full column r an k w ith Leb esgue measure one (almost sur el y). This completes the pr oof of part 1. Next, the second p art is pro v ed. F rom krank deﬁnition, w e ha v e | N A ( S ′ ) | ≥ | S ′ | for S ′ ⊆ V h , | S ′ | ≤ krank( A ) , 15 Lemma 6 result is abou t the column rank of A itself, b ut here it is ab out the column rank of A ⊙ n for which th e same analysis works. Note t h at the supp ort of B M (whic h is full column rank h ere) is within the sup port of A ⊙ n and therefore L emma 6 can still b e applied. 33 whic h is concluded from the fact that the corresp onding su bmatrix of A sp eciﬁed by S ′ should b e full column rank. F rom this in equali ty , w e h a ve | N A ( S ′ ) | ≥ krank( A ) for S ′ ⊆ V h , | S ′ | = kran k( A ) . (35) Then, we hav e | N A ( S ) | ≥ | N A ( S ′ ) | for S ′ ⊂ S ⊆ V h , | S | > krank( A ) , | S ′ | = kran k( A ) , ≥ krank( A ) ≥ d max ( A ) n , (36) where (35) is u sed in the second inequalit y and the last inequalit y is from kr ank condition 3. In th e restricted n -gram matrix A ⊙ n Rest . , th e num b er of neigh b ors for a set S ⊆ V h , | S | > krank( A ), can b e b oun ded as    N A ⊙ n Rest . ( S )    ≥ | N A ( S ) | + | S | ≥ d max ( A ) n + | S | for | S | > krank ( A ) , where the ﬁr st inequalit y is due to the fact that the set N A ⊙ n Rest . consists of rows in d exed by the follo wing t wo sub sets: n -t u ples ( i, i, . . . , i ) where all the indices are equal and n -tuples ( i 1 , . . . , i n ) with distinct indices, i.e., i 1 6 = i 2 . . . 6 = i n . The former su bset is exactly N A ( S ) wh ile the size of the latter subset is at least | S | due to the existence of a p erfect n -g ram matc hing in A . The b ound (36) is us ed in the second inequ a lity . S ince d max  A ⊙ n  = d max ( A ) n , the pro of of part 2 is also completed. ✷ Remark 13. The se c ond r esult of ab ove lemma is similar to the ne c essity ar gument of (Hal l’s) The or em 6 for the existenc e of p erfe ct matching in a bip artite gr aph, but gener alize d to the c ase of p erfe ct n - g r am matching and with additional kr ank c ondition. A.4 (Auxiliary) lemma Pr o of of L emma 3: W e show that if G ( Y , X ; A ) h as a p erfect n -gram matc hing, then G ( Y , X ( n ) ; A ⊙ n ) has a p erfect matc hin g. The rev erse can b e also immediately sho wn by rev ersin g the discussion and exploiting the additional condition stated in the lemma. Let E ⊙ n denote the edge set of the bipartite graph G ( Y , X ( n ) ; A ⊙ n ). Assume G ( Y , X ; A ) has a p erfect n -gram m a tching M ⊆ E . F or an y j ∈ Y , let N M ( j ) denote the set of neigh b ors of ve r te x j according to edge set M . Since M is a p erfect n -gram matc hing, | N M ( j ) | = n for all j ∈ Y . It can b e immed iately concluded fr om Deﬁnition 3 that sets N M ( j ) are all distinct, i.e., N M ( j 1 ) 6 = N M ( j 2 ) for an y j 1 , j 2 ∈ Y , j 1 6 = j 2 . F or an y j ∈ Y , let N ′ M ( j ) denote an arbitrary ordered n -tup le generated from the elements of set N M ( j ). F rom the deﬁnition of n -gram matrix, we ha v e A ⊙ n ( N ′ M ( j ) , j ) 6 = 0 for all j ∈ Y . Hence, ( j, N ′ M ( j )) ∈ E ⊙ n for all j ∈ Y whic h together with the fact that all N ′ M ( j )’s tuples are d istinct , it results that M ⊙ n := { ( j, N ′ M ( j )) | j ∈ Y } ⊆ E ⊙ n is a p erfect matc h in g for G ( Y , X ( n ) ; A ⊙ n ). ✷ 34 Lemma 6. Consider matrix C ∈ R m × r which is generic. L et e C ∈ R m × r b e such that Supp( e C ) ⊆ Supp( C ) and the non-zer o e ntr ies of e C ar e the same as the c orr esp onding non-zer o entries of C . If e C is ful l c olumn r ank, then C is also ful l c olumn r ank, almost sur ely. Pr o of: Since e C is full column rank, there exists a r × r submatrix of e C , denoted by e C S , with non-zero determinan t, i.e., det ( e C S ) 6 = 0. Let C S denote the corresp onding sub mat r ix of C ind exed b y the same ro ws and columns as e C S . The d et ermin an t of C S is a p olynomial in the en tries of C S . Since e C S can b e deriv ed from C S b y k eeping th e corresp ondin g n on-ze ro en tries, det( C S ) can b e decomp osed in to tw o terms as det( C S ) = det( e C S ) + f ( C S ) , where th e ﬁrst term corresp onds to the monomials for which all the v ariables (en tries of C S ) are also in e C S and the second term corresp onds to the monomials for whic h at least one v ariable is not in e C S . T h e ﬁr st term is non -zero as stated earlier. S ince C is generic, the p olynomial f ( C S ) is non-trivial and therefore its ro ots ha v e Leb esgue measure zero. It imp lie s that det( C S ) 6 = 0 with Leb esgue measur e one (almost surely), and hence, it is fu ll (column) rank. Thus, C is also full column rank, almost surely . ✷ Finally , Theorem 1 is pro v ed b y combining the resu lt s of T heorem 5 an d Lemma 5. Pr o of of The or em 1: Since conditions 2 and 3 hold and A is generic, Lemma 5 can b e applied whic h results that rank condition 6 is satisﬁed almost su rely and expans io n condition 7 also h olds. Therefore, all the r equired conditions for Theorem 5 are satisﬁed almost su r ely and this completes the pr o of. ✷ B Pro of of Random Iden tiﬁabilit y Result (Theorem 2) W e provi d e detailed p roof of the steps stated in the p roof sk etc h of random result in Section 5.2. B.1 Pro of of existence of p erfect n -gram match ing and Kr usk al results Pr o of of The or em 3: V ertex sets X and Y are partitioned, describ ed as follo ws (see Figure 6). Deﬁne J := c p n . P artition set X uniformly at random in to n sets of (almost) equal size 16 , denoted b y X ′ l , l ∈ [ n ]. Deﬁn e sets X l := ∪ l i =1 X ′ i , l ∈ [ n ]. F urthermore, partition set Y uniformly at random, hierarc hically as follo ws. First, p artit ion in to J sets, eac h with size at most  c p n  n − 1 , and d enote them by Y i , i ∈ [ J ]. Next, p artit ion eac h of th ese new smaller sets Y i further into J sets, eac h w it h size at most  c p n  n − 2 . Do it iterativ ely u p to n − 1 steps, where at the end, set Y is partitioned in to sets with size at most c p n . The ﬁrst t wo steps are sho wn in Figur e 6. Pro of by induction: The existence of p erfect n -g r am m a tching from set Y to set X is p ro ved b y an induction argument . Consid e r one of inte rm edia te sets in the hierarchica l partitioning of 16 By almost, w e mean the maximum diﬀerence in the size of partitions is 1 which is alwa ys p ossible. 35 P S f r a g r e p l a c e m e n t s X X ′ 1 X ′ 2 X ′ n X 1 X 2 X n Y Y 1 Y 2 Y J Figure 6: Partitioning of sets Y and X , pro posed in the pro of o f Theorem 3. Set X is randomly (uniform) partitioned into n sets of (almost) equal size , denoted by X ′ l , l ∈ [ n ]. Set Y is als o rando mly par titioned in a recursive manner . In each step, it is partitio ne d to J = c p n = O ( p ) num b er o f sets. These smaller sets ar e again partitioned, recurs iv ely . This pa r titioning pro cess is p erformed until rea c hing sets with siz e O ( p ). The ﬁrst tw o steps a re shown in this ﬁgure. Y with size O ( p l ) and its further partitioning in to J := c p n sets, eac h with size O ( p l − 1 ), for an y l ∈ { 2 , . . . , n } . I n the indu ct ion step, it is shown that if there exists a p erfect ( l − 1)-gram m at ching from eac h of these subsets of Y w ith size O ( p l − 1 ) to X l − 1 , then there exists a p erfect l -g ram matc hing from the original set with size O ( p l ) to set X l . Sp eciﬁcally , in the last ind u cti on s te p, it is sh o wn that if there exists a p erfect ( n − 1)-gram m a tching fr om eac h set Y l , l ∈ [ J ] , to set X n − 1 , then there exists a p erfect n -gram matc hing from Y to X n = X . Base case: T he b ase case of induction argumen t holds as follo w s . By app lying Lemma 8 and Lemma 7, there exists a p erfect matc hing fr o m eac h partition in Y w ith size at most c p n = O ( p ) to set X 1 , whp . Induction step: Consider J diﬀerent bipartite graph s G i ( Y i , X n − 1 ; E i ) , i ∈ [ J ], by considering sets Y i and X n − 1 and the corresp onding subset of edges E i ⊂ E incident to them. See Figure 7a. T h e induction step is to show that if eac h of th e corresp ondin g J bipartite graphs G i ( Y i , X n − 1 ; E i ) , i ∈ [ J ], has a p erfect ( n − 1)-gram matc hing, then whp , th e original bipartite graph G ( Y , X ; E ) has a p erfect n -g ram matc hing. Let us denote the corresp onding p erfect ( n − 1)-gram matc h ing of G i ( Y i , X n − 1 ; E i ) by M i . F ur- thermore, the set of all sub s et s of X n − 1 with cardinalit y n − 1 are denoted by P n − 1 ( X n − 1 ), i.e., P n − 1 ( X n − 1 ) includes the sets w ith ( n − 1) elemen ts in the p ow er s e t 17 of X n − 1 . F or eac h set S ∈ P n − 1 ( X n − 1 ), tak e the set of all n o des in Y which are connected to all members of S according to the union of matc hings ∪ J i =1 M i . C all this s et as the p a r en ts of S , denoted by P a( S ). According to the deﬁnition of p erfect ( n − 1)-gram matc hing, there is at most one no de in eac h set Y i whic h is connected to all members of S through the matc hing M i and therefore, | P a( S ) | ≤ J = c p n . In 17 The p o wer set of any set S is the set of all subsets of S . 36 P S f r a g r e p l a c e m e n t s Y Y 1 Y 2 Y J X X n − 1 X ′ n M 1 M 2 M J (a) Partitioning of sets Y and X pro - po sed for the induction step. P S f r a g r e p l a c e m e n t s Y Pa( S 1 ) Pa( S 2 ) Pa( S 3 ) X X n − 1 X ′ n S 1 S 2 S 3 p erfec t matc hi ngs from Pa ( S ) to X ′ n (b) Partitioning of set Y through p erfect ( n − 1)-g ram matchings M i , i ∈ [ J ]. Figure 7: Auxiliar y ﬁgures for pro of of induction step. (a) Partitioning of sets Y and X prop osed in the pro of, where set Y is par titioned to J := c p n partitions Y 1 , . . . , Y J with (a lmost) equal size, for so me constant c < 1. In addition, set X is partitio ne d to t wo partitions X n − 1 and X ′ n with sizes | X n − 1 | = n − 1 n p and | X ′ n | = p n . The p erfect ( n − 1 )-gram matchings M i , i ∈ [ J ] , thr ough bipartite gr aphs G i ( Y i , X n − 1 ; E i ) , i ∈ [ J ], a re also highlighted in the ﬁgure. (b) Set Y is partitioned to subsets Pa( S ) , S ∈ P n − 1 ( X n − 1 ), which is genera ted through p erfect ( n − 1)-g ram matchings M i , i ∈ [ J ]. S 1 , S 2 and S 3 are three diﬀeren t sets in P n − 1 ( X n − 1 ) shown as sa mples. In addition, the p erfect matchings fro m Pa( S ) , S ∈ P n − 1 ( X n − 1 ), to X ′ n , prop osed in the pro of, ar e also highlighted in the ﬁgure. addition, note that s e ts P a( S ) imp ose a partitioning on set Y , i.e., eac h no de j ∈ Y is exactly included in one set P a( S ) for some S ∈ P n − 1 ( X n − 1 ). This is b ecause of the p erfect ( n − 1)-gram matc hings consid e r ed for sets Y i , i ∈ [ J ]. No w, a p erfect n -gram matc h in g f or the original bipartite graph is constructed as follo ws. F or an y S ∈ P n − 1 ( X n − 1 ), consider th e set of p a r e nts P a( S ). Create the bipartite graph G S (P a( S ) , X ′ n ; E S ), where E S ⊂ E is the subset of edges in ci d en t to partitions Pa( S ) ⊂ Y and X ′ n ⊂ X . Denote b y d S the min im u m degree of no des in set P a( S ) in the bipartite graph G S (P a( S ) , X ′ n ; E S ). App lying Lemma 8, we h a ve Pr[ d S ≥ 1 + β log( p/n )] ≥ 1 − J exp  − 2 n 2 ( d min − β n log( p/n )) 2 d min  (37) ≥ 1 − c n p − β log 1 /c = 1 − O ( p − β log 1 /c ) , where β log 1 /c > n − 1, and the last inequalit y is concluded from the d eg ree b ound d min ≥ α log p . F u rthermore, w e ha ve | P a( S ) | ≤ c p n = c | X ′ n | . Now, w e can apply Lemma 7 concluding that there exists a p erfect matc h ing from P a( S ) to X ′ n within the bip a r ti te graph G S (P a( S ) , X ′ n ; E S ), with probabilit y at least 1 − O ( p − β log 1 /c ). Refer to Figure 7b for a sc hematic picture. T he edges of this p erfect matc h in g are com bined with the corresp onding edges of the existing p erfect ( n − 1)-g ram matc hings M i , i ∈ [ J ], to provide n inciden t edges to eac h n ode i ∈ Pa( S ). It is easy to see that this provides a p erfect n -gram m at ching from P a ( S ) to X . W e p erform the same steps for all sets S ∈ P n − 1 ( X n − 1 ) to obtain a p erfect n -g r a m matc h in g from an y Pa( S ) , S ∈ P n − 1 ( X n − 1 ) , to X . Finally , according to this construction, the union of all of these matc hings is a p erfect n -gram matc hing from ∪ S ∈ P n − 1 ( X n − 1 ) P a( S ) = Y to X . This ﬁnish es the pro of of indu ct ion step. Note that here w e analyzed the last induction s tep where the existence of p erfect n -gram matc hing is concluded from th e existence of corresp onding p erfect ( n − 1)-gram matc hings. The earlier ind u cti on steps, wher e the existence of p erfect l -gram matc hing is concluded from the existence of corresp onding p erfect ( l − 1)-gram matc hings f o r any l ∈ { 2 , . . . , n } , can b e 37 similarly prov en . Probabilit y rate : W e now p ro vid e th e p robabilit y rate of the ab o v e eve nts. Let N (hp) l , l ∈ [ n ], denote the total num b er of times that p erfect matc h ing resu lt of Lemma 7 is used in step l in ord er to ensure that there exists a p erfect l -gram matc hing from corresp onding partitions of Y to set X l , whp . Let N (hp) = P l ∈ [ n ] N (hp) l . As earlier, let P l − 1  X l − 1  denote the set of all subsets of X l − 1 with cardinalit y l − 1. W e hav e   P l − 1  X l − 1    =    X l − 1   l − 1  =  l − 1 n p l − 1  , l ∈ { 2 , . . . , n } . According to the construction metho d of l -gram matc hing from ( l − 1)-gram matc h ings, prop osed in the induction step,   P l − 1  X l − 1    is the num b er of times Lemma 7 is used in order to ensure that there exists a p erfect l -gram matc hing for eac h partition on the Y side. Since at most J n − l n umb er of such l -gram matc h ings are prop osed in step l , the num b er N (hp) l can b e b oun ded as N (hp) l ≤ J n − l   P l − 1  X l − 1    = J n − l  l − 1 n p l − 1  , l ∈ { 2 , . . . , n } . (38) Since in the ﬁrst step, N (hp) 1 = J n − 1 n umb er of p erfect matc hings needs to exist in the ab o v e discussion, we h a ve N (hp) = J n − 1 + n X l =2 N (hp) l ≤ J n − 1 + n X l =2 J n − l  l − 1 n p l − 1  ≤  c p n  n − 1 + n X l =2  c p n  n − l  e p n  l − 1 ≤ n  e p n  n − 1 = O ( p n − 1 ) , where inequalit y (38) is used in the ﬁrst inequ a lity and J := c p n and inequalit y  n k  ≤  e n k  k are exploited in the second inequalit y . Since the result of L emma 7 holds with probabilit y at least 1 − O ( p − β log 1 /c ) and it is assumed that β log 1 /c > n − 1, by applying union b ound, we hav e the existence of p erfect n -gram matc hin g with probabilit y at least 1 − O ( p − β ′ ), for β ′ = β log 1 c − ( n − 1) > 0. F u rthermore, note that the degree concent ration b ound in (37) is also u sed O ( p n − 1 ) times. Since the b ound in (37) holds w ith probabilit y at least 1 − O ( p − β log 1 /c ) and it is assumed that β log 1 /c > n − 1, this also r educes to th e same p robabilit y rate. The co eﬃcien t of the ab o ve p olynomial probabilit y rate is also explicitly compu te d , sa ying th a t the p erfect n -g ram matc hing exists with probabilit y at least 1 − γ 1 p − β ′ , with γ 1 = e n − 1  c n n − 1 + e 2 1 − δ 1 n β ′ +1  , 38 where δ 1 is a constan t satisfying e 2  p n  − β log 1 /c < δ 1 < 1. ✷ Pr o of of The or em 4: Let G ( Y , X ; A ) denote th e corresp onding b ipartite graph to matrix A w h ere no de sets Y = [ q ] and X = [ p ] in d ex the columns and r o ws of A r esp ectiv ely . Th erefore, | Y | = q and | X | = p . Fix some S ⊆ Y such that | S | ≤ p . Then Pr( | N ( S ) | ≤ | S | ) ≤ X T ⊆ X : | T | = | S | Pr( N ( S ) ⊆ T ) = X T ⊆ X : | T | = | S | Y i ∈ S  | S | d i  .  p d i  ≤ X T ⊆ X : | T | = | S | Y i ∈ S  | S | p  d i ≤ X T ⊆ X : | T | = | S | Y i ∈ S  | S | p  d min =  p | S |  | S | p  d min | S | , (39) where the b ound  | S | d i  p d i  ≤  | S | p  d i is u sed in the second in equalit y , and the last inequalit y is concluded from the fact that | S | p ≤ 1. Let E denote the ev en t that for any subset S ⊆ Y with | S | ≤ r , we h av e | N ( S ) | ≥ | S | , i.e., E := “ ∀ S ⊆ Y ∧ 1 ≤ | S | ≤ r : | N ( S ) | ≥ | S | ” . Then, by the u nion b ound and inequ a lity (39), we ha ve Pr( E c ) = Pr( ∃ S ⊆ Y s . t . 1 ≤ | S | ≤ r ∧ | N ( S ) | < | S | ) ≤ r X s =1  q s  p s  s p  d min s ≤ r X s =1  e q s  s  e p s  s  s p  d min s ≤ r X s =1  e 2 q r d min − 2 p d min − 1  s , where the b oun d  n k  ≤  e n k  k is used in the second inequalit y . F or r = cp , the ab o ve inequalit y reduces to Pr( E c ) ≤ r X s =1  e 2 c d min − 2 q p  s ≤ r X s =1  e 2 c ′ c d min − 1 p n − 1  s 39 ≤ r X s =1  e 2 c ′ c β log p p n − 1  s = r X s =1  e 2 c ′ p n − 1 − β log 1 /c  s ≤ e 2 c ′ p β ′ − e 2 c ′ = O ( p − β ′ ) , for β ′ = β log 1 c − ( n − 1) > 0 , where the size condition assumed in the theorem is used in the second inequalit y with c ′ := 1 c  c n  n , and the d e gree condition is exploited in the third inequalit y . The last inequ a lity is concluded from the geometric series sum form ula for large enough p . Then, Lemma 9 can b e applied concluding that krank( A ) ≥ r = cp , with probabilit y at least 1 − γ 2 p − β ′ for constan ts β ′ = β log 1 c − ( n − 1) > 0 and γ 2 > 0 as γ 2 = c n − 1 e 2 n n (1 − δ 2 ) , where δ 2 is a constan t satisfying c ′ e 2 p − β ′ < δ 2 < 1. ✷ Pr o of of R emark 10 : Consider a random bipartite graph G ( Y , X ; E ) where for eac h no de i ∈ X : 1. Neighbors N ( i ) ⊆ X are pic ked uniformly at rand om among all size d subs et s of X . 2. Matc hin g M ( i ) ⊆ N ( i ) is pic ked un if orm ly at random among all size n sub sets of N ( i ). Note that as long as n ≤ d , the d istribution of M ( i ) is uniform o ve r all size n subsets of X . Fix some p a ir i, i ′ ∈ Y . Th en Pr( M ( i ) = M ( i ′ )) =  | X | n  − 1 . By the un io n b ound, Pr  ∃ i, i ′ ∈ Y , i 6 = i ′ s . t . M ( i ) = M ( i ′ )  ≤  | Y | 2  | X | n  − 1 , whic h is Θ( | Y | 2 / | X | n ) w hen n is constant. Therefore, if d ≥ n and the size constraint | Y | = O ( | X | s ) for some s < n 2 is satisﬁed, then whp , there is n o pair of no des in s et Y with the same rand om n -gram matc hing. This concludes that the rand om bip a r ti te graph h as a p erfect n -gram matc hing whp , und er these size and degree conditions. ✷ B.2 (Auxiliary) lemmata Lemma 7 (Existence of p erfect matc h ing for rand om bipartite graph s) . Consider a r andom bip ar- tite gr aph G ( W , Z ; E ) with | W | = w no des on the left side and | Z | = z on the right side, and e ach no de i ∈ W is r andomly c onne cte d to d i diﬀer ent no des in set Z . L et d w := min i ∈ W d i . A ssume 40 that it satisﬁes the size c ondition w ≤ cz for some c onstant 0 < c < 1 and the de gr e e c ondition d w ≥ 1 + β log z for some c onstant β > 0 . Then, ther e exists a p e rfe ct matching i n the r andom bip artite gr aph G ( W , Z ; E ) with pr ob ability at le ast 1 − O ( z − β log 1 /c ) wher e β log 1 c > 0 . Pr o of: F rom Hall’s th eo rem (Theorem 6), the existence of p erfect matc hing for a bipartite graph is equiv alent to o ccurrence of the f ollo win g ev ent e E := “ ∀ S ⊆ W : | N ( S ) | ≥ | S | ” . Similar to the analysis in the p roof of Th eo rem 4, it is concluded fr o m union b ound Pr  e E c  = Pr( ∃ S ⊆ W s . t . | N ( S ) | < | S | ) ≤ w X s =1  w s  z s  s z  d w s ≤ w X s =1  e w s  s  e z s  s  s z  d w s ≤ w X s =1  e 2 w d w − 1 z d w − 1  s ≤ w X s =1  e 2 c d w − 1  s , where th e b ound  n k  ≤  e n k  k is used in the second inequalit y . F rom the assu med low er b ound on the degree d w and the fact that 0 < c < 1, w e hav e Pr  e E c  ≤ w X s =1  e 2 c β log z  s = w X s =1  e 2 z β log c  s ≤ e 2 z β log 1 c − e 2 ≤ e 2 1 − δ 1 z − β log 1 /c , where the second in equali ty is conclud e d from th e geometric s e r ie s sum f ormula for large enough z , and δ 1 is a constan t satisfying e 2 z − β log 1 /c < δ 1 < 1. ✷ Lemma 8 (Degree concen tration b ound) . Consider a r andom bip artite gr aph G ( Y , X ; E ) with | Y | = q and | X | = p , wher e e ach no de i ∈ Y is r andomly c onne cte d to d i diﬀer ent no des in set X . L et Y ′ ⊂ Y b e any subset 18 of no des in Y with size | Y ′ | = q ′ and X ′ ⊂ X b e a r andom (uniformly chosen) subset of no des in X with size | X ′ | = p ′ . Cr e ate the new bip artite gr aph G ( Y ′ , X ′ ; E ′ ) wher e e dge set E ′ ⊂ E is the subset of e dges in E incident to Y ′ and X ′ . Denote the de gr e e of e ach no de i ∈ Y ′ within this new bip artite gr aph by d ′ i . L et d min := min i ∈ Y d i and d ′ min := min i ∈ Y ′ d ′ i . Then, if d min > r p p ′ for a non-ne gative i nte ger r , we have Pr[ d ′ min ≥ r + 1] ≥ 1 − q ′ exp  − 2( p ′ /p ) 2 ( d min − ( p/p ′ ) r ) 2 d min  . Pr o of: F or an y i ∈ Y ′ , we h a ve Pr[ d ′ i ≤ r ] = r X j =0  p ′ j  p − p ′ d i − j  .  p d i  , 18 Note that Y ′ need not to b e uniformly chosen and the result is v alid for any su bset of no des Y ′ ⊂ Y . 41 where the inner term of summ ation is a hyp ergeo metric distribution with parameters p (p opu- lation size), p ′ (n umb er of success states in the p opu lation), d i (n umb er of draws) and j is the h yp ergeometric rand om v ariable denoting num b er of successes. The follo wing tail b ound for the h yp ergeometric distribu tio n is provi d ed [45, 46 ] Pr[ d ′ i ≤ r ] ≤ exp( − 2 t 2 i d i ) , for t i > 0 giv en b y r =  p ′ p − t i  d i . Note that assumption d min > p p ′ r in the lemma is equiv alent to ha ving t i > 0 , i ∈ Y . Consid e r ing the minimum degree, for any i ∈ Y ′ , w e ha v e Pr[ d ′ i ≤ r ] ≤ exp( − 2 t 2 d min ) , for t > 0 giv en by r =  p ′ p − t  d min . S ubstituting t from this equ a tion giv es the follo wing b ound Pr[ d ′ i ≤ r ] ≤ exp  − 2( p ′ /p ) 2 ( d min − ( p/p ′ ) r ) 2 d min  . (40) Finally , applying the union b ound, w e can p r o ve the result as f o llo w s Pr[ d ′ min ≥ r + 1] = Pr[ ∩ q ′ i =1 { d ′ i ≥ r + 1 } ] ≥ 1 − q ′ X i =1 Pr[ d ′ i ≤ r ] ≥ 1 − q ′ X i =1 exp  − 2( p ′ /p ) 2 ( d min − ( p/p ′ ) r ) 2 d min  =1 − q ′ exp  − 2( p ′ /p ) 2 ( d min − ( p/p ′ ) r ) 2 d min  , where the union b ound is applied in the ﬁrst inequalit y and the second inequalit y is concluded from (40). ✷ A lo w er b ound on the Krusk al r ank of matrix A based on a suﬃcien t relaxed exp a n sio n prop ert y on A is p ro vided in the follo wing lemma. Lemma 9. If A is generic and the bip artite gr aph G ( Y , X ; A ) satisﬁes the r elaxe d 19 exp ansion pr op erty | N ( S ) | ≥ | S | for any subset S ⊆ Y with | S | ≤ r , then kr ank ( A ) ≥ r , almost sur ely. Before prop osing the pro of, w e state the marriage or Hall’s theorem whic h giv es an equiv alen t condition for having a p erfect matc hing in a bipartite graph. Theorem 6 (Hall’s theorem, [47]) . A b i p artite gr aph G ( Y , X ; E ) has Y -satur ating matching if and only if for every subset S ⊆ Y , the size of the nei ghb ors of S is at le ast as lar ge as S , i.e., | N ( S ) | ≥ | S | . Pr o of of L emma 9: Denote the su bmatrix A N ( S ) ,S b y e A S , i.e., e A S := A N ( S ) ,S . Exploiting marr ia ge or Hall’s theorem, it is concluded that the bipartite graph G ( S, N ( S ); e A S ) has a p erfect matc hin g M S for an y subset S ⊆ Y suc h that | S | ≤ r . Denote by e A M S the corresp ondin g matrix to this p erfect matc hing edge set M S , i.e., e A M S k eeps the non-zero entries of e A S on ed g e set M S and 19 There is no d max term in contrast to the expansion prop ert y prop os ed in condition 7. 42 ev erywhere else, it is zero. Note that the supp ort of e A M S is within the supp ort of e A S . According to the deﬁn ition of p erfect matc h ing, the matrix e A M S is full column rank. F rom L emma 6, it is concluded that e A S is also full column rank almost sur el y . This is true for an y e A S with S ⊆ Y and | S | ≤ r , whic h dir ec tly results that krank( A ) ≥ r , almost surely . ✷ Finally , Theorem 2 is prov ed by exploiting the random results on the existence of p erfect n -gram matc hing and Krusk al rank, provi d ed in T heorems 3 and 4 . Pr o of of The or em 2: W e claim that if random conditions 4 an d 5 are satisﬁed, then d ete rm inistic conditions 2 and 3 hold whp . Then Th eorem 1 can b e app lie d and the pr oof is done. F r om size and degree conditions, Theorem 3 can b e app lie d , wh ich implies that the p erfect n -gram matc hing condition 2 is satisﬁed with probabilit y at least 1 − γ 1 p − β ′ for β ′ = β log 1 c − ( n − 1) > 0. The conditions required f o r Th eo rem 4 also hold and b y applying this theorem w e hav e the b ound krank( A ) ≥ cp , with probabilit y at least1 − γ 2 p − β ′ . Combining this inequalit y with the u p p er b ound on degree d in condition 5, we conclude th at krank condition 3 is also satisﬁed whp . Hence, all the conditions requir ed for Theorem 1 are satisﬁed with probabilit y at least 1 − γ p − β ′ , where γ = γ 1 + γ 2 = e n − 1  c n n − 1 + e 2 1 − δ 1 n β ′ +1  + c n − 1 e 2 n n (1 − δ 2 ) , and this completes the pro of. ✷ C Relationship to CP Decomp osition Uniqueness Results In this section, we pro vide a m o r e detailed comparison with some un iqueness results of o v ercomplete CP decomp ositio n. Here, the follo wing CP decomp osition f or the third order tensor T ∈ R p × s × q is considered, T = r X i =1 a i ◦ b i ◦ c i , (41) where A = [ a 1 | . . . | a r ] ∈ R p × r , B = [ b 1 | . . . | b r ] ∈ R s × r and C = [ c 1 | . . . | c r ] ∈ R q × r . The most imp ortan t and general un iqueness result of CP , called Krusk al’s cond it ion, is pro vid ed in [15], wh ere it is guaranteed that the ab o ve CP decomp osition is un ique if krank( A ) + kr ank( B ) + krank( C ) ≥ 2 r + 2 . Since th en, sev eral w orks hav e analyzed the uniqueness of CP decomp osition. On e set of wo r k s assume that one of the comp onen ts, say C , is full column rank [17, 18]. it is sho wn in [18], for generic (fully d ense) comp onen ts A, B and C , if r ≤ q and r ( r − 1) ≤ p ( p − 1) s ( s − 1) / 2, then the CP decomp osition in (41) is generically unique. No w, w e demons tr a te h o w this CP uniquen ess result can b e adapted to our setting. First, consider the matrix M ∈ R ps × q whic h is obtained by stac king the entries of T as M ( i − 1) s + j,k = T ij k . 43 Then, we hav e M = ( A ⊙ B ) C ⊤ . (42) On the other han d , for the 2-p ersisten t topic mo d el with 4 words ( n = 2 , m = 2), the moment can b e written as M (2) 4 ( x ) = ( A ⊙ A ) E  hh ⊤  ( A ⊙ A ) ⊤ , for A ∈ R p × q . T h e follo wing matrix h a s the same column span of M (2) 4 ( x ), M ′ = ( A ⊙ A ) C ′⊤ , for some full r ank matrix C ′ ∈ R q × q . Our r a n dom iden tiﬁabilit y r esult in Th eorem 2 p r o vides the uniqueness of A and C ′ , giv en M ′ , und er the size condition q ≤  c p 2  2 and th e add itional d eg ree condition 5. Note that as d iscu ssed in the previous section, this identi ﬁ ab ility argument is the same as the un ique decomp osition of the corresp onding tensor. Th u s, in equation (42), by setting A = B an d a full rank squ are m a trix C , w e obtain the 2-p ersisten t topic mo del, under consideration in this pap er. Thus, the iden tiﬁabilit y resu lt s of [18] are applicable to our setting, if we assume generic (i.e. fully dense) matrix A . Ho w eve r , we incorp orate a sparse matrix A , and therefore, r equire d iﬀeren t tec hniques to p ro vid e iden tiﬁabilit y results. W e n o te that the size b ound sp eciﬁed in [18] is comparable to the size b ound deriv ed in this pap er (for random structured matrices), but we hav e additional degree considerations for id en tiﬁability . Analyzing the regime wher e the uniqueness conditions of [18] are satisﬁed under sparsit y constrain ts is an in teresting question for futur e inv estigatio n . References [1] Y o shua Bengio, Aaron Cour ville , and Pa scal Vincen t. Unsup ervised feature learning and deep learning: A review and new p ersp ectiv es. arXiv pr eprint arXiv: 1206.5538 , 2012. [2] Mic hael S . Lewic ki, T errence J. Sejnowski, and Ho ward Hughes. Learning o v ercomplete rep- resen tations. Ne u r al Computation , 12:33 7–365, 1998. [3] Andr´ e Usc hma jew. Lo cal conv ergence of the alternating least sq u ares algorithm for canonical tensor ap p ro ximation. SIAM Journal on M atrix Ana lysis and Applic ations , 33(2):639 –652, 2012. [4] Da vid M. Blei, Andrew Y. Ng, and Mic hael I. Jordan. Laten t Dirichlet Allo ca tion. J ourn al of Machine L e arning R ese ar ch , 3:993–1 022, 2003. [5] J. K. Pr it chard, M. S tephens, and P . Donnelly . Inference of p opulation structure using multi - lo cus genot yp e data. Genetics , 155:945 –959, 2000. [6] A. Anandkumar, R. Ge, D. Hsu , S. M. Kak ade, and M. T e lgarsky . T ensor Metho ds for Learn ing Laten t V ariable Models. Under R eview. J. of Machine L e arning. Available at arXiv:1210.75 59 , Oct. 2012. 44 [7] A. Anandk u mar, D. Hsu , A. Jav anmard, and S . M. Kak ade. Learnin g Linear Ba y esian Net- w orks w ith Laten t V ariables. ArXiv e-prints , S e p te mb er 2012. [8] Sanjeev Arora, Rong Ge, Y o n i Halp ern, Da vid M. Mimno, Ankur Moitra, Da vid S on tag, Yic hen W u , and Mic hael Zhu. A practical algorithm for topic mo deling with pro v able guaran tees. ArXiv 1212.47 77 , 2012. [9] J.B. Krusk al. More factors than sub j ec ts, tests and treatmen ts: an indeterminacy theorem for canonical decomp osition and individual diﬀerences scaling. Psychometrika , 41(3):281 –293, 1976. [10] T a mara K olda and Brett Bader. T ensor decomp ositions and applications. SIREV , 51(3):4 55– 500, 2009. [11] Silvia Gandy , Benjamin R ech t, and I s a o Y amada. T ensor completion and low-n-rank tensor reco v ery via conv ex optimization. Inverse Pr oblems , 27(2): 025010, 2011. [12] Adam Coates, Honglak Lee, and And rew Y. Ng. An analysis of single-la ye r net wo rks in unsup ervised feature learning. Journal of Machine L e arning R ese ar ch - Pr o c e e dings T r ack , 15:215 –223, 2011. [13] Quo c V. Le, Alexandre Karp enko , Jiquan Ngiam, and And rew Y. Ng. ICA with Reconstruction Cost for Eﬃcient Ov ercomplete F eature Learning. In NIPS , pages 1017–102 5, 2011. [14] Li Deng and Dong Y u. De ep L e arning for Signal and Informatio n Pr o c essing . NOW Publishers, 2013. [15] J.B. Krusk al. Three-w ay arra ys: Rank and uniqu eness of trilinear decomp ositions, with appli- cation to arithmetic complexit y and statistics. Line ar algebr a and its appl ic ations , 18(2):95– 138, 1977. [16] A. Bhask ara, M. Charik ar, and A. Vijay aragha v an. Uniqueness of T ensor Decomp ositio ns with Applications to Pol yn omia l Identiﬁabilit y. A rXiv 1304.8087 , Ap ril 2013. [17] T a o Jiang and Nic holas D S idiropoulos. Krusk al’s p ermutatio n lemma and the iden tiﬁcation of candecomp/parafac and bilinear mo d els with constan t mo dulus constrain ts. Signal Pr o c essing, IEEE T r ansactions on , 52(9):2625 –2636, 2004. [18] Liev en De Lathau wer. A Link b et ween the Canonical Decomp osit ion in Multilinear Alge- bra and Simultaneous Matrix Diagonalizat ion. SIAM J. M atr ix Analysis and A p plic ations , 28(3): 642–666, 2006. [19] Alwin Stegeman, Jos M.F. T en Berge, and Liev en De Lathau wer. Su ﬃcie nt conditions for uniqueness in candecomp/parafac and indscal with rand o m comp onen t matrices. Psychome- trika , 71(2):21 9–229, June 2006. [20] L. De Lathauw er, J . Castaing, and J.-F Cardoso. F ourth-order cumulan t-based blind iden ti- ﬁcation of underdetermined mixtures. IEEE T r an. on Si g na l Pr o c essing , 55:296 5–2973, June 2007. [21] Luca Ch ia ntini and Giorgio O ttavia ni. On generic identi ﬁ a b ili ty of 3-tensors of small rank. SIAM Journal on Matrix A nalysis and Applic ations , 33(3):10 18–1037, 2012. 45 [22] Cristiano Bo cci, Luca Chianti n i, and Giorgio Otta viani. Reﬁned metho ds for the iden tiﬁabilit y of tensors. arXiv pr eprint arXiv: 1303.6915 , 2013. [23] Luca Chian tini, Massimiliano Mella, and Giorgio Otta viani. One example of general uniden- tiﬁable tensors. arXiv pr eprint arXiv:1303.691 4 , 2013. [24] E.S. Allman, C . Matias, and J.A. Rho des. Id en tiﬁability of p aramet ers in laten t structure mo dels w it h man y observed v ariables. The Annals of Statistics , 37(6A):3099 –3132, 200 9. [25] Elizab et h S. Allman, John A. Rho des, and Amelia T a ylor. A semialgebraic description of the general mark ov mo del on phylo genetic trees. Arxiv pr eprint arXiv:1212.1200 , Dec. 201 2. [26] Na vin Goy al, Santo sh V empala, and Ying Xiao. F ourier p ca. ArXiv 1306.5 825 , 2013. [27] Joseph M Landsb erg. T ensors: Ge ometry and applic ations , vo lume 128. American Mathemat- ical So c., 2012 . [28] A. Anandkumar, D. Hsu, and S.M. Kak ade. A Metho d of Momen ts for Mixture Mo dels and Hidden Marko v Mo dels. In Pr o c. of Conf. on L e arning The ory , June 2012. [29] A. Anandku mar, D. P . F oster, D. Hsu , S. M. K ak ade, and Y. K. Liu. A Sp ectral Algorithm for Laten t Diric hlet Allocation. In P r o c. of N e ur al Information Pr o c essing (NIPS) , Dec. 2012. [30] A. Anand kumar, R. Ge, D. Hsu, and S . M. Kak ade. A T ensor Sp ectral App roac h to Learning Mixed Mem b ership C omm u nit y Mo dels. In Confer enc e on L e arning The ory (COL T) , Jun e 2013. [31] E. Mossel and S. Roch. Learning nonsingular p h ylogenies and h idden m a r k ov mo dels. The Anna ls of Applie d Pr ob ability , 16(2) :583–614, 2006. [32] J.T. Chang. F u ll reconstruction of marko v mo dels on ev olutionary trees: iden tiﬁabilit y and consistency . Mathematic al Bi oscienc es , 137(1):51 –73, 1996. [33] Y u v al Rabani, Leonard Sch u lman, and Chaitan ya Swam y . Learning mixtures of arbitrary distributions o ve r large discrete domains. arXiv pr eprint arXiv:1212.1527 , 2012. [34] Saneev Arora, Rong Ge, and Ankur Moitra. Learning topic mo dels—going b ey ond svd. In Symp osium on The ory of Computing , 2012. [35] Daniel A S pielman, Huan W ang, and J ohn W right. Exact reco v ery of sparsely-used dictionaries. In Pr o c. of Conf. on L e arning The ory , 2012. [36] Kenneth Kreutz-Delgado, Joseph F. Murray , Bhask ar D. Rao, Kjersti Engan, T e-W on Lee, and T errence J. S e jn o wski. Dictionary learning algorithms for sp ars e represen tation. Neur al Computation , 15:349–396 , F ebruary 2003. [37] B. Rao and K . K reutz-D elgado. An aﬃn e s c aling method olo gy for b est b a sis selection. IEEE T r an. Signal Pr o c e ssing , 47:187–2 00, Jan u ary 1999 . [38] Nishan t A. Mehta and Alexander G. Gray . S parsit y-b a sed generalizat ion b ounds for predictiv e sparse co ding. In Pr o c. of the Intl. Conf. on Machine L e arning (ICML) , A tlant a, USA, June 2013. 46 [39] Andreas Maurer, Massimiliano P on til, and Bernardin o Romera-P aredes. Sp ars e cod ing for m ultitask and transfer learning. ArxXiv pr eprint , abs/1209.0738 , 2012. [40] Christopher J Hillar and F riedric h T Sommer. Ramsey theory rev eals the conditions when sparse co ding on sub sampled d ata is unique. arXiv pr eprint arXiv:1106.3616 , 2011 . [41] G.H. Golub and C.F. V an Loan. M atrix Computations . Th e Johns Hopkins Univ ersit y Press, Baltimore, Maryland, 2012. [42] XuanLong Nguy en. Poste r io r con traction of th e p opulation p olytop e in ﬁ nite admixture mo d- els. arXiv pr eprint arXiv:1206.0068 , 2012. [43] Daniel A. S pielman, Huan W ang, and John W right. Exact reco ve r y of sparsely-used dictio- naries. ArxXiv pr eprint , abs/1206.5 882, 2012. [44] Nic holas D. Sidir o p oulos and Rasm us Bro. O n th e uniqueness of m ultilinear decomp osition of N-w a y arra ys. Journal of Chemometrics , 14(3):229–2 39, 2000. [45] V. Ch v´ atal. The tail of the hyp erge ometric d istribution. Discr ete Mathematics , 25(3):285 –287, 1979. [46] Matthew S k ala. Hyp ergeometric tail inequalities: ending the insanity . http://a nsuz.sooke.bc.c a /profe s sio n al/hyperge o metric.pdf . [47] Philip Hall. On representati ves of s u bsets. J. L ondon Math. So c. , 10(1):26– 30, 1935. 47

When are Overcomplete Topic Models Identifiable? Uniqueness of Tensor Tucker Decompositions with Structured Sparsity

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment