Smoothed Analysis of Tensor Decompositions
Low rank tensor decompositions are a powerful tool for learning generative models, and uniqueness results give them a significant advantage over matrix decomposition methods. However, tensors pose significant algorithmic challenges and tensors analog…
Authors: Aditya Bhaskara, Moses Charikar, Ankur Moitra
Smo othed Analysis of T ensor Decomp ositions Adit y a Bhask ara ∗ Moses Charik a r † Ankur Moitra ‡ Ara vindan Vija y araghav an § Abstract Low rank decompo sition o f tensors is a pow erful tool for learning genera tive mo dels. The uniqueness of decomp osition g ives tensors a s ignificant adv antage ov er matrices. How e ver, ten- sors p o se significant algorithmic challenges and tensors ana logs of mu ch of the ma trix algebra to olkit are unlikely to exist b ecause of har dnes s results. Efficient decompo sition in the ov ercom- plete case (where rank ex c eeds dimens io n) is pa rticularly challenging. W e introduce a smo othed analysis mo del for studying these questions and develop an efficient algor ithm for tensor decom- po sition in the hig hly ov ercomplete case (rank p o ly nomial in the dimension). In this setting, we show that our algor ithm is robust to in verse p olyno mial error – a cr uc ia l pr op erty for applica- tions in learning since w e are only allowed a polynomia l n umber of s amples. While alg orithms are known for exact tensor decomp osition in so me ov ercomplete settings, our main co ntribution is in analyzing their stabilit y in the framew ork of smoo thed analys is. Our main technical contribution is to show that tensor pro ducts o f p erturb ed v ectors are linearly indep endent in a robust sense (i.e. the asso ciated matr ix ha s singular v alues that are at least a n inverse p olyno mial). This k ey r esult pa ves the w ay fo r applying tensor metho ds to learning pro blems in the smoo thed setting. In particular, we use it to o btain results for learning multi-view mo dels and mixtures of axis-aligned Gaussians wher e there are many more “comp onents” than dimensions. The assumption here is that the mo del is not adv ersa rially chosen, forma lized by a p erturbatio n of mo del pa rameters. W e b elieve this an app ealing way to analyze realistic instances of lear ning problems, since this framework allows us to ov ercome many o f the usua l limitations of using tenso r metho ds. ∗ Google Research NYC. Email: bha ska ra@cs.princeton.edu . W ork done while the author was at EPFL, Switzerland. † Princeton Universit y . Email: moses@cs.p rinceton.edu . Supp orted by NSF aw ards CCF 0832797, AF 1218687 and CCF 130251 8 ‡ Massac husetts Institute of T ec hnology , D epartment of Mathematics and CSA IL. Email: moitra@mit.edu . P art of this work was done while the author was a p ostdo c at the Institute for Adv an ced Stu dy and was su pp orted in part by NSF grant No.DMS-0835373 and by an NSF Computing and Inn o v ation F ello wship. § Carnegie Mellon Un iversi ty . Email: aravindv@cs. cmu.edu . Supp orted by th e Simons Postdoctoral F ello wship. 1 In tro duc tion 1.1 Bac kground T ensor decomp ositions p la y a cen tral role in mo dern statistics (see e.g. [ 27 ]). T o illustrate their usefulness, supp ose w e are giv en a matrix M = P R i =1 a i ⊗ b i When can we uniquely reco v er the factors { a i } i and { b i } i of this decomp osition giv en acce ss to M ? In fact, this d ecomp osition is almost neve r unique (unless we require that the factors { a i } i and { b i } i are orthonormal, or that M has ran k one). But giv en a tensor T = P R i =1 a i ⊗ b i ⊗ c i there are general conditions u nder wh ic h { a i } i , { b i } i and { c i } i are u niquely determined (up to scaling) giv en T ; p erhaps the m ost famous suc h condition is due to Krusk al [ 24 ], whic h we review in the next section. T ensor metho ds are commonly used to establish that the parameters of a generat iv e m o del can b e identified giv en third (or higher) order momen ts. In contrast, giv en jus t second-order moment s (e.g. M ) w e can only hop e to reco v er the factors up to a rotatio n. This is called the r otation pr oblem and has b een an imp ortan t issue in statistics since the p ioneering w ork of psyc hologist Charles Sp earman (1904) [ 31 ]. T ensors offer a path around this obstacle precisely b ecause their decomp ositions are often u nique, and consequent ly ha v e found ap p lications in p hyl ogenetic recon- struction [ 11 ], [ 29 ], hidden mark o v m o dels [ 29 ], mixture mo dels [ 20 ], topic mo deling [ 5 ], comm unit y detection [ 3 ], etc. Ho wev er most tensor p roblems are hard: computing the rank [ 17 ], th e b est r ank one app ro xi- mation [ 18 ] and the sp ectral norm [ 18 ] are all N P -hard . Also m an y of the familiar prop erties of matrices do not generalize to tensors. F or example, subtracting the b est rank one appr oximati on to a tensor can actually increase its r an k [ 34 ] and th er e are rank three tensors that can b e appro x- imated arbitrarily well by a sequence of r ank t w o tensors. One of the few algorithmic results for tensors is an algorithm for computing tensor decomp ositions in a r estricted case. Let A, B and C b e matrices w hose columns are { a i } i , { b i } i and { c i } i resp ectiv ely . Theorem 1.1. [ 25 ], [ 11 ] If r ank ( A ) = r ank ( B ) = R and no p air of c olumns in C ar e multiples of e ach other, then ther e is a p olynomial time algorithm to c ompute the minimum r ank tensor de c omp osition of T . Mor e over the r ank one terms in this de c omp osition ar e unique (among al l de c omp ositions with the same r ank). If T is an n × n × n tensor, then R can b e at most n in ord er for the conditions of the theorem to b e met. This basic algorithm has b een u sed to design efficien t algorithms for phylo genetic reconstruction [ 11 ], [ 29 ], topic m o deling [ 5 ], comm unity detection [ 3 ] and learning hidden marko v mo dels and mixtures of spherical Gaussians [ 20 ]. Ho wev er algorithms that mak e use of tensor decomp ositions ha v e traditionally b een limited to the fu ll-rank case, and our goal is to dev elop stable algorithms that work f or R = p oly( n ). Recentl y Go y al et al [ 16 ] ga v e a r obustness analysis for this decomp osition, and we giv e an alternativ e pro of in App endix A . In fact, this basic tensor decomp osition can b e b o otstrapp ed to work ev en when R is larger than n (if w e also increase the ord er of the tensor). Th e key parameter that d ictates when one can efficien tly fi n d a tensor d ecomp osition (or m ore generally , when it is uniqu e) is the Kruskal r ank : Definition 1.2. The Krusk al rank (or K rank ) of a matrix A is the largest k for whic h ev ery set of k column s are linearly indep endent. Also the τ - r obust k-r ank is d enoted by Krank τ ( A ), and is the largest k for which ev ery n × k sub -matrix A | S of A has σ k ( A | S ) ≥ 1 /τ . Ho w can we push the ab ov e theorem b ey ond R = n ? W e can instead work with an order ℓ tensor. T o b e concrete set ℓ = 5 and supp ose T is an n × n × ... × n tensor. W e can “flatten” T to get an 1 order three tensor T = R X i =1 A (1) i ⊗ A (2) i | {z } factor ⊗ A (3) i ⊗ A (4) i | {z } factor ⊗ A (5) i |{z} factor Hence we get an order th r ee tensor b T of size n 2 × n 2 × n . Alternativ ely w e can define this “flattening” using the follo wing op eration: Definition 1.3. The Khatri-Rao pro d uct of U and V whic h are size m × r and n × r resp ectiv ely is an mn × r matrix U ⊙ V whose i th column is u i ⊗ v i . Our new ord er three tensor b T can b e written as: b T = R X i =1 A (1) ⊙ A (2) i ⊗ A (3) ⊙ A (4) i ⊗ A (5) The factors are th e columns of A (1) ⊙ A (2) , the columns of A (3) ⊙ A (4) and the columns of A (5) . Th e crucial p oin t is that the Krusk al rank of the column s of A (1) ⊙ A (2) is in fact at least the sum of the Krusk al rank of th e columns of A (1) and A (2) (and similarly for A (3) ⊙ A (4) ) [ 1 ], [ 9 ], but this is tight in the w orst-case. C onsequen tly this “flattening” op eration allo ws us use the ab o ve algorithm unto R = 2 n ; since the rank ( R ) is larger than the largest dimension ( n ), this is calle d the over c omplete case. Our main tec hnical r esu lt is that in a natural smo othe d analysis m o del, the Kru sk al rank r obustly multiplies and this allo ws u s to give algorithms for computing a tensor decomp osition ev en in the highly o v ercomplete case, for an y R = p oly( n ) (provided that the order of the tensor is large - but still a constant ). Moreo v er our algorithms ha v e immediate applications in learning mixtures of Gaussians and multiview mixture m o dels. 1.2 Our Results W e in tro du ce the follo w ing framew ork for studyin g tensor decomp osition p roblems: • An adve rsary c ho oses a tensor T = P R i =1 A (1) i ⊗ A (2) i ⊗ ... ⊗ A ( ℓ ) i . • Eac h vec tor a j i is ρ -p erturb ed to yield ˜ a j i . 1 • W e are give n e T = P R i =1 e A (1) i ⊗ e A (2) i ⊗ ... ⊗ e A ( ℓ ) i (p ossibly with noise.) Our goal is to reco ve r the factors { e A (1) i } i , { e A (2) i } i , ..., { e A ( ℓ ) i } i (up to rescaling). Th is mo del is directly inspir ed by smo othe d analysis which wa s introdu ced b y Sp ielman and T eng [ 32 ], [ 33 ] as a framew ork in which to understand why certain algorithms p erform well on realistic inpu ts. In applications in learning, tensors are used to enco de lo w-order momen ts of the d istribution. In particular, eac h factor in the decomp osition represents a “comp onent” . The intuitio n is th at if these “comp onen ts” are not c hosen in a worst-c ase configuration, then w e can obtain v astly impr o v ed learning algorithms in v arious settings. F or example, as a d ir ect consequence of our main result, w e will give new algorithms for learning mixtures of sph erical Gauss ians again in th e framew ork of smo othed analysis (without an y add itional s eparation conditions). There are no known p olynomial 1 An (indep endent) rand om gaussian with zero mean and v ariance ρ 2 /n in each co ordinate is added to a j i to obtain ˜ a j i . W e note that w e make th e Gaussian assumpt ion for conv enience, bu t our analysis seems to apply to more general p erturbations. 2 time algorithms to learn suc h m ixures if the num b er of comp onen ts ( k ) is larger than the dimension ( n ). But if their means are p erturb ed, we giv e a p olynomial time algorithm for any k = p oly( n ) b y virtue of our tensor decomp osition algorithm. Our main tec h nical r esult is the f ollo win g: Theorem 1.4. L et R ≤ n ℓ / 2 for some c onstant ℓ ∈ N . L et A (1) , A (2) , . . . A ( ℓ ) b e n × R matric es with c olumns of unit norm, and let e A (1) , e A (2) , . . . e A ( ℓ ) ∈ R n × m b e their r esp e ctiv e ρ -p erturb ations. Then for τ = ( n/ρ ) 3 ℓ , the Khatri-R ao pr o duct satisfies Kr ank τ e A (1) ⊙ e A (2) ⊙ . . . ⊙ e A ( ℓ ) = R w.p. at le ast 1 − exp − C n 1 / 3 ℓ (1) In general the Kru sk al r ank adds [ 1 , 9 ] but in the framework of smo othed analysis it r obustly multiplies . What is crucial here is that w e ha v e a low er b ound τ on how close these vecto rs are to lin early dep end en t. In almost all of the applications of tensor m etho ds, we are not giv en T exactly but rather with some noise. This error could arise, for example, b ecause w e are using a finite num b er of samples to estimate the m oments of a distribution. It is the condition num b er of e A (1) ⊙ e A (1) ⊙ . . . ⊙ e A ( ℓ ) that will con trol whether v arious tensor decomp osition algorithms wo rk in the pr esence of noise. Another crucial p rop erty our method ac hiev es is exp onen tially small failure probab ility for an y constan t ℓ , f or our p olynomial b ound on τ . In particular for ℓ = 2, we sho w (in Th eorem 3.1 ) f or ρ -p erturb ations of tw o n × n 2 / 2 matrices U and V , the K rank τ ( e U ⊙ e V ) = n 2 / 2 for τ = ρ 2 /n O (1) , with probabilit y 1 − exp ( − √ n ). W e remark that it is fairly straigh tforw ard to obtain th e ab o v e statemen t (for ℓ = 2) for failure pr obabilit y δ , with τ = ( n/δ ) O (1) (see Remark 3.7 for more on the latter); ho we v er, this is not desirable since the run ning time has a p olynomial d ep end en ce on th e minim um s in gular v alue 1 /τ (and hence δ ). W e obtain the follo wing main th eorem fr om the ab o ve r esult and from analyzing the stabilit y of the algorithm of Leurgans et al [ 25 ] (see Theorem 2.3 ): Theorem 1.5. L et R ≤ n ⌊ ℓ − 1 2 ⌋ / 2 for some c onstant ℓ ∈ N . Supp ose we ar e given e T + E wher e e T and E ar e or der ℓ -tensors and e T has r ank R and is obtaine d fr om the ab ove smo othe d analysis mo del. Mor e over supp ose the entries of E ar e at most ε ( ρ/n ) 3 ℓ wher e ε < 1 . Then ther e is an algorithm to r e c over the r ank one terms ⊗ ℓ i =1 e a j i up to an additive ε err or. The algor ithm runs in time n C 3 ℓ and suc c e e ds with pr ob ability at le ast 1 − exp( − C n 1 / 3 ℓ ) . As we discuss ed, tensor metho ds hav e had numerous app lications in learning. Ho w ev er algo- rithms that mak e use of tensor decomp ositions ha v e traditionally b een limited to th e full-rank case, and h ence can only h andle cases when the num b er of “comp onen ts” is at most th e d imension. Ho wev er b y usin g our main theorem ab o v e, we can get new algorithms for some of these problems that wo rk even if there are many more “comp onen ts” than dimensions. Multi-view Mo dels (Section 4 ) In this setting, eac h sample is comp osed of ℓ views x (1) , x (2) , . . . , x ( ℓ ) whic h are conditionally in- dep end en t giv en which comp onen t i ∈ [ R ] th e sample is generated from. Hence suc h a mo d el is s p ecified by R mixing we igh ts w i and R discrete distributions µ i (1) , . . . , µ i ( j ) , . . . , µ i ( ℓ ) , one for eac h view. Such mo dels are very expr essiv e and are used as a common abstraction for a num b er of inference problems. Anand kumar et al [ 2 ] ga ve algorithms in the fu ll rank setting. Ho we v er, in many practical settings lik e sp eec h r ecognition and image classification, the dimen s ion of the feature space is t ypically muc h smaller than the n umber of compon ents. If w e su p p ose that the 3 distributions that mak e up the m ulti-view mo del are ρ -p erturb ed (analogously to the tensor set- ting) then w e can giv e the first kn own algorithms for the ov ercomplete s etting. Supp ose that the means ( µ i ( j ) ) are ρ -p erturb ed to obtain { e µ ( j ) i } . Then: Theorem 1.6. This is an algorithm to le arn the p ar ameters w i and { e µ ( j ) i } of an ℓ -vi ew multi- view mo del with R ≤ n ⌊ ℓ − 1 2 ⌋ / 2 c omp onents up to an ac cur acy ε . The running time and sample c omplexity ar e at most p oly ℓ ( n, 1 /ε, 1 /ρ ) and suc c e e ds with pr ob ability at le ast 1 − exp( − C n 1 / 3 ℓ ) for some c onstant C > 0 . Mixtures of Axis-Aligned Gaussians ( Section 5 ) Here we are giv en samples from a distribution F = P k i =1 w i F i ( µ i , Σ i ) where F i ( µ i , Σ i ) is a Gaussian with mean µ i and co v ariance Σ i and eac h Σ i is diagonal. These mixtures are ubiquitous throughout mac hine learning. F eldman et al [ 14 ] ga ve an algorithm f or P A C-learnin g m ixtu res of axis aligned Gaussians, h ow ever the run n ing time is exp onen tial in k , the n um b er of components. Hsu and Kak ade [ 20 ] ga v e a p olynomial time algorithm f or learning mixtures of spherical Gaussians pro vided that their means are full rank (hence k ≤ n ). Again, we turn to the framewo rk of smo othed analysis and su pp ose that the means are ρ -p erturb ed. I n this fr amew ork, w e can giv e a p olynomial time algorithm for learning mixtur es of axis-aligned Gaussians for any k = p oly( n ). Su pp ose that the means of a mixture of axis-aligned Gaussians and sup p ose the means ha v e b een ρ -p erturb ed to obtain e µ i . Then Theorem 1.7. Ther e is an algorith m to le arn the p ar ameters w i , e µ i and Σ i of a mixtur e of k ≤ n ⌊ ℓ − 1 2 ⌋ / (2 ℓ ) axis-aligne d Gaussians up to an ac c u r acy ε . The running time and sample c omplexity ar e at most p oly ℓ ( n, 1 /ε, 1 /ρ ) and suc c e e ds with pr ob ability at le ast 1 − exp( − C n 1 / 3 ℓ ) for some c onstant C > 0 . W e b eliev e that our new algorithms for ov ercomplete tensor decomp osition w ill ha v e further applications in learning. Additionally this framewo rk of s tu dying distribution learnin g when the parameters of the distribu tion w e wo uld lik e to learn are n ot chosen adv ersarially , seems quite app ealing. Remark 1.8. Recall, our main tec hn ical result is that the Kr usk al rank r obustly multiplies. In fact, is is easy to see that for a generic s et of vect ors it multiplie s [ 1 ]. This ob s erv ation, in conju nction with the algorithm of Leurgans et al [ 25 ] yields an algorithm for tensor d ecomp osition in the o v ercomplete case. Another appr oac h to ov ercomplete tensor decomp osition was give n b y [ 13 ] whic h works u p to r ≤ n ⌊ ℓ 2 ⌋ . Ho we ve r these algorithms assume that w e kno w T exactly , and are not known to b e stable when we are giv en T with noise. The main issue is that these algorithms are based on solving a linear s y s tem whic h is full rank if the factors of T are generic, b ut what con trols w hether or not these linear systems can hand le noise is their condition num b er. Alternativ ely , algorithms for o verco mplete tensor decomp osition that assum e we kno w T exac tly w ould not h a v e any applications in learning b ecause we would n eed to tak e to o man y samples to ha v e a go o d enough estimate of T (i.e. the lo w-order momen ts of the distrib ution). In recen t wo rk, Go y al et al [ 16 ] also made use of robu s t algorithms for o v ercomplete tensor decomp osition, an d th eir main application is und erdetermined indep end en t comp onen t analysis (ICA). The condition that they need to imp ose on the tensor h olds generically (lik e ours , see e.g. Corollary 2.4 ) and can show in a smo othed analysis mo del that th is condition holds w ith in v erse p olynomial failure p robabilit y . Ho wev er here our f o cus was on sho wing a lo w er b oun d for the 4 condition num b er of M ⊙ ℓ that do es not dep end (p olynomially) on the failure probability . W e fo cus on the failure p robabilit y b eing small (in p articular, exp onent ially small), b ecause in smo othed analysis, the p ertur bation is “one-shot” and if it d o es not result in an easy instance, you cann ot ask for a n ew one! 1.3 Our Approac h Here w e giv e some in tuition for h o w w e pro v e our main tec hn ical theorem, at least in the ℓ = 2 case. Recall, we are giv en tw o matrices U (1) and U (2) whose R columns are ρ -p ertur b ed to obtain e U (1) and e U (2) resp ectiv ely . Our goal is to pro ve th at if R ≤ n 2 2 then the matrix e U (1) ⊙ e U (2) has smallest singular v alue that is at least p oly(1 /n, ρ ) w ith h igh probabilit y . In fact, it w ill b e easier to work with what w e call the le ave-one-out distanc e (see Definition 3.4 ) as a su rrogate for the smallest singular v alue (see Lemma 3.5 ). Alternativ ely , if w e let x and y b e the first columns of e U (1) and e U (2) resp ectiv ely , and w e set U = sp an( { e U (1) i ⊗ e U (2) i , 2 ≤ i ≤ R } ) then w e would like to pro v e that with high prob ab ility x ⊗ y has a non-negligible pro j ection on the orthogonal complemen t of U . This is the core of our app roac h. Set V to b e the orthogonal complemen t of U . In fact, we p ro v e that for any dimension at least n 2 2 subspace V , with high probabilit y x ⊗ y h as a non-negligible pro jection onto V . Ho w can w e reason ab out the pro jection of x ⊗ y on to an arbitrary (bu t large) dimensional subspace? If V were (sa y) the set of all lo w-rank matrices, then th is w ould b e straigh tforw ard. But what complicates this is that w e are lo oking at the pro jection of a rank one matrix ont o a large dimensional subspace of matrices, and these tw o spaces can b e stru ctur ed quite differen tly . A natural approac h is to construct matrices M 1 , M 2 , ..., M p ∈ V so th at with h igh probabilit y at least one quad r atic form x T M i y is non-negligible. Supp ose the follo w in g condition w ere met (in whic h case we w ould b e done): Supp ose that there is a large s et S of indices so that eac h ve ctor x T M i has a large p ro jection onto the orthogonal complemen t of span( { x T M i , i ∈ S } ). In fact, if suc h a s et S exists with high p robabilit y then th is w ould yield our main tec hnical theorem in the ℓ = 2 case. Ou r main step is in constructing a f amily of matrices M 1 , M 2 , ...M p that h elp u s sho w that S is large. W e call this an ( θ , δ )-orthogonal system (see Definition 3.13 ). The in tuition b ehind this definition is that if w e revea l a column in one of the M i ’s that has a significan t orthogo nal comp onent to all of the columns that w e ha v e r eveal ed so far, this is in effect a fresh source of randomness that can h elp us add another index to the set S . See Sectio n 3 for a more complete description of our approac h in th e ℓ = 2 case. The appr oac h for ℓ > 2 relies on the same basic strategy bu t requires a more delicate indu ction argumen t. See S ection 3.4 . 2 Prior A lgorithms Here we review the algorithm of Leurgans et al [ 25 ]. It has b een disco v ered many times in d ifferen t settings. It is sometimes referred to as “sim ultaneous diagonalization” or as Chang’s lemma [ 11 ]. Supp ose we are giv en a third-order tensor T = P R i =1 u i ⊗ v i ⊗ w i whic h is n × m × p . Let U, V an d W be matrices whose column s are u i , v i and w i resp ectiv ely . Supp ose fur ther that (1) r ank ( U ) = r ank ( V ) = R and (2) k-rank( W ) ≥ 2. Then we can efficient ly reco ver the factors of T . W e present th e algorithm Decompose and its analysis assum in g n = m = R . Any instance with r ank ( U ) = r ank ( V ) = R can b e red u ced to this case as f ollo ws: find the s p an of th e ve ctors { e u j,k } , where e u j,k is the n dimen sional v ector wh ose i th en try is T ij k . This span must b e precisely 5 the span of the columns of U . 2 Th us we can pick some orthonormal basis for this sp an, and write T as an R × m × p tensor. W e can p erform this op eration again (along the second mo de) to mo v e to an R × R × p tensor. Theorem 2.1. [ 25 ], [ 11 ] Given a tensor T ther e exists an algorithm tha t runs in p olynomial time and r e c overs the (unique) factors of T pr ovide d that (1) r ank ( U ) = r ank ( V ) = R and (2) k-r ank ( W ) ≥ 2 . Pro of: The algorithm is to pr e-pro cess as ab o v e (i.e., obtain m = n = R ), and then run Decom- pose stated b elo w. Let us th us analyze Decompos e with m, n b eing R . W e can write T a = U D a V T where D a = diag( a T w 1 , a T w 2 , ..., a T w n ) and similarly T b = U D b V T where D b = diag( b T w 1 , b T w 2 , ..., b T w n ) . Moreo ver we can w rite T a ( T b ) − 1 = U D a D − 1 b U − 1 and T T a ( T T b ) − 1 = V D a D − 1 b V − 1 . So we conclude U and V diagonalize T a ( T b ) − 1 and (( T b ) − 1 T a ) T re- sp ectiv ely . Note that almost surely the diagonals entries of D a D − 1 b are d istinct (Claim A.4 ). Hence the eigendecomp ositions of T a ( T b ) − 1 and ( T b ) − 1 ( T a ) are un iqu e, and we can pair up columns in U and columns in V based on their eigen v alues (we pair up u and v if their eigen v alues are equal). W e can then solv e a linear system to find the remaining factors (columns in W ) and since this is a v alid decomp osition, w e can conclude that these are also the true factors of T app ealing to Krusk al’s uniqueness theorem [ 24 ]. In fact, this algorithm is also stable, as Go y al et al [ 16 ] also recen tly sh o w ed. I t is in tuitiv e that if U and V are we ll-conditioned and eac h pair of column s in W is well-co nditioned then this algorithm can tolerate some inv erse p olynomial amoun t of noise. F or completeness, we giv e a r obustness analysis of Decomp ose in App endix A . Condition 2.2. 1. The c ondition numb ers κ ( U ) , κ ( V ) ≤ κ , 2. The c olumn ve ctors of W ar e not close to p ar al lel: for al l i 6 = j , k w i k w i k − w j k w j k k 2 ≥ δ , 3. The de c omp ositions ar e b ounde d : for al l i , k u i k 2 , k v i k 2 , k w i k 2 ≤ C . Theorem 2.3. Supp ose we ar e give n tensor T + E ∈ R m × n × p with the entries of E b e i ng b ounde d by ǫ · p oly (1 /κ, 1 /n, 1 /δ ) and mor e over T has a de c omp osition T = P R i =1 u i ⊗ v i ⊗ w i that satisfies Condition 2.2 . Then ther e exists an e fficient algorithm that r eturns eac h rank one term in the de c omp osition of T (up to r enaming), within an additive e rr or of ǫ . As b efore, the algorithm is to prepro cess so as to obtain m = n = R , and then run Decompo se . The pr epro cessing step is slightl y different b ecause of the pr esence of error – instead of considering the span of th e { e u j,k } as ab o ve , w e n eed to lo ok at the span of the top R s in gular vec tors of the matrix whose columns are e u j,k . If k E k F is small enough (in terms of κ, δ, n ), the span of these top singular v ectors suffices to obtain an appr o ximation to th e v ectors u i (see App endix A ). Note that the algorithm is limited b y th e condition that r ank ( U ) = ran k ( V ) = R sin ce this requires that R ≤ min( m, n ). But as we ha v e seen b efore, by “flattening” a higher order tensor, w e can hand le o v ercomplete tensors . The follo w ing is an im m ediately corollary of Theorem 2.3 : 2 It is easy to see that the span is contained in the span of the columns of U . T o see equalit y , we observe that if the span is R − 1 dimensional, then pro jecting each of the u i s on to t he span giv es a differ ent decomp osition, and this con tradicts Krusk al’s uniq ueness theorem, which holds in th is case. 6 Algorithm 1 Decompose , Input: T ∈ R R × R × R 1. Let T a = T ( · , · , a ) , T b = T ( · , · , b ) where a, b are unifor mly rando m unit vectors in ℜ p 2. Set U to be the eigenvectors of T a ( T b ) − 1 3. Set V to be the eigen vectors of ( T b ) − 1 T a T 4. Solve the linear s ystem T = P n i =1 u i ⊗ v i ⊗ w i for the v ectors w i 5. Output U, V , W Corollary 2.4. Supp ose we ar e gi ven an or der- ℓ tensor T + E ∈ R n × ℓ with the entries of E b eing b ounde d by ǫ · p oly ℓ (1 /κ, 1 /n, 1 /δ ) , and matric es U (1) , U (2) . . . U ( ℓ ) ∈ R n × r , whose c olumns give a r ank- r de c omp osition T = P R i =1 u (1) i ⊗ u (2) i ⊗ · · · ⊗ u ( ℓ ) i . If Condition 2.2 is satisfie d by U = U (1) ⊙ U (2) ⊙ . . . ⊙ U ( ⌊ ℓ − 1 2 ⌋ ) , V = U ( ⌊ ℓ − 1 2 ⌋ +1) ⊙ . . . ⊙ U (2 ⌊ ℓ − 1 2 ⌋ ) and W = ( U ( ℓ ) if ℓ i s o dd U ( ℓ − 1) ⊙ U ( ℓ ) otherwise then ther e exists an efficient algorithm that c omputes e ach r ank one term in this de c omp osition up to an additive err or of ǫ . Note that Corollary 2.4 d o es not require the decomp osition to b e symmetric. F urther, any tri- partition of the ℓ m o des that satisfies Condition 2.2 w ould hav e sufficed. T o und erstand ho w large a r ank w e can handle, the key question is: When do es the Kruskal r ank (or r ank) of ℓ - wise Khatri-R ao pr o duct b e c ome R ? The f ollo wing lemma is well -kno wn (see [ 9 ] for a robus t analogue) and is known to b e tigh t in the worst case . This allo ws u s to hand le a rank of R ≈ ℓn/ 2. Lemma 2.5. Kr ank ( U ⊙ V ) ≥ m in Kr ank ( U ) + Kr ank ( V ) − 1 , R But, for generic v ectors set of vect ors U and V , a muc h stronger statemen t is tru e [ 1 ]: Krank ( U ⊙ V ) ≥ min Krank ( U ) × Krank ( V ) , R . Hence giv en a generic order ℓ tensor T with R ≤ n ⌊ ( ℓ − 1) / 2 ⌋ , “flattening” it to order three and app ealing to Theorem 2.1 finds the factors uniquely . Th e algorithm of [ 13 ] follo w s a similar b ut more inv olv ed appr oac h, and works for R ≤ n ⌊ ( ℓ ) / 2 ⌋ . Ho wev er in learning applicatio ns we are not giv en T exactly but r ather an app ro ximation to it. Our goal is to sho w that the Kr usk al rank r obustly multiplies t ypically , so that these t yp es of tensor algorithms will not only w ork in the exact case, but are also necessarily stable when w e are giv en T with s ome noise. In the next section, w e sho w that in the sm o othed analysis mo d el, the rob u st K rusk al r ank multiplies on taking Kh atri-Rao pr o ducts. This then establishes our main result Theorem 1.5 , assuming Th eorem 3.3 wh ich w e prov e in the next section. Pro of of Theorem 1.5 : As in Corollary 2.4 , let U = e U (1) ⊙ . . . ⊙ e U ( ⌊ ℓ − 1 2 ⌋ ) , V = e U ( ⌊ ℓ − 1 2 ⌋ +1) ⊙ . . . ⊙ e U ( ℓ − 1) and W = e U ( ℓ ) . Theorem 3.3 sho ws that with pr obabilit y 1 − exp − n 1 / 3 O ( ℓ ) o v er the random ρ -p erturb ations, κ R ( U ) , κ R ( V ) ≤ ( n/ρ ) 3 ℓ . F ur ther, the columns W are δ = ρ/n far from parallel with high prob ab ility . Hence, Corollary 2.4 implies Th eorem 1.5 . 3 The Khatri-Rao Pr o du ct Robustly Multiplies In the exac t case, it is enough to sho w that th e Kr usk al rank almost surely m u ltiplies and this yields algorithms for o v ercomplete tensor decomp osition if we are giv en T exactly (see Remark 1.8 ). 7 But if w e w an t to pro v e that these algorithms are stable, w e n eed to establish th at even the robust Krusk al rank (p ossibly with a different threshold τ ) also multiplies. This ends up b eing a v ery natural question in r andom matrix the ory , alb eit the Khatri-Rao pro duct of t w o p ertu rb ed vect ors in R n is far f rom a p ertu rb ed v ector in R n 2 . F ormally , supp ose w e ha v e t w o matrices U and V with columns u 1 , u 2 , . . . , u R and v 1 , v 2 , . . . , v R in R n . Let e U , e V b e ρ -p erturbations of U, V i.e. for eac h i ∈ [ R ], w e p erturb u i with an (ind ep endent) random gaussian p ertur bation of norm ρ to obtain e u i (and similarly for e v i ). Then we sh o w the follo win g: Theorem 3.1. Supp ose U, V ar e n × R matric es and let e U , e V b e ρ -p erturb ations of U, V r esp e ctively. Then for any c onstant δ ∈ (0 , 1) , R ≤ δ n 2 and τ = n O (1) /ρ 2 , the Khatri-R ao pr o duct satisfies Kr ank τ ( e U ⊙ e V ) = R with pr ob ability at le ast 1 − exp( − √ n ) . Remark 3.2. The natural generalization where the vect ors u i and v i are in d ifferen t dimensional spaces also holds . W e omit the d etails here. In general, a similar r esult holds for ℓ -wise K hatri-Rao pro ducts w h ic h allo ws us to h andle rank as large as δ n ⌊ ℓ − 1 2 ⌋ for ℓ = O (1). Note that this do es not follo w by r ep eatedly applying the ab ov e theorem (say applying the theorem to U ⊙ V and then taking ⊙ W ), b ecause p ertur bing the ent ries of ( U ⊙ V ) is not the same as e U ⊙ e V . In p articular, we hav e only ℓ · n R “tru ly” rand om bits, whic h are the p ertu rbations of the columns of the base matrices. Th e ov erall structure of the pro of is the same, but we need additional id eas follo w ed by a delicate ind uction. Theorem 3.3. F or any δ ∈ (0 , 1) , let R = δ n ℓ for some c onstant ℓ ∈ N . L et U (1) , U (2) , . . . U ( ℓ ) b e n × R matric es with unit c olumn norm, and let e U (1) , e U (2) , . . . e U ( ℓ ) ∈ R n × m b e their r esp e ctive ρ -p erturb ations. Then for τ = ( n/ρ ) 3 ℓ , the Khatri-R ao pr o duct satisfies Kr ank τ e U (1) ⊙ e U (1) ⊙ . . . ⊙ e U ( ℓ ) = n ℓ / 2 w.p. at le ast 1 − exp − δ n 1 / 3 ℓ (2) Let A denote the n ℓ × R matrix e U (1) ⊙ e U (2) ⊙ . . . ⊙ e U ( ℓ ) for con v enience. The theorem states that the smallest singular v alue of A is lo w er-b ou n ded by τ . How c an we lower b ound the smal lest singular value of A ? W e defin e a quantit y which is can b e used as a pr o xy for the least singular v alue and is simpler to analyze. Definition 3.4. F or any matrix A with columns A 1 , A 2 , . . . A R , the lea v e-one-out distance is ℓ ( A ) = min i dist ( A i , s pan { A j } j 6 = i ) . The lea ve-o ne-out d istance is a goo d proxy for th e least singular v alue, if w e are not particular ab out losing multiplicativ e factors that are p olynomial in size of th e matrix. Lemma 3.5. F or any matrix A with c olumns A 1 , A 2 , . . . A R , we have ℓ ( A ) √ R ≤ σ min ( A ) ≤ ℓ ( A ) . W e will sho w that eac h of the ve ctors A i = e u (1) i ⊗ e u (2) i ⊗ · · · ⊗ e u ( ℓ ) i has a r easonable pro j ection (at least n ℓ/ 2 /τ ) on the space orthogonal to the span of the rest of the ve ctors span ( { A j : j ∈ [ R ] − { i }} ) with high probabilit y . W e do n ot hav e a go o d handle on the space span n ed b y the rest of the R − 1 v ectors, so w e will pr o v e a more general statemen t in Theorem 3.6 : w e w ill pro v e that a p erturb ed v ector e x (1) ⊗ · · · ⊗ e x ( ℓ ) has a reasonable pro j ection onto any (fixed) su bspace V w.h.p., as long as dim( V ) is Ω ( n ℓ ). T o s a y th at a vect or w has a reasonable p ro jection on to V , we just need to exhibit a set of v ectors in V suc h th at one of them h a v e a large inner pro duct with w . T h is will imply our the requir ed b ound on th e singular v alue of A as follo ws: 8 1. Fix an i ∈ [ R ] and apply Theorem 3.6 w ith x ( t ) = u ( t ) i for all t ∈ [ ℓ ], and V b eing the sp ace orthogonal to r est of the v ectors A j . 2. Apply a u nion b oun d ov er all the R choic es for i . W e n ow state th e main tec h n ical theorem ab out pro jections of p erturb ed pro d uct v ectors on to arbitrary sub spaces of large d imension. Theorem 3.6. F or any c onstant δ ∈ (0 , 1) , given any subsp ac e V of dimension δ · n ℓ in R n × ℓ , ther e exists tensors T 1 , T 2 , . . . T r in V of unit norm ( k·k F = 1 ), su c h that for r andom ρ -p erturb ations e x (1) , e x (2) , . . . , e x ( ℓ ) ∈ R n of any ve ctors x (1) , x (2) , . . . , x ( ℓ ) ∈ R n , we have Pr " ∃ j ∈ [ r ] s.t k T j e x (1) , e x (2) , . . . , e x ( ℓ ) k ≥ ρ ℓ 1 n 3 ℓ # ≥ 1 − exp − δ n 1 / (2 ℓ ) ℓ (3) Remark 3.7. Since the squared length of the pro jection is a degree 2 ℓ p olynomial of the (Gaussian) v ariables x i , we can apply standard anti-co ncent ration results (Carb ery-W right , for instance) to conclude that the smallest singu lar v alue (in Theorem 3.6 ) is at least an inv erse p olynomial, with failure probability at most an inv ers e p olynomial. Th is approac h can only giv e a singular v alue lo wer b ound of p oly ℓ ( p/n ) for a failure probabilit y of p , whic h is not d esirable since the ru nning time dep ends on the smallest singular v alue. Remark 3.8. F or meaningful guaran tees, w e will think of δ as a small constan t or n − o (1) (note the dep endence of the error probabilit y on δ in eq ( 3 )). F or instance, as we will see in section 3.4 , w e can not hop e f or exp onentia l small failure probabilit y when V ⊆ R n 2 has dimension n . The follo wing restatemen t of Theorem 3.6 giv es a sufficien t condition ab out the singular v alues of a matrix P of size r × n ℓ , that giv es a strong ant i-concen tration prop ert y for v alues attained by v ectors obtained by the tensor pro d uct of p erturb ed v ectors. This alternate view of Theorem 3.6 will b e crucial in th e in ductiv e p r o of for higher ℓ -wise pr o ducts in section 3.4 . Theorem 3.9 (Restatemen t of Theorem 3.6 ) . Given any c onstant δ ℓ ∈ (0 , 1) and any matrix T of size r × ( n ℓ ) such that σ δn ℓ ≥ η , then for r andom ρ -p erturb ations e x (1) , e x (2) , . . . , e x ( ℓ ) ∈ R n of any ve ctors x (1) , x (2) , . . . , x ( ℓ ) ∈ R n , we have Pr " k M e x (1) , e x (2) , . . . , e x ( ℓ ) k ≥ η ρ ℓ 1 n 3 O ( ℓ ) # ≥ 1 − exp − δ n 1 / 3 ℓ (4) Remark 3.10. Theorem 3.6 follo ws from the ab ov e theorem by c ho osing an orthonormal basis for V as the ro ws of T . T h e other direction follo ws by c ho osing V as the span of the top δ ℓ n ℓ righ t singular v ectors of T . Remark 3.11. Before pr o ceeding, w e remark that b oth forms of Th eorem 3.6 could b e of in d ep en- den t interest. F or instance, it follo ws from the ab o v e (b y a small tric k inv olving partitioning the co ordinates), that a ve ctor e x ⊗ ℓ has a non-n egligible pro jection into any cn ℓ dimensional su bspace of R n ℓ with probabilit y 1 − exp( − f ℓ ( n )). F or a v ector x ∈ R n ℓ whose en tries are all ind ep endent Gaussians, suc h a claim follo ws easily , with p robabilit y roughly 1 − exp( − n ℓ ). The ke y d ifference for us is th at e x ⊗ ℓ has essentia lly jus t n bits of rand omness, so man y of the entries are highly correlated. So the theorem s ays that even su c h a correlated p erturb ation h as en ou gh mass in any large enough subspace, with high enough probabilit y . A natural conjecture is that the probabilit y b ound can b e impro v ed to 1 − exp( − Ω( n )), b ut it is b eyo nd the reac h of our metho ds. 9 3.1 Khatri-Rao Pro duct of Tw o Matrices W e first sho w T h eorem 3.9 for the case ℓ = 2. T h is illustrates the main ideas u nderlying the general pro of. Prop osition 3.12. L et 0 < δ < 1 and M b e a δ n 2 × n 2 matrix with σ δn 2 ( M ) ≥ τ . Then for r andom ρ - p erturb ations e x, e y of any two x, y ∈ R n , we have Pr h k M ( e x ⊗ e y ) k ≥ τ ρ n O (1) i ≥ 1 − exp − √ δ n . (5) The high lev el outline is now the follo wing. Let U d enote the span of the top δ n 2 singular ve ctors of M . W e sho w th at for r = Ω( √ n ), there exist n × n matrices M 1 , M 2 , . . . , M r whose columns satisfy certain orthogonal p rop erties we define, and additionally v ec( M i ) ∈ U for all i ∈ [ r ]. W e use the orthogonalit y p rop erties to sh o w that ( e x ⊗ e y ) h as an ρ/ p oly ( n ) dot-pro d uct with at least one of the M i with probability ≥ 1 − exp( − r ). The θ - orthogonalit y prop erty . In ord er to motiv ate this, let us consider some matrix M i ∈ R n × n and consider M i ( x ⊗ y ). This is pr ecisely y T M i x . No w su pp ose we hav e r matrices M 1 , M 2 , . . . , M r , and w e consider the sum P i ( y T M i x ) 2 . This is also equal to k Q ( y ) x k 2 , wh ere Q ( y ) is an r × n matrix wh ose ( i, j )th en try is h y , ( M i ) j i (here ( M i ) j refers to the j th column in M i ). No w consider some matrices M i , and supp ose we kn ew that Q ( e y ) has Ω ( r ) singular v alues of magnitude ≥ 1 /n 2 . Th en, an ρ -p ertu r b ed v ector e x h as at least ρ/n of its norm in the sp ace spann ed b y th e corresp ondin g right singular v ectors, with pr obabilit y ≥ 1 − exp( − r ) (F act 3.26 ). Thus w e get Pr [ k Q ( e y ) e x k ≥ ρ/n 3 ] ≥ 1 − exp( − r ) . So the k ey is to pro v e that the matrix Q ( e y ) has a large n umb er of “n on -n egligible” singular v alues with high probabilit y (o ver the p erturbation in e y ). F or this, let us examine th e entries of Q ( e y ). F or a momen t supp ose that e y is a gaussian random ve ctor ∼ N (0 , ρ 2 I ) (instead of a p erturb ation). Then the ( i, j )th en try of Q ( e y ) is precisely h e y , ( M i ) j i , which is distribu ted like a one d im en sional gaussian of v ariance ρ 2 k ( M i ) j k 2 . If the entries for different i, j w er e indep endent, stand ard results from random matrix theory would imply that Q ( e y ) has many non-negligible singular v alues. Ho wev er, this could b e far from the truth. Consid er, for in stance, t wo ve ctors ( M i ) j and ( M i ′ ) j ′ that are p arallel. Then th eir dot pr o ducts with e y are h ighly correlated. Ho w ev er w e note, that as long as ( M i ′ ) j ′ has a reasonable comp onent orthogonal to ( M i ) j , the distrib ution of the ( i, j ) and ( i ′ , j ′ )th entrie s are “somewhat” indep endent. W e will prov e that we can roughly ac hiev e suc h a situation. This motiv ates the follo wing defin ition. Definition 3.13. [Ord er ed θ -orthogonalit y] A sequ en ce of v ectors v 1 , v 2 , . . . , v n has the ord ered θ -orthogonalit y pr op ert y if for all 1 ≤ i ≤ n , v i has a comp onent of length ≥ θ orthogonal to span { v 1 , v 2 , . . . , v i − 1 } . No w w e defin e a similar n otion for a sequence of matrices M 1 , M 2 , . . . , M r , w hic h sa ys th at a large enough s ubset of columns should ha v e a certain θ -orthogonalit y p rop erty . More formally , Definition 3.14 (Ord ered ( θ , δ )-orthogonal system) . A set of n × m matrices M 1 , M 2 , . . . , M r form an or der e d ( θ , δ ) -ortho gonal system if there exists a p ermutation π on [ m ] su ch that the first δ m columns satisfy the follo w ng p rop erty: for i ≤ δm and ev ery j ∈ [ R ], the π ( i )th column of M j has a p ro jection of length ≥ θ orthogonal to the sp an of all the v ectors giv en by th e columns π (1) , π (2) , . . . , π ( i − 1) , π ( i ) of all the matrices M 1 , M 2 , . . . M r other than itself (i.e. th e π ( i )th column of M j ). 10 The f ollo win g lemma sh o ws th e u se of an ord ered ( θ , δ ) orthogonal system: a matrix Q ( e y ) constructed as ab o v e starting with these M i has man y non-negligible singular v alues with high probabilit y . Lemma 3.15 (Ordered θ -orthogonalit y and p erturb ed combinatio ns.) . L et M 1 , M 2 , . . . , M r b e a set of n × m matric es of b ounde d norm ( k·k F ≥ 1 ) that ar e ( θ , δ ) ortho gonal for some p ar ameters θ , δ , and supp ose r ≤ δ m . L et e x b e an ρ -p erturb ation of x ∈ R n . Then the r × m matr ix Q ( e x ) forme d with the j th r ow of ( Q ( e x )) j b eing e x T M j satisfies Pr x σ r / 2 ( Q ( e x )) ≥ ρθ n 4 ≥ 1 − exp ( − r ) W e defer the pro of of this Lemma to secti on 3.3 . Our focus will no w b e on constructing suc h a ( θ , δ ) orth ogonal s y s tem of matrices, given a subspace V of R n 2 of dimension Ω( n 2 ). The follo wing lemma ac hiev es this Lemma 3.16. L et V b e a δ · nm dimensional subsp ac e R nm , and supp ose r , θ , δ ′ satisfy δ ′ ≤ δ / 2 , r · δ ′ m < δ n/ 2 and θ = 1 / ( nm 3 / 2 ) . Then ther e exist r matric es M 1 , M 2 , . . . , M r of dimension n × m with the fol lowing pr op erties 1. ve c ( M i ) ∈ V for al l i ∈ [ r ] . 2. M 1 , M 2 , . . . , M r form an or der e d ( θ , δ ′ ) ortho gonal system. In p articular, when m ≤ √ n , they form an or der e d ( θ , δ/ 2) ortho gonal system. W e remark that while δ is often a constan t in our applicatio ns, δ ′ do es not ha ve to b e. W e will use this in the pr o of that follo ws, in wh ic h we use these ab ov e t wo lemmas regarding constru ction and use of an ordered ( θ , δ )-orthogonal system to p ro v e Prop osition 3.12 . Pro of of Prop osition 3.12 The p ro of f ollo ws b y combining Lemma 3.16 and Lemma 3.15 in a fairly straigh tforwa rd wa y . Let U b e the s p an of the top δ n 2 singular v alues of M . Thus U is a δ n 2 dimensional su bspace of R n 2 . It has three steps: 1. W e use Lemma 3.16 with m = n, δ ′ = δ n 1 / 2 , θ = 1 n 5 / 2 to obtain r = n 1 / 2 2 matrices M 1 , M 2 , . . . , M r ∈ R n × n ha ving the ( θ , δ ′ )-orthogonalit y prop ert y . 2. No w , app lying Lemma 3.15 , w e ha ve that the matrix Q ( e x ) , defined as b efore, (giv en by linear com binations along e x ) , has σ r / 2 ( Q ( e x )) ≥ ρθ n 4 w.p 1 − exp( − √ n ). 3. Applying F act 3.26 along with a s imple a ve raging argument, w e hav e that for one of the terms M i , w e h av e | M i ( e x ⊗ e y ) | ≥ ρθ /n 6 with probability ≥ 1 − exp( − r / 2) as required . Please refer to App endix B.2 for the complete details. The pro of for higher order tensors will pro ceed along similar lines. Ho w ev er w e r equ ire an additional p re-pro cessing s tep and a careful ind uctiv e s tatemen t (Theorem 3.25 ), whose p r o of in- v ok es Lemmas 3.16 and 3.15 . Th e issues and details with higher order pro du cts are co v ered in Section 3.4 . The follo wing t wo sections are devote d to provi ng the t w o lemmas i.e. Lemma 3.16 and Lemma 3.15 . T hese will b e key to the general case ( ℓ > 2) as well. 11 3.2 Constructing the ( θ , δ ) -Orthogonal System ( Pro of of Lemm a 3.16 ) Recollec t that V is a subspace of R n · m of d im en sion δ nm in Lemma 3.16 . W e will also treat a v ector M ∈ V as a matrix of s ize n × m , with its co-ordinates indexed by [ n ] × [ m ]. W e w an t to construct man y matrices M 1 , M 2 , . . . M r ∈ R n × m suc h th at a reasonable fraction of the m columns satisfy θ -orthogonalit y p rop erty . Int uitiv ely , suc h columns w ould ha v e Ω( n ) indep en d en t directions in R n , as c hoices for the r matrices M 1 , M 2 , . . . , M r . Hence, we need to iden tify columns i ∈ [ m ], suc h that the pr o jection of V onto these n co-ordinates (in column i ) spans a large dimension, in a robu st sense. This notion is formalized b y defining the robust dimension of column pro jectio ns, as follo ws. Definition 3.17 (Robust Dimension of pro jections) . F or a su b space V of R n · m , we define its robu st dimension dim τ i ( V ) to b e dim τ i ( V ) = max d s.t. ∃ orthon orm al v 1 , v 2 , . . . , v d ∈ R n and M 1 , M 2 , . . . , M d ∈ V with ∀ t ∈ [ d ] , k M t k ≤ τ and v t = M t ( i ) . This defin ition ens u res that we do not tak e into accoun t those spu r ious d irections in R n that are cov ered to an insignificant exten t by pro jecting (unit) v ectors in V to the i th column. No w, we w ould lik e to use the large dimens ion of V (dim= δ nm ) to conclude that there are many columns pro jections h a ving large rob u st dimensions of around δ n . Lemma 3.18. In any subsp ac e V in R p 1 · p 2 of dimension dim( V ) for any τ ≥ √ p 2 , we have X i ∈ [ p 2 ] dim τ i ( V ) ≥ dim( V ) (6) Remark 3.19. This lemma will also b e used in the first step of the pro of of Theorem 3.6 to id en tify a go o d blo ck of co-ordinates whic h span a large p ro jection of a giv en subsp ace V . The ab o v e lemma is easy to prov e if the dimension of the column p r o jections used is th e usual dimension of a v ector sp ace. Ho w ev er, with r obust dimension, to carefully a v oid spur ious or insignifican t directions, w e identify the r obust d imension with the num b er of large singular v alues of a certain matrix. Pro of: Let d = dim( V ). Let B b e a ( p 1 p 2 ) × d matrix, with the d columns compr isin g an orthonormal basis for V . Clearly σ d ( B ) = 1. No w, w e split the matrix B int o p 1 blo c ks of size p 1 × d eac h. F or i ∈ [ p 2 ], let B i ∈ R p 1 × d b e the pro jection of B on the rows giv en by [ p 1 ] × i . Let d i = max t suc h that σ t ( B i ) ≥ 1 √ p 2 . W e will first sh o w that P i d i ≥ d . Th en w e will sho w that dim τ i ( V ) ≥ d i to complete our pro of. Supp ose for contradicti on that P i ∈ [ p 2 ] d i < d . Let S i b e the ( d − d 1 )-dimensional s u bspace of R d spanned by the last ( d − d 1 ) right singular v ectors of B i . Hence, for unit vect ors α ∈ S i ⊆ R d , k B i α k < 1 √ p 2 . Since, d − P i ∈ [ p 2 ] d i > 0, there exists at least one un it ve ctor α ∈ T i S ⊥ i . Picking th is unit vecto r α ∈ R d , w e can con tradict σ d ( B ) ≥ 1 T o establish the second p art, consider th e d i top left-singular v ectors for matrix B i ( ∈ R p 1 ) . These d i v ectors can b e expr essed as small com binations ( k·k 2 ≤ √ p 2 ) of the columns of B i using 12 Lemma B.1 . The corresp onding d i small com binations of the columns of the whole matrix B , giv es v ectors in R p 1 p 2 whic h hav e length √ p 2 as required (since column of B are orthonormal). W e will construct th e m atrices M 1 , M 2 , . . . , M r ∈ R n × m in m ultiple stages. I n eac h stage , w e will fo cus on one column i ∈ [ m ]: w e fix this column for all the matrices M 1 , M 2 , . . . , M r , so that this column satisfies the order ed θ -orthogonal prop ert y w.r.t p reviously chosen columns, an d then lea ve this column un changed in the rest of the stages. In eac h stage t of this construction we will b e lo oking at su bspaces of V whic h are obtained by zero-ing out all the columns J ⊆ [ m ] (i.e. all th e co-ordinates [ n ] × J ), that we ha v e fixed so far. Definition 3.20 (S ubspace Pr o jections) . F or J ⊆ [ m ], let V ∗ J ⊆ R n · ( m −| J | ) represent the subsp ace obtained by pro jecting on to the co-ordinates [ n ] × ([ m ] − J ), the su bspace of V ha ving zeros on all the co-ordinates [ n ] × J . V ∗ J = n M ′ ∈ R n · ( m −| J | ) : ∃ M ∈ V s.t. columns M ( i ) = M ′ ( i ) for i ∈ [ m ] − J, and 0 otherw ise . o The extension Ext ∗ J ( M ′ ) for M ′ ∈ V ∗ J is the ve ctor M ∈ V obtained by padding M ′ with zeros in the co ord inates [ n ] × J (columns giv en by J ). The follo wing lemma sh ows that their dimension remains large as long as | J | is n ot to o large: Lemma 3.21. F or any J ⊆ [ m ] and any subsp ac e V of R n · m of dimension δ · nm , the subsp ac e having zer os in the c o-or dinates [ n ] × J has dim V ∗ J ≥ n ( δ m − | J | ) . Pro of of Lemma 3.21 : Consid er a constrain t matrix C of size (1 − δ ) nm × nm whic h describ es V . V ∗ J is describ ed by the constraint matrix of s ize (1 − δ ) nm × n ( m − | J | ) obtained by removi ng the columns of C corresp onding to [ n ] × J . Hence we get a su bspace of dimension at least n ( m − | J | ) − (1 − δ ) nm . W e no w describ e the construction more formally . The It e ra tiv e C onstruction of ordered θ -orthogonal ma t rices. Initially set J 0 = ∅ and M j = 0 for all j ∈ [ r ], τ = √ m and s = δ m/ 2. F or t = 1 . . . s , 1. Pic k i ∈ [ m ] − J t − 1 suc h that dim τ i V ∗ J t − 1 ≥ δ n / 2. If no su c h i exists, rep ort F AIL. 2. Cho ose Z 1 , Z 2 , . . . , Z r ∈ V ∗ J t − 1 of length at m ost √ mn suc h that i th columns Z 1 ( i ) , Z 2 ( i ) , . . . , Z r ( i ) ∈ R n are orthonormal, and also orthogonal to the columns { M j ( i ′ ) } i ′ ∈ J t − 1 ,j ∈ [ r ] . If this is not p ossible, rep ort F AIL. 3. Set for all j ∈ [ r ], the new M j ← M j + Ex t ∗ J ( Z j ), wh ere Ext ∗ J ( Z j ) is the matrix padd ed with zeros in the columns corresp ond in g to J . Set J t ← J t − 1 ∪ { i } . Let J = J s for conv enience. W e first show that the ab o v e pro cess for constructing M 1 , M 2 , . . . , M r completes successfu lly without rep orting F AIL. Claim 3.22. F or r, s such that s ≤ δ m/ 2 and r · s ≤ δ n/ 3 , the ab ove pr o c ess do es not F AIL. 13 Pro of: In eac h stage, we add one column ind ex to J . Hence, | J t | ≤ s at all times t ∈ [ s ]. W e first sho w that S tep 1 of eac h iteration do es not F AIL. F rom Lemma 3.21 , w e ha v e dim V ∗ J t ≥ δ nm / 2. Let W = V ∗ J t . No w, app lyin g Lemma 3.18 to W , w e see that there ex- ists i ∈ [ m ] − J t suc h that dim τ i ( W ) ≥ δ n/ 2, as required. Hence, Step 1 d o es not fail. dim τ i ( W ) ≥ δ n/ 2 shows that there exist Z ′ 1 , Z ′ 2 , . . . Z ′ δn/ 2 with lengths at most √ m suc h that their i th columns { Z ′ t ( i ) } t ≤ δn/ 2 are orthonormal. Ho wev er, we add itionally need to imp ose that the i th columns to also b e orthogonal to the column s { M j ( i ′ ) } j ∈ [ r ] ,i ′ ∈ J t − 1 . F ortunately , th e n umb er of suc h orth ogonalit y constrain ts is at m ost r | J t − 1 | ≤ δ n/ 3. Hence, w e can pic k the r < δ n/ 6 orthonormal i th columns { Z j ( i ) } j ∈ [ r ] and their resp ectiv e extensions Z j , by taking linear com bina- tions of Z ′ t . S in ce the linear com binations result again in unit vecto rs in the i th column, the length of Z j ≤ √ mn , as r equired. Hence, Step 2 do es not F AIL as well. Completing the pro of of L e mma 3.16 . W e no w sh ow that since the p r o cess completes, then M 1 , M 2 , . . . , M r ha v e the requ ired ordered ( θ , δ ′ )-orthogonal prop ert y for δ ′ = s/m . W e first c hec k that M 1 , M 2 , . . . , M r b elong to V . This is tr u e b ecause in eac h stage , Ext ∗ J ( Z j ) ∈ V , and h ence M j ∈ V for j ∈ [ r ]. F urther, since w e run f or s stages, and eac h of the Z j are b ounded in length by √ mn , k M j k F ≤ s √ mn ≤ √ nm 3 . Our fin al matrices M j will b e scaled to k·k F = 1. Th e s columns that satisfy the ordered θ -orthogonalit y pr op ert y are those of J , in the order they w ere c hosen (w e set this order to b e π , and select an arbitrary order for the rest). Supp ose the column i t ∈ [ m ] was chosen at stage t . Th e ke y inv ariant of the pro cess is that once a column i t is c hosen at stage t , th e i th t column remains un c hanged for eac h M j in all s ubsequent stages ( t + 1 onw ards). By th e construction, Z j ( i t ) ∈ R n is orthogonal to { M j ( i ) } i ∈ J t − 1 . Since Z j ( i t ) has unit length and M j is of b oun ded length, w e ha ve the ordered θ -orthogonal pr op erty as required, for θ = 1 / √ nm 3 . This concludes th e pro of. 3.3 ( θ , δ ) -Orthogonalit y and ρ -P erturb ed Com binations (P ro of of Lemma 3.15 ) Supp ose M 1 , M 2 , . . . , M r b e a ( θ , δ )-orthogonal set of matrices (dimensions n × m ). Without loss of generalit y , su pp ose that the p erm utation π in the definition of orthogonalit y is the iden tit y , and let I b e the first δ m column s. No w let us consider an ρ -p erturb ed v ector e x , and consid er the matrix Q ( e x ) defined in the statemen t – it has dimensions r × m , and the ( i, j )th en try is h e x, ( M i ) j i , w hic h is distrib uted as a translated gaussian. No w for any column i ∈ I , the i th column in Q ( e x ) has every entry ha ving an ( ρ · θ ) ‘component’ ind ep endent of entrie s in the previous columns, an d ent ries ab o ve . This implies that for a unit gaussian vec tor g , we ha ve (by an ti-conce ntrati on and θ -orthogonalit y that P r [( g T Q ( e x ) i ) 2 < θ 2 / 4 n ] < 1 / 2 n. (7) F urtherm ore, the ab ov e inequalit y holds, even c onditione d on th e fi rst ( i − 1) columns of Q ( e x ). Lemma 3.23. L et Q ( e x ) b e define d ab ove, and fix some i ∈ I . Then for g ∼ N (0 , 1) n , we have Pr [( g T Q ( e x ) i ) 2 < θ 2 ρ 2 4 n 2 | Q ( e x ) 1 , . . . , Q ( e x ) ( i − 1) ] < 1 2 n , for any given Q ( e x ) 1 , Q ( e x ) 2 , . . . , Q ( e x ) ( i − 1) . Pro of: Let g = ( g 1 , g 2 , . . . , g r ). Then w e h a v e g T Q i ( e x ) = g 1 ( e x T ( M 1 ) i ) + g 2 ( e x T ( M 2 ) i ) + · · · + g r ( e x T ( M r ) i ) = h e x, g 1 ( M 1 ) i + g 2 ( M 2 ) i + . . . g r ( M r ) i i 14 Let u s d enote the latter v ector b y v i for no w, so we are intereste d in h e x, v i i . W e s h o w that v i has a n on-negligible comp onent orthogonal to the span of v 1 , v 2 , . . . v ( i − 1) . Let Π b e the matrix wh ich pro jects orthogonal to the span of ( M s ) i ′ for all i ′ < i . Thus an y v ector Π u is also orth ogonal to the span of v i ′ for i ′ < i . No w b y hypothesis, ev ery ve ctor Π( M s ) i has length ≥ θ . T hus the v ector Π ( P s g s ( M s ) i ) = Π v i has length ≥ θ / 2 with probabilit y ≥ 1 − exp( − r ) (Lemma B.2 ). Th us if w e consider the d istribution of h e x, v i i = h x, v i i + h e, v i i , it is a one-dimensional ga ussian with mean h x, v i i and v ariance ρ 2 . F rom basic an ti-concen tration prop erties of a gaussian (that the mass in any ρ · (v ariance) 1 / 2 in terv al is at most ρ ), the conclusion follo ws. W e can now do this for all i ∈ I , and conclude that the prob ab ility that Eq. ( 7 ) holds for al l i ∈ I is at m ost 1 / (2 n ) | I | . No w what d o es this im p ly ab out the singular v alues of Q ( e x )? Supp ose it h as < r / 2 (wh ic h is < | I | ) non-negligible singular v alues, then a gaussian random v ector g , with probabilit y at le ast n − r , has a negligible comp onen t along al l the corresp ond ing sin gu lar ve ctors, and th us the length of g T Q ( e x ) is negligible with at least this p robabilit y! Lemma 3.24. L et M b e a t × t matrix with sp e ctr al norm ≤ 1 . Supp ose M has at most r singu lar values of magnitude > τ . Then for g ∼ N (0 , 1) t , we have Pr [ k M g k 2 2 < 4 tτ 2 + t n 2 c ] ≥ 1 n cr − 1 2 t . Pro of: Let u 1 , u 2 , . . . , u r b e the singular ve ctors corresp ond in g to v alue > τ . Consider the ev en t that g has a pro jection of length < 1 /n c on to u 1 , u 2 , . . . , u r . This h as pr obabilit y ≥ 1 n cr , by anti- concen tration prop erties of the Gauss ian (and b ecause N (0 , 1) t is rotationally inv arian t). F or any suc h g , we ha v e k M g k 2 2 = r X i =1 h g , u i i 2 + τ 2 k g k 2 ≤ r n 2 c + τ 2 k g k 2 2 . This con tradicts the earlier anti-c oncen tration b ound, and so w e conclude that the m atrix h as at least r / 2 non-negligible singular v alues, as required. 3.4 Higher Order Pro ducts W e h av e a subsp ace V ∈ R n ℓ of dimension δ n ℓ . The pro of for higher order pro ducts pro ceeds b y ind uction on the order ℓ of the pro d uct. R ecall from Remark 3.8 that P r op osition 3.1 2 and Theorem 3.3 do not get go o d guaran tees for small v alues of δ , lik e 1 /n . In fact, w e can not hop e to get suc h exp onen tially small failur e probab ility in that case, since the all th e n d egrees of freedom in V ma y b e constrained to the first n co-ordinates of R n 2 (all the indep endence is in just one mo de). Here, it is easy to s ee that the b est we can hop e for is an inv erse-p olynomial failure probabilit y . Hence, to get exp onentiall y small failure pr obabilit y , we will alw a ys need V to h a v e a large dimension compared to the d imension of the host sp ace in our in ductiv e statemen ts. T o carry out the induction, we will try to r educe this to a statemen t about ℓ − 1 order p ro du cts, b y taking linear combinatio ns (give n by e x (1) ∈ R n ) along one of the m o des. Lo osely sp eaking, 15 Lemma 3.15 serve s this function of “order r eduction”, how eve r it needs a set of r matrices in R n × m (flattened along all the other mo des) wh ich are ordered ( θ , δ ) orthogonal. Let u s consider the case when ℓ = 3, to illustrate some of the issues that arise. W e can use Lemma 3.16 to come u p with r matrices in R n × n 2 that are ordered ( θ , δ ) orthogonal. These columns intuitiv ely corresp ond to indep endent d irections or degrees of freedom, that w e can hop e to get sub stan tial pro jections on. Ho wev er, since these are vec tors in R n , the n um b er of “flattened columns” can not b e comparable to n 2 (in fact, δm ≪ n ) — hen ce, our ind uction h yp othesis f or ℓ = 2 will giv e no guaran tees, (du e to Remark 3.8 ). T o handle this issu e, w e will first restrict our atten tion to a smaller block of co-ordinates of size n 1 × n 2 × n 3 (with n 1 n 2 n 3 ≪ n ) , that has reasonable size in all th e three mo des ( n 1 , n 2 , n 3 = n Ω(1) ). Additionally , we w an t V ’s pro jection onto this n 1 × n 2 × n 3 blo c k spans a large subspace of (robust) dimension at least δ n 1 n 2 n 3 (using Lemma 3.18 ). Moreo ver, c ho osing the main indu ctive statemen t also needs to b e d one carefully . W e need some prop erty f or choosing enough candid ate “indep endent” directions T 1 , T 2 , . . . T r ∈ R n ℓ (pro jected on the c h osen blo c k), such that our pro cess of “order redu ction” (b y first fi nding θ -orthogonal system and then com bin ing along e x (1) ) mainta ins this p rop erty for order ℓ − 1. This is where the alternate in terpretation in Theorem 3.9 in terms of sin gular v alues helps: it suggests the exact prop ert y that w e need! W e ensure that th e matrix formed by the flattened v ectors v ec( T 1 ) , vec( T 2 ) , . . . vec( T r ) (pro jected onto the n 1 × n 2 × n 3 blo c k) , as ro ws form a matrix with many large singular v alues. W e now state the main ind uctiv e claim. Th e claim assumes a blo c k of co-ordinates of reasonable size in eac h mo d e that sp an man y d irections in V , and then establishes the ant i-concen tration b ound inductiv ely . Theorem 3.25 (Main I nductive C laim) . L e t T 1 , T 2 , . . . , T r ∈ R n × ℓ b e r tensors with b ounde d norm ( k·k F ≤ 1 ) and I 1 , I 2 , . . . I ℓ ⊆ [ n ] b e sets of indic es of sizes n 1 , n 2 , . . . n ℓ . L et T b e the r × n ℓ matrix obtaine d with r ows ve c ( T 1 ) , ve c ( T 2 ) , . . . , ve c ( T r ) . Supp ose • ∀ j ∈ [ r ] , P j is T j r estricte d to the blo ck I 1 × · · · × I ℓ , and matrix P ∈ R r × ( n 1 · n 2 ...n ℓ ) has j th r ow as ve c ( P j ) , • r ≥ δ ℓ n 1 n 2 . . . n ℓ and ∀ t ∈ [ ℓ − 1] , n t ≥ ( n t +1 n t +2 . . . n ℓ ) 2 , • σ r ( P ) ≥ η . Then for r andom ρ -p erturb ations e x (1) , e x (2) . . . e x ( ℓ ) of any x (1) , x (2) . . . x ( ℓ ) ∈ R n , we have Pr e x (1) ,... e x ( ℓ ) " k T e x (1) ⊗ · · · ⊗ e x ( ℓ ) k ≥ ρ ℓ η n 1 3 ℓ # ≥ 1 − exp ( − δ ℓ n ℓ ) Before w e giv e a pro of of the main inductiv e claim, we fi rst presen t a standard fact that re- lates the singular v alue of matrices and some anti -concen tration prop erties of randomly p ertu rb ed v ectors. This w ill also establish the base case of our main indu ctiv e claim. F act 3.26. L et M b e a matrix of size m × n with σ r ( M ) ≥ η . Then for any unit ve ctor u ∈ R n and an r andom ρ -p e rturb ation e x of i t, we have k M e x k 2 ≥ η ρ/n 2 w.p 1 − n − Ω( r ) Pro of of Theorem 3.25 : The pro of pro ceeds by indu ction. The base case ( ℓ = 1) is hand led by F act 3.26 . Let us assume the theorem f or ( ℓ − 1)-wise pro ducts. The indu ctiv e pro of will ha v e t w o main steps: 16 1. Supp ose we fl atten the tensors { P j } j ∈ [ r ] along all bu t the firs t mo de, and imagine them as matrices of size n 1 × ( n 2 n 3 . . . n ℓ ). W e can use Lemma 3.16 to construct ordered ( θ , δ ′ ) orthogonal system w.r.t ve ctors in R n 1 (columns corresp ond to [ m ] = [ n 2 . . . n ℓ ]). 2. When we take combinatio ns along e x (1) as T e x (1) , · , · , . . . , · , these tensors will n o w satisfy the condition required for ( ℓ − 1)-order pro ducts in the indu ctiv e h yp othesis, b ecause of Lemma 3.15 . Unrolling th is induction allo ws us to tak e com binations along e x (1) , e x (2) , . . . as required, until w e are left with the base case. F or notational con ve nience, let y = e x (1) , δ ℓ = δ , r ℓ = r and N = n 1 n 2 . . . n ℓ . T o carry out the first step, w e think of { P j } j ∈ [ r ] as matrices of size n 1 × ( n 2 n 3 . . . n ℓ ). W e then apply Lemma 3.16 with n = n 1 , m = N n 1 = n 2 n 3 . . . n ℓ ≤ √ n 1 ; h en ce there exists r ′ ℓ = n 2 . . . n ℓ matrices { Q q } q ∈ [ r ′ ℓ ] with k·k F ≤ 1 whic h are ordered ( θ , δ ′ ℓ )-orthogonal for δ ′ ℓ = δ ℓ / 3. F urther, since Q q are in the ro w-span of P , there exists matrix of co efficien ts α = ( α ( q , j )) q ∈ [ r ′ ℓ ] ,j ∈ [ r ℓ ] suc h that ∀ q ∈ [ r ′ ℓ ] , Q q = X j ∈ [ r ℓ ] α ( q , j ) P j (8) k α ( q ) k 2 2 = X j ∈ [ r ℓ ] α ( q , j ) 2 ≤ 1 /η (since σ r ( P ) ≥ η and k Q q k F ≤ 1) (9) F urther, Q q is the pro jection of P j ∈ [ r ℓ ] α q ,j T j on to co-ordinates I 1 × I 2 · · · × I ℓ . Supp ose we d efine a new set of matrices { W q } q ∈ [ r ′ ℓ ] in R n × ( N n 1 ) b y flattening th e follo wing int o a matrix w ith n rows: W q = X j ∈ [ r ] α q ,j T j [ n ] × ( I 2 ×···× I ℓ ) . In other wo rds , Q q is obtained by pr o jecting W q on to the n 1 ro ws giv en by I 1 . Note that { W q } q ∈ [ r ′ ℓ ] is also ord ered ( θ ′ ℓ , δ ′ ℓ ) orthogonal for θ ′ ℓ = θ η . T o carry out the second part, we apply Lemma 3.15 with { W q } and infer that th e r ′ ℓ × ( N /n 1 ) matrix W ( y ) with q th r ow b eing y T W q has σ r ℓ − 1 ( W ( y )) ≥ η ′ ℓ = θ 2 ρ 2 /n 4 1 with probabilit y 1 − exp( − Ω( r ′ ℓ )), where r ℓ − 1 = r ′ ℓ / 2. W e will like to apply the inductive h yp othesis for ( ℓ − 1) with P b eing W ( y ); how ever W ( y ) do es not h a v e full (robus t) row rank. Hence w e will consider the top r ℓ − 1 righ t sin gular vecto rs of W ( y ) to construct an r ℓ − 1 tensors of order ℓ , wh ose pro j ections to the blo ck I 2 × · · · × I ℓ , lead to a w ell-conditioned r ℓ − 1 × ( n 2 n 3 . . . n ℓ ) matrix for w hic h our in ductiv e hyp othesis holds. Let the top r ℓ − 1 righ t singular vect ors of W ( y ) b e Z 1 , Z 2 , . . . Z r ℓ − 1 . Hence, fr om Lemma B.1 , w e hav e a co efficien t β of size r ℓ − 1 × r ℓ suc h that ∀ j ′ ∈ [ r ℓ − 1 ] Z j ′ = X q ∈ [ r ′ ℓ ] β j ′ ,q W q ( y ) and k β ( j ′ ) k 2 ≤ 1 /η ′ ℓ . No w let us try to represent these new v ectors in terms of the original ro w-v ectors of P , to construct the requir ed tensor of order ( ℓ − 1) . Consider the r ℓ − 1 × r ℓ matrix Λ = β α . C learly , ro wnorm(Λ) ≤ r o wnorm( β ) · k α k F ≤ q r ′ ℓ · r o wnorm( β ) · rownorm( α ) ≤ r ′ ℓ η ℓ η ′ ℓ . 17 Define ∀ j ′ ∈ [ r ℓ − 1 ], an order ℓ tensor T ′ j ′ = P j ∈ [ r ] λ j ′ ,j T j ; fr om th e previous equation, k T ′ j ′ k F ≤ r ′ ℓ / ( η ℓ η ′ ℓ ) . W e need to get a norm alized order ( ℓ − 1) tensor: so, we consider b T j ′ = T ′ j ′ / k T ′ j ′ ( y ) k F , and b T b e the r ℓ − 1 × ( n ℓ ) matrix with j ′ th ro w b eing b T j ′ . Hence, σ r ℓ − 1 b T ( y , · , · , . . . , · ) ≥ η 3 ℓ r ′ ℓ n 3 1 . W e also hav e r ℓ − 1 ≥ 1 2 · n 2 n 3 . . . n ℓ . By the ind uctiv e h yp othesis k b T y , e x (2) , . . . , e x ( ℓ ) k ≥ η ′ ≡ ρ ℓ − 1 η 3 ℓ n 4 1 n 2 3 ℓ − 1 w.p 1 − exp ( − Ω( n ℓ )) (10) Hence, for one of the j ′ ∈ [ r ℓ − 1 ], b T j ′ e x (1) , e x (2) , . . . e x ( ℓ ) ≥ η ′ / √ r ℓ − 1 . Finally , since b T j ′ is giv en b y a s m all com bination of the { T j } j ∈ [ r ] , w e h av e from Cauc h y-Sc hw artz k T e x (1) , e x (1) , . . . , e x (1) k ≥ η ′ · η 3 q r 2 ℓ n 4 1 . The main required theorem no w follo w s b y just sho wing the exists of the n 1 × n 2 × · · · × n ℓ blo c k th at satisfies the th eorem conditions. This follo w s fr om Lemma 3.18 . Pro of of Theorem 3.6 : First we set n 1 , n 2 , n ℓ b y the recurrence ∀ t ∈ [ ℓ ] , n t = 2( n t +1 · n t +2 . . . n ℓ ) 2 and n 1 = O ( n ). It is easy to see that this is p ossib le for n ℓ = n 1 / 3 ℓ . No w, we partition the set of co-ordinates [ n ] ℓ in to blo c ks of size n 1 × n 2 × . . . n ℓ . Let p 1 = n 1 · n 2 . . . n ℓ and p 2 = n ℓ /p 1 . App lying Lemma 3.18 we see that there exists indices I 1 , I 2 , . . . I ℓ of s izes n 1 , n 2 , . . . , n ℓ resp ectiv ely such that pro jection W = V | I 1 × I 2 ×···× I ℓ on this b lo c k of co-ordinates has d imension dim τ I ( W ) ≥ n 1 n 2 . . . n ℓ / 4. Let r = n 1 n 2 . . . n ℓ . No w w e constru ct P ′ with the ro ws of P ′ b eing an orthonormal basis for W , and let T ′ b e the corresp onding vec tors in V . Note that ∀ j ∈ [ r ] , k T ′ j k ≤ n ℓ . Let P b e the re-scaling of th e m atrix s o that for the j th row( j ∈ [ r ]), P j = P ′ j / k T ′ j k and T j = T ′ j / k T ′ j k . Hence σ r ( P ) ≥ 1 /n ℓ . Applying Th eorem 3.25 with this c hoice of P , T , we get the requir ed result. 4 Learning M ulti-view Mixture M o d els W e no w see h o w Theorem 1.5 immediately giv es efficien t learning algorithms f or b road cla ss of discrete mixture mod els called m ulti-view mo dels in th e over-c omplete setting. In a m u lti-view mixture mo del, for eac h sample we are give n a few different obs erv ations or views x (1) , x (2) , . . . , x ( ℓ ) that are conditionally indep endent giv en whic h comp onent i ∈ [ R ] the sample is f rom. T ypically , the R comp onen ts in the mixture are d iscrete distributions. Mu lti-view mo dels are v ery expressive, and capture many we ll-studied mo d els lik e T opic Mo d els [ 2 ], Hidden Marko v Mo d els (HMMs) [ 29 , 1 , 2 ], and rand om graph mixtures [ 1 ]. They are also sometimes referred to as finite mixtur es of finite me asur e pr o ducts [ 1 ] or mixtur e-le arning with multiple snapshots [ 30 ]. In th is section, we will assume that eac h of the comp onents in th e mixture is a d iscrete d istri- bution with su pp ort of size n . W e first introdu ce some notation, along the lines of [ 2 ]. 18 P arameters and t he mo del: Let the ℓ -view mixtur e mo del b e parameterize d by a set of ℓ vec - tors in R n for eac h mixture comp onent, µ i (1) , µ i (2) , . . . , µ i ( ℓ ) i ∈ [ R ] , and mixing w eigh ts { w i } i ∈ [ R ] , that add up to 1. Eac h of these p arameter vecto rs are normalized : in this work, w e will assume that k µ i ( j ) k 1 = 1 for all i ∈ [ R ] , j ∈ [ ℓ ]. Finally , for notational conv enience we think of the p aram- eters are represent ed by n × R matrices (one p er view) M (1) , M (2) , . . . , M ( ℓ ) , with M ( j ) formed b y concatenati ng the v ectors µ i ( j ) (1 ≤ i ≤ R ). Samples from the multi- view mo del with ℓ views are generated as follo w s: 1. The mixture comp onen t i ( i ∈ [ R ]) is fir st pic k ed with probab ility w i 2. The views x (1) , . . . , x ( j ) , . . . , x ( ℓ ) are indicator vect ors in n -dimensions, th at are drawn accord- ing to th e d istribution µ i (1) , . . . , µ i ( j ) , . . . , µ i ( ℓ ) . The state-of-the-art algo rithms for learning m ulti-view mixtur e mo dels h a v e guaran tees that mirror those for mixtures of gaussians. In the w orst case, the b est kn own algorithms for this problem are from a r ecen t work Rabani et al [ 30 ], w h o giv e an algorithm that h as complexit y R O ( R 2 ) + p oly( n, R ). In fact they also sh o w a s ample complexit y lo wer-b ound of exp( ˜ Ω( R )) for learning m ulti-view mo dels in one dimension ( n = 1). P olynomial time algorithms were giv en b y Anandkumar et al. [ 2 ] in a restricted setting called the n on -sin gular or non-degenerate setting. When eac h of these matrices M ( j ) j ∈ [ ℓ ] to ha v e r ank R in a r ob u st sense i.e. σ R ( M ( j ) ) ≥ 1 /τ for all j ∈ [ ℓ ], their algorithm ru ns in j ust p oly( R , n, τ , 1 /ε )) time to learn the p arameters up to err or ε . Ho wev er, their algorithm fails ev en when R = n + 1. Ho wev er, in man y practical settings like sp eec h recognition and image classification, the dimen- sion of the feature space is typical ly muc h smaller than the num b er of comp onents or clusters i.e. n ≪ R . T o the b est of our kno wledge, there w as n o efficient algorithm for learning multi-view mixture mod els in such over-c omplete settings . W e no w sho w ho w Th eorem 1.5 giv es a p olynomial time alg orithm to learn multi-view mixture mo d els in a smo othed sense, ev en in the o v er-complete setting R ≫ n . Theorem 4.1. L et ( w i , µ i (1) , . . . , µ i ( ℓ ) ) b e a mixtur e of R = O ( n ℓ/ 2 − 1 ) multi-view mo dels with ℓ views, and supp ose the me ans µ i ( j ) i ∈ [ R ] ,j ∈ [ ℓ ] ar e p erturb e d indep endently by gaussian noise of magnitude ρ . Then ther e is a p olynomial time algorithm to le arn the weights w i , the p erturb e d p ar ameter ve ctors n ˜ µ ( j ) i o j ∈ [ ℓ ] ,i ∈ [ R ] up to an ac cur acy ε when given sampl es fr om this distribution. The running time and sample c omplexity i s p oly ℓ ( n, 1 /ρ, 1 /ε ) . The conditional ind ep enden ce prop ert y is very useful in obtaining a higher ord er tensor, in terms of the hid den parameter ve ctors that we need to r eco ver. T h is allo ws u s to use our results on tensor d ecomp ositions fr om pr evious sections. Lemma 4.2 ([ 1 ]) . In the notation establishe d ab ove for multi-v iew mo dels, ∀ ℓ ∈ N the ℓ th moment tensor Mom ℓ = E h x (1) ⊗ . . . x ( j ) ⊗ . . . x ( ℓ ) i = X r ∈ [ R ] w r µ (1) r ⊗ µ (2) r · · · ⊗ µ ( j ) r ⊗ · · · ⊗ µ ( ℓ ) r . (11) Our algorithm to learn multi-vie w mo d els consists of thr ee steps: 1. Obtain a go o d empirical estimate b T of the order ℓ tensor Mom ℓ from N = p oly ℓ ( n, R , 1 /ρ, 1 /ε ) samples (giv en b y Lemma C.3 ) b T = 1 N X t =1 x t (1) ⊗ x t (2) ⊗ · · · ⊗ x t ( ℓ ) . 19 2. Apply Theorem 1.5 to b T and reco ver the parameters b µ ( j ) i upto scaling. 3. Normalize th e parameter v ectors b µ i ( j ) to h a ving ℓ 1 norm of 1, and hen ce fi gure out the wei ght s b w i for i ∈ [ R ]. Pro of of Theorem 4.1 : The pro of follo ws f rom a d ir ect applicatio n of Th eorem 1.5 . Hence, w e just s k etc h the details. W e first obtain a goo d empir ical estimate of Mom ℓ that is give n in equation ( 11 ) using Lemma C.3 . App lyin g T heorem 1.5 to b T , we reco v er eac h rank-1 term in the decomp osition w i µ i (1) ⊗ µ i (2) ⊗ · · · ⊗ µ i ( ℓ ) up to error ε in frob enius norm ( k·k F ). Ho wev er, w e kno w that eac h of the parameter v ectors are of unit ℓ 1 norm. Hence, by scaling all th e parameter v ectors to unit ℓ 1 norm, we obtain all the parameters u p to the r equired accuracy . 5 Learning M ixtures of Axis-A ligned Gaussians Let F b e a mixture of k = p oly( n ) axis-aligned Gaussians in n dimensions, and supp ose further that the means of the comp onents are p erturb ed by Gaussian noise of magnitude ρ . W e r estrict to Gaussian noise not b ecause our results c hange, but f or notational con v enience. P arameters: Th e mixture is d escrib ed b y a set of k m ixing we igh ts w i , means µ i and co v ariance matrices Σ i . Since the mixture is axis-aligned, eac h co v ariance Σ i is diagonal and we will denote the j th diagonal of Σ i as σ 2 ij . Our main resu lt in this section is th e follo w ing: Theorem 5.1. L et ( w i , µ i , Σ i ) b e a mixtur e of k = n ⌊ ℓ − 1 2 ⌋ / (2 ℓ ) axis-aligne d Gaussians and supp ose { e µ i } i ∈ [ k ] ar e the ρ -p erturb ations of { µ i } i ∈ [ k ] (that have p olynomial ly b ounde d length). Then ther e is a p olynomial time algorith m to le arn the p ar ameters ( w i , e µ i , Σ i ) i ∈ [ k ] up to an ac cur acy ε when given samples fr om this mixtur e. The running time and sample c omplexity i s p oly ℓ ( n ρε ) . Next we outline the main steps in our learnin g algorithm: 1. W e first pic k an appr opriate ℓ , and estimate M ℓ := P i w i e µ ⊗ ℓ i . 3 2. W e run our decomp osition algorithm for over c omplete tensors on M ℓ to reco v er e µ i , w i . 3. W e then set u p a system of linear equations and solve for σ 2 ij . W e defer a pr ecise description of the second and third steps to the next subsections (in particular, w e need to describ e ho w we obtain M ℓ from the momen ts of F and we need to describ e the linear system that we will use to solv e for σ 2 ij ). 5.1 Step 2 : Recov ering the Means and Mixing W eights Our firs t goal in this subsection is to construct the tensor M ℓ defined ab ov e from random samples. In fact, if we are give n man y samples we can estimate a r elated tensor (and our error will b e an in v erse p olynomial in the num b er of samples w e tak e). Unlike the multi-view mixture mo del, we do not h av e ℓ in d ep end ent views in this case. Let us consider the tensor E [ x ⊗ ℓ ]: E [ x ⊗ ℓ ] = X i w i ( e µ i + η i ) ⊗ ℓ . 3 W e d o not estimate the entire tensor, bu t only a relev ant “blo ck”, as w e will see. 20 Here we h a v e used η i to den ote a Gaussian random v ariable whose mean is zero and whose cov ariance is Σ i . No w the first term in the expansion is the one w e are in terested in, so it would b e nice if w e could “zero out” the other terms. Our observ ation here is that if we restrict to ℓ distinct ind ices ( j 1 , j 2 , . . . , j ℓ ), then this co ordinate will only ha ve con tribution from the means. T o see th is, note that the term of int erest is X i w i ℓ Y t =1 ( e µ i ( j t ) + η i ( j t )) Since the Gaussians are axis aligned, the η i ( j t ) terms are indep end en t for different t , and eac h is a random v ariable of zero exp ectatio n. Th us the term in the summation is precisely P i w i Q ℓ t =1 e µ i ( j t ). Our id ea to estima te the means is now the follo wing: we partition the indices [ n ] in to ℓ roughly equal parts S 1 , S 2 , . . . , S ℓ , and estimate a tensor of dimen s ion | S 1 | × | S 2 | × · · · × | S ℓ | . Definition 5.2 (Co-ordinate partitions) . Let S 1 , S 2 , . . . , S ℓ b e a partition of [ n ] into ℓ pieces of equal size (roughly). Let e µ ( t ) i denotes the v ector e µ i restricted to the co ordin ates S t , and for a sample x , let x ( t ) denote its r estriction to the co ordinates S t . No w , we can estimate th e order ℓ tensor E [ x (1) ⊗ x (2) · · · ⊗ x ( ℓ ) ] to any in v erse p olynomial accuracy us in g p olynomial samples (see Lemma C.3 or [ 20 ] for details), where E [ x (1) ⊗ x (2) · · · ⊗ x ( ℓ ) ] = X i w i e µ (1) i ⊗ e µ (2) i ⊗ · · · ⊗ e µ ( ℓ ) i . No w app lying the main tensor decomp osition theorem (Theorem 1.5 ) to this order ℓ tensor, w e obtain a s et of v ectors ν (1) i , ν (2) i , . . . , ν ( t ) i suc h that ν ( t ) i = c it e µ ( t ) i , and for all t , c i 1 c i 2 · · · c iℓ = 1 /w i . No w we sho w how to reco v er the means e µ i and wei ght s w i . Claim 5.3. The algorithm r e c overs the p erturb e d me ans { e µ i } i ∈ [ R ] and weights w i up to any ac cur acy ε in time p oly ℓ ( n, 1 /ε ) So far, we hav e p ortions of th e mean vecto rs, eac h scaled differen tly (upto some ε/ p oly ℓ ( n ) accuracy . W e need to estimate the scalars c i 1 , c i 2 , . . . , c iℓ up to a s caling (we need another trick to then fi nd w i ). T o do this, the idea is to tak e a d ifferen t partition of the ind ices S ′ 1 , S ′ 2 , . . . , S ′ ℓ , and ‘matc h ’ the co ordinates to fi nd the e µ i . In general, this is tric ky since some p ortions of the vec tor ma y b e zero, but this is another place where the p ertur bation in e µ i turns out to b e very useful (alternately , w e can also apply a ran d om basis change, and a m ore careful analysis to d oing this ’matc h ’). Claim 5.4. L et µ b e any d dimensional ve ctor. Then a c o or dinate-wise σ - p erturb ation of µ has length ≥ dσ 2 / 10 w.p. ≥ 1 − exp( − d ) . The pro of is by a basic ant i-concen tration along with the observ ation that coord inates are indep en d en tly p erturb ed and hence the f ailure pr obabilit y m ultiplies. Let u s no w defin e the p artition S ′ t . Supp ose we divide S 1 and S 2 in to tw o r ou gh ly equal parts eac h, and call the parts A 1 , B 1 and A 2 , B 2 (resp ectiv ely). No w consider a partition with S ′ 1 = A 1 ∪ A 2 and S ′ 2 = B 1 ∪ B 2 , and S ′ t = S t for t > 2. Consider the solution ν ′ i w e obtain using the decomp osition algorithm, and lo ok at the v ectors ν 1 , ν 2 , ν ′ 1 , ν ′ 2 . F or the sake of exp osition, supp ose we did not ha v e any error in computing the decomp osition. W e can scale ν ′ 1 suc h that 21 the sub-vect or corresp onding to A 1 is precisely equal to that in ν 1 . No w lo ok at the remaining sub-ve ctor of ν 1 , and sup p ose it is γ times the “ A 2 p ortion” of ν 2 . Then we m u st ha ve γ = c 2 /c 1 . T o see this formally , let u s fix some i and write v 11 and v 12 to denote th e sub-vect ors of e µ (1) i restricted to co ordinates in A 1 and B 1 resp ectiv ely . W rite v 21 and v 22 to represent su b-v ectors of e µ (2) i restricted to A 2 and B 2 resp ectiv ely . Th en ν 1 is c 1 v 11 ⊕ c 1 v 12 (where ⊕ d enotes concatenation). So also ν 2 is c 2 v 21 ⊕ c 2 v 22 . No w we scaled ν ′ 1 suc h that the A 1 p ortion agrees with ν 1 , th us w e made ν ′ 1 equal to c 1 v 11 ⊕ c 1 v 21 . Th us b y the w a y γ is defin ed, w e hav e c 1 γ = c 2 , which is what w e claimed. W e can no w compute the entire v ector e µ i up to scaling, since we kno w c 1 /c 2 , c 1 /c 3 , and so on. Th us it remains to find the mixtu r e w eigh ts w i . Note that these are all non -n egativ e. No w from the decomp osition, note that for eac h i , w e can find the qu an tit y C ℓ := w i k e µ i k ℓ . The trick no w is to n ote that by r ep eating the en tire pro cess ab o v e with ℓ replaced b y ℓ + 1, the conditions of the decomp osition th eorem still h old, and hence w e compute C ℓ +1 := w i k e µ i k ℓ +1 . Th us taking the ratio C ℓ +1 /C ℓ w e obtain k e µ i k . This can b e done for eac h i , and th u s using C ℓ , w e obtain w i . This completes the analysis assuming w e can obtain e µ ( t ) i without any error. Please see lemma C.4 f or details on ho w to reco v er the we igh ts w i in the p resence of errors. T his establishes the ab o v e claim ab out reco v ering the means and weig hts. 5.2 Step 3 : Recov ering the V ariances No w that we know the v alues of w i and all the m eans e µ i , we sho w ho w to reco v er the v ariances. This can b e done in man y wa ys, an d we will outline one which ends up solving a linear system of equations. Recall that for eac h Gaussian, the co v ariance matrix is diagonal (denoted Σ i , with j th en try equ al to σ 2 ij ). Let us sho w ho w to reco ver σ 2 i 1 for 1 ≤ i ≤ R . The same pro cedu r e can b e applied to the other dimensions to reco ver σ 2 ij for all j . Let u s d ivide the set of indices { 2 , 3 , . . . , n } int o ℓ (nearly equal) sets S 1 , S 2 , . . . , S ℓ . No w consider the expression N 1 = E [ x (1) 2 ( x | S 1 ⊗ x | S 2 ⊗ · · · ⊗ x | S ℓ )] . This can b e ev aluated as b efore. W rite e µ ( t ) i to d enote the p ortion of e µ i restricted to S t , and similarly η ( t ) i to denote the p ortion of the noise v ector η i . This giv es N 1 = X i w i ( e µ i (1) 2 + σ 2 i 1 )( e µ (1) i ⊗ e µ (2) i ⊗ · · · ⊗ e µ ( ℓ ) i ) . No w recall that we know the v ectors e µ i and hence eac h of the tensors e µ (1) i ⊗ e µ (2) i ⊗ · · · ⊗ e µ ( ℓ ) i . F urther, since our e µ i are the p erturb ed means, our theorem (Th eorem 3.3 ) ab out th e cond ition num b er of Khatri-Rao p r o ducts imp lies that the matrix (call it M ) whose columns are the flattened Q t e µ ( t ) i for differen t i , is w ell cond itioned, i.e., h as σ R ( · ) ≥ 1 / p oly ℓ ( n/ρ ). This implies that a system of linear equations M z = z ′ can b e solv ed to r eco ver z u p to a 1 / p oly ℓ ( n/ρ ) accuracy (assuming we kno w z ′ up to a s im ilar accuracy). No w using this with z ′ b eing the flattened N 1 allo ws us to reco ve r the v alues of w i ( e µ i (1) + σ 2 i 1 ) for 1 ≤ i ≤ R . F rom this, since w e know the v alues of w i and e µ i (1) for eac h i , we can reco ver the v alues σ 2 i 1 for all i . As mentioned b efore, we can rep eat this pro cess for other dimensions and reco ver σ 2 ij for all i, j . 22 6 Ac kno wledgemen ts W e thank Ryan O’Donnell for s uggesting that we extend our tec hn iqu es for learning mixtures of spherical Gaussians to the m ore general pr oblem of learning axis-aligned Gaussians. References [1] E. Allman, C. Matias and J. Rho des. Id en tifiabilit y of P arameters in Laten t Structure Mo dels with many Observed V ariables. Anna ls of Statistics , pages 3099–3132 , 200 9. 1.1 , 1.2 , 1.8 , 2 , 4 , 4.2 [2] A. Anand kumar, D. Hsu and S. Kak ad e. A metho d of moments for mixture mo d els and hidd en Mark o v m o dels. In COL T 2012. 1.2 , 4 , 4 [3] A. Anandkumar, R. Ge, D. Hsu and S. Kak ade. A T ensor S p ectral Ap proac h to Learning Mixed Mem b ership Communit y Mo dels. In COL T 2013. 1.1 , 1.1 [4] A. Anandkumar, R. Ge, D. Hsu, S. Kak ade and M. T elga rsky . T ensor Decomp ositions for Learning Laten t V ariable Mo dels. arxiv:1210.755 9 , 2012. [5] A. Anandku mar, D. F oster, D. Hsu , S. Kak ade, Y. Liu. A S p ectral Algorithm f or Laten t Diric h let Allo cation. In NIPS , pages 926–9 34, 2012. 1.1 , 1.1 [6] S. Arora, R. Ge, A. Moit ra and S. Sac h dev a. Pro v able ICA with Un kno wn Gaussian Noise, and Implicatio ns f or Gaussian Mixtures and Au to enco ders. In NIPS , pages 238 4–2392, 2012. [7] M. Belkin, L . Rademac her and J. V oss. Bling Signal Separation in the Presence of Gaussian Noise. In COL T 2013. [8] M. Belkin and K. S inha. Po lynomial Learning of Distribution F amilies. I n FOCS , pages 103–1 12, 2010. [9] A. Bh ask ara, M. Charik ar and A. Vija y aragha v an. Uniqueness of T ensor Decomp ositions with Applications to Polynomia l Identifiabilit y . , 2013. 1.1 , 1.2 , 2 , A [10] J. Ch ang. F u ll Reconstru ction of Marko v Mo dels on Evo lutionary T rees: Identi fiabilit y and Consistency . Mathematic al Bioscienc es , pages 51–73 , 1996. [11] P . Comon. Indep endent C omp onent Analysis: A New Concept? Signal Pr o c essing , p ages 287–3 14, 1994. 1.1 , 1.1 , 2 , 2.1 [12] S. Dasgupta. Learnin g Mixtures of Gaussians. In F OCS , pages 634–644 , 1999. [13] L. De Lathauw er, J Castaing and J . Cardoso. F ourth-order Cumulan t-based Blind Identifica- tion of Un derdetermined Mixtures. IEEE T r ans. on Signal Pr o c essing , 55(6): 2965–29 73, 2007. 1.8 , 2 [14] J. F eldman, R. A. Servedio , and R. O’Donn ell. P A C Learning Axis-ali gned Mixtures of Gaus- sians with No Separation Assum ption. In COL T , pages 20–34, 2006. 1.2 [15] A. F rieze, M. J errum, R. K annan. Learning Linear T ransform ations. In F OCS , p ages 359–3 68, 1996. 23 [16] N. Go y al, S. V empala and Y. Xiao. F ourier PCA. arxiv:1306 .5825 , 2013. 1.1 , 1.2 , 2 , A [17] J. H ˚ astad. T ensor Rank is N P -Complete. Journal of Algorith ms , pages 644–654, 1990. 1.1 [18] C. Hillar and L-H. Lim. Most T ensor Problems are N P -Hard. arxiv:0911.139 3v4 , 2013. 1.1 [19] R. Horn and C. Johnson. Matrix Ana lysis . Cambridge Univ er s it y P ress, 1990. [20] D. Hsu and S. Kak ade. Learning Mixtures of Spherical Gaussians: Moment Method s and Sp ectral Decomp ositions. In ITCS , p ages 11–20, 2013. 1.1 , 1.1 , 1.2 , 5.1 [21] A. Hyv¨ arinen, J . Karh unen and E. Oja. Indep endent Comp onent Analysis . Wiley Int erscience, 2001. [22] A. T. Kalai, A. Moitra, and G. V aliant. Efficien tly Learn ing Mixtures of Two Gauss ians. In STOC , pages 553-562, 2010. [23] A. T. Kalai, A. Samoro dn itsky and S -H T eng. Learning and Smo othed Analysis. In F OCS , pages 395–404 , 2009. [24] J. Krusk al. Three-wa y Arrays: Rank and Uniqueness of T rilinear Decompositions. Line ar Algebr a and Applic ations , 18:95–1 38, 1977. 1.1 , 2 [25] S. Leurgans, R. Ross and R. Ab el. A Decomp osition for Three-w a y Arra ys. SIAM Journal on Matrix Analysis and Applic ations , 14(4):1 064–108 3, 1993. 1.1 , 1.2 , 1.8 , 2 , 2.1 [26] B. Lind sa y . Mixtur e Mo dels: The ory, Ge ometry and Applic ations . Institute for Mathematical Statistics, 1995. [27] P . McCullagh. T ensor Metho ds in Statistics . Ch ap m an and Hall/CR C, 1987. 1.1 [28] A. Moitra and G. V alian t. S etting the P olynomial Learn ab ility of Mixtures of Gaussians . In F OCS , pages 93–102, 2010. [29] E. Mossel and S. Roch. Learn in g Nonsingular Ph ylogenies and Hidden Mark o v Mo d els. In STOC , pages 366–375, 2005. 1.1 , 1.1 , 4 [30] Y. Rabani, L. Sch ulman and C. Swam y . Learning mixtu r es of arbitrary d istributions ov er large discrete domains. . In ITCS 2014. 4 , 4 [31] C. Sp earman. General Intel ligence. Americ an Journal of Psycholo gy , pages 201–29 3, 1904. 1.1 [32] D.A. Spielman, S.H. T eng. Smo othed Analysis of Algorithms: W hy the Simplex Algorithm usually tak es P olynomial T ime. Journal of the ACM , p ages 385–4 63, 2004. 1.2 [33] D.A. Sp ielman, S.H. T eng. Smoothed Analysis: An At tempt to Explain th e Beha vior of Algorithms in Pr actice. Communic ations of the ACM , pages 76-84, 2009. 1.2 [34] A. Stegeman an d P . Comon. Su btracting a Best Rank-1 App ro ximation ma y Increase T ensor Rank. Line ar Algebr a and Its Applic ations , pages 1276– 1300, 2010. 1.1 [35] H. T eic her. Ident ifiabilit y of Mixtur es. Annals of M athematic al Statistics , pages 244–2 48, 1961. 24 [36] S. V empala, Y. Xiao. Structure from Local Optima: Learning S ubspace Juntas via Higher Order PCA. Arxiv: abs/110 8.3329, 2011. [37] P . W edin. P erturb ation Bound s in C onnection with Singular V alue Decomp ositions. BIT , 12:99– 111, 1972. A Stabilit y of the reco v ery algorithm In this section we pro v e Theorem 2.3 , whic h sh o ws that the algorithm from section 2 is actually robust to errors, und er Condition 2.2 . This consists of t w o parts: first p ro ving that the prep r o cessing step indeed allo ws us to reco ver u i (appro ximately), and second, that Deco mpose is robust to noise. Stabilit y of the prepro cessing Supp ose we are give n T + E , wh ere T = P i u i ⊗ v i ⊗ w i , and E is a tensor eac h of wh ose entries is < ǫ · p oly(1 /κ, 1 /n, 1 /δ ). Let e u j,k b e vec tors in ℜ m defined as b efore, and let e U b e the m × np matrix whose columns are e u j,k (for different j, k ). Let u ′ i b e the pr o jection of u i on to the span of the top R singular vec tors of e U . By Claim 4.4 in [ 9 ], we ha v e k T − P i u ′ i ⊗ v i ⊗ w i k F < 2 k E k F , and thus from the robu st ve rsion of Krusk al’s uniqueness theorem [ 9 ], w e must ha ve that u ′ i and u i are ǫ · p oly(1 /κ, 1 / n, 1 /δ ) close. Rep eating the ab ov e along the second mo de allo ws us to mo v e to an R × R × p tensor. Stabilit y of Decompose Next, we establish th at Decompos e is s table (in w hat follo w s, w e hav e m = n = R ). Intuitiv ely , Decompose is s table pr o vided that th e matrices U and V are we ll-conditioned and the eigen v alues of the matrices th at we need to diagonalize are separated. The main step in Decompo se is an eigendecomp osition, so firs t we will establish p ertur bation b ound s. The stand ard p erturbation b oun ds are known as sin θ theorems f ollo win g Da vis-Kahan and W edin. Ho w ev er these b ound s hold most generally for the singular v alue decomp osition of an arbi- trary (not n ecessarily symmetric) matrix. W e require p erturbation b oun d s f or eigen-decomp ositions of general matrice s. Th er e are known b ounds due to E isen stat and Ipsen, h o w ev er the notion of separation required there is difficult to w ork with and for our purp oses it is easier to p ro v e a direct b ound in our setting. Supp ose M = U D U − 1 and c M = M ( I + E ) + F and M and c M are n × n matrices. In ord er to relate the eigendecomp ositions of M and c M resp ectiv ely , we will fir st n eed to establish that the eigen v alues of M are all distinct. W e th ank San tosh V empala f or p oint ing out an err or in an earlier v ersion. W e incorrectly used the Bauer-Fike T h eorem to sho w that c M is diagonaliza ble, but this theorem only sho ws that eac h eigen v alue of c M is close to some eigen v alue of M , bu t do es n ot show that there is a one-to-one mappin g. F ortun ately there is a fix for this that works und er the same conditions (but again see [ 16 ] for an earlier, alternativ e p r o of that uses a “homotop y argumen t”). Definition A.1. Let sep( D ) = min i 6 = j | D i,i − D j,j | . Our fi rst goal is to pro ve that c M is d iagonalizable, and w e will d o this by establishing that its eigen v alues are distinct if the err or matrices E and F are not to o large. Consider U − 1 ( M ( I + E ) + F ) U = D + R 25 where R = U − 1 ( M E + F ) U . W e can b oun d eac h en try in R b y κ ( U )( k M E k 2 + k F k 2 ). Hence if E and F are not too large, the eigenv alues of D + R are close to the eige nv alues of D using Gershgorin’s disk th eorem, and the eigen v alues of D + R are the same as the eigen v alues of c M sin ce these matrices are similar. So we conclude: Lemma A.2. If κ ( U )( k M E k 2 + k F k 2 ) < sep ( D ) / (2 n ) then the eigenv alues of c M ar e distinct and it is diagonalizable. Next we prov e that th e eigen v ectors of c M are also close to those of M (this step will r ely on c M b eing diagonalizable). Th is tec hnique is stand ard in n umerical analysis, but it will b e more con v enien t for us to w ork w ith r elativ e p erturb ations (i.e. c M = M ( I + E ) + F ) so w e include the pro of of su c h a b ound for completeness Consider a r ight eige nv ector b u i of c M with eigen v alue b λ i . W e w ill assume th at th e conditions of the ab o v e corollary are met, so that there is a unique eigen v ector u i of M with eig env alue λ i whic h it is paired with. Then since the eigen v ectors { u i } i of M are full rank, we can w r ite b u i = P j c j u j . Then c M b u i = b λ i b u i X j c j λ j u j + ( M E + F ) b u i = b λ i b u i X j c j ( λ j − b λ i ) u j = − ( M E + F ) b u i No w we can left multiply b y the j th ro w of U − 1 ; call this v ector w T j . Since U − 1 U = I , w e ha v e that w T j u i = 1 i = j . Hence c j ( λ j − b λ i ) = − w T j ( M E + F ) b u i So w e conclud e: k b u i − u i k 2 2 = 2 dist ( b u i , s pan ( u i )) 2 ≤ 2 X j 6 = i ( w T j ( M E + F ) b u i ) | λ j − b λ i | 2 ≤ 8 X j 6 = i k U − 1 ( M E + F ) b u i k 2 2 sep ( D ) 2 where we ha v e u sed the condition that κ ( U )( k M E k 2 + k F k 2 ) < sep ( D ) / 2 to lo w er b oun d the denominator. F u rthermore: k U − 1 M E b u i k 2 = k D U − 1 E b u i k 2 ≤ σ max ( E ) λ max ( D ) σ min ( U ) since b u i is a un it v ector. Theorem A.3. If κ ( U )( k M E k 2 + k F k 2 ) < sep ( D ) / 2 , then k b u i − u i k 2 ≤ 3 σ max ( E ) λ max ( D ) + σ max ( F ) σ min ( U ) sep ( D ) No w we are ready to analyze the s tability of Dec ompose : Let T = P n i =1 u i ⊗ v i ⊗ w i b e an n × n × p tensor that satisfies Condition 2.2 . In our settings of in terest w e are not giv en T exactly but rather a go o d approximat ion to it, and h ere let us mo del this n oise as an additive error E that is itself an n × n × p tensor. Claim A.4. W i th high pr ob ability, sep ( D a D − 1 b ) ≥ δ √ p . Pro of: Fix some i, j . Th e ( i, i )th en try of D a D − 1 b is precisely h w i ,a i h w i ,b i . Note that Pr [ |h w i , b i| > n k w i k ] is exp( − n ), thus the denominators are all at least 1 / ( C n ) in magnitude with pr ob ab ility 1 − exp( − n ). 26 No w give n b for which th is happ ens, we hav e h w i ,a i h w i ,b i − h w j ,a i h w j ,b i = c i h w i , a i − c j h w j , a i where c i , c j ha v e magnitude > 1 / ( C n ). Because w i has at least a δ comp onen t orthogonal to w j , an ti-concen tration of Gaussians implies that the difference ab o v e is at least δ /C 2 n 6 with pr ob ab ility at least 1 − 1 /n 4 . Th us we can tak e a union b ound o ver all p airs. W e will mak e crucial use of the follo wing matrix id en tit y: ( A + Z ) − 1 = A − 1 − A − 1 Z ( I + A − 1 Z ) − 1 A − 1 Let N a = T a + E a and N b = T b + E b . Then using the ab ov e id en tit y w e h av e: N a ( N b ) − 1 = T a ( T b ) − 1 ( I + F ) + G where F = − E b ( I + ( T b ) − 1 E b ) − 1 ( T b ) − 1 and G = E a ( T b ) − 1 Claim A.5. σ max ( F ) ≤ σ max ( E b ) σ min ( T b ) − σ max ( E b ) and σ max ( G ) ≤ σ max ( E a ) σ min ( T b ) Pro of: Using W eyl’s Inequalit y w e h a v e σ max ( F ) ≤ σ max ( E b ) 1 − σ max ( E b ) σ min ( T b ) × 1 σ min ( T b ) = σ max ( E b ) σ min ( T b ) − σ max ( E b ) as desired. The s econd b oun d is ob vious. W e can no w use T heorem A.3 to b ound the error in r eco ve ring the factors U and V b y setting e.g. M = T a ( T b ) − 1 . Ad ditionally , the follo wing claim establishes that the linear s y s tem used to solv e for W is we ll-conditioned and hen ce w e can also b oun d the error in r eco vering W . Claim A.6. κ ( U ⊙ V ) ≤ min( σ max ( U ) ,σ max ( V )) max( σ min ( U ) ,σ min ( V )) ≤ min ( κ ( U ) , κ ( V )) These b ounds establish what we qualitativ ely asserted: Decompos e is stable pro vided th at the m a- trices U and V are well- conditioned and the eigen v alues of the matrices that w e need to diagonalize are separated. B K-rank of the Khatri-Rao pro duct B.1 Lea v e-One-Out Distance Recall: w e defined the lea ve -one-out distance in Section 3 . Here w e establish that is indeed equiv a- len t to the s m allest singular v alue, up to p olynomial factors. In our main pro of, this quant it y will b e muc h easer to work with since it allo ws us to translate questions ab out a set of v ectors b eing w ell-conditioned to reasoning ab out pr o jection of eac h v ector onto the orthogonal complement of the others. Pro of of Le mma 3.5 : Using the v ariational c haracterization for singular v alues: σ min ( A ) = min u, k u k 2 =1 k Au k 2 . T hen let i = argmax | u i | . Clearly | u i | ≥ 1 / √ m since k u k 2 = 1. Then k A i + P j 6 = i A j u j u i k 2 = σ min ( A ) u i . Hence ℓ ( A ) ≤ dist ( A i , s pan { A j } j 6 = i ) ≤ σ min ( A ) u i ≤ σ min ( A ) √ m 27 Con v ersely , let i = argmin i dist ( A i , s pan { A j } j 6 = i ). Then there are co efficien ts (with u i = 1) suc h that k A i u i + X j 6 = i A j u j k 2 = ℓ ( A ) . Clearly k u k 2 ≥ 1 since u i = 1. And w e conclude that ℓ ( A ) = k A i u i + X j 6 = i A j u j k 2 ≥ k A i u i + P j 6 = i A j u j k 2 k u k 2 ≥ σ min ( A ) . B.2 Pro of of Prop osition 3.12 W e n o w giv e th e complete details of the pro of of Pr op osition 3.12 , that sho ws h ow the Krusk al rank m ultiplies in the smo othed setting for tw o-wise p ro du cts. The p ro of follo ws by just com bining Lemma 3.16 and L emma 3.15 . Let U b e the sp an of the top δ n 2 singular v alues of M . Thus U is a δ n 2 dimensional subsp ace of R n 2 . Using Lemma 3.16 with: r = n 1 / 2 2 , m = n, δ ′ = δ n 1 / 2 , w e obtain n × n matrices M 1 , M 2 , . . . , M r ha ving the ( θ, δ ′ )-orthogonalit y pr op erty . Note that in this setting, δ ′ m = n 1 / 2 2 . Th us by applying Lemma 3.15 , we ha ve that the matrix Q ( e x ), defin ed as b efore, satisfies Pr x σ r / 2 ( Q ( e x )) ≥ ρθ n 4 ≥ 1 − exp( − r ) . (12) No w let us consider X s ( e y T M s e x ) 2 = k e y T Q ( e x ) k 2 . Since Q ( e x ) h as many non-negligible singular v alues (Eq.( 12 )), we ha v e (b y F act 3.26 for details) that an ρ -p erturb ed v ector has a n on -n egligible norm w h en multiplied by Q . More precisely , Pr [ k e y T Q ( e x ) k ≥ ρθ /n 4 ] ≥ 1 − exp( − r / 2). Thus for one of the terms M s , we ha v e | M s ( e x ⊗ e y ) | ≥ ρθ /n 5 with probability ≥ 1 − exp( − r / 2). No w this almost completes the pro of, b ut recall that our aim is to argue ab out M ( e x ⊗ e y ), wh ere M is the giv en matrix. ve c ( M s ) is a vec tor in the s p an of the top δ n 2 (righ t) singular vec tors of M , and σ δn 2 ≥ τ , thus we can write M s as a combinatio n of the ro ws of M , with eac h wei ght in the com bination b eing ≤ n/τ (Lemma B.1 ). This implies that for at least one ro w M ( j ) of the matrix M , we must ha ve k M ( j ) ( e x ⊗ e y k ≥ θ ρτ n 6 = ρτ n O (1) . (Otherwise we ha ve a contradicti on). This completes the pro of. Before we giv e the complete pro ofs of the t w o main lemmas regarding ord er ed ( θ , δ ) orthogonal systems (Lemma 3.16 and Lemm a 3.15 ), w e start with a s imple lemma ab out top singular v ectors of matrices, wh ic h is v ery useful to obtain linear com binations of small length. 28 Lemma B.1 (Expressing top singular v ectors as small com binations of column s ) . Supp ose we have a m × n matrix M with σ t ( M ) ≥ η , and let v 1 , v 2 , . . . v t ∈ R m b e the top t left-singu lar ve ctors of M . Then these top t singular ve ctor c an b e expr esse d using small line ar c ombinations of the c olumns { M ( i ) } i ∈ [ n ] i.e. ∀ k ∈ [ t ] , ∃ { α k ,i } i ∈ [ n ] such that v k = X i ∈ [ n ] α k ,i M ( i ) and X i α 2 k ,i ≤ 1 /η 2 Pro of: Let ℓ corresp ond to the num b er of non-zero singular v alues of M . Using the SVD, there exists matrices V ∈ R m × ℓ , U ∈ R n × ℓ with orthonorm al columns (b oth u n itary matrices), and a diagonal matrix Σ ∈ R ℓ × ℓ suc h that M = V Σ U T . Since the n × ℓ matrix V = M ( U Σ − 1 ), the t column s of V corresp onding to the top t s in gular v alues ( σ t ( M ) ≥ η ) corresp ond to linear com binations whic h are small i.e. ∀ k ∈ [ t ] , k α k k ≤ 1 /η . B.3 Constructing the ( θ , δ ) -Ort hogonal System (Pro of of Lemma 3.16 ) Let V b e a su bspace of R n · m , with its co-ordinates indexed by [ n ] × [ m ]. F urther,remember that the vec tors in R n · m are also treated as m atrices of size n × m . W e no w give the complete pr o of of lemma 3.18 that sho ws that th e a v erage robust dimension of column p ro jections is large if the d imension of V is large . Pro of of Lemma 3.18 : Let d = dim( V ). Let B b e a p 1 p 2 × d matrix comp osed of a orthonorm al basis (of d vec tors) for V i.e. the j th column of B is the j th basis vec tor ( j ∈ [ d ]) of V . Clearly σ d ( B ) = 1. F or i ∈ [ p 2 ], let B i b e the p 1 × d matrix obtained by pro j ecting the columns of B on just the rows giv en by [ p 1 ] × i . Hence, B is obtained by just concatenating the column s as B T = B 1 T k B 2 T k . . . k B p T . Finally , let d i = max t suc h that σ t ( B i ) ≥ 1 √ p 2 . W e w ill first show that P i d i ≥ d . Then w e will sh o w that dim τ i ( V ) ≥ d i to complete our pro of. Supp ose for contradicti on that P i ∈ [ p 2 ] d i < d . Let S i b e the ( d − d 1 )-dimensional subs pace of R d spanned by the last ( d − d 1 ) righ t sin gu lar v ectors of B i . Hence, for unit vect ors α ∈ S i ⊆ R d , k B i α k < 1 √ p 2 . Since, d − P i ∈ [ p 2 ] d i > 0, there exists at least one un it ve ctor α ∈ T i S ⊥ i . Picking th is unit vecto r α ∈ R d , w e h av e k B α k 2 2 = P i ∈ [ p 2 ] k B i α k 2 2 < p 2 · ( 1 √ p 2 ) 2 < 1. Th is con tradicts σ d ( B ) ≥ 1 T o establish the second p art, consider some B i ( i ∈ [ p 2 ]). W e p ic k d i orthonormal v ectors ∈ R p 1 corresp ondin g to the top d i left-singular v ectors of B i . By using Lemma B.1 , we kno w that eac h of these j ∈ [ d i ] v ectors can b e expressed as a small combination ~ α j of the columns of B i s.t. k ~ α j k ≤ √ p 2 . F urth er , if we asso ciate with eac h of these j ∈ [ d i ] ve ctors, the v ector w j ∈ R ( p 1 p 2 ) giv en b y the same combinatio n ~ α j of the columns of B , we see that k w j k ≤ √ p 2 since the column s of the matrix B are orthonorm al. B.4 Implications of Ordered ( θ , δ ) -Orthogonalit y: Details of Pr o of of Lemma 3.15 Here we sho w some auxiliary lemmas that are used in th e Pro of of Lemma B.4 . 29 Claim B.2. Supp ose v 1 , v 2 , . . . , v m ar e a set of ve ctors in ℜ n of length ≤ 1 , having the θ -ortho gonal pr op erty. Then we have (a) F or g ∼ N (0 , 1) n , we have P i h v i , g i 2 ≥ θ 2 / 2 with pr ob ability ≥ 1 − exp( − Ω( m )) , (b) F or g ∼ N (0 , 1) m , we have k P i g i v i k 2 ≥ θ 2 / 2 with pr ob ability ≥ 1 − exp( − Ω ( m )) . F urthermor e, p art (a) holds even if g is dr awn fr om u + g ′ , for any fixe d ve ctor u and g ′ ∼ N (0 , 1) n . Pro of: F irst note that w e m ust ha v e m ≤ n , b ecause otherwise { v 1 , v 2 , . . . , v m } cannot hav e th e θ -orthogonal prop ert y for θ > 0. F or an y j ∈ [ m ], w e claim that Pr [( h v j , g i 2 < θ 2 / 2) | v 1 , v 2 , . . . , v j − 1 ] < 1 / 2 . (13) T o s ee this, write v j = v ′ j + v ⊥ j , w here v ⊥ j is orthogonal to the span of { v 1 , v 2 , . . . , v j − 1 } . Since j ∈ I , w e ha v e k v ⊥ j k ≥ θ . No w giv en the v ectors v 1 , v 2 , . . . , v j − 1 , th e v alue h v ′ j , g i is fixed, but h v ⊥ j , g i is distributed as a Gaussian with v ariance θ 2 (since g is a Gaussian of u nit v ariance in eac h direction). Th us from a stand ard anti -concen tration p rop erty for the one-dimen s ional Gaussian, h v j , g i cannot h a v e a m ass > 1 / 2 in an y θ 2 length interv al, in particular, it cannot lie in [ − θ 2 / 2 , θ 2 / 2] with probabilit y > 1 / 2. This pro v es Eq. ( 13 ) . Now since this is tru e for an y conditioning v 1 , v 2 , . . . , v j − 1 and for all j , it follo w s (see Lemma B.3 for a f orm al justification) that Pr [ h v j , g i 2 < θ 2 / 2 for all j ] < 1 2 m < exp( − m/ 2) . This completes the pr o of of the claim, part (a). Note that eve n if w e had g replaced by u + g through - out, the anti-c oncen tration prop ert y still holds (w e ha v e a shifted one-dimensional Gaussian), th us the pr o of goes through verbatim. Let us no w pro v e part (b ). First n ote that if w e denote b y M the n × m matrix whose columns are the v i , then part (a) deals with the distrib ution of g T M M T g , wh ere g ∼ N (0 , 1) n . P art (b) deals with the distribution of g T M T M g , where g ∼ N (0 , 1) m . But since the eigen v alues of M M T and M T M are precisely the same, due to the rotational in v ariance of Gaussians, these t wo qu an tities are distribu ted exactly the same wa y . This completes the pro of. Lemma B.3. Su pp ose we have r andom variables X 1 , X 2 , . . . , X r and an event f ( · ) which is define d to o c cur if its ar gu ment lies in a c ertain i nterval (e.g. f ( X ) o c curs iff 0 < X < 1 ). F urther, supp ose we have Pr [ f ( X 1 )] ≤ p , and Pr [ f ( X i ) | X 1 , X 2 , . . . , X i − 1 ] ≤ p for al l X 1 , X 2 , . . . , X i − 1 . Then P r [ f ( X 1 ) ∧ f ( X 2 ) ∧ · · · ∧ f ( X r )] ≤ p r . C Applications to Mixture Mo dels C.1 Sampling Err or Estimates for Multi-view Mo dels In this section, we sho w error estimates for ℓ -ord er tensors obtained b y lo oking at the ℓ th momen t of the multi-vie w mo del. Lemma C.1 (Error estimates for Multiview mixtu r e mo del) . F or eve ry ℓ ∈ N , supp ose we have a multi-v i ew mo del, with p ar ameters { w r } r ∈ [ R ] and { M ( j ) } j ∈ [ ℓ ] , the n dimensional sample v e ctors 30 x ( j ) have k x ( j ) k ∞ ≤ 1 . Then, for e very ε > 0 , ther e exists N = O ( ε − 2 √ ℓ log n ) such that if N samples { x (1 ) ( j ) } j ∈ [ ℓ ] , { x (2) ( j ) } j ∈ [ ℓ ] , . . . , { x ( N ) ( j ) } j ∈ [ ℓ ] ar e ge ner ate d, then with high pr ob ability k E x (1) ⊗ x (2) ⊗ . . . x ( ℓ ) − 1 N X t ∈ [ N ] x ( t ) (1) ⊗ x ( t ) (2) ⊗ x ( t ) ( ℓ ) k ∞ < ε (14) Pro of: W e fi rst b ound the k · k ∞ norm of the difference of tensors i.e. we sho w that ∀{ i 1 , i 2 , . . . , i ℓ } ∈ [ n ] ℓ , E Y j ∈ [ ℓ ] x ( j ) i j − 1 N X t ∈ [ N ] Y j ∈ [ ℓ ] x ( t ) ( j ) i j < ε/n ℓ/ 2 . Consider a fixed en try ( i 1 , i 2 , . . . , i ℓ ) of the tensor. Eac h samp le t ∈ [ N ] corresp ond s to an ind ep end ent rand om v ariable with a b ound of 1. Hence, w e ha v e a sum of N b ound ed random v ariables. By Bernstein b ounds , probabilit y f or ( 14 ) to not o ccur exp − ( εn − ℓ/ 2 ) 2 N 2 2 N = exp − ε 2 N/ 2 n ℓ . W e hav e n ℓ ev en ts to union b oun d o v er. Hence N = O ( ε − 2 n ℓ √ ℓ log n ) su ffices. Note that sim ilar b ounds hold when the x ( j ) ∈ R n are generated from a multiv ariate gaussian. C.2 Error Analysis for Multi-view Mo dels Lemma C.2. Su pp ose k u ⊗ v − u ′ ⊗ v ′ k F < δ , and L min ≤ k u k , k v k , k u ′ k , k v ′ k ≤ L max , with δ < min { L 2 min , 1 } (2 max { L max , 1 } ) . If u = α 1 u ′ + β 1 ˜ u ⊥ and v = α 2 v ′ + β 2 ˜ v ⊥ , wher e ˜ u ⊥ and ˜ v ⊥ ar e unit ve ctors ortho g onal to u ′ , v ′ r esp e ctive ly, then we have | 1 − α 1 α 2 | < δ /L 2 min and β 1 < √ δ , β 2 < √ δ . Pro of: W e are giv en that u = α 1 u ′ + β 1 ˜ u ⊥ and v = α 2 v ′ + β 2 ˜ v ⊥ . No w, since the tensored ve ctors are close k u ⊗ v − u ′ ⊗ v ′ k 2 F < δ 2 k (1 − α 1 α 2 ) u ′ ⊗ v ′ + β 1 α 2 ˜ u ⊥ ⊗ v ′ + β 2 α 1 u ′ ⊗ ˜ v ⊥ + β 1 β 2 ˜ u ⊥ ⊗ ˜ v ⊥ k 2 F < δ 2 L 4 min (1 − α 1 α 2 ) 2 + β 2 1 α 2 2 L 2 min + β 2 2 α 2 1 L 2 min + β 2 1 β 2 2 < δ 2 (15) This implies that | 1 − α 1 α 2 | < δ /L 2 min as required. No w , let us assume β 1 > √ δ . T h is at once im p lies that β 2 < √ δ . Also L 2 min ≤ k v k 2 = α 2 2 k v ′ k 2 + β 2 2 L 2 min − δ ≤ α 2 2 L 2 max Hence, α 2 ≥ L min 2 L max No w , us in g ( 15 ), we see that β 1 < √ δ . 31 C.3 Sampling Err or Estimates for Gaussians Lemma C.3 (Error estimates for Gaussians) . Supp ose x is gener ate d fr om a mixtur e of R - gaussians with me ans { µ r } r ∈ [ R ] and c ovarianc e Σ i that is diagonal , with the me ans satisfying k µ r k ≤ B . L et σ = max i σ max (Σ i ) F or e v ery ε > 0 , ℓ ∈ N , ther e exists N = Ω (p oly( 1 ε )) , σ 2 , n , R ) such that if x (1) , x (2) , . . . , x ( N ) ∈ R n wer e the N samples, then ∀{ i 1 , i 2 , . . . , i ℓ } ∈ [ n ] ℓ , E Y j ∈ [ ℓ ] x i j − 1 N X t ∈ [ N ] Y j ∈ [ ℓ ] x ( t ) i j < ε. (16) In other wor ds, k E x ⊗ ℓ − 1 N X t ∈ [ N ] ( x ( t ) ) ⊗ ℓ k ∞ < ε Pro of: Fix an element ( i 1 , i 2 , . . . , i ℓ ) of the ℓ -order tensor. Eac h p oin t t ∈ [ N ] corresp ond s to an i.i.d random v ariable Z t = x ( t ) i 1 x ( t ) i 2 . . . x ( t ) ℓ . W e are interested in the d eviation of the sum S = 1 N P t ∈ [ N ] Z t . Eac h of the i.i.d rvs h as v alue Z = x i 1 x i 2 . . . x ℓ . Since the gaussians are axis- aligned and eac h mean is b ounded b y B , | Z | < ( B + tσ ) ℓ with probabilit y O exp( − t 2 / 2) . Hence, b y usin g s tand ard sub -gaussian tail inequalities, we get Pr | S − E z | > ε < exp − ε 2 N ( M + σ ℓ log n ) ℓ Hence, to u nion b oun d o ve r all n ℓ ev en ts N = O ε − 2 ( ℓ log nM ) ℓ suffices. C.4 Reco v ering W eigh ts in Gaussian Mixtures W e now sh o w ho w w e can appro ximate upto a small error the w eigh t w i of a gaussian comp onen ts in a mixtur e of gaussians, wh en we h a v e go o d approximat ions to w i µ ⊗ ℓ i and w i µ ⊗ ( ℓ − 1) i . Lemma C.4 (Reco v ering W eigh ts) . F or every δ ′ > 0 , w > 0 , L min > 0 , ℓ ∈ N , ∃ δ = Ω δ 1 w 1 / ( ℓ − 1) ℓ 2 L min such that, if µ ∈ R n b e a ve ctor with length k µ k ≥ L min , and supp ose k v − w 1 /ℓ µ k < δ and k u − w 1 / ( ℓ − 1) µ k < δ . Then, |h u, v i| k u k ℓ ( ℓ − 1) − w < δ ′ (17) Pro of: F rom ( C.4 ) and triangle inequalit y , we see that k w − 1 /ℓ v − w − 1 / ( ℓ − 1) u k ≤ δ ( w − 1 / ( ℓ ) + w − 1 / ( ℓ − 1) ) = δ 1 . Let α 1 = w − 1 / ( ℓ − 1) and α 2 = w − 1 /ℓ . S u pp ose v = β u + ε ˜ u ⊥ where ˜ u ⊥ is a unit ve ctor p erp end icular to u . Hence β = h v , u i / k u k . k α 1 v − α 2 u k 2 = k ( β α 1 − α 2 ) u + α 1 ε ˜ u ⊥ k < δ 2 1 ( β α 1 − α 2 ) 2 k u k 2 + α 2 1 ε 2 ≤ δ 2 1 β − α 2 α 1 < δ 1 L min 32 No w , su b stituting the v alues for α 1 , α 2 , we see that β − w 1 ( ℓ − 1) − 1 ℓ < δ 1 L min . β − w 1 / ( ℓ ( ℓ − 1)) < δ w 1 / ( ℓ − 1) L min β ℓ ( ℓ − 1) − w ≤ δ ′ when δ ≪ δ ′ w 1 / ( ℓ − 1) ℓ 2 L min 33
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment