Fourier PCA and Robust Tensor Decomposition

F ourier PCA and Robust T en sor Decomp osition Na vin Go y al ∗ San tosh V empala † Ying Xiao ‡ July 1, 2014 Abstract F ourier PCA is Principal Comp onent Analysis of a matrix obtained from higher ord er deriva tives of the logarithm of the F ourier transform of a distribution. W e make this method algorithmic by devel oping a ten sor decomp osition meth o d for a pair of tensors sharing the same vectors in rank-1 decomp ositions. Our main application is the ﬁrst prov ably p olynomial-time algorithm for underdetermined ICA, i.e., learning an n × m matrix A f rom observ ations y = Ax where x is draw n from an u nknown prod uct distribution with arbitrary non-Gaussian comp onents. The n umber of comp onent distributions m can b e arbitrarily h igher than the dimension n and th e columns of A o nly need to satisfy a natural and eﬃcien tly veriﬁable nond egeneracy cond ition. As a second app lication, we give an alternative algorithm for learning mixtures of sph erical Gaussians with linearly ind ep en dent means. These results also hold in the presence of Gaussian noise. ∗ Microsoft Research India † Sc hools of Computer Science, Georgia T ec h and CMU ‡ Sc hool of Computer Science, Georgia T ec h 1 In tro duction Principal Comp onent Analysis [55] is often an “unr easonably eﬀective” heuristic in pr actice, and some o f its eﬀectiveness c an b e explained rigo r ously as well (see, e.g ., [43 ]). It consists o f computing the eigenvectors of the e mpirical cov ariance matrix formed from the data; the e ig env ecto r s turn out to b e dir ections that lo cally maximize second mo men ts. The following example illustrates the p ow er a nd limitations of PCA: given random indep endent p oints from a rota ted cuboid in R n with distinct axis lengths, PCA will identify the axes of the cub oid and their lengths as the eigenv ectors and eigenv alues of the cov aria nce matrix. How ever, if instead o f a ro tation, p oints came fr o m a linear transfor mation o f a cub oid, then PCA does not work. T o handle this a nd similar hu rdles, higher moment extensions of PCA hav e b een develop ed in the lit- erature e.g., [5, 40, 53, 6, 3, 3 9, 65] and s hown to b e pr ov a bly eﬀective for a wider range o f unsup ervised learning problems, including sp ecial cases of Indep endent Component Ana ly sis (ICA), Gaussian mixture mo dels, lear ning la tent topic mode ls etc. ICA is the cla s sic s ignal recov er y pr oblem of le a rning a linear transformatio n A fro m i.i.d. samples x = As whe r e s ∈ R n has an unknown pro duct distribution. The example abov e, namely learning a linearly tra nsformed cuboid, is a sp ecial case of this problem. Although PCA fails, o ne can use it to ﬁrst apply a tr ansformation (to a sa mple) that makes the distribution isotropic , i.e., eﬀectiv ely making the distribution a rotation of a cub e. At this p oint, eigenv ectors giv e no further information, but as obser ved in the sig nal pro ces s ing literatur e [26, 4 1], directions that lo cally maximize fourth mo men ts rev eal the axes of the cub e, and undoing the isotropic transfo r mation yields the axe s of the origina l cubo id. Using this ba sic idea, F r ieze et a l. [35] a nd s ubs equent papers g ive prov ably eﬃcient algorithms a ssuming that the linear tra nsformation A is full-dimensional and the co mpo ne nts of the pro duct distribution diﬀer from one-dimensio nal Gaus sians in their fourth moment. This leav es o pe n the important general ca se o f un der determine d ICA, namely where A is not necessarily square or full-dimensional, i.e., the observ a tions x ar e pro jections to a lower-dimensional spa ce; in the ca se of the cubo id example, we only see samples from an (unknown) pro jection of a tr ansformed cubo id. In this pap er, we give a po lynomial-time algor ithm that (a) works for a ny tr ansformation A provided the columns of the linear transfo rmation satisfy a natural extension of linear indep endence, (b) do es not need the fourth momen t assumption, and (c) is robust to Gaussian noise. As far as we know, this is the ﬁrst po lynomial-time algo rithm for underdeter mined ICA. The central ob ject o f our study is a higher deriv a tive tensor o f suitable functions of the F ourier trans form of the distribution. Our main algorithmic tec hnique is an eﬃcient tensor decomp ositio n metho d for pairs of tensors that share the same vectors in their resp ective rank-1 decomp ositions. W e call our gener al tec hnique F ourier PCA . This is motiv a ted by the fact that in the base case o f se c ond deriv atives, it reduces to P C A of a reweighted cov a riance matrix . As a second application, F ourier P C A giv es an alternative alg o rithm for learning a mixtur e of spherical Gaussians, under the ass umption that the means of the comp onent Gaussians ar e linearly indep endent. Hsu and Ka k ade [39] alrea dy gav e a n algo rithm for this pr o blem based o n third moments; our algorithm ha s the bene ﬁt of b eing robust to Ga ussian noise. W e now discuss these problems a nd prior work, then pre s ent our results in mo re detail. ICA Blind so urce separa tion is a fun damental problem in div erse ar e as r anging from signal pro cessing to neuroscience to machine learning. In this pro blem, a set of source signa ls ar e mixed in an unknown wa y , and one would like to recover the or iginal signa ls or under s tand how they were mixed given the obse rv ations of the mixed sig nals. Perhaps the mos t inﬂuential formalization o f this problem is ICA (see the bo oks [26, 4 1] for compr e he ns ive introductions). In the basic formulation of ICA, one has a random vector s ∈ R n (the so urce signal) whose comp onents are indep endent random v aria bles with unknown distributions. Let s (1) , s (2) , . . . b e independent samples of s . One o bserves mixed samples As (1) , As (2) , . . . o btained by mixing the c omp o nents o f s by an unknown invertible n × n mixing matrix A. The goal is to rec ov er A to the extent po ssible, whic h w ould then a lso g ive us s (1) , s (2) , . . . , o r some a pproximations thereof. One cannot hop e to recov er A in case mor e than o ne o f the s i are Gaussia n; in this ca se any set o f orthog o nal dir e ctions in this subspace w o uld a lso b e consistent with the mo del. Necessar ily , then, all ICA algorithms must require that the comp onent distributions diﬀer from being Gaussian in s ome fashion. A num b er of alg orithms hav e be en devised in the ICA communit y for this pro blem. The literature is v ast and we refer to [26] for a comprehensive account. The ICA pro blem has b een studied rigor ously in theoretica l 1 computer science in s everal previous pap ers [3 5, 54, 9 , 12, 5]. All of these algorithms either assume that the comp onent distribution is a very sp eciﬁc one [54, 9], or assume that its kur tosis (fourth cum ula nt) is bo unded aw ay from 0, in eﬀect a ssuming tha t its fourth moment is bo unded aw ay from that of a Gauss ian. The application of tensor decomp ositio n to ICA has its orig ins in work b y Cardoso [21], and similar idea s were later discov er e d by Chang [24] in the context of ph ylogenetic reconstruction and developed further in several works, e.g. Mossel and Ro ch [53], Anandkumar e t al. [3], Hsu and Kak ade [39] fo r v arious latent v ariable mo dels. Arora et al. [9 ] and Belkin et al. [12] show how to make the algor ithm resistant to unknown Gaussian noise. Underdetermined ICA, where the tr ansformation matrix A is not square or inv e r tible (i.e., it includes a pro jection), is a n imp or tant general problem a nd there are a num b er of algor ithms pr o p osed for it in the signal pr o cessing literature, many of them quite s ophisticated. Ho wever, no ne of them is kno wn to ha ve rigouro us gua rantees o n the sample or time complex ity , e ven for sp ecial distributions. See e.g. Chapter 9 of [26] for a review o f ex is ting alg orithms and identiﬁabilit y co nditions for underdetermined ICA. F or exa mple, F OOBI [22, 3 2] uses fourth-o rder corr elations, and its analysis is done o nly for the exact setting without analyzing the robustness of the algorithm when applied to a sample, and b ounding the sample and time complexity fo r a desir ed level o f er ror. In a dditio n, the known s uﬃcien t condition fo r the success of FOOBI is stronger than o ur s (and more elab ora te). W e ment ion tw o o ther related pap ers [27, 2]. Gaussian mixtures Gaussian mixture s are a p opular mode l in statistics. A distribution F in R n is mo deled as a conv ex combination of unknown Gaussian comp onents. Given i.i.d. s amples from F , the goal is to learn all its para meter s, i.e., the mea ns, cov ariances and mixing weigh ts of the comp onents. A classical result in sta tistics s ays that Gaussian mixtures with distinct parameter s a re uniquely identiﬁable, i.e., as the num b er of samples go es to inﬁnity , there is unique decomp ositio n of F into Gaussian co mpo ne nts. It has b een establishe d that the sa mple complex it y g r ows exponentially in m , the num b er o f co mp onents [14, 13, 42, 51]. In a diﬀeren t dir ection, under as s umptions of separ a ble comp onents, a mixture is lea rnable in time p olynomial in all para meters [64, 2 8, 58, 3 0 , 25, 19]. Our work here is motiv a ted by Hsu a nd Kak ade’s algorithm [39], which uses a tensor constructed from the ﬁrst three moments o f the distribution and w o rks for a mixture o f spher ical Gaussians with linear ly independent means. Robust tensor decomp osition As a core subroutine for all problems ab ov e, we develop a genera l theory of eﬃcien t tensor decompos itions f or pairs of tensors , which allows us to recov er a rank-1 tensor dec o mpo sition from t wo homog eneous tensor relations. As noted in the literature, such a pair of tensor equations ca n be obtained from one tensor equatio n b y applying tw o random vectors to the origina l equation, losing one in the order of the tens o r. Our tensor decomp osition “ﬂattens” these tensors to matrices and p er forms an eigenv a lue decomp osition. The matrice s in question a re not Hermitian or even normal, and hence we use more genera l metho ds fo r eigendecompos ition (in particular , tensor p ow er iteratio ns cannot b e use d to ﬁnd the desired decomp ositio ns ). The a lgorithm for tensor decomp osition via simultaneous tensor diagonaliza tion is essentially due to Leurgans et al [48]; to the bes t of our knowledge, ours is the ﬁrst robust ana ly sis. In subsequent work, Bhask ara et al. [15] hav e outlined a s imila r ro bus tness analys is with a diﬀere n t application. 1.1 Results W e b egin with fully determined ICA. Unlike mo st of the litera ture on ICA, which employs mo ment s, we do not require that our underlying r a ndom v ar iables s i diﬀer from a Gaussian at the fourth momen t. In fact, our algo rithm ca n de a l with diﬀerences fr om b eing Gaussian a t any moment, though the co mputatio nal a nd sample complex ities are higher when the diﬀerences a re at higher moments. W e will use cumulants as a no tion of diﬀerence from b eing a Ga us sian. The cumulant o f rando m v ar iable x a t o rder r , denoted by cum r ( x ), is the r th moment with some additio nal s ubtr actions of p olynomials of lower moments. The following is a short statement of our res ult for fully-determined ICA (i.e. the mixing matrix A is inv ertible); the full statemen t app ears later a s Theor em 4.2. The algo rithm takes as input the samples generated accor ding to the ICA mo del and para meters ǫ , ∆ , M , k and an estima te o f σ n ( A ). Theorem 1. 1. L et x ∈ R n b e giv en by an ICA mo del x = As whe r e A ∈ R n × n c olumns of A hav e unit 2 norm and let σ n ( A ) > 0 . Supp ose that for e ach s i , ther e exists a k i ≤ k such that | cum k i ( s i ) | ≥ ∆ > 0 and E  | s i | k  ≤ M . Then, one c an r e c over the c olumns of A up to signs to ǫ ac cu r acy in p olynomia l time using po ly( n k 2 , M k , 1 / ∆ k , 1 /σ n ( A ) k , 1 /ǫ ) samples with high pr ob ability. In the simplest setting, roughly spea king, our alg orithm computes the cov a riance ma tr ix of the data reweigh ted a ccording to a random F ourier co e ﬃcie n t e iu T x where u ∈ R n is pick ed according to a Gaussian distribution. Our ideas are ins pired by the w o rk of Y eredor [69], who presented such an algorithm for fully determined ICA (without ﬁnite sample gua rantees). The reweigh ted co v aria nce ma tr ix can also be viewed as the Hessian of the logar ithm o f the F our ier transform of the distribution. Using this p ersp ective, we extend the metho d to under determined instance s — problems where the apparent num b e r of degr ees of fr eedom seems higher than the measur ement system ca n uniquely ﬁx, by studying hig her de r iv ative tensors of the F ourier transform. The use of F ourier w eights has the added a dv antage that the r esulting quantities always ex ist (this is the sa me phenomenon that for a probability distribution the c haracter is tic function alw ays exists, but the moment g enerating function may not) and thus works for all random v a riables and not just in the case o f having all ﬁnite mo ment s. W e then e x tend this to the s etting wher e the source signal s has more co mpo nents than the num ber of measurements (Section 6); reca ll tha t in this case, the transfor mation A is a map to a lo wer-dimensional space. Finding prov ably eﬃcient algorithms for underdeter mined ICA has b een an op en pro ble m. T enso r decomp osition techniques, such as p ow er itera tio n, which are known to work in the fully determined cas e cannot p oss ibly genera lize to the underdetermined c a se [4], as they require linear indep endence of the columns of A , which means that they ca n ha ndle at most n so urce v ariables. Our approa ch is base d on tensor decomp osition, usually deﬁned a s follows: given a tensor T ∈ R n ×···× n which has the fo llowing r ank-1 expansion: T = m X i =1 µ i A i ⊗ · · · ⊗ A i , compute the vectors A i ∈ R n . Here ⊗ denotes the usual outer pro duct so that entry-wise [ v ⊗ · · · ⊗ v ] i 1 ,...,i r = v i 1 · · · v i r ). Our main idea here is that we do not attempt to decomp ose a single tensor into its rank- 1 comp onents. This is an NP- hard problem in general, and to make it tractable , previous work uses additional informaton and structura l assumptions, which do not hold in the underdetermined setting or place strong restrictions on ho w la rge m can be as a function of n . Instead, we consider two tenso r s which shar e the same rank- 1 comp onents and comp ose the tenso r s in a sp eciﬁc way , thereb y extracting the desired rank- 1 comp onents. In the following vec  A ⊗ d/ 2 i  denotes the tensor A ⊗ d/ 2 i ﬂattened int o a vector. The algorithm’s input consists of: tensor s T µ , T λ , a nd para meters n, m, d, ∆ , ǫ a s explained in the following theor em. Theorem 1.2 (T ensor decomp osition) . L et A b e an n × m matrix with m > n and c olumns with u nit norm, and let T µ , T λ ∈ R n ×···× n b e or der d tensors such that d ∈ 2 N and T µ = m X i =1 µ i A ⊗ d i T λ = m X i =1 λ i A ⊗ d i , wher e v ec  A ⊗ d/ 2 i  ar e line arly indep endent , µ i , λ i 6 = 0 and    µ i λ i − µ j λ j    > ∆ for al l i, j and ∆ > 0 . The n, algorithm T ensorDe c omp osition ( T µ , T λ ) outputs ve ctors A ′ 1 , . . . , A ′ m with the fol lowing pr op erty. Ther e is a p ermutation π : [ m ] → [ m ] and signs α : [ m ] → {− 1 , 1 } such that fo r i ∈ [ m ] we have    α i A ′ π ( i ) − A i    2 ≤ ǫ. The running time is poly  n d , 1 ǫ , 1 ∆ , 1 σ min ( A ⊙ d/ 2 )  . 3 The polyno mial in the running time ab ov e can b e ma de e xplicit. It basically comes from the time complexity of SVD and eigenv ecto r decomp os itio n of diag onalizable matrices. W e note that in co nt rast to previous work o n tensor decomp ositio ns [3 7, 3 3, 23, 59], o ur metho d has prov able ﬁnite sample guara n tees. W e give a robust version of the a bove, stated as Theor em 5.4. T o apply this to underdetermined ICA, we for m tensors asso ciated with the ICA dis tribution as inputs to our pair wise tensor decompo sition algorithm. The particula r tensors that we use a re the deriv ative tenso rs of the seco nd character istic function ev alua ted at rando m p o ints. Our a lgorithm can handle unkno wn Gaus sian no is e. The ICA mo del with Ga us sian noise is given by x = As + η , where η ∼ N (0 , Σ) is indep endent Gauss ia n noise with unknown gene r al c ov aria nce matr ix Σ ∈ R n × n . Also, our result do es not need full indep endence of the s i , it is suﬃcient to hav e d -wise indep endence. The following is a s hort statement of our result for underdetermined ICA; the full s tatement app ear s later as Theorem 6.8 (but without noise). Its extension to handling Gaussia n noise is in Sec. 4.8. The input to the algor ithms, apart from the sa mples gene r ated according to the unknown noisy under determined ICA mo del, consis ts of several para meters whose meaning will b e clear in the theorem statement b elow: A tensor order parameter d , num be r of signa ls m , a ccuracy par ameter ǫ , co nﬁdence para meter δ , bounds on moments and cum ulants M and ∆, an es timate of the conditioning para meter σ m , and moment o rder k . The no tation A ⊙ d used below is expla ined in the preliminaries section; brieﬂy , it’s a n d × m matrix with each co lumn obtained by ﬂattening A ⊗ d i int o a vector. Theorem 1.3. L et x ∈ R n b e given by an under determine d ICA mo del with unknown Gaussian noise x = As + η wher e A ∈ R n × m with un it norm c olumns and the c ovarianc e matrix Σ ∈ R n × n ar e unknown. L et d ∈ 2 N b e such that σ m ( A ⊙ d/ 2 ) > 0 . Le t M k , M d , M 2 d and k > d b e such that for e ach s i , ther e is a k i satisfying d < k i < k and | cum k i ( s i ) | ≥ ∆ , and E  | s i | k i  , E  σ 1 (Σ) k  ≤ M k , E  | s i | d  ≤ M d , and E  | s i | 2 d  ≤ M 2 d . Th en one c an r e c over t he c olumns of A u p t o ǫ ac cur acy in 2-n orm and up to the sign using p oly  m k , M k d , M 2 d , 1 / ∆ , 1 /σ m ( A ⊙ d/ 2 ) k , 1 /ǫ, 1 / σ k  samples and with similar p olynomial time c omplexity with pr ob ability at le ast 3 / 4 , wh er e 0 < σ < ∆ M k p oly ( σ m ( m k , A ⊙ d/ 2 ) k , 1 /k k ) . The proba bilit y of succ e s s of the algor ithm can b e b o osted from 3 / 4 to 1 − δ for any δ > 0 by taking O (log (1 /δ )) indepe ndent runs of the algorithm and using an adaptation of the “ median” trick (see e.g ., Thm 2.8 in[49]). T o our knowledge, this is the ﬁrs t polynomia l-time a lgorithm for underdetermined ICA with prov able ﬁnite sample guar antees. It works under mild ass umptions on the input distribution a nd nondegenera c y assumptions on the mixing ma tr ix A . The assumption sa ys that the co lumns of the matrix when tensored up individually are linearly indep endent. F or example, with d = 4 , supp ose that every s i diﬀers from a Ga ussian in the ﬁfth or higher moment by ∆, then we can recover all the comp o nents as long as v ec  A i A T i  are linea rly independent. Thus, the n umber of co mpo nen ts that can b e recov er ed can be as high as m = n ( n + 1) / 2. Clearly , this is a weak ening of the standard ass umption tha t the co lumns o f A are linearly indep endent. This a ssumption can b e r e garded as a cer ta in incoherence type assumption. Moreov er , in a sense it’s a necessar y and suﬃcien t condition: the ICA problem is solv able for matrix A if and only if any tw o co lumns are linear indep endent [2 6], and this turns out to b e eq uiv alent to the existence of a ﬁnite d such that A ⊙ d has full column rank. A well-known condition in the literature on tensor deco mpo sition is Krusk al’s condition [45]. Unlik e that condition it is eas y to c heck if a matrix satisﬁes o ur assumption (for a ﬁxed d ). Our assumption is true generic al ly : F or a randomly c ho sen matrix A ∈ R n × ( n d ) (e.g. ea ch en try being i.i.d. standard Gaussian), we hav e σ min ( A ⊙ d ) > 0 with pro bability 1. In a s imilar vein, fo r a randomly chosen matr ix A ∈ R n × ( n d ) its co ndition num b er is close to 1 with high pr obability; see Theor em 9.2 fo r a precise statement a nd pro o f. Mor eov er , o ur as sumption is robust also in the smo othed sense [7]: If we start with an ar bitrary matrix M ∈ R n × ( n 2 ) and per turb it with a no ise matrix N ∈ R n × ( n 2 ) with ea ch entry independently ch osen from N (0 , σ 2 ), then w e ha ve σ min (( M + N ) ⊙ 2 ) = σ 2 /n O (1) with probabilit y at lea st 1 − 1 / n Ω(1) , a nd a simila r gener alization holds for higher pow ers. This follows from a simple application o f the anti-concent ration prop erties o f poly nomials in indep endent rando m v a r iables; see [7] for a pro of. See also [15]. 4 As in the fully-de ter mined ICA setting, we require that our ra ndom v ariables have some cum ula n t diﬀerent from a Gaussian. One asp ect o f our result is that us ing the d th deriv ative, one loses the ability to detect non-Gaussian cum ulants a t order d and lo wer; o n the other hand, a theorem of Marcinkiewicz [50] implies that this do es not cause a pr oblem. Theorem 1. 4 (Mar cinkiewicz) . S upp ose that r andom variable x ∈ R ha s only a ﬁn ite numb er of non- zer o cumulants, then x is distribute d ac c or ding to a Gaussian, and every cumulant of or der gr e ater than 2 vanishes. Thu s, even if we miss the diﬀerence in cumulan ts a t order i ≤ d , there is some higher order cumulan t which is nonzero, and hence non-Gaussian. Note also that for many s peciﬁc instances of the ICA pr oblem studied in the liter ature, al l cumulan ts diﬀer from those o f a Gaussian [35, 54, 9]. W e r emark that apar t fr om dire c t practica l interest of ICA in signal r ecov ery , r ecently some new applica- tions of ICA as an algor ithmic primitive hav e be en discovered. Anders on et al. [8] reduce some sp ecia l cases of the problem o f learning a conv ex b o dy (coming fro m a class of co nv ex b o dies such as simplices), g iven uniformly distributed sa mples from the bo dy , to fully-determined ICA. Anderson et al. [7] solve the prob- lem o f learning Gaussia n mixture mo dels in regimes for which there w er e previously no eﬃcient algor ithms known. This is done by r eduction to underdeter mined ICA using the r esults of our pap er . Our ﬁnal res ult applies the same metho d to lea rning mixtures o f spherical Gauss ians (see the full version). Using F ourier PCA, we r ecov er the result of Hsu a nd Kak ade [39], and extend it to the setting of noisy mixtures, where the nois e its e lf is an unknown a rbitrary Gaussian. Our result can b e viewed as saying that reweigh ted PCA g ives an alternative a lgorithm fo r learning such mixture s . Theorem 1.5. F ourier PCA for Mixtur es applie d to a mixtur e of k < n spheric al Gaussians N ( µ i , σ 2 i I n ) r e c overs the p ar ameters of the mixtu r e to desi r e d ac cur acy ǫ using time and samples p olynomial in k , n, 1 / ǫ with h igh pr ob ability, assuming that the me ans µ i ar e line arly indep endent. Thu s, o verall, our contributions ca n b e vie wed a s tw o -fold. The ﬁrst part is a r obust, eﬃcien t tensor decomp osition technique. The second is the analy s is of the sp ectr a of matrices/tensors arising from F o urier deriv atives. In par ticular, showing that the eige nv alue ga ps are signiﬁcant based o n anticoncentration of po lynomials in Gaussia n space; and that these matrices, even when obtained from samples, remain diago- nalizable. 2 Preliminaries F or positive integer n , the set { 1 , . . . , n } is denoted by [ n ]. The set of pos itive even num b er s is denoted by 2 N . W e ass ume for s implicit y and without re a l lo ss of gener ality that E ( s j ) = 0 for all j . W e can ensure this by w orking with s a mples x i − ¯ x instead o f the or iginal samples x i (here ¯ x is the empirica l av er age of the samples). There is a slight loss of generalit y because using ¯ x (as opposed to using E (() x )) introduces small errors. These errors can b e analysed along with the rest of the err ors a nd do not int ro duce any new diﬃculties. Probabilit y . F o r a random v aria ble x ∈ R n and u ∈ R n , its char acteristic function φ : R → C is deﬁned by φ x ( u ) = E x  e iu T x  . Unlike the momen t generating function, the c haracter istic function is well-deﬁned even for r andom v aria bles without all mo ments ﬁnite. The se c ond char acteristic function of x is deﬁned by ψ x ( u ) := lo g φ x ( u ), where we take that branch of the c o mplex lo garithm tha t makes ψ (0) = 0. In addition to random v aria ble x abov e w e w ill a lso consider random v aria ble s ∈ R m related to x via x = As for A ∈ R n × m and the functions as s o ciated with it: the characteristic function φ s ( t ) = E s  e it T s  and the second characteristic function ψ s ( t ) = log φ s ( t ). Let µ j := E  x j  . Cumulants of x are p olyno mials in the moment s of x which w e now deﬁne. F or j ≥ 1, the j th c um ulant is deno ted cum j ( x ). Some exa mples: cum 1 ( x ) = µ 1 , cum 2 ( x ) = µ 2 − µ 2 1 , cum 3 ( x ) = µ 3 − 3 µ 2 µ 1 + 2 µ 3 1 . As c an b e seen from thes e examples the ﬁrst t wo cumulan ts are the same as the exp ectation 5 and the v aria nce, resp. Cum ula nts hav e the prop er ty that for t wo independent r.v.s x, y we ha ve cum j ( x + y ) = cum j ( x ) + c um j ( y ) (ass uming that the ﬁrst j moments exist for both x and y ). The ﬁrst tw o cumulan ts of the standar d Ga ussian distribution have v a lue 0 and 1, and all subseq uent cumulan ts have v alue 0. Since ICA is not p ossible if all the indep endent co mpo nen t distributions ar e Gaussia ns, we need so me mea sure of distance from the Gaussians of the co mp onent distributions. A conv enient measur e turns out to be the distance fr o m 0 (i.e. the absolute v alue) of the third o r hig her cumulan ts. If all the moments of x exist, then the second characteristic function admits a T aylor expansion in ter ms of cumulan ts ψ x ( u ) = X j ≥ 1 cum j ( x ) ( iu ) j j ! . This can also b e used to deﬁne cumulant s of all o r ders. Matrices. F or a vector µ = ( µ 1 , . . . , µ n ), let diag ( µ ) and diag ( µ j ), where j is a n index v a riable, denote the n × n diago nal matr ix with the dia gonal entries given by the comp onents o f µ . The singula r v a lues o f a n m × n matrix will alw ays b e ordered in the decreasing order : σ 1 ≥ σ 2 ≥ . . . ≥ σ min( m,n ) . Our ma trices will often hav e rank m , and thus the non-zer o s ingular v alues will often, but not alwa ys, be σ 1 , . . . , σ m . The spa n of the co lumns vectors of a matrix A will b e denoted colspan ( A ). The columns of a matrix A are deno ted A 1 , A 2 , . . . . The p otentially ambiguous but con venien t notatio n A T i means ( A i ) T . The condition num b e r of a matrix A is κ ( A ) := σ max ( A ) /σ min ( A ), where σ max ( A ) := σ 1 ( A ) and σ min ( A ) := σ min( m,n ) ( A ). T ensors and tenso r decomp os ition. Here we introduce v arious tensor notions that we need; these are discussed in detail in the review pa pe r [44]. An order d tensor T is a n ar ray index e d by d indices each with n v alues (e.g., when d = 2, then T is simply a matrix o f size n × n ). Th us, it has n d ent ries. T ensors cons idered in this paper are symmetr ic, i.e. T i 1 ,...,i d is inv a riant under p er m utations of i 1 , . . . , i d . In the sequel we will generally not explic itly mention that our tensors ar e s ymmetric. W e also note tha t symmetry of tensors is not essential for our results but fo r our application to ICA it suﬃces to lo ok at only sy mmetric tensors and the results gener alize ea sily to the gener al case, but at the co st o f additional notato n. W e can als o view a tensor as a degree- d homog eneous form over vectors u ∈ R n deﬁned by T ( u, . . . , u ) = P i 1 ,...,i d T i 1 ,...,i d u i 1 . · · · u i d . This is in ana logy with matrices, where every matrix A deﬁnes a qua dr atic form, u T Au = A ( u, u ) = P i,j A i,j u i u j . W e use the o uter pro duct notatio n v ⊗ d = v ⊗ · · · ⊗ v | {z } d c opies , where en trywise we ha ve [ v ⊗ · · · ⊗ v ] j 1 ,...,j d = v j 1 · · · v j d . A (symmetric) rank-1 decomp osition of a tensor T µ is deﬁned by T µ = m X i =1 µ i A ⊗ d i , (1) where the µ i ∈ R are nonzero and the A i ∈ R n are vectors whic h ar e not m ultiples of each other. Such a decomp osition alwa ys exists for all symmetric tensor s with m < n d (better bounds are known but we won’t need them). F or exa mple, for a symmetric ma trix, by the s pec tr al theorem we hav e M = n X i =1 λ i v i ⊗ v i . W e will use the notion of ﬂattening of tensors. Instead of giving a formal deﬁnition it’s more illuminating to give examples. E.g. for d = 4, construct a bijection τ : [ n 2 ] → [ n ] × [ n ] as τ ( k ) = ( ⌊ k /n ⌋ , k − ⌊ k /n ⌋ ) and τ − 1 ( i, j ) = ni + j . W e then deﬁne a pa cking of a matrix B ∈ R n × n to a v ecto r p ∈ R n 2 by B τ ( k ) = p k . F or conv e nience we will say that B = τ ( p ) and p = τ − 1 ( B ). W e also deﬁne a packing of T ∈ R n × n × n × n 6 to a matrix M ∈ R n 2 × n 2 by M a,b = T τ ( a ) ,τ ( b ) , fo r a, b ∈ [ n 2 ]. Note that M is symmetric b ecause T is symmetric with respe c t to all p er m utations o f indices: M a,b = T τ ( a ) ,τ ( b ) = T τ ( b ) ,τ ( a ) = M b,a . The deﬁnition of τ depends on the dimensions and order of the tensor and what it’s b eing ﬂattened into; this will b e clear from the co nt ext and will not b e further elab ora ted up on. Finally , to simplify the notation, we will employ the Kha tri-Rao p ow er o f a matrix: A ⊙ d :=  vec  A ⊗ d 1  | vec  A ⊗ d 2  | . . . | vec  A ⊗ d m  , where recall that vec ( T ) for a tensor T is a ﬂattening of T , i.e. we arr a nge the entries of T in a single column vector. Deriv ativ e s. F o r g : R n → R we will us e abbreviation ∂ u i g ( u 1 , . . . , u n ) for ∂ g ( u 1 ,...,u n ) ∂ u i ; when the v ariables are clear from the context, we will further shorten this to ∂ i g . Similarly , ∂ i 1 ,...,i k g denotes ∂ i 1 ( . . . ( ∂ i k g ) . . . ), and for m ultiset S = { i 1 , . . . , i k } , this will also be deno ted b y ∂ S g , which mak es sense because ∂ i 1 ,...,i k g is inv aria nt under reo rderings o f i 1 , . . . , i k . W e will not use any sp ecia l no ta tion for multisets; what is meant will b e clear from the co ntext. D u g ( u ) denotes the gra dient vector ( ∂ u 1 g ( u ) , . . . , ∂ u n g ( u )), and D 2 u g ( u ) denotes the Hessian matrix [ ∂ u i ∂ u j g ( u )] ij . Mor e genera lly , D d u g ( u ) denotes the order d der iv ative tensor given b y [ D d u g ( u )] i 1 ,...,i d = ∂ u i 1 . . . ∂ u i d g ( u ). Deriv ativ e s and li near transforms. W e ar e particularly interested in how the deriv ative tensor changes under linear tra nsform of the arg ument s. W e state things over the real ﬁeld, but everything ca rries ov er to the complex ﬁeld as well. Let g : R n → R and f : R m → R b e tw o functions such tha t all the der iv atives that we consider be low exist. Let A ∈ R n × m and let v a riables t ∈ R m and u ∈ R n be related by linea r relation t = A T u , and let the function f and g b e r e lated b y g ( u ) = f ( A T u ) = f ( t ). Then for j ∈ [ n ] ∂ u j g ( u ) = ∂ u j f (( A 1 ) T u, . . . , ( A m ) T u ) = ∂ ( A 1 ) T u ∂ u j ∂ t 1 f ( t ) + . . . + ∂ ( A m ) T u ∂ u j ∂ t m f ( t ) = A j 1 ∂ t 1 f ( t ) + . . . + A j m ∂ t m f ( t ) = X k ∈ [ m ] A j k ∂ t k f ( t ) . Applying the previous equtio n twice for i, j ∈ [ n ] gives ∂ u i ∂ u j g ( u ) = ∂ u i   X k ∈ [ m ] A j k ∂ t k f ( t )   = X k ∈ [ m ] A j k ∂ t k ( ∂ u i f ( t )) = X k ∈ [ m ] A j k X ℓ ∈ [ m ] A il ∂ t k ∂ t ℓ f ( t ) = X ℓ,k ∈ [ m ] A iℓ A j k ∂ t ℓ ∂ t k f ( t ) , and applying it four times for i 1 , i 2 , i 3 , i 4 ∈ [ n ] gives ∂ u i 1 ∂ u i 2 ∂ u i 3 ∂ u i 4 g ( u ) = X k 1 ,k 2 ,k 3 ,k 4 ∈ [ m ] A i 1 k 1 A i 2 k 2 A i 3 k 3 A i 4 k 4 ∂ t k 1 ∂ t k 2 ∂ t k 3 ∂ t k 4 f ( t ) . (2) This can b e written more compactly as a ma trix equation D 4 u g ( u ) = A ⊗ 2 ( D 4 t f ( t ))( A ⊗ 2 ) T , 7 where we interpret b oth D 4 u g ( u ) and D 4 t f ( t ) as appro priately ﬂattened into matrices. A useful s pe c ia l case o f this o cc ur s when f has the pro pe rty that ∂ t i ∂ t j f ( t ) = 0 whenever i 6 = j . In this case (2) can b e rewritten as ∂ u i 1 ∂ u i 2 ∂ u i 3 ∂ u i 4 g ( u ) = X k ∈ [ m ] A i 1 k A i 2 k A i 3 k A i 4 k ∂ 4 t k f ( t ) , and in matrix notatio n D 4 u g ( u ) = A ⊙ 2 diag  ∂ 4 t 1 f ( t ) , . . . , ∂ 4 t m f ( t )  ( A ⊙ 2 ) T , where again we interpret D 4 u g ( u ) as ﬂattened in to a matrix. The previo us equa tio ns rea dily gener alize to higher deriv atives. F or d ≥ 1, in ter preting the tensor s D 2 d u g ( u ) and D 2 d t f ( t ) as ﬂa ttened into matrices, we hav e D 2 d u g ( u ) = A ⊗ d ( D 2 d t f ( t ))( A ⊗ d ) T , (3) and if f has the pr op erty tha t ∂ t i ∂ t j f ( t ) = 0 whenever i 6 = j then D 2 d u g ( u ) = A ⊙ d diag  ∂ 2 d t 1 f ( t ) , . . . , ∂ 2 d t m f ( t )  ( A ⊙ d ) T . (4) In o ur applications w e will need to use the above equa tions for the case when g ( u ) = ψ x ( u ) a nd f ( t ) = ψ s ( t ) where these notions w er e deﬁned a t the b eginning of this section. The a bove e q uations then b ecome D 2 d u ψ x ( u ) = A ⊗ d ( D 2 d t ψ s ( t ))( A ⊗ d ) T . (5) In the sp ecial cas e when the comp onents of s are indp endent w e hav e f ( t ) = lo g E  e it 1 s 1  + . . . + log E  e it m s m  and so we have the property ∂ t i ∂ t j ψ s ( t ) = 0 whenever i 6 = j a nd this giv es D 2 d u ψ x ( u ) = A ⊙ d diag  ∂ 2 d t 1 ψ s ( t ) , . . . , ∂ 2 d t 1 ψ s ( t )  ( A ⊙ d ) T . (6) 3 Algorithms In this section, we present our main new a lgorithms and outline their ana lysis. F or the reader’s conv e- nience, we will restate these algo rithms in the sections where their a na lysis app e a rs. As mentioned in the int ro duction, our ICA algorithm is ba sed on a cer tain tensor de c o cmpo sition alg orithm. 3.1 T ensor decomposit ion A fundamental res ult of linear algebra is that every symmetric ma trix has a sp ectral decompos itio n, which allows us to write it as the sum o f outer pro ducts of vectors: A = P n i =1 λ i v i v T i , and such representations are eﬃciently computable. O ur goal, in analog y with sp ectral decomp ositio n for matrices, is to recov er (symmetric) rank-1 decomp ositions of tensors . Unfortunately , there are no known a lgorithms with pr ov able guarantees w hen m > n , a nd in fact this pro blem is NP-ha rd in general [18, 38]. It is an open r esearch question to characteriz e , or e ven give interesting suﬃcient conditions, for when a r ank-1 deco mpo sition of a tensor T as in (1) is unique a nd computationally tracta ble . F or the case d = 2, a necess ary and suﬃcient condition for uniqueness is that the eigenv a lues of T are dis tinct. Indeed, when e ig env alues rep ea t, rota tions of the A i in the degener ate eig ensubspaces with rep eated e ig env alues lead to the same matr ix M . F or d > 2, if the A i are orthogona l, then the expansion in (1) is unique and ca n b e computed e ﬃcie ntly . The algo rithm is p ow er iteration that reco vers o ne A i at a time (see e.g. [5]). The requirement tha t the A i are orthog o nal is necessar y for this algo rithm, but if o ne also ha s access to the order -2 tensor (i.e., matrix) in addition, M = P m i =1 A i ⊗ A i , and the A i are linearly indep enent, then one can ar range for the orthog onality of the A i by a suitable linear transformatio n. Ho wever, the fundamen tal limitation r emains that we m ust take m ≤ n simply because we ca n not hav e mor e tha n n orthog o nal vectors in R n . 8 Here we consider a mo diﬁed setting where we are a llowed some additional information: suppose we hav e access to tw o tens o rs, b oth o f order d , whic h s hare the same ra nk -1 co mpo ne nts, but have diﬀeren t co eﬃcients: T µ = m X i =1 µ i A ⊗ d i , T λ = m X i =1 λ i A ⊗ d i . Given such a p air of tensors T µ and T λ , can we recover the rank-1 comp onents A i ? W e answer this question in the aﬃrmativ e for e ven o rders d ∈ 2 N , and giv e a pro v ably go o d a lgorithm for this pro blem assuming that the ra tios µ i /λ i are distinct. Additionally , we assume that the A i are not scalar multiples of each o ther, a necessar y a ssumption. W e make this quantitativ e via the m th singular v alue of the matrix with columns given b y A ⊙ d/ 2 i . Our algorithm works by ﬂattening tensors T µ and T λ int o matrices M µ and M λ which hav e the following form: M µ = ( A ⊙ d/ 2 )diag ( µ i ) ( A ⊙ d/ 2 ) T , M λ = ( A ⊙ d/ 2 )diag ( λ i ) ( A ⊙ d/ 2 ) T . T aking the pro duct M µ M − 1 λ yields a matrix whose e ig env ecto rs are the columns of A ⊙ d/ 2 and whose e ig en- v alues are µ i /λ i : M µ M − 1 λ = ( A ⊙ d/ 2 )diag ( µ i ) ( A ⊙ d/ 2 ) T (( A ⊙ d/ 2 ) T ) − 1 diag ( λ i ) − 1 ( A ⊙ d/ 2 ) − 1 = ( A ⊙ d/ 2 )diag ( µ i /λ i ) ( A ⊙ d/ 2 ) − 1 . Actually , for the last equation to make se nse one nee ds that A ⊙ d/ 2 be inv ertible which will g e ne r ally not be the case as A ⊙ d/ 2 is not e ven a squa re matrix in genera l. W e handle this b y restric ting M µ and M λ to linear transform from their pr e-image to the image. This is the rea s on for in tro ducing matrix W in algor ithm Diagonalize ( M µ , M λ ) b elow. The main algorithm b elow is T ensor Decomp osi tion ( T µ , T λ ) which ﬂattens the tensor s and calls sub- routine Diagonalize ( M µ , M λ ) to get e s timates of the A ⊙ d/ 2 i , and from this information recov er s the A i themselves. In our applicatio n it will be the c a se that µ, λ ∈ C m and A i ∈ R n . The dis cussion b elow is tailored to this situation; the other in ter esting cases where ev erything is real or ev er ything is complex can also b e dealt with with minor mo diﬁcatio ns . Diagonalize ( M µ , M λ ) 1. Compute the SVD of M µ = V Σ U T , and let W b e the le ft singular v e ctors (columns of V ) corr esp onding to the m la rgest singular v alues . Compute the matrix M = ( W T M µ W )( W T M λ W ) − 1 . 2. Compute the eigenv ector deco mpo sition M = P DP − 1 . 3. Output columns of W P . The columns C i = W P i are eigenvectors c o mputed in subro utine Diagonalize . Ideally , we w o uld like these to eq ua l A ⊙ d/ 2 i . W e are g oing to have er rors introduced b eca use of sa mpling, but in addition, since we are working in the complex ﬁeld w e do not ha ve control over the phase of C i (the output of Diagonalize obtained in Step 3 o f T ensor Decomp ositio n ), and for ρ ∈ C with | ρ | = 1, ρC i is also a v alid output of Diagonalize . In Step 3 of T ensor Decomp osi tion , w e recover the corre c t phase of C i ∈ C n which will give us a vector in C ′ i ∈ R n . W e do this b y choo sing the phase maximizing the norm of the rea l part. In Step 4, we hav e v ⊗ d + E , where E is an err o r tensor , and we wan t to recov er v . W e can do this approximately when k E k F is suﬃciently small just b y re a ding oﬀ a one-dimensional slice (e.g. a c olumn in the case of matrices ) o f v ⊗ d + E (say the one with the ma ximum norm). 9 T ensor Decomp o sition ( T µ , T λ ) 1. Flatten the tensors to sq uare ma trices to obtain M µ = τ − 1 ( T µ ) a nd M λ = τ − 1 ( T λ ). 2. W P = D iag onal iz e ( M µ , M λ ). 3. F or each column C i of W P , let C ′ i := Re  e iθ ∗ C i  /   Re  e iθ ∗ C i    where θ ∗ = argmax θ ∈ [0 , 2 π ]    Re  e iθ C i     . 4. F or each column C ′ i , let v i ∈ R n be such that v ⊗ d/ 2 i is the b est rank-1 approximation to τ ( C ′ i ). F or the computation of eigenvectors o f dia g onalizable (but no t no rmal) matrices ov er the complex num- ber s, we ca n employ any of the several alg o rithms in the literature (see for exa mple [36, 56] fo r a n umber of algorithms used in pra ctice). In general, thes e algorithms emplo y the same atomic elements a s the normal case (Ja cobi iteratio ns, Hous eholder transfor mations etc.), but in mo r e sophistica ted wa y s. The pe r turbation analysis of these algor ithms is substantially mor e inv o lved than for normal matrices; in particular , it is not necessarily the case that a (small) per turbation to a diagonalizable matrix results in another diago nalizable matrix. W e contend with all thes e issues in Section 5.3. In particular w e note that while exact analysis is relatively stra ightforw a rd (Theor em 5.3), a robust v ersion that recov ers the common decompo sition of the given tensors takes c onsiderable additional car e (Theorem 5.4). 3.2 Underdetermined ICA F or underdetermined ICA w e compute the hig her deriv ative tensors o f the second characteristic function ψ x ( u ) = log( φ x ( u )) at tw o random p oints and run the tensor decomp osition algo rithm from the previous section. Underdetermined ICA ( σ ) 1. (Compute deriv ative tensor) Pick independent random vectors α, β ∼ N (0 , σ 2 I n ). F or even d estimate the d th deriv ative tenso rs of ψ x ( u ) at α and β as T α = D d u ψ x ( α ) and T β = D d u ψ x ( β ). 2. (T ensor decomp ositio n) Run T ensor Decomp ositio n ( T α , T β ). T o estimate the 2 d th deriv ative tensor of ψ x ( u ) empirically , o ne simply writes do wn the expression for the deriv ative tenso r, and then estimates each entry from samples using the naive estimato r. The analysis roughly pro ceeds a s follows: By (6) for tensors ﬂattened into matrices we hav e D 2 d u ψ x ( α ) = A ⊙ d diag  ∂ 2 d t 1 ψ s ( A T α ) , . . . , ∂ 2 d t 1 ψ s ( A T α )  ( A ⊙ d ) T and D 2 d u ψ x ( β ) = A ⊙ d diag  ∂ 2 d t 1 ψ s ( A T β ) , . . . , ∂ 2 d t 1 ψ s ( A T β )  ( A ⊙ d ) T . Thu s we hav e t wo tensors with sha red rank -1 factor s as in the tensor deco mpo sition algo rithm a bove. F or our tensor decomp osition to work, we r equire that all the ratios ( ∂ 2 d t j ψ s ( A T α )) / ( ∂ 2 d t j ψ s ( A T β )) for j ∈ [ m ] b e diﬀerent fro m each other a s otherwise the eigenspaces in the ﬂattened forms will mix a nd we will no t b e a ble to uniquely rec over the co lumns A i . T o this end, we will express ∂ 2 d t j ψ s ( A T α ) as a low degree p olynomia l plus erro r ter m (which we will c ontrol by b ounding the deriv atives o f ψ s ). The lo w degree p oly nomials will with high probabilit y tak e on suﬃcien tly diﬀerent v alues for A T u and A T v , which in turn guar antees that their ra tio s, even with the err or terms, are quite diﬀer e nt. Our a nalysis for b oth par ts mig ht b e of interest for other pr oblems. On the way to doing this in full generality for under determined ICA, w e ﬁr st consider the special ca se of d = 2, which will alrea dy in volve several of these co ncepts and the alg orithm itself is just PCA reweight ed with F o ur ier weights. 10 4 F ully determined indep enden t c omp onen t analysis W e begin with the case of s tandard or fully determined ICA where the tr a nsformation matrix A ∈ R n × n is full rank. With a sligh t loss o f generality , we assume that A is unitary . If A is not unitary , w e can simply make it approximately so b y placing the entire sample in isotro pic position. Rigor o usly arguing about this will r equire an a dditional er ror a na lysis; we will omit such details for the s ake of clar ity . In any case, our algorithm fo r underdetermined ICA do es not (and cannot) make an y such assumption. O ur algorithm computes the eigenv ectors of a cov ariance matrix reweigh ted a c c ording to rando m F our ier co eﬃcients. F o urier PCA ( σ ) 1. (Isotropy) Get a sample S fro m the input distribution and use them to ﬁnd an isotropic transfor mation B − 1 with B 2 = 1 | S | X x ∈ S ( x − ¯ x )( x − ¯ x ) T . 2. (F ourier w eights) Pick a random v ector u fro m N (0 , σ 2 I n ). F or every x in a new sample S , compute y = B − 1 x , and its F ourier weigh t w ( y ) = e iu T y P y ∈ S e iu T y . 3. (Rew eighted Co v aria nce) Compute the cov ariance matrix of the p oints y reweigh ted by w ( y ) µ u = 1 | S | X y ∈ S w ( y ) y and Σ u = 1 | S | X y ∈ S w ( y )( y − µ u )( y − µ u ) T . 4. Compute the eigenmatrix V of Σ u and output B V . F ormally , this algorithm is subsumed by our work on underdeter mined ICA in Section 6, but bo th the algorithm and its a na lysis a r e substantially simpler than the general case, but we retain the esse n tial elements of our technique – fo urier transforms , p olynomia l anti-concen tration a nd der iv ative trunca tion. On the other hand, it do es not re quire the machinery of our tensor decomposition in Sec tion 5. W e make the following comments r egarding the eﬃcient re alisation of this alg orithm. The matrix Σ u in the algo r ithm is complex and symmetr ic, a nd th us is not Hermitian; its e ig env alue deco mpo sition is mor e complicated than the us ua l Hermitian/real- symmetric case. It can b e computed in one of tw o wa ys. O ne is to c o mpute the SVD of Σ u (i.e., co mpute the eigenv a lue decompos ition of Σ u Σ ∗ u which is a real symmetric matrix). Alternatively , we ca n exploit the fact that the rea l a nd complex par ts have the sa me eig env ecto rs, and hence by care fully exa mining the re al and imagina ry co mpo nent s, we c a n recover the eigenv ectors. W e separate Σ u = Re (Σ u ) + i Im (Σ u ) into its real part and imagina ry part, and use an SVD on Re (Σ u ) to partition its eigenspace in to subspa ces with clo se eigenv alues, and then an SVD of Im ( Σ u ) in e ach of these subspaces. B o th metho ds need some ca re to ensure that eigenv alue g aps in the or iginal matr ix are preser ved, an imp orta n t asp ect o f our applications . W e complete the algor ithm des cription for ICA by giving a precise metho d for determining the eigenmatrix V of the reweigh ted s ample cov ariance matrix Σ u . This subroutine below translates a g ap in the complex eig env alues of Σ u int o obser v able g aps in the re a l par t. 1. W rite Σ u = Re ( Σ u ) + i Im (Σ u ). Note that b oth the component matrice s a re r eal and symmetric. 2. Compute the eigendecompos ition of Re ( Σ u ) = U diag ( r i ) U T . 11 3. P artition r 1 , . . . , r n int o blocks R 1 , . . . , R l so that each blo ck co nt ains a subsequence of eigenv alues and the gap b etw een conse cutive blo cks is at least ǫ 0 , i.e., min r ∈ R j ,s ∈ R j +1 r − s ≥ ǫ 0 . Let U j be the eigenv e c tors cor resp onding to block R j . 4. F or each 1 ≤ j ≤ l , compute the eigen vectors of U T j Im (Σ u ) U j and output V as the full set o f eigenv e c tors (their unio n). Lemma 4.1. Supp ose Σ u has eigenvalues λ 1 , . . . , λ n and ǫ = min i 6 = j min { Re ( λ i ) − Re ( λ j ) , Im ( λ i ) − Im ( λ j ) } . Then, with ǫ 0 = ǫ/n , the ab ove algo rithm wil l r e c over the eigenve ctors of Σ u . Pr o of. The decompos ition of the matrix Re (Σ u ), will accurately r ecov er the eigensubspaces for ea ch block (since their eigenv alues a re separ ated). Mo r eov er, for eac h block U j diag ( r i ) U T j , the rea l eigenv alues r i are within a rang e less than ǫ (since each c o nsecutive pair is within r i − r i +1 < ǫ /n ). Th us, for each pa ir i , i + 1 in this blo ck, we m ust hav e a separa tion of at least ǫ in the imaginar y parts of λ i , λ i +1 , by the deﬁnition of ǫ . Therefore the eigenv a lues of Q j = U T j Im (Σ u ) U j are separ ated b y at least ǫ a nd w e will re cov er the original eigenv ectors a ccurately . T o per form ICA, we simply a pply F o urier P CA to samples from the input distribution. W e will s how that fo r a suita ble choice of σ and sample size , this will recover the indep endent co mpo nen ts to a ny desired accuracy . The main challenge in the analysis is showing that the reweigh ted cov ariance matrix will hav e a ll its eigenv alues spaced apart s uﬃcient ly (in the c o mplex plane). This eigenv a lue s pacing dep ends o n how far the co mpo ne nt distributions are from being Ga ussian, as meas ured b y cumulan ts. An y non-Gaussian distribution will hav e a no nzero cumulan t, a nd in that sense this is a co mplete metho d. W e will quantify the gaps in terms of the cumulan ts to get an eﬀective b ound on the eige n v alue spacings. The num be r o f samples is chosen to ensur e that the gaps remain almost the s ame, and we can apply eig env ector p er tur bation theorems Davis-Kahan or W edin to rec ov er the eigenv ectors to the des ired accuracy . 4.1 Ov erview of analysis Our main theorem in the analys is of this algor ithm is a s follows: Theorem 4.2. L et x ∈ R n b e given by an ICA mo del x = As wher e A ∈ R n × n is unitary and the s i ar e indep endent, E  s 4 i  ≤ M 4 for some c onstant, and for e ach s i ther e exists a k i ≤ k such that | cum k i ( s i ) | ≥ ∆ (one of the ﬁrst k cumu lant s is lar ge) and E  | s i | k  ≤ M k . F or any ǫ > 0 , wi th the fol lowing sett ing of σ , σ = ∆ 2 k ! √ 2 π 4( k − 1) n 2 ! k · 1 (2 e ) k +1 M k log(4 n ) k +1 , F ourier PCA wil l r e c over ve ctors { b 1 , . . . , b n } such t hat ther e exists signs a i = ± 1 satisfy ing k A i − b i k ≤ ǫ with h igh pr ob ability, using ( ckn ) 2 k 2 +2 ( M k / ∆) 2 k +2 M 2 4 /ǫ 2 samples. Our a nalysis pro ceeds via the analysis of the F ourier tr a nsform: for a rando m vector x ∈ R n distributed according to f , the characteristic function is giv en by the F o ur ier tr ansform φ ( u ) = E  e iu T x  = Z f ( x ) e iu T x dx. W e fav o ur the F ourier trans fo rm or character is tic function over the Laplace tra nsform (or moment gener- ating function) for the simple reaso n that the F our ier transform alwa ys exists, ev en for very heavy tailed distributions. In particular, the triv ial bo und   e itx   = 1 mea ns that o nce we hav e a moment b ound, w e can control the F o ur ier transfor m uniformly . 12 W e will actually employ the se c ond char acteristic function or cumulant gener ating function given by ψ ( u ) = log ( φ ( u )). Note that b oth these deﬁnitions are with resp ect to o bs erved random vector x : when x ar ises from an ICA mo del x = As , we will also deﬁne the co mpo ne nt-wise characteristic functions with resp ect to the underlying s i v ariables φ i ( u i ) = E  e iu i s i  and ψ i ( u i ) = lo g ( φ i ( u i )). No te that b oth these functions are w ith re s pec t to the underlying rando m v ar iables s i and not the o bserved ra ndom v a riables x i . F or convenience, we shall also write g i = ψ ′′ i . Note that the reweigh ted co v aria nce matrix in our algorithm is precisely the Hes s ian second deriv ative matrix D 2 ψ : Σ u = D 2 ψ = E  ( x − µ u )( x − µ u ) T e iu T x  E  e iu T x  , where µ u = E  xe iu T x  / E  e iu T x  . This matrix D 2 ψ has a very sp ecial form; suppose that A = I n : ψ ( u ) = lo g  E  e iu T s  = log E n Y i =1 e iu i s i !! = n X i =1 log( E  e iu i s i  ) = n X i =1 ψ i ( u i ) . T aking a der iv ative will leav e only a single term ∂ ψ ∂ u i = ψ ′ i ( u i ) . (7) And taking a seco nd der iv ative will leave o nly the diagona l terms D 2 ψ = diag ( ψ ′′ i ( u i )) = diag ( g i ( u i )) . Thu s, diagonalizing this matrix will give us the columns of A = I n , provided that the eigenv alues of D 2 ψ are non-deg enerate. In the general case when A 6 = I n , we can ﬁr st place x in is o tropic p osition whence A will b e unitary . W e pe r form the change o f basis carefully in Section 4.2, obtaining that D 2 ψ = A diag  ψ ′′ i (( A T u ) i )  A T . No w D 2 ψ is sy mmetr ic, but not Hermitian, and its eige nv alues are complex, but nonetheless a diago nalization suﬃces to give the co lumns of A . T o o btain a ro bus t algor ithm, we rely o n the eigenv alues of D 2 ψ b e ing a dequately s pa ced (so that the err or arising from sampling do es not mix the eige nspaces, hence columns of A ). Thus, we inject so me randomness by picking a random F ourier co eﬃcient, and hop e that the g i ( u i ) are suﬃciently a n ti-concentrated. T o this end, we will trunca te the T aylor series of g i to k th order, where the k th cum ulant is one that diﬀers from a gaussian substantially . The resulting deg ree k po lynomial will give us the spacings of the eigenv alues via po lynomial anti-concent ration estimates in Section 4.3, and w e will control the r e ma ining terms fro m or der k + 1 and higher by der iv ative estimates in Section 4.4. Notably , the further that s i is fro m b eing a gaussia n (in cumulan t ter ms), the stronge r anti-concen tration. W e will pick the ra ndom F ourier co eﬃcient u accor ding to a Gaussia n N (0 , σ 2 I n ) and we will sho w that with high pr obability for all pair s i , j we have   g i (( A T u ) i ) − g j (( A T u ) j )   ≥ δ. Critical to our analysis is the fact that ( A T u ) i and ( A T u ) j are b oth indep endent Ga ussians since the co lumns of A are indep endent b y our assumption o f isotropic position (Section 4.5). W e then go on to compute the sample complexity requir ed to maintain these ga ps in Sectio n 4 .6 a nd conclude with the pro of of co r rectness for our algo rithm in Sec tio n 4.7. In case mor e than o ne of the v a riables are (standard) Ga ussians, then a quick calculation will verify that ψ ′′ i ( u i ) = 1 . Th us, in the presence o f such v ar iables the eig e nvectors cor resp onding to the eigenv alue 1 are degenerate a nd w e can not r esolve be tween any linear combination o f suc h vectors. Thus, the mo del is indeterminate when some of the underlying r andom v aria ble s are to o ga ussian. T o dea l with this, one t ypically hypothesize s that the underlying v a riables s i are diﬀer ent from Ga ussians. One commonly used wa y is to p ostulate that for ea ch s i the fourth moment or cumulan t diﬀers from that of a Gauss ia n. W e weak e n this as s umption, a nd only requir e that s ome momen t is diﬀerent from a Gaussia n. The Gaussia n function plays a n important role in harmonic a nalysis as the eigenfunction of the F ourier transform op erator, and we exploit this prop er t y to deal w ith additive Gaussian no ise in our mo del in Section 4.8. 13 4.2 Eigen v alues As noted previous ly , when A = I n , we hav e D 2 ψ = dia g ( ψ ′′ i ( u i )). When A 6 = I n we hav e Lemma 4.3. L et x ∈ R n b e given by an ICA mo del x = As wher e A ∈ R n × n is a unitary matrix and s ∈ R n is an indep endent r andom ve ctor. Then D 2 ψ = A diag  ψ ′′ i (( A T u ) i )  A T . This lemma is standard chain r ule for multiv ariate functions. A mo r e ge ner al version applying to higher deriv ative tenso rs is prov ed later in Section 6. 4.3 An ti-concen t ration for p olynomials The main r esult of this section is an a nti-concentration inequalit y for univ ariate p olyno mials under a Gaussian measure. While this inequality app ear s very similar to the inequality of Carb ery– W right [2 0] (cf. [52], Corollar y 3.23), w e are not able to derive o ur inequlity from it. The h yp othes is o f our inequalit y is weak er in that it only requir es the p olynomial to be monic instead o f requir ing the p olynomial to hav e unit v ariance as required by Ca rb ery–W right; on the o ther hand it applies only to univ ar ia te po lynomials. Theorem 4.4 (Anti-concentration of a p olynomial in Gaussian s pace) . L et p ( x ) b e a de gr e e d monic p oly- nomial over R . L et x ∼ N (0 , σ 2 ) , t hen fo r any t ∈ R we have Pr ( | p ( x ) − t | ≤ ǫ ) ≤ 4 dǫ 1 /d σ √ 2 π . F or most of the proof w e will work with the Lebesgue mea sure; the pr o of for the Gaussian measure will follow immediately . Our sta rting point is the following lemma whic h ca n b e derived fr om the prop erties of Chebyshev p olyno mials ([17], Sec 2 .1, Exerc ise 7); we include a pro of for completeness. F or int erv a l [ a, b ], deﬁne the supremum no rm on real-v a lued functions f deﬁned on [ a, b ] by k f ( x ) k [ a,b ] :=   f ( x ) χ [ a,b ]   ∞ = sup x ∈ [ a,b ] | f ( x ) | . Then we hav e Lemma 4.5 . The unique de gr e e d monic p olynomial minimising k p ( x ) k [ a,b ] is giv en by p ( x ) = 2  b − a 4  d T d  2 x − a − b b − a  , (8) wher e T d is the d th Chebyshev p olynomia l. Pr o of. W e already know (see [17]) that for the interv a l [ − 1 , 1 ] the unique monic p olynomial of deg ree d which minimizes k p ( x ) k [ − 1 , 1] is g iven by 2 1 − d T d ( x ). Map the in terv al [ a, b ] to [ − 1 , 1] using the aﬃne map f ( x ) = (2 x − a − b ) / ( b − a ) whic h satisﬁes f ( a ) = − 1 and f ( b ) = 1. Then (( b − a ) / 2) d 2 1 − d T d ( x ) = 2(( b − a ) / 4) d T d ( x ) is the unique monic p olynomial minimizing k·k [ a,b ] . F or if it w ere not, we could use such a p olynomial to construct another monic p olynomia l (by reversing the a bove tra nsformation) which co nt radicts the fact that Chebyshev p olynomials a re the unique mo nic minimizer s of k·k [ − 1 , 1] . F rom this we hav e the following lemma. Lemma 4.6. L et p ( x ) b e a de gr e e d monic p olynomia l over R . Fix ǫ > 0 , then fo r every x , ther e exists an x ′ wher e | x − x ′ | ≤ ǫ and | p ( x ) − p ( x ′ ) | ≥ 2( ǫ/ 2 ) d . 14 Pr o of. W e will translate the polynomial p to obta in the p olynomia l q ( y ) as follows: q ( y ) = p ( y + x ) − p ( x ) . Observe that q ( y ) is a monic p olynomial and q (0 ) = 0 . Now supp ose that for all p oints x ′ ∈ [ x − ǫ, x + ǫ ], we hav e | p ( x ) − p ( x ′ ) | < ( ǫ/ 2) d , then for all y ∈ [ − ǫ, ǫ ], we must hav e | q ( y ) | < 2( ǫ/ 2) d . But, fr o m the previous lemma, we know that for the int erv a l [ − ǫ, ǫ ], the minimum L ∞ -norm on the int erv al fo r any monic p olynomial is attained by r ( y ) = 2 ( ǫ/ 2) d T d ( y /ǫ ). The v alue of this minim um is 2( ǫ/ 2) d . W e can use the a bove lemma to given an upp er b ound on the measure of the set where a p olyno mial stays within a co nstant sized band: Lemma 4.7. L et p ( x ) b e a de gr e e d monic p olynomial. Then for any interval [ a, b ] wher e b − a = ǫ we have µ ( x ∈ R : p ( x ) ∈ [ a, b ]) ≤ 4 d  ǫ 2  1 /d , wher e µ denotes the usual Le b esgue me asur e over R . Pr o of. Since p is a co ntin uous function so, p − 1 ([ a, b ]) = ∪ i I i where I i are disjoint closed interv a ls. There ar e at mos t d such interv a ls: every time p ( x ) exits and re-enters the interv al [ a, b ] ther e m ust b e a change of sign in the deriv ative p ′ ( x ) at some p o int in b etw een. Since p ′ ( x ) is a deg ree d − 1 p olynomial, there ca n only b e d − 1 changes of sign. Next, suppose that | I i | > 4( ǫ / 2) 1 /d , then there exists an interv al [ x − 2( ǫ ′ / 2) 1 /d , x + 2 ( ǫ ′ / 2) 1 /d ] ⊆ I i , where ǫ ′ > ǫ . Then, by a pplying Lemma 4.6, there exis ts a point x ′ such that | x − x ′ | ≤ 2( ǫ ′ / 2) 1 /d but | p ( x ) − p ( x ′ ) | ≥ 2 " 1 2 · 2  ǫ ′ 2  1 /d # d ≥ ǫ ′ > ǫ. This implies that x ′ / ∈ [ a, b ], which is a contradiction. Hence we must hav e | I i | ≤ 4( ǫ/ 2) 1 /d and X i | I i | ≤ d max i | I i | ≤ 4 d ( ǫ/ 2 ) 1 /d , as required. W e can now give the proof for Theo rem 4.4: Pr o of of Th e or em 4.4 . W e know that the Lebesgue measure of the set for which p ( x ) ∈ [ t − ǫ, t + ǫ ] is given by Lemma 4.7. Then m ultiplying by the maximum densit y of a Gaussian 1 /σ √ 2 π gives us the desir ed bo und. 4.4 T runcation E rror Let us expand o ut the function g i = ψ ′′ i as a T aylor se ries with err or e s timate: Theorem 4.8 (T aylor’s theorem with rema inder ) . L et f : R → R b e a C n c ontinuous fun ct ion over some interval I . L et a, b ∈ I , then f ( b ) = n − 1 X k =1 f ( k ) ( a ) k ! ( b − a ) k + f ( n ) ( ξ ) n ! ( b − a ) n , for some ξ ∈ [ a, b ] . 15 T o this end, w e wr ite g i ( u i ) = p i ( u i ) + g ( k ) ( ξ ) k ! u k i , (9) where ξ ∈ [0 , u i ] and p i is a po lynomial of deg ree ( k − 1). T o bound the error ter m in (9) , w e obser ve that it suﬃces to b ound [log( φ i )] ( k ) ( u i ) using the following lemma. Lemma 4.9. L et x ∈ R b e a r andom variable wi th ﬁnite k absolute moments, and let φ ( u ) b e the asso ciate d char acteristic function. Then    [log( φ )] ( k ) ( u )    ≤ 2 k − 1 ( k − 1)! E  | x | k  | φ ( u ) | k . Pr o of. W e will compute the deriv a tives of ψ ( u ) = log φ ( u ) as fo llows: we pr o ceed recursively with ψ ′ ( u ) = φ ′ ( u ) /φ ( u ) a s our base case. Let ψ ( d ) be given by the r atio of t wo functions, a numerator function N ( u ; d ) and a denominator function D ( u ; d ), with no common factors a nd N ( u ; d ) is the sum of terms of the form Q d j =1 φ ( i j ) ( u ) wher e the co eﬃcient of eac h term is ± 1. Some useful prop erties of functions N ( u ; d ) , D ( u ; d ) are summarized in the following claim. Claim 4.10 . F or d ≥ 1 , functions N ( u ; d ) and D ( u ; d ) satisfy 1. D ( u ; d ) = φ ( u ) d . 2. F or e ach t erm of N ( u ; d ) , P d j =1 i j ≤ d . 3. F or e ach t erm of N ( u ; d ) , the total numb er of factors of φ and its derivatives is at m ost d . 4. F or d ≥ 1 , ther e ar e at most 2 d − 1 ( d − 1)! terms in N ( u ; d ) . Pr o of. W e will prov e all these via induction over d . Clearly these are a ll true for the base case d = 1. Assume that all four facts are true for s ome d , we will now examine the case for d + 1. W riting ψ ( d +1) ( u ) as the der iv ative of ψ ( d ) ( u ) = N ( u ; d ) /D ( u ; d ) a nd canceling common factor s gives ψ ( d +1) ( u ) = N ′ ( u ; d ) D ( u ; d ) − N ( u ; d ) D ′ ( u ; d ) D ( u ; d ) 2 = N ′ ( u ; d ) φ ( u ) d − N ( u ; d ) dφ ( u ) d − 1 φ ′ ( u ) φ ( u ) 2 d = N ′ ( u ; d ) φ ( u ) − dφ ′ ( u ) N ( u ; d ) φ ( u ) d +1 . (10) Observing that there is a lwa ys a term in N ( u ; d ) = φ ′ ( u ) d , we can not cancel any further factor s of φ ( u ). Hence D ( u ; d ) = φ ( u ) d , proving the ﬁr st part of the cla im. The seco nd and third pa rts o f the claim follow immediately fro m the ﬁnal expr ession for ψ ( d +1) ( u ) ab ove and our inductive hypo thesis. T o prov e the four th part, let T ( d ) de no te the total n umber of terms in N ( u ; d ), then b y part 3 and the expa ns ion in (10), we hav e T ( d + 1) ≤ dT ( d ) + dT ( d ) ≤ 2 dT ( d ). F r om this T ( d + 1) ≤ 2 d d ! follows immediately . Returning to the pro of of Lemma 4.9, for d ≤ k we observe that    φ ( d ) ( u )    =    E  ( ix ) d e iu T x     ≤ E     ( ix ) d e iu T x     ≤ E  | x | d  . 16 Thu s, for each ter m of N ( u ; d ):       d Y j =1 φ ( i j ) ( u )       ≤ d Y j =1    φ ( i j ) ( u )    ≤ d Y j =1 E  | x | i j  F act 10 . 6 ≤ E  | x | P d j =1 i j  ≤ E  | x | d  . Combining Claim 4.1 0 with the pr evious equation, and noticing that w e nev er need to consider abs o- lute moments of order hig her than k (whic h ar e guar anteed to exist b y our h yp othesis), gives the desired conclusion. T o conclude this section, we obser ve that if the distribution of x ∈ R is iso tropic then for u ∈ ( − 1 , 1 ) we hav e φ x ( u ) = φ x (0) + φ ′ x (0) u + φ ′′ x ( ξ ) 2 u 2 , where ξ ∈ [0 , u ]. W e hav e φ x (0) = 1, φ ′ x (0) = E ( x ) = 0 and | φ ′′ x ( ξ ) | ≤ E  | x | 2  = 1 by the isotropic p osition assumption. Thu s, for u ∈ [ − 1 / 4 , 1 / 4], Lemma 4 .9 g ives us    [log( φ x )] ( k ) ( u )    ≤ E  | x | k  k k . (11) 4.5 Eigen v alue spacings W e will a pply Theorem 4.4 to the truncated T aylor polynomia ls p i of the functions ψ ′′ i , and b ound the truncation erro r to b e a t most a consta nt fra ction of the desir ed anti-concen tration. Theorem 4 .11. L et s ∈ R n b e a r andom ve ctor with indp en dent c omp onents. F or t ∈ R n , let ψ ( t ) = log E  e it T s  b e the se c ond char acteristic function of s . Supp ose we ar e given the fo l lowing data and c ondi- tions: 1. Inte ger k > 2 such that E  | s i | k  exists fo r al l i ∈ [ n ] . 2. ∆ > 0 such tha t for e ach i ∈ [ n ] , ther e exists 2 < k i < k such that | cum k i ( s i ) | ≥ ∆ . 3. M 2 > 0 such t hat E  s 2 i  ≤ M 2 for i ∈ [ n ] . 4. M k > 0 such t hat E  | s i | k  ≤ M k for i ∈ [ n ] . 5. g i ( t i ) := ∂ 2 ψ ( t ) ∂ t 2 i . 6. τ ∼ N (0 , σ 2 I n ) whe r e σ = min (1 , 1 2 √ 2 M 2 log 1 /q , σ ′ ) and σ ′ =  3 8  k +1 · k − 1 k ! · √ 2 π 4 k ! k − 2 · q k − 2 ( p 2 log(1 / q )) k − 1 · ∆ M k , (12) and 0 < q ≤ 1 / 3 . Then with pr ob ability at le ast 1 − n 2 q , for al l distinct i, j we ha ve | g i ( τ i ) − g j ( τ j ) | ≥ ∆ 2( k − 2)! √ 2 π σ q 4 k ! k − 2 . 17 Pr o of. W e will argue ab out the spacing | g 1 ( τ 1 ) − g 2 ( τ 2 ) | , and then us e the union b ound to get tha t none of the spa cings is small with high probability . Since s 1 has ﬁrs t k moments, we can a pply T aylor’s theorem with rema inder (Actually one needs more care as that theorem was stated for functions of type R → R , whereas our function her e is of t yp e R → C . T o this end, w e can consider the real and imaginary parts of the function separa tely and apply Theorem 4.8 to each part; we omit the details.) Applying Theor em 4.8 gives g 1 ( t 1 ) = − k 1 X l =2 cum l ( s 1 ) ( it 1 ) l − 2 ( l − 2 )! + R 1 ( t 1 ) ( it 1 ) k 1 − 1 ( k 1 − 1)! . T runcating g 1 after the degree ( k 1 − 2) term yields a polyno mial p 1 ( t 1 ). Denote the truncation e r ror by ρ 1 ( t 1 ). Then, ﬁxing t 2 arbitrar ily and setting z = g 2 ( t 2 ) for brevity , we hav e | g 1 ( t 1 ) − g 2 ( t 2 ) | = | p 1 ( t 1 ) + ρ 1 ( t 1 ) − z | ≥ | p 1 ( t 1 ) − z | − | ρ 1 ( t 1 ) | . W e will show that | p 1 ( t 1 ) − z | is likely to b e lar ge a nd | ρ 1 ( t 1 ) | is likely to b e small. Noting that ( k 1 − 2)! i k 1 − 2 cum k 1 ( s 1 ) p 1 ( t 1 ) is monic o f deg ree k 1 − 2 (but with co eﬃcie nts from C ), we apply o ur a n ti-concentration result in Theorem 4.4. Ag a in, although that theorem was prov en for po lynomials with real co e ﬃcie n ts, its application to the pr e s ent s ituation is easily seen to go throug h without alter ing the b ound by considering the real and imag inary parts separa tely . In the follo wing, the proba bilit y is for t 1 ∼ N (0 , σ 2 ). Pr ( | p 1 ( t 1 ) − z | ≤ ǫ 1 ) ≤ 4( k 1 − 2) σ √ 2 π  ǫ 1 ( k 1 − 2)! | cum k 1 ( s 1 ) |  1 / ( k 1 − 2) ≤ 4 k 1 σ √ 2 π  ǫ 1 ( k 1 − 2)! ∆  1 / ( k 1 − 2) . Setting ǫ 1 := ∆ ( k 1 − 2)! √ 2 π σ q 4 k 1 ! ( k 1 − 2) ≤ ∆ ( k − 2)! √ 2 π σ q 4 k ! ( k − 2) (13) we hav e Pr ( | p 1 ( t 1 ) − z | ≤ ǫ ) ≤ q . Next we b ound the tr unca tion err or and show that | ρ 1 ( t 1 ) | ≤ ǫ/ 2 with pro bability at least 1 − q √ π log 1 /q . Applying Lemma 4.9, the error intro duced is | ρ 1 ( t 1 ) | ≤ k 1 ! 2 k 1 E  | s 1 | k 1 +1  | φ 1 ( t 1 ) | k 1 +1 · t k 1 − 1 1 ( k 1 − 1)! . W e now low er b o und the pro bability that | t 1 | is small whe n t 1 ∼ N (0 , σ 2 ): Pr  | t 1 | ≤ σ p 2 log 1 / q  ≥ 1 − q p π log 1 / q . The computation ab ove used Claim 10.3. Thu s with pro ba bilit y at least 1 − q √ π log 1 /q we hav e | ρ 1 ( t 1 ) | ≤ k 1 ! 2 k 1 M k (3 / 4) k 1 +1 ( k 1 − 1)! · ( σ p 2 log 1 / q ) k 1 − 1 , (14) here we used that by o ur choice of σ we hav e σ p 2 log 1 /q ≤ 1 2 √ M 2 , hence Lemma 10 .1 gives that | φ ( t 1 ) | ≥ 3 / 4 . 18 Now for | t 1 | ≤ σ p 2 log 1 / q we wan t | ρ 1 ( t 1 ) | ≤ ǫ 1 / 2 . This is seen to b e true by plug ging in the v a lue o f ǫ 1 from (13) and the bound on ρ 1 ( t 1 ) from (14) and our choice of σ . Thu s w e hav e pro ven that | g 1 ( t 1 ) − g 2 ( t 2 ) | ≥ ǫ / 2 with pro bability at least 1 − ( q + q √ π log 1 /q ) ≥ 1 − 2 q (using q ∈ (0 , 1 / 3]). Now applying the unio n b ound o ver a ll pairs we get the required b ound. 4.6 Sample complexit y In this section, we b o und the sample co mplexity o f the algor ithm. First we will show how many sa mples ar e necessary to estimate a ccurately the desired F ourier transfor ms E  e iu T x  , E  xe iu T x  and E  xx T e iu T x  . Lemma 4 .12. L et x ∈ R n b e a r andom ve ctor. Fix ǫ > 0 and a ve ctor t ∈ R n . L et x ( j ) b e i.i.d. samples dr awn ac c or ding to x then       1 m m X j =1 e iu T x ( j ) − E  e iu T x        ≤ ǫ, with p r ob ability at le ast 1 − 4 e − mǫ 2 / 2 . Pr o of. Note that the random v ariables e iu T x are bo unded in magnitude by 1. W e sepa rate o ut the re a l and imaginary comp onents of e iu T x and separ ately a pply the Chernoﬀ inequality . In the most genera l s e tting, a ll we can do is bo und the v ariance o f our sa mple cov ariance matrix, a nd this will give a polynomia l b ound on the sa mple co mplexity . Lemma 4.1 3. Supp ose that the r andom ve ctor x ∈ R n is dr awn fr om an isotr opic d istribution F . Then V ar( x j e iu T x ) ≤ 1 for 1 ≤ j ≤ n, V ar( x 2 j e iu T x ) ≤ E  x 4 j  , V ar( x i x j e iu T x ) ≤ 1 for i 6 = j. Pr o of. V ar( x j e iu T x ) = E  x 2 j  −    E  x j e iu T x     2 ≤ 1 . The other par ts are s imilar, with the las t ineq ua lity using iso tropy . W e can combine thes e concen tration results for the F ourier deriv atives to obtain the ﬁnal sample com- plexity b ound. Recall fro m (1 1) that we have in the interv al u ∈ [ − 1 / 4 , 1 / 4 ] | g ( u ) | ≤ E  | x | 2  k k ≤ 1 6 W e can now give the sample co mplexit y of the a lgorithm. Corollary 4.14 . L et x = As b e an ICA mo del wher e A ∈ R n × n is a un it ary matrix. Supp ose that the r andom ve ctor s ∈ R n is dr awn fr om an isotr opic distribution, and that for e ach s i , we have E  s 4 i  ≤ M . Fix ǫ > 0 and a ve ctor u ∈ R n wher e k u k ≤ 1 / 4 . L et ˆ Σ u b e the matrix estimate d fr om m indep endent samples of x i = As i , then    ˆ Σ u − Σ u    F ≤ ǫ with p r ob ability at le ast 1 − 1 /n for m ≥ p oly ( n, M ) /ǫ 2 . 19 Pr o of. Apply Chebyshev’s inequality along with the v ar iance b ounds. Since the F rob enius nor m is unitarily inv aria nt , we can consider the er ror in the basis c o rresp onding to s . In this basis:    E  e iu T s ( s − ˜ µ )( s − ˜ µ ) T − ˜ Σ u     ≤      E  ss T e iu T s  − m X i =1 ( s i )( s i ) T e iu T s i      + 2      E  s ˜ µ T e iu T s  − m X i =1 x ˆ ˜ µ T e iu T s      +    E  ˜ µ ˜ µ T  − ˆ ˜ µ ˆ ˜ µ T       E  e iu T s     ≤ ǫ where the last bound is der ived b y appor tioning ǫ/ 5 error to each ter m. Finally , we co nclude by noting that by our choice of t , we hav e    E  e iu T x     ≥ 2 9 / 32 , and the multiplicativ e er ror due to the sca ling b y 1 / E  e iu T x  is low er order in comparis on to ǫ . F or more structured distributions, e.g ., logco ncav e distr ibutions, or more generally distributions with sub e xpo ent ial tails, muc h sha r p e r b ounds a re known on the s ample complexity of cov aria nce estimation, see e.g., [57, 67, 6 0, 1 ]. 4.7 Pro of of Theorem 4.2 In this sectio n we g ive the pro o f o f co rrectness for our alg orithm for ICA. Pr o of of Th e or em 4.2 . In the exact cas e, the diagona l entries are given by g i (( A T u ) i ). Since A is o rthonor- mal, for any pair ( A T u ) i = A T i u and ( A T u ) j = A T j u hav e orthog onal A i and A j , hence the argumen ts of g i and g j are indep endent Gaussians and Theore m 4 .11 gives us the eigenv alue spacings of Σ u to be used in Lemma 8.1. In pa rticular, the spacing s ar e at least ξ = ∆ 2 k !  √ 2 π σ 4( k − 1) n 2  k . Thus, with desir ed accur acy ǫ in Lemma 8.1, then we require the sampling er ror (in o pe r ator nor m, which we upper b ound using F ro b enius norm) to be k E k F ≤ ǫξ / ( ξ + ǫ ). W e can then substitute this dir ectly int o Cor o llary 4.14 whic h gives the sample complexity . 4.8 Gaussian noise The Gauss ia n function has several nice prop erties with resp ect to the F ourier transfor m, and we can explo it these to cancel out indep endent Gaus s ian nois e in the pro blems that we s tudy . T o deal with Gaus s ian nois e, when the observed signa l x = As + η where η is from a n unknown Ga ussian N ( µ η , R η ) which is indep endent of s , we can use the following mo diﬁed algo rithm. 1. Pick tw o diﬀerent ra ndom Gaussian vectors u, v . 2. Compute Σ = Σ 0 , Σ u and Σ v as in the pre vious a lgorithm. 3. Output the eigenv ectors o f (Σ u − Σ)(Σ v − Σ) − 1 . Theorem 4. 15. L et x ∈ R n b e given b e a noisy indep endent c omp onents mo del x = As + η , wher e A ∈ R n × n is a ful l r ank m atrix, and the noise ve ctor η ha s a Gaussian distribution. With suﬃciently many samples, the mo diﬁe d algori thm outputs A . 20 Pr o of. When x = As + η , the function ψ ( u ) = log  E  e iu T x  can b e wr itten a s ψ ( u ) = lo g  E  e iu T x  + log  E  e iu T η  Therefore, D 2 ψ u = A diag  ψ ′′ i ( A T i u )  A T + E  e iu T η ( η − µ η )( η − µ η ) T  E  e iu T η  = A diag  ψ ′′ i ( A T i u )  A T + E  ( η − µ η )( η − µ η ) T  = A diag  ψ ′′ i ( A T i u )  A T + R η where η ∼ N ( µ η , R η ). Therefore, Σ u − Σ = A ( D u − D ) A T with D being the cov aria nce matrix of s and (Σ u − Σ)(Σ v − Σ) − 1 = A ( D u − D )( D v − D ) − 1 A − 1 . The eigenv ectors o f the abov e matrix are the columns of A . F or a complete ro bustness analys is , one needs to control the sp ectral p er turbations of the matrix A ( D u − D )( D v − D ) − 1 A − 1 under sa mpling error . W e omit this pro of, but no te tha t it follows eas ily using the techn iques we develop fo r underdetermined ICA. 5 Eﬃcien t tensor decomp ositions In this section we ana lyze the tenso r deco mpo sition algor ithm, which will b e our main to ol fo r the underde- termined ICA pro blem. 5.1 Algorithm Recall that our alg orithm works by ﬂattening tensors T µ and T λ int o matrices M µ and M λ and then obser ving that the eigenvectors o f M µ M − 1 λ are vectors co rresp onding to ﬂattened (vectorized) A ⊗ d/ 2 i . Diagonalize ( M µ , M λ ) 1. Compute the SVD of M µ = V Σ U ∗ , and let the W b e the left sing ula r vectors (columns of V ) corr esp onding to the m la rgest singular v alues . Compute the matrix M = ( W ∗ M µ W )( W ∗ M λ W ) − 1 . 2. Compute the eigenv ector deco mpo sition M = P DP − 1 . 3. Output columns of W P . In Step 3 o f T ensor Deco mpo sition, we get an appr oximation of v ⊙ d/ 2 i up to a phase factor. W e ﬁrst correct the phase by maximizing the pro jection o nt o R n . T o this end w e prove Lemma 5.1. L et v ∈ C n and u ∈ R n b e unit ve ctors such that for some ϕ ∈ [0 , 2 π ] we have   e iϕ v − u   ≤ ǫ for 0 ≤ ǫ ≤ 1 / 2 . L et θ ∗ = argmax θ ∈ [0 , 2 π ] (   Re  e iθ v    ) and u ′ = Re  e iθ ∗ v  /   Re  e iθ ∗ v    . Then ther e is a sign α ∈ − 1 , 1 such that k αu − u ′ k ≤ 11 √ ǫ. 21 T ens or Decomp o sition ( T µ , T λ ) 1. Flatten the tensors to o bta in M µ = τ − 1 ( T µ ) and M λ = τ − 1 ( T λ ). 2. W P = D iag onal iz e ( M µ , M λ ). 3. F or each column C i of W P , let C ′ i := Re  e iθ ∗ C i  /   Re  e iθ ∗ C i    where θ ∗ = argmax θ ∈ [0 , 2 π ]    Re  e iθ C i     . 4. F or each column C ′ i , let v i ∈ R n be such that v ⊗ d/ 2 i is the b est rank-1 approximation to τ ( C ′ i ). Pr o of. Without los s of generalit y , we will as s ume tha t ϕ = 0, hence k v − u k ≤ ǫ . By the o ptimality of θ ∗    Re  e iθ ∗ v     ≥ k Re ( v ) k ≥ 1 − ǫ. Let us denote v ′ = e iθ ∗ v , then we have k Re ( v ′ ) k 2 + k Im ( v ′ ) k 2 = 1 which implies that k Im ( v ′ ) k 2 ≤ 2 ǫ − ǫ 2 < 2 ǫ . Now using ǫ ≤ 1 / 2 w e hav e k v ′ − u ′ k ≤ k Re ( v ′ ) − u ′ k + k Im ( v ′ ) k =     Re ( v ′ ) − Re ( v ′ ) k Re ( v ′ ) k     + k Im ( v ′ ) k ≤ k Re ( v ′ ) k  1 1 − ǫ − 1  + k Im ( v ′ ) k ≤ 2 ǫ + √ 2 ǫ ≤ 4 √ ǫ, and    u ′ − e iθ ∗ u    ≤    u ′ − e iθ ∗ v    +    e iθ ∗ v − e iθ ∗ u    = k u ′ − v ′ k + k u − v k < 5 √ ǫ. This implies   Re  e iθ ∗    ≥ 1 − 5 √ ǫ . Hence there is a sign α ∈ − 1 , 1 suc h that   e iθ ∗ − α   ≤ 10 √ ǫ (w e omit some routine computations). Finally , k u ′ − αu k ≤    u ′ − e iθ ∗ u    +    e iθ ∗ u − αu    ≤ 5 √ ǫ + 10 √ ǫ = 15 √ ǫ. Lemma 5.2. F or unit ve ct or v ∈ R n and a p ositive inte ger d , given v ⊙ d + E , wher e E is an err or ve ctor, we c an re c over v ′′ such that for some α ∈ {− 1 , 1 } we have k v − αv ′′ k 2 ≤ 2 k E k 2 β − k E k 2 , wher e β = 1 n d/ 2 − 1 / 2 . Pr o of. Let’s for a moment work with v ⊙ d (so ther e is no error ), and then we will take the er ror into account. In this case we ca n essentially r e a d v oﬀ from v ⊙ d . Each one- dimensional slice of v ⊗ d (Note tha t a s vectors, v ⊙ d and v ⊗ d are the same; they diﬀer only in ho w their en tries are arranged: In the former, they are in a linear array and in the latter they are in an n × n × . . . × n array . W e will use them in terchangeably , and we will also talk abo ut v ⊗ d + E which has the obvious meaning.) is a scaled cop y of v . Let us c ho ose the copy with the maximum norm. Since k v k = 1, ther e is a co ordinate v ( i ) such that | v ( i ) | ≥ 1 / √ n . Thus 22 there is a one-dimensional slice o f v ⊗ d with norm a t lea st 1 n d/ 2 − 1 / 2 = β . Scaling this slice to norm 1 would result in αv for some α ∈ {− 1 , 1 } . Now, when we do hav e er ror and get v ⊗ d + E , then we must hav e a one-dimensional slice v ′ of v ⊗ d + E with no rm at least β − k E k 2 . The n after norma lizing v ′ to v ′′ , one ca n chec k that k αv ′′ − v k ≤ 2 k E k 2 β −k E k 2 for some α ∈ {− 1 , 1 } . 5.2 Exact analysis W e b egin with the pro of of the tensor decomp ositio n theorem with ac c ess to exact tenso rs as stated in Theorem 1 .2. This is essentially a structural results tha t says we can recov er the rank-1 comp onents when the ratios µ i /λ i are unique. W e ﬁr s t note that for a tens o r T µ with a rank- 1 decomp osition a s in (1), that the ﬂattened matr ix version M µ = τ − 1 ( T µ ) can b e written a s M µ = ( A ⊙ d/ 2 )diag ( µ i ) ( A ⊙ d/ 2 ) T . W e will argue that the dia gonalisa tion step works co rrectly (we will write B = A ⊙ d/ 2 in wha t fo llows). The recov ery of A i from the columns o f B follows by Le mma 5 .1 ab ov e. Our theo r em is a s follows (note that the ﬁrst condition b elow is simply a normalis ation of the eig env ecto r s): Theorem 5.3. L et M µ , M λ ∈ C p × p such that: M µ = B diag ( µ i ) B T , and M λ = B diag ( λ i ) B T , wher e B ∈ R p × m and µ, λ ∈ C m for some m ≤ p . Supp ose that t he fol lowing hold: 1. F or e ach c olumn B i ∈ R m of B , k B i k 2 = 1 , 2. σ m ( B ) > 0 , and 3. µ i , λ i 6 = 0 for al l i , a nd    µ i λ i − µ j λ j    > 0 for al l i 6 = j . Then D iagonalize ( M µ , M λ ) outputs the c olumns of B up to sign and p ermutation. Pr o of. By our assumptions , the image o f M λ has dimensio n m and the matrix W computed in Diagonalize ( M µ , M λ ) satisﬁes colspa n ( W ) = colspan ( B ). Moreo ver, we could choo se W to hav e all entries rea l b eca use B is a real matrix; this will give that the a mbiguities in the recovery of B ar e in s ig ns and not in phase. Since the columns of W are orthonormal, the columns of P := W T B all ha ve unit norm and it is a full rank m × m matrix. So we can wr ite W T M µ W = P diag ( µ i ) P T , ( W T M λ W ) − 1 = ( P T ) − 1 diag  λ − 1 i  P − 1 . Whic h gives ( W T M µ W )( W T M λ W ) − 1 = P dia g ( µ i /λ i ) P − 1 . Thu s the colums of P are the eigen vectors of ( W T M µ W )( W T M λ W ) − 1 , and thus our alg orithm is able to r ecov e r the columns of P up to sign and p ermutation. Let’s call the matrix so r ecov e r ed P ′ . Denote by P 1 , . . . , P m the columns of P , and similarly for P ′ and B . Then P ′ is given by P ′ π ( j ) = α j P j where π : [ m ] → [ m ] is a p ermutation a nd α j ∈ {− 1 , +1 } . W e now claim that W P = W W T B = B . T o see this, let ˆ W = [ W, W ′ ] b e an orthonor mal basis that completes W . Then ˆ W T ˆ W = ˆ W ˆ W T = I . Also, ˆ W ˆ W T = W W T + W ′ W ′ T . F or any v ector v in the span of the columns of W , we hav e v = ˆ W ˆ W T v = ( W W T + W ′ W ′ T ) v = W W T v . In other words, W acts a s or thonormal matrix restricted to its imag e, and th us W W T acts a s the ide ntit y . In par ticular, W P = W W T B = B . Our a lgorithm has access to P ′ as deﬁned ab ove rather than to P . The algo rithm will form the product W P ′ . But now it’s clear from W P = B that W P ′ π ( j ) = α j B j . Thu s the algor ithm will recov er B up to sign and per mut ation. 23 5.3 Diagonalizabilit y and robust analysis In a pplications of our tensor decomp osition algor ithm, we do not hav e access to the true underlying tensors T µ and T λ but ra ther slig htly p erturb ed versions. W e prove now that under suitably small per turbations R µ and R λ , we are able to recov er the co rrect ra nk 1 comp onents with g o o d accur acy . The statement of the robust version of this theorem closely follows tha t of the exact version: w e mere ly need to a dd some assumptions on the mag nitude of the p erturbations r elative to the quo tien ts µ i /λ i in conditions 4 and 5. Theorem 5.4. L et M µ , M λ ∈ C p × p such that M µ = B diag ( µ i ) B T , M λ = B diag ( λ i ) B T , wher e B ∈ R p × m , and µ, λ ∈ C m for s ome m ≤ p . F or err or matric es R µ , R λ ∈ C p × p , let M µ + R µ and M λ + R λ b e p ert u rb e d versions of M µ and M λ . L et 0 < ǫ < 1 . S upp ose that the fol lowing c onditions and data ar e gi ven: 1. F or e ach c olumn B i ∈ R m of B , k B i k 2 = 1 . 2. σ m ( B ) > 0 . 3. µ i , λ i 6 = 0 for al l i ,    µ i λ i − µ j λ j    ≥ Ω > 0 for al l i 6 = j . 4. 0 < K L ≤ | µ i | , | λ i | ≤ K U . 5. k R µ k F , k R λ k F ≤ K 1 ≤ ǫK 2 L σ m ( B ) 3 2 11 κ ( B ) 3 K U m 2 min(Ω , 1) . Then Diagonalize applie d to M µ + R µ and M λ + R λ outputs ˜ B such t hat ther e exists a p ermut ation π : [ m ] → [ m ] a nd phases α j (a phase α is a sc alar in C with | α | = 1 ) such that    B j − α j ˜ B π ( j )    ≤ ǫ . The running time of the algorithm is p oly ( p, 1 Ω , 1 K L , 1 σ min ( B ) , 1 ǫ ) . Pr o of. W e b eg in with an informal o utline of the pro of. W e basically implement the pro of for the exact ca se, how ever b ecause o f the per turbations, v arious equalities now are true only a pproximately and this leads to somewhat lengthy and tec hnical details , but the intuitiv e outline r e mains the same as for the exact ca s e. The algo rithm constructs an orthonor ma l basis of the left singular space of ¯ M µ := M µ + R µ ; denote b y Y the ma trix with this ba sis as its columns. The fact that ¯ M µ is close to M µ gives b y W edin’s theorem (Theorem 8.2) tha t the left singular s paces of ¯ M µ and M µ are close. More sp eciﬁca lly , this means that there are t wo ma trices, W with columns forming an orthonormal basis for the left sing ular space of M µ , and X with columns forming an orthono rmal basis for the left singular space of ¯ M µ such that W and X are close in the en try w is e sense. This implies that W T B and X T B are close. This can be used to sho w that under appropria te conditions X T B is nonsingular. No w using the fact that the columns o f Y and of X span the same spa ce, it follows that ¯ P := Y T B is nonsingular . In the next step, we show by virtue of k R µ k b eing small that the matrix Y T ¯ M µ Y constructed by the a lgorithm is close to ¯ P diag ( µ i ) ¯ P T where the µ i are the eigenv a lues of M µ ; and similarly for Y T ¯ M λ Y . W e then show that ( Y T ¯ M µ Y )( Y T ¯ M λ Y ) − 1 is dia gonaliza ble and the diag onalization provides a ma trix ˜ P close to ¯ P , and so ˜ B = Y ˜ P gives the c o lumns of B up to phas e factors and p ermutation a nd small erro r. A note on the run ning time. Algorithm Dia g onalize us es SVD and eigenv ector decomp ositio n of diago- nalizable (but no t no rmal) matrices as subr outines. There are well-known alg orithms for these a s discussed earlier. The outputs of these algorithms are not exact and hav e a quantiﬁable error: The computation of SVD of M ∈ C n × n within err or ǫ (for any r easonable notion of error, sa y   M − V Σ U T   F where V Σ U T is the SVD o utput by the algorithm on input M ) can b e done in time p oly  n, 1 ǫ , 1 σ min ( M )  . Similarly , for the eig e n vector decomp ositio n o f a diag onalizable matr ix M ∈ C n × n with eigenv alues | λ i − λ j | ≥ Ω > 0 for i 6 = j , we ca n co mpute the decomp ositio n within error ǫ in time p oly( n, 1 Ω , 1 ǫ , 1 min i | λ i | ). 24 In the analysis b elow, we ignor e the erro rs from these computations as they ca n b e c o ntrolled a nd will b e of s ma ller or der than the err o r from the main ana lysis. This can b e made rigoro us but we omit the details in the interest of brevity . Combining the running time of the t wo subroutines one can c heck e asily that the ov er all running tim e is what is claimed in the s tatement o f the theorem. W e now pro ceed with the formal pr o of. The pro of is bro ken into 7 steps. Step 1. W T B ≈ X T B . Let ¯ M µ := M µ + R µ and ¯ M λ := M λ + R λ . No w the fact that k R µ k F is small implies by W edin’s theore m (Theorem 8.2) that the left singular spaces of M µ and ¯ M µ are close: Speciﬁc a lly , by Theorem IV.1 .8 in [16] ab out canonical angles betw een subspaces , we ha ve: There exists an orthonormal bas is of the left singular space o f M µ (given by the columns w 1 , . . . w m of W ∈ C p × m ) and an orthono rmal basis of the left singular space of ¯ M µ (given by the columns x 1 , . . . , x m of X ∈ C p × m ) such that x j = c j w j + s j z j , for all j, where 0 ≤ c 1 ≤ . . . ≤ c m ≤ 1, and 1 ≥ s 1 ≥ . . . ≥ s m ≥ 0, a nd c 2 j + s 2 j = 1 for all j ; vectors w 1 , . . . , w m ; z 1 , . . . , z m form an orthonor mal basis . (F o r the last co ndition to hold we need p ≥ 2 m . A similar representation can b e derived when this condition do es not hold and the following co mputation will s till be v alid. W e omit full discussion of this other case for brevit y; in an y case, w e could arrange so that p ≥ 2 m without any gr eat p enalty in the parameter s achieved.) W e no w apply W edin’s theor em 8.2 to M µ and ¯ M µ to upp er bo und s j . T o this e nd, ﬁrst note that by Claim 10.4 we hav e σ m ( M µ ) ≥ K L σ m ( B ) 2 ; and second, by W ey l’s ineq ua lity for singular v alues (Lemma 8.3) we hav e   σ j ( ¯ M µ ) − σ j ( M µ )   ≤ σ 1 ( R µ ) ≤ K 1 for all j . Thu s in Theorem 8.2, with Σ 1 corres p o nding to non-zero s ingular v alues of M µ , we hav e max σ (Σ 2 ) = 0. And we can c ho os e a corres po nding confor mal SVD of ¯ M µ so that min σ ( ¯ Σ 1 ) ≥ K L σ m ( B ) 2 − K 1 . Whic h gives, k sin Φ k 2 ≤ K 1 / ( K L σ m ( B ) 2 − K 1 ) = : K 2 , wher e Φ is the matrix of canonica l angles b et ween colspa n ( W ) and colspan ( X ). Thus we hav e s j ≤ K 2 , (15) for all j . Now we ca n show that X T B is close to W T B : The ( i, j )’th entry o f W T B − X T B is (1 − c i ) w T i b j − s i z T i b j . Using (15) and k w i k , k b j k , k z i k ≤ 1, we have (1 − c i ) w T i b j − s i z T i b j ≤ s 2 i + s i ≤ 2 K 2 . And so   W T B − X T B   F ≤ 2 m 2 K 2 . Hence by Lemma 8.3 we hav e   σ j ( W T B ) − σ j ( X T B )   ≤ 2 m 2 K 2 for all j . Step 2. ¯ P := Y T B is ful l r ank. The singular v alues of W T B are the same as those o f B . Brieﬂy , this is b ecause W T acts as an isometr y on colspan ( B ) . Also o bserve that the singular v alues of Y T B ar e the same as those of X T B . Brieﬂy , this is b ecause Y T and X T act as isometries o n colspa n ( X ) = colspan ( Y ). Thes e tw o facts to gether with the closeness of the sing ular v alues of W T B and X T B as just shown imply that   σ j ( B ) − σ j ( Y T B )   ≤ 2 m 2 K 2 (16) for all j . Now using that 2 m 2 K 2 < σ m ( B ) / 2 (This follows b y our condition 5 in the theo rem giving an upp er bo und on K 1 : K 1 ≤ ǫ K L K U K L σ m ( B ) 3 2 11 κ ( B ) 3 m 2 which gives K 1 ≤ K L σ m ( B ) 3 8 m 2 . This in turn implies 2 m 2 K 2 < σ m ( B ) / 2 using σ m ( B ) ≤ 1; we omit easy veriﬁcation.) we get that σ m ( Y T B ) > 0 and hence Y T B is full rank. W e note some co ns equences o f (16) for use in later steps: κ ( ¯ P ) ≤ 4 κ ( B ) . (17) 25 This follows from κ ( ¯ P ) ≤ σ 1 ( B )+2 m 2 K 2 σ m ( B ) − 2 m 2 K 2 ≤ 4 κ ( B ), b ecause 2 m 2 K 2 < σ m ( B ) / 2 . σ m ( ¯ P ) ≤ σ m ( B ) + 2 m 2 K 2 < 2 σ m ( B ) . (18) σ m ( ¯ P ) ≥ σ m ( B ) − 2 m 2 K 2 < σ m ( B ) / 2 . (19) Step 3. Y T ¯ M µ Y ≈ ¯ P diag ( µ i ) ¯ P T and Y T ¯ M λ Y ≈ ¯ P diag ( λ i ) ¯ P T . More pr ecisely , let E µ := Y T ¯ M µ Y − ¯ P diag ( µ i ) ¯ P T , then k E µ k F ≤ m 2 k R µ k F ; and s imila rly for ¯ M λ , E λ := Y T ¯ M λ Y − ¯ P diag ( λ i ) ¯ P T . The pro of is short: W e hav e Y T ¯ M µ Y = Y T ( M µ + R µ ) Y = Y T M µ Y + Y T R µ Y = ¯ P diag ( µ i ) ¯ P T + Y T R µ Y . Hence k E µ k F =   Y T R µ Y   F ≤ k R µ k F . Step 4. ( Y T ¯ M µ Y )( Y T ¯ M λ Y ) − 1 is di agonalizable. This is b eca use Theorem 5.5 is applicable to ˜ N := ( Y T ¯ M µ Y )( Y T ¯ M λ Y ) − 1 = ( ¯ P diag ( µ i ) ¯ P T + E µ )( ¯ P dia g ( λ i ) ¯ P T + E λ ) − 1 : using k E µ k F ≤ k R µ k F , the tw o co nditio n to verify are • 6 κ ( ¯ P ) 3 mK U K 2 L σ m ( ¯ P ) 2 K 1 ≤ Ω . This follows from our co ndition 5 using (17), (19) and σ m ( B ) ≤ 1. • K 1 ≤ σ m ( ¯ P ) 2 K L / 2 . This also follows from condition 5, using (18) and ǫ ≤ 1. Hence ˜ N is diagona liz able: ˜ N = ˜ P diag ( ˜ γ i ) ˜ P − 1 . Step 5. The eigenvalues of ˜ N ar e close to the eigenvalues of ¯ P diag ( µ i /λ i ) ¯ P T . This follows from our application of Theorem 5.5 in the pre v ious step (s p eciﬁca lly from (24)) and g ives a p ermutation π : [ m ] → [ m ] such that     µ i λ i − ˜ γ π ( i )     < Ω / 2 , where the ˜ γ i are the eige nv alues of ˜ N . In the nex t s tep we s how that there ex ist phases α i such that ˜ P π ,α := [ α 1 ˜ P π (1) , α 2 ˜ P π (2) , . . . , α m ˜ P π ( m ) ] is close to ¯ P . Step 6. ¯ P is close to ˜ P u p to sign and p ermu tation of c olumns. W e upper bo und the angle θ b etw een the cor resp onding eigenpair s ( µ j λ j , ¯ P j ) and ( ˜ γ π ( j ) , ˜ P π ( j ) ) of N := ¯ P diag ( µ i /λ i ) ¯ P − 1 and ˜ N . Theorem 8 .7 (a gener alized version of the sin( θ ) e ig enspace p erturba tion theor em for diagona lizable matr ices) applied to N and ˜ N gives (with the notatio n der ived from Theore m 8 .7) sin θ ≤ κ ( Z 2 )    ( N − ˜ γ π ( j ) I ) ˜ P π ( j )    2 min i   ( N 2 ) ii − ˜ γ π ( j )   . T o b ound the RH S ab ove, we will estimate each of the three terms. The ﬁrst term: κ ( Z 2 ) ≤ κ ( ¯ P − 1 ) = κ ( ¯ P ) ≤ 4 κ ( B ) , where fo r the ﬁrs t inequality we used that the condition num be r o f a submatr ix can only b e smaller [63]; the second inequality is (17). 26 Setting Err := N − ˜ N , we b o und the sec o nd term:    ( N − ˜ γ π ( j ) I ) ˜ P π ( j )    2 =    ( ˜ N − ˜ γ π ( j ) I ) ˜ P π ( j ) + Err ˜ P π ( j )    2 =    Err ˜ P π ( j )    2 ≤ k Err k 2 ≤ κ ( ¯ P ) 2 · K U K L · 2 m · K 1 σ m ( ¯ P ) 2 K L (b y (23) in Theo rem 5.5) ≤ 2 6 κ ( B ) 2 m K U K L K 1 σ m ( B ) 2 K L (using (17) , (18)). (20) And lastly , the third term: min i   ( N 2 ) ii − ˜ γ π ( j )   ≥ min i : i 6 = j     µ i λ i − µ j λ j     −     µ j λ j − ˜ γ π ( j )     ≥ Ω − κ ( ¯ P ) k Err k 2 (using Lemma 8.5) ≥ Ω − 2 9 κ ( B ) 3 m K U K L K 1 σ m ( B ) 2 K L (using (20) and (1 7)) . T o abbrev iate things a bit, let’s set ǫ ′ := 2 9 κ ( B ) 3 K U K L m K 1 σ m ( B ) 2 K L . Then, putting things tog e ther we get sin( θ ) ≤ ǫ ′ Ω − ǫ ′ . Now using the fact that the columns of ˜ P and ¯ P are unit length implies that there ex ist phases α i such that    α j ˜ P π ( j ) − ¯ P j    2 ≤ ǫ ′ Ω − ǫ ′ . (21) Step 7. Y ˜ P gives B appr oximately and up to phase factors and p ermutation of c olumns. This fo llows fro m tw o facts: (1) ˜ P π ,α ≈ ¯ P , so Y ˜ P π ,α ≈ Y ¯ P (we will pr ov e this shortly); and (2) Y ¯ P = Y Y T B (follows by the deﬁnitio n of ¯ P ). No w no te that the op era tor Y Y T is pro jection to co lspan ( Y ); since the angle b etw een col span ( Y ) and colsp an ( B ) is small as we showed in Step 1 , we get that Y Y T B ≈ B . F ormally , we hav e    Y α j ˜ P π ( j ) − Y ¯ P j    2 ≤ k Y k 2    α j ˜ P π ( j ) − ¯ P j    2 ≤ ǫ ′ Ω − ǫ ′ , using (21). And   b j − Y Y T b j   2 ≤ K 2 , where the last inequality used that the sine of the angle b etw een co lspan ( Y ) and colspan ( W ) = colspan ( B ) is at most K 2 as prov ed in Step 1. Putting these together we get    Y α j ˜ P π ( j ) − b j    2 ≤   b j − Y Y T b j   2 +    Y α j ˜ P π ( j ) − Y ¯ P j    2 ≤ ǫ ′ Ω − ǫ ′ + K 2 . Letting ˜ B = Y ˜ P g ives    α j ˜ B π ( j ) − b j    2 ≤ ǫ ′ Ω − ǫ ′ + K 2 ≤ ǫ . The last inequality follows from our conditio n 5 , whic h implies that ǫ ′ Ω − ǫ ′ ≤ ǫ/ 2 and K 2 ≤ ǫ/ 2. 27 Theorem 5.5 (Diagona lizability of p erturb ed ma tr ices) . L et N µ , N λ ∈ C m × m b e ful l r ank c omplex matric es such t hat N µ = Q diag ( µ i ) Q T , N λ = Q diag ( λ i ) Q T for s ome Q ∈ R m × m and µ, λ ∈ C m . Su pp ose we also have the fol lowing c onditions and data: 1. 0 < K L ≤ | µ i | , | λ i | ≤ K U . 2. | µ i /λ i − µ j /λ j | > Ω > 0 for a l l p airs i 6 = j . 3. 0 < K < 1 and E µ , E λ ∈ C m × m such that k E µ k F , k E λ k F ≤ K . 4. 6 κ ( Q ) 3 · K U K L · m · K σ m ( Q ) 2 K L ≤ Ω . 5. K ≤ σ m ( Q ) 2 K L / 2 . Then ( N µ + E µ )( N λ + E λ ) − 1 is diago nalizable and henc e ha s n ei genve ctors. Pr o of. Deﬁning F µ := ( Q diag ( µ i ) Q T ) − 1 E µ , and similarly F λ , we hav e ( N µ + E µ )( N λ + E λ ) − 1 = ( Q diag ( µ i ) Q T + E µ )( Q diag ( λ i ) Q T + E λ ) − 1 = Q diag ( µ i ) Q T ( I + F µ )( I + F λ ) − 1 ( Q diag ( λ i ) Q T ) − 1 = Q diag ( µ i ) Q T ( I + F µ )( I + G λ )( Q diag ( λ i ) Q T ) − 1 (22) = Q diag ( µ i /λ i ) Q − 1 + Err . In (22) ab ov e G λ = ( I + F λ ) − 1 − I ; hence b y C la im 10.5 (which is applica ble b ecause k F λ k F ≤ K σ m ( Q ) 2 K L ≤ 1 / 2, by our ass umption) w e have k G λ k F ≤ ( m + 1) k F λ k F . The norm of Err then sa tisﬁes k Err k F ≤ σ 1 ( Q ) 2 σ m ( Q ) 2 · K U K L  k F µ k F + ( m + 1 ) k F λ k F + ( m + 1 ) k F µ k F · k F λ k F  ≤ κ ( Q ) 2 · K U K L · 2 m · K σ m ( Q ) 2 K L . (23) Now note that 3 κ ( Q ) k Err k 2 ≤ 6 κ ( Q ) 3 · K U K L · m · K σ m ( Q ) 2 K L ≤ Ω b y our assumption and so Lemma 8.5 is applicable with matrices Q diag ( µ i /λ i ) Q − 1 , Q , a nd Err pla ying the ro les of A , X , a nd E , res p. Lemma 8 .5 gives us a p ermutation π : [ m ] → [ m ] such that   ν π ( i ) ( Q diag ( µ i /λ i ) Q − 1 + Err ) − ν i ( Q diag ( µ i /λ i ) Q − 1 )   ≤ κ ( Q ) k Err k 2 < Ω / 2 , (24) where ν i ( M ) denotes a n e ig env alue of M . Hence all the eigenv a lues of ( N µ + E µ )( N λ + E λ ) − 1 are distinct. By Lemma 8.6, it ha s n linearly independent eigenv ectors { v 1 , . . . , v n } . 6 Underdetermined ICA In this se c tion we give our algorithm for the underdetermined ICA problem and ana lyze it. In the underde- termined case, ther e are more indep endent s ource v ar iables than there are mea surements, thus A has few er rows than columns. W e hav e to b e mor e car e ful ab out ﬁxing the normaliza tion and scaling of the mo del than in the fully deter mined case wher e isotropic p os itio n provides a conv enie nt no rmalization for x, A a nd s . Problem 1 (Underdetermined ICA) . Fix n, m ∈ N such t hat n ≤ m . We say that x ∈ R n is gener ate d by an under determine d ICA mo del if x = As for some ﬁxe d matrix A ∈ R n × m wher e A has ful l r ow r ank and s ∈ R m is a ful ly indep endent r andom ve ctor. In add ition, we ﬁx the normalization so t hat e ach c olumn A i has unit norm. The pr oblem is to r e c over the c olumns of A fr om indep endent samples x mo dulo phase factors. 28 Additional a ssumptions ar e needed for the essentially unique identiﬁability of this mo del. F or ex ample, suppo se that columns A i and A j are pa rallel i.e., A i = cA j , then one could replace the v aria bles s i and s j with s i + cs j and 0 and the mo del would s till b e consistent. W e introduce the following suﬃcient c o ndition for identiﬁabilit y: we r equire that the m co lumn vectors o f A ⊙ k be linearly indep endent for s ome k > 0 (smaller k w ould b e b etter for the eﬃciency of the algor ithm). W e ma ke this quantitativ e by r equiring that the m ’th singula r v a lue satisfy σ m ( A ⊙ k ) > 0. Our a pproach to the under deter mined ICA problem is to apply our tensor decomp ositio n to a pair of carefully-chosen tensors that aris e from the distr ibution. The tensor s we use ar e the deriv ative tensor s of the second charateristic function ψ x ( u ) = log  E  e iu T x  . This metho d g eneralises the fourth moment methods for ICA where one computes the lo cal optima of the following qua rtic for m: f ( u ) = E  ( x T u ) 4  − 3 E  ( x T u ) 2  2 . An eq uiv alent formulation of this is to consider the tensor T ∈ R n × n × n × n which r e presents this quartic for m (just a s in the matrix cas e where symmetric matrices repr esent quadr a tic forms, sy mmetric tenso rs of o rder 4 represent q uartic fo r ms). Let us denote our ov erall tensor r epresenting f ( u ) b y T w he r e f ( u ) = T ( u, u, u, u ). By a relatively straightforw ard calculation, one can verify that T ( u, u , u, u ) is the fourth deriv ative of the second characteristic function of x ev a lauted at 0: T = D 4 u ψ x (0) . On the other hand, one can also v er ify tha t T has the following decomp osition (see for example [4]): T = m X j =1  E  s 4 i  − 3 E  s 2 i  2  A i ⊗ A i ⊗ A i ⊗ A i So in fact, o ne can view the fo urth moment tens or metho ds a s p erfor ming the tenso r deco mpo sition of only one tensor – the four th deriv ative of ψ ev aluated at 0! Our method also genera lises the algorithm w e gav e for the fully determined case in Section 4. W e can view that case as simply b eing the second deriv ative version of the more genera l algo rithm. The techniques used in this section ar e generalisations and reﬁnements of those used in the fully determined case, thoug h replacing the easy matr ix decomp osition arg umen ts with the cor resp onding (harder) tensor a rguments. A prop erty of the s econd characteris tic function that is central for our algor ithm is that the seco nd characteristic function of a rando m vector with indep endent comp onents factorizes into the sum of the second characteristic functions o f each comp onent: log  E  e iu T s  = m X j =1 log  E  e iu j s j  , and now every mixed partial der iv ative (with resp ect to u j and u j ′ ) is 0, as each term in the sum dep ends only on one co mp onent of u . T aking the deriv ative tensor will result in a diagonal tensor where the oﬀdiagonal terms are all 0 . In the cas e when we’re interested in x = As , we simply need to p erform the change o f ba sis via A very c a refully for the deriv ative tens o rs via the chain rule. One could a lso try to p er form this a na lysis with the momen t generating function E  e u T x  without the complex phase. The diﬃculty here is that the moment genera ting functions exists only if all moments of x exist, and thus a moment generating function approach would not b e able to deal with heavy tailed distributions. Moreover, using a r eal ex po nent ial leads us to estimate expo nent ially large qua ntities from samples, a nd it is diﬃcult to get go o d b ounds on the sample complexity . Using the complex exp onential avoids these pro blems as a ll quantities have mo dulus 1. W e will then apply o ur tensor decomp osition framework: as b e fore we s how tha t the eigenv a lues of the ﬂattened deriv ative tensors are w e ll spa c e d in Section 6.3. W e then study the sample complexity in Section 6.4 and a ssembling thes e comp onents in Sec tion 6.5. 29 6.1 Algorithm F or underdetermined ICA w e compute the hig her deriv ative tensors o f the second characteristic function ψ x ( u ) = log( φ x ( u )) at tw o random p oints and run the tensor decomp osition algo rithm from the previous section. Underdetermined ICA ( σ ) 1. (Compute deriv ative tensor) Pick independent random vectors α, β ∼ N (0 , σ 2 I n ). F or even d , estimate the d th deriv ative tensors o f ψ x ( u ) at α and β as T α = D d u ψ x ( α ) and T β = D d u ψ x ( β ). 2. (T ensor decomp ositio n) Run T ensor Decomp ositio n ( T α , T β ). T o estimate the 2 d th deriv ative tensor of ψ x ( u ) empirically , o ne simply writes do wn the expression for the deriv ative tenso r, and then estimates each entry from samples using the naive estimato r. More precisely , we c a n use ∂ φ ( u ) ∂ u i = E  ( ix i ) e iu T x  . This states that diﬀerentiation in the F ourier space is equiv alent to m ultiplication in the o riginal s pa ce, thus it suﬃces to estimate monomials of x reweighted by complex exponentials. T o es timate the d th deriv ative tensor o f lo g ( φ ( u )) empirically , one simply writes down the expr e ssion for the deriv ative tensor, and then estimates ea ch en tr y from samples using the naive estimator. Note that the deriv a tives can b e somewhat complicated, for example, at fourth o rder we hav e [ D 4 ψ u ] i 1 ,i 2 ,i 3 ,i 4 = 1 φ ( u ) 4  E  ( ix i 1 )( ix i 2 )( ix i 3 )( ix i 4 ) exp( iu T x )  φ ( u ) 3 − E  ( ix i 2 )( ix i 3 )( ix i 4 ) exp( iu T x )  E  ( ix i 1 ) exp( iu T x )  φ ( u ) 2 − E  ( ix i 2 )( ix i 3 ) exp( iu T x )  E  ( ix i 1 )( ix i 4 ) exp( iu T x )  φ ( u ) 2 − E  ( ix i 2 )( ix i 4 ) exp( iu T x )  E  ( ix i 1 )( ix i 3 ) exp( iu T x )  φ ( u ) 2 − E  ( ix i 2 ) exp( iu T x )  E  ( ix i 1 )( ix i 3 )( ix i 4 ) exp( iu T x )  φ ( u ) 2 − E  ( ix i 3 )( ix i 4 ) exp( iu T x )  E  ( ix i 1 )( ix i 2 ) exp( iu T x )  φ ( u ) 2 − E  ( ix i 3 ) exp( iu T x )  E  ( ix i 1 )( ix i 2 )( ix i 4 ) exp( iu T x )  φ ( u ) 2 − E  ( ix i 4 ) exp( iu T x )  E  ( ix i 1 )( ix i 2 )( ix i 3 ) exp( iu T x )  φ ( u ) 2 + 2 E  ( ix i 3 )( ix i 4 ) exp( iu T x )  E  ( ix i 2 ) exp( iu T x )  E  ( ix i 1 ) exp( iu T x )  φ ( u ) + 2 E  ( ix i 3 ) exp( iu T x )  E  ( ix i 2 )( ix i 4 ) exp( iu T x )  E  ( ix i 1 ) exp( iu T x )  φ ( u ) + 2 E  ( ix i 4 ) exp( iu T x )  E  ( ix i 2 )( ix i 3 ) exp( iu T x )  E  ( ix i 1 ) exp( iu T x )  φ ( u ) + 2 E  ( ix i 3 ) exp( iu T x )  E  ( ix i 2 ) exp( iu T x )  E  ( ix i 1 )( ix i 4 ) exp( iu T x )  φ ( u ) + 2 E  ( ix i 4 ) exp( iu T x )  E  ( ix i 2 ) exp( iu T x )  E  ( ix i 1 )( ix i 3 ) exp( iu T x )  φ ( u ) + 2 E  ( ix i 4 ) exp( iu T x )  E  ( ix i 3 ) exp( iu T x )  E  ( ix i 1 )( ix i 2 ) exp( iu T x )  φ ( u ) − 6 E  ( ix i 1 ) exp( iu T x )  E  ( ix i 2 ) exp( iu T x )  E  ( ix i 3 ) exp( iu T x )  E  ( ix i 4 ) exp( iu T x )  . The sa lient p oints ar e des crib ed in Lemma 4 .9 and Claim 4.10: ther e are at most 2 d − 1 ( d − 1)! terms (counting m ultiplicities), and no term ha s combined exp o nents of x i in all it facto rs higher than d . W e will g ive a rigoro us analysis of the sampling er r or incurred in Section 6.4. 30 6.2 T runcation err or Lemma 6.1. L et s = ( s 1 , . . . , s m ) ∈ R m b e a r andom ve ctor with indp endent c omp onent s e ach with ﬁn ite k absolute moments , and for t ∈ R m let φ ( t ) = E  e it T s  b e the asso ciate d char acteristic function. Then for k ≥ 1 and i 1 , . . . i k ∈ [ m ] we have | ∂ i 1 ,...,i k log φ ( t ) | ≤ 2 k − 1 ( k − 1)! max j ∈ [ m ] E  | s j | k  | φ ( t ) | k . Pr o of. T o co mpute the der iv atives of log φ ( t ) we pro ceed inductively with ∂ i 1 log φ ( t ) = ( ∂ i 1 φ ( t )) /φ ( t ) as o ur base case. F or d < k , write ∂ i 1 ,...,i d (log φ ) a s N d ( t ) /φ ( t ) d . Then we hav e ∂ i 1 ,...,i d +1 log φ ( t ) = ∂ i d +1  N d ( t ) φ ( t ) d  = ( ∂ i d +1 N d ( t )) φ ( t ) d − N d ( t ) dφ ( t ) d − 1 ∂ i d +1 φ ( t ) φ ( t ) 2 d = ( ∂ i d +1 N d ( t )) φ ( t ) − dN d ( t ) ∂ i d +1 φ ( t ) φ ( t ) d +1 . (25) W e make the following claim ab out N d ( t ): Claim 6.2. The functions N d ( t ) is the sum of terms of the fo rm C S 1 ,...,S d ∂ S 1 . . . ∂ S d φ ( t ) wher e multisets S 1 , . . . , S d ⊆ { i 1 , . . . , i d } ( t his is a multiset) satisfy S 1 ∪ . . . ∪ S d = { i 1 , . . . , i d } , and C S 1 ,...,S d ar e inte ger c o eﬃcients with P | C S 1 ,...,S d | ≤ 2 d − 1 ( d − 1)! . Pr o of. The ﬁrst part follows via induction o n d and (25). F o r the second part, let T ( d ) denote P | C S 1 ,...,S d | . Note that T (1) = 1 . Then by (25), we have T ( d + 1 ) ≤ dT ( d ) + dT ( d ), which gives T ( d ) ≤ 2 d − 1 ( d − 1)!. Returning to the pro of of Lemma 6.1, we observe that for any multiset S with elemen ts fro m [ m ] a nd size at most k , w e have | ∂ S φ ( t ) | =    E  i | S | s S e it T s     ≤ E ( | s S | ) . F or ℓ ∈ [ m ], let p ℓ be the num b er of times ℓ o ccurs in the multiset { i 1 , . . . , i d } . F o r eac h ter m of N d ( t ) w e hav e       d Y j =1 ∂ S j φ       = d Y j =1   ∂ S j φ   ≤ d Y j =1 E    s S j    = m Y ℓ =1 E ( | s ℓ | p ℓ ) ≤ m Y ℓ =1  E  | s ℓ | d  p ℓ /d ≤ max ℓ ∈ [ m ] E  | s ℓ | d  . (26) The s e c ond equality ab ov e uses the independence of the s ℓ , a nd the second inequality uses the ﬁrst part of F act 1 0.6. Thu s | N d ( t ) | ≤ 2 ( d − 1) ( d − 1 )! max ℓ ∈ [ m ] E  | s ℓ | d  , w hich when divided by φ ( t ) d gives the r equired bo und. 31 6.3 Eigen v alue spacings In this subs e ction we exa mine the anti-concent ration o f the diag onal entries ψ ( d ) i (( A T u ) i ). The a nalysis has similarities to the fully-determined case but there a re also some ma jor diﬀerences: in the fully-determined case, A T i u and A T j u a re indep endent Gaussians b ecause the columns of A ar e orthog o nal by isotr opic po sition (recall that w e deﬁned A T i to mea n ( A i ) T ). W e can not mak e A an orthono rmal ma trix in the underde ter - mined case, so w e have to exploit the more limited randomness. An additional complication is that we are working with anti-concent ration of the quotients of such diag onal ent ries rather than the entries themselves. Theorem 6.3 . L et s ∈ R m b e a r andom ve ctor with indep endent c omp onents. F or t ∈ R m and d ∈ 2 N let ψ a ( t ) := log E  e it a s a  , and g a ( t a ) := d d ψ a ( t a ) dt d a for al l a ∈ [ m ] . L et 0 < δ . Su pp ose that the fol lowing data and c onditions ar e given: 1. E ( s a ) = 0 , E  s 2 a  ≤ M 2 and E  s d a  ≤ M d for a ∈ [ m ] and M 2 < M d . 2. k ≥ 2 and for al l a ∈ [ m ] , k a ∈ N wher e d < k a < k , such that | cum k a ( s a ) | ≥ ∆ . 3. E  | s a | k a +1  ≤ M k for a ∈ [ m ] and M 2 < M k . 4. A ∈ R n × m b e a ful l r ow r ank matrix whose c olumns al l have unit norm and 1 − h A a , A b i 2 ≥ L 2 for al l p airs of c olumns. 5. u, v ∼ N (0 , σ 2 I n ) sampl e d indep endently wher e σ ≤ min 1 , 1 2 p 2 M 2 log 1 /q , σ ′ ! , and σ ′ = ∆ k − d + 1 k !  3 8  k 1 M k Lq √ 2 π 4( k − d ) ! k − d 1 p 2 log 1 /q ! k − d and 0 < q < 1 / 3 . Then with pr ob ability at le ast 1 − 3  m 2  q we have     g b ( A T b u ) g b ( A T b v ) − g a ( A T a u ) g a ( A T a v )     ≥ ∆ 1 ( k − d )!( d − 1)!  3 8  d 1 M d σ Lq √ 2 π 4( k − d ) ! ( k − d ) , (27) for al l distinct a, b ∈ [ m ] . (We c oun t the smal l pr ob ability c ase wher e g b ( A T b v ) = 0 or g a ( A T a v ) = 0 as violating the event in (27) .) Pr o of. Fix a 6 = b ∈ [ m ] and show that the s pacing in (27) is low er b ounded for this pair with high pr obability . W e will then take a unio n b ound over all  m 2  pairs, which w ill g ive the desired re sult. F or ra ndom choice of v , the even ts g a ( A T a v ) 6 = 0 and g b ( A T b v ) 6 = 0 (28) hav e probability 1 . Th us in the fo llowing w e will ass ume tha t these e vents a re true. W e will need co ncentration of ( A T a u ) and of ( A T a v ). Pr u ∼ N (0 , σ 2 )    A T a u   > τ  ≤ r 2 π σ 2 k r k 2 1 τ e − τ 2 2 σ 2 k r k 2 ≤ r 2 π σ 2 1 τ e − τ 2 2 σ 2 , 32 here the ﬁrst inequality used Claim 10.3 and the seco nd used the fact that k r k ≤ 1 . Substituting τ = σ p 2 log 1 / q w e g et Pr    A T a u   ≤ σ p 2 log 1 /q  ≥ 1 − σ q p π log 1 / q ≥ 1 − q p π log 1 /q . By the union b ound we hav e Pr    A T a u   ,   A T a v   ≤ σ p 2 log 1 / q  ≥ 1 − 2 q p π log 1 / q . (29) In the sequel we will assume that the event in the pr e vious ex pression happ ens. Under the assumption that   A T a v   ≤ σ p 2 log 1 / q w e have   g a ( A T a v )   =    ψ ( d ) ( A T a v )    ≤ 2 d − 1 ( d − 1)! M d (3 / 4) d , (30) where to upp er bo und   ψ ( d ) ( A T a u )   we us e d Lemma 4 .9, Lemma 10.1, and the co ndition σ p 2 log 1 /q ≤ 1 2 √ M 2 which follows from our a ssumption on σ . T o compute the probability that the spacing is at lea s t ǫ a , where ǫ a > 0 will b e chosen later, we co ndition on ﬁxing of A T b u = z and any ﬁxing o f v : Pr      g a ( A T a u ) g a ( A T a v ) − g b ( A T b u ) g b ( A T b v )     ≤ ǫ a  = Z Pr      g a ( A T a u ) g a ( A T a v ) − g b ( z ) g b ( A T b v )     ≤ ǫ a     A T b u = z  Pr  A T b u = z  dz . W e will bo und the co nditional probability term. As in the pro of of Theorem 6.3, a pplying T aylor’s theo rem with remainder (Theor em 4.8) gives g a ( A T a u ) = i d k a X l = d cum l ( s a ) ( i ( A T a u )) l − d ( l − d )! + R k a +1 ( A T a u ) ( A T a u ) k a − d +1 ( k a − d + 1)! . T runcating g a after the deg ree ( k a − d ) term yields po lynomial p a ( A T a u ). Denote the tr uncation er r or by ρ a ( A T a u ). Then setting h = g b ( A T b u ) g a ( A T a v ) g b ( A T b v ) which is a cons ta n t under our conditioning, we have     g a ( A T a u ) g a ( A T a v ) − g b ( A T b u ) g b ( A T b v )     = 1 | g a ( A T a v ) |     g a ( A T a u ) − g b ( A T b u ) g a ( A T a v ) g b ( A T b v )     = 1 | g a ( A T a v ) |   g a ( A T a u ) − h   = 1 | g a ( A T a v ) |   p a ( A T a u ) + ρ a ( A T a u ) − h   ≥ 1 | g a ( A T a v ) |   p a ( A T a u ) − h   − 1 | g a ( A T a v ) |   ρ a ( A T a u )   . Now we are going to show that the ﬁr st term a bove is likely to be lar ge and the second o ne is likely to be small. W e hav e A T a u = h A a , A b i A T b u + r T u where r is a v ector orthogonal to A b . Our h yp othesis, namely 1 − h A a , A b i 2 ≥ L 2 , gives k r k 2 ≥ L 2 . The o r thogonality of r a nd A b implies that the univ aria te Gaussian r T u is indep endent o f A T b u . 33 Now we apply our anti-concentration inequality from Theor e m 4.4 to obtain (for u ∼ N (0 , σ 2 I n ) and ﬁxed v satisfying (28)) Pr    p a ( A T a u ) − h   ≤ ǫ a   A T b u = z  ≤ 4( k a − d ) σ k r k √ 2 π  ǫ a ( k a − d )! | cum k a ( s a ) |  1 / ( k a − d ) ≤ 4( k a − d ) σ L √ 2 π  ǫ a ( k a − d )! ∆  1 / ( k a − d ) . (31) W e choo se ǫ a := ∆ ( k a − d )! σ Lq √ 2 π 4( k a − d ) ! k a − d ≥ ∆ ( k − d )! σ Lq √ 2 π 4( k − d ) ! k − d =: ǫ, making RHS of (31) eq ual to q . Reca ll tha t this was for ﬁxed v satisfying (28). F or the truncation error, ass uming tha t the event   A T a u   ≤ σ p 2 log 1 /q happ ens, we have   ρ a ( A T a u )   ≤    ψ ( k a +1) ( A T a u )    · ( A T a u ) k a − d +1 ( k a − d + 1 )! ≤ 2 k a M k a +1 (3 / 4) k a +1 · k a ! ( k a − d + 1)! ·  σ p 2 log 1 /q  k a − d +1 ≤ ǫ a / 2 , where to upp er b ound   ψ ( k a +1) ( A T a u )   we used Lemma 4.9, Lemma 1 0 .1, a nd the condition σ p 2 log(1 / q ) ≤ 1 2 √ M 2 , w hich holds g iven our upp er b ound on σ . And the ﬁnal inequality follows from our condition σ ≤ σ ′ . Thu s with pro bability at leas t 1 − 2 q √ π log 1 /q − q w e hav e   p a ( A T a u ) − h   −   ρ a ( A T a u )   ≥ ǫ a / 2 under the condtion that A T b u = z and v ﬁxed. Now since this holds for any z and any ﬁxing of v , it also holds without the conditioning on the event A T b u = z and ﬁxing o f v . Hence using (3 0), with probability a t lea st 1 − 2 q √ π log 1 /q − q ≥ 1 − 3 q w e hav e 1 | g a ( A T a v ) |    p a ( A T a u ) − h   −   ρ a ( A T a u )    ≥ ǫ a · (3 / 8) d 1 ( d − 1)! M d ≥ ǫ · (3 / 8) d 1 ( d − 1)! M d . T o summarize, with probability at least 1 − 3 q the spacing is at leas t ǫ . By the union b ound, with probability at leas t 1 − 3  m 2  q all the spacings are at lea st ǫ . The following is a stra ightf orward coro llary of the pro o f o f the previous theorem. Corollary 6.4. In t he setting of The or em 6.3 we have with pr ob ability at le ast 1 − 6 mq that   g a ( A T u )   ,   g a ( A T v )   ≥ ∆ 0 2( k − d )! σ q √ 2 π 4( k − d ) ! k − d for al l a ∈ [ m ] . An imp orta nt part o f the proo f is to give a low er b ound on the quan tit y 1 − h A i , A j i 2 ≥ L 2 so that the ICA mo del remains identiﬁable. At order d , we will give our bo unds in terms of σ m  A ⊙ j  for j = 1 , . . . , d/ 2. Lemma 6 .5. Fix m, n ∈ N such t hat n ≤ m . L et A ∈ C n × m b e a ful l r ow r ank matrix whose c olumns A i have unit norm. Then 1 − h A i , A j i 2 ≥ 2 k σ m  A ⊙ k  2 , for al l k ∈ N whe r e k ≥ 2 . 34 Pr o of. Consider the matrix B = A ⊙ 2 , observe that h A i , A j i 2 = h B i , B j i . Deﬁne the matrix C = A ⊙ k . Observe that k C i k = k A i k k = 1 and tha t 1 − h B i , B j i = 1 − |h C i , C j i| 2 /k (32) for k ≥ 2 . Recall tha t σ m ( C ) = min k x k =1 k C x k . In particular , if we co nsider the vector x = 1 √ 2 ( e i ± e j ) we hav e k C x k 2 = 1 2  k C i k 2 + k C j k 2 ± 2 h C i , C j i  = 1 ± h C i , C j i ≥ σ m ( C ) 2 . Hence we must have 1 − |h C i , C j i| ≥ σ 2 m ( C ). Co m bining this with (32) , we obtain 1 − h B i , B j i = 1 − |h C i , C j i| 2 /k ≥ 1 − (1 − σ m ( C ) 2 ) 2 /k ≥ 2 k σ m ( C ) 2 , where the last step follows from noting tha t all deriv atives of the function f ( x ) = (1 − x ) t for t ∈ (0 , 1) ar e negative in the interv al x ∈ [0 , 1] 6.4 Sample complexit y T o under stand the complexity o f our algorithm, we must b ound ho w man y samples are needed to estimate the ma trices M u and M v accurately . Thro ughout this section, we estimate E ( f ( x )) for so me function f ( x ), using indep endent s a mples x i via ¯ E ( f ( x ) ) := 1 N N X i =1 f ( x i ) → E ( f ( x )) . More g enerally , we will estimate deriv ative tensors as fo llows. As b efore, φ ( t ) = E  e it T s  and deﬁne the empirical version o f the characteristic function in the natura l wa y ¯ φ ( t ) := 1 N P N i =1 e it T s i . As we will see, for a mult iset S ⊆ [ m ] the deriv ative of ¯ φ ( t ) be haves nicely and will ser ve as an a pproximation of φ ( t ). Note that ∂ S ¯ φ ( t ) = ¯ E  s S e it T s  , where ¯ E ( · ) denotes empirica l exp e c tation ov er N i.i.d. sa mples of s . Similarly , we estimate ∂ S log φ ( t ) by ∂ S log ¯ φ ( t ) = ¯ N d ( t ) / ¯ φ ( t ) d , (33) where by Claim 6.2 ¯ N d ( t ) is a sum of the form P S 1 ,...,S d C S 1 ,...,S d ( ∂ S 1 ¯ φ ( t )) . . . ( ∂ S d ¯ φ ( t )), as descr ib ed in Claim 6 .2. Th us to show that ∂ S ¯ φ ( t ) is a go o d approximation of ∂ S φ ( t ) we show tha t    N d ( t ) φ ( t ) d − ¯ N d ( t ) ¯ φ ( t ) d    = | ¯ φ ( t ) d N d ( t ) − φ ( t ) ¯ N d ( t ) | φ ( t ) d ¯ φ ( t ) d is small. The notion of empirica l estimate of a deriv ative tensor now follo ws immediately fr o m (33) which giv es how to estimate individua l en tr ies of the tensor. 35 Lemma 6.6. L et s ∈ R m b e a r andom ve ctor with indep endent c omp onents. F or t ∈ R m let ψ s ( t ) = φ s ( t ) = log E  e it t s  b e the se c ond char acteristic function of s . Consider t he d t h derivative tensor of ψ s ( t ) (it c ontains m d entries). L et M 2 , M 2 d > 0 b e such that E  s 2 i  ≤ M 2 and E  | s i | 2 d  ≤ M 2 d . Fix 0 < ǫ, δ < 1 / 4 , and let k t k ≤ 1 √ 2 M 2 . Su pp ose we take N samples then with pr ob ability at le ast 1 −  m + d − 1 d  2 d M 2 d ǫ 2 δ  2 d d ( M d + 2) d − 1 ( d − 1)! (3 / 4) d (1 / 2) d  2 , every entry of the empi ric al tensor wil l b e within ǫ of the c orr esp onding entry of t he derivative tensor. Pr o of. In light of (33) w e will pr ov e that each term in the expression for ¯ N d ( t ) (it’s a pro duct of sev eral ∂ S ¯ φ ( t )) is a g o o d approximation of the corresp onding term in the e x pression for N d ( t ) by showing that the corres p o nding fac tors in the pro duct are close. Finally , we s how that the who le sum is a go o d appr oximation. F or complex- v alued r .v. X with mean µ , no te that V ar ( X ) = E (( X − µ )( X − µ ) ∗ ) = E ( X X ∗ ) − µµ ∗ ≤ E  | X | 2  . W e use the seco nd moment metho d to prov e that ∂ S ¯ φ ( t ) is a go o d a pproximation of ∂ S φ ( t ) with high probability . F or multiset S ⊆ { i 1 , . . . , i d } , with p j being the frequency of j in S , by the same arguments as in (26) we hav e V ar  s S e it T s  ≤ E  s 2 S  ≤ m Y j =1 E  s 2 p j j  ≤ m Y j =1  E  s 2 d j  p j /d ≤  max j E  s 2 d j   | S | /d ≤ M | S | /d 2 d . Thu s V ar  ∂ S ¯ φ ( t )  ≤ M | S | /d 2 d N ≤ M 2 d N . Chebyshev’s inequality (which r emains unchanged for co mplex-v alued r.v.s ) for ǫ ′ > 0 yields Pr    ∂ S ¯ φ ( t ) − ∂ S φ ( t )   ≥ ǫ ′  ≤ M 2 d ǫ ′ 2 N . (34) W e will c ho os e a v a lue of ǫ ′ shortly . Now we b ound the diﬀerence b etw een the corresp onding summands in the decomp ositions o f N d ( t ) and ¯ N d ( t ) as sums. Sp eciﬁcally , with pro bability at most ( d +1) M 2 d ǫ ′ N (this co mes from the union b ound: we wan t the even t in (34) to ho ld fo r all S j for j ∈ [ d ] as well a s fo r S = ∅ , co rresp onding to φ ( t )) w e hav e         ¯ φ ( t ) d d Y j =1 ∂ S j φ ( t )   −   φ ( t ) d d Y j =1 ∂ S j ¯ φ ( t )         ≤       d Y j =1 ∂ S j φ ( t )         ¯ φ ( t ) d − φ ( t ) d   +   φ ( t ) d         d Y j =1 ∂ S j φ ( t ) − d Y j =1 ∂ S j ¯ φ ( t )       ≤ M d   ¯ φ ( t ) d − φ ( t ) d   + ( M d + ǫ ′ ) d − M d d ≤ ǫ ′ d M d + ǫ ′ d ( M d + ǫ ′ ) d − 1 ≤ 2 ǫ ′ d ( M d + 1 + ǫ ′ ) d − 1 , where the second inequa lit y used (26), Lemma 1 0.2 and | φ ( t ) | ≤ 1 a nd   ¯ φ ( t )   ≤ 1. Now using the ex pression for N d ( t ) as a sum giv en in Claim 6.2, w ith probabilit y a t mos t 2 d M 2 d ǫ ′ 2 N (the factor 2 d again comes from the union b ound: w e wan t the even t in (34) to hold for all (multi-) subsets of 36 { i 1 , . . . , i d } ) we hav e   ∂ S ¯ φ ( t ) − ∂ S φ ( t )   =   ¯ φ ( t ) d N d ( t ) − φ ( t ) ¯ N d ( t )     φ ( t ) d ¯ φ ( t ) d   (35) ≤ 2 d ǫ ′ d ( M d + 1 + ǫ ′ ) d − 1 ( d − 1)!   φ ( t ) d ¯ φ ( t ) d   ≤ 2 d ǫ ′ d ( M d + 1 + ǫ ′ ) d − 1 ( d − 1)! (3 / 4) d (3 / 4 − ǫ ′ ) d , (36) ≤ ǫ, where the last inequality used Lemma 10.1 a nd ǫ ′ =  2 d d ( M d + 1 + ǫ ′ ) d − 1 ( d − 1)! (3 / 4) d (3 / 4 − ǫ ′ ) d  − 1 ǫ. Now if we want (36) to hold fo r all m ultisets S of size d , then the union b ound needs to ex tended to a ll such mult isets (o f which there ar e  m + d − 1 d  ) giving that e r ror pr obability at most  m + d − 1 d  2 d M 2 d ǫ ′ 2 N =  m + d − 1 d  2 d M 2 d ǫ 2 N  2 d d ( M d + 1 + ǫ ′ ) d − 1 ( d − 1)! (3 / 4) d (3 / 4 − ǫ ′ ) d  2 , as desired. Lemma 6.7 (Sample Complexity) . L et x = As b e an ICA mo del with A ∈ R n × m , x ∈ R n , s ∈ R m and d an even p ositive inte ger. L et M 2 , M 2 d > 0 b e such that E  s 2 i  ≤ M 2 and E  | s i | 2 d  ≤ M 2 d . L et v ∈ R n satisfy k v k 2 ≤ 1 2 k A k 2 √ 2 M 2 . L et T v = D d u ψ x ( v ) b e the d ’th derivative t ensor of ψ x ( u ) = log E  e iu T x  at v . And let ¯ T v = D d u ¯ ψ x ( v ) b e its naive estimate u sing N indep endent samples of x wher e N ≥  m + d − 1 d  1 m d/ 2 σ 1 ( A ) d 2 d M 2 d ǫ 2 δ  2 d d ( M d + 2) d − 1 ( d − 1)! (3 / 4) d (1 / 2) d  2 . Then with pr ob ability at le ast 1 − δ we have   T v − ¯ T v   F ≤ ǫ . Pr o of. In the following all tensors are ﬂattened into ma tr ices. Let x j = As j , j ∈ [ N ] b e i.i.d. samples. Letting t = A T v we hav e T v = D d u ψ x ( u ) = A ⊗ d/ 2 D d t ψ s ( t )( A ⊗ d/ 2 ) T , and ¯ T v = D d u ¯ ψ x ( u ) = A ⊗ d/ 2 D d t ¯ ψ s ( t )( A ⊗ d/ 2 ) T . (Note that we could als o hav e written D d u ψ x ( u ) = A ⊙ d/ 2 diag  ∂ d t j ψ s ( t )  ( A ⊙ d/ 2 ) T bec ause the comp onents of s ar e indpendent, howev er the cor resp onding empirical equa tion D d u ¯ ψ x ( u ) = A ⊙ d/ 2 diag  ∂ d t j ¯ ψ s ( t )  ( A ⊙ d/ 2 ) T need not b e true.) Hence   ¯ T v − T v   F =    A ⊗ d/ 2 D d t ¯ ψ s ( t )( A ⊗ d/ 2 ) T − A ⊗ d/ 2 D d t ψ s ( t )( A ⊗ d/ 2 ) T    F =    A ⊗ d/ 2 ( D d t ¯ ψ s ( t ) − D d t ψ s ( t ))( A ⊗ d/ 2 ) T    F ≤ σ 1 ( A ⊗ d/ 2 ) 2   D d t ¯ ψ s ( t ) − D d t ψ s ( t )   F = σ 1 ( A ) d   D d t ¯ ψ s ( t ) − D d t ψ s ( t )   F ≤ ǫ, where the last ine q uality holds with probability a t least 1 − δ b y Lemma 6.6 which is applicable b ecause   A T v   2 ≤ k A k 2 k v k 2 ≤ 1 2 √ 2 M 2 . 37 6.5 Main theorem and pro of W e ar e now ready to formally s ta te and pr ov e the main theorem. T o get a succes s probability of 3 / 4, we choose q so that 20 m 2 q < 1 / 4. Theorem 6.8 (Underdetermined ICA) . L et x ∈ R n b e gener ate d by an un der determine d ICA mo del x = As with A ∈ R n × m wher e n ≤ m . Supp ose that the fol lowing data and c onditions ar e given: 1. d ∈ 2 N such that σ m  A ⊙ d/ 2  > 0 . 2. k such that for e ach i ther e exists k i , wh er e d < k i < k such that | cum k i ( s i ) | ≥ ∆ 0 . 3. Constants M 2 , M d , M k such that for e ach s i the fo l lowing b ounds hold E ( s i ) = 0 , E  s 2 i  ≤ M 2 , E  s d i  ≤ M d , E  | s i | k i +1  ≤ M k , ∆ 0 ≤ M d , E  | s i | 2 d  ≤ M 2 d . 4. 0 < σ ≤ min(1 , σ 0 , 1 4 m q 1 6 M 2 ln(2 /q ) ) whe r e σ 0 = ∆ 0 k − d + 1 k !  3 8  k 1 M k 2 σ m ( A ⊙ d/ 2 ) q √ 2 π 4( k − d ) √ d ! k − d 1 p 2 log 1 /q ! k − d . Then, with pr ob ability at le ast 1 − 2 0 m 2 q , algorithm Under determine d ICA ( σ ) wil l r eturn a m atrix ˜ B such that ther e ex ist signs α j ∈ {− 1 , 1 } and p ermu tation π : [ m ] → [ m ] such that    B j − α j ˜ B π ( j )    ≤ ǫ , using N samples wher e N ≥  k m ( M d + 2) σ q σ m ( A ⊙ d/ 2 )  ck κ ( A ⊙ d/ 2 ) 6 M 2 d ∆ 6 0 ǫ 2 , for some absolute c onstant c . The running t ime of the algorithm is p oly ( N ) . Pr o of. The pro of inv olves putting together o f v a r ious results we hav e pr ov en. W e take N independent samples of x and form the ﬂattened d th deriv ative tensor s ¯ M u , ¯ M v of ψ ( u ) e v aluated at u and v which are sampled from N (0 , σ 2 0 ). Recall that these ar e the matrice s cons tructed by Underdetermi ned ICA ( σ 0 )) which then inv okes Diagonalize ( · ) which computes eigendecomp osition o f ¯ M u ¯ M − 1 v . W e will deno te b y M u , M v the corres p o nding matrices without any sampling errors. W e will ﬁr st use the result ab out eige n v alue spacing Theorem 6 .3 to get a bo und on the spacing s o f the eigenv alues of the matrix M u M − 1 v , where u, v ∼ N (0 , σ 2 0 I n ) are ra ndom vectors. Next, we determine upp er a nd low er b ounds K U and K L on the eigenv a lues of M u and M v . W e c an then apply The o rem 5 .4 to show that if we hav e suﬃciently go o d approximation of M u and M v then we will get a go o d reco nstruction of ma trix A . Finally , we use Lemma 6.7 to determine the num b er o f samples needed to ge t the r e q uired a ppr oximation. Step 1. First, we apply Theorem 6.3. Note that our choice of σ 0 satisﬁes the constra int s on σ in Theo- rem 6.3; thus ex cept with probabilit y q , we have     g b ( A T b u ) g b ( A T b v ) − g a ( A T a u ) g a ( A T a v )     ≥ Ω 0 := ∆ 0 M d  3 8  d 1 ( d − 1)!( k − d )! σ 0 q L √ 2 π 4( k − d ) ! k − d , (37) for all pairs a, b ∈ [ m ]. Here L as deﬁned in Theorem 6 .3 is given by 2 σ m ( A ⊙ d/ 2 ) / √ d by Lemma 6 .5. 38 Step 2. Next, we will show that u and v concentrate in nor m. T o do so, we will apply the following concentration ineq uality for sub-exp onential ra ndom v aria bles (this is standa r d in the pro of o f the Johnson- Lindenstrauss Lemma, see [1 0, 29] or alter na tively [66] for a mo re general formulation). Lemma 6.9 ([10, 29]) . L et z i ∼ N (0 , 1) b e i.i .d., then Pr n X i =1 z 2 i ≥ β n ! ≤ e n 2 (1 − β +log ( β )) . F or β ≥ 6 , the b ound only improves a s n increases. Thus, we have the simpliﬁed b ound Pr n X i =1 z 2 i ≥ β n ! ≤ e − nβ 12 . In particular , union bounding ov er b oth u, v ∈ R n , we hav e Pr  k u k , k v k ≥ 1 2 k A k F √ 2 M 2  = Pr k u k 2 , k v k 2 ≥  1 2 k A k F √ 2 M 2  2 ! ≤ 2 exp − 1 12 σ 2 0  1 2 k A k F √ 2 M 2  2 ! , where in the second line, w e took β n = 1 12 σ 2 0  1 2 k A k F √ 2 M 2  2 . Using k A k F ≤ m , and our c hoice o f σ 0 which gives σ 0 ≤ 1 4 m q 1 6 M 2 ln(2 /q ) we obtain Pr  k u k , k v k ≥ 1 2 k A k F √ 2 M 2  ≤ q . Thu s except with probabilit y q , norms k u k , k v k s atisfy the hypo theses of Lemma 6.7. Step 3. Now we deter mine the v alues of parameters K U and K L used in Theorem 5.4. A bound for K U can b e o bta ined from Le mma 4 .9 and Lemma 10.1 to ψ s ( t ) = ψ s ( A T u ). The latter le mma being applicable bec ause   A T u   ≤ k A k F k u k ≤ 1 2 √ 2 M 2 and   A T v   ≤ k A k F k v k ≤ 1 2 √ 2 M 2 from Step 2 : K U = ( d − 1)!2 d − 1 M d (3 / 4) d . F or K L , by Cor. 6.4 we ca n set K L = ∆ 0 2( k − d )! σ 0 q √ 2 π 4( k − d ) ! k − d , which holds with pr obability at least 1 − 6 mq . Step 4. W e now ﬁx K 1 which is the upp er bo und on   M u − ¯ M u   F and   M v − ¯ M v   F needed in Theo r em 5 .4 (the role of these tw o quantities is play ed by k R µ k F and k R λ k F in that theorem). Our assumption ∆ 0 ≤ M d gives that Ω 0 ≤ 1 b y (37). And hence the b ound req uired in Theorem 5.4 b ecomes K 1 = ǫK 2 L σ m ( B ) 3 2 11 κ ( B ) 3 K U m 2 Ω 0 , (3 8) 39 where B = A ⊙ d/ 2 . F or this K 1 by Theorem 5 .4 the a lgorithm recovers ˜ B with the prop erty that there are sig ns α j ∈ {− 1 , 1 } and per mut ation [ m ] → [ m ] such that    B j − α j ˜ B π ( j )    ≤ ǫ . Step 5. It remains to determine the n um b er of samples needed to achieve   M u − ¯ M u   F ≤ K 1 and   M v − ¯ M v   F ≤ K 1 . By Step 2 a b ove, we sa tisfy the h yp otheses of Lemma 6 .7. Hence by that lemma, for N at lea s t the quantit y b elow  m + d − 1 d  1 m d/ 2 σ 1 ( A ) d 2 d M 2 d K 2 1 q  16 d d ( M d + 2) d − 1 ( d − 1)! 3 d  2 ≤ 11 2 d m d/ 2 d 2( d +1) M 2 d ( M d + 2) 2( d − 1) 1 K 2 1 q we hav e   M u − ¯ M u   F ≤ K 1 , except with probability q , and s imilarly for   M u − ¯ M u   F . Subtistuting the v alue of K 1 from (38) and in turn o f K U , K L and Ω 0 ab ov e and simplifying (w e omit the straightforw ard but tedious details) g ives that it suﬃces to take N ≥ 2 4 k +6 d +26 3 2 d d 6 d +2 ( k − d ) 2( k − d ) m d/ 2+4 M 2 d M 2 d ( M d + 2) 2 d ∆ 6 0 κ ( B ) 6 σ m ( B ) k − d +6 1 σ 5( k − d ) q 5( k − d )+1 1 ǫ 2 . Accounting for the pr obability of all p ossible bad events enumerated in the pro of via the union b ound we see that with probability at least 1 − q − 3  m 2  q − 6 mq − q > 1 − 20 m 2 q no bad even ts happ en. The running time computation involv es empirical estimates of der iv ate tensors and SVD and eig env alue computations; we skip the r outine chec k that the r unning time is p oly( N ). 6.6 Underdetermined ICA wit h unkno wn Gaussian noise Theorem 6.8 just pro ved is the detailed version of Theorem 1.3 without Gaussian noise. In this section we indica te how to extend this pro o f when there is Gaussian no is e th us proving Theorem 1.3 in full. Our algorithm for the noiseles s case applies essentially unaltere d to the case when the input has unknown Gaussian noise if d > 2. W e co mment on the cas e d = 2 at the end of this section. More pr ecisely , the ICA mo del now is x ′ = x + η = As + η , where η ∼ N (0 , Σ) w he r e σ ∈ R n × n is unknown cov a riance ma trix and η is indep endent of s . Using the independenc e of η and s and the s ta ndard expressio n for the second characteristic o f the Gaussia n we have ψ x ′ ( u ) = E  e iu T x ′  = E  e iu T x + iu T η  = ψ x ( u ) + ψ η ( u ) = ψ x ( u ) − 1 2 u T Σ u. (39) Our algorithm w orks with (estimate of ) the d th der iv ative tenso r of ψ x ′ ( u ). F or d > 2, we ha ve D d u ψ x ′ ( u ) = D d u ψ x ( u ) as in (39) the comp onent of the second characteristic function co r resp onding to the Guassian noise is quadra tic a nd v a nishes for third and higher deriv atives. Ther e fo re, but fo r the estimation error s, the Gauss ian noise mak es no diﬀerence and the algorithm would still r e c ov er A as be fo re. Since the algorithm works only with estimates of these der iv atives, we have to account for how muc h our estimate of D d u ψ x ( u ) changes due to the extra additive term involving the der iv ative of the estimate of the second characteristic of the Gua ssian. 40 If Σ is such that the moments of the Ga ussian noise also s atisfy the co nditio ns we imp ose d on the moments of the s i in the Theorem 6.8, then we can complete the pro of with little extra work. The only thing that changes in the pro of o f the main theorem is that instead o f getting the b ound   M u − ¯ M u   ≤ ǫ ′ we get the bo und   M u − ¯ M u   ≤ 2 ǫ ′ . If w e increase the n um b er of samples b y a factor of 4 then this b ound b ecomes   M u − ¯ M u   ≤ ǫ ′ , and so the pr o of can b e co mpleted without an y o ther c hang e. The d = 2 c ase. When d = 2 , the second der iv ative of the co mp onent of the second characteristic function corres p o nding to the noise in (39 ) is a constant matrix indep endent of u . Th us if we take de r iv atives at tw o diﬀerent p oints and subtract them, then this c onstant matrix disapp ear s. This is a na logous to the alg orithm we g av e fo r fully-determined ICA with noise in Sec. 4.8. The error analysis can still pro ce e d along the ab ov e lines; we omit the details. 7 Mixtures of spherical Gaussians Here we apply F ourier PCA to the classic al pr oblem of learning a mixture of Gauss ians, as suming each Gaussian is spherical. More precise ly , we g et samples x + η , where x is fro m a distr ibutio n that is a mixture of k unknown Gaussians, with i ’th comp onent ha ving mixing weigh t w i and distribution F i = N ( µ i , σ 2 i I ); the noise η is drawn from N ( µ η , Σ η ) and is not ne c essarily spherical. The problem is to e s timate the unknown par ameters w i , µ i , σ i . O ur metho d para llels the F our ier PCA appr o ach to ICA, but here, bec a use the structur e o f the problem is additive (r ather than multiplicativ e as in ICA), we can directly use the matrix D 2 φ rather than D 2 ψ = D 2 log( φ ). It is ea sy to show that D 2 φ = Σ u in the descr iption o f o ur algor ithm. F or any integrable function f : C n → C , we obser ve tha t for a mixture F = P k i =1 w i F i : E F (( f ( x + η ))) = k X i =1 w i E F i ( f ( x + η )) . W e as s ume, witho ut loss of gener ality , that the full mixture is cen ter ed a t zero, so that: k X i =1 w i µ i = 0 41 F o urier PCA for Mixtures 1. Pick u independently fr o m N (0 , I n ). 2. Compute M = E  xx T  , let V be the span of its top k − 1 eigenv ector s and ¯ σ 2 be its k ’th eigenv alue and v b e its k ’th eigenv ector. L e t z b e a vector orthogo nal to V and to u . 3. Compute Σ u = E  xx T e iu T x  , ¯ σ 2 u = E  ( z T x ) 2 e iu T x  , γ u = 1 ( u T v ) 2  − E  ( v T x ) 2 e iu T x  + ¯ σ 2 u  , ˜ u = E  x ( z T x ) 2 e iu T x  . 4. Compute the matrices M = E  xx T  − σ 2 I and M u = Σ u − ˜ σ 2 u I − i ˜ uu T − iu ˜ u T − γ u uu T . 5. Run T ensor Decomp osi tion ( M u , M ) to obtain eig env ector s ˜ µ j and eigen- v alues λ j of M u M − 1 (in their or iginal co ordinate representation). 6. Estimate mixing weigh ts by ﬁnding w ≥ 0 that minimizes k P k j =1 √ w j ˜ µ j k s.t. P k j =1 w j = 1 . Then estimate means and v a riances as µ j = 1 √ w j ˜ µ j , e − 1 2 σ 2 j k u k 2 + iu T µ j = λ j . Lemma 7.1 . F or any f : C n → C , and x ∼ N ( µ, Σ) whe r e Σ is p ositive deﬁnite: E  f ( x ) e iu T x  = e iu T µ − 1 2 u T Σ u E ( f ( x + i Σ u )) . Pr o of. The pro of is via a standard co mpleting the square argument; consider the ex p o ne nt: − 1 2  ( x − µ ) T Σ − 1 ( x − µ )  + iu T x = − 1 2  x T Σ − 1 x + µ T Σ − 1 µ − x T Σ − 1 µ − µ T Σ − 1 x  + iu T x = − 1 2  x T Σ − 1 x − x T Σ − 1 ( µ + i Σ u ) − ( µ + i Σ u ) T Σ − 1 x + ( µ + i Σ u ) T Σ − 1 ( µ + i Σ u )  + iu T µ + 1 2 (Σ u ) T Σ − 1 (Σ u ) = − 1 2 ( x − ( µ + i Σ u )) T Σ − 1 ( x − ( µ + i Σ u )) + iu T µ − 1 2 u T Σ u Thu s: E  f ( x ) e iu T x  = 1 det (Σ) 1 / 2 (2 π ) n/ 2 Z f ( x ) e iu T x e − 1 2 ( x − µ ) T Σ − 1 ( x − µ ) dx = 1 det (Σ) 1 / 2 (2 π ) n/ 2 Z f ( x ) e − 1 2 ( x − ( µ + i Σ u )) T Σ − 1 ( x − ( µ + i Σ u )) e iu T µ − 1 2 u T Σ u dx 42 Now with a change of v ariables y = x − i Σ u , we obtain: E  f ( x ) e iu T x  = 1 det (Σ) 1 / 2 (2 π ) n/ 2 e iu T µ − 1 2 u T Σ u Z f ( y + i Σ u )) e − 1 2 ( y − µ ) T Σ − 1 ( y − µ ) dy = e iu T µ − 1 2 u T Σ u E ( f ( y + i Σ u ))) Note: technically we requir e that E ( | f ( x ) | ) < ∞ with resp ect to a Gaussian mea sure so a s to apply the dominated conv er gence theorem, and an analy tic extension of the Gaussian integral to complex num b ers, but these ar guments a r e standard and we omit them (see for exa mple [46]). Lemma 7.2. L et x ∈ R n b e dr awn fr om a mixtur e of k spheric al Gaussians in R n , u, z ∈ R n as in the algorithm. L et ˆ w j = w j e iu T µ j − 1 2 σ 2 j k u k 2 . Then, E  xx T  = k X j =1 w j σ 2 i I + k X j =1 w j µ j µ T j . ( 40) E  xx T e iu T x  = k X j =1 ˆ w j σ 2 j I + k X j =1 ˆ w j ( µ j + iσ 2 j u )( µ j + iσ 2 j u ) T . (41) E  x ( z T x ) 2 e iu T x  = k X j =1 ˆ w j σ 2 j ( µ i + iσ 2 j u ) . (42) Pr o of. These are obtained by direct calculation a nd expa nding out x i ∼ N ( µ i , σ 2 I n ). F or (40): E  xx T  = k X j =1 w j E F j  xx T  = k X j =1 w j E  ( x − µ j + µ j )( x − µ j + µ j ) T  = k X j =1 w j  σ 2 j I n + µ j µ T j  (41) follows by apply ing Lemma 7 .1 and the pr e v ious r esult: E  xx T e iu T x  = k X i =1 w i e iu T µ i − 1 2 σ 2 i k u k 2 E F i  ( x i + iσ 2 i u )( x i + iσ 2 i u ) T  = k X i =1 ˆ w i  σ 2 i I n + ( µ i + iσ 2 i u )( µ i + iσ 2 i u ) T  43 T o see (42), w e wr ite (no ting that z is orthogo na l to u and to each µ j ), E  x ( z T x ) 2 e iu T x  = k X j =1 ˆ w j E  ( x + iσ 2 j u )( z T ( x + iσ 2 j u )) 2  = k X j =1 ˆ w j E  ( x − µ j + µ j + iσ 2 j u )( z T ( x − µ j )) 2  = k X j =1 ˆ w j  E  ( x − µ j )( z T ( x − µ j )) 2  + E  ( µ j + iσ 2 j u )( z T ( x − µ j )) 2  = k X j =1 ˆ w j σ 2 j ( µ j + iσ 2 j u ) . Instead of p o lynomial anti-concen tration under a Gaussian measure , w e r equire only a simpler lemma concerning the anti-concentration of complex exp o ne ntials: Lemma 7.3 (Complex exp onential anti-concentration) . L et µ i , µ j ∈ R n satisfy k µ i − µ j k > 0 , then for u ∼ N (0 , σ 2 I n ) whe r e k µ i − µ j k 2 σ 2 ≤ 2 π 2 . Then: Pr     e iµ T i u − e iµ T j u    ≤ ǫ  ≤ 16 ǫ k µ i − µ j k σ √ 2 π . Pr o of. First, note that it suﬃces to sho w a nti-concen tr ation of the complex exp onential a round 1:    e iµ T i u − e iµ T j u    =    e iµ T i u       1 − e i ( µ i − µ j ) T u    =    1 − e i ( µ i − µ j ) T u    The exp onent ( µ i − µ j ) T u is of course a random v ariable z ∈ R distributed accor ding to N (0 , σ 2 k µ i − µ j k 2 ). F rom plane geometry , w e know that:   e iz − 1   > ǫ in cas e z / ∈ ∪ k ∈ Z [2 π k − 2 ǫ , 2 π k + 2 ǫ ] W e can b ound this pro bability as follows: Pr ( z / ∈ ∪ k ∈ Z [2 π k − 2 ǫ , 2 π k + 2 ǫ ]) ≤ 2 ∞ X k =0 4 ǫ k µ i − µ j k σ √ 2 π e − (2 πk ) 2 2 k µ i − µ j k 2 σ 2 ≤ 8 ǫ k µ i − µ j k σ √ 2 π ∞ X k =0 e − 2 π 2 k k µ i − µ j k 2 σ 2 = 8 ǫ k µ i − µ j k σ √ 2 π 1 1 − e − 2 π 2 k µ i − µ j k 2 σ 2 ≤ 16 ǫ k µ i − µ j k σ √ 2 π where the last line fo llows from the as sumption k µ i − µ j k 2 σ 2 ≤ 2 π 2 . W e can now prove that the a lgorithm is correct with suﬃciently many sa mples. Using PCA we can ﬁnd the spa n o f the mea ns { µ 1 , . . . , µ k } , as the span of the to p k − 1 rig h t singula r vectors of the ma tr ix whose rows are s a mple p o ints [64]. Pro jecting to this space, the mixture r emains a mixture o f spherical Gaussia ns. W e as s ume tha t the µ i are linearly indep endent (as in recent work [3 9] with higher moments). 44 Pr o of of Th e or em 1.5 . F r om Lemma 7.2, we obser ve that for any unit v ecto r v , E  ( v T x ) 2  = v T E  xx T  v = k X i =1 w i σ 2 i + k X i =1 w i ( µ T i v ) 2 . Without loss of g enerality , we may assume that the overall mea n is 0, hence 0 = P i w i µ i is 0 and therefor e the µ i are linea rly dep endent, and there exist a v ortho gonal to a ll the µ i . F o r such a v , the v ar iance is σ 2 = P k i =1 w i σ 2 i while for v in the span, the v ar iance is strictly highe r . Therefore the v alue σ 2 is simply the k ’th eigenv alue of E  xx T  (assuming x is c e n tered at 0). Thu s, in the a lgorithm w e hav e estimated the matrices M = k X i =1 w i µ i µ T i = AA T and M u = k X i =1 w i e − 1 2 k u k 2 σ 2 i + iu T µ i µ i µ T i = AD u A T . with ( D u ) ii = e − 1 2 k u k 2 σ 2 i + iu T µ i . Thu s, M u M − 1 = AD u A − 1 and its eig env ecto r s ar e the columns of A , assuming the entries of D u are distinct. W e note that the co lumns of A are precisely ˜ µ j = √ w j µ j . The eigenv alue co rresp onding to the eigenv ector ˜ µ j is the j ’th diago nal entry of D u . T o pr ov e the algorithm’s cor rectness, we will once again apply Theorem 5.4 for robust tensor decomp o- sition by v er ifying its ﬁve co nditions. Condition 1 holds by o ur assumption on the means of the Gaussian mixtures. Conditions 3 holds b y taking suﬃciently man y samples (the ov erall sa mple and time complexit y will b e linear in n and p olynomial in k ), Co nditions 2 and 4 hold b y apply ing 7.3. W e can apply o ur obser v ations regar ding Gaussian noise fr o m Section 4.8. Namely , the cov ariance of the reweigh ted Gaussian is s hifted by Σ η , the cov ariance of the unknown noise. Th us , by consider ing Σ u and the standar d cov ariance, a nd taking their diﬀerence , the co nt ribution of the nois e is remov ed and we a re left with a matrix that ca n be diagonalized using A . 8 P erturbation b ounds In this section, we collect all the eigenv alue deco mpo sition bounds that we requir e in our pro ofs. The generalized W eyl inequalit y we derive in Theor em 8.5 s urprisingly seems to b e unkno wn in the litera ture. 8.1 SVD p erturbations In this section, we present t wo standa r d perturba tion bo unds on singular vectors. These b ounds will help determine the a c curacy needed in estimating the matrix with samples. Lemma 8. 1. L et A ∈ C n × n and su pp ose that σ i ( A ) − σ i +1 ( A ) ≥ ǫ for al l i . L et E ∈ C n × n b e a matrix wher e wher e k E k 2 ≤ δ . Denote by v i the right singular ve ctors of A and ˆ v i the right singular ve ctors of A + E , then: k v i − ˆ v i k ≤ √ 2 δ ǫ − δ Pr o of. W e ﬁrs t wr ite: k v i − ˆ v i k 2 = h v i − ˆ v i , v i − ˆ v i i = 2(1 − h v i , ˆ v i i ) = 2(1 − cos( θ )) ≤ 2(1 − cos( θ ) 2 ) = 2 sin( θ ) 2 Next, we apply the following for m of W edin’s Theo rem from [62 ] where no tions such as the ca nonical angle s etc. used in the statement be low are also explained. 45 Theorem 8.2 . L et A, E ∈ C m × n b e c omplex matric es with m ≥ n . L et A have singu lar value de c omp osition A = [ U 1 U 2 U 3 ]   Σ 1 0 0 Σ 2 0 0   [ V ∗ 1 V ∗ 2 ] and similarly for ˜ A = A + E (with c onformal de c omp osition using ˜ U 1 , ˜ Σ 1 etc). Supp ose t her e ar e numb ers α, β > 0 such that 1. min σ ( ˜ Σ 1 ) ≥ α + β , 2. max σ (Σ 2 ) ≤ α . Then k sin(Φ) k 2 , k sin(Θ) k 2 ≤ k E k 2 β wher e Φ is the(diagonal) matrix of c anonic al angles b etwe en the r anges of U 1 and ˜ U 1 and Θ denotes the matrix of c anonic al angles b etwe en the r anges o f U 2 and ˜ U 2 . W e als o req uire the following form of W eyl’s Inequality (se e [62]): Lemma 8.3 . L et A, E ∈ C m × n , then | σ i ( A + E ) − σ i ( A ) | ≤ σ 1 ( E ) By W eyl’s inequality , we kno w that    σ (Σ 1 ) − σ ( ˜ Σ 2 )    ≥ ǫ − δ . Similarly for the sma llest singular v alue. By W edin’s theorem, we pick the partition Σ 1 to b e the to p i singular v alue s , with Σ 2 the remaining ones. Thu s, taking α = σ i +1 ( A ) and β = ǫ − δ , we hav e | sin( θ ) | ≤ k s in(Φ) k 2 ≤ δ ǫ − δ as required. 8.2 P erturbations of complex diagonalizab le matrices The main technical issue in giving a r obust version o f o ur algorithms is that the stabilit y of eigen vectors of general matrices is more complica ted than for Hermitian or no rmal matr ices where the sin( θ ) theorem of Davis and Kahan [31] descr ibes the whole situa tio n. Ro ug hly sp eaking, the diﬃcult y lies in the fact that for a genera l matrix, the eigenv alue decompos itio n is given b y A = P DP − 1 . Up on adding a p erturbation E , it is not c lear a priori that A + E ha s a full set of eigenvectors—that is to say , A + E may no longer be diago nalizable. The g oal of this section is to establish that for a gener al matrix with w ell- s paced eigen- v alues, suﬃcient ly small per turbations do not aﬀect the diagonaliza bilit y . W e use Ba uer -Fike theore m via a homotopy argument typically used in proving stro ng versions of the Ger s hgorin Circle Theorem [68]. Theorem 8.4 (Bauer-Fike [11]) . L et A ∈ C n × n b e a diagonalizab le matr ix s u ch that A = X diag ( λ i ) X − 1 . Then for any eigenvalue µ of A + E ∈ C n × n we ha ve min i | λ i ( A ) − µ | ≤ κ ( X ) k E k 2 . Using this, we prov e a weak version o f W eyl’s theor em for diagona lizable ma trices who se eig env alues a re well-spaced. W e consider this a sp ectral no rm v er sion of the strong Gershgorin Circle theorem (which uses row-wise L 1 norms). 46 Lemma 8.5 (Generalized W ey l inequalit y) . L et A ∈ C n × n b e a diagonal izable matrix such t hat A = X diag ( λ i ) X − 1 . L et E ∈ C n × n b e a matrix such that | λ i ( A ) − λ j ( A ) | ≥ 3 κ ( X ) k E k 2 for al l i 6 = j . Then ther e exists a p ermutation π : [ n ] → [ n ] such that   λ i ( A + E ) − λ π ( i ) ( A )   ≤ κ ( X ) k E k 2 . Pr o of. Consider the matr ix M ( t ) = A + tE for t ∈ [0 , 1]. By the Bauer -Fike theorem, every eigenv alue ˆ λ ( t ) of M ( t ) is c ontained in B ( λ i , tκ ( X ) k E k 2 ) for some i (for λ ∈ C , t ∈ R we use B ( λ, t ) to denote the ball in C of radius t with ce nter at λ ). In particula r, when t = 0 we know that ˆ λ (0) = λ i ∈ B ( λ i , 0). As we incr ease t , ˆ λ ( t ) is a con tinuous function of t , th us it traces a co nnected curve in C . Supp ose that ˆ λ (1) ∈ B ( λ j , κ ( X ) k E k 2 ) for some j 6 = i , then for some t ∗ , we must hav e ˆ λ ( t ∗ ) / ∈ S j B ( λ i , κ ( X ) k E k 2 ) as these balls are disjoin t. This contradicts the Baue r -Fike theorem. Hence w e m ust have ˆ λ (1) ∈ B ( λ i , κ ( X ) k E k 2 ) as desired. The following is a suﬃcient condition for the diagonalizability o f a matrix. The result is well-known (Exercise V.8.1 in [47] for example). Lemma 8.6. L et A : V → V b e a line ar op er ator over a ﬁnite dimensional ve ctor sp ac e of dimension n . Supp ose that al l the eigenvalues of A ar e distinct, i.e., λ i 6 = λ j for al l p airs i, j . Then A has n line arly indep endent eigenve ct ors. W e r equire the following genera lisation of the Davis-Kahan sin( θ ) theo rem [3 1] for general diagonaliz a ble matrices due to E is enstat and Ipsen [34]: Theorem 8.7 (Generalized sin( θ ) Theorem) . L et A, A + E ∈ C n × n b e di agonalizable matric es. L et ˆ γ b e an eigenvalue of A + E with asso ciate d eigenve ctor ˆ x . L et A = [ X 1 , X 2 ]  Γ 1 0 0 Γ 2  [ X 1 , X 2 ] − 1 , b e an eigende c omp osition of A . H er e Γ 1 c onsists of eigenvalues of A closest to ˆ γ , i.e. k Γ 1 − ˆ γ I k 2 = min i | γ i − ˆ γ | , wi th asso ciate d matrix of eigenve ctors X 1 . And Γ 2 c ontains the re maining eigenvalues and t he asso ciate d eigenve ctors ar e in X 2 . Also let [ X 1 , X 2 ] − 1 =:  Z ∗ 1 Z ∗ 2  . Then t he angle b etwe en ˆ x and the subsp ac e sp anne d by the eigenve ctors asso ciate d with Γ 1 is giv en by sin( θ ) ≤ κ ( Z 2 ) k ( A − ˆ γ I ) ˆ x k 2 min i | (Γ 2 ) ii − ˆ γ | . 9 General p osition and a v erage case analysis Our necessary condition for identiﬁabilit y is sa tis ﬁe d almost surely by randomly c hosen vectors for a fairly general class of distributions. F o r simplicit y we restrict ourselves to the case of d = 2 a nd Gaussian distr i- bution in the following theo rem; the pro of of a more ge neral s tatement would b e s imilar. Theorem 9. 1. L et v 1 , . . . , v m ∈ R n b e standar d Gaussian i.i.d. r andom ve ctors, with m ≤  n +1 2  . Then v ⊗ 2 1 , . . . , v ⊗ 2 m ar e line arly indep endent almost sur ely. Pr o of Sketch. Let’s take m =  n +1 2  without loss of gene r ality . Consider vextors w 1 , . . . , w m , where w i is ob- tained fr om v ⊗ 2 i by removing duplicate comp onents; e.g., for v 1 ∈ R 2 , we have v ⊕ 2 1 = ( v 1 (1) 2 , v 1 (2) 2 , v 1 (1) v 1 (2) , v 2 (1) v 1 (2)) and w 1 = ( v 1 (1) 2 , v 1 (2) 2 , v 2 (1) v 1 (2)). Th us v i ∈ R ( n +1 2 ) . No w conside r the determinant of the  n +1 2  ×  n +1 2  matrix with the w i as columns. As a formal m ultiv aria te p oly nomial with the comp onents of the v i as v ariables , this determinant is not identically 0. This is b ecause, for example, it can be c hecked that the monomial w 1 (1) 2 . . . w n ( n ) 2 w n +1 ( ρ ( n + 1)) . . . w m ( ρ ( m )) o ccurs precisely once in the expansion of the de- terminant as a sum o f monomials (here ρ : { n + 1 , . . . , m } →  [ n ] 2  is an a rbitrary bijectio n). T he pro of can now b e completed a long the lines of the well-known Sch wartz–Zipp el lemma. 47 W e now show that the condition n um b er of the Khatri–Rao p ow er of a random matrix b ehav es w ell in certain situations. F o r simplicity we will deal with the ca se where the en tr ie s of the base matrix M are chosen from {− 1 , 1 } uniformly at r andom; the case of Ga ussian en trie s a lso giv es a simila r though sligh tly weak e r res ult, but would require so me extr a work. W e deﬁne a notion of d ’th p ower of a matrix M ∈ R n × m which is similar to the Khatri–Ra o p ow er except that we only keep the non-redundant m ultilinear part resulting in  n d  × m matrix. W orking with this m ultilinear pa r t will simplify things . F o rmally , M ⊖ d := [ M ⊖ d 1 , . . . , M ⊖ d m ], where for a co lumn vector C ∈ R n , deﬁne C ⊖ d ∈ R ( n d ) with en tries giv en b y C S := C i 1 C i 2 . . . C i d where where 1 ≤ i 1 < i 2 < . . . < i d ≤ n and S = { i 1 , . . . , i d } ∈  [ n ] d  . The following theorem is stated for the ca se when the base matrix M ∈ R n × n 2 . This choice is to keep the s ta temen t and pro of of the theor em simple; generalization to more general para meterization is straightforward. While the theore m below is prov ed for submatrices M ⊖ d of the Khatri– Rao pow er M ⊙ d , similar results hold for M ⊙ d by the interlacing prop erties of the sing ula r v alues of submatrice s [6 3]. Theorem 9. 2. L et M ∈ R n × m b e cho sen by sampling e ach entry iid u niformly at r andom fr om {− 1 , 1 } . F or m = n 2 , inte ger d ≥ 3 , and N =  n d  , and A = M ⊖ d we ha ve E max j ≤ n 2    σ j ( A ) − √ N    < N 1 / 2 − Ω(1) . Pr o of. W e ar e going to use Theor em 5 .62 of V ers hynin [66] which we state her e essentially verbatim: Theorem 9.3 ([66]) . L et A b e an N × m matrix ( N ≥ m ) whose c olumns A j ar e indep endent isotr opic r andom ve ctors in R N with k A j k 2 = √ N almost sur ely. Consider the inc oher enc e p ar ameter µ := 1 N E max j ≤ m X k ∈ [ m ] ,k 6 = j h A j , A k i 2 . Then for absolute c onstants C, C 0 we ha ve E   1 N A ∗ A − I   ≤ C 0 q µ log m N . In p articular, E max j ≤ m    σ j ( A ) − √ N    < C p µ log m. Our matrix A = M ⊖ d will play the role o f matrix A in Theorem 9.3. Note that for a column A j we hav e E A j ⊗ A j = I , so the A j are isotropic. Also note that k A j k 2 = √ N alw ays. W e now bound the incoherence pa rameter µ . T o this end, we ﬁrst prov e a concentration b ound for h A j , A k i , for ﬁxed j, k . W e use a co ncentration inequality fo r p oly nomials of r andom v ariables. Sp eciﬁ- cally , w e use Theorem 23 at (h ttp:// www.contrib.andrew.cmu.edu/ ∼ ryanod/ ?p= 1472). Let us restate that theorem here. Theorem 9.4. L et f : {− 1 , 1 } n → R b e a p olynomial of de gr e e at m ost k . Then for any t ≥ (2 e ) k/ 2 we have Pr x ∼{− 1 , 1 } [ | f ( x ) | ≥ t k f k 2 ] ≤ exp  − k 2 e t 2 /k  . Here k f k 2 := [ E x f ( x ) 2 ] 1 / 2 . F or our application to h A j , A k i , we ﬁrst ﬁx A j arbitrar ily . Then h A j , A k i , which will play the role of f ( x ) in the ab ov e theorem, can b e wr itten as P S ∈ ( [ n ] d ) c S x S where the choice of the c o eﬃcients c S = ± 1 comes fr o m the ﬁxing o f A j and the entries of A k are o f the for m x S , where S ∈  [ n ] d  . Now 48 E x h A j , A k i 2 = X S,S ′ ∈ ( [ n ] d ) c S c s ′ E x x S x S ′ = X S ∈ ( [ n ] d ) c 2 S E x x 2 S + X S,S ′ ∈ ( [ n ] d ) : S 6 = S ′ c S c S ′ E x x S x S ′ =  n d  = N . In o ther words, for o ur choice of f we hav e k f k 2 = √ N . Applying Theorem 9.4 with t ≥ (2 e ) d/ 2 and λ = t √ N w e hav e Pr x ∼{− 1 , 1 } [ |h A j , A k i| ≥ λ ] ≤ exp  − d 2 e t 2 /d  = exp  − d 2 e λ 2 /d N 1 /d  . (43) Note that we prov ed the ab ov e inequalit y fo r any ﬁxed A j , so clearly it also follows when A j is a lso random. W e no w estimate para meter µ . Note that h A j , A k i 2 ≤ N 2 alwa ys. When the union o f the even t in (43) ov er a ll j 6 = k , which we denote by B , do es not ho ld, we will use the b ound just mentioned. F o r the following computation recall that the num b er o f columns m in A is n 2 . µ ≤ 1 N mλ 2 Pr  ¯ B  + 1 N mN 2 Pr ( B ) ≤ mλ 2 N + m  m 2  N 2 N exp  − d 2 e λ 2 /d N 1 /d  ≤ n 2 λ 2 N + n 6 N exp  − d 2 e λ 2 /d N 1 /d  . (44) Now choose λ := N 1 / 2+ ǫ for a small ǫ > 0. Then the express ion in (44) is b ounded by (44) ≤ n 2 N 2 ǫ + n 6 N exp  − d 2 e N 2 ǫ/d  . It’s now clear that for a suﬃciently small choice of ǫ (say 0 . 05 ) and suﬃcien tly large n (cepending on d and ǫ ), only the ﬁr st term ab ov e is signiﬁcan t and using our ass umption d > 2 gives µ < 2 n 2 N 2 ǫ < 2 d ! N 2 /d + ǫ << N . Therefore by Theorem 9.3 we hav e E     1 N A ∗ A − I     ≤ C 0 r µ log n 2 N < 1 / N Ω(1) , which gives E max j ≤ n 2    σ j ( A ) − √ N    < N 1 / 2 − Ω(1) . In pa rticular, setting s min ( A ) := s n 2 ( A ) we have E    σ min ( A ) − √ N    < 1 / N 1 / 2 − Ω(1) . Using Marko v this a lso gives probability b ounds . 49 10 T ec hnical lemmas In this sectio n we c o llect some of the technical claims needed in the paper. Lemma 10.1 (N onv anishing of φ ( t )) . L et s b e a re al-value d r andom ve ct or in R m with indep endent c omp o- nents and E ( s ) = 0 . A lso let E ( | s j | ) and E    s 2 j    exist and E    s 2 j    ≤ M 2 for al l j for M 2 > 0 . Then for t ∈ R m with k t k 2 ≤ 1 2 √ M 2 the cha r acteristic funct ion φ ( · ) of s satisﬁes | φ ( t ) | ≥ 3 / 4 . Pr o of. Using T aylor’s theorem 4.8 for cos y and sin y gives e iy = c os y + i sin y = 1 + iy − ( iy ) 2 2! [cos ( θ 1 y ) + i sin ( θ 2 y )] , for y, θ 1 , θ 2 ∈ R with | θ 1 | ≤ 1 , | θ 2 | ≤ 1. Applying this to y = t T s , taking expectation ov er s , and using the assumption of zer o means o n the s i we get E  e it T s  = 1 − E  ( it T s ) 2 2 [cos ( θ 1 y ) + i sin ( θ 2 y )]  , which using the indp endence of the components of s and the zer o means ass umption g ives    E  e it T s  − 1    = | φ ( t ) − 1 | ≤ 1 2 E  ( t T s ) 2 | cos ( θ 1 y ) + i sin ( θ 2 y ) |  ≤ E  ( t T s ) 2  = X j t 2 j E  s 2 j  ≤ R 2 k t k 2 2 ≤ 1 / 4 . Lemma 10.2. L et a 1 , . . . , a d , b 1 , . . . , b d ∈ C b e such that | a j − b j | ≤ ǫ for r e al ǫ ≥ 0 , and | a j | ≤ R for R > 0 . Then       d Y j =1 a j − d Y j =1 b j       ≤ ( R + ǫ ) d − R d . Pr o of. F or 0 < j < d , deﬁne the j th elemen tary symmetr ic function in d v ariables : σ j ( x 1 , . . . , x d ) = P 1 ≤ i 1 ≤ ... ≤ i j ≤ d x i 1 . . . x i j . W e will use the following well-kno wn inequality (see, e.g., [61]) which ho lds for x ℓ ≥ 0 for all ℓ . σ j ( x 1 , . . . , x d )  d j  ! 1 /j ≤ σ 1 ( x 1 , . . . , x d ) d . (45) Let b j = a j + ǫ j . Then       Y j ( a j + ǫ j ) − Y j a j       ≤ ǫ σ d − 1 ( | a 1 | , . . . , | a d | ) + ǫ 2 σ d − 2 ( | a 1 | , . . . , | a d | ) + . . . + ǫ d − 1 σ 1 ( | a 1 | , . . . , | a d | ) ≤ dǫ R d − 1 +  d 2  ǫ 2 R d − 2 + . . . + ǫ d = ( R + ǫ ) d − R d , where the second inequa lit y follows fro m (4 5). 50 Claim 10.3 . L et u ∈ R b e sample d ac c or ding to N (0 , σ 2 ) . Then for τ > 0 we have Pr ( | u | > τ ) ≤ r 2 π σ 2 τ e − τ 2 2 σ 2 Pr o of. F ollows from the well-kno wn fact: 1 √ 2 π R ∞ a e − z 2 / 2 dz ≤ 1 √ 2 π · 1 a · e − a 2 / 2 , for a > 0 W e state the f ollowing eas y claim without pro of. Claim 10.4 . L et B ∈ C p × m with p ≥ m and colspan ( B ) = m . L et D ∈ C m × m b e a diagonal matrix. Then σ m ( B D B T ) ≥ σ m ( B ) 2 σ m ( D ) . Claim 10.5 . F or E ∈ C m × m with k E k F < 1 / 2 we have ( I − E ) − 1 = I + E + R , wher e k R k F < m k E k F . Pr o of. F or k E k F < 1 / 2 we hav e ( I − E ) − 1 = I + E + E 2 + . . . . Hence   ( I − E ) − 1 − ( I + E )   F ≤   E 2   F   ( I − E ) − 1   F < m k E k F . F act 10.6. F or a r e al-value d r andom varia ble x and for any 0 < p ≤ q we have E ( | x | p ) 1 /p ≤ E ( | x | q ) 1 /q , E ( | x | p ) E ( | x | q ) ≤ E  | x | p + q  . Pr o of. H¨ older’s ine q uality implies that fo r 0 ≤ p ≤ q we have E ( | x | p ) 1 /p ≤ E ( | x | q ) 1 /q , and hence E ( | x | p ) E ( | x | q ) ≤ E  | x | p + q  p/ ( p + q ) E  | x | p + q  q/ ( p + q ) = E  | x | p + q  . 11 Conclusion W e conclude with so me o p en problems. (1) Our c ondition for ICA to be p ossible required tha t there exist a d such that A ⊙ d has full column rank. As mentioned befo r e, the existence of suc h a d tur ns o ut to be equiv alent to the necessary and suﬃcient condition for ICA, namely , any tw o columns of A ar e linearly independent. Thus if d is large f or a matrix A then our algor ithm whose running time is exponential in d will be ineﬃcien t. This is inevitable to some extent as suggested b y the ICA low er bound in [7]. How ever, the lower b ound there requir es that one o f the s i be Ga us sian. Can one prov e the low er b ound without this requirement? (2 ) Give an eﬃcient alg orithm for independent subspace analysis. This is the problem where the s i are not all indendent but rather the set of indices [ m ] is partitioned into subsets. F o r any tw o distinct subsets S 1 and S 2 in the partitio n s S 1 is indep endent of s S 2 , where s S 1 denotes the vector of the s i with i ∈ S 1 etc. Clearly this problem is a gener alization of ICA. Ac knowledgemen ts. W e thank Sham Kak a de for helpful discussio ns and Y o ngshun Xiao for showing us the Gershgo rin Cir c le theo rem and the contin uous deformation technique used in Lemma 8 .5. 51 References [1] R. Adamczak, A. Litv a k, A. Pa jor , and N. T omczak-J a egermann. Quan titative e s timates of the conv er- gence of the empirical co v aria nce ma trix in logconcave ens em bles. J. Amer. Math. So c. , 233:5 35–5 61, 2011. [2] L. Albera, A. F err´ eol, P . Comon, and P . Chev alier. Blind iden tiﬁca tion of overcomplete mixtures of sources (biome). Line ar algebr a and its applic ations , 39 1:3–30 , 20 04. [3] A. Anandkuma r , D. F oster, D. Hsu, S. K a k ade, and Y.-K . Liu. A sp ectral alg o rithm for latent dirichlet allo cation. In A dvanc es in Neur al Information Pr o c essing Systems 25 , pa g es 926– 934, 2012 . [4] A. Anandkumar, R. Ge, D. Hsu, S. Kak a de , a nd M. T elgarsky . T e ns or decompos itions for lea rning latent v ariable mo dels . 201 2. [5] A. Anandkumar, R. Ge, D. Hsu, S. M. K ak ade, and M. T elgars ky . T ensor deco mpo s itions for learning latent v ariable mo dels . CoRR , a bs/121 0.7559 , 2012 . [6] A. Anandkumar, D. Hsu, and S. M. Kak ade. A metho d of moments for mixture mo dels and hidden marko v mo dels. In Pr o c. of C OL T , 2012 . [7] J . Ander son, M. Belkin, N. Goy al, L. Radema cher, a nd J . V oss. The more, the merrier : the blessing of dimensionality for lea r ning la rge gaussian mixtures. arXiv:1311.2 891 , 201 3. [8] J . Ander son, N. Goy al, a nd L. Rademacher. Eﬃcien t le arning of simplices. COL T , 201 3. [9] S. Arora, R. Ge, A. Moitra, a nd S. Sac hdev a. Pr ov a ble ICA with unknown gauss ian noise, with impli- cations for ga ussian mixtur es and auto enco ders. In NIPS , pag es 2384– 2392 , 2012 . [10] R. I. Arriag a and S. V empala . An algo rithmic theory of lear ning: Ro bust concepts and r andom pro jec- tion. Machine L e arning , 63(2):1 6 1–18 2, 2006 . [11] F. L. Ba uer and C. Fike. Nor ms and ex c lusion theor ems. Numerische Ma thematik , 2 (1 ):137–1 41, 19 6 0. [12] M. Belkin, L. Rademac her, and J. V oss. Blind signal sepa ration in the presence of Gaussian noise. In Pr o c. of COL T , 201 3. [13] M. Belkin and K. Sinha. Polynomial learning of distribution f amilies. In FOCS , pages 103– 1 12, 20 10. [14] M. Belkin and K. Sinha. T oward learning gaus sian mixtures with a r bitrary s eparation. In COL T , pages 407–4 19, 2010 . [15] A. Bha sk ara , M. Charik ar, A. Moitra, a nd A. Vijay a raghav an. Smo othed analysis of tensor decomp osi- tions. CoRR , abs/13 11.36 5 1, 201 3. [16] R. Bha tia. Matrix analysis , v o lume 169. Springer, 1997 . [17] P . Borwein and T. Er delyi. Polynomi als and Polynomi al In e qualities . Springer, 19 95. [18] S. C. Brubaker and S. V e mpala. Random tensors and planted cliques. In RA NDOM , pages 406–41 9, 2009. [19] S. C. Brubaker a nd S. S. V empala. Isotro pic P CA and a ﬃne-inv ar iant cluster ing . In Building Bridges , pages 241–2 81. Springer, 20 08. [20] A. Carb ery and J. W r ight. Distributional and L q norm ineq ualities for p oly nomials ov er conv ex bo dies in R n . Mathematic al R ese ar ch L etters , 8:233 –248, 200 1. [21] J. Cardoso . Source separation using hig her or der moments. In International Confer enc e on A c oust ics, Sp e e ch, a nd Signal Pr o c essing , 1 9 89. 52 [22] J. Cardoso . Sup er-symmetric decomp osition of the four th-order cumulan t tensor. blind identiﬁcation of more sources than sensors. In Ac oust ics, Sp e e ch, and Signal Pr o c essing, 1991. ICASSP-91., 1991 International Confe r enc e on , pages 310 9–311 2. IEEE , 19 9 1. [23] J. D. Carro ll and J.-J. Chang. Analysis of individual diﬀerences in m ultidimensional scaling via a n n-wa y genera lization of E ck ar t-Y oung decomp osition. Psychometrika , 35(3 ):283–3 19, 19 7 0. [24] J. T. Chang. F ull re c onstruction of marko v mo dels on evolutionary tr e es: identiﬁabilit y and consistency . Mathematic al bio scienc es , 137(1):51 –73, 199 6. [25] K. Chaudhuri a nd S. Rao. Learning mixture s of product distributions using correlations and indepen- dence. In Pr o c. of COL T , 200 8. [26] P . Comon and C. Jutten, edito r s. Handb o ok of Blind Sour c e Sep ar ation . Academic P ress, 2010. [27] P . Como n and M. Ra jih. Blind identiﬁcation of under-determined mixtures ba sed on the characteristic function. Signal Pr o c essing , 86 (9):2271 – 2281 , 20 06. [28] S. Dasgupta. Learning mixtures o f Gaussians. In F oundations of Computer Scienc e, 199 9. 40 th Annual Symp osium on , pages 634– 644. IEEE , 1 999. [29] S. Dasgupta and A. Gupta. An elemen tar y pr o of of a theorem of johnson and lindenstrauss. R andom Structures and A lgorithms , 22(1):60 –65, 2 003. [30] S. Dasgupta and L. Sch ulman. A pr obabilistic a nalysis of E M for mixtures o f separated, spherica l Gaussians. The Journal of Machi ne L e arning R ese ar ch , 8:203 –226 , 2007. [31] C. Davis and W. M. Kahan. The rotation o f eigenvectors by a pertur bation I I I. SIAM Journal on Numeric al A nalysis , 7(1):1– 46, 1 970. [32] L. De Lathauw er, J. Castaing, and J. Cardoso . F ourth-or der cumulan t-based blind iden tiﬁcation of underdetermined mixtures. Signal Pr o c essing, IEEE T r ansactions on , 5 5(6):2965 –297 3 , 2 007. [33] L. De Lathauw er , B. De Mo or, and J. V andewalle. On the b est rank-1 and rank-( R 1 , R 2 , . . . , R n ) approximation o f higher -order tenso r s. SIAM Journal on Matrix Analysis and Applic ations , 21(4):1 324– 1342, 2 000. [34] S. C. E isenstat a nd I. C. Ipsen. Relative p er turbation r esults for eigenv alues and eigenv ec tors of diago- nalisable matrices. BIT Numeric al Mathe matics , 38(3):502–50 9, 1 9 98. [35] A. M. F rieze, M. Jerrum, and R. Ka nnan. Learning linear transformations. In F OCS , pages 359–368 , 1996. [36] G. H. Golub and C. F. V. Loan. Ma trix Computations . The Jo hns Hopkins Univ er sity P r ess, 201 3 . [37] R. A. Har shman. F oundations of the P ARAF A C pro cedure: mo dels and conditio ns for an “explanator y” m ultimo dal factor ana lysis. 1970. [38] C. Hillar and L.-H. Lim. Mo st tensor problems are NP-ha rd. Journal of the A CM , 60, 201 3. [39] D. Hsu and S. M. Ka k ade. Learning mixtures of spher ical Gaussians: moment methods and sp ectr a l decomp ositions. In ITCS , pag es 11 –20, 2 013. [40] D. Hsu, S. M. Kak ade, and T. Zhang . A sp ectra l algorithm for learning hidden Mark ov mo dels. 2 009. [41] A. Hyv¨ arinen, J. Karhunen, and E . Oja. Indep endent Comp onent Analysis . Wiley , 20 01. [42] A. T. Kalai, A. Moitra, a nd G. V a liant. Eﬃcien tly learning mixtures of tw o Gauss ians. In Pr o c e e dings of t he 42nd ACM symp osium on The ory of c omputing , pages 553 –562. ACM, 2 010. 53 [43] R. K annan a nd S. V empala. Sp e ctr al alg orithms . Now Publis hers Inc, 20 09. [44] T. G. Ko lda a nd B. W. Ba der. T ens o r deco mpo sitions and applicatio ns. SIAM R eview , 5 1:455 –500, 2009. [45] J. B. K rusk al. Three-wa y arrays: rank a nd uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics. Line ar alge br a and its a pplic ations , 18(2):95– 1 38, 1977 . [46] S. La ng. Complex analysis , volume 103. Springer, 1998 . [47] S. La ng. U nder gr aduate algebr a . Spring er V erlag, 200 5. [48] S. E. Leurgans, R. T. Ross , and R. B. Ab el. A deco mp os ition for 3-wa y ar rays. SIAM Journal on Matrix A n alysis and Applic ations , 14:1064–1 083, 1993 . [49] L. Lo v´ a s z and S. V empala . The geometry of lo gconcav e functions and sampling algo r ithms. Ra ndom Struct. Algorithms , 30(3 ):307–3 58, 200 7. [50] J. Marcinkiewic z . Sur une pro pri´ et ´ e de la loi de Gauss. Mathematische Zeitschrift , 44(1):61 2 –618 , 1939 . [51] A. Moitra and G. V a liant. Settling the p olyno mial le a rnability o f mixtures of Gaus sians. In F oundations of Computer Scienc e (F OCS), 2010 51 st Annual IEEE Symp osium on , pag es 93–10 2. IEEE , 20 10. [52] E. Mossel, R. O’Donnell, and K. Oleszkiewicz. Noise stability o f functions with low inﬂuences : Inv a riance and optimality . Annals of Math. , 17 1:295– 341, 201 0. [53] E. Mossel and S. Ro ch. Lear ning nonsing ular phylogenies and hidden marko v mo dels. In STOC , pag e s 366–3 75, 2005 . [54] P . Q. Nguyen and O. Regev. Learning a parallelepip ed: Cryptanaly s is of GGH a nd NTRU signa tures. J. Cryptolo gy , 22(2):139– 160, 200 9. [55] K. Pearso n. O n lines and planes of closest ﬁt to systems of po ints in space. The L ondon, Ed inbur gh, and Dublin Philosoph ic al Magazine and Journal of Scienc e , 2(11):5 5 9–57 2 , 19 01. [56] W. H. Pres s , S. A. T eukolsky , W. T. V etterling, and B. P . Flanner y . N umeric al r e cip es 3r d e dition: The art of scientiﬁc c omputing . Cambridge universit y press, 20 07. [57] M. Rudelson. Rando m vectors in the isotr opic p osition. J. of F unctional A nalysis , 164:60–7 2, 199 9 . [58] A. Sanjeev and R. K annan. Learning mixtures of a r bitrary Gaussia ns. In Pr o c e e dings of the t hirty-thir d annual ACM symp osium on The ory of c omputing , pages 2 47–25 7. ACM, 2001 . [59] A. Smilde, R. Bro, and P . Geladi. Mu lt i-way analysis: applic ations in t he chemic al scienc es . Wiley . com, 2005 . [60] N. Sr iv astav a and R. V ershynin. Cov ariance estima tes fo r distributions with 2 + ǫ momen ts. A nnals of Pr ob ability , to app ear (a r Xiv:1106 .2775), 2 011. [61] J. M. Steele . The Cauchy–Schwarz Master Class . C a mbridge Universit y P ress, 2004. [62] G. W. Stew ar t a nd J .-g. Sun. Matrix p erturbatio n theo ry . 199 0 . [63] R. C. Thompson. Principal subma trices IX: In terlacing inequalities for singular v alues of submatrices. Line ar A lgebr a and its Applic ations , 5:1–12, 197 2. [64] S. V empala a nd G. W ang. A s p ectr al alg orithm for lea rning mixture mo dels. Journal of Computer and System Scienc es , 68(4):841–8 60, 20 0 4. [65] S. S. V empala and Y. Xiao. Structure fr o m lo c al optima: Learning subspace juntas v ia higher order PCA. CoRR , abs/11 08.332 9, 201 1. 54 [66] R. V ershynin. In tro duction to the non-a symptotic analys is of r andom matrices . In Y. Eldar and G. K u- t yniok, e ditors, Comp r esse d Sensing, The ory and Applic ations , pages 210 –268. Ca m bridge Universit y Press, Oxford. [67] R. V ershynin. How close is the sample cov a riance to the actual c ov aria nce ma trix? Journ al of The or etic al Pr ob ability , to app ear (a r Xiv:1004 .3484), 2 010. [68] J. H. Wilkinso n. The algebr aic eigenvalue pr oblem , volume 155. Oxford Univ Pr ess, 1 965. [69] A. Y eredor . Blind source sepa ration via the second characteristic function. Signal Pr o c essing , 80 (5):897– 902, 2000 . 55

Fourier PCA and Robust Tensor Decomposition

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment