New Algorithms for Learning Incoherent and Overcomplete Dictionaries

In sparse recovery we are given a matrix $A$ (the dictionary) and a vector of the form $A X$ where $X$ is sparse, and the goal is to recover $X$. This is a central notion in signal processing, statistics and machine learning. But in applications such…

Authors: Sanjeev Arora, Rong Ge, Ankur Moitra

New Algorithms for Learning Incoheren t and Overcomplete Dictionaries Sanjeev Arora ∗ Rong Ge † Ankur Moitra ‡ Ma y 2 7, 2014 Abstract In sp arse r e c overy we are given a matrix A ∈ R n × m (“the dictionary” ) and a v ector of the form AX where X is sp arse , and the goa l is to r ecov e r X . This is a central notio n in signal pro cessing, statistics a nd machine lear ning. But in applications such a s sp arse c o ding , edge detection, compression and sup er resolution, the dictionary A is unknown and has to b e learned from ra ndom examples of the form Y = AX where X is drawn from an appropriate distribution — this is the dictionary le arning problem. In mos t settings , A is over c omplete : it has mor e columns than rows. This pap er presents a po lynomial-time algorithm for lea rning overcomplete dictionaries; the only pr eviously known algorithm with prov able guara ntees is the rec ent w ork o f [48] who gav e an a lgorithm for the undercomplete case, which is rarely the case in applications. Our a lgorithm applies to inc oher ent dictionaries which have b een a central ob j ect of study since they were introduced in seminal work of [18]. In pa rticular, a dictionary is µ -incoher ent if eac h pair o f columns has inner pro duct a t mos t µ/ √ n . The alg o rithm makes natural stochastic assumptions ab o ut the unknown sparse vector X , which can con tain k ≤ c min( √ n/µ log n, m 1 / 2 − η ) no n- zero en tr ie s (for any η > 0). This is close to the b est k a llow able b y the b est spa rse recov er y alg orithms even if one knows the di ctionary A exactly . Moreover, b oth the running time and sample complex ity depend on log 1 /ǫ , where ǫ is the target a ccuracy , and so our algorithms conv erge very quickly to the true dictionary . Our algorithm can also tolerate substantial amo un ts of no ise provided it is incoherent with r esp ect to the dictionar y (e.g., Gaussian). In the noisy setting, o ur running time and sample complexity depe nd p olynomially o n 1 /ǫ , and this is necess ary . 1 In tro duction Finding sp arse representa tions for data —signals, images, natur al language— is a ma jor fo cus of computational h armonic analysis [20, 41]. This r equires having the right d ictionary A ∈ R n × m for the d ataset, w hic h allo ws eac h data p oin t to b e written as a sparse linear com bin ation of the columns of A . F or images, p opular c hoices for the dictionary include sin us oids , wa velets, ridgelets, curv elets, etc. [41] and eac h one is useful for different t yp es of features: wa v elets for impulsive ev ents, ridgelets for disconti n uities in edges, curvele ts for smo oth curve s, etc. It is common to ∗ arora@cs.princeton.edu, Princeton Universit y , Computer Science Department and Center for Computational Intractabili ty † rongge@microsof t.com, Microsoft Research, New England.P art of this work was done while the author w as a graduate stu dent at Princeton U n ivers it y and w as supp orted in part by NSF gran t s CCF-0832797 , CCF-11173 09, CCF-13025 18, DMS - 1317308 , and Simons Investig ator ‡ moitra@mit.edu, Massac husetts Institute of T ec hnology , Department of Mathematics and CSAIL P art of this w ork w as don e while the author w as a p ostdo c at the I nstitute for Adv anced S tudy and w as supp orted in part b y NSF gran t N o. DMS-0835373 and by an NSF Computing and I n nov ation F ellow ship. 1 com bin e such hand -designed bases into a single dictionary , wh ic h is “redundant” or “o v ercomplete” b ecause m ≫ n . This can allo w sparse represent ation ev en if an image con tains man y different “t yp es” of f eatures ju m b led together. In mac hin e learning dictionaries are also us ed for fe atur e sele ction [45] and f or building classifiers on top of sparse cod ing primitives [35]. In man y settings h and-designed dictionaries do not do as we ll as dictionaries that are fit to the dataset using automated m etho ds. In image pro cessing such disco v ered dictionaries are used to p erform denoising [21], edge-detection [40], sup er-resolution [52] and compression. T he problem of disco v ering the b est dictionary to a dataset is called dictionary le arning and also r eferred to as sp arse c o ding in mac hine learning. Dictionary learning is also a basic buildin g block in the design of d eep learning systems [46]. See [3, 20] for fu r ther applications. In fact, the d ictionary learning problem was iden tified b y [44] as part of a study on inte rnal image represen tations in the visual cortex. Their w ork su ggested th at basis vecto rs in learned dictionaries often corresp ond to w ell-kno wn image filters suc h as Gab or fi lters. Our goal is to design an algorithm for th is problem with pr ov able guaran tees in the same s p irit as recen t work on nonnegativ e matrix factorizatio n [7], topic mo dels [8, 6] and mixtures mo dels [43, 12]. (W e will later d iscuss wh y curren t algorithms in [39], [22], [4], [36], [38] do not come with suc h guarante es.) Designing suc h algorithms for dictionary learning has pr o ved c hallenging. Ev en if the dictionary is completely kn o wn , it can b e NP-hard to represent a v ector u as a spars e linear com bin ation of the columns of A [16]. Ho w ever for man y n atural t yp es of d ictionaries, the pr oblem of findin g a sparse repr esentati on is compu tationally easy . The pioneering work of [18], [17] and [29] (building on the uncertaint y principle of [19]) presented a n um b er of imp ortan t examples (in fact, th e ones we used ab o ve) of d ictionaries that are inc oher e nt and sho wed that ℓ 1 -minimization can find a sparse r ep resen tation in a kno wn , incoherent dictionary if one exists. Definition 1.1 ( µ -incoheren t) . An n × m matrix A whose col umns are unit vec tors is µ -i nc oher ent if ∀ i 6 = j we ha ve h A i , A j i ≤ µ/ √ n. W e will refer to A as i nc oher ent if µ is O (log n ). A rand omly c hosen dictionary is incoherent with high p robabilit y (eve n if m = n 100 ). [18] gav e man y other imp ortan t examples of incoheren t dictionaries, su c h as one constru cted fr om spikes and sines , as well as those built up from wa velet s and sines, or ev en wa v elets and ridgelets. There is a r ic h b o dy of literature dev oted to incoheren t d ictionaries (see additional references in [25]). [18] prov ed that giv en u = Av where v has k nonzero en tries, where k ≤ √ n/ 2 µ , b asis pursuit (solv able b y a linear program) r eco v ers v exactly and it is un ique. [25] (and subsequently [50]) ga v e algorithms for reco v ering v ev en in the presence of additive noise. [49] ga v e a more general exact r e c overy c ondition (ER C) under whic h the sp arse reco v ery problem for incoherent dictionaries can b e algorithmically solv ed. All of these require n > k 2 µ 2 . In a foundational work, [13] show ed that basis pursu it solv es the sparse r eco v ery p roblem ev en for n = O ( k log( m/k )) if A satisfies the we ak er r e stricte d isometry pr op erty [14]. Also if A is a full-rank square matrix, then we can compute v from A − 1 u , trivially . But our fo cus here will b e on incoheren t and o vercomplete dictionaries; extendin g these results to RIP matrices is left as a ma jor op en problem. The main result in this pap er is an algorithm that pr ovably learns an un k n o wn , in coheren t dictio- nary from random samples Y = AX w here X is a vecto r with at most k ≤ c min( √ n/µ log n, m 1 / 2 − η ) non-zero en tries (for an y η > 0, small enough constant c > 0 dep ending on η ). Hence w e can allo w almost as many n on-zeros in the hidden vec tor X as the b est sparse reco v ery algorithms whic h as- sume that the dictionar y A is known . Th e pr ecise r equiremen ts that w e place on the distributional mo del are describ ed in Section 1.2. W e can relax some of these co nditions at th e cost of incr eased runn in g time or requiring X to b e more sp ars e. Finally , our algorithm can tolerate a substant ial amoun t of add itive noise, an imp ortant consid eration in most applications includ ing sparse co d ing, pro vided it is indep endent and un correlated with the d ictionary . 2 1.1 Related W orks Algorithms used in practice Dictionary learnin g is solv ed in p ractice b y v arian ts of alterna ting minimization . [39] ga ve the fi r st approac h; sub sequen t p opular approac hes include the metho d of optimal dir e ctions (MOD) of [22], and K - SVD of [4]. Th e general idea is to main tain a guess f or A and X and at ev ery step either u p date X (using basis p ursuit) or up date A b y , sa y , solving a least squares problem. P ro v able gu arantees for su c h alg orithms hav e pro ve d d iffi cu lt b ecause the in itial guesses ma y b e v ery far from the true dictionary , causing basis pursuit to b ehav e erratica lly . Also, the algorithms could con verge to a dictionary that is not incoheren t, and thus unusable for sparse reco ve ry . (In pr actice, these heuristics do often w ork.) Algorithms with guaran tees An eleg an t pap er of [48] sho w s ho w to pro v ably reco v er A exactly if it has f ull column rank, and X has at most √ n nonzeros. How ev er, r equiring A to b e full column rank precludes most interesting applications where the dictionary is redund an t and hence cannot ha ve full column ran k (see [18, 20, 41]). Moreo v er, the algorithm in [48] is n ot noise toleran t. After the initial ann ou n cemen t of this w ork, [2, 1] indep end en tly ga v e prov able algorithms for learning ov ercomplete and incoheren t dictionaries. Th eir fi r st pap er [2 ] requires th e en tries in X to b e ind ep endent random ± 1 v ariables. Their seco nd [1] giv es an algorithm –a version of alternating minimization– that con v erges to the correct dictionary giv en a go o d initial dictionary (su c h a go o d initializat ion can only b e found using [2] in sp ecial cases, or more generally u sing this pap er). Unlik e our algorithms, theirs assume the s p arsit y of X is at most n 1 / 4 or n 1 / 6 (assumption A4 in b oth pap ers), whic h are far from the n 1 / 2 limit of incoherent dictionaries. The main c hange from the initial v ersion of our pap er is that w e ha ve impro v ed the dep endence of our algorithms from p oly(1 /ǫ ) to log 1 /ǫ (see Section 5). After this work, [11 ] giv e an quasi-p olynomial time algorithm for dictionary lea rning using sum- of-squares SDP hierarc hy . The algorithm can outpu t an appro xim ate dictionary ev en when sp ars it y is almost linear in the dimensions with we ak er assumptions. Indep endent Comp onent Analysis When the entries of X are indep enden t, algorithms for indep endent c omp onent analysis or ICA [15] can reco v er A . [23] ga ve a prov able algorithm that reco ve rs A u p to arbitrary accuracy , pro vided en tries in X are non-Gaussian (wh en X is Gaussian, A is only determined up to rotat ions anyw a y). Su b sequen t w orks considered the ov ercomplete case and ga ve pro v able algorithms ev en w hen A is n × m with m > n [37, 28]. Ho wev er, these algorithms are incomparable to ours since the algo rithms are relying on different assumptions (ind ep endence vs. sparsity). With sparsity assu mption, w e can mak e muc h we ak er assumptions on ho w X is generated. In particular, all these algorithms requir e the supp ort Ω of the v ector X to b e at least 3-wise indep endent ( Pr [ u, v , w ∈ Ω] = Pr [ u ∈ Ω] Pr [ v ∈ Ω ] Pr [ w ∈ Ω]) in th e undercomplete case and 4-wise ind ep endence in the o vercomplete case . Ou r algorithm only requires the sup p ort S to h a ve b ounded momen ts ( Pr [ u, v , w ∈ Ω ] ≤ Λ Pr [ u ∈ Ω] Pr [ v ∈ Ω] Pr [ w ∈ Ω] where Λ is a large constan t or ev en a p olynomial dep ending on m, n, k , see Definition 1.5). Also, b ecause our algorithm relies on the sparsit y constraint, we are able to get almost exact reco v er in the noiseless case (see Theorem 1.4 and S ection 5). T his kind of guarantee is imp ossible for ICA without sparsit y assump tion. 1.2 Our Results A range of results are p ossible wh ic h trade off more assumptions with b etter p erformance. W e giv e tw o illustrativ e ones: the first mak es the most assumptions but has th e b est p erf orm ance; th e 3 second h as the we ak est assump tions and somewhat w orse p erformance. The theorem statemen ts will b e cleaner if we use asymp totic notation: the p arameters k , n, m will go to infinity and the constan ts denoted as “ O (1)” are arbitrary so long as they do n ot gro w with these parameters. First w e d efi ne the class of distribu tions that the k -sparse v ectors must b e dra wn f rom. W e will b e in terested in distr ibutions on k -sparse v ectors in R m where eac h coordin ate is nonzero with probabilit y Θ( k /m ) (the constan t in Θ( · ) can differ among co ordinates). Definition 1.2 (Distribution class Γ and its momen ts) . The distribution is in class Γ if (i) eac h nonzero X i has exp ectation 0 and lies in [ − C , − 1] ∪ [1 , C ] w here C = O (1). (ii) Conditioned on an y subset of co ordinates in X b eing nonzero, the v alues X i are indep endent of eac h other. The distribution has b ounde d ℓ -wise moments if the probabilit y that X is nonzero in any subset S of ℓ coord inates is at most c ℓ times Q i ∈ S Pr [ X i 6 = 0] where c = O (1). Remark: (i) The b ound ed momen ts condition trivially holds for an y constan t ℓ if th e set of nonzero locations is a random subset of size k . T he values of these nonzero lo cations are allo wed to b e d istr ibuted v ery d ifferen tly from one another. (ii) The requirement that non zero X i ’s b e b ound ed a wa y from zero in magnitude is similar in sp irit to the Spike- and-Slab Sp arse Co ding (S3C) mod el of [27], which also encourages nonzero laten t v ariables to b e b ounded a w a y from zero to a v oid degeneracy iss u es that arise wh en some co efficien ts are m u c h larger than others. (iii) In the rest of the pap er we will b e focusin g on the case wh en C = 1, all th e p ro ofs generalize directly to the case C > 1 b y losing constan t factors in the guarant ees. Because of symmetry in the problem, w e can only hop e to learn dictionary A u p to p erm utation and sign-flips. W e sa y t wo dictionaries are column -wise ǫ -close, if after appr op r iate p erm utation and flipp in g the corresp onding columns are within distance ǫ . Definition 1.3. Tw o dictio naries A, B ∈ R n × m are column-wise ǫ -close, if there exists a p ermuta- tion π and θ ∈ {± 1 } m suc h that k ( A i ) − θ i ( B ) π ( i ) k ≤ ǫ . Later when w e are talking ab out t wo dictionaries that are ǫ -close, we alw a ys assume the columns are ordered correctly so that k A i − B i k ≤ ǫ . Theorem 1.4. Ther e is a p olynomial time algorithm to le arn a µ -inc oher e nt dictionary A fr om r andom exampl es. With high pr ob ability the algor ithm r eturns a dictionary ˆ A that is c olumn-wise ǫ close to A given r andom samples of the form Y = AX , wher e X ∈ R n is chosen ac c or ding to some distribution in Γ and A is in R n × m : • If k ≤ c min( m 2 / 5 , √ n µ log n ) and the distribution has b ounde d 3 -wise moments, c > 0 is a universal c onstant, then the algorithm r e quir es p 1 samples and runs in time e O ( p 2 1 n ) . • If k ≤ c min( m ( ℓ − 1) / (2 ℓ − 1) , √ n µ log n ) and the distribution has b ounde d ℓ -wise moments, c > 0 is a c onstant only dep ending on ℓ , then the algorithm r e q uir es p 2 samples and runs in time e O ( p 2 2 n ) • Even if e ach sample is of the form Y ( i ) = AX ( i ) + η i , wher e η i ’s ar e indep endent spheric al Gaussian noise with standar d deviation σ = o ( √ n ) , the algorithms ab ove stil l suc c e e d pr ovide d the numb er of samples is at le ast p 3 and p 4 r e sp e ctively. In p articular p 1 = Ω(( m 2 /k 2 ) log m + mk 2 log m + m log m log 1 /ǫ ) and p 2 = Ω(( m/k ) ℓ − 1 log m + mk 2 log m log 1 /ǫ ) and p 3 and p 4 ar e lar ger by a σ 2 /ǫ 2 factor. 4 Remark: The sparsity that our algorithm can tolerate – the m inim um of √ n µ log n and m 1 / 2 − η – approac h es the sparsit y that the b est kno wn algo rithms r equire even if A is known . Although the run ning time and sample complexit y of the algorithm are relativ ely large p olynomials, there are many w a ys to optimize the algo rithm. See the discussion in Section 7 . No w we describ e the other result wh ich requires fewer assump tions on h o w the samples are generated, bu t require more stringen t b oun ds on the sparsit y: Definition 1.5 (Distribution class D ) . A d istribution is in class D if (i) the ev ents X i 6 = 0 ha v e we akly b ounde d second and third momen ts, in the sens e that Pr [ X i 6 = 0 and X j 6 = 0] ≤ n ǫ Pr [ X i 6 = 0] Pr [ X j 6 = 0], Pr [ X i , X j , X t 6 = 0] ≤ o ( n 1 / 4 ) Pr [ X i 6 = 0] Pr [ X j 6 = 0] Pr [ X t 6 = 0]. (ii) Eac h nonzero X i is in [ − C, − 1] ∪ [1 , C ] where C = O (1). The follo win g theorem is pr o ved similarly to Theorem 1.4, an d is sk etc hed in App endix B. Theorem 1.6. Ther e is a p olynomial time algorithm to le arn a µ -inc oher e nt dictionary A fr om r andom examples of the form Y = AX , wher e X is chosen ac c or ding to some distribution in D . If k ≤ c min( m 1 / 4 , n 1 / 4 − ǫ/ 2 √ µ ) and we ar e given p ≥ Ω (max( m 2 /k 2 log m, mn 3 / 2 log m log n k 2 µ )) samples , then the algorithm suc c e e ds with high pr ob ability, and the output dictionary is c olumn-wise ǫ = O ( k √ µ/n 1 / 4 − ǫ/ 2 ) close to the true dictionary. The algorithm runs in time ˜ O ( p 2 n + m 2 p ) . The algorithm is also noise-toler ant as in The or em 1.4. 1.3 Pro of Outline The key observ ation in the algorithm is th at we can test whether t wo samples sh are the same dictionary elemen t (see Section 2). Giv en this information, we can b u ild a graph whose vertic es are the samples, and edges corresp ond to samples that share the same dictionary elemen t. A large cluster in this graph corresp onds to the set of all samples with X i 6 = 0. In Section 3 w e giv e an algorithm for finding all the large clusters. Then we sho w how to reco v er the dictionary giv en the clusters in Section 4. This allo ws us to get a rough estimate of the d ictionary matrix. Section 5 giv es an alg orithm f or refi ning the solution in the noiseless case. The three main parts of the tec hniqu es are: Ov erlapping Clustering: Heur istics su ch as MOD [22] or K-SVD [4 ] h a ve a cyclic dep enden ce: If we knew A , w e could solve for X and if we knew all of the X ’s we could s olv e for A . Our main idea is to b reak this cycle by (without knowing A ) findin g all of the samples where X i 6 = 0. W e can think of this as a cluster C i . Although our strategy is to cluster a random graph, what is crucial is that w e are lo oking for an overlapping clustering since eac h sample X b elongs to k clusters! Man y of the algorithms whic h ha ve b een designed for finding o verlapping clusterings (e.g . [9], [10]) h a ve a p o or d ep endence on the m axim um n um b er of clusters that a no de can b elong to. Ins tead, w e giv e a simple com b inatorial algorithm based on triplet (or higher-order ) tests that r eco v ers the underlying, o verlapping clustering. In order to prov e correctness of our com b inatorial algorithm, w e rely on to ols f rom discrete geometry , namely the pier cing numb er [42, 5]. Reco v ering the Dictionary: Next, w e observe that there are a num b er of natural algorithms for reco ve ring the dictionary once w e kn o w the clusters C i . W e can think of a random sample from C i as applyin g a filter to the samples we are giv en, and filtering out only those samples where X i 6 = 0. The claim is that th is distribu tion will ha ve a m uc h larger v ariance along the direction A i than along other dir ections, and this allo w s us to reco v ery the d ictionary either u sing a certain a verag ing algorithm, or by computing the largest sin gu lar vect or of the samples in C i . In fact, this latter approac h is similar to K-S VD [4] and hence our analysis yields insigh ts into wh y these heur istics w ork. 5 F ast C on vergence: Th e ab o v e approac h yields pro v able algo rithms for dictio nary lea rning wh ose runn in g time and sample complexit y dep end p olynomially on 1 /ǫ . How ev er once we ha v e a suitably go o d app ro ximation to the true dictionary , can w e conv erge at a muc h faster rate? W e analyze a simple alte rnating min imization algorithm Iter a tive A verag e and we deriv e a formula for its up dates where we can analyze it by think in g of it instead as a noisy v ersion of the matrix p o w er metho d (see Lemma 5.6). T his analysis is inspired by recen t w ork on analyzing alternating minimization for the matrix completion prob lem [34, 32], and w e obtain algo rithms whose running time and samp le co mplexit y dep ends on log 1 /ǫ . Hence w e ge t algorithms that con v erge rapidly to the tru e dictionary w hile simultaneously b eing able to handle almost the same sp ars it y as in the sparse reco very problem w here A is known! NOT A TI ON: Throughout this pap er, w e will use Y ( i ) to denote the i th sample and X ( i ) as the v ector that generated it – i.e. Y ( i ) = AX ( i ) . Let Ω ( i ) denote the supp ort of X ( i ) . F or a v ector X let X i b e the i th co ordinate. F or a matrix A ∈ R n × m (esp ecially the dictionary matrix), w e use A i to denote the i -th column (the i -th dictionary elemen t). Also, for a set S ⊂ { 1 , 2 , ..., m } , we use A S to denote the su bmatrix of A with columns in S . W e will use k A k F to denote the F rob en iu s norm and k A k to denote the sp ectral norm. Moreo ver w e will use Γ to denote the d istribution on k -sparse v ectors X th at is used to generate our s amples, and Γ i will d enote the restriction of this distribution to vec tors X wh er e X i 6 = 0. When we are w orkin g with a graph G w e will use Γ G ( u ) to d enote th e set of neigh b ors of u in G . Th r oughout the pap er “with high probabilit y” means the probabilit y is at least 1 − n − ∆ for large enough ∆. 2 The Connection Graph In this p art we sh ow how to test whether tw o samples share the same dictionary elemen t, i.e ., whether the supp orts Ω ( i ) and Ω ( j ) in tersect. Th e idea is w e can chec k the inner-pro d uct of Y ( i ) and Y ( j ) , which can b e d ecomp osed in to the sum of in ner-pro du cts of dictio nary element s h Y ( i ) , Y ( j ) i = X p ∈ Ω ( i ) ,q ∈ Ω ( j ) h A p , A q i X ( i ) p X ( j ) q If the supp orts are d isjoin t, then eac h of th e terms ab o v e is small since h A p , A q i ≤ µ / √ n by the incoherence assumption. T o prov e the sum is ind eed small, we w ill app eal to the classic Hanson- W right inequalit y: Theorem 2.1 (Hanson-W righ t) . [31] L et X b e a ve ctor of indep endent, sub-Gaussian r andom variables with me an zer o and varianc e one. L et M b e a symmetric matrix. Then P r [ | X T M X − tr ( M ) | > t ] ≤ 2 exp {− c min( t 2 / k M k 2 F , t/ k M k 2 ) } This will allo w us to d etermin e if Ω (1) and Ω (2) in tersect but with f alse negativ es: Lemma 2.2. Supp ose k µ < √ n C ′ log n for lar ge enough c onstant C ′ (dep ending on C in Definition 1.2). Then if Ω ( i ) and Ω ( j ) ar e disjoint, with high pr ob ability |h Y ( i ) , Y ( j ) i| < 1 / 2 . Pro of: Let N b e the k × k subm atrix resulting f r om r estricting A T A to the lo cations where X ( i ) and X ( j ) are n on-zero. Set M to b e a 2 k × 2 k matrix where the k × k sub m atrices in the top-left and b ottom-righ t are zero, and the k × k submatrices in the b ottom-left and top-righ t are (1 / 2) N and (1 / 2) N T resp ectiv ely . Here we th ink of the v ector X as b eing a length 2 k vec tor wh ose first 6 k en tries are the non-zero en tries in X ( i ) and whose last k ent ries are the non-zero entries in X ( j ) . And b y construction, we ha ve that h Y ( i ) , Y ( j ) i = X T M X W e can now app eal to the Hanson-W right inequalit y (ab o v e). Note that since Ω ( i ) and Ω ( j ) do not in tersect, the entries in M are eac h at most µ/ √ n and so the F rob eniu s norm of M is at most µk √ 2 n . This is also an upp er-b ound on the sp ectral norm of M . W e can set t = 1 / 2, and for k µ < √ n/C ′ log n b oth terms in the minim um are Ω(log n ) and th is implies the lemma.  W e will also mak e use of a w eake r b ound (but whose conditions allo w us to mak e few er distri- butional assump tions): Lemma 2.3. If k 2 µ < √ n/ 2 then |h Y ( i ) , Y ( j ) i| > 1 / 2 implies that Ω ( i ) and Ω ( j ) interse ct Pro of: Supp ose Ω ( i ) and Ω ( j ) are disjoint. Then the f ollo wing upp er b ound holds: |h Y ( i ) , Y ( j ) i| ≤ X p 6 = q |h A p , A q i X ( i ) p X ( j ) q | ≤ k 2 µ/ √ n < 1 / 2 and this implies the lemma.  This only w orks up to k = O ( n 1 / 4 / √ µ ). In comparison, the stronger b ound of Lemma 2.2 make s use of the rand omness of the signs of X and works up to k = O ( √ n/µ log n ). In our algorithm, we build the follo win g graph : Definition 2.4. Giv en p s amp les Y (1) , Y (2) , ..., Y ( p ) , b uild a c onne ction gr aph on p no des where i and j are connected by an edge if and only if |h Y ( i ) , Y ( j ) i| > 1 / 2. This graph will “miss” some ed ges, sin ce if a pair X ( i ) and X ( j ) ha ve in tersecting su pp ort we do not necessarily meet the ab ov e condition. But by Lemma 2.2 (with high probabilit y) this graph will not ha v e an y false p ositiv es: Corollary 2.5. W ith high pr ob ability, e ach e dge ( i, j ) pr esent in the c onne ction gr aph c orr esp onds to a p air wher e Ω ( i ) and Ω ( j ) have non-empty interse ction. Consider a sample Y (1) for whic h there is an ed ge to b oth Y (2) and Y (3) . This means that there is some co ord in ate i in b oth Ω (1) and Ω (2) and some co ordin ate i ′ in b oth Ω (1) and Ω (3) . Ho w ev er the c h allenge is that we do not immediately kno w if Ω (1) , Ω (2) and Ω (3) ha ve a common in tersectio n or not. 3 Ov erlapping Clustering Our goal in this section is to d etermin e which samples Y ha v e X i 6 = 0 just fr om the connection graph. T o d o this, we will iden tify a com b in atorial condition that allo ws u s to d ecide whether or not a set of three samples Y (1) , Y (2) and Y (3) that ha ve supp orts Ω (1) , Ω (2) and Ω (3) resp ectiv ely – ha ve a common int ersection or not. F rom this condition, it is straigh tforw ard to giv e an algorithm that correctly groups together all of the samples Y th at ha ve X i 6 = 0. In order to redu ce the n um b er of letters used w e will fo cus on the first three samples Y (1) , Y (2) and Y (3) although all the claims and lemmas hold f or all triples. 7 Supp ose we are giv en tw o samples Y (1) and Y (2) with supp orts Ω (1) and Ω (2) where Ω (1) ∩ Ω (2) = { i } . W e will pro v e that this pair can b e used to reco v er all the samp les Y for whic h X i 6 = 0. This will follo w b ecause we will sho w that the exp ected num b er of common neigh b ors b et w een Y (1) , Y (2) and Y will b e large if X i 6 = 0 and otherwise w ill b e small. So thr oughout this sub section let us consider a sample Y = AX and let Ω b e its su pp ort. W e will need the f ollo wing elementa ry claim. Claim 3.1. Supp ose Ω (1) ∩ Ω (2) ∩ Ω (3) 6 = ∅ , then P r Y [ for al l j = 1 , 2 , 3 , |h Y , Y ( j ) i| > 1 / 2] ≥ k 2 m Pro of: Using ideas similar to Lemma 2.2, we can sho w if | Ω ∩ Ω (1) | = 1 (that is, the new sample has a unique inte rsection with Ω (1) ), then |h Y , Y (1) i| > 1 / 2. No w let i ∈ Ω (1) ∩ Ω (2) ∩ Ω (3) , let E b e the even t that Ω ∩ Ω (1) = Ω ∩ Ω (2) = Ω ∩ Ω (3) = { i } . Clearly , when ev ent E happ ens, for all j = 1 , 2 , 3 , |h Y , Y ( j ) i| > 1 / 2. Th e probabilit y of E is at least Pr [ i ∈ Ω] Pr [(Ω (1) ∪ Ω (2) ∪ Ω (3) \{ i } ) ∩ Ω = ∅| i ∈ Ω] = k /m · (1 − O ( k /m ) · 3 k ) ≥ k / 2 m. Here we used b ounded second moment prop erty for the conditional probabilit y and un ion b ound.  This claim establishes a lo wer b oun d on the exp ected num b er of common neigh b ors of a triple, if they h a ve a common in tersection. Next w e establish an upp er b ound, if they don’t h a ve a co mmon in tersectio n. S upp ose Ω (1) ∩ Ω (2) ∩ Ω (3) = ∅ . In principle we shou ld b e concerned that Ω could still in tersect eac h of Ω (1) , Ω (2) and Ω (3) in differen t lo cations. Let a = | Ω (1) ∩ Ω (2) | , b = | Ω (1) ∩ Ω (3) | and c = | Ω (2) ∩ Ω (3) | . Lemma 3.2. Supp ose that Ω (1) ∩ Ω (2) ∩ Ω (3) = ∅ . Then the pr ob ability that Ω i nterse cts e ach of Ω (1) , Ω (2) and Ω (3) is at most k 6 m 3 + 3 k 3 ( a + b + c ) m 2 Pro of: W e can break up the eve n t w hose probability w e wo uld like to b ou n d int o t w o (not nec- essarily disjoint) ev en ts: (1) the pr obabilit y that Ω in tersects eac h of Ω (1) , Ω (2) and Ω (3) disjoin tly (i.e. it conta ins a p oint i ∈ Ω (1) but i / ∈ Ω (2) , Ω (3) , and sim ilarly for the other sets ). (2) the probabilit y that Ω cont ains a p oint in the common intersect ion of tw o of the s ets, and one p oint from the remaining set. Clearly if Ω in tersects the eac h of Ω (1) , Ω (2) and Ω (3) then at least one of these t w o ev ents m u st o ccur. The probabilit y of the first ev en t is at most the probab ility that Ω conta ins at least one elemen t from eac h of thr ee disj oin t sets of size at most k . T h e probability that Ω conta ins an eleme n t of j u st one suc h s et is at most the exp ected in tersection whic h is k 2 m , and since the exp ected intersectio n of Ω with eac h of these sets are non-p ositiv ely correlate d (b ecause they are disjoint) w e hav e that the probab ility of the first ev ent can b e b ounded by k 6 m 3 . Similarly , for the second even t: consider the probabilit y that Ω con tains an elemen t in Ω (1) ∩ Ω (2) . Since Ω (1) ∩ Ω (2) ∩ Ω (3) = ∅ , then Ω m ust also con tain an elemen t in Ω (3) to o. Th e exp ected in tersectio n of Ω and Ω (1) ∩ Ω (2) is k a m and the exp ected in tersection of Ω an d Ω (3) is k 2 m , and ag ain the exp ectatio ns are non-p ositiv ely correlated sin ce the t wo sets Ω (1) ∩ Ω (2) and Ω (3) are disjoin t b y assumption. Rep eating th is argumen t for the other p airs completes the pro of of the lemma.  Note that if Γ has b oun d ed higher ord er momen t, the probabilit y that t wo sets of size k intersect in at least Q element s is at m ost ( k 2 m ) Q . He nce w e can assu m e that with high p robabilit y there is no pair of samples whose supp orts intersec t in m ore than a constant num b er of lo cations. When Γ only has b ounded 3-wise momen t see App end ix A. 8 Algorithm 1 OverlappingCluster , Input: p samples Y (1) , Y (2) , ..., Y ( p ) 1. Co mpute a graph G on p no des where there is an edge betw een i and j iff |h Y ( i ) , Y ( j ) i| > 1 / 2 2. Set T = pk 10 m 3. Rep ea t Ω( m log 2 m ) times: 4. Cho ose a random edge ( u, v ) in G 5. Set S u,v = { w : | Γ G ( u ) ∩ Γ G ( v ) ∩ Γ G ( w ) | ≥ T } ∪ { u, v } 6. Delete any set S u,v where u, v are co n tained in a strictly sma lle r set S a,b (also delete any duplicates) 7. O utput the remaining sets S u,v Let us quan titativ ely compare our lo wer and up p er b ound : If k ≤ cm 2 / 5 then the exp ected n um b er of common neighbors for a triple with Ω (1) ∩ Ω (2) ∩ Ω (3) 6 = ∅ is muc h larger than the exp ected num b er of common neigh b ors of a triple whose common in tersection is empt y . Under this condition, if we tak e p = O ( m 2 /k 2 log n ) samples eac h triple with a common in tersection will ha v e at least T common neigh b ors, and eac h trip le whose common intersectio n is empt y will hav e less than T / 2 common neighb ors . Hence w e can searc h f or a triple with a common in tersectio n as follo ws: W e can find a pair of samples Y (1) and Y (2) whose supp orts int ersect. W e can tak e a neigh b or Y (3) of Y (1) in the connection graph (at random), and b y coun ting the n umb er of common neighbors of Y (1) , Y (2) and Y (3) w e can decide whether or n ot their supp orts ha ve a common in tersection. Definition 3.3. W e will call a p air of samples Y (1) and Y (2) an identifying p air for coordinate i if the inte rsection of Ω (1) and Ω (2) is exactly { i } . Theorem 3.4. The output of O verlappingClus ter is an overlapping clustering wher e e ach set c orr esp onds to some i and c ontains al l Y ( j ) for which i ∈ Ω ( j ) . The algorithm runs in time e O ( p 2 n ) and suc c e e ds with high pr ob ability if k ≤ c min( m 2 / 5 , √ n µ l og n ) and if p = Ω( m 2 log m k 2 ) Pro of: W e can u se Lemma 2.2 to conclude that eac h edge in G corresp onds to a pair whose sup p ort in tersects. W e can app eal to Lemma 3.2 and Claim 3.1 to conclude that for p = Ω( m 2 /k 2 log m ), with h igh probabilit y eac h triple with a common in tersection h as at least T common neigh b ors, and eac h triple without a common in tersection has at most T / 2 common neighb ors . In fact, f or a random edge ( Y (1) , Y (2) ), the p robabilit y that the common intersecti on of Ω (1) and Ω (2) is exactly { i } is Ω(1 /m ) b ecause w e kn o w that they do in tersect, and that in tersection has a constan t p r obabilit y of b eing size one and it is uniformly distrib uted o v er m p ossib le lo ca- tions. App ealing to a coup on collect or argumen t w e conclude that if the inner lo op is run at least Ω( m log 2 m ) times then the alg orithm finds an identifying p air ( u, v ) for eac h column A i with high probabilit y . Note that we may hav e pairs that are not an id en tifyin g triple for some coordin ate i . How ev er, an y other p air ( u, v ) found b y the algorithm m us t ha v e a common in tersection. C onsider for example a pair ( u, v ) where u and v ha v e a common int ersection { i, j } . T h en w e kno w that there is some other pair ( a, b ) whic h is an iden tifying pair for i and hence S a,b ⊂ S u,v . (In fact this con tainmen t is strict, since S u,v will also co n tain a set corr esp onding to an iden tifying pair for j too). Hence the second-to-last step in the algo rithm w ill necessarily delete all suc h n on-iden tifyin g pairs S u,v . What is the r unning time of this algorithm? W e n eed O ( p 2 n ) time to build the connection graph, and the loop tak es e O ( pmn ) time. Finally , the deletion step requires time e O ( m 2 ) since there 9 will b e e O ( m ) p airs foun d in the previous step and for eac h pair of pairs, w e can delete S u,v if and only if there is a strictly smaller S a,b that con tains u and v . This concludes the pro of of correctness of the algorithm, and its running time analysis.  4 Reco v ering the Dictionary 4.1 Finding t he Relativ e Signs Here w e sho w how to reco v er the column A i once w e ha v e learned whic h samples Y ha ve X i 6 = 0. W e will refer to this s et of s amp les as the “cluster” C i . T h e key observ ation is that if Ω (1) and Ω (2) uniquely in tersect in ind ex i then the sign of h Y (1) , Y (2) i is equal to the sign of X (1) i X (2) i . If there are enough su c h pairs Y (1) and Y (2) , w e can determine not only whic h samples Y h a ve X i 6 = 0 but also whic h pairs of samples Y and Y ′ ha ve X i , X ′ i 6 = 0 and sign( X i ) = sign( X ′ i ). This is the main step of the algorithm O verl appingA verag e . Theorem 4.1. If the input to O verl appingA verag e C 1 , ..., C m ar e the true clusters { j : i ∈ Ω ( j ) } up to p ermutation, then the algorithm outputs a dictionar y ˆ A that is c olumn-wise ǫ -close to A with high pr ob ability if k ≤ min( √ m, √ n µ ) and if p = Ω  max( m 2 log m/k 2 , m log m/ ǫ 2 )  F u rthermor e the algorithm runs in time O ( p 2 ) . In tuitiv ely , the algorithm works b ecause the s ets C ± i correctly identi fy samples with the same sign. This is su mmarized in the follo wing lemma. Lemma 4.2. In Al gorithm 2 , C ± i is either { u : X ( u ) i > 0 } or { u : X ( u ) i < 0 } . Pro of: It suffices to p ro ve the lemma at the start of Step 8, since this s tep only tak es th e co mple- men t of C ± i with resp ect to C i . App ealing to Lemma 2.2 w e conclude that Ω ( u ) and Ω ( v ) uniquely in tersect in coordinate i then the sign of h Y ( u ) , Y ( v ) i is equal to the sign of X ( u ) i X ( v ) i . Hence when Algorithm 2 adds an elemen t to C ± i it must ha ve the same sign as the i th comp onent of X ( u i ) . What remains is to p ro ve that eac h no de v ∈ C i is correctly labeled. W e will do this by sho wing that for an y such v ertex, there is a length tw o path of lab eled pairs that connects u i to v , and this is true b ecause the num b er of lab eled pairs is large. W e need the follo wing simple clai m: Claim 4.3. If p > m 2 log m/k 2 then with high pr ob ability any two clusters shar e at most 2 pk 2 /m 2 no des in c ommon. This follo ws since th e prob ab ility that a no de is conta ined in any fixed pair of clusters is at most k 2 /m 2 . Th en for an y no d e u ∈ C i , w e w ould lik e to lo wer b ound the n um b er of lab eled pairs it has in C i . Sin ce u is in at most k − 1 other clus ters C i 1 , ..., C i k − 1 , th e n u m b er of pairs u, v where v ∈ C i that are not lab eled for C i is at most k − 1 X t =1 |C i t ∩ C i | ≤ k · 2 pk 2 /m 2 ≪ p k / 3 m = |C i | / 3 Therefore for a fi xed no de u for at least a 2 / 3 fraction of the other no des w ∈ C i the pair u, w is lab eled. Hence w e conclude that for eac h p air of no des u i , v ∈ C i the num b er of w for whic h b oth u i , w and w , v are lab eled is at lea st |C i | / 3 > 0 and so for ev ery v , there is a lab eled path of length t wo connecting u i to v .  Using this lemma, we are ready to prov e Algorithm 2 correctly learns all columns of A . 10 Algorithm 2 O verl appingA verag e , Input: p samp les Y (1) , Y (2) , ...Y ( p ) and ov erlapping clusters C 1 , C 2 , ..., C m 1. F o r each C i 2. F or each pair ( u, v ) ∈ C i that do es not app ear in a ny other C j ( X ( u ) and X ( v ) hav e a unique int ersection) 3. Lab el the pair +1 if h Y ( u ) , Y ( v ) i > 0 and otherwise lab el it − 1. 4. Cho ose a n arbitra ry u i ∈ C i , and set C ± i = { u i } 5. F or ea ch v ∈ C i 6. If the pair u i , v is lab eled + 1 add v to C ± i 7. Else if there is w ∈ C i where the pairs u i , w and v, w hav e the same lab el, add v to C ± i . 8. If |C ± i | ≤ |C i | / 2 se t C ± i ← C i \C ± i . 9. Let ˆ A i = P v ∈ C ± i Y ( v ) / k P v ∈ C ± i Y ( v ) k 10. Output ˆ A , where each co lumn is ˆ A i for so me i Pro of: W e can in v oke Lemma 4.2 and conclude that C ± i is either { u : X ( u ) i > 0 } or { u : X ( u ) i < 0 } , whic hever set is larger. Let us supp ose that it is th e former. Then eac h Y ( u ) in C ± i is an indep end en t samp le from the distribution conditioned on X i > 0, whic h w e call Γ + i . W e ha v e that E Γ + i [ AX ] = cA i where c is a constan t in [1 , C ] b ecause E Γ + i [ X j ] = 0 for all j 6 = i . Let us compute the v ariance: E Γ + i [ k AX − E Γ + i AX k 2 ] ≤ E Γ + i X 2 i + X j 6 = i E Γ + i [ X 2 j ] ≤ C 2 + X j 6 = i C 2 k /m ≤ C 2 ( k + 1) , Note that there are no cross-terms b ecause the signs of eac h X j are indep enden t. F urthermore w e can b ound th e norm of eac h v ector Y ( u ) via incoherence. W e can conclud e that if |C ± i | > C 2 k log m/ǫ 2 , then with high pr ob ab ility k ˆ A i − A i k ≤ ǫ using vecto r Bernstein’s inequalit y ([30], Theorem 12). Th is latter condition h olds b ecause w e set C ± i to itself or its complemen t based on whic h one is larger.  4.2 An Approac h via SVD Here w e giv e an alternativ e algorithm for r eco v ering the dictionary based instead on SVD. Intuitiv ely if we tak e all the s amples wh ose supp ort con tains index j , then ev ery suc h samp le Y ( i ) has a comp onent along d irection A j . T h erefore direction A j should ha ve the largest v ariance and can b e found b y SVD. The adv an tage is that metho ds like K-SVD whic h are quite p opu lar in p ractice also rely on finding directions of maxim um v ariance, so the analysis w e pro vid e here yields insights in to wh y these app roac hes work. Ho we v er, the crucial difference is that w e rely on finding the correct o verla pping clustering in the fi r st step of our dictionary learning algorithms, whereas K -SVD and approac h es lik e appro xim ate it via their current guess for the dictionary . Let us fix some notation: Let Γ i b e the distribu tion cond itioned on X i 6 = 0. Then once we ha ve found the o verlapping clustering, eac h cluster is a set of random samples from Γ i . Also let α = |h u, A i i| . Definition 4.4. Let R 2 i = 1 + P j 6 = i h A i , A j i 2 E Γ i [ X 2 j ]. 11 Note that R 2 i is the p ro jected v ariance of Γ i on the direction u = A i . Ou r goal is to sho w that for an y u 6 = A i (i.e. α 6 = 1), the v ariance is strictly smaller. Lemma 4.5. The pr oje cte d varianc e of Γ i on u is at most α 2 R 2 i + α p (1 − α 2 ) 2 µk √ n + (1 − α 2 )( k m + µk √ n ) Pro of: Let u || and u ⊥ b e the comp onents of u in the direction of A i and p erp endicular to A i . Then we wan t b ound E Γ i [ h u, Y i 2 ] where Y is s amp led from Γ i . S ince the signs of eac h X j are indep end en t, we can write E Γ i [ h u, Y i 2 ] = X j E Γ i [ h u, A j X j i 2 ] = X j E Γ i [ h u || + u ⊥ , A j X j i 2 ] Since α = k u || k we ha v e: E Γ i [ h u, Y i 2 ] = α 2 R 2 i + E Γ i [ X j 6 = i (2 h u || , A j ih u ⊥ , A j i + h u ⊥ , A j i 2 ) X 2 j ] Also E Γ i [ X 2 j ] = ( k − 1) / ( m − 1). Let v b e the un it v ector in the direction u ⊥ . W e can write E Γ i [ X j 6 = i h u ⊥ , A j i 2 X 2 j ] = (1 − α 2 )( k − 1 m − 1 ) v T A − i A T − i v where A − i denotes the dicti onary A with the i th column remo v ed. Th e m axim um o ver v of v T A − i A T − i v is just th e largest singular v alue of A − i A T − i whic h is the same as the largest singu- lar v alue of A T − i A − i whic h by the Greshgorin Disk Th eorem (see e.g. [33]) is at most 1 + µ √ n m . And hence we can b ound E Γ i [ X j 6 = i h u ⊥ , A j i 2 X 2 j ] ≤ (1 − α 2 )( k m + µk √ n ) Also since |h u || , A j i| = α |h A i , A j i| ≤ αµ/ √ n w e obtain: E [ X j 6 = i 2 h u || , A j ih u ⊥ , A j i X 2 j ] ≤ α p (1 − α 2 ) 2 µk √ n and this concludes the pro of of the lemma.  Definition 4.6. Let ζ = max { µk √ n , q k m } , s o the expression in Lemma 4.5 can b e b e an up p er b ound ed by α 2 R 2 i + 2 α √ 1 − α 2 · ζ + (1 − α 2 ) ζ 2 . W e w ill show that an approac h based on SVD r eco v ers the true d ictionary up to additiv e accuracy ± ζ . Note that here ζ is a p arameter that con v erges to zero as the size of the pr oblem increases, bu t is n ot a function of the n umb er of samples. So u nlik e the alg orithm in the p r evious subsection, w e cannot mak e the error in our algorithm arbitrarily small b y increasing the num b er of samples, but this alg orithm has the adv antag e that it succeeds ev en when E [ X i ] 6 = 0. Corollary 4.7. The maximum singular value of Γ i is at le ast R i and the dir e ction u satisfies k u − A i k ≤ O ( ζ ) . F urthermor e the se c ond lar gest singular value is b ounde d by O ( R 2 i ζ 2 ) . 12 Algorithm 3 OverlappingSVD , Input: p samples Y (1) , Y (2) , ...Y ( p ) 1. Run O verlappingCluster (or O verla ppingCluster2 ) o n the p samples 2. Le t C 1 , C 2 , ... C m be the m returned overlapping cluster s 3. Co mpute ˆ Σ i = 1 |C i | P Y ∈C i Y Y T 4. Co mpute the first singular v a lue ˆ A i of ˆ Σ i 5. O utput ˆ A , where each co lumn is ˆ A i for so me i Pro of: The b ound in Lemma 4.5 is only an upp er b ound, ho wev er the direction α = 1 has v ariance R 2 i > 1 and h ence the direction of maxim u m v ariance must co rresp ond to α ∈ [1 − O ( ζ 2 ) , 1]. Then w e can app eal to the v ariational c h aracterizat ion of singular v alues (see [33]) that σ 2 (Σ i ) = max u ⊥ A i u T Σ i u u T u Then condition that α ∈ [ − O ( ζ ) , O ( ζ )] for the second singular v alue implies the second part of the corollary .  Since we ha ve a lo w er b ound on the sep aration b et w een the fir st and second sin gular v alues of Σ i , w e can apply W edin’s Theorem and sh o w that we can reco v er A i appro ximately ev en in the presence of noise. Theorem 4.8 (W edin ) . [51] L et δ = σ 1 ( M ) − σ 2 ( M ) and let M ′ = M + E and furthermor e let v 1 and v ′ 1 b e the first singular ve ctors of M and M ′ r e sp e ctively. Then sin Θ( v 1 , v ′ 1 ) ≤ C k E k δ Hence ev en if we do not ha ve access to Σ i but rather an appro ximation to it ˆ Σ i (e.g. an empir ical co v ariance matrix computed from our samples), w e can use the ab o ve p erturbation b ound to sho w that we can still reco ve r a direction that is close to A i – and in fact con verge s to A i as we tak e more and more samp les. Theorem 4.9. If the input to O verl appingSVD is the c orr e ct clustering, then the algorithm outputs a dictionary ˆ A so tha t for e ach i , k A i − ˆ A i k ≤ ζ with high pr ob ability if k ≤ c min( √ m, √ n µ l og n ) and if p ≥ max( m 2 log m/k 2 , mn log m log n ζ 2 ) Pro of: App ealing to T heorem 3.4, we hav e that with high probabilit y the call to O verlapp ing- Cluster retur ns the correct ov erlapping clustering. T hen giv en n log n ζ 2 samples from the distribu - tion Γ i the classic result of Rudelson imp lies that the compu ted emp irical co v ariance matrix ˆ Σ i is close in sp ectral norm to the true co-v ariance matrix [47]. This, com bined with the separation of the first and second singular v alues established in Corollary 4.7 and W edin’s T heorem 4.8 imply that w e reco ver eac h column of A up to an add itiv e accuracy of ǫ and this implies the th eorem. Note that since we only need to compute the first sin gular vecto r, this can b e done via p o w er iteration [26] and hence the b ottlenec k in the runnin g time is the cal l to O verl appingCluster .  13 Algorithm 4 Itera tiveA verage , Input: In itial estimatio n B , k B i − A i k ≤ ǫ , q samp les (inde- p endent of B ) Y (1) , Y (2) , ...Y ( q ) 1. F o r each sample i , let Ω ( i ) = { j : |h Y ( i ) , B j i| > 1 / 2 } 2. F o r each dictionary elemen t j 3. Let C + j be the set of samples that hav e inner pro duct more than 1 / 2 with B ( j ) ( C + j = { i : h Y ( i ) , B j i > 1 / 2 } ) 4. F or ea ch sample i in C + j 5. Let ˆ X ( i ) = B + Ω ( i ) Y ( i ) 6. Let Q i,j = Y ( i ) − P t ∈ Ω ( i ) \{ j } B t ˆ X ( i ) t 7. Let B ′ j = P i ∈C + j Q i,j / k P i ∈C + j Q i,j k . 8. O utput B ′ . 4.3 Noise T olerance Here we elab orate on w h y the algorithm can tolerate noise pro vided that the noise is unc orr elate d with the dictionary (e.g. Gaussian noise). The observ ation is that in constructing the connection graph, w e only mak e u se of the inn er pro ducts b et ween p airs of samples Y (1) and Y (2) , the v alue of w hic h is roughly preserve d under v arious noise mo dels. In turn, the o verlapping clustering is a purely com binatorial algorithm that only mak es use of the connection graph . Finally , w e reco ve r the dictionary A using singular v alue decomp osition, whic h is we ll-kno wn to b e stable und er noise (e.g. W edin’s Theorem 4.8). 5 Refining the Sol ution Earlier sections ga ve noise-toleran t alg orithms for the dictionary learning problem with sample complexit y O (p oly( n, m, k ) /ǫ 2 ). This dep endency on ǫ is n ecessary for an y noise-tole ran t algorithm since ev en if the dictionary has only one vec tor, w e need O (1 /ǫ 2 ) samples to estimat e the vec tor in presence of noise. How ev er wh en Y is exactly equal to AX w e can hop e to r eco v er the d ictionary with b etter ru nning time and muc h few er samples. I n particular, [24] recent ly established that ℓ 1 -minimization is lo cally correct f or incoherent dictionaries, therefore it seems p laus ib le that giv en a v ery go o d estimate for A there is some algorithm that computes a refi ned estimate of A whose runn in g time and sample complexit y ha ve a b etter dep endence on ǫ . In this section we analyze the lo cal-con v ergence of an algorithm th at is similar to K-S VD [4]; see Algorithm 4 Ite ra tive A vera ge . Recall B S denotes the submatrix of B whose columns are indices in S ; also, P + = ( P T P ) − 1 P T is the left-pseudoin verse of the matrix P . Hence P + P = I , P P + is the pro jection m atrix to the s p an of columns of P . The key lemma of this s ection shows the err or decreases by a constan t factor in eac h round of Itera tiveA verag e (p ro vid ed that it was suitably small to b egin with). Let ǫ 0 ≤ 1 / 1 00 k . Theorem 5.1. Supp ose the dictionary A is µ -inc oher ent with µ/ √ n < 1 /k log k , initial solution is ǫ < ǫ 0 close to the true solution (i . e. for al l i k B i − A i k ≤ ǫ ). With high pr ob ability the output of Itera tiveA verag e is a dictionary B ′ that satisfies k B ′ i − A i k ≤ (1 − δ ) ǫ , wher e δ is a universal p ositive c onstant. Mor e over, the algorithm runs in time O ( q nk 2 ) and suc c e e ds with high pr ob ability when numb er of samples q = Ω ( m log 2 m ) . 14 W e will analyze the u p date made to the first column B 1 , and the same argumen t will w ork for all columns (and h ence we can apply a union b oun d to complete the p ro of ). T o simplify the p ro of, w e will let ξ denote arbitrarily small constants (whose precise v alue will c hange from line to line). First, we establish some basic claims that will b e the basis for our analysis of Itera tiveA verage . Claim 5.2. Supp ose A is a µ inc oher ent matrix with µ/ √ n < 1 /k log k . If for al l i , k B i − A i k ≤ ǫ 0 then Itera tiveA verage r e c overs the c orr e ct supp ort for e ach sample (i.e. Ω ( i ) = supp ( X ( i ) ) ) and the c orr e ct sign (i.e. C + j = { j : X ( i ) j > 0 } ) 1 Pro of: W e can compute h Y ( i ) , B 1 i = P j ∈ Ω ( i ) X ( i ) j h A j , B 1 i and the total con trib ution of all of the terms b esides X ( i ) 1 h A 1 , B 1 i for j 6 = 1 is at most 1 / 3. Th is implies the claim.  Claim 5.3. The se t of c olumns { B i } i is µ ′ = µ + O ( k / √ n ) -inc oher ent wher e µ ′ / √ n ≤ 1 / 10 k . T o simplify the notati on, let u s p ermute the samp les so that C + 1 = { 1 , 2 , ..., l } . The probabilit y that X ( i ) 1 > 0 is Θ( k /m ) and so for q = Θ( m log 2 m ) samples with high p robabilit y the num b er of samples l where X ( i ) 1 > 0 is Ω( q k /m ) = Ω( k lo g 2 m ). Definition 5.4. Let M i b e the matrix (0 , B Ω ( i ) \{ 1 } ) B + Ω ( i ) . Then w e can write Q i, 1 = ( I − M i ) Y ( i ) . Let us establish some b asic pr op erties of M i that w e will need in our analysis: Claim 5.5. M i has the fol lowing pr op erties: (1) M i B 1 = 0 (2) F or al l j ∈ Ω ( i ) \{ 1 } , M i B j = B j and (3) k M i k ≤ 1 + ξ Pro of: The firs t and second prop ert y follo w immediately from the defin ition of M i , and the third prop erty follo w s from the Gershgorin disk theorem.  F or the time b eing, w e will consider the v ector ˆ B 1 = P l i =1 Q i, 1 / P l i =1 X ( i ) 1 . W e cann ot compute this ve ctor directly (note that ˆ B 1 and B ′ 1 are in general different) bu t first we will show that ˆ B 1 and A 1 are s uitably close. T o accomplish th is, we will first fi nd a con v enient expression for th e error: Lemma 5.6. A 1 − ˆ B 1 = l X i =1 X ( i ) 1 P l i =1 X ( i ) 1 M i ( A 1 − B 1 ) − P l i =1 P j ∈ Ω ( i ) \{ 1 } ( I − M i )( A j − B j ) X ( i ) j P l i =1 X ( i ) 1 . (1) Pro of: The pro of is mostly carefully reorganizing terms and u sing prop erties of M i ’s to simplify the expression. Let us fir st compute ˆ B 1 − B 1 : ˆ B 1 − B 1 = P l i =1 X ( i ) 1 (( I − M i ) A 1 − B 1 ) + P l i =1 P j ∈ Ω ( i ) \{ 1 } ( I − M i ) A j X ( i ) j P l i =1 X ( i ) 1 = l X i =1 X ( i ) 1 P l i =1 X ( i ) 1 ( I − M i )( A 1 − B 1 ) + P l i =1 P j ∈ Ω ( i ) \{ 1 } ( I − M i )( A j − B j ) X ( i ) j P l i =1 X ( i ) 1 . 1 Notice that t his is n ot a “w ith high probabilit y” statement, the support is always correctl y reco vered. That is why we use Ω ( i ) b oth in t he algorithm and for the true supp ort 15 The last equalit y us es the fi rst and second pr op erties of M i from the ab o v e claim. Consequently w e ha ve A 1 − ˆ B 1 = ( A 1 − B 1 ) − ( ˆ B 1 − B 1 ) = l X i =1 X ( i ) 1 P l i =1 X ( i ) 1 M i ( A 1 − B 1 ) − P l i =1 P j ∈ Ω ( i ) \{ 1 } ( I − M i )( A j − B j ) X ( i ) j P l i =1 X ( i ) 1 . And this is our d esired expression.  W e will analyze the tw o terms in the ab ov e equation separately . The s econd term is the most straigh tforward to b ound, since it is the sum of indep endent vecto r-v alued random v ariables (after w e condition on the s upp ort Ω ( i ) of eac h sample in C + 1 . Claim 5.7. If l > Ω( k log 2 m ) , then with high pr ob ability the se c ond term of Equation (1) is b ounde d by ǫ/ 10 0 . Pro of: The denominator is at least l an d the numeratoris the sum of at most l k indep endent random v ectors with mean zero, and w hose length is at most 3 C ǫ . W e can in vo k e the v ector Bernstein’s inequ alit y [30], and conclud e that th e sum is b ounded b y O ( C √ lk log mǫ ) w ith high probabilit y . After normalizati on th e second term is b ounded b y ǫ/ 100.  All that remains is to b ound the firs t term. Note that the co efficien t of k M i ( A 1 − B 1 ) k is in dep en- den t of the sup p ort, and so the fi rst term will con v erge to its exp ectation - namely E [ k M i ( A 1 − B 1 ) k ]. So it suffices to b ound this exp ectation. Lemma 5.8. E [ k M i ( A 1 − B 1 ) k ] ≤ (1 − δ ) ǫ. Pro of: W e will break up A 1 − B 1 on to its comp onent ( x ) in the d irection B 1 and its orthogonal comp onent ( y ) in B ⊥ 1 . First w e b oun d the norm of x : k x k = |h A 1 − B 1 , B 1 i| = |h A 1 − B 1 , A 1 − B 1 i| / 2 ≤ ǫ 2 / 2 Next we consider th e comp on ent y . Consider the supp orts Ω (1) and Ω (2) of t w o rand om samples from C + 1 . These sets certainly intersect at least once, since b oth con tain { 1 } . Y et with p robabilit y at least 2 / 3 this is their only in tersection (e.g. see C laim 4.3). If s o, let S = (Ω (1) ∪ Ω (2) ) \{ 1 } . Recall that k B T S k ≤ 1 + ξ . Ho w ev er B T S y is the concatenation of B T Ω (1) y and B T Ω (2) y and so we conclude that k B T Ω (1) y k + k B T Ω (2) y k ≤ (1 + ξ ) √ 2. Since the sp ectral norm of (0 , B Ω ( i ) \{ 1 } ) is b ounded, w e conclude that k M 1 y k + k M 2 y k ≤ (1 + ξ ) √ 2. This implies that E [ k M i ( A 1 − B 1 ) k ] ≤ E [ k M i x k ] + E [ k M i y k ] ≤ (2 / 3 )(1 + ξ )( √ 2 / 2) ǫ + (1 / 3)(1 + ξ ) ǫ + ǫ 2 / 2 And this is ind eed at most (1 − δ ) ǫ whic h conclud es the pro of of the lemma.  Com b ining the t w o claims, w e kno w th at with high probabilit y ˆ B 1 has distance at most (1 − δ ) ǫ to A 1 . Ho w ev er, B ′ 1 is not equal to ˆ B 1 (and we cannot compute ˆ B 1 b ecause we do not kno w the normalization facto r). The k ey observ ation h ere is ˆ B 1 is a m ultiple of B ′ 1 , the vec tor B ′ 1 and A 1 all ha ve unit norm , so if ˆ B 1 is close to A 1 the ve ctor B ′ 1 m u st also b e close to A 1 . Claim 5.9. If x and y ar e u ni t ve ctors, and x ′ is a multiple of x then k x ′ − y k ≤ ǫ < 1 implies that k x − y k ≤ ǫ √ 1 + ǫ 2 16 Algorithm 5 OverlappingCluster 2 , Input: p samples Y (1) , Y (2) , ..., Y ( p ) , in teger ℓ 1. Co mpute a graph G on p no des where there is an edge betw een i and j iff |h Y ( i ) , Y ( j ) i| > 1 / 2 2. Set T = pk C m 2 ℓ 3. Rep ea t Ω( k ℓ − 2 m log 2 m ) times: 4. Cho ose a random no de u in G , and ℓ − 1 neigh bo rs u 1 , u 2 , ...u ℓ − 1 5. If | Γ G ( u ) ∩ Γ G ( u 1 ) ∩ ... ∩ Γ G ( u ℓ − 1 ) | ≥ T 6. Set S u 1 ,u 2 ,...u ℓ − 1 = { w s.t. | Γ G ( u ) ∩ Γ G ( u 1 ) ∩ ... ∩ Γ G ( w ) | ≥ T } ∪ { u 1 , u 2 , ...u ℓ − 1 } 7. Delete any set S u 1 ,u 2 ,...u ℓ − 1 if u 1 , u 2 , ...u ℓ − 1 are contained in a strictly s maller set S v 1 ,v 2 ,...v ℓ − 1 8. O utput the remaining sets S u 1 ,u 2 ,...u ℓ − 1 Pro of: W e h a ve that k x − y k 2 = sin 2 θ + (1 − cos θ ) 2 where θ is th e angle b etw een x and y . Note that sin θ ≤ k x ′ − y k ≤ ǫ so hence k x − y k ≤ q ǫ 2 + (1 − √ 1 − ǫ 2 ) 2 . Note that for 0 ≤ a ≤ 1 we ha ve 1 − a ≤ √ 1 − a and this imp lies the claim.  This concludes the pro of of Theorem 5.1. T o b ound the runnin g time, observe that for eac h sample, the main computations in v olve compu ting the pseud o-in verse of a n × k matrix, whic h tak es O ( nk 2 ) time. 6 A Higher Order Algorithm Here w e extend the algorithm O verl appingCluster p resen ted in S ection 3 to succeed ev en w h en k ≤ c min( m 1 / 2 − η , √ n/µ log n ). T he p remise of OverlappingCluster is that w e can distinguish whether or not a triple of samples Y (1) , Y (2) , Y (3) has a common in tersection based on their num b er of common neigh b ors in the connection graph. How ev er for k = ω ( m 2 / 5 ) this is no longer true! But w e will instead consider h igher-order groups of sets. In particular, f or an y η > 0 there is an ℓ so that w e can distinguish whether an ℓ -tuple of samples Y (1) , Y (2) , ..., Y ( ℓ ) has a common intersectio n or not based on their n um b er of common neigh b ors, and this test succeeds ev en for k = Ω( m 1 / 2 − η ). The main tec hnical c hallenge is in showing that if the sets Ω (1) , Ω (2) , ..., Ω ( ℓ ) do n ot hav e a common in tersection, that w e can u p p er b ound the probability that a rand om s et Ω inte rsects eac h of them. T o accomplish this, we will need to b ound the num b er of w ays of piercing ℓ sets Ω (1) , Ω (2) , ..., Ω ( ℓ ) that ha v e b ound ed pairwise intersectio ns by at most s p oin ts (see d efinitions b elo w and Lemma 6.4), and this is the key to analyzing our higher order algorithm Overlapping- Cluster2 . W e will defer the pro ofs of the k ey lemmas and the d escription of the algo rithm in this section to App endix 6. Nev ertheless w hat w e need is an analogue of Claim 3.1 and Lemma 3.2. The first is easy , b ut what ab out an analogue of Lemma 3.2? T o analyze the probabilit y that a set Ω in tersects eac h of the sets Ω (1) , Ω (2) , ..., Ω ( ℓ ) w e will rely on the follo wing s tandard definition: Definition 6.1. Giv en a collecti on of sets Ω (1) , Ω (2) , ..., Ω ( ℓ ) , th e pier cing numb er is th e min im u m n um b er of p oints p 1 , p 2 , ..., p r so that eac h set con tains at least one p oin t p i . The n otion of piercing num b er is wel l-studied in com binatorics (see e.g. [42]). How ev er, one is usually inte rested in upp er-b oun ding the piercing num b er. F or example, a classic r esu lt of Alon and Kleitman concerns the ( p, q )-pr oblem [5]: Sup p ose we are giv en a collecti on of sets that has 17 the prop erty that eac h c hoice of p of them has a sub set of q which inte rsect. Then ho w large can the p iercing num b er b e? Alon and Kleitman prov ed that the piercing num b er is at most a fixed constan t c ( p, q ) ind ep endent of th e n u m b er of sets [5]. Ho wev er, here our in terest in piercing num b er is n ot in b oun ding the minim u m n um b er of p oints needed but rather in analyzing how man y w a ys there are of piercing a collection of s ets with at most s p oints, sin ce this will d irectly yield b oun ds on the probabilit y th at Ω intersect s eac h of Ω (1) , Ω (2) , ..., Ω ( ℓ ) . W e will need as a co ndition that eac h pair of sets has b ounded in tersection, and this holds in our m o del with high-probability . Claim 6.2. With high pr ob ability, the interse ction of any p air Ω (1) , Ω (2) has size at most Q Definition 6.3. W e will call a set of ℓ sets a ( k , Q ) family if eac h set has size at most k and the in tersectio n of eac h pair of sets has size at most Q . Lemma 6.4. The numb er of ways of pier cing ( k , Q ) family (of ℓ sets) with s p oints is at most ( ℓk ) s . An d crucial ly if ℓ ≥ s + 1 , then the numb er of ways of pier cing it with s p oints is at most Qs ( s + 1)( ℓk ) s − 1 . Pro of: The first p art of the lemma is the ob vious up p er b ound . No w let us assume ℓ ≥ s + 1: Then giv en a set of s p oints that pierce the sets, we can partition the ℓ sets in to s sets based on whic h of the s p oin ts is h its the set. (In general, a set ma y b e hit by m ore than one p oint, but w e can break ties arbitrarily). Let us fi x an y s + 1 of the ℓ sets, and let U be the the union of the pairwise inte rsections of eac h of these sets. Then U has size at most Q s ( s + 1). F urth ermore by the Pigeon Hole Principle, there m u st b e a pair of these sets that is hit by the same p oin t. Hence one of the s p oin ts m ust b elong to the set U , and we ca n remo ve this p oin t and app eal to the fir st part of the lemma (remo ving an y sets that are hit by this p oint ). This concludes the pro of of the second part of the lemma, too.  Theorem 6.5. The algorithm OverlappingCluster 2 ( ℓ ) finds an overlapping clustering wher e e ach set c orr esp onds to some i and c ontains al l Y ( j ) for which X ( j ) i 6 = 0 . The algorithm runs in time e O ( k ℓ − 2 mp + p 2 n ) and suc c e e ds with high pr ob ability if k ≤ c min( m ( ℓ − 1) / (2 ℓ − 1) , √ n µ l og n ) and if p = Ω( m 2 /k 2 log m + k ℓ − 2 m log 2 m ) In order to p ro ve this theorem we first giv e an analogue of Claim 3.1: Claim 6.6. Supp ose Ω (1) ∩ Ω (2) ∩ ... ∩ Ω ( ℓ ) 6 = ∅ , then P r Y [ for al l j = 1 , 2 , ..., ℓ , |h Y , Y ( j ) i| > 1 / 2] ≥ k 2 m The pro of of this claim is identic al to the p r o of of Claim 3.1. Next we giv e th e crucial corollary of Lemma 6.4. Corollary 6.7. The pr ob ability that Ω hits e ach set in a ( k , Q ) family (of ℓ sets) is at most X 2 ≤ s ≤ ℓ − 1 ( Qs ( s + 1)( ℓk ) s − 1 )  k m  s + X s ≥ ℓ  ℓk 2 m  s wher e C s is a c onstant dep ending p olynomial ly on s . 18 Pro of: W e can break up the p r obabilit y of the eve n t that Ω hits eac h set in a ( k , Q ) family into another family of ev ents. Let us consider the p robabilit y that X pierces the family with s ≤ ℓ − 1 p oint s or s ≥ ℓ p oint s. In th e former case, we can inv ok e th e second p art of Lemma 6.4 and the probabilit y that X hits an y particular set of s p oin ts is at most ( k/m ) s . In the latter case, we can in v oke the first part of Lemm a 6.4.  Note that if k ≤ m 1 / 2 then k /m is alwa ys g reater than or equal to k s − 1 ( k /m ) s . And so asymptotically the largest term in the ab o ve s u m is ( k 2 /m ) ℓ whic h w e wan t to b e asymptotically smaller than k /m whic h is the probabilit y in C laim 6. 6. So if k ≤ cm ( ℓ − 1) / (2 ℓ − 1) then abov e b ound is o ( k /m ) whic h is asymptoticall y sm aller th an the probabilit y that a giv en set of ℓ no des that ha ve a common in tersection are eac h connected to a r andom (new) no de in the connection graph. So again, w e ca n distinguish b et ween whether or not an ℓ -tuple has a common inte rsection or not and this immediately yields a new o v er lapp ing clustering algorithm that w orks for k almost as large as √ m , although the run ning time dep end s on ho w close k is to this b oun d . 7 Discussion This pap er shows it is p ossible to prov ably get around the c hic ken-and-egg problem inh eren t in dictionary learning: not kno wing A seems to p rev ents reco vering X ’s and vice v ers a. By using com bin atorial tec h niques to reco ver the su pp ort of eac h X without knowing th e dictionary , our algorithm suggests a new w a y to design algo rithms. Currently the runn ing time is e O ( p 2 n ) time, which m ay b e too s low for large-scale problems. But our algorithm suggests more heuristic v ersions of reco vering the s upp ort that are more efficien t. One alternativ e is to constru ct the connection graph G and then fi nd the ov erlapping clus terin g b y runnin g a tru ncated p ow er metho d [53] on e i + e j (a v ector that is one on indices i , j and zero elsewhere and ( i, j ) is an edge). In exp eriments, th is r eco v ers a go o d enough approximati on to the tru e clustering that can then b e u sed to smartly initialize KSVD so that it do es n ot ha v e to start from s cr atc h. In practice, this yields a h yb rid metho d that con verges muc h more quic kly and succeeds more often. Th us w e feel that in practice the b est algorithm ma y use algorithmic id eas present ed here. W e note that for dictionary learning, m aking sto c hastic assumptions seems unav oidable. Inter- estingly , our exp erimen ts help to corrob orate some of the assump tions. F or instance, the condition E [ X i | X i 6 = 0] = 0 used in our b est analysis also seems necessary for KSVD; empirically w e h a ve seen its p er f ormance degrade when this is viola ted. Ac kno wledgemen ts W e w ould like to thank Adity a Bhask ara, T engyu Ma and Sushant Sac hdev a for numerous helpful discussions through ou t v arious stages of this w ork. 19 References [1] A. Agarw al, A. Anand kumar, P . Jain, P . Netrapalli, and R. T andon. Learning sparsely used o verco mplete dictionaries via alternating m inimization. In arxiv:131 0.7991 , 2013. [2] A. Agarwal, A. An andkumar, an d P . Netrapalli. Exact reco v ery of sp ars ely used o vercomplete dictionaries. In arxiv:1309.19 52 , 2013. [3] M. Aharon. Ov ercomplete dictionaries for sparse r epresen tation of signals. In PhD Thesis , 2006. [4] M. Aharon, M. Elad, and A. Bruckstei n. K-svd: An algorithm for designing ov ercomplete dictionaries for spars e repr esen tation. In IEEE T r ans. on Si g nal Pr o c essing , pages 4311–4322 , 2006. [5] N. Alon and D. Kleitman. Piercing con vex sets and the hadwigder deb runner ( p, q )-problem. In A dvanc es in Mathema tics , p ages 103–1 12, 1992 . [6] A. Anandkumar, D. F oster, D. Hsu, S. Kak ade, and Y. Liu . A sp ectral algorithm for laten t diric h let allo cation. In N IPS , pages 926–934 , 201 2. [7] S. Arora, R. Ge, R . Kannan, and A. Moitra. Computing a nonn egativ e matrix factorizat ion – pro v ab ly . In STOC , pages 145–1 62, 2012 . [8] S. Arora, R. Ge, and A. Moit ra. Lea rning topic mo dels - going b ey ond svd. In F OCS , p ages 1–10, 2012. [9] S. Arora, R. Ge, S. S ac hdev a, and G. Schoeneb ec k. Find ing o verlapping communities in so cial net works: T ow ards a rigorous approac h. In EC , 2012. [10] M. Balcan, C. Borgs, M. Bra verman, J. Cha yes, and S -H T eng. Find ing endogenously form ed comm un ities. In SODA , 20 13. [11] Boaz Barak, John Kelner, and Da vid Steurer. Dictionary learning usin g sum-of-square hierar- c hy . u npublishe d manuscript , 201 4. [12] M. Belkin and K. Sin ha. P olynomial learning of distrib ution families. In F OCS , pages 103 –112, 2010. [13] E . Cand es, J. Rom b erg, and T . T ao. Stable signal reco very from incomplete and inaccurate measuremen ts. In Communic ations of Pur e and A pplie d M ath , pages 1207–1 223, 2006. [14] E . Candes and T. T ao. Deco ding by lin ear programming. In IEEE T r ans. on Information The ory , pages 4203–4 215, 2005. [15] P . Comon. Indep enden t comp onent analysis: A new concept? In Signal Pr o c essing , pages 287–3 14, 1994 . [16] G. Da vis, S. Mallat, and M. Ave llaneda. Greedy adaptiv e appro ximations. In J. of Constructive Appr oximation , pages 57–98, 1997. [17] D. Donoho and M. Elad. Optimally sparse repr esen tation in general (non-orthogonal ) dictio- naries via ℓ 1 -minimization. In PNAS , pages 2197–2202 , 2003. 20 [18] D. Donoho and X. Huo. Uncertain t y principles and ideal atomic decomp osition. In IEE E T r ans. on Informat ion The ory , p ages 2845– 2862, 1999 . [19] D. Donoho and P . Stark. Uncertaint y prin ciples and signal reco v ery . In SIAM J. on Appl. Math , pages 906–931 , 1999. [20] M. Elad. Sp arse and redun dan t represen tations. In Springer , 2010 . [21] M. E lad and M. Aharon. Im age denoising via sparse and redun dan t repr esen tations o ver learned dictionaries. In IE E E T r ans. on Signal Pr o c essing , pages 3736–3745 , 2006. [22] K . Engan, S. Aase, and J . Hak on-Husoy . Metho d of optimal directions for frame design. In ICASSP , pages 2443–24 46, 1999. [23] A. F rieze, M. Jerru m , and R. Kannan. Learn ing linear transformations. I n F OCS , p ages 359–3 68, 1996 . [24] Q . Geng, H. W ang, and J. W r igh t. On the local correctness of ℓ 1 -minimization for dictionary learning. In arxiv:1101.5 672 , 2013. [25] A. Gilb ert, S. Muthukrishnan, and M. Strauss . Appro ximation of functions ov er r ed undant dictionaries using coherence. In SODA , 2003. [26] G. Golub and C. v an Loan. Matrix computations. In The Johns H opkins U ni v ersity Pr ess , 1996. [27] I . J. Go o dfello w, A. Courville, and Y.Bengio. Large-scale feature learning with sp ike-a nd-slab sparse co ding. In ICML , p ages 718–726, 2012 . [28] N. Go y al, S . V empala, and Y. Xiao. F ourier p ca. In STOC , 201 4. [29] R . Grib on v al and M. Nielsen. Sparse representa tions in unions of b ases. In IEEE T r ansactions on Informa tion The ory , p ages 3320– 3325, 2003 . [30] D. Gross. Reco vering lo w-rank matrices from few co efficien ts in an y basis. In arxiv:091 0.1879 , 2009. [31] D. Hanson and F. W righ t. A b ound on tail p robabilities for quadratic f orms in indep en den t random v ariables. In Annals of Math. Stat. , pages 1079–1083 , 1971. [32] M. Hardt. On the pro v able conv ergence of alternating minimizatio n for matrix completion. In arxiv:1312 .0925 , 2013. [33] R . Horn and C. Johnson. Matrix analysis. In Cambrid ge University Pr ess , 1990. [34] P . Jain, P . Netrapalli, and S. Sangha vi. L ow r ank matrix completion u sing alte rnating mini- mization. In STOC , pages 665–674, 201 3. [35] K . Ka vukcuog lu, M. Ranzato, and Y. LeCun. F ast inference in sp arse coding algorithms with applications to ob j ect recognition. In NYU T e ch R ep ort , 2008. [36] K . K reutz-Delga do, J. Murray , K. Engan B. Rao, T. Lee, and T . S ejno w s ki. Dictio nary learning algorithms for sparse repr esen tation. In Neur al Computa tion , 200 3. 21 [37] L . De Lathauw er, J Castaing, and J. Cardoso. F our th -order cumulan t-based b lind ident ification of underd etermin ed mixtur es. In IEEE T r ans. on Signal Pr o c essing , pages 2965–297 3, 2007. [38] H. Lee, A. Battle, R. Raina, and A. Ng. E fficien t sparse cod ing algorithms. In NIPS , 2006. [39] M. Lewic ki and T. Sejnowski. Learnin g o vercomplete representa tions. In Neur al Computa tion , pages 337–36 5, 2000. [40] J . Mairal, M. L eordean u, F. Bac h, M. Herb ert, and J. Ponce. Discriminativ e sp ars e image mo dels for class-sp ecific edge detect ion and image in terpretation. In ECCV , 2008. [41] S . Mallat. A wa velet tour of signal pro cessing. In A c ademic-Pr ess , 1998. [42] J . Matousek. Lectures on d iscrete geometry . In Springer , 2002. [43] A. Moitra and G. V alian t. Setting the p olynomial learnabilit y of m ixtures of gaussians. In F OCS , pages 93–102 , 2010. [44] B. Olshausen and B. Field. Sparse codin g with an o ve rcomplete basis set: A strategy emplo ye d b y v1? In Vision R ese ar ch , pages 3331–3 325, 1997. [45] M. P ontil , A. Argyriou, and T. Evgeniou. Multi-task feature learning. I n NIPS , 2007. [46] M. Ranzato, Y. Boureau, and Y. LeCun. Spars e feature learning for deep b elief net works. In NIPS , 2007. [47] M. Rud elson. Random v ectors in the isotropic p osition. In J. of F u nc tional Analysis , pages 60–72 , 1999 . [48] D. Sp ielman, H. W ang, and J. W righ t. Exact reco v ery of sparsely-used dictionaries. In Journal of Machine L e arning R ese ar ch , 2012 . [49] J . T ropp . Greed is go o d: Algorithmic r esults f or sparse appr o ximation. I n IEEE T r ansactions on Informa tion The ory , p ages 2231– 2242, 2004 . [50] J . T ropp, A. Gil b ert, S. Muth ukrishnan, and M. Strauss. Impr o ved sparse approximat ion ov er quasi-incoheren t dictionaries. In IEEE International Conf. on Image Pr o c essing , 200 3. [51] P . W edin. P ertur b ation b ounds in connection with singular v alue decomp ositions. In BIT , pages 99–111 , 1972 . [52] J . Y ang, J. W righ t, T. Hu ong, and Y. Ma. Image sup er-resolution as sparse representa tion of ra w image patc hes. In CVPR , 2008 . [53] X. Y u an and T. Z h ang. T run cated p o w er metho d for sparse eigen v alue problems. In Journal of Machine L e arning R ese ar ch , pages 899–9 25, 2013 . 22 A Clustering Using Only Bounded 3 -wise Momen t When the supp ort of X has only b ounded 3-wise m oment, it is p ossible to ha v e t wo supp orts Ω with large intersecti on. In that case c h ec king the num b er of common neigh b ors cannot correctly iden tify whether the th r ee samples ha v e a common in tersection. In particular, there migh t b e false p ositiv es (thr ee samples with no common in tersection b ut has man y common neighbors) but no false negativ es (still all samples with common inte rsection will h a ve many common neigh b ors). The algorithm can still work in this case, b ecause it is unlikel y for the t w o sup p orts to ha v e a very large in tersectio n: Lemma A.1. Supp ose Γ has b ounde d 3-wise moments, k = cm 2 / 5 for some smal l enough c onstant c > 0 . F or any se t Ω of size k , the pr ob ability that a r andom supp ort Ω ′ fr om Γ has interse c tion lar ger than m 1 / 5 / 100 with Ω is at most O ( m − 6 / 5 ) . Pro of: Let T b e the num b er of triples in the in tersection of Ω and Ω ′ . F or any triple in Ω, the probabilit y that it is also in Ω ′ is at most O ( k 3 /m 3 ) by b ounded 3-wise momen t. Therefore E [ T ] ≤  k 3  O ( k 3 /m 3 ) = O ( k 6 /m 3 ). On the other hand, whenev er Ω and Ω ′ has more than m 1 / 5 / 100 in tersections, T is larger than  m 1 / 5 / 100 3  . By Mark o v’s inequ alit y w e kno w Pr [ | Ω ∩ Ω ′ | ≥ m 1 / 5 / 100] ≤ O ( m − 6 / 5 ).  Since the probabilit y of h a ving false p ositive s is small (bu t not negligible) , w e can do a simple trimming op eration when w e are computing the set S u,v in Algorithm 1. W e shall change the definition of S u,v as follo ws : 1. S et S ′ u,v = { w : | Γ G ( u ) ∩ Γ G ( v ) ∩ Γ G ( w ) | ≥ T } ∪ { u, v } . 2. S et S u,v = { w : w ∈ S ′ u,v and | Γ G ( w ) ∩ S ′ u,v | ≥ T } . No w S ′ u,v is the same as the old d efi nition and m a y h a ve false p ositiv es. Ho w ever, intuitiv ely the false p ositiv es are not in the cluster so they cannot hav e many connectio ns to the cluster, and will b e filtered out in the second step. In p articular, w e ha v e the follo wing lemma: Lemma A.2. If ( u, v ) is an indentifying p air (as define d in Definition 3.3) for i , then with high pr ob ability S u,v is the set C i = { j : i ∈ Ω ( j ) } . Pro of: First w e argue the set S ′ u,v is the un ion of C i with a small set. By Claim 3.1 and C hernoff b ound , for all w ∈ C i u, v, w has more than T common n eigh b ors, s o w ∈ S ′ u,v . On the other hand, if w 6∈ C i but w ∈ S ′ u,v , then by L emma 3.2 w e kno w Ω ( w ) m u st ha v e a large inte rsection with either Ω ( u ) or Ω ( v ) , whic h has probability only O ( m − 6 / 5 ) by Lemma A.1. Th erefore again by concen tration b ounds with h igh p robabilit y | S ′ u,v \C i | ≤ p/m ≪ T . No w consider the second step. F or the samples in C i , the pr obabilit y th at they are connected to another r andom sample in C i is 1 − O ( k 2 /m ), so by concen tration b ounds w ith high pr obabilit y they ha v e at least T neighbors in C i , and they will n ot b e filtered and are still in S u,v . On the other hand, for any v ertex w 6∈ C i , the exp ected num b er of edges from w to C i is only O ( k 2 /m ) |C i | ≪ T , and by concen tration prop ert y , they are concen trated around the exp ectation w ith high probabilit y . So for any w ∈ S ′ u,v \C i , it can only ha ve O ( pk 3 /m 2 ) edges to C i , and O ( p/m ) edges to S ′ u,v \C i . The total n u m b er of edges to S ′ u,v is m uch less than T , so all of those ve rtices are going to b e remo v ed, and S u,v = C .  This lemma ensures after we pic k enough random pairs, with high p robabilit y all the correct clusters C i ’s are among the S u,v ’s. There can b e “bad” sets, bu t s ame as b efore all those sets con tains some of the C i , so will b e remo ved at the end of th e algorithm: 23 Claim A.3. F or any p air ( u, v ) with i ∈ Ω ( u ) ∩ Ω ( v ) , let C i = { j : i ∈ Ω ( j ) } , then with high pr ob ability C i ⊆ S u,v . Pro of: This is essent ially in the pro of of the pr evious lemma. As b efore by Claim 3.1 we know C i ⊆ S ′ u,v . No w for any sample in C i , the exp ected num b er of edges to C i is (1 − o (1)) |C i | , b y concen tration b ounds w e kn ow the num b er of neighb ors is larger than T with high p robabilit y . Then w e apply union b ound for all samples in C i , and conclude that C i ⊆ S u,v .  B Extensions: Pro of S k etc h of Theorem 1. 6 Let us firs t examine how the cond itions in the hypothesis of T heorem 1.4 w ere used in its p ro of and then discus s why they can b e relaxe d. Our algorithm is b ased on three steps: constructing th e connection graph, fin ding the o v erlap- ping clusterin g, and r eco v ering the dictionary . Ho w ever if w e inv ok e Lemma 2.3 (as opp osed to Lemma 2.2) then th e prop er ties we need of the connection graph follo w from eac h X b eing at most k sparse for k ≤ n 1 / 4 / √ µ without an y distrib u tional assumptions. F urth ermore, the crucial steps in finding the o verlapping clustering are b ound s on the probabilit y that a sample X in tersects a triple with a common intersecti on, and the probabilit y th at it d o es so when there is no common int ersection (Claim 3.1 and Lemma 3.2). Ind eed, these b ounds hold whenev er the probability of t wo sets inte rsecting in tw o or more lo cations is smaller (by , sa y , a factor of k ) than the probabilit y of th e s ets in tersecting once. T his can b e tru e eve n if elemen ts in th e sets hav e significan t p ositiv e correlation (but f or the ease of exp osition, w e ha v e emph asized the simplest m o dels at the exp ense of generalit y). Lastly , Algorithm 2 we can instead consider the difference b etw een the a verages for S i and C i \ S i and this s ucceeds ev en if E [ X i ] is n on-zero. This last step do es use the condition that the v ariables X i are ind ep endent, but if we instead use Algorithm 3 we can circumv en t this assump tion and still reco ver a dictionary that is close to th e true one. Finally , the “b oun ded a wa y f r om zero” assumption in Definition 1.2 can b e relaxed: the re- sulting alg orithm reco v er s a dictionary that is close enough to the true one and still allo w s sparse reco ve ry . This is b ecause when the distrib ution has the an ti-concen tration prop ert y , a sligh t v ariant of Algorithm 1 can still find most (instead of all) column s with X i 6 = 0. Using the ideas fr om this part, we giv e a p ro of ske tc h for Theorem 1.6 Pro of: [sk etc h for T heorem 1.6] The pr o of follo ws th e same steps as th e pr o of of Th eorem 4.9. There are a few steps that needs to b e mo dified: 1. Inv oke Lemma 2.3 instead of Lemma 2.2. 2. F or Lemma 3.2, use the weak er b ound on th e 4-th momen t. This is s till OK b ecause k is smaller no w. 3. In Definition 4.4, redefin e R 2 i to b e E x ∈ D i [ h A i , Ax i 2 ]. 4. In Lemma 4.5, use the b ound R 2 i α 2 + α √ 1 − α 2 2 k √ µ/n 1 / 4 + (1 − α 2 ) k 2 µ/ √ n in order to tak e the correlations b et ween X i ’s into ac coun t.  24 Remark: Based on differen t assumptions on the distribution, there are algorithms with different trade-offs. Theorem 1.6 is only used to illustrate the p oten tial of our approac h and do es not try to ac hieve optimal trade-off in every case. A ma jor difference from class Γ is that the X i ’s d o n ot hav e exp ectation 0 and are not forbidden from taking v alues close to 0 (pro vided they do ha ve reasonable probabilit y of taking v alues aw a y from 0). Another ma jor difference is that the d istribution of X i c an d ep end u p on the v alues of other nonzero co ordinates. Th e w eak er moment condition allo ws a fair bit of correlation among the set of nonzero coordin ates. It is also p ossible to relax the condition that eac h nonzero X i is in [ − C, − 1] ∪ [1 , C ]. Instead w e require X i has magnitude at most O (1), and has a we ak anti-c onc entr ation prop erty: for ev ery δ > 0 it h as probabilit y at least c δ > 0 of exceeding δ in magnitude. This requires c hanging Algorithm 1 in the follo wing wa ys: F or eac h set S , let T b e the su bset of ve rtices that ha ve at least 1 − 2 δ n eighb ors in S : T = { i ∈ S, | Γ G ( i ) ∩ S | ≥ (1 − 2 δ ) | S | . Keep sets S that 1 − 2 δ fraction o f the vertice s a re in T ( | T | ≥ (1 − 2 δ ) | S | ).Here the choice of δ dep end on parameters µ, n, k , and effects the fin al accuracy of the algo rithm. This ensures for an y remaining S , there m u st b e a s in gle coord inate that ev ery X ( i ) for i ∈ S is n onzero on. In the last step, only output sets that are s ignifican tly differen t from the previously outpu tted sets (significan tly different means the symmetric difference is at least pk / 5 m ) C Discussion: Ov erlapping Comm unities There is a connection b et ween the approac h used h ere, and the recen t w ork on algorithms for finding o verlapping comm u nities (see in particular [9], [10]). W e can think of the set of samp les Y for whic h X i 6 = 0 as a “comm unity”. Then eac h sample is in more than one communit y , and indeed for our setting of parameters eac h sample is con tained in k co mm unities. W e can think of the m ain approac h of this pap er as: If we c an find al l of the overlap ping c ommunities, then we c an le arn an unknown dictionary. So ho w can w e find these o verla pping comm un ities? The recen t pap ers [9], [10] p ose determin- istic conditions on what constitutes a comm u nit y (e.g. eac h no de outside of the communit y has few er edges in to th e comm un ity than do other mem b ers of the comm un it y). These pap ers provide algorithms for find ing all of the communities, pr o vided these conditions are met. Ho we v er for our setting of parameters, b oth of these algorithms w ould run in quasi-p olynomial time. F or example, the parameter “ d ” in th e pap er [9] is an u p p er-b ound on ho w many communities a no de can b elong to, and the runn ing time of the algorithms in [9 ] are qu asi-p olynomial in this p arameter. But in our sett ing, eac h s amp le Y b elongs to k comm un ities – one for ea c h n on-zero v alue in X – and the most interesting setting here is when k is p olynomially large. Similarly , the parameter “ θ ” in [10] can b e thought of as: If no d e u is in comm unit y c , what is the ratio of the edges incident to u that lea v e the comm u nit y c compared to the n um b er that sta y insid e c ? Again, for our pur p oses this parameter “ θ ” is roughly k and the algorithms in [10] dep end quasi-p olynomially on this parameter. Hence these algorithms w ould not suffice for our p u rp oses b ecause wh en app lied to learning an unknown dictionary , their runnin g time wo uld dep end quasi-p olynomially on the sparsit y k . In con trast, our alg orithms run in p olynomial time in all of the p arameters, albeit for a more restricted notion of what constitutes a communit y (but one that seems quite natur al from the p ersp ectiv e of dictionary learning). Our algorithm OverlappingCluster fi nds all of the o v erlappin g “com- m u nities” pro vided that whenev er a triple of no des shares a common comm unity they hav e man y 25 more common n eighb ors than if they d o n ot all share a s ingle comm unity . The correctness of the algorithm is quite easy to pro ve , once this co ndition is m et; but here the main work w as in sh o win g that our generativ e mo del m eets these neigh b orho o d conditions. 26

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment