More Algorithms for Provable Dictionary Learning

More algorithms for pro v able dictionary learning Sanjeev Arora ∗ Adit ya Bhask ara † Rong Ge ‡ T engyu Ma § August 21, 2018 Abstract In dictionary l e arning , als o known as sp arse c o ding , the alg orithm is given samples of the form y = Ax wher e x ∈ R m is an unknown random sparse vector and A is an unknown dictionar y matr ix in R n × m (usually m > n , which is the over c omplete case). The goal is to lea rn A a nd x . This pr oblem has been studied in neuroscience, machine learning, visio ns, and image pro cessing. In practice it is solved b y heuristic algorithms and prov able algor ithms seemed hard to ﬁnd. Recently , prov a ble a lgorithms were found that work if the unknown feature vector x is √ n -sparse or even spars er. Spielman et al. [SWW12] did this for dictionar ies wher e m = n ; A rora et a l. [AGM13] gav e an alg orithm for overcomplete ( m > n ) and inco herent matrices A ; a nd Agar wal et al. [AAN13] handled a similar case but with weaker guara ntees. This raised the pro ble m of desig ning prov able a lgorithms that allow spa rsity ≫ √ n in the hidden vector x . The current pap er desig ns algorithms that allow spa rsity up to n/pol y (lo g n ). It w ork s for a class o f matrices where fea tures are individual ly r e c over able , a new notion identiﬁed in this pap er that may motiv ate further work. The algor ithm runs in quasip olynomia l time beca us e they use limited enumeration. ∗ Princeton Universi ty , Computer Science Departmen t and Cen ter for Computational Intra ctabilit y . Email: arora@cs.princeton.edu. This w ork is su p p orted b y the NSF grants CCF-0832797, CCF-1117 309, CCF-13025 18, DMS- 1317308, and Simons Inv estigator Gran t. † Google R esearc h NYC. Email: bh ask ara@cs.princeton.edu. Part of this wo rk was done while the author w as a Postdoc at EPFL, S witzerland. ‡ Microsof t Researc h. Emai l: rongge@micro soft.com. Part of this w ork was d one while the author w as a graduate student at Princeton Unive rsit y and was su pp orted in p art by NSF grants CCF-0832797, CCF- 1117309 , CCF-1302518, DMS-1317308, and Simons Investiga tor Gran t. § Princeton Universi ty , Computer Science Departmen t and Cen ter for Computational Intra ctabilit y . Email: tengyu@cs.princeton.edu. This w ork is sup p orted by the NS F grants CCF-083279 7, CCF-111730 9, CCF-13025 18, DMS- 1317308, and Simons Inv estigator Gran t. 1 1 In tro duction Dictionary learning, also kno wn as sp arse c o ding , tries to understand the structure of ob- serv ed samples y by represent ing them as sparse linear com bin ations of “dictionary” ele- men ts. More pr ecisely , th er e is an unknown dictionary matrix A ∈ R n × m (usually m > n , whic h is the over c omplete case), and the algorithm is giv en samples y = Ax where x is an unknown r andom sparse v ector. (W e say a ve ctor is k -sp arse if it has at most k nonzero co ordinates.) The goal is to le arn A and x . Such sparse r epresen tation w as ﬁrst stud- ied in neur oscience, wh ere Olsh au s en and Field [OF97] suggested that dictionaries ﬁtted to real-li fe images ha ve simila r prop erties as the receptiv e ﬁelds of n eurons in the ﬁrs t la ye r of visual cortex. Inspired by this neural analog, d ictionary learning is widely u sed in mac hin e learnin g f or fe atur e sele ction [AEP06]. More recen tly the id ea of spars e co d- ing h as also inﬂu enced deep learning [BC + 07]. In image pro cessing, learned dictionaries ha ve b een successfully applied to image d enoising [EA06], ed ge detection [MLB + 08] and sup er-resolution [YWHM08]. Pro v able guarante es f or dictionary learnin g ha v e seemed diﬃcult b ecause th e ob vious math p rogramming formulation is noncon vex: b oth A and the x ’s are unkn o w n . Ev en wh en the d ictionary A is kn o wn , it is in general NP-hard to get the sparse combinatio n x giv en worst-c ase y [DMA97]. This problem of deco ding x giv en Ax with fu ll kno wledge of A is called sp arse r e c overy or sp arse r e gr ession , and is closely related to c ompr esse d sensing . F or many t y p es of dictionary A , sparse reco ve ry w as sh o wn to b e tractable eve n on w orst- case y , starting with suc h a result for inc oher ent matrices by Donoho and Huo [DH01]. Ho wev er in most early w orks x was constrained to b e √ n -sparse, until Candes, Romb er g and T ao [CR T06] show ed h ow to do sparse reco v ery ev en wh en the sparsit y is Ω( n ), assuming A satisﬁes the r estricte d i sometry pr op erty (RIP) (which r andom matrices do). But dictionary learning itself (reco vering A giv en samples y ) has pro v ed muc h h arder and heu ristic algorithms are widely used. L ewic ki an d Sejno w ski [LS00] designed the ﬁrst one, which w as follo wed by the metho d of optimal directions (MOD) [EAHH99] and K- SVD [AEB06]. See [Aha06] for more references. Ho wev er, un til recen tly there were no algorithms that pro v ably reco v ers the correct dictionary . Recen tly Spielman et al. [SWW12] ga ve such an algorithm for the full rank case (i.e., m = n ) and the unkn o wn f eature v ector x is √ n -sparse. Ho w ever, in p r actice o v ercomplete dictionaries ( m > n ) are preferred. Arora et al. [A GM13] ga ve the ﬁr st pro v able learning algorithm for o v ercomplete d ictionaries that runs in p olynomial-time; they required x to b e n 1 / 2 − ǫ -sparse (roughly sp eaking) and A to b e incoheren t. Indep enden tly , Agarw al et al. [AAN13] ga ve a w eake r algorithm that also assumes A is in coherent and allo w s x to b e n 1 / 4 -sparse. T h us all thr ee of these recent algorithms cannot hand le sparsit y more than √ n , an d this is a fundamenta l limitation of the tec h nique: they requ ire t wo random x , x ′ to in tersect in no more than O (1) co ordinates with high prob ab ility , which fails to hold when sp arsit y ≫ √ n . Since sparse reco very (where A is kn o w n ) is p ossible ev en up to sparsit y Ω( n ), this raised the question w hether dictionary learning is p ossible in that regime. In this p ap er w e will refer to feature vec tors with sparsity n/pol y (log n ) as slightly-sp arse , sin ce metho ds in this pap er d o not seem to allo w density higher than that. In our recen t pap er on deep learnin g ([ABGM1 3 ], S ection 7) we show ed how to solv e dictionary learning in this regime f or d ictionaries which are adjacency matrices of random 2 w eigh ted sp arse bipartite graphs; these are kno w n to allo w sparse r eco v ery alb eit with a sligh t t w ist in the pr oblem deﬁnition [Ind 08, JXHC09, BGI + 08]. S in ce r eal-life dictionaries are probably not random, this raises the question whether dictionary learning is p ossible in the sligh tly sparse case for other d ictionaries. The current pap er giv es quasip olynomial- time algorithms for learning more such dictionaries. T he runn ing time is qu asip olynomial time b ecause it uses limited enumeration (similarly , e.g., to algorithms for learning gaussian mixtures). No w w e discus s this class of dictionaries. Some of our discussion b elo w refers to nonnegativ e dictionary learnin g, whic h constrains matrices A and hidd en vec tor x to ha v e nonnegativ e en tries. Th is is a p opular v arian t prop osed b y Ho y er [Ho y02], motiv ated aga in partly by the neural analogy . Algorithms lik e NN-K-SVD [AEB05] w ere th en applied to image classiﬁcation tasks. This version is also related to nonne gative matrix factorization [LS 99], which has b een obs er ved to lead to factorizat ions th at are usu ally sparser and more lo cal th an traditional metho d s like S VD. 1.1 Ho w to deﬁne dictionary learning? No w we discuss wh at versions of dictionary learning mak e m ore sense than others. F or exp osition p urp oses we refer to the co ordin ates of th e hidden v ector x as fe atur es , and those of the visible vecto r y = Ax as pixels , ev en though the discuss ion applies to more than just computer vision. Dic tionary learning as deﬁned here —whic h is the s tandard deﬁnition—assumes that f eatures’ eﬀect on th e pixels add linearly . But, th e problem deﬁnition is somewh at arbitrary . On the one hand one could consider mor e gener al (and nonlinear) v ersions of this problem —f or in s tance in vision, dictionary learning is p art of a sys tem that h as to deal with o c clusions among ob jects that may h ide part of a f eature, and to incorp orate the fact that f eatures ma y b e presen t with an arbitrary translation/rotati on. On th e other hand, one could consider m ore sp eciﬁc v ersions that place restrictions on the dictionary , since not all dictionaries may m ake sense in applications. W e consider this latter p ossibilit y now, with the us u al cav eat that it is non trivial to cleanly formalize p rop erties of real-life instances. One reasonable prop ert y of real-life d ictionaries is that eac h feature do es not inv olv e most pixels. This implies that column vect ors of A are r elatively sp arse . T h us matrices with RIP prop ert y —at least if they are d ense— do not s eem a go o d matc h 1 . Another in tuitiv e pr op ert y is th at features are individual ly r e c over able , wh ic h means, roughly sp eaking, that to an observ er who kn ows the dictionary , the presence of a particular feature should not b e confusable with the eﬀect s pro duced by th e usual d istribution of other features (this is an a ve rage-ca se cond ition, since x satisﬁes sto chastic assumptions). In particular, one sh ould b e able to detect its p resence by lo oking only at the p ixels it w ould aﬀect. Th us it b ecomes clear that not all matrices th at allo w sp ars e reco v ery are of equal inter- ests. The pap er of Arora et al. [A GM13] restricts atten tion to inc oher ent matrices, where the column s hav e pairwise inner pro d uct at most µ/ √ n wh ere µ is small, like p oly(log n ). These mak e sense on b oth th e ab ov e counts. First, th ey can hav e fairly sparse column s . Secondly they satisfy A T A ≈ I , so given Ax one can tak e its inner pro d uct w ith the i th 1 By contra st, in th e usual setting of compressed sensing, the matrix provides a basis for making mea- surements, and its density is a nonissue. 3 column A i to r oughly determine the extent to wh ich feature i is presen t. But incoherent matrices restrict s parsit y to O ( √ n ), so one is tempted by RIP matrices but, as mentioned, their columns are fairly dense. F urthermore, RIP matrices we re designed to allo w sparse reco ve ry for w orst-case feature vecto rs whereas in dictionary learning these are sto c hastic. As m entioned, sp arse r andom graph s (with rand om edges w eigh ts in [ − 1 , 1]) chec k all the righ t b o xes (and were handled in our recen t pap er on d eep learning) but require p ositing that the d ictionary has n o stru cture. The goal in the current pap er is to mo ve b eyo nd random graph s. Dictionaries wit h individually re co v erable feat ures. Let us try to formulat e the prop erty th at features are individual ly r e c over able . W e hop e this deﬁn ition and discussion will stim ulate fu r ther w ork (similar, w e hop e, to Dasgupta’s formalization of separabilit y for gaussian m ixtures [Das99]). Let us assume that the co ord inates of x are pairwise indep en - den t. Then the presence of the i th feature (i.e., x i 6 = 0) change s the cond itional distribution of th ose p ixels in v olve d in A i , the i th column of A . F eatures are said to b e individual ly r e c over able if this c hange in conditional d istr ibution is not obtainable fr om other com b ina- tions of features that arise with reasonable p r obabilit y . This statistical prop ert y is h ard to w ork w ith an d b elo w we suggest some (p ossib ly to o strong) com b inatorial prop erties of the supp ort of A that imply it. Better formalizations seem quite plausible and are left for future work. 2 Deﬁnitions and Results The d ictionary is an un kno w n matrix A ∈ R n × m . W e are giv en i.i.d samp les y th at are generated b y y = Ax , where x ∈ R m is c hosen fr om some distribu tion. W e ha ve N samples y i = Ax i for i = 1 , . . . , N . As in the introdu ction, w e will refer to co ordin ates of x as fe atur es and those of y as pixels , even thou gh vision isn’t the only inte nded application. F or most of the pap er w e assu me the entries of x are indep enden t Bernoulli v ariables: eac h x i is 1 with probability ρ and 0 with probabilit y 1 − ρ ; we refer to this as ρ -B ernoulli distribution. T his assu mption can b e relaxed somewhat: we only need th at entries of x are pairwise indep en den t, and e T x s h ould satisfy concen tration b ound s for reasonable v ectors e . The n on zero en tries of x can also b e in [1 , c ] instead of b eing exactly 1. The j th column of A is d enoted by A j , the i th row of A by A ( i ) , and th e entries of A are d enoted by A ( i ) j . F or ease of exp osition w e ﬁrst d escrib e our learning algorithm for nonnegativ e dictionar- ies (i.e., A ( i ) j ≥ 0, for all i ∈ [ n ] , j ∈ [ m ]) and then in S ection 4 describ e the generalization to th e general case. Note that th e su b case of nonnegativ e dictionaries is also of practical in terest. 2.1 Nonnegativ e dictionaries By normalizing, we can assu me without loss of generalit y that the exp ected v alue of eac h pixel is 1, that is, E [ y i ] = E [( Ax ) i ] = 1. W e also assume that | A ( i ) j | ≤ Λ for s ome constan t Λ 4 2 : n o entry of A is too large. Let G b b e the bipartite graph deﬁn ed by en tr ies of A that ha v e magnitudes larger than or equal to b , that is, G b = { ( i, j ) : | A ( i ) j | ≥ b, i ∈ [ n ] , j ∈ [ m ] } . W e mak e tw o assumptions ab out this graph (the parameters d , σ and κ w ill b e chose n later). Assumption 1: (Ev ery f eature has signiﬁcan t eﬀect on pixels) Th ere are at least d edges with wei gh ts larger than σ for ev ery feature j . That is, the degree of G σ on the feature side is alw a ys larger th an d . Assumption 2: (Lo w pairwise inte rsections among features) In G τ the neighborh o o d of eac h feature (that is, { i ∈ [ n ] : A ( i ) j ≥ τ } ) has intersec tion up to d/ 10 (with total w eigh t < dσ / 10) with eac h of at most o (1 / √ ρ ) other features, and in tersection at most κ w ith th e neigh b orho o d of eac h remaining features. Here τ is O θ (1 / log n ) as explained b elo w and κ = O θ ( d/ log 2 n ). Guideline through no tation: W e will think of σ ≤ 1 as a s m all constan t, Λ ≥ 1 a constan t, and ∆ a su ﬃcien tly large constan t wh ic h is us ed to cont rol the assumption. Let θ = ( σ, Λ , ∆) and we use the n otation O θ ( · ) to hide the dep end encies of σ, Λ , ∆. Also, we think of m as n ot muc h larger than n , ρ < 1 / p oly(log n ), and d ≪ n . Th e normalization assumption implies (for all practical p urp oses in th e algorithm) that m dρ ∈ [ n/ Λ , n/τ ]. W e t yp ically thin k of d as 1 /ρ , hence a run ning time of m d w ould b e b ad (though it is unclear a p r iori how to ev en achiev e that). Precisely , for our algorithms to wo rk, w e need d ≥ ∆Λ log 2 n/σ 2 , τ = O ( σ 4 / ∆Λ 2 log n ) = O θ (1 / log n ) and κ = O ( σ 8 d / log 2 n ∆ 2 Λ 6 ) = O θ ( d/ log 2 n ) f or some suﬃcient ly large constant ∆ and the densit y ρ = o ( σ 5 / Λ 6 . 5 log 2 . 5 n ) = o θ (1 / log 2 . 5 n ). Note that if G τ w ere like a r andom graph (i.e. if f eatures aﬀect r andom sets of d pixels) when d 2 ≪ n , the pairwise int ersection κ b et ween the neigh b orho o ds of tw o features in G τ w ould b e O (1). Ho wev er, we allo w these in tersectio ns to b e κ = O θ ( d/ log 2 n ). No w we giv e a stron ger v ersion of Assu mption 2 whic h will allo w a stronger algorithm. Assumption 2’: In G τ , the pairwise in tersection of the neigh b orho o ds of an y tw o features j, k is less than κ , wh er e τ = O θ (1 / log n ) and κ = O θ ( d/ log 2 n ). The algorithm can only learn th e real-v alued matrix appro ximately . Two dictionaries are close if they satisfy the follo win g deﬁnition: Definition 1 ( ǫ -equiv ale nt) Two dictionaries A and ˆ A ∈ R n × m ar e ǫ -e quivalent, if f or a r andom ve ctor x ∈ R m with indep endent ρ -Bernoul li c omp onents, with high pr ob ability Ax and ˆ Ax ar e entry-wise ǫ -close. Theorem 1 (Nonneg Case) Under Assumptions 1 and 2, when ρ = o ( σ 5 / Λ 6 . 5 log 2 . 5 n ) = o θ (1 / log 2 . 5 n ) , Algorithm 2 runs in n O (Λ log 2 n/σ 4 ) time, uses p oly ( n ) samp les and outputs a m atrix that is o ( ρ ) -equiv alen t to the tru e dictionary A . F urthermore, under Assum ptions 1 and 2’ the same algorithm re- turns a d ictionary that is n − C -equiv alen t to the true dictionary , while using n 4 C +3 samples, where C is a large constan t d ep end ing on ∆ . 3 2 Though th e absolute v alue n otations are redundent in the nonn egativ e case, w e kee p them so that they can be adapted to t he general case. S imilarly for the deﬁnition of G b later. 3 Recall that ∆ is a suﬃciently large constant th at controls the parameters of the assumptions. 5 The theorem is pro v ed in Section 3. R emark: Assumption 2 is existential ly close to optimal, in the sense that if it is signiﬁ can tly violated: e.g ., if there are poly (1 /ρ ) f eatures that intersec t the neigh b orh o o d of feature j u sing edges of total w eigh t Ω( ℓ 1 -norm of A j ) then feat ure j is n o longer individually reco ve rable: its eﬀect can b e dup licated wh p by com binations of these other features. But a more precise c h aracterizati on of individu al reco v erabilit y would b e nice, as wel l as a matc hing algorithm. 2.2 Dictionaries w it h negativ e en tries When the edge s can b e b oth p ositiv e and negativ e, it is no longer v ali d to assume the exp ectation of y i ’s are equal to 1. In stead, w e c ho ose a diﬀerent normalization: the v ariances of y i ’s are 1. W e still assume magnitude of edge w eigh ts are at most Λ, and that features don’t o verlap a lot as describ ed in Assu mption 1 and 2. W e also need one more assump tion to b ound the v ariance contributed by the sm all entries. Assumption G1: The degree of G σ on side of x is alw a ys larger than 2 d . Assumption G2’: In G τ , the pairwise int ersection of the neigh b orho o ds of any t wo fea- tures j, k is less than κ ,where τ = O θ (1 / log n ) an d κ = O θ ( d/ log 2 n ). Assumption G3: (small entries of A don’t cause large eﬀects) ρ || A ( i ) ≤ τ || 2 2 ≤ γ , wh ere A ( i ) ≤ δ b e the vecto r that only conta ins the en tries of A ( i ) that are at most δ , and γ = σ 4 / 2∆Λ 2 log n . Note that Assu mption G1 diﬀers from Assump tion 1 by a constan t factor 2 just to s im p lify some n otations later. Assum ption G2’ is the same as b efore. This assumption G3 in tuitiv ely says that for eac h y i = P k A ( i ) k x k , the smaller A ( i ) k ’s should not con tribute to o m uc h to the v ariance of y i . This is automatically satisﬁed for nonnegativ e dictionaries b ecause there can b e no cancellatio ns. Notice that this assumption is talking abou t ro ws of matrix A (corresp onding to pixels), wh ereas the earlier assumptions talk ab out columns of A (corresp ond ing to features). Also, consid er τ to b e the smallest n um b er b et wee n what is requ ired by Assumption G2’ and G3. In term of parameters, w e s till n eed d ≥ ∆Λ log 2 n/σ 2 , and κ = O ( σ 8 d/ log 2 n ∆ 2 Λ 6 )= O ( d/ log 2 n ). As b efore ∆ is a large enough constant. Theorem 2 Under Assumptions G1, G2’ and G3, when ρ = o ( σ 5 / Λ 6 . 5 log 2 . 5 n ) = o θ (1 / log 2 . 5 n ) there is an algorithm that ru ns in n O (∆Λ log 2 n/σ 2 ) time, uses n 4 C +5 m samples and outpu ts a matrix that is n − C -equiv alen t to the true d ictionary A , wher e C is a constan t dep ending on ∆ . The algorithm and the pro of of Theorem 2 are sketc hed in Section 4. 3 Nonnegativ e Dictionary Learning Recall that d ictionary learning seems hard b ecause b oth A and x are unknown. T o get around this problem, p r evious works (e.g. [AG M13]) try to extract information ab out the 6 assignmen t x w ithout ﬁr st learning A (but assu ming n ice pr op erties of A ). After ﬁndin g x , reco ve ring A b ecomes easy . In [A GM13 ] the unk n o w n x ’s w ere reco vered via an o verlapping clustering p ro cedure. The pr o cedure relies on incoherence of A , as when A is in coheren t it is p ossible to test whether the supp ort of x 1 , x 2 in tersect. This idea fails w hen x is only sligh tly s p arse, b ecause in this setting th e s upp orts of x 1 , x 2 alw ays ha ve a large in tersection. Our algorithm h ere relies on correlation among pixels . T he k ey observ ation is: if the j th bit in x is 1, then Ax = A j + P k 6 = j A k x k . Pixels w ith high v alues in A j tend to b e elevate d ab o ve their mean v alues (recall A is nonnegativ e). A t ﬁrst it is unclear how this sim u ltaneous elev atio n can b e sp otted, sin ce A j is unknown and these elev atio ns/correlations among pixels are muc h s maller than th e stand ard deviation of individu al p ixels. Therefore w e lo ok f or lo cal regions —small su bsets of pixels— in A j where this eﬀect is signiﬁcant in the aggregate (i.e., sum of pixel v alues), and can b e used to consisten tly predict the v alue of x j . These are called th e signatur e sets (see Deﬁnition 2). If w e can iden tify signature sets, they can giv e u s a go o d estimation of whether the feature x j is p resen t. Since the signature sets are small, in quasi-p olynomial time we can aﬀord to en umerate all sets of that size, and c hec k if the pixels in these sets are lik ely to b e elev ated together. Ho wev er, this d o es n ot solv e the prob lem, b ecause there can b e many sets — called c orr elate d sets b elo w— that sho w similar c orr elations and lo ok similar to signature sets. It is hard to separate signature sets from other correlated s ets w hen the size of the set is small. This leads to the n ext idea: try to exp and a s ignature set b y ﬁrst estimating the corresp ondin g column of A , and then pic k in g large entries in that column. The resulting sets are called exp ande d signatur e sets ; these ha ve size d (and hence could not h a ve b een found by exhaustiv e guessin g alone). If th e set b eing expand ed is indeed a s ignatur e set, this exp ansion p ro cess can correctly estimate the column of A . W e giv e algorithms that can ﬁnd expanded signature sets, and usin g these sets we can get a rough estimation for the matrix A . Final ly , we also giv e a pro cedure that lev er ages the individual ly r e c over able prop erties of the features, and reﬁnes th e solution to b e inv erse p olynomially equiv alen t to the tru e dictionary . The high level algorithm is describ ed in Algorithm 2 (the concepts such as correlated sets, and empirical bias are deﬁn ed later). T o simp lify the pro of the algorithm description uses Assump tion 2’; we summarize later (in Section 3.4) w h at c hanges with Ass umption 2. The main algorithm has three main steps. Section 3.1 explains ho w to test for correla ted sets and expand a set (1-2 in Algorithm 2); Section 3.2 sho ws ho w to ﬁnd expand ed signature sets and a rough estimation of A (3-6 in Algorithm 2); ﬁnally S ection 3.3 sho ws ho w to reﬁ n e the s olution and get ˆ A that is in v erse p olynomially equiv alen t to A (7-10 in Algorithm 2). 3.1 Correlated Sets, Signature Sets and Expanded Sets W e consider a set T of size t = Ω(p oly log n ) (to b e sp eciﬁed later), an d denote by β T the random v ariable representing the sum of all pixels in T , i.e., β T = P i ∈ T y i . W e can expand β T as β T = X i ∈ T y i = X i ∈ T   m X j =1 A ( i ) j x j   = m X j =1 X i ∈ T A ( i ) j ! x j . Let β j,T =  P i ∈ T A ( i ) j  b e th e con tribution of x j to the sum β T , th en β T is just 7 β T = m X j =1 β j,T x j (1) Note that by the normalization of E [ y i ], we ha v e E [ β T ] = P i ∈ T E [ y i ] = t . Intuitiv ely , if for all j , β j,T ’s are relativ ely small, β T should concentrate aroun d its m ean. On the other hand, if there is some j whose co eﬃcient β j,T is signiﬁcant ly larger than other β k ,T , then β T will b e elevate d by β j,T precisely when x j = 1. That is, with probabilit y roughly ρ (corresp onding to when x j = 1), w e should observe β T to b e roughly β j,T larger than its exp ectation. No w we mak e this precise by deﬁning such set T with only one large co eﬃcien t β k ,T as signature s ets. Definition 2 (Signa ture Set) A set T of size t is a signature set for x j , if β j,T ≥ σ t , and for al l k 6 = j , the c ontribution β k ,T ≤ σ 2 t / ∆ log n . Her e ∆ is a lar ge enough c onstant. The follo wing lemma formalizes the earlier intuition that if T is a s ignatur e set f or x j , then a large β T is highly correlated with the eve n t x j = 1. Lemma 3 Supp ose T of size t is a signature set for x j with t = ω ( √ log n ) . Let E 1 b e the ev en t that x j = 1 and E 2 b e the eve n t that β T ≥ E [ β T ] + 0 . 9 σ t . Then for large constan t C (dep ending on the ∆ in Deﬁnition 2) 1. Pr[ E 1 ] + n − 2 C ≥ P r[ E 2 ] ≥ Pr[ E 1 ] − n − 2 C . 2. Pr[ E 2 | E 1 ] ≥ 1 − n − 2 C , and Pr[ E 2 | E c 1 ] ≤ n − 2 C . 3. Pr[ E 1 | E 2 ] ≥ 1 − n − C . Pr oof: W e can w r ite β T as β T = β j,T x j + X k 6 = j β k ,T x k (2) The idea is that since β k ,T < σ 2 t/ (∆ log n ) for all k 6 = j , the summation in th e RHS ab ov e is highly conce n trated around its mean, whic h is actually v ery close to E [ β T ] = t . Therefore since β j,T > σ t , w e kn ow β T > t + 0 . 9 σ t essent ially iﬀ x j = 1. F ormally , observe that E [ β j,T x j ] = ρβ j,T ≤ ρ Λ t = o ( σ t ), and recall that E [ β T ] = t , w e ha ve E [ P k 6 = j β k ,T x k ] = (1 − o ( σ )) t . Let M = σ 2 t/ (∆ log n ) b e the upp er b ound for β k ,T , and then the v ariance of the su m P k 6 = j β k ,T x k is b ounded by ρM P k 6 = j β k ,T ≤ M t . Then by calling Bernstein inequalit y (see Th eorem 23, b ut note that σ th er e is the standard deviation), we hav e Pr         X k 6 = j β k ,T x k − E [ X k 6 = j β k ,T x k ]       > σ t/ 20   ≤ 2 exp ( − σ 2 t 2 / 400 2 M t + 2 3 M √ M t σ t/ 20 ) ≤ n − 2 C . where C is a large constan t d ep end ing ∆. 8 P art (2) imm ed iately follo ws: if x j = 1, then β T < t + 0 . 9 σt iﬀ th e su m deviates from its exp ectation by more than σ t/ 20, which happ ens with p r obabilit y < n − 2 C . So also if x j = 0, E 2 o ccurs w ith pr obabilit y < n − 2 C . This th en implies part (1), since the probability of E 1 is precisely ρ . Com b ining the (1) and (2), and using Ba y es’ rule Pr[ E 1 | E 2 ] = Pr[ E 2 | E 1 ] Pr[ E 1 ] / Pr [ E 2 ], w e obtain (3). ✷ Th us if we can ﬁnd a signature set for x j , we would roughly kno w the samples in which x j = 1. The follo wing lemma sho ws that assuming the lo w pairw ise assumptions among features, th er e exists a s ignatur e set for ev ery f eature x j . Lemma 4 Supp ose A satisﬁes Assumptions 1 and 2, let t = Ω(Λ∆ lo g 2 n/σ 2 ) , then for any j ∈ [ n ] , there exists a signature s et of size t for no d e x j . Pr oof: W e sho w the existence b y prob ab ilistic metho d. By Assump tion 1, n o de x j has at least d neigh b ors in G σ . Let T b e a uniformly ran d om set of t n eigh b ors of x j in G σ . Now b y the deﬁnition of G σ w e hav e β j,T ≥ σ t . Using a b oun d on in tersection size (Assu m ption 2’) follo wed b y Chernoﬀ b ound, we sho w that T is a signature set with go o d probability . F or k 6 = j , let f k ,T b e the n um b er of edges from x k to T in graph G τ . Then we can up p erb ound β k ,T b y tτ + f k ,T Λ since all ed ge w eigh ts are at most Λ and there are at most f j,T edges with wei gh ts larger than τ . Using simple Chernoﬀ b ound and u nion b oun d, we kn o w that w ith pr obabilit y at least 1 − 1 /n , for all k 6 = j , f k ,T ≤ 4 log n . Therefore β k ,T ≤ tτ + f k ,T Λ ≤ σ 2 t/ (∆ log n ) for t ≥ Ω(Λ∆ log 2 n/σ 2 ), and τ = O ( σ 2 / ∆ log n ). ✷ Although signature sets exist f or all x j , it is diﬃcult to ﬁnd them; even if we en um erate all sub sets of size t , it is not clear how to kno w wh en we found a signatur e set. Thus we ﬁrst lo ok for “correlated” sets, wh ich are d eﬁned as follo ws: Definition 3 (Correla ted Set) A set T of si ze t is c al le d correlated , if with pr ob ability at le ast ρ − 1 /n 2 over the c hoic e of x ’s, β T ≥ E [ β T ] + 0 . 9 σ t = t + 0 . 9 σ t . It f ollo ws easily (Lemma 3 ) that signature sets must b e correlated sets. Corollar y 5 If T of size t is a signature set for x j , and t = ω ( √ log n ) , then T is a correlated set. Although signature s ets are all correlated sets, the other direction is far from true. There can b e man y correlated sets that are not signature sets . A simple counte rexample wo uld b e that there are j and j ′ suc h that b oth β j,T and β j ′ ,T are larger th an σ t . This kind of coun terexample seems inevitable for an y test on a set T of p olylogarithmic size. T o resolve this iss ue, the idea is to exp and an y set of size t int o a muc h larger set ˜ T , whic h is called exp ande d set for T . If T happ ened to b e a signatur e set to start with, suc h an expansion w ould giv e go o d estimate of the corresp onding column of A , and more imp ortantl y , ˜ T will h a ve a similar ‘signature’ prop ert y as T , which w e can no w v erify b ecause ˜ T is large. Algorithm 1 and Deﬁnition 4 s ho w h o w to expand T to ˜ T . The empirical exp ectati on ˆ E [ f ( y )] is d eﬁned to b e 1 N P N i =1 f ( y i ). 9 Algorithm 1. ˜ T = expand( T , threshold) Input: T of size t , d , and N samples  y 1 , . . . , y N  Output: vec tor ˜ A T ∈ R n and expand ed set ˜ T of size d (when T is a s ignatur e set ˜ A T is an estimation of A j ). 1: R e c overy Step: Let L b e th e set of samp les whose β T v alues are larger than ˆ E [ β T ] + thr eshol d L = n y k | β k T ≥ ˆ E [ β T ] + threshol d o 2: E stimation Step: Compute the empirical mean of samp les in L , and obtain ˜ A T b y shifting and scaling ˆ E L [ y ] = 1 | L |   X y k ∈ L y k   , and ˜ A T ( i ) = max { 0 , ( ˆ E L [ y i ] − ˆ E [ y i ]) / (1 − ρ ) } 3: E xp ansion Step: ˜ T = { d largest co ordinates of ˜ A T } Definition 4 F or any set T of size t , the exp ande d set ˜ T for T is deﬁne d as the one output by Algorithm 1. The estimation ˜ A T is the output at step 2. When T is a signature set for x j , then ˜ A T is already close to the tru e A j , and the expanded set ˜ T is close to the largest en tries of A j . Lemma 6 If T is a signature set f or x j and the num b er of samp les N = Ω( n 2+ δ /ρ 3 ) , wh ere δ is any p ositiv e constan t, th en with high probabilit y || ˜ A T − A j || ∞ ≤ 1 /n . F urth ermore, β j, ˜ T is d/n -close to the sum of d largest entries in A j , and for all i ∈ ˜ T , A ( i ) j ≥ 0 . 9 σ . Pr oof: Let’s ﬁrs t consider E [ ˜ A T ] , ( E [ y | E 2 ] − 1 ) / (1 − ρ ) wh ere E 2 is the ev en t that β T ≥ t + 0 . 9 σ t d eﬁned in Lemma 3. Recall that b ecause of normalization, we kno w for an y j , P i ∈ [ n ] A ( i ) j = 1 /ρ , so in particular y i ≤ 1 /ρ . By Lemma 3 an d some calculatio ns (see Lemma 21), w e ha ve that | E [ y | E 2 ] − E [ y | E 1 ] | ∞ ≤ n − C /ρ . Note that E [ y | E 1 ] = 1 + (1 − ρ ) A j . Therefore we ha v e that | E [ ˜ A T ] − A j | ∞ ≤ n − C /ρ . No w b y concen tration inequalities when N = Ω( n 2+ δ /ρ 3 ) (notice that the v ariance of ea c h co ord inate is b ounded b y Λ), k ˜ A T − E [ ˜ A T ] k ∞ ≤ 1 /n with very high probabil- it y (exp( − Ω( n δ ))). This probabilit y is high enough so we can app ly union b ound for all signature s ets. ✷ 3.2 Iden tify Expanded Signature Sets W e will now see the adv anta ge that the expand ed sets ˜ T pro vide. If T happ ens to b e a signature set, the expand ed set ˜ T f or T a lso has similar prop erty . But n o w ˜ T is a m uc h larger set (size d as opp osed to t = pol y l og ), and we know (by Assum ption 2’) that diﬀeren t features hav e limited int ersection, so if we see a large elev ation it is lik ely to b e caused by 10 a single feature! W e will lev erage this in order to identify exp an d ed signature sets among all the expanded sets. If an exp an d ed set ˜ T also has essentia lly a un iqu e large co eﬃcien t β j, ˜ T , we call it an exp ande d signatur e set . Definition 5 (Exp anded Signa ture Set) An exp ande d set ˜ T is an exp ande d signatur e set for x j if β j, ˜ T ≥ 0 . 7 σ d and for al l k 6 = j , β k , ˜ T ≤ 0 . 3 σ d . Note that an expan d ed signature set alw a ys has size d and the gap b et wee n largest β j,T and the second largest is only constan t as opposed to logarithmic in the deﬁn ition of signature set. As its n ame su ggests, a expanded set ˜ T of a signature set T for x j is an expanded signature set for x j as w ell. On one h and, th e Lemma 6 guaran tees th at ˜ T connects to x j with large we igh ts, and on the other hand, s ince the pairwise in tersection of n eigh b orho o d s of x j and x k in G τ is sm all, ˜ T cannot also conn ect to other x k with to o man y large w eigh ts. Lemma 7 If T is a signature set for x j , then the expanded s et ˜ T f or T is alw a ys an exp anded signature set for x j . In fact, the co eﬃcient β j, ˜ T is at least 0 . 9 σd . Pr oof: Since we know there are at least d w eigh ts A ( i ) j bigger than σ for any column A j , b y Lemma 6 w e kno w β j, ˜ T ≥ σ d − o (1) d ≥ 0 . 9 σ d . F urthermore, Lemma 6 sa ys x j connects to ev ery no de in ˜ T w ith w eigh ts larger than 0 . 9 σ (since by Ass umption 1 there are more than d edges of w eight at least σ from no d e j ). By Assumption 2 on the graph , f or an y other k 6 = j , the n umb er of y i ’s that are connected to b oth k and j in G τ is b oun ded by κ . In p articular, the num b er of edge s from k to ˜ T with w eights more than τ is b ounded by κ . Therefore the co eﬃcient β k , ˜ T = P ( i,k ) ∈ G τ A ( i ) k + P ( i,k ) 6∈ G τ A ( i ) k is b oun d ed by Λ κ + | ˜ T | τ = o ( d ) ≤ 0 . 3 d . (Recall τ = o (1) and κ = o ( d )) ✷ The follo wing notion of empirical bias is a more precise wa y (compared to correlated set) to measure the simultaneous elev ation eﬀect. Definition 6 (Empirical Bias) The empiric al b i as ˆ B ˜ T of an exp ande d set ˜ T of size d is deﬁne d to b e the lar gest B that satisﬁes    n k ∈ [ p ] : β k ˜ T ≥ ˆ E [ β ˜ T ] + B o    ≥ ρN / 2 . In other wor ds, ˆ B ˜ T is the diﬀer enc e b etwe en the ρN / 2 -th lar gest β k ˜ T in the samples and ˆ E [ β ˜ T ] . The key lemma in this part shows th e exp anded set with largest empir ical bias m ust b e an expanded signature set: Lemma 8 Let ˜ T ∗ b e the set with largest empirical bias ˆ B ˜ T ∗ among all the expanded sets ˜ T . The set ˜ T ∗ is an expanded signature s et for some x j . 11 W e build th is lemma in several steps. First of all, w e sh ow that the bias of ˜ T is almost equalto the la rgest β j, ˜ T :if β ˜ T con tains a large term β j, ˜ T x j , then certainly this term will con tribu te to the bias ˆ B ˜ T ; on th e other hand, supp ose in some extreme case β ˜ T only h as t wo n on-zero terms β j, ˜ T x j + β k , ˜ T x k . T hen they cannot contribute more than max { β j, ˜ T , β k , ˜ T } to th e b ias, b ecause otherwise b oth x k and x j ha ve to b e 1 to m ak e the sum larger than max { β j, ˜ T , β k , ˜ T } , and this only happ en s with small probabilit y ρ 2 ≪ ρ . The in tuitiv e argumen t ab o ve is not far from true: basically w e could sho w th at a) Th ere are ind eed very few large co eﬃcien ts β k , ˜ T ’s (see C laim 9 for the precise statemen t) b) th e sum of those small β k , ˜ T x k concen trates aroun d its mean, th us w on’t contribute m uch to the bias. After r elating the bias of ˜ T to the largest co eﬃcien ts max j β j, ˜ T , we fu rther argue that taking the set ˜ T ∗ with largest bias among all the ˜ T , we not only see a large co eﬃcien t β j, ˜ T , but also we ob s erv e a gap b et ween the the top β j, ˜ T and all other β k ,T ’s, and hen ce ˜ T is an expanded signature set for x j . W e mak e th e argument s ab o ve pr ecise by the follo w ing claims. First, we shall sho w there cannot b e to o many large co eﬃcien ts β j,D for any set D of size d (although w e only apply the claim on expand ed sets). Claim 9 F or any set ˜ T of size d , th e num b er of k ’s su c h that β k , ˜ T ’s is larger than dσ 4 / ∆Λ 2 log n is at most O (∆Λ 3 log n/σ 4 ) . Pr oof: F or the ease of exp osition, w e d eﬁne K lar g e = { k : β k , ˜ T ≥ dσ 4 / ∆Λ 2 log n } . Hence the goal is to pr o ve th at | K lar g e | ≤ O (∆Λ 3 log n/σ 4 ). Recall th at β k , ˜ T = P i ∈ ˜ T A ( i ) k . Let Q k = { i ∈ ˜ T : A ( i ) k ≥ τ } b e the subset of n o des in ˜ T that connect to k with weigh ts larger than τ . W e hav e that β k , ˜ T = P i 6∈ Q k A ( i ) k + P i ∈ Q k A ( i ) k . The ﬁ rst sum is u pp er b ounded b y dτ ≤ dσ 4 / 2∆Λ 2 log n . Therefore for k ∈ K lar g e , the second sum is lo wer b ounded by dσ 4 / 2∆Λ 2 log n . Sin ce A ( i ) k ≤ Λ, we h a ve | Q k | ≥ σ 4 d / 2∆Λ 3 log n . On the other hand, by Assum p tion 2 we kn o w in graph G τ , any t wo features cann ot share to o many pixels: for any k and k ′ , | Q k ∩ Q k ′ | ≤ κ . Also note that by deﬁ n ition, Q j ⊂ ˜ T , which implies that | ∪ k ∈ K lar g e Q k | ≤ | ˜ T | = d . By inclusion-exclusion we ha v e d ≥ | [ k ∈ K lar g e Q k | ≥ X k ∈ K lar g e | Q k |− X k ,k ′ ∈ K lar g e | Q k ∩ Q k ′ | ≥ | K lar g e | σ 4 d / 2∆Λ 3 log n −| K lar g e | 2 / 2 · κ (3) This imp lies that | K lar g e | ≤ O (∆Λ 3 log n/σ 4 ), wh en κ = O ( σ 8 d / ∆ 2 Λ 6 log 2 n ). 4 ✷ F or simp licit y , let k ∗ = arg max k β k , ˜ T , so β k ∗ , ˜ T is the largest co eﬃcien t in β ˜ T . Recall that the deﬁ n ition of expanded signature set roughly translates to a constan t factor gap b et w een β k ∗ , ˜ T to any other co eﬃcient β k , ˜ T . The next claim shows that the empirical bias ˆ B ˜ T is a go o d estimate of β k ∗ , ˜ T when β k ∗ , ˜ T is large. 4 Note that an y subset of K lar g e also satisﬁes equation (3), thus w e don’t hav e to w orry about the other range of th e solution of (3) 12 Claim 10 F or any expand ed ˜ T of size d , with high p r obabilit y o ver the choic es of all the N samples, the empirical bias ˆ B ˜ T is within 0 . 1 dσ 2 / Λ to β k ∗ , ˜ T = m ax k β k , ˜ T when β k ∗ , ˜ T is at least 0 . 5 dσ . Pr oof: Let K ′ lar g e = K lar g e \ { k ∗ } 5 , and β small, ˜ T = P k 6∈ K lar g e β k , ˜ T x k , and β lar g e, ˜ T = P k ∈ K ′ lar g e β k , ˜ T x k . First of all, the v ariance of β small, ˜ T is b ounded by ρ P k 6∈ K lar g e β 2 k , ˜ T ≤ dσ 4 / ∆Λ 2 log n ·  ρ P k 6∈ K lar g e β k , ˜ T  ≤ d 2 σ 4 / ∆Λ 2 log n . By Bernstein’s inequalit y , for suﬃciently large ∆, with probabilit y at most 1 /n 2 o ver the c hoice of x , the v alue | β small, ˜ T − E [ β small, ˜ T ] | is larger than 0 . 05 dσ 2 / Λ, that is, β small, ˜ T nicely concen trates aroun d its mean. Secondly , with probabilit y at most ρ we ha ve x k ∗ = 1 , and then β k ∗ , ˜ T x k ∗ is elev ated ab o v e its mean b y roughly β k ∗ , ˜ T . Third ly , the mean of β lar g e, ˜ T is at most ρ P k ∈ K ′ lar g e β k , ˜ T ≤ ρ | K | d , w hic h is o ( σ d ) by Claim 9. These thr ee p oin ts altoget her imply that with p robabilit y at least ρ − n − 2 , β ˜ T is ab ov e its mean by β k ∗ , ˜ T − 0 . 1 σ 2 d/ Λ. Also note that th e empirical mean ˆ E [ β ˜ T ] is suﬃcientl y close to th e β ˜ T with probabilit y 1 − exp( − Ω ( n )) o ver th e choice s of N samples, wh en N = pol y ( n ). Therefore w ith probabilit y 1 − exp( − Ω( n )) o ve r the c h oices of N samp les, ˆ B ˜ T > β k ∗ , ˜ T − 0 . 1 σ 2 d/ Λ. It r emains to prov e the other side of the inequalit y , that is, ˆ B ˜ T ≤ β k ∗ , ˜ T + 0 . 1 σ 2 d/ Λ. Note that | K lar g e | = O (log n ), thus with probab ility at least 1 − 2 ρ 2 | K | 2 , at most one of the x k , ( k ∈ K lar g e ) is equal to 1. T h en with pr obabilit y at least 1 − 2 ρ 2 | K | 2 o ver the choic es of x , β lar g e, ˜ T + β k ∗ , ˜ T is elev ated ab o v e its mean by at most β k ∗ , ˜ T . Also with pr ob ab ility 1 − n − 2 o ver the choice s of x , β small, ˜ T is ab o ve its mean b y at most 0 . 1 σ 2 d/ Λ. Ther efore with p robabilit y at least 1 − 3 ρ 2 | K | 2 o ver the choic es of x , β ˜ T is ab o ve its mean b y at most β k ∗ , ˜ T + 0 . 1 σ d/ Λ. Hence when 3 ρ 2 | K | 2 ≤ ρ/ 3, with pr obabilit y at least 1 − exp( − Ω( n )) o ver the c h oice of the N samples, ˆ B ˜ T ≤ β k ∗ , ˜ T + 0 . 1 σ 2 d/ Λ. The condition is satisﬁed when ρ ≤ c/ log 2 n for a small enough constan t c . ✷ No w we are ready to prov e Lemma 8. Pr oof: [of Lemma 8] By Claim 10 and the existence of go o d expanded signature sets (Lemma 7), w e know th e maxim um bias is at least 0 . 8 σ d . Apply Claim 10 again, we kno w for the set ˜ T ∗ that h as largest bias, there must b e a feature j with β j, ˜ T ∗ ≥ 0 . 7 σ d . F or the sak e of con tr adiction, no w we assume that th e set ˜ T ∗ with largest b ias is not an expanded signature s et. Th en there must b e s ome k 6 = j wh ere β k , ˜ T ∗ ≥ 0 . 3 σd . Let Q j and Q k b e the set of n o des in ˜ T ∗ that are conn ected to j an d k in G τ (these are th e same Q ’s as in the pro of of Claim 9). W e kno w | Q j ∩ Q k | ≤ κ by assump tion, and | Q k | ≥ 0 . 3 σ d/ Λ. This m eans | Q j | ≤ d − 0 . 3 σd/ Λ + κ by inclusion-exclusion. No w let T ′ b e a signature set for x j , and let ˜ T ′ b e its expanded set, from Lemma 6 w e kn o w β j, ˜ T ′ is almost equal to the sum of the d largest en tries in A j , which is at least 0 . 2 σ 2 d/ Λ larger than β j, ˜ T ∗ , since | Q j | ≤ d − 0 . 2 σ d/ Λ. By Claim 10 w e kno w ˆ B ( ˜ T ′ ) ≥ β j, ˜ T ′ − 0 . 1 σ 2 d/ Λ > β j, ˜ T ∗ + 0 . 1 σ 2 d/ Λ ≥ ˆ B ( ˜ T ), whic h con tradict w ith th e assumption that ˜ T ∗ is th e set with largest bias. ✷ 5 K lar g e is d eﬁned in pro of of Cla im 9 13 No w we ha v e f ound expand ed signature sets, we can then apply Algorithm 1 (bu t with threshold 0 . 6 σ d instead of 0 . 9 σ d ) on that to get an estimation. Lemma 11 If ˜ T is an expanded signature set for x j , and ˜ A ˜ T is the corresp ond in g column ou tp ut by Algorithm 1, then with high probabilit y k ˜ A ˜ T − A j k ∞ ≤ O ( ρ (Λ 3 log n/σ 2 ) 2 √ Λ log n ) = o ( σ ) . Pr oof: Deﬁne E 1 to b e the ev en t that x j = 1, and E 2 to b e the even t that β ˜ T ≥ 0 . 6 dσ . When E 1 happ en s, ev en t E 2 alw ays happ en unless β ˜ T ,small is f ar from its exp ectation. In the pro of of Claim 10 we’ v e already shown the n umb er of suc h samp les is at most n with v ery h igh pr obabilit y . Supp ose E 2 happ en s, and E 1 do es not happ en. Th en either β ˜ T ,small is far from its exp ectation, or at least t wo x j ’s with large co eﬃcient s β j, ˜ T ’s are on. Recall b y Claim 9 the n um b er of x j ’s with large co eﬃcien ts is | K | ≤ O (Λ 3 log n/σ 2 ), so the probabilit y that at least t wo large co eﬃcient is “on” (with x j = 1) is b ounded by O ( ρ 2 · | K | 2 ) = ρ · O ( ρ Λ 6 log 2 n/σ 4 ) = ρ · o ( σ / √ Λ log n ) . With v ery high probability th e num b er of su c h samples is b ounded b y ρN · o ( σ / √ Λ log n ). Com b ining the tw o parts, w e kn o w the n um b er of samples that is in E 1 ⊕ E 2 (the symmetric diﬀerence b et ween E 1 and E 2 ) is b ound ed by ρN · o ( σ / √ Λ log n ). Also, with h igh probabilit y (1 − n − C ) all th e samples ha v e entries b ound ed by O ( √ Λ log n ) by Bernstein’s inequalit y (v ariance of y i is b ound ed by P j ρ ( A ( i ) j ) 2 ≤ max j A ( i ) j P j ρA ( i ) j ≤ Λ). Notice that this is a s tatement of the entire sample ind ep endent of the set T , so we do not need to apply union b oun d o ver all expanded signatur e sets. Therefore by Lemma 21 k ˜ A ˜ T − A j k ∞ ≤ o ( σ / p Λ log n ) · O ( p Λ log n ) = o ( σ ) . ✷ The p revious lemma lo oks ve ry similar to the lemma for signature sets, ho w ever, the b eneﬁt is we know how to ﬁnd a set that is guaran teed to b e expanded s ignatur e set! S o w e can iterativ ely ﬁnd all expand ed signature sets. After id entifying ˜ T 1 , ˜ T 2 , ..., ˜ T k (reorder the columns of A to mak e them corresp ond to the ﬁ rst k columns), w e can estimate th e corresp onding column s ˜ A ˜ T 1 , . . . ˜ A ˜ T k . Since th ese are close to the true columns A 1 , A 2 , ..., A k (wlog. we r eorder columns so ˜ A ˜ T j corresp ond to A j for 1 ≤ j ≤ k ), w e can in f act compu te ˆ β j, ˜ T = P i ∈ ˜ T ˜ A ˜ T j ( i ). By Lemma 11 we k n o w | ˆ β j, ˜ T − β j, ˜ T | = o ( σ d ). Lemma 12 Ha ving found ˜ T i (and hence also ˜ A ˜ T i ) f or i ≤ k , let ˜ T b e the set with largest emp ir ical bias among the expand ed sets that ha v e ˆ β j, ˜ T < 0 . 2 σ d f or all j ≤ k . Th en ˜ T is an expanded signature s et for new x j where j > k . Pr oof: The p ro of is almost id en tical to Lemma 8. First, if T is a signature s et of x j where j > k , then b y Lemma 7 ˜ T must sat isfy ˆ β j, ˜ T < 0 . 2 σ d , so it will comp ete f or the set with largest empir ical bias. Also, since ˆ β j, ˜ T < 0 . 2 σ d , we kn o w the co eﬃcient s in β j, ˜ T m u st hav e j > k . Leverag ing this observ ation in th e p ro of of Lemma 8 giv es the result. ✷ 14 3.3 Getting an E quiv alent Dictionary After ﬁndin g expanded signature sets, we already ha v e an estimation ˜ A ˜ T j of A j that is en tr y -w ise o ( σ ) close. Ho wev er, this alone d o es not imp ly that the t w o dictionaries are ǫ -equiv alen t f or ve ry small ǫ . In the ﬁn al s tep, we look at al l the large entries in the column A j , and use them to identify wh ether feature x j is 1 or 0. The abilit y to do this justiﬁes the in dividually reco ve rable p rop erty of the dictionary . Lemma 13 Let S j b e the set of all entries larger than σ / 2 in ˜ A ˜ T j , then | S j | ≥ d , β j,S j ≥ (0 . 5 − o (1)) | S j | σ , and for all k 6 = j β k ,S j ≤ σ 2 | S j | / ∆ log n w here ∆ is a large enough constan t. Pr oof: T h is follo ws directly from the assumptions. By Assump tion 1, there are at least d en tr ies in A j that are larger than σ , all these entries will b e at least (1 − o (1)) σ in ˜ A ˜ T j , so | S j | ≥ d . Also, since for all i ∈ S j , ˜ A ˜ T j ( i ) ≥ 0 . 5 σ , w e know A j ( i ) ≥ 0 . 5 σ − o ( σ ), hence β j,S j ≥ (0 . 5 − o (1)) | S j | σ . By Assumption 2, f or an y k 6 = j , the num b er of edges in G τ b et w een k and S j is b ound ed b y κ , so β k ,S j ≤ τ | S j | + κ Λ ≤ σ 2 | S j | / ∆ log n . ✷ Since S j has a unique large co eﬃcient β j,S j , and the rest of the co eﬃcients are muc h smaller, wh en ∆ is large enough, and N ≥ n 4 C + δ /ρ 3 w e kno w ˆ A j is entry-wise n − 2 C / log n close to A j (this is using th e s ame argument as in Lemma 6). W e sh all sh o w this is enough to pr o of n − C -equiv alence b et ween ˆ A and A . Lemma 14 Let A, ˆ A b e dictionaries with ro ws having ℓ 1 -norm O (1 /ρ ) and all entries in A − ˆ A hav e magnitude at most δ . T h en ˆ A and A are O ( √ δ log n ) -equiv alen t. The p r o of is an easy application of Berns tein’s in equ alit y (see App endix A.1). Remark: Notice that when C ≥ 1 it is clear why ˆ A j should h a ve ℓ 1 norm 1 /ρ (b ecause it is very close to A j ); when C is smaller we need to trun cate the entries of ˆ A j that are smaller than n − 2 C / log n . W e no w formally write do wn the steps in the algorithm. 3.4 W orking with Assumption 2 In order to assume Assumption 2 instead of 2’, w e need to c hange the d eﬁnition of signature sets to allo w o (1 / √ ρ ) “mo d erately large” ( σ t/ 10) en tries. This makes the deﬁnition lo ok similar to expanded signature sets. Su c h signature sets still exist b y similar p r obabilistic argumen t as in Lemma 4. Lemma 7 and Claims 9 and 10 can also b e adapted. Finally , for Lemma 14, the guaran tee will b e w eak er (there can b e o (1 / √ ρ ) mo derately large co eﬃcien ts). The algorithm will only estimate x j incorrectly if at least 6 such co ef- ﬁcien ts are “on” (has the corr esp onding x j b eing 1), which happ ens with less than o ( ρ 3 ) probabilit y . By argument sim ilar to Lemma 6 and Lemma 14 we get the ﬁrst part of Theorem 1. 15 Algorithm 2. Nonnegativ e Dictionary Learning Input: N samples  y 1 , . . . , y N  generated b y y i = Ax i . Unknown dictionary A satisﬁes Assumptions 1 and 2. Output: ˆ A th at is n − C close to A 1: E n umerate all sets of size t = O (Λ lo g 2 n/σ 4 ), keep the sets that are correlated. 2: E xpand all correlated sets T , ˜ T = E xpand ( T , 0 . 9 σt ). 3: f or j = 1 TO m do 4: Let ˜ T j b e the set with largest emp irical bias, and for all k < j , ˆ β k , ˜ T = P i ∈ T ˜ A ˜ T k ( i ) ≤ 2 dσ . 5: Let ˜ A ˜ T k b e th e result of estimation step in E xpand ( ˜ T , 0 . 6 σ d ). 6: e nd for 7: f or j = 1 TO m do 8: Let S j b e th e set of en tr ies that are larger th an σ / 2 in ˜ A ˜ T j 9: Let ˆ A i b e th e result of estimation step in E xpand ( S j , 0 . 4 σ | S j | ) 10: end for 4 General Case With minor mo d iﬁcations, ou r algorithm and its analysis can b e adapted to the general case in whic h the matrix A can hav e b oth p ositiv e and negativ e entries. W e follo w th e outline fr om the non-negativ e case, and lo ok at sets T of size t . The quan tities β T and β j,T are deﬁ ned exactly the same as in Section 3.1. Additionally , let ν T b e the standard deviation of β T , and let ν − j,T b e the standard deviation of β T − β j,T x j . That is, ν 2 − j,T = V [ β T − β j,T x j ] = ρ X k 6 = j β 2 k ,T . The deﬁnition of signature sets requires an additional condition to tak e into accoun t the standard deviations. Definition 7 ((General ) Signa ture Set ) A set T of size t i s a signature set for x j , if for some lar ge c onstant ∆ , we have: (a) | β j,T | ≥ σ t , (b) for al l k 6 = j , the c ontribution | β k ,T | ≤ σ 2 t/ (∆ log n ) , and additional ly, (c) ν − j,T ≤ σ t/ √ ∆ log n . In the nonnegativ e case the add itional condition ν − j,T ≤ σ t/ √ ∆ log n was automatica lly implied by nonnegativit y and scaling. Now we use Assum ption G3 to sho w th ere exist T in whic h (c) is true along with the other prop erties. T o d o that, we p ro v e a simple lemma whic h lets us b ound the v ariance (the same lemma is also used in other p laces). Lemma 15 Let T b e a set of size t and S b e an arb itrary s u bset of features, and consider th e sum β S,T = P j ∈ S β j,T x j . Supp ose for eac h j ∈ S , th e num b er of edges from j to T in graph G τ is b ounded by W . Th en the v ariance of β S,T is b ounded by 2 tW + 2 t 2 γ . Pr oof: Th e idea is to sp lit the weig h ts A ( i ) j in to the big and smal l ones (thresh old b eing τ ). Intuitiv ely , on one hand, the con tribution to the v ariance from large we igh ts is b ounded 16 ab o v e b ecause the n um b er of such large edges in b ounded by W . O n the other hand, b y assumption (3), the total v ariance of small we igh ts is less than γ , which imp lies that the con tribu tion of small w eigh t to the v ariance is also b ou n ded. F orm ally , w e h av e V [ β S,T ] = ρ X j ∈ S β 2 j,T = ρ X j ∈ S X i ∈ T A ( i ) j ! 2 = ρ X j ∈ S   X i : i ∈ T , ( i,j ) ∈ G τ A ( i ) j + X i : i ∈ T , ( i,j ) 6∈ G τ A ( i ) j   2 ≤ 2 ρ X j ∈ S     X i : i ∈ T , ( i,j ) ∈ G τ A ( i ) j   2 +   X i : i ∈ T , ( i,j ) 6∈ G τ A ( i ) j   2   ≤ 2 ρ X j ∈ S   W   X i : i ∈ T , ( i,j ) ∈ G τ  A ( i ) j  2   + t   X i : i ∈ T , ( i,j ) 6∈ G τ  A ( i ) j  2     = 2 ρW X i ∈ T X j ∈ S  A ( i ) j  2 + 2 ρt X i ∈ T X j :( i,j ) 6∈ G τ  A ( i ) j  2 ≤ 2 tW + 2 t 2 γ . In the fourth line we used Cauc hy-Sc h w arz in equ alit y and in th e last step, w e u sed Assump - tion G3 ab out the total v ariance du e to small terms b eing small, as well as the normalization of th e v ariance in eac h pixel. ✷ Lemma 16 Supp ose A satisﬁes our assumptions for general d ictionaries, and let t = Ω(Λ∆ lo g 2 n/σ 2 ) . Then for an y j ∈ [ n ] , there exists a general signature set of size t for no de x j (as in Deﬁnition 7). Pr oof: As b efore, we use the probabilistic m etho d. Sup p ose we ﬁx some j . By Ass u mption G1, in G σ , no d e x j has either at least d p ositiv e neigh b ors or d negativ e ones. W.l.o.g., let us assum e there are d n egativ e neighbors. L et T b e uniformly r andom su bset of size t of these n egativ e neigh b ors. By deﬁnition of G σ , we hav e β j,T ≤ − σ t . F or k 6 = j , let f k ,T b e the num b er of edges from x k to T in graph G τ . Using the same argumen t as in the pro of of Lemma 4, w e ha v e f k ,T ≤ 4 log n w.h.p. for all suc h k 6 = j . Th us | β k ,T | ≤ tτ + f k ,T Λ ≤ σ 2 t/ (∆ log n ). Thus it remains to b ound ν − j,T . W e could apply Lemma 15 with W = 4 log n ≥ f k ,T , an d S = [ m ] \ { j } on set T : we get ν 2 − j,T ≤ 2 tW + 2 t 2 γ . Recall that γ = σ 2 / 3∆ 2 log n and th us ν − j,T ≤ σ t / √ ∆ l og n . ✷ The pro of of L emma 3 n ow follo ws in the general case (here we w ill use the v ariance b ound (c) in the general d eﬁ nition of signature sets), except that we need to redeﬁne even t E 2 to han d le the negativ e case. F or completeness, we state the general v ersion of Lemma 3 in App endix A.2. As b efore, signature sets giv e a great idea of wh ether x j = 1. Let us n o w deﬁn e correlated sets: here we need to consid er b oth p ositiv e and negativ e bias 17 Definition 8 ((General ) Correl a ted S et) A se t T of size t is correlated , if either with pr ob ability a t le ast ρ − 1 /n 2 over the choic e of x ’s, β T ≥ E [ β T ] + 0 . 8 σ t , or with pr ob ability at le ast ρ − 1 /n 2 , β T ≤ E [ β T ] − 0 . 8 σt . Starting with a corr elated set (a p oten tial signature set), we expand it similar to (Deﬁ- nition 4), except that we ﬁnd ˜ T as follo ws: ˜ T temp = { 2 d co ordinates of largest magnitude in ˆ A T } , ˜ T 1 = { i ∈ ˜ T temp : ˆ A T ≥ 0 } ˜ T =  ˜ T 1 if | T 1 | ≥ d ˜ T temp \ ˜ T 1 otherwise Our earlier d eﬁ nitions of expanded signature sets and bias can also b e adapted naturally: Definition 9 ((General ) Exp anded Signa ture Se t) A n exp and e d se t ˜ T is an exp ande d signatur e set for x j if | β j, ˜ T | ≥ 0 . 7 σ d and for al l k 6 = j , | β k , ˜ T | ≤ 0 . 3 σ d . Since Lemma 6 still holds, L emm a 7 follo ws straigh tforwardly . T h at is, there alwa ys exists a general expanded signatur e set ˜ T that is pro du ced by a set T of size t = O θ (log n 2 ). (Note that this is wh y in the general case w e assume that G σ has degree at least 2 d in Assumption G1. W e wan t to make the size of go o d expand ed set to b e d instead of d/ 2 so that all the lemmas can b e adapted with ou t change of n otation). Definition 10 ((General) Empirical Bias) The empiric al bias ˆ B ˜ T of an exp ande d set ˜ T of size d is deﬁne d to b e the lar gest B that satisﬁes    n k ∈ [ p ] :    β k ˜ T − ˆ E [ β ˜ T ]    ≥ B o    ≥ ρN / 2 . In other wor ds, ˆ B ˜ T is the diﬀer enc e b etwe en the ρN / 2 -th lar gest β k ˜ T in the samples and ˆ E [ β ˜ T ] . Let us no w intuitiv ely describ e wh y the analog of Lemma 8 h olds in the general case. W e provides the formal statemen t and the pr o of in App endix A.2 1. The ﬁr st s tep, Claim 9 is a statemen t pu rely ab out the magnitudes of the edges (in fact, cancellations in β k , ˜ T for k 6 = j only h elp our case). 2. The second step, Claim 10 essentia lly argues that the sm all β k , ˜ T do not con tribu te m u c h to the bias (a concen tration b ound, whic h still holds due to Lemma 15), and that the probabilit y of two “large” features j, j ′ b eing on simultaneously is very sm all. The latter holds ev en if the β j, ˜ T ha ve diﬀerent signs. 3. The ﬁnal s tep in the pr o of of Lemma 8 is an argument whic h uses the assu m ption on th e ov erlap b et wee n features to contradict th e maximalit y of bias, when the case where β j, ˜ T and β j ′ , ˜ T are b oth “large”. This on ly uses the magnitudes of the entries in A , and th us also follo ws. 18 Reco v ering an equiv alen t dictionary . The main lemma in the nonn egativ e case, which sho w s that Algorithm 1 r oughly reco v ers a column, is Lemma 11. Th e pr o of uses the prop erty th at signature sets are elev ated “almost iﬀ ” the x j = 1 to conclude that we get a go o d appro ximation to one of th e columns. W e ha ve seen that this also h olds in the general case, and since the rest of the argument deals only with the magnitudes of the entries, we conclude that we can roughly r eco v er a column also in the general case. Let us state th is formally . Lemma 17 If ˜ T is an expanded signature set for x j , and ˜ A ˜ T is the corresp ond in g column ou tp ut by Algorithm 1, then with high probabilit y k ˜ A ˜ T − A j k ∞ ≤ O ( ρ (Λ 3 log n/σ 2 ) 2 √ Λ log n ) = o ( σ ) . Once we ha ve all th e ent ries whic h are > σ / 2 in magnitude, we can use the ‘reﬁn emen t’ tric k of Lemma 13 to conclude that we can reco ver the entries. Lemma 18 When the num b er of samples is at le ast n 4 C +3 m , the matrices A and ˆ A are en try-wise n − 2 C m − 1 / 2 close. F urther, the t wo d ictionaries are n − C -equiv alen t. The ﬁr st part of th e pro of (sho wing entry-wise closeness) is very similar to Lemma 6. In ord er to sho w n − C equiv alen t, notice when the ent ries are very close th is j u st follo ws from Bernstein’s inequalit y , with v ariance b ounded by n − 4 C m − 1 · m . In Sectio n 3 w e do not just use this b ound, b ecause we w ant to b e able to also hand le the case when the entrywise error is only in v erse p olylog (for Assump tion 2). References [AAN13] Alekh Aga rw al, Animashree Anandkumar, and Praneeth Netrapalli. Exact reco ve ry of sparsely used o v er complete dictionaries. CoRR , abs/1309.1 952, 2013. [ABGM13 ] Sanjeev Arora, Adity a Bhask ara, Rong Ge, and T engyu Ma. Prov able b ounds for learning some deep r epresent ations. CoRR , abs/1310.6343 , 2013. [AEB05] Mic hal Ah aron, Mic hael Elad, and Alfred M Bru c ks tein. K-svd and its non- negativ e v ariant for dictionary design. In Optics & Photonics 2005 , pages 59141 1–591 411. Internatio nal So ciet y for Optics and Photonics, 2005. [AEB06] Mic hal Aharon, Mic hael Elad, and Alfred Bruc k s tein. K-svd: An algorithm for designing ov ercomplete dictionaries for sparse representati on. Signal Pr o c ess- ing, IEEE T r ans actions on , 54(11 ):4311 –4322, 2006. [AEP06] Andreas Argyriou, Th eo doros Evgeniou, and Massimiliano Po n til. Multi-task feature learnin g. In NIPS , pages 41–48, 2006. [A GM13] Sanjeev Arora, Rong Ge, an d Anku r Moitra. New algo rithms for learning incoheren t and o v ercomplete d ictionaries. ArXiv , 1308.6273 , 2013. 19 [Aha06] Mic hal Ah aron. Over c omp lete Dictionaries for Sp arse R epr esentation of Sig- nals . PhD th esis, T ec hnion - Israel Institute of T ec hnology , 2006. [BC + 07] Y-lan Bour eau, Y ann L Cu n, et al. Sparse feature lea rning for deep belief net works. In A dvanc es in neur al inf ormation pr o c essing systems , p ages 1185– 1192, 2007. [Ben62] George Bennett. Pr obabilit y inequalities for the su m of indep en den t rand om v ariables. Journal of the Americ an Statistic al A sso ciation , 57(297):pp. 33–45, 1962. [Ber27] S. Bernstein. The ory of Pr ob ability , 1927. [BGI + 08] R. Berind e, A.C. Gilb ert, P . In dyk, H. K arloﬀ, and M.J. Strauss . Com b ining geometry and com binatorics: a uniﬁ ed approac h to sp arse s ignal reco v ery . I n 46th Annual Al lerton Confer enc e on Communic ation, Contr ol, and Comp uting , pages 798–805, 2008. [CR T06] E mman uel J Cand` es, Justin Rom b erg, and T erence T ao. Robust u ncertain ty principles: Exact signal reconstruction from highly incomplete frequency in for- mation. Information The ory, IEE E T r ans actions on , 52(2): 489–5 09, 2006. [Das99] Sanjo y Dasgupta. Learning mixtures of gaussians. In FOCS , pages 634–64 4. IEEE C omp uter So ciet y , 1999. [DH01] Da vid L Donoho and Xiaoming Huo. Un certain ty prin ciples and ideal atomic decomp osition. Informatio n The ory, IEEE T r ansactions on , 47(7):2 845–2 862, 2001. [DMA97] Geoﬀ Davi s, Stephane Mallat, and Marco Av ellaneda. Adaptiv e greedy ap- pro ximations. Constructive appr oximation , 13(1):57 –98, 1997. [EA06] Mic hael Elad and Mic h al Aharon. Image d enoising via s parse and redun dan t represent ations o ver learned dictionaries. Image Pr o c essing, IEEE T r ansactions on , 15(12):37 36–37 45, 2006. [EAHH99] Kjer s ti En gan, Sv en Ole Aase, and J Hak on Huso y . Metho d of optimal di- rections for fr ame design. In A c oustics, Sp e e ch, and Signal Pr o c essing, 1999. Pr o c e e dings., 1999 IEEE International Confer enc e on , v olume 5, pages 2443– 2446. IEEE, 1999. [Ho y02] P atrik O Hoy er. Non-negativ e sparse co ding. In Neur al Networks for Signal Pr o c essing, 2002. Pr o c e e dings of the 2002 12th IEEE Workshop on , pages 557– 565. IEEE , 2002. [Ind08] Piotr Ind y k . Explicit constructions f or compressed sensing of sparse signals. In S hang-Hua T eng, editor, SODA , p ages 30–33. SIAM, 2008. [JXHC09] Sina Jafarp our, W eiyu Xu, Babak Hassibi, and A. Rob ert Calderbank. Eﬃ- cien t and robu st compressed sensin g u s ing optimized expand er graphs. IEE E T r ansactions on Information The ory , 55(9):4299– 4308, 2009. 20 [LS99] Daniel D Lee and H S eb astian Seun g. Learning the p arts of ob jects by non- negativ e matrix factorization. Natur e , 401(6755) :788– 791, 1999. [LS00] Mic hael S Lewic ki and T errence J S ejno w ski. Learning o vercomplete r epresen- tations. Neur al c omputation , 12(2):337– 365, 2000. [MLB + 08] Ju lien Mairal, Marius Leordean u , F rancis Bac h, Martial Heb ert, and Jean P once. Discriminativ e sparse image mo d els for class-sp eciﬁc edge detection and image in terpretation. In Computer Vision–ECCV 2008 , pages 43–56 . Spr inger, 2008. [OF97] Bruno A Olshausen and Da vid J Field. S parse co ding with an o v ercomplete basis set: A strategy emplo y ed by v1? Vision r e se ar ch , 37(23):33 11–33 25, 1997. [SWW12] Daniel A. S pielman, Hu an W ang, and J ohn W right. Exact reco v ery of sp arsely- used d ictionaries. Journal of M achine L e arning R ese ar ch - P r o c e e dings T r ack , 23:37. 1–37.1 8, 2012. [YWHM08 ] Jianc hao Y ang, John W righ t, Thomas Huang, and Yi Ma. Image sup er- resolution as sp arse representati on of r a w image patc hes. In Computer Vision and P attern R e c o gnition, 2008. CV PR 2008. IEE E Confer enc e on , p ages 1–8. IEEE, 2008. A F ull Pro ofs In this section we giv e the omitted pr o ofs. A.1 Pro of of Lemma 14 Pr oof: Let us fo cus on the i th ro w of A − ˆ A and denote it by w . Th en we ha ve k w k 1 ≤ k A k 1 + k ˆ A k 1 ≤ O (1 /ρ ). No w consider th e random v ariable Z = P j w j x j , where x j are i.i.d. Bernoulli r .v.s with probabilit y ρ of b eing 1. Then by Bernstein’s inequalit y (Th eorem 23), w e hav e Pr[ Z − E Z > ǫ ] ≤ e − ǫ 2 ρσ + ρ P j w 2 j . Since | w j | < δ for all j , we can b oun d th e v ariance as ρ · P j w 2 j ≤ δ ρ · P j | w j | ≤ 2 δ . Th us setting t = (4 δ log n ) 1 / 2 (notice that this is the t in Berns tein’s inequalit y , n ot the same as the size of signature sets), we obtain an u pp er b ound of 1 / p oly( n ) on the probabilit y . ✷ A.2 Missing Lemmas and Pro ofs of Section 4 Lemma 19 (General Vers ion of Lemm a 3) Supp ose T of size t is a general signatur e set for x j with t = ω ( √ log n ) . Let E 1 b e the ev ent that x j = 1 and E 2 b e th e even t that β T ≥ E [ β T ] + 0 . 9 σ t if β j,T ≥ σ t , and th e even t β T ≤ E [ β T ] − 0 . 9 σ t if β j,T ≤ − σ t . Then for large constan t C (dep ending on ∆ ) 21 1. Pr[ E 1 ] + n − 2 C ≥ P r[ E 2 ] ≥ Pr[ E 1 ] − n − 2 C . 2. Pr[ E 2 | E 1 ] ≥ 1 − n − 2 C , and Pr[ E 2 | E c 1 ] ≤ n − 2 C . 3. Pr[ E 1 | E 2 ] ≥ 1 − n − C . Pr oof: It is a str aightforw ard mo diﬁcation of the pro of of L emm a 3. First of all, | E [ β j,T x j ] | = o ( σ t ),and thus m ean of P k 6 = j β k ,T x k only diﬀers from that of β T b y at most o ( σ t ). Secondly , Bernstein inequ alit y requires the largest co eﬃcients and the tota l v ariance b e b oun ded, whic h corresp ond to exactly prop ert y (b ) and (c) of a general signature set. The rest of the pro of follo ws as in that for Lemma 3 . ✷ Lemma 20 (General vers ion of Lem ma 8) Let ˜ T ∗ b e th e set w ith largest general empir ical b ias ˆ B ˜ T ∗ among all the expanded sets ˜ T . The set ˜ T ∗ is an expanded signature s et for some x j . Pr oof: W e ﬁ rst pro ve an analog of Claim 9. Let W = σ 4 d / 2∆Λ 3 log n . Let’s redeﬁne K lar g e := { k ∈ [ m ] :    { i ∈ ˜ T : | A ( i ) k | ≥ τ }    ≥ W } b e the subset of no d es in [ m ] w hic h connect to at least W nod es in ˜ T in the subgraph G τ .Note that this imp lies that if k 6∈ K lar g e , th en | β k , ˜ T | ≤ dτ + W Λ ≤ dσ 4 / ∆Λ 2 log n . Let Q k = { i ∈ ˜ T : | A ( i ) k | ≥ τ } . By deﬁnition, we hav e for k ∈ K lar g e , | Q K | ≥ W . Then similarly as in the pro of of Claim 9, using the fact that | Q k ∩ Q k ′ | ≤ κ , and inclusion-exclusion, we h a ve that | K lar g e | ≤ O (∆Λ 3 log n/σ 4 ). Then w e p r o ve an analog of Claim 10. Let β small, ˜ T and β lar g e, ˜ T b e deﬁned as in the pro of of Claim 10 (with the n ew deﬁnition of K lar g e ). By Lemm a 15, th e v ariance of β small, ˜ T is b ounded by 2 dW + 2 d 2 γ ≤ 2 d 2 σ 4 / ∆Λ 2 log n . Therefore by Bernstein’s inequalit y w e ha ve th at for suﬃciently large ∆, with probability at least 1 − n − 2 o ver the c h oice of x , | β small, ˜ T − E [ β small, ˜ T ] | ≤ 0 . 05 dσ 2 / Λ. It follo ws from the s ame argumen t of Claim 10 that with high probability o v er the choice of N samples, | ˆ B T − m ax k β k , ˜ T | ≤ 0 . 1 dσ 2 / Λ holds when max k β k , ˜ T ≥ 0 . 5 dσ . W e apply almost the same argum ent as in the p ro of of Lemma 8. By Lemma 17 we know that our algorithm m u st pr o duce an expanded signature set of size d w ith b ias at least 0 . 8 σ d , and thus the set ˜ T ∗ with largest bias must has a large co eﬃcient j with β j, ˜ T ∗ ≥ 0 . 7 σ d . If there is some other k su c h that β k , ˜ T ∗ ≥ 0 . 3 σ d , then | Q k | ≥ 0 . 3 σd/ Λ and therefore we could remo ve those elements in ˜ T ∗ − Q j , w hic h has size larger th an 0 . 3 σd/ Λ − κ b y Assu mption G2. Then b y adding some other element s wh ich are in the neigh b orho o d of j in G σ in to the set Q j w e get a set with bias larger than ˜ T ∗ , wh ic h con tradicts our assum p tion th at there exists k with β k , ˜ T ∗ ≥ 0 . 3 σ d . Hence ˜ T ∗ is ind eed an expanded signature set and th e pro of is complete. ✷ B Probabilit y Inequalities Lemma 21 Supp ose X is a b oun ded rand om v ariable in a normed v ector space with || X || ≤ M . If ev ent E hap p ens with p robabilit y 1 − δ f or some δ < 1 , then || E [ X | E ] − E [ X ] || ≤ 2 δ M 22 Pr oof: W e ha ve E [ X ] = E [ X | E ] Pr [ E ]+ E [ X | E c ] Pr[ E c ] = E [ X | E ]+( E [ X | E c ] − E [ X | E ]) Pr [ E c ], and therefore || E [ X | E ] − E [ X ] || ≤ 2 δ M . ✷ Lemma 22 Supp ose X is a b oun ded rand om v ariable in a normed v ector space with || X || ≤ M . If ev ents E 1 and E 2 ha ve small symmetrical diﬀerences in th e sens e that Pr[ E 1 | E 2 ] ≤ δ and Pr[ E 2 | E 1 ] ≤ δ . T h en || E [ X | E 1 ] − E [ X | E 2 ] || ≤ 4 δ M . Pr oof: Let Y = X | E 2 , by Lemma 21, w e ha v e || E [ Y | E 1 ] − E [ Y ] || ≤ 2 δM , that is, || E [ X | E 1 E 2 ] − E [ X | E 2 ] || ≤ 2 δM . Similarly || E [ X | E 1 E 2 ] − E [ X | E 1 ] || ≤ 2 δM , and hence || E [ X | E 1 ] − E [ X | E 2 ] || ≤ 4 δ M . ✷ Theorem 23 (Bernste in In equality[Ber 27 ] cf . [Be n62]) Let x 1 , . . . , x n b e indep en d en t v ariables with ﬁnite v ariance σ 2 i = V [ x i ] and b oun ded b y M so that | x i − E [ x i ] | ≤ M . Let σ 2 = P i σ 2 i . Then we ha v e Pr "      n X i =1 x i − E [ n X i =1 x i ]      > t # ≤ 2 exp( − t 2 2 σ 2 + 2 3 M t ) 23

More Algorithms for Provable Dictionary Learning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment