A Simpler Approach to Matrix Completion
This paper provides the best bounds to date on the number of randomly sampled entries required to reconstruct an unknown low rank matrix. These results improve on prior work by Candes and Recht, Candes and Tao, and Keshavan, Montanari, and Oh. The re…
Authors: Benjamin Recht
A Simpler Approach to Matrix Completio n Benjamin Recht Departmen t of Computer Sciences, University of W isconsin-Madiso n 1210 W Dayton St, Madison, WI 5370 6 email: brecht@cs.wi sc.edu October 2009 Abstract This paper provid es t he best bounds to date on the number of randomly sampled entries required to reconstruct an unkno wn low rank matrix. These results impro ve on prior work by Cand ` es and Recht [4], Cand ` es and T ao [7], and Kesha va n, Montanari, and Oh [18]. The reconstruction is ac complished by minimizing the nuclear norm, or sum of the singular values, of the hidden matrix subject to agreemen t wi th the provid ed entries. If the underlying matrix s atisfies a certain incoherence cond ition, then the number o f entries required is equal t o a quad ratic logarithmic factor times the number of parameters in the singular v alue decomposition. The proof of this assertion is short, self contained, and uses very elementary analysis. The no vel techniques herein are based on recent work in quantum information theory . Keywords. Matrix completion , low-rank matrices, conv ex o ptimization, nuclear no rm min imization, ra ndom matrices, operato r Cherno ff bound , compr essed sensing. 1 Introd uction Recovering a lo w r ank matrix from a given subset o f its entries is a recu rring p roblem in collaborative filtering [ 25], dimensiona lity reductio n [20, 28], an d m ulti-class lear ning [2 , 22]. While a variety o f heuristics h av e been devel- oped across m any disciplines, th e g eneral p roblem o f fin ding the lowest ra nk ma trix satisfying equality constraints is NP-h ard. All k nown algo rithms which can comp ute the lo west rank solu tion f or a ll in stances requ ire time at least exponential in the dimensions of the matrix in both theory and practice [9]. In shar p co ntrast to such worst case pessimism, Cand ` es an d Recht showed th at most low r ank matrices co uld b e recovered from most sufficiently large sets o f entries by com puting the m atrix of minim um nu clear norm tha t agr eed with the provid ed e ntries [4], and furthermo re the revealed set o f entries could com prise a v anishin g fraction of the entire matrix. Th e nuclear norm is eq ual to the sum of the singu lar values o f a matrix an d is the best conve x lower bound of the rank functio n on the set of matrices whose singular values a re all bou nded by 1 . The in tuition behind this h euristic is that whereas th e r ank function cou nts the numbe r o f no n vanishing singu lar values, the nuclear no rm sums their amplitude, much like how th e ℓ 1 norm i s a useful surrogate for counting the number of nonzero s in a v ector . Moreover , the nuclear norm can be minimized subject to equality constrain ts via semidefinite programming . Nuclear norm minimization had long been observed to produce very lo w-rank solutions in practice (see, for exam- ple [ 3, 11, 12, 21, 2 6]), b ut only very rece ntly was th ere a ny theoretical b asis fo r when it produced th e minim um rank solution. The first paper to provide such founda tions w as [24], where Recht, Fazel, and Parrilo de veloped prob abilistic technique s to study a verage case be havior and showed that the nuclear n orm heuristic could solve mo st in stances of the rank minimization problem assuming the number of linear constraints w as suf ficiently large. The results in [24] i n- spired a grou ndswell of interest in theor etical gu arantees for rank minimization, and these results lay the foundation for [4]. Cand ` es and Recht’ s boun ds were subseq uently improved by Cand ` es and T ao [7] and Kesha van, Montanari, and Oh [18] to show that one could , in s pecia l cases, reconstruct a low-rank matrix by observin g a set of entries of size at most a polylo garithmic factor larger than the intrin sic dimension of the v ariety of rank r matrices. 1 This paper sharpens the results in [7, 18] to provide a bound on the numb er of entries required to reconstruct a low rank matr ix which is op timal up to a small n umerical c onstant a nd o ne lo garithmic factor . The m ain the orem makes minimal assum ptions abou t the low r ank matrix of inter est. Moreover , the p roof is very sho rt and relies o n mo stly elementary analysis. In order to precisely state the main result, we need one definition. Cand ` es and Recht observed that it is impossible to recover a matrix which is equal to zero in nearly all of its entries unless all of the entries of the matrix are ob served (consider, f or example, the r ank one ma trix which is equa l to 1 in o ne entry and zer os everywhere else). I n other words, the matrix cannot be mostly equal to zero on the observed entries. This motivated the following definition Definition 1.1 Let U be a sub space o f R n of d imension r and P U be the o rthogonal pr ojection o nto U . Then the coheren ce of U (vis- ` a-vis the standard basis ( e i ) ) is defined to be µ ( U ) ≡ n r max 1 ≤ i ≤ n k P U e i k 2 . (1.1) Note that for any subspace, the smallest µ ( U ) can be is 1 , achieved, fo r example, if U is span ned b y vectors wh ose entries all ha ve magnitude 1 / √ n . The lar gest possible v alue for µ ( U ) is n/r which would correspo nd to any subspace that contains a standard basis element. If a matrix has row and column spaces with lo w coherence, then each entry can be expected to provide about the same amoun t of information. Recall that the nuc lear norm o f an n 1 × n 2 matrix X is the su m of the singula r values of X , k X k ∗ = P min { n 1 ,n 2 } k =1 σ k ( X ) , where, here a nd b elow , σ k ( X ) denotes th e k th largest singu lar value o f X . The main result o f this pap er is the f ol- lowing Theorem 1.1 Let M be a n n 1 × n 2 matrix o f rank r with singula r va lue decomposition U Σ V ∗ . W ithou t lo ss o f generality , impose the conventions n 1 ≤ n 2 , Σ is r × r , U is n 1 × r and V is n 2 × r . Assume that A0 The r o w and column spaces have coh er ences bounded above by some positive µ 0 . A1 The matrix U V ∗ has a maximum entry bound ed by µ 1 p r/ ( n 1 n 2 ) in absolute value for some positive µ 1 . Suppo se m entries of M are observed with locations sampled uniformly at r ando m. Then if m ≥ 3 2 max { µ 2 1 , µ 0 } r ( n 1 + n 2 ) β log 2 (2 n 2 ) (1.2) for some β > 1 , the minimizer to the pr oblem minimize k X k ∗ subject to X ij = M ij ( i, j ) ∈ Ω . (1.3) is unique and equal to M with pr obability at least 1 − 6 lo g( n 2 )( n 1 + n 2 ) 2 − 2 β − n 2 − 2 β 1 / 2 2 . The assumption s A0 and A1 were introduce d in [4] . Both µ 0 and µ 1 may depen d on r , n 1 , or n 2 . Mo reover , note that µ 1 ≤ µ 0 √ r by the Cauchy -Schwarz inequality . As shown in [4 ], both sub spaces selected fr om the unifo rm distribution and spaces constructed as the spa n o f singu lar vectors with bounded entr ies are not on ly in coheren t with the standar d basis, but also o bey A1 with hig h pro bability f or values of µ 1 at most lo garithmic in n 1 and/or n 2 . Applying this theorem to the models stud ied in Section 2 of [ 4], we find that there is a nu merical constant c u such that c u r ( n 1 + n 2 ) log 5 ( n 2 ) entries ar e sufficient to reco nstruct a ran k r matrix whose row and colu mn spaces ar e sampled from the Haar me asure on the Grassman n manifold. If r > log( n 2 ) , the number of entries can b e reduced to c u r ( n 1 + n 2 ) log 4 ( n 2 ) . Similar ly , there is a numerical co nstant c i such that c i µ 2 0 r ( n 1 + n 2 ) log 3 ( n 2 ) entries are sufficient to recover a matrix of arbitrary ran k r whose sing ular vectors have en tries with m agnitude s bo unded by p µ 0 /n 1 . Theorem 1.1 greatly impr oves upon prio r resu lts. First of all, it has the weakest assumptio ns o n the matrix to be recovered. In ad dition to assump tion A1 , Cand ` es and T ao req uire a “strong inco herenc e condition” (see [7]) wh ich is considerab ly mor e restrictive than the assump tion A0 in Theo rem 1.1. Many o f their results also requir e restrictio ns on the rank of M , and their bou nds depend superlinearly on µ 0 . K eshavan et al require the matrix rank to be no more 2 than log( n 2 ) , and requ ire bounds on the maximum magnitude of the entries in M and the ratios σ 1 ( M ) /σ r ( M ) and n 2 /n 1 . Theorem 1.1 makes no such assumptions about the rank, aspect ratio, nor cond ition number of M . Moreover, (1.2) has a smaller log factor than [7], and features numerical constants that are both explicit and small. Also note that there is no t much room fo r improvement in the b ound f or m . It is a consequen ce of the coupo n collector’ s pro blem that at least n 2 log n 2 unifor mly sam pled e ntries are necessary just to guaran tee th at a t least on e entry in e very r ow an d colu mn is observed with high probab ility . In addition, rank r matrices hav e r ( n 1 + n 2 − r ) parameters, a fact that can be verified by co unting the nu mber o f degrees of freedom in the singular v alue de compo- sition. Interesting ly , Cand ` es and T ao showed that C µ 0 n 2 r log ( n 2 ) entries wer e necessary for co mpletion when the entries are sampled unifor mly at rando m [7]. Hence, (1.2) is optimal up to a small numerical constant times log( n 2 ) . Most importantly , the proof of Theo rem 1.1 is short and straightfo rward. Cand ` es and Recht employed sophisti- cated tools fr om the study of ra ndom variables on Banach spaces inclu ding decoupling tools and powerful momen t inequalities for the norms of random matrices. Cand ` es and T ao rely on intricate moment calculations spanning over 30 pages. The p resent work only uses basic m atrix an alysis, elementary large deviation bou nds, an d a noncomm utative version of Bernstein’ s Inequ ality proven here in the Appendix. The proof o f Theor em 1.1 is inspired by a rec ent paper in quanutm in formatio n which consider ed the pro blem of recon structing the d ensity matrix o f a q uantum en semble using as few mea surements as po ssible [1 6]. Their work adapted results fro m [4 ] an d [ 5] to the quantum regime by using special algebraic prop erties of quantum me asurements. Their proof followed a m ethodo logy an alogou s to the ap proach of Cand ` es an d Recht but h ad two main dif fer ences: they u sed a sampling with replace ment m odel as a pro xy f or unifo rm samp ling, and they deployed a powerful no n- commutative Chernoff boun d d ev elop ed by Ah lswede an d Winter for use in quan tum info rmation theor y [ 1]. In this paper, I adapt these two strategies from [16] to the matrix com pletion problem . In sectio n 3 I show how the sampling with replacement mod el bo unds pro babilities in the uniform samplin g m odel, and p resent very short p roofs of some of the m ain resu lts in [4 ]. Surprisingly , th is yields a simple proof of Theo rem 1.1, provid ed in Section 4, which h as the least restrictiv e assumption s of an y assertion proven thus far . 2 Pr eliminaries a nd notation Before con tinuing, let us survey the notatio ns used throu ghou t the pape r . I clo sely follow the conventions established in [4], and in vite the reader to consult this reference for a more thoro ugh discussion of the matrix completion problem and the associated c onv ex geometry . A th oroug h introdu ction to th e nec essary m atrix analy sis u sed in this pape r can be found in [24]. Matrices are bo ld capital, vectors are bo ld lowercase an d scalars or en tries are no t bold. For example, X is a matrix, a nd X ij its ( i, j ) th entry . Likewise x is a vector , and x i its i th co mponen t. If u k ∈ R n for 1 ≤ k ≤ d is a collection o f vectors, [ u 1 , . . . , u d ] will d enote the n × d matr ix who se k th colu mn is u k . e k will den ote the k th standard basis vecto r in R d , equal to 1 in comp onent k an d 0 everywhere else. The dimension o f e k will alw ays be clear from context. X ∗ and x ∗ denote the transpose of matrices X and vectors x respec ti vely . A v ariety of norms on matrices will be discussed. The spectral norm of a m atrix is denoted by k X k . The Euclidean inner p rodu ct between two matrices is h X , Y i = T r( X ∗ Y ) , and the correspon ding Eu clidean norm, called the Frobeniu s or Hilbert-Sch midt norm, is denoted k X k F . T hat is, k X k F = h X , X i 1 / 2 . T he n uclear n orm of a matr ix X is k X k ∗ . The maximum entry of X (in absolute value) is denoted by k X k ∞ ≡ max ij | X ij | . For vectors, the only norm applied is the usual Euclidean ℓ 2 norm, simply denoted as k x k . Linear tr ansforma tions that act on matr ices will be den oted b y calligraphic letters. In particular, the iden tity operator will be deno ted by I . The spectral no rm (the top singular value) of such an operato r will b e deno ted by kAk = s up X : k X k F ≤ 1 kA ( X ) k F . Fix once and for all a matrix M obeying the assumptions of Theorem 1.1. Let u k (respectively v k ) denote the k th column of U (resp ectiv ely V ). Set U ≡ spa n ( u 1 , . . . , u r ) , and V ≡ s pan ( v 1 , . . . , v r ) . Also assume, withou t loss of gener ality , that n 1 ≤ n 2 . It is con venient to introd uce the orth ogonal deco mposition R n 1 × n 2 = T ⊕ T ⊥ where T is th e linear space spann ed by elements o f the for m u k y ∗ and xv ∗ k , 1 ≤ k ≤ r , wh ere x and y are arb itrary , and T ⊥ is its ortho gonal complement. T ⊥ is the sub space of matrices spanned by the family ( xy ∗ ) , wher e x (respec ti vely y ) is any vector orthogonal to U (respectively V ). 3 The orthogon al projectio n P T onto T is g iv en by P T ( Z ) = P U Z + Z P V − P U Z P V , (2.1) where P U and P V are the or thogon al pr ojections onto U and V re spectiv ely . Note here that while P U and P V are matrices, P T is a linear operator mapping matrices to matrices. The orthog onal projectio n onto T ⊥ is giv en by P T ⊥ ( Z ) = ( I − P T )( Z ) = ( I n 1 − P U ) Z ( I n 2 − P V ) where I d denotes the d × d identity matrix. It follows from the definition (2.1) of P T that P T ( e a e ∗ b ) = ( P U e a ) e ∗ b + e a ( P V e b ) ∗ − ( P U e a )( P V e b ) ∗ . This giv es kP T ( e a e ∗ b ) k 2 F = hP T ( e a e ∗ b ) , e a e ∗ b i = k P U e a k 2 + k P V e b k 2 − k P U e a k 2 k P V e b k 2 . Since k P U e a k 2 ≤ µ ( U ) r /n 1 and k P V e b k 2 ≤ µ ( V ) r /n 2 , kP T ( e a e ∗ b ) k 2 F ≤ max { µ ( U ) , µ ( V ) } r n 1 + n 2 n 1 n 2 ≤ µ 0 r n 1 + n 2 n 1 n 2 (2.2) I will make frequent use of this calculation through out the sequel. 3 Sampling with Replacement As discussed above, the main contribution of this w ork is an analysis of unifo rmly sampled sets of entries via the study of a sampling with rep lacement model. All of the previous work [ 4, 7, 18] studied a Berno ulli sampling model as a proxy for unif orm sampling. Ther e, each entry was rev ealed indepen dently wit h p robab ility equal to p . In all of the se results, the theorem statemen ts conc erned sampling sets of m entries unifo rmly , but it was shown that p robab ility of failure under Berno ulli sampling with p = m n 1 n 2 closely appr oximated the probability of failure under uniform sampling. The p resent work will analy ze the situation where each en try ind ex is sampled in depend ently f rom the unifor m d istribution on { 1 , . . . , n 1 } × { 1 , . . . , n 2 } . This mo dification of th e sampling model gives rise to all of the simplifications below . It would app ear th at sampling with replaceme nt is not suitable for analyzing matrix c ompletion as one might encoun ter duplicate entr ies. Howev er, just as is the case with Bernoulli samp ling, b ound ing the likelihoo d of er ror when sampling with replacement allows u s to bo und th e probability o f the nuclear no rm h euristic failing und er u niform sampling. Proposition 3.1 The pr obab ility that the nuclear no rm heuristic fails when the set of observed entries is sampled uniformly fr om the c ollection o f sets of size m is less than or equal to the pr oba bility th at the heuristic fails when m entries ar e sampled indep endently w ith replacement. Proof The proof follows the argum ent in Section I I.C of [6] . Let Ω ′ be a co llection of m entries, ea ch sam pled indepen dently fro m the u niform distribution o n { 1 , . . . , n 1 } × { 1 , . . . , n 2 } . Let Ω k denote a set of entries of size k sampled uniformly from all collections of entries of size k . It follows that P ( Failure (Ω ′ )) = m X k =0 P ( Failure (Ω ′ ) | | Ω ′ | = k ) P ( | Ω ′ | = k ) = m X k =0 P ( Failure (Ω k )) P ( | Ω ′ | = k ) ≥ P ( Failure (Ω m )) m X k =0 P ( | Ω ′ | = k ) = P ( Failure (Ω m )) . 4 Where the inequality follows because P ( Failure (Ω m )) ≥ P ( Failure (Ω m ′ )) if m ≤ m ′ . That is, the pro bability decreases as the number of entries rev ealed is increased. Surprisingly , chan ging th e sampling mo del makes most of the th eorems from [4 ] simple consequen ces of a non- commutative variant of Bernstein’ s Inequality . Theorem 3.2 (Noncommutative Bernstein Inequality) Let X 1 , . . . , X L be indepen dent zer o-mea n random matri- ces of d imension d 1 × d 2 . Su ppose ρ 2 k = max {k E [ X k X ∗ k ] k , k E [ X ∗ k X k ] k} a nd k X k k ≤ M a lmost surely for all k . Then for any τ > 0 , P " L X k =1 X k > τ # ≤ ( d 1 + d 2 ) exp − τ 2 / 2 P L k =1 ρ 2 k + M τ / 3 ! . Note that in the case th at d 1 = d 2 = 1 , this is pre cisely the two side d version of the stand ard Bernstein In- equality . When the X k are diagon al, this b ound is the same as ap plying the standard Bernstein Ineq uality a nd a union boun d to the diagonal o f the matrix summ ation. Furth ermore, o bserve th at the right hand side is less than ( d 1 + d 2 ) exp( − 3 8 τ 2 / ( P L k =1 ρ 2 k )) as lo ng as τ ≤ 1 M P L k =1 ρ 2 k . This condensed form of the inequality will b e used exclusi vely thro ugho ut. Theorem 3.2 is a coro llary of an Chernoff boun d for finite d imensional o perators developed by Ahlswede and Winter [1 ]. A similar ineq uality for symmetric i.i.d. ma trices is pro posed in [ 16]. The pro of is provided in the Appendix. Let us n ow recor d two th eorems, proven for the Bernou lli model in [ 4], that adm it very simple pro ofs in the sampling with replacement mode l. T he theorem statements requir es some additional notation. Let Ω = { ( a k , b k ) } l k =1 be a collection of indices sampled uniform ly with replacemen t. Set R Ω to be the operato r R Ω ( Z ) = | Ω | X k =1 h e a k e ∗ b k , Z i e a k e ∗ b k . Note that the ( i , j ) th compon ent of R Ω ( X ) is zero unless ( i, j ) ∈ Ω . F or ( i, j ) ∈ Ω , R Ω ( X ) is equ al to X ij times the multiplicity of ( i, j ) ∈ Ω . Unlike in previous work on matrix co mpletion, R Ω is not a p rojection opera tor if there are duplicates in Ω . Nonetheless, this does not adversely af fect the argument, and R Ω ( X ) = 0 if and only if X ab = 0 for a ll ( a, b ) ∈ Ω . Moreover , we can show th at th e max imum du plication of any e ntry is always le ss than 8 3 log( n 2 ) with very high probability . Proposition 3.3 W ith pr obability at least 1 − n 2 − 2 β 2 , the maximum number of r epetitions of any entry in Ω is less than 8 3 β log( n 2 ) for n 2 ≥ 9 a nd β > 1 . Proof This ass ertio n can be proven by apply ing a standard Chernoff boun d for the Bern oulli distribution. Note that fo r a fixed entry , the prob ability it is sampled more than t tim es is equal to the probab ility of more than t h eads occurrin g in a sequence of m tosses wher e the probability of a head is 1 n 1 n 2 . This probability can be upper bounded by P [ more than t head s in m trials ] ≤ m n 1 n 2 t t exp t − m n 1 n 2 (see [17], for example). Applying the union bound over all of the n 1 n 2 entries and the fact that m n 1 n 2 < 1 , we have P [ any entry is selected more than 8 3 β log( n 2 ) times ] ≤ n 1 n 2 8 3 β log( n 2 ) − 8 3 β log( n 2 ) exp 8 3 β log ( n 2 ) ≤ n 2 − 2 β 2 when n 2 ≥ 9 . This applicatio n of the Cher noff bound is very cru de, and m uch tigh ter bound s can be d erived using mor e carefu l analysis. For example in [15], the maximum ov ersamp ling is sho wn to be bounded by O ( log( n 2 ) log log( n 2 ) ) . For our purposes here, the loose upper bound provided by Proposition 3.3 will be more than suf ficient. In add ition to this boun d on th e n orm of R Ω , the following theorem asserts th at the oper ator P T R Ω P T is also very close to an isometry o n T if the nu mber of samp led entries is sufficiently large. This r esult is a nalgou s to the Theorem 4.1 in [4] for the B erno ulli model, whose proo f uses se veral powerful theorems from the study of probability in Banach spaces. Here, one only needs to compute a few lo w order moments and then apply Theorem 3.2. 5 Theorem 3.4 Supp ose Ω is a set of entries of size m sampled indepen dently an d u niformly with r epla cement. Then for all β > 1 , n 1 n 2 m P T R Ω P T − m n 1 n 2 P T ≤ r 16 µ 0 r ( n 1 + n 2 ) β log( n 2 ) 3 m with pr obability at least 1 − 2 n 2 − 2 β 2 pr ovided that m > 16 3 µ 0 r ( n 1 + n 2 ) β log( n 2 ) . Proof Decomp ose any matrix Z as Z = P ab h Z , e a e ∗ b i e a e ∗ b so that P T ( Z ) = X ab hP T ( Z ) , e a e ∗ b i e a e ∗ b = X ab h Z , P T ( e a e ∗ b ) i e a e ∗ b . (3.1) For k = 1 , . . . , m sample ( a k , b k ) from { 1 , . . . , n 1 } × { 1 , . . . , n 2 } unifor mly with r eplacemen t. Then R Ω P T ( Z ) = P m k =1 h Z , P T ( e a k e ∗ b k ) i e a k e ∗ b k which gives ( P T R Ω P T )( Z ) = m X k =1 h Z , P T ( e a k e ∗ b k ) i P T ( e a k e ∗ b k ) . Now the fact that the operator P T R Ω P T does not deviate from its e xpec ted value E ( P T R Ω P T ) = P T ( E R Ω ) P T = P T ( m n 1 n 2 I ) P T = m n 1 n 2 P T in the spectral norm can be proven using the Noncomm utative Bernstein In equality . T o proceed, d efine the op erator T ab which maps Z to hP T ( e a e ∗ b ) , Z iP T ( e a e ∗ b ) . This op erator is rank on e, has operator n orm kT ab k = kP T ( e a e ∗ b ) k 2 F , an d we h av e P T = P a,b T ab by (3. 1). Hen ce, f or k = 1 , . . . , m , E [ T a k b k ] = 1 n 1 n 2 P T . Observe that if A an d B are positiv e semidefinite, w e hav e k A − B k ≤ max {k A k , k B k} . Using this fact, we can compute the bound kT a k b k − 1 n 1 n 2 P T k ≤ max {kP T ( e a k e ∗ b k ) k 2 F , 1 n 1 n 2 } ≤ µ 0 r n 1 + n 2 n 1 n 2 , where the final inequality follows from (2.2 ). W e also have k E [( T a k b k − 1 n 1 n 2 P T ) 2 ] k = k E [ kP T ( e a k e ∗ b k ) k 2 F T a k b k ] − 1 n 2 1 n 2 2 P T ] k ≤ max {k E [ kP T ( e a k e ∗ b k ) k 2 F T a k b k ] k , 1 n 2 1 n 2 2 } ≤ max {k E [ T a k b k ] k µ 0 r n 1 + n 2 n 1 n 2 , 1 n 2 1 n 2 2 } ≤ µ 0 r n 1 + n 2 n 2 1 n 2 2 The theorem now follows by applying the Noncommutative Bern stein Inequality . The next th eorem is an ana log of Theorem 6 .3 in [4] o r Lem ma 3.2 in [18]. Th is theor em asserts that f or a fixed matrix, if one sets all of the entries not in Ω to zero it remains close to a multip le of the original matrix in the operator norm. Theorem 3.5 Supp ose Ω is a set of entries of size m sa mpled indep enden tly and uniformly with replacement and let Z be a fixed n 1 × n 2 matrix. Assume without loss of generality that n 1 ≤ n 2 , Then for all β > 1 , n 1 n 2 m R Ω − I ( Z ) ≤ r 8 β n 1 n 2 2 log( n 1 + n 2 ) 3 m k Z k ∞ with pr obability at least 1 − ( n 1 + n 2 ) 1 − β pr ovided that m > 6 β n 1 log( n 1 + n 2 ) . 6 Proof First observe that the operator norm can be upper bounde d by a multiple of the matrix infinity norm k Z k = sup k x k =1 k y k =1 X a,b Z ab y a x b ≤ X a,b Z 2 ab y 2 a 1 / 2 X a,b x 2 b 1 / 2 ≤ √ n 2 max a X b Z 2 ab ! 1 / 2 ≤ √ n 1 n 2 k Z k ∞ Note that n 1 n 2 m R Ω ( Z ) − Z = 1 m P m k =1 n 1 n 2 Z a k b k e a k e ∗ b k − Z . This is a sum of zero-m ean random matrices, and k n 1 n 2 Z a k b k e a k e ∗ b k − Z k ≤ k n 1 n 2 Z a k b k e a k e ∗ b k k + k Z k < 3 2 n 1 n 2 k Z k ∞ for n 1 ≥ 2 . W e also have E ( n 1 n 2 Z a k b k e a k e ∗ b k − Z ) ∗ ( n 1 n 2 Z a k b k e a k e ∗ b k − Z ) = n 1 n 2 X c,d Z 2 cd e d e ∗ d − Z ∗ Z ≤ max n 1 n 2 X c,d Z 2 cd e d e ∗ d , k Z ∗ Z k ≤ n 1 n 2 2 k Z k 2 ∞ where we again use the fact that k A − B k ≤ max {k A k , k B k} for positi ve semidefinite A and B . A similar calcula- tion holds for ( n 1 n 2 Z a k b k e a k e ∗ b k − Z )( n 1 n 2 Z a k b k e a k e ∗ b k − Z ) ∗ . The theorem no w follows by the Nonco mmutative Bernstein Inequa lity . Finally , the follo wing Lemma is required to prove Theorem 1.1. Succin ctly , it says that for a fixed matrix in T , the operator P T R Ω does not increase the matrix infinity norm. Lemma 3.6 S uppo se Ω is a set of entries of size m sampled indep endently and uniformly with r eplacement and let Z ∈ T be a fi xed n 1 × n 2 matrix. Assume without loss of generality that n 1 ≤ n 2 . Then for all β > 2 , n 1 n 2 m P T R Ω ( Z ) − Z ∞ ≤ r 8 β µ 0 r ( n 1 + n 2 ) log n 2 3 m k Z k ∞ with pr obability at least 1 − 2 n 2 − β 2 pr ovided that m > 8 3 β µ 0 r ( n 1 + n 2 ) log n 2 . Proof This lemma can be proven using the standard Bernstein In equality . For each matrix index ( c, d ) , sample ( a, b ) unifor mly at random to define the random v ariab le ξ cd = h e c e ∗ d , n 1 n 2 h e a e ∗ b , Z iP T ( e a e ∗ b ) − Z i . W e hav e E [ ξ cd ] = 0 , | ξ cd | ≤ µ 0 r ( n 1 + n 2 ) k Z k ∞ , and E [ ξ 2 cd ] = 1 n 1 n 2 X a,b h e c e ∗ d , n 1 n 2 h e a e ∗ b , Z iP T ( e a e ∗ b ) − Z i 2 = n 1 n 2 X a,b hP T ( e c e ∗ d ) , e a e ∗ b i 2 h e a e ∗ b , Z i 2 − Z 2 cd ≤ n 1 n 2 kP T ( e c e ∗ d ) k 2 F k Z k 2 ∞ ≤ µ 0 r ( n 1 + n 2 ) k Z k 2 ∞ . Since the ( c, d ) en try of n 1 n 2 m P T R Ω ( Z ) − Z is identically distributed to 1 m P m k =1 ξ ( k ) cd , where ξ ( k ) cd are i.i.d. copies of ξ cd , we have by Bernstein’ s Ineq uality and the union bound : Pr " n 1 n 2 m P T R Ω ( Z ) − Z ∞ > r 8 β µ 0 r ( n 1 + n 2 ) log( n 2 ) 3 m k Z k ∞ # ≤ 2 n 1 n 2 exp( − β log( n 2 )) ≤ 2 n 2 − β 2 . 7 4 Pr oof of Theorem 1.1 The p roof f ollows the pr ogram developed in [1 6] w hich itself ad apted the strategy prop osed in [4]. The m ain id ea is to appr oximate a dual feasible solution of (1.3) which certifies that M is the un ique minimum nuclear norm solution. In [4] such a certificate was constructed via an infinite series using a construction dev elop ed in the compre ssed s ensin g literature [6, 13] . The ter ms in this series w ere then a nalyzed individually using the deco upling in equalities o f de la Pe ˜ na and M ontgom ery-Smith [ 10]. T runca ting the infin ite series after 4 terms gave th eir result. In [7], the authors bound ed the con tribution of O (lo g( n 2 )) terms in this series using intensive combinatorial analysis o f each ter m. Th e insight in [16 ] was that, when sampling observations with rep lacement, a dual feasib le solution could be closely approx imated by a mo dified ser ies whe re each ter m in volved the produc t of independ ent ran dom variables. This change in the sampling model allo ws one to a void decou pling inequalities and gives r ise to the dramatic simplification here. T o p roceed , recall again that by Propo sition 3 .1 it suffices to con sider th e scen ario w hen the e ntries are samp led indepen dently and u niform ly with replacemen t. I will first d ev elop the main argum ent o f the proof assuming ma ny condition s hold with h igh pro bability . The p roof is com pleted by subsequently b oundin g probab ility that all o f th ese ev ents hold. Supp ose that n 1 n 2 m P T R Ω P T − m n 1 n 2 P T ≤ 1 2 , kR Ω k ≤ 8 3 β 1 / 2 log( n 2 ) . (4.1) Also suppose there exists a Y in the range of R Ω such that kP T ( Y ) − U V ∗ k F ≤ r r 2 n 2 , kP T ⊥ ( Y ) k < 1 2 (4.2) If (4.1) holds, then for any Z ∈ ker R Ω , P T ( Z ) canno t be too lar ge. Ind eed, we ha ve 0 = k R Ω ( Z ) k F ≥ kR Ω P T ( Z ) k F − kR Ω P T ⊥ ( Z ) k F . Now observe that kR Ω P T ( Z ) k 2 F = h Z , P T R 2 Ω P T ( Z ) i ≥ h Z , P T R Ω P T ( Z ) i ≥ m 2 n 1 n 2 kP T ( Z ) k 2 F and kR Ω P T ⊥ ( Z ) k F ≤ 8 3 β 1 / 2 log( n 2 ) kP T ⊥ ( Z ) k F . Collecting these facts gives that for any Z ∈ k er R Ω , kP T ⊥ ( Z ) k F ≥ s 9 m 128 β n 1 n 2 log 2 ( n 2 ) kP T ( Z ) k F > r 2 r n 2 kP T ( Z ) k F . Now r ecall that k A k ∗ = sup k B k≤ 1 h A , B i . For Z ∈ ker R Ω , pick U ⊥ and V ⊥ such that [ U , U ⊥ ] and [ V , V ⊥ ] are unitary matrices and that h U ⊥ V ∗ ⊥ , P T ⊥ ( Z ) i = k P T ⊥ ( Z ) k ∗ . Then it follows that k M + Z k ∗ ≥ h U V ∗ + U ⊥ V ∗ ⊥ , M + Z i = k M k ∗ + h U V ∗ + U ⊥ V ∗ ⊥ , Z i = k M k ∗ + h U V ∗ − P T ( Y ) , P T ( Z ) i + h U ⊥ V ∗ ⊥ − P T ⊥ ( Y ) , P T ⊥ ( Z ) i > k M k ∗ − r r 2 n 2 kP T ( Z ) k F + 1 2 kP T ⊥ ( Z ) k ∗ ≥ k M k ∗ . The fir st ine quality ho lds from the variational characte rization of the nuclear no rm. W e also used the fact that h Y , Z i = 0 for all Z ∈ ker R Ω . Thu s, if a Y exists o beying (4 .2), we have that for any X ob eying R Ω ( X − M ) = 0 , k X k ∗ > k M k ∗ . That is, any if X h as M ab = X ab for a ll ( a, b ) ∈ Ω , X h as strictly larger n uclear norm tha n M , and hence M is th e uniqu e minimize r o f (1.3). The re mainder of the proo f shows that su ch a Y exists with high probab ility . 8 T o this end, partition 1 , . . . , m into p partition s of size q . By assumption, we may choose q ≥ 128 3 max { µ 0 , µ 2 1 } r ( n 1 + n 2 ) β log( n 1 + n 2 ) and p ≥ 3 4 log(2 n 2 ) . Let Ω j denote th e set of indices co rrespon ding to the j th p artition. No te that each of these par titions are in depend ent of one anothe r when the indices are sampled with replacem ent. Assum e that n 1 n 2 q P T R Ω k P T − q n 1 n 2 P T ≤ 1 2 (4.3) for all k . Define W 0 = U V ∗ and set Y k = n 1 n 2 q P k j =1 R Ω j ( W j − 1 ) , W k = U V ∗ − P T ( Y k ) for k = 1 , . . . , p . Then k W k k F = W k − 1 − n 1 n 2 q P T R Ω k ( W k − 1 ) F = ( P T − n 1 n 2 q P T R Ω k P T )( W k − 1 ) F ≤ 1 2 k W k − 1 k F , and it follows th at k W k k F ≤ 2 − k k W 0 k F = 2 − k √ r . Since p ≥ 3 4 log(2 n 2 ) ≥ 1 2 log 2 (2 n 2 ) = log 2 √ 2 n 2 , then Y = Y p will satisfy the first inequality of (4.2). Also suppo se that W k − 1 − n 1 n 2 q P T R Ω k ( W k − 1 ) ∞ ≤ 1 2 k W k − 1 k ∞ (4.4) n 1 n 2 q R Ω j − I ( W j − 1 ) ≤ s 8 n 1 n 2 2 β log n 2 3 q k W j − 1 k ∞ (4.5) for k = 1 , . . . , p . T o see that kP T ⊥ ( Y p ) k ≤ 1 2 when (4.4) and (4.5) hold, observe k W k k ∞ ≤ 2 − k k U V ∗ k ∞ , and it follows that kP T ⊥ Y p k ≤ p X j =1 k n 1 n 2 q P T ⊥ R Ω j W j − 1 k = p X j =1 kP T ⊥ ( n 1 n 2 q R Ω j W j − 1 − W j − 1 ) k ≤ p X j =1 k ( n 1 n 2 q R Ω j − I )( W j − 1 ) k ≤ p X j =1 s 8 n 1 n 2 2 β log n 2 3 q k W j − 1 k ∞ = 2 p X j =1 2 − j s 8 n 1 n 2 2 β log n 2 3 q k U V ∗ k ∞ < s 32 µ 2 1 rn 2 β log n 2 3 q < 1 / 2 since q > 128 3 µ 2 1 rn 2 β log ( n 2 ) . The fir st in equality f ollows from the triangle inequ ality . The second line follows because W j − 1 ∈ T for all j . The third line follows because, for any Z , kP T ⊥ ( Z ) k = k ( I n 1 − P U ) Z ( I n 2 − P V ) k ≤ k Z k . The fourth line applies (4.5). The next line follows from (4.4). The final line follows from the as sump tion A1 . All that rem ains is to bo und the proba bility that all of the in voked events hold. Wit h m satisfying the bound in the main theorem s tatemen t, the first inequality in (4.1) fails to hold with prob ability at most 2 n 2 − 2 β 2 by Theorem 3.4, and th e second inequality fails to ho ld with prob ability at most n 2 − 2 β 1 / 2 2 by Proposition 3. 3. For all k , (4. 3) fails to hold with pro bability at m ost 2 n 2 − 2 β 2 , (4. 4) fails to hold with pr obability at mo st 2 n 2 − 2 β 2 , an d (4.5) fails to hold with probab ility at most ( n 1 + n 2 ) 1 − 2 β . Summin g t hese all together, all of the ev ents hold with proba bility at least 1 − 6 lo g ( n 2 )( n 1 + n 2 ) 2 − 2 β − n 2 − 2 β 1 / 2 2 by the union boun d. Th is completes the proof. 9 5 Discussion and Conclusions The results proven here are near ly optimal, but small improvements c an possibly be made. T he numeric al co nstant 32 in the statement of th e theor em may be r educible by m ore clever b ookkeepin g, an d it m ay be po ssible to d erive a linear dependenc e on the logarithm o f the matrix dimensions. But furthe r red uction is no t possible b ecause of th e necessary con ditions pr ovided by Cand ` es and T ao . On e min or improvement that co uld be made would b e to remove the assumption A1 . For in stance, while µ 1 is known to be small in mo st of the m odels of low rank matrices that have been an alyzed, no on e has sh own that an assumption o f the form A1 is necessary for completio n. Non etheless, all prior results on ma trix completion hav e imp osed an assumption like A1 [4, 7, 1 8], and it would be interesting to see if it can be removed as a requirem ent, or if it is somehow necessary . Surprisingly , the simplicity of the argument presen ted her e m ostly arises from the abando nment of Bernoulli sampling in fa vor of sampling with replacement. I t would be of interest to re view r esults in vestigating noise rob ustness of matrix com pletion [5, 19] or dec onv olution of sp arse and low rank matrices [8] to see if results can b e improved by appealing to samp ling with replace ment. Furthermo re, since much of the work on rank minimization a nd m atrix completion borrows tools from the compressed sensing community , it is of interest to re visit this related body of work and to see if proo fs can be simp lified or b ounds can be impr oved there as well. The no ncomm utativ e versions of Chernoff and Bernstein’ s In equalities may be usefu l through out machine learning and statistical signal processing, and a fruitf ul line of inquiry would examine how to app ly these tools fro m qua ntum informa tion to the study o f classical signals and systems. Acknowledgmen ts B.R. w ou ld like to thank Aram Harrow for introdu cing him to the operator Chernoff bo und and many helpfu l clarif ying conv ersation s, Silvia Gandy for poin ting out se veral typos in the orig inal version of this manuscript, and Rob Now ak, Ali Rahimi, and Stephen Wright for many fruitful discussions about this paper . Refer ences [1] R. Ahlswede and A. Winter . Strong con verse for identification via quantum channels. IE EE T ran sactions on Information Theory , 48(3):569– 579, 2002. [2] A. Argyriou, C. A. Micchelli, and M. Pontil. Con vex multi-task feature learning. Machin e Learning , 2008. Published online first at http://www .springerlink .com/ . [3] C. B eck and R. D’Andrea. Computational study and comparisons of LF T reducibility methods. In P r oceedings of the American Contr ol Confer ence , 1998. [4] E. Cand ` es and B. Recht. Exact matrix completion via con ve x optimization. F ound ations of Computational Mathematics , 2008. T o appear . Preprint available at http://lanl.arxiv.org/ abs/0805.4471 . [5] E. J. C and ` es and Y . Pl an. Matrix completion with noise. S ubmitted to Proc eedings of the IEEE . Preprint available at http://www- stat.stanford .edu/ ˜ candes/public ations.html , 2009. [6] E. J. Cand ` e s, J. Romberg, and T . T ao. Robust uncertainty principles: ex act signal reconstruction from highly incomplete frequenc y information. IEEE T rans. Inform. Theory , 52(2):489–509, 2006. [7] E. J. Cand` es a nd T . T ao. The po wer of con vex relaxation: Near-optimal matrix completion. S ubmitted for pub lication. Preprint av ailable at http://www - s tat.stanford .edu/ ˜ candes/public ations.html , 2009. [8] V . Chandrasekaran, S. S angha vi, P . A. Parrilo, and A. S. Willsky . Rank-sparsity incoherence for matrix decompo sition. Submitted for publication. Preprint av ailable at http://ssg.mit. edu/group/ve nkatc/venkatc .shtml , 2009. [9] A. L. Chistov and D. Y . Grigoriev . Comple xity of quantifier elimination in the theory of algebraically closed fields. In Pr oceedings of t he 11th Symposium on Mathematical F ound ations of Computer Science , volume 176 of Lecture Notes in Computer Science , pages 17–31. Springer V erlag, 1984. [10] V . H. de la Pe ˜ na and S. J. Montgomery-Smith. Decoupling inequalities for the t ail probabilities of multiv ariate U -statistics. Ann. Pr obab . , 23(2):806–816, 1995. [11] M. Fazel. Matrix Rank Minimization with Applications . PhD t hesis, Stanford Uni versity , 2002. 10 [12] M. Fazel, H. Hindi, and S. Boyd. A r ank minimization heuristic with application t o minimum order system approximation. In Pr oceedings of the American Contr ol Confer ence , 2001. [13] J. J. Fuchs. On sp arse representations i n arbitrary redundan t bases. IEEE T ransactions on Information Theory , 50:1341–1 344, 2004. [14] S. Golden. Lower boun ds for the Helmholtz function. Physical Review , 137B(4):B11 27–1128, 1965. [15] G. H. Gonnet. Expected length of the lon gest p robe sequence in hash code searching. Journ al of t he Association for Computing Mach inery , 28(2):289–304, 1981. [16] D. Gross, Y .-K. Liu, S. T . Flammia, S. Becker , and J. Ei sert. Quantum state tomography via compressed sensing. Preprint av ailable at http//arxi v.org/abs/090 9.3304 , 2009 . [17] T . Hagerup and C. R ¨ u b. A guided tour of chernof f bounds. Information Pro cessing Letters , 33:305–308 , 1990. [18] R. H . Ke shav an, A. Montanari, and S. Oh. Matrix completion from a few entries. 2009. Preprint av ailable at http://arxiv. org/abs/0901 .3150 . [19] R. H. Keshav an, A. Montanari, and S. Oh. Matrix completion fro m noisy entries. Preprint av ailable at http://arxiv. org/abs/0906 .2027 , 200 9. [20] N. Li nial, E. London, and Y . Rabinovich. The geometry of graphs and some of its algorithmic applications. Combinatorica , 15:215–2 45, 1995. [21] M. Mesbahi and G. P . Papav assilopoulos. On the rank minimization problem ov er a positi ve semidefinite linear matri x inequality . IEEE T ransaction s on Automatic Contr ol , 42(2):239 –243, 1997. [22] G. Obozin ski, B. T askar , an d M. Jordan . Joint co v ariate selection and joint subspace selection for mul- tiple classification problems. T o Appear in Journal of Statistics and Computing . Preprint av ailble at http://www.se as.upenn.edu / ˜ taskar/ , 2009. [23] D. Panchen ko. MIT 18.465: Statistical Learning Theory . MIT Open Course ware http://ocw.mi t.edu/OcwWeb /Mathematics/ 18- 465S pring- 2007/CourseHome/ , 2007. [24] B. Recht, M. Fazel, and P . Parrilo. Guaranteed minimum rank solutions of matrix equations via nuclear norm minimization. SIAM Review . T o appear . P reprint A v ailable at http:/ /pages.cs.wis c.edu/ ˜ brecht/public ations.html . [25] J. D. M . Rennie and N. Srebro. Fast ma ximum margin matrix f actorization for collaborati ve pred iction. In Pr oceedings of the International Confer ence of Machine Learning , 2005. [26] N. Srebro. Learning wi th Matrix F actorizations . PhD t hesis, Massach usetts I nstitute of T echnology , 2004. [27] C. J. Thompson. Inequality with applications in statistical mechanics. Journa l of Mathematical Physics , 6(11):1812–182 3, 1965. [28] K. Q. W einberge r and L. K. Saul. Unsupervised learning of image manifolds by semidefinite progra mming. International J ournal of Computer V ision , 70(1):77 –90, 2006. A Operator Cher noff Bo unds In this section , I present a proof o f 3.2, a nd also provid e new proof s o f some pro bability bo unds f rom qu antum informa tion theory . T o review , a symmetric matrix A is p ositiv e semidefinite if all of its e igenv alues are nonnegativ e. If A an d B are positiv e semidefinite matrices, A B means B − A is positive semidefinite. For square matrices A , the matrix exponential will be denoted exp( A ) and is gi ven by the power series exp( A ) = ∞ X k =0 A k k ! The fo llowing th eorem is a generalization of Markov’ s inequality orig inally proven in [ 1]. My pr oof closely follows the s tand ard proof of the traditional Markov inequality , and does not rely on discre te summations. Theorem A.1 (Operator M arkov Inequality [1]) Let X b e a random po sitive semidefinite matrix an d A a fixed positive definite matrix. Then P [ X 6 A ] ≤ T r( E [ X ] A − 1 ) 11 Proof Note th at if X 6 A , then A − 1 / 2 X A − 1 / 2 6 I , an d hence k A − 1 / 2 X A − 1 / 2 k > 1 . Let I X 6 A denote the indicator of the e vent X 6 A . Th en I X 6 A ≤ T r( A − 1 / 2 X A − 1 / 2 ) as th e right hand side is al ways nonnegative, and, if the lef t hand side equals 1 , the trac e of the right hand side must exceed the no rm of th e right h and side which is greater than 1 . Thus we have P [ X 6 A ] = E [ I X 6 A ] ≤ E [T r( A − 1 / 2 X A − 1 / 2 )] = T r( E [ X ] A − 1 ) . where the last equality follows from the linearity and cyclic properties of the trace. Next I w ill deri ve a noncomm utative version of the Chernoff bo und. This was also proven in [1] for i.i.d. matrices. The version stated here is more gene ral in th at the random m atrices n eed n ot b e identica lly distributed, but th e pro of is essentially the same. Theorem A.2 (Noncommutative Chernoff Bound) Let X 1 , . . . , X n be indep endent symmetric random matrices in R d × d . Let A be a n arbitrary symmetric matrix. Then for any invertible d × d matrix T P " n X k =1 X k 6 n A # ≤ d n Y k =1 k E [exp( T X k T ∗ − T AT ∗ )] k Proof The proo f relies on an estimate from statisti cal p hysics which is stated here without proof. Lemma A.3 (Golden-Thompson inequality [14, 2 7]) F or any symmetric matrices A and B , T r(exp( A + B )) ≤ T r((exp A )(exp B )) . Much like the proof of the standard Chernoff bound , the theorem now follo ws from a long chain of inequalities. P " n X k =1 X k 6 n A # = P " n X k =1 ( X k − A ) 6 0 # = P " n X k =1 T ( X k − A ) T ∗ 6 0 # = P " exp n X k =1 T ( X k − A ) T ∗ ! 6 I d # ≤ T r E " exp n X k =1 T ( X k − A ) T ∗ !#! = E " T r exp n X k =1 T ( X k − A ) T ∗ !!# ≤ E " T r exp n − 1 X k =1 T ( X k − A ) T ∗ ! exp ( T ( X n − A ) T ∗ ) !# ≤ E 1 ,...,n − 1 " T r exp n − 1 X k =1 T ( X k − A ) T ∗ ! E [exp ( T ( X n − A ) T ∗ )] !# ≤ k E [exp ( T ( X n − A ) T ∗ )] k E 1 ,...,n − 1 " T r exp n − 1 X k =1 T ( X k − A ) T ∗ !!# ≤ n Y k =2 k E [exp ( T ( X k − A ) T ∗ )] k E [ T r (exp ( T ( X 1 − A ) T ∗ ))] ≤ d n Y k =1 k E [exp ( T ( X k − A ) T ∗ )] k 12 Here, the first three lines fo llow from stand ard pr operties of the semidefinite orderin g. The fou rth line inv okes the Operator Markov Ine quality . The sixth line f ollows from the Golden-Th ompson inequality . The seventh line f ollows from indep endenc e of the X k . Th e eighth line follows because for positi ve definite matrices T r( AB ) ≤ T r ( A ) k B k . This is just anoth er statement of the duality between the nuclear and operator norms. The ninth line iterativ ely repeats the previous two steps. T he final line follows becau se for a po siti ve definite matrix A , T r( A ) is the sum of the eigenv alues of A , and all of the eigenv alues are at most k A k . Let us n ow turn to proving the No ncomm utative Bernstein Inequ ality presen ted in Section 3. The au thors in [16] propo sed a similar in equality fo r sym metric i.i.d. random m atrices with a slightly worse con stant. The proof here is more general and follows the standard derivation of Bernstein’ s ine quality . Proof [of Theorem 3.2] Set Y k = 0 X k X ∗ k 0 Then Y k are symmetric random variables, and for all k k E [ Y 2 k ] k = E X k X ∗ k 0 0 X ∗ k X k = max {k E [ X k X ∗ k ] k , k E [ X ∗ k X k ] k} = ρ 2 k . Moreover , the maximu m singular value of P L k =1 X k is equal to the maxim um eigenv alue of P L k =1 Y k . By Th eo- rem A.2, we have for all λ > 0 P " L X k =1 X k > Lt # = P " L X k =1 Y k 6 Lt I # ≤ ( d 1 + d 2 ) exp( − Lλt ) L Y k =1 k E [exp( λ Y k )] k . For each k , let Y k = U k Λ k U ∗ k be an eigenv alue decomp osition, whe re Λ k is the diagon al matrix of the e igenv alues of Y k . In turn, it follows that for s > 0 − M s Y 2 k − U k M s Λ 2 k U ∗ k U k Λ 2+ s k U ∗ k = Y 2+ s k U k M s Λ 2 k U ∗ k M s Y 2 k , which then implies k E [ Y s +2 k ] k ≤ M s k E [ Y 2 k ] k . (A.1) For fi xed k , we have k E [exp( λ Y k )] k ≤ k I k + ∞ X j =2 λ j j ! k E [ Y j k ] k ≤ 1 + ∞ X j =2 λ j j ! k E [ Y 2 k ] k M j − 2 = 1 + ρ 2 k M 2 ∞ X j =2 λ j j ! M j = 1 + ρ 2 k M 2 (exp( λM ) − 1 − λM ) ≤ exp ρ 2 k M 2 (exp( λM ) − 1 − λM ) . The first inequality follows from the triangle inequality and the fact that E [ Y k ] = 0 , the second inequality follows from (A.1), and the final inequality follows from the fact that 1 + x ≤ exp( x ) for all x . Putting this togethe r gi ves P " L X k =1 X k > Lt # ≤ ( d 1 + d 2 ) exp − λLt + P L k =1 ρ 2 k M 2 (exp( λM ) − 1 − λM ) ! . This final expression is now just a real numb er , and o nly has to b e minimized as a fun ction o f λ . The theorem now follows by algeb raic m anipulation : the righ t han d side is minim ized by setting λ = 1 M log(1 + tLM P L k =1 ρ 2 k ) , th en b asic approx imations can be employed to complete the argument (see, for example [23], lectures 4 and 5). 13
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment