Dimension-free tail inequalities for sums of random matrices

We derive exponential tail inequalities for sums of random matrices with no dependence on the explicit matrix dimensions. These are similar to the matrix versions of the Chernoff bound and Bernstein inequality except with the explicit matrix dimensio…

Authors: Daniel Hsu, Sham M. Kakade, Tong Zhang

Dimension-free tail inequalities for sums of random matrices Daniel Hsu 1,2 , Sham M. Kak ade 2 , and T ong Zhang 1 1 Departmen t of Statistics, Rutgers Univ ersit y 2 Departmen t of Statistics, Wharton Sc ho ol, Univ ersit y of P ennsylv ania Octob er 22, 2018 Abstract W e derive exp onen tial tail inequalities for s u ms o f random matrices with no depe ndence on the explicit matrix dimensio ns. These are s imilar to the matrix versions o f the Chernoff bound and Bernstein inequality except with the explicit matrix dimensions replaced by a tr a ce quantit y that ca n be s ma ll even when the dimension is large o r infinite. Some a pplica tions to principal comp onent analysis and approximate matrix m ultiplicatio n a re g iven to illustrate the utilit y of the new bo unds. 1 In tro duc tion Sums of random matrices arise in m an y statistical and p robabilistic applications, and hence their concen tration b eha vior is of significan t in terest. Surp r isingly , the classical exp onenti al moment metho d used to deriv e tail in equalities f or scalar random v ariables carries o ver to the matrix setting wh en augmented with certain matrix trace inequalities. This fact was first disco vered b y Ahlsw ede and Win ter (2002), wh o pro v ed a matrix v ersion of the Chern off b ou n d using the Golden-Thompson inequalit y (Golden, 1965; Thompson, 1965): tr exp( A + B ) ≤ tr(exp( A ) exp( B )) for all symmetric matrices A an d B . Later, it w as d emonstrated that the same tec hnique could b e adapted to yield analogues of other tail b ounds s uc h as Bernstein’s inequalit y (Gross et al., 2010; Rec h t , 2009; Gross, 2009; Olive ira , 2010a,b). Recen tly , a theorem due to Lieb (1973) was identi- fied by T ropp (2011a,b) to yield sharp er ve rsions of this general class of tail b ounds. Altoget her , these r esu lts h a v e pr o v ed inv aluable in constructing and simplifying man y probab ilistic arguments concerning sums of rand om matrices. One deficiency of these p revious in equalities is their exp licit dep endence on the dimens ion, whic h prev ents their app lication to in finite dimens ional spaces th at arise in a v ariet y of data analysis tasks ( e.g. , Sc h¨ olk opf et al., 1999; Rasmussen and Williams , 2006; F ukumizu et al., 2007; Bac h, 2008). In this w ork, we p ro v e analogous r esults wh ere the dimension is r eplaced with a trace quan tit y that can b e small even when the d imension is large or infinite. F or instance, in our matrix generalization of Bern s tein’s inequality , th e (normalized) trace of the s econd momen t matrix app ears in stead of the matrix d im en sion. Suc h trace quanti ties can often b e regarded as an in trinsic E-mail: djhsu@rci. rutgers.edu , skakade@whar ton.upenn.edu , tzhang@sta t.rutgers.edu 1 notion of d imension. The p rice for this im p ro ve ment is that the more t yp ical exp onent ial tail e − t is replaced with a slight ly wea ker tail t ( e t − t − 1) − 1 ≈ e − t +log t . As t b ecomes large, the difference b ecomes negligible. F or instance, if t ≥ 2 . 6, then t ( e t − t − 1) − 1 ≤ e − t/ 2 . There are some p revious works that giv e dimension-free tail inequalities in some sp ecial cases. Rudelson and V ers h ynin (20 07 ) pro ve exp onent ial tail inequ alities for s u ms of rank-one matrices b y w a y of a key inequalit y of Rudelson (1999 ) (see also Oliv eira, 2010a). Magen and Z ouzias (2011) pro ve tail inequalities for sums of lo w -rank matrice s using non-comm utativ e Khintc hine momen t inequalities, bu t fall sh ort of giving an exp onent ial tail inequ alit y . I n con trast, our resu lts are pro ved us in g a natural m atrix generalization of the exp onen tial moment metho d. 2 Preliminaries Let ξ 1 , . . . , ξ n b e random v ariables, and for eac h i = 1 , . . . , n , let X i := X i ( ξ 1 , . . . , ξ i ) b e a symmetric matrix-v alued functional of ξ 1 , . . . , ξ i . W e use E i [ · ] and shorthand for E [ · | ξ 1 , . . . , ξ i − 1 ]. F or any symmetric matrix H , let λ max ( H ) denote its largest eigen v alue, exp( H ) := I + P ∞ k =1 H k /k !, and log(exp( H )) := H . The follo wing con v ex trace inequalit y of Lieb (1973) was also used by T ropp (2011a,b). Theorem 1 (Lieb, 1973) . F or any symmetric matrix H , the function M 7→ tr exp( H + log( M )) is c onc ave in M for M ≻ 0 . The follo w in g lemma due to (T ropp , 201 1b ) is a matrix ge neralization of a sca lar result due to F r eedman (1975) (see also Zhang, 2005), where the key is th e in v o cation of Theorem 1. W e giv e the pro of for completeness. Lemma 1 (T ropp, 2011b) . F or any c onstant symmetric matrix X 0 , E " tr exp n X i =0 X i − n X i =1 ln E i [exp( X i )] !# ≤ tr exp ( X 0 ) . (1) Pr o of. By ind uction on n . The claim holds trivially for n = 0. No w fix n ≥ 1, and assume as the inductiv e hyp othesis that (1) holds with n replaced by n − 1. I n this case, E " tr exp n X i =0 X i − n X i =1 log E i [exp( X i )] !# = E " E n " tr exp n − 1 X i =0 X i − n X i =1 log E i [exp( X i )] + log exp( X n ) !## ≤ E " tr exp n − 1 X i =0 X i − n X i =1 log E i [exp( X i )] + log E n [exp( X n )] !# = E " tr exp n − 1 X i =0 X i − n − 1 X i =1 log E i [exp( X i )] !# ≤ tr exp ( X 0 ) where the first in equ alit y follo ws from Th eorem 1 and Jensen’s in equalit y , and the second inequalit y follo ws from the in ductiv e hyp othesis.  2 3 Exp onent ial tail inequalities for sums of random matrices 3.1 A generic inequalit y W e fi rst state a generic in equalit y based on Lemma 1. T his differs from earlier appr oac hes, wh ich instead combine Mark o v’s inequalit y w ith a result similar to Lemma 1 ( e.g. , T ropp, 2011a, Th eorem 3.6). Theorem 2. F or any η ∈ R and any t > 0 , Pr " λ max η n X i =1 X i − n X i =1 log E i [exp( η X i )] ! > t # ≤ tr E " − η n X i =1 X i + n X i =1 log E i [exp( η X i )] #! · ( e t − t − 1) − 1 . Pr o of. Fix a constant m atrix X 0 , an d let A := η P n i =0 X i − P n i =1 log E i [exp( η X i )]. Note th at g ( x ) := e x − x − 1 is non-negativ e for all x ∈ R and increasing for x ≥ 0. Letting { λ i ( A ) } d en ote the eigen v alues of A , we hav e Pr [ λ max ( A ) > t ] ( e t − t − 1) = E  1  λ max ( A ) > t  ( e t − t − 1)  ≤ E h e λ max ( A ) − λ max ( A ) − 1 i ≤ E " X i  e λ i ( A ) − λ i ( A ) − 1  # = E [tr(exp( A ) − A − I )] ≤ tr(exp( X 0 ) + E [ − A ] − I ) where the last inequalit y follo ws from Lemma 1 . No w w e take X 0 → 0 so tr(exp( X 0 ) − I ) → 0.  3.2 Some specific b ounds W e n o w giv e some sp ecific b ound s as corollaries of T heorem 2 . Most of the estimates u sed in the pro ofs are tak en from p revious works ( e.g. , Ahlsw ede and Wint er , 2002; T ropp, 2011a); th e main p oint here is to sho w how th ese previous tec hniques can b e com bined w ith T heorem 2 to yield n ew tail inequalities with no explicit d ep endence on the matrix dimens ion. First, w e give a b ound under a subgaussian-t yp e condition on the distr ib ution. Theorem 3 (Matrix subgaussian b ou n d) . If ther e e xists ¯ σ > 0 and ¯ k > 0 such that for al l i = 1 , . . . , n , E i [ X i ] = 0 λ max 1 n n X i =1 log E i  exp( η X i )  ! ≤ η 2 ¯ σ 2 2 E " tr 1 n n X i =1 log E i  exp( η X i )  !# ≤ η 2 ¯ σ 2 ¯ k 2 for al l η > 0 almost sur ely, then f or any t > 0 , Pr " λ max 1 n n X i =1 X i ! > r 2 ¯ σ 2 t n # ≤ ¯ k · t ( e t − t − 1) − 1 . 3 Pr o of. W e fix η := p 2 t/ ( ¯ σ 2 n ). By Th eorem 2 , we obtain Pr " λ max 1 n n X i =1 X i − 1 nη n X i =1 log E i [exp( η X i )] ! > t nη # ≤ tr E " n X i =1 log E i [exp( η X i )] #! · ( e t − t − 1) − 1 ≤ nη 2 ¯ σ 2 ¯ k 2 · ( e t − t − 1) − 1 = ¯ k · t ( e t − t − 1) − 1 . No w sup p ose λ max 1 n n X i =1 X i − 1 nη n X i =1 log E i [exp( η X i )] ! ≤ t nη . This implies for ev ery n on-zero v ector u , u ⊤  1 n P n i =1 X i  u u ⊤ u ≤ u ⊤  1 nη P n i =1 log E i [exp( η X i )]  u u ⊤ u + t nη ≤ λ max 1 nη n X i =1 log E i [exp( η X i )] ! + t nη and therefore λ max 1 n n X i =1 X i ! ≤ λ max 1 nη n X i =1 log E i [exp( η X i )] ! + t nη ≤ η ¯ σ 2 2 + t nη = r 2 ¯ σ 2 t n as required.  W e can also give a Bernstein-t yp e b ound based on moment cond itions. F or simplicit y , w e just state the b oun d in the case that the λ max ( X i ) are b ound ed almost sur ely . Theorem 4 (Matrix Bernstein b ou n d) . If ther e exists ¯ b > 0 , ¯ σ > 0 , and ¯ k > 0 such that fo r al l i = 1 , . . . , n , E i [ X i ] = 0 λ max ( X i ) ≤ ¯ b λ max 1 n n X i =1 E i [ X 2 i ] ! ≤ ¯ σ 2 E " tr 1 n n X i =1 E i [ X 2 i ] !# ≤ ¯ σ 2 ¯ k almost sur e ly, then for any t > 0 , Pr " λ max 1 n n X i =1 X i ! > r 2 ¯ σ 2 t n + ¯ bt 3 n # ≤ ¯ k · t ( e t − t − 1) − 1 . Pr o of. L et η > 0. F or eac h i = 1 , . . . , n , exp( ηX i )  I + η X i + e η ¯ b − η ¯ b − 1 ¯ b 2 · X 2 i 4 and therefore log E i  exp( η X i )   e η ¯ b − η ¯ b − 1 ¯ b 2 · E i  X 2 i  . Since e x − x − 1 ≤ x 2 / (2(1 − x/ 3)) for 0 ≤ x < 3, we hav e by Theorem 2 Pr " λ max 1 n n X i =1 X i ! > η ¯ σ 2 2(1 − η ¯ b/ 3) + t η n # ≤ η 2 ¯ σ 2 ¯ k n 2(1 − η ¯ b/ 3) · ( e t − t − 1) − 1 pro vided that η < 3 / ¯ b . C ho osing η := 3 ¯ b · 1 − p 2 ¯ σ 2 t/n 2 ¯ bt/ (3 n ) + p 2 ¯ σ 2 t/n ! giv es the desir ed b ound .  3.3 Discussion The adv an tage of our r esu lts here o ver previous exp onen tial ta il inequalitie s for sums of r andom matrices is the absence of exp licit d ep endence on the matrix d imensions. Indeed, all p revious tail inequalities using the exp onent ial momen t metho d (either via the Golden-Thompson inequalit y or Lieb’s trac e inequalit y) are roughly of th e form d · e − t when the m atrices in the s u m are d × d (Ahlswede and Winte r , 2002; Gross et al., 2010; Rec ht, 2009; Gross, 2009; T ropp, 2011a ,b). Our results also impro ve ov er th e tail in equ alities of Rudelson and V ers h ynin (2007) in that it applies to full-rank matrices, not ju st rank-one m atrices; and also o v er that of Magen and Zouzias (2011) in that it pr o vides an exp onent ial tail inequalit y , rather than just a p olynomial tail. Th u s , our impro vemen ts widen the applicabilit y of these inequ alities (and the matrix exp onent ial momen t metho d in general); we explore s ome of th ese in Subsection 3.4. One disadv an tage of our tec hnique is that in finite dimensional s ettings, th e relev an t trace quan tit y that replaces the d im en sion ma y tur n out to b e of the same ord er as th e dimension d (an example of s u c h a case is discussed n ext). In su c h cases, the resulting tail b oun d fr om Theorem 4 (sa y) of ¯ k · t ( e t − t − 1) − 1 is lo oser than the d · e − t tail b ou n d provided by earlier tec hniques ( e.g. , T ropp, 2011a). W e note that the matrix exp onentia l moment metho d used here and in pr evious wo rk can lead to a significantly sub optimal tail inequalit y in some cases. Th is was p oin ted out by T ropp (2011a, Section 4. 6), b u t we elab orate on it here fu r ther. Supp ose x 1 , . . . , x n ∈ {± 1 } d are i.i.d. random v ectors w ith indep en den t Rademac her en tries—eac h coord in ate of x i is +1 or − 1 w ith equal proba- bilit y . Let X i = x i x ⊤ i − I , so E [ X i ] = 0, λ max ( X i ) = λ max ( E [ X 2 i ]) = d − 1, and tr( E [ X 2 i ]) = d ( d − 1). In this case, Th eorem 4 implies the b ound Pr " λ max 1 n n X i =1 x i x ⊤ i − I ! > r 2( d − 1) t n + ( d − 1) t 3 n # ≤ dt ( e t − t − 1) − 1 . On the other hand , b ecause th e x i ha v e s u bgaussian pro jections, it is kn o wn that Pr " λ max 1 n n X i =1 x i x ⊤ i − I ! > 2 r 71 d + 16 t n + 10 d + 2 t n # ≤ 2 e − t/ 2 5 (Litv ak et al., 2 005, also s ee Lemma 2 in App endix A). First, this latter inequality remov es the d fact or on the righ t-hand side. P erhaps more imp ortan tly , the deviation term t does not scale with d in this inequ alit y , whereas it do es in the former. Thus this latter b ound p ro vides a muc h stronger exp onen tial tail: roughly put, Pr[ λ max ( P n i =1 x i x ⊤ i /n − I ) > c · ( p d/n + d/n ) + τ ] ≤ exp( − Ω( n min( τ , τ 2 ))) for some constant c > 0; th e probability b ound fr om Theorem 4 is only of the form exp( − Ω(( n/d ) min( τ , τ 2 ))). The sub-optimalit y of Theorem 4 is shared b y all other existing tail inequalities prov ed using this exp onenti al moment method . The issu e is relat ed to the asymp totic freeness of the r andom matrices X 1 , . . . , X n (V oiculescu , 1991 ; Guionnet, 2004)— i.e. , that nearly all high-ord er mome nts of rand om matrices v anish asymptotically—whic h is n ot exploited in the m atrix exp onen tial moment metho d . This m eans that the p ro of tec hniqu e in the exp onentia l momen t metho d o v er-counts the cont rib u tion of h igh-order matrix momen ts th at should ha v e v anished. F ormalizing this discr ep ancy would help clarify the limits of th is technique, but the task is b eyo nd the scop e of this pap er. It is also wo rth men tioning that asymptotic fr eeness only h olds when the X i ha v e in dep end en t entries. F or matrices with correlated en tries, our b ound is close to b est p ossible in th e w orst case. 3.4 Examples F or a matrix M , let k M k F denote its F rob enius norm, and let k M k 2 denote its sp ectral norm. If M is s ymmetric, then k M k 2 = max { λ max ( M ) , − λ min ( M ) } , where λ max ( M ) and λ min ( M ) are, resp ectiv ely , the largest and smallest eigen v alues of M . 3.4.1 Suprem um of a random pro cess The first example embeds a random pro cess in a diagonal matrix to sho w that Theorem 3 is tight in certain cases. Example 1. Let ( Z 1 , Z 2 , . . . ) b e (p ossibly dep endent) mean-zero subgaussian r andom v ariables; i.e. , eac h E [ Z i ] = 0, and there exists p ositiv e constants σ 1 , σ 2 , . . . such that E [exp( η Z i )] ≤ exp  η 2 σ 2 i 2  ∀ η ∈ R . W e fur ther assu m e that v := sup i { σ 2 i } < ∞ and k := 1 v P i σ 2 i < ∞ . Also, for con v enience, w e assume log k ≥ 1 . 3 (to simp lify the tail inequalit y). Let X = diag ( Z 1 , Z 2 , . . . ) b e the rand om d iagonal matrix with the Z i on its diagonal. W e h a v e E [ X ] = 0, and log E [exp( η X )]  diag  η 2 σ 2 1 2 , η 2 σ 2 2 2 , . . .  , so λ max (log E [exp( η X )]) ≤ η 2 v 2 and tr (log E [exp( ηX )]) ≤ η 2 v k 2 . By Theorem 3, we hav e Pr h λ max ( X ) > √ 2 v t i ≤ k t ( e t − t − 1) − 1 . Therefore, letting t := 2( τ + log k ) > 2 . 6 for τ > 0 and interpreting λ max ( X ) as sup i { Z i } , Pr " sup i { Z i } > 2 s sup i { σ 2 i }  log P i σ 2 i sup i { σ 2 i } + τ  # ≤ e − τ . 6 Supp ose the Z i ∼ N (0 , 1) are just N i.i.d. standard Gaussian random v ariables. Th en the ab ov e inequalit y states that th e largest of the Z i is O (log N + τ ) with probability at least 1 − e − τ ; this is known to b e tight up to constants, so the log N term cann ot generally b e remo ve d. This fact has b een noted by pr evious w orks on matrix tail in equ alities ( e.g. , T rop p, 2011a), whic h also use this examp le as an extreme case. W e n ote, h o w eve r, that these p revious works are not applicable to the case of a coun tably infinite num b er of mean-zero Gaussian random v ariables Z i ∼ N (0 , σ 2 i ) (or more generally , subgaussian rand om v ariables), whereas the ab ov e inequalit y can b e applied as long as the sum of the σ 2 i is fi nite.  3.4.2 Principal comp onen t analysis Our next t wo examples uses Th eorem 4 to give sp ectral norm error b ound s for estimating the second momen t matrix of a random v ector from i.i.d. copies. This is relev ant in the c ontext of (k ernel) principal comp onen t analysis of high (or infinite) dimens ional data ( e.g. , Sc h¨ olk opf et al., 1999). Example 2. Let x 1 , . . . , x n b e i.i.d. random v ectors with Σ := E [ x i x ⊤ i ], K := E [ x i x ⊤ i x i x ⊤ i ], and k x i k 2 ≤ ¯ ℓ almost su rely f or some ¯ ℓ > 0. L et X i := x i x ⊤ i − Σ and ˆ Σ n := n − 1 P n i =1 x i x ⊤ i . W e h a v e λ max ( X i ) ≤ ¯ ℓ 2 − λ min ( Σ ). Also, λ max ( n − 1 P n i =1 E [ X 2 i ]) = λ max ( K − Σ 2 ) and E [tr( n − 1 P n i =1 E [ X 2 i ])] = tr( K − Σ 2 ). By Theorem 4, Pr " λ max  ˆ Σ n − Σ  > r 2 λ max ( K − Σ 2 ) t n + ( ¯ ℓ 2 − λ min ( Σ )) t 3 n # ≤ tr( K − Σ 2 ) λ max ( K − Σ 2 ) · t ( e t − t − 1) − 1 . Since λ max ( − X i ) ≤ λ max ( Σ ), we also ha ve Pr " λ max  Σ − ˆ Σ n  > r 2 λ max ( K − Σ 2 ) t n + λ max ( Σ ) t 3 n # ≤ tr( K − Σ 2 ) λ max ( K − Σ 2 ) · t ( e t − t − 1) − 1 . Therefore Pr "   ˆ Σ n − Σ   2 > r 2 λ max ( K − Σ 2 ) t n + max { ¯ ℓ 2 − λ min ( Σ ) , λ max ( Σ ) } t 3 n # ≤ tr( K − Σ 2 ) λ max ( K − Σ 2 ) · 2 t ( e t − t − 1) − 1 . A similar result w as giv en by Zw ald and Blanc hard (2006, Lemma 1) but for F rob enius norm error rather than sp ectral norm err or. This is generally incomparable to our result, although sp ectral norm error ma y b e m ore appropriate in cases w h ere the sp ectrum is slo w to deca y .  W e no w sho w that co mbining the b oun d from the previous example w ith sharp er dimension- dep end en t tail inequalities can sometimes lead to s tr onger resu lts. Example 3. Let x 1 , . . . , x n b e i.i.d. rand om vect ors with Σ := E [ x i x ⊤ i ]; let X i := x i x ⊤ i − Σ and ˆ Σ n := n − 1 P n i =1 x i x ⊤ i . F or an y p ositiv e in teger d ≤ rank( Σ ), let Π d, 0 b e the orthogonal pr o jector to the d -dimensional eigenspace of Σ corresp onding to its d largest eigen v alues, and let Π d, 1 := I − Π d, 0 . W e ha ve   ˆ Σ n − Σ   2 ≤   Π d, 0 ( ˆ Σ n − Σ  Π d, 0 k 2 + 2   Π d, 0 ( ˆ Σ n − Σ  Π d, 1 k 2 +   Π d, 1 ( ˆ Σ n − Σ  Π d, 1 k 2 ≤ 2   Π d, 0 ( ˆ Σ n − Σ  Π d, 0 k 2 + 2   Π d, 1 ( ˆ Σ n − Σ  Π d, 1 k 2 . 7 W e can use the tail inequalities from this w ork to control k Π d, 1 ( ˆ Σ n − Σ )Π d, 1 k 2 , and us e p otenti ally sharp er dimension-d ep endent inequalities to con trol k Π d, 0 ( ˆ Σ n − Σ )Π d, 0 k 2 . Let Σ d, 0 := Π d, 0 Σ Π d, 0 , Σ d, 1 := Π d, 1 Σ Π d, 1 , K d, 1 := E [(Π d, 1 x i x ⊤ i Π d, 1 ) 2 ], and assume k Π d, 1 x i k 2 ≤ ¯ ℓ d, 1 for all i = 1 , . . . , n almost su rely . F u rthermore, sup p ose there exists γ d, 0 > 0 suc h that for all i = 1 , . . . , n and all ve ctors α , E h exp  α ⊤ Σ − 1 / 2 d, 0 x i i ≤ exp  γ d, 0 k α k 2 2 / 2  where Σ − 1 / 2 d, 0 is the matrix square-ro ot of the Mo ore-P enrose pseudoinv erse of Σ d, 0 . This condition states that ev ery pro jection of Σ − 1 / 2 d, 0 x i has subgaussian tails. In this case, the tail b eha vior of k Π d, 0 ( ˆ Σ n − Σ )Π d, 0 k 2 should not dep end on the dimensionalit y d . Indeed, a co v ering num b er argumen t give s Pr "   Π d, 0  ˆ Σ n − Σ  Π d, 0   2 > 2 γ d, 0 k Σ k 2  r 71 d + 16 t n + 5 d + t n  # ≤ 2 e − t/ 2 for any t > 0 (see Lemma 2 in Ap p end ix A). Com bining this with the tail inequalit y from Example 2, w e hav e (for t ≥ 2 . 6) Pr "   ˆ Σ n − Σ   2 > 4 γ d, 0 k Σ k 2  r 71 d + 16 t n + 5 d + t n  + 2 v u u t 2 λ max ( K d, 1 − Σ 2 d, 1 )  log  tr( K d, 1 − Σ 2 d, 1 ) λ max ( K d, 1 − Σ 2 d, 1 )  + t  n + 2 max { ¯ ℓ 2 d, 1 − λ min ( Σ d, 1 ) , λ max ( Σ d, 1 ) }  log  tr( K d, 1 − Σ 2 d, 1 ) λ max ( K d, 1 − Σ 2 d, 1 )  + t  3 n # ≤ 4 e − t/ 2 . (2)  Comparisons. W e consid er the follo win g st ylized scenario to compare the b ounds fr om Examp le 2 and Example 3. 1. Th e largest d eigen v alues of Σ are all equal to k Σ k 2 , and the remaining eigen v alues are s m aller and rapidly deca ying so tr( Σ d, 1 ) / k Σ k 2 is small. 2. ¯ ℓ 2 and ¯ ℓ 2 d, 1 are within constant factors of tr( Σ ) and tr( Σ d, 1 ), resp ectiv ely; this simply r equires that the squared length of an y x i nev er b e more than a constan t f actor times its exp ected squared length. 3. λ max ( K − Σ 2 ) and λ max ( K d, 1 − Σ 2 d, 1 ) are within constan t factors of λ max ( Σ ) 2 and λ max ( Σ d, 1 ) 2 , resp ectiv ely; this is similar to the pr evious cond ition. W e will also ignore co n s tan t and logarithmic factors, as w ell as the γ d, 0 factors. The b ound o n k ˆ Σ n k 2 from Examp le 3 then b ecomes (roughly) k Σ k 2 1 + r d n ! + k Σ k 2 r t n + t n + (tr( Σ d, 1 ) / k Σ k 2 ) t n ! (3) 8 whereas the b ound from Examp le 2 is k Σ k 2 + k Σ k 2   r t n +  d + (tr ( Σ d, 1 ) / k Σ k 2 )  t n   . (4) The main d ifference b et we en these b ound s is that th e d eviation term t d o es not scale with d in (3), but it d o es in (4), so the exp onen tial tail in the latter is muc h w eak er, as discussed in S ubsection 3.3. W e c an also compare the b ound from Ex amp le 3 to th e case where th e x i are i.i.d. Gaussian random v ectors with mean zero and co v ariance Σ . Arrange the x i as columns in a matrix ˆ A n = [ x 1 | · · · | x n ], so k ˆ Σ n k 2 = 1 n k ˆ A n ˆ A ⊤ n k 2 = 1 n k ˆ A n k 2 2 . Note that ˆ A n has the same d istribution as Σ 1 / 2 Z , where Z is a matrix of indep endent s tandard Gaussian random v ariables. The fun ction Z 7→ k Σ 1 / 2 Z k 2 = k ˆ A n k 2 is k Σ 1 / 2 k 2 -Lipsc hitz in Z , so b y Gaussian concentrati on (Pisier, 1989), Pr h k ˆ A n k 2 > E  k ˆ A n k 2  + p 2 k Σ k 2 t i ≤ e − t . The exp ectation can b e b ounded u sing a result of Gordon (1985, 1988 ): E  k ˆ A n k 2  = E  k Σ 1 / 2 Z k 2  ≤ k Σ 1 / 2 k 2 √ n + k Σ 1 / 2 k F . Putting these together, we obtain Pr "   ˆ Σ n   2 >   Σ   2 + 2 r k Σ k 2 tr(Σ) n + 2 r 2 k Σ k 2 2 t n + tr( Σ ) + 2 p 2 tr( Σ ) k Σ k 2 t + 2 k Σ k 2 t n # ≤ e − t . In our st ylized scenario, this roughly implies a b ound on k ˆ Σ n k 2 of the form k Σ k 2 1 + r d + tr( Σ d, 1 ) / k Σ k 2 n + d + tr( Σ d, 1 ) / k Σ k 2 n ! + k Σ k 2 r t n + t n ! (5) Compared to (3), we see that the main difference is that t do es not scale with tr( Σ d, 1 ) / k Σ k 2 in (5), but it d o es in (3). T herefore th e b ounds are comparable (up to constant and logarithmic factors) when th e eigensp ectrum of Σ is rapidly deca ying after the firs t d eigen v alues. 3.4.3 Appro ximate mat rix multiplication Finally , w e giv e an example ab out appr o ximating a matrix pro duct AB ⊤ using non-u niform sam- pling of the columns of A and B . Example 4. Let A := [ a 1 | · · · | a m ] and B := [ b 1 | · · · | b m ] b e fixed matrices, eac h w ith m columns. Assume a i 6 = 0 and b i 6 = 0 for all i = 1 , . . . , m . If m is very large, then th e straight forward computation of th e pro du ct AB ⊤ can b e prohibitive . An alternativ e is to tak e a small (non- uniform) r andom sample of the columns of A and B , sa y a i 1 , b i 1 , . . . , a i n , b i n , and then compute a w eigh ted sum of outer pro ducts 1 n n X j =1 a i j b ⊤ i j p i j 9 where p i j > 0 is the a priori pr obabilit y of choosing the column index i j ∈ { 1 , . . . , m } (the actual v alues of the probab ilities p i for i = 1 , . . . , m are giv en b elo w). An analysis of this scheme w as giv en b y Magen and Zouzias (2011) with the stronger requirement that th e n umb er of columns samp led b e p olynomially related to the allo we d failure p robabilit y . Here w e giv e an analysis in whic h the n umb er of column s sampled dep end s only logarithmically on th e failure probabilit y . Let X 1 , . . . , X n b e i.i.d. rand om matrices w ith th e discrete distribu tion give n by Pr  X j = 1 p i  0 a i b ⊤ i b i a ⊤ i 0  = p i ∝ k a i k 2 k b i k 2 for all i = 1 , . . . , m , wh er e p i := k a i k 2 k b i k 2 / Z and Z := P m i =1 k a i k 2 k b i k 2 . Let ˆ M n := 1 n n X j =1 X j and M :=  0 AB ⊤ B A ⊤ 0  . Note that k ˆ M n − M k 2 is the sp ectral norm error of appro ximating AB ⊤ using the a verag e of n outer pro du cts P n j =1 a i j b ⊤ i j /p i j , where the indices are such that i j = i ⇔ X j = a i b ⊤ i /p i for j = 1 , . . . , n . W e ha ve the f ollo wing identitie s: E [ X j ] = m X i =1 p i  1 p i  0 a i b ⊤ i b i a ⊤ i 0  =  0 P m i =1 a i b ⊤ i P m i =1 b i a ⊤ i 0  = M tr( E [ X 2 j ]) = tr m X i =1 p i  1 p 2 i  a i b ⊤ i b i a ⊤ i 0 0 b i a ⊤ i a i b ⊤ i  ! = m X i =1 2 k a i k 2 2 k b i k 2 2 p i = 2 Z 2 tr( E [ X j ] 2 ) = tr  AB ⊤ B A ⊤ 0 0 B A ⊤ AB ⊤  = 2 tr( A ⊤ AB ⊤ B ); and the follo wing in equalities: k X j k 2 ≤ max i =1 ,...,m 1 p i      0 a i b ⊤ i b i a ⊤ i 0      2 = max i =1 ,...,m k a i b ⊤ i k 2 p i = Z k E [ X j ] k 2 = k AB ⊤ k 2 ≤ k A k 2 k B k 2 k E [ X 2 j ] k 2 ≤ k A k 2 k B k 2 Z. This m eans k X j − M k 2 ≤ Z + k A k 2 k B k 2 and k E [( X j − M ) 2 ] k 2 ≤ k E [ X 2 j ] − M 2 k 2 ≤ k A k 2 k B k 2 ( Z + k A k 2 k B k 2 ), so Theorem 4 and a un ion b ound imply Pr "   ˆ M n − M   2 > r 2 ( k A k 2 k B k 2 ( Z + k A k 2 k B k 2 )) t n + ( Z + k A k 2 k B k 2 ) t 3 n # ≤ 4  Z 2 − tr( A ⊤ AB ⊤ B ) k A k 2 k B k 2 ( Z + k A k 2 k B k 2 )  · t ( e t − t − 1) − 1 . Let r A := k A k 2 F / k A k 2 2 ∈ [1 , rank ( A )] and r B := k B k 2 F / k B k 2 2 ∈ [1 , rank ( B )] b e the n umerical (or stable) rank of A and B , r esp ectiv ely . Since Z / ( k A k 2 k B k 2 ) ≤ k A k F k B k F / ( k A k 2 k B k 2 ) = √ r A r B , w e hav e th e simplified (bu t slightly lo oser) b oun d Pr "   ˆ M n − M   2 k A k 2 k B k 2 > 2 r (1 + √ r A r B )(log(4 √ r A r B ) + t ) n + 2(1 + √ r A r B )(log(4 √ r A r B ) + t ) 3 n # ≤ e − t . 10 Therefore, for any ǫ ∈ (0 , 1) and δ ∈ (0 , 1), if n ≥ 8 3 + 2 r 5 3 ! (1 + √ r A r B )(log(4 √ r A r B ) + log(1 /δ )) ǫ 2 , then with probabilit y at least 1 − δ o v er the r andom choi ce of column ind ices i 1 , . . . , i n ,       1 n n X j =1 a i j b ⊤ i j p i j − AB ⊤       2 ≤ ǫ k A k 2 k B k 2 .  Ac kno wledgements W e are grateful to Alex Gittens for useful comments and p oint ing out a subtle mistak e in our pro of of Theorem 2 in an earlier dr aft, and to J o el T ropp for his man y commen ts and su ggestions. References R. Ahlswede and A. Win ter. S tr ong con v erse f or iden tification via qu an tum c hannels. IEE E T r ansa ctions on Information The ory , 48(3):56 9–579, 2002. F. Bac h. Consistency of th e group L asso and m ultiple ke rn el learning. Journal of Machine L e arning R e se ar ch , 9:1179 –1225, 2008. D. A. F reedman. On tail probabilities for martingales. The A nnals of P r ob ability , 3(1):100 –118, 1975. K. F ukum izu, F. Bac h, and A. Gretton. Consistency of kernel canonical correlatio n analysis. Journal of Machine L e arning R ese ar ch , 8:361–38 3, 2007. S. Golden. Low er b oun ds for th e Helmholtz f unction. Physic al R eview , 137(4B) :1127–1128, 1965. Y. Gordon. Some inequalities for Gaussian pro cesses and app lications. Isr ael J. Math. , 50:265–28 9, 1985. Y. Gordon. Gaussian pr o cesses and almost spherical sections of con ve x b o dies. Ann als of Pr ob ability , 16:180 –188, 1988. D. Gross. Reco v ering low-rank m atrices from few co efficien ts in any basis, 2009. arXiv:0910.1879 . D. Gr oss, Y.-K. Liu, S. Flammia, S . Bec ker, and J. Eisert. Q uan tum state tomograph y via com- pressed sensing. Physic al R eview L etters , 105(15 ):150401, 2010. A. Guionnet. L arge deviations and sto c hastic calculus for large r an d om matrices. Pr ob ability Surveys , 1:72–17 2, 2004. E. H. Lieb. Con ve x trace functions and the Wigner-Yanase-Dyson conjecture. A dv. M ath. , 11: 267–2 88, 1973. A. Litv ak, A. Pa jor, M. Rud elson, and N. T omczak-Jaeg ermann . S mallest singular v alues of random matrices and geometry of random p olytop es. A dvanc e s in Mathematics , 195:491–5 23, 2005. 11 A. Magen and A. Zouzias. Low rank matrix-v alued Chernoff b ounds and approxima te matrix m ultiplication. In Pr o c e e dings of the 22nd ACM-SIAM Symp osium on Discr ete Al gorithms , 2011. R. I. Oliv eira. Sums of random Hermitian matrices and an inequalit y by Rudelson. Ele c. Comm. P r ob ab. , 15:203–212 , 2010a. R. I. O liveira. Concen tration of the adj acency matrix and of th e Laplacian in random graph s with indep end en t edges, 2010b. arXiv:0911.06 00. G. Pisier. The volume of c onvex b o dies and Banach sp ac e ge ometry . Cam br idge Universit y Press, 1989. C. E. Rasmussen and C. K. I. Williams. Gaussian Pr o c esses for Machine L e arning . The MIT Press, 2006. B. Rec h t. A simple approac h to m atrix completion, 2009. arXiv:091 0.0651v2. M. Rudelson. Random v ecto rs in isotropic p osition. Journal of F unctional Analysis , 164:60 –72, 1999. M. Rudelson and R. V ersh yn in. Sampling from large matrices: An approac h th r ough ge ometric functional analysis. Journal of the ACM , 54(4), 2007. B. Sc h¨ olk opf, A. J. Smola, and K.-R. M ¨ uller. K ernel p rincipal comp onent analysis. I n B. Sc h¨ olk opf, C. J. C. Burges, and A. J. Smola, editors, A dvanc es in K ernel Metho ds—Supp ort V e ctor L e arning , pages 327–35 2. MIT Pr ess, 1999. C. J. Thompson. Inequalit y with applications in stat istical mec hanics. Journal of Mathema tic al Physics , 6(11):18 12–1813, 1965. J. T ropp. User-friend ly tail b ou n ds for sums of ran d om matrices, 2011a. arXiv:1004.43 89v6. J. T ropp. F reedman’s in equalit y for matrix martingales, 2011b. arXiv:1101.303 9. D. V oiculescu. Limit la ws for ran d om matrices and free pro du cts. Invent. Math. , 104:201 –220, 1991. T. Zhang. Data dep endent concentrat ion b ounds for sequential prediction algorithms. In Pr o c e e d- ings of the 18th An nual Confer enc e on L e arning The ory , 2005. L. Z w ald and G. Blanc hard. On the con verge nce of eigenspaces in k ernel p rincipal comp onen t analysis. In A dvanc es in Neur al Inf ormation Pr o c essing Systems 18 . 2006. A Sums of random v ector outer pr o duc ts The follo wing lemma is a tail inequalit y for sm allest and largest eigen v alues of the empirical co v ari- ance m atrix of sub gaussian random v ectors. This result (with n on-explicit constants) was originally obtained by Litv ak et al. (2005 ). 12 Lemma 2. L et x 1 , . . . , x n b e r andom ve ctors in R d such that, for some γ ≥ 0 , E h x i x ⊤ i    x 1 , . . . , x i − 1 i = I and E h exp  α ⊤ x i     x 1 , . . . , x i − 1 i ≤ exp  k α k 2 2 γ / 2  for al l α ∈ R d for al l i = 1 , . . . , n , almost sur ely. F or al l ǫ 0 ∈ (0 , 1 / 2) and δ ∈ (0 , 1) , Pr " λ max 1 n n X i =1 x i x ⊤ i ! > 1 + 1 1 − 2 ǫ 0 · ǫ ǫ 0 ,δ,n or λ min 1 n n X i =1 x i x ⊤ i ! < 1 − 1 1 − 2 ǫ 0 · ǫ ǫ 0 ,δ,n # ≤ δ wher e ǫ ǫ 0 ,δ,n := γ · r 32 ( d log (1 + 2 /ǫ 0 ) + log (2 /δ )) n + 2 ( d log(1 + 2 /ǫ 0 ) + log (2 /δ )) n ! . Remark 1. In our ap p lications of this lemma, we will simply c ho ose ǫ 0 := 1 / 4 for concreteness.  W e giv e the p ro of of Lemma 2 for completeness. The subgaussian p rop erty m ost readily lends it self to b ou n ds on linear com binations of sub - gaussian ran d om v ariables. How ev er, we are interested in b ounding certain quadratic com bin ations. Therefore w e b o otstrap from the b ou n d for linear combinatio n s to b ound the moment generating function of the quadratic com b in ations; from there, we can obtain th e d esir ed tail inequalit y . The follo wing lemma relates the moment generating fu nction to a tail in equ alit y . Lemma 3. L et W b e a non-ne gative r and om variable. F or any η ∈ R , E [exp ( η W )] − η E [ W ] − 1 = η Z ∞ 0 (exp ( η t ) − 1) · Pr [ W > t ] · dt. Pr o of. I ntegratio n-by-parts.  The next lemma giv es a tail inequalit y for a ny p articular Ra yleigh quotien t of the emp irical co v ariance matrix. Lemma 4. L et x 1 , . . . , x n b e r andom ve ctors in R d such that, for some γ ≥ 0 , E h x i x ⊤ i    x 1 , . . . , x i − 1 i = I and E h exp  α ⊤ x i     x 1 , . . . , x i − 1 i ≤ exp  k α k 2 2 γ / 2  for al l α ∈ R d for al l i = 1 , . . . , n , almost sur ely. F or al l α ∈ R d such that k α k 2 = 1 , and al l δ ∈ (0 , 1) , Pr " α ⊤ 1 n n X i =1 x i x ⊤ i ! α > 1 + r 32 γ 2 log(1 /δ ) n + 2 γ log(1 /δ ) n # ≤ δ and Pr " α ⊤ 1 n n X i =1 x i x ⊤ i ! α < 1 − r 32 γ 2 log(1 /δ ) n # ≤ δ. 13 Pr o of. Fix α ∈ R d with k α k 2 = 1. F or i = 1 , . . . , n , let W i := ( α ⊤ x i ) 2 , so E [ W i ] = 1. F or any t ≥ 0, using Chernoff ’s b ounding metho d giv es E [ 1 [ W i > t ] | x 1 , . . . , x i − 1 ] ≤ inf η> 0 n E h 1 h exp  η | α ⊤ x i |  > e η √ t i    x 1 , . . . , x i − 1 io ≤ inf η> 0 n e − η √ t ·  E h exp  η α ⊤ x i     x 1 , . . . , x i − 1 i + E h exp  − η α ⊤ x i     x 1 , . . . , x i − 1 io ≤ inf η> 0 n 2 exp  − η √ t + η 2 γ / 2 o = 2 exp  − t 2 γ  . So by Lemma 3, for an y η < 1 / (2 γ ), E [exp ( η W i ) | x 1 , . . . , x i − 1 ] ≤ 1 + η + η Z ∞ 0 (exp ( η t ) − 1) · 2 exp  − t 2 γ  · dt = 1 + η + 8 η 2 γ 2 1 − 2 η γ ≤ exp  η + 8 η 2 γ 2 1 − 2 ηγ  and therefore E " exp η n X i =1 W i !# ≤ exp  nη + 8 nη 2 γ 2 1 − 2 ηγ  . Using Chern off ’s b oun ding metho d t wice more giv es Pr " n X i =1 W i > n + t # ≤ inf 0 ≤ η< 1 / (2 γ )  exp  − tη + 8 nη 2 γ 2 1 − 2 η γ  = exp − 8 nγ 2 + γ t − p 8 nγ 2 (8 nγ 2 + 2 γ t ) 2 γ 2 ! and Pr " n X i =1 W i < n − t # ≤ in f η ≤ 0  exp  tη + 8 nη 2 γ 2 1 − 2 ηγ  ≤ exp  − t 2 32 nγ 2  . The claim follo ws.  In order to b ound the smallest and largest eigen v alues of the empirical co v ariance matrix, we apply the b ound for the R ayleig h quotient in Lemma 4 together with a co v ering argumen t. Lemma 5 (Pisier, 1989) . F or any ǫ 0 > 0 , ther e exists Q ⊆ S d − 1 := { α ∈ R d : k α k 2 = 1 } of c ar dinality ≤ (1 + 2 /ǫ 0 ) d such that ∀ α ∈ S d − 1 ∃ q ∈ Q  k α − q k 2 ≤ ǫ 0 . 14 Pr o of of L emma 2. Let ˆ Σ := (1 /n ) P n i =1 x i x ⊤ i , let S d − 1 := { α ∈ R d : k α k 2 = 1 } b e the unit sphere in R d , and let Q ⊂ S d − 1 b e an ǫ 0 -co v er of S d − 1 of minim um size with r esp ect to k · k 2 . By Lemma 5, the cardinalit y of Q is at most (1 + 2 /ǫ 0 ) d . Let E b e the eve nt max n | q ⊤ ( ˆ Σ − I ) q | : q ∈ Q o ≤ ǫ ǫ 0 ,δ,n . By Lemma 4 and a u nion b ound, Pr[ E ] ≥ 1 − δ . No w assume the ev en t E holds. Let α 0 ∈ S d − 1 b e such that | α ⊤ 0 ( ˆ Σ − I ) α 0 | = max {| α ⊤ ( ˆ Σ − I ) α | : α ∈ S d − 1 } = k ˆ Σ − I k 2 . Using the triangle and Cauc hy-Sc h w arz inequalities, we hav e k ˆ Σ − I k 2 = | α ⊤ 0 ( ˆ Σ − I ) α 0 | = min q ∈ Q | q ⊤ ( ˆ Σ − I ) q + α ⊤ 0 ( ˆ Σ − I ) α 0 − q ⊤ ( ˆ Σ − I ) q | ≤ min q ∈ Q | q ⊤ ( ˆ Σ − I ) q | + | α ⊤ 0 ( ˆ Σ − I ) α 0 − q ⊤ ( ˆ Σ − I ) q | = min q ∈ Q | q ⊤ ( ˆ Σ − I ) q | + | α ⊤ 0 ( ˆ Σ − I )( α 0 − q ) − ( q − α 0 ) ⊤ ( ˆ Σ − I ) q | ≤ min q ∈ Q | q ⊤ ( ˆ Σ − I ) q | + k α 0 k 2 k ˆ Σ − I k 2 k α 0 − q k 2 + k q − α 0 k 2 k ˆ Σ − I k 2 k q k 2 ≤ ǫ ǫ 0 ,δ,n + 2 ǫ 0 k ˆ Σ − I k 2 so max n λ max ( ˆ Σ ) − 1 , 1 − λ min ( ˆ Σ ) o = k ˆ Σ − I k 2 ≤ 1 1 − 2 ǫ 0 · ǫ ǫ 0 ,δ,n .  15

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment