Dimension-free tail inequalities for sums of random matrices

Dimension-free tail inequalities for sums of random matrices Daniel Hsu 1,2 , Sham M. Kak ade 2 , and T ong Zhang 1 1 Departmen t of Statistics, Rutgers Univ ersit y 2 Departmen t of Statistics, Wharton Sc ho ol, Univ ersit y of P ennsylv ania Octob er 22, 2018 Abstract W e derive exp onen tial tail inequalities for s u ms o f random matrices with no depe ndence on the explicit matrix dimensio ns. These are s imilar to the matrix versions o f the Chernoﬀ bound and Bernstein inequality except with the explicit matrix dimensions replaced by a tr a ce quantit y that ca n be s ma ll even when the dimension is large o r inﬁnite. Some a pplica tions to principal comp onent analysis and approximate matrix m ultiplicatio n a re g iven to illustrate the utilit y of the new bo unds. 1 In tro duc tion Sums of random matrices arise in m an y statistical and p robabilistic applications, and hence their concen tration b eha vior is of signiﬁcan t in terest. Surp r isingly , the classical exp onenti al moment metho d used to deriv e tail in equalities f or scalar random v ariables carries o ver to the matrix setting wh en augmented with certain matrix trace inequalities. This fact was ﬁrst disco vered b y Ahlsw ede and Win ter (2002), wh o pro v ed a matrix v ersion of the Chern oﬀ b ou n d using the Golden-Thompson inequalit y (Golden, 1965; Thompson, 1965): tr exp( A + B ) ≤ tr(exp( A ) exp( B )) for all symmetric matrices A an d B . Later, it w as d emonstrated that the same tec hnique could b e adapted to yield analogues of other tail b ounds s uc h as Bernstein’s inequalit y (Gross et al., 2010; Rec h t , 2009; Gross, 2009; Olive ira , 2010a,b). Recen tly , a theorem due to Lieb (1973) was identi- ﬁed by T ropp (2011a,b) to yield sharp er ve rsions of this general class of tail b ounds. Altoget her , these r esu lts h a v e pr o v ed inv aluable in constructing and simplifying man y probab ilistic arguments concerning sums of rand om matrices. One deﬁciency of these p revious in equalities is their exp licit dep endence on the dimens ion, whic h prev ents their app lication to in ﬁnite dimens ional spaces th at arise in a v ariet y of data analysis tasks ( e.g. , Sc h¨ olk opf et al., 1999; Rasmussen and Williams , 2006; F ukumizu et al., 2007; Bac h, 2008). In this w ork, we p ro v e analogous r esults wh ere the dimension is r eplaced with a trace quan tit y that can b e small even when the d imension is large or inﬁnite. F or instance, in our matrix generalization of Bern s tein’s inequality , th e (normalized) trace of the s econd momen t matrix app ears in stead of the matrix d im en sion. Suc h trace quanti ties can often b e regarded as an in trinsic E-mail: djhsu@rci. rutgers.edu , skakade@whar ton.upenn.edu , tzhang@sta t.rutgers.edu 1 notion of d imension. The p rice for this im p ro ve ment is that the more t yp ical exp onent ial tail e − t is replaced with a slight ly wea ker tail t ( e t − t − 1) − 1 ≈ e − t +log t . As t b ecomes large, the diﬀerence b ecomes negligible. F or instance, if t ≥ 2 . 6, then t ( e t − t − 1) − 1 ≤ e − t/ 2 . There are some p revious works that giv e dimension-free tail inequalities in some sp ecial cases. Rudelson and V ers h ynin (20 07 ) pro ve exp onent ial tail inequ alities for s u ms of rank-one matrices b y w a y of a key inequalit y of Rudelson (1999 ) (see also Oliv eira, 2010a). Magen and Z ouzias (2011) pro ve tail inequalities for sums of lo w -rank matrice s using non-comm utativ e Khintc hine momen t inequalities, bu t fall sh ort of giving an exp onent ial tail inequ alit y . I n con trast, our resu lts are pro ved us in g a natural m atrix generalization of the exp onen tial moment metho d. 2 Preliminaries Let ξ 1 , . . . , ξ n b e random v ariables, and for eac h i = 1 , . . . , n , let X i := X i ( ξ 1 , . . . , ξ i ) b e a symmetric matrix-v alued functional of ξ 1 , . . . , ξ i . W e use E i [ · ] and shorthand for E [ · | ξ 1 , . . . , ξ i − 1 ]. F or any symmetric matrix H , let λ max ( H ) denote its largest eigen v alue, exp( H ) := I + P ∞ k =1 H k /k !, and log(exp( H )) := H . The follo wing con v ex trace inequalit y of Lieb (1973) was also used by T ropp (2011a,b). Theorem 1 (Lieb, 1973) . F or any symmetric matrix H , the function M 7→ tr exp( H + log( M )) is c onc ave in M for M ≻ 0 . The follo w in g lemma due to (T ropp , 201 1b ) is a matrix ge neralization of a sca lar result due to F r eedman (1975) (see also Zhang, 2005), where the key is th e in v o cation of Theorem 1. W e giv e the pro of for completeness. Lemma 1 (T ropp, 2011b) . F or any c onstant symmetric matrix X 0 , E " tr exp n X i =0 X i − n X i =1 ln E i [exp( X i )] !# ≤ tr exp ( X 0 ) . (1) Pr o of. By ind uction on n . The claim holds trivially for n = 0. No w ﬁx n ≥ 1, and assume as the inductiv e hyp othesis that (1) holds with n replaced by n − 1. I n this case, E " tr exp n X i =0 X i − n X i =1 log E i [exp( X i )] !# = E " E n " tr exp n − 1 X i =0 X i − n X i =1 log E i [exp( X i )] + log exp( X n ) !## ≤ E " tr exp n − 1 X i =0 X i − n X i =1 log E i [exp( X i )] + log E n [exp( X n )] !# = E " tr exp n − 1 X i =0 X i − n − 1 X i =1 log E i [exp( X i )] !# ≤ tr exp ( X 0 ) where the ﬁrst in equ alit y follo ws from Th eorem 1 and Jensen’s in equalit y , and the second inequalit y follo ws from the in ductiv e hyp othesis.  2 3 Exp onent ial tail inequalities for sums of random matrices 3.1 A generic inequalit y W e ﬁ rst state a generic in equalit y based on Lemma 1. T his diﬀers from earlier appr oac hes, wh ich instead combine Mark o v’s inequalit y w ith a result similar to Lemma 1 ( e.g. , T ropp, 2011a, Th eorem 3.6). Theorem 2. F or any η ∈ R and any t > 0 , Pr " λ max η n X i =1 X i − n X i =1 log E i [exp( η X i )] ! > t # ≤ tr E " − η n X i =1 X i + n X i =1 log E i [exp( η X i )] #! · ( e t − t − 1) − 1 . Pr o of. Fix a constant m atrix X 0 , an d let A := η P n i =0 X i − P n i =1 log E i [exp( η X i )]. Note th at g ( x ) := e x − x − 1 is non-negativ e for all x ∈ R and increasing for x ≥ 0. Letting { λ i ( A ) } d en ote the eigen v alues of A , we hav e Pr [ λ max ( A ) > t ] ( e t − t − 1) = E  1  λ max ( A ) > t  ( e t − t − 1)  ≤ E h e λ max ( A ) − λ max ( A ) − 1 i ≤ E " X i  e λ i ( A ) − λ i ( A ) − 1  # = E [tr(exp( A ) − A − I )] ≤ tr(exp( X 0 ) + E [ − A ] − I ) where the last inequalit y follo ws from Lemma 1 . No w w e take X 0 → 0 so tr(exp( X 0 ) − I ) → 0.  3.2 Some speciﬁc b ounds W e n o w giv e some sp eciﬁc b ound s as corollaries of T heorem 2 . Most of the estimates u sed in the pro ofs are tak en from p revious works ( e.g. , Ahlsw ede and Wint er , 2002; T ropp, 2011a); th e main p oint here is to sho w how th ese previous tec hniques can b e com bined w ith T heorem 2 to yield n ew tail inequalities with no explicit d ep endence on the matrix dimens ion. First, w e give a b ound under a subgaussian-t yp e condition on the distr ib ution. Theorem 3 (Matrix subgaussian b ou n d) . If ther e e xists ¯ σ > 0 and ¯ k > 0 such that for al l i = 1 , . . . , n , E i [ X i ] = 0 λ max 1 n n X i =1 log E i  exp( η X i )  ! ≤ η 2 ¯ σ 2 2 E " tr 1 n n X i =1 log E i  exp( η X i )  !# ≤ η 2 ¯ σ 2 ¯ k 2 for al l η > 0 almost sur ely, then f or any t > 0 , Pr " λ max 1 n n X i =1 X i ! > r 2 ¯ σ 2 t n # ≤ ¯ k · t ( e t − t − 1) − 1 . 3 Pr o of. W e ﬁx η := p 2 t/ ( ¯ σ 2 n ). By Th eorem 2 , we obtain Pr " λ max 1 n n X i =1 X i − 1 nη n X i =1 log E i [exp( η X i )] ! > t nη # ≤ tr E " n X i =1 log E i [exp( η X i )] #! · ( e t − t − 1) − 1 ≤ nη 2 ¯ σ 2 ¯ k 2 · ( e t − t − 1) − 1 = ¯ k · t ( e t − t − 1) − 1 . No w sup p ose λ max 1 n n X i =1 X i − 1 nη n X i =1 log E i [exp( η X i )] ! ≤ t nη . This implies for ev ery n on-zero v ector u , u ⊤  1 n P n i =1 X i  u u ⊤ u ≤ u ⊤  1 nη P n i =1 log E i [exp( η X i )]  u u ⊤ u + t nη ≤ λ max 1 nη n X i =1 log E i [exp( η X i )] ! + t nη and therefore λ max 1 n n X i =1 X i ! ≤ λ max 1 nη n X i =1 log E i [exp( η X i )] ! + t nη ≤ η ¯ σ 2 2 + t nη = r 2 ¯ σ 2 t n as required.  W e can also give a Bernstein-t yp e b ound based on moment cond itions. F or simplicit y , w e just state the b oun d in the case that the λ max ( X i ) are b ound ed almost sur ely . Theorem 4 (Matrix Bernstein b ou n d) . If ther e exists ¯ b > 0 , ¯ σ > 0 , and ¯ k > 0 such that fo r al l i = 1 , . . . , n , E i [ X i ] = 0 λ max ( X i ) ≤ ¯ b λ max 1 n n X i =1 E i [ X 2 i ] ! ≤ ¯ σ 2 E " tr 1 n n X i =1 E i [ X 2 i ] !# ≤ ¯ σ 2 ¯ k almost sur e ly, then for any t > 0 , Pr " λ max 1 n n X i =1 X i ! > r 2 ¯ σ 2 t n + ¯ bt 3 n # ≤ ¯ k · t ( e t − t − 1) − 1 . Pr o of. L et η > 0. F or eac h i = 1 , . . . , n , exp( ηX i )  I + η X i + e η ¯ b − η ¯ b − 1 ¯ b 2 · X 2 i 4 and therefore log E i  exp( η X i )   e η ¯ b − η ¯ b − 1 ¯ b 2 · E i  X 2 i  . Since e x − x − 1 ≤ x 2 / (2(1 − x/ 3)) for 0 ≤ x < 3, we hav e by Theorem 2 Pr " λ max 1 n n X i =1 X i ! > η ¯ σ 2 2(1 − η ¯ b/ 3) + t η n # ≤ η 2 ¯ σ 2 ¯ k n 2(1 − η ¯ b/ 3) · ( e t − t − 1) − 1 pro vided that η < 3 / ¯ b . C ho osing η := 3 ¯ b · 1 − p 2 ¯ σ 2 t/n 2 ¯ bt/ (3 n ) + p 2 ¯ σ 2 t/n ! giv es the desir ed b ound .  3.3 Discussion The adv an tage of our r esu lts here o ver previous exp onen tial ta il inequalitie s for sums of r andom matrices is the absence of exp licit d ep endence on the matrix d imensions. Indeed, all p revious tail inequalities using the exp onent ial momen t metho d (either via the Golden-Thompson inequalit y or Lieb’s trac e inequalit y) are roughly of th e form d · e − t when the m atrices in the s u m are d × d (Ahlswede and Winte r , 2002; Gross et al., 2010; Rec ht, 2009; Gross, 2009; T ropp, 2011a ,b). Our results also impro ve ov er th e tail in equ alities of Rudelson and V ers h ynin (2007) in that it applies to full-rank matrices, not ju st rank-one m atrices; and also o v er that of Magen and Zouzias (2011) in that it pr o vides an exp onent ial tail inequalit y , rather than just a p olynomial tail. Th u s , our impro vemen ts widen the applicabilit y of these inequ alities (and the matrix exp onent ial momen t metho d in general); we explore s ome of th ese in Subsection 3.4. One disadv an tage of our tec hnique is that in ﬁnite dimensional s ettings, th e relev an t trace quan tit y that replaces the d im en sion ma y tur n out to b e of the same ord er as th e dimension d (an example of s u c h a case is discussed n ext). In su c h cases, the resulting tail b oun d fr om Theorem 4 (sa y) of ¯ k · t ( e t − t − 1) − 1 is lo oser than the d · e − t tail b ou n d provided by earlier tec hniques ( e.g. , T ropp, 2011a). W e note that the matrix exp onentia l moment metho d used here and in pr evious wo rk can lead to a signiﬁcantly sub optimal tail inequalit y in some cases. Th is was p oin ted out by T ropp (2011a, Section 4. 6), b u t we elab orate on it here fu r ther. Supp ose x 1 , . . . , x n ∈ {± 1 } d are i.i.d. random v ectors w ith indep en den t Rademac her en tries—eac h coord in ate of x i is +1 or − 1 w ith equal proba- bilit y . Let X i = x i x ⊤ i − I , so E [ X i ] = 0, λ max ( X i ) = λ max ( E [ X 2 i ]) = d − 1, and tr( E [ X 2 i ]) = d ( d − 1). In this case, Th eorem 4 implies the b ound Pr " λ max 1 n n X i =1 x i x ⊤ i − I ! > r 2( d − 1) t n + ( d − 1) t 3 n # ≤ dt ( e t − t − 1) − 1 . On the other hand , b ecause th e x i ha v e s u bgaussian pro jections, it is kn o wn that Pr " λ max 1 n n X i =1 x i x ⊤ i − I ! > 2 r 71 d + 16 t n + 10 d + 2 t n # ≤ 2 e − t/ 2 5 (Litv ak et al., 2 005, also s ee Lemma 2 in App endix A). First, this latter inequality remov es the d fact or on the righ t-hand side. P erhaps more imp ortan tly , the deviation term t does not scale with d in this inequ alit y , whereas it do es in the former. Thus this latter b ound p ro vides a muc h stronger exp onen tial tail: roughly put, Pr[ λ max ( P n i =1 x i x ⊤ i /n − I ) > c · ( p d/n + d/n ) + τ ] ≤ exp( − Ω( n min( τ , τ 2 ))) for some constant c > 0; th e probability b ound fr om Theorem 4 is only of the form exp( − Ω(( n/d ) min( τ , τ 2 ))). The sub-optimalit y of Theorem 4 is shared b y all other existing tail inequalities prov ed using this exp onenti al moment method . The issu e is relat ed to the asymp totic freeness of the r andom matrices X 1 , . . . , X n (V oiculescu , 1991 ; Guionnet, 2004)— i.e. , that nearly all high-ord er mome nts of rand om matrices v anish asymptotically—whic h is n ot exploited in the m atrix exp onen tial moment metho d . This m eans that the p ro of tec hniqu e in the exp onentia l momen t metho d o v er-counts the cont rib u tion of h igh-order matrix momen ts th at should ha v e v anished. F ormalizing this discr ep ancy would help clarify the limits of th is technique, but the task is b eyo nd the scop e of this pap er. It is also wo rth men tioning that asymptotic fr eeness only h olds when the X i ha v e in dep end en t entries. F or matrices with correlated en tries, our b ound is close to b est p ossible in th e w orst case. 3.4 Examples F or a matrix M , let k M k F denote its F rob enius norm, and let k M k 2 denote its sp ectral norm. If M is s ymmetric, then k M k 2 = max { λ max ( M ) , − λ min ( M ) } , where λ max ( M ) and λ min ( M ) are, resp ectiv ely , the largest and smallest eigen v alues of M . 3.4.1 Suprem um of a random pro cess The ﬁrst example embeds a random pro cess in a diagonal matrix to sho w that Theorem 3 is tight in certain cases. Example 1. Let ( Z 1 , Z 2 , . . . ) b e (p ossibly dep endent) mean-zero subgaussian r andom v ariables; i.e. , eac h E [ Z i ] = 0, and there exists p ositiv e constants σ 1 , σ 2 , . . . such that E [exp( η Z i )] ≤ exp  η 2 σ 2 i 2  ∀ η ∈ R . W e fur ther assu m e that v := sup i { σ 2 i } < ∞ and k := 1 v P i σ 2 i < ∞ . Also, for con v enience, w e assume log k ≥ 1 . 3 (to simp lify the tail inequalit y). Let X = diag ( Z 1 , Z 2 , . . . ) b e the rand om d iagonal matrix with the Z i on its diagonal. W e h a v e E [ X ] = 0, and log E [exp( η X )]  diag  η 2 σ 2 1 2 , η 2 σ 2 2 2 , . . .  , so λ max (log E [exp( η X )]) ≤ η 2 v 2 and tr (log E [exp( ηX )]) ≤ η 2 v k 2 . By Theorem 3, we hav e Pr h λ max ( X ) > √ 2 v t i ≤ k t ( e t − t − 1) − 1 . Therefore, letting t := 2( τ + log k ) > 2 . 6 for τ > 0 and interpreting λ max ( X ) as sup i { Z i } , Pr " sup i { Z i } > 2 s sup i { σ 2 i }  log P i σ 2 i sup i { σ 2 i } + τ  # ≤ e − τ . 6 Supp ose the Z i ∼ N (0 , 1) are just N i.i.d. standard Gaussian random v ariables. Th en the ab ov e inequalit y states that th e largest of the Z i is O (log N + τ ) with probability at least 1 − e − τ ; this is known to b e tight up to constants, so the log N term cann ot generally b e remo ve d. This fact has b een noted by pr evious w orks on matrix tail in equ alities ( e.g. , T rop p, 2011a), whic h also use this examp le as an extreme case. W e n ote, h o w eve r, that these p revious works are not applicable to the case of a coun tably inﬁnite num b er of mean-zero Gaussian random v ariables Z i ∼ N (0 , σ 2 i ) (or more generally , subgaussian rand om v ariables), whereas the ab ov e inequalit y can b e applied as long as the sum of the σ 2 i is ﬁ nite.  3.4.2 Principal comp onen t analysis Our next t wo examples uses Th eorem 4 to give sp ectral norm error b ound s for estimating the second momen t matrix of a random v ector from i.i.d. copies. This is relev ant in the c ontext of (k ernel) principal comp onen t analysis of high (or inﬁnite) dimens ional data ( e.g. , Sc h¨ olk opf et al., 1999). Example 2. Let x 1 , . . . , x n b e i.i.d. random v ectors with Σ := E [ x i x ⊤ i ], K := E [ x i x ⊤ i x i x ⊤ i ], and k x i k 2 ≤ ¯ ℓ almost su rely f or some ¯ ℓ > 0. L et X i := x i x ⊤ i − Σ and ˆ Σ n := n − 1 P n i =1 x i x ⊤ i . W e h a v e λ max ( X i ) ≤ ¯ ℓ 2 − λ min ( Σ ). Also, λ max ( n − 1 P n i =1 E [ X 2 i ]) = λ max ( K − Σ 2 ) and E [tr( n − 1 P n i =1 E [ X 2 i ])] = tr( K − Σ 2 ). By Theorem 4, Pr " λ max  ˆ Σ n − Σ  > r 2 λ max ( K − Σ 2 ) t n + ( ¯ ℓ 2 − λ min ( Σ )) t 3 n # ≤ tr( K − Σ 2 ) λ max ( K − Σ 2 ) · t ( e t − t − 1) − 1 . Since λ max ( − X i ) ≤ λ max ( Σ ), we also ha ve Pr " λ max  Σ − ˆ Σ n  > r 2 λ max ( K − Σ 2 ) t n + λ max ( Σ ) t 3 n # ≤ tr( K − Σ 2 ) λ max ( K − Σ 2 ) · t ( e t − t − 1) − 1 . Therefore Pr "   ˆ Σ n − Σ   2 > r 2 λ max ( K − Σ 2 ) t n + max { ¯ ℓ 2 − λ min ( Σ ) , λ max ( Σ ) } t 3 n # ≤ tr( K − Σ 2 ) λ max ( K − Σ 2 ) · 2 t ( e t − t − 1) − 1 . A similar result w as giv en by Zw ald and Blanc hard (2006, Lemma 1) but for F rob enius norm error rather than sp ectral norm err or. This is generally incomparable to our result, although sp ectral norm error ma y b e m ore appropriate in cases w h ere the sp ectrum is slo w to deca y .  W e no w sho w that co mbining the b oun d from the previous example w ith sharp er dimension- dep end en t tail inequalities can sometimes lead to s tr onger resu lts. Example 3. Let x 1 , . . . , x n b e i.i.d. rand om vect ors with Σ := E [ x i x ⊤ i ]; let X i := x i x ⊤ i − Σ and ˆ Σ n := n − 1 P n i =1 x i x ⊤ i . F or an y p ositiv e in teger d ≤ rank( Σ ), let Π d, 0 b e the orthogonal pr o jector to the d -dimensional eigenspace of Σ corresp onding to its d largest eigen v alues, and let Π d, 1 := I − Π d, 0 . W e ha ve   ˆ Σ n − Σ   2 ≤   Π d, 0 ( ˆ Σ n − Σ  Π d, 0 k 2 + 2   Π d, 0 ( ˆ Σ n − Σ  Π d, 1 k 2 +   Π d, 1 ( ˆ Σ n − Σ  Π d, 1 k 2 ≤ 2   Π d, 0 ( ˆ Σ n − Σ  Π d, 0 k 2 + 2   Π d, 1 ( ˆ Σ n − Σ  Π d, 1 k 2 . 7 W e can use the tail inequalities from this w ork to control k Π d, 1 ( ˆ Σ n − Σ )Π d, 1 k 2 , and us e p otenti ally sharp er dimension-d ep endent inequalities to con trol k Π d, 0 ( ˆ Σ n − Σ )Π d, 0 k 2 . Let Σ d, 0 := Π d, 0 Σ Π d, 0 , Σ d, 1 := Π d, 1 Σ Π d, 1 , K d, 1 := E [(Π d, 1 x i x ⊤ i Π d, 1 ) 2 ], and assume k Π d, 1 x i k 2 ≤ ¯ ℓ d, 1 for all i = 1 , . . . , n almost su rely . F u rthermore, sup p ose there exists γ d, 0 > 0 suc h that for all i = 1 , . . . , n and all ve ctors α , E h exp  α ⊤ Σ − 1 / 2 d, 0 x i i ≤ exp  γ d, 0 k α k 2 2 / 2  where Σ − 1 / 2 d, 0 is the matrix square-ro ot of the Mo ore-P enrose pseudoinv erse of Σ d, 0 . This condition states that ev ery pro jection of Σ − 1 / 2 d, 0 x i has subgaussian tails. In this case, the tail b eha vior of k Π d, 0 ( ˆ Σ n − Σ )Π d, 0 k 2 should not dep end on the dimensionalit y d . Indeed, a co v ering num b er argumen t give s Pr "   Π d, 0  ˆ Σ n − Σ  Π d, 0   2 > 2 γ d, 0 k Σ k 2  r 71 d + 16 t n + 5 d + t n  # ≤ 2 e − t/ 2 for any t > 0 (see Lemma 2 in Ap p end ix A). Com bining this with the tail inequalit y from Example 2, w e hav e (for t ≥ 2 . 6) Pr "   ˆ Σ n − Σ   2 > 4 γ d, 0 k Σ k 2  r 71 d + 16 t n + 5 d + t n  + 2 v u u t 2 λ max ( K d, 1 − Σ 2 d, 1 )  log  tr( K d, 1 − Σ 2 d, 1 ) λ max ( K d, 1 − Σ 2 d, 1 )  + t  n + 2 max { ¯ ℓ 2 d, 1 − λ min ( Σ d, 1 ) , λ max ( Σ d, 1 ) }  log  tr( K d, 1 − Σ 2 d, 1 ) λ max ( K d, 1 − Σ 2 d, 1 )  + t  3 n # ≤ 4 e − t/ 2 . (2)  Comparisons. W e consid er the follo win g st ylized scenario to compare the b ounds fr om Examp le 2 and Example 3. 1. Th e largest d eigen v alues of Σ are all equal to k Σ k 2 , and the remaining eigen v alues are s m aller and rapidly deca ying so tr( Σ d, 1 ) / k Σ k 2 is small. 2. ¯ ℓ 2 and ¯ ℓ 2 d, 1 are within constant factors of tr( Σ ) and tr( Σ d, 1 ), resp ectiv ely; this simply r equires that the squared length of an y x i nev er b e more than a constan t f actor times its exp ected squared length. 3. λ max ( K − Σ 2 ) and λ max ( K d, 1 − Σ 2 d, 1 ) are within constan t factors of λ max ( Σ ) 2 and λ max ( Σ d, 1 ) 2 , resp ectiv ely; this is similar to the pr evious cond ition. W e will also ignore co n s tan t and logarithmic factors, as w ell as the γ d, 0 factors. The b ound o n k ˆ Σ n k 2 from Examp le 3 then b ecomes (roughly) k Σ k 2 1 + r d n ! + k Σ k 2 r t n + t n + (tr( Σ d, 1 ) / k Σ k 2 ) t n ! (3) 8 whereas the b ound from Examp le 2 is k Σ k 2 + k Σ k 2   r t n +  d + (tr ( Σ d, 1 ) / k Σ k 2 )  t n   . (4) The main d iﬀerence b et we en these b ound s is that th e d eviation term t d o es not scale with d in (3), but it d o es in (4), so the exp onen tial tail in the latter is muc h w eak er, as discussed in S ubsection 3.3. W e c an also compare the b ound from Ex amp le 3 to th e case where th e x i are i.i.d. Gaussian random v ectors with mean zero and co v ariance Σ . Arrange the x i as columns in a matrix ˆ A n = [ x 1 | · · · | x n ], so k ˆ Σ n k 2 = 1 n k ˆ A n ˆ A ⊤ n k 2 = 1 n k ˆ A n k 2 2 . Note that ˆ A n has the same d istribution as Σ 1 / 2 Z , where Z is a matrix of indep endent s tandard Gaussian random v ariables. The fun ction Z 7→ k Σ 1 / 2 Z k 2 = k ˆ A n k 2 is k Σ 1 / 2 k 2 -Lipsc hitz in Z , so b y Gaussian concentrati on (Pisier, 1989), Pr h k ˆ A n k 2 > E  k ˆ A n k 2  + p 2 k Σ k 2 t i ≤ e − t . The exp ectation can b e b ounded u sing a result of Gordon (1985, 1988 ): E  k ˆ A n k 2  = E  k Σ 1 / 2 Z k 2  ≤ k Σ 1 / 2 k 2 √ n + k Σ 1 / 2 k F . Putting these together, we obtain Pr "   ˆ Σ n   2 >   Σ   2 + 2 r k Σ k 2 tr(Σ) n + 2 r 2 k Σ k 2 2 t n + tr( Σ ) + 2 p 2 tr( Σ ) k Σ k 2 t + 2 k Σ k 2 t n # ≤ e − t . In our st ylized scenario, this roughly implies a b ound on k ˆ Σ n k 2 of the form k Σ k 2 1 + r d + tr( Σ d, 1 ) / k Σ k 2 n + d + tr( Σ d, 1 ) / k Σ k 2 n ! + k Σ k 2 r t n + t n ! (5) Compared to (3), we see that the main diﬀerence is that t do es not scale with tr( Σ d, 1 ) / k Σ k 2 in (5), but it d o es in (3). T herefore th e b ounds are comparable (up to constant and logarithmic factors) when th e eigensp ectrum of Σ is rapidly deca ying after the ﬁrs t d eigen v alues. 3.4.3 Appro ximate mat rix multiplication Finally , w e giv e an example ab out appr o ximating a matrix pro duct AB ⊤ using non-u niform sam- pling of the columns of A and B . Example 4. Let A := [ a 1 | · · · | a m ] and B := [ b 1 | · · · | b m ] b e ﬁxed matrices, eac h w ith m columns. Assume a i 6 = 0 and b i 6 = 0 for all i = 1 , . . . , m . If m is very large, then th e straight forward computation of th e pro du ct AB ⊤ can b e prohibitive . An alternativ e is to tak e a small (non- uniform) r andom sample of the columns of A and B , sa y a i 1 , b i 1 , . . . , a i n , b i n , and then compute a w eigh ted sum of outer pro ducts 1 n n X j =1 a i j b ⊤ i j p i j 9 where p i j > 0 is the a priori pr obabilit y of choosing the column index i j ∈ { 1 , . . . , m } (the actual v alues of the probab ilities p i for i = 1 , . . . , m are giv en b elo w). An analysis of this scheme w as giv en b y Magen and Zouzias (2011) with the stronger requirement that th e n umb er of columns samp led b e p olynomially related to the allo we d failure p robabilit y . Here w e giv e an analysis in whic h the n umb er of column s sampled dep end s only logarithmically on th e failure probabilit y . Let X 1 , . . . , X n b e i.i.d. rand om matrices w ith th e discrete distribu tion give n by Pr  X j = 1 p i  0 a i b ⊤ i b i a ⊤ i 0  = p i ∝ k a i k 2 k b i k 2 for all i = 1 , . . . , m , wh er e p i := k a i k 2 k b i k 2 / Z and Z := P m i =1 k a i k 2 k b i k 2 . Let ˆ M n := 1 n n X j =1 X j and M :=  0 AB ⊤ B A ⊤ 0  . Note that k ˆ M n − M k 2 is the sp ectral norm error of appro ximating AB ⊤ using the a verag e of n outer pro du cts P n j =1 a i j b ⊤ i j /p i j , where the indices are such that i j = i ⇔ X j = a i b ⊤ i /p i for j = 1 , . . . , n . W e ha ve the f ollo wing identitie s: E [ X j ] = m X i =1 p i  1 p i  0 a i b ⊤ i b i a ⊤ i 0  =  0 P m i =1 a i b ⊤ i P m i =1 b i a ⊤ i 0  = M tr( E [ X 2 j ]) = tr m X i =1 p i  1 p 2 i  a i b ⊤ i b i a ⊤ i 0 0 b i a ⊤ i a i b ⊤ i  ! = m X i =1 2 k a i k 2 2 k b i k 2 2 p i = 2 Z 2 tr( E [ X j ] 2 ) = tr  AB ⊤ B A ⊤ 0 0 B A ⊤ AB ⊤  = 2 tr( A ⊤ AB ⊤ B ); and the follo wing in equalities: k X j k 2 ≤ max i =1 ,...,m 1 p i      0 a i b ⊤ i b i a ⊤ i 0      2 = max i =1 ,...,m k a i b ⊤ i k 2 p i = Z k E [ X j ] k 2 = k AB ⊤ k 2 ≤ k A k 2 k B k 2 k E [ X 2 j ] k 2 ≤ k A k 2 k B k 2 Z. This m eans k X j − M k 2 ≤ Z + k A k 2 k B k 2 and k E [( X j − M ) 2 ] k 2 ≤ k E [ X 2 j ] − M 2 k 2 ≤ k A k 2 k B k 2 ( Z + k A k 2 k B k 2 ), so Theorem 4 and a un ion b ound imply Pr "   ˆ M n − M   2 > r 2 ( k A k 2 k B k 2 ( Z + k A k 2 k B k 2 )) t n + ( Z + k A k 2 k B k 2 ) t 3 n # ≤ 4  Z 2 − tr( A ⊤ AB ⊤ B ) k A k 2 k B k 2 ( Z + k A k 2 k B k 2 )  · t ( e t − t − 1) − 1 . Let r A := k A k 2 F / k A k 2 2 ∈ [1 , rank ( A )] and r B := k B k 2 F / k B k 2 2 ∈ [1 , rank ( B )] b e the n umerical (or stable) rank of A and B , r esp ectiv ely . Since Z / ( k A k 2 k B k 2 ) ≤ k A k F k B k F / ( k A k 2 k B k 2 ) = √ r A r B , w e hav e th e simpliﬁed (bu t slightly lo oser) b oun d Pr "   ˆ M n − M   2 k A k 2 k B k 2 > 2 r (1 + √ r A r B )(log(4 √ r A r B ) + t ) n + 2(1 + √ r A r B )(log(4 √ r A r B ) + t ) 3 n # ≤ e − t . 10 Therefore, for any ǫ ∈ (0 , 1) and δ ∈ (0 , 1), if n ≥ 8 3 + 2 r 5 3 ! (1 + √ r A r B )(log(4 √ r A r B ) + log(1 /δ )) ǫ 2 , then with probabilit y at least 1 − δ o v er the r andom choi ce of column ind ices i 1 , . . . , i n ,       1 n n X j =1 a i j b ⊤ i j p i j − AB ⊤       2 ≤ ǫ k A k 2 k B k 2 .  Ac kno wledgements W e are grateful to Alex Gittens for useful comments and p oint ing out a subtle mistak e in our pro of of Theorem 2 in an earlier dr aft, and to J o el T ropp for his man y commen ts and su ggestions. References R. Ahlswede and A. Win ter. S tr ong con v erse f or iden tiﬁcation via qu an tum c hannels. IEE E T r ansa ctions on Information The ory , 48(3):56 9–579, 2002. F. Bac h. Consistency of th e group L asso and m ultiple ke rn el learning. Journal of Machine L e arning R e se ar ch , 9:1179 –1225, 2008. D. A. F reedman. On tail probabilities for martingales. The A nnals of P r ob ability , 3(1):100 –118, 1975. K. F ukum izu, F. Bac h, and A. Gretton. Consistency of kernel canonical correlatio n analysis. Journal of Machine L e arning R ese ar ch , 8:361–38 3, 2007. S. Golden. Low er b oun ds for th e Helmholtz f unction. Physic al R eview , 137(4B) :1127–1128, 1965. Y. Gordon. Some inequalities for Gaussian pro cesses and app lications. Isr ael J. Math. , 50:265–28 9, 1985. Y. Gordon. Gaussian pr o cesses and almost spherical sections of con ve x b o dies. Ann als of Pr ob ability , 16:180 –188, 1988. D. Gross. Reco v ering low-rank m atrices from few co eﬃcien ts in any basis, 2009. arXiv:0910.1879 . D. Gr oss, Y.-K. Liu, S. Flammia, S . Bec ker, and J. Eisert. Q uan tum state tomograph y via com- pressed sensing. Physic al R eview L etters , 105(15 ):150401, 2010. A. Guionnet. L arge deviations and sto c hastic calculus for large r an d om matrices. Pr ob ability Surveys , 1:72–17 2, 2004. E. H. Lieb. Con ve x trace functions and the Wigner-Yanase-Dyson conjecture. A dv. M ath. , 11: 267–2 88, 1973. A. Litv ak, A. Pa jor, M. Rud elson, and N. T omczak-Jaeg ermann . S mallest singular v alues of random matrices and geometry of random p olytop es. A dvanc e s in Mathematics , 195:491–5 23, 2005. 11 A. Magen and A. Zouzias. Low rank matrix-v alued Chernoﬀ b ounds and approxima te matrix m ultiplication. In Pr o c e e dings of the 22nd ACM-SIAM Symp osium on Discr ete Al gorithms , 2011. R. I. Oliv eira. Sums of random Hermitian matrices and an inequalit y by Rudelson. Ele c. Comm. P r ob ab. , 15:203–212 , 2010a. R. I. O liveira. Concen tration of the adj acency matrix and of th e Laplacian in random graph s with indep end en t edges, 2010b. arXiv:0911.06 00. G. Pisier. The volume of c onvex b o dies and Banach sp ac e ge ometry . Cam br idge Universit y Press, 1989. C. E. Rasmussen and C. K. I. Williams. Gaussian Pr o c esses for Machine L e arning . The MIT Press, 2006. B. Rec h t. A simple approac h to m atrix completion, 2009. arXiv:091 0.0651v2. M. Rudelson. Random v ecto rs in isotropic p osition. Journal of F unctional Analysis , 164:60 –72, 1999. M. Rudelson and R. V ersh yn in. Sampling from large matrices: An approac h th r ough ge ometric functional analysis. Journal of the ACM , 54(4), 2007. B. Sc h¨ olk opf, A. J. Smola, and K.-R. M ¨ uller. K ernel p rincipal comp onent analysis. I n B. Sc h¨ olk opf, C. J. C. Burges, and A. J. Smola, editors, A dvanc es in K ernel Metho ds—Supp ort V e ctor L e arning , pages 327–35 2. MIT Pr ess, 1999. C. J. Thompson. Inequalit y with applications in stat istical mec hanics. Journal of Mathema tic al Physics , 6(11):18 12–1813, 1965. J. T ropp. User-friend ly tail b ou n ds for sums of ran d om matrices, 2011a. arXiv:1004.43 89v6. J. T ropp. F reedman’s in equalit y for matrix martingales, 2011b. arXiv:1101.303 9. D. V oiculescu. Limit la ws for ran d om matrices and free pro du cts. Invent. Math. , 104:201 –220, 1991. T. Zhang. Data dep endent concentrat ion b ounds for sequential prediction algorithms. In Pr o c e e d- ings of the 18th An nual Confer enc e on L e arning The ory , 2005. L. Z w ald and G. Blanc hard. On the con verge nce of eigenspaces in k ernel p rincipal comp onen t analysis. In A dvanc es in Neur al Inf ormation Pr o c essing Systems 18 . 2006. A Sums of random v ector outer pr o duc ts The follo wing lemma is a tail inequalit y for sm allest and largest eigen v alues of the empirical co v ari- ance m atrix of sub gaussian random v ectors. This result (with n on-explicit constants) was originally obtained by Litv ak et al. (2005 ). 12 Lemma 2. L et x 1 , . . . , x n b e r andom ve ctors in R d such that, for some γ ≥ 0 , E h x i x ⊤ i    x 1 , . . . , x i − 1 i = I and E h exp  α ⊤ x i     x 1 , . . . , x i − 1 i ≤ exp  k α k 2 2 γ / 2  for al l α ∈ R d for al l i = 1 , . . . , n , almost sur ely. F or al l ǫ 0 ∈ (0 , 1 / 2) and δ ∈ (0 , 1) , Pr " λ max 1 n n X i =1 x i x ⊤ i ! > 1 + 1 1 − 2 ǫ 0 · ǫ ǫ 0 ,δ,n or λ min 1 n n X i =1 x i x ⊤ i ! < 1 − 1 1 − 2 ǫ 0 · ǫ ǫ 0 ,δ,n # ≤ δ wher e ǫ ǫ 0 ,δ,n := γ · r 32 ( d log (1 + 2 /ǫ 0 ) + log (2 /δ )) n + 2 ( d log(1 + 2 /ǫ 0 ) + log (2 /δ )) n ! . Remark 1. In our ap p lications of this lemma, we will simply c ho ose ǫ 0 := 1 / 4 for concreteness.  W e giv e the p ro of of Lemma 2 for completeness. The subgaussian p rop erty m ost readily lends it self to b ou n ds on linear com binations of sub - gaussian ran d om v ariables. How ev er, we are interested in b ounding certain quadratic com bin ations. Therefore w e b o otstrap from the b ou n d for linear combinatio n s to b ound the moment generating function of the quadratic com b in ations; from there, we can obtain th e d esir ed tail inequalit y . The follo wing lemma relates the moment generating fu nction to a tail in equ alit y . Lemma 3. L et W b e a non-ne gative r and om variable. F or any η ∈ R , E [exp ( η W )] − η E [ W ] − 1 = η Z ∞ 0 (exp ( η t ) − 1) · Pr [ W > t ] · dt. Pr o of. I ntegratio n-by-parts.  The next lemma giv es a tail inequalit y for a ny p articular Ra yleigh quotien t of the emp irical co v ariance matrix. Lemma 4. L et x 1 , . . . , x n b e r andom ve ctors in R d such that, for some γ ≥ 0 , E h x i x ⊤ i    x 1 , . . . , x i − 1 i = I and E h exp  α ⊤ x i     x 1 , . . . , x i − 1 i ≤ exp  k α k 2 2 γ / 2  for al l α ∈ R d for al l i = 1 , . . . , n , almost sur ely. F or al l α ∈ R d such that k α k 2 = 1 , and al l δ ∈ (0 , 1) , Pr " α ⊤ 1 n n X i =1 x i x ⊤ i ! α > 1 + r 32 γ 2 log(1 /δ ) n + 2 γ log(1 /δ ) n # ≤ δ and Pr " α ⊤ 1 n n X i =1 x i x ⊤ i ! α < 1 − r 32 γ 2 log(1 /δ ) n # ≤ δ. 13 Pr o of. Fix α ∈ R d with k α k 2 = 1. F or i = 1 , . . . , n , let W i := ( α ⊤ x i ) 2 , so E [ W i ] = 1. F or any t ≥ 0, using Chernoﬀ ’s b ounding metho d giv es E [ 1 [ W i > t ] | x 1 , . . . , x i − 1 ] ≤ inf η> 0 n E h 1 h exp  η | α ⊤ x i |  > e η √ t i    x 1 , . . . , x i − 1 io ≤ inf η> 0 n e − η √ t ·  E h exp  η α ⊤ x i     x 1 , . . . , x i − 1 i + E h exp  − η α ⊤ x i     x 1 , . . . , x i − 1 io ≤ inf η> 0 n 2 exp  − η √ t + η 2 γ / 2 o = 2 exp  − t 2 γ  . So by Lemma 3, for an y η < 1 / (2 γ ), E [exp ( η W i ) | x 1 , . . . , x i − 1 ] ≤ 1 + η + η Z ∞ 0 (exp ( η t ) − 1) · 2 exp  − t 2 γ  · dt = 1 + η + 8 η 2 γ 2 1 − 2 η γ ≤ exp  η + 8 η 2 γ 2 1 − 2 ηγ  and therefore E " exp η n X i =1 W i !# ≤ exp  nη + 8 nη 2 γ 2 1 − 2 ηγ  . Using Chern oﬀ ’s b oun ding metho d t wice more giv es Pr " n X i =1 W i > n + t # ≤ inf 0 ≤ η< 1 / (2 γ )  exp  − tη + 8 nη 2 γ 2 1 − 2 η γ  = exp − 8 nγ 2 + γ t − p 8 nγ 2 (8 nγ 2 + 2 γ t ) 2 γ 2 ! and Pr " n X i =1 W i < n − t # ≤ in f η ≤ 0  exp  tη + 8 nη 2 γ 2 1 − 2 ηγ  ≤ exp  − t 2 32 nγ 2  . The claim follo ws.  In order to b ound the smallest and largest eigen v alues of the empirical co v ariance matrix, we apply the b ound for the R ayleig h quotient in Lemma 4 together with a co v ering argumen t. Lemma 5 (Pisier, 1989) . F or any ǫ 0 > 0 , ther e exists Q ⊆ S d − 1 := { α ∈ R d : k α k 2 = 1 } of c ar dinality ≤ (1 + 2 /ǫ 0 ) d such that ∀ α ∈ S d − 1 ∃ q ∈ Q  k α − q k 2 ≤ ǫ 0 . 14 Pr o of of L emma 2. Let ˆ Σ := (1 /n ) P n i =1 x i x ⊤ i , let S d − 1 := { α ∈ R d : k α k 2 = 1 } b e the unit sphere in R d , and let Q ⊂ S d − 1 b e an ǫ 0 -co v er of S d − 1 of minim um size with r esp ect to k · k 2 . By Lemma 5, the cardinalit y of Q is at most (1 + 2 /ǫ 0 ) d . Let E b e the eve nt max n | q ⊤ ( ˆ Σ − I ) q | : q ∈ Q o ≤ ǫ ǫ 0 ,δ,n . By Lemma 4 and a u nion b ound, Pr[ E ] ≥ 1 − δ . No w assume the ev en t E holds. Let α 0 ∈ S d − 1 b e such that | α ⊤ 0 ( ˆ Σ − I ) α 0 | = max {| α ⊤ ( ˆ Σ − I ) α | : α ∈ S d − 1 } = k ˆ Σ − I k 2 . Using the triangle and Cauc hy-Sc h w arz inequalities, we hav e k ˆ Σ − I k 2 = | α ⊤ 0 ( ˆ Σ − I ) α 0 | = min q ∈ Q | q ⊤ ( ˆ Σ − I ) q + α ⊤ 0 ( ˆ Σ − I ) α 0 − q ⊤ ( ˆ Σ − I ) q | ≤ min q ∈ Q | q ⊤ ( ˆ Σ − I ) q | + | α ⊤ 0 ( ˆ Σ − I ) α 0 − q ⊤ ( ˆ Σ − I ) q | = min q ∈ Q | q ⊤ ( ˆ Σ − I ) q | + | α ⊤ 0 ( ˆ Σ − I )( α 0 − q ) − ( q − α 0 ) ⊤ ( ˆ Σ − I ) q | ≤ min q ∈ Q | q ⊤ ( ˆ Σ − I ) q | + k α 0 k 2 k ˆ Σ − I k 2 k α 0 − q k 2 + k q − α 0 k 2 k ˆ Σ − I k 2 k q k 2 ≤ ǫ ǫ 0 ,δ,n + 2 ǫ 0 k ˆ Σ − I k 2 so max n λ max ( ˆ Σ ) − 1 , 1 − λ min ( ˆ Σ ) o = k ˆ Σ − I k 2 ≤ 1 1 − 2 ǫ 0 · ǫ ǫ 0 ,δ,n .  15

Dimension-free tail inequalities for sums of random matrices

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment