The Generalization Ability of Online Algorithms for Dependent Data

We study the generalization performance of online learning algorithms trained on samples coming from a dependent source of data. We show that the generalization error of any stable online algorithm concentrates around its regret--an easily computable…

Authors: Alekh Agarwal, John C. Duchi

The Generalization Abilit y of Online Algorit hms for Dep enden t Dat a Alekh Agarw al John C. Duc hi { alekh,jduch i } @eecs.berkeley.edu No v em b er 27, 2024 Abstract W e study the generalization per formance of online lea rning algorithms tra ined on samples coming from a dep endent source o f data. W e show that the generaliza tion error of a ny stable online algo rithm concentrates aro und its regr et—an eas ily computable statistic of the online per formance of the algorithm—when the underly ing ergo dic pro ces s is β - or φ -mixing. W e show high probability erro r bo unds assuming the lo ss function is co nv ex, and w e also establish sharp conv ergence r ates and deviation bounds for s trongly c o nv ex lo sses and several linear prediction problems s uch as linear and log istic regre ssion, least-squar es SVM, and b o osting on dep endent data. In addition, our results have straightforward applications to sto chastic o ptimization with depe ndent da ta, and our analy sis requires o nly martinga le conv ergence arg ument s; we need not rely on mor e p owerful statistical to o ls such as empir ic al pro cess theory . 1 In tro duc tion Online learnin g alg orithms ha v e the attractiv e prop ert y that regret guaran tees—p erf ormance of the sequence of p oint s w (1) , . . . , w ( n ) the online algorithm pla ys measured against a fi xed pred ictor w ∗ —hold for arbitrary sequences of loss functions, without assu ming an y statistical regularit y of the sequence. It is natural to ask w h ether one can s ay something stronger when some pr obabilistic structure underlies the sequence of examples, or loss f unctions, presen ted to the online algo rithm . In p articular, if the s equ ence of examples are ge nerated b y a stochastic pr o cess, can the online learning algorithm outp ut a go o d pred ictor f or fu ture samples from the same pro cess? When d ata is d ra wn in dep end en tly and id en tically distributed from a fixed un derlying distrib u - tion, Cesa-Bianc hi et al. [7] hav e sho wn th at online learning algorithms can in fact ou tp ut predictors with go o d generalization p erformance. Sp ecifically , they sho w that for con vex loss functions, the a v erage of th e n predictors play ed b y the online algorithm h as—with high probab ility—small gen- eralizatio n error on future examples generated i.i.d. fr om the same distr ibution. In this pap er, we ask the same question when the data is dra wn according to a (dep endent) ergo dic p ro cess. In addition, this p ap er h elps pr o vide j u stification for the use of r egret to a fixe d comparator w ∗ as a measur e of p erformance for online learning algorithms. Regret to a fixed predictor is sometimes n ot a n atural metric, whic h has led s everal researc hers to stud y online algorithms with p erform an ce guarantee s f or (slo wly) changing comparators w ∗ (1) , w ∗ (2) , . . . (see, e.g., Herbster and W arm uth [13, 14]). When data comes i.i.d. from a (unkn o wn) distribution, ho we ve r, online-to-batc h con v ersions [7] j ustify computing regret with resp ect to a fixed w ∗ . In this pap er, w e show th at 1 ev en when data comes from a dep endent sto chastic pro cess, regret to a fixed comparator is b oth meaningful and a reasonable ev aluat ion metric. Though practically , man y settings requir e learning with non-i.i.d. d ata—examples include time series data from financial prob lems, meteorological observ ati ons, and learning for p redictiv e con trol— the generalization p erformance of statistical learning algorithms for non-indep endent data is p er- haps not so well u ndersto o d as th at for the ind ep end ent scenario. In s p ite of n atural d ifficulties encoun tered with dep endent d ata, sev eral researc hers h a v e studied the con v ergence of statistic al pro cedur es in non i.i. d. settings [29, 19, 30, 22]. In su c h scenarios, one generally a ssu mes that the data are dra wn from a stationary α -, β -, or φ -mixing sequence, wh ic h implies that dep en- dence b et w een observ ations w eak ens suitably o ve r time. Y u [29] adapts classical empirical pro cess tec hniques to pr o v e uniform la ws of large n umb ers for dep endent data; p erhaps a more direct paren t to our appr oac h is the work o f Mohri and R ostamizadeh [22], who com bine algo rithm ic stabilit y [5] with kno wn concen tration inequalitie s to deriv e generalization b oun ds. Stein wart and Christmann [26] sho w f ast rates of conv erge nce for learning f r om stationary geometrically α -mixing pro cesses, so long as the loss fu nctions satisfy natural lo calization and self-b ounding assumptions. Suc h assu mptions w ere p r eviously exploited in the mac hine lea rn ing and statistics lite rature for indep en den t sequences (e.g. [2]), and Stein wart and Christman n extend these r esu lts by buildin g off Bernstein-t yp e inequalities for dep end en t sequences d ue to Mo d ha and Masry [21]. In this pap er, we sho w that online learning algorithms enjoy guaran tees on generalization to unseen d ata for dep end en t data sequences f rom β - and φ -mixing s ou r ces. In particular, w e sho w that stable online learning algorithms—those that do not c hange their pred ictor to o aggressiv ely b et ween iterations—also yield pred ictors with small generaliz ation error. In the most fa v orable regime of geomet ric mixing, we demonstrate generalization error on the order of O (log n/ √ n ) after training on n samples when the loss fun ction is con v ex and Lips chitz. W e also demonstrate faster O (log n/n ) con v ergence when the loss fu nction is strongly co nv ex in the hyp othesis w , whic h is the usual case for regularized losses. In addition, w e consider linear prediction settings, and show O (log n/n ) con v ergence a loss that is strongly conv ex in its scalar argument (though not in the predictor w ) is app lied to a linear predictor h w, ·i , whic h giv es fast rates for least squares S VMs, least squares regression, logistic regression, and b o osting o v er b ounded sets. W e also p ro vide an example and asso ciated learning algorithm for whic h the exp ected regret go es to −∞ , wh ile any fixed pred ictor has exp ected loss zero; this sho ws that lo w regret alone is n ot sufficien t to guaran tee small exp ected error wh en data samples are dep endent. In demonstrating generalization guarantees for onlin e learning algorithms with dep en den t data, w e answer an op en problem p osed by Cesa-Bianc hi et al. [7] on whether online algorithms gi ve go o d p erformance on unseen d ata when said data is d r a wn from a mixin g stationary pro cess. Ou r results also answer a question p osed by Xiao [28] regarding th e con v ergence of the regularized d ual a v eraging algorithm with dep end en t sto c hastic gradient s. More b roadly , our results establish that an y suitably stable optimization or online learning algorithm conv erges in sto c hastic app ro ximation settings when the noise sequen ce is m ixing. There is a rich h istory of classical wo rk in this area (see e.g. the b o ok [18 ] and references therein), bu t most results for dep en d en t data are asymptotic, and to our kn o wledge there is a p au city of finite sample and h igh probability conv ergence guaran- tees. The guaran tees w e pr o vide ha ve applications to, for example, learning from Mark o v c hains, autoregressiv e pro cesses, or learning complex statistical m o dels for wh ich inf er en ce is exp en siv e [27]. Our techniques b uild off of a recen t pap er by Duc hi et al. [10], w here we show high probabilit y b oun d s on the con v ergence of the mirror descent algorithm for sto chastic optimization ev en when 2 the gradients are non-i.i.d. In particular, we b uild on our earlier martingale tec hniqu es, sh o wing concen tration in equalities for dep end en t random v ariables that are sharp er than pr eviously used Bernstein concent ration for geome trically α -mixing pro cesses [21, 26] b y exploiting recen t ideas of Kak ade and T ewa ri [17], though we u s e wea kened v ersions of φ -mixing and β -mixin g to p ro v e our high probabilit y r esults. F urther , our pr o of tec hniques require only relativ ely elemen tary m artingale con v ergence argumen ts, and w e do not r equire that the input d ata is stationary b ut only that it is suitably con ve rgent. 2 Setup, Assump tions, and Notation W e assume th at the online algorithm r eceiv es n d ata p oin ts x 1 , . . . , x n from a sample space X , where the data is generated according to a sto chastic pr o cess P , though the samples x t are not necessarily i.i.d. or ev en ind ep end ent. The online algorithm pla ys p oin ts (h yp otheses) w ∈ W , and at iteratio n t the algorithm pla ys the p oint w ( t ) and suffers the loss F ( w ( t ); x t ). W e assume that the statistical samples x t ha v e a stationary distribution Π to which they conv erge (w e mak e this precise shortly), and we measure generalizat ion p erf orm ance with resp ect to the exp ected loss or risk fun ctional f ( w ) := E Π [ F ( w ; x )] = Z X F ( w ; x ) d Π( x ) . (1) Essen tially , our goal is to sho w that after n iterations of any lo w-regret online algorithm, it is p ossible to u se w (1) , . . . , w ( n ) to outp ut a predictor or hyp othesis b w n for wh ic h f ( b w n ) is guaran teed to b e small with r esp ect to any other hyp othesis w ∗ . Discussion of our statistical assump tions requires a few additional defin itions. The total v aria- tion d istance b et wee n distribu tions P and Q defi n ed on the pr obabilit y space ( S, F ) where F is a σ -field, eac h with dens ities p an d q with resp ect to an und erlying m easur e µ , 1 is giv en b y d TV ( P , Q ) := sup A ∈F | P ( A ) − Q ( A ) | = 1 2 Z S | p ( s ) − q ( s ) | dµ ( s ) . (2) Define the σ -field F t = σ ( x 1 , . . . , x t ). Let P t [ s ] denote the d istribution of x t conditioned on F s , that is, giv en the initial samples x 1 , . . . , x s . W ritten sligh tly differen tly , P t [ s ] = P t ( · | F s ) is a version of the cond itional probabilit y of x t giv en the sigma field F s = σ ( x 1 , . . . , x s ). O ur main assump tion is that the stochastic pro cess is suitably mixing: there is a statio nary distrib ution Π to w hic h the distribu tion of x t con v erges as t grows. W e also assume that th e distributions P t [ s ] and Π are absolutely con tin uous w ith resp ect to an u nderlying measure µ throughou t. W e use the follo wing to measure conv ergence : Definition 2.1 (W eak β and φ -mixing) . The β and φ -mixing c o efficients of the sampling distribu- tion P ar e define d, r esp e ctively, as β ( k ) := s up t ∈ N n 2 E [ d TV ( P t + k ( · | F t ) , Π)] o and φ ( k ) := sup t ∈ N ,B ∈F t n 2 d TV ( P t + k ( · | B ) , Π) o . W e sa y that the pro cess is φ -mixing (resp ectiv ely , β -mixin g) if φ ( k ) → 0 ( β ( k ) → 0) as k → ∞ , and w e assume without loss that β and φ are non-increasing. Th e ab o v e d efinitions are w eak er th an 1 This assumpt ion is without loss, since P and Q are each absolutely con tinuous with resp ect to th e measure P + Q . 3 the stand ard defin itions of m ixing [22 , 6, 29], which r equire mixing ov er the ent ire fu ture σ -field of the pro cess, that is, σ ( x t , x t +1 , x t +2 , . . . ). I n con trast, w e require mixing o ve r only the single-slice marginal of x t + k . F rom the d efi nition, we also see th at β -mixing is w eak er th an φ -mixing s in ce β ( k ) ≤ φ ( k ). W e state our r esults in general f orms u sing either the β or φ -mixing co efficien ts of the sto c hastic p ro cess, and w e generally use φ -mixing results for stronger high-probabilit y guarantees compared to β -mixing. W e remark th at if the sequ en ce { x t } is i.i.d., then φ (1) = β (1) = 0. Tw o regimes of β -mixing (and φ -mixing) will b e of sp ecial in terest. A pro cess is called geo- metrically β -mixing ( φ -mixing) if β ( k ) ≤ β 0 exp( − β 1 k θ ) (resp ectiv ely φ ( k ) ≤ φ 0 exp( − φ 1 k θ )) for some β i , φ i , θ > 0. Some sto chastic p ro cesses satisfying geometric mixing in clude finite-state er - go d ic Mark o v chains and a large class of aperio dic, Harr is-recurrent Ma rko v pro cesses; see the references [20, 21] for m ore examples. A pro cess is called algebraically β -mixing ( φ -mixing) if β ( k ) ≤ β 0 k − θ (resp. φ ( k ) ≤ φ 0 k − θ ) for co nstants β 0 , φ 0 , θ > 0. Examples of algebraic mixing arise in certain Metrop olis-Hastings samplers wh en the prop osal distribu tion do es n ot h a v e a low er b oun d ed dens it y [16], some queuing systems, and other unb ounded pr o cesses. W e no w turn to stating the relev an t assump tions on the instantaneo us loss functions F ( · ; x ) and other quantit ies relev an t to the online learning algorithm. Recall that the algorithm plays p oint s (h yp othesis) w ∈ W . Through ou t, w e mak e the follo wing b oundedn ess assu mptions on F and the domain W , whic h are common in the online learnin g literature. Assumption A (Boundedn ess) . F or µ -almost every (henc eforth µ -a.e. ) x , the function F ( · ; x ) is c onvex and G -Lipschitz with r esp e ct to a norm k·k over W : | F ( w ; x ) − F ( v ; x ) | ≤ G k w − v k (3) for al l w , v ∈ W . In addition , W is c omp act and has finite r adius: f or any w , w ∗ ∈ W , k w − w ∗ k ≤ R . (4) F urther, F ( w ; x ) ∈ [0 , GR ] . As a consequence of Assump tion A f is also G -Lipsc hitz. Giv en the first t wo b ounds (3) and (4) of Assumption A, the fin al condition can b e assumed without loss; w e mak e it explicit to av oid cen tering issues later. In the s equ el, we give somewhat stronger results in the presence of the follo wing additional assu mption, which lo wer b ounds the cur v ature of the exp ected fun ction f : Assumption B (Strong conv exit y) . The exp e cte d function f is λ -str ongly c onvex with r esp e ct to the norm k·k , that is, f ( v ) ≥ f ( w ) + h g , v − w i + λ 2 k w − v k 2 for w, v ∈ W and for al l g ∈ ∂ f ( w ) . (5) Lastly , to pro v e generalization error b ounds for online learning algorithms, w e require them to b e app r opriately stable, as d escrib ed in the n ext assump tion. Assumption C. Ther e is a non-incr e asing se quenc e κ ( t ) such that if w ( t ) and w ( t + 1) ar e suc- c essive iter ates of the online algorith m, then k w ( t ) − w ( t + 1) k ≤ κ ( t ) . Here k·k is the same n orm as that us ed in Assumption A. W e observ e th at this stabilit y assumption is different from the stability condition of Mohri and R ostamizadeh [22] and neither one imp lies the 4 other. It is common (or at least str aigh tforw ard) to establish b ound s κ ( t ) as a part of the regret analysis of online algo rithms (e.g. [28]), whic h motiv ates our assum ption here. What remains to complete our setup is to quantify our assumptions on the p erformance of the online learning algorithm. W e assume access to an online algorithm whose regret is b ound ed by (the p ossibly r andom qu an tit y) R n for the sequence of p oin ts x 1 , . . . , x n ∈ X , that is, the online algorithm pro duces a sequence of iterates w (1) , . . . , w ( n ) suc h that for an y fixed w ∗ ∈ W , n X t =1 F ( w ( t ); x t ) − F ( w ∗ , x t ) ≤ R n . (6) Our goal is to use the s equence w (1) , . . . , w ( n ) to construct an estimator b w n that p erforms w ell on un s een data. Since our samples are d ep endent, we m easure the generalization error on future test samples dra wn fr om the same sample path as the training data [22]. That is, we measure p erform an ce on the m samples x n +1 , . . . , x n + m dra wn from the p ro cess P [ n ] , and we w ould lik e to b oun d the future r isk of b w n , defined as 1 m m X t =1 E [ F ( b w n ; x n + t ) − F ( w ∗ ; x n + t ) | F n ] , (7) the conditional exp ectation of the losses F ( b w n ; x ) given the fir st n samples. Note that in the i.i.d. setting [7], th e exp ectatio n ab ov e is the excess risk f ( b w n ) − f ( w ∗ ) of b w n against w ∗ , b ecause x n + t is indep endent of x 1 , . . . , x n . Of course, we are in the d ep end ent setting, so the generalization measure (7) r equires slight ly more care. 1 3 Generalization b oun ds for con v ex functions Our defin itions and assumptions in place, w e sho w in this section that an y su itably stable online learning algorithm enjo ys a high-probabilit y generalizatio n guarantee f or con v ex loss functions F . The main resu lts of this section are Th eorems 2 and 3, whic h giv e high pr obabilit y conv ergence of an y stable online le arning algorithm un der φ - and β -mixing, resp ectiv ely . F ollo wing Th eorem 2, w e also p resen t an example illustrating that lo w regret is b y itself insufficien t to guaran tee go o d generalizat ion p erformance, which is distinct f rom i.i.d. settings [7]. Before pro ceeding with our tec hnical d ev elopmen t, w e describ e the high-lev el structure and in tuition underlying ou r pro ofs. The tec hnical insight underp inning m an y of our results is that under our mixing assumptions, the distribution of the random ins tance x t + τ is close to the stationary distribution conditioned on F t . That is, lo oking some n umb er of steps τ into the futre from a time t is almost as go o d as obtaining an unbiased samp le from the stationary d istribution Π. As a resu lt, the loss F ( w ( t ); x t + τ ) is a goo d pro xy for f ( w ( t )), since w ( t ) only dep ends on x 1 , . . . , x t − 1 . Lemma 1 formalizes this in tuition. (Duc hi et al. [10] use a similar tec hnique as a building blo ck.) Under our stabilit y cond ition, we can further demonstrate th at F ( w ( t ); x t + τ ) is close to F ( w ( t + τ ); x t + τ ), and the b eha vior of the latter sequence is nearly the same as the sequen ce F ( w ( t ); x t ) with resp ect to whic h the regret R n is measured. W e make these these ideas f ormal in Prop ositions 1 and 2. W e then com bine our in termediate results (includ ing b ound s on the regret R n ), app lying relev ant martingale concen tration inequalities, to obtain the main theorems of this and later sections. Our starting p oin t is th e ab o ve -menti oned tec hnical lemma that underlies many of our results. 5 Lemma 1. L et w , v ∈ W b e me asur able with r esp e ct to the σ - field F t and Assumption A hold. Then for any τ ∈ N , E [ F ( w ; x t + τ ) − F ( v ; x t + τ ) | F t ] ≤ f ( w ) − f ( v ) + GRφ ( τ ) . and E h   E [ F ( w ; x t + τ ) − F ( v ; x t + τ ) | F t ] − ( f ( w ) − f ( v ))   i ≤ GRβ ( τ ) . Pro of W e fi rst pr o v e the resu lt for the φ -mixing b ound . Recalling that f ( w ) = E Π [ F ( w ; x )] and the defin ition of the u nderlying measure µ an d the densities π and p , E [ F ( w ; x t + τ ) − F ( v ; x t + τ ) | F t ] = E [ F ( w ; x t + τ ) − f ( w ) + f ( v ) − F ( v ; x t + τ ) | F t ] + f ( w ) − f ( v ) = Z X [ F ( w ; x ) − F ( v ; x )]( p t + τ [ t ] ( x ) − π ( x )) dµ ( x ) + f ( w ) − f ( v ) ≤ Z X | F ( w ; x ) − F ( v ; x ) |    p t + τ [ t ] ( x ) − π ( x )    dµ ( x ) + f ( w ) − f ( v ) ≤ GR Z    p t + τ [ t ] ( x ) − π ( x )    dµ ( x ) + f ( w ) − f ( v ) = 2 GR · d TV ( P t + τ [ t ] , Π) + f ( w ) − f ( v ) , where for the second inequalit y w e used the Lipsc hitz assumption A and th e compactness assump - tion on W . Noting that 2 d TV ( P t + τ [ t ] , Π) ≤ φ ( τ ) by the defin ition 2.1 completes the pro of of the fi rst part. T o see the second inequalit y using β -mixing coefficients, we b egin b y n oting that as a conse- quence of the pro of of the fi rst inequalit y , E [ F ( w ; x t + τ ) − F ( v ; x t + τ ) | F t ] − ( f ( w ) − f ( v )) ≤ 2 GRd TV ( P t + τ [ t ] , Π) , and the inequalit y holds with w and v switc hed: E [ F ( v ; x t + τ ) − F ( w ; x t + τ ) | F t ] − ( f ( v ) − f ( w )) ≤ 2 GRd TV ( P t + τ [ t ] , Π) . Com bining th e t wo in equalities and taking exp ectations, we ha v e E h   E [ F ( w ; x t + τ ) − F ( v ; x t + τ ) | F t ] − ( f ( w ) − f ( v ))   i ≤ 2 GR E  d TV ( P t + τ ( · | F t ) , Π)  ≤ GRβ ( τ ) b y the definition 2.1 of th e mixing co efficients. Using Lemma 1, we can giv e a prop osition that r elates the r isk on the test sequ ence to th e exp ected er r or of a predictor w un der the stationary distribu tion. The result shows th at for an y w measurable w ith resp ect to the σ -field F n —w e use b w n ∈ F n , the (un sp ecified as yet) outpu t of th e online learnin g algorithm—we can p ro v e generalization b ounds by sho wing that w has small risk under the stationary distribution Π. Prop osition 1. Under the Li pschitz assumption A, for any w ∈ W me asur able with r esp e ct to F n , any w ∗ ∈ W , and any τ ∈ N , 1 m n + m X t = n +1 E [ F ( w ; x t ) − F ( w ∗ ; x t ) | F n ] ≤ f ( w ) − f ( w ∗ ) + φ ( τ ) GR + ( τ − 1) GR m 6 and E  1 m n + m X t = n +1 E [ F ( w ; x t ) − F ( w ∗ ; x t ) | F n ]  ≤ E [ f ( w )] − f ( w ∗ ) + β ( τ ) GR + ( τ − 1) GR m . Pro of The pr o of follo ws from the definition 2.1 of m ixing. The ke y idea is to giv e up on the first τ − 1 test s amp les and us e th e mixing assu mption to control the loss on the remainder. W e ha v e n + m X t = n +1 E [ F ( w ; x t ) − F ( w ∗ ; x t ) | F n ] = n + τ − 1 X t = n +1 E [ F ( w ; x t ) − F ( w ∗ ; x t ) | F n ] + n + m X t = n + τ E [ F ( w ; x t ) − F ( w ∗ ; x t ) | F n − 1 ] ≤ ( τ − 1) GR + n + m X t = n + τ E [ F ( w ; x t ) − F ( w ∗ ; x t ) | F n ] since by th e Lipsc hitz assu mption A and compactness F ( w ; x ) − F ( w ∗ ; x ) ≤ GR . No w, w e app ly Lemma 1 to th e summation, whic h complete s the pr o of. Prop osition 1 allo ws us to fo cus on con trolling th e error on the exp ected function f und er the stationary distribution Π, w hic h is a natural con v ergence guarante e. In d eed, the fu nction f is the risk functional with resp ect to w hic h conv erge nce is measur ed in the standard i.i.d. case, and applying Prop osition 1 w ith τ = 1 and φ (1) = 0 (or β (1) = 0) confirms that the b ound is equal to f ( w ) − f ( w ∗ ). W e now tu r n to con trolling the error under f , b eginning with a result that relates r isk p erform an ce of the sequen ce of hypotheses w (1) , . . . , w ( n ) output b y the online learning algorithm to the algorithm’s r egret, a term dep en d en t on the stabilit y of the algorithm, and an ad d itional random term. This prop osition is the starting p oin t for th e remaind er of our r esults in this section. Prop osition 2. L e t Assumptions A and C hold and let w ( t ) denot e the se quenc e of outputs of the online algorith m. Then for any τ ∈ N , n X t =1 f ( w ( t )) − f ( w ∗ ) ≤ R n + Gτ n X t =1 κ ( t ) + 2 τ GR + n X t =1 [ f ( w ( t )) − F ( w ( t ); x t + τ ) + F ( w ∗ ; x t + τ ) − f ( w ∗ )] . (8) Pro of W e b egin by exp anding the regret of w ( t ) on sequence f via n X t =1 [ f ( w ( t )) − f ( w ∗ )] = n X t =1 [ f ( w ( t )) − F ( w ( t ); x t + τ ) + F ( w ( t ); x t + τ ) − f ( w ∗ )] = n X t =1 [ f ( w ( t )) − F ( w ( t ); x t + τ ) + F ( w ∗ ; x t + τ ) − f ( w ∗ ) + F ( w ( t ); x t + τ ) − F ( w ∗ ; x t + τ )] . (9) 7 No w we use stability and the r egret guarant ee (6) to b oun d the last tw o terms of the summation (9). T o that end , note that n X t =1 [ F ( w ( t ); x t + τ ) − F ( w ∗ ; x t + τ )] = n X t =1 [ F ( w ( t ); x t ) − F ( w ∗ ; x t )] | {z } S 1 + n − τ X t =1 [ F ( w ( t ); x t + τ ) − F ( w ( t + τ ); x t + τ )] | {z } S 2 + n X t = n − τ +1 F ( w ( t ); x t + τ ) − τ X t =1 F ( w ( t ); x t ) + τ X t =1 F ( w ∗ ; x t ) − n + τ X t = n +1 F ( w ∗ ; x t ) | {z } S 3 . W e no w b oun d th e three term s in the sum mation. S 3 is b ounded b y 2 τ GR und er th e b oun d edness assumption A, and the regret b ound (6) guarantee s th at S 1 ≤ R n . Using the stabilit y assum ption C , w e can b ound S 2 b y noting F ( w ( t ); x t + τ ) − F ( w ( t + τ ); x t + τ ) ≤ G k w ( t ) − w ( t + τ ) k ≤ G τ − 1 X s =0 κ ( t + s ) ≤ Gτ κ ( t ) , where the last step u s es the non-increasing prop ert y of the co efficien ts κ ( t ). Sub stituting the b ounds on S 1 , S 2 , and S 3 in to Eq. (9) completes the p rop osition. The remaining dev elopmen t of this section consists of using the k ey inequalit y (8) in Prop osi- tion 2 to giv e exp ected and high-pr obabilit y con v ergence guarantee s f or the online learning algo- rithm. Throughout, w e define the output of the onlin e algorithm to b e the a ve raged predictor b w n = 1 n n X t =1 w ( t ) . (10) W e b egin w ith results giving con ve rgence in exp ectation for stable online algorithms. Theorem 1. Under Assumptions A and C, for any τ ∈ N the pr e dictor b w n satisfies the guar ante e E [ f ( b w n )] − f ( w ∗ ) ≤ 1 n E [ R n ] + β ( τ ) GR + ( τ − 1) G n  2 R + n X t =1 κ ( t )  , for any w ∗ ∈ W . Pro of F rom the inequalit y (8) in Prop osition 2, what remains is to tak e the exp ectatio n of the random quan tities. T o that end , we note that w ( t ) is measurable with r esp ect to F t − 1 (since the iterate at time t d ep end s only on first t − 1 samples) and app ly Lemma 1, wh ic h giv es E [ E [ F ( w ∗ ; x t + τ − 1 ) − F ( w ( t ); x t + τ − 1 ) | F t − 1 ]] ≤ f ( w ∗ ) − f ( w ( t )) + GRβ ( τ ) . 8 Adding the differen ce to the sum (8) with the setting τ 7→ ( τ − 1) giv es E  n X t =1 f ( w ( t )) − f ( w ∗ )  ≤ E [ R n ] + G ( τ − 1) n X t =1 κ ( t ) + 2( τ − 1) GR + nGRβ ( τ ) . Dividing b y n and observing th at f ( b w n ) ≤ 1 n P n t =1 f ( w ( t )) by Jensen’s in equalit y completes the pro of. W e observe that setting τ = 1 and β (1) = 0 reco v ers an exp ected v ersion of the r esults of Cesa- Bianc hi et al. [7, Corollary 2] for i.i.d. samp les. Theorem 1 combined with Pr op osition 1 immediately yields the follo wing generalization b ound. Our other results can b e similarly extended , bu t we lea ve suc h d ev elopmen t to the r eader. Corollary 2. Under Assumptions A and C, for any τ ∈ N the pr e dictor b w n satisfies the gu ar ante e 1 m E  n + m X t = n +1 F ( b w n ; x t ) − F ( w ∗ ; x t )  ≤ 1 n E [ R n ] + 2 β ( τ ) GR + ( τ − 1) GR  2 n + 1 m + 1 n n X t =1 κ ( t )  . It is clear that the stabilit y assumption w e make on the online algorithm pla ys a k ey role in our results w h enev er τ > 1, that is, the samples are indeed dep end en t. It is natur al to ask whether this additional term is just an artifact of our analysis, or whether lo w-regret b y itself ensur es a small error under the stationary distribu tion ev en for dep en d en t data. T he next example sh o ws that low regret—b y itself—is insu fficien t for generalization guarant ees, so some additional assumption on the online algorithm is necessary to guaran tee small err or und er th e stationary d istribution. Example 1 (Low-regret d o es not imply conv erge nce) . In 1-dimension, defi n e the linear loss F ( w ; x ) = h w , x i , where x ∈ {− 1 , 1 } and th e set W = [ − 1 , 1]. Let p > 0 and define follo wing dep end en t sampling p ro cess: at eac h time t , set x t =    1 with p robabilit y p/ 2 − 1 with p robabilit y p/ 2 x t − 1 with p robabilit y 1 − p. The stationary distribution Π is uniform on {− 1 , 1 } , so the exp ected error E Π [ h w , x i ] = 0 for an y w ∈ W . Ho w ev er, w e can demonstrate an up date rule w ith negativ e exp ected regret as f ollo ws. Consider the algorithm whic h sets w ( t ) = − x t − 1 , implemen ting a trivial so-called fol low the le ader strategy . With pr obabilit y 1 − p/ 2, the v alue h w ( t ) , x t i = − 1, while h w ( t ) , x t i = 1 with probabilit y p/ 2. Con s equen tly , the exp ectatio n of the cu mulativ e sum P n t =1 F ( w ( t ); x t ) is − (1 − p ) n . Using standard resu lts on the expected deviation of the simp le r andom wal k (e.g. [4]), we kno w that E " inf w ∈W n X t =1 h w , x t i # = − E      n X t =1 x t      = Θ( − √ n ) . W e are th us guaran teed that the exp ected regret of the up d ate r ule is − Ω((1 − p ) n ). 9 X 1 1 X 1 2 X i t X 1 3 ! "# $ τ ! "# $ τ Figure 1. The τ differ e nt blo cks of near-ma rtingales used in the pro of o f Theo rem 2. Black b oxes represent elements in the same index s et I (1), gray in I (2), and s o on. W e ha v e no w seen that it is p ossible to ac hiev e guaran tees on the generalization p rop erties of an on lin e learning algorithm by taking exp ectation o v er b oth the training and test samples. W e w ould like to p ro v e stronger results th at hold with high probability o v er the training data, as is p ossible in i.i.d. settings [7 ]. T he next theorem app lies martingale concen tration arguments u s ing the Ho effding-Azuma inequalit y [1] to giv e h igh-probabilit y concen tration for the ran d om quantiti es remaining in Prop osition 2’s b ound. Theorem 2. U nder Assumptions A and C, with pr ob ability at le ast 1 − δ , for any τ ∈ N and any w ∗ ∈ W the pr e dictor b w n satisfies the guar ante e f ( b w n ) − f ( w ∗ ) ≤ 1 n R n + ( τ − 1) G n n X t =1 κ ( t ) + 2 GR r 2 τ n log τ δ + φ ( τ ) GR + 2( τ − 1) GR n . Pro of Insp ecting the inequalit y (8) from Prop osition 2, we observe that it su ffices to b ound Z n := n X t =1 [ f ( w ( t )) − f ( w ∗ ) − F ( w ( t ); x t + τ − 1 ) + F ( w ∗ ; x t + τ − 1 )] (11) This is analogous to the term that arises in the i.i.d. case [7], where Z n is a b oun ded martingale sequence and hen ce concent rates around its exp ectatio n. Our pro of that the sum (11) concent rates is similar to the argument Duchi et al. [10] use to prov e concen tration for the ergo dic mir ror descen t algorithm. The idea is that though Z n is not quite a martingale in the general ergo dic case, it is in fact a sum of τ ne ar -martinga les. This tec hn ique of using blo c ks of rand om v ariables in dep endent settings has also b een u s ed in previous wo rk to d irectly b ound the moment generating function of sums of d ep endent v ariables [21], though our app r oac h is different. See Fig. 1 for a graph ical represent ation of our c hoice (12) of the martingale sequences. F or i ∈ { 1 , . . . , τ } and t ∈ { 1 , . . . , ⌈ n/τ ⌉} , defi ne the rand om v ariables X i t := f ( w (( t − 1) τ + i )) − f ( w ∗ ) + F ( w ∗ ; x tτ + i − 1 ) − F ( w (( t − 1) τ + i ); x tτ + i − 1 ) . (12) In addition, define the asso ciated σ -fields F i t := F tτ + i − 1 = σ ( x 1 , . . . , x tτ + i − 1 ). Th en it is clear that X i t is measurable with resp ect to F i t (recall that w ( t ) is measurable with resp ect to F t − 1 ), so the sequence X i t − E [ X i t | F i t − 1 ] defines a martingale difference sequ ence adapted to the filtration F i t , t = 1 , 2 , . . . . F ollo wing pr evious subsampling tec hniques [21, 10], we d efi ne the ind ex set I ( i ) to 10 b e the indices { 1 , . . . , ⌊ n/τ ⌋ + 1 } for i ≤ n − τ ⌊ n/τ ⌋ and { 1 , . . . , ⌊ n/τ ⌋} otherwise. T hen a bit of algebra sho ws that Z n = τ X i =1 X t ∈I ( i )  X i t − E [ X i t | F i t − 1 ]  + τ X i =1 X t ∈I ( i ) E [ X i t | F i t − 1 ] . (13) The fi rst term in the d ecomp osition (13) is a sum of τ different martingale difference sequences. In addition, the b oun d edness assumption A guaran tees that | X i t − E [ X i t | F i t − 1 ] | ≤ 2 GR , so eac h of the sequences is a b ounded difference sequence. The Ho effding-Azuma inequalit y [1] then guarant ees P   X t ∈I ( i )  X i t − E [ X i t | F i t − 1 ]  ≥ γ   ≤ exp  − τ γ 2 8 nG 2 R 2  . (14) T o con trol the exp ectation term from the second su m in the r epresent ation (13), w e u se mixing. Indeed, Lemma 1 immediately implies that E [ X i t | F i t − 1 ] ≤ GR φ ( τ ). Com bining these b ound s with the application (14 ) of Ho effding-Azuma inequalit y , w e see by a union b ound that P ( Z n > nGRφ ( τ ) + γ ) ≤ τ X i =1 P  X t ∈I ( i )  X i t − E [ X i t | F i t − 1 ]  ≥ γ /τ  ≤ τ exp  − γ 2 8 τ nG 2 R 2  . Equiv alen tly , b y setting γ = 2 GR p 2 nτ log( τ /δ ) , we obtain that with p robabilit y at least 1 − δ , Z n ≤ GR  nφ ( τ ) + 2 r 2 nτ log τ δ  . Dividing by n and using the conv exit y of f as in th e pr o of of Theorem 1 completes th e pr o of. T o b etter illustrate our r esults, w e no w sp ecializ e th em under concrete m ixing assumptions in sev eral corollaries, wh ic h should mak e clearer the rates of con v ergence of the pro cedures. W e b egin with t wo corollaries giving generalization error b ounds f or geometricall y and algebraica lly φ -mixing pro cesses (defin ed in S ection 2). Corollary 3. U nder the assumptions of The or em 2, assume further that φ ( k ) ≤ c exp( − φ 1 k θ ) for some universal c onstant c . Ther e e xi sts a finite uni v ersal c onstant C such that with pr ob ability at le ast 1 − δ , for any w ∗ ∈ W f ( b w n ) − f ( w ∗ ) ≤ 1 n R n + C ·  (log n ) 1 /θ G nφ 1 /θ 1 n X t =1 κ ( t ) + GR s (log n ) 1 /θ nφ 1 /θ 1 log (log n ) 1 /θ δ  . The corollary follo ws from Theorem 2 b y taking τ = (log n/ (2 φ 1 )) 1 /θ . Wh en the samples x t come from a geometrically φ -mixing pro cess, Corollary 3 yields a h igh-probabilit y generalizati on b ound of the same ord er as that in the i.i.d. setting [7] up to p oly-logarithmic factors. Algebraic mixing giv es somewhat slo wer rates: 11 Corollary 4. U nder the assumptions of The or em 2, assume further that φ ( k ) ≤ φ 0 k − θ . Define K n = P n t =1 κ ( t ) /R . Ther e exists a finite universal c onsta nt C such that with pr ob ability at le ast 1 − δ , for any w ∗ ∈ W f ( b w n ) − f ( w ∗ ) ≤ 1 n R n + C · " GRφ 1 1+ θ 0  K n n  θ 1+ θ + GRφ 1 θ +1 0  K n n θ  − 1 2 θ +2 r 1 θ + 1 · log n K n δ # . The corollary follo ws b y setting τ = φ 1 / ( θ +1) 0 ( n/K n ) 1 / ( θ +1) . So long as the sum of the stabilit y constan ts P n t =1 κ ( t ) = o ( n ), the b ound in Corollary 4 con verge s to 0. In addition, we remark that under the same condition on the stability , an argument similar to th at f or Corollary 7 of Duc hi et al. [10] imp lies f ( b w n ) − f ( w ∗ ) → 0 almost su rely whenever φ ( k ) → 0 as k → ∞ . T o obtain concrete generalization error rates fr om our results, one m ust kno w b ounds on the stabilit y sequ ence κ ( t ) (and the regret R n ). F or many online algorithms, the stabilit y sequence satisfies κ ( t ) ∝ 1 / √ t , includ in g online gradien t and mirr or descent [9]. As a more concrete example, consider Nestero v’s dual a verag ing algorithm [23], which Xiao extends to r egularized settings [28]. F or con v ex, G -Lipsc hitz functions, the dual a v eraging algorithm satisfies R n = O ( GR √ n ), and with appropriate stepsize choi ce [28, Lemma 10] p rop ortional to √ t , one has κ ( t ) ≤ R/ √ t . Noting that P n t =1 t − 1 / 2 ≤ 2 √ n , s u bstituting the stabilit y b ound into the r esult of Theorem 2 immediately yields the follo wing: there exists a u n iv ersal constan t C su c h that with p robabilit y at least 1 − δ , f ( b w n ) − f ( w ∗ ) ≤ 1 n R n + C · inf τ ∈ N  GR ( τ − 1) √ n + GR √ n r τ log τ δ + φ ( τ ) GR  . (15) The b ound (15) captures the kno wn con v ergence rates for i.i.d. sequences [7, 28] by taking τ = 1, since φ (1) = 0 in i.i.d. settings. In addition, sp ecializing to the geometric m ixing rate of C orollary 3 one obtains a generaliz ation error b ound of O  1 + 1 φ 1  1 √ n  to p oly-loga rithmic factors. Theorem 2 and the corolla ries follo wing require φ -mixin g of the sto c hastic sequence x 1 , x 2 , . . . , whic h is p erhaps an undesirably strong assumption in some situations (for example, w h en the samp le space X is un b ounded). T o mitigate this, w e n o w giv e high-probabilit y con v ergence results u nder the weak er assum ption that the sto chastic pro cess P is β -mixing. T hese results are (u nsurp risingly) w eak er than those for φ -mixing; nonetheless, there is no significan t loss in rates of con v ergence as long as th e pro cess P mixes qu ic kly enough. Theorem 3. Under Assumptions A and C, with pr ob ability at le ast 1 − 2 δ , for any τ ∈ N and for al l w ∗ ∈ W the pr e dictor b w n satisfies the guar ante e f ( b w n ) − f ( w ∗ ) ≤ 1 n R n + ( τ − 1) G n n X t =1 κ ( t ) + 2 GR r 2 τ n log 2 τ δ + 2 β ( τ ) GR δ + 2( τ − 1) GR n . Pro of F ollo wing the p ro of of Th eorem 2, we constru ct the random v ariables Z n and X i t as in the defin itions (11) and (12). Decomp osing Z n in to the t wo part su m (13), we similarly apply the Ho effding-Azuma inequ alit y (as in the p ro of of Theorem 2) to the fir st term. The treatment of the second piece requires more care. Observe that for any fi xed i, t , the fact that w (( t − 1) τ + i ) and w ∗ are measurable with resp ect to F i t − 1 guaran tees via Lemma 1 that E    E  X i t | F i t − 1     ≤ GRβ ( τ ) . 12 Applying Marko v’s inequalit y , w e see that with probabilit y at least 1 − δ , τ X i =1 X t ∈I ( i ) E  X i t | F i t − 1  ≤ nGRβ ( τ ) δ . Con tinuing as in the p ro of of T heorem 2 yields the result of th e theorem. Though the 1 /δ factor in T heorem 3 may b e large, we no w sh o w that thin gs are n ot so difficult as they seem. In deed, let us n o w mak e the additional assump tion that the sto c hastic pro cess x 1 , x 2 , . . . is ge ometrically β -mixing. W e ha ve the follo wing corollary . Corollary 5. Under the assumptions of The or em 3 , assume further that β ( k ) ≤ β 0 exp( − β 1 k θ ) . Ther e exi sts finite universal c onstant C such tha t with pr ob ability at le ast 1 − 1 /n for any w ∗ ∈ W f ( b w n ) − f ( w ∗ ) ≤ 1 n R n + C · " (1 . 5 log n ) 1 /θ G nβ 1 /θ 1 n X t =1 κ ( t ) + GR s (1 . 5 log n ) 1 /θ nβ 1 /θ 1 log  n (log n ) 1 /θ  + β 0 GR √ n # . The corollary follo ws from Theorem 3 b y sett ing τ = (1 . 5 log n/β 1 ) 1 /θ and a few algebraic ma- nipulations. Corollary 5 sho ws that u nder geometric β -mixing, w e ha ve essentiall y iden tical high- probabilit y generalizatio n guarant ees as we had for φ -mixing (cf. Corollary 3), unless th e desired error p r obabilit y or the mixin g constant θ is extremely small. W e can mak e similar argum en ts for p olynomially β -mixing sto c hastic pro cesses, though the asso ciated w eak ening of the b ound is somewhat more pr onounced. 4 Generalization error b oun ds for strongly con v ex functions It is by n o w w ell-kno wn that the regret of online learning algorithms scales as O (log n ) for s trongly con v ex functions, results which are du e to w ork of Hazan et al. [1 2]. T o remind the reader, w e recall Assumption B, w hic h states that a fun ction f is λ -strongly con vex w ith resp ect to the n orm k·k if for all g ∈ ∂ f ( w ), f ( v ) ≥ f ( w ) + h g , v − w i + λ 2 k w − v k 2 for w , v ∈ W . F or many online algorithms, including online gradien t and mirr or descent [3, 12, 24, 9] and d u al a v eraging [28, Lemma 11], the iterates satisfy the stabilit y b ound k w ( t ) − w ( t + 1) k ≤ G/ ( λt ) w h en the loss functions F ( · , x ) are λ -strongly con vex. Under these conditions, C orollary 2 giv es exp ected generalizat ion error b ound of O (inf τ ∈ N { β ( τ ) + τ log n/n } ) as compared to O (in f τ ∈ N { β ( τ )+ p τ /n } ) for non-strongly conv ex pr oblems. The imp ro ve ment in r ates, h o w eve r, does not apply to Theo- rem 2’s high pr ob ab ility results, s in ce the term con trolling the fl uctuations aroun d the exp ectatio n of th e martingale we construct scales as e O ( p τ /n ). Th at said, when the samples x t are d ra wn i.i.d. from th e distribution Π, Kak ade and T ewari [17] sho w a generalizatio n error b ound of O (log n/n ) with high p robabilit y by using self-b oun d ing prop erties of an appropr iately constructed martin- gale. In the next th eorem, w e combine the tec hniques used to pr ov e our previous r esults with a self-b ound ing martin gale argum en t to d eriv e sharp er generalizati on guaran tees when the exp ected function f is strongly con v ex. Thr ou gh ou t this section, w e will focu s on error to the minim um of the exp ected f unction: w ∗ ∈ arg min w ∈W f ( w ). 13 Theorem 4. L et Assumptions A, B, and C hold, so the exp e cte d fu nc tion f is λ - str ongly c onvex with r esp e ct to the norm k·k over W . Then for any δ < 1 /e , n ≥ 3 , with pr ob ability at le ast 1 − 4 δ log n , for any τ ∈ N the pr e dictor b w n satisfies f ( b w n ) − f ( w ∗ ) ≤ 2 n R n + 2( τ − 1) G n n X t =1 κ ( t ) + 2 R ! + 32 G 2 τ λn log τ δ + 12 τ RG n log τ δ + 2 RGφ ( τ ) . Before we p ro v e the theorem, we illustrate its use with a simple corollary . W e again use Xiao’s extension of Nestero v’s dual av eraging algorithm [23 , 28], where for G -Lipsc hitz λ -strongly con v ex losses F it is shown that k x ( t ) − x ( t + 1) k ≤ κ ( t ) ≤ G λt . Consequent ly , T heorem 4 yields th e follo wing corollary , app licable to dual a v eraging, mirror descen t, and online gradient descen t: Corollary 6. In addition to the c onditions of The or em 4, assume the stability b ound κ ( t ) ≤ G/λt . Ther e is a unive rsal c onstant C such that with pr ob ability at le ast 1 − δ log n , f ( b w n ) − f ( w ∗ ) ≤ 2 n R n + C · inf τ ∈ N  ( τ − 1) G 2 λn log n + τ G 2 λn log τ δ + G 2 λ φ ( τ )  . Pro of T he p ro of follo ws b y noting the follo w ing t w o facts: fi rst, P n t =1 κ ( t ) ≤ ( G/λ )(1 + log n ), and secondly , the definition (5) of strong conv exit y imp lies G k w − v k ≥ f ( v ) − f ( w ) ≥ h ∇ f ( w ) , v − w i + λ 2 k v − w k 2 . Recalling [15] th at k∇ f ( w ) k ∗ ≤ G , we h a v e k w − v k ≤ 4 G/λ for all w , v ∈ W , so R ≤ 2 G/λ . W e can furth er extend Corollary 6 using mixing rate assumptions on φ as in Corollaries 3 and 4, though this f ollo ws the same lines as those. F or a few more concrete examples, w e note that online gradien t and mirror descent as we ll as du al av erag ing [12, 9 , 24, 28] all ha ve R n ≤ C · ( G 2 /λ ) log n when the loss fun ctions F ( · ; x ) are strongly conv ex (this is stronger than assuming that the exp ected function f is strongly con v ex, but it allo ws sharp logarithmic b ounds on the random quant it y R n ). In this sp ecial case, Corolla ry 6 implies the generalization b oun d f ( b w n ) − f ( w ∗ ) = O  G 2 λ inf τ ∈ N  τ log n n + φ ( τ )  with high pr obabilit y . F or example, online algorithms f or SVMs (e.g. [25]) and other regularized problems satisfy a sharp h igh-probabilit y generaliz ation guaran tee, ev en for non-i.i.d. data. W e no w tur n to proving Theorem 4, b eginning with a martingale concentrati on inequality . Lemma 7 (F r eedman [11], Kak ade and T ew ari [17]) . L e t X 1 , . . . , X n b e a martingale differ enc e se quenc e adapte d to the filtr ation F t with | X t | ≤ b . Define V = P n t =1 E [ X 2 t | F t − 1 ] . F or any δ < 1 /e and n ≥ 3 P " n X t =1 X t ≥ max { 2 √ V , 3 b p log 1 /δ } p log 1 /δ # ≤ 4 δ log n . 14 Pro of of Theorem 4 F or the pro of of this theorem, w e d o not start f rom the Prop osition 2, as w e did for the pr evious theorems, but b egin directly with an appropr iate martingale. Recalling the definition (12) of the random v ariables X i t and the σ -fields F i t = σ ( x 1 , . . . , x tτ + i − 1 ) from the pro of of Theorem 2, our goal will b e to give sharp er concen tration results for the martingale difference sequence X i t − E [ X i t | F i t − 1 ]. T o app ly Lemma 7, w e must b ound the v ariance of the difference sequence. T o that end, note that the conditional v aria nce is b oun ded as E  ( X i t − E [ X i t | F i t − 1 ]) 2 | F i t − 1  ≤ E  ( X i t ) 2 | F i t − 1  = E h ( f ( w (( t − 1) τ + i )) − f ( w ∗ ) − F ( w (( t − 1) τ + i ); x τ t + i − 1 ) + F ( w ∗ ; x tτ + i − 1 )) 2 | F i t − 1 i ≤ 4 G 2 k w (( t − 1) τ + i ) − w ∗ k 2 , where in the last line we used the Lipschitz assum ption A and the fact that w (( t − 1) τ + i ) ∈ F i t − 1 . Of course, since w ∗ minimizes f , the λ -strong con v exit y of f implies (see e.g. [15]) that for an y w ∈ W , f ( w ) − f ( w ∗ ) ≥ λ 2 k w − w ∗ k 2 . Consequen tly , we see that E  ( X i t − E [ X i t | F i t − 1 ]) 2 | F i t − 1  ≤ 8 G 2 λ [ f ( w (( t − 1) τ + i )) − f ( w ∗ )] . (16) What remains is to use the single term conditional v ariance b ound (16) to ac hieve deviation con trol o v er th e en tire sequence X i t . T o that end, recall the index sets I ( i ) defined in the p r o of of Theorem 2, an d define the s ummed v ariance terms V i := P t ∈I ( i ) E [( X i t − E [ X i t | F i t − 1 ]) 2 | F i t − 1 ]. Then the b ound (16) give s V i ≤ 8 G 2 λ X t ∈I ( i ) [ f ( w ( τ ( t − 1) + i )) − f ( x ∗ )] . Using the preceding v ariance b ound, we can ap p ly F reedman’s concen tration result (Lemma 7) to see that w ith probabilit y at lea st 1 − (4 δ log n ) /τ , X t ∈I ( i )  X i t − E [ X i t | F i t − 1 ]  ≤ max n 2 p V i , 6 GR p log( τ /δ ) o p log( τ /δ ) (17) W e can use the inequalit y (17) to show concen tration. Define the summations S i := X t ∈I ( i ) f ( w ( τ ( t − 1) + i )) − f ( w ∗ ) and b S i := X t ∈I ( i ) F ( w ( τ ( t − 1) + i ); x τ t + i − 1 ) − F ( w ∗ ; x τ t + i − 1 ) . Then the d efinition (12) of th e rand om v ariables X i t coupled with the inequalit y (17 ) implies that S i ≤ b S i + max  r 32 G 2 λ p S i , 6 GR r log τ δ  r log τ δ + X t ∈I ( i ) E [ X i t | F i t − 1 ] ≤ b S i + r 32 G 2 log τ δ λ p S i + 6 GR log τ δ + |I ( i ) | φ ( τ ) RG, 15 where we h a v e app lied Lemma 1. Solving th e in duced qu adratic in √ S i , we see p S i ≤ r 8 G 2 log τ δ λ + r 8 G 2 λ log τ δ + b S i + |I ( i ) | φ ( τ ) RG + 6 GR log τ δ . Squaring b oth sides and u sing that ( a + b ) 2 ≤ 2 a 2 + 2 b 2 , w e find th at S i ≤ 32 G 2 λ log τ δ + 2 b S i + 12 GR log τ δ + 2 |I ( i ) | φ ( τ ) RG (18) with probability at least 1 − 4 δ log n/τ . W e ha v e no w n early completed the pro of of the theorem. Our fir st step for the remaind er is to note that τ X i =1 S i = n X t =1 f ( w ( t )) − f ( w ∗ ) Applying a union b oun d , we u se the in equalit y (18) to see th at with p robabilit y at least 1 − 4 δ log n , n X t =1 f ( w ( t )) − f ( w ∗ ) ≤ 2 τ X i =1 b S i + 32 G 2 τ λ log τ δ + 12 τ GR log τ δ + 2 nφ ( τ ) RG. All that r emains is to use stabilit y to relate the sum P τ i =1 b S i to the regret R n , wh ich is similar to what we did in the pro of of P rop osition 2. Ind eed, by the d efinition of the sums b S i w e h a v e τ X i =1 b S i = n X t =1 F ( w ( t ); x t + τ − 1 ) − F ( w ∗ ; x t + τ − 1 ) = n X t =1 F ( w ( t ); x t ) − F ( w ∗ ; x t ) + n − τ X t =1 F ( w ( t ); x t + τ − 1 ) − F ( w ( t + τ − 1); x t + τ − 1 ) + τ − 1 X t =1 F ( w ∗ ; x t ) − n + τ − 1 X t = n +1 F ( w ∗ ; x t ) + n X t = n − τ +1 F ( w ( t ); x t + τ − 1 ) − τ − 1 X t =1 F ( w ( t ); x t ) ≤ R n + 2( τ − 1) GR + ( τ − 1) G n X t =1 κ ( t ) , (19) where the in equalit y follo ws from the definition (6) of the regret, the b ounded ness assump tion A, and the stabilit y assumption C. Ap plying the final b ound , we see that n X t =1 f ( w ( t )) − f ( w ∗ ) ≤ 2 R n +2( τ − 1) G n X t =1 κ ( t )+ 32 G 2 τ λ log τ δ +12 τ GR log τ δ +2 nφ ( τ ) RG + 4( τ − 1) RG with p robabilit y at least 1 − 4 δ log n . Dividing by n and app lying Jens en ’s inequalit y completes the pro of. W e no w turn to the case of β -mixing. As b efore, the proof largel y follo ws the p ro of of the φ -mixing case, with a s u itable application of Mark o v’s inequalit y b eing the only difference. 16 Theorem 5. In addition to Assumptions A and C, assume further that the exp e c te d function f is λ -str ongly c onvex with r esp e ct to the nor m k·k over W . Then for any δ < 1 /e , n ≥ 3 , with pr ob ability gr e at er than 1 − 5 δ log n , for any τ ∈ N the pr e dictor b w n satisfies f ( b w n ) − f ( w ∗ ) ≤ 2 n R n + 2( τ − 1) G n n X t =1 κ ( t ) + 2 R ! + 32 G 2 τ λn log τ δ + 12 τ RG n log 2 τ δ + 2 RGβ ( τ ) δ . Pro of W e closely follo w the pr o of of Theorem 4 . Through the b oun d (17) , no step in the pro of of Theorem 4 uses φ -mixing. The use of φ -mixing o ccurs in b ounding terms of the form E [ X i t | F i t − 1 ]. Rather than b oundin g them im m ediately (as was done follo wing Eq. (17) in th e pro of of Theorem 4), w e carry them f urther through the steps of the p r o of. Using the n otation of Theorem 4’s pro of, in place of the in equalit y (18), we hav e S i ≤ 32 G 2 λ log τ δ + 2 b S i + 12 GR log τ δ + X t ∈I ( i ) E  X i t | F i t − 1  with probabilit y at least 1 − 4 δ log n/τ . P aralleling the p ro of o f Theorem 4, w e fi nd that with probabilit y at least 1 − 4 δ log n , n X t =1 f ( w ( t )) − f ( w ∗ ) (20) ≤ 2 R n + 2( τ − 1) G n X t =1 κ ( t ) + 32 G 2 τ λ log τ δ + 12 τ GR log τ δ + 4( τ − 1) RG + τ X i =1 X t ∈I ( i ) E  X i t | F i t − 1  . As in the pro of of Th eorem 3, we app ly Mark o v’s inequalit y to the fin al term, whic h giv es with probabilit y at least 1 − δ τ X i =1 X t ∈I ( i ) E  X i t | F i t − 1  ≤ 2 nGRβ ( τ ) δ . Substituting this b ound in to the inequalit y (20 ) and applying a union bou n d (noting that δ < δ log n ) co mp letes the p ro of. As w as the case for T heorem 3, when the p ro cess x 1 , x 2 , . . . is geome trically β -mixing, we can obtain a corollary of the ab ov e result showing no essen tial loss of rates with resp ect to geometrically φ -mixing pro cesses. W e omit details as the tec hnique is basically identica l to that for Corollary 5. 5 Linear Prediction F or this section, we place ours elves in the common s tatistical prediction setting wh er e the statistical samples come in pairs of the form ( x, y ) ∈ X × Y , where y is the lab el or target v alue of the s amp le x , and the samp les are fin ite dimen s ional: X ⊂ R d . No w w e measure the goo dness of th e hyp othesis w on the example ( x, y ) b y F ( w ; ( x, y )) = ℓ ( y , h x, w i ) , ℓ : Y × R → R , (21) 17 where the loss fu nction ℓ measures the accuracy of the p r ediction h x, w i . An extraordinary n umb er of stati stical learning problems fall in to the ab o ve fr amew ork: linear regression, wh ere the loss is of the form ℓ ( y , h x, w i ) = 1 2 ( y − h x, w i ) 2 ; logistic regression, where ℓ ( y , h x , w i ) = log(1 + exp( − y h x, w i )); b o osting and S VMs all h a v e th e form (21). The loss function (21) mak es it clear that individu al samples cannot b e strongly con v ex, sin ce the linear op erator h x, ·i has a nontrivia l n ull space. How ev er, in many prob lems, the exp ected loss function f ( w ) := E Π [ F ( w ; ( x, y ))] is s tr ongly con v ex ev en though individu al loss functions F ( w ; ( x, y )) are not. T o qu an tify this, we no w assume that k x k 2 ≤ r for µ -a.e. x ∈ X , and mak e the follo wing assumption on the loss: Assumption D (Linear strong con v exit y) . F or fixe d y , the loss function ℓ ( y , · ) is a λ -str ongly c onvex and L - Lipschitz sc alar func tion over [ − R r , R r ] : ℓ ( y , b ) ≥ ℓ ( y , a ) + ℓ ′ ( y , a )( b − a ) + λ 2 ( b − a ) 2 and | ℓ ( y , b ) − ℓ ( y , a ) | ≤ L | a − b | for any a, b ∈ R with max {| a | , | b |} ≤ Rr . Our choi ce of Rr ab ov e is in ten tional, since h x, w i ≤ R r b y H¨ older’s in equalit y and our com- pactness assumption (4). A few exa mples of suc h loss functions include logistic regression and least-squares r egression, the latter of which satisfies Assumption D with λ = 1. T o s ee that the exp ected loss function satisfying Assu mption D is strongly con v ex, note th at 2 f ( v ) = E Π [ ℓ ( y , h x, v i )] ≥ E Π  ℓ ( y , h x, w i ) + ℓ ′ ( y , h x, w i )( h x, v i − h x , w i ) + λ 2 ( h x, v i − h x, w i ) 2  = E Π [ F ( w ; ( x, y )) + h∇ F ( w ; ( x, y )) , v − w i ] + λ 2 E Π [ h x, v i 2 + h x, w i 2 − 2 h x , w i h x, v i ] = f ( w ) + h ∇ f ( w ) , v − w i + λ 2 h Co v ( x )( w − v ) , w − v i , (22) where Cov( x ) is the co v ariance matrix of x und er the stationary d istr ibution Π. So as long as λ min (Co v ( x )) > 0, we see that the exp ected fun ction f is λ · λ min (Co v ( x ))-strongly conv ex. If we h ad access to a stable online learning algorithm with small (i.e. logarithmic) regret for losses of the form (21) satisfying Assumption D, w e could simply apply Theorem 4 and guaran tee go o d generalization p rop erties of the predictor b w n the algorithm outpu ts. The theorem assumes only strong con v exit y of the exp ected fun ction f , which—as p er our ab o v e discussion—is the case for linear prediction, so the sharp generali zation guaran tee would follo w from the inequalit y (22) . Ho w ev er, w e found it difficult to show that existing algorithms satisfy our desiderata of logarithmic regret and stabilit y , b oth of whic h are crucial requirements for our resu lts. Belo w, we p resen t a sligh t mo dification of Hazan et al.’s follo w the app ro ximate leader (FT AL) algorithm [12] to ac hiev e th e d esired results. Our appr oac h is to essen tially com bine FT AL w ith the V ovk-Azo ury - W armuth forecaster [8, Chapter 11.8], wh ere the algorithm u ses the sample x to mak e its prediction. Sp ecifically , our algorithm is as follo ws. At iteration t of the alg orithm, the algo rithm r eceiv es x t , 2 F or notational convenience w e use ∇ F to denote eith er the gradien t or a measurable selection from th e subgradient set ∂ F ; th is is no loss of generality . 18 pla ys the p oint w ( t ), s uffers loss F ( w ( t ); ( x t , y t )), then adds ∇ F ( w ( t ); ( x t , y t )) to its collectio n of observ ed (sub )gradien ts. The algorithm’s calc ulation of w ( t ) at iteration t is w ( t ) = argmin w ∈W ( t − 1 X i =1 h∇ F ( w ( i ); ( x i , y i )) , w i + λ 2 t − 1 X i =1 h w ( i ) − w , x i i 2 + λ 2 w ⊤ ( x t x ⊤ t + ǫI ) w ) . (23) The algorithm abov e is quite similar to Hazan et al.’s FT AL algorithm [12], and the follo wing prop osition sho ws th at the algorithm (23) do es in fact hav e logarithmic regret (w e give a pro of of the pr op osition, which is somewh at tec hnical, in App endix A). Prop osition 3. L et the se quenc e w ( t ) b e define d by the up date (23) under Assumption D . Then for any ǫ > 0 and any se quenc e of samples ( x t , y t ) , n X t =1 F ( w ( t ); ( x t , y t )) − F ( w ∗ ; ( x t , y t )) ≤ 9 L 2 d 2 λ log  r 2 n ǫ + 1  + λǫ 2 k w ∗ k 2 2 . What remains is to sho w th at a suitable form of stabilit y holds for the algorithm (23) that w e hav e defined. The ad d itional stabilit y pro vided b y using x t in the up date of w ( t ) app ears to b e im p ortant . In the original v ersion [12] of the FT AL algorithm, the predictor w ( t ) can c hange quite drastically if a sample x t sufficien tly different from the past—in the sense that h x t ′ , x t i ≈ 0 for t ′ < t —is encountered. In the presence of d ep end ence b et wee n samples, s uc h large up dates can b e detrimenta l to p erformance, since they k eep the algorithm from exploiting the mixing of the stochasti c pro cess. Returning to our argument on stabilit y , w e recall the pro of of Th eorem 4, sp ecifically the argument leading to the b ound (19). W e see that the stabilit y b ound d o es not require the full p o wer of Assumption C , but in fact it is sufficient th at F ( w ( t ); ( x t + τ , y t + τ )) − F ( w ( t + τ ); ( x t + τ , y t + τ )) ≤ τ κ ( t ) , that is, the d ifferences in loss v alues are stable. T o quantify the stabilit y of the algorithm (23), w e require t w o definitions that will b e useful here and in our subsequent pr o ofs. Define th e outer pro du ct matrices A t := t X i =1 x i x ⊤ i and A t,ǫ := A t + ǫI . (24) Giv en a p ositiv e defi n ite matrix A , the asso ciated Mahalanobis norm and its dual are defined as k w k 2 A := h Aw, w i an d k w k 2 A − 1 :=  A − 1 w, w  . Then the follo wing pr op osition (whose p ro of we provi de in App endix A) shows that stabilit y holds for the lin ear-pr ediction algorithm (23 ). Prop osition 4. L et w ( t ) b e gener ate d ac c o r ding to the up date (23) and let Assumption D hold. Then for any τ ∈ N , F ( w ( t ); ( x t + τ , y t + τ )) − F ( w ( t + τ ); ( x t + τ , y t + τ )) ≤ L 2 2 λ  6 τ k x t + τ k 2 A − 1 t + τ ,ǫ + 5 τ − 1 X s =1 k x t + s k 2 A − 1 t + s,ǫ + 3 k x t k 2 A − 1 t,ǫ  19 W e use one more ob s erv atio n to deriv e a generalizatio n b oun d for th e appr o ximate follo w-the- leader up date (23). F or any loss ℓ satisfying Assumption D, standard con vex analysis giv es that | ℓ ′ ( y , a ) | ≤ L so b y straightforw ard algebra (taking a = − Rr and b = Rr ), 2 L | a − b | ≥ λ 2 ( b − a ) 2 , implying λ ≤ 2 L Rr . (25) No w, usin g Prop osition 4 and the regret b ound fr om Prop osition 3 , w e now giv e a fast high- probabilit y con ve rgence guaran tee for onlin e algorithms applied to linear prediction problems, suc h as linear or logisti c regression, satisfying Assumption D. Sp ecifically , Theorem 6. L et w ( t ) b e gener ate d ac c or ding to the up da te (23) with ǫ = 1 . Then with pr ob ability at le ast 1 − 4 δ log n , for any τ ∈ N , f ( b w n ) − f ( x ∗ ) ≤ L 2 d λn (9 + 14 τ ) log  r 2 n + 1  + λ n k w ∗ k 2 2 + 32 L 2 r 2 τ λn · λ min (Co v ( x )) log τ δ + 8 τ L 2 λn  3 log τ δ + 1  + 4 L 2 λ φ ( τ ) . Pro of Giv en the regret b ound in Prop osition 3, all that r emains is to con trol the stabilit y of the algorithm. T o that end, note th at n − τ X t =1 F ( w ( t ); ( x t + τ , y t + τ )) − F ( w ( t + τ ); ( x t + τ , y t + τ )) ≤ 7 L 2 τ λ n X t =1 k x t k 2 A − 1 t,ǫ ≤ 7 L 2 τ d λ log  r 2 n ǫ + 1  , (26) the last inequalit y f ollo wing from an application of Hazan et al.’s L emm a 11 [12]. F ur ther, u s ing Assumption D, w e kno w that the Lipsc hitz constan t of F is G ≤ Lr . W e mim ic the pro of of Theorem 4 f or the remainder of the argumen t. This r equires a minor redefinition of our martingale sequence, since w ( t ) dep ends on x t in the u p d ate (23), whereas our pr evious pro ofs r equired w ( t ) to b e measurable with resp ect to F t − 1 . As a result, w e n o w d efine X i t := f ( w (( t − 1) τ + i )) − f ( w ∗ ) + F ( w ∗ ; x tτ + i ) − F ( w (( t − 1) τ + i ); x tτ + i ) , and the asso ciated σ -fields F i t := F tτ + i = σ ( x 1 , . . . , x tτ + i ). Th e sequence X i t − E [ X i t | F i t − 1 ] defi nes a martingale d ifference sequence adapted to th e filtration F i t , t = 1 , 2 , . . . . The remainder of the pro of parallels that of Theorem 4, with the mo dification that terms in vo lving ( τ − 1) G are replaced b y terms inv olving τ G . Sp ecifically , we use the in equalit y (19), the regret b ou n d fr om Prop osition 3, and the stabilit y guaran tee (26) to see f ( b w n ) − f ( w ∗ ) ≤ L 2 d λn (9 + 14 τ ) log  r 2 n ǫ + 1  + λǫ n k w ∗ k 2 2 + 32 L 2 r 2 τ λn · λ min (Co v ( x )) log τ δ + 3 τ LRr n  4 log τ δ + 1  + 2 LRr φ ( τ ) . Noting that R r ≤ 2 L/λ b y the b ound (25) completes the p ro of. 20 T o simplify the conclusions of Th eorem 6, w e can ignore constants and the size of the sample space X . Doing th is, we s ee th at with probabilit y at least 1 − δ , f ( b w n ) − f ( w ∗ ) ≤ O (1) · in f τ ∈ N  L 2 dτ λn log n + L 2 τ λn · λ min (Co v ( x )) log τ log n δ + L 2 λ φ ( τ )  . In particular, we can sp ecialize this resu lt in the face of differen t mixing assumptions on the pr o cess. W e giv e the b ound only for geometrically mixing pro cesses, that is, wh en φ ( k ) ≤ φ 0 exp( − φ 1 k θ ). Then w e ha v e—as in Corollary 3—the follo wing: Corollary 8. L et w ( t ) b e gener ate d ac c or ding to the fol low-the-appr oximate le ader up d ate (23) and assume that the pr o c ess P is ge ometric al ly φ -mixing. Then with pr ob ability at le ast 1 − δ , f ( b w n ) − f ( w ∗ ) ≤ O (1) · " L 2 d (log n ) 1+ 1 θ φ 1 /θ 1 λn + L 2 (log n ) 1 θ φ 1 /θ 1 λn · λ min (Co v ( x )) log  log n δ  # . W e conclud e this s ection by noting without pr o of that, since all th e results here bu ild on the theorems of Section 4, it is p ossible to analogously derive corresp onding high-probabilit y con ve r- gence guaran tees when the sto chasti c pr o cess P is β -mixin g rather th an φ -mixing. In this case, w e build on Theorem 5 r ather than Theorem 4, but the tec hniques are largely id en tical. 6 Conclusions In this p ap er, we ha v e s ho wn h o w to obtain high-probabilit y data-dep end en t b ou n ds on the gen- eralizatio n err or, or excess risk, of hyp otheses output b y online learning algorithms, ev en when samples are d ep endent. I n doing so, we ha ve extended seve ral kno wn results on the generalization prop er ties of online algorithms with indep endent d ata. By u sing martingale tools, we hav e giv en (w e hop e) d irect simp le pro ofs of conv ergence guarantees for learning algorithms with dep end en t data without requiring the mac hinery of emp irical pro cess theory . In addition, th e results in this pap er may b e of indep endent in terest for sto c hastic optimizat ion, since they sho w b oth th e ex- p ected and high-p r obabilit y conv ergence of an y lo w-regret stable online algorithm for sto chastic appro ximation p roblems, ev en with dep end en t samples. W e b eliev e there are a few natural op en questions this w ork raises. First, can online algo- rithms guaran tee go o d generalization p erformance when the underlying sto c hastic p ro cess is only α -mixing? Our tec hniques do n ot seem to extend readily to this m ore general setting, as it is less natural for measuring conv ergence of conditional distributions, so we susp ect that a different or more careful ap p roac h will b e n ecessary . Our second question regards adaptivit y: can an online algorithm b e more intimate ly coupled with the data and automaticall y adapt to the dep enden ce of the sequence of statistical samp les x 1 , x 2 , . . . ? This might allo w b oth stronger regret b ounds and b etter rates of con v ergence than w e ha v e ac hiev ed. Ac kno wledgmen ts W e would lik e to thank Nicol´ o Cesa-Bianc hi and sev eral anon ymous review ers, whose careful read- ings of our wo rk greatly impro v ed it. I n p erforming this w ork, AA was supp orted in part by a MSR P hD F ello wship and a Go ogle P hD F ello wship, and J D was sup p orted by the Department of Defense through a National Defense Science and Engineering Gradu ate F ello wship. 21 A T ec hnical Pr o ofs Pro of of Proposition 3 W e first giv e an equiv alen t form of the algorithm (23) for whic h it is a bit simp ler to p ro of results (thou gh the form is less intuitiv e) . D efine the (sub)gradient -lik e v ectors g ( t ) for all t as g ( t ) := ∇ F ( w ( t ); ( x t , y t )) − λx t x ⊤ t w ( t ) . (27) Then a bit of algebra sho ws that the algorithm (23) is equiv alen t to w ( t ) = argmin w ∈W ( t − 1 X i =1 h g ( i ) , w i + λ 2 h A t,ǫ w, w i ) . (28) W e n o w turn to the pr o of of the regret b ound in the theorem. Our pro of is similar to the p ro ofs of related r esults of Nestero v [23] and Xiao [28]. W e b egin by noting that via Assump tion D, n X t =1 F ( w ( t ); ( x t , y t )) − F ( w ∗ ; ( x t , y t )) ≤ n X t =1 h∇ F ( w ( t ); ( x t , y t )) , w ( t ) − w ∗ i − λ 2 n X t =1 ( w ( t ) − w ∗ ) ⊤ x t x ⊤ t ( w ( t ) − w ∗ ) = n X t =1 D ∇ F ( w ( t ); ( x t , y t )) − λx t x ⊤ t w ( t ) , w ( t ) − w ∗ E + λ 2 n X t =1 D x t x ⊤ t w ( t ) , w ( t ) E − λ 2 n X t =1 D x t x ⊤ t w ∗ , w ∗ E = n X t =1 h g ( t ) , w ( t ) − w ∗ i + λ 2 n X t =1 h x t , w ( t ) i 2 − λ 2 h A n w ∗ , w ∗ i . (29) Define the p ro ximal function ψ t ( w ) = λ 2 h A t,ǫ w, w i and let z ( t ) = P t i =1 g ( i ). Then w e can b oun d the regret (29) by taking a supremum and introd u cing the conju gate to ψ , defi ned by ψ ∗ n ( z ) = sup w ∈W {h z , w i − ψ n ( w ) } . In particular, w e see th at for any ǫ ≥ 0 n X t =1 F ( w ( t ); ( x t , y t )) − F ( w ∗ ; ( x t , y t )) ≤ n X t =1 h g ( t ) , w ( t ) i + λ 2 n X t =1 h x t , w ( t ) i 2 + su p w ∈W  − h z ( n ) , w i − λ 2 h A n w, w i − λǫ 2 k w k 2 2  + λǫ 2 k w ∗ k 2 2 = n X t =1 h g ( t ) , w ( t ) i + λ 2 n X i =1 h x t , w ( t ) i 2 + ψ ∗ n ( − z ( n )) + λǫ 2 k w ∗ k 2 2 . (30) The function ψ ∗ n has (1 /λ )-Lipsc hitz cont inuous grad ient with r esp ect to the Mahalanobis norm induced by A n,ǫ (e.g. [15, 23]), and fu r ther it is known that ∇ ψ ∗ n ( z ) = argmin w ∈W {h− z , w i + ψ n ( w ) } so that ∇ ψ ∗ n ( − z ( n − 1)) = w ( n ) b y d efinition of the up date (23 ). Th us we see ψ ∗ n ( − z ( n )) ≤ ψ ∗ n ( − z ( n − 1)) + h ∇ ψ ∗ n ( − z ( n − 1)) , z ( n − 1) − z ( n ) i + 1 2 λ k z ( n ) − z ( n − 1) k 2 A − 1 n,ǫ = ψ ∗ n ( − z ( n − 1)) − h w ( n ) , g ( n ) i + 1 2 λ k g ( n ) k 2 A − 1 n,ǫ = − h z ( n − 1) , w ( n ) i − λ 2 h A n,ǫ w ( n ) , w ( n ) i − h w ( n ) , g ( n ) i + 1 2 λ k g ( n ) k 2 A − 1 n,ǫ . 22 since w ( n ) minimizes h z ( n − 1) , w i + ψ n ( w ). Plugging the last inequality in to the b ound (30) yields n X t =1 F ( w ( t ); ( x t , y t )) − F ( w ∗ ; ( x t , y t )) ≤ n X t =1 h g ( t ) , w ( t ) i + λ 2 n X t =1 h x t , w ( t ) i 2 − h z ( n − 1) , w ( n ) i − λ 2 h A n,ǫ w ( n ) , w ( n ) i − h w ( n ) , g ( n ) i + λǫ 2 k w ∗ k 2 2 + 1 2 λ k g ( n ) k 2 A − 1 n,ǫ = n − 1 X t =1 h g ( t ) , w ( t ) i + λ 2 n − 1 X t =1 h x t , w ( t ) i 2 − h z ( n − 1) , w ( n ) i − λ 2 h A n − 1 ,ǫ w ( n ) , w ( n ) i + λǫ 2 k w ∗ k 2 2 + 1 2 λ k g ( n ) k 2 A − 1 n,ǫ ≤ n − 1 X t =1 h g ( t ) , w ( t ) i + λ 2 n − 1 X t =1 h x t , w ( t ) i 2 + ψ ∗ n − 1 ( − z ( n − 1)) + λǫ 2 k w ∗ k 2 2 + 1 2 λ k g ( n ) k 2 A − 1 n,ǫ since A n = A n − 1 + x n x ⊤ n . Rep eating the argum ent inductive ly do wn from n − 1, we fin d n X t =1 F ( w ( t ); ( y t , x t )) − F ( w ∗ ; ( y t , x t )) ≤ 1 2 λ n X t =1 k g ( t ) k 2 A − 1 t,ǫ + λǫ 2 k w ∗ k 2 2 . (31) The b ound (31) nearly completes th e pro of of the theorem, but w e must con trol the gradient norm k g ( t ) k 2 A − 1 t,ǫ terms. T o that end, let α t = ℓ ′ ( y t , h x t , w ( t ) i ) ∈ R and n ote that k g ( t ) k 2 A − 1 t,ǫ = D A − 1 t,ǫ ( α t x t − λx t x ⊤ t w ( t )) , α t x t − λx t x ⊤ t w ( t ) E ≤ ( L + λR r ) 2 k x t k 2 A − 1 t,ǫ since by Assu mption D, | α t | ≤ L . No w we apply a r esult of Hazan et al. [12, Lemma 11], giving n X t =1 k g ( t ) k 2 A − 1 t,ǫ ≤ ( L + λR r ) 2 d log  r 2 n ǫ + 1  . Using that λ ≤ 2 L/ ( Rr ), we combine this with the b ound (31) to get the r esult of the theorem. Pro of of Prop osition 4 W e b egin b y n oting th at an y g ∈ ∂ F ( w ( t ); ( x t + τ , y t + τ )) can b e written as αx t + τ for some α ∈ [ − L, L ]. Thus, using the firs t-order con v exit y inequalit y , we see th ere is suc h an α for whic h F ( w ( t ); x t + τ ) − F ( w ( t + τ ); x t + τ ) ≤ α h x t + τ , w ( t ) − w ( t + τ ) i . No w we apply H¨ older’s inequalit y and Lemma 9, which together yield h x t + τ , w ( t ) − w ( t + τ ) i ≤ k x t + τ k A − 1 t + τ ,ǫ k w ( t ) − w ( t + τ ) k A t + τ ,ǫ ≤ 3 L λ τ − 1 X s =0 k x t + τ k A − 1 t + τ ,ǫ k x t + s k A − 1 t + s,ǫ + 2 L λ τ X s =1 k x t + τ k A − 1 t + τ ,ǫ k x t + s k A − 1 t + s,ǫ ≤ 3 L 2 λ  τ − 1 X s =0 k x t + s k 2 A − 1 t + s,ǫ + τ k x t + τ k 2 A − 1 t + τ ,ǫ  + L λ  τ X s =1 k x t + s k 2 A − 1 t + s,ǫ + τ k x t + τ k 2 A − 1 t + τ ,ǫ  23 where we ha v e used the fact that ( a 2 + b 2 ) / 2 ≥ ab for any a, b ∈ R . A re-organization of terms and using the fact that | α | ≤ L complete s the pro of. Lemma 9. L et w ( t ) b e gener ate d ac c or ding to the up da te (23) . Then for any τ ∈ N , k w ( t ) − w ( t + τ ) k A t + τ ,ǫ ≤ 3 L λ τ − 1 X s =0 k x t + s k A − 1 t + s,ǫ + 2 L λ τ X s =1 k x t + s k A − 1 t + s,ǫ . Pro of Recall the defi n ition (24) of the outer pro d uct matrices A t and the constru ction (27) of the su bgradien t vect ors g ( t ) from the pro of of Prop osition 3. With the d efinition z ( t ) = P t i =1 g ( i ), also as in Prop osition 3, it th e up d ate (23) is equiv alent to w ( t ) = argmin w ∈W  h z ( t − 1) , w i + λ 2 h A t,ǫ w, w i  . (32 ) No w, let us u nderstand the stabilit y of the solutions to the abov e up dates. Fixing τ ∈ N , the firs t order conditions for th e op timalit y of w ( t + 1) in the up date (32) for w ( t ) and w ( t + τ ) imply h z ( t + τ − 1) + λA t + τ ,ǫ w ( t + τ ) , w − w ( t + τ ) i ≥ 0 and  z ( t − 1) + λA t,ǫ w ( t ) , w ′ − w ( t )  ≥ 0 , for all w , w ′ ∈ W . T aking w = w ( t ) and w ′ = w ( t + τ ), then add ing the tw o inequalities, we see h z ( t + τ − 1) − z ( t − 1) + λA t + τ ,ǫ w ( t + τ ) − λA t,ǫ w ( t ) , w ( t ) − w ( t + τ ) i ≥ 0 . (33) The r emainder of the pro of consists of manipulating the inequalit y (33 ) to ac hiev e the desired result. T o b egin, we rearrange Eq. (33) to state h z ( t + τ − 1) − z ( t − 1) , w ( t ) − w ( t + τ ) i ≥ λ h A t + τ ,ǫ ( w ( t ) − w ( t + τ )) , w ( t ) − w ( t + τ ) i + λ h ( A t,ǫ − A t + τ ,ǫ ) w ( t ) , w ( t ) − w ( t + τ ) i = λ k w ( t ) − w ( t + τ ) k 2 A t + τ ,ǫ + λ h ( A t,ǫ − A t + τ ,ǫ ) w ( t ) , w ( t ) − w ( t + τ ) i . Using H¨ older’s inequalit y applied to the d ual norms k·k A and k·k A − 1 , we see that λ k w ( t ) − w ( t + τ ) k 2 A t + τ ,ǫ ≤ k z ( t + τ − 1) − z ( t − 1) k A − 1 t + τ ,ǫ k w ( t ) − w ( t + τ ) k A t + τ ,ǫ + λ k ( A t + τ ,ǫ − A t,ǫ ) w ( t ) k A − 1 t + τ ,ǫ k w ( t ) − w ( t + τ ) k A t + τ ,ǫ and dividin g b y λ k w ( t ) − w ( t + τ ) k giv es k w ( t ) − w ( t + τ ) k A t + τ ,ǫ ≤ 1 λ k z ( t + τ − 1) − z ( t − 1) k A t + τ ,ǫ + k ( A t + τ ,ǫ − A t,ǫ ) w ( t ) k A − 1 t + τ ,ǫ . (34) No w we note the fact that A t + τ ,ǫ − A t,ǫ = P τ s =1 x t + s x ⊤ t + s , so k ( A t + τ ,ǫ − A t,ǫ ) w ( t ) k A − 1 t + τ ,ǫ ≤ max s ∈ [ τ ] | h x t + s , w ( t ) i | τ X s =1 k x t + s k A − 1 t + τ ,ǫ ≤ Rr τ X s =1 k x t + s k A − 1 t + τ ,ǫ . 24 In addition, we ha v e z ( t + τ − 1) − z ( t − 1) = P τ − 1 s =0 g ( t + s ), and as in th e pro of of P rop osition 3, k z ( t + τ − 1) − z ( t − 1) k A − 1 t + τ ,ǫ ≤ ( L + λR r ) τ − 1 X s =0 k x t + s k A − 1 t + τ ,ǫ ≤ 3 L τ − 1 X s =0 k x t + s k A − 1 t + τ ,ǫ , where for the last inequalit y w e used the b ou n d (25), whic h implies R r ≤ 2 L λ . T h us th e inequal- it y (34) yields k w ( t ) − w ( t + τ ) k A t + τ ,ǫ ≤ 3 L λ τ − 1 X s =0 k x t + s k A − 1 t + τ ,ǫ + 2 L λ τ X s =1 k x t + s k A − 1 t + τ ,ǫ . Noting that A t +1 ,ǫ  A t,ǫ completes the p ro of. References [1] K. Azuma. W eigh ted sums of certain dep endent random v ariables. T ohoku Mathematic al Journal , 68:357–36 7, 1967. [2] P . Bartlett, O. Bousquet, and S. Mend elson. Lo cal rademac her complexities. Annals of Statis- tics , 33(4):1 497–1537, 2005. [3] A. Bec k and M. T eb oulle. Mirror descent and non lin ear pro jected subgradient metho ds for con v ex optimizatio n. Op er at ions R ese ar ch L etters , 31:167 –175, 2003. [4] P . Billingsley . Pr ob ability and Me a sur e . Wiley , Second edition, 1986. [5] O. Bousquet and A. Elisseeff. Stabilit y and generalization. Journal of Machine L e arning R e se ar ch , 2:499–526 , 2002. [6] R. C. Bradley . Basic p rop erties of str ong mixing cond itions. a survey and some op en questions. Pr ob ability Surveys , 2:10 7–144, 2005 . [7] N. Cesa-Bianc hi, A. Con coni, and C . Gen tile. On the generalization abilit y of on-line learning algorithms. IEEE T r ansactions on Information The ory , 50:205 0–2057, 2004. [8] N. Cesa-Bia nchi and G. Lugosi. Pr e diction, L e arning, and Games . Cam brid ge Univ ersit y Press, 2006. [9] J. Duc hi, S. S halev-Sh wartz, Y. Sin ger, and A. T ew ari. Comp osite ob jectiv e mirror d escen t. In The 23r d Annual Confer enc e on Computa tional L e arning The ory , 2010. [10] J . C. Duchi, A. Agarwa l, M. Johan s son, and M. I. J ord an. E rgo dic sub gradien t descent. URL http://a rxiv.org /abs/11 05.4681 , 2011. [11] D. A. F reedman. On tail probabilities for martingales. The Annals of Pr ob ability , 3(1):100–1 18, F eb. 1975. [12] E. Hazan, A. Agarw al, and S. Kale. Logarithmic regret algorithms for online con v ex optimiza- tion. Machine L e arning , 69, 2007. 25 [13] M. Herbster and M. W armuth. T rac king the b est exp ert. M achine L e arning , 32(2):151 –178, 1998. [14] M. Herbster and M. W arm uth. T rac king the b est linear p redictor. Journal of Machine L e arning R e se ar ch , 1:281–309 , 2001. [15] J . Hiriart-Urrut y and C. Lemar ´ ec hal. Convex Analysis and Mi nimization A lgorithms I & II . Springer, 1996. [16] S . F. Jarn er an d G. O. Rob erts. Polynomial con v ergence rates of marko v c hains. The Annals of Applie d Pr ob ability , 12( 1):pp. 224–2 47, 2002. [17] S . M. Kak ade and A. T ewa ri. On the generaliza tion abilit y of online s tr ongly conv ex program- ming algorithms. In A dvanc es in Neur al Informat ion Pr o c essing Systems 21 , 2009. [18] H. J. Kus h ner and G. Yin. Sto chastic Appr oximation and R e cursive Algorithm s and Applic a- tions . Springer, Second edition, 2003. [19] R. Meir. Nonparametric time series prediction through ad ap tive m o del selection. Machine L e arning , 39:5 –34, 2000 . [20] S . Meyn and R. L. Tweedie. Markov Chains and Sto cha stic Stability . Cam bridge Universit y Press, Second edition, 200 9. [21] D. Mo dha and E. Masry . Minim um complexit y regression estimation with weakly d ep end en t observ atio ns. IEEE T r ansactions on Informat ion The ory , 42(6):2 133–2145 , 1996. [22] M. Mohri and A. Rostamizadeh. Stability b ounds for stationary φ -mixing and β -mixing p ro- cesses. Journa l of Machine L e arning R ese ar ch , 11:789–81 4, 2010. [23] Y. Nestero v. Primal-dual subgradient metho ds for con v ex problems. Mathematic al Pr o gr am- ming A , 120(1):2 61–283, 2009. [24] S . Shalev-Shw artz and Y. Sin ger. Logarithmic regret algorithms for strongly conv ex rep eated games. T echnical Rep ort 42, The Hebrew Universit y , 2007. [25] S . Shalev-Shw artz, Y. Singer, N. S rebro, and A. C otter. P egasos: primal estimated sub-gradient solv er for SVM. M athematic al Pr o gr amming Series B , page T o app ear, 2011. [26] I. Stein wa rt and A. Chr istmann. F ast learning from non-i.i.d. observ ations. In A dvanc es in Neur al Information P r o c essing Systems 22 , pages 1768–1776 , 2009. [27] G. W ei and M. A. T anner. A Mon te Carlo implemen tation of the EM algo rithm and the p o or man’s data augmen tation algorithms. Journal of the Americ an Statistic al Asso ciation , 85(41 1):699–70 4, 1990 . [28] L. Xiao. Dual a v eraging metho ds for r egularized sto c hastic learning and online optimizat ion. Journal of Machine L e arning R ese ar ch , 11:2543–2 596, 2010. [29] B. Y u. Rates of con ve rgence for empirical pro cesses of stationary m ixing sequences. Anna ls of Pr ob ability , 22( 1):94–116 , 1994. [30] B. Z ou, L. Li, and Z. Xu. The generalizat ion p erformance of ERM algorithm with strongly mixing observ ations. Machine L e ar ning , pages 275–295, 20 09. 26

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment