Online Learning, Stability, and Stochastic Gradient Descent

Online Learning, Stabilit y , and Sto c hastic Gradien t Descent Ma y 29, 20 18 T omaso P oggio, Stephen V oinea, Lorenzo Rosasco CBCL, McGovern Institute, CSAIL, Br ain Sc ienc es Dep artment, Massachusetts Institute of T e chnolo gy Abstract In batc h learning, stabilit y together with existence and uniqu en ess of the s olution corresp onds to well- p osedn ess of Emp irical Risk Minimization (ERM) metho ds; r ecen tly , it was pro v ed that CV loo stability is necessary and suﬃcien t for generaliz ation and consistency of ERM ([9]). In this note, w e in tro d uce CV on stability , whic h plays a similar role in online learning. W e show that sto c hastic gradien t descen t (SDG) with the u sual hyp otheses is CV on stable and we then discuss the implications of C V on stabilit y for con v ergence of SGD. This repo rt describ es research done within the Cen ter for Biologica l & Computational Learning in the Departmen t of Brain & Cognitive Sciences and in th e Artiﬁcial Intelligence La b o ratory at the Mas sach usetts Institute of T ec hnology . This research was sp onsored by gra nt s from: AFSOR, DARP A, NSF. Additional supp ort was provided by: Honda R&D Co ., Ltd., Siemens Corp or a te Research, Inc ., I IT, McDermott Chair. Con ten ts 1 Learning, Generalization and Stabilit y 2 1.1 Basic Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Batc h a nd O nline Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Generalization and Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3.1 Other Measures of Generalization. . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Stabilit y and G eneralization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Stabilit y and SGD 5 2.1 Setting and Preliminary F acts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.1 Stabilit y of SGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 A R emarks: assumptions 8 B Learning Rates, Finite Sample Bounds and Complexit y 8 B.1 Connections Betw een D iﬀeren t Not io ns of Con v ergence. . . . . . . . . . . . . . . . . 8 B.2 Rates a nd F inite Sa mple Bounds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 B.3 Complexit y and Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 B.4 Necessary Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 B.5 Robbins Siegmund’s L emma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1 Learning, Gene ralization and Stabil it y In this section w e collect some basic deﬁnition and facts. 1.1 Basic Setting Let Z b e a proba bilit y space with a measure ρ . A training set S n is a n i.i.d. sample z i , i, = 0 , . . . , n − 1 from ρ . Assume that a hypotheses space H is given. W e typic ally assume H to b e a Hilb ert space and sometimes a p -dimensional Hilb ert Space, in whic h case, without loss of generalit y , we identify elemen ts in H with p -dimensional v ectors and H with R p . A lo ss function is a map V : H × Z → R + . Moreo v er w e assume that I ( f ) = E z V ( f , z ) , exists and is ﬁnite for f ∈ H . W e consider the problem of ﬁnding a minim um of I ( f ) in H . In particular, w e restrict ourselv es to ﬁnding a minimizer of I ( f ) in a closed subset K of H (note that w e can of course ha v e K = H ). W e denote this minimizer by f K so t ha t I ( f K ) = min f ∈ K I ( f ) . 2 Note that in general, existence (and uniqueness) of a minimizer is not guaranteed unless some further assumptions are sp eciﬁed. Example 1. A n examp l e of the a b ove s et is sup ervise d le arning. In this c ase X is usual ly a subset of R d and Y = [0 , 1] . Ther e is a Bo r el p r ob ability m e asur e ρ o n Z = X × Y . and S n is an i.i.d. sample z i = ( x i , y i ) , i, = 0 , . . . , n − 1 fr om ρ . The hyp otheses sp ac e H is a sp ac e of functions fr om X to Y and a typic al examp l e of loss functions i s the squar e l o ss ( y − f ( x )) 2 . 1.2 Batc h and O n line Learning Algorithms A ba tc h learning alg o rithm A maps a training set to a f unction in the h yp otheses space, that is f n = A ( S n ) ∈ H , and is t ypically assumed to b e symmetric , that is, inv arian t to p erm utations in the training set. An o nline learning algo rithm is deﬁned recursiv ely as f 0 = 0 and f n +1 = A ( f n , z n ) . A w eak er notion o f an online a lg orithm is f 0 = 0 and f n +1 = A ( f n , S n +1 ) . The former deﬁnition giv es a memory-less a lg orithm, while the latter k eeps memory of the past (see [5]). Clearly , the algorithm obtained from either o f these tw o pro cedures will not in general b e symmetric. Example 2 (ERM) . The pr ototyp e example of b atch le arning algorithm is em piric al risk m i n i- mization, deﬁ ne d by the variational pr oblem min f ∈H I n ( f ) , wher e I n ( f ) = E n V ( f , z ) , E n b eing the empi ri c al aver age on the sample, a n d H is typic al ly assume d to b e a pr op er, clos e d subsp ac e of R p , for example a b al l or the c onve x hul l of some given ﬁnite set of v e ctors. Example 3 (SGD) . The pr ototyp e example of online le arning algorithm is sto cha s tic gr adient desc ent, deﬁne d by the r e cursion f n +1 = Π K ( f n − γ n ∇ V ( f n , z n )) , (1) wher e z n is ﬁxe d , ∇ V ( f n , z n ) is the gr adie n t of the loss with r esp e ct to f at f n , and γ n is a suitable de cr e asing se quenc e. Her e K is assume d to b e a close d subset of H and Π K : H → K the c orr esp onding pr o j e ction. Note that if K is c onvex then Π K is a c on tr action, i.e. k Π K k ≤ 1 and mor e o v e r if K = H then Π K = I . 3 1.3 Generalization and Consistency In this section we discuss sev eral w ays of formalizing the concept of generalization of a learning algorithm. W e sa y that an algorit hm is weakly consisten t if w e hav e conv ergence of the risks in probabilit y , that is for all ǫ > 0, lim n →∞ P ( I ( f n ) − I ( f K ) > ǫ ) = 0 , (2) and that it is strongly consisten t if conv ergence holds almost surely , tha t is P  lim n →∞ I ( f n ) − I ( f K ) = 0  = 1 . A diﬀeren t notion of consistency , ty pically considered in statistics, is given by con v ergence in exp ectatio n lim n →∞ E [ I ( f n ) − I ( f K )] = 0 . Note that, in the a b o v e equations, probabilit y and exp ectations a r e with resp ect to the sample S n . W e add three remarks. Remark 1. A mor e gene r al r e quir ement than those describ e d ab ove is obtaine d by r eplacing I ( f K ) by inf f ∈H I ( f ) . Note that in this latter c ase no extr a assumption s ar e ne e de d. Remark 2. Y et a m or e g ener al r e quir ement would b e o b tain e d by r eplacing I ( f K ) by inf f ∈F I ( f ) , F b eing the lar gest sp ac e such that I ( f ) is deﬁne d. An algorithm ha ving such a c onsistenc y pr op erty is c al le d universal. Remark 3. We note that, fol l o wing [1] the c onver genc e (2) c orr esp onds to the deﬁnition of le arn- ability of the class H . 1.3.1 Other Measures of Generalization. Note that alternatively one could measure the error with resp ect t o the norm in H , that is k f n − f K k , for example lim n →∞ P ( k f n − f K k > ǫ ) = 0 . (3) A diﬀerent requiremen t is to ha v e con v erg ence in t he fo r m lim n →∞ P ( | I n ( f n ) − I ( f n ) | > ǫ ) = 0 . (4) Note that for b oth the a b ov e error measures one can consider diﬀeren t notions o f conv ergence (almost surely , in expectatio n) as w ell con v ergence ra tes, hence ﬁnite sample b ounds. 4 F or certain algo r it hms, most notably ERM, under mild assumptions o n the loss functions, t he con v erg ence (4) implies we ak consistency 1 . F or general algorithms there is no straightforw ard connection b et w een (4) and consistency (2). Con v ergence ( 3 ) is typically stronger tha n (2), in particular this can b e seen if the loss satisﬁes the Lipschitz conditio n | V ( f , z ) − V ( f ′ , z ) | ≤ L k f − f ′ k , L > 0 , (5) for all f , f ′ ∈ H and z ∈ Z , but also for other loss function whic h do no t satisfy (5) suc h as the square loss. 1.4 Stabilit y and Generalization Diﬀeren t notions of stability are suﬃcien t to imply consistency results as we ll as ﬁnite sample b ounds. A strong form of stability is uniform stabilit y sup z ∈ Z sup z 1 ,...,z n sup z ′ ∈ Z | V ( f n , z ) − V ( f n,z ′ , z ) | ≤ β n where f n,z ′ is the function returned by an algorithm if w e replace the i -th p oint in S n b y z ′ and β n is a decreasing function of n . Bousquet a nd Eliseef pro v e that the ab ov e conditio n, for algorithms whic h are symmetric , giv es exp onential tail inequalities on I ( f n ) − I n ( f n ) meaning tha t we ha v e δ ( ǫ, n ) = e − C ǫ 2 n for some constant C [2]. F urthermore, it w as sho wn in [10] that ERM with a strongly con v ex loss function is alwa ys uniformly stable. W eaker requiremen ts can b e deﬁned b y replacing one o r more suprem ums with exp ectation o r statements in probabilit y; exp onen tial inequalities will in general b e replaced b y w eak er concen t r a tion. A t horough discussion and a list of relev ant references can b e found in [6 , 7]. Notice that the notion o f CV loo stability introduced there is necessary and suﬃcien t f o r generalization and consistency of ERM ( [9]) in the batch setting of classiﬁcation and regression. This is the main mot iv ation for in tro ducing the v ery similar notion of CV on stabilit y for the online setting in the next section 2 - 2 Stabilit y and SGD Here w e fo cus on online learning and in particular on SGD a nd discuss the role play ed by the follo wing deﬁnition of stability , that we call CV on stabilit y 1 In fact for ERM P ( I ( f n ) − I ( f K ) > ǫ ) = P ( I ( f n ) − I n ( f n ) + I n ( f n ) − I n ( f K ) + I n ( f K ) − I ( f K ) > ǫ ) ≤ P ( I ( f n ) − I n ( f n ) > ǫ ) + P ( I n ( f n ) − I n ( f K ) > ǫ ) + P ( I n ( f K ) − I ( f K ) > ǫ ) The ﬁrst term go es to zero b ecause of (4), the second term has pro bability zero since f n minimizes I n , the third term go es to zer o if V ( f K , z ) is a well b ehav ed ra ndo m v a riable (for example if the loss is bo unded but also under weak er moment/tails conditions). 2 Thu s for the setting of ba tch clas s iﬁcation and regr e ssion it is not necessa r y (S. Sha lev-Sch w artz, p er s. co mm.) to use the framework of [8]). 5 Deﬁnition 2.1. We say that an on l i n e algorithm is CV on stable wi th r ate β n if for n > N we have − β n ≤ E z n [ V ( f n +1 , z n ) − V ( f n , z n ) | S n ] < 0 , (6) wher e S n = z 0 , . . . , z n − 1 and β n ≥ 0 go e s to zer o w ith n . The deﬁnition ab o v e is of course equiv alen t to 0 < E z n [ V ( f n , z n ) − V ( f n +1 , z n ) | S n ] ≤ β n . (7) In particular, we a ssume H to b e a p - dimensional Hilb ert Spa ce a nd V ( · , z ) to b e con v ex and t wice diﬀeren tiable in the ﬁrst argumen t f or all v alues of z . W e discuss the stability prop erty of (1) when K is a closed, con v ex subset; in particular, we fo cus on the case when we can dro p the pro jection so that f n +1 = f n − γ n ∇ V ( f n , z n ) . (8) 2.1 Setting and Preliminary F acts W e recall the follo wing standard result, see [4] a nd r eferences therein for a pro of. Theorem 1. Assume that, • Th er e exists f K ∈ K , such that ∇ I ( f K ) = 0 , and for al l f ∈ H , h f − f K , ∇ I ( f ) i > 0 . • P n γ n = ∞ , P n γ 2 n < ∞ . • Th er e exists D > 0 , such that for al l f n ∈ H , E z n [ k∇ V ( f n , z ) k 2 | S n ] ≤ D (1 + k f n − f K k 2 ) . (9) Then, P ( lim n →∞ k f n − f K k = 0) = 1 . The following result will b e also useful. Lemma 1. Under the same assumption s of T he or em 1, if f K b elongs to the interior of K , then ther e exists N > 0 such that for n > N , f n ∈ K so that the pr oje ctions of (1) ar e not ne e de d and the f n ar e given by f n +1 = f n − γ n ∇ V ( f n , z n ) . 2.1.1 Stabilit y of SGD Throughout this section we assume that h f , H ( V ( f , z )) f i ≥ 0 k H ( V ( f , z )) k ≤ M < ∞ , (10) for any f ∈ H and z ∈ Z ; H ( V ( f , z )) is the Hessian of V . 6 Theorem 2. Under the same assumption of The or em 1, ther e exists N such that for n > N , SGD satisﬁes CV on with β n = C γ n , wh er e C is a universal c onstant. Pr o of. Note that from T ay lor’s formula, [ V ( f n +1 , z n ) − V ( f n , z n )] = h f n +1 − f n , ∇ V ( f n , z n ) i + 1 / 2 h f n +1 − f n , H ( V ( f , z n ))( f n +1 − f n ) i , (11 ) with f = αf n + (1 − α ) f n +1 for 0 ≤ α ≤ 1. W e can use the deﬁnition of SG D and L emma 1 to sho w there exists N suc h tha t for n > N , f n +1 − f n = γ n V ( f n , z n ). Hence changing signs in (11) and taking t he exp ectation w.r.t. z n conditioned o v er S n = z 0 , . . . , z n − 1 , w e get E z n [ V ( f n , z n ) − V ( f n +1 , z n ) | S n ] = γ n E z n [ k∇ V ( f n , z n ) k 2 | S n ] + 1 / 2 γ 2 n E z n [ h∇ V ( f n , z n ) , H ( V ( f , z n )) ∇ f V ( f n , z n ) i | S n ] . (12) The ab ov e quantit y is clearly non negative, in pa rticular the last term is non negativ e b ecause of (10). Using (9) and (10) w e get E z n [ V ( f n , z n ) − V ( f n +1 , z n ) | S n ] = ( γ n + 1 / 2 γ 2 n M ) D (1 + E z n [ k f n − f K k | S n ]) ≤ C γ n , if n is large enough. A partial conv erse result is g iven by the follow ing theorem. Theorem 3. Assume that, • Th er e exists f K ∈ K , such that ∇ I ( f K ) = 0 , and for al l f ∈ H , h f − f K , ∇ I ( f ) i > 0 . • P n γ n = ∞ , P n γ 2 n < ∞ . • Th er e exists C , N > 0 , such that for al l n > N , (7) holds with β n = C γ n . Then, P  lim n →∞ k f n − f K k = 0  = 1 . (13) Pr o of. Note that from (11) w e also hav e E z n [ V ( f n +1 , z n ) − V ( f n , z n ) | S n ] = − γ n E z n [ k∇ V ( f n , z n ) k 2 | S n ] + 1 / 2 γ 2 n E z n [ h∇ V ( f n , z n ) , H ( V ( f , z n )) ∇ f V ( f n , z n ) i | S n ] . so t ha t using the stabilit y assumption and (10) we obtain, − β n ≤ (1 / 2 γ 2 n − γ n ) E z n [ k∇ V ( f n , z n ) k 2 | S n ] that is, E z n [ k∇ V ( f n , z n ) k 2 | S n ] ≤ β n ( γ n − M / 2 γ 2 n ) = C γ n ( γ n − M / 2 γ 2 n ) . 7 F rom Lemma 1 for n large enough we obtain k f n +1 − f K k 2 ≤ k f n − γ n ∇ V ( f n , z n ) − f K k 2 ≤ k f n − f K k 2 + γ 2 n k∇ V ( f n , z n ) k 2 − 2 γ n h f n − f K , ∇ V ( f n , z n ) i so t ha t taking the exp ectation w.r.t. z n conditioned to S n and using the assumptions, w e write E z n [ k f n +1 − f K k 2 | S n ] ≤ k f n − f K k 2 + γ 2 n E z n [ k∇ V ( f n , z n ) k 2 | S n ] − 2 γ n h f n − f K , E z n [ ∇ V ( f n , z n ) | S n ] i ≤ k f n − f K k 2 + γ 2 n D γ n ( γ n − M / 2 γ 2 n ) − 2 γ n h f n − f K , ∇ I ( f n ) i , since E z n [ ∇ V ( f n , z n ) | S n ] = ∇ I ( f n ). The series P n γ 2 n D γ n ( γ n − M / 2 γ 2 n ) con v erg es and the last inner pro duct is p ositiv e by assumption, so t ha t the Robbins-Siegmund ’s theorem implies (13) and the theorem is prov ed. A Remarks: assumptions • The assumptions will b e satisﬁed if the loss is conv ex (and tw ice diﬀeren tiable) and H is compact. In fact, a con v ex f unction is alwa ys lo cally Lipsc hitz so that if we restrict H to b e a compact set, V satisﬁes (5) for L = sup f ∈H ,z ∈ Z k∇ V ( f , z )) k < ∞ . Similarly since V is t wice diﬀeren tia ble and con v ex, w e ha v e that the Hessian H ( V ( f , z )) of V at an y f ∈ H and z ∈ Z is iden t iﬁed with a b ounded, p ositiv e semi-deﬁnite matrix, that is h f , H ( V ( f , z )) f i ≥ 0 k H ( V ( f , z )) k ≤ 1 < ∞ , for a n y f ∈ H and z ∈ Z , where for the sak e of simplicit y w e to ok the b ound o n the Hessian to b e 1. • The g radien t in the SGD up date rule can b e replaced b y a sto chas tic subgradient with little c hanges in the theorems. B Learning Rates, Finite S ample B ounds and Complexity B.1 Connections Bet w een Diﬀeren t Notions of Con v ergence. It is known that b oth con v erg ence in exp ectation a nd strong con v ergence imply we ak conv ergence. On the other hand if w e ha v e w eak consistency and ∞ X n =1 P ( I ( f n ) − I ( f K ) > ǫ ) < ∞ for all ǫ > 0, then w eak consistency implies strong consistency by the Borel-Cantelli lemma. 8 B.2 Rates and Finite Sample Bounds. A stro ng er result is w eak con v ergence with a r ate , that is P ( I ( f n ) − I ( f K ) > ǫ ) ≥ δ ( n, ǫ ) , where δ ( n, ǫ ) decreases in n for all ǫ > 0. W e make tw o observ ations. First, one can see that the Bor el-Can telli lemma imp oses a rate on the deca y of δ ( n, ǫ ). Second, typ ically δ = δ ( n, ǫ ) is in v ertible in ǫ so that w e can write the ab ov e result as a ﬁnite sample b ound P ( I ( f n ) − I ( f K ) ≤ ǫ ( n, δ )) ≥ 1 − δ. B.3 Complexit y and Generalization W e sa y that a class of real v alued functions F on Z is uniform G liv enk o- Can telli if the following limit exists lim n →∞ P  sup F ∈F | E n ( F ) − E ( F ) | > ǫ  = 0 . for all ǫ > 0. If w e consider the class of functions induced by V and H , that is F ( · ) = V ( f , · ) , f ∈ H , the ab ov e pro p erties can b e written as lim n →∞ P  sup f ∈H | I n ( f ) − I ( f ) | > ǫ  = 0 . (14) Clearly the a b o v e prop ert y implies (4) , hence consistency of ERM if f H exists and under mild assumption on the loss – see previous fo otnote. It is we ll kno wn that UGC classes can b e completely characterized by suitable capacit y/complexit y measures of H . I n particular a class of binary v alued f unctions is UG C if and only if the V C- dimension is ﬁnite. Similarly a class of b o unded functions is UGC if and only if the fat- shattering dimension is ﬁnite. See [1] and reference therein. Finite complexit y of H is hence a suﬃcien t condition for the consistency of E RM . B.4 Necessary Conditions One natural question is w eather the ab ov e conditions are also necessary for consistency of ERM in t he sense of (2), or in o ther words if consistency of ERM on H implies that H is UGC class. An argumen t in this direction is giv en b y V a pnik whic h call the result the ke y theorem in learning (together with the conv erse direction). V apnik argues that (2) m ust b e replaced b y a m uc h stronger notion of con v erg ence essen tially holding if w e replace H with H γ = { f ∈ H | I ( f ) ≥ γ } , for all γ . Another r esult in this direction is given without pr o of in [1]. 9 B.5 Robbins S iegm und’s Lemma W e use the sto c hastic appro ximation framew ork describ ed b y Duﬂo ([3], pp 6-15). W e assume a sequence of data z i deﬁned by a pr o babilit y space Ω , A, P a nd a ﬁltration F = ( F ) n ∈ N where F n is a σ - ﬁeld and F n ∈ F n +1 . In additio n a sequence X n of measurable functions from Ω , A to another measurable space is deﬁned to b e adapte d to F if for all n , X n is F n - measurable. Deﬁnition Supp ose that X = ( X n ) is a s e quenc e of r a ndom varia b l e s adapte d to the ﬁl tr ation F . X is a sup erma rtingale if i t is inte gr abl e (se e [3]) a n d if E [ X n +1 |F n ] ≤ X n The following is a k ey theorem ([3]). Theorem B.1. (R obbins - S ie gmund) L et (Ω , F , P ) b e a pr ob ability sp ac e. L et ( V n ) , ( β n ) , ( χ n ) , ( η n ) b e ﬁnite non-n e gative F n -mesur able r an d om varia bles, wher e F 1 ⊂ · · · ⊂ F n ⊂ · · · is a se quenc e of sub- σ -a lgebr as of F . S upp ose that ( V n ) , ( β n ) , ( χ n ) , ( η n ) ar e four p ositive se quenc es adap te d to F a n d that E [ V n +1 |F n ] ≤ V n (1 + β n ) + χ n − η n . Then if P β n < ∞ and P χ n < ∞ , almost sur ely ( V n ) c onver ge s to a ﬁnite r a n dom variable and the series P η n c onver ges. W e pro vide a short pro of of a sp ecial case of the theorem. Theorem B.2. Supp ose that ( V n ) an d ( η n ) ar e p ositive se quen c es adapte d to F and that E [ V n +1 | F n ] ≤ V n − η n . Then almo st sur el y ( V n ) c onver ges to a ﬁnite r andom variable an d the series P η n c onver ges. Pro of Let Y n = V n + P n − 1 k =1 η k . Then w e hav e E [ Y n +1 | F n ] = E [ V n +1 | F n ] + n X k =1 E [ η k | F n ] ≤ V n − η n + n X k =1 η k = Y n . So ( Y n ) is a sup ermartinga le, and b ecause ( V n ) and ( η n ) are po sitiv e sequenc es, ( Y n ) is also b ounded from b elo w by 0, whic h implies it con v erges almost surely . It fo llo ws that b oth ( V n ) and P η n con v erg e. 10 References [1] N. Alon, S. Ben-Da vid, N. Cesa-Bianchi, and D. Haussler. Scale-sensitiv e dimensions, uniform con v erg ence, and learnabilit y . J. of the ACM , 44(4):61 5–631, 1997. [2] O. Bousquet and A. Elissee ﬀ. Stability a nd generalization. Journal Machine L e arning R e- se ar ch , 2001. [3] M. Duﬂo. R andom Iter ative Mo d e ls . Pringer, New Y ork, 199 1 . [4] J. Lelong. A cen tral limit theorem for robbins monro algorithms with pro jections. Web T e ch R ep ort , 1:1–1 5 , 2005. [5] A. Ra khlin. Lecture notes on online learning. T ec h Rep ort 2010, U.P enn, Camb ridge, MA, Marc h 20 10. [6] S. Mukherjee Ra khlin, A. and T. P oggio. Stability results in learning theory . Analysis and Applic ations , 3(4):397– 4 17, 2005. [7] T. P oggio S. Mukherjee P . Niy o gi and R. Rifkin. Learning theory: Stabilit y is suﬃcien t for generalization and necessary and suﬃcien t for consistency of empirical risk minimization. A nalysis an d Applic ation s , (25):1 61–193, 2 0 06. [8] N. Srebro S. Shalev-Shw artz, O. Shamir and K. Sr idha r a n. Learnability , stabilit y and uniform con v erg ence. Journal of Machine L e a rning R es e ar ch , pa ges 2635–26 7 0, 2 0 10. [9] S. Mukherjee T. P oggio, R. R ifkin and P . Niy ogi. General conditions for pr edictivity in learning theory . Natur e , 428:419– 4 22, 2004. [10] A. Wibisono, L. Rosasco, and T. P oggio. Suﬃcien t conditions f or uniform stabilit y of reg- ularization a lgorithms. T ec hnical rep ort, MIT Computer Science and Artiﬁcial In telligence Lab orato ry . 11

Online Learning, Stability, and Stochastic Gradient Descent

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment