Margin-adaptive model selection in statistical learning

Bernoul li 17 (2), 2011, 687–713 DOI: 10.315 0/10-BEJ 288 Margin-adaptiv e mo del selection in s tatistical learning SYL V AIN ARLOT 1 and PETER L. BAR T L ETT 2 1 CNRS, Wil low Pr o je ct-T e am, L ab or atoir e d’Informatique de l’Ec ole Normale Sup erieur e (CNRS/ENS/INRIA UMR 8548), 23, avenue d’Italie, CS 81 321, 75214 Paris Ce dex 13, F r anc e. E-mail: sylvain.arlot@ens.fr 2 Computer Scienc e Division and Dep artment of Statistics, University of California, Berkeley, 367 Evans Hal l # 3860 , Berkeley, CA 94720-3860, USA. E-mail: b art lett@cs.b erkeley.e du A classical condition for fast learning rates is th e margin condition, ﬁrst introduced by Mammen and Tsybako v. W e tac kle in this pap er the problem of adaptivity to this condition in th e context of mo del selection, in a general learning framework. Actually , we consider a weak er version of this condition that allows one to t ak e into account that learning within a small mo del can b e muc h easier than within a large one. R equiring th is “strong margin adaptivity” makes the mod el selection problem more challengi ng. W e ﬁrst pro ve, in a general framew ork, that some p enaliza tion pro cedures (including local Rademac her complexities) exhibit this adaptivity when the mo dels are n ested. Contrary to previous results, this holds with p enalties th at only dep end on the data. Our second main result is that strong margin adap t ivit y is not alwa ys p ossible when the models are n ot nested: for every model selection pro cedure (ev en a randomized one), there is a problem for which it do es not demonstrate strong margin adapt iv it y . Keywor ds: adapt iv it y; empirical minimization; empirical risk min imization; lo cal Rademacher complexity; margin condition; mod el selectio n; oracle inequ al ities; statistical learning 1. In tro duction W e consider in this paper the mo del selection problem in a gener al fr a mew ork. Since our main motiv atio n comes fr o m the s upervised bina ry classiﬁca tion setting, w e fo cus o n this framework in this introduction. Section 2 intro duces the natural g eneralization to empirical (risk) minimization problems, which we co nsider in the rema inder o f the pa per. W e observe indep e nden t realizations ( X i , Y i ) ∈ X × Y for i = 1 , . . . , n of a random v a riable with distribution P , where Y = { 0 , 1 } . The g oal is to build a (da ta -dependent) predictor t (i.e., a measurable function X 7→ Y ) such that t ( X ) is as o ften as p ossible equal to Y , where ( X , Y ) ∼ P is indep e nden t from the data . T his is the pr e diction problem, in the setting of s upervised binary cla ssiﬁcation. In other words, the goal is to ﬁnd t minimizing the pr e dic tio n error P γ ( t ; · ) := P ( X,Y ) ∼ P ( t ( X ) 6 = Y ) , where γ is the 0– 1 loss. This is an electronic reprint of the original article pub li shed by the ISI/BS in Bernoul li , 2011, V ol. 17, No. 2, 687–713 . This reprint diﬀers from the original in pagination and typographic detail. 1350-7265 c  2011 ISI/BS 688 S. Arlot and P. L. Bartlett The minimizer s of the pr ediction er ror, when it exists, is called the Ba yes predictor . Deﬁne the reg ression function η ( X ) = P ( X,Y ) ∼ P ( Y = 1 | X ). Then, a classica l argument shows that s ( X ) = 1 η ( X ) ≥ 1 / 2 . How ever, s is unknown, since it dep ends on the unknown distribution P . Our go al is to bu ild from th e data some predictor t minimizing the prediction er ror, or equiv alent ly the excess loss ℓ ( s, t ) := P γ ( t ) − P γ ( s ). A clas s ical a ppr oac h to the prediction pro blem is empiric al risk minimization . Let P n = n − 1 P n i =1 δ ( X i ,Y i ) be the empirical measur e and S m be a n y set of predictors, which is called a mo del . The empiric al risk minimizer over S m is then deﬁned as ˆ s m ∈ arg min t ∈ S m P n γ ( t ) = ar g min t ∈ S m ( 1 n n X i =1 1 t ( X i ) 6 = Y i ) . W e exp ect that the risk o f ˆ s m is close to that of s m ∈ a rg min t ∈ S m P γ ( t ) , assuming that such a minimizer exists. 1.1. Margin condition Depending on s ome prop erties of P and the complexity of S m , the pr ediction error of ˆ s m is more or le ss distant fro m that o f s m . F or insta nce, when S m has a ﬁnite V apnik– Chervonenkis dimension V m [ 26 , 27 ] and s ∈ S m , it has b een prov en (see, e.g., [ 19 ]) that E [ ℓ ( s, ˆ s m )] ≤ C r V m n for some numerical constant C > 0. This is optimal without any as sumption on P , in the minimax sens e: no estimator can hav e a smaller prediction risk uniformly ov er all distributions P such that s ∈ S m , up to the numerical facto r C [ 14 ]. How ev er, there exist fav orable situations where muc h smaller pr ediction erro rs (“fast rates”, up to n − 1 instead of n − 1 / 2 ) can be obtained. A suﬃcient condition, the so - called “margin condition”, ha s been introduced by Mammen and Tsybako v [ 21 ]. If, for some ε 0 , C 0 > 0 a nd α ≥ 1 , ∀ ε ∈ (0 , ε 0 ] P ( | 2 η ( X ) − 1 | ≤ ε ) ≤ C 0 ε α , (1) if the Bayes predicto r s b elongs to S m , and if S m is a VC-class of dimension V m , then the pre dic tio n error of ˆ s m is s maller than L ( C 0 , ε 0 , α ) ln( n )( V m /n ) κ/ (2 κ − 1) in exp ectation, where κ = (1 + α ) /α and L ( C 0 , ε 0 , α ) > 0 only dep ends o n C 0 , ε 0 and α . Cor respond- ing minimax low er b ounds [ 23 ] and other upp er b ounds ca n be obtained under other complexity assumptions (e.g., Assumption (A2) o f Tsybako v [ 24 ], inv olving bra c ket ing ent ropy). In the extreme situation where α = + ∞ , that is, for some h > 0 , P ( | 2 η ( X ) − 1 | ≤ h ) = 0 , (2) Mar gin-adaptive mo del sele ction 689 then the same result ho lds with κ = 1 and L ( h ) ∝ h − 1 . Mo re precisely , a s prov ed in [ 23 ] E [ ℓ ( s, ˆ s m )] ≤ C min (  V m (1 + ln( nh 2 V − 1 m )) nh  , r V n ) . F ollowing the approa ch of K o ltc hinskii [ 16 ], we will conside r the following generaliz a - tion of the margin condition: ∀ t ∈ S ℓ ( s, t ) ≥ ϕ ( q v a r P ( γ ( t ; · ) − γ ( s ; · ))) , (3) where S is the set o f predicto r s, and ϕ is a conv ex non-decr easing function on [0 , ∞ ) with ϕ (0) = 0 . Indeed, the pr o ofs of the abov e upp er bounds on the prediction error of ˆ s m use o nly that ( 1 ) implies ( 3 ) with ϕ ( x ) = L ( C 0 , ε 0 , α ) x 2 κ and κ = (1 + α ) /α , and that ( 2 ) implies ( 3 ) with ϕ ( x ) = hx 2 . (See, e.g., Pro p osition 1 in [ 2 4 ].) All these results show that the empirica l risk minimizer is adaptive to the mar gin c onditio n , since it leads to an optimal ex c ess risk under v arious assumptions on the complexity of S m . How ever, o btaining such rates o f estimation requir es knowledge of some S m to which the Bay es predictor b elongs, which is a stro ng assumption. A less restrictive framework is the following. First, w e do not assume that s ∈ S m . Second, we do not assume that the margin condition ( 3 ) is s atisﬁed for a ll t ∈ S , but only for t ∈ S m , which ca n be seen as a “lo cal” marg in condition: ∀ t ∈ S m ℓ ( s, t ) ≥ ϕ m ( q v a r P ( γ ( t ; · ) − γ ( s ; · ))) , (4) where ϕ m is a conv ex non-decrea sing function on [0 , ∞ ) with ϕ m (0) = 0 . The fa c t that ϕ m can dep end o n m allows situations where we ar e lucky to have a strong ma rgin condition for some small mo dels but the g lo bal ma rgin condition is lo o se. As prov en in Section 5.2 (Pro p osition 1 ), such situations cer tainly exist. Note that when ϕ m ( x ) = h m x 2 , ( 3 ) and ( 4 ) can b e trac e d back to mean–v ariance conditions on γ tha t were used in sev eral pap ers for deriving conv ergence ra tes o f some minim um contrast estimators on so me given mo del S m (see, e.g., [ 11 ] and refer ences therein). 1.2. Adaptive mo del selection Assume now that we are no t given a single mode l but a whole family ( S m ) m ∈M n . By empirical risk minimization, we obtain a family ( ˆ s m ) m ∈M n of predictors, from which w e would like to select some b s b m with a prediction err or P γ ( b s b m ) as small as p ossible. The aim of suc h a mo del sele ction pr o c e dur e (( X 1 , Y 1 ) , . . . , ( X n , Y n )) 7→ b m ∈ M n is to satisfy an or acle ine quality of the for m ℓ ( s, b s b m ) ≤ C inf m ∈M n { ℓ ( s, s m ) + R m,n } , (5) 690 S. Arlot and P. L. Bartlett where the leading co nstan t C ≥ 1 should b e clo se to one and the r emainder term R m,n should b e clos e to P γ ( ˆ s m ) − P γ ( s m ). Typically , one prov es that ( 5 ) holds either in exp ectation, or with high probability . Assume for insta nce that ϕ m ( x ) = h m x 2 for some h m > 0 a nd S m has a ﬁnite V C- dimension V m ≥ 1 . In view of the aforementioned minimax low er bo unds of [ 23 ], one cannot hop e in general to pr o ve an or acle inequalit y ( 5 ) with a re mainder R m,n smaller than min ( ln( n ) V m nh m , r V m n ) , where the ln( n ) term may only b e necessa ry for some V C classes S m (see [ 23 ]). Then, ada ptive mo del sele ction occ ur s when b m satisﬁes an o r acle inequality ( 5 ) with R m,n of the o rder of this minimax lower b ound. Mo r e g e nerally , let C m be some complex- it y measure of S m (e.g., its V C-dimension, or the ρ app earing in Tsyba k ov’s as sumption [ 24 ]). Then, deﬁne R n ( C m , ϕ m ) a s the minimax prediction error ov er the set of distr i- butions P such that s ∈ S m and the lo cal margin condition ( 4 ) is sa tisﬁed in S m with ϕ m , where S m has a co mplexit y at mos t C m . Mass art and N ´ ed´ e lec [ 23 ] hav e proven tight upper and low er bounds on R n ( C m , ϕ m ) with several complexity measures; their results are stated with the marg in condition ( 3 ), but they actually use its lo cal v ersion ( 4 ) only . A marg in adaptive mo del selec tion pro cedure should satisfy an oracle inequality of the form ℓ ( s, b s b m ) ≤ C inf m ∈M n { ℓ ( s, s m ) + R n ( C m , ϕ m ) } (6) without using the know le dge of C m and ϕ m . W e call this prop erty “strong ma rgin adap- tivit y”, to emphasize the fa c t that this is more challenging than ada ptivity to a mar gin condition that holds uniformly ov er the mo dels. 1.3. Pena lization W e fo cus in particula r in this pap er on p enalization pro cedures, which ar e deﬁned as follows. Let p en : M n 7→ [0 , ∞ ) b e a (data-dep enden t) function, and deﬁne b m ∈ ar g min m ∈M n { P n γ ( ˆ s m ) + p en( m ) } . Since our goal is to minimize the prediction e r ror of ˆ s m , the ide al p enalty would be pen id ( m ) := P γ ( ˆ s m ) − P n γ ( ˆ s m ) , (7) but it is unknown b ecause it depends on the distribution P . A cla ssical wa y of designing a p enalt y is to estimate p en id ( m ), or at least a tight upp er b ound on it. W e consider in particular lo c al c omplexity me asur es [ 8 , 10 , 1 6 , 20 ], b ecause they es ti- mate pen id tightly enough to achiev e fast e stimation rates when the margin condition holds true. See Section 3.2 for a detailed deﬁnition of these p enalties. Mar gin-adaptive mo del sele ction 691 1.4. Related results There is a co nsiderable wealth of litera ture on ma rgin a daptivit y in the context of mo del selection as well a s mo del aggreg ation. Most o f the pap ers consider the unifor m mar gin condition, that is, when ϕ m ≡ ϕ . B a rron, Bir g ´ e and Mas sart [ 7 ] ha ve pr o ven oracle in- equalities for deterministic p enalties under some mean– v arianc e condition on γ clo s e to ( 3 ) with ϕ ( x ) = hx 2 . F o llo wing a similar appr oac h, margin adaptive or acle ine q ualities (with more general ϕ ) hav e b e en prov en with lo calized ra ndom p enalties [ 8 , 10 , 16 , 20 ] and with other pena lties in a par ticular framework [ 25 ]. Adaptivit y to the margin has also b een considered with a regularized b oos ting metho d [ 12 ], the hold-out [ 1 3 ] and in a P AC-Bay es fra mew ork [ 5 ]. Aggregation metho ds have bee n studied in [ 24 ] and [ 1 7 ]. Notice also that a co mpletely diﬀerent a pproach is p ossible: estimate ﬁrst the r egression function η (po ssibly throug h mo del selectio n), then use a plug-in clas siﬁer; this works provided η is smo oth enough [ 6 ]. It is quite unclear whether any of these results can b e e x tended to strong margin adaptivity (actually , we will prove that this needs additional r estrictions in gener al). T o our knowledge, the only results a llo wing ϕ m to dep end on m can b e found in [ 16 ]. First, when the mode ls are nested, a compar ison metho d ba sed on loca l Ra dema c her complexities attains s trong margin ada ptivit y , as suming that s ∈ S m ∈M n S m (Theorem 7; a nd it is quite unclea r whether this still holds witho ut the latter assumption). Second, a penaliza tion method based o n lo cal Rademacher complex ities has the same prop ert y in the gener al case, but it uses the knowledge of ( ϕ m ) m ∈M n (Theorems 6 and 11). Our cla im is that when ϕ m do es stro ngly depe nd on m , it is cr ucial to take it in to account to c ho ose the b est mo del in M n . And suc h situations occur, as pro ven by our Prop osition 1 in Section 5.2 . B ut assuming either s ∈ S m ∈M n S m or that ϕ m is kno wn is no t realistic. Our goa l is to in vestigate the kind of results that can b e obtained with c omple tely data-driven pro cedures; in pa rticular, when s / ∈ S m ∈M n S m . 1.5. Our results In this pap er, we aim at understa nding when strong margin adaptivity can be obta ined for data-dep enden t mo del selection pro cedures. Notice that we do not restr ict ourselves to the classiﬁcation setting. W e consider a m uch more general framework (as in [ 16 ]), which is describ ed in Section 2 . W e pr o ve tw o kinds of r e sults. First, when mo dels are nested, we show that so me pena lization metho ds a re strong ly margin a daptiv e (Theo rem 1 ). In particular , this result ho lds for the lo cal Rademac her co mplexities (Coro lla ry 1 ). Compared to previous results (in particular the ones of [ 16 ]), o ur main adv ance is tha t our pena lties do not require the knowledge of ( ϕ m ) m ∈M n , and we do not assume that the Bayes predictor be longs to any of the mo dels. Our second result prob es the limits of strong mar gin adaptivit y , without the nes ted assumption. A family o f mo dels exists such that, fo r every sample size n and every (mo de l) selection procedur e b m , a distributio n P exis ts for which b m fails to be strongly margin adaptive with a pos itive probability (Theorem 2 ). Hence, the previo us p ositive 692 S. Arlot and P. L. Bartlett results (Theorem 1 and Corollary 1 ) cannot be extended outside of the nested cas e for a general distributio n P . Where is the b oundary b etw een these tw o extre mes? Obviously , the nested a ssumption is no t necessar y . F or instance, when the globa l ma r gin assumption is indeed tig h t ( ϕ = ϕ m for every m ∈ M n ), mar gin adaptivit y ca n b e o btained in several w ays, as men tioned in Section 1.4 . W e sketc h in Section 5 some situatio ns where stro ng mar gin adaptivit y is po ssible. Mor e precisely , we state a gener al oracle inequality (Theorem 3 ), v a lid for any family o f mo dels and any distribution P . W e then discuss assumptions under whic h its remainder term is small enoug h to imply stro ng margin adaptivity . This pap er is organized as follows. W e describ e the general setting in Section 2 . W e consider in Section 3 the nested case, in which strong margin adaptivity ho lds. Negativ e results (i.e., lo wer b ounds on the prediction error of a general mo del selection pr ocedure) are stated in Section 4 . The line betw een these tw o situations is sketched in Section 5 . W e discus s our res ults in Section 6 . All the pro ofs are given in Section 7 . 2. The ge neral empirical minimization framew ork Although our main motiv ation comes from the class iﬁcation problem, it turns out that all our results can b e proven in the ge ner al setting of empirical minimization. As ex plained below, this s etting includes binary classiﬁcation with the 0– 1 loss, b ounded regres s ion and several other fra mew orks. In the rest o f the pap er, we will use the following general notation, in or der to emphasize the generality of our results. W e obs erv e indep enden t r ealizations ξ 1 , . . . , ξ n ∈ Ξ of a rando m v ariable with distri- bution P , and we are given a set F o f measur able functions Ξ 7→ [0 , 1] . O ur goal is to build so me (data-dep enden t) f such that its exp ectation P ( f ) := E ξ ∼ P [ f ( ξ )] is as small as possible. F or the sake of simplicit y , w e assume that ther e is a minimizer f ⋆ of P ( f ) ov er F . This includes the prediction framework, in which Ξ = X × Y , ξ i = ( X i , Y i ), F := { ξ 7→ γ ( t ; ξ ) s.t. t ∈ S } , where γ : S × Ξ 7→ [0 , 1 ] is an y contrast function. Then, f ⋆ is equal to γ ( s ; · ), where s is the Bayes predictor. In the binary classiﬁcation framework, Y = { 0 , 1 } and w e can tak e the 0–1 co n trast γ ( t ; ( x, y )) = 1 t ( x ) 6 = y , for ins tance. W e then recov er the setting descr ib ed in Section 1 . In the bounded reg ression fra mew ork, as suming that Y = [0 , 1 ], we ca n take the lea st-squares contrast, γ ( t ; ( x, y )) = ( t ( x ) − y ) 2 . Many other co n trast functions γ can b e considered, provided that they take their v alues in [0 , 1 ]. Notice the o ne-to-one cor r espondence b et w een pre dictors t and functions f t := γ ( t ; · ) in the pr e diction framework. The empiric al minimizer over F m ⊂ F (called a mo del) c a n then be deﬁned as b f m ∈ a rg min f ∈F m P n ( f ) . Mar gin-adaptive mo del sele ction 693 W e exp ect that its exp ectation P ( b f m ) is clo s e to that of f m ∈ a rg min f ∈F m P ( f ), a ssuming that such a minimizer exists. In the prediction fr amew ork, deﬁning F m := { f t s.t. t ∈ S m } , we ha ve b f m = f ˆ s m and f m = f s m . W e ca n no w wr ite the globa l margin condition as follows: ∀ f ∈ F P ( f − f ⋆ ) ≥ ϕ ( p v a r P ( f − f ⋆ )) , (8) where ϕ is a conv ex non-decreasing function o n [0 , ∞ ) with ϕ (0) = 0 . Similar ly , the loca l margin condition is ∀ f ∈ F m P ( f − f ⋆ ) ≥ ϕ m ( p v a r P ( f − f ⋆ )) . (9) Notice that most of the upp er and low er b ounds on the r isk under the margin condition given in the introduction stay v alid in the general empirical minimization framework, at least when ϕ m ( x ) = ( h m x 2 ) κ m for some h m > 0 and κ m ≥ 1 (see, e.g ., [ 23 ] and [ 16 ]). Assume that F m is a VC-t yp e class of dimension V m . If ϕ m ( x ) = h m x 2 , E [ P ( b f m − f ⋆ )] ≤ 2 P ( f m − f ⋆ ) + C min (  ln( n ) V m nh m  , r V m n ) for so me n umerical constant C > 0. If ϕ m ( x ) = ( h m x 2 ) κ m for so me h m > 0 a nd κ m ≥ 1 , E [ P ( b f m − f ⋆ )] ≤ 2 P ( f m − f ⋆ ) + C min (  L ( h m , κ m ) ln( n )  V m nh m  κ m / (2 κ m − 1)  , r V m n ) for so me constants C, L ( h m , κ m ) > 0. Given a collection ( F m ) m ∈M n of mo dels, we are lo oking for a mo del sele c tio n pr o cedure ( ξ 1 , . . . , ξ n ) 7→ b m ∈ M n satisfying a n or acle ine quality of the form P ( b f b m − f ⋆ ) ≤ C inf m ∈M n { P ( f m − f ⋆ ) + R m,n } , (10) with a leading constant C c lo se to 1 and a remainder term R m,n as small as p ossible. Similarly to ( 6 ), w e deﬁne a s tr ongly margin-a daptiv e pro cedure as any b m s uch that ( 10 ) holds with s ome numerical constant C , and R m,n of the or der of the minima x risk R n ( C m , ϕ m ), wher e C m is some complex it y measure of F m . Deﬁning p enalization metho ds as b m ∈ arg min m ∈M n { P n ( b f m ) + p en( m ) } (11) for so me data-dep enden t p en : M n 7→ R , the idea l penalty is p en id ( m ) := ( P − P n )( b f m ). 694 S. Arlot and P. L. Bartlett 3. Margin-adaptiv e mo del selection for nested mo dels 3.1. General result Our ﬁrst result is a suﬃcient condition for p enalization pro cedures to a ttain str ong margin adaptivity when the mo dels ar e nested (Theor em 1 ). Since this condition is satis ﬁed by lo cal Rademacher complex ities , this leads to a da ta-driven margin- adaptiv e p enalization pro cedure (Corollar y 1 ). Theorem 1. Fix ( F m ) m ∈M n and ( ϕ m ) m ∈M n such that the lo c al mar gin c onditions ( 9 ) hold. L et ( t m ) m ∈M n b e a se quenc e of p ositive r e als t ha t is non-de cr e asing (with re sp e ct t o the inclusion or dering on F m ). Assume that some c onstants c, η ∈ (0 , 1) and C 1 , C 2 ≥ 0 exist such that the fol lowing holds: • The mo dels ( F m ) m ∈M n ar e neste d. • L ower b ounds on the p enalty: with pr ob abil ity at le ast 1 − η , for every m, m ′ ∈ M n , (1 − c ) p en( m ) ≥ ( P − P n )( b f m − f m ) + t m n ≥ 0 , (12) F m ′ ⊂ F m ⇒ c pen( m ) ≥ v ( m ) − C 1 v ( m ′ ) − C 2 P ( f m ′ − f ⋆ ) , (13) wher e v ( m ) := r 2 t m n v a r P ( f m − f ⋆ ) . Then, if b m is deﬁne d by ( 11 ), with pr ob ability at le ast 1 − η − 2 P m ∈M n e − t m , we have for every ε ∈ (0 , 1) P ( b f b m − f ⋆ ) ≤ 1 1 − ε inf m ∈M n ( (1 + ε + C 2 + εC 1 ) P ( f m − f ⋆ ) + p en( m ) (14) + (1 + max { 1 , C 1 } ) min ( ϕ ⋆ m r 2 t m ε 2 n ! , r 2 t m n ) + t m 3 n ) , wher e ϕ ⋆ m ( x ) := sup y ≥ 0 { xy − ϕ m ( y ) } is the c onvex c onjugate of ϕ m . Theorem 1 is proved in Section 7.1 . R emark 1. 1. If p en( m ) is of the right or de r , that is, not muc h lar ger than E [p en id ( m )], then Theorem 1 is a strong mar gin adaptivity result. Indeed, assuming that ϕ m ( x ) = ( h m x 2 ) κ m , the rema inder term is no t to o large, since ϕ ⋆ m ( x ) = L ( h m , κ m ) x 2 κ m / (2 κ m − 1) Mar gin-adaptive mo del sele ction 695 for some p ositive co nstan t L ( h m , κ m ). Hence, choo sing ε = 1 / 2 , for insta nce, we can rewrite ( 14 ) as P ( b f b m − f ⋆ ) ≤ L ( C 1 , C 2 ) inf m ∈M n  P ( f m − f ⋆ ) + pen( m ) + L ( h m , κ m )  t m n  κ m / (2 κ m − 1)  for some p ositive constants L ( C 1 , C 2 ) and L ( h m , κ m ). When ϕ m is a general convex function, minimax estimation rates a re no longer av a ila ble, so that w e do not kno w whether the remainder term in ( 14 ) is of the r igh t order . Nev ertheless, no better risk b ound is known, even for a single mo del to which s b elongs. 2. In the c a se that the ϕ m are k no wn, methods inv olving lo cal Rademacher complex- ities and ( ϕ m ) m ∈M n satisfy ora c le inequalities simila r to ( 14 ) (see Theo rems 6 a nd 11 in [ 16 ]). On the contrary , the ϕ m are not as sumed to be known in Theo rem 1 , a nd conditions ( 12 ) and ( 13 ) ar e satisﬁed by co mpletely data-dep endent penalties, as shown in Section 3.2 . Also , Theorem 7 of [ 16 ] shows that ada ptivit y is p ossible us- ing a compar ison metho d, provided that f ⋆ belo ngs to one of the mo dels. Ho wev er, it is not c le ar whether this c o mparison metho d achieves the o ptimal bias – v aria nce trade-oﬀ in the general ca se, as in Theorem 1 . 3.2. Lo cal Rademac her complexities Although Theo rem 1 applies to any p enalization pro cedure sa tisfying a ssumptions ( 12 ) and ( 13 ), we now focus o n metho ds based on lo cal Rademacher complexities. Let us deﬁne pr ecisely these complex ities. W e mainly use the no tation of [ 16 ]: • for every δ > 0, the δ minimal set of F m w.r.t. the distribution P is F m,P ( δ ) := n f ∈ F m s.t. P ( f ) − inf g ∈F m P ( g ) ≤ δ o , • the L 2 ( P ) diameter of the δ minimal set of F m is D 2 P ( F m ; δ ) = sup f ,g ∈F m,P ( δ ) P (( f − g ) 2 ) , • the exp ected mo dulus of contin uity of ( P − P n ) ov er F m is φ n ( F m ; P ; δ ) = E sup f ,g ∈F m,P ( δ ) | ( P n − P )( f − g ) | . W e then deﬁne U n ( F m ; δ ; t ) := K φ n ( F m ; P ; δ ) + D P ( F m ; δ ) r t n + t n ! , 696 S. Arlot and P. L. Bartlett where K > 0 is a n umerical constant (to b e c hosen later). The (ideal) lo cal complex ity δ n ( F m ; t ) is (roughly) the sma llest p ositive ﬁxed p oin t o f r 7→ U n ( F m ; r ; t ) . More precisely , δ n ( F m ; t ) := inf  δ > 0 s.t. sup σ ≥ δ  U n ( F m ; σ ; t ) σ  ≤ 1 2 q  , (15) where q > 1 is a numerical cons ta n t. Two impo rtan t points, whic h follow from Theorems 1 and 3 o f K oltc hinskii [ 1 6 ], a re that: 1. δ n ( F m ; t ) is larg e enoug h to s atisfy assumption ( 1 2 ) with a probability at least 1 − log q ( n/t )e − t for each model m ∈ M n . 2. Ther e is a completely data-dep enden t ˆ δ n ( F m ; t ) such that ∀ m ∈ M n P ( ˆ δ n ( F m ; t ) ≥ δ n ( F m ; t )) ≥ 1 − 5 ln q  n t  e − t . This da ta -dependent ˆ δ n ( F m ; t ) is a resa mpling estimate of δ n ( F m ; t ), called the “lo cal Rademacher complexit y”. Before stating the main result of this section, let us re c all the deﬁnition o f ˆ δ n ( F m ; t ), as in [ 16 ]. W e need the following additional no tation: • for every δ > 0, the empir ical δ minimal s et of F m is b F n,m ( δ ) := n f ∈ F m s.t. P n ( f ) − inf g ∈F m P n ( g ) ≤ δ o = F m,P n ( δ ) , • the empirica l L 2 ( P ) diameter of the empirical δ minimal set of F m is b D n ( F m ; δ ) = sup f ,g ∈ b F n,m ( δ ) P n (( f − g ) 2 ) , • the mo dulus of contin uity of the Rademacher pro cess f 7→ n − 1 P n i =1 ε i f ( ξ i ) over F m , where ε 1 , . . . , ε n are i.i.d. Rademac her random v ar ia bles (i.e., ε i takes the v alues +1 and − 1 with pr obabilit y 1 / 2 each): b φ n ( F m ; δ ) = sup f ,g ∈ b F n,m ( δ )      1 n n X i =1 ε i ( f ( ξ i ) − g ( ξ i ))      . Deﬁning b U n ( F m ; δ ; t ) := b K b φ n ( F m ; P ; ˆ cδ ) + b D n ( F m ; ˆ cδ ) r t n + t n ! (where b K , ˆ c > 0 are n umerical constants, to be chosen later), the lo c al Ra demacher c om- plexity ˆ δ n ( F m ; t ) is (ro ughly) the smalle st positive ﬁxed p oin t of r 7→ b U n ( F m ; r ; t ) . More Mar gin-adaptive mo del sele ction 697 precisely , ˆ δ n ( F m ; t ) := inf  δ > 0 s.t. sup σ ≥ δ  b U n ( F m ; σ ; t ) σ  ≤ 1 2 q  , (1 6) where q > 1 is a numerical cons ta n t. Corollary 1 (Strong margin adaptivit y for lo cal R ademac her complexities ). Ther e exist numeric al c onstants K > 0 and q > 1 such t ha t the fol lowing holds. L et t > 0 . Assume that a numeric al c onstant L > 0 exists and an event of pr ob abi lity at le ast 1 − L log q ( n/t ) Card( M n )e − t exists on which ∀ m ∈ M n pen( m ) ≥ 7 2 δ n ( F m ; t ) , (17) wher e δ n ( F m ; t ) is deﬁne d by ( 15 ) (and dep ends on b oth K and q ). Assu me mor e over that the m o dels ( F m ) m ∈M n ar e neste d and b m ∈ ar g min m ∈M n { P n ( b f m ) + p en( m ) } . Then, an event of pr ob ability at le ast 1 − [2 + ( L + 1 ) lo g q ( n t )] Card( M n )e − t exists on which, for every ε ∈ (0 , 1 ) , P ( b f b m − f ⋆ ) ≤ 1 1 − ε inf m ∈M n (  1 + 2 K q + ε (1 + √ 2)  P ( f m − f ⋆ ) + p en( m ) (18) + (1 + √ 2) min ( ϕ ⋆ m r 2 t ε 2 n ! , r 2 t n ) + t 3 n ) . In p articular, t hi s holds when p en( m ) = 7 2 ˆ δ n ( F m ; t ) , pr ovide d that b K , ˆ c > 0 ar e lar ger than some c onst ant s dep ending only on K , q . Corollar y 1 is prov ed in Section 7.1 . R emark 2. O ne can always enla rge the co nstan ts K and q , making the leading constant of the oracle inequality ( 18 ) closer to one, at the price of enlarg ing δ n ( F m ; t ) (hence pen( m ) or ˆ δ n ( F m ; t )). W e do not k no w whether it is p ossible to make the leading co nstan t closer to one without changing the p enalization pro cedure itse lf. As we show in Section 5.2 , ther e are distributions P and collections of mo dels ( F m ) m ∈M n such that ( 18 ) is a str ong improv ement ov er the “uniform margin” c ase, in terms of prediction error . It seems reasona ble to exp ect that this happ ens in a sig niﬁcan t nu mber of pr actical situations. In Section 5 , w e state a mor e gener a l result (from which Theorem 1 is a corollary) that sugg ests why it is more diﬃcult to prov e Cor ollary 1 when ϕ m really dep ends on m . 698 S. Arlot and P. L. Bartlett This general r esult is als o useful to understand ho w the nestedness a ssumption might b e relaxed in Theorem 1 a nd Corollary 1 . The re a son wh y Coro llary 1 implies stro ng ma r gin adaptivity is that the lo - cal Rademacher complex ities are not to o large when the lo cal mar gin condition is satisﬁed, together with a complexity ass umption on F m . Indeed, there exists a distribution-dep enden t e δ n ( F m ; t ) (deﬁned as δ n ( F m ; t ) with U n ( F m ; δ ; t ) replaced by K 1 U n ( F m ; K 2 δ ; t ) for some numerical constants K 1 , K 2 > 0 , related to b K and ˆ c ) such that ∀ m ∈ M n P ( e δ n ( F m ; t ) ≥ ˆ δ n ( F m ; t ) ≥ δ n ( F m ; t )) ≥ 1 − 5 lo g q  n t  e − t . (See Theorem 3 of [ 16 ].) This leads to sev eral upp er bounds on ˆ δ n ( F m ; t ) under the lo cal margin condition ( 9 ), by combining Lemma 5 of [ 16 ] with the examples of its Se c tion 2.5. F or instance, in the binar y cla ssiﬁcation case, when F m is the class of 0 –1 lo ss functions asso ciated with a VC-class S m of dimension V m , such that the ma rgin condition ( 9 ) ho lds with ϕ m ( x ) = h m x 2 , we hav e for e v ery t > 0 a nd ε ∈ (0 , 1] , δ n ( F m ; t ) ≤ εP ( f m − f ⋆ ) + K 3 nh m  ε − 1 t + ε − 2 V m ln  nε 2 h m K 4 V m  , (19) where K 3 and K 4 depe nd only o n K . (Similar upp er b ounds hold under several other complex ity assumptions on the mo dels F m , see [ 16 ].) In particular , when each mo del S m is a V C-clas s o f dimension V m , ϕ m ( x ) = h m x 2 , pen ( m ) = 7 2 ˆ δ n ( F m ; t ) and t = ln(Card( M n )) + 3 ln( n ), ( 18 ) implies that P ( b f b m − f ⋆ ) ≤ C inf m ∈M n  P ( f m − f ⋆ ) + ln(Card( M n )) + ln( n ) + V m ln(e nh m /V m ) nh m  with pr o babilit y at least 1 − K n − 2 , for some n umerical constan ts C , K > 0. Up to some ln( n ) fa ctor, this is a str o ng mar gin-adaptive mo del selectio n r esult, pr o vided that Card( M n ) is s maller than some pow er of n . Notice that the ln( n ) factor is sometimes necessary (a s shown b y [ 23 ]), meaning that this upper bound is then optimal. 4. Lo w er b ound for some n on-nested mo dels In this section, we in vestigate the assumption in Theorem 1 tha t the mo dels F m are nested. T o this aim, let us co ns ider the case where mo dels are singletons F m = { f m } . Then, any estimator b f m ∈ F m is deterministic a nd equa l to f m , so that mo del selection amounts to selecting among a family { f m s.t. m ∈ M n } o f functions. Theorem 2 b elow shows that no s election pro cedure can b e strongly margin-a daptiv e in genera l. Theorem 2. L et γ b e the 0–1 loss and F 0 − 1 := { γ ( u ; · ) s.t. u : X 7→ { 0 , 1 } is me asur able } b e the asso cia te d loss funct ion class. If Card( X ) ≥ 2 , two functions f 0 , f 1 ∈ F 0 − 1 and Mar gin-adaptive mo del sele ction 699 absolute c onstants C 3 , C 4 > 0 exist such t ha t the fol lowing ho lds. F or every inte ger n ≥ 2 and b m a sele ction pr o c e dur e (that is, a function ( X × Y ) n 7→ M = { 0 , 1 } ), a distribution P ex ists su ch t ha t P  P ( f b m − f ⋆ ) ≥ C 4 √ n ln( n ) min m ∈{ 0 , 1 }  P ( f m − f ⋆ ) + v ( m ) + ln( n ) nh m  ≥ C 3 (20) and E [ P ( f b m − f ⋆ )] ≥ C 3 C 4 √ n ln( n ) min m ∈{ 0 , 1 }  P ( f m − f ⋆ ) + v ( m ) + ln( n ) nh m  , (21) wher e ∀ m ∈ { 0 , 1 } v ( m ) := r 2 ln( n ) n v a r P ( f m − f ⋆ ) and h m := P ( f m − f ⋆ ) v a r P ( f m − f ⋆ ) . Theorem 2 is proved in Section 7.2 . A straightf orward coro llary of Theorem 2 is that in the classiﬁcatio n setting with the 0–1 loss, str ong mar gin-adap tive mo del sele ction is not alway s p ossible when the mo dels ar e not neste d . Indeed, when F m = { f m } fo r every m ∈ M n = { 0 , 1 } , ( 20 ) shows that for a n y mo del s election pr ocedure b m , some distribution P exists such that r e s ults lik e The o rem 1 or Corollar y 1 do no t hold if t m = ln( n ) for every m . R emark 3. 1. Theo rem 2 and its co rollary for mo del selection als o hold for ra ndomized rules b m : ( X × Y ) n 7→ [0 , 1 ] (where the v alue o f b m (( X i , Y i ) 1 ≤ i ≤ n ) is the probability assigned to the choice of f 1 ). Hence, aggr egating models instead of selecting one do es no t mo dify the co nclusion of Theor em 2 . 2. The most reasona ble selection pro cedure a mo ng t wo functions f 0 and f 1 (or tw o mo dels { f 0 } and { f 1 } ) clearly is empirical minimization. The pr oof of Theore m 2 yields explicitly some distribution P , called P 1 , such that ( 20 ) and ( 21 ) hold fo r empirical minimization. Note that when mo dels are sing le tons, most penaliza tion pro cedures coincide with empirica l minimiza tion, for insta nce, when pen( m ) is pro- po rtional to the lo cal Rademacher co mplexit y ˆ δ n ( F m ; t ), or to the idea l p enalt y pen id ( m ) = ( P − P n )( b f m − f m ), its exp ectation or s ome quantile of p en id ( m ). 3. Theo rem 2 fo cuses o n mar gin a daptivit y with ϕ m ( x ) = h m x 2 , wher eas the mar gin condition is also satisﬁe d with o ther functions ϕ m . This is b oth for s implicit y rea sons and bec a use this choice emphasiz e s that one co uld hop e fo r lear ning rates o f order 1 / ( nh m ) if strong marg in adaptivity w ere po ssible. The mea ning of Theo r em 2 is then mainly that one canno t guar an tee to lea rn a t a r ate b etter than 1 / √ n , wher eas for so me model, the excess lo ss and 1 / ( nh m ) b oth are of or der 1 /n . 4. The counterexample given in the pro of of Theorem 2 is hig hly non-a symptotic, since the distribution P strongly dep ends o n n . If P and f 0 , f 1 were ﬁxed, it is well known that empirical minimization leads to asy mpto tic optimality , b ecause ( f m ) m ∈{ 0 , 1 } is 700 S. Arlot and P. L. Bartlett ﬁnite and ﬁxed when n grows. This illustrates a signiﬁcant diﬀerence betw een the asymptotic and non- asymptotic frameworks. Another example of suc h a diﬀerence o ccurs when the num ber of candida te functions (or mo dels) is inﬁnite, or grows to inﬁnit y with the sa mple size, see (iv) in P ropo s ition 1 in Section 5 .2 . With Theorem 1 , w e have prov en a strong margin adaptivit y result for nested mo dels, which holds true when the p enalt y is built up on lo cal Radema c her complexities. There- fore, adaptive model selection is attainable for nested mo dels, whatever the distribution of the data. On the other hand, Theorem 2 gives a s imple example where no mo del se- lection pro cedure can satisfy an o racle inequalit y ( 10 ) with a leading co nstan t smaller than C 4 √ n/ (ln( n )). Lo oking ca refully at the sele ction problems considered in the proo f of Theor em 2 , it app ears that the main reaso n why they are pa r ticularly tough is that we are quite “lucky” with one of the mo dels: it ha s sim ultaneously a very small bias, a very small size and a large marg in parameter , while other mo dels with very similar app earance ar e muc h worse. When lo oking for a mor e general stro ng ma rgin adaptivity result, w e then must keep in mind that this is a ho peless ta sk in such situa tions. Let us ﬁnally men tion a rela ted result in a close but slightly diﬀerent fra mew ork. In the classiﬁcation fr a mew ork, under a glob al margin condition with ϕ ( x ) ∝ x 2 κ with κ ≥ 1, Theorem 3 in [ 18 ] shows tha t for a n y M n ≥ 2 , a family ( u m ) m ∈M n of M n classiﬁers exists for which, for a n y selection pro cedure b m , s ome distribution P exists such that E [ P ( f b m − f ⋆ )] ≥ inf m ∈M n { P ( f m − f ⋆ ) } + C  ln( M n ) n  κ/ (2 κ − 1) , where f m = γ ( u m ; · ) for some los s function γ . When b m is (p enalized) empirica l mini- mization, the remainder term is shown to be as la rge as C p ln( M n ) /n when the mar gin condition holds with κ > 1 . This result and Theo rem 2 focus on diﬀer en t problems . In [ 1 8 ], the margin condition is only a ssumed to hold glob al ly , and the foc us is o n the dep endence of the remainder term on the ca r dinalit y M n of M n . Ther efore, the counterexample given in [ 18 ] implies nothing ab out loca l ma rgin conditions for ( f m ) m ∈M n . Note that using thes e a rgumen ts, we could probably genera lize Theorem 2 to a family of M n ≥ 2 functions a nd obtain a low er bo und dep ending on M n as in [ 18 ]. 5. General coll ections of mo dels As prov en in Section 4 , we cannot ho p e to obtain margin adaptivit y without any a s- sumption o n either P or the models. The purp ose o f this section is to explain what can still b e pr o ven in the g eneral case, a nd why this is weak er than our Theo rem 1 . 5.1. A general oracle inequalit y W e sta rt with a general r esult for pena lties satisfying the low er b ound ( 12 ). Mar gin-adaptive mo del sele ction 701 Theorem 3. L et ( F m ) m ∈M n b e any c ountable family of mo dels, and ( t m ) m ∈M n b e any se quenc e of p ositive numb ers. L et b m b e deﬁne d by ( 11 ) and assume that some c ∈ (0 , 1) exists such that ∀ m ∈ M n (1 − c ) p en( m ) ≥ ( P − P n )( b f m − f m ) + t m n ≥ 0 (22) on an event of pr ob ability at le ast 1 − η . Then, ther e exists an event of pr ob ability at le ast 1 − η − 2 P m ∈M n e − t m on which the fol lowing holds: for every ε ∈ (0 , 1) , P ( b f b m − f ⋆ ) ≤ 1 1 − ε inf m ∈M n  P ( f m − f ⋆ ) + p en( m ) + v ( m ) + t m 3 n  + V n , (23) wher e V n := 1 1 − ε sup m ∈M n { v ( m ) − εP ( f m − f ⋆ ) − c p en( m ) } and v ( m ) := r 2 t m n v a r P ( f m − f ⋆ ) . Theorem 3 is proved in Section 7.1 . Le t us ma k e a few co mmen ts. First, without V n , ( 2 3 ) is the kind of oracle inequa lit y we are lo oking for, since the leading co nstan t is close to 1 (provided ε is small enough). F or the sake of simplicity , assume that a margin condition ( 9 ) holds for every mo del m ∈ M n , with ϕ m ( x ) = h m x 2 . Then, v ( m ) ≤ s 2 t m P ( f m − f ⋆ ) h m n ≤ εP ( f m − f ⋆ ) + t m 2 εh m n for any ε ∈ (0 , 1 ). Hence, the ﬁr s t term of the r igh t-hand side of ( 23 ) is smaller than 1 + ε 1 − ε inf m ∈M n  P ( f m − f ⋆ ) + p en( m ) + t m 2 εh m n + t m 3 n  , which is the righ t-hand side of a margin-adaptive oracle inequality like ( 6 ) (at lea st when the pena lty is itself of the right order). A similar r esult holds for a more general ϕ m ; see the pr o of o f Theorem 1 . Once we hav e a penalty satisfying ( 2 2 ) (for instance, a lo cal Rademacher p enalt y), the main diﬃculty for proving a strong margin adaptivity result then lies in V n . It a r ises from the diﬀerence betw een the ideal p enalty and the right-hand side of the low er b ound ( 22 ), that is ( P − P n )( f m ). This random quantit y is cen tered, and (up to a q uan tit y inde- pendent of m ) has deviations of order v ( m ), Bernstein’s inequality b eing unimprov able. Then, if v ( m ) ha ppens to b e m uch la rger than P ( f m − f ⋆ ) + pen( m ), m is s e le cted with a pos itive pr obabilit y , whatev er the v alue of P ( b f m − f ⋆ ). In that case, the e xpectation 702 S. Arlot and P. L. Bartlett of b f b m is worse than the orac le b y at leas t v ( m ) (for any of thes e “ bad” models). Hence, V n certainly is unav oidable in ( 23 ). As shown by Theorem 2 , V n can b e m uch larger than the exp ectation of a strong margin-a da ptiv e estimator. Nevertheless, V n is not alwa ys the main term on the right- hand side o f ( 23 ). Let us now des cribe a s e t of fa vorable situations in whic h it is p ossible to prove that V n is small enoug h: 1. Mo dels are nested, t m is non-decreas ing (with r espect to the inclusion or dering on F m ), a nd pen sa tis ﬁe s the additio nal condition ( 13 ); see Section 3 . 2. Mo dels are nes ted, t m is no n- decreasing and v ( m ) is decr easing (or at least not increasing too m uch) when F m increases. Indeed, let us ﬁx m, m ⋆ ∈ M n (think of m ⋆ as a minimizer of the inﬁmum on the right-hand side o f ( 23 )). When mo dels are nes ted, either F m ⋆ ⊂ F m so that v ( m ) ≤ sup F m ⋆ ⊂F m ′ { v ( m ′ ) } , or F m ⊂ F m ⋆ so that ϕ m ⋆ ≤ ϕ m hence ϕ ⋆ m ≤ ϕ ⋆ m ⋆ . In the s econd case, v ( m ) − εP ( f m − f ⋆ ) ≤ ϕ ⋆ m r 2 t m ε 2 n ! ≤ ϕ ⋆ m ⋆ r 2 t m ε 2 n ! ≤ ϕ ⋆ m ⋆ r 2 t m ⋆ ε 2 n ! since t m ≤ t m ⋆ and ϕ ⋆ m ⋆ is non-decreas ing. As a c o nsequence, for a n y m ⋆ ∈ M n , V n ≤ 1 1 − ε max ( sup F m ⋆ ⊂F m ′ { v ( m ′ ) } ; ϕ ⋆ m ⋆ r 2 t m ⋆ ε 2 n !) , which is not to o larg e pro vided tha t v ( m ) never increases too muc h. Notice that we can unders tand assumption ( 13 ) as ensuring that the p enalt y comp ensates a po ssible increase of v ( m ) . 3. The oracle mo del pr ediction er ror do es not decr ease to zer o faster than n − 1 / 2 and t m ≤ t . Indeed, the stra ig h tforward upp er b ound v ( m ) ≤ p 2 t m /n shows that V n ≤ (1 − ε ) − 1 p 2 t/n . 4. The margin condition do es not dep end o n m and t m ≤ t . Indeed, when ϕ m ≡ ϕ (or inf m ϕ m ≥ ϕ ), we ha ve V n ≤ 1 1 − ε sup m ∈M n ( ϕ ⋆ m r 2 t m ε 2 n !) ≤ 1 1 − ε ϕ ⋆ r 2 t ε 2 n ! . 5. The p enalty sa tisﬁes c pe n( m ) ≥ v ( m ) for every m ∈ M n , which can be ens ured for instance by adding c − 1 v ( m ) (or a n estimate of it) to a p enalty satis fying ( 22 ). An example of this metho d is the one prop osed by Koltchinskii [ 16 ] (Section 5.2), and in that cas e ( 23 ) coincides with his Theorem 6. Poin ts 3 and 4 ab o ve show tha t the challenging situations are the ones wher e the mar g in condition indeed dep ends on the model, a nd fast r ates of estimation are a ttainable. W e prov e in Section 5.2 that suc h situations can occur , enligh tening ho w our Theorem 1 is an improvemen t on existing results and their stra igh tforward co nsequences. Mar gin-adaptive mo del sele ction 703 On the other hand, po in t 5 may seem contradictory with the negative results of Section 4 . The explanation is that using v ( m ) in the p enalt y means that b m is not only a function of the data, but also o f the unknown dis tr ibution P . Then it ca nno t be considered a daptiv e. A more sur prising consequence of this r emark com bined with Theorem 2 is that v ( m ) cannot b e estimated accurately enough unifor mly o ver the set of all distributio ns P . Consider the prop osal, in Section 5.1 o f [ 16 ], to a dd C s t m P n ( b f m ) n to the p enalty , which is suﬃcien t to give a result lik e ( 14 ). The po in t is that such a pena lt y is ge ne r ally muc h to o la rge (at leas t for small mo dels), which often results in an upper b ound of order n − 1 / 2 . In the examples w e hav e in mind (as well a s in the counterexamples of Section 4 ), the exce ss risk o f the o r acle is muc h smaller , t ypically of order n − β for so me β ∈ (1 / 2; 1 ] . 5.2. The lo cal margin conditions can b e signiﬁcan tly tigh ter than the global one In this sec tio n, we show that ther e exist challenging situations in which the mar gin condition holds for functions ϕ m strongly dep ending o n m . Prop osition 1. L et κ ∈ (1 ; + ∞ ) and assu m e that X is inﬁnite. L et γ b e the 0–1 loss and F 0 − 1 := { γ ( u ; · ) s.t. u : X 7→ { 0 , 1 } is me asur able } b e the asso ciate d loss function class. Then ther e ex ist a pr ob abili ty distribution P on X × { 0 , 1 } , a se quen c e ( f j ) j ∈ N of elements of F 0 − 1 and p ositive c onstants ( C i ) 5 ≤ i ≤ 7 (dep ending on κ only) s uch that: (i) ∀ k ∈ N , P ( f 2 k +1 − f ⋆ ) = P ( f 2 k − f ⋆ ) = b ( k ) and 2 − kκ − 2 ≤ b ( k ) ≤ 2 − kκ − 1 . (ii) The glob al mar gin c ondi tion ( 8 ) is satisﬁe d over F = F 0 − 1 with ϕ ( x ) = C 5 x 2 κ , and it is tight: ∀ k ∈ N , ϕ ( p v a r( f 2 k +1 − f ⋆ )) ≥ C 6 P ( f 2 k +1 − f ⋆ ) . (iii) A tighter lo c al mar gin c ondition ( 9 ) hold s ove r { f 2 k s.t. k ∈ N } : ∀ k ∈ N , P ( f 2 k − f ⋆ ) ≤ v ar P ( f 2 k − f ⋆ ) . (iv) F or every m ∈ N , deﬁne F m = { f m } and c onsider the mo del sele ction pr obl em among ( F m ) 0 ≤ m ≤ M n with M n ≥ 2 ln 2 ( n ) . Then, the right-hand side of a str ong mar gin-ada ptive or acl e ine quali ty of the fo rm ( 10 ) is at most pr op ortional to inf 0 ≤ 2 k ≤ M n  P ( f 2 k − f ⋆ ) + ln( n ) n  ≤ 2 ln( n ) n , wher e as the right-hand side of a glob al mar gin-ada ptive or acle ine quality is lar ger than C 7 n − κ/ (2 κ − 1) ≫ (ln( n )) /n . Prop osition 1 is prov ed in Section 7.3 . It g ives an example of a mo del selection pr oblem where str ong mar gin ada ptivity implies a faster r ate of c onver genc e than adaptivity to 704 S. Arlot and P. L. Bartlett the glob al mar gin c ondition . Note that the same arg umen t works with many other model selection pr oblems, such as selecting among ( { f 2 k +1 s.t. 0 ≤ k ≤ m } ) m ∈{ 1 ,..., (ln( n )) 2 } . 6. Discussion 6.1. Other p enalization pro cedures W e ha ve fo cused in Section 3.2 on p enalties deﬁned in terms o f lo cal Rademacher com- plexities in order to prov e that stro ng marg in adaptivity is attainable for s ome data- driv en pena lties. An interesting que s tion is whether such a r esult can b e ex tended to penalties that ca n be computed faster. F or ins tance, it is natura l to think of estimating p en id ( m ) itself b y resampling, in- stead of the lo cal complexit y δ n ( F m ; t ). Such penalties, with several kinds of resampling schemes, hav e b een propo sed in [ 2 ] and [ 3 ] and called “resa mpling p enalties” (RP), gen- eralizing the b o o tstrap p enalty suggested by Efron [ 15 ]. Resampling penalties can b e computed faster than lo cal Ra demac her co mplexities, b ecause they are not deﬁned as ﬁxed p oin ts of the r esampling estimate o f a function. In particular , the V -fold p enalties deﬁned in [ 2 ] hav e the sa me computational co st as V -fold cross- v alidation. In addition, RP ar e e a sy to calibrate, since they dep end on a single tuning parameter – the m ultiplicative factor in fro n t of it – which can, for instance, b e estimated from the data b y us ing the “slo pe heuristics ” (se e [ 4 ]). O n the contrary , lo cal Rademacher complexities dep end on tw o more constants, whose theoretical v alues are certainly to o large for practica l a pplication. Extending Cor ollary 1 to RP would req uir e to pr o ve that RP s atisfy b oth assumptions ( 12 ) and ( 13 ). O n the one hand, ( 12 ) means essen tially that the penalty is lar ger than the exp ectation of the idea l p enalty with larg e probability . Hence, o ne can conjecture that ( 12 ) holds for RP ; a partial pr o of of ( 12 ) for RP in our general setting can be found in Chapter 7 of [ 1 ], to g ether with an agenda for a complete pro of, whic h seems to be a diﬃcult theoretical problem. O n the other hand, ( 13 ) s e ems less lik ely to hold for RP , and we may hav e to mo dify RP so that ( 13 ) can b e s a tisﬁed in gene r al. Proving such r esults would b e quite int eresting , since it would provide a stro ng margin- adaptive penalizatio n pro cedure with a reas o nably small computational cost. 6.2. Should we mak e collections of mo dels nested? A natura l question co ming from o ur results is whether one should make any co lle ction of mo dels nested b efore perfor ming mo del selection in order to improve p erformance. Let us consider the counterexample of Theorem 2 and lo ok at what would ha ppen if we make the mo dels nested. Assume that P = P 1 is the distribution deﬁned in the pro of of Theor em 2 . On the one hand, c omparing { f 0 } and { f 0 , f 1 } , the mo del selection pro blem would be easy b ecause the mar gin parameter h m is the sa me in b oth mo dels, ma k ing the remainder term of order Mar gin-adaptive mo del sele ction 705 n − 1 / 2 (the r emainder term ( nh m ) − 1 can b e repla ced by n − 1 / 2 when h m ≤ n − 1 / 2 bec ause of the upp er b ound v ar P ( f m − f ⋆ ) ≤ 1 / 4 ). And margin ada ptivit y is no t challenging when the mar gin co ndition is merely not sa tisﬁed. On the other hand, when P = P 1 , co mpa ring { f 1 } and { f 0 , f 1 } is more challenging b ecause f 1 is really b etter than f 0 . Here, contrary to the non-nes ted ca se, the large increas e of the term v ar P ( f m − f ⋆ ) induces a similar increase in the L 2 ( P 1 ) diameter of the clas s . Hence, lo cal Rademacher complexities can detect it, as shown b y The o rem 1 . T o conclude, improving signiﬁcant ly the pre diction p erformance o f the ﬁnal estimator by making the mo dels nested requir e s s ome prior knowledge, suc h as a natural o r dering betw een the (no n- nested) mo dels. Otherwise, Theor em 2 s hows that cho osing how to make the mo dels nested, either from data or randomly , is not successful with proba bilit y at least C 3 > 0 , whatever the sample size. 7. Pro ofs 7.1. Oracle inequalities W e give the pro ofs in a logica l order , that is, ﬁr st Theorem 3 , then Theore m 1 (which is a cor ollary of it), and ﬁnally Co rollary 1 . Pro of of Theorem 3 . First, by deﬁnition of b m , for every m ∈ M n we hav e P n ( b f b m ) + p en( b m ) ≤ P n ( b f m ) + p en( m ) , which can b e r ewritten as P ( b f b m − f ⋆ ) + ( P n − P )( b f b m − f b m ) + ( P n − P )( f b m − f ⋆ ) + p en( b m ) ≤ P ( f m − f ⋆ ) + P n ( b f m − f m ) + ( P n − P )( f m − f ⋆ ) + p en( m ) ≤ P ( f m − f ⋆ ) + ( P n − P )( f m − f ⋆ ) + p en( m ) . In the even t that ( 22 ) holds, we then hav e P ( b f b m − f ⋆ ) + ( P n − P )( f b m − f ⋆ ) + c p en( b m ) + t b m n (24) ≤ inf m ∈M n { P ( f m − f ⋆ ) + ( P n − P )( f m − f ⋆ ) + p en( m ) } . By Bernstein’s inequality (see, e.g., Pr opositio n 2.9 in [ 22 ]), for every m ∈ M n , there is an even t of probability 1 − 2e − t m on which | ( P n − P )( f m − f ⋆ ) | ≤ v ( m ) + t m 3 n . 706 S. Arlot and P. L. Bartlett On the intersection o f these ev ents with the one in whic h ( 22 ) ho lds, we der iv e from ( 24 ) that P ( b f b m − f ⋆ ) − v ( b m ) + c p en ( b m ) ≤ inf m ∈M n  P ( f m − f ⋆ ) + p en( m ) + v ( m ) + t m 3 n  . F or any ε > 0 , the left-ha nd side is larger than (1 − ε ) P ( b f b m − f ⋆ ) + εP ( f b m − f ⋆ ) + c p en( b m ) − v ( b m ) ≥ (1 − ε ) P ( b f b m − f ⋆ ) − sup m ∈M n { v ( m ) − εP ( f m − f ⋆ ) − c p en( m ) } . The result follows.  Pro of o f Theorem 1 . W e consider the even t in which ( 2 3 ) holds. By Theorem 3 , we know that it has pro babilit y at lea st 1 − η − 2 P m ∈M n e − t m . W e ﬁr st b ound the ﬁrst term on the right-hand side of ( 23 ). F ro m ( 9 ), we hav e ∀ m ∈ M n v ( m ) ≤ r 2 t m n ϕ − 1 m ( P ( f m − f ⋆ )) . Then, using that xy ≤ ϕ m ( x ) + ϕ ⋆ m ( y ) fo r every x, y ≥ 0, ∀ m ∈ M n v ( m ) ≤ ϕ ⋆ m r 2 t m ε 2 n ! + ϕ m ( εϕ − 1 m ( P ( f m − f ⋆ ))) . Since ϕ m is co n vex with ϕ m (0) = 0 , we have ϕ m ( λx ) ≤ λϕ m ( x ) for every λ ∈ (0 , 1) and x ≥ 0. Then, using also that v ar P ( f m − f ⋆ ) ≤ 1, ∀ m ∈ M n v ( m ) ≤ min ( r 2 t m n , ϕ ⋆ m r 2 t m ε 2 n ! + εP ( f m − f ⋆ ) ) , (25) and the right-hand side o f ( 23 ) is smaller than 1 1 − ε inf m ∈M n ( (1 + ε ) P ( f m − f ⋆ ) + p en( m ) + min ( ϕ ⋆ m r 2 t m ε 2 n ! , r 2 t m n ) + t m 3 n ) + V n . (26) It now remains to upp erb ound V n . Let m, m ′ ∈ M n . Since mo dels F m are nested, tw o cases can o ccur: 1. F m ⊂ F m ′ , whic h implies t m ≤ t m ′ and ϕ m ≥ ϕ m ′ , hence ϕ ⋆ m ≤ ϕ ⋆ m ′ . Using, in ad- dition, ( 25 ) and that ϕ ⋆ m ′ is non-decreas ing , we hav e v ( m ) ≤ min ( r 2 t m ′ n , ϕ ⋆ m ′ r 2 t m ′ ε 2 n !) + εP ( f m − f ⋆ ) . Mar gin-adaptive mo del sele ction 707 2. F m ′ ⊂ F m . Using ( 13 ) and ( 25 ), v ( m ) ≤ C 1 v ( m ′ ) + C 2 P ( f m ′ − f ⋆ ) + c p en( m ) ≤ C 1 min ( r 2 t m ′ n , ϕ ⋆ m ′ r 2 t m ′ ε 2 n !) + ( C 2 + C 1 ε ) P ( f m ′ − f ⋆ ) + c p en( m ) . Therefore, V n ≤ 1 1 − ε inf m ′ ∈M n ( max { 1 , C 1 } min ( r 2 t m ′ n , ϕ ⋆ m ′ r 2 t m ′ ε 2 n !) + ( C 2 + C 1 ε ) P ( f m ′ − f ⋆ ) ) and the result fo llo ws.  Pro of of Corollary 1 . F rom [ 16 ] (Theo rem 1 and (9.2) in the pro of of its Lemma 2), we know that there exist numerical c onstan ts K > 0 and q > 1 s uch that ( 12 ) holds with t m = t , c = 5 / 7 and η = ( L + 1) ln q ( n t ) Card( M n )e − t . In addition, Lemma 3 be low shows that ( 1 3 ) holds with C 1 = √ 2 and C 2 = 2 / ( K q ). The result follows from Theorem 1 with t m = t .  Lemma 3. L et F m ′ ⊂ F m and δ n b e deﬁne d by ( 15 ). Then, v ( m ) ≤ 2 δ n ( F m ; t ) + √ 2 v ( m ′ ) + 2 P ( f m ′ − f ⋆ ) q K . (27) Pro of. Since F m ′ ⊂ F m , f m ′ ∈ F m (as well as f m ), so that D P ( F m ; P ( f m ′ − f m )) ≥ p P ( f m − f m ′ ) 2 ≥ p v a r P ( f m − f m ′ ) (28) ≥ r v a r P ( f m − f ⋆ ) 2 − p v a r P ( f m ′ − f ⋆ ) . F or the last inequality , w e used that v ar( X ) ≤ 2 v ar( X + Y ) + 2 v ar( Y ) for any random v a riables X , Y , and the inequality √ x + y ≤ √ x + √ y for every x, y ≥ 0 . First, ass ume that the low er b ound in ( 28 ) is non- p ositive. This implies v ( m ) = r 2 t n v a r P ( f m − f ⋆ ) ≤ √ 2 v ( m ′ ) , so that ( 27 ) holds . Otherwise, the assumptions of Lemma 4 b elow hold with D 0 = r v a r P ( f m − f ⋆ ) 2 − p v a r P ( f m ′ − f ⋆ ) > 0 708 S. Arlot and P. L. Bartlett and σ 0 = P ( f m ′ − f m ) . W e deduce from ( 29 ) that v ( m ) 2 − v ( m ′ ) √ 2 ≤ δ n ( F m ; t ) + P ( f m ′ − f m ) q K ≤ δ n ( F m ; t ) + P ( f m ′ − f ⋆ ) q K , and ( 27 ) als o holds.  Lemma 4. L et δ n ( F m ; t ) b e deﬁne d by ( 15 ). Assum e that t her e is some D 0 , σ 0 > 0 such that D P ( F m ; σ 0 ) ≥ D 0 . Then, we have t he fol lowi ng lower b ound: max  δ n ( F m ; t ); σ 0 q K  ≥ D 0 r t n . (29) Pro of. First, ( 29 ) clearly holds when σ 0 qK ≥ D 0 p t/n . Otherwise, let σ 1 = max { q K , 1 } D 0 p t/n > σ 0 . F rom the deﬁnition of U n , we hav e U n ( F m ; σ 1 ; t ) σ 1 ≥ K D P ( F m ; σ 1 ) σ 1 r t n ≥ K D 0 q K D 0 p t/n r t n = 1 q > 1 2 q . Then, acco rding to the deﬁnition ( 15 ) o f δ n ( F m ; t ), δ n ( F m ; t ) ≥ σ 1 ≥ D 0 p t/n and the result follows.  7.2. Low er b ounds (pro of of Theorem 2 ) F or every m ∈ { 0 , 1 } , let f m : ( x, y ) 7→ 1 y 6 = m ; f m ∈ F 0 − 1 , since f m = γ ( u m ; · ), where for every x ∈ X , u m ( x ) = m . Let α = (2 n ) − 1 and h = (2 n ) − 1 / 2 . Let a 6 = b be any tw o ele men ts of X . W e deﬁne a probability distribution P 1 on X × { 0 , 1 } as follows: if ( X , Y ) ∼ P 1 , then P ( X = a ) = α , P ( X = b ) = 1 − α , P ( Y = 1 | X = a ) = 0 and P ( Y = 1 | X = b ) = 1 2 + h . W e als o deﬁne P 0 as the distribution of ( X , 1 − Y ) , w he r e ( X , Y ) ∼ P 1 . In the follow- ing, for any distribution Q on X × { 0 , 1 } , we use the nota tio n P Q as a shor tcut for P ( X i ,Y i ) 1 ≤ i ≤ n ∼ Q ⊗ n . First, under distribution P 1 , the Bay es predictor is s = 1 b , P 1 ( f 0 − f ⋆ ) = 2(1 − α ) h, P 1 ( f 1 − f ⋆ ) = α and v a r P 1 ( f 1 − f ⋆ ) = α − α 2 . Hence, min m ∈{ 0 , 1 }  P 1 ( f m − f ⋆ ) + v ( m ) + ln( n ) nh m  ≤ P 1 ( f 1 − f ⋆ ) + v (1) + ln( n ) nh 1 ≤ α + r 2 α ln( n ) n + ln( n ) n ≤ 2 + 3 ln( n ) 2 n . Mar gin-adaptive mo del sele ction 709 Therefore, if P P 1 ( b m = 0) ≥ C 3 , then ( 20 ) holds when P = P 1 , with C 4 = 1 / 3. Similarly , P P 0 ( b m = 1) ≥ C 3 implies ( 20 ) with P = P 0 and C 4 = 1 / 3. So , in or der to prov e ( 20 ), we only need to prov e that max j ∈{ 0 , 1 } { P P j ( b m = 1 − j ) } ≥ C 3 > 0 . (30) The pro of of ( 30 ) r e lies on three main fa cts. First, ∀ j ∈ { 0 , 1 } P P j ( ∀ i, X i = b ) = (1 − α ) n =  1 − 1 2 n  n ≥ 1 2 . (31) Second, for every j ∈ { 0 , 1 } , under P j , conditionally to {∀ i, X i = b } , Car d { i s.t. Y i = 1 } is a binomial random v ariable with par ameters ( n, p j ), where p j = P ( X,Y ) ∼ P j ( Y = 1) = 1 2 + ( − 1) j +1 h. So, Lemma 5 shows that for every j ∈ { 0 , 1 } and every k ∈ N ∩ [ n 2 − √ n, n 2 + √ n ], P P j (Card { i s.t. Y i = 1 } = k |∀ i, X i = b ) ≥ C √ n > 0 , (32) where C is a n absolute cons ta n t. Third, let us deﬁne, for every k ∈ { 0 , . . . , n } , π k := P P U ( b m (( X i , Y i ) 1 ≤ i ≤ n ) = 1 | Ca r d { i s.t. Y i = 1 } = k and ∀ i , X i = b ) , where P U is the uniform dis tr ibution on { a, b } × { 0 , 1 } . A cr ucial r e mark is that P U can be replaced by either P 0 or P 1 in the deﬁnition of π k , since the conditioning event determines ( X i , Y i ) 1 ≤ i ≤ n up to the ordering o f the o bs erv ations ; in the deﬁnition of π k , the pro babilit y o nly refers to the ordering o f the ( X i , Y i ), and a ny pro duct measure on X × { 0 , 1 } assigns e q ual probabilities to the n ! p erm utations of the n obser v ations. Note also that the deﬁnition of π k stays v alid whe n b m is a r andomized selection rule, which prov es the generalizatio n of Theorem 2 p ointed out in Remar k 3 . F o r any given selection rule b m , Card  k ∈ N ∩  n 2 − √ n, n 2 + √ n  s.t. π k > 1 2  is either larger or s ma ller than √ n . If it is la rger, ( 31 ), ( 32 ) and the deﬁnition of the π k (with P 0 instead of P U ) show that P P 0 ( b m (( X i , Y i ) 1 ≤ i ≤ n ) = 1) ≥ √ n × C √ n × 1 2 = C 2 = C 3 > 0 , so that ( 30 ) is satisﬁed. Otherwise, choosing P 1 instead o f P 0 shows that ( 30 ) holds true. This proves ( 20 ), which clea rly implies ( 21 ), since P ( f b m − f ⋆ ) ≥ 0 a.s. 710 S. Arlot and P. L. Bartlett A key to ol in the pro of of Theorem 2 is the following uniform lo wer b ound on the density of the binomial dis tribution w.r.t. the counting measure on N . Lemma 5. F or every n ∈ N and p ∈ [0 , 1] , let B ( n, p ) denote the binomial distribut io n with p ar ameters ( n, p ) . F or every a, b > 0 and c ∈ (0 , 1 / 2) , a p ositive c onstant C ( a, b, c ) exists such that for any p ositive inte ger n , inf k ∈ N , | k − n/ 2 |≤ min { an 1 / 2 ,n/ 2 } | p − 1 / 2 |≤ min { bn − 1 / 2 ,c } { √ n P Z ∼B ( n,p ) ( Z = k ) } ≥ C ( a, b, c ) > 0 . (33) Pro of. Let n, k , p sa tisfy the ab ov e conditions, Z ∼ B ( n, p ), and deﬁne η := 2 k n − 1 , δ := p − 1 2 . The as s umption o n k and p b ecomes | η | ≤ min { an − 1 / 2 , 1 / 2 } and | δ | ≤ min { bn − 1 / 2 , c } . In addition, P ( Z = k ) = p k (1 − p ) n − k  n k  =  1 2 + δ  k  1 2 + δ  n − k n ! k !( n − k )! . W e now use Stirling’s for m ula: ln( n !) = n ln( n ) − n + 1 2 ln(2 π n ) + ε n for so me sequence ε n → 0 when n → + ∞ (one has (12 n + 1) − 1 ≤ ε n ≤ (12 n ) − 1 ). Then, ln P ( Z = k ) = k ln  1 2 + δ  + ( n − k ) ln  1 2 − δ  + ln n ! k !( n − k )! = n 2  (1 − η ) ln  1 − 2 δ 1 − η  + (1 + η ) ln  1 + 2 δ 1 + η  − 1 2 ln( n ) + 1 2 ln  2 π  − 1 2 ln(1 − η 2 ) + ε n − ε k − ε n − k . Deﬁne h : ( − 1 , + ∞ ) 7→ R b y h ( x ) := x − 1 ln(1 + x ) − 1 , so that ∀ x > − 1 ln(1 + x ) = x (1 + h ( x )) . Recall that | h ( x ) | ≤ 2 | x | as so on as x ≥ − 1 / 2 , by the T aylor–Lag range form ula. In par- ticular, lim x → 0 h ( x ) = 0 . W e then hav e ln P ( Z = k ) = n 2 [4 δ η − 2 η 2 − 2 δ (1 − η ) h ( − 2 δ ) + η (1 − η ) h ( − η ) + 2 δ (1 + η ) h (2 δ ) − η (1 + η ) h ( η )] Mar gin-adaptive mo del sele ction 711 − 1 2 ln( n ) + 1 2 ln  2 π  + η 2 2 h ( − η 2 ) + ε n − ε k − ε n − k . Assuming that n ≥ n 0 such that ma x { a, b } n − 1 / 2 ≤ 1 / 2, it follows that ln P ( Z = k ) = − 1 2 ln( n ) + R ( k, n, p ) with R ( k , n, p ) ≥ L (1 + a 2 + ab + b 2 ) for some numerical constant L > 0 , and this low er b ound is unifor m over n ≥ n 0 and k , p such that the conditions of the inﬁmu m in ( 33 ) ar e satisﬁed. On the other hand, inf n ≤ n 0 , 1 ≤ k ≤ n { P Z ∼B ( n,p ) ( Z = k ) } ≥ K ( p ) > 0 as so on as p ∈ (0 , 1) . Since P Z ∼B ( n,p ) ( Z = k ), seen as a function of p , is increa sing on (0 , k /n ) and decrea sing on ( k /n, 1) , K ( p ) is unif ormly larger than min { K (1 / 2 − c ) , K (1 / 2 + c ) } . The result follows.  7.3. Pro of of Prop osition 1 Let ( x j ) j ∈ N be any inﬁnite sequence o f distinct e le ments of X and λ > 0 to be chosen later. W e deﬁne P as fo llows, by denoting ( X , Y ) a pair of r a ndom v ariables with joint distribution P . F or every k ∈ N , P ( X = x 2 k ) = p k q k and P ( X = x 2 k +1 ) = p k (1 − q k ), where p k = 2 − k − 1 and q k ∈ [0 , 1] is to b e chosen la ter; note that P k ∈ N p k = 1 . F or every k ∈ N , P ( Y = 1 | X = x 2 k ) = 0 and P ( Y = 1 | X = x 2 k +1 ) = (1 + δ k ) / 2 where δ k = 2 − kλ . As a consequence , the Bay es pr e dictor is s := 1 { x 2 k +1 s.t. k ∈ N } . Let us deﬁne for every j ∈ N , u j ( x ) :=  s ( x ) , if x 6 = x j , 1 − s ( x ) , if x = x j and f j = γ ( u j ; · ) , where γ is the 0 –1 loss. Then, for any k ∈ N , P ( f 2 k +1 − f ⋆ ) = δ k p k (1 − q k ) , P ( f 2 k − f ⋆ ) = p k q k , (34) v a r P ( f 2 k +1 − f ⋆ ) = p k (1 − q k ) − ( δ k p k (1 − q k )) 2 , (35) v a r P ( f 2 k − f ⋆ ) = p k q k − ( p k q k ) 2 . (36) W e ca n no w pr ove the four s ta temen ts of Prop osition 1 . (i) B y ( 34 ), choo sing q k = δ k / (1 + δ k ) and λ = κ − 1 > 0 implies (i) with b ( k ) = p k q k . (ii) F or every t ∈ (0 , 1 ), P ( | 2 η ( X ) − 1 | ≤ t ) = X k ∈ N P ( X = x 2 k +1 ) 1 δ k ≤ t ≤ X k s.t. 2 − kλ ≤ t 2 − k − 1 ≤ t 1 /λ . (37) 712 S. Arlot and P. L. Bartlett By Le mma 9 o f [ 9 ], ( 3 7 ) implies the g lobal margin condition ov er F 0 − 1 with function ϕ ( x ) = C 5 x 2( λ +1) , where C 5 only depe nds on λ . This implies the ﬁrst part of (ii) since λ = κ − 1 > 0. F o r the second part, ( 35 ) implies that v a r P ( f 2 k +1 − f ⋆ ) ≥ p k (1 − q k )(1 − p k ) ≥ p k (1 − q k ) 2 ≥ p k 4 = 2 − k − 3 , hence the seco nd part of (ii) holds with C 6 = C 5 2 2 − 3 κ . (iii) B y ( 36 ), v ar P ( f 2 k − f ⋆ ) = p k q k (1 − p k q k ) ≤ p k q k = P ( f 2 k − f ⋆ ). (iv) By (iii), for every k ∈ N , a lo cal marg in condition ho lds on F 2 k with function ϕ 2 k : x 7→ x 2 . So, the right-hand side o f a strong marg in- adaptiv e oracle inequality is at most (keeping only even v alues of m ) pro portiona l to inf 0 ≤ k ≤ M n / 2  P ( f 2 k − f ⋆ ) + ln( n ) n  ≤ 2 − ln 2 ( n ) − 1 + ln( n ) n ≤ 2 ln( n ) n . Note that the ln( n ) factor may b e replaced b y a s maller quantit y depending on the framework. The last statement on global margin adaptivity holds acco r ding to (ii), since ϕ ⋆ ( x ) = L ( κ ) x 2 κ/ (2 κ − 1) , wher e L ( κ ) > 0 only dep ends on κ . Ac kno wledgeme n ts The authors g r atefully ackno wledge the supp ort of the NSF under aw ards DMS-04 34383 and DMS-0707060 . The ﬁrst author’s research was mostly ca rried out at Univ Paris-Sud (Lab oratoire de Mathematiques, CNRS – UMR 8628), with the additional supp ort of Inria Saclay (Selec t P ro ject). The a uthors would also like to thank an anonymous re feree for n umerous comments that improved the pres e n tation and s o me of the res ults o f the pap er. References [1] Arlot, S. (2007). R esampling and Mo del Sele ction . PhD thesis, Universit y P aris-Sud 11, December 2007. Av ailable at http://tel .archives- ouve rtes.fr/tel- 00198803/en/ . [2] Arlot, S. (2008). V -fold cros s-v alidation improv ed: V -fold p enalization. Ava ilable at arXiv:0802.05 66v2. [3] Arlot, S. (2009). Mo d el selection by resampling p enalizati on. Ele ctr on. J. Stat. 3 557–624 (electronic). MR2519533 [4] Arlot, S. and Massart, P . (2009). Data-driven calibration of pen alti es for least-squares regression. J. Mach. L e arn. Re s. 10 245–2 79 (electronic). [5] Audib ert, J.-Y. (2004). Classiﬁcation under p olynomial en tropy and margin assump- tions and rand omize d estimators. Lab oratoire d e Probabilites et Modeles Aleatoires. Preprint. [6] Audib ert, J.-Y. and Tsybak ov, A.B. (2007). F ast learning rates for plug-in classiﬁers. Ann. Statist. 35 608–6 33. MR2336861 Mar gin-adaptive mo del sele ction 713 [7] Barron, A., Birg´ e, L. and Massart, P . (1999). Risk b ounds for model selection via p enal- ization. Pr ob ab. The ory R elate d Fi elds 113 301–413. MR1679028 [8] Bartlett, P .L., Bousquet, O. and Mendelson, S. (2005). Local rademacher complexities. Ann . Statist. 33 1497– 1537. MR2166554 [9] Bartlett, P .L., Jordan, M.I. and McAuliﬀe, J.D. (2006). Conv exity , classiﬁcatio n, and risk b ounds. J. Amer . Statist. Asso c. 101 138–15 6. MR2268032 [10] Bartlett, P .L., Mendelson, S. and Philips, P . (2004). Lo cal complexities for emp irical risk minimization. In L e arning The ory . L e ctur e Notes in Comput. Sc i. 3120 2 70–284. Berlin: Springer. MR2177915 [11] Birg´ e, L. and Massart, P . (1998). Minim um contrast estimators on sieves: Exp onen tial b ounds and rates of conv ergence. Bernou l li 4 329–375. MR1653272 [12] Blanchard, G., Lugosi, G. and V a yatis , N. (2004). O n th e rate of conv ergence of regularized b oosting classiﬁers. J. Mach. L e arn. R es. 4 861–894 . MR2076000 [13] Blanchard, G. and Massart, P . (2006). Discussion: “Local R ademac her complexities and oracle inequalities in risk minimization” [ Ann. Statist. 34 (2006) 2593–265 6] by V. Koltc hinskii. Ann. Statist. 34 2664–26 71. MR2329460 [14] D evro ye, L. and Lugosi, G. (1995). Low er b ounds in pattern recognition and learning. Pattern Re c o gni ti on 28 1011–1018. [15] Efron, B. (1983). Estimating the error rate of a pred ictio n rule: Improv emen t on cross- v alidation. J. Amer. Statist. Asso c. 78 316–3 31. MR0711106 [16] K oltc hinskii, V. (2006). Lo cal R ademac her complexities and oracle inequalities in risk min- imization. A nn. Statist. 34 2593–26 56. MR2329442 [17] Lecu´ e, G. (2007). S im ultaneous adaptation to t h e margin and to complexity in classiﬁcation. Ann . Statist. 35 1698– 1721. MR2351102 [18] Lecu´ e, G. (2007). Sub optimalit y of p enalized empirical risk minimization in classiﬁcation. In COL T 2007 L e ctur e Notes in Art iﬁcial Intel li genc e 4539 . Berlin: Sp ringer. MR2397584 [19] Lu gos i, G. (2002). Pattern classiﬁcation and learning theory . In Principles of Nonp ar amet- ric L e arning (Udi ne, 2001). CI SM Courses and L e ctur es 434 1–56. Vienna: Springer. MR1987656 [20] Lu gos i, G. and W egka mp, M. (2004). Complexity regularization via localized random p enal- ties. Ann . Statist. 32 1679– 1697. MR2089138 [21] Mammen, E. and Tsybak o v, A.B. (1999). Smo oth discrimination analysis. Ann. Stat ist. 27 1808–18 29. MR1765618 [22] Massart, P . (2003). Conc entr ation ine qualities and mo del Sele ction. L e ctur e Notes i n Math- ematics 1896 . Berlin: Springer. MR2319879 [23] Massart, P . and N´ ed ´ elec, ´ E. (2006). Risk b ounds for statistical lea rning. A nn. Statist. 34 2326–23 66. MR2291502 [24] Tsyb ak o v, A.B. (2004). Optimal aggregation of classiﬁers in statistical learning. Ann. Statist. 32 135–1 66. MR2051002 [25] Tsyb ak o v, A.B. and v an de Geer, S.A. (2005). Square root p enalt y: Adaptation to the margin in classiﬁcation and in ed ge estimation. A nn. Statist. 33 1203–1224. MR2195633 [26] V apnik, V.N. (1998). Statistic al L e arning The ory . New Y ork: Wiley . MR164125 0 [27] V apnik, V.N. and Cervonenkis, A.J. (1971). The uniform con vergence of frequ encies of the app earance of even ts to their p robabili ties. (Russian. English summary) T e or. V er ojatnost. i Primenen. 16 264–279 . MR0288823

Margin-adaptive model selection in statistical learning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment