Margin-adaptive model selection in statistical learning

A classical condition for fast learning rates is the margin condition, first introduced by Mammen and Tsybakov. We tackle in this paper the problem of adaptivity to this condition in the context of model selection, in a general learning framework. Ac…

Authors: Sylvain Arlot, Peter L. Bartlett

Bernoul li 17 (2), 2011, 687–713 DOI: 10.315 0/10-BEJ 288 Margin-adaptiv e mo del selection in s tatistical learning SYL V AIN ARLOT 1 and PETER L. BAR T L ETT 2 1 CNRS, Wil low Pr o je ct-T e am, L ab or atoir e d’Informatique de l’Ec ole Normale Sup erieur e (CNRS/ENS/INRIA UMR 8548), 23, avenue d’Italie, CS 81 321, 75214 Paris Ce dex 13, F r anc e. E-mail: sylvain.arlot@ens.fr 2 Computer Scienc e Division and Dep artment of Statistics, University of California, Berkeley, 367 Evans Hal l # 3860 , Berkeley, CA 94720-3860, USA. E-mail: b art lett@cs.b erkeley.e du A classical condition for fast learning rates is th e margin condition, first introduced by Mammen and Tsybako v. W e tac kle in this pap er the problem of adaptivity to this condition in th e context of mo del selection, in a general learning framework. Actually , we consider a weak er version of this condition that allows one to t ak e into account that learning within a small mo del can b e muc h easier than within a large one. R equiring th is “strong margin adaptivity” makes the mod el selection problem more challengi ng. W e first pro ve, in a general framew ork, that some p enaliza tion pro cedures (including local Rademac her complexities) exhibit this adaptivity when the mo dels are n ested. Contrary to previous results, this holds with p enalties th at only dep end on the data. Our second main result is that strong margin adap t ivit y is not alwa ys p ossible when the models are n ot nested: for every model selection pro cedure (ev en a randomized one), there is a problem for which it do es not demonstrate strong margin adapt iv it y . Keywor ds: adapt iv it y; empirical minimization; empirical risk min imization; lo cal Rademacher complexity; margin condition; mod el selectio n; oracle inequ al ities; statistical learning 1. In tro duction W e consider in this paper the mo del selection problem in a gener al fr a mew ork. Since our main motiv atio n comes fr o m the s upervised bina ry classifica tion setting, w e fo cus o n this framework in this introduction. Section 2 intro duces the natural g eneralization to empirical (risk) minimization problems, which we co nsider in the rema inder o f the pa per. W e observe indep e nden t realizations ( X i , Y i ) ∈ X × Y for i = 1 , . . . , n of a random v a riable with distribution P , where Y = { 0 , 1 } . The g oal is to build a (da ta -dependent) predictor t (i.e., a measurable function X 7→ Y ) such that t ( X ) is as o ften as p ossible equal to Y , where ( X , Y ) ∼ P is indep e nden t from the data . T his is the pr e diction problem, in the setting of s upervised binary cla ssification. In other words, the goal is to find t minimizing the pr e dic tio n error P γ ( t ; · ) := P ( X,Y ) ∼ P ( t ( X ) 6 = Y ) , where γ is the 0– 1 loss. This is an electronic reprint of the original article pub li shed by the ISI/BS in Bernoul li , 2011, V ol. 17, No. 2, 687–713 . This reprint differs from the original in pagination and typographic detail. 1350-7265 c  2011 ISI/BS 688 S. Arlot and P. L. Bartlett The minimizer s of the pr ediction er ror, when it exists, is called the Ba yes predictor . Define the reg ression function η ( X ) = P ( X,Y ) ∼ P ( Y = 1 | X ). Then, a classica l argument shows that s ( X ) = 1 η ( X ) ≥ 1 / 2 . How ever, s is unknown, since it dep ends on the unknown distribution P . Our go al is to bu ild from th e data some predictor t minimizing the prediction er ror, or equiv alent ly the excess loss ℓ ( s, t ) := P γ ( t ) − P γ ( s ). A clas s ical a ppr oac h to the prediction pro blem is empiric al risk minimization . Let P n = n − 1 P n i =1 δ ( X i ,Y i ) be the empirical measur e and S m be a n y set of predictors, which is called a mo del . The empiric al risk minimizer over S m is then defined as ˆ s m ∈ arg min t ∈ S m P n γ ( t ) = ar g min t ∈ S m ( 1 n n X i =1 1 t ( X i ) 6 = Y i ) . W e exp ect that the risk o f ˆ s m is close to that of s m ∈ a rg min t ∈ S m P γ ( t ) , assuming that such a minimizer exists. 1.1. Margin condition Depending on s ome prop erties of P and the complexity of S m , the pr ediction error of ˆ s m is more or le ss distant fro m that o f s m . F or insta nce, when S m has a finite V apnik– Chervonenkis dimension V m [ 26 , 27 ] and s ∈ S m , it has b een prov en (see, e.g., [ 19 ]) that E [ ℓ ( s, ˆ s m )] ≤ C r V m n for some numerical constant C > 0. This is optimal without any as sumption on P , in the minimax sens e: no estimator can hav e a smaller prediction risk uniformly ov er all distributions P such that s ∈ S m , up to the numerical facto r C [ 14 ]. How ev er, there exist fav orable situations where muc h smaller pr ediction erro rs (“fast rates”, up to n − 1 instead of n − 1 / 2 ) can be obtained. A sufficient condition, the so - called “margin condition”, ha s been introduced by Mammen and Tsybako v [ 21 ]. If, for some ε 0 , C 0 > 0 a nd α ≥ 1 , ∀ ε ∈ (0 , ε 0 ] P ( | 2 η ( X ) − 1 | ≤ ε ) ≤ C 0 ε α , (1) if the Bayes predicto r s b elongs to S m , and if S m is a VC-class of dimension V m , then the pre dic tio n error of ˆ s m is s maller than L ( C 0 , ε 0 , α ) ln( n )( V m /n ) κ/ (2 κ − 1) in exp ectation, where κ = (1 + α ) /α and L ( C 0 , ε 0 , α ) > 0 only dep ends o n C 0 , ε 0 and α . Cor respond- ing minimax low er b ounds [ 23 ] and other upp er b ounds ca n be obtained under other complexity assumptions (e.g., Assumption (A2) o f Tsybako v [ 24 ], inv olving bra c ket ing ent ropy). In the extreme situation where α = + ∞ , that is, for some h > 0 , P ( | 2 η ( X ) − 1 | ≤ h ) = 0 , (2) Mar gin-adaptive mo del sele ction 689 then the same result ho lds with κ = 1 and L ( h ) ∝ h − 1 . Mo re precisely , a s prov ed in [ 23 ] E [ ℓ ( s, ˆ s m )] ≤ C min (  V m (1 + ln( nh 2 V − 1 m )) nh  , r V n ) . F ollowing the approa ch of K o ltc hinskii [ 16 ], we will conside r the following generaliz a - tion of the margin condition: ∀ t ∈ S ℓ ( s, t ) ≥ ϕ ( q v a r P ( γ ( t ; · ) − γ ( s ; · ))) , (3) where S is the set o f predicto r s, and ϕ is a conv ex non-decr easing function on [0 , ∞ ) with ϕ (0) = 0 . Indeed, the pr o ofs of the abov e upp er bounds on the prediction error of ˆ s m use o nly that ( 1 ) implies ( 3 ) with ϕ ( x ) = L ( C 0 , ε 0 , α ) x 2 κ and κ = (1 + α ) /α , and that ( 2 ) implies ( 3 ) with ϕ ( x ) = hx 2 . (See, e.g., Pro p osition 1 in [ 2 4 ].) All these results show that the empirica l risk minimizer is adaptive to the mar gin c onditio n , since it leads to an optimal ex c ess risk under v arious assumptions on the complexity of S m . How ever, o btaining such rates o f estimation requir es knowledge of some S m to which the Bay es predictor b elongs, which is a stro ng assumption. A less restrictive framework is the following. First, w e do not assume that s ∈ S m . Second, we do not assume that the margin condition ( 3 ) is s atisfied for a ll t ∈ S , but only for t ∈ S m , which ca n be seen as a “lo cal” marg in condition: ∀ t ∈ S m ℓ ( s, t ) ≥ ϕ m ( q v a r P ( γ ( t ; · ) − γ ( s ; · ))) , (4) where ϕ m is a conv ex non-decrea sing function on [0 , ∞ ) with ϕ m (0) = 0 . The fa c t that ϕ m can dep end o n m allows situations where we ar e lucky to have a strong ma rgin condition for some small mo dels but the g lo bal ma rgin condition is lo o se. As prov en in Section 5.2 (Pro p osition 1 ), such situations cer tainly exist. Note that when ϕ m ( x ) = h m x 2 , ( 3 ) and ( 4 ) can b e trac e d back to mean–v ariance conditions on γ tha t were used in sev eral pap ers for deriving conv ergence ra tes o f some minim um contrast estimators on so me given mo del S m (see, e.g., [ 11 ] and refer ences therein). 1.2. Adaptive mo del selection Assume now that we are no t given a single mode l but a whole family ( S m ) m ∈M n . By empirical risk minimization, we obtain a family ( ˆ s m ) m ∈M n of predictors, from which w e would like to select some b s b m with a prediction err or P γ ( b s b m ) as small as p ossible. The aim of suc h a mo del sele ction pr o c e dur e (( X 1 , Y 1 ) , . . . , ( X n , Y n )) 7→ b m ∈ M n is to satisfy an or acle ine quality of the for m ℓ ( s, b s b m ) ≤ C inf m ∈M n { ℓ ( s, s m ) + R m,n } , (5) 690 S. Arlot and P. L. Bartlett where the leading co nstan t C ≥ 1 should b e clo se to one and the r emainder term R m,n should b e clos e to P γ ( ˆ s m ) − P γ ( s m ). Typically , one prov es that ( 5 ) holds either in exp ectation, or with high probability . Assume for insta nce that ϕ m ( x ) = h m x 2 for some h m > 0 a nd S m has a finite V C- dimension V m ≥ 1 . In view of the aforementioned minimax low er bo unds of [ 23 ], one cannot hop e in general to pr o ve an or acle inequalit y ( 5 ) with a re mainder R m,n smaller than min ( ln( n ) V m nh m , r V m n ) , where the ln( n ) term may only b e necessa ry for some V C classes S m (see [ 23 ]). Then, ada ptive mo del sele ction occ ur s when b m satisfies an o r acle inequality ( 5 ) with R m,n of the o rder of this minimax lower b ound. Mo r e g e nerally , let C m be some complex- it y measure of S m (e.g., its V C-dimension, or the ρ app earing in Tsyba k ov’s as sumption [ 24 ]). Then, define R n ( C m , ϕ m ) a s the minimax prediction error ov er the set of distr i- butions P such that s ∈ S m and the lo cal margin condition ( 4 ) is sa tisfied in S m with ϕ m , where S m has a co mplexit y at mos t C m . Mass art and N ´ ed´ e lec [ 23 ] hav e proven tight upper and low er bounds on R n ( C m , ϕ m ) with several complexity measures; their results are stated with the marg in condition ( 3 ), but they actually use its lo cal v ersion ( 4 ) only . A marg in adaptive mo del selec tion pro cedure should satisfy an oracle inequality of the form ℓ ( s, b s b m ) ≤ C inf m ∈M n { ℓ ( s, s m ) + R n ( C m , ϕ m ) } (6) without using the know le dge of C m and ϕ m . W e call this prop erty “strong ma rgin adap- tivit y”, to emphasize the fa c t that this is more challenging than ada ptivity to a mar gin condition that holds uniformly ov er the mo dels. 1.3. Pena lization W e fo cus in particula r in this pap er on p enalization pro cedures, which ar e defined as follows. Let p en : M n 7→ [0 , ∞ ) b e a (data-dep enden t) function, and define b m ∈ ar g min m ∈M n { P n γ ( ˆ s m ) + p en( m ) } . Since our goal is to minimize the prediction e r ror of ˆ s m , the ide al p enalty would be pen id ( m ) := P γ ( ˆ s m ) − P n γ ( ˆ s m ) , (7) but it is unknown b ecause it depends on the distribution P . A cla ssical wa y of designing a p enalt y is to estimate p en id ( m ), or at least a tight upp er b ound on it. W e consider in particular lo c al c omplexity me asur es [ 8 , 10 , 1 6 , 20 ], b ecause they es ti- mate pen id tightly enough to achiev e fast e stimation rates when the margin condition holds true. See Section 3.2 for a detailed definition of these p enalties. Mar gin-adaptive mo del sele ction 691 1.4. Related results There is a co nsiderable wealth of litera ture on ma rgin a daptivit y in the context of mo del selection as well a s mo del aggreg ation. Most o f the pap ers consider the unifor m mar gin condition, that is, when ϕ m ≡ ϕ . B a rron, Bir g ´ e and Mas sart [ 7 ] ha ve pr o ven oracle in- equalities for deterministic p enalties under some mean– v arianc e condition on γ clo s e to ( 3 ) with ϕ ( x ) = hx 2 . F o llo wing a similar appr oac h, margin adaptive or acle ine q ualities (with more general ϕ ) hav e b e en prov en with lo calized ra ndom p enalties [ 8 , 10 , 16 , 20 ] and with other pena lties in a par ticular framework [ 25 ]. Adaptivit y to the margin has also b een considered with a regularized b oos ting metho d [ 12 ], the hold-out [ 1 3 ] and in a P AC-Bay es fra mew ork [ 5 ]. Aggregation metho ds have bee n studied in [ 24 ] and [ 1 7 ]. Notice also that a co mpletely different a pproach is p ossible: estimate first the r egression function η (po ssibly throug h mo del selectio n), then use a plug-in clas sifier; this works provided η is smo oth enough [ 6 ]. It is quite unclear whether any of these results can b e e x tended to strong margin adaptivity (actually , we will prove that this needs additional r estrictions in gener al). T o our knowledge, the only results a llo wing ϕ m to dep end on m can b e found in [ 16 ]. First, when the mode ls are nested, a compar ison metho d ba sed on loca l Ra dema c her complexities attains s trong margin ada ptivit y , as suming that s ∈ S m ∈M n S m (Theorem 7; a nd it is quite unclea r whether this still holds witho ut the latter assumption). Second, a penaliza tion method based o n lo cal Rademacher complex ities has the same prop ert y in the gener al case, but it uses the knowledge of ( ϕ m ) m ∈M n (Theorems 6 and 11). Our cla im is that when ϕ m do es stro ngly depe nd on m , it is cr ucial to take it in to account to c ho ose the b est mo del in M n . And suc h situations occur, as pro ven by our Prop osition 1 in Section 5.2 . B ut assuming either s ∈ S m ∈M n S m or that ϕ m is kno wn is no t realistic. Our goa l is to in vestigate the kind of results that can b e obtained with c omple tely data-driven pro cedures; in pa rticular, when s / ∈ S m ∈M n S m . 1.5. Our results In this pap er, we aim at understa nding when strong margin adaptivity can be obta ined for data-dep enden t mo del selection pro cedures. Notice that we do not restr ict ourselves to the classification setting. W e consider a m uch more general framework (as in [ 16 ]), which is describ ed in Section 2 . W e pr o ve tw o kinds of r e sults. First, when mo dels are nested, we show that so me pena lization metho ds a re strong ly margin a daptiv e (Theo rem 1 ). In particular , this result ho lds for the lo cal Rademac her co mplexities (Coro lla ry 1 ). Compared to previous results (in particular the ones of [ 16 ]), o ur main adv ance is tha t our pena lties do not require the knowledge of ( ϕ m ) m ∈M n , and we do not assume that the Bayes predictor be longs to any of the mo dels. Our second result prob es the limits of strong mar gin adaptivit y , without the nes ted assumption. A family o f mo dels exists such that, fo r every sample size n and every (mo de l) selection procedur e b m , a distributio n P exis ts for which b m fails to be strongly margin adaptive with a pos itive probability (Theorem 2 ). Hence, the previo us p ositive 692 S. Arlot and P. L. Bartlett results (Theorem 1 and Corollary 1 ) cannot be extended outside of the nested cas e for a general distributio n P . Where is the b oundary b etw een these tw o extre mes? Obviously , the nested a ssumption is no t necessar y . F or instance, when the globa l ma r gin assumption is indeed tig h t ( ϕ = ϕ m for every m ∈ M n ), mar gin adaptivit y ca n b e o btained in several w ays, as men tioned in Section 1.4 . W e sketc h in Section 5 some situatio ns where stro ng mar gin adaptivit y is po ssible. Mor e precisely , we state a gener al oracle inequality (Theorem 3 ), v a lid for any family o f mo dels and any distribution P . W e then discuss assumptions under whic h its remainder term is small enoug h to imply stro ng margin adaptivity . This pap er is organized as follows. W e describ e the general setting in Section 2 . W e consider in Section 3 the nested case, in which strong margin adaptivity ho lds. Negativ e results (i.e., lo wer b ounds on the prediction error of a general mo del selection pr ocedure) are stated in Section 4 . The line betw een these tw o situations is sketched in Section 5 . W e discus s our res ults in Section 6 . All the pro ofs are given in Section 7 . 2. The ge neral empirical minimization framew ork Although our main motiv ation comes from the class ification problem, it turns out that all our results can b e proven in the ge ner al setting of empirical minimization. As ex plained below, this s etting includes binary classification with the 0– 1 loss, b ounded regres s ion and several other fra mew orks. In the rest o f the pap er, we will use the following general notation, in or der to emphasize the generality of our results. W e obs erv e indep enden t r ealizations ξ 1 , . . . , ξ n ∈ Ξ of a rando m v ariable with distri- bution P , and we are given a set F o f measur able functions Ξ 7→ [0 , 1] . O ur goal is to build so me (data-dep enden t) f such that its exp ectation P ( f ) := E ξ ∼ P [ f ( ξ )] is as small as possible. F or the sake of simplicit y , w e assume that ther e is a minimizer f ⋆ of P ( f ) ov er F . This includes the prediction framework, in which Ξ = X × Y , ξ i = ( X i , Y i ), F := { ξ 7→ γ ( t ; ξ ) s.t. t ∈ S } , where γ : S × Ξ 7→ [0 , 1 ] is an y contrast function. Then, f ⋆ is equal to γ ( s ; · ), where s is the Bayes predictor. In the binary classification framework, Y = { 0 , 1 } and w e can tak e the 0–1 co n trast γ ( t ; ( x, y )) = 1 t ( x ) 6 = y , for ins tance. W e then recov er the setting descr ib ed in Section 1 . In the bounded reg ression fra mew ork, as suming that Y = [0 , 1 ], we ca n take the lea st-squares contrast, γ ( t ; ( x, y )) = ( t ( x ) − y ) 2 . Many other co n trast functions γ can b e considered, provided that they take their v alues in [0 , 1 ]. Notice the o ne-to-one cor r espondence b et w een pre dictors t and functions f t := γ ( t ; · ) in the pr e diction framework. The empiric al minimizer over F m ⊂ F (called a mo del) c a n then be defined as b f m ∈ a rg min f ∈F m P n ( f ) . Mar gin-adaptive mo del sele ction 693 W e exp ect that its exp ectation P ( b f m ) is clo s e to that of f m ∈ a rg min f ∈F m P ( f ), a ssuming that such a minimizer exists. In the prediction fr amew ork, defining F m := { f t s.t. t ∈ S m } , we ha ve b f m = f ˆ s m and f m = f s m . W e ca n no w wr ite the globa l margin condition as follows: ∀ f ∈ F P ( f − f ⋆ ) ≥ ϕ ( p v a r P ( f − f ⋆ )) , (8) where ϕ is a conv ex non-decreasing function o n [0 , ∞ ) with ϕ (0) = 0 . Similar ly , the loca l margin condition is ∀ f ∈ F m P ( f − f ⋆ ) ≥ ϕ m ( p v a r P ( f − f ⋆ )) . (9) Notice that most of the upp er and low er b ounds on the r isk under the margin condition given in the introduction stay v alid in the general empirical minimization framework, at least when ϕ m ( x ) = ( h m x 2 ) κ m for some h m > 0 and κ m ≥ 1 (see, e.g ., [ 23 ] and [ 16 ]). Assume that F m is a VC-t yp e class of dimension V m . If ϕ m ( x ) = h m x 2 , E [ P ( b f m − f ⋆ )] ≤ 2 P ( f m − f ⋆ ) + C min (  ln( n ) V m nh m  , r V m n ) for so me n umerical constant C > 0. If ϕ m ( x ) = ( h m x 2 ) κ m for so me h m > 0 a nd κ m ≥ 1 , E [ P ( b f m − f ⋆ )] ≤ 2 P ( f m − f ⋆ ) + C min (  L ( h m , κ m ) ln( n )  V m nh m  κ m / (2 κ m − 1)  , r V m n ) for so me constants C, L ( h m , κ m ) > 0. Given a collection ( F m ) m ∈M n of mo dels, we are lo oking for a mo del sele c tio n pr o cedure ( ξ 1 , . . . , ξ n ) 7→ b m ∈ M n satisfying a n or acle ine quality of the form P ( b f b m − f ⋆ ) ≤ C inf m ∈M n { P ( f m − f ⋆ ) + R m,n } , (10) with a leading constant C c lo se to 1 and a remainder term R m,n as small as p ossible. Similarly to ( 6 ), w e define a s tr ongly margin-a daptiv e pro cedure as any b m s uch that ( 10 ) holds with s ome numerical constant C , and R m,n of the or der of the minima x risk R n ( C m , ϕ m ), wher e C m is some complex it y measure of F m . Defining p enalization metho ds as b m ∈ arg min m ∈M n { P n ( b f m ) + p en( m ) } (11) for so me data-dep enden t p en : M n 7→ R , the idea l penalty is p en id ( m ) := ( P − P n )( b f m ). 694 S. Arlot and P. L. Bartlett 3. Margin-adaptiv e mo del selection for nested mo dels 3.1. General result Our first result is a sufficient condition for p enalization pro cedures to a ttain str ong margin adaptivity when the mo dels ar e nested (Theor em 1 ). Since this condition is satis fied by lo cal Rademacher complex ities , this leads to a da ta-driven margin- adaptiv e p enalization pro cedure (Corollar y 1 ). Theorem 1. Fix ( F m ) m ∈M n and ( ϕ m ) m ∈M n such that the lo c al mar gin c onditions ( 9 ) hold. L et ( t m ) m ∈M n b e a se quenc e of p ositive r e als t ha t is non-de cr e asing (with re sp e ct t o the inclusion or dering on F m ). Assume that some c onstants c, η ∈ (0 , 1) and C 1 , C 2 ≥ 0 exist such that the fol lowing holds: • The mo dels ( F m ) m ∈M n ar e neste d. • L ower b ounds on the p enalty: with pr ob abil ity at le ast 1 − η , for every m, m ′ ∈ M n , (1 − c ) p en( m ) ≥ ( P − P n )( b f m − f m ) + t m n ≥ 0 , (12) F m ′ ⊂ F m ⇒ c pen( m ) ≥ v ( m ) − C 1 v ( m ′ ) − C 2 P ( f m ′ − f ⋆ ) , (13) wher e v ( m ) := r 2 t m n v a r P ( f m − f ⋆ ) . Then, if b m is define d by ( 11 ), with pr ob ability at le ast 1 − η − 2 P m ∈M n e − t m , we have for every ε ∈ (0 , 1) P ( b f b m − f ⋆ ) ≤ 1 1 − ε inf m ∈M n ( (1 + ε + C 2 + εC 1 ) P ( f m − f ⋆ ) + p en( m ) (14) + (1 + max { 1 , C 1 } ) min ( ϕ ⋆ m r 2 t m ε 2 n ! , r 2 t m n ) + t m 3 n ) , wher e ϕ ⋆ m ( x ) := sup y ≥ 0 { xy − ϕ m ( y ) } is the c onvex c onjugate of ϕ m . Theorem 1 is proved in Section 7.1 . R emark 1. 1. If p en( m ) is of the right or de r , that is, not muc h lar ger than E [p en id ( m )], then Theorem 1 is a strong mar gin adaptivity result. Indeed, assuming that ϕ m ( x ) = ( h m x 2 ) κ m , the rema inder term is no t to o large, since ϕ ⋆ m ( x ) = L ( h m , κ m ) x 2 κ m / (2 κ m − 1) Mar gin-adaptive mo del sele ction 695 for some p ositive co nstan t L ( h m , κ m ). Hence, choo sing ε = 1 / 2 , for insta nce, we can rewrite ( 14 ) as P ( b f b m − f ⋆ ) ≤ L ( C 1 , C 2 ) inf m ∈M n  P ( f m − f ⋆ ) + pen( m ) + L ( h m , κ m )  t m n  κ m / (2 κ m − 1)  for some p ositive constants L ( C 1 , C 2 ) and L ( h m , κ m ). When ϕ m is a general convex function, minimax estimation rates a re no longer av a ila ble, so that w e do not kno w whether the remainder term in ( 14 ) is of the r igh t order . Nev ertheless, no better risk b ound is known, even for a single mo del to which s b elongs. 2. In the c a se that the ϕ m are k no wn, methods inv olving lo cal Rademacher complex- ities and ( ϕ m ) m ∈M n satisfy ora c le inequalities simila r to ( 14 ) (see Theo rems 6 a nd 11 in [ 16 ]). On the contrary , the ϕ m are not as sumed to be known in Theo rem 1 , a nd conditions ( 12 ) and ( 13 ) ar e satisfied by co mpletely data-dep endent penalties, as shown in Section 3.2 . Also , Theorem 7 of [ 16 ] shows that ada ptivit y is p ossible us- ing a compar ison metho d, provided that f ⋆ belo ngs to one of the mo dels. Ho wev er, it is not c le ar whether this c o mparison metho d achieves the o ptimal bias – v aria nce trade-off in the general ca se, as in Theorem 1 . 3.2. Lo cal Rademac her complexities Although Theo rem 1 applies to any p enalization pro cedure sa tisfying a ssumptions ( 12 ) and ( 13 ), we now focus o n metho ds based on lo cal Rademacher complexities. Let us define pr ecisely these complex ities. W e mainly use the no tation of [ 16 ]: • for every δ > 0, the δ minimal set of F m w.r.t. the distribution P is F m,P ( δ ) := n f ∈ F m s.t. P ( f ) − inf g ∈F m P ( g ) ≤ δ o , • the L 2 ( P ) diameter of the δ minimal set of F m is D 2 P ( F m ; δ ) = sup f ,g ∈F m,P ( δ ) P (( f − g ) 2 ) , • the exp ected mo dulus of contin uity of ( P − P n ) ov er F m is φ n ( F m ; P ; δ ) = E sup f ,g ∈F m,P ( δ ) | ( P n − P )( f − g ) | . W e then define U n ( F m ; δ ; t ) := K φ n ( F m ; P ; δ ) + D P ( F m ; δ ) r t n + t n ! , 696 S. Arlot and P. L. Bartlett where K > 0 is a n umerical constant (to b e c hosen later). The (ideal) lo cal complex ity δ n ( F m ; t ) is (roughly) the sma llest p ositive fixed p oin t o f r 7→ U n ( F m ; r ; t ) . More precisely , δ n ( F m ; t ) := inf  δ > 0 s.t. sup σ ≥ δ  U n ( F m ; σ ; t ) σ  ≤ 1 2 q  , (15) where q > 1 is a numerical cons ta n t. Two impo rtan t points, whic h follow from Theorems 1 and 3 o f K oltc hinskii [ 1 6 ], a re that: 1. δ n ( F m ; t ) is larg e enoug h to s atisfy assumption ( 1 2 ) with a probability at least 1 − log q ( n/t )e − t for each model m ∈ M n . 2. Ther e is a completely data-dep enden t ˆ δ n ( F m ; t ) such that ∀ m ∈ M n P ( ˆ δ n ( F m ; t ) ≥ δ n ( F m ; t )) ≥ 1 − 5 ln q  n t  e − t . This da ta -dependent ˆ δ n ( F m ; t ) is a resa mpling estimate of δ n ( F m ; t ), called the “lo cal Rademacher complexit y”. Before stating the main result of this section, let us re c all the definition o f ˆ δ n ( F m ; t ), as in [ 16 ]. W e need the following additional no tation: • for every δ > 0, the empir ical δ minimal s et of F m is b F n,m ( δ ) := n f ∈ F m s.t. P n ( f ) − inf g ∈F m P n ( g ) ≤ δ o = F m,P n ( δ ) , • the empirica l L 2 ( P ) diameter of the empirical δ minimal set of F m is b D n ( F m ; δ ) = sup f ,g ∈ b F n,m ( δ ) P n (( f − g ) 2 ) , • the mo dulus of contin uity of the Rademacher pro cess f 7→ n − 1 P n i =1 ε i f ( ξ i ) over F m , where ε 1 , . . . , ε n are i.i.d. Rademac her random v ar ia bles (i.e., ε i takes the v alues +1 and − 1 with pr obabilit y 1 / 2 each): b φ n ( F m ; δ ) = sup f ,g ∈ b F n,m ( δ )      1 n n X i =1 ε i ( f ( ξ i ) − g ( ξ i ))      . Defining b U n ( F m ; δ ; t ) := b K b φ n ( F m ; P ; ˆ cδ ) + b D n ( F m ; ˆ cδ ) r t n + t n ! (where b K , ˆ c > 0 are n umerical constants, to be chosen later), the lo c al Ra demacher c om- plexity ˆ δ n ( F m ; t ) is (ro ughly) the smalle st positive fixed p oin t of r 7→ b U n ( F m ; r ; t ) . More Mar gin-adaptive mo del sele ction 697 precisely , ˆ δ n ( F m ; t ) := inf  δ > 0 s.t. sup σ ≥ δ  b U n ( F m ; σ ; t ) σ  ≤ 1 2 q  , (1 6) where q > 1 is a numerical cons ta n t. Corollary 1 (Strong margin adaptivit y for lo cal R ademac her complexities ). Ther e exist numeric al c onstants K > 0 and q > 1 such t ha t the fol lowing holds. L et t > 0 . Assume that a numeric al c onstant L > 0 exists and an event of pr ob abi lity at le ast 1 − L log q ( n/t ) Card( M n )e − t exists on which ∀ m ∈ M n pen( m ) ≥ 7 2 δ n ( F m ; t ) , (17) wher e δ n ( F m ; t ) is define d by ( 15 ) (and dep ends on b oth K and q ). Assu me mor e over that the m o dels ( F m ) m ∈M n ar e neste d and b m ∈ ar g min m ∈M n { P n ( b f m ) + p en( m ) } . Then, an event of pr ob ability at le ast 1 − [2 + ( L + 1 ) lo g q ( n t )] Card( M n )e − t exists on which, for every ε ∈ (0 , 1 ) , P ( b f b m − f ⋆ ) ≤ 1 1 − ε inf m ∈M n (  1 + 2 K q + ε (1 + √ 2)  P ( f m − f ⋆ ) + p en( m ) (18) + (1 + √ 2) min ( ϕ ⋆ m r 2 t ε 2 n ! , r 2 t n ) + t 3 n ) . In p articular, t hi s holds when p en( m ) = 7 2 ˆ δ n ( F m ; t ) , pr ovide d that b K , ˆ c > 0 ar e lar ger than some c onst ant s dep ending only on K , q . Corollar y 1 is prov ed in Section 7.1 . R emark 2. O ne can always enla rge the co nstan ts K and q , making the leading constant of the oracle inequality ( 18 ) closer to one, at the price of enlarg ing δ n ( F m ; t ) (hence pen( m ) or ˆ δ n ( F m ; t )). W e do not k no w whether it is p ossible to make the leading co nstan t closer to one without changing the p enalization pro cedure itse lf. As we show in Section 5.2 , ther e are distributions P and collections of mo dels ( F m ) m ∈M n such that ( 18 ) is a str ong improv ement ov er the “uniform margin” c ase, in terms of prediction error . It seems reasona ble to exp ect that this happ ens in a sig nifican t nu mber of pr actical situations. In Section 5 , w e state a mor e gener a l result (from which Theorem 1 is a corollary) that sugg ests why it is more difficult to prov e Cor ollary 1 when ϕ m really dep ends on m . 698 S. Arlot and P. L. Bartlett This general r esult is als o useful to understand ho w the nestedness a ssumption might b e relaxed in Theorem 1 a nd Corollary 1 . The re a son wh y Coro llary 1 implies stro ng ma r gin adaptivity is that the lo - cal Rademacher complex ities are not to o large when the lo cal mar gin condition is satisfied, together with a complexity ass umption on F m . Indeed, there exists a distribution-dep enden t e δ n ( F m ; t ) (defined as δ n ( F m ; t ) with U n ( F m ; δ ; t ) replaced by K 1 U n ( F m ; K 2 δ ; t ) for some numerical constants K 1 , K 2 > 0 , related to b K and ˆ c ) such that ∀ m ∈ M n P ( e δ n ( F m ; t ) ≥ ˆ δ n ( F m ; t ) ≥ δ n ( F m ; t )) ≥ 1 − 5 lo g q  n t  e − t . (See Theorem 3 of [ 16 ].) This leads to sev eral upp er bounds on ˆ δ n ( F m ; t ) under the lo cal margin condition ( 9 ), by combining Lemma 5 of [ 16 ] with the examples of its Se c tion 2.5. F or instance, in the binar y cla ssification case, when F m is the class of 0 –1 lo ss functions asso ciated with a VC-class S m of dimension V m , such that the ma rgin condition ( 9 ) ho lds with ϕ m ( x ) = h m x 2 , we hav e for e v ery t > 0 a nd ε ∈ (0 , 1] , δ n ( F m ; t ) ≤ εP ( f m − f ⋆ ) + K 3 nh m  ε − 1 t + ε − 2 V m ln  nε 2 h m K 4 V m  , (19) where K 3 and K 4 depe nd only o n K . (Similar upp er b ounds hold under several other complex ity assumptions on the mo dels F m , see [ 16 ].) In particular , when each mo del S m is a V C-clas s o f dimension V m , ϕ m ( x ) = h m x 2 , pen ( m ) = 7 2 ˆ δ n ( F m ; t ) and t = ln(Card( M n )) + 3 ln( n ), ( 18 ) implies that P ( b f b m − f ⋆ ) ≤ C inf m ∈M n  P ( f m − f ⋆ ) + ln(Card( M n )) + ln( n ) + V m ln(e nh m /V m ) nh m  with pr o babilit y at least 1 − K n − 2 , for some n umerical constan ts C , K > 0. Up to some ln( n ) fa ctor, this is a str o ng mar gin-adaptive mo del selectio n r esult, pr o vided that Card( M n ) is s maller than some pow er of n . Notice that the ln( n ) factor is sometimes necessary (a s shown b y [ 23 ]), meaning that this upper bound is then optimal. 4. Lo w er b ound for some n on-nested mo dels In this section, we in vestigate the assumption in Theorem 1 tha t the mo dels F m are nested. T o this aim, let us co ns ider the case where mo dels are singletons F m = { f m } . Then, any estimator b f m ∈ F m is deterministic a nd equa l to f m , so that mo del selection amounts to selecting among a family { f m s.t. m ∈ M n } o f functions. Theorem 2 b elow shows that no s election pro cedure can b e strongly margin-a daptiv e in genera l. Theorem 2. L et γ b e the 0–1 loss and F 0 − 1 := { γ ( u ; · ) s.t. u : X 7→ { 0 , 1 } is me asur able } b e the asso cia te d loss funct ion class. If Card( X ) ≥ 2 , two functions f 0 , f 1 ∈ F 0 − 1 and Mar gin-adaptive mo del sele ction 699 absolute c onstants C 3 , C 4 > 0 exist such t ha t the fol lowing ho lds. F or every inte ger n ≥ 2 and b m a sele ction pr o c e dur e (that is, a function ( X × Y ) n 7→ M = { 0 , 1 } ), a distribution P ex ists su ch t ha t P  P ( f b m − f ⋆ ) ≥ C 4 √ n ln( n ) min m ∈{ 0 , 1 }  P ( f m − f ⋆ ) + v ( m ) + ln( n ) nh m  ≥ C 3 (20) and E [ P ( f b m − f ⋆ )] ≥ C 3 C 4 √ n ln( n ) min m ∈{ 0 , 1 }  P ( f m − f ⋆ ) + v ( m ) + ln( n ) nh m  , (21) wher e ∀ m ∈ { 0 , 1 } v ( m ) := r 2 ln( n ) n v a r P ( f m − f ⋆ ) and h m := P ( f m − f ⋆ ) v a r P ( f m − f ⋆ ) . Theorem 2 is proved in Section 7.2 . A straightf orward coro llary of Theorem 2 is that in the classificatio n setting with the 0–1 loss, str ong mar gin-adap tive mo del sele ction is not alway s p ossible when the mo dels ar e not neste d . Indeed, when F m = { f m } fo r every m ∈ M n = { 0 , 1 } , ( 20 ) shows that for a n y mo del s election pr ocedure b m , some distribution P exists such that r e s ults lik e The o rem 1 or Corollar y 1 do no t hold if t m = ln( n ) for every m . R emark 3. 1. Theo rem 2 and its co rollary for mo del selection als o hold for ra ndomized rules b m : ( X × Y ) n 7→ [0 , 1 ] (where the v alue o f b m (( X i , Y i ) 1 ≤ i ≤ n ) is the probability assigned to the choice of f 1 ). Hence, aggr egating models instead of selecting one do es no t mo dify the co nclusion of Theor em 2 . 2. The most reasona ble selection pro cedure a mo ng t wo functions f 0 and f 1 (or tw o mo dels { f 0 } and { f 1 } ) clearly is empirical minimization. The pr oof of Theore m 2 yields explicitly some distribution P , called P 1 , such that ( 20 ) and ( 21 ) hold fo r empirical minimization. Note that when mo dels are sing le tons, most penaliza tion pro cedures coincide with empirica l minimiza tion, for insta nce, when pen( m ) is pro- po rtional to the lo cal Rademacher co mplexit y ˆ δ n ( F m ; t ), or to the idea l p enalt y pen id ( m ) = ( P − P n )( b f m − f m ), its exp ectation or s ome quantile of p en id ( m ). 3. Theo rem 2 fo cuses o n mar gin a daptivit y with ϕ m ( x ) = h m x 2 , wher eas the mar gin condition is also satisfie d with o ther functions ϕ m . This is b oth for s implicit y rea sons and bec a use this choice emphasiz e s that one co uld hop e fo r lear ning rates o f order 1 / ( nh m ) if strong marg in adaptivity w ere po ssible. The mea ning of Theo r em 2 is then mainly that one canno t guar an tee to lea rn a t a r ate b etter than 1 / √ n , wher eas for so me model, the excess lo ss and 1 / ( nh m ) b oth are of or der 1 /n . 4. The counterexample given in the pro of of Theorem 2 is hig hly non-a symptotic, since the distribution P strongly dep ends o n n . If P and f 0 , f 1 were fixed, it is well known that empirical minimization leads to asy mpto tic optimality , b ecause ( f m ) m ∈{ 0 , 1 } is 700 S. Arlot and P. L. Bartlett finite and fixed when n grows. This illustrates a significant difference betw een the asymptotic and non- asymptotic frameworks. Another example of suc h a difference o ccurs when the num ber of candida te functions (or mo dels) is infinite, or grows to infinit y with the sa mple size, see (iv) in P ropo s ition 1 in Section 5 .2 . With Theorem 1 , w e have prov en a strong margin adaptivit y result for nested mo dels, which holds true when the p enalt y is built up on lo cal Radema c her complexities. There- fore, adaptive model selection is attainable for nested mo dels, whatever the distribution of the data. On the other hand, Theorem 2 gives a s imple example where no mo del se- lection pro cedure can satisfy an o racle inequalit y ( 10 ) with a leading co nstan t smaller than C 4 √ n/ (ln( n )). Lo oking ca refully at the sele ction problems considered in the proo f of Theor em 2 , it app ears that the main reaso n why they are pa r ticularly tough is that we are quite “lucky” with one of the mo dels: it ha s sim ultaneously a very small bias, a very small size and a large marg in parameter , while other mo dels with very similar app earance ar e muc h worse. When lo oking for a mor e general stro ng ma rgin adaptivity result, w e then must keep in mind that this is a ho peless ta sk in such situa tions. Let us finally men tion a rela ted result in a close but slightly different fra mew ork. In the classification fr a mew ork, under a glob al margin condition with ϕ ( x ) ∝ x 2 κ with κ ≥ 1, Theorem 3 in [ 18 ] shows tha t for a n y M n ≥ 2 , a family ( u m ) m ∈M n of M n classifiers exists for which, for a n y selection pro cedure b m , s ome distribution P exists such that E [ P ( f b m − f ⋆ )] ≥ inf m ∈M n { P ( f m − f ⋆ ) } + C  ln( M n ) n  κ/ (2 κ − 1) , where f m = γ ( u m ; · ) for some los s function γ . When b m is (p enalized) empirica l mini- mization, the remainder term is shown to be as la rge as C p ln( M n ) /n when the mar gin condition holds with κ > 1 . This result and Theo rem 2 focus on differ en t problems . In [ 1 8 ], the margin condition is only a ssumed to hold glob al ly , and the foc us is o n the dep endence of the remainder term on the ca r dinalit y M n of M n . Ther efore, the counterexample given in [ 18 ] implies nothing ab out loca l ma rgin conditions for ( f m ) m ∈M n . Note that using thes e a rgumen ts, we could probably genera lize Theorem 2 to a family of M n ≥ 2 functions a nd obtain a low er bo und dep ending on M n as in [ 18 ]. 5. General coll ections of mo dels As prov en in Section 4 , we cannot ho p e to obtain margin adaptivit y without any a s- sumption o n either P or the models. The purp ose o f this section is to explain what can still b e pr o ven in the g eneral case, a nd why this is weak er than our Theo rem 1 . 5.1. A general oracle inequalit y W e sta rt with a general r esult for pena lties satisfying the low er b ound ( 12 ). Mar gin-adaptive mo del sele ction 701 Theorem 3. L et ( F m ) m ∈M n b e any c ountable family of mo dels, and ( t m ) m ∈M n b e any se quenc e of p ositive numb ers. L et b m b e define d by ( 11 ) and assume that some c ∈ (0 , 1) exists such that ∀ m ∈ M n (1 − c ) p en( m ) ≥ ( P − P n )( b f m − f m ) + t m n ≥ 0 (22) on an event of pr ob ability at le ast 1 − η . Then, ther e exists an event of pr ob ability at le ast 1 − η − 2 P m ∈M n e − t m on which the fol lowing holds: for every ε ∈ (0 , 1) , P ( b f b m − f ⋆ ) ≤ 1 1 − ε inf m ∈M n  P ( f m − f ⋆ ) + p en( m ) + v ( m ) + t m 3 n  + V n , (23) wher e V n := 1 1 − ε sup m ∈M n { v ( m ) − εP ( f m − f ⋆ ) − c p en( m ) } and v ( m ) := r 2 t m n v a r P ( f m − f ⋆ ) . Theorem 3 is proved in Section 7.1 . Le t us ma k e a few co mmen ts. First, without V n , ( 2 3 ) is the kind of oracle inequa lit y we are lo oking for, since the leading co nstan t is close to 1 (provided ε is small enough). F or the sake of simplicity , assume that a margin condition ( 9 ) holds for every mo del m ∈ M n , with ϕ m ( x ) = h m x 2 . Then, v ( m ) ≤ s 2 t m P ( f m − f ⋆ ) h m n ≤ εP ( f m − f ⋆ ) + t m 2 εh m n for any ε ∈ (0 , 1 ). Hence, the fir s t term of the r igh t-hand side of ( 23 ) is smaller than 1 + ε 1 − ε inf m ∈M n  P ( f m − f ⋆ ) + p en( m ) + t m 2 εh m n + t m 3 n  , which is the righ t-hand side of a margin-adaptive oracle inequality like ( 6 ) (at lea st when the pena lty is itself of the right order). A similar r esult holds for a more general ϕ m ; see the pr o of o f Theorem 1 . Once we hav e a penalty satisfying ( 2 2 ) (for instance, a lo cal Rademacher p enalt y), the main difficulty for proving a strong margin adaptivity result then lies in V n . It a r ises from the difference betw een the ideal p enalty and the right-hand side of the low er b ound ( 22 ), that is ( P − P n )( f m ). This random quantit y is cen tered, and (up to a q uan tit y inde- pendent of m ) has deviations of order v ( m ), Bernstein’s inequality b eing unimprov able. Then, if v ( m ) ha ppens to b e m uch la rger than P ( f m − f ⋆ ) + pen( m ), m is s e le cted with a pos itive pr obabilit y , whatev er the v alue of P ( b f m − f ⋆ ). In that case, the e xpectation 702 S. Arlot and P. L. Bartlett of b f b m is worse than the orac le b y at leas t v ( m ) (for any of thes e “ bad” models). Hence, V n certainly is unav oidable in ( 23 ). As shown by Theorem 2 , V n can b e m uch larger than the exp ectation of a strong margin-a da ptiv e estimator. Nevertheless, V n is not alwa ys the main term on the right- hand side o f ( 23 ). Let us now des cribe a s e t of fa vorable situations in whic h it is p ossible to prove that V n is small enoug h: 1. Mo dels are nested, t m is non-decreas ing (with r espect to the inclusion or dering on F m ), a nd pen sa tis fie s the additio nal condition ( 13 ); see Section 3 . 2. Mo dels are nes ted, t m is no n- decreasing and v ( m ) is decr easing (or at least not increasing too m uch) when F m increases. Indeed, let us fix m, m ⋆ ∈ M n (think of m ⋆ as a minimizer of the infimum on the right-hand side o f ( 23 )). When mo dels are nes ted, either F m ⋆ ⊂ F m so that v ( m ) ≤ sup F m ⋆ ⊂F m ′ { v ( m ′ ) } , or F m ⊂ F m ⋆ so that ϕ m ⋆ ≤ ϕ m hence ϕ ⋆ m ≤ ϕ ⋆ m ⋆ . In the s econd case, v ( m ) − εP ( f m − f ⋆ ) ≤ ϕ ⋆ m r 2 t m ε 2 n ! ≤ ϕ ⋆ m ⋆ r 2 t m ε 2 n ! ≤ ϕ ⋆ m ⋆ r 2 t m ⋆ ε 2 n ! since t m ≤ t m ⋆ and ϕ ⋆ m ⋆ is non-decreas ing. As a c o nsequence, for a n y m ⋆ ∈ M n , V n ≤ 1 1 − ε max ( sup F m ⋆ ⊂F m ′ { v ( m ′ ) } ; ϕ ⋆ m ⋆ r 2 t m ⋆ ε 2 n !) , which is not to o larg e pro vided tha t v ( m ) never increases too muc h. Notice that we can unders tand assumption ( 13 ) as ensuring that the p enalt y comp ensates a po ssible increase of v ( m ) . 3. The oracle mo del pr ediction er ror do es not decr ease to zer o faster than n − 1 / 2 and t m ≤ t . Indeed, the stra ig h tforward upp er b ound v ( m ) ≤ p 2 t m /n shows that V n ≤ (1 − ε ) − 1 p 2 t/n . 4. The margin condition do es not dep end o n m and t m ≤ t . Indeed, when ϕ m ≡ ϕ (or inf m ϕ m ≥ ϕ ), we ha ve V n ≤ 1 1 − ε sup m ∈M n ( ϕ ⋆ m r 2 t m ε 2 n !) ≤ 1 1 − ε ϕ ⋆ r 2 t ε 2 n ! . 5. The p enalty sa tisfies c pe n( m ) ≥ v ( m ) for every m ∈ M n , which can be ens ured for instance by adding c − 1 v ( m ) (or a n estimate of it) to a p enalty satis fying ( 22 ). An example of this metho d is the one prop osed by Koltchinskii [ 16 ] (Section 5.2), and in that cas e ( 23 ) coincides with his Theorem 6. Poin ts 3 and 4 ab o ve show tha t the challenging situations are the ones wher e the mar g in condition indeed dep ends on the model, a nd fast r ates of estimation are a ttainable. W e prov e in Section 5.2 that suc h situations can occur , enligh tening ho w our Theorem 1 is an improvemen t on existing results and their stra igh tforward co nsequences. Mar gin-adaptive mo del sele ction 703 On the other hand, po in t 5 may seem contradictory with the negative results of Section 4 . The explanation is that using v ( m ) in the p enalt y means that b m is not only a function of the data, but also o f the unknown dis tr ibution P . Then it ca nno t be considered a daptiv e. A more sur prising consequence of this r emark com bined with Theorem 2 is that v ( m ) cannot b e estimated accurately enough unifor mly o ver the set of all distributio ns P . Consider the prop osal, in Section 5.1 o f [ 16 ], to a dd C s t m P n ( b f m ) n to the p enalty , which is sufficien t to give a result lik e ( 14 ). The po in t is that such a pena lt y is ge ne r ally muc h to o la rge (at leas t for small mo dels), which often results in an upper b ound of order n − 1 / 2 . In the examples w e hav e in mind (as well a s in the counterexamples of Section 4 ), the exce ss risk o f the o r acle is muc h smaller , t ypically of order n − β for so me β ∈ (1 / 2; 1 ] . 5.2. The lo cal margin conditions can b e significan tly tigh ter than the global one In this sec tio n, we show that ther e exist challenging situations in which the mar gin condition holds for functions ϕ m strongly dep ending o n m . Prop osition 1. L et κ ∈ (1 ; + ∞ ) and assu m e that X is infinite. L et γ b e the 0–1 loss and F 0 − 1 := { γ ( u ; · ) s.t. u : X 7→ { 0 , 1 } is me asur able } b e the asso ciate d loss function class. Then ther e ex ist a pr ob abili ty distribution P on X × { 0 , 1 } , a se quen c e ( f j ) j ∈ N of elements of F 0 − 1 and p ositive c onstants ( C i ) 5 ≤ i ≤ 7 (dep ending on κ only) s uch that: (i) ∀ k ∈ N , P ( f 2 k +1 − f ⋆ ) = P ( f 2 k − f ⋆ ) = b ( k ) and 2 − kκ − 2 ≤ b ( k ) ≤ 2 − kκ − 1 . (ii) The glob al mar gin c ondi tion ( 8 ) is satisfie d over F = F 0 − 1 with ϕ ( x ) = C 5 x 2 κ , and it is tight: ∀ k ∈ N , ϕ ( p v a r( f 2 k +1 − f ⋆ )) ≥ C 6 P ( f 2 k +1 − f ⋆ ) . (iii) A tighter lo c al mar gin c ondition ( 9 ) hold s ove r { f 2 k s.t. k ∈ N } : ∀ k ∈ N , P ( f 2 k − f ⋆ ) ≤ v ar P ( f 2 k − f ⋆ ) . (iv) F or every m ∈ N , define F m = { f m } and c onsider the mo del sele ction pr obl em among ( F m ) 0 ≤ m ≤ M n with M n ≥ 2 ln 2 ( n ) . Then, the right-hand side of a str ong mar gin-ada ptive or acl e ine quali ty of the fo rm ( 10 ) is at most pr op ortional to inf 0 ≤ 2 k ≤ M n  P ( f 2 k − f ⋆ ) + ln( n ) n  ≤ 2 ln( n ) n , wher e as the right-hand side of a glob al mar gin-ada ptive or acle ine quality is lar ger than C 7 n − κ/ (2 κ − 1) ≫ (ln( n )) /n . Prop osition 1 is prov ed in Section 7.3 . It g ives an example of a mo del selection pr oblem where str ong mar gin ada ptivity implies a faster r ate of c onver genc e than adaptivity to 704 S. Arlot and P. L. Bartlett the glob al mar gin c ondition . Note that the same arg umen t works with many other model selection pr oblems, such as selecting among ( { f 2 k +1 s.t. 0 ≤ k ≤ m } ) m ∈{ 1 ,..., (ln( n )) 2 } . 6. Discussion 6.1. Other p enalization pro cedures W e ha ve fo cused in Section 3.2 on p enalties defined in terms o f lo cal Rademacher com- plexities in order to prov e that stro ng marg in adaptivity is attainable for s ome data- driv en pena lties. An interesting que s tion is whether such a r esult can b e ex tended to penalties that ca n be computed faster. F or ins tance, it is natura l to think of estimating p en id ( m ) itself b y resampling, in- stead of the lo cal complexit y δ n ( F m ; t ). Such penalties, with several kinds of resampling schemes, hav e b een propo sed in [ 2 ] and [ 3 ] and called “resa mpling p enalties” (RP), gen- eralizing the b o o tstrap p enalty suggested by Efron [ 15 ]. Resampling penalties can b e computed faster than lo cal Ra demac her co mplexities, b ecause they are not defined as fixed p oin ts of the r esampling estimate o f a function. In particular , the V -fold p enalties defined in [ 2 ] hav e the sa me computational co st as V -fold cross- v alidation. In addition, RP ar e e a sy to calibrate, since they dep end on a single tuning parameter – the m ultiplicative factor in fro n t of it – which can, for instance, b e estimated from the data b y us ing the “slo pe heuristics ” (se e [ 4 ]). O n the contrary , lo cal Rademacher complexities dep end on tw o more constants, whose theoretical v alues are certainly to o large for practica l a pplication. Extending Cor ollary 1 to RP would req uir e to pr o ve that RP s atisfy b oth assumptions ( 12 ) and ( 13 ). O n the one hand, ( 12 ) means essen tially that the penalty is lar ger than the exp ectation of the idea l p enalty with larg e probability . Hence, o ne can conjecture that ( 12 ) holds for RP ; a partial pr o of of ( 12 ) for RP in our general setting can be found in Chapter 7 of [ 1 ], to g ether with an agenda for a complete pro of, whic h seems to be a difficult theoretical problem. O n the other hand, ( 13 ) s e ems less lik ely to hold for RP , and we may hav e to mo dify RP so that ( 13 ) can b e s a tisfied in gene r al. Proving such r esults would b e quite int eresting , since it would provide a stro ng margin- adaptive penalizatio n pro cedure with a reas o nably small computational cost. 6.2. Should we mak e collections of mo dels nested? A natura l question co ming from o ur results is whether one should make any co lle ction of mo dels nested b efore perfor ming mo del selection in order to improve p erformance. Let us consider the counterexample of Theorem 2 and lo ok at what would ha ppen if we make the mo dels nested. Assume that P = P 1 is the distribution defined in the pro of of Theor em 2 . On the one hand, c omparing { f 0 } and { f 0 , f 1 } , the mo del selection pro blem would be easy b ecause the mar gin parameter h m is the sa me in b oth mo dels, ma k ing the remainder term of order Mar gin-adaptive mo del sele ction 705 n − 1 / 2 (the r emainder term ( nh m ) − 1 can b e repla ced by n − 1 / 2 when h m ≤ n − 1 / 2 bec ause of the upp er b ound v ar P ( f m − f ⋆ ) ≤ 1 / 4 ). And margin ada ptivit y is no t challenging when the mar gin co ndition is merely not sa tisfied. On the other hand, when P = P 1 , co mpa ring { f 1 } and { f 0 , f 1 } is more challenging b ecause f 1 is really b etter than f 0 . Here, contrary to the non-nes ted ca se, the large increas e of the term v ar P ( f m − f ⋆ ) induces a similar increase in the L 2 ( P 1 ) diameter of the clas s . Hence, lo cal Rademacher complexities can detect it, as shown b y The o rem 1 . T o conclude, improving significant ly the pre diction p erformance o f the final estimator by making the mo dels nested requir e s s ome prior knowledge, suc h as a natural o r dering betw een the (no n- nested) mo dels. Otherwise, Theor em 2 s hows that cho osing how to make the mo dels nested, either from data or randomly , is not successful with proba bilit y at least C 3 > 0 , whatever the sample size. 7. Pro ofs 7.1. Oracle inequalities W e give the pro ofs in a logica l order , that is, fir st Theorem 3 , then Theore m 1 (which is a cor ollary of it), and finally Co rollary 1 . Pro of of Theorem 3 . First, by definition of b m , for every m ∈ M n we hav e P n ( b f b m ) + p en( b m ) ≤ P n ( b f m ) + p en( m ) , which can b e r ewritten as P ( b f b m − f ⋆ ) + ( P n − P )( b f b m − f b m ) + ( P n − P )( f b m − f ⋆ ) + p en( b m ) ≤ P ( f m − f ⋆ ) + P n ( b f m − f m ) + ( P n − P )( f m − f ⋆ ) + p en( m ) ≤ P ( f m − f ⋆ ) + ( P n − P )( f m − f ⋆ ) + p en( m ) . In the even t that ( 22 ) holds, we then hav e P ( b f b m − f ⋆ ) + ( P n − P )( f b m − f ⋆ ) + c p en( b m ) + t b m n (24) ≤ inf m ∈M n { P ( f m − f ⋆ ) + ( P n − P )( f m − f ⋆ ) + p en( m ) } . By Bernstein’s inequality (see, e.g., Pr opositio n 2.9 in [ 22 ]), for every m ∈ M n , there is an even t of probability 1 − 2e − t m on which | ( P n − P )( f m − f ⋆ ) | ≤ v ( m ) + t m 3 n . 706 S. Arlot and P. L. Bartlett On the intersection o f these ev ents with the one in whic h ( 22 ) ho lds, we der iv e from ( 24 ) that P ( b f b m − f ⋆ ) − v ( b m ) + c p en ( b m ) ≤ inf m ∈M n  P ( f m − f ⋆ ) + p en( m ) + v ( m ) + t m 3 n  . F or any ε > 0 , the left-ha nd side is larger than (1 − ε ) P ( b f b m − f ⋆ ) + εP ( f b m − f ⋆ ) + c p en( b m ) − v ( b m ) ≥ (1 − ε ) P ( b f b m − f ⋆ ) − sup m ∈M n { v ( m ) − εP ( f m − f ⋆ ) − c p en( m ) } . The result follows.  Pro of o f Theorem 1 . W e consider the even t in which ( 2 3 ) holds. By Theorem 3 , we know that it has pro babilit y at lea st 1 − η − 2 P m ∈M n e − t m . W e fir st b ound the first term on the right-hand side of ( 23 ). F ro m ( 9 ), we hav e ∀ m ∈ M n v ( m ) ≤ r 2 t m n ϕ − 1 m ( P ( f m − f ⋆ )) . Then, using that xy ≤ ϕ m ( x ) + ϕ ⋆ m ( y ) fo r every x, y ≥ 0, ∀ m ∈ M n v ( m ) ≤ ϕ ⋆ m r 2 t m ε 2 n ! + ϕ m ( εϕ − 1 m ( P ( f m − f ⋆ ))) . Since ϕ m is co n vex with ϕ m (0) = 0 , we have ϕ m ( λx ) ≤ λϕ m ( x ) for every λ ∈ (0 , 1) and x ≥ 0. Then, using also that v ar P ( f m − f ⋆ ) ≤ 1, ∀ m ∈ M n v ( m ) ≤ min ( r 2 t m n , ϕ ⋆ m r 2 t m ε 2 n ! + εP ( f m − f ⋆ ) ) , (25) and the right-hand side o f ( 23 ) is smaller than 1 1 − ε inf m ∈M n ( (1 + ε ) P ( f m − f ⋆ ) + p en( m ) + min ( ϕ ⋆ m r 2 t m ε 2 n ! , r 2 t m n ) + t m 3 n ) + V n . (26) It now remains to upp erb ound V n . Let m, m ′ ∈ M n . Since mo dels F m are nested, tw o cases can o ccur: 1. F m ⊂ F m ′ , whic h implies t m ≤ t m ′ and ϕ m ≥ ϕ m ′ , hence ϕ ⋆ m ≤ ϕ ⋆ m ′ . Using, in ad- dition, ( 25 ) and that ϕ ⋆ m ′ is non-decreas ing , we hav e v ( m ) ≤ min ( r 2 t m ′ n , ϕ ⋆ m ′ r 2 t m ′ ε 2 n !) + εP ( f m − f ⋆ ) . Mar gin-adaptive mo del sele ction 707 2. F m ′ ⊂ F m . Using ( 13 ) and ( 25 ), v ( m ) ≤ C 1 v ( m ′ ) + C 2 P ( f m ′ − f ⋆ ) + c p en( m ) ≤ C 1 min ( r 2 t m ′ n , ϕ ⋆ m ′ r 2 t m ′ ε 2 n !) + ( C 2 + C 1 ε ) P ( f m ′ − f ⋆ ) + c p en( m ) . Therefore, V n ≤ 1 1 − ε inf m ′ ∈M n ( max { 1 , C 1 } min ( r 2 t m ′ n , ϕ ⋆ m ′ r 2 t m ′ ε 2 n !) + ( C 2 + C 1 ε ) P ( f m ′ − f ⋆ ) ) and the result fo llo ws.  Pro of of Corollary 1 . F rom [ 16 ] (Theo rem 1 and (9.2) in the pro of of its Lemma 2), we know that there exist numerical c onstan ts K > 0 and q > 1 s uch that ( 12 ) holds with t m = t , c = 5 / 7 and η = ( L + 1) ln q ( n t ) Card( M n )e − t . In addition, Lemma 3 be low shows that ( 1 3 ) holds with C 1 = √ 2 and C 2 = 2 / ( K q ). The result follows from Theorem 1 with t m = t .  Lemma 3. L et F m ′ ⊂ F m and δ n b e define d by ( 15 ). Then, v ( m ) ≤ 2 δ n ( F m ; t ) + √ 2 v ( m ′ ) + 2 P ( f m ′ − f ⋆ ) q K . (27) Pro of. Since F m ′ ⊂ F m , f m ′ ∈ F m (as well as f m ), so that D P ( F m ; P ( f m ′ − f m )) ≥ p P ( f m − f m ′ ) 2 ≥ p v a r P ( f m − f m ′ ) (28) ≥ r v a r P ( f m − f ⋆ ) 2 − p v a r P ( f m ′ − f ⋆ ) . F or the last inequality , w e used that v ar( X ) ≤ 2 v ar( X + Y ) + 2 v ar( Y ) for any random v a riables X , Y , and the inequality √ x + y ≤ √ x + √ y for every x, y ≥ 0 . First, ass ume that the low er b ound in ( 28 ) is non- p ositive. This implies v ( m ) = r 2 t n v a r P ( f m − f ⋆ ) ≤ √ 2 v ( m ′ ) , so that ( 27 ) holds . Otherwise, the assumptions of Lemma 4 b elow hold with D 0 = r v a r P ( f m − f ⋆ ) 2 − p v a r P ( f m ′ − f ⋆ ) > 0 708 S. Arlot and P. L. Bartlett and σ 0 = P ( f m ′ − f m ) . W e deduce from ( 29 ) that v ( m ) 2 − v ( m ′ ) √ 2 ≤ δ n ( F m ; t ) + P ( f m ′ − f m ) q K ≤ δ n ( F m ; t ) + P ( f m ′ − f ⋆ ) q K , and ( 27 ) als o holds.  Lemma 4. L et δ n ( F m ; t ) b e define d by ( 15 ). Assum e that t her e is some D 0 , σ 0 > 0 such that D P ( F m ; σ 0 ) ≥ D 0 . Then, we have t he fol lowi ng lower b ound: max  δ n ( F m ; t ); σ 0 q K  ≥ D 0 r t n . (29) Pro of. First, ( 29 ) clearly holds when σ 0 qK ≥ D 0 p t/n . Otherwise, let σ 1 = max { q K , 1 } D 0 p t/n > σ 0 . F rom the definition of U n , we hav e U n ( F m ; σ 1 ; t ) σ 1 ≥ K D P ( F m ; σ 1 ) σ 1 r t n ≥ K D 0 q K D 0 p t/n r t n = 1 q > 1 2 q . Then, acco rding to the definition ( 15 ) o f δ n ( F m ; t ), δ n ( F m ; t ) ≥ σ 1 ≥ D 0 p t/n and the result follows.  7.2. Low er b ounds (pro of of Theorem 2 ) F or every m ∈ { 0 , 1 } , let f m : ( x, y ) 7→ 1 y 6 = m ; f m ∈ F 0 − 1 , since f m = γ ( u m ; · ), where for every x ∈ X , u m ( x ) = m . Let α = (2 n ) − 1 and h = (2 n ) − 1 / 2 . Let a 6 = b be any tw o ele men ts of X . W e define a probability distribution P 1 on X × { 0 , 1 } as follows: if ( X , Y ) ∼ P 1 , then P ( X = a ) = α , P ( X = b ) = 1 − α , P ( Y = 1 | X = a ) = 0 and P ( Y = 1 | X = b ) = 1 2 + h . W e als o define P 0 as the distribution of ( X , 1 − Y ) , w he r e ( X , Y ) ∼ P 1 . In the follow- ing, for any distribution Q on X × { 0 , 1 } , we use the nota tio n P Q as a shor tcut for P ( X i ,Y i ) 1 ≤ i ≤ n ∼ Q ⊗ n . First, under distribution P 1 , the Bay es predictor is s = 1 b , P 1 ( f 0 − f ⋆ ) = 2(1 − α ) h, P 1 ( f 1 − f ⋆ ) = α and v a r P 1 ( f 1 − f ⋆ ) = α − α 2 . Hence, min m ∈{ 0 , 1 }  P 1 ( f m − f ⋆ ) + v ( m ) + ln( n ) nh m  ≤ P 1 ( f 1 − f ⋆ ) + v (1) + ln( n ) nh 1 ≤ α + r 2 α ln( n ) n + ln( n ) n ≤ 2 + 3 ln( n ) 2 n . Mar gin-adaptive mo del sele ction 709 Therefore, if P P 1 ( b m = 0) ≥ C 3 , then ( 20 ) holds when P = P 1 , with C 4 = 1 / 3. Similarly , P P 0 ( b m = 1) ≥ C 3 implies ( 20 ) with P = P 0 and C 4 = 1 / 3. So , in or der to prov e ( 20 ), we only need to prov e that max j ∈{ 0 , 1 } { P P j ( b m = 1 − j ) } ≥ C 3 > 0 . (30) The pro of of ( 30 ) r e lies on three main fa cts. First, ∀ j ∈ { 0 , 1 } P P j ( ∀ i, X i = b ) = (1 − α ) n =  1 − 1 2 n  n ≥ 1 2 . (31) Second, for every j ∈ { 0 , 1 } , under P j , conditionally to {∀ i, X i = b } , Car d { i s.t. Y i = 1 } is a binomial random v ariable with par ameters ( n, p j ), where p j = P ( X,Y ) ∼ P j ( Y = 1) = 1 2 + ( − 1) j +1 h. So, Lemma 5 shows that for every j ∈ { 0 , 1 } and every k ∈ N ∩ [ n 2 − √ n, n 2 + √ n ], P P j (Card { i s.t. Y i = 1 } = k |∀ i, X i = b ) ≥ C √ n > 0 , (32) where C is a n absolute cons ta n t. Third, let us define, for every k ∈ { 0 , . . . , n } , π k := P P U ( b m (( X i , Y i ) 1 ≤ i ≤ n ) = 1 | Ca r d { i s.t. Y i = 1 } = k and ∀ i , X i = b ) , where P U is the uniform dis tr ibution on { a, b } × { 0 , 1 } . A cr ucial r e mark is that P U can be replaced by either P 0 or P 1 in the definition of π k , since the conditioning event determines ( X i , Y i ) 1 ≤ i ≤ n up to the ordering o f the o bs erv ations ; in the definition of π k , the pro babilit y o nly refers to the ordering o f the ( X i , Y i ), and a ny pro duct measure on X × { 0 , 1 } assigns e q ual probabilities to the n ! p erm utations of the n obser v ations. Note also that the definition of π k stays v alid whe n b m is a r andomized selection rule, which prov es the generalizatio n of Theorem 2 p ointed out in Remar k 3 . F o r any given selection rule b m , Card  k ∈ N ∩  n 2 − √ n, n 2 + √ n  s.t. π k > 1 2  is either larger or s ma ller than √ n . If it is la rger, ( 31 ), ( 32 ) and the definition of the π k (with P 0 instead of P U ) show that P P 0 ( b m (( X i , Y i ) 1 ≤ i ≤ n ) = 1) ≥ √ n × C √ n × 1 2 = C 2 = C 3 > 0 , so that ( 30 ) is satisfied. Otherwise, choosing P 1 instead o f P 0 shows that ( 30 ) holds true. This proves ( 20 ), which clea rly implies ( 21 ), since P ( f b m − f ⋆ ) ≥ 0 a.s. 710 S. Arlot and P. L. Bartlett A key to ol in the pro of of Theorem 2 is the following uniform lo wer b ound on the density of the binomial dis tribution w.r.t. the counting measure on N . Lemma 5. F or every n ∈ N and p ∈ [0 , 1] , let B ( n, p ) denote the binomial distribut io n with p ar ameters ( n, p ) . F or every a, b > 0 and c ∈ (0 , 1 / 2) , a p ositive c onstant C ( a, b, c ) exists such that for any p ositive inte ger n , inf k ∈ N , | k − n/ 2 |≤ min { an 1 / 2 ,n/ 2 } | p − 1 / 2 |≤ min { bn − 1 / 2 ,c } { √ n P Z ∼B ( n,p ) ( Z = k ) } ≥ C ( a, b, c ) > 0 . (33) Pro of. Let n, k , p sa tisfy the ab ov e conditions, Z ∼ B ( n, p ), and define η := 2 k n − 1 , δ := p − 1 2 . The as s umption o n k and p b ecomes | η | ≤ min { an − 1 / 2 , 1 / 2 } and | δ | ≤ min { bn − 1 / 2 , c } . In addition, P ( Z = k ) = p k (1 − p ) n − k  n k  =  1 2 + δ  k  1 2 + δ  n − k n ! k !( n − k )! . W e now use Stirling’s for m ula: ln( n !) = n ln( n ) − n + 1 2 ln(2 π n ) + ε n for so me sequence ε n → 0 when n → + ∞ (one has (12 n + 1) − 1 ≤ ε n ≤ (12 n ) − 1 ). Then, ln P ( Z = k ) = k ln  1 2 + δ  + ( n − k ) ln  1 2 − δ  + ln n ! k !( n − k )! = n 2  (1 − η ) ln  1 − 2 δ 1 − η  + (1 + η ) ln  1 + 2 δ 1 + η  − 1 2 ln( n ) + 1 2 ln  2 π  − 1 2 ln(1 − η 2 ) + ε n − ε k − ε n − k . Define h : ( − 1 , + ∞ ) 7→ R b y h ( x ) := x − 1 ln(1 + x ) − 1 , so that ∀ x > − 1 ln(1 + x ) = x (1 + h ( x )) . Recall that | h ( x ) | ≤ 2 | x | as so on as x ≥ − 1 / 2 , by the T aylor–Lag range form ula. In par- ticular, lim x → 0 h ( x ) = 0 . W e then hav e ln P ( Z = k ) = n 2 [4 δ η − 2 η 2 − 2 δ (1 − η ) h ( − 2 δ ) + η (1 − η ) h ( − η ) + 2 δ (1 + η ) h (2 δ ) − η (1 + η ) h ( η )] Mar gin-adaptive mo del sele ction 711 − 1 2 ln( n ) + 1 2 ln  2 π  + η 2 2 h ( − η 2 ) + ε n − ε k − ε n − k . Assuming that n ≥ n 0 such that ma x { a, b } n − 1 / 2 ≤ 1 / 2, it follows that ln P ( Z = k ) = − 1 2 ln( n ) + R ( k, n, p ) with R ( k , n, p ) ≥ L (1 + a 2 + ab + b 2 ) for some numerical constant L > 0 , and this low er b ound is unifor m over n ≥ n 0 and k , p such that the conditions of the infimu m in ( 33 ) ar e satisfied. On the other hand, inf n ≤ n 0 , 1 ≤ k ≤ n { P Z ∼B ( n,p ) ( Z = k ) } ≥ K ( p ) > 0 as so on as p ∈ (0 , 1) . Since P Z ∼B ( n,p ) ( Z = k ), seen as a function of p , is increa sing on (0 , k /n ) and decrea sing on ( k /n, 1) , K ( p ) is unif ormly larger than min { K (1 / 2 − c ) , K (1 / 2 + c ) } . The result follows.  7.3. Pro of of Prop osition 1 Let ( x j ) j ∈ N be any infinite sequence o f distinct e le ments of X and λ > 0 to be chosen later. W e define P as fo llows, by denoting ( X , Y ) a pair of r a ndom v ariables with joint distribution P . F or every k ∈ N , P ( X = x 2 k ) = p k q k and P ( X = x 2 k +1 ) = p k (1 − q k ), where p k = 2 − k − 1 and q k ∈ [0 , 1] is to b e chosen la ter; note that P k ∈ N p k = 1 . F or every k ∈ N , P ( Y = 1 | X = x 2 k ) = 0 and P ( Y = 1 | X = x 2 k +1 ) = (1 + δ k ) / 2 where δ k = 2 − kλ . As a consequence , the Bay es pr e dictor is s := 1 { x 2 k +1 s.t. k ∈ N } . Let us define for every j ∈ N , u j ( x ) :=  s ( x ) , if x 6 = x j , 1 − s ( x ) , if x = x j and f j = γ ( u j ; · ) , where γ is the 0 –1 loss. Then, for any k ∈ N , P ( f 2 k +1 − f ⋆ ) = δ k p k (1 − q k ) , P ( f 2 k − f ⋆ ) = p k q k , (34) v a r P ( f 2 k +1 − f ⋆ ) = p k (1 − q k ) − ( δ k p k (1 − q k )) 2 , (35) v a r P ( f 2 k − f ⋆ ) = p k q k − ( p k q k ) 2 . (36) W e ca n no w pr ove the four s ta temen ts of Prop osition 1 . (i) B y ( 34 ), choo sing q k = δ k / (1 + δ k ) and λ = κ − 1 > 0 implies (i) with b ( k ) = p k q k . (ii) F or every t ∈ (0 , 1 ), P ( | 2 η ( X ) − 1 | ≤ t ) = X k ∈ N P ( X = x 2 k +1 ) 1 δ k ≤ t ≤ X k s.t. 2 − kλ ≤ t 2 − k − 1 ≤ t 1 /λ . (37) 712 S. Arlot and P. L. Bartlett By Le mma 9 o f [ 9 ], ( 3 7 ) implies the g lobal margin condition ov er F 0 − 1 with function ϕ ( x ) = C 5 x 2( λ +1) , where C 5 only depe nds on λ . This implies the first part of (ii) since λ = κ − 1 > 0. F o r the second part, ( 35 ) implies that v a r P ( f 2 k +1 − f ⋆ ) ≥ p k (1 − q k )(1 − p k ) ≥ p k (1 − q k ) 2 ≥ p k 4 = 2 − k − 3 , hence the seco nd part of (ii) holds with C 6 = C 5 2 2 − 3 κ . (iii) B y ( 36 ), v ar P ( f 2 k − f ⋆ ) = p k q k (1 − p k q k ) ≤ p k q k = P ( f 2 k − f ⋆ ). (iv) By (iii), for every k ∈ N , a lo cal marg in condition ho lds on F 2 k with function ϕ 2 k : x 7→ x 2 . So, the right-hand side o f a strong marg in- adaptiv e oracle inequality is at most (keeping only even v alues of m ) pro portiona l to inf 0 ≤ k ≤ M n / 2  P ( f 2 k − f ⋆ ) + ln( n ) n  ≤ 2 − ln 2 ( n ) − 1 + ln( n ) n ≤ 2 ln( n ) n . Note that the ln( n ) factor may b e replaced b y a s maller quantit y depending on the framework. The last statement on global margin adaptivity holds acco r ding to (ii), since ϕ ⋆ ( x ) = L ( κ ) x 2 κ/ (2 κ − 1) , wher e L ( κ ) > 0 only dep ends on κ . Ac kno wledgeme n ts The authors g r atefully ackno wledge the supp ort of the NSF under aw ards DMS-04 34383 and DMS-0707060 . The first author’s research was mostly ca rried out at Univ Paris-Sud (Lab oratoire de Mathematiques, CNRS – UMR 8628), with the additional supp ort of Inria Saclay (Selec t P ro ject). The a uthors would also like to thank an anonymous re feree for n umerous comments that improved the pres e n tation and s o me of the res ults o f the pap er. References [1] Arlot, S. (2007). R esampling and Mo del Sele ction . PhD thesis, Universit y P aris-Sud 11, December 2007. Av ailable at http://tel .archives- ouve rtes.fr/tel- 00198803/en/ . [2] Arlot, S. (2008). V -fold cros s-v alidation improv ed: V -fold p enalization. Ava ilable at arXiv:0802.05 66v2. [3] Arlot, S. (2009). Mo d el selection by resampling p enalizati on. Ele ctr on. J. Stat. 3 557–624 (electronic). MR2519533 [4] Arlot, S. and Massart, P . (2009). Data-driven calibration of pen alti es for least-squares regression. J. Mach. L e arn. Re s. 10 245–2 79 (electronic). [5] Audib ert, J.-Y. (2004). Classification under p olynomial en tropy and margin assump- tions and rand omize d estimators. Lab oratoire d e Probabilites et Modeles Aleatoires. Preprint. [6] Audib ert, J.-Y. and Tsybak ov, A.B. (2007). F ast learning rates for plug-in classifiers. Ann. Statist. 35 608–6 33. MR2336861 Mar gin-adaptive mo del sele ction 713 [7] Barron, A., Birg´ e, L. and Massart, P . (1999). Risk b ounds for model selection via p enal- ization. Pr ob ab. The ory R elate d Fi elds 113 301–413. MR1679028 [8] Bartlett, P .L., Bousquet, O. and Mendelson, S. (2005). Local rademacher complexities. Ann . Statist. 33 1497– 1537. MR2166554 [9] Bartlett, P .L., Jordan, M.I. and McAuliffe, J.D. (2006). Conv exity , classificatio n, and risk b ounds. J. Amer . Statist. Asso c. 101 138–15 6. MR2268032 [10] Bartlett, P .L., Mendelson, S. and Philips, P . (2004). Lo cal complexities for emp irical risk minimization. In L e arning The ory . L e ctur e Notes in Comput. Sc i. 3120 2 70–284. Berlin: Springer. MR2177915 [11] Birg´ e, L. and Massart, P . (1998). Minim um contrast estimators on sieves: Exp onen tial b ounds and rates of conv ergence. Bernou l li 4 329–375. MR1653272 [12] Blanchard, G., Lugosi, G. and V a yatis , N. (2004). O n th e rate of conv ergence of regularized b oosting classifiers. J. Mach. L e arn. R es. 4 861–894 . MR2076000 [13] Blanchard, G. and Massart, P . (2006). Discussion: “Local R ademac her complexities and oracle inequalities in risk minimization” [ Ann. Statist. 34 (2006) 2593–265 6] by V. Koltc hinskii. Ann. Statist. 34 2664–26 71. MR2329460 [14] D evro ye, L. and Lugosi, G. (1995). Low er b ounds in pattern recognition and learning. Pattern Re c o gni ti on 28 1011–1018. [15] Efron, B. (1983). Estimating the error rate of a pred ictio n rule: Improv emen t on cross- v alidation. J. Amer. Statist. Asso c. 78 316–3 31. MR0711106 [16] K oltc hinskii, V. (2006). Lo cal R ademac her complexities and oracle inequalities in risk min- imization. A nn. Statist. 34 2593–26 56. MR2329442 [17] Lecu´ e, G. (2007). S im ultaneous adaptation to t h e margin and to complexity in classification. Ann . Statist. 35 1698– 1721. MR2351102 [18] Lecu´ e, G. (2007). Sub optimalit y of p enalized empirical risk minimization in classification. In COL T 2007 L e ctur e Notes in Art ificial Intel li genc e 4539 . Berlin: Sp ringer. MR2397584 [19] Lu gos i, G. (2002). Pattern classification and learning theory . In Principles of Nonp ar amet- ric L e arning (Udi ne, 2001). CI SM Courses and L e ctur es 434 1–56. Vienna: Springer. MR1987656 [20] Lu gos i, G. and W egka mp, M. (2004). Complexity regularization via localized random p enal- ties. Ann . Statist. 32 1679– 1697. MR2089138 [21] Mammen, E. and Tsybak o v, A.B. (1999). Smo oth discrimination analysis. Ann. Stat ist. 27 1808–18 29. MR1765618 [22] Massart, P . (2003). Conc entr ation ine qualities and mo del Sele ction. L e ctur e Notes i n Math- ematics 1896 . Berlin: Springer. MR2319879 [23] Massart, P . and N´ ed ´ elec, ´ E. (2006). Risk b ounds for statistical lea rning. A nn. Statist. 34 2326–23 66. MR2291502 [24] Tsyb ak o v, A.B. (2004). Optimal aggregation of classifiers in statistical learning. Ann. Statist. 32 135–1 66. MR2051002 [25] Tsyb ak o v, A.B. and v an de Geer, S.A. (2005). Square root p enalt y: Adaptation to the margin in classification and in ed ge estimation. A nn. Statist. 33 1203–1224. MR2195633 [26] V apnik, V.N. (1998). Statistic al L e arning The ory . New Y ork: Wiley . MR164125 0 [27] V apnik, V.N. and Cervonenkis, A.J. (1971). The uniform con vergence of frequ encies of the app earance of even ts to their p robabili ties. (Russian. English summary) T e or. V er ojatnost. i Primenen. 16 264–279 . MR0288823

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment