Data-driven calibration of penalties for least-squares regression

Journal of Mac hine Learning Researc h 0 (0000) 0 Submitted 3/08; Revised 9/08; Published 0/00 Data-driv en Calibration of Pe nalties for Least-Squares Regression Sylv ain Arlot syl v ain.arlot@ma th.u-psud.fr P ascal Massart p ascal.massar t@ma th. u-psud.fr Univ Paris-Sud, UMR 8628 L ab or atoi r e de Mathematiques Orsay, F-91405 ; CNRS, Orsay, F-914 05 ; INRIA Saclay, Pr ojet S ele ct Editor: John Laﬀerty Abstract Penalization pro cedur es often suﬀer fro m their dep endence on multiplying factors, whose optimal v alues are either unknown or hard to estimate from da ta . W e prop ose a completely data-driven calibration a lgorithm for these par ameters in the least-squa r es regress ion frame- work, w itho ut ass uming a particular shap e for the pena lt y . Our a lgorithm r e lie s o n the concept of minimal p enalty , recently in tro duced by Birg´ e and Ma s sart (2007) in the c on- text of penalized least squar e s for Gaus sian homoscedastic regression. On the positive side, the minimal p enalty ca n b e ev aluated fr o m the data themselves, leading to a data- driven estimatio n of an optimal p e na lt y which can b e used in practice; on the negativ e side, their appro ach heavily relies on the homosceda s tic Gaussia n nature of their sto chastic framework. The purp ose of this pap er is tw ofold: stating a more general heuristics for designing a da ta -driven penalty (the slop e heuristics ) a nd proving that it works for p enalized le a st- squares regres sion with a ra ndom design, even for hetero s cedastic non-Gaussian data. F or techn ical reasons , some exact mathema tical r esults will be pr ov ed only for regresso gram bin-width selection. This is a t lea st a ﬁrst step to wards further results, sinc e the approach and the method that we use a re indeed general. Keyw ords: Data-driven Calibr ation, Non-par a metric Regr ession, Mo del Selection b y Penalization, H eterosc edastic Data, Regressog ram 1. In tro duction In the last decades, mo del selectio n has receiv ed muc h in terest, commonly through p e- nalization. In short, p enalizatio n c ho oses the mo del minimizing the su m of the empir- ical risk (ho w well the algorithm ﬁts the data) and of some m easur e of complexity of the mo del (called p enalt y); see FPE (Ak aike, 1970), AIC (Ak aik e, 1973), Mallo ws’ C p or C L (Mallo ws , 1973). Man y other p enalization pro cedures ha v e b een pr op osed sin ce, among whic h Rademacher complexities (Koltc hinskii , 2001; Bartlett et al., 2002), lo cal Rademac her complexities (Bartlett et al., 2005; Koltc h inskii, 2006), b o otstrap p enalties (Efron, 1983), resampling and V -fold p enalties (Arlot, 2008b,c). Mo del select ion can target t w o diﬀeren t g oals. On the one hand, a p ro cedure is eﬃcient (or asymptotically optimal) wh en its quad r atic risk is asymptotically equiv alen t to the risk c  0000 Sylv ain Ar lot and P ascal M ass ar t. Arlot and Massar t of the orac le. On the other hand, a pro cedu re is c onsistent when it c ho oses the smallest true model asymptotically with probab ility one. This pap er deals with eﬃcient p ro cedures, without assuming the existence of a true mo del. A huge amount of literature exists ab out eﬃciency . First Mallo ws’ C p , Ak aik e’s FPE and AIC are asymptotically optimal, as pr o v ed by Shibata (1981 ) for Gaussian errors, by Li (1987) under s u itable moment assump tions on th e errors, and b y Poly ak and Tsyb ak o v (1990) un der sh arp er moment conditions, in the F ourier case. Non-asymptotic oracle in- equalities (with some leading constan t C > 1) ha v e b een obtained by Barron et al. (1999 ) and by Birg´ e and Massart (2001) in the Gaussian case, and by Baraud (2000, 2002) un der some momen t assu mptions on the err ors. In the Gaussian case, non-asymp totic oracle in - equalities with leading constan t C n tending to 1 wh en n tend s to inﬁn it y hav e been obtained b y Birg ´ e an d Massart (2007). Ho w ev er, from the practical p oin t of view, b oth AIC and Mallo ws’ C p still present serious drawbac ks. On the one hand, AIC relies on a strong asymptotic assumption, so that for small sample sizes, the optimal m ultiplying factor can b e quite diﬀerent from one. Therefore, corrected v ersions of AIC h a v e b een prop osed (Sugiur a, 1978; Hurvic h and Tsai, 1989). On the other han d , the optimal calibration of Mallo ws’ C p requires the kno wledge of the noise lev el σ 2 , assumed to b e constan t. When real data are in v olv ed, σ 2 has to be estimated s eparately and ind ep endently from any mo d el, which is a diﬃcult task. Moreo v er, the b est estimator of σ 2 (sa y , with r esp ect to the quadratic error) quite u nlik ely leads to the most eﬃcien t mo d el selectio n pro cedure. Contrary to Mallo ws’ C p , the data-dep end en t calibration rule deﬁn ed in this article is not a “plug-in” metho d; it f o cuses directly on eﬃciency , w hic h can improv e signiﬁcan tly the p erformance of the mo del s election p r o cedure. Existing p enalization p ro cedures presen t similar or str on ger dr a wbac ks than AIC and Mallo ws ’ C p , often b ecause of a gap b et w een th eory and p r actice. F or instance, oracle in- equalities h av e only b een prov ed for (global) Rademac her p enalties m ultiplied b y a f actor t w o (Koltc hin skii, 2001), while they are us ed without this fact or (Lozano, 20 00). As pr o v ed b y Arlot (2007, Ch apter 9), this factor is n ecessary in general. Therefore, the optimal cali- bration of these pen alties is real ly an issue. Th e calibration problem is ev en harder for lo cal Rademac her complexities: theoretical results h old only with large calibration constan ts, particularly the multiplying factor, and no optimal v alues are kn o wn. One of the p urp oses of this pap er is to addr ess th e issu e of optimizing the m ultiplying factor for general-shap e p enalties. F ew automatic calibration algorithms are a v ailable. T he m ost p opular ones are certainly cross-v alidation metho ds (Allen, 1974; Stone, 1974), in particular V -fold cross-v alidati on (Geisser, 1975), b ecause these are general-purp ose metho ds, relying on a wid ely v alid heur is- tics. Ho w ev er, their computational cost can b e high. F or instance, V -fold cross-v alidati on requires the en tire mo d el selection pro cedure to b e p erform ed V times for eac h candidate v alue of the constant to b e calibrated. F or p enalties prop ortional to the dim en sion of the mo dels, such as Mallo ws’ C p , alternativ e calibration pro cedur es h a v e b een prop osed by George and F oster (2000 ) and by Shen and Y e (2002). A completely diﬀerent approac h has b een prop osed by Birg ´ e and Massart (2007) for calibrating d imensionalit y-based p enalties. Since this artic le extends their approac h to a m uc h wider range of applications, let u s b rieﬂy recall their main results. In Gaussian 2 Da t a-driven Calibra tion of P enal ties homoscedastic regression with a ﬁxed design, assu me that eac h mo del is a ﬁn ite-dimensional v ector space. Co nsider the p enalty p en( m ) = K D m , w here D m is the dimension of the mo del m and K > 0 is a p ositive constan t, to b e calibrated. First, there exists a minimal constan t K min , such that the ratio b et w een the quadratic risk of the c hosen estimator and the quadratic risk of the oracle is asymptotic ally in ﬁnite if K < K min , and ﬁn ite if K > K min . Second, when K = K ⋆ := 2 K min , the p enalt y K D m yields an eﬃcient mo del selection p ro cedure. In other words, the optimal p enalty is twic e the minimal p enalty . This relationship c haracterizes the “slop e heuristics” of Birg´ e and Massart (2007). A cr u cial fact is that the min imal constant K min can b e estimated from the d ata, sin ce large mo dels are selected if and only if K < K min . This leads to the follo wing strategy for c ho osing K fr om th e d ata. F or ev ery K ≥ 0 , let b m ( K ) b e the mo del selected b y minim izing the empirical r isk p enalized by p en( D m ) = K D m . First, compu te K min suc h th at D b m ( K ) is “h uge” f or K < K min and “reasonably small” wh en K ≥ K min ; explicit v alues for “h uge” and “small” are prop osed in Section 3.3 . Second, deﬁn e b m := b m (2 K min ). Su c h a m etho d has b een successfully app lied for multiple c hange p oin ts detection b y Lebarbier (2005). F rom the th eoretica l p oin t of view, th e issue for understanding and v alidating this approac h is the existence of a m in imal p enalt y . This qu estion has b een addressed for Gaussian homoscedastic regression with a ﬁxed design by Birg ´ e and Massart (2001, 2007) when the v ariance is kno wn, and b y Baraud et al. (200 7) when the v ariance is u nkno wn. Non-Gaussian or h eteroscedastic data hav e nev er b een considered. Th is article con tributes to ﬁll this gap in the theoretical un derstanding of p enalizatio n pro cedu res. The calibration algorithm prop osed in this article relies on a generalizati on of Birg ´ e and Massart’s slop e h euristics (Section 2.3). I n S ection 3, th e algo rithm is deﬁned in the least-squares regression f r amew ork, for general-shap e p enalties. Th e shap e of the p enalt y itself can b e estimated fr om the data, as explained in S ection 3.4 . The theoretical v alidation of the algorithm is pr o vided in S ection 4, fr om the non- asymptot ic p oint of view . Non-asymptotic means in particular that the collect ion of mo dels is allo w ed to d ep end on n : in practice, it is usu al to allo w the num b er of explanatory v ariables to increase with th e num b er of observ atio ns. Considering mo dels with a large n umber of parameters (for exa mple of the order of a p o w er of the sample size n ) is also necessary to approximate functions b elonging to a general approximat ion space. Thus, the non-asymptotic p oin t of view allo ws us not to assume that the r egression fu nction is describ ed with a small num b er of parameters. The existence of min imal p enalties for heter osc e datic r e gr ession with a r andom design (Theorem 2) is pro v ed in Section 4.3. In Section 4.4, b y pro ving th at t wice the minimal p enalt y has some optimalit y pr op erties (Th eorem 3), we extend the so-called slop e heuris- tics to heteroscedatic regression with a r an d om design. Moreo ver, n either Theorem 2 nor Theorem 3 assume the d ata to b e Gauss ian; only mild moment assu mptions are required. F or provi ng Theorems 2 and 3, eac h mo del is assumed to b e the v ector space of piecewise constan t fun ctions on some partition of the feature s pace. T his is indeed a restriction, b u t w e conjecture that it is mainly tec hnical, and that the slop e heuristics r emains v alid at least in the general least-squares regression framewo rk. W e p ro vide some evidence f or this b y p ro ving tw o k ey concen tration inequalities without the r estriction to piecewise constant functions. Another argum en t su pp orting this conjecture is that r ecen tly several simulatio n studies hav e sho wn that the slope heuristics can b e used in sev eral framew orks: mixtu r e 3 Arlot and Massar t mo dels (Maugis and Mic hel, 2008), clustering (Baudry, 2007 ), spatial statistics (V erzelen , 2008), estimation of oil reserv es (Lep ez, 2002) an d genomics (Villers, 2007). Although the slop e h euristics has not b een formally v alidated in these framew orks, this article is a ﬁrst step to w ards suc h a v alidation, b y proving that the slop e heur istics can b e applied whatev er the shap e of the ideal p enalt y . This pap er is organized a s follo ws. The framework and the slop e heuristics are d escrib ed in Section 2. The resulting algorithm is deﬁn ed in Section 3. The main theoretical results are stated in Section 4 . All the pro ofs are give n in App endix A. 2. F ram ew ork In this section, we d escrib e the framewo rk and the general slop e heuristics. 2.1 Least-squares regression Supp ose w e observe some data ( X 1 , Y 1 ) , . . . ( X n , Y n ) ∈ X × R , indep en den t w ith common distribution P , where the feature space X is t ypically a compact s et of R d . The goal is to predict Y giv en X , where ( X , Y ) ∼ P is a new d ata p oin t indep enden t of ( X i , Y i ) 1 ≤ i ≤ n . Denoting by s the regression function, that is s ( x ) = E [ Y | X = x ] for ev ery x ∈ X , w e can write Y i = s ( X i ) + σ ( X i ) ǫ i (1) where σ : X 7→ R is the heteroscedastic noise leve l and ǫ i are i.i.d. centered noise terms, p ossibly dep end en t on X i , b ut with mean 0 and v ariance 1 conditionally to X i . The qualit y of a p redictor t : X 7→ Y is measur ed b y the (quadratic) prediction loss E ( X,Y ) ∼ P [ γ ( t, ( X , Y )) ] = : P γ ( t ) where γ ( t, ( x, y )) = ( t ( x ) − y ) 2 is the least-squares con trast. Th e minimizer of P γ ( t ) o v er th e set of all pred ictors, called Ba y es predictor, is the regression fun ction s . Therefore, the excess loss is deﬁned as ℓ ( s , t ) := P γ ( t ) − P γ ( s ) = E ( X,Y ) ∼ P ( t ( X ) − s ( X )) 2 . Giv en a particular set of pr edictors S m (called a mo del ), w e d eﬁne the b est p redictor o v er S m as s m := arg min t ∈ S m { P γ ( t ) } , with its empirical counterpart b s m := arg min t ∈ S m { P n γ ( t ) } (when it exists and is unique), where P n = n − 1 P n i =1 δ ( X i ,Y i ) . This estimator is the w ell- kno wn empiric al risk minimizer , also called least-squares estimator sin ce γ is the least- squares con trast. 4 Da t a-driven Calibra tion of P enal ties 2.2 Ideal mo del selection Let us assume th at we are giv en a family of mod els ( S m ) m ∈M n , hence a family of estimators ( b s m ) m ∈M n obtained by empirical r isk minimization. T he mo del selection p r oblem consists in looking for some data-depen den t b m ∈ M n suc h that ℓ ( s , b s b m ) is a s small as p ossible. F or instance, it wo uld b e conv enient to p ro v e some oracle inequalit y of the form ℓ ( s , b s b m ) ≤ C inf m ∈M n { ℓ ( s , b s m ) } + R n in exp ectation or on an eve nt of large probabilit y , with leading constant C close to 1 and R n = o ( n − 1 ). General p enalization pro cedur es can b e describ ed as follo ws. Let p en : M n 7→ R + b e some p enalty fu n ction, p ossibly data-dep end en t, and deﬁn e b m ∈ arg min m ∈M n { crit( m ) } with crit( m ) := P n γ ( b s m ) + p en( m ) . (2) Since the ideal criterion crit ( m ) is the tru e p r ediction err or P γ ( b s m ), the ide al p e nalty is p en id ( m ) := P γ ( b s m ) − P n γ ( b s m ) . This quan tit y is unknown b ecause it dep end s on th e tru e d istribution P . A natural idea is to c ho ose p en( m ) as close as p ossible to p en id ( m ) f or every m ∈ M n . W e will show b elo w, in a general setting, that wh en p en is a go o d estimator of the id eal p enalt y p en id , then b m satisﬁes an oracle inequalit y with leading constant C close to 1. By deﬁnition of b m , ∀ m ∈ M n , P n γ ( b s b m ) ≤ P n γ ( b s m ) + p en( m ) − p en( b m ) . F or eve ry m ∈ M n , we deﬁn e p 1 ( m ) = P ( γ ( b s m ) − γ ( s m ) ) p 2 ( m ) = P n ( γ ( s m ) − γ ( b s m ) ) δ ( m ) = ( P n − P ) ( γ ( s m ) ) so that p en id ( m ) = p 1 ( m ) + p 2 ( m ) − δ ( m ) and ℓ ( s, b s m ) = P n γ ( b s m ) + p 1 ( m ) + p 2 ( m ) − δ ( m ) − P γ ( s ) . Hence, for eve ry m ∈ M n , ℓ ( s , b s b m ) + (p en − p en id )( b m ) ≤ ℓ ( s, b s m ) + (p en − p en id )( m ) . (3) Therefore, in order to deriv e an oracle inequalit y fr om (3), it is suﬃcient to sh o w that f or ev ery m ∈ M n , p en( m ) is close to p en id ( m ). 5 Arlot and Massar t 2.3 The slop e heuristics If the p en alty is too big, th e left-hand side of (3) is larger than ℓ ( s , b s b m ) so that (3) implies an o racle inequalit y , p ossibly w ith large leading constan t C . On the con trary , if the p enalt y is to o sm all, the left-hand side of (3) ma y b ecome n egligible with r esp ect to ℓ ( s, b s b m ) (whic h would mak e C explo de) or—wo rse—ma y b e nonp ositiv e. In the latter case, n o oracle in equalit y m a y be deriv ed fr om (3). W e shall see in the follo wing that ℓ ( s, b s b m ) b lo ws up if and only if the p enalt y is smaller th an some “minimal p enalt y”. Let us consider ﬁ rst the case p en( m ) = p 2 ( m ) in (2). Then, E [ crit( m ) ] = E [ P n γ ( s m ) ] = P γ ( s m ), so that b m appro ximate ly minimizes its bias. Therefore, b m is one of the more com- plex mo dels, and the risk of b s b m is large. Let us assume now that p en( m ) = K p 2 ( m ). If 0 < K < 1, cr it ( m ) is a decreasing function of the complexit y of m , so that b m is a gain one of the more complex mod els. On the con trary , if K > 1, crit( m ) increases w ith the complexit y of m (at least for the largest mo d els), so that b m has a sm all or medium complexit y . This argumen t supp orts the conjecture that the “minimal amount of p enalt y” r equired for the mo del selection pro cedu re to w ork is p 2 ( m ). In many framewo rks s u c h as th e one of Section 4.1, it turns out that ∀ m ∈ M n , p 1 ( m ) ≈ p 2 ( m ) . Hence, th e ideal p enalt y p en id ( m ) ≈ p 1 ( m ) + p 2 ( m ) is close to 2 p 2 ( m ). Since p 2 ( m ) is a “minimal p enalty” , the optimal p enalt y is close to twice the minimal p enalt y: p en id ( m ) ≈ 2 p en min ( m ) . This is the so-cal led “slop e heur istics”, ﬁr st introd uced b y Birg ´ e and Massart (2007) in a Gaussian homoscedastic setting. Note that a formal pr o of of the v alidit y of the slop e heuristics has only b een given for Gaussian homoscedastic least-squares regression with a ﬁxed design (Birg ´ e and Massart, 2007); up to the b est of our kno wledge, the present pap er yields the second th eoretica l result on the slop e heuristics. This heur istics has some applications b ecause the minimal p enalt y can b e estimated from the data. Ind eed, when the p enalt y smaller than p en min , the selected mo del b m is among the more complex. On the con trary , wh en the p enalt y is larger than p en min , the complexit y of b m is m uc h smaller. This leads to the al gorithm describ ed in the next section. 3. A data-driv en calibration algorithm No w, a data-driv en calibration algorithm for p enalization pr o cedures can b e deﬁned, gen- eralizing a metho d prop osed b y Birg ´ e and Massart (2007) a nd implement ed by Lebarbier (2005). 3.1 The general algorithm Assume that the shap e p en shape : M n 7→ R + of the id eal p enalt y is kno wn, from some prior kn o wledge or b ecause it had ﬁrst b een estimated, see Section 3.4. Then, the p enalt y K ⋆ p en shape pro vides an app ro ximately optimal pro cedure, for some unkno wn constant K ⋆ > 0. Th e goal is to ﬁnd some b K su c h that b K p en shape is approximate ly optimal. 6 Da t a-driven Calibra tion of P enal ties Let D m b e some k n o wn complexit y m easur e of the mod el m ∈ M n . Typica lly , when the m o dels are ﬁnite-dimensional ve ctor spaces, D m is the dimension of S m . According to the “slop e heuristics” detailed in Section 2.3, the follo wing algorithm pro vides an optimal calibration of the p enalt y p en shape . Algorithm 1 (Data-driven p enalization with slop e heuristics) 1. Compute the sele cte d mo del b m ( K ) as a fu nction of K > 0 b m ( K ) ∈ arg min m ∈M n  P n γ ( b s m ) + K p en shape ( m )  . 2. Find b K min > 0 such that D b m ( K ) is “ huge ” for K < b K min and “r e asonably smal l” for K > b K min . 3. Sele ct the mo del b m := b m  2 b K min  . A compu tationally eﬃcien t wa y to p erf orm the ﬁ rst step of Algorithm 1 is p ro vided in Section 3.2. The accurate deﬁnition of b K min is d iscussed in Section 3.3, including explicit v alues for “huge” and “reasonably small”). Then, once P n γ ( b s m ) and p en shape ( m ) are kno wn for eve ry m ∈ M n , the complexit y of Algorithm 1 is O (Card( M n ) 2 ) (see Algorithm 2 and Prop osition 1). This can b e a d ecisiv e adv an tage compared to cross-v alidation m etho ds, as discussed in Section 4.6. 3.2 Computation of ( b m ( K ) ) K ≥ 0 Step 1 of Algorithm 1 requires to compute b m ( K ) for ev ery K ∈ (0 , + ∞ ). A computationally eﬃcien t wa y to p erform this s tep is d escrib ed in this su b section. W e start with some notations: ∀ m ∈ M n , f ( m ) = P n γ ( b s m ) g ( m ) = p en shape ( m ) and ∀ K ≥ 0 , b m ( K ) ∈ arg min m ∈M n { f ( m ) + K g ( m ) } . Since the latter d eﬁnition can b e am b iguous, let us c ho ose an y to tal ordering  on M n suc h that g is non-decreasing, which is alw a ys p ossible if M n is at m ost coun table. T hen, b m ( K ) is d eﬁned as the smallest element of E ( K ) := arg m in m ∈M n { f ( m ) + K g ( m ) } for  . Th e main reason why the whole tra jectory ( b m ( K ) ) K ≥ 0 can b e computed eﬃcientl y is its particular sh ap e. Indeed, the p ro of of Prop osition 1 sho ws that K 7→ b m ( K ) is piecewise constan t, and non-increasing for  . Then, the whole tra jectory ( b m ( K ) ) K ≥ 0 can b e sum marized by • the n umber of jumps i max ∈ { 0 , . . . , Card( M n ) − 1 } , • the location of th e ju mps: an increasing sequ ence of nonnegativ e reals ( K i ) 0 ≤ i ≤ i max +1 , with K 0 = 0 an d K i max +1 = + ∞ , 7 Arlot and Massar t • a n on-increasing sequence of mo dels ( m i ) 0 ≤ i ≤ i max , with ∀ i ∈ { 0 , . . . , i max } , ∀ K ∈ [ K i , K i +1 ) , b m ( K ) = m i . Algorithm 2 (Step 1 of Algorithm 1 ) F or every m ∈ M n , deﬁne f ( m ) = P n γ ( b s m ) and g ( m ) = p en shape ( m ) . Cho ose  any to tal or dering on M n such that g is non-de cr e asing. • Init: K 0 := 0 , m 0 := arg min m ∈M n { f ( m ) } (when this minimum is atta ine d sever al times, m 0 is deﬁne d as the smal lest one with r esp e ct to  ). • Step i , i ≥ 1 : L et G ( m i − 1 ) := { m ∈ M n s.t. f ( m ) > f ( m i − 1 ) and g ( m ) < g ( m i − 1 ) } . If G ( m i − 1 ) = ∅ , then put K i = + ∞ , i max = i − 1 and stop. Otherwise, K i := inf  f ( m ) − f ( m i − 1 ) g ( m i − 1 ) − g ( m ) s.t. m ∈ G ( m i − 1 )  (4) and m i := min  F i with F i := arg min m ∈ G ( m i − 1 )  f ( m ) − f ( m i − 1 ) g ( m i − 1 ) − g ( m )  . Prop osition 1 (Correctness of Algorithm 2) If M n is ﬁnite, Algorithm 2 terminates and i max ≤ Card( M n ) − 1 . With the notatio ns of A lgorithm 2 , let b m ( K ) b e the smal lest element of E ( K ) := arg min m ∈M n { f ( m ) + K g ( m ) } with r esp e ct to  . Then, ( K i ) 0 ≤ i ≤ i max +1 is incr e asing and ∀ i ∈ { 0 , . . . , i max − 1 } , ∀ K ∈ [ K i , K i +1 ) , b m ( K ) = m i . It is pro v ed in Section A.2 . In th e c hange-p oint detection framew ork, a similar result has b een prov ed by La vielle (2005). Prop osition 1 also giv es an up p er b ou n d on the computational complexit y of Algo- rithm 2; since the complexity of eac h step is O (Card M n ), Algorithm 2 requires less th an O ( i max Card M n ) ≤ O ((Card M n ) 2 ) op erations. In general, this up p er b oun d is p essimistic since i max ≪ Card M n . 3.3 Deﬁnition of b K min Step 2 of Algorithm 1 estimates b K min suc h th at b K min p en shape is the minimal p enalt y . The purp ose of this subsection is to d eﬁne prop er ly b K min as a fun ction of ( b m ( K )) K > 0 . According to the slop e heuristics describ ed in Section 2.3, b K min corresp onds to a “com- plexit y jump”. If K < b K min , b m ( K ) has a large complexit y , whereas if K > b K min , b m ( K ) has a small or medium c omplexit y . Therefore, the tw o follo wing deﬁnitions of b K min are natural. Let D thresh b e the largest “reasonably small” complexit y , meaning the mo dels with larger complexities should not b e s electe d. Wh en D m is the dimension of S m as a v ector space, 8 Da t a-driven Calibra tion of P enal ties 0 0.005 0.01 0.015 0.02 0.025 0.03 0 5 10 15 20 25 30 35 K dimension of m(K) Maximal jump Reasonable dimension (a) One clear ju mp. 0 0.005 0.01 0.015 0.02 0.025 0 5 10 15 20 25 30 35 K dimension of m(K) Maximal jump Reasonable dimension (b) Two jump s , t w o v alues for b K min . Figure 1: D b m ( K ) as a fu nction of K for t wo diﬀerent s amp les. Data are simulat ed acc ording to (1) with n = 200, X i ∼ U ([0 , 1]), ǫ i ∼ N (0 , 1), s ( x ) = sin( π x ) and σ ≡ 1. The mo dels ( S m ) m ∈M n are the sets of piecewise constan t fun ctions on regular partitions of [0 , 1], with dimensions b etw een 1 and n / (ln( n )). The p enalty shap e is p en shape ( m ) = D m and the dimens ion threshold is D thresh = 19 ≈ n/ (2 ln( n )). See exp eriment S1 by Ar lot (2008c, Section 6.1) for details. D thresh ∝ n/ (ln( n )) or n/ (ln( n )) 2 are natural choic es since the dimension of th e oracle is lik ely to b e of order n α for some α ∈ (0 , 1). Then , deﬁne b K min := inf  K > 0 s.t. D b m ( K ) ≤ D thresh  . (thresh) With this deﬁn ition, Algorithm 2 can b e stopp ed as so on as the threshold is reac hed. Another idea is that b K min should matc h with th e largest complexit y jum p: b K min := K i jump with i jump = arg max i ∈{ 0 ,...,i max − 1 }  D m i +1 − D m i  . (max jump ) In order to ensu re that ther e is a clear jump in the sequence ( D m i ) i ≥ 0 , it ma y b e u seful to add a few mo d els of large complexit y . As an illustration, we compared the t w o deﬁnitions ab o v e (“threshold” an d “maximal jump”) on 1 000 simulated samples. The exact sim ulation framewo rk is describ ed b elo w Figure 1. T hree cases o ccured: 1. Ther e is one clear jump. Both deﬁn itions giv e the s ame v alue for b K min . This o ccured for ab out 85% of the samples; an example is give n on Figure 1a. 2. Ther e are seve ral jumps corresp onding to close v alues of K . Deﬁnitions (thresh) and (max jump) giv e slightly diﬀeren t v alues f or b K min , b ut the selected mo dels b m  2 b K min  are equal. T his o ccured for ab out 8 . 5% of the samples. 9 Arlot and Massar t 3. Ther e are several jump s corresp onding to distan t v alues of K . Deﬁnitions (thresh) and (max jump) strongly disagree, giving diﬀeren t selected mo dels b m  2 b K min  at ﬁnal. This o ccured for ab out 6 . 5% of the samples; an example is giv en on Figure 1b . The only problematic case is the third one, in whic h an arbitrary c hoice has to b e made b et w een deﬁn itions (thr esh) and (max jum p). With the same sim ulated data, we h a v e compared the pr ediction err ors of the t w o metho ds by estimating the constant C or that w ould app ear in some oracle inequalit y , C or := E [ ℓ ( s , b s b m ) ] E [ inf m ∈M n { ℓ ( s , b s m ) } ] . With deﬁnition (thresh) C or ≈ 1 . 88; with deﬁnition (max jump) C or ≈ 2 . 01. F or b oth metho ds, the standard error of the estimation is 0 . 04. As a comparison, Mallo ws ’ C p with a classical estimator of the v ariance σ 2 has a n estimate d p erformance C or ≈ 1 . 93 on the same data. The o v erall conclusion of this simulat ion exp erimen t is that Algorithm 1 can b e com- p etitiv e with Mallo ws’ C p in a framewo rk wh ere Mallo ws’ C p is kn o wn to b e optimal. Deﬁnition (thresh) for b K min seems sligh tly more eﬃcien t than (max jump), but without con vincing evidence. In deed, b oth deﬁnitions dep end on some arbitrary c hoices: the v alue of the threshold D thresh in (thresh) , the maximal complexit y among th e collection of mo dels ( S m ) m ∈M n in (max jump). When n is s m all, say n = 200, c ho osing D thresh is tric ky s ince n/ (2 ln ( n )) and √ n are quite close. T h en, the diﬀerence b et w een (thresh ) and (max jump) is like ly to come mainly from the particular choice D thresh = 19 than from basic diﬀerences b et w een the tw o d eﬁ nitions. In ord er to estimate b K min as automatically as p ossible, we suggest to com bine the t w o deﬁnitions; when the selected mo dels b m (2 b K min ) diﬀer, send a warning to the ﬁnal user advising him to lo ok at the curve K 7→ D b m ( K ) himself; otherwise, r emain conﬁd en t in the automatic c hoice of b m (2 b K min ). 3.4 P enalt y shap e F or using Algorithm 1 in p ractice, it is necessary to know a priori , or at least to estimate, the optimal shap e p en shape of the p enalt y . Let us explain ho w this can b e ac hiev ed in diﬀeren t frameworks. The ﬁrst example that comes to mind is p en shape ( m ) = D m . It is v alid for homoscedastic least-squares regression on linear mo dels, as sho wn b y seve ral pap ers menti oned in Section 1 . Indeed, when Card( M n ) is smaller than some p ow er of n , Mallo ws’ C p p enalt y—deﬁned b y p en( m ) = 2 E  σ 2 ( X )  n − 1 D m —is wel l known to b e asymptotically optimal. F or larger collect ions M n , m ore elab orate results (Birg ´ e and Massart , 2001 , 2007) h a v e sh o wn that a p enalt y prop ortional to ln( n ) E  σ 2 ( X )  n − 1 D m and d ep ending on the size of M n is a symp- totical ly optimal. Algorithm 1 then pro vides an alte rnativ e to plugging an estimato r of E  σ 2 ( X )  in to the a b ov e p enalties. Let u s detail t wo main adv an tages o f our approac h. First, w e av oid the diﬃcult task o f e stimating E  σ 2 ( X )  without kn owing in adv ance some m o del to wh ic h the 10 Da t a-driven Calibra tion of P enal ties true regression function b elongs. Algorithm 1 provi des a mo del-free estimation of the fact or m ultiplying the p enalt y . Second, the estimator c σ 2 of E  σ 2 ( X )  with the smallest quadratic risk is certainly far from b eing the optimal one f or m o del selection. F or in s tance, un der- estimating the multiplicat iv e factor is well-kno w n to lead to p o or p erformances, whereas o v erestimating the m ultiplicativ e factor do es not increase m u c h the prediction error in gen- eral. Then, a go o d estimator of E  σ 2 ( X )  for mo d el selection should o v erestimate it with a probabilit y larger t han 1 / 2. Algo rithm 1 satisﬁes this prop ert y automatically b ecause b K min so that the selected mo del cannot b e to o large. In short, Algo rithm 1 with p en shape ( m ) = D m is q u ite diﬀer ent fr om a simple plug-in version of Mal lows’ C p . It leads to a really data-dep endent p enalty, which may p e rform b etter in pr actic e than the b est deterministic p enalty K ⋆ D m . In a more general fr amew ork, Algorithm 1 allo ws to c ho ose a diﬀerent shap e of p enalt y p en shape . F or instance, in the heteroscedastic least-squares regression framework of Sec- tion 2.1, the optimal p enalt y is n o longer prop ortional to the dimension D m of th e mo dels. This can b e sh o wn from computations made b y (Arlot, 2008c, Pr op osition 1) when S m is assumed to b e the vecto r sp ace of piecewise constan t functions on a partition ( I λ ) λ ∈ Λ m of X : E [ p en id ( m ) ] = E [ ( P − P n ) γ ( b s m ) ] ≈ 2 n X λ ∈ Λ m E  σ ( X ) 2   X ∈ I λ  . (5) An exact r esult has b een p ro v ed by Arlot (2008c, Prop osition 1). Moreo v er, Arlot (2008a) ga v e an example of mo del select ion problem in whic h n o p enalt y pr op ortional to D m can b e asymptotically optimal. A ﬁrst w a y to estimate th e shap e of the p enalt y is simply to use (5) to co mpute p en shape , when b oth the d istribution of X and the sh ap e of the noise level σ are kno wn. In practice, one has seldom such a pr ior kno wledge. W e suggest in this situation to use r esampling p enalties (Ef r on, 1983 ; Arlot, 2008c), or V -fold p enalties (Arlot, 2008 b ) whic h hav e m uc h smaller computational costs. Up to a m ultiplicativ e factor (automatica lly estimated b y Algorithm 1), these p enalties should estimate correctly E [ p en id ( m ) ] in an y fr amew ork. In p articular, resampling and V -fold p enalties are asymptotically optimal in the heteroscedastic least-squares regression frame- w ork (Arlot, 2008b,c). 3.5 The general prediction fra mew ork Section 2 and deﬁnition of Algorithm 1 hav e restricted o urselve s to the least-squares regres- sion framework. Actually , this is n ot necessary at all to make Algorithm 1 we ll-deﬁned, so that it can naturally b e extend ed to the general pr ed iction framework. More precisely , the ( X i , Y i ) can b e assumed to b elong to X × Y for some general Y , and γ : S × ( X ×Y ) 7→ [0; + ∞ ) an y con trast f unction. I n particular, Y = { 0 , 1 } leads to the b in ary classiﬁcation problem, for whic h a n atur al con trast fu nction is the 0–1 loss γ ( t ; ( x, y )) = 1 t ( x ) 6 = y . In this case, the shap e of the p enalt y p en shape can for instance b e esti mated with the global or lo cal Rademac her complexities mentio ned in Section 1. Ho w ev er, a n atural question is wh ether the slop e heuristics of Section 2.3 , u p on whic h Algorithm 1 relies, can b e extended to the ge neral framew ork. Sev eral concentrat ion results used to p r o v e the v alidit y of the slop e heu r istics in the least-squares regression framew ork in 11 Arlot and Massar t this article are v alid in a general setting includin g binary classiﬁcation. Even if the fact or 2 coming from th e closeness of E [ p 1 ] and E [ p 2 ] (see Section 2.3) ma y not b e universall y v alid, w e conjecture that Algorithm 1 can b e used in other settings than least-squares regression. Moreo v er, as already menti oned at the end of S ection 1, empirical studies ha v e sh o wn that Algorithm 1 can b e successfully applied to sev eral problems, w ith diﬀeren t sh ap es for the p enalt y . T o our k n o wledge, to give a formal pro of of this fact r emains an int eresting op en problem. 4. Theoretical r esults Algorithm 1 mainly relies on the “slop e heur istics”, dev elop ed in Section 2.2. The goal of this section is to pr o vide a theoretical justiﬁcation of this heuristics. It is sp lit in to t wo main results. First, Theorem 2 p r o vides lo wer b ounds on D b m and the risk of b s b m when the p enalt y is smaller than p en min ( m ) := E [ p 2 ( m ) ]. Second, Theorem 3 is an oracle inequalit y with leading constant almost one when p en( m ) ≈ 2 E [ p 2 ( m ) ] , relying on (3) and the comparison p 1 ≈ p 2 . In order to prov e b oth theorems, tw o probabilistic r esu lts are necessary . First, p 1 , p 2 and δ concentrat e around their expectations; for p 2 and δ , it is prov ed in a general framew ork in App end ix A.6. Second, E [ p 1 ( m ) ] ≈ E [ p 2 ( m ) ] for ev ery m ∈ M n . The latter p oint is quite hard to prov e in general, so that we must make an assumption on the m o dels. Therefore, in this section, we restrict ourselves to the r egressogram case, assum in g that for every m ∈ M n , S m is the set of piecewise constan t functions on s ome ﬁ xed p artition ( I λ ) λ ∈ Λ m of X . This fr amew ork is describ ed p recisely in the next su bsection. Although w e do n ot consider regressograms as a ﬁnal goal, the theoretical resu lts prov ed f or regressograms help to unders tand b etter how to u se Algorithm 1 in pr actice . 4.1 Regressograms Let S m b e the the set of p iecewise constan t functions on some partitio n ( I λ ) λ ∈ Λ m of X . The empirical r isk minimizer b s m on S m is called a r e gr esso gr am . S m is a v ector space of dimension D m = Card(Λ m ), spanned by the family ( 1 I λ ) λ ∈ Λ m . Since this basis is orthogonal in L 2 ( µ ) for an y probabilit y measure µ on X , computations are quite easy . In particular, w e ha v e: s m = X λ ∈ Λ m β λ 1 I λ and b s m = X λ ∈ Λ m b β λ 1 I λ , where β λ := E P [ Y | X ∈ I λ ] b β λ := 1 n b p λ X X i ∈ I λ Y i b p λ := P n ( X ∈ I λ ) . Note that b s m is un iqu ely deﬁned if and only if eac h I λ con tains at least one of th e X i . Otherwise, b s m is not un iquely deﬁned and w e consider that the mo d el m cannot b e c hosen. 12 Da t a-driven Calibra tion of P enal ties 4.2 Main assumptions In this section, w e mak e th e follo wing assumptions. First, eac h mo d el S m is a set of p iecewise constan ts functions on some ﬁxed partition ( I λ ) λ ∈ Λ m of X . S econd, the family ( S m ) m ∈M n satisﬁes: ( P1 ) Polynomia l complexit y of M n : C ard( M n ) ≤ c M n α M . ( P2 ) Richness of M n : ∃ m 0 ∈ M n s.t. D m 0 ∈ [ √ n, c rich √ n ]. Assumption ( P1 ) is quite classical for p ro ving the asymptotic optimalit y of a m o del selectio n pro cedure; it is for in stance implicitly assumed b y Li (1987) in the homoscedastic ﬁxed - design case. Assump tion ( P2 ) is merely tec hnical and can b e c hanged if n ecessary; it only ensures that ( S m ) m ∈M n do es not con tain only mo dels which are either too small or too large. F or any p enalt y function p en : M n 7→ R + , we d eﬁ ne th e follo wing mo del selection pro cedure: b m ∈ arg min m ∈M n , mi n λ ∈ Λ m { b p λ } > 0 { P n γ ( b s m ) + p en( m ) } . (6) Moreo v er, the d ata ( X i , Y i ) 1 ≤ i ≤ n are assumed to b e i.i.d. and to satisfy: ( Ab ) Th e d ata are b ound ed: k Y i k ∞ ≤ A < ∞ . ( An ) Uniform low er-b ound on the n oise lev el: σ ( X i ) ≥ σ min > 0 a.s. ( Ap u ) Th e b ias decreases as a p ow er of D m : th er e exist s ome β + , C + > 0 s uc h that ℓ ( s, s m ) ≤ C + D − β + m . ( Ar X ℓ ) Low er regularit y of the partitions for L ( X ): D m min λ ∈ Λ m { P ( X ∈ I λ ) } ≥ c X r ,ℓ . F urther commen ts are m ade in Sections 4.3 and 4.4 ab out these assumptions, in particular ab out their p ossible we ak ening. 4.3 Minimal p enalties Our ﬁrst result conce rns the existence of a minimal p enalt y . In this subs ection, ( P2 ) is replaced by the follo wing strongest assumption: ( P2 +) ∃ c 0 , c rich > 0 s .t. ∀ l ∈ [ √ n, c 0 n/ ( c rich ln( n )) ], ∃ m ∈ M n s.t. D m ∈ [ l , c rich l ]. The r eason why ( P2 ) is n ot suﬃ cien t to prov e T heorem 2 b elo w is th at at least one mo del of dimension of ord er n / ln ( n ) should b elong to the family ( S m ) m ∈M n ; otherwise, it ma y not b e p ossible to pro v e that suc h mo d els are selected by p enalizat ion pro cedur es b ey ond the minimal p en alty . Theorem 2 Supp ose al l the assump tions of Se ction 4.2 ar e satisﬁe d. L et K ∈ [0; 1) , L > 0 , and assume that an eve nt of pr ob ability at le ast 1 − Ln − 2 exists on which ∀ m ∈ M n , 0 ≤ p en( m ) ≤ K E [ P n ( γ ( s m ) − γ ( b s m ) ) ] . (7) 13 Arlot and Massar t Then, ther e exist two p ositive c onstants K 1 , K 2 such that, with pr ob ability at le ast 1 − K 1 n − 2 , D b m ≥ K 2 n ln( n ) − 1 , wher e b m is deﬁne d by (6) . On the same event, ℓ ( s , b s b m ) ≥ ln( n ) inf m ∈M n { ℓ ( s , b s m ) } . (8) The c onstants K 1 and K 2 may dep end on K , L and c onstants in ( P1 ) , ( P2 +) , ( Ab ) , ( An ) , ( Ap u ) and ( Ar X ℓ ) , but do not dep e nd on n . This theorem th us v alidates th e ﬁrst part of the heuristics of Section 2.3, proving that a min imal amoun t of p enalization is required; when th e p enalt y is s maller, the selected dimension D b m and the quadratic risk of the ﬁnal estimator ℓ ( s, b s b m ) blo w up. This coupling is qu ite interesti ng, since the d imension D b m is kn o wn in pr actice, con trary to ℓ ( s, b s b m ). It is then p ossible to detect from the data whether the p enalt y is too s mall, as prop osed in Algorithm 1. The main int erest of this result is its combinatio n with Theorem 3 b elo w. Nev erthe- less Theorem 2 is also in teresting b y itself for und erstanding the th eoretica l prop erties of p enalization pro cedures. Indeed, it generalizes th e results of Birg ´ e and Massart (2007 ) on the existence of minimal p enalties to heterosce dastic regression with a random design, ev en if we ha v e to restrict to regressograms. Moreo v er, we ha v e a general formulation for the minimal p enalt y p en min ( m ) := E [ P n ( γ ( s m ) − γ ( b s m ) ) ] = E [ p 2 ( m ) ] , whic h can b e used in framew orks situations where it is not prop ortional to the d imension D m of the mo dels (see Section 3.4 and references th er ein). In addition, assumptions ( Ab ) and ( An ) on the data are muc h we ak er th an th e Gaussian homoscedastic assumption. They are also muc h more realistic, and moreo v er can b e strongly relaxed. Roughly sp eaking, b oundedness of data can b e rep laced by cond itions on momen ts of the noise, and the uniform lo wer b ound σ min is no longer necessary when σ satisﬁes some mild r egularity assumptions. W e refer to Arlot (2008c, Section 4.3) for d etailed statemen ts of these assumptions, and exp lanations on how to adapt p ro ofs to th ese situations. Finally , let u s comment on conditions ( Ap u ) and ( Ar X ℓ ). The upp er b ound ( Ap u ) on the bias o ccurs in the most r easonable situations, for in stance when X ⊂ R k is b ounded, the partition ( I λ ) λ ∈ Λ m is regular and the regression fu n ction s is α -H¨ olderian for some α > 0 ( β + dep endin g on α and k ). It ensu r es that medium and large mo d els ha v e a signiﬁcant ly smaller bias than sm aller ones; otherw ise, the selected d imension w ould b e allo wed to b e to o small with sig niﬁcant p robabilit y . On the other hand, ( Ar X ℓ ) is satisﬁed at least for “almost regular” partitions ( I λ ) λ ∈ Λ m , when X has a lo w er b ounded densit y w.r.t. the Leb esgue measure on X ⊂ R k . Theorem 2 is stated with a general formulation of ( Ap u ) and ( Ar X ℓ ), instead of assum in g for instance that s is α -H¨ olderian and X has a lo wer b ounded densit y w .r.t Leb, in order to p oint out the gener ality of the “minimal p enalizatio n” p h enomenon. It o ccurs as so on as the mo dels are not to o muc h patholo gical. In p articular, we do n ot mak e an y assump tion on the 14 Da t a-driven Calibra tion of P enal ties distribution of X itself, but only that the mod els are not too badly c hosen according to th is distribution. Suc h a cond ition can b e c hec ked in p ractice if some prior kno wledge on L ( X ) is a v ailable; if part of the d ata are un lab eled—a usual case—, classical dens ity estimation pro cedures can b e applied for estimating L ( X ) from u n lab eled data (Devro y e and Lugosi, 2001). 4.4 Optimal p enalties Algorithm 1 r elies on a link b et w een the minimal p enalt y p oin ted out b y Th eorem 2 and some optimal p enalt y . The follo wing result is a formal p r o of of this link in the framework w e consider: p enalties close to twice th e min im al p enalt y satisfy an oracle in equalit y with leading constan t app ro ximately equal to one. Theorem 3 Supp ose al l the assumptions of Se ction 4.2 ar e satisﬁe d to g e ther with ( Ap ) The bias de cr e ases lik e a p ower of D m : ther e exist β − ≥ β + > 0 and C + , C − > 0 such that C − D − β − m ≤ ℓ ( s, s m ) ≤ C + D − β + m . L et δ ∈ (0 , 1) , L > 0 , and assume that an event of pr ob ability at le ast 1 − Ln − 2 exists on which for every m ∈ M n , (2 − δ ) E [ P n ( γ ( s m ) − γ ( b s m ) ) ] ≤ p en( m ) ≤ (2 + δ ) E [ P n ( γ ( s m ) − γ ( b s m ) ) ] . (9) Then, for every 0 < η < min { β + ; 1 } / 2 , ther e exist a c onsta nt K 3 and a se qu e nc e ǫ n tending to zer o at inﬁnity such that, with pr ob ability at le ast 1 − K 3 n − 2 , D b m ≤ n 1 − η and ℓ ( s , b s b m ) ≤  1 + δ 1 − δ + ǫ n  inf m ∈M n { ℓ ( s , b s m ) } , (10) wher e b m is deﬁne d by (6) . Mor e over, we have the or acle ine qu ality E [ ℓ ( s , b s b m ) ] ≤  1 + δ 1 − δ + ǫ n  E  inf m ∈M n { ℓ ( s , b s m ) }  + A 2 K 3 n 2 . The c onstant K 3 may dep end on L, δ, η and the c onstants in ( P1 ) , ( P2 ) , ( Ab ) , ( An ) , ( Ap ) and ( Ar X ℓ ) , but not on n . The term ǫ n is smal ler than ln( n ) − 1 / 5 ; it c an b e made smal ler than n − δ for any δ ∈ (0; δ 0 ( β − , β + )) at the pric e of e nlar ging K 3 . This theorem shows that twic e th e minimal p enalt y p en min p oint ed out by Theorem 2 satisﬁes an oracle inequalit y with leading constan t almost one. In other w ords, the slop e heuristics of Sectio n 2.3 is v alid. Th e consequ en ces of the c ombination of Theorems 2 and 3 are detailed in Section 4.5 . The oracle inequalit y (10) remains v alid wh en the p enalt y is only close to twice the minimal one. I n particular, the shap e of the p enalty c an b e estimate d by r esampl ing as suggested in Section 3.4 . 15 Arlot and Massar t Actually , Th eorem 3 ab ov e is a corollary of a more general result stated in App en d ix A.3, Theorem 5. If p en( m ) ≈ K E [ P n ( γ ( s m ) − γ ( b s m ) ) ] (11) instead of (9), u nder the same assumptions, an oracle inequalit y w ith leading constant C ( K ) + ǫ n instead of 1 + ǫ n holds with large probabilit y . T he constant C ( K ) is equal to ( K − 1) − 1 when K ∈ (1 , 2] and to C ( K ) = K − 1 when K > 2. T herefore, for ev ery K > 1, the p enalt y deﬁn ed by (11) is eﬃcien t up to a m ultiplicativ e constan t. This result is new in the heteroscedastic fr amew ork. Let us commen t the add itional assum ption ( Ap ) , that is the lo wer b ound on the b ias. Assuming ℓ ( s, s m ) > 0 for ev ery m ∈ M n is c lassical for p ro ving th e asymp totic optimalit y of Mallo ws’ C p (Shibata, 1981; Li, 1 987; Birg´ e an d Massart, 2007). ( Ap ) has b een made b y Stone (1985) and Bur m an (2002) in th e density estimation framew ork, f or the same tec hnical reasons as ou r s. Assu mption ( Ap ) is satisﬁed in several framew orks, s uc h as the follo wing: ( I λ ) λ ∈ Λ m is “regular”, X has a low er-b ounded dens it y w.r.t. the Leb esgue measure on X ⊂ R k , and s is n on-constan t an d α -h¨ olderian (w.r.t. k·k ∞ ), with β 1 = k − 1 + α − 1 − ( k − 1) k − 1 α − 1 and β 2 = 2 αk − 1 . W e r efer to Arlot (2007, Section 8.10) for a complete pro of. When the lo w er b ound in ( Ap ) is no longer assum ed, (10) holds with t wo mo diﬁcations in its right -hand side (for details, see Arlot, 2008c, Remark 9): the inf is restricted to mo dels of d imension larger than ln( n ) γ 1 , and there is a remainder term ln( n ) γ 2 n − 1 , w h ere γ 1 , γ 2 > 0 are numerical constan ts. Th is is equiv alen t to (10), unless there is a mo del of small dimension w ith a sm all bias. The lo w er b ou n d in ( Ap ) en sures that it ca nnot h app en. Note that if there is a small mo del close to s , it is hop eless to obtain an oracle inequalit y with a p enalty whic h estimates p en id , simply b ecause deviations of p en id around its exp ectat ion w ould b e muc h larger than the excess loss of the oracle. In such a situation, BIC-lik e metho ds are m ore app ropriate; for in s tance, Csisz´ ar (2002) and Csisz´ ar and Shields (2000 ) sho w ed that BIC p enalties are minimal p enalties f or estimating the ord er of a Marko v c hain. 4.5 Main theoretical and practical consequences The slop e heur istics and the correctness of Algorithm 1 follo w from the com bination of Theorems 2 and 3. 4.5.1 Optimal and minimal pena l ties F or the sak e of simplicit y , let us consider the p enalt y K E [ p 2 ( m ) ] with an y K > 0; any p enalt y close to this one satisﬁes similar prop erties. At ﬁr st r eading, one can think of the homoscedastic case w here E [ p 2 ( m ) ] ≈ σ 2 D m n − 1 ; on e of the nov elties of our r esults is that the general picture is quite similar. According to T heorem 3, the p enalization pro cedur e asso ciated with K E [ p 2 ( m ) ] satis- ﬁes an o racle inequ ality with leading co nstant C n ( K ) as so on as K > 1, and C n (2) ≈ 1. Moreo v er, results pro v ed b y Arlot (2008b) imply th at C n ( K ) ≥ C ( K ) > 1 as so on as K is not close to 2. Therefore, K = 2 is the optima l multiplying factor in fr ont of E [ p 2 ( m ) ]. 16 Da t a-driven Calibra tion of P enal ties When K < 1, Th eorem 2 shows that n o oracle inequalit y can h old with lea ding constan t C n ( K ) < ln( n ). Since C n ( K ) ≤ ( K − 1) − 1 < ln( n ) as soon as K > 1 + ln( n ) − 1 , K = 1 is the minimal multiplying factor in front of E [ p 2 ( m ) ]. More generally , p en min ( m ) := E [ p 2 ( m ) ] is pro v ed to b e a minimal p enalty . In short, Theorems 2 and 3 pro v e the slop e h eu ristics describ ed in Section 2.3: “optimal” p enalty ≈ 2 × “minimal” p enalty . Birg ´ e and Massart (2007) ha v e p ro v ed the v alidit y of the s lop e heur istics in the Gaussian homoscedastic fr amew ork. This pap er extends their result to a non-Gaussian and het- eroscedastic setting. 4.5.2 Dimension jump In addition, Theorems 2 and 3 pro v e the existence of a crucial phenomenon: there ex- ists a “dimension ju mp”—complexit y jump in the general f r amew ork—around the minimal p enalt y . Let us consider again th e p enalt y K E [ p 2 ( m ) ] . As in Algorithm 1, let us deﬁn e b m ( K ) ∈ arg min m ∈M n { P n γ ( b s m ) + K E [ p 2 ( m ) ] } . A careful lo ok at the pro ofs of Theorems 2 and 3 shows that there exist constan ts K 4 , K 5 > 0 and an ev en t of pr obabilit y 1 − K 4 n − 2 on whic h ∀ 0 < K < 1 − 1 ln( n ) , D b m ( K ) ≥ K 5 n (ln( n )) 2 and ∀ K > 1 + 1 ln( n ) , D b m ( K ) ≤ n 1 − η . (12) Therefore, the dimension D b m ( K ) of th e selecte d mo del jumps around the minimal v alue K = 1, f rom v alues of ord er n (ln( n )) − 2 to n 1 − η . Let u s kno w explain why Algorithm 1 is correct, assuming th at p en shape ( m ) is close to E [ p 2 ( m ) ]. With deﬁn ition (thresh) of b K min and a threshold D thresh ∝ n (ln( n )) − 3 , (12) ensures that 1 − 1 ln( n ) ≤ b K min ≤ 1 + 1 ln( n ) with a la rge pr obabilit y . Then, according to Theorem 3, the output of Alg orithm 1 satisﬁes an oracle inequalit y w ith leading constan t C n tending to one as n tends to inﬁn it y . 4.6 Comparison w ith data-splitting metho ds T unin g parameters are often chosen by cross-v alidation or b y another data-splitting metho d, whic h suﬀer fr om some dr a wbac ks compared to Algorithm 1. First, V -fold cross-v alidation, lea v e- p -out and rep eated learning-testing method s require a larger compu tation time. Indeed, they need to p erf orm th e empirical risk minimization pro cess for eac h m o del sev eral times, wh ereas Algorithm 1 only needs to p erf orm it once. Second, V -fold cross-v alidation is asymptotically sub optimal when V is ﬁxed , as sho wn b y (Arlot, 2008b , T heorem 1). The same sub optimalit y result is v alid for the hold-out, when the size of the training set is not asymptotically equ iv alen t to th e sample size n . On the con trary , Th eorems 2 and 3 prov e that Algorithm 1 is asymptotically optimal in a framewo rk 17 Arlot and Massar t including the one used b y (Arlot, 2008b, Theorem 1) for pro ving the sub optimalit y o f V -fold cross-v alidation. Hence, the quadratic risk of Algo rithm 1 should b e smaller, within a factor κ > 1. Third, hold-out with a training set of s ize n t ∼ n , for instance n t = n − √ n or n t = n (1 − ln( n ) − 1 ), is kno wn to b e un stable. The ﬁn al output b m strongly dep ends on the c hoice of a particular split of the data. According to the sim ulation study of Section 3.3, Algorithm 1 is far more s table. T o conclude, compared to d ata sp litting m etho ds, Algorithm 1 is either faster to com- pute, more eﬃcient in terms of qu adratic r isk, or more stable. T hen, Algorithm 1 should b e p referred eac h time it can b e used. Another app roac h is to use aggregat ion tec hn iques, instead of selecting one mod el. As s h o wn b y sev eral results (see for instance Tsybako v , 2004; L ecu´ e, 2007), aggregating estimators built up on a training simple of size n t ∼ n can ha v e an optimal quadratic risk. Moreo v er, aggregation requ ires appro ximately the same computation time as Alg orithm 1, and is muc h more stable than the hold-out. Hence, it can b e an alternativ e to mo del selection with Algorithm 1. 5. Conclusion This pap er pro vides mathematical evidence that the metho d in tro d uced by Birg´ e and Massart (2007) for designing data-driv en p enalties remains eﬃcien t in a non-Gaussian framew ork. The purp ose of th is c onclusion is to relate the s lop e heuristics develo p ed in Section 2 to the w ell kno w n Mallo ws’ C p and Ak aik e’s criteria and to the u n biased estimation of th e risk principle. Let us come come bac k to Gaussian mo del sele ction in ord er to explain h o w to guess what is the right p enalt y from the data themselves. Let γ n b e some empirical criterion (for instance the least-squares criterion as in this pap er, or the log-l ik eliho o d criterion), ( S m ) m ∈M n b e a collecti on of mod els and for eve ry m ∈ M n s m b e some minimizer of t 7→ E [ γ n ( t ) ] o v er S m (assuming that su c h a p oin t exists). Minimizing some p enalized criterion γ n ( b s m ) + p en( m ) o v er M n amoun ts to minimize b b m − b v m + p en( m ) , where ∀ m ∈ M n , b b m = γ n ( s m ) − γ n ( s ) and b v m = γ n ( s m ) − γ n ( b s m ) . The p oin t is that b b m is an un biased estimator of the bias term ℓ ( s, s m ). Ha ving concen tra- tion argument s in mind, minimizing b b m − b v m + p en ( m ) can b e conjectured approximate ly equiv alen t to minimize ℓ ( s , s m ) − E [ b v m ] + p en( m ) . Since the p urp ose of mo del selection is to minimize the risk E [ ℓ ( s, b s m ) ] , an ideal p enalt y w ould b e p en( m ) = E [ b v m ] + E [ ℓ ( s m , b s m ) ] . In Gaussian least-squares regression w ith a ﬁxed design, the m o dels S m are linear and E [ b v m ] = E [ ℓ ( s m , b s m ) ] is explicitly computable if the noise lev el is constant and kno wn; 18 Da t a-driven Calibra tion of P enal ties this leads to Mallo ws’ C p p enalt y . When γ n is the log-lik elihoo d, E [ b v m ] ≈ E [ ℓ ( s m , b s m ) ] ≈ D m 2 n asymptotically , where D m stands for the num b er of parameters deﬁning mo del S m ; this leads to Ak aik e’s Information C r iterion (AIC). T h erefore, b oth Mallo w s ’ C p and Ak aik e’s criterion are based on the unbiased (or asymptotically u n biased) risk estimation pr inciple. This pap er go es furth er in this d ir ection, using that E [ b v m ] ≈ E [ ℓ ( s m , b s m ) ] remains a v alid appro ximation in a n on-asymptotic f r amew ork. Th en , a go o d p enalt y b ecomes 2 E [ b v m ] or 2 b v m , having in mind concen tratio n argum en ts. Since b v m is the minim al p enalt y , this explains the slop e heuristics (Birg ´ e and Massart, 2007) and connects it to Mallo ws’ C p and Ak aike ’s heuristics. The second main idea d evelo p ed in this pap er is that the minimal p enalt y can b e es- timated from the data; Algorithm 1 uses the ju mp of complexit y whic h o ccurs arou n d the minimal p enalt y , as shown in Sections 3.3 and 4.5.2. Another wa y to estimate the m inimal p enalt y when it is (at least appro ximately) of the form αD m is to estimate α b y the slop e of the graph of γ n ( b s m ) f or large en ough v alues of D m ; th is metho d can b e extended to other shap es of p en alties, simp ly b y replacing D m b y some (kn o wn!) function f ( D m ). The slop e h euristics can ev en b e com bined w ith resampling ideas, by taking a fun ction f bu ilt from a r andomized empir ical criterion. As sh o wn by Arlot (200 8a), th is app r oac h is m uc h m ore eﬃcien t than the rougher choic e f ( D m ) = D m for heteroscedastic regression framew orks. The qu estion of the optimalit y of the slop e h euristics in general r emains an op en p roblem; nev ertheless, w e b eliev e that this heur istics can b e useful in practice , and that proving its eﬃciency in this pap er helps to u n derstand it b etter. Let us ﬁnally menti on that cont rary to Birg ´ e and Massart (2007), we a ssume in th is pap er that the collection of mo dels M n is “small”, that is Card( M n ) gro ws at most lik e a p o w er of n . F or s everal prob lems, suc h th at complete v ariable selection, larger colle ctions of mo dels h a v e to b e considered; then, it is kno wn from the homoscedasti c case that the minimal p enalt y is muc h larger than E [ p 2 ( m ) ]. Nev ertheless, ´ Emilie Lebarbier has used the slop e heur istics with f ( D m ) = D m  2 . 5 + ln  n D m   for m ultiple c hange-p oint s detection from n noisy data, usin g the results by Birg ´ e and Massart (2007) in the Gaussian case. Let u s no w explain how we exp ect to generalize the slop e heuristics to the non-Gaussian heteroscedastic case when M n is la rge. First, group the mo dels according to some complex- it y index C m suc h as th eir dimensions D m ; for C ∈  1 , . . . , n k  , d eﬁ ne f S C = S C m = C S m . Then, replace the mo del selectio n p roblem with the family ( S m ) m ∈M n b y a “co mplexit y selection pr oblem”, th at is mod el selection with the family  f S C  1 ≤ C ≤ n k . W e conjecture that this grouping of the models is s uﬃcien t to tak e into account the ric hness of M n for the optimal calibration of the p enalt y . A theoretical justiﬁcation of this p oint could rely on the extension of our results to an y kind of mo del, sin ce f S C is not a v ector space in general. Ac kno wledgmen t s 19 Arlot and Massar t The authors gratefully ac kno wledge the anonymo us referees for s everal suggestions and references. App endix A. Proofs This app endix is dev oted to the p ro ofs of the results stated in the p ap er. Prop osition 1 is pro v ed in Section A.2; Th eorem 3 is pr o v ed in Sections A.3 and A.4; T heorem 2 is pro v ed in S ection A.5; the r emaining sections are devot ed to p robabilistic r esults used in the main pro ofs and tec hnical pro ofs. A.1 Con v en tions and nota t ions In the rest of the pap er, L denotes a univ ersal constant , not necessarily the s ame at eac h o ccurrence. When L is n ot unive rsal, but d ep ends on p 1 , . . . , p k , it is written L p 1 ,...,p k . Similarly , L ( SH2 ) (resp. L ( SH5 ) ) denotes a constan t allo wed to dep en d on th e parameters of the assump tions made in Theorem 2 (resp. Theorem 5), in cluding ( P1 ) and ( P2 ). W e also mak e use of the follo wing notations: • ∀ a, b ∈ R , a ∧ b is the minim um of a and b , a ∨ b is the maxi m um of a and b , a + = a ∨ 0 is the p ositive p art of a and a − = a ∧ 0 is its negativ e part. • ∀ I λ ⊂ X , p λ := P ( X ∈ I λ ) and σ 2 λ := E h ( Y − s m ( X ) ) 2    X ∈ I λ i . • Sin ce E [ p 1 ( m ) ] is not we ll-deﬁned (b ecause of th e ev ent { min λ ∈ Λ m { b p λ } = 0 } ), w e ha v e to tak e the follo wing con v en tion p 1 ( m ) = e p 1 ( m ) := X λ ∈ Λ m s.t. b p λ > 0 p λ  β λ − b β λ  2 + X λ ∈ Λ m s.t. b p λ =0 p λ σ 2 λ . Remark that p 1 ( m ) = e p 1 ( m ) when min λ ∈ Λ m { b p λ } > 0), so that this con v en tion has no consequences on th e ﬁ n al r esults (Theorems 2 and 5). A.2 Proof of Prop osition 1 First, since M n is ﬁnite, the inﬁmum in (4) is attained as so on as G ( m i − 1 ) 6 = ∅ , so that m i is well d eﬁned for eve ry i ≤ i max . Moreo ve r, by construction, g ( m i ) decreases with i , so that all th e m i ∈ M n are diﬀerent; hence, Algorithm 2 terminates and i max + 1 ≤ Card( M n ). W e n o w p ro v e b y induction the f ollo win g pr op ert y for ev ery i ∈ { 0 , . . . , i max } : P i : K i < K i +1 and ∀ K ∈ [ K i , K i +1 ) , b m ( K ) = m i . Notice also that K i can alw a ys b e d eﬁned b y (4) with the con v en tion inf ∅ = + ∞ . P 0 holds true By deﬁnition of K 1 , it is clea r that K 1 > 0 (it may b e equal to + ∞ if G ( m 0 ) = ∅ ). F or K = K 0 = 0, the deﬁnition of m 0 is the one of b m (0), s o th at b m ( K ) = m 0 . F or K ∈ (0 , K 1 ), 20 Da t a-driven Calibra tion of P enal ties Lemma 4 sho ws that either b m ( K ) = b m (0) = m 0 or b m ( K ) ∈ G (0). In th e latte r case, by deﬁnition of K 1 , f ( b m ( K )) − f ( m 0 ) g ( m 0 ) − g ( b m ( K )) ≥ K 1 > K hence f ( b m ( K )) + K g ( b m ( K )) > f ( m 0 ) + K g ( m 0 ) whic h is contradict ory with the deﬁnition of b m ( K ). Therefore, P 0 holds true. P i ⇒ P i +1 for e ver y i ∈ { 0 , . . . , i max − 1 } Assume that P i holds tru e. First, we ha v e to pr ov e that K i +2 > K i +1 . S ince K i max +1 = + ∞ , this is clear if i = i max − 1. O therwise, K i +2 < + ∞ and m i +2 exists. Th en, b y deﬁn ition of m i +2 and K i +2 (resp. m i +1 and K i +1 ), w e hav e f ( m i +2 ) − f ( m i +1 ) = K i +2 ( g ( m i +1 ) − g ( m i +2 )) (13) f ( m i +1 ) − f ( m i ) = K i +1 ( g ( m i ) − g ( m i +1 )) . (14) Moreo v er, m i +2 ∈ G ( m i +1 ) ⊂ G ( m i ), and m i +2 ≺ m i +1 (b ecause g is n on -d ecreasing). Using again the d eﬁnition of K i +1 , we ha v e f ( m i +2 ) − f ( m i ) > K i +1 ( g ( m i ) − g ( m i +2 )) (15) (otherwise, w e w ould ha v e m i +2 ∈ F i +1 and m i +2 ≺ m i +1 , which is not p ossible). Com bining the diﬀerence of (15) and (14) with (13), we h a v e K i +2 ( g ( m i +1 ) − g ( m i +2 )) > K i +1 ( g ( m i +1 ) − g ( m i +2 )) , hence K i +2 > K i +1 , s ince g ( m i +1 ) > g ( m i +2 ). Second, we pro v e that b m ( K i +1 ) = m i +1 . F rom P i , we kno w that for ev ery m ∈ M n , for ev ery K ∈ [ K i , K i +1 ), f ( m i ) + K g ( m i ) ≤ f ( m ) + K g ( m ). T aking the limit when K tends to K i +1 , it f ollo w s that m i ∈ E ( K i +1 ). By (14), w e then ha v e m i +1 ∈ E ( K i +1 ). On the other hand, if m ∈ E ( K i +1 ), Lemma 4 sh o ws that either f ( m ) = f ( m i ) and g ( m ) = g ( m i ) or m ∈ G ( m i ). In the ﬁ rst case, m i +1 ≺ m (b ecause g is non -d ecreasing). In the second one, m ∈ F i +1 , so m i +1  m . Sin ce b m ( K i +1 ) is the smallest element of E ( K i +1 ), we hav e pro v ed that m i +1 = b m ( K i +1 ). Last, w e ha v e to pr o v e that b m ( K ) = m i +1 for ev er y K ∈ ( K 1 , K 2 ). F rom the last statemen t of Lemma 4, we h a v e either b m ( K ) = b m ( K 1 ) or b m ( K 1 ) ∈ G ( b m ( K )). In the latter case (whic h is only p ossible if K i +2 < ∞ ), b y deﬁnition of K i +2 , f ( b m ( K )) − f ( m i +1 ) g ( m i +1 ) − g ( b m ( K )) ≥ K i +2 > K so that f ( b m ( K )) + K g ( b m ( K )) > f ( m i +1 ) + K g ( m i +1 ) whic h is contradict ory with the deﬁnition of b m ( K ). 21 Arlot and Massar t Lemma 4 With th e notations of Pr op osition 1 and its pr o of, if 0 ≤ K < K ′ , m ∈ E ( K ) and m ′ ∈ E ( K ′ ) , then one of the two fol lowing statements holds true: (a) f ( m ) = f ( m ′ ) and g ( m ) = g ( m ′ ) . (b) f ( m ) < f ( m ′ ) and g ( m ) > g ( m ′ ) . In p articular, either b m ( K ) = b m ( K ′ ) or b m ( K ′ ) ∈ G ( b m ( K )) . Pro of By deﬁn ition of E ( K ) and E ( K ′ ), f ( m ) + K g ( m ) ≤ f ( m ′ ) + K g ( m ′ ) (16) f ( m ′ ) + K ′ g ( m ′ ) ≤ f ( m ) + K ′ g ( m ) . (17) Summing (16 ) and (17) giv es ( K ′ − K ) g ( m ′ ) ≤ ( K ′ − K ) g ( m ) so that g ( m ′ ) ≤ g ( m ) . (18) Since K ≥ 0, (16) and (18) giv e f ( m ) + K g ( m ) ≤ f ( m ′ ) + K g ( m ), that is f ( m ) ≤ f ( m ′ ) . (19) Moreo v er, (19) and (17) imply g ( m ) = g ( m ′ ), hence f ( m ′ ) ≤ f ( m ), that is f ( m ) = f ( m ′ ) b y (1 9). Similarly , (16) and (18) sho w that f ( m ) = f ( m ′ ) imply g ( m ) = g ( m ′ ). In b oth cases, (a) is satisﬁed. Otherwise, f ( m ) < f ( m ′ ) and g ( m ) > g ( m ′ ), that is th e (b) statemen t. The last stat emen t follo ws b y ta king m = b m ( K ) and m ′ = b m ( K ′ ), b ecause g is non- decreasing, so that the minim um of g in E ( K ) is attained by b m ( K ). A.3 A general oracle inequality First of all, let u s state a general theorem, f rom whic h Theorem 3 is an obvious corollary . Theorem 5 Supp ose al l the assumptions of Se ction 4.2 ar e satisﬁe d to g e ther with ( Ap ) The bias de cr e ases lik e a p ower of D m : ther e exist β − ≥ β + > 0 and C + , C − > 0 such that C − D − β − m ≤ ℓ ( s, s m ) ≤ C + D − β + m . L et L, ξ , c 1 , C 1 , C 2 ≥ 0 , c 2 > 1 and assume that an event of pr ob ability at le ast 1 − Ln − 2 exists on which, for e very m ∈ M n such that D m ≥ ln( n ) ξ , E [ c 1 P ( γ ( b s m ) − γ ( s m ) ) + c 2 P n ( γ ( s m ) − γ ( b s m ) ) ] ≤ p en( m ) ≤ E [ C 1 P ( γ ( b s m ) − γ ( s m ) ) + C 2 P n ( γ ( s m ) − γ ( b s m ) ) ] . (20) Then, for every 0 < η < min { β + ; 1 } / 2 , ther e exist a c onsta nt K 3 and a se qu e nc e ǫ n tending to zer o at inﬁnity such that, with pr ob ability at le ast 1 − K 3 n − 2 , D b m ≤ n 1 − η and ℓ ( s , b s b m ) ≤  1 + ( C 1 + C 2 − 2) + ( c 1 + c 2 − 1 ) ∧ 1 + ǫ n  inf m ∈M n { ℓ ( s , b s m ) } (21) 22 Da t a-driven Calibra tion of P enal ties wher e b m is deﬁne d by (6) . Mor e over, we have the or acle ine qu ality E [ ℓ ( s , b s b m ) ] ≤  1 + ( C 1 + C 2 − 2) + ( c 1 + c 2 − 1 ) ∧ 1 + ǫ n  E  inf m ∈M n { ℓ ( s , b s m ) }  + A 2 K 3 n 2 . (22) The c onstant K 3 may dep end on L , η , ξ , c 1 , c 2 , C 1 , C 2 and c onsta nts in ( P1 ) , ( P2 ) , ( Ab ) , ( An ) , ( Ap ) and ( Ar X ℓ ) , but not on n . The term ǫ n is smal ler than ln( n ) − 1 / 5 ; it c an b e made smal ler than n − δ for any δ ∈ (0; δ 0 ( β − , β + )) at the pric e of e nlar ging K 3 . The particular form of condition (20 ) on the p enalt y is motiv ated by th e fact that th e ideal shap e of p enalt y E [ p en id ( m ) ] (or equiv alen tly E [ 2 p 2 ( m ) ]) is unkn o wn in general. Then, it has to b e estimated f rom the d ata, for instance b y resampling. Und er the as- sumptions of Theorem 5, Arlot (2008b,c) h as prov ed that r esampling and V -fold p enalties satisfy cond ition (20) with constants c 1 + c 2 = 2 − δ n , C 1 + C 2 = 2 + δ n (for some absolute sequence δ n tending to zero at inﬁ nit y), and some numerical constan t ξ > 0. T hen, Theo- rem 5 sh ows th at suc h a p enalizatio n pro cedure satisﬁes an oracle in equalit y w ith leading constan t tending to 1 asymptotically . The rationale b ehind Th eorem 5 is that if p en( m ) is close to c 1 p 1 ( m ) + c 2 p 2 ( m ), then crit( m ) ≈ ℓ ( s , s m ) + c 1 p 1 ( m ) + ( c 2 − 1) p 2 ( m ). When c 1 = c 2 = 1, this is exactly the ideal criterion ℓ ( s, b s m ). When c 1 + c 2 = 2 with c 1 ≥ 0 and c 2 > 1, we ob tain the same result b ecause p 1 ( m ) and p 2 ( m ) are quite close, at least wh en D m is large enough. The closeness b et w een p 1 and p 2 is the k eystone of the slop e h euristics. Notice that if max m ∈M n D m ≤ K ′ 3 (ln( n )) − 1 n (for some constan t K ′ 3 dep endin g only on the assumptions of Theorem 3, as K 3 ), one can replace the cond ition c 2 > 1 by c 1 + c 2 > 1 and c 1 , c 2 ≥ 0 . A.4 Proof of Theorem 5 This pro of is similar to the one of Arlot (2 008c, Theorem 1). W e giv e it for the sake of completeness. F rom (3), w e ha v e for eac h m ∈ M n suc h that A n ( m ) := min λ ∈ Λ m { n b p λ } > 0 ℓ ( s, b s b m ) −  p en ′ id ( b m ) − p en( b m )  ≤ ℓ ( s, b s m ) +  p en( m ) − p en ′ id ( m )  . (23) with p en ′ id ( m ) := p 1 ( m ) + p 2 ( m ) − δ ( m ) = p en( m ) + ( P − P n ) γ ( s ) and δ ( m ) := ( P n − P )( γ ( s m ) − γ ( s )). It is suﬃcien t to cont rol p en − p en ′ id for ev ery m ∈ M n . W e will thus use th e concen tration inequalities of Section A.6 with x = γ ln( n ) and γ = 2 + α M . Deﬁn e B n ( m ) = min λ ∈ Λ m { np λ } , and Ω n the ev en t on w hic h • for eve ry m ∈ M n , (20) holds • for eve ry m ∈ M n suc h that B n ( m ) ≥ 1, (29) and (30) hold: e p 1 ( m ) ≥ E [ e p 1 ( m ) ] − L ( SH5 )  ln( n ) 2 √ D m + e − LB n ( m )  E [ p 2 ( m ) ] e p 1 ( m ) ≤ E [ e p 1 ( m ) ] + L ( SH5 )  ln( n ) 2 √ D m + p D m e − LB n ( m )  E [ p 2 ( m ) ] 23 Arlot and Massar t • for eve ry m ∈ M n suc h that B n ( m ) > 0, (31), (28) and 26 hold: e p 1 ( m ) ≥ 1 2 + ( γ + 1) B n ( m ) − 1 ln( n ) − L ( SH5 ) ln( n ) 2 √ D m ! E [ p 2 ( m ) ] | p 2 ( m ) − E [ p 2 ( m ) ] | ≤ L ( SH5 ) ln( n ) √ D m [ ℓ ( s, s m ) + E [ p 2 ( m ) ] ]   δ ( m )   ≤ ℓ ( s, s m ) √ D m + L ( SH5 ) ln( n ) √ D m E [ p 2 ( m ) ] F rom Pr op osition 11 (for e p 1 ), Prop osition 10 (for p 2 ) and Prop osition 8 (for δ ( m )), P ( Ω n ) ≥ 1 − L X m ∈M n n − 2 − α M ≥ 1 − L c M n − 2 . F or every m ∈ M n suc h that D m ≤ L c X r ,ℓ n ln( n ) − 1 , ( Ar X ℓ ) implies that B n ( m ) ≥ L − 1 ln( n ) ≥ 1. As a consequ en ce, on Ω n , if ln( n ) 7 ≤ D m ≤ L c X r ,ℓ n ln( n ) − 1 : max  | e p 1 ( m ) − E [ e p 1 ( m ) ] | , | p 2 ( m ) − E [ p 2 ( m ) ] | ,   δ ( m )    ≤ L ( SH5 ) E [ ℓ ( s , s m ) + p 2 ( m ) ] ln( n ) Using (32) (in Prop osition 12) and th e fact that B n ( m ) ≥ L − 1 ln( n ), ( c 1 + c 2 )  1 − e δ n  2 ≤ E [ p en( m ) ] ≤ ( C 1 + C 2 )  1 + e δ n  2 E [ e p 1 ( m ) + p 2 ( m ) ] with 0 ≤ e δ n ≤ L ln( n ) − 1 / 4 . W e dedu ce: if n ≥ L ( SH5 ) , for ev ery m ∈ M n suc h that ln( n ) 7 ≤ D m ≤ L c X r ,ℓ n ln( n ) − 1 , on Ω n ,  ( c 1 + c 2 − 2 ) − − L ( SH5 ) ln( n ) 1 / 4  p 1 ( m ) ≤ (p en − p en ′ id )( m ) ≤  ( C 1 + C 2 − 2 ) + + L ( SH5 ) ln( n ) 1 / 4  p 1 ( m ) . W e need to assume that n is large enough in order to u pp er b ound E [ p 2 ( m ) ] in terms of p 1 ( m ), since we only hav e p 1 ( m ) ≥  1 − L ( SH5 ) ln( n ) 1 / 4  + E [ p 2 ( m ) ] in general. Com bined with (23), this giv es: if n ≥ L ( SH5 ) , ℓ ( s , b s b m ) 1 ln( n ) 5 ≤ D b m ≤ L c X r ,ℓ n ln( n ) − 1 ≤  1 + ( C 1 + C 2 − 2) + ( c 1 + c 2 − 1 ) ∧ 1 + L ( SH5 ) ln( n ) 1 / 4  × inf m ∈M n s.t. ln( n ) 7 ≤ D m ≤ L α M ,c X r ,ℓ n ln( n ) − 1 { ℓ ( s, b s m ) } . W e now use Lemmas 6 and 7 b elo w to con trol on Ω n the dimensions of the selec ted mo del b m and the oracle mo del m ⋆ ∈ arg min m ∈M n { ℓ ( s , b s m ) } . The result follo ws since L ( SH5 ) ln( n ) − 1 / 4 ≤ ǫ n = ln( n ) − 1 / 5 for n ≥ L ( SH5 ) . W e ﬁnally remo v e the condition n ≥ n 0 = L ( SH5 ) b y c ho osing K 3 = L ( SH5 ) suc h that K 3 n − 2 0 ≥ 1. 24 Da t a-driven Calibra tion of P enal ties Classical oracle inequalit y Since (21) holds true on Ω n , E [ ℓ ( s, b s b m ) ] = E [ ℓ ( s, b s b m ) 1 Ω n ] + E  ℓ ( s , b s b m ) 1 Ω c n  ≤ [ 2 η − 1 + ǫ n ] E  inf m ∈M n { ℓ ( s, b s m ) }  + A 2 K 3 P ( Ω c n ) whic h prov es (22). Lemma 6 (Con trol on the dimension of the selected mo del) L et c > 0 and α > ( 1 − β + ) + / 2 . Then, if n ≥ L ( SH5 ) ,c,α , on the event Ω n deﬁne d in the pr o of of The or e m 5, ln( n ) 7 ≤ D b m ≤ n 1 / 2+ α ≤ cn ln ( n ) − 1 . Lemma 7 (Con trol on the dimension of the oracle mo del) D eﬁne the or acle mo del m ⋆ ∈ arg min m ∈M n { ℓ ( s , b s m ) } . L et c > 0 and α > ( 1 − β + ) + / 2 . Then, if n ≥ L ( SH5 ) ,c,α , on the event Ω n deﬁne d in the pr o of of The or e m 5, ln( n ) 7 ≤ D m ⋆ ≤ n 1 / 2+ α ≤ cn ln ( n ) − 1 . Pro of of Lemma 6 By deﬁnition, b m minimizes crit( m ) o v er M n . It th us also minimizes crit ′ ( m ) = crit( m ) − P n γ ( s ) = ℓ ( s, s m ) − p 2 ( m ) + δ ( m ) + p en( m ) o v er M n . 1. Lo we r b ound on crit ′ ( m ) for small mod els: let m ∈ M n suc h that D m < ( ln ( n ) ) 7 . W e then ha v e ℓ ( s , s m ) ≥ C − ( ln( n ) ) − 7 β − from ( Ap ) p en( m ) ≥ 0 p 2 ( m ) ≤ L ( SH5 ) r ln( n ) n + L ( SH5 ) D m n ≤ L ( SH5 ) r ln( n ) n from (27) and from (26) (in Pr op osition 8), δ ( m ) ≥ − L A r ℓ ( s , s m ) ln( n ) n + L A ln( n ) n ≥ − L A r ln( n ) n . W e then ha v e crit ′ ( m ) ≥ L ( SH5 ) ( ln( n ) ) − L β − . 2. Lo we r b ound for large m o dels: let m ∈ M n suc h that D m ≥ n 1 / 2+ α . F r om (20) and (27) (in Pr op osition 10), p en( m ) − p 2 ( m ) ≥ ( c 2 − 1 ) E [ p 2 ( m ) ] − L A r ln( n ) n ≥ ( c 2 − 1) σ 2 min D m n − L A r ln( n ) n 25 Arlot and Massar t and from (24), δ ( m ) ≥ − L ( SH5 ) r ln( n ) n . Hence, if D m ≥ n 1 / 2+ α and n ≥ L ( SH5 ) ,α crit ′ ( m ) ≥ p en( m ) + δ ( m ) − p 2 ( m ) ≥ L ( SH5 ) ,α n − 1 / 2+ α . 3. Ther e exists a b etter mo d el for crit ( m ): from ( P2 ), there exists m 0 ∈ M n suc h that √ n ≤ D m 0 ≤ c rich √ n . If m oreov er n ≥ L c rich ,α , th en ln( n ) 7 ≤ √ n ≤ D m 0 ≤ c rich √ n ≤ n 1 / 2+ α . By (33) in Lemma 13, A n ( m 0 ) ≥ 1 with p robabilit y at least 1 − Ln − 2 . Using ( Ap ), ℓ ( s , s m 0 ) ≤ C + c β + rich n − β + / 2 so that, when n ≥ L ( SH5 ) , crit ′ ( m 0 ) ≤ ℓ ( s, s m 0 ) +   δ ( m )   + p en( m ) ≤ L ( SH5 )  n − β + / 2 + n − 1 / 2  . If n ≥ L ( SH5 ) ,α , this upp er b ound is smaller than the previous lo wer b ounds for small and large mo d els. Pro of of Lemma 7 Rec all that m ⋆ minimizes ℓ ( s , b s m ) = ℓ ( s, s m ) + p 1 ( m ) ov er m ∈ M n , with the conv entio n ℓ ( s, b s m ) = ∞ if A n ( m ) = 0. 1. Lo we r b ound on ℓ ( s, b s m ) for small m o dels: let m ∈ M n suc h that D m < ( ln( n ) ) 7 . F rom ( Ap ), w e ha v e ℓ ( s, b s m ) ≥ ℓ ( s, s m ) ≥ C − ( ln( n ) ) − 7 β − . 2. Lo we r b ound on ℓ ( s, b s m ) for large mo dels: let m ∈ M n suc h that D m > n 1 / 2+ α . F rom (31), for n ≥ L ( SH5 ) ,α , e p 1 ( m ) ≥    1 2 + ( γ + 1)  c X r ,ℓ  − 1 ln( n ) − L ( SH5 ) ,α n 1 / 4    E [ e p 2 ( m ) ] so that ℓ ( s, b s m ) ≥ e p 1 ( m ) ≥ L ( SH5 ) ,α n − 1 / 2+ α . 3. Ther e exists a b etter mo del for ℓ ( s, b s m ): let m 0 ∈ M n b e as in the pro of of Lemma 6 and assume that n ≥ L c rich ,α . T hen, p 1 ( m 0 ) ≤ L ( SH5 ) E [ p 2 ( m ) ] ≤ L ( SH5 ) n − 1 / 2 and the arguments of th e p revious pro of show that ℓ ( s, b s m 0 ) ≤ L ( SH5 )  n − β + / 2 + n − 1 / 2  whic h is smaller than the previous upp er b ounds for n ≥ L ( SH5 ) ,α . 26 Da t a-driven Calibra tion of P enal ties A.5 Proof of Theorem 2 Similarly to the pro of of Theorem 5, w e consid er th e eve nt Ω ′ n , of pr obabilit y at least 1 − L c M n − 2 , on which: • for ev ery m ∈ M n , (7) (fo r p en), (31) (for e p 1 ), (27)–(28) (for p 2 , with x = γ ln( n ) and θ = p ln( n ) /n ) and (24)–(26) (for δ , with x = γ ln ( n ) and η = p ln( n ) /n ) hold true. • for eve ry m ∈ M n suc h that B n ( m ) ≥ 1, (29) and (30) hold (for e p 1 ). Lo w er b ound on D b m By deﬁnition, b m min imizes crit ′ ( m ) = crit( m ) − P n γ ( s ) = ℓ ( s, s m ) − p 2 ( m ) + δ ( m ) + p en( m ) o v er m ∈ M n suc h that A n ( m ) ≥ 1. As in th e pro of of T heorem 5, w e d eﬁne c = L c X r ,ℓ > 0 suc h that for ev ery mo del of dimension D m ≤ cn ln ( n ) − 1 , B n ( m ) ≥ L − 1 ln( n ) ≥ 1. Let c ′ = min( c, c 0 ) and d ∈ (0 , 1) a constan t to b e c hosen later. 1. Lo we r b ound on crit ′ ( m ) for “small” m o dels: assume that m ∈ M n and D m ≤ dc ′ n ln( n ) − 1 . T hen, ℓ ( s, s m ) + p en( m ) ≥ 0 an d from (24), δ ( m ) ≥ − L A r ln( n ) n . If D m ≥ ln( n ) 4 , (28 ) implies that p 2 ( m ) ≤  1 + L ( SH2 ) ln( n )  E [ p 2 ( m ) ] ≤ L ( SH2 ) D m n ≤ c ′ dL ( SH2 ) ln( n ) . On the other h an d , if D m < ln( n ) 4 , (27) implies that p 2 ( m ) ≤ L ( SH2 ) r ln( n ) n . W e then ha v e crit ′ ( m ) ≥ − dL ( SH2 ) ( ln( n ) ) − 1 . 2. Ther e exists a b etter mo del f or cr it( m ): let m 1 ∈ M n suc h that ln( n ) 4 ≤ c ′ dn c rich ln( n ) ≤ D m 1 ≤ c ′ n ln( n ) ≤ n . F rom ( P2 +), this is p ossible as so on as n ≥ L c rich ,c ′ ,d . By (33) in Lemma 13 , A n ( m 0 ) ≥ 1 with probability at least 1 − Ln − 2 . W e then ha v e ℓ ( s, s m 1 ) ≤ L ( SH2 ) ,c ′ ln( n ) β + n − β + b y ( Ap ) p 2 ( m 1 ) ≥  1 − L ( SH2 ) ln( n )  E [ p 2 ( m 1 ) ] b y (28) p en( m 1 ) ≤ K E [ p 2 ( m 1 ) ] b y (7)   δ ( m 1 )   ≤ L A r ln( n ) n b y (24) 27 Arlot and Massar t so that crit ′ ( m 1 ) ≤ L ( SH2 ) ,c ′ ln( n ) β + n − β + +  K − 1 + L ( SH2 ) ln( n )  E [ p 2 ( m 1 ) ] + L A r ln( n ) n ≤ ( K − 1 + L ( SH2 ) (ln( n )) − 1 ) σ 2 min c ′ 2 ln ( n ) if n ≥ L ( SH2 ) ,c ′ . W e no w c ho ose d such that the constan t dL ( SH2 ) app earing in the lo w er b ound on crit ′ ( m ) for “small” mo dels is smaller than (1 − K − L ( SH2 ) (ln( n )) − 1 ) σ 2 min c ′ / 2, that is d ≤ L ( SH2 ) ,c ′ . Th en, w e assume that n ≥ n 0 = L ( SH2 ) ,c ′ ,d = L ( SH2 ) . Finally , w e remo v e this condition as b efore by enlarging K 1 . Risk of D b m The pro of of (8) is quite similar to the one of Lemma 7. First, f or every mo del m ∈ M n suc h that A n ( m ) ≥ 1 and D m ≥ K 2 n ln( n ) − 1 , we ha v e ℓ ( s, b s m ) ≥ e p 1 ( m ) ≥ L ( SH2 ) K 2 ln( n ) − 2 b y (31 ) . Then, the mo del m 0 ∈ M n deﬁned previously satisﬁes A n ( m ) ≥ 1, an d ℓ ( s , b s m 0 ) ≤ L ( SH2 )  n − β + / 2 + n − 1 / 2  . If n ≥ L ( SH2 ) , the ratio b et w een these t w o b oun d s is larger than ln( n ), so that (8) holds. A.6 Concen tration inequalit ies used in t he main pro ofs In this secti on, we no longer assume that eac h mo del is the set of piecewise constant fun ctions on some partition of X . First, we con trol δ ( m ) w ith general mo dels and b ounded data. Prop osition 8 A ssume that k Y k ∞ ≤ A < ∞ . Then for al l x ≥ 0 , on an event of pr ob a- bility at le ast 1 − 2 e − x : ∀ η > 0 ,   δ ( m )   ≤ η ℓ ( s , s m ) +  4 η + 8 3  A 2 x n . (24) If mor e over Q ( p ) m := n E [ p 2 ( m ) ] D m > 0 , (25) on the same event,   δ ( m )   ≤ ℓ ( s , s m ) √ D m + 20 3 A 2 Q ( p ) m E [ p 2 ( m )] √ D m x . (26) Remark 9 ( Regressogram case) If S m is the set of pie c ewise c onstant functions on some p artition ( I λ ) λ ∈ Λ m of X , Q ( p ) m = 1 D m X λ ∈ Λ m σ 2 λ ≥ ( σ min ) 2 > 0 . 28 Da t a-driven Calibra tion of P enal ties Then, we der ive a concentrat ion inequalit y f or p 2 ( m ) in the regressogram case from a general result by Bouc heron and Massart (2008). Prop osition 10 L et S m b e the mo del of pie c ewise c onstant functions asso ciate d with the p artition ( I λ ) λ ∈ Λ m . Assume that k Y k ∞ ≤ A and deﬁne p 2 ( m ) = P n ( γ ( s m ) − γ ( b s m ) ) . Then, for every x ≥ 0 , ther e exists an event of pr ob ability at le ast 1 − e 1 − x on which for every θ ∈ (0 ; 1) , | p 2 ( m ) − E [ p 2 ( m ) ] | ≤ L  θ ℓ ( s, s m ) + A 2 √ D m √ x n + A 2 x θ n  (27) for some absolute c onstant L . If mor e over σ ( X ) ≥ σ min > 0 a.s., we have on the same event: | p 2 ( m ) − E [ p 2 ( m ) ] | ≤ L √ D m  ℓ ( s , s m ) + A 2 E [ p 2 ( m ) ] σ 2 min  √ x + x   . (28) Finally , we r ecall a concen tration inequalit y for p 1 ( m ) prov ed b y (Arlot, 2008b , Pr op o- sition 9). Its pro of is particular to the regressogram case. Prop osition 11 (Prop osition 9, Arlot (2008b)) L et γ > 0 and S m b e the mo del of pie c ewise c onstant functions asso c i ate d with the p artition ( I λ ) λ ∈ Λ m . Assume that k Y k ∞ ≤ A < ∞ , σ ( X ) ≥ σ min > 0 a.s. and min λ ∈ Λ m { np λ } ≥ B n > 0 . Then, if B n ≥ 1 , on an event of pr ob ability at le ast 1 − Ln − γ , e p 1 ( m ) ≥ E [ e p 1 ( m ) ] − L A,σ min ,γ  ln( n ) 2 √ D m + e − LB n  E [ p 2 ( m ) ] (29) e p 1 ( m ) ≤ E [ e p 1 ( m ) ] + L A,σ min ,γ  ln( n ) 2 √ D m + p D m e − LB n  E [ p 2 ( m ) ] . (30) If we only have a lower b ound B n > 0 , then, with pr ob ability at le ast 1 − Ln − γ , e p 1 ( m ) ≥  1 2 + ( γ + 1) B − 1 n ln( n ) − L A,σ min ,γ ln( n ) 2 √ D m  E [ p 2 ( m ) ] . (31) A.7 Additional results needed A cru cial result in the pro ofs of T h eorems 5 and 2 is that p 1 ( m ) and p 2 ( m ) are close in exp ectation; the follo wing prop osition was prov ed b y Arlot (2008b, L emma 7). Prop osition 12 (Lemma 7, Arlot (2008b)) L et S m b e a mo del of pie c ewise c onstant functions adapte d to some p artition ( I λ ) λ ∈ Λ m . Assume that min λ ∈ Λ m { np λ } ≥ B > 0 . Then,  1 − e − B  2 E [ p 2 ( m ) ] ≤ E [ e p 1 ( m ) ] ≤ h 2 ∧  1 + 5 . 1 × B − 1 / 4  + ( B ∨ 1 ) e − ( B ∨ 1 ) i E [ p 2 ( m ) ] . (32) Finally , we n eed th e follo wing tec hnical lemma in the pro of of the m ain th eorems. 29 Arlot and Massar t Lemma 13 L et ( p λ ) λ ∈ Λ m b e non-ne g ative r e al numb ers of sum 1, ( n b p λ ) λ ∈ Λ m a multinomial ve ctor of p ar ameters ( n ; ( p λ ) λ ∈ Λ m ) . Then, for al l γ > 0 , min λ ∈ Λ m { n b p λ } ≥ min λ ∈ Λ m { np λ } 2 − 2( γ + 1) ln ( n ) (3 3) with pr ob ability at le ast 1 − 2 n − γ . Pro of By Bernstein inequalit y (Massart, 2007, Prop osition 2.9), for all λ ∈ Λ m , P  n b p λ ≥ (1 − θ ) np λ − p 2 npx − x 3  ≥ 1 − e − x . T ak e x = ( γ + 1) ln ( n ) ab o v e, and remark that √ 2 npx ≤ np 2 + x . The un ion b ound giv es the result since C ard(Λ m ) ≤ n . A.8 Proof of Prop osition 8 Since k Y k ∞ ≤ A , we ha v e k s k ∞ ≤ A and k s m k ∞ ≤ A . In fact, ev erything h app ens as if S m ∪ { s } w as b oun ded by A in L ∞ . W e h a v e δ ( m ) = 1 n n X i =1 ( γ ( s m , ( X i , Y i )) − γ ( s, ( X i , Y i )) − E [ γ ( s m , ( X i , Y i )) − γ ( s, ( X i , Y i ))]) and assumptions of Bernstein inequalit y (Massart, 2007, Prop osition 2.9) are fulﬁlled with c = 8 A 2 3 n and v = 8 A 2 ℓ ( s , s m ) n since k γ ( s m , ( X i , Y i )) − γ ( s, ( X i , Y i )) − E [ γ ( s m , ( X i , Y i )) − γ ( s, ( X i , Y i ))] k ∞ ≤ 8 A 2 and v ar ( γ ( s m , ( X i , Y i )) − γ ( s, ( X i , Y i ))) ≤ E h ( γ ( s m , ( X i , Y i )) − γ ( s, ( X i , Y i ))) 2 i ≤ 8 A 2 ℓ ( s , s m ) b ecause k s m − s k ∞ ≤ 2 A and ( γ ( t, · ) − γ ( s, · )) 2 = ( t ( X ) − s ( X )) 2 (2( Y − s ( X )) − t ( X ) + s ( X )) 2 and E  ( Y − s ( X )) 2   X ] ≤ (2 A ) 2 4 = A . W e obtain that, with probability at least 1 − 2 e − x ,   δ ( m )   ≤ √ 2 v x + c = r 16 A 2 ℓ ( s, s m ) x n + 8 A 2 x 3 n and (24) follo ws sin ce 2 √ ab ≤ aη + bη − 1 for all η > 0. T aking η = D − 1 / 2 m ≤ 1 and u sing Q ( p ) m deﬁned by (25 ), w e deduce (26). 30 Da t a-driven Calibra tion of P enal ties A.9 Proof of Prop osition 10 W e apply here a result b y Bouc heron and Massart (2008, T heorem 2.2 in a p r eliminary v ersion), in which it is only assumed that γ tak es its v alues in [0; 1]. This is satisﬁed when k Y k ∞ ≤ A = 1 / 2. When A 6 = 1 / 2, w e apply this result to (2 A ) − 1 Y and reco v er the general result b y h omogeneit y . First, w e recall this r esu lt in the b oun d ed least-squares regression framew ork. F or ev ery t : X 7→ R and ǫ > 0, w e deﬁn e d 2 ( s, t ) = 2 ℓ ( s, t ) and w ( ǫ ) = √ 2 ǫ . Let φ m b elong to th e class of nondecreasing and cont inuous fu nctions f : R + 7→ R + suc h that x 7→ f ( x ) /x is n onincreasing on (0; + ∞ ) and f (1) ≥ 1. Assume th at for ev ery u ∈ S m and σ > 0 such that φ m ( σ ) ≤ √ nσ 2 , √ n E " sup t ∈ S m , d ( u,t ) ≤ σ | γ n ( u ) − γ n ( t ) | # ≤ φ m ( σ ) . (34) Let ε ⋆,m b e the unique p ositiv e solution of the equation √ nε 2 ⋆,m = φ m ( w ( ε ⋆,m )) . Then, there exists s ome absolute constan t L suc h that f or ev ery r eal num b er q ≥ 2 one has k p 2 ( m ) − E [ p 2 ( m )] k q ≤ L √ n  p 2 q  p ℓ ( s, s m ) ∨ ε ⋆,m  + q 2 √ n  . (35) Using no w that S m is the set of piecewise constan t fun ctions on some partitio n ( I λ ) λ ∈ Λ m of X , we can take φ m ( σ ) = 3 √ 2 p D m × σ in (34). (36) The pro of of this statemen t is made b elo w. Then, ε ⋆,m = 6 √ D m n − 1 / 2 . Com bining (35) with the classical lin k b etw een momen ts and concen tration (see for instance Ar lot , 2007, Lemma 8.9), the ﬁr st result follo ws. The second resu lt is obtained by taking θ = D − 1 / 2 m , as in Pr op osition 8. Pro of of (36) Let u ∈ S m and d ( u, t ) = √ 2 k u ( X ) − t ( X ) k 2 for eve ry t : X 7→ R . Deﬁne ψ : R + 7→ R + b y ψ ( σ ) = E " sup d ( u,t ) ≤ σ , t ∈ S m | ( P n − P )( γ ( u, · ) − γ ( t, · )) | # . W e are lo oking for some n on d ecreasing an d con tin uous function φ m : R + 7→ R + suc h that φ m ( x ) /x is nonincr easing, φ m (1) ≥ 1 and for ev ery u ∈ S m , ∀ σ > 0 suc h that φ m ( σ ) ≤ √ nσ 2 , φ m ( σ ) ≥ √ nψ ( σ ) . W e ﬁ rst lo ok at a general upp erb ound on ψ . Assume that u = s m . If this is not the case, the tr iangular inequalit y sho ws th at ψ general u ≤ 2 ψ u = s m . Let us write t = X λ ∈ Λ m t λ 1 I λ u = s m = X λ ∈ Λ m β λ 1 I λ . 31 Arlot and Massar t Computation of P ( γ ( t, · ) − γ ( s m , · )) for some general t ∈ S m : P ( γ ( t, · ) − γ ( s m , · )) = E  ( t ( X ) − Y ) 2 − ( s m ( X ) − Y ) 2  = E  ( t ( X ) − s m ( X )) 2  + 2 E [( t ( X ) − s m ( X ))( s m ( X ) − s ( X ))] = E  ( t ( X ) − s m ( X )) 2  = X λ ∈ Λ m p λ ( t λ − β λ ) 2 since for ev ery λ ∈ Λ m , E [ s ( X ) | X ∈ I λ ] = β λ . Computation of P n ( γ ( t, · ) − γ ( s m , · )) for some general t ∈ S m : with η i = Y i − s m ( X i ), w e ha v e P n ( γ ( t, · ) − γ ( s m , · )) = 1 n n X i =1  ( t ( X i ) − Y i ) 2 − ( u ( X i ) − Y i ) 2  = 1 n n X i =1 ( t ( X i ) − u ( X i )) 2 − 2 n n X i =1 [( t ( X i ) − u ( X i )) η i ] = 1 n n X i =1 X λ ∈ Λ m ( t λ − u λ ) 2 1 X i ∈ I λ − 2 n n X i =1 X λ ∈ Λ m ( t λ − u λ ) 1 X i ∈ I λ η i . Bac k to ( P n − P ) W e sum the t w o in equalities ab o v e and us e the triangular inequalit y: | ( P n − P )( γ ( t, · ) − γ ( u, · )) | ≤       1 n n X i =1 X λ ∈ Λ m ( t λ − u λ ) 2 ( 1 X i ∈ I λ − p λ )       +       2 n n X i =1 X λ ∈ Λ m ( t λ − u λ ) 1 X i ∈ I λ η i       ≤ 2 A n X λ ∈ Λ m  ( √ p λ | t λ − u λ | ) | P n i =1 ( 1 X i ∈ I λ − p λ ) | √ p λ  + 2 n X λ ∈ Λ m  ( √ p λ | t λ − u λ | ) | P n i =1 1 X i ∈ I λ η i | √ p λ  since | t λ − u λ | ≤ 2 A for eve ry t ∈ S m . W e n o w assu me th at d ( u, t ) ≤ σ for some σ > 0, that is d ( u, t ) 2 = 2 X λ ∈ Λ m p λ ( t λ − u λ ) 2 ≤ σ 2 . F rom Cauch y-Sc hw arz inequalit y , we obtain for every t ∈ S m suc h that d ( u, t ) ≤ σ | ( P n − P )( γ ( t, · ) − γ ( u, · )) | ≤ 2 Aσ √ 2 n v u u t X λ ∈ Λ m ( P n i =1 ( 1 X i ∈ I λ − p λ )) 2 p λ + √ 2 σ n v u u t X λ ∈ Λ m ( P n i =1 1 X i ∈ I λ η i ) 2 p λ 32 Da t a-driven Calibra tion of P enal ties Bac k to ψ The upp er b ound ab o v e do es n ot dep end on t , s o that the left-hand side of the inequalit y can b e replaced by a supremum ov er { t ∈ S m s.t. d ( u, t ) ≤ σ } . T aking exp ectations and using J ensen’s inequalit y ( √ · b eing conca ve), we obtain an up p er b ound on ψ : ψ ( σ ) ≤ 2 Aσ √ 2 n v u u t X λ ∈ Λ m E " ( P n i =1 ( 1 X i ∈ I λ − p λ )) 2 p λ # + √ 2 σ n v u u t X λ ∈ Λ m E " ( P n i =1 1 X i ∈ I λ η i ) 2 p λ # (37) F or eve ry λ ∈ Λ m , we hav e E n X i =1 ( 1 X i ∈ I λ − p λ ) ! 2 = n X i =1 E ( 1 X i ∈ I λ − p λ ) 2 = np λ ( 1 − p λ ) (38) whic h simpliﬁes the ﬁrst term. F or the second term, notice that ∀ i 6 = j, E  1 X i ∈ I λ 1 X j ∈ I λ η i η j  = E [ 1 X i ∈ I λ η i ] E  1 X j ∈ I λ η j  and ∀ i, E [ 1 X i ∈ I λ η i ] = E [ 1 X i ∈ I λ E [ η i | 1 X i ∈ I λ ] ] = 0 since η i is cen tered conditionally to 1 X i ∈ I λ . Then, E n X i =1 1 X i ∈ I λ η i ! 2 = n X i =1 E  1 X i ∈ I λ η 2 i  ≤ np λ k η k 2 ∞ ≤ np λ (2 A ) 2 . (39) Com bining (37) with (38) and (39), we d educe that ψ ( σ ) ≤ 2 Aσ √ 2 √ n p D m − 1 + 2 √ 2 Aσ √ n p D m ≤ 3 A √ 2 √ D m √ n × σ . As already noticed, we hav e to m ultiply this b ound b y 2 so that it is v alid f or ev ery u ∈ S m and not only u = s m . The resu lting upp er b ound (m ultiplied b y √ n ) has all the desir ed prop erties for φ m since 6 A √ 2 √ D m = 3 √ 2 D m ≥ 1. The result follo ws. References Hirotugu Ak aik e. Statistic al predictor ident iﬁcation. A nn. Inst. Statist. M ath. , 22:203 –217, 1970. Hirotugu Ak aik e. Inform ation theory and an extension of the maxim um likel iho o d p rinciple. In Se c ond Internatio nal Symp osium on Information The ory (Tsahkadsor, 1971) , pages 267–2 81. Ak ad ´ emiai K iad´ o, Budap est, 1973. Da vid M. Allen. Th e r elationship b et w een v ariable selection and d ata au gmentati on and a metho d for pred iction. T e chnometrics , 16:125 –127, 1974. 33 Arlot and Massar t Sylv ain Arlot. R esampling and Mo del Sele ction . Ph D thesis, Univ ersit y P aris-Sud 11, Decem b er 2007. oai:tel.a rc hiv es-ouv ertes.fr:tel-00198803 v1. Sylv ain Arlot. Sub op timalit y of p enalties prop ortional to the dimension for model selection in heteroscedastic regression, Decem b er 2008a. arXiv:081 2.3141 . Sylv ain Arlot. V -fold cross-v ali dation impr o v ed: V -fold p enalization, F ebruary 2008b. arXiv:0802 .0566v2. Sylv ain Arlot. Mo del selection b y resampling p enalizatio n, Marc h 2008c. oai:hal.arc h ives- ouv ertes.fr:hal-0026 2478 v1. Y annic k Baraud. Mo d el selection f or r egression on a ﬁxed design. Pr ob ab. The ory R elate d Fields , 117(4) :467–49 3, 2000. Y annic k Baraud. Model selection for regression on a random design. ESAIM Pr ob ab. Statist. , 6:127– 146 (electronic), 2002. Y annic k Baraud , Christoph e Giraud, and Sylvie Huet. Gaussian m o del selection with u n- kno wn v ariance. T o app ear in T he Annals of Statistics. arXiv:math.ST /0701 250, 2007. Andrew Ba rron, Lucien Birg ´ e, and Pascal Massart. Risk b ound s for m o del selec tion via p enalization. Pr ob ab. The ory R elate d Fields , 113(3):30 1–413, 1999. P eter L. Bartlett, S t ´ eph ane Bouc heron, and G´ abor Lugosi. Mo del selecti on and err or estimation. M achine L e arning , 48:85–1 13, 2002. P eter L. Bartlett, Olivier Bousquet, and Sh ahar Mend elson. Local Rademacher complexi- ties. Ann. Statist. , 33(4):1 497–15 37, 2005. Jean-P atric k Baudr y . Clustering through mo del selection criteria. P oster s ession at On e Da y Statistical W orkshop in Lisieux. http ://www.ma th.u-psud .fr/ ∼ baudry , June 2007. Lucien Birg ´ e and P ascal Massart. Gaussian mod el selection. J. Eur. Math. So c. (JEM S) , 3(3):2 03–268, 2001. Lucien Birg´ e an d Pasca l Massart. Minimal p enalties for Gaussian mo d el selection. Pr ob ab. The ory R elate d Fields , 138(1-2):33 –73, 2007 . St ´ ephane Boucheron an d P ascal Massart. A p o or man’s w ilks phenomenon. P ersonal comm unication, Marc h 2008. Prabir Bur man. Es timation of equifrequ ency histograms. Statist. Pr ob ab. L ett. , 56(3 ): 227–2 38, 2002. Imre Csisz´ ar. Large-scale typicali t y of Mark o v sample p aths and consistency of MDL order estimators. IE EE T r ans. Inform. The ory , 48(6):16 16–162 8, 2002. Imre Csisz´ ar and P aul C. S hields. The consistency of th e BIC Marko v order estimator. Ann. Statist. , 28(6):1 601–16 19, 2000. 34 Da t a-driven Calibra tion of P enal ties Luc De vro y e and G´ ab or Lugosi. Combinatorial metho ds in density estimation . Sp ringer Series in Statistics. Springer-V erlag, New Y ork, 2001. Bradley Efron. Estimating the error r ate of a prediction rule: impr o v emen t on cross- v alidation. J. Amer. Statist. Asso c. , 78(38 2):316– 331, 1983. Seymour Geisser. The p redictiv e sample reuse metho d with applications. J. Amer. Statist. Asso c. , 70:320 –328, 1975. Edwa rd I. George and Dean P . F oster. Calibration and empirical Ba y es v ariable selection. Biometrika , 87(4):731 –747, 2000. Cliﬀord M. Hurvic h and C hih-Ling Tsai. Regression and time series mo del selection in small samples. B i ometrika , 76(2):297– 307, 1989. Vladimir Koltc hinskii. Rademac her p enalties and structural r isk m inimization. IEEE T r ans. Inform. The ory , 47(5):19 02–191 4, 2001. Vladimir Koltc hinskii. Lo cal Rademac her complexities and oracle inequalities in risk mini- mization. Ann. Statist. , 34(6): 2593–2 656, 2006. Marc La vielle. Usin g p enalized con trasts f or the change -p oint problem. Signal Pr o c es. , 85 (8):15 01–1510 , 2005. ´ Emilie Lebarbier. Detecting multiple c hange-p oints in the mean of a gaussian p ro cess b y mo del selection. Signal P r o c es. , 85:717– 736, 2005. Guillaume Lecu´ e. M´ etho des d’agr ´ egation : optimalit ´ e et vitesses r apides . PhD thesis, LPMA, Unive rsit y Paris VI I, May 2007. Vincen t Lep ez. Some estimation pr oblems r elate d to oil r eserves . PhD thesis, Univ ers it y P aris XI, 2002. Ker-Chau Li. Asymptotic optimalit y for C p , C L , cross-v alidation and generaliz ed cross- v alidation: discrete index set. A nn. Statist. , 15(3):958 –975, 198 7. F ernando Lozano. Mo d el selectio n using rademac her p enalization. In Pr o c e e dings of the 2nd ICSC Symp. on N eur al Computation (NC2000). Berlin, Germany . IC SC Academic Press, 2000. Colin L. Mallo ws. Some commen ts on C p . T e chnometrics , 15:661–675 , 1973. P ascal Massart. Conc entr ation ine qualities and mo del sele ction , volume 1896 of L e ctur e Notes in Mathematics . Springer, Berlin, 2007. Cath y Maugis and Bertrand Mic h el. A n on asymptotic p enalized criterion for gaussian mixture mo del selection. T ec hnical Rep ort 6549, INRI A, 2008. Boris T. P oly ak and Alexandre B. Tsyb ak o v . Asymptotic optimalit y of the C p -test in the pro jection estimatio n of a regression. T e or. V er oyatnost. i Primenen. , 35(2):3 05–317 , 1990. 35 Arlot and Massar t Xiaotong Shen and Jianming Y e. Adaptiv e m o del select ion. J. A mer. Statist. Asso c. , 97 (457): 210–221 , 2002. Ritei Shibata. An optimal selection of regression v ariables. Biometrika , 68(1):45–54 , 1981. Charles J. Stone. An asymp totica lly optimal histogram selec tion ru le. In Pr o c e e dings of the Berkeley c onfer enc e in honor of Jerzy Ne yman and Jack Ki e fer, V ol. II (Berkeley, Calif., 1983) , W a dsworth Statist./Probab. Ser., pages 513–520, Belmon t, CA, 1985. W adsworth. M. Stone. Cr oss-v alidatory c hoice and assessment of statistica l pr edictions. J. R oy. Statist. So c. Ser. B , 36:111–14 7, 1974. Nariaki Sugiura. F ur ther analysis of the data by ak aik e’s information criterion and the ﬁ n ite corrections. Comm. Statist. A—The ory Metho ds , 7(1):13–26 , 1978. Alexandre B. Tsybako v. O ptimal aggregation of classiﬁers in statistica l learning. Ann. Statist. , 32(1):135 –166, 2004. Nicolas V erzelen. Gaussian gr aphic al mo dels and Mo del sele ction . PhD thesis, Univ ersit y P aris XI, Decem b er 2008. F ann y Villers. T ests et s´ ele ction de mo d` eles p our l’analyse de donn´ ees pr ot´ eomiques et tr anscriptomiques . PhD thesis, Universit y Paris XI, Decem b er 2007. 36

Data-driven calibration of penalties for least-squares regression

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment