Model Selection with the Loss Rank Principle

Mo del Selecti o n with the Loss Rank Prin c iple Marcus Hutter RSISE @ ANU and SML @ NICT A, Canberra, A CT, 0200, Australia marcus@hutt er1.net www.hutter1 .net Minh-Ngo c T ran Departmen t of Statistics and Applied Pro ba bilit y , National Univ ersit y of Singap o re, ngoctm@nus. edu.sg 2 Marc h 20 10 Abstract A k ey issu e in statistics a nd machine learning is to automa tically selec t t he “righ t” mo d el complexit y , e.g., the num b er of neigh b ors to b e a v eraged o v er in k nearest neighbor (kNN) regression or the p olynomial degree in regression with p olynomials. W e suggest a nov el pr inciple - the Loss Ra nk Principle (LoRP) - for mo del s electio n in r egression and c lassiﬁcation. It is b ased on the loss rank, whic h coun ts ho w man y other (ﬁctitious) data w ould b e ﬁtted b etter. LoRP sele cts the mo del that has minimal loss rank. Unlik e most p enalized maxim um lik elihoo d v arian ts (AIC, BIC, MDL), LoRP depend s only on the regression fu nctions and the loss function. It w orks without a sto chastic noise mo del, and is directly applicable to any n on-parametric regressor, lik e kNN. Con ten ts 1 In tro duction 2 2 The Loss Rank P rinciple 4 3 LoRP for y-Linear Mo dels 8 4 Optimality Prop erties of LoR P for V ariable Selection 11 5 Exp eriments 13 6 Comparison to Gaussian B ay esian Linear Regression 17 7 Comparison to other Mo de l Selection Schemes 19 8 Loss F unctions and their Selection 23 9 Self-Consisten t Regression 24 10 Nearest Neigh b ors Cla ssiﬁcation 26 11 Conclusion and Outlo ok 28 References 30 Keyw ords Mo del selection, loss rank p rinciple, non -p arametric r egression, classiﬁcation, general loss function, k nearest neigh b ors. 1 1 In tro duction Regression. Consider a regres sion or classiﬁcation problem in which w e w an t to determine the functional relationship y i ≈ f true ( x i ) from data D = { ( x 1 ,y 1 ) ,..., ( x n ,y n ) } , i.e., w e seek a function r ( . | D ) ≡ r ( D )( . ) suc h that r ( x | D ) ≡ r ( D )( x ) is close to the unkno wn f true ( x ) for all x . One may deﬁne r ( . | D ) directly , e.g., “a verage the y v alues of the k nearest neigh b ors (kNN) of x in D ”, or select r ( . | D ) fro m a class of functions F that has smallest (training) error on D . If the class F is not to o large, e.g., the p o lynomials of ﬁxed reasonable degree d , this of ten w orks w ell. Mo del selection. What remains is to select the righ t mo del complexity c , lik e k or d . This selection cannot b e based on the tr a ining erro r , since the more complex the mo del (large d , small k ) the b etter the ﬁt on D (p erfect for d = n and k = 1 ). This problem is called o v erﬁtting, for whic h v ario us remedies ha v e b een suggested. The most p opular ones in pra ctice are based on a test set D ′ used for selecting the c for whic h the function r c ( . | D ) has smalles t (test) error on D ′ , or improv ed v ersions lik e cross-v alidation [All74]. T ypically D ′ is cut from D , thus reducing the sample size a v ailable for regression. T est set metho ds often w ork well in pra ctice, but the reduc ed sample decreases acc uracy , whic h can b e a se rious problem if n is small. W e will no t discuss empirical test set metho ds an y further. See [Mac92] fo r a comparison of cross-v alidatio n with Bay esian mo del selection. There are also v ar ious mo del selection metho ds that allow to use all data D for regression. The most p opular ones can b e regarded as p enalized v ersions of Maxim um L ikelihoo d (ML). In additio n to the function class F c (subscript c b elonging to some set indexing the complexit y), one has t o sp ecify a sampling mo del P( D | f ), e.g., that the y i ha v e indep enden t Gaussian distribution with mean f ( x i ). ML c ho o ses r c ( D ) = argmax f ∈F c P( D | f ), P enalized ML (PML) then chooses ˆ c = argmin c {− logP( D | r c ( D ))+Pe nalty ( c ) } , where the penalty dep ends on the us ed approac h (MDL [Ris78 ], BIC [Sc h78 ], AIC [Ak a73]). All P ML v ariants r ely on a prop er sampling mo del (whic h may b e diﬃcult to establish), ig nore (or at least do not tell ho w t o incorp orat e) a p oten tially giv en loss function (see [Y am99, G r ¨ u04] for ex- ceptions), are based on distribution- indep enden t p enalties (whic h ma y result in bad p erformance for sp eciﬁc distributions), a nd a re ty pically limited to (semi)parametric mo dels. Main idea. The main g oal of the pap er is to establish a criterion for sele cting the “b est” mo del complexit y c based o n regressors r c giv en as a blac k b ox without insigh t in to the origin o r inner structure of r c , that do es no t dep end o n things often not giv en (lik e a sto c hastic noise model), and that exploits what is/should b e giv en (lik e the loss function, note that the criterion can also be used for loss-function selection, see Section 8). The k ey observ a tion w e exploit is that large classes F c or more ﬂex ible regressors r c can ﬁt more data w ell than mor e rigid ones. W e deﬁne the loss r an k of r c as the nu m b er o f other (ﬁctitious) data D ′ that are ﬁtted b etter b y r c ( D ′ ) than D is ﬁtted b y r c ( D ), as measured b y some loss function. The loss rank is large for regressors ﬁtting D not w ell and for to o ﬂexible regr essors (in b oth 2 cases the regres sor ﬁts many other D ′ b etter). The loss rank has a minim um for not to o ﬂexible regress ors whic h ﬁt D not t o o bad. W e claim that minimizing the loss rank is a suitable mo del selection criterio n, since it trades oﬀ the qualit y of ﬁt with the ﬂexibilit y of the mo del. Unlik e PML, our Loss Rank Principle (LoRP) w o r ks without a noise (sto c hastic sampling) mo del, and is directly a pplicable to any non-parametric regressor, lik e kNN. Related ideas. There are v arious other ideas that someho w coun t ﬁctitio us data. In normalized ML [Gr ¨ u04], the complexit y of a sto c hastic mo del class is deﬁned as the log sum ov er all D ′ of maxim um lik eliho o d probabilities. In the luc kiness framew o r k for classiﬁcation [Her02, Chp.4], the loss rank is related to the lev el of a hypothesis, if t he empirical loss is used as an unluc kiness function. The empirical Rademache r complexit y [Kol01, BBL02] a v era g es ov er all p ossible relab eled instances. Finally , instead of considering all D ′ one could consid er only the set of a ll p ermu tations of { y 1 ,...,y n } , lik e in p ermutation tests [ET93 ]. The test s tatistic w ould here b e the empirical loss. Con ten ts. In Section 2 , after giving a brief intro duction to regression, w e formally state LoRP for mo del selection. Ex plicit expressions for t he loss rank for the im- p ortant class of linear regressors are deriv ed in Section 3; this class includes kNN, p olynomial, linear basis function (LBF R), kerne l, pro jec tiv e regression, and some others. In Section 4, w e establish optimalit y prop erties of LoRP for linear regres- sion, name ly model consistency and asymptotic mean eﬃciency . Experimen ts are presen ted in Section 5 : W e compare LoRP to ot her s election metho ds and demon- strate the use of LoR P for some s p eciﬁc problems lik e choosing tuning par a meters in k NN and spline regression. In Section 6 w e compare lin ear LoRP to Ba y esian mo del selection for linear regression with Gaussian noise and prior, and in Section 7 t o PML, in particular MDL, BIC, and AIC, and then discuss tw o trace fo rm ulas for the eﬀectiv e dimension. Sections 8-10 c an b e considered as extension sections. In Section 8 w e sho w ho w to generalize linear LoRP to non-quadratic loss, in par- ticular to other norms. W e also dis cuss how LoRP can be us ed to se lect the loss function itself, in case it is not part of the problem speciﬁcation. In Section 9 we brieﬂy discuss interpolation. LoRP only de p ends on the regress or o n data D and not on x 6∈ { x 1 ,...,x n } . W e construct canonical regress ors for oﬀ-data in terp olatio n from regressors given only on-data, in particular f or kNN, Kernel, and LBFR , a nd sho w that they are canonical. In Section 10 w e de riv e exact express ions for kNN when { x 1 ,...,x n } forms a disc rete d -dimensional hypercub e, a nd discuss the limits n → ∞ , k → ∞ , and d → ∞ . Section 11 con tains the conc lusions of our w ork and further considerations tha t could b e elab orated on in the future. The main idea of LoR P has already b een presen ted at t he COL T 200 7 conference [Hut07]. In this pap er w e p resen t LoRP more thoroughly , disco v er its theoretical prop erties a nd ev aluate the metho d thr o ugh some exp erimen ts. 3 2 The Loss Rank Principle After giving a brief in tro duction to regression, c lassiﬁcation, model selection, o v er- ﬁtting, and some reo ccurring examples, we state our nov el Loss Rank Principle for mo del selection. W e ﬁrst s tate it for classiﬁcation (Princ iple 3 for discrete v alues), and then g eneralize it for regression (Principle 5 for contin uous v alues), and exem- plify it on t wo (o v er-simplistic) artiﬁcial Examples 4 and 6. Thereafter w e sho w how to regularize LoR P for realistic regression pr o blems. Setup and notation. W e assume data D = ( x , y ) := { ( x 1 ,y 1 ) ,..., ( x n ,y n ) } ∈ ( X × Y ) n =: D has b een observ ed. W e think of the y as having an approximate functional dep endence on x , i.e., y i ≈ f true ( x i ), where ≈ means that the y i are distorted b y noise fro m the unkno wn “true” v alues f true ( x i ). W e will write ( x,y ) for generic data p oin ts, use v ector notation x = ( x 1 ,...,x n ) ⊤ and y = ( y 1 ,...,y n ) ⊤ , and D ′ = ( x ′ , y ′ ) for generic (ﬁctitious) da t a of siz e n . A full list of abbreviations and notatio ns used throughout the pap er is placed in the app endix. Regression and classiﬁcation. In regression problems Y is t ypically (a subset of ) the real set I R or some more general measurable space lik e I R m . In classiﬁcation, Y is a ﬁnite set or at least discrete . W e imp ose no restrictions on X . Indee d, x will essen tially b e ﬁxed and pla ys only a sp ectator role, so w e will of ten notat ionally suppress dep endencies on x . The goa l of regression/classiﬁcation is to ﬁnd a function f D ∈ F ⊂ X → Y “ close” to f true based on the past observ atio ns D . Or phrased in another w a y: w e are in terested in a mapping r : D → F suc h that ˆ y := r ( x | D ) ≡ r ( D )( x ) ≡ f D ( x ) ≈ f true ( x ) for all x ∈ X . Example 1 (polynomial regression) F or X = Y = I R , consider the set F d := { f w ( x ) = w d x d − 1 + ... + w 2 x + w 1 : w ∈ I R d } of p olynomials o f degree d − 1. Fitting the p olynomial to data D , e.g., b y least squares regression, w e estimate w with ˆ w D . The regression func tion ˆ y = r d ( x | D ) = f ˆ w D ( x ) can b e written do wn in closed form (see Example 9). ♦ Example 2 (k nearest neigh b ors) Let Y b e some v ector space lik e I R and X b e a metric space lik e I R m with some (e.g., Euclidean) metric d ( · , · ). kNN es timates f true ( x ) by av eraging the y v alues of the k nearest neighbors N k ( x ) of x in D , i.e., r k ( x | D ) = 1 k P i ∈N k ( x ) y i with |N k ( x ) | = k suc h that d ( x,x i ) ≤ d ( x,x j ) for all i ∈ N k ( x ) and j 6∈ N k ( x ). ♦ P arametric versus non-parametric regression. P o lynomial regression is an example of parametric regression in the sense that r d ( D ) is the optimal function from a family o f functions F d indexed b y d < ∞ real parameters ( w ). In con trast, the kNN regressor r k is directly giv en and is not based on a ﬁnite-dimensional family of functions. In general, r ma y be give n either directly or b e the result of an optimization pro cess. Loss function. The qualit y of ﬁt to the data is us ually measured by a loss func- tion Lo ss( y , ˆ y ) , where ˆ y i = ˆ f D ( x i ) is an estimate of y i . Often the loss is additiv e: 4 Loss( y , ˆ y ) = P n i =1 Loss( y i , ˆ y i ). If t he class F is not to o lar ge, go o d regres sion func- tions r ( D ) can b e found b y minimizing the loss w.r.t. all f ∈ F . F or instance, r d ( D ) = argmin f ∈F d P n i =1 ( y i − f ( x i )) 2 and ˆ y = r d ( x | D ) in Example 1. Regression class and loss. In the follo wing we assume a class of regressors R (whatev er their origin), e.g., the kN N regressors { r k : k ∈ I N } o r the least squares p olynomial regressors { r d : d ∈ I N 0 := I N ∪ { 0 }} . Eac h regressor r can b e though t of as a mo del. Throughout the paper, w e use the terms “ r egr essor” a nd “mo del” in terc ha ng eably . Note that unlik e f ∈ F , regressors r ∈ R are not functions of x alone but dep end on all observ ations D , in particular on y . Lik e for functions f , w e can compute the empirical loss of each regress or r ∈ R : Loss r ( D ) ≡ Lo ss r ( y | x ) := Loss( y , ˆ y ) = n X i =1 Loss( y i , r ( x i | x , y )) where ˆ y i = r ( x i | D ) in the third expression, and the last expres sion holds in case of additiv e loss. Ov erﬁtting. Unfo r tunately , minimizing Loss r w.r.t. r will t ypically no t s elect the “b est” ov erall regressor. This is the w ell-kno wn ov erﬁtting problem. In case o f p olynomials, the classes F d ⊂ F d +1 are nested, hence Loss r d is monotone decreasing in d with Loss r n ≡ 0 perfectly ﬁtting the data . In case of kNN, Loss r k is more or less an increasin g function in k w ith p erfect re gression on D for k = 1 , sinc e no a v eraging tak es place. In general, R is of t en indexed b y a “ﬂexibility ” o r smo ot hness or complexit y parameter, whic h has to b e prop erly determined. The more ﬂexible r is, the closer it can ﬁt t he data. Hence suc h r has smaller empirical loss, but is not necessarily b etter since it has higher v a r ia nce. Clearly , too inﬂexible r also lead to a bad ﬁt (“ hig h bias”). Main goal. The main goal of the pap er is to establish a selection criterion in order to sp ecify the smallest mo del to whic h f true b elongs or is close to, and sim ultaneously determine the “b est” ﬁtting function r ( D ). The criterion • is based on r giv en as a black b ox that do es not require insigh t in to the o r ig in or inner structure of r ; • do es not dep end on things often not given (like a sto chastic noise model); and • exploits what is or should b e giv en (lik e the loss function). Deﬁnition of loss rank. W e ﬁrst consider discrete Y (i.e., classiﬁcation), ﬁx x , y is t he observ ed data and y ′ are ﬁctitious o thers. The key observ ation w e exploit is that a more ﬂexible r c an ﬁt more data D ′ ∈ D w ell than a more rig id one. The more ﬂexible r is, the smaller the empirical loss Loss r ( y | x ) is. Instead of minimizing the unsuitable Loss r ( y | x ) w.r.t. r , w e c ould ask how man y y ′ ∈ Y n lead to s maller Loss r than y . W e deﬁne the loss rank of r (w.r.t. y ) as the nu m b er of y ′ ∈ Y n with smaller or equal empirical loss than y : Rank r ( y | x ) ≡ Rank r ( L ) := # { y ′ ∈ Y n : Loss r ( y ′ | x ) ≤ L } with L := Loss r ( y | x ) (1) 5 W e claim that the loss ra nk of r is a s uitable mo del selec tion measure. F or (1 ) to mak e sense, w e ha v e to assume (and will la ter assure) that Rank r ( L ) < ∞ , i.e., t here are only ﬁnitely man y y ′ ∈ Y n ha ving loss smaller than L . Since the logarithm is a strictly monot o ne increasing function, w e can also con- sider the logarithmic ra nk LR r ( y | x ) := log Rank r ( y | x ), whic h will b e more con v e- nien t. Principle 3 ( LoRP for c lassiﬁcation) F or discr ete Y , the b est classi- ﬁer/r e gr esso r r : D × X → Y in some class R for data D = ( x , y ) is the one with the smal lest loss r ank: r best = arg min r ∈R LR r ( y | x ) ≡ arg min r ∈R Rank r ( y | x ) (2) wher e Rank r is deﬁne d in (1) . W e giv e a sim ple example for whic h we can compute all ranks by hand to help the reader b etter grasp how the principle w orks. Example 4 (simp le discrete) Consider X = { 1 , 2 } , Y = { 0 , 1 , 2 } , and t w o p oints D = { (1 , 1) , (2 , 2 ) } lying on the diago nal x = y , with p olynomial (zero, constan t, linear) least s quares regressors R = { r 0 ,r 1 ,r 2 } (see Ex.1). r 0 is simply 0, r 1 the y -av erage, and r 2 the line through p oints (1 ,y 1 ) a nd (2 ,y 2 ). This , together with the quadratic Loss for generic y ′ and observ ed y = (1 , 2) and ﬁxed x = (1 , 2), is summarized in the follo wing table d r d ( x | x , y ′ ) Loss d ( y ′ | x ) Loss d ( D ) 0 0 y ′ 1 2 + y ′ 2 2 5 1 1 2 ( y ′ 1 + y ′ 2 ) 1 2 ( y ′ 2 − y ′ 1 ) 2 1 2 2 ( y ′ 2 − y ′ 1 )( x − 1) + y ′ 1 0 0 F rom the Loss we can easily compute the Rank for all nine y ′ ∈ { 0 , 1 , 2 } 2 . Equal rank due t o equ al loss is indicated b y a “=” in the table below. Whole equalit y groups are a ctually as signed the r a nk of their rig ht-most mem b er, e.g., fo r d = 1 the ranks of ( y ′ 1 ,y ′ 2 ) = (0 , 1) , (1 , 0) , ( 2 , 1) , (1 , 2) are all 7 (and not 4,5,6,7). Rank r d ( y ′ 1 y ′ 2 | 12) d 1 2 3 4 5 6 7 8 9 Rank r d ( D ) 0 y ′ 1 y ′ 2 = 0 0 < 01 = 1 0 < 11 < 02 = 20 < 21 = 12 < 22 8 1 y ′ 1 y ′ 2 = 0 0 = 11 = 2 2 < 01 = 10 = 21 = 12 < 02 = 20 7 2 y ′ 1 y ′ 2 = 0 0 = 01 = 0 2 = 10 = 11 = 20 = 21 = 22 = 12 9 So LoRP selects r 1 as b est regressor, since it has minimal rank on D . r 0 ﬁts D to o badly and r 2 is to o ﬂexible ( p erfectly ﬁts a ll D ′ ). ♦ LoRP for con t in uous Y . W e now consider the case of contin uous or measurable spaces Y , i.e., normal regress ion problems. W e assume Y = I R in the following exp o sition, but the idea and resulting principle hold for more g eneral measurable 6 spaces lik e I R m . W e simply reduce the mo del selection problem to the discrete case by considering the discretized space Y ε = εZ Z f or small ε > 0 and discretize y ❀ y ε ∈ εZ Z n (“ ❀ ” means “is replaced by ”). Then Rank ε r ( L ) := # { y ′ ε ∈ Y n ε : Lo ss r ( y ′ ε | x ) ≤ L } with L = Loss r ( y ε | x ) coun ting the num b er of ε -g rid p oin ts in t he set V r ( L ) := { y ′ ∈ Y n : Loss r ( y ′ | x ) ≤ L } (3) whic h w e assume (and later assure) to b e ﬁnite, analogo us to the discrete case. Hence R ank ε r ( L ) · ε n is a n appro ximation of the loss volume | V r ( L ) | of set V r ( L ), and typ ically Rank ε r ( L ) · ε n = | V r ( L ) | · (1 + O ( ε )) → | V r ( L ) | for ε → 0. T aking the logarithm w e get LR ε r ( y | x ) = logR a nk ε r ( L ) = log | V r ( L ) | − n log ε + O ( ε ). Since n log ε is indep enden t of r , w e can drop it in comparisons lik e (2). So for ε → 0 w e can deﬁne the log-loss “r a nk” simply as the log- v o lume LR r ( y | x ) := log | V r ( L ) | , where L := Loss r ( y | x ) (4) Principle 5 ( LoRP for r egression) F or me asur able Y , the b est r e gr essor r : D × X → Y in some class R fo r data D = ( x , y ) is the one with the smal lest loss volume: r best = arg min r ∈R LR r ( y | x ) ≡ arg min r ∈R | V r ( L ) | wher e LR , V r , and L ar e deﬁne d in (3) an d (4) , and | V r ( L ) | i s the volume of V r ( L ) ⊆ Y n . F or discre te Y with coun ting measure w e reco v er the discrete LoRP (Principle 3). Example 6 (simp le con tin uous) Consider Example 4 but with in terv al Y = [0 , 2]. The ﬁrst table remains unc hanged, while t he second table b ecomes d V d ( L ) = { y ′ ∈ [0 , 2 ] 2 : ... } | V d ( L ) | Loss d ( D ) | V d (Loss d ( D )) | 0 y ′ 1 2 + y ′ 2 2 ≤ L π 4 L if L ≤ 4; 4 if L ≥ 8; 2 √ L − 4+ L ( π 4 − cos − 1 ( 2 √ L )) else 5 . = 3 . 6 1 1 2 ( y ′ 2 − y ′ 1 ) 2 ≤ L 4 √ 2 L − 2 L if L ≤ 2; 4 if L ≥ 2 1 2 3 2 0 ≤ L 4 0 4 So LoRP again selects r 1 as b est regressor, since it has smallest loss volume on D . ♦ Inﬁnite rank or v olume. Often the loss ra nk/volume will b e inﬁnite, e.g., if w e ha d c hosen Y = Z Z in Ex.4 or Y = I R in Ex.6. There are v ario us po ten t ial remedies. W e could modif y (a) the regressor r or (b) the Lo ss to make LR r ﬁnite, (c) the Loss Rank Principle itself, or (d) ﬁnd problem-speciﬁc solutions. Regressors r with inﬁnite rank might b e rejected for philosophical or pragmatic reasons. W e will brieﬂy consider (a) fo r linear regression la ter, but t o ﬁddle a round with r in a generic (blac kb ox w ay) seems diﬃcult. W e ha v e no go o d idea ho w to tinke r w ith LoRP ( c), and also a patc hed LoRP may b e less attractive . F or kNN on a g r id we 7 later use remedy (d). While in (decision) theory , the application’s g oal determines the loss, in practice the lo ss is often more de termined by conv enience or rules of th um b. So the Loss (b) seems the most inviting place to tink er with. A v ery simple mo diﬁcation is t o add a small p enalt y term to the loss. Loss r ( y | x ) ❀ Loss α r ( y | x ) := Loss r ( y | x ) + α k y k 2 , α > 0 “ small” (5) The Euclidean norm k y k 2 := P n i =1 y 2 i is default, but other (non)no rm regularizations are p ossible. The regularized LR α r ( y | x ) based on Loss α r is alw ays ﬁnite, since { y : k y k 2 ≤ L } has ﬁnite v olume. An alternative p enalt y α ˆ y ⊤ ˆ y , quadrat ic in the regression estimates ˆ y i = r ( x i | x , y ) is p ossible if r is un b ounded in ev ery y → ∞ direction. A sc heme trying to determine a s ingle (ﬂexibilit y) parameter (like d and k in the ab ov e examples) w o uld b e of no use if it dep ended on one (or more) other unkno wn para meters ( α ), since v arying through the unkno wn parameter leads to an y (non)desired result. Since LoRP seeks the r of smalles t rank, it is natural to also determine α = α min b y minimizing LR α r w.r.t. α . The go o d ne ws is that this leads to meaningful results. Interes tingly , as w e will see later, a clev er c hoice of α ma y also result in alternativ e optimalit ies of the selection pro cedure. 3 LoRP for y-Line ar Mo d els In this section w e consider t he imp or t an t class of y-line ar regressions with quadratic loss function. By “y-linear regression”, w e mean the linearity is only assumed in y and the dep endence on x can b e arbitrar y . This class is rich er than it may app ear. It includes the nor ma l linear regression model, kNN (Example 7), kerne l (Example 8), and man y other regression mo dels. F or y-linear regression and Y = I R , the lo ss rank is t he v olume of an n -dimensional ellipsoid, whic h can eﬃcien tly b e computed in time O ( n 3 ) (Theorem 10). F or the sp ecial case of pro jective regression, e.g., linear basis function regression (Example 9), we can ev en determine the regularization parameter α analytically (Theorem 11). y-Linear regression. W e ass ume Y = I R in this section; generalization to I R m is straigh tforw ard. A y-linear regressor r can b e written in the form ˆ y = r ( x | x , y ) = n X j =1 m j ( x, x ) y j ∀ x ∈ X a nd some m j : X × X n → I R (6) P a rticularly interes ting is r for x = x 1 ,...,x n . ˆ y i = r ( x i | x , y ) = X j M ij ( x ) y j with M : X n → I R n × n (7) where matrix M ij ( x ) = m j ( x i , x ). Since LoRP needs r only on the t raining data x , w e only need M . Example 7 (kNN ctd.) F or kNN of Ex.2 w e hav e m j ( x, x ) = 1 k if j ∈ N k ( x ) and 0 else, and M ij ( x ) = 1 k if j ∈ N k ( x i ) and 0 else. ♦ 8 Example 8 (k ernel regression) Kernel regression tak es a weigh ted av erage ov er y , where the w eigh t of y j to y is prop ortional to the similarity of x j to x , measured by a k ernel K ( x,x j ), i.e., m j ( x, x ) = K ( x,x j ) / P n j =1 K ( x,x j ). F or example the Ga ussian k ernel fo r X = I R m is K ( x,x j ) = e −k x − x j k 2 2 / 2 σ 2 . The width σ con trols the smo othness of the k ernel regressor, and LoRP selects the real- v alued “complexit y” parameter σ . ♦ Example 9 (linea r basis function regression, L BFR) Let φ 1 ( x ) ,...,φ d ( x ) b e a set or v ector of “basis” functions often called “features”. W e place no restrictions on X o r φ : X → I R d . Consider the class of functions linear in φ : F d = { f w ( x ) = P d a =1 w a φ a ( x ) = w ⊤ φ ( x ) : w ∈ I R d } F or instance, for X = I R and φ a ( x ) = x a − 1 w e w ould reco ver the p olynomial regression Example 1. F or quadratic lo ss function Lo ss ( y i , ˆ y i ) = ( y i − ˆ y i ) 2 w e ha v e Loss w ( y | φ ) := n X i =1 ( y i − f w ( x i )) 2 = y ⊤ y − 2 y ⊤ Φ w + w ⊤ B w where matrix Φ is deﬁned b y Φ ia = φ a ( x i ) and B is a symme tric matrix with B ab = P n i =1 φ a ( x i ) φ b ( x i ) = [Φ ⊤ Φ] ab . The loss is quadratic in w with minim um at w = B − 1 Φ ⊤ y . So the least squares regressor is ˆ y = y ⊤ Φ B − 1 φ ( x ), he nce m j ( x, x ) = (Φ B − 1 φ ( x )) j and M ( x ) = Φ B − 1 Φ ⊤ . ♦ Consider now a general linear regressor M with quadratic loss and quadratic p enalt y Loss α M ( y | x ) = n X i =1  y i − P n j =1 M ij y j  2 + α k y k 2 = y ⊤ S α y , where 1 S α = ( I − M ) ⊤ ( I − M ) + α I (8) ( I is the identit y matrix). S α is a symmetric matrix. F or α > 0 it is p ositiv e deﬁnite and for α = 0 p ositiv e semideﬁnite. If λ 1 ,...,λ n ≥ 0 ar e the eigen v alues of S 0 , then λ i + α are the eigen v alues of S α . V ( L ) = { y ′ ∈ I R n : y ′ ⊤ S α y ′ ≤ L } is an ellipsoid with the eigen vec tors of S α b eing the main a xes and p L/ ( λ i + α ) b eing their length. Hence the v olume is | V ( L ) | = v n n Y i =1 r L λ i + α = v n L n/ 2 √ det S α where v n = π n/ 2 / n 2 ! is the v olume of the n -dimensional unit sphere, z ! := Γ( z + 1), and det is the determinan t. T aking the logarithm we get LR α M ( y | x ) = log | V (Loss α M ( y | x )) | = n 2 log( y ⊤ S α y ) − 1 2 log det S α + log v n (9) Since v n is indep endent of α and M it is p ossible to dro p v n . Consider now a class of linear regressors M = { M } , e.g., the kNN re gressors { M k : k ∈ I N } or the d -dimensional linear basis function regressors { M d : d ∈ I N 0 } . 1 The men tio ned alternative p enalty α k ˆ y k 2 would lead to S α = ( I − M ) ⊤ ( I − M ) + αM ⊤ M . F or LBFR, penalty α k ˆ w k 2 is popular (ridge regr ession). Apart fro m b eing limited t o parametric regres s ion, it has the disadv a nt age of not b eing repa rametrizatio n in v ar iant. F or instanc e , sca ling φ a ( x ) ❀ γ a φ a ( x ) do e s not change the class F d , but changes the ridge regr e s sor. 9 Theorem 10 (LoRP for y-linear regression) F or Y = I R , the b est line a r r e gr es- sor M : X n → I R n × n in some class M for data D = ( x , y ) is M best = arg min M ∈M ,α ≥ 0 { n 2 log( y ⊤ S α y ) − 1 2 log det S α } = arg min M ∈M α ≥ 0 n y ⊤ S α y (det S α ) 1 /n o (10) wher e S α = S α ( M ) is deﬁne d in (8) . The last expression sho ws that linear LoRP minimizes the Loss times the g eomet- ric av erage o f the squared axes lengths o f ellipsoid V (1). Note that M best dep ends on y unlik e the M ∈ M . Nullspace of S 0 . If M has an eigen v alue 1, then S 0 = ( I − M ) ⊤ ( I − M ) has a zero eigen v alue and α > 0 is necessary , since det S 0 = 0. Actually this is true for most practical M . Most linear regressors are in v arian t under a constan t shift of y , i.e., r ( x | x , y + c ) = r ( x | x , y ) + c , whic h implies that M has eigen v ector (1 ,..., 1 ) ⊤ with eigen v alue 1. This can easily b e ch ec k ed for k NN ( Ex.2), k ernel (Ex.8), and LBFR (E x.9). Suc h a generic 1-eigen v ector eﬀecting all M ∈ M could easily and ma yb e should b e ﬁltered out b y considering o nly the orthogonal space or dropping these λ i = 0 when computing det S 0 . The 1-eigen v ectors that de p end on M are t he ones where w e really need a regularizer α > 0. F or instance, M d in LBFR has d eigen v alues 1, and M kNN has as man y eigen v alues 1 as there are disjoint comp onents in the graph determined b y the edges M ij > 0. In general w e need to ﬁnd the optima l α n umerically . If M is a pro jection w e can ﬁnd α m analytically . Numerical a ppro ximation of (det S α ) 1 /n and the computational complex- it y of linear LoRP . F or eac h α and candidate mo del, the determinan t of S α in the general case can b e computed in time O ( n 3 ). Often M is a v ery sparse matrix (lik e in kNN) or can b e w ell a pproximated b y a sparse matr ix (like for k ernel regression), whic h allows us to appro ximate det S α sometimes in linear time [Reu02]. T o searc h the optimal α and M , the computational cost depends on the range of α we searc h and the num b er of candidate mo dels w e ha v e. Pro jectiv e regression. Consider a pro j ection matrix M = P = P 2 with d (= tr P ) eigen v alues 1, and n − d zero eigen v alues. F or instance, M = Φ B − 1 Φ ⊤ of LBFR Ex.9 is such a matrix. This implies that S α has d eigen v alues α and n − d eigenv alues 1 + α , thus det S α = α d (1 + α ) n − d . Let ρ = k y − ˆ y k 2 / k y k 2 , then y ⊤ S α y = ( ρ + α ) y ⊤ y and LR α P = n 2 log y ⊤ y + n 2 log( ρ + α ) − d 2 log α − n − d 2 log(1 + α ) . (11) Solving ∂ LR α P /∂ α = 0 w.r.t. α w e get a minim um at α = α m := ρd (1 − ρ ) n − d pro vided that 1 − ρ > d /n . After some algebra we get LR α m P = n 2 log y ⊤ y − n 2 KL( d n k 1 − ρ ) , where KL( p k q ) := p log p q + (1 − p ) log 1 − p 1 − q (12) is the r elat ive en tropy o r the Kullback-Leibler div ergence. N ote that (12) is still v alid without t he condition 1 − ρ > d/n (the term lo g ((1 − ρ ) n − d ) has b een canceled 10 in the deriv ation). What w e need when using (12) is that d < n and ρ < 1, whic h are v ery reasonable in practice. In terestingly , if w e use the p enalt y α k ˆ y k 2 instead of α k y k 2 , the loss rank t hen has the same expression as (12) without any condition 2 . Minimizing LR α m P w.r.t. P is equiv alen t to maximizing KL( d n k 1 − ρ ). The term ρ is a measure of ﬁt. If d increases , then ρ decreases a nd otherwis e. W e a re seeking a tra deoﬀ b et w een the mo del complexit y d and the measure of ﬁt ρ , and LoRP suggests the optimal tradeoﬀ b y maximizing KL. Theorem 11 (LoRP for pro jectiv e regression) The b est pr oje ctive r e gr essor P : X n → I R n × n with P = P 2 in some pr oje ctive class P for data D = ( x , y ) is P best = arg max P ∈P KL( tr P ( x ) n k y ⊤ P ( x ) y y ⊤ y ) . (13) 4 Optimalit y Prop er t ies of LoRP for V ariable Se- lection In the previous sections, LoR P w a s stated for general- purp ose mo del selection. By restricting atten tion to linear regression mo dels, we will p oint o ut in this section some theoretical prop erties of LoRP for v ariable (also called feature or attribute) selection. V ariable selection is probably the most fundamental a nd imp ortan t topic in lin- ear regression ana lysis. A t the initial stage of mo deling, a la r g e n um b er of p oten tial co v ariates are often intro duced; one then has t o select a sm aller subset of the co- v ariates to ﬁt/in terpret the dat a. There are t wo main goals of v ariable sele ction, one is mo del iden tiﬁcation, the o t her is regression estimation. The f ormer aims at iden tif ying the true subset generating the data, while the latter a ims at estimating eﬃcien tly the regression function, i.e., selecting a subset that has the minimum mean squared error loss. Note that whether or not there is a selection criterion ac hieving sim ultaneously the se t wo goals is still an op en question [Y an05 , Gr ¨ u0 4]. W e show that with the optimal parameter α (deﬁned a s α m that minimizes the loss rank LR α M in α ), LoRP satisﬁes the ﬁrst goal, while with a suitable c hoice of α , LoRP satisﬁes the second goal. Giv en d + 1 p oten tial co v aria tes X 0 ≡ 1 ,X 1 ,...,X d and a resp onse v aria ble Y , let X = x b e a non-random design matrix of size n × ( d + 1) and y b e a response vec tor resp ectiv ely (if y a nd X are cen tered, then the co v aria te 1 can b e omitted f r om the mo dels). Denote by S = { 0 ,j 1 ,...j |S |− 1 } the candidate mo del that has co v ariates X 0 ,X j 1 ,...,X j |S | − 1 . Under a prop osed mo del S , w e can write y = X S β S + σ S ǫ 2 Then S α = ( I n − P ) ⊤ ( I n − P ) + αP ⊤ P = I n + ( α − 1) P has d eigenv alues α and n − d eig env alues 1, thus det ( S α ) = α d . The loss ra nk LR α P = n 2 log y ⊤ y + n 2 log(1 + ( α − 1)(1 − ρ )) − d 2 log α is minimized at α m = ρd (1 − ρ )( n − d ) . Af ter some algebra we get the sa me express ion of LR α m P as (12). 11 where ǫ is noise with exp ectation E [ ǫ ] = 0 and co v ariance Co v( ǫ ) = I n , σ S > 0 , β S = ( β 0 ,β j 1 ,...,β j |S | − 1 ) ⊤ , and X S is the n × |S | de sign matrix obtained from X b y remo ving the ( j + 1)st column for all j 6∈ S . Mo del consistency of LoR P for v ariable selection. The ordinary least squares (OLS) ﬁtted v ector under mo del S is ˆ y S = M S y with M S = X S ( X ⊤ S X S ) − 1 X ⊤ S b eing a pro jection matrix. F rom Theorem 11 the b est subset c hosen b y L o RP is ˆ S n = arg min S LR α m S = arg max S { KL( |S | n k 1 − ρ S ) } , ρ S = k y − ˆ y S k 2 k y k 2 . The term ρ S is a measure of ﬁt. It will b e v ery close to 0 if mo del S is big, otherwise, it will b e close t o 1 if S is too small. Therefore, it is reasonable to consider only cases in whic h ρ S is b ounded aw a y from 0 and 1. In order to prov e the theoretical prop erties o f LoRP , w e need the following tec hnical assumption. (A) F or eac h candidate mo del S , ρ S is b ounded a w a y from 0 and 1, i.e., there are constan t s c 1 and c 2 suc h that 0 < c 1 ≤ ρ S ≤ c 2 < 1 with probability 1 (w.p.1). Let ˆ σ 2 S = k y − ˆ y S k 2 /n and S n ull = { 0 } . It is easy t o see that for ev ery S 1 − ρ S = k ˆ y S k 2 / k y k 2 , n ˆ σ 2 S = ρ S k y k 2 , n ¯ y 2 = k ˆ y S null k 2 ≤ k ˆ y S k 2 ≤ k y k 2 (14) where ¯ y denotes the arithmetic mean P n i =1 y i /n . Assumption (A) f o llo ws from (A’) 0 < lim inf n →∞ ( ¯ y ) 2 ≤ lim sup n →∞ ( 1 n k y k 2 ) < ∞ and ∀S : ˆ σ 2 S → σ 2 S > 0 w.p.1. The ﬁrst condition of (A’) is ob viously v ery mild and satisﬁed in almost all cases in practice. The se cond one is routinely used to deriv e asymptotic prop erties of mo del selection criteria (e.g., Theorem 2 of [Sha97] and Condition 1 of [WL T07]). Lemma 12 (LoRP for v ariable sele ction) The loss r ank of m o del S is LR S ≡ LR α m S = n 2 log( n ˆ σ 2 S ) + n 2 H ( |S | n ) + d 2 log 1 − ρ S ρ S (15) wher e ρ S and ˆ σ 2 S ar e deﬁne d in (14 ) , and H ( p ) := − p log p − (1 − p )log(1 − p ) is t he entr opy of p . Under Assumption (A) or (A’), after ne gle cting c ons tan ts indep endent of S , the loss r ank of mo del S has the form LR S = n 2 log ˆ σ 2 S + |S | 2 log n + O P (1) , (16) wher e O P (1) denotes a b ounde d r andom variable w.p.1 . Pro of. Inserting y ⊤ y = n ˆ σ 2 S /ρ S in to (1 2 ) and r ear r a nging terms giv es (15). By Assumption (A) the last term in (15) is bo unded w.p.1. T a ylor expansion log(1 − p ) = − p + O ( p 2 ) implies H ( p ) /p + lo g p → 1, hence n 2 H ( |S | n ) = |S | 2 log n + O (1) . Finally , dropping the S -indep enden t term n 2 log n from (15) giv es (16). This lemma implies that the loss rank LR S here is a BIC-type criterion, t hus w e immediately can state without pro of the following t heorem whic h is the w ell-kno wn mo del consistency of BIC-t yp e criteria ( in terested readers can ﬁnd the r o utine pro of in, for example, [Cha06]) . 12 Theorem 13 (Model consistency) Under Assumption (A) or (A’), L oRP is mo del c onsistent f o r variable sele ction in the sense that the pr ob ability of sele cting the true mo del go es to 1 for data size n → ∞ . The opt imal regr ession estimation of LoRP . The second go al of mo del selection is often measured b y the ( a symptotic) mean eﬃciency [Shi83] whic h is brieﬂy deﬁned as follo ws. Let S T denote the true mo del (whic h ma y contain an inﬁnite num b er of co v ariates). F or a candidate mo del S , let L n ( S ) = k X S T β S T − X S ˆ β S k 2 b e the squared loss where ˆ β S is the OLS estimate, and R n ( S ) = E [ L n ( S )] be the risk. The mean eﬃciency of a selection criterion δ is deﬁne d b y the r a tio eﬀ( δ ) = inf S R n ( S ) E [ L n ( S δ )] ≤ 1 where S δ is the mo del selected b y δ . δ is said to b e asymptotically mean eﬃcien t if lim inf n →∞ eﬀ( δ ) = 1. By minimizing the loss r a nk in α w e hav e sho wn in the previous paragraph that LoRP satisﬁes the ﬁrst goal of mo del selection. W e now show that with a suitable c ho ice of α , Lo R P also satisﬁes the second goal. F rom (11), we hav e LR α S ( y | x ) = n 2 log( ˆ σ 2 S + α n y ⊤ y ) + n 2 log n − |S | 2 log( α ) − n −|S | 2 log(1 + α ) . By c ho osing α = ˜ α = exp( − n ( n + |S | ) |S | ( n −| S |− 2) ), under Assumption (A), the loss rank o f mo del S (neglecting the common constant n 2 log n ) is prop ortio nal to LR ˜ α S ( y | x ) = n log ˆ σ 2 S + n ( n + |S | ) n −|S | − 2 + o P (1) , whic h is the corrected AIC of [HT89]. As a result, LoRP( ˜ α ) is opt imal in terms of regression estimation, i.e., it is asymptotically mean eﬃcien t ([Shi83], 19 83; [Sha97], 1997). Theorem 14 (Asymptotic mean eﬃciency) Under Assumption (A) or (A’), with a suitable c hoic e of α , the loss r ank is pr o p ortional to the c orr e cte d AIC . As a r esult, L oRP is asymptotic al ly me an eﬃc ient. 5 Exp erimen ts In this s ection w e pr esen t a sim ulat ion study for LoR P , compare it to o t her meth- o ds and also demonstrate ho w LoRP can b e use d for some sp eciﬁc problems lik e c ho osing t uning parameters for kNN and spline r egression. All experimen ts a re conducted b y using MA TLAB soft ware and the source co de is freely a v ailable at h ttp://www.h utter1.net/a i/lorp co de.zip . 13 Comparison to AI C and BI C for mo del iden tiﬁcation. Samples are generated from the mo del y = β 0 + β 1 X 1 + ... + β d X d + ǫ, ǫ ∼ N (0 , σ 2 ) (17) where β is the v ector of c o eﬃcien ts with some zero entries . Without loss of gen- eralit y , we assume t ha t β 0 = 0, otherwis e, w e can cen ter the resp onse v ector y a nd standardize the de sign matrix X to exclude β 0 from the m o del. W e shall compare the p erformance of LoRP to that of BIC and AIC with v arious factors n, d and signal-to-noise ratio (SNR) whic h is k β k 2 /σ 2 ( k β k 2 is often called the length of the signal). F or a given set of factors ( n, d, SNR), the w a y we sim ulate a dataset from mo del (17) is as follo ws. En tries of X are sampled fro m a uniform distribution on [ − 1 , 1 ]. T o generate β , w e ﬁrst create a vec tor u = ( u 1 ,...,u d ) ⊤ whose entries are sampled from a uniform distribution on [ − 1 , 1]. The num b er of true co v ariat es d ∗ is randomly selected from { 1 , 2 ,...,d } , the last d − d ∗ en tr ies of u are set to zero, the n co eﬃcien t v ector β is computed by β i = { length of signal } ∗ u i / || u || . In our sim ulation, the length of signal w a s ﬁxed to b e 10 . n observ ation errors ǫ 1 ,...,ǫ n are sampled from a normal distribution with mean 0 and v ariance σ 2 = || β || 2 / SNR. Finally , the resp onse v ector is computed by y = X β + ǫ . F or eac h set of factor s ( n, d, SNR), 1 000 data sets are sim ulated in the same manner to assess the av erage p erformance of the metho ds. F or simplic ity , a candidate mo del is sp eciﬁed by its order, i.e., w e searc h the b est mo del among only d mo dels { 1 } , { 1 , 2 } ..., { 1 , 2 ,...,d } . F or the general case, an eﬃcien t branc h-and-b ound algorithm [Mil02, Chp.3] can b e used to exhaustiv ely search f o r the b est subsets. T able 1 presen ts p ercen tages of correctly-ﬁtted mo dels w ith v ario us f a ctors n , d and SNR. As shown, LoR P outperforms the others. The b etter p erformance of LoRP o v er BIC, whic h is the most p opular criterion for mo del iden tiﬁcation, is v ery encouraging. This is probably b ecause LoRP is a se lection criterion with a data- dep enden t p enalt y . This improv emen t needs a theoretical justiﬁcation whic h w e in tend to do in the future. T able 1: P ercen tag e of corr ectly-ﬁtted mo dels ov er 1000 replications n d SNR AIC BIC LoR P n d SNR AIC BIC LoR P 100 5 1 62 62 69 300 5 1 74 82 83 5 85 85 86 5 78 90 91 10 80 90 91 10 81 94 94 10 1 5 2 42 54 10 1 63 67 71 5 63 77 77 5 70 85 86 10 68 84 85 10 74 90 90 20 1 3 2 22 36 20 1 54 45 61 5 55 63 65 5 64 79 80 10 56 73 74 10 67 85 85 14 Comparison to A IC and BIC for regression estimation. Consider the fol- lo wing mo del whic h is from [Shi83] y = y ( x ) = log 1 1 − x + ǫ, ǫ ∼ N (0 , σ 2 ) , x ∈ [0 , 1) . (18) W e approximate the true function b y a F ourier series and consider the problem of c ho osing a go o d order among mo dels y = β 0 + k − 1 X l =1 cos( πl x/δ ) l +1 β l + ǫ, k = 1 , ..., K . In the presen t context, a mo del in Section 4 is completely sp eciﬁed b y the order K of the F ourier series. Samples a re created from (18) at the p oints x i = δ i n +1 , i = 1 ,...,n . As in [Shi83 ], w e ta k e δ = . 99 , and K = 1 6 3 with v ario us n and σ . The p erformance is measured b y the estimate of mean eﬃciency o v er 100 0 r eplications. T able 2 re presen t s the sim ulatio n res ults. In g eneral, LoRP (with α = ˜ α a s in Section 4) outp erfo rms the others, e xcept for cas es with unrealistically high noise lev el. F or cases with hig h noise , mean eﬃcie ncy of BIC is o ften larg er than that of AIC and LoRP . This w as also sho wn in the sim ulatio n study of [Shi83], T able 1. This phenomenon can b e explained a s follo ws. The risk of mo del k (the mo del sp eciﬁed by its order k ) is R n ( k ) = k ( I − M k ) y true k 2 + k σ 2 where M k is the regression matrix under mo del k and y true is the v ector of true v a lues y ( x i ). When σ → ∞ , the ideal k ⋆ = arginf k R n ( k ) → 1. Be- cause BIC p enalizes the mo del complexit y more strongly than AIC and LoRP do, the order chose n by BIC is closer to k ⋆ = 1 than the ones chosen by AIC and LoRP . As a result, mean eﬃciency of BIC is la r g er than that of the others. T able 2: Estimates o f mean eﬃciency ov er 10 0 0 replications n σ AIC BIC LoRP n σ AIC BIC LoRP 400 .0 01 1.00 .98 .99 600 .001 1.00 .98 1 .00 .01 .93 .68 .90 .01 .99 .67 .92 .05 .88 .67 .95 .05 .90 .66 .94 .1 .88 .67 .92 .1 .90 .67 .93 .5 .81 .66 .85 .5 .82 .66 .83 1 .79 .63 .82 1 .7 9 .65 .82 5 .67 .65 .70 5 .65 .67 .66 10 .54 .67 .59 10 .54 .59 .54 100 .31 .89 .33 100 .40 .90 .41 LoRP for selecting a go o d n um b er of neigh b ors in kNN. Let us no w see ho w LoRP can b e a pplied to select a go o d para meter k in kNN regression. W e created a dataset of n = 1 00 observ ations ( x i ,y i ) from the mo del: y = f ( x ) + ε, with f ( x ) = sin(12( x +0 . 2)) x +0 . 2 , x ∈ [0 , 1] (19) 15 2 4 6 8 10 12 14 16 18 20 −1 −0.5 0 0.5 1 1.5 2 2.5 Number of Neighbors k (a) 1 2 3 4 5 6 7 8 9 10 x 10 −4 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3 Smoothing parameter λ (b) LR GCV EPE LR GCV EPE Figure 1: Cho osing the tuning para meters in kNN and spline regression. The curv es ha v e b een scaled b y their standard deviations. where ε ∼ N (0 ,σ 2 ) with σ = 0 . 5. The regression matrix M ( k ) for k NN regression is determined by M ( k ) ij = 1 k if j ∈ N k ( x i ) and 0 else. Then, the loss r ank is LR( k ) = inf α ≥ 0 { n 2 log( y ⊤ S ( k ) α y ) − 1 2 log det S ( k ) α } , where S ( k ) α = ( I − M ( k ) ) ⊤ ( I − M ( k ) ) + αI . The most widely-used metho d to s elect a go o d k is probably Generalized Cross-V alidation (GCV) [CW79]: GCV( k ) = n k ( I − M ( k ) ) y k 2 / [tr( I − M ( k ) )] 2 . T o judge how w ell GCV and LoRP w ork, w e compare them to the exp ected prediction error deﬁned as EPE( k ) = n X i =1 E ( y i − ˆ y i ) 2 = n X i =1 h σ 2 + ( f ( x i ) − 1 k X j ∈N k ( x i ) f ( x j )) 2 + σ 2 k i . Figure 1(a) shows the curv es LR( k ) , GCV( k ) , EPE( k ) for k = 2 ,..., 20 (the trivial case k = 1 is o mitt ed), in whic h k = 7-nearest neigh b ors is c hosen b y LoRP and k = 8 is chosen b y GCV. The “ideal” k is 5. Both Lo RP and GCV do a reasonable job. LoRP w orks sligh tly b etter than GCV. LoRP for selecting a g o o d smo ot hing parameter. W e no w further demon- strate the use of LoR P in selecting a go o d smo othing pa rameter for spline regression. Consider the following problem: ﬁnd a function b elonging t o t he class o f functions with contin uous 2nd deriv ative that minimizes the following p enalized residual sum 16 of squares: RSS( f ) = n X i =1 ( y i − f ( x i )) 2 + λ Z ( f ′′ ( t )) 2 dt, where λ is called the smo othing parameter. T he second term p enalizes the curv a t ur e of function f and the smo othing parameter λ controls the amoun t of p enalt y . Our goal is to c ho o se a go o d λ . It is w ell kno wn (see, e.g., [HTF01], Section 5.4) that the solution is a natu- ral spline f ( x ) = P n j =1 N j ( x ) θ j where N 1 ( x ) ,...,N n ( x ) are the bas is functions o f the natural cubic spline: N 1 ( x ) = 1 , N 2 ( x ) = x, N k +2 ( x ) = d k ( x ) − d n − 1 ( x ) with d k ( x ) = ( x − x k ) 3 + − ( x − x n ) 3 + x n − x k . The problem thus reduces to ﬁnding a ve ctor θ ∈ I R n that minimizes RSS( θ ) = ( y − N θ ) ⊤ ( y − N θ ) + λ θ ⊤ Ω θ where N ij = N j ( x i ) and Ω ij = R N ′′ i ( x ) N ′′ j ( x ) dx . It is easy to s ee that the solution is ˆ θ λ = ( N ⊤ N + λ Ω) − 1 N ⊤ y , and the ﬁtted vector is ˆ y = N ˆ θ λ = M λ y with M λ = N ( N ⊤ N + λ Ω) − 1 N ⊤ y . The ﬁtted v ector is linear in y , t hus the loss rank is LR( λ ) = arg min α ≥ 0 { n 2 log( y ⊤ S α λ y ) − 1 2 log det S α λ } where S α λ = ( I − M λ ) ⊤ ( I − M λ ) + αI . Let us consider again the dataset generated f rom mo del (19). Figure 1(b) sho ws the curv es LR( λ ), GCV( λ ) and EPE( λ ). The deriv ation of expressions for G CV( λ ) and EPE( λ ) is similar to the prev ious example . λ ≈ 3 × 10 − 4 is the optimal v alue selected b y the “ideal” criterion EPE. λ ≈ 5 × 10 − 4 and λ ≈ 7 × 10 − 4 are selected by LoRP a nd GCV, r esp ective ly . One again, lik e the previous example, LoRP selects a b etter λ than GCV do es. 6 Comparison t o Gaussian Ba y esian Lin ear Re- gressio n W e now consider LBFR from a Ba ye sian p ersp ectiv e with Gaussian noise and prior, and compare it to LoRP . In addition to the noise mo del as in PML, one also has to sp ecify a prior. Ba ye sian mo del se lection (BMS) pro ceeds b y selec ting the mo del that has la rgest evidence . In the sp ecial case of LBF R with Gaussian noise and prior and a type I I maximum lik eliho o d es timate for the noise v ariance, the expression for the evidence has a similar structure as t he expression of the loss rank. Gaussian B ay esian L BFR / MAP . Recall from Sec.3 Ex.9 that F d is the class of functions f w ( x ) = w ⊤ φ ( x ) ( w ∈ I R d ) that are linear in feature v ector φ . Let Gauss N ( z | µ , σ ) := exp( − 1 2 ( z − µ ) ⊤ σ − 1 ( z − µ )) (2 π ) N/ 2 √ det σ (20) 17 denote a general N -dimensional Gaussian distribution with mean µ and cov ariance matrix σ . W e assume that observ atio ns y are p erturb ed from f w ( x ) b y indep enden t additiv e Gaussian noise with v ariance β − 1 and zero mean, i.e., t he lik eliho o d of y under mo del w is P( y | w ) = G auss n ( y | Φ w ,β − 1 I ), where Φ ia = φ a ( x i ). A Bay esian assumes a prior (b efore seeing y ) distribution on w . W e assume a centere d Gaussian with co v ariance matrix ( αC ) − 1 , i.e., P( w ) = Gauss d ( w | 0 ,α − 1 C − 1 ). F ro m the prior and the lik eliho o d one can compute the evidence and the p osterior Evidence: P( y ) = Z P( y | w )P( w ) d w = Gauss n ( y | 0 , β − 1 S − 1 ) (21) P o sterior: P( w | y ) = P( y | w )P( w ) /P ( y ) = G a uss d ( w | ˆ w , A − 1 ) B := Φ ⊤ Φ , A := αC + β B , M := β Φ A − 1 Φ ⊤ , S := I − M , (22) ˆ w := β A − 1 Φ ⊤ y , ˆ y := Φ ˆ w = M y A standard Ba y esian p o in t estimate for w for ﬁxed d is the one that maximizes the p osterior (MAP) (whic h in the G aussian case coincides with the mean) ˆ w = argmax w P( w | y ) = β A − 1 Φ ⊤ y . F or α → 0, MAP reduces to Maxim um Lik eliho o d (ML), whic h in the Gaussian case coincides with the least squares regression of Ex.9. F or α > 0, the regression matrix M is not a pro j ection anymore. Ba y esian model selection. Consider no w a fa mily of mo dels {F 1 , F 2 ,... } . Here the F d are the linear regress ors with d basis functions, but in general they could b e completely diﬀerent mo del classes. All quan tities in t he previous parag raph implicitly depend on the c hoice of F , whic h w e no w explic ate with an index. In particular, t he evide nce for mo del class F is P F ( y ). BMS chooses the mo del class (here d ) F of highest evidence: F BMS = arg max F P F ( y ) Once the mo del class F BMS is determined, the MAP (or other) regression function f w F BMS or M F BMS are c hosen. The data v aria nce β − 1 ma y b e know n o r estim ated from the da ta, C is often chosen I , and α has to b e c hosen some how . Note that while α → 0 leads to a reasonable MAP=ML regressor for ﬁxed d , this limit cannot b e used f o r BMS. Comparison to LoRP . Inserting (20) in to (21) and taking the logarithm w e see that BMS minimizes − log P F ( y ) = β 2 y ⊤ S y − 1 2 log det S − n 2 log β 2 π (23) w.r.t. F . Let us estimate β by ML: W e assume a broad prior α ≪ β so that β ∂ S ∂ β = O ( α β ) can be neglecte d. Then − ∂ l ogP F ( y ) ∂ β = 1 2 y ⊤ S y − n 2 β + O ( α β n ) = 0 ⇔ β ≈ ˆ β := n/ ( y ⊤ S y ). Inse rting ˆ β in to (23) w e get − log P F ( y ) = n 2 log y ⊤ S y − 1 2 log det S − n 2 log n 2 π e (24) 18 T aking an improp er prior P( β ) ∝ β − 1 and in tegrating out β leads for small α to a similar result. The last te rm in (24) is a constant inde p enden t of F and can be ignored. The ﬁrst t wo t erms ha v e the same structure as in linear LoRP (10), but the matrix S is diﬀeren t. In b oth cases, α act as regularizers, so w e may minimize o v er α in BMS lik e in LoR P . F or α = 0 (whic h ne ither mak es sen se in BMS nor in LoRP), M in BMS coincides with M of Ex.9, but still the S 0 in LoRP is the square of the S in BMS. F or α > 0, M of BMS ma y b e regarded as a regular ized regressor as suggested in Sec.2 (a), rather than a regularized loss function ( b) used in L o RP . Note also that BMS is limited to (semi)parametric regression, i.e., do es not co ver the non-parametric kNN Ex.2 and k ernel Ex.8 , unlik e LoR P . Since B only depends o n x (and not on y ) , and all P are implicitly conditioned on x , one could c ho o se C = B . In this case , M = γ Φ B − 1 Φ ⊤ , with γ = β α + β < 1 for α > 0, is a simple multiplic ativ e r egula rization of pro jection Φ B − 1 Φ ⊤ , and (24) coincides with (11) for suitable α , apart from an irrelev ant additiv e constan t, hence minimizing (24) o v er α also leads to (12). 7 Comparison to oth er Mo del Select i on Sc hemes In t his section w e giv e a brief introduction to PML for ( semi)parametric regression, and its ma jor instan tia t ions, AIC, BIC, and MDL principle, whose p enalty terms are all prop ortional to the num b er of parameters d . The eﬀe ctive numb er of p ar ameters is of ten m uc h smaller than d , e.g., if there a r e soft constraints like in ridge regression. W e compare MacKa y’s trace formula [Mac92 ] for Gaussian Ba y esian LBFR and Hastie’s et al. t r a ce form ula [HTF01] fo r general linear regression with LoRP . P enalized ML (AIC, BIC, MDL). Consider a d -dimensional sto chastic mo del class lik e the G aussian Ba y esian linear regression example of Section 6. Let P d ( y | w ) b e the data lik eliho o d under d -dimensional mo del w ∈ I R d . The maxim um likelihoo d (ML) estimator for ﬁxed d is ˆ w = arg max w P d ( y | w ) = ar g min w {− log P d ( y | w ) } (25) Since − logP d ( y | w ) de creases with d , w e cannot ﬁnd the mo del dime nsion b y s im- ply minimizing ov er d (o ve rﬁtting). P enalized M L adds a complexit y term to get reasonable results ˆ d = arg min d {− log P d ( y | ˆ w ) + Penalt y ( d ) } (26) The p enalty introduces a tra deoﬀ b et w een the ﬁrst and second term with a minim um at ˆ d < ∞ . V arious p enalties ha v e b een suggested: AIC [Ak a73] uses d , BIC [Sc h78] and the (crude) MDL [Ris78, Gr¨ u0 4] use d 2 log n for P enalt y( d ). There are at least thr e e imp o rtant c onc eptual diﬀer e nc es to LoRP: • In or der t o apply PML one needs to sp ecify not only a class of regression functions, but a f ull probabilistic mo del P d ( y | w ), 19 • PML ignores or at least do es not tell how to incorp orate a p o t en tia lly giv en loss-function, • PML is mostly limited to selecting b etw een (semi)parametric mo dels. W e discuss tw o approac hes to the last item in the remainder of this section (where AIC, BIC, and MDL are not directly applicable): (a) for non-par a metric mo dels lik e kNN or kerne l regression, o r (b) if d do es not reﬂect the “true” complexit y of the mo del. [Mac92] suggests an expression fo r the eﬀectiv e num b er of parameters d e f f as a substitute for d in case of (b), while [HTF01] intro duce another expression whic h is applicable fo r b oth (a) and (b). The trace p enalt y for parametric Gaussian LBFR. W e con tin ue with the Gaussian Ba y esian linear regression example (see Section 6 for details and notation). P erfo rming the inte gratio n in (21), [Mac92 , Eq.(21)] deriv es the follo wing expression for the Bay esian evidence f o r C = I − log P( y ) = ( α ˆ E W + β ˆ E D ) + ( 1 2 log det A − d 2 log α ) − n 2 log β 2 π (27) ˆ E D = 1 2 k Φ ˆ w − y k 2 2 , ˆ E W = 1 2 k ˆ w k 2 2 (the ﬁrst brac k et in ( 2 7) equals β 2 y ⊤ S y and the se cond equals − 1 2 logdet S , cf. (23)). Minimizing (27) w.r.t. α leads to the following relation: 0 = − ∂ log P( y ) ∂ α = ˆ E W + 1 2 tr A − 1 − d 2 α ( ∂ ∂ α log det A = t r A − 1 ) He argues that α k ˆ w k 2 2 corresp onds to the eﬀectiv e num b er of parameters, hence d McK e f f := α k ˆ w k 2 2 = 2 α ˆ E W = d − α tr A − 1 (28) The trace penalt y for general linear models. W e no w return to general linear regression ˆ y = M ( x ) y (7). LBFR is a sp ecial c ase of a pro jection matrix M = M 2 with rank d = tr M being the nu m b er of basis functions. M lea ve s d directions un t ouc hed and pro jects all other n − d directions t o zero. F or general M , [HT F01, Sec.5.4.1] argue to regard a direction that is only somewhat shrunk en, sa y b y a factor of 0 < β < 1, as a fractional parameter ( β degrees of freedom). If β 1 ,...,β n are the shrink ages = eigen v alues of M , the eﬀectiv e n um b er of parameters could b e deﬁned as [HTF01, Sec.7.6] d HTF e f f := n X i =1 β i = tr M , where HTF stands for Hastie-Tibshirani-F riedman, which generalizes the relation d = tr M b ey o nd pro jections . F or MacKay’s M (22), tr M = d − α tr A − 1 , i.e., d HTF e f f is consisten t with a nd generalizes d McK e f f . Problems. Though nicely motiv ated, the trace formula is not without problems. First, since for pro jections, M = M 2 , one could hav e argued equally well for d HTF e f f = 20 tr M 2 . Second, for kNN we hav e tr M = n k (since M is 1 k on the diagonal), whic h do es not lo ok unreasonable. Consider no w kNN’, whic h is deﬁned as follows : w e a v erage o v er the k nearest neigh b or s excluding the closest neighbor. F or suﬃcien tly smo oth functions, kNN’ for suitable k is still a reasonable regressor, but tr M = 0 (since M is zero on the diagonal). So d HTF e f f = 0 for kNN’, whic h mak es no sense and w o uld lead one to alw ay s select the k = 1 mo del. Relation to LoRP . In the case of kNN’, tr M 2 w o uld b e a b etter estimate for the e ﬀectiv e dimens ion. In line ar LoRP , − logdet S α serv es as complexity p enalty . Ignoring the nulls pace of S 0 = ( I − M ) ⊤ ( I − M ) (8), w e can T a ylor expand − 1 2 log det S 0 in M − 1 2 log det S 0 = − tr log ( I − M ) = ∞ X s =1 1 s tr( M s ) = tr M + 1 2 tr M 2 + ... F or BMS (2 4) with S = I − M (22) w e get half of this v alue. So the trace p enalty ma y b e regarded as a leading order approximation to LoRP . The higher order terms prev ent p eculiarities lik e in kNN’. Co ding/MDL in terpret ation of LoRP . The basic idea of MDL is as fo llows [Gr ¨ u04]: “The goa l of statistical inferences may b e cast a s trying to ﬁnd regularity in the data. ‘Regularit y’ may b e iden tiﬁed with ‘ability to compress’. MDL com bines these tw o in sigh ts b y v iewing le arning as data c ompr e s s ion : it te lls us that, for a giv en set o f h yp otheses H a nd data set D , we should try to ﬁnd the h yp othesis or com bina t io n of hypotheses in H t ha t compress D most.” The standard incarnation of (crude) MDL is as follows: If H is a sto c hastic mo del o f (discrete) data D , w e can co de D (b y Shannon-F ano) in ⌈− log 2 P( D | H ) ⌉ bits. If w e ha v e a class of mo dels H , w e also ha v e to code H (someho w in, say , L ( H ) bits) in order to b e a ble t o deco de D . MDL chooses the hypothesis H MDL = argmin H ∈H {− log 2 P( D | H ) + L ( H ) } of m inimal t w o-part code. F or instance, if H is the class of all p o lynomials of all degrees with eac h co eﬃcien t coded to 1 2 log 2 n bits (i.e., O ( n − 1 / 2 ) accuracy) and w e condition on x , i.e., D ❀ y | x , MDL tak es the form (25) and (26), i.e., H MDL = ( ˆ w , ˆ d ). W e no w give LoRP (for discrete D ) a data compression/MDL interpreta- tion. F or simplicit y , we w ill ﬁrst assume that all loss v alues are diﬀeren t, i.e., if Loss r ( y ′ | x ) 6 = Loss r ( y ′′ | x ) for y ′ 6 = y ′′ (adding inﬁnitesimal random no ise to Loss r easily ensures this). In this case, Ra nk r ( ·| x ) : Y n → I N is an o rder preserving bijec- tion, i.e., Rank r ( y ′ | x ) < R ank r ( y ′′ | x ) iﬀ Loss r ( y ′ | x ) < L oss r ( y ′′ | x ) with no gaps in the range of Ra nk r ( ·| x ). Phrased diﬀeren tly , Rank r ( ·| x ) co des eac h y ′ ∈ Y n as a natural num b er m in increasing loss-order. The natural n umber m can its elf b e co ded in ⌈ log 2 m ⌉ bits (using plain not preﬁx co ding ) . Let us call this co de of y ′ the L oss R ank Co de (LR C). LRC has a nice c haracterization: LR C is t he s hortest lo ss-order preserving co de. Ignoring t he rounding, the L en gth of LRC r ( y ′ | x ) is LR r ( y ′ | x ): 21 Prop osition 15 (Minimalit y prop erty) I f a l l loss values ar e diﬀer en t, i.e., if Loss r ( y ′ | x ) 6 = Loss r ( y ′′ | x ) for al l y ′ 6 = y ′′ then the loss r an k (c o d e ) of y is the smal lest/shortest among al l loss-or der pr eserving r ankings/c o des C in the sens e that Rank( y ) = min { C ( y ) : C ∈ Y n → I N ∧ ( ⋆ ) } ⌊ LR( y ) / log 2 ⌋ = min { Length( C ( y )) : C ∈ Y n → { 0 , 1 } ∗ ∧ ( ⋆ ) } ( ⋆ ) := [Loss( y ′ ) < Loss( y ′′ ) ⇔ C ( y ′ ) < C ( y ′′ ) , ∀ y ′ , y ′′ ] The pro of follo ws from the fact that if a dis crete injection (co de) is order pre- serving, there ex ists a “ smallest” one without gaps in the range. So Lo RP mini- mizes the Loss Rank Co de, where LR C itself is the shortest among all loss-order preserving co des. F rom this persp ectiv e, LoRP is just a diﬀere n t (non-sto chastic, non-parametric, los s-based) incarnation of MDL. The M DL philosoph y provid es a justiﬁcation of LoRP (2), its regularization (5), and loss function selection (Section 8). This identiﬁcation s hould also allow to apply or adapt the v arious consistenc y results of MDL, implying that LoRP is consisten t under some mild conditions. If some losses are equal, Ra nk r ( ·| x ) : Y n → I N still prese rv es the or der ≤ , but the mapping is neither surjectiv e nor injectiv e any more. Large regression c lasses R . The classes R of regressors w e considered so far w ere discrete a nd “small”, often index ed b y an integer complexit y index (like k in kNN or d in L BFR). But large classes are also think able. As an extreme case, consider the class of al l regressors. Clearly , there is an r = r D whic h “knows” D and p erfectly ﬁts D ( r ( x i | D ) = y i , ∀ i ), but is t he w orst p ossible on all other D ′ ( r ( x i | D ′ ) = ∞ , ∀ i, ∀ D ′ 6 = D ). This r has (discrete) Rank 1, so is b est according to LoRP . So if R is to o large, LoRP can ov erﬁt to o. Consider a more realistic example by not taking al l of the ﬁrst d basis functions in LBFR, but se lecting some basis functions φ i 1 ,...,φ i d , i.e., R is indexe d b y d in tegers, and d may b e v ariable to o. One solution appro ac h is to group more regressors in R into one function class F , e.g., the class o f functions F k ,d = { w 1 φ i 1 + ...w d φ i d : w ∈ I R d , 1 ≤ i 1 < ... < i d ≤ k } tha t are linear in d of the ﬁrst k base s. No w R is a small class indexed b y d and k only . Lo oking at the co ding interpre tation of LR r and the MD L philosophy , suggests to assign a co de to r ∈ I R in order to get a complete co de for D : r best = arg min r { LR r ( y | x ) + L ( r ) } where r is the length of a co de for r (giv en R ). F or R ≃ I N a single integer has to b e co ded, e.g., k in L ( r ) = L ( k ) ≈ log 2 k bits, whic h can usually b e safely dropp ed/ignored. F or more complex c lasses like t he (ungroup ed) LBFR subset se- lection ab ov e, L ( r ) = L ( i 1 ,...,i d ,d ) ≈ d log 2 k + log 2 d can b ecome imp ortant. 22 8 Loss F unctions and their S electio n General additive loss. Linear Lo RP ˆ y = M ( x ) y of Section 3 can easily b e g ener- alized to non-quadra t ic loss. Let us consider the ρ > 0 loss Loss M ( y | x ) := ( P n i =1 ( y i − ˆ y i ) ρ ) 1 / ρ = k y − ˆ y k ρ = k ( I − M ) y k ρ V ( L ) = { y ′ ∈ I R n : k ( I − M ) y ′ k ρ ≤ L } = { ( I − M ) − 1 z ∈ I R n : k z k ρ ≤ L } Let v ρ n := |{ z ∈ I R n : k z k ρ ≤ 1 }| = 2 n Q n − 1 i =1 i ρ ! 1 ρ ! / i +1 ρ ! , where i ρ ! := Γ( i ρ + 1 ) , b e the v olume of the unit d - dimensional ρ -norm “ba ll”. Since V ( L ) is a linear transformation of this ball with transformation matrix ( I − M ) − 1 and scaling L , we ha v e | V ( L ) | = v ρ n L n / det( I − M ), hence LR M ( y | x ) = log | V (Loss M ( y | x )) | = n log k ( I − M ) y k ρ − log det( I − M ) + log v ρ n (29) F or the ρ = 2 norm, (2 9 ) reduces to LR 0 M (9). Note that Loss M := g ( k y − ˆ y k ρ ) leads to t he same result (29) for an y monotone increasing g , i.e., only the order of the loss matters, not its a bsolute v alue. More generally Lo ss M = g ( P i h ( y i − ˆ y i )) for an y h implies LR M ( y | x ) = n lo g v h n ( P i h ( y i − ˆ y i )) − log det( I − M ) , where v h n ( l ) := |{ z ∈ I R n : P i h ( z i ) ≤ l }| 1 / n is a one-dimensional function of l (indep enden t D and M ), once to b e determined (e.g., v h n ( l ) = l · ( v ρ n ) 1 / n ∝ l for ρ -norm loss). Regular izat io n ma y b e p erformed b y M ❀ γ M with o ptimization ov er γ < 1 . Loss-function selection. In principle, the loss function should b e part of the problem sp eciﬁcation, since it c hara cterizes the ultimate goal. F or instance, whether a test should more lik ely classify a health y p erson as sic k than a sic k p erson as health y , depends on the s ev erit y of a misclassiﬁcation (loss) in each direction. In realit y , though, having to sp ecify the loss function can b e a nu isance. Sure, the loss has to resp ect some general features, e.g., that it increases with the deviation of ˆ y i from y i . O t herwise it is c hosen b y conv enience or rules of th um b, rather than b y elicitation of the real go al, for instance preferring the Euc lidean norm o v er ρ 6 = 2 norms. If we subs crib e to the pro cedure of cho osing the loss function, we could ask whether this may b e do ne in a more principled w ay . Consider a (not to o large) class of loss functions Loss α , index ed b y some para meter α . F or instance, Loss α = k y − ˆ y k α from the previous paragraph. The regularized loss (5 ) also constitutes a class of losses. In this case we minimized ov er the regularizatio n parameter α . This suggests to c ho ose in general the loss function that has minimal loss rank LR α r . The justiﬁcations are similar to the ones for minimizing LR α r w.r.t. r . Note that the term log v ρ n cannot b e dropp ed an ymore, unlike in (10). 23 9 Self-Con s istent Regression So far w e ha v e considered only “on-data ” regression. LoRP only dep ends on the regressor r on da t a D and not on x 6∈ { x 1 ,...,x n } . W e now c onstruct canonical re- gressors for oﬀ-data x f r o m regressors giv en only on- data. First, this may ease t he sp eciﬁcation of the regression functions, second, it is a canonical w ay for in terp ola- tion ( L o RP can’t distinguish b et w een r that are identical on D ), and third, w e sho w that man y standard regress ors ( kNN, K ernel, LBFR) are self-consisten t in the sense that they are canonical. W e limit our exp osition to linear regression. Oﬀ-data regression. A linear regressor is completely determined by the n func- tions m j (6), but not b y the matrix function M (7). Indeed, tw o sets { m j } and { m ′ j } that coincide on D = ( x , y ), i.e. m j ( x i | x ) = m ′ j ( x i | x ) ∀ i,j but p ossibly diﬀer for x 6∈ x , lead to the same matrix M ij ( x ) = m j ( x i | x ) = m ′ j ( x i | x ). LoRP has the a dv antage of only dep ending on M , but this also means that it cannot distinguish b etw een an m j that b eha v es w ell on x 6∈ x and one that, e.g., wildly oscillates outside x . T ypically , the m j are given and, pro vided the model complex ity is c hosen ap- propriately e.g. b y LoR P , they prop erly in terp olate x . Nev ertheless, a canonical extension from M to m j w o uld b e nice. In this wa y LoRP w ould not b e vulnerable to ba d m j , and w e could interpolate D (predict y for any x ∈ X ) ev en without m j giv en a- prio ri. W e deﬁn e a self-consisten t regres sion sc heme based only on M (for all n ). W e ask f or an estimate ˆ y of y f or x 6∈ x . W e add a virtual data p oin t ( x 0 ,y 0 ) to D , where x 0 = x . If w e knew y 0 = y w e could es timate ˆ y 0 = r ( x 0 |{ ( x 0 ,y 0 ) } ∪ D ), but w e don’t kno w y 0 . But w e could require a self-consistency c ondition, namely that ˆ y 0 = y 0 for x 0 6∈ x . Deﬁnition 16 (canonica l and self-cons isten t r egressors) L et M ′ ij ( x ′ ) 0 ≤ i,j ≤ n b e the r e gr ession matrix for the data set D ′ = { ( x 0 ,y 0 ) } ∪ D = (( x 0 , x ) , ( y 0 , y ) ) = ( x ′ , y ′ ) of size n + 1 . (i) A line ar r e gr essor ˜ y 0 = ˜ r ( x 0 | D ) is c al le d a c anonic al r e gr essor for M ′ if the c onsistency c ondition ˜ y 0 = r ( x 0 | D ′ ) ≡ P n j =0 M ′ 0 j y j holds ∀ x 0 ,D . (ii) A r e gr essor r is c al le d self-c on s istent if ˜ r = r , i.e. if r ( x 0 |{ ( x 0 ,r ( x 0 | D )) } ∪ D ) = r ( x 0 | D ) ∀ x 0 ,D . (iii) A clas s o f r e gr e s s ors R = { r } is c al le d self-c onsistent if ˜ R = { ˜ r } ⊆ R . W e denote the solutio n of the self-consistency condition y 0 = P n j =0 M ′ 0 j y j b y ˜ y 0 . So w e hav e to solve ˜ y 0 = n X j =1 M ′ 0 j y j + M ′ 00 ˜ y 0 = ⇒ ˜ y 0 = P n j =1 M ′ 0 j y j 1 − M ′ 00 = P n j =1 M ′ 0 j y j P n j =1 M ′ 0 j where the la st equalit y only holds if P n j =0 M ′ 0 j = 1, whic h is often the c ase, in par- ticular for kNN and Kernel regression, but not necessarily for LBF R. 24 Prop osition 17 (canonical r egressor) The line ar r e gr es s o r y 0 = ˜ r ( x 0 | D ) := n X j =1 ˜ m j ( x 0 | x ) y j , wh e r e ˜ m j ( x 0 | x ) := M ′ 0 j ( x ′ ) 1 − M ′ 00 ( x ′ ) is the unique c anoni c al r e gr essor for M ′ (if M ′ 00 < 1 ). Example 18 (self- consisten t kNN, ↑ Ex.2) M ′ 0 j ( x ′ ) = 1 k for x j ∈ N ′ k ( x 0 ) and 0 else. The k nearest neighbors N ′ k ( x 0 ) of x 0 among x ′ consist of x 0 and the k − 1 nearest neighbors N k − 1 ( x 0 ) =: J of x 0 among x , i.e. N ′ k ( x 0 ) = { x 0 } ∪ N k − 1 ( x 0 ). Hence ˜ y 0 = P n j =1 M ′ 0 j y j P n j =1 M ′ 0 j = P j ∈ J 1 k y j P j ∈ J 1 k = X j ∈ J 1 k − 1 y j = n X j =1 M ( k − 1) 0 j y j = r k − 1 ( x 0 | D ) = ˆ y 0 Canonical k NN is equiv alen t to standard (k –1)NN, so the class of canonical kNN regressors coincides with the standard kNN class. ♦ Example 19 (self- consisten t k ernel) M ′ 0 j ( x ′ ) = K ( x 0 , x j ) P n j =0 K ( x 0 , x j ) = ⇒ ˜ y 0 = P n j =1 K ( x 0 , x j ) y j P n j =1 K ( x 0 , x j ) = r ( x 0 | D ) = ˆ y 0 Canonical k ernel regression coincides with the standard k ernel smo other. ♦ Example 20 (self- consisten t LBFR) B ′ = n X i =0 φ ( x i ) φ ( x i ) ⊤ = B + φ ( x 0 ) φ ( x 0 ) ⊤ ⇒ M ′ 0 j = φ ( x 0 ) ⊤ B ′ − 1 φ ( x j ) = φ ( x 0 ) ⊤  B − 1 − B − 1 φ ( x 0 ) φ ( x 0 ) ⊤ B − 1 1 + φ ( x 0 ) ⊤ B − 1 φ ( x 0 )  φ ( x j ) = M 0 j − M 00 M 0 j 1 + M 00 = M 0 j 1 + M 00 ⇒ 1 − M ′ 00 = 1 1 + M 00 In the ﬁr st line we used the Sherman-Morrison formula for in v erting B ′ . In the second line w e deﬁned M 0 j = φ ( x 0 ) ⊤ B − 1 φ ( x j ), extending M . ⇒ ˜ y 0 = P n j =1 M ′ 0 j y j 1 − M ′ 00 = n X j =1 M 0 j y j = n X j =1 m j ( x 0 , x ) y j = ˆ y 0 Canonical LBFR coincides with standard LBFR . ♦ Prop osition 21 (self-consisten t regressors) Kernel r e gr ession and line ar b asis function r e gr ess i o n ar e self-c onsi stent. kNN is not self-c ons i stent but the class of kNN r e gr e ssors R = { r kNN : k ∈ I N } is self-c o nsistent. T o summarize, w e exp ect LoRP to select go o d regress ors with prop er interpola- tion b ehavior for canonical and self-consisten t regressors. 25 10 Nearest Neigh b ors Classi ﬁ cation W e no w consider k-nearest neigh b ors classiﬁcation in more detail. In o rder to get more insigh t in t o LoRP w e see k a case t hat allow s analytic solution. In general, the determinan t det S α cannot b e computed a nalytically , but f or x lying on a hy p ercub e of t he regular g rid X = Z Z d w e can. W e de riv e exact expressions, and consid er the limits n → ∞ , k → ∞ , and d → ∞ . kNN on one-dimensional grid. W e consider the d = 1 dimensional case ﬁrst. W e assume x = (1 , 2 , 3 ,...,n ), a circular metric d ( x i ,x j ) = d ( i,j ) = min {| i − j | ,n − | i − j |} , and o dd k ≤ n . The kNN regression matrix M ij = b i − j with b i − j = 1 k if d ( i, j ) ≤ k − 1 2 and 0 ot herwise is a diagonal-constant (T oeplitz) matrix with circularity prop ert y b i − j = b i − j + n . F or instance, for k = 3 and n = 5 M = 1 3   1 1 0 0 1 1 1 1 0 0 0 1 1 1 0 0 0 1 1 1 1 0 0 1 1   F or ev ery circulant matrix, the eigen ve ctors v 1 ,..., v n are wa v es v l j = θ j l with θ = e 2 π √ − 1 /n . The eigen v a lues are the fourier transform ˆ b l = P n j =1 b j θ − j l of b , since P j M ij v l j = P j b i − j θ j l = P j b j θ ( i − j ) l = v l i P j b j θ − j l = ˆ b l v l i , where we exploited circularity of b and θ j l . F or M kNN in particular we get ˆ b l = ↑ circularity 1 k k − 1 2 X j = − k − 1 2 θ − j l = ↑ geometric sum 1 k θ lk / 2 − θ − lk / 2 θ l/ 2 − θ − l/ 2 = ↑ insert θ sin( π lk /n ) k sin( π l /n ) < 1 fo r l 6 = n and ˆ b n = 1. The only 1-vec tor v n = 1 corresp onds to a constant shift y i ❀ y i + c under whic h kNN (lik e many o ther regressors) is inv ariant. Instead of regularizing LoRP with α > 0 w e can restrict V ( L ) ⊂ I R n to the space ort hogonal to v n , whic h me ans dropping ˆ b n = 1 in the determinan t . In tuitiv ely , since this in v ariant direction is the same for a ll k , w e can drop t he same additiv e inﬁnite constan t from LR for ev ery k , whic h is irrelev an t for comparisons (formally w e should compute lim α → 0 { LR α k 1 − LR α k 2 } ). The exact expression for the res tricted log- determinan t (denoted b y a prime) is − 1 2 log det ′ S 0 = − log det ′ (1 1 − M ) = − n − 1 X l =1 log(1 − ˆ b l ) =: n k c 1 nk = c 1 nk tr M F or large n (and large k ) the expression can b e simpliﬁed. The exact, la rge n , and 26 large k ≪ n express ions are c 1 nk = − k n n − 1 X l =1 log  1 − sin( π l k / n ) k sin( π l /n )  c 1 ∞ k = − k π Z π / 2 − π / 2 log  1 − sin( k z ) k sin( z )  dz  z = π l /n for l < n 2 z = π l /n − π else  c 1 ∞∞ = − 1 π Z ∞ −∞ log  1 − sin t t  dt ˙ = 3 . 20 2 ( t = k z , sin ( z ) ∼ z ) F urther, c 1 ∞ 3 = 3log3 ˙ =3 . 295. Since c 1 ∞ k is decreasing in k , c 1 ∞ k equals 3 . 2 within 3% for all k . kNN on d -dimensional grid. W e no w consider x = X d = { 1 ,...,n 1 } d on a d - dimensional complete h yp ercub e grid with n = n d 1 p oin ts and Manhattan distance d ( x i ,x j ) = d ( i , j ) = P d a =1 d 1 ( i a ,j a ) f o r a ll x i = i ∈ X d and x j = j ∈ X d , where d 1 is the one-dimensional circular distance deﬁned ab o v e (so actually X d is a discrete to rus). F or k = k d 1 , the neigh b orho o d N k ( x ) of x is a cub e of side-length k 1 . In this case, M = M 1 ⊗ ... ⊗ M 1 is a d -fold tensor pro duct of the 1d k 1 NN matrices M 1 of sample size n 1 . The eigen v ectors of M are v l 1 ⊗ ... ⊗ v l d with eigen v alues ˆ b l 1 · ... · ˆ b l d . W e get − log det ′ (1 1 − M ) = − n 1 − 1 X l 1 =1 ... n d − 1 X l d =1 log(1 − ˆ b l 1 · ... · ˆ b l d ) (30) n ≫ k →∞ − → − 1 π d Z I R d log  1 − d Y a =1 sin t a t a  d d t =: n k c d ∞∞ F or instance, for d = 2, numeric al in tegration giv es c 2 ∞∞ ˙ =2 . 2 compared to 3 . 2 in one dimension. F or higher dimensions, ev aluat ion o f the d - dimensional in t egra l b ecomes cum b ersome, and w e resort to a diﬀeren t approx imation. T a ylor series in M . W e can also (not only for kNN) expand log det S 0 in a T a ylor series in M : − log det ′ (1 1 − M ) = − tr ′ log(1 1 − M ) = ∞ X s =1 1 s tr ′ ( M s ) = ∞ X s =1 1 s (tr ′ M s 1 ) d = n k ∞ X s =1 1 s ( A n 1 k 1 s ) d =: n k c d nk where w e used tr( A ⊗ B ) = tr ( A ) · tr( B ) a nd ( A ⊗ B ) s = A s ⊗ B s and deﬁned A n 1 k 1 s := k 1 n 1 tr ′ ( M s 1 ) = k 1 n 1 n 1 − 1 X l =1 ( ˆ b l ) s n ≫ k →∞ − → 1 π Z ∞ −∞  sin t t  s dt The one-dimensional in tegral can b e expressed as a ﬁnite sum with s terms or ev a lua ted n umerically . F or any n and k one can sho w that A nk 1 = A nk 2 = 1 > A nk s for 27 s > 2 . So the expansion abov e is useful for larg e d . Note also that c d nk is monotone decreasing in d . F or d → ∞ w e ha v e c ∞ nk = ∞ X s =1 1 s ( A nk s ) ∞ = 1 + 1 2 + 0 + ... = 3 2 i.e. c d nk decreases monotone in d from ab out 3.2 t o 3 2 . The practical implicatio n of this observ atio n, thoug h, is limited, since k = k d 1 → ∞ is actually not ﬁxed f or d → ∞ . Indeed, in practical high-dimensional problems, k ≪ n ≪ 3 d , but in our grid example k = k d 1 ≥ 3 d . Real data do not for m full grids but sparse neigh b orho o ds if d is large. 11 Concl u sion and Outlo o k W e introduced a new metho d, the Loss R a nk Principle, for mo del selection. The loss rank of a mo del is deﬁned as the num b er of other data that ﬁt the mo del b etter than the training data. The model c ho sen by Lo R P is the one of smallest loss rank. The loss rank has an explicit expre ssion in case of linear mo dels. Mo del consistency and asymptotic eﬃcie ncy o f LoR P were considered. The nume rical exp erimen ts suggest that Lo R P w orks w ell in practice. A comparison b et w een LoRP and other metho ds for mo del selection w as also presen t ed. In t his pap er, w e ha v e only scratc hed at the surface o f LoR P . LoRP se ems to b e a promising principle with a lot of p otential, leading to a ric h ﬁeld. In the f ollo wing w e brieﬂy summarize miscellaneous considerations. Comparison to Rademac her complexities. F or a (binary) classiﬁcation prob- lem, the rank (1) of classiﬁer r can b e re-formulated as the probability that a ran- domly relab eled sample y ′ b eha v es b etter than the actual y . The mor e ﬂexible r is, the larger its rank is. The Rademac her comple xit y [Kol01, BBL02] of r is the exp ectatio n of the diﬀerence b et w een the misclassifying loss under the actual y and the misclassifying loss under a ra ndo mly relab eled sample y ′ . The more ﬂexible r is, the larger its Ra demac her complex ity is. Therefore, there is a close connection be- t w een LoRP and Rademach er c omplexities. Mo del selection base d on Rademac her complexities has a n um b er of attra ctive prop erties and has b een attracting man y researc hers, th us it’s worth disco v ering this connection. Some results hav e b een re- cen tly already obtained, how ev er, to kee p the presen t pap er not so lo ng , we decide to presen t the results in another pap er. Mon t e C arlo estimates for non-linear LoRP . F or non-linear regression w e did not presen t an eﬃcien t algorithm for the loss rank/v olume LR r ( y | x ). The high-dimensional volume | V r ( L ) | (3) ma y b e computed b y Monte Carlo algorithms. Normally V r ( L ) constitutes a small part of Y n , and uniform sampling o v er Y n is not feasible. Ins tead one should consider t w o comp eting regressors r and r ′ and compute | V ∩ V ′ | / | V | and | V ∩ V ′ | / | V ′ | by uniformly sampling from V and V ′ resp ectiv ely e.g., with a Metrop olis-ty p e algorithm. T aking the ratio we get | V ′ | / | V | and hence the 28 loss rank diﬀerence LR r − LR r ′ , whic h is suﬃcie n t for LoRP . The usual tric ks and problems with sampling apply here to o. LoRP for h ybrid model classes. LoRP is not restricted to mo del classes indexed b y a single in tegral “complexit y” parameter, but may b e applied more generally to selecting among some (typic ally discrete) clas s of mo dels/regressors. F or ins tance, the class could contain kNN and p olynomial regressors, and LoRP selects t he com- plexit y a n d type of regressor (non-parametric kNN ve rsus parametric p olynomials). Generativ e versus discriminativ e LoRP . W e hav e concen trat ed on counting y ’s giv en ﬁxed x , whic h cor r esp o nds to discriminativ e learning. LoR P might equally w ell b e used for coun ting ( x,y ), as alluded to in the in tro duction. This w ould corresp ond to generativ e learning. Both regimes are used in practice. See [LJ08] for some recen t results on their relativ e b eneﬁt, a nd further references. Ac kno wledgemen t. W e w ould lik e to thank tw o anon ymous revie w ers for their detailed and helpful commen ts. The second author w ould like to thank the SML@NICT A for supp orting a visit which led to the presen t pap er. App endix: List of Ab breviation s and Notations AIC= Ak aik e Informatio n Criterion. BIC= Ba y esian Information Criterion. BMS= Ba y esian Mo del Selection kNN= k Nearest Neighbors. LBFR= Linear Basis F unction R egression. LoRP= Loss Rank Principle. LR C = Lo ss Rank Co de. MAP= Maxim um a Posterior. MDL= Minim um Des cription Length. ML= Maxim um Likelih o o d. PML= P enalized Maximum Lik eliho o d. D = { ( x 1 ,y 1 ) ,..., ( x n ,y n ) } = observ ed data . D = { D } = set o f all p ossible data D . X × Y =obse rv ation space. x = ( x 1 ,...,x n )= v ector of x -observ ations, similarly y . f : X → Y = functional dep endence b etw een x and y . F = ( “ small”) class of functions f . H = class of sto c hastic hypotheses/mo dels. r : D → F = regressor/mo del. ˆ y i = r ( x i | D )= r -estimate of y i . R = (“small”) class of regressors/mo dels. w ∈ I R d = parametrization of F d . N k ( x )= set of indices of the k nearest neighbors of x in D . L = Loss r ( D ) = Loss( y , ˆ y )= empirical lo ss of r on D . 29 Rank r ( L ) = # { y ′ ∈ Y n : Loss r ( y ′ | x ) ≤ L } = loss rank o f r . V ( L )= v olume of D under r . LR r ( y | x )= lo g rank/v olume of D . LR α r = regularized LR r . d e f f = eﬀectiv e dimension. m j ( x, x )= co eﬃcien ts o f linear regressor. M ( x )= linear regression matrix or “hat” matrix. log= natural logarithm. a ❀ b : a is replaced b y b . References [Ak a73] H. Ak aike . Information theory and an extension of the maximum lik eliho o d principle. In Pr o c. 2nd International Symp osium on Information The ory , pages 267–2 81, Bud ap est, Hun gary , 1973. Ak ademiai Kaid´ o. [All74] D. Allen. Th e relationship b et ween v ariable selecti on and data a ugmenta tion and a metho d for prediction. T e chnometrics , 16:12 5–127, 1974. [BBL02] P . Bartlett, S. Boucheron, and G. Lu gosi. Mod el selection and error estimation. Machine L e arning , 48:85 –113, 2002. [Cha06] A. Cham baz. T esting the order of a mo del. Ann. Stat. , 34(3):116 6–1203, 2006. [CW79] P . Crav en and G. W ah b a. Sm o othing n oisy data with spline functions: estimating the correct degree of smo othing by the metho ds of generalized cross-v alidation. Numerische Mathematik , 31 :377–403 , 1979. [ET93] B. Efron and R. Tibshirani. An Intr o duction to the Bo otstr ap . Chapm an & Hall/CR C, New Y ork, 1993. [Gr ¨ u04] P . D. Gr ¨ unw ald. T utorial on minimum descrip tion length. I n Minimum Descrip- tion L ength: r e c ent advanc es i n the ory and pr actic e , page Chapters 1 an d 2. MIT Press, 2004. h ttp://www.cwi.nl/ ∼ p dg/ftp/mdlintro.p df. [Her02] R. Herbric h. L e arning Kernel Classiﬁers . The MIT Press, 2002. [HT89] C. M. Hurvich and C. L. Tsai. Regression and time series m o del selection in small s amp les. Bio metrika , 76 (2):297–3 07, 1989. [HTF01] T. Hastie, R. T ibshirani, and J. H. F riedman. The Elements of Statistic al L e arn- ing . S p ringer, 2001. [Hut07] M. Hutter. Th e lo ss rank principle for mo d el selection. In Pr o c. 20 th Annual Conf. on L e arning The ory (COL T’07) , vo lume 4539 o f LNAI , pages 589 –603, San Diego, 2007 . Sp ringer, Berlin. [Kol01] V. Koltc hin s kii. Rademac h er p enalties and structural risk min imization. IEEE T r ans. Inform. The ory , 47:1 902–191 4, 2001. 30 [LJ08] P . Liang and M. Jord an. An asymptotic analysis o f generativ e, discriminativ e, and pseudolik eliho o d estimators. In Pr o c. 25th International Conf. on M achine L e arning (ICM L-2008) , volume 307, pages 584–591 . A CM, 200 8. [Mac92 ] D. J. C. MacKa y . Ba y esian interpolation. Neu r al Comp utation , 4(3 ):415–44 7, 1992. [Mil02] A. Mill er. Subset Sele ction in R e gr ession . Chapman & Hal l/CR C, 2002. [Reu02] A. Reusken. App ro ximation of the determinant of large sparse symmetric p ositiv e deﬁnite matrices. SIAM Journal on Matrix Analysis and Applic ations , 23(3):799– 818, 2002. [Ris78] J. J . Rissanen. Mo deling by shortest d ata descrip tion. Automatic a , 14(5):465– 471, 1978. [Sc h78] G. S c hw arz. Es timating the dimen s ion of a mo d el. A nnals of Statistics , 6(2):461– 464, 1978. [Sha97] J. Shao. An asymptotic theory for linear mo del selection. Statistic a Sinic a , 7:221– 264, 1997. [Shi83] R. Shibata. Asymptotic mean eﬃciency of a selection of regression v ariables. Anna ls of the Institute of Statistic al M athematics , 35:415–42 3, 1983. [WL T07] H. W ang, R. Li, and C. L. Tsai. T uning p arameter selectors for the sm o othly clipp ed absolute deviation method . Biometrika , 3(94):5 53–568 , 2 007. [Y am99] K. Y amanishi. Extended sto c hastic complexit y an d minimax relativ e loss anal- ysis. In In Pr o c. 10th International Confer enc e on Algor ithmic L e arning The ory - AL T’ 99 , pages 26–38. Springer-V erlag, 1999. [Y an05] Y. Y ang. Can the strengths of aic and bic b e shared? a conﬂict b et w een mo d el iden tiﬁcation and regression estimation. B i ometrika , 92(4):937 –950, 2005. 31

Model Selection with the Loss Rank Principle

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment