Asymptotic optimality of a cross-validatory predictive approach to linear model selection

IMS Collectio ns Pushing the Limits of Con temp orary Statist ics: Contributions in Honor of Jay an ta K. Ghosh V ol. 3 ( 2008) 138–154 c  Institute of Mathe matical Statistics , 2008 DOI: 10.1214/ 07492170 80000001 10 Asymptotic optimalit y of a cross-v alidatory predictiv e approac h to linear mo d el selectio n Arijit Chakrabarti 1 and T apas Saman ta 1 Indian Statistic al Institute Abstract: In t his article we study the asymptot ic predictiv e optimality of a model selection criterion based on the cross-v alidatory predictive density , already a v ailable in the literat ure. F or a dep enden t v ariable and associated explanatory v ariables, w e consider a class of linear mo dels as approximations to the t rue regression func tion. One select s a model a mong these using the criterion under study and predicts a future replicate of the dep enden t v ariable b y an optimal predictor under the chosen mo del. W e show that for squ ared error prediction loss, this scheme of prediction perf orms asymptotically as well as an oracle, where the oracle here refers to a model sel ection rul e which minimizes this loss if the true regressi on were known. Con ten ts 1 Int ro duction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 2 Basic results – ca s e with σ 2 known . . . . . . . . . . . . . . . . . . . . . . 141 3 Case with σ 2 unknown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 44 4 The “mo del true” ca s e and consistency . . . . . . . . . . . . . . . . . . . . 146 5 Concluding remark s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 Ac knowledgmen ts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 1. Introduction The ultimate goal o f mo deling in any scientiﬁc or so cio logical inv estigation is to discov er the under lying re gular pa ttern or phenomenon, if any , which controls the data genera ting mechanism. Although it is almost imp oss ible to imagine that a single mo del or combinations of a handful will fully captur e the intricate functioning of na ture or so ciolo gical issues, one can alwa ys hop e to b e able to come c lo se. Given a choice of several mo dels and a set o f data, a p opular metho d is to cho o se the mo del which explains or ﬁts the given data b est (in so me well-deﬁned sense). Howev er, it is of prime imp o r tance that any mo del that is chosen should b e able to pre dict future o bs erv a tions fr o m the same exp er iment or pro ces s reasona bly well and that it do es not merely ﬁt the observed data. This is the purp ose of predic tive mo de l selection. 1 Applied Statistics Divisi on, Indian Stat istical Institut e, 203 B. T. Road, Kolk at a-700108, India, e-mail: arc@isic al.ac.in ; tapas@isica l.ac.in AMS 2000 subje c t classiﬁc ations: Pr imary 62J05; secondary 62F15. Keywor ds and phr ases: cross-v ali dation, oracle, predictive density. 138 Optimality of a pr ed ictive appr o ach to mo del sele ction 139 One of the most pro minen t approaches to predictive model selection is cross - v alidatio n (see [ 17 ]) and v aria nt s ther eof. As the name cross- v alidatio n s uggests, parameters of the p opulation are estimated under each mo del b y using a pa rt of the data (the “estimation set” ), while the res t of the da ta (the “v a lidation set”) are predicted using the estimates based on the ﬁr st group. This is done re pea tedly by using “v alidation s ets” comprising diﬀerent parts of the data , e.g., the whole data could simply be divided in to 10 disjoint par ts, e ach part consisting of an equal nu m ber of obser v atio ns and predicted using the res t. If, for a particular mo del, such pr edictions match b e s t with the actually obser ved v alues, i.e., if the av erage prediction err or is the smallest for it a mong all the candidate mo dels, it is se lected. Optimality pro pe rties of clas sical cros s-v a lidatory techniques have b een s tudied, e.g., in [ 12 ] and [ 16 ]. In the Bay esian liter ature, several appr o aches to model selection ha ve b een stud- ied with the predictive a sp ect in mind; se e , e.g., [ 1 , 4 , 5 , 8 , 9 , 10 , 1 3 , 14 ]. The purp os e of this pap er is to study the predictive prop erties of a model selection criterion (see (1.2) below) based o n the av erage o f the (log) cr oss-v alidatory predictive densities (see (1.1) b elow) and alr eady av ailable in the literature. Diﬀerent t ypes of av erages (e.g. ar ithmetic mean, (log ) g eometric mean) of cros s-v a lidatory predictive densities hav e b een s tudied by several authors ([ 2 ], [ 3 ], [ 5 ], [ 9 ] a nd [ 1 4 ]). Chakr a barti and Ghosh [ 5 ] considered an av erage with resp ect to dis joint v a lidation sets and studied what should b e the optimal pro p o rtion o f the sample kept for v alidation in larg e sample size s , for the selectio n of a mo del close st to the tr ue mo del (in terms of Kullback-Leibler divergence), and for the selection of the more pa rsimonious model if tw o mo dels a re equidistant fr om the truth. Using squared er ror prediction lo ss, we show that mo del selection using criterio n (1.2) has an optimality prop erty in predicting a future r eplicate o f the dep endent v aria ble (for ﬁxed v alues of the in- depe ndent v a r iables), when the true r egressio n is being approximated by a class of candidate linear mo dels. The pro o fs of the optimality results partly use some general techniques of Li [ 12 ] which were later adopted in [ 16 ]. In the Bay esian setup, the o rdinary predictive density under a mo del is deﬁned as the in tegral o f the lik eliho o d function of the observed data with resp ect to the prio r distribution of the para meters under the mo del. Bet ween tw o c omp eting mo dels, the one having a large r predictive densit y for the given data seems to b e the more appropria te description of the unknown data generating pr o cess. In non-sub jective Bay esian analysis , it is co mmon to use noninforma tive priors for the parameters which a r e typically impr op er and deﬁned only up to unknown multiplicative co n- stants. In such situations, use of the o rdinary pre dictive density as a model selection criterion will b e ina ppropriate. T o get rid of this diﬃculty , one up dates the impro pe r prior by g etting a prop er p osterior based o n par t of the data (calle d the training sample) and then integrates the likelihoo d function of the res t of the data with resp ect to this po sterior , thus g iv ing the cross -v alida tory predictive densit y . This is like getting the predictive distribution of part of the data using information o b- tained from the rest of it. This metho d of o btaining a cross -v alida tory predictive density can also b e used when one puts a prop er pr ior o n the pa rameters o f the mo del. The cross-v alidatory predictive density can then b e used to get pse udo -Bay es F a ctors, after appropriate av eraging with resp ect to the diﬀerent p oss ible choices o f the tra ining sample. This line of thought owes its o r igin to Geisser [ 7 ] and Ge is ser and Eddy [ 8 ] and came to prominence through wha t are referred to a s partial B ay es F a ctors or Intrinsic Bay es F a ctors ([ 2 ], [ 3 ], [ 9 ], [ 11 ] and [ 15 ]). In the next few par a graphs, we describ e our setup and the mo del sele ction cri- terion we study . W e follow the notations of Shao [ 16 ]. 140 A. Chakr ab arti and T. Samanta Let y n = ( y 1 , . . . , y n ) ′ be a vector of observ ations on the dep endent (res po nse) v aria ble a nd let X n = ( x ′ 1 , . . . , x ′ n ) ′ be an n × p n matrix of explanator y v aria bles (whic h are p otentially res po nsible for the v ariability in the y ’s), with x i asso ciated with y i . Let µ n denote E ( y n | X n ), the (unknown) average v alue of the resp onse v aria ble given the v alues of the explana tory v ariables. W e further a ssume that given X n , e n = y n − µ n has mean vector 0 and the comp onents of e i are indep endent with common v a riance σ 2 , whic h co uld be known o r unknown. W e a r e int erested in ca pturing the functional rela tionship, if any , be t ween µ n and X n which will b e most suitable for predictive purp oses. W e restr ic t o ur search within a cla ss of normal linear mo dels . O ur mo del space, deno ted A n , is indexe d by α , where ea ch α c o nsists of a subset o f size p n ( α ) (1 ≤ p n ( α ) ≤ p n ) of { 1 , 2 , . . . , p n } and the true mea n µ n is assumed to b e linearly related to the corresp o nding explanatory v aria bles. Mo re sp eciﬁcally , under mo del α ∈ A n , y n ∼ N ( µ n ( α ) = X n ( α ) β n ( α ) , σ 2 I n ) where X n ( α ) is the submatrix of X n consisting of the p n ( α ) columns sp eciﬁed by α and β n ( α ) ∈ ℜ p n ( α ) . A Bayesian puts a prio r on the unknown para meters within each mo del. W e consider standar d no n-sub jective pr io rs (see e.g., [ 1 ]) given by π α ( β n ( α )) ∝ 1 if σ 2 is known, a nd π α ( β n ( α ) , σ 2 ) ∝ 1 σ 2 if σ 2 is unknown. Consider, for example, the ca se with σ 2 unknown. Let π α (( β n ( α ) , σ 2 ) | y k +1 , . . . , y n ) denote the po sterior distr ibution of the para meters under the mo del g iven the obser - v ations ( y k +1 , . . . , y n ). The cross- v alidato ry predictive density of ( y 1 , . . . , y k ) given ( y k +1 , . . . , y n ) under mo del α , deno ted by the express ion f α ( y 1 , . . . , y k | y k +1 , . . . , y n ), is given b y (1.1) Z f β n ( α ) ,σ 2 ( y 1 , . . . , y k ) π α (( β n ( α ) , σ 2 ) | y k +1 , . . . , y n ) d β n ( α ) dσ 2 , where f β n ( α ) ,σ 2 ( y 1 , . . . , y k ) denotes the density of the k dimensional normal vector, with mean vector given by the ﬁrst k comp onents of µ n ( α ) and v ariance-cov ariance matrix σ 2 I k , ev aluated at y 1 , . . . , y k . Similarly , the predictive dens ity of an y subset ( y t 1 , . . . , y t k ) of y , given the res t of the co mpo nent s of y under this mo del can b e calculated, where ( t 1 , . . . , t k ) denotes a subset of (1 , . . . , n ). Since a g o o d criter ion should not depe nd to o muc h on the choice of the training sample, we c onsider the geometr ic mean of the c r oss-v alidatory predictive densities thus obta ine d by v ary ing the choice of the training s a mple. The ratio of such geometr ic means for t wo models is pr ecisely the Geo metr ic Intrinsic Bayes F actor ([ 2 ], [ 3 ]). F o r mo del α , the criter ion which we intend to s tudy equa ls the logarithm o f this geometric mean. Thus if we consider a total of r tra ining samples, this log arithm is given by (1.2) CV( α ) = 1 r r X i =1 log f α ( y t 1 i , . . . , y t ki |{ y t : t / ∈ ( t 1 i , . . . , t ki ) } ) , where ( y t 1 i , . . . , y t ki ) is the set of y obser v ations not included in the i -th training sample. One selec ts the mo del ˆ α n ∈ A n which maximizes CV( α ). Once a mo del is thus selected, we use the mean of the predictive distribu- tion o f y new n , given the observed y n under the selected mo del, as the predicto r for a future r eplicate y new n of the resp onse v ariable for the same v alue X n of the expla natory v ar iables. An ea sy calculation shows that this turns out to b e the lea st squar es estimate X ( ˆ α n ) ˆ β n ( ˆ α n ) wher e ˆ β n ( α ) = P n ( α ) y n and P n ( α ) = X n ( α )( X n ( α ) ′ X n ( α )) − 1 X n ( α ) is the usua l pro jection matrix . Optimality of a pr ed ictive appr o ach to mo del sele ction 141 Our goal is to ev aluate this prediction scheme under the true regres sion us- ing squared erro r predictio n loss. Under the true µ n , the future replicate y new n will b e indepe nden t of the orig inal observ ations y n . The quality o f a ny pr edic- tor δ ( y n ) of y new n based on y n can b e ev a luated by the average prediction er - ror E µ n ( 1 n || y new n − δ ( y n ) || 2 ), where E µ n denotes exp ectatio n with resp ect to the joint distribution of ( y new n , y n ) when µ n is the tr ue unknown mean. This exp ec- tation will b e small if, for any ﬁxed y n , E µ n ( 1 n || y new n − δ ( y n ) || 2 | y n ) is also small. As o bserved b efore, the predictor δ ( y n ) we wan t to ev a luate is the same as the least squares predictive estima te of y new n under the chosen mo del ˆ α n . Now note that for any g iven ﬁxed mo del α , the least squares pr e dictive estimate is given by δ ( y n ) = δ ( y n )( α ) = ˆ µ n ( α ) = X n ( α ) ˆ β n ( α ). A simple algebra shows tha t the ab ov e conditional exp ectation is, up to a consta n t whic h does no t dep end on α , equal to (1.3) L n ( α ) = || µ n − ˆ µ n ( α ) || 2 n . Hence the co nditional exp ectatio n will b e minimized for a cer tain α if L n ( α ) is minimized. If we knew the true µ n , we could ﬁnd the mo del which minimizes this L n ( α ) for each y n . W e shall c all this the oracle mo del, denoted α L n . The b est a ny pro cedure c a n a chiev e is to do a s well as the or acle in the limit in terms of the loss as the sample size grows to inﬁnity . W e show in the following sections of this article that under cer tain co nditions, minimizing CV( α ) with re sp ect to α is asymptotically e quiv alent to minimizing L n ( α ). Using this fact it is shown that the ra tio of L n ( α L n ) to L n ( ˆ α n ) tends to 1 in pro bability , where b y establishing the o ptim um asymptotic b ehavior o f criterion (1.2) in the pro ble m of prediction of a set o f future observ ations. In Sections 2 and 3 we consider the case whe r e the true mo de l is not in the mo del space – the prop os ed mo dels ar e only approximations to the truth. In Section 2 we consider the case when σ 2 is known. W e show that under certain assumptions, the mo del s election pr o cedure under study p erfor ms as well a s the o racle as ymptotically in the sens e that the r atio of their losses tends to one in pro ba bilit y . In Section 3 , we consider the more realistic situation when σ 2 is unknown. Under appro priate conditions it is shown that this pro cedure also achieves the oracle asymptotically in this case. As a v alidation of this metho d, w e nex t co nsider in Section 4 the questio n of whether, under the as sumption that the true mo del is indeed included in the mo del space, we do equally well in terms of hitting the oracle loss a s ymptotically . It is shown tha t this mo del selectio n pro cedure cho oses the co rrect mo del with smallest dimension with probability tending to one in addition to b eing asymptot- ically optimal in terms of hitting the oracle. Some concluding remarks are made in Section 5. T echnical pro o fs of most of the results ar e given in the Appendix. F o r notatio nal s implicit y we write y , µ , e , X ( α ), β ( α ) and P ( α ) in place of y n , µ n , e n , X n ( α ), β n ( α ) and P n ( α ) resp ectively , dro pping the suﬃx n for the rest of the pap er. 2. Basi c res ults – case with σ 2 kno wn In this section we take the “mo del false” point of v iew that the mo dels ar e only approximations to the truth but none of them is a ctually true. W e show tha t under certain co nditions, the mo del sele c tio n pro cedur e under study is a symptotically optimal in the sense of per forming as well a s the orac le deﬁned above. 142 A. Chakr ab arti and T. Samanta As describ ed in the intro duction, the mo del selection criterio n under considera- tion is an average of the cro ss-v a lidatory predictive densit y f α ( y 1 , . . . , y k | y k +1 , . . . , y n ) under mo del α , over suitable choices of the “tr aining sample” { y k +1 , . . . , y n } . We do not re c ommend her e any p articular choic e of the t r ai ning samples; our r esults hold as long as e ach y i , 1 ≤ i ≤ n , app e ars in the same num b er of tr aining samples chosen (which wil l b e assume d thr oughout the p ap er) . Let y i , i = 1 , . . . , r b e the r training samples (each o f size n − k ). F or each y i , let µ i and e i be the sub v ector of µ a nd e corr esp onding to the la b e ls of the comp onents of y i and X i ( α ) b e the s ubmatrix of X ( α ) c o nsisting of the co rresp onding rows o f it. Also, let ˆ β i ( α ) = [ X ′ i ( α ) X i ( α )] − 1 X ′ i ( α ) y i , P i ( α ) = X i ( α )[ X ′ i ( α ) X i ( α )] − 1 X ′ i ( α ), i = 1 , . . . , r . It will be assumed thro ug hout tha t ( n − k ) → ∞ and X ′ i ( α ) X i ( α ) is nonsingular for e a ch i and α . With the s ta ndard non-sub jective prior π ( β ( α )) = constant , we hav e a closed form expr ession for the cr oss-v alidatory predictive den- sity . An alter native equiv ale n t cr iter ion, which is to be minimized with resp ect to α , is Γ( α ) = 1 n ( y − X ( α ) ˆ β ( α )) ′ ( y − X ( α ) ˆ β ( α )) − 1 r r X i =1 1 n ( y i − X i ( α ) ˆ β i ( α )) ′ ( y i − X i ( α ) ˆ β i ( α )) + 1 r r X i =1 σ 2 n log  | X ′ ( α ) X ( α ) | | X ′ i ( α ) X i ( α ) |  . (2.1) Note tha t Γ( α ) is equal to the negative of the cr iterion (1.2) up to a n additive constant. W e will pr ov e tha t minimization of Γ( α ) is equiv alent to minimization of the lo ss L n ( α ) (deﬁned in (1.3)) in an appro priate asymptotic sense and this will lead to the des ired asymptotic (predictive) o ptimality of the criterion under consideratio n. Note that the loss L n ( α ) deﬁned in (1.3) can be written as nL n ( α ) = n ∆ n ( α ) + e ′ P ( α ) e where n ∆ n ( α ) = µ ′ ( I − P ( α )) µ and let nR n ( α ) = E ( nL n ( α )) = n ∆ n ( α ) + σ 2 p n ( α ) . One of the key as sumptions under which we pr ov e our res ults is the fo llowing condition ([ 12 ], [ 16 ]): (2.2) X α ∈A n 1 [ nR n ( α )] m → 0 for some p ositive in teger m for which E ( e 4 m 1 ) < ∞ . W e also ass ume (2.3) p n λ n min α ∈A n nR n ( α ) → 0 , where λ n = log( n/ ( n − k )). Optimality of a pr ed ictive appr o ach to mo del sele ction 143 F o r certain r emarks justifying these a ssumptions, see [ 12 ] and [ 16 ]. In particular , it is ar gued in these pap ers using several concrete examples, that condition (2.2) is a natural one when the dimension p n of the largest model g r ows with sample size. Also, if p n remains b o unded, nR n ( α ) is exp ected to g o to ∞ for all α as the sa mple size incr e a ses, if the candidate mo dels are separ ated from the truth. That min α nR n ( α ) → ∞ is a ssumption A.3 ′ of Li [ 12 ] and as remarked therein, it is a quite reas o nable assumption if p n grows with n . Condition (2.3) requir es that min α nR n ( α ) → ∞ at a suitable r ate. Under condition (3.3) b elow ([ 16 ], condition (2.5)), (2.3) holds if ( p n λ n ) /n → 0. It is impor ta nt to note that w e also need to assume ( n − k ) /n → 0 to pr ov e our results (s e e e.g. (6.10)). This addr e sses an imp or ta nt questio n ab out the requir ed size of the training sample. W e, howev er, do not claim that it is a necessary condition for asymptotic predictive optimality . W e now consider the criterion Γ( α ) as deﬁned in (2.1). Since X ( α ) ˆ β ( α ) = P ( α ) y , 1 n ( y − X ( α ) ˆ β ( α )) ′ ( y − X ( α ) ˆ β ( α )) = 1 n y ′ ( I − P ( α )) y = 1 n e ′ e + L n ( α ) − 2 n e ′ P ( α ) e + 2 n e ′ ( I − P ( α )) µ . (2.4) Similarly , 1 r r X i =1 1 n ( y i − X i ( α ) ˆ β i ( α )) ′ ( y i − X i ( α ) ˆ β i ( α )) = n − k n 2 e ′ e + 1 nr r X i =1 µ ′ i ( I − P i ( α )) µ i − 1 nr r X i =1 e ′ i P i ( α ) e i + 2 nr r X i =1 e ′ i ( I − P i ( α )) µ i . (2.5) W e ﬁrst state tw o auxilia ry results. Lemma 2.1. Under c onditions (2.2) and (2.3), 1 n ( y − X ( α ) ˆ β ( α )) ′ ( y − X ( α ) ˆ β ( α )) = 1 n e ′ e + L n ( α ) + o p ( L n ( α )) uniformly in α ∈ A n . By saying Z n ( α ) = o p ( L n ( α )) uniformly in α , w e mean max α | Z n ( α ) | /L n ( α ) p → 0 . Lemma 2.2. Supp ose that c onditio ns (2.2) and (2.3) hold and ( n − k ) /n → 0 . Then 1 r r X i =1 1 n ( y i − X i ( α ) ˆ β i ( α )) ′ ( y i − X i ( α ) ˆ β i ( α )) = n − k n 2 e ′ e + o p ( L n ( α )) , uniformly in α ∈ A n . 144 A. Chakr ab arti and T. Samanta Pro ofs of Lemma 2.1 a nd Lemma 2.2 are given in the Appendix. In order to prove the main re s ult of this section we need to assume another condition which is given b elow. Let (2.6) a in ( α ) = log  ( n − k ) p n ( α ) | X ′ ( α ) X ( α ) | n p n ( α ) | X ′ i ( α ) X i ( α ) |  . W e assume (2.7) max α ∈A n 1 r r P i =1 a in ( α ) nR n ( α ) → 0 . Remark 2.1. Let x ′ 1 ( α ) , . . . , x ′ n ( α ) b e the n r ows of X ( α ). If these n rows ar e “similar” , e.g., if they can b e thought of as (independent) realizations o f a random vector x a nd p n is small compar ed to b oth n − k and n , then     X ′ ( α ) X ( α ) n     =     1 n n X j =1 x j ( α ) x ′ j ( α )     ≈ | E ( xx ′ ) | and similar ly     X ′ i ( α ) X i ( α ) n − k     ≈ | E ( xx ′ ) | . In this case , it fo llows that a in ( α ) ≈ 0. In s uch a situation, a ssumption (2 .7) se ems to b e quite r easona ble. Now note that (2.3 ) and (2.7) will imply that the third term in the right hand side of (2.1) is a lso of the order o p ( L n ( α )) uniformly in α ∈ A n . Th us Γ( α ) = cons tant + L n ( α ) + o p ( L n ( α )) uniformly in α ∈ A n which implies minimization of Γ( α ) is essentially equiv alen t to minimization of L n ( α ) in a n appropriate asymptotic sense and we hav e the following r esult. Theorem 2.1. Supp ose that c onditions (2.2), ( 2.3) and (2.7) hold and ( n − k ) /n → 0 . Then we have t he fol lowing r esults. (a) Γ( α ) = k n 2 e ′ e + L n ( α ) + o p ( L n ( α )) uniformly in α ∈ A n . (b) The mo del sele ction rule under stu dy is asymptotic al ly optimal in the sense that L n ( ˆ α n ) min α ∈A n L n ( α ) p → 1 wher e ˆ α n is as deﬁne d in Se ction 1. Pro of of Theore m 2.1 is given in the Appendix. 3. Case wi th σ 2 unkno wn W e now consider the more realis tic s itua tion when the v ar iance σ 2 is unknown. The standard non-sub jective prio r in this c ase is π ( β ( α ) , σ 2 ) ∝ 1 σ 2 under mo del α . In terestingly , the results in this case follow fro m the basic r esults obtained in Section 2. W e co nsider here the (“mo de l false”) setup and assumptions o f Section 2. Optimality of a pr ed ictive appr o ach to mo del sele ction 145 Let y i , i = 1 , . . . , r b e the r training samples chosen. The c r oss-v alidatory pr e- dictive density under mo del α for a training sample y i is given b y | X ′ i ( α ) X i ( α ) | 1 2 | X ′ ( α ) X ( α ) | 1 2 × [( y − X ( α ) ˆ β ( α )) ′ ( y − X ( α ) ˆ β ( α ))] − n 2 [( y i − X i ( α ) ˆ β i ( α )) ′ ( y i − X i ( α ) ˆ β i ( α ))] − n − k 2 up to a multiplicativ e constant. Our criterio n (to b e minimized with resp ect to α ), which is an average ov er the r training sa mples, is given by (3.1) Γ( α ) = log [ S ( α )] − n − k nr r X i =1 log[ S i ( α )] + 1 nr r X i =1 log | X ′ ( α ) X ( α ) | | X ′ i ( α ) X i ( α ) | where S ( α ) = ( y − X ( α ) ˆ β ( α )) ′ ( y − X ( α ) ˆ β ( α )) and S i ( α ) = ( y i − X i ( α ) ˆ β i ( α )) ′ ( y i − X i ( α ) ˆ β i ( α )). Note that Γ( α ) = ( k /n ) log( nσ 2 ) + Γ 1 ( α ) where (3.2) Γ 1 ( α ) = log  S ( α ) nσ 2  − n − k nr r X i =1 log  S i ( α ) nσ 2  + 1 nr r X i =1 a in ( α ) + 1 n p n ( α ) λ n , a in ( α ) is as deﬁned in (2.6) and λ n = log( n/ ( n − k )). Ther efore, minimizing Γ( α ) (with res p ect to α ) is equiv ale nt to minimizing Γ 1 ( α ) for all σ . Let u n ( α ) = log  e ′ e nσ 2 + 1 σ 2 L n ( α )  . In order to prove the asy mptotic optimalit y o f this metho d, we ﬁrs t note in Lemma 3.1 b elow that Γ 1 ( α ) is asymptotically equiv alen t to u n ( α ) and this in tur n implies the desired c onclusion a s stated in Theorem 3.1 . W e prov e these r esults b y inv oking certain conditions which we describ e b elow. W e ﬁrst make the fo llowing assumption (see [ 16 ], condition (2.5)): (3.3) lim inf n →∞ min α ∆ n ( α ) > 0 where ∆ n ( α ) is as deﬁned in Section 2. This may b e thought of as an ide ntiﬁabilit y condition on the mo dels in the model space , a s app ear s in the discussio n of Mervyn Stone on [ 16 ]. W e further a ssume that (3.4) n − k n log n → 0 , p n λ n n → 0 a nd 1 n n X i =1 µ 2 i is bo unded, (3.5) 1 nr r X i =1 a in ( α ) → 0 , and (3.6) r X i =1 log( S i ) > 0 with pro bability tending to 1 , where S i is equal to S i ( α ) with α as the full mo del, i.e., α = { 1 , . . . , p n } . One can give suﬃcient conditions for (3.6 ) base d on the relative magnitude o f r a nd ( n − k ) as n → ∞ , to the eﬀect that r is not to o lar g e compared with n − k which is the case for most practically implemen table schemes. W e, howev er, do not record the details here. T he ﬁna l r esults of this se ction a re now stated b elow. 146 A. Chakr ab arti and T. Samanta Lemma 3.1. Under c onditions (3.3)–(3.6), (3.7) Γ 1 ( α ) = u n ( α ) + o p ( u n ( α )) uniformly in α. Theorem 3.1. Under c onditions (3.3)–(3.6 ), (3.8) L n ( ˆ α n ) L n ( α L n ) p → 1 . Both Lemma 3.1 and Theo rem 3.1 are prov ed in the Appendix. 4. The “mo del true” case and consistency W e now s how that if s ome mo del in the model spa ce is true, the mo del se lection pro cedure under study choos es the co rrect mo del of the sma llest dimension in ad- dition to b eing asymptotically optimal. Thus this pro cedur e not only captures the truth but at the same time is as parsimonious as p o s sible. Although the assumption of a true mo del may no t se em to b e very r ealistic, o ur result in this se c tion pr ovides a v alidation of the metho d. W e, how ev er, co nsider only the simpler c ase when σ 2 is known. As in [ 16 ], let A c n ⊂ A n denote all the prop osed mo dels that are a ctually correct. Thu s for α ∈ A c n , µ = X ( α ) β ( α ) for some β ( α ) ∈ ℜ p n ( α ) . In Section 2 we assumed that A c n is empty . It is imp orta nt to note that all the results of Sectio n 2 with A n replaced by A n − A c n hold under the corr esp onding assumptions with A n replaced by A n − A c n . In par ticula r, if (4.1) X α ∈A n −A c n 1 [ nR n ( α )] m → 0 for some p ositive in teger m for which E ( e 4 m 1 ) < ∞ and (4.2) p n λ n min α ∈A n −A c n nR n ( α ) → 0 , with λ n = log( n/ ( n − k )), then (4.3) Γ( α ) = k n 2 e ′ e + L n ( α ) + o p ( L n ( α )) uniformly in α ∈ A n − A c n . F o r α ∈ A c n , ( I − P ( α )) µ = 0 and ( I − P i ( α )) µ i = 0 ∀ i . Therefor e, from (2.1), (2.4) and (2.5) we hav e for α ∈ A c n (4.4) Γ( α ) = k n 2 e ′ e − 1 n e ′ P ( α ) e + 1 nr r X i =1 e ′ i P i ( α ) e i + σ 2 nr r X i =1 log  | X ′ ( α ) X ( α ) | | X ′ i ( α ) X i ( α ) |  . Also L n ( α ) = 1 n e ′ P ( α ) e for α ∈ A c n . W e now ass ume that (4.5) lim sup n →∞ X α ∈A c n 1 [ p n ( α )] m < ∞ . Optimality of a pr ed ictive appr o ach to mo del sele ction 147 for some p ositive integer m such that E ( e 4 m 1 ) < ∞ (condition (3.10) of Shao [ 16 ]), and (4.6) max α ∈A c n 1 r r P i =1 a in ( α ) p n ( α ) λ n → 0 with λ n = log( n n − k ) and a in ( α ) a s de ﬁned in (2 .6). See Remark 2.1 in this context. Let α c n be the mo del α in A c n with smallest dimension. Using the ab ov e, we now hav e Prop ositi on 4.1 . Under c ondi tions (4.1), (4.2), (4.5) and (4.6) (4.7) Γ( α ) = k n 2 e ′ e + 1 n λ n σ 2 p n ( α ) + o p ( 1 n λ n σ 2 p n ( α )) uniformly in α ∈ A c n , and (4.8) Γ( α c n ) = k n 2 e ′ e + o p ( L n ( α )) uniformly in α ∈ A n − A c n . Pro of of Pro p osition 4.1 is given in the Appendix. Keeping in mind the ab ov e facts, we now pro ceed tow ards pr oving that this mo del s election rule cho oses the most par simonious correc t mo del as claimed in Theorem 4.1 b elow. T owards this we ﬁrst observe that (4.3) and (4.8) imply max α ∈A n −A c n (Γ( α c n ) − k n 2 e ′ e ) / (Γ( α ) − k n 2 e ′ e ) < 1 with probability tending to 1. It then fo llows that (4.9) P [Γ( α c n ) ≤ Γ( α ) ∀ α ∈ A n − A c n ] → 1 . W e now try to ﬁnd some conditions under w hich (4.10) P [Γ( α c n ) ≤ Γ( α ) ∀ α ∈ A c n ] → 1 . Let n [Γ( α ) − Γ( α c n )] = Z n ( α ). It is enoug h to show that (4.11) P [ Z n ( α ) ≥ 0 ∀ α ∈ A c n ] → 1 . Now, P [ Z n ( α ) < 0 for so me α ∈ A c n ] ≤ X α ∈A c n P [ Z n ( α ) < 0] ≤ X α ∈A c n P [ | Z n ( α ) − E ( Z n ( α )) | > E ( Z n ( α ))] ≤ X α ∈A c n E | Z n ( α ) − E ( Z n ( α )) | 2 m [ E ( Z n ( α ))] 2 m . (4.12) F r om (4.4) (4.13) Z n ( α ) − E ( Z n ( α )) = 1 r r X i =1 e ′ i [ P i ( α ) − P i ( α c n )] e i − e ′ [ P ( α ) − P ( α c n )] e 148 A. Chakr ab arti and T. Samanta and E ( Z n ( α )) can b e wr itten as 1 σ 2 E ( Z n ( α )) = [ p n ( α ) − p n ( α c n )] λ n + 1 r r X i =1 [ a in ( α ) − a in ( α c n )] where a in ( α ) is as deﬁned in (2.6). If we assume (4.14) 1 r r X i =1 [ a in ( α ) − a in ( α c n )] = o p ([ p n ( α ) − p n ( α c n )] λ n ) uniformly in α ∈ A c n , then (4.15) 1 σ 2 E ( Z n ( α )) = [ p n ( α ) − p n ( α c n )] λ n + o p ([ p n ( α ) − p n ( α c n )] λ n ) uniformly in α ∈ A c n . Noting that P ( α ) − P ( α c n ) and P i ( α ) − P i ( α c n ) ar e pro jection matrices and the ﬁr st ter m on the right hand side of (4 .13) can b e expr essed as e ′ M e for some matrix M , and using Theore m 2 o f Whittle [ 18 ] or inequality (6.2) of the App endix we hav e E | Z n ( α ) − E ( Z n ( α )) | 2 m ≤ cons ta nt [ p n ( α ) − p n ( α c n )] m . It then follows from (4.12) and (4.15) that (4.1 1) holds if (4.16) X α ∈A c n 1 λ 2 m n [ p n ( α ) − p n ( α c n )] m → 0 . Thu s we ﬁnally have the following. Theorem 4.1. Under c onditions (4.1), (4.2), (4.5), (4.6), (4.14) and (4.16), (4.17) P [ ˆ α n = α c n ] → 1 . It is prov ed in the Appendix that under (4.1 ) and (4.2) (4.18) max α ∈A n −A c n L n ( α c n ) L n ( α ) p → 0 . Since L n ( α c n ) ≤ L n ( α ) ∀ α ∈ A c n , Theorem 4.1 and (4 .18) imply the following. Theorem 4.2. Under t he c onditions of The or em 4.1, one has (4.19) L n ( ˆ α n ) /L n ( α L n ) p → 1 . 5. Co ncluding remarks In this ar ticle w e hav e studied pr edictive optimality o f a cross -v alida tory Bay esian approach to mo del selection in the context of selecting from a mong a set o f linear mo dels. It has b een shown that this metho d pr edicts as well as the ora cle as the sample size grows. In addition, it has b een shown that in cas e the space of candida te mo dels contains at least one co rrect mo del, this metho d choos e s the corr ect mo del with the smallest dimension with probability tending to one as sa mple size g rows. Thu s the metho d has t w o imp or tant facets – one o f an optimal predictor and the Optimality of a pr ed ictive appr o ach to mo del sele ction 149 other of a selection criterion which do es not unnecessarily cho ose a complex mo del when simpler ones ar e apt. Needless to say , this article has not addressed some interesting rela ted issues. First, it will b e interesting to se e how this metho d works when it is a pplied in the setup o f genera lized line a r mo dels, thr ough theor etical inv estigation and simula- tion. Ano ther fo cus of rec e n t resea rch is the case whe n the num b er of p o ten tial parameters in the mo dels is very larg e , e.g., when it is o f the same order as the nu m ber of observ ations. Asymptotic optimality studies in such setup, even for the normal linear mo dels will b e a really challenging task. Also, we have not touched upo n the co mputatio nal as p ect of this metho d, which b ecomes imp o rtant if the nu m ber of po tent ial r egress o rs a nd num ber of mo dels in the mo del space get large . W e, how ev er, emphasize that one ra rely considers the set of a ll 2 p po ssible mo d- els if p regresso rs are av ailable. F or example, one can use expe r t knowledge a bo ut the pro blem under s tudy and sta rt with a pruned lis t of mo dels or one can take a nested sequence of mo dels (there b y restricting the to tal num ber of mo dels to at most p ). Li ([ 12 ], Exa mple 1 ) co nsidered a situation where the p r egresso rs are arrang ed in decreasing o rder o f impor ta nce. He then considered p mo dels, the α -th mo del consisting of the ﬁrst α reg ressor s in this o rdered ar rangement. See in this context Exa mples 1 and 2 of [ 16 ] where the num ber of mo dels under considera tion is ﬁxed althoug h the num ber of pa rameters may grow with sample size . La st but not the least, as we commented b efore, the requirement that k /n → 1 is only a suﬃcient condition; a ca reful s tudy of the necessity of this conditio n is in o rder. In some exa mples, we hav e observed that k/ n → c for a ny c ∈ (0 , 1) is also suﬃcient to achiev e go o d o ptimality results similar to ones we have o bta ined in this pap er. So me theoretical investigations and simu lation studies will hop efully prov e co nclusive to ﬁnd the optimal k . It is worth mentioning that in a related pr oblem Cha k rabarti and Gho sh [ 5 ] made interesting observ ations r egarding this issue which ca n b e a starting p oint for such inv estigation. App endix W e present in this sec tion pr o ofs of so me o f the res ults o f the earlier sections. W e will need b ounds for the moments of linear and quadratic forms in e . Let A = ( a ij ) be a no n-random n × n matrix and b be a non-r andom n - vector. Then by Theorem 2 of Whittle [ 18 ], E ( | e ′ b | 2 m ) ≤ C 1 ( || b || 2 ) m , a nd (6.1) E | e ′ A e − E ( e ′ A e ) | 2 m ≤ C 2 ( X i X j a 2 ij ) m (6.2) for some constants C 1 , C 2 > 0 a nd for po sitive in teger m for which E ( e 4 m 1 ) < ∞ . Below max α will mean maximum over α ∈ A n . Pr o of of L emma 2.1. As shown in Li ([ 12 ], p. 970), using Theorem 2 of Whittle [ 18 ] or inequalities (6.1) and (6.2) stated ab ove, and condition (2.2), we hav e max α | e ′ P ( α ) e − σ 2 p n ( α ) | nR n ( α ) p → 0 , a nd (6.3) max α | e ′ ( I − P ( α )) µ | nR n ( α ) p → 0 . (6.4) 150 A. Chakr ab arti and T. Samanta Also, from (6.3) (6.5) max α | L n ( α ) R n ( α ) − 1 | = max α | e ′ P ( α ) e − σ 2 p n ( α ) | nR n ( α ) p → 0 . Lemma 2.1 now follows from (2.3), (2.4 ), (6.3), (6.4) and (6.5). Pr o of of L emma 2.2. Let T 1 = 1 r r X i =1 µ ′ i ( I − P i ( α )) µ i , T 2 = 1 r r X i =1 e ′ i P i ( α ) e i and T 3 = 1 r r X i =1 e ′ i ( I − P i ( α )) µ i . Then, in view o f (2.5 ), the left ha nd side of the equality claimed in Lemma 2.2 can be written as n − k n 2 e ′ e + 1 n ( T 1 − T 2 + 2 T 3 ) . W e shall prov e that (6.6) T j /n = o p ( L n ( α )) uniformly in α for j = 1 , 2 , 3 . W e ﬁx a training s a mple y 1 = ( y 1 , y 2 , . . . , y n − k ) ′ . Let X ( α ) =  X 1 X 1 c  and I − P ( α ) =  A B  where X 1 and X 1 c are the subma trices c o nsisting o f the ﬁrs t n − k rows and the last k rows o f X , res p ectively , a nd A and B ar e analogo us submatrices of I − P ( α ). Then µ ′ ( I − P ( α )) µ = µ ′ B ′ B µ + µ ′ A ′ A µ , and (6.7) µ ′ ( I − P ( α )) µ − µ ′ 1 ( I − P 1 ( α )) µ 1 = µ ′ B ′ ( I − P c ) − 1 B µ , (6.8) where P c = X 1 c ( X ′ ( α ) X ( α )) − 1 X ′ 1 c (see, e.g., Result (5.4) of Chatterjee and Hadi [ 6 ], p. 1 8 9). O ne can now c hec k that ( I − P c ) − 1 = I + X 1 c ( X ′ 1 X 1 ) − 1 X ′ 1 c and (6.9) µ ′ B ′ ( I − P c ) − 1 B µ − µ ′ B ′ B µ = µ ′ B ′ X 1 c ( X ′ 1 X 1 ) − 1 X ′ 1 c B µ ≥ 0 as ( X ′ 1 X 1 ) − 1 is po sitive deﬁnite. F r om (6.7)–(6.9) µ ′ 1 ( I − P 1 ( α )) µ 1 nL n ( α ) ≤ µ ′ 1 ( I − P 1 ( α )) µ 1 µ ′ ( I − P ( α )) µ ≤ || A µ || 2 || A µ || 2 + || B µ || 2 . W e now consider average ov er the r tr a ining samples . Since each y i (1 ≤ i ≤ n ) app ears in the same nu m be r of training samples, we hav e (6.10) T 1 nL n ( α ) ≤ 1 r r P i =1 µ ′ i ( I − P i ( α )) µ i µ ′ ( I − P ( α )) µ ≤ n − k n which co nverges to zero. T o pr ov e (6.6) for j = 2 we note that T 2 can b e expr e s sed a s e ′ M ( α ) e for some matrix M ( α ) = ( m ij ), which is a sum of r matrices cor r esp onding to the r choices Optimality of a pr ed ictive appr o ach to mo del sele ction 151 of the n − k indices fro m { 1 , 2 , . . . , n } ( n − k rows of X ( α )). F or example, for the training sample y 1 = ( y 1 , . . . , y n − k ) ′ , e ′ 1 P 1 ( α ) e 1 may b e written as e ′ M 1 ( α ) e where M 1 ( α ) =  P 1 ( α ) 0 0 0  , th us M ( α ) = (1 /r ) r X i =1 M i ( α ) . As P i ( α )’s are all idemp otent matr ices, one can show that P i P j m 2 ij ≤ p n ( α ). Then pro ceeding as in the pro o f of (6.3) given in Li ([ 12 ], p. 970) o ne c a n pr ov e the result using (6.2), (2.2 ), (2.3) and (6.5). Indeed, by (6.2), P  max α | e ′ M ( α ) e − σ 2 p n ( α ) | nR n ( α ) > ǫ  ≤ C X α [ p n ( α )] m ǫ 2 m [ nR n ( α )] 2 m for some constant C > 0. The result follows from (2.2), (2.3) and (6.5). The pr o of of (6.6 ) for j = 3 is simila r. W e note that T 3 = e ′ b with b = (1 /r ) r P i =1 ( I − P i ( α )) µ i and || b || 2 ≤ (1 /r ) r P i =1 µ ′ i ( I − P i ( α )) µ i . By (6.1) and (6 .1 0) P  max α | e ′ b | nR n ( α ) > ǫ  ≤ C  n − k n  m X α 1 [ nR n ( α )] m for so me constant C > 0 . The r esult follows fr om (2.2) a nd (6.5). Thus (6.6) is prov ed and hence the lemma. Remark 6.1. Indeed, to prove Lemma 2.1 and Lemma 2.2, we need to ass ume p n min α ∈A n nR n ( α ) → 0 instead of the stronger condition (2.3). W e, howev er, ne e d (2.3) to prov e o ur ﬁna l result. Pr o of of The or em 2.1. Since (2.3 ) and (2.7) imply that the third term in the rig ht hand side of (2.1) is of the order o p ( L n ( α )) uniformly in α ∈ A n , part (a) follows from (2.1), Lemma 2.1 a nd Lemma 2.2. F rom part (a), Γ( α ) ca n b e written as Γ( α ) = k n 2 e ′ e + L n ( α )(1 + ζ n ( α )) , α ∈ A n , where max α | ζ n ( α ) | p → 0. No w Γ( ˆ α n ) ≤ Γ( α ) ∀ α implies L n ( ˆ α n ) L n ( α ) ≤ 1 + ζ n ( α ) 1 + ζ n ( ˆ α n ) ≤ 1 + ma x α | ζ n ( α ) | 1 − ma x α | ζ n ( α ) | ∀ α. Part (b) fo llows from the ab ove. 152 A. Chakr ab arti and T. Samanta Pr o of of L emma 3.1. W e ﬁr st note that under suitable conditions ther e exist 0 < δ < ∆ such that (6.11) log(1 + δ ) < u n ( α ) < log (1 + ∆) ∀ α with pro bability tending to 1. This follows from (3.3), (3.4) and the fact that e ′ e /nσ 2 p → 1, noting that max α e ′ P ( α ) e / n ≤ e ′ P e /n p → 0 and L n ( α ) is unifor mly (in α ) b ounded with pr obability tending to 1 . Here P is the pro jection matrix corres p o nding to the full mo del. Consider now the expression in (3.2). By Lemma 2.1 of Section 2 and (6.1 1 ), log[ S ( α ) /nσ 2 ] = log[ e ′ e /nσ 2 + L n ( α ) /σ 2 + o p ( L n ( α ) /σ 2 )] = log[ e ′ e /nσ 2 + L n ( α ) /σ 2 + o p ( e ′ e /nσ 2 + L n ( α ) /σ 2 )] = log[ e u n ( α ) (1 + o p (1))] = u n ( α ) + o p (1) = u n ( α ) + o p ( u n ( α )) (6.12) uniformly in α . In view of (3.5 ), to prove (3.7), it r emains to show (6.13) n − k nr r X i =1 log[ S i ( α ) /nσ 2 ] = o p (1) . Note that we are a lso using (3.4 ) and (6.1 1). Since S i ( α ) ≥ S i for all α and all i , we have for all α 0 < 1 r r X i =1 log[ S i ( α )] = log [ r Y i =1 S i ( α )] 1 /r ≤ log[ 1 r r X i =1 S i ( α )] implying − n − k n log( nσ 2 ) < n − k nr r X i =1 log  S i ( α ) nσ 2  ≤ n − k n log[ 1 r r X i =1 S i ( α )] − n − k n log( nσ 2 ) . Then (6.1 3) follows from Lemma 2.2 of Section 2, condition (3.4) and the fact that L n ( α ) is uniformly (in α ) b o unded with proba bilit y tending to 1 (as noted earlier in the arg ument for (6.11)). Pr o of of The or em 3.1. Le t ˆ α n be the mo del which minimizes Γ( α ). Pro ceeding as in the pro o f of part (b) of Theor em 2.1, and using (3.7) we can prov e that u n ( ˆ α n ) u n ( α L n ) p → 1 . This, together with (6.11), imply that u n ( ˆ α n ) − u n ( α L n ) p → 0 i.e., e ′ e + nL n ( ˆ α n ) e ′ e + nL n ( α L n ) p → 1 . Optimality of a pr ed ictive appr o ach to mo del sele ction 153 Since e ′ e n p → σ 2 and L n ( α L n ) ≥ min α ∆ n ( α ), using (3.3 ) w e hav e L n ( ˆ α n ) L n ( α L n ) p → 1 . Pr o of of Pr op ositio n 4.1. W e ﬁrst prove equation (4.7). Below, by max α we mean maximum ov er α ∈ A c n . Le t Z n ( α ) = ( e ′ P ( α ) e ) / ( σ 2 p n ( α )). W e ﬁr s t show that max α | Z n ( α ) | = O p (1). By (6.2) P [max α | Z n ( α ) − 1 | > M ] ≤ X α E | Z n ( α ) − 1 | 2 m /M 2 m ≤ C M 2 m X α 1 [ p n ( α )] m for some constant C > 0 and by (4.5) this can b e made ar bitrarily sma ll by cho osing suitable M > 0. Thus max α | Z n ( α ) − 1 | = O p (1) implying max α | Z n ( α ) | = O p (1). This implies (1 /n ) e ′ P ( α ) e = o p ( 1 n λ n σ 2 p n ( α )) unifor mly in α ∈ A c n as λ n → ∞ . Pro ceeding in a similar manner a nd noting that (1 /r ) r P i =1 e ′ i P i ( α ) e i can b e written as e ′ M ( α ) e (see pro of of Lemma 2.2) one can prov e 1 nr r X i =1 e ′ i P i ( α ) e i = o p ( 1 n λ n σ 2 p n ( α )) uniformly in α ∈ A c n . The result now fo llows from (4.2 ), (4.4) and (4.6). In order to complete the pro of of Prop ositio n 4 .1 , we now prov e equation (4.8 ). F r om (4.7), Γ( α c n ) = k n 2 e ′ e + 1 n λ n σ 2 p n ( α c n ) + o p  1 n λ n σ 2 p n ( α c n )  . The r esult follows fr o m (4.1) and (4 .2) noting that (4.1) implies (6.5) with m ax α ∈A n replaced by max α ∈A n −A c n . Pr o of of (4.18). Note that max α ∈A n −A c n L n ( α c n ) L n ( α ) = max α e ′ P ( α c n ) e nL n ( α ) . By (6.2) and by arguments used ea rlier P  max α ∈A n −A c n     e ′ P ( α c n ) e − σ 2 p n ( α c n ) nR n ( α )     > ǫ  ≤ C " p n min α nR n ( α ) # m X α ∈A n −A c n 1 [ nR n ( α )] m for some constant C . The result follows fro m (4.1) and (4.2). 154 A. Chakr ab arti and T. Samanta Ac kno wle dgment s. W e would ﬁrst like to thank Professo rs Be rtrand Clarke and Subhashis Ghosal for inviting us to co ntribute to this volume. W e c o nsider it a gr e at privileg e to b e able to write an article as an expression of our deep rega rd for our mentor and teacher Profess or J ay an ta K . Ghos h. W e cannot express in words how muc h we lear ne d in numerous aca demic discus sions we had with him ov er the years which immensely inﬂuenced our thought pro cess. Needless to s ay , this work owes muc h to the way we lear ned to think a bo ut mo del selection through our asso ciatio n with him. Finally , we thank an anonymous referee a nd the editors for helpful co mmen ts and suggestions tow ards improvemen t of the pap er. References [1] Barbieri, M. M. and Berger, J. O. (2004). Optimal predictive mo del selection. Ann. Statist. 3 2 870–8 97. MR20651 92 [2] Ber ger, J. O. and Pericchi, L. R. (19 96a). The intrinsic Bay es fac- tor for mo del selection and prediction. J. Amer. St atist. Asso c. 91 109 –122 . MR13940 65 [3] Ber ger, J. O. and Pericchi, L. R. (1996b). The intrinsic Bay es factor for linear mode ls (with discussion). In Bayesian S tatistics (J. M. Berna rdo et al., eds.) 5 25– 4 4. Oxfo rd Univ. Press . MR14253 98 [4] Bernardo, J. M. and Smith, A. F. M. (1994). Bayesian The ory . Wiley , Chichester. MR12 74699 [5] Chakrabar ti, A . and Ghosh, J. K. (2 008). So me asp ects of Bayesian mo del selection for pr e dic tio n (with discussio n). In Bayesian St atistics (J. M. Bernardo et al., eds.) 8 . T o a pp e a r. [6] Cha tterjee, S. and H adi, A. S. (1988). Sensitivity Analysis in Line ar R e- gr ession . Wiley , New Y ork. MR09396 10 [7] Geisser, S. (1975 ). The predictive sample reuse metho d with applications. J. Amer. Statist. Asso c. 70 3 2 0–32 8. [8] Geisser, S. and Eddy, W . F. (197 9). A predictiv e approach to mo del sele c- tion. J. Amer. Statist. Asso c. 74 153–16 0. MR05295 31 [9] Gelf and, A . E. and Dey, D. K. (1994 ). Bayesian mo del c hoice: asy mptotics and exact calculations. J . Ro y. S tatist. S o c. Ser. B 56 501 –514. MR127822 3 [10] Gelf and, A. E. and Ghosh, S . K. (19 98). Mo del choice: a minimum po s- terior predictive los s approach. Biometrika 85 1–11 . MR16272 58 [11] Ghosh, J. K. and Sam ant a, T. (200 2). Nonsub jective Bay es testing – an ov erview. J. Statist. Plann. Infer enc e 1 03 205–2 23. MR189 6 993 [12] Li, K.-C. (1987). Asymptotic o ptimalit y for C p , CL, cro ss-v a lidation and generalized cros s-v a lidation: discrete index set. Ann. Statist. 15 95 8 –975 . MR09022 39 [13] Mukhop adhy a y, N. (2 0 00). Bay esian mo del selection for high dimensiona l mo dels with predic tio n error loss and 0– 1 loss . Ph.D. thesis, Purdue Univ. [14] Mukhop adhy a y, N. , Ghosh, J. K . and Berge r, J. O. (2005). Some Bay esian pre dictive a pproaches to mo del sele ction. Statist . Pr ob ab. L et t . 73 369–3 79. MR21878 52 [15] O’Hagan, A . (199 5). F ractional Ba yes factors for mo del comparisons. J . R oy. Statist. So c. S er. B 57 99–1 38. MR1 3 2537 9 [16] Shao, J. (199 7 ). An asymptotic theory for linear mo del selec tion (with dis- cussion). Statist. Sinic a 7 221 –264 . MR146668 2 [17] Stone, M. (1 9 74). Cross- v alidato ry choice and asses sment of s ta tistical pre- dictions (with discussio n). J. R oy. St atist. So c. Ser. B 36 111 – 147. MR03563 77 Optimality of a pr ed ictive appr o ach to mo del sele ction 155 [18] Whittle (1960). Bounds for the moments of linear and quadratic forms in independent v ar iables. The ory Pr ob ab. Appl. 5 302–3 05. MR01338 49

Asymptotic optimality of a cross-validatory predictive approach to linear model selection

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment