The Residual Information Criterion, Corrected

Shi and Tsai (JRSSB, 2002) proposed an interesting residual information criterion (RIC) for model selection in regression. Their RIC was motivated by the principle of minimizing the Kullback-Leibler discrepancy between the residual likelihoods of the…

Authors: Chenlei Leng

The Residual Information Criterion, Corrected Chenlei Leng ∗ Octob er 29, 2018 Abstract Shi and Tsai (JRSSB, 2 002) prop osed an in teresting residual information c riterion (RIC) for mo del selection in regressio n. Their RIC was motiv ated by the principle of minimizing the Kullba c k- Leibler discrepancy betw een the res idua l likelihoo ds of the true a nd candidate mo del. W e show, how ever, under this pr inciple, RIC w ould a lw ays choose the full (satur ated) model. The re s idual lik eliho o d therefore, is not appr opriate as a discrepancy measure in defining infor mation criter io n. W e explain why it is so and provide a cor rected residua l informatio n criter io n as a remedy . KEY W ORDS: R esidual information criterion; Corr e cte d r esidual information crite- rion. 1 In tro duc tion Giv en n iid observ ations from a true mo del y = X β 0 + ε, where y = ( y 1 , ..., y n ) ′ , X is a n × p design matrix, ε = ( ε 1 , ..., ε n ) ′ follo ws a m u ltiv ariate distribution with mean 0 and v ariance σ 2 0 W ( θ 0 ), and β 0 ∈ R p × 1 is an unknown v ector to b e estimated. Here θ 0 is an m × 1 v ector parameterizing th e correlation matrix. Finally , we denote A 0 = A ( β 0 ) = { j : β 0 j 6 = 0 , j = 1 , ..., p } as the n onzero co efficien t set and k 0 = # A 0 as the num b er of nonzero co efficien ts. The problem of estimating A 0 is often referr ed to as v ariable selection or mod el selection. V ariable selection in linear regression is probably one of the most imp ortan t problems in statistics. See for example the references in Shao (1997). T o automate the pr o cess of c ho osing a finite d imensional candidate mo del out of all p ossible mo dels, v arious information criteria hav e b een develo p ed. There are t wo basic elemen ts in all of these criteria: one elemen t that m easures th e go o dness of fit and the other term whic h p enalizes the complexit y of the fitted mo del, usually take n as a function of the p arameters u sed. Generally sp eaking, the existi ng v ariable selectio n appr oac hes can be classified in to t w o broad catego ries. On ∗ Leng is Assistant Professo r, Department of Statistics and App lied Probabilit y , National Universit y of S ingapore. Leng’s research is supp orted in part by N US researc h grant R- 155-05 0-053-133 ( Email: stalc@nus. edu.sg). 1 one hand , AIC t yp e of criteria, suc h as AIC (Ak aik e, 1970) and AICc (Hurvich and Tsai, 1989) , seek to min imize the Ku llbac k-Leibler d iv ergence b et ween the true and candidate mo del. On the other hand, BIC (S ch wa rz, 1978) type of criteria are u sed to iden tify a candidate mo del to ac hieve selectio n consistency . Obviously , these criteria are motiv ated b y different assumptions and differen t considerations, pr actically and theoretically . An y particular c hoice on wh ic h one to use probably dep ends on th e co n text and is sub ject to criticism, as eac h has its o wn merits and shortcomings. In an imp ortan t pap er, Shi and Tsai (2002) prop osed an inte resting information criterion termed the residual information criterion (RIC). T he authors sho wed that RI C is motiv ated b y the consideration of min imizing the discrepancy b et ween the residual log-lik eliho o d func- tions of the tru e and candidate mo del. Ho we v er, surp risingly , the authors arriv ed at a BIC t yp e of criterion, in marked con trast with some other information criteria, suc h as AIC, AICc, motiv ated by th e same principle of min imizing Kullb ac k-Leibler d iscrepancy . In this pap er, w e s h o w that th e RI C approac h is not targeting at minimizing the Kullbac k-Leibler discrepancy b et wee n residu al lik eliho od s. W e p r o vide a corrected crite- rion RIC ∗ motiv ated by this principle. Ho wev er, we sho w that if the resid u al lik eliho od s are used to ev aluate the Kullbac k-Leibler div ergence b et w een mo dels, RIC (i.e . RIC ∗ ) would alw a ys c ho ose the full mo del. Therefore, the r esidual lik eliho o d is not an appropriate loss function to define an inform ation criterion. W e p ro vide a simple lik eliho o d based approac h to circum v ent the p roblem. The rest of the pap er is organized as follo ws. Section 2 r eviews the RIC metho d in Shi and Tsai. Since S hi and Tsai’s RIC is not appr o ximating the Ku llbac k-Leibler div er- gence, w e pr o vide th e RIC ∗ measure as a correction. How ev er, RIC ∗ alw a ys c ho oses the full mo del and the reason is explained. Section 3 presen ts the correct residual lik eliho od information criterion, motiv ated by min imizing the Kullback- Leibler div ergence b et wee n lik eliho o ds in s tead of r esidual likelihoo ds. C on clud ing remarks are giv en in Section 4. 2 The Residual Information Criterion W e review the RIC m etho d in S hi and Tsai (2002) in this section. T h e mo del w e consider in th is article is a sp ecial case of that in Sh i and T sai (2002) by assuming the Bo x-Cox transformation parameter λ is 1. The results in the pap er can b e easily extended to Bo x- Co x mo dels follo wing similar arguments in S hi and Tsai. W e start by lo oking at a candidate (w orking) mo del y = X β + ε, suc h that # A ( β ) = k . W e denote th e activ e co v ariates in X as X A . Inspired b y the residu al lik eliho o d method in Harville (1974) or Diggle et al. (1994) to obtain un biased estimator for the error v ariance, we can write the residual log-lik eliho od as L ( θ ′ , σ 2 ) = − 1 2 ( n − k ) log (2 π ) + 1 2 log | X ′ A X A | − 1 2 ( n − k ) log ( σ 2 ) − 1 2 log | W | − 1 2 log | X ′ A W − 1 X A | − 1 2 y ′ ( W − 1 − H A ) y /σ 2 , (1) 2 where H A = W − 1 X ′ A ( X ′ A W − 1 X A ) − 1 X ′ A W − 1 and the dep end ence of W on θ is supp ressed. A usefu l measure of the distance b et w een the working m od el and the tru e mo del is th e Kullbac k-Leibler div ergence d ( θ ′ , σ 2 ) = E 0 [ − 2 L ( θ ′ , σ 2 ) + 2 L 0 ( θ ′ 0 , σ 2 0 )] , (2) where E 0 denotes the exp ectation und er the true mo del and L 0 denotes the r esid ual log - lik eliho o d of the tru e m od el. Clearly , the b est m o del loses the least inf ormation, in terms of Kullbac k-Leibler distance, r elat iv e to th e truth and is therefore preferred. S u c h a cr iterion form ulates RIC in an information-theoretical framew ork. Pro vided that one can unbiasedly estimate d ( θ ′ , σ 2 ), this criterion pro vides sound basis for parameter estimation and statistical inference un der appropriate cond itions. Since E 0 [2 L 0 ( θ ′ 0 , σ 2 0 )] is indep endent of the w orkin g mo del, we ju st need to ev aluate E 0 [ − 2 L ( θ ′ , σ 2 )]. In Shi and Tsai (2002), (2) is written as d ( θ ′ , σ 2 ) = E 0 h ( n − k ) log ( σ 2 ) + log | W | + log | X ′ A W − 1 X A | + y ′ ( W − 1 − H A ) y /σ 2 i (3) = ( n − k ) log( σ 2 ) + log | W | + log | X ′ A W − 1 X A | + E 0 ( X β 0 + ε ) ′ ( W − 1 − H A )( X β 0 + ε ) /σ 2 (4) b y omitting irrelev an t terms. By s u bstituting their estimated v alues ˆ θ , ˆ σ 2 in to (4), we h a v e d ( ˆ θ ′ , ˆ σ 2 ) = ( n − k ) log ( ˆ σ 2 ) + log | ˆ W | + log | X ′ A ˆ W − 1 X A | + ( X β 0 ) ′ ( ˆ W − 1 − ˆ H A )( X β 0 ) / ˆ σ 2 + tr { ( ˆ W − 1 − ˆ H A ) W 0 } σ 2 0 / ˆ σ 2 . (5) The ab ov e expression inv olv es an un kno wn quan tity σ 2 0 . F ollo wing S hi and Tsai, we judge the qualit y of the candid ate m od el by E 0 { d ( ˆ θ ′ , ˆ σ 2 ) } . No w, if we assu me A 0 ⊆ A , an assumption also u s ed in deriving AIC c (Hur vic h and Tsai, 1989), the third term b ecomes zero. F urthermore, if we assume ˆ θ is consisten t for θ 0 , w e can estimate W 0 b y ˆ W since ˆ W = W 0 + o p (1). Then the fourth term can b e appr o ximated as ( n − k ) σ 2 0 / ˆ σ 2 . Since A ⊆ A 0 , ( n − k ) ˆ σ 2 /σ 2 0 then follo ws χ 2 n − k distribution and therefore E 0 [( n − k ) σ 2 0 / ˆ σ 2 ] = ( n − k ) 2 / ( n − k − 2) . Finally , S hi and Tsai argued that log | X ′ A ˆ W − 1 X A | can b e approxima ted by k log( n ). Putting ev erythin g together, they pr op osed the residual information criterion as f ollo ws RIC = ( n − k ) log ( ˆ σ 2 ) + log | ˆ W | + k log( n ) − k + 4 n − k − 2 , (6) after remo vin g the constan t n + 2. Asymptotically , the complexit y p art of RIC is of the order k log ( n ). Comparing to BIC = n log ( ˜ σ 2 ) + k log( n ), w here ˜ σ 2 is the MLE of σ 2 0 , it is intuiti v ely clear that S hi an d Tsai’s RIC yields consistent mo dels as BIC d oes. Th e complexit y p enalt y of RIC, ho wev er, is f u ndamen tally differen t from that of other familiar information criterion such as AIC and AICc, designed to appro xim ate the Kullbac k-Leibler 3 div ergence b et wee n tw o mo dels. This observ ation raises the question on whether RIC righ tfu lly appro ximates the d iv ergence. It tur ns out that Shi and Ts ai’s deriv ation m otiv ated by minimizing the Kullbac k-Leibler distance, is in corr ect in at least tw o imp ortan t places: 1. In (3), a mo del dep endent term log | X ′ A X A | is omitted from (1), whic h causes serious bias in der ivin g an information criterion. In fact, follo win g Sh i and Ts ai’s argum en ts, w e can approximat e log | X ′ A X A | b y k log ( n ) and thus, RIC should ha ve b een RIC ∗ = ( n − k ) log( ˆ σ 2 ) + log | ˆ W | − k + 4 ( n − k − 2) . Note that in this f orm ulation, RIC ∗ alw a ys c ho oses the full mo del. 2. Even m ore s everely , the p ractice of ap p ro ximating the K ullbac k-Leibler d istance b e- t ween residual lik eliho o ds for comparing mo dels is totally wrong. T o illustrate, sup- p ose that W = I . In this simple case, the r esidual likel iho o d b ecomes L ( σ 2 ) = − 1 2 ( n − k ) log ( σ 2 ) − 1 2 y ′ [ I − X A ( X ′ A X A ) − 1 X A ] y /σ 2 . W e see imm ediatel y that E 0 [ − 2 L ( σ 2 )] = ( n − k ) log ( σ 2 ) + ( n − k ) σ 2 0 /σ 2 whenev er A 0 ⊆ A . Thus, for candid ate mo dels that include X A 0 in the co v ariate set, E 0 [ − 2 L ( σ 2 )] is alw a ys minimized b y σ 2 = σ 2 0 and in this case E 0 [ − 2 L ( σ 2 )] = ( n − k )(log ( σ 2 0 ) + 1). Therefore, if one k n o ws the exact data generating pro cess, the ideal RIC leads to the full mo del, as its E 0 [ − 2 L ( σ 2 )] is the smallest. Th is explains why RIC ∗ alw a ys c ho oses the full mo del. Giv en the ab o v e serious flaws in going from deriving un biased estimator of the Kullback- Leibler div ergence to RIC, S hi and T s ai’s RIC in (6) seems imp rop erly m otiv ated. F ortu- nately , Shi and Tsai’s deriv ation can b e corrected and we in tro duce a corrected RIC in the next section. 3 A Corrected Residual Information Criterion Instead of us ing the resid u al like liho o d , a justifi ab le criterion is to use the log-lik eliho od L ( β ′ , θ ′ , σ 2 ) = n log( σ 2 ) + log | W | + ( y − X β ) ′ W − 1 ( y − X β ) in definin g the d iv ergence d ( β ′ , θ ′ , σ 2 ) = E 0 [ − 2 L ( β ′ , θ ′ , σ 2 ) + 2 L 0 ( β ′ 0 , θ ′ 0 , σ 2 0 )] . W e can write E 0 [ − 2 L ( β ′ , θ ′ , σ 2 )] = E 0  n log ( σ 2 ) + log | W | + ( X β 0 + ε − X β ) ′ W − 1 ( X β 0 + ε − X β )  = n log( σ 2 ) + log | W | + nσ 2 0 /σ 2 + ( X β − X β 0 ) ′ W − 1 ( X β − X β 0 ) σ 2 0 /σ 2 . 4 W e can now replace σ 2 , β and θ by th e their estimates by usin g the residual lik eliho od metho d. No w, supp ose that A 0 ⊆ A . F ollo wing S hi and Tsai again, E 0 nσ 2 0 / ˆ σ 2 ≈ n ( n − k ) / ( n − k − 2). Since ˆ β − β 0 follo ws normal distrib ution N { 0 , σ 2 0 ( X ′ A W − 1 X A ) − 1 } asymp- totical ly , 1 k ( X ˆ β − X β 0 ) ′ W − 1 ( X ˆ β − X β 0 ) σ 2 0 / ˆ σ 2 is distributed app ro ximately as F ( k , n − k ). Th er efore, E 0 { ( X ˆ β − X β 0 ) ′ W − 1 ( X ˆ β − X β 0 ) σ 2 0 / ˆ σ 2 } = k ( n − k ) n − k − 2 . Putting everything toget her, w e hav e the follo wing corrected residual information criterion, whic h w e shall refer to as RIC c, RICc = n log ( ˆ σ 2 ) + k + 4( k + 1) n − k − 2 , b y omitting a constant n + 2. Note that AIC = n log ( ˜ σ 2 ) + 2 k, and AICc = n log( ˜ σ 2 ) + 2 n ( k + 1) / ( n − k − 2) where ˜ σ 2 is the MLE of σ 2 0 . W e can decomp ose the first expression of RIC c, AIC and AICc as n log(RSS) − n log ( n − k ), n log (RSS) − n log( n ) and n log(RSS) − n log ( n ) resp ectiv ely . Thus, the complexit y p enalties for RICc, AIC, AICc are − n log( n − k ) + k + 4( k + 1) / ( n − k − 2), − n log( n ) + 2 k and − n log ( n ) + 2 n ( k + 1) / ( n − k − 2) resp ective ly . It can b e seen that RICc has a larger p enalt y f unction th an AIC and a smaller p enalt y than AICc w hen n ≫ k . 4 Concluding Remarks In fitting a m od el to data, one is required to c h oose a set of candidate mo dels, a fitting pro cedure and a criterion to compare comp eting m od els. A min imal r equiremen t for a reasonable criterion is th at the p opulation version of the criterion is un iquely minimized b y the set of the p arameters whic h generate the data. The p op u latio n version of the residual lik eliho od information criterion is minimized by the full mo del and th u s fails to meet this basic requiremen t. Therefore, the residual lik eliho o d cannot b e used as a discrepancy measure b etw een mo d els. A simple remedy is to use the lik eliho o d based Kullback-Le ibler div ergence. Being a legitimat e criterion on its o wn, our argumen ts sho w that Shi and Tsai’s RIC is n ot motiv ated by the righ t pr inciple. Should one ha v e f ollo w ed their m otiv ation, RIC (i.e. RIC ∗ b y our notation) would ha v e alw a ys c hosen the fu ll model. Ho w ever, Shi and Tsai’s RIC, though motiv ated by the wrong p rinciple (using the r esidual lik eliho od instead of the lik eliho o d) and ignorin g dangerously an imp ortant term log | X ′ X | in appr o ximation, has go o d small sample p erf ormance in their simulatio ns. Additionally , Shi and Tsai’s RIC has b een su ccessfully app lied to a num b er of app licat ions, suc h as normal linear regression, Bo x-Co x transformation, inv erse regression mo dels (Ni et al. , 200 5) and longitudinal data 5 analysis (Li et al. , 2006). The success ma y b e und erstoo d as S hi and T sai’s RIC resembles BIC. De spite the increasing p opularit y of RIC, S hi and Tsai’s RIC remains un motiv ated. It remains to find a justification for Sh i and Tsai’s RI C as a future research topic. References Ak aik e, H. (1970). Statistical predictor identifica tion. Annals of Institute of Statistic al Mathematics , 22, 203-217. Azari, R ., Li, L., and Tsai, C .-L. (2006). Longitudin al data mo del s election. Computatio nal Statistics and Data An alysis , 50, 3053-3066 . Diggle, P .J., Heagert y , P .J., Liang, K.-Y. and Zeger, S.L. (2002 ). An alysis of longitudinal data. (2nd edition). Oxford : Oxford Universit y Press. Harville, D.A. (1974). Ba y esian inference for v ariance comp onen ts u sing only error con trasts. Biometrika , 61, 383-385. Hurvic h , C. M., and T sai, C.-L. (1989). Regression and time series mo del selection in small samples. Biometrika , 76, 297-307. Ni, L., Co ok, R. D., and T sai, C-L. (2005). A note on shrink age s liced in ve rse r egression. Biometrika , 92, 242-247. Sc h arwz, G. (1978). Estimating the dimen sion of a mo del. Annals of Statistics , 6, 461-464. Shao, J. (1997). An asymptotic theory for linear mo del selection (with discussion). Statistic a Sinic a , 7, 221-264. Shi, P . and Tsai, C .-L. (2002 ). Regression mod el selection-a residual lik eliho od approac h . Journal of the R oyal Stat istic al So ciety B , 64, 237– 252. 6

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment