The Residual Information Criterion, Corrected

The Residual Information Criterion, Corrected Chenlei Leng ∗ Octob er 29, 2018 Abstract Shi and Tsai (JRSSB, 2 002) prop osed an in teresting residual information c riterion (RIC) for mo del selection in regressio n. Their RIC was motiv ated by the principle of minimizing the Kullba c k- Leibler discrepancy betw een the res idua l likelihoo ds of the true a nd candidate mo del. W e show, how ever, under this pr inciple, RIC w ould a lw ays choose the full (satur ated) model. The re s idual lik eliho o d therefore, is not appr opriate as a discrepancy measure in deﬁning infor mation criter io n. W e explain why it is so and provide a cor rected residua l informatio n criter io n as a remedy . KEY W ORDS: R esidual information criterion; Corr e cte d r esidual information crite- rion. 1 In tro duc tion Giv en n iid observ ations from a true mo del y = X β 0 + ε, where y = ( y 1 , ..., y n ) ′ , X is a n × p design matrix, ε = ( ε 1 , ..., ε n ) ′ follo ws a m u ltiv ariate distribution with mean 0 and v ariance σ 2 0 W ( θ 0 ), and β 0 ∈ R p × 1 is an unknown v ector to b e estimated. Here θ 0 is an m × 1 v ector parameterizing th e correlation matrix. Finally , we denote A 0 = A ( β 0 ) = { j : β 0 j 6 = 0 , j = 1 , ..., p } as the n onzero co eﬃcien t set and k 0 = # A 0 as the num b er of nonzero co eﬃcien ts. The problem of estimating A 0 is often referr ed to as v ariable selection or mod el selection. V ariable selection in linear regression is probably one of the most imp ortan t problems in statistics. See for example the references in Shao (1997). T o automate the pr o cess of c ho osing a ﬁnite d imensional candidate mo del out of all p ossible mo dels, v arious information criteria hav e b een develo p ed. There are t wo basic elemen ts in all of these criteria: one elemen t that m easures th e go o dness of ﬁt and the other term whic h p enalizes the complexit y of the ﬁtted mo del, usually take n as a function of the p arameters u sed. Generally sp eaking, the existi ng v ariable selectio n appr oac hes can be classiﬁed in to t w o broad catego ries. On ∗ Leng is Assistant Professo r, Department of Statistics and App lied Probabilit y , National Universit y of S ingapore. Leng’s research is supp orted in part by N US researc h grant R- 155-05 0-053-133 ( Email: stalc@nus. edu.sg). 1 one hand , AIC t yp e of criteria, suc h as AIC (Ak aik e, 1970) and AICc (Hurvich and Tsai, 1989) , seek to min imize the Ku llbac k-Leibler d iv ergence b et ween the true and candidate mo del. On the other hand, BIC (S ch wa rz, 1978) type of criteria are u sed to iden tify a candidate mo del to ac hieve selectio n consistency . Obviously , these criteria are motiv ated b y diﬀerent assumptions and diﬀeren t considerations, pr actically and theoretically . An y particular c hoice on wh ic h one to use probably dep ends on th e co n text and is sub ject to criticism, as eac h has its o wn merits and shortcomings. In an imp ortan t pap er, Shi and Tsai (2002) prop osed an inte resting information criterion termed the residual information criterion (RIC). T he authors sho wed that RI C is motiv ated b y the consideration of min imizing the discrepancy b et ween the residual log-lik eliho o d func- tions of the tru e and candidate mo del. Ho we v er, surp risingly , the authors arriv ed at a BIC t yp e of criterion, in marked con trast with some other information criteria, suc h as AIC, AICc, motiv ated by th e same principle of min imizing Kullb ac k-Leibler d iscrepancy . In this pap er, w e s h o w that th e RI C approac h is not targeting at minimizing the Kullbac k-Leibler discrepancy b et wee n residu al lik eliho od s. W e p r o vide a corrected crite- rion RIC ∗ motiv ated by this principle. Ho wev er, we sho w that if the resid u al lik eliho od s are used to ev aluate the Kullbac k-Leibler div ergence b et w een mo dels, RIC (i.e . RIC ∗ ) would alw a ys c ho ose the full mo del. Therefore, the r esidual lik eliho o d is not an appropriate loss function to deﬁne an inform ation criterion. W e p ro vide a simple lik eliho o d based approac h to circum v ent the p roblem. The rest of the pap er is organized as follo ws. Section 2 r eviews the RIC metho d in Shi and Tsai. Since S hi and Tsai’s RIC is not appr o ximating the Ku llbac k-Leibler div er- gence, w e pr o vide th e RIC ∗ measure as a correction. How ev er, RIC ∗ alw a ys c ho oses the full mo del and the reason is explained. Section 3 presen ts the correct residual lik eliho od information criterion, motiv ated by min imizing the Kullback- Leibler div ergence b et wee n lik eliho o ds in s tead of r esidual likelihoo ds. C on clud ing remarks are giv en in Section 4. 2 The Residual Information Criterion W e review the RIC m etho d in S hi and Tsai (2002) in this section. T h e mo del w e consider in th is article is a sp ecial case of that in Sh i and T sai (2002) by assuming the Bo x-Cox transformation parameter λ is 1. The results in the pap er can b e easily extended to Bo x- Co x mo dels follo wing similar arguments in S hi and Tsai. W e start by lo oking at a candidate (w orking) mo del y = X β + ε, suc h that # A ( β ) = k . W e denote th e activ e co v ariates in X as X A . Inspired b y the residu al lik eliho o d method in Harville (1974) or Diggle et al. (1994) to obtain un biased estimator for the error v ariance, we can write the residual log-lik eliho od as L ( θ ′ , σ 2 ) = − 1 2 ( n − k ) log (2 π ) + 1 2 log | X ′ A X A | − 1 2 ( n − k ) log ( σ 2 ) − 1 2 log | W | − 1 2 log | X ′ A W − 1 X A | − 1 2 y ′ ( W − 1 − H A ) y /σ 2 , (1) 2 where H A = W − 1 X ′ A ( X ′ A W − 1 X A ) − 1 X ′ A W − 1 and the dep end ence of W on θ is supp ressed. A usefu l measure of the distance b et w een the working m od el and the tru e mo del is th e Kullbac k-Leibler div ergence d ( θ ′ , σ 2 ) = E 0 [ − 2 L ( θ ′ , σ 2 ) + 2 L 0 ( θ ′ 0 , σ 2 0 )] , (2) where E 0 denotes the exp ectation und er the true mo del and L 0 denotes the r esid ual log - lik eliho o d of the tru e m od el. Clearly , the b est m o del loses the least inf ormation, in terms of Kullbac k-Leibler distance, r elat iv e to th e truth and is therefore preferred. S u c h a cr iterion form ulates RIC in an information-theoretical framew ork. Pro vided that one can unbiasedly estimate d ( θ ′ , σ 2 ), this criterion pro vides sound basis for parameter estimation and statistical inference un der appropriate cond itions. Since E 0 [2 L 0 ( θ ′ 0 , σ 2 0 )] is indep endent of the w orkin g mo del, we ju st need to ev aluate E 0 [ − 2 L ( θ ′ , σ 2 )]. In Shi and Tsai (2002), (2) is written as d ( θ ′ , σ 2 ) = E 0 h ( n − k ) log ( σ 2 ) + log | W | + log | X ′ A W − 1 X A | + y ′ ( W − 1 − H A ) y /σ 2 i (3) = ( n − k ) log( σ 2 ) + log | W | + log | X ′ A W − 1 X A | + E 0 ( X β 0 + ε ) ′ ( W − 1 − H A )( X β 0 + ε ) /σ 2 (4) b y omitting irrelev an t terms. By s u bstituting their estimated v alues ˆ θ , ˆ σ 2 in to (4), we h a v e d ( ˆ θ ′ , ˆ σ 2 ) = ( n − k ) log ( ˆ σ 2 ) + log | ˆ W | + log | X ′ A ˆ W − 1 X A | + ( X β 0 ) ′ ( ˆ W − 1 − ˆ H A )( X β 0 ) / ˆ σ 2 + tr { ( ˆ W − 1 − ˆ H A ) W 0 } σ 2 0 / ˆ σ 2 . (5) The ab ov e expression inv olv es an un kno wn quan tity σ 2 0 . F ollo wing S hi and Tsai, we judge the qualit y of the candid ate m od el by E 0 { d ( ˆ θ ′ , ˆ σ 2 ) } . No w, if we assu me A 0 ⊆ A , an assumption also u s ed in deriving AIC c (Hur vic h and Tsai, 1989), the third term b ecomes zero. F urthermore, if we assume ˆ θ is consisten t for θ 0 , w e can estimate W 0 b y ˆ W since ˆ W = W 0 + o p (1). Then the fourth term can b e appr o ximated as ( n − k ) σ 2 0 / ˆ σ 2 . Since A ⊆ A 0 , ( n − k ) ˆ σ 2 /σ 2 0 then follo ws χ 2 n − k distribution and therefore E 0 [( n − k ) σ 2 0 / ˆ σ 2 ] = ( n − k ) 2 / ( n − k − 2) . Finally , S hi and Tsai argued that log | X ′ A ˆ W − 1 X A | can b e approxima ted by k log( n ). Putting ev erythin g together, they pr op osed the residual information criterion as f ollo ws RIC = ( n − k ) log ( ˆ σ 2 ) + log | ˆ W | + k log( n ) − k + 4 n − k − 2 , (6) after remo vin g the constan t n + 2. Asymptotically , the complexit y p art of RIC is of the order k log ( n ). Comparing to BIC = n log ( ˜ σ 2 ) + k log( n ), w here ˜ σ 2 is the MLE of σ 2 0 , it is intuiti v ely clear that S hi an d Tsai’s RIC yields consistent mo dels as BIC d oes. Th e complexit y p enalt y of RIC, ho wev er, is f u ndamen tally diﬀeren t from that of other familiar information criterion such as AIC and AICc, designed to appro xim ate the Kullbac k-Leibler 3 div ergence b et wee n tw o mo dels. This observ ation raises the question on whether RIC righ tfu lly appro ximates the d iv ergence. It tur ns out that Shi and Ts ai’s deriv ation m otiv ated by minimizing the Kullbac k-Leibler distance, is in corr ect in at least tw o imp ortan t places: 1. In (3), a mo del dep endent term log | X ′ A X A | is omitted from (1), whic h causes serious bias in der ivin g an information criterion. In fact, follo win g Sh i and Ts ai’s argum en ts, w e can approximat e log | X ′ A X A | b y k log ( n ) and thus, RIC should ha ve b een RIC ∗ = ( n − k ) log( ˆ σ 2 ) + log | ˆ W | − k + 4 ( n − k − 2) . Note that in this f orm ulation, RIC ∗ alw a ys c ho oses the full mo del. 2. Even m ore s everely , the p ractice of ap p ro ximating the K ullbac k-Leibler d istance b e- t ween residual lik eliho o ds for comparing mo dels is totally wrong. T o illustrate, sup- p ose that W = I . In this simple case, the r esidual likel iho o d b ecomes L ( σ 2 ) = − 1 2 ( n − k ) log ( σ 2 ) − 1 2 y ′ [ I − X A ( X ′ A X A ) − 1 X A ] y /σ 2 . W e see imm ediatel y that E 0 [ − 2 L ( σ 2 )] = ( n − k ) log ( σ 2 ) + ( n − k ) σ 2 0 /σ 2 whenev er A 0 ⊆ A . Thus, for candid ate mo dels that include X A 0 in the co v ariate set, E 0 [ − 2 L ( σ 2 )] is alw a ys minimized b y σ 2 = σ 2 0 and in this case E 0 [ − 2 L ( σ 2 )] = ( n − k )(log ( σ 2 0 ) + 1). Therefore, if one k n o ws the exact data generating pro cess, the ideal RIC leads to the full mo del, as its E 0 [ − 2 L ( σ 2 )] is the smallest. Th is explains why RIC ∗ alw a ys c ho oses the full mo del. Giv en the ab o v e serious ﬂaws in going from deriving un biased estimator of the Kullback- Leibler div ergence to RIC, S hi and T s ai’s RIC in (6) seems imp rop erly m otiv ated. F ortu- nately , Shi and Tsai’s deriv ation can b e corrected and we in tro duce a corrected RIC in the next section. 3 A Corrected Residual Information Criterion Instead of us ing the resid u al like liho o d , a justiﬁ ab le criterion is to use the log-lik eliho od L ( β ′ , θ ′ , σ 2 ) = n log( σ 2 ) + log | W | + ( y − X β ) ′ W − 1 ( y − X β ) in deﬁnin g the d iv ergence d ( β ′ , θ ′ , σ 2 ) = E 0 [ − 2 L ( β ′ , θ ′ , σ 2 ) + 2 L 0 ( β ′ 0 , θ ′ 0 , σ 2 0 )] . W e can write E 0 [ − 2 L ( β ′ , θ ′ , σ 2 )] = E 0  n log ( σ 2 ) + log | W | + ( X β 0 + ε − X β ) ′ W − 1 ( X β 0 + ε − X β )  = n log( σ 2 ) + log | W | + nσ 2 0 /σ 2 + ( X β − X β 0 ) ′ W − 1 ( X β − X β 0 ) σ 2 0 /σ 2 . 4 W e can now replace σ 2 , β and θ by th e their estimates by usin g the residual lik eliho od metho d. No w, supp ose that A 0 ⊆ A . F ollo wing S hi and Tsai again, E 0 nσ 2 0 / ˆ σ 2 ≈ n ( n − k ) / ( n − k − 2). Since ˆ β − β 0 follo ws normal distrib ution N { 0 , σ 2 0 ( X ′ A W − 1 X A ) − 1 } asymp- totical ly , 1 k ( X ˆ β − X β 0 ) ′ W − 1 ( X ˆ β − X β 0 ) σ 2 0 / ˆ σ 2 is distributed app ro ximately as F ( k , n − k ). Th er efore, E 0 { ( X ˆ β − X β 0 ) ′ W − 1 ( X ˆ β − X β 0 ) σ 2 0 / ˆ σ 2 } = k ( n − k ) n − k − 2 . Putting everything toget her, w e hav e the follo wing corrected residual information criterion, whic h w e shall refer to as RIC c, RICc = n log ( ˆ σ 2 ) + k + 4( k + 1) n − k − 2 , b y omitting a constant n + 2. Note that AIC = n log ( ˜ σ 2 ) + 2 k, and AICc = n log( ˜ σ 2 ) + 2 n ( k + 1) / ( n − k − 2) where ˜ σ 2 is the MLE of σ 2 0 . W e can decomp ose the ﬁrst expression of RIC c, AIC and AICc as n log(RSS) − n log ( n − k ), n log (RSS) − n log( n ) and n log(RSS) − n log ( n ) resp ectiv ely . Thus, the complexit y p enalties for RICc, AIC, AICc are − n log( n − k ) + k + 4( k + 1) / ( n − k − 2), − n log( n ) + 2 k and − n log ( n ) + 2 n ( k + 1) / ( n − k − 2) resp ective ly . It can b e seen that RICc has a larger p enalt y f unction th an AIC and a smaller p enalt y than AICc w hen n ≫ k . 4 Concluding Remarks In ﬁtting a m od el to data, one is required to c h oose a set of candidate mo dels, a ﬁtting pro cedure and a criterion to compare comp eting m od els. A min imal r equiremen t for a reasonable criterion is th at the p opulation version of the criterion is un iquely minimized b y the set of the p arameters whic h generate the data. The p op u latio n version of the residual lik eliho od information criterion is minimized by the full mo del and th u s fails to meet this basic requiremen t. Therefore, the residual lik eliho o d cannot b e used as a discrepancy measure b etw een mo d els. A simple remedy is to use the lik eliho o d based Kullback-Le ibler div ergence. Being a legitimat e criterion on its o wn, our argumen ts sho w that Shi and Tsai’s RIC is n ot motiv ated by the righ t pr inciple. Should one ha v e f ollo w ed their m otiv ation, RIC (i.e. RIC ∗ b y our notation) would ha v e alw a ys c hosen the fu ll model. Ho w ever, Shi and Tsai’s RIC, though motiv ated by the wrong p rinciple (using the r esidual lik eliho od instead of the lik eliho o d) and ignorin g dangerously an imp ortant term log | X ′ X | in appr o ximation, has go o d small sample p erf ormance in their simulatio ns. Additionally , Shi and Tsai’s RIC has b een su ccessfully app lied to a num b er of app licat ions, suc h as normal linear regression, Bo x-Co x transformation, inv erse regression mo dels (Ni et al. , 200 5) and longitudinal data 5 analysis (Li et al. , 2006). The success ma y b e und erstoo d as S hi and T sai’s RIC resembles BIC. De spite the increasing p opularit y of RIC, S hi and Tsai’s RIC remains un motiv ated. It remains to ﬁnd a justiﬁcation for Sh i and Tsai’s RI C as a future research topic. References Ak aik e, H. (1970). Statistical predictor identiﬁca tion. Annals of Institute of Statistic al Mathematics , 22, 203-217. Azari, R ., Li, L., and Tsai, C .-L. (2006). Longitudin al data mo del s election. Computatio nal Statistics and Data An alysis , 50, 3053-3066 . Diggle, P .J., Heagert y , P .J., Liang, K.-Y. and Zeger, S.L. (2002 ). An alysis of longitudinal data. (2nd edition). Oxford : Oxford Universit y Press. Harville, D.A. (1974). Ba y esian inference for v ariance comp onen ts u sing only error con trasts. Biometrika , 61, 383-385. Hurvic h , C. M., and T sai, C.-L. (1989). Regression and time series mo del selection in small samples. Biometrika , 76, 297-307. Ni, L., Co ok, R. D., and T sai, C-L. (2005). A note on shrink age s liced in ve rse r egression. Biometrika , 92, 242-247. Sc h arwz, G. (1978). Estimating the dimen sion of a mo del. Annals of Statistics , 6, 461-464. Shao, J. (1997). An asymptotic theory for linear mo del selection (with discussion). Statistic a Sinic a , 7, 221-264. Shi, P . and Tsai, C .-L. (2002 ). Regression mod el selection-a residual lik eliho od approac h . Journal of the R oyal Stat istic al So ciety B , 64, 237– 252. 6

The Residual Information Criterion, Corrected

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment