Rejoinder: Fisher Lecture: Dimension Reduction in Regression

Rejoinder: Fisher Lecture: Dimension Reduction in Regression [arXiv:0708.3774]

Authors: R. Dennis Cook

Statistic al Scienc e 2007, V ol. 22, No. 1, 40–4 3 DOI: 10.1214 /0883423 07000000078 Main article DO I: 10.1214/0883 42306000000682 c  Institute of Mathematical Statisti cs , 2007 Rejoinder: Fisher Lecture: Dimension Reduction in Regressio n R. Dennis Co ok 1. INTRODUCTION I am grateful to all of the discussants for their commen ts whic h r aise a num b er of imp ortan t and in- sigh tful issues, and add significan tly to the breadth of ideas. F ollo wing a few introductory comments on the need for a new regressio n genre th at cen ters on dimension redu ction, I turn to the discussants’ re- marks. The dev elopmen t in the 1960s and early 1970s of diagnostic method s for reg ression pro d uced a ma jor shift in regression metho dology . When a diagnostic pro du ces comp elling evidence of a deficiency in the current mod el or data it is natural to pur sue re- medial action, l eading to a n ew mo del and a n ew round of diagnostics, pro ceeding in this wa y until the required diagnostic c hec ks are passed. By the late 197 0s th is t yp e of iterati ve mo del dev elopmen t paradigm w as widely repr esen ted in the applied sci- ences and w as formalized in the statisti cal literature b y Bo x ( 1979 , 19 80 ) and Cook and W eisberg ( 1982 ). With the a v ailabilit y of desktop compu ting starting in the mid-1980s, it is n ow p ossible to app ly in rea- sonable time batteries of graphical and n umerical diagnostics to m any regressions. Adv ances in computing and other tec hnologies no w allo w scientists to routinely formula te regressions in whic h the num b er p of pr edictors is considerably larger than that n ormally considered in the past. Suc h la rge- p regressions necessitate a new type of analysis for at least t wo reasons. First, the stan- dard iterativ e paradigm for mo d el devel opment can b ecome unte nable when p is large. Recognizing the R. Dennis Co ok is Pr ofessor, Scho ol of Statistics, University of Minnesota, 224 Chur ch Str e et S. E., Minne ap olis, Minnesota 55 455, USA e-mail: dennis@stat.umn.e du . This is an elec tronic reprint of the orig inal article published by the Institute of Mathematical Statistics in Statistic al Scienc e , 200 7 , V ol. 2 2, No. 1 , 40 –43 . This reprint differs from the orig inal in pagina tion and t yp ogr aphic detail. v ariet y of graphical d iagnostics th at could b e u sed and the possib ility of iteratio n, a thorough analy- sis mig ht require assessmen t of many plots in ad- dition to v arious n umerical diagnostics. Exp erience has sh o w n th at the paradigm can often b ecome im- p onder ab le when applied with to o m an y predictors. Second, in some regressions, particularly those asso- ciated with high-through p ut tec hn ologies, the sam- ple size n may b e smaller than p , leading to op era- tional problems in addition to p onderabilit y difficul- ties. These issues ha v e caused a shift in the applied sciences to w ard a different regressio n genr e with the goal of reducing the dimensionalit y of the v ector X ∈ R p of predictors as a first step in th e analy- sis, effectiv ely raising an old idea to a p osition of prominence. T o da y , dimension reduction is ubiquitous in the applied sciences, represen ted primarily b y principal comp onent method ology . Fifteen ye ars ago I rarely encoun tered intra- un iversit y scien tists seeking help with principal comp onent r eductions in regression. Suc h settings n o longer seem unusual. The reasons for this are as indicate d previously: While I occa- sionally see p roblems with n < p , m ore frequen tly n is sev eral times p , wh ile p itse lf i s t o o large for a full commitment to iterativ e m o del d ev elopmen t guided by diagnostic s. This, in add ition to the r ea- sons s tated in Section 2 of the article, lea ds me to conclude t hat the case for dimension r ed uction metho dology has b een made, methodology based on firm parametric foundations with sub sequen t robust and nonparametric coun terparts. Whether the ideas and metho dological directions I pr op osed will meet this goal is less clear, but I am still con vinced that they hold promise w hen X and Y are join tly dis- tributed. In contrast to a comment by Christensen, I thin k parametric dimension r ed uction is currently as im- p ortant, if not more imp ortan t, than other f orms, partly b ecause dimension reduction metho dology h as existed m ostly in a world apart f r om core Fisherian theory , making it d ifficult to appreciate what could b e ac hiev ed. F or this reason I w elcome Christensen’s 1 2 R. D. COOK dev elopmen t of connections with multiv ariate linear mo del theory . 2. APPLICABIL ITY According to Christensen, a k ey issue in the dev el- opmen t of mo dels (2), (5), (10) and (13 ) is whether they are “broadly rea sonable.” I agree. Moreo ver, the emerging p icture do es seem to b e one of broad reasonableness f or the reasons indicated in the fol- lo w in g sectio ns. 2.1 First A pplications As men tioned in Sectio n 6.5 of the article, I ha v e used the red u ctiv e mo del (13) on sev er al standard data sets from the literature and nearly alw a ys ha ve found d < p , and often su bstan tially so. Addition- ally , I recen tly applied mo d el (13) to a consulting problem with ab out 35 predictors and a little le ss than 300 observ ations. Again d ≪ p with an appar- en tly goo d fit. The logo data analyzed b y Li and Nac h tsheim of- fer another opp ortun it y to test the pr op osed metho d- ology . T o complemen t their analysis I first u sed Grass- mann optimization to fit the reductiv e mo del (13) with d = 1 and the cubic op tion for f y , without pr e- dictor screening. The log lik eliho o d r atio statistic (cf. Section 6. 5) for comparing this fit to the full mo del has the v alue Λ 1 = 94 . 02 on r ( p − 1) = 63 degrees of fr eedom, for a nominal p -v alue of 0.007. The same pro cess w ith other v alues of d ga v e Λ 2 = 66 . 76, Λ 3 = 65 . 1 and Λ 21 = 0 . 076 with corresp ond- ing p -v alues 0.2 6, 0.21 and 0.99 . Consequently , I inferred t hat the su fficien t r eduction Γ T X is t wo- dimensional. This analysis pro vides another in stance in supp ort of th e p ossibilit y that the prop osed mo d - els are broadly reasonable. Grassman n optimization is preferable to ev aluating the log lik eliho o d at v ar- ious com binations of the eigen v ectors of b Σ or b Σ fit , although the latter can sometimes pr ovide a useful appro ximation. There are sev eral wa ys in whic h an estimated suf- ficien t reduction b Γ T X co uld b e used to contin ue an analysis, d ep ending on app lication-sp ecific r equ ire- men ts. As long as d is sufficiently sm all, as it is in th e logo data, the standard m o del-diagnostic paradigm migh t b e used to dev elop a forward mo del for the re- gression of Y on b Γ T X . In cases where a fu ll forward mo del is not essen tial, w e could pro ceed directly to estimation of the forward mean fu nction based on the rela tion E( Y | X = x ) = E { Y N ( x | Y ) } E { N ( x | Y ) } , where N is the normal d ensit y of X | Y and th e ex- p ectations are w ith resp ect to the marginal distri- bution of Y . This can b e estimated using b E( Y | X = x ) = P n i =1 y i b N ( x | y i ) P n i =1 b N ( x | y i ) , where b N denotes the estimated densit y obtained b y sub stituting parameter estimates. T he estimated mean function allo ws construction of residuals y i − b E( Y | X = x i ) that can b e plotted to chec k the f or- w ard mean fu nction implied b y the inv erse mo del. Here th e residuals are used as a final diagnosti c c h ec k, and not n ecessarily as an int egral part of a mo del building process. 2.2 Sup ervised P rincipal Com p onents Li and Nac htsheim dev elop ed a fascinating c on- nection b et ween the redu ctiv e mo del (13) and the sup ervised principal comp onent (SPC ) setting of Bair et al. ( 2006 ). The laten t v ariable mo del used by Bair et al. ( 2006 , Section 2.2) and describ ed by Li and Nac htsheim leads to the PF C mo del (5) or the re- ductiv e mo d el (13) , d ep endin g on the restrict ions placed on the co v ariance m atrix of the errors ( ε , ǫ 1 , . . . , ǫ p ). Ho we ve r, Bair et al. did n ot p ose or us e (5) or (13) as their basis for estimatio n, but instead used principal comp onen ts. Th us, the method of SPC’s is more in lin e with the PC mo d el (2). Conformably partition X =  X 1 X 2  and Γ =  Γ 1 Γ 2  , and assu me that Γ 2 = 0. Then u nder the PC mo del (2), the sufficien t reduction is just Γ T 1 X 1 and the maxim um lik eliho o d estimato r of span( Γ 1 ) is th e span of the first d eigen ve ctors of the sample v ersion of V ar( X 1 ). This leads to exactly the analysis sug- gested by Bair et al.: Use initial screening to identify X 2 and then compute the first (or first few) p rincipal comp onent s for X 1 . I exp ect that PF C r ed uctions based on mo del (5) or sufficien t redu ctions based on mo del (13) can d o m u c h b etter than SPC’s, partic- ularly if prior scr eenin g is based on an information criterion lik e AIC or BIC applied in th e context of the in ve rse mo del. This exp ectation i s supp orted by the logo d ata, since there is only a v ery wea k rela tionship b et wee n the fir st tw o principal comp onen ts and the estimated sufficien t reduction based on m o del (13). F or in- stance, the v alue of R 2 from the linear regression of the fir st principal comp onent on b Γ T X is only 0 . 11. REJOINDER 3 The u p p er left plot in Li and Nac htsheim’s Figure 1 w as computed using the first princip al stand ar d- ize d c omp onent , the firs t prin cipal component co m- puted after standardizing eac h predictor to hav e a marginal sample s tand ard deviation of 1 (p erson al comm unication). The v alue of R 2 from the linear re- gression of the fir st principal standard ized comp o- nen t o n b Γ T X is 0 . 60. In this example the fi rst prin ci- pal standardized comp onen t outp erform ed the firs t principal comp onent , although this need not alw a ys b e so (cf. Section 7.3). The general p oints in my dis- cussion of p redictor standardization w ere captured nicely b y Christensen’s summary: S tandardization is necessary , u nless th e r egression can b e describ ed reasonably b y the PC mo del (2) or the PF C mo del (5), b ut the common practice of marginal predictor scaling is not sufficient, and ma y ev en b e coun ter- pro du ctiv e. On balance, SPC’s are b ased on relativ ely w eak sup ervision, s in ce th e resp onse is u sed only in the screening phase. Mod els (5) and (13) allo w m ore complete sup ervision, wh ether used with prior screen- ing o r not. 2.3 V a riance Reduction versus Bias Consider a regression in whic h mo d el (13) holds with d = p , whic h is equiv alen t to mo del (17), and that no ve rsion of mo d el (13) holds strictly with d < p . In that case, fitting (13) with d < p will re- sult in bias alo ng with a reduction in v ariance. It is conjectured that often the reduction in v ariance will out w eigh the increase in bias, resulting in a reduc- tion in mean squared error. In ot her w ords, there ma y b e reason to pursu e mo dels like (13) with d < p ev en when they are “incorrect.” 2.4 P a rtial Least S qua res Lik e Christensen, I ha v e had misgivings ab out par- tial le ast squares and found its apparen t p opularit y in some areas, particularly c hemometrics, to b e a bit curious. Ho wev er, the dev elopmen ts in the arti- cle are causing me to reconsider. In Section 7.4 of the article I d evelo p ed a connection b et wee n OLS and the inv erse mo del (17). Here I establish a con- nection with partial least squares via the p opula- tion relationship b et wee n OLS and model (13) with d < p . F or notational c onv enience, let M = Ω 2 + β · V ar ( f Y ) β T and C = Co v ( X , Y ). S upp ose that w e wish to estimate the p opulation O LS coefficient v ec- tor B = Σ − 1 Co v ( X , Y ). F rom P rop osition 4, Σ = Γ 0 Ω 2 0 Γ T 0 + ΓMΓ T , and thus M − 1 = ( Γ T ΣΓ ) − 1 and Σ − 1 = Γ 0 ( Ω 2 0 ) − 1 Γ T 0 + ΓM − 1 Γ T . Now, B = Σ − 1 Γ β Co v ( f Y , Y ) = Γ ( Γ T ΣΓ ) − 1 β Cov( f Y , Y ) = P Γ ( Σ ) B = Γ ( Γ T ΣΓ ) − 1 Co v ( X , Y ) . This sa ys that B ∈ S Γ and that w e c an construct a marginal estimator of B by p ro jecting the usual momen t estimate b B on to b S Γ relativ e to the b Σ inner pro du ct. The p opulation v ersion of the p artial least squares co efficien ts follo ws this same pattern B pls = P K ( Σ ) B (Helland, 1992 ; Helland and Almøy , 1 994 ; Naik a nd Tsai, 2000 ), except that S Γ is rep laced by the cyclic subspace S K spanned b y K = ( C , Σ C , Σ 2 C , . . . , Σ q − 1 C ) for some in teger q t hat is often c hosen by cross-v alidation in p ractice. Assu me that C can b e written as a linear com b ination of at most q eigen- v ectors of Σ . Then B ∈ S K (Naik and Tsai, 2000 ) and consequently the margi nal estimato r of B fr om mo del (13) and the PLS estimator ha v e th e same ba- sic form, but differ on the method of estimating an upp er b ound— S K or S Γ —for span( B ). P artial least squares estimates can b e computed without inv er- sion of b Σ , w h ic h seems to b e one of their attractions. In short, th e prop osed metho ds inherit supp ort f r om the a pp arent reasonableness of partial least squ ares in some conte xts. Principal comp onen ts, principal fitted comp onent s, partial least squ ares and reductiv e metho ds based on the in ve rse mo del (13) all attempt to make use of information from the marginal distrib u tion of X when inferring ab out Y | X . This distingu ish es them from metho ds like OLS, SIR, RMA VE and the lasso that apparentl y do not c onsider su c h inf orm ation. The simulation r esults in the article sho w that sub- stan tial gains ov er forw ard metho ds are p ossible when X con tains information on Y | X via Σ . It seems f air to conclude that the models considered in the article will b e br oadly r easonable at least within the class of r egressions where X is informative ab out Y | X . 3. NONCONST A NT V AR(X | Y ) The d ev elopmen t throughout th e article w as b ased on the assump tion that V ar( X | Y ) is constant. If V ar( X | Y ) is n ot constan t, then the reductions d e- scrib ed here ma y no longer b e sufficient, although they will still b e functions of sufficient reductions. 4 R. D. COOK Li and Nac ht sheim describ e an extreme case of th is where the forwa rd mean fu nction E( Y | X ) is quadratic in one of th e predictors without a linear trend . F or example, wh en E( Y | X ) = X 2 1 and X 1 is symmetri- cally distributed ab out 0, E( X 1 | Y ) = 0 w hile V ar ( X 1 | Y ) will b e n on constant. The p oten tial for this kind of setting can b e detected by us in g a nu- merical diagnostic for heteroscedasticit y in the con- text of fitting a simple linear mo d el to X j | Y , j = 1 , . . . , p . B. Li has take n an in teresting first step in the study of mo dels that allo w for nonconstan t V ar ( X | Y ) . His Theorem 2.2 shows that w e can co nstru ct suffi- cien t reductions for b oth the conditional mean and the co nditional v ariance, and th us co ve r settings lik e that describ ed by L. Li and Nac htsheim. In th e con- text of his mo del (5), the metho ds of this article should b e us eful for estimating span( Γ 1 ), but they will not b e able to ident ify the part of span( Γ 2 ) that is not conta ined in span ( Γ 1 ). Ho wev er, ev en if span( Γ 1 ) = span( Γ 2 ), there should b e efficiency gains by considerin g mo dels lik e B. Li’s (5). These gains are n icely illustrated b y Li’s examples. Ad- ditionally , a nalyses based on heteroscedastic inv erse mo dels will lik ely encounter inference iss ues not con- sidered in the article. F or instance, it ma y b e of in- terest to test if span( Γ 2 ) ⊆ span( Γ 1 ) or vice v ersa. 4. PRINCIP AL COMPONENTS B. L i p resen ted an intriguing explanation for why the resp onse tends to hav e a h igher correlation with the first p rincipal comp onen t t han a ny other comp o- nen t, b ut concluding that dimension reduction via X | Y should offer substan tial gains. I exp ect that his Conjecture 1 .1 is correct and that it offers a par- tial explanation for the p opu larit y of r ed uction to a few leading pr in cipal comp onents. W e may b e able to gain additional ins ights in to the s ituation reason- ing as follo w s, u sing Li’s notation and mo del. Recall that X is N (0 , Σ), Y = β T X + ǫ and X ⊥ ⊥ ǫ . Hold fixed β with k β k = 1 and Σ with eigen v alues λ 1 > · · · > λ p > 0 , and assume that the errors ǫ are normally distributed with mean 0 and v ariance σ 2 ǫ . Then the squared correlation co efficien t b et ween Y and the first principal comp onen t v T 1 X can b e ex- pressed as ρ 2 1 = ( β T v 1 ) 2 λ 1 σ 2 ǫ + P p j =1 ( β T v j ) 2 λ j . If th e eigen v alues λ j are roughly the same, then the magnitude of ρ 1 is con trolled b y σ 2 ǫ and the an- gles b et w een β and the eigenv ectors v j . Ho wev er, if β T v 1 6 = 0 a nd λ 1 is sufficien tly larger than b oth λ 2 and σ 2 ǫ , then ρ 2 1 will b e close to 1. This might b e tak en to suggest that reduction to the first principal comp onent is desirable, particularly w hen λ 1 ≫ λ 2 . Ho wev er, Y | v T 1 X is normally distribu ted with mean E( Y | v T 1 X ) = β T v 1 v T 1 X an d v ariance V ar ( Y | v T 1 X ) = σ 2 ǫ + p X j =2 ( β T v j ) 2 λ j . This distribution do es not d ep end on the v alue of λ 1 . Pro vided β 6 = v 1 , V ar ( Y | v T 1 X ) > V ar( Y | β T X ) , reflecting the fact that v T 1 X is not a sufficien t re- duction and consequen tly that conditioning on v T 1 X do es not exhaust the information that X con tains ab out Y , r egardless of the magnitude of ρ 1 . 5. BINARY PREDICTORS The m ain thrust of m y article is o n normal inv erse mo dels, but I also su ggested ho w the ideas could be applied with co nditional predictors X | Y fr om other families. I w as particularly pleased to see that Li and Nach tsheim im p lemen ted an algorithm fo r re- gressions with all binary pred ictors, but wa s simul- taneously a bit disapp ointed to see that it did n ot w ork out as cr isp ly as exp ected. Their suggestion of a ma jorization strategy to resolv e the optimization issues is exc ellen t and will lik ely pro duce a s table and practically u seful algorithm. Ma jorizatio n ma y not b e essen tial for sim ilar algorithms dev elop ed for principal fitted comp onents ( ν y = β f y ), although it migh t still im p ro v e p erformance. A CKNO WLEDGMENTS The auth or is grateful to the Executiv e Editor, Ed George, for arranging this discussion a nd to Liliana F orzani for helpfu l comments. Research f or this dis- cussion was supp orted in part by National Science F oun dation Gran t DMS-04-0536 0. REFERENCES Bair, E., Hastie , T. , P aul, D. and Tibshi rani, R. (20 06). Prediction b y superv ised principal components. J. A mer. Statist. Asso c. 101 119–137. MR2252436 Box, G. E. P. (1979). Robustness in the strategy of scientific mod el b uilding. In R obustness in Statistics (R. Launer and G. Wilkinson, eds.) 201–235. Academic Press, N ew Y ork. Box, G. E. P. (1980 ). Sampling and Ba yes’ inference in sci- entific modeling and robustn ess (with discussion). J. R oy. Statist. So c. Ser. A 143 383–430. MR0603745 REJOINDER 5 Cook, R. D. and Weisberg, S. (1982). R esiduals and Influenc e i n R e gr ession . Chapman and H all, Lond on. MR0675263 Helland, I. S. (1992). Maximum lik eliho od re gression on relev ant comp onents. J. Ro y. Stat ist. So c. Ser. B 54 637– 647. MR1160488 Helland, I. S. and Almøy, T. (1994). Comparison of pre- diction metho ds when only a few components are relev ant. J. Amer. Statist. Asso c. 89 583–591. MR1294085 Naik, P. and Tsai , C.-L. (2000). P artial least squares esti- mator for single-index mo d els. J. R. Stat. So c. Ser. B Stat. Metho dol. 62 763–771. MR1796290

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment