Comment: Struggles with Survey Weighting and Regression Modeling

Comment: Struggles with Survey Weighting and Regression Modeling [arXiv:0710.5005]

Authors: Robert M. Bell, Michael L. Cohen

Statistic al Scienc e 2007, V ol. 22, No. 2, 165– 167 DOI: 10.1214 /0883423 07000000177 Main article DO I: 10.1214/0883 42306000000691 c  Institute of Mathematical Statisti cs , 2007 Comment: Struggles with Survey W eighting and Regression Mo deling Rob ert M. Bell and Michael L. Cohen Andrew Gelman’s article “Struggles with s urve y w eigh ting and regression modeling” addresses the question of what approac h analysts should use to pro du ce estimates (and associated estimates o f v ari- abilit y) based on s ample surv ey data. Gelman starts b y asserting that survey w eigh ting is a “mess.” While w e agree that incorp oration of the su rv ey design for regression remains chal lenging, with imp ortan t op en questions, many recen t con tributions to the litera- ture hav e greatly clarified the situation. Examples include relativ ely recen t con tributions by Pfeffer- mann and Sve rchk o v ( 1999 ), Graubard and Korn ( 2002 ) and Little ( 2004 ). Gelman’s pap er is a v ery w elcome add ition to that literature. There are some und erstandable reasons for the current lac k of resolution. First, U.S. federal statis- tical agencies ha v e b een historically limited b y their mission statemen ts to pro du cing statistical su mma- ries, primarily means, p ercen tages, ratios and cross- classified tables of coun ts. This is one explanation for why Co c hran ( 1977 ) and K ish ( 1965 ) d ev ote the great ma j orit y of their classical texts to these esti- mates. As a result, the job of u sing regression an d other more complex mo dels to learn ab out any causal structure underlyin g these summ ary statistics w as generally left to sister p olicy agencies and outside users. Ho wev er, thin gs are c hanging. The federal sta- tistical system (whether it lik es it or n ot) is b e- coming more inv olv ed with complex mo deling. This R ob ert M. Bel l is Memb er, S tatistics R ese ar ch Dep artment, A T&T L abs–R ese ar ch, 180 Park Avenue, Florham Park, New Jersey 07932 , USA e-mail: rb el l@r ese ar ch.att.c om . Michael L. Cohen is S tudy Dir e ctor, Committe e on National St atistics, National A c ademies, R o om 1135 K e ck Center, 500 5th St., N. W., Washington, District of Columbia 20001, USA e-mail: mc ohen@nas.e du . This is an electronic r eprint of the or iginal article published by the Institute of Mathematical Statistics in Statistic al S cienc e , 2 007, V ol. 22, No. 2, 165 –167 . This reprint differs from the origina l in paginatio n a nd t yp ogr aphic detail. includes small-area estimation (e.g., unemp lo ymen t estimates and census net underco ve rage estimates) and researc h int o mo d els com binin g information from surve ys with adm in istrativ e data. (There w ill also lik ely b e increased d emands to use data mining pr o- cedures on federal statistical data.) This relativ ely new d evelo pment has lik ely motiv ated sev eral of the recen t con tribu tions on ho w to accoun t f or the sam- ple design in complex mo dels. Therefore, Gelman’s article and th e r esulting discussion come at an im- p ortant time. Another reason for the failure to resolv e this class of problems is that this general issu e is n ot ea sy . A ttempts to resolv e th is problem raise a num b er of clashing p ersp ectiv es, includ ing: (1) wh ether to b e mo del-based or design-based in one’s in ference, (2) whether to tak e a Ba ye sian or a f r equen tist view, (3) whether one’s inference should b e cond itional on (some of ) th e observed v alues of the design v ari- ables and other auxiliary data that one migh t h a v e for the full p opu lation, (4) wh ether one ev al uates a pro cedure based on its small-sample p erformance or its asymptotic prop er ties, and (5) whether one w an ts an algorithm sp ecific to a p articular r egres- sion mo del or something more omnibu s. A v ariet y of general s c hemes hav e b een prop osed to deal with this hard pr oblem, and sev eral of them can b e expressed as mem b ers or mixtur es of the fol- lo w in g pu r e strategies: (1) use an unw eighted anal- ysis of the collected d ata, which is a pure m o del- based p ersp ectiv e assumin g the mo del is correct for the entire (sup er) p opulation, (2) use the inv erses of the samp le selectio n probabilities as we igh ts, which deriv es from a p u re design-based p ersp ectiv e and is therefore not dep endent on mo del-based assump- tions either, and (3) in clud e the sur vey design in the mo del as predictors (Little, 2004 ). The last s tr ategy , for instance, wo uld make sen s e if it w as ob vious that separate mo dels were needed for su b groups d efined b y the su r v ey v ariables. Gelman’s pap er represen ts a mixtur e of strategies (2) and (3). It is u seful to tak e a closer lo ok at the second example in Section 1.4 of Gelman’s article, whic h 1 2 R. M. BELL AND M. L. COHEN addresses the bias of the r ace co efficien t for predict- ing log income when the sample is unrepr esen tativ e of the p opu lation in terms of gender. L ik e Gelman, w e are viewing the p r oblem as one of estimating the “so-called” census regression co efficien t, whic h in this case is the mean log income f or whites mi- n us the mean log income for nonwhites in the fi nite p opulation. Some alge br a sho ws that conditional on the p opulation m argins and assuming that data are missing at r an d om, the bias of th e race co efficien t in a simple un w eigh ted regression of log income on race is appr o x im ately prop ortional to th e pro duct of t wo factors: the prop ortion of males in the s am- ple minus the prop ortion in the p opulation and th e race–gender interac tion for the p opu lation. In this simple setting, the bias is equ iv alen tly correctable either b y w eigh ting the simple regression or with the mo del-based algorithm outlined by Gelman. Giv en the large in teraction stipulated in the article, it is imp erativ e that th e bias b e corrected, assuming a non trivial d eviation in terms of gender b et we en the sample and th e p opulation. Ho wev er, that ma y not b e true in general. Whether one should try to correct for bias should also tak e in to accoun t the impact on th e v ariance. Either w eigh ting or mo deling in fl ates the v ariance of the resulting estimate for the race co efficien t. Unlike the bias, the added v ariance dep en d s only on how muc h the distribution of gender in the sample differs from that in the p opu lation and n ot on the size of the in teraction. Wh en the true interact ion is v ery s mall, the mean-squared er r or w ill increase if we try to cor- rect for the bias either through we igh ting or mo d el- ing. On the other hand, for sufficientl y large interac- tion effects, the correction d ecreases mean-squ ared error. T h e sample imbalance do es n ot affect whether one is b etter off correcting, but only the magnitude of th e exp ected b enefi t or h arm fr om the correction. In general, the s ize of the true in teraction that implies one sh ould correct f or bias is on the ord er of the empirical u ncertain t y asso ciated with the es- timated int eraction, so it is imp ossible to conclude with muc h confi d ence that correcting for bias is the wrong strategy . Consequently , it is a n o brainer to simply correct for the bias by either wei ght ing or mo deling, un less one has s trong prior evidence that the int eraction truly is v ery small. Ho wev er, surv eys often h a v e man y p oten tial strat- ifying v ariables, p erhaps includ ing some lik e state, with dozens of lev els. F or example, consider a lon- gitudinal s tu dy w h ere we would lik e the follo w -u p sample after nonr esp onse to represent the baseline sample. There ma y b e d ozens or even hundreds of v ariables on whic h w e w ould like to balance. Eve n with a few v ariables, it quic kly b ecomes imp racti- cal either to form a complete cross-classification for w eigh ting or to fit a mo del that r ep resen ts all int er- actions of the original mo del co v ariates with v ari- ables r elated to the samp le design. Some sort of compromise is imp erativ e, and the question is ho w to c ho ose it. Survey practitioners use all sorts of compromises: at the crudest lev el, cross-classificatio n w hile omit- ting some v ariables and/or collapsing v alues for other v ariables; raking or p r op ensity scores weig hts b ased on logistic regression of r esp onse at follo w -up usin g selected int eractions; and to ols lik e weigh ting cells and weigh t trimming to control the v ariabilit y of es- timates. Mo delers h a v e an equally v aried assortmen t of options at their disp osal. Do es it matter whether one uses w eigh ts or a mo del- based approac h? As Gelman sh ows, there is a cor- resp ond ence b etw een the corrections a v ailable b y mo deling v ersus weigh ting, s o either path can w ork w ell. What matters most, w e b eliev e, is that deci- sions ab out wh ic h v ariables and in teractions to use should b e informed b y the inte ractions that actu- ally predict th e outcome. In particular, ev en though w eigh ts can b e created without ev en lo oking at the outcome, th e b est wei ght s are like ly to b e ones that w ere in formed by an appropr iate mo del. Gelman’s hierarchica l r egression mo d el approac h has some very app ealing features. It sup p orts the use of ric h mo d els of the dep end en t v ariables while at the same time redu cing the c h ance of o ve rfitting. Rather than tr eating in teraction term s as either in or out, shrinking estimated inte ractions adaptive ly often improv es p redictiv e accuracy , and, most lik ely , bias correction. Th ese mo dels also p r o vide a pr in ci- pled b asis for inf erence, w hic h is hard to argue if “design-based” w eigh ts are c hosen based on a mo d- eling exercise. Finally , the pap er helps to clarify the relationship b et we en mo d eling and wei ght ing for bias correction, b y demonstrating that the mo del- ing metho dology imp lies the u s e of weig hts. T his is imp ortant b ecause w eigh ting offers sev eral pr actical b enefits. These include (a) the abilit y to use stan- dard soft wa re routines, (b) a v oidance of the need to fit large mo dels with many in teractions (fixed and/or r andom effects) ev ery time one wan ts to es- timate even the simplest new regression mo d el, and (c) the p oten tial to pr o vide for users of d ata fr om a COMMENT 3 go vernmen t agency a simple w a y to pro du ce near- optimal results. P oin t (c) is somewhat P olly ann ish and in need of some amplifi cation. The ideal w eigh ts w ould v ary from r egression to regression, and the us e of these w eigh ts w ould create a lot of work and wo uld greatly complicate comparisons across analyses. T o the ex- ten t that constant weigh ts were pr op osed for use, one would w ant the w eigh ts to b e such that they w ould wo rk reasonably well across a range of p o- ten tial regression analyses. Which terms to include in either a design-based or a mo del-based solution should dep end on the size of v arious interact ions on the dep end en t v ariables of in terest. Un fortunately , a go o d set of weigh ts for one r egression analysis ma y b e quite p o or for another one. Ho wev er, p ossibly the outcome v ariables could b e group ed and a set of w eigh ts ident ified that work reasonably we ll for th e en tire group of v ariables. W e hop e that researc hers con tin ue to inv estigate, as Gelman h as suggested, the relationship b et we en w eigh ting and mo deling to try to dev elop approac hes that enjoy the b est of b oth worlds, in particular that are omnibus for a v ariet y of estimands of inte rest. Returning to the federal statistical sys tem, giv en its need to pr o duce a large num b er of estimates, often disaggregate d demographically and geographicall y , for its large and d iv erse us er communit y , there is an imp ortant adv an tage to more general-purp ose and easy-to-a pply method s. Finally , w e hav e a couple o f questions or issues that could u se fur ther w ork or explication: • Gelman’s metho d for estimating and pr o ducing inferences for census regression parameters r elies on a hierarc hical regression mo del, so it is im- p ortant to understand the qu alit y of fit of that mo del. How ev er, for h ierarc h ical regression mo d- els estimated usin g data fr om a complex s amp le, notions of standardized residuals and lev erage and their use in assessing linearit y , infl uence, v ariance heterogeneit y , and so on, are quite complicated. F ur th er, ev en with adequate diagnostics for hier- arc h ical regression mo dels, those d iagnostics will not assess the influence of particular data p oin ts on the censu s p opulation r egression estimates. It w ould b e v aluable to inv estigate these iss u es fur- ther. (Initial efforts to w ard incorp orating sample w eigh ts in diagnostic plots hav e b een tak en by Korn and Gr au b ard, 1995 .) • Finally , although Gelman fo cuses on the goal of estimating linear regression p arameters, he men- tions th at h is tec hniqu es ma y extend to logistic regression. Mo dern data analysis mak es use of a m uc h w ider v ariet y of tec hn iques, as found, for ex- ample, in Hastie, Tibshirani and F riedman ( 2001 ). F or example, in classification and regression trees, the parameters pla y a ve ry different r ole, and it- erativ e steps are used to “gro w” the tree. It is unclear how either a mo d el-based or a weig hting approac h should b e used in either gro wing clas- sification or regression trees, or in assessing their p erforman ce on a training sample that was col- lected from a complex sample design. Researc h on the inte rface of these pr oblems w ould b e v aluable. In summary , Gelman’s researc h mak es very v alu- able con tributions to the question of ho w to carry out regression mo deling f rom complex samples. Clearly , as Gelman h as stated, more w ork is needed in this area. A CKNO WLEDGMENT W e greatly app reciate Ph il Kott’s critique of an earlier ve rsion of this commen t. REFERENCES Cochran, W. G . (1977 ). Sampling T e chniques , 3rd ed. Wiley , New Y ork. MR0474575 Graubard, B. I. and K orn, E. L. (2002). Inference for su- p erp opulation parameters using sample surveys. Statist. Sci. 17 73–96. MR191007 5 Hastie, T ., Tibshiran i, R. and Friedman, J. (2001). The Elements of Statistic al L e arning . Data Mining , Inf er enc e and Pr e diction. Springer, New Y ork. MR1851606 Kish, L. (1965). Survey Sampling . Wiley , New Y ork. Ko rn, E. L. and Graubard, B. I. (1995). Ex amp les of dif- fering we ighted and unw eighted estimates from a sample survey . Amer. Statist. 49 291–295. Little, R. J. A. (2004). T o mo del or not to mo del? Com- p eting mod es of inference for finite p op u lation sampling. J. Amer. Statist. Asso c. 99 546–556. MR2109316 Pfeffermann, D. and Sverchk ov, M. (1999). Pa rametric and semi-parametric estimation of regression mo dels fitted to survey data. Sankhy¯ a Ser. B 61 166–186. MR 1720710

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment