Comment: Bayesian Checking of the Second Levels of Hierarchical Models
Comment: Bayesian Checking of the Second Levels of Hierarchical Models [arXiv:0802.0743]
Authors: Andrew Gelman
Statistic al Scienc e 2007, V ol. 22 , N o. 3, 349– 352 DOI: 10.1214 /07-STS235A Main article DOI: 10.1214/ 07-STS235 c Institute of Mathematical Statistics , 2007 Comment: Ba y esian Chec king of the Second Levels of Hiera rchical Mo dels Andrew Gelman Ba y arri and Castellanos (BC) h av e written an in - teresting pap er discus s in g t w o forms of p osterior mo del c hec k, on e based on cross-v alidatio n and one based on replication of new groups in a h ierarc hical mo del. W e think both these c hec ks are go o d ideas and can b ecome ev en more effectiv e w hen under- sto o d in the c ont ext of p osterior predictiv e c hec king. F or the purp ose of discussion, how ev er, it is most in teresting to fo cus on the areas where we disagree with BC: 1. W e ha ve a differen t view of mo del c hec king. Rather than setting the goal of ha v in g a fi xed probabilit y of rejecting a true mo d el and a high probabilit y of rejecting a false mo d el, we recognize ahead of time that our mo d el is wrong and view mo del c hec king as a wa y to explore and un derstand dif- ferences b etw een m o del and data. 2. BC focus on p -v alues and scala r test statistics. W e fa v or graphical summaries of m ultiv ariate test summaries. 3. F or BC, it is imp ortan t th at p -v alues ha v e a uni- form d istribution (i.e., that they b e u -v alues, in our terminology) und er the assum ption that th e n ull h yp othesis is true. F or us, it is imp ortan t that p -v alues b e interpretable as p osterior pr ob- abilities comparing replicated to observ ed data. 4. BC recommend an “empirical Ba y es p rior p -v alue” as b eing b ette r than the p osterior p redictiv e p - v alue. In fact, their empirical Ba y es prior p -v alue is an app ro ximation to a p osterior predictiv e p - v alue wh ic h w as recommended for hiera rchica l mo dels in Gelman, Meng and Stern ( 1996 ). BC Andr ew Gelman is Pr ofessor, Dep artments of Statistics and Politic al Scienc e, Co lumbia University, New Y ork, New Y ork 10027, USA e-mail: gelman@stat.c olumbia .e du This is an electronic reprint of the or ig inal article published by the Institute of Mathematica l Statistics in Statistic al Scienc e , 2 007, V ol. 22, No . 3, 349– 352 . This reprint differs fro m the origina l in paginatio n and t yp ogr aphic detail. miss this connection by not seeing the full gener- alit y of p osterior pr ed ictiv e chec kin g. In our discussion, w e go through eac h of th e ab ov e p oint s in turn and conclude with a commen t on the p oten tial imp ortance of theoretical w ork suc h as BC’s on the future develo pment of pr ed ictiv e mo d el c hec king. 1. THE G O AL OF MODEL CHECKING: REJECTING F ALSE MODELS, OR UNDERST ANDING W A YS IN WHICH THE MODEL DOES NOT FIT D A T A All m o dels are wr ong, and th e purp ose of m o del c hec king (as w e see it) is not to reject a m o del but rather to und ers tand the wa ys in which it do es not fit the data. F r om a Ba y esian p oint of view, the p os- terior distribu tion is what is b eing used to summa- rize inferences, so th is is what we w an t to c hec k. The ke y questions then b ecome: (a) what asp ects of the mo del should b e chec k ed; (b) what r ep lications should w e compare the d ata to; (c) ho w to visualize the mo del c hec ks, wh ic h are t ypically h ighly multi- dimensional; (d) what to mak e of the r esults? In a wide-ranging discussion of a r ange of differ- en t methods for Ba y esian mo del chec kin g, BC fo- cus on the ab o ve question (d): in partic ular, ho w can Ba y esian hyp othesis testing b e set up so that the resulting p -v alues can used as a mo d el-rejectio n rule with sp ecified Type I errors? Th is qu estion is sometimes fr amed as a d esire for calibration in p - v alues, but ultimately the desire for calibrati on is most clearly inte rpr etable within a mo del-rejection framew ork. F or example, BC write that some meth- o ds “can result in a seve re conserv atism incapable of detecting clearly inappr opriate mo dels.” But it is not at all clear that, just b eca use a mo del is wrong, that it is “inappropriate.” If a mo del predicts r ep li- cated data that are just lik e the observed data in imp ortan t w a ys, it ma y ve ry wel l b e appropriate for these purp oses. Recall that we ha ve already agreed that our m o dels are wrong; we would lik e to mea- sure appropr iateness in a direct w a y , rather than 1 2 A. GELMAN set a rule that ev en a true mo del m ust b e declared “inappropriate” 5% of the time. F or example, in the mo del considered by BC, we do not see the rationale for their testing the hypothesis µ = µ 0 ; w e w ould rather just p erform Ba y esian inference for µ . Our concerns are thus a bit different from those of BC: we are less concerned ab out the prop erties of our p ro cedures in the (relativ ely un in teresting) case that the mo d el is true, and more inte rested in h a ving the abilit y to address the misfit of mo del to data in direct terms. One reason, p erhaps, of the p opular- it y of our p oste rior predictiv e appr oac h, in addition to its Ba y esian flav or and ease of imp lemen tation, is the flexibilit y that allo ws us to consider compli- cated test sum maries—including plots of the ent ire data set, a s w ell as co mbinatio ns of data and pa- rameters and combinatio ns of observed and missing data—th us brin ging the p o w er of exploratory d ata analysis to the c hec king of Bay esian mo dels, and con v ersely bringing the p ow er of Ba yesi an inference to exploratory data analysis. Some of the difference in fo cus can be seen by lo oking at the graph s in BC—histograms of the null distributions of p -v alues, cur v es of predicti ve d en- sities of u n idimensional test summaries, and a sin- gle plot of ra w data (bu t with no comparison plots of replicated data)— and comparing to th e graphs in Gelman ( 2004 ) and Gelman et al. ( 2005 ), w hic h sho w v arious p lots of time series and other m ultidi- mensional test summaries. 2. THE S TEPS OF BA YESIAN MOD EL CHECKING BC b egin their pap er with a useful c haracteriza- tion of any chec king m etho d as havi ng a diagnostic statistic , a distribution for the statistic, and a wa y to measure conflict w ith the null d istribution. Here w e briefly explain ho w our own applied mo d el c hec king fits in to BC’s thr ee-step fr amew ork. Step 1. BC consider a diagnostic statistic T( x obs ) that dep ends en tirely on observ ed data. In a Ba y esian framew ork, the diagnostic statisti c, or test statistic, or discrepancy measure can also dep end on parame- ters (Gelman, Meng and Stern, 1996 ) and on m iss in g or laten t data (Gelman et al., 2005 ). It can b e helpful to lo ok purely at observed data, b ut the expand ed form ulation can allo w us to define test v aria bles that more directly catc h features of substanti ve inte rest. Step 2. W e compare the test v ariable to the predic- tiv e distribution of other data sets y rep that could ha v e arisen fr om the same model. F ormally intro- ducing the replications y rep is an imp ortant step in the mathematical fo rmulatio n of Ba y esian testing b ecause it mak es explicit the join t mo del, p ( y , y rep , θ ) . (Ba y arri and Castellanos use the notation x for data, but w e prefer y b ecause we commonly work in th e applied regression fr amew ork in whic h y is mo deled conditional on predictors, x .) Because w e are d oing Ba y esian inference, w e simply u se the p osterior d is- tribution, p ( y rep | y ), whic h is also called the p osterior predictiv e distribution b eca use y rep can b e view ed as predictions. As discuss ed by Gelman, Meng and S tern ( 1996 ), the prior p r edictiv e distribution is also a p osterior predictiv e distribution bu t with y rep defined as aris- ing from new parameters, θ rep , dra wn from the mo del. The c hoice of prior or p osterior distribution—or, more generally , t he choic e of what is to b e repli- cated in defin in g y rep —dep ends on whic h asp ects of the mo del are b eing c hec k ed. In many cases, the prior distribution is assigned based on con v enience and so there is no p articular interest in c hec king its fit to the data. In the con text of the pap er at hand , wh ich is ex- plicitly concerned with chec king the second lev el of a hierarc hical mo del, it make s sense to use an in ter- mediate replicatio n, in w hic h the hyp erparameters η are ke pt the same bu t the lo w er-lev el parameters θ are r eplicated—that is, resampled from the group- lev el mo del. In the n otation of BC, the predictiv e distribution of interest would b e p ( θ rep , x rep , η | x ), a v eraging ov er the p osterior distribution p ( η | x ) . This is a sligh t departur e from BC’s recommendation to in tegrate θ . (Actually , w e pr efer the term “av erage o v er” to “in tegrate out” since we p erform our com- putations using sim ulation.) As w e discuss in Section 4 b elo w , it turn s out th is is very close to what BC call the empirical Ba y es prior predictiv e chec k. Step 3. F or a one-dimensional test summary , th e discrepancy b etw een mo d el and data can b e summa- rized b y a p -v alue or, often more usefu lly , b y a pr e- dictiv e confidence inte rv al. (F or example, page 366 of Ba ye sian Data Analysis h as an example from an analysis of elections in whic h 12.6% of the elections in the d ata switched parties, but in replicated d ata sets the 95% in terv al for the prop ortio n of switc hes w as [13.0%, 14.3%]. In this case, the mo del clearly did not fit this asp ect of the d ata, but this differ- ence of ab out one p ercen tage p oin t w as not of p r ac- tical significance.) F or higher-dimensional test su m - maries, graphical summaries w ould b e appropriate— COMMENT 3 up to and including p lots of the enti re data set, com- pared w ith plots of replicated data. There is some p oten tial, we b eliev e, to conn ect classes of mo dels with classes of graph s to su ggest n atur al and auto- matic displa ys of c hec ks for man y problems (Gel- man, 2003 , 2004 ). As w e hav e already noted, BC fo cus on p -v alues, whic h can b e useful summaries bu t are no replace- men t for graphical comparisons of observ ed and repli- cated data that can rev eal v arious asp ect s of mo del misfit. W e emphasize th at any of the metho d s d is- cussed in th e BC p ap er can b e ap p lied to graphical c hec ks. 3. p -V ALUES AND u -V ALUES Regarding the discussion in Section 3.5 of BC on p -v alues, we r efer the r eader to Section 2.3 of Gel- man ( 2003 ), whic h distinguishes b etw een Ba y esian p -v alues—most simply , p osterior p robabilit y state- men ts of the form Pr( T ( y rep ) > T ( y ) | y )—and u -v alues—data su mmaries with a uniform null d is- tribution. Classical p -v alues with p iv otal test statis- tics are also u -v alues, but in the presence of un cer- tain t y ab out parameters it is not generally p ossible for te sts to ha v e b oth p rop erties at once. On t he o ccasions th at we do sum marize test statistics us- ing tail-a rea probabilities, we prefer the p -v alue b e- cause it can b e directly interpreted as a statemen t, conditional on the m o del, ab out what migh t b e ex- p ected in future replications. Here we disagree with BC, wh o describ e the un iform null distribution as “a very desirable pr op ert y , namely h a ving the same in terpretation across problems.” It is p erhaps a mat- ter of taste whether to prefer a p osterior summary with a d irect probabilistic interpretati on or a less- in terpretable statistic that has a u n iform d istribu- tion u nder the null mo del. W e wo uld certainly not call our p -v alues uninte rpr etable: for example, a p - v alue of 0.2 mea ns clea rly that, under the mo del, 20% of future data will b e at least as extreme as the observ ed data. No calibratio n is n ecessary for this in terpretation to b e v alid. In any case, our p oint here is to distinguish b e- t w een the t w o goals—a d irect probabilit y statemen t and a uniform null distribution—and to p oin t out that, in general, y ou cannot ha v e b oth, just as, in general, p osterio r means will not b e un biased es- timates and p osterior int erv als will not ha ve clas- sical c onfid ence co v erage for all parameter v alues. Ultimatel y we will ev aluate our Ba y esian mo del- c hec king metho d s based on h ow we ll they help us understand differences b et wee n mo del and data, n ot based on theoretical co v erage prop erties and not based on their rates of rejecting mo d els whic h we kno w are false anyw a y . 4. THE “EMPIRICAL BA YES PRIOR p -V ALUE” BC’s pap er concludes with a statemen t that em- pirical Ba y es prior p -v alues “hav e b etter pr op erties [than p osterior p -v alues] and are easier to compute.” In fact, these EB-prior p -v al ues are v ery close to p os- terior p -v alues, replicating θ bu t lea ving the hyp er- parameters ( η , in BC’s notation) fi xed, a s trategy whic h Gelman, Meng and Stern ( 1996 ) r ecommend for hierarc hical mo d els ( Figure 1c on page 739 of that pap er). The o nly difference b et we en the EB- prior distribution and this p osterior predictiv e dis- tribution is that the former us es p oin t estimates of the h yp erparameters, wh ic h cannot in general b e a go o d idea (consider, e.g., settings where no go o d p oint estimates exist, such as the 8-sc ho ols example from Ch apter 5 of Bay esian Data Analysis). W e su s- p ect the go o d p erformance of the EB-prior p -v alues comes fr om the app ropriate c hoice of replication for testing the second lev el of a hierarc hical mo del—the same h yp erparameters bu t new groups—n ot from the use of p oint estimates. T o put it a nother wa y , take BC’s “empirical Ba yes” metho d, av erage o v er the hyp erparameters so that it b ecomes “hierarchic al B a ye s” (as is appropr iate giv en the other p arts of the pap er), and yo u get a p osterior predictiv e c hec k. W e su pp ose that BC did not notice this b eca use of their assum ption that in p osterior p redictiv e c hec king, all parameters h ad to b e kept the same in replications (as in Figure 1a on page 739 of Gelman, Meng and Stern, 1996 ). In fact, the flexibilit y of pr edictiv e c hec king allo ws d ifferen t asp ects of the data and parameter vec tors to b e pr e- serv ed in replications, and for the p articular goal of BC’s pap er, it m ak es sense to replicate the param- eters θ (as BC en d ed up disco v ering in their sim- ulations). S in hara y and Stern ( 2003 ) d iscuss these issues further in the conte xt of the hierarc hical nor- mal mo del. 5. LOOKING F ORW ARD As indicated by the plethora of metho ds discussed b y BC, there are many wa ys of combining ideas of replication and cross-v alidatio n. A parallel situation arises in the literature of the b o otstrap (Efron and 4 A. GELMAN Tibshirani, 1993 ), with p arametric b ootstraps, n on- parametric b ootstraps, and sp ecial metho d s for spa- tial and time-series data. A lot more work needs to b e done. In particular, although we do find the p os- terior predictiv e framewo rk u seful, we recog nize that there is something p articularly comp elling ab out ex- ternal v alidatio n and cross-v alidatio n. At the theo- retical leve l, there is an op ening to incorp orate v ali - dation into hierarc hical m o deling with the p ossibil- ities of differen t lev els of cross-v alidatio n for ind i- viduals and groups (e.g., fiv efold cross-v alidati on of groups and tenfold cross-v alidatio n of observ ations within groups). More pr actica l concerns include de- cisions ab out ho w to set up the tuning parame- ters for cross-v alidatio n and, when comparisons are made graphically , h o w to visualize the many repli- cated data sets. BC’s partial p osterior predictiv e dis- tribution could b e an excellen t w a y to unify this area. The BC pap er fo cuses on p -v alues, but if our o wn exp erience is an y guide, we exp ect the most use- ful w ork to fo cus on graphical explorations of re- alized and replicated data. W e fo cused on p -v alues in our 1996 pap er, but in the y ears since, we ha ve found graphical c hec ks to b e m ore helpfu l, with nu- merical summaries and p -v alues coming in at the end to give some s tructure to our visual judgments. The theoretical structur e used b y BC, of lo oking at n ull distributions of p -v alues, could b ecome h elp- ful h ere, and also for concerns of m ultiple compar- isons. A CKNO WLEDGMENTS W e thank Hal Stern for helpful commen ts and the National Scie nce F oundation and National In- stitutes of Health for financial supp ort. REFERENCES Efr on, B . and Tibshirani, R. (1993). An I ntr o duction to the Bo ots tr ap. Chapman and Hall, New Y ork. MR127090 3 Gelman, A. (2003). A Bay esian formulati on of exploratory data analysis and go od ness-of-fit testing. I nternat. St atist. R eview 71 369–382. Gelman, A. (2004). Exploratory data analysis for complex models ( with discussion). J. Comput. Gr aph. Statist. 13 755–78 7. MR210905 2 Gelman, A ., C arlin, J. B ., S tern, H. S. and Rubin, D. B. (2003). Bayesian Data Analysis , 2nd ed. CRC Press, London. MR202749 2 Gelman, A. , Meng, X. L. and Stern, H. S . (1996). P os- terior predictive assessmen t of mo del fitn ess v ia realized discrepancies (with discussion). Statist. Sinic a 6 733–760. MR142240 4 Gelman, A ., V an Mechelen, I., Verbeke, G., Heitjan, D. F . and Meulders, M. (2005). Multiple imputation for model chec king: Completed-data plots with missing and laten t data. Bi ometrics 61 74–85. MR2135847 Sinhara y, S. and S tern, H. S. (200 3). Posteri or predictive model chec king in hierarc hical mo dels. J. Statist. Pl ann. Infer enc e 111 209–221. MR195588 2
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment