Rejoinder: Bayesian Checking of the Second Levels of Hierarchical Models
Rejoinder: Bayesian Checking of the Second Levels of Hierarchical Models [arXiv:0802.0743]
Authors: M. J. Bayarri, M. E. Castellanos
Statistic al Scienc e 2007, V ol. 22, No. 3, 363– 367 DOI: 10.1214 /07-STS235REJ Main article DO I: 10.1214/07-STS235 c Institute of Mathematical Statisti cs , 2007 Rejoinder: Ba y esian Checkin g of the Second Levels of Hiera rch ical Mo dels M. J. Ba y a rri and M. E. Castellanos W e w ould lik e to th ank the discussan ts for the v aluable insigh ts and for commen ting on imp ortan t asp ects of mo del c hec king that we did not touc h in our pap er. Our goal wa s mo d est (but crucial): to se- lect an appropriate distribution with whic h to ju dge the compatibilit y of the d ata with a h yp othesized (hierarc hical) mo del, w hen the test s tatistic is not ancillary and an impr op er prior is u s ed for the hy- p erparameters. S ince it is imp ortan t to emph asize that this is by no means th e only asp ec t of m o del c h ec king, the d iscussan ts’ complemen tary contribu- tions and commen ts are all most w elcome. The sp e- cific tec h nical con tributions of Ev ans and John son are also appreciated, since their dev elopments in this area w er e n ot men tioned in our review. Sev eral d iscussant s ha v e highligh ted the imp or- tance of graphical d ispla ys in m o del chec king. W e will not commen t on this b ecause we entirely agree. W e sim ilarly agree with most of the discussan ts’ other comments, although in this rejoinder w e mainly concen tr ate on disagreemen ts. Our comments are or- ganized around the main topics that arise in the discussions. W e ke ep the same notation and termi- nology used in the pap er (although it do es conflict with the notation used by some of the discussants) . M. J. Bayarri is Pr ofessor, Dep artment of Statistics and Op er ation R ese ar ch, University of V alencia, Burjassot, (V alencia), 46100 Sp ain e-mail: susie.b ayarri@uv.es . M. E. Castel lanos is Asso ciate Pr ofessor, Dep artment of Statistics and Op er ation R ese ar ch, R ey Juan Carlos University, M´ ostoles, (Madri d), 28933 Sp ain e-mail: maria.c ast el lanos@urjc.es . This is an elec tronic reprint of the orig inal article published by the Institute of Mathematical Statistics in Statistic al Scienc e , 20 07, V o l. 22, No . 3, 363–3 67 . This reprint differs from the orig inal in pagina tion and t yp ogr aphic detail. ROLE OF PR IOR P REDICTIVE DISTRIBUTIONS WHEN MODEL UNCERT AINTY IS PRESENT Ba yesia n analyses, when mod el uncertain t y is present (mo del c hoice, mo del a veraging) , are based on th e pr ior p r edictiv e distributions f or th e differ- en t mo dels under consideration. Mo d el c hec king is a quick-a nd-dir t y shortcut to bypass m o del c hoice, and “pur e” Ba y esian reasoning indicates that all rel- ev an t information lies in the (prior) predictiv e dis- tribution m ( x ) for the en tertained mo del. As Ev ans p oin ts out, ob jectiv e Ba y es metho dol- ogy shou ld b e guided b y prop er Ba y es metho dol- ogy , so ob jectiv e Ba yes mo del c h ec k in g should also b e b ased on th e prior p redictiv e distribution. The difficult y , ho wev er, is that only some asp ects of this distribution can b e utilized wh en the prior d istribu- tion is imp rop er. Ba yarri and Berger ( 1997 , 1999 , 2000 ) argue that the relev ant asp ect to consider for mo del c hec king is a conditional (p rior) pr edictiv e distribution m ( x | u ), where U = U ( X ) is an ap- propriate cond itioning statistic suc h that the p os- terior π ( θ | u ) is prop er. Mo del c hec ks (measur es of surpr ise) computed with this distrib ution (su c h as p -v alues or relativ e sur prise) are called c onditional pr e dictive measures. If w e u s e a statistic T to measure departur e and use U for conditioning, the r elev an t distribution for mo del c hec king is then m ( t | u ). E v ans’ prescription can b e put in th is framew ork with T ancillary and U sufficient (caution: Ev an s ’ notation switc hes the roles of T and U ). Larsen and Lu ’s (from n o w on L&L) p rescription for c hec king group i is also of this form w ith T = T ( X i ) and U = X ( − i ) . T he complete theory of Johnson (not ske tc hed in his discussion) relies on the wh ole p r ior p r edictiv e. Hence, all these metho ds pr o duce legitimate Bay esian measures of surpr ise. The p osterior pr edictiv e distribution can- not b e expressed in this wa y (it w ould p ro duce a trivial, degenerate distribution). Ba yarri and Berger ( 1997 , 1999 ) explore sev eral c h oices of U and recommend use of the c onditional 1 2 M. J. BA Y ARRI AND M. E. CASTELLANOS MLE of θ , that is, the MLE computed in the con- ditional distribution f ( x | t, θ ). The resu lting m ea- sures of su rprise (or mo d el c hec ks) w ere sho wn to ba- sically coincide with th e p artial p osterio r measures; indeed, the conditional pr edictiv e distribution for that c hoice of U and th e partial p ost erior pr edictiv e distribution are asymp toticall y equiv alen t ( Robins ( 1999 ); Robins, v an der V aart and V en tura ( 2000 )). W e ha ve concen trated on partial p osterior mea- sures b ec ause they are basical ly indistinguishable from the conditional predictiv e ones and they are easier to compute, b ut their Ba y esian justification comes fr om the conditional pr edictiv e reasoning. W e should p erhaps ha ve reiterated this in the pap er. CHOICE OF T AND/OR D W e are not addressing optimal choice of T in this pap er: w e fo cus on th e c h oice of the r elev ant distri- bution to lo cate T . T is often c hosen casually based on intuitiv e grounds and w e w an ted a metho d that w ould w ork with any c hoice of th e departure statis- tic T (although, of course, adequate c hoice of T is al- w a ys im p ortan t to increase p o we r). Ho w ev er, sev eral discussant s ha v e f o cused their discussion on sp eci fic c h oices, so w e commen t on those. A preliminary issue is consid eration of discr ep ancy me asur es , that is, functions of the data and the pa- rameters D = D ( x , θ ), as w ell as statistics T = T ( x ) for mo del c hec king. Gelman and L&L fav or th eir routine use, also w ith informal, intuitiv ely sound c h oices. John son’s pr op osal, although deriv ed f rom a different philosoph y , could also b e considered un- der this umbrella . John s on’s interesting metho d ap- plies to in v ariant situations in whic h the distrib u tion of an optimally c hosen D , namely a pivot al quan- tit y , is precisely k n o wn. John son’s elegan t theorem sho ws ho w to obtain simulations from th e piv otal quan tities for th e true (u nkno wn) parameter v alues, so that their adequacy with the known distrib u tion can b e assessed. The main difficulty is that these sim ulations are highly correlated and p rop er assess- men ts require p rior predictiv e tec hniqu es (and hence informativ e priors). In some situations, the p ro vided b ound s f or the p -v alues of the suggested test statis- tic might suffi ce, so these tec hniques are defi n itely w orth considering. Note, how eve r, that without an informativ e prior, in terpretation of graphical dis- pla ys, or other uses of these correlated sim ulations, is an issue. Although our metho dology could b e applied to suc h functions [it w ould probably suffice to con- sider the join t conditional distribution p ( x , θ | u ) ], w e ha v e not though t ab out it enough to ve nture an opinion. Use of D ’s seems intuiti ve ; ho wev er, when used in conju nction with p osterior pred ictiv e distri- butions, they suffer from the same typ e of conserv a- tiv en ess as stat istics do ( Robins ( 19 99 ); Robins, v an der V aart and V en tura ( 2000 )). Since the pr ob lems are th e same whether or n ot T is c h o- sen to also include parameters, we cast th e rest of the rejoinder in terms of traditional statistics T . (Note that, if T is ancillary or D piv otal, the issues ab out ho w to in tegrate out the p arameters disap- p ear.) Ev ans c ho oses n ot to in tegrate out th e unkno w n θ but rather to eliminate it in traditional f requent ist w a ys, by either conditioning on a sufficien t statistic (i.e., U ab o v e is sufficient) or by us ing an ancillary test statistic T . His argument is, h o w ev er, also well within Ba yesia n thinking, pr o viding a b eautiful fac- torizatio n of the join t (prior) d istribution of x a nd θ in w hic h the role of the different factors can b e v ery nicely interpreted. Although these sp ecific choice s of T and U are needed for the clean factorizati on, w e show that other c hoices of T and/or U are also p ossible (ma yb e desirable) for mo del c hecking, and migh t b e s impler to implemen t. Th is applies sp e- cially to p roblems in which the required s tatistics do not exist, are difficult to ident ify , or when sam- pling from the resulting distribu tion is particularly c h allenging. Johnson w onders abou t choice s of T sufficien t (or nearly so) and/or T ancillary . T should not b e suf- ficien t; a sufficient T is virtually useless for mo d el c h ec king (this is in agreement with Ev ans’ remarks). An extensive discussion of this issue, w ith exam- ples, can b e found in Ba yarri and Berger ( 1997 ), Ba yarri and Berger ( 2000 ) and r ejoinder. An ancil- lary T simply repr o duces frequ en tist testing with similar p -v alues (terminology f rom Bay arri and Berger ( 1999 ), 2000 ); the Ba yesian machinery for in tegrat- ing out unkn o wn q u an tities is simp ly not n eeded and, in this case, prior, p osterior, conditional and partial p osterior pr edictiv e distribu tions are all iden- tical to the sp ecified marginal d istribution for T , f ( t ). W hen T is nearly ancillary , then all p ro cedures will pro du ce v ery similar mo del c h ec ks . L&L suggest c ho osing for group i a T i whic h is a function of the d ata X i (and p ossibly the parame- ters) and as U i the rest of the data. As L&L ind icate, REJOINDER 3 there might b e some concern ab out losing p ow er, but certainly the b ehavior is m uc h b etter than that of p osterior predictiv e measures (as clearly sho wn b y L&L’s T able 1). As w e remark ed b efore, this a voids d ouble u s e of the d ata if we wer e only test- ing that gr oup . Ou r main co ncern is ho w to prop erly in terpret all these T i ’s j oin tly . L&L ha v e b een v ery careful not to compute any p -v alue based on o ve rall measures. F or ins tance, using the o v erall discrep- ancy measures T 1 = max { ¯ X i } , T 2 = m ax {| ¯ X i − ¯ ¯ X |} and T 3 = max {| ¯ X i − µ |} pro d uces p -v alues equal to 0 . 479 , 0 . 619 and 0 . 476, resp ectiv ely , thus s h o wing the same und esirable b ehavi or as p osterior p r edic- tiv e p -v alues, and th e concern ab out doub le us e of the data still arises. (F or a simp le example of simi- lar issu es with cross-v alidation p -v alues, see the re- joinder to Pr ofessor Carlin in Ba yarri and Berge r ( 1999 ).) If w e kee p the p -v alues in dividually , it is not very clear wh at to d o with them. One concern is that they are probably highly correlated, and then displa ys of uniformit y might mean litt le; another imp ortant concern is with m ultiplicit y iss ues, esp e- cially when there are m an y grou p s. O f co urs e the m ultiplicit y issue gets worsened when, in add ition to ha vin g man y groups, one consid ers man y T ’s for eac h group . The only wa y that we kno w to satis- factorily handle multiplicitie s is Bay esian mo del se- lection analysis, and the complexit y of the problem escalate s (and again requires p rop er priors). METHODOLOGICAL ISSUES In the discussion, v arious int eresting m etho dolog - ical iss ues arose. W e briefly addr ess the main issues here. Mo del elab or ation. Gelman and Johnson touch on mo del elab oration follo w ed by inference as an al- ternativ e to mo del c hec king. In the situation con- templated in this pap er, how ev er, in wh ic h we are seriously entertaining a mo del, an an alysis with a single, m ore complex mo del would not b e adequate. Correct Ba y esian analysis should ac kno wledge the uncertain t y in th e mo del assessmen t, u tilizing mo del selection (b et ween the more elaborated and the sim- pler mo dels) or mo del a verag ing. T h is is indeed the ideal Ba y esian analysis, but b oth the analysis and the prior assessments are considerably hard er than those required for our mo d el c hec king prop osal. Av oiding th e full mo del un certain ty analysis in sit- uations where we are reasonably confident in the assessed mod el w as precisely the m otiv ation for de- v eloping an ob jectiv e Ba y es mo del c hecking p ro ce- dure. Of course, if the mo d el is foun d incompatible with the d ata, then a f ull mo d el selection analysis cannot b e a voided. Avoid ing double use of the data. Ev ans suggests that, to a v oid double use of the data, our c hoices for T and U should satisfy his factorizatio n of the join t distribution, at least asym p totically . There is no n eed for this: w e a void d ouble use of the data b y conditioning. Also, there is no need f or T and U to b e ind ep endent (as wh en splitting the data), nor for T to b e suffi cien t nor for U to b e ancillary (in our notation, not Ev ans’). Computing a mean and a v ariance of the same p osterio r distribution is not using the d ata t wice; it is describing t wo c harac- teristics of that distrib ution. S im ilarly , fo cusin g on one “slice” (a cond itional d istr ibution) of the joint prior pr edictiv e m ( x ) is not u sing the data t wice, but using a sp ecific charact eristic of that distr ibu- tion. T o illustrate with th e simp lest d iscrete exam- ple, if T = ( x 1 , x 2 ) and U = x 1 , then m ( t | u obs ) = m ( x 1 , x 2 | x 1 = u obs ) = m ( x 2 | x 1 = u obs ) if x 1 = u obs and 0 otherwise; x 1 and x 2 are used f or differen t things, bu t not used t wice. Note that p ost erior pre- dictiv e c h ec k s cannot b e cast in this wa y . This is- sue is also d iscussed at length in the r ejoinder of Ba yarri and Berger ( 20 00 ). A c c ounting for unc ertainty in the estimates. Gel- man argues that there must b e something w rong in our recommendation of plug-in c hecks ov er p os- terior p redictiv e c h ec ks , since th e former do n ot ac- coun t for uncertain t y in the estima tes. It is true that plug-in c hecks mak e tw o m istak es—u sing the d ata t wice and ignoring the un certain t y in the estimates— whereas p oste rior p redictiv e c hecks only make the first mistak e. Cr ucially , ho wev er, the second “mis- tak e” that is made b y plug-in c hec ks actually op er- ates in the opp o site d irection of the first mistak e, and brings the resu lting p -v alue closer to unifor- mit y . This was formally sho wn to b e the case in Robins, v an der V aart and V en tura ( 2000 ), but can also b e understo o d in tuitiv ely: when the data are v ery incompatible with the mo d el, p osterior pr e- dictiv e (and plug-in) distrib utions sit in the wr ong part of the s p ace (the parameters are o v ertuned to accommodate f or mo d el deficiency) but, since the plug-in d istribution is (wrongly) m ore concen trated than the p osterior p redictiv e distribution, it is less compatible with extreme v alues of test stat istics, 4 M. J. BA Y ARRI AND M. E. CASTELLANOS and hence is less conserv ativ e. It is the theorem in Robins, v an der V aart and V en tura ( 2000 ) that sho ws th e correction is not an o v ercomp ensation, that is, that the plug-in still r emains conserv ativ e, while p ossessing more p o we r. The plug-in predictive c h ec ks are also often easier to compute. Note th at this sup erior p erformance of the plug-in c h ec ks o c- curs regardless of th e sp ecific form of c hec king used, that is, wh ether it is f ormal or grap h ical. LIMIT A TIONS W e are sympathetic to the complaints concerning the difficulty of computing p artial p osterior (and conditional) predictiv e c hec ks, bu t it can b e done and the difficulty is only in estimating a (u s ually) univ ariate densit y at one p oin t, not a difficult com- putation compared to most Ba yesian compu tations no w ada ys. How ev er, w e recognize that m ore w ork is needed to dev elop fast and efficien t algorithms to carry out the necessary computations. F or inv ariant situations, the computations for p osterior simula- tions from the piv otal quan tit y are simpler, but only when the computed b ound s are satisfactory (and the test pro cedu r e adequate); otherwise, prop er inter- pretation of the simulated v alues ( whether for visual displa ys or n umerical computations) requ ir es prior predictiv e tec hniques, whic h not only need a pr op er prior, but also are of a similar leve l of complexit y as th e p artial p osterio r predictiv e tec h nique. Cross- v alidation m a y or ma y not b e simpler to compute. The computations required for m ( g ( x ) | u ) for a suf- ficien t statistic U (and g an y function of the data) are likely to b e formid able; in Ba yarri, Castellanos and Morales ( 2006 ) w e actually suggest u se of MCMC co mpu tations to g en- erate from m ( g ( x ) | u ) wh ic h are basically iden tical to the ones used for conditional and p artial p osterior predictiv e distr ib utions. F or an y T (and discrepancy D ), Robins ( 199 9 ) and Robins, v an der V aart and V en tura ( 2000 ) suggest ho w to “cen ter” them so as to pr o duce asymptoti- cally uniform p -v alues, and this can also be a daun t- ing task. P osterior p r edictiv e tec hn iques are usually simpler to compute than partial p osterior or condi- tional predictiv e tec hn iques. Another limitation of our metho d ology is th at it do es n ot say an ything ab out c h o osing T . Choice of T is equiv alen t to informally c h o osing the asp ect of the mo del to b e c hec k ed. What we advocate, once a statistic T has b een c hosen to detect incompatibil- it y b et w een d ata an d mo del, is to lo cate the ob- serv ed t in the distribution of m ( t | u ) [or in its appro ximation m ( t | x obs \ t obs )]. In the language of Gelman, one should get the “replicates” for mo del c h ec ks from those distrib utions. This pr escription holds whether T is un iv ariate or multiv ariate, and whether one u ses graphics, r esid uals, relativ e sur- prise, p -v alues or other metho ds to formally or in- formally lo cate T in m ( t | u ) . This add resses one of Gelman’s concerns. (Of course, if T is multiv ariate, the d efinition of the p -v alue is not clear.) W e d o recognize, ho we ve r, that c hoice of T is an imp or- tan t issue. Ev ans and J ohnson ha ve b oth addressed this issu e and their suggestions are certainly sen- sible and wo rth consid ering. W e do recommend a sp ecific c hoice of U , namely the cond itional MLE. Rob ert and Rousseau ( 2002 ) and F raser and Rousseau ( 2005 ) suggest use of the unconditional MLE in- stead; this choice is also worth exp lorin g. MISUNDERST ANDINGS In the d iscu ssions, a num b er of the statemen ts made concerning our metho dology are incorrect. These statemen ts refer to issues that were discussed in our earlier pap ers w here the metho d ology wa s first present ed, and so w e neglecte d to review these issues in this pap er. W e tr y to straigh ten out some of th ese misunderstand in gs here. Gelman s u ggests that our metho dology focuses on using p -v alues as a mo del-rejection r u le with sp eci- fied Typ e-I err ors . This is not the case. W e do not fix T yp e-I errors , nor do we a dvocate use of p -v alues as formal decision rules (indeed, we are quite op- p osed to it; see Sellk e, Ba y arri and Berger ( 2001 ), and Hubbard and Bay arri ( 2003 )). Indeed, the metho dology is v alid whether or n ot p -v alues are used. W e us e p -v alues as “mea sur es of sur p rise”: n umerical quan tifications of the incompatibilit y of the observed t and the “reference” distribution; an- other such measure is th e r elativ e predictiv e surprise also explored in the paper (and wh ic h can readily b e applied to m ultiv ariate T ’s). Alternativ ely , one can opt for c hec king informally t his incompatibilit y with graphical disp la ys. The main adv anta ge of p -v alues is p edago gical: statisticia ns are used to in terpreting them. Of course, this familia rit y is a detrimen t when pro cedur es suc h as p osterior predictive p -v alues are used, in that casual u sers w ill interpret the p -v alues REJOINDER 5 as arising f r om a u niform distribution, not susp ect- ing that they are instead arising fr om a distribution m uc h more concen trated ab out 1 / 2. Gelman and John son imply that th e metho dol- ogy can only b e applied to simp le examples and univ ariate statistics. This is not so. W e use “sim- ple” examples so that the numerical complexit y does not obscure the relev ant issues. As menti oned ear- lier, ther e is n othing in the metho d ology to prev en t it b eing used with m ultiv ariate statistics. Similarly , although we u se p -v alues and relativ e su rprise (n u- merical quantifica tions), one can use graphical dis- pla ys of sim ulations from m ( t | u ) in the same w a y as the discussants u se graphical d ispla ys from their prop osed distributions. Johnson conjectures that our p -v alues can b e an- ticonserv ativ e. Cond itional pred ictiv e p -v alues can nev er b e uniformly conserv ative or antico nserv ativ e since, as v alid Bay esian p -v alues (i.e., based on the prior predictiv e distrib ution), they are uniform on aver age . P artial p osterior pr edictiv e p -v alues are not only asymptotical ly equiv alen t to the conditional predictiv e p -v alues (for the prop osed u ), but v ery often they are identica l; when they are n ot, the p ar- tial p osterior and conditional predictiv e d istribu- tions are extremely similar eve n after v ery few obser- v ations. Of course, if one has an ancillary statistic, one has exact u niformit y , but th is is rarely th e case. CONCLUSIONS Mo del c h ec kin g is su btle and has a v ariet y of as- p ects, as clearly p ointed o ut by the discussan ts. Op- timal selection of T and U is still an iss ue, and cross-v alidation m ight pro v e u seful. A p o ssible an- sw er is Ev ans’ prop osals, b ut we find them unduly limited. Use of pivot al qu antitie s is certainly a p os- sibilit y in in v ariant situations, b ut p rop er in terpre- tation in general wo uld ultimately require prior pre- dictiv e analysis and thus pr eclude use of imp r op er priors. T echniques that pro d u ce p -v alues n ear 0.5 when the mo del is ob viously wr ong are simply bad tec hn iques, whether one uses p -v alues, other c harac- teristics of the reference distributions, or graphical displa ys. Suc h tec hn iques can detect truly terrible mo dels, but the fact th at they can ha v e suc h p o or detection p o we r means that “p assin g” su c h a mo del c h ec k do es very little to instill confidence that one has a go o d mo d el. REFERENCES Ba y arri , M. J. and Ber ger, J. O. (1997). M easures of sur- prise in Ba yesia n analysis. ISDS Discussion P ap er 97–46, Duke Un iv. Ba y arri , M. J. and B erger, J. O. (1999). Qu antif ying sur- prise in the data a nd mo del verification. In Baye sian Statis- tics 6 (J. M. Bernardo, J. O . Berger, A. P . D awid and A. F. M. Smith, eds.) 53–82. Oxford Univ. Press. MR1723493 Ba y arri , M. J. and B erger, J. O. (2000). p -v alues for com- p osite n ull models (with discussion). J. Amer. Statist. A s- so c. 95 1127–1142, 1157–117 0. MR1804239 Ba y arri , M . J., Castellanos, M. E. and Morales, J. (2006). MCMC method s to app ro ximate cond itional pre- dictive distributions. Comput. Statist. Data A nal. 51 621– 640. MR2297533 Fraser, D. A. S. and R ousseau, J. (2005). Developing p -val ues: A Bay esian-frequentist conv ergence. Cahier du CEREMADE, Univ. Pa ris D au p hine, F rance. Hubbard, R. and Ba y a rri, M. J. (2003). Confusion ov er measures of eviden ce (p’s) versus errors ( α ’s) in classica l statistical testing (with discussion). A m er. St atist. 57 171– 182. Ro ber t, C. P. and R ousseau, J. (2002 ). A mixture ap- proac h to Bay esian goo dn ess of fit. Cahier du CERE- MADE, 02009, Univ. Paris Dauphine, F rance. Ro bins, J. M. (1999). Discussion of quantifying surprise in the data and mo del verification. In Bayesian Statistics 6 (J. M. Bernardo, J. O. Berger, A . P . Dawid and A. F. M. Smith, eds.) 67–70. Oxford Univ. Press. MR1723490 Ro bins, J. M., v an de r V aa r t, A. and Ventu ra, V. (2000). A symptotic distribution of p v alues in comp osite null mo dels (with discussion). J. Amer. Statist. Asso c. 95 1143–11 56, 1171–1172. MR1804240 Sellke, T., Ba y arri, M. J. and Berger, J. O. ( 2001). Calibration of p -va lues for testing p recise null h yp otheses. Amer . Statist. 55 62–71. MR1818723
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment