Inherent Difficulties of Non-Bayesian Likelihood-based Inference, as Revealed by an Examination of a Recent Book by Aitkin
For many decades, statisticians have made attempts to prepare the Bayesian omelette without breaking the Bayesian eggs; that is, to obtain probabilistic likelihood-based inferences without relying on informative prior distributions. A recent example …
Authors: Andrew Gelman (Columbia University), Christian P. Robert (Universite Paris-Dauphine, IUF
Inheren t Difficulties of Non-Ba y esian Lik eliho o d-based Inference, as Rev ealed b y an Examination of a Recen t Bo ok b y Aitkin ∗ A. Gelman 1 , C.P. R ober t 2 , 3 , 4 , and J. R ousseau 2 , 4 , 5 1 Depts. of Statistics and of P olitical Science, Columbia Univ ersity 2 Univ ersit´ e P aris-Dauphine, CEREMADE 3 Institut Univ ersitaire de F rance, 4 CREST, and 5 ENSAE Octob er 26, 2018 Abstract F or many decades, statisticians hav e made attempts to prepare the Ba yesian omelette without breaking the Ba yesian eggs; that is, to obtain probabilistic likelihoo d-based inferences without relying on informative prior distributions. A recent example is Murra y Aitkin’s recen t b o ok, Statistic al Infer enc e , whic h presents an approach to statistical h yp othesis testing based on comparisons of p osterior distributions of likelihoo ds un- der comp eting models. Aitkin develops and illustrates his metho d using some simple examples of inference from iid data and t wo-w ay tests of in- dep endence. W e analyze in this note some consequences of the inferential paradigm adopted therein, discussing why the approach is incompatible with a Ba yesian persp ectiv e and wh y we do not find it relev an t for applied w ork. Keyw ords: F oundations, lik eliho o d, Bay esian, Ba yes factor, mo del choice, test- ing of hypotheses, improp er priors, coherence. 1 In tro duction F or many decades, statisticians hav e made attempts to prepare the Bay esian omelette without breaking the Bay esian eggs; that is, to obtain probabilistic lik eliho o d-based inferences without relying on informative prior distributions. A recen t example is Murray Aitkin’s recent b ook, Statistic al Infer enc e , whic h is the culmination of a long researc h program on the topic of integrated evidence, exemplified by the discussion pap er of Aitkin (1991). The b o ok, subtitled An ∗ gelman@stat.columbia.edu , xian@ceremade.dauphine.fr , rousseau@ensae.fr 1 Inte gr ate d Bayesian/Likeliho o d Appr o ach , prop oses handling statistical hypoth- esis testing and mo del selection via comparisons of p osterior distributions of lik eliho o d functions under the comp eting mo dels or via the p osterior distribu- tion of the likelihoo d ratios corresp onding to those mo dels. (The essence of the prop osal is detailed in Section 2.) Instead of comparing Ba yes factors or p erforming p osterior predictiv e c hecks (comparing observed data to p osterior replicated pseudo-datasets), Statistic al Infer enc e recommends a fusion b etw een lik eliho o d and Bay esian paradigms that allo ws for the p erpetuation of nonin- formativ e priors in testing settings where standard Bay esian practice prohibits their usage (DeGro ot, 1973) or requires an extended decision-theoretic frame- w ork (Bernardo, 2011). While we appreciate the considerable effort made by Aitkin to place his theory within a Ba yesian framew ork, we remain uncon vinced of the said coherence, for reasons exp osed in this note. Figure 1: Cover of Statistical In- ference F rom our Bay esian p erspective, and for several dis- tinct reasons detailed in the presen t note, in tegrated Ba yesian/lik eliho od inference cannot fit within the philos- oph y of Bay esian inference. Aitkin’s commendable attempt at creating a framework that incorp orate the use of arbitrary noninformativ e priors in mo del choice procedures is thus in- coheren t in this Bay esian resp ect. When using improp er priors lead to meaningless Ba yesian procedures for p osterior mo del comparison, we see this as a sign that the Bay esian mo del will not work for the problem at hand. Rather than trying at all cost to k eep the offending mo del and define marginal p osterior probabilities b y fiat (whether b y BIC, DIC, intrinsic Ba yes factors, or p osterior likelihoo ds), we prefer to follow the full logic of Bay esian inference and rec- ognize that, when one’s Bay esian approach leads to a dead end, one m ust change either one’s methodologies or one’s be- liefs (or b oth). Ba yesians, b oth sub jective and ob jective, hav e long recognized the need for tuning, expanding, or otherwise altering a mo del in light of its pre- dictions (see, for example, Go o d, 1950 and Ja ynes, 2003), and w e view undefined Ba yes factors as an example where otherwise useful metho ds are being extended b ey ond their applicabilit y . T o try to work around such problems without alter- ing the prior distribution is, we b eliev e , an abandonment of Bay esian principles and, more imp ortan tly , an abandoned opp ortunity for mo del improv ement. The criticisms found in the curren t review are therefore not limited to Aitkin’s bo ok; they also apply to previous patc hes suc h as the deviance informa- tion criterion (DIC) of Spiegelhalter et al. (2002) (which also uses a “p osterior” exp ectation of the log-likelihoo d) and the pseudo-posteriors of Geisser and Eddy (1979)(whic h make an extensive use of the data in their pro duct of predictives). Unlik e the author, who has felt the call to construct a partly new Aitkin, 1991 if tentativ ely unifying foundation for statistical inference, we hav e the luxury of feeling that w e already live in a comfortable (even if not flawless) inferen tial house. Th us, w e come to Aitkin’s b ook not with a perceived need to rebuild but rather with a view tow ard strengthening the p oten tial shakiness 2 of the pillars that supp ort our o wn inferences. A key question when lo oking at an y metho d for probabilistic inference that is not fully Bay esian is: F or the applied problems that interest us, do es the prop osed new approach ac hieve b etter p erformances than our existing methods? Our answ er, to whic h w e arrive after careful thought, is no. As an ev aluation of the ideas found in Statistic al Infer enc e , the criticisms found in this review are inherently limited. W e do not claim here that Aitkin’s approac h is wrong p er se merely that it do es not fit within our inferential metho dology , namely Bay esian statistics, despite using Ba yesian to ols. W e ac- kno wledge that statistical methods do not, and most lik ely nev er will, form a seamless logical structure. It ma y thus very well b e that the approach of comparing p osterior distributions of lik eliho ods could b e useful for some ac- tual applications, and p erhaps Aitkin’s b ook will inspire future researchers to demonstrate this. Statistic al Infer enc e b egins with a crisp review of frequen tist, lik eliho o d and Ba yesian approac hes to inference and then pro ceeds to the main issue: intro- ducing the “integrated Bay es/likelihoo d approac h”, first described in Chapter 2. Muc h of the remaining metho dological material app ears in Chapters 4 (“Unified analysis of finite p opulations”) and 7 (“Go o dness of fit and model diagnostics”). The remaining c hapters apply Aitkin’s principles to v arious examples. The presen t article discusses the basic ideas in Statistic al Infer enc e , then consider the relev ance of Aitkin’s metho dology within the Bay esian paradigm. 2 A small c hange in the paradigm 2.1 P osterior lik eliho o d “This quite smal l change to standar d Bayesian analysis al lows a very gen- er al appr o ach to a wide r ange of app ar ently differ ent infer enc e pr oblems; a particular advantage of the appr o ach is that it c an use the same nonin- formative priors.” Statistical Inference , p age xiii The “quite small change” advocated by Statistic al Infer enc e consists in en- visioning the lik eliho o d function L ( θ , x ) as a generic function of the parameter θ that can b e pro cessed a p osteriori (that is, with a distribution induced by the p osterior π ( θ | x )), hence allo wing for (p osterior) cdf, mean, v ariance and quan- tiles. In particular, the central to ol for Aitkin’s mo del fit is the “p osterior cdf” of the likelihoo d, F ( z ) = Pr π ( L ( θ , x ) > z | x ) . As argued by the author (Chapter 2, page 21), this “small c hange” in persp ectiv e has sev eral app ealing features: – The approac h is general and allows to resolv e the difficulties with the Ba yesian pro cessing of p oin t null hypotheses, b eing defined solely by the Ba yesian mo del asso ciated with L ( θ , x ); 3 – The approach allows for the use of generic noninformativ e and improper priors, again by b eing relative to a single mo del; – The approach handles more naturally the “v exed question of mo del fit”, still for the same reason; – The approach is “simple.” As noted ab ov e, the setting is quite similar to Spiegelhalter et al.’s (2002) DIC in that the deviance D ( θ ) = − 2 log( f ( x | θ )) is a renaming of the likelihoo d and is considered “a p osteriori” b oth in ¯ D = E [ D ( θ ) | x ] and in p D = ¯ D − D ( ˆ θ ), where ˆ θ is a Bay esian estimator of θ , since DIC = p D + ¯ D . The discussion of Spiegelhalter et al. (2002) made this p oint clear, see in partic- ular Dawid (2002), ev en though the authors disagreed. Plummer (2008) make a similarly ambiguous prop osal that also relates to Geisser and Eddy (1979) by its usage of cross-v alidation quantities. W e how ever dispute b oth the appropriateness and the magnitude of the c hange advocated in Statistic al Infer enc e and show b elo w why , in our opinion, this shift in paradigm constitutes a new branch of statistical inference, differing from Bay esian analysis on many p oin ts. First, using priors and p osteriors is no guaran tee that inference is Bay esian (Seidenfeld, 1992). Empirical Ba yes tech- niques are witnesses of this (Robbins, 1964, Carlin and Louis, 2008). Aitkin’s k ey departure from Bay esian principles means that his pro cedure has to b e v ali- dated on its own, rather than b enefiting from the coherence inherent to Bay esian pro cedures. The practical adv antage of the likelihoo d/Ba yesian approach may b e conv enience, but the drawbac k is that the metho d pushes b oth the user and the statistician away from progress in mo del building. 1 W e envision Ba yesian data analysis as comprising three steps: (1) mo del building, (2) inference, and (3) mo del chec king. In particular, we view steps (2) and (3) as separate. Inference works well, with many exciting developmen ts still in the coming, handling complex mo dels, leading to an unlimited range of applications, and a partial in tegration with classical approaches (as in the empirical Bay es w ork of Efron and Morris, 1975, or more recen tly the similarities b et w een hierarchical Bay es and frequentist false discov ery rates discussed b y 1 One might argue that, in practice, almost all Bayesians are sub ject to our criticism of “using mo dels that make nonsensical predictions.” F or example, Gelman et al. (2003) and Marin and Rob ert (2007) are full of noninformativ e priors. Our criticism here, though, is not of noninformative priors in general but of incoheren t predictions ab out quantities of inter est . In particular, noninformative priors can often (but not alwa ys!) giv e reasonable inferences about parameters θ within a mo del, even while giving meaningless (or at least not universally accepted) v alues for marginal lik eliho ods that are needed for Bay esian model comparison. It does when interest shifts from Pr( θ | x, H ) to Pr( H | x ) that the Bayesian must set aside most of noninformative π ( θ | H ) and, p erhaps reluctantly , set up an informative mo del. See, e.g., Liang et al. (2008) and Johnson and Rossell (2010) for some current p erspectives on Bayesian model choice using noninformative priors. 4 Efron, 2010), causal inference, machine learning, and other aims and metho ds of statistical inference. Ev en in the face of all this progress on inference, Ba yesian mo del chec king remains a bit of an anomaly , with the three leading Bay esian approac hes b e- ing Bay es factors, p osterior predictive chec ks, and comparisons of mo dels based on prediction error and other loss-based measures. (Decision-theoretic analy- ses as in (Bernardo, 2011), while in tellectually convincing, hav e not gained the same amount of p opularit y .) Unfortunately , as Aitkin p oin ts out, none of these mo del chec king metho ds wor ks completely smo othly: Ba yes factors dep end on asp ects of a mo del that are untestable and are commonly assigned arbitrarily; p osterior predictive c hecks are, in general, “conserv ative” in the sense of pro- ducing p -v alues whose probability distributions are concentrated near 0 . 5; and prediction error measures (which include cross-v alidation and DIC) require the user to divide data into test and v alidation sets, lest they use the data twice (a p oin t discussed immediately b elow). The setting is even bleaker when trying to incorp orate noninformative priors (Gelman et al., 2003, Rob ert, 2001) and new prop osals are clearly of interest. 2.2 “Using the data twice” “A p ersistent criticism of the p osterior likeliho o d appro ach (...) has b e en b ased on the claim that these appr o aches ar e ‘using the data twic e,’ or ar e ‘violating temp or al c oher enc e.” Statistical Inference , p age 48 “Using the data twice” is not our main reserv ation ab out the metho d—if only b ecause this is a rather v ague concept. Ob viously , one could criticize the use of the “p osterior exp ectation” of the lik eliho d as b eing the ratio of the marginal of the twice replicated data ov er the marginal of the original data, E [ L ( θ , x ) | x ] = Z L ( θ , x ) π ( θ | x ) d θ = m ( x, x ) m ( x ) , similar to Aitkin (1991) (a criticism clearly expressed in the discussion therein). Ho wev er, a more fundamental issue is that the “p osterior” distribution of the lik eliho o d function cannot b e justified from a Bay esian p erspective. Statistic al Infer enc e stays aw ay from decision-theory (as stated on page xiv) so there is no deriv ation based on a loss function or such. Our primary difficulty with the in tegrated likelihoo d idea (and DIC as well) is (a) that the likelihoo d function do es not exist a priori and (b) that i t requires a joint distribution to be prop erly defined in the case of mo del comparison. The case for (a) is arguable, as Aitkin w ould presumably contest that there exists a joint distribution on the lik eliho od, ev en though the case of an improp er prior stands out (see b elow). W e still see the concept of a p osterior probability that the lik eliho od ratio is larger than 1 as meaningless. The case for (b) is more clear-cut in that when considering t wo mo dels, hence a likelihoo d ratio, a Bay esian analysis does require a joint distribution on the tw o sets of parameters to reac h a decision, ev en though in the end only one set will b e used. As detailed b elo w in Section 4, this p oin t is 5 related with the in tro duction of pseudo-priors b y Carlin and Chib (1995) who needed arbitrary defined prior distributions on the parameters that do not exist. In the sp ecific case of an improp er prior, Aitkin’s approac h cannot b e v al- idated in a probabilit y setting for the reason that there is no join t probability on ( θ , x ). Obviously , one could alwa ys adv ance that the whole issue is irrele- v ant since improp er priors do not stand within probabilit y theory . Ho wev er, improp er priors do stand within the Ba yesian framework, as demonstrated for instance b y Hartigan (1983) and it is easy to give those priors an exact meaning. When the data are made of n iid observ ations x n = ( x 1 , . . . , x n ) from f θ and an improp er prior π is used on θ , we can consider a tr aining sample (Smith and Spiegelhalter, 1982) x ( l ) , with ( l ) ⊂ { 1 , ..., n } such that Z f ( x ( l ) | θ ) d π ( θ ) < ∞ ( l ≤ n ) . If w e construct a probabilit y distribution on θ by π x ( l ) ( θ ) ∝ π ( θ ) f ( x ( l ) | θ ) , the p osterior distribution asso ciated with this distribution and the remainder of the sample x ( − l ) is giv en by π x ( l ) ( θ | x ( − l ) ) ∝ π ( θ ) f ( x n | θ ) , x ( − l ) = { x i , i / ∈ ( l ) } . This distribution is indep enden t from the choice of the training sample; it only dep ends on the likelihoo d of the whole data x n and it therefore leads to a non-am biguous p osterior distribution 2 on θ . Ho wev er, as is well known, this construction do es not lead to produce a join t distribution on ( x n , θ ), which w ould be required to giv e a meaning to Aitkin’s integrated lik eliho od. Therefore, his approach cannot cov er the case of improp er priors within a probabilistic framew ork and th us fails to solve the very difficulty with noninformative priors it aimed at solving. This is further illustrated by the use of Haldane’s prior in Chapter 4 of Statistic al Infer enc e , despite it not allowing for empty cells in a con tingency table (Jeffreys, 1939). 3 P osterior probabilit y on the p osterior proba- bilities “The p -value is e qual to the p osterior prob ability that the likeliho od r a- tio, for nul l hyp othesis to alternative, is gr e ater than 1 (...)The p osterior pr obability is p that the p osterior pr ob ability of H 0 is gr e ater than 0.5.” Statistical Inference , p ages 42–43 2 Obvious extensions to the case of indep enden t but non iid data or of exc hangeable data lead to the same interpretation. The case of dep enden t data is more delicate, but similar interpretation can still b e considered. 6 Those tw o equiv alent statements show that it is difficult to give a Bay esian in terpretation to Aitkin’s method, since the t wo “p osterior probabilities” quoted ab o v e are incompatible. Indeed, a fundamen tal Bay esian prop ert y is that the p osterior probability of an even t related with the parameters of the mo del is not a random quantit y but a n umber. T o consider the “p osterior probability of the p osterior probability” means we are e xiting the Ba yesian domain, b oth from logical and philosophical viewp oin ts. In Chapter 2, Aitkin exp oses his (foundational) reasons for choosing this new approac h by integrated Bay es/lik eliho o d. His criticism of Bay es factors is based on several p oin ts we feel useful to repro duce here: (i). “Hav e w e really eliminated the uncertaint y about the mo del parameters b y integration? The in tegrated likelihoo d (...) is the exp ected v alue of the lik eliho o d. But what of the prior v ariance of the lik eliho o d?” (page 47). (ii). “Any exp ectation with resp ect to the prior implies that the data has not y et b een observed (...) So the “in tegrated likelihoo d” is the joint distribu- tion of random v ariables drawn b y a tw o-stage pro cess. (...) The marginal distribution of these random v ariables is not the same as the distribution of Y (...) and do es not b ear on the question of the v alue of θ in that p opulation” (page 47). (iii). “W e cannot use an improp er prior to compute the in tegrated likelihoo d. This eliminate the usual improp er noninformative priors widely used in p osterior inference.” (page 47). (iv). “Any parameters in the priors (...) will affect the v alue of the integrated lik eliho o d and this effect do es not disapp ear with increasing sample size” (page 47). (v). “The Bay es factor is equal to the p osterior mean of the lik eliho o d ratio b et w een the mo dels” [me aning under the ful l mo del p osterior] (page 48). (vi). “The Bay es factor diverges as the prior b ecomes diffuse. (...) This prop- ert y of the Bay es factor has b een kno wn since the Lindley/Bartlett paradox of 1957” (page 48). The representation (i) of the “in tegrated” (or marginal) likelihoo d as an exp ectation under the prior m ( x ) = Z L ( θ , x ) π ( θ ) d θ = E π [ L ( θ , x )] is unassailable and is for instance used as a starting p oin t for motiv ating the nested sampling metho d (Skilling, 2006, Chopin and Rob ert, 2010). This do es not imply that the extension to the v ariance or to any other moment stated in (i) has a similar meaning, nor that the mov e to the exp ectation under the p osterior is v alid within the Bay esian paradigm. While the difficulty (iii) with 7 improp er priors is real, and while the impact of the prior modelling (iv) ma y ha ve a lingering effect, the other p oints can be easily rejected on the ground that the p osterior distribution of the likelihoo d is meaningless within a Ba y esian p erspective. This criticism is anticipated b y Aitkin who protests on pages 48-49 that, given p oin t (v), the p osterior distribution must b e “meaningful,” since the p osterior mean is “meaningful”, but the interpretation of the Bay es factor as a “p osterior mean” is only an interpretation of an existing integral (in the sp ecific case of nested mo dels), it do es not give any v alidation to the analysis. (The marginal likelihoo d ma y similarly b e interpreted as a prior mean, despite dep ending on the observ ation x , as in the nested sampling p erspective. More generaly , bridge sampling tec hniques also exploit those multiple represen tations of a ratio of integrals, Gelman and Meng, 1998.) One could just as w ell take (ii) ab o v e as an argumen t against the integrated likelihoo d/Bay es p ersp ectiv e. 4 Pro ducts of p osteriors In the case of unrelated mo dels to b e compared, the fundamental theoretical argumen t against using p osterior distributions of the likelihoo ds and of related terms is that the approach leads to parallel and separate simulations from the p osteriors under eac h mo del. Statistic al Infer enc e recommends that mo dels b e compared via the distribution of the likelihoo d ratio v alues, L i ( θ i | x ) L k ( θ k | x ) , where the θ i ’s and θ k ’s are dra wn from the resp ectiv e p osteriors. This choice is similar to Scott’s (2002) and to Congdon’s (2006) mistaken solutions exposed in Rob ert and Marin (2008), in that MCMC sim ulations are run for each mo del sep- arately and the resulting samples are then gathered together to pro duce either the p osterior expectation (in Scott’s, 2002, case) or the posterior distribution (for the current pap er) of ρ i L ( θ i | x ) X k ρ k L ( θ k | x ) , whic h do not corresp ond to genuine Bay esian solutions (see Rob ert and Marin, 2008). Again, this is not as muc h b ecause the dataset x is used rep eatedly in this pro cess (since rev ersible MCMC pro duces as well separate samples from the different p osteriors) as the fundamental lack of a common joint distribution that is needed in the Ba yesian framework. This means, e.g., that the integrated lik eliho o d/Ba y es technology is pro ducing s amples from the pro duct of the p os- teriors (a pro duct that clearly is not defined in a Bay esian framework) instead of using pseudo-priors as in Carlin and Chib (1995), i.e. of considering a joint p osterior on ( θ 1 , θ 2 ), whic h is [prop ortional to] p 1 m 1 ( x ) π 1 ( θ 1 | x ) π 2 ( θ 2 ) + p 2 m 2 ( x ) π 2 ( θ 2 | x ) π 1 ( θ 1 ) . (1) 8 This mak es a difference in the outcome, as illustrated in Figure 2, which com- pares the distribution of the likelihoo d ratio under the true p osterior and under the pro duct of p osteriors, when assessing the fit of a Poisson mo del against the fit of a binomial mo del with m = 5 trials, for the observ ation x = 3. The joint sim ulation pro duces a muc h more supp ortiv e argument in fav or of the binomial mo del, when compared with the product of the p osteriors. (Again, this is inher- en tly the fla w found in the reasoning leading to Scott’s, 2002, and Congdon’s, 2006, metho ds for approximating Bay es factors.) Marginal simulation log likelihood ratio −4 −2 0 2 Joint simulation log likelihood ratio −15 −10 −5 0 Figure 2: Comp arison of the distribution of the likeliho o d r atio under the c or- r e ct joint p osterior and under the pr o duct of the mo del-b ase d p osteriors, when assessing a Poisson mo del against a binomial with m = 5 trials, for x = 3 . The joint simulation pr o duc es a much mor e supp ortive ar gument in favor of the ne gative binomial mo del, when c omp ar e d with the pr o duct of the p osteriors. Although w e do not advocate its use, a Bay esian version of Aitkin’s proposal can b e constructed based on the following loss function that ev aluates the esti- mation of the mo del index j based on the v alues of the parameters under b oth mo dels and on the observ ation x : L ( δ, ( j, θ j , θ − j )) = I δ =1 I f 2 ( x | θ 2 ) >f 1 ( x | θ 1 ) + I δ =2 I f 2 ( x | θ 2 ) 1 / 2 2 otherwise, whic h dep ends on the joint p osterior distribution (1) on ( θ 1 , θ 2 ), thus differs from Aitkin’s solution. W e hav e Pr π [ f 2 ( x | θ 2 ) < f 1 ( x | θ 1 ) | x ] = π ( M 1 | x ) Z Θ 2 Pr π 1 l 1 ( θ 1 ) > l 2 ( θ 2 ) | x, θ 2 d π 2 ( θ 2 ) + π ( M 2 | x ) Z Θ 1 Pr π 2 l 1 ( θ 1 ) > l 2 ( θ 2 ) | x, θ 1 d π 1 ( θ 1 ) , 9 where l 1 and l 2 denote the log-likelihoo ds and where the probabilities within the in tegrals are computed under π 1 ( θ 1 | x ) and π 2 ( θ 2 | x ), resp ectiv ely . (Pseudo- priors as in Carlin and Chib, 1995 could b e used instead of the true priors, a requiremen t when at least one of those priors is improp er.) An asymptotic ev aluation of the ab o v e pro cedure is p ossible: consider a sample of size n , x n . If M 1 is the “true” mo del, then π ( M 1 | x n ) = 1 + o p (1) and w e hav e Pr π 1 l 1 ( θ 1 ) > l 2 ( θ 2 ) | x n , θ 2 = Pr h −X 2 p 1 > l 2 ( θ 2 ) − l 2 ( ˆ θ 1 ) i + O p (1 / √ n ) = F p 1 h l 1 ( ˆ θ 1 ) − l 2 ( θ 2 ) i + O p (1 / √ n ) , with obvious notations for the corresp onding log-likelihoo ds, p 1 the dimension of Θ 1 , ˆ θ 1 the maximum lik eliho o d estimator of θ 1 , and X 2 p 1 a chi-square random v ariable with p 1 degrees of freedom. Note also that, since l 2 ( θ 2 ) ≤ l 2 ( ˆ θ 2 ), l 1 ( ˆ θ 1 ) − l 2 ( θ 2 ) ≥ n KL( f 0 , f θ ∗ 2 ) + O p ( √ n ) , where KL( f , g ) denotes the Kullback–Leibler divergence and θ ∗ 2 denotes the pr oje ction of the true mo del on M 2 : θ ∗ 2 = argmin θ 2 K L ( f 0 , f θ 2 ), w e hav e Pr π [ f ( x n | θ 2 ) < f ( x n | θ 1 ) | x n ] = 1 + o p (1) . By symmetry , the same asymptotic consistency o ccurs under mo del M 2 . On the opp osite, Aitkin’s approac h leads (at least in regular mo dels) to the appro x- imation Pr[ X 2 p 2 − X 2 p 1 > l 2 ( ˆ θ 2 ) − l 1 ( ˆ θ 1 )] , where the X 2 p 2 and X 2 p 1 random v ariables are independent, hence producing quite a different result that dep ends on the asymptotic b eha vior of the lik eliho o d ratio. Note that for b oth approaches to b e equiv alent one would need a pseudo-prior for M 2 (resp. M 1 if M 2 w ere true ) as tight around the maxim um lik eliho o d as the p osterior π 2 ( θ 2 | x n ), which w ould b e equiv alent to some kind of empirical Ba yes type of pro cedure. F urthermore, in the case of em b edded mo dels, M 2 and M 1 ⊂ M 2 , Aitkin’s approac h can b e giv en a probabilistic in terpretation. T o this effect, w e write the parameter under M 1 as ( θ 1 , ψ 0 ), ψ 0 b eing a fixed kno wn quantit y , and under M 2 as θ 2 = ( θ 1 , ψ ), so that comparing M 1 with M 2 corresp onds to testing the n ull hypothesis ψ = ψ 0 . Aitkin do es not imp ose a p ositiv e prior probabilit y on M 1 , since his prior only b ears on M 2 (in a spirit close to the Sav age-Dick ey represen tation, see Marin and Rob ert, 2010). His approach is therefore similar to the inv ersion of a confidence region into a testing pro cedure (or vice-versa). Under the mo del M 1 ⊂ M 2 , denoting by l ( θ , ψ ) the log-lik eliho o d of the bigger mo del, Pr π [ l ( θ 1 , ψ 0 ) > l ( θ 1 , ψ ) | x n ] ≈ Pr h X 2 p 2 − p 1 > − l ( ˆ θ 1 ( ψ 0 ) , ψ 0 ) + l ( ˆ θ 1 , ˆ ψ ) i ≈ 1 − F p 2 − p 1 [ − l ( ˆ θ 1 ( ψ 0 ) , ψ 0 ) + l ( ˆ θ 1 , ˆ ψ )] , 10 whic h is the approximate p -v alue asso ciated with the likelihoo d ratio test. There- fore, the aim of this approac h seems to b e, at least for embedded mo dels where the Bernstein–von Mises theorem holds for the p osterior distribution, to con- struct a Bayesian procedure repro ducing the p -v alue asso ciated with the likeli- ho od ratio test. F rom a frequentist p oin t of view it is of interest to see that the p osterior probability of the lik eliho od ratio b eing greater than one is approxi- mately a p -v alue, at least in cases when the Bernstein-von Mises theorem holds, e.g. for embedded mo dels and prop er priors. This p -v alue can then be given a finite-sample meaning (under the ab o ve restrictions), ho wev er it seems more in teresting from a frequen tist p erspective than from a Bay esian one. 3 F rom a Ba yesian decision-theoretic viewp oint, this is even more dubious, since the loss function (2) is difficult to interpret and to justify . “Without a sp e cific alternative, the b est we c an do is to make p osterior pr obability statements ab out µ and tr ansfer these to the p osterior distribu- tion of the likeliho o d r atio (..) Ther e c annot b e str ong evidenc e in favor of a p oint nul l hyp othesis against a gener al alternative hypothesis.” Sta- tistical Inference , p ages 42–44 W e further note that, once Statistic al Infer enc e has set the principle of using the p osterior distribution of the likelihoo d ratio (or rather of the divergence difference since this is at least symmetric in b oth h yp otheses), there is a whole range of outputs av ailable including confidence in terv als on the difference, for c hecking whether or not they con tain zero. F rom our (Bay esian) p ersp ectiv e, this solution (a) is not Bay esian for reasons exp osed ab o ve, (b) is not parame- terization in v ariant, and (c) relies once again on an arbitrary confidence lev el. 5 Misrepresen tations W e hav e fo cused in this review on Aitkin’s prop osals rather than on his char- acterizations of other statistical metho ds. In a few places, how ever, we b eliev e that there hav e b een some unfortunate confusions from his part. On page 22, Aitkin describ es Bay esian p osterior distributions as “formally a measure of p ersonal uncertaint y ab out the mo del parameter,” a statemen t that w e b eliev e holds generally only under a definition of “p ersonal” that is so broad as to b e meaningless. As w e hav e discussed elsewhere (Gelman, 2008), Bay esian probabilities can b e viewed as “sub jective” or “p ersonal” but this is not neces- sary . Or, to put it another w ay , if you w ant to label my posterior distribution as “p ersonal” b ecause it is based on my p ersonal choice of prior distribution, you should also lab el inferences from the prop ortional hazards mo del as “p ersonal” b ecause it is based on the user’s choice of the parameterization of Cox (1972); y ou should also lab el any linear regression (classical or otherwise) as “p ersonal” as based on the individual’s choice of predictors and assumptions of additivity , linearit y , v ariance function, and error distribution; and so on for all but the very simplest mo dels in existence. 3 See Chapter 7 of Gelman et al. (2003) for a fully Bay esian treatment of finite-sample inference. 11 In a nearly century-long tradition in statistics, any probability mo del is sharply divided in to “likelihoo d” (which is considered to b e ob jective and, in textb ook presentations, is often simply given as part of the mathematical sp ec- ification of the problem) and “prior” (a dangerously sub jective entit y to whic h the statistical researcher is encouraged to p our all of his or her p en t-up sk ep- ticism). This may b e a tradition but it has no logical basis. If writers suc h as Aitkin wish to consider their lik eliho ods as ob jective and consider their priors as sub jective, that is their privilege. But w e would prefer them to restrain them- selv es when characterizing the models as others. It w ould be p olite to either ten tatively accept the ob jectivity of others’ models or, contrariwise, to gallan tly affirm the sub jectivity of one’s own choices. Aitkin also mischaracterizes hierarchical mo dels, writing “It is imp ortan t not to in terpret the prior as in some sense a mo del for natur e [italics in the original] that nature has used a random pro cess to dra w a parameter v alue from a higher distribution of parameter v alues . . . ” On the contrary , that is exactly ho w we interpret the prior distribution in the ideal case. Admittedly , we do not generally approach this ideal (except in settings such as genetics where the p opulation distribution of parameters has a clear sampling distribution), just as in practice the error terms in our regression mo dels do not capture the true distribution of errors. Despite these imp erfections, w e b eliev e that it can often b e helpful to in terpret the prior as a mo del for the parameter-generation process and to improv e this mo del where appropriate. 6 Con tributions of the b o ok Statistic al Infer enc e p oin ts out several important facts that are individually kno wn w ell (but p erhaps not w ell enough!), but by putting them all in one place it foregrounds the difficult y or imp ossibilit y of putting all the different approac hes to mo del chec king in one place. W e all know that the p -v alue is in no wa y the posterior probability of a null hypothesis being true; in addition, Ba yes factors as generally practiced corresp ond to no actual probability mo del. Also, it is well-kno wn that the so-called harmonic mean approach to calculating Ba yes factors is inherently unstable, to the extent that in the situations where it do es “work,” it works b y implicitly in tegrating ov er a space differen t from that of its nominal mo del. Y es, we all know these things, but as is often the case with s cien tific anoma- lies, they are asso ciated with such a high level of discomfort that man y re- searc hers tend to forget the problems or try to finesse them. It is refreshing to see the anomalies laid out so clearly . A t some p oin ts, how ever, Aitkin disapp oin ts. F or example, at the end of Sec- tion 7.2, he writes: “In the remaining sections of this c hapter, we first consider the p osterior predictiv e p -v alue and p oin t out difficulties with the p osterior pre- dictiv e distribution which closely parallel those of Bay es factors.” He follows up with a section entitled “The p osterior predictiv e distribution,” which concludes with an example that he writes “should b e a matter of serious concern [em- 12 phasis in original] to those using p osterior predictiv e distributions for predictiv e probabilit y statements.” What is this example of serious concern? It is an imaginary problem in which he observ es 1 success in 10 indep enden t trials and then is asked to compute the probabilit y of getting at most 2 successes in 20 more trials from the same pro- cess. Statistic al Infer enc e assumes a uniform prior distribution on the success probabilit y and yields a predictive probabilit y or 0.447, which, to him, “lo oks a v astly optimistic and unsound statement.” Here, we think Aitkin should take Ba yes a bit more seriously . If you think this predictive probability is unsound, there should be some asp ect of the prior distribution or the likelihoo d that is unsound as well. This is what Go o d (1950) called “the device of imaginary re- sults.” W e suggest that, rather than abandoning highly effectiv e metho ds based on predictive distributions, Aitkin should look more carefully at his predictive distributions and either alter his mo del to fit his in tuitions, alter his intuitions to fit his model, or do a bit of b oth. This is the v alue of inferential coherence as an ideal. 7 Solving non-problems Sev eral of the examples in Statistic al Infer enc e represent solutions to problems that seem to us to b e artificial or con ven tional tasks with no clear analogy to applied w ork. “They ar e artificial and ar e expr esse d in terms of a survey of 100 indi- viduals expr essing supp ort (Y es/No) for the pr esident, b efor e and after a pr esidential addr ess (...) The question of inter est is whether ther e has b e en a change in supp ort b etwe en the surveys (...). We want to assess the evi- denc e for the hyp othesis of e quality H 1 against the alternative hyp othesis H 2 of a change.” Statistical Inference , p age 147 Based on our exp erience in public opinion research, this is not a real question. Supp ort for any p olitical p osition is alwa ys changing. The real question is how m uch the supp ort has c hanged, or p erhaps ho w this c hange is distributed across the p opulation. A defender of Aitkin (and of classical hypothesis testing) might resp ond at this p oin t that, yes, everybo dy kno ws that changes are never exactly zero and that w e should tak e a more “grown-up” view of the null h yp othesis, not that the c hange is zero but that it is nearly zero. Unfortunately , the metaphorical inter- pretation of hypothesis tests has problems similar to the theological do ctrines of the Unitarian ch urch. Once you hav e abandoned literal belief in the Bible, the question so on arises: why follo w it at all? Similarly , once one recognizes the inappropriateness of the p oint null hypothesis, it makes more sense not to try to rehabilitate it or treat it as treasured metaphor but rather to attack our statistical problems directly , in this case by p erforming inference on the c hange in opinion in the p opulation. T o b e clear: we are not denying the v alue of h yp othesis testing. In this example, we find it completely reasonable to ask whether observ ed changes 13 are statistically significan t, i.e. whether the data are consisten t with a null h yp othesis of zero change. What we do not find reasonable is the statement that “the question of interest is whether there has b een a change in supp ort.” 0.40 0.50 0.60 Hypothetical series with stability and change points Time Presidential approval 0 100 200 2002 2004 2006 2008 0.3 0.5 0.7 0.9 Actual presidential appro val series Time Presidential approval Figure 3: (a) Hyp othetic al gr aph of pr esidential appr oval with discr ete jumps; (b) pr esidential appr oval series (for Ge or ge W. Bush) showing movement at many differ ent time sc ales. If the appr oval series lo oke d like the gr aph on the left, then Aitkin ’s “question of inter est” of “whether ther e has b e en a change in supp ort b etwe en the surveys” would b e c ompletely r e asonable. In the c ontext of actual public opinion data, the question do es not make sense; inste ad, we pr efer to think of pr esidential appr oval as a c ontinuously-varying pr o c ess. All this is application-sp ecific. Supp ose public opinion was observed to really b e flat, punctuated by o ccasional c hanges, as in the left graph in Figure 3. In that case, Aitkin’s question of “whether there has b een a change” w ould b e w ell-defined and appropriate, in that w e could in terpret the n ull hypothesis of no c hange as some minimal lev el of baseline v ariation. Real public opinion, how ever, do es not lo ok like baseline noise plus jumps, but rather s ho ws contin uous mov ement on many time scales at once, as can b e seen from the right graph in Figure 3, which sho ws actual presidential approv al data. In this example, we do not see Aitkin’s question as at all reasonable. Any attempt to work with a n ull hypothesis of opinion stabilit y will b e inherently arbitrary . It would make muc h more sense to mo del opinion as a contin uously- v arying pro cess. The statistical problem here is not merely that the null h yp othesis of zero c hange is nonsensical; it is that the n ull is in no sense a reasonable appro ximation to any interesting mo del. The so ciological problem is that, from Sa v age (1954) on ward, many Ba yesians ha ve felt the need to mimic the classical null-h yp othesis testing framework, ev en where it makes no sense. Aitkin is unfortunately no exception, taking a straigh tforward statistical question—estimating a time trend in opinion—and re-expressing it as an abstracted hypothesis testing problem that pulls the analyst aw ay from an y interesting p olitical questions. 14 8 Conclusion: Wh y did w e write this review? “The p osterior has a non-integr able spike at zero. This is e quivalent to assigning zer o prior pr ob ability to these unobserve d values.” Statistical Inference , p age 98 A skeptical (or ev en not so sk eptical) reader might at this p oin t ask, Wh y did w e b other to write a detailed review of a somewhat obscure statistical metho d that we do not even like? Our motiv ation surely was not to protect the world from a dangerous idea; if an ything, we suspect our review will interest some readers who otherwise would not ha ve heard ab out the approach (as previously illustrated b y Rob ert, 2010). In 1970, a b ook such as Statistic al Infer enc e could hav e had a large influence in statistics. As Aitkin notes in his preface, there was a resurgence of interest in the foundations of statistics around that time, with Lindley , Dempster, Barnard, and others writing ab out the intersections b et w een classical and Bay esian infer- ence (going b ey ond the long-understo od results of asymptotic equiv alence) and researc hers such as Ak aik e and Mallows beginning to integrate model-based and predictiv e approaches to inference. A glance at the influential text of Cox and Hinkley (1974) reveals that theoretical statistics at that time was fo cused on inference from indep enden t data from sp ecified sampling distributions (p ossi- bly after discarding information, as in rank-based tests), and “likelihoo d” was cen tral to all these discussions. F orty y ears on, a b ook on likelihoo d inference is more of a nic he item. Partly this is simply part of the gro wth of the field—with the proliferation of b ooks, journals, and online publications, it is muc h more difficult for any single b ook to gain prominence. More than that, though, we think statistical theory has mo ved aw ay from iid analysis, tow ard more complex, structured problems. That said, the foundational problems that Statistic al Infer enc e discusses are indeed imp ortan t and they hav e not y et b een resolv ed. As mo dels get larger, the problem of “nuisance parameters” is revealed to b e not a mere nuisance but rather a central fact in all metho ds of statistical inference. As noted ab o ve, Aitkin makes v aluable p oin ts—known, but not well-enough kno wn—ab out the difficulties of Ba yes factors, pure lik eliho od, and other sup erficially attractive approac hes to mo del comparison. W e b eliev e it is a natural contin uation of this w ork to p oint out the problems of the integrated likelihoo d approac h as well. F or now, we recommend mo del expansion, Ba yes factors where reasonable, cross-v alidation, and predictive mo del chec king based on graphics rather than p -v alues. W e recognize that each of these approaches has lo ose ends. But, as practical idealists, w e consider inferential c hallenges to be opportunities for mo del improv emen t with the Ba yesian realm rather than motiv ations for a new theory of noninformative priors that takes us in uncharted territories. 15 References Aitkin, M. 1991. P osterior Ba yes factors (with discussion). J. R oyal Statist. So ciety Series B 53: 111–142. Bernardo, J. 2011. Integrated ob jective Bay esian estimation and hypothesis testing. In Bayesian Statistics 9 , eds. J. Bernardo, J. Berger, M. Bay arri, D. Heck erman, A. Smith, and M. W est, 1–68. Oxford: Oxford Univ ersity Press. Carlin, B. and S. Chib. 1995. Ba yesian mo del c hoice through Marko v chain Mon te Carlo. J. R oyal Statist. So ciety Series B 57(3): 473–484. Carlin, B. and T. Louis. 2008. Bayes and Empiric al Bayes Metho ds for Data A nalysis . 3rd ed. Chapman and Hall, New Y ork. Chopin, N. and C. Rob ert. 2010. Prop erties of nested sampling. Biometrika 97: 741–755. Congdon, P . 2006. Bay esian mo del choice based on Monte Carlo estimates of p osterior mo del probabilities. Comput. Stat. Data Analysis 50: 346–357. Da wid, A. 2002. Discussion of “Bay esian measures of mo del complexit y and fit” b y Spiegelhalter et al. J. R oyal Statist. So ciety Series B 64(2): 583–639. DeGro ot, M. 1973. Doing what comes naturally: Interpreting a tail area as a p osterior probability or as a lik eliho o d ratio. J. Americ an Statist. Asso c. 68: 966–969. Efron, B. 2010. The future of indirect evidence (with discussion). Statist. Scienc e 25(2): 145–171. Efron, B. and C. Morris. 1975. Data analysis using Stein’s estimator and its generalizations. J. Americ an Statist. Asso c. 70: 311–319. Geisser, S. and W. Eddy . 1979. A predictive approac h to mdo el selection. J. A meric an Statist. Asso c. 74: 153–160. Gelman, A., J. Carlin, H. Stern, and D. Rubin. 2003. Bayesian Data Analysis . 2nd ed. New Y ork: Chapman and Hall, New Y ork. Gelman, A. and X. Meng. 1998. Simulating normalizing constants: F rom im- p ortance sampling to bridge sampling to path sampling. Statist. Scienc e 13: 163–185. Go od, I. 1950. Pr ob ability and the Weighting of Evidenc e . London: Charles Griffin. Hartigan, J. A. 1983. Bayes The ory . New Y ork: Springer-V erlag, New Y ork. Ja ynes, E. 2003. Pr ob ability The ory . Cambridge: Cam bridge Universit y Press. 16 Jeffreys, H. 1939. The ory of Pr ob ability . 1st ed. Oxford: The Clarendon Press. Johnson, V. and D. Rossell. 2010. On the use of non-lo cal prior densities in Ba yesian hypothesis tests. J. R oyal Statist. So ciety Series B 72: 143–170. Liang, F., R. Paulo, G. Molina, M. Clyde, and J. Berger. 2008. Mixtures of g priors for Bay esian v ariable selection. J. Americ an Statist. Asso c. 103(481): 410–423. Marin, J. and C. Rob ert. 2007. Bayesian Cor e . Springer-V erlag, New Y ork. —. 2010. On resolving the Sa v age–Dick ey paradox. Ele ctr on. J. Statist. 4: 643–654. Plummer, M. 2008. P enalized loss functions for Bay esian mo del comparison. Biostatistics 9(3): 523–539. Robbins, H. 1964. The empirical Bay es approac h to statistical decision problems. A nn. Math. Statist. 35: 1–20. Rob ert, C. 2001. The Bayesian Choic e . 2nd ed. Springer-V erlag, New Y ork. —. 2010. The Search for Certaint y: a critical assessment. Bayesian Analysis 5(2): 213–222. (with discussion). Rob ert, C. and J.-M. Marin. 2008. On some difficulties with a p osterior proba- bilit y approximation technique. Bayesian Analysis 3(2): 427–442. Sa v age, L. 1954. The F oundations of Statistic al Infer enc e . New Y ork: John Wiley . Scott, S. L. 2002. Bay esian metho ds for hidden Mark ov mo dels: recursive com- puting in the 21st Century . J. Americ an Statist. Asso c. 97: 337–351. Seidenfeld, T. 1992. R.A. Fisher’s fiducial argument and Bay es’ theorem. Statist. Scienc e 7(3): 358–368. Skilling, J. 2006. Nested sampling for general Bay esian computation. Bayesian A nalysis 1(4): 833–860. Smith, A. and D. Spiegelhalter. 1982. Ba yes factors for linear and log-linear mo dels with v ague prior information. J. R oyal Statist. So ciety Series B 44: 377–387. Spiegelhalter, D. J., N. G. Best, B. P . Carlin, and A. v an der Linde. 2002. Ba yesian measures of model complexit y and fit (with discussion). J. R oyal Statist. So ciety Series B 64(2): 583–639. 17
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment