Importance sampling methods for Bayesian discrimination between embedded models

Imp ortance sampling metho ds for Ba y esian discrimination b et w een em b edded mo dels J.-M. Marin 1 , 3 , ∗ and C.P. R ober t 2 , 3 , † 1 Institut de Math ´ ematiques et Mo d ´ elisation de Mon tp ellier, 2 Univ ersit ´ e Paris Dauphine, and 3 Cen tre de Rec herche en Economie et Statistique, INSEE, P aris June 8, 2018 Abstract This pap er surv eys some well-established approac hes on the appro ximation of Ba yes factors used in Ba yesian mo del choice, mostly as cov ered in Chen et al. (2000). Our fo cus here is on metho ds that are based on importance sam- pling strategies—rather than v ariable dimension tec hniques lik e reversible jump MCMC—, including: crude Monte Carlo, maximum likelihoo d based imp ortance sampling, bridge and harmonic mean sampling, as well as Chib’s metho d based on the exploitation of a functional equalit y . W e demonstrate in this survey ho w these diﬀerent methods can be eﬃcien tly implemented for testing the signiﬁcance of a predictiv e v ariable in a probit model. Finally , w e compare their p erformances on a real dataset. Keyw ords: Ba yesian inference; model c hoice; Bay es factor; Monte Carlo; Impor- tance Sampling; bridge sampling; Chib’s functional identit y; supervised learning; probit model 1 In tro duction The con tribution of Jim Berger to the b etter understanding of Ba yesian testing is fundamen tal and wide-ranging, from establishing the fundamen tal diﬃculties with p - v alues in Berger and Sellk e (1987) to formalising the in trinsic Ba y es factors in Berger and Pericc hi (1 996), to solving the diﬃculty with improp er priors in Berger et al. (1998), and b eyond! While our contribution in this area is ob viously muc h more limited, w e ∗ Institut de Math ´ ematiques et Mod´ elisation de Montpellier, Univ ersit´ e Montpellier 2, Case Courrier 51, 34095 Mon tp ellier cedex 5, F rance. Email: Jean-Michel.Marin@univ-montp2.fr † CEREMADE - Univ ersit´ e P aris Dauphine, 75775 P aris, and CREST, INSEE, P aris, F rance. Email: xian@ceremade.dauphine.fr 1 aim at presenting here the most standard approac hes to the approximation of Ba yes factors. The Ba yes factor indeed is a fundamen tal pro cedure that stands at the core of the Ba y esian theory of testing h yp otheses, at least in the approac h advocated by b oth Jef- freys (1939) and b y Jaynes (2003). (Note that Rob ert et al. 2009, provides a reassess- men t of the crucial role of Jeﬀreys 1939 in setting a formal framework for Ba yesian testing as w ell as for regular inference.) Giv en an hypothesis H 0 : θ ∈ Θ 0 on the pa- rameter θ ∈ Θ of a statistical mo del, with observ ation y and densit y f ( y | θ ), under a compatible prior of the form π (Θ 0 ) π 0 ( θ ) + π (Θ c 0 ) π 1 ( θ ) , the Bayes factor is deﬁned as the p osterior o dds to prior o dds ratio, namely B 01 ( y ) = π (Θ 0 | y ) π (Θ c 0 | y )  π (Θ 0 ) π (Θ c 0 ) = Z Θ 0 f ( y | θ ) π 0 ( θ )d θ  Z Θ c 0 f ( y | θ ) π 1 ( θ )d θ . Mo del c hoice can b e considered from a similar persp ective, since, under the Bay esian paradigm (see, e.g., Rob ert 2001), the comparison of mo dels M i : y ∼ f i ( y | θ i ) , θ i ∼ π i ( θ i ) , θ i ∈ Θ i , i ∈ I , where the family I can be ﬁnite or inﬁnite, leads to posterior probabilities of the models under comparison suc h that P ( M = M i | y ) ∝ p i Z Θ i f i ( y | θ i ) π i ( θ i )d θ i , where p i = P ( M = M i ) is the prior probabilit y of mo del M i . In this short survey , we consider some of the most common Mon te Carlo solutions used to approximate a generic Ba y es factor or its fundamental component, the evidenc e m i = Z Θ i π i ( θ i ) f i ( y | θ i ) d θ i , ak a the marginal likelihoo d. Longer en tries can b e found in Carlin and Chib (1995), Chen et al. (2000), Robert and Casella (2004), or F riel and Pettitt (2008). Note that w e only brieﬂy mention here trans-dimensional metho ds issued from the revolutionary pap er of Green (1995), since our goal is to demonstrate that within-mo del simulation metho ds allo w for the computation of Ba y es factors and th us a v oids the additional complexit y inv olved in trans-dimensional metho ds. While ameanable to an imp ortance sampling tec hnique of sorts, the alternative approac h of nested sampling (Skilling 2006) is discussed in Chopin and Rob ert (2007) and Rob ert and W raith (2009). 2 2 The Pima Indian b enc hmark mo del In order to compare the p erformances of all methods presen ted in this survey , w e c hose to ev aluate the corresp onding estimates of the Ba yes factor in the setting of a single v ariable selection for a probit mo del and to rep eat the estimation in a Mon te Carlo exp erimen t to empirically assess the v ariability of those estimates. W e recall that a probit mo del can b e represented as a natural laten t v ariable mo del in that, if w e consider a sample z 1 , . . . , z n of n indep enden t laten t v ariables asso ciated with a standard regression mo del, i.e. such that z i | θ ∼ N  x T i θ , 1  , where the x i ’s are p -dimensional co v ariates and θ is the v ector of regression co eﬃcien ts, then y 1 , . . . , y n suc h that y i = I z i > 0 is a probit sample. Indeed, given θ , the y i ’s are indep enden t Bernoulli rv’s with P ( y i = 1 | θ ) = Φ  x T i θ  where Φ is the standard normal cdf. The c hoice of a reference prior distribution for the probit mo del is op en to debate, but the connection with the latent regression mo del induced Marin and Rob ert (2007) to suggest a g -prior mo del, θ ∼ N  0 p , n ( X T X ) − 1  , with n as the g factor and X as the regressor matrix. The corresp onding p osterior distribution is then asso ciated with the density π ( θ | y , X ) ∝ n Y i =1  1 − Φ  x T i θ  1 − y i Φ  x T i θ  y i × exp  − θ T ( X T X ) θ / 2 n  , (1) where y = ( y 1 , . . . , y n ). In the completed mo del, i.e. when including the laten t v ariables z = ( z 1 , . . . , z n ) into the mo del, the y i ’s are deterministic functions of the z i ’s and the so-called completed lik eliho o d is f ( y , z | θ ) = (2 π ) − n/ 2 exp − n X i =1  z i − x T i θ  2 / 2 ! n Y i =1 ( I y i =0 I z i ≤ 0 + I y i =1 I z i > 0 ) . The derived conditional distributions z i | y i , θ ∼  N +  x T i θ , 1 , 0  if y i = 1 , N −  x T i θ , 1 , 0  if y i = 0 , (2) are of interest for constructing a Gibbs sampler on the completed mo del, where N +  x T i θ , 1 , 0  denotes the Gaussian distribution with mean x T i θ and v ariance 1 that is left-truncated at 0, while N −  x T i θ , 1 , 0  denotes the symmetrical normal distribution that is righ t- truncated at 0. The corresp onding full conditional on the parameters is given b y θ | y , z ∼ N  n n + 1 ( X T X ) − 1 X T z , n n + 1 ( X T X ) − 1  . (3) Indeed, since direct sim ulation from the p osterior distribution of θ is intractable, Albert and Chib (1993) suggest implementing a Gibbs sampler based on the ab ov e set of full 3 conditionals. More precisely , giv en the curren t v alue of θ , one cycle of the Gibbs algorithm pro duces a new v alue for z as simulated from the conditional distribution (2), which, when substituted into (3), pro duces a new v alue for θ . Although it do es not impact the long-term prop erties of the sampler, the starting v alue of θ may b e taken as the maxim um likelihoo d estimate to av oid burning steps in the Gibbs sampler. Giv en this probit model, the dataset we consider cov ers a population of w omen who w ere at least 21 y ears old, of Pima Indian heritage and living near Pho enix, Arizona. These women w ere tested for diab etes according to W orld Health Organization (WHO) criteria. The data w ere collected by the US National Institute of Diabetes and Digestive and Kidney Diseases, and is av ailable with the basic R pack age (R Developmen t Core T eam 2008). This dataset, used as a benchmark for sup ervised learning metho ds, con tains information ab out 332 w omen with the follo wing v ariables: – glu : plasma glucose concentration in an oral glucose tolerance test; – bp : diastolic blo o d pressure (mm Hg); – ped : diab etes p edigree function; – type : Y es or No, for diab etic according to WHO criteria. F or this dataset, the goal is to explain the diabetes v ariable type b y using the ex- planatory v ariables glu , bp and ped . The following table is an illustration of a classical (maxim um lik eliho o d) analysis of this dataset, obtained using the R glm() function with the probit link: Deviance Residuals: Min 1Q Median 3Q Max -2.1347 -0.9217 -0.6963 0.9959 2.3235 Coefficients: Estimate Std. Error z value Pr(>|z|) glu 0.012616 0.002406 5.244 1.57e-07 *** bp -0.029050 0.004094 -7.096 1.28e-12 *** ped 0.350301 0.208806 1.678 0.0934 . --- Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 460.25 on 332 degrees of freedom Residual deviance: 386.73 on 329 degrees of freedom AIC: 392.73 Number of Fisher Scoring iterations: 4 This analysis sheds some doubt on the relev ance of the co v ariate ped in the mo del and we can repro duce the study from a Bay esian persp ective, computing the Bay es 4 factor B 01 opp osing the probit mo del only based on the co v ariates glu and bp (mo del 0) to the probit mo del based on the co v ariates glu , bp , and ped (mo del 1). This is equiv alen t to testing the hypothesis H 0 : θ 3 = 0 since the mo dels are nested, where θ 3 is the parameter of the probit mo del asso ciated with cov ariate ped . (Note that there is no intercept in either mo del.) If w e denote by X 0 the 332 × 2 matrix containing the v alues of glu and bp for the 332 individuals and by X 1 the 332 × 3 matrix containing the v alues of the co v ariates glu , bp , and ped , the Bay es factor B 01 is given b y (2 π ) 1 / 2 n 1 / 2 | ( X T 0 X 0 ) | − 1 / 2 | ( X T 1 X 1 ) | − 1 / 2 (4) Z R 2 n Y i =1 { 1 − Φ (( X 0 ) i, · θ ) } 1 − y i Φ (( X 0 ) i, · θ ) y i exp  − θ T ( X T 0 X 0 ) θ / 2 n  d θ Z R 3 n Y i =1 { 1 − Φ ( X 1 ) i, · θ ) } 1 − y i Φ ( X 1 ) i, · θ ) y i exp  − θ T ( X T 1 X 1 ) θ / 2 n  d θ = E N 2 (0 2 ,n ( X T 0 X 0 ) − 1 ) " n Y i =1 { 1 − Φ (( X 0 ) i, · θ ) } 1 − y i Φ (( X 0 ) i, · θ ) y i # E N 3 (0 3 ,n ( X T 1 X 1 ) − 1 ) " n Y i =1 { 1 − Φ (( X 1 ) i, · θ ) } 1 − y i Φ (( X 1 ) i, · θ ) y i # using the shortcut notation that A i, · is the i -th line of the matrix A . 3 The basic Mon te Carlo solution As already shown ab o ve, when testing for a null hypothesis (or a mo del) H 0 : θ ∈ Θ 0 against the alternativ e h yp othesis (or the alternative mo del) H 1 : θ ∈ Θ 1 , the Bay es factor is deﬁned by B 01 ( y ) = Z Θ 0 f ( y | θ 0 ) π 0 ( θ 0 )d θ 0  Z Θ 1 f ( y | θ 1 ) π 1 ( θ 1 )d θ 1 . W e assume in this surv ey that the prior distributions under b oth the null and the alternativ e hypotheses are prop er, as, typically , they should b e. (In the case of com- mon n uisance parameters, a common improper prior measure can b e used on those, see Berger et al. (1998), Marin and Rob ert (2007). This ob viously complicates the computational asp ect, as some metho ds like crude Monte Carlo cannot b e used at all, while others are more prone to suﬀer from inﬁnite v ariance.) In that setting, the most elemen tary approximation to B 01 ( y ) consists in using a ratio of t w o standard Monte Carlo approximations based on sim ulations from the corresp onding priors. Indeed, for i = 0 , 1: Z Θ i f ( y | θ ) π i ( θ )d θ = E π i [ f ( y | θ )] . 5 2 3 4 5 Figure 1: Pima Indian dataset: b o xplot of 100 Mon te Carlo estimates of B 01 ( y ) based on simulations from the prior distributions, for 2 × 10 4 sim ulations. If θ 0 , 1 , . . . , θ 0 ,n 0 and θ 1 , 1 , . . . , θ 1 ,n 1 are t w o indep enden t samples generated from the prior distributions π 0 and π 1 , resp ectively , then n − 1 0 P n 0 j =1 f ( y | θ 0 ,j ) n − 1 1 P n 1 j =1 f ( y | θ 1 ,j ) (5) is a strongly consistent estimator of B 01 ( y ). In most cases, sampling from the prior distribution corresp onding to either hypoth- esis is straightforw ard and fast. Therefore, the ab ov e estimator is extremely easy to deriv e as a brute-force ev aluation of the Bay es factor. How ever, if any of the p oste- rior distributions is quite diﬀeren t from the corresp onding prior distribution—and it should b e for v ague priors—, the Mon te Carlo ev aluation of the corresp onding evidence is highly ineﬃcien t since the sample will b e ov erwhelmingly pro ducing negligible v al- ues of f ( y | θ i,j ). In addition, if f 2 ( y | θ ) is not integrable against π 0 or π 1 , the resulting estimation has an inﬁnite v ariance. Since imp ortance sampling usually requires an equiv alen t computation eﬀort, with a p oten tially highy eﬃciency reward, crude Monte Carlo approaches of this t yp e are usually disregarded. Figure 1 and T able 1 summarize the results based on 100 replications of Mon te Carlo appro ximations of B 01 ( y ), using equation (5) with n 0 = n 1 = 20 , 000 simulations. As predicted, the v ariabilit y of the estimator is very high, when compared with the other estimates studied in this survey . (Obviously , the metho d is asymptotically un biased and, the functions b eing square in tegrable in (4), with a ﬁnite v ariance. A massive sim ulation eﬀort would ob viously lead to a precise estimate of the Bay es factor.) 6 4 Usual imp ortance sampling appro ximations Deﬁning t wo imp ortance distributions with densities $ 0 and $ 1 , with the same supp orts as π 0 and π 1 , resp ectively , w e hav e: B 01 ( y ) = E $ 0  f ( y | θ ) π 0 ( θ )  $ 0 ( θ )   E $ 1  f ( y | θ ) π 1 ( θ )  $ 1 ( θ )  . Therefore, giv en tw o indep enden t samples generated from distributions $ 0 and $ 1 , θ 0 , 1 , . . . , θ 0 ,n 0 and θ 1 , 1 , . . . , θ 1 ,n 1 , resp ectively , the corresponding imp ortance sampling estimate of B 01 ( y ) is n − 1 0 P n 0 j =1 f 0 ( x | θ 0 ,j ) π 0 ( θ 0 ,j ) /$ 0 ( θ 0 ,j ) n − 1 1 P n 1 j =1 f 1 ( x | θ 1 ,j π 1 ( θ 1 ,j ) /$ 1 ( θ 1 ,j ) . (6) Compared with the standard Monte Carlo approximation ab o ve, this approac h oﬀers the adv antage of op ening the choice of the representation (6) in that it is p ossible to pick imp ortance distributions $ 0 and $ 1 that lead to a signiﬁcant reduction in the v ariance of the imp ortance sampling estimate. This implies c ho osing imp ortance functions that pro vide as go o d as p ossible appro ximations to the corresp oning posterior distributions. Maxim um lik eliho o d asymptotic distributions or kernel approximations based on a sample generated from the p osterior are natural candidates in this setting, ev en though the approximation gro ws harder as the dimension increases. F or the Pima Indian b enchmark, we prop ose for instance to use as imp ortance distributions, Gaussian distributions with means equal to the maxim um likelihoo d (ML) estimates and co v ariance matrices equal to the estimated co v ariance matrices of the ML estimates, b oth of which are provided by the R glm() function. While, in general, those Gaussian distributions provide crude approximations to the p osterior distributions, the sp eciﬁc case of the probit mo del will sho w this is an exceptionally go o d approximation to the p osterior, since this leads to the b est solution among all those compared here. The results, obtained ov er 100 replications of the metho dology with n 0 = n 1 = 20 , 000 are summarized in Figure 2 and T able 1. They are clearly excellent, while requiring the same computing time as the original simulation from the prior. 5 Bridge sampling metho dology The original version of the bridge sampling approximation to the Bay es factor (Gelman and Meng 1998, Chen et al. 2000) relies on the assumption that the parameters of b oth models under comparison b elong to the same space: Θ 0 = Θ 1 . In that case, for likelihoo d functions f 0 and f 1 under resp ectiv ely mo dels M 0 and M 1 , the bridge represen tation of the Bay es factor is B 01 ( y ) = Z Θ 0 f 0 ( y | θ ) π 0 ( θ )d θ  Z Θ 1 f 1 ( y | θ ) π 1 ( θ )d θ = E π 1  f 0 ( y | θ ) π 0 ( θ ) f 1 ( y | θ ) π 1 ( θ )     y  . (7) 7 ● ● Monte Carlo Importance sampling 2 3 4 5 Figure 2: Pima Indian dataset: b o xplots of 100 Mon te Carlo and imp ortance sampling estimates of B 01 ( y ), based on sim ulations from the prior distributions, for 2 × 10 4 sim ulations. Giv en a sample from the p osterior distribution of θ under mo del M 1 , θ 1 , 1 , . . . , θ 1 ,N ∼ π 1 ( θ | y ), a ﬁrst bridge sampling approximation to B 01 ( y ) is N − 1 N X j =1 f 0 ( y | θ 1 ,j ) π 0 ( θ 1 ,j ) f 1 ( y | θ 1 ,j ) π 1 ( θ 1 ,j ) . F rom a practical p ersp ective, for the ab o ve bridge sampling appro ximation to b e of any use, the constrain t on the common parameter space for b oth mo dels go es further in that, not only must b oth mo dels hav e the same complexity , but they m ust also b e pa- rameterised on a common ground, i.e. in terms of some sp eciﬁc momen ts of the sampling mo del, so that parameters under b oth mo dels hav e a common meaning. Otherwise, the resulting bridge sampling estimator will ha v e v ery p o or conv ergence properties, p ossibly with inﬁnite v ariance. Equation (7) is nothing but a v ery sp ecial case of the general representation (T orrie and V alleau 1977) B 01 ( y ) = E ϕ [ f 0 ( y | θ ) π 0 ( θ ) /ϕ ( θ )]  E ϕ [ f 1 ( y | θ ) π 1 ( θ ) /ϕ ( θ )] , whic h holds for an y densit y ϕ with a suﬃcien tly large supp ort and whic h only requires a single sample θ 1 , . . . , θ N generated from ϕ to pro duce an importance sampling estimate of the ratio of the marginal lik eliho o ds. Apart from using the same imp ortance function ϕ for b oth integrals, this metho d is therefore a sp ecial case of imp ortance sampling. Another extension of this bridge sampling approac h is based on the general repre- 8 sen tation B 01 ( y ) = Z f 0 ( y | θ ) π 0 ( θ ) α ( θ ) π 1 ( θ | y )d θ  Z f 1 ( y | θ ) π 1 ( θ ) α ( θ ) π 0 ( θ | y )d θ ≈ n 1 − 1 n 1 X j =1 f 0 ( y | θ 1 ,j ) π 0 ( θ 1 ,j ) α ( θ 1 ,j ) n 0 − 1 n 0 X j =1 f 1 ( y | θ 0 ,j ) π 1 ( θ 0 ,j ) α ( θ 0 ,j ) where θ 0 , 1 , . . . , θ 0 ,n 0 and θ 1 , 1 , . . . , θ 1 ,n 1 are tw o indep endent samples coming from the p osterior distributions π 0 ( θ | y ) and π 1 ( θ | y ), resp ectiv ely . That applies for any p ositiv e function α as long as the upp er in tegral exists. Some c hoices of α lead to very po or p er- formances of the metho d in connection with the harmonic mean approac h (see Section 6), but there exists a quasi-optimal solution, as provided by Gelman and Meng (1998): α ? ( y ) ∝ 1 n 0 π 0 ( θ | y ) + n 1 π 1 ( θ | y ) . This optim um cannot be used p er se , since it requires the normalising constan ts of b oth π 0 ( θ | y ) and π 1 ( θ | y ). As suggested by Gelman and Meng (1998), an approximate v ersion uses iterativ e v ersions of α ? , based on iterated appro ximations to the Bay es factor. Note that this solution recycles sim ulations from both p osteriors, which is quite appropriate since one mo del is selected via the Bay es factor, instead of using an imp ortance w eighted sample common to b oth approximations. W e will see b elow an alternativ e representation of the bridge factor that bypasses this diﬃcult y (if diﬃculty there is!). Those deriv ations are, ho w ever, restricted to the case where b oth mo dels ha ve the same complexity and th us they do not apply to embedded mo dels, when Θ 0 ⊂ Θ 1 in suc h a wa y that θ 1 = ( θ , ψ ), i.e., when the submo del corresp onds to a sp eciﬁc v alue ψ 0 of ψ : f 0 ( y | θ ) = f ( y | θ , ψ 0 ). The extension of the most adv anced bridge sampling strategies to suc h cases requires the in tro duction of a pseudo-p osterior density, ω ( ψ | θ , y ), on the parameter that does not app ear in the embedded mo del, in order to reconstitute the equiv alence b etw een b oth parameter spaces. Indeed, if we augmen t π 0 ( θ | y ) with ω ( ψ | θ , y ), w e obtain a join t distribution with densit y π 0 ( θ | y ) × ω ( ψ | θ , y ) on Θ 1 . The Ba y es factor can then b e expressed as B 01 ( y ) = Z Θ 1 f ( y | θ , ψ 0 ) π 0 ( θ ) α ( θ , ψ ) π 1 ( θ , ψ | y )d θ ω ( ψ | θ , y ) d ψ Z Θ 1 f ( y | θ , ψ ) π 1 ( θ , ψ ) α ( θ , ψ ) π 0 ( θ | y ) × ω ( ψ | θ , y )d θ d ψ , (8) for all functions α ( θ , ψ ), because it is clearly independent from the choice of b oth α ( θ, ψ ) 9 and ω ( ψ | θ , y ). Obviously , the p erformances of the approximation ( n 1 ) − 1 n 1 X j =1 f ( y | θ 1 ,j , ψ 0 ) π 0 ( θ 1 ,j ) ω ( ψ 1 ,j | θ 1 ,j , y ) α ( θ 1 ,j , ψ 1 ,j ) ( n 0 ) − 1 n 0 X j =1 f ( y | θ 0 ,j , ψ 0 ,j ) π 1 ( θ 0 ,j , ψ 0 ,j ) α ( θ 0 ,j , ψ 0 ,j ) , where ( θ 0 , 1 , ψ 0 , 1 ) , . . . , ( θ 0 ,n 0 , ψ 0 ,n 0 ) and ( θ 1 , 1 , ψ 1 , 1 ) , . . . , ( θ 1 ,n 1 , ψ 1 ,n 1 ) are tw o indep endent samples generated from distributions π 0 ( θ | y ) × ω ( ψ | θ , y ) and π 1 ( θ , ψ | y ), resp ectively , do dep end on this completion b y the pseudo-p osterior as well as on the function α ( θ , ψ ). Chen et al. (2000) establish that the asymptotically optimal c hoice for ω ( ψ | θ , y ) is the ob vious one, namely ω ( ψ | θ , y ) = π 1 ( ψ | θ , y ) , whic h most often is unav ailable in closed form (esp ecially when considering that the normalising constan t of ω ( ψ | θ , y ) is required in (8)). How ever, in laten t v ariable models, appro ximations of the conditional p osteriors often are a v ailable, as detailed in Section 7. While this extension of the basic bridge sampling approximation is paramount for handling em b edded mo dels, its implemen tation suﬀers from the dep endence on this pseudo-p osterior. In addition, this technical device brings the extended bridge method- ology close to the cross-mo del alternativ es of Carlin and Chib (1995) and Green (1995), in that b oth those approac hes rely on completing distributions, either lo cally (Green 1995) or globally (Carlin and Chib 1995), to link b oth mo dels under comparison in a bijectiv e relation. The density ω ( ψ | θ 0 , y ) is then a pseudo-p osterior distribution in Chib and Carlin’s (1995) sense, and it can b e used as Green’s (1995) prop osal in the rev ersible jump MCMC step to mo v e (or not) from mo del M 0 to mo del M 1 . While using cross-mo del solutions to compare only t wo mo dels do es seem sup erﬂuous, given that the randomness in picking the mo del at each step of the simulation is not as useful as in the setting of comparing a large n umber or an inﬁnity of mo dels, the a verage acceptance probability for mo ving from mo del M 0 to mo del M 1 is related to the Bay es factor since E π 0 × ω  f ( y | θ , ψ ) π 1 ( θ , ψ ) f ( y | θ , ψ 0 ) π 0 ( θ ) ω ( ψ | θ , y )  = B 01 ( y ) ev en though the av erage E π 0 × ω  min  1 , f ( y | θ , ψ ) π 1 ( θ , ψ ) f ( y | θ , ψ 0 ) π 0 ( θ ) ω ( ψ | θ , y )  do es not provide a closed form solution. F or the Pima Indian b enchmark, we use as pseudo-p osterior densit y ω ( θ 3 | θ 1 , θ 2 , y ), the conditional Gaussian densit y deduced from the asymptotic Gaussian distribution on ( θ 1 , θ 2 , θ 3 ) already used in the imp ortance sampling solution, with mean equal to the 10 ● ● ● ● IS Bridge sampling MC 2 3 4 5 ● ● ● ● Importance Sampling Bridge sampling 2.8 3.0 3.2 3.4 Figure 3: Pima Indian dataset: (left) b o xplots of 100 importance sampling, bridge sampling and Mon te Carlo estimates of B 01 ( y ), based on simulations from the prior distributions, for 2 × 10 4 sim ulations; (right) same comparison for the imp ortance sam- pling versus bridge sampling estimates only . ML estimate of ( θ 1 , θ 2 , θ 3 ) and with co v ariance matrix equal to the estimated co v ariance matrix of the ML estimate. The quasi-optimal solution α ? in the bridge sampling estimate is replaced with the inv erse of an av erage b et ween the asymptotic Gaussian distribution in mo del M 1 and the pro duct of the asymptotic Gaussian distribution in mo del M 0 times the ab o v e ω ( θ 3 | θ 1 , θ 2 , y ). This ob viously is a sub optimal choice, but it oﬀers the adv an tage of pro viding a non-iterativ e solution. The results, obtained ov er 100 replications of the metho dology with n 0 = n 1 = 20 , 000 are summarized in Figure 3 and T able 1. The left-hand graph sho ws that this c hoice of bridge sampling estimator pro duces a solution whose v ariation is quite close to the (excellen t) imp ortance sampling solution, a considerable impro vemen t up on the initial Mon te Carlo estimator. How ever, the righ t-hand-side graph shows that the imp ortance sampling solution remains far sup erior, esp ecially when accoun ting for the computing time. (In this example, running 20,000 iterations of the Gibbs sampler for the mo dels with b oth tw o and three v ariables tak es approximately 32 seconds.) 6 Harmonic mean appro ximations While using the generic harmonic mean approximation to the marginal likelihoo d is often fraught with danger (Neal 1994), the represen tation (Gelfand and Dey 1994) ( k = 0 , 1) E π k  ϕ k ( θ ) π k ( θ ) f k ( y | θ )     y  = Z ϕ k ( θ ) π k ( θ ) f k ( y | θ ) π k ( θ ) f k ( y | θ ) m k ( y ) d θ = 1 m k ( y ) (9) holds, no matter what the densit y ϕ k ( θ ) is—provided ϕ k ( θ ) = 0 when π k ( θ ) f k ( y | θ ) = 0—. This representation is remark able in that it allows for a direct pro cessing of Monte Carlo or MCMC output from the p osterior distribution π k ( θ | y ). As with imp ortance 11 sampling appro ximations, the v ariability of the corresp onding estimator of B 01 ( y ) will b e small if the distributions ϕ k ( θ ) ( k = 0 , 1) are close to the corresp onding poste- rior distributions. How ever, as opposed to usual imp ortance sampling constraints, the densit y ϕ k ( θ ) m ust hav e ligh ter—rather than fatter—tails than π k ( θ ) f k ( y | θ ) for the appro ximation of the marginal m k ( y ) 1 , N − 1 N X j =1 ϕ k ( θ k,j ) π k ( θ k,j ) f k ( y | θ k,j ) to enjoy ﬁnite v ariance. F or instance, using ϕ k ( θ ) = π k ( θ ) as in the original harmonic mean approximation (Newton and Raftery 1994) will most usually result in an inﬁ- nite v ariance estimator, as discussed b y Neal (1994). On the opp osite, using ϕ k ’s with constrained supp orts derived from a Monte Carlo sample, like the conv ex h ull of the sim ulations corresp onding to the 10% or to the 25% HPD regions—that again is eas- ily derived from the simulations—is b oth completely appropriate and implemen table (Rob ert and W raith 2009). Ho w ever, for the Pima Indian benchmark, we prop ose to use instead as our dis- tributions ϕ k ( θ ) the very same distributions as those used in the abov e imp ortance sampling appro ximations, that is, Gaussian distributions with means equal to the ML estimates and cov ariance matrices equal to the estimated cov ariance matrices of the ML estimates. The results, obtained o ver 100 replications of the methodology with N = 20 , 000 simulations for each approximation of m k ( y ) ( k = 0 , 1) are summarized in Figure 4 and T able 1. They show a v ery clear proximit y b etw een b oth imp ortance solutions in this sp ecial case and a corresp onding domination of the bridge sampling estimator, even though the imp ortance sampling estimate is m uch faster to compute. This remark m ust b e toned down by considering that the computing time due to the Gibbs sampler should not necessarily b e tak en into accoun t into the comparison, since samples are generated under b oth mo dels. 7 Exploiting functional equalities Chib’s (1995) metho d for appro ximating a marginal (likelihoo d) is a direct application of Bay es’ theorem: given y ∼ f k ( y | θ ) and θ ∼ π k ( θ ), we ha ve that m k = f k ( y | θ ) π k ( θ ) π k ( θ | y ) , for all θ ’s (since b oth the lhs and the rhs of this equation are constant in θ ). Therefore, if an arbitrary v alue of θ , sa y θ ∗ k , is selected and if a go od appro ximation to π k ( θ | y ) can b e constructed, denoted ˆ π ( θ | y ), Chib’s (1995) approximation to the evidence is m k = f k ( y | θ ∗ k ) π k ( θ ∗ k ) ˆ π k ( θ ∗ k | y ) . (10) 12 ● ● ● ● ● Bridge Harmonic mean IS 2.8 3.0 3.2 3.4 ● ● ● Harmonic mean Importance sampling 3.102 3.104 3.106 3.108 3.110 3.112 3.114 3.116 Figure 4: Pima Indian dataset: (left) b o xplots of 100 bridge sampling, harmonic mean and importance sampling estimates of B 01 ( y ), based on simulations from the prior distributions, for 2 × 10 4 sim ulations; (right) same comparison for the harmonic mean v ersus imp ortance sampling estimates only . In a general setting, ˆ π ( θ | y ) may b e the Gaussian approximation based on the MLE, already used in the imp ortance sampling, bridge sampling and harmonic mean solu- tions, but this is unlik ely to b e accurate in a general framew ork. A second solution is to use a nonparametric approximation based on a preliminary MCMC sample, even though the accuracy may also suﬀer in large dimensions. In the sp ecial setting of latent v ariables mo dels (like mixtures of distributions but also lik e probit models), Chib’s (1995) appro ximation is particularly attractiv e as there exists a natural appro ximation to π k ( θ | y ), based on the Rao–Blac kwell (Gelfand and Smith 1990) estimate ˆ π k ( θ ∗ k | y ) = 1 T T X t =1 π k ( θ ∗ k | y , z ( t ) k ) , where the z ( t ) k ’s are the latent v ariables sim ulated by the MCMC sampler. The esti- mate ˆ π k ( θ ∗ k | y ) is a parametric un biased appro ximation of π k ( θ ∗ k | y ) that conv erges with rate O( √ T ). This Rao–Blackw ell appro ximation obviously requires the full conditional densit y π k ( θ ∗ k | y , z ) to b e av ailable in closed form (constan t included) but, as already explained, this is the case for the probit mo del. Figure 5 and T able 1 summarize the results obtained for 100 replications of Chib’s appro ximations of B 01 ( y ) with T = 20 , 000 sim ulations for eac h appro ximation of m k ( y ) ( k = 0 , 1). While Chib’s method is usually very reliable and dominates imp ortance sampling, the incredibly go o d appro ximation provided by the asymptotic Gaussian distribution implies that, in this particular case, Chib’s metho d is dominated b y b oth the imp ortance sampling and the harmonic mean estimates. 13 ● ● ● Chib Harmonic mean IS 3.06 3.08 3.10 3.12 3.14 Figure 5: Pima Indian dataset: b oxplots of 100 Chib’s, harmonic mean and importance estimates of B 01 ( y ), based on sim ulations from the prior distributions, for 2 × 10 4 sim ulations. T able 1: Pima Indian dataset: P erformances of the v arious approximation metho ds used in this survey . Mon te Imp ortance Bridge Harmonic Chib’s Carlo sampling sampling mean appro ximation Median 3.277 3.108 3.087 3.107 3.104 Standard deviation 0.7987 0.0017 0.1357 0.0025 0.0195 Duration in seconds 7 7 71 70 64 8 Conclusion In this short ev aluation of the most common estimations to the Ba y es factor, we hav e found that a particular imp ortance sampling and its symmetric harmonic mean coun- terpart are b oth v ery eﬃcient in the case of the probit mo del. The bridge sampling estimate is m uc h less eﬃcient in this example, due to the approximation error result- ing from the pseudo-p osterior. In most settings, the bridge sampling is actually doing b etter than the equiv alent imp ortance sampler (Rob ert and W raith 2009), while Chib’s metho d is muc h more generic than the four alternativ es. The recommendation result- ing from the short exp erimen t ab ov e is therefore to lo ok for handy appro ximations to the p osterior distribution, whenev er av ailable, but to fall back on Chib’s metho d as a bac kup solution providing a reference or b etter. Ac kno wledgemen ts J.-M. Marin and C.P . Rob ert are supp orted by the 2009–2012 gran t ANR-09-BLAN- 0218 “Big’MC”. 14 References Alber t, J. and Chib, S. (1993). Ba yesian analysis of binary and p olychotomous resp onse data. J. A meric an Statist. Asso c. , 88 669–679. Ber ger, J. and Pericchi, L. (1996). The intrinsic Bay es factor for mo del selection and prediction. J. Americ an Statist. Asso c. , 91 109–122. Ber ger, J. , Pericchi, L. and V arsha vsky, J. (1998). Ba yes factors and marginal distributions in in v ariant situations. Sankhya A , 60 307–321. Ber ger, J. and Sellke, T. (1987). T esting a p oint-n ull h yp othesis: the irrecon- cilabilit y of signiﬁcance levels and evidence (with discussion). J. Americ an Statist. Asso c. , 82 112–122. Carlin, B. and Chib, S. (1995). Ba y esian mo del choice through Mark ov c hain Monte Carlo. J. R oyal Statist. So ciety Series B , 57 473–484. Chen, M. , Sha o, Q. and Ibrahim, J. (2000). Monte Carlo Metho ds in Bayesian Computation . Springer-V erlag, New Y ork. Chib, S. (1995). Marginal likelihoo d from the Gibbs output. J. Americ an Statist. Asso c. , 90 1313–1321. Chopin, N. and Rober t, C. (2007). Contemplating evidence: prop erties, extensions of, and alternatives to nested sampling. T ech. Rep. 2007-46, CEREMADE, Univ ersit´ e P aris Dauphine. Friel, N. and Pettitt, A. (2008). Marginal likelihoo d estimation via p o w er p oste- riors. J. R oyal Statist. So ciety Series B , 70(3) 589–607. Gelf and, A. and Dey, D. (1994). Bay esian mo del c hoice: asymptotics and exact calculations. J. R oyal Statist. So ciety Series B , 56 501–514. Gelf and, A. and Smith, A. (1990). Sampling based approaches to calculating marginal densities. J. Americ an Statist. Asso c. , 85 398–409. Gelman, A. and Meng, X. (1998). Sim ulating normalizing constan ts: F rom imp or- tance sampling to bridge sampling to path sampling. Statist. Scienc e , 13 163–185. Green, P. (1995). Reversible jump MCMC computation and Bay esian mo del deter- mination. Biometrika , 82 711–732. Ja ynes, E. (2003). Pr ob ability The ory . Cam bridge Universit y Press, Cambridge. Jeffreys, H. (1939). The ory of Pr ob ability . 1st ed. The Clarendon Press, Oxford. Marin, J.-M. and Rober t, C. (2007). Bayesian Cor e . Springer-V erlag, New Y ork. 15 Neal, R. (1994). Con tribution to the discussion of “Approximate Bay esian inference with the w eigh ted likelihoo d b o otstrap” by Michael A. Newton and Adrian E. Raftery . J. R oyal Statist. So ciety Series B , 56 (1) 41–42. Newton, M. and Rafter y, A. (1994). Appro ximate Ba y esian inference b y the w eigh ted likelihoo d b o otstrap (with discussion). J. R oyal Statist. So ciety Series B , 56 1–48. R Development Core Team (2008). R: A L anguage and Envir onment for Statistic al Computing . R F oundation for Statistical Computing, Vienna, Austria. ISBN 3- 900051-07-0, URL http://www.R- project.org . R ober t, C. (2001). The Bayesian Choic e . 2nd ed. Springer-V erlag, New Y ork. R ober t, C. and Casella, G. (2004). Monte Carlo Statistic al Metho ds . 2nd ed. Springer-V erlag, New Y ork. R ober t, C. , Chopin, N. and R ousseau, J. (2009). Theory of Probabilit y revisited (with discussion). Statist. Scienc e . (to app ear). R ober t, C. and Wraith, D. (2009). Computational metho ds for Ba y esian mo del c hoice. In MaxEnt 2009 pr o c e e dings (A. I. of Ph ysics, ed.). Skilling, J. (2006). Nested sampling for general Ba yesian computation. Bayesian A nalysis , 1(4) 833–860. Torrie, G. and V allea u, J. (1977). Nonph ysical sampling distributions in Mon te Carlo free-energy estimation: Um brella sampling. J. Comp. Phys. , 23 187–199. 16

Importance sampling methods for Bayesian discrimination between embedded models

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment