Variational Probabilistic Inference and the QMR-DT Network

We describe a variational approximation method for efficient inference in large-scale probabilistic models. Variational methods are deterministic procedures that provide approximations to marginal and conditional probabilities of interest. They provi…

Authors: T. S. Jaakkola, M. I. Jordan

Variational Probabilistic Inference and the QMR-DT Network
Journal of Articial In telligence Researc h 10 (1999) 291-322 Submitted 10/98; published 5/99 V ariational Probabilistic Inference and the QMR-DT Net w ork T ommi S. Jaakk ola tommi@ai.mit.edu A rticial Intel ligenc e L ab or atory, Massachusetts Institute of T e chnolo gy, Cambridge, MA 02139 USA Mic hael I. Jordan jord an@cs.berkeley.edu Computer Scienc e Division and Dep artment of Statistics, University of California, Berkeley, CA 94720-1776 USA Abstract W e describ e a v ariational appro ximation metho d for ecien t inference in large-scale probabilistic mo dels. V ariational metho ds are deterministic pro cedures that pro vide ap- pro ximations to marginal and conditional probabilities of in terest. They pro vide alterna- tiv es to appro ximate inference metho ds based on sto c hastic sampling or searc h. W e describ e a v ariational approac h to the problem of diagnostic inference in the \Quic k Medical Ref- erence" (QMR) net w ork. The QMR net w ork is a large-scale probabilistic graphical mo del built on statistical and exp ert kno wledge. Exact probabilistic inference is infeasible in this mo del for all but a small set of cases. W e ev aluate our v ariational inference algorithm on a large set of diagnostic test cases, comparing the algorithm to a state-of-the-art sto c hastic sampling metho d. 1. In tro duction Probabilistic mo dels ha v e b ecome increasingly prev alen t in AI in recen t y ears. Bey ond the signican t represen tational adv an tages of probabilit y theory , including guaran tees of consistency and a naturalness at com bining div erse sources of kno wledge (P earl, 1988), the disco v ery of general exact inference algorithms has b een principally resp onsible for the rapid gro wth in probabilistic AI (see, e.g., Lauritzen & Spiegelhalter, 1988; P earl, 1988; Sheno y , 1992). These exact inference metho ds greatly expand the range of mo dels that can b e treated within the probabilistic framew ork and pro vide a unifying p ersp ectiv e on the general problem of probabilistic computation in graphical mo dels. Probabilit y theory can b e view ed as a com binatorial calculus that instructs us in ho w to merge the probabilities of sets of ev en ts in to probabilities of comp osites. The k ey op er- ation is that of marginalization, whic h in v olv es summing (or in tegrating) o v er the v alues of v ariables. Exact inference algorithms essen tially nd w a ys to p erform as few sums as p ossible during marginalization op erations. In terms of the graphical represen tation of probabilit y distributions|in whic h random v ariables corresp ond to no des and conditional indep endenci es are expressed as missing edges b et w een no des|exact inference algorithms dene a notion of \lo calit y" (for example as cliques in an appropriately dened graph), and attempt to restrict summation op erators to lo cally dened sets of no des. c  1999 AI Access F oundation and Morgan Kaufmann Publishers. All righ ts reserv ed. Jaakk ola & Jord an While this approac h manages to sta v e o the exp onen tial explosion of exact probabilistic computation, suc h an exp onen tial explosion is inevitable for an y calculus that explicitly p erforms summations o v er sets of no des. That is, there are mo dels of in terest in whic h \lo cal" is o v erly large (see Jordan, et al., in press). F rom this p oin t of view, it is p erhaps not surprising that exact inference is NP-hard (Co op er, 1990). In this pap er w e discuss the inference problem for a particular large-scale graphical mo del, the Quic k Medical Reference (QMR) mo del. 1 The QMR mo del consists of a com- bination of statistical and exp ert kno wledge for appro ximately 600 signican t diseases and appro ximately 4000 ndings. In the probabilistic form ulation of the mo del (the QMR-DT), the diseases and the ndings are arranged in a bi-partite graph, and the diagnosis problem is to infer a probabilit y distribution for the diseases giv en a subset of ndings. Giv en that eac h nding is generally relev an t to a wide v ariet y of diseases, the graph underlying the QMR-DT is dense, reecting high-order sto c hastic dep endencies. The computational com- plexit y of treating these dep endencies exactly can b e c haracterized in terms of the size of the maximal clique of the \moralized" graph (see, e.g., Dec h ter, 1998; Lauritzen & Spiegel- halter, 1988). In particular, the running time is exp onen tial in this measure of size. F or the QMR-DT, considering the standardized \clino copathologic conference" (CPC) cases that w e discuss b elo w, w e nd that the median size of the maximal clique of the moralized graph is 151.5 no des. This rules out the use of general exact algorithms for the QMR-DT. The general algorithms do not tak e adv an tage of the particular parametric form of the probabilit y distributions at the no des of the graph, and it is conceiv able that additional factorizations migh t b e found that tak e adv an tage of the particular c hoice made b y the QMR-DT. Suc h a factorization w as in fact found b y Hec k erman (1989); his \Quic kscore algorithm" pro vides an exact inference algorithm that is tailored to the QMR-DT. Unfortu- nately , ho w ev er, the run time of the algorithm is still exp onen tial in the n um b er of p ositiv e ndings. F or the CPC cases, w e estimate that the algorithm w ould require an a v erage of 50 y ears to solv e the inference problem on curren t computers. F aced with the apparen t infeasibili t y of exact inference for large-scale mo dels suc h as the QMR-DT, man y researc hers ha v e in v estigated appro ximation metho ds. One general approac h to dev eloping appro ximate algorithms is to p erform exact inference, but to do so partially . One can consider partial sets of no de instan tiations, partial sets of h yp otheses, and partial sets of no des. This p oin t of view has led to the dev elopmen t of algorithms for appro ximate inference based on heuristic searc h. Another approac h to dev eloping appro x- imation algorithms is to exploit a v eraging phenomena in dense graphs. In particular, la ws of large n um b ers tell us that sums of random v ariables can b eha v e simply , con v erging to predictable n umerical results. Th us, there ma y b e no need to p erform sums explicitly , either exactly or partially . This p oin t of view leads to the v ariational approac h to appro ximate inference. Finally , y et another approac h to appro ximate inference is based on sto c hastic sampling. One can sample from simplied distributions and in so doing obtain information ab out a more complex distribution of in terest. W e discuss eac h of these metho ds in turn. Horvitz, Suermondt and Co op er (1991) ha v e dev elop ed a partial ev aluation algorithm kno wn as \b ounded conditioning" that w orks b y considering partial sets of no de instan- 1. The acron ym \QMR-DT" that w e use in this pap er refers to the \decision-theoretic " reform ulation of the QMR b y Sh w e, et al. (1991). Sh w e, et al. replaced the heuristic represen tation emplo y ed in the original QMR mo del (Miller, F asarie, & My ers, 1986) b y a probabili stic represen tation. 292 V aria tional Pr obabilistic Inference and QMR-DT tiations. The algorithm is based on the notion of a \cutset"; a subset of no des whose remo v al renders the remaining graph singly-connected. Ecien t exact algorithms exist for singly-connected graphs (P earl, 1988). Summing o v er all instan tiations of the cutset, one can calculate p osterior probabilities for general graphs using the ecien t algorithm as a subroutine. Unfortunately , ho w ev er, there are exp onen tially man y suc h cutset instan tia- tions. The b ounded conditioning algorithm aims at forestalling this exp onen tial gro wth b y considering partial sets of instan tiations. Although this algorithm has promise for graphs that are \nearly singly-connected," it seems unlik ely to pro vide a solution for dense graphs suc h as the QMR-DT. In particular, the median cutset size for the QMR-DT across the CPC cases is 106.5, yielding an unmanageably large n um b er of 2 106 : 5 cutset instan tiations. Another approac h to appro ximate inference is pro vided b y \searc h-based" metho ds, whic h consider no de instan tiations across the en tire graph (Co op er, 1985; Henrion, 1991; P eng & Reggia, 1987). The general hop e in these metho ds is that a relativ ely small fraction of the (exp onen tially man y) no de instan tiations con tains a ma jorit y of the probabilit y mass, and that b y exploring the high probabilit y instan tiations (and b ounding the unexplored probabilit y mass) one can obtain reasonable b ounds on p osterior probabilities. The QMR- DT searc h space is h uge, con taining appro ximately 2 600 disease h yp otheses. If, ho w ev er, one only considers cases with a small n um b er of diseases, and if the h yp otheses in v olving a small n um b er of diseases con tain most of the high probabilit y p osteriors, then it ma y b e p ossible to searc h a signican t fraction of the relev an t p ortions of the h yp othesis space. Henrion (1991) w as in fact able to run a searc h-based algorithm on the QMR-DT inference problem, for a set of cases c haracterized b y a small n um b er of diseases. These w ere cases, ho w ev er, for whic h the exact Quic kscore algorithm is ecien t. The more general corpus of CPC cases that w e discuss in the curren t pap er is not c haracterized b y a small n um b er of diseases p er case. In general, ev en if w e imp ose the assumption that patien ts ha v e a limited n um b er N of diseases, w e cannot assume a priori that the mo del will sho w a sharp cuto in p osterior probabilit y after disease N . Finally , in high-dimensional searc h problems it is often necessary to allo w paths that are not limited to the target h yp othesis subspace; in particular, one w ould lik e to b e able to arriv e at a h yp othesis con taining few diseases b y pruning h yp otheses con taining additional diseases (P eng & Reggia, 1987). Imp osing suc h a limitation can lead to failure of the searc h. More recen t partial ev aluation metho ds include the \lo calized partial ev aluation" metho d of Drap er and Hanks (1994), the \incremen tal SPI" algorithm of D'Am brosio (1993), the \probabilistic partial ev aluation" metho d of P o ole (1997), and the \mini-buc k ets" algorithm of Dec h ter (1997). The former algorithm considers partial sets of no des, and the latter three consider partial ev aluations of the sums that emerge during an exact inference run. These are all promising metho ds, but lik e the other partial ev aluation metho ds it is y et not clear if they restrict the exp onen tial gro wth in complexit y in w a ys that yield realistic accuracy/time tradeos in large-scale mo dels suc h as the QMR-DT. 2 V ariational metho ds pro vide an alternativ e approac h to appro ximate inference. They are similar in spirit to partial ev aluation metho ds (in particular the incremen tal SPI and mini-buc k ets algorithms), in that they aim to a v oid p erforming sums o v er exp onen tially 2. D'Am brosio (1994) rep orts \mixed" results using incremen tal SPI on the QMR-DT, for a somewhat more dicult set of cases than Hec k erman (1989) and Henrion (1991), but still with a restricted n um b er of p ositiv e ndings. 293 Jaakk ola & Jord an man y summands, but they come at the problem from a dieren t p oin t of view. F rom the v ariational p oin t of view, a sum can b e a v oided if it con tains a sucien t n um b er of terms suc h that a la w of large n um b ers can b e in v ok ed. A v ariational approac h to inference replaces quan tities that can b e exp ected to b e the b eneciary of suc h an a v eraging pro cess with surrogates kno wn as \v ariational parameters." The inference algorithm manipulates these parameters directly in order to nd a go o d appro ximation to a marginal probabilit y of in terest. The QMR-DT mo del turns out to b e a particularly app ealing arc hitecture for the dev elopmen t of v ariational metho ds. As w e will sho w, v ariational metho ds ha v e a simple graphical in terpretation in the case of the QMR-DT. A nal class of metho ds for p erforming appro ximate inference are the sto c hastic sam- pling metho ds. Sto c hastic sampling is a large family , including tec hniques suc h as rejection sampling, imp ortance sampling, and Mark o v c hain Mon te Carlo metho ds (MacKa y , 1998). Man y of these metho ds ha v e b een applied to the problem of appro ximate probabilistic in- ference for graphical mo dels and analytic results are a v ailable (Dagum & Horvitz, 1993). In particular, Sh w e and Co op er (1991) prop osed a sto c hastic sampling metho d kno wn as \lik eliho o d-w eigh ted sampling" for the QMR-DT mo del. Their results are the most promis- ing results to date for inference for the QMR-DT|they w ere able to pro duce reasonably accurate appro ximations in reasonable time for t w o of the dicult CPC cases. W e consider the Sh w e and Co op er algorithm later in this pap er; in particular w e compare the algorithm empirically to our v ariational algorithm across the en tire corpus of CPC cases. Although it is imp ortan t to compare appro ximation metho ds, it should b e emphasized at the outset that w e do not think that the goal should b e to iden tify a single c hampion appro ximate inference tec hnique. Rather, dieren t metho ds exploit dieren t structural features of large-scale probabilit y mo dels, and w e exp ect that optimal solutions will in v olv e a com bination of metho ds. W e return to this p oin t in the discussion section, where w e consider v arious promising h ybrids of appro ximate and exact inference algorithms. The general problem of appro ximate inference is NP-hard (Dagum & Lub y , 1993) and this pro vides additional reason to doubt the existence of a single c hampion appro ximate inference tec hnique. W e think it imp ortan t to stress, ho w ev er, that this hardness result, together with Co op er's (1990) hardness result for exact inference cited ab o v e, should not b e tak en to suggest that exact inference and appro ximate inference are \equally hard." T o tak e an example from a related eld, there exist large domains of solid and uid mec hanics in whic h exact solutions are infeasible but in whic h appro ximate tec hniques (nite elemen t metho ds) w ork w ell. Similarly , in statistical ph ysics, v ery few mo dels are exactly solv able, but there exist appro ximate metho ds (mean eld metho ds, renormalization group metho ds) that w ork w ell in man y cases. W e feel that the goal of researc h in probabilistic inference should similarly b e that of iden tifying eectiv e appro ximate tec hniques that w ork w ell in large classes of problems. 2. The QMR-DT Net w ork The QMR-DT net w ork (Sh w e et al., 1991) is a t w o-lev el or bi-partite graphical mo del (see Figure 1). The top lev el of the graph con tains no des for the dise ases , and the b ottom lev el con tains no des for the ndings . 294 V aria tional Pr obabilistic Inference and QMR-DT There are a n um b er of conditional indep endence assumptions reected in the bi-partite graphical structure. In particular, the diseases are assumed to b e marginally indep enden t. (I.e., they are indep enden t in the absence of ndings. Note that diseases are not assumed to b e m utually exclusiv e; a patien t can ha v e m ultiple diseases). Also, giv en the states of the disease no des, the ndings are assumed to b e conditionally indep enden t. (F or a discussion regarding the medical v alidit y and the diagnostic consequences of these and other assumptions em b edded in to the QMR-DT b elief net w ork, see Sh w e et al., 1991). diseases findings d 1 d n f 1 f m Figure 1: The QMR b elief net w ork is a t w o-lev el graph where the dep endencies b et w een the diseases and their asso ciated ndings ha v e b een mo deled via noisy-OR gates. T o state more precisely the probabilit y mo del implied b y the QMR-DT mo del, w e write the join t probabilit y of diseases and ndings as: P ( f ; d ) = P ( f j d ) P ( d ) = " Y i P ( f i j d ) # 2 4 Y j P ( d j ) 3 5 (1) where d and f are binary (1/0) v ectors referring to presence/absence states of the diseases and the p ositiv e/negativ e states or outcomes of the ndings, resp ectiv ely . The conditional probabilities P ( f i j d ) are represen ted b y the \noisy-OR mo del" (P earl, 1988): P ( f i = 0 j d ) = P ( f i = 0 j L ) Y j 2  i P ( f i = 0 j d j ) (2) = (1  q i 0 ) Y j 2  i (1  q ij ) d j (3)  e   i 0  P j 2  i  ij d j ; (4) where  i is the set of diseases that are paren ts of the nding f i in the QMR graph, q ij = P ( f i = 1 j d j = 1) is the probabilit y that the disease j , if presen t, could alone cause the nding to ha v e a p ositiv e outcome, and q i 0 = P ( f i = 1 j L ) is the \leak" probabilit y , i.e., the probabilit y that the nding is caused b y means other than the diseases included in the QMR mo del. In the nal line, w e reparameterize the noisy-OR probabilit y mo del using an exp onen tiated notation. In this notation, the mo del parameters are giv en b y  ij =  log (1  q ij ). 295 Jaakk ola & Jord an 3. Inference Carrying out diagnostic inference in the QMR mo del in v olv es computing the p osterior marginal probabilities of the diseases giv en a set of observ ed p ositiv e ( f i = 1) and negativ e ( f i 0 = 0) ndings. Note that the set of observ ed ndings is considerably smaller than the set of p ossible ndings; note moreo v er (from the bi-partite structure of the QMR-DT graph) that unobserv ed ndings ha v e no eect on the p osterior probabilities for the diseases. F or brevit y w e adopt a notation in whic h f + i corresp onds to the ev en t f i = 1, and f  i refers to f i = 0 (p ositiv e and negativ e ndings resp ectiv ely). Th us the p osterior probabilities of in terest are P ( d j j f + ; f  ), where f + and f  are the v ectors of p ositiv e and negativ e ndings. The negativ e ndings f  are b enign with resp ect to the inference problem|they can b e incorp orated in to the p osterior probabilit y in linear time in the n um b er of asso ciated diseases and in the n um b er of negativ e ndings. As w e discuss b elo w, this can b e seen from the fact that the probabilit y of a negativ e nding in Eq. (4) is the exp onen tial of an expression that is linear in the d j . The p ositiv e ndings, on the other hand, are more problematic. In the w orst case the exact calculation of p osterior probabilities is exp onen tially costly in the n um b er of p ositiv e ndings (Hec k erman, 1989; D'Am brosio, 1994). Moreo v er, in practical diagnostic situations the n um b er of p ositiv e ndings often exceeds the feasible limit for exact calculations. Let us consider the inference calculations in more detail. T o nd the p osterior probabilit y P ( d j f + ; f  ), w e rst absorb the evidence from negativ e ndings, i.e., w e compute P ( d j f  ). This is just P ( f  j d ) P ( d ) with normalization. Since b oth P ( f  j d ) and P ( d ) factorize o v er the diseases (see Eq. (1) and Eq. (2) ab o v e), the p osterior P ( d j f  ) m ust factorize as w ell. The normalization of P ( f  j d ) P ( d ) therefore reduces to indep enden t normalizations o v er eac h disease and can b e carried out in time linear in the n um b er of diseases (or negativ e ndings). In the remainder of the pap er, w e concen trate solely on the p ositiv e ndings as they p ose the real computational c hallenge. Unless otherwise stated, w e assume that the prior distribution o v er the diseases already con tains the evidence from the negativ e ndings. In other w ords, w e presume that the up dates P ( d j ) P ( d j j f  ) ha v e already b een made. W e no w turn to the question of computing P ( d j j f + ), the p osterior marginal probabilit y based on the p ositiv e ndings. F ormally , obtaining suc h a p osterior in v olv es marginalizing P ( f + j d ) P ( d ) across the remaining diseases: P ( d j j f + ) / X d n d j P ( f + j d ) P ( d ) (5) where the summation is o v er all the p ossible congurations of the disease v ariables other than d j (w e use the shorthand summation index d n d j for this). In the QMR mo del P ( f + j d ) P ( d ) has the form: P ( f + j d ) P ( d ) = " Y i P ( f + i j d ) # 2 4 Y j P ( d j ) 3 5 (6) = " Y i  1  e   i 0  P j  ij d j  # 2 4 Y j P ( d j ) 3 5 (7) 296 V aria tional Pr obabilistic Inference and QMR-DT whic h follo ws from Eq. (4) and the fact that P ( f + i j d ) = 1  P ( f  j d ). T o p erform the summation in Eq. (5) o v er the diseases, w e w ould ha v e to m ultiply out the terms 1  e fg corresp onding to the conditional probabilities for eac h p ositiv e nding. The n um b er of suc h terms is exp onen tial in the n um b er of p ositiv e ndings. While algorithms exist that attempt to nd and exploit factorizations in this expression, based on the particular pattern of observ ed evidence (cf. Hec k erman, 1989; D'Am brosio, 1994), these algorithms are limited to roughly 20 p ositiv e ndings on curren t computers. It seems unlik ely that there is sucien t laten t factorization in the QMR-DT mo del to b e able to handle the full CPC corpus, whic h has a median n um b er of 36 p ositiv e ndings p er case and a maxim um n um b er of 61 p ositiv e ndings. 4. V ariational Metho ds Exact inference algorithms p erform man y millions of arithmetic op erations when applied to complex graphical mo dels suc h as the QMR-DT. While this proliferation of terms expresses the sym b olic structure of the mo del, it do es not necessarily express the n umeric structure of the mo del. In particular, man y of the sums in the QMR-DT inference problem are sums o v er large n um b ers of random v ariables. La ws of large n um b ers suggest that these sums ma y yield predictable n umerical results o v er the ensem ble of their summands, and this fact migh t enable us to a v oid p erforming the sums explicitly . T o exploit the p ossibilit y of n umerical regularit y in dense graphical mo dels w e dev elop a v ariational approac h to appro ximate probabilistic inference. V ariational metho ds are a general class of appro ximation tec hniques with wide application throughout applied math- ematics. V ariational metho ds are particularly useful when applied to highly-coupled sys- tems. By in tro ducing additional parameters, kno wn as \v ariational parameters"|whic h essen tially serv e as lo w-dimensional surrogates for the high-dimensional couplings of the system|these metho ds ac hiev e a decoupling of the system. The mathematical mac hinery of the v ariational approac h pro vides algorithms for nding v alues of the v ariational pa- rameters suc h that the decoupled system is a go o d appro ximation to the original coupled system. In the case of probabilistic graphical mo dels v ariational metho ds allo w us to simplify a complicated join t distribution suc h as the one in Eq. (7). This is ac hiev ed via parameter- ized transformations of the individual no de probabilities. As w e will see later, these no de transformations can b e in terpreted graphically as delinking the no des from the graph. Ho w do w e nd appropriate transformations? The v ariational metho ds that w e consider here come from con v ex analysis (see App endix 6). Let us b egin b y considering metho ds for obtaining upp er b ounds on probabilities. A w ell-kno wn fact from con v ex analysis is that an y conca v e function can b e represen ted as the solution to a minimization problem: f ( x ) = min  f  T x  f  (  ) g (8) where f  (  ) is the c onjugate function of f ( x ). The function f  (  ) is itself obtained as the solution to a minimization problem: f  (  ) = min x f  T x  f ( x ) g : (9) 297 Jaakk ola & Jord an The formal iden tit y of this pair of minimization problems expresses the \dualit y" of f and its conjugate f  . The represen tation of f in Eq. (8) is kno wn as a variational tr ansformation . The pa- rameter  is kno wn as a variational p ar ameter . If w e relax the minimization and x the the v ariational parameter to an arbitrary v alue, w e obtain an upp er b ound: f ( x )   T x  f  (  ) : (10) The b ound is b etter for some v alues of the v ariational parameter than for others, and for a particular v alue of  the b ound is exact. W e also w an t to obtain lo w er b ounds on conditional probabilities. A straigh tforw ard w a y to obtain lo w er b ounds is to again app eal to conjugate dualit y and to express func- tions in terms of a maximization principle. This represen tation, ho w ev er, applies to c onvex functions|in the curren t pap er w e require lo w er b ounds for c onc ave functions. Our con- ca v e functions, ho w ev er, ha v e a sp ecial form that allo ws us to exploit conjugate dualit y in a dieren t w a y . In particular, w e require b ounds for functions of the form f ( a + P j z j ), where f is a conca v e function, where z j for i 2 f 1 ; 2 ; : : : ; n g are non-negativ e v ariables, and where a is a constan t. The v ariables z j in this expression are eectiv ely coupled|the impact of c hanging one v ariable is con tingen t on the settings of the remaining v ariables. W e can use Jensen's inequalit y , ho w ev er, to obtain a lo w er b ound in whic h the v ariables are decoupled. 3 In particular: f ( a + X j z j ) = f ( a + X j q j z j q j ) (11)  X j q j f ( a + z j q j ) (12) where the q j can b e view ed as dening a probabilit y distribution o v er the v ariables z j . The v ariational parameter in this case is the probabilit y distribution q . The optimal setting of this parameter is giv en b y q j = z j = P k z k . This is easily v eried b y substitution in to Eq. (12), and demonstrates that the lo w er b ound is tigh t. 4.1 V ariational Upp er and Lo w er Bounds for Noisy-OR Let us no w return to the problem of computing the p osterior probabilities in the QMR mo del. Recall that it is the conditional probabilities corresp onding to the p ositiv e ndings that need to b e simplied. T o this end, w e write P ( f + i j d ) = 1  e   i 0  P j  ij d j = e log (1  e  x ) (13) where x =  i 0 + P j  ij d j . Consider the exp onen t f ( x ) = log (1  e  x ). F or noisy-OR, as w ell as for man y other conditional mo dels in v olving compact represen tations (e.g., logistic regression), the exp onen t f ( x ) is a conca v e function of x . Based on the discussion in the 3. Jensen's inequalit y , whic h states that f ( a + P j q j x j )  P j q j f ( a + x j ), for conca v e f , where P q j = 1, and 0  q j  1, is a simple consequence of Eq. (8), where x is tak en to b e a + P j q j x j . 298 V aria tional Pr obabilistic Inference and QMR-DT previous section, w e kno w that there m ust exist a v ariational upp er b ound for this function that is linear in x : f ( x )   x  f  (  ) (14) Using Eq. (9) to ev aluate the conjugate function f  (  ) for noisy-OR, w e obtain: f  (  ) =   log  + (  + 1) log (  + 1) (15) The desired b ound is obtained b y substituting in to Eq. (13) (and recalling the denition x =  i 0 + P j  ij d j ): P ( f + i j d ) = e f (  i 0 + P j  ij d j ) (16)  e  i (  i 0 + P j  ij d j )  f  (  i ) (17)  P ( f + i j d;  i ) : (18) Note that the \v ariational evidence" P ( f + i j d;  i ) is the exp onen tial of a term that is linear in the disease v ector d . Just as with the negativ e ndings, this implies that the v ariational evidence can b e incorp orated in to the p osterior in time linear in the n um b er of diseases asso ciated with the nding. There is also a graphical w a y to understand the eect of the transformation. W e rewrite the v ariational evidence as follo ws: P ( f + i j d;  i ) = e  i (  i 0 + P j  ij d j )  f  (  i ) (19) = e  i  i 0  f  (  i ) Y j h e  i  ij i d j : (20) Note that the rst term is a constan t, and note moreo v er that the pro duct is factorized across the diseases. Eac h of the latter factors can b e m ultiplied with the pre-existing prior on the corresp onding disease (p ossibly itself mo dulated b y factors from the negativ e evidence). The constan t term can b e view ed as asso ciated with a delink ed nding no de f i . Indeed, the eect of the v ariational transformation is to delink the nding no de f i from the graph, altering the priors of the disease no des that are connected to that nding no de. This graphical p ersp ectiv e will b e imp ortan t for the presen tation of our v ariational algorithm| w e will b e able to view v ariational transformations as simplifying the graph un til a p oin t at whic h exact metho ds can b e run. W e no w turn to the lo w er b ounds on the conditional probabilities P ( f + i j d ). The exp o- nen t f (  i 0 + P j  ij d j ) in the exp onen tial represen tation is of the form to whic h w e applied Jensen's inequalit y in the previous section. Indeed, since f is conca v e w e need only iden tify the non-negativ e v ariables z j , whic h in this case are  ij d j , and the constan t a , whic h is no w  i 0 . Applying the b ound in Eq. (12) w e ha v e: P ( f + i j d ) = e f (  i 0 + P j  ij d j ) (21)  e P j q j j i f   io +  ij d j q j j i  (22) 299 Jaakk ola & Jord an = e P j q j j i h d j f   io +  ij q j j i  +(1  d j ) f (  io ) i (23) = e P j q j j i d j h f   io +  ij q j j i   f (  io ) i + f (  io ) (24)  P ( f + i j d; q j i ) (25) where w e ha v e allo w ed a dieren t v ariational distribution q j i for eac h nding. Note that once again the b ound is linear in the exp onen t. As in the case of the upp er b ound, this implies that the v ariational evidence can b e incorp orated in to the p osterior distribution in time linear in the n um b er of diseases. Moreo v er, w e can once again view the v ariational transformation in terms of delinking the nding no de f i from the graph. 4.2 Appro ximate Inference for QMR In the previous section w e describ ed ho w v ariational transformations are deriv ed for indi- vidual ndings in the QMR mo del; w e no w discuss ho w to utilize these transformations in the con text of an o v erall inference algorithm. Conceptually the o v erall approac h is straigh tforw ard. Eac h transformation in v olv es replacing an exact conditional probabilit y of a nding with a lo w er b ound and an upp er b ound: P ( f + i j d; q j i )  P ( f + i j d )  P ( f + i j d;  i ) : (26) Giv en that suc h transformations can b e view ed as delinking the i th nding no de from the graph, w e see that the transformations not only yield b ounds, but also yield a sim- plied graphical structure. W e can imagine in tro ducing transformations sequen tially un til the graph is sparse enough that exact metho ds b ecome feasible. A t that p oin t w e stop in tro ducing transformations and run an exact algorithm. There is a problem with this approac h, ho w ev er. W e need to decide at eac h step whic h no de to transform, and this requires an assessmen t of the eect on o v erall accuracy of transforming the no de. W e migh t imagine calculating the c hange in a probabilit y of in terest b oth b efore and after a giv en transformation, and c ho osing to transform that no de that yields the least c hange to our target probabilit y . Unfortunately w e are unable to calculate probabilities in the original un transformed graph, and th us w e are unable to assess the eect of transforming an y one no de. W e are unable to get the algorithm started. Supp ose instead that w e w ork bac kw ards. That is, w e in tro duce transformations for al l of the ndings, reducing the graph to an en tirely decoupled set of no des. W e optimize the v ariational parameters for this fully transformed graph (more on optimization of the v ariational parameters b elo w). F or this graph inference is trivial. Moreo v er, it is also easy to calculate the eect of reinstating a single exact conditional at one no de: w e c ho ose to reinstate that no de whic h yields the most c hange. Consider in particular the case of the upp er b ounds (lo w er b ounds are analogous). Eac h transformation in tro duces an upp er b ound on a conditional probabilit y P ( f + i j d ). Th us the lik eliho o d of observing the (p ositiv e) ndings P ( f + ) is also upp er b ounded b y its v ariational coun terpart P ( f + j  ): P ( f + ) = X d P ( f + j d ) P ( d )  X d P ( f + j d;  ) P ( d )  P ( f + j  ) (27) 300 V aria tional Pr obabilistic Inference and QMR-DT W e can assess the accuracy of eac h v ariational transformation after in tro ducing and op- timizing the v ariational transformations for all the p ositiv e ndings. Separately for eac h p ositiv e nding w e replace the v ariationally transformed conditional probabilit y P ( f + i j d;  i ) with the corresp onding exact conditional P ( f + i j d ) and compute the dierence b et w een the resulting b ounds on the lik eliho o d of the observ ations:  i = P ( f + j  )  P ( f + j  n  i ) (28) where P ( f + j  n  i ) is computed without transforming the i th p ositiv e nding. The larger the dierence  i is, the w orse the i th v ariational transformation is. W e should therefore in tro duce the transformations in the ascending order of  i s. Put another w a y , w e should treat exactly (not transform) those conditional probabilities whose  i measure is large. In practice, an in telligen t metho d for ordering the transformations is critical. Figure 2 compares the calculation of lik eliho o ds based on the  i measure as opp osed to a metho d that c ho oses the ordering of transformations at random. The plot corresp onds to a repre- sen tativ e diagnostic case, and sho ws the upp er b ounds on the log-lik eliho o d s of the observ ed ndings as a function of the n um b er of conditional probabilities that w ere left in tact (i.e. not transformed). Note that the upp er b ound m ust impro v e (decrease) with few er trans- formations. The results are striking|the c hoice of ordering has a large eect on accuracy (note that the plot is on a log-scale). 0 2 4 6 8 10 12 −60 −55 −50 −45 −40 −35 −30 # of exactly treated findings log−likelihood Figure 2: The upp er b ound on the log-lik eliho o d for the delta metho d of remo ving trans- formations (solid line) and a metho d that bases the c hoice on a random ordering (dashed line). Note also that the curv e for the prop osed ranking is con v ex; th us the b ound impro v es less the few er transformations there are left. This is b ecause w e rst remo v e the w orst transformations, replacing them with the exact conditionals. The remaining transforma- tions are b etter as indicated b y the delta measure and th us the b ound impro v es less with further replacemen ts. W e mak e no claims for optimalit y of the delta metho d; it is simply a useful heuristic that allo ws us to c ho ose an ordering for v ariational transformations in a computationally ecien t w a y . Note also that our implemen tation of the metho d optimizes the v ariational parameters only once at the outset and c ho oses the ordering of further transformations based on these xed parameters. These parameters are sub optimal for graphs in whic h 301 Jaakk ola & Jord an substan tial n um b ers of no des ha v e b een reinstated, but w e ha v e found in practice that this simplied algorithm still pro duces reasonable orderings. Once w e ha v e decided whic h no des to reinstate, the appro ximate inference algorithm can b e run. W e in tro duce transformations at those no des that w ere left transformed b y the ordering algorithm. The pro duct of all of the exact conditional probabilities in the graph with the transformed conditional probabilities yields an upp er or lo w er b ound on the o v erall join t probabilit y asso ciated with the graph (the pro duct of b ounds is a b ound). Sums of b ounds are still b ounds, and th us the lik eliho o d (the marginal probabilit y of the ndings) is b ounded b y summing across the b ounds on the join t probabilit y . In particular, an upp er b ound on the lik eliho o d is obtained via: P ( f + ) = X d P ( f + j d ) P ( d )  X d P ( f + j d;  ) P ( d )  P ( f + j  ) (29) and the corresp onding lo w er b ound on the lik eliho o d is obtained similarly: P ( f + ) = X d P ( f + j d ) P ( d )  X d P ( f + j d; q ) P ( d )  P ( f + j q ) (30) In b oth cases w e assume that the graph has b een sucien tly simplied b y the v ariational transformations so that the sums can b e p erformed ecien tly . The expressions in Eq. (29) and Eq. (30) yield upp er and lo w er b ounds for arbitrary v alues of the v ariational parameters  and q . W e wish to obtain the tigh test p ossible b ounds, th us w e optimize these expressions with resp ect to  and q . W e minimize with resp ect to  and maximize with resp ect to q . App endix 6 discusses these optimization problems in detail. It turns out that the upp er b ound is con v ex in the  and th us the adjustmen t of the v ariational parameters for the upp er b ound reduces to a con v ex optimization problem that can b e carried out ecien tly and reliably (there are no lo cal minima). F or the lo w er b ound it turns out that the maximization can b e carried out via the EM algorithm. Finally , although b ounds on the lik eliho o d are useful, our ultimate goal is to appro ximate the marginal p osterior probabilities P ( d j j f + ). There are t w o basic approac hes to utilizing the v ariational b ounds in Eq. (29) and Eq. (30) for this purp ose. The rst metho d, whic h will b e our emphasis in the curren t pap er, in v olv es using the transformed probabilit y mo del (the mo del based either on upp er or lo w er b ounds) as a computationally ecien t surrogate for the original probabilit y mo del. That is, w e tune the v ariational parameters of the transformed mo del b y requiring that the mo del giv e the tigh test p ossible b ound on the lik eliho o d. W e then use the tuned transformed mo del as an inference engine to pro vide appro ximations to other probabilities of in terest, in particular the marginal p osterior probabilities P ( d j j f + ). The appro ximations found in this manner are not b ounds, but are computationally ecien t appro ximations. W e pro vide empirical data in the follo wing section that sho w that this approac h indeed yields go o d appro ximations to the marginal p osteriors for the QMR-DT net w ork. A more am bitious goal is to obtain in terv al b ounds for the marginal p osterior probabil- ities themselv es. T o this end, let P ( f + ; d j j  ) denote the com bined ev en t that the QMR-DT mo del generates the observ ed ndings f + and that the j th disease tak es the v alue d j . These b ounds follo w directly from: P ( f + ; d j ) = X d n d j P ( f + j d ) P ( d )  X d n d j P ( f + j d;  ) P ( d )  P ( f + ; d j j  ) (31) 302 V aria tional Pr obabilistic Inference and QMR-DT where P ( f + j d;  ) is a pro duct of upp er-b ound transformed conditional probabilities and exact (un transformed) conditionals. Analogously w e can compute a lower b ound P ( f + ; d j j q ) b y applying the lo w er b ound transformations: P ( f + ; d j ) = X d n d j P ( f + j d ) P ( d )  X d n d j P ( f + j d; q ) P ( d )  P ( f + ; d j j q ) (32) Com bining these b ounds w e can obtain in terv al b ounds on the p osterior marginal proba- bilities for the diseases (cf. Drap er & Hanks 1994): P ( f + ; d j j q ) P ( f + ;  d j j  ) + P ( f + ; d j j q )  P ( d j j f + )  P ( f + ; d j j  ) P ( f + ; d j j  ) + P ( f + ;  d j j q ) ; (33) where  d j is the binary complemen t of d j . 5. Exp erime n tal Ev aluation The diagnostic cases that w e used in ev aluating the p erformance of the v ariational tec h- niques w ere cases abstracted from clino copathologic conference (\CPC") cases. These cases generally in v olv e m ultiple diseases and are considered to b e clinically dicult cases. They are the cases in whic h Middleton et al. (1990) did not nd their imp ortance sampling metho d to w ork satisfactorily . Our ev aluation of the v ariational metho dology consists of three parts. In the rst part w e exploit the fact that for a subset of the CPC cases (4 of the 48 cases) there are a sucien tly small n um b er of p ositiv e ndings that w e can calculate exact v alues of the p osterior marginals using the Quic kscore algorithm. That is, for these four cases w e w ere able to obtain a \gold standard" for comparison. W e pro vide an assessmen t of the accuracy and eciency of v ariational metho ds on those four CPC cases. W e presen t v ariational upp er and lo w er b ounds on the lik eliho o d as w ell as scatterplots that compare v ariational appro ximations of the p osterior marginals to the exact v alues. W e also presen t comparisons with the lik eliho o d-w eigh ted sampler of Sh w e and Co op er (1991). In the second section w e presen t results for the remaining, in tractable CPC cases. W e use length y runs of the Sh w e and Co op er sampling algorithm to pro vide a surrogate for the gold standard in these cases. Finally , in the third section w e consider the problem of obtaining in terv al b ounds on the p osterior marginals. 5.1 Comparison to Exact Marginals F our of the CPC cases ha v e 20 or few er p ositiv e ndings (see T able 1), and for these cases it is p ossible to calculate the exact v alues of the lik eliho o d and the p osterior marginals in a reasonable amoun t of time. W e used Hec k erman's \Quic kscore" algorithm (Hec k er- man 1989)|an algorithm tailored to the QMR-DT arc hitecture|to p erform these exact calculations. Figure 3 sho ws the log-lik eliho o d for the four tractable CPC cases. The gure also sho ws the v ariational lo w er and upp er b ounds. W e calculated the v ariational b ounds t wice, with diering n um b ers of p ositiv e ndings treated exactly in the t w o cases (\treated exactly" 303 Jaakk ola & Jord an case # of p os. ndings # of neg. ndings 1 20 14 2 10 21 3 19 19 4 19 33 T able 1: Description of the cases for whic h w e ev aluated the exact p osterior marginals. (a) 0 1 2 3 4 5 −70 −60 −50 −40 −30 −20 sorted cases log−likelihood (b) 0 1 2 3 4 5 −70 −60 −50 −40 −30 −20 sorted cases log−likelihood Figure 3: Exact v alues and v ariational upp er and lo w er b ounds on the log-lik eliho o d log P ( f + j  ) for the four tractable CPC cases. In (a) 8 p ositiv e ndings w ere treated exactly , and in (b) 12 p ositiv e ndings w ere treated exactly . simply means that the nding is not transformed v ariationally). In panel (a) there w ere 8 p ositiv e ndings treated exactly , and in (b) 12 p ositiv e ndings w ere treated exactly . As exp ected, the b ounds w ere tigh ter when more p ositiv e ndings w ere treated exactly . 4 The a v erage running time across the four tractable CPC cases w as 26.9 seconds for the exact metho d, 0.11 seconds for the v ariational metho d with 8 p ositiv e ndings treated exactly , and 0.85 seconds for the v ariational metho d with 12 p ositiv e ndings treated exactly . (These results w ere obtained on a 433 MHz DEC Alpha computer). Although the lik eliho o d is an imp ortan t quan tit y to appro ximate (particularly in appli- cations in whic h parameters need to b e estimated), of more in terest in the QMR-DT setting are the p osterior marginal probabilities for the individual diseases. As w e discussed in the previous section, the simplest approac h to obtaining v ariational estimates of these quan- tities is to dene an appro ximate v ariational distribution based either on the distribution P ( f + j  ), whic h upp er-b ounds the lik eliho o d, or the distribution P ( f + j q ), whic h lo w er- b ounds the lik eliho o d. F or xed v alues of the v ariational parameters (c hosen to pro vide a tigh t b ound to the lik eliho o d), b oth distributions pro vide partially factorized appro xi- mations to the join t probabilit y distribution. These factorized forms can b e exploited as 4. Giv en that a signican t fraction of the p ositiv e ndings are b eing treated exactly in these sim ulation s, one ma y w onder what if an y additional accuracy is due to the v ariational transformations. W e address this concern later in this section and demonstrate that the v ariational transformations are in fact resp onsible for a signican t p ortion of the accuracy in these cases. 304 V aria tional Pr obabilistic Inference and QMR-DT ecien t appro ximate inference engines for general p osterior probabilities, and in particular w e can use them to pro vide appro ximations to the p osterior marginals of individual diseases. In practice w e found that the distribution P ( f + j  ) yielded more accurate p osterior marginals than the distribution P ( f + j q ), and w e restrict our presen tation to P ( f + j  ). Fig- ure 4 displa ys a scatterplot of these appro ximate p osterior marginals, with panel (a) corre- (a) 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 exact marginals variational estimates (b) 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 exact marginals variational estimates Figure 4: Scatterplot of the v ariational p osterior estimates and the exact marginals. In (a) 8 p ositiv e ndings w ere treated exactly and in (b) 12 p ositiv e ndings w ere treated exactly . sp onding to the case in whic h 8 p ositiv e ndings w ere treated exactly and panel (b) the case in whic h 12 p ositiv e ndings treated exactly . The plots w ere obtained b y rst extracting the 50 highest p osterior marginals from eac h case using exact metho ds and then computing the appro ximate p osterior marginals for the corresp onding diseases. If the appro ximate marginals are in fact correct then the p oin ts in the gures should align along the diagonals as sho wn b y the dotted lines. W e see a reasonably go o d corresp ondence|the v ariational algorithm app ears to pro vide a go o d appro ximation to the largest p osterior marginals. (W e quan tify this corresp ondence with a ranking measure later in this section). A curren t state-of-the-art algorithm for the QMR-DT is the enhanced v ersion of lik eliho o d- w eigh ted sampling prop osed b y Sh w e and Co op er (1991). Lik eliho o d-w ei gh ted sampling is a sto c hastic sampling metho d prop osed b y F ung and Chang (1990) and Shac h ter and P eot (1990). Lik eliho o d-w ei gh ted sampling is basically a simple forw ard sampling metho d that w eigh ts samples b y their lik eliho o ds. It can b e enhanced and impro v ed b y utilizing \self- imp ortance sampling" (see Shac h ter & P eot, 1990), a v ersion of imp ortance sampling in whic h the imp ortance sampling distribution is con tin ually up dated to reect the curren t estimated p osterior distribution. Middleton et al. (1990) utilized lik eliho o d-w eigh ted sam- pling with self-imp ortance sampling (as w ell as a heuristic initialization sc heme kno wn as \iterativ e tabular Ba y es") for the QMR-DT mo del and found that it did not w ork sat- isfactorily . Subsequen t w ork b y Sh w e and Co op er (1991), ho w ev er, used an additional enhancemen t to the algorithm kno wn as `Mark o v blank et scoring" (see Shac h ter & P eot, 1990), whic h distributes fractions of samples to the p ositiv e and negativ e v alues of a no de in prop ortion to the probabilit y of these v alues conditioned on the Mark o v blank et of the no de. The com bination of Mark o v blank et scoring and self-imp ortance sampling yielded 305 Jaakk ola & Jord an an eectiv e algorithm. 5 In particular, with these mo dications in place, Sh w e and Co op er rep orted reasonable accuracy for t w o of the dicult CPC cases. W e re-implemen ted the lik eliho o d-w ei gh ted sampling algorithm of Sh w e and Co op er, incorp orating the Mark o v blank et scoring heuristic and self-imp ortance sampling. (W e did not utilize \iterativ e tabular Ba y es" but instead utilized a related initialization sc heme{ \heuristic tabular Ba y es"{also discussed b y Sh w e and Co op er). In this section w e discuss the results of running this algorithm on the four tractable CPC cases, comparing to the results of v ariational inference. 6 In the follo wing section w e presen t a fuller comparativ e analysis of the t w o algorithms for all of the CPC cases. Lik eliho o d-w eigh ting sampling, and indeed an y sampling algorithm, realizes a time- accuracy tradeo|taking additional samples requires more time but impro v es accuracy . In comparing the sampling algorithm to the v ariational algorithm, w e ran the sampling algorithm for sev eral dieren t total time p erio ds, so that the accuracy ac hiev ed b y the sampling algorithm roughly co v ered the range ac hiev ed b y the v ariational algorithm. The results are sho wn in Figure 5, with the righ t-hand curv e corresp onding to the sampling runs. The gure displa ys the mean correlations b et w een the appro ximate and exact p osterior marginals across ten indep enden t runs of the algorithm (for the four tractable CPC cases). 10 −1 10 0 10 1 10 2 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 execution time in seconds mean correlation Figure 5: The mean correlation b et w een the appro ximate and exact p osterior marginals as a function of the execution time (in seconds). Solid line: v ariational estimates; dashed line: lik eliho o d-w ei gh ting sampling. The lines ab o v e and b elo w the sam- pling result represen t standard errors of the mean based on the ten indep enden t runs of the sampler. V ariational algorithms are also c haracterized b y a time-accuracy tradeo. In particular, the accuracy of the metho d generally impro v es as more ndings are treated exactly , at the cost of additional computation. Figure 5 also sho ws the results from the v ariational algorithm (the left-hand curv e). The three p oin ts on the curv e corresp ond to up to 8, 12 and 5. The initiali zati on metho d pro v ed to ha v e little eect on the inference results. 6. W e also in v estigated Gibbs sampling (P earl, 1988). The results from Gibbs sampling w ere not as go o d as the results from lik eliho o d-w ei gh ted sampling, and w e rep ort only the latter results in the remainder of the pap er. 306 V aria tional Pr obabilistic Inference and QMR-DT 16 p ositiv e ndings treated exactly . Note that the v ariational estimates are deterministic and th us only a single run w as made. The gure sho ws that to ac hiev e roughly equiv alen t lev els of accuracy , the sampling algorithm requires signican tly more computation time than the v ariational metho d. Although scatterplots and correlation measures pro vide a rough indication of the accu- racy of an appro ximation algorithm, they are decien t in sev eral resp ects. In particular, in diagnostic practice the in terest is in the abilit y of an algorithm to rank diseases correctly , and to a v oid b oth false p ositiv es (diseases that are not in fact signican t but are included in the set of highly rank ed diseases) and false negativ es (signican t diseases that are omit- ted from the set of highly rank ed diseases). W e dened a ranking measure as follo ws (see also Middleton et al., 1990). Consider a set of the N highest ranking disease h yp otheses, where the ranking is based on the correct p osterior marginals. Corresp onding to this set of diseases w e can nd the smallest set of N 0 appro ximately rank ed diseases that includes the N signican t ones. In other w ords, for an y N \true p ositiv es" an appro ximate metho d pro duces N 0  N \false p ositiv es." Plotting false p ositiv es as a function of true p ositiv es pro vides a meaningful and useful measure of the accuracy of an appro ximation sc heme. T o the exten t that a metho d pro vides a nearly correct ranking of true p ositiv es the plot increases slo wly and the area under the curv e is small. When a signican t disease app ears late in the appro ximate ordering the plot increases rapidly near the true rank of the missed disease and the area under the curv e is large. W e also plot the n um b er of \false negativ es" in a set of top N highly rank ed diseases. F alse negativ es refer to the n um b er of diseases, out of the N highest ranking diseases, that do not app ear in the set of N appro ximately rank ed diseases. Note that unlik e the previous measure, this measure do es not rev eal the sev erit y of the misplacemen ts, only their frequency . With this impro v ed diagnostic measure in hand, let us return to the ev aluation of the inference algorithms, b eginning with the v ariational algorithm. Figure 6 pro vides plots of (a) 0 10 20 30 40 50 0 10 20 30 40 50 60 true positives false positives (b) 0 10 20 30 40 50 0 1 2 3 4 5 6 7 approximate ranking false negatives Figure 6: (a) Av erage n um b er of false p ositiv es as a function of true p ositiv es for the v aria- tional metho d (solid lines) and the partially-exact metho d (dashed line). (b) F alse negativ es in the set of top N appro ximately rank ed diseases. In b oth gures 8 p ositiv e ndings w ere treated exactly . the false p ositiv es (panel a) and false negativ es (panel b) against the true p ositiv es for the 307 Jaakk ola & Jord an (a) 0 10 20 30 40 50 0 5 10 15 20 25 30 35 40 true positives false positives (b) 0 10 20 30 40 50 0 0.5 1 1.5 2 2.5 3 3.5 4 approximate ranking false negatives Figure 7: (a) Av erage n um b er of false p ositiv es as a function of true p ositiv es for the v aria- tional metho d (solid line) and the partially-exact metho d (dashed line). (b) F alse negativ es in the set of top N appro ximately rank ed diseases. In b oth gures 12 p ositiv e ndings w ere treated exactly . tractable CPC cases. Eigh t p ositiv e ndings w ere treated exactly in the sim ulation sho wn in this gure. Figure 7 displa ys the results when 12 p ositiv e nding w ere treated exactly . As w e noted earlier, 8 and 12 p ositiv e ndings comprise a signican t fraction of the total p ositiv e ndings for the tractable CPC cases, and th us it is imp ortan t to v erify that the v ariational transformations are in fact con tributing to the accuracy of the p osterior appro ximations ab o v e and b ey ond the exact calculations. W e did this b y comparing the v ariational metho d to a metho d whic h w e call the \partially-exact" metho d in whic h the p osterior probabilities w ere obtained using only those ndings that w ere treated exactly in the v ariational calculations (i.e., using only those ndings that w ere not transformed). If the v ariational transformations did not con tribute to the accuracy of the appro ximation, then the p erformance of the partially-exact metho d should b e comparable to that of the v ariational metho d. 7 Figure 6 and Figure 7 clearly indicate that this is not the case. The dierence in accuracy b et w een these metho ds is substan tial while their computational load is comparable (ab out 0.1 seconds on a 433MHz Dec Alpha). W e b eliev e that the accuracy p ortra y ed in the false p ositiv e plots pro vides a go o d in- dication of the p oten tial of the v ariational algorithm for pro viding a practical solution to the appro ximate inference problem for the QMR-DT. As the gures sho w, the n um b er of false p ositiv es gro ws slo wly with the n um b er of true p ositiv es. F or example, as sho wn in Figure 6 where eigh t p ositiv e ndings are treated exactly , to nd the 20 most lik ely diseases w e w ould only need to en tertain the top 23 diseases in the list of appro ximately rank ed diseases (compared to more than 50 for the partially-exact metho d). The ranking plot for the lik eliho o d-w eigh ted sampler is sho wn in Figure 8, with the curv e for the v ariational metho d from Figure 7 included for comparison. T o mak e these plots, w e ran the lik eliho o d-w ei gh ted sampler for an amoun t of time (6.15 seconds) that w as 7. It should b e noted that this is a conserv ativ e comparison, b ecause the partially-exact metho d in fact b enets from the v ariational transformation|the set of exactly treated p ositiv e ndings is selected on the basis of the accuracy of the v ariational transformations, and these accuracies correlate with the diagnostic relev ance of the ndings. 308 V aria tional Pr obabilistic Inference and QMR-DT 0 10 20 30 40 50 0 5 10 15 20 25 30 35 40 true positives false positives Figure 8: Av erage n um b er of false p ositiv es as a function of true p ositiv es for the lik eliho o d- w eigh ted sampler (dashed line) and the v ariational metho d (solid line) with 12 p ositiv e ndings treated exactly . comparable to the time allo cated to our slo w est v ariational metho d (3.17 seconds; this w as the case in whic h 16 p ositiv e ndings w ere treated exactly . Recall that the time required for the v ariational algorithm with 12 p ositiv e ndings treated exactly w as 0.85 seconds.) As the plots sho w, for these tractable CPC cases, the v ariational metho d is signican tly more accurate than the sampling algorithm for comparable computational loads. 5.2 The F ull CPC Corpus W e no w consider the full CPC corpus. The ma jorit y of these cases (44 of 48 cases), ha v e more than 20 p ositiv e ndings and th us app ear to b e b ey ond the reac h of exact metho ds. An imp ortan t attraction of sampling metho ds is the mathematical guaran tee of accurate estimates in the limit of a sucien tly large sample size (Gelfand & Smith, 1990). Th us sampling metho ds ha v e the promise of pro viding a general metho dology for appro ximate inference, with t w o ca v eats: (1) the n um b er of samples that is needed can b e dicult to diagnosis, and (2) v ery man y samples ma y b e required to obtain accurate estimates. F or real-time applications, the latter issue can rule out sampling solutions. Ho w ev er, long-term runs of a sampler can still pro vide a useful baseline for the ev aluation of the accuracy of faster appro ximation algorithms. W e b egin b y considering this latter p ossibilit y in the con text of lik eliho o d-w eigh ted sampling for the QMR-DT. W e then turn to a comparativ e ev aluation of lik eliho o d-w eigh ted sampling and v ariational metho ds in the time-limited setting. T o explore the viabilit y of the lik eliho o d-w eigh ted sampler for pro viding a surrogate for the gold standard, w e carried out t w o indep enden t runs eac h consisting of 400,000 samples. Figure 9(a) sho ws the estimates of the log-lik eliho o d from the rst sampling run for all of the CPC cases. W e also sho w the v ariational upp er and lo w er b ounds for these cases (the cases ha v e b een sorted according to the lo w er b ound). Note that these b ounds are rigorous b ounds on the true log-lik eliho o d, and th us they pro vide a direct indication of the accuracy of the sampling estimates. Although w e see that man y of the estimates lie b et w een the b ounds, w e also see in man y cases that the sampling estimates deviate substan tially from the b ounds. This suggests that the p osterior marginal estimates obtained from these samples are lik ely to b e unreliable as w ell. Indeed, Figure 9(b) presen ts a scatterplot of 309 Jaakk ola & Jord an (a) 0 10 20 30 40 50 −200 −180 −160 −140 −120 −100 −80 −60 −40 −20 sorted cases log−likelihood (b) 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 sampling estimates 1 sampling estimates 2 Figure 9: (a) Upp er and lo w er b ounds (solid lines) and the corresp onding sampling esti- mates (dashed line) of the log-lik eliho o d of observ ed ndings for the CPC cases. (b) A correlation plot b et w een the p osterior marginal estimates from t w o inde- p enden t sampling runs. estimated p osterior marginals for the t w o indep enden t runs of the sampler. Although w e see man y cases in whic h the results lie on the diagonal, indicating agreemen t b et w een the t w o runs, w e also see man y pairs of p osterior estimates that are far from the diagonal. These results cast some doubt on the viabilit y of the lik eliho o d-w eigh ted sampler as a general appro ximator for the full set of CPC cases. Ev en more problematically w e app ear to b e without a reliable surrogate for the gold standard for these cases, making it dicult to ev aluate the accuracy of real-time appro ximations suc h as the v ariational metho d. Note, ho w ev er, that the estimates in Figure 9(a) seem to fall in to t w o classes|estimates that lie within the v ariational b ounds and estimates that are rather far from the b ounds. This suggests the p ossibilit y that the distribution b eing sampled from is m ulti-mo dal, with some estimates falling within the correct mo de and pro viding go o d appro ximations and with others falling in spurious mo des and pro viding seriously inaccurate appro ximations. If the situation holds, then an accurate surrogate for the gold standard migh t b e obtained b y using the v ariational b ounds to lter the sampling results and retaining only those estimates that lie b et w een the b ounds giv en b y the v ariational approac h. Figure 10 pro vides some evidence of the viabilit y of this approac h. In 24 out of the 48 CPC cases b oth of the indep enden t runs of the sampler resulted in estimates of the log- lik eliho o d lying appro ximately within the v ariational b ounds. W e recomputed the p osterior marginal estimates for these selected cases and plotted them against eac h other in the gure. The scatterplot sho ws a high degree of corresp ondence of the p osterior estimates in these cases. W e th us ten tativ ely assume that these estimates are accurate enough to serv e as a surrogate gold standard and pro ceed to ev aluate the real-time appro ximations. Figure 11 plots the false p ositiv es against the true p ositiv es on the 24 selected CPC cases for the v ariational metho d. Tw elv e p ositiv e ndings w ere treated exactly in this sim ulation. Obtaining the v ariational estimates to ok 0.29 seconds of computer time p er case. Although the curv e increases more rapidly than with the tractable CPC cases, the v ariational algorithm still app ears to pro vide a reasonably accurate ranking of the p osterior marginals, within a reasonable time frame. 310 V aria tional Pr obabilistic Inference and QMR-DT 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 sampling estimates 1 sampling estimates 2 Figure 10: A correlation plot b et w een the selected p osterior marginal estimates from t w o indep enden t sampling runs, where the selection w as based on the v ariational upp er and lo w er b ounds. 0 10 20 30 40 50 0 10 20 30 40 50 60 70 true positives false positives Figure 11: Av erage n um b er of false p ositiv es as a function of true p ositiv es for the v ari- ational metho d (solid line) and the lik eliho o d-w ei gh ted sampler (dashed line). F or the v ariational metho d 12 p ositiv e ndings w ere treated exactly , and for the sampler the results are a v erages across ten runs. T o compare the v ariational algorithm to a time-limited v ersion of the lik eliho o d-w eigh ted sampler w e ran the latter algorithm for a p erio d of time (8.83 seconds p er case) roughly com- parable to the running time of the v ariational algorithm (0.29 seconds p er case). Figure 11 sho ws the corresp onding plot of false p ositiv es against true p ositiv es, where w e ha v e a v er- aged o v er ten indep enden t runs. W e see that the curv e increases signican tly more steeply than the v ariational curv e. T o nd the 20 most lik ely diseases with the v ariational metho d w e w ould only need to en tertain the top 30 diseases in the list of appro ximately rank ed diseases. F or the sampling metho d w e w ould need to en tertain the top 70 appro ximately rank ed diseases. 5.3 In terv al Bounds on the Marginal Probabilities Th us far w e ha v e utilized the v ariational approac h to pro duce appro ximations to the p oste- rior marginals. The appro ximations that w e ha v e discussed originate from upp er and lo w er 311 Jaakk ola & Jord an b ounds on the lik eliho o d, but they are not themselv es b ounds. That is, they are not guar- an teed to lie ab o v e or b elo w the true p osteriors, as w e see in Figure 4. As w e discussed in Section 4.1, ho w ev er, it is also p ossible to induce upp er and lo w er b ounds on the p osterior marginals from upp er and lo w er b ounds on the lik eliho o d (cf. Eq. 33). In this section w e ev aluate these in terv al b ounds for the QMR-DT p osterior marginals. Figure 12 displa ys histogram of the in terv al b ounds for the four tractable CPC cases, the 24 selected CPC cases from the previous section, and all of the CPC cases. These histograms include all of the diseases in the QMR-DT net w ork. In the case of the tractable cases the (a) 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Interval size Frequency (b) 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Interval size Frequency (c) 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Interval size Frequency Figure 12: Histograms of the size of the in terv al b ounds on all of the diseases in the QMR- DT net w ork for (a) the four tractable CPC cases, (b) the 24 selected CPC cases from the previous section, and (c) all of the CPC cases. v ariational metho d w as run with 12 p ositiv e ndings treated exactly . F or the remaining CPC cases the v ariational metho d w as run with 16 p ositiv e ndings treated exactly . The running time of the algorithm w as less than 10 seconds of computer time p er CPC case. F or the tractable CPC cases, the in terv al b ounds are tigh t for nearly all of the diseases in the net w ork. Ho w ev er, (1) few of the p ositiv e ndings are treated v ariationally in these cases, and (2) there is no need in practice to compute v ariational b ounds for these cases. W e get a somewhat b etter picture of the viabilit y of the v ariational in terv al b ounds in Figure 12(b) and Figure 12(c), and the picture is decidedly mixed. F or the 24 selected cases, tigh t b ounds are pro vided for appro ximately half of the diseases. The b ounds are v acuous for appro ximately a quarter of the diseases, and there are a range of diseases in b et w een. When w e consider all of the CPC cases, appro ximately a third of the b ounds are tigh t and nearly half are v acuous. Although these results ma y indicate limitations in our v ariational appro ximation, there is another more immediate problem that app ears to b e resp onsible for the lo oseness of the b ounds in man y cases. In particular, recall that w e use the Quic kscore algorithm (Hec k erman, 1989) to handle the exact calculations within the framew ork of our v ariational algorithm. Unfortunately Quic kscore suers from v anishing n umerical precision for large n um b ers of p ositiv e ndings, and in general w e b egin to run in to n umerical problems, resulting in v acuous b ounds, when 16 p ositiv e ndings are incorp orated exactly in to the v ariational appro ximation. Th us, although it is clearly of in terest to run the v ariational algorithm for longer durations, and thereb y impro v e the b ounds, w e are unable to do so within our curren t implemen tation of the exact subroutine. 312 V aria tional Pr obabilistic Inference and QMR-DT While it is clearly w orth studying metho ds other than Quic kscore for treating the ex- act ndings within the v ariational algorithm, it is also of in terest to consider com bining v ariational metho ds with other metho ds, suc h as searc h-based or other partial ev aluation metho ds, that are based on in terv als. These metho ds ma y help in simplifying the p osterior and ob viating the need for impro ving the exact calculations. It is also w orth emphasizing the p ositiv e asp ect of these results and their p oten tial practical utilit y . The previous section sho w ed that the v ariational metho d can pro vide ac- curate appro ximations to the p osterior marginals. Com bined with the in terv al b ounds in this section|whic h are calculated ecien tly|the user can obtain guaran tees on appro xi- mately a third of these appro ximations. Giv en the relativ ely b enign rate of increase in false p ositiv es as a function of true p ositiv es (Figure 11), suc h guaran tees ma y suce. Finally , for diseases in whic h the b ounds are lo ose there are also p erturbation metho ds a v ailable (Jaakk ola, 1997) that can help to v alidate the appro ximations for these diseases. 6. Discussion Let us summarize the v ariational inference metho d and ev aluate the results that w e ha v e obtained. The v ariational metho d b egins with parameterized upp er and lo w er b ounds on the indi- vidual conditional probabilities at the no des of the mo del. F or the QMR-DT, these b ounds are exp onen tials of linear functions, and in tro ducing them in to the mo del corresp onds to delinking no des from the graph. Sums of pro ducts of these b ounds yield b ounds, and th us w e readily obtain parameterized b ounds on marginal probabilities, in particular upp er and lo w er b ounds on the lik eliho o d. W e exploited the lik eliho o d b ounds in ev aluating the output of the lik eliho o d-w eigh ted sampling algorithm. Although the sampling algorithm did not yield reliable results across the corpus of CPC cases, when w e utilized the v ariational upp er and lo w er b ounds to select among the samples w e w ere able to obtain sampling results that w ere consisten t b et w een runs. This suggests a general pro cedure in whic h v ariational b ounds are used to assess the con v ergence of a sampling algorithm. (One can also imagine a more in timate relationship b et w een these algorithms in whic h the v ariational b ounds are used to adjust the on-line course of the sampler). The fact that w e ha v e b ounds on the lik eliho o d (or other marginal probabilities) is critical|the b ounding prop ert y allo ws us to nd optimizing v alues of the v ariational pa- rameters b y minimizing the upp er-b ounding v ariational distribution and maximizing the lo w er-b ounding v ariational distribution. In the case of the QMR-DT net w ork (a bipar- tite noisy-OR graph), the minimization problem is a con v ex optimization problem and the maximization problem is solv ed via the EM algorithm. Once the v ariational parameters are optimized, the resulting v ariational distribution can b e exploited as an inference engine for calculating appro ximations to p osterior probabilities. This tec hnique has b een our fo cus in the pap er. Graphically , the v ariationally transformed mo del can b e view ed as a sub-graph of the original mo del in whic h some of the nding no des ha v e b een delink ed. If a sucien t n um b er of ndings are delink ed v ariationally then it is p ossible to run an exact algorithm on the resulting graph. This approac h yields appro ximations to the p osterior marginals of the disease no des. 313 Jaakk ola & Jord an W e found empirically that these appro ximations app eared to pro vide go o d appro xima- tions to the true p osterior marginals. This w as the case for the tractable set of CPC cases (cf. Figure 7) and|sub ject to our assumption that w e ha v e obtained a go o d surrogate for the gold standard via the selected output of the sampler|also the case for the full CPC corpus (cf. Figure 11). W e also compared the v ariational algorithm to a state-of-the-art algorithm for the QMR- DT, the lik eliho o d-w ei gh ted sampler of Sh w e and Co op er (1991). W e found that the v aria- tional algorithm outp erformed the lik eliho o d-w eigh ted sampler b oth for the tractable cases and for the full corpus. In particular, for a xed accuracy requiremen t the v ariational algo- rithm w as signican tly faster (cf. Figure 5), and for a xed time allotmen t the v ariational algorithm w as signican tly more accurate (cf. Figure 8 and Figure 11). Our results w ere less satisfactory for the in terv al b ounds on the p osterior marginals. Across the full CPC corpus w e found that for appro ximately one third of the disease the b ounds w ere tigh t but for half of the diseases the b ounds w ere v acuous. A ma jor imp edimen t to obtaining tigh ter b ounds app ears to lie not in the v ariational appro ximation p er se but rather in the exact subroutine, and w e are in v estigating exact metho ds with impro v ed n umerical prop erties. Although w e ha v e fo cused in detail on the QMR-DT mo del in this pap er, it is w orth noting that the v ariational probabilistic inference metho dology is considerably more general. Sp ecically , the metho ds that w e ha v e describ ed here are not limited to the bi-partite graphical structure of the QMR-DT mo del, nor is it necessary to emplo y noisy-OR no des (Jaakk ola & Jordan, 1996). It is also the case that the t yp e of transformations that w e ha v e exploited in the QMR-DT setting extend to a larger class of dep endence relations based on generalized linear mo dels (Jaakk ola, 1997). Finally , for a review of applications of v ariational metho ds to a v ariet y of other graphical mo del arc hitectures, see Jordan, et al. (1998). A promising direction for future researc h app ears to b e in the in tegration of v arious kinds of appro ximate and exact metho ds (see, e.g., Dagum & Horvitz, 1992; Jensen, Kong, & Kjrul, 1995). In particular, searc h-based metho ds (Co op er, 1985; P eng & Reggia, 1987, Henrion, 1991) and v ariational metho ds b oth yield b ounds on probabilities, and, as w e ha v e indicated in the in tro duction, they seem to exploit dieren t asp ects of the struc- ture of complex probabilit y distributions. It ma y b e p ossible to com bine the b ounds from these algorithm|the v ariational b ounds migh t b e used to guide the searc h, or the searc h- based b ounds migh t b e used to aid the v ariational appro ximation. Similar commen ts can b e made with resp ect to lo calized partial ev aluation metho ds and b ounded conditioning metho ds (Drap er & Hanks, 1994; Horvitz, et al., 1989). Also, w e ha v e seen that v ariational b ounds can b e used for assessing whether estimates from Mon te Carlo sampling algorithms ha v e con v erged. A further in teresting h ybrid w ould b e a sc heme in whic h v ariational ap- pro ximations are rened b y treating them as initial conditions for a sampler. Ev en without extensions our results in this pap er app ear quite promising. W e ha v e presen ted an algorithm whic h runs in real time on a large-scale graphical mo del for whic h exact algorithms are in general infeasible. The results that w e ha v e obtained app ear to b e reasonably accurate across a corpus of dicult diagnostic cases. While further w ork is needed, w e b eliev e that our results indicate a promising role for v ariational inference in dev eloping, critiquing and exploiting large-scale probabilistic mo dels suc h as the QMR-DT. 314 V aria tional Pr obabilistic Inference and QMR-DT Ac kno wledgemen ts W e w ould lik e to thank the Univ ersit y of Pittsburgh and Randy Miller for the use of the QMR-DT database. W e also w an t to thank Da vid Hec k erman for suggesting that w e attac k QMR-DT with v ariational metho ds, and for pro viding helpful counsel along the w a y . App endix A. Dualit y The upp er and lo w er b ounds for individual conditional probabilit y distributions that form the basis of our v ariational metho d are based on the \dual" or \conjugate" represen tations of con v ex functions. W e presen t a brief description of con v ex dualit y in this app endix, and refer the reader to Ro c k afellar (1970) for a more extensiv e treatmen t. Let f ( x ) b e a real-v alued, con v ex function dened on a con v ex set X (for example, X = R n ). F or simplicit y of exp osition, w e assume that f is a w ell-b eha v ed (dieren tiable) function. Consider the graph of f , i.e., the p oin ts ( x; f ( x )) in an n + 1 dimensional space. The fact that the function f is con v ex translates in to con v exit y of the set f ( x; y ) : y  f ( x ) g called the epigr aph of f and denoted b y epi ( f ) (Figure 13). It is an elemen tary prop ert y f(x) epi(f) x ξ - y - f*( ξ ) ≤ 0 x ξ ’ - y - f*( ξ ’) ≤ 0 x ξ - y - µ ≤ 0 Figure 13: Half-spaces con taining the con v ex set epi ( f ). The conjugate function f  (  ) denes the critical half-spaces whose in tersection is epi ( f ), or, equiv alen tly , it denes the tangen t planes of f ( x ). of con v ex sets that they can b e represen ted as the in tersection of all the half-spaces that con tain them (see Figure 13). Through parameterizing these half-spaces w e obtain the dual represen tations of con v ex functions. T o this end, w e dene a half-space b y the condition: all ( x; y ) suc h that x T   y    0 (34) where  and  parameterize all (non-v ertical) half-spaces. W e are in terested in c haracter- izing the half-spaces that con tain the epigraph of f . W e require therefore that the p oin ts in the epigraph m ust satisfy the half-space condition: for ( x; y ) 2 epi ( f ), w e m ust ha v e x T   y    0. This holds whenev er x T   f ( x )    0 as the p oin ts in the epigraph ha v e the prop ert y that y  f ( x ). Since the condition m ust b e satised b y all x 2 X , it follo ws 315 Jaakk ola & Jord an that max x 2 X f x T   f ( x )   g  0 ; (35) as w ell. Equiv alen tly ,   max x 2 X f x T   f ( x ) g (36) where the righ t-hand side of this equation denes a function of  , whic h is kno wn as the \dual" or \conjugate" function f  (  ). This function, whic h is also a con v ex function, denes the critical half-spaces whic h are needed for the represen tation of epi ( f ) as an in tersection of half-spaces (Figure 13). T o clarify the dualit y b et w een f ( x ) and f  ( x ), let us drop the maxim um and rewrite the inequalit y as: x T   f ( x ) + f  (  ) (37) In this equation, the roles of the t w o functions are in terc hangeable and w e ma y susp ect that also f ( x ) can obtained from the dual function f  ( x ) b y an optimization pro cedure. This is in fact the case and w e ha v e: f ( x ) = max  2  f x T   f  (  ) g (38) This equalit y states that the dual of the dual giv es bac k the original function. It pro vides the computational to ol for calculating dual functions. F or conca v e (con v ex do wn) functions the results are analogous; w e replace max with min , and lo w er b ounds with upp er b ounds. App endix B. Optimization of the V ariational P arameters The v ariational metho d that w e ha v e describ ed in v olv es replacing selected lo cal conditional probabilities with either upp er-b ounding or lo w er-b ounding v ariational transformations. Because the pro duct of b ounds is a b ound, the v ariationally transformed join t probabilit y distribution is a b ound (upp er or lo w er) on the true join t probabilit y distribution. More- o v er, b ecause sums of b ounds is a b ound on the sum, w e can obtain b ounds on marginal probabilities b y marginalizing the v ariationally transformed join t probabilit y distribution. In particular, this pro vides a metho d for obtaining b ounds on the lik eliho o d (the marginal probabilit y of the evidence). Note that the v ariationally transformed distributions are b ounds for arbitrary v alues of the v ariational parameters (b ecause eac h individual ly transformed no de conditional prob- abilit y is a b ound for arbitrary v alues of its v ariational parameter). T o obtain optimizing v alues of the v ariational parameters, w e tak e adv an tage of the fact that our transformed distribution is a b ound, and either minimize (in the case of upp er b ounds) or maximize (in the case of lo w er b ounds) the transformed distribution with resp ect to the v ariational parameters. It is this optimization pro cess whic h pro vides a tigh t b ound on the marginal probabilit y of in terest (e.g., the lik eliho o d) and thereb y pic ks out a particular v ariational distribution that can subsequen tly b e used for appro ximate inference. 316 V aria tional Pr obabilistic Inference and QMR-DT In this app endix w e discuss the optimization problems that w e m ust solv e in the case of noisy-OR net w orks. W e consider the upp er and lo w er b ounds separately , b eginning with the upp er b ound. Upp er Bound T ransformations Our goal is to compute a tigh t upp er b ound on the lik eliho o d of the observ ed ndings: P ( f + ) = P d P ( f + j d ) P ( d ). As discussed in Section 4.2, w e obtain an upp er b ound on P ( f + j d ) b y in tro ducing upp er b ounds for individual no de conditional probabilities. W e represen t this upp er b ound as P ( f + j d;  ), whic h is a pro duct across the individual v aria- tional transformations and ma y con tain con tributions due to ndings that are b eing treated exactly (i.e., are not transformed). Marginalizing across d w e obtain a b ound: P ( f + )  X d P ( f + j d;  ) P ( d )  P ( f + j  ) : (39) It is this latter quan tit y that w e wish to minimize with resp ect to the v ariational parameters  . T o simplify the notation w e assume that the rst m p ositiv e ndings ha v e b een trans- formed (and therefore need to b e optimized) while the remaining conditional probabilities will b e treated exactly . In this notation P ( f + j  ) is giv en b y P ( f + j  ) = X d 2 4 Y i  m P ( f + i j d;  i ) 3 5 " Y i>m P ( f + i j d ) # Y j P ( d j ) (40) / E 8 < : Y i  m P ( f + i j d;  i ) 9 = ; ; (41) where the exp ectation is tak en with resp ect to the p osterior distribution for the diseases giv en those p ositiv e ndings that w e plan to treat exactly . Note that the prop ortionalit y constan t do es not dep end on the v ariational parameters (it is the lik eliho o d of the exactly treated p ositiv e ndings). W e no w insert the explicit forms of the transformed conditional probabilities (see Eq. (17)) in to Eq. (41) and nd: P ( f + j  ) / E 8 < : Y i  m e  i (  i 0 + P j  ij d j )  f  (  i ) 9 = ; (42) = e P i  m (  i  i 0  f  (  i )) E  e P j; i  m  i  ij d j  (43) where w e ha v e simply con v erted the pro ducts o v er i in to sums in the exp onen t and pulled out the terms that are constan ts with resp ect to the exp ectation. On a log-scale, the prop ortionalit y b ecomes an equiv alence up to a constan t: log P ( f + j  ) = C + X i  m (  i  i 0  f  (  i )) + log E  e P j;i  m  i  ij d j  (44) 317 Jaakk ola & Jord an Sev eral observ ations are in order. Recall that f  (  i ) is the conjugate of the conca v e function f (the exp onen t), and is therefore also conca v e; for this reason  f  (  i ) is con v ex. In App endix C w e pro v e that the remaining term: log E  e P j;i  m  i  ij d j  (45) is also a con v ex function of the v ariational parameters. No w, since an y sum of con v ex functions is con v ex, w e conclude that log P ( f + j  ) is a con v ex function of the v ariational parameters. This means that there are no lo cal minima in our optimization problem. W e ma y safely emplo y the standard Newton-Raphson pro cedure to solv e r log P ( f + j  ) = 0. Alternativ ely w e can utilize xed-p oin t iterations. In particular, w e calculate the deriv ativ es of the v ariational form and iterativ ely solv e for the individual v ariational parameters  k suc h that the deriv ativ es are zero. The deriv ativ es are giv en as follo ws: @ @  k log P ( f + j  ) =  k 0 + log  k 1 +  k + E 8 < : X j  k j d j 9 = ; (46) @ 2 @ 2  k log P ( f + j  ) = 1  k  1 1 +  k + V ar 8 < : X j  k j d j 9 = ; ; (47) where the exp ectation and the v ariance are with resp ect to the p osterior appro ximation P ( d j f + ;  ), and b oth deriv ativ es can b e computed in time linear in the n um b er of asso ci- ated diseases for the nding. The b enign scaling of the v ariance calculations comes from exploiting the sp ecial prop erties of the noisy-OR dep endence and the marginal indep endence of the diseases. Calculating the exp ectations in Eq. (7) is exp onen tially costly in the n um b er of exactly treated p ositiv e ndings. When there are a large n um b er of p ositiv e ndings, w e can ha v e recourse to a simplied pro cedure in whic h w e optimize v ariational parameters after ha ving transformed all or most of the p ositiv e ndings. While the resulting v ariational parameters are sub optimal, w e ha v e found in practice that the incurred loss in accuracy is t ypically quite small. In the sim ulations rep orted in the pap er, w e optimized the v ariational parameters after appro ximately half of the exactly treated ndings had b een in tro duced. (T o b e precise, in the case of 8, 12 and 16 total ndings treated exactly , w e optimized the parameters after 4, 8, and 8 ndings, resp ectiv ely , w ere in tro duced). Lo w er Bound T ransformations Mimic king the case of upp er b ounds, w e replace individual conditional probabilities of the ndings with lo w er-b ounding transformations, resulting in a lo w er-b ounding expression P ( f + j d; q ). T aking the pro duct with P ( d ) and marginalizing o v er d yields a lo w er b ound on the lik eliho o d: P ( f + )  X d P ( f + j d; q ) P ( d )  P ( f + j q ) : (48) W e wish to maximize P ( f + j q ) with resp ect to the v ariational parameters q to obtain the tigh test p ossible b ound. 318 V aria tional Pr obabilistic Inference and QMR-DT Our problem can b e mapp ed on to a standard optimization problem in statistics. In particular, treating d as a laten t v ariable, f as an observ ed v ariable, and q as a parameter v ector, the optimization of P ( f + j q ) (or its logarithm) can b e view ed as a standard maxim um lik eliho o d estimation problem for a laten t v ariable mo del. It can b e solv ed using the EM algorithm (Dempster, Laird, & Rubin, 1977). The algorithm yields a sequence of v ariational parameters that monotonically increase the ob jectiv e function log P ( f + j q ). Within the EM framew ork, w e obtain an up date of the v ariational parameters b y maximizing the exp ected complete log-lik eliho o d: E  log P ( f + j d; q ) P ( d )  = X i E n log P ( f + i j d; q j i ) o + constan t ; (49) where q old denotes the v ector of v ariational parameters b efore the up date, where the con- stan t term is indep enden t of the v ariational parameters q and where the exp ectation is with resp ect to the p osterior distribution P ( d j f + ; q old ) / P ( f + j d; q old ) P ( d ). Since the v ariational parameters asso ciated with the conditional probabilities P ( f + i j d; q j i ) are indep enden t of one another, w e can maximize eac h term in the ab o v e sum separately . Recalling the form of the v ariational transformation (see Eq. (24)), w e ha v e: E n log P ( f + i j d; q j i ) o = X j q j j i E f d j g " f  io +  ij q j j i !  f (  io ) # + f (  io ) (50) whic h w e are to maximize with resp ect to q j j i while k eeping the exp ectations E f d j g xed. This optimization problem can b e solv ed iterativ ely and monotonically b y p erforming the follo wing sync hronous up dates with normalization: q j j i E f d j g " q j j i f  io +  ij q j j i !   ij f 0  io +  ij q j j i !  q j j i f (  io ) # (51) where f 0 denotes the deriv ativ e of f . (The up date is guaran teed to b e non-negativ e). This algorithm can b e easily extended to handle the case where not all the p ositiv e ndings ha v e b een transformed. The only new feature is that some of the conditional probabilities in the pro ducts P ( f + j d; q old ) and P ( f + j d; q ) ha v e b een left in tact, i.e., not transformed; the optimization with resp ect to the v ariational parameters corresp onding to the transformed conditionals pro ceeds as b efore. App endix C. Con v exit y The purp ose of this app endix is to demonstrate that the function: log E  e P j;i  m  i  ij d j  (52) is a con v ex function of the v ariational parameters  i . W e note rst that ane transforma- tions do not c hange con v exit y prop erties. Th us con v exit y in X = P j;i  m  i  ij d j implies 319 Jaakk ola & Jord an con v exit y in the v ariational parameters  . It remains to sho w that log E n e X o = log X i p i e X i = f ( ~ X ) (53) is a con v ex function of the v ector ~ X = f X 1 : : : X n g T ; here w e ha v e indicated the discrete v alues in the range of the random v ariable X b y X i and denoted the probabilit y measure on suc h v alues b y p i . T aking the gradien t of f with resp ect to X k giv es: @ @ X k f ( ~ X ) = p k e X k P i p i e X i  q k (54) where q k denes a probabilit y distribution. The con v exit y is rev ealed b y a p ositiv e semi- denite Hessian H , whose comp onen ts in this case are H k l = @ 2 @ X k @ X l f ( ~ X ) =  k l q k  q k q l (55) T o see that H is p ositiv e semi-denite, consider ~ Z T H ~ Z = X k q k Z 2 k  ( X k q k Z k )( X l q l Z l ) = V ar f Z g  0 (56) where V ar f Z g is the v ariance of a discrete random v ariable Z whic h tak es the v alues Z i with probabilit y q i . References D'Am brosio, B. (1993). Incremen tal probabilistic inference. In Pr o c e e dings of the Ninth Confer enc e on Unc ertainty in A rticial Intel ligenc e . San Mateo, CA: Morgan Kauf- mann. D'Am brosio, B. (1994). Sym b olic probabilistic inference in large BN20 net w orks. In Pr o- c e e dings of the T enth Confer enc e on Unc ertainty in A rticial Intel ligenc e . San Mateo, CA: Morgan Kaufmann. Co op er, G. (1985). NESTOR: A c omputer-b ase d me dic al diagnostic aid that inte gr ates c ausal and pr ob abilistic know le dge . Ph.D. Dissertation, Medical Informatics Sciences, Stanford Univ ersit y , Stanford, CA. (Av ailable from UMI at h ttp://wwwlib.umi.com/dissertations/main). Co op er, G. (1990). The computational complexit y of probabilistic inference using Ba y esian b elief net w orks. A rticial Intel ligenc e , 42 , 393{405. Dagum, P ., & Horvitz, E. (1992). Reform ulating inference problems through selectiv e conditioning. In Pr o c e e dings of the Eighth A nnual Confer enc e on Unc ertainty in A rticial Intel ligenc e . Dagum, P ., & Horvitz, E. (1993). A Ba y esian analysis of sim ulation algorithms for inference in Belief net w orks. Networks , 23 , 499{516. 320 V aria tional Pr obabilistic Inference and QMR-DT Dagum, P ., & Lub y , M. (1993). Appro ximate probabilistic reasoning in Ba y esian b elief net w orks is NP-hard. A rticial Intel ligenc e , 60 , 141{153. Dec h ter, R. (1997). Mini-buc k ets: A general sc heme of generating appro ximations in auto- mated reasoning. In Pr o c e e dings of the Fifte enth International Joint Confer enc e on A rticial Intel ligenc e . Dec h ter, R. (1998). Buc k et elimination: A unifying framew ork for probabilistic inference. In M. I. Jordan (Ed.), L e arning in Gr aphic al Mo dels . Cam bridge, MA: MIT Press. Dempster, A., Laird, N., & Rubin, D. (1977). Maxim um lik eliho o d from incomplete data via the EM algorithm. Journal of the R oyal Statistic al So ciety B , 39 , 1{38. Drap er, D., & Hanks, S. (1994). Lo calized partial ev aluation of b elief net w orks. In Pr o- c e e dings of the T enth A nnual Confer enc e on Unc ertainty in A rticial Intel ligenc e . F ung, R., & Chang, K. C. (1990). W eigh ting and in tegrating evidence for sto c hastic sim- ulation in Ba y esian net w orks. In Pr o c e e dings of Fifth Confer enc e on Unc ertainty in A rticial Intel ligenc e . Amsterdam: Elsevier Science. Gelfand, A., & Smith, A. (1990). Sampling-based approac hes to calculating marginal Den- sities. Journal of the A meric an Statistic al Asso ciation , 85 , 398{409. Hec k erman, D. (1989). A tractable inference algorithm for diagnosing m ultiple diseases. In Pr o c e e dings of the Fifth Confer enc e on Unc ertainty in A rticial Intel ligenc e . Henrion, M. (1991). Searc h-based metho ds to b ound diagnostic probabilities in v ery large b elief nets. In Pr o c e e dings of Seventh Confer enc e on Unc ertainty in A rticial Intel li- genc e . Horvitz, E. Suermondt, H., & Co op er, G. (1989). Bounded conditioning: Flexible inference for decisions under scarce resources. In Pr o c e e dings of Fifth Confer enc e on Unc er- tainty in A rticial Intel ligenc e . Jaakk ola, T. (1997). V ariational metho ds for infer enc e and le arning in gr aphic al mo dels . PhD thesis, Departmen t of Brain and Cognitiv e Sciences, Massac h usetts Institute of T ec hnology . Jaakk ola, T., & Jordan, M. (1996). Recursiv e algorithms for appro ximating probabilities in graphical mo dels. In A dvanc es of Neur al Information Pr o c essing Systems 9 . Cam- bridge, MA: MIT Press. Jensen, C. S., Kong, A., & Kjrul, U. (1995). Blo c king-Gibbs sampling in v ery large probabilistic exp ert systems. International Journal of Human-Computer Studies , 42 , 647{666. Jensen, F. (1996). Intr o duction to Bayesian networks . New Y ork: Springer. 321 Jaakk ola & Jord an Jordan, M., Ghaharamani, Z. Jaakk ola, T., & Saul, L. (in press). An in tro duction to v ariational metho ds for graphical mo dels. Machine L e arning . Lauritzen, S., & Spiegelhalter, D. (1988). Lo cal computations with probabilities on graph- ical structures and their application to exp ert systems (with discussion). Journal of the R oyal Statistic al So ciety B, 50 , 157{224. MacKa y , D. J. C. (1998). In tro duction to Mon te Carlo metho ds. In M. I. Jordan (Ed.), L e arning in Gr aphic al Mo dels . Cam bridge, MA: MIT Press. Middleton, B., Sh w e, M., Hec k erman, D., Henrion, M., Horvitz, E., Lehmann, H., & Co op er, G. (1990). Probabilistic diagnosis using a reform ulation of the INTERNIST-1/QMR kno wledge base I I. Ev aluation of diagnostic p erformance. Section on Medical Infor- matics T ec hnical rep ort SMI-90-0329, Stanford Univ ersit y . Miller, R. A., F asarie, F. E., & My ers, J. D. (1986). Quic k medical reference (QMR) for diagnostic assistance. Me dic al Computing, 3 , 34{48. P earl, J. (1988). Pr ob abilistic R e asoning in Intel ligent Systems . San Mateo, CA: Morgan Kaufmann. P eng, Y., & Reggia, J. (1987). A probabilistic causal mo del for diagnostic problem solving { P art 2: Diagnostic strategy . IEEE T r ans. on Systems, Man, and Cyb ernetics: Sp e cial Issue for Diagnosis , 17 , 395{406. P o ole, D. (1997). Probabilistic partial ev aluation: Exploiting rule structure in probabilistic inference. In Pr o c e e dings of the Fifte enth International Joint Confer enc e on A rticial Intel ligenc e . Ro c k afellar, R. (1972). Convex A nalysis . Princeton Univ ersit y Press. Shac h ter, R. D., & P eot, M. (1990). Sim ulation approac hes to general probabilistic inference on b elief net w orks. In Pr o c e e dings of Fifth Confer enc e on Unc ertainty in A rticial Intel ligenc e . Elsevier Science: Amsterdam. Sheno y , P . P . (1992). V aluation-based systems for Ba y esian decision analysis. Op er ations R ese ar ch, 40 , 463{484. Sh w e, M., & Co op er, G. (1991). An empirical analysis of lik eliho o d { w eigh ting sim ulation on a large, m ultiply connected medical b elief net w ork. Computers and Biome dic al R ese ar ch , 24 , 453-475. Sh w e, M., Middleton, B., Hec k erman, D., Henrion, M., Horvitz, E., Lehmann, H., & G. Co op er (1991). Probabilistic diagnosis using a reform ulation of the INTERNIST- 1/QMR kno wledge base I. The probabilistic mo del and inference algorithms. Metho ds of Information in Me dicine , 30 , 241{255. 322

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment