Learning Gaussian Networks
We describe algorithms for learning Bayesian networks from a combination of user knowledge and statistical data. The algorithms have two components: a scoring metric and a search procedure. The scoring metric takes a network structure, statistical da…
Authors: Dan Geiger, David Heckerman
Learning Gaussian Net w orks Dan Geiger geiger02@gmail.com Da vid Heck erman hec kerma@hotmail.com July 1994, Revised Ma y 2021 Abstract W e describ e scoring metrics for learning Ba yesian net works from a com bination of user knowl- edge and statistical data. Previous w ork has concentrated on metrics for domains con taining only discrete v ariables, under the assumption that data represen ts a m ultinomial sample. In this pap er, w e extend this work, developing scoring metrics for domains containing only contin uous v ariables under the assumption that con tinuous data is sampled from a m ultiv ariate normal distribution. Our w ork extends traditional statistical approaches for iden tifying v anishing re- gression coefficients in that w e identify tw o imp ortant assumptions, called event e quivalence and p ar ameter mo dularity , that when combined allow the construction of prior distributions for m ultiv ariate normal parameters from a single prior Bayesian network sp ecified b y a user. Corrections to the original text in red are taken from the 2021 up date of J. Kuipers, G. Moffa, and D. Heck erman, Addendum on the scoring of Gaussian directed acyclic graphical mo dels. Annals of Statistics 42, 1689-1691, Aug 2014 (arXiv:1402.6863). Other updates to the original are in blue . 1 In tro duction Sev eral researc hers hav e examined methods for learning Bay esian net works from data, including Co op er and Hersko vits (1991,1992), Buntine (1991), Spiegelhalter et al. (1993), and Hec kerman et al. (1994) (herein referred to as CH, Buntine, SDLC, and HGC, resp ectively). These metho ds all ha ve the same basic comp onents: a scoring metric and a searc h procedure. The metric computes a score that is prop ortional to the p osterior probability of a netw ork structure, giv en data and a user’s prior knowledge. The search procedure generates net works for ev aluation by the scoring metric. These methods use the tw o comp onents to identify a net work or set of net works with high relative p osterior probabilities, and these netw orks are then used to predict future ev ents. Previous work has concentrated on domains con taining only discrete v ariables, under the as- sumption that data is sampled from a multiv ariate discrete distribution. In this pap er, w e dev elop metrics for domains con taining only con tinuous v ariables, under the assumption that contin uous data is sampled from a multiv ariate normal (Gaussian) distribution. Previously , when w orking with con tinuous v ariables, the standard solution had b een to transform each suc h v ariable x i to a discrete one by splitting its domain in to sev eral m utually exclusive and exhaustive regions. Our metrics eliminate the need for this transformation. In addition, our metrics hav e the adv an tage that they use the lo w p olynomial dimentionalit y of the parameter space of a m ulitiv ariate normal distribu- tion, whereas their discrete coun terparts often require a parameter space that is exp onential in the n umber of domain v ariables. 1 Our work can b e viewed as an extension of traditional statistical approaches for iden tifying v anishing regression co efficients, suc h as those describ ed in DeGroot (1970, Chapter 11). In partic- ular, w e translate tw o assumptions that we identified in HGC for domains con taining only discrete v ariables, called parameter mo dularity and even t equiv alence, to domains containing con tinuous v ariables. The assumption of p ar ameter mo dularity, addresses the relationship among prior distri- butions of parameters for different Bay esian-netw ork structures. The property of event e quivalenc e sa ys that tw o Bay esian-netw ork structures that represent the same set of indep endence assertions should correspond to the same ev ent and th us receive the same score. W e show that, when com bined, these assumptions allow the construction of reasonable prior distributions for multiv ariate normal parameters from a single prior Bayesian network sp ecified by a user. Our iden tification of even t equiv alence arises from a subtle distinction betw een t wo t yp es of Ba yesian netw orks. The first type, called b elief networks , represents only assertions of conditional indep endence and dep endence. The second type, called c ausal networks , represents assertions of cause and effect as well as assertions of independence and dep endence. In this pap er, w e argue that metrics for b elief net works should satisfy even t equiv alence, whereas metrics for causal net works need not. Our score-equiv alent metrics for b elief netw orks are similar to the metrics describ ed by Dawid and Lauritzen (1993), except that our metrics score directed netw orks, whereas their metrics score undirected netw orks. In this pap er, we concentrate on directed mo dels rather than on undirected mo dels, b ecause w e b elieve that users find the former easier to build and interpret. W e note that muc h of the mathematics inv olved in our deriv ations is b orrow ed from DeGro ot’s b o ok, “Optimal Statistical Decisions,” (1970). 2 Gaussian Belief Net w orks Throughout this discussion, we consider a domain ~ x of n con tinuous v ariables x 1 , . . . , x n . W e use ρ ( ~ x | ξ ) to denote the joint probability densit y function (pdf ) ov er ~ x of a p erson with background kno wledge ξ . W e use p ( e | ξ ) to denote the probability of a discrete ev ent e . A b elief netw ork for ~ x represen ts a joint p df ov er ~ x b y enco ding assertions of conditional inde- p endence as w ell as a collection of p dfs. F rom the chain rule of probability , w e kno w ρ ( x 1 , . . . , x n | ξ ) = n Y i =1 ρ ( x i | x 1 , . . . , x i − 1 , ξ ) (1) F or each v ariable x i , let Π i ⊆ { x 1 , . . . , x i − 1 } b e a set of v ariables that renders x i and { x 1 , . . . , x i − 1 } conditionally indep endent. That is, ρ ( x i | x 1 , . . . , x i − 1 , ξ ) = ρ ( x i | Π i , ξ ) (2) A belief netw ork is a pair ( B S , B P ), where B S is a belief-netw ork structure that encodes the assertions of conditional indep endence in Equation 2, and B P is a set of p dfs corresp onding to that structure. In particular, B S is a directed acyclic graph such that (1) eac h v ariable in U corresponds to a no de in B S , and (2) the parents of the no de corresp onding to x i are the no des corresp onding to the v ariables in Π i . (In the remainder of this pap er, we use x i to refer to b oth the v ariable and its corresp onding no de in a graph.) Asso ciated with no de x i in B S are the pdfs ρ ( x i | Π i , ξ ). B P is the union of these p dfs. Com bining Equations 1 and 2, w e see that any b elief netw ork for ~ x uniquely determines a join t pdf for ~ x . That is, ρ ( x 1 , . . . , x n | ξ ) = n Y i =1 ρ ( x i | Π i , ξ ) A minimal b elief network is a b elief net w ork where Equation 2 is violated if an y arc is remov ed. Th us, a minimal b elief netw ork represents b oth assertions of independence and assertions of dep endence. 2 Let us supp ose that the joint probabilit y density function for ~ x is a m ultiv ariate (nonsingular) normal distribution. In this case, w e write ρ ( ~ x | ξ ) = n ( ~ m, Σ − 1 ) ≡ (2 π ) − n/ 2 | Σ | − 1 / 2 e − 1 / 2( ~ x − ~ m ) 0 Σ − 1 ( ~ x − ~ m ) where ~ m is an n -dimensional mean vector, and Σ = ( σ ij ) is an n × n co v ariance matrix, b oth of whic h are implicitly functions of ξ , and where | Σ | is the determinant of Σ. W e shall often find it con venien t to refer to the pr e cision matrix W = Σ − 1 , whose elements are denoted b y w ij . This distribution can be written as a pro duct of conditional distributions eac h b eing an indepen- den t normal distribution. Namely , ρ ( ~ x | ξ ) = n Y i =1 ρ ( x i | x 1 , . . . , x i − 1 , ξ ) (3) ρ ( x i | x 1 , . . . , x i − 1 , ξ ) = n ( m i + i − 1 X j =1 b ij ( x j − m j ) , 1 /v i ) (4) where m i is the unconditional mean of x i , v i is the conditional v ariance of x i giv en v alues for x 1 , . . . , x i − 1 , and b ij is a linear co efficient reflecting the strength of the relationship b etw een x i and x j (e.g., DeGro ot, p.55). 1 Th us, w e ma y interpret a multiv ariate normal distribution as a b elief net work, where b ij = 0 ( j < i ) implies that x j is not a parent of x i . W e call this special form of a b elief netw ork a Gaussian b elief netw ork. The name is adopted from Shac hter and Kenley (1989) who first describ ed Gaussian influence diagrams. More formally , a Gaussian b elief network is a pair ( B S , B P ), where (1) B S is a b elief-netw ork structure containing nodes x 1 , . . . , x n and no arc from x j to x i whenev er b ij = 0 , j < i , (2) B P is the collection of parameters ~ m = ( m 1 , . . . , m n ), ~ v = { v 1 , . . . , v n } , and { b ij | j < i } , and (3) the joint distribution ov er ~ x is determined b y Equations 3 and 4. Due to sp ecial prop erties of nonsingular normal distributions, a minimal Gaussian b elief network is one w ere there is an arc from x j to x i if and only if b ij 6 = 0. Giv en a multiv ariate normal density , w e can generate a Gaussian b elief netw ork, and vice v ersa. The unconditional means ~ m are the same in b oth represen tations. Shac hter and Kenley (1989) describ e the general transformation from ~ v and { b ij | i < j } of a given Gaussian b elief netw ork G to the precision matrix W of the normal distribution represented b y G . They use the follo wing recursiv e formula in whic h W ( i ) denotes the i × i upp er left submatrix of W , ~ b i denotes the column v ector ( b 1 ,i , . . . , b i − 1 ,i ) and ~ b 0 i denotes the transposed v ector ~ b i (i.e., the line vector ( b 1 ,i , . . . , b i − 1 ,i )): W ( i + 1) = W ( i ) + ~ b i +1 ~ b 0 i +1 v i +1 − ~ b i +1 v i +1 − ~ b 0 i +1 v i +1 1 v i +1 (5) for i > 0, and W (1) = 1 v 1 . Equation 5 pla ys a k ey role in this pap er. F or example, supp ose x 1 = n ( m 1 , 1 /v 1 ) , x 2 = n ( m 2 , 1 /v 2 ) , and x 3 = n ( m 3 + b 13 ( x 1 − m 1 ) + b 23 ( x 2 − m 2 ) , 1 /v 3 ). The b elief-net work structure defined by these equations is shown in Figure 1. The precision matrix is given b y W = 1 v 1 + b 2 13 v 3 b 13 b 23 v 3 − b 13 v 3 b 13 b 23 v 3 1 v 2 + b 2 23 v 3 − b 23 v 3 − b 13 v 3 − b 23 v 3 1 v 3 (6) 1 The co efficients b ij can b e thought of as regression co efficients or expressed in terms of Y ule’s (1907) partial regression coefficient β . 3 Figure 1: A b elief-netw ork structure for three v ariables. T able 1: An complete database for the domain associated with the netw ork shown in Figure 1. V ariable v alues for each case Case x 1 x 2 x 3 1 -0.78 -1.55 0.11 2 0.18 -3.04 -2.35 3 1.87 1.04 0.48 4 -0.42 0.27 -0.68 5 1.23 1.52 0.31 6 0.51 -0.22 -0.60 7 0.44 -0.18 0.13 8 0.57 -1.82 -2.76 9 0.64 0.47 0.74 10 1.05 0.15 0.20 11 0.43 2.13 0.63 12 0.16 -0.94 -1.96 13 1.64 1.25 1.03 14 -0.52 -2.18 -2.31 15 -0.37 -1.30 -0.70 16 1.35 0.87 0.23 17 1.44 -0.83 -1.61 18 -0.55 -1.33 -1.67 19 0.79 -0.62 -2.00 20 0.53 -0.93 -2.92 The Gaussian-b elief-netw ork represen tation of a multiv ariate normal distribution is b etter suited to mo del elicitation and understanding than is the standard representation [Shach ter and Kenley , 1989]. T o assess a Gaussian b elief netw ork, the user needs to sp ecify (1) the unconditional mean of each v ariable x i ( m i ), (2) the relativ e imp ortance of eac h parent x j in determining the v alues of its c hild x i ( b ij ), and (3) a conditional v ariance for x i giv en that its paren ts are fixed ( v i ). Equation 5 then determines W . In con trast, when assessing a normal distribution directly , one needs to guarantee that the assessed cov ariance matrix is p ositive-definite—a task done by altering in some ad ho c manner the correlations stated by the user. 3 A Metric for Gaussian Belief Netw orks W e are in terested in computing a score for a Gaussian b elief-netw ork structure, given a set of cases D = { ~ x 1 , . . . , ~ x m } . Eac h c ase ~ x i is the observ ation of one or more v ariables in ~ x . W e sometimes refer to D as a datab ase . T able 1 is an example of a database for the three-no de domain of the Gaussian b elief netw ork shown in Figure 1. Our scoring metrics are based on fiv e assumptions, the first of which is the follo wing: Assumption 1 The datab ase D is a r andom sample fr om a multivariate normal distribution with unknown me ans ~ m and unknown pr e cision matrix W . Because every Gaussian b elief netw ork is equiv alent to a multiv ariate normal distribution, As- sumption 1 is equiv alent to stating that the database D is a random sample from a Gaussian belief net work with unknown parameters, ~ v , B = { b ij | j < i } , ~ m . A Bay esian measure of the go o dness of a netw ork structure is its p osterior probabilit y given a database: p ( B S | D , ξ ) = c p ( B S | ξ ) ρ ( D | B S , ξ ) 4 where c = 1 /ρ ( D | ξ ) = 1 / P B S p ( B S | ξ ) ρ ( D | B S , ξ ) is a normalization constan t. F or ev en small domains, how ev er, there are to o man y net work structures to sum ov er in order to determine the constan t. Therefore we use p ( B S | ξ ) ρ ( D | B S , ξ ) = ρ ( D , B S | ξ ) as our score. Also problematic is our use of the term B S as an argument of a probability . In particular, B S is a b elief-netw ork structure, not an even t. Thus, w e need a definition of an even t B e S that corresp onds to structure B S (the sup erscript “ e ” stands for e v ent). A natural definition for this even t is that B e S holds true iff the database is a random sample from a minimal Gaussian b elief net work with structure B S —that is, iff for all j < i , b ij 6 = 0 if and only if there is an arc from x j to x i in B S . F or example the even t B e S corresp onding to the Gaussian b elief netw ork of Figure 1, is the even t { b 12 = 0 , b 13 6 = 0 , b 23 6 = 0 } . This definition has the following desirable prop erty . When tw o belief-netw ork structures represen t the same assertions of conditional independence, w e sa y that they are isomorphic . F or example, in the three v ariable domain { x 1 , x 2 , x 3 } , the netw ork structures x → x 2 → x 3 and x 1 ← x 2 → x 3 represen t the same assertion: x 1 and x 3 are indep enden t given x 2 . Giv en the definition of B e S , it can be shown that even ts B e S 1 and B e S 2 are equiv alent if and only if the structures B S 1 and B S 2 are isomorphic. That is, the relation of isomorphism induces an equiv alence class on the set of ev ents B e S . W e call this prop erty event e quivalenc e . There is a problem with the definition, how ever. In particular, ev ents corresp onding to some non-isomorphic net work structures are not mutually exclusive. F or example, in the four-v ariable domain { x 1 , x 2 , x 3 , x 4 } , consider the structures x 1 ⇒ B ⇐ x 4 and x 1 ⇒ B ⇒ x 4 , where B is the subnet work structure x 2 → x 3 , and x ⇒ B means that there is an arc from x to b oth v ariables in B . The ev ents corresp onding to these structures both include the situation where x 1 and x 4 are marginally indep endent. Arbitrary ov erlaps b etw een ev ents can make scores difficult to in terpret and use. F or example, the prediction of future even ts b y av eraging o ver multiple models cannot b e justified. In our case, ho wev er, w e can repair the definition of B e S so as to make non-equiv alent ev ents m utually exclusive, without affecting our mathematical results or the in tuitive understanding of ev ents by the user. In particular, all o verlaps will b e of measure zero with respect to the ev ents that create the o verlap. Thus, giv en a set of ov erlapping even ts, we simply exclude the intersection from all but one of the even ts. W e note that this revised definition retains the prop ert y of ev ent equiv alence. Prop osition 1 (Even t Equiv alence) Belief-network structur es B S 1 and B S 2 ar e isomorphic if and only if B e S 1 = B e S 2 . Because the score for net work structure B S is ρ ( D , B e S | ξ ), an immediate consequence of the prop ert y of ev ent equiv alence is score equiv alence. Prop osition 2 (Score Equiv alence) The sc or es of two isomorphic b elief-network structur es must b e e qual. Giv en the prop erty of even t equiv alence, w e tec hnically should score eac h b elief-netw ork-structure equiv alence class, rather than each belief-netw ork structure. Nonetheless, users find it in tuitive to w ork with (i.e., construct and in terpret) belief netw orks. Consequently , we contin ue our presentation in terms of belief net works, k eeping Prop osition 2 in mind. 3.1 Complete Gaussian Belief Net w orks W e first deriv e ρ ( D , B e S | ξ ), assuming B S is the structure of a complete Gaussian b elief net work. A c omplete Gaussian b elief network is one with no missing edges. Applying the prop ert y of even t equiv alence, w e know that the even t asso ciated with any complete b elief netw ork is the same; and w e use B e S C to denote this ev ent. 5 T o motiv ate the deriv ation, consider the following expansion of ρ ( D | B e S C , ξ ): ρ ( D | B e S C , ξ ) = m Y l =1 ρ ( C l | C 1 , . . . , C l − 1 , B e S C , ξ ) = m Y l =1 Z ρ ( C l | ~ m, W, B e S C , ξ ) ρ ( ~ m, W | C 1 , . . . , C l − 1 , B e S C , ξ ) d ~ m dW Th us, w e can derive the metric if we find a conjugate distribution for the parameters ~ m and W suc h that the integral ab ov e has a closed form solution. The next assumption leads to suc h a conjugate distribution. If all v ariables in a case are observ ed, w e say that the case is c omplete. If all cases in a database are complete, w e sa y that the database is c omplete . Assumption 2 Al l datab ases ar e c omplete. 2 Giv en this assumption, the follo wing distribution is conjugate for multiv ariate-normal sampling. Theorem 3 (DeGro ot, p.178) Supp ose that ~ x 1 , . . . , ~ x l is a r andom sample fr om a multivariate normal distribution with an unknown value of the me an ve ctor ~ m and an unknown value of the pr e cision matrix W . Supp ose that the prior joint distribution of ~ m and W is the normal-Wishart distribution: the c onditional distribution of ~ m given W is n ( ~ µ 0 , ν W ) such that ν > 0 , and the mar ginal distribution of W is a Wishart distribution with α > n − 1 de gr e es of fr e e dom and pr e cision matrix T 0 , denote d by w ( α, T 0 ) . Then the p osterior joint distribution of ~ m and W given ~ x i , i = 1 , . . . , l , is as fol lows: The c onditional distribution of ~ m given W is a multivariate normal distribution with me an ve ctor ~ µ l and a pr e cision matrix ( ν + l ) W , wher e X l = 1 l l X i =1 ~ x i , ~ µ l = ν ~ µ 0 + l X l ν + l . (7) and the mar ginal of W is w ( α + l, T l ) , wher e S l and T l ar e given by S l = l X i =1 ( ~ x i − X l )( ~ x i − X l ) 0 (8) and T l = T 0 + S l + ν l ν + l ( ~ µ 0 − X l )( ~ µ 0 − X l ) 0 (9) In this theorem, X l and S l are the sample mean and scatter matrix of the database, respectively . Also, an n dimensional Wishart distribution with α degrees of freedom and matrix T 0 is given b y ρ ( W | ξ ) = w ( α, T 0 ) ≡ c ( n, α ) | T 0 | α/ 2 | W | ( α − n − 1) / 2 e − 1 / 2 tr { T 0 W } (10) where tr { T 0 W } is the sum of the diagonal elements of T 0 W and c ( n, α ) = " 2 αn/ 2 π n ( n − 1) / 4 n Y i =1 Γ( α + 1 − i 2 ) # − 1 The parameters ν , α , ~ µ 0 , and T 0 are implicit functions of the user’s background knowledge ξ . The quantities ν and α can b e thought of as the effective sample sizes of the normal and Wishart comp onen ts of the prior, resp ectiv ely . Summarizing our discussion so far, we mak e the follo wing assumption: 2 SDLC present a surv ey of approximation methods for handling missing data in the con text of discrete v ariables. Some of these methods in mo dified form can b e applied to Gaussian net works. 6 Assumption 3 The prior distribution ρ ( ~ m, W | B e S C , ξ ) is a normal-Wishart distribution as given in The or em 3. F rom Equation 5, this assumption fixes the distribution ρ ( ~ m, ~ v , B | B e S C , ξ ). Nonetheless, we shall sometimes find it easier to specify the prior densit y in the space of W , rather then in the space of parameters describing a Gaussian b elief netw ork. If ρ ( ~ x | ~ m, W, B e S C , ξ ) = n ( ~ m, W ) and if ρ ( ~ m, W | B e S C , ξ ) is a normal-Wishart distribution as sp ec- ified by Theorem 3, then ρ ( ~ x | B e S C , ξ ), defined b y ρ ( ~ x | B e S C , ξ ) = Z ρ ( ~ x | ~ m, W, B e S C , ξ ) ρ ( ~ m, W , B e S C , ξ ) d ~ m dW is an n dimensional m ultiv ariate t distribution with γ = α − n + 1 degrees of freedom, lo cation vector ~ µ 0 , and a precision matrix T 0 0 = ν γ ν +1 T − 1 0 . This result can b e deriv ed b y first integrating ov er ~ m using Equation 6 on p.178 of DeGroot with sample size equal to one, and then in tegrating o ver W follo wing an approach similar to that on pp.179–180 of DeGroot. Also, using Equation 3 on p.180 of DeGro ot, the t distribution ρ ( ~ x | B e S C , ξ ) can b e written in a less traditional form as follows: ρ ( ~ x | B e S C , ξ ) = (2 π ) − n/ 2 ( ν ν + 1 ) n/ 2 c ( n, α ) c ( n, α + 1) | T 0 | α/ 2 | T 1 | − ( α +1) / 2 (11) where T 1 is defined by Equation 9 with l = 1. Com bining these facts with Theorem 3, w e kno w that ρ ( C l | C 1 , . . . , C l − 1 , B e S C , ξ ) is a m ultiv ariate t distribution with parameters ν + l − 1, α + l − 1, ~ µ l − 1 , and T l − 1 . Consequently , w e obtain ρ ( D | B e S C , ξ ) = m Y l =1 ρ ( C l | C 1 , . . . , C l − 1 , B e S C , ξ ) = m Y l =1 (2 π ) − n/ 2 ( ν + l − 1 ν + l ) n/ 2 c ( n, α + l − 1) c ( n, α + l ) | T l − 1 | α + l − 1 2 | T l | α + l 2 ! = (2 π ) − nm/ 2 ( ν ν + m ) n/ 2 c ( n, α ) c ( n, α + m ) | T 0 | α 2 | T m | − α + m 2 (12) Multiplying Equation 12 b y the prior probabilit y p ( B e S C | ξ ) yields a metric for scoring B e S C . 3.2 General Gaussian Belief Net w orks W e no w consider an arbitrary Gaussian belief net work B S . T o form a prior distribution for the parameters of B S , we mak e t w o additional assumptions: Assumption 4 (Parameter Indep endence) F or every Gaussian b elief network B S , ρ ( ~ v , B | B e S , ξ ) = Q n i =1 ρ ( v i , ~ b i | B e S , ξ ) . W e note that this assumption is consistent with Assumption 3, because if ρ ( W | B e S C , ξ ) is a Wishart distribution, then ρ ( ~ v , B | B e S C , ξ ), obtained from ρ ( W | B e S C , ξ ) b y using Equation 5 and the Jacobian ∂ W /∂~ v B of this transformation, is equal to Q n i =1 ρ ( v i , ~ b i | B e S C , ξ ). The deriv ation of this claim is giv en in the App endix (Theorem 7). Assumption 5 (Parameter Mo dularit y) If x i has the same p ar ents in two Gaussian b elief net- works B S 1 and B S 2 , then ρ ( v i , ~ b i | B e S 1 , ξ ) = ρ ( v i , ~ b i | B e S 2 , ξ ) . 7 Assumption 4 has b een made in discrete contexts by many researc hers (e.g., CH, Buntine, SDLC, and HGC). Assumption 5 has also b een made by these same researchers, but HGC were the first researc hers to make the assumption explicit and to emphasize its imp ortance for generating prior distributions. P arameter mo dularity plays a similar imp ortant role in the curren t dev elopmen t. In particular, this assumption, in conjunction with the prop erty of even t equiv alence and our previous assumptions allows us to determine the joint prior distribution of the parameters ~ m, ~ v , B asso ciated with any Gaussian netw ork B S from the joint density ρ ( ~ m, W | B e S C ). T o see this fact, first note that, b y the definition of the ev en t B e S , ρ ( ~ m | ~ v , B , B e S , ξ ) = ρ ( ~ m | ~ v , B , B e S C , ξ ). The latter distribution is determined by ρ ( ~ m | W, B e S C , ξ ), which is giv en. Second, from Assumption 4, w e obtain ρ ( ~ v , B | B e S , ξ ) by determining ρ ( v i , ~ b i | B e S , ξ ) for each i . By Assump- tion 5, how ever, ρ ( v i , ~ b i | B e S , ξ ) is equal to ρ ( v i , ~ b i | B e S 0 C , ξ ) for an y complete net work structure B S 0 C where the paren ts of x i are the same as are those in B S . By even t equiv alence and Assumption 4, w e obtain ρ ( v i , ~ b i | B e S 0 C , ξ ) from the given densit y ρ ( W | B e S C , ξ ). F rom Assumptions 1 through 5, we derive ρ ( D | B e S , ξ ). T o do so, we need the follo wing theorem whose pro of is pro vided in the App endix. [Note: a deriv ation from weak er assumptions is giv en in D. Geiger and D. Heck erman, Parameter Priors for Directed Acyclic Graphical Mo dels and the Characterization of Several Probabilit y Distributions, The A nnals of Statistics , 30: 1412-1440, Oct 2002.] Theorem 4 If ρ ( ~ x | ~ m, W , D , ξ ) is a multivariate normal distribution, and ρ ( ~ m | W, D, B e S , ξ ) is a mul- tivariate normal distribution with a pr e cision matrix ν W , ν > 0 , then ρ ( x i | x 1 , . . . , x i − 1 , ~ v , B , D , B e S , ξ ) = ρ ( x i | Π i , v i , ~ b i , D x i Π i , B e S 0 , ξ ) , wher e B S 0 is any network wher e x i has the same p ar ents as in B S , and D x i Π i is the datab ase D r estricte d to the variables in { x i } ∪ Π i . In p articular, this claim holds for any c omplete Gaussian b elief network B S C = B S 0 in which Π i and x i app e ar b efor e any other variables, and Π i app e ars b efor e x i . Let D l = { C 1 , . . . , C l − 1 } and C l b e an instance of x 1 , . . . , x n . In the following deriv ation, w e use x i and Π i to represent the instance of x i and Π i in the l th case. Theorem 4 yields, ρ ( D | ~ v , B , B e S , ξ ) = m Y l =1 n Y i =1 ρ ( x i | x 1 , . . . , x i − 1 , ~ v , B , D l , B e S , ξ ) = m Y l =1 n Y i =1 ρ ( x i , Π i | v i , ~ b i , D x i Π i l , B e S , ξ ) ρ (Π i | v i , ~ b i , D x i Π i l , B e S , ξ ) and ρ (Π i | v i , ~ b i , D x i Π i l , B e S , ξ ) = ρ (Π i | v i , ~ b i , D Π i l , B e S , ξ ) By combining these equations, we obtain the follo wing likeliho o d sep ar ability pr op erty: ρ ( D | ~ v , B , B e S , ξ ) = n Y i =1 ρ ( D x i Π i | v i , ~ b i , B e S , ξ ) ρ ( D Π i | v i , ~ b i , B e S , ξ ) (13) By Bay es rule, ρ ( ~ v , B | D , B e S , ξ ) is prop ortional to ρ ( D | ~ v , B , B e S , ξ ) ρ ( ~ v , B | B e S , ξ ). Thus, b ecause ρ ( D | ~ v , B , B e S , ξ ) factors as shown by Equation 13, and ρ ( ~ v , B | B e S , ξ ) factors as given by Assumption 4, w e obtain the following p osterior p ar ameter indep endenc e property: ρ ( ~ v , B | D , B e S , ξ ) = n Y i =1 ρ ( v i , ~ b i | D x i Π i , B e S , ξ ) In a similar manner, whenev er x i has the same parents in tw o Gaussian b elief netw orks B S and B S 0 , b y using Equation 13 where B e S in the righ t hand side is replaced b y B e S 0 and using Assumption 5, 8 w e obtain the p osterior p ar ameter mo dularity prop erty: ρ ( v i , ~ b i | D x i Π i , B e S , ξ ) = ρ ( v i , ~ b i | D x i Π i , B e S 0 , ξ ) No w, w e hav e ρ ( D | B e S , ξ ) = m Y l =1 ρ ( C l | D l , B e S , ξ ) , (14) ρ ( C l | D l , B e S , ξ ) = n Y i =1 ρ ( x i | x 1 , . . . , x i − 1 , D l , B e S , ξ ) ρ ( x i | x 1 , . . . , x i − 1 , D l , B e S , ξ ) = Z ρ ( x i | x 1 , . . . , x i − 1 , D l , ~ v , B , B e S , ξ ) ρ ( ~ v , B | D l , B e S , ξ ) d~ v B (15) By applying Theorem 4 to the first term of the righ t-hand-side of Equation 15, and p osterior pa- rameter indep endence and posterior parameter mo dularity to the second term, w e obtain ρ ( x i | x 1 , . . . , x i − 1 , D l , B e S , ξ ) = Z ρ ( x i | Π i , v i , ~ b i , D x i Π i l , B e S C , ξ ) ρ ( v i , ~ b i | D x i Π i l , B e S C , ξ ) dv i ~ b i = ρ ( x i | Π i , D x i Π i l , B e S C , ξ ) Therefore, ρ ( C l | D l , B e S , ξ ) = n Y i =1 ρ ( x i , Π i | D x i Π i l , B e S C , ξ ) ρ (Π i | D x i Π i l , B e S C , ξ ) (16) F urthermore, because ρ (Π i | D x i Π i l , B e S C , ξ ) is a multiv ariate t distribution, w e kno w that ρ (Π i | D x i Π i l , B e S C , ξ ) = ρ (Π i | D Π i l , B e S C , ξ ) (DeGro ot, p.60). Th us, combining Equations 14 and 16, we ha ve ρ ( D | B e S , ξ ) = n Y i =1 ρ ( D x i Π i | B e S C , ξ ) ρ ( D Π i | B e S C , ξ ) (17) where eac h term in 17 is of the form given in Equation 12. Multiplying Equation 17 b y p ( B e S | ξ ), we obtain a metric for an arbitrary Gaussian b elief netw ork B S . (This dev elopment is incomplete, as it requires a recip e for deriving the parameters of the prior for subsets of the domain v ariables from the prior for all domain v ariables. The recipe implicit in an example giv en in the original v ersion— deleted in this v ersion—is incorrect. F or a correction, see the 2021 up date of D. Geiger and D. Hec kerman, Parameter Priors for Directed Acyclic Graphical Mo dels and the Characterization of Sev eral Probability Distributions, The Annals of Statistics , 30: 1412-1440, Oct 2002.) W e call this metric BGe which stands for B ay esian metric for G aussian netw orks having score e quiv alence. 3.3 Score Equiv alence In making the assumptions of parameter indep endence and parameter mo dularit y , we hav e—in effect—sp ecified the prior densities for the multinomial parameters in terms of the structure of a b elief netw ork. Consequently , there is the p ossibilit y that this sp ecification violates the prop ert y of score equiv alence. The following theorem, how ever, demonstrates that our sp ecification implies score equiv alence. Theorem 5 (Score Equiv alence) If B S 1 and B S 2 ar e isomorphic b elief-network structur es, then ρ ( D | B e S 1 , ξ ) and ρ ( D | B e S 2 , ξ ) as c ompute d by Equation 17 ar e e qual. 9 Pro of: In Heck erman et al. (1994, Theorem 10), w e show that a b elief netw ork structure can b e transformed into an isomorphic structure by a series of arc reversals, such that, whenev er an arc from x i to x j is reversed, Π i = Π j \ { x i } . Th us, our claim follows if we can prov e it for the case where B S 1 and B S 2 differ by a single arc rev ersal with this restriction. So, let B S 1 and B S 2 b e tw o isomorphic netw ork structures that differ only in the direction of the arc betw een x i and x j (sa y x i → x j in B S 1 ). Let R b e the paren ts of x i in B S 1 . By the cited theorem, R ∪ { x i } is the paren ts of x j in B S 1 , R is the paren ts of x j in B S 2 , and R ∪ { x j } is the paren ts of x i in B S 2 . Because the tw o structures differ only in the reversal of a single arc, the only terms in the pro duct of Equation 17 that can differ are those inv olving x i and x j . F or B S 1 , these terms are ρ ( D x i R | B e S C , ξ ) ρ ( D R | B e S C , ξ ) ρ ( D x i x j R | B e S C , ξ ) ρ ( D x i R | B e S C , ξ ) = ρ ( D x i x j R | B e S C , ξ ) ρ ( D R | B e S C , ξ ) whereas for B S 2 , they are ρ ( D x j R | B e S C , ξ ) ρ ( D R | B e S C , ξ ) ρ ( D x i x j R | B e S C , ξ ) ρ ( D x j R | B e S C , ξ ) = ρ ( D x i x j R | B e S C , ξ ) ρ ( D R | B e S C , ξ ) Th us, ρ ( D | B e S 1 , ξ ) = ρ ( D | B e S 2 , ξ ). 3.4 Enco ding Prior Knowledge: The Prior Gaussian Belief Net w ork F rom the previous discussion, we see that there are three comp onents of a user’s prior knowledge that are relev ant to learning Gaussian netw orks: (1) the prior probabilities p ( B e S | ξ ), (2) the effective sample sizes α and ν , and (3) the parameters ~ µ 0 and T 0 . The assessment of the prior probabilities p ( B e S | ξ ) is straightforw ard. Bun tine and HGC, for example, describ e metho ds that facilitate these assessmen ts. In addition, a user can assess the effective sample sizes directly . In this section, we concen trate on the assessment of ~ µ 0 and T 0 . Using (1) our previous observ ation that p ( ~ x | B e S C , ξ ) is a m ultiv ariate t distribution, and (2) Equation 11 on p.61 of DeGro ot with α > n + 1, we obtain E( ~ x | B e S C , ξ ) = ~ µ 0 Co v ( ~ x | B e S C , ξ ) = ν + 1 ν 1 α − n − 1 T 0 (18) Th us, a p erson can assess a Gaussian b elief net work for E( ~ x | B e S C , ξ ) and Cov( ~ x | B e S C , ξ ), and then compute ~ µ 0 and T 0 using Equations 18. W e call this belief net w ork a prior b elief network . 4 Metrics for Gaussian Causal Net w orks P eople often ha ve kno wledge about the causal relationships among v ariables in addition to knowledge ab out conditional independence. Such causal knowledge is stronger than is conditional-independence kno wledge, b ecause it allo ws us to deriv e beliefs about a domain after w e in tervene. Causal net works, describ ed—for example—by Spirtes et al. (1993), Pearl and V erma (1991), and Hec kerman and Shac hter (1994) represent suc h causal relationships among v ariables. In particular, a causal netw ork for U is a b elief net work for U , wherein it is asserted that each nonro ot no de x is caused b y its paren ts. The precise meaning of cause and effect is not imp ortant for our discussion. The interested reader should consult the previous references. The even t C e S is the same as that for a b elief-net work structure, except that w e also include in the even t the assertion that each nonro ot no de is caused b y its paren ts. Th us, in contrast to the case for b elief netw orks, it is not appropriate to require the prop erties of even t equiv alence or score equiv alence. F or example, consider a domain containing t wo v ariables x and y . Both the causal net w ork C S 1 where x p oints to y and the causal netw ork C S 2 where y p oints to x represent 10 the assertion that x and y are dep enden t. The netw ork C S 1 , ho wev er, in addition represents the assertion that x causes y , whereas the netw ork C S 2 represen ts the assertion that y causes x . Thus, the even ts C e S 1 are C e S 2 are not equal. Indeed, it is reasonable to assume that these ev ents—and the ev ents asso ciated with an y tw o different causal-net work structures—are mutually exclusiv e. In principle, then, a user may assign a (p ossibly different) prior distribution to the parameters ~ m , ~ v , and B to ev ery complete Gaussian causal netw ork, constrained only by the assumption of parameter mo dularity . The prior distributions for parameters of incomplete netw orks would then b e determined b y parameter modularity . W e call this general metric BG, as it is a sup erset of the BGe metric. F or practical reasons, ho wev er, the assessment pro cess should b e constrained. One alternativ e is to use the BGe metric. A more general alternative is to con tinue to use the prior net work to compute ~ µ 0 and T 0 , but to allow effectiv e sample size to v ary for different v ariables and differen t parent sets of each v ariable. W e call this metric the BGp metric, where “p” stands for p rior net work. 5 Summary and F uture W ork W e hav e described metrics for learning belief netw orks and causal netw orks from a combination of user knowledge and statistical data for domains containing only contin uous v ariables. An imp ortant con tribution has b een our elucidation of the prop erty of ev ent equiv alence and the assumption of parameter mo dularity . W e ha ve sho wn that these properties, when combined, allow a statistician to compute a reasonable prior distribution for the parameters of any Gaussian b elief net work, giv en a single prior Gaussian belief net work pro vided by a user. A legitimate concern with our approac h is that the multiv ariate model is too restrictiv e. In practice, when this mo del is inappropriate, statisticians will t ypically turn to a more general model where eac h contin uous v ariable conditioned on its parents is assumed to be a mixture of multiv ariate normal distributions. In Geiger and Heck erman (1994), we derive metrics for domains containing b oth discrete and contin uous v ariables, sub ject to the restriction that a domain can b e decomposed in to disjoint sets of contin uous v ariables where each such set is conditioned by a set of discrete v ariables. W e note that this work, when combined with appro ximation metho ds that handle missing data, provides a metho d for learning with multiv ariate mixtures. In the discrete case, a complete netw ork has one parameter for each instance of ~ x . Consequen tly , it is easy to o verfit such a structure with data; and the metrics developed for discrete domains pro vide a means by whic h we can av oid such ov erfitting. In the contin uous case, a complete netw ork has only n + n ( n − 1) / 2 parameters. Thus, it is possible that the errors introduced b y our methods, arising from heuristic search in an exponential space to find one or a handful of structures with high scores outw eigh the benefits asso ciated with decreasing the degree of o verfitting. W e leav e this concern for future experimentation. Ac kno wledgmen ts W e thank W ra y Bun tine and anon ymous reviewers for useful suggestions. References [Co op er and Hersk o vits, 1991] Co op er, G. and Hersk ovits, E. (Jan uary , 1991). T echnical Report SMI-91-1, Section of Medical Informatics, Universit y of Pittsburgh. [Co op er and Hersk o vits, 1992] Co op er, G. and Hersko vits, E. (1992). Machine L e arning , 9:309–347. [Da wid and Lauritzen, 1993] Dawid, A. and Lauritzen, S. (1993). Annals of Statistics , 21:1272–1317. 11 [DeGro ot, 1970] DeGro ot, M. (1970). McGraw-Hill, New Y ork. [Geiger and Heck erman, 1994] Geiger, D. and Heck erman, D. (Marc h, 1994). T echnical Report MSR-TR-94-10, Microsoft. [Hec kerman et al., 199 4] Heck erman, D., Geiger, D., and Chick ering, D. (1994b). In this pro ceed- ings. [Hec kerman and Shach ter, 1994] Heck erman, D. and Shac hter, R. (1994). In this pro ceedings. [P earl and V erma, 1991] Pearl, J. and V erma, T. (1991). In Allen, J., Fikes, R., and Sandewall, E., editors, Know le dge R epr esentation and R e asoning: Pr o c e e dings of the Se c ond International Confer enc e , pages 441–452. Morgan Kaufmann, New Y ork. [Shac hter and Kenley , 1989] Shach ter, R. and Kenley , C. (1989). Management Scienc e , 35:527–550. [Spiegelhalter et al., 1993] Spiegelhalter, D., Da wid, A., Lauritzen, S., and Co well, R. (1993). Sta- tistic al Scienc e , 8:219–282. [Spirtes et al., 1993] Spirtes, P ., Glymour, C., and Scheines, R. (1993). Springer-V erlag, New Y ork. [Y ule, 1907] Y ule, G. (1907). Pr o c e e dings of the R oyal So ciety of L ondon, Series A, 79:182–193. App endix Theorem 6 The Jac obian J for the change of variables fr om W to { ~ v , B } is given by J = ∂ W /∂ ~ v B = n Y i =1 v − ( i +1) i (19) Pro of: Let J ( i ) denote the Jacobian for the first i v ariables in W . Then J ( i ) has the following matrix form: J ( i − 1) 0 0 0 − 1 v i I i − 1 ,i − 1 0 0 0 − 1 v 2 i (20) where I k,k is the identit y matrix of size k × k . Thus, the absolute v alue of J ( i ) is given b y , | J ( i ) | = 1 v i +1 i · | J ( i − 1) | (21) whic h giv es Equation 19. Theorem 7 If ρ ( W | ξ ) has an n-dimensional Wishart distribution, then ρ ( ~ v , B | ξ ) = n Y i =1 ρ ( v i , ~ b i | ξ ) Pro of: By assumption, we hav e ρ ( W | ξ ) = c | W | ( α − n − 1) / 2 e − 1 / 2 tr { T 0 W } (22) Th us, we must express Equation 22 in terms of { ~ v , B } , multiply by the Jacobian giv en b y Theorem 6, and show that the resulting function factors as a function of i . F rom Equation 5, we get | W ( i ) | = 1 v i | W ( i − 1) | = n Y i =1 v − 1 i 12 so that the determinan t in Equation 22 factors as a function of i . Also, Equation 5 implies (b y induc- tion) that each elemen t w ij in W is a sum of terms eac h b eing a function of ~ b i and v i . Consequen tly , the exp onent in Equation 22 factors as a function of i . Theorem 4 If ρ ( ~ x | ~ m, W, D , B e S , ξ ) is a multivariate normal distribution, and ρ ( ~ m | W , D , B e S , ξ ) is a multivariate normal distribution with pr e cision matrix ν W , ν > 0 , then ρ ( x i | x 1 , . . . , x i − 1 , ~ v , B , D , B e S , ξ ) = ρ ( x i | Π i , v i , ~ b i , D x i Π i , B e S 0 , ξ ) wher e B S 0 is any network wher e x i has the same p ar ents as in B S , and D x i Π i is the datab ase D r estricte d to the variables in { x i } ∪ Π i . Pro of: Using ρ ( ~ x | W , D , B e S , ξ ) = Z ρ ( ~ x | ~ m, W, D , B e S , ξ ) ρ ( ~ m | W , D , B e S , ξ ) d ~ m and Assumptions 1 and 3, we obtain ρ ( ~ x | W , D , B e S , ξ ) = c | W | 1 / 2 · e − 1 2 ν ν +1 P n i,j =1 ( x i − µ Di )( x j − µ Dj ) w ij (23) where ~ µ D is the p osterior mean after seeing D , given by Equation 7 of Theorem 3. The marginal distribution ρ ( x 1 , . . . , x i | ξ ) of a normal distribution n ( ~ m, W ) is a normal distribu- tion n ( ~ m i , W i ), where ~ m i and W i are the terms in ~ m and W that corresp ond to x 1 , . . . , x i . Thus, using | W | = Q n i =1 v − 1 i , Equation 23 becomes ρ ( x 1 , . . . , x i | W , D , B e S , ξ ) = c | W i | 1 / 2 · e − 1 2 ν ν +1 P i j,k =1 ( x j − µ j D )( x k − µ kD ) w j k (24) By expressing W in terms of ~ v and B using Equation 5, we obtain ρ ( x 1 , . . . , x i | ~ v , B , D , B e S , ξ ) ρ ( x 1 , . . . , x i − 1 | ~ v , B , D , B e S , ξ ) = c · v − 1 / 2 i · e − 1 2 ν ν +1 A (25) where A = tr " ( ~ x − ~ µ D ) i ( ~ x − ~ µ D ) 0 i ~ b i ~ b 0 i v i − ~ b i v i − ~ b 0 i v i v i !# (26) where ( ~ x − ~ µ D ) i is the column vector of the i elements of ( ~ x − ~ µ D ) that corresp ond to x 1 , . . . , x i . Starting with an y netw ork B S 0 , such that the paren ts of x i are the same as in B S , we obtain exactly Equations 25 and 26. F urthermore, b ecause ~ µ D dep ends only on D x i Π i , the theorem is established. 13
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment