A Bayesian Approach to Learning Bayesian Networks with Local Structure
Recently several researchers have investigated techniques for using data to learn Bayesian networks containing compact representations for the conditional probability distributions (CPDs) stored at each node. The majority of this work has concentrate…
Authors: ** - Nir Friedman - Moises Goldszmidt - Daphne Koller **
A Ba y esian Approac h to Learning Ba y esian Net w orks with Lo cal Structure Da vid Maxw ell Chick ering dmax@microso ft.com Da vid Hec kerman heck erma@micr osoft.com Microsoft Research Redmond W A, 98052 - 6399 Christopher Meek meek@microso ft.com Abstract Recent ly several researchers hav e in vesti- gated techniques for using data to learn Bay esian net w orks containing compact rep- resentations for the conditional pro babilit y distributions (CPDs) stored at eac h no de. The ma jority of this work has concentrated on using decision-tree representations for the CPDs. In addition, researchers typi- cally apply non-Bay esian (or asymptotically Bay esian) scoring functions such a s MDL to ev aluate the go o dness- o f-fit of net works to the data. In this pa per w e in vestigate a Bayesian ap- proach to lea rning Ba yesian netw orks that contain the more genera l de cision-gr aph r ep- resentations of the CP Ds. First, w e describ e how to ev aluate the p os terior pro ba bilit y— that is, the Bay esia n scor e—of such a net- work, given a database of observed cases. Second, w e describ e v arious search spaces that can b e used, in conjunction with a s c o r- ing function and a search pro cedure, to iden- tify o ne or mor e high-sc o ring net works. Fi - nally , we present an exp erimental ev alua tio n of the search spaces, using a greedy algo r ithm and a Bay esian scor ing function. 1 INTRO DUCTION Given a se t o f observ atio ns in some do main, a com- mon pro blem that a data analyst faces is to build o ne or mo r e mo dels of the pro cess that g e ner ated the data. In the la st few y ears, resear chers in the UAI commu- nit y hav e contributed an enormous bo dy of work to this problem, using Bay esia n netw ork s as the mo del of choice. Recent works include Co op er and Herskovits (1992), B un tine (1991), Spiegelhalter et. al (19 93), and Heck erman et al. (1995 ). A substantial amount of the early work on lea rn- ing Bay es ia n netw or ks ha s used o bs erved data to in- fer glob al independence constra in ts that hold in the domain of in terest. Global independences ar e pr e- cisely tho s e that follow from the missing edges within a B ayesian-net work s tructure. More recen tly , re- searchers (including Boutilier et al., 1995 and F ried- man and Goldszmidt, 1996 ) hav e extended the “clas- sical” definition of a Bayesian netw ork to include effi- cient representations of lo c al constra int s that can hold among the para meters stored in the nodes of the net- work. Two notable features ab out the this recent work are (1) the ma jority of effort has concentrated on infer- ring decision trees, which are structures that can ex- plicitly repr e s en t some par ameter equality cons traints and (2) res earchers t y pically apply no n-Bay esia n (or asymptotically Ba yesian) scoring functions suc h as MDL as to e v a lua te the go o dnes s -of-fit of netw or ks to the data. In this pap er, we apply a Bayesian appr oach to learn- ing Bay esian netw orks that contain de cision-gr aphs — generalizatio ns of decision trees that ca n enco de ar- bitrary equality constra in ts—to repr esent the co ndi- tional pr obability distributions in the no des. In Section 2, we introduce notation and pr evious rel- ev a n t work. In Section 3 we describ e how to ev aluate the Bay es ian sco r e of a Bay e s ian netw or k that c ont ains decision graphs. In Section 4, we inv estiga te how a search algor ithm can b e use d, in co njunction with a scoring function, to identif y these netw o rks from data. In Section 5, w e use data from v a rious domains to ev a luate the learning accuracy of a g reedy search algo - rithm applied the search spaces defined in Section 4. Finally , in Section 6 , we c o nclude with a dis c us sion of future extensions to this work. 2 BA C K GROUND In this s ection, we desc r ibe our notation a nd discuss previous relev ant work. Throughout the remainder of this pap er, w e use lower-case letters to r efer to v a r i- ables, and upper -case letters to refer to sets o f v ar i- ables. W e write x i = k when we o bserve that v ariable x i is in state k . When we observe the state of ev- ery v aria ble in a set X , we ca ll the set o f obser v atio ns a state of X . Although arguably an abuse of nota- tion, w e find it conv enient to index the s tates of a set of v ariables with a single integer. F or exa mple, if X = { x 1 , x 2 } is a set containing tw o binary v a riables, we may w r ite X = 2 to denote { x 1 = 1 , x 2 = 0 } . In Sectio n 2.1, w e define a Bayesian net work. In Sec- tion 2.2 we des crib e decision trees and how they can b e used to r epresent the pro babilities within a Bay es ian net w ork. In Section 2 .3, w e describ e decision graphs, which are g eneralizations of decisio n trees. 2.1 BA YESIAN NETW ORKS Consider a domain U of n discrete v a riables x 1 , . . . , x n , where each x i has a finite num b er of states. A Bay esia n net w ork for U represents a joint probability distribu- tion over U by enco ding (1) asser tions of conditional independence a nd (2) a collection of probability distri- butions. Sp ecifically , a Bay es ian netw ork B is the pair ( B S , Θ), where B S is the structur e of the netw ork, and Θ is a set of parameters that enco de lo cal probability distributions. The structure B S has tw o components: the glob al structur e G and a set of lo c al structur es M . G is an acyclic, directed graph— dag for short—that contains a no de for each v ar iable x i ∈ U . The edges in G de- note probabilistic dependences among the v aria bles in U . W e use P ar ( x i ) to denote the set of parent no des of x i in G . W e use x i to refer to b oth the v ariable in U and the corr esp o nding no de in G . The set of lo ca l structures M = { M 1 , . . . , M n } is a set of n mappings, one for each v ariable x i , such that M i maps each v a lue of { x i , P ar ( x i ) } to a par ameter in Θ. The asser tions of conditional indep endence implied by the glo bal structure G in a B ayesian netw o r k B imp ose the following decomp osition of the joint pro babilit y distribution ov er U : p ( U | B ) = Y i p ( x i | P ar ( x i ) , Θ , M i , G ) (1) The set o f pa rameters Θ contains—for each no de x i , for ea ch state k of x i , and for each pa rent state j — a single para meter 1 Θ( i, j, k ) that enco des the condi- 1 Because the sum P k p ( x i = k | P ar ( x i ) , Θ , M i , G ) must x y z Figure 1: Bayesian netw o rk for U = { x, y , z } tional pr obabilities g iven in Equation 1. That is, p ( x i = k | P ar ( x i ) = j, Θ , M i , G ) = Θ ( i, j, k ) (2) Note tha t the function Θ( i, j, k ) dep ends on b oth M i and G . F or notational simplicity we leav e this dep en- dency implicit. Let r i denote the n um b er of states of v a r iable x i , and let q i denote the num be r of states of the set P ar ( x i ). W e use Θ ij to denote the set of par ameters character- izing the distribution p ( x i | P ar ( x i ) = j, Θ , M i , G ): Θ ij = ∪ r i k =1 Θ( i, j, k ) W e use Θ i to denote the set of para meters characterizing all of the conditional distributions p ( x i | P ar ( x i ) , Θ , M i , G ): Θ i = ∪ q i j =1 Θ ij In the “class ical” implementation o f a Bay esian net- work, ea ch no de x i stores ( r i − 1) · q i distinct param- eters in a larg e table. That is, M i is simply a lo okup in to a table. Note that the size of this table grows exp o nent ially with the num b er of parents q i . 2.2 DECISION TREE S There are o ften equa lity co nstraints that hold among the parameters in Θ i , and rese a rchers ha ve used map- pings other than complete tables to more efficiently represent these parameters . F or example, consider the global structure G depicted in Figure 1, and assume that a ll no des a re binary . F urthermore, assume that if x = 1, then the v alue of z do es not depe nd on y . That is, p ( z | x = 1 , y = 0 , Θ , M z , G ) = p ( z | x = 1 , y = 1 , Θ , M z , G ) Using the de cision tr e e shown in Figure 2 to imple- men t the mapping M z , we ca n r epresent p ( z | x = 1 , y , Θ , M Z ) using a single distribution for both p ( z | x = 1 , y = 0 , Θ , M z , G ) and p ( z | x = 1 , y = 1 , Θ , M z , G ). b e one, Θ will actually only contain r i − 1 d istinct param- eters for th is distribution. F or simplicit y , w e leav e th is implicit for the remainder of the pap er. x y 0 1 0 1 p(z|x=0, y=0) p(z|x=0, y=1) p(z|x=1, y=0) = p(z|x=1, y=1) Figure 2: Decision tree for no de z Decision tre e s , descr ibed in deta il by Breiman (1 9 84), can be used to represent sets of parameters in a Bay esian netw or k. Ea ch tr ee is a dag containing ex- actly one ro o t no de, and every no de other than the ro ot no de has exa c tly one parent. Each leaf no de con- tains a table of k − 1 distinct pa rameters that c o l- lectively define a conditional probability distribution p ( x i | P ar ( x i ) , Θ , M i , D ). Each non-leaf node in the tree is annota ted with the na me of one o f the parent v a riables π ∈ P ar ( x i ). Out-going edges fr o m a no de π in the tree are anno tated with mutually exclusive and collectively exhaustive sets of v alues for the v aria ble π . When a node v in a decis ion tree is annotated with the name π , we say that v splits π . If the edge from v 1 to child v 2 is annotated with the v alue k , we s ay that v 2 is the child o f v 1 c orr esp onding to k . Note that by definition of the edge annotations, the child of a no de corresp onding to a n y v alue is unique. W e trav erse the decision tree to find the parameter Θ( i, j, k ) as follows. First, initialize v to be the ro o t no de in the decision tre e . Then, a s long as v is not a leaf, let π b e the no de in P ar ( x i ) that v splits, and reset v to b e the child of v co rresp onding to the v alue of π —determined b y P ar ( x i ) = j —and rep eat. If v is a leaf, we w e return the par a meter in the table corr e- sp o nding to s tate k of x i . Decision tree ar e more expressive mappings than com- plete tables, as we can represent a ll of the par ameters from a complete table using a c omplete de cision tr e e . A complete decision tree T i for a no de x i is a tree of depth | P ar ( x i ) | , such that every no de v l at level l in T i splits on the lth parent π l ∈ P ar ( x i ) a nd ha s exactly r π l chil dren, one for each v alue o f π . It fo llows by this def- inition that if T i is a complete tree, then Θ( i, j, k ) will map to a distinct par ameter for each distinct { i, j } , which is pr ecisely the b ehavior of a complete table. Researchers ha ve found that decisio n trees are useful for eliciting proba bilit y distributions, a s expe r ts of- ten hav e extensive knowledge ab out equality of con- ditional distributions. F ur ther more, man y r e searchers hav e developed methods for learning these lo ca l struc- tures fro m data. 2.3 DECISION GRAPHS In this section we describ e a genera lization of the de- cision tree, known as a de cision gr aph , that can rep- resent a muc h richer set of equality constraints among the lo ca l parameter s. A decision graph is iden tical to a decision tree except that, in a decision gra ph, the non- ro ot no des can hav e more than one pa rent. Consider , for example, the decis io n g raph depicted in Figure 3. This decisio n graph repre s en ts a conditional pro babil- it y distribution p ( z | x, y , Θ) for the no de z in Figure 1 that has different equa lity constraints than the tree shown in Figure 2. Sp ecifically , the decis io n g raph en- co des the equa lit y p ( z | x = 0 , y = 1 , Θ ) = p ( z | x = 1 , y = 0 , Θ) x y 0 1 0 1 p(z|x=0, y=0) p(z|x=1, y=1) p(z|x=0, y=1) = p(z|x=1, y=0) y 1 0 Figure 3: Decision gr aph for no de z W e use D i to denote a decision graph for no de x i . If the mapping in a no de x i is implement ed with D i , we use D i instead of M i to denote the mapping. A decision-gra ph D i can explicitly repr e s ent an arbitrary set of equa lit y constra int s of the form Θ ij = Θ ij 0 (3) for j 6 = j 0 . T o demonstrate this, cons ider a co mplete tree T i for no de x i . W e ca n transfor m T i in to a decisio n graph that re pr esents all of the des ired constra ints by simply merging tog ether any leaf no des that co n tain sets that a r e equa l. It is int eresting to note that any equality constraint o f the form given in Equation 3 can also b e interpreted as the following indep endenc e constr aint : x i ⊥ ⊥ P ar ( x i ) | P ar ( x i ) = j or P ar ( x i ) = j 0 If we allow no des in a decision gra ph D i to split on no de x i as well as the no des in P ar ( x i ), we ca n rep- resent an a r bitrary set of equality constra int s among the parameters Θ i . W e return to this issue in Section 6, a nd assume for now that no des in D i do no t s plit on x i . 3 LEARNING DECISION GRAPHS Many res earchers have derived the Bayesian measure- of-fit—herein ca lled the Bayesian s c or e —for a netw ork, assuming that there ar e no equalities among the pa- rameters. F riedman and Goldszmidt (199 6) deriv e the Bay es ian s core for a structure co n taining deci- sion tree s . In this se ction, we show how to ev aluate the Bay es ian sc o re for a s tr ucture containing decision graphs. T o derive the Bay esian scor e , w e first need to mak e an assumption ab out the pro ces s that gener ated the database D . In particular, we assume that the database D is a rando m (exchangeable) sample from some unknown distribution Θ U , a nd that all o f the constraints in Θ U can b e r e pr esented us ing a netw ork structure B S containing decisio n gr aphs. As we sa w in the pr evious section, the structure B S = {G , M } imp oses a set o f indep endence c o n- straints that must hold in a n y distribution represented using a Bay esian netw o r k with that structure. W e de- fine B h S to be the hyp othesis that (1) the indep endence constraints impose d b y structure B S hold in the join t distribution Θ U from which the da tabase D was gen- erated, and (2) Θ U contains no other indep endence constraints. W e r efer the re a der to Heck erman e t al. (1994) for a more detailed discussion o f structure hy- po theses. The Bay es ia n scor e for a s tructure B S is the p os terior probability of B h S , given the observed da tabase D : p ( B h S | D ) = c · p ( D | B h S ) p ( B h S ) where c = 1 p ( D ) . If we are only concerned with the r el- ative scores of v arious structure s , as is almost always the case, then the constant c can b e ig nored. Co ns e - quent ly , we extend our definition of the Bay esian scor e to b e any function prop or tional to p ( D | B h S ) p ( B h S ). F or now, we assume tha t there is an efficient metho d for assessing p ( B h S ) (assuming this distribution is uni- form, for ex ample), and concentrate on how to derive the mar ginal likeliho o d ter m p ( D | B h S ). By in teg rating ov er all of the unknown para meters Θ we hav e: p ( D | B h S ) = Z Θ p (Θ | B h S ) p ( D | Θ , B h S ) (4) Researchers typically make a num b er of simplifying assumptions that c o llectively allow Equatio n 4 to be expressed in closed form. Before introducing these a s- sumptions, we need the following notation. As we s how ed in Section 2, if the lo cal structure for a no de x i is a decision g raph D i , then s ets of par ame- ters Θ ij and Θ ij 0 can be identical for j 6 = j 0 . F or the deriv ations to follow, we find it useful to enum erate the distinct parameter sets in Θ i . Equiv alently , we find it useful to enumerate the leav es in a decision graph. F or the remainder of this section, w e a dopt the fol- lowing syntactic conv ent io n. When referring to a pa- rameter set stored in the leaf of a decision gra ph, we use a to denote the no de index, and b to denote the parent-state index. When referring to a par ameter set in the context of a sp ecific par ent state of a no de, w e use i to denote the no de index and j to denote the parent-state index. T o e numerate the set of leav es in a decision gra ph D a , we define a set of le af-set indic es L a . The idea is that L a contains exactly one parent-state index for each leaf in the gr aph. Mor e precisely , let l denote the n um b er of leav es in D a . Then L a = { b 1 , . . . , b l } is defined a s a set with the following prop er ties: 1. F or a ll { b , b 0 } ⊆ L a , b 6 = b 0 ⇒ Θ a,b 6 = Θ a,b 0 2. ∪ b ∈ L a Θ a,b = Θ a The first prop erty ensures that each index in L cor- resp onds to a different leaf, and the sec o nd pro p er t y ensures that every lea f is included. One assumption used to der ive Eq uation 4 in clo sed form is the p ar ameter indep endenc e assumption. Sim- ply stated, this a ssumption says that given the hypoth- esis B h S , knowledge a b out a n y distinct parameter set Θ ab do es not give us any information ab out a ny other distinct pa rameter set. Assumption 1 (Par ame ter Indep endence) p (Θ | B h S ) = n Y a =1 Y b ∈ L a p (Θ ab | B h S ) Another assumption that resea rchers make is the Dirichlet assumption. This a ssumption restricts the prior distributions over the distinct par ameter sets to be Dirichlet. Assumption 2 (Dirichlet) F or al l a and for al l b ∈ L a , p (Θ ab | B h S ) ∝ r a Y c =1 Θ α abc − 1 abc wher e α abc > 0 for 1 ≤ c ≤ r a Recall that r a denotes the num b er of states for no de x a . The hyperparameter s α abc characterize our prior knowledge a b out the parameters in Θ. Hec kerman et al. (1995 ) descr ibe how to derive these exponents from a prio r Bay es ian netw ork. W e return to this issue later. Using these assumptions, we can derive the Bay esian score for a s tr ucture that co nt a ins decision g raphs by following a completely a nalogous metho d as Heck e r - man et al. (1995). Before showing the res ult, we must define the inverse function o f Θ( i, j, k ). L e t θ denote an arbitrar y para meter in Θ. The function Θ − 1 ( θ ) de- notes the set of index triples that Θ () maps in to θ . That is, Θ − 1 ( θ ) = { i, j, k | Θ( i, j, k ) = θ } Let D ijk denote the num b er of cases in D for which x i = k a nd P ar ( x i ) = j . W e define N abc as follows: N abc = X ijk ∈ Θ − 1 ( θ abc ) D ijk In tuitiv ely , N abc is the num b er of cases in D that pro - vide information ab out the parameter θ abc . Letting N ab = P c N abc and α ab = P c α abc , we can write the Bay esian score a s follows: p ( D , B h S ) = p ( B h S ) n Y a =1 Y b ∈ L a Γ( α ab ) Γ( N ab + α ab ) · | r a | Y c =1 Γ( N abc + α abc ) Γ( α abc ) (5) W e ca n determine all of the counts N abc for ea ch no de x a as follows. First, initialize a ll the c o un ts N abc to zero. Then, fo r ea ch case C in the databa se, let k C and j C denote the v alue for x i and P ar ( x i ) in the case, resp ectively , and increment by one the count N abc corresp onding to the para meter θ abc = p ( x i = k C | P ar ( x i ) = j C , Θ , D a ). Each such para meter can be fo und efficiently by traversing D a from the r o ot. W e say a scoring function is no de de c omp osable if it can b e factor ed into a pro duct o f functions that de- pend only a no de and its parents. No de decomp os- abilit y is useful for efficien tly searching throug h the space o f globa l-net work structures. Note that Equa- tion 5 is no de decomp osa ble as long as p ( B h S ) is no de decomp osable. W e now cons ider some node- decompo sable distr ibu- tions for p ( B h S ). Perhaps the s implest distr ibution is to assume a uniform prior over netw or k structures. That is, we set p ( B h S ) to a constant in E quation 5. W e use this simple prio r for the exper iments describ ed in Section 5. Another a ppr oach is to (a- priori) fav or net w orks with few er para meters. F or exa mple, we can use p ( B h S ) ∝ κ | Θ | = n Y a =1 κ | Θ a | (6) where 0 < κ < = 1. Note that κ = 1 corre s po nds to the uniform prior over a ll structure hypotheses. A simple prior for the para meters in Θ is to assume α abc = 1 for all a, b, c . This c hoice of v alues corre- sp o nds to a uniform prio r ov er the parameters, a nd was explor ed b y Co op er and Hers kovits (1992) in the context o f B ayesian netw ork s containing complete ta- bles. W e call the Bay es ian scoring function the un i- form sc oring fun ction if all the hyper pa rameters are set to o ne. W e hav e found that this prior works well in practice and is ea sy to implement. Using t wo additional assumptions, Heck erman et al. (1995) show that each α abc can b e derived fro m a prior Bayesia n network . The idea is that α abc is prop or- tional to the prior proba bility , o btained from the pr io r net w ork, of all states o f { x i = k , P ar ( x i ) = j } that map to the para meter θ abc . Sp ecifically , if B P is our prior Bayesian net work, we set α abc = α X ijk ∈ Θ − 1 ( θ abc ) p ( x i = k , P ar ( x i ) = j | B P ) where α is a single e quivalent sample size used to as s es all o f the exp onents, and P ar ( x i ) denotes the parents of x i in G (as opp osed to the parents in the prior net- work). α can b e understo o d as a mea s ure of confi- dence that w e have for the parameters in B P . W e call the Bayesian scor ing function the PN sc oring funct ion ( P rior N et work sco ring function) if the expo nen ts are assessed this w ay . Hec kerman et a l. (1995 ) der ive these c o nstraints in the context of Bayesian netw o rks with complete tables. In the full version of this pap er, we show that these c o nstraints follow when using de- cision gr aphs as well, with only slight mo dificatio ns to the additional a ssumptions. Although we do not provide the details here, we can use the decision-g r aph structure to efficiently compute the exp onents α abc from the prior netw o rk in muc h the same wa y we co mputed the N abc v a lues from the database. 4 SEARCH Given a sc o ring function that ev aluates the mer it o f a Bayesian-net work structure B S , learning B ayesian net w orks from data reduces to a search for one or more structures that hav e a hig h scor e. Chick er ing (1995 ) shows that finding the optimal structure co n taining complete tables for the mappings M is NP-ha r d when using a Bay esia n scoring function. Given this res ult, it se e ms reasonable to a ssume that b y allowing (the more g eneral) decision-gra ph mappings, the problem remains hard, and consequentl y it is appropriate to apply heur istic search techniques. In Section 4.1, w e define a s earch s pa ce over decis ion- graph structures within a single node x i , assuming that the parent set P ar ( x i ) is fixed. O nce such a space is defined, w e can apply to that space a ny num b er of well-kno wn search algor ithms. F or the exp eriments describ ed in Section 5, for example, we apply greedy search. In Section 4.2 we describ e a gr e edy a lgorithm that combines lo ca l-structure search ov er a ll the decisio n graphs in the nodes with a g lobal-structure search over the edges in G . 4.1 DECISION-GRAPH SE ARCH In this section, we a ssume that the states of our search space corres po nd to all of the p ossible decision gra phs for s ome no de x i . In o r der for a search algorithm to trav ers e this space, we must define a set of op erato rs that transform one s tate into another. There are three o pe rators we define, and each op era - tor is a mo dification to the c ur rent set of leav es in a decision graph. Definition (Complete Split) L et v b e a le af no de in the de cision gr aph, and let π ∈ P ar ( x i ) b e a p ar ent of x i . A co mplete split C ( v, π ) adds r i new le af no des as childr en to v , wher e e ach child of v c orr esp onds to a distinct value of π . Definition (Binary Split) L et v b e a le af no de in the de cision gr aph, and let π ∈ P ar ( x i ) b e a p ar ent of x i . A binary split B ( v , π , k ) adds 2 new le af no des as childr en to v , wher e t he first child c orr esp onds to state k of π , and the other child c orr esp onds to al l other st ates of π . Definition (Merge) L et v 1 and v 2 b e two distinct le af no des in the de cision gr aph. A Merge M ( v 1 , v 2 ) mer ges the v 1 and v 2 into a single no de. That is, the r esult ing n o de inherits al l p ar ents fr om b oth v 1 and v 2 . In Fig ure 4, we show the result o f ea ch type o f o per ator to a decision gra ph for a no de z with parents x and y , where x a nd y b oth hav e three states. W e a dd the pre-condition that the opera tor must change the para meter constraints implied by the de- cision graph. W e w ould not allow, for example, a complete split C ( v 1 , y ) in Figure 4a: t wo of v 1 ’s new chil dren w ould corres po nd to imp oss ible s ta tes o f y ( { y = 0 and y = 1 } and { y = 0 and y = 2 } ), and the third child would corresp ond to the original con- straints at v 1 ( { y = 0 and y = 0 } ). Note that star ting from a decision gra ph containing a v 1 v 3 v 2 y 0 1 2 y 0 1 2 x 0 1 2 y 0 1 2 x 0 1,2 y 0 1,2 (a) (b) (c) (d) Figure 4: Ex ample o f the application of ea ch type of op erator: (a) the origina l decision g raph, (b) the re- sult o f a pplying C ( v 3 , x ), (c) the result o f a pplying B ( v 3 , x, 0), and (d) the result of applying M ( v 2 , v 3 ) single no de (b oth the r o ot a nd a leaf no de), we ca n generate a complete decision tree b y rep eatedly ap- plying complete splits. As discussed in the previous section, we can re pr esent any par ameter-set equalities b y merg ing the leav es of a complete decision tree. Co n- sequently , s tarting from a gr aph containing one no de there exists a series of op erators that r esult in any set of pos s ible pa r ameter-set equalities. Note a lso that if we rep eatedly merg e the leav es of a decisio n g raph un- til there is a single para meter s e t, the resulting g raph is equiv a lent (in terms of parameter equalities) to the graph con taining a single no de. Therefore , our o p- erators ar e sufficien t for moving from an y set of pa- rameter constra int s to any other set of parameter con- straints. Although w e do not discuss them here, there are methods tha t ca n simplify (in terms o f the n umber of no des) so me decisio n graphs such tha t they repre- sent the sa me set of par ameter constr aint s. The complete-split oper ator is actually not needed to ensure that a ll parameter equalities can b e reached: any complete split can be replaced by a series of bina ry splits such that the resulting parameter-se t co nstraints are identical. W e included the c o mplete-split o p er a tor in the hop es that it would help lead the se a rch a lg o- rithm to b etter str uctures. In Section 5, we co mpare greedy sear ch p erforma nce in v arious search spaces de- fined by including only subsets of the ab ove op erators. 4.2 COMBINING GLOBAL AND LOCAL SEAR CH In this section w e describ e a g reedy a lg orithm that combines globa l- structure s e arch ov er the edg es in G with lo cal-structure search ov er the decision graphs in all o f the no des of G . Suppos e that in the decision-graph D i for no de x i , there is no non-lea f no de annota ted with some parent π ∈ P ar ( x i ). In this ca se, x i is independent of π given its other pa rents, and we ca n r emov e π from P ar ( x i ) without viola ting the decompo sition given in Equation 1. Thus given a fixed structure, we can learn all the lo cal decisio n graphs fo r all of the no des, and then delete those parents that a re indep endent. W e can also co nsider adding edges as follows. F or each no de x i , add to P ar ( x i ) all non- descendants of x i in G , lea rn a decision gra ph for x i , and then delete all parents that a re not contained in the decision g raph. Figure 5 shows a gr eedy alg o rithm that uses combines these tw o ideas. In our exp eriments, we started the algorithm with a structure for whic h G contains no edges, a nd ea ch graph D i consists o f a single r o ot no de. 1. Score the current netw ork structure B S 2. F or each no de x i in G 3. Add every non- d escendant that is not a parent of x i to P ar ( x i ) 4. F or every p ossible op erator O to the decision graph D i 5. Apply O to B S 6. Score the resulting structure 7. Unapply O 8. Remov e any paren t that w as added to x i in step 3 9. If the b est score from step 6 is better than th e current score 10. Let O be the op erator that resulted in the b est score 11. If O is a split op erator (either complete or bi- nary) on a no de x j that is not in P ar ( x i ), then add x j to P ar ( x i ) 12. Apply O to B S 13. Goto 1 14. Otherwise, return B S Figure 5: Greedy algor ithm that combines lo cal and global str ucture search Note that as a result of a merg e o pe r ator in a decision graph D i , x i may be r endered indep endent from one of its par ent s π ∈ P ar ( x i ), even if D i contains a no de annotated with π . F or a simple exa mple, we co uld rep eatedly merge all leav es into a single leaf no de, and the resulting graph implies that x i do es not de- pend on any o f its parents. W e found exp erimentally that—when using the alg o rithm from Figure 5—this phenomenon is rare. Becaus e testing fo r these pa rent deletions is exp ensive, we chose to not chec k for them in the e x per imen ts descr ib ed in Section 5. Another greedy a pproach for lea rning structures con- taining decision trees has b een explor ed by F riedman and Goldszmidt (199 6). The idea is to score edg e op- erations in G (adding, deleting, or reversing edges ) by applying the o per ation and then gree dily learning the lo cal decision trees for any no des who’s parents hav e changed as a result of the o pe ration. In the full v ersion of the pape r , we compa r e our approa ch to theirs. 5 EXPERI MENT AL RESUL TS In this s ection we inv estigate how v arying the set o f allow ed op erators affects the performance of greedy search. By disallowing the merge ope r ator, the search algorithms will identify decision-tree lo cal structures in the Bay esia n netw o rk. Consequently , we can see how learning accuracy changes, in the co nt ext of greedy search, when we genera lize the lo c a l structures from decision trees to decis ion g raphs. In all of the exper imen ts des crib ed in this section, we measure learning a ccuracy by the poster io r pro bability of the identified s tructure hypotheses. Researchers of- ten use other criter ia, such a s predictiv e a ccuracy on a holdout set or structural difference from some g enera- tiv e mo del. The r eason that w e do not use any of these criteria is that we are ev a lua ting how wel l the se ar ch algorithm p erforms in various se ar ch sp ac es , and the goal of the search algorithm is to maximize the sc oring function . W e ar e not ev a luating how well the Bay esian scoring functions a pproximate s ome o ther cr iter ia. In our firs t exp eriment, we consider the Pr omoter Gene S e quenc es data base from the UC Irvine collec- tion, c o nsisting of 106 cas es. There are 58 v aria bles in this domain. 57 of these v aria bles, { x 1 , . . . , x 57 } represent the “ba s e-pair” v alues in a DNA sequence, and ea ch has fo ur p ossible v alues . The other v ariable, promoter , is binary and indicates whether or not the sequence ha s promo ter activity . The goal of lear ning in this domain is to build a n ac c urate mo del of the distri- bution p ( pr omoter | x 1 , . . . , x 57 ), and conse q uen tly it is reasona ble to cons ider a static gra phical str ucture for which P ar ( pr omoter ) = { x 1 , . . . , x 57 } , and search for a decis ion g raph in no de pr omoter . T able 1 shows the relative Bay esian scores for the b est decision gr aph learned, using a greedy se a rch with v ar- ious par ameter priors and sear ch spa ces. All se a rches started with a decisio n g raph co ntaining a sing le no de, and the current b est op e rator w a s applied at each s tep un til no ope r ator increased the score o f the curr e nt state. Each co lumn c o rresp onds to a differen t r e stric- tion of the sear ch spac e descr ibed in Section 4.1: the lab els indicate wha t o pe r ators the greedy sear ch was T able 1: Greedy se a rch p erfor ma nce for v arious Bay esian scoring functions, using different s e ts of o p- erators, in the P romoter domain. C B CB CM BM CBM uniform 0 13.62 6.07 22.13 26 .11 26 .1 1 U-PN 1 0 0 6.12 4.21 9.5 10 .82 12.9 3 U-PN 2 0 0 5.09 3.34 14.11 12 .1 1 14.12 U-PN 3 0 0 4.62 2.97 10.93 12.98 16 .65 U-PN 4 0 0 3.14 1.27 16.3 13.54 16.02 U-PN 4 0 0 2.99 1.12 15.76 15 .5 4 17.54 allow ed to use, where C deno tes c o mplete splits, B denotes binary splits, and M denotes merges. The col- umn lab eled B M, for exa mple, shows the re s ults when a greedy search used binary splits a nd mer ges, but not co mplete splits. Each row corres po nds to a differ- ent parameter - prior for the Bayesian s coring function. The U-PN scoring function is a sp ecial c a se of the PN scoring function for which the prior netw ork imp oses a uniform distribution over all v ar iables. The num- ber fo llowing the U-P N in the row labels indicates the equiv a len t-sample size α . All results use a uniform prior ov er str ucture hypotheses. A v alue of zero in a row of the table denotes the hypothesis with low es t probability o ut o f all those identified using the given parameter pr ior. All other v alues denote the na tural logarithm of how many times more likely the iden tified h ypo thesis is tha n the one with low est probability . By compa r ing the relative v alues b etw een s e a rches that use merges and sear ch es that don ’t use merges, we see that without exception, adding the merge op er- ator results in a significantly more probable structure h ypo thesis. W e c a n therefore co nclude that a greedy search over decision graphs results in better solutions than a gr eedy search ov er dec is ion trees. An interest- ing obse r v a tion is that complete-split op erator actually reduces solution quality when w e restrict the search to decision trees. W e p erformed an iden tica l exp eriment to another clas - sification problem, but for simplicity w e only pr esent the results for the uniform scor ing function. Recall from Section 3 that the uniform sco ring function ha s all of the hype r parameters α abc set to one. This seco nd exp e riment was run with the Splic e-junction Gene Se- quenc es database, ag ain from the UC Irvine rep ository . This data base also co n tains a DNA sequence, a nd the problem is to predict whether the p osition in the mid- dle of the sequence is an “ in tron-exon” b o undary , an “exon-intron” b oundar y , o r neither. The results are given in T able 2. W e used the s ame uniform prior for structure h yp othese s . T able 2: Gree dy sear ch perfo r mance for the uniform scoring function, using different sets of op era tors, in the S plice domain. C B CB CM BM CB M 0 38 3 363 464 655 687 T able 3: Gree dy sear ch perfo r mance for the uniform scoring function for each no de in the ALARM net work. Also included is the uniform score for the co mplete- table mo del COMP C B CB CM BM CBM 0 134 1 86 165 25 7 2 70 270 T able 2 a gain supp or ts the claim that we get a signifi- cant improv ement b y using decision g r aphs instead of decision trees. Our final set o f exper imen ts were done in the ALARM domain, a well-known benchmark for Bayesian- net w ork lear ning algor ithms. The ALARM netw o rk, describ ed b y Beinlich et a l. (1989), is a hand- constructed Bay esian netw o rk used for diagnosis in a medical domain. The par ameters of this netw o rk are stored using co mplete tables. In the first exp eriment for the ALARM domain, we demonstrate tha t for a fixed glo ba l structure G , the h ypo thesis identified by searching for lo cal decision graphs in a ll the no des can be s ignificantly b etter than the hypothesis cor r esp onding to co mplete tables in the no des. W e fir st genera ted 1000 cases from the ALARM net w ork, and then co mputed the uniform Bay es ian score for the ALARM net work, assuming that the pa- rameter mappings M ar e complete tables. W e expec t the po sterior of this mo del to b e quite go o d, be c a use we’re ev alua ting the gener ative model structure. Next, using the uniform sco r ing function, we applied the six greedy sear ches as in the previous exp eriments to iden- tify go od decision graphs for al l of the no des in the net w ork. W e kept the global structure G fixed to b e iden tical to the global s tructure of the ALARM net- work. The results are shown in T able 3 , a nd the v alues hav e the sa me semantics as in the pr evious t wo tables. The score given in the first co lumn lab eled CO MP is the score for the complete-table mo del. T able 3 demonstrates that s e a rch p er fo rmance using decision graphs can iden tify significantly be tter mo d- els than when just using decision trees . The fact that the complete-table mo del attains suc h a low score (the bes t hypothesis w e fo und is e 270 times more pro bable than the co mplete-table h yp othesis!) is not surpris- ing upo n examinatio n of the pr o babilit y tables stored T able 4: Performance of gr eedy a lg orithm that com- bines lo cal and global structure sea r ch, using differen t sets of op erator s, in the ALARM domain. Also in- cluded is the result of a g reedy algo rithm that searches for global structure a ssuming complete tables. COMP C B CB CM BM CBM 255 0 25 6 241 869 977 1136 in the ALARM net work: most o f the tables contain parameter-s e t equalities. In the next exp eriment, we used the ALARM domain to test the structur e - learning algor ithm given in Sec- tion 4.2. W e again generated a databas e of 100 0 ca s es, and used the unifor m sco ring function with a uniform prior ov er structure h yp o theses. W e ran six versions of o ur alg orithm, cor r esp onding to the six poss ible sets of lo cal-structure op erators as in the previo us exp er- imen ts. W e als o ran a greedy structure- search algo- rithm that assumes co mplete ta bles in the no des. W e initialized this search with a global netw or k structure with no edges, a nd the o p er a tors w er e single-edge mo d- ifications to the gr aph: deletion, addition and reversal. In T able 4 we show the results. The co lumn la beled COMP co rresp onds to the greedy s earch ov er struc- tures with co mplete tables. Once again, we note that when we allow nodes to contain decis ion gra phs, we get a sig nificant im- prov ement in solution qua lit y . Note that the search ov er complete-table structures out-per formed our al- gorithm when we restr ic ted the algor ithm to search for decis ion trees containing either (1) only complete splits or (2) complete splits and binary splits. In o ur final exp e r imen t, we rep eated the previo us ex- per imen t, except that we o nly allowed our algo rithm to add parents that ar e no t descendants in the gener- ative mo del. That is, we restricted the glo bal sea rch ov er G to those dags tha t did not violate the par tial or- dering in the ALARM netw o rk. W e also ra n the same greedy structure-search algo r ithm that searches over structures with complete tables, except w e initialized the search with the ALARM netw or k. The results of this exp er iment are shown in T able 5. F rom the table, we see that the constrained sear ches exhibit the same relative b ehavior as the unco nstrained sear ches. F or each exp eriment in the ALARM domain (T ables 3, 4, a nd 5) the v alues pres e nted meas ure the per for- mance o f sear ch relative to the worst p erformance in that exp eriment. In T able 6, we compare sear ch per - formance across all exper imen ts in the ALARM do- main. That is, a v a lue of zer o in the table corres p onds to the exp eriment and set of op erators that led to the T able 5: Performance of a restricted v ers ion of our greedy algorithm, using differen t sets of op erator s , in the ALARM domain. Also included is the result of a g reedy a lgorithm, initialized with the global str uc- ture of the ALARM netw o rk, that sea rches for globa l structure assuming complete tables. COMP C B CB CM BM CBM 0 179 3 34 307 55 3 7 28 790 T able 6: Compariso n of Bayesian s cores for all exp er- imen ts in the ALARM domain COMP C B CB CM BM CBM S 278 412 464 443 534 548 548 U 255 0 256 241 869 976 113 6 C 3 36 515 670 643 88 9 1064 11 26 learned h ypo thesis with lowest p oster ior pr obability , out o f all experiments and opera tor restrictions we considered in the ALARM domain. All o ther v alues given in the table are relative to this (lo west) p oster ior probability . The row lab els corresp ond to the exp eri- men t: S denotes the first exp eriment that p er fo rmed lo cal se a rches in a static global str uctur e , U denotes the second exp eriment that p erformed unconstra ined structural sear ches, and C deno tes the final exp e riment that per formed co nstrained structura l sear ch. Rather sur prising, each hypothesis lea rned using global-structure search with decision gr aphs had a higher p os terior than every hypothesis learned using the gener ative static structures . 6 DISCUSSION In this pap er we s how ed how to der ive the Bay esia n score o f a netw o rk structure tha t co nt ains para meter maps implement ed as decision g raphs. W e defined a search spac e for learning individual decision gr aphs within a static globa l s tructure, and defined a greedy algorithm that s earches for bo th global a nd lo c al struc- ture simul taneously . W e demonstrated exp erimentally that gr eedy search over structures containing decision graphs significantly outp erforms greedy sear ch ov er bo th (1) structures containing co mplete tables and (2) structures containing decisio n trees. W e now c o nsider an extension to the decision g raph that we mentioned in Section 2.3. Recall that in a de- cision graph, the parameter sets are stor ed in a table within the lea ves. When decision gra phs ar e imple- men ted this wa y , any parameter θ abc m ust belo ng to exactly one (distinct) parameter set. An imp ortant consequence of this pro p er ty is that if the pr iors for the parameter sets are Dirichlet (Assumption 2), then the p osterior distributions ar e Dirichlet as well. That is, the Dirichlet distribution is c onjugate with resp ect to the likelihoo d o f the observed da ta . As a result, it is easy to derive the Bay es ian sco ring function in closed form. If we allow no des w ithin a decisio n graph D i to split on no de x i , we can r epresent an a r bitrary set of pa - rameter co nstraints of the form Θ( i, j, k ) = Θ( i, j 0 , k 0 ) for j 6 = j 0 and k 6 = k 0 . F or example, consider a Baysian net w ork for the tw o- v a r iable domain { x, y } , where x is a parent of y . W e can use a decis io n graph for y that splits on y to r epresent the constra in t p ( y = 1 | x = 0 , Θ , D y , G ) = p ( y = 0 | x = 1 , Θ , D y , G ) Unfortunately , when w e allow these types of con- straints, the Diric hlet distribution is no longe r conju- gate with resp ect to the likeliho o d o f the data, a nd the parameter independence assumption is vio lated. Con- sequently , the deriv ation descr ibed in Section 3 will not apply . Conjugate pr iors for a decision gr aph D i that splits o n no de x i do exist, how ever, and in the full version of this pap er we use a weak er version of parameter indep endence to derive the Bay e s ian sco r e for these graphs in clos ed form. W e c o nclude by noting that it is e a sy to extend the def- inition of a net work structure to r epresent constraints betw een the para meters of different nodes in the net- work, e.g. Θ ij = Θ i 0 j 0 for i 6 = i 0 . Bo th Buntine (1994) and T hies son (199 5) consider these types of constraints. The Bayesian score for such structures can b e derived by simple mo difica tions to the approach describ ed in this pap er. References [Beinlich et a l., 1 989] Beinlic h, I., Suermondt, H., Chav ez, R., and Co o per , G. (198 9). The ALARM monitoring system: A ca se study with tw o proba- bilistic inference techniques for b elief netw o rks. In Pr o c e e dings of the Se c ond Eur op e an Confer enc e on Artifici al Intel lige nc e in Me dicine, London, pag es 247–2 56. Spr ing er V erlag, Berlin. [Boutlier et a l., 19 96] Boutlier, C., F riedman, N., Goldszmidt, M., and Koller, D. (1996). Con text- sp e c ific indep endence in Bay esian net works. In Pr o- c e e dings of Twelfth Confer enc e on Unc ertainty in Artifici al Intel ligenc e, Portland, OR, pages 115 – 123. Morgan Kaufmann. [Brieman et al., 19 84] Brieman, L., F riedman, J., Ol- shen, R., and Stone, C. (1 984). Cl assific ation and R e gr ession T r e es . W adsworth. [Bun tine, 1991] Bunt ine, W. (199 1 ). Theory refine- men t on Bay esian net works. In Pr o c e e dings of Sev- enth Confer enc e on Unc ertainty in Artificia l Intel li- genc e, Los Angeles, CA, pages 52 –60. Morgan K a uf- mann. [Bun tine, 1994] Bunt ine, W. L. (1994). Learning with gra phical mo del. T echnical Rep ort FIA-94- 02, NASA Ame. [Chic kering, 1 995] Chick ering, D. M. (1995). Learning Bay esian net w orks is NP -Complete. Submitte d to: L e ctur e Notes in Statistics . [Co op er and Herskovits, 1 9 92] Co op er, G. and Her- sko vits, E. (1992 ). A Bay esian metho d for the induc- tion of pro babilistic netw or ks fr o m data. Mach ine L e arning , 9:309– 347. [F riedman, 19 96] F r iedman, J. (1996 ). O n bias, v ar i- ance, 0/ 1-loss, and the curse of dimensio nalit y . Data Mining and Know le dge D isc overy , 1. [Hec kerman et a l., 19 9 4] Hec kerman, D., Geiger, D., and Chic kering, D. (19 94). Learning Bay esian net- works: The com bina tion o f knowledge and statis- tical data. In Pr o c e e dings of T enth Confer enc e on Unc ertainty in Artificial Intel lige nc e, Sea ttle, W A, pages 2 93–30 1. Mor g an Kaufman. [Hec kerman et a l., 19 9 5] Hec kerman, D., Geiger, D., and Chic kering, D. (19 95). Learning Bay esian net- works: The combination o f knowledge a nd statisti- cal da ta. Machine L e arning , 20:1 97–24 3. [Spiegelhalter et al., 19 93] Spiegelhalter, D., Da wid, A., Lauritzen, S., and Cowell, R. (199 3). Bayesian analysis in exp ert systems. S tatistic al S cienc e , 8:219– 282. [Thiesson, 1995] Thiesson, B. (199 5). Score a nd infor- mation for recursive exponential mo dels with incom- plete data. T echnical rep ort, Institute of Elec tr onic Systems, Aalb or g University , Aalb org , Denmark.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment