Improving parameter learning of Bayesian nets from incomplete data

This paper addresses the estimation of parameters of a Bayesian network from incomplete data. The task is usually tackled by running the Expectation-Maximization (EM) algorithm several times in order to obtain a high log-likelihood estimate. We argue…

Authors: Giorgio Corani, Cassio P. De Campos

Impro ving parameter learning of Ba y esian nets from incomplete data G. Corani and C. P . De Camp os IDSIA Galleria 2, CH-6928 Manno (Lugano ) Switzerland { giorgio,cassio } @idsia.c h Abstract This pap er addresses the estimation of parameters of a Ba yesia n n et- w ork from incomplete data. The task is usually tac kled by running the Exp ectation-Maximization (EM) algo rithm several times in order to ob- tain a high log-li kelihoo d estimate. W e argue that c ho osing the maximum log-lik elihoo d estimate (as well as the maximum p enalized log-likelihood and the maximum a p osteriori estimate) has severe d ra wbacks, b eing af- fected b oth by ov erfi t ting and mo del uncertaint y . Two ideas are d iscussed to o vercome th ese issues: a max imum entropy approach and a Bay esian mod el a veraging approach. Both ideas can b e easily applied on top of EM, while the entrop y idea can b e also implemen ted in a more sophis- ticated w ay , through a dedicated non-linear solv er. A v ast set of exper- iments shows that these ideas p ro duce significantly b etter estimates and inferences th an the traditional and widely used max im um (p enalized) log- lik eliho od and maximum a p osteriori estimates. In p articular, if EM is adopted as op t imization engine, the mo del av eraging app roac h is the b est p erforming one; its p erformance is matc hed b y the entrop y approach w hen implemented u sing the non-linear solver. The results suggest th at th e ap- plicabilit y of these ideas is immediate (th ey are easy to implement and to integ rate in currently a va ilable inference engines) and that t h ey constitute a b etter wa y to learn Ba yesian netw ork parameters. 1 In tro duction This pap er fo c us es on learning the parameter s of a Bay esian netw ork (BN) with known st r u ctur e from incomplete s amples, under the assumption of MAR (missing-at-r a ndom) missing data. In this setting, the missing data ma ke the log-likeliho o d (LL) function non-concave a nd multimodal. The most common approach to maximize LL in the presence of missing data is the Exp ectation- Maximization (E M) algo r ithm [4], which gener ally conv er ges to a lo cal maxi- m um of the LL function. The EM can b e ea sily modified to max imize , rather 1 than LL, the p oster io r probability of the data (MAP), as well as other p enalize d maximum lik eliho o d idea s [11, Sec 1.6 ]. Generally , ma ximizing MAP rather than LL yields smo other es tima tes , less prone to ov e r fitting [8]. In the follow- ing, w e refer to the function to be maximized as the sc or e . Although w e fo cus on BN learning, the idea s are general and shall apply a lso to other probabilis- tic gr aphical mo dels that shar e similar characteris tics in terms of par amater learning. In or der to reduce the chance of obtaining an estimate with low score, a m ulti-star t approach is adopted: EM is started fro m many differen t initializa- tion po in ts, even tually selecting the estimate cor resp onding to the run that achiev es the highes t sco re. W e arg ue how ever tha t this strategy ha s s ome dr aw- backs. Fir stly , the estimate that maximizes the scor e can well do so b ecause of ov er fitting. Even if the MAP estimatio n is ado pted, the fixed structure a nd the a mo un t of data may lea d to ov erfitting, b ecause they might not fully repre- sent the distribution that gener ated the data. Secondly , the estimates pro duced by the different EM runs a re typically very different from e ach other , a nd yet achiev e very close score s [7, Chap. 19]. Choos ing the single estimate with highest scor e implies in mo del unc ertainty , b ecause the estimates with slig h tly worse scor e a re completely ignored. Overall, the sco re a lone do es not seem to b e p ow erful enoug h to identify the best estimate. Note that the challenge presented here do es not inv olve mo del complexity: all the compe ting estimates hav e the same underlying s tructure and thus approaches such as the Bay esian Information Cr iterion (BIC) do not apply . W e prop ose and c o mpare tw o approaches to replace the criter io n o f selecting the highest score estimate, b oth bas ed on well-known ideas alrea dy applied in other contexts. The first is based on the principle of maximum entropy and the seco nd on mo de l av eraging . The max im um e ntropy cr iterion can be stated as: ” when we make infer enc es on inc omplete informatio n, we should dr aw them fr om that pr ob ability distribution that has the maximum entr opy p ermitte d by the information which we do have “ [6 ]. W e interpret this principle b y firs t discarding estimates with low score, which we as sume to b e p o or, a nd then by s electing the most entropic estimate among the rema ining ones. The entrop y-base d criterion is exp ected to yield parameter estimates that are mo r e robust to ov erfitting than those from the criterion of max im um score. In its simplest version, we apply the ent ro p y principle on top o f a m ulti-star t E M; in a mo re so phisticated fashion, we implemen t it us ing a no n-linear solver. In [12], a similar idea has b een explored to fit co n tinuous distributions from only partial knowledge abo ut moments or other features that are ex tr acted from data. The mo del av eraging idea is instead inspired by Bay esian Mo del Averaging (BMA, see [5]), and is designed to be used on top of the mult i-sta r t EM. BMA is a technique s pecifica lly designe d to deal with mo del uncertaint y , which pr escrib es to a verage the predictions pro duced by a set of comp eting mo dels, assigning to ea ch mo del a weigh t prop ortio na l to its p osterior proba bilit y . In the literature, BMA has b een often used to manag e ensembles of BN clas s ifiers; yet, most attempts were not succ e ssful, as r eviewed by [2]. The pro blem is that a single mo del b ecomes us ua lly muc h more proba ble than an y comp etitor, and thus there is little difference b etw e e n using BMA o r 2 the single most pr obable mo de l (that is, the maximum scor e one). Our approach is to a pply BMA lo cally to ea c h conditional proba bilit y distribution that defines the BN; this allows us to instan tiate a single BN mo del, whos e parameter s can be readily insp ected, instead of dealing with a collection o f mo dels. Sections 3 and 4 pr esent a v ast amount of exp eriment s with different BNs and missingness pro cesses. These exp eriments, pe r formed on a grid of computers, would hav e taken mo re than nine mon ths if run on a single des ktop computer . They c o nsistent ly show that both the en tropy and the BMA-ba sed appr oaches provide better estimates than the max imum scor e estimation, and that these better parameter s a lso r esult in b etter inferences with the r e sulting BNs. 2 Metho ds W e ado pt Bayesian netw ork s as framework for o ur s tudy , even if the discus s ion of this pap er may be relev ant to the pa rameter lear ning of other pr obabilistic graphical mo dels. Therefore, w e as sume that the re a der is familiar with basic concepts of Bay esian net works [7]. A Bay esian netw ork (BN) is a triple ( G , X , P ), where G is a directed acyclic graph with no des asso ciated to random v aria bles X = { X 1 , . . . , X n } ov er discr ete domains { Ω X 1 , . . . , Ω X n } and P is a collection of probability v alues p ( x j | π j ) with P x j ∈ Ω X j p ( x j | π j ) = 1, where x j ∈ Ω X j is a category of X j and π j ∈ × X ∈ Π j Ω X an instantiation for the parents Π j of X j in G . F urthermore , every v ar iable is conditiona lly indep endent o f its non- descendants given its parents. Given its indep e ndence a ssumptions, the joint probability distr ibution represe nted b y a BN is o btained by p ( x ) = Q j p ( x j | π j ), where x ∈ Ω X and a ll x j , π j (for every j ) agree with x . Upp ercase letter s are used for rando m v a riables and low ercase letters for their corresp onding cate- gories. The g r aph G and the v ariables X (and their domains) a re as sumed to be known; θ v | w is us e d to denote the probability p ( v | w ) (with V , W ⊆ X ). Given the data y = ( y 1 , . . . , y N ) with N instances such that y i ∈ Ω Y i and Y i ⊆ X is the set of obser ved v ariables of instance i , we denote by N w the nu mber of instances of y that a gree with the configura tion w . The goal is to learn P , which is usually done b y max imizing the p enalized (log -)likelihoo d (LL) of y : ˆ θ = argma x θ S θ ( y ) = ar gmax θ N X i =1 log θ y i + α ( θ ) ! , where α is the penalty term. The ar gumen t y of S θ is o mitted fr om now on ( S is the a c ronym for sc or e ). W e use the p enalized LL b ecause one might simply set the pena lt y to zero to obtain the standard lik eliho o d, or to lo g p ( θ ) to ge t the MAP estimation. F o r ease of e x pos e , we assume: α ( θ ) = log n Y j =1 Y x j Y π j θ α x j ,π j x j | π j , with α x j ,π j = 1 | Ω X j |·| Ω Π j | , which is the MAP version with equiv alent sample size 3 set to one. F or a co mplete da ta set (that is, Y i = X for all i ), we hav e a concav e sum of logarithms on θ : S θ = n X j =1 X x j X π j N ′ x j ,π j log θ x j | π j , where N ′ x j ,π j = N x j ,π j + α x j ,π j , and the estimate ˆ θ x j | π j = N ′ x j ,π j / ( P x j N ′ x j π j ) achiev es maximum sco r e. In the case of incomplete data, w e hav e S θ = N X i =1 log X z i n Y j =1 θ x i j | π i j + α ( θ ) , (1) where x = ( y i , z i ) = ( x i 1 , . . . , x i n ) repre s en ts an instantiation of all the v ar iables. No clo sed-form so lutio n is known, a nd one has to directly optimize max θ S θ , sub ject to ∀ j ∀ π j : P x j θ x j | π j = 1 , ∀ j ∀ x j ∀ π j : θ x j | π j ≥ 0. The most common a pproach to optimize this function is to use the E M metho d, which completes the data with the exp ected counts for each missing v aria ble given the obse r ved v ariables, that is, v a riables Z i j are co mpleted by “weigh ts” ˆ θ k Z j | y i for ea c h i , j of a missing v alue, where ˆ θ k represents the current estimate a t iteration k . This idea is equiv a len t to weigh ting the chance o f having Z i j = z j by the (current) distribution o f Z j given y i (this is known as the E-st ep , and r equires inferences over the netw or k instantiated with P = ˆ θ k ). Using thes e weigh ts together with the actual co un ts from the data, the sufficient statistics v alues N k x j ,π j are computed for every x j , π j , and the next (upda ted) estimate ˆ θ k +1 is o btained as if the data were complete: ˆ θ k +1 x j | π j = N ′ k x j ,π j / ( P x j N ′ k x j π j ), where N ′ k x j ,π j = N k x j ,π j + α x j ,π j as b efore (this is the M-step ). Because in the first step ther e is no curr en t estima te ˆ θ 0 , an initial guess has to b e used. Using the scor e to test conv ergence, this pro cedure a c hieves a saddle p oint, which is usually a local optimum of the pro ble m, and ma y v ary accor ding to the initial guess ˆ θ 0 . In view of obtaining an estimate with high score, it is common to execute multiple runs of E M with distinct initial guesse s and then to ta k e the estimate with maximum sco re a mong them. How e v er, it is often the case that many distinct estimates hav e very similar score, a nd simply selecting the one with the highest one is c learly an ov er-simplified decisio n, b ecause e q ual (or almost equa l) s cores cannot b e used as a criterio n to find the best estimate. 2.1 En tropy High scor e is not the only target in or der to obtain a go o d estimate. There ar e many estimates that lie within a tiny distance from the glo bal maximum and can be as go o d as or b etter than the global one. A simple alternative a pproach is to pic k the parameter estimate which has maximum entr opy , among those 4 which hav e a high score [6]. Therefore, a po ssible criterio n is ˆ θ = argma x θ n X j =1 X π j X x j θ x j | π j log θ x j | π j , (2) sub ject to S θ ≥ c · s ∗ , for a given 0 ≤ c ≤ 1, w he r e s ∗ is the maximum score. The optimization o f E q. (2) co mputes the maximum entropy distribution with the guarantee that its sc o re is high (within a sma ll differe nce from the maximum v alue). Eq. (2) is referr e d to as lo c al entr opy in [10]. The maximum entropy is used b ecause it leads to the most conse r v ative (least infor mative) distributio n ov er the set of all es tima tes that achiev e sco r e as go o d as c · s ∗ . The co ns traint that b ounds the score is not as simple as it reads: in fact it is necessary to us e a ll the equa tions that define the score function to force it to b e grea ter than a certa in v a lue, if one wan ts to use a non-linear optimization suite. O n the o ther hand, a simple implemen tation of ent ro p y ca n be done by using the many runs o f the EM metho d. The idea is to select, among the estimates r eturned b y the different EM runs that achiev e a high enoug h score (compare d to the max im um obtaine d o ne), the estimate with maximum entrop y . This differs from the usual maximum e ntropy inference in the wa y that it first checks for high s c o re estimates and then ma ximizes ent ro p y among them. Given the usual gr eat num b er o f para meters to estimate in a Bay es ian netw or k, restricting ourselves only to those estimates that hav e exactly equa l score is undesired: usually only the top sco ring estimate will be left. A r elated maximum entropy approach is shown in [12], with tw o main differ- ences: (i) they fo cus on a con tinuous s cenario with a compact parametr ization, e.g. using mean/v aria nc e or other similar features of the data, whic h implies in mor e estimates tha t equally fit the data (our se tting has catego rical data and ther e are tens to h undreds of para meters to estimate); (ii) they force their estimator to ha ve likeliho od precisely equal to the v alue of the maximum sc ore (whic h in o ur cas e w ould be so mething similar to us ing c = 1). W e no te that they hav e also extended their idea to a so- c alled r e gu larize d version, which in- cludes a p enalty in the en tropy function. This is shown to b ecome simila r to the MAP estimation. W e work differently b y a llowing some v a riation in the score without the use of an extra penaliza tion for that purp ose, as we consider all estimates w ith high sc o re a s equally go o d (they are later discriminated b y their entrop y). In a BN with more than a couple of v a riables, the n umber of parameters to e s timate b ecomes quickly la rge a nd there is only a very small (or no) r egion of the parameter space with estimates that achieve the very sa me global maximum v alue. Howev er , a fea sibilit y regio n defined b y a small per - centage aw ay from the maximum scor e is enough to pr o duce a whole region o f estimates, indicating that the r egion of high sc o re estimates is a lmost (but not exactly) flat. This is exp ected in a high-dimensiona l par ameter space. 5 2.2 Ba y esian Mo del Av eraging A BMA-based appr oach can also b e used to ov ercome the mo del uncerta int y and ov erfitting problems. The BMA is applied on the alternative estimates r eturned by the v arious EM runs in o rder to obtain a final single estimate. The ra tionale is as follows: if we consider as query the v ar iable X j given its pa rents, the p osterior probability distr ibutio n p ( X j | π j ) returned by a n inference corresp onds to the distribution which is sp ecified in the co nditional probability table of the BN for X j . T o answer this query using BMA, we use the estimates identified b y each run of EM. W e av erage the re turned inferences from mo dels identified by eac h EM run, using weigh ts prop or tio nal to the score achiev ed by them. 1 W e rep eat this query for each j and ea c h combination o f π j ; ev entually we instant iate a single BN, by setting ˆ θ x j | π j = p ( x j | π j ), where p ( x j | π j ) is the BMA-av erag ed inference. In pr actice, this can b e done by simply av erag ing the co efficients of the conditional proba bilit y tables. Thus, BMA is lo c al ly applied to estimate each conditional probability distribution; w e denote as ˆ θ the estimate obtained in this wa y . Note that the inferences returned by ˆ θ ma tc h thos e pro duced b y the standard BMA (which always computes the answer by quer ying ov er all the estimated mo dels a nd then av era ging the r eturned v a lues) o nly fo r querie s on X j given its parents, a nd not on more g eneral quer ies. How ever, it is g e nerally not p o ssible to o btain a summary estimate which exa ctly ma tc hes the inferences pro duced by the standard usa ge of BMA. In fa c t, the s tandard BMA can b e seen a s an ensem ble of BNs; to get a single estimate, one ha s to a verage the joint distributions of these B Ns. This would pro duce a representation of the joint pr o babilit y distribution that is not g uaranteed to factorize as the origina l BN structure. Moreover, it would b e very demanding from the computational viewp oint . An exception exists for naive structur e s [3], while our BMA-base d approach can be used to estimate the pa r ameters of any BN. 3 Exp erimen ts with EM as Underlying Engine In or der to compare entropy and BMA appro aches, w e p erform exp eriments using differen t netw or k structures (Asia, Alarm and rando mly generated ne t- works), sa mple sizes ( n =100, n =200 ) a nd p ercentages of missing data mp ( mp =30%, mp =60%). W e also include the maximum score a pproach in the exp eriment s to serve as baseline, which we refer to as MAP (this is simila r to a pena lized LL, as defined in Sec. 2). A triple h netw ork structure , n, mp i identifies a sett ing . F or each s etting, we p erform 300 exp eriments, wher e each exp eriment is org anized as follows: a) random draw of the par a meters of the r efer enc e net- work ; b) sampling of n complete instances fr om the refer e nce netw o rk; c) appli- cation of a MCAR missingness pro cess, which turns each v alue of the instances 1 Actually , BMA requires to av er age using as weigh ts the p osterior pr obabilit y of eac h model, obtained as a pro duct of its pr ior probability and its mar ginal likelihoo d, which implies in av eraging ov er the parameters of the mo dels. 6 int o missing with probability mp ; 2 d) e x ecution of 30 runs of E M from differ- ent initializa tions, using the MAP-based s core; e) choice o f the es tima te using MAP , entropy a nd BMA. T o ev alua te the quality of the estimates, we measure the KL-divergence b et ween the joint distr ibutio n r e pr esented b y the reference net work and the es timated netw or ks (denoted a s joint metric ). T o measure the quality of the inferences pro duced b y the mo dels obtained with differen t crite- ria, it is necessary to s elect queries of interest. Having seen in ex per iments that bo th B MA and entropy yield cons isten tly b etter estima tes than MAP , we hav e chosen a query that could mitigate the differences among metho ds , in order to conserv atively as sess the a dv antage (in terms of inferences ) of b oth BMA a nd ent ro p y over MAP . In particular, we quer y the margina l joint distribution of all leaf no des, without any ev idence set in the net work ( le af metric ). This requir es marginalizing out all no n-leaf v ariables, s o it in volv es all v aria bles in the com- putation. Because of that, lo cal “mistakes” in estimates can co mpensate each other, making harder to assess difference s among metho ds. This is a desired characteristic if o ne w ants to understa nd how strong is the difference amo ng the ideas. Since KL-divergences are not normally distributed, we analyze the results through the non-para metr ic F riedman test with significance lev el of 1% (whic h is reasona bly stro ng). T o preven t is sues fro m multiple comparisons, we per formed the p ost-ho c o f the test via T ukeys Honestly Significant Difference. Hence, the analysis yields a rank of metho ds for each s etting and each metric. As for entrop y , we ch o ose the maximum en tropy estimate amo ng those whose score was at lea s t a s high as 95% of the highest s core. 3.1 ASIA Net work This set of exp eriments uses the structure o f the Asia netw or k [9]. In all settings (shown in T able 1) and the tw o metrics (joint and lea f ), the F rie dma n test returned the rank: 1 ) BMA; 2 ) en tropy; 3) MAP . T o better understa nd the quantitativ e difference among them, we rep ort in T able 1 the re lative me dians of KL div erg ence, namely the medians of BMA and entropy in a certain task, divided b y the median o bta ined by MAP in the same task. This allows us to see the quan titative improv ement of them ov er MAP . The improvemen t of the median ov er MAP ranges, dep ending o n the task, from 2% to 29% for BMA and from 1% to 14% for entropy . Mo reov er, the im- prov emenet is consistent, o ccurr ing in all settings. Interestingly , the difference in p erforma nc e increase when the learning task is mo re challenging. F or in- stance, the adv antage o f BMA ov er en tropy , and of b oth BMA and entrop y ov er MAP , increa ses with the p ercentage of missing da ta . Conv erse ly , the grea ter the sample s ize the easier the lea r ning task, thus the p erfor mance of the methods are more similar for n =20 0 than for n =100 , even though the differe nc e s remain statistically s ignificant in all the cas es. As a res ult of the conse rv ative desig n of the query , the differences among metho ds are g enerally les s a pparent in the leaf 2 MCAR (or missing c ompletely at r andom ) indicates that the probability of each v alue being miss i ng does not depend on the v alue itself, neither on the v al ue of other v ari ables. 7 n =100 n =200 mp =30% mp =60% mp =30% mp =60% BMA ent r. BMA entr. BMA en tr . BMA entr. joint 0.90 0.96 0.79 0.90 0.92 0.96 0.81 0 .91 le af 0.93 0 .9 2 0.87 0.86 0 .98 0.99 0.9 2 0.89 T able 1: Relative medians of KL divergence with the ASIA ne tw or k, i.e., medi- ans o f BMA a nd entropy divided by the median of MAP . Sma ller n umbers indi- cate b etter perfor mance; in pa rticular, v alues sma ller than 1 indicate a s maller median than MAP . Each ce ll cor r esp o nds to 300 exp eriments. metric than in the joint metric. Overall, these exp eriments indicate BMA as the bes t option, while both BMA and entrop y provide sig nificant ly b etter p erfor- mance than MAP either in the para meter estimates and in the inferences. An insight of the r eason for entrop y to outp erform MAP is given by Figure 1, which clearly shows that a higher MAP score do es not necess arily imply in a b etter estimate; ins tead, when dealing with estimates of high MAP score, entrop y is more discriminative than the MAP score and has also a strong er co rrelation with the KL divergence. −440 −438 −436 −434 −432 −430 −428 5.5 6 6.5 7 7.5 8 KL=0.2 KL=0.6 KL=1.0 KL=1.4 KL=1.8 P S f r a g r e p la c e m e n t s Entrop y MAP Score Figure 1: Rela tion be tw een K L div ergence, entrop y and score ; darker p oints represent lower K L divergence betw een true and e s timated join t distributions. The figure refers to one tho us and E M runs p e rformed on a n incomplete tr aining set of 200 samples. 3.2 ALARM Net work The ALARM netw o rk ha s 37 no des and 8 leav es [1]. Again, we consider mp of 30% and 60 %, and s ample size n of 1 00 and 20 0 . As fo r the joint metric, in a ll settings the rank was: 1) BMA; 2) en tropy; 3) MAP . As for the le af metric, in 8 all settings but one the rank was: 1) BMA & entropy; 2) MAP . The relative medians of KL divergence, r epo rted in T able 2, sho w larg e differences on the joint metric than on the lea f one; only a slight difference exists b etw een BMA and en tropy in the latter case, while the adv a n tage of b oth ideas ov er MAP is clearer. On the joint metric, BMA is by far the b est-pe rforming a pproach. Also in this cas e, the difference among ideas is emphas ized when mp incre ases or the sample size decreases . As b efore, BMA achiev es the b est r esults: it provides the better par ameter estimates and it is at least as go o d as en tropy for inferences. BMA and entrop y consistently outperfo rm MAP in the quality of parameter estimates and inferences. n =100 n =200 Metric mp =30% mp =60% mp =30% mp =60% BMA ent r. BMA entr. BMA en tr . BMA entr. joint 0.85 0 .9 3 0.79 0.88 0 .89 0.93 0.8 2 0.89 le af 0.96 0 .9 5 0.94 0.94 0 .98 0.97 0.9 7 0.96 T able 2: Rela tiv e medians of KL divergence for exper imen ts with the ALARM net work. Each cell regards 30 0 ex p eriments. 3.3 Randomly generated netw orks In the case of randomly gener ated structure s , the exp erimental pro cedure de- scrib ed in Sectio n 3 also includes the gene r ation of the rando m structure, which is accomplished b efore dr awing the par ameters. Given tw o v ariables X i and X j , an arc from X i to X j is rando mly included with probability 1 /3 if i < j (no arc is included if j ≥ i , which ensures that the gra ph is acyclic and has no lo ops). F ur thermore, the maximum num be r of pare n ts is set to 4 and the nu mber of catego r ies per v ariable ranges from 2 to 4 (r andomly chosen too). After the graph is generated, the exp eriments follow a s befor e (s e e T able 3). On the joint metric, w e always obtain the ra nk 1) BMA, 2) entropy , 3 ) MAP . On the le af metric, we obtain that sa me rank in t wo s ettings and the rank 1) BMA & entrop y; 2) MAP in the o ther tw o settings. Th us, ther e is a cons is ten t sup e riority of BMA ov er entrop y in estimating parameters , althoug h this does not alwa ys implies in a super io rity on the lea f queries. Both BMA and e n tropy are sup erior to MAP in a ll the ca ses. In summary , these exp eriments indicate BMA a s the b est choice: it consis- ten tly yields the b e st pa r ameter estimates, and o n the leaf queries it is either the b est idea or it ties with entrop y . Overall, BMA and entropy are consistently better than MAP , b oth on the joint and on the le af metric. 9 n =100 n =200 mp =30% mp =60% mp =30% mp =60% BMA ent r. BMA entr. BMA en tr . BMA entr. joint 0.78 0.94 0.75 0.89 0.82 0.92 0.80 0 .89 le af 0.89 0 .9 2 0.88 0.88 0 .96 0.97 0.9 5 0.92 T able 3: Relative medians for exp eriments with randomly gener ated netw or ks with 20 nodes . 4 Exp erimen ts using Con tin uous Optimization The maximum entropy criterion, a pplied to the selection of the par a meter es- timates, maximize s entrop y while guara n teeing the s c ore to exc e ed a cer tain threshold. So far , we hav e applied this ide a on top of the m ulti-star t EM, which selects only among e s timates r eturned by the different EM runs. This a pproach ident ifies para meter estimates which ar e b etter than those from MAP , but usu- ally worse than BMA’s estimates . An alternative wa y to implement the idea of maximum entropy is to solve direc tly the non-linear optimization pr oblem. This requir es to maximize Eq. (2) sub ject to the constra in t that the s core is only ma rginally smaller than the b est sco re, allowing a more fine-g rained wa y to select the estimate than lo oking at the solutions iden tified b y the different EM runs. This appro ach is referred in the following as C entr opy . The reaso n to a nalyze such situation is that EM runs tend to return loca l optima of the score function. How ever, the maxim um entrop y estimate might well b e a no n- optimum estimate in terms of sco r e. Because of that, even if we incre a se the nu mber of EM runs and us e ma n y more initializa tions (s itua tion which we hav e tested), the en tropy idea is still confined to sa ddle po in ts of the sco re function, which is only an appr o ximation of a tr ue maximum entrop y idea. Therefore , we p erform exper imen ts with t wo sp ecific net work structures with the aim of understanding whether the entropy idea impr ov es by using a c o n tinuous opti- mization metho d instead o f E M. F or eas e of ex pos e , we fo cus only on the joint metric, as this is metric in which entropy is consis ten tly inferior to BMA. 4.1 Exp erimen ts with B N1 Figure 2 shows the str uc tur e of the first set of exp eriments. V aria bles A (binar y) and B (ternary) hav e uniform distributions and a re a lw ays observed; v ariables U , E and T are binar y (assuming states true and false ); the v alue of T is defined by the logic a l relation T = E ∧ U . V ar iable T is alwa y s obs e rved, while U and E are affected by the missingness pro cess. Both U and E a re observed if and only if T is tr ue. Therefore, E a nd U are either b oth o bs erved a nd positive, or non-observed. The missingness pro ces s is MAR 3 bec ause given T (always 3 More precisely , this missingness pro cess is MAR (as required b y EM) but not MCAR (missing completely at random); for a discussion of the different kinds of mi ssingness, see for instance [ 7, Sec. 19.1.2]. 10 observed) the pr obability of U and E to b e missing do es not dep end on their v alues. Under the chosen conditions, E and U were missing in ab out 85% o f the sampled instances. W e assume the co nditional probabilities of T to be known, th us fo cusing on the difficult y of lea rning the probabilities of no des U and E . A B U E T Figure 2: Net work BN1; nodes affected b y the missingness pro cess have a g rey background. F or this net work, we als o de velop ed a s olver which ide ntifies the glob al max- imum of the MAP scor e; technically , the so lv er maps the lear ning problem int o a po lynomial pr ogramming one; more details ar e given in the s upplemen tary material. W e use the globa l solver in t wo different wa ys: to co mpute C entr opy , th us ensur ing the estimate to have a MAP score close to the glo bal maxim um (w e allowed a tolera nc e of 1%); to co mpute the glo bal MAP estimate on its own, referred to a s glob al MAP . It is interesting to compar e global MAP with MAP in or der to a ssess which is the impact of the lo cal solver (that is, EM) on the quality o f estimates . T o the b est o f our knowledge, there is no pr evious a nalysis with glo bal exa ct solvers fo r learning BNs from incomplete samples. The same exp eriment al pro cedure of Sec tio n 3 is used, considering sa mple sizes in { 100 , 2 0 0 } and pe r forming 300 ex p eriments for e ach setting. F or b oth sample sizes, the F r iedman test o n the joint metric returns the following rank: 1 ) BMA; 2 ) C entr opy ; 3 ) ent ro py; 4 ) MAP; 5 ) globa l MAP . The r elative medians were, res pectively , for n =10 0: 0.27, 0.3 3 , 0.37, 1, 1.5; fo r n =3 00: 0.14, 0.25, 0.35, 1 , 1.2 . These results c a n be comment ed fro m several viewp oints. First, C entr opy significantly improv es ov er entrop y , almost reaching the s ame quality of BMA. Second, estimates fo und b y the g lo bal solver were worse than those of MAP; in fact, the p enalized MAP function offers only pa r tial protectio n a gainst ov erfitting; it is indeed less prone than log- lik eliho o d to ov e r fitting, but still an estimate which maximizes the MAP scor e can b e w ell affected b y overfitting. The v alue of MAP score achiev e d by the globa l solver was o nly sligh tly higher (around 2.5 %) than achiev ed by the lo cal ma x im um identified by the multi- start E M. Thir d, BMA and the t wo entrop y implementations outp erform MAP , confirming the results of the previous exper imen ts. 4.2 Exp erimen ts with B N3 Net work B N3 has structure A → B → C . W e considered t wo different configu- rations of num b er of states for each no de: 5-3-5 (meaning A , C with 5 ca teg ories and B with 3) and 8 -4-8 ( A , C with 8 catego ries and B with 4 ). In b oth cases, we made B ra ndomly missing o n 85% of the instances. Thus, despite the simple structure of the netw o rk, the learning task is interesting since there a re many missing data and a mo derately high num be r of states. The tw o configura tions requires to estimate fr om inc omplete samples resp ectively 2 · 5 + 4 · 3 = 22 and 11 8 · 3 + 7 · 4 = 52 par a meters; to these num b ers, one should add the mar ginals of A , whic h are ho wev er learned from complete samples and whose estimate is th us identical for all metho ds. W e fixed n to 3 00 for the 5 -3-5 config uration and to 5 00 for the 8-4-8. In this case, the p erfor mance of C entr opy w as extremely go o d. W e obtained, for b oth settings, the following rank in the F r iedman test: 1 ) C entr opy ; 2 ) BMA; 3 ) entrop y; 4 ) MAP . The globa l MAP was not run, bec ause the n umber of free unknowns in the contin uo us optimization pr oblem was to o hig h to achiev e the globa l so lution in rea sonable time. The rela tive medians for the join t metric a re (num b ers are g iv en in the order: C entr opy , BMA, entrop y and MAP): for the 5-3-5 configuration: 0.71, 0.78, 0.90, 1; for the 8- 4-8 configura tion: 0.68, 0.78, 0 .92, 1. Thes e exp eriments c onfirm that, in order to get the most out of the en tropy criterio n, it is muc h more effective to hav e a dedica ted so lver than applying it on to p of the multi-start EM. Here, C entr opy is even b etter than BMA, so its implementation for general settings and further analyses ar e intended in the near future. 5 Conclusions This pap er suggests that ma ximizing (p enalized) lik eliho o d or MAP scor es is not the b est choice to learn the parameter s o f a Bay esia n netw ork. In par ticular, a hig h scor e is necessary but not sufficient in order to hav e a go o d estimate. T o impr o ve estimation, w e pr opo se: (i) a BMA a pproach that averages ov er the es timates lear ne d in different runs of EM, and (ii) a ma xim um entropy criterion applied ov er the estimates with high scor es. The entropy idea can be implemented on top of EM or using a dedica ted non-linear solver, which allows a mor e fine-gr ained choice of estimates. Both BMA a nd entrop y ca n be promptly integrated into any EM implementation at vir tually no co st in terms of implemen tation and running time; instead, the non-linear solver for ent ro p y require s some additiona l implementation effor t. Thorough exp eriments show that the pre s en ted ideas significantly improv e the qua lity of estimates when compar ed to standard max im um p enalized likelihoo d and MAP ideas. If EM is used as optimization engine, then BMA yields by far the best results, follow ed by ent ro p y and MAP . If the dedicated non-linear so lv er is used, entropy per forms as g oo d as BMA, or even b etter. Moreover, for a sp ecific netw o rk, we developed a global solver for the MAP estimation. W e sho wed that its score is only slig h tly higher tha n the maximum identified b y EM runs, and yet it yie lds worse estimates. This corrob ora tes with the other results, indicating that the usual sco r es do suffer fro m ov erfitting and/or mo del uncertaint y . References [1] I. A. B e inlic h, H. J. Suermondt, R. M. Chav ez, and G. F. Co oper . The alarm mo nito ring system: A ca se study with t wo probabilistic inference 12 techn iques for b elief netw o rks. In Se c ond Euro p e an Confer enc e on Artificial Intel ligenc e in Me dicine , volume 38, pages 2 47–25 6 , 1989 . [2] J. Ce rquides and Ramo n Mantaras. Ro bust bay e s ian linear c lassifier ens em- bles. In Pr o c. Eur op e an Confer en c e on Machine L e arning (ECML) 2005 , volume 3 7 20 o f L e ctur e N otes in Computer Scienc e , pages 72–8 3. Springer , 2005. [3] D. Dash a nd G.F. Co op er. E x act Model Av erag ing with Naive Bay es ian Classifiers. Pr o c. of the 19th In ternational Confer enc e on Machi ne L e arn- ing , pag es 91–9 8, 200 2. [4] A.P . Dempster, N.M. Laird, and D.B. Rubin. Ma xim um likeliho o d from incomplete data via the em algorithm. Journal of the Ro yal Statistic al So ciety. Series B (Metho dolo gic al) , 39 (1):1–38, 1977 . [5] J.A. Hoeting, D. Madigan, A.E. Ra ftery , and C.T. V olinsky . Ba yesian mo del av era ging: a tutorial. Statist ic al scienc e , 14(4):38 2–417 , 199 9. [6] E. T. Jaynes. On the rationale of ma xim um-entrop y metho ds. Pr o c. IEEE , 70(9):939 –952, 19 82. [7] D. Ko ller and N. F riedman. Pr ob abilistic Gr aphic al Mo dels . MIT pr ess, 2009. [8] S.L. La uritzen. The E M a lgorithm for graphical asso ciation models with missing data. Co mputational St atistics & Data Analysis , 19(2):191 –201, 1995. [9] S.L. Lauritzen and D.J . Spiegelhalter. Lo cal computations with probabili- ties on graphica l structures and their a pplication to exp ert systems. Journal of the R oyal Statistic al So ciety. Series B (Metho dolo gic al) , 50(2):157 –224, 1988. [10] T. Luk asie wicz. Credal Net works under Maximum Entrop y. In Pr o c e e dings of t he 16th Confer enc e on Unc ert ainty in Artificial In tel ligenc e , pa ges 363– 370. Mo rgan Kaufmann P ublishers Inc., 2 000. [11] G. M. McLachlan a nd T. Krishna n. The EM A lgorithm and Extensions . Wiley , New Y o r k, 1997 . [12] S. W ang, D. Sch uurmans, F. Peng, and Y. Zhao. Com bining statis tical lan- guage mo dels via the la ten t maximum entrop y pr inciple. Machine L e arning , 60(1-3 ):2 29–250 , 2005 . 13

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment