Learning Undirected Graphical Models with Structure Penalty

Learning Undirected Graphical Mo dels with Structure P enalt y Shilin Ding sding@st a t.wisc.edu Dep artmen t of Statistics University of Wisc onsin-Madison Madison, WI 53706, USA April 26, 2011 Abstract In undirec ted graphical mo dels, learning the gr aph structure a nd learning the functions that relate the predictive v aria bles (features) to the resp o nses giv en the str uctur e are tw o topics that hav e been widely inv estig ated in machine learning and statistics. Learning graphical mo dels in t wo stage s will ha ve problems b ecause graph structure may c hang e after considering the features. The main cont r ibution of t his pap er is the prop osed metho d that le a rns the graph str ucture and functions on the g raph at the same time. General graphical mo dels with binary outcomes conditioned on predic tive v aria bles are pr ov ed to be equiv a lent to m ultiv ar iate Bernoulli mo del. The reparameterization o f the po tent ia l functions in graphical mo del b y conditiona l log odds ratios in multiv ariate Bernoulli mo del oﬀers adv antage in the representation of the conditional indep endence structur e in the mo del. Additionally , we imp ose a structure p enalty on groups of conditional log o dds ratios to learn the graph str ucture. These gro ups of functions a re designed with ov erlaps to enforce hierarchical function sele ction. In this way , w e ar e able to shrink higher order int er actions to obtain a spar se graph structure. Sim ulatio n studies show that the metho d is a ble to re cov er the gr aph structure. The a na lysis of coun ty data from Census Bureau gives in teresting r elations b etw een unemploymen t rate, crime and others discovered b y the mo del. 1. In tro duction In undirected graphical mo dels (Mark o v Random Fields), a graph G is deﬁ n ed as G = ( V , E ), wh er e V = { 1 , · · · , K } is the set of no des and E ⊂ V × V is the set of links b et w een the nod es. In fact , V is asso ciated to a set of m u ltiv ariate resp ons e v ariables Y 1 , · · · , Y K , and E sp eciﬁes th e conditional indep end ence structur e among them. F or example, a link b et ween tw o no d es i, j ind icates a pairwise interacti on f ij . , and a clique b et w een three no des i, j, k indicates a third order in teraction f ij k . These fu nctions f ormulate the eﬀects of predicativ e v ariables (features), X , on the resp onses and their interactio ns. Graphical mo d els ha ve b een used in many applications. Cond itional Random Fields (CRF) (Laﬀerty et al., 200 1 ) and their extensions, e.g. dyn amic CRF (Sutton et al., 2007), are w ell kn o wn in Natural Language Pro cessing communit y . The CRFs ac hieve great success in sequen tially structur ed text, b y mo d eling the int eraction of lab els ( Y ) on th e no des conditioned ﬂexibly on the features ( X ). There are also numerous app licatio n s of graphical 1 mo dels to computer vision (Szeliski et al., 2007; Honorio and S amaras, 2010). Ising mo del is another classical example that d ra ws great in terest in image pro cessing (Williams et al., 2004) and also has b een recent ly app lied to so cial net works (Banerjee et al., 2008). The in tuition of utilizing a graph structure is that some resp onses are related while others are not. How eve r, in m an y cases, the graph is p re-determined by d omain kno w ledge. F or example, Duan et al. (2008) prop osed a collect ive mo del for lab eling music signals with fully connected graph, whic h they called collec tive conditional random ﬁelds. They ha ve 10 seman tic categorie s su ch as genre (b lues, rap, . . . ), instrument (guitar, piano, . . . ), pro du ction (studio, liv e), r hythm(strong, w eak, middle), and etc. It is p ossib le that some links are not necessary: e.g. pr o duction and instrument. Estimating the parameters on these relatio n s will lead to o ve r-ﬁ tting. Therefore, graph structure learning is an imp ortan t asp ect of relation disco ve ry in multiv ariate resp onse applications and m ulti-task learning. Man y pap ers h a v e fo cused on the graphical mo del selection issue. Meinshausen and Buhlmann (2006) and Peng et al. (20 09 ) studied spars e co v ariance estimation of Gauss ian outcomes (Sp eed and Kiiv eri, 1986) without input features. T he co v ariance matrix deter- mines th e dep endence structur e in the Gauss ian distribu tion and its sparsit y sp eciﬁes the link age in Gaussian Mark ov Random Fields. This is not the case for non-elliptical distribu- tion, suc h as the distribution o f discrete random v ariables. Ra vikumar et al. (2010) focu s ed on graph structure s electio n of Ising mo del based on l 1 -regularized log istic regression. It ga v e suﬃcient conditions for consisten tly estimating the neigh b orho o d of the no d es, with- out input features. Ho w ev er, t wo marginally indep endent resp onse v ariables ma y b ecome dep end en t after cond itioning on X . So, ignoring the p redicativ e v ariables may lead to incon- sisten t estimation of the graph str u cture. T o the b est of our knowledge, there is n o previous w ork addressed the issue of learning the graph structure and the functions asso ciated with the graph at the same time. In this p ap er, our ﬁ rst contribution is the pro of of the equiv alence b et wee n the general graphical mo del with biv ariate outcomes and m ultiv ariate Bern ou lli (MVB) m o del. T he functions that rep r esen t the eﬀects of pr edicativ e v ariables on resp on s es and their in terac- tions (a t all le vels) ca n b e form ulated in MVB mod el, whic h is endo wed w ith the adv ant age of interpreting the graph structure. It follo ws from the sparsity of links in the graph ical mo dels that some functions are co ns tant zero, whic h means certain resp onses are condition- ally in dep end en t. Therefore, w e imp ose the stru cture p enalt y on group s of functions with o v erlaps to obtain sparse estimation of the graph s tr ucture. Th ese groups are designed to enforce the sparsity on th e functions and shrin k higher order int eractions so that they only app ear after their lo we r components ha ve entered the mo del. The pap er is organized as follo ws: S ection 2 introdu ces graphical models and their rela - tion with multiv ariate Bernoulli mo del. Section 3 and 4 discusses th e mo d el for learning the graph structure and the functions on the graph through structure p enalt y . T he exp eriment s are sho wn and discussed in Section 5. Section 6 give s the conclusion and fu ture work. 2. Conditional Indep endence in Graphical Models The n otations in this pap er are sum marized in T able 1. 2 T able 1: Notations Sym b ol Description k · k L 2 norm n Sample size p Num b er of co v ariates K Num b er of Resp onse/Outpu t Y K d imensional Resp onse/Output Ω Set of { 1 , 2 , . . . , K } Ψ K P o wer set of Ω except the empt y set m The h ighest lev el of inte raction considered in the mo del, in this p ap er m = K q Num b er of f ω in th e assumption, in this pap er q = 2 K − 1 ω , κ, v Elemen t of Ψ K used for ind exin g y ω ( i ) y ω ( i ) = Q k ∈ ω y k ( i ) Y ( i ) Augmen ted resp onses ( y 1 ( i ) , . . . , y Ω ( i )) c Mo del parameters ˜ p ( p + 1) · q , d imension of c T v T v = { ω | v ⊂ ω } is the sub grap h ro oted at v conta inin g all its descendan ts f T v f T v = ( f ω ) , ω ∈ T v J ( f T v ) P enalt y on f T v p v W eigh t f or p enalt y on stru cture T v s v , r v Subgradient of λJ ( f T v ) of the v th group S ω ( y ; x ) S ω ( y ; x ) = P κ ∈ T ω y κ f κ In graphical mo dels, G = ( V , E ), the distribution of multiv ariate discrete random v ari- ables Y 1 , . . . , Y K is: p ( Y 1 = y 1 , . . . , Y K = y K | X ) = 1 Z ( X ) Y C ∈C Φ C ( y C ; X ) (1) where Z is the normalization factor. The distrib ution is factorized according to the cliques in the graph. A clique C ⊂ Ω = { 1 , . . . , K } is th e set of no d es of a fully connected subgraph. Φ C ( y C ; X ) is the p oten tial function on C . It dep ends on y C = { y i | i ∈ C } and the pr edicativ e v ariables X whic h are shared across all r esp onse v ariables. On e example of application is to mo del the relations of m ultiple clinical resp onses (h yp ertension, diab etes, etc.) and ho w they are aﬀected by the p erson’s genetic v ariables and en vironm en tal v ariables (smoking, in come, etc.). F or the pu rp ose of eﬃcien t computation, C is u sually the set of all maximal cliques of the graph. The maximal clique is a clique that is not prop erly con tained in an y other clique in the graph . Diﬀeren t representati ons w ith C as the s et of non-maximal cliques can b e con v erted to m aximal clique r epresen tation by r edeﬁnition of the p oten tial func- tions (W ainwrigh t and Jordan, 2008). F urth ermore, C d o es not ha v e to reﬂect the graph structure, as long as it is suﬃcien t. F or example, the most general choic e for an y giv en 3 graphical mo del is C = { Ω } . T he conditional indep endence b et w een the resp onse v ariables is implicitl y formulated by the restrictions on the potenti al functions. See Theorem 2.2 and Example 2.1 for details. The Mark o v prop erty states that any t wo no des not in a clique are cond itionally in- dep end en t giv en other n o des. F or example, Y s , Y t are conditionally ind ep endent giv en all other v ariables that blo c k the path from Y s to Y t . Therefore, C as the s et of maximal cliques factorizes th e graph an d sp eciﬁes the conditional in dep end ence in the mo del. In Figure 1(a), w e ha v e 2 cliques { 1 , 2 , 3 } and { 3 , 4 } . In this case, { Y 2 , Y 4 } are conditionally indep end en t giv en Y 3 , so are { Y 1 , Y 4 } . (a) (b) (c) (d) Figure 1: Graph ical Mo del Examples. (a) Mod el 1: The triangle clique { Y 1 , Y 2 , Y 3 } in dicate a third ord er in teraction. Add itionally , th ere is a pairwise in teraction b et ween Y 3 and Y 4 . Y 4 is cond itionally in d ep endent with Y 1 or Y 2 giv en Y 3 . (b ) Mo d el 2: Y 5 , Y 6 are another set of in teracted random v ariables wh ic h are indep endent of other 4 no des. (c) Mod el 3: Y 5 , Y 6 , Y 7 , Y 8 form a 4-no de clique that conn ected to Y 1 , Y 2 , Y 3 thr ough Y 4 . (d) Mo del 4: Y 4 , · · · , Y 8 form a str ongly connected comp onent , and Y 9 , Y 10 are indep endent of other no des Giv en th e graph structure, the p oten tial fun ctions are conv enient to c haracterize the distribution on the graph. How eve r, if the graph is unkno wn in adv ance, estimating the p oten tial fun ctions do es not give direct inf erence of the graph structur e, b ecause there are diﬀeren t represen tations with diﬀeren t c h oices of cliques in the same grap h as men tioned ab o ve. Even if assuming the most general case where C = { Ω } , the conditional indep endence b et ween the no des cannot b e represented b y Φ C in a simple form (see Example 2.1), wh ic h 4 mak es the learning of graph structure diﬃcult. The m ultiv ariate Be r n oulli (MVB) mo del can represent a graphical mo del whose nod es are Bernoulli rand om v ariables, i.e. Y i = 0 or 1. And the p arameterizatio n in MVB mod el is suitable for learning th e graph structure. W e will sh o w later that it is equ iv alen t to GM (1) w ith bin ary outcomes. Th e distrib ution of MVB mo d el is: p ( Y 1 = y 1 , . . . , Y K = y k | X ) = exp { y 1 f 1 + · · · + y K f K + · · · + y 1 y 2 f 1 , 2 + · · · + y 1 . . . y K f 1 ,...,K − b ( f ) } = exp { 2 K − 1 X ω =1 y ω f ω − b ( f ) } (2) Here, w e use the follo win g notations. Let Ψ K b e the p o wer set of Ω = { 1 , . . . , K } . C oun t- ing the empt y set, ther e are 2 K elemen ts in Ψ K . Where con v enient in what follo ws, we will relab el these element s from 0 to 2 K − 1, e.g. for K = 3, w e will use f 1 , 2 , 3 and f 7 in terc han geably without furth er sp eciﬁcatio n . Beca u se there are 2 K − 1 free parameters in (2), denote Ψ K = Ψ K − {∅} for sim p le notation. Let ω denotes a set in Ψ K , d eﬁne Y = ( y 1 , · · · , y ω , · · · , y Ω ) b e the augmen ted resp onse with y ω = Y i ∈ ω y i (3) And f = ( f 1 , . . . , f ω , . . . , f Ω ) b e the ve ctor of natur al parameters, where f ω ( x ) is th e conditional log o dd s ratio (Gao et al., 2001) to b e estimated f ω = log O R ( Y i , i ∈ ω | Y j = 0 , j / ∈ ω ; X ) (4) Here, th e o dd s ratios are calculated recursively O R ( Y i ) = P ( Y i = 1) 1 − P ( Y i = 1) , (5) O R ( Y i , i ∈ ω ∪ { k } ) = O R ( Y i , i ∈ ω | Y k = 1) O R ( Y i , i ∈ ω | Y k = 0) , supp ose k / ∈ ω (6) Later, we will call f 1 , · · · , f K main eﬀects, and f 1 , 2 , · · · , f 1 , ··· ,K measures the in teractions b et ween th e r esp onse v ariables. W e are inte rested in th e s p arse estimatio n of f ω , wh ich is in a R ep ro ducing K ernel Hilb ert S pace (RKHS) H ω with kernel K ω (W ah ba , 1990). Denote: S ω ( y ; x ) = X κ ⊂ ω y κ f κ ( x ); S ω ( x ) = X κ ⊂ ω f κ ( x ); (7) Then the normalization factor is: exp( b ( f ( x ))) = 1 + X ω ∈ Ψ K exp( S ω ( x )) (8) And we hav e the follo win g lemma: 5 Lemma 2.1 In multivariate Bernoul li mo del, deﬁne the o dd-even p artition of the p ower set of ω as: Ψ ω odd = { κ ⊂ ω | | κ | = | ω | − k , wher e k i s o dd } , and Ψ ω eve n { κ ⊂ ω | | κ | = | ω | − k , wher e k i s even } . N ote | Ψ ω odd | = | Ψ ω eve n | = 2 | ω |− 1 , the natur al p ar ameters have the fol lowing pr op e rty: f ω = log Q κ ∈ Ψ ω even p ( Y i = 1 , i ∈ κ, and Y j = 0 , j ∈ Ω − κ | X ) Q κ ∈ Ψ ω odd p ( Y i = 1 , i ∈ κ, and Y j = 0 , j ∈ Ω − κ | X ) (9) and exp( S ω ) = p ( Y i = 1 , i ∈ ω , and Y j = 0 , j ∈ Ω − ω | X ) P ( Y i = 0 , i ∈ Ω | X ) (10) The equiv alence b etw een grap h ical mo dels and MVB mo del is giv en in the follo wing theo- rem. Theorem 2.2 Gr aphic al mo del of gener al form (1) with 0 / 1 no des is e quivalent to multi- variate Bernoul li mo del (2). And the fol lowings ar e e qu ivalent: 1. Ther e is no | C | -or der inter action in { Y i , i ∈ C } . 2. Ther e is no cliq ue C ∈ Ψ K in the gr aph. 3. f ω = 0 for al l ω such tha t C ⊂ ω . A pr o of is giv en in App endix A. Theorem 2.2 state s that there is a clique C in the graphical mo del, if there is ω ⊃ C, f ω 6 = 0 in MVB mo del. T he conditional indep end en ce sp eciﬁed in the graphical mo d el can b e f u lly formulated by MVB mo del. Example 2.1 F or a gr aph with K no des, the p ar ameters in GM ar e { Φ ω | ω ∈ Ψ K } , wher e Φ ω = Φ Ω ( Y i = 1 , i ∈ ω, and Y j = 0 , j ∈ Ω − ω ) is the p otential function. We usual ly r estrict Φ ∅ = 1 to make the mo del identiﬁable. So ther e ar e 2 K − 1 fr e e p ar ameters. Similarly, ther e ar e also 2 K − 1 f r e e p ar ameters in MV B mo del ( f 1 , . . . , f Ω ) When K = 2 , Ω = { 1 , 2 } , C = { Ω } , deﬁne Φ 11 to b e the p otential function Φ Ω ( Y 1 = 1 , Y 2 = 1; X ) for simplicity, and deﬁne Φ 10 , Φ 01 , Φ 00 similarly. The pr ob ability distribution with the GM p ar ameterization is p ( Y 1 = 1 , Y 2 = 1 | X ) = 1 Z Φ 11 , p ( Y 1 = 1 , Y 2 = 0 | X ) = 1 Z Φ 10 , p ( Y 1 = 0 , Y 2 = 1 | X ) = 1 Z Φ 01 , p ( Y 1 = 0 , Y 2 = 0 | X ) = 1 Z Φ 00 The r e lation b etwe en GM and M V B mo del is f 1 = log Φ 10 Φ 00 , f 2 = log Φ 01 Φ 00 , f 1 , 2 = log Φ 11 · Φ 00 Φ 01 · Φ 10 Note, the indep endenc e b etwe en Y 1 and Y 2 implies f 1 , 2 = 0 or log Φ 11 · Φ 00 Φ 01 · Φ 10 = 0 (11) Ther efor e, the sp arseness in the c onditional lo g o dds r atios in MVB mo del gives a dir e ct infer enc e of the gr aph structur e. But this pr op erty do es not apply to GM. 6 3. Structure Penalt y In man y applicatio ns , th e assumption of graph ical mo dels is that the graph structure has few large cliques. It is equiv alent to the sparsit y in higher o r d er interact ions in MVB mo del b y T heorem 2.2. S o w e will imp ose a sparse p enalty on the dep endence structure to shrin k higher ord er interacti ons. Let y ( i ) = ( y 1 ( i ) , . . . , y K ( i )) , x ( i ) = ( x 1 ( i ) , . . . , x p ( i )) b e th e i th data p oin t. T he aug- men ted representat ion of the multiv ariate resp ons es is: Y ( i ) = ( y 1 ( i ) , . . . , y ω ( i ) , . . . , y Ω ( i )) (12) There are | Ψ K | = 2 K − 1 comp on ents in total 1 . Denote th e n umb er of comp onen ts by q . In this pap er, we consider the learning of the full mo del where q = | Ψ K | . Supp ose eac h function f ω is in a Repro du cing Kernel Hilb ert Sp ace (RKHS) H ω with kernel K ω (W ah ba , 1990). The general p enalized log likeli h o o d mo del is: min I λ ( f ) = L ( f ) + λJ ( f ) (13) where the ﬁrst term is the negativ e log- likel iho o d: L ( f ) = n X i =1  − Y ( i ) T f ( x ( i )) + b ( f )  (14) and the second term J ( · ) is the structure p enalt y . The ob j ectiv e is to obta in sparse estimation of the cliques by structure p enalt y on f . Consider the p airw ise links. No link b etw een Y s , Y t in th e graphical mo del means f ω = 0 for all ω ⊃ { s, t } . F or example, in Figure 1(b), Y 1 , Y 4 are conditionally ind ep endent means f 1 , 4 , f 1 , 2 , 4 , , f 1 , 3 , 4 , f 1 , 2 , 3 , 4 are all zero. This ob jectiv e is similar to sparse co v ariance matrix es- timation in Gaussian data for neigh b orho o d selection with lasso (Meinshaus en and Buhlmann, 2006). Ho wev er, our mo del will deal with higher order co v ariance stru ctures that d o not exist in Gaussian data. In add ition, w e not only consider the graph structur e of resp onses Y alone, but also the f u nctions of pr edicativ e v ariables X on Y . T o satisfy this in tuition, the p enalt y is designed to shrink large cliques in the graph . Supp ose in the tru e mo d el, ther e is no in teraction on clique C , then all f ω should b e zero, for C ⊂ ω . T he p enalt y is d esigned to shrin k suc h f ω to zero. The id ea can b e view ed as group lasso with ov erlaps. Group lasso (Y uan and Lin, 2006) leads to selec tion of v ariables in groups. It h as consisten t estimation when the group s are exclusive and union to th e whole s et. Jacob et al. (2009) co n s idered the p enalt y on groups w ith arbitrary o v erlaps. Zhao et al. (2009) set up the general f ramew ork for hierarc hical v ariable selections with o v erlappin g groups, which w e adopt h ere for the functions. W e consider the p enalt y guided by the structure in Figure 2. The guid ing graph T has 2 K − 1 no des: 1 , . . . , ω , . . . , Ω. With some abuse of notat ion, we use th e elemen t in Ψ k to index the no de in T . There is an edge f r om ω 1 to ω 2 if and only if ω 1 ⊂ ω 2 and | ω 1 | + 1 = | ω 2 | . Domain knowle d ge can b e applied here to design a diﬀeren t guiding structure. Jen atton et al. (2009) discussed ho w to deﬁne the groups to ac hieve diﬀeren t nonzero p atterns. 1. In applicatio ns with large graphs, we only consider up to m ’th interactions. W e tru ncate higher order intera ctions and get P m k =1  K k  functions 7 Figure 2: Hierarc hical Guidin g Stru cture for P enalt y Let T v = { ω ∈ Ψ K | v ⊂ ω } b e the su bgraph ro oted at v in T , including all th e descen- dan ts of v . Denote f T v = ( f ω ), ω ∈ T v . All the fun ctions are ca tegorized in to groups with o v erlaps as G = ( T 1 , . . . , T Ω ). T he stru cture p enalt y on th e grou p T v of fu n ctions is: J ( f T v ) = p v s X ω ∈ T v k f ω k 2 H ω (15) where p v is the w eigh t for the p enalt y on T v c hosen empirically as 1 | T v | . Th en the ob jectiv e function is: min f I λ ( f ) = L ( f ) + λ X v p v s X ω ∈ T v k f ω k 2 H ω (16) The follo win g theorem sho ws that by solving the ob jectiv e (1 6 ), f ω 1 will en ter the mo del b efore f ω 2 if ω 1 ⊂ ω 2 . That is to say , if f ω 1 do es not exist, there will b e no higher order in teractions on ω 2 . The p ro of is giv en in App endix A. Theorem 3.1 Obje c tive (16 ) is c onvex, thus the minimal is attainable. L et ω 1 , ω 2 ∈ Ψ K and ω 1 ⊂ ω 2 . If f ∗ is the minimizer, tha t is 0 ∈ ∂ I λ ( f ∗ ) which is the sub gr adient of I λ at f ∗ , then f ∗ ω 2 = 0 almost sur e if f ∗ ω 1 = 0 . Example 3.1 If K = 3 , f = ( f 1 , f 2 , f 3 , f 1 , 2 , f 1 , 3 , f 2 , 3 , f 1 , 2 , 3 ) . The gr oup e d functions at no de 1 in Figur e 2 is f T 1 = ( f 1 , f 1 , 2 , f 1 , 3 , f 1 , 2 , 3 ) . The obje ctive i s: min − l ( y , f ) + λ  p 1 p k f 1 k 2 + k f 1 , 2 k 2 + k f 1 , 3 k 2 + k f 1 , 2 , 3 k 2 + p 2 p k f 2 k 2 + k f 1 , 2 k 2 + k f 2 , 3 k 2 + k f 1 , 2 , 3 k 2 + p 3 p k f 3 k 2 + k f 1 , 3 k 2 + k f 2 , 3 k 2 + k f 1 , 2 , 3 k 2 (17) + p 4 p k f 1 , 2 k 2 + k f 1 , 2 , 3 k 2 + p 5 p k f 1 , 3 k 2 + k f 1 , 2 , 3 k 2 + p 6 p k f 2 , 3 k 2 + k f 1 , 2 , 3 k 2 + p 7 p k f 1 , 2 , 3 k 2  8 Algorithm 1 P r o ximal Linearization Algorithm Input: c 0 , α 0 , ζ > 1 , tol > 0 rep eat Cho ose α k ∈ [ α min , α max ] Solv e Eq (19 ) for d k while δ k = I λ ( c k ) − I λ ( c k + d k ) < k d k k 3 do // In suﬃcien t decrease Set α k = max( α min , ζ α k ) Solv e Eq (19 ) for d k end w hile Set α k +1 = α k /ζ Set c k +1 = c k + d k un til δ k < tol 4. P arameter E stimation In this p ap er, we fo cus on the situation where the ω th function space is H ω = { 1 } ⊕ H ω 1 . { 1 } refers to the constan t fun ction space, and H ω 1 is a RKHS with a linear kernel. The function f ω ∈ H ω has the form : f ω ( x ) = c ω 0 + P p j =1 c ω j x j . Its norm is k f ω k 2 H ω = k c ω k 2 . Here, we denote c ω = ( c ω 0 , . . . , c ω p ) T ∈ R p +1 as a v ector of length p + 1 and c = ( c ω ) ω ∈ Ψ K ∈ R ˜ p b e the concatenate d v ector of all parameters of length ˜ p = ( p + 1) · q . Hence, the ob jectiv e (16) is no w: min c I λ ( c ) = L ( c ) + λ X v p v k c T v k (18) where c T v = ( c ω ) ω ∈ T v is a ( p + 1) · | T v | vec tor. T o solv e (18), w e iterativ ely solv e the follo wing pr o ximal linearizati on problem (W righ t, 2010): min c L k + ∇ L T k ( c − c k ) + α k 2 k c − c k k 2 + λJ ( c ) (19) where L k = L ( c k ), α k is a p ositiv e scalar c hosen adaptive ly . With slight abuse of notation, w e us e c k to denote the v ector of all parameters at k th step. Algorithm 1 summarized the framew ork of solving (18). F ollo wing the analysis in W righ t (2010), we can sho w that the pro ximal linearizati on alg orithm will con v erge for n egativ e log-lik eliho o d loss function plus group lasso t yp e p enalties with o v erlaps. Ho w eve r, solving group lasso with ov erlaps is not trivial d ue to the non-sm o othness at the singular p oin t. In recen t y ears, sev eral pap ers ha ve addressed this pr oblem. Jacob et al. (20 09 ) duplicated the v ariables in the design matrix that app ear in group o ve rlaps, then solv ed the pr oblem as group lasso with ou t o verlaps. Kim and Xing (2010) reparameter- ized the group norm with additional du mm y v ariables. They alte rn ativ ely optimized for the mo del parameters and the du mm y v ariables at eac h iteratio n . T he metho d p erforms eﬃcien tly on qu adratic loss function for Gaussian data. But optimizing alternativ ely ov er t w o sets of parameters m igh t not scale we ll on p enalized logisti c regression. 9 In this pap er, we solv e (19) by its smo oth and con v ex dual p roblem prop osed b y L iu and Y e (2010). Let Z = { v ∈ Ψ K |k c T v k = 0 } , an d ¯ Z = Ψ K − Z b e the complemen t. Deﬁne s v , v ∈ Ψ K as: s v ∈ S v = { s = ( s ω ) ω ∈ Ψ K | s ∈ R ˜ p , k s k ≤ λp v , s ω = 0 if ω ∈ T v } (20) then th e s ubgradient of (19) is: ∇ L + α k ( c − c k ) + X v ∈ Z s v + X u ∈ ¯ Z r u (21) where s v is the sub gradien t of λp v k c T v k f or v ∈ Z and r u is the sub gradien t for u ∈ ¯ Z : r u = arg max s u h s u , c i , f or u ∈ ¯ Z (22) The subgradient s v is in a unit ball of certain su bspace of R ˜ p . Th ese subspaces are n ot p erp endicular to eac h other. Thus, s v ’s are not s ep arable, and closed form solution of (19 ) cannot b e obtained. W e solv e the proxima l subproblem (19) b y its sm o othing and conv ex dual p roblem. Note (19) is equiv alen t to: min c ∈ R ˜ p max S ∈ S φ ( c, S ) = (23) ∇ L T k ( c − c k ) + α k 2 k c − c k k 2 + X v ∈ Ω h s v , c i where S is a ˜ p × q matrix wh ose columns are s v . S = { S | S = ( s 1 , . . . , s v , . . . , s Ω ) , s v ∈ S v for v ∈ Ψ K } is the feasible region of S . Since φ ( · , S ) is lo wer semicon tinuous and φ ( c, · ) is upp er semiconti nuous, there exists a saddle p oin t and th e max and min are exc hangeable. The s olution of minimizing φ ( c, S ) is: ˜ c = arg min c φ ( c, S ) = c k − 1 α k ∇ L k − 1 α k X v s v (24) Substitute ˜ c bac k into (23), we hav e the dual problem of (19) as: max S ∈ S η ( S ) = − 1 2 k X v s v k 2 + ( α k c k − ∇ L k ) T X v s v (25) F ollo wing the pro of in Liu and Y e (2010), w e can sho w that η ( S ) is con v ex and Lipsc hitz con tin uou s . The diﬀeren tial is α k ˜ ce T where e ∈ R ˜ p is a ve ctor of ones. Hence, (25) can b e solv ed by existing gradien t metho d s. W e u se the acc elerated gradien t d escen t metho d implemen ted in (Liu et al., 2009). 5. Exp erimen ts 5.1 Sim ulation Study The s imulations are p erformed to ev aluate the learning accuracy of our metho d. The graph s of the resp onse v ariables are depicted in Figure 1. In the simulat ion, we assume th e most 10 general d istribution family for eac h graph ical mo del according to its graph structure. F or example, in Mo del 1, we hav e 4 resp onse v ariables, ( Y 1 , Y 2 , Y 3 , Y 4 ). T h ere is a triangular clique ( Y 1 , Y 2 , Y 3 ), but Y 4 is indep endent with the other resp onse v ariables. In this case, w e ha ve a 15 co n d itional log o dd s ratios to estimate . In the true mo d el, th e non-zero fun ctions are { f 1 , f 2 , f 3 , f 1 , 2 , f 1 , 3 , f 2 , 3 , f 1 , 2 , 3 , f 4 } . In Mo d el 3, there are 255 fun ctions and 25 of them are nonzero. In Mod el 4, there are 1023 fu nctions and 25 of them are nonzero. The predictiv e v ariables X = ( X 1 , . . . , X 5 ) are indep endent ly generated from m ultiv ari- ate Gaussian distribu tion with m ean 0 and v ariance 1. Eac h f ω has 6 parameters to estimate (1 of th em is the inte rcept). These parameters, c ω j , j = 1 , · · · , 5, are u n iformly s amp led fr om {− 5 , − 4 , · · · , 5 } . W e set the in tercepts c ω 0 in main eﬀec ts to 1, those in s econd or higher order interactio ns to 2. Eac h Y is randomly selected p rop ortional to the probabilit y in equation (2), where f = X c . W e generate 100 datasets for eac h graph structure in Fig ur e 1 to ev aluate the learnin g accuracy . The sample size in eac h dataset is 1000. The tuning parameter λ is c hosen by t wo diﬀeren t tun ing metho ds: 1) GA CV (general- ized appro ximation cross v alidation), 2) BGA CV (B-t yp e GA CV). The details of th e tuning metho ds are discussed in App endix B. T able 2: # of T ru e P ositiv e and F alse Po sitive F unctions f 1 , 2 f 1 , 3 f 2 , 3 f 3 , 4 f 1 , 2 , 3 f 5 , 7 , 8 f 5 , 6 , 7 , 8 FP Mo del 1 GA CV 100 10 0 100 99 89 - - 123 BGA CV 83 89 86 6 8 17 - - 5 Mo del 2 GA CV 100 99 100 99 83 - - 321 BGA CV 92 93 92 8 3 34 - - 46 Mo del 3 GA CV 95 89 79 9 8 57 7 7 31 39 4 BGA CV 37 18 21 9 2 0 36 0 143 Mo del 4 GA CV 91 98 96 8 8 59 4 9 - 654 BGA CV 70 72 69 6 6 0 0 - 134 In T able 2, we count, for eac h fu n ction f ω , the num b er of r uns out of th e 100 repli- cations in whic h f ω is reco vered ( k c ω k 6 = 0). The reco ve red functions in the tru e mo d el are considered as true p ositiv e; while th e others not in the true mo del are false p ositiv e. Since the main eﬀects are alwa ys detected correctly , th ey are not listed in the table. Th e structure p enalty is eﬃcien t in reco gnizing strong in teractions in the resp onses, suc h as the in teraction betw een Y 1 , Y 2 . But its p erf ormance on higher order in teractions will b e aﬀect ed in m ore complex graph stru ctures, e.g. f 1 , 2 , 3 in Mo d el 3 and 4. Compared to GA C V, BGA CV tends to achiev e more sparse results in general, b ecause they hav e large p enalties on the degrees of f reedom of the estimated mo d el. On the con trary , GA CV will discov er more true p ositiv e functions with the cost of h igher false p ositiv e rate. In Figure 3, we sho w the learning results in terms of true p ositiv e rate (TPR) with increasing sample size from 100 to 1000. The exp erimen tal setting is the same as b efore. 11 200 400 600 800 1000 0.5 0.6 0.7 0.8 0.9 1.0 Sample Size T rue Positive Rate GA CV BGA CV (a) Mo del 1 200 400 600 800 1000 0.5 0.6 0.7 0.8 0.9 1.0 Sample Size T rue Positive Rate (b) Mo del 2 200 400 600 800 1000 0.5 0.6 0.7 0.8 0.9 1.0 Sample Size T rue Positive Rate (c) Mo del 3 200 400 600 800 1000 0.5 0.6 0.7 0.8 0.9 1.0 Sample Size T rue Positive Rate (d) Mo del 4 Figure 3: T rue Positiv e Rate (TPR) of Graph structure learning with increasing samp le size from 100 to 1000 b y diﬀeren t tuning method s: GA CV, BGA CV. The TPR curves (GA CV) hav e con v erged for M1, M2 after 500 samples. But for M3 an d M4, larger sample size is required to ac h ieve conv ergence, since th er e are 1023 fu nctions to estimate. BGA CV is more conserv ative in selecting non-zero functions. The TPRs are improving al ong with the increasing sample size. Compared to Mo del 1 and 2, the algorithm needs substan tially larger sample size to achiev e high TPR on Mo del 3 and 4. GA CV ac hiev es b etter tr u e p ositiv e rate in all four graphical mo dels. This tun ing metho d will obtain less sparse mo dels compared to BGA C V. 5.2 Census Bureau Coun ty Data W e use the count y d ata fr om U.S. Censu s Bureau 2 to v alidate our metho d. It includes demographic data for all counties in United States, co vering p opulation, employmen t, v otes (Scammon et al., 2005), and etc. W e delete the coun ties whic h ha ve missin g v alue in the columns we are in terested, and get 266 8 data en tries in tota l. The outcomes for this 2. http:// www.census.gov/st atab/www/ccdb.html 12 T able 3: S elected Resp onse v ariables Resp onse Description P ositiv e% V ote 2004 vote s for Republican president ial candidate 81.11 P o vert y P ersons in p o vert y 52.70 V Crime Violen t Crime, eg. murder, robb ery 23.09 PCrime Prop erty Crime, eg. b u rglary 6.82 URate Unemplo yment Rate 51.35 PChange P opulation c h ange in p ercentag e fr om 2000 to 2006 64.96 study are su mmarized in T able 3. “V ote” is co ded as 1 if the Republican candidate won in th e 20 04 p r esiden tial elect ion. T o dic hotomize the rest resp onse v ariables, the national mean is selected as a threshold. Th e third column in the table gives the p ercen tage of p ositiv e in the data. The features in the m o del are: p ercen tage of housing unit c hange; go v ernment exp en diture; p opulation p ercenta ge of 3 ethnic groups (White, Africa n , Asia n), p eople foreign b orn, p eople o ver 65, p eople u n der 18, p eople w ith high sc h o ol education, and p eople with bac helor degree; birth r ate; death rate. In the exp eriment, W e ﬁrst standardize the data to b e m ean 0 v ariance 1. T hen, w e can get an estimated graphical mo del with eve ry ﬁ xed λ . Adjusting the regularization parameter λ from 0 . 3289 to 0 . 1389 , we will disco ver new in teractions ente rin g the mo del. The graph structure of λ = 0 . 1559 (c hosen by cross v alidation) is sho wn in Figure 4. Th e ﬁrst num b er on edge indicates the order of the link en tering the m o del, while the seco nd one is the corresp onding λ . The unemp loymen t rate pla ys an imp ortan t role a s a hub in the graph. It is strongly r elated to pov erty and crimes. Population c h ange is nega tive ly related to violent crimes, as w ell as to un emp lo yment rate. Vote URate 4(0.1859) PChange 1(0.3289) 7(0.1559) Poverty 2(0.2059) VCrime 5(0.1659) PCrime 3(0.1989) 6(0.1589) Figure 4: Corr elation of Resp onse in the coun ty data from Census Bureau. The ﬁ rst num b er on the edge sho ws the ord er at whic h the link is reco v ered. The n umb er in br ac k et denotes the corresp ond ing λ . 13 W e analyze the lin k b etw een “V ote” and “PChange”, whic h is reco ve red b y our method b efore any other links. The marginal correlation b et w een them (without conditioning on X ) is 0 . 0389, w hic h is the second smallest absolute v alue in the correlation matrix. The partial correlat ion m etho d (P eng et al., 20 09 ) is tak en as an example to sho w ho w the links of resp onse v ariables are disco vered without considering X . The lin k b et ween “V ote” and “PChange” is the tenth reco v ered link usin g R pac k age sp ac e . The reason is that after taking features in to accoun t, the dep endence stru cture of resp onse v ariables may change. The main contribution in this case is “p ercenta ge of housing unit change” ( X 1 ) a n d “p opulation p ercenta ge of p eople o ver 65” ( X 2 ). Part of the ﬁ tted mo del is sho wn b elow: f V ote = 0 . 0463 · X 1 + 0 . 0877 · X 2 + · · · f P C hang e = 0 . 2315 · X 1 − 0 . 0942 · X 2 + · · · f V ote,P C h ang e = 0 . 0211 · X 1 − 0 . 0115 · X 2 + · · · The main eﬀects suggest th at with increase of housing units, the coun ties tend to increase in p opulation and v ote f or Republican. With in crease of p eople o v er 65, the coun ties tend to lose p opulation, but still more lik ely to v ote f or Repub lican. The in teraction function rev eals that as housing u nits increase, the coun ties are more lik ely to ha v e b oth p ositive results f or “V ote” and “PChange”. But this tenden cy will b e coun teracted b y the increase of p eople o ver 65: the resp onses are less lik ely to tak e b oth p ositiv e v alues. 6. Conclusions and F uture W or k The structure p enalty on the multiv ariate Bernoulli mo del can eﬃcien tly learn th e graph structure, which indicate s th e conditional indep end ence of the resp onse v ariables. In this pap er, w e only consider linear mo dels for the conditional log o dds ratios. It can b e ex- tended to smo othing spline ANO V A mo del with more freedom b y choosing the kernels. And Theorem 3.1 h olds n atually . It is also interesting to see h o w the p enalt y w ill imp ro ve the prediction p o w er compared to large margin metho ds. Another extension will b e the select ion of features for eac h f ω . The sparsit y in eac h function requires the sparsit y within eac h group. But the graph structure would change w ith diﬀeren t selected features. One application is to discov er the relations of m ultiple symptoms or clinical resp onses and ho w they are eﬀected b y the en vironm ental and genetical co v ariates. Smoking could b e signiﬁcan t for man y diseases and their in teractions, but other co v ariates, suc h as taking Vitamin migh t b e only related to a su b set of the symptoms. As a resu lt, w e will in v estigate sparse p enalties within eac h fu nction. Ac knowled gmen t s Researc h is su p p orted in part by NIH Gran t EY09946, NSF Grant DMS-090681 8 and O NR Gran t N0014-0 9-1-0655. 14 References O. Ba n erjee, L. El Ghaoui, and A. d’Aspremont. Mo del selection through sparse maxim um lik eliho o d estimation f or multiv ariate gaussian or bin ary data. The Journal of Machine L e arning R ese ar ch , 9:485–51 6, 2008. ISS N 1532-4435 . Z. Duan, L. Lu, and C. Z hang. Collectiv e ann otation of m u s ic f rom multi p le s emantic sat- egories. In Pr o c e e dings of 9th International Confer enc e on M usic Information R etrieval , pages 237–242, Philadelph ia, USA, 2008 . F. Gao, G. W ah b a, R. Klein, and B. Klein. Smo othing Spline ANOV A for multiv ariate Bernoulli observ ations, w ith app licatio n to ophthalmolo gy data. Journal of the Americ an Statistic al Asso ciation , 96(453 ):127, 2001 . J. Honorio and D. S amaras. Multi-T ask Learnin g of Gaussian Graphical Mo dels. In Pr o- c e e dings of the 27th Interna tional Confer enc e on Machine le arning , pages 447 –454, Haifa, Israel, 201 0. L. Jacob, G. Ob ozinski, and J.P . V ert. Group Lasso with o v erlap and grap h Lasso. In Pr o c e e dings of the 26th Annual International Confer enc e on Machine L e arning , pages 433–4 40, 2009 . R. Jen atton, J .Y. Audib ert, and F. Bac h . S tructured v ariable selection with sparsit y- inducing norms. arXiv:0904.3523 , 2009. S. Kim and E.P . Xing. T r ee-guided group lasso for m u lti-task regression with stru ctured sparsit y. In Pr o c e e dings of 27th International Confer enc e on Machine L e arning , pages 543–5 50, Haifa, Israel, 2010. J. Laﬀerty , A. McCallum, and F. Pereira. C on d itional r an d om ﬁ elds : Probabilistic mo dels for segmen ting and lab eling sequence d ata. In Pr o c e e dings of the 18th International Confer enc e on Machine L e arning , pages 282–2 89, 2001 . J. Liu and J. Y e. F ast o verlapping group lasso. arXiv:1009.0306 v1 , 2010. J. Liu , S. Ji, and J. Y e. SLEP: Sp arse L e arning with Eﬃcient Pr oje ctions . Arizona State Univ ersit y , 2009. URL ht tp://www .public.a su.edu/ ~ jye02/So ftware/S LEP . Xiw en Ma. Penalize d R e gr e ssion i n R epr o duci ng Kernel Hilb ert Sp ac es With R andomize d Covariate Data . PhD thesis, Departmen t of Statistics, Universit y of Wisconsin-Madison, 2010. N. Meinshausen and P . Buhlmann . High-dimensional graphs and v ariable selection with the lasso. The A nnals of Statistics , 34(3 ):1436–146 2, 200 6. J. Pe n g, P . W ang, N. Zhou, and J. Zhu. Partia l correlatio n estimation b y join t s p arse regression mo d els. Journal of the Americ an Statistic al A sso ciation , 104(48 6):735–74 6 , 2009. P . Ra viku mar, M.J. W ain wr igh t, and J. Laﬀert y . High-dimensional Ising m o del selection using l1-regula r ized log istic regression. Anna ls of Statistics , 38(3):1 287–1319 , 2010 . 15 R.M. Scammon, A.V. McGillivra y , and R. Co ok. Americ a V otes 26: 2003-2004, Ele ction R eturns By State . CQ Press, 2005. ISBN 978156802 9740. W. S hi, G. W ah ba, S. W righ t, K. Lee, R. Klein, and B. Klein. LASSO-Pat ternsearch algo- rithm with application to ophthalmolog y and genomic data. Statistics and its Interfac e , 1(1):1 37, 2008 . TP Sp eed and HT Kiiv eri. Gauss ian Mark o v d istributions o ver ﬁnite graph s. Annals of Statistics , 14(1) :138–150, 198 6. IS SN 0090-536 4. C. Sutton, A. McCallum, and K . Rohanimanesh . Dynamic conditional random ﬁelds: F ac- torized p robabilistic models for lab eling and segmenting sequence data. The Journal of Machine L e arning R ese ar ch , 8:6 93–723, 20 07. IS S N 1532-44 35. R. Szeliski, R. Z ab ih , D. Sc harstein, O. V eksler, V. Kolmogoro v, A. Agarw ala, M. T app en, and C. Rother. A comparativ e study of ener gy m inimization metho ds for marko v random ﬁelds with smo othn ess-b ased pr iors. IEEE T r ansactions on Pattern Anal ysis and Machine Intel ligenc e , 30:106 8–1080, June 2007. ISS N 0162 -8828. G. W ah ba. Spline Mo dels for Observationa l Data . So ciet y for Ind ustrial Mathematics, 199 0. M.J. W ainwrigh t and M.I. Jord an. Graphical mo d els, exp onen tial families, and v ariational inference. F oundations and T r ends R  in Machine L e arning , 1:1–30 5, 2008. ISS N 1935- 8237. O. Will iams, A. Blak e, and R. Cip olla. The V ariational Ising Classiﬁer (V IC ) algo rithm for coheren tly contaminat ed data. A dvanc e s in Ne u r al Information Pr o c essing Systems , 17: 1497– 1504, 2004 . S.J. W righ t. Accelerated blo c k-co ordinate relaxation for regularized op timization. T ec hnical rep ort, Departmen t of C omputer Science, Universit y of Wisconsin-Madison, 2010. D. Xiang and G. W ah ba. A generalized appro ximate cross v alidation for smo othing sp lines with n on-Gaussian d ata. Statistic a Sinic a , 6:675–6 92, 1996 . ISS N 1017-0 405. M. Y uan and Y. Lin. Mo del selection and estimatio n in regression with group ed v ariables. Journal of the R oyal Statist ic al So ciety: Series B (Statistic al Metho dolo gy) , 68( 1):49–67, 2006. P . Zhao, G. Ro c ha, and B. Y u. The comp osite absolute p enalties family for group ed and hierarc hical v ariable selection. Anna ls of Statistics , 37(6A):3 468–3497 , 2009. 16 App endix A. Pro of A.1 Proof of Theorem 2.2 Pro of Giv en graphical mo d el (1), le t y ω C b e a real ization of y C suc h t h at y ω C = { y ω i | i ∈ C } where y ω i = 1 if i ∈ ω and y ω i = 0 otherwise. Let the o dd -ev en partition of the p o wer set of ω d eﬁned as in Lemma 2.1 . The conditional log o dd s ratios in MVB mo d el are: f ω ( x ) = log Q κ ∈ Ψ ω even Q C ∈C Φ C ( y κ C ; x ) Q κ ∈ Ψ ω odd Q C ∈C Φ C ( y κ C ; x ) (26) b ( f ) = log Z ( x ) Q C ∈C Φ C (0; x ) Con versely , giv en the MVB mo del of 2, the cliques can b e determined b y the nonzero f ω : clique C exists if C = ω and f ω 6 = 0. Then the maximal cliques can b e inferr ed fr om the graph s tructure. And supp ose they are C 1 , . . . , C m . Let ω i b e ψ ( ω i ) = C i , i = 1 , . . . , m , and κ 1 = ∅ , κ i b e ψ ( κ i ) = C i ∩ ( C i − 1 ∪ · · · ∪ C 1 ) , i = 2 , . . . , m . Then one p ossible parameterization is: Φ C i ( y C i ; x ) = exp  S ω i ( y ; x ) − S κ i ( y ; x )  (27) Z ( x ) = exp( b ( f )) Therefore, graphical mo del (1) with biv ariate outcomes is equiv alent to the MVB Mo d el (2). In the latter part of the theorem, 1 ⇒ 2 and 3 ⇒ 1 follo ws naturally from the Marko v prop erty of graphical mo d els. T o sh o w 2 ⇒ 3, n otice that wh enev er κ ∩ C = κ ′ ∩ C , y κ C = y κ ′ C . F or any p ossib le v = κ ∩ C , κ ′ ∈ { κ | κ = v ∪ u, s.t. u ⊂ ω − v } will giv e κ ′ ∩ C = v . There are 2 | ω − v | suc h κ ′ in total d ue to the c hoice of u . Also, they app ear in the nominator and denominator of equation (26) equ ally . So, for any C ∈ C , Y κ ∈ Ψ ω even Φ C ( y κ C ; x ) = Y κ ∈ Ψ ω odd Φ C ( y κ C ; x ) (28) It follo ws that f ω = 0 b y (26). A.2 Proof of Theorem 3.1 Pro of W e give the pro of for the linear case. The con ve xity of I λ is easy to c hec k, since L and J ( f T v ) are all conv ex in c . Supp ose there is some ω 2 ⊃ ω 1 s.t. c ∗ ω 2 6 = 0, by the groups constructed through Figure 2, k c ∗ T v k 6 = 0 for all v ⊂ ω 1 . So the partial d eriv ativ e of ob jectiv e (18) with resp ect to c ω 1 is ∂ L ∂ c ω 1 + λ X v ⊂ ω 1 p v c ∗ ω 1 k c ∗ T v k = 0 (29) Hence, the probabilit y of {∃ ω 2 ⊃ ω 1 s.t c ∗ ω 2 6 = 0 } equals to the p r obabilit y th at ∂ L ∂ c ω 1 = 0, whic h is 0. 17 App endix B. T uning F or i -th data, we ha ve: S ω i = S ω ( x ( i )) (30) b i = b ( f ( x ( i ))) = log (1 + X ω exp S ω i ) (31) Then, the mean of the augmente d resp onse Y ( i ) in the m ultiv ariate Bernoulli mo del is: µ ( i ) = E [ Y ( i ) | x ( i ) , f ] = ( µ 1 ( i ) , · · · , µ κ ( i ) , · · · , µ Ω ( i )) (32) where µ κ ( i ) = ∂ b i ∂ f κ = P ω ∈ T κ exp S ω i exp b i (33) The q × q co v ariance m atrix of the augmented resp onse is: W ( i ) = v ar ( Y ( i ) | x ( i ) , f ) (34) where the ( α, β )-th element of W ( i ) is: W α,β ( i ) = ∂ 2 b i ∂ f α ( ∂ f β ) T = P ω ∈ T α ∩ T β exp S ω i exp b i − µ α ( i ) · µ β ( i ) (35) Let R v b e a ˜ p × ˜ p diago nal matrix whose ( i, i )-th elemen t is 1 if c i 6 = 0. Then, the v -th group p enalt y in (18) can b e w ritten as: J ( f T v ) = p v s X ω ∈ T v k f ω k 2 = p v k R v c k (36) Note R v is symm etric and R v · R v = R v , direct calculation y ields the deriv ativ e and Hessian of the p en alt y term: ∂ J ∂ c = X v : R v c 6 =0 p v R v c k R v c k (37) ∂ 2 J ∂ c∂ c T = X v : R v c 6 =0 p v J v = X v : R v c 6 =0 p v R v  k R v c k 2 I − c · c T  R v k R v c k 3 (38) where J v . = ( R v ( k R v c k 2 I − c · c T ) R v ) / k R v c k 3 . Denote the grand design matrix as: D =  D (1) T · · · D ( n ) T  T (39) where D ( i ) =      x ( i ) T 0 · · · 0 0 x ( i ) T · · · 0 . . . . . . . . . . . . 0 0 · · · x ( i ) T      (40) 18 Supp ose there are N non-zero elemen ts of c at location { a 1 , . . . , a N } . Let ˜ D b e the matrix comp osed b y the a 1 , . . . , a N th column of D . T h en, th e Hessian matrix of I λ in (16) is: ∂ 2 I λ ∂ c∂ c T = ∂ 2 L ∂ c∂ c T + λ ∂ 2 J ∂ c∂ c T = ˜ D T W ˜ D + λ X v : R v c 6 =0 p v J v (41) (42) Let H b e the nq × nq inﬂuence matrix that imp lies f λ,ǫ − f λ ≈ H ǫ (43) where ǫ is a small p erturbation on Y ; f λ = D c λ is the estimat ed function v alue with tun in g parameter λ ; and f λ,ǫ is the estimated fun ction v alue w ith the p erturbation. Th en, the analysis of the ﬁrst order T ayl or expansion of ∂ I λ ∂ c ( Y + ǫ, c λ,ǫ ) leads to the form ulation of H as follo ws (refer to (Xiang and W ahba, 1996) and (Ma, 2010) Chapter 3 for more details) H = ˜ D  ∂ 2 I λ ∂ c∂ c T  − 1 ˜ D T = ˜ D  ˜ D T W ˜ D + λ X v : R v c 6 =0 p v J v  − 1 ˜ D T (44) The ( i, j )-th q × q subm atrix of H is H ( i, j ) = ˜ D ( i ) T  ∂ 2 I λ ∂ c∂ c T  − 1 ˜ D ( j ) (45) Let Q ( i ) = I − H ( i, i ) W ( i ) for i = 1 , . . . , n , deﬁne the generalize d a verag e matrix, den oted as ¯ Q , of { Q ( i ) , i = 1 , . . . , n } as follo ws ¯ Q = ( δ − γ ) I q × q + γ · ee T =      δ γ · · · γ γ δ · · · γ . . . . . . . . . . . . γ γ · · · δ      (46) where e is the un it ve ctor of length q and δ = 1 nq P n i =1 tr ( Q ( i )) , γ = 1 nq ( q − 1)  e T Q ( i ) e − tr ( Q ( i ))  (47) Let ¯ H b e the generalized a v erage of { H ( i, i ) , i = 1 , · · · , n } , the GA CV score is GAC V ( λ ) = O B S ( λ ) + 1 n n X i =1 Y ( i ) T ¯ Q − 1 ¯ H  Y ( i ) − µ ( i )  (48) where O B S ( λ ) = 1 n h − Y ( i ) T f λ ( x ( i )) + b ( f λ ( x ( i ))) i (49) is the observed log-lik eliho o d . 19 The degrees of fr eedom of multiv ariate Bernoulli data is generally diﬃcult to obtain. But w e can hav e a goo d ap p ro ximation from GACV (Shi et al., 2008) as ˆ d f ( λ ) = n X i =1 Y ( i ) T ¯ Q − 1 ¯ H  Y ( i ) − µ ( i )  (50) So the BGA CV score can b e d eﬁned as B GAC V ( λ ) = O B S ( λ ) + 1 n log n 2 n X i =1 Y ( i ) T ¯ Q − 1 ¯ H  Y ( i ) − µ ( i )  (51) 20

Learning Undirected Graphical Models with Structure Penalty

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment