Inferring Multiple Graphical Structures

Inferring Multiple Graphical Structures Julien Chiquet , Yv es Grandv alet , Christophe Am broise . L abor atoir e Statistique et G´ enome 523, Plac e des T err asses, 91000 ´ Evry e-mail: [julien.chiquet,christophe.ambroise]@genopole.cnrs.fr ; yves.grandvalet@utc.fr url: http://stat.genopole.cnrs.fr Abstract: Gaussian Graphical Models pro vide a conv enient framework for representing dep endencies b et w een v ariables. Recently , this tool has re- ceived a high in terest for the disco very of biological netw orks. The literature focuses on the case where a single net work is inferred from a set of mea- surements, but, as wetlab data is typically scarce, several assays, where the exp erimen tal conditions aﬀect interactions, are usually merged to infer a single netw ork. In this pap er, we prop ose tw o approaches for estimat- ing multiple related graphs, by rendering the closeness assumption into an empirical prior or group p enalties. W e provide quantitativ e results demon- strating the b eneﬁts of the proposed approaches. The methods presen ted in this paper are embeded in the R pack age simone from version 1.0-0 and later. Keyw ords and phrases: Network inference, Gaussian graphical mo del, Multiple sample setup, Co op-LASSO, Intert wined-LASSO. Con ten ts 1 Motiv ations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Net work Inference with GGM . . . . . . . . . . . . . . . . . . . . . . . 3 3 Inferring Multiple GGMs . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.1 Intert wined Estimation . . . . . . . . . . . . . . . . . . . . . . . . 5 3.2 Graphical Co operative-LASSO . . . . . . . . . . . . . . . . . . . 6 4 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 4.1 Problem Decomp osition . . . . . . . . . . . . . . . . . . . . . . . 10 4.2 Solving the Sub-Problems . . . . . . . . . . . . . . . . . . . . . . 11 4.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . 13 5 Exp erimen ts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 5.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 5.1.1 Sim ulation Proto col . . . . . . . . . . . . . . . . . . . . . 15 5.1.2 Exp erimen tal Setup . . . . . . . . . . . . . . . . . . . . . 16 5.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 5.2 Protein Signaling Netw ork . . . . . . . . . . . . . . . . . . . . . . 17 6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Ac knowledgmen ts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 A Pro ofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 A.1 Deriv ation of the pseudo-log-lik elihoo d . . . . . . . . . . . . . . . 24 A.2 Blo c kwise Optimization of the pseudo-log-likelihoo d . . . . . . . 25 A.3 Sub diﬀeren tial for the Co operative-LASSO . . . . . . . . . . . . 25 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 1 Chiquet, Gr andvalet and Ambr oise/Inferring Multiple Gr aph Structur es 2 1. Motiv ations Systems biology provides a large amount of data sets that aim to understand the complex relationships existing b et ween the molecular entities that driv e any biological pro cess. Dep ending on the molecule of in terest, v arious netw orks can b e inferred, e.g., gene-to-gene regulation net works or protein-protein in teraction net works. The basic idea is to consider that if t wo molecules in teract, a statistical dep endency b et w een their expression should b e observ ed. A conv enien t mo del of multiv ariate dep endence patterns is Gaussian Graph- ical Mo deling (GGM). In this framework, a multidimensional Gaussian v ariable is c haracterized b y the so-called concen tration matrix, where conditional in- dep endence b et ween pairs of v ariables is characterized by a zero entry . This matrix ma y b e represented by an undirected graph, where each vertex repre- sen ts a v ariable, and an edge connects tw o v ertices if the corresp onding pair of random v ariables are dep enden t, conditional on the remaining v ariables. Merging diﬀerent exp erimen tal conditions from w etlab data is a common practice in GGM-based inference methods ( T oh and Horimoto 2002 , Sch¨ afer and Strimmer 2005 ). This pro cess enlarges the num b er of observ ations a v ailable for inferring interactions. How ever, GGMs assume that the observed data form an indep enden t and identically distributed (i.i.d.) sample. In the aforementioned paradigm, assuming that the merged data is drawn from a single Gaussian comp onen t is obviously wrong, and is likely to hav e detrimen tal side eﬀects in the estimation pro cess. In this pap er, we prop ose to remedy this problem by estimating multiple GGMs, eac h of whom matching diﬀerent mo dalities of the same set of v ari- ables, which corresp ond here to the diﬀerent exp erimen tal conditions. As the distributions of these mo des hav e strong commonalities, we prop ose to estimate these graphs jointly . Considering several tasks at a time has b een attracting m uch atten tion in the mac hine learning literature, where the generic problem is usually referred to as “multi-task learning” ( Caruana 1997 ). Efron ( 2009 ) used the terms “indirect evidence” and “learning from the exp erience of others” for similar ideas. The principle is to learn an inductive bias, whose role is to sta- bilize the estimation pro cess, hop efully enabling more accurate predictions in small sample size regimes ( Baxter 2000 ). The techniques comprise the empirical and hierarchical Bay es metho dologies and regularization schemes (see Argyriou et al. 2008 , for example), whic h may b e in terpreted as approximations to the latter. Here, we will mainly follow the p enalt y-based approach to leverage the inference of several related graphs tow ards a common pattern. A t ypical example of this problem arises when inferring gene interactions from data measured on sligh tly diﬀerent stem cells, such as wild and m utant. It is reasonable to assume that most, though not all, interactions will b e common to b oth types of cells. The line of attack w e prop ose alleviates the diﬃculties arising from the scarcity of data in each exp erimen tal condition by coupling the estimation problems. Our ﬁrst proposal biases the estimation of the concentra- tion matrices tow ards a common v alue. Our second prop osition fo cuses on the similarities in the sparsity pattern that are more directly related to the graph itself. W e prop ose the Co operative-LASSO, which builds on the Group-LASSO ( Y uan and Lin 2006 ) to fav or solutions with a common sparsity pattern, but enco des a further preference tow ards solutions with similar sign patterns, th us preserving the type of co-regulation (p ositiv e or negative) across assays. Chiquet, Gr andvalet and Ambr oise/Inferring Multiple Gr aph Structur es 3 T o our kno wledge, the presen t w ork is the ﬁrst to exploit the m ulti-task learn- ing framework for learning GGMs. How ever, coupling the estimation of several net works has recen tly been in vestigated for Mark o v random ﬁelds, in the con text of time-v arying netw orks. Kolar et al. ( 2009 ) prop ose tw o sp eciﬁc constraints, one for smo oth v ariations o ver time, the other one for abrupt changes. Their p enalties are closer to the F used-LASSO and total v ariations p enalties than to the group p enalties prop osed here. 2. Net work Inference with GGM In the GGM framew ork, w e aim to infer the graph of conditional depen- dencies among the p v ariables of a vector X from indep enden t observ ations ( X 1 , . . . , X n ). W e assume that X is a p -dimensional Gaussian random v ariable X ∼ N ( 0 p , Σ ). Let K = Σ − 1 b e the concen tration matrix of the mo del; the non-zero entries of K ij indicate a conditional dep endency b et ween the v ariables X i and X j , and thus deﬁne the graph G of conditional dependencies of X . The GGM approach pro duces the graph G from an inferred K . The latter cannot b e obtained b y maxim um lik eliho od estimation that would typically re- turn a full matrix, and hence a useless fully connected graph. T o pro duce sparse net works, Banerjee et al. ( 2008 ) prop ose to p enalize the en tries of K b y an ` 1 - norm. F riedman et al. ( 2008 ) latter addressed the very same problem with an elegan t algorithm named the gr aphic al-LASSO . Their w ell-motiv ated approac h pro duces a sparse, symmetric and p ositiv e-deﬁnite estimate of the concen tration matrix. Ho w ev er, a cruder though more direct approac h has been rep orted to be more accurate in terms of edge detection ( Villers et al. 2008 , Ro c ha et al. 2008 ). This approach, prop osed by Meinshausen and B ¨ uhlmann ( 2006 ) and known as neighb orho o d sele ction , determines G via an iterative estimation of the neigh- b orhoo d of its nodes. F or this purp ose, it considers p indep enden t ` 1 -p enalized regression problems. Let X b e the n × p matrix of stack ed observ ations, whose k th row con tains ( X k ) | . The vertices adjacent to v ertex i are estimated by the non-zero elements of β solving min β ∈ R p − 1 1 n   X i − X \ i β   2 2 + λ k β k 1 , (1) where X i is the i th column of X and X \ i is X depriv ed of its i th column: the i th v ariable is “explained” by the remaining ones. As the neighborho o d of the p v ariables are selected separately , a p ost-symmetrization must b e applied to manage inconsistencies b et w een edge selections; Meinshausen and B ¨ uhlmann suggest AND or OR rules, whic h are b oth asymptotically consistent (as n go es to inﬁnity). Solving the p regression problems ( 1 ) may b e interpreted as inferring the concen tration matrix in a p enalized, pseudo maximum likelihoo d framework, where the joint distribution of X is approximated b y the pro duct of the p distributions of each v ariable conditional on the other ones ( Ro c ha et al. 2008 , Am broise et al. 2009 , Ravikumar et al. 2010 ), that is L ( K | X ) = p X i =1 n X k =1 log P ( X k i | X k \ i ; K i ) ! , Chiquet, Gr andvalet and Ambr oise/Inferring Multiple Gr aph Structur es 4 where X k \ i is the k th realization of the v ector X deprived of the i th coordinate. Considering the Gaussian assumption on the generation of the data X , the pseudo-log-lik eliho od admits a compact and simple expression (see deriv ation in App endix A.1 ): L ( K | X ) = n 2 log det( D ) − n 2 T r  D − 1 2 KSKD − 1 2  − np 2 log(2 π ) , (2) where S = n − 1 X | X is the empirical cov ariance matrix, and D is a p × p diagonal matrix with elements D ii = K ii . In the sequel, it will b e con v enien t to use the suﬃciency of S for K , and, by a slight abuse of notations, write L ( K | X ) = L ( K | S ). F ollowing Banerjee et al. ( 2008 ), an ` 1 p enalt y may b e added to obtain a sparse estimate of K . Nev ertheless, our approac h to maximizing the pseudo-log- lik eliho od is muc h simpler than the optimization of the log-likelihoo d prop osed b y Banerjee et al. ( 2008 ). Indeed, as stated formally in the following proposition, maximizing the penalized pseudo-log-likelihoo d on the set of arbitrary matrices (not constrained to b e either symmetric or p ositiv e deﬁnite) boils do wn to solv- ing p independent LASSO problems of size p − 1. F urthermore, for the purpose of disco vering the graph structure, additional computational savings are ac hieved b y remarking that D needs not to b e estimated. W e thus av oid the iterative sc heme of Ro c ha et al. ( 2008 ) alternating optimization with resp ect to D and to the oﬀ-diagonal elemen ts of D − 1 K . Prop osition 1. Consider the follo wing reordering of the rows and columns of K and S :  K \ i \ i K i \ i K | i \ i K ii  ,  S \ i \ i S i \ i S | i \ i S ii  , (3) where K \ i \ i is matrix K deprived of its i th column and its i th line, and where K i \ i is the i th column of K depriv ed of its i th element. The problem max { K ij : i 6 = j } L ( K | S ) − λ k K k 1 , (4) where k K k 1 is the comp onen twise ` 1 -norm, can b e solv ed column-wisely by considering p LASSO problems in form min β ∈ R p − 1 1 2    S 1 / 2 \ i \ i β + S − 1 / 2 \ i \ i S i \ i    2 2 + λ n k β k 1 , (5) where the optimal β is the maximizer of ( 4 ) with resp ect to K − 1 ii K i \ i as deﬁned in ( 3 ). Hence, Problem ( 4 ) ma y b e decomp osed into the p Problems ( 5 ) of size p − 1 generated by the p p ossible permutations in ( 3 ). The full solution to Problem ( 4 ), in { K ij : i 6 = j } , also requires K ii (see Rocha et al. 2008 ). How ever, since our interest is to un veil the graph structure, the sparsit y pattern of the p enalized maxim um lik eliho od estimate of K is suﬃcient, and the latter is directly recov ered from β . Pr o of. See app endix A.2 . Chiquet, Gr andvalet and Ambr oise/Inferring Multiple Gr aph Structur es 5 F rom the deﬁnition of the co v ariance matrix S , it is clear that Problem ( 1 ) is a slight reparameterization of Problem ( 5 ). Hence, the graph pro duced by the approac h of Meinshausen and B ¨ uhlmann ( 2006 ) is identical to the one ob- tained by maximizing the p enalized pseudo likelihoo d ( Ambroise et al. 2009 , Prop osition 8). 3. Inferring Multiple GGMs In transcriptomics, it is a common practice to conduct several assays where the exp erimen tal conditions diﬀer, resulting in T samples measuring the expres- sion of the same molecules. F rom a statistical viewp oin t, we hav e T samples b elonging to diﬀerent sub-p opulations, hence with diﬀerent distributions. As- suming that eac h sample w as dra wn independently from a Gaussian distribution X ( t ) ∼ N ( 0 p , Σ ( t ) ), the T samples ma y be pro cessed separately b y following the approac h described in Section 2 . The ob jective function is expressed compactly as a sum: max { K ( t ) ij : i 6 = j } T t =1 T X t =1  L ( K ( t ) | S ( t ) ) − λ k K ( t ) k 1  . (6) Note that it is sensible to apply the same p enalt y parameter λ for all sam- ples provided that the T samples hav e similar sizes and originate from similar distributions, in particular regarding scaling and sparseness. Problem ( 6 ) ignores the relationships b et ween regulation net works. Wh en the tasks are known to hav e strong commonalities, the multi-task learning frame- w ork is w ell adapted, esp ecially for small sample sizes, where sharing informa- tion may considerably impro ve estimation accuracy . T o couple the estimation problems, we hav e to break the separability in K (1) , . . . , K ( T ) in Problem ( 6 ). This may b e achiev ed either mo difying the data-ﬁtting term or the p enalizer. These t w o options result resp ectiv ely in the graphical Intert wined-LASSO and the graphical Co op erativ e-LASSO presented b elo w. 3.1. Intertwine d Estimation In the Maximum A Posteriori framework, the estimation of a concen tration matrix can b e biased to wards a sp eciﬁc v alue, say S − 1 0 . F rom a practical view- p oin t, this is usually done by considering a conjugate prior on K , that is, a Wishart distribution W ( S − 1 0 , n ). The MAP estimate is then computed as if we had observed additional observ ations of empirical cov ariance matrix S 0 . Here, we would like to bias each estimation problem tow ards the same con- cen tration matrix, whose v alue is unkno wn. An empirical Bay es solution would b e to set S 0 = ¯ S , where ¯ S is the weigh ted av erage of the T empirical cov ariance matrices. As in the maximum likelihoo d framework, this approach would lead to a full concen tration matrix. Hence, we will consider here a p enalized crite- rion, whic h do es not exactly ﬁt the penalized maximum lik eliho od nor the MAP framew orks, but that will p erform the desired coupling b et ween the estimates of K (1) , . . . , K ( T ) while pursuing the original sparseness goal. F ormally , let n 1 , . . . , n T b e the sizes of the resp ectiv e samples, whose empir- ical cov ariance matrices are denoted by S (1) , . . . , S ( T ) . Also denote n = P n t , w e consider the follo wing problem: Chiquet, Gr andvalet and Ambr oise/Inferring Multiple Gr aph Structur es 6 max { K ( t ) ij : i 6 = j } T t =1 T X t =1  L ( K ( t ) | ˜ S ( t ) ) − λ k K ( t ) k 1  , (7) where ˜ S ( t ) = α S ( t ) +(1 − α ) ¯ S and ¯ S = n − 1 P T t =1 n t S ( t ) . As this criterion amoun ts to consider that we observed a blend of the actual data for task t and data from the other tasks, we will refer to this approac h as intert wined estimation. The idea is reminiscent of the compromise b et w een linear discriminant analysis and its quadratic counterpart p erformed b y the regularized discriminan t analysis of F riedman ( 1989 ). Although the to ols are similar, the primary goals diﬀer: F riedman ( 1989 ) aims at getting a control on the num b er of eﬀective parameters, w e wan t to bias empirical distributions tow ards a common mo del. In order to a v oid m ultiple hyper-parameter tuning, the results shown in the exp erimen tal section were obtained with α arbitrarily set to 1 / 2 . More reﬁned strategies are left for future work. 3.2. Gr aphic al Co op er ative-LASSO The second approach consists in devising p enalties that encourage similar spar- sit y patterns across tasks, suc h as the Group-LASSO ( Y uan and Lin 2006 ), whic h has already inspired some multi-task learning strategies ( Argyriou et al. 2008 ), but w as never considered for learning graph mo dels. W e shortly describ e ho w Group-LASSO may b e used for inferring multiple graphs b efore introduc- ing a slightly more complex p enalt y that was inspired b y the application to biological interactions, but should b e relev ant in man y other applications. As in the single task case, sparsity of the concentration matrices is obtained via an ` 1 p enalization of their en tries. An additional constrain t imposes the sim- ilarit y b et ween the tw o concentration matrices. Each interaction is considered as a group. The Group-LASSO is a mixed norm that encourages sparse solutions with resp ect to groups, where groups form a pre-deﬁned partition of v ariables. In the GGM framework, by grouping the partial correlations b et w een v ariables across the T tasks, such a p enalt y will fav or graphs G 1 , . . . , G T with common edges. The learning problem is then max { K ( t ) ij : i 6 = j } T t =1 T X t =1 L ( K ( t ) | S ( t ) ) − λ X i 6 = j  T X t =1  K ( t ) ij  2  1 / 2 . (8) Though this formalization expresses some of our exp ectations regarding the commonalities b et ween tasks, it is not really satisfying here since w e aim at in- ferring the supp ort of the solution (that is, the set of non-zero entries of K ( t ) ). T o enable the inference of diﬀerent netw orks ( t, u ), we must hav e some ( i, j ) suc h that K ( t ) ij = 0 and K ( u ) ij 6 = 0. This even t occurs with probability zero with the Group-LASSO, whose v ariables en ter or lea ve the support group-wise ( Y uan and Lin 2006 ). How ever, w e may cure this problem b y considering a regular- ization term that b etter suits our needs. Namely , when the graphs represen t the regulation netw orks of the same set of molecules across exp erimen tal con- ditions, we exp ect a stronger similarity pattern than the one expressed in ( 8 ). Sp eciﬁcally , the co-regulation encompasses up-regulation and down-regulation and the t yp e of regulation is not likely to b e in verted across assays: in terms of Chiquet, Gr andvalet and Ambr oise/Inferring Multiple Gr aph Structur es 7 partial correlations, sign swaps are very unlikely . This additional constraint is formalized in the follo wing learning problem: max { K ( t ) ij : i 6 = j } T t =1 T X t =1 L ( K ( t ) | S ( t ) ) − λ X i 6 = j  T X t =1  K ( t ) ij  2 +  1 / 2 +  T X t =1  − K ( t ) ij  2 +  1 / 2 ! , (9) where ( u ) + = max(0 , u ). Figures 1 and 2 illustrate the role of eac h p enalt y on a problem with T = 2 tasks and p = 2 v ariables. They represent sev eral views of the unit balls 2 X i =1  2 X t =1 β ( t ) i 2  1 / 2 ≤ 1 , and 2 X i =1  2 X t =1  β ( t ) i  2 +  1 / 2 +  2 X t =1  − β ( t ) i  2 +  1 / 2 ≤ 1 , that is, the admissible set for a p enalt y for a problem with tw o tasks and tw o features. These plots also provide some insight on the sparsity pattern that originate from the p enalt y , since sparsity is related to the singularities at the b oundary of the admissible set ( Nikolo v a 2000 ). In particular, the ﬁrst column illustrates that, when β (2) 2 is null, β (1) 2 ma y also be exactly zero, while the second column sho ws that this even t is improbable when β (2) 2 diﬀers from zero. The second row illustrates the same t yp e of relationship betw een β (2) 1 and β (1) 1 that are expected due to the symmetries of the unit ball. Figure 2 corresp onds to a Co operative-LASSO p enalt y . These plots should be compared with their Group-LASSO coun terpart in Figure 1 . W e see that there are additional discon tinuities in the unit ball resulting in new v ertices on the 3-D plots. As b efore, we hav e that, when β (2) 2 is n ull, β (1) 2 ma y also b e exactly zero, but in addition, we may also hav e β (1) 1 or β (2) 1 exactly null. Accordingly , in the second and third row, we see that we may hav e β (1) 2 n ull when β (2) 2 is non-zero. These new edges will result in some new zero es when the Group-LASSO would ha ve allow ed a solution with opp osite signs betw een tasks. The second main striking diﬀerence with Group-LASSO is the loss of the axial symmetry of the Co operative-LASSO when some v ariables are non-zero. These plots illustrate that the decoupling of the p ositiv e and negative parts of the regression co eﬃcien ts in the p enalt y fav ors solutions where these co eﬃcien ts are of same sign across tasks. The p enalties are iden tical in the p ositiv e and negativ e orthan t, but the Co operative-LASSO p enalization is more stringen t elsewhere, when there are some sign mismatc hes b et ween tasks. The most extreme situation o ccurs when there is no sign agreement across tasks for all v ariables. In the setup represen ted here, with only tw o tasks, the eﬀective p enalt y then reduces to the LASSO. Chiquet, Gr andvalet and Ambr oise/Inferring Multiple Gr aph Structur es 8 β (2) 2 = 0 β (2) 2 = 0 . 1 β (2) 2 = 0 . 3 β (2) 1 = 0 β (1) 2 1 1 −1 −1 β (1) 1 β (1) 2 1 1 −1 −1 β (1) 1 β (1) 2 1 1 −1 −1 β (1) 1 β (2) 1 = 0 . 1 β (1) 2 1 1 −1 −1 β (1) 1 β (1) 2 1 1 −1 −1 β (1) 1 β (1) 2 1 1 −1 −1 β (1) 1 β (2) 1 = 0 . 3 β (1) 2 1 1 −1 −1 β (1) 1 β (1) 2 1 1 −1 −1 β (1) 1 β (1) 2 1 1 −1 −1 β (1) 1 Fig 1 . A dmissible set for the Gr oup-LASSO p enalty for a problem with two tasks and two fe atures. T op r ow: cuts of the unit b al l thr ough ( β (1) 1 , β (2) 1 , β (1) 2 ) for various values of β (2) 2 , wher e ( β (1) 1 , β (2) 1 ) sp an the horizontal plane, and β (1) 2 is on the vertic al axis; b ottom r ows: cuts thr ough ( β (1) 1 , β (1) 2 ) for various values of ( β (2) 1 and β (2) 2 ) . Chiquet, Gr andvalet and Ambr oise/Inferring Multiple Gr aph Structur es 9 β (2) 2 = 0 β (2) 2 = 0 . 1 β (2) 2 = 0 . 3 β (2) 1 = 0 β (1) 2 1 1 −1 −1 β (1) 1 β (1) 2 1 1 −1 −1 β (1) 1 β (1) 2 1 1 −1 −1 β (1) 1 β (2) 1 = 0 . 1 β (1) 2 1 1 −1 −1 β (1) 1 β (1) 2 1 1 −1 −1 β (1) 1 β (1) 2 1 1 −1 −1 β (1) 1 β (2) 1 = 0 . 3 β (1) 2 1 1 −1 −1 β (1) 1 β (1) 2 1 1 −1 −1 β (1) 1 β (1) 2 1 1 −1 −1 β (1) 1 Fig 2 . A dmissible set for the Coop er ative-LASSO p enalty for a pr oblem with two tasks and two fe atur es. T op r ow: cuts of the unit ball thr ough ( β (1) 1 , β (2) 1 , β (1) 2 ) for various values of β (2) 2 , wher e ( β (1) 1 , β (2) 1 ) sp an the horizontal plane, and β (1) 2 is on the vertic al axis; b ottom r ows: cuts thr ough ( β (1) 1 , β (1) 2 ) for various values of ( β (2) 1 and β (2) 2 ) . Chiquet, Gr andvalet and Ambr oise/Inferring Multiple Gr aph Structur es 10 4. Algorithms In this section, we describ e the strategy prop osed for solving the three opti- mization problems introduced ab o ve, based up on the prop osal of Osb orne et al. ( 2000a ) for solving the LASSO. This part also draws its inspiration from Os- b orne et al. ( 2000b ), Kim et al. ( 2006 ), Roth and Fischer ( 2008 ). 4.1. Pr oblem De c omp osition The m ultiple indep enden t tasks Problem ( 6 ) can b e solv ed by considering either T single tasks like ( 4 ) of size ( p − 1) × p (each one p ossibly decomp osed in p LASSO sub-problems of size p − 1), or a single large problem of size T × ( p − 1) × p , whic h can b e decomp osed into p LASSO sub-problems of size ( p − 1) × T , through Prop osition 1 . This line of attack is not computationally eﬃcien t here, but it will b ecome adv an tageous when considering the p enalties presen ted in Section 3.2 . It is introduced at this p oin t to provide a uniﬁed conceptual view of all algorithms. Consider the ( p T ) × ( p T ) blo c k-diagonal matrix C comp osed b y the empirical co v ariance matrices of each tasks C =    S (1) 0 . . . 0 S ( T )    , and deﬁne C \ i \ i =     S (1) \ i \ i 0 . . . 0 S ( T ) \ i \ i     , C i \ i =     S (1) i \ i . . . S ( T ) i \ i     . (10) The ( p − 1) T × ( p − 1) T matrix C \ i \ i is the matrix C where we remov ed eac h line and each column p ertaining to v ariable i . W e deﬁne ˜ C , ˜ C \ i \ i and ˜ C i \ i similarly , with S ( t ) b eing replaced b y ˜ S ( t ) for each t = 1 , . . . , T in the ab o ve deﬁnitions. Let β ( t ) ∈ R ( p − 1) denote the vector estimating K ( t ) i \ i , deﬁned from K ( t ) as in ( 3 ), and let β ∈ R T × ( p − 1) b e the vector of the concatenated estimates β | = ( β (1) | , · · · , β ( T ) | ). The optimization of ( 6 ) is achiev ed by solving p sub- problems in form: min β ∈ R T × ( p − 1) 1 2    C 1 / 2 \ i \ i β + C − 1 / 2 \ i \ i C i \ i    2 2 + λ T X t =1 1 n t k β k 1 . (11) Note that we do not need to p erform the costly matrix op erations that are expressed in the the ﬁrst term of the ob jective function of Problem ( 11 ). In practice, we compute f ( β ; C ) = 1 2 β | C \ i \ i β + β | C i \ i , whic h only diﬀers from the squared ` 2 norm in ( 11 ) by a constan t that is irrel- ev ant for the optimization pro cess. Chiquet, Gr andvalet and Ambr oise/Inferring Multiple Gr aph Structur es 11 Accordingly , Problems ( 7 ), ( 8 ) and ( 9 ) can b e decomp osed into p minimiza- tion sub-problems whose ob jectiv e functions ma y b e decomposed as L k ( β ) = f ( β ) + λg k ( β ) , (12) where, with a slight abuse of notation, f ( β ) is either f ( β ; ˜ C ) for Problem ( 7 ) or f ( β ; C ) for Problems ( 8 ) and ( 9 ), and where g k ( β ) stands for the p enalt y functions resp ectiv ely deﬁned b elo w: • for the graphical Intert wined LASSO g 1 ( β ) = T X t =1 1 n t    β ( t )    1 ; • for the graphical Group-LASSO g 2 ( β ) = p − 1 X i =1    β [1: T ] i    2 , where β [1: T ] i =  β (1) i , . . . , β ( T ) i  | ∈ R T is the vector of the i th comp onent across tasks; • for the graphical Co operative-LASSO g 3 ( β ) = p − 1 X i =1       β [1: T ] i  +     2 +      − β [1: T ] i  +     2  . Since f is conv ex with resp ect to β , and all p enalties are norms, all these ob jective functions are con v ex and th us easily amenable to optimization. They are non-diﬀerentiable at zero, due to the p enalt y terms, which all fa vor zero co- eﬃcien ts. Bearing in mind the typical problems in biological data, where graphs ha ve a few tens or hundreds no des, and where connectivity is very weak 1 , we need conv ex optimization to ols that are eﬃcien t for medium-size problems with extremely sparse solutions. W e thus chose a greedy strategy that aims at solv- ing a series of small-size sub-problems, and will oﬀer a simple monitoring of con vergence. 4.2. Solving the Sub-Pr oblems The minimizers β of the ob jectiv e functions ( 12 ) are assumed to ha ve man y zero co eﬃcien ts. The approach developed for the LASSO by Osb orne et al. ( 2000a ) tak es adv an tage of this sparsity b y solving a series of small linear systems, whose size is incremen tally increased/decreased, similarly to a column generation algo- rithm. The master problem is the original problem, but solv ed only with resp ect to the subset of v ariables curren tly iden tiﬁed as non-zero β co eﬃcien ts. The sub- problem of identifying new non-zero v ariables simply consists in detecting the violations of the ﬁrst-order optimality conditions with resp ect to all v ariables. When there are no more such violations, the current solution is optimal. 1 Typically , the exp ected n umber of vertices in graphs to scale as the n umber of no des, that is, we exp ect order of √ pT non-zero co eﬃcien ts in each sub-problem of size T × ( p − 1). Chiquet, Gr andvalet and Ambr oise/Inferring Multiple Gr aph Structur es 12 The ob jective functions L k ( β ) are conv ex and smo oth except at some lo ca- tions with zero co eﬃcien ts. Th us, the minimizer is such that the null vector 0 ∈ R p − 1 is an element of the sub diﬀeren tial ∂ β L k ( β ). In our problems, the sub diﬀeren tial is giv en by ∂ β L k ( β ) = ∇ β f ( β ) + λ∂ β g k ( β ) , (13) where ∇ β f ( β ) = C \ i \ i β + C i \ i and where the form of ∂ β g k ( β ) diﬀers for the three problems and will b e detailed b elo w. The algorithm is started from a sparse initial guess, that is, β = 0 or, if a v ailable, the solution obtained on a more constrained problem with a larger p enalization parameter λ . Then, one con verges to the global solution iterativ ely , b y managing the index A of the non-zero coeﬃcients of β and solving the master problem ov er A , where the problem is contin uously diﬀerentiable. The manage- men t of A requires tw o steps: the ﬁrst one remov es from A the co eﬃcien ts that hav e b een zero ed when solving the previous master problem, ensuring its diﬀeren tiability at the next iteration, and the second one examines the candi- date non-zero co eﬃcien ts that could en ter A . In this pro cess, summarized in Algorithm 1 , the size of the bigger master problems is typically of the order of magnitude of the n umber of non-zero entries in the solution. Solving the mas- ter problem with resp ect to the non-zero co eﬃcien ts β A can b e formalized as solving min h L k ( β A + h ), where h ∈ R |A| is optimal if 0 ∈ ∂ h L k ( β A + h ). Algorithm 1 : General optimization algorithm // 0. INITIALIZATION β ← 0 A ← ∅ while 0 / ∈ ∂ β L ( β ) do // 1. MASTER PROBLEM: OPTIMIZATION WITH RESPECT TO β A Find a (approximate) solution h to the smo oth problem ∇ h f ( β A + h ) + λ∂ h g k ( β A + h ) = 0 . // where ∂ h g k = {∇ h g k } β A ← β A + h // 2. IDENTIFY NEWLY ZEROED VARIABLES while ∃ i ∈ A : β i = 0 and min θ ∈ ∂ β i g k    ∂ f ( β ) ∂ β i + λθ    = 0 do A ← A\{ i } // 3. IDENTIFY NEW NON-ZERO VARIABLES // Select i ∈ A c such that an infinitesimal change of β i provides the highest reduction of L k i ← arg max j ∈A c v j , where v j = min θ ∈ ∂ β j g k    ∂ f ( β ) ∂ β j + λθ    if v i 6 = 0 then A ← A ∪ { i } else Stop and return β , which is optimal Chiquet, Gr andvalet and Ambr oise/Inferring Multiple Gr aph Structur es 13 4.3. Implementation Details W e provide below the implementation details that are sp eciﬁc to eac h optimiza- tion problem. Sp eciﬁcit y of each problem relies on ∂ β g k ( β ), denoted θ herein. In tert wined LASSO – This LASSO problem is solved as prop osed by Os- b orne et al. ( 2000a ), except that w e consider here the Lagrangian formulation with λ ﬁxed. The comp onen ts of θ in the sub diﬀeren tial ( 13 ) read if β i = 0 then θ i ∈ [ − 1 , 1] , else θ i = sign( β i ) . Solving the master problem on A requires an estimate of θ A at β A + h . It is computed based on a lo cal approximation, where the comp onen ts of sign( β A + h ) are replaced by sign( β A ). 2 This leads to the following descent direction h : h = − β A − ˜ C − 1 \ i \ i ( A , A )( ˜ C i \ i ( A ) + λ θ A ) , where, in order to a void double subscripts, we use the notation M ( A , A ) for the square submatrix of M formed by the ro ws and columns indexed b y A , and v ( A ) for the subv ector formed by the columns of v indexed by A . Then, b efore up dating β A , one chec ks whether the lo cal approximation used to compute h is consistent with the sign of the new solution. If not the case, one lo oks for the largest step size ρ in direction h such that β + A = β A + ρ h is sign-consistent with β A . This amounts to zero a co eﬃcien t, say β i , and i is remo ved from A if | ∂ f ( β + ) /∂ β i | < λ , otherwise, the corresp onding θ i is set to − sign( ∂ f ( β + ) /∂ β i ) . In any case, a new direction h is computed as ab o ve, and β A is up dated un til the optimality conditions are reached within A . Finally , the global optimum is attained if the ﬁrst-order optimalit y conditions are met for all the comp onen ts of β , that is, if b β veriﬁes 0 ∈ e C \ i \ i b β + e C i \ i + λ θ , where θ is suc h θ A = sign( b β A ) and k θ A c k ∞ ≤ 1 . Graphical Group-LASSO – In this problem, the subdiﬀerential ( 13 ) is con- ditioned on the norm of β [1: T ] i , the vector of the i th comp onen t across tasks. Let θ [1: T ] i =  θ (1) i , . . . , θ ( T ) i  | ∈ R T b e deﬁned similarly to β [1: T ] i , w e ha ve that, if    β [1: T ] i    2 = 0 then    θ [1: T ] i    2 ≤ 1 , else θ [1: T ] i =    β [1: T ] i    − 1 2 β [1: T ] i , where, here and in what follows, 0 / 0 is deﬁned by contin uation as 0 / 0 = 0. As the subgradien t w.r.t. β [1: T ] i reduces to a gradien t whenever one comp onen t of β [1: T ] i is non-zero, the management of the null v ariables is done here b y subsets 2 When A is up dated and that β i = 0, the corresp onding θ i is set to − sign( ∂ f ( β ) /∂ β i ) . Chiquet, Gr andvalet and Ambr oise/Inferring Multiple Gr aph Structur es 14 of T v ariables, according to    ∇ β [1: T ] i f ( β )    2 , instead of the one by one basis of the LASSO. Hence, w e only need to index the groups i ∈ { 1 , . . . , p − 1 } in A . Here also, solving the master problem on A requires an estimate of θ [1: T ] A at β [1: T ] A + h . Provided that    β [1: T ] i    2 6 = 0 for all i ∈ A , θ [1: T ] A is diﬀeren tiable w.r.t. β [1: T ] A . It will th us be approximated by a ﬁrst-order T a ylor expansion, resulting in a Newton-Raphson or quasi-Newton step. Here, we used quasi-Newton with BF GS up dates. Note that, whenever β [1: T ] i = 0 , that is, when a new group of v ariables has just been activ ated or is ab out to be deactiv ated, the corresponding θ [1: T ] i is set so that    ∇ β [1: T ] i f ( β ) + λ θ [1: T ] i    2 (14) is minimum (that is, with θ [1: T ] i prop ortional to ∇ β [1: T ] i f ). The up dates of A are also based on the minimal v alue of ( 14 ). Graphical Co operative-LASSO – As the Group-LASSO, the Co op erativ e- LASSO considers a group structure, but its implementation diﬀers considerably from the former in the management of A . Though sev eral v ariables are usually activ ated or deactiv ated at the same time, they typically corresp ond to subsets of β [1: T ] i , and these subsets are context-dependent; they are not deﬁned beforehand. As a result, the index of non-zero β [1: T ] i is b etter handled by considering t wo sets: the index of β [1: T ] i with p ositiv e and negative com ponen ts: A + =  i ∈ { 1 , . . . , p − 1 } :      β [1: T ] i  +     2 > 0  , and A − =  i ∈ { 1 , . . . , p − 1 } :      − β [1: T ] i  +     2 > 0  . Let T denote the index of non-zero entries of β [1: T ] i , with complemen t T c ; the sub diﬀeren tial at the current solution is such that: • if i ∈ A c + ∩ A c − , then max       θ [1: T ] i  +     2 ,      − θ [1: T ] i  +     2  ≤ 1 ; • if i ∈ A c + ∩ A − then θ T i =      − β T i  +     − 1 2 β T i , θ T c i :      θ T c i  +     2 ≤ 1 and      − θ T c i  +     2 = 0 ; • i ∈ A + ∩ A c − , then θ T i =      β T i  +     − 1 2 β T i , θ T c i :      − θ T c i  +     2 ≤ 1 and      θ T c i  +     2 = 0 ; Chiquet, Gr andvalet and Ambr oise/Inferring Multiple Gr aph Structur es 15 • if i ∈ A + ∩ A − , then θ ( t ) i =      sign  β ( t ) i  β [1: T ] i  +     − 1 2 β ( t ) i , t = 1 , . . . , T . Once A + and A − are determined, the master problem is solved as for the Group- LASSO, with BF GS up dates, with box constraints to ensure sign feasible β [1: T ] i for i such that i ∈ A c + ∩ A − or i ∈ A + ∩ A c − . When a new v ariable has just b een activ ated or is ab out to b e deactiv ated, the corresp onding θ [1: T ] i is set so that      ∇ β [1: T ] i f ( β ) + λ θ [1: T ] i  +     2 +      −∇ β [1: T ] i f ( β ) − λ θ [1: T ] i  +     2 (15) is minimum. The up dates of A are also based on the minimal v alue of ( 15 ). 5. Experiments In most real-life applications, the ma jor part of the inferred graphs are unknown, with little av ailable information on the presence/absence of edges. W e essentially face an unsup ervised learning problem, where there is no ob jectiv e criterion allo wing to compare diﬀerent solutions. As a result, setting the h yp er-parameters is particularly troublesome, alike, say , choosing the num b er of comp onen ts in a mixture mo del, and it is a common practice to visualize several netw orks corresp onding to a series of penalties. Regarding the ﬁrst issue, w e chose to present here synthetic and w ell-kno wn real data that allo w for an ob jective quantitativ e ev aluation. Regarding the second issue, the problem of choosing p enalt y parameters can b e guided by the- oretical results that pro vide a bound on the rate of false edge disco very ( Mein- shausen and B ¨ uhlmann 2006 , Banerjee et al. 2008 , Ambroise et al. 2009 ), or by more traditional information criteria targeting the estimation of K ( Y uan and Lin 2007 , Ro c ha et al. 2008 ). How ever, these prop osals tend to b eha ve po orly , and it is a usual practice to compare the p erformance of learning algorithms b y providing a series of results, such as prec i sion-recall plots or R OC-curves, letting the choice of p enalt y parameters as a mostly op en question for future researc h. Although the shortcomings of this type of comparison are well-kno wn ( Drummond and Holte 2006 , Bengio et al. 2005 ), we will use precision vs. recall plots, which can b e considered as v aluable exploratory to ols. Precision is the ratio of the n umber of true selected edges to the total n umber of selected edges in the inferred graphs. Recall is the ratio of the num b er of true selected edges in the inferred graphs to the total num b er of edges in the true graphs. In a statistical framework, the recall is equiv alent to the p o wer and the precision is equiv alen t to one min us the false disco v ery prop ortion. 5.1. Synthetic Data 5.1.1. Simulation Pr oto c ol T o generate T samples stemming from a similar graph, we ﬁrst draw an “ances- tor” graph with p no des and k edges according to the Erd˝ os-R´ en yi mo del. Here, Chiquet, Gr andvalet and Ambr oise/Inferring Multiple Gr aph Structur es 16 Fig 3 . Set of simulate d gr aphs: anc estor (top) and two children (b ottom) engender e d by a δ = 2 perturb ation. w e consider a simple setting with T = 4 and a net work with p = 20 no des and k = 20 edges, as illustrated in Figure 3 . Then, T children graphs are pro duced b y random addition and deletion of δ edges in the ancestor graph. The T con- cen tration matrices are built from the normalized graph Laplacians, whose oﬀ- diagonal elements are slightly deﬂated to pro duce strictly diagonally dominant matrices. T o allo w for positively and negatively correlated v ariables, w e generate a strictly triangular matrix of random signs dra wn from a Rademac her distribu- tion. This matrix is symmetrized, complemen ted with ones on the diagonal, and its comp onen t-wise m ultiplication with the deﬂated Laplacians pro duces the ground-truth for the concentration matrices K (1) , . . . , K ( T ) . Each K ( t ) is ﬁnally used to generate n Gaussian vectors with zero mean and cov ariance K ( t ) − 1 . 5.1.2. Exp erimental Setup The prec i sion-recall plots are computed by considering the cumulativ e num ber of true and false edge detections among the T = 4 children net w orks. Let E ( t ) b e the set of edges for children netw ork t , precision and recall are resp ectiv ely formaly deﬁned as: T X t =1 X ( i,j ) ∈E ( t ) 1 ( b K ( t ) ij ) T X t =1 X i>j 1 ( b K ( t ) ij ) and T X t =1 X ( i,j ) ∈E ( t ) 1 ( b K ( t ) ij ) T X t =1    E ( t )    , where b K ( t ) ij is the estimated partial correlation b et ween v ariables i and j for net work t , 1 ( u ) returns 1 if u 6 = 0 and 0 otherwise, and |E | is the cardinal of E . T o ensure representativ eness, the precision-recall ﬁgures are av eraged o ver 100 random draws of the ancestor graph, the a v eraging b eing p erformed for ﬁxed v alues of the penalization parameter λ . That is, each p oin t in the (preci- sion,recall) plane is the a verage of the 100 points obtained for eac h random dra w Chiquet, Gr andvalet and Ambr oise/Inferring Multiple Gr aph Structur es 17 of the ancestor graph, for a giv en estimation metho d and for a given v alue of the p enalization parameter. W e compare our prop osals, namely the Graphical In tertwined ( α = 1 / 2), Co operative and Group LASSO to tw o baselines: the original neighborho o d selection of Meinshausen and B ¨ uhlmann ( 2006 ), either applied separately to each graph (annotated “indep enden t”), or computed on the data set merging the data originating from all graphs (annotated “p ooled”). 5.1.3. R esults Figures 4 , 5 and 6 display precision-recall plots for nine prototypical situations. F rom Figure 4 to 6 , the sample size increases, and from top to bottom, the diﬀer- en tiation b et w een netw orks increases. First, note that the independent s trategy is not inﬂuenced by the level of p erturbation, yet only by the sub-sample size, as exp ected. The top graph in Figure 4 represents the small-sample low-perturbation situation, where merging data sets is a go od strategy , leveraging the inde- p enden t analysis. The latter p erforms p o orly , and our multi-task approaches dominate the p o oled strategy , the Co operative-LASSO b eing sup erior to the Group-LASSO, which has the adv antage on the Intert wined LASSO. The medium/large-sized-sample low-perturbation (top graph in Figures 5 and 6 ), small/medium-sized-sample medium-p erturbation (middle graph in Figures 4 and 5 ) and small-sized-sample large-p erturbation (b ottom graph in Figure 4 ) are qualitatively similar, with less diﬀerences b et ween all comp etitors. F or the large-sample medium-p erturbation (middle graph in Figure 6 ) and medium- sized-sample low-perturbation (b ottom graph in Figure 5 ) cases, all metho ds p erform similarly . There is a slight adv antage for the multi-task s t rategies for high recalls, and a slight adv antage for the indep enden t analysis for lo w recalls (that is, for high p enalization, where there is less eﬀective degrees of freedom to determine). The b ottom graph in Figure 6 represen ts the large-sample high- p erturbation situation, where merging data is a bad strategy , since the net works diﬀer signiﬁcantly and there is enough data to estimate eac h netw ork indep en- den tly . The indep enden t strategy w orks best, closely follo wed b y the In tertwined LASSO. The Coop erativ e and Group-LASSO behav e equally well for high recalls (lo w p enalization parameters), but for highly p enalized solutions (lo w recalls), they even tually become sligh tly worse than the p ooled estimation. These exp erimen ts show that our prop osals are v aluable, esp ecially in the most common situation where data are scarce. Among the baselines, the usual p ooled sample strategy is go od in the small-sample low-perturbation, and the opp osite indep enden t strategy is b etter in the large-sample high-p erturbation case. The intert wined LASSO is very robust, in the sense that it alw ays p er- forms fav orably compared to the b est baseline metho d ov er the whole sp ectrum of situations. F urthermore, except for the large-sample high-p erturbation case, the Group-LASSO p erforms ev en b etter, and the Co operative-LASSO impro ves further the supremacy of the multiple graph inference approach. 5.2. Pr otein Signaling Network Only a few real data sets come with a reliable and exhaustive ground-truth al- lo wing quan titativ e assessmen ts. W e make use of a m ultiv ariate ﬂow cytometry Chiquet, Gr andvalet and Ambr oise/Inferring Multiple Gr aph Structur es 18 δ = 1 Precision 0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● CoopLasso GroupLasso Intertwined Independent Pooled Recall δ = 3 Precision 0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● CoopLasso GroupLasso Intertwined Independent Pooled Recall δ = 5 Precision 0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● CoopLasso GroupLasso Intertwined Independent Pooled Recall Fig 4 . Pr e cision-r e call curves for the Intertwine d, Co op erative, Gr oup and the two b aseline LASSO, for inferring four gr aphs (e ach with p = 20 nodes, k = 20 e dges and a p erturb ation δ fr om the anc estor gr aph) fr om four samples of size n t = 25 . Chiquet, Gr andvalet and Ambr oise/Inferring Multiple Gr aph Structur es 19 δ = 1 Precision 0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● CoopLasso GroupLasso Intertwined Independent Pooled Recall δ = 3 Precision 0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● CoopLasso GroupLasso Intertwined Independent Pooled Recall δ = 5 Precision 0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● CoopLasso GroupLasso Intertwined Independent Pooled Recall Fig 5 . Pr e cision-r e call curves for the Intertwine d, Co op erative, Gr oup and the two b aseline LASSO, for inferring four gr aphs (e ach with p = 20 nodes, k = 20 e dges and a p erturb ation δ fr om the anc estor gr aph) fr om four samples of size n t = 50 . Chiquet, Gr andvalet and Ambr oise/Inferring Multiple Gr aph Structur es 20 δ = 1 Precision 0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● CoopLasso GroupLasso Intertwined Independent Pooled Recall δ = 3 Precision 0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● CoopLasso GroupLasso Intertwined Independent Pooled Recall δ = 5 Precision 0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● CoopLasso GroupLasso Intertwined Independent Pooled Recall Fig 6 . Pr e cision-r e call curves for the Intertwine d, Co op erative, Gr oup and the two b aseline LASSO, for inferring four gr aphs (e ach with p = 20 nodes, k = 20 e dges and a p erturb ation δ fr om the anc estor gr aph) fr om four samples of size n t = 100 . Chiquet, Gr andvalet and Ambr oise/Inferring Multiple Gr aph Structur es 21 data set p ertaining to a well-studied human T-cell signaling pathw ay ( Sachs et al. 2005 ). The latter inv olv es 11 signaling molecules (phospholipids and phos- phorylated proteins) and 20 interactions describ ed in the literature. The signal- ing netw ork is p erturbed by activ ating or inhibiting the pro duction of a given molecule. F ourteen assays hav e b een conducted, aiming to rev eal diﬀerent part of the net w ork. Here, we used only four as sa ys (inhibition of PKC, activ ation of PK C, inhibition of AKT, activ ation of PKA). Graphs inferred using only one assay at a time show that each assay really fo cus on diﬀeren t part of the netw ork (see Figure 7 ). r af mek plcg pip2 pip3 er k akt pka pkc p38 jnk r af mek plcg pip2 pip3 er k akt pka pkc p38 jnk r af mek plcg pip2 pip3 er k akt pka pkc p38 jnk r af mek plcg pip2 pip3 er k akt pka pkc p38 jnk Fig 7 . F our gr aphs inferr e d fr om single assay. F r om left to right, top to b ottom, we have r espe ctively gr aphs inferre d fr om an assay: inhibiting akt, activating pka, inhibiting pkc, ac- tivating pkc. Thick black lines repr esent true positive and thin r ed lines ar e false p ositive. r af mek plcg pip2 pip3 er k akt pka pkc p38 jnk r af mek plcg pip2 pip3 er k akt pka pkc p38 jnk Fig 8 . Gr ound-truth pathway (left) and gr aph sum of the four graphs estimate d by Intertwine d LASSO using al l data (right). Thick black lines r epresent true p ositive and thin r e d lines ar e false p ositive. When considering a strategy based on inference from multiple assays, the ﬁrst false p ositiv e inferred by the Intert wined Graphical LASSO o ccurs when Chiquet, Gr andvalet and Ambr oise/Inferring Multiple Gr aph Structur es 22 11 true interactions out of 20 are detected (see Figure 8 ). This edge, betw een p38 and Jnk, is in fact due to an indirect connection via unmeasured MAP kinase kinases ( Sachs et al. 2005 ), which is a typical problem of confounding arising in this context. Considering partial correlations within the subset of av ailable v ariables, the edge is correctly detected, but it is a false p ositiv e with resp ect to the biological ground truth. F urthermore, in larger biological net w orks, the absence of edge in the ground truth path wa y often merely means that there is y et no evidence that the co-regulation exists. As a result, most real data ev aluation of graph inference metho ds are based on qualitative sub jectiv e assessments by exp erts. This ca v eat b eing, the v arious inference algorithms b eha ve here as in the syn thetic exp erimen ts: all inference metho ds p erform ab out equally well for large samples (each assay consists here of ab out 1000 rep eated measuremen ts). Figure 9 displays the results obtained for small sample sizes. Here also, the precision-recall plots are a veraged ov er 100 indep enden t random draws of sam- ples of size n t , that is n = 4 n t observ ations ov er the four considered assays. It is w orth noticing that the large sample size limit is almost obtained for n t = 20. As for the synthetic exp erimen ts, the av eraging is p erformed for ﬁxed v alues of the p enalization parameter λ . In this situation, our proposals dominate the b est baseline strategy , which is p ooled estimation. Again, the In tertwined LASSO is v ery robust, but the Group-LASSO, and to a greater extent the Co op erativ e- LASSO p erform b etter in the small-sample-size regime. 6. Conclusion This pap er presents the ﬁrst metho ds dedicated to the inference of multiple graphs in the Gaussian Graphical Model framew ork. In this setup, the t w o baseline approaches consist in either handling the inference problems separately or as a single one by merging the av ailable data sets. Our prop osals, motiv ated b y bioinformatics applications, were devised to describ e the dep endencies b e- t ween pairs of v ariables in analogous op erating conditions, suc h as measurements recorded in diﬀerent assa ys. This situation o ccurs routinely with omics data. Our approaches are based on the neighborho od selection of Meinshausen and B ¨ uhlmann ( 2006 ). The ﬁrst one, the Intert wined Graphical LASSO, relaxes the uniqueness constraint that is implicit when the tasks are pro cessed as a single one, merely biasing the results tow ards a common answer. Our second approac h, the Graphical Co operative-LASSO, is based on a group-p enalt y that fa vors similar graphs, with homogeneous dep endencies betw een the same pairs of v ariables. Homogeneit y is quantiﬁed here by the magnitude and sign of partial correlations. The Coop erativ e-LASSO con trasts the Group-LASSO in b eing able to infer diﬀering graph structures across tasks. Our exp erimen tal results show that our prop osals are v aluable and robust, consisten tly p erforming at least as w ell as the b est of the tw o baseline solutions. The algorithms developed in this pap er are made av ailable within the R - pac k age simone from v ersion 1.0-0 and later. This pac k age also em b eds extension of the multi-task framework to time-course data, that is, when transcriptomics data are collected by considering the same individual across time. This imple- men tation builds on the ` 1 -p enalized V AR(1) mo del describ ed in Charb onnier et al. ( 2010 ). Chiquet, Gr andvalet and Ambr oise/Inferring Multiple Gr aph Structur es 23 n t = 7 Precision 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.4 0.5 0.6 0.7 0.8 0.9 1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Intertwined CoopLasso GroupLasso Independent Pooled Recall n t = 10 Precision 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.4 0.5 0.6 0.7 0.8 0.9 1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Intertwined CoopLasso GroupLasso Independent Pooled Recall n t = 20 Precision 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.4 0.5 0.6 0.7 0.8 0.9 1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Intertwined CoopLasso GroupLasso Independent Pooled Recall Fig 9 . Pr e cision-r e call curves for the Intertwine d, Co op erative, Gr oup and the two b aseline LASSO, for inferring the graphs on four assays of Sachs’ data fr om four samples of size n t = 7 , 10 and 20 . Chiquet, Gr andvalet and Ambr oise/Inferring Multiple Gr aph Structur es 24 As future work, we will provide a theoretical analysis of the Co op erativ e- LASSO regarding uniqueness of the solution and selection consistency , or spar- sistence, that corresp onds here to the asymptotic con vergence of the set of de- tected edges tow ards the set of true edges. Ac kno wledgmen ts Yv es Grandv alet was partially supp orted b y the P ASCAL2 Netw ork of Excel- lence, the Europ ean ICT FP7 under gran t No 247022 - MASH, and the F renc h National Researc h Agency (ANR) under grant ClasSel ANR-08-EMER-002. Christophe Ambroise and Julien Chiquet w ere partially supp orted b y the ANR under gran ts GD2GS ANR-05-MMSA-0013 and NeMo ANR-08-BLAN-0304-01. App endix A: Pro ofs A.1. Derivation of the pseudo-lo g-likeliho o d W e sho w here that the pseudo-log-likelihoo d L ( K | X ) = p X i =1 n X k =1 log P ( X k i | X k \ i ; K i ) ! , (16) asso ciated to a sample of size n drawn indep enden tly from the multiv ariate Gaussian vector X ∼ N ( 0 p , Σ ) reads L ( K | X ) = n 2 log det( D ) − n 2 T r  D − 1 2 KSKD − 1 2  − np 2 log(2 π ) , where S = n − 1 X | X is the empirical v ariance-cov ariance matrix and D is the diagonal matrix such that D ii = K ii , for i = 1 , . . . , p . Pr o of. Since the joint distribution of X k is Gaussian, the distributions of X k i conditioned on the remaining v ariables X k \ i are also Gaussian. Their parameters ( µ k i , σ i ) are given b y µ k i = Σ | i \ i Σ − 1 \ i \ i X k \ i , σ i = Σ ii − Σ | i \ i Σ − 1 \ i \ i Σ i \ i . (17) where Σ \ i \ i is matrix Σ depriv ed of its i th column and its i th line, Σ i \ i is the i th column of matrix Σ deprived of its i th elemen t. As K = Σ − 1 , reordering the ro ws and columns of the matrices yields  Σ \ i \ i Σ i \ i Σ | i \ i Σ ii  ×  K \ i \ i K i \ i K | i \ i K ii  =  I p − 1 0 0 1  , where K \ i \ i is matrix K deprived of its i th column and its i th line, K i \ i is the i th column of matrix K deprived of its i th element, and I p − 1 is the identit y matrix of size p − 1. Two of these blockwise equalities are rewritten as follows: Σ ii = (1 − Σ | i \ i K i \ i ) /K ii , Σ − 1 \ i \ i Σ i \ i = − K i \ i /K ii . Chiquet, Gr andvalet and Ambr oise/Inferring Multiple Gr aph Structur es 25 Using the ab o v e identities in ( 17 ), we obtain σ i = (1 − Σ | i \ i K i \ i ) /K ii + Σ | i \ i K i \ i /K ii = 1 /K ii , µ i = − K | i \ i X | \ i /K ii . where µ i = ( µ 1 i , . . . , µ n i ) | . Using these notations and the corresponding blo c kwise notations for S ( S ii = n − 1 X | i X i , S i \ i = n − 1 X | \ i X i and S \ i \ i = n − 1 X | \ i X \ i ), Equation ( 16 ) reads L ( K | X ) = − n 2 p X i =1 log σ i − p X i =1 1 2 σ i ( X i − µ i ) | ( X i − µ i ) − np 2 log(2 π ) = n 2 p X i =1 log K ii − np 2 log(2 π ) − n 2 p X i =1 K ii  S ii + 2 K ii S | i \ i K i \ i + 1 K 2 ii K | i \ i S \ i \ i K i \ i  (18) = n 2 log det D − np 2 log(2 π ) − n 2 p X i =1 1 K ii ( K | i SK i ) , where K i is the i th column of K and D is the diagonal matrix such that D ii = K ii . Finally , w e use that P p i =1 1 K ii ( K | i SK i ) = T r( D − 1 2 KSKD − 1 2 ) to conclude the pro of. A.2. Blo ckwise Optimization of the pseudo-lo g-likeliho o d Pr o of of Pr op osition 1 . F rom ( 18 ), w e hav e L ( K | S ) = − n 2 p X i =1  2 S | i \ i K i \ i + 1 K ii K | i \ i S \ i \ i K i \ i  + c, (19) where c do es not dep end on K ij with j 6 = i . Thus, if we discard the symmetry constrain t on K , maximizing the pseudo-lik eliho o d ( 19 ) with respect to the non- diagonal en tries of K amounts to solve p indep enden t maximization problems with resp ect to K i \ i , i = 1 , . . . , p . The summands of ( 19 ) can b e rewritten as − n 2 K ii  2 K ii S | i \ i K i \ i + K | i \ i S \ i \ i K i \ i  = − nK ii 2    K − 1 ii S 1 / 2 \ i \ i K i \ i + S − 1 / 2 \ i \ i S i \ i    2 2 + c 0 , where c 0 = n/ 2 K ii S | i \ i S \ i \ i S i \ i do es not dep end on K ij with j 6 = i . Adding an ` 1 p enalt y term on K i \ i and deﬁning β = K − 1 ii K i \ i leads to the ob jective function of Problem ( 5 ), which concludes the pro of. A.3. Sub diﬀer ential for the Co op er ative-LASSO By deﬁnition, for a conv ex function g , the sub diﬀeren tial is ∂ g | β 0 = n θ : ∀ β , g ( β ) − g ( β 0 ) ≥ θ > ( β − β 0 ) o Chiquet, Gr andvalet and Ambr oise/Inferring Multiple Gr aph Structur es 26 The function g ( β ) =   ( β ) +   2 +   ( − β ) +   2 has kinks whenever β has at least one zero comp onen t and that it has either no p ositiv e or no negative component. There are th us three situations where the sub diﬀeren tial do es not reduce to the gradien t : 1.   ( β 0 ) +   2 = 0 and   ( − β 0 ) +   2 6 = 0, 2.   ( β 0 ) +   2 6 = 0 and   ( − β 0 ) +   2 = 0, 3.   ( β 0 ) +   2 = 0 and   ( − β 0 ) +   2 = 0, i.e. β 0 = 0. F or the ﬁrst situation, denoting A the index of non-zero entries of β 0 and A c its complement, the sub diﬀeren tial is deﬁned as n   ( − β 0 ) +   − 1 2 β 0 + θ : θ A = 0 and ∀ β A c ,   ( β A c ) +   2 ≥ θ > A c β A c o . (20) The set of admissible θ is explicitly giv en by n θ : θ A = 0 ,   ( θ A c ) +   2 ≤ 1 and   ( − θ A c ) +   2 = 0 o . (21) Pr o of. W e ﬁrst show that, for any θ in the set deﬁned in ( 21 ), the inequality in deﬁnition ( 20 ) alwa ys holds. Dropping the subscript A c for readabilit y , w e hav e: θ > β = ( θ ) > + β − ( − θ ) > + β = ( θ ) > + β = ( θ ) > + ( β ) + − ( θ ) > + ( − β ) + ≤ ( θ ) > + ( β ) + ≤   ( θ ) +   2   ( β ) +   2 ≤   ( β ) +   2 . T o ﬁnish the pro of, it is suﬃcient to exhibit some β suc h that the inequality in deﬁnition ( 20 ) do es not hold when   ( θ ) +   2 > 1 or when   ( − θ ) +   2 > 0. F or   ( θ ) +   2 > 1, we c hoose β = ( θ ) + , yielding θ > β =   ( θ ) +   2 2 , and   ( β ) +   2 =   ( θ ) +   2 <   ( θ ) +   2 2 , hence   ( β ) +   2 < θ > β ; for   ( − θ ) +   2 > 0, we c ho ose β = − ( − θ ) + , yielding θ > β =   ( − θ ) +   2 2 > 0, and   ( β ) +   2 = 0, hence   ( β ) +   2 < θ > β . The second situation is treated as the ﬁrst one, yielding ∂ g | β 0 = n    ( β 0 ) +    − 1 2 β 0 + θ : θ A = 0 ,    ( − θ A c ) +    2 ≤ 1 and   ( θ A c ) +   2 = 0 o . F or the last situation, the sub diﬀeren tial, deﬁned as ∂ g | β 0 = n θ : ∀ β ,   ( β ) +   2 +   ( − β ) +   2 ≥ θ > β o , (22) reads ∂ g | β 0 = n θ : max    ( θ ) +   2 ,   ( − θ ) +   2  ≤ 1 o , (23) Chiquet, Gr andvalet and Ambr oise/Inferring Multiple Gr aph Structur es 27 Pr o of. W e ﬁrst sho w that, for all the elemen ts of ∂ g as explicitly deﬁned in ( 23 ), the inequality in deﬁnition ( 22 ) alwa ys holds: θ > β = ( θ ) > + β − ( − θ ) > + β ≤ ( θ ) > + ( β ) + + ( − θ ) > + ( − β ) + ≤   ( θ ) +   2   ( β ) +   2 +   ( − θ ) +   2   ( − β ) +   2 ≤ max    ( θ ) +   2 ,   ( − θ ) +   2     ( β ) +   2 +   ( − β ) +   2  . T o ﬁnish the pro of, it is suﬃcient to exhibit some β such that the inequalit y in deﬁnition ( 22 ) do es not hold for max    ( θ ) +   2 ,   ( − θ ) +   2  > 1. Without loss of generality , w e assume   ( θ ) +   2 > 1, and choose β = ( θ ) + , yielding θ > β =   ( θ ) +   2 2 , and   ( β ) +   2 +   ( − β ) +   2 =   ( β ) +   2 =   ( θ ) +   2 <   ( θ ) +   2 2 , hence   ( β ) +   2 +   ( − β ) +   2 < θ > β . References C. Ambroise, J. Chiquet, and C. Matias. Inferring sparse Gaussian graphical mo dels with laten t structure. Ele ctr on. J. Stat. , 3:205–238, 2009. A. Argyriou, T. Evgeniou, and M. Pon til. Conv ex multi-task feature learning. Mach. L e arn. , 73(3):243–272, 2008. O. Banerjee, L. El Ghaoui, and A. d’Aspremon t. Mo del selection through sparse maxim um likelihoo d estimation for multiv ariate Gaussian or binary data. J. Mach. L e arn. R es. , 9:485–516, 2008. J. Baxter. A mo del of inductiv e bias learning. J. Artif. Int. R es. , 12(1):149–198, 2000. S. Bengio, J. Mari´ ethoz, and M. Keller. The exp ected p erformance curve. In In ICML Workshop on ROC Analysis in Machine L e arning , 2005. R. Caruana. Multitask learning. Mach. L e arn. , 28(1):41–75, 1997. C. Charb onnier, J. Chiquet, and C. Ambroise. W eighted-lasso for structured net work inference from time course data. Statistic al Applic ations in Genetics and Mole cular Biolo gy , 9(1), 2010. C. Drummond and R. C. Holte. Cost curv es: An impro v ed metho d for visualizing classiﬁer p erformance. Mach. L e arn. , 65(1):95–130, 2006. B. Efron. The future of indirect evidence. T ec hnical Rep ort 250, Division of Biostatistics, Stanford Universit y , 2009. J. F riedman, T. Hastie, and R. Tibshirani. Sparse inv erse cov ariance estimation with the graphical lasso. Biostatistics , 9(3):432–441, 2008. J. H. F riedman. Regularized discriminant analysis. J. Amer. Statist. Asso c. , 84 (405):165–175, 1989. Y. Kim, J. Kim, and Y. Kim. Blo c kwise sparse regression. Statistic a Sinic a , 16: 375–390, 2006. M. K. Kolar, A. A. Le Song, and E. P . Xing. Estimating time-v arying netw orks. A nn. Appl. Stat. , 2009. N. Meinshausen and P . B ¨ uhlmann. High-dimensional graphs and v ariable selec- tion with the lasso. Ann. Statist. , 34(3):1436–1462, 2006. M. Nikolo v a. Lo cal strong homogeneity of a regularized estimator. SIAM J. Appl. Math. , 61(2):633–658, 2000. Chiquet, Gr andvalet and Ambr oise/Inferring Multiple Gr aph Structur es 28 M. R. Osborne, B. Presnell, and B. A. T urlac h. On the LASSO and its dual. J. Comput. Gr aph. Statist. , 9(2):319–337, 2000a. M. R. Osb orne, B. Presnell, and B. A. T urlac h. A new approach to v ariable selection in least squares problems. IMA J. Numer. Anal. , 20(3):389–403, 2000b. P . Ravikumar, M. J. W ainwrigh t, and J. Laﬀert y . High-dimensional Ising mo del selection using ` 1 -regularized logistic regression. Ann. Statist. , 38:1287–1319, 2010. G. V. Rocha, P . Zhao, and B. Y u. A path follo wing algorithm for sparse pseudo- lik eliho od inv erse cov ariance estimation (splice), 2008. V. Roth and B. Fischer. The group-lasso for generalized linear mo dels: unique- ness of solutions and eﬃcent algorithms. In International Confer enc e on Machine L e arning , 2008. K. Sachs, O. Perez, D. P e’er, D.A. Lauﬀenburger, and G.P . Nolan. Causal protein-signaling netw orks derived from m ultiparameter single-cell data. Sci- enc e , 308:523–529, 2005. J. Sc h¨ afer and K. Strimmer. A shrink age approac h to large-scale co v ariance ma- trix estimation and implications for functional genomics. Stat. Appl. Genet. Mol. Biol. , 4(1), 2005. H. T oh and K. Horimoto. Inference of a genetic netw ork b y a combined approach of cluster analysis and graphical gaussian mo deling. Bioinformatics , 18:287– 297, 2002. F. Villers, B. Sc haeﬀer, C. Bertin, and S. Huet. Assessing the v alidity domains of graphical Gaussian mo dels in order to infer relationships among comp onen ts of complex biological systems. Stat. Appl. Genet. Mol. Biol. , 7(2), 2008. M. Y uan and Y. Lin. Mo del selection and estimation in regression with group ed v ariables. J. R. Stat. So c. Ser. B Stat. Metho dol. , 68(1):49–67, 2006. M. Y uan and Y. Lin. Mo del selection and estimation in the Gaussian graphical mo del. Biometrika , 94(1):19–35, 2007.

Inferring Multiple Graphical Structures

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment