Split Bregman Method for Sparse Inverse Covariance Estimation with Matrix Iteration Acceleration

Split Bregman Metho d for Sparse In v erse Co v ariance Estimation with Matrix Iteration Acc eleration Gui-Bo Y e and Jian-F eng Cai and Xiaoh ui Xie Scho ol of Information and Computer Science, University of California, Irvine, CA 92697, USA e-mail: yeg@uci. edu Dep artment of M athematics, University of California, L os Angeles, CA 90095 , USA e-mail: cai@math .ucla.edu Scho ol of Information and Computer Science, University of California, Irvine, CA 92697, USA e-mail: xhx@ics. uci.edu Abstract: W e consider the problem of estimating the i nv erse cov ariance m atrix by maximizing the likelihoo d function with a penalty added t o enco urage the sparsit y of the resulting matrix. W e prop ose a new approac h based on the split Bregman met ho d to solve t he r egularized maxim um likelihoo d estimation problem. W e show that our method i s signiﬁcan tly faster than the widely used graphical lasso method, which is based on blo ckwise coordi nate descent , on b oth artiﬁcial and real- wo rl d data. More i mpor tan tly , diﬀerent fr om the graphical lasso, the spl i t Bregman based metho d is muc h more general, and can b e applied to a class of regulari zation terms other than the ℓ 1 norm. Keywords a nd phrases: Inv erse cov ariance matrix, Split Bregman, ℓ 1 norm, Graphical mo del. 1. In tro duction Undirected graphica l mo dels pr ovide an eﬃcient wa y to describ e and explain the r elationships among a set of ob jects (or v ariables), a nd ha ve b ecome a p opular to ol to mo del netw orks of components in a v ariety of applications, including in ternet, so cial net works, and g ene net works [ 1 , 2 , 3 ]. A key step in volv ed in constructing a graphical mo del for a particular application is to learn the structure of the graphical mo del from a set o f observ ations, which is o ften c alled mo del selectio n in statistics. If w e assume the o bserv ations hav e a multiv aria te Gaussian distributio n with mea n µ and cov ar iance matrix Σ, learning the structure of a graphical model can be refor m ulated as a cov ariance s e lection problem [ 4 ], whic h aims to identify nonzero ent rie s of the in verse cov ariance ma trix (als o kno wn as c onc entr ation m atr ix or pr e cisio n matrix ). The idea behind this for mu latio n is that if the ( i, j )-entry of Σ − 1 is zero , then the v ariable i and j ar e conditionally independent, given the other v ariables [ 5 , 6 ]. The principle of pa r simony sugges ts that among the g raphical mo dels that a dequately explains the data we should select the simplest. Thus, it is natural to imp ose an ℓ 1 pena lty for the estimation of Σ − 1 to pro mote the sparsity o f the resulting graph, as has b een prop osed by a nu mber o f author s [ 7 , 2 , 8 , 9 ]. It ha s b een noted that if w e ﬁt a linear r egress io n model to ea ch v ariable using the other s as pr edictors, the regres s ion coeﬃcient of v ariable j on i will b e, up to a po sitive scalar, eq ual to the ( i, j )-entry of Σ − 1 [ 10 , 6 ]. Based on this o bserv ation, Meinsha us en and B ¨ uhlmann [ 7 ] propo sed a simple approa ch to the structur a l learning problem; they estimate a spars e graphica l model b y ﬁtting a lasso regression mo del for each v ariable, treating the v ariable as the resp o ns e and the others as predicto r s. The ( i, j )-entry of Σ − 1 is estimated to be non-zero if either the e stimated co eﬃcient o f v ariable i o n j o r the estimated co eﬃcient of v ar iable j on i is nonzero (alternatively , one ca n use an AND rule, requiring both co eﬃcients to b e nonze ro). Although this approach is co mputationally attractive and ha s been s hown to be able to consistently estimate Σ − 1 asymptotically , it do es not take the in trinsic symmetry o f Σ − 1 int o account, and could res ult in con tradicto ry neighborho o ds and a non-positive deﬁnite co ncentration matrix. F urthermor e, if the same p enalty pa rameter is used for all las s o regres sions, as is commonly done to reduce the num b er of parameter s, the approa ch will not be able to correctly infer graphs with skewed deg ree distributions. T o overcome these limitations, Peng et al. [ 6 ] propo s ed a symmetric regress ion appro ach; howev er, the derived concentration matrix c a n still b e non-p ositive-deﬁnite. A mor e principled approach to cov ariance selection is to ﬁnd the Σ − 1 that ma ximizes the log-likeliho o d of the da ta with an added ℓ 1 pena lty [ 11 ] to encour age the spar sity of the resulting graph [ 2 , 8 , 9 ]. More 1 imsart-g eneric ve r. 2009/08/13 file: SBGM.t ex date: May 29, 2018 sp eciﬁcally , suppo se we ar e given n samples indep endently dr awn from a p -v ariate Gaus sian distr ibutio n: x 1 , . . . , x n ∼ N ( µ, Σ). Let S b e the empirical cov a riance ma trix: S := 1 n n X k =1 ( x i − µ )( x i − µ ) T . Denote Θ = Σ − 1 . The goal is to ﬁnd the Θ ∗ that maximizes the pena liz ed log-likelihoo d log det Θ − tr( S Θ) − λ X i 6 = j | Θ ij | , (1) sub ject to the c onstraint that Θ is p ositive deﬁnite. Here, we use tr to denote the trace and Θ ij to denote the ( i , j )-entry o f Θ . Due to the ℓ 1 pena lty term and the e x plicit p o s itive deﬁnite constraint on Θ, the metho d lea ds to a spars e estimation of the concen tratio n ma trix that is g ua ranteed to be positive deﬁnite. The simpler approach of Meinshausen and B¨ uhlmann [ 7 ] can b e viewed a s an approximation to the penalized maximum likeliho o d formulation [ 8 , 9 ]. The ob jective function in ( 1 ) is strictly conv ex, so a global optimal solution is g uaranteed to exist a nd b e unique. Fin ding the optimal s olution ca n, how ever, b e computationally challenging due to the log det term app eared in the likelihoo d function and the nondiﬀerentiability of the ℓ 1 pena lty . Y uan and Lin [ 2 ] solved ( 1 ) using the interior-p oint metho d for the “ ma xdet” problem, which is prohibitive for problems with mor e than tens of v ar iables due to its memory r equirements and computational complexit y . Banerjee et al. [ 8 ] developed a blo ckwise co o rdinate descent metho d to so lve ( 1 ), after noting that the dual of ( 1 ) can b e reduced to a lasso regres s ion problem if one focuses on optimizing only one row o r column of Ω. Exploiting this observ ation further, F r ie dman et al. [ 9 ] used the co or dinate descent pro cedure to solve the la sso r egressio n arising in the dual of the blo ckwise co o rdinate descent, and implemented an eﬃcient method, called “ graphical lasso”, to solve ( 1 ). The g raphical la sso is remar k ably fast; it can solves a p = 10 00 dimension ( ∼ 50 0,000 par ameters) problem within a minute. Ho wev er, a ma jor limitation o f the blockwise co ordinate descent methods, including the gra phical lasso , is tha t their deriv ation is sp eciﬁc to the ℓ 1 pena lty . This limitation prev ents them from being applied to other types of pena lties, such as a combination of ℓ 1 and ℓ 2 pena lty use d in the e la stic net [ 12 ], the fused lasso p enalty [ 13 ], and the group lasso p enalty [ 14 ], w hich have bee n found to b e more useful than the simple ℓ 1 pena lty in a n um b er of a pplications. In this pap er, we prop os e a new a lgorithm based o n the s plit Bregma n metho d [ 15 , 16 ] to solve the pena lized maxim um likelihoo d estimation pro blem ( 1 ). W e show that our metho d is not only substan tially faster than the g raphical lasso metho d, but can also be easily extended to deal with a broa d range of pena lty terms. Although the Br egman iteration was an old technique propos ed in the six ties [ 17 , 18 ], it gained signiﬁca nt int ere s t only r e cent ly after Osher a nd others de mo nstrated its hig h eﬃciency for spa r sity recov ery in a wide range of problems , including imag e restoration [ 19 , 15 , 16 ], compressed sensing [ 20 , 21 , 22 ], matrix completion [ 23 ], lo w rank matrix recov ery [ 24 ], and general fused lasso problems [ 25 ]. By introducing an auxiliary v a riable, we show that the optimizatio n pro blem in ( 1 ) can b e reformulated such that the lo g-likelihoo d term and the pena lty t er m in tera ct only through an equality constraint , thereby enabling an eﬃcien t applicatio n o f the split Bregman method to so lve the optimization problem. The rest of the pape r is o rganized as follows. In Section 2 , we ﬁrs t present a g eneralized sparse inv erse cov ar iance matrix es timation pr oblem, which includes ( 1 ) as a spec ia l case. W e then der ive a s plit Bre gman algorithm, called SBGM, to solve the generaliz ed pr oblem (Subsection 2 .2 ). The co n vergence pro p er t y of the algo rithm is also given. SBGM consists of three up date steps with the ﬁrst up date b eing the ma jor time-consuming o ne. In Subsection 2.3 , we provide an explicit formula fo r the ﬁr st up date s tep and pro po se an eﬃcient Newton method to solve the resulting matrix quadr atic equation. In Section 3 , we implemen t SBGM to so lve the sp ecial case ( 1 ) . In Section 4 , we illustra te the utility of our algorithm a nd compa re its per formance to the g raphical lasso metho d using b o th simulated data and ge ne expr ession data . 2. The split Bregman metho d for generalized sparse graphical mo dels (SBGM) The split Bregman method was or iginally prop o sed b y Osher and coauthors to solv e total v ar iation based image restora tion pr o blems [ 1 5 ]. It was later found to b e either equiv alent o r clo s ely rela ted to a num ber of 2 other existing o ptimiza tion a lgorithms, including Doug las-Rachford splitting [ 26 ], the alterna ting direction metho d o f multipliers (ADMM) [ 27 , 28 , 15 , 29 ] a nd the metho d of mult ipliers [ 3 0 ]. Beca use of its fast conv ergence and the easines s of implemen tation, it is inc r easingly b eco ming a metho d of choice for s olving large-s cale sparsity recovery problems [ 24 , 16 , 25 ]. In this section, we ﬁrst extend the ℓ 1 regular iz ed maxim um likeliho o d inv er se cov a riance matr ix estimation problem ( 1 ) to a llow for a mor e genera l class of regulariz a tion terms. W e refor m ulate the problem by intro- ducing an auxiliary v ariable. W e then proceed to derive a split Bregman metho d to solve the refo rmulated problem. 2.1. A gener al pr oblem and its r eformulation W e der ive our alg orithm in a more g eneral setting than the one describ ed in ( 1 ). Instead of using the ℓ 1 - norm penalty , w e consider a more general p enalty φ (Θ), whic h is only r equired to be conv ex and sa tisfy φ (Θ) = φ (Θ T ). The ℓ 1 norm a nd a n umber of other t yp es, including tho s e used in elas tic net [ 12 ], fused lasso [ 13 ], and g roup lasso [ 14 ], can b e v ie w ed as a sp ecial case of the g eneral p enalty term. W e ﬁnd Θ ∗ by solving the following constrained optimization problem min Θ ≻ 0 − log det Θ + tr( S Θ ) + λφ (Θ) , (2) where Θ ≻ 0 means that Θ is p ositive deﬁnite, and λ ∈ R + is a regular ization pa rameter. If we cho ose φ (Θ) = P i 6 = j | Θ ij | , the general pr oblem r educes to the problem ( 1 ). The log-likelihoo d term and the r egularizatio n term in ( 2 ) are coupled, whic h makes the optimization problem diﬃcult to solve. How ever, the t wo terms can be decoupled if w e intro duce an aux iliary v ariable to trans fer the coupling fro m the o b jective f unction to the co nstraints. More specia lly , the problem ( 2 ) is equiv alen t to the following pr oblem min − log det Θ + tr ( S Θ) + λφ ( A ) s.t. A = Θ Θ ≻ 0 . (3) The introduction o f the new v ariable of A is a key step of our algo rithm, which makes the problem amenable to a split Breg ma n pro cedure to be de ta iled below. 2.2. Deri vation of the split Br e gman metho d Although the split Br egman method orig ina ted from Breg man iter ations [ 31 , 16 , 32 ], it has b een demonstrated to b e equiv alen t to the alternating direc tio n method of m ultipliers (ADMM) [ 27 , 28 , 33 , 34 ]. F or simplicity of pr esentation, next we derive the split Bregma n metho d us ing the augmented Lagra ng ian metho d [ 35 , 30 ]. W e ﬁrst deﬁne an a ugmented La grangia n function o f ( 3 ) L (Θ , A, M ) = − lo g det Θ + tr( S Θ ) + λφ ( A ) + tr( M T (Θ − A )) + µ 2 k Θ − A k 2 F , (4) where matrix M is a dual v aria ble co r resp onding to the linear constr aint Θ = A , and µ > 0 is a parameter . Compared with the s ta ndard Lagrang ian fu nction, the augmented Lagrangian function has an extra term µ 2 k Θ − A k 2 F , which p enalizes the viola tion o f the linea r constr aint Θ = A . With the deﬁnition o f the a ugmented L a grangia n function ( 4 ), the primal problem ( 3 ) is equiv alent to min Θ ≻ 0 ,A max M L (Θ , A, M ) . (5) Exchanging the order of min and ma x in ( 5 ) leads to the for m ulatio n of the dual problem max M E ( M ) , with E ( M ) = min Θ ≻ 0 ,A L (Θ , A, M ) . (6) 3 Note that the gradient ∇ E ( M ) can b e calculated by the following [ 36 ] ∇ E ( M ) = Θ( M ) − A ( M ) , with (Θ( M ) , A ( M )) = a rg min Θ ≻ 0 ,A L (Θ , A, M ) . (7) Applying gradient asc ent o n the dual problem ( 6 ) and using equa tio n ( 7 ), we obtain the metho d o f m ultipliers [ 30 ] to solve ( 3 ) ( (Θ k +1 , A k +1 ) = arg min Θ ≻ 0 ,A L (Θ , A, M k ) , M k +1 = M k + µ (Θ k +1 − A k +1 ) . (8) Here we have used µ as the step size of the gradient a scent. It is easy to see that the eﬃciency of the iterative algorithm ( 8 ) large ly hinges on whether the ﬁr st equation of ( 8 ) can be solved eﬃcien tly . Note that the augmented Lagrangia n function L (Θ , A, M k ) still contains a nondiﬀerentiable term φ ( A ). But diﬀerent from the o riginal ob jective function ( 2 ), the function φ induced nondiﬀeren tiable term has now b een tra nsferred from terms in volving Θ to terms inv olving A only . Th us w e can solve the ﬁrst equation of ( 8 ) thro ugh an iterative algo r ithm that alternates b etw een the minimization o f Θ a nd A , ( Θ k +1 = arg min Θ ≻ 0 − log det Θ + tr( S Θ ) + tr(( M k ) T (Θ − A k )) + µ 2 k Θ − A k k 2 F , A k +1 = arg min A λφ ( A ) + tr(( M k ) T (Θ k +1 − A )) + µ 2 k Θ k +1 − A k 2 F . (9) The metho d of multipliers req uires that the alternative minimization of Θ a nd A in ( 9 ) be run multiple times un til con vergence. How ever, because the ﬁrst equa tion of ( 8 ) represents only one step of the o verall iteration, it is a c tua lly not necessar y to b e solved completely . In fact, the split Bregma n metho d (or the alternating direction metho d of multipliers [ 27 ]) uses only one itera tion of ( 8 ), whic h leads to th e follo wing iterative algo r ithm for solving ( 3 ),      Θ k +1 = arg min Θ ≻ 0 − log det Θ + tr( S Θ) + tr(( M k ) T (Θ − A k )) + µ 2 k Θ − A k k 2 F , A k +1 = arg min A λφ ( A ) + µ 2 k Θ k +1 − A + µ − 1 M k k 2 F , M k +1 = M k + µ (Θ k +1 − A k +1 ) . (10) 2.2.1. Conver genc e Pr op erty The conv ergence o f the iteration ( 10 ) can be derived from the c onv ergence theo ry of the alternating direction metho d of m ultipliers or the conv erg ence theo r y of the split Breg man metho d [ 27 , 37 , 16 ]. Theorem 1. L et Θ k b e gener ate d by ( 10 ) , a nd Θ ∗ b e the u nique mi nimizer of ( 3 ) . Then, lim k →∞ k Θ k − Θ ∗ k = 0 . F rom The o rem 1 , the condition for the conv ergence of the iteration ( 1 0 ) is quite mild and even irrelev ant to the c hoice o f the para meter µ in the iteratio n ( 10 ). This prop erty makes the split Bregma n metho d quite general and easy to implement, which partly explains wh y the metho d is g aining po pularity recently . 2.2.2. U p dating Θ and A W e ﬁrst fo cus on the computation of the ﬁrst equation of ( 10 ). T aking the deriv ative of the ob jective function and s etting it to be ze r o, we get − Θ − 1 + µ Θ = µA k − S − M k . (11) It is a quadra tic equation wher e the unkno wn is a matrix. The co mplex it y for solving this equation is at least O ( p 3 ) beca use of the inv ersion involv ed in ( 11 ). Note that b ecause φ (Θ) = φ (Θ T ), if Θ k is s y mmetric, 4 so is µA k − S − M k . It is easy to chec k that the explicit form for the so lutio n of ( 11 ) under constraint Θ ≻ 0 , i.e., Θ k +1 , is Θ k +1 = K k + p ( K k ) 2 + 4 µI 2 µ , where K k = µA k − S − M k . (12) Here √ C is the squa re ro ot of a symmetric po sitive deﬁnite matrix C . Recall that the square ro o t of a symmetric positive deﬁnite matrix C is deﬁned to b e the ma tr ix whose eigenv ectors a re the same as those of C a nd eigenv alues ar e the squar e ro ot o f those of C . Therefore, to get the up date of Θ k +1 , one may ﬁrst compute the eigenv alues and eigen vectors of K k , a nd then get the eigen v alues of Θ k +1 according to ( 12 ) by repla cing the matrices b y the co rresp onding eigenv alue s . This appro ach is rather slow due to the eigen decomp osition step. In the next subsectio n, we will pro po se a faster algorithm for solving ( 1 1 ) based on Newton’s iteration [ 38 , 39 ]. F or the second eq ua tion o f ( 10 ), w e hav e made the da ta ﬁtting term µ 2 k Θ k +1 − A + µ − 1 M k k 2 F separable with resp ect to the en tries of A . Thus, if the φ ( A ) is separ able, then it is very easy to get the solution and the computationa l complexity would be O ( p 2 ) for most cases . See Section 3 for the sp ecial ca s e of φ ( A ) = P i 6 = j | A ij | . Thu s, compar e d with the up date o f Θ , the co mputational time fo r updating A is minor and mostly negligible. The computation of the third e q uation of ( 10 ) is straigh tforward and the computational cost is also O ( p 2 ), whic h can b e neglected as w ell. So the o verall computationa l time for r unning one iteration of ( 10 ) mostly dep ends on how fast Θ can b e up dated. 2.3. Newton ’s metho d to c alculate the squar e r o ot of a p ositive deﬁni te matri x As desc r ib ed ab ov e, the main s tep to obtain an up date o f Θ k +1 is calculating the squa r e ro o t of the p ositive deﬁnite ma trix ( K k ) 2 + 4 µI , where K k is a s ymmetric matr ix. One p os sible way is to calcula te all the eigenv a lues and eigenv ectors of matrix K k , which is computationa l demanding when the size of the matrix is large. In this subs e c tion, we will use Newton’s metho d [ 39 , 3 8 ] to calculate the square ro ot o f a positive deﬁnite matrix directly instead o f using the eigenv alue decomp osition of K k . Our numerical exper iment s show that Newton’s metho d fo r calculating the squa re ro ot of a positive deﬁnite matrix of the for m K 2 + 4 µI , where K is symmetr ic , is ab out four times faster than the one using eig en deco mpo sition. W e b egin with ﬁnding the p ositive ro ot of the equation x 2 − a = 0( a > 0) using Newto n’s metho d [ 40 , 41 ]. That is, x k +1 = 1 2 ( x k + a x k ) . (13) The following lemma ensures that ( 13 ) with x 0 > 0 alwa ys co n verges to √ a and the co nv ergence ra te is quadratic. Lemma 1. If we cho ose x 0 > 0 , then x k gener ate d by ( 13 ) is wel l-deﬁne d and satisﬁes | x k +1 − √ a | ≤ min  1 2 √ a   x k − √ a   2 , 1 2   x k − √ a    . Pr o of . Since x 0 > 0, x 1 is w ell-deﬁned a nd x 1 = 1 2 ( x 0 + a x 0 ) ≥ √ a . By induction, x k is well-deﬁned a nd x k ≥ √ a . Moreover, using the iteration ( 13 ), | x k +1 − √ a | =     1 2 ( x k + a x k ) − √ a     = 1 2 | x k | | x k − √ a | 2 ≤ 1 2 √ a | x k − √ a | 2 , and | x k +1 − √ a | = 1 2 | x k | | x k − √ a | 2 = 1 2 (1 − √ a x k ) | x k − √ a | ≤ 1 2 | x k − √ a | . 5 Now we apply Newton’s method to calculate the square roo t of a p ositive deﬁnite matr ix of the form K 2 + αI , where K is symmetric and α is a p os itiv e co nstant. Let X 0 = √ αI and X k +1 = 1 2  X k + ( X k ) − 1 ( K 2 + αI )  . (14) Then the iteration ( 14 ) alwa ys con verges quadratically to √ K 2 + αI . Similar algor ithms ha ve b een pr o p o sed in [ 42 , 43 , 41 ] to co mpute p olar decomp ositions. Theorem 2. If we cho ose X 0 = √ αI , then X k gener ate d by ( 14 ) is wel l-deﬁne d and qu adr atic al ly c onver ges to √ K 2 + αI . Mor e sp e ci ﬁc al ly, k X k − p K 2 + αI k 2 ≤ min ( 1 2 p α + λ min ( K 2 ) k X k − p K 2 + αI k 2 2 , 1 2 k X k − p K 2 + αI k 2 ) , (15) wher e λ min ( K 2 ) is the minimum eigenvalue of matrix K . Pr o of . Let K = U Ω U T with Ω = diag( ω 1 , . . . , ω p ) and U U T = U T U = I . Since X 0 = √ αI , by induction, X k can be wr itten as X k = U Ω k U T with Ω k = diag( ω k 1 , . . . , ω k p ) and ω k i > 0. F urthermore, X k +1 = U  1 2  Ω k + (Ω k ) − 1 (Ω 2 + αI )   U T . Therefore, ( 14 ) changes only the singular v alues whic h ar e gov er ned by ( 13 ). The Theo rem fo llows immedi- ately from Lemma 1 . Note that the choice of the initial g uess X 0 can be a n y matrix o f the form U Λ U T , wher e U is the eigenv ectors of K and Λ is a diagonal matrix with p os itive diagona l. In the abov e, we hav e c hosen X 0 = √ αI for simplicit y . Combining the iteration ( 10 ) and Newton’s metho d to calcula te the square ro o t of the p ositive matrix o f the form K 2 + 4 µI wher e K is symmetric, we g et Algo r ithm 1 to compute the solution o f ( 2 ). Algorithm 1 Split B r egman metho d for g eneralized graphical mo de ls ( 2 ) (SBGM) Giv en µ . Initialize Θ 0 , A 0 , and M 0 . repea t 1) Compute K k = µA k − S − M k 2) Use Newton method to compute X k = p ( K k ) 2 + 4 µI 3) Θ k +1 = K k + X k 2 µ 4) A k +1 = arg min A λφ ( A ) + µ 2 k Θ − A k + µ − 1 M k k 2 F 5) M k +1 = M k + µ (Θ k +1 − A k +1 ) un ti l Con vergence 3. The Split Bregman metho d for ℓ 1 p enalized graphical mo dels (SBGL asso ) In this section, we describ e a detailed implementation of the split Bregma n metho d to solve the ℓ 1 pena lized inv erse cov ar ia nce matr ix estimation problem ( 1 ). The ℓ 1 pena lty cor resp onds to a special ca se of the general pena lty use d in ( 2 ) with φ (Θ) = P i 6 = j | Θ ij | . Algorithm 1 can be a pplied here with only minor changes. The up dates of Θ and and M are e x actly the same as the ones in Algorithm 1 . F or the upda te of A , let T λ be a soft thresholding op erato r deﬁned on matrix space and sa tisfying T λ (Ω) = ( t λ ( ω ij )) p i, j = 1 i 6 = j , with t λ ( ω ij ) = sgn( ω ij ) max { 0 , | ω ij | − λ } . Then the update o f A is A k +1 = T µ λ (Θ k +1 + µ − 1 M k ) . So we obtain Algorithm 2 to solv e ( 1 ). 6 Algorithm 2 Split B r egman metho d for ℓ 1 pena lized graphica l mo dels (SBGLasso ) Initialize Θ 0 , A 0 , and M 0 . repea t 1) Compute K k = µA k − S − M k 2) Use Newton method to compute X k = p ( K k ) 2 + 4 µI 3) Θ k +1 = K k + X k 2 µ 4) A k +1 = T λ µ (Θ k +1 + µ − 1 M k ) 5) M k +1 = M k + µ (Θ k +1 − A k +1 ) un ti l Con vergence 4. Numerical exp eriments In this section, we use time trials to illus trate the utilit y o f our prop osed a lgorithms. W e ﬁrst demonstr ate the eﬃciency of the pro p o s ed Newton’s metho d for calc ula ting the square ro ot of a po sitive deﬁnite ma trix by comparing it with the eig env alue decompo s ition method. W e then b e nc hmar k the p erforma nce of the split Bregman method (SBGLasso ) for solving the ℓ 1 pena lized in verse cov ariance matrix estimation problem, and compare it to the gra phical lasso metho d. The time tr ials were all conducted on an Intel Core 2 Duo desktop PC (E7500, 2.93GHz). 4.1. Newton ’s metho d versus eigenvalu e de c omp osition metho d for c omputing the squar e r o ot of a p ositive deﬁnite matrix W e ﬁrst illustrate the eﬃciency of the Newton’s metho d for c omputing the squa r e r o ot of a p os itive deﬁnite matrix o f form M = K 2 + µI , where K = ( B + B ′ ) / 2 a nd B is a p × p r a ndom Gaus s ian matrix with its en tries randomly drawn from the standard Ga ussian distribution. The form of the matrix M we c ho ose is mostly motiv ated b y the squar e ro ot we need to so lve in the equatio n ( 12 ). Our a lgorithm is implemented in Matlab using mex programming. W e compare it with the met ho d using matlab function “ sch ur” on the matrix K . More sp eciﬁcally , w e use the tw o-line co de “[ U, D ] = schu r ( K ); S R = U ∗ diag ( sq rt ( diag ( D ) . ˆ 2 + µ )) ∗ U ′ ” to compute the squa r e r o ot of M . Note that the upper triang ular matrix pro duced by sch ur decomp os ition is ex actly diagonal since K is symmetric . In T a ble 1 , w e hav e listed the num ber of itera tions needed for our N ewton’s method to solve the squa re ro ot o f M , where we stop the iterations whenever the relative c hange o f tw o successive steps is less than 10 − 6 . W e also calculated the relative diﬀerence betw een the result derived b y our method and the one by sch ur decompos ition. As shown in T a ble 1 , the relative diﬀerence is of precis io n of order 10 − 9 , demonstrating that o ur algorithm is highly accurate. Empirically , we ﬁnd that o ur algo rithm only needs 8 steps to conv erge and the num ber of steps is mostly constant as the matrix s iz e increa ses, which illustrates the quadra tic conv ergence rate of o ur algor ithm g iven in Theo rem 2 . Overall, our algor ithm is substantially faster than the sch ur decomposition based method, saving mo re than 75% of the computational times on a ll examples we tested. F or example, when n = 2000 , our algorithm ta kes 7 . 45 seconds to ﬁnd a so lution, s ig niﬁcantly low er than the 38 . 08 seco nds use d by the metho d based on sch ur decompo sition. T ab le 1 Exp erimental r esults for c omputing t he squar e r o ot of of M . Al l r esults ar e avera ges of 10 runs. p Newton method Sc hur decomposition relative iters in Newton method total time(s) total time(s) diﬀerence 1000 8 1 . 20 4 . 54 9 . 69 × 10 − 14 1500 8 3 . 35 16 . 11 8 . 54 × 10 − 13 2000 8 7 . 45 38 . 08 3 . 09 × 10 − 11 2500 8 13 . 56 76 . 03 3 . 75 × 10 − 10 3000 8 22 . 43 127 . 32 2 . 35 × 10 − 9 7 4.2. SBGL asso versus the gr aphic al lasso for ℓ 1 p enalize d gr aphic al mo dels Next we illustrate the eﬃciency of the split Bregman metho d for ℓ 1 pena lized maximum likelihoo d graphica l mo dels (SBGLasso ) using time tria ls. The gra phical lass o pr op osed by F r iedman et al. [ 9 ] is by far the most eﬃcient method for so lving the la rge-sca le inv ers e cov ar iance matrix estimation problem ( 1 ), so we fo cus our compariso n with the gra phical lasso metho d. SBGLasso is c o ded in C, linked to a Matlab function, while the graphical lasso is co ded in F ortra n, linked to an R lang uage function, s o it is rea sonable to co mpare them using time tr ials despite that they are implemented in tw o diﬀerent langua ges. The stopping cr iteria of SBGLa sso is sp eciﬁed as f ollows. L e t Φ(Θ k ) = − lo g det Θ k +tr( S Θ k )+ λ P i 6 = j | Θ k ij | . Since the ℓ 1 pena lized maximum likeliho o d graphica l mo del ( 1 ) is a sp ecial case of mo de l ( 2 ), we hav e lim k →∞ Φ(Θ k ) = Φ(Θ ∗ ) by Theo rem 1 . W e termina te the algorithm when the relative change o f the energy functional Φ(Θ) falls below a certain threshold δ . In a ddition, we require the relative diﬀerence k Θ − A k F / k Θ k F to b e less than δ to mak e sur e that the r esulting so lutio n is also primal feas ible. W e used δ = 10 − 4 in our simulation. F or the graphica l lasso , we also set the termination threshold to b e 10 − 4 . Note that the convergence of Algo rithm 2 is gua ranteed no ma tter what v alues of µ a re chosen a s shown in Theo rem 1 . How ever, the choice of µ can inﬂuence the sp e ed of the algorithm since it could aﬀect the nu mber of iterations inv olved. In our implementation, w e found empirically that choosing µ = 0 . 5 works w ell for the a rtiﬁcial data and µ = 0 . 0 12 for the gene express io n data. 4.2.1. Artiﬁcial data T o generate the artiﬁcial data, we ﬁrst create a set of spa rse inv erse cov ariance matrices with diﬀerent dimension p , and then g enerate n samples from the m ultiv ariate Gauss ian distr ibutio n pa rametrized by each of the inv ers e cov ariance matrices. The spa r se in verse cov a riance matric es are created using the same pro cedure a s descr ib e d in [ 8 ]. More s pec ially , to c r eate an inv er s e cov aria nce matrix, we ﬁrst generate a random p × p diag onal matrix with p o sitive entries, and then symmetrica lly insert r andom num b e rs to approximately p randomly chosen lo cations of the matrix. Positive deﬁniteness is ensured by adding a m ultiple of the ident ity to the matrix if needed. The p enalty parameter is chosen suc h that the estimated inv erse cov ariance matrix has ro ughly the same num b er of nonzero en tries as the actual in verse cov ariance matrix. T able 2 shows a comparison of SBGLasso and the gra phical lasso on b oth computationa l times as w ell as relative er rors after tes ting on the a rtiﬁcial data. W e no te that SBGLass o is signiﬁca nt ly faster than the graphical lasso for large- scale data ( p > 500), while b eing able to achiev e the same lev els of accuracy . It takes less than half of the computatio na l times us ed by the graphical lass o for all ca ses we tested e x cept when p = 500. F or example, when n = 2000 , p = 3000, SBGLasso tak es only 269 . 53 seconds to ﬁnd the optimal solution, compared to the 7 29 . 63 sec o nds us e d by the gr aphical la s so. T o ev aluate how the p erfor ma nce of SBGLasso scales with the proble m s iz e, we plotted the CPU time that SBGLasso to ok to so lve the pr o blem ( 1 ) as a function of p and n . The CP U time is roughly quadratic in p and constant in n . This phenomena is expected since each of the concen tra tion matrices used in the artiﬁcial da ta ha s a b o ut p ( p +1) 2 unknowns , which is quadra tic with resp ect to p . By co n tra st, the num b er of samples only appea rs in S , a nd do es not directly aﬀect the sp eed of SBGLa sso. T ab le 2 Comp aring the p erformanc e of SBGLa sso and the gr aphic al lasso on the artiﬁcial data n , p SBF GLasso Graphical Lasso total time (s) relativ e error total time(s) relative err or n = 1000 , p = 500 2 . 61 0 . 1136 1 . 94 0 . 1136 n = 100 0 , p = 1000 14 . 09 0 . 1192 27 . 08 0 . 1193 n = 100 0 , p = 2000 88 . 71 0 . 1170 217 . 88 0 . 1170 n = 200 0 , p = 3000 269 . 53 0 . 0980 729 . 63 0 . 0979 8 500 1000 1500 2000 2500 3000 0 50 100 150 200 250 300 dimension p CPU seconds (a) 1000 1500 2000 2500 3000 3500 4000 0 10 20 30 40 50 60 70 80 90 Number of samples n CPU seconds (b) Fig 1: CPU times f or SBGLasso for the same problem as in T able 2 , for diﬀerent v alues of n and p . (a) n is ﬁxed and equals to 1000; (b ) p is ﬁ xed and equals to 2000. 4.2.2. Gene expr ession data W e next apply SBGLasso to learn gene reg ulatory netw ork s from microar ray gene expre ssion data. W e used the Rosetta Inpharma tics comp endium o f gene expr ession proﬁles describ ed by Hughes et al. [ 44 ]. The comp endium contains micro array mea surements of the expr essions o f 6316 genes in resp onse to 300 diverse m utations a nd chemical treatments in S. c er evisiae . W e prepro ces s ed this data by removing genes that either contain missing v alues in some expe r iments, or ha ve low v aria nce across diﬀerent exp er imental conditions. This led us to a ﬁnal dataset of p = 1000 genes with mea surements a c ross n = 300 exp eriments. W e used SBGLasso to so lve the sparse gra phical mo del ( 1 ) on this data , and co mpared its pe rformance to the graphical la s so [ 9 ]. The results are s umma r ized in T a ble 3 , which shows the computational times spent by diﬀerent so lvers for diﬀeren t choices of the r egularizatio n par ameter λ . SBGLasso co nsistently o utper forms the g raphical la sso, sa ving mo re than 40% time in all cases. The concentration matrices deriv ed b y SBGLasso and the gr aphical la sso ar e very similar. T able 3 also lists the v alues of the o b jective function achieved when the algorithms conv erge, showing that the res ults fro m the tw o solvers are almost ide ntical. T ab le 3 Comp aring the p erformanc e of SBGL asso and the gr aphic al lasso on the g ene e xpr ession data λ SBF GLasso Graphical Lasso total time (s) energy v alue Φ(Θ) total time(s) energy v alue Φ(Θ) λ = 0 . 01 89 . 10 − 2 . 96 × 10 3 167 . 23 − 2 . 96 × 10 3 λ = 0 . 015 59 . 95 − 2 . 78 × 10 3 147 . 92 − 2 . 79 × 10 3 λ = 0 . 02 54 . 41 − 2 . 72 × 10 3 116 . 18 − 2 . 72 × 10 3 λ = 0 . 025 52.94 − 2 . 69 × 10 3 87 . 74 − 2 . 69 × 10 3 5. Discussion The pr oblem of inv erse co v ariance matrix estimation arises in ma ny applications. Because the n um b er of av aila ble samples is usually much s ma ller than the n umber of fre e parameters asso ciated with the in verse cov ar iance ma trix, a pa rsimony appr oach is to select the simples t matrix that adequately explains the data. This leads to the idea of formulating the problem as a regularized maximum lik eliho o d estimation problem with a p enalty added to encour age the sparsity of the resulting matrix. Solving the reg ularized maximum lik eliho o d problem is how ever nontrivial for large-sc a le problems, beca use of the co mplex it y of the 9 log-likeliho o d term and the no ndiﬀerentiabilit y of th e regula r ization ter m. In this work, w e propo se a new approach bas ed on the split B r egman metho d to s olve the reg ularized inv erse cov ar iance matr ix estimation problem. W e show that the appr oach is very fast; it solv es a 30 00 × 30 0 0 matrix estimation pr oblem in less than 5 min utes. W e co mpa red the split Bre g man method to the graphical lasso metho d, the state-of-the-art for solving ℓ 1 pena lized in verse cov a riance matrix estimation problems [ 9 ]. W e sho w that our method is about t wice faster than the g raphical lasso for large - scale problems on both the artiﬁcial data and the gene expressio n data, while being able to a chiev e the s ame levels of accuracy . More imp or tantly , our metho d is muc h more general and can handle a v ariety of spar sity regularization terms other than the ℓ 1 norm, such as thos e used in ela tis tic net, fused lasso and group ed lass o. B y contrast, the gra phical la sso can o nly b e a pplied to the ℓ 1 norm pena lt y . The contribution o f the paper lies in t wo asp ects. First, w e reformulated the regularized max im um lik e- liho o d estimation problem by in tro ducing an auxilia r y v ariable, which leads to the decoupling of the lo g- likelihoo d term and the reg ularization term, making the problem amenable to the split Br egman method. Second, we prop osed to use Newton’s metho d to calculate the square r o ot of a p ositive deﬁnite matrix, the ma jor time consuming step o f our algor ithm. Because the matr ix quadr atic eq uation is se parable in its eigenv a lues , the co nv ent iona l solv er is ﬁrst to compute the eigen decomposition a nd then solve univ aria te quadratic equa tions ab out the eigen v alues. This a pproach is generally slow, as the eigen decompositio n is time-consuming. T o a c celerate it, we propo se to use a matrix iter ative algo rithm [ 39 , 38 ] to directly so lve the matrix q ua dratic equation based on Newton’s metho d, witho ut using the eigen decompo sition. The con tri- bution is crucial and makes the derived alg orithm fro m the split Bregman metho d v ery easy to implement and hig hly eﬃcient. References [1] S.L. Lauritzen. Gr aphi c al mo dels . Oxford Universit y Press, USA, 1 996. [2] M. Y uan and Y. Lin. Model selection a nd estimation in the gaussian graphica l mo del. Bio metrika , 94 (1):19–35 , 2 007. [3] M. E. J. Newman. The structur e and function of complex net works. SIAM R eview , 45(2):1 67–25 6, 2003. [4] A.P . Dempster. Cov a r iance sele c tio n. Biometrics , 28(1 ):1 57–17 5, 1972 . ISSN 00 06-34 1X. [5] D. Edwards. Intro duction to gr aphic al mo del ling . Springer T exts in Sta tistics. Springer-V er lag, New Y ork, second edition, 2000. [6] J. Peng, P . W ang, N. Zhou, and J. Zhu. Partial correlation estimation b y joint s parse regres s ion mo dels. Journal of the Americ an Statistic al As s o ciatio n , 10 4(486):7 35–74 6, 2009. ISSN 016 2-145 9. [7] N. Meinshausen and P . B¨ uhlmann. High-dimensional g raphs and v ariable selection with the las so. The Annals of Statistics , 34(3):143 6 –146 2 , 2006. ISSN 0090 -5364 . [8] O. Banerjee, L. E l Gha oui, and A. d’Aspremont. Mo del selection throug h spar se maximum likeliho o d estimation for multiv ar iate gauss ian or binary data. The Journ al of Machine L e arning R ese ar ch , 9: 485–5 16, 2 008. ISSN 1532- 4435. [9] J. F riedman, T. Hastie, and R. Tibshir ani. Spars e inv erse cov a riance estimation with the gr aphical lasso. Biostatistics , 9(3):43 2–441 , 200 8. ISSN 1 465-4 644. [10] T. Has tie, R. Tibshirani, and J .H. F riedman. The elements of statistic al le arning: data mining, infer enc e, and pr e diction . Springe r V erlag, 2009 . [11] R. Tibshirani. Regression shrink age and selection via the lasso. J . R oy. Statist. So c. S er. B , 58(1): 267–2 88, 1 996. [12] H. Zou and T. Hastie. Regula rization and v ariable selectio n via the e lastic net. J. R. S tat. So c. S er. B Stat. Metho dol. , 67(2):30 1–32 0 , 200 5. [13] R. Tibshirani, M. Saunders, S. Rosset, J. Zh u, and K. Knight. Spars it y and smo othness v ia the fused lasso. J. R. Stat. S o c. Ser. B Stat. M etho dol. , 67(1 ):91–10 8, 200 5. [14] M. Y uan a nd Y. Lin. Mo del selection and es timation in reg r ession with group ed v ariables. J. R. Stat. So c. Ser. B Stat. Metho dol. , 6 8(1):49–6 7, 20 0 6. [15] T. Goldstein and S. Osher . The s plit Bregman metho d for L 1-reg ula rized pro blems. SIA M J. Imagi ng Sci. , 2 (2):323–3 43, 2 009. ISSN 1936 -4954 . 10 [16] J. F. Cai, S. Osher, and Z. Shen. Split bregman metho ds and frame bas e d image resto ration. M ult isc ale Mo del. Simul. , 8(2):337 –369 , 2009 . [17] L. M. Br` egman. A relaxation metho d of ﬁnding a common p oint of co nvex sets and its a pplication to the solution of pr oblems in con vex progr a mming. ˘ Z. Vyˇ cisl. Mat. i Mat. Fiz. , 7:620 –631 , 19 67. ISSN 0044- 4669. [18] A. E . C ¸ etin. Reconstr uc tio n of signa ls fr o m Fourier tra nsform samples. Signal Pr o c ess. , 16(2):12 9–148 , 1989. ISSN 016 5-168 4 . [19] S. Osher, M. Burger, D. Goldfarb, J. Xu, and W. Yin. An itera tive regular ization metho d for total v ariation-based image restora tion. Multisc ale Mo del. S imul. , 4(2 ):460–4 89 (electronic ), 2005 . ISSN 1540- 3459. [20] J. F. Cai, S. Oshe r , a nd Z. Shen. Linearized Breg ma n itera tions for compressed sensing . Math. Co mp. , 78(267 ):1515– 1 536, 2009. [21] S. Osher , Y. Ma o, B. Dong, and W. Yin. F a s t linear ized Breg man iteration for compre s sive s e nsing and sparse denoising. c ommu. Ma th. Sci. , 8 (2):93–11 1, 2010. [22] W. Yin, S. Osher, D. Goldfarb, and J. Darb on. B regman iterative a lgorithms for l 1 -minimization with applications to compressed sensing. S IA M J. Imaging Sci. , 1(1):1 43–16 8, 200 8. [23] J. F. Cai, E. J. Cand ` es , and Z. Shen. A singular v alue thresholding algorithm for matrix completion. SIAM J. Optim. , 20 (4):1956 –1982 , 2010. ISSN 1052 -6234 . [24] E.J . Candes, X. Li, Y. Ma, a nd J . W righ t. Robust principal co mpo nent a nalysis? A rxiv pr eprint arXiv:091 2.3599 , 2009. [25] G.B. Y e and X. Xie. Split br egman metho d for larg e scale fused lasso. Computational Statistics and Data Analysis , 2 010. . [26] C. W u and X .C. T ai. Augment ed Lagra ngian metho d, dual methods, and split Br egman iteration for R OF, vectorial TV, and high order mo dels. SIA M Journal on Imaging S cienc es , 3:3 00, 20 10. [27] D. Gabay and B. Mercier . A dual alg orithm for the s olution of nonlinear v aria tional problems via ﬁnite element a ppr oximation. Co mput ers and Ma thematics with Applic ations , 2(1):1 7 –40, 19 7 6. [28] R. Glowinski and A. Marro co . Sur l’a pproximation, par ´ el´ ements ﬁnis d’or dre un, et la r´ esolution, pa r pena lisation-dualit´ e, d’une classe de pro bl ` emes de dirichlet non lin´ eaires. R ev. F r anc. Automat. Inform. R e ch. Op er at. , pa g es 41 –76, 197 5. [29] E. E sser. Applications of lagrang ian-based alternating direc tio n metho ds and connectio ns to split breg - man. CAM r ep ort , 9 :31, 20 09. URL ftp:// ftp.m ath.ucla.edu/pub/camreport/cam09- 31.pdf . [30] R. T. Ro ck a fellar. A dual approa ch to solv ing nonlinear prog ramming pro ble ms by unconstra ined optimization. Math. Pr o gr amming , 5:3 54–3 73, 19 73. ISSN 0025- 5 610. [31] J. F. Cai, S. Osher, and Z. Shen. Linearized Bregma n iterations for fra me-based image deblur r ing. SIAM J. Imaging Sci. , 2(1):22 6–252 , 200 9. [32] X. Z hang, M. Burge r , X. Bresso n, and S. O sher. Bregmanized nonlo c al reg ula rization for deco n volution and s parse r econstruction. S IA M Journal on Imaging S cienc es , 3:2 53–27 6, 20 10. [33] S. Setzer. Split bregman algo rithm, dougla s-rachford splitting a nd frame shr ink age. S c ale sp a c e and variational met ho ds in c omputer vision , pages 4 64–47 6, 20 0 9. [34] X. Zha ng , M. Burger, and S. O sher. A uniﬁed primal-dual algorithm framework based on bregman iteration. Journ al of Scientiﬁc Computing , 2010. . [35] M. R. Hestenes. Multiplier and gradient metho ds. J. Optimization The ory Appl. , 4:3 03–32 0, 1969. ISSN 0022- 3239. [36] Dimitri P . Bertsek as. Constr aine d optimization and Lagr ange mu ltiplier metho ds . Computer Science and Applied Mathematics. Aca demic Press Inc. [Harcourt B r ace Jov anovic h Publishers ], New Y ork , 198 2. [37] J. E ckstein and D.P . B ertsek as. On the dougla s r achford splitting metho d and the proximal p oint algorithm for maxima l mo notone opera tors. Mathematic al Pr o gr amming , 55(1):293 – 318, 1992. ISSN 0025- 5610. [38] Nicholas J. Higham. F unctions of matric es . Society fo r Industrial and Applied Mathematics (SIAM), Philadelphia, P A, 2008 . Theor y a nd computatio n. [39] N.J. Hig ha m. Newto n’s metho d for the matrix squa re roo t. Mathematics of Computation , 46 (174): 537–5 49, 1 986. ISSN 0025- 5718. [40] R.L. Burden and J.D. F aires . Numeric al Analysis . 8th edition, Bro o k s/Cole, 200 4. 11 [41] J.F. Cai and S. Osher. F ast singular v alue thresholding without singular v alue decomp osition. 2010 . [42] N.J. Higham. Computing the pola r deco mpo sition- with applica tions. SIAM J. S ci. Stat. Comput. , 7 (4):1160– 1174 , 1 986. [43] N.J. Higha m and S.S. Robert. F a s t po la r decomp o sition of an arbitrar y matrix. In SIAM J. Sci. Stat. Comput . Citeseer, 1990. [44] T.R. Hughes, M.J . Ma rton, A.R. Jones, C.J. Rob erts , R. Sto ughton, C.D. Armour, H.A. Bennett, E. Coﬀey , H. Dai, Y.D . He, et al. F unctiona l discovery via a compendium of ex pression proﬁles. Cel l , 102(1):10 9–12 6 , 2000. ISSN 0092 -8674 . 12

Split Bregman Method for Sparse Inverse Covariance Estimation with Matrix Iteration Acceleration

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment