Model Selection Through Sparse Maximum Likelihood Estimation
We consider the problem of estimating the parameters of a Gaussian or binary distribution in such a way that the resulting undirected graphical model is sparse. Our approach is to solve a maximum likelihood problem with an added l_1-norm penalty term…
Authors: Onureena Banerjee, Laurent El Ghaoui, Alex
Model Selection Thr ough Sp arse Max Likelihood Estima tion Mo del Selection Through Sparse Maxim um Lik eliho o d Estimation for Multiv aria te G a ussian or Binary Data On ureena Banerjee onureena@eecs.berkeley.edu Lauren t El Ghaoui elghaoui@eec s.berkeley.edu EECS Dep artment University of California, Berkeley Berkeley, CA 94720 US A Alexandre d’Aspremon t aspremon@princeton.edu ORFE Dep artment Princ eton Un i versity Princ eton, NJ 08544 U SA Editor: Leslie P ack Kaelbling Abstract W e consider the problem of estimating the par ameters of a Gaussian or binary distr ib ution in such a wa y that the resulting undirec t ed g raphical mo del is spa rse. Our approach is to so lve a maximum lik eliho od pr o blem with an added ℓ 1 -norm p enalt y term. The problem as for m ulated is co n vex but the memory requir emen ts and complex it y of existing int erior point methods are prohibitiv e for problems with more than tens of no des. W e present t w o new algor ithms for solving problems with at least a thousand nodes in the Gaussian case. Our fir st algo rithm uses blo c k co ordinate descent, and can b e interpreted as r ecursiv e ℓ 1 -norm pena lized regr ession. O ur seco nd algo rithm, ba sed on Nesterov’s firs t order metho d, yields a c o mplexit y estimate with a better dependence on proble m size than existing interior po int metho ds. Using a log determinant relaxa t ion o f the lo g partition function (W ain wr igh t and Jo rdan [2006]), we sho w that these same algorithms can be used to solve an approximate sparse maximum likelihoo d problem for the binary cas e. W e tes t our a lgorithms on synthetic data, as well a s o n gene expressio n and senate voting records data. Keyw ords: Mo del Selection, Maxim um Likelihoo d Estimation, Con vex Optimization, Gaussian Graphica l Model, Binary Data 1 Banerjee, El Ghaoui, and d’Aspremont 1. In tr oduction Undirected graphical mo dels offer a wa y to describ e and explain the relationships among a set of v ariables, a cen tral elemen t of m ultiv ariate data analysis. Th e principle of parsimony dictates that w e should select the simplest graph i cal mod e l that adequately explains the data. In this pap er weconsider practical w a ys of implement ing the follo wing approac h to finding such a mo del: giv en a set of data, w e solv e a maxim um lik eliho o d pr o blem with an added ℓ 1 -norm p enalt y to mak e the resulting graph as sparse as p ossible. Man y authors h a v e studied a v ariety of related ideas. In th e Gaussian case, mo del selection in v olv es fin ding the pattern of zeros in the in v erse co v ariance matrix, since th ese zeros corresp ond to conditional ind ep endencies among the v ariables. T raditionally , a greedy forw ard-bac kw ard searc h al gorithm is used to determine the zero pattern [e.g., Laur i tzen, 1996]. Ho w ever, this is computationally infeasible for d a ta w ith ev en a mo derate num b er of v ariables. Li and Gui [2005] introd uce a gradient descent algorithm in whic h th e y account for the sparsit y of the in v ers e co v ariance matrix b y definin g a loss function that is the negativ e of the log likelihoo d function. Recen tly , Huang et al. [2005] considered p enalized maxim um lik eliho od estimation, and Dahl et al. [2006] prop osed a set of large scale metho ds for pr o blems where a sparsit y pattern for the inv erse co v ariance is giv en and one must estimate the nonzero elemen ts of the m a trix. Another wa y to estimate the graph i cal mo del is to find the set of n e igh b ors of eac h node in the graph b y regressing that v ariable aga inst th e r e maining v ariables. In this v ein, Dobra and W est [2004] employ a sto c hastic algorithm to manage tens of thousand s of v ari- ables. There has also b een a great deal of in terest in u sing ℓ 1 -norm p enalties in statistical applications. d’Aspr e mon t et al. [2004] apply an ℓ 1 norm p enalt y to s parse principle comp o- nen t analysis. Directly r e lated to our problem is the use of the Lasso of Tibshirani [1996] to obtain a v ery short list of neigh b ors for eac h n ode in the graph. Meinshausen and B ¨ uhlmann [2006] study this appr o ac h in detail, and sho w that th e resulting estimator is consistent, ev en for high-dimensional graphs. The problem formulati on for Gaussian d a ta, therefore, is simp le . The difficulty lies in its computation. Although the pr o blem is con v ex, it is non-smo oth and has an un b ounded constrain t set. As w e shall see, the resulting complexit y for existing in terior p oin t metho ds is O ( p 6 ), wher e p is the num b er of v ariables in the distribu tion. In add it ion, interior p oin t metho ds require that at eac h step w e compute and store a Hessian of size O ( p 2 ). The memory requirements and c omplexit y are thus prohib i tiv e for O ( p ) h igher than the te ns. Sp ecializ ed algorithms are n e eded to h a ndle larger problems. The remainder of the pap er is organized as follo ws. W e b egin by considering Gaussian data. I n Section 2 w e set u p the pr o blem, d e riv e its d u a l, d i scuss prop erties of the solution and how hea vily to w eigh t the ℓ 1 -norm p enalt y in our pr o blem. In Section 3 we present a pro v ably con v ergen t blo c k co ordinate descen t algorithm that can b e in terpreted as recursiv e ℓ 1 -norm p enaliz ed regression. In Sect ion ?? w e presen t a second, alternativ e algorithm based 2 Model Selection Through Sp arse Max L ikelihood Estima tion on Ne stero v’s recen t wo rk on non-smo oth optimization, and giv e a rigorous complexit y analysis with b etter dep endence on problem s i ze than interior p oin t metho ds. In Section ?? w e s h o w that the algorithms we dev elop ed for the Gaussian case can also b e used to solv e an app r o ximate sp a rse maximum lik eliho od problem for multiv ariate b i nary data, usin g a log determinan t r e laxation f o r th e log partition function giv en by W ainwrigh t and Jordan [2006]. In Section 6, we test our metho ds on synthet ic as w ell as gene expression and senate v oting records data. 2. Problem F orm ulation In this section w e set up the sp a rse maximum lik eliho od problem for Gaussian data, derive its du a l, and discu ss some of its prop erties. 2.1 Problem setup. Supp ose w e are giv en n samples ind e p enden tly dra w n from a p -v ariate Gaussian distribution: y (1) , . . . , y ( n ) ∼ N (Σ p , µ ), wh er e the co v ariance m atrix Σ is to b e estimated. L et S denote the second momen t matrix ab out the mean: S := 1 n n X k =1 ( y ( k ) − µ )( y ( k ) − µ ) T . Let ˆ Σ − 1 denote our estimate of the inv erse co v ariance matrix. Our estimato r tak es the form: ˆ Σ − 1 = arg ma x X ≻ 0 log det X − trace( S X ) − λ k X k 1 . (1) Here, k X k 1 denotes the su m of the absolute v alues of the elemen ts of the p ositiv e defin ite matrix X . The scalar paramet er λ control s the size of the p enalt y . Th e p enalt y term is a pro xy for the n um b er of nonzero elemen ts in X , and is often us e d – a lbiet with v ector, not matrix, v ariables – in regression tec hniques, s u c h as the Lasso. In the case wh er e S ≻ 0, the classical maxim um lik eliho o d estimate is r e co vered for λ = 0. Ho wev er, when the num b er of samples n is small compared to the num b er of v ariables p , the second moment matrix ma y n o t b e inv ertible. In s uc h cases, for λ > 0, our estimator p erforms some regularization so that o ur estimate ˆ Σ is alw a ys inv ertible, n o matter ho w small the r a tio of s amp le s to v ariables is. 3 Banerjee, El Ghaoui, and d’Aspremont Ev en in ca ses wh er e w e ha ve enough samples so that S ≻ 0, the in verse S − 1 ma y not b e sp a rse, even if there are man y cond itional ind e p endencies amo ng the v ariables in the distribution. By trading off maximalit y of the log likel iho od for sparsity , w e hop e to find a v ery s p a rse solution that s till ad equ a tely explains the data. A larger v alue of λ corresp onds to a sparser solution that fi t s the data less well . A smaller λ corresp onds to a solution that fits the data we ll but is less s parse. The c hoice of λ is therefore an imp ortan t issue that will b e examined in detail in S ection 2.3. 2.2 The dual problem and b ounds on the solution. W e can write (1 ) as max X ≻ 0 min k U k ∞ ≤ λ log det X + trace( X , S + U ) where k U k ∞ denotes the maxim um absolute v alue elemen t of the symmetric matrix U . This corresp onds to seeking an estimate with the m aximum w orst-case log lik eliho od , o ver a ll additiv e p erturbations of the second momen t matrix S . A similar robustness int erpretation can b e made f or a n um b er of estimation p roblems, such as su p port vec tor mac h in e s f o r classification. W e can obtain the d ual p roblem by exc hanging the max and the min . The resulting in n er problem in X can b e solv ed analytically b y setting the gradient of the ob jectiv e to zero and solving for X . The result is min k U k ∞ ≤ λ − log d e t( S + U ) − p where the prim al and du a l v ariables are related as: X = ( S + U ) − 1 . Not e that the log determinan t fu nct ion acts a log b a rrier, creating an implicit constrain t that S + U ≻ 0. T o write things neatly , let W = S + U . Then the du a l of our sparse m aximum lik eliho od problem is ˆ Σ := max { log det W : k W − S k ∞ ≤ λ } . (2) Observe that the d ual problem (1) estimates the co v ariance matrix wh il e the primal p r o blem estimates its inv erse. W e also observe that the diagonal elemen ts of the solution are Σ k k = S k k + λ for all k . The follo wing theorem sho ws that adding the ℓ 1 -norm p enalt y regularlizes the solution. Theorem 1 F or every λ > 0 , the optimal solution to (1) is unique, with b ounde d eigenval- ues: p λ ≥ k ˆ Σ − 1 k 2 ≥ ( k S k 2 + λp ) − 1 . Here, k A k 2 denotes the m a xim um eigenv alue of a symmetric matrix A . 4 Model Selection Through Sp arse Max L ikelihood Estima tion The dual p roblem (2) is sm ooth and con v ex. When p ( p + 1) / 2 is in the lo w h undreds, the problem can b e solv ed by existing soft w are that uses an inte rior p oin t metho d [e.g., V andenberghe et al., 199 8]. The complexit y to compu t e an ǫ -sub optimal solution using suc h s e cond-order metho ds, ho wev er, is O ( p 6 log(1 /ǫ )), making them infeasible when p is larger than th e tens. A relat ed problem, solv ed by Dahl et al. [200 6], is t o co mpute a maximum lik eliho od es- timate of the co v ariance matrix when the sparsit y structure of the inv erse is kno wn in adv ance. This is a ccomplished b y adding c onstrain ts to (1) of the form: X ij = 0 for all pairs ( i, j ) in some s pecified s e t. Our constrain t set is un b ounded as w e h o p e to unco v er the sp a rsit y structure automatically , starting with a dense s econd momen t matrix S . 2.3 Choice of p enalt y parameter. Consider the true, unknown graphical mo del for a give n distribution. This graph has p no des, and an edge b et wee n no des k and j is missin g if v ariables k and j are indep enden t conditional on the rest of the v ariables. F or a giv en no de k , let C k denote its connectivit y comp onen t: the set of all no des that are connected to no de k through some c h a in of edges. In particular, if no de j 6∈ C k , then v ariables j a nd k are indep endent . W e w ould lik e to c h oose the p enalt y parameter λ so that, for fi nite samp les, the probabilit y of error in estimating the graph ical mo del is controll ed. T o this en d, we can adapt the work of Meinshausen and B ¨ uhlmann [2006] as follo ws. Let ˆ C λ k denote our estimate of the connec- tivit y comp onen t of n ode k . In the con text of our optimizatio n pr o blem, this corr esp onds to the en tries of row k in ˆ Σ that are nonzero. Let α b e a give n level in [0 , 1]. Consid er the follo win g c hoice for the p enalt y parameter in (1): λ ( α ) := (max i>j ˆ σ i ˆ σ j ) t n − 2 ( α/ 2 p 2 ) q n − 2 + t 2 n − 2 ( α/ 2 p 2 ) (3) where t n − 2 ( α ) d e notes the (100 − α )% p oint of the Stu den t’s t-distribu t ion for n − 2 degrees of fr eedom, and ˆ σ i is the empirical v ariance of v ariable i . Then we can pr o v e the follo win g theorem: Theorem 2 Using λ ( α ) the p enalty p ar ameter in (1), for any fixe d level α , P ( ∃ k ∈ { 1 , . . . , p } : ˆ C λ k 6⊆ C k ) ≤ α. Observe that, for a fix ed prob lem size p , as the num b er of samples n incr eases to infi nit y , the p e nalt y parameter λ ( α ) decrea ses to zero. Thus, asymp totically w e reco v er the clas- 5 Banerjee, El Ghaoui, and d’Aspremont sical maximum like liho od estimate, S , whic h in turn conv erges in pr o babilit y t o the true co v ariance Σ. 3. Blo c k Co ordin ate Descen t Algorit hm In this section we pr esen t an algorithm for solving (2) that uses b lo c k co ordin ate descen t. 3.1 Algorithm description. W e b egin b y detail ing the al gorithm. F or a n y symmetric matrix A , let A \ j \ k denote the matrix p ro du ced by remo ving ro w k and column j . Let A j denote column j w ith the diagonal ele men t A j j remo v ed. T he plan is to optimize o ver one r o w and column of the v ariable matrix W at a time, and to rep eatedly sweep through all columns until we ac h iev e con vergence. Initialize: W (0) := S + λI F or k ≥ 0 1. F or j = 1 , . . . , p (a) Let W ( j − 1) denote the cur ren t iterate. Solve the quadratic pr ogram ˆ y := arg min y { y T ( W ( j − 1) \ j \ j ) − 1 y : k y − S j k ∞ ≤ λ } (4) (b) Up d ate rule: W ( j ) is W ( j − 1) with column/row W j replaced by ˆ y . 2. Let ˆ W (0) := W ( p ) . 3. After eac h sw eep thr ough all columns, c hec k the conv ergence condition. Conv ergence o ccurs wh en trace(( ˆ W (0) ) − 1 S ) − p + λ k ( ˆ W (0) ) − 1 k 1 ≤ ǫ. (5) 3.2 Con v e rgence and prop erty of solution. Using Sch u r complements, we can pr o ve con v ergence: Theorem 3 The blo ck c o or dinate desc ent algorithm describ e d ab ove c onver ges, acheiving an ǫ -su b optima l solution to (2). In p articular, the i ter ates pr o duc e d by the algorithm ar e strictly p ositive definite: e ach time we swe ep thr ough the c olumns, W ( j ) ≻ 0 for al l j . 6 Model Selection Through Sp arse Max L ikelihood Estima tion The pro of of Theorem 3 sheds so me in teresting ligh t on th e solution to problem (1). In particular, we can use this metho d to sh o w that th e solution has the follo wing prop ert y: Theorem 4 Fix any k ∈ { 1 , . . . , p } . If λ ≥ | S k j | for al l j 6 = k , th en c olumn and r ow k of the solution ˆ Σ to (2) ar e zer o, excluding the diagonal element. This means that, for a give n s econd momen t matrix S , if λ is c h osen su c h th at the condition in Theorem 4 is met for some column k , then the sparse maximum likel iho o d metho d esti- mates v ariable k to b e indep endent of all other v ariables in the distribu tion. In particular, Theorem 4 imp lies that if λ ≥ | S k j | for all k > j , then (1) estimates all v ariables in the distribution to b e pairwise indep endent. Using the work of Lu o and Tsen g [1992], it may b e p ossib le to sho w that the lo cal conv er- gence r ate of this metho d is at least linear. In practice w e ha v e found that a s m all n um b er of sweeps through all columns, indep endent of p roblem size p , is sufficient to ac hiev e co n- v ergence. F or a fixed num b er of K sweeps, the cost of the metho d is O ( K p 4 ), since eac h iteration costs O ( p 3 ). 3.3 In terpretation as recursiv e penalized regression. The du al of (4) is min x x T W ( j − 1) \ j \ j x − S T j x + λ k x k 1 . (6) Strong d ualit y obtains so that pr oblems (6) and (4) are equiv alen t. If w e let Q denote the square ro ot of W ( j − 1) \ j \ j , and b := 1 2 Q − 1 S j , then w e can wr ite (6) as min x k Qx − b k 2 2 + λ k x k 1 . (7) The pr oblem (7) is a p enalized least-squares p roblems, kn own as th e Lasso. If W ( j − 1) \ j \ j w ere the j -th pr incipal minor of the sample co v ariance S , then (7) w ould b e equiv alen t to a p enalized regression of v ariable j ag ainst all others. T h us, the appr oac h is reminiscen t of the appr oac h explored by Meinshau s en and B ¨ uhlmann [2006], bu t there are t wo differences. First, w e b egin w ith some regularization and, as a consequence, eac h p en alized regression problem has a u nique solution. Second, and more imp ortan tly , w e up date the problem data after eac h regression: except at the ve ry fir st up date, W ( j − 1) \ j \ j is n ev er a minor of S . In this sense, the coord inate descent metho d can b e in terpreted as a recurs ive Lasso. 7 Banerjee, El Ghaoui, and d’Aspremont 4. Nestero v’s First Order Metho d In this section we apply the recen t results d ue to Nesterov [2005] to obtain a first order algo- rithm for solving (1) with lo wer memory requ iremen ts and a rigorous complexit y estimate with a b etter dep endence on pr oblem size th an those offered by in terior p oin t metho ds. Our pur p ose is not to obtain another algorithm, as we ha v e found that the blo c k co ordin ate descen t is fairly efficien t; rather, w e seek to use Nestero v’s formalism to derive a r igorous complexit y estimate for the problem, imp r o v ed o v er that offered by in terior-p oin t metho ds. As we will see, Nestero v’s framewo rk allo ws us to obtain an algorithm that has a complexit y of O ( p 4 . 5 /ǫ ), where ǫ > 0 is the desired accuracy on the ob jectiv e of problem (1). This is in c on trast to the complexit y of interio r-p oint metho d s, O ( p 6 log(1 /ǫ )). Th us, Nest ero v’s metho d provides a muc h b etter dep endence on problem size and lo w er m emory r equiremen ts at the exp ense of a degraded dep end ence on accuracy . 4.1 Idea of Nesterov’s metho d. Nestero v’s method applies to a class of non-smooth, con ve x optimization problems of the form min x { f ( x ) : x ∈ Q 1 } (8) where the ob jectiv e fu nction can b e written as f ( x ) = ˆ f ( x ) + max u {h Ax, u i 2 : u ∈ Q 2 } . Here, Q 1 and Q 2 are b ounded, closed, con v ex sets, ˆ f ( x ) is d ifferen tiable (with a Lipsc hitz- con tinuous gradien t) and con v ex on Q 1 , and A is a linear op erator. The c h allenge is to write our problem in the appropriate form and choose asso ciated fu n ctions and parameters in suc h a wa y as to obtain the b est p ossible complexit y estimate, by applying general r esults obtained by Nesterov [2005]. Observe that w e ca n wr ite (1) in the form (8) if we imp ose b oun ds on the e igen v alues of the solution, X . T o this end , we let Q 1 := { x : aI X bI } Q 2 := { u : k u k ∞ ≤ λ } (9) where the constants a, b are giv en suc h that b > a > 0. By Theorem 1, we kn ow that suc h b oun d s alwa ys exist. W e also define ˆ f ( x ) := − log det x + h S, x i , and A := λI . T o Q 1 and Q 2 , w e asso ciate norm s and cont in uous, strongly con v ex fu nctions, called pro x- functions, d 1 ( x ) and d 2 ( u ). F or Q 1 w e c ho ose the F rob enius n orm, and a prox-function 8 Model Selection Through Sp arse Max L ikelihood Estima tion d 1 ( x ) = − log det x + log b . F or Q 2 , w e choose the F rob enius norm agai n, and a p r o x- function d 2 ( x ) = k u k 2 F / 2. The metho d applies a s m o othing tec hn ique to the non-smo oth prob lem (8), which replaces the ob jectiv e of th e original problem, f ( x ), by a p enalized function inv olving the p ro x- function d 2 ( u ): ˜ f ( x ) = ˆ f ( x ) + max u ∈ Q 2 {h Ax, u i − µd 2 ( u ) } . (10) The ab o v e fu nction tur n s out to b e a smo oth un iform appr oximati on to f ev erywhere. It is d ifferentiable, conv ex on Q 1 , and a has a Lipsc hitz-con tinuous gradien t, with a constan t L that can b e computed as detailed b elo w. A s p ecific gradien t scheme is then applied to this smo oth appro ximation, w ith con v ergence rate O ( L/ǫ ). 4.2 Algorithm and complexity estimate. T o detail the algo rithm and compute the complexit y , we m u st first calc ulate some pa- rameters corresp onding to our definitions a nd c h oices ab ov e. First, the strong con ve xit y parameter for d 1 ( x ) on Q 1 is σ 1 = 1 /b 2 , in the sense that ∇ 2 d 1 ( X )[ H, H ] = trace( X − 1 H X − 1 H ) ≥ b − 2 k H k 2 F for every symmetric H . F ur th ermore, the cen ter of th e set Q 1 is x 0 := arg min x ∈ Q 1 d 1 ( x ) = bI , and satisfies d 1 ( x 0 ) = 0. With our c hoice, we h a v e D 1 := max x ∈ Q 1 d 1 ( x ) = p log ( b/a ). Similarly , th e strong conv exit y parameter for d 2 ( u ) on Q 2 is σ 2 := 1, and w e hav e D 2 := m ax u ∈ Q 2 d 2 ( U ) = p 2 / 2 . With this c h oice, the cen ter of the s et Q 2 is u 0 := arg min u ∈ Q 2 d 2 ( u ) = 0. F or a desired accuracy ǫ , w e set the sm o othness parameter µ := ǫ/ 2 D 2 , and s et x 0 = bI . The algorithm p ro ceeds as follo w s: F or k ≥ 0 do 1. Compute ∇ ˜ f ( x k ) = − x − 1 + S + u ∗ ( x k ), wher e u ∗ ( x ) solv es (10 ). 2. Find y k = arg min y {h∇ ˜ f ( x k ) , y − x k i + 1 2 L ( ǫ ) k y − x k k 2 F : y ∈ Q 1 } . 3. Find z k = arg min x { L ( ǫ ) σ 1 d 1 ( X ) + P k i =0 i +1 2 h∇ ˜ f ( x i ) , x − x i i : x ∈ Q 1 } . 9 Banerjee, El Ghaoui, and d’Aspremont 4. Up d ate x k = 2 k +3 z k + k +1 k +3 y k . In our case, the Lipsc h itz constan t for the gradien t of our smo oth app ro ximation to the ob jectiv e fu nction is L ( ǫ ) := M + D 2 k A k 2 / (2 σ 2 ǫ ) where M := 1 /a 2 is the Lip sc hitz constan t for th e gradien t of ˜ f , and the norm k A k is induced by the F rob enius norm , and is equal to λ . The algorithm is guarante ed to pr o duce an ǫ -su b optimal solution after a n um b er of steps not exceeding N ( ǫ ) := 4 k A k r D 1 D 2 σ 1 σ 2 · 1 ǫ + q M D 1 σ 1 ǫ = ( κ p (log κ ) )(4 p 1 . 5 aλ/ √ 2 + √ ǫp ) /ǫ. (11) where κ = b/a is a b oun d on the condition num b er of the solution. No w we are ready to estimate the complexit y of the algorithm. F or Step 1, the gradien t of the smo oth a pproxima tion is c omputed in closed form b y taking the in verse of x . Step 2 essen tially amoun ts to pro jecting on Q 1 , and r equires that w e solv e an eigen v alue problem. The same is true for Step 3. In fact, eac h iteratio n costs O ( p 3 ). T he num b er of iterations necessary to ac hiev e an ob jectiv e with absolute accuracy less than ǫ is give n in (11 ) by N ( ǫ ) = O ( p 1 . 5 /ǫ ). Th us, if the condition num b er κ is fixed in adv ance, th e complexit y of the algorithm is O ( p 4 . 5 /ǫ ). 5. Binary V aria bles: Appro ximate Sparse Ma xim um Likelihoo d Estimation In this section, we consid er the problem of estimating a n undirected g raphical mo del fo r m ultiv ariate binary d ata. Recently , W ain wright et al. [2006] applied an ℓ 1 -norm p enalt y to the logistic regression problem to obtain a binary v ers ion of the h igh-dimensional consistency results of Meinshausen and B ¨ uhlmann [2006 ]. W e apply the log determinant r elaxation of W ain wrigh t and Jordan [2006] to form ulate an appro ximate sparse maxim um likel iho o d (ASML) problem for estimating the parameters in a m ultiv ariate binary distr ib ution. W e sho w that the r esu lting problem is the same as the Gaussian sparse maxim um lik eliho o d (SML) problem, and that we can therefore apply our previously-dev elop ed algorithms to sparse mo d el selection in a bin ary setting. 10 Model Selection Through Sp arse Max L ikelihood Estima tion Consider a distribu tion made up of p binary rand om v ariables. Usin g n data s amples, w e wish to estimate th e stru cture of the distribution. T he logistic mo d el of this d istr ibution is p ( x ; θ ) = exp { p X i =1 θ i x i + p − 1 X i =1 p X j = i +1 θ ij x i x j − A ( θ ) } (12) where A ( θ ) = log X x ∈X p exp { p X i =1 θ i x i + p − 1 X i =1 p X j = i +1 θ ij x i x j } (13) is the log partition fu nction. The sparse maxim um lik elihoo d problem in this case is to maximize (12) with an add ed ℓ 1 -norm p enalt y on terms θ k j . Sp ecifically , in the undirected graph ical mo del, an edge b et w een no des k and j is missing if θ k j = 0. A w ell-kno wn difficult y is that the lo g partition function has to o man y terms in its outer sum to compute. Ho w ev er, if we use the log determinan t relaxation for the log partition function d ev eloped by W ain wrigh t and Jordan [2006], w e can obtain an approxi mate sp arse maxim um lik eliho o d (ASML) estimate. W e s hall set up the problem in the next section. 5.1 Problem form ulation. Let’s b egin with some notation. Letting d := p ( p + 1) / 2, d efine the m ap R : R d → S p +1 as follo ws: R ( θ ) = 0 θ 1 θ 2 . . . θ p θ 1 0 θ 12 . . . θ 1 p . . . θ p θ 1 p θ 2 p . . . 0 Supp ose that our n samples are z (1 ) , . . . , z ( n ) ∈ {− 1 , +1 } p . Let ¯ z i and ¯ z ij denote sample mean and second moments. The sparse maximum likel iho o d pr oblem is ˆ θ exact := arg max θ 1 2 h R ( θ ) , R ( ¯ z ) i − A ( θ ) − λ k θ k 1 . (14) Finally d efine the constant v ector m = (1 , 4 3 , . . . , 4 3 ) ∈ R p +1 . W ainwrigh t and Jordan [2006] giv e an upp er b oun d on the log p artition function as the solution to th e follo wing v ariational problem: A ( θ ) ≤ max µ 1 2 log det( R ( µ ) + diag( m )) + h θ , µ i = 1 2 · max µ log det( R ( µ ) + diag( m )) + h R ( θ ) , R ( µ ) i . (15) 11 Banerjee, El Ghaoui, and d’Aspremont If we use the b ound (15) in our sparse m axim um lik eliho o d p roblem (14), w e won’t b e able to extract an optimizing argument ˆ θ . O ur first step, therefore, will b e to rewrite the b ound in a form that will allo w this. Lemma 5 We c an r ewrite the b ound (15) as A ( θ ) ≤ p 2 log( eπ 2 ) − 1 2 ( p + 1) − 1 2 · { max ν ν T m + log det( − ( R ( θ ) + diag ( ν ))) . (16) Using this v ersion of the b ound (15), we h a v e the follo wing th eorem. Theorem 6 Using the upp er b ound on the lo g p artition function given in (16), the appr ox- imate sp arse maximum likeliho o d pr oblem has the fol lowing solution: ˆ θ k = ¯ µ k ˆ θ k j = − ( ˆ Γ) − 1 k j (17) wher e the matrix ˆ Γ is the solution to the fol lowing pr oblem, r elate d to (2): ˆ Γ := arg max { log d et W : W k k = S k k + 1 3 , | W k j − S k j | ≤ λ } . (18) Here, S is defin ed as b efore: S = 1 n n X k =1 ( z ( k ) − ¯ µ )( z ( k ) − ¯ µ ) T where ¯ µ is the v ector of samp le means ¯ z i . In particular, this mea ns that w e can reu se the algorithms devel op ed in Sections 3 and ?? for prob lems with binary v ariables. The relaxation (15) is the simp lest one offered by W ain wrigh t and Jordan [2006]. The relaxation can b e tightened by adding linear constrain ts on the v ariable µ . 5.2 P e nalt y parameter c hoice for binary v ariables. F or the c hoice of the p enalt y p arameter λ , w e can derive a formula analogo us to (3). Consider the c h oice λ ( α ) bin := ( χ 2 ( α/ 2 p 2 , 1)) 1 2 (min i>j ˆ σ i ˆ σ j ) √ n (19) 12 Model Selection Through Sp arse Max L ikelihood Estima tion where χ 2 ( α, 1) is the (1 00 − α )% p oint of the c h i-square distribution for one d egree of freedom. Since our v ariables take on v alues in {− 1 , 1 } , the empirical v ariances are o f the form: ˆ σ 2 i = 1 − ¯ µ 2 i . Using (19), w e ha v e th e follo wing b inary version of Theorem 2: Theorem 7 With (19) chosen a s the p enalty p ar ameter in the appr oximate sp arse maxi- mum likeliho o d pr oblem, for a fixe d level α , P ( ∃ k ∈ { 1 , . . . , p } : ˆ C λ k 6⊆ C k ) ≤ α. 6. Numerical Results In th is section w e p r esen t the resu lts of some numerical exp er im ents, b oth on syn thetic and real data. 6.1 Syn thetic exp eriments. Synt hetic exp eriments require that w e generate un derlying sparse inv erse co v ariance matri- ces. T o this end, w e first rand omly choose a diagonal matrix w ith p ositiv e d iagonal en tries. A giv en num b er of nonzeros are in serted in th e matrix at ran d om lo cations sym metrically . P ositiv e definiteness is ensur ed by adding a m ultiple of the iden tit y to the matrix if needed. The multiple is c hosen to b e only as large as necessary for inv ersion with n o errors . 6.1.1 Sp arsity and thresh olding. A v ery simple app roac h to obtaining a sparse estimate of the inv erse co v ariance matrix w ould b e to apply a thresh old to the inv erse empirical co v ariance matrix, S − 1 . Ho w ev er , ev en when S is easily in v ertib le, it can b e difficult to select a thr eshold lev el. W e solv ed a syn thetic problem of size p = 100 where the true concen tration matrix d ensit y was set to δ = 0 . 1. Dra wing n = 200 samples, we plot in Figure (1) the sorted absolute v alue elemen ts of S − 1 on the left and ˆ Σ − 1 on the righ t. It is clearly easier to choose a threshold lev el for the S ML estimate. Applying a thr esh old to either S − 1 or ˆ Σ − 1 w ould decrease the log lik eliho o d of the estimate by an unkn own amoun t. W e only observ e th at to preserve p ositiv e d efiniteness, the threshold lev el t must satisfy the b oun d t ≤ min k v k 1 ≤ 1 v T S − 1 v . 13 Banerjee, El Ghaoui, and d’Aspremont 0 2000 4000 6000 8000 10000 0 2 4 6 8 10 (A) 0 2000 4000 6000 8000 10000 0 2 4 6 8 10 (B) Figure 1: S orted ab s olute v alue of elemen ts of (A) S − 1 and (B) ˆ Σ − 1 . T he s olution ˆ Σ − 1 to (1) is un-thresholded. 6.1.2 Recovering s tructure. W e b egin with a small exp eriment to test the a bilit y of th e metho d to reco v er the sparse structure of an un derlying co v ariance matrix. Figure 2 (A) sho ws a sparse inv erse co v ariance matrix of size p = 30. Figure 2 (B) displa ys a c orresp ond ing S − 1 , u s ing n = 60 sa mples. Figure 2 (C) displa ys the solution to (1) f or λ = 0 . 1. The v alue of the p enalt y p arameter here is c hosen arbitrarily , and the solution is not t hresholded. Neverthele ss, w e can still pic k out features that were present in the tru e underlyin g in v erse co v ariance matrix. A 10 20 30 5 10 15 20 25 30 B 10 20 30 5 10 15 20 25 30 C 10 20 30 5 10 15 20 25 30 Figure 2: R eco v ering th e sparsit y pattern. W e plot (A) the original inv erse co v ariance matrix Σ − 1 , (B) the noisy sample in v erse S − 1 , and (C) the solution to pr oblem (1) for λ = 0 . 1. Using the same underlying in v erse co v ariance matrix, w e rep eat th e exp eriment using smaller sample sizes. W e solve (1 ) for n = 30 and n = 20 using the same arbitrarily c h osen p enalt y parameter v alue λ = 0 . 1, and displa y the sol utions in Figure (3). As ex- p ected, our abilit y to pic k out features of the true inv erse co v ariance matrix d iminishes with the num b er of samples. This is an added r eason to c ho ose a larger v alue of λ wh en w e ha v e few er samples, as in (3 ). 14 Model Selection Through Sp arse Max L ikelihood Estima tion A 10 20 30 5 10 15 20 25 30 B 10 20 30 5 10 15 20 25 30 C 10 20 30 5 10 15 20 25 30 Figure 3: R eco v ering the sp arsit y p attern for small sample size. W e p lot (A) the original in v erse cov ariance matrix Σ − 1 , (B ) the solution to problem (1) for n = 30 and (C) the solution for n = 20. A p enalt y parameter of λ = 0 . 1 is used for (B) and (C). 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 P S f r a g r e p l a c e m e n t s ˆ Σ − 1 ij λ 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 −0.7 −0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 P S f r a g r e p l a c e m e n t s ˆ Σ − 1 i j λ λ ˆ Σ − 1 ij Figure 4: Path f ollo wing: elemen ts of solution to (1) as λ increases. Red lines corresp ond to elemen ts that are zero in the true inv erse cov ariance matrix; blue lines corresp ond to tru e nonzeros. V ertical lines mark a r ange of λ v alues us ing which w e reco v er the sp arsit y pattern exactly . 6.1.3 P a th following ex periments . Figure (4) shows t wo path follo w in g examples. W e solve t w o r andomly generated pr oblems of size p = 5 and n = 100 samples. T he red lines corresp ond to elemen ts of the solution that are zero in the tr ue underlying in v erse co v ariance matrix. The blu e lines corresp ond to true n onzeros. The v ertical lines mark r anges of λ for whic h w e reco v er the correct sparsit y pattern exactly . Note that, by Theorem 4, for λ v alues greater than those sh o w n, the solution will b e diagonal. 15 Banerjee, El Ghaoui, and d’Aspremont −0.5 0 0.5 1 0 1 2 3 4 5 6 7 P S f r a g r e p l a c e m e n t s Error (in %) log( λ/σ ) Figure 5: R eco v ering sp arsit y p attern in a matrix with add ed un if orm noise of size σ = 0 . 1. W e plot the a verage p ercen tage or m isclassified entries as a function of log ( λ/σ ). On a related note, w e obs erv e that (1) also works well in r eco vering the sparsity pattern of a matrix mask ed by noise. The follo wing exp eriment illustrates this observ ation. W e generate a sp arse in v erse co v ariance matrix of size p = 50 as describ ed ab o v e. Th en, instead of u sing an empirical co v ariance S as input to (1), we u se S = (Σ − 1 + V ) − 1 , wh ere V is a r andomly generated u niform noise of size σ = 0 . 1. W e th en solv e (1) f or v arious v alues of the p enalt y parameter λ . In figure 5, for a eac h of v alue of λ sh o wn, w e randomly selected 10 s ample co v ariance ma- trices S of size p = 50 and computed the n um b er of misclassified zeros and nonzero elemen ts in t he solution to (1). W e plot the av erage p ercenta ge of error s (n um b er of misclassified zeros plus misclassified nonzeros d ivided by p 2 ), as w ell as error bars corresp onding to one standard deviation. As sh o w n, the err or r ate is nearly zero on av erage w h en the p enalt y is set to equ al the n oise lev el σ . 6.1.4 CPU times ve rsus problem size. F or a sense of the practical p erformance of the Nestero v metho d and the blo ck co ordinate descen t metho d , we randomly selected 10 sample co v ariance matrices S for problem s izes p ranging fr om 400 to 1000.In eac h case, the n umber of samples n wa s chosen to b e ab out a third of p . In figur e 6 w e plot the av erage CPU time to ac hiev e a d ualit y gap of ǫ = 1. CPU times were computed using an AMD Athlo n 64 2.20Ghz pro cessor with 1.96GB of RAM. 16 Model Selection Through Sp arse Max L ikelihood Estima tion 400 500 600 700 800 900 1000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 Problem size p Average CPU times in seconds Figure 6: Average CPU times vs. prob lem s ize us ing blo c k co ord inate d escen t. W e plot the a verage CPU time (in seconds) to reac h a gap of ǫ = 0 . 1 v ersus problem size p . As sho wn, we are typicall y able to solv e a problem of size p = 1000 in ab out tw o and half hours. 6.1.5 Performance a s a binar y classifier. In th is section we numerically examine the abilit y of the sparse maxim um likel iho o d (SML) metho d to correctly classify elemen ts of the in v erse cov ariance matrix as zero or nonzero. F or comparision, w e will use th e Lasso estimate of Meinshausen and B ¨ uhlmann [200 6], whic h has b een sh o wn to p erform extremely w ell. The Lasso regresses eac h v ariable a gainst a ll others one at a time. Up on obtaining a solution θ ( k ) for eac h v ariable k , one can estimate sparsit y in one of tw o w ays: either b y declaring an elemen t ˆ Σ ij nonzero if b oth θ ( k ) i 6 = 0 and θ ( k ) j 6 = 0 (Lasso-AND ) or, less conserv ativ ely , if either o f those quan tities is nonzero (Lasso-OR). As noted previously , Meinshausen and B ¨ u h lmann [2006] ha v e also d eriv ed a formula for c h o osing their p enalt y parameter. Both the SML and Lasso p enalt y paramet er form ulas dep end on a chosen lev el α , which is a b ound on the same error probabilit y for eac h metho d . F or these exp er im ents, w e set α = 0 . 05. In the follo wing experiments, w e fi xed the problem size p at 30 and generated sparse un- derlying inv ers e cov ariance matrices as d escrib ed ab ov e. W e v aried the num b er of samples n from 10 to 310. F or eac h v alue of n shown, w e ran 30 trials in whic h we estimated the sparsit y p attern of the inv erse co v ariance matrix using th e S ML, Lasso-OR, and Lasso-AND 17 Banerjee, El Ghaoui, and d’Aspremont 1 2 3 4 5 6 7 8 9 10 11 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Positive predictive value Ratio of samples to variables: n/p SML LASSO−V LASSO−& 1 2 3 4 5 6 7 8 9 10 11 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Power Ratio of samples to variables: n/p SML LASSO−V LASSO−& 1 2 3 4 5 6 7 8 9 10 11 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 Density of solution Ratio of samples to variables: n/p SML LASSO−V LASSO−& True Figure 7: C lassifying zeros and nonzeros for a true density of δ = 0 . 05. W e plot the p ositiv e predictiv e v alue, the p ow er, and the estimated densit y u sing S ML, Lasso-OR and Lasso-AND. metho ds. W e th en recorded the a v erage num b er of n onzeros estimated b y e ac h metho d, and the av erage num b er of entries correctly iden tified as n onzero (true p ositiv es). W e sho w t w o sets of plots. Figure (7) corresp ond s to exp eriments where the true density w as set to b e lo w , δ = 0 . 05. W e plot the p ow er (prop ortion of correctly id en tified nonzeros), p ositiv e predictiv e v alue (prop ortion of estimated nonzeros that are correct), and the den s it y estimated b y eac h metho d. Figure (8) corresp onds to exp erimen ts wh ere the true densit y w as set to b e high, δ = 0 . 40, an d we p lot the same three quantiti es. 18 Model Selection Through Sp arse Max L ikelihood Estima tion 0 1 2 3 4 5 6 7 8 9 10 11 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Positive predictive value Ratio of samples to variables: n/p SML LASSO−V LASSO−& 0 1 2 3 4 5 6 7 8 9 10 11 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 Power Ratio of samples to variables: n/p SML LASSO−V LASSO−& 0 1 2 3 4 5 6 7 8 9 10 11 −0.05 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 Density of solution Ratio of samples to variables: n/p SML LASSO−V LASSO−& True Figure 8: C lassifying zeros and nonzeros for a true density of δ = 0 . 40. W e plot the p ositiv e predictiv e v alue, the p ow er, and the estimated densit y u sing S ML, Lasso-OR and Lasso-AND. Meinshausen and B ¨ uh lmann [2006] rep ort that, asymptotically , Lasso-AND and Lasso-OR yield the same estimate of the sparsit y pattern of the inv erse co v ariance m atrix. At a fin ite n um b er of samp les, the S ML metho d seems to fall in in b etw een the t w o metho ds in terms of p o w er, p ositiv e predictiv e v alue, and the density of the estimate. It typica lly offers, on a verage , the lo w est total num b er of errors, tied with either Lasso-AND or Lasso-OR. Among the tw o Lasso metho d s , it would seem that if the true densit y is very lo w, it is sligh tly b etter to u se the more conserv ativ e Lasso-AND. If the densit y is higher, it m ay b e b etter to use Lasso-OR. Wh en the true densit y is un kno wn, we can ac hieve an accuracy comparable to 19 Banerjee, El Ghaoui, and d’Aspremont the b etter choice among the Lasso metho ds by computing the SML estimate. Figure (9) sho ws one example of sp ars it y pattern reco v ery w hen the tr ue density is lo w. A B C D Figure 9: C omparing sparsit y p attern reco v er y to the Lasso. (A) true co v ariance (B) L asso- OR (C) Lasso-AND (D) SML. The Lasso and SML metho d s hav e a comparable compu tational complexit y . Ho wev er, unlik e the Lasso, the SML metho d is not p arallelizable. P aralleliz ation w ould render the Lasso a more computationally attractiv e choi ce, since eac h v ariable can regressed against all other separately , at an individual cost of O ( p 3 ). In exchange, SML can offer a more accurate estimate of the sparsit y pattern, as w ell a s a w ell-conditioned estimate of the co v ariance matrix itself. 6.2 Gene expression and U.S. Se nate v oting records data. W e tested our algorithms on three s ets o f data: tw o ge ne expression data sets, as w ell as US Senate vot ing records. In this section w e br iefly explore the resulting graph ical mo dels. 20 Model Selection Through Sp arse Max L ikelihood Estima tion 6.2.1 Rosett a Inpha rma tics comp endium. W e applied our algorithms to the Roset ta Inpharm atics C omp end iu m of gene expression profiles describ ed by Hughes et al. [2000]. The 300 exp er im ent comp endiu m con tains n = 253 samples with p = 6136 v ariables. With a view to w ards obtaining a v ery sparse graph, w e replaced α/ 2 p 2 in (3) by α , and set α = 0 . 05. T he resu lting p enalt y paramete r is λ = 0 . 0313. This is a large p enalt y for this data set, and by applying Theorem 4 w e find that all b u t 270 of the v ariables are estimated to b e indep endent from all the rest, clearly a v ery conserv ativ e estimate. Figure (10) d ispla ys the r esulting graph . Figure 10: App lication to Hu ghes comp end iu m. T h e ab ov e graph r esults f r om solving (1) for th is data set with a p enalty parameter of λ = 0 . 0313 . Figure (11) closes in on a region of Figure (10), a cl uster of genes that is un connected to the remaining genes in this estimate. Acc ording to Gene Ontolo gy [see Consortium, 2000], these genes are asso ciated with iron h omeostasis. The probabilit y that a gene h as b een false included in this cluster is at most 0 . 05. As a second example, in Figure (12), we sh o w a su b graph of genes asso ciated w ith cellular mem brane fusion. All thr ee graphs were rendered using Cytoscap e. 21 Banerjee, El Ghaoui, and d’Aspremont Figure 11: App lication to Hughes dataset (closeup of Figur e (10). These genes are asso ci- ated w ith iron h omeostasis. Figure 12: App lication to Hughes dataset (subgrap h of Figure (10). T h ese genes are asso- ciated with cellular mem brane fusion. 22 Model Selection Through Sp arse Max L ikelihood Estima tion T able 1: Pred ictor genes f or LDL receptor. Accession Gene BF553500 Cbp/p300-intera cting transactiv a tor BF387347 EST BF405996 calcium ch annel, vol t age dependent NM 017158 cytochrome P450, 2c39 K03249 enoyl-CoA, hydra t ase/3-hydro x y a cyl Co A dehydrog. BE100965 EST AI411979 Carnitine O-acetyl transferase AI410548 3-hydro xy isobutyr yl-Co A hydrolase NM 017288 sodium channel, vol t age-ga ted Y00102 estr ogen receptor 1 NM 0 13200 carnitine p almitoyl transferase 1b 6.2.2 Iconix microarra y d a t a. Next we analyzed a s u bset of a 10 , 000 gene microarra y dataset from 160 drug treated rat liv er s Natsoulis et al. [2005]. In this study , r ats were treated with a v ariet y of fi brate, statin, or estrogen receptor agonist comp oun ds. T aking th e 500 genes w ith the highest v ariance, we once again replaced α/ 2 p 2 in (3) by α , and set α = 0 . 05. The resulting p enalty p arameter is λ = 0 . 0853 . By applying Theorem 4 we fin d that all bu t 339 of the v ariables are estimated to b e inde- p end en t from the rest. This estimate is less conserv ativ e than that obtained in the Hughes case since th e ratio of samples to v ariables is 160 to 500 in s tead of 253 to 6136. The first order neig h b ors of any no de in a Ga ussian graphical model form the set of pre- dictors for that v ariable. In the estimated obtained by solving (1), w e f ou n d that LDL receptor had one of the largest num b er of fi r st-order neigh b ors in th e Gaussian graphical mo del. T he LDL receptor is b eliev ed to b e one of the k ey med iators o f the effect of b oth statins and estrogenic comp ound s on L DL c holesterol. T able 1 lists s ome of the fi rst order neigh b ors of LDL receptor. It is p erhaps not su rprising that several of these genes are directly in v olve d in either lipid or steroid metab olism (K0324 9, AI4119 79, AI410548 , NM 01320 0, Y0010 2). Other genes suc h as Cbp/p300 are kn o wn to b e global transcriptional r egulators. Finally , some are un- annotated E S Ts. Their connection to th e LDL receptor in this analysis ma y provide clues to their function. 23 Banerjee, El Ghaoui, and d’Aspremont 6.2.3 Sena te voting re cords d a t a. W e conclud e our numerical exp erimen ts by testing our approximat e sparse maxim um like li- ho o d estimation metho d on b inary data. The d ata s et consists of US senate v oting records data f rom th e 109th congress (2004 - 2006). There are one hundred v ariables, corresp ond- ing to 100 s en ators. Eac h of the 542 samples is bill that wa s put to a v ote. Th e v otes are recorded as -1 for n o and 1 for yes. There a re ma n y m issing v alues in this d ataset, corresp onding to missed v otes. S ince our analysis dep end s on data v alues tak en solely f rom {− 1 , 1 } , it w as necessary to impu te v alues to these. F or this exp erimen t, w e replaced all missing v otes with no es (-1). W e c hose the p enalt y parameter λ ( α ) ac cording to (19), using a signifi cance lev el of α = 0 . 05. Figure (13) sh o ws the resulting graphical mo d el, r endered using Cytoscap e. Red no des corresp ond to Repub lican senators, and blue no des corresp ond to Demo cratic senators. W e can mak e some tent ativ e observ ations by browsing the net w ork of senators. As neigh b ors most Demo crats ha v e only other Demo crats and Rep ublicans ha v e only other Repub licans. Senator Ch afee (R, RI) h as only demo crats as h is neighbors, an observ ation that s upp orts media statemen ts made by and ab out Ch afee du ring those years. S enator Allen (R, V A) unites t wo otherw ise separate groups of Repu blicans and also pro vides a co nnection to the large clus ter of Demo crats through Ben Nelson (D, NE). S enator Lieb erman (D, CT) is connected to other Demo crats only through Kerr y (D, MA), his run ning mate in th e 2004 p resident ial election. Th ese observ ations also matc h media statemen ts made b y b oth pund its and p oliticians. Thus, although we obtained this graphical mo d el via a relaxation of th e log partition function, the resulting p icture is largely supp orted by con v en tional wisdom. Figure (14) sho ws a subgraph consisting of n eigh b ors of degree thr ee or lo w er of Senator Allen. Ac kno wledgmen ts W e are indebted to Georges Natsoulis for his interpretatio n of the Iconix dataset analysis and for gathering the senate v oting records. W e also thank Martin W ain wright, Bin Y u, P eter Bartlett, and Mic hael J ordan for man y enligh tening con versations. References D. Bertsek as. Nonline ar Pr o gr amming . A thena Scien tific, 1998. Gene On tology C onsortium. Gene on tology: to ol for the unifi cation of biology . N atur e Genet. , 25:25–29, 2000. 24 Model Selection Through Sp arse Max L ikelihood Estima tion Figure 13: US Senate, 109th Congress (2004 -2006) . The graph d ispla ys th e solution to (14) obtained using the log determinan t relaxation to the log partition fun ction of W ain wr igh t and Jordan [200 6]. Demo cratic senators are colored blue and Republican senators are colored red. 25 Banerjee, El Ghaoui, and d’Aspremont Figure 14: US Senate, 109th Congress. Neigh b ors of Senator Allen (degree th ree or low er). J. Dahl, V. Ro yc h o w dhury , and L. V andenb erghe. Co v ariance selection for non-c hordal graphs via c hordal embedd ing. Submitte d to Optimization Metho ds and Softwar e , 2006. Alexandre d’Aspremont, Laur en t El Gh aoui, M.I. Jord an, and G. R. G. Lanc kriet. A direct form ulation for sparse PC A usin g semidefin ite programming. A dvanc es i n Neu r al Information Pr o c essing Systems , 17, 2004. A. Dobra and M. W est. Ba y esian cov ariance selecti on. Working p ap er, ISDS, Duke Uni- versity , 2004. J. Z . Huang, N. Liu, and M. P ourahmadi. Cov ariance selection and estimattion via p enalized normal lik eliho o d. W harton Pr eprint , 2005. T. R . Hughes, M. J. Marton, A. R. Jones, C. J. Rob erts, R. Stoughto n, C. D. Armour, H. A. Bennett, E. Coffey , H. Dai, Y. D. He, M. J. Kid d, A. M. King, M. R. Mey er, D. Slade, P . Y. Lum, S. B. Stepanian ts, D. D. Sho emaker, D. Gac hotte, K. Chakraburtty , J. Simon, M. Bard, and S. H. F riend. F unctional disco v ery via a comp endiu m of expression p rofiles. Cel l , 102(1):10 9–126 , 20 00. S. Lauritzen. Gr aphic al Mo dels . Spr inger V erlag, 1996. H. Li and J. Gui. Gradient directed r egularization for sparse gaussian concen tration graphs, with applications to inf erence of genetic n etw orks. Unive rsity of Pennsylvania T e chnic al R ep ort , 2005. Z. Q . Lu o and P . Tseng. On the con v ergence of the co ordin ate d escen t metho d for con v ex differen tiable minimizat ion. Journal of Optimization The ory and Applic ations , 72(1): 7–35, 1992. 26 Model Selection Through Sp arse Max L ikelihood Estima tion N. Meinshausen and P . B ¨ uhlmann . High dimensional graphs and v ariable selection with the lasso. Annals of statistics , 34:1436–1 462, 2006. G. Natsoulis, L. El Ghaoui, G. Lanc kriet, A. T olley , F. Lero y , S. Du n lea, B. Eynon, C. Pe ar- son, S. T ugendreich, and K. Jarnagin. Classification of a large microarray data set: al- gorithm comparison and analysis of drug signatures. Genome R ese ar ch , 15:724 –736, 2005. Y. Nestero v. Smo oth minimization of non-smo oth f u nctions. M ath. Pr o g. , 103(1):127 –152, 2005. R. Tibshirani. Regression shr ink age and select ion via the lasso. Journal of the Royal statistic al so ciety, series B , 58(267-2 88), 1996 . Liev en V andenb erghe, Stephen Bo yd, and Shao-P o W u. Determinan t maximiza tion with linear matrix inequalit y constraints. SIAM Journal on Matrix Analysis and Applic ations , 19(4): 499 – 533, 1998. M. W a in wrigh t and M. Jordan . Log-determinant r elaxation for appr o x im ate inference in discrete marko v random fields . IEEE T r ansactions on Signal Pr o c essing , 2006. M. J. W ainwrigh t, P . Ra viku mar, and J . D. Laffert y . High-dimensional graph ical mo del selection using ℓ 1 -regularized logistic r egression. Pr o c e e dings of A dvanc es in Neur al In- formation Pr o c essing Systems , 2006. App endix A. Pro of of solution properties and blo c k co ordinate descen t con v ergence In this section, w e giv e short proofs of the tw o theorems o n prop erties of the solution to (1), as well as the con v er gence of the b lo c k co ordin ate descen t metho d . Pro of of Theorem 1: Since ˆ Σ satisfies ˆ Σ = S + ˆ U , where k U k ∞ ≤ λ , we ha v e: k ˆ Σ k 2 = k S + ˆ U k 2 ≤ k S k 2 + k U k 2 ≤ k S k 2 + k U k ∞ ≤ k S k 2 + λp whic h yields the lo w er b ound on k ˆ Σ − 1 k 2 . Lik ewise, w e can sho w that k ˆ Σ − 1 k 2 is b ound ed ab o v e. A t the optim um, the primal du al gap is zero: − log d et ˆ Σ − 1 + trace( S ˆ Σ − 1 ) + λ k ˆ Σ − 1 k 1 − log det ˆ Σ − p = trace( S ˆ Σ − 1 ) + λ k ˆ Σ − 1 k 1 − p = 0 27 Banerjee, El Ghaoui, and d’Aspremont W e therefore ha v e k ˆ Σ − 1 k 2 ≤ k ˆ Σ − 1 k F ≤ k ˆ Σ − 1 k 1 = p/λ − trace( S ˆ Σ − 1 ) /λ ≤ p/λ where the last inequality f ollo ws from trace( S ˆ Σ − 1 ) ≥ 0, since S 0 and ˆ Σ − 1 ≻ 0. Next we p ro v e the con v ergence of blo c k co ordin ate descent : Pro of of Theorem 3: T o s ee that optimizing ov er one ro w and column of W in (2) yields the qu adratic program (4), let all b ut the last row and column of W b e fi xed. Since w e know the diagonal en tries of the s olution, we can fix th e remaining diagonal en try as wel l: W = W \ p \ p w p w T p W pp Then, us ing Sc h ur complemen ts, w e hav e that det W = det W \ p \ p · ( W pp − w T p ( W \ p \ p ) − 1 w p ) whic h giv es rise to (4). By general results on blo c k co ordinate descent algorithms [e.g., Bertsek as, 1998], th e algo- rithms con v erges if and only if (4) has a uniqu e solution at eac h iteratio n. Thus it suffices to sho w th at, at ev er y swee p, W ( j ) ≻ 0 for all columns j . Prior to the first sweep, the in itial v alue of the v ariable is p ositiv e definite: W (0) ≻ 0 since W (0) := S + λI , and w e ha v e S 0 and λ > 0 by assump tion. No w supp ose that W ( j ) ≻ 0. T h is implies that th e follo win g Sch ur complemen t is p ositiv e: w j j − W T j ( W ( j ) \ j \ j ) − 1 W j > 0 By the up date rule we hav e that the c orresp ond ing S c hur complement for W ( j + 1) is ev en greater: w j j − W T j ( W ( j + 1) \ j \ j ) − 1 W j > w j j − W T j ( W ( j ) \ j \ j ) − 1 W j > 0 so that W ( j + 1) ≻ 0. Finally , w e apply T heorem 3 to prov e th e second prop erty of th e solution. Pro of of Theorem 4: 28 Model Selection Through Sp arse Max L ikelihood Estima tion Supp ose that column j of the second momen t matrix satisfies | S ij | ≤ λ for all i 6 = j . Th is means th at the zero v ector is in the co nstrain t set of (4) for that column. E ac h time we return to column j , the ob jectiv e fu nction will b e different, but alw a ys of the form y T Ay for A ≻ 0. S ince the constrain t set will not c hange, the solution for column j w ill alw a ys b e zero. By T heorem 3 , the blo c k co ordinate descen t algorithm con v erges to a solution, and so therefore th e solution m ust ha v e ˆ Σ j = 0. App endix B. Pro of of error bounds. Next we shall sho w that the p enalty p arameter c hoice give n in (3) yields the error pr ob ab ility b oun d of T heorem 2. The pro of is nearly identica l to that of [Meinshausen and B ¨ uh lmann, 2006,Theorem 3]. T he differences stem from a differen t ob jectiv e function, and the fact that our v ariable is a matrix of size p rather than a vec tor of size p . Our p r o of is only an adaptation of their p ro of to our problem. B.1 Preliminaries Before we b egin, consider problem (1), for a matrix S of an y size: ˆ X = arg min − log det X + trace( S X ) + λ k X k 1 where we hav e dr op p ed the constrain t X ≻ 0 since it is implicit, due to th e log determinant function. Since the problem is unconstrained, the solution ˆ X must corresp ond to setting the su bgradien t of the ob jectiv e to zero: S ij − X − 1 ij = − λ f or X ij > 0 S ij − X − 1 ij = λ for X ij < 0 | S ij − X − 1 ij | ≤ λ for X ij = 0 (20) Recall that b y Theorem 1 , the s olution is u nique for λ p ositiv e. B.2 Pro of of error b ound for Gaussian data. No w we are ready to prov e T heorem 2. Pro of of Theorem 2: Sort columns of the cov ariance matrix so that v ariables in the same connectivit y comp onent are group ed together. The correct zero p attern for the co v ariance matrix is then blo c k diagonal. Define Σ correct := blk diag( C 1 , . . . , C ℓ ) (21) 29 Banerjee, El Ghaoui, and d’Aspremont The inv erse (Σ correct ) − 1 m ust also b e blo c k d iagonal, with p ossib le additional zeros ins ide the blo c ks. If w e constrain the solution to (1) to ha ve th is stru cture, then b y the form of the ob jectiv e, w e c an optimize o v er eac h blo c k separately . F or eac h bloc k, th e sol ution is c h aracterized by (20). No w, sup p ose that λ > max i ∈ N ,j ∈ N \ C i | S ij − Σ correct ij | (22) Then, b y the subgradien t c haracterizatio n of the solution noted ab o ve, and the fact that the solution is u nique for λ > 0, it must b e the case that ˆ Σ = Σ correct . By the definition of Σ correct , this implies that, for ˆ Σ, we h av e ˆ C k = C k for all k ∈ N . T aking th e cont rap ositiv e of this statemen t, we can wr ite: P ( ∃ k ∈ N : ˆ C k 6⊆ C k ) ≤ P (max i ∈ N ,j ∈ N \ C i | S ij − Σ correct ij | ≥ λ ) ≤ p 2 ( n ) · max i ∈ N ,j ∈ N \ C i P ( | S ij − Σ correct ij | ≥ λ ) = p 2 ( n ) · max i ∈ N ,j ∈ N \ C i P ( | S ij | ≥ λ ) (23) The equalit y at the end f ollo ws since, by d efinition, Σ correct ij = 0 for j ∈ N \ C i . It remains to b ound P ( | S ij | ≥ λ ). The statemen t | S k j | ≥ λ can b e written as: | R k j | (1 − R 2 k j ) − 1 2 ≥ λ ( s k k s j j − λ 2 ) − 1 2 where R k j is the correlation b et w een v ariables k and j , since | R k j | (1 − R 2 k j ) − 1 2 = | S k j | ( S k k S j j − S 2 k j ) − 1 2 F urthermore, the condition j ∈ N \ C k is equiv alen t to sa ying that v ariables k and j are indep en den t: Σ k j = 0. Cond itional on this, the statistic R k j (1 − R 2 k j ) − 1 2 ( n − 2) 1 2 has a S tudent’s t-distribution for n − 2 degrees of freedom. Ther efore, f or all j ∈ N \ C k , P ( | S k j | ≥ λ | S k k = s k k , S j j = s j j ) = 2 P ( T n − 2 ≥ λ ( s k k s j j − λ 2 ) − 1 2 ( n − 2) 1 2 | S k k = s k k , S j j = s j j ) ≤ 2 ˜ F n − 2 ( λ ( ˆ σ 2 k ˆ σ 2 j − λ 2 ) − 1 2 ( n − 2) 1 2 ) (24) 30 Model Selection Through Sp arse Max L ikelihood Estima tion where ˆ σ 2 k is the sample v ariance of v ariable k , and ˜ F n − 2 = 1 − F n − 2 is the CDF of the Student’s t-distribu tion with n − 2 d egree of f reedom. T his implies th at, for all j ∈ N \ C k , P ( | S k j | ≥ λ ) ≤ 2 ˜ F n − 2 ( λ ( ˆ σ 2 k ˆ σ 2 j − λ 2 ) − 1 2 ( n − 2) 1 2 ) since P ( A ) = R P ( A | B ) P ( B ) dB ≤ K R P ( B ) dB = K . Pu tting the inequalities together, w e hav e that: P ( ∃ k : ˆ C λ k 6⊆ C k ) ≤ p 2 · max k ,j ∈ N \ C k 2 ˜ F n − 2 ( λ ( ˆ σ 2 k ˆ σ 2 j − λ 2 ) − 1 2 ( n − 2) 1 2 ) = 2 p 2 ˜ F n − 2 ( λ (( n − 2) / (( max i>j ˆ σ k ˆ σ j ) 2 − λ 2 )) 1 2 ) F or any fixed α , our requir ed condition on λ is therefore ˜ F n − 2 ( λ (( n − 2) / (( max i>j ˆ σ k ˆ σ j ) 2 − λ 2 )) 1 2 ) = α/ 2 p 2 whic h is satisfied by choosing λ according to (3). B.3 Pro of of bound for binary dat a. W e can reuse muc h of the pr evious p r o of to d eriv e a corresp onding formula for the binary case. Pro of of Theorem 7: The pro of of T heorem 7 is identic al to th e p ro of of Theorem 2, except that w e h a v e a differen t null d istribution for | S k j | . The null distr ib ution of nR 2 k j is c hi-squared with one degree of fr eedom. Analogous to (24 ), we hav e: P ( | S k j | ≥ λ | S k k = s k k , S j j = s j j ) = 2 P ( nR 2 k j ≥ nλ 2 s k k s j j | S k k = s k k , S j j = s j j ) ≤ 2 ˜ G ( nλ 2 ˆ σ 2 k ˆ σ 2 j ) where ˆ σ 2 k is the sample v ariance of v ariable k , and ˜ G = 1 − G is the CDF of the c hi-squared distribution w ith one degree of fr eedom. Th is implies that, for all j ∈ N \ C k , P ( | S k j | ≥ λ ) ≤ 2 ˜ G (( λ ˆ σ k ˆ σ j √ n ) 2 ) 31 Banerjee, El Ghaoui, and d’Aspremont Putting the inequalities together, we h av e that: P ( ∃ k : ˆ C λ k 6⊆ C k ) ≤ p 2 · max k ,j ∈ N \ C k 2 ˜ G (( λ ˆ σ k ˆ σ j √ n ) 2 ) = 2 p 2 ˜ G ((min i>j ˆ σ k ˆ σ j ) 2 nλ 2 ) so that, for an y fi xed α , w e can ac hiev e our desired b ound by choosing λ ( α ) according to (19). App endix C. Pro of of connection betw een Gaussia n SML and binary ASML W e end with a pro of of Theorem 6, whic h connects the exac t Gaussian sp arse maxim um lik eliho o d pr oblem with the appro ximate sparse maximum lik eliho o d p roblem ob tained by using the log determinant relaxatio n of W ainwrigh t and Jordan [2006]. First we must pro v e Lemma 5. Pro of of Lemma 5: The conjugate fu nction for the con v ex n ormalization A ( θ ) is d efined as A ∗ ( µ ) := su p θ {h µ, θ i − A ( θ ) } (25) W ain wrigh t and Jordan derive a low er b ound on this conjugate function u sing an entrop y b oun d : A ∗ ( µ ) ≥ B ∗ ( µ ) (26) Since our original v ariables are spin v ariables x {− 1 , +1 } , the b ound giv en in the pap er is B ∗ ( µ ) := − 1 2 log det( R ( µ ) + diag( m )) − p 2 log( eπ 2 ) (27) where m := (1 , 4 3 , . . . , 4 3 ). The du al of this lo wer b ound is B ( θ ): B ∗ ( µ ) := max θ h θ , µ i − B ( θ ) ≤ max θ h θ , µ i − A ( θ ) = : A ∗ ( µ ) (28) 32 Model Selection Through Sp arse Max L ikelihood Estima tion This means that, for all µ, θ , h θ , µ i − B ( θ ) ≤ A ∗ ( µ ) (29) or B ( θ ) ≥ h θ , µ i − A ∗ ( µ ) (30) so that in particular B ( θ ) ≥ max µ h θ , µ i − A ∗ ( µ ) =: A ( θ ) (31) Using the d efinition of B ( θ ) and its d ual B ∗ ( µ ), we can write B ( θ ) := max µ h θ , µ i − B ∗ ( µ ) = p 2 log( eπ 2 ) + max µ 1 2 h R ( θ ) , R ( µ ) i + 1 2 log det( R ( µ ) + diag( m )) = p 2 log( eπ 2 ) + 1 2 · max {h R ( θ ) , X − diag( m ) i + log det( X ) : X ≻ 0 , diag( X ) = m } = p 2 log( eπ 2 ) + 1 2 · { max X ≻ 0 min ν h R ( θ ) , X − diag( m ) i + log det( X ) + ν T (diag( X ) − m ) } = p 2 log( eπ 2 ) + 1 2 · { max X ≻ 0 min ν h R ( θ ) + diag( ν ) , X i + log det( X ) − ν T m } = p 2 log( eπ 2 ) + 1 2 · { min ν − ν T m + m ax X ≻ 0 h R ( θ ) + diag( ν ) , X i + log det( X ) } = p 2 log( eπ 2 ) + 1 2 · { min ν − ν T m − log det( − ( R ( θ ) + diag( ν ))) − ( p + 1) } = p 2 log( eπ 2 ) − 1 2 ( p + 1) + 1 2 · { min ν − ν T m − log det( − ( R ( θ ) + diag( ν ))) } = p 2 log( eπ 2 ) − 1 2 ( p + 1) − 1 2 · { max ν ν T m + log det( − ( R ( θ ) + diag( ν λ ) } (32) No w we use lemma 5 to pro v e the main result of section 5.1. Ha ving expressed the upp er b oun d on th e log partition function as a constan t minus a maximization problem will h elp when we formulate the sp ars e approxi mate m axim um lik eliho o d problem. Pro of of Theorem 6: 33 Banerjee, El Ghaoui, and d’Aspremont The appr o ximate sparse maximum lik elihoo d problem is obtained by replacing the log p ar- tition fun ction A ( θ ) w ith its upp er b ound B ( θ ), as derived in lemma 5 : n · { max θ 1 2 h R ( θ ) , R ( ¯ z ) i − B ( θ ) − λ k θ k 1 } = n · { max θ 1 2 h R ( θ ) , R ( ¯ z ) i − λ k θ k 1 + 1 2 ( p + 1) − p 2 log( eπ 2 ) + 1 2 · { max ν ν T m + log d et( − ( R ( θ ) + diag( ν ))) } } = n 2 ( p + 1) − np 2 log( eπ 2 ) + n 2 · max θ , ν { ν T m + h R ( θ ) , R ( ¯ z ) i + log d et( − ( R ( θ ) + d iag( ν ))) − 2 λ k θ k 1 } (33) W e can collect the v ariables θ and ν into an unconstrained s y m metric matrix v ariable Y := − ( R ( θ ) + diag( ν )). Observe that h R ( θ ) , R ( ¯ z ) i = h− Y − diag( ν ) , R ( ¯ z ) i = −h Y , R ( ¯ z ) i − h diag( ν ) , R ( ¯ z ) i = −h Y , R ( ¯ z ) i (34) and that ν T m = h diag( ν ) , diag( ν ) i = h− Y − R ( θ ) , diag( m ) i = −h Y , d iag( m ) i − h R ( θ ) , diag( m ) i = −h Y , diag( m ) i (35) The appr o xim ate sparse maximum lik eliho o d problem can then b e written in terms of Y : n 2 ( p + 1) − np 2 log( eπ 2 ) + n 2 · max θ , ν { ν T m + h R ( θ ) , R ( ¯ z ) i + log d et( − ( R ( θ ) + d iag( ν ))) − 2 λ k θ k 1 } = n 2 ( p + 1) − np 2 log( eπ 2 ) + n 2 · max { log det Y − h Y , R ( ¯ z ) + diag( m ) i − 2 λ P p i =2 P p +1 j = i +1 | Y ij |} (36) If we let M := R ( ¯ z ) + diag( m ), then: M = 1 ¯ µ T ¯ µ Z + 1 3 I where ¯ µ is the sample mean and Z = 1 n n X k =1 z ( k ) ( z ( k ) ) T 34 Model Selection Through Sp arse Max L ikelihood Estima tion Due to the added 1 3 I term, w e hav e that M ≻ 0 for any data set. The pr ob lem can no w b e written as: ˆ Y := arg max { log det Y − h Y , M i − 2 λ p X i =2 p +1 X j = i +1 | Y ij | : Y ≻ 0 } (37) Since we are only p enalizing certain elemen ts of the v ariable Y , the solution ˆ X of th e dual problem to (37) will b e of the form: ˆ X = 1 ¯ µ T ¯ µ ˜ X where ˜ X := arg max { log det V : V k k = Z k k + 1 3 , | V k j − Z k j | ≤ λ } . W e can write an equiv alent problem for estimating the co v ariance matrix. Define a new v ariable: Γ = V − ¯ µ ¯ µ T Using this v ariable, and the fact that the second m omen t matrix ab out the mean, defined as b efore, can b e written S = 1 n n X k =1 z ( k ) ( z ( k ) ) T − ¯ µ ¯ µ T = Z − ¯ µ ¯ µ T w e obtain the formulation (18 ). Using Sc h ur complement s, we see that our pr imal v ariable is of th e form: Y = ∗ ∗ ∗ ˆ Γ − 1 F rom our defi n ition of th e v ariable Y , we see that the paramet ers w e are estimating, ˆ θ k j , are the n egativ es of the off-diagonal elemen ts of ˆ Γ − 1 , whic h giv es us (17). 35
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment