High-Dimensional Gaussian Graphical Model Selection: Walk Summability and Local Separation Criterion
We consider the problem of high-dimensional Gaussian graphical model selection. We identify a set of graphs for which an efficient estimation algorithm exists, and this algorithm is based on thresholding of empirical conditional covariances. Under a …
Authors: Animashree An, kumar, Vincent Y. F. Tan
Journal of Mac hine Learning Researc h () Submitted 6/11; Published High-Dimensional Gaussia n Graphical Mo del Selection: W alk Summabilit y and Lo cal Separation Criterio n Animashree Anandkumar a.anandkumar@uci. e du Center for Pervasi ve Communic atio ns and Computing Ele ctric al Engine ering and Co mputer Scienc e University of Califo rnia, Irvine Irvine, CA 92697 Vincen t Y. F. T a n vt an@wisc.edu Dep artm en t of Ele ctric al and Computer Engine ering University of Wisc onsin-Madison Madison, WI 53706 Alan S. Willsky willsky@mit.edu Sto chastic Systems Gr oup L ab or atory for Information and De cision S y stems Massachusetts Institute of T e chnolo gy Cambridge, MA 0213 9 Editor: Martin W ain wright Abstract W e consider the problem of high-dimensio nal Gaussian gr aphical mo del selection. W e ident ify a set of graphs for wh ich an efficient estimation algor ithm exists, and this a lgorithm is based on thresho ld ing o f empirica l conditiona l cov aria nces. Under a set of trans pa ren t conditions, we e s t ablish structura l consistency (or sp arsistency ) for the prop osed algor ithm , when the n umber of samples n = Ω ( J − 2 min log p ), where p is the num ber of v ariables and J min is the minimum (absolute) edge p oten tial of the graphical mo del. The sufficien t conditions for sparsis t ency a re based on the notion of walk-summability of the mo del and the presenc e of sparse lo c al vertex sep ar ators in the underlying graph. W e also derive nov el non-asymptotic necess ary conditions o n the num b er o f s amples r equired for spa rsistency . Keyw ords: Gaussian graphical mo del selection, high-dimensional learning, lo ca l-separation prop ert y , w alk-summabilit y , necessary conditions for mo del selection. 1. In tro duction Probabilistic graphical mo dels offe r a p o w erful formalism for represen ting high-dimensional distributions succinctly . In an undirected graphical mo del, the conditional ind ep endence relationships among th e v ariables are represente d in the form of an undirected graph. Learn - ing graphical mo dels u sing its observed s a mples is an imp ortan t task, and inv olv es b oth structure and parameter estimation. While there are man y tec hniques for parameter esti- mation (e.g., exp ect ation maximization), structure estimation is arguably m o re challe nging. High-dimensional stru c ture estimation is NP-hard for general mod el s (Karger and Srebr o , c A n imashree Anandkumar, Vincent T an, and Alan Willsky . Anandkumar, T an, and W il lsky 2001; Bogdano v et al., 2008) and moreov er, th e n umber of samples a v ailable for learning is t ypically muc h smaller than the num b er of dimen sio ns (or v ariables). The complexit y of str ucture estimation dep e nds crucially on the u nderlying graph stru c - ture. Chow and Liu (1968) established th a t structur e estimation in tree mo dels reduces to a maxim um w eigh t spann ing tree problem and is th us computationally efficien t. Ho w ev er, a general c haracterizatio n of graph families f o r wh i c h structur e estimation is tractable has so far b een lac king. In this pap er, we presen t suc h a charac terization based on the so-c alled lo- c al sep ar ation prop ert y in graphs. It tur n s out that a wide v ariet y of (rand o m) graphs satisfy this prop ert y (with probability tendin g to one) includ ing large girth graphs, the Erd˝ os-R ´ en yi random graph s (Bollob´ a s , 1985) and the p o w er-la w graph s (Ch ung and Lu, 2006 ), as w ell as graphs w ith short cycles suc h as the small-w orld grap h s (W atts and Strogatz, 1998) and other hybrid/augmen ted graph s (Chung and Lu , 2006, Ch . 12). Successful s tr uct ure estimation also r e lies on certain assu mptions on the parameters of the model, and these assumptions are tied to the sp ec ific algorithm emplo y ed. F or instance, for conv ex-relaxation approac hes (Meinshausen and B ¨ uhlmann, 2006 ; Ra vikumar et al., 2008), the assumptions are based on certain i n c oher enc e conditions on the mo del, w hic h are hard to interpret as w ell as v erify in general. In this pap er, we present a s et of tr a nsparent condi- tions for Gauss ian graph ic al mo del selection b a sed on walk-sum analysis (Maliouto v et al., 2006). W alk-sum an alysis has b een pr eviously emplo y ed to analyze the p erformance of lo op y b elief propagation (LBP) an d its v arian ts in Gaussian graph ic al mo dels. In this pap er, w e demonstrate that w alk-summabilit y also turns out to b e a natural criterion for efficien t structure estimatio n, thereb y reinforcing its imp ortance in charac terizing the tractabilit y of Gaussian graphical mo dels. 1.1 Summary of Results Our main con tributions in this wo rk are threefold. W e p ropose a simple lo cal algorithm for Gaussian graphical mo del selection, termed as conditional co v ariance threshold test ( CCT ) based on a s e t of conditional co v ariance thr e sholding tests. Second, we derive sample complexit y results for our algorithm to ac hiev e stru c tural consistency (or sp a rsistency). Third, we pro v e a no v el non-asymptotic lo w er b ound on th e sample complexit y r e quired by an y learning algorithm to s ucc eed. W e no w elab orate on these con tributions. Our structure learnin g pro cedure is kno wn as the Conditional Co v ariance T est 1 ( CCT ) and is outlined in Alg orithm 1. Let CCT ( x n ; ξ n,p , η ) b e the output edge set from CCT giv en n i.i.d. samples x n , a threshold ξ n,p (that dep ends on b oth p and n ) and a constan t η ∈ N , whic h is related to the local vertex separation prop ert y (describ ed later). The conditional co v ariance test pr oceeds in the follo wing manner. Firs t, the empirical absolute cond i tional co v ariances 2 are computed as follo ws: b Σ( i, j | S ) := b Σ( i, j ) − b Σ( i, S ) b Σ − 1 ( S, S ) b Σ( S, j ) , 1. An analogous test is employ ed for Ising mo del selection in (Anand kumar et al., 2011b) based on con- ditional m utual information. W e later note that conditional m utual information test has sligh tly w orse sample complexity for learning Gaussian mo d el s. 2. Alternative ly , conditional indep endence can b e tested via sample p a rtial correlations which can be com- puted via regression or recursion. See (Kalisch and B ¨ uhlmann, 2007) for details. 2 High-Dimensional G aussian Graphical Model Selection Algorithm 1 Algorithm CCT ( x n ; ξ n,p , η ) for structure learning using samples x n . Initialize b G n p = ( V , ∅ ). F or eac h i, j ∈ V , if min S ⊂ V \{ i,j } | S |≤ η | b Σ( i, j | S ) | > ξ n,p , (1) then add ( i, j ) to b G n p . Output: b G n p . where b Σ( · , · ) are the resp ectiv e empirical v ariances. Note that b Σ − 1 ( S, S ) exists wh en the n umber of samples satisfies n > | S | (whic h is the regime u nder consideration). The con- ditional co v ariance is thus computed for eac h no de pair ( i, j ) ∈ V 2 and the cond itioning set whic h achiev es the minimum is found, o v er all subsets of cardinalit y at most η ; if the minim um v alue exceeds the threshold ξ n,p , then the no d e pair is declared as an edge. See Algorithm 1 for details. The computational complexity of the algorithm is O ( p η +2 ), wh ic h is efficient for small η . F or the so-calle d walk-summable Gaussian graphical mo dels, the parameter η can b e in terpreted as an u pp er b ound on the size of lo cal v ertex separators in the und erlying graph. Man y graph families ha v e small η and as such, are amenable to computationally efficien t structure estimation b y our algorithm. These include Erd ˝ os-R ´ enyi random graphs, p ow er-la w graphs and small-w orld graph s, as discussed previously . W e establish that the prop osed algorithm has a sample complexit y of n = Ω( J − 2 min log p ), where p is the num b er of no d es (v ariables) and J min is the minimum (absolute) ed ge p oten tial in the mo del. As exp ected, the sample complexit y improv es wh en J min is large, i.e., the mo del has strong edge p oten tials. Ho w ev er, as w e shall see, J min cannot b e arbitrarily large for the mo d el to b e w alk-summable. W e deriv e the minimum s amp le complexity for v arious graph families and show that this minim um is attained when J min tak es the maximum p ossible v alue. W e also dev elop no v el tec hn iques to obtain necessary conditions f or co nsistent structure estimation of Er d ˝ os-R ´ enyi random graphs and other ensem bles with n on-uniform distr ib u- tion of graph s. W e obtain non-asymptotic b oun ds on the num b er of samples n in terms of the exp ected degree and the num b er of no des of the mo del. The techniques employ ed are information-theoretic in n ature (Cov er and Th omas, 2006). W e cast th e learning prob- lem as a source-cod ing p roblem and develo p necessary conditions w hic h com bine the us e of F ano’s in equ alit y with the so-called asymptotic equipartition prop ert y . Our su fficien t conditions f or structural consistency are based on w alk-summabilit y . This c haracterizat ion is nov el to the b est of ou r kno wledge. Previously , walk-summable m o dels ha v e b een extensiv ely studied in the con text of inf erence in Gaussian graphical mo dels. As a b y-pro du ct of our analysis, we also establish the correctness of lo op y b elief p ropagation for w alk-summable Gaussian graph ical mo d els Mark o v on lo cally tree-lik e graphs (see Section 5 for details). This suggests that w alk-summabilit y is a fund amen tal criterion for tractable learning and infer en ce in Gaussian graphical mo dels. 3 Anandkumar, T an, and W il lsky 1.2 Related W ork Giv en that structure learning of general grap h ical mo d els is NP-hard (K arger and Sreb ro, 2001; Bogdano v et al., 2008), the fo cus h as b een on c haracterizing classes of mo dels on whic h learning is tractable. Th e seminal w ork of Cho w and Liu (196 8 ) pro vided an efficien t implemen tation of maximum-lik elihoo d structure estimation for tree mo dels via a maximum w eigh ted spann ing tree algorithm. Err or-exp onent analysis of the Cho w-Liu algorithm w as stu died (T an et al., 2010) and extensions to general forest m o dels w ere consider ed b y T an et al. (2011) and Liu et al. (2011 ). Learning trees with laten t (hidden) v ariables (Choi et al., 2011) h a v e also b een stud ied r ecen tly . F or graphical m o dels Mark o v on general graphs, alternativ e approac hes are required for structure estimation. A recent p aradigm for structure estimation is based on con v ex relaxation, w here an estimate is obtained via con v ex optimization whic h incorp orates an ℓ 1 -based p enalt y term to encourage sparsit y . F or Gaussian graphical m o dels, such ap- proac hes hav e b een considered in Meinshausen and B ¨ uhlmann (2006); Ra vikumar et al. (2008); d’Aspremon t et al. (2008), and the sample complexit y of the prop osed algorithms ha v e b een analyzed. A ma jor disadv an tag e in using con v ex-relaxatio n metho d s is that the incoherence conditions r equired for consistent estimation are hard to in terpret an d it is not straigh tforw ard to charac terize the class of m o dels satisfying these conditions. An alternativ e to the conv ex-relaxation approac h is the use of simple greedy lo cal algo- rithms for structure learning. The conditions required for consisten t estimation are t ypi- cally more transparen t, alb eit somewhat r estrictive . Bresler et al. (2008) prop ose an algo- rithm f or structure learnin g of general graphical mo dels Mark ov on b ou n ded-degree graph s , based on a series of conditional-indep endence tests. Abb eel et al. (200 6 ) prop ose an algo- rithm, similar in spirit, for learning factor graphs with b ounded d egree. S pirtes and Meek (1995), Cheng et al. (2002), Kalisc h and B ¨ uh lmann (2007) and Xie and Geng (2008) pro- p ose conditional-indep end ence tests for learning Ba y esian net w orks on d irected acyclic graphs (D A G). Netrapalli et al. (2010 ) prop osed a faster greedy algorithm, based on condi- tional entrop y , for graph s w ith large girth and b oun ded degree. Ho w ev er, all the works (Bresler et al., 2008; Abb eel et al., 2006; Sp irtes and Meek, 1995; Chen g et al., 2002; Netrapalli et al., 2010) require the maxim um degree in the graph to b e b ounded (∆ = O (1)) whic h is restrictiv e. W e allo w for grap h s w h ere the maxim um degree can gro w with th e num b er of no des. More- o v er, we establish a natural tr ad eoff b et w een the maxim um degree and other parameters of the graph (e.g., girth) requ ir ed for consisten t stru cture estimation. Necessary conditions for consisten t graphical mo del selection p ro vide a low er b ound on sample complexit y and ha v e b een explored b efore by Santhanam and W ain wrigh t (2008); W ang et al. (2010 ). These w orks consider graphs dra wn unif orm ly from the class of b ounded degree grap h s and establish that n = Ω(∆ k log p ) samples are required f or consistent struc- ture estimation, in an p -no d e graph w ith maxim um degree ∆, where k is t ypically a small p ositiv e integ er. How ever, a d irect app lication of these m etho ds yield p o or lo w er b ound s if the ensemble of graphs has a highly n on-uniform distribu tion. This is the case with th e ensem ble of Erd˝ os-R´ enyi random graphs (Bollob´ as, 1985). Necessary condi- tions for stru cture estimation of E rd˝ os-R ´ en yi random graphs w ere deriv ed for Ising mo dels b y Anand kumar et al. (2010) based on an inf orm ation-theoretic co v ering argumen t. How- ev er, this appr oac h is not d irectly ap p licable to the Gauss ian setting. W e presen t a n o v el 4 High-Dimensional G aussian Graphical Model Selection approac h for obtaining n ecessary conditions f or Gaussian graph ical mo del selectio n b ased on the notion of typic ality . W e charact erize the set of t ypical grap h s for the Erd˝ os-R ´ enyi ensem ble and deriv e a mo dified form of F ano’s in equalit y and obtain a non-asymp totic lo w er b ound on sample complexit y in v olving the a v erage degree and th e n um b er of n o des. W e briefly also p oin t to a large b o dy of w ork on high-dimensional co v ariance selec- tion under d ifferen t notions of sp ars it y . Note th at the assum ption of a Gaussian graphi- cal mo del Marko v on a sparse graph is one s u c h formulati on. Other notions of sparsity include Gaussian mo dels with sparse co v ariance matrices, or havi ng a band ed Cholesky factorizat ion. Also, note that many works consider co v ariance estimation instead of selec- tion and in general, estimation guaran tees can b e obtained und er less stringent conditions. See Lam and F an (2009), Rothman et al. (2008 ), Huang et al. (2006) and Bic k el and Levina (2008) for details. P ap er Out line The pap er is organized as follo ws. W e in tro du ce the system mo del in Section 2. W e pr o v e the main result of our pap er regarding the structural consistency of conditional co v ariance th resholding test in Section 3. W e pro v e n ecessary conditions for mo del selectio n in Section 4. In Section 5, we analyze the p erforman ce of lo op y b elief propagation in Gaussian graph ical mo dels. Section 6 concludes the pap er . P r o ofs and additional discussion are p ro vided in the app endix. 2. Preliminaries and System Mo del 2.1 Gaussian Graphical Mo dels A Gaussian graphical mo del is a family of join tly Gaussian distributions wh ich f actor in accordance to a giv en graph. Give n a graph G = ( V , E ), with V = { 1 , . . . , p } , consider a vec tor of Gaussian random v ariables X = [ X 1 , X 2 , . . . , X p ] T , where eac h no de i ∈ V is asso ciated with a scalar Gaussian random v ariable X i . A Gaussian graphical mo del Mark o v on G has a pr ob ab ility d ensit y function (p df ) that ma y b e parameterized as f X ( x ) ∝ exp − 1 2 x T J G x + h T x , (2) where J G is a p ositiv e-definite symm etric matrix wh ose sparsit y pattern corresp onds to that of the graph G . More p recisely , J G ( i, j ) = 0 ⇐ ⇒ ( i, j ) / ∈ G. (3) The matrix J G is kno wn as the p oten tial or information matrix, th e non-zero entries J ( i, j ) as the edge p otenti als, and th e vect or h as the p oten tial v ector. A mo d el is said to b e attr active if J i,j ≤ 0 for all i 6 = j . The form of parameterizatio n in (2 ) is known as the information form and is related to the standard mean-co v ariance p arameterizatio n of the Gaussian distrib u tion as µ = J − 1 h , Σ = J − 1 , where µ := E [ X ] is the m ean vecto r and Σ := E [( X − µ )( X − µ ) T ] is the co v ariance matrix. W e sa y that a j oin tly Gaussian random v ect or X with joint p d f f ( x ) satisfies lo cal Mark o v prop ert y with resp ect to a graph G if f ( x i | x N ( i ) ) = f ( x i | x V \ i ) (4) 5 Anandkumar, T an, and W il lsky holds for all n o des i ∈ V , where N ( i ) denotes the set of neighbors of no de i ∈ V and , V \ i denotes the set of all no des excluding i . More generally , w e sa y that X satisfies the global Mark o v prop ert y , if for all d isjoin t sets A, B ⊂ V , we h a v e f ( x A , x B | x S ) = f ( x A | x S ) f ( x B | x S ) . (5) where set S is a sep ar ator 3 of A and B The lo cal and global Mark o v pr op erties are equiv alen t for non-degenerate Gaussian distr ib utions (Lauritzen, 1996). Our results on str u cture learning dep end on the precision matrix J . Let J min := min ( i,j ) ∈ G | J ( i, j ) | , J max := max ( i,j ) ∈ G | J ( i, j ) | , D min := min i J ( i, i ) . (6) In tuitiv ely , mo dels with ed ge p otent ials whic h are “too small” or “to o large” are harder to learn than those with comparab le p oten tials. S ince we consider th e high-dimensional case w here the num b er of v ariables p gro ws, w e allo w the b ound s J min , J max , and D min to p oten tially scale w ith p . The p artial c orr elation c o efficient b et w een v ariables X i and X j , for i 6 = j , measures th eir conditional co v ariance giv en all other v ariables. These are computed by normalizing the off-diagonal v alues of the information matrix, i.e., R ( i, j ) := Σ( i, j | V \ { i, j } ) p Σ( i, i | V \ { i, j } )Σ( j, j | V \ { i, j } ) = − J ( i, j ) p J ( i, i ) J ( j, j ) . (7) F or all i ∈ V , set R ( i, i ) = 0. W e henceforth refer to R as the partial correlation matrix. An imp ortant sub-class of Gaussian graphical mo d els of the form in (33) are the walk- summable mo dels (Maliouto v et al., 2006 ). A Gaussian m o del is said to b e α -w alk summ ab le if k R k ≤ α < 1 , (8) where R := [ | R ( i, j ) | ] denotes the en try-wise absolute v alue of the partial correlation matrix R and k · k d enotes the sp ectral or 2-norm of the matrix, wh ich for symmetric matrices, is giv en b y the maximum abs olute eigen v alue. In other w ords, w alk-summabilit y means that an attractiv e mo d el formed by taking the absolute v alues of the partial correlatio n matrix of the Gaussian graphical mo del is also v alid (i.e., th e corresp onding p oten tial matrix is p ositiv e definite). Th is immediately implies that attractiv e mo d els f orm a su b-class of walk- summable mo dels. F or detailed d iscussion on w alk-summabilit y , see Section A.1. 2.2 T ractable Graph F amilies W e consider the class of Gaussian graphical m o dels Mark o v on a graph G p b elonging to some ensem ble G ( p ) of graphs w ith p no d es. W e consider the high-dimens ional learnin g regime, w here b oth p and the n umber of samples n gro w sim ultaneously; t ypically , the gro wth of p is m uc h faster than th at of n . W e emp hasize that in our form ulation the graph ensem ble G ( p ) can either b e d eterministic or random – in the latter, w e also sp ecify a 3. A set S ⊂ V is a separator for sets A and B if the remov al of no des in S partitions A and B into distinct compon ents. 6 High-Dimensional G aussian Graphical Model Selection probabilit y measure o v er the set of graphs in G ( p ). In the setting where G ( p ) is a random- graph ensem ble, let P X ,G denote the joint p robabilit y distribution of the v ariables X and the graph G ∼ G ( p ), and let f X | G denote th e conditional (Gaussian) d ensit y of the v ariables Mark o v on the giv en graph G . Let P G denote the probabilit y distribution of graph G dra wn from a r andom ensem ble G ( p ). W e use the term al most every (a.e.) graph G satisfies a certain prop ert y Q if lim p →∞ P G [ G satisfies Q ] = 1 . In other wo rds, the prop erty Q holds asymptotically almost su rely 4 (a.a.s.) with resp ect to the random-graph en sem ble G ( p ). Ou r conditions and theoretical guarantee s will b e based on this n otion f or random graph ensembles. Intuiti v ely , this means that graphs that hav e a v anishin g probabilit y of o ccurrence as p → ∞ are ignored. W e no w c haracterize the ensemble of graphs amenable for consisten t structur e estimation under our form ulation. T o this end, w e define the concept of local separation in graphs. See Fig. 1 for an illustration. F or γ ∈ N , let B γ ( i ; G ) d enote the set of vertice s within distance γ from i with resp ect to graph G . L et H γ ,i := G ( B γ ( i )) denote the subgraph of G spanned b y B γ ( i ; G ), bu t in addition, we retain the no des not in B γ ( i ) (and remov e the corresp onding edges). Thus, the num b er of v ertices in H γ ,i is p . Definition 1 ( γ -Lo cal Separat or) Given a gr aph G , a γ - lo cal separator S γ ( i, j ) b etwe en i and j , for ( i, j ) / ∈ G , is a m inimal vertex sep ar ator 5 with r esp e ct to the sub gr aph H γ ,i . In addition, the p ar ameter γ is r eferr e d to as the p ath threshold for lo c al sep ar ation. In other words, th e γ -lo cal separator S γ ( i, j ) separates no des i and j with resp ect to paths in G of length at most γ . W e now c haracteriz e th e ensemble of graphs based on the size of lo cal separators. Definition 2 ( ( η , γ ) -Lo cal Separation Prop erty ) An ensemble of gr aphs satisfies ( η , γ ) - lo c al sep ar ation pr op erty if for a.e. G p in the ensemble, max ( i,j ) / ∈ G p | S γ ( i, j ) | ≤ η. (9) We denote such a gr aph ensemble by G ( p ; η , γ ) . In Section 3, we prop ose an efficien t algorithm for grap h ical mo d el selection when the underlying graph b elongs to a graph ensem ble G ( p ; η , γ ) with sparse lo cal separators (i.e., small η , for η defined in (9)). W e will see th at the computational complexit y of our pr op osed algorithm scales as O ( p η +2 ). W e now pro vide examples of sev eral graph f amilies satisfying (9). 4. Note t hat the term a.a .s. do es not app ly to deterministic graph ensembles G ( p ) where n o randomness is assumed, and in this setting, w e assume that t he prop erty Q holds for every graph in the en sem ble. 5. A minimal separator is a separator of smal lest cardinalit y . 7 Anandkumar, T an, and W il lsky P S f r a g r e p l a c e m e n t s j a b c d A U ( j ; T s a w ( i ; G ) ) e U ( j , a ; T s a w ( i ; G ) ) i S ( i, j ) Figure 1: Illustration of l -lo cal s eparator set S ( i, j ; G, l ) for the grap h sho wn ab o v e with l = 4. Note that N ( i ) = { a, b, c, d } is the n eigh b orho o d of i and the l -lo cal separator set S ( i, j ; G, l ) = { a, b } ⊂ N ( i ; G ). This is b ecause th e p ath along c connecting i and j h as a length greater than l and hen ce no de c / ∈ S ( i, j ; G, l ). Example 1: Bound e d-Degree W e no w sh o w that the lo cal-separation pr op erty holds for a rich class of graphs. Any (deterministic or r andom) ensemble of d egree-b ounded graphs G Deg ( p, ∆) satisfies ( η , γ )- lo cal separation p rop erty w ith η = ∆ and arbitrary γ ∈ N . If we do not imp ose an y further constrain ts on G Deg , the computational complexit y of our prop osed algorithm scales as O ( p ∆+2 ) (see also Br esler et al. (2 008) where the computational complexity is comparable). Th us, when ∆ is large, our prop osed algorithm and th e one in Bresler et al. (2008) are computationally intensiv e. Our goal in this pap er is to relax the u sual b ounded-d egree assumption and to consider ensembles of graph s G ( p ) whose maxim um degrees may gro w with the num b er of no des p . T o this end, we discuss other structural constraints which can lead to graphs with sp arse lo cal separators. Example 2: Bound e d Local P a t hs Another sufficient condition 6 for the ( η , γ )-lo cal separation prop ert y in Definition 2 to hold is that there are at most η paths of length at most γ in G b et w een any t w o no des (henceforth, termed as the ( η , γ )- lo c al p ath s pr op e rty ). In other words, there are at most η − 1 num b er of o v erlapping 7 cycles of length smaller than 2 γ . In particular, a sp ecial case of the lo cal-paths p rop erty describ ed ab ov e is th e so-call ed girth prop erty . The girth of a graph is the length of the sh ortest cycle. Thus, a graph with girth g satisfies ( η , γ )-lo cal s ep aration p rop erty with η = 1 and γ = g . Let G Girth ( p ; g ) denote the ensem ble of graph s with girth at most g . Th ere are man y graph constructions 6. F or any graph satisfying ( η , γ )-lo cal separation prop erty , the number of vertex-disjoin t paths of length at most γ b etw een any tw o non- neighbors is b ounded ab ov e by η , by app ealing t o Menger’s theorem for b ounded path lengths (Lov´ asz et al., 1978). How ev er, in th e definition of lo cal-p aths p rop erty , we consider all distinct p aths of length at most γ and not just vertex disjoin t paths. 7. Tw o cycles are said to o v erlap if they ha v e common vertices . 8 High-Dimensional G aussian Graphical Model Selection whic h lead to large girth. F or example, the bipartite Ramanujan graph (Chung, 1997, p. 107) and the random Cayley graph s (Gam burd et al., 2009) ha v e large girths. The girth condition can b e wea k ened to allo w for a small num b er of short cycles, while not allo w ing for t ypical no d e neighborh o o ds to conta in short cycles. S uc h graphs are termed as lo c al ly tr e e-like . F or in stance, the ensem ble of Er d ˝ os-R´ enyi graphs G ER ( p, c/p ), where an ed ge b et w een an y no de p air app ears with a p robabilit y c/p , in dep end en t of other no de pairs, is lo cally tree-lik e. The parameter c ma y gro w with p , alb eit at a con trolled rate for tractable stru cture learning. W e make this more pr ecise in Example 3 in Section 3.1. Th e pro of of the follo wing result ma y b e found in (Anandku mar et al., 2011a, Lemma 3). Prop osition 3 (Random Graphs are Lo cally T ree-Like) The ensemble of Er d˝ os-R´ enyi gr aphs G ER ( p, c/p ) satisfies the ( η , γ ) -lo c al sep ar ation pr op erty in (9) with η = 2 , γ ≤ log p 4 log c . ( 10) Th us, there are at most t w o paths of length sm aller than γ b et w een any t w o no des in Erd˝ os-R ´ en yi graphs a.a.s, or equiv alen tly , there are n o o v erlapping cycles of length smaller than 2 γ a.a.s. S im ilar observ ations apply for the more general sc ale-fr e e or p ower-law graphs (Chung and Lu, 2006; Dommers et al., 2010). Along sim ilar lines, the ensem ble of ∆-r an d om regular graph s, denoted b y G Reg ( p, ∆), which is the u niform ensem ble of regular graphs with degree ∆ has no ov erlapp ing cycles of length at most Θ(log ∆ − 1 p ) a.a.s. (McKa y et al., 2004, Lemma 1). Example 3: Smal l-W orld Graphs The pr evious t w o examples sho w ed lo cal separation holds u nder t w o differen t conditions: b ound ed maxim um degree and b oun ded num b er of local paths. The former class of graphs can ha v e short cycles but the maxim um degree needs to b e constan t, while the latter class of graphs can ha v e a large maximum d egree but the n um b er of o v erlapping short cycles needs to b e small. W e no w pro vide instances whic h incorp orate b oth these features: large degrees and short cycles, and y et satisfy the lo cal separation prop ert y . The class of h ybrid graphs or augmented graph s (Ch ung and Lu, 2006, Ch. 12) consists of graphs which are the union of tw o graphs: a “lo cal” graph having short cycles and a “global” graph ha ving small a v erag e d istances. Since the h ybrid graph is the u nion of these lo cal and global graphs, it has b oth large degrees and short cycles. The simp lest mo del G W atts ( p, d, c/p ), firs t stu d ied by W atts and Str ogatz (1998), consists of the u nion of a d - dimensional grid and an Erd˝ os-R ´ en yi random graph with parameter c . It is easily seen th at a.e. graph G ∼ G W atts ( p, d, c/p ) satisfies ( η , γ )-local separation prop ert y in (9), with η = d + 2 , γ ≤ log p 4 log c . Similar observ ations apply for more general hybrid graph s studied in (Chung an d Lu, 2006, Ch. 12). 9 Anandkumar, T an, and W il lsky Counter-exam p le: Dense Graphs While the ab o v e examples illustrate that a large class of graphs satisfy the lo cal separation criterion, there indeed exist graphs which do not satisfy it. Such graphs tend to b e “dense”, i.e., the num b er of edges scales sup er-linearly in the num b er of n o des. F or instance, the Erd˝ os-R ´ en yi graphs G ER ( p, c/p ) in the dense regime, where the a v erage degree scales as c = Ω ( p 2 ). In this regime, the n o de degrees as w ell as th e num b er of sh ort cycles gro w with p and th us, the size of the lo cal s ep arators also gro ws with p . S uc h graphs are hard instances for our algorithm. 3. Guaran tees for Conditional Co v ariance Thresholding 3.1 Assumptions (A1) Sample Scaling Requiremen ts: W e consider the asymp totic setting wh ere b oth the num b er of v ariables (no d es) p and the num b er of samples n tend to infin it y . W e assume that the p arameters ( n, p, J min ) scale in the follo wing fashion: 8 n = Ω( J − 2 min log p ) . (11) W e require that the num b er of no des p → ∞ to exploit the lo cal separation prop erties of the class of graph s under consideration. (A2) α -W alk-summabilit y: The Gauss ian graphical mo del Marko v on G p ∼ G ( p ) is α - w alk summable a.a.s., i.e., k R G p k ≤ α < 1 , a.e. G p ∼ G ( p ) , (12) where α is a constan t (i.e., not a function of p ), R := [ | R ( i, j ) | ] is the en try-wise absolute v alue of the partial correlation matrix R and k ·k denotes the sp ectral n orm. (A3) Lo cal-Separation Prop erty: W e assume that the en s em ble of graphs G ( p ; η , γ ) satisfies the ( η , γ )-lo cal s eparation prop ert y with η , γ satisfying: η = O (1) , J min D − 1 min α − γ = ω (1) , (13) where α is giv en by (12 ) and D min := min i J ( i, i ) is the minimum diagonal en try of the p oten tial matrix J . (A4) Condition on Edge-Po tentials: The minimum absolute edge p oten tial of an α - w alk summable Gaussian graphical mo del satisfies D min (1 − α ) min ( i,j ) ∈ G p J ( i, j ) K ( i, j ) > 1 + δ , (14) for almost eve ry G p ∼ G ( p ), for some δ > 0 (not dep ending on p ) and 9 K ( i, j ) := k J ( V \ { i, j } , { i, j } ) k 2 , 8. The notations ω ( · ), Ω( · ) refer to asymptotics as the num ber of v ariables p → ∞ . 9. Here and in th e sequel, for A, B ⊂ V , we use the notation J ( A, B ) to denote th e sub-matrix of J indexed by rows in A and columns in B . 10 High-Dimensional G aussian Graphical Model Selection is the sp ectral norm of th e submatrix of the p oten tial matrix J , and D min := min i J ( i, i ) is the minimum diagonal en try of J . Intuitiv ely , (14 ) limits the extent of non- homogeneit y in the mo d el and th e exten t of o v erlap of neigh b orho o d s. Moreo ver, this assump tion is not required for consisten t graphical mo del selection wh en the mo del is attractiv e ( J i,j ≤ 0 f or i 6 = j ). 10 (A5) Choice of threshold ξ n,p : The threshold ξ n,p for graph estimation under CCT algorithm is c hosen as a fun ction of the num b er of no des p , the num b er of samp les n , and the minimum edge p oten tial J min as follo w s: ξ n,p = O ( J min ) , ξ n,p = ω α γ D min , ξ n,p = Ω r log p n ! , (15) where α is giv en b y (12), D min := min i J ( i, i ) is the minimum d iagonal entry of the p oten tial matrix J , and γ is the path-threshold (9) for the ( η , γ )-lo cal separation prop erty to h old. Assumption (A1) stip u lates h o w n , p and J min should scale f or consisten t graphical mo del selection, i.e., the s amp le complexit y . The sample size n needs to b e sufficien tly large with resp ect to the n um b er of v ariables p in the mo del for consisten t structure reconstruction. Assumptions (A2) and (A4) imp ose constrain ts on the mo del parameters. Assump tion (A3) r estricts the class of graphs under consideration. T o the b est of ou r knowledge , all previous wo rks dealing with graphical mo d el selectio n, e.g., Meinshausen and B ¨ uhlmann (2006), Ravikumar et al. (20 08), also imp ose some conditions for consisten t graphical mo del selection. Assu mption (A5) is with r egard to the c hoice of a su itable thr eshold ξ n,p for thresholding conditional co v ariances. In the sequel, we compare the cond itions for consisten t reco v ery after p r esen ting our main theorem. Example 1: Degree-Boun ded Ens embles T o gain a b etter und erstanding of conditions (A1)–(A5), consid er th e ensemble of graphs G Deg ( p ; ∆) with b ou n ded d egree ∆ ∈ N . It can b e established that for the walk- summability condition in (A3) to hold, 11 w e require th at f or n ormalized p recision matrices ( J ( i, i ) = 1), J max = O 1 ∆ . (16) See Section A.2 for detailed discussion. When the minim um p otenti al ac hiev es the b ound ( J min = Θ(1 / ∆)), a sufficien t condition for (A3) to hold is give n by ∆ α γ = o (1) , (17) where γ is the path threshold for the lo cal-separation prop erty to hold according to Defi- nition 2. Intuitiv ely , we requ ire a larger path th reshold γ , as the degree b ound ∆ on the graph ensem ble increases. 10. The assumption (A5) rules out th e p ossibility that the neighbors are marginally ind ep endent. See Section B.3 for details. 11. W e can provide imp roved b ounds for random-graph ensem bles. S ee Section A.2 for details. 11 Anandkumar, T an, and W il lsky Note that (17 ) allo w s for the degree b ound ∆ to gro w w ith the num b er of no d es as long as th e path th reshold γ also gro ws appropriately . F or example, if the maxim um d egree scales as ∆ = O (p oly(log p )) and the path-threshold scales as γ = O (log log p ), then (17) is satisfied. This implies that graphs with fairly large degrees and short cycles can b e reco vered successfully using our algorithm. Example 2: Gir th-Bounde d Ensembles The cond ition in (17) can b e sp ecialized for the ensem ble of girth-b oun ded graphs G Girth ( p ; g ) in a straigh tforw ard m anner as ∆ α g = o (1) , (18) where g corresp onds to th e girth of the graph s in the ensem ble. The condition in (18) demonstrates a natural tradeoff b et w een the girth and the m axim um degree; graphs with large degrees can b e learned efficien tly if their girths are large. Indeed, in the extreme case of trees w h ic h hav e infinite girth, in accordance with (18) , there is no constrain t on no de d egrees for successful reco v ery and recall that the Cho w-Liu algorithm (Cho w and Liu, 1968) is an efficien t metho d for mo del selection on tree distrib utions. Example 3: Erd ˝ os-R ´ enyi and S mall-World Ens embles W e can also conclud e that a.e. Erd˝ os-R ´ enyi graph G ∼ G ER ( p, c/p ) satisfies (13) when c = O (p oly(log p )) und er th e b est-p ossible scaling of J min sub j ect to the wa lk-summabilit y constrain t in (12). This is b ecause it can b e shown that J min = O (1 / √ ∆) for w alk-summabilit y in (12 ) to hold. See Section A.2 for details. Noting that a.a.s., th e maxim um degree ∆ for G ∼ G ER ( p, c/p ) satisfies ∆ = O log p lo g c log log p , from (Bollob´ as , 1985, Ex. 3.6) and γ = O ( log p log c ) from (10). Thus, the Er d˝ os-R´ enyi graphs are amenable to s u ccessful r eco v er y when the a v erage degree c = O (p oly (log p )). Similarly , for the small-w orld ensemble G W atts ( p, d, c/p ), when d = O (1) and c = O (p oly(log p )), the graphs are amenable for consistent estimatio n. 3.2 Consistency of Conditional Co v ariance Thresholding Assuming (A1) – (A5), w e no w state our main result. The pro of of this result and the auxiliary lemmata for the pr o of can b e found in Sections B and Section C. Theorem 4 (Structural consistency of CCT ) F or structur e le arning of Gaussian gr aph- ic al mo dels Markov on a gr aph G p ∼ G ( p ; η , γ ) , CCT ( x n ; ξ n,p , η ) is c onsistent for a.e. gr aph G p . In other wor ds, lim n,p →∞ n =Ω( J − 2 min log p ) P [ CCT ( { x n } ; ξ n,p , η ) 6 = G p ] = 0 (19) Remarks: 12 High-Dimensional G aussian Graphical Model Selection 1. Consistency guaran tee: The CCT algorithm consisten tly reco v ers the structure of Gaussian graph ical mo dels asymp totically , with pr obabilit y tendin g to one, w here the probability measure is with resp ect to b oth the rand om graph (dra wn from the ensem ble G ( p ; η , γ ) and the samples (drawn f r om Q n i =1 f ( x i | G )). 2. Analysis of sample complexit y: The ab o v e result states that the s ample complex- it y for the CCT ( n = Ω( J − 2 min log p )), whic h impr o v es w hen the minimum edge p oten tial J min is large. 12 This is intuitiv e since the edges ha v e str on ger p otenti als in this case. On the other hand, J min cannot b e arbitrarily large since th e α -wa lk-summabilit y as- sumption in (12) imp oses an u p p er b oun d on J min . The minimum sample complexit y (o v er d ifferen t parameter settings) is attained w hen J min ac hiev es this up p er b ound. See Section A.2 for details. F or example, for any degree-b ounded graph ensemble G ( p, ∆) with maxim um degree ∆, the minimum s amp le complexit y is n = Ω(∆ 2 log p ) i.e., when J min = Θ(1 / ∆), w h ile for Erd ˝ os-R´ enyi random graph s, the m inim um samp le complexit y can b e improv ed to n = Ω(∆ log p ), i.e., when J min = Θ(1 / √ ∆). 3. Comparison w ith Ra vikumar et al. (2008): The w ork b y Ravikumar et al. (2008) emplo ys an ℓ 1 -p enalized lik eliho o d estimato r for structur e estimatio n in Gaus- sian graph ical m o dels. Under th e so-called incoherence conditions, the sample com- plexit y is n = Ω ((∆ 2 + J − 2 min ) log p ). Ou r sample complexit y in (11) is th e same in terms of its dep endence on J min , and there is no explicit dep end ence on the max- im um d egree ∆ . Moreo ver, w e hav e a transparent s u fficien t condition in terms of α -w alk-summabilit y in (12), which directly imp oses scaling conditions on J min . 4. Comparison with Meinshausen and B ¨ uhlmann ( 2006): The w ork by Meinshausen and B ¨ uhlmann (2006) considers ℓ 1 -p enalized linear regression for n eigh b orho o d selecti on of Gaussian graphical mo dels and establish a samp le complexit y of n = Ω((∆ + J − 2 min ) log p ). W e note that our guarante es allo w for graphs whic h do n ot necessarily satisfy the condi- tions imp osed by Meinshausen and B ¨ uh lmann (2006). F or instance, the assumption of neigh b orh o o d stabilit y (assumption 6 in (Meinshau s en and B ¨ uhlmann, 2006)) is hard to v erify in general, and the relaxation of this assumption corresp onds to the class of mo dels with diagonally-dominan t co v ariance matrices. Note that the class of Gaussian graphical m o dels with diagonally-dominan t co v ariance matrices forms a strict su b-class of walk- summable mo dels, and th us satisfies assumption (A2) for the theorem to hold. Thus, Th eorem 4 applies to a larger class of Gauss ian graphical mo d- els compared to Meinshausen and B ¨ uh lm an n (200 6 ). F urther m ore, the conditions for successful reco v ery in Theorem 4 are arguably more transparent . 5. Comparison with Ising mo dels: Ou r ab o v e r esult f or learning Gaussian graphi- cal mo dels is analogo us to stru ctur e estimation of Ising mo dels sub ject to an up p er b ound on the edge p oten tials (Anand kumar et al., 201 1b), and w e characte rize su c h a regime as a c onditional uniqueness regime. Thus, wa lk-summabilit y is th e analogous condition for Gaussian m o dels. Pro of Outline W e fi rst analyze the scenario when exact statistics are a v ailable. (i) W e establish that for an y t w o non-neigh b ors ( i, j ) / ∈ G , the minimum conditional co v ariance 12. Note that the sample complexit y also implicitly dep ends on w alk-summabilit y parameter α th rough (13 ). 13 Anandkumar, T an, and W il lsky in (1) (based on exact statistics) d o es not exceed the threshold ξ n,p . (ii) S imilarly , we also establish th at the cond itional co v ariance in (1) exceeds the thr esh old ξ n,p for all neigh b ors ( i, j ) ∈ G . (ii i) W e then extend these r esults to empirical versions usin g concen tration b ound s. 3.2.1 Performance of Conditional Mutual Informa tion Test W e now emplo y the conditional mutual information test, analyzed in Anandkumar et al. (2011 b ) for Ising mo dels, and note that it has sligh tly w orse sample complexit y th an usin g conditional co v ariances. Using th e threshold ξ n,p defined in (15), the cond itional m utual information test CMIT is giv en by the thr eshold test min S ⊂ V \{ i,j } | S |≤ η b I ( X i ; X j | X S ) > ξ 2 n,p , (20) and no de pairs ( i, j ) exceeding the th r eshold are added to the estimate b G n p . Assuming (A1) – (A5), we ha v e th e follo wing result. Theorem 5 (Structural consistency of CMIT ) F or structur e le arning of the Gaussian gr aphic al mo del on a gr aph G p ∼ G ( p ; η , γ ) , CMIT ( x n ; ξ n,p , η ) is c onsistent for a.e. gr aph G p . In other wor ds, lim n,p →∞ n =Ω( J − 4 min log p ) P [ CMIT ( { x n } ; ξ n,p , η ) 6 = G p ] = 0 (21) The pro of of this theorem is provided in Section C.3. Remarks: 1. F or Gaussian random v ariables, conditional cov ariances and cond itional mutual in- formation are equiv alen t tests for conditional indep endence. Ho w ev er, from ab o v e results, w e n ote that there is a d ifferen ce in the sample complexit y for the tw o tests. The sample complexit y of CM IT is n = Ω( J − 4 min log p ) in con trast to n = Ω( J − 2 min log p ) for CCT . Th is is d ue to faster deca y of conditional m utual information on the edges compared to th e deca y of conditional co v ariances. Thus, conditional co v ariances are more efficien t for Gaussian graph ical mo del selection compared to conditional mutual information. 4. Necessary Conditions for Mo del Selection In the pr evious sections, we prop osed and analyzed efficien t algorithms for learning th e structure of Gaussian graphical m o dels Marko v on graph ensem bles s atisfying lo cal-separation prop erty . In this section, we stu dy the pr ob lem of der ivin g ne c essary conditions for consis- ten t structure learning. F or the class of degree-b oun d ed graphs G Deg ( p, ∆), necessary conditions on sample com- plexit y h a v e b een c haracterize d b efore (W ang et al., 2010) b y consid er in g a certain (limited) set of ensembles. Ho w ev er, a na ¨ ıv e application of such b ounds (based on F ano’s inequalit y 14 High-Dimensional G aussian Graphical Model Selection ✲ ✲ ✲ X m ∼ P m ( x ) Enco der M ∈ [2 mR ] Decoder b X m Figure 2: The canonical source co ding p roblem. See Chapter 3 in (Co v er and Thomas, 2006). (Co v er and Thomas, 2006, Ch. 2)) turn s out to b e to o w eak f or the class of Erd˝ os-R ´ enyi graphs G ER ( p, c/p ), wher e the a v erage degree 13 c is m uc h smaller than the maxim um degree. W e no w p ro vide n ecessary conditions on th e sample complexit y for reco very of Erd˝ os- R ´ en yi graphs. Ou r in formation-theoretic tec hn iques ma y also b e applicable to other ensem- bles of random graphs. Th is is a promising a v en ue f or future work. 4.1 Setup W e now describ e the problem more form ally . A graph G is dr a wn from the ensem ble of Erd˝ os-R ´ en yi graphs G ∼ G ER ( p, c/p ). T he learner is also provided with n cond itionally i.i.d. samples X n := ( X 1 , . . . , X n ) ∈ ( X p ) n (where X = R ) dra wn from the conditional (Gaussian) pr o duct prob ab ility densit y function (p df ) Q n i =1 f ( x i | G ). The task is then to estimate G , a rand om quan tit y . Th e estimate is d enoted as b G := b G ( X n ). It is desired to deriv e tight n ecessary conditions on n (as a function of c and p ) so th at the pr ob ability of err or P ( p ) e := P ( b G 6 = G ) → 0 (22) as the n umber of n o des p tend s to infinity . Note that th e p robabilit y measur e P in (22) is asso ciated to b oth the realizati on of the random graph G and the samples X n . The task is r eminiscen t of source co d ing (or compression), a p roblem of cen tral imp or- tance in information theory (Cov er and T h omas, 2006) – we wo uld like to deriv e funda- men tal limits asso ciated to the problem of reconstructing th e source G give n a compressed v ersion of it X n ( X n is also an alogous to the “message”). How ever, note the imp ortant distinction; wh ile in source cod ing, the sour ce co der can design b oth the enco der and the deco der, our p roblem m an d ates that the co de is fi xed by the conditional pr obabilit y d ensit y f ( x | G ). W e are only allo w ed to design the d eco d er. See comparisons in Figs. 2 and 3. 4.2 Necessary Conditions for Exact Reco very T o derive the necessary condition for learning Gaussian graphical mo d els Mark o v on sparse Erd˝ os-R ´ en yi graphs G ∼ G ER ( p, c/p ), w e assume th at the str ict w alk-summabilit y condition with parameter α , according to (12). W e are then able to demons tr ate the follo wing: Theorem 6 (W ea k Con v erse for Gaussian Mo dels) F or a walk-summable Gaussian gr aphic al mo del satisfying (12) with p ar ameter α , for almost every gr ap h G ∼ G ER ( p, c/p ) 13. The techniques in this section are applicable when the a vera ge degree ( c ) of G ER ( p, c/p ) ensemble is a function of p , e.g., c = O (p oly( log p )). 15 Anandkumar, T an, and W il lsky ✲ ✲ ✲ G ∼ G ER ( p, c p ) n Y i =1 f ( x i | G ) X n ∈ ( R p ) n Decoder b G Figure 3: The estimation problem is analogous to sour ce co ding: the “source” is G ∼ G ER ( p, c p ), the “message” is X n ∈ ( R p ) n and the “decoded sou r ce” is b G . W e are asking what the minimum “rate” (analogous to th e num b er of samples n ) are required so that b G = G with high probab ility . as p → ∞ , in or der for P ( p ) e → 0 , we r e quir e that n ≥ 2 p log 2 h 2 π e 1 1 − α + 1 i p 2 H b c p (23) for al l p su fficiently lar ge. The pro of is provided in Section D.1. By expanding the binary ent ropy fu nction, it is easy to see that the statement in (23) can b e wea k ened to the necessary condition: n ≥ c log 2 p log 2 h 2 π e 1 1 − α + 1 i . (24) The ab o v e cond ition do es not inv olv e any asymptotic notation, and also demonstrates the dep end ence of the sample complexit y on p, c and α transparently . Finally , the dep end ence on α can b e explained as follo ws: an y α -w alk-summable mo d el is also β -wa lk-summable for all β > α . Th us, the class of β -walk-summable mo dels con tains the class of α -w alk- summable mo dels. Th is results in a lo oser b ound in (23) for larger α . 4.3 Necessary Conditions for Reco v ery with Distortion In this section, w e generalize Theorem 6 to the case where we only require estimation of the underlying graph u p to a certain edit d istance: an error is d eclared if and only if the estimated graph b G exceeds an edit distance (or distortion) D of the true graph. The e dit distanc e d : G p × G p → N ∪ { 0 } b etw een t w o undir ected graphs G = ( V , E ) and G = ( V , E ′ ) is defined as d ( G, G ′ ) := | E △ E ′ | , where △ denotes th e symmetric difference b et w een the edge sets E and E ′ . Th e edit distance can b e r egarded as a distortion measure b et w een t w o graphs. Giv en an p ositiv e integer D , known as the distortio n , supp ose we declare an error if and only if d ( G, G ′ ) > D , then the pr obabilit y of error is red efined as P ( p ) e := P ( d ( G , b G ( X n )) > D ) . (25) W e derive necessary conditions on n (as a f unction of p an d c ) su c h th at th e probabilit y of error (25) go es to zero as p → ∞ . T o ease n otation, w e defin e the ratio β := D / p 2 . (26) 16 High-Dimensional G aussian Graphical Model Selection Note that β may b e a f u nction of p . W e do not attempt to mak e this dep endence exp licit. The follo wing corollary is based on an idea p rop oun ded b y Kim et al. (2008) among others. Corollary 7 (W ea k Con v erse for Discrete Mo dels With Distortion) F or P ( p ) e → 0 , we must have n ≥ 2 p log 2 h 2 π e 1 1 − α + 1 i p 2 H b c p − H b ( β ) (27) for al l p su fficiently lar ge. The pro of of this corollary is provided in Section D.7. Note that for (27) to b e a u seful b ound , we n eed β < c/p wh ich tr an s lates to an allo w ed distortion D < cp/ 2. W e observe from (27) that b ecause the error criterion has b een relaxed, the required n um b er of samp les is also redu ced from the corr esp ondin g lo w er b ound in (23). 4.4 Pro of T ec hniques Our analysis to ols for derivin g necessary conditions for Gaussian graphical mo del selection are information-theoretic in nature. A common and natural tool to der ive necessary con- ditions (also called con v erses) is to r esort to F ano’s inequalit y (Co v er and Thomas, 2006, Chapter 2), which (lo wer) b ounds the probabilit y of error P ( p ) e as a function of the e qui v o- c ation or c onditional entr opy H ( G | X n ) and the size of the set of all graph s with p no des. Ho w ev er, a direct and n a ¨ ıv e application F ano’s inequalit y results in a trivial lo w er b ound as the set of all graphs, w hic h can b e realized by G ER ( p, c/p ) is, lo osely sp eaking, “too large”. T o ameliorate suc h a pr ob lem, w e employ another information-theoretic n otion, kno wn as typic ality . A typic al set is, roughly sp eaking, a set that has small cardinalit y and y et h as high pr obabilit y as p → ∞ . F or example, the probability of a set of length- m s equ ences is of the order ≈ 2 mH (where H is the entrop y rate of the source) and hence those sequences with probabilit y close to this v alue are called typic al . In our context, give n a graph G , we define the ¯ d ( G ) to b e the r atio of th e num b er of edges of G to the total num b er of no des p . Let G p denote the s et of all graphs with p no d es. F or a fixed ǫ > 0, w e d efine the follo wing set of graphs: T ( p ) ǫ := G ∈ G p : ¯ d ( G ) c − 1 2 ≤ ǫ 2 . (28) The set T ( p ) ǫ is known as the ǫ -typic al set of gr aphs . Every graph G ∈ T ( p ) ǫ has an a v erage n umber of ed ges that is c 2 ǫ -close in the Erd˝ os-R´ enyi ensemble. Note that t ypicalit y ideas are usually used to d eriv e su ffic i ent conditions in inform ation theory (Co v er and Thomas, 2006) ( achievability in inform ation-theoretic parlance); our use of b oth t ypicalit y for graphical mo del selection as w ell as F ano’s inequalit y to derive conv erse statemen ts s eems no v el. Indeed, the pro of of the con v erse of the source co d ing theorem in Co v er and T homas (20 06, Chapter 3) utilizes only F ano’s inequalit y . W e now summarize the prop erties of the t ypical set. Lemma 8 (Prop erties of T ( p ) ǫ ) The ǫ -typic al set of gr aphs has the fol lowing pr op erties: 1. P ( T ( p ) ǫ ) → 1 as p → ∞ . 17 Anandkumar, T an, and W il lsky 2. F or al l G ∈ T ( p ) ǫ , we have 14 exp 2 − p 2 H b c p (1 + ǫ ) ≤ P ( G ) ≤ exp 2 − p 2 H b c p . (29) 3. The c ar dinality of the ǫ -typic al set c an b e b ounde d as (1 − ǫ ) exp 2 p 2 H b c p ≤ |T ( p ) ǫ | ≤ exp 2 p 2 H b c p (1 + ǫ ) (30) for al l p su fficiently lar ge. The pro of of this lemma can b e fou n d in S ection D.2. Pa rts 1 and 3 of Lemma 8 resp ectiv ely sa y that the s et of t ypical graph s has high probability and has ve ry small cardinalit y r elativ e to the n umber of graphs with p n o des | G p | = exp 2 ( p 2 ). P art 2 of Lemma 8 is kno wn as the asymptotic e quip artition pr op erty : the graphs in the t ypical set are almost un iformly distributed. 5. Implications on Lo op y Belief Propagation An activ e area of researc h in the graphical mo del comm unit y is that of inferen ce – i.e., the task of computing no de marginals (or MAP estimates) through efficien t distr ibuted algorithms. The simplest of these algorithms is the b elief pr opagatio n 15 (BP) algorithm, where messages are passed among the neigh b ors of the graph of the mo d el. It is kn o wn that b elief p ropagation (and max-pro du ct) is exact on tree mo d els, meaning that correct marginals are computed at all the no des (P earl, 1988). On the other hand on general graphs, the generalized version of BP , kno wn as lo op y b elief pr opagatio n (LBP), ma y not con v erge and even if it do es, the marginals may not b e correct. Motiv ated b y the twin problems of con v ergence and correctness, there has b een extensiv e w ork on charac terizing LBP’s p er f ormance f or different models. See Section 5.3 for details. As a by-pro d uct of our previous analysis on graphical mo del selection, w e now show the asymptotic correctness of LBP on wa lk-summable Gauss ian m o dels when the un derlying graph is lo cally tree-lik e. 5.1 Bac kground The b elief pr opagatio n (BP) algorithm is a distributed algorithm where messages (or b eliefs) are passed among the neighbors to dra w inferences at the no des of a graphical mo del. The computation of no de marginals thr ough n a ¨ ıv e v ariable eliminati on (or Gaussian elimination in the Gaussian setting) is pr ohibitiv ely exp ensiv e. Ho w ev er, if the graph is sparse (co nsists of few edges), the computation of no de marginals m a y b e sp ed up dramatically b y exploiting the graph structure and u sing distributed algorithms to parallelize the computations. F or th e sak e of completeness, w e no w recall the basic steps in LBP , sp ecific to Gaussian graphical mo d els. Given a message sc hedule whic h sp ecifies ho w messages are exc hanged, 14. W e use the notation exp 2 ( · ) to mean 2 ( · ) . 15. The va riant of the b elief propagation algorithm which comput es the MAP estimates is kn o wn as th e max-pro duct algorithm. 18 High-Dimensional G aussian Graphical Model Selection eac h no de j receiv es information from eac h of its neigh b ors (according to the graph), where the message, m t i → j ( x j ), from i to j , in t th iteration is parameterized as m t i → j ( x j ) := exp − 1 2 ∆ J t i → j x 2 j + ∆ h t i → j x j . Eac h no de i p repares message m t i → j ( x j ) by collecting messages fr om neighbors of the pre- vious iteration (und er p arallel iterations), and compu ting ˆ J i \ j ( t ) = J ( i, i ) + X k ∈N ( i ) \ j ∆ J t − 1 k → i , ˆ h i \ j ( t ) = h ( i ) + X k ∈N ( i ) \ j ∆ h k → i ( t ) , where ∆ J t i → j = − J ( j, i ) ˆ J − 1 i \ j ( t ) J ( j, i ) , ∆ h t i → j = − J ( j, i ) ˆ J − 1 i \ j ( t ) ˆ h k → i ( t ) . 5.2 Results Let Σ LBP ( i, i ) denote the v ariance at no d e i at the LBP fixed p oin t. 16 Without loss of generalit y , we consider the normalized ve rsion of the precision matrix J = I − R , whic h can alw a ys b e obtained fr om a general pr ecision matrix via normalizatio n. W e can then renorm alize the v ariances, computed via LBP , to obtain the v ariances corresp ondin g to the un n ormalized precision matrix. W e consider the follo wing ensemble of lo cally-tree lik e graph s. C onsider th e ev en t that the neigh b orh o o d of a no d e i h as no cycles up to graph distance γ , give n by Γ( i ; γ , G ) := { B γ ( i ; G ) do es not conta in an y cycles } . W e assu m e a r andom graph ensemble G ( p ) su c h that for a giv en no de i ∈ V , w e ha v e P [Γ c ( i ; γ , G )] = o (1) . (31) Prop osition 9 (Correctness of LBP) Given an α - walk-summable Gaussian gr aphic al mo del on a.e. lo c al ly tr e e-like gr aph G ∼ G ( p ; γ ) with p ar ameter γ satisfying (31) , we have | Σ G ( i, i ) − Σ LBP ( i, i ) | a.a.s. = O (max( α γ , P [Γ c ( i ; γ , G )])) . (32) The pro of is give n in S ection B.4. Remarks: 1. The class of Erd˝ os-R ´ enyi rand om graphs, G ∼ G ER ( p, c/p ) satisfies (31) , with γ = O (log p/ log c ) for a no de i ∈ V c hosen uniformly at ran d om. 2. Recall that the class of random regular graph s G ∼ G Reg ( p, ∆) hav e a girth of O (log ∆ − 1 p ). Thus, f or any n o de i ∈ V , (31) h olds with γ = O (log ∆ − 1 p ). 16. Con vergence of LBP on w alk-summable mo dels has been established by Maliouto v et al. (2006). 19 Anandkumar, T an, and W il lsky 5.3 Previous W ork on Lo op y Belief Propagation It has long b een kno wn through numerous empir ical stu dies (Mur phy et al., 1999) and the phenomenal successes of tu rb o decod ing (McEliece et al., 2002), that lo op y b elief prop- agatio n (LBP) p erforms r easonably w ell on a v ariet y of graphical mo d els though it also m ust b e mentio ned that LBP fails catastrophically on other mo dels. W eiss (2000) pr ov ed that if the underlying graph (of a Gauss ian graphical mo del) consists of a single cy- cle, LBP conv erges and is correct, i.e., the fi xed p oints of the means and the v ariances are the same as the true means and v ariances. In addition, sufficient conditions for a unique fi xed p oint are kn o wn (Mooij and Kapp en, 2007). Th e max-pro d uct v arian t of LBP (called th e max-pro du ct or min-sum algorithm) has b een studied (Bay ati et al., 2005; Sangha vi et al., 2009; Ruozzi and T atiko nda , 2010). Despite its seemingly heuristic nature, LBP has found a v ariet y of concrete applications, esp ecially in com binatorial optimiza- tion (Moallemi and V an Ro y, 2010; Gamarnik et al., 2010). Indeed, it h as b een applied and analyzed for NP-h ard problems such as maxim um matc hing (Ba y ati et al., 2008b), b-matc hing (Sangha vi et al., 2009), the Steiner tr ee problem (Ba y ati et al., 2008a). The app lication of BP for in ference in Gauss ian graph ical mo d els has b een studied ex- tensiv ely – starting with the seminal wo rk by W eiss and F reeman (2001). Undoub tedly the Kalman filter is the most familiar instance of BP in Gaussian graphical mo dels. T h e no- tion of wa lk-summabilit y in Gaussian graphical mo dels w as in tro duced b y Maliouto v et al. (2006). Among other results, the authors sh o w ed that LBP conv erges to the correct means for w alk- su mmable mo dels but the estimated v ariances ma y n evertheless s till b e incor- rect. C h andrasek aran et al. (200 8 ) leverag ed the ideas of Malio uto v et al. (2006) to analyze related inference algorithms such as em b edded trees and the blo c k Gauss-Seidel m etho d. Recen tly , Liu et al. (2010) consid ered a mo d ifi ed v ersion of LBP by id en tifying a sp ecial set of n o des – called the feedbac k v ertex set (FVS) (V azirani , 2001) – that breaks (or ap- pro ximately br eaks) cycles in the lo op y graph. Th is allo w s one to p erform inference in a tractably to tradeoff accuracy and compu tational complexit y . F or Gaussian graphical mo dels Mark o v on lo cally tree-lik e graphs, an appr oximate FVS can b e identified. This set, though not an FVS p er se , allo w s one to break all th e short cycles in the graph and th us, it allo ws for proving tigh t error b ound s on the in f erred v ariances. The p erforman ce of LBP on lo cally tree-lik e graphs has al so b een studied for other families of graphical mod els. F or Isin g mo dels Marko v on lo cally tree-lik e graphs, the work b y Dem b o an d Mon tanari (2010) established an analogous result for attractiv e (also kno wn as ferr omagnetic) mo dels. Note that w alk-summable Gaussian graphical mo dels is a su p erset of the class of attract iv e Gaussian m o dels. An interpretatio n of LBP in terms of grap h co vers is given by V on tob el (2010) and its equiv alence to wal k-summabilit y for Gaussian graphical mo dels is established b y Ruozzi et al. (2009). 6. Conclusion In this p ap er, w e adopted a no v el and a unifi ed paradigm for graphical mo del selection. W e presented a simple local algorithm for structure estimatio n w ith low compu tational and samp le complexities under a set of mild and transparent conditions. T his algorithm succeeds on a w ide range of graph en s em bles such as the Erd˝ os-R ´ enyi ensem ble, small-w orld 20 High-Dimensional G aussian Graphical Model Selection net w orks etc. W e also emp lo y ed no v el inform ation-theoretic tec h niques f or establishing necessary conditions for graph ical mo del selection. A cknowledgement The fir st author is supp orted in part by the setup fund s at UCI and the AF OSR Award F A9550-10 -1-031 0, the second author is su pp orted by A*ST AR, Singap ore and the third author is su pp orted in part b y AF OSR under Grant F A9550- 08-1-10 80. The authors thank V enk at Chandrasek aran (MIT) for discussions on wa lk-summable mo d els, Elchanan Mossel (UC Berk eley) for discussions on the necessary conditions for mo del selection an d Divy anshu V ats (U. Minn.) for extensive commen ts. T he authors thank the Asso ciate Editor Mar- tin W ainwrigh t (Berk eley) and th e anon ymous review ers for comments w h ic h significant ly impro v ed this manuscript. App endix A. W alk-summ able Gaussian Graphical Mo dels A.1 Bac kground on W alk-Summability W e no w r ecap the prop erties of wa lk-summable Gaussian graph ical mo dels, as giv en by (12). F or d etails, see Maliouto v et al. (2006). F or simplicit y , we fi rst assume that the diagonal of the p oten tial matrix J is normalized ( J ( i, i ) = 1 for all i ∈ V ). W e remo v e this assum ption and consid er general unn orm alized precision matrices in Section B.2. C onsider splitting the matrix J int o the id en tit y matrix and the partial correlation matrix R , defined in (7): J = I − R . (33) The co v ariance matrix Σ of the graph ical mo del in (33) can b e decomp osed as Σ = J − 1 = ( I − R ) − 1 = ∞ X k =0 R k , k R k < 1 , (34) using Neumann p o w er series for the m atrix in v erse. Note that w e require that k R k < 1 for (34) to hold, which is implied by wa lk-summabilit y in (12) (since k R k ≤ k R k ). W e no w relate the matrix p o w er R l to wal ks on graph G . A walk w of length l ≥ 0 on graph G is a sequence of n o des w := ( w 0 , w 1 , . . . , w l ) trav ersed on the graph G , i.e., ( w k , w k +1 ) ∈ G . Let | w | denote the length of the w alk. Giv en matrix R G supp orted on graph G , let the weigh t of the walk b e φ ( w ) := | w | Y k =1 R ( w k − 1 , w k ) . The elemen ts of the matrix p o we r R l are giv en by R l ( i, j ) = X w : i l → j φ ( w ) , (35) where i l → j denotes the set of walks fr om i to j of length l . F or this reason, we henceforth refer to R as the walk matrix . 21 Anandkumar, T an, and W il lsky Let i → j denote all th e wa lks b etw een i and j . Under the w alk-summabilit y condition in (12), w e hav e conv ergence of P w : i → j φ ( w ), irresp ectiv e of the order in which the w alks are collected, and th is is equ al to the co v ariance Σ( i, j ). In Section A.3, w e relate wa lk-summabilit y in (12) to the n otion of correlation deca y , where the effect of faraw a y no des on cov ariances can b e con trolled and the lo cal-separatio n prop erty of the grap h s und er consideration can b e exploited. A.2 Sufficien t Conditions for W alk-summability W e no w pr o vide sufficient conditions and suitable parameterization for w alk-summabilit y in (12) to hold. Th e adjacency matrix A G of a graph G with maxim um degree ∆ G satisfies λ max ( A G ) ≤ ∆ G , since it is dominated by a ∆-regular graph whic h has maxim um eigen v alue of ∆ G . F rom P erron-F rob en iu s theorem, for adjacency matrix A G , w e h a v e λ max ( A G ) = k A G k , where k A G k is the sp ectral radiu s of A G . Thus, for R G supp orted on graph G , w e hav e α := k R G k = O ( J max ∆) , where J max := max i,j | R ( i, j ) | . This implies that J max = O 1 ∆ (36) to ha v e α < 1, which is the requirement for wal k-summabilit y . When the graph G is a Erd˝ os-R ´ en yi rand om graph, G ∼ G ER ( p, c/p ), we can pr o vide b etter b oun ds. When G ∼ G ER ( p, c/p ), we hav e (Kriv elevic h and Sudako v, 2003), that λ max ( A G ) = (1 + o (1)) max( p ∆ G , c ) , where ∆ G is the maxim um degree and A G is the adjacency matrix. T h us, in this case, when c = O (1), we require that J max = O r 1 ∆ ! , (37) for walk-summabilit y ( α < 1). Note that when c = O (p oly(log p )), w.h.p . ∆ G p = Θ(log p / log log p ) (Bollo b´ as, 1985, Ex. 3.6). A.3 Implications of W alk-Summability Recall that Σ G denotes the co v ariance matrix for Gaussian graphical mo del on graph G and that J G = Σ − 1 G with J G = I − R G in (33). W e now relate the wa lk-summabilit y condition in (12) to correlation d eca y in th e mo del. In other w ords, und er wal k-summabilit y , we can sho w th at th e effect of faraw ay no des on co v ariances d eca ys with distance, as made pr ecise in Lemma 10. Let B γ ( i ) denote the set of no des within γ hops from no de i in graph G . Denote H γ ; ij := G ( B γ ( i ) ∩ B γ ( j )) (38) 22 High-Dimensional G aussian Graphical Model Selection as the induced su bgraph of G o v er the inte rsection of γ -hop neighb orh o o ds at i and j and retaining the no des in V \ { B γ ( i ) ∩ B γ ( j ) } . Thus, H γ ; ij has the same num b er of no des as G . . W e first mak e the follo wing simple observ ation: the ( i, j ) elemen t in the γ th p ow er of w alk matrix, R γ G ( i, j ), is giv en by w alks of length γ b et w een i and j on graph G and thus, dep end s only on subgraph 17 H γ ; ij (see (35)). This enables us to quantify the effect of no des outside B γ ( i ) ∩ B γ ( j ) on the co v ariance Σ G ( i, j ). Define a new wa lk matrix R H γ ; ij suc h that R H γ ; ij ( a, b ) = R G ( a, b ) , a, b ∈ B γ ( i ) ∩ B γ ( j ), (39) 0 , o.w. (40) In other words, R H γ ; ij is form ed b y considerin g the Gaussian graphical mo del ov er graph H γ ; ij . Let Σ H γ ; ij denote the corresp onding co v ariance matrix. 18 Lemma 10 (Co v ariance Bounds Under W alk-summability) F or any walk-summable Gaussian gr aphic al mo del ( α := k R G k < 1) , we have 19 max i,j | Σ G ( i, j ) − Σ H γ ; ij ( i, j ) | ≤ α γ 2 α 1 − α = O ( α γ ) . (41) Th us, for walk- summable Gaussian graphical mo dels, w e h a v e α := k R G k < 1, imply- ing that the error in (41) in app ro ximating the co v ariance by lo cal n eigh b orho o d deca ys exp onentia lly with distance. P arts of the pr o of b elo w are insp ired by Dumitriu and P al (2009). Pr o of: Using the p ow er-series in (34), w e can w rite the co v ariance matrix as Σ G = γ X k =0 R k G + E G , where the error matrix E G has sp ectral radiu s k E G k ≤ k R G k γ +1 1 − k R G k , from (34). Thus, 20 for any i, j ∈ V , | Σ G ( i, j ) − γ X k =0 R k G ( i, j ) | ≤ k R G k γ +1 1 − k R G k . (42) Similarly , we h a v e | Σ H γ ; ij ( i, j ) − γ X k =0 R k H γ ; ij ( i, j ) | ≤ k R H γ ; ij k γ +1 1 − k R H γ ; ij k (43) 17. Note that R γ ( i, j ) = 0 if B γ ( i ) ∩ B γ ( j ) = ∅ . 18. When B γ ( i ) ∩ B γ ( j ) = ∅ meaning that graph distance b etw een i and j is more than γ , we obtain Σ H γ ; ij = I . 19. The b ound in ( 41) also holds if H γ ; ij is replaced with an y of its sup ergraphs. 20. F or any matrix A , we hav e max i,j | A ( i, j ) | ≤ k A k . 23 Anandkumar, T an, and W il lsky (a) ≤ k R G k γ +1 1 − k R G k , (44) where for inequalit y (a), we use the fact that k R H γ ; ij k ≤ k R H γ ; ij k ≤ k R G k , since H γ ; ij is a subgraph 21 of G . Com bining (42) and (44), using the triangle inequalit y , we obtain (41). ✷ W e also mak e some s im p le observ ations ab out conditional cov ariances in wa lk-summable mo dels. Recall that R G denotes matrix with absolute v alues of R G , and R G is the wa lk matrix o v er graph G . Also recall that the α -w alk summabilit y condition in (12), is k R G k ≤ α < 1. Prop osition 11 (Conditional Cov ariances under W alk-Summability) Given a walk- summable Gaussian gr aphic al mo del, for any i, j ∈ V and S ⊂ V with i, j / ∈ S , we have Σ( i, j | S ) = X w : i → j ∀ k ∈ w ,k / ∈ S φ G ( w ) . (45) Mor e over, we have sup i ∈ V S ⊂ V \ i Σ( i, i | S ) ≤ (1 − α ) − 1 = O (1) . (46) Pr o of: W e ha v e, fr om Rue and Held (2005, Thm . 2.5), Σ( i, j | S ) = J − 1 − S, − S ; G ( i, j ) , where J − S, − S ; G denotes th e submatrix of p oten tial matrix J G b y deleting no des in S . S ince submatrix of a walk-summable matrix is w alk-summable, we h av e (45) b y app ealing to the w alk-sum expression for cond itional co v ariances. F or (46), let k A k ∞ denote the maximum absolute v alue of entries in matrix A . Using monotonicit y of sp ectral norm and the f act that k A k ∞ ≤ k A k , we h a v e sup i ∈ V S ⊂ V , i / ∈ V Σ( i, i | S ) ≤ k J − 1 − S, − S ; G k = (1 − k R − S, − S ; G k ) − 1 ≤ (1 − k R − S, − S ; G k ) − 1 ≤ (1 − k R G k ) − 1 = O (1) . ✷ Th us, the conditional co v ariance in (45) consists of w alks in the original graph G , not passing through no d es in S . 21. When tw o matrices A and B are such that | A ( i, j ) | ≥ | B ( i, j ) | for all i, j , w e ha ve k A k ≥ k B k . 24 High-Dimensional G aussian Graphical Model Selection App endix B. Graphs wit h Lo cal-Separation P rop erty B.1 Conditional Co v ariance b etw een Non-Neighbors: Normalized Case W e now provide b ounds on the conditional co v ariance for Gaussian graphical mo dels Mark o v on a graph G ∼ G ( p ; η , γ ) satisfying the lo cal-separation prop ert y ( η , γ ), as p er Defin ition 2 . Lemma 12 (Conditional Co v aria nce Bet w een Non-neigh b ors) F or a walk-summable Gaussian g r aphic al mo del, the c onditiona l c ovar ianc e b etwe e n non-neighb ors i and j , c on- ditione d on S γ , the γ -lo c al sep ar ator b etwe en i and j , satisfies max j / ∈N ( i ) Σ( i ; j | S γ ) = O ( k R G k γ ) . (47) Pr o of: In this p ro of, we abbreviate S γ b y S for notational con v enience. T he conditional co v ariance is giv en by the Sch ur complemen t, i.e., f or any su bset A such that A ∩ S = ∅ , Σ( A | S ) = Σ( A, A ) − Σ( A, S )Σ( S, S ) − 1 Σ( S, A ) . (48) W e u se the notation Σ G ( A, A ) to denote the su b matrix of the co v ariance matrix Σ G , when the und erlying graph is G . As in Lemma 10, w e ma y decomp ose Σ G as follo ws: Σ G = Σ H γ + E γ , where H γ is the s u bgraph spann ed b y γ -hop neigh b orh o o d B γ ( i ), and E γ is the error m atrix. Let F γ b e the matrix su c h that Σ G ( S, S ) − 1 = Σ H γ ( S, S ) − 1 + F γ . W e ha v e Σ H γ ( i, j | S ) = 0 , w here Σ H γ ( i, j | S ) d enotes the conditional co v ariance b y con- sidering the mo d el give n b y the subgraph H γ . Th is is d ue to the Mark o v p rop erty since i and j are separated by S in the sub graph H γ . Th us using (48), the conditional cov ariance on graph G can b e b ounded as Σ G ( i, j | S ) = O (max( k E γ k , k F γ k )) . By Lemma 10, w e hav e k E γ k = O ( k R G k γ ). Using W o o db ury matrix-in v ersion id en tit y , we also ha v e k F γ k = O ( k R G k γ ). ✷ B.2 Extension to General Precision Matrices: Unnormalized Case W e now extend the ab o v e analysis to general precision matrices J w here the diagonal elemen ts are not assum ed to b e identit y . Denote the p recision matrix as J = D − E , where D is a d iagonal matrix and E has zero diagonal elemen ts. W e thus ha v e that J norm := D − 0 . 5 JD − 0 . 5 = I − R , (49 ) 25 Anandkumar, T an, and W il lsky where R is the p artial correlation matrix. Th is also implies that J = D 0 . 5 J norm D 0 . 5 . Th us, we ha v e th at Σ = D − 0 . 5 Σ norm D − 0 . 5 , (50) where Σ norm := J − 1 norm is the cov ariance matrix corresp ond ing to the normalized m o del. When the mo d el is w alk-summable, i.e., k R k ≤ α < 1, w e hav e that Σ norm = P k ≥ 0 R k . W e no w utilize the results deriv ed in the pr evious sections in v olving the normalized mo del (Lemma 10 and Lemma 12) to obtain b ounds for general precision matrices. Lemma 13 (Co v ariance Bounds for General Mo dels) F or any walk-summable Gaus- sian gr aphic al mo del ( α := k R G k < 1) , we have the fol lowing r esu lts: 1. C ov ariance Bounds: The c ovarianc e e ntries up on limiting to a sub gr aph H γ ; ij for any i, j ∈ V satisfies max i,j | Σ G ( i, j ) − Σ H γ ; ij ( i, j ) | ≤ α γ D min 2 α 1 − α = O α γ D min , (51) wher e D min := min i D ( i, i ) = min i J ( i, i ) . 2. C ondit ional C o v ariance b etw een Non-neigh b ors: The c onditio nal c ovar ianc e b etwe en non-neighb ors i and j , c onditio ne d on S γ , the γ -lo c al sep ar ator b etwe en i and j , satisfies max j / ∈N ( i ) Σ( i ; j | S γ ) = O α γ D min , (52) wher e D min := min i D ( i, i ) = min i J ( i, i ) . Pr o of: Usin g (50) an d Lemma 10, we ha v e (51). Similarly , it can b e s ho wn that for any S ⊂ V \ { i, j } , i, j ∈ V , Σ( i, j | S ) = D − 0 . 5 Σ norm ( i, j | S ) D − 0 . 5 , where Σ norm ( i, j | S ) is the conditional co v ariance corresp onding to the mo del w ith normal- ized precision matrix. F rom Lemma 12, we h a v e (52). ✷ B.3 Conditional Co v ariance b etw een Neighbors: General Case W e p ro vide a lo w er b ound on conditional co v ariance among the neighbors for the graphs under consideration. Recall that J min denotes the minimum edge p oten tials. Let K ( i, j ) := k J ( V \ { i, j } , { i, j } ) k 2 , where J ( V \ { i, j } , { i, j } ) is a su b-matrix of the p oten tial matrix J . 26 High-Dimensional G aussian Graphical Model Selection Lemma 14 (Conditional Co v aria nce Bet w een Neigh b ors) F or an α -walk summable Gaussian gr aphic al mo del satisfying D min (1 − α ) m in ( i,j ) ∈ G p J ( i, j ) K ( i, j ) > 1 + δ, ( 53) for some δ > 0 (not dep ending on p ), wher e D min := min i J ( i, i ) , we have | Σ G ( i, j | S ) | = Ω( J min ) , (54) for any ( i, j ) ∈ G such that j ∈ N ( i ) and any subset S ⊂ V with i, j / ∈ S . Pr o of: First note that for attractiv e mo dels, Σ G ( i, j | S ) (a) ≥ Σ G 1 ( i, j | S ) (b) = − J ( i, j ) J ( i, i ) J ( j, j ) − J ( i, j ) 2 = Ω( J min ) , (55) where G 1 is the graph consisting only of edge ( i, j ). Inequalit y (a) arises from the fact that in attractiv e mo dels, the weigh ts of all the wa lks are p ositiv e, and th us, the w eigh t of w alks on G 1 form a lo w er b ound for those on G (recall that the co v ariances are giv en by the su m-w eigh t of wa lks on the graphs ). Equ alit y (b ) is by d irect matrix inv ers ion of the mo del on G 1 . F or general mo d els, we need furth er analysis. Let A = { i, j } and B = V \ { S ∪ A } , for some S ⊂ V \ A . Let Σ ( A, A ) denote the co v ariance m atrix on set A , and let e J ( A, A ) := Σ ( A, A ) − 1 denote the corr esp ondin g marginal p oten tial matrix. W e ha v e for all S ⊂ V \ A e J ( A, A ) = J ( A, A ) − J ( A, B ) J ( B , B ) − 1 J ( B , A ) . Recall that k A k ∞ denotes the maxim um ab s olute v alue of entries in matrix A . k J ( A, B ) J ( B , B ) − 1 J ( B , A ) k ∞ (a) ≤ k J ( A, B ) J ( B , B ) − 1 J ( B , A ) k (b) ≤ k J ( A, B ) k 2 k J ( B , B ) − 1 k = k J ( A, B ) k 2 λ min ( J ( B , B )) , (56) (c) ≤ K ( i, j ) 2 D min (1 − α ) (57) where in equalit y (a) arises from the fact that the ℓ ∞ norm is b ound ed by the sp ectral norm, (b) arises from s ub-multiplic ativ e prop ert y of norms and (c) arises f rom w alk-summabilit y prop erty . Inequalit y (b) is from th e b oun d on edge p oten tials and α -w alk summabilit y of the mo del and s in ce K ( i, j ) ≥ k J ( A, B ) k . Assuming (53), we ha v e | e J ( i, j ) | > J min − k J ( A, B ) k 2 D min (1 − α ) = Ω( J min ) . Since Σ G ( i, j | S ) = − e J ( i, j ) e J ( i, i ) e J ( j, j ) − e J ( i, j ) 2 , w e ha v e the r esult. ✷ 27 Anandkumar, T an, and W il lsky B.4 Analysis of Lo op y Belief Propagation Pr o of of Pr op osition 9: F rom Lemma 10 in Section A.3, for an y α -walk-summable Gaussian graphical mo del, we hav e, for all no des i ∈ V conditioned on the ev en t Γ( i ; γ , G ), | Σ G ( i, i ) − Σ LBP ( i, i ) | = O ( k R G k γ ) . (58) This is b ecause conditioned on Γ( i ; γ , G ), it is shown that the series expansions based on walk-sums corresp ond ing to the v ariance s Σ H γ ; ij ( i, i ) and Σ LBP ( i, i ) are identic al up to length γ w alks, and the effect of w alks b ey ond length γ can b e b ou n ded as ab ov e. Moreo v er, for a sequence of α -walk-summable, we ha v e Σ( i, i ) ≤ M for all i ∈ V , for some constan t M and similarly Σ LBP ( i, j ) ≤ M ′ for some constan t M ′ since it is obtained by the set of self-a v oiding walks in G . W e thus hav e E [ | Σ G ( i, i ) − Σ LBP ( i, i ) | ] ≤ O ( k R G k γ ) + P [Γ c ( i ; γ )] = o (1) , where E is o v er the exp ectation of ensemble G ( p ). By Mark o v’s inequalit y 22 , w e hav e the result. ✷ App endix C. Sample- based Analysis C.1 Concen tration of Empirical Quan tities F or our sample complexit y analysis, we recap the concentrati on result b y Ravikumar et al. (2008, Lemma 1) for sub -Gaussian matrices and sp ecialize it to Gauss ian matrices. Lemma 15 (Concen tration of Empirical Co v ariances) F or any p - dimensional Gaus- sian r andom ve ctor X = [ X 1 , . . . , X p ] , the empiric al c ovarianc e obtaine d fr om n samples satisfies P h | b Σ( i, j ) − Σ ( i, j ) | > ǫ i ≤ 4 exp − nǫ 2 3200 M 2 , (59) for al l ǫ ∈ (0 , 40 M ) and M := max i Σ( i, i ) . This translates to b ounds for empirical conditional co v ariance. Corollary 16 (Concen tration of Empirical Conditional Co v ariance) F or a walk-summable p -dimensional Gaussian r andom ve c tor X = [ X 1 , . . . , X p ] , we have P max i 6 = j S ⊂ V , | S |≤ η | b Σ( i, j | S ) − Σ( i ; j | S ) | > ǫ ≤ 4 p η +2 exp − nǫ 2 K , (60) wher e K ∈ (0 , ∞ ) is a c onstant which is b ounde d when k Σ k ∞ is b ounde d, for al l ǫ ∈ (0 , 40 M ) with M := max i Σ( i, i ) , and n ≥ η . 22. By Marko v’s inequality , for a non-negative random v ariable X , w e have P [ X > δ ] ≤ E [ X ] /δ . By choosing δ = ω ( E [ X ]), w e hav e the result. 28 High-Dimensional G aussian Graphical Model Selection Pr o of: F or a give n i, j ∈ V and S ⊂ V with η ≤ n , u sing (48), P h | b Σ( i, j | S ) − Σ( i ; j | S ) | > ǫ i ≤ P h | b Σ( i, j ) − Σ ( i ; j ) | > ǫ [ k ∈ S | b Σ( i, k ) − Σ( i ; k ) | > K ′ ǫ # , where K ′ is a constan t w hic h is b ounded when k Σ k ∞ is b ound ed. Using Lemma 15, we ha v e the result. ✷ C.2 Pro of of Theorem 4 W e are now r eady to pr ov e T heorem 4. W e analyze the error ev en ts for the conditional co v ariance threshold test CCT . F or an y ( i, j ) / ∈ G p , define the ev en t F 1 ( i, j ; { x n } , G p ) := n | b Σ( i, j | S ) | > ξ n,p o , (61) where ξ n,p is the thresh old in (15) and S is th e γ -lo cal separator b et w een i and j (since the minim um in (1) is ac hiev ed by the γ -lo cal sep arator). Similarly for any edge ( i, j ) ∈ G p , define the ev en t that F 2 ( i, j ; { x n } , G p ) := n ∃ S ⊂ V : | S | ≤ η , | b Σ( i, j | S ) | < ξ n,p o . (62) The probability of error resulting from CCT can th us b e b ounded by the t wo t yp es of err ors, P [ CCT ( { x n } ; ξ n,p ) 6 = G p ] ≤ P [ ( i,j ) ∈ G p F 2 ( i, j ; { x n } , G p ) + P [ ( i,j ) / ∈ G p F 1 ( i, j ; { x n } , G p ) (63) F or the first term, app lying union b ound for b oth the terms and using the result (60) of Lemma 15, P [ ( i,j ) ∈ G p F 2 ( i, j ; { x n } , G p ) = O p η +2 exp − n ( C min ( p ) − ξ n,p ) 2 K 2 (64) where C min ( p ) := inf ( i,j ) ∈ G p S ⊂ V , i,j / ∈ S | S |≤ η | Σ( i, j | S ) | = Ω ( J min ) , ∀ p ∈ N , (65) from (69). S ince ξ n,p = o ( J min ), (64) is o (1) wh en n > L log p/J 2 min , for sufficiently large L (dep endin g on η and M ). F or the second term in (63), P [ ( i,j ) / ∈ G p F 1 ( i, j ; { x n } , G p ) = O p η +2 exp − n ( ξ n,p − C max ( p )) 2 K 2 , (66) 29 Anandkumar, T an, and W il lsky where C max ( p ) := max ( i,j ) / ∈ G p | Σ( i, j | S ) | = O α γ D min , (67) from (68). F or the choice of ξ n,p in (15), (66) is o (1) and this completes the pro of of Theorem 4. C.3 Conditional Mutual Information Thresholding T est W e no w analyze the p erformance of conditional mutual in f ormation thr eshold test. W e first note b oun ds on conditional m utual inf ormation. Prop osition 17 (Conditional Mutual Information) Under the assumptions (A1)–(A5), we have that the c onditio nal mutual information among non-neighb ors , c onditione d on the γ -lo c al sep ar atio n satisfies max ( i,j ) / ∈ G I ( X i ; X j | X S γ ) = O ( α 2 γ ) , (68) and the c onditio nal mutual information among the neighb ors satisfy min ( i,j ) ∈ G S ⊂ V \{ i,j } I ( X i ; X j | X S ) = Ω( J 2 min ) . (69) Pr o of: The conditional m utual inf ormation for Gaussian v ariables is giv en by I ( X i ; X j | X S ) = − 1 2 log 1 − ρ 2 ( i, j | S ) , (70) where ρ ( i, j | S ) is the conditional correlation co efficien t, giv en by ρ ( i, j | S ) := Σ( i, j | S ) p Σ( i, i | S )Σ( j, j | S ) . F rom (46) in Prop osition 11, we ha v e Σ( i, i | S ) = O (1) and th us, the r esult holds. ✷ W e n o w note the concen tration b ounds on empirical mutual inf ormation. Lemma 18 (Concen tration of Empirical Mutual Information) F or any p -dimensional Gaussian r ando m ve ctor X = [ X 1 , . . . , X p ] , the empiric al c ovarianc e obtaine d fr om n sam- ples satisfies P ( | b I ( X i ; X j ) − I ( X i ; X j ) | > ǫ ) ≤ 24 exp − nM ǫ 2 20480 0 L 2 , (71) for some c onstant L which is finite when ρ max := max i 6 = j | ρ ( i, j ) | < 1 , and al l ǫ < ρ max , and for M := max i Σ( i, i ) . Pr o of: The resu lt on empirical co v ariances can b e found in (Ra vikumar et al., 2008, Lemma 1). The result in (71) will b e sho wn through a sequence of transformations. First, w e will b ound P ( | b ρ ( i, j ) − ρ ( i, j ) | > ǫ ). Con s ider, P ( | b ρ ( i , j ) − ρ ( i, j ) | > ǫ ) 30 High-Dimensional G aussian Graphical Model Selection = P b Σ( i, j ) ( b Σ( i, i ) b Σ ( j, j )) 1 / 2 − Σ( i, j ) (Σ( i, i )Σ( j, j )) 1 / 2 > ǫ ! = P b Σ( i, j ) Σ( i, j ) Σ( i, i ) b Σ( i, i ) Σ( j, j ) b Σ( j, j ) ! 1 / 2 − 1 > ǫ | ρ ( i, j ) | (a) ≤ P b Σ( i, j ) Σ( i, j ) > 1 + ǫ | ρ ( i, j ) | 1 / 3 ! + P b Σ( i, j ) Σ( i, j ) < 1 − ǫ | ρ ( i, j ) | 1 / 3 ! + . . . + P Σ( i, i ) b Σ( i, i ) > 1 + ǫ | ρ ( i, j ) | 2 / 3 ! + P Σ( i, i ) b Σ( i, i ) < 1 − ǫ | ρ ( i, j ) | 2 / 3 ! + . . . + P Σ( j, j ) b Σ( j, j ) > 1 + ǫ | ρ ( i, j ) | 2 / 3 ! + P Σ( j, j ) b Σ( j, j ) < 1 − ǫ | ρ ( i, j ) | 2 / 3 ! (b) ≤ P b Σ( i, j ) Σ( i, j ) > 1 + ǫ 8 | ρ ( i, j ) | ! + P b Σ( i, j ) Σ( i, j ) < 1 − ǫ 8 | ρ ( i, j ) | ! + . . . + P Σ( i, i ) b Σ( i, i ) > 1 + ǫ 3 | ρ ( i, j ) | ! + P b Σ( i, i ) Σ( i, i ) < 1 − ǫ 3 | ρ ( i, j ) | ! + . . . + P b Σ( j, j ) Σ( j, j ) > 1 + ǫ 3 | ρ ( i, j ) | ! + P b Σ( j, j ) Σ( j, j ) < 1 − ǫ 3 | ρ ( i, j ) | ! (c) ≤ 24 exp − nM ǫ 2 20480 0 | ρ ( i, j ) | 2 (d) ≤ 24 exp − nM ǫ 2 20480 0 where in ( a ), w e used the fact that P ( AB C > 1 + δ ) ≤ P ( A > (1 + δ ) 1 / 3 or B > (1 + δ ) 1 / 3 or C > (1 + δ ) 1 / 3 ) and the un ion b ound , in ( b ) w e used the fact that (1 + δ ) 3 ≤ 1 + 8 δ and (1 + δ ) − 2 / 3 ≤ 1 − δ / 3 for δ = ǫ/ | ρ ( i, j ) | < 1. Finally , in ( c ), we used the result in (59) and in ( d ), we us ed the b ound s on ρ < 1. No w, define the bijectiv e fu nction I ( | ρ | ) := − 1 / 2 log(1 − ρ 2 ). Then w e claim that there exists a constan t L ∈ (0 , ∞ ), dep ending only on ρ max < 1, su c h that | I ( x ) − I ( y ) | ≤ L | x − y | , (72) i.e., the function I : [0 , ρ max ] → R + is L = L ( ρ max )-Lipsc hitz. This is b ecause the slop e of the fun ction I is b ounded in the inte rv al [0 , ρ max ]. Thus, we ha v e the inclusion {| b I ( X i ; X j ) − I ( X i ; X j ) | > ǫ } ⊂ {| b ρ ( i, j ) − ρ ( i, j ) | > ǫ/L } (73) since if | b I ( X i ; X j ) − I ( X i ; X j ) | > ǫ it is true that L | b ρ ( i, j ) − ρ ( i, j ) | > ǫ from (72) . W e ha v e b y monotonicit y of measur e and (73) the desired result. ✷ W e can no w obtain the d esired result on concentrat ion of empirical conditional mutual information. 31 Anandkumar, T an, and W il lsky Lemma 19 (Concen tration of Empirical Conditional Mutual Information) F or a walk-summable p -dimensional Gaussian r andom ve ctor X = [ X 1 , . . . , X p ] , we have P max i 6 = j S ⊂ V \{ i,j } , | S |≤ η | b I ( X i ; X j | X S ) − I ( X i ; X j | X S ) | > ǫ ≤ 24 p η +2 exp − nM ǫ 2 20480 0 L 2 , (74) for c onstants M , L ∈ (0 , ∞ ) and al l ǫ < ρ max , wher e ρ max := max i 6 = j S ⊂ V \{ i,j } , | S |≤ η | ρ ( i, j | S ) | . Pr o of: Since the mo del is walk- summable, we ha v e that max i,S Σ( i, i | S ) = O (1) and thus, the constan t M is b oun ded. S imilarly , due to str ict p ositiv e-definiteness we ha v e ρ max < 1 ev en as p → ∞ , and thus, the constan t L is also finite. The result then follo ws from union b ound . ✷ The sample complexit y for structur al consistency of CMIT f ollo ws on lines of analysis for CCT . App endix D. Necessary Conditions for Mo del Selection D.1 Necessary Conditions for Exact Reco very W e p ro vide the pro of of Th eorem 6 in this section. W e collec t four auxiliary lemmata whose pro ofs (together with the pro of of Lemma 8) will b e provided at the end of th e section. F or information-theoretic notation, the reader is referred to Co v er and T homas (2006). Lemma 20 (Upp er Bound on Differen tial Entrop y of Mixture) L et α < 1 . Sup- p ose asymptotic al ly almost sur e ly e ach pr e cision matrix J G = I − R G satisfies (12) , i.e . , that k R G k ≤ α for a.e. G ∈ G ( p ) . Then, for the Gaussian mo del, we have h ( X n ) ≤ pn 2 log 2 2 π e 1 − α , (75) wher e r e c al l that X n | G ∼ Q n i =1 f ( x i | G ) . F or the s ake of con v enience, w e define the rand om v ariable: W = ( 1 G ∈ T ( p ) ǫ 0 G / ∈ T ( p ) ǫ . (76) The random v ariable W indicates whether G ∈ T ( p ) ǫ . Lemma 21 (Lo w er Bound on C onditional Differen tial En trop y) Supp ose that e ach pr e ci sion matrix J G has u nit diagonal. Then, h ( X n | G, W ) ≥ − pn 2 log 2 (2 π e ) . (77) Lemma 22 (Conditional F ano Inequalit y) In the ab ove notation, we have H ( G | X n , G ∈ T ( p ) ǫ ) − 1 log 2 ( |T ( p ) ǫ | − 1) ≤ P ( b G ( X n ) 6 = G | G ∈ T ( p ) ǫ ) . (78) 32 High-Dimensional G aussian Graphical Model Selection Lemma 23 (Exp onen tial Deca y in Probabilit y of A t ypical Set) Define the r ate func- tion K ( c, ǫ ) := c 2 [(1 + ǫ ) ln(1 + ǫ ) − ǫ ] . The pr ob ability of the ǫ -atypic al set de c ays as P (( T ( p ) ǫ ) c ) = P ( G / ∈ T ( p ) ǫ ) ≤ 2 exp ( − pK ( c, ǫ )) (79) for al l p ≥ 1 . Note the non -asymp totic nature of the b ound in (79). T he rate fu nction K ( c, ǫ ) satisfies lim ǫ ↓ 0 K ( c, ǫ ) /ǫ 2 = c/ 4. W e pro v e Theorem 6 u s ing these lemmata. Pr o of: Consider the follo wing sequ ence of lo w er b ounds : pn 2 log 2 2 π e 1 − α (a) ≥ h ( X n ) (b) ≥ h ( X n | W ) (80) = I ( X n ; G | W ) + h ( X n | G, W ) (c) ≥ I ( X n ; G | W ) − pn 2 log 2 (2 π e ) = H ( G | W ) − H ( G | X n , W ) − pn 2 log 2 (2 π e ) , (81) where ( a ) follo ws from Lemma 20, ( b ) is b ecause cond itioning do es not increase differentia l en trop y and ( c ) follo w s from Lemma 21. W e w ill lo w er b oun d the first term in (81) and upp er b ound the second term in (81) . Now consider the first term in (81 ): H ( G | W ) = H ( G | W = 1) P ( W = 1) + H ( G | W = 0) P ( W = 0) (a) ≥ H ( G | W = 1) P ( W = 1) (b) ≥ H ( G | G ∈ T ( p ) ǫ )(1 − ǫ ) (c) ≥ (1 − ǫ ) p 2 H b c p , (82) where ( a ) is b ecause the en trop y H ( G | W = 0) and the probabilit y P ( W = 0) are b oth non-negativ e. In equalit y ( b ) follo ws for all p sufficiently large from the defin ition of W as w ell as Lemma 8 part 1. Statemen t ( c ) comes from fact that H ( G | G ∈ T ( p ) ǫ ) = − X g ∈T ( p ) ǫ P ( g | g ∈ T ( p ) ǫ ) log 2 P ( g | g ∈ T ( p ) ǫ ) ≥ − X g ∈T ( p ) ǫ P ( g | g ∈ T ( p ) ǫ ) − p 2 H b c p = p 2 H b c p . W e are now done b ound ing the fir st term in the difference in (81). No w we will b ound the second term in (81). First we will derive a b ound on H ( G | X n , W = 1). Consider, P ( p ) e := P ( b G ( X n ) 6 = G ) 33 Anandkumar, T an, and W il lsky (a) = P ( b G ( X n ) 6 = G | W = 1) P ( W = 1) + P ( b G ( X n ) 6 = G | W = 0) P ( W = 0) ≥ P ( b G ( X n ) 6 = G | W = 1) P ( W = 1) (b) ≥ P ( b G ( X n ) 6 = G | G ∈ T ( p ) ǫ ) 1 1 + ǫ (c) ≥ H ( G | X n , G ∈ T ( p ) ǫ ) − 1 log 2 |T ( p ) ǫ | 1 1 + ǫ , (83) where ( a ) is by the la w of total pr obabilit y , ( b ) holds for all p su fficien tly large by Lemma 8 part 1 and ( c ) is due to the conditional v ersion of F ano’s in equalit y (Lemma 22). Then, from (83), we hav e H ( G | X n , W = 1) ≤ P ( p ) e (1 + ǫ ) log 2 |T ( p ) ǫ | + 1 ≤ P ( p ) e (1 + ǫ ) p 2 H b c p + 1 . (84) Define the r ate function K ( c, ǫ ) := c 2 [(1 + ǫ ) ln (1 + ǫ ) − ǫ ]. Note that this function is p ositiv e whenev er c, ǫ > 0. In fact it is monotonically increasing in b oth parameters. Now we u tilize (84) to b ou n d H ( G | X n , W ): H ( G | X n , W ) = H ( G | X n , W = 1) P ( W = 1) + H ( G | X n , W = 0) P ( W = 0) (a) ≤ H ( G | X n , W = 1) + H ( G | X n , W = 0) P ( W = 0) (b) ≤ H ( G | X n , W = 1) + H ( G | X n , W = 0)(2 e − pK ( c, ǫ ) ) (c) ≤ H ( G | X n , W = 1) + p 2 (2 e − pK ( c, ǫ ) ) (d) ≤ P ( p ) e (1 + ǫ ) p 2 H b c p + 1 + 2 p 2 e − pK ( c, ǫ ) , (85) where ( a ) is b ecause we u pp er b oun ded P ( W = 1) b y unity , ( b ) follo ws by Lemma 23, ( c ) follo ws b y u p p er b ound ing the conditional en trop y by p 2 and ( d ) follo ws from (84). Substituting (82) and (85) b ack into (81) yields pn 2 log 2 2 π e 1 1 − α + 1 ≥ (1 − ǫ ) p 2 H b c p − P ( p ) e (1 + ǫ ) p 2 H b c p − 1 − 2 p 2 e − pK ( c, ǫ ) = p 2 H b c p h (1 − ǫ ) − P ( p ) e (1 + ǫ ) i − Θ( p 2 e − pK ( c, ǫ ) ) , whic h implies that n ≥ 2 p log 2 h 2 π e 1 1 − α + 1 i p 2 H b c p h (1 − ǫ ) − P ( p ) e (1 + ǫ ) i − Θ( pe − pK ( c, ǫ ) ) . Note that Θ( pe − pK ( c, ǫ ) ) → 0 as p → ∞ since the rate function K ( c, ǫ ) is p ositiv e. If w e imp ose that P ( p ) e → 0 as p → ∞ , then n has to satisfy (23) by the arbitrariness of ǫ > 0. This completes the p r o of of Theorem 6. ✷ 34 High-Dimensional G aussian Graphical Model Selection D.2 Pro of of Lemma 8 Pr o of: P art 1 follo ws dir ected from the law of large num b ers. Pa rt 2 follo ws fr om the fact that the Binomial pm f is maximized at its mean. Hence, for G ∈ T ( p ) ǫ , we hav e P ( G ) ≤ c p cp/ 2 1 − c p ( p 2 ) − cp/ 2 . W e arr ive at the up p er b oun d after some rudimentary algebra. The lo w er b ou n d can b e pro v ed by observin g that for G ∈ T ( p ) ǫ , we hav e P ( G ) ≥ c p cp (1+ ǫ ) / 2 1 − c p ( p 2 ) − cp (1+ ǫ ) / 2 = exp 2 p 2 ( c p log 2 c p )(1 + ǫ ) + [1 − c (1 + ǫ ) /p ] log 2 (1 − c p ) ≥ exp 2 p 2 ( c p log 2 c p )(1 + ǫ ) + (1 + ǫ )(1 − c p ) log 2 (1 − c p ) . The result in P art 2 follo ws immediately by app ealing to the symm etry of the binomial pmf ab out its mean. Part 3 f ollo w s b y the follo wing c hain of inequalities: 1 = X G ∈ G n P ( G ) ≥ X G ∈T ( p ) ǫ P ( G ) ≥ X G ∈T ( p ) ǫ exp 2 − p 2 H b c p (1 + ǫ ) = |T ( p ) ǫ | exp 2 − p 2 H b c p (1 + ǫ ) . This completes the pr o of of the upp er b oun d on |T ( p ) ǫ | . The lo w er b ound follo ws b y n oting that for sufficientl y large n , P ( T ( p ) ǫ ) ≥ 1 − ǫ (by Lemma 8 Pa rt 1). Thus, 1 − ǫ ≤ X G ∈T ( p ) ǫ P ( G ) ≤ X G ∈T ( p ) ǫ exp 2 − p 2 H b c p = |T ( p ) ǫ | exp 2 − p 2 H b c p . This completes the p r o of. ✷ D.3 Pro of of Lemma 20 Pr o of: Note that the d istribution of X (with G marginalized out) is a Gaussian mixture mo del giv en by P G ∈ G p P ( G ) N ( 0 , J − 1 G ). As suc h the cov ariance matrix of X is giv en by Σ X = X G ∈ G p P ( G ) J − 1 G . (86) This is not immediately ob vious but it is due to the zero-mean n ature of eac h Gaussian probabilit y d ensit y fu nction N ( 0 , J − 1 G ). Using (86), w e h a v e the follo wing c hain of inequal- ities: h ( X n ) ≤ nh ( X ) 35 Anandkumar, T an, and W il lsky (a) ≤ n 2 log 2 ((2 π e ) p det( Σ X )) = n 2 [ p log 2 (2 π e ) + log 2 det( Σ X )] (b) ≤ n 2 [ p log 2 (2 π e ) + p log 2 λ max ( Σ X )] = n 2 p log 2 (2 π e ) + p log 2 λ max X G ∈ G p P ( G ) J − 1 G (c) ≤ n 2 p log 2 (2 π e ) + p log 2 X G ∈ G p P ( G ) λ max J − 1 G = n 2 p log 2 (2 π e ) + p log 2 X G ∈ G p P ( G ) 1 λ min ( J G ) (d) ≤ n 2 p log 2 (2 π e ) + p log 2 X G ∈ G p P ( G ) 1 1 − α = pn 2 log 2 2 π e 1 − α , where ( a ) uses the maxim um en trop y principle (Co v er and T h omas, 2006, Chapter 13) i.e., that the Gaussian maximizes en trop y sub ject to an a v er age p ow er constraint ( b ) uses the fact that the determin ant of Σ X is upp er b ound ed by λ max ( Σ X ) n , ( c ) uses the conv exity of λ max ( · ) (it equals to the op erator norm k · k 2 o v er the set of symmetric matrices, ( d ) uses the fact that α ≥ k R G k 2 ≥ k R G k 2 = k I − J G k 2 = λ max ( I − J G ) = 1 − λ min ( J G ) a.a.s. This completes the pro of. ✷ D.4 Pro of of Lemma 21 Pr o of: Firstly , we low er b ound h ( X n | G, W = 1) as follo ws: h ( X n | G ) = X g ∈ G p P ( g ) h ( X n | G = g ) (a) = n X g ∈ G p P ( g ) h ( X | G = g ) (b) = n 2 X g ∈ G p P ( g ) log 2 [(2 π e ) p det( J − 1 g )] = − n 2 X g ∈ G p P ( g ) log 2 [(2 π e ) p det( J g )] (c) ≥ − n 2 X g ∈ G p P ( g ) log 2 [(2 π e ) p ] ≥ − pn 2 log 2 (2 π e ) , 36 High-Dimensional G aussian Graphical Model Selection where ( a ) is b ecause th e samples in X n are conditionally indep endent giv en G = g , ( b ) is b y the Gaussian assump tion, ( c ) is by Hadamard’s inequalit y det( J g ) ≤ p Y i =1 [ J g ] ii = 1 (87) and the assumption th at eac h diagonal element of eac h pr ecision matrix J g = I − R g is equal to 1 a.a.s. T his pro v es the claim. ✷ D.5 Pro of of Lemma 22 Pr o of: Define the “error” rand om v ariable E = ( 1 b G ( X n ) 6 = G 0 b G ( X n ) = G . No w consider H ( E , G | X n , W = 1) = H ( E | X n , W = 1) + H ( G | E , X n , W = 1) (88) = H ( G | X n , W = 1) + H ( E | G, X n , W = 1) . (89) The first term in (88) can b e b ound ed ab ov e by 1 since the alphab et of the random v ariable E is of size 2. S ince H ( G | E = 0 , X n , W = 1) = 0, th e second term in (88) can b e b ounded from ab ov e as H ( G | E , X n , W = 1) = H ( G | E = 0 , X n , W = 1) P ( E = 0 | W = 1) + H ( G | E = 1 , X n , W = 1) P ( E = 1 | W = 1) ≤ P ( b G ( X n ) 6 = G | G ∈ T ( p ) ǫ ) log 2 ( |T ( p ) ǫ | − 1) . The second term in (89) is 0. Hence, we h a v e the desired conclusion. ✷ D.6 Pro of of Lemma 23 Pr o of: The pr o of uses s tand ard Chern off b ound in g tec hniques but the scaling in p is somewhat differen t from th e usu al Ch ernoff (Cr am ´ er) upp er b ound. F or sim p licit y , we will use M := p 2 . Let Y i , i = 1 , . . . , M b e in dep end en t Bern ou lli random v ariables su c h th at P ( Y i = 1) = c/p . Th en th e probabilit y in qu estion can b e b ounded as P ( G / ∈ T ( p ) ǫ ) = P 1 cp M X i =1 Y i − 1 2 > ǫ 2 ! (a) ≤ 2 P 1 cp M X i =1 Y i > 1 + ǫ 2 ! (b) ≤ 2 E " exp t M X i =1 Y i − p t c 2 (1 + ǫ ) !# (90) 37 Anandkumar, T an, and W il lsky = 2 exp − p t c 2 (1 + ǫ ) M Y i =1 E [exp( tY i )] , (91) where ( a ) follo ws f r om the un ion b ound, ( b ) follo ws fr om an application of Marko v’s in- equalit y with t ≥ 0 in (90). Now, th e moment generating function of a Bernoulli random v ariable with p robabilit y of success q is q e t + (1 − q ). Using this fact, we can f urther upp er b ound (91) as follo ws: P ( G / ∈ T ( p ) ǫ ) = 2 exp − p t c 2 (1 + ǫ ) + M ln( c p e t + (1 − c p ) (a) ≤ 2 exp − p t c 2 (1 + ǫ ) + p ( p − 1) 2 c p ( e t − 1) ≤ 2 exp − p h t c 2 (1 + ǫ ) − c 2 ( e t − 1) i , (92) where in ( a ), we used the fact th at ln (1 + z ) ≤ z . Now, w e d ifferen tiate th e exp onen t in square brac k ets with resp ect to t ≥ 0 to find the tigh test b ound . W e observ e that the optimal parameter is t ∗ = ln(1 + ǫ ). Substituting this bac k into (92) completes the pro of. ✷ D.7 Necessary Conditions for Reco very with Distortion W e n o w pro vide the pro of for Corollary 7. The pro of of Corollary 7 follo ws from the follo wing generalization of the conditional F ano’s inequality presented in Lemma 22. This is a mo d ified version of an analogous theorem in (Kim et al., 2008). Lemma 24 (Conditional F ano’s Inequalit y (Generalization)) In the ab ove not ation, we have H ( G | X n , G ∈ T ( p ) ǫ ) − 1 − log 2 L log 2 ( |T ( p ) ǫ | − 1) ≤ P ( d ( G, b G ( X n )) > D | G ∈ T ( p ) ǫ ) (93) wher e L = p 2 H b ( β ) and β is define d in (26) . W e will only pro vide a pro of ske tc h of Lemma 24 sin ce it is similar to Lemma 22. Pr o of: The key to establishing (93) is to up p er b ound the card inalit y of the set { G ∈ G p : d ( G, G ′ ) ≤ D } , whic h is isomorphic to { E ∈ E p : | E △ E ′ | ≤ D } , w here E p is th e set of all edge sets (with p no des). F or this p urp ose, w e ord er th e no de pairs in a lab elled undirected graph lexicographically . Now, we map eac h ed ge set E into a length- p 2 bit- string s ( E ) ∈ { 0 , 1 } ( p 2 ) . Th e c haracters in the string s ( E ) ind icate wh ether or not an edge is presen t b et w een t w o no de pairs. Define d H ( s, s ′ ) to b e the Hamming distance b etw een strings s and s ′ . Then, note that | E △ E ′ | = d H ( s ( E ) , s ( E ′ )) = d H ( s ( E ) ⊕ s ( E ′ ) , 0) (94) where ⊕ d enotes addition in F 2 and 0 denotes the all zeros string. Th e relation in (94 ) means that the cardinalit y of th e set { E ∈ E n : | E △ E ′ | ≤ D } is equal to the n um b er of 38 High-Dimensional G aussian Graphical Model Selection strings of Hamming we igh t less than or equal to D . With this realization, it is easy to see that |{ s ∈ { 0 , 1 } ( p 2 ) : d H ( s, 0) ≤ D }| = D X k =1 p 2 k ≤ 2 ( p 2 ) H b ( D/ ( p 2 ) ) = 2 L . By using the same steps as in the p ro of of Lemma 24 (or F ano’s inequalit y f or list deco ding), w e arrive at the desired conclusion. ✷ References P . Abb eel, D. K oller, and A.Y. Ng. Learnin g factor graphs in p olynomial time and sample complexit y . The Journal of M achine L e arning R ese ar ch , 7:1743–178 8, 2006. A. Anandkum ar, V. Y. F. T an, and A. S. Willsky . High-Dimensional Str u cture Learning of Ising Mo dels on Sparse Random Graphs. Pr eprint. Available on arXiv: 1011.01 29 , No v. 2010. A. Anandkum ar, A. Hassidim, and J. Kelner. T op ology Disco very of Sparse Random Graph s With F ew Participan ts. arXiv:1102.506 3 , F eb. 2011a. A. Anandkum ar, V. Y. F. T an, and A. S. Willsky . High-Dimensional Str u cture Learning of Ising Mo dels: T ractable Graph F amilies. Pr eprint, Available on ArXiv 1107.173 6 , Ju ne 2011b. M. Ba y ati, D. Sh ah, and M. Sh arma. Maxim um W eigh t Matc hing via Max-Pro d uct Belief Propagation. In Pr o c. IEEE Intl. Symp osium on Information The ory (ISIT) , 2005. M. Ba yati, A. Braunstein, and R. Z ecc hina. A rigorous analysis of the ca vit y equations for the minim um spann ing tree. Journal of Mathematic al P hysics , 49:1252 06, 2008a. M. Ba yati, D. Shah, and M. S harma. Max-pro duct for m axim um w eigh t matc h ing: C on- v ergence, correctness, and lp dualit y . Information The ory, IEEE T r ansactions on , 54(3): 1241– 1251, 2008b. P .J. Bic k el and E . Levina. C o v ariance regularizatio n by thresh olding. The Annals of Statis- tics , 36(6):25 77–260 4, 2008. A. Bogdano v, E. Mossel, and S. V adhan. The Complexity of Distinguish ing Mark o v Random Fields. Appr oximation, R and omization and Combinatorial Optimization. Algorith ms and T e chniques , pages 331–342 , 2008. B. Bollob´ as. R andom Gr aphs . Academic Press, 1985. G. Bresler, E. Mossel, and A. Sly . Reconstruction of Marko v Random Fields fr om S am- ples: Some Observ atio ns and Algorithms. In Intl. workshop APPRO X Appr oximation, R andomization and Combinatoria l Optimization , pages 343–356. S pringer, 2008. V. Ch andrasek aran , J .K . J oh n son, and A.S . Willsky . Estimation in Gaussian graphical mo dels us ing tractable sub grap h s: A walk- sum analysis. Si g nal Pr o c essing, IEEE T r ans- actions on , 56(5):191 6–1930 , 2008. 39 Anandkumar, T an, and W il lsky J. Cheng, R. Greiner, J. Kelly , D. Bell, and W. Liu. Learning bay esian n et w orks from data: an information-theory based appr oac h . Artificial Intel ligenc e , 137(1-2 ):43–90 , 2002. M.J. Ch oi, V.Y.F. T an , A. Anandkumar, and A. Willsky . Learning Laten t T ree Graph ical Mo dels. J. of M achine L e arning R ese ar ch , 12:1771–18 12, Ma y 2011. C. Chow and C. Liu. Appr oximati ng Discrete Probability Distributions with Dep end en ce T rees. IEE E T r an. on Information The ory , 14(3):46 2–467, 1968. F.R.K. Chung. Sp e ctr al gr aph the ory . Amer Mathematical So ciet y , 1997. F.R.K. C h ung and L. Lu. Complex gr aphs and network . Amer. Ma thematical So ciet y , 2006. T. Co v er an d J. Th omas. Elements of Information The ory . John Wiley & S ons, Inc., 2006. A. d’Asp r emon t, O. Banerjee, and L. El Ghaoui. First-order metho ds for sparse co v ariance selection. SIAM. J. Matrix Anal. & Appl. , 30(56), 2008. A. Dem b o and A. Mon tanari. Isin g Mo dels on Lo cally T ree-lik e Graph s. Annals of Applie d Pr ob ability , 2010. S. Dommers, C. Giardin ` a, and R. v an der Hofstad. Isin g mo dels on p ow er -law r andom graphs. Journal of Statistic al Physics , p ages 1–23, 2010. I. Dumitriu and S. Pa l. Sp arse regular r andom graphs: sp ectral density and eigen v ectors. Arxiv pr eprint arXiv : 0910.53 06 , 2009. D. Gamarnik, D. Sh ah, and Y. W ei. Belief pr opagation for min-cost net wo rk fl o w: con v er- gence & correctness. In Pr o c. of ACM-SIAM Symp osium on Discr e te Algorithms , pages 279–2 92, 2010. A. Gam burd, S. Ho ory , M. Shah s hahani, A. Shalev, and B. Virag. O n the girth of r andom ca yley graphs. R andom Structur es & Algorith ms , 35(1):100– 117, 2009. J.Z. Hu ang, N. L iu, M. Pourahmadi, and L. Liu. C o v ariance matrix selection and estimation via p enalised norm al likelihoo d. Biometrika , 93(1), 2006. M. Kalisc h and P . B ¨ uhlmann. Es timating high-dimensional dir ected acyclic graph s w ith the p c-algorithm. J. of Machine L e arning R ese ar ch , 8:613–636 , 2007 . D. Karger and N. Srebro. Learning Mark o v Net w orks: Maximum Bounded T r ee-width Graphs. In Pr o c. of ACM-SIAM symp osium on Di sc r ete algorithms , pages 392–401, 2001. Y.-H. Kim, A. Sutivo ng, and T. M. Co v er. State Amplification. IEEE T r ansactio ns on Information The ory , 54(5):18 50 – 1859, May 2008. M. Kr ivelevic h and B. Su d ak o v. The largest eigen v alue of sp arse random graph s. Combi- natorics, Pr ob ability and Computing , 12(01):61 –72, 2003. C. L am and J. F an. Sparsistency and rates of con v ergence in large co v ariance matrix estimation. Annals of statistics , 37(6B):4 254, 2009. 40 High-Dimensional G aussian Graphical Model Selection S.L. Lauritzen. Gr aphic al mo dels: Clar endon P r ess . C larendon Press, 1996. H. Liu, M. Xu , H. Gu, A. Gupta, J. Lafferty , and L. W asserman. F orest d ensit y estimation. J. of Machine L e arning R e se ar ch , 12:9 07–951 , 2011. Y. Liu, V. Chandr asek aran, A. Anandkum ar, and A. Willsky . F eedbac k Message P assing for Inf erence in Gaussian Graph ical Mo dels. In P r o c. of IEEE ISIT , Austin, US A, Jun e 2010. L. Lo v´ asz, V. Neum ann-Lara, and M. Plummer. Mengerian theorems for paths of b ound ed length. Perio dic a Mathematic a Hungaric a , 9(4):26 9–276, 1978. D.M. Maliouto v, J.K. Johnson, and A.S. Willsky . W alk-Sums and Belief Propagation in Gaussian Graphical Mo d els. J . of Machine L e arning R ese ar ch , 7:2031–20 64, 2006 . R.J. McEliece, D.J.C. MacKa y , and J.F. C h eng. T urb o deco din g as an in stance of Pearl’s b elief propagation algorithm. Sele cte d Ar e as in Communic ations, IEEE Journal on , 16 (2):14 0–152, 2002. ISS N 0733- 8716. B.D. McKa y , N.C. W ormald, and B. Wyso ck a. Short cycles in r andom regular graphs. The Ele ctr onic Journal of Combinatorics , 11(R66):1, 2004. N. Meinshausen and P . B ¨ uh lmann. High Dimensional Graphs and V ariable Selection With the Lasso. Annal s of Statistics , 34(3):14 36–146 2, 2006. C.C. Moallemi and B. V an Ro y . C onv ergence of min-su m message-passing for con v ex opti- mization. Information The ory, IEEE T r ansactio ns on , 56(4):2041 –2050, 2010 . J.M. Mooij and H.J. Kapp en. Su fficien t Cond itions for Conv ergence of the Su m –Pro duct Algorithm. Information The ory, IEEE T r ansactions on , 53(12):4422 –4437, 2007. IS S N 0018- 9448. K. Murphy , Y. W eiss, an d M.I. Jordan. Lo opy b elief propagation for appr o ximate inference: An empirical study. In Pr o c. of Unc ertainty in AI , pages 467–4 75, 1999. P . Netrapalli, S. Banerjee, S. Sangha vi, and S . Sh akk ottai. Greedy Learning of Marko v Net- w ork Stru ctur e . In Pr o c. of Al lerton Conf. on Communic ation, Contr ol and Computing , Mon ticello , US A, Sept. 2010. J. P earl. P r ob abilistic R e aso ning in Intel ligent Systems—Networks of Plausible Infer enc e . Morgan Kaufmann, 1988. P . Ra vikumar, M.J. W ain wrigh t, G. Raskutti, and B. Y u . High-dimensional co v ari- ance estimation by min im izing ℓ 1 -p enalized log-determinan t d iv ergence. Arxiv pr eprint arXiv:0811.36 28 , 2008. A.J. Rothman, P .J. Bic kel, E . Levina, and J. Zhu. S parse p ermuta tion inv ariant co v ariance estimation. Ele ctr onic Journal of Statistics , 2:494–515, 2008. H. Rue and L . Held. Gaussian M arkov R ando m Fields: The ory and Applic ations . Chapm an and Hall, London, 2005. 41 Anandkumar, T an, and W il lsky N. Ru ozzi and S. T atiko nda. Con v ergen t and correct message passin g sc hemes for optimiza- tion problems o v er graph ical mo dels. Arxiv pr eprint arXiv:1002.32 39 , 2010. N. Ruozzi, J. Thaler, and S. T atik onda. Graph co v ers and quadr atic minimization. In Communic ation, Contr ol, and Computing, 2009. A l lerton 2009. 47th Annual Al lerton Confer enc e on , p ages 1590–1 596, 2009. S. Sangha vi, D. Shah, and A.S . Willsky . Message passing for maximum w eig ht in dep end en t set. Information The ory, IEEE T r ansactions on , 55(11):482 2–4834, 2009. ISSN 0018- 9448. N.P . Santhanam and M.J. W ain wright . Information-theoretic Limits of High-dimensional Mo del Selection. In International Symp osium on Information The ory , T oron to, Canada, July 2008. P . Spirtes and C. Meek. Learning ba y esian net w orks with discrete v ariables from d ata. In Pr o c . of Intl. Conf. on Know le dge Disc overy and Data Mining , p ages 294–29 9, 1995. V.Y.F. T an, A. An andkumar, L. T ong, and A. Willsky . A Large-Deviation An alysis for the Maxim um Like liho o d Learning of T ree Structures. IE EE T r an. on Information The ory , Marc h . V.Y.F. T an, A. Anandku m ar, and A. Willsky . Learn in g Gaussian T ree Mo dels: Analysis of Error Exp onents and Extremal Structures. IEE E T r an. on Signal P r o c essing , 58(5) : 2701– 2714, Ma y 2010. V.Y.F. T an, A. Anandku mar, and A. Willsky . Learning Mark o v F orest Mo dels: Analysis of Error Rates. J. of Machine L e arning R ese ar ch , 12:161 7–1653 , Ma y 2011. V.V. V azirani. Appr oximation Algorithms . Springer, 2001. P ascal O. V ont ob el. Counting in graph co v ers: A combinato rial c haracterizat ion of the b ethe ent ropy fu nction. Arxiv 1012.00 65 , 2010. W. W ang, M.J. W ainwrigh t, and K. Ramc hand ran. In formation-theoretic b ounds on mo del selection for Gaussian Mark o v r andom fields. In IEE E Internationa l Symp osium on In- formation The ory Pr o c e e dings (ISIT) , Austin, Tx, Jun e 2010. D.J. W atts and S.H. Strogatz. Collectiv e dyn amics of small-wo rldnetw orks. Natur e , 393 (6684 ):440–44 2, 1998. Y. W eiss. Correctness of Lo cal P r obabilit y Propagation in Graphical Mo d els w ith Lo ops. Neur al Computation , 12(1):1– 41, 2000. Y. W eiss and W.T. F r eeman. Correctness of Belief Propagation in Gaussian Graph ical Mo dels of Arbitrary T op ology. Ne ur al Computation , 13(10):2173 –2200, 2001. X. Xie and Z. Geng. A recursive metho d for structural learning of d irected acyclic graphs. J. of Machine L e arning R e se ar ch , 9:45 9–483, 2008. 42
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment