Information-theoretic limits on sparse signal recovery: Dense versus sparse measurement matrices

Information-theoretic limits on sparse signal reco v ery: Dense v ersus sparse measuremen t matrices W ei W ang ? Martin J. W ain wright † ,? Kannan Ramc handran ? { wangwei, wainwrig, kannanr } @eecs.berkeley.edu Departmen t of Electrical Engineering and Computer Sciences ? , and Departmen t of Statistics † Univ ersity of California, Berk eley T echnical Report Departmen t of Statistics, UC Berkeley Ma y 2008 Abstract W e study the information-theoretic limits of exactly reco v ering the supp ort of a sparse sig- nal using noisy pro jections deﬁned by v arious classes of measurement matrices. Our analysis is high-dimensional in nature, in whic h the num b er of observ ations n , the ambien t signal di- mension p , and the signal sparsity k are all allow ed to tend to inﬁnity in a general manner. This pap er makes tw o no v el contributions. First, w e provide sharp er necessary conditions for exact supp ort recov ery using general (non-Gaussian) dense measuremen t matrices. Combined with previously known suﬃcient conditions, this result yields sharp characterizations of when the optimal decoder can reco ver a signal for v arious scalings of the sparsity k and sample size n , including the imp ortan t sp ecial case of linear sparsity ( k = Θ( p )) using a linear scaling of obser- v ations ( n = Θ( p )). Our second con tribution is to pro v e necessary conditions on the num b er of observ ations n required for asymptotically reliable recov ery using a class of γ -sparsiﬁed measure- men t matrices, where the measurement sparsit y γ ( n, p, k ) ∈ (0 , 1] corresp onds to the fraction of non-zero entries p er row. Our analysis allows general scaling of the quadruplet ( n, p, k , γ ), and rev eals three diﬀerent regimes, corresp onding to whether measurement sparsit y has no eﬀect, a minor eﬀect, or a dramatic eﬀect on the information-theoretic limits of the subset recov ery problem. Keyw ords: Sparsit y reco very; sparse random matrices; subset selection; compressive sensing; signal denoising; sparse approximation; information-theoretic b ounds; F ano’s inequality . 1 In tro duction The problem of estimating a k -sparse v ector β ∈ R p based on a set of n noisy linear observ ations is of broad interest, arising in subset selection in regression, graphical mo del selection, group testing, signal denoising, sparse appro ximation, and compressiv e sensing. A large b ody of recent w ork (e.g., [6, 9, 10, 5, 4, 14, 21, 22, 23, 13, 7, 25, 26, 19]) has analyzed the use of ` 1 -relaxation metho ds for estimating high-dimensional sparse signals, and established conditions (on signal sparsity and the c hoice of measurement matrices) under which they succeed with high probabilit y . Of complemen tary in terest are the information-theoretic limits of the sparsit y reco v ery problem, whic h apply to the p erformance of any pro cedure regardless of its computational complexity . Such 1 analysis has t wo purp oses: ﬁrst, to demonstrate where known p olynomial-time metho ds achiev e the information-theoretic b ounds, and second, to reveal situations in which current metho ds are sub-optimal. An interesting question whic h arises in this context is the eﬀect of the c hoice of measuremen t matrix on the information-theoretic limits of sparsity recov ery . As w e will see, the standard Gaussian measurement ensemble is an optimal choice in terms of minimizing the num b er of observ ations required for reco v ery . Ho w ev er, this c hoice produces highly dense measuremen t matrices, whic h ma y lead to prohibitiv ely high computational complexit y and storage requiremen ts. Sparse matrices can reduce this complexit y , and also low er comm unication cost and latency in distributed net w ork and streaming applications. On the other hand, s uc h measuremen t sparsity , though b eneﬁcial from the computational standp oint, may reduce statistical eﬃciency by requiring more observ ations to deco de. Therefore, an imp ortan t issue is to c haracterize the trade-oﬀ b etw een measuremen t sparsit y and statistical eﬃciency . With this motiv ation, this pap er makes t wo contributions. First, w e derive sharp er necessary conditions for exact supp ort recov ery , applicable to a general class of dense measurement matrices (including non-Gaussian ensembles). In conjunction with the suﬃcient conditions from previous w ork [24], this analysis pro vides a sharp characterization of necessary and suﬃcien t conditions for v arious sparsit y regimes. Our second con tribution is to address the eﬀect of measuremen t sparsit y , meaning the fraction γ ∈ (0 , 1] of non-zeros p er ro w in the matrices used to collect measuremen ts. W e derive lo w er b ounds on the n umber of observ ations required for exact sparsity reco very , as a function of the signal dimension p , signal sparsity k , and measurement sparsity γ . This analysis highligh ts a trade-oﬀ b etw een the statistical eﬃciency of a measurement ensemble and the computational complexity asso ciated with storing and manipulating it. The remainder of the pap er is organized as follows. W e ﬁrst deﬁne our problem formulation in Section 1.1, and then discuss our con tributions and some connections to related work in Section 1.2. Section 2 provides precise statements of our main results, as well as a discussion of their conse- quences. Section 3 pro vides pro ofs of the necessary conditions for v arious classes of measurement matrices, while pro ofs of more technical lemmas are giv en in the app endices. Finally , we conclude and discuss op en problems in Section 4. 1.1 Problem form ulation There are a v ariety of problem form ulations in the growing b o dy of work on compressive sensing and re lated areas. The signal mo del may b e exactly sparse, approximately sparse, or compressible (i.e. that the signal is appro ximately sparse in some orthonormal basis). The most common signal mo del is a deterministic one, although Bay esian form ulations are also p ossible. In addition, the observ ation mo del can b e either noiseless or noisy , and the measurement matrix can b e random or deterministic. F urthermore, the signal reco v ery can b e p erfect or approximate, assessed by v arious error metrics (e.g., ` q -norms, prediction error, subset recov ery). In this pap er, we consider a deterministic signal mo del, in which β ∈ R p is a ﬁxed but unknown v ector with exactly k non-zero entries. W e refer to k as the signal sp arsity and p as the signal dimension , and deﬁne the supp ort set of β as S : = { i ∈ { 1 , . . . , p } | β i 6 = 0 } . (1) Note that there are N =  p k  p ossible supp ort sets, corresp onding to the N p ossible k -dimensional subspaces in which β can lie. W e are giv en a v ector of n noisy observ ations Y ∈ R n , of the form Y = X β + W, (2) 2 where X ∈ R n × p is the measuremen t matrix, and W ∼ N (0 , σ 2 I n × n ) is additiv e Gaussian noise. Our results apply to v arious classes of dense and γ -sparsiﬁed measuremen t matrices, whic h will b e deﬁned concretely in Section 2. Throughout this pap er, w e assume without loss of generalit y that σ 2 = 1, since any scaling of σ can b e accounted for in the scaling of β . Our goal is to p erform exact recov ery of the supp ort set S , which corresp onds to a standard mo del selection error criterion. More precisely , w e measure the error b etw een the estimate b β and the true signal β using the { 0 , 1 } -v alued loss function: ρ ( b β , β ) : = I h { b β i 6 = 0 , ∀ i ∈ S } ∩ { b β j = 0 , ∀ j 6∈ S } i . (3) The results of this pap er apply to arbitrary deco ders. An y deco der is a mapping g from the obser- v ations Y to an estimated subset b S = g ( Y ). Let P [ g ( Y ) 6 = S | S ] b e the conditional probabilit y of error given that the true supp ort is S . Assuming that β has supp ort S c hosen uniformly at random o ver the N p ossible subsets of size k , the av erage probability of error is given by p err = 1  p k  X S P [ g ( Y ) 6 = S | S ] . (4) W e say that sparsit y reco very is asymptotically reliable if p err → 0 as n → ∞ . Since we are trying to recov er the supp ort exactly from noisy measuremen ts, our results necessarily in v olve the minim um v alue of β on its supp ort, β min : = min i ∈ S | β i | . (5) In particular, our results apply to deco ders that op erate ov er the signal class C ( β min ) : = { β ∈ R p | | β i | ≥ β min ∀ i ∈ S } . (6) With this set-up, our goal is to ﬁnd necessary conditions on the parameters ( n, p, k , β min , γ ) that an y decoder, regardless of its computational complexit y , must satisfy for asymptotically reliable reco very to b e p ossible. W e are in terested in lo w er b ounds on the num b er of measuremen ts n , in general settings where b oth the signal sparsity k and the measurement sparsity γ are allow ed to scale with the signal dimension p . As our analysis shows, the appropriate notion of rate for this problem is R = log ( p k ) n . 1.2 Our con tributions One b o dy of past work [12, 18, 1] has fo cused on the information-theoretic limits of sparse estimation under ` 2 and other distortion metrics, using p ow er-based SNR measures of the form SNR : = E [ k X β k 2 2 ] E [ k W k 2 2 ] = k β k 2 2 . (7) (Note that the second equality assumes that the noise v ariance σ 2 = 1, and that the measurement matrix is standardized, with each elemen t X ij ha ving zero-mean and v ariance one.) It is imp ortant to note that the p o wer-based SNR (7), though appropriate for ` 2 -distortion, is not suitable for the supp ort reco very problem. Although the minimum v alue is related to this p o wer-based measure 3 b y the inequalit y k β 2 min ≤ SNR, for the ensemble of signals C ( β min ) deﬁned in equation (6), the ` 2 -based SNR measure (7) can b e made arbitrarily large, while still having one co eﬃcient β i equal to the minimum v alue (assuming that k > 1). Consequen tly , as our results show, it is p ossible to generate problem instances for which supp ort recov ery is arbitrarily diﬃcult—in particular, by sending β min → 0 at an arbitrarily rapid rate—even as the pow er-based SNR (7) b ecomes arbitrarily large. The pap er [24] w as the ﬁrst to consider the information-theoretic limits of exact subset reco very using dense Gaussian measuremen t ensem bles, explicitly identifying the minimum v alue β min as the key parameter. This analysis yielded necessary and suﬃcient conditions on general quadruples ( n, p, k , β min ) for asymptotically reliable recov ery . Subsequent w ork [16, 2] has extended this type of analysis to the criterion of partial support reco very . In this paper, we consider only exact supp ort reco very , but provide results for general dense measurement ensem bles, thereby extending previous results. In conjunction with known suﬃcien t conditions [24], one consequence of our ﬁrst main result (Theorem 1, b elo w) is a set of sharp necessary and suﬃcien t conditions for the optimal deco der to reco v er the support of a signal with linear sparsit y ( k = Θ( p )), using only a linear fraction of observ ations ( n = Θ( p )). Moreov er, for the sp ecial case of the standard Gaussian ensem ble, Theorem 1 also recov ers some results indep endently obtained in concurren t work by Reeves [16], and Fletc her et al. [11]. W e then consider the eﬀect of measuremen t sparsity , whic h w e assess in terms of the fraction γ ∈ (0 , 1] of non-zeros per ro w of the the measurement matrix X . Some past w ork in compressiv e sensing has prop osed computationally eﬃcient recov ery metho ds based on sparse measurement matrices, including work inspired b y expander graphs and co ding theory [26, 19], sparse random pro jections for Johnson-Lindenstrauss em b eddings [25], and sketc hing and group testing [13, 7]. All of this w ork deals with the noiseless observ ation mo del, in con trast to the noisy observ ation model (2) considered here. The pap er [1] provides results on sparse measurements for noisy problems and distortion- t yp e error metrics, using a Bay esian signal mo del and p ow er-based SNR that is not appropriate for the subset reco very problem. Also, some concurrent w ork [15] pro vides suﬃcien t conditions for supp ort recov ery using the Lasso ( ` 1 -constrained quadratic programming) for appropriately sparsiﬁed ensembles. These results can b e viewed as complemen tary to the information-theoretic analysis of this pap er. In this pap er, we characterize the inheren t trade-oﬀ b etw een measuremen t sparsit y and statistical eﬃciency . More speciﬁcally , our second main result (Theorem 2, b elow) pro vides necessary conditions for exact supp ort recov ery , using γ -sparsiﬁed Gaussian measurement matrices (see equation (8)), for general scalings of the parameters ( n, p, k , β min , γ ). This analysis rev eals three regimes of interest, corresp onding to whether measurement sparsit y has no eﬀect, a small eﬀect, or a signiﬁcant eﬀect on the num b er of measurements necessary for recov ery . Thus, there exist regimes in which measuremen t sparsit y fundamen tally alters the abilit y of an y metho d to deco de. 2 Main results and consequences In this section, we state our main results, and discuss some of their consequences. Our analysis applies to random ensembles of measuremen t matrices X ∈ R n × p , where each en try X ij is drawn i.i.d. from some underlying distribution. The most commonly studied random ensemble is the standard Gaussian case, in whic h eac h X ij ∼ N (0 , 1). Note that this choice generates a highly dense measurement matrix X , with np non-zero en tries. Our ﬁrst result (Theorem 1) applies to 4 more general ensem bles that satisfy the moment conditions E [ X ij ] = 0 and v ar( X ij ) = 1, which allo ws for a v ariet y of non-Gaussian distributions (e.g., uniform, Bernoulli etc.). In addition, w e also derive results (Theorem 2) for γ -sparsiﬁed matrices X , in which each entry X ij is i.i.d. drawn according to X ij =  N (0 , 1 γ ) w.p. γ 0 w.p. 1 − γ . (8) Note that when γ = 1, X is exactly the standard Gaussian ensemble. W e refer to the sparsiﬁcation parameter 0 ≤ γ ≤ 1 as the me asur ement sp arsity . Our analysis allows this parameter to v ary as a function of ( n, p, k ). 2.1 Tigh ter b ounds on dense ensem bles W e b egin by noting an analogy to the Gaussian channel co ding problem that yields a straightforw ard but lo ose set of necessary conditions. Supp ort reco very can b e viewed as a c hannel co ding problem, in whic h there are N =  p k  p ossible supp ort sets of β , corresp onding to messages to b e sent ov er a Gaussian c hannel with noise v ariance 1. The eﬀectiv e co de rate is then R = log ( p k ) n . If eac h supp ort set S is enco ded as the co deword c ( S ) = X β , where X has i.i.d. Gaussian en tries, then by standard Gaussian c hannel capacity results, we immediately obtain a lo wer b ound on the n um b er of observ ations n necessary for asymptotically reliable recov ery , n > log  p k  1 2 log  1 + k β k 2 2  . (9) This b ound is tight for k = 1 and Gaussian measurements, but lo ose in general. As Theorem 1 clariﬁes, there are additional elemen ts in the supp ort reco very problem that distinguish it from a standard Gaussian co ding problem: ﬁrst, the signal p o w er k β k 2 2 do es not capture the inherent problem diﬃculty for k > 1, and second, there is o verlap b et w een support sets for k > 1. The follo wing result pro vides sharp er conditions on subset recov ery . Theorem 1 (General ensembles) . L et the me asur ement matrix X ∈ R n × p b e dr awn with i.i.d. elements fr om any distribution with zer o-me an and varianc e one. Then a ne c essary c ondition for asymptotic al ly r eliable r e c overy over the signal class C ( β min ) is n > max  f 1 ( p, k , β min ) , f 2 ( p, k , β min ) , k − 1  , (10) wher e f 1 ( p, k , β min ) : = log  p k  − 1 1 2 log  1 + k β 2 min (1 − k p )  (11a) f 2 ( p, k , β min ) : = log( p − k + 1) − 1 1 2 log  1 + β 2 min (1 − 1 p − k +1 )  . (11b) The pro of of Theorem 1, given in Section 3, uses F ano’s inequalit y to b ound the probabilit y of error of any recov ery metho d. In addition to the standard Gaussian ensemble ( X ij ∼ N (0 , 1)), this result also co v ers matrices from other common ensem bles (e.g., Bernoulli X ij ∈ {− 1 , +1 } ). It 5 Necessary conditions Suﬃcien t conditions (Theorem 1) (W ainwrigh t [24]) k = Θ( p ) β 2 min = Θ( 1 k ) Θ( p log p ) Θ( p log p ) k = Θ( p ) β 2 min = Θ( log k k ) Θ( p ) Θ( p ) k = Θ( p ) β 2 min = Θ(1) Θ( p ) Θ( p ) k = o ( p ) β 2 min = Θ( 1 k ) Θ( k log ( p − k )) Θ( k log( p − k )) k = o ( p ) β 2 min = Θ( log k k ) max n Θ  k log p k log log k  , Θ  k log ( p − k ) log k o Θ  k log p k  k = o ( p ) β 2 min = Θ(1) max n Θ  k log p k log k  , Θ( k ) o Θ  k log p k  T able 1. Tight necessary and suﬃcient conditions on the num b er of observ ations n required for exact supp ort recov ery are obtained in several regimes of interest. generalizes and strengthens earlier results on subset recov ery [24]. Note that k β k 2 2 ≥ k β 2 min (with equalit y in the case when | β i | = β min for all indices i ∈ S ), so that this b ound is strictly tighter than the intuitiv e b ound (9). Moreov er, b y ﬁxing the v alue of β at ( k − 1) indices to β min and allo wing the last comp onent of β to tend to inﬁnity , we can drive the p ow er k β k 2 2 to inﬁnity , while still ha ving the minimum enter the low er b ound. The necessary conditions in Theorem 1 can b e compared against the suﬃcient conditions in W ainwrigh t [24] for exact supp ort reco very using the standard Gaussian ensemble, as sho wn in T able 1. W e obtain tigh t necessary and suﬃcient conditions in the regime of linear signal sparsit y (meaning k /p = α for some α ∈ (0 , 1)), under v arious scalings of the minim um v alue β min . W e also obtain tigh t matching conditions in the regime of sublinear signal sparsity (in which k /p → 0), when k β 2 min = Θ(1). There remains a sligh t gap, how ev er, in the sublinear sparsity regime when k β 2 min → ∞ (see b ottom t wo rows in T able 1). Moreov er, these information-theoretic b ounds can b e compared to the recov ery threshold of ` 1 -constrained quadratic programming, known as the Lasso [23]. This comparison reveals that whenev er k β 2 min = Θ(1) (in b oth the linear and sublinear sparsit y regimes), then Θ( k log( p − k )) observ ations are necessary and suﬃcien t for sparsit y reco v ery , and hence the Lasso metho d is information-theoretically optimal. In contrast, when k β 2 min → ∞ and k /p = α , there is a gap b et w een the p erformance of the Lasso and the information-theoretic b ounds. Theorem 1 has some consequences related to results pro v ed in concurrent w ork. Reev es and Gastpar [16] ha ve sho wn that in the regime of linear sparsit y k /p = α > 0, if any deco der is giv en only a linear fraction sample size (meaning that n = Θ( p )), then in order to reco v er the supp ort exactly , one m ust ha v e k β 2 min → + ∞ . This result is one corollary of Theorem 1, since if β 2 min = Θ(1 /k ), then we hav e n > log( p − k + 1) − 1 1 2 log(1 + Θ(1 /k )) = Ω( k log( p − k ))  Θ( p ) , so that the scaling n = Θ( p ) is precluded. In other concurrent work, Fletcher et al. [11] used direct metho ds to show that for the special case of the standard Gaussian ensem ble, the num b er of observ a- 6 tions must satisfy n > Ω  log( p − k ) β 2 min  . This b ound is a consequence of our low er b ound f 2 ( p, k , β min ); moreo ver, Theorem 1 implies the same low er b ound for general (non-Gaussian) ensem bles as well. In the regime of linear sparsity , W ainwrigh t [24] sho w ed, b y direct analysis of the optimal deco der, that the scaling β 2 min = Ω(log( k ) /k ) is suﬃcient for exact supp ort recov ery using a linear fraction n = Θ( p ) of observ ations. Combined with the necessary condition in Theorem 1, we obtain the follo wing corollary that provides a sharp characterization of the linear-linear regime: Corollary 1. Consider the r e gime of line ar sp arsity, me aning that k /p = α ∈ (0 , 1) , and supp ose that a line ar fr action n = Θ( p ) of observations ar e made. Then the optimal de c o der c an r e c over the supp ort exactly if and only if β 2 min = Ω(log k /k ) . 2.2 Eﬀect of measuremen t sparsity W e no w turn to the eﬀect of measuremen t sparsit y on recov ery , considering in particular the γ - sparsiﬁed ensemble (8). Ev en though the av erage signal-to-noise ratio of our channel remains the same (since v ar( X ij ) = 1 for all choices of γ by construction), the Gaussian c hannel co ding b ound (9) is clearly not tigh t for sparse X , ev en in the case of k = 1. The loss in statistical eﬃciency is due to the fact that we are constraining our co deb o ok to hav e a sparse structure, which may b e far from a capacity-ac hieving co de. Theorem 1 applies to an y ense m ble in whic h the comp onents are zero-mean and unit v ariance. Ho wev er, if we apply it to the γ -sparsiﬁed ensem ble, it yields lo wer bounds that are independent of γ . In tuitively , it is clear that the procedure of γ -sparsiﬁcation should cause deterioration in support recov ery . Indeed, the follo wing result provides reﬁned b ounds that capture the eﬀects of γ -sparsiﬁcation. Let φ ( µ, σ 2 ) denote the Gaussian densit y with mean µ and v ariance σ 2 , and deﬁne the following tw o mixture distributions: ψ 1 : = k X l =0  k `  γ ` (1 − γ ) k − ` φ  0 , 1 + `β 2 min γ  (12) ψ 2 : = γ φ  0 , 1 + β 2 min γ  + (1 − γ ) φ (0 , 1) . (13) F urthermore, let H ( · ) denote the en tropy functional. With this notation, w e hav e the following result. Theorem 2 (Sparse ensembles) . L et the me asur ement matrix X ∈ R n × p b e dr awn with i.i.d. ele- ments fr om the γ -sp arsiﬁe d Gaussian ensemble (8) . Then a ne c essary c ondition for asymptotic al ly r eliable r e c overy over the signal class C ( β min ) is n > max  g 1 ( p, k , β min , γ ) , g 2 ( p, k , β min , γ ) , k − 1  , (14) wher e g 1 ( p, k , β min , γ ) : = log  p k  − 1 H ( ψ 1 ) − 1 2 log(2 π e) (15a) g 2 ( p, k , β min , γ ) : = log( p − k + 1) − 1 H ( ψ 2 ) − 1 2 log(2 π e) . (15b) 7 0 0.5 1 1.5 2 2.5 x 10 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Signal dimension p Rate Fundamental limits of sparsity in three regimes γ k → 0 γ k = Θ (1) γ k → ∞ Figure 1. The rate R = log ( p k ) n is plotted using equation (14) in three regimes, depending on how the quan tit y γ k scales, where γ ∈ [0 , 1] denotes the measuremen t sparsiﬁcation parameter and k denotes the signal sparsit y . The pro of of Theorem 2, given in Section 3, again uses F ano’s inequality , but explicitly analyzes the eﬀect of measuremen t sparsiﬁcation on the distribution of the observ ations. The necessary condition in Theorem 2 is plotted in Figure 1, sho wing distinct regimes of b ehavior dep ending on ho w the quantit y γ k scales, where γ ∈ [0 , 1] is the measuremen t sparsiﬁcation parameter and k is the signal sparsit y index. In order to characterize the thresholds at whic h measurement sparsit y b egins to degrade the p erformance of any deco der, Corollary 2 b elow further b ounds the necessary conditions in Theorem 2 in three cases. F or an y scalar γ , let H binary ( γ ) denote the entrop y of a Ber( γ ) v ariate. Corollary 2 (Three regimes) . The ne c essary c onditions in The or em 2 c an b e simpliﬁe d as fol lows. (a) In gener al, g 1 ( p, k , β min , γ ) ≥ log  p k  − 1 1 2 log  1 + k β 2 min  , (16a) g 2 ( p, k , β min , γ ) ≥ log( p − k + 1) − 1 1 2 log  1 + β 2 min  . (16b) (b) If γ k = τ for some c onstant τ , then g 1 ( p, k , β min , γ ) ≥ log  p k  − 1 1 2 τ log  1 + kβ 2 min τ  + C , (17a) g 2 ( p, k , β min , γ ) ≥ log( p − k + 1) − 1 1 2 ( τ k ) log  1 + kβ 2 min τ  + H binary ( τ k ) , (17b) wher e C = 1 2 log(2 π e( τ + 1 12 )) is a c onstant. 8 Necessary conditions (Theorem 2) k = o ( p ) k = Θ( p ) β 2 min = Θ( 1 k ) γ = o ( 1 k log k ) Θ  k log ( p − k ) γ k log 1 γ  Θ  p log p γ p log 1 γ  β 2 min = Θ( 1 k ) γ = Ω( 1 k log k ) Θ( k log ( p − k )) Θ( p log p ) β 2 min = Θ( log k k ) γ = o ( 1 k log k ) Θ  k log ( p − k ) γ k log 1 γ  Θ  p log p γ p log 1 γ  β 2 min = Θ( log k k ) γ = Θ( 1 k log k ) Θ( k log ( p − k )) Θ( p log p ) β 2 min = Θ( log k k ) γ = Ω( 1 k ) max n Θ  k log p k log log k  , Θ  k log ( p − k ) log k o Θ( p ) T able 2. Necessary conditions on the num b er of observ ations n required for exact supp ort recov ery is sho wn in diﬀeren t regimes of the parameters ( p, k , β min , γ ). (c) If γ k ≤ 1 , then g 1 ( p, k , β min , γ ) ≥ log  p k  − 1 1 2 γ k log  1 + β 2 min γ  + k H binary ( γ ) , (18a) g 2 ( p, k , β min , γ ) ≥ log( p − k + 1) − 1 1 2 γ log  1 + β 2 min γ  + H binary ( γ ) . (18b) Corollary 2 reveals three regimes of b eha vior, deﬁned b y the scaling of the measurement sparsity γ and the signal sparsity k . If γ k → ∞ as p → ∞ , then the recov ery threshold (16) is of the same order as the threshold for dense measurement ensem bles. In this regime, sparsifying the measuremen t ensemble has no asymptotic eﬀect on performance. In sharp contrast, if γ k → 0 suﬃcien tly fast as p → ∞ , then the recov ery threshold (18) changes fundamen tally compared to the dense case. Finally , if γ k = Θ(1), then the recov ery threshold (17) transitions b etw een the tw o extremes. Using the b ounds in Corollary 2, the necessary conditions in Theorem 2 are sho wn in T able 2 under diﬀeren t scalings of the parameters ( n, p, k , β min , γ ). In particular, if γ = o ( 1 k log k ) and the minimum v alue β 2 min do es not increase with k , then the denominator γ k log 1 γ go es to zero. Hence, the n umber of measurements that any deco der needs in order to reco ver reliably increases dramatically in this regime. 3 Pro ofs of our main results In this section, w e pro vide the pro ofs of Theorems 1 and 2. Establishing necessary conditions for exact sparsity recov ery amoun ts to ﬁnding conditions on ( n, p, k , β min ) (and p ossibly γ ) under whic h the probabilit y of error of any reco very metho d sta ys b ounded a w ay from zero as n → ∞ . A t a high-level, our general approac h is quite simple: w e consider restricted problems in whic h the deco der has b een given some additional side information, and then apply F ano’s inequality [8] to lo wer b ound the probabilit y of error. In order to establish the tw o types of necessary conditions (e.g, f 1 ( p, k , β min ) versus f 2 ( p, k , β min )), we consider tw o classes of restricted ensembles: one which 9  p k  − 1 1 k ( p − k ) subsets k 1 − k p ! subsets (a) k − 1 p − k + 1 k − 1 ∗ p − k + 1 (b) Figure 2. Illustration of restricted ensem bles. (a) In restricted ensemble A, the deco der must distinguish b etw een  p k  supp ort sets with an av erage o verlap of size k 2 p , whereas in restricted ensemble B, it must deco de amongst a subset of the k ( p − k ) + 1 supp orts with ov erlap k − 1. (b) In restricted ensem ble B, the deco der is given the lo cations of the k − 1 largest non-zeros, and it must estimate the lo cation of the smallest non-zero from the p − k + 1 remaining p ossible indices. captures the bulk eﬀect of having man y comp eting subsets at large distances, and the other which captures the eﬀect of a smaller num b er of subsets at v ery close distances. This is illustrated in Figure 2a. W e note that although the ﬁrst restricted ensem ble is a harder problem, applying F ano to the second restricted ensem ble yields a tighter analysis in some regimes. In all cases, we assume that the supp ort S of the unknown v ector β ∈ R p is chosen randomly and uniformly ov er all  p k  p ossible supp ort sets. Throughout the remainder of the pap er, we use the notation X j ∈ R n to denote column j of the matrix X , and X U ∈ R n ×| U | to denote the submatrix con taining columns in- dexed b y set U . Similarly , let β U ∈ R | U | denote the sub v ector of β corresp onding to the index set U . Restricted ensemble A: In the ﬁrst restricted problem, also exploited in previous w ork [24], w e assume that while the supp ort set S is unkno wn, the deco der knows a priori that β j = β min for all j ∈ S . In other words, the deco der kno ws the v alue of β on its supp ort, but it do es not know the locations of the non-zeros. Conditioned on the even t that S is the true underlying supp ort of β , the observ ation vector Y ∈ R n can then b e written as Y : = X j ∈ S X j β min + W. (19) If a deco der can reco ver the supp ort of an y p -dimensional k -sparse v ector β , then it m ust b e able to recov er a k -sparse v ector that is constant on its supp ort. F urthermore, ha ving kno wledge of the v alue β min at the deco der cannot increase the probabilit y of error. Finally , w e assume that β j = β min for all j ∈ S to construct the most diﬃcult p ossible instance within our ensemble. Th us, w e can apply F ano’s inequality to low er b ound the probabilit y of error in the restricted problem, and so obtain a lo wer b ound on the probabilit y of error for the general problem. This pro cedure yields the low er b ounds f 1 ( p, k , β min ) and g 1 ( p, k , β min , γ ) in Theorems 1 and 2 resp ectively . Restricted ensem ble B: The second restricted ensemble is designed to capture the confusable eﬀects of the relatively small num b er ( p − k + 1) of very close-by subsets (see Figure 2b). This restricted ensemble is deﬁned as follows. Supp ose that the deco der is giv en the lo cations of all 10 but the smallest non-zero v alue of the v ector β , as w ell as the v alues of β on its supp ort. More precisely , let j ? denote the unkno wn lo cation of the smallest non-zero v alue of β , which we assume ac hieves the minimum (i.e., β j ? = β min ), and let T = S \ { j ? } . Giv en kno wledge of ( T , β T , β min ), the deco der may simply subtract X T β T = P j ∈ T X j β j from Y , so that it is left with the mo diﬁed n -v ector of observ ations e Y : = X j ? β min + W. (20) By re-ordering indices as need b e, we ma y assume without loss of generalit y that T = { p − k + 2 , . . . , p } , so that j ? ∈ { 1 , . . . , p − k + 1 } . The remaining sub-problem is to determine, giv en the observ ations e Y , the lo cation of the single non-zero. Note that when w e assume that the supp ort of β is uniformly chosen o ver all  p k  p ossible subsets of size k , then given T , the lo cation of the remaining non-zero is uniformly distributed ov er { 1 , . . . , p − k + 1 } . W e will now argue that analyzing the probability of error of this restricted problem gives us a lo w er bound on the probabilit y of error in the original problem. Let e β ∈ R p − k +1 b e a vector with exactly one non-zero. W e can augmen t e β with k − 1 non-zeros at the end to obtain a p - dimensional vector. If a deco der can recov er the supp ort of any p -dimensional k -sparse vector β , then it can reco ver the support of the augmen ted e β , and hence the supp ort of e β . Similarly , pro viding the deco der with side information ab out the non-zero v alues of β cannot increase the probability of error. As b efore, we can apply F ano’s inequalit y to lo wer bound the probabilit y of error in this restricted problem, thereb y obtaining the low er b ounds f 2 ( p, k , β min ) and g 2 ( p, k , β min , γ ) in Theorems 1 and 2 resp ectively . 3.1 Pro of of Theorem 1 In this section, we deriv e the necessary conditions f 1 ( p, k , β min ) and f 2 ( p, k , β min ) in Theorem 1 for the general class of measurement matrices, by applying F ano’s inequalit y to b ound the probability of deco ding error in restricted problems A and B, resp ectively . 3.1.1 Applying F ano to restricted ensem ble A W e ﬁrst p erform our analysis of the error probabilit y for a particular instance of the random measuremen t matrix X , and subsequently av erage ov er the ensem ble of matrices. Let Ω denote a random subset c hosen uniformly at random ov er all  p k  subsets S ⊂ { 1 , . . . , p } of size k . The probabilit y of deco ding error, for a given X , can b e low er b ounded by F ano’s inequality as p err ( X ) ≥ H (Ω | Y ) − 1 log  p k  = 1 − I (Ω; Y ) + 1 log  p k  where we ha ve used the fact that H (Ω | Y ) = H (Ω) − I (Ω; Y ) = log  p k  − I (Ω; Y ). Th us the problem is reduced to upp er b ounding the m utual information I (Ω; Y ) b etw een the random subset Ω and the noisy observ ations Y . Since b oth X and β min are known and ﬁxed, the m utual information can b e expanded as I (Ω; Y ) = H ( Y ) − H ( Y | Ω) = H ( Y ) − H ( W ) . W e ﬁrst b ound the entrop y of the observ ation vector H ( Y ), using the fact that diﬀerential en tropy is maximized by the Gaussian distribution with a matched v ariance. More sp eciﬁcally , for a given 11 X , let Λ( X ) denote the cov ariance matrix of Y conditioned on X . (Hence en try Λ ii ( X ) on the diagonal represen ts the v ariance of Y i .) With this notation, the entrop y of Y can b e b ounded as H ( Y ) ≤ n X i =1 H ( Y i ) ≤ n X i =1 1 2 log(2 π e Λ ii ( X )) . Next, the entrop y of the Gaussian noise v ector W ∼ N (0 , I n × n ) can b e computed as H ( W ) = n 2 log(2 π e). Com bining these tw o terms, w e then obtain the follo wing b ound on the mutual infor- mation, I (Ω; Y ) ≤ n X i =1 1 2 log(Λ ii ( X )) . With this b ound on the mutual information, we no w av erage the probability of error ov er the ensem ble of measuremen t matrices X . Exploiting the conca vit y of the logarithm and applying Jensen’s inequalit y , the av erage probability of error can b e b ounded as E X [ p err ( X )] ≥ 1 − P n i =1 1 2 log( E X [Λ ii ( X )]) + 1 log  p k  . (21) It remains to compute the expectation E X [Λ ii ( X )], ov er the ensem ble of matrices X drawn with i.i.d. entries from any distribution with zero-mean and unit v ariance. The pro of of the follo wing lemma in v olves some relativ ely straigh tforward but length y calculation, and is giv en in Appendix A. Lemma 1. Given i.i.d. X ij with zer o-me an and unit varianc e, the aver age c ovarianc e is given by E X [Λ( X )] =  1 + k β 2 min  1 − k p  I n × n . (22) Finally , com bining Lemma 1 with equation (21), w e obtain that the av erage probability of error is b ounded aw ay from zero if n < log  p k  − 1 1 2 log  k β 2 min  1 − k p  + 1  , as claimed. 3.1.2 Applying F ano to restricted ensem ble B The analysis of restricted ensemble B is completely analogous to the pro of for restricted ensemble A, so w e will only outline the key steps b elow. Let Ω denote a random v ariable with uniform distribution ov er the indices { 1 , . . . , p − k + 1 } . The probability of decoding error, for a given measuremen t matrix X , can b e lo wer b ounded b y F ano’s inequality as p err ( X ) ≥ 1 − I (Ω; e Y ) + 1 log( p − k + 1) 12 As b efore, the key problem of b ounding the mutual information I (Ω; e Y ) b et ween the random index Ω and the mo diﬁed observ ation vector e Y , can b e reduced to b ounding the entrop y H ( e Y ). F or eac h ﬁxed X , let Λ( X ) denote the co v ariance matrix of e Y . Since the diﬀerential en trop y of e Y i is upp er b ounded by the entrop y of a Gaussian distribution with v ariance Λ ii ( X ), we obtain the follo wing b ound on the mutual information I (Ω; e Y ) = H ( e Y ) − n 2 log(2 π e) ≤ n X i =1 1 2 log(Λ ii ( X )) . Applying Jensen’s inequalit y , we can then b ound the av erage probability of error, av eraged ov er the ensem ble of measurement matrices X , as E X [ p err ( X )] ≥ 1 − P n i =1 1 2 log( E X [Λ ii ( X )]) + 1 log( p − k + 1) . (23) The pro of of Lemma 2 b elow follows the same steps as the deriv ation of Lemma 1, and is omitted. Lemma 2. Given i.i.d. X ij with zer o-me an and unit varianc e, the aver age c ovarianc e is given by E X [Λ( X )] =  1 + β 2 min  1 − 1 p − k + 1  I n × n . (24) Finally , combining Lemma 2 with the F ano b ound (23), we obtain that the av erage probability of error is b ounded aw ay from zero if n < log( p − k + 1) − 1 1 2 log(1 + β 2 min (1 − 1 p − k +1 )) as claimed. 3.2 Pro of of Theorem 2 This section contains pro ofs of the necessary conditions in Theorem 2 for the γ -sparsiﬁed Gaussian measuremen t ensem ble (8). W e proceed as b efore, applying F ano’s inequalit y to restricted problems A and B, in order to derive the conditions g 1 ( p, k , β min , γ ) and g 2 ( p, k , β min , γ ), resp ectively . 3.2.1 Analyzing restricted ensemble A In analyzing the probability of error in restricted ensemble A, the initial steps pro ceed as in the pro of of Theorem 1, ﬁrst b ounding the probability of error for a ﬁxed instance of the measurement matrix X , and later av eraging o ver the γ -sparsiﬁed Gaussian ensemble (8). Let Ω denote a random subset uniformly distributed ov er the  p k  p ossible subsets S ⊂ { 1 , . . . , p } of size k . As b efore, the probabilit y of deco ding error, for each ﬁxed X , can b e low er b ounded by F ano’s inequality as p err ( X ) ≥ 1 − I (Ω; Y ) + 1 log  p k  . 13 W e can similarly b ound the m utual information I (Ω; Y ) = H ( Y ) − H ( W ) ≤ n X i =1 H ( Y i ) − n 2 log(2 π e) , using the Gaussian entrop y for W ∼ N (0 , I n × n ). F rom this p oin t, the key subproblem is to compute the entrop y of Y i = P j ∈ S X ij β min + W i . T o c haracterize the limiting b ehavior of the random v ariable Y i , note that Y i is distributed according to the density deﬁned as ψ 1 ( y , i ; X ) = 1  p k  X S 1 √ 2 π exp   − 1 2 ( y − β min X j ∈ S X ij ) 2   . F or each ﬁxed matrix X , this densit y is a mixture of Gaussians with unit v ariances and means that dep end on the v alues of { X i 1 , . . . , X ip } , summed ov er subsets S ⊂ { 1 , . . . , p } with | S | = k . At a high-lev el, our immediate goal is to characterize the entrop y H ( ψ 1 ). Note that as X v aries o ver the ensemble (8), the sequence { ψ 1 ( · ; X ) } p , indexed by the signal dimension p , is actually a s equence of random densities. As an intermediate step, the follo wing lemma characterizes the a verage p oint wise b ehavior of this random sequence of densities, and is pro ven in App endix B. Lemma 3. L et X b e dr awn with i.i.d. entries fr om the γ -sp arsiﬁe d Gaussian ensemble (8) . F or e ach ﬁxe d y and for al l i = 1 , . . . , n , E X [ ψ 1 ( y , i ; X )] = ψ 1 ( y ) , wher e ψ 1 ( y ) = E L   1 q 2 π (1 + Lβ 2 min γ ) exp   − y 2 2(1 + Lβ 2 min γ )     (25) is a mixtur e of Gaussians with binomial weights L ∼ Bin( k , γ ) . F or certain scalings, we can use concentration results for U -statistics [20] to pro v e that ψ 1 con verges uniformly to ψ 1 , and from there that H ( ψ 1 ) p → H ( ψ 1 ). In general, how ever, w e alw ays ha ve an upp er b ound, which is suﬃcien t for our purp oses. Indeed, since diﬀerential entrop y H ( ψ 1 ) is a concav e function of ψ 1 , b y Jensen’s inequality and Lemma 3, we hav e E X [ H ( ψ 1 )] ≤ H ( E X [ ψ 1 ]) = H ( ψ 1 ) . With these ingredients, w e conclude that the av erage error probability of an y deco der, a veraged o ver the sparsiﬁed Gaussian measurement ensemble, is low er b ounded by E X [ p err ( X )] ≥ 1 − P n i =1 E X [ H ( Y i )] − n 2 log(2 π e) + 1 log  p k  = 1 − P n i =1 E X [ H ( ψ 1 )] − n 2 log(2 π e) + 1 log  p k  ≥ 1 − nH ( ψ 1 ) − n 2 log(2 π e) + 1 log  p k  . 14 Therefore, the probability of deco ding error is b ounded aw a y from zero if n < log  p k  − 1 H ( ψ 1 ) − 1 2 log(2 π e) , as claimed. 3.2.2 Analyzing restricted ensemble B The analysis of restricted ensem ble B mirrors exactly the deriv ation of restricted ensemble A. Hence w e only outline the k ey steps in this section. Letting Ω ∼ Uni { 1 , . . . , p − k + 1 } , we again apply F ano’s inequality to restricted problem B, using the sparse measurement ensemble (8): p err ( X ) ≥ 1 − I (Ω; e Y ) + 1 log( p − k + 1) . In order to upp er b ound I (Ω; e Y ), w e need to upp er b ound the en tropy H ( e Y ). The sequence of densities asso ciated with e Y i b ecomes ψ 2 ( y , i ; X ) = 1 p − k + 1 p − k +1 X j =1 1 √ 2 π exp  − 1 2 ( y − β min X ij ) 2  . Lemma 4 b elow characterizes the a v erage p oint wise b ehavior of these densities, and follows from the pro of of Lemma 3, with S tak en to b e subsets of the indices { 1 , . . . , p − k + 1 } of size | S | = 1. Lemma 4. L et X b e dr awn with i.i.d. entries ac c or ding to (8) . F or e ach ﬁxe d y and for al l i = 1 , . . . , n , E X [ ψ 2 ( y , i ; X )] = ψ 2 ( y ) , wher e ψ 2 ( y ) = E B   1 q 2 π (1 + B β 2 min γ ) exp   − y 2 2(1 + B β 2 min γ )     (26) is a mixtur e of Gaussians with Bernoul li weights B ∼ Ber( γ ) . As b efore, we can apply Jensen’s inequalit y to obtain the b ound E X [ H ( ψ 2 )] ≤ H ( E X [ ψ 2 ]) = H ( ψ 2 ) . The necessary condition then follows by the F ano b ound on the probabilit y of error. 3.3 Pro of of Corollary 2 In this section, we deriv e b ounds on the expressions g 1 ( p, k , β min , γ ) and g 2 ( p, k , β min , γ ) in The- orem 2. W e b egin b y noting that the Gaussian mixture distribution ψ 1 deﬁned in (12) is a strict generalization of the distribution ψ 2 deﬁned in (13); moreov er, setting the parameter k = 1 in ψ 1 reco vers ψ 2 . The v ariance asso ciated with the mixture distribution ψ 1 is equal to σ 2 1 = 1 + k β 2 min , 15 and so the entrop y of ψ 1 is alwa ys b ounded by the en tropy of a Gaussian distribution with v ariance σ 2 1 , as H ( ψ 1 ) ≤ 1 2 log(2 π e(1 + k β 2 min )) . Similarly , the mixture distribution ψ 2 has v ariance equal to 1 + β 2 min , so that the en tropy associated with ψ 2 can in general b e b ounded as H ( ψ 2 ) ≤ 1 2 log(2 π e(1 + β 2 min )) . This yields the ﬁrst set of b ounds in (16). Next, to deriv e more reﬁned b ounds whic h capture the eﬀects of measurement sparsit y , w e will mak e use of the follo wing lemma (which is prov en in App endix C) to b ound the en tropy asso ciated with the mixture distribution ψ 1 : Lemma 5. F or the Gaussian mixtur e distribution ψ 1 deﬁne d in (12) , H ( ψ 1 ) ≤ E L  1 2 log  2 π e  1 + Lβ 2 min γ  + H ( L ) , wher e L ∼ Bin( k , γ ) . W e can further b ound the expression in Lemma 5 in three cases, delineated by the quantit y γ k . The pro of of the following claim in giv en in App endix D. Lemma 6. L et E = E L h 1 2 log  1 + Lβ 2 min γ i , wher e L ∼ Bin( k , γ ) . (a) If γ k > 3 , then 1 4 log  1 + k β 2 min 3  ≤ E ≤ 1 2 log  1 + k β 2 min  . (b) If γ k = τ for some c onstant τ , then 1 2 (1 − e − τ ) log  1 + k β 2 min τ  ≤ E ≤ 1 2 τ log  1 + k β 2 min τ  . (c) If γ k ≤ 1 , then 1 4 γ k log  1 + β 2 min γ  ≤ E ≤ 1 2 γ k log  1 + β 2 min γ  . Finally , com bining Lemmas 5 and 6 with some simple b ounds on the en tropy of the binomial v ariate L (giv en in App endix E), we obtain the b ounds on g 1 ( p, k , β min , γ ) in (17) and (18). W e can similarly b ound the entrop y asso ciated with the Gaussian mixture distribution ψ 2 . Since the densit y ψ 2 is a sp ecial case of the density ψ 1 with k set to 1, w e can again apply Lemma 5 to obtain H ( ψ 2 ) ≤ E B  1 2 log  2 π e  1 + B β 2 min γ  + H ( B ) = γ 2 log  1 + β 2 min γ  + H binary ( γ ) + 1 2 log(2 π e) . W e hav e thus obtained the b ounds on g 2 ( p, k , β min , γ ) in equations (17) and (18). 16 4 Discussion In this pap er, w e ha v e studied the information-theoretic limits of exact supp ort recov ery for gen- eral scalings of the parameters ( n, p, k , β min , γ ). Our ﬁrst result (Theorem 1) applies generally to measuremen t matrices with zero-mean and unit v ariance entries. It strengthens previously kno wn b ounds, and combined with kno wn suﬃcient conditions [24], yields a sharp characterization of reco vering signals with linear sparsity with a linear fraction of observ ations (Corollary 2). Our sec- ond result (Theorem 2) applies to γ -sparsiﬁed Gaussian measurement ensembles, and rev eals three diﬀeren t regimes of measuremen t sparsity , dep ending on ho w signiﬁcan tly they impair statistical ef- ﬁciency . F or linear signal sparsity , Theorem 2 is not a sharp result (by comparison to Theorem 1 in the dense case); how ev er, its tigh tness for sublinear signal sparsity is an interesting op en problem. Finally , Theorem 1 implies that the standard Gaussian ensemble is an information-theoretically optimal choi ce for the measurement matrix: no other zero-mean unit v ariance distribution can reduce the n um b er of observ ations necessary for recov ery , and in fact the standard Gaussian dis- tribution achiev es matching suﬃcient b ounds [24]. This fact raises an interesting op en question on the design of other, more computationally friendly , measuremen t matrices which are optimal in the information-theoretic sense. Ac knowledgmen t The w ork of WW and KR was supp orted by NSF grant CCF-0635114. The w ork of MJW was supp orted by NSF grants CAREER-CCF-0545862 and DMS-0605165. A Pro of of Lemma 1 W e b egin by deﬁning some additional notation. Let ¯ β ∈ R p b e a k -sparse v ector with ¯ β j = β min for all indices j in the supp ort set S . Recall that Ω denotes a random subset uniformly distributed o ver all  p k  p ossible subsets S ⊂ { 1 , . . . , p } with | S | = k . Conditioned on the even t that Ω = S , the v ector of n observ ations can then b e written as Y : = X S ¯ β S + W = β min X j ∈ S X j + W. Note that for a given instance of the matrix X , the distribution of Y is a Gaussian mixture with densit y f ( y ) = 1 ( p k ) P S φ ( X S ¯ β S , I ), where we are using φ to denote the density of a Gaussian random vector with mean X S ¯ β S and cov ariance I . Let µ ( X ) = µ ∈ R n and Λ( X ) = Λ ∈ R n × n b e the mean vector and co v ariance matrix of Y , resp ectively . The cov ariance matrix of Y can b e computed as Λ = E  Y Y T  − µµ T , where µ = 1  p k  X S X S ¯ β S = β min  p k  X S X j ∈ S X j and E  Y Y T  = E  ( X ¯ β )( X ¯ β ) T  + E  W W T  = 1  p k  X S ( X S ¯ β S )( X S ¯ β S ) T + I . 17 With this notation, we can now compute the exp ectation of the cov ariance matrix E X  Λ  , a veraged o v er an y distribution on X with indep endent, zero-mean and unit v ariance entries. T o compute the ﬁrst term, we hav e E X  E  Y Y T  = β 2 min  p k  X S E X   X j ∈ S X j X T j + X i 6 = j ∈ S X i X T j   + I = β 2 min  p k  X S X j ∈ S I + I =  1 + k β 2 min  I where the second equality uses the fact that E X h X j X T j i = I , and E X h X i X T j i = 0 for i 6 = j . Next, w e compute the second term as, E X  µµ T  = β min  p k  ! 2 E X    X S,U X j ∈ S ∩ U X j X T j + X S,U X i ∈ S,j ∈ U i 6 = j X i X T j    = β min  p k  ! 2 X S,U X j ∈ S ∩ U I =   β min  p k  ! 2 X S,U | S ∩ U |   I . F rom here, note that there are  p k  p ossible subsets S . F or each S , a coun ting argument rev eals that there are  k λ  p − k k − λ  subsets U of size k whic h ha v e λ = | S ∩ U | ov erlaps with S . Th us the scalar m ultiplicative factor ab ov e can b e written as β min  p k  ! 2 X S,U | S ∩ U | = β 2 min  p k  k X λ =1  k λ  p − k k − λ  λ. Finally , using a substitution of v ariables (by setting λ 0 = λ − 1) and applying V andermonde’s iden tity [17], w e ha ve β min  p k  ! 2 X S,U | S ∩ U | = β 2 min  p k  k k − 1 X λ 0 =0  k − 1 λ 0  p − k k − λ 0 − 1  = β 2 min  p k  k  p − 1 k − 1  = k 2 β 2 min p . Com bining these terms, w e conclude that E X  Λ( X )  =  1 + k β 2 min  1 − k p  I . 18 B Pro of of Lemma 3 Consider the following sequences of densities, ψ 1 ( y , i ; X ) = 1  p k  X S 1 √ 2 π exp   − 1 2 ( y − β min X j ∈ S X ij ) 2   and ψ 1 ( y ) = E L   1 q 2 π (1 + Lβ 2 min γ ) exp   − y 2 2(1 + Lβ 2 min γ )     where L ∼ Bin( k , γ ). Our goal is to show that for eac h ﬁxed y and ro w index i , the p oint wise a verage of the sto c hastic sequence of densities ψ 1 o ver the ensem ble of matrices X satisﬁes E X [ ψ 1 ( y , i ; X )] = ψ 1 ( y ). By symmetry , it is suﬃcien t to compute this expectation for the subset S = { 1 , . . . , k } . When eac h X ij is i.i.d. dra wn according to the γ -sparsiﬁed ensemble (8), the random v ariable Z = ( y − β min P k j =1 X ij ) has a Gaussian mixture distribution which can b e describ ed as follows. Denoting the mixture lab el b y L ∼ Bin( k , γ ), then Z ∼ N  y , `β 2 min γ  if L = ` , for ` = 0 , . . . , k . Thu s, conditioned on the mixture lab el L = ` , the random v ariable ˜ Z = γ `β 2 min ( y − β min P k j =1 X ij ) 2 has a noncen tral chi-square distribution with 1 degree of freedom and parameter λ = γ y 2 `β 2 min . Ev aluating M ˜ Z ( t ) = E [e t ˜ Z ], the moment-generating function [3] of ˜ Z , then gives us the desired quantit y , E X   1 √ 2 π exp   − 1 2 ( y − β min k X j =1 X ij ) 2     = k X ` =0 1 √ 2 π E X   exp   − 1 2 ( y − β min k X j =1 X ij ) 2         L = `   P ( L = ` ) = k X ` =0 1 √ 2 π M ˜ Z  − `β 2 min 2 γ  P ( L = ` ) = E L   1 q 2 π (1 + Lβ 2 min γ ) exp   − y 2 2(1 + Lβ 2 min γ )     as claimed. C Pro of of Lemma 5 Let Z b e a random v ariable distributed according to the densit y ψ 1 ( y ) = E L   1 q 2 π (1 + Lβ 2 min γ ) exp   − y 2 2(1 + Lβ 2 min γ )     , 19 where L ∼ B in ( k , γ ). T o compute the entrop y of Z , w e can expand the following mutual information in t w o w ays, I ( Z ; L ) = H ( Z ) − H ( Z | L ) = H ( L ) − H ( L | Z ), and obtain H ( Z ) = H ( Z | L ) + H ( L ) − H ( L | Z ) . The conditional distribution of Z given that L = ` is Gaussian, and so the conditional en tropy of Z given L can b e written as H ( Z | L ) = E L  1 2 log  2 π e  1 + Lβ 2 min γ  . F urthermore, w e can b ound the conditional entrop y of L given Z as 0 ≤ H ( L | Z ) ≤ H ( L ). This giv es upp er and low er b ounds on the entrop y of Z as H ( Z | L ) ≤ H ( Z ) ≤ H ( Z | L ) + H ( L ) . D Pro of of Lemma 6 W e ﬁrst derive upp er and lo wer b ounds in the case when γ k ≤ 1. W e can rewrite the binomial distribution as p ( ` ) : =  k `  γ ` (1 − γ ) k − ` = γ k `  k − 1 ` − 1  γ ` − 1 (1 − γ ) k − ` and hence E = 1 2 k X ` =1 log  1 + `β 2 min γ  p ( ` ) = 1 2 γ k k X ` =1 log  1 + `β 2 min γ  `  k − 1 ` − 1  γ ` − 1 (1 − γ ) k − ` . T aking the ﬁrst tw o terms of the binomial expansion of  1 + β 2 min γ  ` and noting that all the terms are non-negativ e, w e obtain the inequality  1 + β 2 min γ  ` ≥ 1 + `β 2 min γ and consequently log  1 + β 2 min γ  ≥ 1 ` log  1 + `β 2 min γ  . Using a change of v ariables (by setting ` 0 = ` − 1) and applying the binomial theorem, we thus obtain the upp er b ound E ≤ 1 2 γ k k X ` =1 log  1 + β 2 min γ   k − 1 ` − 1  γ ` − 1 (1 − γ ) k − ` = 1 2 γ k log  1 + β 2 min γ  k − 1 X ` 0 =0  k − 1 ` 0  γ ` 0 (1 − γ ) k − ` 0 − 1 = 1 2 γ k log  1 + β 2 min γ  . 20 T o derive the low er b ound, w e will use the fact that 1 + x ≤ e x for all x ∈ R , and e − x ≤ 1 − x 2 for x ∈ [0 , 1]. E = 1 2 k X ` =1 log  1 + `β 2 min γ  p ( ` ) ≥ 1 2 log  1 + β 2 min γ  k X ` =1 p ( ` ) = 1 2 log  1 + β 2 min γ  (1 − (1 − γ ) k ) ≥ 1 2 log  1 + β 2 min γ  (1 − e − γ k ) ( a ) ≥ 1 2 log  1 + β 2 min γ   γ k 2  . Next, w e examine the case when γ k = τ for some constan t τ . The deriv ation of the upp er bound in the case when γ k ≤ 1 holds for the γ k = τ case as well. The pro of of the low er b ound follows the same steps as in the γ k ≤ 1 case, except that w e stop b efore applying the last inequality ( a ). Finally , we derive b ounds in the case when γ k > 3. Since the mean of a L ∼ Bin( k , γ ) random v ariable is γ k , by Jensen’s inequalit y the follo wing b ound alw a ys holds, E L  1 2 log  1 + Lβ 2 min γ  ≤ 1 2 log(1 + k β 2 min ) . T o derive a matc hing low er b ound, w e use the fact that the median of a Bin( k , γ ) distribution is one of {b γ k c − 1 , b γ k c , b γ k c + 1 } . This allo ws us to b ound E ≥ 1 2 k X ` = b γ k c− 1 log  1 + `β 2 min γ  p ( ` ) ≥ 1 2 log  1 + ( b γ k c − 1) β 2 min γ  k X ` = b γ k c− 1 p ( ` ) ≥ 1 4 log  1 + k β 2 min 3  where in the last step w e used the fact that ( b γ k c− 1) β 2 min γ ≥ ( γ k − 2) β 2 min γ ≥ kβ 2 min 3 for γ k > 3, and P k ` =median p ( ` ) ≥ 1 2 . E Bounds on binomial entrop y Lemma 7. L et L ∼ Bin( k , γ ) . Then H ( L ) ≤ k H binary ( γ ) . F urthermor e, if γ = o  1 k log k  , then k H binary ( γ ) → 0 as k → ∞ . 21 Pr o of. W e can express the binomial v ariate as L = P k i =1 Z i , where Z i ∼ Ber( γ ) i.i.d. Since H ( g ( Z 1 , . . . , Z k )) ≤ H ( Z 1 , . . . , Z k ), w e ha v e H ( L ) ≤ H ( Z 1 , . . . , Z k ) = k H binary ( γ ) . Next we ﬁnd the limit of k H binary ( γ ) = k γ log 1 γ + k (1 − γ ) log 1 1 − γ . Let γ = 1 kf ( k ) , and assume that f ( k ) → ∞ as k → ∞ . Hence the ﬁrst term can be written as k γ log 1 γ = 1 f ( k ) log( k f ( k )) = log k f ( k ) + log f ( k ) f ( k ) , and so k γ log 1 γ → 0 if f ( k ) = ω (log k ). The second term can also b e expanded as − k (1 − γ ) log(1 − γ ) = − k log  1 − 1 k f ( k )  + 1 f ( k ) log  1 − 1 k f ( k )  = − log  1 − 1 k f ( k )  k + 1 f ( k ) log  1 − 1 k f ( k )  . If f ( k ) → ∞ as k → ∞ , then we hav e the limits lim k →∞  1 − 1 k f ( k )  k = 1 and lim k →∞  1 − 1 k f ( k )  = 1 , whic h in turn imply that lim k →∞ log  1 − 1 k f ( k )  k = 0 and lim k →∞ 1 f ( k ) log  1 − 1 k f ( k )  = 0 . Lemma 8. L et L ∼ Bin( k , γ ) , then H ( L ) ≤ 1 2 log(2 π e( k γ (1 − γ ) + 1 12 )) . Pr o of. W e immediately obtain this b ound by applying the diﬀerential entrop y b ound on discrete en tropy [8]. References [1] S. Aeron, M. Zhao, and S. V enk atesh. Information-theoretic b ounds to sensing capacity of sensor net w orks under ﬁxed snr. In Information The ory Workshop , September 2007. [2] M. Akcak ay a and V. T arokh. Shannon theoretic limits on noisy compressive sampling. T ech- nical Rep ort arXiv:cs.IT/0711.0366, Harv ard, Nov ember 2007. [3] L. Birg´ e. An alternativ e p oint of view on Lepski’s metho d. In State of the Art in Pr ob ability and Statistics , num b er 37 in IMS Lecture Notes, pages 113–133. Institute of Mathematical Statistics, 2001. 22 [4] E. Candes, J. Romberg, and T. T ao. Stable signal reco very from incomplete and inaccurate measuremen ts. Communic ations on Pur e and Applie d Mathematics , 59(8):1207–1223, August 2006. [5] E. Candes and T. T ao. Deco ding b y linear programming. IEEE T r ans. Info The ory , 51(12):4203–4215, Decem b er 2005. [6] S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomp osition by basis pursuit. SIAM J. Sci. Computing , 20(1):33–61, 1998. [7] G. Cormo de and S. Muthukrishnan. T ow ards an algorithmic theory of compressed sensing. T echnical rep ort, Rutgers Universit y , July 2005. [8] T.M. Co ver and J.A. Thomas. Elements of Information The ory . John Wiley and Sons, New Y ork, 1991. [9] D. Donoho. Compressed sensing. IEEE T r ans. Info. The ory , 52(4):1289–1306, April 2006. [10] D. Donoho, M. Elad, and V. M. T emlyak ov. Stable recov ery of sparse o vercomplete represen- tations in the presence of noise. IEEE T r ans. Info The ory , 52(1):6–18, January 2006. [11] A. K. Fletc her, S. Rangan, and V. K. Goy al. Necessary and suﬃcient conditions on sparsity pattern reco v ery . T ec hnical Rep ort arXiv:cs.IT/0804.1839, UC Berk eley , April 2008. [12] A. K. Fletcher, S. Rangan, V. K. Goy al, and K. Ramchandran. Denoising by sparse approxi- mation: Error b ounds based on rate-distortion theory . Journal on Applie d Signal Pr o c essing , 10:1–19, 2006. [13] A. Gilb ert, M. Strauss, J. T ropp, and R. V ershynin. Algorithmic linear dimension reduction in the ` 1 -norm for sparse v ectors. In Pr o c. A l lerton Confer enc e on Communic ation, Contr ol and Computing , Allerton, IL, September 2006. [14] N. Meinshausen and P . Buhlmann. High-dimensional graphs and v ariable selection with the lasso. A nnals of Statistics , 2006. T o app ear. [15] D. Omidiran and M. J. W ain wright. High-dimensional subset recov ery in noise: Sparsiﬁed measuremen ts without loss of statistical eﬃciency . T ec hnical rep ort, Department of Statistics, UC Berk eley , April 2008. Short version presented at In t. Symp. Info. Theory , July 2008. [16] G. Reev es. Sparse signal sampling using noisy linear pro jections. Master’s thesis, UC Berk eley , Decem b er 2007. [17] J. Riordan. Combinatorial Identities . Wiley Series in Probability and Mathematical Statistics. Wiley , New Y ork, 1968. [18] S. Sarvotham, D. Baron, and R. G. Baraniuk. Measurements versus bits: Compressed sensing meets information theory . In Pr o c. Al lerton Confer enc e on Contr ol, Communic ation and Computing , Septem b er 2006. [19] S. Sarvotham, D. Baron, and R. G. Baraniuk. Sudo co des: F ast measurement and reconstruc- tion of sparse signals. In Int. Symp osium on Information The ory , Seattle, W A, July 2006. 23 [20] R. J. Serﬂing. Appr oximation The or ems of Mathematic al Statistics . Wiley Series in Probabilit y and Statistics. Wiley , 1980. [21] R. Tibshirani. Regression shrink age and selection via the lasso. Journal of the R oyal Statistic al So ciety, Series B , 58(1):267–288, 1996. [22] J. T ropp. Just relax: Conv ex programming metho ds for identifying sparse signals in noise. IEEE T r ans. Info The ory , 52(3):1030–1051, March 2006. [23] M. J. W ainwrigh t. Sharp thresholds for high-dimensional and noisy reco very of sparsit y using using ` 1 -constrained quadratic programs. T echnical Rep ort 709, Departmen t of Statistics, UC Berk eley , 2006. [24] M. J. W ainwrigh t. Information-theoretic b ounds for sparsit y recov ery in the high-dimensional and noisy setting. T echnical Rep ort 725, Departmen t of Statistics, UC Berkeley , Jan uary 2007. Presen ted at In ternational Symp osium on Information Theory , June 2007. [25] W. W ang, M. Garofalakis, and K. Ramchandran. Distributed sparse random pro jections for reﬁnable appro ximation. In Pr o c. International Confer enc e on Information Pr o c essing in Sensor Networks , Nash ville, TN, April 2007. [26] W. Xu and B. Hassibi. Eﬃcien t compressed sensing with deterministic guarantees using ex- pander graphs. In Information The ory Workshop (ITW) , September 2007. 24

Information-theoretic limits on sparse signal recovery: Dense versus sparse measurement matrices

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment