The Benefit of Group Sparsity

This paper develops a theory for group Lasso using a concept called strong group sparsity. Our result shows that group Lasso is superior to standard Lasso for strongly group-sparse signals. This provides a convincing theoretical justification for usi…

Authors: Junzhou Huang, Tong Zhang

The Benefit of Group Sparsity
The Benet of Group Sparsit y Junzhou Huang Computer Science Departmen t, Rutgers Univ ersit y T ong Zhang Statistics Departmen t, Rut g ers Univ ersit y Abstract This pap er dev elops a theory for group Lasso using a concept called str ong gr oup sp arsity . Our result sho ws that group Lasso is sup erior to standard Lasso for strongly group - s parse signals. This pro vides a con vincing theoretical jus ti cation for using group sparse regularization when the underlying group structure is consisten t with the data. Moreo v er, the theory predicts some limitations of the group Lasso form ulation t hat are conrmed b y sim ulation studies. 1 In tro duction W e are in terested in th e sp a rse learnin g p ro b lem for least squares regression. Consi der a set of p basis v ectors { x 1 , . . . , x p } where x j ∈ R n for eac h j . Here, n is the sample size. Denote b y X the n × p data matrix, with column j of X b eing x j . Giv en an observ ation y = [ y 1 , . . . , y n ] ∈ R n that is generated from a sparse linear com bination of the basi s v ectors plus a sto c hastic noise v ecto r  ∈ R n : y = X ¯ β +  = d X j =1 ¯ β j x j + , where w e assume that the target co ecien t ¯ β is sparse. Throughout the pap er, w e consider x ed design only . That is, w e assume X is  xed, and ra n domiza ti o n is wi th resp ect to the noise  . Note that w e do not assume that the noise  is zero-me an . Dene the supp ort of a sparse v e ctor β ∈ R p as supp( β ) = { j : β j 6 = 0 } , and k β k 0 = | supp( β ) | . A natural metho d for sparse learning is L 0 regularization: ˆ β L 0 = arg min β ∈ R p k X β − y k 2 2 sub ject to k β k 0 ≤ k , where k is the sparsit y . Sin c e this optimization problem is generally NP-hard, in practice, one often consider the follo wing L 1 regularization problem, whic h is the closest con v ex relaxation of L 0 : ˆ β L 1 = arg min β ∈ R p  1 n k X β − y k 2 2 + λ k β k 1  , where λ is an appropriately c hosen regularization parameter. This metho d is often referred to as Lasso i n the statistical literature. 1 In practical applications, one often kno ws a group structure o n the co ecien t v ector ¯ β so that v ariables in the same group tend to b e zeros or n o n zero s sim ultaneously . The purp ose of this pap er is to sho w that if suc h a structure ex i sts, then b etter results can b e obtained . 2 Strong Group Sparsit y F or simplicit y , w e shall only c on sider non -o v erlapping groups in this pap er, although our analysis can b e adapted to handle mo derately o v erlappi ng groups. Assume that { 1 , . . . , p } = ∪ m j =1 G j is partitioned in to m disjoin t groups G 1 , G 2 , . . . , G m : G i ∩ G j = ∅ when i 6 = j . Moreo v er, throughout th e pap er, w e let k j = | G j | , and k 0 = max j ∈{ 1 ,...,m } k j . Giv en S ⊂ { 1 , . . . , m } that denotes a set of groups, w e dene G S = ∪ j ∈ S G j . Giv en a subset of v ariables F ⊂ { 1 , . . . , p } and a co ecien t v ector β ∈ R p , let β F b e the v ecto r in R | F | whic h is iden tical to β in F . Similar, X F is the n × | F | matrix with columns iden tical to X in F . The follo wing metho d, often referred to as group L asso, has b een prop osed to tak e adv an tage of the group structure: ˆ β = arg min β   1 n k X β − y k 2 2 + λ m X j =1 k β G j k 2   . (1) The purp ose of this pap er is to dev elop a theory that c h a racterizes the p erformance of (1). W e are in terested in conditions under whic h group Lasso yields b etter estimate of ¯ β than the standard Lasso. Instead of the standard sparsit y assumption, where the complexit y is measured b y the n um b er of nonzero co ecien ts k , w e in tro duce the strong group sparsit y concept b elo w. The idea is to measure the complexit y of a sparse signal using group sparsit y in addition to co ecien t sparsit y . Denition 2.1 A c o ecient ve ctor ¯ β ∈ R p is ( g , k ) str ongly gr oup-sp arse if ther e exists a set S of gr oups such that supp( ¯ β ) ⊂ G S , | G S | ≤ k , | S | ≤ g . The new concept is referred to as strong group- sparsit y b ecause k is used to measure the sp a rs it y of ¯ β instead of k ¯ β k 0 . If this notion is b enecial, then k / k ¯ β k 0 should b e small, whic h means that the signal has to b e ecien tl y co v ered b y the groups. In fact, the group Lasso metho d do es not w ork w ell when k / k ¯ β k 0 is large. I n that case, th e signal is only we ak gr oup sp arse , and one needs to use k ¯ β k 0 to precisely measure the real sparsit y of the sign a l . Unfortunately , suc h information is not included in the group Lasso form ul a ti o n , and there is no simple x of this problem using v ariations of group L asso. This is b ecause our theory requires that the group L asso regularization term is strong enough to dominate the noise, and the stro n g regularization causes a b ias of the order O ( k ) whic h c an not b e remo v ed. This is one fun dame n tal dra wbac k whic h is inheren t to the group Lasso form ulation. 3 Related W ork The idea of using group structure to ac hiev e b etter sparse reco v e ry p erformance has receiv ed m u c h atten tion. F or example, group sparsit y has b een considered for sim ultaneous sp a rse appro ximation 2 [12] and m ulti-task c ompressi v e sensing [4] from the Ba y esian hierarc hical mo deling p oin t of view. Under the Ba y esian hierarc hical mo del framew ork, data from all sources con tribute to the estima- tion of h yp er-parameters in the sparse prior mo del. The shared prior can then b e inferred f ro m m ultip le sources. Although the idea can b e j ustied using standard Ba y esian in tuition, there are no theoretical results sho wing ho w m uc h b etter (and under what kind of conditions) the resulting algorithms p erform. In [11], the a u thors attempted to deriv e a b ound on the n u m b er of samples needed to reco v er blo c k sparse signals, where th e co ecien ts in eac h b lo c k are either all zero or all nonzero. In our terminology , this corresp onds to the ca se of group sparsit y with equal size groups. The algorithm considered there is a sp ecial case of (1) with λ j → 0 + . Ho w ev er, their result is v ery lo ose, and do es not demonstrate the adv an tage of group L asso o v er standard Lasso. In the statistical literature, the group Lasso (1) has b een studied b y a n um b er of authors [13, 1, 7, 5, 8]. There w ere no theoretical results in [13]. Although some theoretical results w ere dev elop ed in [1, 7], n e i ther sho w ed that group Lasso is sup erior to the standard Lasso. The authors of [5] sho w ed that group Lasso can b e sup erior to s ta n dard L asso when eac h group is an innite dimensional k ernel, b y using an argumen t completely dieren t from ours (they relied on the fact that meaningful analysis can b e obtained for k ernel me th o ds in innite dimension). Their idea cannot b e adapted to sho w the adv an tage of group Lasso in nite dimensional scenarios of in terests suc h as in the standard compressiv e sensing setting. Therefore our analysis, whic h fo cuses on the latter, is complemen tary to their w ork. Another related w ork is [8], where the authors considered a sp ecial case o f group Lasso in the m ulti- ta sk learning scenario, and sho w ed that the n um b er of samples required for reco v ering the exact supp ort set ma y b e sma l ler for g rou p La ss o u nder appropriate conditions. Ho w ev er, there are ma jor dierences b et w een our analysis and their analysis. F o r example, th e group form u lation w e consider here is more general and includes the m ulti-task scenario as a sp e ci a l case. Mo reo v er, w e study signal reco v ery p e rf o rmance in 2-norm instead of th e exact reco v ery o f supp ort set in their analysis. The sparse eigen v alue condition emplo y ed in this w ork is often considerably w eak er than the irrepresen table t y p e condition in their analysis (whic h is required for exact supp ort set reco v ery). Our analysis also sho ws that for strongly group-sparse signals, ev en when the n um b er of samples is large, the group Lasso can s till ha v e adv an tages in that it is more ro b ust to noise than standard Lasso. In the ab o v e con text, the main con tribution of this w ork is the in tro duction of the strong group sparsit y concept, under whic h a satisfactory theory of group Lasso is d ev elop ed. Our result sho ws that strongly group sparse signals can b e estimated more reliably using group Lasso, in that it requires few er n um b er of samples in the compressiv e sensing setting, and is more robust to noise in the statistical estimation setting. Finally , w e sh a l l men tion that indep enden t of the authors, results similar to those p re s e n ted in this pap er ha v e also b een obtained in [6] with a similar tec hnical analysis. Ho w ev er, while our pap er studies the general group Lasso form ul a ti o n , only the sp ecial case of m ulti-task l ea rn ing is considered in [6]. 4 Assumptions The follo wing assumption on the noise is imp orta n t in our analysis. It captu res an imp ortan t adv an tage of group Lasso o v er standard Lasso under the strong g rou p sparsit y assumption. 3 Assumption 4.1 (Group noise condition) Ther e exist non-ne ga t ive c onstants a, b such that for any xe d gr oup j ∈ { 1 , . . . , m } , and η ∈ (0 , 1) : with pr ob ability lar ger than 1 − η , the noise pr oje ction to the j -th gr oup is b ounde d by: k ( X > G j X G j ) − 0 . 5 X > G j (  − E  ) k 2 ≤ a p k j + b p − ln η . The imp ortance of the assumption is that the concen tration term √ − ln η do es not dep end on k . This rev eals a signican t b enet of group Lasso o v er standard Lasso: that is, the concen tration term do es n o t increase when the group size in c reases. This implies that if w e can corre ctl y guess the group sparsit y structure, the group Lasso estimator i s more stable with resp ect to s to c h a sti c noise than the standard L asso. W e shall p oin t out that this assumption holds for indep enden t sub-Gaussi a n noise v ectors, where e t (  i − E  i ) ≤ e t 2 σ 2 / 2 for all t and i = 1 , . . . , n . It can b e sho wn th a t one ma y c ho o se a = 2 . 8 and b = 2 . 4 when η ∈ (0 , 0 . 5) . Since a complete treatmen t of sub-Gaussian noise is not imp ortan t for the purp ose of this pap er, w e only pro v e this assumption under indep enden t Gaussian noise, whic h can b e directly calculated. Prop osition 4.1 Assume the noise ve ctor  ar e indep endent Gaussians:  i − E  i ∼ N (0 , σ 2 i ) , wher e e ach σ i ≤ σ ( i = 1 , . . . , n ). Then Assum p t io n 4.1 holds with a = σ and b = √ 2 σ . The next a s sumption handles the ca s e that true target is not exactly sparse. That is, w e only assume that X ¯ β ≈ E y . Assumption 4.2 (Group appro xima tion error condition) Ther e exist δ a, δ b ≥ 0 such that for al l gr oup j ∈ { 1 , . . . , m } : the pr oje ction of err or me a n E  to the j -th gr oup is b ounde d by: k ( X > G j X G j ) − 0 . 5 X > G j E  k 2 / √ n ≤ p k j δ a + δ b. As men tioned earli e r, w e do not assume that the noise is zero-mean. Hence E  ma y not equal zero. In oth e r w ords, this condition considers the situation th at the true target is not exactly sparse. It resem bles algebraic noise in [14] but tak es the group s tructure in to accoun t. S imilar to [14], w e ha v e the follo wing result. Prop osition 4.2 Consider a ( g , k ) str ongly gr oup sp arse c o ecient ve ctor ¯ β such that 1 n k X ¯ β − E y k 2 2 ≤ ∆ 2 , and a 0 , b 0 ≥ 0 . Then ther e exists ( g 0 , k 0 ) str ongly gr o u p sp arse ¯ β 0 such that k 0 a 2 0 + g 0 b 2 0 ≤ 2( k a 2 0 + g b 2 0 ) , k X ¯ β 0 − E y k 2 ≤ k X ¯ β − E y k 2 , supp( ¯ β ) ⊂ supp( ¯ β 0 ) , and for al l gr oup j : k ( X > G j X G j ) − 0 . 5 X > G j ( X ¯ β 0 − E y ) k 2 / √ n ≤ ( a 0 p k j + b 0 )∆ / q k a 2 0 + b 2 0 . The prop osition sho ws that if the app ro ximation error o f ¯ β is ∆ = k X ¯ β − E y k 2 / √ n , then w e ma y nd an alternativ e target ¯ β 0 with similar s parsit y for whic h w e ca n tak e δ a = a 0 ∆ / p k a 2 0 + b 2 0 and δ b = b 0 ∆ / p k a 2 0 + b 2 0 in Assumption 4.2. This means that in Theorem 5.1 b elo w, b y c ho osing a 0 = a and b 0 = b p ln( m/η ) , the con tribution of the appro ximation error to the reconstruction error k ˆ β − ¯ β k 2 is O (∆) . Note that this assumption do es not sho w the b enet of group Lasso o v er standard L asso. Therefore in order to compare our resu lts to that of the standard Lasso, one ma y 4 consider the simple situation where δ a = δ b = 0 . That is, the target is exactly sp arse. The only reason to include Assumption 4.2 is to illustrate that our analysis ca n handle appro ximate sparsit y . The last assumption is a sparse eigen v alue condition, used in the mo dern analysis of Lasso (e.g., [ 2 , 14]). It is also closely related to (and sligh tly w eak er than) the RIP (restri c ted isometry prop ert y) assumption [3] in the compressiv e sensing literature. Th is a s sumption tak es adv an tage of group structure, and can b e considered as (a w eak er v ersion of ) group RIP . W e in tro duce a denition b efore stating the assumption. Denition 4.1 F or al l F ⊂ { 1 , . . . , p } , dene ρ − ( F ) = inf  1 n k X β k 2 2 / k β k 2 2 : supp( β ) ⊂ F  , ρ + ( F ) = sup  1 n k X β k 2 2 / k β k 2 2 : supp( β ) ⊂ F  . Mor e over, for al l 1 ≤ s ≤ p , dene ρ − ( s ) = inf { ρ − ( G S ) : S ⊂ { 1 , . . . , m } , | G S | ≤ s } , ρ + ( s ) = sup { ρ + ( G S ) : S ⊂ { 1 , . . . , m } , | G S | ≤ s } . Assumption 4.3 (Group sparse eigen v alue condi tion) Ther e exist s, c > 0 such that ρ + ( s ) − ρ − (2 s ) ρ − ( s ) ≤ c. Assumption 4.3 illustrates another adv an tage of group Lasso o v er standard Lasso. S ince w e only consider eigen v alues for sub-matrices consisten t with the group structure { G j } , the ratio ρ + ( s ) /ρ − ( s ) can b e signican tly sm all er than the corresp onding ratio for Lasso (whic h consid- ers all subsets of { 1 , . . . , p } up to size s ). F or exa mp le, assume that all group sizes are iden tical k 1 = . . . = k m = k 0 , and s is a m ultipl e of k 0 . F or random pro jections used in compressiv e sensing applications, only n = O ( s + ( s/k 0 ) ln m ) pro jections are needed for Assump tion 4.3 to hold. In comparison, for standard Lasso, w e need n = O ( s ln p ) pro jections. The dierence can b e signican t when p and k 0 are large. More precisely , w e ha v e the follo wing random pro jection sample complexit y b ound for the group sparse eigen v alue condition. Although w e assume Gaussian random matrix in order to state ex p licit constan ts, it is clear that similar results hold for other sub-Gaussian random matrices. Prop osition 4.3 (Group-RIP) Supp ose that elements in X ar e iid st an dar d Gaussian r andom variables N (0 , 1) . F or any t > 0 and δ ∈ (0 , 1) , let n ≥ 8 δ 2 [ln 3 + t + k ln(1 + 8 /δ ) + g ln( em/g )] . Then with pr ob ability at le ast 1 − e − t , the r andom matrix X ∈ R n × p satises the fol lowing gr oup- RI P ine quality for al l ( g , k ) str ongly gr oup-sp arse ve ctor ¯ β ∈ R p , (1 − δ ) k ¯ β k 2 ≤ 1 √ n k X ¯ β k 2 ≤ (1 + δ ) k ¯ β k 2 . (2) 5 5 Main Results Our main result is the follo wing signal reco v ery (2-n o rm parameter estimation error) b o u nd for group Lasso. Theorem 5.1 Supp ose t h at Assumption 4.1, Assum ption 4.2, and Assumption 4.3 ar e valid. T ake λ j = ( A p k j + B ) / √ n , wher e b oth A and B c an dep end on data y . Given η ∈ (0 , 1) , with pr ob ability lar ger than 1 − η , if the fol lowing c onditions hold: • A ≥ 4 max j ρ + ( G j ) 1 / 2 ( a + δ a √ n ) , • B ≥ 4 max j ρ + ( G j ) 1 / 2 ( b p ln( m/η ) + δ b √ n ) , • ¯ β is a ( g , k ) str ongly gr oup-sp arse c o ecient ve ctor, • s ≥ k + k 0 , • L et ` = s − ( k − k 0 ) + 1 , and g ` = min {| S | : | G S | ≥ `, S ⊂ { 1 , . . . , m }} , we have c 2 ≤ `A 2 + g ` B 2 72( k A 2 + g B 2 ) , then the solution of (1) satises: k ˆ β − ¯ β k 2 ≤ √ 4 . 5 ρ − ( s ) √ n (1 + 0 . 25 c − 1 ) p A 2 k + g B 2 . The rst four conditions of the theorem are n o t critical, as they are just denitions and c hoices for λ j . The fth assumption is critical, whic h means that the group sparse eigen v alue cond ition has to b e satised with some c that is not to o large. In order to satisfy the condition, ` should b e c hosen relativ ely large as the righ t hand side is linear in ` . Ho w ev er, this implies that s also gro w linearly . It is p ossi ble to nd s so that the condition is satised when c 2 in Assumption 4.3 g ro ws sub-lin ea rl y in s . Consider the situation that δ a = δ b = 0 . If the conditions of Theorem 5.1 is satised, then k ˆ β − ¯ β k 2 2 = O (( k + g ln( m/η )) /n ) . In com p a ri son, The Lasso estimator can only ac hiev e the b ound k ˆ β L 1 − ¯ β k 2 2 = O (( k ¯ β k 0 ln( p/η )) /n ) . If k / k ¯ β k 0  ln( p/η ) (whic h means th a t the group structure is useful) and g  k ¯ β k 0 , then the group Lasso is sup erior. This is consisten t with in tuition. Ho w ev e r, if k  k ¯ β k 0 ln( p/η ) , then group Lasso is inferior. This happ ens when the signal is not strongly group sparse. Theorem 5.1 also suggests that if the g roup sizes are not ev en, then group Lasso ma y not w ork w ell when the signal is con tained in small sized groups. This is b ecause in suc h case g ` can b e signican tly smaller than g ev en with relativ ely large ` , whic h means w e ha v e to c ho ose a large s and sma l l c , implying a p o or b ound. This prediction is conrmed in Section 6.2 using sim ulated data. In tuitiv ely , group Lasso fa v ors large sized groups b ecause the 2-norm regularization for large group size i s w eak er. A djusting regularization parame ters λ j not only fails to w ork in theory , but also impractical since it is unrealistic to tune man y parameters. This unstable b eha vior with resp ect to unev en gro u p size ma y b e regarded as another dra wbac k of the group Lasso form ulation. In the follo wing, w e presen t t w o simplications of Theorem 5.1 that are easier to in terpret. The rst is the compressiv e sensing case, whic h do es not consider sto c hastic noise. 6 Corollary 5.1 (Compr essiv e sens ing) Supp ose that Assumption 4.1 and A ssumption 4.2 ar e valid with a = b = δ b = 0 . T ake λ j = 4 p k j max j ρ + ( G j ) 1 / 2 δ a . L et ¯ β b e a ( k , g ) str ongly gr oup- sp arse signal, ` = k , an d s = 2 k + k 0 − 1 . If ( ρ + ( s ) − ρ − (2 s )) /ρ − ( s ) ≤ 1 / √ 72 , then the solution o f (1) satises: k ˆ β − ¯ β k 2 ≤ 6 √ 2 + 18 ρ − ( s ) max j ρ + ( G j ) 1 / 2 δ a √ k . If δ a = 0 , then w e can ac hiev e exact reco v ery . Moreo v er, Prop osition 4.2 implies th at w e ma y c ho ose a ta rget with similar sparsit y suc h that δ a √ k = O ( k X ¯ β − E y k 2 / √ n ) . This implies a b ou nd k ˆ β − ¯ β k 2 = O ( k X ¯ β − E y k 2 / √ n ) . If w e ha v e ev en sized groups, the n um b er of samples n required for Corollary 5.1 to hold (that is, ( ρ + ( s ) − ρ − (2 s )) /ρ − ( s ) ≤ 1 / √ 72 ) is O ( k + g ln( m/g )) , where g = k /k 0 . In comparison, although a similar result holds for Lasso, it requires sample size o f order k ¯ β k 0 ln( p/ k ¯ β k 0 ) . Again, g rou p Lasso has a signican t adv an tage if k / k ¯ β k 0  ln( p/ k ¯ β k 0 ) , g  k ¯ β k 0 , and p is large. The follo wing corollary is for ev en sized groups, and the result is simpler to in terpret. F or standard Lasso, B = O ( √ ln p ) , and for group Lasso, B = O ( √ ln m ) . The b enet of group Lasso is the division of B 2 b y k 0 in the b ound, whic h is a signican t impro v emen t when the dimensionalit y p is large. The disadv an tage of group La s so is that the signal sparsit y k ¯ β k 0 is replaced b y the group sparsit y k . This is not an artifact of our analysis, but rather a fundamen tal dra wbac k inheren t to the group Lasso form ulation. The eect is observ able, as sho wn in our sim ul a ti o n studies. Corollary 5.2 (E v en group size) Supp ose that Assumption 4.1 and Assumption 4.2 ar e valid. Assume also that al l gr oups ar e of e qual sizes: k 0 = k j for j = 1 , . . . , m . Given η ∈ (0 , 1) , let λ j = ( A p k 0 + B ) / √ n, wher e A ≥ 4 max j ρ + ( G j ) 1 / 2 ( a + δ a √ n ) and B ≥ 4 max j ρ + ( G j ) 1 / 2 ( b p ln( m/η ) + δ b √ n ) . L et ¯ β b e a ( k , k /k 0 ) str ongly gr oup-sp arse signal. With pr ob ability lar ger than 1 − η , if 6 √ 2( ρ + ( k + ` ) − ρ − (2 k + 2 ` )) /ρ − ( k + ` ) < p `/k for some ` > 0 that is a multiple of k 0 , then the solution of (1) satises: k ˆ β − ¯ β k 2 ≤ ρ − ( k + ` ) − 1 ( √ 4 . 5 + 4 . 5 `/k ) p A 2 + B 2 /k 0 p k /n. 6 Sim ulation Studies W e w an t to v erify o u r theory b y comparing group L asso to Lasso on sim ulation data. F or quan- titativ e ev aluation, the reco v ery error is d e n e d as the relativ e dierence in 2-norm b et w een the estimated sparse co ecien t v ector β est and the ground-truth sparse co ecien t ¯ β : k β est − ¯ β k 2 / k ¯ β k 2 . The regularization parameter λ in Lasso is c hosen with v e-fold cross v alidation. In group Lasso, w e simply supp ose the regularization parameter λ j = ( λ p k j ) / √ n for j = 1 , 2 , ..., m . The regularization parameter λ is then c hosen with v e-fold cross v alidation. Here w e set B = 0 in the form ula λ j = O ( A p k j + B ) . Since the relativ e p erformance of group Lasso v ersus standard Lasso is similar with other v alues of B , in order to a v oid redund a n cy , w e do not include results with B 6 = 0 . 7 6.1 Ev en gr ou p size In this set of exp erimen ts, the pro jection matrix X is generated b y creating an n × p matrix with i.i.d. dra ws from a standard Gaussian distribution N (0 , 1) . F or simpl icit y , the ro ws of X are normalized to unit magnitude. Zero-mean Gaussian noise with standard deviation σ = 0 . 01 is added to the measuremen ts. Our task is to compare th e reco v ery p erformance of La s so and Group Lasso for these ( g , k ) strongly gro u p sparse signals. 6.1.1 With correct group structure In this exp erimen t, w e randomly generate ( g , k ) strongly group sparse co ecien ts with v alues ± 1 , where p = 512 , k = 64 and g = 16 . There are 128 groups with ev en group size of k 0 = 4 . Here the group structure coincides with the signal sparsit y: k = k ¯ β k 0 . Figure 1 sho ws an instance of generated sparse co ecien t v ector and the reco v ere d results b y Lasso and gro u p Lasso resp ectiv ely when n = 3 k = 192 . Since the sample size n is only three times the signal sparsit y k , the standard Lasso do es not ac hiev e go o d rec o v ery results, wh erea s th e group Lasso ac hiev es near p erfect reco v ery of the original si g n al. Figure 2(a) sho ws the eect of sample size n , where w e rep ort the a v eraged reco v er error o v er 100 random runs for eac h sample size. Group Lasso is clearly sup erior in this case. These results sho w that the the group Lasso can ac hiev e b etter reco v ery p erformance for ( g , k ) strongly gro u p sparse signals with few er measuremen ts, whic h is co n sisten t with our theory . 50 100 150 200 250 300 350 400 450 500 − 2 0 2 (a) Original 50 100 150 200 250 300 350 400 450 500 − 2 0 2 (b) Lasso 50 100 150 200 250 300 350 400 450 500 − 2 0 2 (b) Group Lasso Figure 1: Reco v ery resu lts when the assumed group structure is corre ct. (a) Original data; (b) results with Lasso (reco v ery error is 0.3444); (c) results with Group Lasso (reco v ery error is 0.0419) T o study the eect of the group n um b er g (with k xed), w e set the sample size n = 160 and then c hange the group n um b er while k eeping other parame ters unc hanged. Figure 2(b) sho ws the reco v ery p erformance of the t w o algorithms, a v era ged o v er 100 random runs for eac h sample size. As 8 0 1 2 3 4 5 6 7 0 0.2 0.4 0.6 0.8 1 1.2 Recovery Error Sample Size Ratio ( n / k) Lasso Group Lasso (a) 0 10 20 30 40 50 60 70 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Recovery Error Group Number Lasso Group Lasso (b) Figure 2: Reco v ery p erformance: (a) reco v ery error vs. sample size ratio n/k ; (b) reco v ery error vs. group n um b er g exp ected, the reco v ery p erf o rmance for Lasso is indep enden t to the group n um b er within statistical error. Moreo v er, the reco v ery results for group Lasso are signican tly b etter when the group n um b er g is m uc h sma l ler than the sparsit y k = 64 . When g = k , the gro u p Lasso b ecomes iden tical to Lasso, whic h is exp ected. This sho ws that the reco v ery p erformance of group Lasso degrades when g /k increases, whic h conrms our theory . 6.1.2 With incorrect group structure In this exp erimen t, w e assume that the kno wn group structu re is not exactly the same as the sparsit y of the signal (that is, k > k ¯ β k 0 ). W e randomly generate strongly gro u p sparse co ecien ts with v alues ± 1 , where p = 512 , k ¯ β k 0 = 64 and g = 16 . In the rst exp erimen t, w e let k = 4 k ¯ β k 0 , and use m = 32 groups with ev en group size of k 0 = 16 . Figure 3 sho ws one instance of the generated sparse signal and th e reco v ered results b y Lasso and group Lasso resp ectiv ely when n = 3 k ¯ β k 0 = 192 . In thi s case, the standard Lasso obtains b etter reco v ery results than the group L asso. Figure 2(a) s ho ws the eect of sample size n , where w e rep ort the a v eraged reco v er error o v er 100 random runs for eac h sample size. The group Lasso reco v ery p erformance is clearly inferior to that of the Lasso. Thi s sho ws that group Lasso fails when k / k ¯ β k 0 is relativ ely larg e, whic h is consisten t with our theory . T o study the eect of k / k ¯ β k 0 on the g rou p Lasso p erformance, w e k eep k ¯ β k 0 xed, and simply v ary the gro u p size as k 0 = 1 , 2 , 4 , 8 , 16 , 32 , 64 with k / k ¯ β k 0 = 1 , 1 , 1 , 2 , 4 , 8 , 16 . Figure 4(b) sho ws the p erformance of the t w o algorithms with dieren t group sizes k 0 in terms of reco v ery error. It sho ws that the p erformance of group Lasso is b etter when k / k ¯ β k 0 = 1 . Ho w ev er, when k / k ¯ β k 0 > 1 , the p erformance of group Lasso deteriorates. 6.2 Unev en gr oup size In this set of exp erimen ts, w e randomly generate ( g , k ) strongly sparse co ecien ts with v alues ± 1 , where p = 512 , a n d g = 4 . There are 64 unev en sized groups. The pro jection matrix X and noises are g en e rated as in the ev en group size case. Our task is to compare the reco v ery p erformance of 9 0 100 200 300 400 500 − 2 0 2 (a) Original 0 100 200 300 400 500 − 2 0 2 (b) Lasso 0 100 200 300 400 500 − 2 0 2 (c) Group Lasso Figure 3: Reco v ery results when the assumed group structure is incorrect. (a) Origin a l data; (b) results with Lasso (reco v ery error is 0.3616); (c) results with Group Lasso (reco v ery error is 0.6688) 0 1 2 3 4 5 6 7 0 0.2 0.4 0.6 0.8 1 1.2 Recovery Error Sample Size Ratio Lasso Group Lasso (a) 0 10 20 30 40 50 60 70 0 0.2 0.4 0.6 0.8 1 1.2 Recovery Error Group Size Lasso Group Lasso (b) Figure 4: Reco v ery p erformance: (a) reco v ery error vs. sample size ratio n/k ; (b) reco v ery error vs. group size k 0 10 Lasso and Group L asso for ( g , k ) strongly s parse signals with k ¯ β k 0 = k . T o reduce the v ariance, w e run eac h exp erimen t 100 times and rep ort the a v erage p erformance. In the rst exp erimen t, the group sizes of 64 groups are ra n domly generated and the g = 4 activ e groups are rand o mly extracted from th ese 64 groups. Figure 5(a) sho ws the reco v ery p erformance of Lasso and group Lasso with in c reasin g sample size (measuremen ts) in term s of reco v ery error. Similar to the case of ev en group size, the group Lasso obtains b etter reco v ery results than those with Lasso. It sho ws that the group Lasso is sup erior when the group sizes are randomly unev en. 0 1 2 3 4 5 6 7 0 0.2 0.4 0.6 0.8 1 1.2 Recovery Error Sample Size Ratio ( n / k ) Lasso Group Lasso (a) 0 1 2 3 4 5 6 7 0 0.2 0.4 0.6 0.8 1 1.2 Recovery Error Sample Size Ratio ( n / k ) Lasso Group Lasso (b) Figure 5: Reco v ery p erformance: (a) g activ e groups ha v e randomly unev en group sizes; (b) half of g activ e groups are single elemen t groups and another half of g activ e groups ha v e larg e group size As discussed after Theorem 5.1, b ecause g roup Lasso fa v ors large sized groups, if the signal is con tained in small si z ed group s, then the p erformance of group Lasso can b e relativ ely p o or. In order to conrm this claim of Theorem 5.1, w e consider the s p ecial case where 32 groups ha v e large group si ze s and eac h of the remaining 32 groups h a s only one elemen t. First, w e consid er the case where half of g = 4 activ e groups are ex tracted from the single elemen t groups and the oth e r half of g = 4 activ e groups are extracted from the groups with large size. Figure 5(b) sho ws the signal reco v ery p erformance of Lasso and group Lasso. It is clear that th e group Lasso p erforms b etter, but the results are not as go o d as those of Figure 5(a ). Moreo v er, Figure 6(a) sho ws the reco v e ry p erformance of Lasso an d group Lasso when all of th e g = 4 activ e groups are extracted from large sized gro u ps. W e observ e that the relativ e p erformance of group Lasso impro v es. Finally , Figure 6(b) sho ws the reco v ery p erformance of Lasso and group Lasso when all of the g = 4 activ e groups are extracted from single elemen t groups. It is ob vious that the group Lasso is inferior to Lasso in this case. This conrms the prediction of Theorem 5.1 that suggests that group Lasso fa v ors large sized groups. 7 Conclusion In this pap er w e in tro duced a concept called strong group sparsit y that c hara cterizes the signal reco v ery p erformance of group Lasso. In particular, w e sho w ed that group Lasso is sup erior to standard Lasso when the und erlying signal is strongly group-sparse: • Group Lasso is more robust to noise due to the stabilit y asso ciated with gro u p structure. 11 0 1 2 3 4 5 6 7 0 0.2 0.4 0.6 0.8 1 1.2 Recovery Error Sample Size Ratio ( n / k ) Lasso Group Lasso (a) 0 5 10 15 20 25 0 0.2 0.4 0.6 0.8 1 1.2 Recovery Error Sample Size Ratio ( n / k ) Lasso Group Lasso (b) Figure 6: Reco v ery p erformance: (a) all g activ e groups ha v e l a rge gro u p size; (b) all g activ e group s are single elemen t groups • Group Lasso requires a smaller sample size to satisfy the sparse eigen v alue condition required in the mo dern sparsit y analysis. Ho w ev er, group Lasso can b e inferior if the signal is only w eakly group-sparse, or co v ered b y groups with small sizes. Moreo v er, group Lasso do es not p erform w ell with o v erlapping groups (whic h is not analyzed in this pap er). Better learning algorithms are needed to o v ercome these limitations. References [1] F rancis R. Bac h. Consi stency of the group l a ss o and m ultiple k ernel learnin g. JMLR , 9 : 1 179 1225, 2008. [2] P eter Bic k el, Y aaco v Rito v, and Alexandre T sybak o v. Si m ultaneous analysis of Lasso and Dan tzig selector. A nnals of Statistics , 2008 . to app ear. [3] Emman uel J. Candes and T erence T ao. Deco ding b y linear progra mmin g . IEEE T r a ns. on Information The ory , 51 : 4 2034215, 2005. [4] S. Ji, D. Dunson, a n d L. Carin. Multi-task compressiv e sensing. IEEE T r ansactions on Signal Pr o c essing , 200 8. A ccepted. [5] Vladimir K oltc hinskii and Ming Y uan. Sparse reco v ery in large ensem bles of k ernel ma c hines. In COL T'08 , 2008 . [6] Karim Lounici, Massimiliano P on til, Alexandre B. T syb a k o v, and Sara A. v an de Geer. T a kin g adv an tage of sparsit y in m ulti-task learnng. submitted, 2009. [7] Y uv al Nardi and Alessandro Rinaldo. On the asymptotic prop erties of the group lasso estimator for linear mo dels. Ele ctr onic Journal of Statistics , 2:605 633, 2008. [8] G. Ob ozinski, M. J. W ain wrigh t, and M. I. Jordan. Union supp ort reco v ery in high-dimensional m ulti v ariate regression. T ec hnical Rep ort 761, UC Berk eley , 2008. 12 [9] G. P isier. The v olume of con v ex b o dies and Banac h space geometry . 1989. Cam bridge Un iv er- sit y Press. [10] Holger Rauh ut, Karin S c hnass, and Pierre V andergheynst. Compressed sensing and redundan t dictionaries. IEEE T r ansactions on Information The ory , 54(5 ), 2008. [11] M. Sto jnic, F. P arv aresh, and B. Hassibi. On the reconstruction of blo c k-sparse signals with an optimal n um b er of measuremen ts. 2008. Preprin t. [12] D. Wipf and B. Rao. An empirical Ba y esian strategy for solving the sim ultaneous sparse appro ximation problem. IEEE T r ansactions on Signal Pr o c essing , 55(7 ): 3 7043716, 2007. [13] M. Y uan and Y. Lin. Mo del selection and estimation in regression with g roup ed v ariables. Journal of The R oyal Statistic al So ciety Series B , 68(1 ): 4 967, 2006. [14] T ong Zhang. Some sharp p erformance b ounds for least squares regression with L 1 regulariza- tion. The A nnals of Statistics , 20 09. to app ear. A Pro of of Prop osition 4.1 Without loss of generalit y , w e ma y assume σ i > 0 for all i (otherwise, w e can still let σ i > 0 and then just tak e the limit σ i → 0 for som e i ). F or notation simplicit y , w e remo v e the subscript j from the group index, and consider group G with k v ariables. Let Σ b e the diagonal matrix with σ i as its diagonal elemen ts . W e can nd an n × k matrix Z = X G ( X > G Σ X G ) − 0 . 5 , suc h that Z > Σ Z = I k × k . Let ξ = Z > (  − E  ) ∈ R k . Since ∀ v ∈ R n , k ( X > G X G ) − 0 . 5 X > G v k 2 = k ( Z > Z ) − 0 . 5 Z > v k 2 , w e ha v e k ( X > G X G ) − 0 . 5 X > G (  − E  ) k 2 2 ξ > ξ ≤ sup v ∈ R n v > Z ( Z > Z ) − 1 Z > v v > Z Z > v = sup u ∈ R k u > ( Z > Z ) − 1 u u > u = sup u ∈ R k u > Z > Σ Z u u > ( Z > Z ) u ≤ sup v ∈ R n v > Σ v v > v ≤ σ 2 . Therefore, w e o n ly need to sho w that with p ro b abilit y at least 1 − η for a l l η ∈ (0 , 1) : k ξ k 2 ≤ a √ k + b p − ln η (3) with a = 1 and b = √ 2 . T o pro v e this inequalit y , w e note that the cond ition Z > Σ Z = I k × k means that the co v ariance matrix of ξ is I k × ,k . Therefore the comp onen ts of ξ are k iid Gaussians N (0 , 1) , and the distribution of k ξ k 2 2 is χ 2 . Man y metho ds ha v e b een suggested to appro ximate the tail probab ilit y of χ 2 distri- bution. F or example, a w ell-kno wn appro ximation of k ξ k 2 is the normal N ( √ k − 0 . 5 , 0 . 5) , whic h 13 w ould imply a = b = 1 in (3 ). In the follo wing, w e deriv e a sligh tly w eak er tail probabilit y b ound using direct in tegration of tail p ro b a b ilit y for δ ≥ √ k : P ( k ξ k 2 2 ≥ δ 2 ) = 1 Γ( k / 2)2 k/ 2 Z x ≥ δ 2 x k/ 2 − 1 e − x/ 2 dx = 2 Γ( k / 2)2 k/ 2 Z x ≥ δ x k − 1 e − x 2 / 2 dx = 2 δ k − 1 Γ( k / 2)2 k/ 2 Z x ≥ 0 e − ( x + δ ) 2 / 2+( k − 1) ln(1+ x/δ ) dx ≤ 2 δ k − 1 e − δ 2 / 2 Γ( k / 2)2 k/ 2 Z x ≥ 0 e − x 2 / 2+ x ( − δ +( k − 1) /δ ) dx ≤ √ 2 π δ k − 1 e − δ 2 / 2 Γ( k / 2)2 k/ 2 ≤ √ 0 . 5( δ / √ k ) k − 1 e − 0 . 5 δ 2 +0 . 5 k ≤ √ 0 . 5 e − δ 2 / 2+0 . 5 k +( k − 1)( δ / √ k − 1) ≤ √ 0 . 5 e − ( δ − √ k ) 2 / 2 . This implies that (3) h o l ds with a = 1 and b = √ 2 . Note that in the ab o v e deriv ation, w e ha v e used the follo wing Sterling lo w er b ound for the Gamma fun ction Γ(0 . 5 k ) ≥ √ 2 π (0 . 5 k ) 0 . 5 k − 0 . 5 e − 0 . 5 k . B Pro of of Pro p osition 4.2 W e consider the follo wing group-greedy pro cedure starting with ¯ β (0) = ¯ β , and form ( k ( ` ) , g ( ` ) ) strongly gro u p sparse ¯ β ( ` ) as follo ws for ` = 1 , 2 , . . . • let r ( ` − 1) = X ¯ β ( ` − 1) − E y , • let j ( ` ) = arg max j [ k ( X > G j X G j ) − 0 . 5 X > G j r ( ` − 1) k 2 / p k j a 2 0 + b 2 0 ] , • let ¯ β ( ` ) = ¯ β ( ` − 1) ; and then reset its co ecien ts in group G j as ¯ β ( ` ) G j = ¯ β ( ` ) G j − ( X > G j X G j ) − 1 X > G j r ( ` − 1) , where j = j ( ` ) . It is not dicult to c hec k that k r ( ` − 1) k 2 2 − k r ( ` ) k 2 2 = k ( X > G j X G j ) − 0 . 5 X > G j r ( ` − 1) k 2 2 , k ( ` ) − k ( ` − 1) ≤ k j , g ( ` ) − g ( ` − 1) ≤ 1 , with j = j ( ` ) . Therefore if for all 0 ≤ ` ≤ t , w e ha v e arg max j  k ( X > G j X G j ) − 0 . 5 X > G j r ( ` ) k 2 / q k j a 2 0 + b 2 0  ≥ √ n ∆ / q k a 2 0 + b 2 0 , 14 then b y summing o v er ` = 1 , . . . , t, t + 1 , w e o b tain n ∆ 2 = k r (0) k 2 2 ≥ t +1 X ` =1 [ k r ( ` − 1) k 2 2 − k r ( ` ) k 2 2 ] ≥ n t +1 X ` =1 [( k ( ` ) − k ( ` − 1) ) a 2 0 + ( g ( ` ) − g ( ` − 1) ) b 2 0 ]∆ 2 / ( k a 2 0 + b 2 0 ) ≥ n [( k ( t +1) − k ) a 2 0 + ( g ( t +1) − g ) b 2 0 ]∆ 2 / ( k a 2 0 + b 2 0 ) . This implies that k ( t +1) a 2 0 + g ( t +1) b 2 0 ≤ 2( k a 2 0 + g b 2 0 ) . Therefore if w e let t b e the rst ti m e k ( t +1) a 2 0 + g ( t +1) b 2 0 > 2( k a 2 0 + gb 2 0 ) , th en there exists ` ≤ t , suc h that ¯ β 0 = β ( ` ) satises the requiremen t. C Pro of of Prop osition 4.3 The follo wing lemma is tak en from [9]. Since the pro of is simple, it i s included for completeness . Lemma C.1 Consider the unit spher e S k − 1 = { x : k x k 2 = 1 } in R k ( k ≥ 1 ). Given any ε > 0 , ther e exists an ε -c over Q ⊂ S k − 1 such that min q ∈ Q k x − q k 2 ≤ ε for al l k x k 2 = 1 , with | Q | ≤ (1 + 2 /ε ) k . Pro of Let B k = { x : k x k 2 ≤ 1 } b e the unit b a l l in R k . Let Q = { q i } i =1 ,..., | Q | ⊂ S k − 1 b e a maximal subset suc h that k q i − q j k 2 > ε for all i 6 = j . By maximalit y , Q is an ε -co v er of S k − 1 . Since the balls q i + ( ε/ 2) B k are d isjoin t and b elong to (1 + ε/ 2) B k , w e ha v e X i ≤| Q | v ol ( q i + ( ε/ 2) B k ) ≤ v ol ((1 + ε/ 2) B k ) . Therefore, | Q | ( ε/ 2) k v ol ( B k ) ≤ (1 + ε/ 2) k v ol ( B k ) , whic h implies that | Q | ≤ (1 + 2 /ε ) k . The follo wing concen tration result for χ 2 distribution is simil a r to Prop osition 4.1. This is where the Gaussian assumption is used in the pro of. A similar result holds for sub -Ga u ssian random v ariables. Lemma C.2 L et ξ ∈ R n b e a ve ctor of n iid standar d Gaussian variables: ξ i ∼ N (0 , 1) . T hen ∀  ≥ 0 : Pr  |k ξ k 2 − √ n | ≥   ≤ 3 e −  2 / 2 . Pro of Prop osition 4.1 implies th a t Pr  k ξ k 2 − √ n ≥   ≤ √ 0 . 5 e −  2 / 2 . 15 Using iden tical deriv ation in the pro of of Prop osition 4.1, and let δ = √ n −  and k = n , w e obtain: Pr  k ξ k 2 − √ n ≤ −   ≤ 2 δ k − 1 e − δ 2 / 2 Γ( k / 2)2 k/ 2 Z x ≤ 0 e − x 2 / 2+ x ( − δ +( k − 1) /δ ) dx ≤ 2 δ k − 1 e − δ 2 / 2 Γ( k / 2)2 k/ 2 Z x ≤ 0 e − x 2 / 2 − x dx ≤ 3 × √ 2 π δ k − 1 e − δ 2 / 2 Γ( k / 2)2 k/ 2 ≤ 3 × √ 0 . 5 e −  2 / 2 . Com bini ng the ab o v e t w o inequalities, w e obtain the desired b ound. The deriv ation of the follo wing estimate emplo ys a standard pro of tec hnique (for exa mp le, see [10]). Lemma C.3 Supp ose X is gener ate d ac c or ding to Pr op osition 4.3. F or any xe d s et S ⊂ { 1 , . . . , p } with | S | = k and 0 < δ < 1 , we have with pr ob ability exc e e ding 1 − 3(1 + 8 /δ ) k e − nδ 2 / 8 : (1 − δ ) k β k 2 ≤ 1 √ n k X S β k 2 ≤ (1 + δ ) k β k 2 (4) for al l β ∈ R k . Pro of It i s enough to pro v e the conclusion in th e case of k β k 2 = 1 . A ccording to L emma C.1, giv en  1 > 0 , there exists a nite set Q = { q i } with | Q | ≤ (1 + 2 / 1 ) k suc h that k q i k 2 = 1 for all i , and min i k β − q i k 2 ≤  1 for all k β k 2 = 1 . F or eac h i , Since elemen ts of ξ = X S q i are iid Gaussians N (0 , 1) , Lemma C.2 implies that ∀  2 > 0 : Pr  |k X S q i k 2 − √ n k q i k 2 | ≥ √ n 2  ≤ 3 e − n 2 2 / 2 . T aking union b ound for all q i ∈ Q , w e obtain with probabilit y exceeding 1 − 3(1 + 2 / 1 ) k e − n 2 2 / 2 : for all q i ∈ Q , (1 −  2 ) ≤ 1 √ n k X S q i k 2 ≤ (1 +  2 ) . No w, w e dene ρ as the smallest nonnegativ e n um b er suc h that 1 √ n k X S β k 2 ≤ (1 + ρ ) (5) for all β ∈ R k with k β k 2 = 1 . Since for all k β k 2 = 1 , w e can nd q i ∈ Q suc h that k β − q i k 2 ≤  1 , w e ha v e k X S β k 2 ≤ k X S q i k 2 + k X S ( β − q i ) k 2 ≤ √ n (1 +  2 + (1 + ρ )  1 ) , where w e used (5 ) in the deriv ation. Since ρ is the smallest non-negativ e constan t for whic h (5) holds, w e ha v e √ n (1 + ρ ) ≤ √ n (1 +  2 + (1 + ρ )  1 ) , whic h implies that ρ ≤ (  1 +  2 ) / (1 −  1 ) . 16 No w w e c ho ose  1 = δ / 4 and  2 = δ / 2 . Since 0 < δ < 1 , it is easy to see that ρ ≤ δ . This pro v es the upp er b ound. F or th e lo w er b ound, w e note that for all k β k 2 = 1 with k β − q i k 2 ≤  1 , w e ha v e k X S β k 2 ≥ k X S q i k 2 − k X S ( β − q i ) k 2 ≥ √ n (1 −  2 − (1 + ρ )  1 ) , whic h leads to the desired resu lt. Pro of of Prop osition 4.3 F or eac h subset S ⊂ { 1 , . . . , m } of groups with | S | ≤ g and | G S | ≤ k , w e kno w from C.3 that for all β suc h that supp( β ) ⊂ G S : (1 − δ ) k β k 2 ≤ 1 √ n k X β k 2 ≤ (1 + δ ) k β k 2 with probabilit y exceeding 1 − 3(1 + 8 /δ ) k e − nδ 2 / 8 . Since the n um b er of suc h groups S can b e no more than C g m ≤ ( em/g ) g , b y ta kin g the union b ound, w e kno w that the group RIP in Equation (2) fails with probabilit y less than 3( em/g ) g (1 + 8 /δ ) k e − nδ 2 / 8 ≤ e − t . D T ec hnical Lemmas The follo wing lemmas are adapted from [14] to handle group sparsit y structure. Similar tec hniques can b e found in [2]. The rst lemma is in [14]. The pro of is inclu ded for completeness. Lemma D.1 L et A = X > X/n , and let I and J b e non-overlapping indic es in { 1 , . . . , p } . W e have k A I ,J k 2 ≤ p ( ρ + ( I ) − ρ − ( I ∪ J ))( ρ + ( J ) − ρ − ( I ∪ J )) , wher e the matrix 2-norm is dene d as k A I ,J k 2 = sup k u k 2 = k v k 2 =1 | u > A I ,J v | . Pro of Consider v ∈ R p with v I ∈ R | I | and v J ∈ R | J | : p ositiv e semi-deniteness implies that ρ + ( I ) k v I k 2 2 + 2 tv > I A I ,J v J + t 2 ρ + ( J ) k v J k 2 2 ≥ v > I A I ,I v I + 2 tv > I A I ,J v J + t 2 v > J A J,J v J ≥ ρ − ( I ∪ J )( k v I k 2 2 + t 2 k v J k 2 2 ) for all t . This implies that | v > I A I ,J v J | ≤ p ( ρ + ( I ) − ρ − ( I ∪ J ))( ρ + ( J ) − ρ − ( I ∪ J )) k v I k 2 k v J k 2 , whic h leads to the desired resu lt. The next lemma uses the p rev iou s result to con trol the con tribution of the non-signal p a rt G c of an error v ector u to th e pro duct u > G A G,G c u G c . 17 Lemma D.2 Given u ∈ R p and S ⊂ { 1 , . . . , m } . Consider ` ≥ 1 and dene λ 2 − = min    X j ∈ S 0 λ 2 j : | G S 0 | ≥ `    . L et S 0 ⊂ { 1 , . . . , m } − S c ontain indic es j of lar gest values of k u G j k 2 /λ j ( j / ∈ S ), and satises the c ondition ` ≤ | G S 0 | < ` + k 0 . L et G = G S ∪ G S 0 . Then s X j / ∈ S ∪ S 0 k u G j k 2 2 ≤ (2 λ − ) − 1 X j / ∈ S λ j k u G j k 2 and 1 n       X j / ∈ S ∪ S 0 u > G X > G X G j u G j       ≤ λ − 1 − ˜ ρ + k u G k 2 X j / ∈ S λ j k u G j k 2 , wher e ˜ ρ + = p ( ρ + ( G ) − ρ − ( | G | + ` + k 0 − 1))( ρ + ( ` + k 0 − 1) − ρ − ( | G | + ` + k 0 − 1)) . Pro of Without loss of generalit y , w e assume that S = { 1 , . . . , g } , and w e assume that j > g is in descending order o f k u G j k 2 /λ j . Le t S 0 , S 1 , . . . b e the rst, second, etc, consecutiv e blo c ks of j > g , suc h that ` ≤ | G S k | < ` + k 0 (except f o r the last S k ). If w e l et G k = G S k , then: X j / ∈ S ∪ S 0 k u G j k 2 2 ≤   X j / ∈ S ∪ S 0 λ j k u G j k 2    max j / ∈ S ∪ S 0 k u G j k 2 /λ j  ≤   X j / ∈ S ∪ S 0 λ j k u G j k 2    min j ∈ S 0 k u G j k 2 /λ j  ≤   X j / ∈ S ∪ S 0 λ j k u G j k 2     X j ∈ S 0 λ j k u G j k 2 / X j ∈ S 0 λ 2 j   ≤ [ P j / ∈ S λ j k u G j k 2 ] 2 4 λ 2 − . 18 This pro v es the rst inequalit y of the lemma. Similarly , w e ha v e X k ≥ 1 k u G k k 2 = X k ≥ 1 s X j ∈ S k k u G j k 2 2 ≤ X k ≥ 1 s X j ∈ S k λ j k u G j k 2 r max j ∈ S k k u G j k 2 /λ j ≤ X k ≥ 1 s X j ∈ S k λ j k u G j k 2 r min j ∈ S k − 1 k u G j k 2 /λ j ≤ X k ≥ 1 s X j ∈ S k λ j k u G j k 2 s X j ∈ S k − 1 λ j | u G j k 2 / X j ∈ S k − 1 λ 2 j ≤ λ − 1 − X k ≥ 1 s X j ∈ S k λ j k u G j k 2 s X j ∈ S k − 1 λ j | u G j k 2 ≤ λ − 1 − X k ≥ 1 1 2   X j ∈ S k λ j k u G j k 2 + X j ∈ S k − 1 λ j | u G j k 2   ≤ λ − 1 − X k ≥ 0 X j ∈ S k λ j k u G j k 2 = λ − 1 − X j / ∈ S λ j k u G j k 2 . Therefore n − 1       X j / ∈ S ∪ S 0 u > G X > G X G j u G j       ≤ n − 1 X k ≥ 1 | u > G X > G X G k u G k | ≤ n − 1 X k ≥ 1 k X > G X G k k 2 k u G k k 2 k u G k 2 ≤ ˜ ρ + k u G k 2 X k ≥ 1 k u G k k 2 ≤ ˜ ρ + λ − 1 − k u G k 2 X j / ∈ S λ j k u G j k 2 . Note that Lem ma D.1 is used to b ound k X > G X G k k 2 . This pro v es the second inequalit y of the lemma. The follo wing lemma sho ws that the group L 1 -norm of the group Lasso estimato r' s non-signal part is sm all (compared to the group L 1 -norm of th e parameter estimation error in the signal part). Lemma D.3 L et supp( ¯ β ) ∈ G S for some S ⊂ { 1 , . . . , m } . Assum e that for al l j : λ j ≥ 4 ρ + ( G j ) 1 / 2 k ( X > G j X G j ) − 1 / 2 X > G j  k 2 / √ n. Then the solution of (1) satises: X j / ∈ S λ j    ˆ β G j    2 ≤ 3 X j ∈ S λ j k ¯ β G j − ˆ β G j k 2 . 19 Pro of The rst order condition is: 2 X > X ( ˆ β − ¯ β ) − 2 X >  + m X j =1 λ j n ˆ β G j    ˆ β G j    2 = 0 . (6) By m ultiplying b oth sides b y ( ˆ β − ¯ β ) > , w e obtain 0 ≥ − 2( ˆ β − ¯ β ) > X > X ( ˆ β − ¯ β ) = − 2( ˆ β − ¯ β ) > X >  + m X j =1 λ j n ( ˆ β − ¯ β ) > G j ˆ β G j    ˆ β G j    2 . Therefore X j / ∈ S λ j    ˆ β G j    2 ≤ X j ∈ S λ j k ¯ β G j − ˆ β G j k 2 + 2( ˆ β − ¯ β ) > X > /n ≤ X j ∈ S λ j k ¯ β G j − ˆ β G j k 2 + 2 m X j =1 ρ + ( G j ) 1 / 2 k ( ˆ β − ¯ β ) G j k 2 k ( X > G j X G j ) − 1 / 2 X > G j  k 2 / √ n ≤ X j ∈ S λ j k ¯ β G j − ˆ β G j k 2 + 0 . 5 m X j =1 λ j k ( ˆ β − ¯ β ) G j k 2 . Note that the last inequalit y follo ws from the assumption o f the lemma. By simplifying the ab o v e inequalit y , w e obtain the desi red b ound. The follo wing lemma b ounds parameter estimation error b y com bining the previous t w o lemmas. Lemma D.4 L et supp( ¯ β ) ∈ G S for some S ⊂ { 1 , . . . , m } . C onsider ` ≥ 1 and let s = | G S | + ` + k 0 − 1 . Dene λ 2 − = min    X j ∈ S 0 λ 2 j : | G S 0 | ≥ `    , ˜ ρ + = p ( ρ + ( s ) − ρ − (2 s − | G S | ))( ρ + ( s − | G S | ) − ρ − (2 s − | G S | )) . If for al l j : λ j ≥ 4 ρ + ( G j ) 1 / 2 k ( X > G j X G j ) − 1 / 2 X > G j  k 2 / √ n, and 6 ˜ ρ + ρ − ( s ) ≤ λ − q P j ∈ S λ 2 j , then the solution of (1) satises: k ( ˆ β − ¯ β ) k 2 ≤ 1 . 5 ρ − ( s )   1 + 1 . 5 λ − 1 − s X j ∈ S λ 2 j   s X j ∈ S λ 2 j . 20 Pro of Dene S 0 as in Lemma D.2. Let G = ∪ j ∈ S ∪ S 0 G j . By m ultip ly i ng b oth sides of (6) b y ( ˆ β − ¯ β ) > G , w e obtain 2( ˆ β − ¯ β ) > G X > G X ( ˆ β − ¯ β ) − 2( ˆ β − ¯ β ) > G X >  + X j ∈ S ∪ S 0 λ j n ( ˆ β − ¯ β ) > G j ˆ β G j    ˆ β G j    2 = 0 . Similar to the pro of in Lem ma D.3, w e use the assumptions on λ j to obtain: 4 n − 1 ( ˆ β − ¯ β ) > G X > G X ( ˆ β − ¯ β ) + X j ∈ S 0 λ j    ˆ β G j    2 ≤ 3 X j ∈ S λ j k ˆ β G j − ¯ β G j k 2 . (7) No w, Lemma D.2 implies that ( ˆ β − ¯ β ) > G X > G X ( ˆ β − ¯ β ) ≥ ( ˆ β − ¯ β ) > G X > G X G ( ˆ β − ¯ β ) G − ˜ ρ + λ − 1 − n k ( ˆ β − ¯ β ) G k 2 X j / ∈ S λ j k ( ˆ β − ¯ β ) G j k 2 . By applying Lemma D.3, w e ha v e n − 1 ( ˆ β − ¯ β ) > G X > G X ( ˆ β − ¯ β ) ≥ ρ − ( G ) k ( ˆ β − ¯ β ) G k 2 2 − 3 ˜ ρ + λ − 1 − k ( ˆ β − ¯ β ) G k 2 X j ∈ S λ j k ( ˆ β − ¯ β ) G j k 2 ≥ ρ − ( G ) k ( ˆ β − ¯ β ) G k 2 2 − 3 ˜ ρ + λ − 1 − s X j ∈ S λ 2 j k ( ˆ β − ¯ β ) G k 2 2 ≥ 0 . 5 ρ − ( G ) k ( ˆ β − ¯ β ) G k 2 2 . The assumption of the lemma is used to deriv e the last inequalit y . No w plug this i neq u alit y in to (7), w e h a v e k ( ˆ β − ¯ β ) G k 2 2 ≤ 1 . 5 ρ − ( G ) − 1 X j ∈ S λ j k ˆ β G j − ¯ β G j k 2 ≤ 1 . 5 ρ − ( G ) − 1 s X j ∈ S λ 2 j k ( ˆ β − ¯ β ) G k 2 . This implies k ( ˆ β − ¯ β ) G k 2 2 ≤ 2 . 25 ρ − ( G ) − 2 X j ∈ S λ 2 j . No w Lemma D.2 and Le mma D.3 imply that k ( ˆ β − ¯ β ) k 2 2 − k ( ˆ β − ¯ β ) G k 2 2 ≤ 0 . 25 λ − 2 −   X j / ∈ S λ j k ( ˆ β − ¯ β ) G j k 2   2 ≤ 2 . 25 λ − 2 −   X j ∈ S λ j k ( ˆ β − ¯ β ) G j k 2   2 ≤ 2 . 25 λ − 2 − X j ∈ S λ 2 j k ( ˆ β − ¯ β ) G k 2 2 . By co m bining the previous t w o displa y ed inequalities, w e obtain the lemma . 21 E Pro of of T heorem 5.1 Assumption 4.1 implies that with probabi lit y larger than 1 − η , uniformly for all groups j , w e ha v e k ( X > G j X G j ) − 0 . 5 X > G j (  − E  ) k 2 ≤ a p k j + b p ln( m/η ) . It follo ws that with the c hoice of A , B , and λ j , λ j ≥ 4 ρ + ( G j ) 1 / 2 k ( X > G j X G j ) − 1 / 2 X > G j  k 2 / √ n for all j . Moreo v er, assumptions of th e theorem also imply that ˜ ρ + ≤ ρ + ( s ) − ρ − (2 s ) , and ˜ ρ + ρ − ( s ) ≤ ρ + ( s ) − ρ − (2 s ) ρ − ( s ) ≤ c ≤ p `A 2 + g ` B 2 6 p 2( k A 2 + g B 2 ) ≤ λ − 6 q P j ∈ S λ 2 j . Note that w e ha v e used P j ∈ S 0 [ A 2 k j + B 2 ] ≤ n P j ∈ S 0 λ 2 j ≤ 2 P j ∈ S 0 [ A 2 k j + B 2 ] . Therefore the conditions of Lemma D.4 are satised. Its conclus ion implies that k ( ˆ β − ¯ β ) k 2 ≤ 1 . 5 ρ − ( s )   1 + 1 . 5 λ − 1 − s X j ∈ S λ 2 j   s X j ∈ S λ 2 j ≤ 1 . 5 ρ − ( s )  1 + 1 4 c  s X j ∈ S λ 2 j ≤ 1 . 5 ρ − ( s )  1 + 1 4 c  p 2( A 2 k + B 2 g ) /n. This pro v es the theorem. 22

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment