Learning with Structured Sparsity

We are interested in the sparse learning problem under the xed design condition. Consider a xed set of p basis vectors {x 1 , . . . , x p } where x j ∈ R n for each j. Here, n is the sample size. Denote by X the n × p data matrix, with column j of X being x j . Given a random observation y = [y 1 , . . . , y n ] ∈ R n that depends on an underlying coecient vector β ∈ R p , we are interested in the problem of estimating β under the assumption that the target coecient β is sparse. Throughout the paper, we consider xed design only. That is, we assume X is xed, and randomization is with respect to the noise in the observation y. We consider the situation that Ey can be approximated by a sparse linear combination of the basis vectors: Ey ≈ X β, where we assume that β is sparse. Dene the support of a vector β ∈ R p as supp(β) = {j : β j = 0}, and β 0 = |supp(β)|. A natural method for sparse learning is L 0 regularization: where s is the desired sparsity. For simplicity, unless otherwise stated, the objective function considered throughout this paper is the least squares loss although other objective functions for generalized linear models (such as logistic regression) can be similarly analyzed. Since this optimization problem is generally NP-hard, in practice, one often considers approximate solutions. A standard approach is convex relaxation of L 0 regularization to L 1 regularization, often referred to as Lasso [22]. Another commonly used approach is greedy algorithms, such as the orthogonal matching pursuit (OMP) [23]. In practical applications, one often knows a structure on the coecient vector β in addition to sparsity. For example, in group sparsity, one assumes that variables in the same group tend to be zero or nonzero simultaneously. The purpose of this paper is to study the more general estimation problem under structured sparsity. If meaningful structures exist, we show that one can take advantage of such structures to improve the standard sparse learning. The idea of using structure in addition to sparsity has been explored before. An example is group structure, which has received much attention recently. For example, group sparsity has been considered for simultaneous sparse approximation [24] and multi-task compressive sensing [14] from the Bayesian hierarchical modeling point of view. Under the Bayesian hierarchical model framework, data from all sources contribute to the estimation of hyper-parameters in the sparse prior model. The shared prior can then be inferred from multiple sources. He et al. recently extend the idea to the tree sparsity in the Bayesian framework [11,12]. Although the idea can be justied using standard Bayesian intuition, there are no theoretical results showing how much better (and under what kind of conditions) the resulting algorithms perform. In the statistical literature, Lasso has been extended to the group Lasso when there exist group/block structured dependences among the sparse coecients [25]. However, none of the above mentioned work was able to show advantage of using group structure. Although some theoretical results were developed in [1,18], neither showed that group Lasso is superior to the standard Lasso. The authors of [15] showed that group Lasso can be superior to standard Lasso when each group is an innite dimensional kernel, by relying on the fact that meaningful analysis can be obtained for kernel methods in innite dimension. In [19], the authors consider a special case of group Lasso in the multi-task learning scenario, and show that the number of samples required for recovering the exact support set is smaller for group Lasso under appropriate conditions. In [13], a theory for group Lasso was developed using a concept called strong group sparsity, which is a special case of the general structured sparsity idea considered here. It was shown in [13] that group Lasso is superior to standard Lasso for strongly group-sparse signals, which provides a convincing theoretical justication for using group structured sparsity. While group Lasso works under the strong group sparsity assumption, it doesn't handle the more general structures considered in this paper. Several limitations of group Lasso were mentioned in [13]. For example, group Lasso does not correctly handle overlapping groups (in that overlapping components are over-counted); that is, a given coecient should not belong to dierent groups. This requirement is too rigid for many practical applications. To address this issue, a method called composite absolute penalty (CAP) is proposed in [27] which can handle overlapping groups. Unfortunately, no theory is established to demonstrate the eectiveness of the approach. In a related development [16], Kowalski et al. generalized the mixed norm penalty to structured shrinkage, which can identify the structured signicance maps and thus can handle the case of the overlapping groups, However, the structured shrinkage operations do not necessarily convergence to a xed point. There were no additional theory to justify their methods. Other structures have also been explored in the literature. For example, so-called tonal and transient structures were considered for sparse decomposition of audio signals in [8], but again without any theory. Grimm et al. [10] investigated positive polynomials with structured sparsity from an optimization perspective. The theoretical result there did not address the eectiveness of such methods in comparison to standard sparsity. The closest work to ours is a recent paper [2], which we learned after nishing this paper. In that paper, a specic case of structured sparsity, referred to as model based sparsity, was considered. It is important to note that some theoretical results were obtained there to show the eectiveness of their method in compressive sensing, although in a more limited scope than results presented here. Moreover, they do not provide a generic framework for structured sparsity. In their algorithm, dierent schemes have to be specically designed for dierent data models, and under specialized assumptions. It remains as an open issue how to develop a general theory for structured sparsity, together with a general algorithm that can be applied to a wide class of such problems. We see from the above discussion that there exists extensive literature on structured sparsity, with empirical evidence showing that one can achieve better performance by imposing additional structures. However, none of the previous work was able to establish a general theoretical framework for structured sparsity that can quantify its eectiveness. The goal of this paper is to develop such a general theory that addresses the following issues, where we pay special attention to the benet of structured sparsity over the standard non-structured sparsity: • quantifying structured sparsity; • the minimal number of measurements required in compressive sensing; • estimation accuracy under stochastic noise; • an ecient algorithm that can solve a wide class of structured sparsity problems. In structured sparsity, not all sparse patterns are equally likely. For example, in group sparsity, coecients within the same group are more likely to be zeros or nonzeros simultaneously. This means that if a sparse coecient vector's support set is consistent with the underlying group structure, then it is more likely to occur, and hence incurs a smaller penalty in learning. One contribution of this work is to formulate how to dene structure on top of sparsity, and how to penalize each sparsity pattern. In order to formalize the idea, we denote by I = {1, . . . , p} the index set of the coecients. Consider any sparse subset F ⊂ {1, . . . , p}, we assign a cost cl(F ). In structured sparsity, the cost of F is an upper bound of the coding length of F (number of bits needed to represent F by a computer program) in a pre-chosen prex coding scheme. It is a well-known fact in information theory (e.g. [7]) that mathematically, the existence of such a coding scheme is equivalent to From the Bayesian statistics point of view, 2 -cl(F ) can be regarded as a lower bound of the probability of F . The probability model of structured sparse learning is thus: rst generate the sparsity pattern F according to probability 2 -cl(F ) ; then generate the coecients in F . Denition 3.1 A cost function cl(F ) dened on subsets of I is called a coding length (in base-2) if We give ∅ a coding length 0. The corresponding structured sparse coding complexity of F is dened as Clearly if cl(F ) is sub-additive, then the corresponding coding complexity c(F ) is also subadditive. Based on the structured coding complexity of subsets of I, we can now dene the structured coding complexity of a sparse coecient vector β ∈ R p . Denition 3.2 Giving a coding complexity c(F ), the structured sparse coding complexity of a coefcient vector Later in the paper, we will show that if a coecient vector β has a small coding complexity c( β), then β can be eectively learned, with good in-sample prediction performance (in statistical learning) and reconstruction performance (in compressive sensing). In order to see why the denition requires adding |F | to cl(F ), we consider the generative model for structured sparsity mentioned earlier. In this model, the number of bits to encode a sparse coecient vector is the sum of the number of bits to encode F (which is cl(F )) and the number of bits to encode nonzero coecients in F (this requires O(|F |) bits up to a xed precision). Therefore the total number of bits required is cl(F ) + O(|F |). This information theoretical result translates into a statistical estimation result: without additional regularization, the learning complexity for least squares regression within any xed support set F is O(|F |). By adding the model selection complexity cl(F ) for each support set F , we obtain an overall statistical estimation complexity of O(cl(F ) + |F |). While the idea of using coding based penalization is clearly motivated by the minimum description length (MDL) principle, the actual penalty we obtain for structured sparsity problems is dierent from the standard MDL penalty for model selection. This dierence is important in sparse learning. Therefore in order to prevent confusion, we avoid using MDL in our terminology. Nevertheless, one may consider our framework as a natural combination of the MDL idea and the modern sparsity analysis. Before giving detailed examples, we introduce a general coding scheme called block coding. The basic idea of block coding is to dene a coding scheme on a small number of base blocks (a block is a subset of I), and then dene a coding scheme on all subsets of I using these base blocks. Consider a subset B ⊂ 2 I . That is, each element (a block) of B is a subset of I. We call B a block set if I = ∪ B∈B B and all single element sets {j} belong to B (j ∈ I). Note that B may contain additional non single-element blocks. The requirement of B containing all single element sets is for convenience, as it implies that every subset F ⊂ I can be expressed as the union of blocks in B. Let cl 0 be a code length on B: we dene cl(B) = cl 0 (B) + 1 for B ∈ B. It not dicult to show that the following cost function on This is because It is clear from the denition that block coding is sub-additive. The main purpose of introducing block coding is to design computationally ecient algorithms based on the block structure. In particular, this paper considers a structured greedy algorithm that can take advantage of block structures. In the structured greedy algorithm, instead of searching over all subsets of I up to a xed coding complexity s (the number of such subsets can be exponential in s), we greedily add blocks from B one at a time. Each search problem over B can be eciently performed because we require that B contains only a computationally manageable number of base blocks. Therefore the algorithm is computationally ecient. We will show that under appropriate conditions, a target coecient vector with a small block coding complexity can be approximately learned using the structured greedy algorithm. This means that the block coding scheme has important algorithmic implications. That is, if a coding scheme can be approximated by block coding with a small number of base blocks, then the corresponding estimation problem can be approximately solved using the structured greedy algorithm. For this reason, we shall pay special attention to block coding approximation schemes for examples discussed below. A simple coding scheme is to code each subset F ⊂ I of cardinality k using k log 2 (2p) bits, which corresponds to block coding with B consisted only of single element sets, and each base block has a coding length cl 0 = log 2 p. This corresponds to the complexity for the standard sparse learning. A more general version is to consider single element blocks B = {{j} : j ∈ I}, with a nonuniform coding scheme cl 0 ({j}) = c j , such that j 2 -c j ≤ 1. It leads to a non-uniform coding length on I as cl(B) = |B| + j∈B c j . In particular, if a feature j is likely to be nonzero, we should give it a smaller coding length c j , and if a feature j is likely to be zero, we should give it a larger coding length. The concept of group sparsity has appeared in various recent work, such as the group Lasso in [25]. Consider a partition of I = ∪ m j=1 G j into m disjoint groups. Let B G contain the m groups {G j }, and B 1 contain p single element blocks. The strong group sparsity coding scheme is to give each element in B 1 a code-length cl 0 of ∞, and each element in B G a code-length cl 0 of log 2 m. Then the block coding scheme with blocks B = B G ∪ B 1 leads to group sparsity, which only looks for signals consisted of the groups. The resulting coding length is: cl(B) = g log 2 (2m) if B can be represented as the union of g disjoint groups G j ; and cl(B) = ∞ otherwise. Note that if the signal can be expressed as the union of g groups, and each group size is k 0 , then the group coding length g log 2 (2m) can be signicantly smaller than the standard sparsity coding length of gk 0 log 2 (p). As we shall see later, the smaller coding complexity implies better learning behavior, which is essentially the advantage of using group sparse structure. It was shown in [13] that strong group sparsity dened above also characterizes the performance group Lasso. Therefore if a signal has a pre-determined group structure, then group Lasso is superior to standard Lasso. An extension of this idea is to allow more general block coding length for cl 0 (G j ) and cl 0 ({j}) so that This leads to non-uniform coding of the groups, so that a group that is more likely to be nonzero is given a smaller coding length. One may also create a hierarchical group structure. A simple example is wavelet coecients of a signal [17]. Another simple example is a binary tree with the variables as leaves, which we describe below. Each internal node in the tree is associated with three options: left child only, right child only, and both children; each option can be encoded in log 2 3 bits. Given a subset F ⊂ I, we can go down from the root of the tree, and at each node, decide whether only left child contains elements of F , or only right child contains elements of F , or both children contain elements of F . Therefore the coding length of F is log 2 3 times the total number of internal nodes leading to elements of F . Since each leaf corresponds to no more than log 2 p internal nodes, the total coding length is no worse than log 2 3 log 2 p|F |. However, the coding length can be signicantly smaller if nodes are close to each other or are clustered. In the extreme case, when the nodes are consecutive, we have O(|F | + log 2 p) coding length. More generally, if we can order elements in F as F = {j 1 , . . . , j q }, then the coding length can be bounded as If all internal nodes of the tree are also variables in I (for example, in the case of wavelet decomposition), then one may consider feature set F with the following property: if a node is selected, then its parent is also selected. This requirement is very eective in wavelet compression, and often referred to as the zero-tree structure [21]. Similar requirements have also been applied in statistics [27] for variable selection. The argument presented in this section shows that if we require F to satisfy the zero-tree structure, then its coding length is at most O(|F |), without any explicit dependency on the dimensionality p. This is because one does not have to reach a leave node. The tree-based coding scheme discussed in this section can be approximated by block coding using no more than p 1+δ base blocks (δ > 0). The idea is similar to that of the image coding example in the more general graph sparsity scheme which we discuss next. We consider a generalization of the hierarchical and group sparsity idea that employs a (directed or undirected) graph structure G on I. To the best of our knowledge, this general structure has not been considered in any previous work. In graph sparsity, each variable (an element of I) is a node of G but G may also contain additional nodes that are not variables. For simplicity, we assume G contains a starting node (this requirement is not critical). At each node v ∈ G, we dene coding length cl v (S) on all subsets S of the neighborhood N v of v including the empty set, as well as any other single node To encode F ⊂ G, we start with the active set containing only the starting node, and nish when the set becomes empty. At each node v before termination, we may either pick a subset S ⊂ N v , with coding length cl v (S), or a node in u ∈ G, with coding length cl v (u), and then put the selection into the active set. We then remove v from the active set (once a node v is removed, it does not return to the active set anymore). This process is continued until the active set becomes empty. As a concrete example, we consider image processing, where each image is a rectangle of pixels (nodes); each pixel is connected to four adjacent pixels, which forms the underlying graph structure. At each pixel, the number of subsets in its neighborhood is 2 4 = 16 (including the empty set), and each subset is given a coding length cl v (S) = 5; we also encode all other pixels in the image with random jumping, each with a coding length 1 + log 2 p. Using this scheme, we can encode each connected region F by no more than log 2 p + 5|F | bits by growing the region from a single point in the region. Therefore if F is composed of g connected regions, then the coding length is g log 2 p + 5|F |, which can be signicantly better than standard sparse coding length of |F | log 2 p. This example shows that the general graph coding scheme presented here favors connected regions (that is, nodes that are grouped together with respect to the graph structure). In particular, it proves the following more general result. Proposition 4.1 Given a graph G, there exists a constant C G such that for any probability distribution q on G ( v∈G q(v) = 1 and q(v) ≥ 0 for v ∈ G), the following quantity is a coding length on 2 G : where F ⊂ 2 G can be decomposed into the union of g connected components F = ∪ g j=1 F j . Our simple and suboptimal coding scheme described for the image processing example gives an upper bound C G ≤ 1 + d G , where d G is the maximum degree of G. However, for many graphs, one can improve this constant to O(log d G ) with a slightly more complicated argument. As a simple application of graph coding, we consider the special case where we have only one connected component that contains the starting node v 0 . We can simply let q(v 0 ) = 1, and the coding length is O(|F |), which is independent of the dimensionality p. This generalizes the similar claim for the zero-tree structure described earlier. The graph coding scheme can be approximated with block coding. The idea is to consider relatively small sized base blocks consisted of nodes that are close together with respect to the graph structure, and then use the induced block coding scheme to approximate the graph coding. For example, for the previously discussed image coding example, we can use connected blocks of size upto δ log 2 p/5 as base blocks of B (δ > 0). Since each base block can be encoded with (1 + δ) log 2 p bits by earlier discussion, we know that the total number of base blocks can be no more than p 1+δ . We can give each of such blocks a coding length (1 + δ) log 2 p. For a connected region F that can be covered by O(1 + |F |/ log 2 p) of such blocks, the corresponding block coding length for a subset F is cl(F ) = O(|F | + log 2 p), which is the same as the complexity of the original graph coding length (up to a constant). This means that graph coding length can be approximated with block coding scheme. As we have pointed out, such an approximation is useful because the latter is required in the structured greedy algorithm which we propose in this paper. Let z j ∈ {0, 1} be a random variable for j ∈ I that indicates whether j is selected or not. The most general coding scheme is to consider a joint probability distribution of z = [z 1 , . . . , z p ]. The coding length for F can be dened as -log 2 p(z 1 , . . . , z p ) with z j = I(j ∈ F ) indicating whether j ∈ F or not. Such a probability distribution can often be conveniently represented as a binary random eld on an underlying graph. In order to encourage sparsity, on average, the marginal probability p(z j ) should take 1 with probability close to O(1/p), so that the expected number of j's with z j = 1 is O(1). For disconnected graphs (z j are independent), the variables z j are iid Bernoulli random variables with probability 1/p being one. In this case, the coding length of a set This is essentially the probability model for the standard sparsity scheme. In many cases, it is possible to approximate a general random eld coding scheme with block coding by using approximation methods in the graphical model literature. However, the details of such approximations are beyond the scope of this paper. 5 Algorithms for Structured Sparsity The following algorithm is a natural extension of L 0 regularization to structured sparsity problems. It penalizes the coding complexity instead of the cardinality (sparsity) of the feature set. Alternatively, we may consider the formulation The optimization of either (1) or ( 2) is generally hard. For related problems, there are two common approaches to alleviate this diculty. One is convex relaxation (L 1 regularization to replace L 0 regularization for standard sparsity); the other is forward greedy selection (also called orthogonal matching pursuit or OMP). We do not know any extensions of L 1 regularization like convex relaxation that can handle general structured sparsity formulations. However, one can extend greedy algorithm by using a block structure. We call the resulting procedure structured greedy algorithm or StructOMP, which can approximately solve (1). We have discussed the relationship of this greedy algorithm and block coding in Section 4. It is important to understand that the block structure is only used to limit the search space in the greedy algorithm. The actual coding scheme does not have to be the corresponding block coding. However, our theoretical analysis assumes that the underlying coding scheme can be approximated with block coding using base blocks employed in the greedy algorithm. Although one does not need to know the specic approximation in order to use the greedy algorithm, knowing its existence (which can be shown for the examples discussed in Section 4) guarantees the eectiveness of the algorithm. It is also useful to understand that our result does not imply that the algorithm won't be eective if the actual coding scheme cannot be approximated by block coding. In Figure 1, we are given a set of blocks B that contains subsets of I. Instead of searching all subsets F ⊂ I up to a certain complexity |F | + c(F ), which is computationally infeasible, we search only the blocks restricted to B. It is assumed that searching over B is computationally manageable. At each step ( * ), we try to nd a block from B to maximize progress. It is thus necessary to dene a quantity that measures progress. Our idea is to approximately maximize the gain ratio: which measures the reduction of objective function per unit increase of coding complexity. This greedy criterion is a natural generalization of the standard greedy algorithm, and essential in our analysis. For least squares regression, we can approximate λ (k) using the following denition where is the projection matrix to the subspaces generated by columns of X F . We then select B (k) so that where γ ∈ (0, 1] is a xed approximation ratio that species the quality of approximate optimization. Alternatively, we may use a simpler denition ) , which is easier to compute, especially when blocks are overlapping. Since the ratio is bounded between ρ + (B) and ρ -(B) (these quantities are dened in Denition 6.1), we know that maximizing φ(B) would lead to approximate maximization of φ(B) with γ ≥ ρ -(B)/ρ + (B). Note that we shall ignore B ∈ B such that B ⊂ F (k-1) , and just let the corresponding gain to be 0. Moreover, if there exists a base block B ⊂ F (k-1) but c(B ∪ F (k-1) ) ≤ c(F (k-1) ), we can always select B and let 1) (this is because it is always benecial to add more features into F (k) without additional coding complexity). We assume this step is always performed if such a B ∈ B exists. The non-trivial case is c(B ∪ F (k-1) ) > c(F (k-1) ) for all B ∈ B; in this case both φ(B) and φ(B) are well dened. 6 Theory of Structured Sparsity We assume sub-Gaussian noise as follows. Assumption 6.1 Assume that {y i } i=1,...,n are independent (but not necessarily identically distributed) sub-Gaussians: there exists a constant σ ≥ 0 such that ∀i and ∀t ∈ R, Both Gaussian and bounded random variables are sub-Gaussian using the above denition. For example, if a random variable ξ ∈ The following property of sub-Gaussian noise is important in our analysis. Our simple proof yields a sub-optimal choice of the constants. Proposition 6.1 Let P ∈ R n×n be a projection matrix of rank k, and y satises Assumption 6.1. Then for all η ∈ (0, 1), with probability larger than 1 -η: We also need to generalize sparse eigenvalue condition, used in the modern sparsity analysis. It is related to (and weaker than) the RIP (restricted isometry property) assumption [6] in the compressive sensing literature. This denition takes advantage of coding complexity, and can be also considered as (a weaker version of) structured RIP. We introduce a denition. Denition 6.1 For all F ⊂ {1, . . . , p}, dene Moreover, for all s > 0, dene In the theoretical analysis, we need to assume that ρ -(s) is not too small for some s that is larger than the signal complexity. Since we only consider eigenvalues for submatrices with small cost c( β), the sparse eigenvalue ρ -(s) can be signicantly larger than the corresponding ratio for standard sparsity (which will consider all subsets of {1, . . . , p} up to size s). For example, for random projections used in compressive sensing applications, the coding length c(supp( β)) is O(k ln p) in standard sparsity, but can be as low as c(supp( β)) = O(k) in structured sparsity (if we can guess supp( β) approximately correctly. Therefore instead of requiring n = O(k ln p) samples, we requires only O(k + cl(supp( β))). The dierence can be signicant when p is large and the coding length cl(supp( β)) k ln p. An example for this is group sparsity, where we have p/k 0 even sized groups, and variables in each group are simultaneously zero or nonzero. The coding length of the groups are (k/k 0 ) ln(p/k 0 ), which is signicantly smaller than k ln p when p is large. More precisely, we have the following random projection sample complexity bound for the structured sparse eigenvalue condition. The theorem implies that the structured RIP condition is satised with sample size n = O((k/k 0 ) ln(p/k 0 )) in group sparsity rather than n = O(k ln(p)) in standard sparsity. Therefore Theorem 6.2 shows that in the compressive sensing applications, it is possible to reconstruct signals with fewer number of random projections by using group sparsity (or more general structured sparsity). Theorem 6.1 (Structured-RIP) Suppose that elements in X are iid standard Gaussian random variables N (0, 1). For any t > 0 and δ ∈ (0, 1), let Then with probability at least 1-e -t , the random matrix X ∈ R n×p satises the following structured-RIP inequality for all vector β ∈ R p with coding complexity no more than s: Although in the theorem, we assume Gaussian random matrix in order to state explicit constants, it is clear that similar results hold for other sub-Gaussian random matrices. The following result gives a performance bound for constrained coding complexity regularization. Theorem 6.2 Suppose that Assumption 6.1 is valid. Consider any xed target β ∈ R p . Then with probability exceeding 1 -η, for all ≥ 0 and β ∈ R p such that: Q( β) ≤ Q( β) + , we have Moreover, if the coding scheme c(•) is sub-additive, then This theorem immediately implies the following result for (1): ∀ β such that c( β) ≤ s, Note that we generally expect ρ -(s + c( β)) = O(1). The result immediately implies that as sample size n → ∞ and s/n → 0, the root mean squared error prediction performance X β -Ey 2 / √ n converges to the optimal prediction performance inf c( β)≤s X β -Ey 2 / √ n. This result is agnostic in that even if X β -Ey 2 / √ n is large, the result is still meaningful because it says the performance of the estimator β is competitive to the best possible estimator in the class c( β) ≤ s. In compressive sensing applications, we take σ = 0, and we are interested in recovering β from random projections. For simplicity, we let X β = Ey = y, and our result shows that the constrained coding complexity penalization method achieves exact reconstruction βconstr = β as long as ρ -(2c( β)) > 0 (by setting s = c( β)). According to Theorem 6.1, this is possible when the number of random projections (sample size) reaches n = O(c( β)). This is a generalization of corresponding results in compressive sensing [6]. As we have pointed out earlier, this number can be signicantly smaller than the standard sparsity requirement of n = O( β 0 ln p), if the structure imposed in the formulation is meaningful. Similar to Theorem 6.2, we can obtain the following result for (2). A related result for standard sparsity under Gaussian noise can be found in [5]. Theorem 6.3 Suppose that Assumption 6.1 is valid. Consider any xed target β ∈ R p . Then with probability exceeding 1 -η, for all λ > 7.4σ 2 and a ≥ 7.4σ 2 /(λ -7.4σ 2 ), we have Unlike the result for (1), the prediction performance X βpen -Ey 2 of the estimator in ( 2) is competitive to (1 + a) X β -Ey 2 , which is a constant factor larger than the optimal prediction performance X β -Ey 2 . By optimizing λ and a, it is possible to obtain a similar result as that of Theorem 6.2. However, this requires tuning λ, which is not as convenient as tuning s in (1). Note that both results presented here, and those in [5] are superior to the more traditional least squares regression results with λ explicitly xed (for example, theoretical results for AIC). This is because one can only obtain the form presented in Theorem 6.2 by tuning λ. Such tuning is important in real applications. We shall introduce a denition before stating our main results. The following theorem shows that if c( β, B) is small, then one can use the structured greedy algorithm to nd a coecient vector β (k) that is competitive to β, and the coding complexity c(β (k) ) is not much worse than that of c( β, B). This implies that if the original coding complexity c( β) can be approximated by block complexity c( β, B), then we can approximately solve (1). Theorem 6.4 Suppose the coding scheme is sub-additive. Consider β and such that and Then at the stopping time k, we have By Theorem 6.2, the result in Theorem 6.4 implies that The result shows that in order to approximate a signal β up to accuracy , one needs to use coding complexity O(ln(1/ ))c( β, B). If B contains small blocks and their sub-blocks with equal coding length, and the coding scheme is block coding generated by B, then c( β, B) = c( β). In this case we need O(s ln(1/ )) to approximate a signal with coding complexity s. In order to get rid of the O(ln(1/ )) factor, backward greedy strategies can be employed, as shown in various recent work such as [26]. For simplicity, we will not analyze such strategies in this paper. However, in the following, we present an additional convergence result for structured greedy algorithm that can be applied to weakly sparse p-compressible signals common in practice. It is shown that the ln(1/ ) can be removed for such weakly sparse signals. Theorem 6.5 Suppose the coding scheme is sub-additive. Given a sequence of targets βj such that for some > 0. Then at the stopping time k, we have In the above theorem, we can see that if the signal is only weakly sparse, in that ( Q( βj+1 ) -Q( β0 ) + )/( Q( βj ) -Q( β0 ) + ) grows sub-exponentially in j, then we can choose s = O(c( β0 , B)). This means that we can nd β (k) of complexity s = O(c( β0 , B)) to approximate a signal β0 . The worst case scenario is when Q( β1 ) ≈ Q(0), which reduces to the s = O(c( β0 , B) log(1/ )) complexity in Theorem 6.4. As an application, we introduce the following concept of weakly sparse compressible target that generalizes the corresponding concept of compressible signal in standard sparsity from the compressive sensing literature [9]. Denition 6.3 The target Ey is (a, q)-compressible with respect to block B if there exist constants a, q > 0 such that for each s > 0, ∃ β(s) such that c( β(s), B) ≤ s and Corollary 6.1 Suppose that the target is (a, q)-compressible with respect to B. Then with probability 1 -η, at the stopping time k, we have where If we assume the underlying coding scheme is block coding generated by B, then min u≤s ρ -(s + c( β(u))) ≤ ρ -(s + s ). The corollary shows that we can approximate a compressible signal of complexity s with complexity s = O(qs ) using greedy algorithm. This means the greedy algorithm obtains optimal rate for weakly-sparse compressible signals. The sample complexity suers only a constant factor O(q). Combine this result with Theorem 6.2, and take union bound, we have with probability 1 -2η, at stopping time k: Given a xed n, we can obtain a convergence result by choosing s (and thus s ) to optimize the right hand side. The resulting rate is optimal for the special case of standard sparsity, which implies that the bound has the optimal form for structured q-compressible targets. In particular, in compressive sensing applications where σ = 0, we obtain when sample size reaches n = O(qs ), the reconstruction performance is β(k) -β 2 2 = O(a/s q ), which matches that of the constrained coding complexity regularization method in (1) up to a constant O(q). Since many real data involve weakly sparse signals, our result provides strong theoretical justication for the use of OMP in such problems. Our experiments are consistent with the theory. The purpose of these experiments is to demonstrate the advantage of structured sparsity over standard sparsity. We compare the proposed StructOMP to OMP and Lasso, which are standard algorithms to achieve sparsity but without considering structure. In our experiments, we use Lassomodied least angle regression (LAS/Lasso) as the solver of Lasso [4]. In order to quantitatively compare performance of dierent algorithms, we use recovery error, dened as the relative dierence in 2-norm between the estimated sparse coecient vector βest and the ground-truth sparse coecient β: βestβ 2 / β 2 . Our experiments focus on graph sparsity, with several dierent underlying graph structures. Note that graph sparsity is more general than group sparsity; in fact connected regions may be regarded as dynamic groups that are not pre-dened. However, for illustration, we include a comparison with group Lasso using some 1D simulated examples, where the underlying structure can be more easily approximated by pre-dened groups. Since additional experiments involving more complicated structures are more dicult to approximate by pre-dened groups, we exclude group-Lasso in those experiments. In the rst experiment, we randomly generate a 1D structured sparse signal with values ±1, where p = 512, k = 64 and g = 4. The support set of these signals is composed of g connected regions. Here, each component of the sparse coecient is connected to two of its adjacent components, which forms the underlying graph structure. The graph sparsity concept introduced earlier is used to compute the coding length of sparsity patterns in StructOMP. The projection matrix X is generated by creating an n × p matrix with i.i.d. draws from a standard Gaussian distribution N (0, 1). For simplicity, the rows of X are normalized to unit magnitude. Zero-mean Gaussian noise with standard deviation σ = 0.01 is added to the measurements. Our task is to compare the recovery performance of StructOMP to those of OMP, Lasso and group Lasso for these structured sparsity signals. Figure 2 shows one instance of generated signal and the corresponding recovered results by dierent algorithms when n = 160. Since the sample size n is not big enough, OMP and Lasso do not achieve good recovery results, whereas the StructOMP algorithm achieves near perfect recovery of the original signal. We also include group Lasso in this experiment for illustration. We use pre-dened consecutive groups that do not completely overlap with the support of the signal. Since we do not know the correct group size, we just try group Lasso with several dierent group sizes (gs=2, 4,8,16). Although the results obtained with group Lasso are better than those of OMP and Lasso, they are still inferior to the results with StructOMP. As mentioned, this is because the pre-dened groups do not completely overlap with the support of the signal, which reduces the eciency. To study how the sample size n aects the recovery performance, we vary the sample size and record the recovery results by dierent algorithms. To reduce the randomness, we perform the experiment 100 times for each sample size. Figure 3(a) shows the recovery performance of the three algorithms, averaged over 100 random runs for each sample size. As expected, StructOMP is better than the group Lasso and far better than the OMP and Lasso. The results show that the proposed StructOMP can achieve better recovery performance for structured sparsity signals with less samples. Note that Lasso performs better than OMP in the rst example. This is because the signal is strongly sparse (that is, all nonzero coecients are signicantly dierent from zero). In the recovered results with OMP (error is 0.9921); (c) recovered results with Lasso (error is 0.8660);; (d) recovered results with Group Lasso (error is 0.4832 with group size gs=2); (e) recovered results with Group Lasso (error is 0.4832 with group size gs=4);(f) recovered results with Group Lasso (error is 0.2646 with group size gs=8);(g) recovered results with Group Lasso (error is 0.3980 with group size gs=16); (h) recovered results with StructOMP (error is 0.0246). second experiment, we randomly generate a 1D structured sparse signal with weak sparsity, where the nonzero coecients decay gradually to zero, but there is no clear cuto. As expected in our theory (see Theorem 6.5 and discussions thereafter), OMP becomes much more competitive relative to Lasso. In fact OMP has better performance than Lasso in this more realistic situation. One instance of generated signal is shown in Figure 4 (a). Here, p = 512 and all coecient of the signal are not zeros. We dene the sparsity k as the number of coecients that contain 95% of the image energy. The support set of these signals is composed of g = 2 connected regions. Again, each element of the sparse coecient vector is connected to two of its adjacent elements, which forms the underlying 1D line graph structure. The graph sparsity concept introduced earlier is used to compute the coding length of sparsity patterns in StructOMP. The projection matrix X is generated by creating an n × p matrix with i.i.d. draws from a standard Gaussian distribution N (0, 1). For simplicity, the rows of X are normalized to unit magnitude. Zero-mean Gaussian noise with standard deviation σ = 0.01 is added to the measurements. Figure 4 shows one generated signal and its recovered results by dierent algorithms when k = 32 and n = 48. Again, we observe that OMP and Lasso do not achieve good recovery results, whereas the StructOMP algorithm achieves near perfect recovery of the original signal. We try group Lasso with several dierent group sizes (gs=2, 4,8,16). Although the results obtained with group Lasso are better than those of OMP and Lasso, they are still inferior to the results with StructOMP. In order to study how the sample size n eects the recovery performance, we vary the sample size and record the recovery results by dierent algorithms. To reduce the randomness, we perform the experiment 100 times for each of the sample sizes. Figure 3(b) shows the recovery performance of dierent algorithms, averaged over 100 random runs for each sample size. As expected, StructOMP algorithm is superior in all cases. What's dierent from the rst experiment is that the recovery error of OMP becomes smaller than that of Lasso. This result is consistent with our theory, which predicts that if the underlying signal is weakly sparse, then the relatively performance of OMP becomes comparable to Lasso. It is well known that 2D natural images are sparse in a wavelet basis. Their wavelet coecients have a hierarchical tree structure, which is widely used for wavelet-based compression algorithms [21]. Figure 5(a) shows a widely used example image with size 64 × 64: cameraman. Each 2D wavelet coecient of this image is connected to its parent coecient and child coecients, which forms the underlying hierarchical tree structure (which is a special case of graph sparsity). In our experiment, we choose Haar-wavelet to obtain its tree-structured sparsity wavelet coecients. The projection matrix X and noises are generated with the same method as that for 1D structured sparsity signals. OMP, Lasso and StructOMP are used to recover the wavelet coecients from the random projection samples respectively. Then, the inverse wavelet transform is used to reconstruct the images with these recovered wavelet coecients. Our task is to compare the recovery performance of the StructOMP to those of OMP and Lasso. Figure 5 shows one example of the recovered results by dierent algorithms. It shows that StructOMP obtains the best recovered result. Figure 6(a) shows the recovery performance of the three algorithms, averaged over 100 random runs for each sample size. The StructOMP algorithm is better than both Lasso and OMP in this case. Since real image data are weakly sparse, the performance of standard OMP (without structured sparsity) is similar to that of Lasso. Background subtracted images are typical structure sparsity data in static video surveillance applications. They generally correspond to the foreground objects of interest. Unlike the whole scene, these images are not only spatially sparse but also inclined to cluster into groups, which correspond to dierent foreground objects. Thus, the StructOMP algorithm can obtain superior recovery from compressive sensing measurements that are received by a centralized server from multiple and randomly placed optical sensors. In this experiment, the testing video is downloaded from http://homepages.inf.ed.ac.uk/rbf/CAVIARDATA1/. The background subtracted images are obtained with the software [28]. One sample image frame is shown in Figure 7(a). The support set of 2D images is thus composed of several connected regions. Here, each pixel of the 2D background subtracted image is connected to four of its adjacent pixels, forming the underlying graph structure in graph sparsity. The results shown in Figure 7 demonstrate that the StructOMP outperforms both OMP and Lasso in recovery. We randomly choose 100 background subtracted images as test images. Figure 6(b) shows the recovery performance as a function of increasing sample sizes. It demonstrates again that StructOMP signicantly outperforms OMP and Lasso in recovery performance on video data. Comparing to the image compression example in the previous section, the background subtracted images have a more clearly dened sparsity pattern where nonzero coecients are generally distinct from zero (that is, stronger sparsity); this explains why Lasso performs better than the standard (unstructured) OMP on this particular data. The result is again consistent with our theory. In structured sparsity, the complexity of learning is measured by the coding complexity c( β) ≤ β 0 + cl(supp( β)) instead of β 0 ln p which determines the complexity in standard sparsity. Using this notation, a theory parallel to that of the standard sparsity is developed. The theory shows that if the coding length cl(supp( β)) is small for a target coecient vector β, then the complexity of learning β can be signicantly smaller than the corresponding complexity in standard sparsity. Experimental results demonstrate that signicant improvements can be obtained on some real problems that have natural structures. The structured greedy algorithm presented in this paper is the rst ecient algorithm proposed to handle the general structured sparsity learning. It is shown that the algorithm is eective under appropriate conditions. Future work include additional computationally ecient methods such as convex relaxation methods (e.g. L 1 regularization for standard sparsity, and group Lasso for strong group sparsity) and backward greedy strategies to improve the forward greedy method considered in this paper. Proof Let s n = n i=1 (x i y i -Ex i y i ); then by assumption, E(e tsn + e -tsn ) ≤ 2e i σ 2 t 2 /2 , which implies that Pr(|s n | ≥ )e t ≤ 2e i σ 2 t 2 /2 . Now let t = /( i x 2 i σ 2 ), we obtain the desired bound. The following lemma is taken from [20]. Since the proof is simple, it is included for completeness. Lemma A.2 Consider the unit sphere S k-1 = {x : x 2 = 1} in R k (k ≥ 1). Given any ε > 0, there exists an ε-cover Q ⊂ S k-1 such that min q∈Q x -q 2 ≤ ε for all x 2 = 1, with |Q| ≤ (1 + 2/ε) k . Proof Let B k = {x : x 2 ≤ 1} be the unit ball in R k . Let Q = {q i } i=1,...,|Q| ⊂ S k-1 be a maximal subset such that q i -q j 2 > ε for all i = j. By maximality, Q is an ε-cover of S k-1 . Since the balls q i + (ε/2)B k are disjoint and belong to (1 + ε/2)B k , we have i≤|Q| vol(q i + (ε/2)B k ) ≤ vol((1 + ε/2)B k ). This proves the rst inequality. The second inequality follows directly from Lemma C.4 with λ = 0. Proof of Theorem 6.3 The desired bound is a direct consequence of Lemma C.4, by noticing that 2σ 2 ln(6/η) X β -Ey 2 ≤ a X β -Ey 2 2 + a -1 2σ 2 ln(6/η), a ≤ 0, and b + a -1 2σ 2 ≤ (10 + 5a + 7a -1 )σ 2 . D Proof of Theorem 6.4 and Theorem 6.5 The following lemma is an adaptation of a similar result in [26] on greedy algorithms for standard sparsity. Proof For all ∈ F , Xβ + αXe -y 2 2 achieves the minimum at α = 0 (where e is the vector of zeros except for the -th component, which is one). This implies that x (Xβ -y) = 0 E Proof of Corollary 6.1 Given s , we consider f j = min ≥j Q( β(s /2 )). We may assume that f 0 is achieved with 0 = 0. Note that by Lemma C.3, we have with probability 1 -2 -j-1 η: | Q( β(s /2 j )) -y -Ey 2 2 | ≤2 X β(s /2 j ) -Ey 2 2 + 2σ 2 [j + 1 + ln(2/η)] ≤2an2 qj /s q + 2σ 2 [j + 1 + ln(2/η)]. This means the above inequality holds for all j with probability 1 -η. Now, by taking = 2an/s q + 2σ 2 [ln(2/η) + 1] in Theorem 6.5, we obtain 2 -j ln(2 + 2(j + 2 q(j+1) )) 2 -j (ln 2 + 1 + j + q(j + 1) ln 2) ≤ 2 + 4(1 + q ln 2), where we have used the simple inequality ln(α + β) ≤ α + ln(β) when α, β ≥ 1. Therefore, This means that Theorem 6.5 can be applied to obtain the desired bound.

Learning with Structured Sparsity

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment