Universal Regularizers For Robust Sparse Coding and Modeling

1 Uni v ersal Re gularizers F or Rob ust Sparse Coding and Modeling Ignacio Ram ´ ırez and Guillermo Sapiro Department of Electrical and Computer Engineering Uni versity of Minnesota { ramir048,guille } @umn.edu Abstract Sparse data models, where data is assumed to be well represented as a linear combination of a fe w elements from a dictionary , hav e gained considerable attention in recent years, and their use has led to state-of-the-art results in many signal and image processing tasks. It is now well understood that the choice of the sparsity regularization term is critical in the success of such models. Based on a codelength minimization interpretation of sparse coding, and using tools from universal coding theory , we propose a framew ork for designing sparsity regularization terms which hav e theoretical and practical advantages when compared to the more standard ` 0 or ` 1 ones. The presentation of the framework and theoretical foundations is complemented with examples that show its practical advantages in image denoising, zooming and classiﬁcation. I . I N T RO D U C T I O N Sparse modeling calls for constructing a succinct representation of some data as a combination of a few typical patterns ( atoms ) learned from the data itself. Signiﬁcant contributions to the theory and practice of learning such collections of atoms (usually called dictionaries or codebooks ), e.g., [ 1 ], [ 14 ], [ 33 ], and of representing the actual data in terms of them, e.g., [ 8 ], [ 11 ], [ 12 ], hav e been de veloped in recent years, leading to state-of-the-art results in many signal and image processing tasks [ 24 ], [ 26 ], [ 27 ], [ 34 ]. W e refer the reader for example to [ 4 ] for a recent re view on the subject. A critical component of sparse modeling is the actual sparsity of the representation, which is controlled by a regularization term ( re gularizer for short) and its associated parameters. The choice of the functional form of the regularizer and its parameters is a challenging task. Sev eral solutions to this problem have been proposed in the literature, ranging from the automatic tuning of the parameters [ 20 ] to Bayesian models, where these parameters are themselves considered as random variables [ 17 ], [ 20 ], [ 51 ]. In this work we adopt the interpretation of sparse coding as a codelength minimization problem. This is a natural and objectiv e method for assessing the quality of a statistical model for describing gi ven data, and which is based on the Minimum Description Length (MDL) principle [ 37 ]. In this frame work, the regularization term in the sparse coding formulation is interpreted as the cost in bits of describing the sparse linear coefﬁcients used to reconstruct the data. Se veral works on image coding using this approach were developed in the 1990’ s under the name of “complexity-based” or “compression-based” coding, following the popularization of MDL as a po werful statistical modeling tool [ 9 ], [ 31 ], [ 40 ]. The focus on these early works was in denoising using wav elet basis, using either generic asymptotic results from MDL or ﬁx ed probability models, in order to compute the description length of the coef ﬁcients. A later , major breakthrough in MDL theory was the adoption of universal coding tools to compute optimal codelengths. In this work, we improv e and extend on previous results in this line of work by designing regularization terms based on such universal codes for image coef ﬁcients, meaning that the codelengths obtained when encoding the coefﬁcients of an y (natural) image with such codes will be close to the shortest codelengths that can be obtained with any model ﬁtted speciﬁcally for that particular instance of coefﬁcients. The resulting framew ork not only formalizes sparse coding from the MDL and uni versal coding perspectiv es but also leads to a family of universal r e gularizers which we sho w to consistently improv e results in image processing tasks such as denoising and classiﬁcation. These models also enjoy sev eral desirable theoretical and practical properties such as statistical consistency (in certain cases), impro ved rob ustness to outliers in the data, and improved sparse signal recovery (e.g., decoding of sparse signals from a compressiv e sensing point of vie w [ 5 ]) when compared with the traditional ` 0 and ` 1 -based techniques in practice. These models also yield to the use of a simple and efﬁcient optimization technique for solving the corresponding sparse coding 2 problems as a series of weighted ` 1 subproblems, which in turn, can be solved with of f-the-shelf algorithms such as LARS [ 12 ] or IST [ 11 ]. Details are gi ven in the sequel. Finally , we apply our univ ersal regularizers not only for coding using ﬁxed dictionaries, but also for learning the dictionaries themselves, leading to further improvements in all the aforementioned tasks. The remainder of this paper is organized as follows: in Section II we introduce the standard framew ork of sparse modeling. Section III is dedicated to the deriv ation of our proposed universal sparse modeling framew ork, while Section IV deals with its implementation. Section V presents experimental results showing the practical beneﬁts of the proposed framework in image denoising, zooming and classiﬁcation tasks. Concluding remarks are gi ven in Section VI . I I . S P A R S E M O D E L I N G A N D T H E N E E D F O R B E T T E R M O D E L S Let X ∈ R M × N be a set of N column data samples x j ∈ R M , D ∈ R M × K a dictionary of K column atoms d k ∈ R M , and A ∈ R K × N , a j ∈ R K , a set of reconstruction coef ﬁcients such that X = D A . W e use a T k to denote the k -th ro w of A , the coefﬁcients associated to the k -th atom in D . For each j = 1 , . . . , N we deﬁne the active set of a j as A j = { k : a kj 6 = 0 , 1 ≤ k ≤ K } , and k a j k 0 = |A j | as its cardinality . The goal of sparse modeling is to design a dictionary D such that for all or most data samples x j , there exists a coefﬁcients vector a j such that x j ≈ D a j and k a j k 0 is small (usually below some threshold L  K ). Formally , we would like to solve the follo wing problem min D , A N X j =1 ψ ( a j ) s . t . k x j − D a j k 2 2 ≤ , j = 1 , . . . , N , (1) where ψ ( · ) is a regularization term which induces sparsity in the columns of the solution A . Usually the constraint k d k k 2 ≤ 1 , k = 1 , . . . , K , is added, since otherwise we can always decrease the cost function arbitrarily by multiplying D by a large constant and di viding A by the same constant. When D is ﬁxed, the problem of ﬁnding a sparse a j for each sample x j is called sparse coding, a j = arg min a ψ ( a j ) s . t . k x j − D a j k 2 2 ≤ . (2) Among possible choices of ψ ( · ) are the ` 0 pseudo-norm, ψ ( · ) = k·k 0 , and the ` 1 norm. The former tries to solve directly for the sparsest a j , but since it is non-con ve x, it is commonly replaced by the ` 1 norm, which is its closest con vex approximation. Furthermore, under certain conditions on (ﬁxed) D and the sparsity of a j , the solutions to the ` 0 and ` 1 -based sparse coding problems coincide (see for example [ 5 ]). The problem ( 1 ) is also usually formulated in Lagrangian form, min D , A N X j =1 k x j − D a j k 2 2 + λψ ( a j ) , (3) along with its respecti ve sparse coding problem when D is ﬁxed, a j = arg min a k x j − D a k 2 2 + λψ ( a ) . (4) Even when the regularizer ψ ( · ) is con vex, the sparse modeling problem, in any of its forms, is jointly non-con ve x in ( D , A ) . Therefore, the standard approach to ﬁnd an approximate solution is to use alternate minimization: starting with an initial dictionary D (0) , we minimize ( 3 ) alternativ ely in A via ( 2 ) or ( 4 ) (sparse coding step), and D (dictionary update step). The sparse coding step can be solved ef ﬁciently when ψ ( · ) = k·k 1 using for example I S T [ 11 ] or L A R S [ 12 ], or with O M P [ 28 ] when ψ ( · ) = k·k 0 . The dictionary update step can be done using for example M O D [ 14 ] or K - S V D [ 1 ]. A. Interpretations of the sparse coding pr oblem W e now turn our attention to the sparse coding problem: given a ﬁxed dictionary D , for each sample vector x j , compute the sparsest vector of coef ﬁcients a j that yields a good approximation of x j . The sparse coding problem admits sev eral interpretations. What follows is a summary of these interpretations and the insights that they provide into the properties of the sparse models that are rele vant to our deri v ation. 3 1) Model selection in statistics: Using the ` 0 norm as ψ ( · ) in ( 4 ) is kno wn in the statistics community as the Akaike’ s Information Criterion ( A I C ) when λ = 1 , or the Bayes Information Criterion ( B I C ) when λ = 1 2 log M , two popular forms of model selection (see [ 22 , Chapter 7]). In this context, the ` 1 regularizer was introduced in [ 43 ], again as a conv ex approximation of the abov e model selection methods, and is commonly known (either in its constrained or Lagrangian forms) as the Lasso . Note howe ver that, in the regression interpretation of ( 4 ), the role of D and X is very different. 2) Maximum a posteriori: Another interpretation of ( 4 ) is that of a maximum a posteriori ( M A P ) estimation of a j in the logarithmic scale, that is a j = arg max a { log P ( a | x j ) } = arg max a { log P ( x j | a ) + log P ( a ) } = arg min a {− log P ( x j | a ) − log P ( a ) } , (5) where the observed samples x j are assumed to be contaminated with additi ve, zero mean, I I D Gaussian noise with v ariance σ 2 , P ( x j | a ) ∝ e − 1 2 σ 2 k x j − D a k 2 2 , and a prior pr obability model on a with the form P ( a ) ∝ e − θψ ( a ) is considered. The ener gy term in Equation ( 4 ) follo ws by plugging the previous two probability models into ( 5 ) and factorizing 2 σ 2 into λ = 2 σ 2 θ . According to ( 5 ), the ` 1 regularizer corresponds to an I I D Laplacian prior with mean 0 and inv erse-scale parameter θ , P ( a ) = Q K k =1 θ e − θ | a k | = θ K e − θ k a k 1 , which has a special meaning in signal processing tasks such as image or audio compression. This is due to the widely accepted fact that representation coef ﬁcients derived from predictiv e coding of continuous-valued signals, and, more generally , responses from zero- mean ﬁlters, are well modeled using Laplacian distrib utions. F or example, for the special case of D C T coefﬁcients of image patches, an analytical study of this phenomenon is provided in [ 25 ], along with further references on the subject. 3) Codelength minimization: Sparse coding, in all its forms, has yet another important interpretation. Suppose that we have a ﬁxed dictionary D and that we want to use it to compress an image, either losslessly by encoding the reconstruction coef ﬁcients A and the residual X − D A , or in a lossy manner , by obtaining a good approximation X ≈ DA and encoding only A . Consider for example the latter case. Most modern compression schemes consist of two parts: a probability assignment stage, where the data, in this case A , is assigned a probability P ( A ) , and an encoding stage, where a code C ( A ) of length L ( A ) bits is assigned to the data giv en its probability , so that L ( A ) is as short as possible. The techniques known as Arithmetic and Huf fman coding provide the best possible solution for the encoding step, which is to approximate the Shannon ideal codelength L ( A ) = − log P ( A ) [ 10 , Chapter 5]. Therefore, modern compression theory deals with ﬁnding the coef ﬁcients A that maximize P ( A ) , or , equi valently , that minimize − log P ( A ) . No w , to encode X lossily , we obtain coefﬁcients A such that each data sample x j is approximated up to a certain ` 2 distortion  , k x j − Da j k 2 2 ≤  . Therefore, gi ven a model P ( a ) for a vector of reconstruction coef ﬁcients, and assuming that we encode each sample independently , the optimum v ector of coef ﬁcients a j for each sample x j will be the solution to the optimization problem a j = arg min a − log P ( a ) s . t . k x j − D a j k 2 2 ≤ , (6) which, for the choice P ( a ) ∝ e − ψ ( a ) coincides with the error constrained sparse coding problem ( 2 ). Suppose no w that we want lossless compression. In this case we also need to encode the reconstruction residual x j − Da j . Since P ( x , a ) = P ( x | a ) P ( a ) , the combined codelength will be L ( x j , a j ) = − log P ( x j , a j ) = − log P ( x j | a j ) − log P ( a j ) . (7) Therefore, obtaining the best coef ﬁcients a j amounts to solving min a L ( x j , a j ) , which is precisely the M A P formulation of ( 5 ), which in turn, for proper choices of P ( x | a ) and P ( a ) , leads to the Lagrangian form of sparse coding ( 4 ). 1 1 Laplacian models, as well as Gaussian models, are probability distributions over R , characterized by continuous probability density functions, f ( a ) = F 0 ( a ) , F ( a ) = P ( x ≤ a ) . If the reconstruction coef ﬁcients are considered real numbers, under any of these distrib utions, any instance of A ∈ R K × N will hav e measure 0 , that is, P ( A ) = 0 . In order to use such distributions as our models for the data, we assume that the coefﬁcients in A are quantized to a precision ∆ , small enough for the density function f ( a ) to be approximately constant in any interval [ a − ∆ / 2 , a + ∆ / 2] , a ∈ R , so that we can approximate P ( a ) ≈ ∆ f ( a ) , a ∈ R . Under these assumptions, − log P ( a ) ≈ − log f ( a ) − log ∆ , and the effect of ∆ on the codelength produced by any model is the same. Therefore, we will omit ∆ in the sequel, and treat density functions and probability distributions interchangeably as P ( · ) . Of course, in real compression applications, ∆ needs to be tuned. 4 Fig. 1. Standard 8 × 8 D C T dictionary (a), global empirical distribution of the coefﬁcients in A (b, log scale), empirical distrib utions of the coefﬁcients associated to each of the K = 64 D C T atoms (c, log scale). The distributions in (c) have a similar heavy tailed shape (heavier than Laplacian), but the variance in each case can be signiﬁcantly different. (d) Histogram of the K = 64 different ˆ θ k values obtained by ﬁtting a Laplacian distribution to each row a T k of A . Note that there are signiﬁcant occurrences between ˆ θ = 5 to ˆ θ = 25 . The coefﬁcients A used in (b-d) were obtained from encoding 10 6 8 × 8 patches (after removing their DC component) randomly sampled from the Pascal 2006 dataset of natural images [ 15 ]. (e) Histograms showing the spatial variability of the best local estimations of ˆ θ k for a few rows of A across different regions of an image. In this case, the coefﬁcients A correspond to the sparse encoding of all 8 × 8 patches from a single image, in scan-line order . For each k , each v alue of ˆ θ k was computed from a random contiguous block of 250 samples from a T k . The procedure was repeated 4000 times to obtain an empirical distribution. The wide supports of the empirical distributions indicate that the estimated ˆ θ can hav e very different v alues, e ven for the same atom, depending on the region of the data from where the coef ﬁcients are taken. As one can see, the codelength interpretation of sparse coding is able to unify and interpret both the constrained and unconstrained formulations into one consistent framew ork. Furthermore, this framew ork offers a natural and objecti ve measure for comparing the quality of dif ferent models P ( x | a ) and P ( a ) in terms of the codelengths obtained. 4) Remarks on related work: As mentioned in the introduction, the codelength interpretation of signal coding was already studied in the context of orthogonal wav elet-based denoising. An early example of this line of work considers a regularization term which uses the Shannon Entropy function P p i log p i to giv e a measure of the sparsity of the solution [ 9 ]. Howe ver , the Entropy function is not used as measure of the ideal codelength for describing the coef ﬁcients, but as a measure of the sparsity (actually , group sparsity) of the solution. The MDL principle was applied to the signal estimation problem in [ 40 ]. In this case, the codelength term includes the description of both the location and the magnitude of the nonzero coef ﬁcients. Although a pioneering effort, the model assumed in [ 40 ] for the coef ﬁcient magnitude is a uniform distribution on [0 , 1] , which does not exploit a priori knowledge of image coef ﬁcient statistics, and the description of the support is slightly wasteful. Furthermore, the codelength expression used is an asymptotic result, actually equiv alent to B I C (see Section II-A1 ) which can be misleading when working with small sample sizes, such as when encoding small image patches, as in current state of the art image processing applications. The uniform distribution was later replaced by the universal code for inte ger s [ 38 ] in [ 31 ]. Ho wev er, as in [ 40 ], the model is so general that it does not perform well for the speciﬁc case of coefﬁcients arising from image decompositions, leading to poor results. In contrast, our models are derived follo wing a careful analysis of image coefﬁcient statistics. Finally , probability models suitable to image coef ﬁcient statistics of the form P ( a ) ∝ e −| a | β (kno wn as generalized Gaussians) were applied to the MDL-based signal coding and estimation frame work in [ 31 ]. The justiﬁcation for such models is based on the empirical observation that sparse coefﬁcients statistics exhibit “hea vy tails” (see next section). Howe ver , the choice is ad hoc and no optimality criterion is av ailable to compare it with other possibilities. Moreov er, there is no closed form solution for performing parameter estimation on such family of models, requiring numerical optimization techniques. In Section III , we derive a number of probability models for which parameter estimation can be computed efﬁciently in closed form, and which are guaranteed to optimally describe image coef ﬁcients. B. The need for a better model As explained in the previous subsection, the use of the ` 1 regularizer implies that all the coefﬁcients in A share the same Laplacian parameter θ . Ho wev er , as noted in [ 25 ] and references therein, the empirical v ariance of coef ﬁcients associated to dif ferent atoms, that is, of the dif ferent ro ws a T k of A , v aries greatly with k = 1 . . . , K . This is clearly seen in Figures 1 (a-c), which sho w the empirical distribution of D C T coefﬁcients of 8 × 8 patches. 5 As the variance of a Laplacian is 2 /θ 2 , different variances indicate different underlying θ . The histogram of the set n ˆ θ k , k = 1 , . . . , K o of estimated Laplacian parameters for each row k , Figure 1 (d), shows that this is indeed the case, with signiﬁcant occurrences of v alues of ˆ θ in a range of 5 to 25 . The straightforward modiﬁcation suggested by this phenomenon is to use a model where each row of A has its o wn weight associated to it, leading to a weighted ` 1 regularizer . Howe ver , from a modeling perspectiv e, this results in K parameters to be adjusted instead of just one, which often results in poor generalization properties. For example, in the cases studied in Section V , ev en with thousands of images for learning these parameters, the results of applying the learned model to new images were always signiﬁcantly worse (ov er 1dB in estimation problems) when compared to those obtained using simpler models such as an unweighted ` 1 . 2 One reason for this failure may be that real images, as well as other types of signals such as audio samples, are far from stationary . In this case, e ven if each atom k is associated to its own θ k ( λ k ), the optimal value of θ k can hav e signiﬁcant local variations at different positions or times. This ef fect is shown in Figure 1 (e), where, for each k , each θ k was re-estimated se veral times using samples from different regions of an image, and the histogram of the dif ferent estimated values of ˆ θ k was computed. Here again we used the D C T basis as the dictionary D . The need for a ﬂexible model which at the same time has a small number of parameters leads naturally to Bayesian formulations where the different possible λ k are “mar ginalized out” by imposing an hyper-prior distribution on λ , sampling λ using its posterior distrib ution, and then av eraging the estimates obtained with the sampled sparse- coding problems. Examples of this recent line of work, and the closely related Bayesian Compressi ve Sensing, are de veloped for example in [ 23 ], [ 44 ], [ 49 ], [ 48 ]. Despite of its promising results, the Bayesian approach is often criticized due to the potentially expensi ve sampling process (something which can be reduced for certain choices of the priors inv olved [ 23 ]), arbitrariness in the choice of the priors, and lack of proper theoretical justiﬁcation for the proposed models [ 48 ]. In this work we pursue the same goal of deri ving a more ﬂexible and accurate sparse model than the traditional ones, while av oiding an increase in the number of parameters and the burden of possibly solving several sampled instances of the sparse coding problem. For this, we deploy tools from the very successful information-theoretic ﬁeld of univ ersal coding, which is an extension of the compression scenario summarized above in Section II-A , when the probability model for the data to be described is itself unkno wn and has to be described as well. I I I . U N I V E R S A L M O D E L S F O R S PA R S E C O D I N G Follo wing the discussion in the preceding section, we no w have se veral possible scenarios to deal with. First, we may still want to consider a single value of θ to work well for all the coef ﬁcients in A , and try to design a sparse coding scheme that does not depend on prior knowledge on the v alue of θ . Secondly , we can consider an independent (but not identically distributed) Laplacian model where the underlying parameter θ can be different for each atom d k , k = 1 , . . . , K . In the most extreme scenario, we can consider each single coefﬁcient a kj in A to have its o wn unknown underlying θ kj and yet, we would like to encode each of these coefﬁcients (almost) as if we kne w its hidden parameter . The ﬁrst two scenarios are the ones which ﬁt the original purpose of uni versal coding theory [ 29 ], which is the design of optimal codes for data whose probability models are unknown, and the models themselves are to be encoded as well in the compressed representation. No w we dev elop the basic ideas and techniques of univ ersal coding applied to the ﬁrst scenario, where the problem is to describe A as an I I D Laplacian with unkno wn parameter θ . Assuming a kno wn parametric form for the prior , with unknown parameter θ , leads to the concept of a model class . In our case, we consider the class M = { P ( A | θ ) : θ ∈ Θ } of all I I D Laplacian models over A ∈ R K × N , where P ( A | θ ) = N Y j =1 K Y k =1 P ( a kj | θ ) , P ( a kj | θ ) = θ e − θ | a kj | and Θ ⊆ R + . The goal of universal coding is to ﬁnd a probability model Q ( A ) which can ﬁt A as well as the model in M that best ﬁts A after ha ving observed it. A model Q ( A ) with this property is called universal (with respect to the model M ). 2 Note that this is the case when the weights are found by maximum likelihood. Other applications of weighted ` 1 regularizers, using other types of weighting strate gies, are kno wn to improv e ov er ` 1 -based ones for certain applications (see e.g. [ 51 ]). 6 For simplicity , in the following discussion we consider the coefﬁcient matrix A to be arranged as a single long column vector of length n = K × N , a = ( a 1 , . . . , a n ) . W e also use the letter a without sub-index to denote the v alue of a random v ariable representing coef ﬁcient v alues. First we need to deﬁne a criterion for comparing the ﬁtting quality of different models. In universal coding theory this is done in terms of the codelengths L ( a ) required by each model to describe a . If the model consists of a single probability distribution P ( · ) , we know from Section II-A3 that the optimum codelength corresponds to L P ( a ) = − log P ( a ) . Moreov er , this relationship deﬁnes a one-to-one correspondence between distributions and codelengths, so that for any coding scheme L Q ( a ) , Q ( a ) = 2 − L Q ( a ) . Now suppose that we are restricted to a class of models M , and that we need choose the model ˆ P ∈ M that assigns the shortest codelength to a particular instance of a . W e then hav e that ˆ P is the model in M that assigns the maximum probability to a . For a class M parametrized by θ , this corresponds to ˆ P = P ( a | ˆ θ ( a )) , where ˆ θ ( a ) is the maximum likelihood estimator ( M L E ) of the model class parameter θ giv en a (we will usually omit the argument and just write ˆ θ ). Unfortunately , we also need to include the value of ˆ θ in the description of a for the decoder to be able to reconstruct it from the code C ( a ) . Thus, we hav e that any model Q ( a ) inducing v alid codelengths L Q ( a ) will hav e L Q ( a ) > − log P ( a | ˆ θ ) . The ov erhead of L Q ( a ) with respect to − log P ( a | ˆ θ ) is kno wn as the codelength r e gr et , R ( a , Q ) := L Q ( a ) − ( − log P ( a | ˆ θ ( a ))) = − log Q ( a ) + log P ( a | ˆ θ ( a ))) . A model Q ( a ) (or, more precisely , a sequence of models, one for each data length n ) is called universal if R ( a , Q ) gro ws sublinearly in n for all possible realizations of a , that is 1 n R ( a , Q ) → 0 , ∀ a ∈ R n , so that the codelength regret with respect to the M L E becomes asymptotically negligible. There are a number of ways to construct uni versal probability models. The simplest one is the so called two- part code , where the data is described in two parts. The ﬁrst part describes the optimal parameter ˆ θ ( a ) and the second part describes the data according to the model with the value of the estimated parameter ˆ θ , P ( a | ˆ θ ( a )) . For uncountable parameter spaces Θ , such as a compact subset of R , the value of ˆ θ has to be quantized in order to be described with a ﬁnite number of bits d . W e call the quantized parameter ˆ θ d . The regret for this model is thus R ( a , Q ) = L ( ˆ θ d ) + L ( a | ˆ θ d ) − L ( a | ˆ θ ) = L ( ˆ θ d ) − log P ( a | ˆ θ d ) − ( − log P ( a | ˆ θ )) . The ke y for this model to be uni versal is in the choice of the quantization step for the parameter ˆ θ , so that both its description L ( ˆ θ d ) , and the difference − log P ( a | ˆ θ d ) − ( − log P ( a | ˆ θ )) , grow sublinearly . This can be achiev ed by letting the quantization step shrink as O (1 / √ n ) [ 37 ], thus requiring d = O (0 . 5 log n ) bits to describe each dimension of ˆ θ d . This gi ves us a total re gret for two-part codes which grows as dim(Θ) 2 log n , where dim(Θ) is the dimension of the parameter space Θ . Another important univ ersal code is the so called Normalized Maximum Likelihood ( N M L ) [ 42 ]. In this case the uni versal model Q ∗ ( a ) corresponds to the model that minimizes the worst case regret, Q ∗ ( a ) = min Q max a {− log Q ( a ) + log P ( a | ˆ θ ( a )) } , which can be written in closed form as Q ∗ ( a ) = P ( a | ˆ θ ( a )) C ( M ,n ) , where the normalization constant C ( M , n ) := X a ∈ R n P ( a | ˆ θ ( a )) d a is the value of the minimax regret and depends only on M and the length of the data n . 3 Note that the N M L model requires C ( M , n ) to be ﬁnite, something which is often not the case. The two previous e xamples are good for assigning a probability to coefﬁcients a that have already been computed, but they cannot be used as a model for computing the coefﬁcients themselves since they depend on ha ving observed them in the ﬁrst place. For this and other reasons that will become clearer later , we concentrate our work on a third important family of uni versal codes deriv ed from the so called mixture models (also called Bayesian mixtur es ). In 3 The minimax optimality of Q ∗ ( a ) derives from the fact that it deﬁnes a complete uniquely decodable code for all data a of length n , that is, it satisﬁes the Kraft inequality with equality . P a ∈ R n 2 − L Q ∗ ( a ) = 1 . Since ev ery uniquely decodable code with lengths { L Q ( a ) : a ∈ R n } must satisfy the Kraft inequality (see [ 10 , Chapter 5]), if there exists a value of a such that L Q ( a ) < L Q ∗ ( a ) (that is 2 − L Q ( a ) > 2 − L Q ∗ ( a ) ), then there exists a vector a 0 for which L Q ( a 0 ) > L Q ∗ ( a 0 ) for the Kraft inequality to hold. Therefore the regret of Q for a 0 is necessarily greater than C ( M , n ) , which shows that Q ∗ is minimax optimal. 7 a mixture model, Q ( a ) is a con vex mixture of all the models P ( a | θ ) in M , indexed by the model parameter θ , Q ( a ) = R Θ P ( a | θ ) w ( θ ) dθ , where w ( θ ) speciﬁes the weight of each model. Being a conv ex mixture implies that w ( θ ) ≥ 0 and R Θ w ( θ ) dθ = 1 , thus w ( θ ) is itself a probability measure ov er Θ . W e will restrict ourselves to the particular case when a is considered a sequence of independent random v ariables, 4 Q ( a ) = n Y j =1 Q j ( a j ) , Q j ( a j ) = Z Θ P ( a j | θ ) w j ( θ ) dθ , (8) where the mixing function w j ( θ ) can be different for each sample j . An important particular case of this scheme is the so called Sequential Bayes code, in which w j ( θ ) is computed sequentially as a posterior distrib ution based on pre viously observed samples, that is w j ( θ ) = P ( θ | a 1 , a 2 , . . . , a n − 1 ) [ 21 , Chapter 6]. In this work, for simplicity , we restrict ourselves to the case where w j ( θ ) = w ( θ ) is the same for all j . The result is an I I D model where the probability of each sample a j is a mixture of some probability measure ov er R , Q j ( a j ) = Q ( a j ) = Z Θ P ( a j | θ ) w ( θ ) dθ , ∀ j = 1 , . . . , N . (9) A well known result for I I D mixture (Bayesian) codes states that their asymptotic regret is O ( dim(Θ) 2 log n ) , thus stating their univ ersality , as long as the weighting function w ( θ ) is positiv e, continuous and unimodal over Θ (see for example [ 21 , Theorem 8.1],[ 41 ]). This giv es us great ﬂexibility on the choice of a weighting function w ( θ ) that guarantees uni versality . Of course, the results are asymptotic and the o (log n ) terms can be lar ge, so that the choice of w ( θ ) can hav e practical impact for small sample sizes. In the follo wing discussion we deriv e sev eral I I D mixture models for the Laplacian model class M . For this purpose, it will be con venient to consider the corresponding one-sided counterpart of the Laplacian, which is the exponential distrib ution ov er the absolute value of the coef ﬁcients, | a | , and then symmetrize back to obtain the ﬁnal distribution over the signed coefﬁcients a . A. The conjugate prior In general, ( 9 ) can be computed in closed form if w ( θ ) is the conjugate prior of P ( a | θ ) . When P ( a | θ ) is an exponential (one-sided Laplacian), the conjugate prior is the Gamma distribution, w ( θ | κ, β ) = Γ( κ ) − 1 θ κ − 1 β κ e − β θ , θ ∈ R + , where κ and β are its shape and scale parameters respectiv ely . Plugging this in ( 9 ) we obtain the Mixtur e of exponentials model ( M O E ), which has the following form (see Appendix A for the full deriv ation), Q M O E ( a | β , κ ) = κβ κ ( a + β ) − ( κ +1) , a ∈ R + . (10) W ith some abuse of notation, we will also denote the symmetric distribution on a as M O E , Q M O E ( a | β , κ ) = 1 2 κβ κ ( | a | + β ) − ( κ +1) , a ∈ R . (11) Although the resulting prior has two parameters to deal with instead of one, we know from universal coding theory that, in principle, any choice of κ and β will giv e us a model whose codelength regret is asymptotically small. Furthermore, being I I D models, each coef ﬁcient of a itself is modeled as a mixture of exponentials, which makes the resulting model over a very well suited to the most ﬂexible scenario where the “underlying” θ can be different for each a j . In Section V -B we will sho w that a single M O E distribution can ﬁt each of the K rows of A better than K separate Laplacian distributions ﬁne-tuned to these rows, with a total of K parameters to be estimated. Thus, not only we can deal with one single unknown θ , b ut we can actually achie ve maximum ﬂexibility with only two parameters ( κ and β ). This property is particular of the mixture models, and does not apply to the other uni versal models presented. 4 More sophisticated models which include dependencies between the elements of a are out of the scope of this work. 8 Finally , if desired, both κ and β can be easily estimated using the method of moments (see Appendix A ). Given sample estimates of the ﬁrst and second non-central moments, ˆ µ 1 = 1 n P n j =1 | a j | and ˆ µ 2 = 1 n P n j =1 | a j | 2 , we hav e that ˆ κ = 2( ˆ µ 2 − ˆ µ 2 1 ) / ( ˆ µ 2 − 2 ˆ µ 2 1 ) and ˆ β = ( ˆ κ − 1) ˆ µ 1 . (12) When the M O E prior is plugged into ( 5 ) instead of the standard Laplacian, the following new sparse coding formulation is obtained, a j = arg min a k x j − Da k 2 2 + λ M O E K X k =1 log ( | a k | + β ) , (13) where λ M O E = 2 σ 2 ( κ + 1) . An example of the M O E regularizer , and the thresholding function it induces, is shown in Figure 2 (center column) for κ = 2 . 5 , β = 0 . 05 . Smooth, differentiable non-conv ex regularizers such as the one in in ( 13 ) have become a mainstream robust alternativ e to the ` 1 norm in statistics [ 16 ], [ 51 ]. Furthermore, it has been shown that the use of such regularizers in regression leads to consistent estimators which are able to identify the rele v ant variables in a regression model (oracle property) [ 16 ]. This is not always the case for the ` 1 regularizer , as w as pro ved in [ 51 ]. The M O E regularizer has also been recently proposed in the context of compressi ve sensing [ 6 ], where it is conjectured to be better than the ` 1 -term at recov ering sparse signals in compressi ve sensing applications. 5 This conjecture was partially conﬁrmed recently for non-con vex regularizers of the form ψ ( a ) = k a k r with 0 < r < 1 in [ 39 ], [ 18 ], and for a more general family of non-conv ex regularizers including the one in ( 13 ) in [ 47 ]. In all cases, it was shown that the conditions on the sensing matrix (here D ) can be signiﬁcantly relaxed to guarantee exact recovery if non-conv ex regularizers are used instead of the ` 1 norm, provided that the exact solution to the non-con vex optimization problem can be computed. In practice, this regularizer is being used with success in a number of applications here and in [ 7 ], [ 46 ]. 6 Our experimental results in Section V provide further e vidence on the beneﬁts of the use of non-con ve x regularizers, leading to a much improv ed recovery accuracy of sparse coefﬁcients compared to ` 1 and ` 0 . W e also show in Section V that the M O E prior is much more accurate than the standard Laplacian to model the distrib ution of reconstruction coefﬁcients drawn from a large database of image patches. W e also show in Section V how these improvements lead to better results in applications such as image estimation and classiﬁcation. B. The Jef fr eys prior The Jef freys prior for a parametric model class M = { P ( a | θ ) , θ ∈ Θ } , is deﬁned as w ( θ ) = p | I ( θ ) | R Θ p | I ( ξ ) | dξ , θ ∈ Θ , (14) where | I ( θ ) | is the determinant of the F isher information matrix I ( θ ) =  E P ( a | ˜ θ )  − ∂ 2 ∂ ˜ θ 2 log P ( a | ˜ θ )      ˜ θ = θ . (15) The Jeffre ys prior is well known in Bayesian theory due to three important properties: it virtually eliminates the hyper -parameters of the model, it is in variant to the original parametrization of the distribution, and it is a “non-informati ve prior , ” meaning that it represents well the lack of prior information on the unknown parameter θ [ 3 ]. It turns out that, for quite different reasons, the Jef freys prior is also of paramount importance in the theory of uni versal coding. For instance, it has been shown in [ 2 ] that the worst case regret of the mixture code obtained using the Jef freys prior approaches that of the N M L as the number of samples n grows. Thus, by using Jef freys, one can attain the minimum worst case regret asymptotically , while retaining the advantages of a mixture (not needing hindsight of a ), which in our case means to be able to use it as a model for computing a via sparse coding. For the exponential distrib ution we hav e that I ( θ ) = 1 θ 2 . Clearly , if we let Θ = (0 , ∞ ) , the integral in ( 14 ) e valuates to ∞ . Therefore, in order to obtain a proper integral, we need to exclude 0 and ∞ from Θ (note that 5 In [ 6 ], the logarithmic regularizer arises from approximating the ` 0 pseudo-norm as an ` 1 -normalized element-wise sum, without the insight and theoretical foundation here reported. 6 While these works support the use of such non-con vex re gularizers, none of them formally deri ves them using the uni versal coding framew ork as in this paper . 9 Fig. 2. Left to right: ` 1 (green), M O E (red) and J O E (blue) regularizers and their corresponding thresholding functions thres( x ) := arg min a { ( x − a ) 2 + λψ ( | a | ) } . The unbiasedness of M O E is due to the fact that large coefﬁcients are not shrank by the thresholding function. Also, although the J O E regularizer is biased, the shrinkage of large coef ﬁcients can be much smaller than the one applied to small coefﬁcients. this was not needed for the conjugate prior). W e choose to deﬁne Θ = [ θ 1 , θ 2 ] , 0 < θ 1 < θ 2 < ∞ , leading to w ( θ ) = 1 ln( θ 2 /θ 1 ) 1 θ , θ ∈ [ θ 1 , θ 2 ] . The resulting mixture, after being symmetrized around 0 , has the follo wing form (see Appendix A ): Q J O E ( a | θ 1 , θ 2 ) = 1 2 ln( θ 2 /θ 1 ) 1 | a |  e − θ 1 | a | − e − θ 2 | a |  , a ∈ R + . (16) W e refer to this prior as a Jef fre ys mixture of exponentials ( J O E ), and again overload this acronym to refer to the symmetric case as well. Note that although Q J O E is not deﬁned for a = 0 , its limit when a → 0 is ﬁnite and e valuates to θ 2 − θ 1 2 ln( θ 2 /θ 1 ) . Thus, by deﬁning Q J O E (0) = θ 2 − θ 1 2 ln( θ 2 /θ 1 ) , we obtain a prior that is well deﬁned and continuous for all a ∈ R . When plugged into ( 5 ), we get the J O E -based sparse coding formulation, min a k x j − Da k 2 2 + λ J O E K X k =1 { log | a k | − log( e − θ 1 | a k | − e − θ 2 | a k | ) } , (17) where, according to the con vention just deﬁned for Q J O E (0) , we deﬁne ψ J O E (0) := log( θ 2 − θ 1 ) . According to the M A P interpretation we ha ve that λ J O E = 2 σ 2 , coming from the Gaussian assumption on the approximation error as explained in Section II-A . As with M O E , the J O E -based regularizer , ψ J O E ( · ) = − log Q J O E ( · ) , is continuous and differentiable in R + , and its deri vati ve con verges to a ﬁnite v alue at zero, lim a → 0 ψ 0 J O E ( a ) = θ 2 2 − θ 2 1 θ 2 − θ 1 . As we will see later in Section IV , these properties are important to guarantee the con ver gence of sparse coding algorithms using non-con vex priors. Note from ( 17 ) that we can re write the J O E regularizer as ψ J O E ( a k ) = log | a k | − log e − θ 1 | a | (1 − e − ( θ 2 − θ 1 ) | a | ) = θ 1 | a k | + log | a k | − log(1 − e − ( θ 2 − θ 1 ) | a k | ) , so that for sufﬁciently large | a k | , log (1 − e − ( θ 2 − θ 1 ) | a k | ) ≈ 0 , θ 1 | a k |  log | a k | , and we have that ψ J O E ( | a k | ) ≈ θ 1 | a k | . Thus, for large | a k | , the J O E regularizer behav es like ` 1 with λ 0 = 2 σ 2 θ 1 . In terms of the probability model, this means that the tails of the J O E mixture behav e like a Laplacian with θ = θ 1 , with the region where this happens determined by the value of θ 2 − θ 1 . The fact that the non-con vex re gion of ψ J O E ( · ) is conﬁned to a neighborhood around 0 could help to av oid falling in bad local minima during the optimization (see Section IV for more details on the optimization aspects). Finally , although having Laplacian tails means that the estimated a will be biased [ 16 ], the sharper peak at 0 allo ws us to perform a more aggressive thresholding of small values, without excessiv ely clipping large coef ﬁcients, which leads to the typical over -smoothing of signals recov ered using an ` 1 regularizer . See Figure 2 (rightmost column) for an e xample regularizer based on J O E with parameters θ 1 = 20 , θ 2 = 100 , and the thresholding function it induces. The J O E regularizer has two hyper -parameters ( θ 1 , θ 2 ) which deﬁne Θ and that, in principle, need to be tuned. One possibility is to choose θ 1 and θ 2 based on the physical properties of the data to be modeled, so that the possible v alues of θ never fall outside of the range [ θ 1 , θ 2 ] . For example, in modeling patches from grayscale images with a limited dynamic range of [0 , 255] in a D C T basis, the maximum variance of the coefﬁcients can nev er exceed 128 2 . The same is true for the minimum v ariance, which is deﬁned by the quantization noise. Having said this, in practice it is adv antageous to adjust [ θ 1 , θ 2 ] to the data at hand. In this case, although no closed form solutions exist for estimating [ θ 1 , θ 2 ] using M L E or the method of moments, standard optimization techniques can be easily applied to obtain them. See Appendix A for details. 10 C. The conditional Jef fr eys A recent approach to deal with the case when the inte gral ov er Θ in the Jef freys prior is improper, is the conditional J effr e ys [ 21 , Chapter 11]. The idea is to construct a proper prior , based on the improper Jef freys prior and the ﬁrst fe w n 0 samples of a , ( a 1 , a 2 , . . . , a n 0 ) , and then use it for the remaining data. The key observation is that although the normalizing integral R p I ( θ ) dθ in the Jef freys prior is improper , the unnormalized prior w ( θ ) = p I ( θ ) can be used as a measure to weight P ( a 1 , a 2 , . . . , a n 0 | θ ) , w ( θ ) = P ( a 1 , a 2 , . . . , a n 0 | θ ) p I ( θ ) R Θ P ( a 1 , a 2 , . . . , a n 0 | ξ ) p I ( ξ ) dξ . (18) It turns out that the integral in ( 18 ) usually becomes proper for small n 0 in the order of dim(Θ) . In our case we hav e that for an y n 0 ≥ 1 , the resulting prior is a Gamma( κ 0 , β 0 ) distribution with κ 0 := n 0 and β 0 := P n 0 j =1 a j (see Appendix A for details). Therefore, using the conditional Jeffre ys prior in the mixture leads to a particular instance of M O E , which we denote by C M O E (although the functional form is identical to M O E ), where the Gamma parameters κ and β are automatically selected from the data. This may explain in part why the Gamma prior performs so well in practice, as we will see in Section V . Furthermore, we observe that the value of β obtained with this approach ( β 0 ) coincides with the one estimated using the method of moments for M O E if the κ in M O E is ﬁxed to κ = κ 0 + 1 = n 0 + 1 . Indeed, if computed from n 0 samples, the method of moments for M O E giv es β = ( κ − 1) µ 1 , with µ 1 = 1 n 0 P a j , which gives us β = n 0 +1 − 1 n 0 P a j = β 0 . It turns out in practice that the value of κ estimated using the method of moments gi ves a v alue between 2 and 3 for the type of data that we deal with (see Section V ), which is just above the minimum acceptable value for the C M O E prior to be deﬁned, which is n 0 = 1 . This justiﬁes our choice of n 0 = 2 when applying C M O E in practice. As n 0 becomes large, so does κ 0 = n 0 , and the Gamma prior w ( θ ) obtained with this method con verges to a Kronecker delta at the mean value of the Gamma distribution, δ κ 0 /β 0 ( · ) . Consequently , when w ( θ ) ≈ δ κ 0 /β 0 ( θ ) , the mixture R Θ P ( a | θ ) w ( θ ) dθ will be close to P ( a | κ 0 /β 0 ) . Moreover , from the deﬁnition of κ 0 and β 0 we have that κ 0 /β 0 is exactly the M L E of θ for the Laplacian distribution. Thus, for large n 0 , the conditional Jeffre ys method approaches the M L E Laplacian model. Although from a universal coding point of view this is not a problem, for large n 0 the conditional Jeffre ys model will loose its ﬂexibility to deal with the case when different coef ﬁcients in A have different underlying θ . On the other hand, a small n 0 can lead to a prior w ( θ ) that is ov erﬁtted to the local properties of the ﬁrst samples, which for non-stationary data such as image patches, can be problematic. Ultimately , n 0 deﬁnes a trade-off between the degree of ﬂexibility and the accuracy of the resulting model. I V . O P T I M I Z A T I O N A N D I M P L E M E N T A T I O N D E T A I L S All of the mixture models discussed so far yield non-con vex regularizers, rendering the sparse coding problem non- con vex in a . It turns out howe ver that these regularizers satisfy certain conditions which make the resulting sparse coding optimization well suited to be approximated using a sequence of successiv e con vex sparse coding problems, a technique kno wn as Local Linear Appr oximation ( L L A ) [ 52 ] (see also [ 46 ], [ 19 ] for alternativ e optimization techniques for such non-con vex sparse coding problems). In a nutshell, suppose we need to obtain an approximate solution to a j = arg min a k x j − D a k 2 2 + λ K X k =1 ψ ( | a k | ) , (19) where ψ ( · ) is a non-con vex function over R + . At each L L A iteration, we compute a ( t +1) j by doing a ﬁrst order expansion of ψ ( · ) around the K elements of the current estimate a ( t ) kj , ˜ ψ ( t ) k ( | a | ) = ψ ( | a ( t ) kj | ) + ψ 0 ( | a ( t ) kj | )  | a | − | a ( t ) kj |  = ψ 0 ( | a ( t ) kj | ) | a | + c k , 11 and solving the con vex weighted ` 1 problem that results after discarding the constant terms c k , a ( t +1) j = arg min a k x j − D a k 2 2 + λ K X k =1 ˜ ψ ( t ) k ( | a k | ) = arg min a k x j − D a k 2 2 + λ K X k =1 ψ 0 ( | a ( t ) kj | ) | a k | = arg min a k x j − D a k 2 2 + K X k =1 λ ( t ) k | a k | . (20) where we ha ve deﬁned λ ( t ) k := λψ 0 ( | a ( t ) kj | ) . If ψ 0 ( · ) is continuous in (0 , + ∞ ) , and right-continuous and ﬁnite at 0 , then the L L A algorithm con verges to a stationary point of ( 19 ) [ 51 ]. These conditions are met for both the M O E and J O E regularizers. Although, for the J O E prior , the deriv ative ψ 0 ( · ) is not deﬁned at 0 , it con ver ges to the limit θ 2 2 − θ 2 1 2( θ 2 − θ 1 ) when | a | → 0 , which is well deﬁned for θ 2 6 = θ 1 . If θ 2 = θ 1 , the J O E mixing function is a Kronecker delta and the prior becomes a Laplacian with parameter θ = θ 1 = θ 2 . Therefore we hav e that for all of the mixture models studied, the L L A method con ver ges to a stationary point. In practice, we ha ve observed that 5 iterations are enough to con verge. Thus, the cost of sparse coding, with the proposed non-con ve x regularizers, is at most 5 times that of a single ` 1 sparse coding, and could be less in practice if warm restarts are used to begin each iteration. Of course we need a starting point a (0) j , and, being a non-conv ex problem, this choice will inﬂuence the approxima- tion that we obtain. One reasonable choice, used in this work, is to deﬁne a (0) kj = a 0 , k = 1 , . . . , K , j = 1 , . . . , N , where a 0 is a scalar so that ψ 0 ( a 0 ) = E w [ θ ] , that is, so that the ﬁrst sparse coding corresponds to a Laplacian regularizer whose parameter is the average value of θ as giv en by the mixing prior w ( θ ) . Finally , note that although the discussion here has re volved around the Lagrangian formulation to sparse coding of ( 4 ), this technique is also applicable to the constrained formulation of sparse-coding giv en by Equation ( 1 ) for a ﬁxed dictionary D . Expected approximation error : Since we are solving a con vex approximation to the actual tar get optimization problem, it is of interest to kno w how good this approximation is in terms of the original cost function. T o giv e an idea of this, after an approximate solution a is obtained, we compute the expected value of the dif ference between the true and approximate regularization term values. The expectation is taken, naturally , in terms of the assumed distribution of the coefﬁcients in a . Since the regularizers are separable, we can compute the error in a separable way as an expectation o ver each k -th coefﬁcient, ζ q ( a k ) = E ν ∼ q h ˜ ψ k ( ν ) − ψ ( ν ) i , where ˜ ψ k ( · ) is the approximation of ψ k ( · ) around the ﬁnal estimate of a k . For the case of q = M O E , the expression obtained is (see Appendix) ζ M O E ( a k , κ, β ) = E ν ∼ M O E ( κ,β ) h ˜ ψ k ( ν ) − ψ ( ν ) i = log( a k + β ) + 1 a k + β  a k + β κ − 1  − log β − 1 κ . In the M O E case, for κ and β ﬁxed, the minimum of ζ M O E occurs when a k = β κ − 1 = µ ( β , κ ) . W e also have ζ M O E (0) = ( κ − 1) − 1 − κ − 1 . The function ζ q ( · ) can be e v aluated on each coef ﬁcient of A to gi ve an idea of its quality . For example, in the experiments from Section V , we obtained an average value of 0 . 16 , which lies between ζ M O E (0) = 0 . 19 and min a ζ M O E ( a ) = 0 . 09 . Depending on the experiment, this represents 6% to 7% of the total sparse coding cost function v alue, sho wing the ef ﬁciency of the proposed optimization. Comments on parameter estimation: All the univ ersal models presented so far , with the exception of the conditional Jef freys, depend on hyper-parameters which in principle should be tuned for optimal performance (remember that the y do not inﬂuence the universality of the model). If tuning is needed, it is important to remember that the proposed univ ersal models are intended for reconstruction coef ﬁcients of clean data , and thus their hyper- parameters should be computed from statistics of clean data, or either by compensating the distortion in the statistics caused by noise (see for example [ 30 ]). Finally , note that when D is linearly dependent and rank( D ) = R M , the coef ﬁcients matrix A resulting from an exact reconstruction of X will have many zeroes which are not properly explained by any continuous distribution such as a Laplacian. W e sidestep this issue by computing the statistics only from the non-zero coefﬁcients in A . Dealing properly with the case P ( a = 0) > 0 is beyond the scope of this work. 12 V . E X P E R I M E N TA L R E S U LT S In the following experiments, the testing data X are 8 × 8 patches drawn from the Pascal V OC2006 testing subset, 7 which are high quality 640 × 480 R G B images with 8 bits per channel. For the e xperiments, we con verted the 2600 images to grayscale by averaging the channels, and scaled the dynamic range to lie in the [0 , 1] interval. Similar results to those sho wn here are also obtained for other patch sizes. A. Dictionary learning For the experiments that follo w , unless otherwise stated, we use a “global” ov ercomplete dictionary D with K = 4 M = 256 atoms trained on the full V OC2006 training subset using the method described in [ 35 ], [ 36 ], which seeks to minimize the follo wing cost during training, 8 min D , A 1 N N X j =1 n k x j − D a j k 2 2 + λψ ( a j ) o + µ   D T D   2 F , (21) where k·k F denotes Frobenius norm. The additional term, µ   D T D   2 F , encourages incoherence in the learned dictionary , that is, it forces the atoms to be as orthogonal as possible. Dictionaries with lo wer coherence are well kno wn to have se veral theoretical adv antages such as improv ed ability to recov er sparse signals [ 11 ], [ 45 ], and faster and better con ver gence to the solution of the sparse coding problems ( 1 ) and ( 3 ) [ 13 ]. Furthermore, in [ 35 ] it was sho wn that adding incoherence leads to improvements in a variety of sparse modeling applications, including the ones discussed belo w . W e used M O E as the regularizer in ( 21 ), with λ = 0 . 1 and µ = 1 , both chosen empirically . See [ 1 ], [ 26 ], [ 35 ] for details on the optimization of ( 3 ) and ( 21 ). B. M O E as a prior for sparse coding coefﬁcients W e begin by comparing the performance of the Laplacian and M O E priors for ﬁtting a single global distribution to the whole matrix A . W e compute A using ( 1 ) with  ≈ 0 and then, follo wing the discussion in Section IV , restrict our study to the nonzero elements of A . The empirical distribution of A is plotted in Figure 3 (a), along with the best ﬁtting Laplacian, M O E , J O E , and a particularly good example of the conditional Jeffre ys ( C M O E ) distributions. 9 The M L E for the Laplacian ﬁt is ˆ θ = N 1 / k A k 1 = 27 . 2 (here N 1 is the number of nonzero elements in A ). For M O E , using ( 12 ), we obtained κ = 2 . 8 and β = 0 . 07 . For J O E , θ 1 = 2 . 4 and θ 2 = 371 . 4 . According to the discussion in Section III-C , we used the v alue κ = 2 . 8 obtained using the method of moments for M O E as a hint for choosing n 0 = 2 ( κ 0 = n 0 + 1 = 3 ≈ 2 . 8 ), yielding β 0 = 0 . 07 , which coincides with the β obtained using the method of moments. As observed in Figure 3 (a), in all cases the proposed mixture models ﬁt the data better , signiﬁcantly better for both Gamma-based mixtures, M O E and C M O E , and slightly better for J O E . This is further conﬁrmed by the Kullback-Leibler div ergence ( K L D ) obtained in each case. Note that J O E fails to signiﬁcantly improve on the Laplacian mode due to the excessi vely large estimated range [ θ 1 , θ 2 ] . In this sense, it is clear that the J O E model is very sensiti ve to its hyper -parameters, and a better and more rob ust estimation would be needed for it to be useful in practice. Gi ven these results, hereafter we concentrate on the best case which is the M O E prior (which, as detailed abo ve, can be deri ved from the conditional Jeffreys as well, thus representing both approaches). From Figure 1 (e) we know that the optimal ˆ θ v aries locally across different regions, thus, we expect the mixture models to perform well also on a per-atom basis. This is conﬁrmed in Figure 3 (b), where we show , for each row a k , k = 1 , . . . , K , the difference in K L D between the globally ﬁtted M O E distribution and the best per-atom ﬁtted M O E , the globally ﬁtted Laplacian, and the per-atom ﬁtted Laplacians respectiv ely . As can be observed, the K L D obtained with the global M O E is signiﬁcantly smaller than the global Laplacian in all cases, and e ven the per-atom Laplacians in most of the cases. This shows that M O E , with only two parameters (which can be easily estimated, as 7 http://pascallin.ecs.soton.ac.uk/challenges/V OC/databases.html#VOC2006 8 While we could hav e used of f-the-shelf dictionaries such as D C T in order to test our universal sparse coding framework, it is important to use dictionaries that lead to the state-of-the-art results in order to show the additional potential impro vement of our proposed regularizers. 9 T o compute the empirical distrib ution, we quantized the elements of A uniformly in steps of 2 − 8 , which for the amount of data av ailable, giv es us enough detail and at the same time reliable statistics for all the quantized values. 13 Fig. 3. (a) Empirical distribution of the coefﬁcients in A for image patches (blue dots), best ﬁtting Laplacian (green), M O E (red), C M O E (orange) and J O E (yello w) distributions. The Laplacian ( K L D = 0 . 17 bits) is clearly not ﬁtting the tails properly , and is not sufﬁciently peaked at zero either . The two models based on a Gamma prior, M O E ( K L D = 0 . 01 bits) and C M O E ( K L D = 0 . 01 bits), provide an almost perfect ﬁt. The ﬁtted J O E ( K L D = 0 . 14 ) is the most sharply peaked at 0 , but doest not ﬁt the tails as tight as desired. As a reference, the entropy of the empirical distribution is H = 3 . 00 bits. (b) K L D for the best ﬁtting global Laplacian (dark green), per-atom Laplacian (light green), global M O E (dark red) and per-atom M O E (light red), relativ e to the K L D between the globally ﬁtted M O E distribution and the empirical distribution. The horizontal axis represents the indexes of each atom, k = 1 , . . . , K , ordered according to the dif ference in K L D between the global M O E and the per-atom Laplacian model. Note how the global M O E outperforms both the global and per-atom Laplacian models in all but the ﬁrst 4 cases. (c) acti ve set recov ery accurac y of ` 1 and M O E , as deﬁned in Section V -C , for L = 5 and L = 10 , as a function of σ . The improvement of M O E over ` 1 is a factor of 5 to 9 . (d) P S N R of the recovered sparse signals with respect to the true signals. In this case signiﬁcant impro vements can be observed at the high S N R range, specially for highly sparse ( L = 5 ) signals. The performance of both methods is practically the same for σ ≥ 10 . detailed in the text), is a much better model than K Laplacians (requiring K critical parameters) ﬁtted speciﬁcally to the coefﬁcients associated to each atom. Whether these modeling improvements hav e a practical impact is explored in the next experiments. C. Recovery of noisy sparse signals Here we compare the activ e set recov ery properties of the M O E prior, with those of the ` 1 -based one, on data for which the sparsity assumption |A j | ≤ L holds exactly for all j . T o this end, we obtain sparse approximations to each sample x j using the ` 0 -based Orthogonal Matching Pursuit algorithm ( O M P ) on D [ 28 ], and record the resulting active sets A j as ground truth. The data is then contaminated with additi ve Gaussian noise of v ariance σ and the recovery is performed by solving ( 1 ) for A with  = C M σ 2 and either the ` 1 or the M O E -based regularizer for ψ ( · ) . W e use C = 1 . 32 , which is a standard v alue in denoising applications (see for example [ 27 ]). For each sample j , we measure the error of each method in recov ering the activ e set as the Hamming distance between the true and estimated support of the corresponding reconstruction coef ﬁcients. The accuracy of the method is then given as the percentage of the samples for which this error falls belo w a certain threshold T . Results are sho wn in Figure 3 (c) for L = (5 , 10) and T = (2 , 4) respecti vely , for various values of σ . Note the very signiﬁcant improv ement obtained with the proposed model. Gi ven the estimated active set A j , the estimated clean patch is obtained by projecting x j onto the subspace deﬁned by the atoms that are active according to A j , using least squares (which is the standard procedure for denoising once the acti ve set is determined). W e then measure the P S N R of the estimated patches with respect to the true ones. The results are shown in Figure 3 (d), again for various values of σ . As can be observed, the M O E -based recov ery is signiﬁcantly better , specially in the high S N R range. Notoriously , the more accurate activ e set recov ery of M O E does not seem to improv e the denoising performance in this case. Howe ver , as we will see next, it does make a difference when denoising real life signals, as well as for classiﬁcation tasks. D. Recovery of r eal signals with simulated noise This experiment is an analogue to the previous one, when the data are the original natural image patches (without forcing exact sparsity). Since for this case the sparsity assumption is only approximate, and no ground truth is av ailable for the activ e sets, we compare the different methods in terms of their denoising performance. A critical strategy in image denoising is the use of o verlapping patches, where for each pixel in the image a patch is extracted with that pixel as its center . The patches are denoised independently as M -dimensional signals 14 Fig. 4. Sample image denoising results. T op: Barbara, σ = 30 . Bottom: Boats, σ = 40 . From left to right: noisy , ` 1 / O M P , ` 1 / ` 1 , M O E / M O E . The reconstruction obtained with the proposed model is more accurate, as evidenced by a better reconstruction of the texture in Barbara, and sharp edges in Boats, and does not produce the artifacts seen in both the ` 1 and ` 0 reconstructions, which appear as black/white speckles all ov er Barbara, and ringing on the edges in Boats. σ = 10 learning ` 1 M O E [ 1 ] coding ` 0 ` 1 ` 0 M O E barbara 30.4/ 34.4 31.2 /33.8 30.5/ 34.4 30.9/ 34.4 34.4 boat 30.4/33.7 30.9 /33.4 30.5/33.7 30.8/ 33.8 33.7 lena 31.8/35.5 32.4 /35.1 32.1/35.6 32.3/ 35.6 35.5 peppers 31.6/34.8 32.1 /34.6 31.8/ 34.9 32.0/ 34.9 34.8 man 29.6/33.0 30.6 /32.9 29.7/33.0 30.2/ 33.1 32.8 A VERA GE 30.7/34.2 31.4 /33.9 30.8/34.2 31.1/ 34.3 34.1 σ = 20 ` 1 M O E [ 1 ] ` 0 ` 1 ` 0 M O E 26.5/30.6 26.9/30.2 26.8/30.7 27.0 / 30.9 30.8 26.9/30.2 27.2/30.1 27.1/30.3 27.3 / 30.4 30.3 28.3/32.3 28.6/32.0 28.7/32.3 28.8 / 32.4 32.4 28.3/31.9 28.7 /31.8 28.6/31.9 28.7 / 32.0 31.9 25.8/28.8 26.3 /28.9 26.0/28.9 26.2/ 29.0 28.8 27.0/30.6 27.4/30.4 27.3/30.6 27.5 / 30.8 30.6 σ = 30 ` 1 M O E [ 1 ] ` 0 ` 1 ` 0 M O E 24.5/28.2 24.8/28.2 24.8/28.3 24.9 / 28.5 28.4 25.0/28.1 25.2/28.2 25.3/28.2 25.4 / 28.3 28.2 26.4/30.1 26.6/30.2 26.7/30.3 26.8 / 30.4 30.3 26.3/29.8 26.6/ 29.9 26.6/ 29.9 26.7 / 29.9 - 23.9/26.5 24.2 / 26.8 24.1/26.6 24.2 /26.7 26.5 25.1/28.3 25.4/ 28.5 25.4/28.4 25.5 / 28.5 28.4 T ABLE I D E N O I S I N G R E S U LT S : I N E A C H TA B L E , E A C H C O L U M N S H OW S T H E D E N O I S I N G P E R F O R M A N C E O F A L E A R N I N G + C O D I N G C O M B I N A T I O N . R E S U LT S A R E S H O W N I N PA I R S , W H E R E T H E L E F T N U M B E R I S T H E P S N R B E T W E E N T H E C L E A N A N D R E C O V E R E D I N D I V I D UA L P A T C H E S , A N D T H E R I G H T N U M B E R I S T H E P S N R B E T W E E N T H E C L E A N A N D R E C OV E R E D I M AG E S . B E S T R E S U LT S A R E I N B O L D . T H E P R O P O S E D M O E P R O D U C E S B E T T E R FI N A L R E S U LTS OV E R B OT H T H E ` 0 A N D ` 1 O N E S I N A L L C A S E S , A N D A T PA T C H L E V E L F O R A L L σ > 10 . N OT E T H A T T H E A V E R A G E V A L U E S R E P O R T E D A R E T H E P S N R O F T H E A V E R AG E M S E , A N D N O T T H E P S N R A V E R AG E . and then recombined into the ﬁnal denoised images by simple averaging. Although this consistently improv es the ﬁnal result in all cases, the improv ement is very different depending on the method used to denoise the indi vidual patches. Therefore, we now compare the denoising performance of each method at two le vels: individual patches and ﬁnal image. T o denoise each image, the global dictionary described in Section V -A is further adapted to the noisy image patches using ( 21 ) for a few iterations, and used to encode the noisy patches via ( 2 ) with  = C M σ 2 . W e repeated the experiment for two learning variants ( ` 1 and M O E regularizers), and two coding variants (( 2 ) with the regularizer used for learning, and ` 0 via O M P . The four variants were applied to the standard images Barbara, Boats, Lena, Man and Peppers, and the results summarized in T able I . W e show sample results in Figure 4 . Although the quantitativ e improv ements seen in T able I are small compared to ` 1 , there is a signiﬁcant improv ement at the visual lev el, as can be seen in Figure 4 . In all cases the PSNR obtained coincides or surpasses the ones reported in [ 1 ]. 10 E. Zooming As an example of signal recovery in the absence of noise, we took the previous set of images, plus a particularly challenging one (T ools), and subsampled them to half each side. W e then simulated a zooming effect by upsampling 10 Note that in [ 1 ], the denoised image is ﬁnally blended with the noisy image using an empirical weight, providing an extra improvement to the ﬁnal PSNR in some cases. The results in I are already better without this extra step. 15 image cubic ` 0 ` 1 M O E barbara 25.0 25.6 25.5 25.6 boat 28.9 29.8 29.8 29.9 lena 32.7 33.8 33.8 33.9 peppers 32.0 33.4 33.4 33.4 man 28.4 29.4 29.3 29.4 tools 21.0 22.3 22.2 22.3 A VER 22.8 24.0 24.0 24.1 Fig. 5. Zooming results. Left to right: summary , T ools image, detail of zooming results for the framed region, top to bottom and left to right: cubic, ` 0 , ` 1 , M O E . As can be seen, the M O E result is as sharp as ` 0 but produces less artifacts. This is reﬂected in the 0 . 1 dB overall improv ement obtained with M O E , as seen in the leftmost summary table. them and estimating each of the 75% missing pixels (see e.g., [ 50 ] and references therein). W e use a technique similar to the one used in [ 32 ]. The image is ﬁrst interpolated and then decon voluted using a Wiener ﬁlter . The decon voluted image has artifacts that we treat as noise in the reconstruction. Howe ver , since there is no real noise, we do not perform a veraging of the patches, using only the center pixel of ˆ x j to ﬁll in the missing pixel at j . The results are summarized in Figure 5 , where we again observe that using M O E instead of ` 0 and ` 1 improv es the results. F . Classiﬁcation with universal sparse models In this section we apply our proposed univ ersal models to a classiﬁcation problem where each sample x j is to be assigned a class label y j = 1 , . . . , c , which serves as an index to the set of possible classes, {C 1 , C 2 , . . . , C c } . W e follo w the procedure of [ 36 ], where the classiﬁer assigns each sample x j by means of the maximum a posteriori criterion ( 5 ) with the term − log P ( a ) corresponding to the assumed prior , and the dictionaries representing each class are learned from training samples using ( 21 ) with the corresponding regularizer ψ ( a ) = − log P ( a ) . Each experiment is repeated for the baseline Laplacian model, implied in the ` 1 regularizer , and the universal model M O E , and the results are then compared. In this case we expect that the more accurate prior model for the coefﬁcients will result in an improv ed likelihood estimation, which in turn should improv e the accuracy of the system. W e begin with a classic te xture classiﬁcation problem, where patches hav e to be identiﬁed as belonging to one out of a number of possible textures. In this case we experimented with samples of c = 2 and c = 3 textures drawn at random from the Brodatz database, 11 , the ones actually used shown in Figure 6 . In each case the experiment was repeated 10 times. In each repetition, a dictionary of K = 300 atoms was learned from all 16 × 16 patches of the leftmost halves of each sample texture. W e then classiﬁed the patches from the rightmost halv es of the te xture samples. For the c = 2 we obtained an av erage error rate of 5 . 13% using ` 1 against 4 . 12% when using M O E , which represents a reduction of 20% in classiﬁcation error . For c = 3 the average error rate obtained was 13 . 54% using ` 1 and 11 . 48% using M O E , which is 15% lower . Thus, using the univ ersal model instead of ` 1 yields a signiﬁcant improv ement in this case (see for e xample [ 26 ] for other results in classiﬁcation of Brodatz textures). The second sample problem presented is the Graz’02 bike detection problem, 12 where each pixel of each testing image has to be classiﬁed as either background or as part of a bik e. In the Graz’02 dataset, each of the pixels can belong to one of two classes: bike or background. On each of the training images (which by con vention are the ﬁrst 150 ev en-numbered images), we are gi ven a mask that tells us whether each pixel belongs to a bike or to the background. W e then train a dictionary for bike patches and another for background patches. Patches that contain pixels from both classes are assigned to the class corresponding to the majority of their pixels. 11 http://www .ux.uis.no/ ∼ tranden/brodatz.html 12 http://lear .inrialpes.fr/people/marszalek/data/ig02/ 16 Fig. 6. T extures used in the texture classiﬁcation e xample. Fig. 7. Classiﬁcation results. Left to right: precision vs. recall curve, a sample image from the Graz’02 dataset, its ground truth, and the corresponding estimated maps obtained with ` 1 and M O E for a ﬁxed threshold. The precision vs. recall curve sho ws that the mixture model giv es a better precision in all cases. In the example, the classiﬁcation obtained with M O E yields less false positiv es and more true positives than the one obtained with ` 1 . In Figure 7 we sho w the pr ecision vs. r ecall curves obtained with the detection framework when either the ` 1 or the M O E regularizers were used in the system. As can be seen, the M O E -based model outperforms the ` 1 in this classiﬁcation task as well, gi ving a better precision for all recall v alues. In the above experiments, the parameters for the ` 1 prior ( λ ), the M O E model ( λ M O E ) and the incoherence term ( µ ) were all adjusted by cross validation. The only e xception is the M O E parameter β , which was chosen based on the ﬁtting experiment as β = 0 . 07 . V I . C O N C L U D I N G R E M A R K S A framework for designing sparse modeling priors was introduced in this work, using tools from univ ersal coding, which formalizes sparse coding and modeling from a MDL perspective. The priors obtained lead to models with both theoretical and practical advantages ov er the traditional ` 0 and ` 1 -based ones. In all deriv ed cases, the designed non-con vex problems are suitable to be efﬁciently (approximately) solved via a few iterations of (weighted) ` 1 subproblems. W e also showed that these priors are able to ﬁt the empirical distribution of sparse codes of image patches signiﬁcantly better than the traditional I I D Laplacian model, and ev en the non-identically distributed independent Laplacian model where a different Laplacian parameter is adjusted to the coef ﬁcients associated to each atom, thus showing the ﬂexibility and accuracy of these proposed models. The additional ﬂexibility , furthermore, comes at a small cost of only 2 parameters that can be easily and efﬁciently tuned (either ( κ, β ) in the M O E model, or ( θ 1 , θ 2 ) in the J O E model), instead of K (dictionary size), as in weighted ` 1 models. The additional accuracy of the proposed models was shown to hav e signiﬁcant practical impact in acti ve set recovery of sparse signals, image denoising, and classiﬁcation applications. Compared to the Bayesian approach, we a void the potential burden of solving several sampled sparse problems, or being forced to use a conjugate prior for computational reasons (although in our case, a fortiori , the conjugate prior does provide us with a good model). Ov erall, as demonstrated in this paper , the introduction of information theory tools can lead to formally addressing critical aspects of sparse modeling. 17 Future work in this direction includes the design of priors that take into account the nonzero mass at a = 0 that appears in ov ercomplete models, and online learning of the model parameters from noisy data, following for example the technique in [ 30 ]. A C K N O W L E D G M E N T S W ork partially supported by N G A , O N R , A RO , N S F , N S S E FF , and F U N D A C I B A - A N T E L . W e wish to thank Julien Mairal for providing us with his fast sparse modeling toolbox, SP AMS. 13 W e also thank Federico Lecumberry for his participation on the incoherent dictionary learning method, and helpful comments. A P P E N D I X D E R I V AT I O N O F T H E M O E M O D E L In this case we ha ve P ( a | θ ) = θ e − θa and w ( θ | κ, β ) = 1 Γ( κ ) θ κ − 1 β κ e − β θ , which, when plugged into ( 9 ), gi ves Q ( a | β , κ ) = Z ∞ θ =0 θ e − θa 1 Γ( κ ) θ κ − 1 β κ e − β θ dθ = β κ Γ( κ ) Z ∞ θ =0 e − θ ( a + β ) θ κ dθ . After the change of v ariables u := ( a + β ) θ ( u (0) = 0 , u ( ∞ ) = ∞ ), the integral can be written as Q ( a | β , κ ) = β κ Γ( κ ) Z ∞ θ =0 e − u  u a + β  k du a + β = β κ Γ( κ ) ( a + β ) − ( κ +1) Z ∞ θ =0 e − u u k du = β κ Γ( κ ) ( a + β ) − ( κ +1) Γ( κ + 1) = β κ Γ( κ ) ( a + β ) − ( κ +1) κ Γ( κ ) , obtaining Q ( a | β , κ ) = κβ κ ( a + β ) − ( κ +1) , since the integral on the second line is precisely the deﬁnition of Γ( κ + 1) . The symmetrization is obtained by substituting a by | a | and dividing the normalization constant by two, Q ( | a || β , κ ) = 0 . 5 κβ κ ( | a | + β ) − ( κ +1) . The mean of the M O E distribution (which is deﬁned only for κ > 1 ) can be easily computed using integration by parts, µ ( β , κ ) = κβ κ Z ∞ 0 u ( u + β ) ( κ +1) du = κβ  − u κ ( u + β ) κ     ∞ 0 + 1 κ Z ∞ 0 du ( u + β ) k  = β κ − 1 . In the same way , it is easy to see that the non-central moments of order i are µ i = β ( κ − 1 i ) . The M L E estimates of κ and β can be obtained using any nonlinear optimization technique such as Newton method, using for example the estimates obtained with the method of moments as a starting point. In practice, ho wev er , we have not observed any signiﬁcant improvement in using the M L E estimates ov er the moments-based ones. Expected appr oximation err or in cost function As mentioned in the optimization section, the LLA approximates the M O E regularizer as a weighted ` 1 . Here we dev elop an expression for the expected error between the true function and the approximate con vex one, where the expectation is taken (naturally) with respect to the M O E distribution. Gi ven the v alue of the current iterate a ( t ) = a 0 , (assumed positiv e, since the function and its approximation are symmetric), the approximated regularizer is ψ ( t ) ( a ) = log( a 0 + β ) + 1 | a 0 | + β ( a − a 0 ) . W e hav e E a ∼ M O E ( κ,β ) h ψ ( t ) ( a ) − ψ ( a ) i = Z ∞ 0 κβ κ ( a + κ ) κ +1  log( | a 0 + β ) + 1 a 0 + β ( a − a 0 ) − log( a + β )  da = log( a 0 + β ) + a 0 a 0 + β + κβ κ a 0 + β Z ∞ 0 a ( a + β ) κ +1 da − κβ κ Z ∞ 0 log( a + β ) ( a + β ) κ +1 da = log( a 0 + β ) + a 0 a 0 + β + β ( a 0 + β )( κ − 1) − log β − 1 κ . 13 http://www .di.ens.fr/willow/SP AMS/ 18 D E R I V AT I O N O F T H E C O N S T R A I N E D J E FF R E Y S ( J O E ) M O D E L In the case of the exponential distribution, the Fisher Information Matrix in ( 15 ) e valuates to I ( θ ) =  E P ( ·| ˜ θ )  ∂ 2 ∂ ˜ θ 2 ( − log θ + θ log a )      ˜ θ = θ =  E P ( ·| ˜ θ )  1 ˜ θ 2      ˜ θ = θ = 1 θ 2 . By plugging this result into ( 14 ) with Θ = [ θ 1 , θ 2 ] , 0 < θ 1 < θ 2 < ∞ we obtain w ( θ ) = 1 ln( θ 2 /θ 1 ) 1 θ . W e now deri ve the (one-sided) J O E probability density function by plugging this w ( θ ) in ( 9 ), Q ( a ) = Z θ 2 θ 1 θ e − θa 1 ln( θ 2 /θ 1 ) dθ θ = 1 ln( θ 2 /θ 1 ) Z θ 2 θ 1 e − θa dθ = 1 ln( θ 2 /θ 1 ) 1 a  e − θ 1 a − e − θ 2 a  . Although Q ( a ) cannot be e v aluated at a = 0 , the limit for a → 0 exists and is ﬁnite, so we can just deﬁne Q (0) as this limit, which is lim a → 0 Q ( a ) = lim a → 0 1 ln( θ 2 /θ 1 ) a  1 − θ 1 a + o ( a 2 ) − (1 − θ 2 a + o ( a 2 ))  = θ 2 − θ 1 ln( θ 2 /θ 1 ) . Again, if desired, parameter estimation can be done for example using maximum likelihood (via nonlinear optimization), or using the method of moments. Howe ver , in this case, the method of moments does not provide a closed form solution for ( θ 1 , θ 2 ) . The non-central moments of order i are µ i = Z ∞ + 0 a i ln( θ 2 /θ 1 ) 1 a h e − θ 1 a − e − θ 1 a i da = 1 ln( θ 2 /θ 1 )  Z + ∞ 0 a i − 1 e − θ 1 a da − Z + ∞ 0 a i − 1 e − θ 2 a da  . (22) For i = 1 , both integrals in ( 22 ) are trivially ev aluated, yielding µ 1 = 1 ln( θ 2 /θ 1 ) ( θ − 1 1 − θ − 1 2 ) . For i > 1 , these integrals can be solved using integration by parts: µ + i = Z + ∞ 0 a i − 1 e − θ 1 a da = a i − 1 1 ( − θ 1 ) e − θ 1 a     + ∞ 0 − 1 ( − θ 1 ) ( i − 1) Z + ∞ 0 a i − 2 e − θ 1 a da µ − i = Z + ∞ 0 a i − 1 e − θ 2 a da = a i − 1 1 ( − θ 2 ) e − θ 2 a     + ∞ 0 − 1 ( − θ 2 ) ( i − 1) Z + ∞ 0 a i − 2 e − θ 2 a da, where the ﬁrst term in the right hand side of both equations ev aluates to 0 for i > 1 . Therefore, for i > 1 we obtain the recursions µ + i = i − 1 θ 1 µ + i − 1 , µ − i = i − 1 θ 2 µ − i − 1 , which, combined with the result for i = 1 , give the ﬁnal expression for all the moments of order i > 0 µ i = ( i − 1)! ln( θ 2 /θ 1 )  1 θ i 1 − 1 θ i 2  , i = 1 , 2 , . . . . In particular , for i = 1 and i = 2 we hav e θ 1 =  ln( θ 2 /θ 1 ) µ 1 + θ − 1 2  − 1 , θ 2 =  ln( θ 2 /θ 1 ) µ 2 + θ − 2 1  − 1 , which, when combined, gi ve us θ 1 = 2 µ 1 µ 2 + ln( θ 2 /θ 1 ) µ 2 1 , θ 2 = 2 µ 1 µ 2 − ln( θ 2 /θ 1 ) µ 2 1 . (23) One possibility is to solve the nonlinear equation θ 2 /θ 1 = µ 2 +ln( θ 2 /θ 1 ) µ 2 1 µ 2 − ln( θ 2 /θ 1 ) µ 2 1 for u = θ 1 /θ 2 by ﬁnding the roots of the nonlinear equation u = µ 2 +ln uµ 2 1 µ 2 − ln uµ 2 1 and choosing one of them based on some side information. Another possibility is to simply ﬁx the ratio θ 2 /θ 1 beforehand and solve for θ 1 and θ 2 using ( 23 ). D E R I V AT I O N O F T H E C O N D I T I O N A L J E FF R E Y S ( C M O E ) M O D E L The conditional Jef freys method deﬁnes a proper prior w ( θ ) by assuming that n 0 samples from the data to be modeled a were already observed. Plugging the Fisher information for the exponential distribution, I ( θ ) = θ − 2 , into ( 18 ) we obtain w ( θ ) = P ( a n 0 | θ ) θ − 1 R Θ P ( a n 0 | ξ ) ξ − 1 dξ = ( Q n 0 j =1 θ e − θa j ) θ − 1 R + ∞ 0 ( Q n 0 j =1 ξ e − ξ a j ) ξ − 1 dξ = θ n 0 − 1 e − θ P n 0 j =1 a j R + ∞ 0 ξ n 0 − 1 e − ξ P n 0 j =1 a j dξ . 19 Denoting S 0 = P n 0 j =1 a j and performing the change of v ariables u := S 0 ξ we obtain w ( θ ) = θ n 0 − 1 e − S 0 θ S − n 0 0 R + ∞ 0 u n 0 − 1 e − u du = S n 0 0 θ n 0 − 1 e − S 0 θ Γ( n 0 ) , where the last equation deriv es from the deﬁnition of the Gamma function, Γ( n 0 ) . W e see that the resulting prior w ( θ ) is a Gamma distribution Gamma ( κ 0 , β 0 ) with κ 0 = n 0 and β 0 = S 0 = P n 0 j =1 a j . R E F E R E N C E S [1] M. Aharon, M. Elad, and A. Bruckstein. The K-SVD: An algorithm for designing of o vercomplete dictionaries for sparse representations. IEEE T rans. SP , 54(11):4311–4322, Nov . 2006. [2] A. Barron, J. Rissanen, and B. Y u. The minimum description length principle in coding and modeling. IEEE T rans. IT , 44(6):2743–2760, 1998. [3] J. Bernardo and A. Smith. Bayesian Theory . Wile y , 1994. [4] A. Bruckstein, D. Donoho, and M. Elad. From sparse solutions of systems of equations to sparse modeling of signals and images. SIAM Revie w , 51(1):34–81, Feb . 2009. [5] E. J. Cand ` es. Compressi ve sampling. Proc. of the International Congr ess of Mathematicians , 3, Aug. 2006. [6] E. J. Cand ` es, M. W akin, and S. Boyd. Enhancing sparsity by reweighted ` 1 minimization. J. F ourier Anal. Appl. , 14(5):877–905, Dec. 2008. [7] R. Chartrand. Fast algorithms for noncon vex compressi ve sensing: MRI reconstruction from very fe w data. In IEEE ISBI , June 2009. [8] S. Chen, D. Donoho, and M. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on Scientiﬁc Computing , 20(1):33–61, 1998. [9] R. Coifman and M. W ickenhauser . Entropy-based algorithms for best basis selection. IEEE T rans. IT , 38:713–718, 1992. [10] T . Cover and J. Thomas. Elements of information theory . John Wile y and Sons, Inc., 2 edition, 2006. [11] I. Daubechies, M. Defrise, and C. De Mol. An iterative thresholding algorithm for linear in verse problems with a sparsity constraint. Comm. on Pur e and Applied Mathematics , 57:1413–1457, 2004. [12] B. Efron, T . Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics , 32(2):407–499, 2004. [13] M. Elad. Optimized projections for compressed-sensing. IEEE T rans. SP , 55(12):5695–5702, Dec. 2007. [14] K. Engan, S. Aase, and J. Husoy . Multi-frame compression: Theory and design. Signal Pr ocessing , 80(10):2121–2140, Oct. 2000. [15] M. Everingham, A. Zisserman, C. W illiams, and L. V an Gool. The P ASCAL Visual Object Classes Challenge 2006 (VOC2006) Results. http://www .pascal-network.org/challenges/V OC/voc2006/results.pdf. [16] J. Fan and R. Li. V ariable selection via nonconcav e penalized likelihood and its oracle properties. Journal Am. Stat. Assoc. , 96(456):1348–1360, Dec. 2001. [17] M. Figueiredo. Adaptive sparseness using Jeffreys prior . In Thomas G. Dietterich, Suzanna Becker , and Zoubin Ghahramani, editors, Adv . NIPS , pages 697–704. MIT Press, Dec. 2001. [18] S. Foucart and M. Lai. Sparsest solutions of underdetermined linear systems via ` q -minimization for 0 < q ≤ 1 . Applied and Computational Harmonic Analysis , 3(26):395–407, 2009. [19] G. Gasso, A. Rakotomamonjy , and S. Canu. Recov ering sparse signals with non-conv ex penalties and DC programming. IEEE T rans. SP , 57(12):4686–4698, 2009. [20] R. Giryes, Y . Eldar, and M. Elad. Automatic parameter setting for iterativ e shrinkage methods. In IEEE 25-th Con vention of Electr onics and Electrical Engineers in Israel (IEEEI’08) , Dec. 2008. [21] P . Gr ¨ unwald. The Minimum Description Length Principle . MIT Press, June 2007. [22] T . Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference and Pr ediction . Springer, 2 edition, Feb . 2009. [23] S. Ji, Y . Xue, and L. Carin. Bayesian compressiv e sensing. IEEE T rans. SP , 56(6):2346–2356, 2008. [24] B. Krishnapuram, L. Carin, M. Figueiredo, and A. Hartemink. Sparse multinomial logistic regression: Fast algorithms and generalization bounds. IEEE T rans. P AMI , 27(6):957–968, 2005. [25] E. Lam and J. Goodman. A mathematical analysis of the DCT coefﬁcient distributions for images. IEEE T rans. IP , 9(10):1661–1666, 2000. [26] J. Mairal, F . Bach, J. Ponce, G. Sapiro, and A. Zisserman. Supervised dictionary learning. In D. Koller , D. Schuurmans, Y . Bengio, and L. Bottou, editors, Adv . NIPS , volume 21, Dec. 2009. [27] J. Mairal, G. Sapiro, and M. Elad. Learning multiscale sparse representations for image and video restoration. SIAM MMS , 7(1):214–241, April 2008. [28] S. Mallat and Z. Zhang. Matching pursuit in a time-frequency dictionary . IEEE T rans. SP , 41(12):3397–3415, 1993. [29] N. Merhav and M. Feder . Univ ersal prediction. IEEE T rans. IT , 44(6):2124–2147, Oct. 1998. [30] G. Motta, E. Ordentlich, I. Ramirez, G. Seroussi, and M. W einberger . The DUDE framew ork for grayscale image denoising. T echnical report, HP laboratories, 2009. http://www .hpl.hp.com/techreports/2009/HPL-2009-252.html. [31] P . Moulin and J. Liu. Analysis of multiresolution image denoising schemes using generalized-Gaussian and complexity priors. IEEE T rans. IT , April 1999. [32] R. Neelamani, H. Choi, and R. Baraniuk. Forw ard: Fourier-wa velet regularized decon volution for ill-conditioned systems. IEEE T rans. SP , 52(2):418–433, 2004. [33] B. Olshausen and D. Field. Sparse coding with an overcomplete basis set: A strategy employed by V1? V ision Researc h , 37:3311–3325, 1997. 20 [34] R. Raina, A. Battle, H. Lee, B. Packer , and A. Ng. Self-taught learning: transfer learning from unlabeled data. In ICML , pages 759–766, June 2007. [35] I. Ramirez, F . Lecumberry , and G. Sapiro. Universal priors for sparse modeling. In CAMSAP , Dec. 2009. [36] I. Ram ´ ırez, P . Sprechmann, and G. Sapiro. Classiﬁcation and clustering via dictionary learning with structured incoherence and shared features. In CVPR , June 2010. [37] J. Rissanen. Univ ersal coding, information, prediction and estimation. IEEE T rans. IT , 30(4), July 1984. [38] J. Rissanen. Stochastic comple xity in statistical inquiry . Singapore: W orld Scientiﬁc, 1992. [39] R. Saab, R. Chartrand, and O. Y ilmaz. Stable sparse approximation via nonconv ex optimization. In ICASSP , April 2008. [40] N. Saito. Simultaneous noise suppression and signal compression using a library of orthonormal bases and the MDL criterion. In E. Foufoula-Geor giou and P . Kumar , editors, W avelets in Geophysics , pages 299–324. Ne w Y ork: Academic, 1994. [41] G. Schwartz. Estimating the dimension of a model. Annals of Statistics , 6(2):461–464, 1978. [42] Y . Shtarkov . Universal sequential coding of single messages. Pr obl. Inform. T ransm. , 23(3):3–17, July 1987. [43] R. T ibshirani. Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society: Series B , 58(1):267–288, 1996. [44] M. Tipping. Sparse bayesian learning and the rele vance vector machine. Journal of Machine Learning , 1:211–244, 2001. [45] J. Tropp. Greed is good: Algorithmic results for sparse approximation. IEEE T rans. IT , 50(10):2231–2242, Oct. 2004. [46] J. T rzasko and A. Manduca. Highly undersampled magnetic resonance image reconstruction via homotopic ` 0 -minimization. IEEE T rans. MI , 28(1):106–121, Jan. 2009. [47] J. Trzasko and A. Manduca. Relaxed conditions for sparse signal recov ery with general concav e priors. IEEE T rans. SP , 57(11):4347– 4354, 2009. [48] D. Wipf, J. Palmer , and B. Rao. Perspectiv es on sparse bayesian learning. In Adv . NIPS , Dec. 2003. [49] D. W ipf and B. Rao. An empirical bayesian strategy for solving the simultaneous sparse approximation problem. IEEE T rans. IP , 55(7-2):3704–3716, 2007. [50] G. Y u, G. Sapiro, and S. Mallat. Solving in verse problems with piecewise linear estimators: From Gaussian mixture models to structured sparsity . Preprint arXi v:1006.3056. [51] H. Zou. The adaptiv e LASSO and its oracle properties. Journal Am. Stat. Assoc. , 101:1418–1429, 2006. [52] H. Zou and R. Li. One-step sparse estimates in nonconcave penalized likelihood models. Annals of Statistics , 36(4):1509–1533, 2008.

Universal Regularizers For Robust Sparse Coding and Modeling

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment