Sparse Image Representation with Epitomes
Sparse coding, which is the decomposition of a vector using only a few basis elements, is widely used in machine learning and image processing. The basis set, also called dictionary, is learned to adapt to specific data. This approach has proven to b…
Authors: Louise Beno^it (INRIA Paris - Rocquencourt, LIENS, INRIA Paris - Rocquencourt)
Sparse Image Repr esentation with Epitomes Louise Benoît 1 , 3 Julien Mairal 2 , 3 Francis Bach 2 , 4 Jean Ponce 1 , 3 1 École Normale Supérieure 45 , rue d’Ulm, 75005 Paris, France. 2 INRIA 23 , av enue d’Italie, 75013 Paris, France. Abstract Sparse coding, which is the decomposition of a vector using only a few basis elements, is widely used in machine learning and image pr ocessing . The basis set, also called dictionary , is learned to adapt to specific data. This ap- pr oach has pr oven to be very effective in many image pr o- cessing tasks. T raditionally , the dictionary is an unstruc- tur ed “flat” set of atoms. In this paper , we study structur ed dictionaries [ 1 ] which ar e obtained fr om an epitome [ 11 ], or a set of epitomes. The epitome is itself a small imag e, and the atoms are all the patches of a chosen size inside this im- age . This considerably r educes the number of parameters to learn and pr ovides sparse image decompositions with shift- in variance pr operties. W e pr opose a new formulation and an algorithm for learning the structur ed dictionaries asso- ciated with epitomes, and illustrate their use in image de- noising tasks. 1. Introduction Jojic, Frey and Kannan [ 11 ] introduced in 2003 a prob- abilistic generativ e image model called an epitome . Intu- itiv ely , the epitome is a small image that summarizes the content of a lar ger one, in the sense that for an y patch from the large image there should be a similar one in the epit- ome. This is an intriguing notion, which has been applied to image reconstruction tasks [ 11 ], and epitomes hav e also been extended to the video domain [ 5 ], where they have been used in denoising, superresolution, object removal and video interpolation. Other successful applications of epitomes include location recognition [ 20 ] or f ace recogni- tion [ 6 ]. Aharon and Elad [ 1 ] have introduced an alternativ e for - mulation within the sparse coding frame work called ima ge- 3 WILLO W project-team, Laboratoire d’Informatique de l’École Nor - male Supérieure, ENS/INRIA/CNRS UMR 8548 . 4 SIERRA team, Laboratoire d’Informatique de l’École Normale Supérieure, ENS/INRIA/CNRS UMR 8548 . Figure 1. A “flat” dictionary (left) vs. an epitome (right). Sparse coding with an epitome is similar to sparse coding with a flat dic- tionary , except that the atoms are extracted from the epitome and may overlap instead of being chosen from an unstructured set of patches and assumed to be independent one from each other . signatur e dictionary , and applied it to image denoising. Their formulation unifies the concept of epitome and dictio- nary learning [ 9 , 21 ] by allowing an image patch to be rep- resented as a sparse linear combination of several patches extracted from the epitome (Figure 1 ). The resulting sparse representations are highly redundant (there are as many dic- tionary elements as ov erlapping patches in the epitome), with dictionaries represented by a reasonably small number of parameters (the number of pix els in the epitome). Such a representation has also pro ven to be useful for texture syn- thesis [ 22 ]. In a different line of work, some research has been fo- cusing on learning shift-in variant dictionaries [ 13 , 23 ], in the sense that it is possible to use dictionary elements with different shifts to represent signals, exhibiting patterns that may appear several times at different positions. While this is different from the image-signature dictionaries of Aharon and Elad [ 1 ], the two ideas are related, and as shown in this paper , such a shift in v ariance can be achiev ed by using a col- lection of smaller epitomes. In f act, one of our main contri- butions is to unify the frameworks of epitome and dictionary learning, and establish the continuity between dictionaries, dictionaries with shift in variance, and epitomes. W e propose a formulation based on the concept of epitomes/image-signature-dictionaries introduced by [ 1 , 11 ], which allows to learn a collection of epitomes, and which is generic enough to be used with epitomes that may 2913 hav e different shapes, or with different dictionary param- eterizations. W e present this formulation for the specific case of image patches for simplicity , but it applies to spatio- temporal blocks in a straightforward manner . The follo wing notation is used throughout the paper: we define for q > 1 the ` q -norm of a vector x in R m as k x k q M = ( P m j =1 | x j | q ) 1 /q , where x j denotes the j -th coordinate of x . if X is a matrix in R m × n , x i will denote its i th row , while x j will denote its j th column. As usual, x i,j will denote the entry of X at the i th -row and j th -column. W e consider the Frobenius norm of X : k X k F M = ( P m i =1 P n j =1 x 2 i,j ) 1 / 2 . This paper is organized as follows: Section 2 introduces our formulation. W e present our dictionary learning algo- rithm in Section 3 . Section 4 introduces dif ferent improv e- ments for this algorithm, and Section 5 demonstrates e xper- imentally the usefulness of our approach. 2. Proposed A pproach Giv en a set of n training image patches of size m pixels, represented by the columns of a matrix X = [ x 1 , . . . , x n ] in R m × n , the classical dictionary learning formulation, as introduced by [ 21 ] and revisited by [ 9 , 14 ], tries to find a dictionary D = [ d 1 , . . . , d p ] in R m × p such that each sig- nal x i can be represented by a sparse linear combination of the columns of D . More precisely , the dictionary D is learned along with a matrix of decomposition coefficients A = [ α 1 , . . . , α n ] in R p × n such that x i ≈ D α i for ev- ery signal x i . Follo wing [ 14 ], we consider the following formulation: min D ∈D , A ∈ R p × n 1 n n X i =1 h 1 2 k x i − D α i k 2 2 + λ k α i k 1 i , (1) where the quadratic term ensures that the vectors x i are close to the approximation D α i , the ` 1 -norm induces spar- sity in the coefficients α i (see, e.g., [ 4 , 24 ]), and λ controls the amount of regularization. T o prevent the columns of D from being arbitrarily large (which would lead to arbitrar- ily small v alues of the α i ), the dictionary D is constrained to belong to the con ve x set D of matrices in R m × p whose columns hav e an ` 2 -norm less than or equal to one: D M = { D ∈ R m × p s.t. ∀ j = 1 , . . . , p, k d j k 2 6 1 } . As will become clear shortly , this constraint is not adapted to dictionaries e xtracted from epitomes, since o ver- lapping patches cannot be expected to all have the same norm. Thus we introduce an unconstrained formulation equiv alent to Eq. ( 1 ): min D ∈ R m × n , A ∈ R p × n 1 n n X i =1 h 1 2 k x i − D α i k 2 2 + λ p X j =1 k d j k 2 | α j,i | i . (2) This formulation remov es the constraint D ∈ D from Eq. ( 1 ), and replaces the ` 1 -norm by a weighted ` 1 -norm. As shown in Appendix A , Eq. ( 1 ) and Eq. ( 2 ) are equiv- alent in the sense that a solution of Eq. ( 1 ) is also solu- tion of Eq. ( 2 ), and for every solution of Eq. ( 2 ), a solution for Eq. ( 1 ) can be obtained by normalizing its columns to one. T o the best of our knowledge, this equiv alent formu- lation is new , and is ke y to learning an epitome with ` 1 - regularization: the use of a con ve x regularizer (the ` 1 -norm) that empirically provides better-beha ved dictionaries than ` 0 (where the ` 0 pseudo-norm counts the number of non- zero elements in a vector) for denoising tasks (see T able 1 ) differentiates us from the ISD formulation of [ 1 ]. T o prev ent degenerate solutions in the dictionary learning formulation with ` 1 -norm, it is important to constrain the dictionary ele- ments with the ` 2 -norm. Whereas such a constraint can eas- ily be imposed in classical dictionary learning, its extension to epitome learning is not straightforward, and the original ISD formulation is not compatible with con vex regularizers. Eq. ( 2 ) is an equiv alent unconstrained formulation, which lends itself well to epitome learning. W e can now formally introduce the general concept of an epitome as a small image of size √ M × √ M , encoded (for example in row order) as a vector E in R M . W e also introduce a linear operator ϕ : R M → R m × p that extracts all ov erlapping patches from the epitome E , and rearranges them into the columns of a matrix of R m × p , the integer p being the number of such overlapping patches. Concretely , we hav e p = ( √ M − √ m + 1) 2 . In this context, ϕ ( E ) can be interpreted as a traditional flat dictionary with p el- ements, except that it is generated by a small number M of parameters compared to the pm parameters of the flat dictionary . Our approach thus generalizes to a much wider range of epitomic structures using any mapping ϕ that ad- mits fast projections on Im( ϕ ) . The functions ϕ we hav e used so far are relati vely simple, but giv e a frame work that easily extends to families of epitomes, shift-in variant dic- tionaries, and plain dictionaries. The only assumption we make is that ϕ is a linear operator of rank M (i.e., ϕ is in- jectiv e). This list is not exhausti ve, which naturally opens up new perspectiv es. The fact that a dictionary D is ob- tained from an epitome is characterized by the fact that D is in the image Im ϕ of the linear operator ϕ . Gi ven a dic- tionary D in Im ϕ , the unique (by injectivity of ϕ ) epitome representation can be obtained by computing the in verse of ϕ on Im ϕ , for which a closed form using pseudo-inv erses exists as sho wn in Appendix B . Our goal being to adapt the epitome to the training image patches, the general minimization problem can therefore be expressed as follo ws: min D ∈ Im ϕ, A ∈ R p × n 1 n n X i =1 h 1 2 k x i − D α i k 2 2 + λ p X j =1 k d j k 2 | α j,i | i . (3) 2914 There are se veral moti vations for such an approach. As dis- cussed abov e, the choice of the function ϕ lets us adapt this technique to different problems such as multiple epitomes or any other type of dictionary representation. This formu- lation is therefore deliberately generic. In practice, we ha ve mainly focused on two simple cases in the experiments of this paper: a single epitome [ 11 ] (or image signature dictio- nary [ 1 ]) and a set of epitomes. Furthermore, we hav e now come do wn to a more traditional, and well studied problem: dictionary learning. W e will therefore use the techniques and algorithms dev eloped in the dictionary learning litera- ture to solve the epitome learning problem. 3. Basic Algorithm As for classical dictionary learning, the optimization problem of Eq. ( 3 ) is not jointly con vex in ( D , A ) , but is con vex with respect to D when A is fixed and vice-versa. A block-coordinate descent scheme that alternates between the optimization of D and A , while keeping the other pa- rameter fixed, has emer ged as a natural and simple way for learning dictionaries [ 9 , 10 ], which has prov en to be rela- tiv ely efficient when the training set is not too large. Even though the formulation remains nonconv ex and therefore this method is not guaranteed to find the global optimum, it has proven experimentally to be good enough for many tasks [ 9 ]. W e therefore adopt this optimization scheme as well, and detail the different steps below . Note that other algorithms such as stochastic gradient descent (see [ 1 , 14 ]) could be used as well, and in fact can easily be deriv ed from the ma- terial of this section. Howe ver , we ha ve chosen not to in ves- tigate these kind of techniques for simplicity reasons. In- deed, stochastic gradient descent algorithms are potentially more ef ficient than the block-coordinate scheme mentioned abov e, but require the (sometimes non-trivial) tuning of a learning rate. 3.1. Step 1: Optimization of A with D Fixed. In this step of the algorithm, D is fixed, so the constraint D ∈ Im ϕ is not in volved in the optimization of A . Further- more, note that updating the matrix A consists of solving n independent optimization problems with respect to each column α i . For each of them, one has to solve a weighted- ` 1 optimization problem. Let us consider the update of a column α i of A . W e introduce the matrix Γ M = diag [ k d 1 k 2 , .., k d p k 2 ] , and define D 0 = DΓ − 1 . If Γ is non-singular , we show in Appendix A that the relation α 0 ? i = Γ α ? i holds, where α 0 ? i = argmin α 0 i ∈ R p 1 2 k x i − D 0 α 0 i k 2 F + λ k α 0 i k 1 , and α ? i = argmin α i ∈ R p 1 2 k x i − D α 0 i k 2 F + λ p X j =1 k d j k 2 | α j,i | . This shows that the update of each column can easily be obtained with classical solvers for ` 1 -decomposition prob- lems. W e use to that effect the LARS algorithm [ 8 ], imple- mented in the software accompanying [ 14 ]. Since our optimization problem is in variant by multiply- ing D by a scalar and A by its inv erse, we then proceed to the following renormalization to ensure numerical stabil- ity and prevent the entries of D and A from becoming too large: we rescale D and A with s = min j ∈ J 1 ,n K k d j k 2 , and define D ← 1 s D and A ← s A . Since the image of ϕ is a vector space, D stays in the image of ϕ after the normalization. And as noted before, it does not change the value of the objecti ve function. 3.2. Step 2: Optimization of D with A Fixed. W e use a projected gradient descent algorithm [ 3 ] to up- date D . The objectiv e function f minimized during this step can be written as: f ( D ) M = 1 2 k X − DA k 2 F + λ p X j =1 k d j k 2 k α j k 1 , (4) where A is fix ed, and we recall that α j denotes its j -th ro w . The function f is differentiable, except when a column of D is equal to zero, which we assume without loss of generality not to be the case. Suppose indeed that a column d j of D is equal to zero. Then, without changing the value of the cost function of Eq. ( 3 ), one can set the corresponding ro w α j to zero as well, and it results in a function f defined in Eq. ( 4 ) that does not depend on d j anymore. W e hav e, howe ver , not observed such a situation in our e xperiments. The function f can therefore be considered as differen- tiable, and one can easily compute its gradient as: ∇ f ( D ) = − ( X − DA ) A T + D∆ , where ∆ is defined as ∆ M = diag( λ k α 1 k 1 k d 1 k 2 , . . . , λ k α p k 1 k d p k 2 ) . T o use a projected gradient descent, we no w need a method for projecting D onto the conv ex set Im ϕ , and the update rule becomes: D ← Π Im ϕ [ D − ρ ∇ f ( D )] , where Π Im ϕ is the orthogonal projector onto Im ϕ , and ρ is a gradient step, chosen with a line-search rule, such as the Armijo rule [ 3 ]. Interestingly , in the case of the single epitome (and in fact in any other extension where ϕ is a linear operator that extracts some patches from a parameter vector E ), this pro- jector admits a closed form: let us consider the linear oper- ator ϕ ∗ : R m × p → R M , such that for a matrix D in R m × p , 2915 a pixel of the epitome ϕ ∗ ( D ) is the average of the entries of D corresponding to this pixel v alue. W e gi ve the formal form of this operator in Appendix B , and show the follow- ing results: (i) ϕ ∗ is indeed linear , (ii) Π Im ϕ = ϕ ◦ ϕ ∗ . W ith this closed form of Π Im ϕ in hand, we now have an ef- ficient algorithmic procedure for performing the projection. Our method is therefore quite generic, and can adapt to a wide v ariety of functions ϕ . Extending it when ϕ is not lin- ear , b ut still injecti ve and with an ef ficient method to project on Im ϕ will be the topic of future work. 4. Impro vements W e present in this section sev eral improv ements to our basic framew ork, which either improv e the conv ergence speed of the algorithm, or generalize the formulation. 4.1. Accelerated Gradient Method f or Updating D . A first improvement is to accelerate the conv ergence of the update of D using an accelerated gradient tech- nique [ 2 , 19 ]. These methods, which build upon early works by Nesterov [ 18 ], have attracted a lot of attention recently in machine learning and signal processing, especially be- cause of their fast con ver gence rate (which is prov en to be optimal among first-order methods), and their ability to deal with large, possibly nonsmooth problems. Whereas the value of the objecti ve function with classi- cal gradient descent algorithms for solving smooth conv ex problems is guaranteed to decrease with a conv ergence rate of O (1 /k ) , where k is the number of iterations, other al- gorithmic schemes have been proposed with a conv ergence rate of O (1 /k 2 ) with the same cost per iteration as classi- cal gradient algorithms [ 2 , 18 , 19 ]. The difference between these methods and gradient descent algorithms is that two sequences of parameters are maintained during this iterativ e procedure, and that each update uses information from past iterations. This leads to theoretically better con vergence rates, which are often also better in practice. W e hav e chosen here for its simplicity the algorithm FIST A of Beck and T eboulle [ 2 ], which includes a practi- cal line-search scheme for automatically tuning the gradient step. Interestingly , we hav e indeed observed that the algo- rithm FIST A was significantly faster to con verge than the projected gradient descent algorithm. 4.2. Multi-Scale V ersion T o improv e the results without increasing the computing time, we hav e also implemented a multi-scale approach that exploits the spatial nature of the epitome. Instead of directly learning an epitome of size M , we first learn an epitome of a smaller size on a reduced image with corresponding smaller patches, and after upscaling, we use the resulting epitome as the initialization for the next scale. W e iterate this process in practice two to three times. The procedure is illustrated in Figure 2 . Intuitiv ely , learning smaller epitomes is an easier task than directly learning a large one, and such a procedure provides a good initialization for learning a lar ge epitome. Multi-scale Epitome Learning. Input: n number of scales, r ratio between each scale, E 0 random initialization for the first scale. for k = 1 to n do Giv en I k rescaling of image I for ratio 1 r n − k , X k the corresponding patches, initialize with E = upscale ( E k − 1 , r ) , E k = epitome ( X k , E ). end for Output: learned epitome E . Figure 2. Multi-scale epitome learning algorithm. 4.3. Multi-Epitome Extension Another improvement is to consider not a single epitome but a family of epitomes in order to learn dictionaries with some shift in variance, which has been the focus of recent work [ 13 , 23 ]. Note that different types of structured dic- tionaries ha ve also been proposed with the same moti vation for learning shift-in variant features in image classification tasks [ 12 ], but in a significantly different framew ork (the structure in the dictionaries learned in [ 12 ] comes from a different sparsity-inducing penalization). Figure 3. A “flat” dictionary (left) vs. a collection of 4 epitomes (right). The atoms are extracted from the epitomes and may ov er- lap. As mentioned before, we are able to learn a set of N epitomes instead of a single one by changing the function ϕ introduced earlier . The vector E no w contains the pixels (parameters) of sev eral small epitomes, and ϕ is the linear operator that extracts all overlapping patches from all epit- omes. In the same way , the projector on Im ϕ is still easy to compute in closed form, and the rest of the algorithm stays unchanged. Other “epitomic” structures could easily be used within our framework, ev en though we have limited 2916 ourselves for simplicity to the case of single and multiple epitomes of the same size and shape. The multi-epitome version of our approach can be seen as an interpolation between classical dictionary and single epitome. Indeed, defining a multitude of epitomes of the same size as the considered patches is equiv alent to work- ing with a dictionary . Defining a lar ge number a epito- mes slightly larger than the patches is equiv alent to shift- in variant dictionaries. In Section 5 , we experimentally com- pare these dif ferent re gimes for the task of image denoising. 4.4. Initialization Because of the nonconv exity of the optimization prob- lem, the question of the initialization is an important issue in epitome learning. W e hav e already mentioned a multi- scale strategy to ov ercome this issue, b ut for the first scale, the problem remains. Whereas classical flat dictionaries can naturally be initialized with prespecified dictionaries such as overcomplete DCT basis (see [ 9 ]), the epitome does not admit such a natural choice. In all the experiences (un- less written otherwise), we use as the initialization a single epitome (or a collection of epitomes), common to all ex- periments, which is learned using our algorithm, initialized with a Gaussian lo w-pass filtered random image, on a set of 100 000 random patches extracted from 5 000 natural im- ages (all different from the test images used for denoising). 5. Experimental V alidation Figure 4. House, Peppers, Cameraman, Lena, Boat and Barbara images. W e provide in this section qualitati ve and quantitativ e validation. W e first study the influence of the dif ferent model hyperparameters on the visual aspect of the epitome before moving to an image denoising task. W e choose to represent the epitomes as images in order to visualize more easily the patches that will be extracted to form the images. Since epitomes contain negati ve values, they are arbitrarily rescaled between 0 and 1 for display . In this section, we will work with several images, which are shown in Figure 4 . 5.1. Influence of the Initialization In order to measure the influence of the initialization on the resulting epitome, we have run the same experience with different initializations. Figure 5 shows the different results obtained. The difference in contrast may be due to the scaling of the data in the displaying process. This experiment illus- trates that different initializations lead to visually different epitomes. Whereas this property might not be desirable, the classical dictionary learning framew ork also suffers from this issue, but yet has led to successful applications in im- age processing [ 9 ]. Figure 5. Three epitomes obtained on the boat image for different initializations, but all the same parameters. Left: epitome obtained with initialization on a epitome learned on random patches from natural images. Middle and Right: epitomes obtained for two dif- ferent random initializations. 5.2. Influence of the Size of the Patches The size of the patches seem to play an important role in the visual aspect of the epitome. W e illustrate in Figure 6 an experiment where pairs of epitome of size 46 × 46 are learned with different sizes of patches. Figure 6. Pairs of epitomes of width 46 obtained for patches of width 6 , 8 , 9 , 10 and 12 . All other parameters are unchanged. Ex- periments run with 2 scales ( 20 iterations for the first scale, 5 for the second) on the house image. As we see, learning epitomes with small patches seems to introduce finer details and structures in the epitome, whereas large patches induce epitomes with coarser struc- tures. 2917 Figure 7. 1 , 2 , 4 and 20 epitomes learned on the barbara image for the same parameters. They are of sizes 42 , 32 , 25 and 15 in order to keep the same number of elements in D . They are not represented to scale. 5.3. Influence of the Number of Epitomes W e present in this section an e xperiment where the num- ber of learned epitomes vary , while keeping the same num- bers of columns in D . The 1 , 2 , 4 and 20 epitomes learned on the image barbara are shown in Figure 7 . When the num- ber of epitomes is small, we observe in the epitomes some discontinuities between texture areas with different visual characteristics, which is not the case when learning se veral independant epitomes. 5.4. A pplication to Denoising In order to e v aluate the performance of epitome learn- ing in various regimes (single epitome, multiple epitomes), we use the same methodology as [ 1 ] that uses the success- ful denoising method first introduced by [ 9 ]. Let us con- sider first the classical problem of restoring a noisy image y in R n which has been corrupted by a white Gaussian noise of standard deviation σ . W e denote by y i in R m the patch of y centered at pixel i (with any arbitrary ordering of the image pixels). The method of [ 9 ] proceeds as follows: • Learn a dictionary D adapted to all overlapping patches y 1 , y 2 , . . . from the noisy image y . • Approximate each noisy patch using the learned dic- tionary with a greedy algorithm called orthogonal matching pursuit (OMP) [ 17 ] to have a clean estimate of e very patch of y i by addressing the follo wing prob- lem argmin α i ∈ R p k α i k 0 s.t. k y i − D α i k 2 2 6 ( C σ 2 ) , where D α i is a clean estimate of the patch y i , k α i k 0 is the ` 0 pseudo-norm of α i , and C is a regularization parameter . Follo wing [ 9 ], we choose C = 1 . 15 . • Since every pixel in y admits many clean estimates (one estimate for every patch the pixel belongs to), a v- erage the estimates. Figure 8. Artificially noised boat image (with standard deviation σ = 15 ), and the result of our denoising algorithm. Quantitativ e results for single epitome, and multi-scale multi-epitomes are presented in T able 1 on six images and fiv e levels of noise. W e ev aluate the performance of the de- noising process by computing the peak signal-to-noise ratio (PSNR) for each pair of images. For each level of noise, we have selected the best regularization parameter λ overall the six images, and have then used it all the experiments. The PNSR v alues are averaged over 5 experiments with 5 different noise realizations. The mean standard deviation is of 0 . 05 dB both for the single epitome and the multi-scale multi-epitomes. W e see from this experiment that the formulation we pro- pose is competitive compared to the one of [ 1 ]. Learning multi epitomes instead of a single one seems to provide bet- ter results, which might be explained by the lack of flexi- bility of the single epitome representation. Evidently , these results are not as good as recent state-of-the-art denoising algorithms such as [ 7 , 15 ] which exploit more sophisticated 2918 σ house peppers c.man barbara lena boat 10 I E 35.98 34.52 33.90 34.41 35.51 33.70 E 35.86 34.41 33.83 34.01 35.43 33.63 15 I E 34.45 32.50 31.65 32.23 33.74 31.81 E 34.32 32.36 31.59 31.84 33.66 31.75 20 I E 33.18 31.00 30.19 30.69 32.42 30.45 E 33.08 30.93 30.11 30.33 32.35 30.37 25 I E 32.02 29.82 29.08 29.49 31.36 29.36 E 31.96 29.77 29.01 29.14 31.29 29.30 50 I E 27.83 26.06 25.57 25.04 27.90 26.01 E 27.83 26.07 25.60 24.86 27.82 26.02 T able 1. PSNR Results. First Row: 20 epitomes of size 7 × 7 learned with 3 scales (I E : improv ed epitome); Second ro w: single epitome of size 42 × 42 ( E ). Best results are in bold. σ I E E [ 1 ] [ 11 ] [ 9 ] [ 7 ] [ 15 ] 10 34 . 83 34 . 67 34 . 71 28 . 83 34 . 76 35 . 24 35 . 32 15 32 . 95 32 . 79 32 . 84 28 . 92 32 . 87 33 . 43 33 . 50 20 31 . 55 31 . 41 31 . 36 28 . 55 31 . 52 32 . 15 32 . 18 25 30 . 41 30 . 29 29 . 99 28 . 12 30 . 42 31 . 15 31 . 11 50 26 . 57 26 . 52 25 . 91 25 . 21 26 . 66 27 . 69 27 . 87 mean 31 . 26 31 . 14 30 . 96 27 . 93 31 . 25 31 . 93 32 . 00 T able 2. Quantitative comparativ e e valuation. PSNR v alues are a v- eraged o ver 5 images. W e compare ourselves to two pre vious epit- ome learning based algorithms: ISD ([ 1 ]) and epitomes by Jojic, Frey and Kannan ([ 11 ] as reported in [ 1 ]), and to three more elab- orate dictionary learning based algorithms K-SVD ([ 9 ]), BM3D ([ 7 ]), and LSSC ([ 15 ]). image models. But our goal is to illustrate the performance of epitome learning on an image reconstruction task, in or- der to better understand these formulations. 6. Conclusion W e have introduced in this paper a new formulation and an ef ficient algorithm for learning epitomes in the context of sparse coding, extending the work of Aharon and Elad [ 1 ], and unifying it with recent work on shift-in v ariant dictio- nary learning. Our approach is generic, can interpolate be- tween these two regimes, and can possibly be applied to other formulations. Future work will extend our framework to the video setting, to other image processing tasks such as inpainting, and to learning image features for classification or recognition tasks, where shift in variance has prov en to be a key property to achieving good results [ 12 ]. Another direction we are pursuing is to find a way to encode other in variant properties through dif ferent mapping functions ϕ . Acknowledgments. This work was partly supported by the European Community under the ERC grants "V ideoW orld" and "Sierra". A. A ppendix: ` 1 -Norm and W eighted ` 1 -Norm In this appendix, we will show the equi v alence between the two minimization problems introduced in section 3.1 . Let us denote F ( D , α ) = 1 2 k x − D α k 2 2 + λ 2 p X j =1 k d j k 2 | α j | , (5) and G ( D , α ) = 1 2 || x − D α || 2 2 + λ k α k 1 . (6) Let us define α 0 ∈ R p and D 0 ∈ R m × p such that D 0 = DΓ − 1 , and α 0 = Γ α , where Γ = diag[ k d 1 k 2 , .., k d p k 2 ] . The goal is to show that α 0 ? = Γ α ? , where: α ? = argmin α F ( D , α ) , and α 0 ? = argmin α 0 G ( D 0 , α 0 ) . W e clearly hav e: D α = D 0 α 0 . Furthermore, since Γ α = α 0 , we hav e: ∀ j = 1 , . . . , p, k d j k 2 | α j | = | α 0 j | . Therefore, F ( D , α ) = G ( D 0 , α 0 ) . (7) Moreov er, since for all D , D 0 is in the set D , we hav e shown the equi valence between Eq. ( 1 ) and Eq. ( 2 ). B. A ppendix: Projection on Im ϕ In this appendix, we will show how to compute the or- thogonal projection on the v ector space Im ϕ . Let us denote by R i the binary matrix in { 0 , 1 } m × M that extracts the i -th patch from E . Note that with this notation, the matrix R i is a binary M × m matrix corresponding to a linear operator that takes a patch of size m and place it at the location i in an epitome of size M which is zero everywhere else. W e therefore hav e ϕ ( E ) = [ R 1 E , . . . , R p E ] . W e denote by ϕ ∗ : R m × p → R M the linear operator defined as ϕ ∗ ( D ) = ( p X j =1 R T j R j ) − 1 ( p X j =1 R T i D ) , which creates an epitome of size M such that each pixel contains the av erage of the corresponding entries in D . In- deed, the M × M matrix ( P p j =1 R T j R j ) − 1 is diagonal and the entry i on the diagonal is the number of entries in D corresponding to the pixel i in the epitome. Denoting by R M = R 1 . . . R p , which is a mp × M matrix, we have vec ( ϕ ( E )) = RE , where vec ( D ) M = [ d T 1 , . . . , d T p ] T , which is the vector of size 2919 mp obtained by concatenating the columns of D , and also ϕ ∗ ( D ) = ( R T R ) − 1 R T vec ( D ) . Since vec (Im ϕ ) = Im R and vec ( ϕ ( ϕ ∗ ( D ))) = R ( R T R ) − 1 R T vec ( D ) , which is an orthogonal projection onto Im R , it results the two follo wing properties which are useful in our framework and classical in signal processing with ov ercomplete representations ([ 16 ]): • ϕ ∗ is the in verse function of ϕ on Im ϕ : ϕ ∗ ◦ ϕ = Id . • ( ϕ ◦ ϕ ∗ ) is the orthogonal projector on Im ϕ . References [1] M. Aharon and M. Elad. Sparse and redundant modeling of image content using an image-signature- dictionary . SIAM Journal on Imaging Sciences , 1(3):228–247, July 2008. [2] A. Beck and M. T eboulle. A fast iterativ e shrinkage- thresholding algorithm for linear inv erse problems. SIAM Journal on Imaging Sciences , 2(1):183–202, 2009. [3] D. P . Bertsekas. Nonlinear pr ogramming . Athena Sci- entific Belmont, 1999. [4] S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM Jour - nal on Scientific Computing , 20(1):33–61, 1998. [5] V . Cheung, B. Frey , and N. Jojic. V ideo epitomes. In Pr oc. CVPR , 2005. [6] X. Chu, S. Y an, L. Li, K. L. Chan, and T . S. Huang. Spatialized epitome and its applications. In Pr oc. CVPR , 2010. [7] K. Dabov , A. Foi, V . Katkovnik, and K. Egiazarian. Image Denoising by Sparse 3-D T ransform-Domain Collaborativ e Filtering. IEEE T ransactions on Image Pr ocessing , 16(8):2080–2095, 2007. [8] B. Efron, T . Hastie, I. Johnstone, and R. T ibshi- rani. Least angle regression. Annals of statistics , 32(2):407–499, 2004. [9] M. Elad and M. Aharon. Image denoising via sparse and redundant representations over learned dictio- naries. IEEE T ransactions on Image Pr ocessing , 54(12):3736–3745, December 2006. [10] K. Engan, S. O. Aase, and J. H. Husoy . Frame based signal compression using method of optimal direc- tions (MOD). In IEEE International Confer ence on Acoustics, Speech and Signal Processing (ICASSP) , 1999. [11] N. Jojic, B. Frey , and A. Kannan. Epitomic analysis of appearance and shape. In Pr oc. ICCV , 2003. [12] K. Kavukcuoglu, M. Ranzato, R. Fergus, and Y . Le- Cun. Learning in variant features through topographic filter maps. In Pr oc. CVPR , 2009. [13] B. Mailhé, S. Lesage, R. Gribon val, F . Bimbot, and P . V anderghe ynst. Shift-in variant dictionary learning for sparse representations: extending K-SVD. In Pr oc. EUSIPCO , 2008. [14] J. Mairal, F . Bach, J. Ponce, and G. Sapiro. On- line learning for matrix factorization and sparse cod- ing. Journal of Machine Learning Resear ch , 11:19– 60, 2010. [15] J. Mairal, F . Bach, J. Ponce, G. Sapiro, and A. Zisser - man. Non-local sparse models for image restoration. In Pr oc. ICCV , 2009. [16] S. Mallat. A W avelet T our of Signal Pr ocessing, Sec- ond Edition . Academic Press, New Y ork, 1999. [17] S. Mallat and Z. Zhang. Matching pursuit in a time- frequency dictionary . IEEE T ransactions on Signal Pr ocessing , 41(12):3397–3415, 1993. [18] Y . Nesterov . A method for solving the conv ex pro- gramming problem with conv ergence rate O (1 /k 2 ) . Dokl. Akad. Nauk SSSR , 269:543–547, 1983. [19] Y . Nesterov . Gradient methods for minimizing com- posite objective function. T echnical report, Center for Operations Research and Econometrics (CORE), Catholic Univ ersity of Louvain, 2007. [20] K. Ni, A. Kannan, A. Criminisi, and J. Winn. Epitomic location recognition. In Pr oc. CVPR , 2008. [21] B. A. Olshausen and D. J. Field. Sparse coding with an ov ercomplete basis set: A strategy employed by V1? V ision Researc h , 37:3311–3325, 1997. [22] G. Peyré. Sparse modeling of textures. Journal of Mathematical Imaging and V ision , 34(1):17–31, 2009. [23] J. Thiagarajan, K. Ramamurthy , and A. Spanias. Shift-in variant sparse representation of images using learned dictionaries. In IEEE International W orkshop on Machine Learning for Signal Pr ocessing , 2008. [24] R. Tibshirani. Regression shrinkage and selection via the Lasso. J ournal of the Royal Statistical Society . Series B , 58(1):267–288, 1996. 2920
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment