A Probabilistic Framework for Discriminative Dictionary Learning

A Pr obabilistic Framework f or Discriminativ e Dictionary Lear ning Bernard Ghanem and Nar endra Ahuja Abstract In this paper , we address the problem of discriminati ve dictionary learning (DDL), where sparse linear representation and classiﬁcation are combined in a probabilis- tic framework. As such, a single discriminati ve dictionary and linear binary clas- siﬁers are learned jointly . By encoding sparse representation and discriminativ e classiﬁcation models in a MAP setting, we propose a general optimization frame- work that allows for a data-driv en tradeoff between faithful representation and accurate classiﬁcation. As opposed to pre vious work, our learning methodology is capable of incorporating a div erse family of classiﬁcation cost functions (includ- ing those used in popular boosting methods), while av oiding the need for in volv ed optimization techniques. W e show that DDL can be solved by a sequence of up- dates that make use of well-known and well-studied sparse coding and dictionary learning algorithms from the literature. T o validate our DDL framework, we apply it to digit classiﬁcation and face recognition and test it on standard benchmarks. 1 Introduction Representation of signals as sparse linear combinations of a basis set is popular in the signal/image processing and machine learning communities. In this representation, a sample ~ y is described by a linear combination ~ x of a sparse number of columns in a dictionary D , such that ~ y = D ~ x . Signiﬁcant theoretical progress has been made to determine the necessary and suf ﬁcient conditions, under which reco very of the sparsest representation using a predeﬁned D is guaranteed [3, 27, 4]. Recent sparse coding methods achiev e state-of-the-art results for various visual tasks, such as face recognition [29]. Instead of minimizing the ` 0 norm of ~ x , these methods solve relaxed versions of the originally NP-hard problem, which we will refer to as traditional sparse coding (TSC). Howe ver , it has been empirically sho wn that adapting D to underlying data can improve upon state-of-the-art techniques in v arious restoration and denoising tasks [6, 23]. This adaptation is made possible by solving a sparse matrix factorization problem, which we refer to as dictionary learning . Learning D is done by alternating between TSC and dictionary updates [1, 8, 20, 15]. For an overvie w of TSC, dictionary learning, and some of their applications, we refer the reader to [28, 7]. In this paper , we address the problem of discriminati ve dictionary learning (DDL), where D is viewed as a linear mapping between the original data space and the space of sparse representations, whose dimensionality is usually higher . In DDL, we seek an optimal mapping that yields faithful sparse representation and allows for maximal discriminability between labeled data. These two objectiv es are seldom complimentary and they tend to introduce conﬂicting goals in many cases, thus, classiﬁcation can be viewed as a r e gularizer for reliable representation and vice versa. From both viewpoints, this re gularization is important to pre vent o verﬁtting to the labeled data. Therefore, instead of optimizing both objectiv es simultaneously , we seek joint optimization. In the case of sparse linear representation, the problem of DDL was recently introduced and de veloped in [19, 21, 22], under the name supervised dictionary learning (SDL). In this paper , we denote the problem as DDL instead of SDL, since DDL inherently includes the semi-supervised case. SDL is also addressed in a recent work on task-driv en dictionary learning [18]. The form of the optimization 1 problem in SDL is shown in Eq. (1). The objecti ve is a linear combination of a representation cost e R and a classiﬁcation cost e C using data labels L and classiﬁer parameters W . min X , D , W e R ( Y , X , D ) + λe C ( X , W , L ) (1) Although [22, 21] use multiple dictionaries, it is clear that learning a single dictionary allows for sharing of features among labeled classes, less computational cost, and less risk of ov erﬁtting. As a result, our proposed method learns a single dictionary D . Here, we note that [13] addresses a similar problem, where D is predeﬁned and e C is the Fisher criterion. Despite their merits, SDL methods hav e the following drawbacks. (i) Most methods use limited forms for e C (e.g. softmax applied to reconstruction error). Consequently , they cannot generalize to incorporate popular classi- ﬁcation costs, such as the exponential loss used in Adaboost or the hinge loss in SVMs. (ii) Pre vious SDL methods weight the training samples and the classiﬁers uniformly by setting the ﬁxed mix- ing coefﬁcient λ according to cross-validation. This biases their cost functions to samples that are badly represented or misclassiﬁed. As such, they are more sensitiv e to outlier , noisy , and mislabeled training data. (iii) From an optimization viewpoint, the SDL objectiv e functions are quite in volv ed especially due to the use of the softmax function for multi-class discrimination. Contributions: Our proposed DDL framework addresses the previous issues by learning a linear map D that allows for maximal class discrimination in the labeled data when using linear classiﬁ- cation. (i) W e sho w that this framew ork is applicable to a general family of classiﬁcation cost func- tions, including those used in popular boosting methods. (ii) Since we pose DDL in a probabilistic setting, the representation-classiﬁcation tradeoff and the weighting of training samples correspond to MAP parameters that are estimated in a data-driv en fashion that av oids parameter tuning. (iii) Since we decouple e R and e C , the representations X act as the only liaisons between classiﬁcation and representation. In fact, this is why well-studied methods in dictionary learning and TSC can be easily incorporated in solving the DDL problem. This a voids inv olved optimization techniques. Our framew ork is efﬁcient, general, and modular , so that any improvement or theoretical guarantee on individual modules (i.e. TSC or dictionary learning) can be seamlessly incorporated. The paper is organized as follows. In Section 2, we describe the probabilistic representation and classiﬁcation models in our DDL frame work and ho w they are combined in a MAP setting. Section 3 presents the learning methodology that estimates the MAP parameters and shows how inference is done. In Section 4, we validate our frame work by applying it to digit classiﬁcation and face recognition and showing that it achie ves state-of-the-art performance on benchmark datasets. 2 Overview of DDL Framew ork In this section, we giv e a detailed description of the probabilistic models used for representation and classiﬁcation. Our optimization framework, formulated in a standard MAP setup, seeks to maximize the likelihood of the gi ven labeled data coupled with priors on the model parameters. 2.1 Representation and Classiﬁcation Models W e assume that each M -dimensional data sample can be represented as a sparse linear combination of K dictionary atoms with additive Gaussian noise of diagonal cov ariance: ~ y = D ~ x + ~ n ; ~ n ∼ N ( ~ 0 , σ 2 I ) . Here, we view the sparse representation ~ x as a latent variable of the representation model. In training, we assume that the training samples are represented by this model. Ho we ver , test samples can be contaminated by various types of noise that need not be zero-mean Gaussian in nature. In testing, we hav e: ~ y = D ~ x + ~ e + ~ n , where we constrain any auxiliary noise ~ e (e.g. occlusion) to be sparse in nature without modeling its explicit distribution. This constraint is used in the error correction method for sparse representation in [27]. It is clear that the representation in testing is identical to the one in training with the dictionary in the latter being augmented by identity . In both cases, the likelihood of observing a speciﬁc ~ y is modeled as a Gaussian: ( ~ y | ~ x , D ) ∼ N  D ~ x , σ 2 I  . Since a single dictionary is used to represent samples belonging to dif ferent classes, sharing of features is allowed among classes, which simpliﬁes the learning process. 2 T o model the classiﬁcation process, we assume that each data sample corresponds to a label vector ~ l ∈ {− 1 , +1 } C , which encodes the class membership of this sample, where C is the total number of classes. In our experiments, only one value in ~ l is +1 . W e apply a linear classiﬁer (or equiv alently a set of additi vely boosted linear classiﬁers) to the sparse representations in a one-vs-all classiﬁcation setup. The probabi listic classiﬁcation model is shown in Eq. (2), where Ω( . ) is the classiﬁcation cost function. Note that appending 1 to ~ x intrinsically adds a bias term to each classiﬁer ~ w . Due to the linearity of the classiﬁer , discrimination of the j th class is completely determined by the scalar cost function Ω ( ~ x ) = Ω ( z j ) , where z j = l j ~ w T j ~ x . This function quantiﬁes the cost of assigning label l j to representation ~ x using the j th classiﬁer ~ w j . For now , we do not specify the functional form of Ω( . ) . In Section 3, we show that most forms of Ω( . ) used in practice are easily incorporated into our DDL framework. Since we seek ef fecti ve class discrimination, we expect lo w classiﬁcation cost for the gi ven representations. Therefore, by arranging all C linear classiﬁers in matrix W , the e vent ( ~ l | ~ x , W ) can be modeled as a product of C independent exponential distrib utions parameterized by γ j for j = 1 , . . . , C . By denoting ~ w j as the classiﬁer of the j th class, we hav e: p  ~ l | ~ x , W  ∝ 1 Q C j =1 γ j e − P C j =1 1 γ j Ω ( l j ~ w T j ~ x ) (2) 2.2 Overall Probabilistic Model T o formalize notation, we consider a training set of N data samples in R d that are columns of the data matrix Y ∈ R M × N . The i th column of the label matrix L ∈ { +1 , − 1 } C × N is the label vector ~ l i corresponding to the i th data sample. Here, we assume that there are K atoms in the dictionary D ∈ R d × K , where K is a ﬁxed integer that is application-dependent. T ypically , K  d . Note that there hav e been recent attempts to determine an optimal K for a giv en dataset [24]. F or our experiments, K is kept ﬁxed and its optimization is left for future work. The representation matrix X ∈ R K × N is a sparse matrix, whose columns represent the sparse codes of the data samples Y using dictionary D . The linear classiﬁers are columns in matrix W ∈ R K × C . W e denote Θ R = { σ 2 i } N i =1 and Θ C = { γ j } C j =1 as the representation and classiﬁcation parameters respecti vely . In what follows, we combine the representation and classiﬁcation models from the previous sec- tion in a uniﬁed frame work that will allo w for the joint MAP estimation of the unkno wns: D , X , W , Θ R , and Θ C . By making the standard assumption that the posterior probability con- sists of a dominant peak, we determine the required MAP estimates by maximizing the product: p ( Y | D , X , Θ R ) p ( L | X , W , Θ C ) p (Θ R ) p (Θ C ) . Here, we make a simplifying assumption that the prior of the dictionary and representations are uniform. T o model the priors of Θ R and Θ C and to avoid using hyper-parameters, we choose the objectiv e non-parametric Jeffere ys prior , which has been shown to perform well for classiﬁcation and regression tasks [9]. Therefore, we obtain p (Θ R ) ∝ Q N i =1 1 σ 2 i and p (Θ C ) ∝ Q C j =1 1 γ j . The motiv ations behind the selection of these priors are that (i) the representation prior encourages a low variance representation (i.e. the training data should properly ﬁt the proposed representation model) and that (ii) the classiﬁcation prior encourages a low mean (and v ariance) 1 classiﬁcation cost (i.e. the training data should be properly classiﬁed using the proposed classiﬁcation model). By minimizing the sum of the negati ve log likelihood of the data and labels as well as the log priors, MAP estimation requires solving the optimization problem in Eq. (3), where L j i represents the label of the i th training sample with respect to the j th class. T o encode the sparse representation model, we explicitly enforce sparsity on X by requiring that each representation ~ x i ∈ S T = { ~ a : k ~ a k 0 ≤ T } . An alternati ve for obtaining sparse representations is to assume that ~ x i follows a Laplacian prior, which leads to an ` 1 regularizer in the objecti ve. While this sparsifying regularizer alle viates some of the complexity of Eq. (3), it leads to the problem of selecting proper parameters for these Laplacian priors. Note that recent efforts hav e been made to ﬁnd optimal estimates of these Laplacian parameters in the context of sparse coding [11, 30, 2]. Howe ver , to av oid additional parameters, we choose the form in Eq. (3), where the ﬁrst two terms of the objectiv e correspond to the representation cost and the last two to the classiﬁcation cost. 1 The mean and variance of an e xponential distribution with parameter λ = 1 γ are γ and γ 2 respectiv ely . 3 min { D , W , X , Θ R , Θ C } N X i =1 k ~ y i − D ~ x i k 2 2 2 σ 2 i + N X i =1 ln σ M +2 i + C X j =1 N X i =1 Ω  L j i ~ w T j ~ x i  γ j + C X j =1 ln γ N +1 j (3) In the follo wing section, we show that Eq. (3) can be solved for a general family of cost functions Ω( . ) using well-kno wn and well-studied techniques in TSC and dictionary learning. In other words, dev eloping specialized optimization methods and performing parameter tuning are not required. 3 Learning Methodology Since the objecti ve function and sparsity constraints in Eq. (3) are non-con vex, we decouple the de- pendent v ariables by resorting to a blockwise coordinate descent method (alternating optimization). At each iteration, only a subset of variables is updated at a time. Clearly , learning D is decoupled from learning W , if X and (Θ R , Θ C ) are ﬁx ed. Next, we identify the four basic update procedures in our DDL framework. In what follo ws, we denote the estimate of variable A at iteration k as A ( k ) . 3.1 Classiﬁer Update (a) classiﬁcation cost (b) classiﬁer weights Figure 1: Four classiﬁcation cost functions: square, exponential, logistic, and hinge loss in 1(a). 1(b) plots their impacts on classiﬁer weights (second deriv ativ es) in our DDL framew ork. Since the classiﬁcation terms in Eq. (3) are de- coupled from the representation terms and in- dependent of each other, each classiﬁer can be learned separately . In this paper, we focus on four popular forms of Ω( . ) , as shown in Fig- ure 1(a): (i) the square loss: Ω( z ) = (1 − z ) 2 optimized by the boosted square leverage method [5], (ii) the exponential loss: Ω( z ) = e − z optimized by the AdaBoost method [10], (iii) the logistic loss: Ω( z ) = ln(1 + e − z ) opti- mized by the LogitBoost method [10], and (iv) the hinge loss: Ω( z ) = max(0 , 1 − z ) optimized by the SVM method. Since additiv e boost- ing of linear classiﬁers yields a linear classiﬁer , we allo w for seamless incorporation of additiv e boosting, which is a nov el contribution. 3.2 Discriminative Sparse Coding In this section, we describe how well-known and well-studied TSC algorithms (e.g. Orthogonal Matching Pursuit (OMP)) are used to update X ( k +1) from X ( k ) . This is done by solving the problem in Eq. (4), which we refer to as discriminative sparse coding (DSC). DSC requires the sparse code to not only reliably represent the data sample but also to be discriminable by the one-vs-all classiﬁers. Here, we denote ~ l as the label vector of the i th data element (i.e. the i th column of L ). The ( k ) superscripts are omitted from variables not being updated to facilitate readability . Here, we note that DSC, as deﬁned here, is a generalization of the functional form used in [13]. ~ x ( k +1) i = arg min ~ x ∈S T    ~ b − A ~ x    2 2 + 2 C X j =1 Ω  ~ g T j ~ x  γ j  where ~ b = ~ y i σ i ; A = D σ i ; ~ g j = l j ~ w j  (4) Solving Eq. (4): The complexity of this solution depends on the nature of Ω( . ) . Ho wev er , it is easy to show that, by applying a projected Ne wton gradient descent method to Eq. (4), DSC can be formulated as a sequence of TSC problems, if Ω( z ) is strictly con vex. At each Newton iteration, a quadratic local approximation of the cost function is minimized. If we denote Ω 1 ( z ) and Ω 2 ( z ) as the ﬁrst and second deriv atives of Ω( z ) respectiv ely and Ω 12 ( z ) = Ω 1 ( z ) Ω 2 ( z ) , the quadratic approximation of Ω( z ) around z p is Ω( z ) ≈ Ω( z p ) + Ω 1 ( z p )( z − z p ) + 1 2 Ω 2 ( z p )( z − z p ) 2 . Since Ω 2 ( z ) is a strictly 4 positiv e function, we can complete the square to get Ω( z ) ≈ 1 2 Ω 2 ( z p )[ z − ( z p − Ω 12 ( z p ))] 2 + cte . By replacing this approximation in Eq. (4), the objective function at the ( p + 1) th Newton iteration is: k ~ b − A ~ x k 2 2 + k H ( p ) ( ~ δ ( p ) − G T ~ x ) k 2 2 . In fact, this objectiv e takes the form of a TSC problem and, thus, can be solv ed by any TSC algorithm. Here, G is formed by the columnwise concatenation of ~ g j and we deﬁne δ ( p ) ( j ) = ~ g T j ~ x ( p ) − Ω 12 ( ~ g T j ~ x ( p ) ) for j = 1 , . . . , C . Also, we deﬁne the diagonal weight matrix H ( p ) , where H ( p ) ( j, j ) = ( Ω 2 ( ~ g T j ~ x ( p ) ) 2 γ j ) 1 2 weights the j th classiﬁer . Based on this deri v ation, the same TSC algorithm (e.g. OMP) can be used to solve the DSC problem iterati vely , as illustrated in Algorithm 1. The con ver gence of this algorithm is dependent on whether the TSC algorithm is capable of recov ering the sparsest solution at each iteration. Although this is not guaranteed in general, the con vergence of TSC algorithms to the sparsest solution has been sho wn to hold, when the solution is sparse enough ev en if the dictionary atoms are highly correlated [3, 27, 12, 4]. In our experiments, we see that the DSC objecti ve is reduced sequentially and con ver gence is obtained in almost all cases. Furthermore, we provide a Stop Criterion (threshold on the relative change in solution) for the premature termination of Algorithm 1 to av oid needless computation. Algorithm 1 Discriminativ e Sparse Coding (DSC) INPUT : A , ~ b , G , ~ α , Ω , ~ x (0) , T , p max , Stop Criterion while (Stop Criterion) AND p ≤ p max do compute and form: ~ δ ( p ) and H ( p ) ; ~ x ( p +1) = TSC  ~ b H ( p ) ~ δ ( p )  ,  A H ( p ) G T  , T  ; p = p + 1; end while OUTPUT : ~ x ( p ) Popular Forms of Ω( z ) : Here, we focus on particular forms of Ω( z ) , namely the four functions in Section 3.1. Before proceeding, we need to replace the traditional hinge cost with a strictly conv ex approximation. W e use the smooth hinge approximation introduced by [17], which can arbitrarily approximate the traditional hinge. As seen before, Ω 2 ( z ) and Ω 12 ( z ) are the only functions that play a role in the DSC solution. Obviously , only one iteration of Algorithm 1 is needed when the square cost is used, since it is already quadratic. For all other Ω( z ) , at the p th iteration of DSC, the impact of the j th classiﬁer on the overall cost (or equiv alently on updating the sparse code) is determined by H ( p ) ( j, j ) . This weight is inﬂuenced by two terms. (i) It is in versely proportional to γ j . So, a classiﬁer with a smaller mean training cost (i.e. higher training set discriminability ) yields more impact on the solution. (ii) It is proportional to Ω 2 ( l j ~ w T j ~ x ( p ) ) , the second deri vati ve at the pre vious solution. In this case, the impact of the j th classiﬁer is determined by the type of classiﬁcation cost used. In Figure 1(b), we plot the relationship between Ω( z ) and Ω 2 ( z ) for all four Ω( z ) types. For the square and hinge functions, Ω( z ) and Ω 2 ( z ) are independent, thus, a classiﬁer yielding high sample discriminability (low Ω( z ) ) is weighted the same as one yielding low discriminability . For the exponential case, the relationship is linear and positi vely correlated, thus, the lower a classiﬁer’ s sample discriminability is the higher its weight. This implies that the sparse code will be updated to correct for classiﬁers that misclassiﬁed the training sample in the pre vious iteration. Clearly , this makes representation sensitive to samples that are “hard” to classify as well as outliers. This sensitivity is ov ercome when the logistic cost is used. Here, the relationship is positiv ely correlated for moderate costs but negati vely correlated for high costs. This is consistent with the theoretical argument that LogitBoost should outperform AdaBoost when training data is noisy or mislabeled. 3.3 Unsupervised Dictionary Learning When X ( k ) , Θ ( k ) R , and Θ ( k ) C are ﬁxed, D ( k ) can be updated by any unsupervised dictionary learning method. In our experiments, we use the KSVD algorithm, since it av oids expensi ve matrix in version operations required by other methods. Also, efﬁcient versions of KSVD hav e recently been dev el- oped [25]. By alternating between TSC and dictionary updates (SVD operations), KSVD iterati vely reduces the ov erall representation cost and generates a dictionary with normalized atoms and the corresponding sparse representations. In our case, the representations are known apriori, so only a single iteration of the KSVD algorithm is required. For more details, we refer the readers to [1]. 5 3.4 Parameter Estimation and Initialization The use of the Jeffereys prior for Θ R and Θ C yields simple update equations: σ ( k ) i = ( 1 M +2 k ~ y i − D ( k ) ~ x ( k ) i k 2 2 ) 1 2 and γ ( k ) j = 1 N +1 P N i =1 Ω( L j i ( ~ w ( k ) j ) T ~ x ( k ) i ) . These variables estimate the sample representation v ariance and the mean/variance of the classiﬁcation cost respecti vely . Since the o ver - all update scheme is iterativ e, proper initialization is needed. In our experiments, we initialize D (0) to a randomly selected subset of training samples (uniformly chosen from the different classes) or to random zero-mean Gaussian vectors, followed by columnwise normalization. Interestingly , both schemes produce similar dictionaries, although the randomized scheme requires more iterations for con vergence. The representations X (0) are computed by TSC using D (0) . Initializing the remaining variables uses the update schemes abo ve. Algorithm 2 summarizes the ov erall DDL framew ork. Algorithm 2 Discriminativ e Dictionary Learning (DDL) INPUT : Y , L , T , Ω , q max , p max , Stop Criterion Initialize D (0) , X (0) , Θ (0) R , Θ (0) C , and q = 0 while (Stop Criterion) AND q ≤ q max do for i = 1 to N do ~ x ( q +1) = DSC ( D ( q ) σ ( q ) i , ~ y i σ ( q ) i , W ( q ) diag ( ~ l i ) , 1 Θ ( q ) C , Ω , ~ x ( q ) , T , p max , Stop Criterion ) ; end for Learn classiﬁers W ( q +1) using L and X ( q +1) ; D ( q +1) = KSVD ( D ( q ) , X ( q +1) , T ) ; Update ~ σ ( q +1) and ~ γ ( q +1) ; q = q + 1 ; end while OUTPUT : D ( q ) , W ( q ) , X ( q ) , ~ σ ( q ) , and ~ γ ( q ) 3.5 Inference After learning D and W , we describe how the label of a test sample ~ y t is inferred. W e seek the class j t that maximizes p ( ~ y t | ~ l t ( j )) , where ~ l t ( j ) is the label vector of ~ y t assuming it belongs to class j . By marginalizing with respect to ~ x and assuming a single dominant representation ~ x t exists, j t is the class that maximizes p ( ~ y t | ~ x t , D ) p ( ~ x t | ~ l t ( j ) , W ) , as in Eq. (5). The inner maximization problem is exactly a DSC problem where ~ l t ( j ) is the hypothesized label vector . Here, we use the testing representation model to account for dense errors (e.g. occlusion), thus, augmenting D by identity . Computing j t in volves C independent DSC problems. T o reduce computational cost, we solve a single TSC problem instead: ~ x t = argmax ~ x ∈ S T p ( ~ y t | ~ x , D ) . In this case, j t = argmax j ∈ 1 ,...,C p ( ~ l t ( j ) | ~ x t , W ) . j t = argmax j ∈ 1 ,...,C  max ~ x ∈ S T p ( ~ y t | ~ x , D ) p ( ~ l t ( j ) | ~ x , W )  (5) Implementation Details: There are sev eral ways to speedup computation and allow for quicker con vergence. (i) The DSC update step is the most computationally expensiv e operation in Algo- rithm 2. This is mitigated by using a greedy TSC method (Batch-OMP instead of ` 1 minimization methods) and exploiting the inherent parallelism of DDL (e.g. doing DSC updates in parallel). (ii) Selecting suitable initializations for D and the DSC solutions can dramatically speedup con- ver gence. F or example, choosing D (0) from the training set leads to a smaller number of DDL iterations than randomly choosing D (0) . Also, we initialize DSC solutions at a given DDL iteration with those from the pre vious iteration. Moreover , the DDL framew ork is easily extended to the semi-supervised case, where only a subset of training samples are labeled. The only modiﬁcation to be made here is to use TSC (instead of DSC) to update the representations of unlabeled samples. 6 4 Experimental Results In this section, we provide empirical analysis of our DDL framew ork when applied to handwrit- ten digit classiﬁcation ( C = 10 ) and face recognition ( C = 38 ). Digit classiﬁcation is a standard machine learning task with two popular benchmarks, the USPS and MNIST datasets. The digit sam- ples in these two datasets ha ve been acquired under different conditions or written using signiﬁcantly different handwriting styles. T o alleviate this problem, we use the alignment and error correction technique for TSC that was introduced in [26]. This corrects for gross errors that might occur (e.g. due to thickening of handwritten strokes or reasonable rotation/translation). Consequently , we do not need to augment the training set with shifted versions of the training images, as done in [18]. Furthermore, we apply DDL to face recognition, which is a machine vision problem where sparse representation has made a big impact. W e use the Extended Y ale B (E-Y ALE-B) benchmark for ev aluation. T o show that learning D in a discriminativ e fashion improv es upon traditional dictio- nary learning, we compare our method against a baseline that treats representation and classiﬁcation independently . In the baseline, X and D are estimated using KSVD, W is learned using X and L directly , and a a winner-take-all classiﬁcation strategy is used. Clearly , our framework is general, so we do not expect to outperform methods that use domain-speciﬁc features and machinery . How- ev er, we do achie ve results comparable to state-of-the-art. Also, we show that our DDL framework signiﬁcantly outperforms the baseline. In all our experiments, we set q max = 20 and p max = 100 and initialize D to elements in the training set. Digit Classiﬁcation: The USPS dataset comprises N = 7291 training and 2007 test images, each of 16 × 16 pixels ( M = 256 ). W e plot the test error rates of the baseline for the four classiﬁer types and for a range of T and K v alues in Figure 2. Beneath each plot, we indicate the values of K and T that yield minimum error . This is a common way of reporting SDL results [18, 19, 21, 22]. Interestingly , the square loss classiﬁer leads to the lowest error and the best generalization. For comparison, we plot the results of our DDL method in Figure 3. Clearly , our method achieves a signiﬁcant impro vement of 4 . 5% over the baseline, and 1% and 0 . 5% ov er the SDL methods in [19] and [18] respectiv ely . Our results are comparable to the state-of-the-art performance ( 2 . 2% ) [16]). This result shows that adapting D to the underlying data and class labels yields a dictionary that is better suited for classiﬁcation. Increasing T leads to an overall improvement of performance because representation becomes more reliable. Howe ver , we observe that beyond T = 3 , this improv ement is insigniﬁcant. The square loss classiﬁer achiev es the lo west performance and the logistic classiﬁer achiev es the highest. The variations of error with K are similar for all the classiﬁers. Error steadily decreases till an “optimal” K value is reached. Beyond this K value, performance deteriorates due to overﬁtting. Future work will study how to automatically predict this optimal value from training data, without resorting to cross-validation. In Figure 4, we plot the learned parameters Θ R (in histogram form) and Θ C for a typical DDL setup. W e observe that the form of these plots does not signiﬁcantly change when the training setting is changed. W e notice that the Θ R histogram ﬁts the form of the Jeffere ys prior , p ( x ) ∝ 1 x . Most of the σ values are close to zero, which indicates reliable reconstruction of the data. On the other hand, Θ C take on similar v alues for most classes, except the “0” digit class that contains a signiﬁcant amount of variation and thus the highest classiﬁcation cost. Note that these values tend to be inv ersely proportional to the classiﬁcation performance of their corresponding linear classiﬁers. W e provide a visualization of the learned D in the supplementary material . Interestingly , we observe that the dictionary atoms resemble digits in the training set and that the number of atoms that resemble a particular class is in versely proportional to the accuracy of that class’ s binary classiﬁer . This occurs because a “hard” class contains more intra-class v ariations requiring more atoms for representation. The MNIST dataset comprises N = 60000 training and 10000 test images, each of 28 × 28 pixels ( M = 784 ). W e show the baseline and DDL test error rates in T able 1. W e train each classiﬁer type using the K and T v alues that achiev ed minimum error for that classiﬁer on the USPS dataset. Compared to the baseline, we observe a similar improvement in performance as in the USPS case. Also, our results are comparable to state-of-the-art performance ( 0 . 53% ) for this dataset [14]. Face Recognition: The E-Y ALE-B dataset comprises 2 , 414 images of C = 38 individuals, each of 192 × 168 pixels, which we downsample by an order of 8 ( M = 504 ). Using a classiﬁcation setup similar to [29] with K = 600 and T = 5 , we record the classiﬁcation results in T able 1, 7 Figure 2: Baseline classiﬁcation performance on the USPS dataset Figure 3: DDL classiﬁcation performance on the USPS dataset Figure 4: Parameters Θ R and Θ C learned from the USPS dataset 8 which lead to implications similar to those in our pre vious e xperiments. Interestingly , DDL achieves similar results to the robust sparse representation method of [29], which uses all training samples ( K ≈ 1200 ) as atoms in D . This shows that learning a discriminativ e D can reduce the dictionary size by as much as 50% , without signiﬁcant loss in performance. T able 1: Baseline and DDL test error on MNIST and E-Y ALE-B datasets MNIST (digit classiﬁcation) E-Y ALE-B (face recognition) SQ EXP LOG HINGE SQ EXP LOG HINGE B ASELINE 8 . 35% 6 . 91% 5 . 77% 4 . 92% 10 . 23% 9 . 65% 9 . 23% 9 . 17% DDL 1.41 % 1.28 % 1.01 % 0.72 % 8.89 % 7.82 % 7.57 % 7.30 % 5 Conclusions This paper addresses the problem of discriminative dictionary learning by jointly learning a sparse linear representation model and a linear classiﬁcation model in a MAP setting. W e dev elop an optimization framework that is capable of incorporating a diverse family of popular classiﬁcation cost functions and solvable by a sequence of update operations that build on well-known and well- studied methods in sparse representation and dictionary learning. Experiments on standard datasets show that this frame work outperforms the baseline and achiev es state-of-the-art performance. References [1] M. Aharon, M. Elad, and A. M. Bruckstein. The K-SVD:an algorithm for designing of ov er- complete dictionaries for sparse representations. In IEEE T ransactions on Signal Processing , volume 54, 2006. [2] P . Bickel, Y . Ritov , and A. Tsybakov . Simultaneous analysis of Lasso and Dantzig selector. ArXiv e-prints , 2008. [3] M. Dav enport and M. W akin. Analysis of Orthogonal Matching Pursuit using the restricted isometry property. IEEE T ransactions on Information Theory , 56(9):4395–4401, 2010. [4] D. Donoho and M. Elad. Optimally sparse representation in general (nonorthogonal) dictio- naries via l minimization. Proc. of the National Academy of Sciences , 100(5):2197–202, 2003. [5] N. Duffy and D. Helmbold. Boosting methods for regression. Journal of Machine Learning Resear ch , 47(2):153–200, 2002. [6] M. Elad and M. Aharon. Image denoising via sparse and redundant representations o ver learned dictionaries. IEEE T ransactions on Ima ge Pr ocessing , 15(12):3736–45, 2006. [7] M. Elad, M. Figueiredo, and Y . Ma. On the Role of Sparse and Redundant Representations in Image Processing. Proceedings of the IEEE , 98(6):972–982, 2010. [8] K. Engan, S. Aase, and J. Husoy . Frame based signal compression using method of optimal directions (mod). In IEEE Intern. Symp. Circ. Syst. , 1999. [9] M. Figueiredo. Adaptiv e Sparseness using Jef freys’ Prior. NIPS , 1:697–704, 2002. [10] J. Friedman, R. Tibshirani, and T . Hastie. Additive logistic regression: a statistical view of boosting. The Annals of Statistics , 28(2):337–407, 2000. [11] R. Giryes, M. Elad, and Y . Eldar . Automatic parameter setting for iterati ve shrinkage methods. In IEEE Convention of Electrical and Electr onics Engineers in Isr ael , pages 820–824, 2009. [12] R. Gribon v al and M. Nielsen. Sparse representations in unions of bases. IEEE T ransactions on Information Theory , 49(12):3320–3325, 2004. [13] K. Huang and S. A viyente. Sparse representation for signal classiﬁcation. In NIPS , pages 609–616, 2006. [14] K. Jarrett, K. Ka vukcuoglu, M. Ranzato, and Y . LeCun. What is the best multi-stage architec- ture for object recognition? ICCV , pages 2146–2153, 2009. [15] R. Jenatton, J. Mairal, G. Obozinski, and F . Bach. Proximal methods for sparse hierarchical dictionary learning. In ICML , 2010. [16] D. Keysers, J. Dahmen, T . Theiner, and H. Ney . Experiments with an extended tangent dis- tance. ICPR , 1(2):38–42, 2000. 9 [17] N. Loeff and A. Farhadi. Scene discov ery by matrix factorization. ECCV , pages 451–464, 2008. [18] J. Mairal, F . Bach, and J. Ponce. T ask-Dri ven Dictionary Learning. ArXiv e-prints , Sept. 2010. [19] J. Mairal, F . Bach, J. Ponce, G. Sapiro, , and A. Zisserman. Supervised dictionary learning. In NIPS , 2008. [20] J. Mairal, F . Bach, J. Ponce, and G. Sapiro. Online dictionary learning for sparse coding. ICML , pages 1–8, 2009. [21] J. Mairal, F . Bach, J. Ponce, G. Sapiro, and A. Zisserman. Discriminativ e learned dictionaries for local image analysis. In CVPR , 2008. [22] J. Mairal, M. Leordeanu, F . Bach, M. Hebert, and J. Ponce. Discriminativ e sparse image models for class-speciﬁc edge detection and image interpretation. ECCV , pages 43–56, 2008. [23] J. Mairal, G. Sapiro, and M. Elad. Learning multiscale sparse representations for image and video restoration. SIAM Multiscale Modeling and Simulation , 7(1):214–241, 2008. [24] R. Mazhar and P . Gader . EK-SVD: Optimized dictionary design for sparse representations. In ICPR , pages 1–4, 2008. [25] R. Rubinstein, M. Zibule vsky , and M. Elad. Efﬁcient implementation of the k-svd algorithm using batch orthogonal matching pursuit. CS T echnion T echnical Report , pages 1–15, 2008. [26] A. W agner , J. Wright, A. Ganesh, Z. Zhou, and Y . Ma. T o wards a practical face recognition system: Robust registration and illumination by sparse representation. In CVPR , pages 597 –604, 2009. [27] J. Wright and Y . Ma. Dense error correction via l1-minimization. In IEEE T ransactions on Information Theory , number 2, pages 3033–3036, 2010. [28] J. Wright, Y . Ma, J. Mairal, G. Sapiro, T . Huang, and S. Y an. Sparse Representation for Computer V ision and Pattern Recognition. Pr oceedings of the IEEE , 98(6):1031–1044, 2010. [29] J. Wright, A. Y ang, A. Ganesh, S. Sastry , and Y . Ma. Rob ust face recognition via sparse representation. TP AMI , 31(2):210–27, 2009. [30] H. Zou. The Adaptiv e Lasso and its Oracle Properties. J ournal of the American Statistical Association , 101:1418–1429, 2006. 10

A Probabilistic Framework for Discriminative Dictionary Learning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment