Large Scale, Large Margin Classification using Indefinite Similarity Measures

Despite the success of the popular kernelized support vector machines, they have two major limitations: they are restricted to Positive Semi-Definite (PSD) kernels, and their training complexity scales at least quadratically with the size of the data…

Authors: Omid Aghazadeh, Stefan Carlsson

Large Scale, Large Margin Classification using Indefinite Similarity   Measures
Large Scale, Lar ge Margin Classification using Indefinite Similarity Measur es Omid Aghazadeh and Stefan Carlsson Abstract Despite the success of the popular kernelized support vector machines, they have two major limitations: they are restricted to Positive Semi-Definite (PSD) ker- nels, and their training complexity scales at least quadratically with the size of the data. Many natural measures of similarity between pairs of samples are not PSD e.g . in v ariant kernels, and those that are implicitly or explicitly defined by latent variable models. In this paper , we in v estigate scalable approaches for using indef- inite similarity measures in large margin frame works. In particular we sho w that a normalization of similarity to a subset of the data points constitutes a represen- tation suitable for linear classifiers. The result is a classifier which is competitiv e to k ernelized SVM in terms of accuracy , despite ha ving better training and test time complexities. Experimental results demonstrate that on CIF AR-10 dataset, the model equipped with similarity measures in variant to rigid and non-rigid de- formations, can be made more than 5 times sparser while being more accurate than kernelized SVM using RBF kernels. 1 Introduction Linear support vector machine (SVM) has become the classifier of choice for man y lar ge scale classification problems. The main reasons for the success of linear SVM are its max margin property achiev ed through a conv ex optimization, a training time linear in the size of the training data, and a testing time independent of it. Although the linear classifier operating on the input space is usually not very flexible, a linear classifier operating on a mapping of the data to a higher dimensional feature space can become arbitrarily complex. Mixtures of linear classifiers has been proposed to increase the non-linearity of linear classifiers [10, 1]; which can be seen as feature mappings augmented with non-linear gating functions. The training of these mixture models usually scales bilinearly with respect to the data and the number of mixtures. The drawback is the non-con ve xity of the optimization procedures, and the need to know the (maximum) number of components beforehand. Kernelized SVM maps the data to a possibly higher dimensional feature space, maintains the con- ve xity , and can become arbitrarily flexible depending on the choice of the kernel function. The use of kernels, ho we ver , is limiting. Firstly , kernelized SVM has significantly higher training and test time complexities when compared to linear SVM. As the number of support vectors gro ws approximately linearly with the training data [22], the training complexity becomes approximately somwehere between O ( n 2 ) and O ( n 3 ) . T esting time complexity scales linearly with the number of support v ectors, bounded by O ( n ) . Secondly , the positive (semi) definite (PSD) kernels are sometimes not expressi ve enough to model various sources of variation in the data. A recent study [21] argues that metric constraints are not necessarily optimal for recognition. For e xample, in image classification problems, considering k er- nels as similarity measures, the y cannot align exemplars, or model deformations when measuring similarities. As a response to this, in variant kernels were introduced [6] which are generally indef- 1 inite. Indefinite similarity measures plugged in SVM solv ers result in non-con vex optimizations, unless explicitly made PSD, mainly using eigen decomposition methods [3]. Alternativ ely , latent variable models hav e been proposed to address the alignment problem e.g . [9, 25]. In these cases, the dependency of the latent v ariables on the parameters of the model being learnt mainly has two draw- backs: 1) the optimization problem in such cases becomes non-conv ex, and 2) the cost of training becomes much higher than the case without the latent variables. This paper aims to address these problems using explicit basis expansion. W e show that the resulting model: 1) has better training and test time complexities than kernelized SVM models, 2) can make use of indefinite similarity measures without an y need for remov al of the negati ve eigen values, which requires the e xpensiv e eigen decomposition, 3) can make use of multiple similarity measures without losing con vexity , and with a cost linear in the number of similarity measures. Our contributions are: 1) proposing and analyzing Basis Expanding SVM (BE-SVM) regarding the aforementioned three properties, and 2) in vestigating the suitability of particular forms of in v ariant similarity measures for large scale visual recognition problems. 2 Basis Expanding Support V ector Machine 2.1 Background: SVM Giv en a dataset D = { ( x 1 , y 1 ) , . . . , ( x n , y n ) | x i ∈ X , y i ∈ {− 1 , 1 }} the SVM based classifiers learn max margin binary classifiers. The SVM classifier is f ( x ) = h w, x i ≥ 0 1 . The w is learnt via minimizing 1 2 h w , w i + C P i ` H ( y i , f ( x )) , where ` H ( y , x ) = max(0 , 1 − xy ) is the Hinge loss. Any positiv e semi definite (PSD) k ernel k : X × X → R can be associated with a reproducing kernel hilbert space (RKHS) H , and vice versa, that is h ψ H ( x ) , ψ H ( y ) i = k ( x, y ) , where ψ H : X → H is the implicitly defined feature mapping associated to H and consequently to k ( ., . ) . Representer theorem states that in such a case, ψ H ( w ) = P i γ i k ( ., x i ) where γ i ∈ R ∀ i . For a particular case of k ( ., . ) , namely the linear kernel k ( x , y ) = x · y associated with an Euclidean space, linear SVM classifier is f l ( x ) = w T x ≥ 0 where w is gi ven by minimizing the primal SVM objectiv e: 1 2 k w k 2 + C P i ` H ( y i , f l ( x i )) . More generally , gi ven an arbitrary PSD kernel k ( ., . ) , the kernelized SVM classifier is f k ( x ) = P i α i k ( x, x i ) ≥ 0 where α i s are learnt by minimizing the dual SVM objectiv e: 1 2 α T YKY α − k α k 1 , 0 ≤ α i ≤ C, α T y = 0 where Y = diag ( y ) . The need for positi veness of k ( ., . ) is e vident in the dual SVM objecti ve where the quadratic regular - izing term depends on the eigen values of K ij = k ( x i , x j ) . In case of indefinite k ( ., . ) s, the problem becomes non-con vex and the inner products need to be re-defined, as there will be no associating RKHS to indefinite similarity measures. V arious workarounds for indefinite similarity measures ex- ist, most of which in volv e e xpensiv e eigen decomposition of the gram matrix [3]. A PSD kernel can be learnt from the similarity matrix, with some constraints e.g . being close to the similarity matrix where closeness is usually measured by the Frobenius norm. In case of Frobenius norm, the closed form solution is spectrum clipping, namely setting the negati ve eigen v alues of the gram matrix to 0 [3]. As pointed out in [4], there is no guarantee that the resulting PSD kernels are optimal for clas- sification. Nev ertheless, jointly optimizing for a PSD kernel and the classifier [4] is impractical for large scale scenarios. W e do not go into the details of possible re-formulations regarding indefinite similarity measures, but refer the reader to [19, 13, 3] for more information. Linear and Kernelized SVM hav e very different properties. Linear SVM has a training cost of O ( d x n ) and a testing cost of O ( d x ) where d x is the dimensionality of x . K ernelized SVM has a training complexity which is O ( d k ( nn sv )+ n 3 sv ) [15] where d k is the cost of ev aluating the kernel for one pair of data, and n sv is the number of resulting support vectors. The testing cost of kernelized SVM is O ( d k n sv ) . Therefore, a significant body of research has been dedicated to reducing the training and test costs of kernelized SVMs by approximating the original problem. 1 W e omit the bias term for the sake of clarity . 2 2.2 Speeding up Ker nelized SVM A common approach for approximating the kernelized SVM problem is to restrict the feature map- ping of w : ψ H ( w ) ≈ ψ R ( w ) = P J j =1 β j ψ H ( z j ) where J < n . Methods in this direction either learn synthetic samples z j [24] or restrict z j to be on the training data [15]. These methods essen- tially exploit lo w rank approximations of the gram matrix K . Low rank approximations of PD K  0 , result in speedups in training and testing complexities of kernelized SVM. Methods that learn basis coordinates outside the training data e.g . [24], usually in volve intermediate optimization ov erheads, and thus are prohibitiv e in large scale scenarios. On the contrary , the Nystr ¨ om method giv es a low rank PSD approximation to K with a very lo w cost. The Nystr ¨ om method [23] approximates K using a randomly selected subset of the data: K ≈ K nm K − 1 mm K mn (1) where K ab refers to a sub matrix of K = K nn index ed by a = ( a 1 , . . . , a n ) T , a i ∈ { 0 , 1 } , and similarly by b . The approximation (1) is deriv ed by defining eigenfunctions of k ( ., . ) as expansions of numerical eigen vectors of K . A consequence is that the data can be embedded in an Euclidean space: K ≈ Ψ T mn Ψ mn , where Ψ mn , the Nystr ¨ om feature space, is Ψ mn = K − 1 2 mm K mn (2) Methods exist which either explicitly or implicitly exploit this e.g . [14] to reduce both the training and test costs, by restricting the support vectors to be a subset of the bases defined by m . In case of indefinite similarity measures, K − 1 2 mm in (2) will not be real. In the rest of the paper , we refer to an indefinite version of a similarity matrix K with ˜ K , and refer to the normalization by K − 1 2 mm with Nystr ¨ om normalization. In order to get a PSD approximation of an indefinite ˜ K , the indefinite ˜ K mm (1) needs to be made PSD. Spectrum clipping, spectrum flip, spectrum shift, and spectrum square are possible solutions based on eigen decomposition of ˜ K mm . The latter can be achiev ed without the eigen decomposition step: ˜ K T mm ˜ K mm  0 . If the goal is to find the PSD matrix closest to the original indefinite ˜ K with respect to the reduced basis set m , spectrum clip gi ves the closed form solution. Therefore, when there are a few nega- tiv e eigen values, the spectrum clip technique gi ves good low rank approximations to ˜ K mm which can be used by (1) to get a low rank PSD approximation to ˜ K . Howe ver , when there are a con- siderable number of negati ve eigen v alues, as it is the case with most of the similarity measures we consider later on in section 2.4, there is no guarantee for the resulting PSD matrix to be optimal for classification. This is true specially when eigen vectors associated with ne gati ve eigen values contain discriminativ e information. W e experimentally v erify in section 3.3 that the ne gativ e eigen values do contain discriminativ e information. W e seek normali zations that do not assume a PSD K mm , and do not require eigen-decompositions. For e xample, one can replace K mm in (2) with the co v ariance of columns of K mn . W e experimen- tally found out that a simple embedding, presented in the next section in (4), is competiti ve with the Nystr ¨ om embedding (2) for PSD similarity measures, while outperforming it in case of indefinite ones that we studied. 2.3 Basis Expanding SVM Basis Expanding SVM (BE-SVM) is a linear SVM classifier equipped with a normalization of the following e xplicit feature map ˜ ϕ ( x ) = [ s ( b 1 , x ) , . . . , s ( b B , x )] T (3) where B = { b 1 , . . . , b B } is an ordered basis set 2 which is a subset of the training data, and s ( ., . ) is a pairwise similarity measure. The BE-SVM feature space defined by ϕ ( x ) = 1 E X [ k ˜ ϕ − E X [ ˜ ϕ ] k ] ( ˜ ϕ ( x ) − E X [ ˜ ϕ ]) (4) 2 For the moment assume B is gi ven. W e experiment with dif ferent basis selection strategies in section 3.4). 3 is similar to the Nystr ¨ om feature space (2) with a different normalization scheme, as pointed out in section 2.2. The centralization of ˜ ϕ ( . ) better conditions ϕ ( . ) for a linear SVM solver , and normal- ization by the av erage ` 2 norm is most useful for combining multiple similarity measures. The BE-SVM classifier is f B ( x ) = w T ϕ ( x ) ≥ 0 (5) where w is solved by minimizing the primal BE-SVM objectiv e 1 2 k w k 2 2 + C X i ` H ( y i , f B ( x i )) 2 (6) An ` 1 regularizer results in sparser solutions, but with the cost of more expensi ve optimization than an ` 2 regularization. Therefore, for lar ge scale scenarios, an ` 2 regularization, combined with a reduced basis set B , is preferred to an ` 1 regularizer combined with a lar ger basis set. Using multiple similarity measures is straightforward in BE-SVM. The concatenated feature map ϕ M ( x ) =  ϕ (1) ( x ) T , . . . , ϕ ( M ) ( x ) T  T encodes the values of the M similarity measures ev aluated on the corresponding bases B (1) , . . . , B ( M ) . In this work, we restrict the study to the case that the bases are shared among the M similarity measures: i.e . B (1) = . . . = B ( M ) . 2.4 Indefinite Similarity Measures f or Visual Recognition The lack of expressibility of the PSD kernels have been argued before e.g . in [3, 4, 21]. For example, similarity measures which are not based on vectorial representations of data are most likely to be indefinite. Particularly in computer vision, considering latent information results in lack of a fixed vectorial representation of instances, and therefore similarity measures based on latent information are most likely to be indefinite 3 . A few applications of indefinite similarity measures in computer vision are pointed out below . [6] proposes (indefinite) jitter kernels for b uilding desired in variances in classification problems. [1] uses indefinite pairwise similarity measures with latent positions of objects for clustering. [16] considers deformation models for image matching. [7] defines an indefinite similarity measure based on explicit correspondences between pairs of images for image classification. In this work, we consider similarity measures with latent deformations: s ( x i , x j ) = max z i ∈Z ( x i ) , z j ∈Z ( x j ) K I ( φ ( x i , z i ) , φ ( x j , z j )) + R ( z i ) + R ( z j ) (7) where K I ( ., . ) is a similarity measure (potentially a PD kernel), φ ( x, z ) is a representation of x giv en the latent variable z , R ( z ) is a regularization term on the latent variable z , and Z ( x ) is the set of possible latent variables associated with x . Specifically , when R ( . ) = 0 and Z ( x ) inv olves latent positions, the similarity measure becomes similar to that of [1]. When R ( . ) = 0 and Z ( x ) in volves latent positions and local deformations, it becomes similar to the zero order model of [16]. Finally , an MRF prior in combination with latent positions and local deformations gives a similarity measure, similar to that of [7]. The proposed similarity measure (7) picks the latent v ariables which have the maximal (regularized) similarity v alues K I ( ., . ) s. This is in contrast to [6] where the latent variables were suggested to be those which minimize a metric distance based on the kernel K I ( ., . ) . The advantage of a metric based latent v ariable selection is not so clear, while some works argue against unnecessary restrictions to metrics [21]. Also, if K I ( ., . ) is not PSD, deriving a metric from it is at best expensi ve. Therefore, the latent variables in (7) are selected according to the similarity v alues instead of metric distances. 2.5 Multi Class Classification SVMs are mostly known as binary classifiers. T wo popular extensions to the multi-class problems are one-v-res (1vR) and one-v-one (1v1). The two simple extentions hav e been argued to perform as 3 Note that [25] and similar approaches use a PD kernel on fixed vectorial representation of the data, given the latent information . The latent informations in turn are updated using an alteranti ve minimization approach. This makes the optimization non-con vex, and dif fers from similarity measures which directly model latent informations. 4 T raining T esting (per sample) Memory Computation Memory Computation K SVM nM ¯ d φ + n 2 C n 2 M ¯ d K + n 3 C nM ¯ d φ nC M ¯ d K BE-SVM nM ¯ d φ + nM |B | nC |B | M ¯ d K |B | M ¯ d φ |B | C M ¯ d K T able 1: Complexity Analysis for kernelized SVM and BE-SVM. The number of samples for each of the C classes was assumed to be equal to n C . M is the number of kernels/similarity measures, M ¯ d φ is the dimensionality of representations required for ev aluating M kernels/similarity measures, and M ¯ d K is the cost of ev aluating all M kernels/similarity measures. well as more sophisticated formulations [20]. In particular , [20] concludes that in case of kernelized SVMs, in terms of accuracy they are both competitiv e, while in terms of training and testing com- plexities 1v1 is superior . Therefore, we only consider 1v1 approach for kernelized SVM. In case of linear SVMs ho wev er , 1v1 results in unnecessary overhead and 1vR is the algorithm of choice. A 1vR BE-SVM can be expected to be both faster and to generalize better than a 1v1 BE-SVM where bases from all classes are used in each of the binary classifiers. In case of 1v1 BE-SVM where only bases from the two classes under consideration are used in each binary classifier, there will be a clear advantage in terms of training complexity . Howe ver , due to the reduction in the size of the basis set, the algorithm generalizes less in comparison to a 1vR approach. Therefore, we only consider 1vR formulation for BE-SVM. T able 1 summarizes the memory and computational complexity analysis for 1v1 kernelized SVM and 1vR BE-SVM. Shown are the upper bounds complexities where we hav e considered n to be the upper bound on n sv . 2.6 Margin Analysis of Basis Expanding SVM Both kernelized SVM and BE-SVM are max margin classifiers in their feature spaces. The feature space of k ernelized SVM ψ H ( . ) is implicitly defined via the kernel function k ( ., . ) while the feature space of the BE-SVM is explicitly defined via empirical kernel maps. In order to deriv e the margin as a function of the data, we first need to derive the dual BE-SVM objectiv e, where we assume a non-squared Hinge loss and unnormalized feature mappings ˜ ϕ ( . ) . Borro wing from the representer theorem and considering the KKT conditions of the primal, one can deri ve w = P i y i β i ˜ ϕ ( x i ) , and consequently derive the BE-SVM dual objecti ve which is similar to the dual SVM objecti ve but with K ij = ˜ ϕ ( x i ) T ˜ ϕ ( x j ) . Let S B X refer to the similarity of the data to the bases. W e can see that the margin of the BE-SVM, gi ven the optimal dual v ariables 0 ≤ β i ≤ C , is  β T YS T B X S B X Y β  − 1 , as opposed to  α T YKY α  − 1 for the kernelized SVM, given the optimal dual variables 0 ≤ α i ≤ C . Furthermore, S T B X S B X is PSD, and that is BE-SVM’ s workaround for using indefinite similarity measures. W e analyze the mar gin of BE-SVM in case of unnormalized features ( ˜ ϕ ( . ) instead of ϕ ( . ) ) and a non-squared Hinge loss. Gi ven the corresponding dual v ariables, the margin of the BE-SVM was mentioned to be M B E ( β ) =  β T YS T B X S B X Y β  − 1 (8) as opposed to that of the kernelized SVM M K ( α ) =  α T YKY α  − 1 (9) For comparison, the mar gin of the Nystr ¨ omized method is M N ( α ) =  α T YK X B K − 1 B B K B X Y α  − 1 (10) BE-SVM vs Ker nelized SVM : When s ( ., . ) = k ( ., . ) and all training exemplars are used as bases, the mar gin of the BE-SVM will be  β T YK 2 Y β  − 1 . Comparing to the margin of SVM, for the same parameter C and the same kernel, it can be said that the solution (and thus the margin) of BE- SVM is even more deri ved by large eigenpairs, and ev en less by small ones. It is straightforward to verify K 2 = P i λ 2 i v i v T i . Therefore, the contribution of lar ge eigenpairs, that are { ( λ i , v i ) | λ i > 1 } , to K 2 is amplified. Similarly , the contrib ution of small eigenpairs, that are those with λ i < 1 , to K 2 is dampened. 5 −1 −0.5 0 0.5 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 Kernel SVM, gamma=10,100, SVs: 224.9, CV acc: 84.9 (a) Kernelized SVM −1 −0.5 0 0.5 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 BE SVM D, B=651, gamma= 10,100, SVs: 170.4, CV acc: 84.7 (b) BE-SVM dual objectiv e −1 −0.5 0 0.5 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 BE SVM P, B=65, gamma= 10,100, SVs: 65.0, CV acc: 85.4 (c) BE-SVM primal objectiv e Figure 1: Demonstration of kernelized SVM and BE-SVM using two Gaussian RBF kernels with γ 1 = 10 , γ 2 = 10 2 and C = 10 . 1a is based on equally weighted kernels. 1b is without nor- malization. 1c is with normalization on 10% of the data randomly selected as bases. 10 fold cross validation accurac y and the number of support vectors are a veraged over i = 1 : 20 scenarios based on the same problem b ut with dif ferent spatial noises. The noise model for i th scenario is a zero mean Gaussian with σ i = 10 − 2 i . The visualization is on the noiseless data for clarity . Best vie wed electronically . BE-SVM vs Nystr ¨ omized method : When s ( ., . ) = k ( ., . ) and a subset of training ex emplars are used as bases (reduced settings), the resulting margin of BE-SVM is  β T YK X B K B X Y β  − 1 . Comparing to the margin of the Nystr ¨ omized method, we can say that the most of the difference between the Nystr ¨ omized method and BE-SVM, is the normalization by K − 1 B B . For covariance kernels, that the Nystr ¨ omized method is most suitable for , K B B is the covariance of the basis set in the feature space. Therefore, it can be said that the normalization by K − 1 B B essentially de-correlates the bases in the feature space. Although this is an appealing property , there is no associating RKHS with indefinite similarity measures and the de-correlation in such cases is non-trivial. In case of cov ariance kernels, it can be said that BE-SVM assumes un-correlated bases, while bases are alw ays correlated in the feature space. As larger sets of bases usually result in more (non-diagonal) covariances, the un-correlated assumption is more violated with large set of bases. The consequence is that in such cases, that are covariance kernels with large set of bases, BE-SVM can be e xpected to perform worse than the Nystr ¨ omized method. Howe ver , for sufficiently small set of bases, or in case of indefinite similarity measures, there is no reason for superiority of the Nystr ¨ omized method. In such cases and in practice, BE-SVM is competitive or better than the Nystr ¨ omized method. 2.6.1 Demonstration on 2D T oy data Figure 1 visualizes the use of multiple Gaussian RBF kernels in BE-SVM and kernelized SVM. W e point out the following observ ations. 1) the dual objective of BE-SVM (e xact) tends to result in sparser solutions as measured by non- zero support vector coefficient (compare 1a with 1b). W e believe the main reason for this to be the modification of the eigenv alues as described in section 2.6. Note howe ver that in order to classify a new sample, its similarity to all training data needs to be e v aluated, irrespectiv e of the sparsity of the BE-SVM solution (see equation (13)). In this sense, the BE-SVM dual objecti ve results in completely dense solutions, similar to the primal BE-SVM objective without any basis reduction. Howe ver , the solution can be made sparse by construction, by reducing the basis set, similar to the case with the primal BE-SVM objecti ve. W e do not demonstrate this here, mainly because our main focus is on the (approximate) primal objectiv e. 2) due to the definition of the (linear) kernel in BE-SVM (see equation (13)), the solution of the BE-SVM has an inherent bias with respect to the (marginal) distribution of class labels. In other words, the contribution of each class to the norm of ˜ ϕ ( . ) , and consequently to the value of ˜ K ( ., . ) , directly depends on the number of bases from each class. Consequently , the decision boundary of BE-SVM is shifted towards the class with less bases: compare the decision boundaries on the left 6 sides of 1a and 1b. In e xperiments on CIF AR-10 dataset, as the number of ex emplars from dif ferent classes are roughly equal, this did not play a crucial role. 2.7 Related W ork There exists a body of work re garding the use of proximity data, similarity , or dissimilarity measures in classification problems. [18] uses similarity to a fixed set of samples as features for a kernel SVM classifier . [11] uses proximities to all the data as features for a linear SVM classifier . [12] uses proximities to all the data as features and proposes a linear program machine based on this representation. In contrast, we use a normalization of the similarity of points to a subset of the data as features for a (fast, approximate) linear SVM classifier . 3 Experiments 3.1 Dataset and Experimental Setup W e present our experimental results on CIF AR-10 dataset [17]. The dataset is comprised of 60,000 tiny 32 × 32 RGB images, 6,000 images for each of the 10 classes inv olved, divided into 6 folds with inequal distribution of class labels per fold. The first 5 folds are used for training and the 6th fold is used for testing. W e use a modified v ersion of the HOG feature [5], described in [9]. For most of our experiments, we use HOG cell sizes of 8 and 4, which result in 31 × 32 8 2 = 496 and 31 × 32 4 2 = 1984 dimensional representation of each of the images. Due to the normalization of each of the HOG cells, namely normalizing by gradient/contrast infor- mation of the neighboring cells, the HOG cells on border of images are not normalized properly . W e belie ve this to have a ne gativ e effect on the results, b ut as the aim of this paper is not to get the best results possible out of the model, we rely on the consistenc y of the normalization for all images to address this problem. A possible fix is to up-sample images and ignore the HOG-cells at the boundaries, but we do not pro vide the results for such fixes. For all the experiments, we center the HOG feature vector and scale feature vectors in versely by the av erage ` 2 norm of the centered feature vectors, similar to the normalization of BE-SVM (4). This results in easier selection of parameters C and γ for SVM formulations. Unless stated otherwise, we fix C = 2 and γ = 1 for kernelized SVM with Gaussian RBF kernels, and C = 1 for the rest. W e use LibLinear [8] to optimize the primal linear SVM objectiv es with squared Hinge loss, similar to (6). For kernelized SVM, we use LibSVM [2]. W e report multi-class classification results (0-1 loss) on the test set, where we used a 1 v 1 formulation for kernelized SVM, and 1 v all formulation for other methods. 3.2 Baseline: SVM with P ositive Definite K ernels Figure 4a sho ws the performance of linear SVM (H4L and H8L) and k ernelized SVM with Gaussian RBF kernel (K4R and K8R) as a function of number of parameters in the models. The number of parameters for linear SVM is the input dimensionality , and for kernelized SVM it is the sum of n sv ( d φ + 1) where d φ is the dimensionality of the feature vector the corresponding kernel operates on. The 5 numbers for each model are the results of the model trained on 1 , . . . , 5 folds of the training data (each fold contains 10,000 samples). Figure 4b shows the performance kernelized SVM as a function of support vectors when trained on 1 , . . . , 5 folds. Except the linear SVM with a HOG cell size of 8 pixels (496 dimensions) which saturates its performance at 4 folds, all models consistently benefit from more training data. 3.3 BE-SVM with In variant Similarity Measures The general form of the in variant similarity measures we consider w as gi ven in (7). In particular , we consider rigid and deformable similarity measures where the smallest unit of deformation/translation is a HOG cell. 7 0.5 0.52 0.54 0.56 0.58 0.6 0.62 0.64 0.66 0.68 0.7 H4L H8L H8(1,1) H8(0,1) H4(0,1) H4R H8R H4(1,1) H4(2,1) H8(1,0) H4(1,0) H4(2,0) BE SVM: Similarity Measure vs Accuracy (1 Fold) Accuracy (a) Full Bases 0.5 0.52 0.54 0.56 0.58 0.6 0.62 0.64 0.66 0.68 0.7 H4L H4R H4(2,0) H8L H8R H8(1,0) BE SVM: Sim vs Acc (1 Fold) B=10x100 Accuracy Rand Indx K KMed Nystrom (b) Basis Selection Figure 2: Performance of BE-SVM as a function of dif ferent similarity measures when trained on the first fold. An H4 (H8) refers to a HOG cell size of 4 (8) pixels. L and R refer to linear and Gaussian RBF kernels respectively , and ( h R , h L ) refers to a similarity measure with h R rigid and h L local deformations (11), (12). The rigid similarity measure models in variance to translations and is gi ven by K R ( x, y ) = max z R ∈Z R X c ∈C φ C ( x, c ) T φ C ( y , c + z R ) (11) where Z R = { ( z x , z y ) | z x , z y ∈ {− h R , . . . , h R }} allows a maximum of h R HOG cells displace- ments in x, y directions, C = { ( x, y ) | x, y ∈ { h 1 , . . . , h H } is the set of indices of h H HOG cells in each direction, and φ C ( x, c ) is the 31 dimensional HOG cell of x located at position c . φ C ( x, c ) is zero for cells outside x (zero-padding). K R ( x, y ) is the maximal cross correlation between φ ( x ) and φ ( y ) . The deformable similarity measure allows local deformations (displacements) of each of the HOG cells, in addition to in variance to rigid deformations K L ( x, y ) = max z R ∈Z R X c ∈C max z L ∈Z L φ C ( x, c ) T φ C ( y , c + z R + z L ) (12) where Z L = { ( z x , z y ) | z x , z y ∈ {− h L , . . . , h L }} allows a maximum of h L HOG cell local defor - mation for each of the HOG cells of y . W e consider a maximum deformation of 8 pixels e.g . 2 HOG cells for a HOG cell size of 4 pixels. Regularizing global or local deformations is straightforward in this formulation. Howe ver , we did not notice significant improvements for the set of displacements we considered, which is probably related to the small size of the latent set suitable for small images in CIF AR-10. Figure 2a shows the performance of BE-SVM using dif ferent similarity measures, when trained on the first fold. It can be seen that the in variant similarity measures improv e recognition performance. Particularly , in absence of any other information, modelling rigid deformations (latent positions) seems to be much more beneficial than modelling local deformations. An interesting observ ation is that aligning the data in higher resolutions is much more crucial: all models (linear SVM, kernelized SVM, and BE-SVM) suffer performace losses when the resolution is increased from a HOG cell size of 8 pixels to 4 pixels. Howe ver , BE-SVM achieves significant performance gains by aligning the data in higher resolutions: compare H4L with H4(1,0) and H4(2,0), and H8L with H8(1,0). W e tried training linear and kernelized SVM models by jittering the feature v ectors, in the same manner that the in variant similarity measures do (11), (12); that is to jitter the HOG cells with zero- padding for cells outside images. This resulted in significant performance losses for both linear SVM and kernelized SVM, while also siginificantly increasing memory requirement and computation times. W e believe the reason for this to be the boundary effects; which are also mentioned in previous work e.g . [6]. W e also believe that jittering the input images, in combination with some boundary 8 H4(0,0) H4(0,1) H4(1,0) H4(1,1) H4(2,0) H4(2,1) CorNyst CorBE NgRat .0 .26 .18 .25 .16 .30 .20 .61 NgEng .00 .04 .05 .05 .04 .07 .33 .73 T able 2: Eigen value analysis of v arious similarity measures based on HOG cell size 4. heuristics (see section 3.1), will improve the test performance (while significantly increasing training complexities), b ut we do not provide e xperimental results for such cases. 3.4 Basis Selection Figure 2b sho ws accuracy of BE-SVM using dif ferent similarity measures and dif ferent basis se- lection strategies; for a basis size of B = 10 × 100 exemplars. In the figure, ‘Rand’ refers to a random selection of the bases, ‘Indx’ refers to selection of samples according to their indices, ‘K KMed’ refers to a kernel k-medoids approach based on the similarity measure, and ‘Nystrom’ refers to selection of bases similar to the ‘Indx’ approach, but with the Nystr ¨ om normalization, using a spectrum clip fix for indefinite similarity measures(see section 2.2). The reported results for ‘Rand’ method is averaged over 5 trials; the variance was not significant. It can be observed that all meth- ods except the ‘Nystrom’ result in similar performances. W e also tried other sophisticated sample selection criteria, but observed similar behaviour . W e attribute this to little variation in the quality of e xemplars in the CIF AR-10 dataset. Having observ ed this, for the rest of sub-sampling strate gies, we do not a verage ov er multiple random basis selection trials, but rather use the deterministic ‘Indx’ approach. The difference between normalization factors in BE-SVM and Nystr ¨ om method (see section 2.2) is evident in the figure. The BE-SVM normalization tends to be consistently superior in case of indefi- nite similarity measures. For PSD kernels (H4L, H8L, H4R, and H8R) , the Nystr ¨ om normalization tends to be better in lower resolutions (H8) and w orse in higher resolutions (H4). W e believ e the main reason for this is to be lack of significant similarity of bases in higher resolutions in absence of any alignment. In such cases, the low rank assumption of K [23] is violated, and normalization by a diagonally dominant K mm will not capture any useful information. In order to analyze ho w the performance of BE-SVM depends on the eigen values of the similarity measures, we provide the following eigenv alue analysis. W e compute the similarity of the bases to themselves – corresponding to K mm in (2) – and perform an eigen-decomposition of the resulting matrix. T able 2 sho ws the ratio of negativ e eigen values: ‘NgRat’= 1 B P i [ λ i < 0] , and the rela- tiv e ener gy of eigenv alues ‘NgEng’= P i | λ i | [ λ i < 0] P i | λ i | [ λ i > 0] as a function of various similarity measures for B = 10 × 100 and a HOG cell size of 4. The last two columns, namely ‘CorNyst’ and ‘CorBE’ re- flect the correlation of the measured entities – ‘NgRat’ and ‘NgEng’ – to the observed performance of BE-SVM using the Nystr ¨ om normalization and BE-SVM normalization. W e used Pearson’ s r to measure the extent of linear dependence between the test performances and different normalization schemes. It can be observed that: 1) both normalization schemes hav e a positi ve correlation with both the ratio of negati ve eigen v alues and their relativ e energy , and 2) BE-SVM normalization cor - relates more strongly with the observed entities. From this, we conclude that negati ve eigen vectors contain discriminati ve information and that BE-SVM’ s normalization is more suitable for indefinite similarity measures. W e also experimented with spectrum flip and spectrum square methods for the Nystr ¨ om normalization, but they generally provided slightly worse results in comparison to the spectrum clip technique. 3.5 Multiple Similarity Measures Different similarity measures contain complementary information. Fortunately , BE-SVM can make use of multiple similarity measures by construction. T o demonstrate this, using one fold of training data and B = 10 × 50 , we greedily – in an incremental way – augmented the similarity measures with the most contributing ones. Using this approach, we found two (ordered) sets of similarity mea- sures with complementary information: 1) a lo w-resolution set M 1 = { H 8 R, H 8(1 , 0) , H 8(0 , 1) } , and 2) a two-resolution set M 2 = { H 8 R, H 4(2 , 0) , H 4(0 , 1) , H 8(1 , 0) } . Surprisingly , the two resolution sequence resembles those of the part based models [9], and multi resolution rigid models 9 25 50 100 250 500 0.56 0.58 0.6 0.62 0.64 0.66 0.68 0.7 0.72 0.74 0.76 B=10x Accuracy BE SVM: Multiple Similarity Measures H8R H8R+H8(1,0) H8R+H8(1,0)+H8(0,1) H8R+H4(2,0) H8R+H4(2,0)+H4(0,1)+H8(1,0) Figure 3: Performance of BE-SVM using multiple similarity measures for v arious sizes of the basis set. Results with dotted, dashed, and solid lines represent 1, 3, and 5 folds worth of training data. See text for analysis. [1] in that the information is processed at two lev els: a coarser rigid ‘root’ level and a finer scale deformable lev el. W e then trained BE-SVM models using these similarity measures for various sizes of the basis set, and for v arious sizes of training data. Figures 4a and 4b show these results, where the BE-SVM models are trained on all 5 folds. The shown number of supporting e xemplars (and consequently the number of parameters) for BE-SVM are based on the size of the basis set. It can be seen that using a basis size of B = 10 × 250 , the performance of the BE-SVM using more than 3 two- resolution similarity measures surpass that of the kernelized SVM trained on all the data and based on approximately B = 10 × 4000 support vectors. Using lo w-resolution similarity measures, B = 10 × 500 outperforms kernelized SVMs trained on up to 4 folds of the training data. Furthermore, it can be observ ed that for the same model comple xity , as measured either by the number of supporting ex emplars, or by model parametrs, BE-SVM performs better than kernelized SVM. Figure 3 shows the performance of BE-SVM using different similarity measures for various basis sizes and for dif ferent training set sizes. It can be observed that using (in v ariant) indefinite similarity measures can significantly increase the performance of the model: compare the red curve with any other curve with the same line style. For example, using all the training data and a two resolution deformable approach results in 8-10% improvements in accurac y in comparison to the best perform- ing PSD kernel (H8R). Furthermore, the two-resolution approach outperforms the single resolution approach by approximately 3-4% accuracy (compare blue and black curv es with the same line style). Measured by model parameters, BE-SVM is roughly 8 times sparser than kernelized SVM for the same accuracy . Measured by supporting ex emplars, its sparsity increases roughly to 30. W e need to point out that different similarity measures have different complexities e.g . H8(1,0) is more ex- pensiv e to e v aluate than K8R. Howe ver , when the bases are shared for dif ferent similarity measures, CPU cache can be utilized much more efficiently as there will be less memory access and more (cached) computations. 3.5.1 Multiple Ker nel Learning with PSD Ker nels W e tried Multiple K ernel Learning (MKL) for kernelized SVM with PSD kernels. When compared to sophisticated MKL methods, we found the follo wing procedure to gi ve competitiv e performances, with much less training costs. Defining K C ( ., . ) = αK 1 ( ., . ) + (1 − α ) K 2 ( ., . ) , our MKL approach consists of performing a line search for an optimal alpha α ∈ { 0 , . 1 , . . . , 1 } which results in best 5- 10 2 3 4 5 6 7 8 0.56 0.58 0.6 0.62 0.64 0.66 0.68 0.7 0.72 0.74 0.76 Log10 Parameters Accuracy Parameters vs Accuracy H4L H8L K4R K8R M11=M21=H8R M12=M11+H8(1,0) M13=M12+H8(0,1) M22=M21+H4(2,0) M24=M23+H8(1,0) (a) Parameters vs Performance 2 2.5 3 3.5 4 4.5 5 0.56 0.58 0.6 0.62 0.64 0.66 0.68 0.7 0.72 0.74 0.76 Log10 Supporting Exemplars Accuracy Supporting Exemplars vs Accuracy K4R K8R M11=M21=H8R M12=M11+H8(1,0) M13=M12+H8(0,1) M22=M21+H4(2,0) M24=M23+H8(1,0) (b) Supporting Exemplars vs Performance Figure 4: Performance of BE-SVM vs model parameters for v arious sizes of the basis set, using multiple similarity measures. Each curve for linear SVM (H4L, H8L) and kernelized SVM (K4R, K8R) represents the result for training on 1 , . . . , 5 folds of training data. Each curve for BE-SVM shows the result for training model with a basis set of size B = 10 × { 25 , 50 , 100 , 250 , 500 } when trained on 5 folds of the training data. fold cross v alidating performance. Using this procedure, linear kernels were found not to contribute anything to Gaussian RBF kernels. The optimal combination for high resolution and low resolution Gaussian RBF kernels (K4R and K8R) resulted in a performance gain of less than 0.5% accuracy in comparison to K8R. W e founds this insignificant, and did not report its performance, considering the fact that the number of parameters increases approximately 4 times using this approach. 3.6 BE-SVM’ s Normalizations It can be v erified that in case of unnormalized features ˜ ϕ ( m ) ( . ) , the corresponding Gram matrix will be ˜ K ( x i , x j ) = P M m =1 ˜ K ( m ) ( x i , x j ) = P M m =1 P B b =1 s ( m ) ( b b , x i ) s ( m ) ( b b , x j ) (13) where ˜ K ( m ) s are combined with equal weights, the value of each of which depends (locally) on how the similarities of x i and x j correlate with respect to the bases. In the case of normalized features, the centered values of each similarity measure is weighted by ( E X [ k ˜ ϕ − E X [ ˜ ϕ ] k ]) − 2 i.e . more global weight is put on (the centered v alues of) the similarity measures with smaller v ariances in similarity values. While the BE-SVM’ s normalization of empirical kernel maps is not optimal for discrimination, it can be seen as a reasonable prior for combining different similarity measures. Utilizing such a prior , in combination with linear classifiers and ` P regularizers, has two important consequences: 1) the centering helps reduce the correlation between dimensions and the scaling helps balance the effect of regularization on different similarity measures, irrespective of their overall norms, and 2) such a scaling directly affects the parameter tuning for learning the linear classifiers: for all the similarity measures (and combinations of similarity measures) with various basis sizes, the same parameter: C = 1 was used to train the classifiers. While cross-v alidation will still be a better option, cross-validating for dif ferent parameters settings – and specially when combining multiple similarity measures – will be very expensi ve and prohibitive. By using the BE-SVM’ s normalization, we essentially avoid searching for optimal combining weights for different similarity measures and also tuning for the C parameter of the linear SVM training. In this section, we quantitativ ely ev aluate the normalization suggested for BE-SVM (4), and com- pare it to a fe w other combinations. Particularly , we consider various normalizations of the HOG feature vectors, and similarly , various normalization schemes for the empirical kernel map ˜ ϕ (3). W e consider the following normalizations: 11 0.5 0.52 0.54 0.56 0.58 0.6 0.62 0.64 0.66 0.68 0.7 H4L H4R H4(2,0) H8L H8R H8(1,0) BE SVM [Normalizations] C=1 (1 Fold) B=10x100 Accuracy BE−SVM + BE−SVM (59) BE−SVM + Unnorm (58) BE−SVM + Z−Score (55) Unnorm + Z−Score (48) Unnorm + BE−SVM (46) Z−Score + BE−SVM (43) Unnorm + Unnorm (40) Z−Score + Unnorm (40) Z−Score + Z−Score (39) 0.5 0.52 0.54 0.56 0.58 0.6 0.62 0.64 0.66 0.68 0.7 H4L H4R H4(2,0) H8L H8R H8(1,0) BE SVM [Normalizations] C=CV (1 Fold) B=10x100 Accuracy BE−SVM + BE−SVM (59) BE−SVM + Unnorm (59) BE−SVM + Z−Score (57) Unnorm + Z−Score (49) Unnorm + BE−SVM (48) Z−Score + BE−SVM (43) Z−Score + Z−Score (40) Unnorm + Unnorm (40) Z−Score + Unnorm (40) Figure 5: Performance of BE-SVM for dif ferent normalization schemes of the feature vector and the empirical kernel map, and dif ferent similarity measures. “F + K (P)” in the legend reflects using F and K normalization schemes for the feature vectors and the empirical kernel maps respectiv ely , which results in the av erage test performance of P (av eraged ov er the similarity measures). 0.5 0.55 0.6 0.65 0.7 0.75 0.8 M21=H8R M22=M21+H4(2,0) M23=M22+H4(0,1) M24=M23+H8(1,0) BE SVM [Normalizations] C=1 (1 Fold) B=10x100 Accuracy BE−SVM + BE−SVM (65) Unnorm + BE−SVM (64) Z−Score + BE−SVM (59) BE−SVM + Unnorm (56) Z−Score + Unnorm (56) BE−SVM + Z−Score (51) Unnorm + Z−Score (48) Unnorm + Unnorm (47) Z−Score + Z−Score (43) 0.5 0.55 0.6 0.65 0.7 0.75 0.8 M21=H8R M22=M21+H4(2,0) M23=M22+H4(0,1) M24=M23+H8(1,0) BE SVM [Normalizations] C=CV (1 Fold) B=10x100 Accuracy BE−SVM + BE−SVM (66) Unnorm + BE−SVM (65) Z−Score + BE−SVM (60) BE−SVM + Unnorm (58) Z−Score + Unnorm (58) BE−SVM + Z−Score (51) Unnorm + Z−Score (49) Unnorm + Unnorm (47) Z−Score + Z−Score (44) Figure 6: Performance of BE-SVM for dif ferent normalization schemes of the feature vector and the empirical kernel map, and different combinations of similarity measures. “F + K (P)” in the legend reflects using F and K normalization schemes for the feature vectors and the empirical kernel maps respectiv ely , which results in the av erage test performance of P (averaged ov er the combinations of similarity measures). • No normalization (Unnorm) • Z-Scoring, namely centering and scaling each dimension by the in verse of its standard deviation (Z-Score) • BE-SVM normalization, namely centering and scaling all dimensions by the in verse av er- age ` 2 norm of the centered vectors (BE-SVM) W e report test performances for all combinations of normalizations for the feature vectors and the empirical kernel maps, for two cases: 1) when C = 1 , and 2) when the C parameter is cross- validated from C = { 10 − 1 , 10 0 , 10 1 } . In both cases, |B | = 10 × 100 bases were uniformly sub- sampled from the first fold of the training set (‘Indx’ basis selection). Figure 5 sho ws the performance of BE-SVM in combination with dif ferent normalizations of the feature vectors and empirical kernel maps, and for different similarity measures. On top, reported numbers are for C = 1 while on the bottom, C is cross validated. It can be observ ed that the BE-SVM’ s normalization works best both for the feature and empirical kernel map normalizations. Although z-scoring is more suitable for linear similarity measures (compare BE-SVM + BE-SVM with Z-SCORE + BE-SVM in H4L, H8L, H4(x,y) and H8(x,y)), o verall BE-SVM’ s normalization of 12 the feature space w orks better than the alternatives. Particularly , in single similarity measure cases, it seems that normalizing the feature according to the BE-SVM’ s normalization is more important than normalizing the empirical kernel map. While the cross-v alidation of the C parameter marginally affects the performance, it does not change the conclusions dra wn from the C = 1 case. Figure 6 sho ws the performance of BE-SVM in combination with dif ferent normalizations of the feature vectors and empirical kernel maps, and for different combinations of similarity measures (the sequence of greedily augmented similarity measures M 2 : the set of two resolution similarity measures described in Section 3.5). It can be observed that BE-SVM’ s normalization of the kernel map is much more important and effecti ve when combining multiple similarity measure (compare to Figure 5) . These observations quantitatively motiv ate the use of BE-SVM’ s normalization with the following benefits, at least on the dataset we experimented on: • It remov es the need for cross-v alidation for tuning the C parameter, and mixing weights for different similarity measures. • As the feature v ector is centered and properly scaled, the linear SVM solver con ver ges much faster than the unnormalized case, or when C >> 1 . • It results in robust learning of BE-SVM which can ef ficiently combine different similar- ity measures i.e . RBF kernels (H8R), and linear deformable similarity measures (H4(2,0), H4(0,1), H8(1,0)). 4 Conclusion W e analyzed scalable approaches for using indefinite similarity measures in lar ge margin scenarios. W e showed that our model based on an explicit basis expansion of the data according to arbitrary similarity measures can result in competiti ve recognition performances, while scaling better with respect to the size of the data. The model named Basis Expanding SVM was thoroughly analyzed and extensi vely tested on CIF AR-10 dataset. In this study , we did not explore basis selection strategies, mainly due to the small intra-class varia- tion of the dataset. W e expect basis selection strategies to play a crucial role in the performance of the resulting model on more challenging datasets e.g . Pascal V OC or ImageNet. Therefore, an im- mediate future work is to apply BE-SVM to larger scale and more challenging problems e.g . object detection, in combination with data driv en basis selection strategies. References [1] O. Aghazadeh, H. Azizpour, J. Sulliv an, and S. Carlsson. Mixture component identification and learning for visual recognition. In Eur opean Confer ence on Computer V ision , 2012. [2] C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. A CM T ransactions on Intelligent Systems and T echnology , pages 27:1–27:27, 2011. [3] Y . Chen, E. K. Garcia, M. R. Gupta, A. Rahimi, and L. Cazzanti. Similarity-based classification: Concepts and algorithms. Journal of Machine Learning Researc h , pages 747–776, 2009. [4] Y . Chen, M. R. Gupta, and B. Recht. Learning kernels from indefinite similarities. In International Confer ence on Machine Learning , 2009. [5] N. Dalal and B. T riggs. Histograms of oriented gradients for human detection. In IEEE Conference on Computer V ision and P attern Recognition , 2005. [6] D. Decoste and B. Sch ¨ olkopf. T raining inv ariant support vector machines. Machine Learning , pages 161–190, 2002. [7] O. Duchenne, A. Joulin, and J. Ponce. A Graph-matching Kernel for Object Categorization. In IEEE International Confer ence on Computer V ision , 2011. [8] R.-E. Fan, K.-W . Chang, C.-J. Hsieh, X.-R. W ang, and C.-J. Lin. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Researc h , 2008. [9] P . F . Felzenszwalb, R. B. Girshick, D. A. McAllester , and D. Ramanan. Object detection with discrimina- tiv ely trained part-based models. IEEE T ransactions on P attern Analysis and Machine Intelligence , pages 1627–1645, 2010. [10] Z. Fu, A. Robles-Kelly , and J. Zhou. Mixing linear svms for nonlinear classification. Neural Networks , pages 1963–1975, 2010. 13 [11] T . Graepel, R. Herbrich, P . Bollmann-Sdorra, and K. Obermayer . Classification on pairwise proximity data. In Neural Information Pr ocessing Systems , 1998. [12] T . Graepel, R. Herbrich, B. Sch ¨ olkopf, A. Smola, P . Bartlett, K. M ¨ uller , K. Obermayer , and R. W illiamson. Classification on proximity data with lp-machines. In Neural Information Pr ocessing Systems , 1999. [13] B. Haasdonk. Feature space interpretation of svms with indefinite kernels. IEEE T ransactions on P attern Analysis and Machine Intelligence , 2005. [14] Y . jye Lee and O. L. Mangasarian. Rsvm: Reduced support vector machines. In Data Mining Institute, Computer Sciences Department, University of W isconsin , 2001. [15] S. S. K eerthi, O. Chapelle, and D. DeCoste. Building support vector machines with reduced classifier complexity . J ournal of Machine Learning Resear ch , pages 1493–1515, 2006. [16] D. Ke ysers, T . Deselaers, C. Gollan, and H. Ney . Deformation models for image recognition. IEEE T ransactions on P attern Analysis and Machine Intelligence , pages 1422–1435, 2007. [17] A. Krizhe vsky . Learning multiple layers of features from tiny images. T echnical report, 2009. [18] L. Liao and W . S. Noble. Combining Pairwise Sequence Similarity and Support V ector Machines for Detecting Remote Protein Ev olutionary and Structural Relationships. Journal of Computational Biology , pages 857–868, 2003. [19] C. S. Ong, X. Mary , S. Canu, and A. J. Smola. Learning with non-positive kernels. In International Confer ence on Machine Learning , 2004. [20] R. Rifkin and A. Klautau. In defense of one-vs-all classification. J ournal of Machine Learning Researc h , pages 101–141, 2004. [21] W . J. Scheirer , M. J. W ilber , M. Eckmann, and T . E. Boult. Good recognition is non-metric. CoRR , 2013. [22] I. Steinwart. Sparseness of support vector machines – some asymptotically sharp bounds. In Neural Information Pr ocessing Systems , 2004. [23] C. W illiams and M. Seeger . The effect of the input density distribution on kernel-based classifiers. In International Confer ence on Machine Learning , 2000. [24] M. W u, B. Sch ¨ olkopf, and G. Bakir . A direct method for building sparse kernel learning algorithms. Journal of Machine Learning Resear ch , pages 603–624, 2006. [25] W . Y ang, Y . W ang, A. V ahdat, and G. Mori. Kernel latent svm for visual recognition. In Neural Informa- tion Pr ocessing Systems , 2012. 14

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment