This paper addresses the pattern classification problem arising when available target data include some uncertainty information. Target data considered here is either qualitative (a class label) or quantitative (an estimation of the posterior probability). Our main contribution is a SVM inspired formulation of this problem allowing to take into account class label through a hinge loss as well as probability estimates using epsilon-insensitive cost function together with a minimum norm (maximum margin) objective. This formulation shows a dual form leading to a quadratic problem and allows the use of a representer theorem and associated kernel. The solution provided can be used for both decision and posterior probability estimation. Based on empirical evidence our method outperforms regular SVM in terms of probability predictions and classification performances.
In the mainstream supervised classification scheme, an expert is required for labelling a set of data used then as inputs for training the classifier. However, even for an expert, this labeling task is likely to be difficult in many applications. In the end the training data set may contain inaccurate classes for some examples, which leads to non robust classifiers [1]. For instance, this is often the case in medical imaging where radiologists have to outline what they think are malignant tissues over medical images without access to the reference histopatologic information. We propose to deal with these uncertainties by introducing probabilistic labels in the learning stage so as to: 1. stick to the real life annotation problem, 2. avoid discarding uncertain data, 3. balance the influence of uncertain data in the classification process. Our study focuses on the widely used Support Vector Machines (SVM) two-class classification problem [2]. This method aims a finding the separating hyperplane maximizing the margin between the examples of both classes. Several mappings from SVM scores to class membership probabilities have been proposed in the literature [3,4]. In our approach, we propose to use both labels and probabilities as input thus learning simultaneously a classifier and a probabilistic output. Note that the output of our classifier may be transformed to probability estimations without using any mapping algorithm. In section 2 we define our new SVM problem formulation (referred to as P-SVM) to deal with certain and probabilistic labels simultaneously. Section 3 describes the whole framework of P-SVM and presents the associated quadratic problem. Finally, in section 5 we compare its performances to the classical SVM formulation (C-SVM) over different data sets to demonstrate its potential.
We present below a new formulation for the two-class classification problem dealing with uncertain labels. Let X be a feature space. We define (x i , l i ) i=1…m the learning dataset of input vectors (x i ) i=1…m ∈ X along with their corresponding labels (l i ) i=1…m , the latter of which being
• class labels:
. p i , associated to point x i allows to consider uncertainties about point x i ’s class. We define it as the posterior probability for class 1.
We define the associated pattern recognition problem as min
subject to
Where boundaries z - i , z + i directly depend on p i . This formulation consists in minimizing the complexity of the model while forcing good classification and good probability estimation (close to p i ). Obviously, if n = m, we are brought back to the classical SVM problem formulation.
Following the idea of soft margin introduced in regular SVM to deal with the case of inseparable data, we introduce slack variables ξ i . This measure the degree of misclassification of the datum x i thus relaxing hard constraints of the initial optimization problem which becomes
Parameters C and C are predefined positive real numbers controlling the relative weighting of classification and regression performances.
Let ε be the labelling precision and δ the confidence we have in the labelling. Let’s define η = ε + δ. Then, the regression problem consists in finding optimal parameters w and b such that
Thus constraining the probability prediction for point x i to remain around to Finally:
where
We can rewrite the problem in its dual form, introducing Lagrange multipliers. We are looking for a stationary point for the Lagrange function L defined as
Computing the derivatives of L with respect to w, b, ξ, ξ -and ξ + leads to the following optimality conditions:
where
Calculations simplifications then lead to
Formulations (2) and ( 3) can be easily generalized by introducing kernel functions. Let k be a positive kernel satisfying Mercer’s condition and H the associated Reproducing Kernel Hilbert Space (RKHS). Within this framework equation ( 2) becomes
Formulation (3) remains identical, with
In order to experimentally evaluate the proposed method for handling uncertain labels in SVM classification, we have simulated different data sets described below. In these numerical examples, a RBF kernel k(u, v) = e -u-v 2 /2σ 2 is used and C = C = 100. We implemented our method using the SVM-KM Toolbox [8]. We compare the classification performances and probabilistic predictions of the C-SVM and P-SVM approaches. In the first case, probabilities are estimated by using Platt’s scaling algorithm [3] while in the second case, probabilities are directly estimated via the formula defined in (2):
for probability distributions P and Q (for evaluating probability estimation).
We generate two unidimensional datasets, labelled ‘+1’ and ‘-1’, from normal distributions of variances σ 2 -1 = σ 2 1 =0.3 and means µ -1 =-0.5 and µ 1 =+0.5. Let’s (x l i ) i=1…n l denote the learning data set (n l =200) and (x t i ) i=1…n t the test set (n t =1000). We compute, for each point x i , its true probability P (y i = +1|
This content is AI-processed based on open access ArXiv data.