Submodular Optimization for Efficient Semi-supervised Support Vector Machines

Submodular Optimization for Ef ﬁcient Semi-supervised Support V ector Machines W ael Emara and Mehmed Kantardzic Computer Engineering and Computer Science Department Univ ersity of Louisville, Louisville, Kentucky 40292 Email: waemar01@cardmail.louisville.edu, mmkant01@louisville.edu Abstract —In this work we present a quadratic programming approximation of the Semi-Supervised Support V ector Machine (S3VM) problem, namely approximate QP-S3VM, that can be efﬁciently solved using off the shelf optimization packages. W e pro ve that this approximate formulation establishes a relation between the low density separation and the graph-based models of semi-supervised learning (SSL) which is important to develop a unifying framework f or semi-supervised lear ning methods. Furthermore, we propose the novel idea of representing SSL problems as submodular set functions and use efﬁcient sub- modular optimization algorithms to solve them. Using this new idea we develop a repr esentation of the approximate QP-S3VM as a maximization of a submodular set function which makes it possible to optimize using efﬁcient greedy algorithms. W e demonstrate that the proposed methods are accurate and provide signiﬁcant impro vement in time complexity over the state of the art in the literature. I . I N T R O D U C T I O N The recent advances in information technology imposes serious challenges on traditional machine learning algorithms where classiﬁcation models are trained using labeled samples. Data collection and storage no wadays has ne ver been eas- ier and therefore using such enormous volumes of data to infer reliable classiﬁcation models is of utmost importance. Meanwhile, labeling entire data sets to train classiﬁcation models is no longer a v alid option due to the high cost of experienced human annotators. Despite the recent ef forts to make annotation of large data sets cheap and reliable by using online workforce, the collected labeled data can nev er keep up with the cheap collection of unlabeled data. Semi-supervised learning (SSL) handles this issue by uti- lizing large amount of unlabeled samples, along with labeled samples to build better performing classiﬁers. T wo assump- tions form the basis for the usefulness of unlabeled samples in discriminativ e SSL methods: the cluster assumptions and the smoothness assumption [1]. Although both assumptions use the idea that samples that are close under some distance metric should assume the same label, they inspire different categories of SSL algorithms, namely lo w density separation methods (for the cluster assumption) and graph-based methods (for the smoothness assumption). In the low density separation methods the unlabeled samples are used to better estimate the boundaries or each class. The graph-based methods use labeled and unlabeled samples to construct a graph representation of the data set where information is then propagated from the labeled samples to the unlabeled samples through the dense regions of the graph, a process known as label pr opagation [2]. The practical success and the theoretical rob ustness of large margin methods in general and specially Support V ector Ma- chines (SVM) has drawn a lot of attention to Semi-Supervised Support V ector Machines (S 3 VM) [3]. Howe ver the problem is challenging due to the non-con vexity of the objecti ve function. In this paper we propose an approximate-S 3 VM formulation that will result in a standard quadratic programming problem, namely approximate QP-S 3 VM, that can be solved directly using off the shelf optimization packages. One important aspect of the proposed formulation is that it uncovers a connection between the S 3 VM, as a low density separation method, and the graph based algorithms which is a helpful step towards a unifying framework for SSL [4]. Furthermore, we present a new formulation of loss based SSL problems. The new formulation represents SSL problems as set functions and use the theory of submodular set functions optimization to solve them efﬁciently . Speciﬁcally , we present a submodular set function that is equi valent to the proposed approximate QP- S 3 VM and solve it efﬁciently using a greedy approach that is well established in optimizing submodular functions [5]. Section I-A provides preliminaries of S 3 VM and the nota- tions used throughout the paper . The proposed approximate QP-S 3 VM is detailed in Section II. In Section III we present the submodular formulation of the approximate QP-S 3 VM. Experimental results are provided in Section IV, followed by the conclusion in Section V. A. Preliminaries Semi-supervised learning uses partially labeled data sets L ∪ U where L = { ( x i , y i ) } and U = { x j } , x ∈ R n , and y i ∈ { +1 , − 1 } . Throughout this paper we use i and j as indices for labeled and unlabeled samples, respectiv ely . The major body of work on S 3 VM is based on the idea of solving a standard SVM while treating unkno wn labels as additional variables [3]. The semi-supervised learning problem is to ﬁnd the solution of min w ,y j J ( w , y j ) = 1 2 k w k 2 + C X i ∈L ` l ( w , ( x i , y i )) + C ∗ X j ∈U ` u ( w , x j ) (1) where the loss functions for unlabeled samples ` u and labeled samples ` l are deﬁned as follows: ` u ( w , ( x j , y j )) = max y j ∈{− 1 , +1 } { 0 , 1 − y j ( h w , x j i + b ) } (2) ` l ( w , ( x i , y i )) = max { 0 , 1 − y i ( h w , x i i + b ) } (3) The solution of Eqn.(1) will result in ﬁnding the optimal separating hyperplane w and the labels assigned to the unla- beled samples y j . The loss ov er labeled and unlabeled samples is controlled by two parameters C and C ∗ , which reﬂect the conﬁdence in the labels y i and the cluster assumption, respectiv ely . Algorithms that solve Eqn.(1) can broadly be divided into combinatorial and continuous optimization algorithms. In con- tinuous optimization algorithms, for a giv en ﬁxed w , the opti- mal y j are simply obtained by sg n ( h w , x j i + b ) . The problem then comes do wn to a continuous optimization problem in w . On the other hand, in combinatorial optimization algorithms, for given y j , the optimization for w is a standard SVM problem. Therefore, if we deﬁne a function I ( y j ) such that I ( y j ) = min w J ( w , y j ) (4) the problem will be transformed to minimizing I ( y j ) ov er a set of binary v ariables where each ev aluation of I ( y j ) is a standard SVM optimization problem [6], [7], [8], min y j I ( y j ) . (5) Solving Eqn.(1) may lead to degenerate solutions where all the unlabeled samples are assigned to one class. This is usually handled in the literature by enforcing a balancing constr aint which makes sure that a certain ratio r of the unlabeled samples are assigned to class +1 [3]. I I . Q U A D R A T I C P R O G R A M M I N G A P P RO X I M A T I O N O F S 3 V M ( Q P - S 3 V M ) In Eqn.(5) the combinatorial formulation of S 3 VM opti- mizes for the labels y j that minimize the loss associated with each unlabeled sample. T o ov ercome the hard combinatorial problem, the loss of setting y j = 1 , denoted by ` + j , is assigned a new variable p j , where 0 ≤ p j ≤ 1 . This v ariable indicates the probability that the y j = 1 is correct. Similarly , the loss of setting y j = − 1 , denoted by ` − j , is given by the probability 1 − p j . The balancing constraint will hav e the form P j ∈U p j = r |U | . This modiﬁed formulation has the following form [8], [9]: Problem 1. Continuous optimization formulation of the com- binatorial S 3 VM pr oblem. ar gmin P min w J ( w , P ) = 1 2 k w k 2 + C X i ∈L ζ i + C ∗ X j ∈U p j ` + j + C ∗ X j ∈U (1 − p j ) ` − j subject to y i [ h w , x i i + b ] ≥ 1 − ζ i h w , x j i + b ≥ 1 − ` + j −h w , x j i − b ≥ 1 − ` − j ζ i ≥ 0 , ` + j ≥ 0 , ` − j ≥ 0 0 ≤ p j ≤ 1 , X j ∈U p j = r |U | (6) Now that the problem has been simpliﬁed from being combinatorial in y i , y j ∈ { +1 , − 1 } , to being continuous in p j , p j ∈ [0 , 1] , we proceed to ﬁnd the dual form. Deriving the Lagrangian of the continuous formulation in Problem 1 and applying the Karush-K uhn-T uc ker conditions to it, the obtained dual form is presented in Problem 2. Problem 2. Dual form of min w J ( w , P ) in Pr oblem 1. max A , B , Γ I Dual (7) wher e I Dual = A 0 1 |L| + ( Γ + B ) 0 1 |U | − 1 2 ( A ◦ Y ) 0 K ll ( A ◦ Y ) − 1 2 ( Γ − B ) 0 K uu ( Γ − B ) − ( A ◦ Y ) 0 K lu ( Γ − B ) (8) subject to 0 ≤ A ≤ C 1 |L| 0 ≤ Γ ≤ C ∗ P 0 ≤ B ≤ C ∗ ( 1 |U | − P ) w here 1 |L| : A ones vector of length |L| . Similarly is 1 |U | . α i :Lagrangian Multiplier of labeled loss constraint ζ i . γ j :Lagrangian Multiplier of unlabeled loss constraint ` + j . β j :Lagrangian Multiplier of unlabeled loss constraint ` − j . A 0 = [ α 1 , . . . , α |L| ] , B 0 = [ β 1 , . . . , β |U | ] , Γ 0 = [ γ 1 , . . . , γ |U | ] P 0 = [ p 1 , . . . , p |U | ] , Y 0 = [ y 1 , . . . , y |L| ] , K ll = K i,i 0 ∀ i, i 0 ∈ L , K uu = K j,j 0 ∀ j, j 0 ∈ U , K lu = K i,j ∀ i ∈ L , j ∈ U . Using the deri ved dual form in Problem 2, we propose an approximate optimization based on minimizing an upper bound of max A , B , Γ I Dual . The proposed upper bound is speciﬁed in the following theorem. Theorem 1. Proposed upper bound for max A , B , Γ I Dual : max A , B , Γ I Dual ≤ I ( w ∗ ) + C ∗ |U | + M 1 + M 2 (9) wher e I ( w ∗ ) = min w 1 2 k w k 2 + C X i ∈L ζ i M 1 = 1 2 C ∗ 2 ( 1 |U | − P ) 0 K uu P M 2 = C C ∗ Y 0 K lu ( 1 |U | − P ) (10) Pr oof: See the appendix. Examining the upper bound in Theorem 1, I ( w ∗ ) is the objectiv e function value of optimizing a standard supervised SVM on the labeled samples L . Therefore, it is constant as well as the term C ∗ |U | . The rest of the upper bound, M 1 + M 2 , is a function of P . The optimal values of P are now obtainable through the following optimization problem. Problem 3. Quadratic pr ogrammi ng appr oximation of Semi- supervised Support V ector Machines (QP-S 3 VM): min P 1 2 C ∗ 2 ( 1 |U | − P ) 0 K uu P + C C ∗ Y 0 K lu ( 1 |U | − P ) (11) subject to P 0 1 |U | = r |U | , 0 ≤ P ≤ 1 |U | . (12) Note: Equation (11) can be r ewritten in the standar d quadratic pr ogramming form as follows: min P − 1 2 C ∗ 2 P 0 K uu P + ( 1 2 C ∗ 2 1 |U | K uu − C C ∗ Y 0 K lu ) P (13) The proposed approximate formulation is a quadratic pro- gramming problem in the v ariables p j . In order to a void trivial solutions to the problem where all the variables p j are z er o . W e add the constraint P 0 1 = r |U | which makes sure that a certain ratio of the unlabeled samples, r , be assigned to class +1 . A. QP-S 3 VM Model Interpr etation In this section we analyze the approximate model obtained in Problem 3. This is necessary to ensure that the approximate model does not deviate from the original S 3 VM problem. The ﬁrst term in Eqn.(11) can be expanded as follows: 1 2 C ∗ 2 ( 1 |U | − P ) 0 K uu P = 1 2 C ∗ 2 X j,j 0 = { 1 ,..., |U |} j = j 0 [ K uu ] j,j 0 p j 0 (1 − p j ) | {z } Q 1 + 1 2 C ∗ 2 X j = { 1 ,..., |U |− 1 } j 0 = { j +1 ,..., |U |} [ K uu ] j,j 0 ( p j + p j 0 − 2 p j p j 0 ) | {z } Q 2 (14) As Q 1 is negati ve quadratic in p j , minimizing Q 1 enforces the values of p j to be either 0 or 1 . In other words, minimizing Q 1 help making clear assignments of the labels to the unlabeled samples. T o understand the implications of minimizing Q 2 on the solution of Problem 3, we will start by plotting z = ( p j + p j 0 − 2 p j p j 0 ) , for all p j , p j 0 ∈ [0 , 1] , as sho wn in Fig.1. 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.5 1 p j p j’ z Fig. 1. Plot of z = ( p j + p j 0 − 2 p j p j 0 ) for all p j , p j 0 ∈ [0 , 1] . In Fig.1 we see that small values of z , i.e. z ' 0 , means that q j ' q j 0 while large values of z , i.e. z ' 1 , means that q j q j 0 ' 0 . T o minimize Q 2 we assign small z to large v alued [ K uu ] j,j 0 . This means that when two unlabeled samples x j and x j 0 are close, [ K uu ] j,j 0 is large, the assigned small v alued z will force them to assume the same label, i.e. q j ' q j 0 . On the other hand, if [ K uu ] j,j 0 is small, we assign a large z to it. In other words, if the two unlabeled samples are not close, small [ K uu ] j,j 0 , then they should be assigned to different classes, by setting z to be lar ge, i.e. q j q j 0 ' 0 . It is easy to see no w how minimizing Q 2 basically implements the clustering assumption of semi-supervised learning algorithms where unlabeled samples form clusters and all samples in the same cluster have the same label. Notice that during the minimization of Q 2 a smaller minimum value is achiev able if all the unlabeled samples are assigned the same label, that is when z = 0 and therefore p j = p j 0 . Ho we ver , this is a degenerate solution and this is why the balancing constraint is important in the approximate formulation in Problem 3 Next we study the second term in Eqn.(11). W e start by rewriting it as follows: C C ∗ Y 0 K lu ( 1 |U | − P ) = C C ∗ X i ∈L ,j ∈U y i [ K lu ] i,j (1 − p j ) = C C ∗ X i ∈L ,j ∈U y i =+1 [ K lu ] i,j (1 − p j ) | {z } Q 3 + C C ∗ X i ∈L ,j ∈U y i = − 1 [ K lu ] i,j ( p j − 1) | {z } Q 4 (15) W e split Eqn (15) into terms associated with labeled samples with y i = +1 , Q 3 , and those with y i = − 1 , Q 4 . This is necessary because of the dependence of the interpretation on the labels y i . Since p j ∈ [0 , 1] , minimizing Q 3 in volves assigning small (1 − p j ) , i.e. p j ' 1 , to [ K lu ] i,j with lar ge values and vice versa, small valued [ K lu ] i,j are assigned large (1 − p j ) , i.e. p j ' 0 . In other words, if an unlabeled sample x j that is close to, i.e. large [ K lu ] i,j , a labeled sample ( x i , y i = +1) , then this unlabeled sample should hav e the same label as the labeled sample, that is p j ' 1 and y j = +1 . On the other hand, if the unlabeled sample x j is far from, i.e. small [ K lu ] i,j , the labeled sample ( x i , y i = +1) , then this unlabeled sample should have a opposite label to that of the labeled sample, that is p j ' 0 and y j = − 1 . Once again it is notable that if the balancing constraint is not used, a smaller value for the minimum of Q 3 is achiev able if all the unlabeled samples are assigned the same label, p j = 1 and y j = +1 . The same argument holds for minimizing Q 4 where unlabeled samples with large/small similarity to a labeled sample ( x i , y i = − 1) will be assigned small/large ( p j − 1) , i.e. p j ' 0 and p j ' 1 , respecti vely . The process of jointly minimizing Q 2 , which implements the clustering assumption of semi-supervised learning, and Q 3 + Q 4 , where unlabeled samples are assigned labels by their similarity to labeled samples, results in a formulation that follows the same intuition behind label pr opagation algo- rithms [2] for semi-supervised learning. That is the labeling process chooses dense re gions to propagate labels through the unlabeled samples. Therefore, the provided approximate formulation in Problem 3 does not deviate from the general paradigm of the semi-supervised learning problem. Meanwhile the provided formulation provides an insight into the con- nection between the A voiding Dense Re gions semi-supervised algorithms, which include S 3 VM, and the Graph-based algo- rithms. I I I . S U B M O D U L A R O P T I M I Z A T I O N O F A P P RO X I M A T E Q P - S 3 V M The approximate QP-S 3 VM formulation proposed in Prob- lem 3 is simple and intuiti ve. Ho we ver , due to the fact that it is a quadratic minimization of a concav e function, the computa- tional complexity of ﬁnding a solution will become a hindering issue specially for semi-supervised learning problems which are inherently large scale. In this section we use the concepts of submodular set functions to provide a simple and efﬁcient algorithm for the proposed approximate QP-S 3 VM problem. Submodular set functions play a central role in combina- torial optimization [10]. They are considered discrete analog of con ve x functions in continuous optimization in the sense of structural properties that can be beneﬁted from algorithmically . They also emerge as a natural structural form in classic combinatorial problems such as maximum cov erage and max- imum facility location in location analysis, as well as max- cut problems in graphs. More recently submodular set func- tions hav e become key concepts in machine learning where problems such as feature selection [11] and active learning [12] are solved by maximizing submodular set functions while other core problems like clustering and learning structures of graphical models ha ve been formulated as submodular set function minimization [13]. As discussed in Section II the solution of the approximate QP-S 3 VM provides a value for the variable p j associated with each unlabeled sample x j , j ∈ U such that p j = 1 for y j = +1 and p j = 0 for y j = − 1 . In this section we use a different perspective of the problem. In this ne w perspecti ve the problem of binary semi-supervised classiﬁcation in general is concerned with choosing a subset A from the pool of all unlabeled samples U . All the unlabeled samples x j , j ∈ A should be assigned the label y j = +1 and the rest of them, x j , j ∈ U \ A , will be assigned the label y j = − 1 . Each possible subset A is assigned a v alue by a set function f ( A ) that has the same optimal solution, in terms of A and U \A , as the original semi-supervised classiﬁcation problem. What makes the reformulation of semi-supervised learning into a set functions interesting is that if the set function f ( A ) is monotonic submodular , many algorithms can solve the problem ef ﬁciently [10]. In the follo wing we give some background on the concept of submodularity in set functions and how we employ it to solve our problem efﬁciently . Let f ( X ) be a set function deﬁned of the set X = { x 1 , x 2 , . . . , x n } . The monotonicity and submodularity of f ( X ) are deﬁned as follows [10]: Deﬁnition 1. F or all sets A, B ⊆ X with A ⊆ B , a set function f : 2 X → R is: a) Monotonic if f ( A ) ≤ f ( B ) b) Submodular if f ( A ∪ { x j } ) − f ( A ) ≥ f ( B ∪ { x j } ) − f ( B ) for all x j / ∈ B . A well ackno wledged result by Nemhauser et al. [5], see Theorem 2 belo w , establishes a lower bound of the perfor- mance for the simple greedy algorithm, see Algorithm 1, if it is used to maximize a monotone submodular set function subject to a cardinality constraint. The simple greedy algo- rithms basically works by adding the element that maximally increases the objecti ve value and according to Theorem 2 this simple procedure is guaranteed to achiev e at least a constant fraction (1 − 1 /e ) of the optimal solution, where e is the natural exponential. Theorem 2. Given a ﬁnite set X = { x 1 , x 2 , . . . , x n } and a monotonic submodular function f ( A ) , wher e A ⊆ X and f ( ∅ ) = 0 . F or the following maximization problem, A ∗ = arg max |A|≤ k f ( A ) . The gr eedy maximization algorithm returns A Greedy such that f ( A Greedy ) ≥ (1 − 1 e ) f ( A ∗ ) . Algorithm 1 :Greedy Algorithm for Submodular Function Maximization with Cardinality Constraint [5], [14] 1. Start with X 0 = φ 2. For i = 1 to k x ∗ := arg max x f ( X i − 1 ∪ { x } ) − f ( X i − 1 ) X i := X i − 1 ∪ { x ∗ } A. Solving QP-S 3 VM Using Submodular Optimization In this section we use the concepts of submodular functions maximization to provide an efﬁcient and simple algorithm for solving the approximate QP-S 3 VM problem. T o wards this goal we propose the follo wing submodular maximization problem that is equiv alent to the approximate QP-S 3 VM in Problem 3. Problem 4. Submodular maximization formulation that is equivalent to Pr oblem 3: max |A|≤ r |U | S ( A ) (16) wher e S ( A ) = − 1 2 C ∗ 2 X j ∈A ,j 0 ∈U [ K uu ] j,j 0 + C C ∗ X j ∈A ,i ∈L y i [ K lu ] i,j + 1 2 C ∗ 2 X j,j 0 ∈A [ K uu ] j,j 0 + d X j,j 0 ∈A  δ j,j 0  3 2 C ∗ 2 |U | + C C ∗ |L|  − 1 2 C ∗ 2  | {z } Q 5 , (17) wher e S is a submodular set function deﬁned on all subsets A ⊂ U of unlabeled samples assigned to the class y j = +1 , 0 ≤ K ij ≤ d , and δ j,j 0 = 1 for j = j 0 and 0 otherwise. Problem 4 basically maximizes the negati ve of a discrete version of the objectiv e function in Eqn.(13). The correspon- dence between the ﬁrst three terms in S ( A ) and Eqn.(13) is straightforward. Howev er, the term Q 5 is of our design and it is added to ensure the monotonicity and submodularity of S ( A ) , as sho wn in Theorem 3. The constant d is the maximum value of the kernel matrix. Therefore d = 1 for Radial Basis Function (RBF) kernels. If the data is feature-wise normalized, a highly recommended practice, with values ∈ [0 , 1] , then for the linear kernel d is equal to the number of dimensions of the used data set (for dense data) or the a verage number of non-zero features (for sparse data). Since for a ﬁxed |A| the value of Q 5 is constant, then the optimal solution obtained by optimizing S ( A ) is not af fected by adding Q 5 . In other words Q 5 depends on the cardinality of A not its contents. Theorem 3. The set function S ( A ) in Pr oblem 4 is monotone (non-decr easing), submodular , and S ( ∅ ) = 0 . Pr oof: See the appendix. Now that we hav e shown that S ( A ) is monotonic, submod- ular , and S ( ∅ ) = 0 this means that the greedy maximization algorithm can used be used to optimize Problem 4 and the performance guarantee in Theorem 2 holds true. T o summarize, the proposed equi v alent submodular max- imization in Problem 4 is deﬁned on the all subsets A of samples belonging to the class labeled y j = +1 . The ef ﬁcient greedy algorithm in Algorithm 1 is used to the solve the prob- lem efﬁciently . Once the optimum solution A ∗ is determined, the rest of the unlabeled samples, i.e. U \A ∗ , will belong to class with labels y j = − 1 . W e use the proposed algorithm in the transductive setting of semi-supervised learning. Howe ver , if the inducti ve setting is needed, a standard supervised SVM training can be performed to giv e the ﬁnal hyperplane w . I V . E X P E R I M E N T A L R E S U LT S In this section we illustrate the accuracy and efﬁciency of the proposed QP-S 3 VM and its submodular optimization (S-QP-S 3 VM). T o this end, we compare the performance of QP-S 3 VM and S-QP-S 3 VM with three competitiv e S 3 VM algorithms, namely the T ransductive Support V ector Machine (TSVM) [7], the Deterministic Annealing for Semi-supervised K ernel Machines (D A) [8], and 5 TSVM [15]. All experiments are performed on a 2 GHZ Intel Core2 Duo machine with 2 GB RAM. The experiments are performed on several real w orld data, see T able I, that are selected so as to achie ve diversity in terms of dimensionality and distribution properties. T ABLE I D AT A S E T S U S ED I N T H E E XP E R I ME N T S [ 1 6 ], [ 1 7 ]. Data set Features Samples Labeled C C ∗ /C r australian 14 690 3 0.922 10 − 1 0.44 w6a 300 1,900 19 0.838 10 − 4 0.5 svmguide1 4 3,089 15 1.055 10 − 3 0.65 a9a 123 15,680 78 0.897 10 − 3 0.5 news20.binary 1,355,191 19,900 100 6.087 10 − 3 0.5 real-sim 20,958 72,309 8 1 10 − 4 0.31 KDD-99 122 10 6 10 1 10 − 4 0.56 In the accuracy of transductive learning experiment we considered a challenging setup where the number of labeled samples does not exceed 1% of the av ailable unlabeled data and in two data sets the percentage is as low as 0.01%. The labeled/unlabeled samples splitting process is repeated 10 times and the av erage is reported in T able II. T o illustrate the value of using unlabeled samples in the semi-supervised setting the results of standard SVM trained using only the labeled samples are presented. All experiments use the linear kernel with feature-wise normalized data. The ratio of positive samples in the output r is set to the correct ratio in the unlabeled samples. It is clear in T able II that the QP-S 3 VM and S-QP-S 3 VM are superior in terms of accuracy to TSVM, D A, and 5 TSVM. In T able III we provide a CPU-time comparison between the QP-S 3 VM, S-QP-S 3 VM, TSVM, D A, and 5 TSVM. It is clear that from the time complexity perspectiv e, S-QP-S 3 VM is far more efﬁcient than its competitors. T ABLE III C P U T I M E ( S E CO N D S ) E X P E RI M E N TS . Data set TSVM D A 5 TSVM QP-S 3 VM S-QP-S 3 VM australian 11.73 0.786 0.452 174.82 0.013 w6a 109.40 0.836 2.491 6,993.12 0.038 svmguide1 186.59 2.46 0.803 - 0.008 a9a 206.30 20.78 18.68 - 0.335 news20.binary - 653.4 - - 3.241 real-sim - 89.38 - - 1.925 KDD-99 - 2,740 - - 1,620 V . C O N C L U S I O N A N D F U T U R E W O R K In this paper we propose a quadratic programming approxi- mation of the semi-supervised SVM problem (QP-S 3 VM) that T ABLE II C L AS S I FI CATI O N AC C UR A CY E X PE R I M EN T S F OR M E DI U M S IZ E DAT A S E T S . Data set SVM TSVM D A 5 TSVM QP-S 3 VM S-QP-S 3 VM australian 50.029 63.26 60.48 56.53 75.57 74.49 w6a 67.44 58.73 68.09 52.60 72.33 70.75 svmguide1 71.19 77.31 80.98 69.71 92.73 92.45 a9a 66.91 71.49 72.91 64.43 - 74.90 news20.binary 63.35 - 67.94 - - 71.44 real-sim 52.13 - 69.23 - - 71.83 KDD-99 72.12 - 97.12 - - 98.46 prov ed to be ef ﬁcient to solve using standard optimization techniques. One major contribution of the proposed QP-S 3 VM is that it establishes a link between the two major paradigms of semi-supervised learning, namely low density separation methods and graph-based methods. Such link is considered a signiﬁcant step towards a unifying framew ork for semi- supervised learning methods. Furthermore, we propose a novel formulation of the semi-supervised learning problems in terms of submodular set functions which is, up to the authors knowledge, is the ﬁrst time such idea is presented. Using this new formulation we present a methodology to use submod- ular optimization techniques to efﬁciently solve the proposed QP-S 3 VM problem. Finally , our idea of representing semi- supervised learning problems as submodular set functions will hav e a great impact on many learning schemes as it will open the door for using an arsenal of algorithms that ha ve theoretical guarantees and efﬁcient performance. The authors are already making progress in extending the presented work to multi- class semi-supervised formulations as well as examining the relationship between submodular optimization over different matroids and its interpretation in terms of semi-supervised learning. One last intriguing point about the proposed work is that samples are assigned to classes, in our case the positiv e class, sequentially . This opens the door for possible ways to estimate the ratio of positiv e samples r automatically during the learning process which is still a problem for most semi- supervised techniques specially if there exists a dif ference in the ratio r between the labeled and unlabeled samples. V I . A P P E N D I X A. Pr oof of Theorem 1 T o get an upper bound for I Dual we divide it into sev eral components as follows: I Dual = N 1 + N 2 + N 3 (18) where N 1 = A 0 1 |L| − 1 2 ( A ◦ Y ) 0 K ll ( A ◦ Y ) N 2 = ( Γ + B ) 0 1 |U | − 1 2 ( Γ − B ) 0 K uu ( Γ − B ) N 3 = − ( A ◦ Y ) 0 K lu ( Γ − B ) . (19) Then max A , B , Γ I Dual ≤ max A N 1 + max B , Γ N 2 + max A , B , Γ N 3 (20) max A N 1 is the dual form of a standard supervised SVM problem using the label data, i.e. max A N 1 = min w 1 2 k w k 2 + C X i ∈L ζ i (21) Furthermore, using the v alue limits of A , B and Γ , i.e. 0 ≤ A ≤ C 1 |L| , 0 ≤ B ≤ C ∗ ( 1 |U | − P ) and 0 ≤ Γ ≤ C ∗ P , we can deriv e the following upper bounds of N 2 and N 3 , max B , Γ N 2 ≤ C ∗ |U | + 1 2 C ∗ 2 ( 1 |U | − P ) 0 K uu P (22) and max A , B , Γ N 3 ≤ C C ∗ Y 0 K lu ( 1 |U | − P ) . (23) Combining the three upper bounds we get the provided bound in the theorem. B. Pr oof of Theorem 3 First, S ( ∅ ) = 0 follows directly from the deﬁnition in Eqn.(17) where all the summations are on elements in the set A . Therefore if A = ∅ then S ( ∅ ) = 0 . For the sak e of simplicity we consider the special case where d = 1 . Howe ver , the extension to the general values of d is fairly straightforward. Next we prove the monotonicity pr operty . Using the deﬁnition of S ( A ) , we can sho w that for any m ∈ U and m / ∈ A , the increase in the objectiv e v alue of S due to adding m is, S ( A ∪ m ) − S ( A ) = − 1 2 C ∗ 2 X j 0 ∈U [ K uu ] m,j 0 + C C ∗ X i ∈L y i [ K lu ] i,m + C ∗ 2 X j 0 ∈A [ K uu ] m,j 0 − C ∗ 2 |A| + 1 2 C ∗ 2  [ K uu ] m,m − 1  + 3 2 C ∗ 2 |U | + C C ∗ |L| (24) Since we are examining the case where d = 1 , then 0 ≤ K i,j ≤ 1 and K i,i = 1 . Therefore, since 1 2 C ∗ 2  [ K uu ] m,m − 1  = 0 C ∗ 2 X j 0 ∈A [ K uu ] m,j 0 ≥ 0 C C ∗ |L| + C C ∗ X i ∈L y i [ K lu ] i,m ≥ 0 3 2 C ∗ 2 |U | ≥ 1 2 C ∗ 2 X j 0 ∈U [ K uu ] m,j 0 + C ∗ 2 |A| (25) then S ( A ∪ m ) − S ( A ) ≥ 0 Thus the monotonicity property of S ( A ) holds true. Now we pro ve the submodularity of S ( A ) by assuming the set B = {A ∪ q } where q ∈ U . Using the same set element m we used earlier, i.e. m ∈ U and m / ∈ A , we need to show that adding m to the set A has more effect than adding it to the set B as stated in Deﬁnition 1-b . Since S ( B ) = − 1 2 C ∗ 2 X j ∈{A∪ q } ,j 0 ∈U [ K uu ] j,j 0 + C C ∗ X j ∈{A∪ q } ,i ∈L y i [ K lu ] i,j + 1 2 C ∗ 2 X j,j 0 ∈{A∪ q } [ K uu ] j,j 0 + X j,j 0 ∈{A∪ q }  δ j,j 0  3 2 C ∗ 2 |U | + C C ∗ |L|  − 1 2 C ∗ 2  (26) then S ( B ∪ m ) − S ( B ) = − 1 2 C ∗ 2 X j 0 ∈U [ K uu ] m,j 0 + C C ∗ X i ∈L y i [ K lu ] i,m + C ∗ 2 X j 0 ∈{A∪ q } [ K uu ] m,j 0 − C ∗ 2 ( |A| + 1) + 1 2 C ∗ 2  [ K uu ] m,m − 1  + 3 2 C ∗ 2 |U | + C C ∗ |L| (27) Therefore ( S ( A ∪ m ) − S ( A )) − ( S ( B ∪ m ) − S ( B )) = C ∗ 2  1 − [ K u , u ] q ,m  ≥ 0 (28) Hence the set function S ( A ) is submodular . R E F E R E N C E S [1] X. Zhu, “Semi-supervised learning literature survey , ” Computer Sci- ences, Univ ersity of Wisconsin-Madison, T ech. Rep. 1530, 2005. [2] X. Zhu, Z. Ghahramani, and J. Lafferty , “Semi–supervised learning using gaussian ﬁelds and harmonic functions, ” in Proceedings of the International Confer ence on Machine Learning , 2003. [3] O. Chapelle, V . Sindhwani, and S. S. Keerthi, “Optimization techniques for semi-supervised support vector machines, ” Journal of Machine Learning Resear ch , vol. 9, pp. 203–233, 02 2008. [4] H. Narayanan, M. Belkin, and P . Niyogi, “On the relation between low density separation, spectral clustering and graph cuts, ” in Advances in Neural Information Pr ocessing Systems 19 , B. Sch ¨ olkopf, J. Platt, and T . Hoffman, Eds. Cambridge, MA: MIT Press, 2007. [5] G. L. Nemhauser, L. A. W olsey , and M. L. Fisher, “ An analysis of ap- proximations for maximizing submodular set functionsi, ” Mathematical Pr ogramming , vol. 14, pp. 265–294, 1978. [6] O. Chapelle, V . Sindhwani, and S. S. Keerthi, “Branch and bound for semi-supervised support vector machines, ” in T wentieth Annual Confer ence on Neur al Information Pr ocessing Systems (NIPS 2006) , Cambridge, MA, USA, 09 2007, pp. 217–224. [7] T . Joachims, “Transducti ve inference for text classiﬁcation using support vector machines, ” in Proceedings of ICML-99, 16th International Con- fer ence on Machine Learning , I. Bratk o and S. Dzeroski, Eds. Bled, SL: Morgan Kaufmann Publishers, San Francisco, US, 1999, pp. 200–209. [8] V . Sindhwani, S. S. K eerthi, and O. Chapelle, “Deterministic annealing for semi-supervised kernel machines, ” in ICML ’06: Proceedings of the 23r d international confer ence on Machine learning , New Y ork, NY , USA, 2006, pp. 841–848. [9] J. W ang, X. Shen, and W . Pan, “On efﬁcient lar ge margin semisupervised learning: Method and theory , ” J. Mach. Learn. Res. , vol. 10, pp. 719– 742, June 2009. [10] M. Gr ¨ otschel, L. Lov ´ asz, and A. Schrijver , Geometric Algorithms and Combinatorial Optimization , second corrected edition ed., ser . Algo- rithms and Combinatorics. Springer, 1993, vol. 2. [11] M. Narasimhan and J. Bilmes, “ A submodular-supermodular procedure with applications to discriminativ e structure learning, ” in Uncertainty in Artiﬁcial Intelligence (U AI) . Edinburgh, Scotland: Morgan Kaufmann Publishers, July 2005. [12] A. Krause and C. Guestrin, “Nonmyopic active learning of gaussian pro- cesses: an exploration-exploitation approach, ” in ICML ’07: Proceedings of the 24th international conference on Machine learning . New Y ork, NY , USA: ACM, 2007, pp. 449–456. [13] M. Narasimhan, N. Jojic, and J. Bilmes, “Q-clustering, ” in Advances in Neural Information Pr ocessing Systems 18 , Y . W eiss, B. Sch ¨ olkopf, and J. Platt, Eds. Cambridge, MA: MIT Press, 2006, pp. 979–986. [14] M. Sviridenko, “ A note on maximizing a submodular set function subject to a knapsack constraint, ” Operations Researc h Letters , v ol. 32, no. 1, pp. 41 – 43, 2004. [15] O. Chapelle and A. Zien, “Semi-supervised classiﬁcation by lo w density separation, ” in T enth International W orkshop on Artiﬁcial Intelligence and Statistics , 01 2005, pp. 57–64. [16] A. Asuncion and D. Newman, “UCI machine learning repository , ” 2007. [17] C.-W . Hsu, C.-C. Chang, and C.-J. Lin, “ A practical guide to support vector classiﬁcation, ” Department of Computer Science, National T aiwan Univ ersity , T ech. Rep., 2003.

Submodular Optimization for Efficient Semi-supervised Support Vector Machines

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment