A Gaussian Belief Propagation Solver for Large Scale Support Vector Machines

A Gaussian Belief Propagation Solver fo r La rge Scale Supp o rt V ecto r Machines Danny Bickson Scho ol of Engineering and Compute r Science The Hebrew Universit y of Jerusa l em Givat Ram, Jerusalem, 91 904 Israel danny.bickso n@gmail.com Elad Y om-T ov IBM Haifa Research Lab Haifa 31 9 05 Israel yomtov@il.ib m.com Danny Dolev Scho ol of Engi neering and Computer Science The Hebrew Univer sity of Jerusalem Givat Ram, Jerusal em, 91904 Israel dolev@cs.huj i.ac.il Abstract Supp o rt vector machines (SVMs) are an extremely succes sful type of clas siﬁca- tion and regressio n a l gor i thms. Building an SVM en tails solving a constraine d convex quadratic programming p rob lem, which is quadratic in t h e numb er of training samples. We intro duce an eﬃcient pa ral l el implemen ta tion of an sup- p ort vector regressi on solver, based on the Gauss i an Belief Propagation algo- rithm (GaBP). In this pap er, we demonstrate that metho ds from t h e co m p lex system domain could b e utilized for p erfo rmin g eﬃcient distributed computati on. W e co m - pare the prop osed algorithm to previously propos e d distributed and sing le-no de SVM so lvers. Our comparison shows that t h e propos e d algorithm is just as accurate as these s o lvers, whil e b eing si gniﬁcantly faste r, esp ecially for large datasets. We d emonstrate scalabili t y of the propo sed algorithm to up to 1,024 computing nod e s and hundreds of thousa nds of data po ints using an IBM Blue Gene sup ercomputer. As far as we know, our wo rk is th e largest pa rall el im- plementation of b elief propagation ever done, demo nstrating the applicabili ty of this alg orithm fo r large scale distributed comp uting systems. 1 Intro duction Supp o rt-vector machines (SVMs) ar e a class o f algorithms that have, in recent y ears, exhibited sup erio r p e rfo rmance co mpa red to o ther pat te rn class iﬁcation algorithms. There are several for mulations o f the SVM p rob l em, dep e nding on the speci ﬁ c applicati on of the SVM (e.g., classiﬁcati on, regression, etc.). One o f the diﬃculti es in usin g SVMs is that buildi ng an SVM requ ires solvin g a co nstrained quadratic programming problem, whose size is quadratic i n the numb er of traini ng examp l es. This fact has le d to extensive rese arch on eﬃcien t SVM solvers. Recently , several researchers have suggested using multiple computing n o des in order to increase the computatio nal power available for solving SVMs. 1 In this article, we intro duce a distributed SVM solver bas ed on the Gaussian Belief Propagation (GaBP) algorithm. We improve on the o rig inal GaBP algorithm b y reducin g the communication load, as represented by the numb er of messages sent in each optimization iteratio n, from O ( n 2 ) to O ( n ) agg regated messages, where n is the nu mber of data p oints. Previo u sly , it was known that the GaBP algorithm is very eﬃcient for sparse matrices. Usi n g our novel construction, we demonstrate that the algorithm exhibi ts very go o d p e rfo rmance fo r dense m a trices as well. We also show that the GaBP algorithm can b e us e d with kernels, thus making the algorithm more p owerful than what was considered previously thought p ossible. Using extens ive simulati on we demonstrate the applicabil i t y of our protocol vs. the state-of- the-art existing parallel SVM so l vers. Using a Linux cluste r of up to a hundred mac h i nes and the IBM Blue Gene sup ercomputer we managed to solve very large data sets up to hundreds of thousands data p oint, using up to 1,0 24 CPUs working in par al lel. Our compar i s on s ho w s that the pr o pose d algorithm is just as accurate as these previous solvers, while b eing s igniﬁcantly faster. A prelimina ry version of this pap er app eared as a p oster in [2 0]. 1.1 Classiﬁcation Using Supp ort V ector Ma c hines W e begi n by formulating the SVM problem. Consi d er a training s e t: D = { ( x i , y i ) , i = 1 , . . . , N , x i ∈ ℜ m , y i ∈ {− 1 , 1 }} . (1) The goa l of the SVM is to learn a mapping from x i to y i such tha t the error in mapping, as measured on a new dataset, would be minim a l. SVMs learn to ﬁnd the l i nea r weight vector that separates the tw o c lasses so th a t y i ( x i · w + b ) ≥ 1 f or i = 1 , . . . , N . (2) There may exist many hyp erplanes that achieve such separation, but SVMs ﬁnd a weight vector w and a bias te rm b that maximize the margin 2 / k w k . Therefo re, the o ptimization problem that needs to b e solved is min J D ( w ) = 1 2 k w k , (3) S u bj ect to y i ( x i · w + b ) ≥ 1 f or i = 1 , . . . , N . (4) Any p oints lying on the hyperplane y i ( x i · w + b ) = 1 are cal led supp ort vectors. If the data cannot b e separated usi n g a line ar sepa rator, a slack variable ξ ≥ 0 is introduc ed and the constraint is relaxed to: y i ( x i · w + b ) ≥ 1 − ξ i f or i = 1 , . . . , N . (5) The optimizati on problem then b e c omes: min J D ( w ) = 1 2 k w k + C N X i =1 ξ i , (6) subj ect to y i ( x i · w + b ) ≥ 1 f or i = 1 , . . . , N , (7) ξ i ≥ 0 f or i = 1 , . . . , N . (8) The weights of the linear function can b e found directly or by con verting the problem into its dual optimizatio n problem, which is usually easier to so lve. Using the notatio n of Vijay ak uma r and Wu [7], the dual problem is thus: max L D ( h ) = X i h i − 1 2 h ′ · D · h, (9) subj ect to 0 ≤ h i ≤ C, i = 1 , ..., N , (10) 2 Σ i h i y i = 0 . (11) where D is a m a trix such that D ij = y i y j K ( x i , x j ) and K ( · , · ) i s either an inner p ro duct of the samples or a function of these samples. In the latter case, thi s function is kn own as the k ernel function, which can b e any function that co mplies with the Mercer c o nditions [9]. Fo r e xample, these may b e po lynomial functio ns, radial-basis (Gaussian) functions, or hyp erbo lic tang ents. If the data is not sep a rable, C is a tradeoﬀ b et ween maximizing th e margin and reducing the numb er of misclassiﬁc ations. The classiﬁ cation of a new data p oint is the n computed u sing the following equa tion: ( x ) = sig n X i ∈ S V h i y i K ( x i , x ) + b ! (12) 1.2 Kernel Ridge Regression p rob lem Kernel Ridge Regress i on (KRR) impl e ments a regul a rized form of the least sq u a res metho d useful for bo th regression and classiﬁ cation. The non-linear versio n of K RR is s imilar to the Supp ort- V e c t or M achine (SVM) problem. However, in th e latter, sp ecial e m p hasis is g i ven to p oints close to the decis i on bo u nda ry , which i s n o t provided by the co s t functio n use d by KRR. Given training data D = { x i , y i } l i =1 , x i ∈ R d , y i ∈ R the KRR algorithm determines the parameter vecto r w ∈ R d of a non-linear mo del (using the “kernel trick”), via minimi zation of the following objective function : [6]: min λ || w || 2 + l X i =1 ( y i − w T Φ( x i )) 2 where λ is a tradeo ﬀ parameter b et ween the tw o terms of the o ptimization function, and Φ ( ˙ ) is a (p ossible non-linear) ma pping of the trainin g pa t te rns. One can show that the du a l form of this o ptimization pr o blem is given by: max W ( α ) = y T a + 1 / 4 λα T K α − 1 / 4 α T α (13) where K is a matrix w h ose ( i, j ) -th entry i s the k ernel function K i,j = Φ( x i ) T Φ( x j ) . The optimal so lution to this op t i mization problem i s: α = 2 λ ( K + λ I ) − 1 y The correspondi n g prediction functio n is given by : f ( x ) = w T Φ( x ) = y T ( K + λ I ) − 1 K ( x i , x ) . The underlying assu mption used i s that the kernel matrices a re invertible. 1.3 Previous App ro aches for Solving Pa rallel SVMs There are s e veral main metho ds for ﬁnd ing a sol u tion to an SVM problem on a single -nod e computer. (See Chapter 10 of [9]) for a taxonomy of such metho ds.) How e ver, sinc e solving an SVM is quadratic in time and cubic in m e mo ry , the s e metho ds encounter diﬃculty when scaling to datasets that have many e xamples a n d supp ort vectors. The latter t wo are not synonymous. A l arge dataset w ith many rep eated examples mig ht be solved using sub-samplin g approaches, while a highly non-separable dataset with many supp ort vectors will require an altogether diﬀerent sol ution strategy . The literature covers several attempts a t so lving SVMs in parallel, which all ow for greater co mputational p o wer and larger memo ry size. In Collob ert et al. [15] the SVM solver is parallelized by training mul tiple SVMs , ea c h on a subs e t of the training data, and aggregating the resulting classiﬁers into a single clas s iﬁer. The training data is then redistributed to the classi ﬁers acc o rding their p erfor ma n ce and the process is iterated 3 until co nvergence is reached. The need to re-divide the d a t a among the SVM classiﬁers means that the data must be mo ved b etw een no des several times; this rules out the use of an app roac h where bandwidth is a c o ncern. A more low-level approach is taken by Zangh i rati et al. [14], where the qu a dratic optimiza ti on p roble m is divided in to smaller quadratic pr o grams (similar to the Active Set metho ds), each of which is solved on a diﬀerent no de. The results a re aggregated and the pro c e ss is rep eated until c o nvergence. The p erformance of this metho d has a strong dep endence on the cachi ng ar c h itecture of the cluster. Graf et al . [17] par ti tion the data and solve an SVM for each partition. The supp ort vectors from each p a ir of classi ﬁers ar e then aggregated into a new training set fo r whi c h an SVM is solved. The pr o cess conti nues until a single classiﬁer remains. T he aggregati on pro cess can be iterated, using the supp o rt vectors of the ﬁn a l classiﬁ e r in the previous iteratio n to seed the new classiﬁers. One problem with this approach is that the data must be rep eatedly shared b etw een nod e s, me aning th a t o nce again the goal of data distribution cannot b e attained. The se cond pr o blem, which mig h t b e more severe, is that the numb er o f p ossible supp ort vectors i s restricted by the capaci t y of a single SVM solver. Y om T ov [8] proposed modifying the sequential algorithm develop ed in [7] to batch mo de. In this way , the c o mplete kernel matrix is hel d in distributed memo ry and the Lagrange multipliers are computed ite rati vely . This metho d has the advantag e that it can eﬃc iently solve diﬃcult SVM problems that have many s upport vectors to their solution. Based on th at wo rk , we show in th is pap er how an SVM solutio n can b e obtaine d by adapting a Gaussian Beli ef Propagation algorithm to th e solution of the algorithm prop osed in [7]. Recently , Hazan et al. pr op osed an ite rative algorithm fo r parallel decomp osition based on Fenchel Duality [23]. Zanni et al. proposes a decomp ositi o n method fo r computing SVM in pa rallel [22]. W e compar e ou r run time results to b oth systems in Sectio n 5 . 2 Gaussian Belief Propagation In this section we present our no vel contribution - a Gaussian Belief Propagati o n solver for distributed computa t i on of the SVM p robl e m [19, 18, 2 1]. F o llowing, we provide a step-by-step derivation of o ur GaBP sol ver from our pr o pos e d cost function. As sta ted in the previous section, o ur aim is to ﬁnd x ∗ , a solu tion to the quadratic cost function given in eq . ( 9). Using linear algebra no tation, we can rewrite the same cost function: min E ( x ) , x T Wx / 2 − y T x . As the matrix W is symmetric 1 ( e.g. , W = S T S , the d e rivati ve of the quadratic form with resp ect to the vector x is gi ven by E ′ ( x ) = Wx − y . Thus, equating E ′ ( x ) = 0 gives the glo bal minimum x ∗ of this convex function, whi ch is the desired solutio n x ∗ = W − 1 y . Now, one can d eﬁne the following jointly Gaus s ian distribution p ( x ) , Z − 1 exp  − E ( x )  = Z − 1 exp ( − x T Wx / 2 + y T x ) , (14) where Z is a d istribution normalization factor. Deﬁning th e vector µ , W − 1 y , one gets the for m p ( x ) = Z − 1 exp ( µ T W µ/ 2 ) × exp ( − x T Wx / 2 + µ T Wx − µ T W µ/ 2 ) = ζ − 1 exp  − 1 2 ( x − µ ) T W ( x − µ )  = N ( µ, W − 1 ) , (15) where the new no rmali za tion factor ζ , Z exp ( − µ T W µ/ 2 ) . T o summarize to this p oint, the tar get solution x ∗ = W − 1 y is equal to µ , W − 1 y , which is the mean vector of the di s tribution p ( x ) , as d eﬁned in eq. (14). 1 an extension to non- symmetric matrices is discussed in [18 ]. Fo r simplicit y of a rgumen ts, w e handle the symmetric case in t his pap er 4 The fo rmul a tion ab ove allows us to shift the rating pr oblem from an alge braic to a probabilistic domain. Ins te ad o f so l ving a determinist i c vector-matrix line ar equation, we now solve an infer- ence p roblem in a graphical mod e l describing a certain Gaussian dis tribution func tion. Given the Peer-to-P eer graph W and the p rior vector y , one knows how to write explicitl y p ( x ) (14) and the correspondi ng graph G with edge p otentials (compatibility functio ns) ψ ij and self-p otentials (‘evidence’) φ i . The s e graph pote ntials a re determined acco rdin g to the following pai rwise fac- torization of the Gauss i an di s tribution p ( x ) (14) p ( x ) ∝ K Y i =1 φ i ( x i ) Y { i,j } ψ ij ( x i , x j ) , (16 ) resulting in ψ ij ( x i , x j ) , exp( − x i A ij x j ) and φ i ( x i ) = exp  b i x i − A ii x 2 i / 2  . The set o f edges { i, j } co rresp onds to the set of netw ork edges W . Hence , we would like to calculate the marginal densities, whic h mu s t also be Gaussian, p ( x i ) ∼ N ( µ i = { W − 1 y } i , P − 1 i = { W − 1 } ii ) , where µ i and P i ar e the marginal mean and inverse va rian ce (a.k.a. pr ecision), resp ectively . Recall that, according to our pr e vious argumentation, the inferred mean µ i is ide n tical to the desired solutio n x ∗ i . The move to the pr o babilistic d o main calls for the utilizatio n of BP as an eﬃcient inference engine. The s um-p ro duct rule o f BP for co ntinuous variables, required in o ur c a s e, is given by [1] m ij ( x j ) = α Z x i ψ ij ( x i , x j ) φ i ( x i ) Y k ∈N ( i ) \ j m ki ( x i ) dx i , (17) where m ij ( x j ) is the message s e nt from no de i to node j over their shar e d edge on the graph , scalar α is a normalization constant and the set N ( i ) \ j denotes all the no des neighb oring x i , except x j . Th e marginals a re computed acc ording to the produc t rule [1] p ( x i ) = αφ i ( x i ) Y k ∈N ( i ) m ki ( x i ) . (18) GaBP is a sp ecial case of continuous BP where the un d erlying distributio n is Gaussian. In [21] we show how to d erive the GaBP up date rules by substituti ng Gaussian distribution s i n the continuous BP equations. The outpu t of this de rivati o n i s up date rules that a re co mputed lo cally by each no de. The GaBP-based impl e m e ntation of the P eer-to- P eer rating algorithm is summarized in T able 1. Algorithm 1 can b e easily executed distributively . Each no de i receives as an input the i ’th row (or colu m n ) of the matrix W and the scalar b i . In each i teration, a mess age c ontaining tw o reals, µ ij and P ij , is sent to every neighb oring no de through their mutu a l edge, corresponding to a non- zero A ij entry . Convergence. If it converges, GaBP i s k nown to result in exact inference [1]. In co ntrast to conventio nal iterative metho ds for the sol u tion of syste ms of linear equations, for GaBP , determining the exact region of convergenc e and convergence rate remain o pen resear c h prob- lems. All that is k no wn is a suﬃcient (but not necessary) conditio n [4, 5] stating that GaBP converges when the sp ectral radius satisﬁes ρ ( | I K − W | ) < 1 . A stricte r suﬃcient con d i- tion [1], actually pr o ved earlier, determines that the matrix W mu s t be dia g onally dominant ( i.e. , | W ii | > P j 6 = i | W ij | , ∀ i ) in order for GaBP to converge. Eﬃciency . The lo cal computation a t a no de at each round is fairly minimal. Each no de i computes l o call y tw o scalar values µ ij , P ij for each neighb or j ∈ N ( i ) . Convergen ce time is dep endent o n b oth the i n puts and the netw ork top olog y . Empirical resul ts are pr o vided i n Section 5. 5 Algo rit hm 1 1. Initialize: X Set the neighborhood N ( i ) to include ∀ k 6 = i ∃ A ki 6 = 0 . X Set the scalar fixes P ii = A ii and µ ii = b i / A ii , ∀ i . X Set the initial N ( i ) ∋ k → i scal ar messages P ki = 0 and µ ki = 0 . X Set a convergence threshold ǫ . 2. Iterate: X Propagate the N ( i ) ∋ k → i messa ges P ki and µ ki , ∀ i (under certai n sch eduling) . X Compute the N ( j ) ∋ i → j sc alar messages P ij = − A 2 ij /  P ii + P k ∈ N ( i ) \ j P ki  , µ ij =  P ii µ ii + P k ∈ N ( i ) \ j P ki µ ki  / A ij . 3. Check: X If the messages P ij and µ ij did not converge (w.r.t. ǫ ), return to Step 2. X Else, continue to Step 4. 4. Infer: X Compute the marginal means µ i =  P ii µ ii + P k ∈ N ( i ) P ki µ ki  /  P ii + P k ∈ N ( i ) P ki  , ∀ i . ( X Optiona lly co mpute the marginal precisions P i = P ii + P k ∈ N ( i ) P ki ) 5. Solve: X Find the solution x ∗ i = µ i , ∀ i . 3 Prop osed Solution of SVM Solver Based on GaBP F or our p ro p osed sol ution, w e take the expo nent of dual SVM formulation given in equation (9) and solve max exp( L D ( h )) . Since exp( L D ( h )) is co nvex, the solutio n of max exp( L D ( h )) is a g lobal maximum that a l so satisﬁes max L D ( h ) since the matrix D is s ymmetric an d p ositive deﬁnite. No w we can relate to the new problem for m u lation as a probability densi t y functio n , which is in i tself Gaussi an: p ( h ) ∝ exp( − 1 2 h ′ D h + h ′ 1) , where 1 i s a vector of (1 , 1 , · · · , 1 ) and ﬁnd the assignment of ˆ h = arg max p ( h ) . It is known [5] that in Gaussia n mo dels ﬁndi n g th e MAP assignment is e q uivalent to s olving the inference pr oblem. T o sol ve the inference pr oblem, namely compu ti ng the ma rg inals ˆ h , we prop ose using the GaBP algorithm, wh i ch is a distributed mess age passing algorithm. We take the computed ˆ h as the Lagrange multi plier weights of the supp o rt vectors of the original SVM data p oints and apply a threshol d for cho osi ng data points wi th no n-zero weight as support vectors. Note that using this formulation we ignore the remaini ng const rai n ts 10, 11. In other words we do not solve the SVM problem, but the kernel ridge regression p roblem. Nevertheles s , empirical results presented in Section 5 show that we achieve very go o d c l assiﬁcation vs. sta te -of-the-art SVM solvers. Finally , foll o wi ng [7], we remove the e xpl icit bia s term b and ins tead add another dimensio n to the pattern vecto r x i such that ´ x i = ( x 1 , x 2 , . . . , x N , λ ) , where λ is a scal a r c o nstant. T he mo diﬁed weight vector, which incorporates the bi as term, is written as ´ w = ( w 1 , w 2 , . . . , w N , b/λ ) . How ever, this mo diﬁcation causes a change to the optimized margin. Vijay akumar and W u [7] discuss the eﬀect of th i s mo diﬁca ti on and reach the con c lusion that “ setting the aug menting term to zero (equival e nt to neglecting the bias term) in hig h dimensio n al k ernels gi ves satisfactory results on real wor ld data” . W e did not completel y neglect the bias term and in our exp eriments, which used the Radial Basis Kernel, set it to 1 / N , as propose d in [8 ]. 6 3.1 GaBP Algo rith m Con vergence In or de r to force the algorithm to converge, w e ar ti ﬁcially w eig h t the main diagon a l of the k ernel matrix D to make it diag onally dominant. Section 5 o utlines our empirical results showing that this mo diﬁcation did not sig niﬁcantly aﬀect the error in class i ﬁcations on al l te sted da t a sets. A partial justiﬁcation fo r weighting the main diag onal i s found in [6]. In the 2 -Norm soft margin for mulation of the SVM problem, the sum o f squared slack variables is minimized: min ξ, w ,b k w k 2 2 + C Σ i ξ 2 i s.t. y i ( w · x i + b ) ≥ 1 − ξ i The dual problem is derived: W ( h ) = Σ i h i − 1 2 Σ i,j y i y j h i h j ( x i · x j + 1 C δ ij ) , where δ ij is the Kronecker δ deﬁned to be 1 when i = j , and zero elsewhere. It is shown that the only chang e relative to the 1 -Norm so ft mar gin SVM is the addition o f 1 /C to the di agonal of the inner product matrix associ ated with the trainin g set. This has the eﬀect of addi n g 1 /C to the eigenvalues, rendering the k ernel ma trix (and thus the GaBP pr o blem) b etter co n ditioned [6 ]. 3.2 Convergence in Asynchron ous Settings One of the desired p rop erties of a large scale algorithm is that it should converge in a s ynchronous settings as well as in synchronous settings. This is b ecause in a large-scale com m u nication netw ork, clo cks are not synchroni zed accurately and s ome no des may b e slower tha n o thers, while some no des exp erience longer comm u nication d e la ys. Recent wo rk by Koll er et. al [3] deﬁne s c o nditions for the con verg e n ce of b elief propagation. This wo rk deﬁnes a distance metric on the space of BP messages; if thi s metric forms a max- norm co nstruction, the BP algorithm co n verges unde r some assumptions. Using exp eriments on various netw ork sizes, o f u p to a sparse matrix of o n e mill i on o ver one milli o n no des, the algorithm converged asynchronous ly in all cases where i t converged i n synchronou s setti n gs. F u rthe rmore, as noted in [3], in asynchron o us settings the algorithm co nverges faster as compared to synchronous setti ngs. 4 Algo rithm Optimization Instead o f sending a message c ompo sed of the pair o f µ ij and P ij , a no de broadcasts a g gregated sums, and consequ e ntly each no de can retrieve local ly the P i \ j and µ i \ j from the sums b y means of a subtraction : Instead o f sending a message comp osed of the pair of µ ij and P ij , a no de can broadcast the aggregated sums ˜ P i = P ii + X k ∈ N ( i ) P ki , (19) ˜ µ i = ˜ P − 1 i ( P ii µ ii + X k ∈ N ( i ) P ki µ ki ) . (20) Consequently , each no de can retrieve lo cally the P i \ j and µ i \ j from the sums by means of a subtraction P i \ j = ˜ P i − P j i , (21) µ i \ j = ˜ µ i − P − 1 i \ j P j i µ j i . (22) The rest of the algor i thm remains the same. 7 Algo rit hm 2 1. Initialize: X Set the neighborhood N ( i ) t o include ∀ k 6 = i ∃ A ki 6 = 0 . X Set the scalar fixes P ii = A ii and µ ii = b i / A ii , ∀ i . X Set the initial i → N ( i ) broa dcast messages ˜ P i = 0 and ˜ µ i = 0 . X Set the initial N ( i ) ∋ k → i inte rnal scalars P ki = 0 and µ ki = 0 . X Set a convergence threshold ǫ . 2. Iterate: X Broadcast the aggregat ed su m messages ˜ P i = P ii + P k ∈ N ( i ) P ki , ˜ µ i = ˜ P i − 1 ( P ii µ ii + P k ∈ N ( i ) P ki µ ki ) , ∀ i (under certain scheduling) . X Compute the N ( j ) ∋ i → j in ternal scalars P ij = − A 2 ij / ( ˜ P i − P j i ) , µ ij = ( ˜ P i ˜ µ i − P j i µ j i ) / A ij . 3. Check: X If the internal scalars P ij and µ ij did not converge (w.r.t. ǫ ), return to Step 2. X Else, continue to Step 4. 4. Infer: X Compute the marginal means µ i =  P ii µ ii + P k ∈ N ( i ) P ki µ ki  /  P ii + P k ∈ N ( i ) P ki  = ˜ µ i , ∀ i . ( X Optiona lly compute the margin al precisions P i = P ii + P k ∈ N ( i ) P ki = ˜ P i ) 5. Solve: X Find the solution x ∗ i = µ i , ∀ i . 5 Exp erimental Results W e impl e m e nted our p ro p osed algorithm using appro ximately 1,000 li nes of co de in C. We implemented co mmunication b et ween the no des using the M PICH 2 message passi ng interface [24]. Ea ch no de was resp onsibl e for d data p oints out of the total n data p oints in the dataset. Our implementati on us ed synchron ous c ommunication rounds b ecause of MPI limi tations. In Section 6 we further elaborate o n thi s issue. Each no de was assig n e d s e veral examples from the input ﬁle. Th en, the kernel matrix D was computed by the n odes in a distributed fashi o n, so that each no de computed the rows of the kernel matrix related to its assign e d data poi nts. After computing the relevant parts of the matrix D , the no des weighted the diagonal of the matrix D , as discussed in Section 3.1. Then, several rounds of co mmunication between the no des were run. In each roun d , using our optimizati on, a total of n sums were calculated using MPI Allreduce system c all. Finally , each no de output the solution x , which was the mean o f the input Gaussian that matche d i t s o wn data points . Each x i signiﬁed the weight o f the data poin t i for b eing chosen as a supp ort vector. T o compare our algorithm p erfo rmance, we used tw o algorithms: Sequential SVM (SVMSeq) [7] and SVMlight [16]. W e used the SVMSeq implementation provided within t h e IBM Par a l lel Machine Lea rning (PML) to olb ox [10]. T he PML im p lements the same algorithm by Vija ykumar and W u [7] that our GaBP solver is based on, but the implementation in through a master-slave ar c hitecture as describ ed in [8]. SVMlight is a single com p uting no de solver. T able 1 describ es the seven datasets we us e d to compare the alg o rithms and the classiﬁ c ation accuracy obtai n ed. Th ese co mputations were done using ﬁ ve pr o cessi n g no des (3.5 GHz Intel Pentium machines, running the Linu x op erating system) for each of th e p arallel sol vers. All datasets were taken from the UCI rep ository [1 1]. We used medium-sized d a tasets (up to 20,00 0 examples) so that run-ti mes using SVMlight would not b e p rohibi tively high. All algorithms were 8 Dataset Dimension T rain T est Error (%) GaBP Sequential SVMlight Isolet 617 6238 1559 7.06 5.84 49.97 Letter 16 20000 2.06 2.06 2.3 Mushroom 117 8124 0 .04 0.05 0.02 Nursery 25 129 60 4.16 5.2 9 0.02 Pageblocks 10 5473 3.8 6 4.08 2.74 Pen digits 16 7494 3498 1.66 1.37 1.57 Spambase 57 4601 16.3 16.5 6.57 T able 1: Error rates of the GaBP so lver versus those of the parallel sequential solver and SVMli ght Dataset Run times (sec) GaBP Sequential Isolet 228 1328 Letter 468 601 Mushroom 226 176 Nursery 221 297 Pageblocks 26 37 Pen digits 45 155 Spambase 49 79 T able 2 : R unning times (in se c onds) of the GaBP solver (working in a distributed en vironment) compared to that o f th e IBM pa rallel so lver run with an R BF kernel. The parameters of the algorithm (kernel width and misclassiﬁ cation cost) were optimi zed usi ng a l i ne-search algorithm, as detai led i n [1 2]. Note that SVMlight is a single no de s olver, which we use main ly as a comparison for the accuracy in classiﬁca ti on. Using the F riedma n tes t [1 3], we did not detect any statistically signiﬁ cant diﬀerence b etw een the p erfo rmance of the algorithms with regards to accuracy ( p < 0 . 1 0 − 3 ). Figure 1 shows the sp eedup res u lts of the algorithm when running the GaBP algorithm o n a Blue Gene sup ercomputer. The sp eedup with N no des is co mputed as the run time of th e alg o rithm on a si ngle no de, divided by the run time using N no des. Obviously , it is desirable to ob tain linear sp eedup, i.e., dou bling compu ta tional pow er halves the p ro cessin g time, but this is limited by t h e co mmunication load and by parts of th e algorithm that cannot b e parallelized. Si nce Blue Gene is currently limited to 0.5 GB o f memory at each n ode, most datasets co uld no t b e run on a single no de. We therefore show sp eedup comp a red to two node s . As the ﬁgure sh o ws , in most cases w e get a lin ea r sp eedup up to 256 CPUs, which means that the running time is linearly prop o rtio nal to one over the number of used CPUs. When using 5 12 - 102 4 CPUs , the communicatio n overhead reduces the eﬃciency of the pa rall el comp u tation. We ide n tiﬁed thi s pr oblem as an area for future research into optimizing the p erfor mance for larger scale grids . W e also te sted the ability to build cla s siﬁers for la rge r datasets. T able 3 s h o ws the run times of the GaBP algorithm usi n g 1024 CPUs o n tw o larger datasets, b oth from the UCI rep ository . This d e m o nstrates the ability of the algorithm to pr o cess very large d atasets in a reasona b ly short am o unt of time. We compare o ur running time to state- of-the-art par al lel decomp ositio n metho d b y Zan ni et al. [22] and Hazan et al. . Using the M NIST dataset we whe re considerably slow er by a factor of two, b u t in the larger Covertype dataset w e have a sup erio r pe rformance. Note that it is ha rd to compare runni ng times si nce the machi n e s used for expe rime ntation are diﬀerent. Zanni used 1 6 Pentium IV machines with 1 6 Gb memory , H azan used 10 P entiu m IV machines with 4Gb memory whil e we used a larger number o f weaker Pentium IV machines with 400Mb of memory . Furthermo re, in the Coverty p e dataset we used onl y 15 0,000 data p oints while Zanni and Ha za n u s ed the full dataset w h ich i s twice larger. 9 10 0 10 1 10 2 10 3 10 0 10 1 10 2 10 3 Number of CPUs Speedup (Relative to 2 CPUs) SVM using GaBP Speed−up Isolet Pageblocks Pen Spambase Figure 1: Spe edup o f the GaBP algorithm vs. 2 CPUS Dataset Dim Num o f examples Run t i me GaBP (sec) Run time [22] (sec) Run ti m e [23] Covertype 54 150,00 0/300,0 00 468 24365 16742 MNIST 784 60,000 756 359 18 T able 3: R unning times of the GaBP solver fo r large data s ets using 102 4 CPUs on an IBM Blue Gene sup ercomputer. Running time results are co mpa red to tw o state-of-the- art so lvers: [22] and [23]. 6 Discussion In this pap er we demonstrated the application of the Gaussian Belief Propagation to the solu- tion of SVM p ro b lems. Our exp eriments demonstrate the usefulness of this solver, bein g b oth accurate and scal able. W e i mplemented o ur alg o rithm using a synchron o us commun ication mo del mainly b ecause MPICH2 do es not supp or t asynchronous co mmunication. While synchronous communica tion is the mo de of choice for supercomp u ters su ch as Blue Gene, in many cas e s such as heteroge- neous grid environments, asynch ronous communicati on will b e pr eferred. We b elieve th at the next challeng ing goal wi l l be to impl e m e nt the prop osed algorithm in as ynch ronous s ettings, where algorithm ro unds w ill no longer b e synchronized. Our initial experiments with very la rg e sparse k ernel matrices (millions of data p oints) show that asynchronous settings converge faster. Recent wo rk by Koller [3 ] s u pports this claim b y showing that in many cases the BP algor i thm converges faster in asynchron ous settings. Another challeng ing ta s k would i nvolve scaling to data sets of milli ons of data p oints. Currently the full k ernel matrix is computed by th e nod es. While this is eﬀective for problems w ith many supp or t vectors [8], it is not required i n many problems which ar e either easily separable or el se where the classiﬁc ation error is less imp ortant com p a red to the time required to learn the mo de. Thus, so lvers scaling to much larger datasets ma y have to diverge from the current strategy of computing the full kernel matrix and instead sparsify the kernel matrix as is co mmonly done in single no de solvers. 10 Finally , i t remains an op en questi on whe t h er SVMs c a n b e solved eﬃc i ently in Peer-to-Peer environments, where each no de can (eﬃciently) obtain data from on ly several close peers. Future wo rk will b e required in order to verify how the GaBP algorithm p erforms in su ch an environm e n t, where only partial segments of the kernel matrix can b e computed by each nod e. References [1] Y. Weiss and W. T. Freeman. Correctness of b elief pr o pagation in Gaussi a n graphical mo dels of ar b i tra ry top olo g y . In NIPS-1 2, 19 9 9 [2] J.K. Johnson. Walk-summable Gauss-Markov random ﬁelds. T echn i cal Rep or t, Feb ruary 2002. (Cor rected , Novembe r 200 5). [3] G. Elidan and I. McGraw and D. Koller, Resi d ual Belief Propagati o n: Infor med Schedul- ing fo r Asynchrono us Messa ge Passing, Pro ceedings of the Tw en ty-second Conference o n Uncertainty in AI (UAI), Boston, Massachussetts, 200 6 [4] J.K. Jo hnson, D.M. Malioutov, A.S. Willsky . W alk -sum interpretation and anal ysi s of Gaus- sian belief propagation, In Advances in Neural Infor mation Proces s ing Systems, vol. 1 8, pp. 579-5 86, 2006. [5] D.M. Malioutov, J.K. Johnso n, A.S. Willsky . Walk-sums and b elief p ropaga tion in Gauss ian graphical mo dels, Journal of Machine Learning Research, vol. 7, pp. 20 31-2064 , Octob er 2006. [6] Nello Cristia nini and Jo hn Shaw e-T a ylor. An Intro duction t o Supp ort V ector Machine s and Other Kernel- based Learning M ethods . Cambridge University Press, 2000 . ISBN 0-521 - 78019 -5. [7] Sethu Vijay akumar and Si W u (1 999), Sequential Support V ector Classiﬁers and R e gression. Pro c. Internati o nal Conference on Soft Compu ti ng (SOCO’99), Geno a, Ital y , pp.610-6 19. [8] Elad Y o m-T ov (2007), A distributed seq u ential solver for large scale SVMs. In: O. Chap elle, D. DeCoste, J. Weston, L . Bottou: Large scal e kernel machines. MIT Press, pp. 141-156 . [9] B. Sc h ¨ olkopf and A. J. Smola. Learning with kernels: Supp ort vecto r machine s , regulariza- tion, optimization, and b eyond. M IT Press , Cambridge, MA, USA, 20 02. [10] htt p://www.alphawork s.ibm.com/tech/pml [11] Catherine L. Blake, Eamonn J. Keo gh, and Christo pher J. Merz. UCI repo sitor y of machine learning databases, 1998. URL http://www.i cs.uci.edu/$\sim$ mlearn/MLRepository.html . [12] R. Rifkin and A. Klautau. In defense of One-vs-All classiﬁcation. Journal o f Mac hine Learn- ing Research, 5:10 1–141, 200 4. [13] J. Dem ˘ sar. Statistical comparisons of classiﬁers over multiple data sets. Journal of Machin e Learning Research, 7 :1–30, 20 06. [14] G. Zangh irati and L. Zann i. A parallel solver for large quadratic p rograms in training s upport vector machines. Pa rall el computi n g, 29:5 35-551, 20 03. [15] R. Collob ert, S. Bengio, and Y. Bengi o. A parallel mixture of svms for very large sc ale pr oblems. In Advances in Neural Information Pro cessing Systems. MIT Press, 2002. [16] T. Jo a chims. M aking large-scale svm learning practical. In ”B. Sch¨ o lkopf, C. Burges, A. Smola” (Editors), Advances in Kernel M ethod s - Supp ort V ector Learning , [17] H. P . Graf, E. Cosatto, L. Bottou, I. Durdanovic, and V. V apnik. Pa rall e l supp ort vecto r machines: T he cascade svm. In Advances i n Neural Information Processin g Systems , 2004 . [18] D. Bickson, O. Shental, P . H . Siegel, J. K. W olf, and D. Dolev. Gaussi a n b elief p ropa gation based multius er detection. In IEEE Int. Symp. o n Inform. Theory (ISIT) , T oronto, Canada, July 2008, to app ear . [19] O. Shen t a l , D. Bickso n , P . H. Sieg e l, J. K. Wolf and D. Dolev. Gaussian bel i ef p ropa g ation solver for systems of linear equation s.In IEE E Int. Symp. o n Info rm. Theory (ISIT) , T oronto, Canada, July 20 0 8, to app ear. 11 [20] D.Bick s on, D. Dolev and E . Y o m-T ov, Solving Large Sca l e Kernel Ridg e Regression using A Gaussian Belief Propagati on So lver i n NIPS Wo rksho p on E ﬃcient Machine Learning , Canada, 2007 . [21] D. Bick son, O. Shental, P . H. Siegel, J. K. Wolf, an d D. Dolev. Lin ea r detection via b e lief propagation, in 45th Allerton Conf. on Communi c ations, Control and Computi ng , Monticello , IL, USA, Sept. 2 007. [22] L. Zanni, T. Seraﬁ ni and Gaetano Zanghi rati. Pa ralle l Softw ar e for T raining Large Scale Supp o rt Vector M achines on Multiprocessor Sy s tems. In proc of Journal of Machin e Learning Research, Vol. 7 (July 2 006), pp . 1 467-149 2. [23] T. Hazan, A. Man and A. Shashua. A Pa rallel Decomp ositio n So lver for SVM: Distributed Dual Asce n t using Fenchel Duality . In Conference on Computer Vision and Pattern Recog- nition (CVPR) , Anchorage, Jun e 2008, to appear. [24] MPI mess age passi n g i n terface. http://www-un ix.mcs.anl.gov/mp i/mpich/ 12

A Gaussian Belief Propagation Solver for Large Scale Support Vector Machines

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment