Fixed-point and coordinate descent algorithms for regularized kernel methods
In this paper, we study two general classes of optimization algorithms for kernel methods with convex loss function and quadratic norm regularization, and analyze their convergence. The first approach, based on fixed-point iterations, is simple to im…
Authors: Francesco Dinuzzo
Fixed-p oin t and co ordinate descen t algorithms for regularized k ernel m etho ds F rancesco Din uzzo July 2, 2018 Abstract In this pap er, w e study tw o general classes of optimization algorithms for k ernel method s with con vex loss function and quadratic n orm regu- larization, and analyze their con vergence. The first approach, based on fixed-p oint iterations, is simple to implement and analyze, a nd can b e easily parallelized. The second, based on coordinate descent, exploits the structure of additivel y separable loss functions to comput e solutions of line searc hes in closed form. Instances of these general classes of algorithms are already incorp orated into state of the art machine learning softw are for large scale problems. W e sta rt from a solution characterizatio n of the regularized p roblem, obtained using sub- differen tial calculus and resol- ven ts of monotone operators, that holds for general conv ex loss functions regardless of differentiabil ity . The tw o metho dologies describ ed in the pa- p er can b e regarded as instances of non-linear Jacobi and Gauss-Seidel algorithms, and are both w ell-suited to solv e large scale problems. 1 In tro duction The development of optimization so ft w ar e for learning from larg e datasets is heavily influenced by memor y hierarchies of computer s to rage. In pr e sence of memory constra in ts, most of the high o r der optimization metho ds b ecome un- feasible, wherea s techniques such as co ordinate des cen t or s to chastic gradient descent may exploit the sp ecific s tructure o f lear ning functionals to scale well with the da taset size. Considerable effort ha s b een devoted to make kernel metho ds fea sible on large scale problems [Bottou et al., 2007]. One of the most impo rtan t features of mo dern machine lear ning metho dologies is the ability to leverage on sparsity in order to obtain scala bilit y . Typically , lear ning metho ds that impo se sparsity are based on the minimization of no n- differen tiable ob jec- tive functionals. Is this the ca se of s upp ort vector machines o r methods based on ℓ 1 regular iz ation. In this chapter, we ana lyze optimization algorithms for a g e neral class of reg- ularization functionals, using sub-differe ntial calculus and reso lv ents of mono- tone op erators [Rock afellar, 1970, ? ] to mana ge non- differen tiability . In partic- ular, we study lea rning metho ds that can be interpreted as the minimization o f a conv ex empir ical r isk term plus a squa red norm reg ularization into a re pr o- ducing kernel Hilb ert space [Aro ns za jn, 195 0] H K with non-null repro ducing 1 kernel K , na mely min g ∈H K f ( g ( x 1 ) , . . . , g ( x ℓ )) + k g k 2 H K 2 , (1) where f : R ℓ → R + is a finite-v alued b ounded b elo w conv ex function. Regu- larization problems of the form (1) admit a unique optimal s o lution which, in view of the r e presen ter theore m [Sc h¨ olkopf e t al., 2001], can b e represented as a finite linear combination of kernel sectio ns: g ( x ) = ℓ X i =1 c i K x i ( x ) . W e c har acterize optimal co efficients c i of the linear combination via a fa mily of no n- linear equa tions. Then, we introduce t wo general clas ses of optimiza- tion a lgorithms fo r large scale regulariza tion methods that can b e regar ded as instances o f no n-linear Ja cobi a nd Gauss-Seidel algo rithms, and analyz e their conv ergence prop erties. Fina lly , we s tate a theor em that shows how to refor m u- late conv ex reg ularization pr o blems, so as to trade o ff p ositiv e semidefiniteness of the kernel matrix for differentiabilit y of the empirica l r isk. 2 Solution c haracterization As a conseq ue nc e of the repr esen ter theorem, an o ptimal solution of pro blem (1) c an be obta ined b y solving finite-dimensional optimization problems of the form min c ∈ R ℓ F ( c ) , F ( c ) = f ( K c ) + c T K c 2 , (2) where K ∈ R ℓ × ℓ is a no n-n ull symmetric pos itiv e semi-definite matrix ca lled kernel matrix . The entries k ij of the kernel matrix are g iv en by k ij = K ( x i , x j ) , where K : X × X → R is a po sitiv e semidefinite kernel function. It is easy to verify that the resulting kernel matr ix is symmetric and p ositive semi-definite. Let k i ( i = 1 , . . . , ℓ ) denote the columns o f the kernel matrix. Particularly int ere s ting is the case in which function f is additively separa ble. Definition 1 (Additively s e parable functional) . A functional f : R ℓ → R is c al le d additively separable if f ( z ) = ℓ X i =1 f i ( z i ) . (3) Parametric mo dels with ℓ 2 (ridge) regula rization cor responds to the case in which inputs are n - dimensional numeric vectors ( X = R n ) and the kernel matr ix is chosen as K = XX T , whe r e X ∈ R ℓ × n is a matrix whose r o ws ar e the input data x i . Letting w := X T c, (4) 2 the following class of pro blems is obtained: min w ∈ R n f ( X w ) + k w k 2 2 2 . (5) Observe that one can optimize ov er the who le space R n , sinc e the optimal weigh t v ector will automatically be in the for m (4). Parametric mo dels with ℓ 2 regular iz ation can b e se en as specific instances of k ernel methods in which K is the linear kernel: K ( x 1 , x 2 ) = h x 1 , x 2 i 2 . In the following, tw o key mathematical ob jects will b e used to characterize optimal solutions of problems (2) and (5 ). The first is the sub differen tial ∂ f of the empirical risk. The second is the res o lv ent of the inv ers e subdiffer en tial, defined as J α := I + α ( ∂ f ) − 1 − 1 . (6) See the app endix for more details ab out these ob jects. The following r esult characterizes optimal solutions of problem (2) via a non-linear eq uation inv olv- ing J α . The characterization also holds for no n- differen tiable loss functions, and is obtained without introducing constr a ined optimization problems. The pro of of Theor em 1 is given in to the a pp endix. Theorem 1. F or any α > 0 , ther e exist optimal solutions of pr obl em (2) su ch that c = − J α ( α K c − c ) , (7) wher e J α is the r esolvent of the inverse sub- differ ential ( ∂ f ) − 1 , se e (6). The usefulness of condition (7) depends o n the p ossibility of computing closed- form express ions for the resolven t, which may not b e feasible for gener al conv ex functionals. Remark ably , for man y le arning metho ds one can typically exploit the sp ecific structure o f f to work out clo sed-form expressions . F or instance, when f is a dditiv ely separable as in (3), the sub-differential decouples with resp ect to the different co mponents. In such a case, the computation of the resolven t reduces to the inv ers io n of a function of a sing le v ariable, which ca n be often obtained in clo sed for m. Indeed, in many sup ervised lear ning pro blems, additive separ abilit y holds, where f i ( z i ) = λ − 1 L ( y i , z i ), L : R × R → R + is a loss function, and λ > 0 is a regular ization parameter. T able 2 r e p orts the express io n of the J α in corresp ondence with commonly used los s functions. When f is additively s eparable, the characterization (7) can b e generalized as follows. Corollary 1. Assume that (3) holds. Then, for any α i > 0 , i = 1 , . . . , ℓ , t her e exist optimal solutions of pr oblem (2) such that c i = − J i α i ( α i k T i c − c i ) , i = 1 , . . . , ℓ, (8) wher e J i α i ar e t he r esolvents of the inverse sub-differ entials ( ∂ f i ) − 1 , se e (6). In this pap er, w e analyze tw o iterative a pproaches to compute optimal so- lutions of problem (2), bas ed on the solutio n characterizations of Theore m 1 and Coro lla ry 1. F or both metho ds, w e show that cluster p oin ts of the iteration sequence are o ptimal solutions, and we hav e min c ∈ R ℓ F ( c ) = lim k → + ∞ F ( c k ) , (9) 3 Name Loss L ( y 1 , y 2 ) Op erator − J α ( v ) L1-SVM (1 − y 1 y 2 ) + y ⊙ min ( αλ ) − 1 , (1 − y ⊙ v ) + L2-SVM (1 − y 1 y 2 ) 2 + y ⊙ (1 − y ⊙ v ) + / (1 + αλ ) RLS ( y 1 − y 2 ) 2 / 2 ( y − v ) / (1 + αλ ) RLA | y 1 − y 2 | sign( y − v ) ⊙ min ( αλ ) − 1 , | y − v | SVR ( | y 1 − y 2 | − ǫ ) + sign( y − v ) ⊙ min ( αλ ) − 1 , ( | y − v | − ǫ ) + T able 1: O perator − J α for different methods . So me o f the losses are ex pressed using the “p ositive part” function defined as ( x ) + = max { 0 , x } . In the rig h t- most column, ⊙ deno tes the element-wise pro duct, and functions are applied comp onen t-wise. where F de no te the functional of problem (2). Section 3 descr ib es a first ap- proach, which in volves simply iterating equa tion (7) acc o rding to the fixed-po in t metho d. The metho d can b e als o regar ded as a non-linear Jacobi a lg orithm to solve equation (7). It is s ho wn that α can b e alwa ys chosen so as to make the iterations a ppro ximate an optimal solution to a rbitrary precision. In section 4, we descr ibe a s econd a pproach, that inv olves separately iterating the sing le comp onen ts using the characteriza tion of equation (8). F or a suitable choice of α i , the metho d b oils down to co ordinate descent, a nd optimality o f cluster po in ts holds whenever indices are pick ed according to an “es sen tially cyclical” rule. Eq uiv alently , the metho d can b e r egarded as a non-linea r Gauss- Seidel algorithm to s o lv e (8). 3 Fixed-p oin t algorithms In this section, we sug gest co mputing the optimal co efficien t vector c of problem (2) by simply iterating equa tion (7), starting from any initial condition c 0 : c k +1 = − J α ( α K c k − c k ) . (10) Such pr ocedure is the well-kno wn fixed p oint itera tion (a ls o known as P icard o r Richardson iteration) metho d. Provided that α is prop erly chosen, the pro ce- dure ca n b e used to solve problem (2) to any given accuracy . B e fore analyzing the conv erg ence pro perties of metho d (10), le t’s study the computatio nal com- plexity of a sing le iteration. T o this end, one ca n decomp ose the iteration into three intermediate steps: z k = K c k , step 1 v k = αz k − c k , step 2 c k +1 = − J α ( v k ) . step 3 The decomp osition emphasize the se pa ration b et ween the role of the kernel (affecting only s tep 1) and the r ole of the function f (affecting only step 3). Step 1 Step one is the only one that inv olves the kernel matrix. Generally , it is also the most computationally and memory demanding step. Since z = K c repr esen ts 4 predictions on training inputs (or a quantit y rela ted to them), it holds that being able to perfor m fast pr edictions also hav e a cr uc ia l impac t on the training time. This is remark able, since go od prediction s peed is a desir able goal on its own. Notice that an efficient implemen tation of the pr ediction step is b eneficial for a n y learning method o f the for m (2), indep enden tly o f f . Ideally , the computatio nal cost of such matr ix -v ector multiplication is O ( ℓ 2 ). How ever, the kernel ma trix might no t fit in to the memory , so tha t the time needed to compute the pro duct might also include sp ecial computations o r additional I/O op erations. O bserv e that, if many comp onen ts of vector c are null, only a subset of the r ows o f the kernel matrix is neces s ary in order to co mpute the pro duct. Hence, metho ds tha t impo se spar sit y in vector c ma y pro duce a s ignifican t sp eed-up in the pr ediction step. As an additio nal r emark, observe that the matrix-vector pr oduct is an op eration that can b e easily para llelized. In the linear ca se (5), the computation o f z k can b e divided in tw o pa rts: w k = X T c k , z k = X w k . In order to compute the pr oduct, it is not even necessa ry to form the kernel matrix, whic h may yields a significant memo ry saving. The tw o in termedi- ate pro ducts b oth need O ( nℓ ) op erations and the o verall co s t still scales with O ( nℓ ). When the num ber of features is muc h low er than the num ber o f exam- ples ( n ≪ ℓ ), there’s a significant improvemen t with r espect to O ( ℓ 2 ). Sp eed-up and memory saving ar e even more dr amatic when X is spa rse. In such a case, computing the pro duct in tw o steps migh t b e mo re conv enient also when n > ℓ . Step 2 Step tw o is a simple subtractio n b et ween vectors, w ho se computationa l cost is O ( ℓ ). In section 5, it is shown that v = α K c − c can b e interpreted as the vector of predictions on the training inputs asso ciated with another lea rning problem consisting in stabilizing a functional reg ularized whose empirical risk is always differentiable, a nd whose kernel is not neces sarily positive. Step 3 Step three is the only one that dep ends on function f . Hence, different a lgo- rithms can b e implemen ted by s imply cho osing differ e n t resolven ts J α . T able 2 rep orts the loss function L and the c orresp onding re s olv ent for so me common sup e rvised lea rning metho ds. Some examples are g iv en b elo w. Consider pro b- lem (2) with the “hinge” los s function L ( y 1 , y 2 ) = (1 − y 1 y 2 ) + , asso ciated with the p opular Suppo rt V ector Machine (SVM). F o r SVM, step three re a ds c k +1 = y ⊙ min 1 αλ , 1 − y ⊙ v k + , where ⊙ denotes the elemen t-wise pro duct, and min is applied element -wis e . As a se c ond example, c onsider class ic r egularized least squares (RLS). In this case, step three r educes to c k +1 = y − v k 1 + αλ . 5 Generally , the complex ity of s tep three is O ( ℓ ) for any of the classical loss functions. 3.1 Con v ergence The following result s tates that the s equence genera ted by the iterative pro- cedure (10) c an b e used to approximately solve problem (2) to any precision, provided that α is suitably chosen. Theorem 2. If t he se quenc e c k is gener ate d ac c or ding to algorithm (10), and 0 < α < 2 k K k 2 , (11) then (9) holds. Mor e over, c k is b ounde d, and any cluster p oint is a solution of pr oblem (2). A stronger convergence result holds when the kernel matr ix is strictly p os- itive or f is differen tiable with Lipschitz contin uous gradient. Under these conditions, it turns out that the whole sequence c k conv erges a t lea st linearly to an unique fixed p oin t. Theorem 3. Supp ose that the se quenc e c k is gener ate d ac c or ding to algorithm (10), wher e α satisfy (11), and one of the fol lowing c onditions holds: 1. The kernel matrix K is p ositive definite. 2. F unction f is everywher e differ entiable and ∇ f is Lipschitz c ontinuous , Then, ther e ex ist s a unique solut io n c ∗ of e quation (7), and c k c onver ges to c ∗ with the fol lowing r ate k c k +1 − c ∗ k 2 ≤ µ k c k − c ∗ k 2 , 0 ≤ µ < 1 . In practice, condition (11) can b e equiv alently satisfied b y fixing α = 1 a nd scaling the kernel matr ix to have spec tr al norm b et ween 0 a nd 2. In problems that inv olve a regulariza tion par a meter, this last choice will only affect its scale. A p ossible practical rule to c ho ose the v alue of α is α = 1 / k K k 2 , which is equiv- alent to scale the kernel ma trix to hav e sp ectral nor m equa l to one. Howev er, in order to compute the scaling fa c tor in this w ay , one g e nerally needs all the ent rie s of the kernel matrix. A cheaper alternative that uses only the diagonal ent rie s of the kernel matrix is α = 1 / tr( K ), which is eq uiv alent to fix α to one and nor malizing the kernel matrix to have tr ace one. T o see that this last rule satisfy condition (11), observe that the trace of a po sitiv e semidefinite matrix is an upp er b ound for the sp ectral no rm. In the linear case (5), o ne ca n di- rectly compute α on the basis of the da ta ma tr ix X . In pa r ticular, w e have k K k 2 = k X k 2 2 , and tr( K ) = k X k 2 F , where k · k F denotes the F rob enius norm. 4 Co ordinate-wise iterativ e algorithms In this section, we describ e a second optimization appr oac h that can be seen as a wa y to itera tiv ely enforce o ptimalit y c o ndition (8 ). Throughout the section, 6 it is assumed that f is additively separable as in (3). In view o f Cor ollary 1, the optimality co ndition can b e rewritten for a single comp onen t a s in (8). Cons ider the following general up date algor ithm: c k +1 i = − J i α i ( α i k T i c k − c k i ) , i = 1 , . . . , ℓ. (12) A se r ial implemen tation of a lgorithm (10) can b e obtained by choosing α i = α and by cyclically co mputing the new co mponents c k +1 i according to equa tion (12). Observe that this approach req uires to keep in memory b oth c k and c k +1 at a c e rtain time. In the next sub-section, w e analyze a different choice of parameters α i that leads to a clas s o f co ordinate descent algor ithms, based on the principle of using new co mputed information as so on as it is av a ila ble. 4.1 Co ordinate descen t metho ds Algorithm 1 Co ordinate descent for regularize d kernel methods while max i | h i | ≥ δ do Pick a coo rdinate index i acco rding to some rule, z k i = k T i c k , v k i = z k i /k ii − c k i , tmp = S i ( v k i ), h i = tmp − c k i , c k +1 i = tmp, end whil e A co ordinate descent algo r ithm upda tes a single v ariable at each itera tion by solving a s ub-problem of dimension one. During the last years, optimiza- tion v ia co ordinate descent is b ecoming a popula r appro a c h in machine learn- ing and statistics, since its implementation is str a igh tforward and enjo ys fa- vorable computationa l prop erties [F riedman et al., 2007, Tseng and Y un, 2 008, W u and Lange, 2008, Cha ng et al., 2008, Hsieh et al., 2 008, Y un and T oh, 200 9 , Huang et al., 2 010, F riedman et al., 20 10 ]. Although the metho d ma y require many itera tions to conv erge, the sp ecific str ucture o f supe r vised learning ob- jective functiona ls allows to s olv e the sub-proble ms with high efficiency . This makes the approa c h comp etitiv e e s pecially fo r la rge-scale problems, in which memory limitations hinder the use of second or der optimization a lgorithms. As a matter of fact, sta te o f the art solvers for large scale supe r vised learning such as glmn et [F riedman et al., 2010] for genera lized linear mo dels, or LIBLIN EAR [F an et al., 200 8 ] for SVMs are based o n co ordinate de s cen t techniques. The up date for c k i in algor ithm (1 2) also dep ends o n comp onents c k j with j < i which hav e alrea dy b een up dated. Hence, one needs to keep in memory co efficien ts from tw o subsequent iter ations c k +1 and c k . In this sub-sectio n, we describ e a metho d that a llo ws to take adv an tage o f the computed information as so on a s it is av aila ble, by ov erwriting the co efficients with the new v alues. Assume that the diago nal elements of the kernel matrix are strictly p ositive, i.e. k ii > 0. Notice that this la s t assumption can b e made without any loss of g eneralit y . Indeed, if k ii = 0 for s o me index i then, in v iew of the inequal- it y | k ij | ≤ p k ii k j j , it follows that k ij = 0 fo r all j . Hence, the whole i - th column (row) of the kernel matrix is zero , and can b e removed without a ffect- ing optimization results for the o ther c o efficients. By letting α i = 1 /k ii and 7 S i := − J i ( k ii ) − 1 in equatio n (8), the i -th coefficie n t in the inner sum doe s cancel out, and we obtain c i = S i X j 6 = i k ij k ii c j . (13) The optimal i - th co efficient is th us expres sed as a function of the others. Similar characterizations have been a lso derived in [Dinuzzo and De Nicolao, 20 09 ] for several loss functions. Eq ua tion (13) is the star ting p oint to obtain a v ariety of co ordinate des cen t algor ithms inv olving the iterative c hoice of a a co ordinate index i follow ed by the optimization o f c i as a function of the other co efficients. A simple test on the res idua l o f equa tion (13) ca n b e used as a stopping condi- tion. The appro ac h ca n b e als o reg arded as a non-linear Ga us s-Seidel metho d [Ortega and Rheinboldt , 200 0 ] for solving the eq ua tions (8). It is a ssumed that vector c is initialized to so me initial c 0 , and co efficien ts h i are initialized to the residuals of equation (13) ev aluated in corr espondence with c 0 . Remark ably , in order to implement the metho d for different loss functions, w e simply need to mo dify the expr ession o f functions S i . Each up date only in volves a s ingle r o w (column) of the k ernel matrix. In the following, we will a s sume that indices are recursively picked acco r ding to a rule that sa tis fy the following c o ndition, see [Tseng, 200 1 , Luenber ger a nd Y e, 2008]. Essentially Cyclic Rule . There exis ts a constant integer T > ℓ such tha t every index i ∈ { 1 , . . . , ℓ } is chosen at lea st once b etw een the k -th iteration and the ( k + T − 1)-th, for all k . Iterations of co ordinate de s cen t alg orithms that use a n es sen tially cyclic r ule can b e group ed in macr o-iter a tions , containing at most T up dates of the form (13), within which a ll the indices are pick ed a t leas t once. Below, we rep ort some simple rules that sa tisfy the essentially cyclic condition and don’t require to maintain any a dditio na l infor mation (suc h as the gradient): 1. Cyclic rule: In each macr o-iteration, each index is pick ed exactly once in the order 1 , . . . , ℓ . Hence, ea ch mac r o-iteration cons ists exactly of ℓ iterations. 2. Aitk en doub l e s w eep rule : Cons ists in alterna ting macr o -iterations in which indices ar e chosen in the natural order 1 , . . . , ℓ with macro- iter ations in the reverse order, i.e. ( ℓ − 1) , . . . , 1. 3. Randomized cyclic rul e: The same a s the cy c lic r ule , except that in- dices are r andomly p ermuted at each macr o-iteration. In the linea r case (5), z k i can b e computed as follows w k = X c k , z k i = x T i w k . By exploiting the fact that o nly one co mponent of vector c changes from a n iteration to the next, the fir st equation can b e further developed: w k = X T c k = w k − 1 + ( X T e p ) h p = w k − 1 + x p h p 8 where p deno tes the index chosen in the pr e vious iter a tion, and h p denotes the v ariation of co efficien t c p in the pre v ious iteratio n. By intro ducing these new quantities, the co ordinate des c en t alg orithm can b e rewritten as in Algorithm 2, where we hav e set S i := − J i k x i k − 2 2 . Algorithm 2 Co ordinate descent (linear kernel) while max i | h i | ≥ δ do Pick a coo rdinate index i acco rding to some rule, if h p 6 = 0 then w k = w k − 1 + x p h p , end if z k i = x T i w k , v k i = z k i / k x i k 2 2 − c k i , tmp = S i ( v k i ), h i = tmp − c k i , c k +1 i = tmp, p = i end whil e The computatio nal cost of a single iteration dep ends mainly on the updates for w and z i , a nd scales linea rly with the num be r of features, i.e. O ( n ). When the loss function ha ve linear traits, it is often the case that co efficient c i do esn’t change after the up date, so that h i = 0. When this happ en, the next up date o f w can b e sk ipp ed, obtaining a significa n t sp eed-up. F urther, if the vectors x i are spar se, the av erag e co mputational co s t of the second line may b e m uch low er than O ( n ). A technique of this kind has b een pr oposed in [Hsieh et al., 200 8 ] and implemen ted in the pac k age LIBLINE AR [F an et al., 2008] to impr ove sp eed of co ordinate des cen t itera tions fo r linear SVM training. Here, one can see that the same technique can b e applied to any conv ex lo ss function, provided that an expressio n for the corres ponding resolven t is av ailable. The main conv ergence result for co ordinate desce n t is sta ted b elow. It should be o bs erv ed that the class ic a l theor y of co n vergence for co ordinate des cen t is t ypica lly formulated for differentiable o b jectiv e functionals. When the o b jec- tive functiona l is not differentiable, there ex is t counterexamples showing tha t the metho d may get s tuck in a non-stationar y p oint [Auslender, 197 6 ]. In the non-differentiable case, optimality of cluster p oints o f coo rdinate descent itera- tions has b een proven in [Tseng, 20 01 ] (see a lso references therein), under the additional assumption that the no n-differen tiable part is additively s e parable. Unfortunately , the result o f [Tseng, 2 0 01 ] ca nnot b e directly applied to pro blem (2), since the (p ossibly) non-differential part f ( K c ) is not separable with resp ect to the optimization v ariables c i , even when (3) holds. Notice also that, when the kernel matrix is not strictly p ositiv e, level sets of the ob jective functional are unbounded (see Le mma 1 in the a ppendix). Despite these facts, it s till holds that cluster p oin ts of co ordinate descent iterations are optimal, as stated by the next Theor em. Theorem 4. Su pp ose that the fol lowing c onditions hold: 1. F unction f is additive ly sep ar able as in (3), 2. The diagonal ent ries of the kernel matrix satisfy k ii > 0 , 9 3. The se quenc e c k is gener ate d by the c o or dinate desc ent algorithm (Algo- rithm 1 or 2), wher e indic es ar e r e cursively sele cte d ac c or ding to an essen- tial ly cyclic rule. Then, (9) holds, c k is b ounde d, and any cluster p oint is a solution of pr oblem (2). 5 A reform ulation theorem The follo wing result shows tha t solutions of problem (2) satisfying equation (8) are also statio nary p oints of a suitable family of differentiable functionals. Theorem 5. If c satisfy (7), then it is also a stationary p oi nt of the fol lowi ng functional: F α ( c ) = α − 1 f α ( K α c ) + c T K α c 2 , wher e f α denotes the Mor e au-Y osida r e gularization of f , and K α := α K − I . Theorem 5 gives an insight into the ro le of par ameter α , as well as pr o viding an interesting link with machine learning with indefinite kernels. By the prop- erties of the More a u-Y osida re gularization, f α is differe n tiable with Lipschitz contin uous gr adien t. It follows that F α also hav e such pro p erty . Notice that low er v alues of α a re a ssocia ted with smo other functions f α , while the gra di- ent of α − 1 f α is non-expansive. A low er v alue of α als o implies a “less p ositive semidefinite” kernel, s ince the eigenv alues o f K α are given by ( αα i − 1 ), where α i denote the eig en v alues of K . Indee d, the kernel b ecomes non-p o sitiv e as so on as α min i { α i } < 1. Hence, the rela xation par ameter α r egulates a tr ade-off betw een smo othness of f α and p ositivit y o f the kernel. When f is additively separ able as in (3), it follows that f α is als o additively separable: f α ( z ) = ℓ X i =1 f iα ( z i ) , and f iα is the Mor eau-Y osida regular ization of f i . The comp onen ts ca n b e often computed in closed form, so that an “e q uiv alent differentiable lo ss function” can be derived for non-differentiable pro ble ms . F or instance, when f i is given by the hinge loss f i ( z i ) = (1 − y i z i ) + , letting α = 1, we obtain f i 1 ( z i ) = 1 / 2 − y i z i , y i z i ≤ 0 (1 − y i z i ) 2 + / 2 , y i z i > 0 Observe that this last function is differe ntiable with Lipschit z contin uous deriv a- tive. B y The o rem 5, it follows that the SVM solution can be equiv alent ly com- puted by searching the stationa r y p oint s of a new reg ula rization functional ob- tained by replacing the hinge loss with its e q uiv alent differentiable loss function, and mo difying the kernel matrix by subtracting the identit y . 6 Conclusions In this pap er, fixed-p oin t and coo rdinate desce nt a lgorithms for regula rized ker- nel metho ds with convex empirical r isk a nd sq uared RKHS norm regula r ization 10 hav e b een analyzed. The tw o appr o ac hes can b e r egarded as insta nces of non- linear Jacobi a nd Gauss-Seidel algor ithms to solve a suitable non-linear equa tion that characteriz e s optimal s o lutions. While the fixed-p oin t a lgorithm has the adv an tage of b eing pa rallelizable, the co ordinate descent alg orithm is able to immediately exploit the information computed during the up date of a s ing le co efficien t. Both class es of algor ithms have the p oten tial to scale well with the dataset size. Fina lly , it has b e en s ho wn that minimizers of conv ex reg ularization functionals are a lso stationar y points o f a family o f differentiable r e gularization functionals inv olv ing the Moreau-Y osida regula r ization of the empirical r isk. App endix A In this section, we rev iew some c o ncepts and theorems fr om analy sis a nd linear algebra, w hich a re use d in the pr oofs. Let E denote an Euclidean space endow ed with the standard inner pro duct h x 1 , x 2 i 2 = x T 1 x 2 and the induced norm k x k 2 = p h x, x i 2 . Set-v alued maps A s et-v a lued map (or multifunction) A : E → 2 E is a rule that asso ciate to each po in t x ∈ E a subs e t A ( x ) ⊆ E . Notice that any map A : E → E can b e seen as a sp ecific ins ta nce of multif unction suc h that A ( x ) is a s ing leton for all x ∈ E . The multi-function A is called monotone whenever h y 1 − y 2 , x 1 − x 2 i 2 ≥ 0 , ∀ x 1 , x 2 ∈ E , y 1 ∈ A ( x 1 ) , y 2 ∈ A ( x 2 ) , If there exists L ≥ 0 such that k y 1 − y 2 k 2 ≤ L k x 1 − x 2 k 2 , ∀ x 1 , x 2 ∈ E , y 1 ∈ A ( x 1 ) , y 2 ∈ A ( x 2 ) , then A is s ingle-v alued, and is calle d Lipschitz c ontinuous function with mo dulus L . A Lipschitz con tinuous function is called nonexp ansive if L = 1, c ontr active if L < 1 , and firmly non-exp ansive if k y 1 − y 2 k 2 2 ≤ h y 1 − y 2 , x 1 − x 2 i 2 , ∀ x 1 , x 2 ∈ E , y 1 ∈ A ( x 1 ) , y 2 ∈ A ( x 2 ) . In par ticula r, firmly non-expansive maps are sing le-v alued, mono tone, a nd non- expansive. F or a ny monotone m ultifunction A , its r esolvent J A α is defined for any α > 0 as J A α := ( I + αA ) − 1 , where I sta nds for the identit y op erator. Resolven ts of monotone o perato r s are known to b e firmly non-e x pansiv e. Finite-v alued con v ex functions A function f : E → R is ca lled finite-value d c onvex if, for an y α ∈ [0 , 1] and any x 1 , x 2 ∈ E , it sa tisfy −∞ < f ( αx 1 + (1 − α ) x 2 ) ≤ αf ( x 1 ) + (1 − α ) f ( x 2 ) < + ∞ The sub differen tial of a finite-v alued c o n vex function f is a multifunction ∂ f : E → 2 E defined as ∂ f ( x ) = { ξ ∈ E : f ( y ) − f ( x ) ≥ h ξ , y − x i 2 , ∀ y ∈ E } . It can b e shown that the following prop e rties hold: 11 1. ∂ f ( x ) is a non- e mpt y conv ex compact set for any x ∈ E . 2. f is (Gˆ ateaux) differentiable at x if and only if ∂ f ( x ) = { ∇ f ( x ) } is a singleton (whose unique element is the gr adien t). 3. ∂ f is a monotone multifunction. 4. The point x ∗ is a (glo bal) minimizer of f if a nd o nly if 0 ∈ ∂ f ( x ∗ ). F or any finite-v alued conv ex function f , its Mo r eau-Y osida regulariz ation (or Moreau env elop e, or quadratic min-conv olutio n) is defined as f α ( x ) := min y ∈ E f ( y ) + α 2 k y − x k 2 2 . F or any fixe d x , the minimum in the de finitio n of f α is attained at y = p α ( x ), where p α := I + α − 1 ∂ f − 1 denotes the so-called pr oximal mapping . It can b e shown tha t the following remark able prop erties hold: 1. f α is conv ex differentiable, a nd the g radien t ∇ f α is Lipschitz contin uous with mo dulus 1 /α . 2. f α ( x ) = f ( p α ( x )) + α 2 k p α ( x ) − x k 2 2 . 3. f α and f hav e the same set of minimizers for all α . 4. The g radien t ∇ f α is ca lled Mor eau-Y osida regular ization of ∂ f , and satisfy ∇ f α ( x ) = α ( x − p α ( x )) = αJ α ( x ) , where J α denote the res olv ent of the inv erse sub-differential defined as J α := I + α ( ∂ f ) − 1 − 1 . Con v ergence t heorems Theorem 6 (Contraction mapping theo rem) . L et A : E → E and supp ose that, given c 0 , the se quenc e c k is gener ate d as c k +1 = A ( c k ) . If A is c ontr active with mo dulus µ , then ther e exists a un ique fixe d-p oint c ∗ such that c ∗ = A ( c ∗ ) , and the se quenc e c k c onver ges to c ∗ at line ar r ate: k c k +1 − c ∗ k 2 ≤ µ k c k − c ∗ k 2 , 0 ≤ µ < 1 . The following r e sult is know as Zang will’s conv erge nce theorem [Zang will , 1969], see also pa ge 20 6 of [Luen b erger and Y e, 2008]. Theorem 7 (Zang will’s conv ergence theor em) . L et A : E → 2 E denote a mu l- tifunction, and supp ose that, given c 0 , the se quenc e c k is gener ate d as c k +1 ∈ A ( c k ) . L et Γ ⊂ E c al le d s olution set . If the fol lowing c onditions hold: 12 1. The gr aph G A = { ( x, y ) ∈ E × E : y ∈ A ( x ) } is a close d set, 2. Ther e exists a des c en t function F such t ha t • F or al l x ∈ Γ , F ( A ( x )) ≤ F ( x ) , • F or al l x / ∈ Γ , F ( A ( x )) < F ( x ) , 3. The se quenc e c k is b ounde d, then al l the clust er p oints of c k b elongs t o the solut ion set. App endix B The following Lemma will prov e useful in the subsequent pro ofs. Lemma 1. The functional F of pr oblem ( 2) is such t hat F ( c + u ) = F ( c ) , for any ve ctor u in the nu l lsp ac e of the kernel matrix. Pr o of . Let u denote any vector in the n ullspace of the kernel matrix. T hen, we hav e F ( c + u ) = f ( K ( c + u )) + ( c + u ) T K ( c + u ) 2 = f ( K c ) + c T K c 2 = F ( c ) . Pr o of of The or em 1. Problem (2) is a convex optimization problem, wher e the functional F is contin uous and bounded b elow. First of all, w e show that there exists o ptimal solution. Observe that minimization can b e restricted to the range of the kernel matrix. Indeed, any vector c ∈ E can b e uniq ue ly decom- po sed as c = u + v , w he r e u b elongs to the nullspace of K a nd v b elongs to the range. By L e mma 1 , w e have F ( c ) = F ( v ). Since F is c oercive on the ra nge of the k ernel matrix (lim k v k 2 → + ∞ F ( v ) = + ∞ ), it follows that there exist o ptimal solutions. A necessary and sufficient condition for c ∗ to b e optimal is 0 ∈ ∂ F ( c ∗ ) = K ( ∂ f ( K c ∗ ) + c ∗ ) = K G ( c ∗ ) , G ( c ∗ ) := ∂ f ( K c ∗ ) + c. Consider the decomp osition G ( c ∗ ) = u G + v G , whe r e u G belo ngs to the nullspace of the kernel matrix and v G belo ngs to the r a nge. Observe tha t v G = G ( c ∗ ) − u G = G ( c ∗ − u G ) . W e have 0 ∈ K G ( c ∗ ) = K v G ⇒ 0 ∈ G ( c ∗ − u G ) = v G , so that, for any o ptimal c ∗ , there exis ts an optimal c = c ∗ − u G such that 0 ∈ ∂ f ( K c ) + c. (14) By introducing the inv erse sub-differential, equation (14) ca n b e rewr itten a s K c ∈ ( ∂ f ) − 1 ( − c ) . 13 Multiplying by α > 0 b o th sides and subtracting c , we o btain α K c − c ∈ α ( ∂ f ) − 1 ( − c ) − c. Finally , in tro ducing the r esolv ent J α as in (6), we hav e α K c − c ∈ ( J α ) − 1 ( − c ) Since J α is single-v alued, equation (7) follows. Pr o of of Cor ol lary 1. Let’s sta rt from the s ufficien t condition for optimality (14). If (3) holds, then the sub differen tial o f f decouples with r espect to the different components, so that ther e exist optimal co efficients c i such that 0 ∈ ∂ f i k T i c + c i , i = 1 , . . . , ℓ. Equiv alent ly , k T i c ∈ ( ∂ f i ) − 1 ( − c i ) . Multiplying by α i > 0 b o th sides and subtra cting c i , we ha ve α i k T i c − c i ∈ α i ( ∂ f i ) − 1 ( − c i ) − c i . The thesis follows by intro ducing the resolvent s J i α i and so lv ing for − c i . Pr o of of The or em 2. W e show that the sequence c k generated by algo rithm (10) conv erges to an optimal solution of Pro blem (2). By Theore m 1 , there ex ists optimal so lutions c ∗ satisfying (7). W e now o bs erv e that an y other vector c such that K ( c ∗ − c ) = 0 is als o o ptimal. Indeed, we hav e c = c ∗ + u , where u belo ngs to the nullspace o f the kernel matrix. B y Lemma 1, it follows that F ( c ) = F ( c ∗ ). T o pr o ve (9), it suffices to show that K r k → 0, whe r e r k := c k − c ∗ can b e uniquely decomp osed as r k = u k + v k , K u k = 0 , h u k , v k i 2 = 0 . W e need to prov e that k v k k 2 → 0 . Since J α is nonexpansive, w e have γ k +1 := k r k +1 k 2 2 = k c k +1 − c ∗ k 2 2 = k J α ( α K c k − c k ) − J α ( α K c ∗ − c ∗ ) k 2 2 ≤ k α K r k − r k k 2 2 = k α K v k − r k k 2 2 . Observing that v k is orthogo nal to the nullspace of the kernel matrix, we can further estimate a s follo ws k α K v k − r k k 2 2 = γ k − v kT 2 α K − α 2 K 2 v j ≤ γ k − β k v k k 2 2 , where β := min i : α i > 0 αα i (2 − αα i ) . and α i denote the eigenv alues o f the kernel matrix. Since the k erne l matrix is po sitiv e semidefinite a nd condition (11) holds, we have 0 ≤ αα i < 2 . 14 Since the kernel matrix is not null and have a finite num b er of e igen v alues, there’s at least one eigenv a lue with strictly po sitiv e distance from zero . It follows that β > 0. Since 0 ≤ γ k +1 ≤ γ 0 − β k X j =1 k v j k 2 2 , we ha ve, necessa rily , that k v k k 2 → 0 . Finally , observe that c k remains bounded k c k k 2 ≤ k r k k 2 + k c ∗ k 2 ≤ k r 0 k 2 + k c ∗ k 2 , so that ther e’s a subsequence conv erging to an optimal solution. In fact, b y (9) it follows that any cluster p oin t of c k is an o ptimal solution. Pr o of of The or em 3. Algorithm (10) can b e rewr itten as c k +1 = A ( c k ) , where the map A : E → E is defined as A ( c ) := − J α ( α K c − c ) . Under b oth conditions (1) and (2) of the theo r em, w e show that A is contractiv e. Uniqueness of the fixed-p oint , and conv ergence with linear rate will then follow from the contraction mapping theor em (Theo rem 6). L e t µ 1 := k α K − I k 2 = max i | 1 − α i α | , where α i denote the eigenv alues of the kernel matrix . Since the kernel matrix is p ositiv e s emidefinite, and condition (11) ho lds, we hav e 0 ≤ α i α < 2 , so that µ 1 ≤ 1. W e no w show that the following inequality ho lds: k J α ( y 1 ) − J α ( y 2 ) k 2 ≤ µ 2 k y 1 − y 2 k 2 , (15) where µ 2 := 1 + 1 L 2 − 1 / 2 , and L denotes the Lipschitz mo dulus o f ∇ f when f is differentiable with Lip- schitz contin uous gr adien t, and L = + ∞ other wise. Since J α is nonexpans iv e, it is easy to see that (15) ho lds when L = + ∞ . Supp ose now that f is differ- ent iable and ∇ f is Lipschitz contin uous with mo dulus L . It follows that the inv erse gradient satisfies k ( ∇ f ) − 1 ( x 1 ) − ( ∇ f ) − 1 ( x 2 ) k 2 ≥ 1 L k x 1 − x 2 k 2 . Since ( ∇ f ) − 1 is monotone, we hav e k J − 1 α ( x 1 ) − J − 1 α ( x 2 ) k 2 2 = k x 1 − x 2 + ( ∇ f ) − 1 ( x 1 ) − ( ∇ f ) − 1 ( x 2 ) k 2 2 ≥ k x 1 − x 2 k 2 2 + k ( ∇ f ) − 1 ( x 1 ) − ( ∇ f ) − 1 ( x 2 ) k 2 2 ≥ 1 + 1 L 2 k x 1 − x 2 k 2 2 . 15 F rom this last inequality , we obtain (15). Finally , we hav e k A ( c 1 ) − A ( c 2 ) k 2 = k J α ( α K c 1 − c 1 ) − J α ( α K c 2 − c 2 ) k 2 ≤ µ 2 k ( α K − I )( c 1 − c 2 ) k 2 ≤ µ k c 1 − c 2 k 2 , where we hav e set µ := µ 1 µ 2 . Co nsider the cas e in which K is str ictly po sitiv e definite. Then, it holds that 0 < α i α < 2 , so that µ 1 < 1 , and A is contractive. Finally , when f is differen tiable and ∇ f is Lipschit z contin uous, we hav e µ 2 < 1 and, ag ain, it follows that A is contractiv e. By the contraction mapping theorem (Theor em 6), there ex is ts a unique c ∗ satisfying (7 ), a nd the sequence c k of P icard iter ations conv erges to c ∗ at a linear r ate. Pr o of of The or em 4. W e s hall apply Theorem 7 to the co ordinate desc e n t macr o- iterations, wher e the solution set Γ is g iv en b y Γ := { c ∈ E : (8) holds } . Let A denote the alg orithmic map obtained after each macro-iter ation o f the co ordinate descen t algorithm. By the ess en tially cyclic rule, we hav e c ∈ A ( c ) = [ ( i 1 ,...,i s ) ∈ I { ( A i 1 ◦ · · · ◦ A i s ) ( c ) } , where I is the s et o f strings of length at most s = T on the alphab et { 1 , . . . , ℓ } such that a ll the characters a re pick ed a t least once. Observ ing that the se t I has finite cardinality , it follows that the gr aph G A is the union of a finite nu mber o f graphs of p oin t-to-p oint maps : G A = [ ( i 1 ,...,i s ) ∈ I { ( x, y ) ∈ E × E : y = ( A i 1 ◦ · · · ◦ A i s ) ( x ) } . Now notice that each map A i is of the form A i ( c ) = c + e i t i ( c ) , t i ( c ) := S i X j 6 = i k ij k ii c j − c i . All the res olv ents are Lipschitz contin uous, so that functions A i are also Lip- schitz contin uous. It follows that the comp osition of a finite n umber of such maps is contin uous, and its graph is a closed set. Since the union of a finite nu mber o f closed sets is a lso clo sed, w e obtain that G A is closed. Each map A i yields the solution of an exact line search over the i - th co ordi- nate direction for minimizing functional F o f Pro blem (2). Hence, the function φ i ( t ) = F ( c + e i t ) , is minimized a t t i ( c ), that is 0 ∈ ∂ φ i ( t i ( c )) = h e i , ∂ F ( c + e i t i ( c )) i 2 = h k i , ∂ f ( K A i ( c )) + A i ( c ) i 2 . 16 Equiv alent ly , − h k i , A i ( c ) i 2 ∈ h k i , ∂ f ( K A i ( c )) i 2 . (16) By definition of sub differen tial, we hav e f ( K A i ( c )) − f ( K c ) ≤ t i ( c ) γ , ∀ γ ∈ h k i , ∂ f ( K A i ( c )) i 2 . In particula r, in view of (16), we hav e f ( K A i ( c )) − f ( K c ) ≤ − t i ( c ) h k i , A i ( c ) i 2 . Now, obs erv e that F ( A ( c )) ≤ F ( A i ( c )) = F ( c + e i t i ( c )) = F ( c ) + t 2 i ( c ) k ii 2 + t i ( c ) h k i , c i 2 + f ( K A i ( c )) − f ( K c ) ≤ F ( c ) + t 2 i ( c ) k ii 2 + t i ( c ) h k i , c i 2 − t i ( c ) h k i , A i ( c ) i 2 = F ( c ) + t 2 i ( c ) k ii 2 + t i ( c ) h k i , c − A i ( c ) i 2 = F ( c ) + t 2 i ( c ) k ii 2 − t 2 i ( c ) k ii = F ( c ) − t 2 i ( c ) k ii 2 . Since k ii > 0, the following inequalities hold: t 2 i ( c ) ≤ 2 k ii ( F ( c ) − F ( A i ( c ))) ≤ 2 k ii ( F ( c ) − F ( A ( c ))) . (17) W e no w show tha t F is a desc ent function for the map A asso ciated with the solution s et Γ. Indeed, if c satisfy (8 ), then the applica tio n of the map A do esn’t change the po sition, so that F ( A ( c )) = F ( c ) . On the o ther hand, if c do es not satisfy (8), there’s at lea st one index i such that t i ( c ) 6 = 0. Since all the comp onen ts a re chosen at least once, and in view of (17), we hav e F ( A ( c )) < F ( c ) . Finally , we need to prov e that the sequence of macr o -iterations remains b ounded. In fact, it turns out that the whole sequence c k of iterations o f the co ordinate descent algo rithm is b ounded. F ro m the fir st inequalit y in (17 ), the sequence F ( c k ) is no n-increasing and bo unded below, and thus it must co n verge to a nu mber F ∞ = lim k → + ∞ F ( c k ) ≤ F ( c 0 ) . (18) Again from (17), w e obtain that the sequence of step sizes is square summable: + ∞ X k =0 c k +1 − c k 2 2 ≤ 2 min j k j j F ( c 0 ) − F ∞ < + ∞ . 17 In particula r, step-siz es are also uniformly b ounded: t 2 i ( c k ) = c k +1 − c k 2 2 ≤ 2 min j k j j F ( c 0 ) − F ∞ < + ∞ . (19) Now, fix any c oordina te i , and consider the sequence c k i . Let h ij denote the subsequence of indices in which the i -th comp onen t is pick ed by the essent ially cyclic rule and observe that c h ij i = S i k T i c h ij − 1 k ii − c h ij − 1 i . Recalling the definition of S i , a nd after some algebr a, the last equation c a n be rewritten as c h ij i ∈ − ∂ f i k T i c h ij − 1 + k ii t i c h ij − 1 . Since ∂ f i ( x ) is a compact set for any x ∈ R , it suffices to show that the ar gumen t of the s ubdifferential is b o unded. F o r an y k , let’s deco mpose c k as c k = u k + v k , K u k = 0 , h u k , v k i 2 = 0 . Letting α 1 > 0 denote the smallest non-null e igen v alue of the k ernel matrix, w e hav e α 1 k v k k 2 2 ≤ v kT K v k = c kT K c k ≤ 2 F ( c k ) ≤ 2 F ( c 0 ) . By the tria ng ular ineq ualit y , we hav e k T i c k + k ii t i c k ≤ M k T i c k k ii + t i c k ≤ M k T i c k k ii + t i c k , where M := max j | k j j | . The first term ca n be ma jo rized as follows: k T i c k k ii = k T i v k k ii ≤ k i k ii 2 k v k k 2 ≤ k i k ii 2 s 2 F ( c 0 ) α 1 ≤ s 2 ℓF ( c 0 ) α 1 < + ∞ , while the term t i c k is b ounded in view of (19). It follows that c k i is b ounded independently of i , which implies that c k is b ounded. In particular , the subse- quence consisting o f the macro-iter ations is b ounded as well. By Theorem 7, there ’s at least one subseq uence of the sequence o f macr o - iterations conv erging to a limit c ∞ that satisfies (8), and th us minimizes F . By contin uit y of F , we have F ( c ∞ ) = min c ∈ R ℓ F ( c ) . Finally , in view of (18), we have F ∞ = F ( c ∞ ), which proves (9 ) and shows that any clus ter point of c k is an o ptimal so lution of Problem (2). Pr o of of The or em 5. Equation (7) can b e rewr itten a s J α ( K α c ) + c = 0 . Now, let f α denote the Moreau- Y osida regular ization o f f . F rom the pro p erties of f α , we have ∇ f α ( K α c ) + αc = 0 . 18 Multiplying b oth sides of the previo us equation by α − 1 K α , we obtain α − 1 K α ∇ f α ( K α c ) + K α c = 0 . Finally , the last equatio n ca n b e rewr itten as ∇ c α − 1 f α ( K α c ) + c T K α c 2 = 0 , so that the thesis follows. References N. Aronsza jn. Theory of repro ducing kernels. T r ansactions of the Americ an Mathematic a l So ciety , 68:3 37–404, 1 950. A. Auslender. Optimisation M´ etho des Num´ eriques . Mass on, F rance, 1976. L. Bottou, O. Cha pelle, D. DeCoste, a nd J. W eston, e dito rs. L ar ge Sc ale Kernel Machines . MIT Press , Ca m bridge, MA, USA, 200 7. K-W. Chang , C-J. Hsieh, and C-J . Lin. Co ordinate desce nt metho d for la rge- scale L2 -loss linear supp ort vector machines. Journ al of Machine Le arning R ese ar ch , 9:1369–1 398, 2008. F. Din uzzo and G. De Nicolao . An a lgebraic characterization of the optim um of reg ularized kernel metho ds . Machine L e a rning , 74(3):31 5–345, 2009 . R. F an, K.W. Chang, C.J . Hsieh, X.R. W ang, and C.J. Lin. LIBLINEAR: A library for lar g e linear class ific a tion. Jour n al of Machine L e arning R ese ar ch , 9:1871 –1874, 2008 . J. F riedman, T. Hastie, H. Ho efling, a nd R. Tibs hirani. Path wise co ordinate optimization. Annals of Applie d St atistics , 1 (2):302–332, 2007. J. F riedma n, T. Hastie, and R. Tibshirani. Reg ularization paths for generalized linear mo dels via co ordinate descent. Journ al of Statistic al Softwar e , 33 (1): 1–22, 201 0. C. Hsieh, K.W. Cha ng, C.J . Lin, S. S. Keerthi, and S. Sundara ra jan. A dual co ordinate descent metho d for larg e-scale linear SVM. In Pr o c e e dings of the 25th Annual Int ernational Confer enc e on Machine L e arning (ICML 2008) , pages 40 8–415, Helsinki, Finland, 20 08. F-L Huang, C-J Hsieh, K-W Chang, and C-J Lin. Iterative scaling and co or- dinate descent metho ds for maximum entrop y mo dels. Journ al of Machine L e arning Rese ar ch , 11 :8 15–848, 201 0. D. G. Luenber ger and Y. Y e. Line ar and Nonline ar Pr o gr amming . International series in op eration res earc h and management s cience. Springer, 2008 . J. M. O rtega and W. C. Rhein b oldt. Iter ative solut io n of nonline ar e quations in sever al variables . Classics in Applied Mathematics. SIAM, 200 0. 19 R. T. Rock afellar. Convex Analysis . P r inceton Univ ersity P ress, Princeton, NJ , USA, 1970 . B. Sch¨ olk opf, R. Her bric h, and A. J. Smola. A generalized representer theorem. Neur al Networks and Computational L e arning The ory , 81:4 16–426, 2 001. P . Tseng. Convergence of a blo ck co ordinate descent method for nondifferen- tiable minimization. Journal of optimization the ory and applic a tions , 1 09(3): 475–4 94, June 20 01. P . Tseng and S. Y un. A co ordinate gr adien t descent metho d for linearly con- strained smo oth o ptimiza tion a nd suppo r t vector machines training. Compu- tational Optimization and Applic ations , page s 1– 28, 2 008. T. T. W u and K. L a nge. Coo rdinate descent alg orithms for lasso p enalized regres s ion. Annals of Applie d Statistics , 2(1):22 4–244, 20 08. S. Y un a nd K.-C. T oh. A co ordinate gradient desc en t metho d for ℓ 1 -regular ized conv ex minimizatio n. Computational O ptimi zation and Applic ations , pages 1–35, 200 9. W. Za ngwill. Non-line ar Pr o gr amming: A Unifie d Appr o ach . Prentice-Hall, Englewoo d Cliffs, NJ, 19 69. 20
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment