Semi-supervised Ranking Pursuit

Semi-sup ervised Ranking Pursuit Evgeni Tsivtsiv adze 1 and T om Heskes 2 1 The Netherlands Organization for Applied Scientiﬁc Researc h, Zeist, The Netherlands evgeni.tsi vtsivadze@tno.nl , 2 Institute for Computing and Information Sciences, Radb oud Univ ersity Nijmegen, The Netherlands t.heskes@s cience.ru.nl Abstract. W e prop o se a nov el sparse preference learning/ranking algo rith m. Our algorithm appro ximates the true utility function by a wei ghted sum of basis fun ctions using th e squared loss on pairs of data p oints, and is a generalization of the k ernel matc h ing pursuit meth od . It can op erate b oth in a sup ervised and a semi-sup ervised setting and allo ws eﬃcient searc h for multiple, near-opt imal solutions. F urthermore, w e describe the ex t ension of the algorithm suitable for combined ranking an d regres- sion tasks. I n our exp eriments we demonstrate that the p rop osed al gorithm outper- forms several state-of-the-art learning metho ds when taking into account u nlab eled data an d p erforms comparably in a sup ervised learning scenario, while p ro v iding sparser solutions. 1 In tro duction Recently , prefer ence lear ning [1] has received signiﬁcant attent io n in machine learning co mmunit y . Informally , the main go al of this task is pr ediction of o r- dering of the da ta p oints rather than prediction of a n umerica l s core as in the case of regr ession or a class lab el a s in the ca s e of a class iﬁcation ta s k. The ranking problem can b e consider ed as a s pec ia l ca se of preference lear ning when a strict order is deﬁned ov er a ll data p oints. The a pplications o f algor ithms that learn to ra nk 3 are widespr ead, including information r etriev al (co llab orative ﬁl- tering, web search e.g . [2]), natural langua ge pro cessing (parse ranking e.g . [3]), bioinformatics (protein r anking e.g. [4]), and many others. Despite notable progres s in the development and application of preference learning/r anking alg orithms (e.g. [1]), so far the emphasis was ma inly on im- proving the lear ning p erfo r mance of the metho ds. Much less is known a bo ut mo dels tha t fo cus in addition on in ter pretability and spa rseness of the ranking solution. In this work we pro p o s e a nov el pre fer ence le arning/r anking algo rithm that b esides state-of-the-a rt per formance can also lea d to spa rse mo dels a nd no- tably faster prediction times (that is an abso lute necessity for a wide r ange of applications such as e.g. search engines), compared to the non-sparse counter- parts. 3 See e.g. http://researc h.microsoft.com/ en- us/um/b eijing/pro jects/letor/pap er.aspx Sparse mo deling is a ra pidly developing area of machine lear ning mo tiv ated by the s ta tistical pr oblem o f v ariable s e le ction in high-dimensio nal datas e ts . The aim is to obtain a highly pre dictive (sma ll) set of v ariables that can help to en- hance o ur unders tanding o f underlying phenomena. This ob jective constitutes a crucia l diﬀerence betw e e n spars e mo deling and other machine learning ap- proaches. Recent developmen ts in theory a nd a lgorithms for s pa rse mo deling mainly concern l1 -regula r ization and conv ex r elaxation of the subset selection problem. Examples o f suc h algo rithms include spars e regress ion (e.g. La sso [5]) and its v ar ious extensions (Elastic Net [6], group Las s o [7,8], simult a neous/multi- task Las so [9]) as well as sparse dimensio nality r eduction (spar se PCA [9 ], NMF [10]) alg o rithms. Applications of these metho ds ar e wide-rang ing, including com- putational biolo g y , neuros cience, image pro cessing , infor ma tion retriev al, and so- cial netw ork a nalysis, to name a few. The sp arse ranking algor ithm we prop ose here is not tied to a particular domain and can b e applied to v ar io us problems where it is necessary to estimate pr eference relations/ra nk ing of the ob jects as well a s to obta in a compact a nd r epresentativ e mo del. RankSVM [11] is a rank ing method that ca n lead to sparse solutions. Ho w- ever, in RankSVM spa rsity control is not explicit and the pr o duced mo dels are usually far fro m b eing interpretable. Also no te that frequently ra nking alg o- rithms are not directly applicable to more general preference learning tasks or can b ecome c omputationally exp ensive. Our method is a generalization of the (kernel) match ing pur suit algorithm [12] and it approximates the true utility function by a weigh ted sum of basis functions using s quared loss on pairs of data p oints. Unlike existing methods our algo rithm allows explicit co nt r ol over sparsity of the mo del and can b e a pplied to ranking and preference lea rning problems. F urthermor e, an extension of the algor ithm allows us to eﬃciently search for several near-optimal solutions instead of a single one. F or example, so me of the problems that a rise during the spa rse mo deling include p ossible existence of m ultiple near ly-optimal so lutions (e.g. due to the lack of a s ing le spa rse gr o und truth). This situation is co mmon for many biolog ical problems w hen, for exam- ple, ﬁnding a few highly predictiv e proteins does not exclude the possibility of ﬁnding some other g roup of genes/proteins with similar prop erties. The same situation ca n o ccur in ma ny o ther domains, e.g. information re tr iev al (v arious groups of highly descr iptive do cument/queries in do cument ranking task), nat- ural la nguage pro cess ing (parse re-ranking ), etc. Therefor e, it is an imp or tant issue to explore and include sea rch for m ultiple nearly -optimal spars e s o lutions rather than a single solution. In our empirica l ev aluation we show that our al- gorithm can op erate in sup erv ised and semi-sup ervis ed settings, leads to s parse solutions, and improved p er formance co mpared to several baseline metho ds. 2 Problem Setting Let X b e a set of ins ta nces and Y be a se t of lab els. W e consider the lab el r anking task [1,13] namely , we wan t to pr edict for a ny instance x ∈ X a preference relation P x ⊆ Y × Y among the set of la be ls Y . W e assume that the true preference rela tion P x is tra nsitive and asymmetric for each ins tance x ∈ X . Our training s e t { ( q i , s i ) } n i =1 contains the data points ( q i , s i ) = (( x i , y i ) , s i ) ∈ ( X × Y ) × R that ar e an instance-lab el tuple q i = ( x i , y i ) ∈ X × Y and its score s i ∈ R . W e de ﬁne the pair of data p oints (( x , y ) , s ) and (( x ′ , y ′ ) , s ′ ) to b e r elevant , iﬀ x = x ′ and irr elevant o therwise. As a n exa mple, consider an information retriev al task where ev er y query is asso ciated with the set of retrieved do cuments. The intersection of the retrieved do cument s asso ciated with diﬀerent queries ca n b e either empt y or non-empt y . W e are usua lly interested in ranking the do cuments that are asso ciated with a single query (the one that has r etrieved the do cument s ). Thu s , ranks b etw een do cument s retrieved by diﬀerent queries a re not r elevant for this task, wher eas those do c uments retrieved by the sa me query are r elevant . Given a relev ant pair (( x , y ) , s ) and (( x , y ′ ) , s ′ ), we say that instance x pr efers lab el y to y ′ , iﬀ s > s ′ . If s = s ′ , the lab els are ca lled tie d . Accordingly , w e write y ≻ x y ′ if s > s ′ and y ∼ x y ′ if s = s ′ . Finally , w e deﬁne our training set T = ( Q, s , W ), where Q = ( q 1 , . . . , q n ) t ∈ ( X × Y ) n is the vector of insta nce-lab el training tuples a nd s = ( s 1 , . . . , s n ) t ∈ R n is the cor r esp onding vector of scor e s. The W matrix deﬁnes a preference gr aph and inc o rp orates informa tion ab out relev ance of a particular da ta p oint to the tas k, e.g . [ W ] i,j = 1 , if ( q i , q j ) , 1 ≤ i, j ≤ n, i 6 = j, ar e re le v an t and 0 otherwis e . Informally , the g oal o f our ranking task is to ﬁnd a lab el r anking fun ction such that the ranking P f , x ⊆ Y × Y induced b y the function for any instance x ∈ X is a go o d “predictio n” of the true pre fer ence relation P x ⊆ Y × Y . F orma lly , we search for the function f : X × Y → R mapping each instance-lab el tuple ( x , y ) to a rea l v alue r epresenting the (predicted) relev ance of the lab el y with res pec t to the instance x . T o measure how well a hypothesis f is able to predict the preference relations P x for all instances x ∈ X , we consider the following cos t function (disagr eement er ror) that captures the amount of inco rrectly predicted pairs of relev ant training data po int s : d ( f , T ) = 1 2 n X i,j =1 [ W ] i,j    sign  s i − s j  − sign  f ( q i ) − f ( q j )     , (1) where s ign( · ) denotes the signum function. 3 Ranking Pursuit In this se ction we ta ilo r the kernel matching pursuit algorithm [12] to the sp eciﬁc setting of prefere nc e lea rning and ranking . Cons ider ing the training s et T = ( Q, s , W ) and a dictionar y of functions D = { k 1 , . . . , k N } , where N is the nu mber of functions in the dictionary . W e are interested in ﬁnding a spa r se approximation of the prediction function f P ( q ) = P P p =1 a p k γ p ( q ) using the basis functions { k 1 , . . . , k P } ⊂ D and the co eﬃcients { a 1 , . . . , a P } ∈ R P . The or de r of the dictionary functions as they a ppea r in the expansion is given by a set o f indices { γ 1 , . . . , γ P } , where γ ∈ { 1 , . . . , N } . Basis functions are chosen to b e kernel functions, similar to [12], that is k γ ( q ) = k ( q γ , q ) with k ( · , · ) an appropr iate kernel function. W e will use the notation f P = ( f P ( q 1 ) , . . . , f P ( q n )) t to represent the n -dimens io nal vector that corresp onds to the ev aluation of the function on the training p oints and similarly k γ = ( k γ ( q 1 ) , . . . , k γ ( q n )) t . W e also deﬁne r = s − f P to be the residue. The bas is functions and the corre s po nding co eﬃcients are to be chosen such that they minimize a n a ppr oximation o f the disa g reement error : Require: T raining set with scored data - T , dictionary of functions - D , n umber of basis functions - P . 1: Ini tialize : Set residue vector r 1 = s 2: for p = 1 , . . . , P (or unti l performance on the v alidation set stops improving) do 3: for eac h p ossible lab eling γ do 4: Compute a ∗ ( γ ) = ( k t γ L k γ ) − 1 k t γ L r p 5: Set J ( γ ) = J ( a ∗ ( γ ) , γ ) 6: end for 7: Pic k γ p = argmin γ J ( γ ) 8: Set a p = a ∗ ( γ p ) and compute the new residuals r p +1 = r p − a p k γ 9: end for 10: Compute prediction: f P ( q ) = P X p =1 a p k γ p ( q ) Fig. 1. Su p ervised rank ing pursuit algorithm. c ( f P , T ) = 1 2 n X i,j =1 [ W ] i,j  ( s i − s j ) − ( f P ( q i ) − f P ( q j ))  2 , which in matrix form can b e written as c ( f P , T ) = ( s − f P ) t L ( s − f P ) , (2) where L is the Laplacia n matrix of the gr aph W . The r anking purs uit star ts at stage 0, with f 0 , a nd recur sively appends func- tions to an initially empty basis, at each s tage of training to reduce the appr oxi- mation of the r anking erro r. Given f p we build f p +1 ( a, γ ) = f p + a k γ , by sea rching for γ ∈ { 1 , . . . , N } and a ∈ R s uch that a t every step (the residue of ) the error is minimized: J ( a, γ ) = c ( f p +1 ( a, γ ) , T ) = ( s − f p +1 ( a, γ )) t L ( s − f p +1 ( a, γ )) = ( r p − a k γ ) t L ( r p − a k γ ) , where we deﬁne k γ i = k ( q γ , q i ) a nd k γ = ( k γ 1 , . . . , k γ n ) t . By setting the ﬁr st deriv ative to zero and solving the resulting sys tem of equa tions we can obtain the co eﬃcient a that minimizes J ( a, γ ) for a given γ , that is a = ( k t γ L k γ ) − 1 k t γ L r p . The set of basis functions and co eﬃcients obtained at every itera tion of the algorithm is s ub o ptima l. This ca n b e cor rected by a back-ﬁtting pro cedure using a least-sq ua res approximation o f the disagr eement err o r. The optimal v alue of the para meter P , tha t can b e considered a “ regular ization” parameter of the algorithm, is estimated using a cross-v alidatio n pro cedure. The pseudo-co de for the a lgorithm is pres ented in Figure 1. 3.1 Learning Multiple Near-Optimal Solutions In this s ubsection we formulate an extension of the ra nking pursuit alg orithm that can eﬃcien tly us e unscor ed data to impr ov e the per formance of the algo- rithm. T he main idea b ehind o ur appro ach is to construct m ultiple, near- optimal, “sparse ” ranking functions that give a s mall e r ror on the scored data and who s e predictions agree o n the unscored part. Semi-sup ervised learning algor ithms hav e gained more and more atten tio n in recent years as unlab eled data is typically muc h easier to obtain than lab eled data. Multi-view lear ning algorithms, such as c o-training [14], split the attributes int o indep endent sets and an algo rithm is lear n t based on these diﬀere nt “ views”. The go al of the learning pro cess consists of ﬁnding a prediction function for ev- ery v iew (for the lea rning ta sk) that p erforms well o n the lab eled data of the designated view such that all prediction functions a gree on the unlab eled data . Closely rela ted to this approa ch is the c o-r e gularization fra mework describ ed in [15,16], where the sa me idea o f agr eement maximization b etw een the predictor s is central. Brieﬂy stated, a lgorithms based upo n this approach search for hy- po theses from diﬀerent Repro ducing Kernel Hilb ert Spaces (RKHS) [17], namely views, such that the tra ining erro r o f each hypo thesis on the lab eled data is small and, at the same time, the hypo theses give similar pr edictions for the unlab eled data. Within this framework, the disagreement is tak ens into account via a co- regular iz ation term. Empirical res ults show that the co-reg ularizatio n approach works well for class iﬁc a tion [15], r egressio n [18], and cluster ing tasks [19,20]. Moreov er , theor etical in vestigations demonstrate that the co-regular ization ap- proach reduces the Ra demacher complex ity b y a n amoun t that dep ends on the “distance” b etw een the views [21,22]. Let us co nsider M diﬀeren t feature spaces H 1 , . . . , H M that ca n be con- structed fro m diﬀer e n t data p oint des c riptions (i.e., diﬀerent features) or by using diﬀerent kernel functions. Similar to [12] we c onsider H to b e a RKHS. In addi- tion to the training set T = ( Q, s , W ) o riginating from a set { ( q i , s i ) } n i =1 of data po int s with sco ring infor mation. W e also ha ve a training set T = ( Q, W ) from a set { q i } l i =1 of data po ints without sc o ring information, Q = ( q 1 , . . . , q l ) t ∈ ( X × Y ) l , a nd the corres po nding a djacency matrix W . T o avoid misunderstand- ings with the deﬁnition o f the lab el ranking task, we will use the terms “score d” instead of “lab eled” and “unscored” instead of “unlabeled” . W e s earch for the functions F P = ( f (1) P , . . . , f ( M ) P ) ∈ H 1 × . . . × H M , minimizing e c ( F P , T , T ) = M X v =1 c ( f ( v ) P , T ) + ν M X v, u =1 c ( f ( v ) P , f ( u ) P , T ) , (3) where ν ∈ R + is a regula rization parameter and where c is the loss function measuring the disa greement b etw een the prediction functions of the views on the uns c ored data: c ( f ( v ) P , f ( u ) P , T ) = 1 2 l X i,j =1  W  i,j   f ( v ) P ( q i ) − f ( v ) P ( q j )  −  f ( u ) P ( q i ) − f ( u ) P ( q j )   2 . Although we hav e us ed unsco red data in o ur formulation, we note that the algorithm can also op erate in a purely supervised setting. It will then not o nly minimize the er ror on the sco red data but als o enforce agree men t among the prediction functions co nstructed fro m diﬀerent views . The predictio n functions f ( v ) P ∈ H v of (3) for v = 1 , . . . , M hav e the form f ( v ) P ( q ) = P P p =1 a ( v ) p k ( v ) γ v p ( q ) with cor resp onding co eﬃcients { a ( v ) 1 , . . . , a ( v ) P } ∈ R P . Let ¯ L denote the Laplac ian matrix of the gra ph W . Using a s imilar approa ch as in sec tio n 3 we can write the ob jectiv e function as J ( a , γ ) = e c ( F p +1 ( a , γ ) , T , T ) = M X v =1 ( r p − a ( v ) k ( v ) γ v ) t L ( r p − a ( v ) k ( v ) γ v ) + ν M X v, u =1 ( a ( v ) ¯ k ( v ) γ v − a ( u ) ¯ k ( u ) γ u ) t ¯ L ( a ( v ) ¯ k ( v ) γ v − a ( u ) ¯ k ( u ) γ u ) , where a = ( a (1) , . . . , a ( M ) ) t ∈ R M , γ = ( γ 1 , . . . , γ M ) with γ v ∈ { 1 , . . . , N } , and ¯ k γ is the ba sis vector expans ion on unsco red da ta with ¯ k γ i = k ( q γ , q i ). By taking partial der iv atives with res pec t to the co eﬃcients in each view (for clarity we denote k ( v ) γ v and ¯ k ( v ) γ v as k ( v ) and ¯ k ( v ) , resp ectively) and de ﬁning g ( v ) ν = 2 ν ( M − 1) ¯ k ( v ) t ¯ L ¯ k ( v ) and g ( v ) = k ( v ) t L k ( v ) , w e obtain ∂ ∂ a ( v ) J ( a , γ ) = 2 ( g ( v ) + g ( v ) ν ) a ( v ) − 2 k ( v ) t L r p − 4 ν M X u =1 ,u 6 = v ¯ k ( v ) t ¯ L ¯ k ( u ) a ( u ) . A t the optim um we hav e ∂ ∂ a ( v ) J ( a , γ ) = 0 for a ll views, thus, w e get the exact solution by so lving        g (1) + g (1) ν − 2 ν ¯ k (1) t ¯ L ¯ k (2) . . . − 2 ν ¯ k (2) ¯ L ¯ k (1) g (2) + g (2) ν . . . . . . . . . . . .               a (1) a (2) . . .        =        k (1) t L r p k (2) t L r p . . .        with resp ect to the co eﬃcients in each view. Note that the left-hand side matr ix is po sitive deﬁnite by construction a nd, therefor e, inv ertible. Once the co e ﬃcie n ts are estimated, m ultiple solutions c an b e obtained using the prediction functions constructed for ea ch view . W e ca n als o conside r a s ingle pre diction function that is given, for exam- ple, by the average of the functions fo r all views. The overall complexity of the s tandard ranking pursuit algor ithm is O ( P n 2 ), thus, ther e is no incre a se in computationa l time c o mpared to the kernel matching pur suit alg o rithm in the supervis ed se tting [12]. The semi-supe r vised v ers ion of the r anking pursuit algorithm requires O ( P n M ( M 3 + M 2 l )) time, whic h is linear in the num b er of unscored data po int s 4 . The pseudo- co de for the algor ithm is presented in Figure 2. Require: T raining set with scored and unscored data - T , T , dictionary of functions - D , n umber of basis functions - P , co-re gularization parameter - ν . 1: Initialize : Set residue vector r 1 = s 2: for p = 1 , . . . , P (or un til p erformance on the v alidation set stops improving) do 3: for each possible lab eling γ do 4: Compute a ∗ ( γ ) = ( B + C ) − 1 e using the matrices: B =    g (1) 0 . . . 0 g (2) . . . . . . . . . . . .    e =    k (1) t γ 1 L r p k (2) t γ 2 L r p . . .    C =    g (1) ν − 2 ν ¯ k (1) t γ 1 ¯ L ¯ k (2) γ 2 . . . − 2 ν ¯ k (2) t γ 2 ¯ L ¯ k (1) γ 1 g (2) ν . . . . . . . . . . . .    5: Set J ( γ ) = J ( a ∗ ( γ ) , γ ) 6: e nd for 7: P ick γ p = argmin γ J ( γ ) 8: Se t a p = a ∗ ( γ p ) and compute the new residuals r p +1 = r p − 1 M P M v =1 a ( v ) p k ( v ) γ v 9: end for 10: Compute prediction: f P ( q ) = 1 M M X v =1 P X p =1 a ( v ) p k ( v ) γ v p ( q ) Fig. 2. Semi-sup erv ised ranking pursuit algorithm. 4 In semi-sup ervised learning usually n ≪ l , thus, linear complexity in the num b er of unscored d ata p oints is b eneﬁcial. W e n ote that complexity of the algorithm can be further reduced to O ( P M 3 nl ) by forcing the indices of the nonzero co eﬃcients in the d iﬀerent views to b e the same. 4 Com bined Ranking and Regression Pursuit Recently a metho d on com bined ranking and r egress io n has b een prop osed in [23]. The authors sug gest that in ma ny circumstances it is beneﬁcial to mini- mize the com bined ob jective function simultaneously due to the fact that the algorithm can av oid learning degenerate mo dels suited only for some particu- lar set o f pe rformance metrics. F urthermor e, s uch o b jective ca n help to improve regres s ion per formance in some circumstances e.g . when there is a la rge cla ss im- balance situation. Empiric a lly , the combined appro ach gives the “b e st of b oth” per formance, p e rforming as well at regr ession a s a reg ression-o nly metho d, and as well a t ranking as a ra nking-only only metho d. Howev er, des pite the eﬃcient sto chastic gra dient descen t a lgorithm describ ed in [23] the ob jective function to be minimized still consists o f t wo separ ate parts, namely reg ression and ranking with the appr opriate weight co e ﬃcie nts attached to b oth. Motiv ated by the ab ov e a pproach and strong empirical results present ed in [23] we prop ose a fra mework for joint ranking and r egressio n optimiza tion based on our r anking pursuit algorithm. W e a rgue that our approach is slig h tly more elegant a nd s impler compared to [23 ] due to the fact that w e employ a gener- alization of kernel matching pursuit – a genuine reg ression alg orithm, thus, we do not hav e to c o nsider t wo separate o b jective functions when lea r ning joint ranking a nd r egress ion mo de ls . Compared to the kernel matching pursuit a lgorithm which minimizes least- squares error function c ( f P , T ) = 1 2 n X i =1  s i − f P ( q i )  2 , recall that the super vised r a nking pursuit c ho o ses the basis functions and the corres p o nding co eﬃcients such tha t they minimize an approximation of the dis- agreement error: c ( f P , T ) = 1 2 n X i,j =1 [ W ] i,j  ( s i − s j ) − ( f P ( q i ) − f P ( q j ))  2 , which in matrix form can b e written as c ( f P , T ) = ( s − f P ) t L ( s − f P ) , (4) where L is the Laplacia n matrix of the graph W deﬁned in section 2. Note that we can obtain a standard reg ression algo rithm by using a n iden tity matr ix instead of L in (4). A simple idea be hind our combined r anking and r egressio n approach is the appropria te selection of the weigh ts for the ma trix L , so that in a special case we can obtain a regress io n formulation and in another we can recov er complete pa irwise ranking. F or this purp ose we consider the weigh ted Laplacian matr ix ˜ L = β I + (1 − β ) L. (5) By setting the β co eﬃcient equal to zero, we recov er the standard ranking pur- suit. On the o ther hand b y setting β eq ual to 1 we obtain kernel matching purs uit [12]. By setting the v alues of the co eﬃcient betw een these extremes and using such weigh ted ˜ L in (4) corr e sp onds to minimizing a “ combined” r anking and regres s ion ob jective function. W e refer to this alg orithm as combin e d ra nk ing and re gressio n pursuit (CRRP). 5 Subset of Regressors Metho d for Ranking Algorithms F or compar ison with the state-of-the-ar t, w e will compare our algor ithm with the spar se RankRLS, r ecently pr op osed in [2 4]. The main idea b ehind spar se RankRLS is to employ subset selection metho d (describ ed e.g. in [2 5]) that is gener ally used for computational sp e ed up purposes. F or example, a popular approach to s pe e d up the algorithm co ns ists in appr oximating the kernel matrix. How ever, this in turn leads to solutions that do not depe nd o n all data p o ints present in the tra ining set and, thus, can be co ns idered as sparse. Let us brieﬂy describe this approach: Consider a setup when instead of se- lecting a basis function to minimize the disagree ment er ror at every iter a tion of the a lgorithm a s in section 3, we choo se the prediction function to have the following form: f ( q ) = P n p =1 a p k ( q , q p ). F urther, given the prediction function that dep ends on all tr aining data the ob jective function in matrix for m can be written as ( s − K a ) t L ( s − K a ), where K ∈ R n × n is a kernel matrix constructed from the training set and a = ( a 1 , . . . , a n ) t ∈ R n is a co r resp onding coeﬃcient vector. Now, let R = { i 1 , . . . , i r } ⊆ [ n ] be a subset of indices such that only a i 1 , . . . , a i r are no nzero. By randomly selec ting a subset of data p o in ts, we can approximate the prediction function us ing ˆ f ( q ) = P r j =1 a p j k ( q , q p j ). Similarly we can ap- proximate the kernel matrix a nd deﬁne matr ix K R,R ∈ R r × r that contains b oth rows and columns indexed by R . This approa ch for matrix appr oximation, known as “subset of r e gresso rs”, w a s pioneered in [26] a nd is frequent ly applied in prac- tice. Although it may seem over-simplistic (e.g. o ther metho ds might a pp ea r to be more suitable r ather than random selection of the r egresso rs) it is eﬃcient and usually leads to quite go o d p erformance . The reas on b ehind this is that the solution o btained using a subset o f regress ors metho d can b e shown to b e equiv alent to a “no n- sparse” solution obtained with some other kernel function (e.g. [24]). In our exp eriments we ev aluate the p er formance of the s parse Ra nkRLS algorithm and compare it to the sup ervised and semi-sup erv ised r anking pursuit algorithms. W e demonstr a te that selection o f the non- zero co eﬃcients based on iterative minimization of the dis agreement err or (str a tegy used by the ranking pursuit alg orithm) leads to b etter r esults compar ed to r andom subset selec tio n. T able 1. Perfo rmance comp arison of the learning algorithms in su p ervised exp eriment conducted on Jester joke d ataset. Normalized vers ion of the disagreemen t error is u sed as a performance ev aluation measure. Note that despite p erformance similar to that of ranking algorithms, ranking pursuit leads on av erage to 30% sparser solutions. Method 20 − 40 40 − 60 60 − 80 RLS 0.425 0.419 0.383 Ma tching Pursuit 0.428 0.417 0.381 RankSVM 0.412 0.404 0.372 RankRLS 0.409 0.407 0.374 Sp arse R ankRLS 0.414 0.410 0.380 Ranking Pursuit 0.410 0.404 0.373 6 Exp erimen ts 6.1 Jester jok e dataset W e p erfor m a set of exp er iment s o n the publicly av ailable J ester joke data s et 5 . The task w e address is the predictio n of the joke pre fer ences of a user based on the preferences of o ther users. The dataset co nt a ins 4.1M ra tings in the range from − 1 0 . 0 to +10 . 0 o f 100 jokes as signed by a g roup of 734 21 us ers. Our exp erimental setup is similar to that of [2]. W e have g roup ed the users into three groups according to the num b er o f jokes they hav e rated: 20 − 40 jokes, 40 − 60 jokes, and 60 − 80 jok es . The test users are r andomly selected amo ng the users who ha d ra ted betw e e n 5 0 and 3 00 jokes. F or each test us e r half o f the preferences is reser ved for training and half for testing. The prefer e nc e s ar e derived from the diﬀerences of the ratings the test use r gives to jokes, e.g. a joke with hig her sco re is prefer red ov er the joke with low er s core. The features for each test user are genera ted as follows. A se t of 300 reference users is selected at ra ndom from one of the three gr oups and their r atings for the corres p o nding jokes are used a s a feature v alues. In cas e a user ha s not rated the joke the median of his/her ratings is used as the feature v a lue. The e xpe riment is done for 300 diﬀerent test use rs and the av er age p er formance is r ecorded. Finally , we rep eat the complete experiment ten times with a diﬀerent set of 300 test users selected at rando m. W e rep or t the average v alue ov er the ten runs for each of the thr ee gr oups. In this e x per iment we compare p erformance of the ranking pur suit algorithm to several algor ithms, namely kernel matching pursuit [12], RankSVM [11], RLS [27] (also k nown as kernel ridge r egress io n [28], proximal-svm [29], ls-svm [30]), RankRLS a nd spa rse RankRLS [24] in ter ms of the disagr eement err o r (1). In a ll algo rithms we us e a Gaussia n kernel where the width par ameter is chosen fro m the set { 2 − 15 , 2 − 14 . . . , 2 14 , 2 15 } a nd other parameters (e.g. stopping 5 Av ailable at http://www .ieor.berkeley.ed u/ ~ goldberg/j ester- data/ . T able 2. P erformance comparison of the learning algorithms in semi-supervised ex- p eriment conducted on Jester jok e dataset. Sup ervised learning metho ds are trained only on the scored part of th e dataset. Normalized versi on of the disagreement error is used as a p erformance ev aluation measure. Note t h at semi-sup ervised ranking pursuit notably outp erforms other metho ds. Method 20 − 40 40 − 60 60 − 80 RLS 0.449 0.434 0.405 Ma tching Pursui t 0.451 0.433 0.404 RankSVM 0.428 0.417 0.391 RankRLS 0.429 0.418 0.393 Sp arse RankRLS 0.431 0.424 0.397 Ranking Pursuit 0.428 0.417 0.393 SS Ranking Pursuit 0.419 0.411 0.381 criteria) are chosen by tak ing the av er age ov er the p er formances o n a hold out- set. The ho ld- out set is created similarly as the corr esp onding tra ining/test set. The res ults of the collab ora tive ﬁltering exp eriment are included in T able 1 . It can b e observed that r anking based appro aches in general outp erform the regres s ion metho ds. According to Wilcoxon s igned-rank test [31] the diﬀerences in perfor mance are statistica lly signiﬁcan t ( p < 0 . 05). Ho wever, the diﬀerences in p erformance a mong the ra nking/reg ression algo rithms are not s ta tistically signiﬁcant. Although p er formance of the r anking pursuit algo rithm is similar to that o f the RankSVM and Rank RLS algor ithms, obtained solutions are on av er- age 30% sparser . T o ev aluate the p erfo r mance of the semi-sup erv ised extens io n of the ranking pursuit a lgorithm we construct datas ets similarly a s in the sup er- vised lear ning exp eriment with the following mo diﬁca tion. T o s imu la te unscore d data, for each tes t user we make only half of his/her prefer e nces fro m the tra ining set av a ilable for learning. Using this training se t we construct tw o views, each containing half of the scored and half o f the unscored data points. The rest of the exp erimental setup follows the previously describ ed s uper vised learning set- ting. The results of this ex p er iment are included in T able 2. W e obs erve nota ble improv ement in per formance of the semi-s uper vised ranking pursuit algo rithm compared to all baseline metho ds. This improv ement is s tatistically signiﬁcant according to Wilcoxon signed- r ank tes t with 0.05 as a signiﬁca nc e threshold. The p erfo rmance of the sup ervised metho ds in this exp eriment is decrease d (compared to the superv ised lea rning ex per iment) a s expected, due to the fact that the a mo unt of lab eled data is twice smalle r . 6.2 Mo vie Lens dataset The MovieLens dataset consists o f a pproximately 1M ratings by 6,0 40 users for 3,900 movies. Ratings a re integers from 1 to 5. The e xp e riments w er e set-up T able 3. Perfo rmance comp arison of the learning algorithms in su p ervised exp eriment conducted on Mo vieLens dataset. Normalized vers ion of the disagreemen t error is used as a performance ev aluation measure. Note that despite p erformance similar to that of ranking algorithms, ranking pursuit leads on av erage to 35% sparser solutions. Method 20 − 40 40 − 60 60 − 80 RLS 0.495 0.494 0.482 Ma tching Pursuit 0.494 0.497 0.484 RankSVM 0.481 0.472 0.453 RankRLS 0.479 0.472 0.455 Sp arse R ankRLS 0.484 0.478 0.460 Ranking Pursuit 0.480 0.472 0.453 T able 4. P erformance comparison of the learning algorithms in semi-supervised ex- p eriment conducted on Mo v ieLens dataset. Sup ervised learning method s are trained only on the scored part of th e dataset. Normalized versi on of the disagreement error is used as a p erformance ev aluation measure. Note t h at semi-sup ervised ranking pursuit notably outp erforms other metho ds. Method 20 − 40 40 − 60 60 − 80 RLS 0.497 0.495 0.487 Ma tching Pursui t 0.498 0.495 0.485 RankSVM 0.487 0.479 0.464 RankRLS 0.486 0.479 0.463 Sp arse RankRLS 0.490 0.485 0.470 Ranking Pursuit 0.487 0.479 0.462 SS Ranking Pursuit 0.481 0.474 0.458 in the s ame wa y as for the J ester joke dataset. The r esults of the sup ervised exp eriment a re presented in T able 3. Similarly to the r esults obtained on Jester joke dataset we obser ve that the r anking purs uit a lgorithm leads to muc h mo r e compact mo dels, abo ut 35% spa rser, while having p erforma nce compara ble to that o f the r anking alg orithms. The re s ults o f the semi-sup er vised exp er iment a re presented in T able 4. W e again observe no table improvemen t in p erfor mance of the semi-sup ervise d ranking pursuit alg o rithm compa red to all ba seline metho ds . The improv ement is statistica lly sig niﬁcant ( p < 0 . 05). 6.3 CRRP Algorithm Ev aluation T o empirica lly ev aluate our a pproach, termed combined ranking and regr ession pursuit (CRRP), we co nduct exp eriments using the C RRP algo rithm on the Jester jokes datas et. The exp eriments a r e co nducted following the s uper vised learning setup des c rib ed ab ov e. W e use the disagr e ement error and the mean squared err or (MSE) to measure p erforma nce of the algorithm for ranking and regres s ion setting, resp ectively . The obtained results ar e presented in T a ble 5. It ca n b e o bserved that by choosing the weigh t co eﬃcient appropriately ( β = T able 5. P erformance comparison of the ranking, regression, and CRR P algorithms on th e Jester jok e dataset. F or the p erformance ev aluation in regression task we use mean squared error (MSE) and for the ranking task w e use a normalized version of the disagreemen t error. Note that CRR P is able t o achiev e go od p erformance in b oth ranking and regression settings. Ranking t ask 20 − 40 40 − 60 60 − 80 RLS 0.425 0.419 0.383 Ma tching Pursuit 0.428 0.417 0.381 RankSVM 0.412 0.404 0.372 RankRLS 0.409 0.407 0.374 Sp arse R ankRLS 0.414 0.410 0.380 Ranking Pursuit 0.410 0.404 0.373 CRRP 0.413 0.408 0.373 Regression t ask 20 − 40 40 − 60 60 − 80 RLS 21.6 19.3 15.2 Ma tching Pursuit 20.1 18.7 14.9 RankSVM 34.2 31.6 28.9 RankRLS 33.8 31.3 29.2 Sp arse R ankRLS 36.5 33.8 32.9 Ranking Pursuit 34.0 30.9 29.1 CRRP 23.2 20.4 17.3 0.5) the CRRP alg orithm p erforms almost as w e ll a s the sp ecialized algor ithms on r a nking and regres s ion tasks. No te that when co nsidering the regres sion set- ting the CRRP algor ithm improves ov er the MSE p e rformance of the rank- only metho ds, but do e s not outp er fo rm the reg ression o nly methods. F ur thermore, the p erformance diﬀerences of the CRRP a lg orithm to the ranking metho ds on the regr ession task a s well as to the regres sion metho ds o n the ranking tas k are sta tistically sig niﬁcant ac cording a Wilcoxon signed-ra nk test ( p < 0 . 0 5). T o summarize, while CRRP do es not outper fo rm s pec ialized algorithms in reg res- sion o r ra nking, it is able to achieve notably b etter p erfor mance compar e d to the regress ion-only metho ds for the ranking task or ra nking-only metho ds for the r egress ion ta sk. 7 Conclusions W e pro p o se s pa rse prefer ence learning/r anking algor ithm as well as its semi- sup e rvised extension. Our algorithm is a gener alization o f the kernel matc hing pursuit algor ithm [12] and allows explicit cont r ol o ver spa rsity of the solution. It is also naturally a pplica ble in circumstances when o ne is in ter ested in ob- taining multiple near-optimal solutions that frequently a rise during the s pa rse mo deling of many pr oblems in biology , informa tio n retriev al, natural language pro cessing, etc. Another contribution of this pap er is a combined r anking and regres s ion (CRRP) metho d, for mulated within the framework of the pr op osed ranking pur suit algorithm. The empir ical ev aluation demo nstrates that in the super vised setting our al- gorithm o utper forms re gressio n metho ds such as k er nel matc hing pursuit, RLS and p erforms co mparably to the RankRLS, sparse RankRLS and RankSVM al- gorithms, while having spar ser so lutio ns. In its semi- sup e rvised setting our rank- ing pursuit algorithm no tably outperfor ms all baseline metho ds. W e also show that CRRP algo rithm is suitable for learning com bined ranking and regres s ion ob jectives and leads to go o d p erformance in b oth ranking and r e gressio n tasks . In the future we aim to apply our algo rithm in other domains and will exa mine diﬀerent aggr egation techniques for multiple spa rse so lutio ns. References 1. F ¨ urnkranz, J., (Eds.), E.H.: Preference Learning. Springer, Cam bridge, Mas- sac husetts (2010) 2. Cortes, C., Mohri, M., Rastogi, A.: Magnitude-preserving ranking algorithms. In Ghahramani, Z., ed.: Pro ceedings of the In ternational Conference on Machine Learning, N ew Y ork, NY, U SA, ACM (2007) 169–176 3. Collins, M., Koo, T.: Discriminative reranking for natural language parsing. Com- putational Linguistics 31 (1) (2005) 25–70 4. Kuang, R., W eston, J., Noble, W.S., Leslie, C.S.: Motif-based protein ranking by netw ork p ropagation. Bioinforma t ics 21 (19) (2005) 3711–3718 5. Tibshirani, R.: Regression shrink age and selection via the lasso. Journal of the Roy al S tatistical Society , S eries B 58 (1994) 267–288 6. Zou, H., Hastie, T.: Regularization and v ariable selection via the elastic net. Jo u r- nal of th e R oya l S tatistical Society , S eries B 67 (2005) 301–320 7. Y uan, M., Y uan, M., Lin, Y ., Lin, Y.: Mo del selection and estimation in regressi on with group ed v ariables. Journal of the Roya l S tatistical So ciety , Series B 68 (2006) 49–67 8. Bach, F.R.: Consistency of the group lasso and multiple kernel learning. Journal of Machine Learning Research 9 (2008) 1179–122 5 9. Lee, S., Zhu, J., Xing, E.P .: Adaptive Multi-T ask lasso: with application t oeQTL detection. I n: Adv ances in Neural Information Processing Systems. (2010) 10. Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. In Leen, T.K., Dietterich, T.G., T resp, V., eds.: Adva n ces in Neural Information Processing Systems, MIT Press (2000) 556–562 11. Joac hims, T.: A supp ort vecto r meth od for multiv ariate performance measures. In: Proceed in gs of the International Conference on Mac hine Learning, N ew Y ork, NY, USA, ACM (2005) 377–384 12. Vincent, P ., Bengio, Y .: Kernel matching pursuit. Mac hine Learning 48 (1-3) (2002) 165–187 13. Dekel, O., Manning, C.D., Singer, Y .: Log-linear models for label ranking. In Thrun, S., Sau l, L., Sch¨ olkopf, B., eds.: Ad v ances in Neural Information Pro cessing Systems, Cambridge, MA, MIT Press (2004) 497–504 14. Blum, A ., Mitchell, T.: Com bining labeled and un labeled data with co-training. In: Pro ceedings of th e Conference on Computational Learning Theory , New Y ork, NY, USA, ACM (1998) 92–100 15. Sindhw ani, V., N iyog i, P ., Belkin, M.: A co-regularization approach t o semi- sup ervised learning with m ultiple views. In: Proceedings of ICML W orkshop on Learning with Multiple Views. (2005) 16. Tsivtsiv adze, E., P ahikk ala, T., Boberg, J., Salakoski, T., Heskes, T.: Co- regularized least-squares for lab el ranking. In F ¨ urnkranz, J., H ¨ ullermeier, E., eds.: Preference Learning. Springer (2011) 107–123 17. Sch¨ olkopf, B., Herbrich, R., Smola, A.J.: A generalized representer theorem. I n Helmbold, D.P ., Williamson, B., ed s.: Pro ceedings of the Confere n ce on Compu - tational Learning Theory , London, Springer (2001) 416–426 18. Brefeld, U., G¨ artner, T., Scheﬀer, T., W rob el, S.: Eﬃcien t co-regularised least squares regression. In: Pro ceedings of the International Conference on Machine learning, New Y ork, N Y, USA, ACM (2006) 137–144 19. Kumar, A., I I I, H .D.: A co-t raining approach for m ulti-view spectral clustering. In Getoor, L., Scheﬀer, T., eds.: Proceedings of the Internatio n al Conference on Mac hine Learning, ACM (2011) 393–400 20. Kumar, A., Rai, P ., II I, H.D.: Co-regularized m u lti-v iew spectral clustering. In Shaw e-T aylor, J., Zemel, R ., ed s.: Adva n ces in Neural Information Processing Sys- tems, MIT Press (2011) 21. Rosenberg, D., Bartlett, P .L. : The Rademacher complexity of co-regularized kernel classes. In Meila, M., Shen, X ., eds.: Pro ceedings of the International Conference on A rtiﬁcial Intellige n ce and S tatistics. (2007) 396–403 22. Sindhw ani, V., Rosen b erg, D.: An rkhs for multi-view learning and manifol d co- regularization. In McCallum, A., Ro weis, S ., eds.: Proceedings of the I nternational Conference on Machine Learning, Helsinki, Finland, Omnipress (2008) 976–983 23. Sculley , D.: Combined regression and ranking. In : Proceedings of the ACM SIGKDD international conference on Knowledge disco very and data minin g. KDD ’10, ACM (2010) 979–988 24. Pahikk ala, T., Tsivtsiv adze, E., Airola, A., J¨ arvinen, J., Bob erg, J.: An eﬃcient algorithm for learning to rank from preference graphs. Mac h ine Learning 75 (1) (2009) 129–165 25. Rasmussen, C.E., Williams, C.K.I.: G aussian Pro cesses for Machine Learning (Adaptive Computation and Machine Learning). MIT Press (2005) 26. Pog gio, T., Girosi, F.: Netw orks for approximation and learning. Proceedings of the I EEE 78 (9) ( 1990) 1481–14 97 27. Rifkin, R., Y eo, G., Pogg io, T.: Regularized least-squares classiﬁcation. In S uykens, J., Horva th , G., Basu, S., Micc helli, C., V andew alle, J., eds.: Adv ances in Learning Theory: Metho ds, Mo del and Applications, Amsterd am, IOS Press (2003) 131–154 28. Saunders, C., Gammerman, A., V ovk, V .: Ridge regression learning algorithm in dual v ariables. In: Proceedings of the International Conference on Machine Learning, Morgan Kaufmann Publishers Inc. (1998) 515–521 29. F ung, G., Mangasarian, O.L.: Proximal supp ort vector mac hine classiﬁers. In Pro vost, F., Srik ant, R., eds.: Proceed in gs KD D -2001: Kn o wledge Disco very and Data Mining, ACM (2001) 77–86 30. Suykens, J.A.K., Gestel, T.V., Brabanter, J.D., Mo or, B.D., V andew alle, J.: Leas t Squares Supp ort V ector Mac hines. W orld S cientiﬁc Pub lishing, Singap ore (2002) 31. Dem ˇ sar, J.: Statistical comparisons of classiﬁers o ver multiple data sets. The Journal of Mac hine Learning R esearc h 7 (2006) 1–30

Semi-supervised Ranking Pursuit

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment