A Survey on Metric Learning for Feature Vectors and Structured Data

The need for appropriate ways to measure the distance or similarity between data is ubiquitous in machine learning, pattern recognition and data mining, but handcrafting such good metrics for specific problems is generally difficult. This has led to …

Authors: Aurelien Bellet, Amaury Habrard, Marc Sebban

A Survey on Metric Learning for Feature Vectors and Structured Data
T ec hnical rep ort A Surv ey on Metric L ea rning for F eature V ectors and Structured Data Aur´ elien Bellet ∗ bellet@usc.edu Dep artment of Computer Scienc e University of Sout hern C alifornia L os Angeles, CA 9008 9, USA Amaury Habrard amaur y.habrard@univ-st-etienne.fr Marc Sebban marc.sebban@univ-st-e tienne.fr L ab or atoir e Hub ert Curien UMR 5516 Universit´ e de Saint-Etienne 18 rue Benoit L aur as, 420 00 St-Etienn e, F r anc e Abstract The need for appropriate w ays to mea sure the distance or similarity betw een data is ubiq- uitous in machine learning, pa ttern r ecognition and data mining, but handcr afting such go od metrics for sp ecific pr oblems is gener ally difficult. This has led to the emer gence of metric learning, which aims at automatically lear ning a metr ic from data and has attrac ted a lot of int er est in machine learning and rela ted fields for the past ten years. This survey pap er pr oposes a sy s tematic rev iew of the metr ic lear ning liter a ture, highlighting the pros and cons of ea c h approa c h. W e pay particular attention to Mahalanobis distance metr ic learning, a well-studied and suc c essful fra mework, but additionally present a wide r ange of metho ds that have recently emerged a s p ow erful alternatives, including nonlinear metric learning, similarity le arning and lo cal metric lear ning. Recent trends and extensions, such as semi-sup e r vised metric lea rning, metric lear ning for histog ram data and the deriv ation of generaliza tion guarantees, are a lso covered. Finally , this survey address es metric lear ning for structured data, in pa r ticular edit dis tance learning, and attempts to give an ov ervie w of the r emaining challenges in metric le a rning for the years to come. Keyw ords: Metric Lea rning, Similarity Learning, Mahala nobis Distance, Edit Distance 1. In tro duction The notion of p airwise metric —u sed throughout this survey as a generic term for distance, similarit y or dissimilarity function—b et wee n data p oint s pla ys an imp ortan t r ole in many mac hine learning, pattern recognition and d ata mining techniques. 1 F or ins tance, in classi- fication, th e k -Nearest Neigh b or classifier ( Co v er and Hart , 1967 ) uses a m etric to iden tify the nearest n eig hb ors; man y clustering algorithms, s u c h as the prominent K -Means ( Llo yd , 1982 ), rely on distance measur emen ts b et w een data p oints; in inf ormati on retriev al, do c- ∗ . Most of the work in this p aper was carried out while the auth or was affiliated with Lab oratoire Hu b ert Curien UMR 5516, Universit ´ e de S ain t-Etienne, F rance. 1. Metric-based learning metho ds were the fo cus of the recent SIMBAD Europ ean pro ject (ICT 2008-FET 2008-2011 ). W ebsite: http://sim bad- fp7.eu/ c  Aur´ elien Bellet, Am aur y Habrard and Marc Sebban. Bellet, Habrard and S ebban ument s are often rank ed acco rd ing to their r ele v ance to a giv en qu ery based on similarit y scores. Clearly , the p erformance of these metho ds dep ends on the qu ality of the metric: as in the sa ying “birds of a f eather flo c k together”, we hop e that it iden tifies a s similar (resp. dissimilar) th e pairs of in s ta n ces that are indeed seman tically close (resp. differen t). General-purp ose metric s exist (e.g., the Euclidean distance and the cosine sim ilarity for feature v ectors or the Levensh tein distance f or str in gs) but they often fail to capture the idiosyncrasies of the d at a of interest. Improv ed results a re exp ected when the metric is designed s p ecifically for the task at hand . Sin ce man ual tuning is difficult and tedious, a lot of effort has gone into metric le arning , th e researc h topic dev oted to automatically learning metrics fr om data. 1.1 Metric Learning in a Nutshell Although its origins can b e traced bac k to s ome earlier w ork (e.g., Short and F uku naga , 1981 ; F u kunaga , 1990 ; F r iedm an , 1994 ; Hastie and T ibshirani , 1996 ; Baxter and Bartlett , 1997 ), metric learning really emerged in 2002 w ith the p ionee rin g work of Xing et al. ( 2002 ) that formula tes it as a conv ex optimization prob lem. It has sin ce b een a hot researc h topic, b eing the su b ject of tutorials at ICML 2010 2 and ECCV 2010 3 and w orkshops at I C CV 2011, 4 NIPS 2011 5 and ICML 2013. 6 The goal of metric lea r n ing is to adapt some pairwise real-v alued metric fu nctio n , say the Mahalanobis distance d M ( x , x ′ ) = p ( x − x ′ ) T M ( x − x ′ ), to the problem of inte rest using the information b rough t by training examples. Most metho ds learn the metric (here, the p ositiv e semi-defin ite matrix M in d M ) in a wea kly-sup ervised w a y f r om pair or trip let- based constraints of the follo wing form : • Must-link / cannot-link constrain ts (sometimes called p ositiv e / negativ e pairs): S = { ( x i , x j ) : x i and x j should b e similar } , D = { ( x i , x j ) : x i and x j should b e dissimilar } . • Relativ e constraints (sometimes called training triplets): R = { ( x i , x j , x k ) : x i should b e more sim ilar to x j than to x k } . A metric learning algorithm b asic ally aims at fi nding the parameters of the m etric su c h that it b est agrees with these constrain ts (see Figure 1 for an illustration), in an effort to appro ximate the un derlying seman tic metric. This is t ypically form u lated as an optimizatio n problem that has the f ollo win g general form: min M ℓ ( M , S , D , R ) + λR ( M ) where ℓ ( M , S , D , R ) is a loss fun ction th at in cu rs a p enalt y when training constr aints are violated, R ( M ) is s ome regularizer on the paramete rs M of th e learned metric and 2. http:/ /www.icml2010.or g/tutorials.html 3. http:/ /www.ics.forth.g r/eccv2010/tutorials.php 4. http:/ /www.iccv2011.or g/authors/workshops/ 5. http:/ /nips.cc/Confere nces/2011/Program/schedule.php?Session=Workshops 6. http:/ /icml.cc/2013/?p age_id=41 2 A Sur vey on Metric Learning for Fea ture Vectors and Structured Da t a Metric Learning Figure 1: Illustration of metric learning applied to a face recognition task. F or simp licit y , images are repr esen ted as p oin ts in 2 dimensions. Pa irw ise constrain ts, sho wn in the left p ane, are comp osed of images rep r esen ting the same p erson (must- link, sho wn in green) or different p ersons (ca n not-li nk , sho wn in red). W e wish to adapt the metric s o that there are few er constrain t violations (righ t pane). Images are tak en f r om the Caltec h F aces d ata set. 8 Underlying distribution Metric learning algorithm Metric-based algorithm Data sample Learned metric Learned predictor Prediction Figure 2: The common p r ocess in metric learning. A metric is learned fr om training data and plugged int o an algorithm that outputs a predictor (e.g., a classifier, a r egres- sor, a recommender system...) which hop efully p erforms b etter than a predictor induced by a standard (non-learned) metric. λ ≥ 0 is the regularizatio n parameter. As w e will see in this s u rv ey , state-of-the-art metric learning f orm ulations essen tially differ b y their choic e of metric, constrain ts, loss function and regularizer. After the metric learning phase, the resulting function is used to impro ve the p erfor- mance of a m etric -based algorithm, whic h is most often k -Nearest Neigh b ors ( k -NN), but ma y also b e a clustering algorithm suc h as K -Means, a rankin g algorithm, etc. Th e common pro cess in metric learning is summarized in Figure 2 . 1.2 Applications Metric learning can p ote ntial ly b e b eneficial wh enev er the n otio n of metric betw een in- stances pla ys an imp ortant role. Recen tly , it has b een applied to p roblems as d iv erse as link pr ed ict ion in net works ( Sha w et al. , 2011 ), state r epresen tation in reinforcemen t learn- ing ( T a ylor et al. , 2011 ), music recommendation ( McF ee et al. , 2012 ), partitioning problems 8. http:/ /www.vision.calt ech.edu/html- files/archive.html 3 Bellet, Habrard and S ebban ( La jugie et al. , 2014 ), identit y verificatio n ( Ben et al. , 2012 ), w ebpage arc hiving ( Law et al. , 2012 ), cartoon syn thesis ( Y u et al. , 201 2 ) and ev en assessing the efficacy of acupuncture ( Liang et al. , 2012 ), to name a few. In the follo wing, we list three large fields of application where metric learning h as b een shown to b e v ery useful. Computer vision Th ere is a great need of appr opriate metrics in computer vision, not only to compare images or videos in ad-ho c represent ations—such as bags-of-visual-w ords ( Li and P erona , 2005 )—bu t also in the pr e-pr ocessing step consisting in build in g this very represent ation (for instance, visu al words are u s ually obtained by means of clustering). F or this reason, there exists a large b o dy of m etric learning literature dealing sp ecifically with computer vision problems, such as image classification ( Mensink et al. , 2012 ), ob jec t recog- nition ( F r ome et al. , 2007 ; V erma et al. , 2012 ), face r eco gnition ( Guillaumin et al. , 2009b ; Lu et al. , 2012 ), visu al trac king ( Li et al. , 2012 ; Jiang et al. , 2012 ) or image annotation ( Guillaumin et al. , 2009a ). Information retriev al The ob jective of many information retriev al systems, suc h as searc h engines, is to pr ovide the user with the most relev an t d ocuments according to his/her query . This ranking is often ac hiev ed by u sing a metric b etw een t w o do cument s or b et ween a do cumen t and a qu ery . Applications of metric learnin g to these settings include the w ork of Lebanon ( 2006 ); Lee et al. ( 2008 ); McF ee and Lanckriet ( 2010 ); L im et al. ( 2013 ). Bioinformatics Man y p roblems in bioinform atics in vo lve comparin g sequences such as DNA, p rotein or temp oral series. Th ese comparisons are based on structured m etric s such as edit d istance m easur es (or r ela ted string alignment scores) for strings or Dynamic T ime W arp ing distance for temp oral series. Learning these metrics to adap t th em to the task of interest can greatly improv e the results. Examples include the work of Xiong and Chen ( 2006 ); Saigo et al. ( 2006 ); Kato and Nagano ( 2010 ); W ang et al. ( 2012a ). 1.3 Related T opics W e mentio n here three r esea rch topics that are related to metric learning b u t outside th e scop e of this su rv ey . Kernel learning While metric learning is parametric (one learns the parameters of a giv en form of metric, su c h as a Mahalanobis distance), ke rn el learning is usually nonpara- metric: one learns the k ernel matrix without an y assumption on the form of the k ernel that implicitly generated it. These approac hes are thus v ery p o we rf u l but limited to the transductiv e sett ing and can hardly b e applied to new d at a. The int erested reader may refer to the recent su rv ey on k ernel learning by Abbasnejad et al. ( 2012 ). Multiple k ernel learning Unlike k ernel learning, Multiple Ker n el Learning (MKL) is parametric: it learns a com b in ati on of predefi ned base k ernels. In this r egard, it can b e seen as more restrictiv e than metric or k ernel learning, but as opp osed to k ernel learning, MKL has v ery efficient form ulations and can b e applied in the ind uctiv e setting. Th e inte r ested reader ma y refer to the r ece nt survey on MKL by G¨ onen an d Alpa ydin ( 2011 ). Dimensionalit y reduction Sup ervised d imensionalit y redu cti on aims at findin g a lo w- dimensional representati on that maximizes the separation of lab eled data and in this resp ect 4 A Sur vey on Metric Learning for Fea ture Vectors and Structured Da t a has connections with metric learning, 9 although the p rimary ob jectiv e is quite differen t. Unsup ervised dimensionalit y red u ctio n, or manifold learning, usually assu me that th e (un - lab eled) data lie on an em b edd ed lo w-dimensional m anifold within the h igher-dimensional space and aim at “unfolding” it. These metho ds aim at capturing or preservin g some pr op - erties of the original data (su c h as the v a riance or lo cal distance measuremen ts) in the lo w-dimensional represen tation. 10 The intereste d reader ma y refer to th e sur v eys by F o dor ( 2002 ) and v an d er Maaten et al. ( 2009 ). 1.4 Wh y this Surv ey? As p ointed out ab o v e, metric learning has b een a h ot topic of researc h in m achine learning for a f ew y ears and has now reac hed a considerable lev el of maturity b oth practically and theoreticall y . The early review due to Y ang and Jin ( 2006 ) is now largely outdated as it misses out on imp ortant recent adv ances: more than 75% of the w ork referenced in th e present sur v ey is p ost 2006. A more recen t survey , written indep endently and in p arallel to our work, is du e to Kulis ( 2012 ). Despite some ov erlap, it sh ould b e noted that b oth su rv eys ha v e their own strengths and complemen t eac h other well. Indeed, the su rv ey of K u lis tak es a m ore general app roac h, attempting to p ro vide a unified view of a few core metric learnin g metho ds. It also go es into d ep th ab out topics that are only b riefly review ed here, suc h as k ernelization, optimization metho ds and applications. On the other h and, the present surve y is a detailed and comprehensive review of the existing literature, co ve rin g more than 50 approac hes (including m an y recent w orks that are m issing from Kulis’ pap er) with their relativ e merits and dra wbac ks. F urtherm ore, w e give particular at tenti on to topics that are not co v ered by K ulis, such as metric learning for stru ctured data and the der iv ation of generalizat ion guaran tees. W e think that the presen t survey may foster no v el r esearch in m et r ic learning and b e useful to a v ariet y of audiences, in p articular: (i) machine learners wa nting to get introdu ced to or up date their kno wledge of metric learning will b e able to quic kly grasp the pros and cons of eac h metho d as w ell as th e curr en t strengths and limitations of the researc h area as a whole, and (ii) mac hin e learning pr act itioners in terested in app lying metric learning to their o wn pr ob lem will find information to help them c ho ose the metho ds most appropr iate to their needs, along with links to source co des whenev er a v aila ble. Note that w e fo cus on general-purp ose m ethods, i.e., that are applicable to a wide range of application domains. The abund an t literature on metric learning designed s p ecifically for computer vision is not add ressed b ecause the u nderstanding of these approac hes r equires a significan t amoun t of b ac kground in th at area. F or this reason, we think that they deserve a separate surve y , targeted at the computer vision audience. 1.5 Prerequisites This survey is almost self-con tained and has few pr erequisites. F or metric learning f r om feature v ectors, we assu me that the reader h as some basic kno wledge of linear algebra 9. Some metric learning meth ods can b e seen as fin d ing a new feature space, and a few of them actually hav e the additional goal of making th is feature space low-dimensional . 10. These app roac hes are sometimes referred to as “unsup ervised metric learning”, which is somewhat mis- leading b ecause they d o n ot optimize a n otio n of metric. 5 Bellet, Habrard and S ebban Notation Description R Set of real num b ers R d Set of d -dimensional real-v alued vec tors R c × d Set of c × d r eal-v alued matrices S d + Cone of symmetric PSD d × d real- v al ued matrices X Input (instance) space Y Output (label) s pace S Set of must-link constrain ts D Set of cannot-link constrain ts R Set of relative constrain ts z = ( x, y ) ∈ X × Y An arbitr ar y l abeled instance x An arbitrary ve ctor M An arbitr ar y m atri x I Iden tity matrix M  0 PSD matrix M k · k p p -norm k · k F F r ob enius norm k · k ∗ Nuclear norm tr( M ) T r ace of matri x M [ t ] + = max(0 , 1 − t ) Hinge loss function ξ Slac k v ariable Σ Finite alphabet x String of finite size T able 1: Summary of the main n ota tions. and con v ex optimization (if needed, see Bo yd and V andenberghe , 2004 , for a brush-up). F or metric learning from stru ctured data, we assume that the reader has some familiarit y with basic probabilit y th eory , statistics and like liho o d maximization. The notations used throughout this surve y are summ arize d in T able 1 . 1.6 Outline The rest of this pap er is organized as f oll ows. W e first assum e that data consist of ve ctors lying in some feature space X ⊆ R d . Section 2 describ es k ey p rop erties that w e will use to p r o vide a ta xonomy of metric learnin g algorithms. In Section 3 , w e review the la r ge b o dy of work dealing with sup ervised Mahalanobis distance learning. Section 4 deals with recen t adv ances and tr ends in the field, such as linear similarit y learning, non lin ea r and lo ca l metho ds, histogram distance learning, the deriv atio n of generalization guarante es and semi-sup ervised metric learning metho d s. W e co v er metric learning for structured data in Sectio n 5 , with a fo cus on edit distance lea rn ing. Lastly , w e conclude th is survey in Section 6 w ith a discussion on the cur ren t limitatio ns of the existing literature and pr omising directions for future researc h. 2. Key Prop erties of Metric Learning Algorithms Except for a f ew early metho ds, most metric learning algorithms are essen tially “com- p etitiv e” in the sense that they are able to ac h iev e state-o f-the-art p erformance on s ome problems. Ho w ev er, eac h algorithm has its intrinsic pr operties (e.g., t yp e of metric, abilit y to lev erage u nsup ervised data, go o d scalabilit y with dimensionalit y , generalization guaran- 6 A Sur vey on Metric Learning for Fea ture Vectors and Structured Da t a Metric Learning Fully supervised W eakly supervised Semi supervised Learning paradigm Form of metric Linear Nonlinear Local Optimality of the solution Local Global Scalability w .r .t. dimension w .r .t. number of examples Dimensionality reduction Y es No Figure 3: Fiv e k ey prop erties of metric learning algorithms. tees, etc) and emphasis sh ou ld b e placed on those when deciding whic h metho d to apply to a giv en problem. I n this sectio n , w e iden tify and describ e fiv e k ey prop erties of metric learning algorithms, summarized in Figure 3 . W e use th em to pro vide a taxonom y of th e existing literature: the main features of eac h metho d are giv en in T able 2 . 11 Learning P aradigm W e will consider three learnin g paradigms: • F ul ly sup ervise d : the metric learning algorithm h as access to a set of lab eled training instances { z i = ( x i , y i ) } n i =1 , where eac h training example z i ∈ Z = X × Y is comp osed of an instance x i ∈ X and a lab el (or class) y i ∈ Y . Y is a discrete and finite set of |Y | lab els (unless stated otherwise). In practice, the lab el inform ati on is often u sed to generate sp ecific sets of pair/triplet constrain ts S , D , R , for in stance based on a notion of neigh b orho o d. 12 • We akly su p ervise d : the algorithm has no access to the lab els of individ ual training instances: it is only pro vided with side information in th e form of sets of constraint s S , D , R . This is a meaningful setting in a v ariet y of app lica tions w here lab eled d ata is costly to obtain while such sid e in formatio n is cheap: examples include users’ implicit feedbac k (e.g., clic ks on searc h engine r esults), citations among article s or links in a net wo rk. T his can b e seen as ha ving lab el information only at the p air/t rip let lev el. • Semi-sup ervise d : b esides the (full or wea k) sup ervision, the algorithm h as access to a (t ypically large) sample of u nlab ele d instances for whic h no side information is a v aila b le. Th is is us efu l to a v oid o v erfitting when the lab eled data or side inf ormatio n are scarce. F orm of Metric Clearly , the form of the learned metric is a key choic e. One may identify three main families of metrics: 11. Whenever p ossi ble, we u se the acronyms provided by th e authors of the stud ied meth ods. When there is no kn o wn acronym, we t ake the lib ert y of choosing one. 12. These constraints are usually d eriv ed from the lab els prior to learning the metric and n ev er challenged. Note that W ang et al. ( 2012b ) p ropose a more refined (but costly) app roac h to the problem of building the constraints from labels. Their method alternates b et ween selecting the most relev ant constraints giv en the current metric and learning a n ew metric based on the current constraints. 7 Bellet, Habrard and S ebban • Line ar metrics , su c h as the Mahalanobis distance. T heir expr essiv e p o w er is limited but they are easier to optimize (they usu al ly lead to conv ex formulatio ns, and thus global optimalit y of the solution) and less prone to o v erfitting. • Nonline ar metrics , suc h as the χ 2 histogram distance. They often giv e r ise to noncon- v ex formulations (su b ject to lo cal optimalit y) and ma y o v erfi t, b ut they can capture nonlinear v a r iations in the data. • L o c al metrics , where m ultiple (linear or nonlinear) lo ca l metrics are learned (typical ly sim ultaneously) to b etter deal w ith complex p roblems, such as heterogeneous data. They are h o w ev er more pr on e to ov erfitting than global metho ds since the num b er of parameters they learn can b e very large. Scalabilit y With th e amoun t of a v ailable d ata gro wing fast, the problem of scalabilit y arises in all areas of mac hine learning. First, it is desirable f or a metric learning algorithm to scale we ll with the num b er of training examples n (or constraints). As w e will see, learning the metric in an online wa y is one of the solutions. Second, metric learning metho ds should also scale r easo nab ly w ell with th e dimensionalit y d of the data. How ev er, since metric learning is often phrased as learning a d × d m atrix, designing algorithms that s cale reasonably well with this quan tit y is a considerable c hallenge. Optimality of the Solution This pr operty refers to the ability of the algorithm to fi nd the p aramet ers of the metric that satisfy b est th e criterion of interest. Ideall y , th e solution is guarant eed to b e the glob al optimum —this is essen tially the case for con ve x form ulations of metric learning. On the con trary , for noncon ve x form ulations, the s olutio n m a y only b e a lo c a l optimum . Dimensionalit y Reduction As noted earlier, metric learning is sometimes form ulated as finding a p r o jection of the data in to a new feature space. An inte resting b ypr o duct in this case is to lo ok for a low-dimensional pro jected space, allo wing faster computations as w ell as m ore comp act repr esentati ons. This is t ypically achiev ed b y forcing or regularizing the learned metric matrix to b e lo w-r an k . 3. Supervised Mahalanobis Dist an ce Learning This section deals with (fu lly or weakly) sup ervised Malahanobis distance learning (some- times simply referr ed to as distance metric learnin g), whic h has attracte d a lot of in terest due to its simp lici ty and nice interpretatio n in terms of a linear pro jection. W e s tart by present ing the Mahalanobis distance an d tw o imp ortant chall enges asso ciated with learning this form of metric. The Mahalanobis distance T his term comes from Mahalanobis ( 1936 ) and originally refers to a distance measure that in corporates the correlation b et w een features: d maha ( x , x ′ ) = q ( x − x ′ ) T Ω − 1 ( x − x ′ ) , where x and x ′ are random vect ors from the same distribution w ith co v ariance matrix Ω . By an abuse of terminology common in the metric learning literature, we w ill in fact us e 8 A Sur vey on Metric Learning for Fea ture Vectors and Structured Da t a P ag e Name Y ear Source Supervi sion F orm of Scalabili t y Optimum Dimension Regularizer Additional Co de Metric w.r.t. n w.r.t. d Reduction Informatio n 11 MMC 2002 Y es W eak Linear ★✩✩ ✩✩✩ Global No None — 11 S&J 2003 No W eak Linear ★★✩ ★★★ Global No F r ob enius norm — 12 NCA 2004 Y es F ull Linear ★✩✩ ★★✩ Lo cal Y es None F or k -NN 12 MCML 2005 Y es F ull Linear ★✩✩ ✩✩✩ Global No None F or k -NN 13 LMNN 2005 Y es F ull Linear ★★✩ ★✩✩ Global No None F or k - NN 13 RCA 2003 Y es W eak Linear ★★✩ ★★✩ Global No None — 14 ITML 2007 Y es W eak Linear ★✩✩ ★★✩ Global No LogDet Online v ersi on 15 SDML 2009 No W eak Linear ★✩✩ ★★✩ Global No LogDet+L 1 n ≪ d 15 POLA 2004 No W eak Linear ★★★ ★✩✩ Global No None Online 15 LEGO 2008 No W eak Linear ★★★ ★★✩ Global No LogDet Online 16 RDML 2009 No W eak Linear ★★★ ★★✩ Global No F rob enius norm Online 16 MDML 2012 No W eak Linear ★★★ ★✩✩ Global Y es Nuclear norm Online 16 mt-LMNN 2010 Y es F ull Linear ★★✩ ✩✩✩ Global No F rob enius norm M ulti-task 17 M LCS 2011 No W eak Linear ★✩✩ ★★✩ Lo cal Y es N/A Multi-task 17 GPML 2012 No W eak Linear ★✩✩ ★★✩ Global Y es v on Neumann Multi-task 18 TML 2010 Y es W ea k Linear ★★✩ ★★✩ Global No F robenius norm T ransfer learning 19 LPML 2006 No W ea k Linear ★★✩ ★★✩ Global Y es L 1 norm — 19 SML 2009 No W eak Linear ★✩✩ ✩✩✩ Global Y es L 2 , 1 norm — 19 BoostMetric 2009 Y es W eak Linear ★✩✩ ★★✩ Global Y es None — 20 DML- p 2012 No W eak Linear ★✩✩ ★✩✩ Global No None — 20 RML 2010 No W ea k Linear ★★✩ ✩✩✩ Global No F rob enius norm Noisy constraints 21 MLR 2010 Y es F ull Linear ★★✩ ✩✩✩ Global Y es Nuclear norm F or ranking 22 SiLA 2008 No F ull Linear ★★✩ ★★✩ N/A No None Online 22 gCosLA 2009 No W eak Linear ★★★ ✩✩✩ Global No None Online 23 OASIS 2009 Y es W eak Linear ★★★ ★★✩ Global No F rob enius norm Online 23 SLLC 2012 N o F ull Linear ★★✩ ★★✩ Global No F rob enius norm F or linear cl ass i f. 24 RSL 2013 No F ull Linear ★✩✩ ★★✩ Lo cal No F robenius norm Rectangular matrix 25 LSMD 2005 No W eak Nonlinear ★✩✩ ★★✩ Lo cal Y es None — 25 N NCA 2007 No F ul l Nonlinear ★✩✩ ★★✩ Lo cal Y es Recons. err or — 26 SVML 2012 No F ull Nonlinear ★✩✩ ★★✩ Lo cal Y es F rob enius norm F or SVM 26 GB-LMNN 2012 No F ull Nonlinear ★★✩ ★★✩ Lo cal Y es None — 26 HDML 2012 Y es W eak Nonlinear ★★✩ ★★✩ Local Y es L 2 norm Hamming dis tance 27 M 2 -LMNN 2008 Y es F ul l Lo cal ★★✩ ★✩✩ Global No None — 28 GLML 2010 No F ull Lo cal ★★★ ★★✩ Global No Diagonal Ge nerative 28 Bk-means 2009 No W ea k Lo cal ★✩✩ ★★★ Global No RKHS norm Bregman dist. 29 PLML 2012 Y es W eak Local ★★✩ ✩✩✩ Global No Manifold+F rob — 29 R FD 2012 Y es W eak Local ★★✩ ★★★ N/A No None Random forests 30 χ 2 -LMNN 2012 No F ull Nonlinear ★★✩ ★★✩ Local Y es None Histogram data 31 GML 2011 No W eak Linear ★✩✩ ★★✩ Lo cal No None Histogram data 31 EM DL 2012 No W eak Linear ★✩✩ ★★✩ Lo cal No F robenius norm Histogram data 34 LRML 2008 Y es Semi Linear ★✩✩ ✩✩ ✩ Globa l No Laplacian — 35 M-DML 2009 No Semi Linear ★✩✩ ✩ ✩✩ Lo cal No Laplacian Auxiliary metrics 35 SERA PH 2012 Y es Semi Linear ★✩✩ ✩✩✩ Lo cal Y es T race+en tropy Probabili stic 36 CDML 2011 No Semi N/A N/A N/A N /A N/A N/A Domain adaptation 36 DAML 2011 No Semi Non li near ★✩ ✩ ✩✩ ✩ Global No MM D Domain adaptation T able 2: Main features of metric learning metho ds f or feature v ectors. Scalabilit y lev els are r ela tive and give n as a rough guide. 9 Bellet, Habrard and S ebban the term Mahalanobis distance to refer to generalized quadratic distances, defined as d M ( x , x ′ ) = q ( x − x ′ ) T M ( x − x ′ ) and parameterized by M ∈ S d + , wh ere S d + is the cone of symmetric p ositiv e semi-definite (PSD) d × d r eal -v alued matrices (see Figure 4 ). 13 M ∈ S d + ensures that d M satisfies th e prop erties of a pseudo-distance: ∀ x , x ′ , x ′′ ∈ X , 1. d M ( x , x ′ ) ≥ 0 (nonnegativit y), 2. d M ( x , x ) = 0 (iden tit y), 3. d M ( x , x ′ ) = d ( x ′ , x ) (symmetry), 4. d M ( x , x ′′ ) ≤ d ( x , x ′ ) + d ( x ′ , x ′′ ) (triangle inequalit y). In terpreta tion Not e that wh en M is the identit y matrix, we reco ver the Euclidean distance. Oth erwise, one can express M as L T L , wh ere L ∈ R k × d where k is the rank of M . W e can then r ewrite d M ( x , x ′ ) as follo ws: d M ( x , x ′ ) = q ( x − x ′ ) T M ( x − x ′ ) = q ( x − x ′ ) T L T L ( x − x ′ ) = q ( Lx − Lx ′ ) T ( Lx − Lx ′ ) . Th u s, a Mahalanobis distance implicitly corresp onds to computing the Euclidean distance after the linear pr o jection of th e data defin ed by the transform at ion matrix L . Note that if M is lo w-rank, i.e., rank( M ) = r < d , then it induces a linear pro jection of the data in to a sp ac e of lo we r dimen sion r . It th us allo ws a more compact r epresen tation of the data and c heap er distance computations, esp ecially w hen the original feature space is high- dimensional. These nice pr operties explain wh y learning Mahalanobis distance has attracted a lot of inte rest and is a ma jor comp onen t of m etric learnin g. Challenges Th is leads us to t wo imp ortan t c hallenges asso ciate d with learnin g Maha- lanobis distances. The fi rst one is to maint ain M ∈ S d + in an efficien t wa y during the optimization pro cess. A simple wa y to d o this is to use the pro jected gradien t metho d whic h consists in alternating b et w een a gradient step and a pr o jection step on to th e P SD cone by setting the n eg ativ e eigen v alues to zero. 14 Ho w ev er this is exp ensive for high- dimensional pr ob lems as eigen v alue d eco mp osition scales in O ( d 3 ). The second c hallenge is to learn a low-rank matrix (whic h implies a lo w-dim en sional pr o jection s p ace , as noted earlier) instead of a fu ll-rank one. Un fortunately , optimizing M sub ject to a rank constraint or regularization is NP-hard and thus cannot b e carried out efficien tly . 13. Note that in practice, to get rid of the sq uare root, the Mahalanobis distance is learned in its more conv enient squared form d 2 M ( x , x ′ ) = ( x − x ′ ) T M ( x − x ′ ). 14. Note that Qian et al. ( 2013 ) have prop osed some heuristics to a void doing th is pro jection at each itera- tion. 10 A Sur vey on Metric Learning for Fea ture Vectors and Structured Da t a 0 0.2 0.4 0.6 0.8 1 −1 −0.5 0 0.5 1 0 0.2 0.4 0.6 0.8 1 α β γ Figure 4: The cone S 2 + of p ositiv e semi-definite 2x2 matrices of the form " α β β γ # . The r est of this section is a comprehens iv e review of the sup ervised Mahalanobis distance learning metho ds of the literature. W e first pr esen t tw o early approac hes (Section 3.1 ). W e then discuss metho ds th at are sp ecific to k -nearest neighb ors (Section 3.2 ), inspir ed from in- formation theory (Section 3.3 ), online learning app roac hes (Section 3.4 ), m ulti-task learning (Section 3.5 ) and a few more that d o not fi t any of the previous categories (Section 3.6 ). 3.1 Early Approac hes The appr oa ches in this section d eal with the P SD constraint in a rud imen tary wa y . MMC (Xing et a l.) The seminal wo rk of Xing et al. ( 2 002 ) is the first Mahala n ob is distance learning metho d. 15 It relies on a con ve x formulatio n w ith n o r egula rization, wh ic h aims at maximizing the sum of distances b et ween dissimilar p oint s w h ile keeping the sum of d istances b etw een similar examples small: max M ∈ S d + X ( x i , x j ) ∈D d M ( x i , x j ) s.t. X ( x i , x j ) ∈S d 2 M ( x i , x j ) ≤ 1 . (1) The algorithm used to solv e ( 1 ) is a simple pro jected gradien t app roac h requiring the full eigen v alue d eco mp osition of M at eac h iteration. This is typicall y in tractable for medium and high-d im en sional problems. S&J ( Sc h ultz & Joac hims) The metho d prop osed by Sc hultz and Joac hims ( 2003 ) re- lies on the parameterization M = A T W A , where A is fixed an d kno wn and W diagonal. W e get: d 2 M ( x i , x j ) = ( Ax i − Ax j ) T W ( Ax i − Ax j ) . 15. Source co de av ailable at: http://www .cs.cmu.edu/ ~ epxing/pap ers/ 11 Bellet, Habrard and S ebban By definition, M is PSD and thus one can optimize o v er the d iag onal matrix W and a vo id the need for costly p ro jections on the PSD cone. T hey prop ose a formulatio n based o n triplet constraints: min W k M k 2 F + C X i,j,k ξ ij k s.t. d 2 M ( x i , x k ) − d 2 M ( x i , x j ) ≥ 1 − ξ ij k ∀ ( x i , x j , x k ) ∈ R , (2) where k M k 2 F = P i,j M 2 ij is the squared F rob enius norm of M , th e ξ ij k ’s are “slac k” v ari- ables to allo w soft constrain ts 16 and C ≥ 0 is the trade-off parameter b etw een regularization and constrain t satisfaction. Problem ( 2 ) is con v ex and can b e solved efficien tly . The main dra wb ac k of this app roac h is th at it is less general than full Mahalanobis distance learning: one only learns a weigh ting W of the features. F ur thermore, A must b e chosen manually . 3.2 Approac hes Driv en by Nearest N e ighbors The ob jectiv e fu nctio n s of the metho ds p resen ted in this section are related to a n earest neigh b or pr edictio n rule. NCA (Goldb erger e t al.) The idea of Neigh b ourho od Comp onen t Analysis 17 (NCA), in tro duced by Goldb erger et al. ( 2004 ), is to optimize the exp ected lea v e-one-out error of a sto c hastic nearest neigh b or classifier in the pro jection space indu ced by d M . They u s e the decomp osition M = L T L and they define the pr obabilit y that x i is the neigh b or of x j b y p ij = exp( −k Lx i − Lx j k 2 2 ) P l 6 = i exp( −k Lx i − Lx l k 2 2 ) , p ii = 0 . Then, the probabilit y that x i is correctly classified is: p i = X j : y j = y i p ij . They learn the distance b y s olving: max L X i p i . (3) Note that the matrix L can b e chosen to b e r ec tangular, indu cing a low-rank M . Th e main limitation of ( 3 ) is that it is noncon v ex and th us sub ject to lo ca l maxima. Hong et al. ( 2011 ) later prop osed to learn a mixture of NCA m etrics, while T arlo w et al. ( 2013 ) generalize NCA to k -NN with k > 1. MCML (Glob erson & Ro weis) Shortly after Goldb erger et al., Glob erson and Row eis ( 2005 ) prop osed MCML (Maximally Collapsing Metric Learning), an alternativ e con v ex form ulation based on minimizing a KL div ergence b et we en p ij and an id ea l d istr ibution, 16. This is a classic trick used for instance in soft-margin SVM ( Cortes and V apnik , 1995 ). Throughout this survey , we will consistently use the symbol ξ to denote slac k v ariables. 17. Source co de av ailable at: http://www .ics.uci.edu/ ~ fowlkes/so ftware/nca/ 12 A Sur vey on Metric Learning for Fea ture Vectors and Structured Da t a whic h can b e seen as attempting to colla p se eac h class to a single p oint. 18 Unlik e NCA, the optimization is done with resp ect to the matrix M and the p roblem is th us conv ex. Ho w ev er, lik e MMC, MCML r equ ires costly pro jections on to the PSD cone. LMNN ( W ein b erger et al.) Large Margin Nearest Neigh b ors 19 (LMNN), in tro duced by W einb erge r et al. ( 2005 ; 2008 ; 2009 ), is one of the most widely-used Mahalanobis distance learning metho ds and has b een the sub ject of man y extensions (describ ed in later sections). One of the r ea sons for its p opularit y is that the constraints are defined in a local wa y: the k n ea r est n eig hb ors (the “target neigh b ors”) of an y training instance should b elong to the correct class while kee pin g a wa y instances of other classes (the “imp ostors”). The E u clidean distance is used to determine the target neigh b ors. F ormally , the constraints are defined in the follo win g wa y: S = { ( x i , x j ) : y i = y j and x j b elongs to th e k -neigh b orho o d of x i } , R = { ( x i , x j , x k ) : ( x i , x j ) ∈ S , y i 6 = y k } . The distance is learned using the f oll owing conv ex pr og ram: min M ∈ S d + (1 − µ ) X ( x i , x j ) ∈S d 2 M ( x i , x j ) + µ X i,j,k ξ ij k s.t. d 2 M ( x i , x k ) − d 2 M ( x i , x j ) ≥ 1 − ξ ij k ∀ ( x i , x j , x k ) ∈ R , (4) where µ ∈ [0 , 1] controls the “pull/push ” trad e-off. The authors dev elop ed a sp ecial- purp ose sol ver—based on subgradient descen t and careful b ook-k eeping—that is able to deal with b illi ons of constrain ts. Alternativ e wa ys of solving th e problem ha v e b een pro- p osed ( T orresani and Lee , 2006 ; Nguyen and Gu o , 2008 ; P ark et al. , 2 011 ; Der and Saul , 2012 ). LMNN generally p erform s very wel l in practice, although it is sometimes prone to o v erfitting due to the absence of regularization, esp ecially in high dim en sion. It is also very sensitiv e to the abilit y of the Euclidean distance to select relev ant target neigh b ors. Note that Do et al. ( 2012 ) highligh ted a relation b etw een LMNN and S upp ort V ector Mac hin es. 3.3 Information-Theoretic Approac hes The metho ds p resen ted in th is section fr ame metric learning as an optimization prob lem in vo lving an information measure. R CA (Bar-Hille l et al.) Relev ant Comp onen t Analysis 20 ( Shent al et al. , 2002 ; Bar-Hillel et al. , 2003 , 2005 ) makes use of p ositiv e pairs only and is based on subsets of the trainin g exam- ples called “c hunklets”. Th ese are obtained from the s et of p ositiv e pairs by applying a transitiv e closure: f or instance, if ( x 1 , x 2 ) ∈ S and ( x 2 , x 3 ) ∈ S , then x 1 , x 2 and x 3 b elong to the same c hunklet. Poin ts in a c hunklet are b eliev ed to share the same lab el. Assuming a total of n p oin ts in k c hunklets, the algorithm is v ery efficien t since it simp ly amoun ts to 18. An implementatio n is a v ailable within th e Matlab T oolb o x for Dimensionality Red uction: http://hom epage.tudelft.nl /19j49/Matlab_Toolbox_for_Dimensionality_Reduction.html 19. Source co de av ailable at: http://www .cse.wustl.edu/ ~ kilian/cod e/code.html 20. Source co de av ailable at: http://www .scharp.org/ther tz/code.html 13 Bellet, Habrard and S ebban computing the follo wing matrix: ˆ C = 1 n k X j =1 n j X i =1 ( x j i − ˆ m j )( x j i − ˆ m j ) T , where c hunklet j consists of { x j i } n j i =1 and ˆ m j is its mean. Th us, RCA essentia lly redu ces the within-c hunklet v ariabilit y in an effort to iden tify features th at are irrelev an t to the task. The in v erse of ˆ C is used in a Mahalanobis distance. The authors hav e sh o wn that (i) it is the optimal solution to an in formatio n-theoretic criterion in v olving a m u tu al information measure, and (ii) it is also the optimal solution to the optimization problem consisting in minimizing the w ithin-cla ss distances. An obvious limitation of R CA is that it cannot make use of th e discriminativ e information brought b y n eg ativ e pairs, which explains why it is not very comp etitiv e in practice. RCA wa s later extended to h an d le negativ e pairs, at th e cost of a more exp ensive algorithm ( Hoi et al. , 2006 ; Y eung and C hang , 2006 ). ITML (Davis et al.) Inform ati on-Th eo retic Metric Learning 21 (ITML), prop osed b y Da vis et al. ( 200 7 ), is an imp ortan t work b ecause it in tro duces LogDet diverge n ce regular- ization that will later b e used in s everal other Mahalanobis distance learning metho ds (e.g., Jain et al. , 2008 ; Qi et al. , 2009 ). This Bregman d ivergence on p ositiv e d efinite matrices is defined as: D ld ( M , M 0 ) = tr( M M − 1 0 ) − log det( M M − 1 0 ) − d, where d is the dimension of the in put space and M 0 is some p ositiv e d efinite matrix we w ant to remain close to. In p r act ice, M 0 is often set to I (the identit y matrix) and th us the regularization aims at k eeping th e learned distance close to the Euclidean distance. Th e k ey feature of the LogDet div ergence is that it is finite if and only if M is p ositiv e defin ite . Therefore, m in imizing D ld ( M , M 0 ) pr o vides an automatic and c heap wa y of p r eserving the p ositiv e semi-defin iteness of M . ITML is form ulated as follo ws: min M ∈ S d + D ld ( M , M 0 ) + γ X i,j ξ ij s.t. d 2 M ( x i , x j ) ≤ u + ξ ij ∀ ( x i , x j ) ∈ S d 2 M ( x i , x j ) ≥ v − ξ ij ∀ ( x i , x j ) ∈ D , (5) where u, v ∈ R are threshold parameters and γ ≥ 0 the trade-off parameter. ITML th us aims at satisfying the similarit y and d issimilarit y constraints while sta ying as close as p os- sible to the Eu cli d ean distance (if M 0 = I ). More precisely , the information-theoretic in terpretation b ehind min imizing D ld ( M , M 0 ) is that it is equiv alen t to minimizing the KL d iv ergence b et ween tw o m ultiv aria te Gaussian distributions parameterized by M and M 0 . The algorithm prop osed to solv e ( 5 ) is efficien t, conv erges to the global minimum an d the r esulting distance p erforms well in practice. A limitation of IT ML is that M 0 , that m ust b e pic ked by hand, can hav e a n imp ortan t influence on the qualit y of the lea rn ed distance. Note that Kulis et al. ( 2009 ) ha v e sho wn ho w h ashing can b e used together with ITML to ac hiev e fast similarity searc h. 21. Source co de av ailable at: http://www .cs.utexas.edu/ ~ pjain/itml / 14 A Sur vey on Metric Learning for Fea ture Vectors and Structured Da t a SDML (Qi et al.) With Sparse Distance Metric L ea rn ing (SDML), Qi et al. ( 2009 ) sp ecifically deal with the case of high-dimensional d at a together with few training samples, i.e., n ≪ d . T o a void o ve rfi tting, th ey u se a double regularization: the LogDet div ergence (using M 0 = I or M 0 = Ω − 1 where Ω is th e co v ariance matrix) and L 1 -regularizatio n on the off-diagonal element s of M . The justification for us in g this L 1 -regularizatio n is t w o-fold: (i) a p racti cal one is that in high-dimensional spaces, the off-diagonal elemen ts of Ω − 1 are often v ery small, and (ii) a theoretical one suggested by a consistency result from a previous work in co v ariance matrix estimation ( Ra vikumar et al. , 2011 ) that app lies to SDML. They use a fast algorithm based on blo c k-coordin ate descen t (the optimizatio n is done o v er eac h row of M − 1 ) and obtain very goo d p erformance for the s p ecific case n ≪ d . 3.4 Online Approac hes In on lin e learning ( Littlestone , 1988 ), the algo rithm r eceiv es trainin g instances one at a time and up dates at eac h step the curr en t hyp ot h esis. Although the p erf ormance of online algorithms is typica lly inferior to batc h algorithms, they are very us eful to tac kle large-scale problems that batch metho ds fail to addr ess due to time and space complexit y issues. Online learning metho ds often come with regret b ounds, stating that the accum ulated loss suffered along the w a y is n ot muc h worse than that of the b est h yp othesis chosen in h in dsigh t. 22 POLA ( Shalev-Sh w artz et al.) POLA ( Shalev-Sh wartz et al. , 2004 ), for Pseudo-metric Online Learning Algorithm, is the firs t online Mahalanobis d istance learning approac h and learns the matrix M as w ell as a threshold b ≥ 1. At eac h step t , POLA receiv es a pair ( x i , x j , y ij ), where y ij = 1 if ( x i , x j ) ∈ S and y ij = − 1 if ( x i , x j ) ∈ D , and p erforms tw o successiv e orthogonal p ro jections: 1. Pro jection of the current solution ( M t − 1 , b t − 1 ) on to the set C 1 = { ( M , b ) ∈ R d 2 +1 : [ y ij ( d 2 M ( x i , x j ) − b ) + 1] + = 0 } , w hic h is done efficien tly (closed-form solution). Th e constrain t basically requires that the distance b et w een tw o instances of same (resp. differen t) lab els b e b elo w (resp. ab o v e) the threshold b with a margin 1. W e get an in termediate solution ( M t − 1 2 , b t − 1 2 ) that satisfies th is constrain t w hile sta ying as close as p ossible to the previous solution. 2. Pro jection of ( M t − 1 2 , b t − 1 2 ) onto the set C 2 = { ( M , b ) ∈ R d 2 +1 : M ∈ S d + , b ≥ 1 } , whic h is done rather efficien tly (in the worst ca se, one only needs to co mp ute th e minimal eigen v alue of M t − 1 2 ). T his p r o jects the m atrix bac k onto the P S D cone. W e th us get a new solution ( M t , b t ) that yields a v al id Mahalanobis d istance. A regret b ound for the algorithm is p ro vided. LEGO (Jain et al.) LEGO (Logdet Exact Gradient Online), d ev eloped b y Jain et al. ( 2008 ), is an improv ed v ersion of POLA based on LogDet div ergence regularizati on. It features tighte r regret b ounds, more efficien t up d ates and b etter practical p erf ormance. 22. A regret b ound has th e follo wing general form: P T t =1 ℓ ( h t , z t ) − P T t =1 ℓ ( h ∗ , z t ) ≤ O ( T ), where T is the num b er of steps, h t is the hyp othesis at time t and h ∗ is the b est b atc h hyp othesis. 15 Bellet, Habrard and S ebban RDML (Jin et al.) RDML ( Jin et al. , 2009 ) is similar to POLA in spir it bu t is more flexible. A t eac h step t , instead of forcing the m argin constr aint to b e s at isfi ed , it p erforms a gradient d escen t step of the follo wing form (assuming F rob enius regularization): M t = π S d +  M t − 1 − λy ij ( x i − x j )( x i − x j ) T  , where π S d + ( · ) is the pro jection to th e PSD cone. Th e parameter λ implemen ts a trade-off b et w een satisfying the pairwise constrain t and stayi n g close to the previous matrix M t − 1 . Using some linear algebra, the authors sho w that th is up d ate can b e p erf orm ed by solving a con vex quadratic p rogram instead of resorting to eigen v alue compu tation lik e P O LA. RDML is ev aluated on sev eral b enc hmark datasets and is shown to p erform comparably to LMNN and ITML. MDML (Kunapuli & Shavlik ) MDML ( Kunapu li and Sha vlik , 2012 ), for Mirr or De- scen t Metric Learning, is an attempt of pr op osing a general framework for online Maha- lanobis distance learning. It is based on comp osite mir r or d escen t ( Duchi et al. , 2010 ), whic h allo ws onlin e optimization of man y regularized problems. It can accommod ate a large class of loss fun ctio ns and regularizers for which efficien t up dates are derive d, and the algorithm comes with a regret b ound . Their study fo cuses on regularizatio n with the n uclear norm (also called trace norm) in tro duced by F azel et al. ( 2001 ) and defined as k M k ∗ = P i σ i , where the σ i ’s are the s ingular v alues of M . 23 It is kn own to b e the b est conv ex relaxation of the rank of th e matrix and th us nuclear norm r eg u larization tends to indu ce lo w-rank matrices. In practice, MDML h as p erformance comparable to LMNN and IT ML, is fast and sometimes induces lo w-rank solutions, b ut surp risingly the algorithm was not ev aluated on large-scal e datasets. 3.5 Multi-T ask Metric Learning This section co v ers Mahalanobis distance learning for the m ulti-task setting ( Caruana , 1997 ), where giv en a set of r ela ted tasks, one learns a metric f or eac h in a coupled fashion in order to improv e the p erformance on all tasks. m t-LMNN (P arameswaran & W einberger) Multi-T ask LMNN 24 ( P aramesw aran and W einb erger , 2010 ) is a straigh tforwa rd adaptation of the ideas of Multi-T ask SVM ( Evgeniou and Pon til , 2004 ) to metric learning. Give n T related tasks, they mo del the problem as learning a shared Mahalanobis m etric d M 0 as well as task-sp ecific metrics d M 1 , . . . , d M t and defin e th e metric for task t as d t ( x , x ′ ) = ( x − x ′ ) T ( M 0 + M t )( x − x ′ ) . Note that M 0 + M t  0, hence d t is a v alid ps eudo-metric. The LMNN form ulation is easily generalized to this m ulti-task setting so as to learn the metrics joint ly , with a sp ecific regularization term defined as follo ws: γ 0 k M 0 − I k 2 F + T X t =1 γ t k M t k 2 F , 23. Note that when M ∈ S d + , k M k ∗ = tr( M ) = P d i =1 M ii , which is much cheaper to compute. 24. Source co de av ailable at: http://www .cse.wustl.edu/ ~ kilian/cod e/code.html 16 A Sur vey on Metric Learning for Fea ture Vectors and Structured Da t a where γ t con trols the regularization of M t . When γ 0 → ∞ , th e shared metric d M 0 is simp ly the Eu clidean d istance, and the form ulation reduces to T in dep enden t LMNN form ulations. On the other hand , wh en γ t> 0 → ∞ , the task-sp ecific matrices are sim p ly zero matrices and the formulation redu ces to L MNN on the union of all data. In-b et w een these extreme cases, these parameters can b e used to adjust the relativ e imp ortance of eac h metric: γ 0 to set the o v erall lev el of shared inform ation, and γ t to set the imp ortance of M t with resp ect to the shared metric. Th e form u lation remains conv ex and can b e solv ed using th e same efficien t sol ver as LMNN. In the m ulti-task setting, m t-LMNN cl early outperf orms single-task metric learning m ethods and other m ulti-task classification tec hniques such as m t-SVM. MLCS (Y ang et al.) MLCS ( Y ang et al. , 2011 ) is a differen t approac h to the problem of m u lti-task metric learning. F or eac h task t ∈ { 1 , . . . , T } , the authors consider learning a Mahalanobis metric d 2 L T t L t ( x , x ′ ) = ( x − x ′ ) T L T t L t ( x − x ′ ) = ( L t x − L t x ′ ) T ( L t x − L t x ′ ) parameterized b y the transformation m atrix L t ∈ R r × d . They sho w that L t can b e d eco m- p osed into a “subsp ace ” part L t 0 ∈ R r × d and a “lo w-dimensional metric” part R t ∈ R r × r suc h that L t = R t L t 0 . The main assu mption of MLCS is that all tasks share a common subspace, i.e., ∀ t , L t 0 = L 0 . Th is parameterization can b e used to extend most of metric learning metho ds to th e multi- task setting, although it breaks the con v exit y of the form u- lation and is th u s sub ject to lo cal optima. Ho we ver, as opp osed to mt-LMNN, it can b e made lo w-rank by setting r < d and th us has many less p aramet ers to learn. In their work, MLCS is applied to the v ersion of L MNN solv ed with resp ect to the tr an s formatio n matrix ( T orresani and Lee , 2006 ). The resulting metho d is ev aluate d on p roblems w ith ve ry scarce training data and study the p erform an ce for d ifferen t v alues of r . I t is shown to outp erform m t-LMNN, but the setup is a bit unfair to m t-LMNN sin ce it is forced to b e lo w-rank b y eigen v alue thresh oldin g. GPML (Y ang et al.) The w ork of Y ang et al. ( 2012 ) ident ifies tw o drawbac ks of pre- vious m ulti-task metric learning approac hes: (i) MLCS’s assumption of common su bspace is sometimes to o strict and leads to a nonconv ex form ulation, and (ii) the F rob enius reg- ularization of mt-LMNN d oes not preserv e geo metry . This pr op erty is defined as b eing the abilit y to propagate side-information: the task-sp ecific m etrics should b e regularized so as to preserve the relativ e distance b et w een training p airs. They introdu ce the follo wing form ulation, whic h extends any metric learnin g algorithm to the multi -task setting: min M 0 ,..., M t ∈ S d + t X i =1 ( ℓ ( M t , S t , D t , R t ) + γ d ϕ ( M t , M 0 )) + γ 0 d ϕ ( A 0 , M 0 ) , (6) where ℓ ( M t , S t , D t , R t ) is the loss function for the task t based on the training p airs/triplets (dep ending on the c hosen algorithm), d ϕ ( A , B ) = ϕ ( A ) − ϕ ( B ) − tr  ( ∇ ϕ B ) T ( A − B )  is a Bregman matrix d iv ergence ( Dhillon and T ropp , 200 7 ) and A 0 is a predefin ed metric (e.g., the identi ty matrix I ). m t-LMNN can essentiall y b e reco v ered fr om ( 6 ) b y setting ϕ ( A ) = k A k 2 F and additional constrain ts M t  M 0 . The authors f ocus on the von 17 Bellet, Habrard and S ebban Neumann d iv ergence: d V N ( A , B ) = tr( A log A − A log B − A + B ) , where log A is the m at r ix logarithm of A . Lik e th e LogDet dive rgence mentioned earlier in this su rv ey (Section 3.3 ), the v on Neum an n diverge n ce is kno wn to b e rank -p reserving and to pro vide automatic enforcemen t of p ositiv e-semidefiniteness. The auth ors further show that minimizing this d iv ergence encourages geometry preserv ation b et wee n the learned metrics. Problem ( 6 ) remains con v ex as long as the original algorithm used for solving eac h task is con v ex, and can b e solv ed efficien tly u sing gradient descen t metho ds. In the exp erimen ts, the metho d is adapted to LMNN and outp erforms single-task LMNN as w ell as m t-LMNN, esp ecially when trainin g data is v ery scarce. TML (Z hang & Y eung) Zhang and Y eung ( 20 10 ) pr opose a transf er metric le arnin g (TML) appr oa ch. 25 They assume that we are giv en S in dep enden t source tasks with enough lab eled data and that a Mahalanobis distance M s has b een learned for eac h task s . The goal is to lev erage the information of the s ource metrics to learn a distance M t for a target task, for w hic h w e only hav e a s carce amount n t of lab eled d ata. No assum ption is made ab out the relation b et w een the source tasks and the target task: they ma y b e p ositiv ely/negativ ely correlated or uncorrelated. The problem is form ulated as follo ws: min M t ∈ S d + , Ω  0 2 n 2 t X i 1. Shi et al. ( 2011 ) use GLML m etrics as base k ernels to learn a global kernel in a discriminativ e manner. Bk-means (W u et al.) W u et al. ( 2009 , 2012 ) prop ose to learn Bregman distances (or Bregman divergences) , a family of metrics that d o not necessarily satisfy the tr iangle in- equalit y or symmetry ( Bregman , 1967 ). Give n the strictly con v ex and t wice differen tiable function ϕ : R d → R , the Bregman distance is defin ed as: d ϕ ( x , x ′ ) = ϕ ( x ) − ϕ ( x ′ ) − ( x − x ′ ) T ∇ ϕ ( x ′ ) . It generalizes man y widely-used measures: the Mahalanobis distance is reco v ered by setting ϕ ( x ) = 1 2 x T M x , the KL d iv ergence ( Kullbac k and Leibler , 1951 ) by c ho osing ϕ ( p ) = P d i =1 p i log p i (here, p is a discrete p robabilit y distrib ution), etc. W u et al. consider the follo wing symmetrized ve rs ion: d ϕ ( x , x ′ ) =  ∇ ϕ ( x ) − ∇ ϕ ( x ′ )  T ( x − x ′ ) = ( x − x ′ ) T ∇ 2 ϕ ( ˜ x )( x − x ′ ) , where ˜ x is a p oin t on the line segmen t b et we en x and x ′ . Therefore, d ϕ amoun ts to a Mahalanobis distance parameterized b y the Hessian matrix of ϕ w h ic h dep ends on the lo ca tion of x and x ′ . In this resp ect, learnin g ϕ can b e seen as learning an infinite num b er of lo ca l Mahalanobis distances. Th ey tak e a nonparametric approac h b y assu m ing φ to b elong to a Repro ducing Kernel Hilb ert Space H K asso cia ted to a k ernel f unction K ( x , x ′ ) = h ( x T x ′ ) w here h ( z ) is a strictly conv ex function (set to exp( z ) in the exp eriments). This allo ws th e deriv ation of a representer theorem. Setting ϕ ( x ) = P n i =1 α i h ( x T i x ) leads to the follo wing form ulation based on classic p ositiv e/negativ e pairs: min α ∈ R n + ,b 1 2 α T K α + C X ( x i , x j ) ∈S ∪D ℓ ( y ij [ d ϕ ( x i , x j ) − b ]) , (15) 28 A Sur vey on Metric Learning for Fea ture Vectors and Structured Da t a where K is the Gram m atrix, ℓ ( t ) = max(0 , 1 − t ) is the hinge loss and C is the trade-off parameter. Problem ( 15 ) is solv ed b y a simple sub grad ient descen t ap p roac h where eac h iteration has a linear complexit y . Note that ( 15 ) o nly has n + 1 v ariables instead of d 2 in most metric learnin g formulatio ns , leading to v ery scalable learning. The downside is that computing the learned distance r equires n k ernel ev aluations, w h ic h can b e exp ensiv e for large datasets. The metho d is ev aluated on clustering problems and exh ibits go od p erformance, matc hing or imp ro ving that of other metric learning appr oa ches. PLML (W ang et al.) W ang et al. ( 2012c ) prop ose PLML, 35 a Paramet ric Lo cal Metric Learning metho d where a Mahalanobis metric d 2 M i is learned for eac h training instance x i : d 2 M i ( x i , x j ) = ( x i − x j ) T M i ( x i − x j ) . M i is parameterized to b e a w eight ed linear com bination of metric b ases M b 1 , . . . , M b 2 , where M b j  0 is asso ciat ed w ith an anc hor p oint u j . 36 In other w ord s , M i is d efined as: M i = m X j =1 W ib j M b j , W i,b j ≥ 0 , m X j =1 W ib j = 1 , where the nonnegativit y of the weig hts ensures that the com bination is PSD. The w eigh t learning p rocedu re is a trade-off b et ween th r ee terms: (i) eac h p oin t x should b e close to its linear appr o ximation P m j =1 W ib j u j , (ii) the we ighting sc heme sh ould b e local (i.e., W ib j should b e large if x i and u i are similar), and (iii) the w eigh ts sh ould v ary smo othly o v er the data manifold (i.e ., similar training instances should b e assigned similar w eigh ts). 37 Giv en the w eight s, the basis m etrics M b 1 , . . . , M b m are then learned in a large-margi n fashion using p ositiv e an d negativ e tr ainin g pairs and F rob enius regularization. In terms of scalabilit y , the w eigh t learning pro cedure is fairly efficien t. How ev er, the metric bases learning p rocedur e r equires at eac h s tep an eigen-decomp ositi on that scales in O ( d 3 ), mak- ing the approac h in tractable f or h igh-dimensional p r oblems. In p ractic e, PLML p erforms v ery well on the ev aluated datasets, and is quite rob u st to ov erfitting d u e to its global manifold regularization. Ho w eve r, like LMNN, PLML is s en sitiv e to the relev ance of the Euclidean distance to assess the similarit y b etw een (anc hor) p oin ts. Note that PLML h as man y h yp er-parameters bu t in the exp erimen ts th e au th ors use default v alues for most of them. Huang et al. ( 2013 ) p r opose to regularize the anc hor metrics to b e lo w-rank and use alternating optimization to solve the p roblem. RFD (Xiong et al.) Th e originalit y of the Rand om F orest Distance ( Xiong et al. , 2012 ) is to see the metric learning problem as a pair classification pr oblem. 38 Eac h p air of examples ( x , x ′ ) is mapp ed to the f ol lowing feature sp ac e: φ ( x , x ′ ) = " | x − x ′ | 1 2 ( x + x ′ ) # ∈ R 2 d . 35. Source co de av ailable at: http://cui .unige.ch/ ~ wangjun/pa pers/PLML.zip 36. In practice, these anchor p oints are defined as the means of clusters constructed by th e K -Means algo- rithm. 37. The weigh ts of a test instance can b e learned by optimizing the same t rade-off given t he weigh ts of th e training instances, and simply set to t h e w eights of the nearest training instance. 38. Source co de av ailable at: http://www .cse.buffalo.edu / ~ cxiong/RFD _Package.zip 29 Bellet, Habrard and S ebban The fir st part of φ ( x , x ′ ) enco des the relativ e p osition of the examples and the second part their absolute p osition, as opp osed to the implicit mapping of the Mahalanobis d istance whic h only enco des relativ e information. The metric is b ased on a rand om forest F , i.e., d RF D ( x , x ′ ) = F ( φ ( x , x ′ )) = 1 T T X t =1 f t ( φ ( x , x ′ )) , where f t ( · ) ∈ { 0 , 1 } is the outpu t of decision tree t . RFD is thus h ighly n onlinear and is able to imp licit ly adapt the m etric throughout the space: when a decision tree in F selects a no de split based on a v alue of the absolute p osition part, then the entire su b-tree is sp ecific to that region of R 2 d . As compared to other lo cal metric learning metho ds, trainin g is v ery efficien t: eac h tr ee tak es O ( n log n ) time to generate and trees can b e built in p arall el. A dra wb ac k is that the ev aluatio n of th e learned metric requir es to compute the output of the T trees. The exp erimen ts h ighligh t th e imp ortance of encodin g absolute information, and sho w that RFD outp erforms some global and lo cal metric learning metho ds on sev eral datasets and app ears to b e quite fast. 4.4 Metric Learning for Histogram Data Histograms are feature v ectors that lie on the probabilit y simplex S d . This representa tion is very common in areas dealing with complex ob jects, such as n at u r al language pr ocessing, computer vision or bioinformatics: eac h instance is r epresen ted as a bag of features, i.e., a vecto r cont aining the frequ ency of eac h feature in th e ob ject. Ba gs-of(-visual)-w ords ( Salton et al. , 1975 ; Li and P erona , 2005 ) are a common example of suc h data. W e present here three metric learning metho ds designed sp ecifically for histograms. χ 2 -LMNN (Kedem et al.) Kedem et al. ( 2012 ) p rop ose χ 2 -LMNN, wh ich is based on a simp le y et prominent histogram metric, the χ 2 distance ( Hafner et al. , 1995 ), defined as χ 2 ( x , x ′ ) = 1 2 d X i =1 ( x i − x ′ i ) 2 x i + x ′ i , (16) where x i denotes the i th feature of x . 39 Note that χ 2 is a (nonlinear) prop er distance. Th ey prop ose to generalize this distance with a linear transformation, introdu cing the follo wing pseudo-distance: χ 2 L ( x , x ′ ) = χ 2 ( Lx , Lx ′ ) , where L ∈ R r × d , with the constrain t that L maps an y x onto S d (the authors sho w that this can b e enforced us ing a simple tric k). T h e ob jectiv e fun ctio n is th e same as LMNN 40 and is optimized using a standard subgradien t d escent pro cedure. Although sub ject to local optima, exp erimen ts sh o w great improv emen ts on histogram data compared to standard histogram metrics an d Mahalanobis distance learning metho ds, and promising resu lts for dimensionalit y reduction (wh en r < d ). 39. The sum in ( 16 ) must b e restricted to entries th at are nonzero in either x or x ′ to a void div ision by zero. 40. T o b e precise, it requires an additional parameter. In stand ard LMNN, due to t he linearity of the Mahalanobis distance, solutions obtained with different v alues of the margin only differ up t o a scaling factor—the margin is thus set to 1. Here, χ 2 is nonlinear and th eref ore this v alue must b e tuned. 30 A Sur vey on Metric Learning for Fea ture Vectors and Structured Da t a GML (Cut uri & Avis) While χ 2 -LMNN optimizes a s imple bin-to-bin histogram d is- tance, Cuturi and Avis ( 2011 ) pr opose to consider th e more p o w erfu l cross-bin Earth Mo v er’s Distance (EMD) in tro duced b y Rubner et al. ( 2000 ), whic h can b e seen as the d istance b e- t w een a sour ce h istog ram x and a d estinat ion histogram x ′ . On an intuitiv e level , x is view ed as piles of earth at s ev eral lo cations (bin s) and x ′ as several holes, w here the v alue of eac h feature rep resen ts the amount of earth and th e capacit y of the hole resp ectiv ely . The E MD is then equal to the minim um amoun t of effort needed to mov e all the earth fr om x to x ′ . Th e costs of mo vin g one unit of earth from bin i of x to b in j of x ′ is enco ded in the so-ca lled ground distance matrix D ∈ R d × d . 41 The computation of EMD amounts to finding the optimal flow matrix F , where f ij corresp onds to the amoun t of earth mov ed from bin i of x to bin j of x ′ . Gi ven the groun d distance m atrix D , EMD D ( x , x ′ ) is linear and can b e form ulated as a linear program: EMD D ( x , x ′ ) = min f ∈ C ( x , x ′ ) d T f , where f an d d are resp ectiv ely the flow and the groun d matrices rewr itte n as v ectors for notational simp lici ty , and C ( x , x ′ ) is the con v ex set of feasible flo ws (which can b e represent ed as linear constrain ts). Ground Metric Learning (GML) aims at learnin g D based on training triplets ( x i , x j , w ij ) where x i and x j are t wo h istog rams and w ij ∈ R is a w eigh t quantifying the similarit y b et ween x i and x j . Th e optimized criterion essen tially aims at min imizing th e sum of w ij EMD D ( x i , x j ) — whic h is a nonlinear fu nction in D — b y casting the problem as a d ifference of t wo con ve x f u nctions. A lo cal minima is found efficien tly b y a subgradient descen t ap p roac h. Exp eriments on image datasets sho w that GML outp erforms standard histogram distances as w ell as Mahalanobis distance metho ds. EMDL (W ang & Guibas) Building on GML and successful Mahalanobis distance learn- ing approac hes suc h as LMNN, W ang and Gu ib as ( 2012 ) aim at learning the EMD ground matrix in th e more flexible setting where the algorithm is provided with a set of relativ e constrain ts R that m ust b e s at isfi ed with a large margin. The p roblem is form ulated as min D ∈ D k D k 2 F + C X i,j,k ξ ij k s.t. EMD D ( x i , x k ) − E MD D ( x i , x j ) ≥ 1 − ξ ij k ∀ ( x i , x j , x k ) ∈ R , (17) where D =  D ∈ R d × d : ∀ i, j ∈ { 1 , . . . , d } , d ij ≥ 0 , d ii = 0  and C ≥ 0 is the trade-off p a- rameter. 42 The authors also prop ose a pair-based f ormulation. Problem ( 17 ) is bi-con vex and is solve d u s ing an alternating p rocedu r e: fi rst fix the grou n d metric and s ol ve for the flo w matrices (this amoun ts to a set of standard EMD problems), then solve for the ground matrix giv en the flows (this is a qu adratic program). The algorithm stops when the changes in the ground matrix are su fficien tly small. Th e pro cedure is sub ject to lo cal optima (b ecause ( 17 ) is not join tly con ve x) and is not guaran teed to con verge : there is a need for a trade-off p aramet er α b et w een stable bu t conserv at ive up d ate s (i.e., staying close 41. F or EMD to b e prop er d istance, D must satisfy the follo wing ∀ i, j, k ∈ { 1 , . . . , d } : (i) d ij ≥ 0, (ii) d ii = 0, (iii) d ij = d j i and (iv) d ij ≤ d ik + d kj . 42. Note th at un lik e in GML, D ∈ D may n ot b e a v alid distance matrix. In this case, EMD D is not a prop er distance. 31 Bellet, Habrard and S ebban Underlying distribution Metric learning algorithm Metric-based algorithm Data sample Learned metric Learned predictor Prediction Consistency guarantees for the learned metric Generalization guarantees for the pr edictor that uses the metric Figure 5: The tw o-fold problem of generalization in metric learning. W e ma y b e in terested in the generalization abilit y of the learned metric itself: can w e sa y an ything ab out its consistency on unseen data d ra wn from th e same distribu tion? F urtherm ore, w e may also b e intereste d in th e generalization abilit y of the predictor using that metric: can we relate its p erformance on uns een data to the qu alit y of the learned metric? to the previous groun d m at rix) and aggressiv e but less stable up dates. Exp erimen ts on face v erification d ata sets confirm th at E MDL impr ov es up on standard histogram d istances an d Mahalanobis distance learning metho ds. 4.5 Generalization Guaran tees for Metric Learning The deriv at ion of guaran tees on the generalization p erformance of the learned mo del is a wide topic in statist ical learning theory ( V apn ik and C herv onenkis , 1971 ; V alian t , 1984 ). Assuming that data p oints are dr awn i.i.d. from some (unkn o wn b ut fixed) d istr ibution P , one essentiall y aims at b ounding the deviation o f the true risk of the lea rn ed mo del (its p erformance on un seen data) from its empiric al risk (its p erformance on the training sample). 43 In th e sp ecific con text of metric learning, we claim that the question of generaliza tion can b e seen as t wo -fold ( Bellet , 2012 ), as illustrated b y Figure 5 : • First, one ma y consider the c onsistency of the le arne d metric , i.e ., trying to b ound the deviation b et w een the empirical p erformance of the metric on th e training sample and its generalizatio n p erformance on un s ee n d ata . • Second, the learned metric is used to imp ro v e the p erforman ce of some prediction mo del (e.g., k -NN or a linear classifier). I t w ould th us b e meaningfu l to express the gener aliza tion p erforma nc e of this pr e dicto r in terms of that of the learned m etric. As in the classic s u p ervised learnin g setting (where trainin g data consist of individu al lab eled instances), generalization guarantee s ma y b e derive d for s u p ervised metric learning (where training data consist of pairs or triplets). Indeed, most of sup ervised metric learning metho ds can b e seen as minimizing a (regularized) loss function ℓ b ased on the trai n in g pairs/triplets. Ho wev er, the i.i.d. assump tio n is violated in th e m etric learning scenario since the training pairs/triplets are constructed from the training sample. F or this r ea son, 43. This deviation is typically a function of the number of training examples and some notion of complexity of the mo del. 32 A Sur vey on Metric Learning for Fea ture Vectors and Structured Da t a establishing generalization guarantees for the learned metric is c hallenging and only recen tly has this q u estio n b een inv estigate d from a theoretical standp oint . Metric consistency b ounds for batc h met ho ds Given a training sample T = { z i = ( x i , y i ) } n i =1 dra wn i.i.d. f rom an unknown distribution µ , let u s consid er fully sup ervised Mahalanobis metric learning of the follo wing general f orm: min M ∈ S d + 1 n 2 X z i , z j ∈T ℓ ( d 2 M , z i , z j ) + λR ( M ) , where R ( M ) is the r egularizer, λ th e regularization parameter and the loss fu nction ℓ is of the form ℓ ( d 2 M , z i , z j ) = g ( y i y j [ c − d 2 M ( x i , x j )]) with c > 0 a decision thresh old v a riable and g con v ex and Lipsc hitz con tin uous. Th is includes p opular loss fun ctio ns suc h as th e hinge loss. Seve r al r ecent work ha ve p rop osed to stud y the con v ergence of the empirical risk (as measured by ℓ on pairs from T ) to the tr ue r isk o ver the unkno wn probabilit y distribution µ . T h e fr amework p rop osed by Bian & T ao ( 2011 ; 2012 ) is qu ite rigid since it relies on strong assumptions on the distribu tion of th e examples and cannot accommo date an y regularizatio n (a constrain t to b ound M is used instead). J in et al. ( 2009 ) use a n otio n of uniform stabilit y ( Bousquet and Elisseeff , 2002 ) adapted to the case of metric learning (where training data is mad e of pairs to derive generalization b ounds that are limited to F rob enius norm r eg u la r ization. Bellet and Habrard ( 2012 ) d emonstrate h o w to adapt the more flexible notion of algorithmic robustness ( Xu and Mann or , 2012 ) to the metric learning setting to d eriv e (lo ose) generalization b ounds for any matrix n orm (including sparsit y-indu cing ones) as regularizer. Th ey also sho w that a wea k n ot ion of robu stness is necessary and suffi cient for metric learnin g algorithms to generalize wel l. Lastly , Cao et al. ( 2012 a ) use a notion of R ad emacher complexit y ( Bartlett an d Mendelson , 2002 ) dep endent on the regularizer to deriv e b ounds for sev eral matrix norms. All these results can easily adapted to non-Mahalanobis linear metric learnin g formulatio ns. Regret b ound con v ersion for online metho ds W ang et al. ( 2012d , 2013b ) deal with the online learning setting. They show that existing pr oof tec hniques to con vert regret b ounds in to generalizatio n boun ds (see for ins ta n ce Cesa-Bianc hi and Gen tile , 2008 ) only hold for univ ariate loss fun ctions, but deriv e an alternativ e fr amew ork that can deal with pairwise losses. A t eac h round , the online algorithm receiv es a n ew instance and is assum ed to pair it with all previously-seen data p oints. As this is exp ensiv e or ev en infeasible in practice, Kar et al. ( 2013 ) prop ose to use a bu ffer conta ining only a b ound ed num b er of the most recent instances. T hey are also able to ob tain tigh ter b ound s based on a notion of Rademac her complexit y , essentia lly adapting and extending the w ork of Cao et al. ( 2012a ). These results su ggest that one can ob tain generaliza tion b ounds for most/all online metric learning algorithms with b ound ed regret (such as those presen ted in S ection 3.4 ). Link b etw een lea rne d metric and classification p erformance The second question of generalizatio n (i.e., at the classifier leve l) remains an op en pr oblem for the most part. T o the b est of our knowledge, it has only b een addressed in the con text of metric learn- ing for linear classification. Bellet et al. ( 2011 , 2012a , b ) rely up on the theory of learning with ( ǫ, γ , τ )-go od similarit y function ( Balcan et al. , 2008a ), which mak es the link b et w een prop erties of a similarit y function and the generalizati on of a linear classifier built from th is 33 Bellet, Habrard and S ebban similarit y . Bellet et al. prop ose to u se ( ǫ, γ , τ )-go o dness as an ob jectiv e fu nction for metric learning, and sh o w that in this case it is p ossible to deriv e generaliza tion guaran tees n ot only for the learned similarit y but also for the linear classifier. Guo and Ying ( 2014 ) extend the resu lts of Bellet et al. to several matrix n orms using a Rademac her complexit y analysis, based on techniques from Cao et al. ( 2012a ). 4.6 Semi-Sup ervised Metric Learning Methods In this secti on, we present tw o categories of metric learning metho ds that are designed to deal with semi-sup ervised learning tasks. The first one corresp onds to the standard semi- sup ervised setting, where the learner mak es use of un labeled pairs in addition to p ositiv e and negativ e constrain ts. The second one concerns approac hes whic h learn metrics to address semi-sup ervised domain adaptation pr oblems wh ere the learner has access to lab eled d at a dra wn according to a source d istribution and unlab eled d ata generated f rom a d ifferen t (bu t related) target distribu tio n. 4.6.1 St andard S emi-Super vised Sett ing The follo wing metric learning method s lev erage the information brought by the set of un- lab ele d p airs , i.e., pairs of training examples that do not b elong to the sets of p ositiv e and negativ e pairs: U = { ( x i , x j ) : i 6 = j, ( x i , x j ) / ∈ S ∪ D } . An early app r oa ch b y Bilenk o et al. ( 2004 ) co mbined semi-sup ervised clustering with metric learning. In the follo wing, w e review general metric learnin g form ulations that incorp orate information from the set of unlab eled p ai r s U . LRML (Hoi et al.) Hoi et al. ( 2008 , 2010 ) prop ose to follo w the prin ciples of mani- fold r egulariz ation for s emi-sup ervised learning ( Belkin and Niy ogi , 2004 ) by resorting to a w eigh t matrix W that enco des the similarit y b et ween pairs of p oints. 44 Hoi et al. construct W using the Euclidean d istance as follo ws: W ij = ( 1 if x i ∈ N ( x j ) or x j ∈ N ( x i ) 0 otherwise where N ( x j ) denotes the nearest neigh b or list of x j . Using W , they use the follo wing regularization known as the graph Laplacian regularizer: 1 2 n X i,j =1 d 2 M ( x i , x j ) W ij = tr( X L X T M ) , where X is the data m atrix and L = D − W is the graph Laplacian matrix with D a diagonal matrix su c h th at D ii = P j W ij . Intuitiv ely , this regularization fav ors an “affinit y- preserving” metric: the distance b et w een p oints that are similar according to W should remain small according to the learned metric. Exp eriments sho w th at LRML (Laplacian Regularized Metric Learning) significan tly outp erforms sup ervised metho ds when the side 44. Source co de av ailable at: http://www .ee.columbia.edu / ~ wliu/ 34 A Sur vey on Metric Learning for Fea ture Vectors and Structured Da t a information is scarce. An ob vious dra wbac k is that computing W is in tractable for large- scale datasets. This w ork has inspir ed a n umb er of extensions and impro vemen ts: Liu et al. ( 2010 ) introd uce a refined wa y of constructing W while Bag hs h ah and Shour aki ( 2009 ), Zhong et al. ( 2011 ) and W ang et al. ( 2013 a ) u se a d ifferen t (but similar in spirit) manifold regularizer. M-DML ( Zha et al.) The idea of Zh a et al. ( 2009 ) is to augment Laplacian regulariza- tion with metrics M 1 , . . . , M K learned fr om au x iliary datasets. F ormally , for eac h a v ailable auxiliary metric, a w eight m atrix W k is constructed follo wing Hoi et al. ( 2008 , 2010 ) b u t using metric M k instead of the Eu clidean distance. These are then com bined to obtain the follo wing regularizer: K X k =1 α k tr( X L k X T M ) , where L k is the Laplacian asso ciate d with weigh t matrix W k and α k is the we ight r eflect ing the utilit y of au x iliary metric M k . As suc h w eigh ts are difficult to set in p racti ce, Zh a et al. prop ose to learn them together with the metric M by alternating optimization (wh ic h only co nv erges to a local minim um ). Exp eriment s on a face r ecognition task sho w that metrics learned from auxiliary datasets can b e successfully used to imp ro v e p erformance o v er LRML. SERAPH (Niu et al.) Niu et al. ( 20 12 ) tac kle semi-sup ervised metric learning from an inform at ion-theoretic p ersp ectiv e by optimizing a probabilit y of labeling a giv en pair parameterized by a Mahalanobis distance: 45 p M ( y | x , x ′ ) = 1 1 + exp  y ( d 2 M ( x , x ′ ) − η )  . M is optimized to m aximize the entrop y of p M on the lab eled pairs S ∪ D and minimize it on unlab eled pairs U , f ollo win g the en trop y regularization principle ( Grandv alet and Bengio , 2004 ). Intuitiv ely , the regularization enforces lo w un ce r taint y of un observ ed w eak lab els. They also encourage a lo w-rank p ro jection by using the tr ac e n orm. Th e resulting noncon- v ex optimization p roblem is solv ed u sing an EM-lik e iterativ e pro cedure w here th e M-step in vo lves a pr o jection on the PS D cone. The prop osed metho d outp erforms su p ervised met- ric learnin g metho ds w hen the amoun t of sup ervision is v ery small, b u t w as only ev aluated against one semi-sup ervised metho d ( Baghshah and Shour aki , 2009 ) kno wn to b e sub ject to o verfitting. 4.6.2 Metric Learning for Domain Adapt a tion In the domain adaptation (D A) setting ( Mansour et al. , 2009 ; Qui ˜ nonero-Candela , 2009 ; Ben-Da vid et al. , 20 10 ), the lab eled training data and the test data come fr om different (but someho w related) distributions (referred to as the source an d target d istributions resp ectiv ely). This situ ation o ccurs very often in real-w orld app lic ations—famous examples include sp eec h recognition, spam d ete ction and ob ject recognition—and is also relev ant for metric learning. Although d oma in adaptation is sometimes ac h ieved by usin g a small 45. Source co de av ailable at: http://sug iyama- www.cs.titec h.ac.j p/ ~ gang/softw are.html 35 Bellet, Habrard and S ebban sample of lab eled target d ata ( Saenk o et al. , 2010 ; Ku lis et al. , 2011 ), we review here the more challe nging case where only unlab eled target data is a v ailable. CDML (Cao et al.) CDML ( Cao et al. , 2011 ), for Consistent Distance Metric Learning, deals with th e setting of co v ariate shift, whic h assu mes that source and target data d istri- butions p S ( x ) and p T ( x ) are differen t but the conditional distrib u tion of the lab els given the f ea tur es, p ( y | x ), remains th e same. In the context of metric lea rn ing, the assumption is m ad e at th e pair lev el, i.e., p ( y ij | x i , x j ) is stable across d omains. Cao et al. sho w that if some metric learning algorithm minimizing some tr ainin g loss P ( x i , x j ) ∈S ∪D ℓ ( d 2 M , x i , x j ) is asymptotically consisten t without co v ariate shift, then the follo wing algorithm is consisten t under co v ariate shift: min M ∈ S d + X ( x i , x j ) ∈S ∪D w ij ℓ ( d 2 M , x i , x j ) , w here w ij = p T ( x i ) p T ( x j ) p S ( x i ) p S ( x j ) . (18) Problem ( 18 ) can b e seen as cost-sensitiv e metric learning, where the cost of eac h pair is giv en by the imp ortance we ight w ij . Th erefore, adapting a metric learning algorithm to co v ariat e shift b oils down to computing the imp ortance w eigh ts, which can b e done r eli ably using unlab eled data ( Tsub oi et al. , 2008 ). T he authors exp er im ent w ith ITML and sho w that their adapted v ersion outp erforms the regular one in situations of (real or simulated) co v ariat e shift. D AML (Geng et al.) DAML ( Geng et al. , 2011 ), for Domain Adaptation Metric Learn- ing, tac kles the general domain ad ap tation setting. In this case, a classic strategy in D A is to use a term that b rings the s ource and target distribution closer. F ollo wing this line of w ork, Geng et al. regularize th e metric u sing the empirical Maximum Mean Discrep- ancy (MMD, Gretton et al. , 2006 ), a n onparametric w a y of m ea su ring the difference in distribution b etw een the sour ce samp le S and the target sample T : M M D ( S, T ) =       1 | S | | S | X i =1 ϕ ( x i ) − 1 | T | | T | X i =1 ϕ ( x ′ i )       2 H , where ϕ ( x ) is a n onlinear feature mapp ing fun ctio n that maps x to the Repro ducing Kernel Hilb ert S pace H . T he MMD can b e compu ted efficien tly using th e k ern el trick and can thus b e u sed as a (conv ex) regularizer in kernelize d metric learning algorithms (see S ec tion 4.2.1 ). D AML is th us a trade-off b et w een satisfying the constrain ts on the la b eled source data and finding a p ro jection that minimizes the discrepancy b et we en th e source and target distribution. Exp eriments on f ac e recognition and image ann ota tion tasks in the DA setting highligh t the effectiv eness of D AML compared to classic metric learning metho ds. 5. Metric Learning for Structured Data In many domains, data naturally come structur ed, as opp osed to the “flat” feature vect or represent ation we hav e fo cused on so f ar. Indeed, instances can come in the form of s trings, suc h as w ord s , text do cuments or DNA sequences; trees lik e XML do cumen ts, secondary structure of RNA or parse trees; and graph s, suc h as n et w orks, 3D ob jects or molecules. In 36 A Sur vey on Metric Learning for Fea ture Vectors and Structured Da t a P ag e Name Y ear Source Data Metho d Script Optimum Negative Co de Typ e P ai rs 39 R&Y 1998 Y es String Ge nerative+EM All Local No 39 O&S 200 6 Y es String Discriminativ e+EM All Local No 40 Saigo 2006 Y es String Gradien t Descent All Lo cal No 40 GESL 2011 Y es All Gradient Descen t Lev enshte in Global Y es 41 Bernard 2006 Y es T ree Both+EM All Lo cal No 41 Bo ye r 2007 Y es T r ee Ge nerative+EM All Local No 41 Dalvi 2009 No T r ee Discriminativ e+EM All Local No 41 Em ms 2012 N o T ree Discri minativ e+EM Optimal Lo cal No 41 N&B 2007 No Gr aph Generativ e+EM All Lo cal No T able 3: Main f ea tur es of metric learning metho ds for stru ctured data. Note that all meth- o ds mak e use of p ositiv e pairs. the context of structured data, metrics are esp ecia lly app ealing b ecause th ey can b e used as a p ro xy to ac cess data without h a ving to man ip ulate these complex ob jects. Indeed, giv en an appr opriate structur ed m etric, one can use an y metric-based algorithm as if the data consisted of feature v ectors. Man y of th ese metrics actually rely on r epresen ting structured ob jects as feature v ectors, suc h as some string k ernels (see Lo dhi et al. , 2002 , and v a riants) or bags-of-(visual)-w ords ( Salton et al. , 1975 ; Li and Perona , 200 5 ). In this case, metric learning can simply b e per f ormed on the feature ve ctor representa tion, but this strateg y can imply a signifi cant loss of structural information. O n the other hand, there exist m etric s that op erate dir ectly on the structured ob jects and can th us capture more stru ctur al distortions. Ho w eve r, learning suc h metrics is challe nging b ecause most of structur ed metrics are com binatorial by nature, wh ic h explains why it has receiv ed less atten tion than metric learnin g from feature v ectors. In this section, we fo cus on the edit distance, whic h basically measures (in terms of num b er of op erations) the cost of turn ing an ob ject in to another. E dit d istance has attracted most of the in terest in the cont ext of metric learning for stru ctured data b ecause (i) it is defined for a v ariet y of ob j ec ts: sequences ( Lev enshte in , 1966 ), tr ees ( Bille , 2005 ) and graphs ( Gao et al. , 2010 ), (ii) it is naturally amenable to learning due to its parameterization b y a cost m at rix. W e r eview string edit d ista n ce learning in Section 5.1 , wh ile metho ds for trees and graph s are co vered in Section 5.2 . The features of eac h app roac h are summarized in T able 3 . 5.1 String Edit Distance Learning In this section, we first introd uce some notations as well as th e string edit distance. W e then review the relev an t metric learning metho ds. 5.1.1 Not a tions and Definitions Definition 1 (Alphab et a nd string) An alphab et Σ is a finite nonempty set of symb ols. A string x is a finite se quenc e of symb ols fr om Σ . The empty string/symb ol is denote d by $ and Σ ∗ is the set of al l finite strings (inc lu ding $ ) that c an b e gener ate d fr om Σ . Final ly, the length of a string x is denote d by | x | . 37 Bellet, Habrard and S ebban Figure 6: A memoryless sto c hastic transducer that m o dels th e edit probabilit y of an y pair of strin gs built from Σ = { a , b } . Edit probabilities assigned to eac h transition are not s h o wn here f or the sake of readability . Definition 2 (String edit distance) L et C b e a nonne gative ( | Σ | + 1) × ( | Σ | + 1) matrix giving the c ost of the fol lowing e lementary e dit op er ations: insertion, deletion and substitu- tion of a symb ol, wher e symb ols ar e taken fr om Σ ∪ { $ } . Given two strings x , x ′ ∈ Σ ∗ , an e dit script is a se quenc e of op er ations that turns x into x ′ . The string e dit distanc e ( L evenshtein , 1966 ) b etwe en x and x ′ is define d as the c ost of the che ap est e dit script and c an b e c ompute d in O ( | x | · | x ′ | ) time by dynamic pr o gr am ming. Similar metrics include th e Needleman-W unsch score ( Needleman and W u nsc h , 1970 ) and the Sm ith-W aterman score ( Smith and W aterman , 1981 ). These alignment-base d m ea - sures u se the same substitution op erations as the edit distance, but a linear gap p enalt y function instead of insertion/deletion costs. The standard edit distance, often called Lev enshte in edit d ista n ce, is based on a unit cost for all op erations. Ho w ev er, this might n ot reflect the realit y of the considered task: for example, in t yp ographical err or correction, the p robabilit y that a user h its the Q k ey instead of W on a QWER TY keyboard is muc h higher than the probability that he hits Q in stea d of Y. F or s ome app lica tions, s uc h as p rotei n alignmen t or handwr itte n d igit recognition, hand- tuned cost matrices ma y b e a v ailable ( Da yhoff et al. , 197 8 ; Henik off and Henik off , 1992 ; Mic´ o and Oncina , 199 8 ). Otherwise, there is a need for automat ically learnin g the cost matrix C for the task at h and. 5.1.2 Stoc hastic String Edit Dist ance Le arning Optimizing the edit distance is challe nging b ecause the optimal sequence of op erations dep ends on the edit c osts themselv es, and therefore up dating the co sts ma y c han ge the optimal edit script. Most general-purp ose appr oa ches get rou n d this problem b y consid- ering a sto c hastic v arian t of the edit distance, where the cost matrix defin es a probabilit y distribution o v er the edit op er ations. O n e can then defin e an edit similarit y a s th e p os- terior probabilit y p e ( x ′ | x ) that an inpu t string x is tur ned in to an output str ing x ′ . This corresp onds to su mming ov er all p ossible edit scripts that turn x into x ′ instead of on ly considering the optimal scr ip t. Such a sto c hastic edit p r ocess can b e represented as a probabilistic mo del, such as a sto c hastic transducer (Figure 6 ), and one can estimate the parameters of the mo del (i.e., the cost matrix) that maximize the exp ected log-lik elihoo d of p ositiv e pairs. T h is is done via an EM-lik e iterativ e pro cedure ( Dempster et al. , 197 7 ). 38 A Sur vey on Metric Learning for Fea ture Vectors and Structured Da t a Note that unlik e the s tand ard edit distance, the obtained edit similarit y do es n ot usu all y satisfy the prop erties of a distance (in fact, it is often not symmetric and r arely satisfies the triangular in equ ali ty). Ristad and Yianilos Th e first metho d for learning a string edit metric, in the form of a generativ e mo del, w as p rop osed by Ristad and Yianilos ( 1998 ). 46 They use a memory- less sto c hastic transdu ce r which mo dels the join t p robabilit y of a p air p e ( x , x ′ ) from whic h p e ( x ′ | x ) can b e estimated. Paramete r estimat ion is p erformed with an EM pro cedure. The Exp ectatio n step tak es the form of a probabilistic version of the d ynamic programing algo- rithm of the stand ard edit d istance . The M-step aims at maximizing the lik eliho od of the training p airs of strings so as to d efine a join t distribution ov er the edit op erations: X ( u,v ) ∈ (Σ ∪{ $ } ) 2 \{ $ , $ } C uv + c (#) = 1 , with c (#) > 0 and C uv ≥ 0 , where # is a termin at ion symbol and c (#) the asso ciate d cost (p robabilit y). Note that Bilenk o and Mo oney ( 2003 ) extended th is app roac h to the Needleman-W unsch score with affine gap p enalt y and applied it to d uplicate detection. T o deal with the ten- dency of Maxim um Lik eliho o d estimators to o v erfit wh en the n u mb er of parameters is large (in this case, wh en the alphab et size is large), T ak a su ( 2009 ) prop oses a Bay esian parameter estimation of pair-HMM p r o viding a wa y to s mooth th e estimation. Oncina and Sebban The w ork of Oncina and S ebban ( 2006 ) describ es thr ee lev els of bias induced by the use of generativ e mo dels: (i) dep endence b et w een edit op erations, (ii) dep endence b et w een the costs and the prior distrib ution of str ings p e ( x ), and (iii) the fact that to obtain the p osterior p robabilit y one must divide b y the empirical estimate of p e ( x ). T hese b iase s are h ighligh ted b y empirical exp erimen ts conducted with the metho d of Ristad and Yianilos ( 1998 ). T o address these limitations, they p rop ose the use of a con- ditional transducer as a discriminativ e mo del that d irectl y mo dels the p osterior pr obabilit y p ( x ′ | x ) that an input string x is turned into an output strin g x ′ using ed it op erations. 46 P arameter estimation is also done with EM where the maximization step d iffers from that of Ristad and Yianilos ( 1998 ) as sho wn b elo w: ∀ u ∈ Σ , X v ∈ Σ ∪{ $ } C v | u + X v ∈ Σ C v | $ = 1 , with X v ∈ Σ C v | $ + c (#) = 1 . In order to allo w the use of negativ e p airs, McCallum et al. ( 200 5 ) consider another discriminativ e mo del, conditional random fields, that can deal with p ositiv e and negativ e pairs in sp ecific s tates, still u sing EM for p aramet er estimation. 5.1.3 String Edit Dist anc e Learning by Gradient Des cent The use of EM has t w o main dra wbac ks: (i) it ma y conv erge to a lo cal optim um, and (ii) parameter estimation and distance calculations must b e done at eac h iterat ion, whic h can b e v ery costly if the s ize of the alphab et an d /or the length of the s tr ings are large. 46. An implementatio n is a v ailable within th e SEDiL p latf orm ( Boy er et al. , 2008 ): http://lab h- curien.univ- st- etienne.fr/SEDiL/ 39 Bellet, Habrard and S ebban The follo wing metho ds get r ou n d these drawbac ks by formulating th e learning p roblem in the form of an optimization problem that can b e efficien tly solv ed b y a gradient descent pro cedure. Saigo et al. Saigo et al. ( 2006 ) manage to a vo id the need for an iterativ e pro cedure lik e E M in the con text of detecting remote homology in p rotein sequences. 47 They learn the parameters of th e Smith-W aterman score which is plu gg ed in their lo cal alignment k ernel k LA where all the p ossible lo cal alignmen ts π f or changing x in to x ′ are tak en into accoun t ( Saigo et al. , 2004 ): k LA ( x, x ′ ) = X π e t · s ( x,x ′ ,π ) . (19) In the ab o v e form u la, t is a parameter and s ( x, x ′ , π ) is the corresp onding score of π and defined as follo ws: s ( x, x ′ , π ) = X u,v ∈ Σ n u,v ( x, x ′ , π ) · C uv − n g d ( x, x ′ , π ) · g d − n g e ( x, x ′ , π ) · g e , (20) where n u,v ( x, x ′ , π ) is the num b er of times that symb ol u is aligned with v while g d and g e , along with their corresp ondin g num b er of o ccurrences n g d ( x, x ′ , π ) and n g e ( x, x ′ , π ), are t w o parameters d eal ing resp ectiv ely with the op ening and extension of gaps. Unlik e th e Smith-W aterman score, k LA is differen tiable and can b e optimized by a gradien t descent pro cedure. The ob jectiv e fun cti on that they optimize is meant to fa vo r the discrimination b et w een p ositiv e and n ega tive examples, bu t this is done by only using p ositiv e p airs of distan t homologs. T he app roac h has t wo additional dra wbac ks: (i) the ob jectiv e function is nonconv ex and thus sub ject to lo cal min ima, and (ii) in general, k LA do es not fu lfill the prop erties of a k ernel. GESL (Bellet et al.) Bellet et al. ( 2011 , 2012 a ) prop ose a conv ex pr og rammin g ap- proac h to learn edit similarit y functions f rom b oth p ositiv e and negativ e pairs without requiring a costly iterativ e pr ocedure. 48 They us e the follo wing simp lified edit function: e C ( x , x ′ ) = X ( u,v ) ∈ (Σ ∪{ $ } ) 2 \{ $ , $ } C uv · # uv ( x , x ′ ) , where # uv ( x , x ′ ) is the num b er of times th e op erat ion u → v appears in the Lev enshtein script. Therefore, e C can b e optimized directly sin ce the sequence of op erations is fixed (it do es n ot dep end on the costs). The authors optimize the nonlinear similarit y K C ( x , x ′ ) = 2 exp( − e C ( x , x ′ )) − 1, d eriv ed from e C . Note that K C is not required to b e PSD nor symmetric. GESL (Go od Edit Similarit y L ea rn ing) is expressed as follo ws: min C ,B 1 ,B 2 1 n 2 X z i ,z j ℓ ( C , z i , z j ) + β k C k 2 F s.t. B 1 ≥ − log ( 1 2 ) , 0 ≤ B 2 ≤ − log ( 1 2 ) , B 1 − B 2 = η γ , 47. Source co de av ailable at: http://sun flower.kuicr.kyo to- u.ac. jp/ ~ hiroto/pro ject/optaa.html 48. Source co de av ailable at: http://www - bcf.usc.edu/ ~ bellet/ 40 A Sur vey on Metric Learning for Fea ture Vectors and Structured Da t a where β ≥ 0 is a regularization p aramet er, η γ ≥ 0 a parameter corresp onding to a desired “margin” and ℓ ( C , z i , z j ) = ( [ B 1 − e C ( x i , x j )] + if y i 6 = y j [ e C ( x i , x j ) − B 2] + if y i = y j . GESL essenti ally learns the edit cost m atrix C so as to optimize the ( ǫ, γ , τ )-goo dness ( Balcan et al. , 2008a ) of the similarit y K C ( x , x ′ ) and th ereb y enjo ys generaliza tion guaran- tees b oth for the lea r n ed similarit y and f or the resulting linear classifier (see S ec tion 4.5 ). A p oten tial dra wback of GESL is that it optimized a simplified v ariant o f the edit dis- tance, although this does not seem to be an issu e in practic e. Note that GESL can b e straigh tforw ard ly adapted to learn tree or graph edit similarities ( Bellet et al. , 2012 a ). 5.2 T ree and Graph Edit Distance Learning In this section, we briefly r evie w the main app roac hes in tr ee /graph edit distance learning. W e do not delve in to th e d eta ils of these approac hes as they are essentia lly adaptatio ns of sto c hastic strin g edit distance learning pr esen ted in Section 5.1.2 . Bernard et al. Extending the w ork of Ristad and Yianilos ( 1998 ) and O n cina and Sebban ( 2006 ) on string edit similarit y learning, Bernard et al. ( 2006 , 2008 ) prop ose b oth a gener- ativ e and a discriminativ e mo del for learning tree edit costs. 46 They r ely on the tree edit distance by Selk o w ( 1977 )—wh ic h is c heap er to compu te than th at of Zh ang and Shasha ( 1989 )—and adapt the up dates of EM to this case. Bo y er et al. The work of Bo y er et al. ( 2007 ) tac kles the m ore complex v arian t of the tree edit distance ( Zhang and Shasha , 1989 ), which allo ws the ins ertio n and deletion of single no des ins te ad of enti re s u btrees only . 46 P arameter estimation in the generativ e m odel is also based on EM. Dalvi et al. The w ork of Dalvi et al. ( 2009 ) p oin ts out a limitation of th e appr oac h of Bernard et al. ( 2006 , 2008 ): they mo del a distrib ution o v er tree edit scrip ts rather than o v er the tr ee s themselv es, and unlik e the case of strings, there is no bijection b et w een the edit scripts and the trees. Reco vering the correct conditional probabilit y with r esp ect to trees requires a careful an d costly pr ocedure. They prop ose a more complex conditional trans- ducer that mo dels the conditional pr obabilit y ov er trees an d us e again EM for parameter estimation. Emms The work of Emms ( 2012 ) p oin ts out a theoretical limitati on of the approac h of Bo y er et al. ( 2007 ): the authors use a facto rization that turns out to be incorrect in some cases. Emms sh o ws that a correct factorization exists when only considering the edit script of highest probabilit y ins tead of all p ossible scripts, and derives the corresp onding EM up dates. An obvi ous dra wback is that th e output of the mo del is n ot th e p robabilit y p ( x ′ | x ). Moreo ve r, the approac h is prone to o ve rfi tting and requ ires sm oothing and other heuristics (such as a final step of zeroing-out the diagonal of the cost matrix). Neuhaus & Bunk e In their p aper, Neuhaus and Bunke ( 2007 ) learn a (more general) graph edit sim ilarity , where eac h edit op eration is mo deled by a Gaussian mixtur e densit y . P arameter estimat ion is don e using an EM-lik e algorithm. Unfortu n ate ly , the approac h is 41 Bellet, Habrard and S ebban in tractable: th e complexit y of the EM pro cedure is exp onen tial in th e num b er of no des (and so is the computation of the d istance) . 6. Conclusion and Discussion In this su rv ey , we pr o vided a comprehensive review of the main metho ds and trends in metric learning. W e here briefly summarize and dra w p romising lines for futur e researc h . 6.1 Summary Numerical dat a While metric learning for feature v ectors was still in its early life at the time of the fi rst sur v ey ( Y ang and J in , 2006 ), it has now reac hed a go o d maturit y le vel. Indeed, r ec ent metho ds are able to deal with a large sp ectrum of settings in a scalable wa y . In particular, online approac hes ha v e pla y ed a significan t role to wards b etter s ca labilit y , complex tasks can b e tac kled thr ough nonlinear or lo cal metric learning, metho ds hav e b een deriv ed for difficult settings su c h as ranking, m ulti-task learning or domain adaptation, and the qu estio n of generalization in metric learning h as b een the fo cus of recen t pap ers. Structured dat a On the other hand, m uch less w ork has gone into metric learning f or structured data and adv ances made for numerical data ha ve not y et propagated to s tructured data. In d eed, most approac hes remain based on EM-like algorithms wh ich mak e them in tractable for large datasets and instance size, and hard to analyze du e to lo cal optima. Nev ertheless, recen t adv an ces su c h as GESL ( Bellet et al. , 2011 ) ha v e shown that drawing inspiration fr om su cce ssfu l feature ve ctor form u lations (ev en if it requ ir es simplifying the metric) can b e highly b eneficial in te r m s of scalabilit y and fl exibilit y . This is promising direction and probably a goo d omen for th e develo pment of this r esearch area. 6.2 What next? In light of this surv ey , we can iden tify the limitations of the current literature and sp eculate on where the futu re of metric learning is going. Scalabilit y with b oth n and d There has b een satisfying solutions to p erform m et r ic learning on large datasets (“Big Data”) th rough online learning or sto c hastic optimization. The question of s calabilit y with the dimensionalit y is more inv olv ed, since most metho ds learn O ( d 2 ) parameters, whic h is in tractable for real-w orld applications inv olving thous ands of features, u nless dimensionalit y red u ctio n is applied b eforehand. Kern eliz ed metho ds ha v e O ( n 2 ) p aramet ers in s te ad, but this is infeasible when n is also large. Therefore, the c hallenge of ac hieving high scalabilit y with b oth n and d has y et to b e ov ercome. Recen t approac hes h a v e tac kled the pr oblem by optimizing ov er the manifold of low-rank m atric es ( Shalit et al. , 2012 ; Ch eng , 2013 ) or defining the metric b ased on a com b in ati on of simp le classifiers ( Kedem et al. , 2012 ; Xiong et al. , 2012 ). These approac hes hav e a go od p oten tial for futu r e researc h . More theoretical understanding Although sev eral r ece nt pap ers ha v e look ed at the generalizat ion of metric learning, analyzing the link b et w een the consistency of the learned metric and its p erformance in a giv en algorithm (classifier, cl u s tering pro cedure, etc) re- mains an imp ortan t op en problem. So far, only results for linear classification h a v e b een 42 A Sur vey on Metric Learning for Fea ture Vectors and Structured Da t a obtained ( Bellet et al. , 2012 b ; Guo and Ying , 2014 ), while learned metrics are also hea vily used for k -NN classification, clustering or information retriev al, for wh ich no theoreti cal result is kno wn. Unsup ervised metric learning A natural question to ask is whether one can learn a metric in a purely unsu p ervised w a y . So far, this has only b een done as a bypro duct of dimensionalit y r eduction algorithms. Oth er relev ant criteria should b e inv estigate d, for instance learning a metric that is robu st to noise or inv ariant to s ome transform ati ons of in terest, in the spirit of denoising auto encoders ( Vincen t et al. , 200 8 ; Chen et al. , 2012 ). Some r esults in this direction ha ve b een obtained for image transformations ( Kumar et al. , 2007 ). A related problem is to c haracterize what it means for a metric to b e go o d for cluster- ing. T here has b een preliminary wo rk on this question ( Balcan et al. , 2008b ; La jugie et al. , 2014 ), wh ic h deserv es more atten tion. Lev eraging the structure The simp le example of metric learnin g designed sp ecifically for histogram data ( Kedem et al. , 2012 ) has sho wn that taking the stru cture of the data in to accoun t when learning the metric can lead to s ig n ifican t imp ro v emen ts in p erformance. As data is b ecoming more and more structured (e.g., so cial n et w orks), us in g this stru ctur e to bias th e c hoice of m etric is lik ely to receiv e incr easing in terest in the near futu r e. Adapting t he metric to c hanging data An im p ortant issue is to dev elop methods robust to c hanges in the data. In this line of work, metric learning in the presence of noisy data as w ell as for transfer learning and domain adaptation hav e recen tly r ec eive d some int erest. Ho w ev er, these efforts are still insu fficien t f or d eali ng with lifelong learnin g applications, where th e learner exp eriences concept dr ift and m ust d etec t and adap t th e metric to differen t c hanges. Learning ric her metrics Existing metric learnin g algorithms ignore th e fact that the notion of similarity is often multimod al: there exist seve ral wa ys in w hic h tw o instances m a y b e similar (p erhaps based on different features), and different d eg rees of similarity (versus the simple bin ary similar/dissimilar view). Being able to mo del these sh ades as well as to in terpret why things are similar w ould bring the learned metrics closer to our own notions of similarity . Ac knowle dgmen ts W e wo uld lik e to ac kno wledge sup p ort from the ANR LAMP AD A 09-EMER-007- 02 pro ject. References M. Ehsan Abbasnejad, Dhanesh Ramac handram, and Mand av a Ra jeswari. A survey of the state of the art in learning the kernels. Know le dge and Information Systems (KAIS) , 31 (2):19 3–221, 201 2. Pierre-An toine Absil, Rob ert Mahony , and Ro dolphe Sepu lc hre. Optimization Algorithms on Matrix Manifolds . Princeton Universit y Press, 2008. 43 Bellet, Habrard and S ebban Ricardo Baeza -Y ates and Berthier Rib eiro-Neto. Mo dern Information Retrieval . Ad d ison- W esley , 1999. Mahdieh S . Baghshah and Saeed B. Sh ouraki. Semi-Sup ervised Metric Learning Using Pair- wise C onstrain ts. In Pr o c e e dings of the 20th International Joint Confer enc e on Artificial Intel ligenc e (IJCAI) , pages 1217–122 2, 2009 . Maria-Flo r in a Balcan, Avrim Blum, and Nathan Srebro. Improv ed Guarante es for Learn ing via S imilarit y Fun ctions. In Pr o c e e dings of the 21st Annual Confer enc e on Le arning The or y (COL T) , pages 287–298, 2008a . Maria-Flo r in a Balcan, Avrim Blum, and S antosh V empala. A Discriminative Framew ork for Clustering v ia S imila rity Fun cti ons. In ACM Symp osium on The ory of Computing (STOC) , p ag es 671–68 0, 2008b. Aharon Bar-Hillel, T omer Hertz, Noam Shen tal, and Daph na W einshall. Learning Dis- tance F unctions using Equiv alence Relations. In Pr o c e e dings of the 20th International Confer enc e on Machine L e ar ning (ICML) , pages 11–18 , 200 3. Aharon Bar-Hillel, T omer Hertz, Noam Shental , and Daphna W einsh all . Learning a Ma- halanobis Metric from Equiv alence Constraints. Journal of Machine L e arning R ese a r ch (JMLR) , 6:937–96 5, 2005. P eter L. Bartlett an d Shahar Mendelson. Rademac her and Gaussian Comp lexit ies: Risk Bounds and Structural Results. Journal of Machine Le arning Rese ar ch (JMLR) , 3:463– 482, 2002. Jonathan Baxter and P eter L. Bartlett. The Canonical Distortion Measure in Feature Sp ace and 1-NN Classification. In Advanc es in Neur al Information Pr o c essing Systems (NIP S) 10 , 1997. Mikhail Belkin an d Pa rtha Niyog i. Semi-Sup ervised Learning on Riemannian Manifolds. Machine Le arning Journal (M LJ) , 56(1–3):20 9–239, 2004. Aur ´ elien Bellet. Sup ervise d Metric Le arning with Gener aliza tion Guar ante es . PhD thesis, Univ ersit y of Sain t-Etienne, 2012. Aur ´ elien Bellet and Amaury Habr ard . Robustness and Generalization for Metric Learnin g. T ec hnical rep ort, Univ ersity of Saint-Etie nn e, Septemb er 2012. arXiv:1 209.1086. Aur ´ elien Bellet, Amaury Habrard, and Marc Sebban. L ea rn ing Go o d Edit Similarities with Generalizati on Guaran tees. In Pr o c e e dings of the Eur op e an Confer enc e on Machine L e arn- ing and Principles and Pr a ctic e of Know le dge Disc overy in D atab ases (ECML/PK D D) , pages 188–203, 2011. Aur ´ elien Bellet, Amaur y Habr ard, and Marc Sebb an. Go o d edit similarit y learnin g by loss minimization. Machine Le arn ing Journal (MLJ) , 89(1):5–35, 2012a. Aur ´ elien Bellet, Amaury Habr ard , and Marc Sebban. Similarit y Learnin g for Pro v ably Accurate Sparse Linear Classification. In P r o c e e dings of the 29th International Confer enc e on Machine Le arning (ICML) , pages 1871–1878 , 2012b. 44 A Sur vey on Metric Learning for Fea ture Vectors and Structured Da t a Xian y e Ben, W eixiao Meng, Rui Y an, and Kejun W ang. An imp ro v ed biometrics tec h n ique based on m etric learnin g app roac h. Neur o c omputing , 97:44–51 , 2012. Shai Ben-Da vid, John Blitzer, Kob y Crammer, Alex K ulesza, F ernand o P ereira, and Jen- nifer W ortman V aughan. A theory of learning from different d omains. M achine L e arning Journal (MLJ) , 79(1- 2):151–17 5, 2010. Aharon Ben-T al, Laur en t El Ghaoui, and Ark adi Nemiro vski. R obust Optimization . Prince- ton Universit y Press, 2009. Marc Bernard, Amaury Habrard, and Marc Sebban. Learning S toc hastic Tree Edit Dis- tance. In Pr o c e e dings o f the 17th Eur op e an Confer enc e on Machine L e arning (E CM L) , pages 42–53, 2006. Marc Bernard, Laur en t Bo ye r, Amaury Habrard, and Marc S eb ban. Learning p robabilistic mo dels of tree edit distance. Pattern Re c o gnition (PR) , 41(8):26 11–2629, 2008. Jin b o Bi, Dijia W u, Le Lu , Meizh u Liu, Yimo T ao, and Matthias W olf. AdaBo ost on lo w-rank PS D matrices for metric learnin g. In Pr o c e e dings of the IEEE Confer enc e on Computer Vision and Pattern Re c o gnition (CVP R) , pages 2617–26 24, 2011. W ei Bian. Constrained Empir ica l Risk Minimization F ramew ork for Distance Metric Learn- ing. IEEE T r ansactions on Neu r al Networks and L e arning Systems (TNN LS) , 23( 8): 1194– 1205, 201 2. W ei Bian and Dac heng T ao. Learning a Distance Metric by Emp irical Loss Minimization. In Pr o c e e dings of the 22nd International Joint Confer enc e on Artificial Intel ligenc e (IJCAI) , pages 1186–119 1, 2011. Mikhail Bilenk o and Ra ymond J. Mo oney . Adaptiv e Dup lica te Detectio n Using Learn- able String Similarit y Mea su res. I n Pr o c e e dings of the 9th A CM SIGKDD Internat ional Confer enc e on Know le dge Disc overy and Data M ining , p ag es 39–48, 2003. Mikhail Bilenk o, S ugat o Basu, and Ra ymond J. Mo oney . Integrati n g Constrain ts and Met- ric Learning in Semi-Sup ervised Clustering. In Pr o c e e dings of the 21st International Confer enc e on Machine L e ar ning (ICML) , pages 81–88 , 200 4. Philip Bille. A s urv ey on tree edit distance and related problems. The or etic al Com puter Scienc e (TCS) , 337(1-3) :217–239, 2005. Olivier Bousquet and An dr ´ e Elisseeff. Stabilit y a n d Generalization. Journal of Machine Le arning Rese a r ch (JMLR) , 2:499–5 26, 2002. Stephen Bo yd and Liev en V and en b erghe. Convex Optimization . Cambridge Univ ersity Press, 2004. Lauren t Bo y er, Amaury Habrard, and Marc Sebban. Learning Metrics b et w een Tree Struc- tured Data: Ap plicat ion to Image Recognition. In Pr o c e e dings of the 18th Eur op e an Confer enc e on Machine L e ar ning (ECML) , p ag es 54–66, 2007. 45 Bellet, Habrard and S ebban Lauren t Bo y er, Y ann Esp osito, Amaury Habrard, Jos ´ e O ncina, and Marc S eb- ban. SEDiL: Softw are for Edit Distance Learnin g. In Pr o c e e dings of the Eu- r op e an Confer enc e on Machine L e arning and Principles and Pr actic e of Know l- e dge Disc overy in Datab ases (E CML/PKDD) , page s 672–677, 2008 . URL http://l abh- curien.u niv- st- etienne.fr/SEDiL/ . Lev M. Bregman. The relaxation metho d of fi n ding the common p oints of conv ex sets and its application to the solution of p roblems in con vex p rogramming. USSR Computationa l Mathematics and Mathematic al Physics , 7(3):20 0–217, 1967. Bin Cao, Xiaoc huan Ni, Jian-T ao Sun, Gang W ang, and Qiang Y ang. Distance Metric Learning u nder Co v ariate Shift. In Pr o c e e d ings of the 22nd International Joint Confer enc e on Artificial Intel ligenc e (IJCAI) , p ag es 1204–1 210, 2011. Qiong Cao, Zheng-Chu Guo, and Yiming Ying. Generalizatio n Bounds for Metric and Similarit y Learning. T echnical rep ort, Univ ersit y of Exeter, J uly 2012a. arXiv:1207.543 7. Qiong Cao, Yiming Ying, and P eng Li. Distance Metric Learning Revisited. In Pr o c e e d - ings of the Eur op e an Confer enc e on Machine L e arning and Principles and Pr actic e of Know le dge Disc overy in Datab ases (ECM L/PKDD) , pages 283– 298, 2012b. Ric h Caruana. Multita sk Learnin g. Machine Le arning Journal (MLJ) , 28(1):4 1–75, 199 7. Nicol´ o Cesa-Bianc hi and Claudio Gen tile. Improv ed Risk Tail Bounds for On -Line Algo- rithms. IEEE T r ansactions on Information The ory (TIT) , 54(1):38 6–390, 2008 . Ratthac hat Chatpatanasiri, T eesid Korsrilabu tr , Pasa korn T angchanac haianan, and Bo on- serm Kijsirikul. A new kernelizati on fr amework for Mahalanobis distance learning algo- rithms. Neur o c omputing , 73:1570–1 579, 2010. Gal Chechik, Uri Sh ali t, V arun Sharma, and Samy Bengio. An Onlin e Algorithm for L arge Scale I mage Similarit y Learning. In A dvan c es in Neur al Information Pr o c essing Systems (NIPS) 22 , pages 306–314 , 200 9. Gal C hec hik, V arun S harma, Uri S halit, and Sam y Bengio. Large S cal e Online Learnin g of Image Similarit y Through Ranking. Journal of Machine Le arning Rese ar c h (JM LR) , 11: 1109– 1135, 201 0. Minmin Chen, Zhixiang Eddie Xu, Kilian Q. W einber ger, and F ei Sha. Marginalize d De- noising Auto encoders for Domain Adaptation. In Pr o c e e dings of the 29th International Confer enc e on Machine L e ar ning (ICML) , 2012. Li Cheng. Riemannian Similarit y Learning. In P r o c e e dings of the 30th International Con- fer enc e on Machine L e arning (ICML) , 2013 . Sumit Chopra, Raia Hadsell, and Y ann LeCu n. Learning a Similarit y Metric Discrimina- tiv ely , with Application to Face V erification. In Pr o c e e dings of the IE EE Confer enc e on Computer Vision and Pattern R e c o gnition (CVPR) , pages 539–546, 2005. 46 A Sur vey on Metric Learning for Fea ture Vectors and Structured Da t a Corinna Cortes and Vladimir V apnik. Supp ort-Vector Net w orks. Machine Le ar ning Journal (MLJ) , 20(3):273 –297, 1995. Thomas C o v er and P eter Hart. Nearest neigh b or pattern classification. IEEE Tr ansa ctions on Information The ory (TIT) , 13(1):2 1–27, 196 7. Kob y Crammer and Gal Chec hik. Adaptive Regulariza tion for Weigh t Matrices. In Pr o- c e e dings of the 29th International Confer enc e on Machine L e arning (ICML) , 2012. Kob y Crammer, Ofer Dek el, Joseph Keshet, Shai S halev-Sh w artz, and Y oram Sin ge r. Online Passiv e-Aggressiv e Algorithms. Journal of M achine Le arning Rese ar ch (JMLR) , 7:551– 585, 2006. Marco Cuturi and Da vid Avis. Ground Metric Learnin g. T ec hn ica l rep ort, Kyo to Universit y , 2011. 11 10.2306. Nilesh N. Dalvi, Ph ilip Bohannon, and F ei Sha. Robust w eb extraction: an approac h b ased on a p robabilistic tree-edit mo del. In Pr o c e e dings of the ACM SIGMOD Interna tional Confer enc e on Management of dat a (COMAD) , pages 335– 348, 2009. Jason V. Davi s, Brian K ulis, Prateek J ain, Suv r it Sra, and Inderjit S . Dhillon. Information- theoretic metric learning. I n Pr o c e e dings of the 24th International Confer enc e on M achine Le arning (ICML) , pages 209–2 16, 200 7. Margaret O. Da yhoff, Rob ert M. Sc hw artz, and Bru ce C. O rcutt. A mo del of evolutio nary c hange in proteins. Atlas of pr otein se quenc e and structur e , 5(3):345– 351, 1978. Arth ur P . Dempster, Nan M. Laird, and Donald B. Rubin. Maximum like liho o d from incomplete data via the EM alg orithm. Journal of the Royal Statistic al So ciety, Series B , 39(1):1– 38, 1977. Jia Deng, Alexander C. Berg, and Li F ei-F ei. Hierarc hical seman tic ind exing for large scale image retriev a l. In Pr o c e e d ings of the IEEE Confer enc e on Computer Vision and Pattern Re c o gnition (CVP R) , p age s 785–79 2, 2011. Matthew Der and La wrence K. Saul. Laten t Coincidence Analysis: A Hidden Variable Mo del f or Distance Metric L earn ing. In Advanc es in Ne ur al Information Pr o c essing Systems (NIPS) 25 , pages 3239–3247 , 2012. Inderjit S. Dhillon and Jo el A. T r opp. Matrix Nearness Problems with Bregman Dive r- gences. SIAM J ourna l on Matrix Analysis and Applic ations , 29(4):112 0–1146, 2007. Huy en Do, Alexandros Kalousis, Jun W ang, and Adam W oznica. A metric learning p ersp ec- tiv e of SVM: on the r ela tion of LMNN and SVM. In Pr o c e e dings of the 15th International Confer enc e on Ar tificial Intel ligenc e and Statistics (AIST A TS) , pages 308–3 17, 201 2. John Duc hi, Shai Shalev-Shw artz, Y oram Singer, and Ambuj T ewa ri. Comp osite Ob jec- tiv e Mirror Descent. In Pr o c e e dings of the 23r d Annual Confer enc e on L e arning The ory (COL T) , pages 14–26, 2010. 47 Bellet, Habrard and S ebban Martin Emms. On Sto c hastic Tree Distances and Their T r aining via Exp ectation- Maximisation. In Pr o c e e d ings of the 1st International Confer enc e on Pattern R e c o gnition Applic at ions and Metho ds (ICPRAM) , pages 144–153 , 2012. Theo doros Evgeniou an d Massimiliano Po ntil. Regularized m ulti-task learning. In Pr o c e e d- ings of the 10th ACM SIGKDD International Confer enc e on Know le dge Disc overy and Data Mining , p ag es 109–11 7, 2004 . Mary am F azel, Haitham Hindi, and S tephen P . Bo yd. A Rank Minimization Heuristic with Application to Minim um Or d er S ystem App ro ximation. In Pr o c e e dings of the Americ an Contr ol Confer enc e , pages 4734–473 9, 2001. Imola K . F o dor. A Survey of Dimension Reduction Techniques. T ec hnical rep ort, La wrence Liv ermore National Lab oratory , 2002. UCRL-ID- 1484 94. Y oav F reund and Rob ert E. Sc hapire. A Decision-Theoretic Generalizat ion of On-Line Learning and an App lica tion to Bo osting. I n Pr o c e e dings of the 2nd Eur op e an Confer enc e on Computational Le arning The ory (Eur oCOL T) , pages 23–37, 1995. Jerome H. F r iedman. Flexible Metric Ne arest Nei ghb or Classificat ion. T ec hnical r eport, Departmen t of Statistics, Stanford Un iv ersit y , 1994. Jerome H. F riedman. Greedy Fun cti on Ap pro ximation: A Gradien t Bo osting Mac hine. Annals of Statistics (A OS) , 29(5) :1189–123 2, 2001. Andrea F rome, Y oram Singer, F ei Sha, and Jitendra Malik. Learning Globally-Consistent Lo cal Distance F unctions for Shap e-Based I m ag e Retriev al and Classification. In Pr o- c e e dings of the IEEE International Confer enc e on Computer Vision (ICCV) , pages 1–8, 2007. Keinosuk e F uku naga . Intr o duction to Statistic al Pattern Re c o gnition . Academic Press, 1990. Xin b o Gao, Bing Xiao, Dac heng T ao, and Xu elo ng Li. A surve y of graph ed it d istance . Pattern Analysis and A pplic ations (P AA) , 13(1):113– 129, 2010. Bo Geng, Dac heng T ao, and Chao Xu. D AML: Domain Adaptation Metric Learnin g. IE EE Tr an sactions on Image P r o c essing (TIP) , 20(10 ):2980–29 89, 2011 . Amir Glob er s on and Sam T. Row eis. Metric L earn ing by Collapsing Classes. In Advanc es in Neur al Information Pr o c essing Systems (NIP S) 18 , p age s 451–45 8, 2005. Jacob Goldb erger, Sam Ro we is, Geoff Hin ton, and Ru slan S ala khutdino v. Neighbour hoo d Comp onen ts Analysis. In Advanc es in N eur al Infor mation Pr o c essing Syst ems (NIPS) 17 , pages 513–520, 2004. Mehmet G¨ onen and Ethem Alpa ydin . Multiple Kernel Learning Algorithms. Journal of Machine Le arning Rese ar ch (JMLR) , 12:2211–22 68, 2011. Yv es Grandv alet and Y oshua Bengio. Semi-sup ervised Learnin g b y Entrop y Minimization. In Advanc es in Neu r al Information Pr o c essing Systems (NIPS) 17 , pages 29–5 36, 2004. 48 A Sur vey on Metric Learning for Fea ture Vectors and Structured Da t a Arth ur Gr ett on, K arsten M. Borgwardt, Malte J . Rasc h, Bernhard Sc h¨ olk opf, and Alexan- der J. S mola. A Kern el Method f or the Two-Sample-Problem. In A dvanc es in N eur al Information Pr o c essing Systems (NIPS) 19 , pages 513–520, 2006. Matthieu Guillaumin , Thomas Mensin k, Jak ob J. V erb eek, and Cordelia Sc hm id. TagProp: Discriminativ e metric learnin g in nearest neighbor mo dels for image auto-annotation. In Pr o c e dd ings of the IEEE Interna tional Confer enc e on Computer V ision (ICCV) , pages 309–3 16, 200 9a. Matthieu Guillaumin, Jak ob J. V erb eek, and Cordelia Schmid. Is that you? Metric learning approac hes f or face iden tification. In Pr o c e ddings of the IEE E International Confer enc e on Computer Vi si on (ICCV) , pages 498–505, 2009b. Zheng-Chu Guo and Yiming Ying. Gu aran teed Classification via Regularized Similarit y Learning. Neur al Computation , 26(3 ):497–522 , 2014. James L. Hafner, Harpr eet S. Sawhney , William Equitz, Myron Flic kner , and W a yn e Niblac k. Efficien t C olo r Histog ram Indexing for Quadratic Form Distance F unctions. IEEE T r ansactions on Pattern Analysis and Machine Intel ligenc e (TP AMI) , 17(7):729 – 736, 1995. T rev or Hastie and Rob ert Tib shirani. Discriminant Adaptiv e Nea r est Neighbor Classifi ca- tion. IEEE T r ansactions on P atter n Analysis and M achine Intel ligenc e (TP AM I) , 18(6): 607–6 16, 199 6. Søren Haub erg, Oren F reifeld, and Mic hael J. Blac k. A Geometric tak e on Metric Learning. In Advanc es in Neur al Informatio n Pr o c essing Systems (NIPS) 25 , pages 2033–20 41, 2012. Y u jie He, W en lin Ch en, and Yixin C hen. K ernel Density Metric L ea rn ing. T ec hnical r eport, W ashin gto n Unive rs it y in St. L ou is, 2013. Stev en Henik off and Jorja G. Henikoff. Amino acid substitution matrices from p rotei n blo c ks. Pr o c e e dings of the National A c ademy of Scienc es of the Unite d States of Americ a , 89(22 ):10915–10 919, 1992. Stev en C. Hoi, W ei Liu , Mic hael R. Lyu, and W ei-Ying Ma. Learnin g Distance Metrics with Con textual Cons tr ain ts for Image Retriev al . In Pr o c e e dings of the IEE E Confer enc e on Computer Vi si on and P atter n Re c o gnition (CV PR) , pages 2072– 2078, 200 6. Stev en C. Hoi, W ei Liu, and Shih-F u Chang. Semi-sup ervised distance metric learning for C olla b orativ e Im ag e Retriev al. In Pr o c e e dings of the IEEE Confer enc e on Computer Vision and Pattern Re c o gnition (CVPR) , 2008. Stev en C. Hoi, W ei Liu, and Sh ih-F u Chang. Semi-sup ervised d istance metric learning for collaborative image retriev al and clustering. ACM T r ansactions on Multime dia Comput- ing, Communic atio ns, and Applic ations (TOMCCAP) , 6(3), 2010. 49 Bellet, Habrard and S ebban Yi Hong, Quann an Li, Jiay an Jiang, and Zhuo w en T u. Learn in g a mixture of sparse dis- tance metrics for classification and dimensionalit y r ed uctio n . In Pr o c e e d ings of the IEEE International Confer enc e on Computer Vision (ICCV) , pages 906–913, 2011. Kaizh u Huan g, Yiming Ying, and Colin Camp b ell. GSML: A Un ified Framew ork for Sp arse Metric Learning. In Pr o c e e dings of the IEEE Internationa l Confer enc e on Data Mining (ICDM) , pages 189–1 98, 200 9. Kaizh u Huang, Rong Jin, Z englin Xu , and Cheng-Lin Liu. Robu st Metric Learning by Smo oth Optimization. In Pr o c e e dings of the 26th Confer enc e on Unc ertainty in Artificial Intel ligenc e (UAI) , p ag es 244–25 1, 2010 . Kaizh u Huang, Yiming Ying, and Colin Campb ell. Generalized sparse metric learning with relativ e comparisons. Know le dge and Information Systems (KAIS) , 28(1):25 –45, 2011. Yinjie Huang, Cong Li, Mic hael Georgiop oulos, and Georgios C. Anagnostop oulos. Reduced-Rank Local Distance Metric Learning. In Pr o c e e dings of the E u r op e an Con- fer enc e on Machine L e arning and P rinciples and Pr actic e of Know le dge Disc overy in Datab a ses (ECML/PKDD) , pages 224–2 39, 201 3. Prateek Jain, Brian Kulis, In d erjit S. Dhillon, an d Kr isten Grauman. Online Metric Learn- ing an d Fast Similarit y Searc h. In A dvanc es i n N eur al Information Pr o c essing Systems (NIPS) 21 , pages 761–768 , 200 8. Prateek Jain, Brian K ulis, and In derjit S . Dhillon. Ind uctiv e Regularized Learning of Kernel Functions. In Advanc es in Neur al Information Pr o c essing Systems (N IPS) 23 , p ag es 946– 954, 2010. Prateek Jain, Brian Kulis, Jason V. Da vis, and Inderjit S. Dhillon. Metric and Kernel Learning Using a Linear Transf orm at ion. Journal of Machine Le arning Rese ar ch (JMLR) , 13:519 –547, 201 2. Nan Jiang, W en yu L iu, and Ying W u . O r der determination and sparsit y-regularized metric learning ad ap tive visual trac king. In Pr o c e e dings of the IEEE Confer enc e on Computer Vision and Pattern Re c o gnition (CVPR) , p age s 1956–19 63, 2012. Rong Jin, Shiju n W ang, and Y ang Z hou. Regularized Distance Metric Learnin g: Theory and Algorithm. In Advanc es in N eur al Information Pr o c essing Systems (N IP S) 22 , p ag es 862–8 70, 200 9. Thorsten Joac hims, Thomas Finley , and Ch u n -Nam J. Y u. Cutting-plane training of struc- tural SVMs. Machine L e arning Journal (MLJ ) , 77(1):27– 59, 2009. Purush ott am Kar, Bharath S rip erum budu r Pr ate ek Jain, and Harish Karnic k. On the Generalizati on Abilit y of Online Learning Algorithms for Pairw ise Loss Fu nctions. In Pr o c e e dings of the 30th International Confer enc e on Machine Le arning , 2013. Tsuyo sh i Kato and Nozomi Nagano. Metric learning for enzyme activ e-site searc h. Bioin- formatics , 26(21):269 8–2704, 2010. 50 A Sur vey on Metric Learning for Fea ture Vectors and Structured Da t a Dor Kedem, Stephen Tyree, K ilia n W einb erger, F ei Sha, and Gert Lanckriet. Non-linear Metric Learnin g. In A dva nc es in Neur al Information Pr o c essing Systems (N IPS) 25 , pages 2582–259 0, 2012. Brian K ulis. Metric Learning: A Su r v ey . Foundations a nd Tr ends in M ach ine Le arning (FTML) , 5(4):287–3 64, 2012. Brian Kulis, Prateek Jain, and Kristen Grauman. Fast S imilarit y Searc h for Learned Met- rics. IEEE T r ansactio ns on Pattern Analysis and Machine Intel ligenc e (TP AMI) , 31(12): 2143– 2157, 200 9. Brian Kulis, Kate Saenk o, and T r ev or Darrell. Wh at y ou sa w is not what you get: Domain adaptation using asymmetric k ern el transforms. In Pr o c e e d ings of the IEEE Confer enc e on Computer Vi si on and P atter n Re c o gnition (CV PR) , pages 1785– 1792, 201 1. Solomon Kullbac k and Richard Leibler. On Inform ati on and S ufficiency . Annals of M ath e- matic a l Statistics , 22(1):79–86 , 1951. M. Pa w an Kumar, Philip H. S. T orr, an d Andrew Z isserman. An Inv a riant Large Margin Nearest Neighbour Classifier. In Pr o c e e dings of the IEEE Interna tional Confer enc e on Computer Vision (ICCV) , p ag es 1–8, 2007. Gautam Kunapuli an d Jude Sha vlik. Mirror Desce nt for Metric Learning: A Unified Ap- proac h. In Pr o c e e dings of the Eu r op e an Confer enc e on Machine L e arning and Principles and P r actic e of Know le dge Disc overy in Datab a se (ECML/PKDD) , p ages 859–874 , 2012. R ´ emi La ju gie, Sylv ain Ar lot , and F rancis Bac h. Large-Margin Metric L earn ing for Con- strained P artitioning Problems. In P r o c e e dings of the 31st International Confer enc e on Machine L e arning (ICM L) , 2014. Marc T. Law, Carlos S. Gutierrez, Nicolas Thome, and S t ´ eph ane Gan¸ carski. Structur al and visual similarity learning for Web page arc hiving. In Pr o c e e dings of the 10th International Workshop on Content-Base d M ultime dia Indexing (CBMI) , pages 1–6, 2012. Guy Lebanon. Metric Learnin g for Text Do cument s. IEEE T r ansactions on Pattern Anal- ysis and Machine Intel ligenc e (TP AMI) , 28(4):497–5 08, 2006. Jung-Eun Lee, Rong Jin, and Anil K . Jain. R an k -b ased distance metric learning: An application to image retriev a l. In Pr o c e e dings of the IEEE Confer enc e on Computer Vision and Pattern Re c o gnition (CVPR) , 2008. Vladimir I. Lev enshtein. Binary co des capable of correcting d ele tions, in sertions and rev er- sals. Soviet Physics-Doklandy , 6:707–7 10, 1966. F ei-F ei Li and Pietro Perona. A Ba yesia n Hierarc hical Mo del f or Learning Natural Scene Categories. In Pr o c e e dings of the IEEE Confer enc e on Computer V ision and Pattern Re c o gnition , p age s 524–531 , 2005. 51 Bellet, Habrard and S ebban Xi Li, Chunhua Shen, Qinfeng Shi, An thony Dic k, and An ton v an den Hengel. N on- sparse Linear Represen tations for Visual T r ac king with O nline Reservo ir Metric Learn- ing. In Pr o c e e dings of the IEEE Confer enc e on Computer Vision and Pattern R e c o gnition (CVPR) , pages 1760 –1767, 2012. Zhaoh ui Liang, Gang Zh ang, Li J iang, and W en bin F u. Learning a Consisten t PRO- Outcomes Metric thr ough KCCA f or an Efficacy Assessing Mo del of Acupu ncture. J our- nal of Chinese Me d ici ne R ese ar ch and Development (J CM RD) , 1(3):79–88 , 2012. Daryl K. Lim, Brian McF ee, and Gert Lanc kriet. Robust Stru ctural Metric Learning. I n Pr o c e e dings of the 30th International Confer enc e on Machine Le arning (ICML) , 2013. Nic k Littlestone. Learning Quickly When Irrelev an t Attributes Ab ound: A New Linear- Threshold Algorithm. Machine Le arning Journal (M LJ) , 2(4):285–3 18, 1988. Meizh u Liu and Baba C. V em ur i. A Robust and Efficien t Doubly Regularized Metric Learning Appr oac h. In Pr o c e e dings of the 12th E ur op e an Confer enc e on Computer Vision (ECCV) , pages 646–6 59, 201 2. W ei Liu, S hiqian Ma, Dac heng T ao, J ia n zh uang Liu, and Peng Liu. Semi-Sup ervised Sparse Metric Learning using Alternating Linearization Optimization. In Pr o c e e dings of the 16th ACM SIGKDD International Confer enc e on Know le dge Disc overy and Data Mining , pages 1139–114 8, 2010. Stuart P . Llo yd. L ea st squares qu an tizatio n in PCM. IEEE Tr ansactions on Information The or y (TIT) , 28:129–13 7, 1982. Huma Lo dhi, Craig Saun ders, John Shaw e-T aylo r, Nello Cristianini, and Chris W atkins. Text Classification using Str in g Ker n els. J ournal of Machine Le arning Rese ar ch (JM LR) , 2:419– 444, 2002. Jiw en Lu, J unlin Hu , Xiuzh uang Zhou, Y uan yuan Shang, Y ap-P eng T an, and Gang W ang. Neigh b orho od repulsed metric learning for kin ship v erification. In Pr o c e e dings of the IEEE Confer enc e on Computer Vision and Pattern Re c o gnition (CVPR) , pages 2594– 2601, 2012. Prasan ta Ch andra Mahalanobis. On the generalised distance in statistics. Pr o c e e dings of the National Institute of Scienc es of India , 2(1):49–5 5, 1936. Yisha y Mansour, Mehry ar Mohri, and Afsh in Rostamizadeh. Domain Adaptation: Learning Bounds and Algorithms. In Pr o c e e dings of the 22nd Annual Confer enc e on Le arning The or y (COL T) , 2009. Andrew McCallum, Kedar Bellare, and F ernando Perei ra. A Cond itio nal Random Field for Discriminativ ely-trained Finite-state String Ed it Distance. In Pr o c e e dings of the 21st Confer enc e in Unc ertainty in Artificial Intel ligenc e (UAI) , pages 388–3 95, 200 5. Brian McF ee and Gert R. G. Lanc kriet. Metric Learnin g to Rank. In Pr o c e e dings of the 27th International Confer enc e on M ach ine Le arning (ICML) , pages 775–782, 2010. 52 A Sur vey on Metric Learning for Fea ture Vectors and Structured Da t a Brian McF ee, Luk e Barrin gton, an d Gert R . G. L an ckriet. Learning Cont ent Similarit y for Music Recommendation. IEEE Tr ansa ctions on A u dio, Sp e e ch & Language Pr o c essing (T ASLP) , 20(8):2207 –2218, 2012. Thomas Mensink, Jak ob J. V erb eek, Florent P erronnin, and Gabriela Csurk a. Metric Learn- ing for Large Scale Image Classification: Generali zing to New C lasses at Near-Zero Cost. In Pr o c e e dings of the 12th Eur op e an Confer enc e on Computer Vision (ECCV) , pages 488–5 01, 201 2. Luisa Mic´ o and Jose O n cina. Comp arison of fast nearest neighbour classifiers for hand writ- ten c haracter recognition. Pattern R e c o gnition L etters (PRL) , 19:351– 356, 1998. Saul B. Needleman and Chr istia n D. W unsch. A general m ethod applicable to the searc h for similarities in the amino acid sequence of tw o pr ote ins. Journal of Mole cular Biolo gy (JMB) , 48(3):443 –453, 1970. Y u rii Nestero v. Smo oth minimization of non-smo oth functions. Mathematic al Pr o g r amming , 103:12 7–152, 2005. Mic hel Neuhaus and Horst Bunk e. Automatic learning of cost fun ctio ns for graph edit distance. Journal of Information Sc i enc e (JIS) , 177(1):2 39–247, 2007 . Behnam Neyshabur , Nati Srebro, Ruslan Salakhutdino v, Y ury Mak arychev, and P a yman Y adollahp our. Th e Po w er of Asymmetry in Binary Hashing. In Advanc es in Neur al Information Pr o c essing Systems (NIPS) 26 , pages 2823–283 1, 2013. Hieu V. Nguye n and Li Bai. Cosine S im ilarity Metric Learning for Face Verification. I n Pr o c e e dings of the 10th Asian Confer enc e on Computer V ision (ACCV) , pages 709–7 20, 2010. Nam Nguy en and Y u nsong Guo. Metric Learnin g: A Supp ort V ector Approac h. In Pr o- c e e dings of the E ur op e an Confer enc e on Machine L e arning and Principles and Pr actic e of Know le dge Disc overy in Datab ases (ECML/PKDD) , pages 125–136, 2008. Gang Niu, Bo Dai, Mak oto Y amada, and Masashi Sugiy ama. In formatio n-theoretic Semi- sup ervised Metric Learnin g via Entrop y R egularization. In Pr o c e e dings of the 29th Inter- national Confer enc e on Machine L e arning (ICML) , 2012. Y u ng-Kyun Noh, By oung-T ak Zhang, and Daniel D. Lee. Generativ e Lo cal Metric Learn- ing for Nearest Neigh b or C lassifica tion. In A dvanc es in N eur al Infor mation Pr o c essing Systems (NIPS) 23 , pages 1822–1830 , 2010. Mohammad Norouzi, Da vid J. Fleet, and Rus lan Salakh utdinov. Hamming Distance Metric Learning. I n Advanc es in Neu r al Informatio n Pr o c essing Systems (NIP S) 25 , pages 1070– 1078, 2012a. Mohammad Norouzi, Ali Pu njani, and Da vid J . Fleet. Fast Searc h in Hamming Space with Multi-Index Hashing. In Pr o c e e dings of the IEEE Confer enc e on Computer Vi si on and Pattern Re c o gnition (CVPR) , 2012b. 53 Bellet, Habrard and S ebban Jose On cina and Marc Sebb an. Learning Sto c hastic Edit Distance: app lic ation in hand- written c haracter recognition. Pattern R e c o gnition (PR) , 39(9):157 5–1587, 200 6. Shibin P aramesw aran and Kilian Q. W einberger. Large Margin Multi-Task Metric L earn ing. In Advanc es in Neur al Informatio n Pr o c essing Systems (NIPS) 23 , pages 1867–18 75, 2010. Ky oungup P ark, C h unhua Sh en, Zhihui Hao, and Junae Kim. Efficient ly Learning a Dis- tance Metric for Large Margin Nearest Neigh b or Classification. In Pr o c e e dings of the 25th AAAI Confer enc e on Artificial Intel ligenc e , 2011. Karl P earson. On Lines and P lanes of Closest Fit to Poin ts in Space. Philosophic al Magazine , 2(6):5 59–572, 1901 . Ali M. Q amar and Eric Gaussier. Online and Batc h L earn ing of Generalized Cosine Simi- larities. In Pr o c e e dings of the IEEE International Confer enc e on Data Mining (ICDM) , pages 926–931, 2009. Ali M. Qamar and Eric Gaussier. RE LIEF Algorithm and Similarit y Learning for k-NN. International Journal of Computer Information Systems and Industrial Management A p- plic at ions (IJCISIM) , 4:445–458, 2012. Ali M. Qamar, Eric Gauss ier, Jean-Pierre Chev allet, and Jo o- Hwee Lim. Similarit y Learnin g for Nearest Neigh b or Classification. In Pr o c e e dings of the IEE E International Confer enc e on Data Mining (ICD M) , p ag es 983–98 8, 2008 . Guo-Jun Qi, J inh ui T ang, Z heng-Jun Zh a, T at-Seng Chua, and Hong-Jiang Zhang. An Efficien t Sparse Metric Learning in High-Dimensional Space via l1-Penalized Log- Determinan t Regularization. In Pr o c e e dings of the 26th International Confer enc e on Machine L e arning (ICM L) , 2009. Qi Qian, Rong Jin , Jin f eng Yi, Lijun Zhang, and Shen ghuo Zhu. Effi cient Distance Metric Learning by Adaptiv e Sampling and Mini-Batc h Sto c hastic Gradient Descen t (SGD). arXiv:1304 .1192, April 2013. Joaquin Qu i ˜ nonero-Candela. Dataset Shift in Machine Le arning . MIT Press, 2009. Dev a Ramanan and Simon Bak er. Lo cal Distance Fu nctions: A Taxonom y , New Algorithms, and an Ev aluatio n . IEEE Tr ansactions on Pattern Analysis and Machine Intel ligenc e (TP AMI) , 33(4) :794–806, 20 11. Pradeep Ravikumar, Martin J. W ainwrigh t, Garve sh Raskutti, and Bin Y u. High- dimensional co v aria n ce estimation by minimizing ℓ 1 -p enalized log-determinan t div er- gence. Ele ctr onic Journal of Statistics , 5:935–98 0, 2011. Eric S. Ristad and Pe ter N. Yianilos. Learning String-Edit Distance. IEEE Tr ansactio ns on Pattern Analysis and Machine Intel lige nc e (TP AMI) , 20(5):5 22–532, 199 8. Romer Rosales and Glenn F un g. Learn ing S parse Metrics via Linear Programming. In Pr o c e e dings of the 12th ACM SIGKDD International Confer enc e on Know le dge Disc overy and Data Mi ning , pages 367–3 73, 2006. 54 A Sur vey on Metric Learning for Fea ture Vectors and Structured Da t a Y ossi Ru bner, Carlo T omasi, and Leonidas J. Guibas. T he Earth Mo v er’s Dista nce as a Metric for Image Retriev al. International Journal of Computer Vision (IJCV) , 40 (2): 99–12 1, 2000. Kate Saenko , Brian K u lis, Mario F ritz, and T rev or Darrell. Ad ap tin g Visual Category Mo dels to New Domains. In Pr o c e e dings of the 11th Eur op e an Confer enc e on Computer Vision (ECCV) , pages 213–226 , 201 0. Hiroto Saigo, Jean-Philipp e V ert, Nobuh isa Ueda, and T atsuya Akutsu . Protein homology detection u s ing strin g alignmen t k ernels. Bioinformatics , 20(11):168 2–1689, 2004. Hiroto Saigo, Jean-Philipp e V ert, and T atsuya Akutsu. Op timizing amino acid sub stitution matrices with a lo cal alignment kernel. Bioinforma tics , 7(246):1–1 2, 2006. Ruslan Salakh utdino v and Geoffrey E . Hin ton. Learning a Nonlinear Em b eddin g by Preserv- ing Class Neigh b ourho o d Stru ctur e. I n Pr o c e e dings of the 11th International Confer enc e on Artificial Intel ligenc e and Statistics (AIST A TS) , pages 412–419, 2007. Gerard S alt on, Andr ew W ong, and C. S . Y ang. A v ector space mo del for automatic indexing. Communic a tions of the A CM , 18(11):61 3–620, 1975. Rob ert E. S c hapire and Y oav F reund. Bo osting: Foundations and Algorithms . MIT Press, 2012. Bernhard Sc h¨ olko pf, Alexander S mola, and Klaus-Rob ert M¨ uller. Nonlinear comp onen t analysis as a kernel eigen v alue pr oblem. Neu r al Computation (NECO) , 10(1):1299– 1319, 1998. Matthew Sch ultz and Th orsten Joac hims. Learning a Distance Metric from Relativ e Com- parisons. In A dvanc es in Neur al Information Pr o c essing Systems (NIPS) 16 , 2003. Stanley M. Selk o w. Th e tr ee -to-tree editing p roblem. Information Pr o c essing Le tters , 6(6): 184–1 86, 197 7. Shai Shalev-Shw artz, Y oram Singer, and And rew Y. Ng. Online an d batc h learning of pseudo-metrics. In Pr o c e e dings of the 21st Internationa l Confer enc e on M achine Le arning (ICML) , 2004. Uri S halit, Daph na W einshall, and Gal Chec hik. Online Learning in The Manifold of Low- Rank Matrices. In Advanc es in Neur al Information Pr o c essing Systems (NIPS) 23 , pages 2128– 2136, 201 0. Uri Shalit, Daphna W einshall, and Gal Chec hik. On line Learning in the E m b edded Manifold of Low-rank Matrices. Journal of Machine Le arning Rese a r ch (JMLR) , 13:429 –458, 2012. Blak e Shaw, Bert C. Hu an g, and T on y Jebara. Learnin g a Distance Metric from a Net wo rk. In Advanc es in Neur al Informatio n Pr o c essing Systems (NIPS) 24 , pages 1899–19 07, 2011. 55 Bellet, Habrard and S ebban Ch u nh ua Shen, Ju n ae Kim, Lei W ang, and Ant on v an d en Hengel. Positiv e S emidefinite Metric Learning with Boosting. In Advanc es in N eur al Information Pr o c essing Systems (NIPS) 22 , pages 1651–16 60, 200 9. Ch u nh ua Shen, Ju n ae Kim, Lei W ang, and Ant on v an d en Hengel. Positiv e S emidefinite Metric Learn ing Usin g Bo osting- like Algorithms. Journal of Machine Le arning Rese ar ch (JMLR) , 13:1007– 1036, 2012. Noam Shenta l, T omer Hertz, Daphn a W einshall, and Misha Pa v el. Adjustment L earn ing and Relev ant Comp onen t Analysis. In Pr o c e e dings of the 7th Eur op e an Confer enc e on Computer Vision (ECCV) , pages 776–792, 2002. Y u an Sh i, Y ung-Kyun Noh, F ei Sha, and Daniel D. Lee. Learning Discriminativ e Metrics via Generativ e Mo dels and Kernel Learning. arXiv:1109.39 40, Septem b er 2011. Rob ert D. S hort and K einosu k e F uku naga . The optimal d istance measur e for nearest n eig h- b or classification. IEEE T r ansactions on Information The ory (TIT) , 27(5):6 22–626, 1981. Josef Sivic and Andrew Zisserman. Effi cie nt visual search of videos cast as text retriev a l. IEEE Tr ansactions on Pattern Analysis and M achine Intel ligenc e (TP AMI) , 31:591–6 06, 2009. T emple F. S mith and Mic hael S. W aterman. Identificat ion of common molecular su bse- quences. Journal of Mole cular Biolo gy (JMB) , 147(1 ):195–197 , 1981. A tsuhiro T ak asu. Ba y esian S imilarit y Mo del Estimation for Approximat e Recognized Text Searc h. In Pr o c e e dings of the 10th Internatio nal Confer enc e on Do cument Analysis and Re c o gnition (ICDAR) , pages 611–6 15, 2009. Daniel T arlo w, Kevin Sw ersky , Ily a Sutsk eve r, Laur en t Charlin, and Ric h Zemel. Sto c hastic k-Neigh b orho od Selectio n for Sup ervised and Unsup erv ised Learning. In Pr o c e e dings of the 30th International Confer enc e on Machine Le arn ing (ICML) , 2013. Matthew E. T a ylor, Brian Kulis, and F ei Sha. Metric learning for reinforcemen t learning agen ts. In Pr o c e e dings of the 10th International Confer enc e on Autonomous Age nts and Multiagent Systems (A AMAS) , pages 777–7 84, 201 1. Lorenzo T orresani and Ku ang-Chih Lee. Large Margin Comp onent Analysis. In Advanc es in Neur al Information Pr o c essing Systems (NIP S) 19 , p age s 1385–1 392, 2006. Ioannis Tso c han taridis, Thorsten Joac hims, Thomas Hofmann, and Y asemin Altun. Large Margin Metho ds for Structur ed and Interdep enden t Output V ariables. Journal of Ma- chine L e arn ing R ese ar ch , 6:145 3–1484, 2005. Y u ta Ts u b oi, Hisashi Kashima, Shohei Hido, S teffen Bic ke l, and Masashi Sugiyama . Direct Densit y Ratio Estimation for Large-scal e Co v ariate Shift Adaptation. In Pr o c e e dings of the SIAM International Confer enc e on Data Mining (SDM) , p ag es 443–45 4, 2008 . Leslie G. V alian t. A theory of the learnable. Communic ations of the ACM , 27:1134 –1142, 1984. 56 A Sur vey on Metric Learning for Fea ture Vectors and Structured Da t a Laurens J .P . v an der Maaten, Er ic O. P ostma, and H. Jaap v an den Herik. Dimensionalit y Reduction: A Comparative Review. T echnical rep ort, Tilbur g Unive rsity , 2009. T iCC-TR 2009- 005. Vladimir N. V apnik and Alexey Y. Chervo nen kis. O n th e u niform conv ergence of relativ e frequencies of ev en ts to their p robabilities. The o ry of P r ob ability and its Applic ations (TP A) , 16(2):264–2 80, 1971. Nakul V erma, Dhruv Maha jan, Sundarara jan Sellamanic k am, and Vino d Nair. Learning Hierarc hical Similarit y Metrics. In Pr o c e e dings of the IE E E Confer enc e on Computer Vision and Pattern Re c o gnition (CVPR) , p age s 2280–22 87, 2012. P ascal Vincen t, Hugo Laro c helle, Y oshua Bengio, and Pierre-An toine Manzagol. Extracting and comp osing r obust features with denoising autoen coders. In Pr o c e e dings of the 25th International Confer enc e on Machine Le arning (ICML) , pages 1096–110 3, 2008. F an W ang and Leonidas J. Guibas. Sup ervised Earth Mo v er’s Distance Learning and Its Computer Vision Applications. I n Pr o c e e d ings of the 12th E ur op e an Confer enc e on Com- puter Vision (E CCV) , p ages 442–4 55, 201 2. Jingy an W ang, Xin Gao, Qu anquan W ang, and Y ongping Li. ProDi s-ContSHC: learn- ing protein dissimilarity measures and hierarc h ical con text coheren tly for pr ot ein-pr ote in comparison in protein database retriev a l. BM C Bioinformatics , 13(S-7):S2, 2012a. Jun W ang, Huy en T. Do, Adam W oznica, and Alexandros Kalousis. Metric Learning with Multiple Kernels. In Advanc e s in Neur al Information Pr o c essing Systems (NIPS) 24 , pages 1170–117 8, 2011. Jun W ang, Adam W oznica, and Alexandros Kalousis. Learning Neigh b orho o ds for Metric Learning. In Pr o c e e dings of the Eur op e an Confer enc e on Machine L e arning and Princi- ples and Pr actic e of Know le dge Disc overy in Datab ases (ECM L/PKDD) , p ag es 223– 236, 2012b. Jun W ang, Adam W oznica, and Alexandros Kalousis. Parametric Lo cal Metric Learning f or Nearest Neighbor Classification. In Advanc es in N eur al Information Pr o c essing Systems (NIPS) 25 , pages 1610–16 18, 201 2c. Qian ying W ang, Pong C. Y uen, and Guo can F eng. S emi-su p er v ised metric lea rn ing via top olog y preservin g m ultiple semi-sup ervised assu m ptions. Pattern Re c o gnition (PR) , 2013a . Y u y ang W ang, Roni Khardon, Dmitry Pe ch y on y , and Rosie J ones. Generalization Boun ds for Online Learning Algorithms with Pairw ise Loss Functions. In Pr o c e e dings of the 25th Annual Confer enc e on Le arning The ory (COL T) , pages 13.1–13.2 2, 2012 d. Y u y ang W ang, Roni Khardon, Dmitry P ec h yon y , and Rosie Jones. Online Learning with Pairwise Loss F un ctions. T ec hnical rep ort, T ufts Unive rs it y , Jan u ary 2013b. arXiv:1301 .5332. 57 Bellet, Habrard and S ebban Kilian Q. W ein b erger and La wrence K. S aul. Fast S olv ers and Efficien t Im plemen tations for Distance Metric Learning. In Pr o c e e dings of the 25th Internationa l Confer enc e on Machine Le arning (ICML) , p ages 1160– 1167, 2008 . Kilian Q. W einberger and La wr en ce K. Saul. Distance Metric Learnin g for Large Margin Nearest Neigh b or Classification. Journal of Machine Le arning Rese ar ch (JMLR) , 10: 207–2 44, 200 9. Kilian Q. W ein b erger, John Blitzer, and La wrence K. S aul. Dista n ce Metric Learning for Large Margin Nearest Neigh b or Classifi cation. In Advanc es in Neur al Information Pr o c essing Systems (NIPS) 18 , p ag es 1473–1 480, 2005 . Lei W u, Rong Jin, S tev en C.-H. Hoi, Jianke Zhu, and Nenghai Y u. Learning Bregman Distance Functions and Its Application for Semi-Sup erv ised Clustering. In Advanc es in Neur al Information Pr o c essing Systems (NIPS) 22 , pages 2089–209 7, 2009. Lei W u, Stev en C.-H. Hoi, Rong Jin, Jianke Zhu, and Nenghai Y u. Learning Bregman Distance Fun cti ons for Semi-Sup ervised C lustering. IEEE Tr ansactions on Know le dge and Data Engi ne ering (TKDE) , 24(3) :478–491, 20 12. Eric P . Xing, Andrew Y. Ng, Michae l I. Jord an, and Stuart J. Russell. Distance Metric Learning with Application to C lustering with Side-Information. In Advanc es in Neur al Information Pr o c essing Systems (NIPS) 15 , pages 505–512, 2002. Caiming Xiong, Da vid Johnson, Ran Xu, and Jason J. Corso. Random forests for metric learning with im p licit pairwise p osition dep enden ce . In Pr o c e e dings of the 18th ACM SIGKDD International Confer enc e on Know le dge Disc overy and Data Mining , pages 958– 966, 2012. Huilin Xiong and Xue-W en Chen. Kern el-based distance m etric learning for microarra y data classification. BMC Bioinformatics , 7:299, 2006. Huan Xu and Shie Mannor. Robu stness and Generalization. Machine Le arning Journal (MLJ) , 86(3):391 –423, 2012. Zhixiang Xu, Kilian Q . W ein b erger, and Olivier Ch apelle. Distance Metric Lea rn ing for Kernel Mac hines. arXiv:1208.3422 , 2012. Liu Y ang and Rong Jin. Distance Metric Learning: A Comprehensive Surv ey . T ec hnical rep ort, Departmen t of C omputer Science and En gi n ee rin g, Mic higan State Universit y , 2006. P eip ei Y ang, Kaizh u Huang, and Cheng-Lin Liu. Multi-Task Low-Rank Metric Learning Based on Common Subspace. In Pr o c e e dings of the 18th International Confer enc e on Neur al Information Pr o c essing (ICONIP) , p ages 151–1 59, 2011. P eip ei Y ang, K aizhu Huang, and Cheng-Lin Liu. Geometry Preserving Multi-task Metric Learning. In Pr o c e e dings of the E u r op e an Confer enc e on M achine Le arning and Principles and Pr actic e of Know le dge Disc overy in Datab ases (ECML/PK D D) , pages 648–664, 2012. 58 A Sur vey on Metric Learning for Fea ture Vectors and Structured Da t a Dit-Y an Y eung and Hong Chang. Extending th e relev ant comp onent analysis algorithm for metric learning using b oth p ositiv e and negativ e equiv ale n ce constrain ts. Pattern Re c o gnition (PR) , 39(5):1 007–1010, 2006 . Yiming Ying and Peng Li. Distance Metric Learning w ith Eigen v alue Optimization. Journal of Machine Le arning Rese ar ch (JMLR) , 13:1–26, 2012. Yiming Yi n g, Kaizh u Huang, and Colin Campb ell. S parse Metric Learning via Smo oth Optimization. In Advanc es in Neur al Information Pr o c essing Systems (NIPS) 22 , pages 2214– 2222, 200 9. Jun Y u, Meng W ang, and Dac heng T ao. Semisu p ervised Multiview Distance Metric Learn- ing for Carto on S yn thesis. IEEE T r ansactions on Image Pr o c essing (TIP) , 21(11):4636 – 4648, 2012. Zheng-Jun Zha, T ao Mei, Meng W ang, Z engfu W ang, and Xian-Sheng Hua. Robust Distance Metric Learnin g with Auxiliary Kn o wledge. In Pr o c e e dings of the 21st International Joint Confer enc e on Artificial Intel lige nc e (IJCAI) , p age s 1327–13 32, 2009. De-Ch uan Zhan, Ming Li, Y u-F eng Li, and Zh i-H u a Z hou. Learning instance sp ecific dis- tances using metric propagation. In Pr o c e e dings of the 26th International Confer enc e on Machine L e arning , 2009. Changshui Zh ang, F eipin g Nie, and Shim in g Xiang. A general kerneliza tion framew ork f or learning algorithms based on k ernel PCA. N eur o c omp uting , 73(4 –6):959–9 67, 2010. Kaizhong Zhang and Dennis Shasha. Simp le fast algorithms for the editing distance b et we en trees and related problems. SIAM Journal of Computing (SICOM P) , 18(6):12 45–1262, 1989. Y u Zhang and Dit-Y an Y eung. Transfer metric learning by learning task relationships. In Pr o c e e dings of the 16th ACM SIGKDD International Confer enc e on Know le dge Disc overy and Data Mi ning , pages 1199– 1208, 2010. Guo qiang Zhong, K aiz hu Huang, and Cheng-Lin Liu. Low Rank Metric Learning with Manifold Regularization. In Pr o c e e dings of the IEEE Internationa l Confer enc e on Data Mining (ICDM) , pages 1266–12 71, 2011. 59

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment