Efficient Classification for Metric Data

Eﬃcien t Classi ﬁ cation for Metric Data ∗ Lee-Ad Gottlieb † Ary eh Kon toro vic h ‡ Rob ert Krauthgamer § June 11, 2018 Abstract Recent adv ances in large-margin classiﬁcation of data residing in general metric sp aces (rather than Hil b ert spaces) enable classiﬁcation u nder v arious natural metrics, such as string edit and earthmo ver d istance. A general framew ork developed for this purp ose by von Luxburg and Bous- quet [JMLR, 2004] left op en the questions of computational eﬃciency and of p ro viding direct b ounds on generalization error. W e d esign a new algorithm for cla ssiﬁcation in general metric spaces, whose runtime and accuracy dep end on the d oubling dimension of th e data p oints, and can thus ac hieve sup erior classiﬁcation p erformance i n many common scenarios. The algorithmic co re of our approac h is an approximate ( rath er than exact) solution to the classical p roblems of Lipschitz ex tension and of N earest Neighbor Search. The algorithm’s generalizatio n p erformance is guaran teed via the fat-shattering dimension of Lipsc h itz classiﬁers, and w e present exp erimental evidence of its sup eriorit y to some common kernel metho ds. As a by-produ ct, we oﬀer a new p ersp ective on the nearest neighbor classiﬁer, which y ields signiﬁcantly sharp er risk asymptotics than the classic analysis of Co ver and Hart [IEEE T rans. Info. Theory , 1967]. 1 In tro duction A recent line of work extends the larg e-margin cla ssiﬁcation paradigm from Hilb ert spaces to le s s structured ones, suc h as Banach or ev en metric spaces, see e.g. [23, 34, 13, 4 0]. In this metr ic approa c h, data is pr esented as p oints with distances but lacking the additio nal structur e o f inner -pro ducts. The po ten tially signiﬁca nt adv antage is that the metric can b e precis ely suited to the t yp e of data, e.g . earthmov er distance for images, or edit dis ta nce for sequences. How ever, muc h of the exis ting machinery of cla ssiﬁcation algor ithms a nd gener a lization b ounds, (e.g. [11, 3 2]) dep ends strongly o n the data residing in a Hilb ert space. This structura l requirement severely limits this machinery’s applica bilit y — many na tural metric spaces cannot b e repre sent ed in a Hilb e rt space faithfully; for mally , every em b edding into a Hilber t space of metrics such as ℓ 1 , earthmov er, and edit distance must distort distances by a la rge fac to r [14, 29, 2]. Ad-ho c solutions such as k erneliza tion cannot circum ven t this shortcoming, b e c ause impo sing an inner -pro duct obviously embeds the data in some Hilber t space. T o addr ess this gap, von Luxburg and Bousquet [34] developed a p ow erful framework of la r ge- margin classiﬁcation for a general metric space X . They ﬁrst show that the natura l h y po theses (classiﬁers) to co nsider in this co nt ext are maximally- s mo oth Lipschitz functions; indeed, they r e duce ∗ An extended abstract of this w ork appeared in Pro ceedings of the 23rd COL T, 2010 [16]. † L. Gottlieb is with the Department of Computer Science and Mathematics at Ariel Universit y (email: leead@ariel.ac.il). ‡ A. Kon torov ich is with the Department of Computer Science at Ben -Gurion Universit y of the Negev (email: k aryeh@c s.bgu.ac.il). His work was suppor ted in part by the Israel Science F oundation (grant No. 1141/12) and a Y aho o F acult y aw ard. § R. Kr authgamer is with the F acult y of Mathematics and Computer Science at the W eizmann Institute of Science (email: rob ert.krauthgamer@w eizmann.ac.il). His work was supported in part by the Israel Science F oundation (gran t #452/08), the US-Israel BSF (grant #2010418), and by a Minerv a gran t. 1 classiﬁcation (of p oints in a metric space X ) with no training error to ﬁnding a Lipschit z function f : X → R consis ten t with the data, which is a classic problem in Analysis, known as Lipschitz extension . Next, they establish err or b ounds in the form of exp ected surrogate los s. Finally , the computational pr oblem of ev aluating the classiﬁca tion function is r educed, a ssuming zero training erro r , to exact nearest neig hbor sea r ch. This matches a p opular classiﬁcatio n heuris tic, a nd in retrosp ect provides a rigo rous explanation for this heuristic’s empirical success in genera l metric s paces, extending the seminal ana lysis o f Cov e r and Har t [1 2] for the E uclidean case. The work o f [34] has left op en some algo rithmic questions. In par ticular, allowing nonzero training error is apt to signiﬁcantly r educe the Lipschitz c onstant, thereby pro ducing classiﬁer s that hav e lo wer complexity a nd are less likely to ov er ﬁt. This introduces the alg o rithmic challenge of constructing a Lipschitz cla ssiﬁer that minimizes the 0-1 tra ining error . In addition, ex a ct nearest neighbo r search in general metrics has time complexity prop or tio nal to the size of the dataset, rendering the technique impractical when the tra ining sa mple is large. Finally , b ounds on the exp ected surrog ate lo ss may signiﬁcantly ov er estimate the generalization erro r, which is the true quantit y o f interest. Our con tribution W e solve the pr oblems delineated ab ov e by showing that data residing in a metr ic space of low doubling dimension admits accurate and computationally eﬃcient class iﬁcation. This is the ﬁrst result that ties the doubling dimensio n of the data to either cla s siﬁcation error or alg orithmic runtime. 1 Spec iﬁc a lly , w e (i) prov e gener alization b ounds for the classiﬁcation (0- 1 ) err o r a s opp osed to surrogate lo ss, (ii) co nstruct and ev alua te the cla ssiﬁer in a computationally-eﬃcie nt manner, and (iii) per form eﬃcient structura l risk minimization by optimizing the tradeoﬀ be t ween the cla ssiﬁer’s smo othness and its training error . Our g eneralizatio n bo und for Lipschitz classiﬁer s controls the e xpe c ted cla ssiﬁcation err or directly (rather tha n exp ected s urroga te loss), a nd may b e sig niﬁcantly sharp er than the la tter in many co mmon scenarios . W e provide this b ound in Section 3, using an elementary analysis of the fat-shattering dimension. In hindsig ht, our approach oﬀers a new p ers pective on the nearest neighbor classiﬁer , with signiﬁcantly tighter risk a symptotics than the classic a nalysis of Cover and Hart [1 2]. W e further give eﬃcient a lgorithms to implemen t the Lipschitz cla ssiﬁer, b oth for the tr aining and the ev aluation stages. In Section 4 we prove that o nce a Lipschitz cla ssiﬁer has been chosen, the hypothesis ca n b e e v a luated quickly on any new p oint x ∈ X using appr oximate nea r est neighbor sear ch, which is known to b e fast when po ints have a low doubling dimension. In Section 5 we further show how to quickly compute a nea r-optimal classiﬁer (in terms of cla ssiﬁcation er ror b ound), even when the training error is no nzero. In pa rticular, this neces s itates the optimizatio n of the num b er o f incorr ectly lab eled examples — and mor eov er , their identit y — a s part of the structural risk minimization. Finally , we give in Section 6 tw o exemplary s etups. In the ﬁr st, the data is repr esented using the earthmov er metric over the plane. In the second, the data is a set of time ser ies v ectors equipped with a p opular distance function. W e provide basic theoretical a nd ex p er imen tal a na lysis, which illustrate the p otential p ow er of our approach. 2 Deﬁnitions and notation Notation W e will use s ta ndard O ( · ) , Ω( · ) notation for order s of magnitude. If f = O ( g ) and g = O ( f ), we will write f = Θ( g ). Whenever f = O ( n po lylog n ), we will denote this b y f = ˜ O ( n ). If n ∈ N is a natural num b er [ n ] denotes the s et { 1 , . . . , n } . Metric spaces A met ric ρ on a set X is a p ositive symmetric function satis fying the triangle in- equality ρ ( x, y ) ≤ ρ ( x, z ) + ρ ( z , y ); together the t wo comprise the metric space ( X , ρ ). The diameter of a set A ⊆ X , is deﬁned by diam( A ) = sup x,y ∈ A ρ ( x, y ) and the distance b etw een tw o sets A, B ⊂ X 1 Previously , the doubling dimension of the sp ac e of classiﬁers was used in [8], but this is l ess relev an t to our discussi on. 2 is deﬁned by ρ ( A, B ) = inf x ∈ A,y ∈ B ρ ( x, y ). The Lipschitz constant of a function f : X → R , deno ted by k f k Lip , is deﬁned to b e the smallest L > 0 that satisﬁes | f ( x ) − f ( y ) | ≤ Lρ ( x, y ) for a ll x, y ∈ X . Doubling dimens ion F or a metric space ( X , ρ ), let λ b e the s ma llest v alue suc h that every ball in X can be cov ered b y λ balls of half the r adius. λ is the doubling c onstant o f X , the doubling dimension of X is ddim( X ) = log 2 λ . A metric is doubling when its doubling dimension is b ounded. Note that while a lo w Euclide a n dimensio n implies a low doubling dimension (Euclidean metrics o f dimensio n d have doubling dimensio n Θ( d ) [21]), low doubling dimension is s trictly mor e gene r al than low Euclidean dimension. The following packing prop erty can b e demonstra ted v ia rep eated applications of the doubling prop erty (see, for example [2 5]): Lemma 1. L et X b e a metric sp ac e, and supp ose t hat S ⊂ X is ﬁ nite and has a m inimu m int erp oint distanc e at le ast α > 0 . Then the c ar dinality of S is | S | ≤  2diam( S ) α  ddim( X ) . Nets Let ( X , ρ ) b e a metric space and supp os e S ⊂ X . An ε -net of S is a subset T ⊂ S with the following pr op erties: (i) Pac king: all distinct u, v ∈ T satisfy ρ ( u, v ) ≥ ε , whic h means that T is ε -separa ted; a nd (ii) Covering: every p oint u ∈ S is s trictly within distance ε o f some p oint v ∈ T , namely ρ ( y , x ) < ε . Learning Our setting in this pap er is the agnostic P AC lea rning mo del [27]. Examples are dr awn independently from X × {− 1 , 1 } acco rding to so me unknown probability distribution P and the lea rner, having observed n such pairs ( x, y ) pro duces a hypothesis h : X → {− 1 , 1 } . The gener alization err or is the proba bilit y of misclassifying a new point dr awn from P : P { ( x, y ) : h ( x ) 6 = y } . The quantit y a bove is r andom, since it depends on the n observ a tions, and we wish to upp er-b ound it in probabilit y . Most bo unds of this sort contain a tr aining err or term, whic h is the fraction of observed examples misclas siﬁed b y h and roug hly corresp onding to bias in Statistics , as well as a hyp othesis c omplexity term, which measur e s the richness of the class of all a dmissible hypotheses [35], and ro ughly co rresp onding to v a riance in Sta tistics. Optimizing the tradeoﬀ b etw een these tw o ter ms is known a s Structura l Risk Minimization (SRM). 2 Keeping in line with the literatur e, we ignor e the measure-theor etic technicalities a sso ciated with taking s upr ema ov er uncountable function class es. 3 Generalization b ounds In this section, we derive genera lization b ounds for Lipschitz clas siﬁers over doubling spaces. As noted by [34] Lipsc hitz functions are the natural ob ject to cons ide r in an optimization/regula rization framework. The basic int uition b ehind our pro ofs is that the Lips ch itz constant plays the r ole of the inv erse margin in the conﬁdence o f the classiﬁer. As in [34], s mall Lipsc hitz constant co rresp onds to large mar gin, whic h in turn y ields low hypothesis complexity and v aria nce. Ho wev er , in contrast to 2 Robert Sc hapir e p oint ed out to us that these terms from Statistics are not entirely accurate in the machine learning s etting. In particular, the classiﬁer complexit y term do es not corresp ond to the v ar iance of the classiﬁer i n an y quant itativ ely precise wa y . Ho wev er, the in tuition underlying SRM corr esponds pr ecisely to the one b ehind bias- v ariance tradeoﬀ in Statistics, and so we shall occasionally use the latter term as well. 3 [34] (whose genera liz ation b ounds rely on Rademacher averages) we use the do ubling prop er ty of the metric space directly to control the fat-sha ttering dimension. W e apply to o ls from gener alized V a pnik-Chervonenkis theory to the case of Lipschitz class iﬁers. Let F be a collectio n of functions f : X → R and r e call the deﬁnition of the fa t- shattering dimension [1, 4]: a set X ⊂ X is said to b e γ -shattered b y F if there exists s ome function r : X → R suc h tha t for ea ch lab el a ssignment y ∈ {− 1 , 1 } X there is an f ∈ F satisfying y ( x )( f ( x ) − r ( x )) ≥ γ > 0 for all x ∈ X . The γ - fa t-shattering dimension of F , denoted by fat γ ( F ), is the cardinality of the la rgest set γ -shattered by F . F or the case of Lipschitz functions, w e will show that the notion of fat-shattering dimension may be somewhat simpliﬁed. W e say that a set X ⊂ X is γ -sha ttered at zer o by a co lle c tion of functions F if fo r each y ∈ {− 1 , 1 } X there is an f ∈ F satisfying y ( x ) f ( x ) ≥ γ for all x ∈ X . (This is the deﬁnition ab ov e with r ≡ 0.) W e write fat 0 γ ( F ) to denote the ca rdinality of the largest set γ -s hattered at zero by F and s how that for L ipschit z function clas ses the tw o notions are the same. Lemma 2. L et F b e the c ol le ction of al l f : X → R with k f k Lip ≤ L . Then fat γ ( F ) = fat 0 γ ( F ) . Pr o of. W e b egin by recalling the classic Lipschitz ex tens ion result, essentially due to [26] a nd [3 6]. An y rea l-v alued function f deﬁned on a subset X o f a metric space X has an extensio n f ∗ to all of X satisfying k f ∗ k Lip = k f k Lip . Th us , in what follows we will as sume that a n y function f deﬁned on X ⊂ X is also deﬁned on all of X via some Lipschitz extension (in pa rticular, to b ound k f k Lip it suﬃces to b ound the restricted   f | X   Lip ). Consider so me ﬁnite X ⊂ X . If X is γ -shattered at zer o by F then by deﬁnition it is a lso γ - shattered. Now a ssume that X is γ -shatter ed by F . Thus, there is some function r : X → R such that for each y ∈ {− 1 , 1 } X there is an f = f r,y ∈ F suc h that f r,y ( x ) ≥ r ( x ) + γ if y ( x ) = +1 and f r,y ( x ) ≤ r ( x ) − γ if y ( x ) = − 1. Let us deﬁne the function ˜ f y on X and as p er ab ov e, on all of X , by ˜ f y ( x ) = γ y ( x ). It is clea r that the collection n ˜ f y : y ∈ {− 1 , 1 } X o γ -fat-shatters X at zero; it only remains to verify that ˜ f y ∈ F , i.e., sup y ∈{− 1 , 1 } X    ˜ f y    Lip ≤ sup y ∈{− 1 , 1 } X k f r,y k Lip . Indeed, sup y ∈{− 1 , 1 } X ,x,x ′ ∈ X f r,y ( x ) − f r,y ( x ′ ) ρ ( x, x ′ ) ≥ sup x,x ′ ∈ X r ( x ) − r ( x ′ ) + 2 γ ρ ( x, x ′ ) ≥ sup x,x ′ ∈ X 2 γ ρ ( x, x ′ ) = sup y ∈{− 1 , 1 } X    ˜ f y    Lip . A conseq uence of Lemma 2 is that in consider ing the gener a lization prop erties of Lipschitz functions we need only b ound the γ -fat-sha ttering dimension a t zero. The latter is achieved by obser ving that the packing num b er of a metric spa ce controls the fat-s hattering dimension of Lipsc hitz functions deﬁned ov er the metric space: Theorem 3 . L et ( X , ρ ) b e a metric sp ac e. Fix some L > 0 , and let F b e the c ol le ct ion of al l f : X → R with k f k Lip ≤ L . Then for al l γ > 0 , fat γ ( F ) = fat 0 γ ( F ) ≤ M ( X , ρ, 2 γ / L ) wher e M ( X , ρ, ε ) is the ε - p acking num b er of X , de ﬁne d as the c ar dinality of the lar gest ε -sep ar ate d subset of X . 4 Pr o of. Supp ose tha t S ⊆ X is fat γ -shattere d at zero. The case | S | = 1 is triv ial, so we assume the existence o f x 6 = x ′ ∈ S and f ∈ F suc h that f ( x ) ≥ γ > − γ ≥ f ( x ′ ). The Lipschitz prop erty then implies that ρ ( x, x ′ ) ≥ 2 γ /L , and the claim follows. Corollary 4. L et metric sp ac e X have doubling dimension ddim ( X ) , and let F b e the c ol le ct ion of r e al-value d functions over X with Lipschitz c onstant at most L . Then for al l γ > 0 , fat γ ( F ) ≤  L diam( X ) γ  ddim( X ) . Pr o of. The claim follows immediately from Theo rem 3 and the packing prop erty of do ubling spaces (Lemma 1). Equipp ed with these estimates for the fat-shattering dimensio n of Lipschitz classiﬁer s, w e can inv oke a s ta ndard g eneralizatio n b ound stated in terms of this quantit y . F or the rema inder o f this section, we take γ = 1 a nd say that a function f classiﬁes an exa mple ( x i , y i ) corr ectly if y i f ( x i ) ≥ 1 . (1) The following gener alization b ounds app ear in [4]. Theorem 5. L et F b e a c ol le ction of r e al-value d fun ctions over some set X , deﬁne D = fat 1 / 16 ( F ) and let and P b e some pr ob ability distribution on X × {− 1 , 1 } . Supp ose that ( x i , y i ) , i = 1 , . . . , n ar e dr awn fr om X × {− 1 , 1 } indep endently ac c or ding to P and t hat some f ∈ F classiﬁes the n examples c orr e ctly, in the sense of (1). Then with pr ob ability at le ast 1 − δ , P { ( x, y ) : sgn( f ( x )) 6 = y } ≤ 2 n ( D lo g 2 (34 en/D ) log 2 (578 n ) + log 2 (4 /δ )) . F urthermor e, if f ∈ F is c orr e ct on al l but k examples, we have with pr ob ability at le ast 1 − δ P { ( x, y ) : s gn( f ( x )) 6 = y } ≤ k n + r 2 n ( D ln(34 en/D ) log 2 (578 n ) + ln(4 /δ )) . Applying Cor ollary 4 , we obtain the following consequence of Theorem 5. Corollary 6. L et metric sp ac e X have doubling dimension ddim ( X ) , and let F b e the c ol le ct ion of r e al-value d functions over X with Lipschitz c onstant at most L . Then for any f ∈ F that classiﬁes a sample of s ize n c orr e ctly, we have with pr ob ability at le ast 1 − δ P { ( x, y ) : sgn( f ( x )) 6 = y } ≤ 2 n ( D lo g 2 (34 en/D ) log 2 (578 n ) + log 2 (4 /δ )) . (2) Likewise, if f is c orr e ct on al l but k examples, we have with pr ob ability at le ast 1 − δ P { ( x, y ) : s gn( f ( x )) 6 = y } ≤ k n + r 2 n ( D ln(34 en/D ) log 2 (578 n ) + ln(4 /δ )) . (3) In b oth c ases, D = fa t 1 / 16 ( F ) ≤ ( 16 L diam( X )) ddim( X ) . 3.1 Comparison with previous generalization b ounds Our generaliza tion b ounds ar e not directly compara ble to those o f von Luxburg and Bous quet [34]. In general, tw o appr oaches exist to analy ze binary c lassiﬁcation by co n tinuous-v alued functions: thresh- olding by the sig n function o r b ounding some expe c ted surroga te loss function. They opt for the latter approach, deﬁning the s urroga te loss function ℓ ( f ( x ) , y ) = min(max(0 , 1 − y f ( x )) , 1) 5 and b ound the ris k E [ ℓ ( f ( x ) , y )]. W e take the former approach, b ounding the gener alization error P (sgn( f ( x )) 6 = y ) dire c tly . Although fo r {− 1 , 1 } -v alued la b els the risk upp er-b ounds the generaliza tion error , it could p otentially b e a crude ov ere stimate. von Luxburg and Bous q uet [34] demo ns trated that the Rademacher av era ge of Lipschitz functions ov er the p -dimensional unit cube ( p ≥ 3) is of order Θ( n − 1 /p ), a nd since the pro of uses only co vering nu mbers, a similar b ound holds for a ll metr ic spaces with b ounded diameter and doubling dimension. In c o njunction with Theo rem 5(b) of [5], this observ a tion yields the following b ound. Lemma 7. L et X b e a metric sp ac e with diam( X ) ≤ 1 , and let F b e the c ol le ction of al l f : X → R with k f k Lip ≤ 1 . If ( x i , y i ) ∈ X × {− 1 , 1 } ar e dr awn iid with r esp e ct to some pr ob ability distribution P , then with pr ob ability at le ast 1 − δ every f ∈ F satisﬁes P { ( x, y ) : f ( x ) 6 = y } ≤ O  k f /n + n − 1 / ddim( X ) + p ln(1 /δ ) /n  , wher e k f is the numb er of examples f lab els inc orr e ctly. Our results compare fav or ably to those of [34] when we assume ﬁxed diameter diam( X ) ≤ 1 a nd Lipschitz constant L ≤ 1 and the num b er o f observ a tions n go es to inﬁnity . Indeed, Lemma 7 bo unds the excess err or decay by O ( n − 1 / ddim( X ) ), whereas Corollary 6 gives a rate o f ˜ O ( n − 1 / 2 ). 3.2 Comparison with previous nearest-neigh b or bounds Corollar y 6 als o allo ws us to sig niﬁcantly sharp en the asymptotic analysis of [1 2] for the ne a rest- neighbor cla ssiﬁer. F ollowing the presentation in [3 3] with an a ppropriate gener alization to g eneral metric spaces, the analysis o f [12] implies that the 1-near est-neighbor c la ssiﬁer h NN achiev es E [err( h NN )] ≤ 2 err( h ∗ ) + O  k η k Lip diam( X ) n 1 / (ddim( X )+1)  , (4) where η ( x ) = P ( Y = 1 | X = x ) is the co nditional pr obability of the 1 lab el, and h ∗ ( x ) = sgn( η ( x ) − 1 / 2 ) is the Bay e s optimal classiﬁer. The cur se of dimensionality exhibited in the term n 1 / (ddim( X )+1) is real — for eac h n , there exists a distribution such that for sample size n ≪ ( L + 1) ddim( X ) , we hav e E [err( h NN )] ≥ Ω(1). Howev er, Co rollary 6 shows that this a nalysis is ov erly p e ssimistic. Comparing (4) with (2) in the case where er r( h ∗ ) = 0, w e see that o nce the sample size pa sses a critical num b er on the order of ( L dia m( X )) ddim( X ) , the expected generaliza tion err or b egins to decay as ˜ O (1 /n ), which is muc h fas ter than the ra te s uggested by (4). 4 Lipsc hitz extension classiﬁer Given n la beled p oints ( x 1 , y 1 ) , . . . , ( x n , y n ) ∈ X × {− 1 , 1 } , we construct our c la ssiﬁer in a similar manner to [34], via a Lipschitz extension of the lab el v alues y i to all o f X . Let S + , S − ⊂ { x 1 , . . . , x n } be the se ts of p ositive and neg ative lab eled points. Our starting p oint is the same extension function used in [34], na mely , for α ∈ [0 , 1] deﬁne f α : X → R by f α ( x ) = α min i ∈ [ n ]  y i + 2 ρ ( x, x i ) ρ ( S + , S − )  + (1 − α ) max j ∈ [ n ]  y j − 2 ρ ( x, x j ) ρ ( S + , S − )  . (5) It is easy to verify , see also [34, Lemma s 7 a nd 12], that f α ( x i ) a grees with the s ample lab el y i for all i ∈ [ n ], and that its Lipschitz constant is identical to the one induced by the lab eled p oints, which in turn is obviously 2 /ρ ( S + , S − ). How ever, computing the exact v alue of f α ( x ) fo r a p oint x ∈ X (or even the sign of f α ( x ) at this p oint) req uires a n exact nearest neighbor search, and in a rbitrary metric space neare s t neighbor search requires Ω( n ) time. 6 In this section, we design a classiﬁer that is ev alua ted at a p oint x ∈ X using an approxi- mate nearest neigh b or search. 3 It is known how to build a data structure for a set of n p oints in time 2 O (dd im( X )) n log n , so as to supp ort (1 + ε )-approximate neares t neig h b or s earches in t ime 2 O (dd im( X )) log n + ε − O (d dim( X )) [10, 2 2] (see also [25, 7]). Our clas siﬁer b elow relies only o n a g iven subset o f the given n p oints, which may event ually lead to improved gener alization b ounds (i.e., it provides a tradeo ﬀ b etw ee n k and L in Theorem 5). Theorem 8. L et ( X , ρ ) b e a metric sp ac e, and ﬁx 0 < ε < 1 32 . Le t S b e a sample c onsisting of n lab ele d p oints ( x 1 , y 1 ) , . . . , ( x n , y n ) ∈ X × {− 1 , 1 } . Fi x a subset S 1 ⊂ S of c ar dinality n − k , on which the c onstructe d classiﬁer must agr e e with the given lab els, and p artition it int o S + 1 , S − 1 ⊂ S 1 ac c or ding to the lab els, lett ing L = 2 /ρ ( S + 1 , S − 1 ) . The n ther e is a binary classiﬁc ation function h : X → {− 1 , 1 } satisfying: (a) h ( x ) c an b e evaluate d at e ach x ∈ X in time 2 O (dd im( X )) log n + ε − O (d dim( X )) , after an initial c omputation of (2 O (dd im( X )) log n + ε − O (d dim( X )) ) n t ime. (b) With pr ob ability at le ast 1 − δ (over the sampling of S ) P { ( x, y ) : h ( x ) 6 = y } ≤ k n + r 2 n ( D ln(34 en/D ) log 2 (578 n ) + ln(4 /δ )) , wher e D =  16 L diam( X ) 1 − 32 ε  ddim( X ) . W e will use the fo llowing simple lemma . Lemma 9. F or any function class F ma pping X to R , deﬁne its ε -p erturb ation F ε to b e F ε = n ˜ f ∈ R X : k f − ˜ f k ∞ ≤ ε, f ∈ F o , wher e k f − ˜ f k ∞ = sup x ∈X | f ( x ) − ˜ f ( x ) | . Then for 0 < ε < γ , fat γ ( F ε ) ≤ fat γ − ε ( F ) . Pr o of. Supp ose that F ε is able to γ -s ha tter the ﬁnite subset X ⊂ X . Then there is an r ∈ R X so that for all y ∈ {− 1 , 1 } X , there is a n ˜ f y ∈ F ε such that y ( x )( ˜ f y ( x ) − r ( x )) ≥ γ , ∀ x ∈ X. (6) Now by deﬁnition, for ea ch ˜ f y ∈ F ε there is so me f y ∈ F suc h that sup x ∈X | f y ( x ) − ˜ f y ( x ) | ≤ ε . W e claim that the co llection n f y : y ∈ {− 1 , 1 } X o is able to ( γ − ε )-shatter X . Indeed, repla cing ˜ f y ( x ) with f y ( x ) in (6 ) p erturbs the le ft-ha nd side by an additive term of at most ε . Pr o of of The or em 8. Without loss of gener ality , assume S 1 ⊂ S corresp onds to points indexed b y i = 1 , . . . , n − k . W e beg in by observing that since all of the sample lab els hav e v a lues in {± 1 } , any Lipschitz extension may b e truncated to the range [ − 1 , 1]. F or mally , if g is a Lipschitz extension of the lab els y i from the sample S to all o f X , then so is T [ − 1 , 1] ◦ g , wher e T [ a,b ] ( z ) = max { a, min { b, z }} is the truncatio n o per ator. In particular, take g to be as in (5) with α = 1 and write r i ( x ) = 2 ρ ( x, x i ) ρ ( S + 1 , S − 1 ) . 3 If x ∗ is the nearest neigh bor for a test p oi n t x , then an y p oint ˜ x satisfying ρ ( x, ˜ x ) ≤ (1 + ε ) ρ ( x, x ∗ ) is called a (1 + ε )-appro ximate nearest neighbor of x . 7 Now deﬁning f ( x ) = T [ − 1 , 1]  min i ∈ [ n − k ] { y i + r i ( x ) }  = min i ∈ [ n − k ]  T [ − 1 , 1] ( y i + r i ( x ))  , (7) where the second eq ua lit y is b y monotonicity of the truncation o p er ator, we conclude that f is a Lipschitz extension of the data, with the same Lips chitz consta nt L = 2 / ρ ( S + 1 , S − 1 ). Now preco mpute 4 in time 2 O (dd im( X )) n log n a da ta structure that s uppo rts (1 + ε )-appr oximate nearest neigh b o r searches on the p oint set S + 1 , a nd a similar one for the p oint set S − 1 . No w compute (still during the learning phase) an estimate ˜ ρ ( S + 1 , S − 1 ) for ρ ( S + 1 , S − 1 ), by searching the second data structure for ea ch of the p oints in S + 1 , and taking the minim um of all the resulting distances. This estimate satisﬁes 1 ≤ ˜ ρ ( S + 1 , S − 1 ) ρ ( S + 1 , S − 1 ) ≤ 1 + ε, (8) and this entire pr ecomputation pro cess takes (2 O (dd im( X )) log n + ε − O (d dim( X )) ) n time. Given a test p oint x ∈ X to b e classiﬁe d, sear c h for x in the tw o data structures (for S + 1 and for S − 1 ), a nd deno te the indices of the p oints a ns w ered by them by a + , a − ∈ [ n − k ], resp ectively . The (1 + ε )-approximation guara n tee mea ns that 1 ≤ ρ ( x, x a + ) ρ ( x, S + 1 ) ≤ 1 + ε , and 1 ≤ ρ ( x, x a − ) ρ ( x, S − 1 ) ≤ 1 + ε. Deﬁne, as a co mputationally-eﬃcient estimate o f f , the function ˜ f ( x ) = min a ∈{ a + ,a − }  T [ − 1 , 1]  y a + 2 ρ ( x, x a ) ˜ ρ ( S + 1 , S − 1 )  , and let our clas siﬁer b e h ( x ) = sgn( ˜ f ( x )). W e remark t hat t he case a = a − alwa ys a ttains the minim um in the deﬁnition of ˜ f (beca use a = a + only pro duces v alues gre a ter o r equa l tha n y a + = 1), and therefore o ne c a n av oid the co mputation of a + , and even the construction of a data s tructure for S + 1 . In fact, the s a me a rgument shows that also in the deﬁnition of f in (7) we can omit fr o m the minimization p oints with label y i = + 1. This classiﬁer h = sgn( ˜ f ) can b e ev aluated on a new p oint x ∈ X in time 2 O (dd im( X )) log n + ε − O (d dim( X )) , and it thus rema ins to b ound the g eneralizatio n erro r of h . T o this end, we will show that sup x ∈X    f ( x ) − ˜ f ( x )    ≤ 2 ε. (9) This means that ˜ f is a 2 ε -p erturba tion of f , as stipulated b y Lemma 9, a nd the generalization er ror of h will follo w fro m Coro lla ry 6 using the fact that f has Lipschitz constant L . T o prove (9 ), ﬁx an x ∈ X . Now let i ∗ ∈ [ n − k ] b e an index attaining the minim um in the deﬁnition of f ( x ) in (7), and similarly a ∗ ∈ [ n − k ] for ˜ f ( x ). Using the rema r k ab ove, we may assume that their lab els are y i ∗ = y a ∗ = − 1. Moreov er, by insp e cting the deﬁnition of f we may further a ssume that i ∗ attains the minimum of r i ( x ) (ov er all points lab eled − 1 ) a nd th us also o f its n umerator ρ ( x, x i ). And since index a ∗ was chosen a s a n a pproximate near est neighbo r (among all p oints lab eled − 1), we get 1 ≤ ρ ( x,x a ∗ ) ρ ( x,x i ∗ ) ≤ 1 + ε . T og ether with (8), we hav e 1 1 + ε r i ∗ ( x ) ≤ 2 ρ ( x, x a ∗ ) ˜ ρ ( S + 1 , S − 1 ) ≤ (1 + ε ) r i ∗ ( x ) . (10) 4 The word pr e co mpute underscores the fact that this comput ation is done dur i ng the “oﬄine” learning phase. Its result is then used to ac hieve f ast “online” ev aluation of the classiﬁer on any p oint x ∈ X during the testing phase. 8 W e now need the following simple cla im: 0 ≤ B ≤ (1 + ε ) C = ⇒ T [ − 1 , 1] ( − 1 + B ) ≤ 2 ε + T [ − 1 , 1] ( − 1 + C ) . T o v erify the claim, a s sume ﬁrst that C ≤ 2 ; then B ≤ C + εC ≤ C + 2 ε , and now use the fact that adding − 1 and tr uncating are b oth mono tone op er a tions, to get T [ − 1 , 1] ( − 1 + B ) ≤ T [ − 1 , 1] ( − 1 + C + 2 ε ), and the right-hand s ide is clear ly at mo st T [ − 1 , 1] ( − 1 + C ) + 2 ε . Assume next that C ≥ 2; then obviously T [ − 1 , 1] ( − 1 + B ) ≤ 1 = T [ − 1 , 1] ( − 1 + C ). The cla im follows. Applying this simple claim twice, o nce fo r each inequality in (10 ), we obtain that − 2 ε ≤ T [ − 1 , 1]  y i ∗ + r i ∗ ( x )  − T [ − 1 , 1]  y a ∗ + 2 ρ ( x, x a ∗ ) ˜ ρ ( S + 1 , S − 1 )  ≤ 2 ε, which prov e s (9), and completes the pro of of the theorem. 5 Structural Risk Minimization In this section, we show how to eﬃcien tly construct a classiﬁer that optimizes the “bia s -v ariance tradeoﬀ ” implicit in Co r ollary 6, equation (3). Let X b e a metr ic s pace, and as sume we are g iven a lab eled sa mple S = ( x i , y i ) ∈ X × {− 1 , 1 } . F o r any L ipschit z co nstant L , let k ( L ) b e the minimal training err or o f S ov er all clas siﬁers with Lipschitz constant L . W e rewrite the generalization b ound as follows: P { ( x, y ) : sgn( f ( x )) 6 = y } ≤ k ( L ) n + r 2 n ( D ln(34 en/D ) log 2 (578 n ) + ln(4 /δ )) = : G ( L ) (11) where D = (16 L diam( X )) ddim( X ) . This b ound contains a free parameter , L , which may be tuned in the cour se of structural r isk minimization. Mor e precisely , decrea sing L dr ives the “bias” term (num b er of mistakes) up a nd the “v ariance” term (fat-sha ttering dimension) down. W e thus seek a n (o ptimal) v alue o f L where G ( L ) a chieves a minimum v alue, as describe d in the following theo r em, which is our SRM result. Theorem 1 0. L et X b e a metric sp ac e and 0 < ε < 1 32 . Given a lab ele d sa mple S = ( x i , y i ) ∈ X × {− 1 , 1 } , i = 1 , . . . , n , t her e ex ists a binary classiﬁc ation funct ion h : X → {− 1 , 1 } satisfying the fol lowing pr op erties: (a) h ( x ) c an b e evaluate d at e ach x ∈ X in time 2 O (dd im( X )) log n + ε − O (d dim( X )) , after an initial c omputation of 2 O (dd im( X )) n log n +  ddim( X ) ε  O (dd im( X )) n time. (b) The gener alization err or of h is b oun de d by P { ( x, y ) : sgn( f ( x )) 6 = y } ≤ c · inf L> 0  k ( L ) n + s 2 n  D ln(3 4 en/D ) log 2 (578 n ) + ln 4 δ  # . for some c onstant c ≤ 2(1 + ε ) , and wher e D = D ( L ) =  16 L diam( X ) 1 − 32 ε  ddim( X ) . W e pro cee d with a des c r iption of o ur algo rithm. W e ﬁrst give a n alg orithm with r un time O ( n 4 . 373 ), and then impr ove the runtime, ﬁrst to O  ddim( X ) ε n 2 . 373 log n  , then to O (ddim( X ) n 2 log n ), and ﬁnally to 2 O (dd im( X )) n log n +  ddim( X ) ε  O (dd im( X )) n . 9 Algorithm des cription W e star t by giving a ra ndomized algo rithm that ﬁnds a v a lue L ∗ that is optimal, namely , G ( L ∗ ) = inf L> 0 G ( L ) for G ( L ) that was deﬁned in (11 ). The runtime of this algor ithm is O ( n 4 . 373 ) with high pr obability . First note the b ehavior of k ( L ) as L increa ses. k ( L ) decreases only when the v alue of L crosses cer tain critica l v alues, each determined by a pa ir x i ∈ S + , x j ∈ S − (that is, L = 2 ρ ( x i ,x j ) ); for such L , the class iﬁcation function h can correctly classify b oth these po in ts. There are O ( n 2 ) critical v alues of L , and these can b e determined by enumerating all in terp oint distances betw een subsets S + , S − ⊂ S . Below, we will s how that for any given L , the v alue k ( L ) can b e computed in r a ndomized time O ( n 2 . 373 ). More precisely , w e will show how to compute a par tition o f S into sets S 1 (with Lipschitz constant L ) and S 0 (of s ize k ( L )) in this time. Given sets S 0 , S 1 ⊂ S , we ca n co nstruct the classiﬁer of Corollary 6. Since there are O ( n 2 ) critical v a lue s of L , we ca n calculate k ( L ) for a ll critical v alues in O ( n 4 . 373 ) total time, and thereby determine L ∗ . T hen b y Co r ollary 6 , we may co mpute a classiﬁer with a bias- v a riance tra de o ﬀ a rbitrarily close to o ptimal. T o co mpute k ( L ) for a given L in ra ndomized time O ( n 2 . 373 ), consider the following algor ithm: Construct a bipartite g raph G = ( V + , V − , E ). The vertex sets V + , V − corres p ond to the la be le d se ts S + , S − ∈ S , resp ectively . The length of edge e = ( u , v ) c onnecting vertices u ∈ V + and v ∈ V − is equal to the distance b etw een the p oints, and E includes all edges of length les s than 2 L . (This E can b e computed in O ( n 2 ) time.) Now, for a ll edges e ∈ E , a c la ssiﬁer with Lipschitz co nstant L necessarily misclassiﬁes a t least one endp oint of e . Hence, ﬁnding a clas siﬁer with Lipschitz constant L that misclassiﬁes a minim um n um b er of p oints in S is exa ctly the proble m of ﬁnding a minimum vertex c ov er for the bipartite g raph G . (This is an unw eig ht ed graph – the lengths are used o nly to determine E .) By K¨ onig’s theor em, the minimum vertex cover problem in bipartite gr a phs is equiv alent to the maximum matching problem, and a maxim um matching in bipar tite graphs can be c o mputed in ra ndomized time O ( n 2 . 373 ) [2 8, 37]. This ma xim um matching immediately ide ntiﬁes a minimum vertex cov er, which in turn g ives the subsets S 0 , S 1 , allowing us to c o mpute a cla ssiﬁer achieving nearly optimal SRM. First im pro vemen t The r unt ime given above can b e reduced from randomized O ( n 4 . 373 ) to ran- domized O (ddim( X ) /ε · n 2 . 373 log n ), if we are willing to settle for a genera lization bo und G ( L ) within a (1 + ε ) facto r of the optimal G ( L ∗ ), for any ε ∈ (0 , 1). T o a chiev e this improvemen t, we discr etize the candidate v alues o f L , and ev aluate k ( L ) o nly for O (ddim( X ) /ε · log n ) v alues o f L , rather than a ll Θ( n 2 ) v alues a s ab ov e. In the extr eme case where the optimal hypothesis fails o n all p oints of a single lab el, the classiﬁer h is a constant function and L ∗ = 0. In all other cases, L ∗ m ust take v a lues in the rang e h 2 diam( X ) , n diam( X ) i ; indeed, every hypothesis correctly class ifying a pa ir of opp os ite la belled po in ts ha s Lipschitz constant at least 2 diam( X ) , and if L ∗ > n diam( X ) then the complexity term (and G ( L ∗ )) is g reater than 1. Our a lgorithms ev aluates k ( L ) for v alues of L = 2 diam( X )  1 + ε ddim( X )  i for i = 0 , 1 , . . . , l log 1+ ε/ ddim( X ) n 2 m , and uses the candidate that minimizes G ( L ). The num b er of candidate v alues for L is O (ddim( X ) /ε · log n ), and one of these v a lues — call it L ′ — satisﬁes L ∗ ≤ L ′ < (1 + ε ddim( X ) ) L ∗ . Observe that k ( L ′ ) ≤ k ( L ∗ ) and that the complexity term for L ′ is gr eater tha n that for L ∗ by at mos t a factor r  1 + ε ddim( X )  ddim( X ) ≤ e ε/ 2 ≤ 1 + ε (where the ﬁnal inequalit y holds since ε < 1 ). It follows that G ( L ′ ) < (1 + ε ) G ( L ∗ ), implying that this algorithm achiev e s a (1 + ε )- a pproximation to G ( L ∗ ). Second im pro vemen t The runtime can b e further r e duced from r andomized O (ddim( X ) /ε · n 2 . 373 log n ) to deterministic O (ddim( X ) n 2 log n ), if we a re willing to se ttle for a generaliza tion b ound G ( L ) within a constant factor 2 o f the optimal G ( L ∗ ). The improvemen t co mes from a faster vertex-cov er co mpu- tation. It is well kno wn that a 2-approximation to vertex cov er can be co mputed (in arbitrar y gr aphs) by a gr e e dy alg orithm in time linear in the gr aph siz e O ( | V + ∪ V − | + | E | ) = O ( n 2 ), see e.g. [3]. Hence, 10 we can compute in O ( n 2 ) time a function k ′ ( L ) that satisﬁes k ( L ) ≤ k ′ ( L ) ≤ 2 k ( L ). W e replace the randomized O ( n 2 . 373 ) a lgorithm with this O ( n 2 ) time greedy a lgorithm. Then k ′ ( L ) ≤ 2 k ( L ), and bec ause we can approximate the co mplexity term to a fa ctor sma ller than 2 (as above, by cho osing a constant ε < 1), our resulting algorithm ﬁnds a Lips chit z constant L ′ for which G ( L ′ ) ≤ 2 · G ( L ∗ ). Final improv ement W e can further improv e the runtime from O (ddim ( X ) n 2 log n ) to 2 O (dd im( X )) n log n + (ddim( X ) /ε ) O (dd im( X )) n , at the cost of incre a sing the approximation factor to 2(1 + ε ). The idea is to work with a sparser r e pr esentation of the vertex c over problem. Reca ll that w e discretized the v alues of L to p owers of  1 + ε ddim( X )  . As was a lready observed b y [25, 10] in the context o f hier archies for doubling metric, X con tains at most ε − 1 2 O (dd im( X )) n of these dis tinct rounded critical v a lues. After constructing a standar d hierar ch y (in time 2 O (dd im( X )) n log n ), these or dered v a lues may be extracted with  ddim( X ) ε  O (dd im( X )) n more w o rk. Let L b e a discretized v alue cons idered ab ov e. W e e xtract from S a subset S ′ ⊂ S that is a  ε ddim( X ) · 2 L  -net for S . Map each p oint p ∈ S to its closest net po in t p ′ ∈ S ′ , and ma intain for each net p oint t wo lists o f p oints of S that are mapp ed to it — one list for pos itiv ely lab eled p oints and one for negatively lab eled points. W e now create an instance of v ertex co ver for the p oints of S : An edge e = ( u, v ) for u ∈ V + and v ∈ V − is added to the edg e s et E ′ if the distanc e b et ween the resp ective net p oints u ′ and v ′ is at mos t  1 − 2 ε ddim( X )  2 L . No tice that E ′ ⊂ E , beca use the dis ta nce betw een such u, v is at most  1 − 2 ε ddim( X )  2 L + 2 ε ddim( X ) · 2 L = 2 L . Mor eov er , the edge set E ′ can be stored implicitly by reco r ding every pair of net p oints that ar e within distance  1 − 2 ε ddim( X )  2 L — opp ositely lab eled po in t pairs that map (respec tively) to this net-p oint pair is conside r ed (implicitly) to hav e an edge in E ′ . By the packing prop erty , the num b er of net-p oint pairs to b e recor de d is at most (ddim( X ) /ε ) O (dd im( X )) n , a nd by employing a hierarchy , the en tire (implicit) constr uction may be done in time 2 O (dd im( X )) n log n + (ddim( X ) /ε ) O (dd im( X )) n . Now, for a g iven L , the run of the greedy algo rithm fo r vertex cover can b e implemented on this graph in time (ddim( X ) / ε ) O (dd im( X )) n , as follows. The greedy alg orithm consider s a pair of net points within dis ta nce  1 − 2 ε ddim( X )  2 L . If there exis t u ∈ V + and v ∈ V − that map to thes e net p oints, then u, v are deleted from S and from the respe ctive lists of the net p oints. (And s imilarly if u, v map to the same net p oint.) The alg orithm termina tes when there ar e no mo r e p oints to r emov e, and corr ectness follows. W e now turn to the analysis. Since E ′ ⊂ E , the gua r antees of the earlier gr eedy alg o rithm s till hold. The resulting p oint set may contain o ppo site lab eled po in ts within distance  1 − 2 ε ddim( X )  2 L − 2 ε ddim( X ) · 2 L =  1 − 4 ε ddim( X )  2 L , resulting in a Lipschitz constant L / (1 − 4 ε ddim( X ) ). This Lips c hitz constant is slightly larg er than the given L , which has the eﬀect of increasing the complexity term in G ( L ) by factor  1 − 4 ε ddim( X )  − ddim( X ) / 2 = 1 + Θ( ε ). The ﬁnal result is achiev ed by scaling down ε to remov e the lea ding co ns tant . 6 Example: Earthmo v er and time -series metrics T o illustrate the p otential p ow er of our appr oach, we ana lyze its po tential for tw o well-known metrics, the ear thmo ver distance which op era tes on g eometric sets, and Edit Distance with Real Penalt y , which op erates on time-series . W e use the earthmover dis ta nce again in Section 7 fo r our exp eriments. 11 Earthmo ver di stance W e will analy ze the doubling dimension of an e a rthmov er metric, which is a natural metric for compa r ing tw o sets of k geo metric features. It is o ften used in computer visio n; for instance, a n image can b e repre s ent ed as a se t of pix els in a color space, yielding an accurate mea sure of dissimilarity be tw een colo r characteristics of the images [31]. In an ana logous manner, an imag e can b e repres en ted as a set o f repre s ent ative geometric features, such a s ob ject contours [19], o ther features [20], and SIFT descriptions [30]. In these con texts, k ≥ 2 is usually a para meter which mo dels the num b er of geometric fea tures identiﬁed inside each image. W e use a simple yet co mmon version, denoted ( X k , EMD), wher e each p oint in X k is a multiset of size k in the unit square in the Euclidean pla ne, forma lly S ⊂ [0 , 1] 2 and | S | = k (allowing and counting multiplicities). The distance betw een such sets S, T ∈ X k is given by EMD( S, T ) = min π : S → T n 1 k X s ∈ S k s − π ( s ) k 2 o , (12) where the minimum is ov er a ll one-to-one mappings π : S → T . In other words, EMD( S, T ) is the minim um-cost bipar tite matching b etw ee n the tw o s e ts S, T , where costs corr esp ond to Euclidean distance. Lemma 11. The e arthmover metric over X k satisﬁes diam( X k ) ≤ √ 2 and ddim ( X k ) ≤ O ( k lo g k ) . Pr o of. F or the rest of this pr o of, a p oint refers to one in the unit squar e , no t X k . Co ns ider a ball in X k of radius r > 0 aro und some S . Let N b e an r / 2-net o f the unit sq uare [0 , 1] 2 , according to the deﬁnition in Section 2. No w co nsider all multisets T ⊂ [0 , 1] 2 of size k that satisfy the following condition: e very p oint in T b e longs to the net N and is within (E uclidean) distance ( k + 1 / 2) r from at least one p oint of S . Points in suc h a multiset T are chosen from a co llection of size at most k · l ( k +1 / 2) r r / 2 m O (1) ≤ k O (1) (b y the packing pro per t y of the net p oints in the E uclidean plane). Th us, the num b er of suc h mult isets T is λ ≤ ( k O (1) ) k = k O ( k ) . T o complete the pro o f of the lemma, it suﬃces to show that the radius r ball (in X k ) aro und S is cov er ed b y the λ balls of r adius r/ 2 whos e centers ar e given b y the ab ove m ultisets T . T o se e this, consider a multiset S ′ such that EMD( S, S ′ ) ≤ r , and let us show that S ′ is con tained in an r/ 2- ball around one of the a bove m ultisets T . Observe that every po int in S ′ is within distance at mo st k r from at lea st o ne p oint of S . Now “map” each po int in S ′ to its nearest p oint in the net N , which m ust be less than r / 2 aw ay , by the cov ering prop er ty of the net. The r esult is a multiset T as a bove with EMD( S ′ , T ) ≤ r/ 2. Time-serie s distance metric T o present another example o f the utility o f o ur classiﬁca tio n algo - rithms, we show that a co mmonly used metric mo del of sparse time-ser ies vectors (with unbounded real c o ordinates) a ctually has a low doubling dimension. A widely used similarity function for time series is the Dynamic Time W a rp (DTW) [39], a non-metr ic distance function betw een tw o time series which is similar to the ℓ 1 norm, exc e pt that it also allows c o ordinate deletions or inser tio ns in order to alig n the t wo ser ies. The latter oper ations a re used to ens ure that the r esulting ser ies ar e of equal length, and these op era tions can also co r relate the res pective p e a ks and troug hs of the ser ies. W e will consider a s imple a nd p opular metric version of DTW k nown as Edit Distance with Real Penalty (ERP) [9], which a llows for insertions o f zero- v a lued elements only . The ERP distance is formally deﬁned as follows. Given time-series v ectors r and s with un b ounded real co ordinates and where the length of the long er series is exactly m , we may insert into r and s any nu mber of zero- v a lued co o rdinates (ca lled gaps ) to pr o duce ne w series ˜ r and ˜ s of equal length. Let R p be the set of a ll time ser ies of length p ≥ m which may b e derived from r via ga p insertions, and similarly S p for s . Then d ERP ( r , s ) = min p ≥ m, ˜ r ∈ R p , ˜ s ∈ S p k ˜ r − ˜ s k 1 . The ERP distance ca n b e co mputed in quadra tic time [9 ]. 12 Our contribution is tw ofold: W e show in Lemma 13 that a set o f time ser ie s o f length a t most m may have doubling dimension Ω( m ) under ERP , even when the co ordinate range is limited to { 1 , 2 } . This dimension is quite high, and motiv ates us to consider sp arse time s eries, which form an active ﬁeld of study [15, 41, 18, 24, 38]. W e show in Lemma 14 that the set of time series v ectors with only k non-ze ro elements — that is, k -spar se vectors — has doubling dimension O ( k lo g k ) under ERP , irresp ective of the vector length m and even when the co ordinates are real and unbounded. W e ﬁrst prov e the claim b elow, and then pro ceed to the low e r-b ound on the dimension of length m vectors under ERP. Claim 12. Consider the set T = { 1 , 2 } m , and an inte ger d ∈ [4 , m/ 2] . Then every ve ctor r ∈ T is within ERP distanc e d of fewer than  3 em d  2 d other ve ctors of T . Pr o of. W e may view ERP on the v ectors of T as a pro cedure transforming a vector r ∈ T into s o me vector s ∈ T a s follows: The pro cedure inserts gaps in r to pro duce vector ˜ r , uses substitutions to transform ˜ r to ˜ s , a nd then deletes all ga ps (i.e., zero- v alued elements) fr om ˜ s to pro duce vector s . Here, the cost of a s ubstitution from ˜ r to ˜ s is considered to be | ˜ r i − ˜ s i | ∈ { 1 , 2 } , while the insertions and deletions entail no cost. But without loss of genera lit y , we may assume that ˜ r i and ˜ s i are not bo th g aps. It follows that whenever ˜ r i is pro duce d by a gap insertion or ˜ s i is a g ap co o rdinate to be deleted, there m ust b e a s ubstitution from ˜ r i to ˜ s i . Thus, if d ERP ( r , s ) ≤ j for r, s ∈ T , then the ERP pro cedur e includes at most j substitutions, and consequently at most j insertions and at mos t j deletions. F or a ﬁxed r , the v ector ˜ r can b e pro duced in one of at most d X j =0  m + j j  < d  3 m/ 2 d  ≤ d  3 em 2 d  d po ssible wa y s of inser ting j ≤ d g aps elements among the m co ordina tes of r . (Here we used the standard for m ula for combination with r e placement.) Having pr o duced ˜ r , the vector ˜ s ca n b e pro duced in at most d X j =0  m + d j  2 d < d  3 m/ 2 d  2 d ≤ d  3 em d  d po ssible ways via s ubstitutions in j ≤ d elements, where a sing le substitution se ts some ˜ s i to one of t wo p ossible v a lues diﬀerent fro m ˜ r i . Having produced ˜ s , the vector s is pro duced by simply removing all gaps in ˜ s . It follows that ther e are fewer than d 2  3 em 2 d  d  3 em d  d ≤  3 em d  2 d vectors of T within distance d of r , as claimed. Lemma 13. Ther e exists a s et S ⊂ { 1 , 2 } m whose doubling dimension under the ERP metric is Ω( m ) . Pr o of. W e will demonstra te that for all m ≥ 4 · 35 = 140, ther e e xists a s e t S ⊂ { 1 , 2 } m of cardinality 2 m/ 2 with diameter at mo s t m and minimum interpoint distance at least d = ⌊ m 35 ⌋ . As a c onsequence of Lemma 1 , this S has doubling dimension Ω( m ). Our pr o of uses a neighborho o d co un ting arg umen t similar to the one pre sent ed in [6, Le mma 8]. W e begin with the set T = { 1 , 2 } m of ca r dinality 2 m . The max imum interpoint distance under ERP in T is at most m , as this is the max im um dis tance under ℓ 1 in T . By Claim 1 2 with d = ⌊ m 35 ⌋ , eac h vector r ∈ T is within distance d o f fewer than  3 em d  2 d ≤  3 e m m/ 35 − 1  2 m/ 35 =  3 · 3 5 e 1 1 − 35 /m  2 m/ 35 ≤ (4 · 35 e ) 2 m/ 35 < 2 m/ 2 other p oints of T . W e now us e T to construct S greedily , starting with the empty set and re p ea tedly pla cing in S a p oint of T at dis tance at leas t d = ⌊ m 35 ⌋ from all po in ts currently in S . Ea ch po in t added to S inv alidates fewer than 2 m/ 2 other p oints in T from app earing in S in the future — these are the p oints within dista nce d of the new po int . It fo llows that S contains a t lea st 2 m 2 m/ 2 = 2 m/ 2 po in ts, as claimed. 13 Figure 1: The raw ﬂower-con tour data, 512 × 512 pixel black-and-white imag es. The tw o kinds o f ﬂow ers (ﬁve- a nd s ix -p etaled) are displayed in separ ate rows. Although the doubling dimension of time-series vectors under ERP is larg e, the situation for sparse vectors is muc h better , irresp ective of the vector length and even when the co ordina tes are un bo unded reals. Lemma 14 . Every set S of k -sp arse t ime-series ve ctors of r e al unb oun de d c o or dinates and arbitr ary length has doubling dimension O ( k log k ) u n der the E RP metric. Pr o of. Similar to [17], let the density c onstant µ ( S ) of S b e the smallest num b er such that for every p > 0, every ball of radius p contains at most µ ( S ) p oints of S at m utual distance strictly greater than p 2 . It is known that the do ubling constant of S is at most the density cons tant , i.e., λ ( S ) ≤ µ ( S ). Indeed, for each radius p ball in S , take a maximal set (with resp ect to containmen t) o f p oints at m utual dista nce stric tly gr e ater than p 2 , a nd let each po int be the center of a ba ll of ra dius p 2 ; the maximality implies that the small balls cov er all p oints of the larger one. W e get that every ball in S can b e covered b y at most µ ( S ) balls o f half the radius. It thus suﬃces to prove an upp er bound o n the density c onstant µ ( S ). T o this e nd, cons ider a subset T ⊂ S such that for s ome p > 0, all interp o int dista nc e s in T ar e in the range (4 p, 8 p ]. In what follows, we prove a n upper b ound o n | T | . First, for each vector r ∈ T , discretize the v ector by ro unding down each c o ordinate r i to the neares t multiple of p k , pro ducing new set T ′ . Done ov er all co ordinates, the r ounding alters interpoint E RP dis tances b y less than 2 k · p k = 2 p in total, a nd therefore all in terp oint ERP distances in T ′ are in the range (2 p, 10 p ]. W e further remov e from e a ch r ∈ T ′ all zero-v alued coo rdinates, a nd this ha s no eﬀect on in ter po in t ERP distances. The resulting set is T ′ of the same size as T , i.e., | T ′ | = | T | . T o b ound | T ′ | , ﬁx an arbitrary vector r ∈ T ′ , and co nsider the num b er of distinct discr etized k -spars e vectors at ERP distance at most 10 p from r . The E RP pro cedure ma y add up to k g aps to r to pro duce ˜ r , and there are P k j =0  k + j j  ≤ P k j =0  2 k j  < 2 2 k po ssible gap conﬁguratio ns. Note tha t the leng th of ˜ r is at most 2 k . Moving to the substitutions, since all vectors of T ′ are discretized into m ultiples of p k and are at distance at most 10 p , we can view the substitutions as adding or removing from these co ordinates weigh t in units (multiples) of p k , and there are in total 10 p p/k = 10 k such units. If we view each substitution a s accounting for a unit weight, and asso cia te each co or dinate with a sign that enco des whether the weigh t will b e a dded or subtra cted from tha t co or dina te, then there are at mo st (2 k ) 10 k · 2 2 k po ssible substitution conﬁgur ations to pro duce ˜ s . Having pro duced ˜ s , the vector s is pro duced from it by removing all gap elements. Altogether, 2 ddim( S ) ≤ µ ( S ) = | T | = | T ′ | ≤ 2 2 k (2 k ) 10 k 2 2 k = k O ( k ) , from which the lemma follows. 7 Exp erimen ts W e consider ed the tas k o f distinguis hing ﬁv e - peta led ﬂowers from six-p etaled ones. The images were taken from a shap e ma tc hing/retr ie v a l data ba se called MP E G-7 T est Set 5 , and a re display ed in Fig. 1. The or ig inal images w ere r epresented as 512 × 5 12 black and white matrices; we sa mpled these down to 128 × 1 28. T o render the task no n trivial, we retained only the image cont our, as otherwise, it would suﬃce to consider the ratio of black/white pixels to achieve 100% accuracy . T o illustr ate the rela tive 5 http://w ww.dabi.te mple.edu/ ~ shape/MP EG7/datase t.html ; the 5-petaled ﬂow ers are under device 0-1 and 6- petaled are under device1-1 . 14 Metho d Erro r EMD near est-neighbor 0.13 Euclidean near e st-neighbor 0.39 Euclidean SVM 0.43 SVM with RBF k ernel 0.39 T able 1: Exp eriments cla s sifying the ﬂow er image s . Each metho d is av erage d ov er hundreds of exp er- imen ts. adv antage of ea rthmov er dista nce ov er the Euclidean one, we translated each image in the plane b y v arious random s hifts. W e r an four class iﬁcation alg orithms o n this da ta: • Euclidean Ne arest Neighbo r. The images w ere treated as vectors in R 128 × 128 , endow ed with the Euclidean metr ic ℓ 2 . • EMD Nearest N eigh b or. The images were cut up int o 16 × 1 6 = 256 squar e blo cks, where each blo ck b is viewed as a vector b ∈ R 8 × 8 . E ach image is thu s repres ent ed as a sequence o f 256 blo cks, a nd over these sequences , E MD is deﬁned as in (12) — except we used ℓ 1 instead of ℓ 2 as the bas e dis tance. • Euclidean SVM. The Supp ort V ecto r Machine (SVM) algorithm [11] was use d, op erating on vectors in R 128 × 128 with the Euclidean kernel h x, y i = x T y and the regula r ization consta n t tuned by cro ss-v alidatio n. • SVM with RBF k ernel. The SVM a lgorithm was used, op erating on vectors in R 128 × 128 with the Radial Basis F unction (RBF) kernel h x, y i = e xp( − k x − y k 2 2 /σ 2 ), wher e the regular iz ation constant and σ w ere tuned b y cro ss-v alidatio n. Our exp erimental r e s ults ar e listed in T able 1. The re lative magnitudes car ry more signiﬁca nce than the absolute v alues , as the latter ﬂuctuate with e xpe r iment design choices, such a s the magnitude of the image tra nslations, the thic kness of the contour reta ine d, and s o forth. These r e sults exhibit a natural setting in which classiﬁcation algo r ithms for a non-Hilb ertian metric sig niﬁcantly outp erforms the Hilb ert-space a lgorithms, which is the main p oint we wished to illustrate here. Ac kno wledgemen ts W e thank the editor and a nonymous refer e es for helpful suggestions for the manuscript. Tha nks also to Shai Ben-David and Shai Shalev- Sh wartz for s haring their bo ok draft a nd to Ulrike von Luxbur g for enlightening discuss ions. References [1] Noga Alon, Shai Ben-David, Nicol` o Cesa-Bia nch i, and David Haussler. Scale-sens itiv e dimensions, uniform convergence, and lear na bilit y . Journal of the A CM , 44 (4):615–6 31, 1997. [2] Alexandr Andoni and Robe r t Kra uthgamer. The computational hardness o f estimating edit dis- tance. SIAM J. Comput. , 39(6):23 98–24 29, April 2010. [3] R. B ar-Y ehuda and S. Even. A linear-time a pproximation a lgorithm for the weigh ted v ertex cover problem. Journal of A lgorithms , 2(2):198 – 2 03, 198 1. 15 [4] Peter Bartlett and John Shaw e-T aylor. Genera liz ation p erformance of s uppor t vector machines and other pa tter n classiﬁers . In A dvanc es in kernel metho ds: supp ort ve ctor le arning , pages 43– 54. MIT Pre ss, 19 99. [5] Peter L. Ba rtlett and Shahar Mendelson. Rademacher and Gaussian co mplexities: Risk bo unds and structur al res ults. J ournal of Machine L e arning R ese ar ch , 3 :463–4 82, 2002 . [6] T. Batu, F. Er gun, J. Kilia n, A. Mag e n, S. Rask ho dniko v a, R. Rubinfeld, a nd R. Sami. A sublinear algorithm for w eakly appr oximating e dit distance. In Pr o c e e dings of the 35th ACM S ymp osium on The ory of Computing , pages 316– 324, 200 3 . [7] Alina Beygelzimer, Sham Kak ade, a nd John Langfo r d. Cov er trees for near e st neighbor . I n ICML ’06: Pr o c e e dings of the 23r d international c onfer enc e on Machine le arning , pages 97–10 4. ACM, 2006. [8] Nader H. Bshouty , Yi Li, and Philip M. Long. Using the doubling dimension to ana lyze the generaliza tion of learning algorithms. Journal of Computer and System Scienc es , 75(6):32 3 – 33 5, 2009. [9] Lei Chen a nd Raymond Ng. On the marria ge of L p -norms and edit distance. In Pr o c e e dings of the Thirtieth Int ern ational Confer enc e on V ery L ar ge D ata Bases - V olume 30 , VLDB ’04 , pages 792–8 03, 200 4. [10] Ric hard Cole and Lee - Ad Gottlieb. Sea rching dyna mic p oint sets in spaces with b ounded doubling dimension. In STOC , pa ges 5 74–58 3, 2006. [11] Corinna Cortes and Vladimir V apnik. Supp ort-vector net works. Machine L e arning , 20(3 ):2 73–29 7, 1995. [12] T. M. Cov er and P . E. Hart. Nearest neighbor patter n cla ssiﬁcation. IEEE T ra nsactions on Information The ory , 13:21 –27, 19 67. [13] Ric ky Der and Da nie l Lee. Larg e-Margin Classiﬁca tion in Ba nach Space s . In JMLR Workshop and Confer enc e Pr o c e e dings V olume 2: AIST A TS 2007 , pages 91–9 8, 2007 . [14] P . Enﬂo. On the nonexistence o f unifor m homeomorphisms be tween L p -spaces. Ark. Mat. , 8:103 – 105, 19 69. [15] Marco F ranciosi and Giulia Me nc o ni. Multi-dimensional spa rse time ser ies: feature extraction. CoRR , 20 0 8. [16] Lee-Ad Gottlieb, Aryeh Kontoro vich, a nd Rob ert Krauthga mer. Eﬃcient c la ssiﬁcation for metric data. In CO L T , pag es 433– 4 40, 2010 . [17] Lee-Ad Gottlieb and Rob ert Krauthgamer. Proximit y algorithms fo r nearly do ubling spaces. SIAM J. Discr ete Math. , 27(4):17 59–17 69, 2013. [18] Josif Grab o ck a, Alexa ndr os Nano po ulos, a nd La rs Schmidt-Thieme. Cla ssiﬁcation of spa rse time series via s uper vised matrix factor ization. In A AAI’12 , 2012 . [19] K. Gr auman and T. Dar rell. F ast con tour matc hing using approximate E a rth Mov er ’s Distance. In Pr o c e e dings of the IEEE Confer enc e on Computer Vision and Pattern R e c o gnition (CVPR) , W a shington DC, June 2 004. [20] K. Grauman and T. Da r rell. The pyramid match kernel: Discriminativ e classiﬁcatio n with sets of imag e features. In Pr o c e e dings of the IEEE International Conf er enc e on Co mputer Vision (ICCV) , Beijing, China, Octob er 2005 . 16 [21] An upam Gupt a, Rober t K r authgamer, and James R. Lee. Bounded geo metries, fracta ls , and low-distortion embeddings. In FOCS , pages 53 4–543 , 2003. [22] Sariel Har-Peled and Manor Mendel. F ast constr uction of nets in low-dimensional metrics and their applica tions. SIA M J. Comput. , 3 5(5):1148 –1184 , 2006 . [23] Matthias Hein, Olivier Bousquet, and B ernhard Sch¨ olkopf. Max ima l margin classiﬁcation for metric spa c e s. J. Comput. Syst. Sci. , 71(3):3 3 3–359 , 2005. [24] V ahid Khanag ha and Khalid Daoudi. An eﬃcient so lution to sparse linear prediction ana lysis o f sp eech. EURASIP Journal on Audio, Sp e e ch, and Music Pr o c essing , 2013(1):1– 9, 2013 . [25] R. Krauthga mer and J. R. Lee. Navigating nets: Simple alg orithms for proximit y search. In 15th Annual ACM-SIAM S ymp osium on Discr ete A lgorithms , pag es 79 8–807 , January 2004. [26] E. J. McShane. Extensio n of range o f functions. Bul l. Amer. Math. So c. , 40(12 ):837–8 4 2, 1 934. [27] Mehryar Mohr i, Afshin Rostamizadeh, a nd Ameet T alwalk a r. F ou n dations Of Machine L e arning . The MIT P ress, 2012 . [28] Marcin Mucha and P iotr Sankowski. Maximum match ings via ga ussian elimination. In FOCS ’04: Pr o c e e dings of t he 45th Annual IEEE Symp osium on F oundations of Computer Scienc e , page s 248–2 55, 200 4. [29] Assaf Naor and Gideon Sc hech tman. Planar ear thmov er is not in l 1 . SIA M J. Comput. , 37 :8 04– 826, June 2 007. [30] Oﬁr Pele and Michael W erman. A linear time histogr am metric for improv ed sift matching. In 10th Eur op e an Confer enc e on Computer Vision , ECCV ’0 8, pages 49 5–508 . Spr ing er-V erlag , 20 08. [31] Y. Rubner, C. T omass i, and L. J. Guibas. The earth mover’s dista nce as a metric for ima g e retriev al. International Journal of Computer Vision , 40(2):99 –121 , 200 0. [32] Bernhard Sc h¨ olk opf and Alexander J. Smola . L e arning with Kern els: Su pp ort V e ctor Machines, R e gu larization, Optimization, and Beyond . The MIT Pr ess, 2 002. [33] Shai Shalev-Shw artz and Shai Ben-David. Understanding Machine L e arning: F r om The ory to Algo rithms . Cambridge Universit y Press , 201 4. [34] Ulrik e von Luxburg and Olivier Bo usquet. Distance- based cla ssiﬁcation with lipschit z functions. Journal of Machine L e arning R ese ar ch , 5:669 –695, 200 4. [35] Larry W ass erman. Al l of n onp ar ametric s tatistics . Springer T exts in Statis tics . Springer , New Y o rk, 200 6. [36] Hassler Whitney . Analytic extensions of diﬀerentiable functions deﬁned in clo sed sets. T r ansac- tions of the A meric an Mathematic al So ciety , 3 6 (1):63–8 9 , 1934. [37] Virginia V assilevsk a Williams. Br eaking the Copp ersmith-Winog rad ba rrier. In STOC ’12: Pr o- c e e dings of the forty-fourth ann u al AC M symp osium on The ory of c omput ing , New Y ork, NY , USA, 20 12. ACM P ress. [38] Bin Y a ng, Chenjuan Guo, and Christian S. Jensen. T rav el cost inference from spar se, s pa tio tempo rally cor r elated time series using marko v mo dels. Pr o c. VLDB Endow. , 6(9):769 – 780, July 2013. [39] By oung-Ke e Yi, H. V. J agadish, and Christos F aloutsos . Eﬃcien t r e triev al o f similar time se - quences under time warping. In Pr o c e e dings of the F ourt e enth International Confer en c e on Data Engine ering , ICDE ’98, pa ges 201 –208. IEEE Co mputer Society , 1998. 17 [40] Haizhang Zhang, Y uesheng Xu, and Jun Zhang . Repro ducing kernel ba nach spaces for machine learning. Journal of Mach ine L e arning R ese ar ch , 10:2 741–2 775, 2 0 09. [41] J. Ziniel, L.C. Potter, and P . Schniter. T ra cking and smo othing o f time-v ar ying spars e signals via approximate b elief pr opagation. In Signals, Systems and Computers (AS ILOMAR), 2010 Confer en c e R e c or d of the F orty F ourth Asilomar Confer en c e on , pa ges 8 08–81 2, No v 2010 . 18

Efficient Classification for Metric Data

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment