Scale-sensitive Psi-dimensions: the Capacity Measures for Classifiers Taking Values in R^Q

Scale-sensitiv e Ψ -dimensions: the Ca pacit y Measures for Classiﬁers T aking V alues in R Q Y ann Guermeur LORIA- CNR S Campus Scientiﬁque, BP 239 54506 V andœuvre-l` es-Nancy Cedex, F rance (e-mail: Yann.Guerm eur@loria. fr ) Abstract. Bo unds on the risk p lay a crucial role in statistical learning theory . They usually inv olv e as capaci ty measure of the model studied the V C d imension or one of its extensions. In classiﬁcation, su c h “VC dimensio ns” exist for mo dels taking v alues in { 0 , 1 } , { 1 , . . . , Q } and R . W e introduce the generali zations appropriate for the missing case, the one of mo dels with v alues in R Q . This provides us with a new guaran teed risk for M-SVMs which app ears sup erior to th e existing one. Keywords: Large margin classiﬁers, Generalized VC dimensions, M-SVMs. 1 In t ro duction V apnik’s statistical lea rning theor y [V apnik, 199 8] deals with three types of problems: pa ttern recognition, regres sion estimation and density estimation. How ever, the theory of b ounds has primar ily b een develope d for the co mpu- tation of dichotomies only . Central in this theory is the notion of “ca pacity” of cla sses of functions. In the case of binary cla ssiﬁers, the measur e of this ca- pacity is the famous V apnik-Chervonenkis (VC) dimension. Extensio ns hav e also bee n prop o sed for r eal-v alued bi-class mo dels and multi-class mo dels ta k- ing theirs v a lue s in the set o f categor ies. Strangely eno ugh, no g eneralized V C dimension was a v ailable so far for Q -categor y cla ssiﬁers taking their v alue s in R Q . This was all the mor e unsatisfactor y as man y classiﬁers exhibit this prop erty , such as the multi-la yer perce ptrons, or the multi-class supp ort vec- tor machines (M-SVMs). In this pap er, the scale-sensitive Ψ -dimensions a r e int ro duced to ﬁll this gap. A ge neralization of Sauer’s lemma [Sa uer, 19 72] is given, whic h relates the covering num b ers app earing in the standar d guara n- teed ris k for lar ge margin multi-category discrimina nt mo dels to o ne of these dimensions, the margin Natara jan dimension. This latter dimensio n is then bo unded from ab ov e for the archit ecture shared by all the M-SVMs propo sed so far. This provides us with a sha rp e r b ound on their sa mple complexit y . The or ganization of the pap e r is a s follows. Section 2 introduces the basic bo und on the risk of large marg in m ulti-category disc r iminant mo dels. In Section 3, the sca le-sensitive Ψ -dimensions are deﬁned, a nd the generalized Sauer lemma is form ulated. The upper bo und on the margin Natara jan di- mension of the M-SVMs is then des c rib e d in Section 4. F or lack of space, pro ofs a re omitted. T he y can be found in [Guermeur, 200 4]. 2 Guermeur 2 Basic theory of large margin Q -category classiﬁers W e consider Q - category pattern recognition problems, with 3 ≤ Q < ∞ . A pattern is represe nted b y its description x ∈ X and the set of categor ies Y is identiﬁed with the set of indices of the categ o ries, { 1 , . . . , Q } . The link betw een patterns and categories is suppo sed to be pro babilistic. X and Y are pro babilit y spa ces, and X × Y is endow ed with a probability meas ur e P , ﬁxed but unknown. Let ( X, Y ) b e a ra ndom pair distributed according to P . T raining consists in using a m - s ample s m = (( X i , Y i )) 1 ≤ i ≤ m of indep en- dent c opies of ( X , Y ) to selec t, in a given class of functions G , a function classifying data in an optimal w ay . The criterion to b e optimized, the risk , is the exp ectatio n with resp ect to P of a given los s function. The wa y the functions in G p erform classiﬁc a tion must b e sp eciﬁed. W e consider classes of functions fr o m X in to R Q . g = ( g k ) 1 ≤ k ≤ Q ∈ G assigns x ∈ X to the category l if and only if g l ( x ) > max k 6 = l g k ( x ). Cas es of ex æquo are treated as e r rors. This ca lls for the choice of a los s function ℓ deﬁned o n G × X × Y by ℓ ( y , g ( x )) = 1 l { g y ( x ) ≤ max k 6 = y g k ( x ) } . The risk of g is then g iven by: R ( g ) = E [ ℓ ( Y , g ( X ))] = Z X ×Y 1 l { g y ( x ) ≤ max k 6 = y g k ( x ) } dP ( x, y ) . This study dea ls with lar ge marg in classiﬁers, when the underlying notion of m ulti-class margin is the following one. Deﬁnition 1 (Multi-class margin). Let g b e a function from X int o R Q . Its mar gin on ( x, y ) ∈ X × Y , M xy ( g , x, y ), is given by: M xy ( g , x, y ) = 1 2  g y ( x ) − max k 6 = y g k ( x )  . Basically , the central elemen ts to ass ig n a pattern to a category and to der ive a level of conﬁdence in this assignation are the index o f the highest output and the diﬀerence betw een this output and the second highest one. The clas s of functions of in ter e s t is th us the imag e of G by applicatio n of an appropria te op erator. Two s uc h “mar gin op erator s” a re co nsidered her e, ∆ and ∆ ∗ . Deﬁnition 2 ( ∆ o p erator). Deﬁne ∆ a s an op erator on G such that: ∆ : G − → ∆ G g 7→ ∆g = ( ∆g k ) 1 ≤ k ≤ Q ∀ x ∈ X , ∆g ( x ) = 1 2  g k ( x ) − ma x l 6 = k g l ( x )  1 ≤ k ≤ Q . ∀ ( g , x ) ∈ G × X , let M x ( g , x ) = max k ∆g k ( x ). VC dimensions for Classiﬁers T aking V alues in R Q 3 Deﬁnition 3 ( ∆ ∗ op erator). Deﬁne ∆ ∗ as a n op erato r on G such that: ∆ ∗ : G − → ∆ ∗ G g 7→ ∆ ∗ g = ( ∆ ∗ g k ) 1 ≤ k ≤ Q ∀ x ∈ X , ∆ ∗ g ( x ) = (sign ( ∆g k ( x )) · M x ( g , x )) 1 ≤ k ≤ Q . In the sequel, ∆ # is used in place of ∆ and ∆ ∗ in the formu las that hold true for b oth op era to rs. The empirica l margin risk is deﬁned as follows. Deﬁnition 4 (Margin risk). Let γ ∈ R ∗ + . The ris k with marg in γ o f g , R γ ( g ), and its empirica l estimate on s m , R γ ,s m ( g ), are deﬁned as: R γ ( g ) = Z X ×Y 1 l { ∆ # g y ( x ) <γ } dP ( x, y ) , R γ ,s m ( g ) = 1 m m X i =1 1 l { ∆ # g Y i ( X i ) <γ } . F or technical r easons, it is useful to squa sh the functions ∆ # g k as m uc h as po ssible without a ltering the v alue of the empir ic al margin risk. This is achiev ed b y application of another op erato r. Deﬁnition 5 ( π γ op erator [B artlett, 199 8]). F o r γ ∈ R ∗ + , deﬁne π γ as an op er a tor on G such that: π γ : G − → π γ G g 7→ π γ g = ( π γ g k )) 1 ≤ k ≤ Q ∀ x ∈ X , π γ g ( x ) = (sign ( g k ( x )) · min ( | g k ( x ) | , γ )) 1 ≤ k ≤ Q . Let ∆ # γ denote π γ ◦ ∆ # and ∆ # γ G b e deﬁned as the s et of functions ∆ # γ g . The capacity of ∆ # γ G is characterized by its covering num b ers. Deﬁnition 6 ( ǫ -co v er, ǫ -net and cov e ring num b ers). Let ( E , ρ ) b e a pseudo-metric space, E ′ ⊂ E and ǫ ∈ R ∗ + . An ǫ -c over of E ′ is a cov er age of E ′ with open balls of radius ǫ the centers of whic h be long to E . These cen ters form a n ǫ -net of E ′ . A pr op er ǫ -net of E ′ is an ǫ - net of E ′ included in E ′ . If E ′ has an ǫ -net of ﬁnite cardinality , then its c overing num b er N ( ǫ, E ′ , ρ ) is the smallest cardinality of its ǫ - nets. If there is no such ﬁnite cover, then the cov er ing nu mber is deﬁned to be ∞ . N ( p ) ( ǫ, E ′ , ρ ) will desig na te the cov er ing nu mber o f E ′ obtained by considering pr ope r ǫ -nets only . The cov er ing num b er s of interest use the following ps e udo-metric: Deﬁnition 7 (functional pseudo-m etric). Let G b e a class of functions from X into R Q . F or a set s X n ⊂ X of cardinality n , deﬁne the pseudo-metric d ℓ ∞ ,ℓ ∞ ( s X n ) on G as: ∀ ( g , g ′ ) ∈ G 2 , d ℓ ∞ ,ℓ ∞ ( s X n ) ( g , g ′ ) = ma x x ∈ s X n k g ( x ) − g ′ ( x ) k ∞ . 4 Guermeur Let N ( p ) ∞ , ∞ ( ǫ, ∆ # γ G , n ) = sup s X n ⊂X N ( p ) ( ǫ, ∆ # γ G , d ℓ ∞ ,ℓ ∞ ( s X n ) ). The following theorem e xtends to the multi-class case Coro lla ry 9 in [Bartlett, 1998]. Theorem 1 (Theorem 1 in [Guermeur, 2004]). L et s m b e a m -sample of examples indep endently dr awn fr om a pr ob ability distribution on X × Y . With pr ob ability at le ast 1 − δ , for every value of γ in (0 , 1] , t he risk of any function g in a class G is b ounde d fr om ab ove by: R ( g ) ≤ R γ ,s m ( g ) + s 2 m  ln  2 N ( p ) ∞ , ∞ ( γ / 4 , ∆ # γ G , 2 m )  + ln  2 γ δ  + 1 m . (1) Studying the sample complexity of a classiﬁer G can thus amoun t to co mput- ing an upp er b ound on N ( p ) ∞ , ∞ ( γ / 4 , ∆ # γ G , 2 m ). In [Guermeur et al. , 2005], we reached this goal by r elating these num b ers to the entrop y num ber s of the corres po nding ev aluation op era tor. In the present pa per , we follow the traditional path of VC b ounds, by mak ing use o f a gener alized VC dimension. 3 Bounding co v ering n um b ers in terms of the margin Natara jan dimension The Ψ -dimensions are the gener alized V C dimensions that characterize the learnability o f cla sses of { 1 , . . . , Q } -v a lued functions . Deﬁnition 8 ( Ψ -dime ns ions [Ben-Da vid et al. , 1995]). Let F b e a clas s of functions on a set X taking their v alues in the ﬁnite set { 1 , . . . , Q } . Let Ψ b e a set of mapping s ψ from { 1 , . . . , Q } into {− 1 , 1 , ∗} , where ∗ is thoug ht of as a null element. A subset s X n = { x i : 1 ≤ i ≤ n } of X is said to b e Ψ -sha tter e d by F if ther e is a ma pping ψ n =  ψ (1) , . . . , ψ ( i ) , . . . , ψ ( n )  in Ψ n such that for each vector v y of {− 1 , 1 } n , there is a function f y in F satisfying  ψ ( i ) ◦ f y ( x i )  1 ≤ i ≤ n = v y . The Ψ - dimension of F , denoted by Ψ -dim( F ), is the maximal car dinality of a subs e t of X Ψ -shattered by F , if it is ﬁnite, or inﬁnit y o therwise. One o f these dimensions needs to be sing led out, the Natara jan dimension. Deﬁnition 9 (Nata ra jan dimension [B e n-Da vid et al. , 1995]). Let F be a c lass of functions on a set X taking their v alues in { 1 , . . . , Q } . The Natar ajan dimension of F , N-dim( F ), is the Ψ -dimension of F in the spec iﬁc case where Ψ is the set of Q ( Q − 1) ma ppings ψ k,l , (1 ≤ k 6 = l ≤ Q ), such that ψ k,l takes the v alue 1 if its argument is equa l to k , the v alue − 1 if its argument is eq ua l to l , a nd ∗ otherwise. VC dimensions for Classiﬁers T aking V alues in R Q 5 The fat- shattering dimensio n characterizes the uniform Gliv e nko-Cantelli classes a mong the clas s es of real-v a lued functions . Deﬁnition 10 (fat-shattering dimensio n [Alon et al . , 1997]). Let G be a cla s s of functions fr om X in to R . F or γ ∈ R ∗ + , s X n = { x i : 1 ≤ i ≤ n } ⊂ X is sa id to b e γ -shatter e d by G if there is a vector v b = ( b i ) ∈ R n such that, for ea ch vector v y = ( y i ) ∈ {− 1 , 1 } n , there is a function g y ∈ G satisfying ∀ i ∈ { 1 , . . . , n } , y i ( g y ( x i ) − b i ) ≥ γ . The fat-shattering dimension of G , P γ -dim ( G ), is the maximal cardina lity of a subs e t of X γ -sha ttered by G , if it is ﬁnite, or inﬁnity otherwise. Given the results av a ila ble for the Ψ -dimensions and the fat-shattering dimen- sion, it app ears na tur al, to s tudy the generaliza tion capa bilities of classiﬁers taking v alues in R Q , to cons ide r the use of capacity meas ur es obtained as mixtures o f the tw o concepts, namely scale-sensitive Ψ -dimensions. Deﬁnition 11 ( Ψ -dim e nsion with margin γ ). Let G be a class o f func- tions on a set X taking their v alues in R Q . Let Ψ b e a family of map- pings ψ from { 1 , . . . , Q } into {− 1 , 1 , ∗} . F or γ ∈ R ∗ + , a subset s X n = { x i : 1 ≤ i ≤ n } of X is said to b e γ - Ψ - shatter e d by ∆ # G if there is a mapping ψ n =  ψ (1) , . . . , ψ ( i ) , . . . , ψ ( n )  in Ψ n and a vector v b = ( b i ) in R n such that, for ea ch vector v y = ( y i ) o f {− 1 , 1 } n , there is a function g y in G satisfying ∀ i ∈ { 1 , . . . , n } ,  if y i = 1 , ∃ k : ψ ( i ) ( k ) = 1 ∧ ∆ # g y ,k ( x i ) − b i ≥ γ if y i = − 1 , ∃ l : ψ ( i ) ( l ) = − 1 ∧ ∆ # g y ,l ( x i ) + b i ≥ γ . The γ - Ψ -dimension o f ∆ # G , Ψ -dim( ∆ # G , γ ), is the ma ximal cardina lity of a subset of X γ - Ψ -shattered by ∆ # G , if it is ﬁnite, o r inﬁnity other wise. The margin Natar a jan dimension is deﬁned a ccordingly . Deﬁnition 12 (Natar a jan dimension with margi n γ ). Let G b e a class of functions on a set X tak ing their v alues in R Q . F or γ ∈ R ∗ + , a subset s X n = { x i : 1 ≤ i ≤ n } of X is said to b e γ -N- shatter e d by ∆ # G if there is a set I ( s X n ) = { ( i 1 ( x i ) , i 2 ( x i )) : 1 ≤ i ≤ n } of n pairs of distinct indices in { 1 , . . . , Q } and a vector v b = ( b i ) in R n such that, for ea ch binary vector v y = ( y i ) ∈ {− 1 , 1 } n , there is a function g y in G satisfying ∀ i ∈ { 1 , . . . , n } ,  if y i = 1 , ∆ # g y ,i 1 ( x i ) ( x i ) − b i ≥ γ if y i = − 1 , ∆ # g y ,i 2 ( x i ) ( x i ) + b i ≥ γ . The N atar ajan dimension with m ar gin γ of the class ∆ # G , N-dim ( ∆ # G , γ ), is the maximal cardinality of a subs e t of X γ -N-sha ttered by ∆ # G , if it is ﬁnite, o r inﬁnity other wise. F or this scale-sensitive Ψ -dimension, the connection with the co vering num- ber s of interest, or genera lized Sauer lemma, is the following one. 6 Guermeur Theorem 2 (Theorem 4 in [Guermeur, 2004]). L et G b e a class of func- tions fr om a domain X into R Q . F or every value of γ in (0 , 1] and every m ∈ N ∗ satisfying 2 m ≥ N-dim ( ∆ γ G , γ / 24) , the following b ound is true: N ( p ) ∞ , ∞ ( γ / 4 , ∆ ∗ γ G , 2 m ) < 2  288 m Q 2 ( Q − 1)  ⌈ d log 2 (23 emQ ( Q − 1) /d ) ⌉ (2) wher e d = N-dim ( ∆ γ G , γ / 24) . This theorem is the cent ral result of the pap er (and the nov elty in the r evised version o f [Guermeur, 2 004]). What makes it a nontrivial Q -c la ss extension of Lemma 3.5 in [Alon et al. , 199 7] is the pres e nce of b oth mar gin op erator s. The rea son why ∆ ∗ app ears in the cov ering num b er instead of ∆ is the very principle a t the ba sis o f all the v a r iants of Sauer’s lemma: tw o functions sep- arated with resp ect to the functiona l pseudo-metric used (here d ℓ ∞ ,ℓ ∞ ( s X n ) ) shatter (at least) one point in s X n . This is true for ∆ ∗ γ G , or more precis ely its η -discretizatio n, not for ∆ γ G (see Section 5.3 in [Guermeur , 200 4] for details). One can der iv e a v ariant of Theorem 2 inv olving N-dim  ∆ ∗ γ G , γ / 24  . This alternative is howev er of lesse r interest, for reaso ns that will app ear b elow. 4 Margin Natara jan dimension of the M-SVMs W e now compute an upper bo und o n the marg in Natara jan dimension of in- terest when G is the class o f functions computed b y the M-SVMs. Thes e larg e margin classiﬁer s a re built around a Mercer kernel. Let κ b e such a k ernel on X and  H κ , h ., . i H κ  the corres po nding repro ducing kernel Hilbert spac e (RKHS) [Aro nsza jn, 1 950]. Le t Φ b e any of the mappings on X s atisfying: ∀ ( x, x ′ ) ∈ X 2 , κ ( x, x ′ ) = h Φ ( x ) , Φ ( x ′ ) i , (3) where h ., . i is the dot pro duct o f the ℓ 2 space. “The” feature spa ce tradi- tionally designates any o f the Hilbert spaces  E Φ ( X ) , h ., . i  spanned by the Φ ( X ). By deﬁnition of a RKHS, H =  H κ , h ., . i H κ  + { 1 }  Q is the class of functions h = ( h k ) 1 ≤ k ≤ Q from X into R Q of the for m: h ( . ) = l k X i =1 β ik κ ( x ik , . ) + b k ! 1 ≤ k ≤ Q where the x ik are ele ments o f X (the β ik and b k are s c alars), a s well as the limits of these functions when the sets { x ik : 1 ≤ i ≤ l k } b ecome dense in X in the norm induced by the dot pro duct. Due to (3), H can also b e seen as a multiv a riate a ﬃne mo del on Φ ( X ). F unctio ns h can then b e rewritten a s: h ( . ) = ( h w k , . i + b k ) 1 ≤ k ≤ Q where vectors w k are elements of E Φ ( X ) . They a re thus desc rib e d b y the pair ( w , b ) with w = ( w k ) 1 ≤ k ≤ Q and b = ( b k ) 1 ≤ k ≤ Q . Let ¯ H stand for the VC dimensions for Classiﬁers T aking V alues in R Q 7 pro duct space H Q κ . Its norm k . k ¯ H is given by   ¯ h   ¯ H = q P Q k =1 k w k k 2 = k w k . Deﬁnition 13 (M-SVM). A M- SVM is a la r ge marg in multi-category dis- criminant mo del obtained by minimizing ov er the hyperpla ne P Q k =1 h k = 0 of H a n ob jective function of the form: J ( h ) = m X i =1 ℓ M-SVM ( y i , h ( x i )) + λ k w k 2 where the empirical term, used in place of the empir ical risk, in volv e s a loss function ℓ M-SVM which is co n vex. The M-SVMs only diﬀer in the na ture of ℓ M-SVM . The sp eciﬁca tion of this function is such that the intro duction o f the p e na lizer k w k 2 tends to maxi- mize a notion o f marg in directly c o nnected with the one of Deﬁnition 1. The formulation of the genera lized Sauer lemma provided here (Theor em 2) is the one obtained under the weakest hypotheses. Pro ce eding a s in the bi-clas s case, we express b elow a b ound o n the margin Natara jan dimension of the M-SVMs as a function of the volume o ccupied by data in E Φ ( X ) and con- straints on ( w , b ), th us res tricting the study to functions with a well-deﬁned range. In that cas e, a v a riant of Theor em 2 c a n b e derived from Lemma 7 in [Guermeur, 2004] whic h do es not inv o lv e π γ but relates the cov e r ing nu mbers of ∆ ∗ G to the margin Natara jan dimension o f ∆ G . Its use for M-SVMs is adv antageous since N-dim  ∆ ¯ H , ǫ  is e a sier to b o und than N-dim ( ∆ γ H , ǫ ) (nonlinearity is diﬃcult to ha ndle). This change of g eneralized Sauer lemma calls for the use of a n intermediate formula r elating the cov ering num b ers of ∆ ∗ γ H and ∆ ∗ ¯ H . It is provided by the following lemma. Lemma 1 (Lemmas 9 and 1 0 in [Guerme ur, 2004]). L et H b e the class of funct ions that a Q - c ate gory M-SV M c an implement under the hyp othesis b ∈ [ − β , β ] Q . L et ( γ , ǫ ) ∈ R 2 satisfy 0 < ǫ ≤ γ ≤ 1 . Then N ( p ) ∞ , ∞ ( ǫ, ∆ ∗ γ H , m ) ≤  2  β ǫ  + 1  Q N ( p ) ∞ , ∞ ( ǫ/ 2 , ∆ ∗ ¯ H , m ) . (4) A ﬁna l theorem then completes the construction of the guar a n teed risk. Theorem 3 (Theorem 5 i n [ Guermeur, 2004]). L et ¯ H b e the class of functions that a Q - c ate gory M-SVM c an implement under the hyp othesis that Φ ( X ) is include d in t he close d b al l of r adius Λ Φ ( X ) ab out the origin in E Φ ( X ) and the c onstr aints 1 / 2 max 1 ≤ k

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment