Scale-sensitive Psi-dimensions: the Capacity Measures for Classifiers Taking Values in R^Q

Bounds on the risk play a crucial role in statistical learning theory. They usually involve as capacity measure of the model studied the VC dimension or one of its extensions. In classification, such "VC dimensions" exist for models taking values in …

Authors: Yann Guermeur (LORIA)

Scale-sensitiv e Ψ -dimensions: the Ca pacit y Measures for Classifiers T aking V alues in R Q Y ann Guermeur LORIA- CNR S Campus Scientifique, BP 239 54506 V andœuvre-l` es-Nancy Cedex, F rance (e-mail: Yann.Guerm eur@loria. fr ) Abstract. Bo unds on the risk p lay a crucial role in statistical learning theory . They usually inv olv e as capaci ty measure of the model studied the V C d imension or one of its extensions. In classification, su c h “VC dimensio ns” exist for mo dels taking v alues in { 0 , 1 } , { 1 , . . . , Q } and R . W e introduce the generali zations appropriate for the missing case, the one of mo dels with v alues in R Q . This provides us with a new guaran teed risk for M-SVMs which app ears sup erior to th e existing one. Keywords: Large margin classifiers, Generalized VC dimensions, M-SVMs. 1 In t ro duction V apnik’s statistical lea rning theor y [V apnik, 199 8] deals with three types of problems: pa ttern recognition, regres sion estimation and density estimation. How ever, the theory of b ounds has primar ily b een develope d for the co mpu- tation of dichotomies only . Central in this theory is the notion of “ca pacity” of cla sses of functions. In the case of binary cla ssifiers, the measur e of this ca- pacity is the famous V apnik-Chervonenkis (VC) dimension. Extensio ns hav e also bee n prop o sed for r eal-v alued bi-class mo dels and multi-class mo dels ta k- ing theirs v a lue s in the set o f categor ies. Strangely eno ugh, no g eneralized V C dimension was a v ailable so far for Q -categor y cla ssifiers taking their v alue s in R Q . This was all the mor e unsatisfactor y as man y classifiers exhibit this prop erty , such as the multi-la yer perce ptrons, or the multi-class supp ort vec- tor machines (M-SVMs). In this pap er, the scale-sensitive Ψ -dimensions a r e int ro duced to fill this gap. A ge neralization of Sauer’s lemma [Sa uer, 19 72] is given, whic h relates the covering num b ers app earing in the standar d guara n- teed ris k for lar ge margin multi-category discrimina nt mo dels to o ne of these dimensions, the margin Natara jan dimension. This latter dimensio n is then bo unded from ab ov e for the archit ecture shared by all the M-SVMs propo sed so far. This provides us with a sha rp e r b ound on their sa mple complexit y . The or ganization of the pap e r is a s follows. Section 2 introduces the basic bo und on the risk of large marg in m ulti-category disc r iminant mo dels. In Section 3, the sca le-sensitive Ψ -dimensions are defined, a nd the generalized Sauer lemma is form ulated. The upper bo und on the margin Natara jan di- mension of the M-SVMs is then des c rib e d in Section 4. F or lack of space, pro ofs a re omitted. T he y can be found in [Guermeur, 200 4]. 2 Guermeur 2 Basic theory of large margin Q -category classifiers W e consider Q - category pattern recognition problems, with 3 ≤ Q < ∞ . A pattern is represe nted b y its description x ∈ X and the set of categor ies Y is identified with the set of indices of the categ o ries, { 1 , . . . , Q } . The link betw een patterns and categories is suppo sed to be pro babilistic. X and Y are pro babilit y spa ces, and X × Y is endow ed with a probability meas ur e P , fixed but unknown. Let ( X, Y ) b e a ra ndom pair distributed according to P . T raining consists in using a m - s ample s m = (( X i , Y i )) 1 ≤ i ≤ m of indep en- dent c opies of ( X , Y ) to selec t, in a given class of functions G , a function classifying data in an optimal w ay . The criterion to b e optimized, the risk , is the exp ectatio n with resp ect to P of a given los s function. The wa y the functions in G p erform classific a tion must b e sp ecified. W e consider classes of functions fr o m X in to R Q . g = ( g k ) 1 ≤ k ≤ Q ∈ G assigns x ∈ X to the category l if and only if g l ( x ) > max k 6 = l g k ( x ). Cas es of ex æquo are treated as e r rors. This ca lls for the choice of a los s function ℓ defined o n G × X × Y by ℓ ( y , g ( x )) = 1 l { g y ( x ) ≤ max k 6 = y g k ( x ) } . The risk of g is then g iven by: R ( g ) = E [ ℓ ( Y , g ( X ))] = Z X ×Y 1 l { g y ( x ) ≤ max k 6 = y g k ( x ) } dP ( x, y ) . This study dea ls with lar ge marg in classifiers, when the underlying notion of m ulti-class margin is the following one. Definition 1 (Multi-class margin). Let g b e a function from X int o R Q . Its mar gin on ( x, y ) ∈ X × Y , M xy ( g , x, y ), is given by: M xy ( g , x, y ) = 1 2  g y ( x ) − max k 6 = y g k ( x )  . Basically , the central elemen ts to ass ig n a pattern to a category and to der ive a level of confidence in this assignation are the index o f the highest output and the difference betw een this output and the second highest one. The clas s of functions of in ter e s t is th us the imag e of G by applicatio n of an appropria te op erator. Two s uc h “mar gin op erator s” a re co nsidered her e, ∆ and ∆ ∗ . Definition 2 ( ∆ o p erator). Define ∆ a s an op erator on G such that: ∆ : G − → ∆ G g 7→ ∆g = ( ∆g k ) 1 ≤ k ≤ Q ∀ x ∈ X , ∆g ( x ) = 1 2  g k ( x ) − ma x l 6 = k g l ( x )  1 ≤ k ≤ Q . ∀ ( g , x ) ∈ G × X , let M x ( g , x ) = max k ∆g k ( x ). VC dimensions for Classifiers T aking V alues in R Q 3 Definition 3 ( ∆ ∗ op erator). Define ∆ ∗ as a n op erato r on G such that: ∆ ∗ : G − → ∆ ∗ G g 7→ ∆ ∗ g = ( ∆ ∗ g k ) 1 ≤ k ≤ Q ∀ x ∈ X , ∆ ∗ g ( x ) = (sign ( ∆g k ( x )) · M x ( g , x )) 1 ≤ k ≤ Q . In the sequel, ∆ # is used in place of ∆ and ∆ ∗ in the formu las that hold true for b oth op era to rs. The empirica l margin risk is defined as follows. Definition 4 (Margin risk). Let γ ∈ R ∗ + . The ris k with marg in γ o f g , R γ ( g ), and its empirica l estimate on s m , R γ ,s m ( g ), are defined as: R γ ( g ) = Z X ×Y 1 l { ∆ # g y ( x ) <γ } dP ( x, y ) , R γ ,s m ( g ) = 1 m m X i =1 1 l { ∆ # g Y i ( X i ) <γ } . F or technical r easons, it is useful to squa sh the functions ∆ # g k as m uc h as po ssible without a ltering the v alue of the empir ic al margin risk. This is achiev ed b y application of another op erato r. Definition 5 ( π γ op erator [B artlett, 199 8]). F o r γ ∈ R ∗ + , define π γ as an op er a tor on G such that: π γ : G − → π γ G g 7→ π γ g = ( π γ g k )) 1 ≤ k ≤ Q ∀ x ∈ X , π γ g ( x ) = (sign ( g k ( x )) · min ( | g k ( x ) | , γ )) 1 ≤ k ≤ Q . Let ∆ # γ denote π γ ◦ ∆ # and ∆ # γ G b e defined as the s et of functions ∆ # γ g . The capacity of ∆ # γ G is characterized by its covering num b ers. Definition 6 ( ǫ -co v er, ǫ -net and cov e ring num b ers). Let ( E , ρ ) b e a pseudo-metric space, E ′ ⊂ E and ǫ ∈ R ∗ + . An ǫ -c over of E ′ is a cov er age of E ′ with open balls of radius ǫ the centers of whic h be long to E . These cen ters form a n ǫ -net of E ′ . A pr op er ǫ -net of E ′ is an ǫ - net of E ′ included in E ′ . If E ′ has an ǫ -net of finite cardinality , then its c overing num b er N ( ǫ, E ′ , ρ ) is the smallest cardinality of its ǫ - nets. If there is no such finite cover, then the cov er ing nu mber is defined to be ∞ . N ( p ) ( ǫ, E ′ , ρ ) will desig na te the cov er ing nu mber o f E ′ obtained by considering pr ope r ǫ -nets only . The cov er ing num b er s of interest use the following ps e udo-metric: Definition 7 (functional pseudo-m etric). Let G b e a class of functions from X into R Q . F or a set s X n ⊂ X of cardinality n , define the pseudo-metric d ℓ ∞ ,ℓ ∞ ( s X n ) on G as: ∀ ( g , g ′ ) ∈ G 2 , d ℓ ∞ ,ℓ ∞ ( s X n ) ( g , g ′ ) = ma x x ∈ s X n k g ( x ) − g ′ ( x ) k ∞ . 4 Guermeur Let N ( p ) ∞ , ∞ ( ǫ, ∆ # γ G , n ) = sup s X n ⊂X N ( p ) ( ǫ, ∆ # γ G , d ℓ ∞ ,ℓ ∞ ( s X n ) ). The following theorem e xtends to the multi-class case Coro lla ry 9 in [Bartlett, 1998]. Theorem 1 (Theorem 1 in [Guermeur, 2004]). L et s m b e a m -sample of examples indep endently dr awn fr om a pr ob ability distribution on X × Y . With pr ob ability at le ast 1 − δ , for every value of γ in (0 , 1] , t he risk of any function g in a class G is b ounde d fr om ab ove by: R ( g ) ≤ R γ ,s m ( g ) + s 2 m  ln  2 N ( p ) ∞ , ∞ ( γ / 4 , ∆ # γ G , 2 m )  + ln  2 γ δ  + 1 m . (1) Studying the sample complexity of a classifier G can thus amoun t to co mput- ing an upp er b ound on N ( p ) ∞ , ∞ ( γ / 4 , ∆ # γ G , 2 m ). In [Guermeur et al. , 2005], we reached this goal by r elating these num b ers to the entrop y num ber s of the corres po nding ev aluation op era tor. In the present pa per , we follow the traditional path of VC b ounds, by mak ing use o f a gener alized VC dimension. 3 Bounding co v ering n um b ers in terms of the margin Natara jan dimension The Ψ -dimensions are the gener alized V C dimensions that characterize the learnability o f cla sses of { 1 , . . . , Q } -v a lued functions . Definition 8 ( Ψ -dime ns ions [Ben-Da vid et al. , 1995]). Let F b e a clas s of functions on a set X taking their v alues in the finite set { 1 , . . . , Q } . Let Ψ b e a set of mapping s ψ from { 1 , . . . , Q } into {− 1 , 1 , ∗} , where ∗ is thoug ht of as a null element. A subset s X n = { x i : 1 ≤ i ≤ n } of X is said to b e Ψ -sha tter e d by F if ther e is a ma pping ψ n =  ψ (1) , . . . , ψ ( i ) , . . . , ψ ( n )  in Ψ n such that for each vector v y of {− 1 , 1 } n , there is a function f y in F satisfying  ψ ( i ) ◦ f y ( x i )  1 ≤ i ≤ n = v y . The Ψ - dimension of F , denoted by Ψ -dim( F ), is the maximal car dinality of a subs e t of X Ψ -shattered by F , if it is finite, or infinit y o therwise. One o f these dimensions needs to be sing led out, the Natara jan dimension. Definition 9 (Nata ra jan dimension [B e n-Da vid et al. , 1995]). Let F be a c lass of functions on a set X taking their v alues in { 1 , . . . , Q } . The Natar ajan dimension of F , N-dim( F ), is the Ψ -dimension of F in the spec ific case where Ψ is the set of Q ( Q − 1) ma ppings ψ k,l , (1 ≤ k 6 = l ≤ Q ), such that ψ k,l takes the v alue 1 if its argument is equa l to k , the v alue − 1 if its argument is eq ua l to l , a nd ∗ otherwise. VC dimensions for Classifiers T aking V alues in R Q 5 The fat- shattering dimensio n characterizes the uniform Gliv e nko-Cantelli classes a mong the clas s es of real-v a lued functions . Definition 10 (fat-shattering dimensio n [Alon et al . , 1997]). Let G be a cla s s of functions fr om X in to R . F or γ ∈ R ∗ + , s X n = { x i : 1 ≤ i ≤ n } ⊂ X is sa id to b e γ -shatter e d by G if there is a vector v b = ( b i ) ∈ R n such that, for ea ch vector v y = ( y i ) ∈ {− 1 , 1 } n , there is a function g y ∈ G satisfying ∀ i ∈ { 1 , . . . , n } , y i ( g y ( x i ) − b i ) ≥ γ . The fat-shattering dimension of G , P γ -dim ( G ), is the maximal cardina lity of a subs e t of X γ -sha ttered by G , if it is finite, or infinity otherwise. Given the results av a ila ble for the Ψ -dimensions and the fat-shattering dimen- sion, it app ears na tur al, to s tudy the generaliza tion capa bilities of classifiers taking v alues in R Q , to cons ide r the use of capacity meas ur es obtained as mixtures o f the tw o concepts, namely scale-sensitive Ψ -dimensions. Definition 11 ( Ψ -dim e nsion with margin γ ). Let G be a class o f func- tions on a set X taking their v alues in R Q . Let Ψ b e a family of map- pings ψ from { 1 , . . . , Q } into {− 1 , 1 , ∗} . F or γ ∈ R ∗ + , a subset s X n = { x i : 1 ≤ i ≤ n } of X is said to b e γ - Ψ - shatter e d by ∆ # G if there is a mapping ψ n =  ψ (1) , . . . , ψ ( i ) , . . . , ψ ( n )  in Ψ n and a vector v b = ( b i ) in R n such that, for ea ch vector v y = ( y i ) o f {− 1 , 1 } n , there is a function g y in G satisfying ∀ i ∈ { 1 , . . . , n } ,  if y i = 1 , ∃ k : ψ ( i ) ( k ) = 1 ∧ ∆ # g y ,k ( x i ) − b i ≥ γ if y i = − 1 , ∃ l : ψ ( i ) ( l ) = − 1 ∧ ∆ # g y ,l ( x i ) + b i ≥ γ . The γ - Ψ -dimension o f ∆ # G , Ψ -dim( ∆ # G , γ ), is the ma ximal cardina lity of a subset of X γ - Ψ -shattered by ∆ # G , if it is finite, o r infinity other wise. The margin Natar a jan dimension is defined a ccordingly . Definition 12 (Natar a jan dimension with margi n γ ). Let G b e a class of functions on a set X tak ing their v alues in R Q . F or γ ∈ R ∗ + , a subset s X n = { x i : 1 ≤ i ≤ n } of X is said to b e γ -N- shatter e d by ∆ # G if there is a set I ( s X n ) = { ( i 1 ( x i ) , i 2 ( x i )) : 1 ≤ i ≤ n } of n pairs of distinct indices in { 1 , . . . , Q } and a vector v b = ( b i ) in R n such that, for ea ch binary vector v y = ( y i ) ∈ {− 1 , 1 } n , there is a function g y in G satisfying ∀ i ∈ { 1 , . . . , n } ,  if y i = 1 , ∆ # g y ,i 1 ( x i ) ( x i ) − b i ≥ γ if y i = − 1 , ∆ # g y ,i 2 ( x i ) ( x i ) + b i ≥ γ . The N atar ajan dimension with m ar gin γ of the class ∆ # G , N-dim ( ∆ # G , γ ), is the maximal cardinality of a subs e t of X γ -N-sha ttered by ∆ # G , if it is finite, o r infinity other wise. F or this scale-sensitive Ψ -dimension, the connection with the co vering num- ber s of interest, or genera lized Sauer lemma, is the following one. 6 Guermeur Theorem 2 (Theorem 4 in [Guermeur, 2004]). L et G b e a class of func- tions fr om a domain X into R Q . F or every value of γ in (0 , 1] and every m ∈ N ∗ satisfying 2 m ≥ N-dim ( ∆ γ G , γ / 24) , the following b ound is true: N ( p ) ∞ , ∞ ( γ / 4 , ∆ ∗ γ G , 2 m ) < 2  288 m Q 2 ( Q − 1)  ⌈ d log 2 (23 emQ ( Q − 1) /d ) ⌉ (2) wher e d = N-dim ( ∆ γ G , γ / 24) . This theorem is the cent ral result of the pap er (and the nov elty in the r evised version o f [Guermeur, 2 004]). What makes it a nontrivial Q -c la ss extension of Lemma 3.5 in [Alon et al. , 199 7] is the pres e nce of b oth mar gin op erator s. The rea son why ∆ ∗ app ears in the cov ering num b er instead of ∆ is the very principle a t the ba sis o f all the v a r iants of Sauer’s lemma: tw o functions sep- arated with resp ect to the functiona l pseudo-metric used (here d ℓ ∞ ,ℓ ∞ ( s X n ) ) shatter (at least) one point in s X n . This is true for ∆ ∗ γ G , or more precis ely its η -discretizatio n, not for ∆ γ G (see Section 5.3 in [Guermeur , 200 4] for details). One can der iv e a v ariant of Theorem 2 inv olving N-dim  ∆ ∗ γ G , γ / 24  . This alternative is howev er of lesse r interest, for reaso ns that will app ear b elow. 4 Margin Natara jan dimension of the M-SVMs W e now compute an upper bo und o n the marg in Natara jan dimension of in- terest when G is the class o f functions computed b y the M-SVMs. Thes e larg e margin classifier s a re built around a Mercer kernel. Let κ b e such a k ernel on X and  H κ , h ., . i H κ  the corres po nding repro ducing kernel Hilbert spac e (RKHS) [Aro nsza jn, 1 950]. Le t Φ b e any of the mappings on X s atisfying: ∀ ( x, x ′ ) ∈ X 2 , κ ( x, x ′ ) = h Φ ( x ) , Φ ( x ′ ) i , (3) where h ., . i is the dot pro duct o f the ℓ 2 space. “The” feature spa ce tradi- tionally designates any o f the Hilbert spaces  E Φ ( X ) , h ., . i  spanned by the Φ ( X ). By definition of a RKHS, H =  H κ , h ., . i H κ  + { 1 }  Q is the class of functions h = ( h k ) 1 ≤ k ≤ Q from X into R Q of the for m: h ( . ) = l k X i =1 β ik κ ( x ik , . ) + b k ! 1 ≤ k ≤ Q where the x ik are ele ments o f X (the β ik and b k are s c alars), a s well as the limits of these functions when the sets { x ik : 1 ≤ i ≤ l k } b ecome dense in X in the norm induced by the dot pro duct. Due to (3), H can also b e seen as a multiv a riate a ffine mo del on Φ ( X ). F unctio ns h can then b e rewritten a s: h ( . ) = ( h w k , . i + b k ) 1 ≤ k ≤ Q where vectors w k are elements of E Φ ( X ) . They a re thus desc rib e d b y the pair ( w , b ) with w = ( w k ) 1 ≤ k ≤ Q and b = ( b k ) 1 ≤ k ≤ Q . Let ¯ H stand for the VC dimensions for Classifiers T aking V alues in R Q 7 pro duct space H Q κ . Its norm k . k ¯ H is given by   ¯ h   ¯ H = q P Q k =1 k w k k 2 = k w k . Definition 13 (M-SVM). A M- SVM is a la r ge marg in multi-category dis- criminant mo del obtained by minimizing ov er the hyperpla ne P Q k =1 h k = 0 of H a n ob jective function of the form: J ( h ) = m X i =1 ℓ M-SVM ( y i , h ( x i )) + λ k w k 2 where the empirical term, used in place of the empir ical risk, in volv e s a loss function ℓ M-SVM which is co n vex. The M-SVMs only differ in the na ture of ℓ M-SVM . The sp ecifica tion of this function is such that the intro duction o f the p e na lizer k w k 2 tends to maxi- mize a notion o f marg in directly c o nnected with the one of Definition 1. The formulation of the genera lized Sauer lemma provided here (Theor em 2) is the one obtained under the weakest hypotheses. Pro ce eding a s in the bi-clas s case, we express b elow a b ound o n the margin Natara jan dimension of the M-SVMs as a function of the volume o ccupied by data in E Φ ( X ) and con- straints on ( w , b ), th us res tricting the study to functions with a well-defined range. In that cas e, a v a riant of Theor em 2 c a n b e derived from Lemma 7 in [Guermeur, 2004] whic h do es not inv o lv e π γ but relates the cov e r ing nu mbers of ∆ ∗ G to the margin Natara jan dimension o f ∆ G . Its use for M-SVMs is adv antageous since N-dim  ∆ ¯ H , ǫ  is e a sier to b o und than N-dim ( ∆ γ H , ǫ ) (nonlinearity is difficult to ha ndle). This change of g eneralized Sauer lemma calls for the use of a n intermediate formula r elating the cov ering num b ers of ∆ ∗ γ H and ∆ ∗ ¯ H . It is provided by the following lemma. Lemma 1 (Lemmas 9 and 1 0 in [Guerme ur, 2004]). L et H b e the class of funct ions that a Q - c ate gory M-SV M c an implement under the hyp othesis b ∈ [ − β , β ] Q . L et ( γ , ǫ ) ∈ R 2 satisfy 0 < ǫ ≤ γ ≤ 1 . Then N ( p ) ∞ , ∞ ( ǫ, ∆ ∗ γ H , m ) ≤  2  β ǫ  + 1  Q N ( p ) ∞ , ∞ ( ǫ/ 2 , ∆ ∗ ¯ H , m ) . (4) A fina l theorem then completes the construction of the guar a n teed risk. Theorem 3 (Theorem 5 i n [ Guermeur, 2004]). L et ¯ H b e the class of functions that a Q - c ate gory M-SVM c an implement under the hyp othesis that Φ ( X ) is include d in t he close d b al l of r adius Λ Φ ( X ) ab out the origin in E Φ ( X ) and the c onstr aints 1 / 2 max 1 ≤ k

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment