Bounding the Fat Shattering Dimension of a Composition Function Class Built Using a Continuous Logic Connective

Bounding the F at Shattering Dimension of a Comp osition F unction Class Built Using a Con tin uous Logic Connectiv e By Hub e rt Hao y ang Duan Sup e rvised By Dr. Vladimir P esto v Pro ject subm itted in partial fulﬁllmen t of the requiremen ts for the degree of B.Sc. Honours Sp ecialization in Mathematics Departmen t of Mathematics and Statistics F acult y of Science Univ ersit y of Ottaw a c  Hub ert Hao y ang Duan, Ottaw a, Canada, 2011 Abstract W e b egin this rep ort by desc ribing the Probably Approxim ately Correct (P A C) mo del for learning a concept class, consisting o f subsets of a domain, and a function class, consisting of functions from the domain to the unit in terv al. Tw o com binatorial parameters, the V apnik-Cherv onenkis (V C) dimens ion and its generalization, the F at Shattering d imension of scale ǫ , are explained and a few examples o f their calculations are g iven with pro ofs. W e then explain Sauer’s Lemma, whic h in v olv es the V C dimension and is used to prov e the equiv alence of a concept class being distribution- free P A C learnable and it ha ving ﬁnite V C dimension. As the main new result of our researc h, w e explore the construction o f a new function class, obtained b y f orming compositions with a con tin uous logic connectiv e, a unifo rmly con tin uous function from the unit hypercub e to the unit inte rv al, from a collection of function classes. Vidy asagar had prov ed tha t suc h a comp osition function class has ﬁnite F at Shattering dimension of all scales if the classes in the original collection do; ho w ev er, no estimates o f the dimension w ere kno wn. Using results b y Mendelson-V ersh ynin and T alagrand, w e b ound the F at Shattering dimension of scale ǫ of this new f unction class in terms of the F a t Shattering dimensions of the collection’s classes. W e conclude this rep ort by pro viding a few op en questions and future researc h topics in volv ing t he P A C learning mo del. Con ten ts 1 In tro duction 2 2 Brief Ov erview of Analysis and Measure Theory 3 3 The P robably Appro ximately Correct Learning Model 7 4 The V apnik-Cherv onenkis Dimension 11 4.1 Sauer’s Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4.2 Characterization of concept class distribution-free P A C learning . . . 14 5 The F at Shattering Dimension 15 5.1 Suﬃcien t conditio n for function class distribution-free P A C learning . 17 6 The F at Shattering Dimension of a Comp osition F unction Class 19 6.1 Construction in the con text of concept classe s . . . . . . . . . . . . . 19 6.2 Construction of new function class with contin uous logic connectiv e . 20 6.3 Main Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 6.4 Pro ofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 7 Open Questions 29 8 Conclusion 31 References 32 1 1 In tro duction In the area of statistical learning theory , the Probably App roximately Correct (P A C) learning mo del formalizes the notion of learning by using sample data po in ts to pro duce v alid h ypot heses through algorithms. F or instance, the follo wing illustrates one learning pro blem whic h can b e formalized in the P A C mo del. Given that there is a disease which aﬀects certain p eople and out of 10 0 p eople in a hospital, 12 of them are sic k with this disease. Is there a w a y to predict whethe r an y giv en p erson in the hospital has t he disease o r not? This rep or t co v ers the P A C learning mo del applied to learning a collection of subsets C , called a concept class, of a domain X and more generally , a collection of functions F , called a function class, from X to the unit interv al [ 0 , 1]. The rep ort in v olv es mostly concepts from analysis a nd some concepts from proba bilit y theory , but o nly the completion of the ﬁrst t w o ye ars o f undergraduate studies in mathematics are assumed from the readers. Rep ort outline First, w e giv e tw o deﬁnitions of P A C learning, one for a concept class C and the other for a function class F , and explore t w o com binatorial para meters, the V apnik- Cherv onenkis (V C) dime nsion and the F at Shattering dimension of s cale ǫ , for C and F , resp ectiv ely . Then, w e explain Sauer’s Lemma, a theorem which in v olv es the VC dimension of C and is used to pro v e that the ﬁniteness of this dimension is a suﬃcien t condition for C to b e learnable. Finally , as the main new result of our research, give n function classes F 1 , . . . , F k and a “contin uous logic connectiv e” (that is, a contin uous function u : [0 , 1] k → [0 , 1]), w e consider the construction of a new comp osition function class u ( F 1 , . . . , F k ), consisting of functions u ( f 1 , . . . , f k ) deﬁned b y u ( f 1 , . . . , f k )( x ) = u ( f 1 ( x ) , . . . , f k ( x )) for f i ∈ F i . W e t hen b ound the F at Shattering dimension of scale ǫ of this class in terms of a sum of the F at Shattering dimensions of scale δ ( ǫ, k ) of F 1 , . . . , F k , where δ ( ǫ, k ) only depends on ǫ and k . There is a previously known analogous estimate for a comp osition of concept classes built using a usual connectiv e of classical logic [18]. W e deduce our new b ound using results from Mendelson-V ersh ynin and T alagrand. Before jumping in to the P A C learning model, w e prov ide some basic terminology and results from analysis and measure theory . F rom now on, a ny prop ositions or examples giv en with pro ofs, unless men tioned otherwise, are done b y us and are indep enden t of any sources. 2 2 Brief Ov ervie w of Analysis and Meas u re Theo ry This section lists some deﬁnitions a nd results in measure theory and analysis, fo und in standard textb o oks, suc h as [6 ], [18], and [2], whic h are used in this rep ort. Probabilit y space Deﬁnition 2.1. L et X b e a se t. A σ -algebra S is a non- e mpty c o l le ction of subsets of X such that the fol lowing ar e satisﬁe d: 1. If A ∈ S , then X \ A ∈ S 2. If A i ∈ S for i ∈ N , then [ i ∈ N A i ∈ S If S is a σ -algebr a, then the p air ( X , S ) is c al le d a measurable space . Deﬁnition 2.2. Supp os e ( X , S ) and ( Y , T ) ar e two me asur able sp ac es. A function f : X → Y is c al l e d measurable if f − 1 ( T ) ∈ S for al l T ∈ T . Deﬁnition 2.3. Given a m e asur able sp ac e ( X, S ) , a function µ : S → R + = { r ∈ R : r ≥ 0 } is a measure if the fol lowing hold: 1. µ ( ∅ ) = 0 2. If A i ∈ S for a l l i ∈ N and A i ∩ A j = ∅ when e ver i 6 = j , then µ [ i ∈ N A i ! = X i ∈ N µ ( A i ) The triple ( X, S , µ ) is c al le d a measure space . I f in addition, µ satisﬁes µ ( X ) = 1 , then µ is a probability measure and ( X , S , µ ) is c al le d a probability space . Giv en a probabilit y space ( X , S , µ ), o ne can measure the diﬀerence b et w een tw o subsets A, B ∈ S of X b y lo oking at their symmetric diﬀerence A △ B , whic h is indeed in S : µ ( A △ B ) = µ (( A ∪ B ) \ ( A ∩ B )) = µ ((( X \ A ) ∩ B ) ∪ ( A ∩ ( X \ B ))) . More generally , giv en t w o measurable functions f , g : X → [0 , 1], one c an lo o k at t he exp ected v alue of their absolute diﬀerence b y integrating with r esp ect to µ : Z X | f ( x ) − g ( x ) | dµ ( x ) . 3 This rep ort does not go into any details in v olving the Leb esgue integral but do es assume that in tegration of measurable functions to the real num bers, whic h is a measure space, mak es sense and is linear and o r der- preserving: Z X ( r f ( x ) + r ′ g ( x )) dµ ( x ) = r Z X f ( x ) dµ ( x ) + r ′ Z X g ( x ) dµ ( x ) and Z X f ( x ) dµ ( x ) ≤ Z X g ( x ) dµ ( x ) , if f ( x ) ≤ g ( x ) for all x ∈ X . V alidating hypotheses in the P A C learning mo del uses the idea of measuring the symmetric diﬀerence of tw o subsets of a pro ba bilit y space ( X, S , µ ) and calculating the expected v alue of the diﬀerence of f , g : X → [0 , 1]. The structure of metric spaces arises naturally from these t w o notions. Metric spaces Deﬁnition 2.4. L et M b e a nonem pty set. A function d : M × M → R + is a metric if the fol lowing hold for a l l m 1 , m 2 , m 3 ∈ M : 1. d ( m 1 , m 2 ) = 0 if and only if m 1 = m 2 2. d ( m 1 , m 2 ) = d ( m 2 , m 1 ) 3. d ( m 1 , m 2 ) ≤ d ( m 1 , m 3 ) + d ( m 3 , m 2 ) In this c ase, the p air ( M , d ) is c al le d a metric space . Deﬁnition 2.5. Given a metric sp ac e ( M , d ) , a metric sub-space of M ( w hich is a metric sp ac e in its own right) is a non e mpty subset M ′ ⊆ M e quipp e d with the distanc e d | M ′ , the r estriction of d to M ′ . The structure o f a metric space exists in ev ery ve ctor space equipped with a norm. Deﬁnition 2.6. Supp os e V is a ve ctor sp ac e o v er R . A function ρ : V → R + is a norm on V if for al l v 1 , v 2 ∈ V and fo r al l r ∈ R , 1. ρ ( r v 1 ) = | r | ρ ( v 1 ) 2. ρ ( v 1 + v 2 ) ≤ ρ ( v 1 ) + ρ ( v 2 ) 3. ρ ( v 1 ) = 0 if an d only if v 1 = 0 If ρ is a norm on V , then ( V , ρ ) is c al le d a normed v ector space . 4 Prop osition 2.7. Base d on De ﬁ nition 2.6 , the function d : V × V → R + deﬁne d by d ( u, v ) = ρ ( u − v ) is a metric on V , and d is c al le d the metric induced b y the no rm ρ on V . The follo wing subsection pro vides a few examples of metric spaces whic h will b e encoun tered in this repor t. Examples of metric spaces The real n um b ers ( R , ρ ), with the absolute v alue norm ρ ( r ) = | r | for r ∈ R , is a normed v ector space so R can b e equipp ed with a metric structure. Example 2.8. The set R with distanc e d deﬁne d by d ( r 1 , r 2 ) = | r 1 − r 2 | f o r r 1 , r 2 ∈ R is a metric sp ac e. The unit interv al [0 , 1] is a subset of R , so it is a metric sub-space of ( R , d ), a nd this space will b e used quite often in this repo r t. Giv en a probability space ( X , S , µ ), the set V o f a ll b ounded measurable functions from X to R is a v ector space, with p oint-wise addition and scalar multiplication. The function ρ : V → R + deﬁned by ρ ( f ) = s  Z X ( f ( x )) 2 dµ ( x )  is a norm on V if an y tw o functions f , g : X → R w hic h agree on a subse t o f X with full measure, µ ( { x ∈ X : f ( x ) = g ( x ) } ) = 1, are iden tiﬁed. 1 The norm ρ is called the L 2 ( µ ) norm on V and w e no r ma lly write || f || 2 = ρ ( f ) for f ∈ V . As a result, V can b e turned in to a metric space. Example 2.9. F ol lowing the notations in the p ar agr aph ab ove, V is a metric sp ac e with distanc e d deﬁne d by d ( f , g ) = | | f − g || 2 = s  Z X ( f ( x ) − g ( x )) 2 dµ ( x )  . W rite [0 , 1] X for the set of all measurable functions from a pr o babilit y space ( X , S , µ ) to [0 , 1]. Then, it is a metric sub-space of V with distance induced b y the L 2 ( µ ) norm on V , restricted of course to [0 , 1] X . Giv en metric spaces ( M 1 , d 1 ) , . . . , ( M k , d k ), their pro duct M 1 × . . . × M k alw a ys has a metric structure. 1 This ident iﬁcatio n ca n b e done using an equiv alence rela tio n, so this rep or t will not go into any details here. 5 Example 2.10. If ( M 1 , d 1 ) , . . . , ( M k , d k ) ar e metric sp ac es, then their pr o duct M 1 × . . . × M k is a metric sp ac e with di s tanc e d 2 deﬁne d b y d 2 (( m 1 , . . . , m k ) , ( m ′ 1 , . . . , m ′ k )) = q (( d 1 ( m 1 , m ′ 1 )) 2 + . . . + ( d k ( m k , m ′ k )) 2 ) . The distanc e d 2 is normal ly r eferr e d to as the L 2 pro duct dis tance on M 1 × . . . × M k . F rom Examples 2.8 a nd 2.10 , the set [0 , 1] k , wh ich denotes the set-theoretic pro d- uct [0 , 1] × . . . × [0 , 1] is then a metric space with distance d 2 deﬁned by d 2 (( r 1 , . . . , r k ) , ( r ′ 1 , . . . , r ′ k )) = q ( | r 1 − r ′ 1 | 2 + . . . + | r k − r ′ k | 2 ) . Also, follo wing Examples 2.9 and 2.10 , if F 1 , . . . , F k are sets of measurable functions from a probability space ( X , S , µ ) to the unit in terv al, then F i ⊆ [0 , 1] X for eac h i = 1 , . . . , k . Therefore, the pro duct F 1 × . . . × F k is a metric space with distance deﬁned by d 2 (( f 1 , . . . , f k ) , ( f ′ 1 , . . . , f ′ k )) = q (( || f 1 − f ′ 1 || 2 ) 2 + . . . + ( || f k − f ′ k || 2 ) 2 ) . 6 3 The Pro bably Approxima te l y Correc t Le arning Mo d el Let ( X, S ) b e a measurable space. A c onc ept class C of X is a subset of S and an elemen t A ∈ C (a measurable subset of X ) is called a c onc ept . A function class F is a collection of measurable functions from X to the unit in terv al [0 , 1]. Unless stated otherwise, from this section o nw ards, the follo wing not ations will b e used: 1. X = ( X , S ): a me asur able sp ac e 2. µ : a pr ob ab i l i ty me asur e S → R + 3. C : a c onc ept c l a ss and F : a function class 4. [0 , 1] X : the set of all me asur able functions f : X → [0 , 1], instead of the cus- tomary notation of all functions from X to [0 , 1]. This section provide s the deﬁnitions of learning C and F in the Probably Approx- imately Correct (P A C) learning mo del, introduced in 198 4 b y V alian t. Concept class P A C learning in v olv es pro ducing a v alid hy p ot hesis f or eve ry con- cept A ∈ C by ﬁrst drawing random p oints, forming a training sample, from X lab eled with whether t hese p oin ts are con tained in A . In ot her w ords, a lab eled sample of m points x 1 , . . . , x m ∈ X for A consists o f thes e p oin ts and the ev alua t io ns χ A ( x 1 ) , . . . , χ A ( x m ) of the indicator function χ A : X → { 0 , 1 } , where χ A ( x ) = 1 if and only if x ∈ A. On the o ther hand, a n unlab eled sample of points do es not include these ev aluations. The set of all lab eled samples of m po ints c an then b e iden tiﬁed with ( X × { 0 , 1 } ) m , and pro ducing a hy p o thesis for A with a lab eled sample is exactly the pro cess of asso ciating the sample to a concept H ∈ C ( i.e. this pro cess is a function from the set of all lab eled samples to the concept class). Here is the precise deﬁnition of a concept class b eing learnable. Deﬁnition 3.1 ([16]) . A c onc ep t class C is distribution-f r ee Probably Approximately Correct learnable if ther e exists an algori thm 2 L : ∪ m ∈ N ( X × { 0 , 1 } ) m → C with the fol lowing pr op erty: for every ǫ > 0 , for every δ > 0 , ther e e x ists a M ∈ N such that for every A ∈ C , for every pr ob ability me as ur e µ , for every m ≥ M , for any x 1 , . . . , x m ∈ X , we have µ ( H m △ A ) < ǫ with c onﬁdenc e at le ast 1 − δ , wher e H m = L (( x 1 , χ A ( x 1 )) , . . . , ( x m , χ A ( x m ))) . Conﬁdence of at least 1 − δ in the deﬁnition ab ov e, k eeping to the same notatio ns, simply means that the (pro duct) measure o f the set of all m -tuples ( x 1 , . . . , x m ) ∈ X m , 2 In this rep or t, a lea rning algo r ithm is simply deﬁned to be a function. 7 where µ ( H m △ A ) < ǫ for H m = L (( x 1 , χ A ( x 1 )) , . . . , ( x m , χ A ( x m ))), is at least 1 − δ . In other w ords, an equiv alen t statemen t to C is distribution-free P A C learnable is that for ev ery ǫ, δ > 0, there exists M ∈ N suc h tha t for ev ery A ∈ C , probabilit y measure µ , and m ≥ M , µ m ( { ( x 1 , . . . , x m ) ∈ X m : µ ( H m △ A ) ≥ ǫ } ) ≤ δ, 3 for H m = L (( x 1 , χ A ( x 1 )) , . . . , ( x m , χ A ( x m ))). A concept class C is distribution-free learnable in the P A C learning mo del if a hy p ot hesis H can alwa ys b e constructed from an alg orithm L for ev ery concept A ∈ C , using any lab eled sample f or A , suc h that the measure of their symmetric diﬀerence H △ A is arbitra r ily small with respect to ev ery probabilit y measure and with arbitrarily high conﬁdence, as long as the sample size is la r ge enough. Ev ery concept A ∈ C is a subset of X so A can b e associated to its indicator function χ A : X → { 0 , 1 } . Ev en more generally , χ A is a function from X to [0 , 1]; in other w ords, ev ery concept class C can b e identiﬁed as a function class F C = { χ A : X → [0 , 1] : A ∈ C } , so it is natural to g eneralize D eﬁnition 3.1 for an y function class F . Deﬁnition 3.1 inv olv es the symmetric diﬀerence of tw o concepts and its generaliza- tion to measurable functions f , g : X → [0 , 1] is the exp ected v alue of t heir absolute diﬀerence E µ ( f , g ) , as seen in the previous section: E µ ( f , g ) = Z X | f ( x ) − g ( x ) | dµ ( x ) . A simple exercis e can show that if f , g ∈ [0 , 1] X tak e v a lues in { 0 , 1 } , so they are in- dicator f unctions o f t w o c oncepts A, B ⊆ X , then E µ ( f , g ) coincide with the measure of their symm etric diﬀerence: E µ ( f , g ) = µ ( A △ B ), where f = χ A and g = χ B . With the generalization of the symm etric diﬀerence, distribution-free P A C learn- ing for any function class can b e deﬁned. In the con text of f unction class lear ning , a lab eled sample of m p oints x 1 , . . . , x m ∈ X for a function f ∈ F consists o f these p oin ts and the ev aluations f ( x 1 ) , . . . , f ( x m ). Then, the set of a ll lab eled samples of m p oin ts can b e identiﬁed with ( X × [0 , 1]) m , and pro ducing a hypothesis is the pro cess of associating a lab eled sample to a function H ∈ F (just as in concept class learning). Deﬁnition 3.2 ([18]) . A function class F is distribution- free Proba bly Approxi- mately Correct learnable if ther e exists an algorithm L : ∪ m ∈ N ( X × [0 , 1]) m → F with the fol lowing pr op ert y: for every ǫ > 0 , for every δ > 0 , ther e exists a M ∈ N such that for e very f ∈ F , for every pr ob ability me asur e µ , fo r every m ≥ M , for any x 1 , . . . , x m ∈ X , we h a ve E µ ( H m , f ) < ǫ with c onﬁdenc e at le ast 1 − δ , wher e H m = L (( x 1 , f ( x 1 )) , . . . , ( x m , f ( x m ))) . 3 The symbol µ m denotes the pro duct measur e o n X m ; the reader ca n refer to [6] for the details. 8 Both deﬁnitions of P AC lear ning con tain the ǫ and δ parameters. The error parameter ǫ is used b ecause the hypothesis is no t required to ha v e zero error - o nly an arbitrarily small error . The risk parameter δ exists because there is no guarantee that an y collection of suﬃcien tly larg e training points leads to a v alid h yp othesis; the learning algorit hm is o nly exp ected to pro duce a v alid h yp othesis with the sample p oin ts with conﬁde nce at least 1 − δ . Hence, the name “Probably ( δ ) Appro ximately ( ǫ ) Correct” is used [8]. The following example illustrates that the set of all a xis-aligned r ectangles in R 2 is distribution-free P A C learnable. Both the s tatement and its pro of can b e found in Chapter 3 of [18] and Chapter 1 o f [8]. Example 3.3. In X = R 2 , the c onc e pt class C = { [ a, b ] × [ c, d ] : a, b, c, d ∈ R } is distribution-fr e e P AC le arnable. Pr o of. Let ǫ, δ > 0. G iven a concept A and an y sample of m training p oin ts x 1 , . . . , x m ∈ X , deﬁne the h yp othesis concept H m to b e the in tersection of all rect- angles con taining only t r aining p oints x i suc h tha t χ A ( x i ) = 1. In other w ords, H m is the smalles t rectangle that con tains only the sample p oints in A . Let µ b e an y probabilit y measure, a nd in fact, H m △ A = A \ H m , whic h can b e brok en do wn in to four sections T 1 , . . . , T 4 . If w e can conclude that µ 4 [ i =1 T i ! < ǫ, with conﬁdence at least 1 − δ , then the pro of is complete. Consider the top section T 1 and deﬁne ˜ T 1 to b e the rectangle along the top parts of A whose measure is exactly ǫ/ 4. The ev en t ˜ T 1 ⊆ T 1 , whic h is equiv alen t to µ ( T 1 ) ≥ ǫ/ 4, holds exactly when no p oints in the sample x 1 , . . . , x m fall in ˜ T 1 , and the probabilit y of this ev en t (whic h is the measure of all suc h m -tuples of ( x 1 , . . . , x m ) ∈ X m where x i / ∈ ˜ T 1 for all i = 1 , . . . , m ) is  1 − ǫ 4  m . Similarly , the same holds for the o t her three sections T 2 , . . . , T 4 . Therefore, the prob- abilit y tha t there exists at least one T i suc h that µ ( T i ) ≥ ǫ/ 4, where i ∈ { 1 , . . . , 4 } , is at most 4  1 − ǫ 4  m . Hence, as long as w e pic k m la rge enough that 4(1 − ǫ / 4) m ≤ δ , with conﬁdence (probabilit y) at least 1 − δ , µ ( T i ) < ǫ/ 4 for ev ery i = 1 , . . . , 4 and thus, µ ( H m △ A ) = µ 4 [ i =1 T i ! ≤ µ ( T 1 ) + . . . + µ ( T 4 ) < 4  ǫ 4  = ǫ. 9 Please note that this arg umen t, though v ery intuitiv e, actually requires the classical Gliv enk o-Can telli theorem. In s ummary , as long as m ≥ (4 /ǫ ) ln(4 /δ ), with conﬁdence at least 1 − δ , µ ( H m △ A ) < ǫ . W e note that this estimate of the sample size o nly dep ends on ǫ and δ , so C is indeed distribution-free P AC learnable. In the next section, a fundamen tal theorem whic h ch aracterizes concept class distribution-free P AC learning will b e stated, and t w o mo r e concept classe s, one learnable and the other not, 4 will b e giv en. Ho w ev er, in o rder to state this theorem, the notion o f shattering, whic h is essen tial in learning theory , m ust b e introduced. 4 They are direct r esults of the theorem. 10 4 The V apnik-C herv onenkis Dimension The V apnik-Cherv onenkis dimension is a com binatorial parameter whic h is deﬁned using the notion of shattering, dev elop ed ﬁrst in 1971 b y V apnik and Cherv onenkis. Deﬁnition 4.1 ([17]) . Given a n y set X and a c ol le ction A of subse ts of X , the c ol le ction A shatters a s ubs e t S ⊆ X if for every B ⊆ S , ther e exis ts A ∈ A such that A ∩ S = B . There is an equiv alen t condition, whic h is sometimes easier to w ork with, to shattering, expressed in terms of c haracteristic functions of subsets of X . Prop osition 4.2. The c ol le ction A shatters a subse t S = { x 1 , . . . , x n } ⊆ X if and only if for every e = ( e 1 , . . . , e n ) ∈ { 0 , 1 } n , ther e exists A ∈ A such that χ A ( x i ) = e i , for al l i = 1 , . . . , n . Pr o of. T rivial. Deﬁnition 4.3 ([17 ]) . The V apnik-Cherv onenkis (V C) dimension of the c o l le ction A , denote d by V C( A ) , is de ﬁ ne d to b e the c a r dinality of the lar gest ﬁnite subset S ⊆ X shatter e d by A . If A shatters arbitr arily lar ge ﬁnite subsets of X , then the V C dimensi o n of A is deﬁn e d to b e ∞ . The V C dimension is deﬁned for ev ery collection A of s ubsets of an y set X , so in particular, X = ( X , S ) can be a measurable space and A = C can b e a concept class. The following are a few examples of ho w to calculate V C dimensions in the context of X = R n . In order to prov e the V C dimension of a conce pt class C is d , w e must pro vide a subset S ⊆ X with cardinality d whic h is shattered by C and pro v e that no subset with cardinalit y d + 1 can b e shattered by C . Example 4.4. If X = R , then the p owerset of X has in ﬁ nite VC dimensio n. Mor e gener al ly, for every inﬁnite set X , VC( P ( X )) = ∞ . Example 4.5. In the sp ac e X = R , let C = { [ a, b ] : a, b ∈ R , a < b } b e the c ol le ction of al l close d intervals. Then, VC( C ) = 2 . Pr o of. Consider the subset S = { 1 , 2 } ⊆ R ; C shatt ers S b ecause [ a, b ] ∩ S =          ∅ if a > 2 or b < 1 { 1 } if a ≤ 1 , b < 2 { 2 } if a > 1 , b ≥ 2 { 1 , 2 } if a ≤ 1 , b ≥ 2 . 11 On the other hand, giv en an y subse t S = { x , y , z } ⊆ R with three distinct po ints, and assume the order to b e x < y < z . Then, there are no closed interv al in C con taining x and z but not y . Example 4.6. Consi d er the sp ac e X = R n . A hyp erplane H ~ a,b is deﬁne d by a nonzer o ve ctor ~ a = ( a 1 , . . . , a n ) ∈ R n and a sc alar b ∈ R : H ~ a,b = { ~ x = ( x 1 , . . . , x n ) ∈ R n : ~ x · ~ a = b } = { ~ x = ( x 1 , . . . , x n ) ∈ R n : x 1 a 1 + . . . + x n a n = b } . Write C as the set o f al l hyp erplanes: C = { H ~ a,b : ~ a ∈ R n \ { ~ 0 } , b ∈ R } . Then V C( C ) = n . Pr o of. Consider the subse t S = { ~ e 1 , . . . , ~ e n } ⊆ R n , where ~ e i is the vector with 1 on the i -th comp onen t a nd 0 ev erywhere else . Supp ose B ⊆ S and there a re t w o cases to consider: 1. If B = ∅ , then let ~ a = (1 , 1 , . . . , 1) ∈ R n and t he h yp erplane H ~ a, − 1 = { ~ x = ( x 1 , . . . , x n ) ∈ R n : x 1 + . . . + x n = − 1 } is disjoin t from S . 2. If B 6 = ∅ , then set ~ a = ( a 1 , . . . , a n ) ∈ R n \ { ~ 0 } , where a i = χ B ( ~ e i ). The n t he h yp erplane H ~ a, 1 = { ~ x = ( x 1 , . . . , x n ) ∈ R n : x 1 a 1 + . . . + x n a n = 1 } satisﬁes H ~ a, 1 ∩ S = B . Moreo v er, no subse t S = { ~ x 1 , . . . , ~ x n , ~ x n +1 } ⊆ R n with cardinalit y n + 1 can b e shattered by C . At b est, there exists a unique h yp erplane H ~ a,b con taining n of these p oin ts, sa y { ~ x 1 , . . . , ~ x n } , so if ~ x n +1 ∈ H ~ a,b , then there are no hy p erplanes that include ~ x 1 , . . . , ~ x n , but not ~ x n +1 . Otherwise, if ~ x n +1 / ∈ H ~ a,b , then there are no h yp erplanes that include ~ x 1 , . . . , ~ x n , ~ x n +1 . The ﬁrst example is trivial and the second is fairly w ell-know n, seen in [8] a nd [10], but w e b eliev e the third, Example 4.6, is a new result. A v ery imp o rtan t concept related to shattering is the growth of all the p o ssible subsets A ∩ S , for A ∈ C , as S ⊆ X increases in size. It is clear that this grow th is alwa ys exp onen tial if C has inﬁnite V C dimension; Sauer’s Lemma explains the gro wth when V C( C ) < ∞ . 4.1 Sauer’s Lemma Giv en a concept class C of X , a nother wa y to express that C shatters a subset S ⊆ X , with cardinality n , is to consider the set o f all A ∩ S , where A ∈ C . F o llowing Chapter 4 of [18 ], C shatters S if and only if |{ A ∩ S : A ∈ C }| = 2 n . 12 More generally , fo r an y subset S ⊆ X , deﬁne π ( S ; C ) = |{ A ∩ S : A ∈ C }| and π ( n ; C ) = max | S | = n π ( S ; C ) . Then, the V C dimension of C can no w b e ex pressed in terms of the grow th of π ( n ; C ) as n gets large. Prop osition 4.7. Given a c onc ept class C , the fol lowing c onditions ar e e quivalent: 1. V C( C ) ≥ n ; 2. C shatters some subset S ⊆ X with c ar dinality n ; 3. π ( n ; C ) = 2 n . Mor e over, the class C has inﬁnite VC di mension if and only if π ( n ; C ) = 2 n for al l n ∈ N . Conv e rsely, C has ﬁnite V C dimens i o n, say V C( C ) ≤ d , if and only if π ( n ; C ) < 2 n for al l n > d . Pr o of. The pro of follo ws from the fact that C shatters S if and only if π ( S ; C ) = 2 n . The extremely intere sting fa ct, as seen in the next theorem, is that if C has ﬁnite V C dimension d , t hen π ( n ; C ) is b ounded b y a p olynomial in n of degree d , for n ≥ d . This result, called Sauer’s Lemma, was ﬁrst pro v en in 19 7 2 by Sauer. In other words, as n gets large, π ( n ; C ) is either alw a ys an exp onential function with base 2 or ev en tually b ounded b y a p o lynomial function of a ﬁxed degree. Theorem 4.8 ( Sa uer’s Lemma [12]) . Supp ose a c onc ept class C ha s ﬁnite V C di- mension d . The n π ( n ; C ) ≤  en d  d , for al l n ≥ d ≥ 1 . Of course, ev erything in this subsection, including Sauer’s Lemma, is tr ue for an y collection of subsets of an y set but in the con text of statistical learning theory , Sauer’s Lemma is pa r t icularly useful b ecause it is used to prov e the equiv alence of a concept class ha ving ﬁnite V C dimens ion and the class b eing distribution-free P AC learnable. 13 4.2 Characterization of conc ept class distribution-free P A C learning The follo wing is one of the main theorems concerning P AC lear ning , whose pro o f results from V apnik a nd Cherv onenkis’ pap er [17] in 1971 and the 1989 pa p er [5] b y Blumer et al.. Theorem 4.9 ([17] and [5]) . L et C b e a c onc ept class of a m e asur able sp ac e ( X , S ) . The fol lowing ar e e quivalent: 1. C is distribution-fr e e Pr ob ably Appr oximately Corr e c t l e arnable. 2. V C( C ) < ∞ . Both directions o f the pro of require expressing the num ber of sample training p oin ts required for learning in terms of the V C dimension o f C ; Sauer’s L emma is used to pro vide a suﬃcien t n um b er of p oin ts required f o r learning in the direction 2) ⇒ 1). Using Theorem 4.9 , one can more easily determine whether a giv en concept class is distribution-free P AC learna ble. Example 4.10. L et X b e any inﬁnite s e t. Then the p owerset P ( X ) is not distribution- fr e e P AC le arnable. Example 4.11. The set of al l hyp erplanes C = { H ~ a,b : ~ a ∈ R n \ { ~ 0 } , b ∈ R } , as deﬁne d i n Example 4.6, is distribution-fr e e P AC le arnable. Both examples come directly from the calculations o f their concept classes’ VC dimensions in Examples 4.4 and 4.6 and from Theorem 4.9. Ev ery concept class C can b e view ed as a function class F C = { χ A : X → [0 , 1] : A ∈ C } , as seen in Section 3, so a natural question is whether the notion of shattering can b e generalized. Indeed, the next section in tro duces the F a t Shattering dimension of scale ǫ , whic h is a generalization of the VC dimension. 14 5 The F at S hattering Di men sion Let ǫ > 0 f rom this section onw ards. A combinatorial parameter whic h generalizes the V apnik-Cherv onenkis dimension is the F at Shattering dimension of scale ǫ , deﬁned ﬁrst by Kearns and Sc hapire in 199 4. This dimension, assigned to function classe s, inv olv es the notion of ǫ -shattering , but similar to the notion of (regular) shattering, it can b e deﬁned for any collection of functions f : X → [0 , 1], where X is an y set, but for sak e of this rep ort , the follo wing sections (still) assume X = ( X , S ) is a measurable space and the collection of functions is a function class F . Deﬁnition 5.1 ([7]) . L et F b e a function class . Given a subset S = { x 1 , . . . , x n } ⊆ X , the class F ǫ -shatters S , with witness c = ( c 1 , . . . , c n ) ∈ [0 , 1] n , if for every e ∈ { 0 , 1 } n , ther e exis ts f ∈ F such that f ( x i ) ≥ c i + ǫ for e i = 1 , and f ( x i ) ≤ c i − ǫ fo r e i = 0 . Deﬁnition 5.2 ([7]) . The F at Shat tering dimension of scale ǫ > 0 of F , deno te d by fat ǫ ( F ) , i s deﬁne d to b e the c ar dinality of the lar gest ﬁn i te subset of X that c an b e ǫ -shatter e d by F . I f F c an ǫ -shatter arbitr arily lar ge ﬁnite subsets, then the F at Shattering dimension of sc ale ǫ of F is deﬁne d to b e ∞ . When the function class F consis ts of only functions taking v alues in { 0 , 1 } , then the F at Shattering dimension of any scale ǫ ≤ 1 / 2 of F agrees with the V C dimension of the cor r espo nding collection of subsets o f X , induced b y the ( indicato r) functions in F . Prop osition 5.3. Supp ose a function class F c onsists of only binary f unc tion s f : X → { 0 , 1 } . F or every f ∈ F , ther e exists a unique subset A f ⊆ X such that χ A f = f . Mor e ove r, write C = { A f : f ∈ F } a nd V C( C ) = fat ǫ ( F ) for al l ǫ ≤ 0 . 5 . Pr o of. The ﬁrst statemen t, of the existence of a unique subset A f ⊆ X for ev ery binary function f , is clear. Let ǫ ≤ 0 . 5. T o sho w that V C( C ) = fat ǫ ( F ), it suﬃces to pro v e that C shatters S = { x 1 , . . . , x n } if and only if F ǫ -shatters S . The equiv alen t condition to shattering as seen in Prop osition 4 .2 will be used. Supp ose C shatters S and deﬁne c = (0 . 5 , 0 . 5 , . . . , 0 . 5) ∈ [0 , 1] n . F or ev ery e ∈ { 0 , 1 } n , there exists A f ∈ C , where f ∈ F , suc h that χ A f ( x i ) = e i , for all i = 1 , . . . , n and th us, f ( x i ) = χ A f ( x i ) = e i ≥ 0 . 5 + ǫ f or e i = 1 and f ( x i ) = χ A f ( x i ) = e i ≤ 0 . 5 − ǫ for e i = 0 . 15 Con v ersely , supp ose F ǫ -shatters S , with witness c = ( c 1 , . . . , c n ) ∈ [0 , 1] n . Let e ∈ { 0 , 1 } n and there exists f ∈ F suc h t hat f ( x i ) ≥ c i + ǫ for e i = 1 , and f ( x i ) ≤ c i − ǫ fo r e i = 0 , but f is binary and ǫ is strictly p o sitiv e, so f ( x i ) ≥ c i + ǫ implies f ( x i ) = 1 for e i = 1 and f ( x i ) ≤ c i − ǫ implies f ( x i ) = 0 for e i = 0. As a result, consider A f ∈ C and χ A f ( x i ) = f ( x i ) = e i for all i = 1 , . . . , n . Therefore, VC( C ) = fat ǫ ( F ). Here is a n example of a commonly used function class which w e prov ed, indep en- den t of an y sources , to hav e inﬁnite F at Shat t ering dimension of scale ǫ . Example 5.4. L et X = R + and let F b e the set of al l c ontinuous functions f : X → [0 , 1] . Then fat ǫ ( F ) = ∞ for al l 0 < ǫ ≤ 0 . 5 . Pr o of. Supp ose 0 < ǫ ≤ 0 . 5, and consider a collection of con tin uous [0 , 1]-v alued functions deﬁned as f ollo ws. Giv en e ∈ { 0 , 1 } N , a countable binary sequence, deﬁne f e : X → [0 , 1] b y f e ( x ) = ( 1 if e i = 1 0 if e i = 0 , if x = i ∈ N . Otherwise, fo r x ∈ [ m, m + 1], with m ∈ N , f e ( x ) =      − ( x − m ) + 1 if e m = 1 , e m +1 = 0 ( x − m ) if e m = 0 , e m +1 = 1 e m if e m = e m +1 . F or eac h e ∈ { 0 , 1 } N , f e is con tinuous b ecause it is deﬁned as a step function of lines which a g ree on the ov erlaps. W rite F = { f e : e ∈ { 0 , 1 } N } a nd F ⊆ F . T o sho w that fat ǫ ( F ) = ∞ , it suﬃces to pro v e that fat ǫ ( F ) = ∞ . Consider the subset S = { 1 , . . . , n } ⊆ X for an y n ∈ N , and the collection F ǫ -shatters S with witness c = (0 . 5 , 0 . 5 , . . . , 0 . 5) ∈ [0 , 1] n : for eac h e ∈ { 0 , 1 } n , it can b e extended to a coun table binary sequence ˜ e , where ˜ e i = e i for all i = 1 , . . . , n and ˜ e i = 0 otherwise. Then, it is clear that f ˜ e ( x i ) = 1 ≥ c i + ǫ for ˜ e i = 1 , and f ( x i ) = 0 ≤ c i − ǫ for ˜ e i = 0 , with x i = i ∈ S for i = 1 , . . . , n . With the generalization from a concept class to a function class, a natural question is whether the ﬁniteness of the F at Shattering dimension of all scales ǫ for a function 16 class F is equiv alen t to F b eing distribution- f ree P AC lear na ble. This question is addressed in the follow ing subsection. 5.1 Suﬃcien t condition for function class distribution-free P AC learning One direction of Theorem 4.9 can b e generalized and stated in terms of the F at Shattering dimension of scale ǫ of a function class. Theorem 5.5 ([1] and [18]) . L et F b e a function class. If fat ǫ ( F ) < ∞ for a l l ǫ > 0 , then F is di s tribution-fr e e P AC le arnable. Ho w ev er, the con v erse t o Theorem 5.5 is false. There exists a distribution- free P AC learnable function class with inﬁnite F at Shattering dimension of some scale ǫ . In fact, for ev ery concept class C with cardinality ℵ 0 or 2 ℵ 0 , there is an asso ciated function class F C deﬁned as follows . Set up a bijection b : C → [0 , 1 / 3] or t o [0 , 1 / 3 ] ∩ Q , dep ending on the cardinality of C , and for ev ery A ∈ C , deﬁne a function f A : X → [0 , 1] by f A ( x ) = χ A ( x ) + ( − 1) χ A ( x ) b ( A ) . No w, write F C = { f A : A ∈ C } . Note that F C can b e thought of the collection of all indicator functions of A ∈ C , excep t that eac h “indicator” function f A has t w o unique identifyin g p o ints b ( A ) and 1 − b ( A ), instead of simply 0 and 1. The fo llo wing prop osition prov ides man y coun terexamples to Theorem 5 .5, which are m uc h simpler than the one found in [18]. The construction of the f unction class F C and t he prop osition b elow a re dev elop ed from an idea of Example 2.10 in [11]. Prop osition 5.6. L et C b e a c onc ept class. The asso ciate d function class F C = { f A : A ∈ C } , deﬁn e d in the pr evious p ar agr aph, is always distribution-fr e e P AC le arnable; this class has inﬁn ite F at Shattering dimension of al l sc ale s ǫ < 1 / 6 if C ha s inﬁnite V C dimensi o n . Pr o of. The function class F C is distribution-free P A C learnable b ecause ev ery func- tion f A ∈ F C can b e unique ly identiﬁed with just one p oint x 0 ∈ X in an y labeled sample: f A ( x 0 ) ∈ { b ( A ) , 1 − b ( A ) } uniquely determines A and th us, f A . F urthermore, suppose C has inﬁnite V C dimension. Let n ∈ N b e arbitra ry and b ecause V C( C ) = ∞ , there exists S = { x 1 , . . . , x n } suc h that C shatters S . Supp o se ǫ < 1 / 6 and we claim that F C ǫ -shatters S with witnes s c = (0 . 5 , . . . , 0 . 5 ) ∈ [0 , 1] n . Indeed, let e ∈ { 0 , 1 } n and there exists A ∈ C suc h that χ A ( x i ) = e i , for all i = 1 , . . . , n , b y Prop osition 4.2. As a result, f A ( x i ) = 1 − b ( A ) ≥ 0 . 5 + ǫ for e i = 1 17 and f A ( x i ) = b ( A ) ≤ 0 . 5 − ǫ for e i = 0 . Consequen tly , F C has inﬁnite F at Shattering dimension of a ll scales ǫ < 1 / 6. The next section explains the main result of our researc h: b ounding the F at Shattering dimension of scale ǫ of a comp osition function class whic h is built with a con tin uous logic connectiv e. 18 6 The F at Shatte ring D imension of a Comp os i t ion F unction Class The goals of this section are to construct a new function class from old ones by means of a contin uous logic connectiv e and to b ound the F at Shattering dimension of scale ǫ of the new function class in terms of the dimensions of the old ones. The fo llowing subsection provide s this construction, whic h can b e fo und in Chapter 4 of [18 ], in the con text of concept classes using a connectiv e of classical logic. 6.1 Construction in th e con text of concept classes Let C 1 , C 2 , . . . , C k b e concept classes, where k ≥ 2, and let u : { 0 , 1 } k → { 0 , 1 } b e an y function, commonly kno wn as a connectiv e of classical logic. A new collection o f subsets of X arises from C 1 , . . . , C k as follows. As men tioned earlier in t his report, ev ery eleme nt A ∈ C i can be iden tiﬁed as a binary function f : X → { 0 , 1 } , namely its c haracteristic function f = χ A , and vice vers a. Now, for any k functions f 1 , . . . , f k : X → { 0 , 1 } , where f i ∈ C i with i = 1 , . . . , k , consider a new f unction u ( f 1 , . . . , f k ) : X → { 0 , 1 } deﬁned by u ( f 1 , . . . , f k )( x ) = u ( f 1 ( x ) , . . . , f k ( x )) . The set of all p ossible u ( f 1 , . . . , f k ), denoted b y u ( C 1 , . . . , C k ), is giv en b y u ( C 1 , . . . , C k ) = { u ( f 1 , . . . , f k ) : f i ∈ C i } . F or instance, when k = 2, w e can consider the “Exclusiv e Or” connectiv e ⊕ : { 0 , 1 } 2 → { 0 , 1 } deﬁned by p ⊕ q = ( p ∧ ¬ q ) ∨ ( ¬ p ∧ q ) , whic h corresp o nds to the symmetric diﬀerence o p eration. T hen, our new concept class constructed from C 1 and C 2 is { A 1 △ A 2 : A 1 ∈ C 1 , A 2 ∈ C 2 } . The next theorem states tha t if C 1 , C 2 , . . . , C k all hav e ﬁnite V C dimension to start with, then regardless of u , the new collection u ( C 1 , . . . , C k ) alwa ys has ﬁnite VC dimension. Theorem 6.1 ([18]) . L et k ≥ 2 . Supp ose C 1 , . . . , C k ar e c onc ept clas s e s, e ach viewe d as a c ol le ction o f bin ary functions, and u : { 0 , 1 } k → { 0 , 1 } i s any function. If the VC 19 dimension of C i is ﬁnite for al l i = 1 , . . . , k . Th e n ther e exis ts a c onstant α = α k 5 , which dep ends only o n k , s uch that V C( u ( C 1 , . . . , C k )) < dα k , wher e d = k max i =1 V C( C i ) . The pro of of this theorem can b e f ound in [1 8] and uses Sauer’s Lemma to b ound the V C dimension of u ( C 1 , . . . , C k ). The main ob jectiv e of our pro ject was to generalize this theorem for function classes, in terms of the F at Shattering dimension of scale ǫ , but the connectiv e of classical logic u w ould hav e to b e replaced b y a contin uous logic connectiv e, a contin uous function u : [0 , 1] k → [0 , 1]. 6.2 Construction of new func tion class with con tin uous logic connectiv e In ﬁrst-o r der logic, there are only tw o truth-v alues 0 or 1, so a connectiv e is a function { 0 , 1 } k → { 0 , 1 } in the classical sense. How ev er, in con tin uous logic, truth-v alues can b e found an ywhere in the unit interv al [0 , 1]. Therefore, w e should consider a function u : [0 , 1] k → [0 , 1], whic h will t ransform function classes, and require that u b e a contin uous logic connectiv e. In other w ords, u should b e con tin uous from the (pro duct) metric space [0 , 1] k to the unit interv al [19]; in fact, because u is contin uous from a compact metric space to a metric space, it is automatically uniformly contin uous. The follo wing pro vides the deﬁnition o f a uniformly con tin uous function u from an y metric space t o another, but w e m ust ﬁrst qualify u with a mo dulus of uniform con tin uity . Deﬁnition 6.2 (See e.g. [19]) . A mo dulus of uniform con tin uit y is any function δ : ( 0 , 1] → (0 , 1] . Deﬁnition 6.3 (See e.g. [19]) . L et ( M 1 , d 1 ) and ( M 2 , d 2 ) b e two me tric sp ac es. A function u : M 1 → M 2 is uniformly con tin uous if ther e exists ( a mo dulus of uniform c ontinuity) δ : (0 , 1] → (0 , 1] such that for al l ǫ ∈ (0 , 1] and m 1 , m 2 ∈ M 1 , if d 1 ( m 1 , m 2 ) < δ ( ǫ ) , then d 2 ( u ( m 1 ) , u ( m 2 )) < ǫ . Such a δ is c al le d a mo dulus of uniform con tinuit y f or u . In particular, u : [0 , 1] k → [0 , 1], where [0 , 1] k is equipp ed with the L 2 pro duct distance d 2 , is uniformly contin uous with mo dulus of unifo rm con tinuit y δ if for ev ery 5 More sp eciﬁcally , α = α k is the smalle s t integer such that k < α log( eα ) . 20 ǫ ∈ (0 , 1] and for ev ery ( r 1 , . . . , r k ) , ( r ′ 1 , . . . , r ′ k ) ∈ [0 , 1] k , d 2 (( r 1 , . . . , r k ) , ( r ′ 1 , . . . , r ′ k )) < δ ( ǫ ) ⇒ | u ( r 1 , . . . , r k ) − u ( r ′ 1 , . . . , r ′ k ) | < ǫ. Giv en function classes F 1 , . . . , F k and a uniformly contin uous function u : [0 , 1] k → [0 , 1], consider the new function class u ( F 1 , . . . , F k ) deﬁned b y u ( F 1 , . . . , F k ) = { u ( f 1 , . . . , f k ) : f i ∈ F i } , where u ( f 1 , . . . , f k )( x ) = u ( f 1 ( x ) , . . . , f k ( x )) for all x ∈ X , just as in Section 6.1 for concept classes , with f i ∈ F i and i = 1 , . . . , k . Our main result states that the F at Shattering dimension of scale ǫ of u ( F 1 , . . . , F k ) is b ounded by a sum of the F at Shattering dimensions o f scale δ ( ǫ, k ) of F 1 , . . . , F k , where δ ( ǫ, k ) is a function of the mo dulus of unifor m con tin uity δ ( ǫ ) for u and k . It is a known result, seen in Chapter 5 of [18], that this new class u ( F 1 , . . . , F k ) has ﬁnite F at Shattering dimension o f all scales ǫ > 0 (and thus, it is distribution-f ree P AC learnable) if eac h of F 1 , . . . , F k has ﬁnite F at Shat t ering dimension o f all scales, but no b o unds w ere kno wn. 6.3 Main Result Fix k ≥ 2 a nd the follo wing theorem is our main new result. Theorem 6.4. L et ǫ > 0 , F 1 , . . . , F k b e function classe s of X , an d u : [0 , 1] k → [0 , 1] b e a unifo rmly c ontinuous function with mo dulu s of c ontinuity δ ( ǫ ) . Then fat ǫ ( u ( F 1 , . . . , F k )) ≤ K log (4 c ′ k √ k / ( δ ( ǫ/ (2 c ′ )) ǫ )) K ′ log(2) ! n X i =1 fat c δ ( ǫ/ (2 c ′ )) ǫ k √ k ( F i ) , wher e c, c ′ , K, K ′ ar e some abso l ute c onstants. Extracting the actual v alues of these absolute constants is not easy , and w e hop e to ﬁnd them in future researc h. F or this reason, comparing the bound in Theorem 6.4 with the existing estimate f or the VC dimension of a comp osition concept class is diﬃcult; how e v er, in statistical learning theory , estimates for function class learning are generally m uc h w orse than estimates for concept class learning. In order t o prov e Theorem 6.4, for clarit y , w e ﬁrst introduce an auxiliary function φ : F 1 × . . . × F k → [ 0 , 1] X , whic h is unifo rmly contin uous from the metric space F 1 × . . . × F k with the L 2 pro duct distance ˜ d 2 to the metric space [0 , 1] X with distance induced by the L 2 ( µ ) norm, and prov e the fo llo wing lemma. Lemma 6.5. L et ǫ > 0 , F 1 , . . . , F k b e function classe s of X , and φ : F 1 × . . . × F k → [0 , 1] X b e uniform ly c ontinuous with some mo dulus of c ontinuity δ ( ǫ, k ) , a function 21 of ǫ and k . Then fat c ′ ǫ ( φ ( F 1 × . . . × F k )) ≤ K log (2 √ k /δ ( ǫ, k )) K ′ log(2) ! k X i =1 fat c δ ( ǫ,k ) √ k ( F i ) , wher e c, c ′ , K, K ′ ar e some abs o lute c onstants and the symb ol φ ( F 1 × . . . × F k ) simply r epr esen ts the ima g e of φ . Then, we will relate the t w o uniformly con tin uous functions u and φ . Lemma 6.6. L e t ǫ > 0 . I f u : [0 , 1] k → [0 , 1] is uniformly c ontinuous with mo dulus of c ontinuity δ ( ǫ ) , then the function φ : F 1 × . . . × F k → [0 , 1] X deﬁne d b y φ ( f 1 , . . . , f k )( x ) = u ( f 1 ( x ) , . . . , f k ( x )) is also uniformly c ontinuous with mo dulus of c ontinuit y δ ( ǫ/ 2) ǫ 2 k , a n d in fact, φ ( F 1 × . . . × F k ) = u ( F 1 , . . . , F k ) . 6.4 Pro ofs In order to prov e Lemma 6.5 , w e ﬁrst intro duce the concept of an ǫ -cov ering n umber for a n y metric space, based on [9], and relate this num ber for a function class to its F at Shattering dimension o f scale ǫ by using results from Mendelson and V ersh ynin [9] and T alagrand [15]. Deﬁnition 6.7. L et ǫ > 0 and supp ose ( M , d ) is a metric sp ac e. The ǫ -co v ering n um b er , denote d by N ( M , ǫ, d ) , of M is the minima l numb er N such that ther e exi s ts elements m 1 , m 2 , . . . , m N ∈ M with the pr op erty that for al l m ∈ M , ther e exists i ∈ { 1 , 2 , . . . , N } for which d ( m, m i ) < ǫ. The set { m 1 , m 2 , . . . , m N } is c al le d a (minimal) ǫ -net of M . The following prop osition relates t he ǫ -co v ering n um b er of a pro duct of metric spaces, with t he L 2 pro duct distance d 2 , M 1 × . . . × M k to the ǫ √ k -co v ering n um b er of eac h space M i . Prop osition 6.8. L et ǫ > 0 and supp o se ( M 1 , d 1 ) , . . . , ( M k , d k ) a r e metric sp ac es , e ach with ﬁnite ǫ √ k -c overing numb ers, N i = N ( M i , ǫ √ k , d i ) for i = 1 , . . . , k . Then N ( M 1 × . . . × M k , ǫ, d 2 ) ≤ k Y i =1 N i . Pr o of. Let C i = { a i 1 , . . . , a i N i } b e a minimal ǫ √ k -net for M i with r esp ect to distance d i , where i = 1 , . . . , k and supp ose ( a 1 , . . . , a k ) ∈ M 1 × . . . × M k . The n, for eac h 22 i = 1 , . . . , k , there exists a i j i ∈ C i , where 1 ≤ j i ≤ N i suc h that d i ( a i , a i j i ) < ǫ √ k . Hence, d 2 (( a 1 , . . . , a k ) , ( a 1 j 1 , . . . , a k j k )) = q  ( d 1 ( a 1 , a 1 j 1 )) 2 + . . . + ( d k ( a k , a k j k )) 2  < v u u t  ǫ √ k  2 + . . . +  ǫ √ k  2 ! = ǫ, where each ( a 1 j 1 , . . . , a k j k ) ∈ C 1 × . . . × C k , whic h has cardinality Π k i =1 N i . Therefore, N ( M 1 × . . . × M k , ǫ, d 2 ) ≤ Π k i =1 N i . Also, if u : M 1 → M 2 is any uniformly con tin uous function with a mo dulus of uniform contin uit y δ ( ǫ ) f rom any metric space to anot her, then the image of a minimal δ ( ǫ )-net of M 1 under u b ecomes an ǫ - net for u ( M 1 ). Prop osition 6.9. L e t ǫ > 0 and supp ose ( M 1 , d 1 ) a n d ( M 2 , d 2 ) ar e two metric sp ac es. If a function u : M 1 → M 2 is uniformly c ontinuous w ith a mo dulus of c ontinuity δ ( ǫ ) , then N ( u ( M 1 ) , ǫ, d 2 ) ≤ N ( M 1 , δ ( ǫ ) , d 1 ) , wher e u ( M 1 ) denotes the image of u . Pr o of. Supp ose N = N ( M 1 , δ ( ǫ ) , d 1 ) is the δ ( ǫ )-co ve ring num ber for M 1 and let { m 1 , . . . , m N } b e a δ ( ǫ )-net for M 1 . Hence f o r ev ery u ( m ) ∈ u ( M 1 ), where m ∈ M 1 , there exists i ∈ { 1 , . . . , N } suc h that d 1 ( m, m i ) < δ ( ǫ ) , whic h implies d 2 ( u ( m ) , u ( m i )) < ǫ a s u is uniformly con tin uous. As a result, the set { u ( m 1 ) , . . . , u ( m N ) } is an ǫ -net for u ( M 1 ), so N ( u ( M 1 ) , ǫ, d 2 ) ≤ N ( M 1 , δ ( ǫ ) , d 1 ) . In particular, w e can view F 1 , . . . , F k as metric spaces, all with distances induced b y the L 2 ( µ ) norm and supp o se φ : F 1 × . . . × F k → [0 , 1] X is unifo rmly con tin uous with mo dulus of con tin uit y δ ( ǫ, k ). Then, b y Prop osition 6.8, if F 1 , . . . , F k all hav e ﬁnite δ ( ǫ, k ) √ k -co v ering num bers, the metric space F 1 × . . . × F k , with the L 2 pro duct metric ˜ d 2 , also has a ﬁnite δ ( ǫ, k )- co v ering nu mber: if we write N ( F i , δ ( ǫ, k ) √ k , L 2 ( µ )) as 23 the δ ( ǫ, k ) √ k -co v ering n um b er for F i , then, N ( F 1 × . . . × F k , δ ( ǫ, k ) , ˜ d 2 ) ≤ k Y i =1 N ( F i , δ ( ǫ, k ) √ k , L 2 ( µ )) . No w, by Prop osition 6.9, N ( φ ( F 1 × . . . × F k ) , ǫ, L 2 ( µ )) ≤ N ( F 1 × . . . × F k , δ ( ǫ, k ) , ˜ d 2 ) ≤ k Y i =1 N ( F i , δ ( ǫ, k ) √ k , L 2 ( µ )) . In other w ords, the ǫ -cov ering num ber for φ ( F 1 × . . . × F k ) is b ounded b y a pro duct of the δ ( ǫ, k ) √ k -co v ering n um b ers of eac h F i . T o pro v e Lemma 6.5, w e no w state the main theorem of a pap er written by Mendelson and V ershy nin, whic h relates the ǫ -cov e ring n um b er of a function class t o its F at Shattering dimension o f scale ǫ . Theorem 6.10 ([9]) . L et ǫ > 0 and let F b e a function class. Then for every pr ob abi l i ty me asur e µ , N ( F , ǫ, L 2 ( µ )) ≤  2 ǫ  K fat cǫ ( F ) for absolute c onstants c, K . And T alagrand pro vides the con v erse. Theorem 6.11 ([15]) . F ol lowing the notations of The or em 6.10, ther e exis ts a pr ob- ability me asur e µ such that N ( F , ǫ, L 2 ( µ )) ≥ 2 K ′ fat c ′ ǫ ( F ) , for absolute c onstants c ′ , K ′ . Pr o of o f L emma 6.5. By Prop ositions 6.8 and 6.9, N ( φ ( F 1 × . . . × F k ) , ǫ, L 2 ( µ )) ≤ k Y i =1 N ( F i , δ ( ǫ, k ) √ k , L 2 ( µ )) , so log( N ( φ ( F 1 × . . . × F k ) , ǫ, L 2 ( µ ))) ≤ k X i =1 log( N ( F i , δ ( ǫ, k ) √ k , L 2 ( µ ))) . By Theorem 6.10, log N ( F i , δ ( ǫ, k ) √ k , L 2 ( µ )) ≤ K fat c δ ( ǫ,k ) √ k ( F i ) log(2 √ k /δ ( ǫ, k )) , 24 for an y probability measure µ where c, K are a bsolute constants. Moreo v er, b y The- orem 6.11 for some probability measure µ and absolute constan ts c ′ , K ′ , log( N ( φ ( F 1 × . . . × F k ) , ǫ, L 2 ( µ ))) ≥ K ′ fat c ′ ǫ ( φ ( F 1 × . . . × F k )) log(2) and alto gether, fat c ′ ǫ ( φ ( F 1 × . . . × F k )) ≤ P k i =1 K fat c δ ( ǫ,k ) √ k ( F i ) log(2 √ k /δ ( ǫ, k )) K ′ log(2) = K log(2 √ k /δ ( ǫ, k )) K ′ log(2) ! k X i =1 fat c δ ( ǫ,k ) √ k ( F i ) . No w, all that is left is to pro v e Lemma 6.6. Pr o of o f L emma 6.6. Suppo se u : [0 , 1] k → [0 , 1] is unifo r mly con tinuous with a mo d- ulus of con tin uity δ ( ǫ ), where [0 , 1] k is a metric space with the L 2 pro duct distance d 2 . W e claim that t he function φ : F 1 × . . . × F k → [0 , 1] X deﬁned by φ ( f 1 , . . . , f k )( x ) = u ( f 1 ( x ) , . . . , f k ( x )) is uniformly con tin uous with mo dulus of con tinu ity δ ( ǫ/ 2) ǫ 2 k . Let ǫ > 0 and ( f 1 , . . . , f k ) , ( f ′ 1 , . . . , f ′ k ) ∈ F 1 × . . . × F k . Supp ose ˜ d 2 (( f 1 , . . . , f k ) , ( f ′ 1 , . . . , f ′ k )) = q (( || f 1 − f ′ 1 || 2 ) 2 + . . . + ( || f k − f ′ k || 2 ) 2 ) < δ ( ǫ/ 2) ǫ 2 k = r δ ( ǫ/ 2) 2 ( ǫ/ 2) 2 k 2 . Hence, for eac h i = 1 , . . . , k , || f i − f ′ i || 2 = s  Z X ( f i ( x ) − f ′ i ( x )) 2 dµ ( x )  < r δ ( ǫ/ 2) 2 ( ǫ/ 2) 2 k 2 . W rite A i = { x ∈ X : | f i ( x ) − f ′ i ( x ) | ≥ q δ ( ǫ/ 2) 2 k } and w e m ust ha v e that µ ( A i ) < 25 ( ǫ/ 2) 2 k , for eac h i = 1 , . . . , k . Otherwise, Z X ( f i ( x ) − f ′ i ( x )) 2 dµ ( x ) = Z A i ( f i ( x ) − f ′ i ( x )) 2 dµ ( x ) + Z X \ A i ( f i ( x ) − f ′ i ( x )) 2 dµ ( x ) ≥ Z A i r δ ( ǫ/ 2) 2 k ! 2 dµ ( x ) + Z X \ A i ( f i ( x ) − f ′ i ( x )) 2 dµ ( x ) = µ ( A i ) r δ ( ǫ/ 2) 2 k ! 2 + Z X \ A i ( f i ( x ) − f ′ i ( x )) 2 dµ ( x ) ≥ ( ǫ/ 2) 2 k δ ( ǫ/ 2) 2 k + Z X \ A i ( f i ( x ) − f ′ i ( x )) 2 dµ ( x ) ≥ δ ( ǫ/ 2) 2 ( ǫ/ 2) 2 k 2 , whic h is a con tradiction. No w, write A = A 1 ∪ . . . ∪ A k and we ha v e that X \ A = { x ∈ X : | f i ( x ) − f ′ i ( x ) | < q δ ( ǫ/ 2) 2 k , for all i = 1 , . . . , k } . Supp ose x ∈ X \ A and then d 2 (( f 1 ( x ) , . . . , f k ( x )) , ( f ′ 1 ( x ) , . . . , f ′ k ( x ))) = q | f 1 ( x ) − f ′ 1 ( x ) | 2 + . . . + | f k ( x ) − f ′ k ( x ) | 2 < s  δ ( ǫ/ 2) 2 k + . . . + δ ( ǫ/ 2) 2 k  < δ ( ǫ/ 2) . Consequen tly , by the uniform contin uit y of u , for all x ∈ X \ A , | u ( f 1 ( x ) , . . . , f k ( x )) − u ( f ′ 1 ( x ) , . . . , f ′ k ( x )) | < ǫ/ 2 . 26 Finally , || φ ( f 1 , . . . , f k ) − φ ( f ′ 1 , . . . , f ′ k ) || 2 = s  Z X ( u ( f 1 ( x ) , . . . , f k ( x )) − u ( f ′ 1 ( x ) , . . . , f ′ k ( x ))) 2 dµ ( x )  ≤ s  Z X \ A ( u ( f 1 ( x ) , . . . , f k ( x )) − u ( f ′ 1 ( x ) , . . . , f ′ k ( x ))) 2 dµ ( x )  + s  Z A ( u ( f 1 ( x ) , . . . , f k ( x )) − u ( f ′ 1 ( x ) , . . . , f ′ k ( x ))) 2 dµ ( x )  < s  Z X \ A ( ǫ/ 2) 2 dµ ( x )  + s  Z A 1 dµ ( x )  ≤ ( ǫ/ 2) + ( ǫ/ 2) = ǫ, as µ ( A ) ≤ P k i =1 µ ( A i ) ≤ k  ( ǫ/ 2) 2 k  = ( ǫ/ 2 ) 2 . No w we will prov e our main theorem. Pr o of o f The or em 6.4. By Lemma 6.6, if u : [0 , 1 ] k → [0 , 1] is uniformly con tin uous with mo dulus of contin uit y δ ( ǫ ), t hen φ : F 1 × . . . × F k → [0 , 1] X deﬁned by φ ( f 1 , . . . , f k )( x ) = u ( f 1 ( x ) , . . . , f k ( x )) is also uniformly con tinuous with mo dulus of con tin uit y δ ( ǫ/ 2) ǫ 2 k . Then, apply Lemma 6.5 with δ ( ǫ, k ) = δ ( ǫ/ 2) ǫ 2 k and with a simple c hange of v ariables c ′ ǫ ′ → ǫ , Theorem 6.4 follo ws directly . Altogether, we can summarize the maps in this section in the follo wing t w o dia- grams (where i is the diagonal map): X i / / X k f 1 × ... × f k / / [0 , 1] k u / / [0 , 1] , while F 1 × . . . × F k φ / / [0 , 1] X . This result is p otentially useful because it allows us to construct new function classes using common contin uous logic connectiv es and b o und their F at Shattering dimensions of scale ǫ . F or instance, the f unction u : [0 , 1] 2 → [0 , 1] deﬁned by u ( r 1 , r 2 ) = r 1 · r 2 (m ultiplication) is uniformly con tinu ous with a mo dulus of con- tin uit y δ ( ǫ ) = ǫ 2 . Indeed, let ǫ > 0 and consider ( r 1 , r 2 ) , ( r ′ 1 , r ′ 2 ) ∈ [0 , 1] 2 . Supp o se 27 d 2 (( r 1 , r 2 ) , ( r ′ 1 , r ′ 2 )) < δ ( ǫ ) = ǫ 2 , so | r 1 − r ′ 1 | < p | r 1 − r ′ 1 | 2 + | r 2 − r ′ 2 | 2 < ǫ 2 and similarly , | r 2 − r ′ 2 | < ǫ 2 . Then, | u ( r 1 , r 2 ) − u ( r ′ 1 , r ′ 2 ) | = | r 1 r 2 − r ′ 1 r ′ 2 | = | r 1 r 2 − r 1 r ′ 2 + r 1 r ′ 2 − r ′ 1 r ′ 2 | = | r 1 ( r 2 − r ′ 2 ) + r ′ 2 ( r 1 − r ′ 1 ) | ≤ | r 1 ( r 2 − r ′ 2 ) | + | r ′ 2 ( r 1 − r ′ 1 ) | ≤ | r 2 − r ′ 2 | + | r 1 − r ′ 1 | < ǫ 2 + ǫ 2 = ǫ. As a result, if F 1 and F 2 are tw o f unction classes with ﬁnite F at Shattering dimensions of some scale ǫ , then the function class u ( F 1 , F 2 ) = F 1 F 2 = { f 1 · f 2 : f 1 ∈ F 1 , f 2 ∈ F 2 } , deﬁned by p oint-wis e m ultiplication, also has ﬁnite F at Shattering dimension of scale ǫ , up to some constant factor and Theorem 6.4 pro vides a precise b ound. W e ha v e made a n in teresting connection, whic h ha s not b een explored m uc h in the past, b et w een contin uous lo g ic and P A C learning, and w e plan to inv estigate this connection ev en further. F or instance, the r elat io nship of comp ositions of function classes and con tin uous log ic may be inte resting to study b ecause comp ositions of uniformly con tin uous functions ar e again uniformly con tinuous. F urthermore, w e can try to a dd some top o lo gical structures to concept classes to see how P A C learning can b e a ﬀected. The next section pro vides a couple of other p o ssible future researc h topics. 28 7 Op e n Quest ions The deﬁnitions of distribution-free P AC learning, f or b oth concept a nd function classes, in Section 3, made no assumptions ab o ut proba bility measures, as a learning algorithm has to pro duce a v alid h yp othesis for an y probabilit y measure µ . If we ﬁx a probability measure µ and ask whether a concept class, or a function class, is P AC learnable, then w e are w orking in the conte xt of ﬁxed distribution P A C learning. Deﬁnition 7.1 ([1 8]) . L et µ b e a pr ob ability me asur e. A function class F is Probably Appro ximately Correct learnable under µ if ther e exists an algorithm L : ∪ m ∈ N ( X × [0 , 1]) m → F with the fol lowin g pr op erty : for every ǫ > 0 , for e v e ry δ > 0 , ther e exists a M ∈ N such that for every f ∈ F , for eve ry m ≥ M , for any x 1 , . . . , x m ∈ X , we have E µ ( H m , f ) < ǫ with c onﬁdenc e a t le ast 1 − δ , wher e E µ ( H m , f ) = Z X | f ( x ) − g ( x ) | dµ ( x ) and H m = L (( x 1 , f ( x 1 )) , . . . , ( x m , f ( x m ))) . When a function class F consists o f o nly binary functions, i.e. F = C is a concept class, there is a theorem, pro v ed by Benedek and Ita i in 1991, whic h giv es a c haracterization of ﬁxed distribution P A C learnabilit y . Theorem 7.2 ([4]) . Fix a pr ob ability me asur e µ and c onsider a c onc ept class C . The fol lowing ar e e quivalent: 1. C is Pr ob ably Appr oximately Corr e ct le arnable under µ . 2. (Finite Metric Entr opy c ondition) The ǫ -c overing numb er of C when view e d as a metric sp ac e with distanc e d = µ ( △ ) is ﬁnite for every ǫ > 0 . Ho w ev er, there is no characterization for ﬁxed distribution P AC learnability of a general function class. T alagra nd had pro v ed that a function class is a Gliv enk o- Can telli (GC) function class with rega r d to a single measure µ if and o nly if the class has no witness of irregularity , a prop erty tha t in v olve s shattering [1 3],[14]. Ev ery G C function class is P AC learna ble under µ [11], but the prop ert y of having no witness of irregularit y is strictly stronger than P A C learnability . W e w ould lik e to pro p ose the following conjecture fo r a p o ssible c haracterization. Conjecture 7.3. Fix a pr ob ability me asur e µ and c onsider a function class F . L et ǫ > 0 . Th e fol lowing ar e e quivalent: 1. The function class F is P A C le arnable under µ to a c cur acy ǫ . 6 6 Being P AC lear nable to a ccuracy ǫ means Deﬁnition 7.1 is satisﬁed, but only for this pa rticular ǫ . 29 2. Ther e exi s ts M , N and γ > 0 such that for al l functions f ∈ F , with pr ob ability at le ast γ , the set { g ∈ F : g | ¯ x N = f | ¯ x N } has an ǫ -c overing numb er, with r esp e ct to the distanc e d = E µ ( , ) , of at most M , wher e ¯ x N denotes a sam p le o f N p oints. A v ery in teresting researc h topic is t o study this conjecture and either pro v e or dispro v e it. Also, by Prop osition 5.6, the ﬁniteness of the F at Shattering dimension of all scales ǫ > 0 do es not c haracterize function class P A C lear ning in the distribution- free case; consequen tly , another topic of researc h w ould b e to come up with a new com binatorial pa r a meter for a function class, relat ed to the notion of shattering, whic h w ould c haracterize learning. This new para meter w ould hav e to solv e the problem of unique iden tiﬁcations of functions, a pro blem that do es not o ccur with concept classes. Y et another p ossible researc h topic is to generalize the deﬁnitions of P AC learning and in tro duce observ ation noise, b oth in the ﬁxed distribution and distribution-fr ee cases. The pap er [3] written by Bar t lett et al. pro v es that the ﬁniteness o f the F at Shat t ering dimension of all scales of a function class F is equiv alen t to F b eing distribution-free learnable under certain noise distributions. It w ould b e in teresting to generalize this result and/or apply it in the ﬁxed distribution setting. 30 8 Conclus ion This rep ort in tro duces the deﬁnitions of Probably Appro ximately Correct learning for concept and function classes and deﬁnes the V apnik-Cherv onenkis dimens ion fo r concept classes and the F a t Shattering dimension of scale ǫ > 0 for function classes . Finiteness of the V C dimension c haracterizes concept class distribution-free P A C learning; how ev er, the ﬁniteness of the F at Shattering dimension of all scales ǫ is still only suﬃcien t for function class learning, and not necess ary . Giv en function classes F 1 , . . . , F k , one can construct a new class u ( F 1 , . . . , F k ) using a contin uous function u : [0 , 1] k → [0 , 1], a con tin uous logic connectiv e. The main new result of this rep ort shows that the F at Shattering dimension of scale ǫ of u ( F 1 , . . . , F k ) is b ounded b y a sum of the F a t Shat tering dimensions o f scale δ ( ǫ, k ) of classes F 1 , . . . , F k , up to some absolute constan ts. This result can b e useful b ecause it a llo ws us to construct new function classes, whic h ma y b e ve ry natural o b jects, and b ound their F at Shattering dimensions. 31 References [1] N. Alon, S. Ben-D a vid, N. Cesa-Bianc hi, and D . Haussler. Sc ale-Sensitive Dim e n - sions, Uniform Co n ver genc e, an d L e arnabi l i ty . Journa l of the ACM 44.4 (1997), 615 - 631. W eb. 25 F eb. 2011. [2] G. Auliac and J. Y. Cab y . Math´ ematiques: T op olo gie et Analyse , 3 rd Ed. Belgium: EdiScience, 200 5. Prin t. [3] P . L. Bartlett, P . M. Long, and R. C. Williamson. F at-Shattering and the L e arn- ability of R e al-V alue d F unction s . Journal of Computer and System Sciences 5 2 .3 (1994), 434 - 452. W eb. 1 Oct. 2010. [4] G. M. Benedek and A. Itai. L e arnability with r esp e ct to F i x e d Distributions . The- oretical Computer Science 86.2 (1991), 377 - 389. W eb. 27 F eb. 2011. [5] A. Blumer, A. Ehrenfeuc h t, D. Haussler, and M. W armuth. L e arnability and the V apnik-Chervo nenkis Dim e nsion . Journal of the ACM 36.4 (1989) , 92 9 - 965 . W eb. 12 Mar. 2011. [6] J. L. D o ob. Me asur e T he ory . New Y ork: Springer-V erlag, 1994. Prin t. [7] M. J. Kear ns and R. Sc hapire. Eﬃcient Distribution-fr e e L e arning of Pr ob abilistic Conc epts . Journal of Computer System Science s 48.3 ( 1994), 464 - 49 7. W eb. 12 Apr. 201 1. [8] M. J. Kearns and U. V. V azirani. An Intr o duction to Co m putational L e arni n g The ory . Cambridge, Massac h usetts: The MIT Press, 1994. Print. [9] S. Mendelson a nd R. V ersh ynin. Entr opy an d the Combin a torial Dim ension . In- v en tiones Mathematicae 152 (2003), 37 - 55. W eb. 9 Mar. 2011. [10] V. P esto v. Index ability, Con c entr ation, and V C The ory . An in vited pap er, Pro c. of the 3rd Inte rnationa l Conf. on Similarity Searc h a nd Applications (SISAP 2010), 3 - 12. W eb. 12 Apr. 2011 . [11] V. P esto v. A Note on Sample Complexi ty of L e arning Binary Output Neur al Net- works Under Fixe d Input Distributions . Pro c. 2010 Eleve nth Bra zilian Symp osium on Neural Netw orks, IEEE Computer So ciet y , Los Alamito s-W ashington- T okyo (2010), 7 - 12. W eb. 2 2 Apr. 2011 . [12] N. Sauer. On the D ensities of F amil i e s of Se ts . J. Combinatorial Theory 13 (1972), 145 - 147. W eb. 12 Apr. 2011 . [13] M. T alagrand. The Glivenko-C a ntel li Pr oblem . Annals o f Probability 15 (198 7 ), 837 - 870. W eb. 22 Apr. 2011. 32 [14] M. T alagrand. The Glivenko-Cantel li Pr oblem, T en Y e ars L ater . J. Theoret. Probab. 9 (1996), 371 - 384. W eb. 22 Apr. 2011. [15] M. T alagrand. V apnik -Chervonenkis T yp e Condi tion s and Uniform Donsker Classes of F u nctions . Annals of Probability 3 1 .3 ( 2 003), 156 5 - 1582. W eb. 22 Apr. 201 1. [16] L. G. V aliant. A T he ory of the L e arnable . Comm unications of the ACM 27.11 (1984), 1134 - 1142. W eb. 21 F eb. 201 1 . [17] V. N. V apnik and A. Y. Cherv onenkis. On the Unifo rm Conver genc e of R elative F r e quencies of Events to Their Pr ob a b ilities . Theory of Prob. a nd its Appl. 16.2 (1971), 264 - 280. W eb. 27 F eb. 2011. [18] M. Vidy asagar. A The ory of L e arning a nd Gener alization: With Applic ations to Neur al Networks and Contr ol Systems . London: Springer-V erlag London Limited, 1997. Prin t. [19] I. B. Y aacov, A. Berenstein, C. W. Hens on, and A. Usvy atso v. Mo del Th e ory for Metric Structur es . London Math So ciet y Lecture No t e Series 350 (2 0 08), 315 - 427. W eb. 2 0 Dec. 2010 . 33

Bounding the Fat Shattering Dimension of a Composition Function Class Built Using a Continuous Logic Connective

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment