PAC learnability under non-atomic measures: a problem by Vidyasagar

P A C learna bility under non-a tomi c measures: a pro blem by Vidyasagar Vladimir Pesto v Dep artamento de Matem´ atic a, Universidade F e der al de Santa Catarina, Campus Universit´ ario T rindade, CEP 88.040-900 Florian´ o p olis-SC, Br asil 1 Dep artment of M athematics and Statistics, University of Ot tawa, 585 King Edwar d A venue, Ottawa, Ontario, K1N6N5 Canada 2 Abstract In resp onse to a 1 997 pro blem of M. Vidyasagar, we state a criterion for P AC learnability of a co ncept class C under the family of all non-a to mic (diﬀuse) measure s on the domain Ω. The uniform Glivenko– Cantelli pr op erty with resp ect to non-atomic mea sures is no longer a necessary condition, and consistent learnability c a nnot in general b e expec ted. Our c riterion is stated in terms of a combinatorial parameter V C( C mo d ω 1 ) whic h w e call the VC dimension of C mo dulo countable sets. The new par ameter is obtained by “thick ening up” single points in the deﬁnition of VC dimension to uncount able “clusters”. Equiv a lent ly , V C( C mo d ω 1 ) ≤ d if and only if every coun table sub class of C has V C dimension ≤ d outside a co un table subset of Ω. The new parameter can b e a lso expresse d as the classica l VC dimension of C calcula ted on a suitable subset of a c o mpactiﬁcation o f Ω. W e do not make a n y meas urability ass umptions on C , a ssuming instead the v alidity of Martin’s Axiom (MA). Similar results ar e obtained for function learning in terms of fat-s hattering dimension mo dulo co un table sets, but, just like in the cla ssical distr ibution-free case, the ﬁniteness of this par ameter is suﬃcien t but not nece s sary for P A C learnability under non-ato mic mea sures. Keywor ds: P A C lear nability, non-atomic measures, learning rule, uniform Glivenk o–Cantelli c la sses,, Martin’s Axiom, VC dimensio n mo dulo co un table sets, fat shattering dimension modulo co unt able sets 2010 MSC: 68T05, 03E05 1. In tro duction A fundamental res ult of sta tistical lea rning theory says that under some mild mea surability assumptions on a concept class C the three conditions are equiv alent: (1) C is distribution- fr ee P A C lear nable o ver the family P ( Ω) of all proba bilit y measures on the domain Ω, (2) C is a unifor m Glivenk o–Cantelli class with resp ect to P (Ω), and (3) the V apnik–Chervonenkis dimension of C is ﬁnite [17, 18, 4]. In this pap er w e ar e int erested in the problem, discussed b y Vidy asagar in b oth editions of his b o ok [19, 20] as problem 12.8, o f giving a similar combinatorial description of conc e pt classe s C which are P A C learna ble under the family P na (Ω) of all non-atomic pr obability measures on Ω. (A measure µ is non-atomic , or diﬀuse, if every set A of s tr ictly po s itive measure co nt ains a s ubset B with 0 < µ ( B ) < µ ( A ).) The condition V C( C ) < ∞ , while o f course suﬃcient for C to b e learnable under P na (Ω), is not necessary . Let a concept c lass C consist of all ﬁnite and all coﬁnite subsets of a standard B orel spa ce Ω. Then V C( C ) = ∞ , a nd more over C is clearly not a unifor m Glivenk o-Cantelli c la ss with r esp e ct to non-atomic me asur es. At the same time, C is P A C learnable under no n- atomic measures: a n y lear ning rule L consis ten t with the subc la ss {∅ , Ω } will learn C . Notice that C is not c onsistently learnable under non-atomic mea sures: there a r e co nsistent learning rules mapping every training sample to a ﬁnite set, and they will not learn any coﬁnite s ubset of Ω. 1 Pe squisador Vi sitan te do CNPq. 2 Pe rmanent address. Pr eprint submitte d to The or etic al Computer Scienc e Novemb er 21, 2018 The most salient featur e of this example is that P A C learna bilit y o f a concept class C under non-atomic measures is no t aﬀected by adding to C sy mmetr ic diﬀerences C △ N for each C ∈ C and every countable set N . A version of VC dimension o blivious to this kind o f set-theor etic “noise” is obtained from the class ical deﬁnition b y “thic kening up” individua l p oints and repla cing them with uncoun table clusters (Figure 1). j 000 000 000 111 111 111 00 00 00 11 11 11 000 000 000 111 111 111 000 000 000 111 111 111 00 00 00 11 11 11 C Ω A 1 A 2 ... A i A Figure 1: A f amily A 1 , A 2 , . . . , A n of uncoun table s ets shattered by C . Deﬁne the VC dimension of a c onc ept class C mo dulo c ountable sets a s the supremum of natural n fo r which there ex ists a fa mily of n uncoun table s ets, A 1 , A 2 , . . . , A n ⊆ Ω, shattered by C in the s ense that for each J ⊆ { 1 , 2 , . . . , n } , there is C ∈ C which contains all sets A i , i ∈ J , and is disjoin t fro m all se ts A j , j / ∈ J . Denote this parameter b y VC( C mo d ω 1 ). Clea rly , for ev ery concept cla ss C V C( C mo d ω 1 ) ≤ VC( C ) . In o ur example ab ov e, one has V C( C mo d ω 1 ) = 1, even as V C( C ) = ∞ . Our main theorem for P A C conce pt lear ning under non-a tomic measur es req uir es an additiona l set- theoretic hypothes is , Martin ’s Axiom ( MA ) [8, 9 , 11]. This is one of the most o ften used a nd b est studied additional set-theoretic assumptions be yond the standard Zermelo- F renk el set theory with the Axiom of Choice (ZF C). Here is one of the equiv alent for ms. Let B b e a Bo olea n alg e bra s atisfying the countable chain condition (that is , e very family of pairw is e disjoint elements of B is countable). Then for every family X of cardinality < 2 ℵ 0 of subsets of B there is a maximal ideal ξ (element o f the Stone spa c e of B ) with the prop erty: each X ∈ X disjoint from ξ admits an upp er bo und x / ∈ ξ . The ab ov e conclusio n holds unconditiona lly if X is countable (due to the Baire Category Theorem), and th us Martin’s Axio m follows fro m the Contin uum Hyp othesis (CH). At the same time, MA is compatible with the negation of CH, and in fact it is namely the combination MA+ ¬ CH that is really interesting. As a conseq uence of Martin’s Axiom, the usual s igma-additivity of a measure can b e strengthened as follows: the union of < 2 ℵ 0 Leb esgue measurable sets is Leb esg ue mea s urable. Essentially , this is the only prop erty we need in the pro of of the following r e s ult. Theorem 1.1. L et (Ω , A ) b e a standar d Bor el s p ac e, and let C ⊆ A b e a c onc ept class. Under Martin ’s Axiom, the fol lowing ar e e quivalent. 1. C is P AC le arnable under the family of al l n on-atomic me asur es. 2. VC ( C mod ω 1 ) = d < ∞ . 3. Every c ountable sub class C ′ ⊆ C has ﬁnite VC dimension on the c omplement to some c ountable subset of Ω (which dep ends on C ′ ). 4. Ther e is d such that for every c oun table C ′ ⊆ C one has VC ( C ′ ) ≤ d on the c omplement to some c ount able subset of Ω (dep en ding on C ′ ). 5. Every c ount able sub class C ′ ⊆ C is a u n iform Glivenko–Ca ntel li class with re sp e ct t o the family of non-atomic me asur es. 6. Every c ount able sub class C ′ ⊆ C is a u n iform Glivenko–Ca ntel li class with re sp e ct t o the family of non-atomic me asur es, with sample c omplexity s ( ǫ, δ ) which only dep en ds on C and not on C ′ . If C is universal ly sep ar able [15], the ab ove ar e also e quivalent to: 2 7. VC dimension of C is ﬁnite outside of a c ountable su bset of Ω . 8. C is a uniform Glivenko-Cantel li class with re sp e ct to the family of n on-atomic pr ob ability me asur es. 9. C is c onsistently P A C le arnable under t he family of al l n on-atomic me asur es. Notice that for universally separ a ble classe s , (1)–(9) are pairwise equiv alent w itho ut additional set- theoretic assumptions. (A class C is universal ly sep ar able if it contains a countable sub class C ′ which is universal ly dense : for each C ∈ C there is a seq uenc e ( C n ), C n ∈ C ′ , suc h that the indicator functions I C n conv erge to I C po int wise.) The concept c lass in the ab ov e example (which is even image admissible So uslin [6], but not universally sepa r able) shows tha t in general (7), (8) and (9) are not equiv alent to the remaining conditions. The core of Theorem 1.1 — and the main technical no velt y of our pap er — is the pro o f of the implication (3) ⇒ (1). It is base d on a sp ecia l choice o f a consistent learning rule L having the pro pe rty that for every concept C ∈ C , the image of all learning samples of the form ( σ , C ∩ σ ) under L forms a unifor m Glivenk o– Cantelli class. It is for esta blishing this pro pe rty of L that we need Martin’s Axiom. Most o f the remaining implicatio ns are relatively straightforw ard adaptations of the standar d techniques of statistical lear ning. Nevertheless, (2) ⇒ (3) requir es a certain techn ical dexterity , and we study this implication in the setting of Bo olean algebras. An a nalog of Theorem 1.1 also holds for P A C learning of function classe s . In this c a se, we are employing a v ersion of fat shattering dimension [1], which w e call fat shattering dimension mo dulo co unt able se ts and denote fat ǫ ( F mo d ω 1 ). How ever, just like in the classic a l case, ﬁniteness o f this combinatorial parameter at every scale ǫ > 0, while suﬃcien t for P AC learnability of a function cla ss F under non-atomic measures, is not necessar y . It is ea sy to construct a function class F with fat ǫ ( F mo d ω 1 ) = ∞ which is distribution-free probably exactly lear nable (Example 7 .3). Recall that a function f : X → Y b et ween tw o measura ble spaces (sets equipp ed with sigma-a lgebras of subsets) is universal ly me asur able if for ev ery measurable subset A ⊆ Y and every probability measur e µ on X the set f − 1 ( A ) is µ - mea surable. F or instance, Borel functions are universally measurable. Theorem 1.2. L et Ω b e a standar d Bor el sp ac e, and let F b e a class of universal ly me asur able funct ions on Ω with values in [0 , 1] . Consider t he fol lowing c onditions. 1. F is P A C le arnable under the family of al l non-atomic me asur es. 2. F or every ǫ > 0 , fa t ǫ ( F mo d ω 1 ) = d ( ǫ ) < ∞ . 3. F or e ach ǫ > 0 , every c ount able su b class F ′ ⊆ F has ﬁnite ǫ -fat shattering dimension on the c omple- ment to some c ountable su bset of Ω (which dep ends on F ′ ). 4. Ther e is a fun ction d ( ǫ ) such that for every c ountable F ′ ⊆ F and al l ǫ > 0 one has fat ǫ ( F ′ ) ≤ d ( ǫ ) on the c omplement to some c ountable subset of Ω ( dep ending on F ′ ). 5. Every c ou n table sub class F ′ ⊆ F is a uniform Glivenko–Ca ntel li class with r esp e ct to the family of non-atomic me asur es. 6. Every c ou n table sub class F ′ ⊆ F is a uniform Glivenko–Ca ntel li class with r esp e ct to the family of non-atomic me asur es, with sample c omplexity s ( ǫ, δ ) which only dep en ds on F and n ot on F ′ . The c onditions (2)–(6) ar e p airwise e quivalent, and u nder Martin ’ s Ax iom e ach of them implies (1). If F is universal ly sep ar able, t he c onditions (2)–(6) ar e also e quivalent to: 7. F or e ach ǫ > 0 , ǫ -fat shattering dimension of F is ﬁnite out side of a c ountable subset of Ω . 8. F is a u niform Glivenko-Cantel li class with r esp e ct to the family of non- atomic pr ob ability me asur es, and e ach of them implies 9. F is c onsistently P AC le arnable under t he family of al l non-atomic me asur es. W e beg in the pa p er by reviewing a gener al formal setting for P AC lea rnability , after which w e pro ceed to analy s is of a well-known exa mple of a conc e pt class of VC dimension 1 whic h is no t a unifor m Glivenk o– Cantelli clas s and is not consistently P AC lear nable [5, 4]. The example w as originally constructed under the 3 Contin uum Hypo thesis, though in fact Martin’s Axiom suﬃces. W e observe that the class C in the example is s till P A C learnable, and this obs erv ation provides a clue to our approach to constructing le arning rules. This ana lysis is follow ed b y a s e r ies of general results abo ut P A C learna bilit y of a function class F under non-atomic measures under Martin’s Axiom and without mak ing any a s sumptions on measur ability of F except the meas ur ability o f individual mem ber s f of the class. In the t wo sections to follow, we dis cuss Bo olean algebras whic h appe ar to provide a useful fra mework for studying concept learning under intermediate families o f meas ures, and co mmutative C ∗ -algebra s and their spaces of maximal ideals, whic h provide a similar con venien t fra mework for function c lasses. In par ticular, w e will show t hat for a concept cla ss C our v ersion of the V C dimension mo dulo countable sets, VC( C mod ω 1 ), is just the usual V C dimension of the family of closures , cl( C ), of a ll C ∈ C , tak en in a suitable compactiﬁcation b Ω o f Ω and computed o ver a certain sub doma in of b Ω, as illustrated in Figure 2. Ω 0000000000 0000000000 0000000000 0000000000 0000000000 0000000000 0000000000 0000000000 0000000000 0000000000 0000000000 0000000000 0000000000 0000000000 0000000000 1111111111 1111111111 1111111111 1111111111 1111111111 1111111111 1111111111 1111111111 1111111111 1111111111 1111111111 1111111111 1111111111 1111111111 1111111111 Ω C b Ω cl(C) a subdomain of b Figure 2: VC( C mo d ω 1 ) via the usual VC dimension of C . A similar r esult ho lds fo r the fat shattering dimension. A t the nex t s ta ge we establish the cor resp onding par ts of Theor ems 1.1 and 1.2 for universally separable classes, at which moment w e have a ll the machinery neede d to accomplish the general ca se. A confer e nce version of this pap er [1 3] treated the case o f concept classes, but w e b elieve that the presentation of our appr oach ha s no w impr oved c o nsiderably . 2. The setting W e need to ﬁx a precise setting, which is mos tly standar d [1 9, 20], see also [1, 4 , 12]. The domain ( instanc e sp ac e ) Ω = (Ω , A ) is a me asura ble sp ac e, that is, a set Ω equipp e d with a sigma- algebra of subsets A . Typically , Ω is a ssumed to be a standar d Bor el sp ac e, that is, a complete separable metric space equipped with the sig ma-algebr a o f Borel subsets. W e will clarify the assumption whenev er neces sary . In the learning model, a set P of proba bilit y measures on Ω is ﬁxed. Usually either P = P (Ω) is the set of all probability meas ures (distr ibution-free lea r ning), or P = { µ } is a single measur e (lea rning under ﬁxed distribution). In our article, the case of interest is the family P = P na (Ω) of a ll no n-atomic meas ures. W e will no t dis ting uish b et ween a mea sure µ and its Lebes gue completion, that is , an extension of µ ov er the larger s ig ma-algebr a of Leb esgue measura ble subsets of Ω. Cons equently , we will sometimes use the term me asur ability meaning L eb esgue me asur ability . No co nfusion can a rise her e. A function class , F , is a fa mily of functions from Ω to the unit interv al [0 , 1] which a re measurable with regar d to every µ ∈ P . F or instance, elemen ts of F c a n b e universally measurable, or most often Borel. A c onc ept class , C , is a function class with v alues in { 0 , 1 } o r, equiv alently , a family of measurable subsets o f Ω. Every probability meas ur e µ on Ω de ter mines a n L 1 distance betw een functions: k f − g k 1 = Z Ω | f ( x ) − g ( x ) | dµ ( x ) . 4 F or concept classes, this r educes to the following metric: d µ ( A, B ) = µ ( A △ B ) . Often it is convenien t to approximate the functions fro m F with elements o f the hyp othesis sp ac e, H , which is, tec hnically , a family o f functions who se clos ur e in each space L 1 ( µ ), µ ∈ P , con tains F . How ever, in our article we make no distinction betw een H and F . A le arning sample is a pair ( σ, r ), where σ is a ﬁnite subset of Ω and r is a function fr o m σ to [0 , 1]. It is conv enient to a s sume that element s x 1 , x 2 , . . . , x n ∈ σ are ordered, a nd thus the set o f all samples ( σ , r ) with | σ | = n can b e ident iﬁed with (Ω × [0 , 1]) n . In the case of co ncept classes, a learning sample is simply a pair ( σ, τ ) of ﬁnite s ubs ets of Ω, where τ ⊆ σ is thought of as the s et of p oints wher e r takes the v alue 1. The se t of all samples of size n in this case is (Ω × { 0 , 1 } ) n . A le arning rule (for F ) is a mapping L : ∞ [ n =1 Ω n × [0 , 1] n → F which satisﬁes the following mea s urability condition: for every f ∈ F and µ ∈ P , the function Ω ∋ σ 7→ kL ( σ, f ↾ σ ) − f k 1 ∈ R (2.1) is mea surable. A learning rule L is c onsistent (with a function cla s s F ) if for every f ∈ F and each σ ∈ Ω n one has L ( σ , f ↾ σ ) ↾ σ = f ↾ σ. In the case of a conce pt class C , the consis tency co ndition b eco mes this: for every C ∈ C and each σ ∈ Ω n one has L ( σ , C ∩ σ ) ∩ σ = C ∩ σ. A learning rule L is pr ob ably appr oximately c orr e ct ( P A C ) under P if for ev ery ǫ > 0 sup µ ∈P sup f ∈ F µ ⊗ n { σ ∈ Ω n : kL ( σ , f ↾ σ ) − f k 1 > ǫ } → 0 as n → ∞ . (2.2) Here µ ⊗ n denotes the (Leb esgue extension of the) pro duct measure o n Ω n . Now the origin of the measura- bilit y condition (2.1) on the mapping L is clear : it is implicit in (2.2). Equiv alently , there is a function s ( ǫ, δ ) ( sample c omplexity of L ) s uch that for each f ∈ F and every µ ∈ P an i.i.d. sample σ w ith ≥ s ( ǫ , δ ) p o ints has the prop erty kL ( σ , f ↾ σ ) − f k 1 < ǫ with conﬁdence ≥ 1 − δ . In particular, for a concept cla ss C , it is co nv enient to rewrite the deﬁnition of a P A C lea r ning rule th us: for each ǫ > 0, sup µ ∈P sup C ∈ C µ ⊗ n { σ ∈ Ω n : µ ( L ( σ , C ∩ σ ) △ C ) > ǫ } → 0 as n → ∞ . (2.3) In terms o f the s ample complexity function s ( ǫ, δ ), a lea rning rule L is P AC if for each C ∈ C and every µ ∈ P an i.i.d. sample σ with ≥ s ( ǫ, δ ) p oints has the prop er t y µ ( C △ L ( σ , C ∩ σ )) < ǫ with conﬁdence ≥ 1 − δ . A function class F is P AC le arnable under P , if there exists a P AC learning rule for F ( C ) under P . A clas s F is c onsist en tly le arnable (under P ) if every learning rule consistent with F is P AC under P . If P = P (Ω) is the set of all probability measures, then F is sa id to b e (distribution-free) P AC le arnable . If P = { µ } is a single pro bability meas ure, one is talking of le arning under a single me asur e (o r distribution). These deﬁnitions apply in particular to concept cla sses as w ell. Learnability under intermediate families o f measures on Ω has received considerable attention, cf. Chapter 7 in [2 0]. 5 Notice that in this pa per , we only talk of p ot en t ial P A C learna bilit y , adopting a purely info r mation- theoretic viewp oint. As a c o nsequence, our statements ab out learning rules a re existential rather than constructive, and building lea rning rules by tra nsﬁnite recursion is p erfectly acceptable. An impo r tant concept is that of a uniform Glivenko–Cantel li function class with r esp e ct to a family of me asur es P , that is, a function class F such that for eac h ǫ > 0 sup µ ∈P µ ⊗ n ( sup f ∈ F | E µ ( f ) − E µ n ( f ) | ≥ ǫ ) → 0 as n → ∞ , (2.4) (cf. [6], Ch. 3; [1 2].) Here µ n stands for the empirical (uniform) mea sure on n p oints, sa mpled in an i.i.d. fashion fro m Ω accor ding to the distribution µ . The s ymbol E µ n means the empirical mean of f on the sa mple σ . One also says that F has the prop erty of un iform c onver genc e of empiric al m e ans ( UCEM prop erty) with r esp e ct to P [20]. In the ca se of a concept class C , the uniform Glivenk o–Cantelli prop erty beco mes sup µ ∈P µ ⊗ n  sup C ∈ C | µ ( C ) − µ n ( C ) | ≥ ǫ  → 0 as n → ∞ . (2.5) In this case, one says that C has the prop erty of u niform c onver genc e of empiric al me asures , which is also abbreviated to UCE M pr o p e r ty (with respec t to P ). Every uniform Glivenko–Can telli class (with r esp ect to P ) is P AC learnable (under P ). In the dis tribution- free situation the co nverse holds under mild additiona l measurability conditions on the class (but not alwa ys [5], see a discussio n in Section 3 b elow). F or lea rning under a single measur e , it is not so: a P A C lea rnable class under a sing le distribution µ need not be uniform Glivenko-Can telli with resp ect to µ (cf. Chapter 6 in [20], or else [14], Example 2.10, where a countable counter-example is given). Not every P A C learna ble class under no n-atomic measures is uniform Glivenk o–Cantelli with resp ect to non-atomic measures either : the class consis ting o f all ﬁnite and all coﬁnite subsets of Ω is a co un ter-exa mple. W e say , following P ollard [1 5 ], that a function class F is universal ly sep ar able if it c ontains a coun table subfamily F ′ which is un iversal ly dense in F : e very function f ∈ F is a po int wise limit of a sequence of elements of F ′ . By the Leb esg ue Dominated Convergence Theorem, for every probability measur e µ on Ω the set F ′ is everywhere dense in F in the L 1 ( µ )-distance. In pa rticular, a concept class C is universally separable if it contains a co un table subfamily C ′ with the prop erty that for every C ∈ C there ex ists a sequence ( C n ) ∞ n =1 of sets from C ′ and for every x ∈ Ω ther e is N with the pro per t y that, for all n ≥ N , x ∈ C n if x ∈ C , and x / ∈ C n if x / ∈ C . Probably the main so urce o f uniform Glivenk o–Cantelli class es is the ﬁniteness of V C dimension. Ass ume that C satisﬁes a suitable measura bilit y c ondition, for instance, C is image a dmiss ible Souslin [6], or else universally sepa rable. (In particular , a co unt able C satisﬁes either condition.) If V C( C ) = d < ∞ , then C is uniform Glivenk o–Cantelli, with a sample complex it y b ound that do es not dep end o n C , but only on ǫ , δ , and d . The following is a typical (and far from b eing optimal) such es tima te, which can b e deduced, for instance, along the lines of [12]: s ( ǫ, δ, d ) ≤ 128 ǫ 2  d log  2 e 2 ǫ log 2 e ǫ  + lo g 8 δ  . (2.6) F or our purp oses , we will ﬁx any such b ound and r efer to it as a “standar d” sample complexity es timate for s ( ǫ, δ, d ). Let us reca ll a more general co ncept of fat shattering dimension [1] which is relev ant for function c la sses. Let ǫ > 0. A ﬁnite subset A of Ω is ǫ -fat shatter e d by a function class F with witness function h : A → [0 , 1] if for every B ⊆ A there is a function f B ∈ F such that ( f B ( a ) > h ( a ) + ǫ for a ∈ B , f B ( a ) < h ( a ) − ǫ for a ∈ A \ B . (2.7) 6 The ǫ - fat shattering dimension of F (ov er the domain Ω) is deﬁned as fat ǫ F = sup {| A | : A ⊆ Ω , A is ǫ - fat shattered b y F } . In par ticula r, if C is a concept class, then for an y ǫ ≤ 1 / 2 the ǫ -fat s hattering dimension of C is the VC dimension of C . If we w ant to stress that the combinatorial dimension is calculated ov er a particular domain Ω, we will use the notation fat ǫ ( F ↾ Ω) and V C( C ↾ Ω). In the deﬁnitio n of ǫ -fat shattering dimension, one can assume without loss o f generality the v alues of ǫ and o f a witness function to be ratio nal. Mo r e pr ecisely , the following holds. Lemma 2.1. Supp ose a ﬁn ite set A is ǫ -fat shatter e d by a function class F . Then ther e is a r ational value ǫ ′ > ǫ su ch that A is ǫ ′ -fat shatter e d by F with a r ational-value d witness function h ′ : A → Q . Pr o of. Let h b e a witness of ǫ -fat sha tter ing for A . F or eac h B ⊆ A c ho o se a function f B satisfying Condition (2.7). F or every a ∈ A deﬁne S a = min a ∈ B f B ( a ) , s a = max a ∈ A \ B f B ( a ) . One ha s : s a < h ( a ) − ǫ < h ( a ) + ǫ < S a , and so S a − s a > 2 ǫ . One c a n therefor e selec t r ational v a lues ǫ ′ a > ǫ and h ′ ( a ) such that s a + ǫ ′ a < h ′ ( a ) < S a − ǫ ′ a . This wa y , w e obtain a desired witness function h ′ , and the pro of is now ﬁnished by p osing ǫ ′ = min a ∈ A ǫ ′ a . Every function class F whos e ǫ -fa t shattering dimension is ﬁnite at every scale ǫ > 0 is uniform Glivenk o– Cantelli. Here is an as ymptotic es timate of the sample size taken from [1] (Theorem 3 .6 ): s ( ǫ, δ, d ) ≤ C  1 ǫ 2 d ( ǫ/ 24)( F ) log 2 d ( ǫ/ 24) ǫ + lo g 1 δ  , (2.8) where d : R + → N is the fat-shattering dimension of F understo o d a s a function of epsilon, d ( ǫ ) = fat ǫ ( F ). In the formula, C denotes a univ ersal constant whose v alue can b e extr acted fro m the pro o fs in [1], but, given the presence of such a lo ose scale as ǫ/ 24, do es not r eally matter . Tighter sample size estimates can be found in [3]. Again, w e will refer to Condition (2.8) as “st andar d” co mplex it y estimate corre s po nding to the fat shattering dimension function d . Finally , reca ll that a subset N ⊆ Ω is universal nul l if for every non- atomic proba bilit y measur e µ on (Ω , A ) one has µ ( N ′ ) = 0 for s ome Borel s e t N ′ containing N . Universal null Bore l sets are just co untable sets. 3. Re visiting an exampl e of Durst and Dudley In or der to explain our approach to constructing a lea rning rule tha t is P AC under non-atomic distri- butions, we need to ex amine the tra ditional way of proving distribution-free P A C learna bilit y . A usual approach c onsists of t w o sta g es. 1. A function (or concept) class F is uniform Glivenk o–Cantelli as lo ng as a suita ble co mbin atoria l parameter of F (V C dimensio n, fat-shattering dimension etc.) is ﬁnite. 2. A uniform Glivenk o–Cantelli class F is P AC learna ble. Moreov er, such a class is c onsistently P A C learnable: every cons istent le a rning r ule L for F is probably approximately correct. The pro of of every statement of the former t yp e dep ends in an e s sential w ay on the F ubini theor em, a nd so some measurability restrictions on the class F are necess ary . Without them, the c o nclusion is not true in gene r al. Here is a classical example of a co nce pt class having ﬁnite VC dimension which is not uniform Glivenk o –Cantelli. Example 3.1 (Durst and Dudley [5], Pro po sition 2.2; cf. also [21], p. 314 ; [6 ], pp. 170 –171) . Let Ω b e an unco un table standar d Bo rel space, that is, up to an isomorphism, a Borel spa c e asso ciated to the unit int erv a l [0 , 1]. T he ca rdinality of Ω is contin uum. Cho o s e a minimal well-ordering ≺ on Ω, a nd let C consist of a ll half-op en initial seg men ts of the or dered set (Ω , ≺ ), that is , subsets o f the for m I y = { x ∈ Ω : x ≺ y } , y ∈ Ω. Clea rly , the V C dimension of the class C is one. 7 Fix a no n- atomic Borel pro bability measure µ on Ω (e.g., the Leb esgue measure on [0 , 1]). Now assume the v a lidity o f the Cont inuum Hypo thesis. Under this assumption, every ele ment of C is a countable s et, therefore Bor el measur able of measure zero. A t the s ame time, for ev ery n and each random n -sample σ , ther e is a co unt able initial segment C ∈ C containing all elements of σ . The empirical meas ur e of C with resp ect to σ is o ne . Thus, no ﬁnite sample guesses the measure of all elements o f C to within an accuracy ǫ < 1 with a non-v anishing conﬁdence. A further mo diﬁca tio n of this construction gives an ex ample of a concept clas s o f ﬁnite V C dimension which is not cons istent ly P A C learnable. Example 3.2 (Blumer, Ehr enfeuc ht, Haussler, and W armuth [4], p. 9 53) . Ag ain, as sume the Contin uum Hypo thesis. Add to the concept class C from Example 3.1 the se t Ω as a n element . In o ther words, form a concept class C ′ consisting of all intit ial seg ment s of (Ω , ≺ ), including improp er o nes. One still has V C( C ′ ) = 1. F or a ﬁnite lab elled sample ( σ , τ ) deﬁne L ( σ , τ ) = min { y : τ ⊆ I y } . (3.1) The learning r ule L is clea rly consisten t with the clas s C , but is not probably approximately corre ct, because for the co nc e pt C = Ω the v alue L (Ω ∩ σ ) = L ( σ, σ ) will always re tur n a countable concept I y , and if µ is a no n-atomic Borel proba bilit y mea sure on Ω, then µ ( C △ I y ) = 1. The co ncept C = Ω is no t learned to accuracy ǫ < 1 with a non-zer o conﬁdence. R emark 3.3 . It is impo rtant to note that — a gain, under the Co ntin uum Hyp othesis — the class C ′ is nevertheless distribution-free P A C lear nable. Indeed, r e de ﬁne a well-ordering on C ′ = { I x : x ∈ Ω } ∪ { Ω } by making Ω the s ma llest elemen t (instead of the large s t one) and keeping the order r elation betw een the other elemen ts the s ame. D enote the new order relation by ≺ 1 , and deﬁne a lea r ning rule L 1 similarly to E q. (3.1), but this time unders tanding the minim um with r esp ect to the or der ≺ 1 : L 1 ( σ , τ ) = min ( ≺ 1 )    C ∈ C ′ : C ∩ σ = \ τ ⊆ D D    . (3.2) In essence, L 1 examines all the c o ncepts following a transﬁnite o rder on them, a nd if a lab elled sample is consistent with the cla ss C ′ , then L 1 returns the ﬁrst conce pt consistent with the sample that it comes across . T o under stand what diﬀerence it ma kes with Example 3.2, let µ be again a non-ato mic pro bability measure on Ω. If C = Ω, then for every sample σ co nsistently lab elled with C the rule L 1 will retur n C , bec ause this is the smalles t consistent concept enco untered b y the alg orithm. If C 6 = Ω, then fo r µ -almo s t all samples σ (that is, for a se t of µ -measur e one) the lab elling on σ pro duced by C will b e empt y , and the concept L 1 ( σ , ∅ ) retur ne d by L 1 , while p ossibly diﬀerent fr om C , will b e again a countable concept, meaning that µ ( C △ L ( σ, ∅ )) = 0. T o give a formal pr o of that L 1 is P A C, notice that for e very C ∈ C ′ and each n ∈ N the co llection o f pairwise distinct concepts L 1 ( σ ∩ C ), σ ∈ Ω n is only countable (under Contin uum Hypothesis ), b ecause they are all contained in the ≺ 1 -initial segment of a minimally o rdered set C ′ of cardina lit y co ntin uum, b ounded by C itself. As a consequence, the conce pt clas s L C 1 = {L 1 ( σ ∩ C ) : σ ∈ Ω n , n ∈ N } ⊆ C ′ (3.3) is also countable (assuming Co ntin uum Hypo thes is). The VC dimension o f the family L C 1 ∪ { C } is ≤ 1, a nd being countable, it is a uniform Gliv enko–Cantelli class w ith a standard sample complexity as in Eq. (2.6). Consequently , given ǫ, δ > 0 , and as suming that n is suﬃcien tly larg e, one has for ea ch pro ba bilit y measure µ o n Ω and every σ ∈ Ω n µ ( C △ L ( σ, C ∩ σ )) < ǫ provided n ≥ s ( ǫ, δ, 1), as required. 8 R emark 3.4 . Thus, under the Contin uum Hyp o thes is, the exa mple of Dudley a nd Durst a s mo diﬁed by Blumer, Ehrenfeuch t, Haussler , and W armuth gives a n exa mple of a P AC learnable conce pt class which is not uniform Glivenk o–Cantelli (even if having ﬁnite VC dimens io n). As it w ill beco me clear in the next Section, the assumption of Contin uum Hyp othesis c a n b e weakened to Ma rtin’s Axio m. Still, it would b e int eresting to kno w whether an e x ample with the same com bination o f proper ties ca n b e constructed without additional set-theoretic a s sumptions. A basic observ ation of this section is that in order for a learning r ule L to b e P A C, the as sumption on F b e ing uniform Glivenko–Can telli can be weakened a s fo llows. Lemma 3.5. L et F b e a fu n ction class and P a family of pr ob ability me asur es on the domain Ω . Su pp ose ther e ex ists a fu n ction s ( ǫ, δ ) and a c onsistent le arning rule L for F with the pr op erty that for every f ∈ F , the set L f ∪ { f } is Glivenko–C antel li with r esp e ct to P with the sample c omplexity s ( ǫ , δ ) , wher e L f = {L ( f ↾ σ ) : σ ∈ Ω n , n ∈ N } . Then L is pr ob ably appr oximately c orr e ct under P with sample c omplexity s ( ǫ, δ ) . R emark 3.6 . Of cours e instead of L f ∪ { f } it is suﬃcien t to make the same assumption on the cla ss L f . This will not aﬀect the P A C learnability of L . How ever, an estimate for the sa mple c o mplexity of the union in terms of s ( ǫ, δ ) will b e so mew ha t awkw ard, and in view of a sp eciﬁc wa y in which the ab ov e Lemma is going to b e used, the current assumption is technically mor e co nv enient. This simple fact b eco mes very useful in combination with the technique of w ell-orde r ings in the case where P consis ts of non-ato mic meas ur es and therefore consis ten t P AC learnability is not to b e exp ected. A t the same time, this a pproach requir es additional set-theoretic ax ioms in order to assure mea surability of emerging function c la sses. Of cours e the Contin uum Hypo thesis is a r a ther strong ass umption, which is particularly unnatural in a probabilistic context (cf. [7 ]). But it is unnecessary . Mar tin’s Axiom is a m uc h weak er and natura l additional set-theoretic a xiom, which works just as w ell. W e explain how the ab ov e idea is fo r malized in the setting of Martin’s Axiom in the next Section. 4. Learnability unde r Martin’s Axiom Martin’s Axiom (MA) [8 , 9, 11] in one of its eq uiv alent forms says that no compact Hausdorﬀ top olo gical space with the countable c hain condition is a unio n of strictly less than c o nt inuum nowhere dense subsets. Thu s, it can b e seen as a strengthening o f the statement o f the Baire Categ o ry Theorem. In par ticular, the Contin uum Hyp othesis (CH) implies MA. How ever, MA is compa tible with the ne g ation of CH, and this is where the most in teresting applications of MA are to b e found. W e will b e using just one par ticular consequence of Martin’s Axiom. F or the pro o f of the following result, s e e [11], Theorem 2.21, or [8], o r [9], pp. 563 – 565. Theorem 4.1 (Ma r tin-Solov ay) . L et (Ω , µ ) b e a standar d L eb esgue non-atomic pr ob ability sp ac e. Under Martin ’s Axiom, the Le b esgue me asu r e is 2 ℵ 0 -additive, t hat is, if κ < 2 ℵ 0 and A α , α < κ is family of p airwise disjoint me asur able sets, then ∪ α<κ A α is L eb esgue me asur able and µ [ α<κ A α ! = X α<κ µ ( A α ) . In p articular, the union of less t han c ont inuum nu l l su bsets of Ω is a n ul l subset. Here is a central technical to o l used in our pro ofs. Lemma 4.2. L et F b e a function class and P a family of pr ob ability me asure s on a st andar d Bor el domain Ω . Consider the fol lowing pr op erties. 1. Every c ountable su b class of F is uniform Glivenko–Cantel li with r esp e ct to P . 9 2. Ther e is a function s ( ǫ, δ ) such that every c ountable su b class of F is uniform Glivenko–Cantel li with r esp e ct to P with sample c omplexity s ( ǫ, δ ) . 3. Every s u b class F ′ of F having c ar dinality < 2 ℵ 0 is uniform Glivenko–Cantel li with r esp e ct to P . 4. Ther e is a funct ion s ( ǫ, δ ) s u ch that every su b class F ′ of F havi ng c ar dinality < 2 ℵ 0 is uniform Glivenko–C antel li with r esp e ct to P with sample c omplexity s ( ǫ, δ ) . Then (1) ր ւ տ (2) (3) տ ր (4) Under Martin ’s Ax iom, al l four c onditions ar e e quivalent. Pr o of. The implications (2) ⇒ (1 ), (3) ⇒ (1), (4) ⇒ (2) and (4) ⇒ (3) are trivially true. T o show (1) ⇒ (2), let δ, ǫ > 0 b e arbitra ry but ﬁxed. F or each c o unt able sub clas s F ′ , ch o ose the s mallest v alue of s ample complexity s = s ( F ′ , ǫ, δ ) ∈ N . The integer-v alued function F ′ 7→ s ( F ′ , ǫ, δ ) is monotone under inclusions: if F ′ ⊆ F ′′ , then s ( F ′ , ǫ, δ ) ≤ s ( F ′′ , ǫ, δ ). If F ′ n is a countable sequence of countable clas ses, then the union ∪ ∞ n =1 F ′ n is a coun table class , whose sample complexity s ( ∪ ∞ n =1 F ′ n , ǫ, δ ) forms an upp e r bound for all s ( F ′ , ǫ, δ ), n = 1 , 2 , . . . . Thu s, the function F ′ 7→ s ( F ′ , ǫ, δ ) fo r δ, ǫ > 0 ﬁxed is bounded on countable sets of inputs. T o conclude the pro of, it is enough to no tice that a real-v a lued function is b ounded if and only if its r estriction to every coun table subset of the domain is b ounded. Now assume Mar tin’s Axiom. It is enough to pr ov e (2) ⇒ (4). This is done by a transﬁnite induction on the cardinality κ = | F ′ | < 2 ℵ 0 . Let us pick the same complexity function s = s ( ǫ, δ ) as in (2). F or κ = ℵ 0 there is nothing to pro ve. Else, represent F as a union o f an increasing transﬁnite chain o f function classes F α , α < κ , for eac h of which the statement of (4) holds. F or every ǫ > 0 a nd n ∈ N , the set ( σ ∈ Ω n : sup f ∈ F   E µ n ( σ ) ( f ) − E µ ( f )   < ǫ ) = \ α<κ ( σ ∈ Ω n : sup f ∈ F α   E µ n ( σ ) ( f ) − E µ ( f )   < ǫ ) is measurable as an ea sy consequence of Mar tin-Solov ay’s Theorem 4.1. Given δ > 0 and n ≥ s ( ǫ, δ ), another application of the same result leads to conc lude that for ev ery µ ∈ P (Ω): µ ⊗ n ( σ ∈ Ω n : sup f ∈ F   E µ n ( σ ) ( f ) − E µ ( f )   < ǫ ) = µ ⊗ n \ α<κ ( σ ∈ Ω n : sup f ∈ F α   E µ n ( σ ) ( f ) − E µ ( f )   < ǫ )! = inf α<κ µ ⊗ n ( σ ∈ Ω n : sup f ∈ F α   E µ n ( σ ) ( f ) − E µ ( f )   < ǫ ) ≥ 1 − δ, as required. Lemma 4.3. L et F b e a function class whose c ountable sub classes ar e u niform Glivenko–C antel li with r esp e ct to a family of pr ob ability me asur es P . L et L b e a c onsistent le arning ru le for F with the pr op ert y that for every f ∈ F , the set L f ,n = {L ( f | σ ) : σ ∈ Ω n } (4.1) has c ar dinality strictly less than c ontinuum. Under Martin ’s Axiom, t he rule L is pr ob ably appr oximately c orr e ct under P . The c ommon sample c omplexity of c ountable sub classes of F b e c omes the sample c omplexity b ound for the le arning rule L . 10 Pr o of. Recall that 2 ℵ 0 is a regular cardinal, and th us admits no countable coﬁnal s ubset. Ther efore, under the assumptions of Lemma, the cardina lit y of L f = ∪ ∞ n =1 L f ,n is still strictly less than contin uum. The same is tr ue of the class L f ∪ { f } . Applying now Lemma 4.2 a nd then Lemma 3.5, w e conclude. The following r esult establishes existence o f lear ning rules with the ab ov e prop erty . Lemma 4 .4. L et F b e an inﬁnite function class on a me asur able sp ac e Ω . Denote κ = | F | the c ar dinality of F . Ther e exists a c onsistent le arning rule L for F with the pr op erty t hat for every f ∈ F and e ach n , the set L f ,n (cf. Eq. (4.1)) has c ar dinality < κ . Un der Martin ’ s Axiom t he rule L satisﬁes the me asu r ability c ondition (2.1 ). Pr o of. Cho ose a minimal well-ordering o f elements of F : F = { f α : α < κ } . Notice that κ never exceeds the ca rdinality o f the contin uum 2 ℵ 0 bec ause F consists of Bor e l subsets of a standard B orel do ma in. F or this reaso n, every initial segment of the ab ove o r dering has cardinality strictly less than 2 ℵ 0 . F or every σ ∈ Ω n and τ ∈ [0 , 1] n , se t the v alue L ( σ , τ ) of the learning r ule equal to f β , wher e β = min { α < κ : f α | σ = τ } , provided such a β exists. Clea rly , for each α < κ one ha s L ( σ , f α ↾ σ ) ⊆ { f β : β ≤ α } , which ass ures tha t the s e t in (4.1) has ca rdinality strictly less than co n tinuum. Besides, the learning rule L is c o nsistent. Fix f = f α ∈ F , α < κ . F or every β ≤ α deﬁne D β = { σ ∈ Ω n : f | σ = f β | σ } . The sets D β are measurable, and the function Ω n ∋ σ 7→ E µ ( L ( f ↾ σ ) − f ) ∈ R takes a constant v alue k f − f β k L 1 ( µ ) on ea ch set D β \ ∪ γ <β D γ , β ≤ α . Such sets, as well as a ll their p ossible unions, ar e measur able under Mar tin’s Axiom by force of Ma rtin–Solov ay’s Theorem 4.1, a nd their union is Ω n . This implies the co ndition (2.1 ) for L . Lemma 4.3 and lemma 4.4 lead to the following result. Theorem 4.5 (Assuming Martin’s Axiom) . L et F b e a function class c onsisting of Bor el me asura ble functions on a standar d Bor el domain Ω , and let P b e a family of pr ob ability m e asur es on Ω . Supp ose that every c ountable su b class of F is u niform Glivenko–Cantel li with r esp e ct to P . Then the function class F is P AC le arnable. In addition, ther e exists a c ommon sample c omplexity b ound for c ountable sub classes of F , and any such b ound gives a sample c omplexity b ou n d for P AC le arnability of F . W e ag ain recall that a set A ⊆ Ω is universal nul l if it is Leb esgue measur able with r esp ect to every non-atomic Borel pr obability measure µ on Ω and µ ( A ) = 0. Corollary 4.6 (Assuming Martin’s Axio m) . Le t F b e a function class c onsist ing of Bor el me asur able functions on a standar d Bor el sp ac e Ω . Supp ose for every ǫ > 0 t her e is a natu r al numb er d ( ǫ ) such that every c ountable s ub class F ′ ⊆ F has ǫ -fat shattering dimension ≤ d ( ǫ ) outside of some un iversal nul l set (which dep ends on F ′ ). Then the fun ction class F is P AC le arnable u nder the family P of non-atomic pr ob ability me asur es, with the standar d sample c omplexity c orr esp onding to the given value of fat shattering dimension. Pr o of. Let F ′ ⊆ F be a countable sub cla ss. F or every n ∈ N , cho o se a null set A n such that the ǫ - fat shattering dimensio n of F ′ restricted to Ω \ A n is b ounded by d (1 /n ). Consider A = ∪ ∞ n =1 A n . The function class F ′ restricted to Ω \ A is uniform Glivenk o –Cantelli, with the usual sample complexity given by d ( ǫ ). In particular, F ′ | Ω \ A is unifor m Glivenko–Can telli with res p ect to the family P of non-atomic probability measures. Since µ ( A ) = 0 for a ll µ ∈ P , we conclude that the cla ss F ′ is uniform Glivenk o –Cantelli with resp ect to P even if view ed on the origina l domain of deﬁnition, Ω. 11 Corollary 4. 7 (Assuming Martin’s Axio m) . L et C b e a c onc ept class c onsisting of Bor el me asur able func- tions on a standar d Bor el sp ac e Ω . Supp ose that for some d every c ount able sub class C ′ ⊆ C has VC dimension ≤ d outside of a u niversal nu l l set (which dep ends on C ′ ). Then the c onc ept class C is P A C le arnable un der the family P of non-atomic pr ob ability me asur es, with the st andar d sample c omplexity c or- r esp onding to t he given value of VC dimension. 5. VC di mension and Bo olean algebras Recall that a Bo ole an algebr a , B = h B , ∧ , ∨ , ¬ , 0 , 1 i , consists of a set, B , equipp e d with tw o asso ciative and c o mm utative binary op er ations, ∧ (“meet”) and ∨ (“join”), whic h ar e distributive ov er each other and satisfy the absor ption principles a ∨ ( a ∧ b ) = a , a ∧ ( a ∨ b ) = a , a s well as a una r y op eratio n ¬ (complement) and tw o elemen ts 0 and 1, satisfying a ∨ ¬ a = 1, a ∧ ¬ a = 0. F or instance, the family 2 Ω of all subsets of a s et Ω, with the union a s join, intersection as meet, the empt y set a s 0 a nd Ω as 1 , as well as the set-theoretic complement ¬ A = A c , for ms a Bo o lean algebra. In fact, every Bo olean alg ebra can be realized as an algebra of subsets of a suitable Ω. Even be tter , according to the Stone r epresentation theorem, a Bo o le a n alg ebra B is isomorphic to the Bo o lean algebra for med b y all op en-and-clos e d subs e ts of a suitable compact s pace, S ( B ), called the Stone sp ac e of B , where the Bo olean algebra o per ations are interpreted s et-theoretically as above. The spa ce S ( B ) can be obtained in diﬀerent wa y s. F or instance, one can think of ele ments of S ( B ) as Bo olean algebra homomor phisms from B to the tw o-element Bo olean algebra { 0 , 1 } (the algebra o f subsets of a s ingleton). In this wa y , S ( B ) is a clo sed top ologic al s ubs pace o f the compac t zero-dimensio na l spa ce { 0 , 1 } B with the usua l T ychonoﬀ pro duct topo logy . The Stone space of the Bo olean algebra B = 2 Ω is known as the Stone- ˇ Ce ch c omp actiﬁc ation of Ω, and is denoted β Ω. The elements o f β Ω are ultraﬁlters on Ω. A colle ction ξ o f non-empty subsets of Ω is a n ultr aﬁlter if it is closed under ﬁnite intersections and if for every subset A ⊆ Ω either A ∈ ξ or A c ∈ ξ . T o every p oint x ∈ Ω there cor resp onds a trivial ( princip al ) u lt r aﬁlter, ¯ x , co nsisting of all sets A containing x . Ho wev er, if Ω is inﬁnite, the Axio m o f C ho ice assures that there exist non-princip al ultra ﬁlters o n Ω. Recall tha t a non-empty family Φ of non-e mpty s ubsets o f a s et X is a ﬁlter if it is closed under ﬁnite int ersections and sup er s ets. An equiv alent form of the Axiom of Choise states that every ﬁlter is c o ntained in an ultra ﬁlter. Now starting with a ﬁlter having a n empt y intersection (e.g. the ﬁlter of all coﬁnite subsets of the natura l num be r s), o ne obtained a non-pr inc ipa l ultra ﬁlter. Basic o pen sets in the space β Ω a re of the for m ¯ A = { ζ ∈ β Ω : A ∈ ζ } , wher e A ⊆ Ω. It is int eresting to note tha t eac h ¯ A is at the same time closed, and in fact ¯ A is the clo s ure of A in β Ω. Moreover, every op en and c lo sed subset o f β Ω is of the form ¯ A . A one-to-one corresp ondence betw e e n ultraﬁlters on Ω and Bo olea n a lgebra homomorphisms 2 Ω → { 0 , 1 } is this: think of an ultraﬁlter ξ on Ω a s its own indicator function χ ξ on 2 Ω , sending A ⊆ Ω to 1 if and only if A ∈ ξ . It is not diﬃcult to verify that χ ξ is a Bo olean algebr a homomorphism, and that every homomorphism arises in this wa y . The bo o k [10] is a standard refere nce to the ab ov e topics. Given a subset C of a Bo ole an algebra B , and a subset X of the Stone space S ( B ), one ca n regard C a s a set of binar y functions restric ted to X , and compute the V C dimension of C over X . W e will denote this parameter V C( C ↾ X ). A subset I of a Bo olea n alg e bra B is a n ide al if, whenever x, y ∈ I and a ∈ B , one has x ∨ y ∈ I and a ∧ x ∈ I . Deﬁne a symmetric diﬀer enc e on B b y the form ula x △ y = ( x ∨ y ) ∧ ¬ ( x ∧ y ). The quotient Bo ole an algebr a B /I consis ts of all equiv alence classes mo dulo the equiv alence r elation x ∼ y ⇐ ⇒ x △ y ∈ I . It can be ea s ily veriﬁed to b e a Bo o lean alg ebra o n its own, with op erations induced from B in a uniq ue wa y . The Stone space of B /I can b e iden tiﬁed with a compact to po logical subspa c e of S ( B ), consisting of all homomorphisms B → { 0 , 1 } whose kernel contains I . F or instance, if B = 2 Ω and I is an ideal o f subsets of Ω, then the Stone space of 2 Ω /I is eas ily seen to consist of all ultraﬁlter s on Ω which do not contain se ts from I . 12 Theorem 5.1 . L et C b e a c onc ept class c onsisting of me asur able subsets of a me asur able domain Ω = (Ω , A ) , and let I b e an ide al of sets on Ω . The fol lowing c onditions ar e e quivalent. 1. The V C dimension of the (family of closur es of the) c onc ept class C restricte d to the St one sp ac e of the quotient algebr a 2 Ω /I is at le ast n : V C ( C ↾ S (2 Ω /I )) ≥ n . 2. Ther e exists a family A 1 , A 2 , . . . , A n of subsets of Ω not b elonging to I , which is shatter e d by C in the sense that if J ⊆ { 1 , 2 , . . . , n } , t hen ther e is C ∈ C which c ontains al l sets A i , i ∈ J , and is disjoint fr om al l sets A i , i / ∈ J . In addition, the subset s A i c an b e assume d me asur able. Pr o of. (1) ⇒ (2). Cho ose ultraﬁlters ξ 1 , . . . , ξ n in the Stone s pa ce of the Bo olean a lgebra 2 Ω /I , whose co llec- tion is sha tter ed by C . F or every J ⊆ { 1 , 2 , . . . , n } , select C J ∈ C which carves the subset { ξ i : i ∈ J } o ut of { ξ 1 , . . . , ξ n } . This means C J ∈ ξ i if and only if i ∈ J . F or all i = 1 , 2 , . . . , n , set A i = \ J ∋ i C J ∩ \ J 6∋ i C c J . (5.1) Then A i ∈ ξ i and hence A i / ∈ I . F urthermore, if i ∈ J , then clea rly A i ⊆ C J , and if i / ∈ J , then A i ∩ C J = ∅ . The se ts A i are measurable by their deﬁnition. (2) ⇒ (1). L et A 1 , A 2 , . . . , A n be a family of subsets of Ω not b elong ing to the s et ideal I a nd sha tter ed by C in sens e of the lemma. F or ev ery i , the family of sets of the form A i ∩ B c , B ∈ I is a ﬁlter and so is contained in so me ultraﬁlter ξ i , which is clear ly disjoint from I a nd contains A i . If J ⊆ { 1 , 2 , . . . , n } and C J ∈ C contains all sets A i , i ∈ J and is disjoint fro m all sets A i , i / ∈ J , then the clos ure ¯ C J of C J in the Stone spa c e contains ξ i if a nd only if i ∈ J . W e co nclude: the collection o f ultraﬁlter s ξ i , i = 1 , 2 , . . . , n , which are all co ntained in the Stone space o f 2 Ω /I , is shattered by the closed s ets ¯ C J . It follows in particular that the VC dimension of a concept class do es not change if the domain Ω is compactiﬁed. Corollary 5.2. VC ( C ↾ Ω) = VC ( C ↾ β Ω) . Pr o of. The inequality VC( C ↾ Ω) ≤ V C( C ↾ β Ω) is trivial. T o establish the conv erse, assume there is a subset o f β Ω of cardinality n shattered b y C . Cho os e sets A i as in Theo rem 5.1,(2). Clear ly , any subs et of Ω meeting ea ch A i at exactly one p oint is shattered by C . Deﬁnition 5.3 . Given a conce pt cla ss C on a domain Ω and an ideal I of subse ts of Ω, we deﬁne the VC dimension of C mo dulo I , V C( C mo d I ) = VC( C ↾ S (2 Ω /I )) . That is , V C( C mod I ) ≥ n if and o nly if any of the equiv alent conditions of Theorem 5.1 ar e met. Deﬁnition 5.4 . Let C b e a concept class o n a domain Ω. If I is the ideal o f all countable subsets of Ω, we denote the VC( C mod I ) b y V C( C mod ω 1 ) and call it the VC dimension mo dulo c ountable sets . Now Theo r em 5.1 v alidates a deﬁnitio n of V C dimensio n mo dulo co un table sets in a form sta ted in Int ro duction to o ur a rticle. 6. F at-shattering dimens ion mo dulo coun table sets When dealing with real- v alued functions instea d of subsets o f the do ma in, the r ole of Bo olea n algebras is ta ken ov e r by commutativ e C ∗ -algebra s. Here is a brief summary . See e.g. [2] for more. Recall that a C ∗ -algebra is an asso ciative alg ebra ov er the ﬁeld of complex num ber s C equipped with an inv o lutio n (an anti-linear map x 7→ x ∗ ) and a norm which is subm ultiplicative ( k xy k ≤ k x kk y k ) and sa tisﬁes the pro pe rty k x ∗ x k = k x k 2 . F or insta nce, the family C ( X ) of all contin uous complex - v alued functions on a compact top olog ical space X for ms a commutativ e unital C ∗ -algebra . Conv ersely , ev ery commutativ e unital C ∗ -algebra A is of this form. The space X , called the Gelfand sp ac e , o r the maximal ide al sp ac e of A , is uniquely deﬁned. Its elements can b e desc r ib e d as no n-zero multiplicativ e complex linear functionals o n 13 A . The top o logy o n the space of such functionals is the weak s tar (weak ∗ ) top olog y , that is, the coar sest top ology making every ev aluation map f 7→ f ( a ), a ∈ A , contin uous . W e wan t to calculate the maximal ideal space of the C ∗ -algebra ℓ ∞ (Ω) of all b ounded complex- v alued functions on a set Ω. With this purpo se, w e intro duce the following no tion. Given a b ounded scalar- v alued function f o n a set Ω and an ultraﬁlter ξ on Ω, the limit of f along the ultr aﬁlter ξ is a uniquely deﬁned num ber , y , with the pro per ty that for each ǫ > 0, { x ∈ Ω : | f ( x ) − y | < ǫ } ∈ ξ . (6.1) The limit along an ultra ﬁlter, or a n u lt r alimit , for s hort, is denoted lim x → ξ f ( x ). Unlike the usual limit, the ultralimit of a b ounded function alo ng a ﬁxed ultraﬁlter always e xists, the pro of o f whic h fact mimic ks the classical Heine–Bo rel compactness argument for the c lo sed interv al. This observ a tion makes the ultralimit a very p ow erful too l. Its downside is a highly non-co nstructive nature: typically , the v a lue of an ultralimit of a particular function canno t be computed explicitely e x cept in the “uninteresting” situations where it coincides with the usual limit. The cor resp ondence ξ 7→ lim x → ξ f ( x ) deﬁnes a co nt inuous function ¯ f on β Ω, whic h is a unique co ntin uous extension of f over the Sto ne - ˇ Cech co mpa ctiﬁcation β Ω. Here, as is usual in set-theor etic top ology and analysis, we identify every p oint x of Ω with the cor resp onding principa l (trivial) ultraﬁlter, ¯ x , cons isting of all subsets of Ω whic h contain x as a n element. If an ultr aﬁlter ξ is ﬁxed, then the cor resp ondence f 7→ f ( ξ ) is a linear m ultiplicative functional of nor m one on ℓ ∞ (Ω), sending the function 1 to 1. It turns out that every linear multiplicative functional φ of norm one o n ℓ ∞ (Ω) sending 1 to 1 is of this form, that is, is the ultralimit alo ng some ultraﬁlter on Ω. This is, in fact, a rather simple observ a tion: suﬃces to restrict φ to the s et of a ll { 0 , 1 } -v alued functions on Ω a nd notice that the image of ev ery such function is necessarily either 0 or 1; the family ξ of all sets A ⊆ Ω with φ ( χ A ) = 1 is now seen to be an ultraﬁlter, and an approximation argument with ﬁnite linear combinations shows that for every f ∈ ℓ ∞ (Ω) one must hav e φ ( f ) = lim x → ξ f ( x ). In this w ay the maxima l idea l spa ce of ℓ ∞ (Ω) is identiﬁed with the space of ultraﬁlters β Ω, that is, the Stone- ˇ Cech compactiﬁcation o f Ω . Thus, the C ∗ -algebra s ℓ ∞ (Ω) and C ( β Ω) are isomorphic. An isomorphism is given by the map f 7→ ¯ f , where ¯ f is the unique contin uous extension o f f over β Ω mentioned ab ov e. Given a C ∗ -algebra , an ide al I of A is a closed linear s ubs pace stable under multiplication by elemen ts of A . The quo tien t algebra A/I is again a C ∗ -algebra (which is in general not an eas y fact to prove). If A is a commutativ e unital C ∗ -algebra a nd I is a non- tr ivial ideal ( I 6 = A ), then A/I is isomorphic to an algebra of contin uous functions o n a suitable closed subs pace Y of the max ima l ideal space X of A . A functional x ∈ X b elongs to Y if and only if it factors thro ug h the quo tien t map π : A → A/I , that is , the kernel of x : A → C contains I . Conv ersely , every compact subspace o f X deter mines an idea l of C ( X ). A link with the Bo ole a n algebra setting is pr ovided b y the following observ a tion: every ideal I of s ubs ets of Ω ge ne r ates an ideal ˜ I of the C ∗ -algebra ℓ ∞ (Ω), as the sma lle st idea l o f A containing characteristic functions of a ll elemen ts of I . Now one can v erify without diﬃcult y that the max imal ideal space of the C ∗ -algebra ℓ ∞ (Ω) / ˜ I is the Stone space of the Bo olean a lg ebra 2 Ω /I . In fact, every idea l of ℓ ∞ (Ω) is of this form. Deﬁnition 6.1 . Let A b e a commutativ e unital C ∗ -algebra , F a subset of A , a nd I an idea l of A . F or every ǫ > 0 , deﬁne the ǫ -fat shattering dimension of F mo dulo I , denoted fat ǫ ( F mo d I ), a s the ǫ -fat shatter ing dimension of F v iewed as a function class on the max imal idea l space Y o f A/I . In a mo r e detailed w ay , we de no te π : A → A/I the quotient homo morphism. A ﬁnite set B ⊆ Y is ǫ -fat shattered b y F if fo r some function h : B → [0 , 1 ] and ev ery C ⊆ B ther e is f C ∈ F with ( y ( π ( f C )) > h ( y ) + ǫ , y ∈ C, y ( π ( f C )) < h ( y ) − ǫ , y / ∈ C. Here element s y ∈ Y are treated as functionals on A/I . The ǫ -fat shattering dimension of F mo dulo I , denoted fat ǫ ( F mo d I ) is the supremum o f cardinalities of ﬁnite subsets of the max imal ideal space of A/ I ǫ -fat shattered by F . 14 Deﬁnition 6 .2 . Let F b e a function c lass o n a doma in Ω, and let ǫ > 0. W e call the ǫ - fat shattering dimension of F mo dulo c ount able sets the v alue fat ǫ ( F mo d ˜ I ), where ˜ I is a C ∗ -algebra ideal of ℓ ∞ (Ω) generated b y characteris tic functions of countable sets. Now we reform ulate Deﬁnition 6 .2 av o iding the C ∗ -algebra ic terminolo gy . Let β ω 1 Ω denote the collection of a ll points of β Ω whic h, viewed as ultraﬁlter s o n Ω, only contain uncountable sets . The ǫ -fat shattering dimension of F mo dulo countable sets is the usual ǫ -fat s hattering dimension of the class of functions f ∈ F extended over β Ω by c ontin uity and then restricted to β ω 1 Ω. W e hav e an analo gue of Theorem 5.1. Theorem 6.3. L et F b e a class of me asur able fun ctions on a standar d Bor el domain Ω , and let I b e an ide al of the C ∗ -algebr a ℓ ∞ (Ω) . Fix any ǫ > 0 . The fol lowing ar e e qu ivalent. 1. The ǫ -fat shattering dimension of F mo dulo I is at le ast n . 2. Ther e exists a family A 1 , A 2 , . . . , A n of me asura ble subsets of Ω whose indic ator functions do not b elong to I , which is ǫ - fat shatter e d by F in the fol lowing sense: ther e is a witness function h : { 1 , 2 , . . . , n } → [0 , 1] and for e ach J ⊆ { 1 , 2 , . . . , n } ther e is a f J ∈ F such t hat ( i ∈ J ∧ x ∈ A i ) ⇒ f J ( x ) > h ( i ) + ǫ , ( i / ∈ J ∧ x ∈ A i ) ⇒ f J ( x ) < h ( i ) − ǫ . (6.2) Pr o of. Before pro ceeding to the ar gument, let us remind that ultraﬁlters on Ω a re v iewed sometimes as mere p oints of the Stone- ˇ Cech compa ctiﬁcation β Ω , and sometimes as families of subsets of Ω. Every po int x ∈ Ω is ca nonically identiﬁe d with the co rresp onding pr incipal ultraﬁlter ¯ x , and every b ounded function f on Ω a dmits a canonical contin uous extension ov er β Ω via the rule ¯ f ( ξ ) = lim x → ξ f ( x ). Notice that this deﬁnition implies ¯ f ( ¯ x ) = f ( x ) whenever x ∈ Ω. (1) ⇒ (2). Let Y ⊆ β Ω denote the maxima l ideal spa ce of the C ∗ -algebra ℓ ∞ (Ω) /I . In other w ords, ℓ ∞ (Ω) /I ∼ = C ( Y ). There exist n ele men ts of Y which are ǫ -fat s hattered by F , let us s ay ξ 1 , . . . , ξ n . Recall that these a re ultraﬁlter s on Ω, that is, fa milies of subsets o f the domain. Cho ose a witness function h : { 1 , 2 , . . . , n } → [0 , 1], and select for every J ⊆ { 1 , 2 , . . . , n } a function f J ∈ F whos e ultralimit along ξ i is > h ( i ) + ǫ if i ∈ J , and is < h ( i ) − ǫ otherwise. F or all i = 1 , 2 , . . . , n , denote b y f A i = \ J ∋ i n ξ ∈ β Ω : f f J ( ξ ) > h ( i ) + ǫ o ∩ \ J 6∋ i  ξ ∈ β Ω : f J ( ξ ) < h ( i ) − ǫ  , (6.3) and co nsider A i = f A i ∩ Ω. F or every i one has ξ i ∈ f A i by the choice of the functions f J . Since the v alue f J ( ξ i ) is the ultralimit o f f J along ξ i , it follows from the deﬁnition of an ultralimit (6.1 ) that each of the 2 n sets app ear ing in E q. (6.3) belo ngs to ξ i , and since ξ i is clo sed under ﬁnite intersections, one has A i ∈ ξ i . Equiv alently , χ A i ( ξ i ) = 1 , w hich implies that χ A i / ∈ I (as every function in the idea l I — or, a bit more precisely , its unique contin uous extensio n ov er β Ω — ident ically v anishes on Y ). Since the functions f J are measurable with rega rd to the Borel structure on Ω, so are the sets A i . The co ndition (2) is veriﬁed by the deﬁnition of the sets A i . (2) ⇒ (1). Let A 1 , A 2 , . . . , A n be a family of subsets of Ω s atisfying (2). Their top olog ical clo sures A i taken in β Ω satisfy ( i ∈ J ∧ ξ ∈ A i ) ⇒ f J ( ξ ) > h ( i ) + ǫ, ( i / ∈ J ∧ ξ ∈ A i ) ⇒ f J ( ξ ) < h ( i ) − ǫ. The condition χ A i / ∈ I can b e reformulated as A i ∩ Y 6 = ∅ . Cho ose ξ i ∈ A i ∩ Y for every i = 1 , 2 , . . . , n . The set { ξ i } n 1=1 is ǫ -fat shattered by the functions ¯ f , f ∈ F with the witnes s function ξ i 7→ h ( i ). R emark 6.4 . Note that w e have not used the assumption of measurability o f subsets A i in the pro of of the implication (2) ⇒ (1). Corollary 6. 5. L et F b e a class of [0 , 1] -value d functions on Ω and let ǫ > 0 . The ǫ -fat shattering dimension of F e quals the ǫ - fat shattering dimension of the set of functions ¯ f , f ∈ F on β Ω . 15 Corollary 6. 6. L et F b e a class of [0 , 1] -value d functions on Ω and let ǫ > 0 . The ǫ -fat shattering dimension of F m o dulo c ountable sets is the supr emu m of c ar dinalities of ﬁnite famili es A 1 , A 2 , . . . , A n of unc oun table subsets of Ω which ar e ǫ - fat shatter e d by F in the sense of Condition (6.2) with a suitable witness fun ction h : { 1 , 2 , . . . , n } → [0 , 1] . 7. Fini teness of combinatorial dimensi on m o dulo coun table s ets as a necessary condition In this Section, we remark that, similar ly to the classical case o f distribution- fr ee learning, ﬁniteness of V C dimension mo dulo coun table sets is necessary for P A C lear nability o f a conce pt class under non-a tomic measures, but this is not the ca se for fat shattering dimension of a function class. Lemma 7. 1. Every unc ountable Bor el su bset of a standar d Bor el sp ac e supp orts a non-atomic Bor el pr ob- ability me asur e. Pr o of. Let A b e an uncountable Borel subset o f a sta ndard Borel spa ce Ω, that is, Ω is a Polish space equipp e d with its Bor el structure. Accor ding to Souslin’s theor em (see e.g. Theorem 3.2.1 in [2]), there exists a Polish (complete s eparable metric) s pace X and a co n tinuous one-to-o ne mapping f : X → A . The Polish s pace X must b e therefor e uncoun table, and so supp orts a non-atomic probability mea sure, ν . The direct imag e measure f ∗ ν = ν ( f − 1 ( B )) o n Ω is a Bor el probability measure suppo rted on A , and it is non-atomic b ecause the inverse image of every singleton is a s ingleton in X and thus has measur e zero . The following r esult makes no measura bilit y assumptions on the concept class. Theorem 7.2. L et C b e a c onc ept class on a domain (Ω , B ) which is a standar d Bor el sp ac e. If C is P AC le arnable under non-atomic me asur es, then the VC dimension of C mo dulo c oun table sets is ﬁn ite. Pr o of. This is just a minor v a r iation of a classica l result for distribution-fr ee P A C lear na bilit y (Theorem 2.1(i) in [4]; we will follow the pro of as pres ent ed in [2 0], Lemma 7.2 on p. 279). Suppo se V C( C mo d ω 1 ) ≥ d . Acco rding to Theorem 5.1, there is a family o f uncoun table Borel sets A i , i = 1 , 2 , . . . , d , shatter ed by C in our sense. Using Lemma 7.1, select for every i = 1 , 2 , . . . , d a non-a tomic probability measure µ i suppo rted o n A i , and le t µ = 1 d P d i =1 µ i . This µ is a no n-atomic Bo r el proba bilit y measure, g iving each A i equal w eight 1 /d . See Figure 3. d 00000 00000 00000 00000 00000 11111 11111 11111 11111 11111 0000000 0000000 0000000 0000000 0000000 0000000 0000000 1111111 1111111 1111111 1111111 1111111 1111111 1111111 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 11111 11111 11111 11111 11111 11111 11111 11111 11111 11111 11111 11111 00000 00000 00000 00000 00000 11111 11111 11111 11111 11111 diffuse measures of mass 1/d supported on A i A 1 A 2 .... A Figure 3: Construction of the measure µ . F or every d - bit string σ there is a c oncept C σ ∈ C which contains all A i with σ i = 1 and is disjoint from A i with σ i = 0. If A and B take constant v alues on all the sets A i , i = 1 , 2 , . . . , d , then d µ ( A, B ) is just the normalized Hamming distance b etw een the cor resp onding d -bit string s. Now, given A ∈ C and 0 ≤ k ≤ d , there are X k ≤ 2 ǫd  d k  concepts B with d µ ( A, B ) ≤ 2 ǫ . This allows to get the following low e r bo und o n the num b er of pairwise 2 ǫ -separ ated concepts: 2 d P k ≤ 2 ǫd  d k  . 16 The Chernoﬀ–Ok a moto b ound allows to estimate the ab ov e expr ession from below by exp[2(0 . 5 − 2 ǫ ) 2 d ]. W e conclude: the metric entrop y of C with rega rd to µ is b ounded from below b y M (2 ǫ , C , µ ) ≥ exp[2(0 . 5 − 2 ǫ ) 2 d ] . The assumption V C( C mod ω 1 ) = ∞ now implies that for every 0 < ǫ < 0 . 25, sup P ∈P M (2 ǫ , C , µ ) = ∞ , where P denotes the family of all non-atomic measures on Ω. By Lemma 7.1 in [20], p. 278, the class C is not P AC learnable under P . On the cont rary , a function class F c a n b e P A C lea rnable under non-atomic measur e s and s till hav e a n inﬁnite fat-shattering dimension mo dulo countable se ts. The following is an adapta tio n of Exa mple 2 .10 in [14]. Example 7.3 . F or a given n ∈ N , c a ll a ny in terv a l of the for m [ i/ n, ( i + 1) /n ], i = 0 , 1 , . . . , n − 1 an interval of or der n . F orm the cla s s C n consisting of all unions of less than 3 √ n interv als o f order n . Let C b e the union of clas ses C n , n ∈ N . Now we will tr ansform C int o a function cla s s. With this purp ose, es ta blish a bijection i b etw een C a nd the rational po int s of the interv al [0 , 1 / 3 ]. Let F consis t of all functions o f the form f C , where f C ( x ) = χ C ( x ) + ( − 1) χ C ( x ) i ( C ) . Each function f C takes its (rationa l) v alues in [0 , 1 / 3] ∪ [2 / 3 , 1] and is uniq ue ly identiﬁable by its v alue a t any single p oint x ∈ [0 , 1]. F or this reason, the class F is (exa c tly) learnable. A learning r ule is given, for instance, b y L ( x, r ) = i − 1 (min { r , 1 − r } ), where ( x, r ) is a learning 1-sample. A t the same time, fat 1 / 6 ( F mo d ω 1 ) = ∞ . Indeed, given any k ∈ N , an arbitrar y co lle ction I 1 , I 2 , . . . , I k of k pairwise distinct interv als of order n = k 3 is 1 / 6-shatter ed by the functions f C , C ∈ C n with the witness function taking a constant v alue 1 / 2. This example c a n be further mo diﬁed. F or instance, one ca n cons ider a larg er class ˜ F consisting of all functions f for which there exists a g ∈ F with { x : f ( x ) 6 = g ( x ) } b eing a universal n ull s et. T he class ˜ F is probably exactly lea rnable by the s a me lea rning r ule L as ab ov e. 8. The univ ersally separable case In this Section we will ex press our versions of the combinatorial dimension mo dulo co untable sets in terms o f the cor resp onding class ical notions. Namely , we will pr ov e that VC( C mo d ω 1 ) ≤ d if and only if every countable sub class of C has V C dimension d outside of a suitable co un table set, and similarly fo r fa t shattering dimens io n. Lemma 8 .1. Le t F b e a universal ly sep ar able function class, with a universal ly dense c ountable su bset F ′ . Then for every ǫ > 0 fat ǫ ( F ) = fat ǫ ( F ′ ) . Pr o of. F or every f ∈ F ther e is a se q uence ( f n ) of elements of F ′ which conv erges to f po int wise: given a ﬁnite A ⊆ Ω a nd an γ > 0, there is a n N s uch that whenever n ≥ N , one has | f ( x ) − f n ( x ) | < γ for all x ∈ A . This means that if A is ǫ -fat shattered by F , it is equally well shattered by F ′ , with the same witness function. This observ ation establishes the inequaity fat ǫ ( F ) ≤ fat ǫ ( F ′ ), while the converse inequality is trivially tr ue. Since for a concept class C one has VC( C ) = fat ǫ ( C ) whenev er ǫ < 1 / 2 , w e obtain: 17 Corollary 8.2. L et C b e a universal ly sep ar able c onc ept class, and let C ′ b e a un iversal ly dense c ountable subset of C . Then VC ( C ) = VC ( C ′ ) . While a version of the following result for fat sha ttering dimensio n cov ers the VC dimension a s a particula r case, the pr o of is tec hnically more complicated, and we feel that the complications obscure the simple idea of the pro o f for VC dimension. F or this reason, w e give a separ ate pres ent ation for VC dimensio n ﬁrst. Theorem 8.3. F or a universal ly sep ar able c onc ept class C , the fol lowing c onditions ar e e quivalent. 1. VC ( C mod ω 1 ) ≤ d . 2. Ther e exists a c ountable subset A ⊆ Ω su ch that VC ( C ↾ (Ω \ A )) ≤ d . Pr o of. (1) ⇒ (2): Cho os e a countable universally dense subfamily C ′ of C . Let B b e the smalles t Bo olean algebra of subsets of Ω containing C ′ . Denote by A the union of all elements of B that are countable sets. Clearly , B is countable, and so A is a countable set. Let a ﬁnite set B ⊆ Ω \ A b e shattered by C . Then, by Cor o llary 8 .2, it is shatter ed by C ′ . Select a family S of 2 | B | sets in C ′ shattering B . F or every b ∈ B the set [ b ] = \ b ∈ C ∈ S C ∩ \ b / ∈ C ∈ S C c is uncoun table (for it b elong s to B yet is not con tained in A ), and the collection o f sets [ b ], b ∈ B is shattered by C ′ . According to (1), | B | ≤ d , fro m w hich w e deduce (2). Notice that this establishes the inequality V C( C ↾ (Ω \ A )) ≤ VC( C mod ω 1 ). (2) ⇒ (1): Fix an A ⊆ Ω s uch that VC ( C mo d A c ) ≤ d . Supp os e a collection o f n uncountable s e ts A i , i = 1 , 2 , . . . , n is sha ttered by C in o ur sense. The sets A i \ A are non-empty; pick a representative a i ∈ A i \ A , i = 1 , 2 , . . . , n . The res ulting se t { a i } n i =1 is s hattered b y C , meaning n ≤ d . Now a v ersion fo r fa t shatter ing dimens io n. Theorem 8.4. F or a u niversal ly sep ar able function class F and ǫ > 0 , the fol lowing c onditions ar e e quiv- alent. 1. fat ǫ ( F mo d ω 1 ) ≤ d . 2. Ther e exists a c ountable subset A ⊆ Ω su ch that fat ǫ ( F ↾ (Ω \ A )) ≤ d . F or a un iversal ly sep ar able function class F and ǫ > 0 , the c onditions ar e e quivalent. Pr o of. (1) ⇒ (2): F or a function f on Ω and r ∈ R , deno te [ f < r ] = { x ∈ Ω : f ( x ) < r } and [ f > r ] = { x ∈ Ω : f ( x ) > r } . Let F ′ be a co un table universally dense subfamily o f F . Denote by B the smallest algebra of s ubsets of Ω containing all sets [ f < r ], [ f > r ] for f ∈ F ′ and r ∈ Q . Now denote by A the union of all elements o f B that a r e c ountable sets. Since B is countable, so is A . Let a ﬁnite s et B ⊆ Ω \ A b e ǫ - fat shattere d by F . Then, b y Le mma 8.1, it is shattered b y F ′ , and by Lemma 2 .1, ther e is a r a tional ǫ ′ > ǫ and a rational- v alued function h : B → Q such that B is ǫ ′ -fat shattered b y a family S of 2 | B | functions in F ′ with h as a witness function. F or every b ∈ B form the set [ b ] = { x ∈ Ω : ∀ C ⊆ B , b ∈ C ⇒ f C ( x ) > h ( b ) + ǫ ′ ∧ b / ∈ C ⇒ f C ( x ) < h ( b ) − ǫ ′ } . The set [ b ] belo ngs to the algebra of se ts B and is not con tained in A (for instance, b ∈ [ b ] a nd b / ∈ A ). Therefore, [ b ] is unco un table. If b, c ∈ B and b 6 = c , then [ b ] ∩ [ c ] = ∅ . Finally , the c ollection o f s ets [ b ], b ∈ B 18 is ǫ ′ -fat s hattered by F ′ with h as a witness function, henc e ǫ -fat shattered. Since | B | ≤ d , we have proved (2), a nd established the inequality fat ǫ ( F ↾ (Ω \ A )) ≤ fat ǫ ( F mo d ω 1 ). (2) ⇒ (1): Fix a countable subset A ⊆ Ω such that fat ǫ ( F mo d A c ) ≤ d . Suppose a collec tio n o f n uncountable sets A i , i = 1 , 2 , . . . , n is ǫ -fat shattered by the function class F . The sets A i \ A are non- empt y , so we can select a repr esentativ e a i in each one o f them, i = 1 , 2 , . . . , n . The resulting set { a i } n i =1 is ǫ -fat shattered by F , meaning n ≤ d . Corollary 8. 5. L et C b e a un iversal ly sep ar able c onc ept class on a Bor el domain Ω . I f d = VC ( C mod ω 1 ) < ∞ , then C is a uniform Glivenko-Cantel li class with r esp e ct to n on-atomic me asur es and c onsistently P AC le arnable under non-atomic me asur es, with a standar d sample c omplexity c orr esp onding to d . Pr o of. The class C has ﬁnite VC dimension in the complement to a suitable countable subset A of Ω, hence C is a universal Glivenko-Can telli class (in the classica l sense) in the standard Bo rel space Ω \ A . But A is a universal null set in Ω, hence clear ly C is univ ersal Glivenk o-Cantelli with resp ect to non-a tomic measures. The class C is distribution-free consistent ly P A C learnable in the domain Ω \ A , with the standard s ample complexity s ( ǫ, δ, d ). Let L be any consistent le arning rule fo r C in Ω. The restriction of L to Ω \ A (mor e exactly , to ∪ ∞ n =1 ((Ω \ A ) n × { 0 , 1 } n )) is a consistent learning rule for C restricted to the standard Borel space Ω \ A , and together with the fact that A has mea sure zero with resp ect to a ny non-atomic mea sure, it implies that L is a P A C lea rning rule for C under non-ato mic measures, with the same sa mple complexity function s ( ǫ, δ, d ). Similarly , we o btain: Corollary 8.6 . L et F b e a universal ly sep ar able function class on a Bor el domain Ω . If for every ǫ > 0 one has d = fat ǫ ( F mo d ω 1 ) < ∞ , then F is a uniform Glivenko-Cantel li class with r esp e ct to non-atomic me asur es and c onsistently P AC le arnable un der non-atomic me asur es, with a standar d sample c omplexity c orr esp onding to d . Here are the tw o main conclus io ns o f this Section. Notice that the following cr iteria no longer a ssume universal separability of the cla sses in volv ed. Corollary 8.7. F or a c onc ept class C , the fol lowing ar e e quivalent. 1. VC-dime nsion of C mo dulo c ount able sets is ≤ d ; 2. F or every c ountable sub class C ′ of C , t her e exists a c ountable A ⊆ Ω such that the V C -dimension of C ′ r estricte d to Ω \ A is ≤ d . Pr o of. (1) ⇒ (2): the VC dimensio n mo dulo co unt able s ets is monotone with resp ect to s ubcla sses, so V C( C ′ mo d ω 1 ) ≤ d . Now Theor em 8.3 gives the desired conclusion. (2) ⇒ (1): a ssume uncountable sets A 1 , A 2 , . . . , A n are sha ttered by C . Select a family S o f 2 n concept classes that do es the sha ttering. There is a countable A such that VC( S ↾ Ω \ A ) ≤ d . Cho o se a repr esentativ e a i in each of the non-empt y sets A i \ A . Since the set { a i } n i =1 is shattered b y the family S restr ic ted to Ω \ A , one concludes that n ≤ d . Similarly , one obtains : Corollary 8.8. F or a function class F and ǫ > 0 , the fol lowing ar e e quivalent. 1. fat ǫ ( F mo d ω 1 ) ≤ d ; 2. F or every c ountable sub class F ′ of F , one has fat ǫ ( F ′ ↾ Ω \ A ) ≤ d for a suitable c oun table A (which dep ends on F ′ ). 9. Pro o fs of tw o theorems from the In tro duction Now we are in a p osition to prove the tw o main theor ems 1.1 and 1.2, just by putting together v a rious results established in the article. 19 9.1. Key to t he pr o of of The or em 1.1 (1) ⇒ (2): this is Theorem 7 .2. (2) ⇒ (3): Coro llary 8.7. (3) ⇒ (4): assume that for every d there is a countable sub cla ss C d of C with the prop erty that the VC dimension of C d is ≥ d after removing any co unt able subset of Ω. Cle arly , the countable class ∪ ∞ d =1 C d will hav e inﬁnite V C dimension outside o f every countable subse t of Ω, a cont radiction. (4) ⇒ (6): a s a conse quence of a classica l result of V a pnik and Chervonenkis, every co unt able subclas s C ′ is universal Glivenk o-Cantelli with r esp ect to all probability mea sures s uppo r ted o utside o f some coun table subset of Ω, and a standard b ound for the sample co mplexity s ( δ, ǫ ) only dep ends o n d , from which the statement follows. (6) ⇒ (5): trivial. (5) ⇒ (1): this is Theorem 4 .5, and the only implication requiring Ma rtin’s Axiom. In the univ ersally separ a ble ca se, the implications (2) ⇐ ⇒ (7) are due to Theorem 8.3, (2) ⇒ (8) follows from Corollary 8.5, (8) ⇒ (9) is standard, and (9) ⇒ (1) trivial. 9.2. Key to t he pr o of of The or em 1.2 (2) ⇒ (3): Coro llary 8.8. (3) ⇒ (4): Assume that for some ǫ > 0 and every v alue d ∈ N there is a countable sub class F d of F with the prop erty that the ǫ -fat shattering dimension of F d is ≥ d after removing any co unt able subset of Ω. Then the countable function class ∪ ∞ d =1 F d will hav e inﬁnite ǫ -fat sha ttering dimension outside o f every co un table subset o f Ω, which is a contradiction. (4) ⇒ (6): Co m bining the assumption with Theorem 2.5 in [1], one c oncludes that every coun table sub clas s F ′ of F is universal Glivenk o -Cantelli with re s pe c t to a ll pr obability measures supp orted o utside o f a suitable coun table subset o f Ω, with a standa rd b o und for the sample complexity s ( δ, ǫ ) o nly dep ending on d ( ǫ ). (6) ⇒ (5): trivial. (5) ⇒ (1): Theore m 4.5. This is the the only implication requiring Martin’s Axiom. In the universally s eparable case, the equiv a lence of (1) and (7) is the statement of Theo rem 8.4, (7) ⇒ (8) is Co rollar y 8.6, and (8) ⇒ (9) is standard. Note again that the implication (1) ⇒ (2) is in ge ner al inv alid, cf. Example 7.3. 10. Conclusi on and Op en Problems W e hav e characterized co ncept clas ses C that ar e distr ibutio n- free P AC learna ble under the fa mily of all non-atomic probability measures on the domain. The criterio n is obtained without any measura bilit y conditions on the co nce pt class, but at the exp ense of making a set- theo retic assumption in the form of Martin’s Axiom. In fact, assuming Martin’s Axiom makes things easier, and as this a x iom is v ery natura l, per haps it des erves its sma ll corner within the foundations of statistical le arning. Generalizing the result ov er function classes, using a version of the fat sha ttering dimension mo dulo countable sets, did not p ose particular technical diﬃculties. How e ver the ﬁniteness of this combinatorial parameter is no long er necess a ry for P A C lear nability of a function class under no n-atomic measures, just like it is the case for the cla ssical dis tribution-free situation. It would b e still interesting to know if the pr esent results hold without Martin’s Axiom, under the assumption that the concept cla ss C is ima ge admissible Souslin ([6], pages 186– 187). The diﬃcult y her e is selecting a measura ble learning rule L with the pr op erty tha t the images of all learning samples ( σ, C ∩ σ ), σ ∈ Ω n , are uniform Gliv enko-Can telli. An ob vious route to pursue is the r ecursion on the Borel rank of C , but w e were unable to follow it through. Now, a c oncept clas s C will b e learnable under no n- atomic measur e s provided there is a hypothesis class H whic h has ﬁnite VC dimension and suc h that every C ∈ C diﬀers from a suitable H ∈ H b y a null set. If C consists of a ll ﬁnite a nd all coﬁnite subsets of Ω, this H is given by {∅ , Ω } . One may conjecture 20 that C is lear nable under non-ato mic measures if and only if it admits such a “core” H having ﬁnite VC dimension. Is this true? Another natura l question is: ca n one characterize conc e pt classes that are unifor mly Glivenk o–Cantelli with resp ect to all no n-atomic measures? Appar ent ly , this tas k requires y et another version of shattering dimension, which is strictly intermediate b e t ween T alagr and’s “witness of irreg ula rity” [16] and our VC dimension modulo countable sets . W e do no t have a viable candidate. Is it p oss ible to co nstruct an example of a co ncept class of ﬁnite VC dimensio n which is no t consistently P A C lear nable [5, 4] without additional set-theoretical a ssumptions, just under the ZFC axioma tics ? Finally , our in vestigation op en up a p oss ibilit y of linking learnability a nd VC dimension to Bo olean algebras and their Stone s paces. This could b e a glib e xercise in generaliza tion for its own sake, or maybe something deeper if one ma nages to inv oke mo del theory and forc ing . A cknow le dgements The a uthor is mos t gr ateful to tw o anonymous referees for their thoroug h reading of the pap er and nu merous us eful s uggestions which hav e help ed to impr ov e the pr esentation considerably . Of course the remaining imp er fectio ns are all author ’s own. References [1] N. Alon, S. Ben-David, N. Cesa-Bianchi and D. Haussler, Sc ale-sensitive dimensions, uniform c onver genc e, and le arnability, Journal of the ACM 44 (1997), 615–631. [2] F. Ar veson, An Invit ation to C ∗ -Algebr as. Graduate T exts in M athematics, 39 , Springer-V erlag, New Y ork– Heidelber g (1976). [3] P .L. Bar tlett and P .M. Long, Mor e the or ems ab out sc ale-se nsitive dimensions and le arning, i n: Pro c. Eigh th Ann ual Conf. on Computational Learning Theory (COL T ’95), ACM, New Y ork, NY, USA, 1995, pp. 392–401. [4] A. Blumer, A. Ehrenfeuc h t, D. Haussler and M.K . W armuth, L e arnability and the V apnik-Chervonenkis dimension, Journal of the ACM, 36 (4) (1989), 929–865 . [5] M. Durst and R.M. Dudley , Empiric al pr o c esses, V apnik–Chervonenkis classes, and Poisson pr o cesses, Prob. and Math. Statistics 1 (1980), 109–115. [6] R.M. Dudley , Uniform Centr al Limit The or ems, Cambridge Studies in Adv anced Mathematics, 63 , Cambridge Unive rsity Press, Cambridge, 1999 . [7] C. F reiling, Axioms of sy mmet ry: thr owing darts at the r e al numb er line, J. Symbolic Logic 51 (1986), 190–200. [8] D.H. F remli n, Conse quenc es of Martin ’s Axiom, Cambridge T racts in Mathematics, 84 . Cambridge Uni v ersity Press, Camb ridge, 1984. [9] T. Jech, Set The ory, Academic Press, New Y ork–London, 1978. [10] P .T. Johnstone, Stone Sp ac es, Reprint of the 1982 edition, Cambridge Studies in Adv anced Mathematics, 3 , Cam- bridge Universit y Press, Cambridge, 1986. [11] K. K unen, Set The ory, North-Holland, Am sterdam, 1980. [12] S. Mendelson, A few notes on statistic al le arning the ory, in: S. Mendelson, A.J. Smol a, Eds., A dvanc e d L e ct ur es in Machine L e arning , LN CS 2600, Springer, 2003, pp. 1–40. [13] V. Pe sto v, P AC le arnability of a co nc ept c lass under non-atomic me asur es: a pr oblem by Vidyasagar, in: Pro c. 21st In tern. Conference on Algorithmic Learning Theory (AL T’2010), Can berr a, Australi a, 6-8 Oct. 2010 (M. H utter, F. Stephan, V. V ovk, T. Zeugmann, eds.), Lect. Notes in Ar tiﬁcial In telligence 63 31 , Springer, 2010, pp. 134–147. [14] V. Pesto v, A note on sample c omplexity of le arning binary output neur al networks under ﬁxe d input d istributions, in: Pro c. 2010 Eleve nth Brazilian Symp osium on Neural Netw orks (S˜ ao Bernardo do Camp o, SP , Br azil, 23-28 October 2010), IEEE Computer Society , Los Alamitos-W ashington-T oky o, 2010, pp. 7–12. [15] D. Pollard, Conver ge nce of Sto chastic Pr o c esses, Springer-V erlag, New Y ork, 1984. [16] M. T alagrand, The Glivenko-Cantel li pr oblem, ten ye ars later, J. Theoret. Probab. 9 (1996), 371–384 . [17] V.N. V apnik and A. Ja. ˇ Cervo nenkis, The uniform c onvergenc e of fr e quenc i es of the app e ar anc e of events to their pr ob abilities, Dokl. Ak ad. Nauk SSSR 181 (1968) , 781–783 (Russian). Engl. transl.: Soviet Math. Dokl. 9 (1968), 915–918. [18] V.N. V apnik and A .Y a. Cherv onenkis, O n the uniform c onver genc e of r elative fr e quencies of events to their pr ob a- bilities, Theory of Probability an d its Applications 16 , iss ue 2 (1971), 264–280. [19] M. Vidya sagar, A the ory of lea rning and gener alization. With applic ations to neur al net works and co ntr ol syst ems, Commun ications and Control Engineering Series. Springer-V erlag London, Ltd., London, 1997. [20] M. Vidyasagar, L e arning and Gener alization, with Applic ations to Neur al N e tworks, 2nd Ed., Springer-V erlag, 2003 [21] R.S. W enokur and P .M. Dudley , Some sp e cial V apnik–Chervonenkis classes, Discrete M ath. 33 (1981), 313–318. 21

PAC learnability under non-atomic measures: a problem by Vidyasagar

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment