Activized Learning: Transforming Passive to Active with Improved Label Complexity

We study the theoretical advantages of active learning over passive learning. Specifically, we prove that, in noise-free classifier learning for VC classes, any passive learning algorithm can be transformed into an active learning algorithm with asym…

Authors: Steve Hanneke

Activized Lear ning: T ransf orming Passiv e to Acti ve with Impro ved Label C omplexity ∗ Stev e Hanneke S H A N N E K E @ S TA T . C M U . E D U Department of Statistics Carne gie Mellon University Pittsbur gh, P A 15 213 USA Abstract W e study the theoretical advantages of acti ve learning over passiv e learn ing. Specifically , we prove that, in noise-free classifier learnin g for VC classes, any passive learning algorithm can be transform ed into an acti ve l earning algorithm with asymptotically strictly s uperio r label complexity for all no ntri vial target fu nctions and distributions. W e fu rther p rovide a general characterization of the magn itudes of these improvements in terms o f a novel gen eralization of the disagreement coefficient. W e also extend these results to acti ve learning in the pr es ence of label noise, a nd find that ev en under broad classes of noise distributions, we can typically guarantee strict improvements over the known results for passive learning. Keywords: Activ e Learning, Selecti ve Samplin g, Sequential Design, Statist ical L earning Theory , P A C Learning , Sample Complexity 1. Introduction and Backgr ound The recent rapi d growth in data sources has spawn ed an equally rapid ex pansion in the nu mber of potent ial applic ations of machine learning m e thodolo gies to extract useful con cepts fro m thi s da ta. Ho wev er , in man y cases, the bottlen eck in the appl ication process is the need to obtain accurate annota tion of the raw data according to the tar get conc ept to be learned. For inst ance, in webpage classifica tion, it is straigh tforwar d to rapidly collect a larg e number of webpages, b ut traini ng an accura te classi fier typ ically requires a human expert to examine and labe l a nu mber of these web- pages, whic h may require significa nt time and effort . For this reaso n, it is natural to look for ways to reduce the total number of label ed exa mples re quired to train an accurate cla ssifier . In the tra di- tional m a chine learni ng protocol , here referred to as passive lear ning , the ex amples labe led by the exp ert are sampl ed independen tly at ra ndom, and the emphasis is on des igning le arning algorithms that make the most effect iv e use of the number of these labele d ex amples av ailable. Ho wev er , it is poss ible to go be yond such methods by alterin g the protocol itself , allo wing the learning algo- rithm to seq uentiall y select the examp les to be label ed, based on its observ ations of the labels of pre vious ly-sele cted exa mples; this interacti ve protocol is referred to as active lear ning . The objec - ti ve in desi gning this selection mechanism is to focus the exper t’ s ef forts towar d labeling only the most infor mati ve data for the learning process, thus elimin ating some degr ee of redun danc y in the informat ion co ntent of the labeled examples. It is no w well-esta blished that activ e lear ning can sometimes pro vide significant prac tical and theore tical adv anta ges o ver passi ve learning, in terms of the number of labels required to obtain a gi ven ac curac y . H o weve r , our cur rent understa nding of ac ti ve learn ing in general is still quite limited ∗ . Some of these (and related) results previou sly appeared in the author’ s doctoral dissertation (Hannek e, 2009b). c  St ev e Hanneke. H A N N E K E in sev eral respects. First, since we are lackin g a comple te under standin g of the poten tial capabi l- ities of activ e learning, we are not ye t sure to what standards w e shou ld aspi re for acti ve learni ng algori thms to meet , and in particular this challenges our ability to characteriz e ho w a “good” activ e learnin g algorithm should beha ve. Second, since we ha ve yet to identify a comp lete set of general princi ples for the design of ef fecti ve acti ve lear ning algorith ms, in many cases the most ef fecti ve kno wn acti ve learn ing algorithms hav e problem-sp ecific des igns (e.g., designed specifically for lin- ear separa tors, or decision trees, etc., under specific assumpti ons on the data distrib ution), and it is not cl ear w h at components of their design can be ab stracted an d tra nsferred to the design of acti ve learning alg orithms for differ ent learnin g problems (e.g., with dif feren t types of classifiers, or dif ferent data distrib utions) . Finally , we ha ve yet to fully understand the scop e of the relat iv e benefits of acti ve learning ov er pa ssi ve lea rning, and in particul ar the cond itions under which such impro vement s are ach ie v able, as well as a general characteriza tion of the pot ential magnitud es of these improv ements. In the p resent work, w e tak e steps t ow ard closing this ga p in o ur understand ing of the capab ilities, gen eral principles , and adv anta ges of acti ve learning. Addition ally , this work has a se cond theme, moti v ated by pract ical concerns. T o date, the ma- chine learning community has in vested decad es of researc h into cons tructing solid, reliable, and well-beh av ed pass ive learn ing algori thms, and into unders tanding their theoretica l properties . W e might hop e that an equi valent amount of ef fort is not requi red in order to disc ov er and understa nd ef fecti v e acti ve learn ing alg orithms. In particul ar , rather than start ing from scratch in the design and analysis of acti v e learni ng algorithms, it seems desira ble to le ve rage this vast kno wledge of passi ve lea rning, to whate ver extent possible. For inst ance, it may be possible to desig n acti ve learnin g algorithms that inherit certain desirable behav iors o r properti es of a giv en passi v e learning algori thm. In this way , we can use a gi ven passiv e learn ing algorithm as a r efer ence point , and the obje cti ve is t o desi gn an acti ve learnin g algorithm with perfo rmance guarantee s strictly superior to those of the pas si ve algorith m. Thus, if the passiv e learning algorithm has prov en ef fecti v e in a v ariety of common learn ing problems, then the acti ve learning algorithm should be ev en bette r for those same learning prob lems. This app roach also has the adv antage of immediately su pplying us with a collect ion of th eoretica l guaran tees on the perfo rmance of the activ e learning algorit hm: namely , improve d forms of all known guarantees on the perfor mance of the giv en passiv e learn ing algori thm. Due to its obviou s practical adv antages, this general line of infor mal thinkin g dominates the exi sting literature on empiric ally-tes ted heuristic approaches to acti v e learning, as most of the pub- lished heuris tic acti ve learning algorithms make use of a passi ve learning algorithm as a subroutin e (e.g., SVM, logistic re gress ion, k-NN, etc.), constructi ng set s of labeled e xamples an d feed ing them into the passi ve learni ng algorithm at v arious times du ring the e xec ution of t he acti ve learni ng algo- rithm (s ee the references in Section 7) . B el o w , we take a more rigor ous look at this gen eral strate gy . W e dev elop a reduction-s tyle framew ork for studying this app roach to the desi gn of activ e learning algori thms relati ve to a gi ven pas si ve learni ng algorithm. W e the n procee d to de ve lop and analy ze a v ariety of such methods, to realize this approac h in a v ery genera l se nse. Specifically , we ex plore the follo w i ng fundamental qu estions . • Is the re a gen eral proced ure that, gi ven an y passi ve learning algori thm, tran sforms it into an acti ve lea rning algorithm requiring significantly fe wer labels to achie ve a gi ven accu racy ? • If so, ho w larg e is the r eductio n in the number o f labe ls required by the resu lting acti v e learn- ing algorit hm, comp ared to the number of labels required by the original passi ve algorithm? 2 A C T I V I Z E D L E A R N I N G • What are suf ficient cond itions fo r an exp onentia l reducti on in the number of label s re quired? • T o what exten t ca n these methods be made rob ust to imperfect or noisy labels? In th e proc ess of e xplorin g th ese ques tions, we find th at for man y i nteresti ng learning problems , the techni ques in the exis ting litera ture are not capabl e of realiz ing the full potential of activ e learn- ing. Thus, e xplori ng this topic in generality require s us to d ev elop no vel insights and entir ely ne w techni ques for the design of activ e learning algorithms. W e also dev elop correspo nding natural comple xity quantities to characterize the performance of su ch algorithms. Sev eral of the results we establ ish here are more general than an y related results in th e e xistin g liter ature, and in many cas es the algorit hms w e d ev elop use significan tly fe wer la bels than any pre viously pub lished methods. 1.1 Backgr oun d The term active lea rning refers to a fa mily of supervise d learning protocol s, charac terized by the ability of the learning algorit hm to pose queries to a teacher , who has access to the tar get concept to be learned . In prac tice, the teacher and queries may tak e a va riety of forms: a human e xpert, in which case the queries may be questions or annotati on tasks; nat ure, in which case the queries may be scien tific experimen ts; a compu ter simulation, in which cas e the queries may be parti cu- lar parameter v alues or initial conditions for the simulato r; or a host of other possibilitie s. In our presen t cont ext, we will specifically discuss a protocol known as pool-ba sed acti ve learning, a type of sequent ial design based on a colle ction of unlabe led examp les; this seems to be the most com- mon form of act iv e learnin g in practical use today (e.g., Settles, 2010; Baldrid ge and Palmer, 2009; Gangadh araiah, B ro wn, and Carbone ll, 2009; Hoi, Jin, Zhu, and L yu, 2006; Luo, Kra mer , G o ldgof, Hall, Samson, R ems en, and Hop kins, 2005 ; Roy a nd McCallum, 2001; T ong a nd K oller, 2001 ; Mc- Callum and Niga m, 1998). W e will not discuss alternat iv e models of acti ve learning, such as onli ne (Deke l, G e ntile, and Sridharan, 2010 ) or exa ct (H e ged ¨ us, 1995). In the pool-ba sed acti ve learnin g setting , the learning algorithm is supplie d with a lar ge collect ion of unlabeled e xamples (the pool ), and is allo wed to select an y exa mple from the pool to reques t that it be labele d. After observin g the lab el of this e xample, the algori thm can then select another unlabeled ex ample from the pool to requ est that it be labeled. This contin ues sequentially for a number of rounds until some halt- ing condition is satisfied, at which time the algorith m returns a funct ion inten ded to approximately mimic and generaliz e the obse rved labe ling beha vior . This set ting contras ts with passi ve learni ng , in which the learni ng alg orithm is supplied with a collection of labeled ex amples. Supposin g the labels recei ved agre e with some true tar get con cept, the obj ecti ve is to use thi s return ed funct ion to appro ximate the true tar get co ncept on future (pre viously un observ ed) data points . The hope is that, by carefully se lecting whic h e xamples shoul d be labele d, the algorit hm can achie ve impro ved accurac y while using fe wer label s compa red to passi ve learning. The moti vatio n for this setting is simple. For many mode rn machine learning problems, unlabeled examples are ine xpensi ve and av ailable in ab undanc e, while annotation is ti me-consumin g or expensi ve. For in- stance , this is the case i n the aforementione d w e bpage cla ssification proble m, where the pool would be the set of all w e bpages, and labeling a webp age requires a human e xpert to examine the website conten t. Settles (2010) surve ys a var iety of other application s for which acti ve learning is presently being used. T o simplify the discussio n, in this work we focus specifical ly on binar y cl assifica tion , in which there are only two possib le labels . The results gen eralize n aturally t o mul ticlass classification as well. 3 H A N N E K E As the abo v e descript ion ind icates, when stud ying the adv antage s of acti ve learning , we are primarily interested in the n umber of label requests suf ficient to achie ve a gi ven accurac y , a quantity referre d to as the lab el comple xity (Definition 1 below). Althoug h acti ve learning has b een an activ e topic in the m a chine learning litera ture for man y years now , our the or etica l understand ing of this topic was lar gely lacking until very recently . Howe ver , within the past fe w years, there has been an exp losion of progre ss. These adv ances can be grouped into tw o categ ories: namel y , the r ealiza ble case and the agn ostic case . 1 . 1 . 1 T H E R E A L I ZA B L E C A S E In the realiza ble case , we are int erested in a particularl y strict scenario, where the true label of any exa mple is determine d by a func tion of the features (co v ariates) , and where that functio n has a spec ific kno wn form (e.g., linear sep arator , decisi on tree, union of interv als, etc.); the set of classifier s ha ving this kno wn form is referr ed to as the conc ept space . The natural formaliz ation of the realizable case is very much anal ogous to the well-k no wn P AC model for passiv e learni ng (V alia nt, 19 84). In the rea lizable ca se, there are ob vious examples of lea rning problems where acti ve learning can prov ide a significa nt adva ntage compared to passi ve learning; for instanc e, in the problem of learning thre shold classifiers on the real line (Example 1 belo w), a kind of binary sear ch strateg y for selecting which e xamples to reque st labels for natural ly lea ds to exp onenti al impro vement s in label comple xity compare d to learning from random lab eled ex amples (pas si ve learnin g). As suc h, there is a natural attrac tion to determin e how general this phen omenon is. This lea ds us to thi nk about genera l-purpo se learning strate gies (i.e., which can be instantiate d for more th an merely thr eshold classifiers on th e real l ine), which e xhibit this b inary search beha vio r in v arious special cases. The first such gen eral-pur pose strate gy to emer ge in the literatu re wa s a parti cularly elegant strate gy prop osed by Cohn, Atlas, and Ladne r (1994 ), typ ically referre d to as CAL after its dis- cov erers (Meta- Algorithm 2 below). The strat egy behind CAL is the follo w i ng. The algo rithm exa mines eac h exampl e in the unlabeled pool in sequ ence, and if there are two classifiers in the concep t space consiste nt w i th all previo usly-ob serv ed labels, b ut which d isagree on th e label o f this nex t e xample, then the al gorithm requests that label, and oth erwise it does not. For this reason, be- lo w we refe r to the general f amily of alg orithms ins pired by CAL as disagr eement-based method s. Disagree ment-based methods are sometimes referred to as “mello w” acti ve le arning, since in so me sense this is the least we can exp ect from a reasonable activ e learn ing algorithm; it nev er reque sts the labe l of an example whose labe l it can infer from informa tion alre ady av aila ble, b ut otherwise makes n o attempt to seek out p articul arly in formati ve exa mples to req uest the labels of. That is, the notion of infor mativenes s implici t in disagre ement-bas ed methods is a bina ry one, so that an e xam- ple is either informati ve or not info rmati ve, bu t there is no fur ther rankin g of the informati v eness of example s. The disagreemen t-based strate gy is quite general , and obvious ly leads to algorith ms that are at least r easonab le , b ut Cohn, Atlas, and Ladne r (199 4) did not stu dy the label co mplexit y achie ved by their strategy in any ge nerality . In a Bayesi an v ariant of the realizab le s etting, Freund, Seu ng, Shamir , and Tis hby (1997 ) studied an algorit hm kno w n as Query by Committee (QBC), which in some sense repres ents a Bayesian v ariant of C AL. Ho we ver , Q BC do es distinguis h between differe nt le vel s of infor mati vene ss bey ond simple disagreement, based on the amount of di sagreemen t on a ra ndom un labeled e xample. They were able to ana lyze the labe l co mplexi ty ach ie ved by QBC in terms of a type of information gain, 4 A C T I V I Z E D L E A R N I N G and found that when the informati on gain is lo wer boun ded by a positi ve constant, the algorithm achie ves a labe l comple xity ex ponent ially smaller than the kno wn resul ts for passiv e learni ng. In particu lar , thi s is the case for the th reshold learning problem, and also for the prob lem of learning higher -dimensional (nearly balance d) linea r sep arators when th e data satisfy a certain (uniform) distrib ution. Belo w , we w i ll not discus s this analysis furthe r , sin ce it is for a sl ightly dif ferent (Bayesia n) setting. Howe ver , the resul ts belo w in our pres ent setting do ha ve interesti ng implicati ons for the Baye sian setting as well, as discussed in the recen t work of Y an g, Hanneke, and Carbo nell (2011 ). The fi rs t gene ral analy sis of the label compl exity of acti ve learning in the (non-Bay esian) real- izable case came in the breakth rough work of Dasgupta (2005). In tha t work, D a sgupta proposed a quanti ty , called the spli tting inde x , to cha racteriz e the label complex ities achie v able by acti ve learn- ing. The splitting inde x analys is is note worth y for se v eral reaso ns. First, one can show it provi des nearly tight b ounds on t he min imax label comp lexit y for a giv en concept s pace and da ta d istrib utio n. In particular , the ana lysis mat ches t he exp onentia l improv ements k no wn to be possib le for threshold classifier s, as well as generalizati ons to highe r-d imension al ho mogeneou s linea r separators under near -uni form distr ib utions (as first established by D a sgupta, Kalai, and Monteleon i (2005 , 2009)). Second, it prov ides a no ve l notion of inf ormativen ess of an example, be yond the simple bin ary notion of informati ven ess emplo yed in disag reement-b ased methods. Specificall y , it describe s the informat iv eness of an example in terms of the number of pairs of well -separat ed classifiers fo r which at lea st one out of each pair will d efinitely be cont radicted , reg ardless of the exa mple’ s label. Finally , unlike an y othe r exist ing work on ac ti ve l earning (prese nt work in cluded) , it provi des an el- ega nt d escript ion of the tra de-of f between the number of l abel re quests and the number of unl abeled exa mples needed by the lear ning algorithm. Another interesting by produc t of Dasgupta’ s work is a better unders tanding of th e natur e of th e improv ements achie v able by activ e learning in the genera l case. In part icular , his work clearly illustrate s the need to study the label complexi ty as a quan tity that v aries dependin g on the particular targ et conce pt and data dist rib ution. W e will see this issue arise in many of the e xample s be lo w . Coming from a slightly differ ent pe rspecti ve, Hanne ke (2007a) later analyzed the labe l com- ple xity of acti v e learning in terms of a n exte nsion of the teac hing dimens ion (Goldman and Kea rns, 1995) . Related quan tities were pre viou sly use d by Hege d ¨ us (199 5) and Hellerstein, Pillaipakk am- natt, Ragha v an, and W ilkins (1996) to tigh tly characte rize the number of membersh ip queries suf- ficient for Exact learnin g; Hanne ke (2007a) provide d a natural generalizat ion to the P AC learni ng setting . At th is time, it is not clear how this qu antity re lates to the splitting index. From a practical perspe cti ve, in some instances it may be easier to cal culate (see the work of Nowak (2008) for a discus sion rela ted to this), though in other cases the opposite seems true. The ne xt progre ss t ow ard under standin g the lab el complexit y of activ e learni ng came in the work of H a nnek e (200 7b), who introduce d a q uantity called the dis agr eement coef ficient (Definition 9 be- lo w), accompanied by a techniq ue for analyzing disagreement- based activ e learning alg orithms. In particu lar , implicit in that work, and made explicit in the later work of Hannek e (201 1), was the first general characteriza tion of the label complexiti es achie ved by the original CAL strate gy for acti ve learni ng in the realizab le case, stated in terms of the disagreemen t coef ficient. The results of the pre sent work are dire ct descend ents of that 2007 paper , and we will discuss the disagreement coef ficient, and re sults bas ed on it , in su bstantia l detail below . Disagreement -based act iv e le arners such as CAL are known to be sometimes suboptimal relati ve to the splitting inde x analysis, and therefo re the d isagree ment coe fficie nt analysis sometimes results in lar ger label complexi ty bounds 5 H A N N E K E than the splitting index analysis . Howe ver , in many cases the label compl exity bounds based on the disa greement coeffici ent are surpri singly goo d considerin g the simpl icity of the meth ods. Fur - thermore , as we will see belo w , the disagr eement coefficien t has the practical benefit of often being fair ly straightfor ward to calcu late for a vari ety of learning problems, partic ularly when there is a natura l geometric interpr etation of the classifiers and th e data dist rib ution is relati vely smooth . As we discu ss below , it can also be used to bound the label complex ity of activ e learning in noisy setting s. For thes e reasons (simplicity of algori thms, eas e of calculati on, and appli cability be yond the realiza ble cas e), subsequen t work on the labe l complexit y of act iv e learning has tended to fa v or the disagree ment-base d approach , making use of the disagree ment coef ficient to bound the label comple xity (Dasgu pta, Hsu, and Montele oni, 2007; F r iedman, 200 9; Beyg elzimer , Dasgu pta, and Langford , 2009; W ang, 200 9; Balcan , Hanneke , and V aughan, 201 0; Hannek e, 201 1; K oltchi nskii, 2010; Bey gelzimer , Hsu, L an gford, and Zhang, 2010; Mahalanabis , 2011 ; W ang, 20 11). A signif- icant part of the present paper foc uses on e xtendin g and genera lizing the disagre ement coef ficient analys is, while still maintainin g the relati ve ea se of ca lculatio n that makes the disagre ement coef fi - cient so useful . In ad dition to man y pos iti ve result s, Dasgupta (2005) also pointed out se ve ral neg ati ve re sults, e ven for very simple and natural learning problems. In particu lar , for m a ny probl ems, the minimax label comple xity of acti ve learni ng will be no bett er tha n that of passi ve learni ng. In fa ct, Balca n, Hannek e, and V augh an (201 0) later sho wed that, for a certain type of acti v e learni ng algorith m – namely , self-v erifying algorithms, whic h themselves adapti vely deter mine how many lab el requests the y need to achie ve a gi ven accuracy – there are ev en par ticular tar get conc epts and da ta distrib u- tions for w h ich no activ e learning algorithm of that type can outperfor m passi ve learn ing. Since all of the abov e label comple xity analyse s (splitting index, teaching dimension , disagr eement coeffi- cient) apply to certain respec tiv e self-ve rifying lea rning algorithms, thes e ne gati ve results are also reflected in all of the exi sting general label complexi ty ana lyses as well. While at first these ne gati ve results may seem disc ouragin g, B a lcan, H a nnek e, and V au ghan (2010 ) noted tha t if we do not requir e the alg orithm to be self -veri fying, ins tead simply measuring the nu mber of label requ ests the alg orithm needs to find a goo d classifier , rath er than the number needed to both find a good clas sifier and verify that it is indeed good, then these negat iv e result s v anish. In fact, (sho ckingly ) they were able to show that for any conce pt space with finite VC dimensio n, and any fixed data distrib ution, for an y gi ven passi ve learning algorithm ther e is an acti ve learni ng algorithm with asympto tically sup erior label comple xity for every nontri vial targ et concep t! A posit iv e resul t of this genera lity and strength is certainly an exci ting adv ance in our unders tanding of the adv antage s of act iv e learnin g. But perhap s equally exciti ng are the unreso lved questi ons ra ised by t hat work, as there are potential oppor tunities to streng then, gene ralize, simpli fy , and e laborat e on this res ult. F ir st, note that the above s tatement all ows the acti v e learnin g algorith m to be specializ ed to the particular dis trib ution acco rding to which the (unlabe led) data are sampled, and ind eed the acti v e learnin g m e thod used by Bal can, Hannek e, and V aug han (2010) in the ir proof has a rather stron g dire ct dep endenc e on the data distrib utio n (which can not be remo ved by simply replac ing some calculation s w i th da ta-depen dent estimators). One interes ting question is whethe r an altern ati ve approach might av oid this direct distrib ution -dependence in the algori thm, so that the claim can be strengthene d to say that the acti ve algo rithm is superior to the passi ve algorith m for all nontri vial targ et conce pts and dat a dis trib ution s . This questio n is intere sting both theore ti- cally , in orde r to obtain the strongest possible theorem on the adv antag es of acti ve learning, as well as prac tically , sin ce direct access to the distrib ution from which the data are sampled is typically 6 A C T I V I Z E D L E A R N I N G not a v ailab le in practical learnin g scenar ios. A second question left open by Balcan , Hannek e, and V aug han (2 010) r egard s t he ma gnitu de of the gap between the acti ve and passi ve label c omplex ities. Specifically , although they did find p articul arly nasty learning prob lems where the l abel comple xity of ac tiv e learnin g will be clos e to that of pa ssi ve le arning (though alw ays bette r), they hyp othesiz ed that for most natu ral learning problems, the improvemen ts over pa ssi ve learnin g shou ld typically be expo nentiall y lar ge (as is the cas e for threshold cla ssifiers); the y gav e many example s to illus- trate this point, but left open the problem of characterizin g gene ral sufficie nt conditi ons for these exp onentia l improve ments to be achie vabl e, e ven when they are not achie v able by self-ver ifying algori thms. Another question left unresolv ed by Balca n, Hannek e, and V augha n (2010) is whether this type of general improv ement gua rantee might be realize d by a computationa lly ef ficie nt acti ve learnin g algo rithm. Finally , they left ope n the question of whether such general results might be furthe r gen eralized to settings that in vol ve noisy labels. The prese nt work picks up where Balcan , Hannek e, and V aughan (201 0) left of f in se vera l respects, making progress on each of the above questi ons, in so me cases completely resolving the question. 1 . 1 . 2 T H E A G N O S T I C C A S E In addition to th e a bov e adv ances in our u ndersta nding of acti ve learning in t he realiza ble cas e, th ere has also been wonderf ul progress in making these methods robust to imperfect teachers , feature space underspecifica tion, and model missp ecification . This gen eral topic goes by the name agnostic active lear ning , fro m its roots in the agnos tic P A C m o del (K earns, Schapire, and Sellie, 1994) . In contra st to the realizable case, in th e a gnostic case , ther e is not necessaril y a perfect class ifier of a kno wn form, and indeed there m a y e ven be label nois e so th at the re is no perfect classifier of any form. R at her , we ha ve a gi ven set of class ifiers (e.g., linear separators, or dept h-limited decision trees, etc.), and the obje cti ve is to identify a classifier whose accurac y is not much worse than the best classifier of that type. Agnostic learning is strictly more general, and ofte n m o re difficu lt, than realiza ble learning; this is true for both passi ve learning and acti ve learning. Howe ver , for a gi v en agnos tic learnin g problem, we might still hope tha t activ e learning can achie ve a gi ven accu racy using fe wer labels than require d for passi ve learning. The genera l topic of agnostic acti ve lea rning got its first taste of real pro gress from Balcan, Beyge lzimer , and Langfo rd (2006a, 200 9) with the publication of the A 2 (agnos tic acti ve) algo- rithm. T h is method is a noise-rob ust disag reement-b ased algorithm, which can be applied with essent ially arbitra ry types of clas sifiers unde r arbit rary nois e distr ib utions. It is interes ting both for its ef fectiv eness and (as with CAL) its eleganc e. T h e origin al work of Balcan, Beygelz imer , and Langford (2006 a, 2009) showed that, in some special cases (thresh olds, and homogeneou s linear separa tors und er a uniform distrib ution ), the A 2 algori thm doe s achie ve impro ved label comple xi- ties compared to the kno w n resu lts for passiv e learning. Using a dif fere nt type of ge neral act iv e learning strate gy , H an nek e (200 7a) found that t he teac h- ing dimension analysis (discussed abo ve for the realizable case ) can be exten ded be yond the real- izable case, arriv ing at general bounds on the label complexity under arbitra ry noise dist rib utions . These bounds improve ov er the kno wn results for passi ve learn ing in many cases. H o weve r , the algori thm requires direct a ccess to a certain quant ity that depe nds on the noise distrib utio n (namely , the noise rate, defined in Sec tion 6 belo w), which would not be a vailab le i n many re al-worl d learning proble ms. 7 H A N N E K E Later , Hann eke (2007 b) establ ished a general characteriza tion of the lab el complexit ies ach ie ved by A 2 , exp ressed in terms of the disagr eement coef ficient. The resu lt hold s for arb itrary types of classifier s (of finite VC dimension) and arb itrary noise d istrib utions, and re present s the natural g en- eraliza tion of the afo remention ed realiza ble-case ana lysis of CA L. I n many cases , this result shows impro vement s o ve r the known res ults for passi ve learning. Furthermore, because of the simplici ty of the d isagreeme nt coefficien t, the boun d can be calcula ted for a v ariety of na tural learnin g problems. Soon after this, Dasgupt a, H s u, and Monte leoni (200 7) pro posed a new acti ve learn ing stra t- egy , which is also ef fecti ve in the agnostic setting. Like A 2 , the new algo rithm is a noise-rob ust disagr eement-ba sed method. The work of Dasgupta, Hsu, and Monte leoni (2007 ) is sig nificant for at least two reasons. First, they were abl e to estab lish a genera l label complex ity bound for this method based on the disagr eement coef ficient. The bound is similar in form to the pre vious labe l comple xity boun d for A 2 by Hanneke (2007b), but impro ves the depe ndence of the bou nd on the disagr eement coe ffici ent. Second, the pro posed method of Dasg upta, Hsu, and Monteleoni (2007) set a ne w s tandard f or compu tationa l and aesth etic simpli city in a gnostic acti ve learning alg orithms. This work has since been followed by related methods of Bey gelzimer , Dasgu pta, and Langfo rd (2009 ) and Beyg elzimer , Hsu, Lang ford, and Zhang (2 010). In pa rticular , Beyg elzimer , D as gupta, and Langford (2009 ) de velop a method capable of learn ing un der an essential ly arb itrary loss func- tion; the y also sho w label complex ity bounds similar to tho se of Dasgupta, Hsu, and M o nteleon i (2007 ), b ut applica ble to a large r clas s of loss funct ions, and stat ed in terms of a gene ralizati on of the disagre ement co ef ficient for arbitrary loss functions. While th e abov e resu lts are enc ouragin g, the guaran tees reflected in thes e label compl exity bound s essentia lly take the form of (at best) constant fact or improv ements ; specifically , in some cases the bou nds improv e the dep endence on the noise rate f actor (defined in S e ction 6 belo w), compared to the kno wn result s for pas si ve lear ning. In fa ct, K ¨ a ¨ ari ¨ ainen (2006 ) sho wed tha t an y label comple xity bo und dependin g on the noise distrib ution only vi a the noise rate can not do better than this ty pe of constan t-fact or improv ement. This ra ised the questio n of wh ether , w it h a more de- tailed descr iption of th e noise distrib ution, on e can show impro ve ments in the asymptotic form of the label comple xity co mpared to pa ssi ve learning. T o ward this end, Castro and No wak (20 08) studied a certain refined desc ription of the noise conditions, related to the mar gin conditions of Mammen and T sy bako v (1999), which are well-studied in the passi ve lear ning literatur e. S p ecifically , they found that in some specia l cases , under certain restrictions on the noise distrib ution, the asy mptotic form of the label comple xity can be impro ved compared to passiv e learning, and in some cases the impro vement s can e v en be e xponen tial in magnitud e; to achie v e this, the y de v eloped algor ithms specifica lly tail ored to the types of classifiers they studied (threshold cl assifiers and bou ndary frag- ment class es). Balcan, Broder , and Zhang (2007) later e xtended this result to ge neral homogeneo us linear separators under a uni form distrib ution. Follo wing this, Hannek e (2009a, 201 1) generaliz ed these results, showing that both of the publish ed general agnos tic acti ve learning algorithms (Bal- can, Beygelzimer , and Langford , 2009; Dasgupta, H s u, and Monteleoni, 2007) can also achie ve these type s of impro vement s in the asymptotic form of the l abel comple xity ; he f urther prov ed gen - eral bounds o n the label complex ities of these methods, ag ain based o n the dis agreement c oef ficient, which apply to arbitr ary types of classifiers, and whic h reflect th ese types of improv ements (und er condit ions on the disagree ment coef ficient). W ang (2009) later boun ded the labe l comple xity of A 2 under somewhat dif feren t noise condition s, in particular identifyi ng weak er noise cond itions suf ficient for thes e improv ements to be exponen tial in magnitude (again, under condit ions on the disagr eement coef ficient) . Ko ltchins kii (2010) has recentl y improv ed on some of H an nek e’ s results, 8 A C T I V I Z E D L E A R N I N G refining certa in logarithmic fact ors and simplifying the pro ofs, using a slightly dif ferent algor ithm based on similar principl es. Though the present work discusses only class es of finite VC dimen- sion, most of the abo ve references also conta in results for v ario us types of nonparametric classes with infinite VC dimensio n. At present, all of the publish ed bounds on the label comple xity of agno stic acti ve learn ing also apply to self- verifyin g algor ithms. As menti oned, in the realizable case, it is typically possibl e to achie ve significan tly better label comple xities if we do not re quire the acti ve learnin g algorithm to be self-verif ying, since the verificatio n of learn ing may be more dif ficult than the lear ning itself (Balcan, Hanne ke, and V aughan , 2010). W e m i ght wo nder w h ether this is als o true in the agnostic case, and w h ether agnostic acti v e lea rning algo rithms tha t are not self -veri fying might possibly achie ve significan tly better label comple xities than the existi ng label comple xity bounds des cribed abo ve. W e in vestig ate thi s in depth belo w . 1.2 Summary of Contrib utions In the pres ent work, we build on and extend the abo ve resu lts in a var iety of ways, resolving a number of open probl ems. The main contrib utions of this work can be summarized as follo ws. • W e formally define a n otion of a univ ersal activi zer , a meta-algori thm that transf orms an y pas- si ve le arning algor ithm into an acti ve learning algo rithm with asymptoti cally strictly superior label comple xities for all nontri vial target concep ts and dis trib utions . • W e analyze the exis ting strategy of disagr eement-ba sed acti ve learnin g fro m this perspec- ti ve, precisely cha racteriz ing the conditi ons under which this stra tegy can lead to a uni ver sal acti vizer in the realizabl e case . • W e propo se a new type of a cti ve learn ing algorithm, based on shatter able sets, and prov e that we can co nstruct uni ve rsal acti vizers for the realizable case bas ed on this idea; in pa rticular , this ov ercomes the issue of distrib ution-depe ndence in the ex isting results mentioned abov e. • W e presen t a novel gen eralizat ion of the dis agreement coef ficient, along with a ne w asy mp- totic boun d on the label co mplexi ties ac hie v able by acti ve l earning in th e realizab le case ; this ne w bound is often significant ly smalle r than the existing results in the publis hed liter ature. • W e sta te ne w conci se suffici ent co ndition s for ex ponenti al improv ements o ver passi ve learn- ing to be achiev able in the realizab le case, includ ing a significan t weak ening of kn o wn con- dition s in the pub lished literature. • W e present a ne w gene ral-pur pose activ e learning algorithm for the agn ostic case, based on the aforemen tioned id ea in v olving shatterab le sets. • W e prove a new asymptotic bound on the label complexitie s achie v able by acti ve learning in the p resence of label no ise (the agnostic case), oft en significa ntly smaller tha n any pre viously publis hed resul ts. • W e for mulate a gene ral conjectu re on the th eoretica l adv antages of acti v e learnin g over pas- si ve learn ing in the presence of arbitrary types of label noise. 9 H A N N E K E 1.3 Outline of the Pa per The paper is org anized as follows. In Section 2, we introd uce the basic notation used through out, formally define the learning pr otocol, and formally define the labe l comple xity . W e als o define the notion of an activizer , which is a procedure that transforms a passi ve learning algo rithm into an acti ve learning algorithm with asympt otically superior label comple xity . In Section 3, we re vie w the establishe d technique of disag r eement-b ased acti ve learning, and prov e a ne w result precis ely charac terizing the scena rios in which disagree ment-base d acti ve learn ing can be used to construct an acti vizer . In particul ar , we find that in man y scena rios, disagreement -based acti ve learning is not po werful enough to pro vide the desir ed impro vement s. In S e ction 4, we mov e bey ond disagree ment- based activ e learning, dev elopin g a new typ e of acti ve learni ng alg orithm based on shatt erab le sets of points. W e apply this te chnique to construct a simple 3-stage p rocedur e, which we then p rov e is a uni ve rsal activ izer for any co ncept space of finite VC dimen sion. In Secti on 5, we beg in by re vie w- ing the kno wn results for bound ing the label comple xity of disagreemen t-based activ e learning in terms of the d isagree ment coefficie nt; w e then dev elop a s omewh at more in vo lved proced ure, aga in based on shatt erable sets, which takes full adv antag e of the sequenti al nature of acti ve lea nring. In additi on to be ing an acti vize r , we show that th is proc edure often achie ves dramatically superi or la- bel c omplex ities than achie vable by passiv e learnin g. In pa rticular , w e define a novel g eneraliz ation of the disagr eement coef ficient, and use it to bound the label comple xity of this procedure. This also pro vides us with conci se suffici ent conditions for obtain ing e xponen tial impro ve ments ov er passi ve learnin g. Continuing in S e ction 6, we exten d our frame work to allo w for label noi se (the agnos tic case) , and discuss the pos sibility of exte nding the resul ts from pre vious sections to these noisy le arning prob lems. W e first re view t he kno wn results for noise-rob ust disagreement -based ac - ti ve learning, and char acteriza tions of its label compl exity in terms of the disagre ement coefficie nt and M a mmen-Tsybak ov nois e parameter s. W e then pro ceed to dev elop a ne w typ e of noise-rob ust acti ve learnin g algorithm, again base d on shatterab le set s, and pro ve bo unds on its lab el comple xity in ter ms of our afo rementione d gene ralizatio n of the dis agreement coef ficient. Additi onally , we presen t a general conject ure concernin g the ex istence of acti vizers for certai n passi ve learning al- gorith ms in the agnostic case. W e conclude in Section 7 w i th a host of enticing open problems for future in vestiga tion. 2. Definitions and Notation For most of the paper , we cons ider the follo wing formal setting. There is a measu rable space ( X , F X ) , where X is called the insta nce space ; for simp licity , we sup pose this is a stan dard Borel space (Sri v asta v a, 1998) (e.g ., R m under the usual Borel σ -alg ebra), though most of the results genera lize. A classifier is any measu rable fun ction h : X → {− 1 , + 1 } . There is a set C of clas- sifiers called the conc ept space . In the r eali zable case , the learni ng problem is char acterize d as follo w s . There is a pr obabili ty m e asure P on X , and a seq uence Z X = { X 1 , X 2 , . . . } of ind epen- dent X -v alued random variab les, each with distri bu tion P . W e refer to the se rando m v ariabl es as the seq uence of unlabe led e xample s ; although in pra ctice, this sequence would typ ically be lar ge b ut finite, to simplify the disc ussion and focus stri ctly on co unting labels, we will suppose this se- quenc e is ine xhau stible. There is additionally a special element f ∈ C , called the tar get function , and we denote by Y i = f ( X i ) ; we furth er denot e by Z = { ( X 1 , Y 1 ) , ( X 2 , Y 2 ) , . . . } the seque nce of labeled e xamples , and for m ∈ N we denot e by Z m = { ( X 1 , Y 1 ) , ( X 2 , Y 2 ) , . . . , ( X m , Y m ) } the finite subseq uence consis ting of the first m elements of Z . For any classi fier h , we define the err or 10 A C T I V I Z E D L E A R N I N G rat e er( h ) = P ( x : h ( x ) 6 = f ( x )) . Informally , the learnin g objecti ve in the realizab le case is to identi fy some h with small er( h ) using elements from Z , w it hout direct access to f . An act ive lear ning algorith m A is permit ted direct access to the Z X sequen ce (the unlab eled exa mples), b ut to gai n access to the Y i v alues it m u st request them one at a time, in a sequentia l manner . Specifically , gi ve n acce ss to th e Z X v alues, the algorithm sele cts any index i ∈ N , requests to observ e the Y i v alue, then ha ving observe d the v alue of Y i , selects ano ther inde x i ′ , obser ves the v alue of Y i ′ , etc. T h e algo rithm is giv en as input an integ er n , ca lled the label b udg et , and is per mitted to observe at most n labels tota l before e v entually hal ting and retu rning a cla ssifier ˆ h n = A ( n ) ; that is, by definition, an acti ve learning algorithm ne ver attempts to acc ess more than the giv en budg et n number of labels. W e will then study the valu es of n sufficie nt to guarantee E [er( ˆ h n )] ≤ ε , for any giv en v alue ε ∈ (0 , 1) . W e refer to this as the label comple xity . W e w il l be particularly interest ed in the asy mptotic depend ence on ε in the label complexity , as ε → 0 . Formally , w e hav e the follo wing definition. Definition 1 An active learning algorithm A achie ves la bel complex ity Λ( · , · , · ) if, for every tar g et functi on f , distrib ution P , ε ∈ (0 , 1) , an d inte ger n ≥ Λ( ε, f , P ) , we have E [er ( A ( n ))] ≤ ε . ⋄ This definitio n of label complexity is similar to one orig inally studied by Balcan, Hannek e, an d V aughan (2010). It has a few feat ures worth notin g. First, the label complexity has an e xplicit depen dence on the tar get fun ction f and distrib ution P . As noted by Dasgupta (2005), we need this dependen ce if we are to ful ly understand the rang e of label complex ities ach ie v able by acti ve learnin g; we further illustrate thi s issu e in the examples belo w . The second feature to note is that the label complexi ty , as de fined here, is simply a sufficie nt b udget size to achie v e the spec ified accura cy . T h at is, her e we are ask ing only ho w many label requests are required for the alg orithm to achie ve a giv en accuracy (in expectati on). Howe ver , as noted by B al can, Hanneke, and V aughan (2010 ), this number might not be suf ficiently lar ge to detect tha t th e algorithm has inde ed ach ie ved the requ ired acc uracy bas ed on ly on the observ ed data. That is, becau se the number of labeled exa mples used in acti ve learni ng can be qui te small, we co me across the problem that the number of label s ne eded to learn a concept might be significantl y smaller than the number of labels needed to verify that we hav e successfull y learned the conce pt. As such, this notion of label comple xity is most usef ul in the desig n of eff ecti ve learn ing algorithms, rat her than for pre dicting the number of labels an algorithm sho uld re quest in any particular application. S p ecifically , to des ign ef fecti ve acti ve lea rning algorithms, we shoul d generally desire small label comple xity v alue s, so that (in the ext reme case) if some algori thm A has smalle r label compl exity v alues than some ot her algori thm A ′ for all targe t functio ns and distrib ution s, then (all other factors being equal) we sho uld cle arly prefer algorithm A over algorithm A ′ ; this is tru e re gardles s of w h ether we hav e a means to detec t (ve rify) how lar ge the improv ements of fered by algorithm A ove r algorithm A ′ are for any partic ular applic ation. Thus, in our present con text , this notion of label comple xity pla ys a rol e analogo us to concep ts such as uni versa l con sistenc y or admissibility , which are also generally use ful in guid ing the design of effect iv e alg orithms, b ut are not intended to be informati ve in the conte xt of any particu lar applica tion. See the work of Balcan, Hanne ke, and V augha n (2010 ) for a discus sion of this issu e, as it relates to a definition of label complexit y similar to that above , as well as othe r notion s of label comple xity from the acti ve learnin g literature (some of whic h include a veri fication requir ement). W e will be intereste d in the performance of activ e learnin g algorithms, relati ve to the perfor - mance of a gi ven pa ssive lea rning algor ithm . In this con text, a passi ve learning algorith m A ta kes 11 H A N N E K E as input a finite sequence of labe led e xamples L ∈ S n ( X × {− 1 , +1 } ) n , and returns a classifier ˆ h = A ( L ) . W e allo w both acti ve and passi ve le arning algorithms to be randomized : that is, to hav e intern al rand omness, in addition to the gi ven rando m data . W e define the label complexi ty for a passi ve le arning algorith m as follo ws. Definition 2 A pas sive le arning algorithm A ac hie ves label co mple xity Λ( · , · , · ) if , for every tar get functi on f , distrib ution P , ε ∈ (0 , 1) , and inte ge r n ≥ Λ( ε, f , P ) , we have E [er ( A ( Z n ))] ≤ ε . ⋄ Although technically some algorithms may be able to achie v e a desired accurac y w it hout an y observ ations, to mak e the general result s easie r to state (namely , those in Section 5), unles s oth- erwise stat ed w e suppose label comple xities (both passi ve and activ e) take stric tly positi ve v alues, among N ∪ {∞} ; note that label complex ities (both pas si ve and acti ve) can be infinite, indicating that the corr espond ing algorit hm might not ac hie ve expe cted error rate ε for any n ∈ N . Both the passi ve and acti ve label compl exiti es are defined as a numbe r of labels suf fi c ient to guarante e the e xpected error rate is at most ε . It is also common in the literatu re to discuss the number of label reques ts suf ficient to gua rantee the error rate is at most ε with hig h pr obabili ty 1 − δ (e.g., Bal- can, Hanne ke, and V augha n, 2010). In the present wor k, we formulate our results in terms of the exp ected error rate beca use it simp lifies the dis cussion of asy mptotics, in that we need only study the beh a vior of the label compl exity as the single ar gument ε approa ches 0 , rath er than the m o re complica ted beh av ior of a function of ε and δ as both ε and δ approac h 0 at v arious relati ve rates. Ho wev er , we note that analogous result s for these high-probab ility guarantees on the error rate can be e xtract ed from the proofs belo w witho ut much d if ficulty , and in se vera l places we explici tly state results of this form. Belo w we emplo y the standard notation from asympto tic analysis, inclu ding O ( · ) , o ( · ) , Ω ( · ) , ω ( · ) , Θ( · ) , ≪ , and ≫ . In all contexts belo w not otherwise specified , the asy mptotics are al ways consid ered as ε → 0 when consider ing a fu nction of ε , and as n → ∞ when consid ering a fun ction of n ; also, in an y expre ssion of the form “ x → 0 , ” we always mean th e li mit fr om above (i.e., x ↓ 0 ). For instance, when con siderin g nonn egat iv e functions of ε , λ a ( ε ) and λ p ( ε ) , the abov e notatio ns are defined as follo ws. W e say λ a ( ε ) = o ( λ p ( ε )) when lim ε → 0 λ a ( ε ) λ p ( ε ) = 0 , and this is equi valent to writing λ p ( ε ) = ω ( λ a ( ε )) , λ a ( ε ) ≪ λ p ( ε ) , or λ p ( ε ) ≫ λ a ( ε ) . W e say λ a ( ε ) = O ( λ p ( ε )) when lim sup ε → 0 λ a ( ε ) λ p ( ε ) < ∞ , which can be equi v alently express ed as λ p ( ε ) = Ω ( λ a ( ε )) . Finally , we write λ a ( ε ) = Θ( λ p ( ε )) to mean that both λ a ( ε ) = O ( λ p ( ε )) and λ a ( ε ) = Ω ( λ p ( ε )) are satisfied. Define the class of function s P olylog(1 /ε ) as those g : (0 , 1) → [0 , ∞ ) such that, for some k ∈ [0 , ∞ ) , g ( ε ) = O (log k (1 /ε )) . For a label comple xity Λ , als o define the se t Non trivial(Λ) as the c ollectio n of all pairs ( f , P ) of a classifier and a distrib ution such that, ∀ ε > 0 , Λ( ε, f , P ) < ∞ , and ∀ g ∈ P olylog (1 /ε ) , Λ( ε, f , P ) = ω ( g ( ε )) . In this conte xt, an active meta- algorit hm is a procedure A a taking as input a pas si ve algorithm A p and a label budg et n , such that for an y passi ve algorithm A p , A a ( A p , · ) is an acti v e learning algori thm. W e define an activ izer for a gi ve n passi ve algorith m as follo ws. Definition 3 W e say an active meta-algori thm A a acti vizes a passive algorithm A p for a concep t space C if the follo wing holds. F or any label comple xity Λ p ach ieved by A p , the active lea rning al- gorith m A a ( A p , · ) achiev es a label c omple xity Λ a suc h that, for every f ∈ C and e very d istrib ution P on X with ( f , P ) ∈ Non trivial (Λ p ) , ther e e xists a c onstant c ∈ [1 , ∞ ) such tha t Λ a ( cε, f , P ) = o (Λ p ( ε, f , P )) . 12 A C T I V I Z E D L E A R N I N G In this case, A a is called an acti vizer for A p with r espect to C , and the active learnin g algori thm A a ( A p , · ) is called the A a -acti vize d A p . ⋄ W e also refer to any acti ve meta-alg orithm A a that acti vizes every passiv e algorith m A p for C as a univers al activ izer for C . One of the main contrib utions of this work is establis hing tha t such uni ve rsal activ izers do exist for any VC cla ss C . A bit of explanat ion is in order regarding Definitio n 3. W e might interpre t it as follo ws: an activiz er for A p strong ly impro ves (in a little-o se nse) the label complexity for all nontrivial tar get functi ons and distri b utions. Here, we se ek a meta-algo rithm that, when gi ven A p as in put, results in an acti ve learnin g alg orithm with strictly super ior label complexit ies. Howe v er , the re is a sense in which s ome dist rib utions P or tar get function s f are trivia l relati ve to A p . For instance, perhaps A p has a default classifier that it is naturally biased to ward (e.g., with minimal P ( x : h ( x ) = +1) , as in the Closu re algorithm (Auer and O rt ner, 2004) ), so tha t when this de fault classifier is the target functi on, A p achie ves a constant label compl exity . In these tri vial scenarios , we can not hope to impr ove over the behavi or of the passi ve algor ithm, but instead can only hope to compete with it. The sense in w h ich we wish to compete may be a subject of some con trov ersy , b ut the implication of Definiti on 3 is that the label complexity of the acti vized algor ithm should be st rictly better than e very non tri vial up per bound on the label complexit y of the passi ve alg orithm. For instance, if Λ p ( ε, f , P ) ∈ Polylo g (1 / ε ) , then w e are guarantee d Λ a ( ε, f , P ) ∈ Polylo g (1 / ε ) as well, b ut if Λ p ( ε, f , P ) = O (1) , we are still onl y guara nteed Λ a ( ε, f , P ) ∈ P olylog(1 /ε ) . This serves the pu rpose of defining a framewo rk that can be studied without requiring too much obsess ion ov er small additi ve terms in tri vial sce narios, thus focusin g the analyst’ s ef forts to ward nontri vial scenar ios where A p has rel ati vely larg e label comple xity , w h ich are precisel y the scena rios for which ac tiv e learning is truly needed . In our proofs, we find that in fact Polylo g (1 / ε ) can be replac ed with log (1 /ε ) , giv ing a slig htly bro ader de finition of “non tri vial, ” for which all of th e results belo w still hold . Section 7 discus ses op en problems regardin g th is issue of triv ial problems. The definitio n of Non trivial ( · ) also only requires the acti vized algorithm to be effe cti ve in sce- narios whe re t he passi ve learnin g algorith m has r easo nable beha vior (i.e., finite label compl exiti es); this is only inten ded to kee p with the reduction- based sty le of the framew ork, and in fact this re- stricti on can easily be lifted using a tric k from Balcan, H a nnek e, and V aughan (20 10) (aggregat ing the acti vized algori thm with ano ther algorithm that is always reasona ble). Finally , we also allo w a con stant factor c loss in th e ε ar gument to Λ a . W e allo w this to be an arbitra ry con stant, again in the interest of allo w i ng the analyst to focu s only on the most signi fi- cant aspects of the problem; for most reason able passi ve learning algorithms, we typically ex pect Λ p ( ε, f , P ) = P oly(1 /ε ) , in which case c can be set to 1 by ad justing the leading constant f actor s of Λ a . A careful inspectio n of our proofs re vea ls that c can alway s b e set arbitraril y clos e to 1 without af fecting the theore ms be lo w (and in fact, we can ev en get c = (1 + o (1)) , a functio n of ε ). Through out this work , w e will a dopt the usual notation for p robabili ties, such as P (er( ˆ h ) > ε ) , and as usual we interpre t this as measuring the correspondi ng ev ent in the (implicit) und erlying probab ility space. In particular , we make the usual implici t assump tion that all se ts in v olve d in the analys is are measu rable; where this assumption does not hold, w e m a y turn to outer probabilit ies, thoug h we will not mak e further mention of these technica l details. W e will also use the nota tion P k ( · ) to represe nt k -di mensional product m e asures; for ins tance, for a measurable set A ⊆ X k , P k ( A ) = P (( X ′ 1 , . . . , X ′ k ) ∈ A ) , for independe nt P -distrib uted rando m v ariab les X ′ 1 , . . . , X ′ k . Addition ally , to simplify notation , we w i ll adopt the con v ention that X 0 = { ∅ } , and P 0 ( X 0 ) = 1 . Through out, we will denote by 1 A ( z ) the indica tor function for a se t A , wh ich has t he v alue 1 when 13 H A N N E K E z ∈ A and 0 otherwise; additiona lly , at times it will be more con ve nient to u se the bipolar indicator functi on, defined as 1 ± A ( z ) = 2 1 A ( z ) − 1 . W e will require a few addition al definition s for the disc ussion belo w . For any classifier h : X → {− 1 , +1 } and finite sequen ce of labe led ex amples L ∈ S m ( X × {− 1 , +1 } ) m , de fine the empirical err or rate er L ( h ) = |L| − 1 P ( x,y ) ∈L 1 {− y } ( h ( x )) ; for complete ness, define er ∅ ( h ) = 0 . Also, for L = Z m , th e first m labeled e xamples in the data sequ ence, abbr ev iate this as er m ( h ) = er Z m ( h ) . For a ny distri bu tion P on X , set of cl assifiers H , classifier h , and r > 0 , define B H ,P ( h, r ) = { g ∈ H : P ( x : h ( x ) 6 = g ( x )) ≤ r } ; when P = P , the distrib ution of the unlabeled e xamples, and P is clear from the con text, we abbrev iate this as B H ( h, r ) = B H , P ( h, r ) ; furthermore, when P = P and H = C , the concept spa ce, and both P and C are clear from the contex t, we abbrev iate this as B( h , r ) = B C , P ( h, r ) . A l so, for any se t of classifiers H , and an y sequence of labeled example s L ∈ S m ( X × {− 1 , +1 } ) m , define H [ L ] = { h ∈ H : er L ( h ) = 0 } ; for any ( x, y ) ∈ X × {− 1 , +1 } , abbre viate H [( x, y )] = H [ { ( x, y ) } ] = { h ∈ H : h ( x ) = y } . W e also adopt the usual definiti on of “shatter ing” use d in lea rning theory (e.g ., V apnik, 1998). Specifically , for any set of class ifiers H , k ∈ N , and S = ( x 1 , . . . , x k ) ∈ X k , we say H sh atter s S if, ∀ ( y 1 , . . . , y k ) ∈ {− 1 , +1 } k , ∃ h ∈ H such that ∀ i ∈ { 1 , . . . , k } , h ( x i ) = y i ; equi v alen tly , H shatte rs S if ∃{ h 1 , . . . , h 2 k } ⊆ H such that for each i, j ∈ { 1 , . . . , 2 k } w it h i 6 = j , ∃ ℓ ∈ { 1 , . . . , k } with h i ( x ℓ ) 6 = h j ( x ℓ ) . T o simplify notation, we will also say th at H shat ters ∅ if and only if H 6 = {} . As us ual, we define the V C dimension of C , denoted d , as the lar gest inte ger k such that ∃ S ∈ X k shatte red by C (V apnik, 1998) . T o focus on no ntri vial problems, we will only con sider concep t spaces C with d > 0 in the results belo w . Generall y , any such concep t space C with d < ∞ is called a VC class . 2.1 Moti va ting E x amples Through out this paper , we will repeatedly ref er to a fe w canonical e xamples. Although themselves quite toy-lik e, they represent the boiled-d own esse nce of some importa nt distin ctions between var - ious types of learnin g pro blems. In some sense, the proce ss of grap pling w i th the fundament al distin ctions raised by these types of exa mples has been a dri ving force behind much of the recent progre ss in un derstan ding the label comple xity of acti ve learnin g. The first example is perha ps the most clas sic, and is clearly the first that comes to mind when consid ering the pot ential for acti ve learning to provi de strong improve ments over pa ssi ve learnin g. Example 1 In the pr oble m of lea rning threshold classifier s, we cons ider X = [0 , 1] and C = { h z ( x ) = 1 ± [ z , 1] ( x ) : z ∈ (0 , 1) } . ⋄ There is a simp le uni versal acti vize r for thresh old classifiers , based on a kind of binary se arch. Specifically , sup pose n ∈ N and that A p is any giv en passi ve learnin g a lgorithm. Conside r th e points in { X 1 , X 2 , . . . , X m } , for m = 2 n − 1 , and sort them in increa sing order: X (1) , X (2) , . . . , X ( m ) . Also initiali ze ℓ = 0 and u = m + 1 , and define X (0) = 0 and X ( m +1) = 1 . No w requ est the label of X ( i ) for i = ⌊ ( ℓ + u ) / 2 ⌋ (i.e., the median point between ℓ and u ); if the label is − 1 , let ℓ = i , and otherwise let u = i ; re peat this (req uesting this median point, then updating ℓ or u accordingly ) unt il we hav e u = ℓ + 1 . Final ly , let ˆ z = X ( u ) , const ruct the labe led seque nce L = { ( X 1 , h ˆ z ( X 1 )) , . . . , ( X m , h ˆ z ( X m )) } , and return the classi fier ˆ h = A p ( L ) . Since each label request at least h alves the set of integers be tween ℓ and u , the to tal numb er of label requests is at most log 2 ( m ) + 1 = n . Supposi ng f ∈ C is the tar get fu nction, this procedure 14 A C T I V I Z E D L E A R N I N G maintain s the in v ariant tha t f ( X ( ℓ ) ) = − 1 and f ( X ( u ) ) = +1 . Thus, once we reach u = ℓ + 1 , since f is a thre shold, it must be some h z with z ∈ ( ℓ, u ] ; therefore eve ry X ( j ) with j ≤ ℓ has f ( X ( j ) ) = − 1 , and lik ewis e e ve ry X ( j ) with j ≥ u has f ( X ( j ) ) = +1 ; in parti cular , this means L equals Z m , th e true labeled sequen ce. But this means ˆ h = A p ( Z m ) . Since n = log 2 ( m ) + 1 , this acti ve learning algo rithm will ac hie ve an equi v alent error rate to w h at A p achie ves with m labele d exa mples, b ut using only log 2 ( m ) + 1 label requ ests. In particular , this implies that if A p achie ves label comple xity Λ p , then this activ e learning algorit hm ach ie ves label complex ity Λ a such that Λ a ( ε, f , P ) ≤ log 2 Λ p ( ε, f , P ) + 2 ; as l ong as 1 ≪ Λ p ( ε, f , P ) < ∞ , thi s is o ( Λ p ( ε, f , P )) , so that this proced ure ac ti vizes A p for C . The second example we cons ider is almost equa lly simple (on ly inc reasing the VC dimension from 1 to 2 ), b ut is far more subtle in terms of ho w we must a pproach its a nalysis in act iv e l earning. Example 2 In the pr oble m of lea rning interv al classifier s, we consi der X = [0 , 1] and C = { h [ a,b ] ( x ) = 1 ± [ a,b ] ( x ) : 0 < a ≤ b < 1 } . ⋄ For the interv als pro blem, we can also c onstruc t a univ ersal activi zer , though sl ightly more com- plicate d. Specifically , suppo se aga in t hat n ∈ N and that A p is any gi ven passi ve learning a lgorith m. W e first request the l abels { Y 1 , Y 2 , . . . , Y ⌈ n/ 2 ⌉ } of the first ⌈ n/ 2 ⌉ example s in the sequ ence. If ev ery one of these labels is − 1 , then w e immediately return the all-ne gati v e cons tant c lassifier ˆ h ( x ) = − 1 . Otherwise, consider the points { X 1 , X 2 , . . . , X m } , for m = max  2 ⌊ n/ 4 ⌋− 1 , n  , and sort them in increa sing order X (1) , X (2) , . . . , X ( m ) . For some v alue i ∈ { 1 , . . . , ⌈ n/ 2 ⌉} with Y i = +1 , let j + denote the correspon ding index j su ch that X ( j ) = X i . Also initialize ℓ 1 = 0 , u 1 = ℓ 2 = j + , and u 2 = m + 1 , and define X (0) = 0 and X ( m +1) = 1 . No w if ℓ 1 + 1 < u 1 , req uest the la- bel of X ( i ) for i = ⌊ ( ℓ 1 + u 1 ) / 2 ⌋ (i.e., the median poin t between ℓ 1 and u 1 ); if the labe l is − 1 , let ℓ 1 = i , and otherwise let u 1 = i ; repea t this (req uesting this m e dian point, then updati ng ℓ 1 or u 1 accord ingly) until we ha ve u 1 = ℓ 1 + 1 . N o w if ℓ 2 + 1 < u 2 , request the labe l of X ( i ) for i = ⌊ ( ℓ 2 + u 2 ) / 2 ⌋ (i.e., the media n point between ℓ 2 and u 2 ); if the label is − 1 , let u 2 = i , and oth erwise let ℓ 2 = i ; repeat this (requestin g this medi an point, then updatin g u 2 or ℓ 2 accord - ingly) until we ha ve u 2 = ℓ 2 + 1 . Finally , let ˆ a = u 1 and ˆ b = ℓ 2 , construct the labeled sequence L = n X 1 , h [ˆ a, ˆ b ] ( X 1 )  , . . . ,  X m , h [ˆ a, ˆ b ] ( X m ) o , and retur n th e classifier ˆ h = A p ( L ) . Since each label reque st in the second phase halves the set of v alue s bet ween either ℓ 1 and u 1 or ℓ 2 and u 2 , the total number of label reques ts is at m o st min { m, ⌈ n/ 2 ⌉ + 2 log 2 ( m ) + 2 } ≤ n . Suppose f ∈ C is the tar get func tion, and let w ( f ) = P ( x : f ( x ) = +1) . If w ( f ) = 0 , then with probab ility 1 the algor ithm w il l return the constant classifier ˆ h ( x ) = − 1 , which has er( ˆ h ) = 0 in this case. Otherwise, if w ( f ) > 0 , then for any n ≥ 2 w ( f ) ln 1 ε , with probabili ty at least 1 − ε , there exi sts i ∈ { 1 , . . . , ⌈ n/ 2 ⌉} with Y i = +1 . Let H + denote the eve nt that suc h an i exists. Supposing this is the cas e, the algorithm will m a ke it into the secon d phase. In this case, the pro cedure main- tains the in v ariant that f ( X ( ℓ 1 ) ) = − 1 , f ( X ( u 1 ) ) = f ( X ( ℓ 2 ) ) = + 1 , an d f ( X ( u 2 ) ) = − 1 , where ℓ 1 < u 1 ≤ ℓ 2 < u 2 . Thus, once we ha ve u 1 = ℓ 1 + 1 and u 2 = ℓ 2 + 1 , since f is an inter va l, it must be some h [ a,b ] with a ∈ ( ℓ 1 , u 1 ] and b ∈ [ ℓ 2 , u 1 ) ; the refore e very X ( j ) with j ≤ ℓ 1 or j ≥ u 2 has f ( X ( j ) ) = − 1 , and lik e wise e very X ( j ) with u 1 ≤ j ≤ ℓ 2 has f ( X ( j ) ) = +1 ; in particu- lar , this means L equals Z m , the true labeled sequenc e. B u t this means ˆ h = A p ( Z m ) . Suppos- ing A p achie ves label complexity Λ p , and that n ≥ max n 8 + 4 log 2 Λ p ( ε, f , P ) , 2 w ( f ) ln 1 ε o , then m ≥ 2 ⌊ n/ 4 ⌋− 1 ≥ Λ p ( ε, f , P ) and E h er( ˆ h ) i ≤ E h er( ˆ h ) 1 H + i + (1 − P ( H + )) ≤ E [er( A p ( Z m ))] + 15 H A N N E K E ε ≤ 2 ε . In pa rticular , this means thi s acti ve lear ning algorith m achie ves label comple xity Λ a such that, for an y f ∈ C with w ( f ) = 0 , Λ a (2 ε, f , P ) = 0 , and for an y f ∈ C with w ( f ) > 0 , Λ a (2 ε, f , P ) ≤ max n 8 + 4 log 2 Λ p ( ε, f , P ) , 2 w ( f ) ln 1 ε o . If ( f , P ) ∈ Non trivial(Λ p ) , then 2 w ( f ) ln 1 ε = o (Λ p ( ε, f , P )) and 8 + 4 log 2 Λ p ( ε, f , P ) = o (Λ p ( ε, f , P )) , so that Λ a (2 ε, f , P ) = o (Λ p ( ε, f , P )) . Therefore, this procedure acti vizes A p for C . This example also bri ngs to light some interesting phen omena in the analysis of the label co m- ple xity of acti ve learn ing. Note that unlike the thres holds examp le, w e ha ve a much strong er de- pende nce on the tar get functi on in these label complexity bo unds, via the w ( f ) quantit y . This issue is fundamental to the problem, and cannot be av oided . In particular , when P ([0 , x ]) is contin uous, this is the very issu e that makes the minimax label complex ity for this problem (i.e., min Λ a max f ∈ C Λ a ( ε, f , P ) ) no better than passi v e learning (Dasgupta, 2005 ). T h us, this proble m emphasiz es the need for an y informati ve label complex ity ana lyses of activ e learning to explici tly descri be the depen dence of the labe l complexi ty on the tar get functi on, as adv ocate d by Dasgupta (2005 ). This e xample also highli ghts the un verifia bility phe nomenon expl ored by Balc an, Hannek e, and V aughan (20 10), since in the case of w ( f ) = 0 , the er ror rat e of the returne d classifier is zer o , b ut (for nonde generat e P ) the re is no way for the algor ithm to verify this fact based only on the finite number of labels it obse rves. In fact, Balcan, Hannek e, and V augha n (201 0) ha ve sho wn tha t under continuou s P , for any f ∈ C with w ( f ) = 0 , the number of labels required to both find a classifier of small error rate and ve rify that th e error ra te is s mall based only on observ able q uantiti es is essen tially no bett er tha n fo r passi v e learning. These issue s are presen t to a small degree in the interv als ex ample, but were easily handl ed in a v ery natural way . The targ et-depe ndence sh ows up only in an initial phase of wa iting for a positi ve exa mple, and the always-n egat iv e classifiers were handle d by setti ng a def ault return va lue. Ho wev er , w e can amplif y these issues so tha t the y show up in more subt le and in vo lved wa ys. Specifically , con sider the follo wing example, studie d by Balca n, Hanneke, and V aughan (2010 ). Example 3 In the pr oble m of lea rning unions of i interv als , w e c onsider X = [0 , 1] an d C =  h z ( x ) = 1 ± S i j =1 [ z 2 j − 1 ,z 2 j ] ( x ) : 0 < z 1 ≤ z 2 ≤ . . . ≤ z 2 i < 1  . ⋄ The chal lenge of this pro blem is tha t, beca use sometimes z j = z j +1 for some j val ues, we do not know ho w many in terv als are requ ired to minimally re present the tar get func tion: on ly that it is at most i . This issue will be made clearer belo w . W e can essentially think of any effe cti ve strate gy here as ha ving two compone nts: one compo nent that searches (perhap s randomly) with the purpo se of ident ifying at le ast one e xample from each decision region , and ano ther component th at refines our esti mates of the end- points of the regi ons the first compon ent identifies. Later , w e will go through the beha vior of a uni ver sal activi zer for this problem in detail. 3. Di sagr eement -Based Active Lear ning At present, perhap s the best-under stood activ e learning alg orithms are those choosing thei r label reques ts based on disagr eement among a set o f re maining candi date classifiers . The cano nical algo- rithm of this type, a vers ion of whic h w e discuss below in Section 5.1, w as proposed by Coh n, A t las, and Ladner (1994 ). Specifically , for an y set H of classifiers, define the re gion of disa gr eement : DIS( H ) = { x ∈ X : ∃ h 1 , h 2 ∈ H s.t. h 1 ( x ) 6 = h 2 ( x ) } . 16 A C T I V I Z E D L E A R N I N G The basi c idea of disa greement-b ased algorith ms is that, at any giv en time in the algorithm, there is a subs et V ⊆ C of remain ing can didates , called the ver sion spa ce , which is guaran teed to con tain the tar get f . When de ciding whether to request a particul ar label Y i , the algorith m simply chec ks whether X i ∈ DIS( V ) : if so, the algori thm requests Y i , and otherwise it doe s not. This general stra tegy is reasonab le, since for an y X i / ∈ DIS( V ) , the label agre ed upon by V must be f ( X i ) , so that we wou ld get no information by reques ting Y i ; that is, for X i / ∈ DIS( V ) , we can accura tely infer Y i based on informatio n already a v ailable . This type of algo rithm has recent ly recei ved subst antial attention, not only for its obvious eleg ance and simplicity , b ut also becaus e (as we discuss in Section 6) there are natural ways to exte nd the techn ique to the gene ral proble m of learning with label noise an d model misspecifica tion (the agnos tic setting). The details of disagreemen t-based algorithms c an v ary in ho w they update th e set V and ho w freq uently th ey do so, bu t it turns out almost all disa greement- based algorit hms sha re many of the same fun damental proper ties, which we des cribe belo w . 3.1 A Basi c Disagr eement-Based Acti ve Lear ning Algorithm In Secti on 5.1, we d iscuss se veral kno wn results on the label comple xities achi ev able by these t ypes of activ e learning alg orithms. Ho we ver , for no w let us examine a v ery basic algorithm of this type. The foll o wing is in tended to be a simple representati ve of the family of disagr eement-ba sed acti ve learnin g algorithms. It has be en stripp ed do wn to th e bare e ssentia ls of what mak es such alg orithms work. As a resul t, altho ugh the gap between its lab el comple xity and that achie v ed by passi ve learnin g is no t necessa rily as lar ge as tho se achie ve d b y the more sophistic ated disa greement- based acti ve learning alg orithms of Section 5 .1, it has the property that wh ene ver those more sophistic ated methods h av e labe l comple xities asymptotic ally superior to tho se achie ved by pass iv e lear ning, that guaran tee will al so be true for this simpler metho d, and vice v ersa. The algorithm oper ates in only 2 phases. In the first, it uses one batch of label reques ts to reduce the versio n space V to a subse t of C ; in the second , it uses an other batch of label requ ests, this ti me only re questin g labels for po ints in DIS( V ) . Thus, we hav e isolated pr ecisely that aspect of disag reement-b ased act iv e learnin g that in vol ves improv ements due to only reques ting the labels of exa mples in the re gion of disag reement. The procedure is formally defined as follo ws, in terms o f an estima tor ˆ P n (DIS( V )) specified be lo w . Meta-Algor ithm 0 Input: passi ve algorith m A p , label b udget n Output: classifier ˆ h 0. Request the fi r st ⌊ n/ 2 ⌋ labels { Y 1 , . . . , Y ⌊ n/ 2 ⌋ } , and let t ← ⌊ n/ 2 ⌋ 1. Let V = { h ∈ C : er ⌊ n/ 2 ⌋ ( h ) = 0 } 2. Let ˆ ∆ ← ˆ P n (DIS( V )) 3. Let L ← {} 4. For m = ⌊ n/ 2 ⌋ + 1 , . . . ⌊ n/ 2 ⌋ + ⌊ n/ (4 ˆ ∆) ⌋ 5. If X m ∈ DIS( V ) and t < n , request the label Y m of X m , and let ˆ y ← Y m and t ← t + 1 6. Else let ˆ y ← h ( X m ) for an arbit rary h ∈ V 7. Let L ← L ∪ { ( X m , ˆ y ) } 8. Return A p ( L ) 17 H A N N E K E Meta-Algor ithm 0 depends on a data-dep endent est imator ˆ P n (DIS( V )) of P (DIS( V )) , which we can define in a var iety of ways using only unlabeled example s. In partic ular , for the theo rems belo w , we will take the follo wing definition for ˆ P n (DIS( V )) , designed to be a confiden ce upp er bound on P (DIS( V )) . Let U n = { X n 2 +1 , . . . , X 2 n 2 } . Then d efine ˆ P n (DIS( V )) = max ( 2 n 2 X x ∈U n 1 DIS( V ) ( x ) , 4 n ) . (1) Meta-Algor ithm 0 is divide d into two stages: one stage w h ere we foc us on reduci ng V , and a second stage where we constru ct the sample L for the passi v e algori thm. This might intui ti vely seem some w h at wast eful, as on e might w i sh to use the re quested label s from the first stage to augment thos e in the second stage when construct ing L , th us feeding all of the obser ved labels into the passi ve algorith m A p . Indeed, this can improv e the label comple xity in some cases (albeit only by a con stant factor ); ho wev er , in order to get the gener al proper ty of being an acti vizer for all passi v e algorithms A p , we construc t the sample L so that the conditional dis trib ution of the X compone nts in L gi ven |L| is P |L| , so that it is (condi tionally ) an i.i.d. sample, which is essential to our analysis. The choice of the number of (unlabeled ) example s to process in the second stage guaran tees (by a Cherno ff bound) that the “ t < n ” constraint in Step 5 is redu ndant; this is a tri ck we w il l empl oy in se v eral of the metho ds belo w . As explained above , because f ∈ V , this implies that e very ( x, y ) ∈ L has y = f ( x ) . T o giv e some basic intuitio n for how this algorithm behav es, co nsider the exa mple of learnin g thresh old classifiers (Example 1); to simplify the expl anation , for no w we ignore the fac t that ˆ P n is only an estimate, as well as the “ t < n ” constr aint in Step 5 (bo th of w h ich will be addressed in the general an alysis belo w). In this case , suppose the targ et functio n is f = h z . L et a = max { X i : X i < z , 1 ≤ i ≤ ⌊ n/ 2 ⌋} and b = min { X i : X i ≥ z , 1 ≤ i ≤ ⌊ n/ 2 ⌋} . Then V = { h z ′ : a < z ′ ≤ b } and DIS( V ) = ( a, b ) , so tha t the second phase of the algorithm only reques ts label s for a numbe r of po ints in th e region ( a, b ) . W ith pro babilit y 1 − ε , the proba bility mass in t his re gion is at most O (log(1 /ε ) /n ) , so that |L| ≥ ℓ n,ε = Ω( n 2 / log (1 /ε )) ; also, since the labels in L are all cor rect, and the X m v alues in L are conditional ly iid (with distrib ution P ) gi ve n |L| , we see that the conditio nal distrib ution of L giv en |L| = ℓ is the same as the (uncond itional) distrib ution of Z ℓ . In particular , if A p achie ves lab el compl exity Λ p , and ˆ h n is the c lassifier r eturned by Meta-Algorit hm 0 applied to A p , then for any n = Ω  p Λ p ( ε, f , P ) log (1 /ε )  chosen so tha t ℓ n,ε ≥ Λ p ( ε, f , P ) , we ha ve E h er  ˆ h n i ≤ ε + sup ℓ ≥ ℓ n,ε E [er ( A p ( Z ℓ ))] ≤ ε + sup ℓ ≥ Λ p ( ε,f , P ) E [er ( A p ( Z ℓ ))] ≤ 2 ε. This indi cates the acti ve learn ing algorithm achi ev es labe l complex ity Λ a with Λ a (2 ε, f , P ) = O  p Λ p ( ε, f , P ) log (1 /ε )  . In particul ar , if ∞ > Λ p ( ε, f , P ) = ω (log(1 /ε )) , th en Λ a (2 ε, f , P ) = o (Λ p ( ε, f , P )) . Therefor e, Meta-Algorit hm 0 is a uni ver sal acti vizer for the space of thre shold classifier s. In con trast, consider the pro blem of lear ning interv al classifiers (Exa mple 2). In this case , suppo se the target fun ction f has P ( x : f ( x ) = +1) = 0 , and that P is unif orm in [0 , 1] . Since (with probabili ty one) e v ery Y i = − 1 , we hav e V = { h [ a,b ] : { X 1 , . . . , X ⌊ n/ 2 ⌋ } ∩ [ a, b ] = ∅} . But this conta ins classifier s h [ a,a ] for ev ery a ∈ (0 , 1) \ { X 1 , . . . , X ⌊ n/ 2 ⌋ } , so that DIS( V ) = (0 , 1) \ { X 1 , . . . , X ⌊ n/ 2 ⌋ } . Thus, P (DIS ( V )) = 1 , and |L| = O ( n ) ; that is, A p gets run with 18 A C T I V I Z E D L E A R N I N G no more labe led examples than simple pass iv e learning would use. T h is indicate s we shou ld not exp ect Meta-Algorithm 0 to be a uni versal activ izer for interv al clas sifiers. Below , we formalize this, by constru cting a passi ve learning algo rithm A p that Meta-Algorith m 0 doe s not acti vize for this scenar io. 3.2 The Limiting Region of Disa gr eement In this su bsectio n, we gen eralize the ex amples from the pre vious subsection . Specifically , we pro ve that the performance of Meta -Algorithm 0 is intimately tied to a particu lar limiting set, re ferred to as the disagr eement cor e . A similar definition was gi ven by B a lcan, Hanneke, and V aughan (2010) (there refe rred to as the bo undary , for reasons that will beco me clear belo w); it is also related to certain quant ities in th e work of Hannek e (2007b, 2011) described below in Section 5.1. Definition 4 Define the disagr eement core of a cla ssifier f with r espect to a set o f clas sifier s H and distrib ution P as ∂ H ,P f = lim r → 0 DIS (B H ,P ( f , r )) . ⋄ When P = P , the true distrib ution on X , and P is clear from the contex t, we abbre viate this as ∂ H f = ∂ H , P f ; if add itionall y H = C , the full concept space, which is clear from the co ntex t, we furthe r ab bre viate this as ∂ f = ∂ C f = ∂ C , P f . As we will see, disagree ment-base d algo rithms oft en tend to focus their label requests aroun d the disagreemen t core of the tar get fun ction. As such, the conc ept of the disagreemen t core w il l be essent ial in much of our discuss ion b elo w . W e therefo re go throu gh a fe w exampl es to build intuitio n about this con cept and its prope rties. Perhaps the simp lest e xample to start with is C as the class of thr eshold clas sifiers (Example 1), und er P unif orm on [0 , 1] . For any h z ∈ C and suf fi ci ently small r > 0 , B( f , r ) = { h z ′ : | z ′ − z | ≤ r } , and DIS(B( f , r )) = [ z − r, z + r ) . There fore, ∂ h z = lim r → 0 DIS(B( h z , r )) = lim r → 0 [ z − r, z + r ) = { z } . Thus, in thi s case, the disagre ement core of h z with respect to C and P is precisely the decision boundary of the classifier . As a slightl y more in v olv ed example, con sider again th e e xample of interv al classifiers (Example 2) , again un der P uniform on [0 , 1] . No w for any h [ a,b ] ∈ C with b − a > 0 , for any suf ficiently small r > 0 , B( h [ a,b ] , r ) = { h [ a ′ ,b ′ ] : | a − a ′ | + | b − b ′ | ≤ r } , and D IS(B( h [ a,b ] , r )) = [ a − r , a + r ) ∪ ( b − r , b + r ] . Therefore , ∂ h [ a,b ] = lim r → 0 DIS(B( h [ a,b ] , r )) = lim r → 0 [ a − r , a + r ) ∪ ( b − r , b + r ] = { a, b } . Thus, in this case as well, the disag reement core of h [ a,b ] with respect to C and P is again the decision bound ary of the cl assifier . As the abo ve two examples illustrat e, ∂ f often corresp onds to the dec ision boundary of f in some geometric int erpretati on of X and f . Inde ed, under f airly gen eral co nditions on C and P , the disagreemen t cor e of f does co rrespon d to (a subset of ) th e set of points di viding the two label reg ions of f ; for ins tance, Friedman (2009 ) deri ves sufficien t cond itions, under which this is the case. In these cases, the beha vior of disagreement- based acti ve learning algorithms can oft en be interp retted in the intui tiv e terms of seeking label reque sts near the decis ion boun dary of the tar get functi on, to refine an es timate of that boundary . Howe ver , in some more subt le scen arios this is no longer the case, for interestin g reasons . T o illust rate this, let us conti nue the example of interv al classifier s fr om abov e, but no w consider h [ a,a ] (i.e., h [ a,b ] with a = b ). This time, for any r ∈ (0 , 1) we ha ve B( h [ a,a ] , r ) = { h [ a ′ ,b ′ ] ∈ C : b ′ − a ′ ≤ r } , and DIS (B( h [ a,a ] , r )) = (0 , 1) . Therefore, ∂ h [ a,a ] = lim r → 0 DIS(B( h [ a,a ] , r )) = lim r → 0 (0 , 1) = (0 , 1) . 19 H A N N E K E This example sho ws that in some cases, the disagree ment core does not corre spond to the de- cision bou ndary of the cl assifier , and ind eed has P ( ∂ f ) > 0 . Int uiti vel y , as in the abov e ex ample, this typic ally happens when the decision surfac e of the classi fier is in some sense simpler than it could be. For instanc e, consider the space C of unions of two intervals (Example 3 with i = 2 ) under uniform P . T h e classifiers f ∈ C with P ( ∂ f ) > 0 are precisely those representa ble (up to probab ility zero dif ference s) as a single interva l. The others (with 0 < z 1 < z 2 < z 3 < z 4 < 1 ) ha ve ∂ h z = { z 1 , z 2 , z 3 , z 4 } . In these example s, the f ∈ C with P ( ∂ f ) > 0 are not only simpler than other nearby clas sifiers in C , but they are also in some sen se de gener ate relati ve to th e rest of C ; ho we ver , it turns out th is is no t alw ays the case, as there exis t scen arios ( C , P ) , e ven with d = 2 , and eve n w it h countable C , for which every f ∈ C has P ( ∂ f ) > 0 ; in thes e cases, eve ry clas sifier is in some importa nt sen se simpler than some other sub set of nearby classifiers in C . In Section 3.3, we show tha t the label comple xity of disagree ment-base d acti ve learning is in- timately tied to the disagre ement core. In particular , sc enarios where P ( ∂ f ) > 0 , such as those mentione d above , lead to the conclusi on that disagreement-b ased methods are some times insuf fi- cient for acti vize d learning. This moti v ates the design of more sophisticate d meth ods in Section 4, which ov ercome this deficiency , al ong w it h a correspon ding refinement of the definitio n of “dis- agreemen t core ” in Sect ion 5.2 that eliminates the abov e issue with “simple” classifiers. 3.3 Neces sary and S ufficie nt Conditions f or Disagr eement-Based A c tiviz ed L e arning In the specific cas e of Meta-Algorithm 0, for lar ge n we may intuit iv ely ex pect it to foc us its se cond batch of lab el requ ests in and aro und the disagre ement core of the target function. Thus, when ev er P ( ∂ f ) = 0 , we should e xpect the label reques ts to be quite focu sed, and theref ore the alg orithm should achiev e highe r accurac y compared to passi ve learning. On the other hand, if P ( ∂ f ) > 0 , then the label re quests will not become focused be yond a constant frac tion of the space , so that the impro vement s achiev ed by Meta-Algorithm 0 o ver passi ve learn ing should be, at best, a cons tant fact or . This intuiti on is formali zed in the follo wing general th eorem, the proof of which is included in Append ix A. Theor em 5 F or an y VC class C , Meta-Algorit hm 0 is a univer sal activize r for C if a nd only if eve ry f ∈ C and distrib ution P has P ( ∂ C , P f ) = 0 . ⋄ While the formal proof is giv en in Appendix A , the general idea is simple . As we always ha ve f ∈ V , an y ˆ y infer red in Step 6 must equal f ( x ) , so that all of th e labe ls in L are correct . Also, as n gro ws lar ge, classic resu lts on passiv e learn ing imply th e diamet er of the set V will beco me small, shrink ing to zero as n → ∞ (V apnik, 1982; Blumer , E h renfeuc ht, Haussler , and W armuth, 198 9). Therefore , as n → ∞ , DIS( V ) sho uld con ver ge to a su bset of ∂ f , so that in the case P ( ∂ f ) = 0 , we hav e ˆ ∆ → 0 ; thus |L| ≫ n , which implies an asymptot ic strict improveme nt in label complexity ov er th e passi v e algorithm A p that L is fed into in Step 8. On the other hand, sin ce ∂ f is de fined by classifier s ar bitrarily close to f , it is unlikel y that any finite sample of correctl y labe led examples can contra dict enough classifier s to make DIS ( V ) significantly smaller than ∂ f , so that we always ha v e P (DIS( V )) ≥ P ( ∂ f ) . Therefore, if P ( ∂ f ) > 0 , then ˆ ∆ con ver ges to some nonzer o consta nt, so that |L| = O ( n ) , repres enting only a constant factor impro vemen t in label complex ity . In fact , as is implied from this sket ch (and is pro ve n in Appendix A), the ta rg ets f and di strib ution s P for which Meta-Algor ithm 0 achiev es asy mptotic str ict impro v ements for all passiv e learning algorith ms (for which f and P are non tri vial) are precisely those (a nd only those) for which P ( ∂ C , P f ) = 0 . 20 A C T I V I Z E D L E A R N I N G There are so me ge neral conditions unde r which the zero-probab ility disagreeme nt cores co ndi- tion of Theo rem 5 will hol d. For ins tance, it is not dif ficult to sho w this will alw ays hold when X is countab le; further more, with some ef fort one can show it will hold for most classes havi ng VC dimensio n one (e.g., any counta ble C with d = 1 ). Howe ver , as we ha ve seen, not all spac es C satisfy this zero -probab ility disagreemen t cores prope rty . In par ticular , for the interv al classifiers studie d in Sectio n 3.2, w e ha ve P ( ∂ h [ a,a ] ) = P ((0 , 1)) = 1 . Indee d, the aforemen tioned special cases aside, for m o st nontri vial space s C , one can construct d istrib utions P that in some sense m imic the interv als problem, so that we should typ ically e xpect disagreemen t-based methods will not be acti vizer s. For detailed discussion s of v arious scenario s where the P ( ∂ C , P f ) = 0 conditi on is (or is not) sa tisfied for v arious C , P , and f , see the works of Hannek e (2009b, 2007 b, 2011); Balcan, Hannek e, and V aughan (2010 ); Friedman (20 09); W ang (2009, 2011 ). 4. Beyond Disagr eement: A Basic Activizer Since the zero-proba bility disagree ment cores condition of Theorem 5 is not a lway s sa tisfied, w e are left with the que stion of whether there could be oth er techniques for acti ve learn ing, beyond simple disagr eement-ba sed methods , which could activi ze eve ry passiv e lea rning algorithm for every VC class. In this section, w e pres ent an entirely new type of acti ve le arning al gorithm, unli ke anything in the existi ng lite rature, and we sho w that indeed it is a univ ersal activi zer fo r any class C of fi ni te VC dimensio n. 4.1 A Basi c Activizer As mentioned, the case P ( ∂ f ) = 0 is alre ady h andled n icely b y d isagree ment-base d method s, since the label reque sts made in the second stage of Meta-Algo rithm 0 will beco me focused into a small reg ion, and L th erefore gro w s faster than n . Thus, the primary qu estion we are faced with is what to do when P ( ∂ f ) > 0 . S in ce (loos ely speaki ng) we ha ve DIS( V ) → ∂ f in Meta-Alg orithm 0, P ( ∂ f ) > 0 corres ponds to s cenario s where the label re quests of Meta-Algorithm 0 will not b ecome focuse d beyond a certain exte nt; specifically , since P (DIS ( V ) ⊕ ∂ f ) → 0 almos t surely (where ⊕ is the symmetric differe nce), Meta-Algorithm 0 will request labe ls for a constant fracti on of the exa mples in L . On the one han d, thi s is definitely a major proble m for disagreemen t-based methods, since it pre ve nts them from improving ov er pass iv e learning in those cases. O n the other hand, if we do not restric t ou rselv es to disagreemen t-based methods, we may actual ly be ab le to expl oit propertie s of this scenario, so that it wo rks to our ad vanta ge . In parti cular , since P (DIS ( V ) ⊕ ∂ C f ) → 0 and P ( ∂ V f ⊕ ∂ C f ) = 0 (almost surely) in Meta-Algo rithm 0, for sufficie ntly lar ge n a ran- dom point x 1 in DIS ( V ) is likely to be in ∂ V f . W e can explo it this f act by using x 1 to split V into two subs ets: V [( x 1 , +1)] and V [( x 1 , − 1)] . Now , if x 1 ∈ ∂ V f , then (by definition of the disagree ment core) inf h ∈ V [( x 1 , +1)] er( h ) = inf h ∈ V [( x 1 , − 1)] er( h ) = 0 . T h erefore, for almost e v ery point x / ∈ DIS( V [( x 1 , +1)]) , the label agreed upo n for x by classi fiers in V [( x 1 , +1)] should be f ( x ) . Similarly , for almost e ve ry point x / ∈ DIS( V [( x 1 , − 1)]) , the label agree d upon for x by classifier s in V [( x 1 , − 1)] should be f ( x ) . Thus, we can accur ately in fer the label of any poin t x / ∈ DIS( V [( x 1 , +1)]) ∩ DIS( V [( x 1 , − 1)]) (e xcept perhaps a probabilit y zer o subs et). W ith these sets V [( x 1 , +1)] and V [( x 1 , − 1)] in hand, there is no lon ger a need to reque st the labels of points for w h ich either of them has agreement about the label, and we can focus our la bel reque sts to the 21 H A N N E K E reg ion DIS( V [( x 1 , +1)]) ∩ DIS( V [( x 1 , − 1)]) , w h ich may be much smaller than DIS( V ) . Now if P (DIS( V [( x 1 , +1)]) ∩ DIS( V [( x 1 , − 1)])) → 0 , then the label requests will become focus ed to a shrink ing re gion, and by the same reasoni ng as for Theorem 5 we can asy mptoticall y achie ve strict impro vement s ov er passiv e learning by a method analogou s to Meta- Algorithm 0 (with changes as descri bed ab ov e). Already this provides a significant improv ement ov er disagreement-b ased methods in man y cases; inde ed, in some cases (such as interv als) this already addresse s the nonze ro-prob ability disagr eement core issue in Theor em 5. In other cases (su ch as union s of two interv als), it doe s not comple tely address the iss ue, sin ce for some targe ts we do not hav e P (DIS( V [( x 1 , +1)]) ∩ DIS( V [( x 1 , − 1)])) → 0 . Ho w e ver , by repeatedly applying this same reason ing, we can ad- dress the issue in full gene rality . Specifically , if P (DIS ( V [( x 1 , +1)]) ∩ DIS( V [( x 1 , − 1)])) 9 0 , then DIS( V [( x 1 , +1)]) ∩ DIS( V [( x 1 , − 1)]) essenti ally con ver ges to a region ∂ C [( x 1 , +1)] f ∩ ∂ C [( x 1 , − 1)] f , which has nonzer o probability , and is nearly equi v alen t to ∂ V [( x 1 , +1)] f ∩ ∂ V [( x 1 , − 1)] f . Thus, for sufficien tly lar ge n , a random x 2 in DIS( V [( x 1 , +1)]) ∩ DIS( V [( x 1 , − 1)]) will likely be in ∂ V [( x 1 , +1)] f ∩ ∂ V [( x 1 , − 1)] f . In this case, we can repeat the abov e ar gument , this time split - ting V into four sets ( V [( x 1 , +1)][( x 2 , +1)] , V [( x 1 , +1)][( x 2 , − 1)] , V [( x 1 , − 1)][( x 2 , +1)] , and V [( x 1 , − 1)][( x 2 , − 1)] ), each with infimum error rate equal zero, so that for any point x in the re- gion of agreement of an y of these four sets, the agre ed-upon label will (almost surely ) be f ( x ) , so that we can infer that la bel. Thus, we need only requ est the label s of tho se points in th e inter section of all four region s of disagr eement. W e can further repeat this process as many times as nee ded, until we get a partit ion of V with shrinking probab ility m a ss in the intersection of the re gions of disagr eement, which (as abov e) can then be used to obtai n asy mptotic improv ements over passiv e learnin g. Note that the abo ve ar gument can be written more concis ely in terms of shatt ering . That is, any x ∈ DIS ( V ) is simply an x such that V shatters { x } ; a point x ∈ DIS ( V [( x 1 , +1)]) ∩ DIS( V [( x 1 , − 1)]) is simply one f or which V shat ters { x 1 , x } , and for any x / ∈ DIS( V [( x 1 , +1)]) ∩ DIS( V [( x 1 , − 1)]) , the lab el y w e infer abou t x has the property that the set V [( x, − y )] does not shatte r { x 1 } . This continue s for each repetition of the above idea, with x in the inter section of the four regions of disagreement simply bein g one for whic h V sha tters { x 1 , x 2 , x } , and so on . In particu lar , thi s perspecti ve makes it clear that we need only repeat this idea at most d times to get a shri nking intersect ion region, sin ce no set of d + 1 points is sha tterable . Note that there may be unobserv able fa ctors (e.g., the tar get function) dete rmining the appropriate numbe r of iterations of this idea sufficie nt to ha ve a shrinking probab ility of requesting a label , while maint aining the accura cy of in ferred lab els. T o addr ess this, we can simply tr y all d + 1 possibi lities, and then sel ect one of the resulting d + 1 class ifiers via a kin d of tourn ament of pairwise compa risons. Also, in order to reduce the probab ility of a mistaken inf erence due to x 1 / ∈ ∂ V f (or similar ly for later x i ), we can replace each sin gle x i with multiple samples, and th en take a majority v ote o ver whether to infer the label, and which label to infer if we do so; generally , we can think of this as estimating certain probabi lities, an d below we write these estimators as ˆ P m , and disc uss the detail s of their implementa tion later . Combinin g Meta-Algorithm 0 with the abov e reaso ning motiv ates a ne w type of acti ve le arning algorith m, refe rred to a s Meta-Algorithm 1 belo w , and stated as follo ws. 22 A C T I V I Z E D L E A R N I N G Meta-Algor ithm 1 Input: passi ve algorith m A p , label b udget n Output: classifier ˆ h 0. Request the first m n = ⌊ n/ 3 ⌋ labels, { Y 1 , . . . , Y m n } , and let t ← m n 1. Let V = { h ∈ C : er m n ( h ) = 0 } 2. For k = 1 , 2 , . . . , d + 1 3. ˆ ∆ ( k ) ← ˆ P m n  x : ˆ P  S ∈ X k − 1 : V shatters S ∪ { x }| V shatters S  ≥ 1 / 2  4. Let L k ← {} 5. For m = m n + 1 , . . . , m n + ⌊ n / (6 · 2 k ˆ ∆ ( k ) ) ⌋ 6. If ˆ P m  S ∈ X k − 1 : V shatters S ∪ { X m }| V shat ters S  ≥ 1 / 2 and t < ⌊ 2 n/ 3 ⌋ 7. Request the label Y m of X m , and let ˆ y ← Y m and t ← t + 1 8. Else, let ˆ y ← argmax y ∈{− 1 , + 1 } ˆ P m  S ∈ X k − 1 : V [( X m , − y )] does not shatter S | V shatte rs S  9. Let L k ← L k ∪ { ( X m , ˆ y ) } 10. Return Activ eSelect( {A p ( L 1 ) , A p ( L 2 ) , . . . , A p ( L d +1 ) } , ⌊ n/ 3 ⌋ , { X m n +max k |L k | +1 , . . . } ) Subrouti ne: Activ eS e lect Input: set of classifiers { h 1 , h 2 , . . . , h N } , label b udget m , sequence of unlabel ed e xamples U Output: classifier ˆ h 0. For each j, k ∈ { 1 , 2 , . . . , N } s.t. j < k , 1. Let R j k be th e first j m j ( N − j ) ln( eN ) k points in U ∩ { x : h j ( x ) 6 = h k ( x ) } (if such v alues exist) 2. Request the labels for R j k and let Q j k be the resul ting se t of labeled examples 3. Let m k j = er Q j k ( h k ) 4. Return h ˆ k , where ˆ k = max { k ∈ { 1 , . . . , N } : max j 0 , B( f , r ) 6 = ∅ ). The follo w i ng corollary is one concrete implication of Theorem 6. Cor ollary 7 F or any VC class C , ther e exi sts an active lear ning algorit hm achie ving a label com- ple xity Λ a suc h that, for all tar ge t functions f ∈ C an d distrib utio ns P , Λ a ( ε, f , P ) = o (1 /ε ) . ⋄ Pro of The one-in clusion gr aph passi v e lear ning algorithm of Hauss ler , Littlestone, and W armuth (1994 ) is kno wn to achi ev e labe l comple xity at most d/ε , for eve ry tar get fun ction f ∈ C and dis- trib ution P . Thus, Theorem 6 implie s that the (Meta- Algorithm 1)-acti vized one-incl usion grap h algori thm sa tisfies the claim. As a byproduct, Theor em 6 also esta blishes the basic fact that ther e exi st acti vizers. In some sense, this observ ation opens up a new rea lm for explorati on: namely , characteriz ing the pr operties that acti vizers can possess. This topic incl udes a v ast array of questions, many of which deal with whether acti vizers are capable of pr eser ving variou s properties of th e giv en passi ve algorithm (e.g., mar gin-bas ed dimen sion-in dependen ce, minimax ity , admissibil ity , etc .). Section 7 describ es a vari- ety of entici ng questions of this type. In th e section s be lo w , w e will consider quantifyin g ho w lar ge the g ap in label comple xity between the giv en pa ssi ve le arning algo rithm and the resulting acti vize d algori thm can be. W e will addi tionally study t he ef fects of labe l noise on the possi bility of ac ti vized learnin g. 26 A C T I V I Z E D L E A R N I N G 4.4 Implementati on and Efficiency Meta-Algor ithm 1 typicall y also has certa in desir able ef ficienc y guarantees. Specifically , supp ose that for any m labeled examp les Q , there is an algo rithm with p oly( d · m ) runn ing time tha t finds some h ∈ C with er Q ( h ) = 0 if one e xists, and otherwise returns a v alue indicating that no such h exis ts in C ; for man y concept spaces with a kind of geomet ric interpr etation, ther e are kno wn meth ods with this ca pabilit y (Khachiya n, 1979; Karmark ar, 198 4; V aliant, 1 984; Kea rns an d V azirani, 1994). W e can use such a subr outine to cre ate an efficien t implementation of the m a in body of Meta- Algorithm 1. Spec ifically , rather than ex plicitly representin g V in Step 1, we can simply store the se t Q 0 = { ( X 1 , Y 1 ) , . . . , ( X m n , Y m n ) } . Then fo r an y step in the algorit hm where we need to test whether V sh atters a set R , we can simply try all 2 | R | possib le label ings of R , and for eac h one temporar ily add these | R | additiona l labeled examples to Q 0 and check wheth er there is an h ∈ C con sistent with all of the labe ls. At first, it m ig ht seem that these 2 k e v aluation s would be pr ohibiti ve; howe v er , supp osing ˆ P m n is implement ed so that it is Ω(1 / p oly( n )) (as it is in Appe ndix B.1), note that th e loop b egin ning at S te p 5 exec utes a n onzero numbe r of t imes onl y if n/ ˆ ∆ ( k ) > 2 k , so t hat 2 k ≤ p o ly( n ) ; we can easil y add a cond ition that skips the step o f calcula ting ˆ ∆ ( k ) if 2 k exc eeds this p oly( n ) lower bound on n/ ˆ ∆ ( k ) , so that ev en those shattera bility test s can be skipped in this case. T h us, for the actual occurrence s of it in the algorithm, testing whether V shatte rs R requi res only p oly( n ) · p oly( d · ( | Q 0 | + | R | )) time. The tota l number of times this test is performed in calcula ting ˆ ∆ ( k ) (from Appendix B.1) is itself only p oly( n ) , and the numbe r of iterati ons of the loop in Step 5 is at most n/ ˆ ∆ ( k ) = p oly( n ) . D e termining the label ˆ y in Step 8 can be per formed in a similar fashi on. So in general, the total runn ing time of the main body of Meta-Algor ithm 1 is poly ( d · n ) . The only remaining ques tion is the ef ficienc y of the final step. Of course, we can req uire A p to hav e run ning time polynomia l in the size of its input se t (and d ). But beyond this, we must co n- sider the ef ficienc y of the Activ eSelect subroutin e. This act ually turn s out to ha ve some subt leties in vol ved. T h e way it is stated abov e is simple and ele gant, but not alway s ef ficient. Specifically , we ha ve no a priori bound on the number of unlabeled e xamples the algorit hm must pro cess before finding a point X m where h j ( X m ) 6 = h k ( X m ) . Indeed, if P ( x : h j ( x ) 6 = h k ( x )) = 0 , we may ef fecti v ely need to e xamine the en tire infinite sequenc e of X m v alues to determine this. Fortu nately , these pro blems can be corrected without dif ficulty , simply by trunca ting the search at a predet er - mined number of points. Specifically , rather than tak ing the next ⌊ m/  N 2  ⌋ examples fo r whic h h j and h k disagr ee, simply restrict ourselv es to at most this number , or at most the numb er of such points among the n ext M unlabe led examples. In Appendix B, we s ho w that Activ eSelect , as o rig- inally stated, has a high -probab ility ( 1 − exp { − Ω( m ) } ) guarantee that the cla ssifier it sele cts has error rate at most twice the best of th e N it is giv en. W ith the m o dification to truncate the search at M unlabel ed e xamples, this gu arantee is in creased to min k er( h k ) + max { er( h k ) , m/ M } . For t he concre te guarantee of Corollary 7, it s uf fices to tak e M ≫ m 2 . Howe v er , to guarantee the modified Activ eSelect can still be used in Meta-Algor ithm 1 while maintainin g (the stro nger) Theorem 6, we need M at least as big as Ω (min { exp { m c } , m/ min k er( h k ) } ) , for any constant c > 0 . In genera l, if we ha ve a 1 / poly( n ) lower bound on the error rate of the c lassifier produ ced by A p for a gi ven numb er of la beled example s as inp ut, w e c an set M as abov e using this lower boun d in pl ace of min k er( h k ) , resulting in an efficie nt version of Activ eSelect that still guarante es Theorem 6. Ho wev er , it is presently not kno wn whethe r there alw ays e xist uni vers al acti vizers that are efficient 27 H A N N E K E (either poly( d · n ) or poly ( d /ε ) running time) when the ab ov e assumpt ions on ef ficiency of A p and finding h ∈ C with er Q ( h ) = 0 hold. 5. The Mag n itudes of Impr ovemen ts In the prev ious section, we sa w that we can alw ays impro v e the label co mplexi ty of a pas si ve learnin g algo rithm by acti vizi ng it. Ho wev er , there remains the question of how lar ge the gap is between the passiv e algorithm’ s label comple xity and the acti vized algorithm’ s label comple xity . In the present sect ion, we refine the ab ov e pro cedures , to take greater adva ntage of the sequen tial nature of acti v e learning. For each, w e characteriz e the improv ements it achie ve s relati ve to any gi ven pas si ve algorithm. As a byproduct, this provides concise suf fi c ient conditio ns for expo nential gains, ad dressin g an open probl em of Balcan, Hanneke, and V aughan (2010). Specifically , consid er the following definitio n, ess entially similar to one explor ed by Balc an, Hanneke, and V aughan (2010). Definition 8 F or a conce pt space C and distrib utio n P , we say that ( C , P ) is learnable at an ex- ponen tial rate if ther e exist s an activ e learning algorithm achiev ing label comp le xity Λ such that ∀ f ∈ C , Λ( ε, f , P ) ∈ Pol ylog (1 /ε ) . W e further say C is learnable at an expon ential rate if ther e e xists an active learn ing algorithm ac hiev ing label complex ity Λ suc h that for all distrib utions P and all f ∈ C , Λ( ε, f , P ) ∈ P olylog (1 /ε ) . ⋄ 5.1 The Label Complexi ty of Disagre ement-Based Acti ve Lear ning As before, to establi sh a fou ndation to build up on, w e begin by studyin g the labe l complexit y gain s achie vable by disagreement -based acti ve learning. From abov e, we already know that disagr eement- based acti ve learning is no t sufficien t to achie ve th e best po ssible gains; b ut as before , it will serv e as a suitable starting place to gain intuiti on for how we might approa ch the problem of improving Meta- Algorith m 1 and quantifyin g th e improvemen ts ach ie va ble ov er passi ve learning by the result ing more sophist icated meth ods. The resu lts on disagreement- based learn ing in this subsecti on ar e essentia lly already kno wn, and av ailabl e in the publishe d litera ture (though in a slightly less gener al form). Specifically , we re vie w (a modified version of) the method of Cohn, Atlas, and Ladne r (199 4), referred to as Meta- Algorith m 2 belo w , which was historically the original disagree ment-base d acti ve learning algo- rithm. W e then state the kno wn res ults on the la bel co mplexi ties achi ev able by this metho d, in terms of a quantit y kn o wn as the disagreemen t coef ficient; that result is due to Hanneke (2011 , 2007b). 5 . 1 . 1 T H E C A L A C T I V E L E A R N I N G A L G O R I T H M T o begin , we con sider the follo wing simple disagreemen t-based meth od, ty pically referred to as CAL after i ts disco ver ers C o hn, Atlas, and Ladn er (1994) , though the versio n here is sligh tly modi- fied compared to the orig inal (s ee below). It essenti ally represen ts a refinement of Meta-Algorithm 0 to take greater adv antag e of the sequenti al aspe cts of acti ve learning. That is, rather than request- ing only two batch es of la bels, as in Meta-Algorithm 0, this m e thod update s the ve rsion space after e very labe l request, thu s focusing the regio n of disagreeme nt (and there fore the region in which it reques ts labe ls) after each label request. 28 A C T I V I Z E D L E A R N I N G Meta-Algor ithm 2 Input: passi ve algorith m A p , label b udget n Output: classifier ˆ h 0. V ← C , t ← 0 , m ← 0 , L ← {} 1. While t < ⌈ n/ 2 ⌉ and m ≤ 2 n 2. m ← m + 1 3. If X m ∈ DIS( V ) 4. Request the label Y m of X m and let t ← t + 1 5. Let V ← V [( X m , Y m )] 6. Let ˆ ∆ ← ˆ P m (DIS( V )) 7. Do ⌊ n/ (6 ˆ ∆) ⌋ times 8. m ← m + 1 9. If X m ∈ DIS( V ) and t < n 10. Request the label Y m of X m and let ˆ y ← Y m and t ← t + 1 11. Else let ˆ y = h ( X m ) for an arbitrary h ∈ V 12. Let L ← L ∪ { ( X m , ˆ y ) } and V ← V [( X m , ˆ y )] 13. Return A p ( L ) The procedur e is specified in terms of an estimator ˆ P m ; for our purposes , we define this as in (14) of Appendix B.1 (with k = 1 there). Every examp le X m added to th e set L in Step 12 either has its label requested (Step 10) or inferre d (Step 11). By the same Ch ernof f bound ar gument m e ntioned for the pr ev ious methods, we are gua ranteed (with high probab ility) that the “ t < n ” con straint in Step 9 is alway s satisfied when X m ∈ DIS ( V ) . Since we assume f ∈ C , an inducti ve ar gumen t sho ws that we will always hav e f ∈ V as well; thu s, e very label requested or infer red will agree with f , and there fore the labels in L are all correct. As with Meta-Algorithm 0, thi s method has two stages to i t: one in which we focus on re ducing the v ersion space V , and a second in which we focu s on construc ting a set of labeled examples to feed into the pass iv e alg orithm. The ori ginal algorithm of Cohn, Atlas, and Ladner (1994 ) essen- tially used only the first stage, and simply return ed any classifier in V after e xhausti ng its b udget for label re quests. H e re we h av e adde d the seco nd stage (Ste ps 6-13) so that we can guarantee a certai n condit ional independ ence (giv en |L| ) among the e xamples fed into the passi ve al gorithm, which is importan t for the genera l results (T h eorem 10 belo w ) . Hannek e (2011) showed that the origin al (simpler) algori thm ac hie ves the (less general) label complex ity bound of Corollary 11 belo w . 5 . 1 . 2 E X A M P L E S Not surprisi ngly , by essent ially the same argumen t as Meta-Algor ithm 0, one can show Meta- Algorith m 2 satisfies the claim in Theorem 5. That is, Meta-Algo rithm 2 is a uni versal activ izer for C if and o nly if P ( ∂ f ) = 0 for e very P and f ∈ C . Howe ver , there are fur ther resul ts kno wn on the labe l comple xity achie v ed by Meta-Alg orithm 2. Speci fically , to i llustrat e the types of improve- ments achi ev able by Meta-Algorith m 2, consider our usua l toy exa mples; as before, to simplify the explan ation, for these examples w e igno re the fact that ˆ P m is only an estimate, as well as the “ t < n ” constrain t in Step 9 (bot h of which will be addressed in th e general results belo w). First, cons ider thresho ld classifiers (Example 1) under a uniform P on [0 , 1] , and suppose f = h z ∈ C . Suppose the gi ven pa ssi ve algorithm has label co mplexi ty Λ p . T o get e xpec ted error at 29 H A N N E K E most ε in Meta-Algorithm 2, i t suffices to ha ve |L| ≥ Λ p ( ε/ 2 , f , P ) with probability at le ast 1 − ε/ 2 . Starting from an y par ticular V set obtained in the alg orithm, call it V 0 , th e se t DIS( V 0 ) is simply the reg ion between the lar gest negati ve exampl e o bserv ed so far (s ay z ℓ ) and the s mallest p ositi v e e xam- ple observ ed so far (say z r ). W ith probabi lity at l east 1 − ε/n , at least one of the ne xt O (log( n/ε )) exa mples in this [ z ℓ , z r ] regi on will be in [ z ℓ + (1 / 3)( z r − z ℓ ) , z r − (1 / 3)( z r − z ℓ )] , so that after proces sing tha t example, we definitely hav e P (DIS ( V )) ≤ (2 / 3 ) P (DIS ( V 0 )) . Thus, upon reach- ing Step 6, sin ce we ha ve made n/ 2 label request s, a un ion bound implies that with pr obabilit y 1 − ε/ 2 , we ha ve P (DIS( V )) ≤ exp {− Ω( n/ log ( n/ε )) } , and therefo re |L| ≥ exp { Ω( n/ log ( n/ε )) } . Thus, for some val ue Λ a ( ε, f , P ) = O (log(Λ p ( ε/ 2 , f , P )) log (log(Λ p ( ε/ 2 , f , P )) /ε )) , any n ≥ Λ a ( ε, f , P ) gi ves |L| ≥ Λ p ( ε/ 2 , f , P ) with probabili ty at lea st 1 − ε/ 2 , so that the activi zed algo - rithm achie v es label complexit y Λ a ( ε, f , P ) ∈ P olylog (Λ p ( ε/ 2 , f , P ) / ε ) . Consider al so the interv als proble m (Example 2) under a uniform P on [0 , 1] , and supp ose f = h [ a,b ] ∈ C , for b > a . In thi s case, as w i th any disagree ment-base d algorithm, until the algori thm observes the first pos iti ve examp le (i.e., the first X m ∈ [ a, b ] ), it will request the label of ev ery example (see the reasonin g abov e for Meta-Algo rithm 0). Ho wev er , at eve ry time after observ ing this first positi ve point, say x , the re gion DIS( V ) is restric ted to the re gion between the lar gest ne gati ve point less tha n x and smallest positi ve point, and the re gion between the larg est positi ve po int and the smallest negati ve poi nt large r tha n x . For each of these two reg ions, the same ar gument s used for th e threshold problem abov e can be appli ed to show tha t, with probabilit y 1 − O ( ε ) , the re gion of disagreement is redu ced by at least a constant fraction e v ery O ( log ( n/ε )) label reque sts, so that |L| ≥ exp { Ω( n/ log ( n/ε )) } . Thus, again the lab el complexi ty is of the form O (lo g (Λ p ( ε/ 2 , f , P )) log (log(Λ p ( ε/ 2 , f , P )) /ε )) , w h ich is Polylo g (Λ p ( ε/ 2 , f , P ) / ε ) , though this time there is a significant (additi ve) targ et-depe ndent const ant (roughly ∝ 1 b − a log(1 /ε ) ), accounting for the length o f the i nitial phase before obs erving any positi ve e xample s. On the other hand, a s w i th any disagreement- based alg orithm, w h en f = h [ a,a ] , becaus e the algorit hm nev er observ es a positi ve exa mple, it requests the label of eve ry example it considers ; in this ca se, by t he same argument giv en for Meta-Algori thm 0, upon re aching Step 6 we ha ve P (DIS( V )) = 1 , so th at |L| = O ( n ) , and we observ e no improve ments for some passi ve algorit hms A p . A simila r analysis can be perf ormed for unions of i interv als under P unifor m on [0 , 1] . In that case, we find that an y h z ∈ C not repre sentable (up to pro bability -zero dif ference s) by a union of i − 1 or fe wer interv als allows for the expone ntial impr ov ements of the type observed in the previo us two exampl es; this time, the phas e of expo nentiall y dec reasing P (DIS ( V )) only occurs afte r observi ng an exampl e in each of the i interva ls and each of the i − 1 negat iv e re gions separa ting the interv als, resul ting in an add iti ve term of roughly ∝ 1 min 1 ≤ j < 2 i z j +1 − z j log( i/ε ) in the lab el comple xity . Howe ver , any h z ∈ C represent able (up to proba bility-z ero dif ference s) by a union of i − 1 or fe wer interv als has P ( ∂ h z ) = 1 , which means |L| = O ( n ) , and the refore (as with any disagree ment-base d algorithm) Meta -Algorithm 2 will n ot provid e impro vemen ts for some passi ve al gorithms A p . 5 . 1 . 3 T H E D I S AG R E E M E N T C O E FFI C I E N T T o ward gen eralizin g the argumen ts from the abov e examples, consider the follo wing definitio n of Hannek e (2007b). 30 A C T I V I Z E D L E A R N I N G Definition 9 F or ε ≥ 0 , the disagree ment coefficie nt of a classifier f with r espec t to a concept space C under a distri b ution P is defin ed as θ f ( ε ) = 1 ∨ sup r >ε P (DIS (B ( f , r ))) r . Also abbr eviat e θ f = θ f (0) . ⋄ Informal ly , the disagreement coefficie nt describe s the rat e of collapse of the reg ion of disagre e- ment, rela ti ve to the distan ce from f . It has bee n useful in characte rizing the label complexiti es achie ved by se ver al disag reement-b ased ac ti ve learning algorithms (Hanne ke, 2007b, 201 1; Das- gupta, Hsu, and Montele oni, 2007; Beyg elzimer , Dasgup ta, and Langfor d, 2009; W ang, 2009; K oltchin skii, 20 10; Bey gelzimer , H s u, Langford , and Zhang, 2010), and itself ha s bee n studied and bounded for vario us fa milies of learning pro blems (Hanne ke, 2007b, 2011; Balcan , Hannek e, and V aughan, 2010; Friedman, 20 09; Beygelzime r , Dasgupta , and L a ngford, 2009; Mahalanabi s, 2011; W ang , 2011). S ee the pap er of Hann eke (2011) for a det ailed discussion o f the disagreement coef ficient, includin g its re lations hips to se ver al related quantitie s, as well as a v ariety of properties that it satisfies that can help to bound its v alue for any giv en lea rning problem. In part icular , be- lo w we use the fac t that , for any const ant c ∈ [1 , ∞ ) , θ f ( ε ) ≤ θ f ( ε/c ) ≤ cθ f ( ε ) . Also note that P ( ∂ f ) = 0 if and only if θ f ( ε ) = o (1 / ε ) . See the papers of Fried man (2009); Mahalanabis (2 011) for some genera l conditions on C and P , under which ev ery f ∈ C has θ f < ∞ , which (as we exp lain belo w) has partic ularly interesti ng implications for act iv e learn ing (Hannek e, 2007b, 20 11). T o build intu ition about the beha vior of the disagreement coef ficient, we briefly go th rough its calcul ation for our usual to y ex amples from abo ve . The first two of these calculat ions are taken from Hanneke (20 07b), and the la st is from B a lcan, Hann eke, and V augha n (201 0). First , con sider the thresh olds prob lem (Example 1), and for simplic ity sup pose the distrib ution P is unif orm on [0 , 1] . In this ca se, as in Section 3.2, B( h z , r ) = { h z ′ ∈ C : | z ′ − z | ≤ r } , and DIS(B( h z , r )) ⊆ [ z − r , z + r ) with equali ty for suf ficientl y small r . Therefore, P (DIS (B ( h z , r ))) ≤ 2 r (with equali ty for sma ll r ), and θ h z ( ε ) ≤ 2 with equali ty for suf fi ci ently small ε . In parti cular , θ h z = 2 . On the other hand, cons ider the interv als pro blem (Exampl e 2), again under P un iform on [0 , 1] . This time, for h [ a,b ] ∈ C with b − a > 0 , we ha ve for 0 < r < b − a , B( h [ a,b ] , r ) = { h [ a ′ ,b ′ ] ∈ C : | a − a ′ | + | b − b ′ | ≤ r } , DIS(B( h [ a,b ] , r )) ⊆ [ a − r , a + r ) ∪ ( b − r , b + r ] , and P (D IS(B( h [ a,b ] , r ))) ≤ 4 r (with equality for suf ficientl y small r ). B u t for 0 < b − a ≤ r , we ha ve B( h [ a,b ] , r ) ⊇ { h [ a ′ ,a ′ ] : a ′ ∈ (0 , 1) } , so that DIS (B( h [ a,b ] , r )) = (0 , 1) and P (DIS(B( h [ a,b ] , r ))) = 1 . Thus, we generally ha ve θ h [ a,b ] ( ε ) ≤ max n 1 b − a , 4 o , with e quality for suf ficient ly small ε . Ho we ver , this la st reasoning also indicates ∀ r > 0 , B( h [ a,a ] , r ) ⊇ { h [ a ′ ,a ′ ] : a ′ ∈ (0 , 1) } , so that DIS(B( h [ a,a ] , r )) = (0 , 1) and P (DIS(B( h [ a,a ] , r ))) = 1 ; theref ore, θ h [ a,a ] ( ε ) = 1 ε , the lar gest possible v alue fo r the disagreement coef ficient; in particul ar , this also means θ h [ a,a ] = ∞ . Finally , conside r the un ions of i interv als pro blem (Example 3), again under P uniform on [0 , 1] . First take an y h z ∈ C such that an y h z ′ ∈ C repre sentabl e as a union of i − 1 int erv als has P ( { x : h z ( x ) 6 = h z ′ ( x ) } ) > 0 . Then for 0 < r < min 1 ≤ j < 2 i z j +1 − z j , B( h z , r ) = { h z ′ ∈ C : P 1 ≤ j ≤ 2 i | z j − z ′ j | ≤ r } , so that P (DIS (B( h z , r ))) ≤ 4 ir , with equali ty for sufficie ntly small r . For r > min 1 ≤ j < 2 i z j +1 − z j , B( h z , r ) contains a set of classifiers that flips the labels (co mpared to h z ) in tha t smallest region and uses the res ulting extra inter va l to disagree with h z on a tin y 31 H A N N E K E reg ion at an arbitrary locat ion (either by encompassing some point with a small in terv al, or by splitti ng an inter v al into two interv als separated by a small gap). Thus, DIS(B( h z , r )) = (0 , 1) , and P (DIS( h z , r )) = 1 . S o in to tal, θ h z ( ε ) ≤ max  1 min 1 ≤ j < 2 i z j +1 − z j , 4 i  , with equality for su fficie ntly small ε . On the other hand, if h z ∈ C can be represe nted by a union of i − 1 (or fe wer) interv als, th en we can use the ex tra interv al to disagree with h z on a tiny r egion at an arbitra ry loc ation, while still remainin g in B( h z , r ) , so that DIS(B( h z , r )) = (0 , 1) , P (DIS (B ( h z , r ))) = 1 , and θ h z ( ε ) = 1 ε ; in particu lar , in this case we hav e θ h z = ∞ . 5 . 1 . 4 G E N E R A L U P P E R B O U N D S O N T H E L A B E L C O M P L E X I T Y O F M E T A - A L G O R I T H M 2 As mentioned , the disa greement coe ffici ent has implicatio ns for the label compl exiti es achie v able by disag reement-b ased acti ve le arning. The intuiti ve reas on for this is th at, as the numbe r of label reques ts increases , the diamet er of the v ersion s pace s hrinks at a pred ictable rate. The di sagreemen t coef ficient then relates the diameter of the v ersion space to the size of its region of disagreemen t, which in tur n describes the probab ility of requesting a lab el. Thus, the e xpect ed freque ncy of lab el reques ts in the data seque nce decr eases at a pr edictabl e rate related to the disagreement co ef ficient, so that |L| in Me ta-Algorith m 2 can be lo wer bound ed by a fu nction of the disa greement coef ficient. Specifically , the follo wing result wa s essentia lly establ ished by Hannek e (2011, 200 7b), though actual ly the re sult belo w is slightly m o re genera l th an the original. Theor em 10 F or any V C class C , and any passive learnin g algorithm A p ach ievin g label com- ple xity Λ p , th e activ e learnin g algorithm obtai ned by app lying Meta-Algor ithm 2 with A p as in put ach ieves a label comple xity Λ a that, for any distrib ution P and class ifier f ∈ C , sat isfies Λ a ( ε, f , P ) = O  θ f  Λ p ( ε/ 2 , f , P ) − 1  log 2 Λ p ( ε/ 2 , f , P ) ε  . ⋄ The pro of of Theorem 10 is similar to the origina l res ult of Hannek e (201 1, 2007b), with onl y minor modi fications to acco unt for using A p instea d of returni ng an arbitrar y element of V . The formal det ails are implicit in the proof of Theor em 16 belo w (since Meta- Algorithm 2 is essen tially identi cal to the k = 1 round of Meta-Algorithm 3, defined belo w). W e also hav e the follo wing simple coroll aries. Cor ollary 11 F or any VC c lass C , ther e e xists a pa ssive learn ing algorith m A p suc h that , for e very f ∈ C and dist rib ution P , the active learnin g algor ithm obtained by applying M e ta-Algorit hm 2 with A p as inpu t ac hie ves label comple xity Λ a ( ε, f , P ) = O  θ f ( ε ) log 2 (1 /ε )  . ⋄ Pro of The one-incl usion graph algo rithm of Hau ssler , Littlestone , and W armuth (1 994) is a pas si ve learnin g algorithm achie ving lab el comple xity Λ p ( ε, f , P ) ≤ d/ε . Plugg ing this in to Theorem 10, using the f act that θ f ( ε/ 2 d ) ≤ 2 dθ f ( ε ) , and simp lifying, we arri v e at the re sult. In f act, we will see in the proof of Theorem 16 that inc urring this extra consta nt factor of d is not actually neces sary . Cor ollary 12 F or any VC class C and distr ib ution P , if ∀ f ∈ C , θ f < ∞ , then ( C , P ) is lea rnable at an e xponen tial ra te. If this is true for all P , then C is learna ble at an e xpone ntial r ate. ⋄ 32 A C T I V I Z E D L E A R N I N G Pro of The first cla im follo ws directly from Corollary 11 , since θ f ( ε ) ≤ θ f . The second claim then follo w s from the fact that Meta-Algor ithm 2 is adapti ve to P (has no dire ct dep endence on P ex cept via the data). Aside from the disagr eement coe ffici ent and Λ p terms, the other constan t f actors hidd en in the big-O in Theo rem 10 are only C -dep endent (i.e., indepen dent of f and P ). A s mentioned, if we are only intereste d in achie ving the label compl exity bound of Coroll ary 11, we can obtain this res ult more directly by the simp ler origin al algori thm of C o hn, Atlas, and Ladn er (199 4) via the anal ysis of Hannek e (2011, 2007b). 5 . 1 . 5 G E N E R A L L O W E R B O U N D S O N T H E L A B E L C O M P L E X I T Y O F M E T A - A L G O R I T H M 2 It is also poss ible to prov e a kind of lower bound on the label complexi ty of Meta-Algorithm 2 in terms of the disagreemen t co effici ent, so that the depen dence on the disagr eement coeffici ent in Theorem 10 is una voida ble. Specifically , there are tw o si mple obse rv ations that in tuiti ve ly ex- plain the possibili ty of such lo wer bound s. The first observ ation is that the expe cted number of label requests Meta-Algorithm 2 makes among the first ⌈ 1 /r ⌉ unlabe led examp les is at least P (DIS(B( f , r ))) / (2 r ) (assuming it does not halt first). Similarly , th e second obs erv ation is that, to arri ve at a re gion of disa greement with ex pected prob ability mass less than P ( DIS(B( f , r ))) / 2 , Meta-Algor ithm 2 requires a budg et n of size at least P (DIS(B( f , r ))) / (2 r ) . These obser va - tions are formali zed in Appendix C as L e mmas 47 and 48. N o ting th at, for unbo unded θ f ( ε ) , P (DIS(B( f , ε ))) /ε 6 = o ( θ f ( ε )) , the rele v ance of these observ ations in the context of deri ving lo wer b ounds base d on the disa greement coef ficient be comes cl ear . In particu lar , we can u se the lat- ter o f these i nsights to arr iv e a t the following the orem, which esse ntially complement s Theorem 10, sho wing that it can not genera lly be impro ve d beyon d reduc ing the consta nts and log arithmic fac- tors, witho ut alteri ng the algorith m or introducing ad ditional A p -depen dent quantiti es in the lab el comple xity bound. The proof is inclu ded in App endix C. Theor em 13 F or any set of class ifiers C , f ∈ C , distri bu tion P , and nonincr easing function λ : (0 , 1) → N , ther e e xists a passive learning algo rihtm A p ach ievin g a label compl ex ity Λ p with Λ p ( ε, f , P ) = λ ( ε ) for all ε > 0 , such that if Meta-Algorithm 2, with A p as its arg ument, ac hie ves label comple xity Λ a , then Λ a ( ε, f , P ) 6 = o  θ f  Λ p (2 ε, f , P ) − 1  . ⋄ Recall that there are many natural learni ng problems for which θ f = ∞ , and indeed where θ f ( ε ) = Ω ( 1 /ε ) : for instan ce, interv als with f = h [ a,a ] under unif orm P , or unions of i interv als under u niform P with f representa ble as i − 1 or fe wer in terv als. Thus , since we ha ve just seen that the impro v ements gained by disagreement- based methods are well-charac terized by the disagree - ment coe fficie nt, if we would like to ach ie ve e xponen tial impro v ements ov er pas si ve learn ing for these pro blems, we will need to move beyon d these disagr eement-ba sed methods. In the subs ec- tions th at foll ow , we will use a n alte rnati ve alg orithm and an alysis, and prove a general re sult that is alw ays at least as good as Theo rem 10 (in a big-O sense ), and often significantly better (in a little-o sense) . In particu lar , it leads to a suf ficient cond ition for learnab ility at an exp onentia l rate, stri ctly more general than that of Corollary 12. 33 H A N N E K E 5.2 An Impr ov ed Activize r In this subsecti on, we define a new acti ve learning method based on shattering, a s in Meta-Algorith m 1, b ut which also tak es fu ller advan tage of th e sequential aspect of activ e learning , as in Meta- Algorith m 2. W e will see that thi s algo rithm can be an alyzed in a man ner an alogous to the disa gree- ment coef ficient analysis of Meta-Alg orithm 2, leadin g to a new and often dramaticall y-impro ved label comple xity bound. Specifical ly , consider the follo wing meta-algorith m. Meta-Algor ithm 3 Input: passi ve algorith m A p , label b udget n Output: classifier ˆ h 0. V ← V 0 = C , T 0 ← ⌈ 2 n/ 3 ⌉ , t ← 0 , m ← 0 1. For k = 1 , 2 , . . . , d + 1 2. Let L k ← {} , T k ← T k − 1 − t , and let t ← 0 3. While t < ⌈ T k / 4 ⌉ and m ≤ k · 2 n 4. m ← m + 1 5. If ˆ P m  S ∈ X k − 1 : V shatters S ∪ { X m }| V shat ters S  ≥ 1 / 2 6. Request the label Y m of X m , and let ˆ y ← Y m and t ← t + 1 7. Else let ˆ y ← argmax y ∈{− 1 , + 1 } ˆ P m  S ∈ X k − 1 : V [( X m , − y )] does not shatter S | V shatters S  8. Let V ← V m = V m − 1 [( X m , ˆ y )] 9. ˆ ∆ ( k ) ← ˆ P m  x : ˆ P  S ∈ X k − 1 : V sha tters S ∪ { x }| V shatters S  ≥ 1 / 2  10. Do ⌊ T k / (3 ˆ ∆ ( k ) ) ⌋ times 11. m ← m + 1 12. If ˆ P m  S ∈ X k − 1 : V shatters S ∪ { X m }| V shatter s S  ≥ 1 / 2 and t < ⌊ 3 T k / 4 ⌋ 13. Request the label Y m of X m , and let ˆ y ← Y m and t ← t + 1 14. Else, let ˆ y ← argmax y ∈{− 1 , + 1 } ˆ P m  S ∈ X k − 1 : V [( X m , − y )] does not shatter S | V shatte rs S  15. Let L k ← L k ∪ { ( X m , ˆ y ) } and V ← V m = V m − 1 [( X m , ˆ y )] 16. Return Activ eSelect( {A p ( L 1 ) , A p ( L 2 ) , . . . , A p ( L d +1 ) } , ⌊ n/ 3 ⌋ , { X m +1 , X m +2 , . . . } ) As before, the procedure is specified in terms of estimators ˆ P m . Again, t hese c an be defined in a v ariety of ways, as long as they co n ver ge (at a fast enough rate) to thei r respe cti ve true probabilitie s. For the results belo w , w e will use the definit ions giv en in Appe ndix B.1: i.e., the sa me definitions used in Meta-Algorithm 1. Foll owin g the same ar gumen t as for Meta-Algorith m 1, one can sho w that Meta-Algo rithm 3 is a uni versal activ izer for C , for an y VC class C . Howe ver , w e can also obtain more detailed res ults in te rms of a generalizati on of the d isagree ment coef ficient giv en belo w . As w it h Meta-Alg orithm 1, this p rocedur e has three main components: one in w h ich we focus on re ducing th e v ersion sp ace V , on e in which we focus on collec ting a (c onditio nally) i.i.d. sample to feed into A p , and o ne in which we select from among the d + 1 exec utions of A p . Howe v er , unl ike Meta-Algor ithm 1, here the first stage i s also broken up based on th e value of k , so that each k has its o wn first a nd secon d stages, rather tha n sharing a single first s tage. Again , the c hoice of the number of (u nlabeled ) e xamples processed in each second stage gu arantee s (by a Chernof f bound) that th e “ t < ⌊ 3 T k / 4 ⌋ ” constr aint in Step 12 is redun dant. Depending on the type of lab el complexit y result we wish to prov e, this multistage architectu re is someti mes av oidable. In particular , as with 34 A C T I V I Z E D L E A R N I N G Corollary 11 abo ve , to dire ctly achie ve the label comple xity bound in Corollary 17 belo w , we can use a m u ch simp ler app roach tha t repl aces Steps 9-16, ins tead simply returning an arbitrary ele ment of V upon terminat ion. W ithin each v alue of k , Meta- Algorithm 3 beha ve s analo gous to Meta-Algori thm 2, requesting the label of an e xample only if it cannot infer the lab el fr om known informat ion, and updating the versi on spac e V after e very label request; howe ve r , unlik e Meta-Alg orithm 2, fo r value s of k > 1 , the mechan ism for inferring a label is bas ed on shatterab le sets, as in M e ta-Algorith m 1, and is moti vated by the same ar gument of splitti ng V into sub sets containing arbit rarily good classifier s (se e the discussion in Section 4.1 ). Also unlik e Meta-Algorithm 2, ev en the infer red labels can be used to reduce the set V (Steps 8 and 15), since they are not only correct b ut also potent ially informati ve in the sense that x ∈ DIS ( V ) . As with Meta -Algorith m 1, the k ey to obtain ing improvemen t guarante es is that some v alue of k has |L k | ≫ n , w h ile maintai ning that all of the labels in L k are correct; Activ eSelect then gua rantees the overa ll perfor mance is not too much worse than that obt ained by A p ( L k ) for this v alue of k . T o bu ild intuit ion about the beha vio r of Meta-Algori thm 3, let us con sider our usual to y ex - amples, again unde r a unif orm distri bu tion P on [0 , 1] ; as before , for simplic ity we ign ore the fact that ˆ P m is only an estimat e, as well as the constr aint on t in Step 12 and the ef fecti veness of Activ eSelect , all of which w il l be ad dressed in the general analysis. First, for the beha vior of the algo rithm for thresho lds and nonzero-widt h interv als, we may simply refer to the discus- sion of Meta-Algorithm 2, since the k = 1 round of M e ta-Algorit hm 3 is essential ly iden tical to Meta-Algor ithm 2; in this case , we hav e alre ady seen that |L 1 | grows as exp { Ω( n/ log( n/ε )) } for thresh olds, and does so for nonzero-width inte rv als af ter some initial period of slo w gro wth related to the width of th e targe t interv al (i.e., the period before finding the first positi ve example) . As with Meta-Algor ithm 1, for zero-width interv als, w e m u st look to the k = 2 round of M e ta-Algorit hm 3 to fi n d improv ements. Also as with Meta-Algo rithm 1, for sufficien tly lar ge n , e ve ry X m pro- cessed in the k = 2 round will hav e its label infer red (correctly ) in Step 7 or 14 (i.e., it does not reques t any label s). But this means we reach Step 9 with m = 2 · 2 n + 1 ; furthermore, in these circumst ances th e definition of ˆ P m from Appe ndix B.1 guaran tees (fo r suf ficiently lar ge n ) that ˆ ∆ (2) = 2 /m , so that |L 2 | ∝ n · m = Ω ( n · 2 n ) . Thus, w e expe ct the label co mplexi ty gains to be e xponen tially impr oved co mpared to A p . For a more in volv ed exa mple, consid er unions of 2 interv als (Example 3), under uniform P on [0 , 1] , and suppos e f = h ( a,b,a,b ) for b − a > 0 ; that is, the target funct ion is rep resentab le as a single nonze ro-width interva l [ a, b ] ⊂ (0 , 1) . As we hav e seen, ∂ f = (0 , 1) in this case, so that disagr eement-ba sed methods ar e inef fectiv e at impro ving over passiv e. This also means the k = 1 round of Meta- Algorithm 3 will not pr ovid e impro vemen ts (i.e., |L 1 | = O ( n ) ). Ho we ver , consider the k = 2 roun d. As discussed in S ec tion 4.2, for suf ficiently lar ge n , after the first rou nd ( k = 1 ) the set V is suc h that any label we infer in the k = 2 round will be correct. Thus, it su ffices to determin e how large the set L 2 becomes . By the same re asoning as in S ec tion 4.2, for sufficie ntly lar ge n , the examples X m whose labels are requested in Step 6 are precisely those not separated from both a and b by at lea st one of the m − 1 exampl es already processed (since V is consisten t with the labels of all m − 1 of th ose ex amples). But this is the same set of poin ts Meta-Algor ithm 2 would query for the interval s ex ample in Section 5.1; thus, the same argu ment use d there implies that in th is problem we ha ve |L 2 | ≥ exp { Ω( n / log ( n/ε )) } with probabi lity 1 − ε/ 2 , which means we sho uld exp ect a label comple xity of O (log (Λ p ( ε/ 2 , f , P )) log (log(Λ p ( ε/ 2 , f , P )) /ε )) , where Λ p is the label complexi ty of A p . For the case f = h ( a,a,a,a ) , k = 3 is the relev ant round , and 35 H A N N E K E the anal ysis goes simila rly to the h [ a,a ] scenar io for inter v als above . Unions of i > 2 in terv als can be studied analogou sly , with the appropriat e va lue of k to analyze being dete rmined by the number of interv als re quired to repres ent the targ et up to probabil ity-zero diff erences (see the discussio n in Section 4.2). 5.3 Bey ond the Disag re ement Coefficient In this subsecti on, we intro duce a ne w quantity , a gen eralizat ion of the disag reement coe ffici ent, which we will later use to provide a general char acteriza tion of the improv ements achie v able by Meta-Algor ithm 3, analog ous to ho w th e disa greement coef ficient chara cterized the impro veme nts achie vable by Meta -Algorith m 2 in Theorem 10 . First, let us define the follo wing generalizat ion of the disagre ement co re. Definition 14 F or an inte ge r k ≥ 0 , defin e the k -dimen sional shatt er core of a classi fier f with r espe ct to a se t of classifier s H and distr ib ution P as ∂ k H ,P f = lim r → 0 n S ∈ X k : B H ,P ( f , r ) shatter s S o . ⋄ As before, when P = P , and P is clear from the contex t, we will abbre viat e ∂ k H f = ∂ k H , P f , and when we als o intend H = C , the full concept space, and C is cle arly defined in the gi ven conte xt, we further abb re viate ∂ k f = ∂ k C f = ∂ k C , P f . W e ha ve the follo w i ng definiti on, which will play a ke y role in the label complex ity bounds belo w . Definition 15 F or any concept space C , distrib ution P , and classifier f , ∀ k ∈ N , ∀ ε ≥ 0 , de fine θ ( k ) f ( ε ) = 1 ∨ sup r >ε P k  S ∈ X k : B( f , r ) shatt ers S  r . Then define ˜ d f = min n k ∈ N : P k  ∂ k f  = 0 o and ˜ θ f ( ε ) = θ ( ˜ d f ) f ( ε ) . Also abbr eviat e θ ( k ) f = θ ( k ) f (0) and ˜ θ f = ˜ θ f (0) . ⋄ W e might refer to the quantity θ ( k ) f ( ε ) as the order - k (or k -d imensiona l) disagre ement coe ffi- cient, as it represent s a direct generalizat ion of the disagreemen t coef ficient θ f ( ε ) . Ho we ver , rather than merely measuring the rate of co llapse of the probability of disa gr eement (one-dimensi onal shatte rability) , θ ( k ) f ( ε ) measures the r ate of collapse of the prob ability of k -dimension al shatter abil- ity . In particu lar , we ha ve ˜ θ f ( ε ) = θ ( ˜ d f ) f ( ε ) ≤ θ (1) f ( ε ) = θ f ( ε ) , so that this new qu antity is nev er lar ger than the dis agreement coef ficient. Howe ver , un like the disag reement coefficie nt, we always ha ve ˜ θ f ( ε ) = o (1 /ε ) for VC classe s C . In fac t, we could equi valen tly define ˜ θ f ( ε ) as the v alue of θ ( k ) f ( ε ) for the smallest k with θ ( k ) f ( ε ) = o (1 /ε ) . Additiona lly , we will see below that there are many inter esting cases where θ f = ∞ (e ven θ f ( ε ) = Ω(1 / ε ) ) b ut ˜ θ f < ∞ (e.g, interv als with a zero-wid th tar get, or unions of i inte rv als where the targe t is repr esentab le as a unio n of i − 1 or 36 A C T I V I Z E D L E A R N I N G fe wer interv als). As was the case fo r θ f , w e will see that sho wing ˜ θ f < ∞ for a gi ven learning prob- lem has inte resting implicatio ns for the l abel comple xity o f acti ve learnin g (Corollary 18 b elo w). In the proc ess, we ha ve also defined the q uantity ˜ d f , which m a y itsel f be of in depend ent int erest in th e asympto tic analysis of learning in general. For VC classes, ˜ d f alw ays ex ists, and in fact is at most d + 1 (since C canno t shatter any d + 1 points). When d = ∞ , the quantity ˜ d f might not b e defined (or defined as ∞ ), in which case ˜ θ f ( ε ) is also no t defined; in this work we restric t ou r discuss ion to VC classes, so that this issue ne ve r comes up; Secti on 7 discu sses poss ible exte nsions to classe s of infinite VC dimensi on. W e should mention that the rest riction of ˜ θ f ( ε ) ≥ 1 in the definit ion is only for con veni ence, as it simplifies the theo rem statemen ts and proofs below . It is not fundamental to the definition, and can be re move d (at th e expe nse of slight ly m o re compl icated theorem statemen ts). In fa ct, this onl y makes a diff erence to the valu e of ˜ θ f ( ε ) in some (seemingl y unu sual) de generate cases. The same is true of θ f ( ε ) in Definition 9. The proces s of cal culating ˜ θ f ( ε ) is quite similar to that for the disagr eement coe fficie nt; we are interes ted in describin g B( f , r ) , and spe cifically the vari ety of beha vior s of elements of B( f , r ) on points in X , in this case w i th respect to shattering. T o illustra te the calculati on of ˜ θ f ( ε ) , consider our usual to y e xamples, again und er P uniform on [0 , 1] . F or the thre sholds exampl e (Example 1), we hav e ˜ d f = 1 , so tha t ˜ θ f ( ε ) = θ (1) f ( ε ) = θ f ( ε ) , which we ha v e seen is equal 2 for small ε . Similarly , for the interv als example (Example 2), an y f = h [ a,b ] ∈ C with b − a > 0 has ˜ d f = 1 , so that ˜ θ f ( ε ) = θ (1) f ( ε ) = θ f ( ε ) , which for sufficien tly small ε , is equal max n 1 b − a , 4 o . Thus, for these two exampl es, ˜ θ f ( ε ) = θ f ( ε ) . Ho we ver , continu ing the interv als example , con sider f = h [ a,a ] ∈ C . In this cas e, we hav e seen ∂ 1 f = ∂ f = (0 , 1) , so that P ( ∂ 1 f ) = 1 > 0 . For any x 1 , x 2 ∈ (0 , 1) with 0 < | x 1 − x 2 | ≤ r , B( f , r ) can sha tter ( x 1 , x 2 ) , specifically using the classifiers { h [ x 1 ,x 2 ] , h [ x 1 ,x 1 ] , h [ x 2 ,x 2 ] , h [ x 3 ,x 3 ] } for any x 3 ∈ (0 , 1) \ { x 1 , x 2 } . Howe ver , for any x 1 , x 2 ∈ (0 , 1) with | x 1 − x 2 | > r , no element of B( f , r ) cla ssifies both as +1 (as it wou ld need width greater than r , and thus would ha ve distance from h [ a,a ] greate r than r ). The refore, { S ∈ X 2 : B( f , r ) shatter s S } = { ( x 1 , x 2 ) ∈ (0 , 1) 2 : 0 < | x 1 − x 2 | ≤ r } ; this latter set has probab ility 2 r (1 − r ) + r 2 = (2 − r ) · r , which shrinks to 0 as r → 0 . Therefore, ˜ d f = 2 . Furthermor e, this sho w s ˜ θ f ( ε ) = θ (2) f ( ε ) = sup r >ε (2 − r ) = 2 − ε ≤ 2 . Contrastin g this with θ f ( ε ) = 1 / ε , we see ˜ θ f ( ε ) is significantly smaller than the d isagreemen t coef fi ci ent; in pa rticular , ˜ θ f = 2 < ∞ , while θ f = ∞ . Consider also the space of unions of i interv als (Example 3) under P uni form on [0 , 1] . In this case, we hav e alrea dy seen that , for any f = h z ∈ C not represent able (up to prob ability - zero differe nces) by a uinon of i − 1 or fewer interv als, we ha ve P ( ∂ 1 f ) = P ( ∂ f ) = 0 , so that ˜ d f = 1 , and ˜ θ f = θ (1) f = θ f = max  1 min 1 ≤ p< 2 i z p +1 − z p , 4 i  . T o general ize this, suppose f = h z is minimally represe ntable as a un ion of any number j ≤ i of interv als of nonz ero width: [ z 1 , z 2 ] ∪ [ z 3 , z 4 ] ∪ · · · ∪ [ z 2 j − 1 , z 2 j ] , with 0 < z 1 < z 2 < · · · < z 2 j < 1 . For our purpos es, this is fully general, since eve ry elemen t of C has distance zer o to some h z of th is type , and ˜ θ h = ˜ θ h ′ for any h, h ′ with P ( x : h ( x ) 6 = h ′ ( x )) = 0 . Now for any k < i − j + 1 , and any S = ( x 1 , . . . , x k ) ∈ X k with all elements dist inct and no elements equ al an y of the z p v alues, the set B( f , r ) can shatter S , as follo ws. Begin with the in terv als [ z 2 p − 1 , z 2 p ] as above , and modify the classi fier in t he follo wing way for e ach labeling of S . For any of the x ℓ v alues we wish to label +1 , if it is al ready in an interv al [ z 2 p − 1 , z 2 p ] , we do not hing; if it is no t in one of th e [ z 2 p − 1 , z 2 p ] interv als, we add the inte rv al [ x ℓ , x ℓ ] 37 H A N N E K E to th e classifier . Fo r an y of the x ℓ v alues we wis h to lab el − 1 , if i t is n ot in an y inte rv al [ z 2 p − 1 , z 2 p ] , we do nothi ng; if it is in some interv al [ z 2 p − 1 , z 2 p ] , we split the inter va l by setting to − 1 the labels in a small region ( x ℓ − γ , x ℓ + γ ) , for γ < min { r /k, z 2 p − z 2 p − 1 } chosen small enough so tha t ( x ℓ − γ , x ℓ + γ ) does not conta in any oth er elemen t of S . These opera tions add at most k new interv als to the min imal representa tion of the cl assifier as a union o f interv als, which therefo re has at most j + k ≤ i interv als. Furthermore, the classifier disagr ees w it h f on a set of size at most r , so that it is contained in B( f , r ) . W e therefo re ha ve P k ( S ∈ X k : B( f , r ) shatters S ) = 1 . Ho wev er , note that for 0 < r < min 1 ≤ p< 2 j z p +1 − z p , for a ny k and S ∈ X k with all elements o f S ∪ { z p : 1 ≤ p ≤ 2 j } separa ted by a distan ce greate r than r , cla ssifyin g the poin ts in S opposite to f while remai ning r - close to f requires us to incre ase to a minimum of j + k interv als. Thus, for k = i − j + 1 , an y S = ( x 1 , . . . , x k ) ∈ X k with min y 1 ,y 2 ∈ S ∪{ z p } p : y 1 6 = y 2 | y 1 − y 2 | > r is not shattera ble by B( f , r ) . W e therefo re ha ve { S ∈ X k : B( f , r ) shatter s S } ⊆  S ∈ X k : min y 1 ,y 2 ∈ S ∪{ z p } p : y 1 6 = y 2 | y 1 − y 2 | ≤ r  . For r < min 1 ≤ p< 2 j z p +1 − z p , we can bound the probability of this lat ter set by considerin g samplin g the points x ℓ sequen tially; the probabilit y the ℓ th point is within r of one of x 1 , . . . , x ℓ − 1 , z 1 , . . . , z 2 j is at most 2 r (2 j + ℓ − 1) , so (by a uni on bound) the probability any of the k poin ts x 1 , . . . , x k is within r of any othe r or any of z 1 , . . . , z 2 j is at most P k ℓ =1 2 r (2 j + ℓ − 1) = 2 r  2 j k +  k 2   = (1 + i − j )( i + 3 j ) r . Since thi s approaches ze ro as r → 0 , we ha ve ˜ d f = i − j + 1 . Furthermore, this analysis shows ˜ θ f = θ ( i − j +1) f ≤ max  1 min 1 ≤ p< 2 j z p +1 − z p , (1 + i − j )( i + 3 j )  . In fact, carefu l furthe r inspection rev eals that th is upper boun d is tight (i.e., th is is the e xact v alue of ˜ θ f ). Recalling that θ f ( ε ) = 1 /ε for j < i , we see that again ˜ θ f ( ε ) is significan tly smalle r than the disagre ement coef ficient; in particul ar , ˜ θ f < ∞ while θ f = ∞ . Of cours e, for the quantity ˜ θ f ( ε ) to be truly useful, w e need to be able to describ e i ts beha vior for famili es of lea rning problems be yond these simple toy proble ms. Fort unately , as with the disagree- ment coef ficient, for lea rning problems with simple “geomet ric” inte rpretati ons, one can typical ly bound the v alue of ˜ θ f without too much dif ficulty . For instanc e, consider X the surf ace of a unit hypers phere in p -d imensiona l Eucl idean sp ace (with p ≥ 3 ), with P unifor m on X , and C the space of linear separators: C = { h w ,b ( x ) = 1 ± [0 , ∞ ) ( w · x + b ) : w ∈ R p , b ∈ R } . Balcan, Hann eke, and V aughan (2010 ) prov ed that ( C , P ) is learnable at an expone ntial rate, by a specialize d argumen t for this space. In the process, they establ ished that for any f ∈ C with P ( x : f ( x ) = +1) ∈ (0 , 1) , θ f < ∞ ; in fac t, a similar arg ument shows θ f ≤ 4 π √ p/ min y P ( x : f ( x ) = y ) . Thus, in this case, ˜ d f = 1 , and ˜ θ f = θ f < ∞ . Howe ve r , consider f ∈ C with P ( x : f ( x ) = y ) = 1 , for some y ∈ {− 1 , +1 } . In this c ase, e very h ∈ C with P ( x : h ( x ) = − y ) ≤ r has P ( x : h ( x ) 6 = f ( x )) ≤ r and is there fore contain ed in B( f , r ) . In par ticular , for any x ∈ X , there is such an h that dis- agrees with f on only a small spheric al cap containi ng x , so that DIS(B( f , r )) = X for all r > 0 . But this means ∂ f = X , which implies θ f ( ε ) = 1 / ε and ˜ d f > 1 . Howe ve r , let us examine the va lue of θ (2) f . Let A p = 2 π p/ 2 Γ ( p 2 ) denote the surface area of the unit sphere in R p , and le t C p ( z ) = 1 2 A p I 2 z − z 2  p − 1 2 , 1 2  denote the sur face area of a spherica l cap of height z (Li, 20 11), where I x ( a, b ) = Γ( a + b ) Γ( a )Γ( b ) R x 0 t a − 1 (1 − t ) b − 1 d t is the re gulari zed incompl ete beta functi on. In particular , since q p 12 ≤ Γ ( p 2 ) Γ ( p − 1 2 ) Γ ( 1 2 ) ≤ 1 2 √ p − 2 , the pro bability mass C p ( z ) A p = 38 A C T I V I Z E D L E A R N I N G 1 2 Γ ( p 2 ) Γ ( p − 1 2 ) Γ ( 1 2 ) R 2 z − z 2 0 t p − 3 2 (1 − t ) − 1 2 d t conta ined in a sp herical cap of height z satisfies C p ( z ) A p ≥ 1 2 r p 12 Z 2 z − z 2 0 t p − 3 2 d t = r p 12 (2 z − z 2 ) p − 1 2 p − 1 ≥ (2 z − z 2 ) p − 1 2 √ 12 p , (2) and letting ¯ z = m in { z , 1 / 2 } , al so satisfies C p ( z ) A p ≤ 2 C p ( ¯ z ) A p ≤ 1 2 p p − 2 Z 2 ¯ z − ¯ z 2 0 t p − 3 2 (1 − t ) − 1 2 d t ≤ p p − 2 Z 2 z − z 2 0 t p − 3 2 d t = 2 √ p − 2 p − 1 (2 z − z 2 ) p − 1 2 ≤ (2 z − z 2 ) p − 1 2 p p/ 6 ≤ (2 z ) p − 1 2 p p/ 6 . (3) Consider any linear sepa rator h ∈ B( f , r ) for r < 1 / 2 , and le t z ( h ) denote the hei ght of the spheri cal cap where h ( x ) = − y . Then (2) indic ates the probabil ity mass in this region is at least (2 z ( h ) − z ( h ) 2 ) p − 1 2 √ 12 p . Since h ∈ B( f , r ) , we know thi s probabi lity mass is at most r , an d we therefo re ha ve 2 z ( h ) − z ( h ) 2 ≤  √ 12 pr  2 p − 1 . Now for an y x 1 ∈ X , the set of x 2 ∈ X for which B( f , r ) shatters ( x 1 , x 2 ) is equi v alen t to the set DIS( { h ∈ B( f , r ) : h ( x 1 ) = − y } ) . But if h ( x 1 ) = − y , then x 1 is in the afo remention ed spherica l cap associ ated with h . A lit- tle trigon ometry rev eals that, for any spheri cal cap of height z ( h ) , any two points on the sur - face of this cap ar e w i thin distan ce 2 p 2 z ( h ) − z ( h ) 2 ≤ 2  √ 12 pr  1 p − 1 of each othe r . Thus, for an y point x 2 furthe r than 2  √ 12 pr  1 p − 1 from x 1 , it must be outsid e the spherical cap asso- ciated with h , which means h ( x 2 ) = y . But this is tru e for e ve ry h ∈ B( f , r ) with h ( x 1 ) = − y , so that DIS( { h ∈ B( f , r ) : h ( x 1 ) = − y } ) is cont ained in the spherica l cap of all elements of X within distance 2  √ 12 pr  1 p − 1 of x 1 ; a little m o re trigono metry re vea ls that the heigh t of this spheri cal cap is 2  √ 12 pr  2 p − 1 . Then (3) indicates the proba bility mass in this re gion is at m o st 2 p − 1 √ 12 pr √ p/ 6 = 2 p √ 18 r . Thus, P 2 (( x 1 , x 2 ) : B( f , r ) shatters ( x 1 , x 2 )) = R P (DIS( { h ∈ B( f , r ) : h ( x 1 ) = − y } )) P (d x 1 ) ≤ 2 p √ 18 r . In particu lar , si nce this approaches zero as r → 0 , we ha v e ˜ d f = 2 . This also shows that ˜ θ f = θ (2) f ≤ 2 p √ 18 , a finite con stant (alb eit a rather large one). Follo wing simila r reaso ning, using th e op posite in equalit ies as app ropriate , and taking r sufficie ntly small, one can also sho w ˜ θ f ≥ 2 p / (12 √ 2) . 5.4 Bounds on the Label Complexity of Acti vized Learning W e ha ve seen abo v e that in the context of se ve ral exa mples, M e ta-Algorith m 3 can offer signif- icant adv antages in label complexity o ver any giv en pas siv e learnin g algorithm, and indeed also ov er disagre ement-bas ed activ e learning in man y cases. In this subsec tion, we presen t a gene ral re- sult cha racteriz ing the magnitudes of these improv ements o ver passi ve lear ning, in terms of ˜ θ f ( ε ) . Specifically , we ha v e the follo wing general theo rem, along with two immedia te coro llaries. The proof is includ ed in App endix D , 39 H A N N E K E Theor em 16 F or any VC class C , and any passi ve learni ng algor ithm A p ach ievin g label co mple x- ity Λ p , the (Meta-Algo rithm 3) -activiz ed A p algori thm ac hiev es a label complexi ty Λ a that, for any distrib ution P and classifi er f ∈ C , sa tisfies Λ a ( ε, f , P ) = O  ˜ θ f  Λ p ( ε/ 4 , f , P ) − 1  log 2 Λ p ( ε/ 4 , f , P ) ε  . ⋄ Cor ollary 17 F or any VC class C , ther e e xists a passive learn ing algorithm A p suc h that, for eve ry f ∈ C and distr ib utions P , the (Meta-Algorit hm 3)-activi zed A p algori thm ac hiev es labe l comple xity Λ a ( ε, f , P ) = O  ˜ θ f ( ε ) log 2 (1 /ε )  . ⋄ Pro of The one-incl usion graph algo rithm of Hau ssler , Littlestone , and W armuth (1 994) is a pas si ve learnin g algorithm achie ving lab el comple xity Λ p ( ε, f , P ) ≤ d/ε . Plugg ing this in to Theorem 16, using th e f act tha t ˜ θ f ( ε/ 4 d ) ≤ 4 d ˜ θ f ( ε ) , an d simplif ying, we ar riv e a t th e resu lt. In fact, in the p roof of Theorem 16, we see that incur ring this extra constant factor of d is not actuall y necessary . Cor ollary 18 F or any VC class C and dist rib ution P , if ∀ f ∈ C , ˜ θ f < ∞ , then ( C , P ) is learn able at an e xponen tial ra te. If this is true for all P , then C is learna ble at an e xpone ntial r ate. ⋄ Pro of The first cla im follo ws directly from Corollary 17 , since ˜ θ f ( ε ) ≤ ˜ θ f . The second claim then follo w s from the fact that Meta-Algor ithm 3 is adapti ve to P (has no dire ct dep endence on P ex cept via the data). Actually , in the proof we arri ve at a some what m o re general result, in that the bound of The- orem 16 actually holds for an y tar get functio n f in the “clos ure” of C : that is, any f such that ∀ r > 0 , B( f , r ) 6 = ∅ . As pre viously mentioned, if our goal is onl y to ob tain the lab el complexity bound of Corollary 17 by a dir ect approach, then we can use a simpler pro cedure (which cut s out Steps 9-16, instead returning an arbit rary element of V ), analogous to ho w the analys is of the orig- inal alg orithm of C o hn, Atlas, and L a dner (19 94) by Hanne ke (20 11) obtains the label comple xity bound of Corollary 11 (see also Algorithm 5 bel o w). Ho wev er , the gen eral re sult of Theo rem 16 is interes ting in tha t it applies to any passi v e algorithm. Inspec ting the pro of, we see that it is also poss ible to state a result that separa tes the prob a- bility of success from the achiev ed error rate , similar to the P AC model of V aliant (1984) and the analys is of acti ve lea rning by Balca n, Hannek e, and V aughan (2010). Specifically , suppose A p is a pass iv e learning algorith m such tha t, ∀ ε, δ ∈ (0 , 1) , there is a value λ ( ε, δ, f , P ) ∈ N such that ∀ n ≥ λ ( ε, δ , f , P ) , P (er ( A p ( Z n )) > ε ) ≤ δ . Suppose ˆ h n is the classifier retu rned by the (Meta-Algo rithm 3)-acti vize d A p with lab el b udget n . Then for s ome ( C , P , f ) -dependen t cons tant c ∈ [1 , ∞ ) , ∀ ε, δ ∈ (0 , e − 3 ) , letting λ = λ ( ε/ 2 , δ / 2 , f , P ) , ∀ n ≥ c ˜ θ f  λ − 1  log 2 ( λ/δ ) , P  er  ˆ h n  > ε  ≤ δ . For ins tance, if A p is an empirical risk minimizati on alg orithm, then this is ∝ ˜ θ f ( ε )p olyl og  1 εδ  . 40 A C T I V I Z E D L E A R N I N G 5.5 Limitati ons and Potential Impr ov ements Theorem 16 and its corolla ries represent signi ficant improv ements o ver most kno wn results for the label comple xity of acti ve learning, and in particul ar ov er Theo rem 10 and its corollaries . As for wheth er this also represent s the best possib le label complex ity gains achi ev able by any activ e learnin g algorith m, the answer is mixed . As with an y algorithm and analysis, Meta-Algo rithm 3, Theorem 16, and corol laries, represen t one set of solutions in a spectrum that trad es strengt h of perf ormance guarantees with simplici ty . A s such, there are se ve ral possibl e m o dification s one might make, which could potenti ally improv e the performanc e guarante es. Here we ske tch a few such possi bilities . Even with Meta-Alg orithm 3 as-is, va rious improv ements to the bou nd of Theorem 16 shou ld be possib le, simply by being more carefu l in the analys is. For instan ce, as mentioned , Meta- Algorith m 3 is a univer sal activizer for any VC class C , so in par ticular we kno w that when ev er ˜ θ f ( ε ) 6 = o (1 / ( ε log (1 /ε ))) , the abov e bou nd is not tight (see the work of Balcan, Hanneke , and V aughan (2010) for a constru ction lea ding to such ˜ θ f ( ε ) v alues), and indee d an y bound of the for m ˜ θ f ( ε )p olyl og (1 /ε ) will not be tight in that cas e. Again, a more refined analys is may close th is gap. Another type of potential impro v ement is in th e constant factors. Specifically , in the case w h en ˜ θ f < ∞ , if we are only interested in asymptot ic labe l co mplex ity guaran tees in Corollary 17, we can replac e “ sup r > 0 ” in D e finition 15 with “ lim sup r → 0 , ” which can s ometimes be sig nificantly smaller and /or easier to stu dy . This is true for the disagre ement coe fficie nt in Corollary 11 as w el l. Additio nally , the proof (in Appendix D) re veal s that there are significant ( C , P , f ) -depen dent co nstant fa ctors other than ˜ θ f ( ε ) , and it is quite lik ely that these can be improv ed by a more carefu l analysis of Meta-Algor ithm 3 (or in so me cases, possibly an improv ed definition of the estimators ˆ P m ). Ho wev er , ev en with su ch refinement s to improve the re sults, the approach of using ˜ θ f to pr ov e learna bility at an expone ntial rate h as limits . For ins tance, it is kno wn th at an y co untable C is learn- able at an exponen tial rate (Balcan, Hanneke, and V aughan , 2010). H o wev er , there are co untable VC classes C fo r which ˜ θ f = ∞ for some elements of C (e.g., tak e the tree-paths co ncept space of Balcan, Hannek e, and V aughan (2010) , except instead of all infinite-de pth pa ths from the root, take all of the fi n ite-dept h paths fro m the root, b ut k eep one infinite-d epth path f ; f or this modified sp ace C , which i s coun table, ev ery h ∈ C has ˜ d h = 1 , and for that one infinite -depth f w e hav e ˜ θ f = ∞ ). Inspec ting the proof re veals tha t it is pos sible to make the results slightly sharper by re placing ˜ θ f ( r 0 ) (for r 0 as in the results abo ve) with a some what more complica ted qu antity: namely , min k < ˜ d f sup r >r 0 r − 1 · P  x ∈ X : P k  S ∈ X k : B( f , r ) shatters S ∪ { x }  ≥ P  ∂ k f  / 16  . (4) This quan tity ca n be bounded in terms of ˜ θ f ( r 0 ) via Marko v’ s inequa lity , b ut is sometime s smal ler . As for improvin g Meta-Algorith m 3 itself, there are sev eral po ssibiliti es. One immediate im- pro vement one can mak e is to repace the condi tion in Steps 5 and 12 by min 1 ≤ j ≤ k ˆ P m ( S ∈ X j − 1 : V shatte rs S ∪ { X m }| V shat ters S ) ≥ 1 / 2 , li ke wise replaci ng the corresp onding quant ity in S te p 9 , and sub stitutin g in Step s 7 and 14 the quantity max 1 ≤ j ≤ k ˆ P m ( S ∈ X j − 1 : V [( X m , − y )] does not shatte r S | V shatters S ) ; in particu lar , the resu lts stated for Met a-Algorith m 3 remain val id with this substi tution, requirin g only minor modificati ons to the pro ofs. Ho we ver , it is not cle ar what gains in theor etical gu arantees this achie ve s. Addition ally , there are v arious quan tities in this proced ure that can be altered almost arb itrarily , allo wing room for fine-tuning. Specifically , the 2 / 3 in Step 0 and 1 / 3 in Step 16 can be set to 41 H A N N E K E arbitra ry constants summing to 1 . Like wise, the 1 / 4 in Step 3, 1 / 3 in Step 10, and 3 / 4 in S te p 12 can be chan ged to an y constants in (0 , 1) , pos sibly depe nding on k , such tha t the sum of the first two is str ictly less than the third. Also, the 1 / 2 in Steps 5, 9, and 12 can be set to an y consta nt in (0 , 1) . Furth ermore, the k · 2 n in S te p 3 only pre v ents i nfinite lo oping, and can be set to an y function gro wing superlinearl y in n , th ough to get the lar gest possible impro veme nts it should at least grow exp onentia lly in n ; typ ically , any activ e learning algo rithm capable of expon ential improvemen ts ov er reasonable passiv e learning algo rithms will require access to a number of unlab eled e xamples exp onentia l in n , and Meta-Algor ithm 3 is no e xceptio n to th is. One major issue in the desi gn of the pro cedure is an inherent trade-of f between the achie ved label compl exity and the numb er of unlabele d ex amples used by the alg orithm. This is note wor - thy both because of the practical concerns of gatherin g such larg e quan tities of unlabe led data, and also for computationa l ef ficienc y reasons . In contrast to disag reement-b ased methods, the design of the esti mators used in Meta-Alg orithm 3 introduce s such a trade-of f, thou gh in contras t to the splitti ng index analysis of Dasgu pta (2005 ), the trade-of f here seems only in the consta nt factors. The choice of th ese ˆ P m estimato rs, both in their definit ion in Appe ndix B.1, and ind eed in the ver y quantit ies the y estimate , is such that we ca n (if desire d) limit the number of un labeled exam- ples the main body of the algorithm uses (the actu al number it need s to achie ve T he orem 16 can be extracted from the proofs in Appendi x D. 1 ). Ho we ver , if the number of unlabeled example s used by the algorithm is not a limiting factor , we can sugg est more effecti ve qua ntities. Specif- ically , follo wing the ori ginal motiv atio n for using shatterable sets, we might conside r a greedil y- constr ucted distrib utio n ov er the set { S ∈ X j : V shatt ers S, 1 ≤ j < k , and eithe r j = k − 1 or P ( s : V shatters S ∪ { s } ) = 0 } . W e can construct the distrib ution implicitly , via the follo w- ing generati ve model. First we set S = {} . T he n repeat the follo wing. If | S | = k − 1 or P ( s ∈ X : V shatters S ∪ { s } ) = 0 , output S ; otherwise, sample s acc ording to the cond itional distrib ution of X giv en that V shatte rs S ∪ { X } . If we deno te this distrib utio n (ov er S ) as ˜ P k , then replac ing the estimator ˆ P m  S ∈ X k − 1 : V shatters S ∪ { X m }| V shat ters S  in Meta-Algorith m 3 w i th an approp riately construct ed esti mator of ˜ P k ( S : V shatte rs S ∪ { X m } ) (and similarly re- placin g the oth er estimator s) can lead to some improve ments in the constan t facto rs of the label comple xity . Howe v er , such a modificati on can also dramatic ally increase the numbe r of unlab eled exa mples requi red by t he algori thm, since dete rmining whether P ( s ∈ X : V shatters S ∪ { s } ) ≈ 0 can be costl y . Unlik e Meta-Algorith m 1 , there remain serious ef ficienc y concern s su rround ing M e ta-Algorit hm 3. If we knew the valu e of ˜ d f and ˜ d f ≤ c log 2 ( d ) for some const ant c , then we could potentially design an ef ficient ve rsion of Meta-Algori thm 3 still achiev ing Corollary 17. Specifically , suppose we can find a clas sifier in C con sistent with an y gi ve n sample, or det ermine that no su ch classifier exi sts, in time poly nomial in the sample si ze (and d ), and als o that A p ef ficiently retur ns a classifier in C consiste nt with the sample it is gi ven. Then repla cing the loop of Step 1 by simply running with k = ˜ d f and returning A p ( L ˜ d f ) , the algorithm beco mes ef ficient, in the sense that with high probab ility , its running time is p oly( d/ε ) , w h ere ε is the error rate guara ntee from in v erting the label comple xity at the valu e of n giv en to the algorith m. T o be clear , in some cases we m a y obtain v alues m ∝ exp { Ω( n ) } , but the error rate gua ranteed by A p is ˜ O (1 / m ) in the se cases, so that we still hav e m polynomia l in d/ε . H o wev er , in the ab sence of th is a ccess to ˜ d f , the v alue s of k > ˜ d f in Meta-Algor ithm 3 may reach v alues of m much l ar ger than p o ly( d/ε ) , sinc e th e error ra tes obtained from these A p ( L k ) ev aluati ons are not gua ranteed to be better than the A p ( L ˜ d f ) ev aluations, and 42 A C T I V I Z E D L E A R N I N G yet we may hav e |L k | ≫ |L ˜ d f | . Thus, there remains a ch allengin g proble m of obtai ning the results abo ve (Theorem 16 and Corollary 17) via an ef ficient algorith m, adap ti ve to the va lue of ˜ d f . 6. T oward Agnostic Activized Lear ning The previo us section s addressed learn ing in the r ealizable case, where there is a perfect classifier f ∈ C (i.e., er( f ) = 0 ). T o move beyond these sce narios, to problems in which f is not a pe rfect classifier (i.e., stochas tic labels) or not well-appro ximated by C , requ ires a change in techni que to make the algorithms m o re robust to such issues . As we will see in S u bsectio n 6.2, the results we can p rov e in this more general sett ing are n ot quite as strong as those of the pre vious sections , but in some ways they are m o re int eresting , both from a practical perspe cti ve, as we ex pect real le arning proble ms to in volv e imperfect teac hers or under specified instance repres entation s, and also fro m a theore tical perspecti ve, as th e class of pr oblems addres sed is significan tly more general than tho se encompa ssed by th e realizable case abov e. In this con text, we will be lar gely int erested in more general versions of the same types of questi ons as abov e, such as whethe r one can acti vize a giv en passi ve learni ng algorithm, in this case guarant eeing strict ly impro ved label comple xities for all nontri vial joint distrib utio ns over X × {− 1 , +1 } . In Subsection 6.3, we present a general conjec ture re garding this type of stron g dominati on. At the sa me time, to ap proach su ch qu estions , we will also need to focus on de velo ping techni ques t o make the a lgorithms rob ust to label noise. For this, we will use a natural general ization of techniq ues de ve loped for noi se-rob ust dis agreemen t-based acti ve learnin g, analo gous to how we genera lized M e ta-Algorit hm 2 to arriv e at Meta- Algorithm 3 abo ve. For thi s purpose , as well as for the sak e of compariso n, we w i ll re view the kno wn tech niques and results for disag reement-b ased agnso tic acti ve learnin g in Subs ection 6.5. W e then extend th ese techn iques in Subsection 6.6 to de- vel op a new type of agnostic activ e learning algorit hm, based on shatterable sets, which relates to the disagr eement-ba sed ag nostic acti ve le arning algo rithms in a way analogous to h ow Meta-Algor ithm 3 rel ates to Meta-Alg orithm 2. Furthermore, we prese nt a bound on the label comple xitie s achiev ed by this method , represen ting a n atural generali zation of b oth Corollary 17 and the kno wn results on disagr eement-ba sed agnostic acti ve learning (Hannek e, 2011). Although we pres ent se veral ne w re sults, in some sens e this section is less ab out what we kno w and more about what we do not yet kno w . A s such, we will focus les s on presen ting a complete and ele gant the ory , and more on ident ifying potentia lly promis ing directio ns fo r ex ploratio n. In particu lar , Subsectio n 6.8 sk etches out some interestin g direct ions, which could potentially lead to a resolu tion of th e aforemention ed gener al conjecture fro m Subsection 6.3. 6.1 Definitions and Nota tion In this settin g, there is a joint distrib ution P X Y on X × {− 1 , +1 } , with marg inal distrib ution P on X . For an y clas sifier h , we denote by er( h ) = P X Y (( x, y ) : h ( x ) 6 = y ) . Also , denote by ν ∗ ( P X Y ) = inf h : X →{− 1 , +1 } er( h ) the Bayes err or rate , or simply ν ∗ when P X Y is clear from the conte xt; al so define the conditio nal label distrib utio n η ( x ; P X Y ) = P ( Y = +1 | X = x ) , where ( X, Y ) ∼ P X Y , or η ( x ) = η ( x ; P X Y ) when P X Y is clear from the contex t. For a gi v en concept space C , deno te ν ( C ; P X Y ) = inf h ∈ C er( h ) , called the nois e rate of C ; when C and /or P X Y is cle ar from the conte xt, w e may abb re viate ν = ν ( C ) = ν ( C ; P X Y ) . Fo r H ⊆ C , the diameter is defined 43 H A N N E K E as diam( H ; P ) = sup h 1 ,h 2 ∈H P ( x : h 1 ( x ) 6 = h 2 ( x )) . Also, for an y ε > 0 , define the ε -minimal set C ( ε ; P X Y ) = { h ∈ C : er( h ) ≤ ν + ε } . For any set of classifiers H , define the closu r e , denoted cl( H ; P ) , as the set of all measur able h : X → {− 1 , +1 } such that ∀ r > 0 , B H , P ( h, r ) 6 = ∅ . When P X Y is cle ar from the co ntext , w e will simply refer to C ( ε ) = C ( ε ; P X Y ) , and when P is clear , w e write diam( H ) = d ia m( H ; P ) an d cl( H ) = cl( H ; P ) . In the nois y settin g, rather than being a perfec t classifier , we will let f denote an arbitr ary element of cl( C ; P ) with er( f ) = ν ( C ; P X Y ) : that is, f ∈ T ε> 0 cl ( C ( ε ; P X Y ); P ) . Such a classifier must exist , since cl( C ) is compact in the pseudo-met ric ρ ( h, g ) = R | h − g | d P ∝ P ( x : h ( x ) 6 = g ( x )) (in the usual sense of the equiv alence classe s be ing compact in the ρ -induced metric). This can be seen by recallin g that C is totally bounded (Haussler, 1992), an d thus so is cl( C ) , and that cl ( C ) is a closed s ubset of L 1 ( P ) , which is comp lete (Dudle y, 20 02), so cl( C ) is also complete (Munk res, 2000) . T otal bound edness and comple teness togethe r impl y compa ctness (Munkres, 2000 ), and this implies the existence of f sin ce mo notone sequenc es o f nonempty closed subsets of a compact space ha ve a non empty limit s et (Munkres, 2000). As before , in the learning problem there is a sequence Z = { ( X 1 , Y 1 ) , ( X 2 , Y 2 ) , . . . } , where the ( X i , Y i ) are inde pendent and iden tically distrib uted, an d we de note by Z m = { ( X i , Y i ) } m i =1 . As before , the X i ∼ P , b ut rather than ha ving each Y i v alue determined as a func tion of X i , instead we hav e each pair ( X i , Y i ) ∼ P X Y . T h e learning protocol is define d identica lly as abov e; that is, the algorit hm has dire ct access to the X i v alues, but must re quest the Y i (label) v alues one at a time, sequen tially , an d can reque st at most n tot al labels, where n is a bud get prov ided as input to the algori thm. The label complexi ty is now defined just as befor e (Definition 1), b ut ge neralize d by replacing ( f , P ) with the joint distrib ution P X Y . S p ecifically , we ha v e the foll owin g formal definitio n, which will be use d throughou t this section (and the corres pondin g appendice s). Definition 19 An activ e learn ing alg orithm A ach ieve s lab el co mple xity Λ( · , · ) if , for any joint distrib ution P X Y , for any ε ∈ (0 , 1) and any inte g er n ≥ Λ( ε, P X Y ) , we have E [er ( A ( n ))] ≤ ε . ⋄ Ho wev er , because the re may not be an y classifier with err or rate less than any arb itrary ε ∈ (0 , 1) , our objecti v e changes here to achi ev ing error rate at most ν + ε fo r any giv en ε ∈ (0 , 1) . Thus, we are interested in the qua ntity Λ( ν + ε, P X Y ) , and will be particula rly interested in this qu antity’ s asympto tic dep endenc e on ε , as ε → 0 . In par ticular , Λ( ε, P X Y ) may often be infinite for ε < ν . The label complex ity for passi ve learning can be genera lized analogous ly , again replacing ( f , P ) by P X Y in Definition 2 as follo ws. Definition 20 A pas sive lea rning algorithm A ac hie ves label co mple xity Λ( · , · ) if, for any joint distrib ution P X Y , fo r any ε ∈ (0 , 1) and any inte ger n ≥ Λ( ε, P X Y ) , we have E [er ( A ( Z n ))] ≤ ε . ⋄ For any label co mplexi ty Λ in the agnostic case , define the se t Non trivial (Λ; C ) as the set of all distri bu tions P X Y on X × {− 1 , +1 } such that ∀ ε > 0 , Λ( ν + ε, P X Y ) < ∞ , and ∀ g ∈ P olylog(1 /ε ) , Λ( ν + ε, P X Y ) = ω ( g ( ε )) . In this con text , we can define an activi zer for a giv en passi ve al gorithm as follo ws. 44 A C T I V I Z E D L E A R N I N G Definition 21 W e say an active meta-algor ithm A a acti vizes a passive algori thm A p for C in the agnosti c case if the following hold s. F or an y lab el comple xity Λ p ach ieved by A p , the ac - tive learni ng algo rithm A a ( A p , · ) achiev es a label co mple xity Λ a suc h that, for e very distrib ution P X Y ∈ Non trivial (Λ p ; C ) , ther e exi sts a constant c ∈ [1 , ∞ ) such that Λ a ( ν + cε, P X Y ) = o ( Λ p ( ν + ε, P X Y )) . In th is case , A a is called an acti vizer for A p with r espect to C in the agnos tic case, and the active learni ng alg orithm A a ( A p , · ) is called the A a -acti vize d A p . ⋄ 6.2 A Nega tiv e Result First, the bad ne ws: we canno t genera lly hope for univ ersal acti vize rs for VC c lasses in the agnos tic case. In fact , the re ev en e xist passi v e algorithms that cannot be act ivized , eve n by an y speciali zed acti ve lea rning algorithm. Specifically , consider again Example 1, where X = [0 , 1] and C is the class of threshold classifier s, and let ˇ A p be a pas si ve le arning algorithm tha t beh av es as follo w s. Give n n point s Z n = { ( X 1 , Y 1 ) , ( X 2 , Y 2 ) , . . . , ( X n , Y n ) } , ˇ A p ( Z n ) return s the class ifier h ˆ z ∈ C , where ˆ z = 1 − 2 ˆ η 0 1 − ˆ η 0 and ˆ η 0 =  |{ i ∈{ 1 ,...,n } : X i =0 ,Y i =+1 }| |{ i ∈{ 1 ,...,n } : X i =0 }| ∨ 1 8  ∧ 3 8 , taking ˆ η 0 = 1 / 8 if { i ∈ { 1 , . . . , n } : X i = 0 } = ∅ . For most distr ib utions P X Y , th is algorit hm clearly woul d not beha ve “reaso nably , ” in that its error rate would be qu ite large ; in particula r , in the reali zable case, the algor ithm’ s worst -case expect ed error rate does not con ver ge to ze ro as n → ∞ . Howe ver , for certain distrib utions P X Y engine ered specifica lly for this algorithm, it has near-o ptimal beha vior in a strong sense. Specifical ly , we hav e the follo wing result, the proof of which is includ ed in Append ix E .1 . Theor em 22 Ther e is no activi zer for ˇ A p with res pect to the space of thr eshold classifier s in the agn ostic case. ⋄ Recall that threshold classifiers were, in some sense, one of the simplest scenarios for activiz ed learnin g in the rea lizable case. A ls o, since threshol d-lik e probl ems are embedded in most “geo- metric” conc ept spaces, this ind icates we should gener ally not expect there to exist acti vizer s for arbitra ry passi ve a lgorithms in the agnostic case. Howe v er , this lea ves open the question of wheth er certain famili es of passi v e learning algorithms can be acti vized in the agn ostic case, a topic we turn to ne xt. 6.3 A Conjec ture: Activized Empirical Risk Minimization The counte rexa mple abo ve is inte resting , in that it expos es the limits on generali ty in the ag nostic setting . Howe ver , the passi v e algorit hm th at cannot be activiz ed there is in m a ny wa ys not v ery re a- sonab le, in that it has suboptimal worst-case e xpect ed exc ess err or rate (amon g othe r deficien cies). It may therefore be more interes ting to ask whether some f amily of “re asonabl e” passi ve learning algori thms can be acti vized in the agnos tic case . I t seems that, unlike ˇ A p abo ve, certa in pass iv e learnin g algorithms should not ha ve too pecul iar a dependen ce on the lab el noise, so that they use Y i to help determine f ( X i ) and that is all. In such cases, any Y i v alue for w h ich we can alre ady infer the v alue f ( X i ) should s imply be ig nored as redundan t informati on, so that we needn ’t request such v alues. While this discussion is admittedly vague, cons ider th e follo wing formal conjecture. 45 H A N N E K E Recall that an empirica l risk m in imization algorith m f or C is a type of passi ve learnin g a lgorith m A , charac terized by the f act that for any set L ∈ S m ( X × {− 1 , +1 } ) m , A ( L ) ∈ argmin h ∈ C er L ( h ) . Conjectur e 23 F or any VC class, the r e ex ists an active meta-algo rithm A a and an empirical risk minimizatio n algo rithm A p for C such that A a acti vizes A p for C in the agnostic case. ⋄ Resoluti on of thi s conje cture would be inter esting for a v ariety of reasons. If the conjectu re is correc t, it means that the v ast (and gro wing) literat ure on the label comple xity of empirical risk minimizatio n has direct implications f or the potential perf ormance of activ e learning un der the s ame condit ions. W e might also expect activ ized empi rical risk minimiz ation to be quite ef fecti ve in practic al app lication s. While this conjectu re remai ns open at this time, the remainder of this section m i ght be viewed as pa rtial e viden ce in it s fa vor , as we show t hat ac tiv e l earning is able to ac hie ve impr ov ements o ver the kno wn bounds on the label comple xity of passiv e learning in many case s. 6.4 Lo w Noise Conditions In the sub section s belo w , we will be int erested in statin g bou nds on the label comple xity of acti v e learnin g, analogous to those of T h eorem 10 and Theorem 16, b ut for learnin g with label noise . As in the realizable case, we shou ld expect su ch bounds to ha ve some ex plicit depend ence on the distr ib ution P X Y . Ini tially , one might hop e that we could state intere sting label complex ity bound s purely in terms of a simple quantity suc h as ν ( C ; P X Y ) . Ho we ver , it is kno wn that any label comple xity bo und for a nontri vial C (for either passiv e or acti ve) depend ing on P X Y only via ν ( C ; P X Y ) will be Ω  ε − 2  when ν ( C ; P X Y ) > 0 (K ¨ a ¨ ari ¨ ainen, 2006 ). Since passi v e learning can achie ve a P X Y -indep endent O  ε − 2  label complexit y bound for an y VC class (Ale xander , 1984), we will need to disc uss label comple xity bound s that depend on P X Y via more detailed quanti ties than merely ν ( C ; P X Y ) if we are to chara cterize th e improv ements of acti ve learn ing o ve r passi ve. In this subsectio n, we re vie w an index co mmonly used to describe certain prope rties of P X Y relati ve to C : namely , the Mammen-Tsybak ov margin condi tions (Mammen and Tsybak ov, 1999; Tsybak ov, 2004; Koltch inskii, 200 6). Specifical ly , we hav e the follo wing formal condition from K oltchin skii (2006). Condition 1 Ther e e xist cons tants µ, κ ∈ [1 , ∞ ) such that ∀ ε > 0 , diam( C ( ε ; P X Y ); P ) ≤ µ · ε 1 κ . ⋄ This co ndition has recently been st udied in depth in the passi v e learnin g literatu re, as it can be used to characteri ze scenarios w h ere the label co mplexi ty of passi ve learni ng is between the worst- case Θ(1 /ε 2 ) and the realiz able case Θ(1 /ε ) (e.g., Mammen and Tsy bako v, 1999 ; Tsybak o v, 2004; K oltchin skii, 2006 ; Massa rt and N ´ ed ´ elec, 200 6). The condit ion is implied by a v ariety of in terestin g specia l ca ses. For in stance, it is satisfied when ∃ µ ′ , κ ∈ [1 , ∞ ) s.t. ∀ h ∈ C , er( h ) − ν ( C ; P X Y ) ≥ µ ′ · P ( x : h ( x ) 6 = f ( x )) κ . It is also satisfied when ν ( C ; P X Y ) = ν ∗ ( P X Y ) and ∃ µ ′′ , α ∈ (0 , ∞ ) s.t. ∀ ε > 0 , P ( x : | η ( x ; P X Y ) − 1 / 2 | ≤ ε ) ≤ µ ′′ · ε α , 46 A C T I V I Z E D L E A R N I N G where κ and µ are functions of α and µ ′′ (Mammen and Tsybako v, 1999; Tsybako v, 20 04); in particu lar , κ = (1 + α ) /α . Special case s of this condition ha ve also bee n studi ed in depth; for instan ce, bounded no ise cond itions, wherein ν ( C ; P X Y ) = ν ∗ ( P X Y ) and ∀ x, | η ( x ; P X Y ) − 1 / 2 | > c for some consta nt c > 0 (e.g., Gin ´ e and Koltch inskii, 2006; Massa rt and N ´ ed ´ elec, 2006 ), are a specia l ca se of Condition 1 with κ = 1 . Conditio n 1 can be inte rpretted in a v ariet y of ways, depe nding on the conte xt. For inst ance, in certain co ncept spaces with a geometric inte rpretati on, it can often be realized as a kind of lar ge mar gin cond ition, under some condition relating the noisiness of a point’ s label to its distance from the optima l decision surf ace. That is, if the magnitu de of noise ( 1 / 2 − | η ( x ; P X Y ) − 1 / 2 | ) for a giv en point depends in vers ely on its distance from the optimal decision surf ace, so that poin ts closer to the decision surface ha ve noisier labels, a small val ue of κ in Condition 1 w il l occur if the distrib ution P has low densit y n ear the optimal decision surface (assuming ν ( C ; P X Y ) = ν ∗ ( P X Y ) ) (e.g., D ek el, G en tile, an d Sridha ran, 2010) . On the oth er hand, w h en the re is high dens ity near the optimal decision sur face, th e v alue of κ m a y be determin ed by how quickly η ( x ; P X Y ) changes as x approa ches the decisi on bound ary (Castr o and No wak, 200 8). S e e the works of Mammen and Tsybak ov (199 9); T s ybak ov (2004 ); Ko ltchinsk ii (2006 ); Massart and N ´ ed ´ elec (2006); Castro and No wak (2008 ); D e kel, Gentile, and Sridharan (2010); Bartlett, Jor dan, and McAulif fe (2006) for furthe r in terpreta tions of Conditi on 1. In the conte xt of passi ve learni ng, one natural method to study is t hat of empirical risk minimiza- tion . Recall that a pass iv e lear ning algorit hm A is called an empirical risk minimization alg orithm for C if it returns a classifier from C making the minimum number of mistakes on the labeled sam- ple it is gi ve n as input. It is kno wn that for any VC cl ass C , for any P X Y satisfy ing Con dition 1 for finite µ and κ , e very empir ical risk minimization algorithm for C achie ves a label comple xity Λ( ν + ε, P X Y ) = O  ε 1 κ − 2 · log 1 ε  . (5) This follo ws from the works of K oltchin skii (200 6) and Massart and N ´ ed ´ elec (2006). Furthermore, for nontri vial concept spaces, one can sh ow that inf Λ sup P X Y Λ( ν + ε ; P X Y ) = Ω  ε 1 κ − 2  , where the supremum ranges ov er all P X Y satisfy ing Conditi on 1 for the gi ven µ and κ v alues, and the infimum ran ges o ver all labe l comple xities ach ie va ble by pas si ve learning algorithms (Castr o and No wak, 2008; Hanne ke, 201 1); that is, the bound (5) cannot be significan tly improve d by any pas- si ve al gorithm, without allo wing the label compl exity to ha ve a more refined dep endence on P X Y than af forded by Conditio n 1. In th e cont ext of activ e lea rning, a v ariet y of res ults are pre sently kno w n , which in some cases sho w impro ve ments ov er (5). Specifically , for any VC clas s C and an y P X Y satisfy ing Condition 1, a certain noise -rob ust disagr eement-ba sed acti ve learn ing al gorithm achie ve s la bel complexity Λ( ν + ε, P X Y ) = O  θ f  ε 1 κ  · ε 2 κ − 2 · log 2 1 ε  . (6) This general re sult was established by Hannek e (2 011) (analyzing the algorithm of Dasgu pta, Hsu, and Monteleon i (2007)), gener alizing earlie r C - specific results by Castr o and No wak (20 08) and Balcan, Broder , and Zhang (2007), and was late r si mplified an d refined in some cases by Koltc hin- skii (2010). C o mparing this to (5), w h en θ f < ∞ this is an improv ement over pas si ve learning by a facto r of ε 1 κ · log (1 /ε ) . Note tha t this gener alizes the label comple xity bound of Corol- lary 11 abov e, since the realizable case entail s Condition 1 w it h κ = µ / 2 = 1 . It is also kno w n 47 H A N N E K E that this type of improv ement is essen tially the best we can hope for w h en we describe P X Y purely in terms of the para meters of Condi tion 1. Specifically , for an y nontri vial con cept space C , inf Λ sup P X Y Λ( ν + ε, P X Y ) = Ω  max n ε 2 κ − 2 , log 1 ε o , where the supremum rang es ov er al l P X Y satisfy ing Condition 1 for the gi ven µ and κ v alues, and the infimum ranges ov er all label comple xities achie v able by activ e learning algorit hms (Hanne ke, 2011; Castro and Nowak , 2008). In the follo wing sub section, we revie w the es tablishe d techniq ues an d results for disagreemen t- based agnostic activ e learning ; the algorithm presen ted there is sl ightly dif feren t fr om that o riginal ly analyz ed by Han neke (2011) , but the label comple xity bounds of Hanneke (20 11) hold for this new algori thm as well. W e follo w this in Subsecti on 6.7 with a new agnosti c acti ve learning method that goe s beyo nd disag reement-b ased learnin g, again gener alizing the notio n of disag reement to the notion of shatter ability; this can be vie wed as analogous to the genera lization of Meta-Algo rithm 2 represented by Meta-Algorithm 3, and as in that case the resulti ng label comple xity bound replaces θ f ( · ) with ˜ θ f ( · ) . For b oth pass iv e and acti ve learnin g, re sults under Condit ion 1 are als o kno wn for more general scenar ios than VC cl asses: namely , entr opy cond itions (Mammen and Tsybako v, 1999; Tsyba ko v, 2004; K oltchi nskii, 2006 , 2008; M a ssart and N ´ ed ´ elec, 200 6; Castro and Nowak , 2008; Hanneke, 2011; Ko ltchins kii, 2010). For a nonpar ametric class known as boundary fra gments , Castro and No wak (20 08) find that acti ve learn ing sometimes offe rs adv antage s ov er passi ve learning, und er a special case of Conditio n 1. Furthermore, Hannek e (2011) sho ws a gene ral result on the label comple xity achie vabl e by disagreeme nt-based agn ostic acti ve learning, which sometimes exhibits an impro ve d depende nce on the parameters of Condi tion 1 under conditi ons on the disag reement coef ficient and certain entrop y conditions for ( C , P ) (see also Koltc hinskii , 2010). T h ese results will not play a ro le in the discussio n belo w , as in the pres ent work we restrict ourselv es strictly to VC classe s, lea ving m o re general results for future in vestigati ons. 6.5 Disag re ement-Based Agnost ic Active Lear ning Unlik e th e realizable cas e, he re in the agnostic c ase we c annot eliminate a classifier from the version space after maki ng m e rely a sin gle mistak e, sinc e e ven the best classifier is poten tially imperfect . Rather , w e take a colle ction of samples with label s, and el iminate those classifiers making sig nifi- cantly more mistakes relati ve to some others in the versio n space. This is the basic idea un derlyin g most of the kno wn agno stic acti ve learning algorith ms, includ ing tho se discussed in the presen t work. The pre cise meanin g of “s ignificantl y m o re, ” su fficie nt to g uarantee the version s pace al ways contai ns some good class ifier , is typic ally determined by established bounds on the de viation of exc ess empirical error rates from excess true error rates, tak en from the passi ve learning literatu re. The follo wing disagreement -based algorith m is slightly dif ferent from an y in the existin g lit- erature , but is simil ar in sty le to a m e thod of B e ygel zimer , Dasgupta , an d Langford (200 9); it also bares resemblence to the a lgorith ms of Koltc hinskii ( 2010); Das gupta, Hsu , and Monteleo ni (2007) ; Balcan, Beygelzi mer , and Lang ford (2006a, 200 9). It should be co nsidere d as representat iv e of the family of disagre ement-base d agnostic ac ti ve learning algorithms, and all resu lts bel o w concerning it ha ve ana logous re sults for vari ants of these other disagree ment-base d metho ds. 48 A C T I V I Z E D L E A R N I N G Algorith m 4 Input: label budg et n , confidenc e para meter δ Output: classifier ˆ h 0. m ← 0 , i ← 0 , V 0 ← C , L 1 ← ∅ 1. While t < n and m ≤ 2 n 2. m ← m + 1 3. If X m ∈ DIS ( V i ) 4. Reques t th e label Y m of X m , and let L i +1 ← L i +1 ∪ { ( X m , Y m ) } and t ← t + 1 5. Else let ˆ y be the label agreed upon by classi fiers in V i , and L i +1 ← L i +1 ∪ { ( X m , ˆ y ) } 6. If m = 2 i +1 7. V i +1 ←  h ∈ V i : er L i +1 ( h ) − min h ′ ∈ V i er L i +1 ( h ′ ) ≤ ˆ U i +1 ( V i , δ )  8. i ← i + 1 , and then L i +1 ← ∅ 9. Return any ˆ h ∈ V i The algori thm is specified in terms of an estimator , ˆ U i . The definit ion of ˆ U i should typica lly be based on gene ralizati on bou nds kno wn for pass iv e lea rning. Insp ired by the wor k of K oltchin skii (2006 ) and applications thereo f in activ e le arning (Han nek e, 2011 ; K oltchins kii, 2010) , we will take a definition of ˆ U i based on a data-depe ndent R a demacher complexity , as follows. Let ξ 1 , ξ 2 , . . . denote a seq uence of indepe ndent Rademach er random va riables (i.e., uniform in {− 1 , +1 } ), also indepe ndent from all other ran dom v ariabl es in the algori thm (i.e., Z ). Then for any set H ⊆ C , define ˆ R i ( H ) = sup h 1 ,h 2 ∈H 2 − i 2 i X m =2 i − 1 +1 ξ m · ( h 1 ( X m ) − h 2 ( X m )) , ˆ D i ( H ) = sup h 1 ,h 2 ∈H 2 − i 2 i X m =2 i − 1 +1 | h 1 ( X m ) − h 2 ( X m ) | , ˆ U i ( H , δ ) = 12 ˆ R i ( H ) + 34 r ˆ D i ( H ) ln(32 i 2 /δ ) 2 i − 1 + 752 ln(32 i 2 /δ ) 2 i − 1 . (7) Algorith m 4 ope rates by repeatedly dou bling the sample size |L i +1 | , while only requ esting the labels of the poi nts in the re gion of dis agreemen t of t he ver sion space. E a ch time it doubles the size of the sample L i +1 , it update s th e version space by eliminat ing any classifiers that make significan tly more mist akes on L i +1 relati ve to others in the ver sion spa ce. Since the labels of the exa mples we infer in Step 5 are agreed upo n by all elements of the v ersion s pace, th e dif fere nce of e mpirical error rates in Step 7 is ident ical to the differe nce of empirical er ror rat es u nder the true la bels. This allo ws us to us e e stablish ed results on devia tions of excess empirical e rror ra tes from excess true error r ates to judge subo ptimality of some of the classifiers in the version space in S te p 7, thus redu cing the ver sion space. As with Meta- Algorithm 2, for computation al feasibilit y , the sets V i and DIS( V i ) in Algo rithm 4 can b e re presente d implicitly by a se t of constrai nts imposed by pre vious ro unds o f t he l oop. Also, the update to L i +1 in Step 5 is included only to make Step 7 some what simpler or more intuit iv e; it can be be remo ve d without altering the beha vior of the algorithm, as long as we compensa te by multiply ing er L i +1 by an appro priate ren ormalizati on const ant in Step 7: namely , 2 − i |L i +1 | . 49 H A N N E K E W e hav e the fo llo wing result about the labe l comple xity of Algorithm 4; it is repr esentati ve of the type of theorem one can pro ve about disagre ement-bas ed activ e learning under Condition 1. Lemma 24 Let C be a VC class a nd su ppose the joint di strib ution P X Y on X × {− 1 , +1 } satisfies Conditio n 1 for finite pa rameter s µ and κ . Ther e is a ( C , P X Y ) -depe ndent con stant c ∈ (0 , ∞ ) suc h that, for any ε, δ ∈ (0 , e − 3 ) , and any inte ger n ≥ c · θ f  ε 1 κ  · ε 2 κ − 2 · log 2 1 εδ , if ˆ h n is the outpu t of Algo rithm 4 when run with label bud get n and confide nce par ameter δ , then on an eve nt of pr oba bility at le ast 1 − δ , er  ˆ h n  ≤ ν + ε. ⋄ The proo f of this res ult is essentially similar to the proof by Hannek e (2011) , combined w i th some simplifying ide as fro m K oltchinsk ii (2010) . It is also implicit i n the pro of of Lemma 26 belo w (by replaci ng “ ˜ d f ” with “ 1 ” in the proof). The details are omitted. This result leads imm e diately to the follo wing implicatio n conc erning the label complexity . Theor em 25 Let C be a VC class and suppose t he joint distrib ution P X Y on X × {− 1 , +1 } satisfie s Conditio n 1 for finite para meters µ, κ ∈ (1 , ∞ ) . W ith an ap pr opriate ( n, κ ) -dep endent se tting of δ , Algorith m 4 ac hie ves a label comple xity Λ a with Λ a ( ν + ε, P X Y ) = O  θ f  ε 1 κ  · ε 2 κ − 2 · log 2 1 ε  . ⋄ Pro of T akin g δ = n − κ 2 κ − 2 , the result follo w s by simple alge bra. W e should note that it is po ssible to design a kind of wrappe r to adapti v ely de termine an appro- priate δ v alue, so that the algorithm achie ves the labe l comple xity guarant ee of Theorem 25 without requir ing an y explici t de penden ce on the noise parameter κ . S pe cifically , one can use an idea simi- lar to the model selection proc edure of H an nek e (2011) fo r this purpose. Ho wev er , as our focus in this wo rk is on mo ving b eyon d disagreemen t-based acti ve learning, we do n ot includ e the details of such a proce dure here . Note that Theorem 25 represe nts an impro ve ment ov er the kno wn res ults for pas si ve learning (namely , (5)) whene ve r θ f ( ε ) is small, and in particu lar this gap can be large when θ f < ∞ . The results of L emma 24 and Theor em 25 represent the state-of-t he-art (up to logarithmic fact ors) in our unders tanding of the lab el comp lexit y of agnostic activ e lea rning for VC classes. Thus, any signif- icant impr ov ement ov er the se would advan ce our unders tanding of the funda mental capabiliti es of acti ve lea rning in the presence of label noise. Next, we p rovid e such an i mprov ement. 6.6 A New T ype of Agnostic Activ e L e arning Algori thm Based on Sh a tterable Sets Algorith m 4 and Theor em 25 represent natural extensio ns of Meta-Algo rithm 2 and T he orem 10 to the agn ostic setting . As such, the y not only benefit from the advan tages of those m e thods (small θ f ( ε ) implies impr ov ed label compl exity ), b ut al so su ff er th e same disad v antages ( P ( ∂ f ) > 0 50 A C T I V I Z E D L E A R N I N G implies no strong improve ments o ver pas si ve). It is the refore natural to in vest igate wheth er the im- pro vement s of fered by Meta-Algorithm 3 and the correspon ding T h eorem 16 can be ex tended t o th e agnos tic setting in a similar way . In particular , as was pos sible for Theorem 16 with respect to The- orem 10, we migh t won der whet her it is pos sible to replace θ f  ε 1 κ  in Theorem 25 with ˜ θ f  ε 1 κ  by a modification of Algorithm 4 ana logous to the modification of M e ta-Algorit hm 2 embodied in Meta-Algor ithm 3. As we hav e se en, ˜ θ f  ε 1 κ  is often signi ficantly smaller in its asymptot ic depen- dence on ε , compared to θ f  ε 1 κ  , in many cases eve n bounded by a finit e co nstant whe n θ f  ε 1 κ  is not. This would therefore rep resent a sign ificant improv ement ov er the kno wn result s for activ e learnin g und er Condition 1. T owar d this end, conside r the follo wing alg orithm. Algorith m 5 Input: label budg et n , confidenc e para meter δ Output: classifier ˆ h 0. m ← 0 , i 0 ← 0 , V 0 ← C 1. For k = 1 , 2 , . . . , d + 1 2. t ← 0 , i k ← i k − 1 , m ← 2 i k , V i k +1 ← V i k , L i k +1 ← ∅ 3. While t <  2 − k n  and m ≤ k · 2 n 4. m ← m + 1 5. If ˆ P 4 m  S ∈ X k − 1 : V i k +1 shatte rs S ∪ { X m }| V i k +1 shatte rs S  ≥ 1 / 2 6. Request the label Y m of X m , and let L i k +1 ← L i k +1 ∪ { ( X m , Y m ) } and t ← t + 1 7. Else ˆ y ← argmax y ∈{− 1 , + 1 } ˆ P 4 m  S ∈ X k − 1 : V i k +1 [( X m , − y )] does not shatter S | V i k +1 shatte rs S  8. L i k +1 ← L i k +1 ∪ { ( X m , ˆ y ) } and V i k +1 ← V i k +1 [( X m , ˆ y )] 9. If m = 2 i k +1 10. V i k +1 ← ( h ∈ V i k +1 : er L i k +1 ( h ) − min h ′ ∈ V i k +1 er L i k +1 ( h ′ ) ≤ ˆ U i k +1 ( V i k , δ ) ) 11. i k ← i k + 1 , then V i k +1 ← V i k , and L i k +1 ← ∅ 12. Return any ˆ h ∈ V i d +1 +1 For the argmax in Step 7, we brea k tie s in fa vo r of a ˆ y v alue with V i k +1 [( X m , ˆ y )] 6 = ∅ to maintain the in v ariant that V i k +1 6 = ∅ (see the proof of Lemma 59); when bo th y va lues satisfy this, we may break ties arbi trarily . The procedu re is specified in terms of se vera l esti mators. The ˆ P 4 m estimato rs, as usu al, are defined in Appen dix B.1. For ˆ U i , we again use the definitio n (7) abov e, based on a data- depende nt R a demacher comple xity . Algorith m 5 is lar gely based on the same principles as Algori thm 4, combined with Meta- Algorith m 3. As in Algo rithm 4, the algorithm pr oceeds by repe atedly d oubling t he size of a labeled sample L i +1 , while onl y requestin g a subset of the labels in L i +1 , inferr ing the others. As before , it updat es the vers ion space ev ery time it doubles the size of the sample L i +1 , and th e update elimi- nates class ifiers from the versi on space that make signi ficantly m o re mistak es on L i +1 compared to others in the version space. In Algorithm 4, this is gu arantee d to be eff ecti ve, sin ce the clas sifiers in the v ersion space agree on all of the inferr ed labels, so that the differe nces of empiric al error rates remain equa l to the true dif ferences of empirical error rates (i.e., under the tru e Y m labels for all elements of L i +1 ); thus, the establish ed results from the passi ve learning litera ture boun ding the de viatio ns of exce ss empirical error rates from excess tru e error rates can be applied, sho wing that 51 H A N N E K E this does not eli minate the best classifiers . In Algorithm 5, the situation is some what more subtl e, b ut the princi ple remains the same. In this case, we enf or ce tha t the classifiers in the version sp ace agree on the infe rred labels in L i +1 by explic itly removing the disagree ing classifiers in Step 8. Thus, as long as Step 8 does not eliminate all of the good clas sifiers, then neith er will Step 10 . T o ar gue that Step 8 doe s not eliminate all good classifiers , we appeal to the same reasoning as for Meta-Algor ithm 1 and Meta-Algo rithm 3. That is, for k ≤ ˜ d f and suf ficiently larg e n , as long as there ex ist good classifiers in the versio n space, the labels ˆ y inferred in Step 7 will agr ee w it h some good classifiers, and thus Step 8 will no t elimin ate all good classifier s. Ho we ver , for k > ˜ d f , the labels ˆ y in Step 7 ha ve no such guar antees, so that we are only guaran teed that some classifier in the v ersion sp ace is not e liminated. Thus, determini ng guarantees on the e rror rate o f this al gorithm hinges on bou nding the worst exce ss error rate among all classifiers in the vers ion space at the con- clusio n of the k = ˜ d f round . This is essen tially determined by the size of L i k at the conclu sion of that round, which itself is lar gely determin ed by ho w f requent ly the algorithm re quests labels dur ing this k = ˜ d f round . Thus, once again the analysis rests on boun ding the rate at which the frequency of label reque sts shrin ks in the k = ˜ d f round , which d etermines th e rate of g ro wth of |L i k | , an d t hus the final guara ntee on the e xcess error rate. As befo re, for comput ational feasibility , we can m a intain the sets V i implicitl y as a set of con- straint s impose d by the pre vious updates, so that we may perform the v ariou s calcu lations requ ired for the estimato rs ˆ P as constrained optimizations . Also , the upd ate to L i k +1 in Step 8 is merely includ ed to make the algor ithm statement and the proofs somewhat more elega nt; it can be omit- ted, as long as we compensate w i th an appropriate renormalizati on of the er L i k +1 v alues in Step 10 (i .e., multiplyi ng by 2 − i k |L i k +1 | ). Additio nally , the same potential improv ements we proposed in Sectio n 5.5 for Meta-Alg orithm 3 can be made to Algorithm 5 as well, again with only minor modificatio ns to th e proofs. W e shoul d note that this is cer tainly not th e only r easonab le way to ext end Meta-Alg orithm 3 to the agn ostic setting. For instance, anot her natural ex tension of Meta-Algor ithm 1 to the agn ostic setting , based o n a completel y dif ferent id ea, a ppears in the author’ s doctor al disser tation (Hannek e, 2009b ); that metho d can be imp rov ed in a natural way to take adv antage of the sequential aspect of acti ve learning, yielding an agnostic ex tension of Meta-Algorithm 3 dif fering from Algori thm 5 in se ver al interesting ways. In the nex t su bsectio n, we will s ee t hat th e labe l comple xities ach ie ved by Algorithm 5 are often significa ntly better than th e kno wn res ults for pa ssi ve learn ing. In f act, the y are of ten signi ficantly better than the pres ently-k no wn results fo r a ny active learning a lgorithms in the publi shed liter ature. 6.7 Impr ov ed Label Complexity Bounds fo r Active Lear ning with Noise Under Condition 1, we can exte nd Lemma 24 and Theorem 25 in an analogou s way to ho w The- orem 16 e xtends Theorem 10. Specifically , we hav e the follo wing result, the pro of of which is includ ed in App endix E .2. 52 A C T I V I Z E D L E A R N I N G Lemma 26 Let C be a VC class a nd su ppose the joint di strib ution P X Y on X × {− 1 , +1 } satisfies Conditio n 1 for finite pa rameter s µ and κ . Ther e is a ( C , P X Y ) -depe ndent con stant c ∈ (0 , ∞ ) suc h that, for any ε, δ ∈  0 , e − 3  , and any inte ger n ≥ c · ˜ θ f  ε 1 κ  · ε 2 κ − 2 · log 2 1 εδ , if ˆ h n is the outpu t of Algo rithm 5 when run with label bud get n and confide nce par ameter δ , then on an eve nt of pr oba bility at le ast 1 − δ , er  ˆ h n  ≤ ν + ε. ⋄ This has the follo w i ng implication for the label complexit y of Algor ithm 5. Theor em 27 Let C be a VC class and suppose t he joint distrib ution P X Y on X × {− 1 , +1 } satisfie s Conditio n 1 for finite para meters µ, κ ∈ (1 , ∞ ) . W ith an ap pr opriate ( n, κ ) -dep endent se tting of δ , Algorith m 5 ac hie ves a label comple xity Λ a with Λ a ( ν + ε, P X Y ) = O  ˜ θ f  ε 1 κ  · ε 2 κ − 2 · log 2 1 ε  . ⋄ Pro of T akin g δ = n − κ 2 κ − 2 , the result follo w s by simple alge bra. Theorem 27 repre sents an interesting generaliz ation be yond the real izable case, and beyon d the disagr eement coefficien t analysis. Note that if ˜ θ f ( ε ) = o  ε − 1 log − 2 (1 /ε )  , Theorem 27 r epresen ts an improv ement over the kno wn results for passi ve lea rning (Massa rt and N ´ ed ´ elec, 20 06). As w e alw ays hav e ˜ θ f ( ε ) = o  ε − 1  , we should typically ex pect such improve ments for all b ut the most ext reme learning probl ems. Recall that θ f ( ε ) is often not o  ε − 1  , so that Theo rem 27 is ofte n a much stronger sta tement than Theorem 25. In pa rticular , this is a significant impro v ement o ver the kno wn results for passi ve lear ning w h ene ver ˜ θ f < ∞ , an d an equally significant improv ement ov er Theorem 25 whene ver ˜ θ f < ∞ but θ f ( ε ) = Ω(1 /ε ) (see abo ve for examp les of this). Ho wev er , note that unlike Meta- Algorithm 3, Algorithm 5 is not an acti vizer . Indeed, it is not clea r (to the author ) how to modify the algorithm to mak e it a univ ersal acti vize r (e ven for the realizable case), while maintaini ng the gua rantees of T h eorem 27. As with Theorem 16 and Corollary 17, Algorith m 5 and Theorem 27 can potenti ally be improv ed in a variet y of ways, as outlined in Section 5.5. In partic ular , Theorem 27 can be made sligh tly sharpe r in some cases by re placing ˜ θ f  ε 1 κ  with the sometimes-smaller ( though more c omplicate d) quanti ty (4) (wit h r 0 = ε 1 κ ). 6.8 Bey ond Condition 1 While Theorem 27 represe nts an impro v ement over the kn own results for ag nostic acti ve learn- ing, Condition 1 is no t fully general , and disall o ws man y importa nt and interestin g scena rios. In particu lar , one ke y property of Cond ition 1, hea vily exploited in the lab el compl exity proo fs for both passi ve learn ing and disag reement-ba sed acti ve learning, is that it implies diam( C ( ε )) → 0 as ε → 0 . In scenarios where this shrinkin g diameter conditio n is not satisfied, the exi sting 53 H A N N E K E proofs of (5) for passiv e learning break do wn, and furthermore, the disagree ment-base d algo - rithms themselv es cease to giv e significant improve ments ove r passiv e learning , for essentiall y the same reaso ns leading to the “onl y if ” part of Theore m 5 (i.e., the sampling region ne v er fo- cuses beyon d some nonzero-p robability re gion) . Even more alar ming (at first glan ce) is the fa ct that this same problem can sometimes be observe d for the k = ˜ d f round of Algorit hm 5; that is, P  x : P ˜ d f − 1 ( S ∈ X ˜ d f − 1 : V i ˜ d f +1 shatte rs S ∪ { x }| V i ˜ d f +1 shatte rs S ) ≥ 1 / 2  is no lon ger guar- anteed to appro ach 0 as th e budg et n increases (as it does w h en diam( C ( ε )) → 0 ). Thus, if we wish to approach an understa nding of improv ements achie v able by acti ve learnin g in genera l, we must come to terms with scen arios w h ere d ia m( C ( ε )) does not shrink to zero. T owa rd this goa l, it will be helpful to partition the dist rib utions into two distinct categ ories, which we will refer to as the benign noise case and the misspe cified model case. The P X Y in the benign noise case are characte rized by the prop erty that ν ( C ; P X Y ) = ν ∗ ( P X Y ) ; this is in some ways similar to the realizab le ca se, in that C can appro ximate an optimal classifier , except tha t the labe ls are stocha stic. In the b enign noise c ase, the only reas on diam( C ( ε )) would n ot shr ink to z ero is if there is a non zero proba bility set of point s x with η ( x ) = 1 / 2 ; that is, there are at least two class ifiers achie ving the Bayes err or rate, and they are at nonzero dista nce from each other , which must mean the y disagree on some points that hav e equal probabili ty of eithe r la bel occurrin g. Interes tingly , it se ems that in the benig n noise case, d ia m( C ( ε )) 9 0 might not be a p roblem for algorit hms base d on shatterable sets, such as Alg orithm 5. In parti cular , Algorithm 5 appears to contin ue exhibitin g reaso nable behav ior in such scenarios . That is, e ve n if there is a nonshrinkin g probab ility that the que ry condition in Step 5 is sat isfied for k = ˜ d f , on any giv en sequ ence Z ther e must be some smal lest v alue of k for which th is pro bability does sh rink as n → ∞ . For thi s v alue of k , we should expect to obse rve good beha vior from the algori thm, in that (fo r suffici ently larg e n ) the inferr ed labels in Step 7 will tend to agree with some optimal classifier . Thus, the al gorithm addres ses the pr oblem of multipl e optimal classifiers by eff ecti vel y se lecting one of the opt imal classifier s. T o illustrate this phenomenon, co nsider learning w i th respect to the space of threshold classifiers (Example 1) with P uniform in [0 , 1] , and let ( X , Y ) ∼ P X Y satisfy P ( Y = +1 | X ) = 0 for X < 1 / 3 , P ( Y = + 1 | X ) = 1 / 2 for 1 / 3 ≤ X < 2 / 3 , and P ( Y = + 1 | X ) = 1 for 2 / 3 ≤ X . As we kno w from abov e, ˜ d f = 1 he re. Howe ver , in this sce nario we ha ve DIS( C ( ε )) → [1 / 3 , 2 / 3] as ε → 0 . Thus, Algorithm 4 ne ve r fo cuses its querie s be yond a co nstant fra ction of X , and therefore canno t improve over ce rtain passi ve le arning algorith ms in ter ms of the as ymptotic depend ence of its label co mplexi ty on ε (assuming a worst-c ase cho ice of ˆ h in S te p 9) . Howe v er , for k = 2 in Algorithm 5, ev ery X m will be assigned a label ˆ y in S te p 7 (since no 2 points are sha ttered); furthe rmore, for suf ficiently lar ge n we ha ve (with high probability ) DIS( V i 1 ) not too much larger than [1 / 3 , 2 / 3] , so that m o st points in DIS( V i 1 ) can be labeled either +1 or − 1 by some optimal classifier . For us, thi s has two implications . F i rst, the S ∈ [1 / 3 , 2 / 3] 1 will (with high probability ) dominate the v otes for ˆ y in Step 7, so that the ˆ y infe rred for any X m / ∈ [1 / 3 , 2 / 3] will agr ee with all of the opti mal classifiers . Second, the inf erred labels ˆ y for X m ∈ [1 / 3 , 2 / 3] will definitely agre e with so me optimal classifier . Since we also impose the h ( X m ) = ˆ y constraint for V i 2 +1 in Step 8, the inferred ˆ y labels must all be consis tent with the same optimal classifier , so that V i 2 +1 will quickly con ver ge to within a small neigh borhoo d around that classifier , without any further label requests. Note, ho w e ver , that the partic ular optimal classifier the algorithm con ve rg es to will be a rando m v ariable , determined by the particular sequence of da ta points processed by the algorithm; thus, it 54 A C T I V I Z E D L E A R N I N G canno t be determined a priori, w h ich significantly complica tes any gene ral attempt to ana lyze the label complexi ty achi ev ed by the algorith m for arbitrary C and P X Y satisfy ing the benign noise condit ion. In particu lar , for some C and P X Y , ev en this minimal k for which con ve rg ence occurs may be a nondetermini stic rando m va riable. At this time, it is not entirely clear how general this pheno menon is (i.e., Algorithm 5 providi ng impro v ements over certai n passiv e al gorithms ev en for benign noise dis trib utions with diam( C ( ε )) 9 0 ), nor ho w to characterize the label compl exity achie ved by A lg orithm 5 in general benign noise setting s where diam( C ( ε )) 9 0 . Ho wev er , as menti oned earlier , there are other natural ways to generalize Meta-Algorith m 3 to handle noise, some of which ha ve more predict able beha vior in the gen eral b enign noise setting. In particu lar , th e or iginal th esis w ork of Hannek e (2 009b) e xplor es a technique f or a cti ve learning with benign noise , which unli ke A l gorithm 5, only uses the re quested labe ls, not the inferred la bels, and as a consequenc e ne ver eliminates any o ptimal classifier from V . Because of this f act, the sa mpling reg ion for each k con ve rg es to a pred ictable limiting region, so that w e ha ve an accu rate a prio ri charac terizatio n of the algorit hm’ s beha vior . Howe ve r , it is not immedia tely clear (to the au thor) whether this alternat iv e technique m ig ht lead to a method achie ving results similar to Theorem 27. In contra st to the benign noise case, in the misspecified model case we ha ve ν ( C ; P X Y ) > ν ∗ ( P X Y ) . In this case, if the diameter does not shrink, it is becau se o f the exis tence of two classifiers h 1 , h 2 ∈ cl( C ) achiev ing error rate ν ( C ; P X Y ) , with P ( x : h 1 ( x ) 6 = h 2 ( x )) > 0 . Howe v er , unlik e abo ve, since the y do not achie ve the Baye s error rate, it is possible that a significant fract ion of the set of point s the y disa gree on may ha v e η ( x ) 6 = 1 / 2 . Intuiti vely , this makes the activ e learning proble m more di fficu lt, as there is a worry that a m e thod such as Algorithm 5 might in fer the label h 2 ( x ) for some point x when in f act h 1 ( x ) is better for that par ticular x , and vice versa for the points x where h 2 ( x ) would be bet ter , thus gettin g the worst of both and potent ially doub ling the error rate in the process. Ho we ver , it turns out that, for the purpose of explorin g Conjecture 23, we can circumvent all of the se issues by noting that there is a tri vial solution to the missp ecified model case. Specifically , sinc e in our present conte xt we are only intereste d in the l abel comple xity for achi ev ing error rate better than ν + ε , we can simply turn to any algo rithm that asympto tically achie ves an error rate strictly better than ν (e.g., De vro ye et al., 1996), in w hi ch case the algori thm should req uire only a finite constant number of label s to achie ve an expected error ra te better than ν . T o make the algorith m effec ti ve for the general case, we simply split our b udg et in thr ee: one part for an acti ve l earning a lgorithm, suc h as Algorithm 5, for th e b enign no ise cas e, one part for the method abov e handling th e misspecified model cas e, an d one pa rt to s elect among their ou tputs. T h e full d etails of s uch a procedure are specified i n Append ix E.3, along with a proof of its pe rformance guaran tees, whic h are summarized as follo ws. Theor em 28 F ix any conce pt space C . Su ppose ther e exist s an active learning algorit hm A a ach ievin g a label comple xity Λ a . Then ther e ex ists an active learnin g algorith m A ′ a ach ievin g a label co mple xity Λ ′ a suc h that , for any distrib ution P X Y on X × {− 1 , +1 } , ther e e xists a fu nction λ ( ε ) ∈ Po lylog (1 / ε ) such that Λ ′ a ( ν + ε, P X Y ) ≤ ( max { 2Λ a ( ν + ε/ 2 , P X Y ) , λ ( ε ) } , in the benig n noise ca se λ ( ε ) , in the misspec ified m o del case . ⋄ The main point of T h eorem 28 is that, for our purposes, we can safe ly ignore the missp eci- fied model case (as its solut ion is a trivia l extensio n), and focus entirely on the performance of 55 H A N N E K E algori thms fo r the benign noise case. In partic ular , for an y label comple xity Λ p , eve ry P X Y ∈ Non trivial(Λ p ; C ) in the misspecified model case has Λ ′ a ( ν + ε, P X Y ) = o (Λ p ( ν + ε, P X Y )) , for Λ ′ a as in Theorem 28. Thus, if there ex ists an activ e meta- algorith m ach ie ving the stron g im- pro vement gua rantees of an acti vizer for some pas si ve learni ng algorithm A p (Definition 21) for all distrib utions P X Y in the ben ign noise case, then there exis ts an acti vizer for A p with respe ct to C in the agnos tic cas e. 7. O p en Pr oblems In some sense, this work raises m o re questions tha n it answers. Here, we list sev eral problems tha t remain op en at this time. Resolving any o f thes e problems would mak e a sign ificant contrib ution to our unders tanding of th e fundamental capabiliti es of acti ve le arning. • W e ha ve established the existenc e of uni ve rsal ac ti vizers for VC cl asses in the r ealizab le case. Ho wev er , we ha ve not made any seriou s atte mpt to chara cterize the pro perties that suc h ac- ti vizers can po ssess. In particula r , as mention ed, it wo uld be interesting to kno w whether acti vizer s exist that pr eser ve certa in fav orable propert ies of the giv en passi ve learning algo- rithm. For instanc e, we know that some passi ve learning algorithms (say , for linear separators ) achie ve a label comple xity that is independ ent of the dimension ality of the spac e X , under a large mar gin con dition on f and P (B a lcan, Blum, and V empala, 20 06b). Is the re an ac- ti vizer for such algori thms that pre serve s this lar ge-margi n-based dimension-ind ependence in the lab el comple xity? Similarly , there are passi ve algorithms whose label comple xity has a weak depend ence on dimen sionalit y , du e to sparsit y cons ideratio ns (Bunea, Tsybak o v , and W egkamp, 2009; W ang and Shen, 2007 ). Is there an acti vize r for these algori thms that pre- serv es thi s sp arsity- based weak dependence on dimension? Is th ere an acti vize r that preserv es adapti veness to the dimens ion of the manifo ld to which P is restr icted? What about an ac- ti vizer tha t is sparsis tent (Rocha, W ang, and Y u, 200 9), gi ven any sparsistent passi ve learni ng algori thm as input? Is there an acti vizer that preser ves admissibility , in that giv en any ad- missible passi ve lear ning algorithm, the acti vized algorithm is an admissible acti ve learning algori thm? Is ther e an acti vizer that, giv en any minimax optimal pas si ve learni ng algorithm as input, pr oduces a minimax optimal activ e learnin g algorit hm? What about preser ving othe r notion s of op timality , or other propertie s? • There may be some waste in the ab ov e act iv izers, since the label requ ests used in their ini- tial phase (reducin g the versio n space) are not used by the pass iv e algorithm to produce the final classifier . This guara ntees the exa mples fe d into the pass iv e algorithm are con ditiona lly indepe ndent gi ven the number of examples. Intu iti vely , this see ms necessar y for the gen- eral results, since an y depen dence among the ex amples fed to the passi ve algorit hm could influence i ts labe l comple xity . Howe ver , it is not clear (to the author) h o w dra matic this ef fect can be, n or wh ether a simpler st rateg y (e.g., slightly randomizing th e budg et of label requests) might yield a similar effect while allowin g a single-stage app roach where all la bels are used in the passi ve algorithm. It see ms intuiti vel y cl ear that some specia l ty pes of passi ve algorithms should be able to use the full set of ex amples, fro m bo th phases, while still maintaining the strict impro ve ments guaran teed in the main theorems abov e. What gen eral propertie s m u st such passi ve alg orithms possess? 56 A C T I V I Z E D L E A R N I N G • As pre vious ly mentioned, the v ast majority of empiri cally-te sted heuristic ac ti ve learning al - gorith ms in the published literatur e are desi gned in a reduction style, us ing a well-kno wn passi ve learning algori thm as a subro utine, con structin g sets of labeled example s and feed- ing them into the pas si ve learning algor ithm at vario us points in the ex ecution of the activ e learnin g a lgorith m ( e.g., Abe and Mamitsuka , 1 998; McCallum and N i gam, 1998; Schohn and Cohn, 2000; C amp bell, Cristianini, and S mo la, 2000; T ong and K oller, 2001; Roy and McCal- lum, 2001 ; Musl ea, Minton , and Knobloc k, 2002; L in denbau m, Marko vitch , and Rusako v, 2004; Mitra, M u rthy , and Pal, 2004; Roth and Small, 2006; Schein and Ungar, 2007; Har - Peled, Roth, and Zimak, 2007; Beygelz imer , Dasgupta, and L an gford, 2009). Howe v er , rath er than includin g some e xamples whose labels are requ ested and other e xamples whose labels are inferr ed in the sets of labe led e xamples giv en to the pas si ve learning algorithm (as in our rigoro us methods abo ve) , these heuristic methods typically only inp ut to the passi ve algo- rithm the e xamples whose labe ls were r equested . W e should expec t that meta-algo rithms of this type could not be univer sal acti vizer s, b ut perhaps the re do ex ist meta-algorith ms of this type that are activiz ers for e ve ry pa ssi ve learning algorithm of some spec ial type . What are some general condition s on the passiv e learning algorithm so that some meta-algorith m of this type (i.e., feeding in only the r eque sted lab els) can activ ize e ver y pass iv e learn ing algo rithm satisfy ing thos e conditions ? • As disc ussed ear lier , the definition of “acti vizer” is based on a trad e-of f bet ween the strength of cl aimed impro vemen ts for nont ri vial scenarios, and eas e of an alysis within the framew ork. There are two natura l questions rega rding the pos sibility of stron ger notions of “a cti vizer . ” In Definition 3 we allo w a co nstant fa ctor c loss in the ε arg ument of the label compl exity . In most scenarios, thi s loss is incon sequen tial (e.g., typic ally Λ p ( ε/c, f , P ) = O (Λ p ( ε, f , P )) ), b ut one can cons truct scena rios where it does make a differe nce. In our proo fs, we see that it is poss ible to achie ve c = 3 ; in fact, a carefu l insp ection of the proo fs re veals we can eve n get c = (1 + o (1)) , a function of ε , con ver ging to 1 . Howe ver , wheth er there exist univ ersal acti vizer s for e ver y VC class that ha ve c = 1 remains an open question. A second question re gards our notion of “non tri vial probl ems. ” In Definition 3, we ha v e chosen to think of any target and distri bu tion w i th label comple xity gro w in g fas ter than P olylog ( 1 /ε ) as nontri vial , and do not requir e the activi zed algorithm to improv e ov er the underl ying pas si ve algorithm for sce narios tha t are trivia l for the passi ve algo rithm. As men- tioned , Definition 3 doe s ha ve implica tions for the label comple xities of these probl ems, as the label comple xity of the acti vized algorithm will improve over ev ery no ntri vial up- per bound on the label complex ity of the pa ssi ve algorith m. Ho wev er , in order to allo w for v arious o peration s in the m e ta-algor ithm that may introd uce additi ve P olylog(1 /ε ) terms d ue to exponen tially small f ailure pr obabilit ies, such as the test that selects amon g hypo theses in Activ eSelect , we do not requ ire the acti vized algor ithm to achie ve the same or der of label comple xity in tri vial scenar ios. For instan ce, there may be cases in which a passi ve alg o- rithm achi ev es O (1 ) label comple xity for a particu lar ( f , P ) , bu t its acti vize d counterpart has Θ(log(1 /ε )) label compl exity . T h e inte ntion is to define a frame work that focuses on non- tri vial scenario s, where passi ve learning uses prohibiti vely man y labels, rather than one that requir es us to obses s o ver extra additi v e log arithmic terms. Nonetheless , there is a question of whether these losses in the label co mplexi ties of tri vial problems are necessary to ga in the impro vement s in the label co mplexit ies of non tri vial problems. There is also the question of 57 H A N N E K E ho w much the definition of “no ntri vial” can be relaxed. S p ecifically , we hav e the followin g questi on: to what extent can we rel ax the notion of “nontr iv ial” in D efini tion 3, while still maintain ing th e ex istence of uni v ersal acti vize rs for VC classes ? W e see from ou r proofs that we can at least replace Polylo g (1 /ε ) with log (1 /ε ) . Ho wev er , it is not clear w h ether w e can go further tha n this in t he realiza ble case (e.g., to say “nontr iv ial” means ω (1) ). When the re is noise, it is clear that we canno t relax the notion of “nont ri vial” b eyon d replacing Po lylog(1 /ε ) with log(1 /ε ) . Specifically , whene ver DIS( C ) 6 = ∅ , for an y label complexi ty Λ a achie ved by an acti ve learn ing algorithm, there must be some P X Y with Λ a ( ν + ε, P X Y ) = Ω(log (1 /ε )) , e ven with the support of P restri cted to a single po int x ∈ DIS( C ) ; the pro of of this is via a reduct ion from sequentia l hy pothes is test ing for whether a coin has bias α or 1 − α , for some α ∈ (0 , 1 / 2) . S in ce pas si ve learning via empirical risk minimization ca n achi ev e label com- ple xity Λ p ( ν + ε, P X Y ) = O (log(1 /ε )) whenev er the sup port of P is restricted to a sing le point, we cannot further rel ax the notion of “non tri vial, ” while preserving th e possibility of a positi ve outcome for Conjecture 23. It is interesting to note that this enti re issue va nishes if we are only interested in method s that ach ie ve error at m o st ε with probability at lea st 1 − δ , where δ ∈ (0 , 1) is some accepta ble con stant failu re probability , as in the wo rk of Balcan, Hannek e, and V aughan (2010); in th is case, we can si mply tak e “nontri vial” to mean ω (1) la- bel complex ity , and both Meta-Algorith m 1 and Meta-Algorith m 3 remain univ ersal acti vizers under this alterna tiv e definition, and achiev e O (1) label complexity in tri vial scenario s. • Anoth er interesti ng questi on concerns ef ficienc y . Suppose there ex ists an algori thm to find an element of C consiste nt with any labeled sequ ence L in time poly nomial in |L| and d , and that A p ( L ) has runni ng time polynomial in |L| and d . Under these cond itions, is there an acti vizer for A p capabl e of achie ving an error rate smaller than an y ε in running time polyn omial in 1 /ε and d , gi v en some app ropriate ly lar ge bud get n ? Recal l that if w e knew the value of ˜ d f and ˜ d f ≤ c log d , t hen Meta-Algorithm 1 could b e made e ffici ent, as discus sed abo ve. Therefor e, this questio n is lar gely focused on the issue of adaptin g to the valu e of ˜ d f . Another related question is wheth er there is an ef ficient acti ve learnin g algor ithm achie ving the label complex ity bound of Corollary 7 or Corollary 17. • One que stion that comes up in the result s abov e is the m in imum number of ba tche s of label reques ts necessa ry for a uni versal activ izer . In Meta-Algorith m 0 and Theorem 5, we saw that sometimes two batches are su fficie nt: one to reduce the version space, and another to constr uct the labe led sample by reque sting only tho se poin ts in the re gion of disagree ment. W e certain ly cannot use fe wer than two batches in a uni versal acti vizer , for an y non tri vial concep t space, so that this repres ents the minimum. Ho w e ver , to get a uni v ersal acti vizer for every concept space, we increased the number of batches to thr ee in Meta-Algorithm 1. The questi on is whethe r this increase is really necessary . Is there al ways a uni ver sal activi zer using only two batch es of lab el requests, for eve ry VC class C ? • For some C , the learning process in the abov e methods might b e viewed in two component s: one compo nent that performs acti ve learning as usu al (say , disag reement-b ased) under the assumpti on that the targe t functi on is very simple, and anoth er compon ent that searches for signs that the tar get func tion is in fact more comple x. Thus, for some nat ural classes such as linear separators, it woul d be interesting to find simpler , more speci alized meth ods, which exp licitly e xec ute these tw o components. For instance, fo r the first componen t, we might con- 58 A C T I V I Z E D L E A R N I N G sider the usual m a rg in-base d activ e learnin g methods, whic h quer y near a curre nt gue ss of the separa tor (Dasgupta, Kalai , and Monteleoni , 2005, 2009; Balcan, Broder , and Z h ang, 2007), exc ept that we bias to ward s imple hypothe ses vi a a regu larizatio n pen alty in the optimiza tion that defines h o w we update t he separator in response t o a qu ery . The s econd component mig ht then be a simple ran dom sear ch for points whose correct classification requ ires lar ger v alues of the regul arization term. • Can we construct univ ersa l acti vizer s for some concept sp aces with infinite VC dimension? What abou t under some constrain ts on the distrib ution P or P X Y (e.g., the usual entrop y condit ions (v an der V aart and W ellner, 1996))? It seems w e can still run M e ta-Algorit hm 1, Meta-Algor ithm 3, and Algorithm 5 in this case, exc ept we should increase the number of round s (v alues of k ) as a function of n ; thi s may continue to ha ve reaso nable beha vior ev en in some cas es w h ere ˜ d f = ∞ , especially when P k ( ∂ k f ) → 0 as k → ∞ . Howe ve r , it is no t clear whether the y will continu e to gua rantee the strict impro vemen ts ov er passiv e lea rning in the realizabl e case, nor what label comp lexit y gua rantees the y will achie ve. One specific questi on is whether the re is a method al ways achiev ing labe l comple xity o  ε 1 − ρ κ − 2  , where ρ is f rom th e entr opy co ndition s (van der V aart an d W ellner, 1996) a nd κ is fro m Conditio n 1. This would be an improveme nt o ver the kno wn res ults for pass iv e learning (Mammen and Tsybak ov, 1999; T s ybak ov, 20 04; Koltc hinskii , 2006). Another related question is whether we can impro ve o ver the kno wn results for acti ve lear ning in these scenarios . Specifically , Hannek e (2011 ) prov ed a bound o f ˜ O  θ f  ε 1 κ  ε 2 − ρ κ − 2  on th e labe l comple xity of a cer tain disagr eement-ba sed acti ve learning method, under entrop y condi tions and Conditio n 1. Do there e xist activ e learn ing methods ach ie ving asymptot ically smaller labe l comple xities than this, in particular improvi ng the θ f  ε 1 κ  fact or? The qua ntity ˜ θ f  ε 1 κ  is no longer define d when ˜ d f = ∞ , so thi s might not be a direct extensi on of Theore m 27, b ut we cou ld pe rhaps use the sequen ce of θ ( k ) f  ε 1 κ  v alues in some other way to repla ce θ f  ε 1 κ  in this case. • There is also a questi on about generalizin g this a pproac h to label spaces other t han { − 1 , +1 } , and possib ly othe r loss fun ctions. It sho uld be straightfo rward to ex tend these results to the setting of multiclas s classification. Ho wev er , it is not clear what the implications would be for genera l struc tured prediction problems, where the label space may be quit e large (e ven infinite), and the los s function in v olv es a notio n of distan ce between labels. From a practical perspe cti ve, this question is particu larly inte resting , since problems with more complicated label spaces are often the scenarios wher e activ e learnin g is most needed , as it tak es substan- tial time or e ff ort to label each example. At this time, there are no p ublish ed the oretical r esults on the la bel comple xity improv ements achi ev able for general structu red prediction problems. • All of the claims in this wo rk also hold when A p is a semi-sup ervised passi ve learnin g al- gorith m, simply by withholdin g a set of unlabe led data poin ts in a preproces sing step, and feedin g them into the passi ve algorithm along w it h the labeled set generated by the acti vizer . Ho wev er , it is no t cl ear w h ether further cl aims are possib le when acti vizing a semi-super vised algori thm, for i nstance by taking into acco unt specific details of the learning bias used by the particu lar se mi-superv ised algorith m (e.g., a clu ster assumption) . 59 H A N N E K E • The s plitting inde x analy sis of Dasgup ta (2005) has the inte resting feature of charac terizing a tra de-of f betwee n the number of label requests and the number of unlab eled e xamples used by the ac tiv e learning al gorithm. In the present work, w e d o not cha racteriz e any such trade- of f. Indeed, th e algorithms d o not really h av e any paramete r to adjust the number o f unlabeled exa mples they use (aside from the prec ision of the ˆ P estimators ), so tha t they simpl y use as many as the y need and then hal t. T h is is true in bot h the realizable case and in the agnost ic case. It would be interestin g to try to modify these algorith ms and thei r analysis so that, when th ere are more un labeled e xamples a vailab le than woul d be used by the abo ve m e thods, the algorithms can take adv antage of this in a way that can be reflected in improv ed lab el comple xity boun ds, and when there are fe wer unlabeled exampl es a v ailable , the algorith ms can alter their beh av ior to compensate for thi s, at the cost of an increase d label comple xity . This would be inte resting bot h for the realizable and agnostic cases. In fact, in th e agnostic case, there are no kno wn methods that exhib it this type of trade-of f. • Finall y , as mention ed in the previo us se ction, there is a serious question concernin g what types of algorithms can be ac ti vized in th e agnostic case, and ho w lar ge the improveme nts in label comple xity will be. In pa rticular , Conjectu re 23 hypo thesize s that for an y VC class, we can acti vize some empiric al risk minimizatio n algori thm in the agnostic ca se. R e solving this conjec ture (either positiv ely or neg ati vely ) should significantly adva nce our understa nding of the capabi lities of ac tiv e learning compared to passi ve learnin g. Ap pendix A. Proofs Related to Section 3: Disagreement -Based Learning The follo wing result follows from a theorem of Anthon y and Bartlett (199 9), bas ed on the clas - sic results of V apnik (1982) (with slightly better constant fa ctors); see also the wo rk of Blumer , Ehrenfeu cht, Hauss ler , and W armuth (1989). Lemma 29 F or any VC cla ss C , m ∈ N , and classifier f suc h that ∀ r > 0 , B( f , r ) 6 = ∅ , let V ⋆ m = { h ∈ C : ∀ i ≤ m , h ( X i ) = f ( X i ) } ; for any δ ∈ (0 , 1) , ther e is an event H m ( δ ) with P ( H m ( δ )) ≥ 1 − δ suc h that, on H m ( δ ) , V ⋆ m ⊆ B( f , φ ( m ; δ )) , w h er e φ ( m ; δ ) = 2 d ln 2 e max { m,d } d + ln (2 /δ ) m . ⋄ A fac t we will use repeatedly is that, for any N ( ε ) = ω (log (1 /ε ) ) , we ha ve φ ( N ( ε ); ε ) = o (1) . Lemma 30 F or ˆ P n (DIS( V )) fr om (1) , on an event J n with P ( J n ) ≥ 1 − 2 · exp {− n / 4 } , max {P (DIS( V )) , 4 /n } ≤ ˆ P n (DIS( V )) ≤ max { 4 P (DIS ( V )) , 8 /n } . ⋄ Pro of Note th at the sequence U n from (1) is indepen dent from both V and L . By a Chernof f bound, on an e ven t J n with P ( J n ) ≥ 1 − 2 · exp {− n / 4 } , P (DIS( V )) > 2 /n = ⇒ P (DIS( V )) 1 n 2 P x ∈U n 1 DIS( V ) ( x ) ∈ [1 / 2 , 2] , and P (DIS( V )) ≤ 2 /n = ⇒ 1 n 2 X x ∈U n 1 DIS( V ) ( x ) ≤ 4 /n. 60 A C T I V I Z E D L E A R N I N G This immediate ly implie s the stated result. Lemma 31 Let λ : (0 , 1) → (0 , ∞ ) and L : N × (0 , 1) → [0 , ∞ ) be such that λ ( ε ) = ω (1) , L ( n, ε ) is 0 at n = 1 and is diver ging as n → ∞ for ever y ε ∈ (0 , 1) , and for any N -valued N ( ε ) = ω ( λ ( ε )) , L ( N ( ε ) , ε ) = ω ( N ( ε )) . Let L − 1 ( m ; ε ) = max { n ∈ N : L ( n, ε ) < m } , for any m ∈ (0 , ∞ ) . Then for any Λ( ε ) = ω ( λ ( ε )) , L − 1 (Λ( ε ); ε ) = o (Λ( ε )) . ⋄ Pro of First note that L − 1 is well-defined and finite , due to the fa cts that L ( n, ε ) can be 0 and is di ver ging in n . Let Λ( ε ) = ω ( λ ( ε )) . It is fai rly str aightfor ward to sh o w L − 1 (Λ( ε ); ε ) 6 = Ω(Λ( ε )) , b ut the stronger o (Λ( ε )) result takes slig htly more wo rk. Let ¯ L ( n, ε ) = min  L ( n, ε ) , n 2 /λ ( ε )  for ev ery n ∈ N and ε ∈ (0 , 1) , and let ¯ L − 1 ( m ; ε ) = max  n ∈ N : ¯ L ( n, ε ) < m  . W e will first pro ve the result for ¯ L . Note that by definitio n of ¯ L − 1 , we kno w  ¯ L − 1 (Λ( ε ); ε ) + 1  2 /λ ( ε ) ≥ ¯ L  ¯ L − 1 (Λ( ε ); ε ) + 1 , ε  ≥ Λ( ε ) = ω ( λ ( ε )) , which implies ¯ L − 1 (Λ( ε ); ε ) = ω ( λ ( ε )) . But, by definiti on of ¯ L − 1 and the condit ion on L , Λ( ε ) > ¯ L  ¯ L − 1 (Λ( ε ); ε ) , ε  = ω  ¯ L − 1 (Λ( ε ); ε )  . Since ¯ L − 1 ( m ; ε ) ≥ L − 1 ( m ; ε ) for all m , thi s implies Λ( ε ) = ω  L − 1 (Λ( ε ); ε )  , or equi v alently L − 1 (Λ( ε ); ε ) = o (Λ( ε )) . Lemma 32 F or any VC clas s C and passive algor ithm A p , if A p ach ieves labe l comple xity Λ p , then Meta-Algor ithm 0, with A p as its ar gument, ac hiev es a label comple xity Λ a suc h that, for eve ry f ∈ C and distrib utio n P ov er X , if P ( ∂ C , P f ) = 0 and ∞ > Λ p ( ε, f , P ) = ω (log (1 /ε )) , then Λ a (2 ε, f , P ) = o (Λ p ( ε, f , P )) . ⋄ Pro of This pro of follo ws similar lines to a pro of of a relate d result of Balcan, Hanneke, and V aughan (2010) . Suppose A p achie ves a label comple xity Λ p , and that f ∈ C and distrib u- tion P sati sfy ∞ > Λ p ( ε, f , P ) = ω (log(1 /ε )) and P ( ∂ C , P f ) = 0 . Let ε ∈ (0 , 1) . For n ∈ N , let ∆ n ( ε ) = P (DIS(B( f , φ ( ⌊ n/ 2 ⌋ ; ε/ 2)))) , L ( n ; ε ) = ⌊ n/ max { 32 /n, 16∆ n ( ε ) }⌋ , and for m ∈ (0 , ∞ ) let L − 1 ( m ; ε ) = m a x { n ∈ N : L ( n ; ε ) < m } . Suppose n ≥ max n 12 ln(6 /ε ) , 1 + L − 1 (Λ p ( ε, f , P ); ε ) o . Consider run ning Meta-Algori thm 0 with A p and n as ar gument s, while f is the tar get functio n and P is th e data dis trib ution . Let V and L be as in Meta-Algorithm 0, and let ˆ h n = A p ( L ) denote the classifier returne d at the end . By Lemma 29, on the ev ent H ⌊ n/ 2 ⌋ ( ε/ 2) , V ⊆ B( f , φ ( ⌊ n/ 2 ⌋ ; ε/ 2)) , so that P (DIS( V )) ≤ ∆ n ( ε ) . L e tting U = { X ⌊ n/ 2 ⌋ +1 , . . . , X ⌊ n/ 2 ⌋ + ⌊ n/ (4 ˆ ∆ ) ⌋ } , by Lemma 30, on H ⌊ n/ 2 ⌋ ( ε/ 2) ∩ J n we ha ve ⌊ n/ m ax { 32 /n, 16∆ n ( ε ) }⌋ ≤ |U | ≤ ⌊ n/ max { 4 P (DIS( V )) , 16 /n } ⌋ . (8) 61 H A N N E K E By a Cherno ff bound, for an e ve nt K n with P ( K n ) ≥ 1 − exp {− n / 12 } , on H ⌊ n/ 2 ⌋ ( ε/ 2) ∩ J n ∩ K n , |U ∩ DIS( V ) | ≤ 2 P (DIS ( V )) · ⌊ n/ max { 4 P (DIS( V )) , 16 /n }⌋ ≤ ⌈ n/ 2 ⌉ . Defining the e vent G n ( ε ) = H ⌊ n/ 2 ⌋ ( ε/ 2) ∩ J n ∩ K n , we see that on G n ( ε ) , ev ery time X m ∈ DIS( V ) in Step 5 of Meta-Algor ithm 0, we hav e t < n ; therefore, since f ∈ V implies that the inferred labe ls in Step 6 are correct as well, we ha ve that on G n ( ε ) , ∀ ( x, ˆ y ) ∈ L , ˆ y = f ( x ) . (9) Noting that P ( G n ( ε ) c ) ≤ P  H ⌊ n/ 2 ⌋ ( ε/ 2) c  + P ( J c n ) + P ( K c n ) ≤ ε/ 2 + 2 · exp { − n/ 4 } + exp {− n/ 12 } ≤ ε, we ha ve E h er  ˆ h n i ≤ E h 1 G n ( ε ) 1 [ |L| ≥ Λ p ( ε, f , P )] er  ˆ h n i + P ( G n ( ε ) ∩ { |L| < Λ p ( ε, f , P ) } ) + P ( G n ( ε ) c ) ≤ E  1 G n ( ε ) 1 [ |L| ≥ Λ p ( ε, f , P )] er ( A p ( L ))  + P ( G n ( ε ) ∩ { |L| < Λ p ( ε, f , P ) } ) + ε. (10) On G n ( ε ) , (8) implies |L| ≥ L ( n ; ε ) , and we chos e n large en ough so that L ( n ; ε ) ≥ Λ p ( ε, f , P ) . Thus, the secon d te rm in (10) is zero, and w e hav e E h er  ˆ h n i ≤ E  1 G n ( ε ) 1 [ |L| ≥ Λ p ( ε, f , P )] er ( A p ( L ))  + ε = E h E h 1 G n ( ε ) er ( A p ( L ))    |L| i 1 [ |L| ≥ Λ p ( ε, f , P )] i + ε. (11) For any ℓ ∈ N with P ( |L| = ℓ ) > 0 , the conditional of U |{|U | = ℓ } is a produc t dist rib ution P ℓ ; that is, the sampl es in U are conditio nally independ ent and identica lly distrib uted with distrib ution P , which is the same as the distrib ution of { X 1 , X 2 , . . . , X ℓ } . Therefore, for an y su ch ℓ with ℓ ≥ Λ p ( ε, f , P ) , by (9) we ha ve E h 1 G n ( ε ) er ( A p ( L ))    {|L| = ℓ } i ≤ E [er ( A p ( Z ℓ ))] ≤ ε. In particula r , th is mea ns (11) i s at most 2 ε . T h is implies Meta-Algorithm 0, with A p as its argumen t, achie ves a label co mplexi ty Λ a such that Λ a (2 ε, f , P ) ≤ max n 12 ln(6 /ε ) , 1 + L − 1 (Λ p ( ε, f , P ); ε ) o . Since Λ p ( ε, f , P ) = ω (log (1 /ε ) ) ⇒ 12 ln (6 /ε ) = o (Λ p ( ε, f , P )) , it remains only to show that L − 1 (Λ p ( ε, f , P ); ε ) = o (Λ p ( ε, f , P )) . Note that ∀ ε ∈ (0 , 1) , L (1; ε ) = 0 and L ( n ; ε ) is di ver ging in n . Furthermor e, by the assumption P ( ∂ C , P f ) = 0 , we kno w that for an y N ( ε ) = ω (log(1 /ε )) , we ha ve ∆ N ( ε ) ( ε ) = o (1) (by continuity of pro bability measu res), which implies L ( N ( ε ); ε ) = ω ( N ( ε )) . Thus, since Λ p ( ε, f , P ) = ω (log(1 /ε )) , L e mma 31 implies L − 1 (Λ p ( ε, f , P ); ε ) = o (Λ p ( ε, f , P )) , as des ired. 62 A C T I V I Z E D L E A R N I N G Lemma 33 F or any V C cla ss C , tar get function f ∈ C , and distrib utio n P , if P ( ∂ C , P f ) > 0 , then ther e exists a pas sive learning algorith m A p ach ievin g a lab el comple xity Λ p suc h that ( f , P ) ∈ Non trivial(Λ p ) , and for any label comple xity Λ a ach ieved by running Meta-Algorithm 0 with A p as its ar gument, an d any constant c ∈ (0 , ∞ ) , Λ a ( cε, f , P ) 6 = o (Λ p ( ε, f , P )) . ⋄ Pro of T h e pr oof can be bro ken do wn into three essential claims. First, it follo ws from L e mma 35 belo w tha t, on an ev ent H ′ of pr obabilit y on e, P ( ∂ V f ) ≥ P ( ∂ C f ) ; sin ce P (DIS ( V )) ≥ P ( ∂ V f ) , we ha ve P (DIS( V )) ≥ P ( ∂ C f ) on H ′ . The second cla im is that on H ′ ∩ J n , |L| = O ( n ) . This follo ws from Lemma 30 and our first claim by notin g that, on H ′ ∩ J n , |L| = j n/ (4 ˆ ∆) k ≤ n/ (4 P (DIS( V ))) ≤ n/ (4 P ( ∂ C f )) . Finally , we constr uct a passi v e algorith m A p whose label complexity is not significan tly im- pro ved w h en |L| = O ( n ) . There is a fairly ob vious rando mized A p with thi s property (simpl y return ing − f with pr obabilit y 1 / |L| , and otherwise f ); ho weve r , we can ev en satisf y the pro perty with a deter ministic A p , as follo ws. Let H f = { h i } ∞ i =1 be an y sequence of classifiers (not neces- sarily in C ) with 0 < P ( x : h i ( x ) 6 = f ( x )) strictly de creasing to 0 , (say with h 1 = − f ). W e kno w such a sequ ence m u st exist since P ( ∂ C f ) > 0 . Now define, for nonempt y S , A p ( S ) = argmin h i ∈H f P ( x : h i ( x ) 6 = f ( x )) + 2 1 [0 , 1 / | S | ) ( P ( x : h i ( x ) 6 = f ( x ))) . A p is constructe d so that, in the special cas e that this particular f is the tar get function and this particu lar P is the data distrib ution , A p ( S ) returns the h i ∈ H f with minimal er( h i ) such that er( h i ) ≥ 1 / | S | . For co mpletenes s, let A p ( ∅ ) = h 1 . Define ε i = er( h i ) = P ( x : h i ( x ) 6 = f ( x )) . No w let ˆ h n be the returned classifier from running Meta-Algorit hm 0 with A p and n as inputs, let Λ p be the (minimal) lab el comple xity ach ie ved by A p , an d let Λ a be th e (minimal) l abel c omplex ity achie ved by Meta- Algorithm 0 w i th A p as input. T ake an y c ∈ (0 , ∞ ) , and i suf fi ci ently large so that ε i − 1 < 1 / 2 . Then w e kno w that for an y ε ∈ [ ε i , ε i − 1 ) , Λ p ( ε, f , P ) = ⌈ 1 / ε i ⌉ . In particu lar , Λ p ( ε, f , P ) ≥ 1 /ε , so that ( f , P ) ∈ Nontrivia l (Λ p ) . Also, by Marko v’ s inequa lity and the abov e results on |L| , E [er( ˆ h n )] ≥ E  1 |L|  ≥ 4 P ( ∂ C f ) n P  1 |L| > 4 P ( ∂ C f ) n  ≥ 4 P ( ∂ C f ) n P ( H ′ ∩ J n ) ≥ 4 P ( ∂ C f ) n (1 − 2 · exp {− n/ 4 } ) . This implies that for 4 ln(4) < n < 2 P ( ∂ C f ) cε i , we hav e E h er( ˆ h n ) i > cε i , so that for all sufficien tly lar ge i , Λ a ( cε i , f , P ) ≥ 2 P ( ∂ C f ) cε i ≥ P ( ∂ C f ) c  1 ε i  = P ( ∂ C f ) c Λ p ( ε i , f , P ) . Since this happen s for all su ffici ently larg e i , and thus for arbitrarily small ε i v alues, we ha ve Λ a ( cε, f , P ) 6 = o (Λ p ( ε, f , P )) . 63 H A N N E K E Pro of [Theorem 5] T h eorem 5 now follo ws direc tly fro m Lemmas 32 an d 33, correspond ing to the “if ” and “only if ” parts of the claim, re specti vely . Ap pendix B. Proofs Related to Section 4: Basic Activizer In this section , we pr ovid e detailed de finitions, lemmas and proofs related to Meta-Algorith m 1. In fac t, we will de velop slightl y more gene ral results here. Specifically , we fix an arbitrary consta nt γ ∈ (0 , 1) , and will prove the result for a f amily of meta-alg orithms paramet erized by the v alue γ , used as t he thre shold in Ste ps 3 and 6 o f Met a-Algorith m 1, which wer e set to 1 / 2 above to simplify the algorith m. Thus, setting γ = 1 / 2 in the statements below will gi ve th e stated theorem. Through out this se ction, we will ass ume C is a VC clas s with VC dimens ion d , and let P denote the (arb itrary) margina l distrib ution of X i ( ∀ i ). W e also fix a n arbitrar y clas sifier f ∈ cl( C ) , where (as in Section 6) cl( C ) = { h : ∀ r > 0 , B( h, r ) 6 = ∅} denotes the closure of C . In the presen t context, f correspond s to the tar get functio n w h en running Meta-Algorit hm 1. T hu s, w e will study the beh av ior of Meta-Algorit hm 1 for t his fixed f and P ; si nce th ey are chosen arb itrarily , to estab lish Theorem 6 it will suf fice to prov e that for an y passi ve A p , Meta-Algorit hm 1 w it h A p as input achiev es superior labe l co mplexi ty compa red to A p for this f and P . In fac t, bec ause her e we only assume f ∈ cl( C ) (rathe r th an f ∈ C ), we actual ly end up prov ing a sl ightly more general ver sion of Theore m 6. But more important ly , this relaxation to cl( C ) will also make the lemmas de vel oped below more useful for subsequent proofs: namely , those in Appe ndix E .2. For this same reason , many of the lemmas of this section are substantiall y more general than is n ecessary for the proof of Theorem 6; the mo re gen eral version s will be used in the proo fs of resul ts in later sect ions. For any m ∈ N , we define V ⋆ m = { h ∈ C : ∀ i ≤ m, h ( X i ) = f ( X i ) } . Additionall y , for H ⊆ C , and an integ er k ≥ 0 , w e will ad opt the notation S k ( H ) = n S ∈ X k : H shatt ers S o , ¯ S k ( H ) = X k \ S k ( H ) , and as in Section 5, we define the k -dime nsional shatter core of f with respec t to H (and P ) as ∂ k H f = lim r → 0 S k (B H ( f , r )) , and furthe r de fine ¯ ∂ k H f = X k \ ∂ k H f . Also as in Section 5, define ˜ d f = min n k ∈ N : P k  ∂ k C f  = 0 o . For con ve nience, we also define the abbrev iation ˜ δ f = P ˜ d f − 1  ∂ ˜ d f − 1 C f  . 64 A C T I V I Z E D L E A R N I N G Also, recall that we are using the co n vention that X 0 = { ∅ } , P 0 ( X 0 ) = 1 , and we say a set of classifier s H shatte rs ∅ if f H 6 = {} . In particula r , S 0 ( H ) 6 = {} iff H 6 = {} , and ∂ 0 H f 6 = {} if f inf h ∈H P ( x : h ( x ) 6 = f ( x )) = 0 . For an y measurable sets S 1 , S 2 ⊆ X k with P k ( S 2 ) > 0 , as usual we define P k ( S 1 | S 2 ) = P k ( S 1 ∩ S 2 ) / P k ( S 2 ) ; in the situ ation where P k ( S 2 ) = 0 , it will be con veni ent to define P k ( S 1 | S 2 ) = 0 . W e use the definition of er( h ) from abov e, and addition ally define the conditional error rate er( h | S ) = P ( { x : h ( x ) 6 = f ( x ) }| S ) for an y measu rable S ⊆ X . W e also adopt the u sual short-hand for equal ities a nd inequa lities in v olving con ditiona l expe ctations and probab ilities gi ven rand om v ariables, wherein for instanc e, we write E [ X | Y ] = Z to mean that there is a v ersion of E [ X | Y ] that is e very where equal to Z , so that in partic ular , any v ersion of E [ X | Y ] equals Z almost e very where (see e.g., Ash and Dol ´ eans-Dade, 2000). B.1 D efinitio n of Estimators for Meta- Algorithm 1 While the estimated probabil ities use d in Meta-Algo rithm 1 can be defined in a v ariet y of ways to make it a u ni vers al activiz er , in the stateme nt of Theorem 6 abov e and proof there of below , w e take the follo wing specific definition s. After the definition, we discuss alternati v e po ssibili ties. Though it is a slight twist on the formal model, it will grea tly simplify our discuss ion be- lo w to suppose we ha ve access to two independen t sequ ences of i.i.d. unla beled examples W 1 = { w 1 , w 2 , . . . } and W 2 = { w ′ 1 , w ′ 2 , . . . } , also indepen dent from the main sequence { X 1 , X 2 , . . . } , with w i , w ′ i ∼ P . S in ce the data seque nce { X 1 , X 2 , . . . } is i.i.d., thi s is di strib ution ally equi v alent to suppo sing w e p artition the data sequen ce in a p reproce ssing step , into three subseque nces, alter nat- ingly assigning each data po int to either Z ′ X , W 1 , or W 2 . Then, if we suppo se Z ′ X = { X ′ 1 , X ′ 2 , . . . } , and w e repl ace all references to X i with X ′ i in th e al gorithms an d resu lts, we obtain the equi v alent statemen ts holding for the model as originally stated. Thus, supposing the ex istence of these W i sequen ces si mply serv es to simplify nota tion, and does not represe nt a further assumption on top o f the pre vious ly stat ed framewo rk. For eac h k ≥ 2 , we partition W 2 into subse ts of si ze k − 1 , as follo ws. For i ∈ N , let S ( k ) i = { w ′ 1+( i − 1)( k − 1) , . . . , w ′ i ( k − 1) } . W e define the ˆ P m estimato rs in terms of three types of functions , defined belo w . For any H ⊆ C , x ∈ X , y ∈ {− 1 , +1 } , m ∈ N , we define ˆ P m  S ∈ X k − 1 : H shatters S ∪ { x }|H shatters S  = ˆ ∆ ( k ) m ( x, W 2 , H ) , ( 12) ˆ P m  S ∈ X k − 1 : H [( x, − y )] does not shatter S |H sh atters S  = ˆ Γ ( k ) m ( x, y , W 2 , H ) , (13) ˆ P m  x : ˆ P  S ∈ X k − 1 : H shatters S ∪ { x }|H shatters S  ≥ γ  = ˆ ∆ ( k ) m ( W 1 , W 2 , H ) . (14) The quant ities ˆ ∆ ( k ) m ( x, W 2 , H ) , ˆ Γ ( k ) m ( x, y , W 2 , H ) , an d ˆ ∆ ( k ) m ( W 1 , W 2 , H ) are specified as follo ws. For k = 1 , ˆ Γ (1) m ( x, y , W 2 , H ) is simply an indicator for whether e very h ∈ H has h ( x ) = y , while ˆ ∆ (1) m ( x, W 2 , H ) is an indicat or for whethe r x ∈ DIS( H ) . Formally , they are defined as follo w s . ˆ Γ (1) m ( x, y , W 2 , H ) = 1 T h ∈H { h ( x ) } ( y ) . ˆ ∆ (1) m ( x, W 2 , H ) = 1 DIS( H ) ( x ) . 65 H A N N E K E For k ≥ 2 , we first define M ( k ) m ( H ) = max    1 , m 3 X i =1 1 S k − 1 ( H )  S ( k ) i     . Then we tak e the follo wing definitions for ˆ Γ ( k ) and ˆ ∆ ( k ) . ˆ Γ ( k ) m ( x, y , W 2 , H ) = 1 M ( k ) m ( H ) m 3 X i =1 1 ¯ S k − 1 ( H [( x, − y )])  S ( k ) i  1 S k − 1 ( H )  S ( k ) i  . (15) ˆ ∆ ( k ) m ( x, W 2 , H ) = 1 M ( k ) m ( H ) m 3 X i =1 1 S k ( H )  S ( k ) i ∪ { x }  . (16) For the remain ing estimator , for an y k we gener ally define ˆ ∆ ( k ) m ( W 1 , W 2 , H ) = 2 m + 1 m 3 m 3 X i =1 1 [ γ / 4 , ∞ )  ˆ ∆ ( k ) m ( w i , W 2 , H )  . The abo v e definition s will be used in the pro ofs belo w . Howe v er , there are certain ly viable al- ternati ve definitio ns one can consider , some of which may ha ve inter esting theor etical propertie s. In genera l, one has the same so rts of trade-of fs prese nt whene ver estimating a con ditiona l probabi lity . For instan ce, we could replac e “ m 3 ” in (1 5) and (16) by min n ℓ ∈ N : M ( k ) ℓ ( H ) = m 3 o , and th en normaliz e by m 3 instea d of M ( k ) m ( H ) ; this would gi ve us m 3 samples from the condition al distri- b ution with which to es timate the cond itional probability . The adv antages of this approach would be its simplicit y or elegance, an d possibly some improv ement in the constan t fa ctors in the label comple xity bounds belo w . On the ot her hand , the drawback of this alternati ve definition wo uld be that w e do no t kno w a priori ho w many unlabeled sampl es we will need to process in order to ca l- culate it; indee d, for some valu es of k and H , we ex pect P k − 1  S k − 1 ( H )  = 0 , so th at M ( k ) ℓ ( H ) is bound ed, and w e might technically need t o examine the entire sequenc e to distinguish t his case from the cas e of v ery small P k − 1  S k − 1 ( H )  . Of cours e, the se practica l iss ues can be add ressed with small modificat ions, but on ly at the e xpense of co mplicatin g the analysis, thus losing the ele gance fact or . For these reasons, we ha ve opted for the slightly looser and less elegan t, b ut more practical, definitio ns abo ve in (1 5) and (16). B.2 P r oof of Theor em 6 At a high lev el, the structure of the proof is the followin g. The primar y compone nts of the proof are three le mmas: 34 , 37, and 38. S e tting as ide, for a moment, the fact that we are us ing the ˆ P m estimato rs rather than the actu al probabi lity value s the y estimate , L e mma 38 indicates that the number of data points in L ˜ d f gro ws sup erlinear ly in n (the number of label requ ests), while Lemma 37 gu arantees that the labels of these points are correct , an d Lemma 34 tells us that the classifier ret urned in the end is nev er much worse than A p ( L ˜ d f ) . These thre e fact ors combine to pro ve the result. The rest of the pro of is comp osed of sup porting lemmas and deta ils rega rding the ˆ P m estimato rs. S p ecifically , Lemm a s 35 and 36 serv e a supporti ng role, w i th the purpos e of 66 A C T I V I Z E D L E A R N I N G sho wing that the set of V -shatterabl e k -t uples con ver ges to the k -dimension al shat ter core (up to probab ility-ze ro dif ferences). The other lemmas belo w (39 – 45) are needed primar ily to exte nd the abov e basic idea to the act ual scenario w h ere the ˆ P m estimato rs are use d as surro gates for the probab ility val ues. Add itionall y , a sub-case of Lemma 45 is needed in o rder to guarantee the label reques t bu dget will not be reache d prematurel y . Agai n, in m a ny cases we pro ve a m o re general lemma than is required for its use in the proof of T h eorem 6; thes e more gene ral results will be needed in subseq uent proofs: namely , in the proof s of The orem 16 and Lemma 26. W e begin with a lemma concern ing th e Activ eSelect subroutine. Lemma 34 F or any k ∗ , M , N ∈ N with k ∗ ≤ N , and N classifier s { h 1 , h 2 , . . . , h N } (th emselves possib ly random variabl es, independen t fr om { X M , X M +1 , . . . } ), Active Select ( { h 1 , h 2 , . . . , h N } , m, { X M , X M +1 , . . . } ) makes at most m label req uests, and if h ˆ k is the class ifier it outputs, then with pr obab ility at least 1 − eN · exp {− m/ ( 72 k ∗ N ln ( e N )) } , we hav e er( h ˆ k ) ≤ 2 er( h k ∗ ) . ⋄ Pro of This proof is essentially identical to a similar result of Balcan, H an nek e, an d V aughan (2010) , b ut is include d here for comple teness. Let M k = j m k ( N − k ) ln( eN ) k . First no te tha t the tota l numb er of labe l requ ests in Activ eSelect is at most m , si nce su mming up the sizes of the batches of label re quests made in all e xec utions of Step 2 yields N − 1 X j =1 N X k = j +1  m j ( N − j ) ln( eN )  ≤ N − 1 X j =1 m j ln( eN ) ≤ m. Let k ∗∗ = argmin k ∈{ 1 ,...,k ∗ } er( h k ) . Now for any j ∈ { 1 , 2 , . . . , k ∗∗ − 1 } with P ( x : h j ( x ) 6 = h k ∗∗ ( x )) > 0 , the law of lar ge numbers implies that with prob ability one we will find at least M j exa mples remaining in the sequence for w h ich h j ( x ) 6 = h k ∗∗ ( x ) , and sin ce er( h k ∗∗ |{ x : h j ( x ) 6 = h k ∗∗ ( x ) } ) ≤ 1 / 2 , Hoeffdin g’ s inequality impli es that P ( m k ∗∗ j > 7 / 12) ≤ exp { − M j / 72 } ≤ exp { 1 − m/ ( 72 k ∗ N ln ( e N )) } . A un ion bound implies P  max j 7 / 12  ≤ k ∗∗ · exp { 1 − m/ (72 k ∗ N ln ( e N )) } . In parti cular , note that when max j 2 er( h k ∗∗ ) . In particular , this implies er( h j |{ x : h k ∗∗ ( x ) 6 = h j ( x ) } ) > 2 / 3 and P ( x : h j ( x ) 6 = h k ∗∗ ( x )) > 0 , which again means (with probab ility one) we will find at least M k ∗∗ exa mples in the seq uence for w h ich h j ( x ) 6 = h k ∗∗ ( x ) . By Hoef fding’ s inequali ty , we h av e that P ( m j k ∗∗ ≤ 7 / 12) ≤ exp {− M k ∗∗ / 72 } ≤ exp { 1 − m/ (72 k ∗ N ln ( e N )) } . By a union bound, we ha ve that P ( ∃ j > k ∗∗ : er( h j ) > 2 er( h k ∗∗ ) and m j k ∗∗ ≤ 7 / 12) ≤ ( N − k ∗∗ ) · exp { 1 − m/ (72 k ∗ N ln ( e N )) } . In partic ular , when ˆ k ≥ k ∗∗ , and m j k ∗∗ > 7 / 12 for all j > k ∗∗ with er( h j ) > 2 er( h k ∗∗ ) , it must be true that er( h ˆ k ) ≤ 2 er( h k ∗∗ ) ≤ 2 er ( h k ∗ ) . 67 H A N N E K E So, by a union bound, w it h probabilit y ≥ 1 − eN · exp {− m/ ( 72 k ∗ N ln ( e N )) } , the ˆ k chosen by Activ eSelect has er( h ˆ k ) ≤ 2 er( h k ∗ ) . The next two lemmas describe the limiting beha vior of S k ( V ⋆ m ) . In particular , we see that its limiting va lue is precis ely ∂ k C f (up to pro babilit y-zero dif fere nces). L e mma 35 establishes that S k ( V ⋆ m ) does no t de crease belo w ∂ k C f (exce pt for a prob ability- zero set), and Lemma 36 est ablishe s that its limit is not lar ger than ∂ k C f (aga in, except for a probab ility-ze ro set). Lemma 35 Ther e is an even t H ′ with P ( H ′ ) = 1 suc h that on H ′ , ∀ m ∈ N , ∀ k ∈ { 0 , . . . , ˜ d f − 1 } , for any H with V ⋆ m ⊆ H ⊆ C , P k  S k ( H )    ∂ k C f  = P k  ∂ k H f    ∂ k C f  = 1 , and ∀ i ∈ N , 1 ∂ k H f  S ( k +1) i  = 1 ∂ k C f  S ( k +1) i  . Also, on H ′ , ev ery such H has P k  ∂ k H f  = P k  ∂ k C f  , and M ( k ) ℓ ( H ) → ∞ as ℓ → ∞ . ⋄ Pro of W e w il l sho w the first claim for the set V ⋆ m , and the result will then hold for H by m o no- tonici ty . In particular , we will show this for any fi x ed k ∈ { 0 , . . . , ˜ d f − 1 } and m ∈ N , and the exi stence of H ′ then hol ds b y a union bound . Fix any set S ∈ ∂ k C f . Suppose B V ⋆ m ( f , r ) does not shatte r S for some r > 0 . There is an infinite seq uence of sets {{ h ( i ) 1 , h ( i ) 2 , . . . , h ( i ) 2 k }} i with ∀ j ≤ 2 k , P ( x : h ( i ) j ( x ) 6 = f ( x )) ↓ 0 , suc h that each { h ( i ) 1 , . . . , h ( i ) 2 k } ⊆ B( f , r ) and shat ters S . Since B V ⋆ m ( f , r ) does not shatter S , 1 = inf i 1 h ∃ j : h ( i ) j / ∈ B V ⋆ m ( f , r ) i = inf i 1 h ∃ j : h ( i ) j ( Z m ) 6 = f ( Z m ) i . But P  inf i 1 h ∃ j : h ( i ) j ( Z m ) 6 = f ( Z m ) i = 1  ≤ inf i P  ∃ j : h ( i ) j ( Z m ) 6 = f ( Z m )  ≤ lim i →∞ X j ≤ 2 k m P  x : h ( i ) j ( x ) 6 = f ( x )  = X j ≤ 2 k m lim i →∞ P  x : h ( i ) j ( x ) 6 = f ( x )  = 0 , where the second inequa lity fo llo ws from the union bound. Therefore , ∀ r > 0 , P  S / ∈ S k  B V ⋆ m ( f , r )  = 0 . Furth ermore, since ¯ S k  B V ⋆ m ( f , r )  is m o notonic in r , the dominated con ver gence theorem giv e us that P  S / ∈ ∂ k V ⋆ m f  = E h lim r → 0 1 ¯ S k (B V ⋆ m ( f ,r )) ( S ) i = lim r → 0 P  S / ∈ S k  B V ⋆ m ( f , r )   = 0 . 68 A C T I V I Z E D L E A R N I N G This implies that (lettin g S ∼ P k be inde pendent from V ⋆ m ) P  P k  ¯ ∂ k V ⋆ m f    ∂ k C f  > 0  = P  P k  ¯ ∂ k V ⋆ m f ∩ ∂ k C f  > 0  = lim ξ → 0 P  P k  ¯ ∂ k V ⋆ m f ∩ ∂ k C f  > ξ  . ≤ lim ξ → 0 1 ξ E h P k  ¯ ∂ k V ⋆ m f ∩ ∂ k C f i (Mark ov) = lim ξ → 0 1 ξ E h 1 ∂ k C f ( S ) P  S / ∈ ∂ k V ⋆ m f    S i (Fubini) = lim ξ → 0 0 = 0 . This estab lishes the fi rs t claim for V ⋆ m , on an ev ent of pro bability 1 , and mono tonicit y ext ends the claim to any H ⊇ V ⋆ m . Also note that, on this ev ent, P k  ∂ k H f  ≥ P k  ∂ k H f ∩ ∂ k C f  = P k  ∂ k H f    ∂ k C f  P k  ∂ k C f  = P k  ∂ k C f  , where the last equality follo ws from th e first claim. Noting th at for H ⊆ C , ∂ k H f ⊆ ∂ k C f , we must ha ve P k  ∂ k H f  = P k  ∂ k C f  . This estab lishes the third claim. F ro m the first claim, for an y gi v en val ue of i ∈ N the secon d claim holds for S ( k +1) i (with H = V ⋆ m ) on an additi onal e ve nt of probabilit y 1 ; taking a union bound ove r all i ∈ N exte nds th is claim to ev ery S ( k ) i on an e ven t of probability 1 . Monoton icity the n implies 1 ∂ k C f  S ( k +1) i  = 1 ∂ k V ⋆ m f  S ( k +1) i  ≤ 1 ∂ k H f  S ( k +1) i  ≤ 1 ∂ k C f  S ( k +1) i  , ext ending the res ult to gene ral H . Also, as k < ˜ d f , we kno w P k  ∂ k C f  > 0 , and since we also kno w V ⋆ m is indep endent from W 2 , the stron g la w of lar ge numbers implies the final claim (for V ⋆ m ) on an additiona l ev ent of probabilit y 1 ; ag ain, monoton icity e xtends this clai m to an y H ⊇ V ⋆ m . Interse cting the abo ve eve nts o ver value s m ∈ N and k < ˜ d f gi ves the e v ent H ′ , and as each of the a bov e e vents has probability 1 and ther e are co untably man y suc h e vents , a un ion boun d implies P ( H ′ ) = 1 . Note that one spec ific implic ation of Lemma 35, obtained by taking k = 0 , is that on H ′ , V ⋆ m 6 = ∅ (ev en if f ∈ cl( C ) \ C ). This is because, for f ∈ cl( C ) , we hav e ∂ 0 C f = X 0 so that P 0  ∂ 0 C f  = 1 , which means P 0  ∂ 0 V ⋆ m f  = 1 (on H ′ ), so tha t we must hav e ∂ 0 V ⋆ m f = X 0 , which implies V ⋆ m 6 = ∅ . In particu lar , this also means f ∈ cl ( V ⋆ m ) . Lemma 36 Ther e is a m o notonic func tion q ( r ) = o (1) (as r → 0 ) suc h th at, on event H ′ , fo r any k ∈ n 0 , . . . , ˜ d f − 1 o , m ∈ N , r > 0 , and set H su ch that V ⋆ m ⊆ H ⊆ B( f , r ) , P k  ¯ ∂ k C f    S k ( H )  ≤ q ( r ) . In particul ar , for τ ∈ N and δ > 0 , on H τ ( δ ) ∩ H ′ (define d abo ve), every m ≥ τ an d k ∈ n 0 , . . . , ˜ d f − 1 o has P k  ¯ ∂ k C f    S k ( V ⋆ m )  ≤ q ( φ ( τ ; δ )) . ⋄ 69 H A N N E K E Pro of Fix any k ∈ n 0 , . . . , ˜ d f − 1 o . By L e mma 35, we kno w that on e vent H ′ , P k  ¯ ∂ k C f    S k ( H )  = P k  ¯ ∂ k C f ∩ S k ( H )  P k ( S k ( H )) ≤ P k  ¯ ∂ k C f ∩ S k ( H )  P k  ∂ k H f  = P k  ¯ ∂ k C f ∩ S k ( H )  P k  ∂ k C f  ≤ P k  ¯ ∂ k C f ∩ S k (B ( f , r ))  P k  ∂ k C f  . Define q k ( r ) as this latter quantity . S i nce P k  ¯ ∂ k C f ∩ S k (B( f , r ))  is monoto nic in r , lim r → 0 P k  ¯ ∂ k C f ∩ S k (B( f , r ))  P k  ∂ k C f  = P k  ¯ ∂ k C f ∩ lim r → 0 S k (B( f , r ))  P k  ∂ k C f  = P k  ¯ ∂ k C f ∩ ∂ k C f  P k  ∂ k C f  = 0 . This pro ves q k ( r ) = o (1) . Defi n ing q ( r ) = max n q k ( r ) : k ∈ n 0 , 1 , . . . , ˜ d f − 1 oo = o (1) complete s the proof of t he first claim. For the final claim, simply reca ll that by Lemm a 29, on H τ ( δ ) , e very m ≥ τ has V ⋆ m ⊆ V ⋆ τ ⊆ B( f , φ ( τ ; δ )) . Lemma 37 F or ζ ∈ (0 , 1) , defin e r ζ = sup { r ∈ (0 , 1) : q ( r ) < ζ } / 2 . On H ′ , ∀ k ∈ n 0 , . . . , ˜ d f − 1 o , ∀ ζ ∈ (0 , 1) , ∀ m ∈ N , for any set H such that V ⋆ m ⊆ H ⊆ B( f , r ζ ) , P  x : P k  ¯ S k ( H [( x, f ( x ))])    S k ( H )  > ζ  = P  x : P k  ¯ S k ( H [( x, f ( x ))])    ∂ k H f  > ζ  = 0 . (17) In particu lar , for δ ∈ (0 , 1) , defining τ ( ζ ; δ ) = m in  τ ∈ N : sup m ≥ τ φ ( m ; δ ) ≤ r ζ  , for any τ ≥ τ ( ζ ; δ ) , and any m ≥ τ , on H τ ( δ ) ∩ H ′ , (17) hold s for H = V ⋆ m . ⋄ Pro of Fix k , m, H as de scribed abo ve, a nd sup pose q = P k  ¯ ∂ k C f |S k ( H )  < ζ ; by Lemma 36, this happe ns on H ′ . Since, ∂ k H f ⊆ S k ( H ) , we hav e that ∀ x ∈ X , P k  ¯ S k ( H [( x, f ( x ))])    S k ( H )  = P k  ¯ S k ( H [( x, f ( x ))])    ∂ k H f  P k  ∂ k H f    S k ( H )  + P k  ¯ S k ( H [( x, f ( x ))])    S k ( H ) ∩ ¯ ∂ k H f  P k  ¯ ∂ k H f    S k ( H )  . Since all proba bility v alues are b ounded by 1 , we hav e P k  ¯ S k ( H [( x, f ( x ))])    S k ( H )  ≤ P k  ¯ S k ( H [( x, f ( x ))])    ∂ k H f  + P k  ¯ ∂ k H f    S k ( H )  . (18) 70 A C T I V I Z E D L E A R N I N G Isolati ng the righ t-most term in (18), by basic properties of probabilit ies we ha ve P k  ¯ ∂ k H f    S k ( H )  = P k  ¯ ∂ k H f    S k ( H ) ∩ ¯ ∂ k C f  P k  ¯ ∂ k C f    S k ( H )  + P k  ¯ ∂ k H f    S k ( H ) ∩ ∂ k C f  P k  ∂ k C f    S k ( H )  ≤ P k  ¯ ∂ k C f    S k ( H )  + P k  ¯ ∂ k H f    S k ( H ) ∩ ∂ k C f  . (19) By assumpti on, the left term in (1 9) equals q . E x amining the right term in (19), we see that P k  ¯ ∂ k H f    S k ( H ) ∩ ∂ k C f  = P k  S k ( H ) ∩ ¯ ∂ k H f    ∂ k C f  / P k  S k ( H )    ∂ k C f  ≤ P k  ¯ ∂ k H f    ∂ k C f  / P k  ∂ k H f    ∂ k C f  . (20) By Lemma 35, on H ′ the denominator in (20) is 1 and the numerator is 0 . Thus, comb ining this fact with (18) and (19), we ha ve that on H ′ , P  x : P k  ¯ S k ( H [( x, f ( x ))])    S k ( H )  > ζ  ≤ P  x : P k  ¯ S k ( H [( x, f ( x ))])    ∂ k H f  > ζ − q  . (21) Note that prov ing the right side of (21) equals zero will suffice to estab lish the result, since it upper bound s both the fi rs t expressi on of (17) (as just established) and the second express ion of (17) (by mono tonicit y of measures ). L et ting X ∼ P be indep endent from the other rand om v ariables ( Z , W 1 , W 2 ), by Mark ov’ s inequ ality , the right side of (21) is at most 1 ζ − q E h P k  ¯ S k ( H [( X , f ( X ))])    ∂ k H f     H i = E h P k  ¯ S k ( H [( X , f ( X ))]) ∩ ∂ k H f     H i ( ζ − q ) P k  ∂ k H f  , and by Fubini’ s theorem, this is (letting S ∼ P k be indepen dent from th e other random varia bles) E h 1 ∂ k H f ( S ) P  x : S / ∈ S k ( H [( x, f ( x ))])     H i ( ζ − q ) P k  ∂ k H f  . Lemma 35 implies this equals E h 1 ∂ k H f ( S ) P  x : S / ∈ S k ( H [( x, f ( x ))])     H i ( ζ − q ) P k  ∂ k C f  . (22) For an y fixed S ∈ ∂ k H f , there is an infinite sequen ce of s ets nn h ( i ) 1 , h ( i ) 2 , . . . , h ( i ) 2 k oo i ∈ N with ∀ j ≤ 2 k , P  x : h ( i ) j ( x ) 6 = f ( x )  ↓ 0 , such th at each n h ( i ) 1 , . . . , h ( i ) 2 k o ⊆ H and sha tters S . If H [( x, f ( x ))] does not shatter S , then 1 = inf i 1 h ∃ j : h ( i ) j / ∈ H [( x, f ( x ))] i = inf i 1 h ∃ j : h ( i ) j ( x ) 6 = f ( x ) i . 71 H A N N E K E In parti cular , P  x : S / ∈ S k ( H [( x, f ( x ))])  ≤ P  x : inf i 1 h ∃ j : h ( i ) j ( x ) 6 = f ( x ) i = 1  = P \ i n x : ∃ j : h ( i ) j ( x ) 6 = f ( x ) o ! ≤ inf i P  x : ∃ j s.t. h ( i ) j ( x ) 6 = f ( x )  ≤ lim i →∞ X j ≤ 2 k P  x : h ( i ) j ( x ) 6 = f ( x )  = X j ≤ 2 k lim i →∞ P  x : h ( i ) j ( x ) 6 = f ( x )  = 0 . Thus (22) is zero, which establi shes the re sult. The fi n al claim is then impl ied by Lemma 29 and monotonici ty of V ⋆ m in m : that is, on H τ ( δ ) , V ⋆ m ⊆ V ⋆ τ ⊆ B( f , φ ( τ ; δ )) ⊆ B( f , r ζ ) . Lemma 38 F or any ζ ∈ (0 , 1) , ther e ar e values n ∆ ( ζ ) n ( ε ) : n ∈ N , ε ∈ (0 , 1) o suc h that, for any n ∈ N and ε > 0 , on event H ⌊ n/ 3 ⌋ ( ε/ 2) ∩ H ′ , lettin g V = V ⋆ ⌊ n/ 3 ⌋ , P  x : P ˜ d f − 1  S ∈ X ˜ d f − 1 : S ∪ { x } ∈ S ˜ d f ( V )    S ˜ d f − 1 ( V )  ≥ ζ  ≤ ∆ ( ζ ) n ( ε ) , and for any N -valued N ( ε ) = ω (l og (1 /ε )) , ∆ ( ζ ) N ( ε ) ( ε ) = o (1) . ⋄ Pro of Througho ut, we s uppose the ev ent H ⌊ n/ 3 ⌋ ( ε/ 2) ∩ H ′ , and fix some ζ ∈ (0 , 1) . W e hav e ∀ x , P ˜ d f − 1  S ∈ X ˜ d f − 1 : S ∪ { x } ∈ S ˜ d f ( V )    S ˜ d f − 1 ( V )  = P ˜ d f − 1  S ∈ X ˜ d f − 1 : S ∪ { x } ∈ S ˜ d f ( V )    S ˜ d f − 1 ( V ) ∩ ∂ ˜ d f − 1 C f  P ˜ d f − 1  ∂ ˜ d f − 1 C f    S ˜ d f − 1 ( V )  + P ˜ d f − 1  S ∈ X ˜ d f − 1 : S ∪ { x } ∈ S ˜ d f ( V )    S ˜ d f − 1 ( V ) ∩ ¯ ∂ ˜ d f − 1 C f  P ˜ d f − 1  ¯ ∂ ˜ d f − 1 C f    S ˜ d f − 1 ( V )  ≤ P ˜ d f − 1  S ∈ X ˜ d f − 1 : S ∪ { x } ∈ S ˜ d f ( V )    S ˜ d f − 1 ( V ) ∩ ∂ ˜ d f − 1 C f  + P ˜ d f − 1  ¯ ∂ ˜ d f − 1 C f    S ˜ d f − 1 ( V )  . (23) By Lemma 35, the left term in (23) equal s P ˜ d f − 1  S ∈ X ˜ d f − 1 : S ∪ { x } ∈ S ˜ d f ( V )    S ˜ d f − 1 ( V ) ∩ ∂ ˜ d f − 1 C f  P ˜ d f − 1  S ˜ d f − 1 ( V )    ∂ ˜ d f − 1 C f  = P ˜ d f − 1  S ∈ X ˜ d f − 1 : S ∪ { x } ∈ S ˜ d f ( V )    ∂ ˜ d f − 1 C f  , and by Lemma 36, the right term in (23) is at most q ( φ ( ⌊ n/ 3 ⌋ ; ε/ 2)) . Thus, we hav e P  x : P ˜ d f − 1  S ∈ X ˜ d f − 1 : S ∪ { x } ∈ S ˜ d f ( V )    S ˜ d f − 1 ( V )  ≥ ζ  ≤ P  x : P ˜ d f − 1  S ∈ X ˜ d f − 1 : S ∪ { x } ∈ S ˜ d f ( V )    ∂ ˜ d f − 1 C f  ≥ ζ − q ( φ ( ⌊ n/ 3 ⌋ ; ε/ 2))  . (24) 72 A C T I V I Z E D L E A R N I N G For n < 3 τ ( ζ / 2; ε/ 2) (for τ ( · ; · ) defined in Lemma 37), we define ∆ ( ζ ) n ( ε ) = 1 . Otherwise, suppose n ≥ 3 τ ( ζ / 2; ε/ 2) , so tha t q ( φ ( ⌊ n/ 3 ⌋ ; ε/ 2)) < ζ / 2 , and thus (24) is at most P  x : P ˜ d f − 1  S ∈ X ˜ d f − 1 : S ∪ { x } ∈ S ˜ d f ( V )    ∂ ˜ d f − 1 C f  ≥ ζ / 2  . By Lemma 29, this is at most P  x : P ˜ d f − 1  S ∈ X ˜ d f − 1 : S ∪ { x } ∈ S ˜ d f (B( f , φ ( ⌊ n/ 3 ⌋ ; ε/ 2)))    ∂ ˜ d f − 1 C f  ≥ ζ / 2  . Letting X ∼ P , by Mark ov ’ s ine quality this is at most 2 ζ E  P ˜ d f − 1  S ∈ X ˜ d f − 1 : S ∪ { X } ∈ S ˜ d f (B( f , φ ( ⌊ n/ 3 ⌋ ; ε/ 2)))    ∂ ˜ d f − 1 C f  = 2 ζ ˜ δ f P ˜ d f  S ∪ { x } ∈ X ˜ d f : S ∪ { x } ∈ S ˜ d f (B( f , φ ( ⌊ n/ 3 ⌋ ; ε/ 2))) and S ∈ ∂ ˜ d f − 1 C f  ≤ 2 ζ ˜ δ f P ˜ d f  S ˜ d f (B( f , φ ( ⌊ n/ 3 ⌋ ; ε/ 2)))  . (25) Thus, defining ∆ ( ζ ) n ( ε ) as (25) for n ≥ 3 τ ( ζ / 2; ε/ 2) esta blishes the fi rs t claim. It remains only to pro ve the second claim. Let N ( ε ) = ω (log (1 / ε )) . Sinc e τ ( ζ / 2; ε/ 2) ≤ l 4 r ζ / 2  d ln  4 e r ζ / 2  + ln  4 ε  m = O (log (1 /ε )) , we hav e that for all suffici ently small ε > 0 , N ( ε ) ≥ 3 τ ( ζ / 2; ε/ 2) , so tha t ∆ ( ζ ) N ( ε ) ( ε ) eq uals (25 ) (with n = N ( ε ) ). Furthe rmore, since ˜ δ f > 0 , P ˜ d f  ∂ ˜ d f C f  = 0 , an d φ ( ⌊ N ( ε ) / 3 ⌋ ; ε/ 2) = o (1) , by co ntinuity of probabilit y measu res we know (25) is o (1) w h en n = N ( ε ) , so that we generally hav e ∆ ( ζ ) N ( ε ) ( ε ) = o (1) . For an y m ∈ N , define ˜ M ( m ) = m 3 ˜ δ f / 2 . Lemma 39 Ther e is a ( C , P , f ) -depen dent constant c ( i ) ∈ (0 , ∞ ) such t hat, for any τ ∈ N ther e is an even t H ( i ) τ ⊆ H ′ with P  H ( i ) τ  ≥ 1 − c ( i ) · exp n − ˜ M ( τ ) / 4 o suc h tha t on H ( i ) τ , if ˜ d f ≥ 2 , then ∀ k ∈ n 2 , . . . , ˜ d f o , ∀ m ≥ τ , ∀ ℓ ∈ N , for any set H such that V ⋆ ℓ ⊆ H ⊆ C , M ( k ) m ( H ) ≥ ˜ M ( m ) . 73 H A N N E K E ⋄ Pro of On H ′ , Lemma 35 implies e v ery 1 S k − 1 ( H )  S ( k ) i  ≥ 1 ∂ k − 1 H f  S ( k ) i  = 1 ∂ k − 1 C f  S ( k ) i  , so we focus on sho wing    n S ( k ) i : i ≤ m 3 o ∩ ∂ k − 1 C f    ≥ ˜ M ( m ) on an appropriate ev ent. W e kno w P  ∀ k ∈ n 2 , . . . , ˜ d f o , ∀ m ≥ τ ,    n S ( k ) i : i ≤ m 3 o ∩ ∂ k − 1 C f    ≥ ˜ M ( m )  = 1 − P  ∃ k ∈ n 2 , . . . , ˜ d f o , m ≥ τ :    n S ( k ) i : i ≤ m 3 o ∩ ∂ k − 1 C f    < ˜ M ( m )  ≥ 1 − X m ≥ τ ˜ d f X k =2 P     n S ( k ) i : i ≤ m 3 o ∩ ∂ k − 1 C f    < ˜ M ( m )  , where the last line follo ws by a union bound . Thus, we will focus on bounding X m ≥ τ ˜ d f X k =2 P     n S ( k ) i : i ≤ m 3 o ∩ ∂ k − 1 C f    < ˜ M ( m )  . (26) Fix any k ∈ n 2 , . . . , ˜ d f o , and integ er m ≥ τ . Since E h    n S ( k ) i : i ≤ m 3 o ∩ ∂ k − 1 C f    i = P k − 1  ∂ k − 1 C f  m 3 ≥ ˜ δ f m 3 , a Chernof f boun d impl ies that P     n S ( k ) i : i ≤ m 3 o ∩ ∂ k − 1 C f    < ˜ M ( m )  ≤ exp n − m 3 P k − 1  ∂ k − 1 C f  / 8 o ≤ exp n − m 3 ˜ δ f / 8 o . Thus, we ha ve that (26) is at most X m ≥ τ ˜ d f X k =2 exp n − m 3 ˜ δ f / 8 o ≤ X m ≥ τ ˜ d f · exp n − m 3 ˜ δ f / 8 o ≤ X m ≥ τ 3 ˜ d f · exp n − m ˜ δ f / 8 o ≤ ˜ d f · exp n − ˜ M ( τ ) / 4 o + ˜ d f · Z ∞ τ 3 exp n − x ˜ δ f / 8 o d x = ˜ d f ·  1 + 8 / ˜ δ f  · exp n − ˜ M ( τ ) / 4 o ≤  9 ˜ d f / ˜ δ f  · exp n − ˜ M ( τ ) / 4 o . Note that since P ( H ′ ) = 1 , defining H ( i ) τ = n ∀ k ∈ n 2 , . . . , ˜ d f o , ∀ m ≥ τ ,    n S ( k ) i : i ≤ m 3 o ∩ ∂ k − 1 C f    ≥ ˜ M ( m ) o ∩ H ′ has the require d properties. 74 A C T I V I Z E D L E A R N I N G Lemma 40 F or any τ ∈ N , ther e is an event G ( i ) τ with P  H ( i ) τ \ G ( i ) τ  ≤  121 ˜ d f / ˜ δ f  · exp n − ˜ M ( τ ) / 60 o suc h that, on G ( i ) τ , if ˜ d f ≥ 2 , the n for every inte ger s ≥ τ and k ∈ n 2 , . . . , ˜ d f o , ∀ r ∈  0 , r 1 / 6  , M ( k ) s (B ( f , r )) ≤ (3 / 2)    n S ( k ) i : i ≤ s 3 o ∩ ∂ k − 1 C f    . ⋄ Pro of Fix intege rs s ≥ τ and k ∈ n 2 , . . . , ˜ d f o , and let r = r 1 / 6 . Define the set ˆ S k − 1 = n S ( k ) i : i ≤ s 3 o ∩ S k − 1 (B ( f , r )) . Note    ˆ S k − 1    = M ( k ) s (B ( f , r )) and the elemen ts of ˆ S k − 1 are condi tionally i.i.d. gi ven M ( k ) s (B ( f , r )) , each with conditio nal dis trib ution equi valen t to the condit ional S ( k ) 1    n S ( k ) 1 ∈ S k − 1 (B ( f , r )) o . In particul ar , E h   ˆ S k − 1 ∩ ∂ k − 1 C f      M ( k ) s (B ( f , r )) i = P k − 1  ∂ k − 1 C f    S k − 1 (B ( f , r ))  M ( k ) s (B ( f , r )) . Defi n e the ev ent G ( i ) τ ( k , s ) = n    ˆ S k − 1    ≤ (3 / 2)    ˆ S k − 1 ∩ ∂ k − 1 C f    o . By Lemma 36 (indee d by defini tion of q ( r ) and r 1 / 6 ) we ha ve 1 − P  G ( i ) τ ( k , s )    M ( k ) s (B ( f , r ))  = P    ˆ S k − 1 ∩ ∂ k − 1 C f   < (2 / 3) M ( k ) s (B ( f , r ))    M ( k ) s (B ( f , r ))  ≤ P    ˆ S k − 1 ∩ ∂ k − 1 C f   < (4 / 5) ( 1 − q ( r )) M ( k ) s (B ( f , r ))    M ( k ) s (B ( f , r ))  ≤ P    ˆ S k − 1 ∩ ∂ k − 1 C f   < (4 / 5) P k − 1  ∂ k − 1 C f    S k − 1 (B ( f , r ))  M ( k ) s (B ( f , r ))    M ( k ) s (B ( f , r ))  . (27) By a Chernof f bound , (27) is at most exp n − M ( k ) s (B ( f , r )) P k − 1  ∂ k − 1 C f    S k − 1 (B ( f , r ))  / 50 o ≤ exp n − M ( k ) s (B ( f , r )) (1 − q ( r )) / 50 o ≤ exp n − M ( k ) s (B ( f , r )) / 60 o . Thus, by Lemma 39, P  H ( i ) τ \ G ( i ) τ ( k , s )  ≤ P n M ( k ) s (B ( f , r )) ≥ ˜ M ( s ) o \ G ( i ) τ ( k , s )  = E h 1 − P  G ( i ) τ ( k , s )    M ( k ) s (B ( f , r ))  1 [ ˜ M ( s ) , ∞ )  M ( k ) s (B ( f , r )) i ≤ E h exp n − M ( k ) s (B ( f , r )) / 60 o 1 [ ˜ M ( s ) , ∞ )  M ( k ) s (B ( f , r )) i ≤ exp n − ˜ M ( s ) / 60 o . 75 H A N N E K E No w defining G ( i ) τ = T s ≥ τ T ˜ d f k =2 G ( i ) τ ( k , s ) , a union bound implies P  H ( i ) τ \ G ( i ) τ  ≤ X s ≥ τ ˜ d f · exp n − ˜ M ( s ) / 60 o ≤ ˜ d f  exp n − ˜ M ( τ ) / 60 o + Z ∞ τ 3 exp n − x ˜ δ f / 120 o d x  = ˜ d f  1 + 120 / ˜ δ f  · exp n − ˜ M ( τ ) / 60 o ≤  121 ˜ d f / ˜ δ f  · exp n − ˜ M ( τ ) / 60 o . This complet es the proo f for r = r 1 / 6 . Monotonicity extend s the result to any r ∈  0 , r 1 / 6  . Lemma 41 Ther e e xist ( C , P , f , γ ) -depende nt constan ts τ ∗ ∈ N and c ( ii ) ∈ (0 , ∞ ) such th at, for any inte ger τ ≥ τ ∗ , ther e is a n event H ( ii ) τ ⊆ G ( i ) τ with P  H ( i ) τ \ H ( ii ) τ  ≤ c ( ii ) · exp n − ˜ M ( τ ) 1 / 3 / 60 o (28) suc h that, on H ( i ) τ ∩ H ( ii ) τ , ∀ s, m, ℓ, k ∈ N with ℓ < m and k ≤ ˜ d f , fo r any set of classifier s H with V ⋆ ℓ ⊆ H , if eith er k = 1 , or s ≥ τ and H ⊆ B( f , r (1 − γ ) / 6 ) , then ˆ ∆ ( k ) s ( X m , W 2 , H ) < γ = ⇒ ˆ Γ ( k ) s ( X m , − f ( X m ) , W 2 , H ) < ˆ Γ ( k ) s ( X m , f ( X m ) , W 2 , H ) . In parti cular , for δ ∈ (0 , 1) and τ ≥ max { τ ((1 − γ ) / 6; δ ) , τ ∗ } , on H τ ( δ ) ∩ H ( i ) τ ∩ H ( ii ) τ , this is true for H = V ⋆ ℓ for ever y k , ℓ, m, s ∈ N satisfying τ ≤ ℓ < m , τ ≤ s , and k ≤ ˜ d f . ⋄ Pro of Let τ ∗ = (6 / ( 1 − γ )) ·  2 / ˜ δ f  1 / 3 , an d consid er any τ , k , ℓ, m, s, H as described abo ve . If k = 1 , the result cl early hol ds. In particu lar , Lemma 35 implie s that on H ( i ) τ , H [( X m , f ( X m ))] ⊇ V ⋆ m 6 = ∅ , so th at some h ∈ H has h ( X m ) = f ( X m ) , and therefo re ˆ Γ (1) s ( X m , − f ( X m ) , W 2 , H ) = 1 T h ∈H { h ( X m ) } ( − f ( X m )) = 0 , and since ˆ ∆ (1) s ( X m , W 2 , H ) = 1 DIS( H ) ( X m ) , if ˆ ∆ (1) s ( X m , W 2 , H ) < γ , t hen si nce γ < 1 w e ha ve X m / ∈ DIS( H ) , so that ˆ Γ (1) s ( X m , f ( X m ) , W 2 , H ) = 1 T h ∈H { h ( X m ) } ( f ( X m )) = 1 . 76 A C T I V I Z E D L E A R N I N G Otherwise, supp ose 2 ≤ k ≤ ˜ d f . Note that on H ( i ) τ ∩ G ( i ) τ , ∀ m ∈ N , and any H with V ⋆ ℓ ⊆ H ⊆ B( f , r (1 − γ ) / 6 ) for some ℓ ∈ N , ˆ Γ ( k ) s ( X m , − f ( X m ) , W 2 , H ) = 1 M ( k ) s ( H ) s 3 X i =1 1 ¯ S k − 1 ( H [( X m ,f ( X m ))])  S ( k ) i  1 S k − 1 ( H )  S ( k ) i  ≤ 1    n S ( k ) i : i ≤ s 3 o ∩ ∂ k − 1 H f    s 3 X i =1 1 ¯ S k − 1 ( V ⋆ m )  S ( k ) i  1 S k − 1 ( B( f ,r (1 − γ ) / 6 ) )  S ( k ) i  (monoto nicity) ≤ 1    n S ( k ) i : i ≤ s 3 o ∩ ∂ k − 1 H f    s 3 X i =1 1 ¯ ∂ k − 1 V ⋆ m f  S ( k ) i  1 S k − 1 ( B( f ,r (1 − γ ) / 6 ) )  S ( k ) i  (monoto nicity) = 1    n S ( k ) i : i ≤ s 3 o ∩ ∂ k − 1 C f    s 3 X i =1 1 ¯ ∂ k − 1 C f  S ( k ) i  1 S k − 1 ( B( f ,r (1 − γ ) / 6 ) )  S ( k ) i  (Lemma 35) ≤ 3 2 M ( k ) s (B( f , r (1 − γ ) / 6 )) s 3 X i =1 1 ¯ ∂ k − 1 C f  S ( k ) i  1 S k − 1 ( B( f ,r (1 − γ ) / 6 ) )  S ( k ) i  . (Lemma 40) For bre vity , let ˆ Γ deno te this last quantity , and let M k s = M ( k ) s  B  f , r (1 − γ ) / 6  . By Hoeffd ing’ s inequa lity , we hav e P (2 / 3) ˆ Γ > P k − 1  ¯ ∂ k − 1 C f    S k − 1  B  f , r (1 − γ ) / 6   + M − 1 / 3 k s      M k s ! ≤ exp n − 2 M 1 / 3 k s o . Thus, by Lemmas 36, 39 and 40, P n (2 / 3) ˆ Γ ( k ) s ( X m , − f ( X m ) , W 2 , H ) > q  r (1 − γ ) / 6  + ˜ M ( s ) − 1 / 3 o ∩ H ( i ) τ ∩ G ( i ) τ  ≤ P n (2 / 3) ˆ Γ > P k − 1  ¯ ∂ k − 1 C f    S k − 1  B  f , r (1 − γ ) / 6   + ˜ M ( s ) − 1 / 3 o ∩ H ( i ) τ  ≤ P n (2 / 3) ˆ Γ > P k − 1  ¯ ∂ k − 1 C f    S k − 1  B  f , r (1 − γ ) / 6   + M − 1 / 3 k s o ∩ { M k s ≥ ˜ M ( s ) }  = E " P (2 / 3) ˆ Γ > P k − 1  ¯ ∂ k − 1 C f    S k − 1  B  f , r (1 − γ ) / 6   + M − 1 / 3 k s      M k s ! 1 [ ˜ M ( s ) , ∞ ) ( M k s ) # ≤ E h exp n − 2 M 1 / 3 k s o 1 [ ˜ M ( s ) , ∞ ) ( M k s ) i ≤ exp n − 2 ˜ M ( s ) 1 / 3 o . Thus, there is an e ve nt H ( ii ) τ ( k , s ) with P  H ( i ) τ ∩ G ( ii ) τ \ H ( ii ) τ ( k , s )  ≤ exp n − 2 ˜ M ( s ) 1 / 3 o such that ˆ Γ ( k ) s ( X m , − f ( X m ) , W 2 , H ) ≤ (3 / 2)  q  r (1 − γ ) / 6  + ˜ M ( s ) − 1 / 3  holds for these particu lar v alues of k and s . 77 H A N N E K E T o extend to the full range of v alues, we simply take H ( ii ) τ = G ( i ) τ ∩ T s ≥ τ T k ≤ ˜ d f H ( ii ) τ ( k , s ) . Since τ ≥ (2 / ˜ δ f ) 1 / 3 , we hav e ˜ M ( τ ) ≥ 1 , so a u nion bound implies P  H ( i ) τ ∩ G ( i ) τ \ H ( ii ) τ  ≤ X s ≥ τ ˜ d f · exp n − 2 ˜ M ( s ) 1 / 3 o ≤ ˜ d f ·  exp n − 2 ˜ M ( τ ) 1 / 3 o + Z ∞ τ exp n − 2 ˜ M ( x ) 1 / 3 o d x  = ˜ d f  1 + 2 − 2 / 3 ˜ δ − 1 / 3 f  · exp n − 2 ˜ M ( τ ) 1 / 3 o ≤ 2 ˜ d f ˜ δ − 1 / 3 f · exp n − 2 ˜ M ( τ ) 1 / 3 o . Then Lemma 40 and a union bound imply P  H ( i ) τ \ H ( ii ) τ  ≤ 2 ˜ d f ˜ δ − 1 / 3 f · exp n − 2 ˜ M ( τ ) 1 / 3 o + 121 ˜ d f ˜ δ − 1 f · exp n − ˜ M ( τ ) / 60 o ≤ 12 3 ˜ d f ˜ δ − 1 f · exp n − ˜ M ( τ ) 1 / 3 / 60 o . On H ( i ) τ ∩ H ( ii ) τ , e ver y such s , m, ℓ, k and H sa tisfy ˆ Γ ( k ) s ( X m , − f ( X m ) , W 2 , H ) ≤ (3 / 2)  q ( r (1 − γ ) / 6 ) + ˜ M ( s ) − 1 / 3  < (3 / 2) ( (1 − γ ) / 6 + (1 − γ ) / 6) = (1 − γ ) / 2 , (29) where the second inequa lity fo llo ws by definition of r (1 − γ ) / 6 and s ≥ τ ≥ τ ∗ . If ˆ ∆ ( k ) s ( X m , W 2 , H ) < γ , then 1 − γ < 1 − ˆ ∆ ( k ) s ( X m , W 2 , H ) = 1 M ( k ) s ( H ) s 3 X i =1 1 S k − 1 ( H )  S ( k ) i  1 ¯ S k ( H )  S ( k ) i ∪ { X m }  . (30) Finally , no ting that we always ha v e 1 ¯ S k ( H )  S ( k ) i ∪ { X m }  ≤ 1 ¯ S k − 1 ( H [( X m ,f ( X m ))])  S ( k ) i  + 1 ¯ S k − 1 ( H [( X m , − f ( X m ))])  S ( k ) i  , we ha ve that, on the e ve nt H ( i ) τ ∩ H ( ii ) τ , if ˆ ∆ ( k ) s ( X m , W 2 , H ) < γ , then ˆ Γ ( k ) s ( X m , − f ( X m ) , W 2 , H ) < (1 − γ ) / 2 = − ( 1 − γ ) / 2 + (1 − γ ) by (29) < − (1 − γ ) / 2 + 1 M ( k ) s ( H ) s 3 X i =1 1 S k − 1 ( H )  S ( k ) i  1 ¯ S k ( H )  S ( k ) i ∪ { X m }  by (30) ≤ − (1 − γ ) / 2 + 1 M ( k ) s ( H ) s 3 X i =1 1 S k − 1 ( H )  S ( k ) i  1 ¯ S k − 1 ( H [( X m ,f ( X m ))])  S ( k ) i  + 1 M ( k ) s ( H ) s 3 X i =1 1 S k − 1 ( H )  S ( k ) i  1 ¯ S k − 1 ( H [( X m , − f ( X m ))])  S ( k ) i  = − (1 − γ ) / 2 + ˆ Γ ( k ) s ( X m , − f ( X m ) , W 2 , H ) + ˆ Γ ( k ) s ( X m , f ( X m ) , W 2 , H ) < ˆ Γ ( k ) s ( X m , f ( X m ) , W 2 , H ) . by (29) 78 A C T I V I Z E D L E A R N I N G The final claim in the le mma state ment is th en implied by Lemma 29, since V ⋆ ℓ ⊆ V ⋆ τ ⊆ B ( f , φ ( τ ; δ )) ⊆ B  f , r (1 − γ ) / 6  on H τ ( δ ) . For an y k , ℓ, m ∈ N , an d any x ∈ X , define ˆ p x ( k , ℓ, m ) = ˆ ∆ ( k ) m ( x, W 2 , V ⋆ ℓ ) p x ( k , ℓ ) = P k − 1  S ∈ X k − 1 : S ∪ { x } ∈ S k ( V ⋆ ℓ )    S k − 1 ( V ⋆ ℓ )  . Lemma 42 F or any ζ ∈ (0 , 1) , the r e is a ( C , P , f , ζ ) -depe ndent cons tant c ( iii ) ( ζ ) ∈ (0 , ∞ ) suc h that, for any τ ∈ N , ther e is an e vent H ( iii ) τ ( ζ ) with P  H ( i ) τ \ H ( iii ) τ ( ζ )  ≤ c ( iii ) ( ζ ) · exp n − ζ 2 ˜ M ( τ ) o suc h that on H ( i ) τ ∩ H ( iii ) τ ( ζ ) , ∀ k , ℓ, m ∈ N with τ ≤ ℓ ≤ m and k ≤ ˜ d f , for any x ∈ X , P ( x : | p x ( k , ℓ ) − ˆ p x ( k , ℓ, m ) | > ζ ) ≤ exp n − ζ 2 ˜ M ( m ) o . ⋄ Pro of Fix any k, ℓ, m ∈ N w it h τ ≤ ℓ ≤ m and k ≤ ˜ d f . Recall our con ve ntion that X 0 = { ∅ } and P 0  X 0  = 1 ; thus, if k = 1 , ˆ p x ( k , ℓ, m ) = 1 DIS ( V ⋆ ℓ ) ( x ) = 1 S 1 ( V ⋆ ℓ ) ( x ) = p x ( k , ℓ ) , so the res ult clearly holds for k = 1 . For the remaini ng case, sup pose 2 ≤ k ≤ ˜ d f . T o simplify notatio n, let ˜ m = M ( k ) m ( V ⋆ ℓ ) , X = X ℓ +1 , p x = p x ( k , ℓ ) and ˆ p x = ˆ p x ( k , ℓ, m ) . Consider the e vent H ( iii ) ( k , ℓ, m, ζ ) = n P ( x : | p x − ˆ p x | > ζ ) ≤ exp n − ζ 2 ˜ M ( m ) oo . W e hav e P  H ( i ) τ \ H ( iii ) ( k , ℓ, m, ζ )    V ⋆ ℓ  (31) ≤ P n ˜ m ≥ ˜ M ( m ) o \ H ( iii ) ( k , ℓ, m, ζ )    V ⋆ ℓ  (by Lemma 39) = P n ˜ m ≥ ˜ M ( m ) o ∩ n P  e s ˜ m | p X − ˆ p X | > e s ˜ mζ    W 2 , V ⋆ ℓ  > e − ζ 2 ˜ M ( m ) o    V ⋆ ℓ  , (32) for any v alue s > 0 . Proceed ing as in Chernof f ’ s boundin g techni que, by Marko v’ s inequality (32) is at most P n ˜ m ≥ ˜ M ( m ) o ∩ n e − s ˜ mζ E h e s ˜ m | p X − ˆ p X |    W 2 , V ⋆ ℓ i > e − ζ 2 ˜ M ( m ) o    V ⋆ ℓ  ≤ P n ˜ m ≥ ˜ M ( m ) o ∩ n e − s ˜ mζ E h e s ˜ m ( p X − ˆ p X ) + e s ˜ m ( ˆ p X − p X )    W 2 , V ⋆ ℓ i > e − ζ 2 ˜ M ( m ) o    V ⋆ ℓ  = E " 1 [ ˜ M ( m ) , ∞ ) ( ˜ m ) P  e − s ˜ mζ E h e s ˜ m ( p X − ˆ p X ) + e s ˜ m ( ˆ p X − p X )    W 2 , V ⋆ ℓ i > e − ζ 2 ˜ M ( m )    ˜ m, V ⋆ ℓ       V ⋆ ℓ # 79 H A N N E K E By Marko v’ s inequality , this is at most E " 1 [ ˜ M ( m ) , ∞ ) ( ˜ m ) e ζ 2 ˜ M ( m ) E h e − s ˜ mζ E h e s ˜ m ( p X − ˆ p X ) + e s ˜ m ( ˆ p X − p X )    W 2 , V ⋆ ℓ i    ˜ m, V ⋆ ℓ i      V ⋆ ℓ # = E " 1 [ ˜ M ( m ) , ∞ ) ( ˜ m ) e ζ 2 ˜ M ( m ) e − s ˜ mζ E h e s ˜ m ( p X − ˆ p X ) + e s ˜ m ( ˆ p X − p X )    ˜ m, V ⋆ ℓ i      V ⋆ ℓ # = E " 1 [ ˜ M ( m ) , ∞ ) ( ˜ m ) e ζ 2 ˜ M ( m ) e − s ˜ mζ E h E h e s ˜ m ( p X − ˆ p X ) + e s ˜ m ( ˆ p X − p X )    X, ˜ m, V ⋆ ℓ i    ˜ m, V ⋆ ℓ i      V ⋆ ℓ # . (33) The conditiona l dis trib ution of ˜ m ˆ p X gi ven ( X, ˜ m, V ⋆ ℓ ) is Binomial ( ˜ m, p X ) , so letti ng B 1 ( p X ) , B 2 ( p X ) , . . . denot e a seq uence of random variab les, cond itionall y ind epende nt with distrib ution Bernoulli( p X ) gi ve n ( X , ˜ m, V ⋆ ℓ ) , we ha ve E h e s ˜ m ( p X − ˆ p X ) + e s ˜ m ( ˆ p X − p X )    X, ˜ m, V ⋆ ℓ i = E h e s ˜ m ( p X − ˆ p X )    X, ˜ m, V ⋆ ℓ i + E h e s ˜ m ( ˆ p X − p X )    X, ˜ m, V ⋆ ℓ i = E " ˜ m Y i =1 e s ( p X − B i ( p X ))    X, ˜ m, V ⋆ ℓ # + E " ˜ m Y i =1 e s ( B i ( p X ) − p X )    X, ˜ m, V ⋆ ℓ # = E h e s ( p X − B 1 ( p X ))    X, ˜ m, V ⋆ ℓ i ˜ m + E h e s ( B 1 ( p X ) − p X )    X, ˜ m, V ⋆ ℓ i ˜ m . (34) It is kno wn that for B ∼ Bernoulli(p ) , E  e s ( B − p )  and E  e s ( p − B )  are at most e s 2 / 8 (see e.g., Lemma 8.1 of De vro ye, Gy ¨ orfi, and Lugosi , 1996 ). Thus, taking s = 4 ζ , (34) is at most 2 e 2 ˜ mζ 2 , and (33) is at most E h 1 [ ˜ M ( m ) , ∞ ) ( ˜ m ) 2 e ζ 2 ˜ M ( m ) e − 4 ˜ mζ 2 e 2 ˜ mζ 2    V ⋆ ℓ i = E h 1 [ ˜ M ( m ) , ∞ ) ( ˜ m ) 2 e ζ 2 ˜ M ( m ) e − 2 ˜ mζ 2    V ⋆ ℓ i ≤ 2 exp n − ζ 2 ˜ M ( m ) o . Since this bound holds for (31), the law of to tal probabil ity implies P  H ( i ) τ \ H ( iii ) ( k , ℓ, m, ζ )  = E h P  H ( i ) τ \ H ( iii ) ( k , ℓ, m, ζ )    V ⋆ ℓ i ≤ 2 · exp n − ζ 2 ˜ M ( m ) o . 80 A C T I V I Z E D L E A R N I N G Defining H ( iii ) τ ( ζ ) = T ℓ ≥ τ T m ≥ ℓ T ˜ d f k =2 H ( iii ) ( k , ℓ, m, ζ ) , we ha ve the requir ed proper ty for the claimed ranges of k , ℓ and m , and a union bound implies P  H ( i ) τ \ H ( iii ) τ ( ζ )  ≤ X ℓ ≥ τ X m ≥ ℓ 2 ˜ d f · exp n − ζ 2 ˜ M ( m ) o ≤ 2 ˜ d f · X ℓ ≥ τ  exp n − ζ 2 ˜ M ( ℓ ) o + Z ∞ ℓ 3 exp n − xζ 2 ˜ δ f / 2 o d x  = 2 ˜ d f · X ℓ ≥ τ  1 + 2 ζ − 2 ˜ δ − 1 f  · exp n − ζ 2 ˜ M ( ℓ ) o ≤ 2 ˜ d f ·  1 + 2 ζ − 2 ˜ δ − 1 f  ·  exp n − ζ 2 ˜ M ( τ ) o + Z ∞ τ 3 exp n − xζ 2 ˜ δ f / 2 o d x  = 2 ˜ d f ·  1 + 2 ζ − 2 ˜ δ − 1 f  2 · exp n − ζ 2 ˜ M ( τ ) o ≤ 18 ˜ d f ζ − 4 ˜ δ − 2 f · exp n − ζ 2 ˜ M ( τ ) o . For k, ℓ, m ∈ N and ζ ∈ (0 , 1) , define ¯ p ζ ( k , ℓ, m ) = P ( x : ˆ p x ( k , ℓ, m ) ≥ ζ ) . (35) Lemma 43 F or any α, ζ , δ ∈ (0 , 1) , β ∈  0 , 1 − √ α  , and in te ger τ ≥ τ ( β ; δ ) , on H τ ( δ ) ∩ H ( i ) τ ∩ H ( iii ) τ ( β ζ ) , for any k , ℓ, ℓ ′ , m ∈ N with τ ≤ ℓ ≤ ℓ ′ ≤ m and k ≤ ˜ d f , ¯ p ζ ( k , ℓ ′ , m ) ≤ P ( x : p x ( k , ℓ ) ≥ αζ ) + exp n − β 2 ζ 2 ˜ M ( m ) o . (3 6) ⋄ Pro of Fix a ny α, ζ , δ ∈ (0 , 1) , β ∈  0 , 1 − √ α  , τ , k , ℓ, ℓ ′ , m ∈ N with τ ( β ; δ ) ≤ τ ≤ ℓ ≤ ℓ ′ ≤ m and k ≤ ˜ d f . If k = 1 , the result clearly holds . In particular , we ha v e ¯ p ζ (1 , ℓ ′ , m ) = P (DIS ( V ⋆ ℓ ′ )) ≤ P (DIS ( V ⋆ ℓ )) = P ( x : p x (1 , ℓ ) ≥ αζ ) . Otherwise, suppo se 2 ≤ k ≤ ˜ d f . By a union bound, ¯ p ζ ( k , ℓ ′ , m ) = P  x : ˆ p x ( k , ℓ ′ , m ) ≥ ζ  ≤ P  x : p x ( k , ℓ ′ ) ≥ √ αζ  + P  x :   p x ( k , ℓ ′ ) − ˆ p x ( k , ℓ ′ , m )   > (1 − √ α ) ζ  . (37) Since P  x :   p x ( k , ℓ ′ ) − ˆ p x ( k , ℓ ′ , m )   > (1 − √ α ) ζ  ≤ P  x :   p x ( k , ℓ ′ ) − ˆ p x ( k , ℓ ′ , m )   > β ζ  , Lemma 42 implies that, on H ( i ) τ ∩ H ( iii ) τ ( β ζ ) , P  x :   p x ( k , ℓ ′ ) − ˆ p x ( k , ℓ ′ , m )   > (1 − √ α ) ζ  ≤ exp n − β 2 ζ 2 ˜ M ( m ) o . (38) 81 H A N N E K E It remains only to ex amine the first term on the right side of (37). For this, if P k − 1  S k − 1  V ⋆ ℓ ′  = 0 , then the first term is 0 by our aforemen tioned co n vention , and thus (36) holds; otherwise, since ∀ x ∈ X , n S ∈ X k − 1 : S ∪ { x } ∈ S k ( V ⋆ ℓ ′ ) o ⊆ S k − 1 ( V ⋆ ℓ ′ ) , we ha ve P  x : p x ( k , ℓ ′ ) ≥ √ αζ  = P  x : P k − 1  S ∈ X k − 1 : S ∪ { x } ∈ S k ( V ⋆ ℓ ′ )    S k − 1 ( V ⋆ ℓ ′ )  ≥ √ αζ  = P  x : P k − 1  S ∈ X k − 1 : S ∪ { x } ∈ S k ( V ⋆ ℓ ′ )  ≥ √ αζ P k − 1  S k − 1 ( V ⋆ ℓ ′ )  . (39) By Lemma 35 and monoton icity , on H ( i ) τ ⊆ H ′ , (39) is at most P  x : P k − 1  S ∈ X k − 1 : S ∪ { x } ∈ S k ( V ⋆ ℓ ′ )  ≥ √ αζ P k − 1  ∂ k − 1 C f  , and monoton icity implie s this is at most P  x : P k − 1  S ∈ X k − 1 : S ∪ { x } ∈ S k ( V ⋆ ℓ )  ≥ √ αζ P k − 1  ∂ k − 1 C f  . (40) By Lemma 36, for τ ≥ τ ( β ; δ ) , on H τ ( δ ) ∩ H ( i ) τ , P k − 1  ¯ ∂ k − 1 C f   S k − 1 ( V ⋆ ℓ )  ≤ q ( φ ( τ ; δ )) < β ≤ 1 − √ α, which implies P k − 1  ∂ k − 1 C f  ≥ P k − 1  ∂ k − 1 C f ∩ S k − 1 ( V ⋆ ℓ )  =  1 − P k − 1  ¯ ∂ k − 1 C f    S k − 1 ( V ⋆ ℓ )  P k − 1  S k − 1 ( V ⋆ ℓ )  ≥ √ α P k − 1  S k − 1 ( V ⋆ ℓ )  . Altogeth er , for τ ≥ τ ( β ; δ ) , on H τ ( δ ) ∩ H ( i ) τ , (40) is at most P  x : P k − 1  S ∈ X k − 1 : S ∪ { x } ∈ S k ( V ⋆ ℓ )  ≥ αζ P k − 1  S k − 1 ( V ⋆ ℓ )  = P ( x : p x ( k , ℓ ) ≥ αζ ) , which, combined with (37) and (38), estab lishes (36 ). Lemma 44 Ther e ar e event s n H ( iv ) τ : τ ∈ N o with P  H ( iv ) τ  ≥ 1 − 3 ˜ d f · exp {− 2 τ } suc h tha t, for any ξ ∈ (0 , γ / 16] , δ ∈ (0 , 1) , and inte ger τ ≥ τ ( iv ) ( ξ ; δ ) , wher e τ ( iv ) ( ξ ; δ ) = max  τ (4 ξ /γ ; δ ) ,  4 ˜ δ f ξ 2 ln  4 ˜ δ f ξ 2  1 / 3  , on H τ ( δ ) ∩ H ( i ) τ ∩ H ( iii ) τ ( ξ ) ∩ H ( iv ) τ , ∀ k ∈ n 1 , . . . , ˜ d f o , ∀ ℓ ∈ N with ℓ ≥ τ , P  x : p x ( k , ℓ ) ≥ γ / 2  + exp n − γ 2 ˜ M ( ℓ ) / 256 o ≤ ˆ ∆ ( k ) ℓ ( W 1 , W 2 , V ⋆ ℓ ) (41) ≤ P ( x : p x ( k , ℓ ) ≥ γ / 8) + 4 ℓ − 1 . (42) ⋄ 82 A C T I V I Z E D L E A R N I N G Pro of For an y k , ℓ ∈ N , by Hoef fdin g’ s inequal ity and the law of total probab ility , on an eve nt G ( iv ) ( k , ℓ ) with P  G ( iv ) ( k , ℓ )  ≥ 1 − 2 exp {− 2 ℓ } , we hav e       ¯ p γ / 4 ( k , ℓ, ℓ ) − ℓ − 3 ℓ 3 X i =1 1 [ γ / 4 , ∞ )  ˆ ∆ ( k ) ℓ ( w i , W 2 , V ⋆ ℓ )        ≤ ℓ − 1 . (43) Define the e vent H ( iv ) τ = T ℓ ≥ τ T ˜ d f k =1 G ( iv ) ( k , ℓ ) . B y a u nion bound , we ha ve 1 − P  H ( iv ) τ  ≤ 2 ˜ d f · X ℓ ≥ τ exp {− 2 ℓ } ≤ 2 ˜ d f ·  exp {− 2 τ } + Z ∞ τ exp {− 2 x } d x  = 3 ˜ d f · exp {− 2 τ } . No w fix any ℓ ≥ τ and k ∈ n 1 , . . . , ˜ d f o . By a union bound, P ( x : p x ( k , ℓ ) ≥ γ / 2) ≤ P ( x : ˆ p x ( k , ℓ, ℓ ) ≥ γ / 4) + P ( x : | p x ( k , ℓ ) − ˆ p x ( k , ℓ, ℓ ) | > γ / 4) . (44) By Lemma 42, on H ( i ) τ ∩ H ( iii ) τ ( ξ ) , P ( x : | p x ( k , ℓ ) − ˆ p x ( k , ℓ, ℓ ) | > γ / 4) ≤ P ( x : | p x ( k , ℓ ) − ˆ p x ( k , ℓ, ℓ ) | > ξ ) ≤ exp n − ξ 2 ˜ M ( ℓ ) o . (45) Also, on H ( iv ) τ , (43) implies P ( x : ˆ p x ( k , ℓ, ℓ ) ≥ γ / 4) = ¯ p γ / 4 ( k , ℓ, ℓ ) ≤ ℓ − 1 + ℓ − 3 ℓ 3 X i =1 1 [ γ / 4 , ∞ )  ˆ ∆ ( k ) ℓ ( w i , W 2 , V ⋆ ℓ )  = ˆ ∆ ( k ) ℓ ( W 1 , W 2 , V ⋆ ℓ ) − ℓ − 1 . (46) Combining (44) with (45) and (46) yields P ( x : p x ( k , ℓ ) ≥ γ / 2) ≤ ˆ ∆ ( k ) ℓ ( W 1 , W 2 , V ⋆ ℓ ) − ℓ − 1 + exp n − ξ 2 ˜ M ( ℓ ) o . (47) For τ ≥ τ ( iv ) ( ξ ; δ ) , exp n − ξ 2 ˜ M ( ℓ ) o − ℓ − 1 ≤ − exp n − γ 2 ˜ M ( ℓ ) / 256 o , so that (47) implies the first inequ ality of the lemma: namely (41). For the seco nd inequality (i.e., (42)), on H ( iv ) τ , (43) implies we ha ve ˆ ∆ ( k ) ℓ ( W 1 , W 2 , V ⋆ ℓ ) ≤ ¯ p γ / 4 ( k , ℓ, ℓ ) + 3 ℓ − 1 . (48) Also, by Lemm a 43 (with α = 1 / 2 , ζ = γ / 4 , β = ξ /ζ < 1 − √ α ), for τ ≥ τ ( iv ) ( ξ ; δ ) , on H τ ( δ ) ∩ H ( i ) τ ∩ H ( iii ) τ ( ξ ) , ¯ p γ / 4 ( k , ℓ, ℓ ) ≤ P ( x : p x ( k , ℓ ) ≥ γ / 8) + exp n − ξ 2 ˜ M ( ℓ ) o . (49) 83 H A N N E K E Thus, combini ng (48 ) with (49) yields ˆ ∆ ( k ) ℓ ( W 1 , W 2 , V ⋆ ℓ ) ≤ P ( x : p x ( k , ℓ ) ≥ γ / 8) + 3 ℓ − 1 + exp n − ξ 2 ˜ M ( ℓ ) o . For τ ≥ τ ( iv ) ( ξ ; δ ) , we hav e exp n − ξ 2 ˜ M ( ℓ ) o ≤ ℓ − 1 , which establis hes (42 ). For n ∈ N and k ∈ { 1 , . . . , d + 1 } , define the set U ( k ) n = n m n + 1 , . . . , m n + j n/  6 · 2 k ˆ ∆ ( k ) m n ( W 1 , W 2 , V ) ko , where m n = ⌊ n/ 3 ⌋ ; U ( k ) n repres ents the set of indic es processed in the inner loop of Meta- Algorith m 1 for th e specified valu e of k . Lemma 45 Ther e ar e ( f , C , P , γ ) -dependent c onstan ts ˆ c 1 , ˆ c 2 ∈ (0 , ∞ ) such that, for any ε ∈ (0 , 1) and inte ger n ≥ ˆ c 1 ln(ˆ c 2 /ε ) , on an even t ˆ H n ( ε ) with P ( ˆ H n ( ε )) ≥ 1 − (3 / 4) ε, (50) we have , for V = V ⋆ m n , ∀ k ∈ n 1 , . . . , ˜ d f o ,    n m ∈ U ( k ) n : ˆ ∆ ( k ) m ( X m , W 2 , V ) ≥ γ o    ≤ j n/  3 · 2 k k , (51) ˆ ∆ ( ˜ d f ) m n ( W 1 , W 2 , V ) ≤ ∆ ( γ / 8) n ( ε ) + 4 m − 1 n , (52) and ∀ m ∈ U ( ˜ d f ) n , ˆ ∆ ( ˜ d f ) m ( X m , W 2 , V ) < γ ⇒ ˆ Γ ( ˜ d f ) m ( X m , − f ( X m ) , W 2 , V ) < ˆ Γ ( ˜ d f ) m ( X m , f ( X m ) , W 2 , V ) . (5 3) ⋄ Pro of S u ppose n ≥ ˆ c 1 ln( ˆ c 2 /ε ) , where ˆ c 1 = max  2 ˜ d f +12 ˜ δ f γ 2 , 24 r (1 / 16) , 24 r (1 − γ ) / 6 , 3 τ ∗  and ˆ c 2 = max  4  c ( i ) + c ( ii ) + c ( iii ) ( γ / 16) + 6 ˜ d f  , 4  4 e r (1 / 16)  d , 4  4 e r (1 − γ ) / 6  d  . In partic ular , we ha ve chosen ˆ c 1 and ˆ c 2 lar ge enough so that m n ≥ max n τ (1 / 16; ε/ 2) , τ ( iv ) ( γ / 16; ε/ 2) , τ ((1 − γ ) / 6; ε/ 2) , τ ∗ o . W e begin with (51). B y Lemmas 43 and 44, on the eve nt ˆ H (1) n ( ε ) = H m n ( ε/ 2) ∩ H ( i ) m n ∩ H ( iii ) m n ( γ / 16) ∩ H ( iv ) m n , ∀ m ∈ U ( k ) n , ∀ k ∈ n 1 , . . . , ˜ d f o , ¯ p γ ( k , m n , m ) ≤ P ( x : p x ( k , m n ) ≥ γ / 2) + exp n − γ 2 ˜ M ( m ) / 256 o ≤ P ( x : p x ( k , m n ) ≥ γ / 2) + exp n − γ 2 ˜ M ( m n ) / 256 o ≤ ˆ ∆ ( k ) m n ( W 1 , W 2 , V ) . (54) 84 A C T I V I Z E D L E A R N I N G Recall that n X m : m ∈ U ( k ) n o is a sa mple of size j n/ (6 · 2 k ˆ ∆ ( k ) m n ( W 1 , W 2 , V )) k , conditi onally i.i.d. (giv en ( W 1 , W 2 , V ) ) with conditio nal distri bu tions P . T h us, ∀ k ∈ n 1 , . . . , ˜ d f o , on ˆ H (1) n ( ε ) , P    n m ∈ U ( k ) n : ˆ ∆ ( k ) m ( X m , W 2 , V ) ≥ γ o    > n/  3 · 2 k       W 1 , W 2 , V ! ≤ P    n m ∈ U ( k ) n : ˆ ∆ ( k ) m ( X m , W 2 , V ) ≥ γ o    > 2    U ( k ) n    ˆ ∆ ( k ) m n ( W 1 , W 2 , V )      W 1 , W 2 , V ! ≤ P B  |U ( k ) n | , ˆ ∆ ( k ) m n ( W 1 , W 2 , V )  > 2    U ( k ) n    ˆ ∆ ( k ) m n ( W 1 , W 2 , V )      W 1 , W 2 , V ! , (55) where this last inequali ty follows fro m (54), and B ( u, p ) ∼ Binomial ( u, p ) is independe nt of W 1 , W 2 , V (for any fixed u and p ). By a Chern of f bound, (55) is at most exp n − j n/  6 · 2 k ˆ ∆ ( k ) m n ( W 1 , W 2 , V ) k ˆ ∆ ( k ) m n ( W 1 , W 2 , V ) / 3 o ≤ exp n 1 − n /  18 · 2 k o . By the law o f total probabilit y and a union bound, there exis ts an ev ent ˆ H (2) n with P  ˆ H (1) n ( ε ) \ ˆ H (2) n  ≤ ˜ d f · exp n 1 − n/  18 · 2 ˜ d f o such that, on ˆ H (1) n ( ε ) ∩ ˆ H (2) n , (51) holds. Next, by Lemma 44 , on ˆ H (1) n ( ε ) , ˆ ∆ ( ˜ d f ) m n ( W 1 , W 2 , V ) ≤ P  x : p x  ˜ d f , m n  ≥ γ / 8  + 4 m − 1 n , and by Lemma 38, on ˆ H (1) n ( ε ) , this is at most ∆ ( γ / 8) n ( ε ) + 4 m − 1 n , which estab lishes (52 ). Finally , Lemma 41 impl ies that on ˆ H (1) n ( ε ) ∩ H ( ii ) m n , ∀ m ∈ U ( ˜ d f ) n , (53) holds. Thus, defining ˆ H n ( ε ) = ˆ H (1) n ( ε ) ∩ ˆ H (2) n ∩ H ( ii ) m n , it remains only to establis h (50). By a union bound, we ha ve 1 − P  ˆ H n  ≤ (1 − P ( H m n ( ε/ 2))) +  1 − P  H ( i ) m n  + P  H ( i ) m n \ H ( ii ) m n  + P  H ( i ) m n \ H ( iii ) m n ( γ / 16)  +  1 − P  H ( iv ) m n  + P  ˆ H (1) n ( ε ) \ ˆ H (2) n  . ≤ ε/ 2 + c ( i ) · exp n − ˜ M ( m n ) / 4 o + c ( ii ) · exp n − ˜ M ( m n ) 1 / 3 / 60 o + c ( iii ) ( γ / 16) · exp n − ˜ M ( m n ) γ 2 / 256 o + 3 ˜ d f · exp {− 2 m n } + ˜ d f · exp n 1 − n/  18 · 2 ˜ d f o ≤ ε/ 2 +  c ( i ) + c ( ii ) + c ( iii ) ( γ / 16) + 6 ˜ d f  · exp n − n ˜ δ f γ 2 2 − ˜ d f − 12 o . (5 6) W e hav e chosen n lar ge enough so that (56) is at most (3 / 4) ε , which establishe s (50). The follo w i ng result is a slightly stronger v ersio n of Theorem 6. 85 H A N N E K E Lemma 46 F or an y passi ve learnin g algorithm A p , if A p ach ieves a labe l comple xity Λ p with ∞ > Λ p ( ε, f , P ) = ω (log(1 /ε )) , th en Meta- Algorithm 1, with A p as its ar gumen t, ac hieve s a label comple xity Λ a suc h that Λ a (3 ε, f , P ) = o (Λ p ( ε, f , P )) . ⋄ Pro of Supp ose A p achie ves la bel compl exity Λ p with ∞ > Λ p ( ε, f , P ) = ω (log(1 /ε )) . Let ε ∈ (0 , 1) , define L ( n ; ε ) = j n/  6 · 2 ˜ d f  ∆ ( γ / 8) n ( ε ) + 4 m − 1 n k (for any n ∈ N ), and let L − 1 ( m ; ε ) = max { n ∈ N : L ( n ; ε ) < m } (for any m ∈ (0 , ∞ ) ). Define c 1 = max n ˆ c 1 , 2 · 6 3 ( d + 1) ˜ d f ln( e ( d + 1)) o and c 2 = max { ˆ c 2 , 4 e ( d + 1) } , and suppo se n ≥ max n c 1 ln( c 2 /ε ) , 1 + L − 1 (Λ p ( ε, f , P ); ε ) o . Consider running Meta-Algorit hm 1 with A p and n as inputs , while f is the target func tion and P is the data distri bu tion. Letting ˆ h n denote the classifier returned from Meta -Algorithm 1, Lemma 34 implies that on an e vent ˆ E n with P ( ˆ E n ) ≥ 1 − e ( d + 1) · exp n −⌊ n/ 3 ⌋ / (72 ˜ d f ( d + 1) ln ( e ( d + 1))) o ≥ 1 − ε/ 4 , we ha ve er( ˆ h n ) ≤ 2 er  A p  L ˜ d f  . By a union bound, the e ven t ˆ G n ( ε ) = ˆ E n ∩ ˆ H n ( ε ) has P  ˆ G n ( ε )  ≥ 1 − ε . T h us, E h er  ˆ h n i ≤ E h 1 ˆ G n ( ε ) 1 h |L ˜ d f | ≥ Λ p ( ε, f , P ) i er  ˆ h n i + P  ˆ G n ( ε ) ∩ n |L ˜ d f | < Λ p ( ε, f , P ) o  + P  ˆ G n ( ε ) c  ≤ E h 1 ˆ G n ( ε ) 1 h |L ˜ d f | ≥ Λ p ( ε, f , P ) i 2 er  A p  L ˜ d f i + P  ˆ G n ( ε ) ∩ n |L ˜ d f | < Λ p ( ε, f , P ) o  + ε. (57) On ˆ G n ( ε ) , (52) of Lemm a 45 implie s |L ˜ d f | ≥ L ( n ; ε ) , and we chose n large enoug h so that L ( n ; ε ) ≥ Λ p ( ε, f , P ) . Thus, the second term in (57) is zero, and we hav e E h er  ˆ h n i ≤ 2 · E h 1 ˆ G n ( ε ) 1 h |L ˜ d f | ≥ Λ p ( ε, f , P ) i er  A p  L ˜ d f i + ε = 2 · E h E h 1 ˆ G n ( ε ) er  A p  L ˜ d f     |L ˜ d f | i 1 h |L ˜ d f | ≥ Λ p ( ε, f , P ) i i + ε. (58) Note that for any ℓ with P ( | L ˜ d f | = ℓ ) > 0 , the condit ional distrib utio n of  X m : m ∈ U ( ˜ d f ) n  gi ven n |L ˜ d f | = ℓ o is simply the product P ℓ (i.e., condit ionally i.i.d.), w h ich is the same as the dis- trib ution of { X 1 , X 2 , . . . , X ℓ } . Furth ermore, on ˆ G n ( ε ) , (51) implies that the t < ⌊ 2 n/ 3 ⌋ conditi on is al ways sati sfied in Step 6 of Meta-Algo rithm 1 while k ≤ ˜ d f , and (5 3) implies that th e infer red labels from Step 8 for k = ˜ d f are all correc t. Theref ore, for an y such ℓ with ℓ ≥ Λ p ( ε, f , P ) , we ha ve E h 1 ˆ G n ( ε ) er  A p  L ˜ d f     n |L ˜ d f | = ℓ oi ≤ E [er ( A p ( Z ℓ ))] ≤ ε. 86 A C T I V I Z E D L E A R N I N G In particular , this means (58) is at most 3 ε . This implies that Meta-Algorithm 1, with A p as its ar gument, achie ve s a la bel complexity Λ a such that Λ a (3 ε, f , P ) ≤ max  c 1 ln( c 2 /ε ) , 1 + L − 1 (Λ p ( ε, f , P ); ε )  . Since Λ p ( ε, f , P ) = ω (log(1 /ε )) ⇒ c 1 ln( c 2 /ε ) = o ( Λ p ( ε, f , P )) , it remains only to sho w that L − 1 (Λ p ( ε, f , P ); ε ) = o (Λ p ( ε, f , P )) . Note that ∀ ε ∈ (0 , 1) , L (1; ε ) = 0 and L ( n ; ε ) is di ver ging in n . Furthermore, by Lemma 38, we know that for any N -v alued N ( ε ) = ω (log(1 /ε )) , we hav e ∆ ( γ / 8) N ( ε ) ( ε ) = o (1) , whic h implies L ( N ( ε ); ε ) = ω ( N ( ε )) . Thus, since Λ p ( ε, f , P ) = ω (log(1 /ε )) , Lemma 31 implies L − 1 (Λ p ( ε, f , P ); ε ) = o (Λ p ( ε, f , P )) , as des ired. This esta blishes the resul t for an arbitra ry γ ∈ (0 , 1) . T o speci alize to the speci fic procedure stated as Meta-Algo rithm 1, we simply tak e γ = 1 / 2 . Pro of [Theorem 6] Theorem 6 no w fol lo ws immedia tely from Lemma 46. Specifically , w e hav e pro ven Lemma 46 for an arbitr ary distrib ution P on X , an arbit rary f ∈ cl( C ) , and an arbi- trary pas si ve algo rithm A p . Therefore, it will certainly hold for ev ery P and f ∈ C , and since e very ( f , P ) ∈ No nt rivial(Λ p ) has ∞ > Λ p ( ε, f , P ) = ω (lo g (1 / ε )) , the implic ation that Meta- Algorith m 1 acti vizes eve ry passi ve algorit hm A p for C follo ws. Careful examinati on of the proofs abov e rev eals that the “ 3 ” in L e mma 46 can be set to any arbitra ry constan t strictly lar ger than 1 , by an app ropriat e m o dification of the “ 7 / 12 ” thre shold in Activ eS elect . In fact, if we were to replace Step 4 of Activ eSelect by instead selecting ˆ k = argmin k max j 6 = k m k j (where m k j = er Q kj ( h k ) when k < j ), the n we could eve n make thi s a certain (1 + o (1)) function of ε , at the expense of lar ger constant factors in Λ a . Ap pendix C. The Label Complexi t y of Meta-Algorithm 2 As m e ntioned , T he orem 10 is essentially implied by the details of th e proof of Theore m 16 in Ap- pendix D belo w . Here we present a proof of Theorem 13, alon g with two useful related lemmas. The first, Lemma 47, lower bou nds the ex pected number of la bel requests Meta- Algorithm 2 would make while proces sing a gi ve n number of random unl abeled examples . The second, Lemma 48, bound s the amount by w h ich each la bel requ est is ex pected to reduce the p robabil ity mass in the r e- gion of disagreement. Althou gh we w il l only us e L e mma 48 in our p roof o f T h eorem 13, Lemma 47 may be of in depend ent interest, as it provid es add itional insi ghts into the beha vior of disag reement based methods, as related to the disagr eement coef ficient, and is included for this reason. Through out, we fix an arbitrary class C , a target func tion f ∈ C , and a distrib ution P , and we continue usin g the notation al con v entions of the proofs abov e, such as V ⋆ m = { h ∈ C : ∀ i ≤ m, h ( X i ) = f ( X i ) } (with V ⋆ 0 = C ). Additiona lly , for t ∈ N , define the rand om varia ble M ( t ) = min ( m ∈ N : m X ℓ =1 1 DIS ( V ⋆ ℓ − 1 ) ( X ℓ ) = t ) , which represent s the index of the t th unlabe led ex ample Meta-Algorithm 2 woul d request the lab el of (assu ming it ha s not yet halted). The two afo remention ed lemmas are formally stated as follo ws. 87 H A N N E K E Lemma 47 F or any r ∈ (0 , 1) , E   ⌈ 1 /r ⌉ X m =1 1 DIS ( V ⋆ m − 1 ) ( X m )   ≥ P (DIS (B( f , r ))) 2 r . ⋄ Lemma 48 F or any r ∈ (0 , 1) and n ∈ N , E h P  DIS  V ⋆ M ( n ) i ≥ P (DIS (B( f , r ))) − nr. ⋄ Before prov ing thes e lemmas, l et us first menti on their relev ance to the disagree ment coefficie nt analys is. Spec ifically , note that when θ f ( ε ) is unbound ed, there exist arbitrarily small value s of ε for which P (DIS(B( f , ε ))) /ε ≈ θ f ( ε ) , so that in particular P (DIS(B( f , ε ))) /ε 6 = o ( θ f ( ε )) . Therefore , Lemma 47 implie s that the number of label req uests Meta-Algorithm 2 makes among the first ⌈ 1 /ε ⌉ unlabeled exampl es is 6 = o ( θ f ( ε )) (as suming it does not halt first). Lik ewis e, one implicati on of L e mma 48 is that arri ving at a r egion of disagr eement with expecte d pro bability mass less than P (DIS(B( f , ε ))) / 2 requires a budge t n of at least P (DIS(B( f , ε ))) / (2 ε ) 6 = o ( θ f ( ε )) . W e now prese nt proofs of Lemm a s 47 and 48. Pro of [Lemma 47] Since E   ⌈ 1 /r ⌉ X m =1 1 DIS ( V ⋆ m − 1 ) ( X m )   = ⌈ 1 /r ⌉ X m =1 E h P  X m ∈ DIS  V ⋆ m − 1     V ⋆ m − 1 i = ⌈ 1 /r ⌉ X m =1 E  P  DIS  V ⋆ m − 1  , (59) we focus on lo wer bounding E [ P (DIS ( V ⋆ m ))] fo r m ∈ N ∪ { 0 } . Let D m = DIS ( V ⋆ m ∩ B( f , r )) . Note that for any x ∈ DIS(B( f , r )) , the re exist s some h x ∈ B( f , r ) w it h h x ( x ) 6 = f ( x ) , and if this h x ∈ V ⋆ m , then x ∈ D m as well. This means ∀ x, 1 D m ( x ) ≥ 1 DIS(B( f ,r )) ( x ) · 1 V ⋆ m ( h x ) = 1 DIS(B( f, r )) ( x ) · Q m ℓ =1 1 DIS( { h x ,f } ) c ( X ℓ ) . There fore, E [ P (DIS ( V ⋆ m ))] = P ( X m +1 ∈ DIS ( V ⋆ m )) ≥ P ( X m +1 ∈ D m ) = E h E h 1 D m ( X m +1 )    X m +1 ii ≥ E " E " 1 DIS(B( f, r )) ( X m +1 ) · m Y ℓ =1 1 DIS( { h X m +1 ,f } ) c ( X ℓ )      X m +1 ## = E " m Y ℓ =1 P  h X m +1 ( X ℓ ) = f ( X ℓ )    X m +1  1 DIS(B( f, r )) ( X m +1 ) # (60) ≥ E  (1 − r ) m 1 DIS(B( f ,r )) ( X m +1 )  = (1 − r ) m P (DIS(B( f , r ))) , (61) where the equa lity in (6 0) is by condit ional indep endenc e of the 1 DIS( { h X m +1 ,f } ) c ( X ℓ ) indica tors, gi ven X m +1 , and the inequ ality in (6 1) is due to h X m +1 ∈ B( f , r ) . This indicate s (59) is at least ⌈ 1 /r ⌉ X m =1 (1 − r ) m − 1 P (DIS (B( f , r ))) ≥ ⌈ 1 /r ⌉ X m =1 (1 − ( m − 1) r ) P (DIS (B( f , r ))) = ⌈ 1 /r ⌉  1 − ⌈ 1 /r ⌉ − 1 2 r  P (DIS (B( f , r ))) ≥ P (DIS (B( f , r ))) 2 r . 88 A C T I V I Z E D L E A R N I N G Pro of [Lemma 48] For each m ∈ N ∪ { 0 } , let D m = DIS (B( f , r ) ∩ V ⋆ m ) . For con ven ience, let M (0) = 0 . W e prov e the result by induction . W e clearly ha ve E  P  D M (0)  = E [ P ( D 0 )] = P (DIS(B( f , r ))) , which ser ves as our base case . Now fi x any n ∈ N , and take as the induc tiv e hypot hesis that E  P  D M ( n − 1)  ≥ P (DIS (B ( f , r ))) − ( n − 1) r . As in the proof of Lemma 47, for any x ∈ D M ( n − 1) , there exis ts h x ∈ B( f , r ) ∩ V ⋆ M ( n − 1) with h x ( x ) 6 = f ( x ) ; u nlik e the proof of Lemma 47, here h x is a random var iable, determined by V ⋆ M ( n − 1) . If h x is also in V ⋆ M ( n ) , then x ∈ D M ( n ) as well. Thus, ∀ x, 1 D M ( n ) ( x ) ≥ 1 D M ( n − 1) ( x ) · 1 V ⋆ M ( n ) ( h x ) = 1 D M ( n − 1) ( x ) · 1 DIS( { h x ,f } ) c ( X M ( n ) ) , where this last equality is due to the f act that e ve ry m ∈ { M ( n − 1) + 1 , . . . , M ( n ) − 1 } has X m / ∈ DIS  V ⋆ m − 1  , so that in particul ar h x ( X m ) = f ( X m ) . Therefore , lett ing X ∼ P be inde pendent of the data Z , E  P  D M ( n )  = E h 1 D M ( n ) ( X ) i ≥ E h 1 D M ( n − 1) ( X ) · 1 DIS( { h X ,f } ) c ( X M ( n ) ) i = E h 1 D M ( n − 1) ( X ) · P  h X ( X M ( n ) ) = f ( X M ( n ) )    X, V ⋆ M ( n − 1) i . (62) The cond itional distrib ution of X M ( n ) gi ven V ⋆ M ( n − 1) is merely P , b ut with supp ort rest ricted to DIS  V ⋆ M ( n − 1)  , and reno rmalized to a prob ability m e asure. Thus, since an y x ∈ D M ( n − 1) has DIS( { h x , f } ) ⊆ DIS  V ⋆ M ( n − 1)  , we ha ve P  h x ( X M ( n ) ) 6 = f ( X M ( n ) )    V ⋆ M ( n − 1)  = P (DIS( { h x , f } )) P  DIS  V ⋆ M ( n − 1)  ≤ r P  D M ( n − 1)  , where the in equalit y follo ws fro m h x ∈ B( f , r ) and D M ( n − 1) ⊆ DIS  V ⋆ M ( n − 1)  . There fore, (62) is at least E  1 D M ( n − 1) ( X ) ·  1 − r P ( D M ( n − 1) )   = E  P  X ∈ D M ( n − 1)    D M ( n − 1)  ·  1 − r P ( D M ( n − 1) )  = E  P  D M ( n − 1)  ·  1 − r P ( D M ( n − 1) )  = E  P  D M ( n − 1)  − r. By the inducti ve hypothesis, this is at least P (DIS (B( f , r ))) − nr . Finally , no ting E h P  DIS  V ⋆ M ( n ) i ≥ E  P  D M ( n )  complete s the proo f. W ith L emma 48 in ha nd, we are ready for the proof of Theorem 13. Pro of [Theorem 13 ] Let C , f , P , and λ be as in the theorem st atement. For m ∈ N , let λ − 1 ( m ) = inf { ε > 0 : λ ( ε ) ≤ m } , or 1 if this is not de fined. W e de fine A p as a randomized algorith m such that, for m ∈ N and L ∈ ( X × {− 1 , +1 } ) m , A p ( L ) returns f w it h probab ility 1 − λ − 1 ( |L| ) and 89 H A N N E K E return s − f with pr obabili ty λ − 1 ( |L| ) (indepe ndent of the conte nts of L ). Note that, for any intege r m ≥ λ ( ε ) , E [er ( A p ( Z m ))] = λ − 1 ( m ) ≤ λ − 1 ( λ ( ε )) ≤ ε . Therefore, A p achie ves some label comple xity Λ p with Λ p ( ε, f , P ) = λ ( ε ) for all ε > 0 . If θ f  λ ( ε ) − 1  6 = ω (1) , then since ev ery la bel complex ity Λ a is Ω(1) , the result clea rly holds. Otherwise, suppo se θ f  λ ( ε ) − 1  = ω (1) , and take an y seque nce of v alues ε i → 0 for which each i has ε i ∈ (0 , 1 / 2) , θ f  λ (2 ε i ) − 1  ≥ 12 , and 2 ε i a continu ity point of λ ; this is possib le, since λ is monotone, and thus has only a cou ntably in finite nu mber of dis continu ities. W e ha ve tha t θ f  λ (2 ε i ) − 1  di ver ges as i → ∞ , and thus so does λ (2 ε i ) . This the n implies that t here e xist value s r i → 0 such that each r i > λ (2 ε i ) − 1 and P (DIS(B( f ,r i ))) r i ≥ θ f  λ (2 ε i ) − 1  / 2 . Fix any i ∈ N and any n ∈ N with n ≤ θ f  λ (2 ε i ) − 1  / 4 . Consid er running Meta-Algorit hm 2 with ar guments A p and n , and let ˆ L denote the final v alue of the set L , and let ˇ m deno te the v alue of m upon reac hing Step 6. Since 2 ε i is a continuity point of λ , any m < λ (2 ε i ) and L ∈ ( X × {− 1 , +1 } ) m has er ( A p ( L )) = λ − 1 ( m ) > 2 ε i . Therefore, w e h a ve E h er  A p  ˆ L i ≥ 2 ε i P  | ˆ L| < λ (2 ε i )  = 2 ε i P j n/  6 ˆ ∆ k < λ (2 ε i )  = 2 ε i P  ˆ ∆ > n 6 λ (2 ε i )  = 2 ε i  1 − P  ˆ ∆ ≤ n 6 λ (2 ε i )  . (63) Since n ≤ θ f  λ (2 ε i ) − 1  / 4 ≤ P (DIS (B ( f , r i ))) / (2 r i ) < λ (2 ε i ) P (DIS(B( f , r i ))) / 2 , we ha ve P  ˆ ∆ ≤ n 6 λ (2 ε i )  ≤ P  ˆ ∆ < P (DIS (B( f , r i ))) / 12  ≤ P n P (DIS ( V ⋆ ˇ m )) < P (DIS (B( f , r i ))) / 12 o ∪ n ˆ ∆ < P (DIS ( V ⋆ ˇ m )) o . (64) Since ˇ m ≤ M ( ⌈ n/ 2 ⌉ ) , monotoni city and a unio n bound imply this is at m o st P  P  DIS  V ⋆ M ( ⌈ n/ 2 ⌉ )  < P (DIS (B ( f , r i ))) / 12  + P  ˆ ∆ < P (DIS ( V ⋆ ˇ m ))  . (65) Marko v’ s inequality implies P  P  DIS  V ⋆ M ( ⌈ n/ 2 ⌉ )  < P (DIS (B( f , r i ))) / 12  = P  P (DIS(B( f , r i ))) − P  DIS  V ⋆ M ( ⌈ n/ 2 ⌉ )  > 11 12 P (DIS(B( f , r i )))  ≤ E h P (DIS(B( f , r i ))) − P  DIS  V ⋆ M ( ⌈ n/ 2 ⌉ ) i 11 12 P (DIS(B( f , r i ))) = 12 11   1 − E h P  DIS  V ⋆ M ( ⌈ n/ 2 ⌉ ) i P (DIS(B( f , r i )))   . Lemma 48 implies this is at most 12 11 ⌈ n/ 2 ⌉ r i P (DIS(B( f ,r i ))) ≤ 12 11 l P (DIS(B( f ,r i ))) 4 r i m r i P (DIS(B( f ,r i ))) . Since any a ≥ 3 / 2 has ⌈ a ⌉ ≤ (3 / 2) a , and θ f  λ (2 ε i ) − 1  ≥ 12 implie s P (DIS(B( f ,r i ))) 4 r i ≥ 3 / 2 , we hav e l P (DIS(B( f ,r i ))) 4 r i m ≤ 3 8 P (DIS(B( f ,r i ))) r i , so that, 12 11 l P (DIS(B( f ,r i ))) 4 r i m r i P (DIS(B( f ,r i ))) ≤ 9 22 . Combin ing the abo ve, we ha ve P  P  DIS  V ⋆ M ( ⌈ n/ 2 ⌉ )  < P (DIS (B ( f , r i ))) / 12  ≤ 9 22 . (66 ) 90 A C T I V I Z E D L E A R N I N G Examining the second term in (65), Hoef fding ’ s inequ ality and the d efinition of ˆ ∆ from (14) imply P  ˆ ∆ < P (DIS ( V ⋆ ˇ m ))  = E h P  ˆ ∆ < P (DIS ( V ⋆ ˇ m ))    V ⋆ ˇ m , ˇ m i ≤ E  e − 8 ˇ m  ≤ e − 8 < 1 / 11 . (67) Combining (63) throug h (67 ) implies E h er  A p  ˆ L i > 2 ε i  1 − 9 22 − 1 11  = ε i . Thus, for an y label comple xity Λ a achie ved by running Meta-Algorith m 2 with A p as its ar gument, we must hav e Λ a ( ε i , f , P ) > θ f  λ (2 ε i ) − 1  / 4 . Since this is true for all i ∈ N , and ε i → 0 as i → ∞ , this establish es the result . Ap pendix D. The Label Complexity of Meta-Algorithm 3 As in Appendix B, we will assu me C is a fixed V C class, P is some arb itrary distrib ution , and f ∈ cl( C ) is an arb itrary fi x ed functio n. W e con tinue usin g th e notation introduced abo ve : in particu lar , S k ( H ) =  S ∈ X k : H shatt ers S  , ¯ S k ( H ) = X k \ S k ( H ) , ¯ ∂ k H f = X k \ ∂ k H f , and ˜ δ f = P ˜ d f − 1  ∂ ˜ d f − 1 C f  . Also, as abo v e, we will pro v e a more genera l result replacin g the “ 1 / 2 ” in Steps 5, 9, and 12 of Meta-Algori thm 3 with an arbitrary v alue γ ∈ (0 , 1) ; thus, the spec ific result for the stated algor ithm will be o btained by taking γ = 1 / 2 . For the estima tors ˆ P m in Meta -Algorithm 3, we take pre cisely the same definitions as giv en in Appendi x B.1 for t he estimators in Meta-Algor ithm 1. In particula r , the quantiti es ˆ ∆ ( k ) m ( x, W 2 , H ) , ˆ ∆ ( k ) m ( W 1 , W 2 , H ) , ˆ Γ ( k ) m ( x, y , W 2 , H ) , and M ( k ) m ( H ) are all de fined as in Appendi x B.1, and the ˆ P m estimato rs are again defined as in (12), (13) and (14). Also, we sometimes refer to quant ities define d abov e, such as ¯ p ζ ( k , ℓ, m ) (defined in (35)), as well as the v ariou s ev ents from the lemmas of the pre vious appendix, such as H τ ( δ ) , H ′ , H ( i ) τ , H ( ii ) τ , H ( iii ) τ ( ζ ) , H ( iv ) τ , and G ( i ) τ . D.1 Pro of of Th e or em 16 Through out the proof, we will mak e refere nce to the sets V m defined in Met a-Algorith m 3. Also let V ( k ) denote the final v alue of V obtained for the specified value of k in Meta-Algo rithm 3. B o th V m and V ( k ) are implicitly funct ions of the budg et, n , gi ve n to Meta-Algorit hm 3. A s abo ve , we contin ue to denote by V ⋆ m = { h ∈ C : ∀ i ≤ m, h ( X m ) = f ( X m ) } . One important fac t we will use repeate dly belo w is that if V m = V ⋆ m for some m , the n since L e mma 35 implies that V ⋆ m 6 = ∅ on H ′ , we must ha ve that all of the previo us ˆ y val ues were con sistent with f , which means that ∀ ℓ ≤ m , V ℓ = V ⋆ ℓ . In particu lar , if V ( k ′ ) = V ⋆ m for the lar gest m value obt ained w h ile k = k ′ in Meta-Algor ithm 3, then V ℓ = V ⋆ ℓ for all ℓ obtain ed while k ≤ k ′ in Meta-Algo rithm 3. Addition ally , define ˜ m n = ⌊ n/ 24 ⌋ , and note that the v alue m = ⌈ n/ 6 ⌉ is obtai ned whil e k = 1 in Meta-Algori thm 3. W e also define the follo w in g quan tities, which w e will show are typical ly equal to rela ted quantities in Meta-Algo rithm 3. Define ˆ m 0 = 0 , T ⋆ 0 = ⌈ 2 n/ 3 ⌉ , and ˆ t 0 = 0 , and for each k ∈ { 1 , . . . , d + 1 } , induct iv ely define 91 H A N N E K E T ⋆ k = T ⋆ k − 1 − ˆ t k − 1 , I ⋆ mk = 1 [ γ , ∞ )  ˆ ∆ ( k ) m  X m , W 2 , V ⋆ m − 1   , ∀ m ∈ N , ˇ m k = min    m ≥ ˆ m k − 1 : m X ℓ = ˆ m k − 1 +1 I ⋆ ℓk = ⌈ T ⋆ k / 4 ⌉    ∪ { max { k · 2 n + 1 , ˆ m k − 1 }} , ˆ m k = ˇ m k + j T ⋆ k /  3 ˆ ∆ ( k ) ˇ m k  W 1 , W 2 , V ⋆ ˇ m k  k , ˇ U k = ( ˆ m k − 1 , ˇ m k ] ∩ N , ˆ U k = ( ˇ m k , ˆ m k ] ∩ N , C ⋆ mk = 1 [ 0 , ⌊ 3 T ⋆ k / 4 ⌋ )   m − 1 X ℓ = ˆ m k − 1 +1 I ⋆ ℓk   Q ⋆ k = X m ∈ ˆ U k I ⋆ mk · C ⋆ mk , and ˆ t k = Q ⋆ k + X m ∈ ˇ U k I ⋆ mk . The mean ing of these v alues can be und erstood in the conte xt of Meta-Algori thm 3, under the condit ion that V m = V ⋆ m for valu es of m obtained for the re specti ve va lue of k . Specifically , under this con dition, T ⋆ k corres ponds to T k , ˆ t k repres ents the final value t for rou nd k , ˇ m k repres ents the v alue of m upon reachin g Step 9 in round k , while ˆ m k repres ents the v alue of m at the e nd of round k , ˇ U k corres ponds to the set of indices arriv ed at in Step 4 dur ing round k , w h ile ˆ U k corres ponds to the set of indices arriv ed at in Step 11 during round k , for m ∈ ˇ U k , I ⋆ mk indica tes whether the label of X m is req uested, while for m ∈ ˆ U k , I ⋆ mk · C ⋆ mk indica tes w h ether the label of X m is req uested. Finally Q ⋆ k corres ponds to the number of label requests in Step 13 duri ng round k . In particula r , note ˇ m 1 ≥ ˜ m n . Lemma 49 F or any τ ∈ N , on the ev ent H ′ ∩ G ( i ) τ , ∀ k , ℓ, m ∈ N with k ≤ ˜ d f , ∀ x ∈ X , for any sets H and H ′ with V ⋆ ℓ ⊆ H ⊆ H ′ ⊆ B( f , r 1 / 6 ) , if either k = 1 or m ≥ τ , then ˆ ∆ ( k ) m ( x, W 2 , H ) ≤ (3 / 2) ˆ ∆ ( k ) m  x, W 2 , H ′  . In particul ar , for any δ ∈ (0 , 1) and τ ≥ τ (1 / 6; δ ) , on H ′ ∩ H τ ( δ ) ∩ G ( i ) τ , ∀ k, ℓ, ℓ ′ , m ∈ N with m ≥ τ , ℓ ≥ ℓ ′ ≥ τ , and k ≤ ˜ d f , ∀ x ∈ X , ˆ ∆ ( k ) m ( x, W 2 , V ⋆ ℓ ) ≤ (3 / 2) ˆ ∆ ( k ) m  x, W 2 , V ⋆ ℓ ′  . ⋄ Pro of First note that ∀ m ∈ N , ∀ x ∈ X , ˆ ∆ (1) m ( x, W 2 , H ) = 1 DIS( H ) ( x ) ≤ 1 DIS( H ′ ) ( x ) = ˆ ∆ (1) m  x, W 2 , H ′  , so the result holds for k = 1 . Lemm a 35, Lemma 40 , and monoto nicity of M ( k ) m ( · ) imply that on H ′ ∩ G ( i ) τ , for any m ≥ τ and k ∈ n 2 , . . . , ˜ d f o , M ( k ) m ( H ) ≥ m 3 X i =1 1 ∂ k − 1 C f  S ( k ) i  ≥ (2 / 3) M ( k ) m  B( f , r 1 / 6 )  ≥ (2 / 3) M ( k ) m  H ′  , 92 A C T I V I Z E D L E A R N I N G so that ∀ x ∈ X , ˆ ∆ ( k ) m ( x, W 2 , H ) = M ( k ) m ( H ) − 1 m 3 X i =1 1 S k ( H )  S ( k ) i ∪ { x }  ≤ M ( k ) m ( H ) − 1 m 3 X i =1 1 S k ( H ′ )  S ( k ) i ∪ { x }  ≤ (3 / 2) M ( k ) m  H ′  − 1 m 3 X i =1 1 S k ( H ′ )  S ( k ) i ∪ { x }  = (3 / 2) ˆ ∆ ( k ) m  x, W 2 , H ′  . The final claim follo w s from Lemma 29. Lemma 50 F or any k ∈ { 1 , . . . , d + 1 } , if n ≥ 3 · 4 k − 1 , then T ⋆ k ≥ 4 1 − k (2 n/ 3) and ˆ t k ≤ ⌊ 3 T ⋆ k / 4 ⌋ . ⋄ Pro of Recall T ⋆ 1 = ⌈ 2 n/ 3 ⌉ ≥ 2 n/ 3 . If n ≥ 2 , we also hav e ⌊ 3 T ⋆ 1 / 4 ⌋ ≥ ⌈ T ⋆ 1 / 4 ⌉ , s o th at ( due to the C ⋆ m 1 fact ors) ˆ t 1 ≤ ⌊ 3 T ⋆ 1 / 4 ⌋ . For the purp ose of induc tion, suppo se some k ∈ { 2 , . . . , d + 1 } has n ≥ 3 · 4 k − 1 , T ⋆ k − 1 ≥ 4 2 − k (2 n/ 3) , and ˆ t k − 1 ≤ ⌊ 3 T ⋆ k − 1 / 4 ⌋ . Then T ⋆ k = T ⋆ k − 1 − ˆ t k − 1 ≥ T ⋆ k − 1 / 4 ≥ 4 1 − k (2 n/ 3) , and since n ≥ 3 · 4 k − 1 , we also hav e ⌊ 3 T ⋆ k / 4 ⌋ ≥ ⌈ T ⋆ k / 4 ⌉ , so that ˆ t k ≤ ⌊ 3 T ⋆ k / 4 ⌋ (again, due to th e C ⋆ mk fact ors). Thus, by the princ iple of induction , this hold s for all k ∈ { 1 , . . . , d + 1 } with n ≥ 3 · 4 k − 1 . The ne xt le mma indica tes that the “ t < ⌊ 3 T k / 4 ⌋ ” co nstraint in Step 12 is red undant for k ≤ ˜ d f . It is similar to (51) in L e mma 45, b ut is made only slightl y more complicated by the fact that the ˆ ∆ ( k ) estimate is calculated in Step 9 based on a set V m dif ferent from the ones used to decide whether or not to reque st a lab el in S t ep 12. Lemma 51 Ther e exi st ( C , P , f , γ ) -depend ent constants ˜ c ( i ) 1 , ˜ c ( i ) 2 ∈ [1 , ∞ ) such that , for an y δ ∈ (0 , 1) , and any inte g er n ≥ ˜ c ( i ) 1 ln  ˜ c ( i ) 2 /δ  , on an even t ˜ H ( i ) n ( δ ) ⊆ G ( i ) ˜ m n ∩ H ˜ m n ( δ ) ∩ H ( i ) ˜ m n ∩ H ( iii ) ˜ m n ( γ / 16) ∩ H ( iv ) ˜ m n with P  ˜ H ( i ) n ( δ )  ≥ 1 − 2 δ , ∀ k ∈ n 1 , . . . , ˜ d f o , ˆ t k = ˆ m k P m = ˆ m k − 1 +1 I ⋆ mk ≤ 3 T ⋆ k / 4 . ⋄ Pro of Define the consta nts ˜ c ( i ) 1 = max  192 d r (3 / 32) , 3 · 4 ˜ d f +6 ˜ δ f γ 2  , ˜ c ( i ) 2 = max  8 e r (3 / 32) ,  c ( i ) + c ( iii ) ( γ / 16) + 125 ˜ d f ˜ δ − 1 f   , and let n ( i ) ( δ ) = ˜ c ( i ) 1 ln  ˜ c ( i ) 2 /δ  . Fix any inte ger n ≥ n ( i ) ( δ ) and consider the ev ent ˜ H (1) n ( δ ) = G ( i ) ˜ m n ∩ H ˜ m n ( δ ) ∩ H ( i ) ˜ m n ∩ H ( iii ) ˜ m n ( γ / 16) ∩ H ( iv ) ˜ m n . 93 H A N N E K E By Lemma 49 and the fact that ˇ m k ≥ ˜ m n for all k ≥ 1 , since n ≥ n ( i ) ( δ ) ≥ 24 τ (1 / 6; δ ) , on ˜ H (1) n ( δ ) , ∀ k ∈ n 1 , . . . , ˜ d f o , ∀ m ∈ ˆ U k , ˆ ∆ ( k ) m  X m , W 2 , V ⋆ m − 1  ≤ (3 / 2) ˆ ∆ ( k ) m  X m , W 2 , V ⋆ ˇ m k  . (68) No w fix any k ∈ n 1 , . . . , ˜ d f o . Since n ≥ n ( i ) ( δ ) ≥ 27 · 4 k − 1 , Lemma 50 implies T ⋆ k ≥ 18 , which means that 3 T ⋆ k / 4 − ⌈ T ⋆ k / 4 ⌉ ≥ 4 T ⋆ k / 9 . Also note that P m ∈ ˇ U k I ⋆ mk ≤ ⌈ T ⋆ k / 4 ⌉ . Let N k = (4 / 3) ˆ ∆ ( k ) ˇ m k  W 1 , W 2 , V ⋆ ˇ m k     ˆ U k    ; note that    ˆ U k    = j T ⋆ k /  3 ˆ ∆ ( k ) ˇ m k  W 1 , W 2 , V ⋆ ˇ m k  k , so that N k ≤ (4 / 9) T ⋆ k . Thus, we hav e P   ˜ H (1) n ( δ ) ∩    ˆ m k X m = ˆ m k − 1 +1 I ⋆ mk > 3 T ⋆ k / 4      ≤ P   ˜ H (1) n ( δ ) ∩    X m ∈ ˆ U k I ⋆ mk > 4 T ⋆ k / 9      ≤ P   ˜ H (1) n ( δ ) ∩    X m ∈ ˆ U k I ⋆ mk > N k      ≤ P   ˜ H (1) n ( δ ) ∩    X m ∈ ˆ U k 1 [2 γ / 3 , ∞ )  ˆ ∆ ( k ) m  X m , W 2 , V ⋆ ˇ m k   > N k      , (69) where thi s last inequ ality is by (68). T o simplify notati on, define ˜ Z k =  T ⋆ k , ˇ m k , W 1 , W 2 , V ⋆ ˇ m k  . By Lemmas 43 and 44 (with β = 3 / 32 , ζ = 2 γ / 3 , α = 3 / 4 , and ξ = γ / 16 ), since n ≥ n ( i ) ( δ ) ≥ 24 · m a x  τ ( iv ) ( γ / 16; δ ) , τ (3 / 32; δ )  , on ˜ H (1) n ( δ ) , ∀ m ∈ ˆ U k , ¯ p 2 γ / 3 ( k , ˇ m k , m ) ≤ P ( x : p x ( k , ˇ m k ) ≥ γ / 2) + exp n − γ 2 ˜ M ( m ) / 256 o ≤ P ( x : p x ( k , ˇ m k ) ≥ γ / 2) + exp n − γ 2 ˜ M ( ˇ m k ) / 256 o ≤ ˆ ∆ ( k ) ˇ m k  W 1 , W 2 , V ⋆ ˇ m k  . Letting ˜ G ′ n ( k ) denote the e vent that ¯ p 2 γ / 3 ( k , ˇ m k , m ) ≤ ˆ ∆ ( k ) ˇ m k  W 1 , W 2 , V ⋆ ˇ m k  , we see th at ˜ G ′ n ( k ) ⊇ ˜ H (1) n ( δ ) . Thus, since the 1 [2 γ / 3 , ∞ )  ˆ ∆ ( k ) m  X m , W 2 , V ⋆ ˇ m k   v ariable s ar e cond itional ly indepe ndent gi ven ˜ Z k for m ∈ ˆ U k , each with respecti ve condition al distrib utio n Bernoulli  ¯ p 2 γ / 3 ( k , ˇ m k , m )  , the law o f total probabili ty and a Chernof f boun d imp ly that (69) is at most P   ˜ G ′ n ( k ) ∩    X m ∈ ˆ U k 1 [2 γ / 3 , ∞ )  ˆ ∆ ( k ) m  X m , W 2 , V ⋆ ˇ m k   > N k      = E   P   X m ∈ ˆ U k 1 [2 γ / 3 , ∞ )  ˆ ∆ ( k ) m  X m , W 2 , V ⋆ ˇ m k   > N k      ˜ Z k   · 1 ˜ G ′ n ( k )   ≤ E h exp n − ˆ ∆ ( k ) ˇ m k  W 1 , W 2 , V ⋆ ˇ m k     ˆ U k    / 27 oi ≤ E [exp {− T ⋆ k / 162 } ] ≤ exp n − n/  243 · 4 k − 1 o , 94 A C T I V I Z E D L E A R N I N G where the last inequal ity is by Lemma 50. Thus, there ex ists ˜ G n ( k ) with P  ˜ H (1) n ( δ ) \ ˜ G n ( k )  ≤ exp  − n/  243 · 4 k − 1  such that, on ˜ H (1) n ( δ ) ∩ ˜ G n ( k ) , we ha ve P ˆ m k m = ˆ m k − 1 +1 I ⋆ mk ≤ 3 T ⋆ k / 4 . Defining ˜ H ( i ) n ( δ ) = ˜ H (1) n ( δ ) ∩ T ˜ d f k =1 ˜ G n ( k ) , a un ion bound implies P  ˜ H (1) n ( δ ) \ ˜ H ( i ) n ( δ )  ≤ ˜ d f · exp n − n/  243 · 4 ˜ d f − 1 o , (70) and on ˜ H ( i ) n ( δ ) , e ver y k ∈ n 1 , . . . , ˜ d f o has P ˆ m k m = ˆ m k − 1 +1 I ⋆ mk ≤ 3 T ⋆ k / 4 . In partic ular , this means the C ⋆ mk fact ors are redundant in Q ⋆ k , so that ˆ t k = P ˆ m k m = ˆ m k − 1 +1 I ⋆ mk . T o get t he stated probabili ty bound , a uni on bound implies that 1 − P  ˜ H (1) n ( δ )  ≤ (1 − P ( H ˜ m n ( δ ))) +  1 − P  H ( i ) ˜ m n  + P  H ( i ) ˜ m n \ H ( iii ) ˜ m n ( γ / 16)  +  1 − P  H ( iv ) ˜ m n  + P  H ( i ) ˜ m n \ G ( i ) ˜ m n  ≤ δ + c ( i ) · exp n − ˜ M ( ˜ m n ) / 4 o + c ( iii ) ( γ / 16) · exp n − ˜ M ( ˜ m n ) γ 2 / 256 o + 3 ˜ d f · exp {− 2 ˜ m n } + 121 ˜ d f ˜ δ − 1 f · exp n − ˜ M ( ˜ m n ) / 60 o ≤ δ +  c ( i ) + c ( iii ) ( γ / 16) + 124 ˜ d f ˜ δ − 1 f  · exp n − ˜ m n ˜ δ f γ 2 / 512 o . (71) Since n ≥ n ( i ) ( δ ) ≥ 24 , we ha ve ˜ m n ≥ n/ 48 , so that summing (70) and (71) giv es us 1 − P  ˜ H ( i ) n ( δ )  ≤ δ +  c ( i ) + c ( iii ) ( γ / 16) + 125 ˜ d f ˜ δ − 1 f  · exp n − n ˜ δ f γ 2 /  512 · 48 · 4 ˜ d f − 1 o . (72) Finally , no te that w e hav e chosen n ( i ) ( δ ) suf ficiently lar ge so that (72) is at most 2 δ . The nex t lemma indicates that the redund ancy of the “ t < ⌊ 3 T k / 4 ⌋ ” const raint, just established in L emma 51, implies that all ˆ y label s obta ined while k ≤ ˜ d f are consistent with the tar get function. Lemma 52 Consi der running M e ta-Algorit hm 3 with a bu dget n ∈ N , w h ile f is the tar get func- tion and P is the data distrib ution. T he r e is an e vent ˜ H ( ii ) n and ( C , P , f , γ ) -dependent cons tants ˜ c ( ii ) 1 , ˜ c ( ii ) 2 ∈ [1 , ∞ ) such that, for any δ ∈ (0 , 1) , if n ≥ ˜ c ( ii ) 1 ln  ˜ c ( ii ) 2 /δ  , then P  ˜ H ( i ) n ( δ ) \ ˜ H ( ii ) n  ≤ δ , and on ˜ H ( i ) n ( δ ) ∩ ˜ H ( ii ) n , we have V ( ˜ d f ) = V ˆ m ˜ d f = V ⋆ ˆ m ˜ d f . ⋄ Pro of Define ˜ c ( ii ) 1 = max  ˜ c ( i ) 1 , 192 d r (1 − γ ) / 6 , 2 11 ˜ δ 1 / 3 f  , ˜ c ( ii ) 2 = max n ˜ c ( i ) 2 , 8 e r (1 − γ ) / 6 , c ( ii ) , exp { τ ∗ } o , let n ( ii ) ( δ ) = ˜ c ( ii ) 1 ln  ˜ c ( ii ) 2 /δ  , suppose n ≥ n ( ii ) ( δ ) , and define the e v ent ˜ H ( ii ) n = H ( ii ) ˜ m n . By L emma 41, since n ≥ n ( ii ) ( δ ) ≥ 24 · max { τ ((1 − γ ) / 6; δ ) , τ ∗ } , on ˜ H ( i ) n ( δ ) ∩ ˜ H ( ii ) n , ∀ m ∈ N and k ∈ n 1 , . . . , ˜ d f o with either k = 1 or m > ˜ m n , ˆ ∆ ( k ) m  X m , W 2 , V ⋆ m − 1  < γ ⇒ ˆ Γ ( k ) m  X m , − f ( X m ) , W 2 , V ⋆ m − 1  < ˆ Γ ( k ) m  X m , f ( X m ) , W 2 , V ⋆ m − 1  . (73) 95 H A N N E K E Recall that ˜ m n ≤ min {⌈ T 1 / 4 ⌉ , 2 n } = ⌈⌈ 2 n/ 3 ⌉ / 4 ⌉ . Therefore, V ˜ m n is obta ined purely by ˜ m n ex ecutio ns of Step 8 while k = 1 . Thus, for e ver y m obtai ned in Meta-Algorith m 3, either k = 1 or m > ˜ m n . W e no w procee d by induc tion on m . W e already kno w V 0 = C = V ⋆ 0 , so this serv es as our base case. Now consid er some v alue m ∈ N obtain ed in Meta-Alg orithm 3 while k ≤ ˜ d f , and sup pose e very m ′ < m ha s V m ′ = V ⋆ m ′ . But this means tha t T k = T ⋆ k and the valu e of t upon obtain ing this particular m has t ≤ P m − 1 ℓ = ˆ m k − 1 +1 I ⋆ ℓk . In particular , if ˆ ∆ ( k ) m ( X m , W 2 , V m − 1 ) ≥ γ , then I ⋆ mk = 1 , so th at t < P m ℓ = ˆ m k − 1 +1 I ⋆ mk ; by Lemma 51, on ˜ H ( i ) n ( δ ) ∩ ˜ H ( ii ) n , P m ℓ = ˆ m k − 1 +1 I ⋆ mk ≤ P ˆ m k ℓ = ˆ m k − 1 +1 I ⋆ mk ≤ 3 T ⋆ k / 4 , so that t < 3 T ⋆ k / 4 , and ther efore ˆ y = Y m = f ( X m ) ; this implies V m = V ⋆ m . On the other ha nd, on ˜ H ( i ) n ( δ ) ∩ ˜ H ( ii ) n , if ˆ ∆ ( k ) m ( X m , W 2 , V m − 1 ) < γ , then (73) implies ˆ y = argmax y ∈{− 1 , + 1 } ˆ Γ ( k ) m ( X m , y , W 2 , V m − 1 ) = f ( X m ) , so th at aga in V m = V ⋆ m . Thus, by the p rinciple of induction, on ˜ H ( i ) n ( δ ) ∩ ˜ H ( ii ) n , for eve ry m ∈ N obtain ed while k ≤ ˜ d f , we ha ve V m = V ⋆ m ; in partic ular , this implie s V ( ˜ d f ) = V ˆ m ˜ d f = V ⋆ ˆ m ˜ d f . The bound o n P  ˜ H ( i ) n ( δ ) \ ˜ H ( ii ) n  then follo w s from Lemma 41, as we ha ve chos en n ( ii ) ( δ ) sufficien tly lar ge so that (28) (with τ = ˜ m n ) is at most δ . Lemma 53 Consi der ru nning Meta -Algorithm 3 with a bud get n ∈ N , while f is t he tar get function and P is the data dist rib ution. T h er e e xist ( C , P , f , γ ) -dependen t constants ˜ c ( iii ) 1 , ˜ c ( iii ) 2 ∈ [1 , ∞ ) suc h that, for any δ ∈ (0 , e − 3 ) , λ ∈ [1 , ∞ ) , and n ∈ N , the r e is an e vent ˜ H ( iii ) n ( δ , λ ) with P  ˜ H ( i ) n ( δ ) ∩ ˜ H ( ii ) n \ ˜ H ( iii ) n ( δ , λ )  ≤ δ with th e pr operty that, if n ≥ ˜ c ( iii ) 1 ˜ θ f ( d/λ ) ln 2 ˜ c ( iii ) 2 λ δ ! , then on ˜ H ( i ) n ( δ ) ∩ ˜ H ( ii ) n ∩ ˜ H ( iii ) n ( δ , λ ) , a t the conclusi on of Meta-Algo rithm 3,    L ˜ d f    ≥ λ . ⋄ Pro of Let ˜ c ( iii ) 1 = max  ˜ c ( i ) 1 , ˜ c ( ii ) 1 , d · ˜ d f · 4 10+2 ˜ d f γ 3 ˜ δ 3 f , 192 d r (3 / 32)  , ˜ c ( iii ) 2 = max n ˜ c ( i ) 2 , ˜ c ( ii ) 2 , 8 e r (3 / 32) o , fix any δ ∈ (0 , e − 3 ) , λ ∈ [1 , ∞ ) , let n ( iii ) ( δ , λ ) = ˜ c ( iii ) 1 ˜ θ f ( d/λ ) ln 2 (˜ c ( iii ) 2 λ/δ ) , and suppose n ≥ n ( iii ) ( δ , λ ) . Define a seque nce ℓ i = 2 i for integ ers i ≥ 0 , and let ˆ ι = l log 2  4 2+ ˜ d f λ/γ ˜ δ f m . Also define ˜ φ ( m, δ, λ ) = max { φ ( m ; δ / 2 ˆ ι ) , d/λ } , where φ is defined in Lemm a 29. Then define the e ve nts ˜ H (3) ( δ , λ ) = ˆ ι \ i =1 H ℓ i ( δ / 2ˆ ι ) , ˜ H ( iii ) n ( δ , λ ) = ˜ H (3) ( δ , λ ) ∩ n ˇ m ˜ d f ≥ ℓ ˆ ι o . Note that ˆ ι ≤ n , so tha t ℓ ˆ ι ≤ 2 n , and therefore the truncation in the definition of ˇ m ˜ d f , which enforc es ˇ m ˜ d f ≤ m ax n ˜ d f · 2 n + 1 , ˆ m k − 1 o , will nev er be a factor in whethe r or not ˇ m ˜ d f ≥ ℓ ˆ ι is satisfied. 96 A C T I V I Z E D L E A R N I N G Since n ≥ n ( iii ) ( λ, δ ) ≥ ˜ c ( ii ) 1 ln  ˜ c ( ii ) 2 /δ  , Lemm a 5 2 implies that on ˜ H ( i ) n ( δ ) ∩ ˜ H ( ii ) n , V ˆ m ˜ d f = V ⋆ ˆ m ˜ d f . Recall that this implies that all ˆ y val ues obtained while m ≤ ˆ m ˜ d f are consist ent with the ir respec tiv e f ( X m ) val ues, so that e v ery such m has V m = V ⋆ m as well. In particular , V ˇ m ˜ d f = V ⋆ ˇ m ˜ d f . Also note that n ( iii ) ( δ , λ ) ≥ 24 · τ ( iv ) ( γ / 16; δ ) , so that τ ( iv ) ( γ / 16; δ ) ≤ ˜ m n , and recall we al ways ha ve ˜ m n ≤ ˇ m ˜ d f . Thus, on ˜ H ( i ) n ( δ ) ∩ ˜ H ( ii ) n ∩ ˜ H ( iii ) n ( δ , λ ) , (t aking ˆ ∆ ( k ) as in Meta-Algor ithm 3) ˆ ∆ ( ˜ d f ) = ˆ ∆ ( ˜ d f ) ˇ m ˜ d f  W 1 , W 2 , V ⋆ ˇ m ˜ d f  (Lemma 52) ≤ P  x : p x  ˜ d f , ˇ m ˜ d f  ≥ γ / 8  + 4 ˇ m − 1 ˜ d f (Lemma 44) ≤ 8 P ˜ d f  S ˜ d f  V ⋆ ˇ m ˜ d f  γ P ˜ d f − 1  S ˜ d f − 1  V ⋆ ˇ m ˜ d f  + 4 ˇ m − 1 ˜ d f (Mark ov’ s ineq.) ≤  8 /γ ˜ δ f  P ˜ d f  S ˜ d f  V ⋆ ˇ m ˜ d f  + 4 ˇ m − 1 ˜ d f (Lemma 35) ≤  8 /γ ˜ δ f  P ˜ d f  S ˜ d f  V ⋆ ℓ ˆ ι   + 4 ℓ − 1 ˆ ι (defn of ˜ H ( iii ) n ( δ , λ ) ) ≤  8 /γ ˜ δ f  P ˜ d f  S ˜ d f  B  f , ˜ φ ( ℓ ˆ ι , δ, λ )  + 4 ℓ − 1 ˆ ι (Lemma 29) ≤  8 /γ ˜ δ f  ˜ θ f ( d/λ ) ˜ φ ( ℓ ˆ ι , δ, λ ) + 4 ℓ − 1 ˆ ι (defn of ˜ θ f ( d/λ ) ) ≤  12 /γ ˜ δ f  ˜ θ f ( d/λ ) ˜ φ ( ℓ ˆ ι , δ, λ ) ( ˜ φ ( ℓ ˆ ι , δ, λ ) ≥ ℓ − 1 ˆ ι ) = 12 ˜ θ f ( d/λ ) γ ˜ δ f max  2 d ln (2 e max { ℓ ˆ ι , d } /d ) + ln (4 ˆ ι /δ ) ℓ ˆ ι , d/λ  . (74) Plugging in the definitio n of ˆ ι and ℓ ˆ ι , d ln (2 e max { ℓ ˆ ι , d } /d ) + ln (4ˆ ι/δ ) ℓ ˆ ι ≤ ( d/ λ ) γ ˜ δ f 4 − 1 − ˜ d f ln  4 1+ ˜ d f λ/δ γ ˜ δ f  ≤ ( d/ λ ) ln ( λ/δ ) . Therefore , (74 ) is at most 24 ˜ θ f ( d/λ )( d/ λ ) ln ( λ/δ ) /γ ˜ δ f . Thus, since n ( iii ) ( δ , λ ) ≥ max n ˜ c ( i ) 1 ln  ˜ c ( i ) 2 /δ  , ˜ c ( ii ) 1 ln  ˜ c ( ii ) 2 /δ o , Lemmas 51 and 52 imply that on ˜ H ( i ) n ( δ ) ∩ ˜ H ( ii ) n ∩ ˜ H ( iii ) n ( δ , λ ) ,    L ˜ d f    = j T ⋆ ˜ d f /  3 ˆ ∆ ( ˜ d f ) k ≥ j 4 1 − ˜ d f 2 n/  9 ˆ ∆ ( ˜ d f ) k ≥ 4 1 − ˜ d f γ ˜ δ f n 9 · 24 · ˜ θ f ( d/λ )( d/ λ ) ln ( λ/δ ) ≥ λ ln( λ/δ ) ≥ λ. No w we turn to boundin g P  ˜ H ( i ) n ( δ ) ∩ ˜ H ( ii ) n \ ˜ H ( iii ) n ( δ , λ )  . By a union bound, we hav e 1 − P  ˜ H (3) ( δ , λ )  ≤ ˆ ι X i =1 (1 − P ( H ℓ i ( δ / 2ˆ ι ))) ≤ δ / 2 . (75) 97 H A N N E K E Thus, it remains only to bound P  ˜ H ( i ) n ( δ ) ∩ ˜ H ( ii ) n ∩ ˜ H (3) ( δ , λ ) ∩ n ˇ m ˜ d f < ℓ ˆ ι o . For each i ∈ { 0 , 1 , . . . , ˆ ι − 1 } , let ˇ Q i =    n m ∈ ( ℓ i , ℓ i +1 ] ∩ ˇ U ˜ d f : I ⋆ m ˜ d f = 1 o    . N o w co n- sider the set I of all i ∈ { 0 , 1 , . . . , ˆ ι − 1 } with ℓ i ≥ ˜ m n and ( ℓ i , ℓ i +1 ] ∩ ˇ U ˜ d f 6 = ∅ . Note that n ( iii ) ( δ , λ ) ≥ 48 , so that ℓ 0 < ˜ m n . F ix any i ∈ I . Since n ( iii ) ( λ, δ ) ≥ 24 · τ (1 / 6; δ ) , we ha ve ˜ m n ≥ τ (1 / 6; δ ) , so that Lemm a 49 implies that on ˜ H ( i ) n ( δ ) ∩ ˜ H ( ii ) n ∩ ˜ H (3) ( δ , λ ) , letting ¯ Q = 2 · 4 6+ ˜ d f  d/γ 2 ˜ δ 2 f  ˜ θ f ( d/λ ) ln ( λ/ δ ) , P  ˜ H ( i ) n ( δ ) ∩ ˜ H ( ii ) n ∩ ˜ H (3) ( δ , λ ) ∩  ˇ Q i > ¯ Q     W 2 , V ⋆ ℓ i  ≤ P      m ∈ ( ℓ i , ℓ i +1 ] ∩ N : ˆ ∆ ( ˜ d f ) m  X m , W 2 , V ⋆ ℓ i  ≥ 2 γ / 3      > ¯ Q      W 2 , V ⋆ ℓ i ! . (76) For m > ℓ i , the variab les 1 [2 γ / 3 , ∞ )  ˆ ∆ ( ˜ d f ) m  X m , W 2 , V ⋆ ℓ i   are cond itional ly (giv en W 2 , V ⋆ ℓ i ) in- depen dent, each with respecti ve conditi onal distrib ution Bernoull i with mean ¯ p 2 γ / 3  ˜ d f , ℓ i , m  . Since n ( iii ) ( δ , λ ) ≥ 24 · τ (3 / 32; δ ) , we hav e ˜ m n ≥ τ (3 / 32; δ ) , so tha t Lemma 4 3 (with ζ = 2 γ / 3 , α = 3 / 4 , and β = 3 / 32 ) implies that on ˜ H ( i ) n ( δ ) ∩ ˜ H ( ii ) n ∩ ˜ H (3) ( δ , λ ) , ea ch of these m val ues has ¯ p 2 γ / 3  ˜ d f , ℓ i , m  ≤ P  x : p x  ˜ d f , ℓ i  ≥ γ / 2  + exp n − ˜ M ( m ) γ 2 / 256 o ≤ 2 P ˜ d f  S ˜ d f  V ⋆ ℓ i   γ P ˜ d f − 1  S ˜ d f − 1  V ⋆ ℓ i  + exp n − ˜ M ( ℓ i ) γ 2 / 256 o (Mark ov’ s ineq.) ≤  2 /γ ˜ δ f  P ˜ d f  S ˜ d f  V ⋆ ℓ i   + exp n − ˜ M ( ℓ i ) γ 2 / 256 o (Lemma 35) ≤  2 /γ ˜ δ f  P ˜ d f  S ˜ d f  B  f , ˜ φ ( ℓ i , δ, λ )  + exp n − ˜ M ( ℓ i ) γ 2 / 256 o (Lemma 29) ≤  2 /γ ˜ δ f  ˜ θ f ( d/λ ) ˜ φ ( ℓ i , δ, λ ) + exp n − ˜ M ( ℓ i ) γ 2 / 256 o (defn of ˜ θ f ( d/λ ) ) . Denote the expre ssion in this last line by p i , and let B ( ℓ i , p i ) be a Binomial( ℓ i , p i ) random v ari- able. Noting that ℓ i +1 − ℓ i = ℓ i , we ha ve that on ˜ H ( i ) n ( δ ) ∩ ˜ H ( ii ) n ∩ ˜ H (3) ( δ , λ ) , (76 ) is at most P  B ( ℓ i , p i ) > ¯ Q  . Next, note that ℓ i p i = (2 / γ ˜ δ f ) ˜ θ f ( d/λ ) ℓ i ˜ φ ( ℓ i , δ, λ ) + ℓ i · exp n − ℓ 3 i ˜ δ f γ 2 / 512 o . Since u · exp  − u 3  ≤ (3 e ) − 1 / 3 for any u , let ting u = ℓ i ˜ δ f γ / 8 we hav e ℓ i · exp n − ℓ 3 i ˜ δ f γ 2 / 512 o ≤  8 /γ ˜ δ f  u · exp  − u 3  ≤ 8 /  γ ˜ δ f (3 e ) 1 / 3  ≤ 4 / γ ˜ δ f . 98 A C T I V I Z E D L E A R N I N G Therefore , sin ce ˜ φ ( ℓ i , δ, λ ) ≥ ℓ − 1 i , we hav e that ℓ i p i is at most 6 γ ˜ δ f ˜ θ f ( d/λ ) ℓ i ˜ φ ( ℓ i , δ, λ ) ≤ 6 γ ˜ δ f ˜ θ f ( d/λ ) max  2 d ln (2 eℓ ˆ ι ) + 2 ln  4 ˆ ι δ  , ℓ ˆ ι d/λ  ≤ 6 γ ˜ δ f ˜ θ f ( d/λ ) max ( 2 d ln 4 3+ ˜ d f eλ γ ˜ δ f ! + 2 ln 4 3+ ˜ d f 2 λ γ ˜ δ f δ ! , d 4 3+ ˜ d f γ ˜ δ f ) ≤ 6 γ ˜ δ f ˜ θ f ( d/λ ) max ( 4 d ln 4 3+ ˜ d f λ γ ˜ δ f δ ! , d 4 3+ ˜ d f γ ˜ δ f ) ≤ 6 γ ˜ δ f ˜ θ f ( d/λ ) · d 4 4+ ˜ d f γ ˜ δ f ln  λ δ  ≤ 4 6+ ˜ d f d γ 2 ˜ δ 2 f ˜ θ f ( d/λ ) ln  λ δ  = ¯ Q/ 2 . Therefore , a Chernof f bound implies P  B ( ℓ i , p i ) > ¯ Q  ≤ exp  − ¯ Q/ 6  ≤ δ / 2 ˆ ι , so that on ˜ H ( i ) n ( δ ) ∩ ˜ H ( ii ) n ∩ ˜ H (3) ( δ , λ ) , (76) is at most δ/ 2ˆ ι . The law of total probability impli es there e xists an e vent ˜ H (4) n ( i, δ, λ ) w i th P  ˜ H ( i ) n ( δ ) ∩ ˜ H ( ii ) n ∩ ˜ H (3) ( δ , λ ) \ ˜ H (4) n ( i, δ, λ )  ≤ δ / 2ˆ ι such tha t, on ˜ H ( i ) n ( δ ) ∩ ˜ H ( ii ) n ∩ ˜ H (3) ( δ , λ ) ∩ ˜ H (4) n ( i, δ, λ ) , ˇ Q i ≤ ¯ Q . Note that ˆ ι ¯ Q ≤ log 2  4 2+ ˜ d f λ/γ ˜ δ f  · 4 7+ ˜ d f  d/γ 2 ˜ δ 2 f  ˜ θ f ( d/λ ) ln ( λ/ δ ) ≤  ˜ d f 4 9+ ˜ d f /γ 3 ˜ δ 3 f  d ˜ θ f ( d/λ ) ln 2 ( λ/δ ) ≤ 4 1 − ˜ d f n/ 12 . (77) Since P m ≤ 2 ˜ m n I ⋆ m ˜ d f ≤ n/ 12 , if ˜ d f = 1 then (77) implies that on ˜ H ( i ) n ( δ ) ∩ ˜ H ( ii ) n ∩ ˜ H (3) ( δ , λ ) ∩ T i ∈I ˜ H (4) n ( i, δ, λ ) , P m ≤ ℓ ˆ ι I ⋆ m 1 ≤ n/ 12 + P i ∈I ˇ Q i ≤ n/ 12 + ˆ ι ¯ Q ≤ n/ 6 ≤ ⌈ T ⋆ 1 / 4 ⌉ , so that ˇ m 1 ≥ ℓ ˆ ι . Otherwise, if ˜ d f > 1 , th en e very m ∈ ˇ U ˜ d f has m > 2 ˜ m n , so that P i ≤ ˆ ι ˇ Q i = P i ∈I ˇ Q i ; thus, on ˜ H ( i ) n ( δ ) ∩ ˜ H ( ii ) n ∩ ˜ H (3) ( δ , λ ) ∩ T i ∈I ˜ H (4) n ( i, δ, λ ) , P i ∈I ˇ Q i ≤ ˆ ι ¯ Q ≤ 4 1 − ˜ d f n/ 12 ; Lemm a 50 implies 4 1 − ˜ d f n/ 12 ≤ l T ⋆ ˜ d f / 4 m , so that again we ha ve ˇ m ˜ d f ≥ ℓ ˆ ι . Thus, a union bound implies P  ˜ H ( i ) n ( δ ) ∩ ˜ H ( ii ) n ∩ ˜ H (3) ( δ , λ ) ∩ n ˇ m ˜ d f < ℓ ˆ ι o ≤ P ˜ H ( i ) n ( δ ) ∩ ˜ H ( ii ) n ∩ ˜ H (3) ( δ , λ ) \ \ i ∈I ˜ H (4) n ( i, δ, λ ) ! ≤ X i ∈I P  ˜ H ( i ) n ( δ ) ∩ ˜ H ( ii ) n ∩ ˜ H (3) ( δ , λ ) \ ˜ H (4) n ( i, δ, λ )  ≤ δ/ 2 . (78) Therefore , P  ˜ H ( i ) n ( δ ) ∩ ˜ H ( ii ) n \ ˜ H ( iii ) n ( δ , λ )  ≤ δ , obtained by su mming (78) and (75). Pro of [Theorem 16 ] If Λ p ( ε/ 4 , f , P ) = ∞ then the result tri vial ly holds. Otherwise, sup pose ε ∈ (0 , 10 e − 3 ) , let δ = ε/ 10 , λ = Λ p ( ε/ 4 , f , P ) , ˜ c 2 = max n 10˜ c ( i ) 2 , 10 ˜ c ( ii ) 2 , 10 ˜ c ( iii ) 2 , 10 e ( d + 1) o , and ˜ c 1 = max n ˜ c ( i ) 1 , ˜ c ( ii ) 1 , ˜ c ( iii ) 1 , 2 · 6 3 ( d + 1) ˜ d ln ( e ( d + 1)) o , and conside r running Meta-Algorith m 99 H A N N E K E 3 with passi ve alg orithm A p and budg et n ≥ ˜ c 1 ˜ θ f ( d/λ ) ln 2 (˜ c 2 λ/ε ) , while f is the tar get func- tion and P is the data dist rib ution. On the e ve nt ˜ H ( i ) n ( δ ) ∩ ˜ H ( ii ) n ∩ ˜ H ( iii ) n ( δ , λ ) , Lemma 53 im- plies    L ˜ d f    ≥ λ , while Lemma 52 implies V ( ˜ d f ) = V ⋆ ˆ m ˜ d f ; reca lling that Lemma 35 implies that V ⋆ ˆ m ˜ d f 6 = ∅ on this eve nt, we must ha ve er L ˜ d f ( f ) = 0 . Further more, if ˆ h is the classifier return ed by Meta-Algorit hm 3, then Lemma 34 implies that er( ˆ h ) is at most 2 er( A p ( L ˜ d f )) , on a high probability e ven t (ca ll it ˆ E 2 in this con text) . Letting ˆ E 3 ( δ ) = ˆ E 2 ∩ ˜ H ( i ) n ( δ ) ∩ ˜ H ( ii ) n ∩ ˜ H ( iii ) n ( δ , λ ) , the total failure probabi lity 1 − P ( ˆ E 3 ( δ )) from all of th ese eve nts is at most 4 δ + e ( d + 1) · exp n −⌊ n/ 3 ⌋ /  72 ˜ d f ( d + 1) ln( e ( d + 1)) o ≤ 5 δ = ε/ 2 . Since, for ℓ ∈ N with P     L ˜ d f    = ℓ  > 0 , the sequence of X m v alues appearin g in L ˜ d f are condition ally distrib uted as P ℓ gi ven |L ˜ d f | = ℓ , and this is the same as the (unco ndition al) distrib ution of { X 1 , X 2 , . . . , X ℓ } , we ha ve that E h er  ˆ h i ≤ E h 2 er  A p  L ˜ d f  1 ˆ E 3 ( δ ) i + ε/ 2 = E h E h 2 er  A p  L ˜ d f  1 ˆ E 3 ( δ )    |L ˜ d f | ii + ε/ 2 ≤ 2 sup ℓ ≥ Λ p ( ε/ 4 ,f , P ) E [er( A p ( Z ℓ ))] + ε/ 2 ≤ ε. T o s peciali ze to the specific v arian t of Meta-Alg orithm 3 stated in S e ction 5.2, tak e γ = 1 / 2 . Ap pendix E. Proofs Related to Section 6: Agnostic Learnin g E.1 P r oof of Theor em 22: Negativ e Result for Agnostic Acti vized Learning It suf fices to sho w that ˇ A p achie ves a label complexity Λ p such that, for any label complexit y Λ a achie ved by any activ e learnin g a lgorithm A a , there exi sts a distrib ution P X Y on X × {− 1 , +1 } such that P X Y ∈ Nontrivia l (Λ p ; C ) and y et Λ a ( ν + cε, P X Y ) 6 = o ( Λ p ( ν + ε, P X Y )) for e very c onstant c ∈ (0 , ∞ ) . Specifically , we will sh o w tha t th ere is a dis trib ution P X Y for which Λ p ( ν + ε, P X Y ) = Θ(1 /ε ) and Λ a ( ν + ε, P X Y ) 6 = o (1 /ε ) . Let P ( { 0 } ) = 1 / 2 , and fo r an y measurable A ⊆ (0 , 1] , P ( A ) = λ ( A ) / 2 , where λ is Lebes gue measure. Let D be the family of distrib utio ns P X Y on X × {− 1 , +1 } characteriz ed by the properties that the mar ginal distrib utio n on X is P , η (0; P X Y ) ∈ (1 / 8 , 3 / 8) , and ∀ x ∈ (0 , 1] , η ( x ; P X Y ) = η (0; P X Y ) + ( x/ 2) · (1 − η (0; P X Y )) . Thus, η ( x ; P X Y ) is a linea r function . For an y P X Y ∈ D , sinc e the point z ∗ = 1 − 2 η (0; P X Y ) 1 − η (0; P X Y ) has η ( z ∗ ; P X Y ) = 1 / 2 , w e see that f = h z ∗ is a Bayes optimal cla ssifier . Further more, for any η 0 ∈ [1 / 8 , 3 / 8] ,     1 − 2 η 0 1 − η 0 − 1 − 2 η (0; P X Y ) 1 − η (0; P X Y )     = | η (0; P X Y ) − η 0 | (1 − η 0 )(1 − η (0; P X Y )) , and since (1 − η 0 )(1 − η (0; P X Y )) ∈ (25 / 64 , 49 / 64 ) ⊂ (1 / 3 , 1) , the v alue z = 1 − 2 η 0 1 − η 0 satisfies | η 0 − η (0; P X Y ) | ≤ | z − z ∗ | ≤ 3 | η 0 − η (0; P X Y ) | . (79) 100 A C T I V I Z E D L E A R N I N G Also note that under P X Y , since (1 − 2 η (0; P X Y )) = (1 − η (0; P X Y )) z ∗ , an y z ∈ (0 , 1) has er( h z ) − er( h z ∗ ) = Z z ∗ z  1 − 2 η ( x ; P X Y )  d x = Z z ∗ z  1 − 2 η (0; P X Y ) − x (1 − η (0; P X Y ))  d x = (1 − η (0; P X Y )) Z z ∗ z ( z ∗ − x ) d x = (1 − η (0; P X Y )) 2 ( z ∗ − z ) 2 , so that 5 16 ( z − z ∗ ) 2 ≤ er( h z ) − er( h z ∗ ) ≤ 7 16 ( z − z ∗ ) 2 . (80) Finally , no te that any x, x ′ ∈ (0 , 1] with | x − z ∗ | < | x ′ − z ∗ | has | 1 − 2 η ( x ; P X Y ) | = | x − z ∗ | (1 − η (0; P X Y )) < | x ′ − z ∗ | (1 − η (0; P X Y )) = | 1 − 2 η ( x ′ ; P X Y ) | . Thus, for a ny q ∈ (0 , 1 / 2] , there e xists z ′ q ∈ [0 , 1] su ch that z ∗ ∈ [ z ′ q , z ′ q + 2 q ] ⊆ [0 , 1] , and the cl as- sifier h ′ q ( x ) = h z ∗ ( x ) ·  1 − 2 1 ( z ′ q ,z ′ q +2 q ] ( x )  has e r( h ) ≥ er( h ′ q ) for e ver y cl assifier h with h (0) = − 1 and P ( x : h ( x ) 6 = h z ∗ ( x )) = q . Noting that er( h ′ q ) − er( h z ∗ ) =  lim z ↓ z ′ q er( h z ) − er( h z ∗ )  +  er( h z ′ q +2 q ) − er( h z ∗ )  , (80) implies that er( h ′ q ) − er( h z ∗ ) ≥ 5 16   z ′ q − z ∗  2 +  z ′ q + 2 q − z ∗  2  , and since max { z ∗ − z ′ q , z ′ q + 2 q − z ∗ } ≥ q , this is at least 5 16 q 2 . In general, any h w it h h (0) = +1 has er( h ) − er( h z ∗ ) ≥ 1 / 2 − η (0; P X Y ) > 1 / 8 ≥ (1 / 8) P ( x : h ( x ) 6 = h z ∗ ( x )) 2 . Combining these fact s, we see that any classifier h has er( h ) − er( h z ∗ ) ≥ (1 / 8) P ( x : h ( x ) 6 = h z ∗ ( x )) 2 . (81) Lemma 54 The pas sive learning algorithm ˇ A p ach ieves a label comple xity Λ p suc h that, for every P X Y ∈ D , Λ p ( ν + ε, P X Y ) = Θ(1 /ε ) . ⋄ Pro of Consider the v alues ˆ η 0 and ˆ z from ˇ A p ( Z n ) for some n ∈ N . Combining (79) and (80 ), we ha ve er( h ˆ z ) − er( h z ∗ ) ≤ 7 16 ( ˆ z − z ∗ ) 2 ≤ 63 16 ( ˆ η 0 − η (0; P X Y )) 2 ≤ 4( ˆ η 0 − η (0; P X Y )) 2 . L e t N n = |{ i ∈ { 1 , . . . , n } : X i = 0 }| , and ¯ η 0 = N − 1 n |{ i ∈ { 1 , . . . , n } : X i = 0 , Y i = +1 }| if N n > 0 , or ¯ η 0 = 0 if N n = 0 . Note that ˆ η 0 =  ¯ η 0 ∨ 1 8  ∧ 3 8 , and si nce η (0; P X Y ) ∈ (1 / 8 , 3 / 8) , w e ha ve | ˆ η 0 − η (0; P X Y ) | ≤ | ¯ η 0 − η (0; P X Y ) | . There fore, for any P X Y ∈ D , E [er( h ˆ z ) − er( h z ∗ )] ≤ 4 E  ( ˆ η 0 − η (0; P X Y )) 2  ≤ 4 E  ( ¯ η 0 − η (0; P X Y )) 2  ≤ 4 E h E h ( ¯ η 0 − η (0; P X Y )) 2    N n i 1 [ n/ 4 ,n ] ( N n ) i + 4 P ( N n < n/ 4) . (82) By a Chernof f bound , P ( N n < n/ 4) ≤ exp {− n / 16 } , and since the conditiona l distrib ution of N n ¯ η 0 gi ven N n is Binomial( N n , η (0; P X Y )) , (82) is at most 4 E  1 N n ∨ n/ 4 η (0; P X Y )(1 − η (0; P X Y ))  + 4 · exp {− n/ 16 } ≤ 4 · 4 n · 15 64 + 4 · 16 n < 68 n . For any n ≥ ⌈ 68 /ε ⌉ , this is at most ε . Therefore, ˇ A p achie ves a label co mplexi ty Λ p such that, for any P X Y ∈ D , Λ p ( ν + ε, P X Y ) = ⌈ 68 /ε ⌉ = Θ(1 /ε ) . 101 H A N N E K E Next we esta blish a cor respond ing lower bound for an y a cti ve learning algorit hm. Note t hat this requir es more than a simple minimax lower boun d, sinc e we must ha ve an asy mptotic lower bound for a fixed P X Y , rather than select ing a dif feren t P X Y for each ε value ; this is akin to the str ong minimax lo wer bound s pro ven by Antos and Lugosi (1998) for passi ve learning in the real izable case. For thi s, we proceed by red uction from the task of est imating a bino mial mean; to ward this end, the follo wing lemma will be useful . Lemma 55 F or any no nempty ( a, b ) ⊂ [0 , 1] , and any sequence of estimator s ˆ p n : { 0 , 1 } n → [0 , 1] , ther e e xists p ∈ ( a, b ) such that, if B 1 , B 2 , . . . ar e indep endent Bernoulli( p ) random v ariable s, also indepe ndent fr om eve ry ˆ p n , then E h ( ˆ p n ( B 1 , . . . , B n ) − p ) 2 i 6 = o (1 / n ) . ⋄ Pro of W e fi rs t establish th e claim when a = 0 an d b = 1 . For any p ∈ [0 , 1] , let B 1 ( p ) , B 2 ( p ) , . . . be i.i.d. Bernoulli( p ) random vari ables, independe nt from any intern al randomness of the ˆ p n esti- mators. W e procee d by reduc tion from hy pothesi s te sting, for w h ich there are kno wn lo wer b ounds. Specifically , it is known (e.g., W ald, 1945; Bar-Yosse f, 2003) that for any p, q ∈ (0 , 1) , δ ∈ (0 , e − 1 ) , any (po ssibly randomized) ˆ q : { 0 , 1 } n → { p, q } , an d any n ∈ N , n < (1 − 8 δ ) ln (1 / 8 δ ) 8KL( p k q ) = ⇒ max p ∗ ∈{ p,q } P ( ˆ q ( B 1 ( p ∗ ) , . . . , B n ( p ∗ )) 6 = p ∗ ) > δ, where KL( p k q ) = p ln( p/q ) + (1 − p ) ln ( (1 − p ) / (1 − q )) . It is also k no wn (e .g., Poland and Hut ter, 2006) t hat for p, q ∈ [1 / 4 , 3 / 4] , KL( p k q ) ≤ (8 / 3)( p − q ) 2 . Combining this w it h the ab ov e f act, we ha ve that for p, q ∈ [1 / 4 , 3 / 4] , max p ∗ ∈{ p,q } P ( ˆ q ( B 1 ( p ∗ ) , . . . , B n ( p ∗ )) 6 = p ∗ ) ≥ (1 / 16) · exp  − 128( p − q ) 2 n/ 3  . (83) Giv en the estimator ˆ p n from the lemma statement, w e construct a sequence of hypoth esis te sts as fol- lo ws. For i ∈ N , let α i = exp  − 2 i  and n i =  1 /α 2 i  . Define p ∗ 0 = 1 / 4 , and for i ∈ N , in ducti v ely define ˆ q i ( b 1 , . . . , b n i ) = argmin p ∈{ p ∗ i − 1 ,p ∗ i − 1 + α i } | ˆ p n i ( b 1 , . . . , b n i ) − p | for b 1 , . . . , b n i ∈ { 0 , 1 } , and p ∗ i = argmax p ∈{ p ∗ i − 1 ,p ∗ i − 1 + α i } P ( ˆ q i ( B 1 ( p ) , . . . , B n i ( p )) 6 = p ) . Finally , define p ∗ = lim i →∞ p ∗ i . Note that ∀ i ∈ N , p ∗ i < 1 / 2 , p ∗ i − 1 , p ∗ i − 1 + α i ∈ [1 / 4 , 3 / 4 ] , and 0 ≤ p ∗ − p ∗ i ≤ P ∞ j = i +1 α j < 2 α i +1 = 2 α 2 i . W e gene rally hav e E h ( ˆ p n i ( B 1 ( p ∗ ) , . . . , B n i ( p ∗ )) − p ∗ ) 2 i ≥ 1 3 E h ( ˆ p n i ( B 1 ( p ∗ ) , . . . , B n i ( p ∗ )) − p ∗ i ) 2 i − ( p ∗ − p ∗ i ) 2 ≥ 1 3 E h ( ˆ p n i ( B 1 ( p ∗ ) , . . . , B n i ( p ∗ )) − p ∗ i ) 2 i − 4 α 4 i . Furthermor e, no te that for any m ∈ { 0 , . . . , n i } , ( p ∗ ) m (1 − p ∗ ) n i − m ( p ∗ i ) m (1 − p ∗ i ) n i − m ≥  1 − p ∗ 1 − p ∗ i  n i ≥  1 − p ∗ i − 2 α 2 i 1 − p ∗ i  n i ≥  1 − 4 α 2 i  n i ≥ exp  − 8 α 2 i n i  ≥ e − 8 , so that the probability m a ss function of ( B 1 ( p ∗ ) , . . . , B n i ( p ∗ )) is nev er smaller than e − 8 times that of ( B 1 ( p ∗ i ) , . . . , B n i ( p ∗ i )) , which implies (by the law of th e unconscio us statis tician) E h ( ˆ p n i ( B 1 ( p ∗ ) , . . . , B n i ( p ∗ )) − p ∗ i ) 2 i ≥ e − 8 E h ( ˆ p n i ( B 1 ( p ∗ i ) , . . . , B n i ( p ∗ i )) − p ∗ i ) 2 i . 102 A C T I V I Z E D L E A R N I N G By a triangl e inequality , w e hav e E h ( ˆ p n i ( B 1 ( p ∗ i ) , . . . , B n i ( p ∗ i )) − p ∗ i ) 2 i ≥ α 2 i 4 P ( ˆ q i ( B 1 ( p ∗ i ) , . . . , B n i ( p ∗ i )) 6 = p ∗ i ) . By (83), this is at least α 2 i 4 (1 / 16 ) · exp  − 128 α 2 i n i / 3  ≥ 2 − 6 e − 43 α 2 i . Combining the abov e, we ha ve E h ( ˆ p n i ( B 1 ( p ∗ ) , . . . , B n i ( p ∗ )) − p ∗ ) 2 i ≥ 3 − 1 2 − 6 e − 51 α 2 i − 4 α 4 i ≥ 2 − 9 e − 51 n − 1 i − 4 n − 2 i . For i ≥ 5 , this is lar ger than 2 − 11 e − 51 n − 1 i . Since n i di ver ges as i → ∞ , we ha ve that E h ( ˆ p n i ( B 1 ( p ∗ ) , . . . , B n i ( p ∗ )) − p ∗ ) 2 i 6 = o (1 / n ) , which establi shes the re sult for a = 0 and b = 1 . T o extend this resu lt to general nonemp ty ranges ( a, b ) , we p roceed by reduct ion from the abo ve problem. Specifically , suppose p ′ ∈ (0 , 1) , and consider the follo wing indepe ndent ran- dom v ariabl es (also independ ent from the B i ( p ′ ) v ariable s and ˆ p n estimato rs). For each i ∈ N , C i 1 ∼ Bernoulli( a ) , C i 2 ∼ Bernoulli(( b − a ) / (1 − a )) . Then for b i ∈ { 0 , 1 } , define B ′ i ( b i ) = max { C i 1 , C i 2 · b i } . For any gi ve n p ′ ∈ (0 , 1) , the random vari ables B ′ i ( B i ( p ′ )) are i.i.d. Bernoulli ( p ) , with p = a + ( b − a ) p ′ ∈ ( a, b ) (which forms a bijection between (0 , 1) and ( a, b ) ). Defining ˆ p ′ n ( b 1 , . . . , b n ) = ( ˆ p n ( B ′ 1 ( b 1 ) , . . . , B ′ n ( b n )) − a ) / ( b − a ) , we hav e E h ( ˆ p n ( B 1 ( p ) , . . . , B n ( p )) − p ) 2 i = ( b − a ) 2 · E h  ˆ p ′ n ( B 1 ( p ′ ) , . . . , B n ( p ′ )) − p ′  2 i . (84) W e ha ve alread y sho wn there e xists a v alue of p ′ ∈ (0 , 1) such that the right side of (84) is not o (1 /n ) . T h erefore, the correspond ing v alue of p = a + ( b − a ) p ′ ∈ ( a, b ) has the left side of (84 ) not o (1 /n ) , which establish es the resu lt. W e are now rea dy for the lower bou nd result for o ur setting . Lemma 56 F or a ny la bel co mple xity Λ a ach ieved by any active learning alg orithm A a , ther e e xists a P X Y ∈ D such that Λ a ( ν + ε, P X Y ) 6 = o (1 /ε ) . ⋄ Pro of T h e ide a here is to reduce fr om the tas k of estimating the mean of iid Bern oulli trials , corres ponding to the Y i v alues. Specifically , consider any acti ve lea rning algorit hm A a ; we use A a to construct an estimator for the mean of iid Bernoulli tria ls as follo ws. Suppose we ha ve B 1 , B 2 , . . . , B n i.i.d. Bernoulli( p ) , for some p ∈ (1 / 8 , 3 / 8) and n ∈ N . W e take the sequ ence of X 1 , X 2 , . . . random v ariable s i.i.d. with di strib utio n P defined above (independen t from th e B j v ariable s). For each i , w e additionall y hav e a ran dom variab le C i with conditio nal distrib ution Bernoulli( X i / 2) gi ven X i , wher e the C i are cond itional ly ind ependen t gi ven th e X i sequen ce, and indepe ndent fro m the B i sequen ce as well. 103 H A N N E K E W e run A a with this sequ ence of X i v alues. For the t th label request made by the algorith m, say for the Y i v alue corresp onding to some X i , if it has pre viou sly reques ted this Y i alread y , then we simply repe at the same an swer for Y i again, and oth erwise we return to the algorithm the v alue 2 max { B t , C i } − 1 for Y i . Note that in the latte r case, the conditi onal distrib ution of max { B t , C i } is Be rnoulli ( p + (1 − p ) X i / 2) , gi v en the X i that A a reques ts the labe l of; thus, the Y i respon se has the same co ndition al dis trib ution gi ven X i as it wou ld hav e for the P X Y ∈ D with η (0; P X Y ) = p (i.e., η ( X i ; P X Y ) = p + (1 − p ) X i / 2 ). Since this Y i v alue is condi tionally (gi ven X i ) indepe ndent from the pre viously returned lab els and X j sequen ce, this is distrib utional ly equ iv alent to running A a under the P X Y ∈ D with η (0; P X Y ) = p . Let ˆ h n be the cla ssifier returned by A a ( n ) in the above context, and let ˆ z n denote the value of z ∈ [2 / 5 , 6 / 7] with minimum P ( x : h z ( x ) 6 = ˆ h n ( x )) . Then define ˆ p n = 1 − ˆ z n 2 − ˆ z n ∈ [1 / 8 , 3 / 8] and z ∗ = 1 − 2 p 1 − p ∈ (2 / 5 , 6 / 7) . By a triangl e ineq uality , we hav e | ˆ z n − z ∗ | = 2 P ( x : h ˆ z n ( x ) 6 = h z ∗ ( x )) ≤ 4 P ( x : ˆ h n ( x ) 6 = h z ∗ ( x )) . C o mbining this with (81) and (79) implies that er( ˆ h n ) − er( h z ∗ ) ≥ 1 8 P  x : ˆ h n ( x ) 6 = h z ∗ ( x )  2 ≥ 1 128 ( ˆ z n − z ∗ ) 2 ≥ 1 128 ( ˆ p n − p ) 2 . (85) In pa rticular , by Lemma 55, we can ch oose p ∈ (1 / 8 , 3 / 8) so that E h ( ˆ p n − p ) 2 i 6 = o (1 / n ) , which, by (8 5), implies E h er( ˆ h n ) i − ν 6 = o (1 /n ) . This mean s there is an increas ing infinit e sequen ce of v alues n k ∈ N , and a co nstant c ∈ (0 , ∞ ) such that ∀ k ∈ N , E h er( ˆ h n k ) i − ν ≥ c/n k . Supposi ng A a achie ves lab el compl exity Λ a , and tak ing th e v alues ε k = c/ ( 2 n k ) , we hav e Λ a ( ν + ε k , P X Y ) > n k = c/ ( 2 ε k ) . Since ε k > 0 and approaches 0 as k → ∞ , we hav e Λ a ( ν + ε, P X Y ) 6 = o (1 /ε ) . Pro of [of Theorem 22] The result follo ws from Lemmas 54 and 56. E.2 P r oof of Lemma 26 : Label Complexity of Algorit hm 5 The proof of Lemma 26 essent ially runs paralle l to that of Theor em 16 , with var iants of each lemma from that proof adapt ed to th e noise-rob ust Algorithm 5. As before, in this section we will fix a particul ar joint distrib ution P X Y on X × {− 1 , + 1 } with mar gina l P on X , and then analyze the label comple xity ach ie ved by Algorithm 5 for that particu lar distrib utio n. For our purp oses, we will suppose P X Y satisfies Condition 1 for some finite par ameters µ an d κ . W e also fix an y f ∈ T ε> 0 cl( C ( ε )) . Furthermore , we will continu e using the notation of A p pendix B, such as S k ( H ) , etc., and in particu lar w e contin ue to denote V ⋆ m = { h ∈ C : ∀ ℓ ≤ m, h ( X ℓ ) = f ( X ℓ ) } (though no te that in this case , we may sometimes ha ve f ( X ℓ ) 6 = Y ℓ , so that V ⋆ m 6 = C [ Z m ] ). As in the abov e proo fs, we will pr ov e a slightly more general result in which the “ 1 / 2 ” thresho ld in Step 5 can be re placed by an arbitrary constant γ ∈ (0 , 1) . For the esti mators ˆ P 4 m used in the algorithm, we tak e the same definitions as in A p pendix B.1. T o be clear , we assu me th e sequences W 1 and W 2 mentione d ther e ar e independent from the entire ( X 1 , Y 1 ) , ( X 2 , Y 2 ) , . . . seq uence of data po ints; this is consistent with the earl ier discussion of ho w these W 1 and W 2 sequen ces ca n be construc ted in a prepro cessing step. W e will co nsider runnin g Algorithm 5 with label b udget n ∈ N and con fidence parameter δ ∈ (0 , e − 3 ) , and ana lyze pro perties of the internal sets V i . W e will denote by ˆ V i , ˆ L i , and ˆ i k , the 104 A C T I V I Z E D L E A R N I N G final v alues of V i , L i , and i k , respecti vely , for each i and k in Algorithm 5. W e also denote by ˆ m ( k ) and ˆ V ( k ) the final values of m and V i k +1 , respe cti vely , obtaine d while k has the specified val ue in Algorith m 5; ˆ V ( k ) may be smaller than ˆ V ˆ i k when ˆ m ( k ) is not a po wer of 2 . Additionally , define L ⋆ i = { ( X m , Y m ) } 2 i m =2 i − 1 +1 . After establis hing a fe w resu lts concerni ng these , we will sh ow th at for n satsifyin g the conditi on in Lemma 26, the conclu sion of the lemma hold s. F ir st, we ha ve a fe w auxilliar y definit ions. For H ⊆ C , and any i ∈ N , define φ i ( H ) = E s up h 1 ,h 2 ∈H    er( h 1 ) − er L ⋆ i ( h 1 )  −  er( h 2 ) − er L ⋆ i ( h 2 )    and ˜ U i ( H , δ ) = min ( ˜ K φ i ( H ) + r diam( H ) ln(32 i 2 /δ ) 2 i − 1 + ln(32 i 2 /δ ) 2 i − 1 ! , 1 ) , where for our purp oses we can ta ke ˜ K = 8272 . It is kno wn (see e.g., Massa rt and N ´ ed ´ elec, 2006 ; Gin ´ e and Kolt chinski i, 2006) that for some uni ver sal constant c ′ ∈ [2 , ∞ ) , φ i +1 ( H ) ≤ c ′ max ( s diam( H )2 − i d log 2 2 diam( H ) , 2 − i di ) . ( 86) W e also generally ha ve φ i ( H ) ≤ 2 for e ve ry i ∈ N . The nex t lemma is tak en from the work of K oltchin skii (2006) on d ata-depe ndent Rademach er complexity bound s on the e xcess risk. Lemma 57 F or any δ ∈ (0 , e − 3 ) , any H ⊆ C with f ∈ cl( H ) , and any i ∈ N , o n an event K i with P ( K i ) ≥ 1 − δ / 4 i 2 , ∀ h ∈ H , er L ⋆ i ( h ) − m in h ′ ∈H er L ⋆ i ( h ′ ) ≤ er( h ) − er( f ) + ˆ U i ( H , δ ) er( h ) − er( f ) ≤ er L ⋆ i ( h ) − er L ⋆ i ( f ) + ˆ U i ( H , δ ) min n ˆ U i ( H , δ ) , 1 o ≤ ˜ U i ( H , δ ) . ⋄ Lemma 57 essenti ally follows fro m a ve rsion of T alagrand’ s inequalit y . The detail s o f the pro of may be ex tracted from the proof s of Koltch inskii (2006), and rela ted der iv ations hav e pre viously been prese nted by Hannek e (2011) ; K oltchi nskii (2010). The only minor twist here is that f nee d only be in cl ( H ) , rath er than in H itself, which easily foll ows fr om Ko ltchins kii’ s orig inal results, since th e Borel-Can telli lemma implie s that with probability one, e ve ry ε > 0 has some g ∈ H ( ε ) (ve ry close to f ) with er L ⋆ i ( g ) = er L ⋆ i ( f ) . For our purp oses, the impor tant implications of Lemma 57 are summarized by the fol lo wing lemma. Lemma 58 F or any δ ∈ (0 , e − 3 ) and any n ∈ N , when running Algorithm 5 with labe l b udg et n and confiden ce par ameter δ , on an even t J n ( δ ) with P ( J n ( δ )) ≥ 1 − δ / 2 , ∀ i ∈ { 0 , 1 , . . . , ˆ i d +1 } , if V ⋆ 2 i ⊆ ˆ V i then ∀ h ∈ ˆ V i , er L ⋆ i +1 ( h ) − min h ′ ∈ ˆ V i er L ⋆ i +1 ( h ′ ) ≤ er( h ) − er( f ) + ˆ U i +1 ( ˆ V i , δ ) (87) er( h ) − er( f ) ≤ er L ⋆ i +1 ( h ) − er L ⋆ i +1 ( f ) + ˆ U i +1 ( ˆ V i , δ ) (88) min n ˆ U i +1 ( ˆ V i , δ ) , 1 o ≤ ˜ U i +1 ( ˆ V i , δ ) . (89) ⋄ 105 H A N N E K E Pro of For each i , consid er applying Lemma 57 und er the condit ional distrib ution giv en ˆ V i . The set L ⋆ i +1 is indepe ndent fro m ˆ V i , as are th e Rademacher variab les in the definition of ˆ R i +1 ( ˆ V i ) . Further - more, by L e mma 35, on H ′ , f ∈ cl  V ⋆ 2 i  , so that the con ditions of Lemma 57 hold. The la w of tot al probab ility t hen implies the exis tence of an ev ent J i of probab ility P ( J i ) ≥ 1 − δ/ 4( i +1) 2 , on which the claimed inequalitie s hold for that value of i if i ≤ ˆ i d +1 . A union bound ov er v alues of i then implies the exis tence of an e ve nt J n ( δ ) = T i J i with probability P ( J n ( δ )) ≥ 1 − P i δ / 4 ( i + 1) 2 ≥ 1 − δ / 2 on which the claimed inequalit ies hold for all i ≤ ˆ i d +1 . Lemma 59 F or some ( C , P X Y , γ ) -dependent constants c, c ∗ ∈ [1 , ∞ ) , for any δ ∈ (0 , e − 3 ) and inte ger n ≥ c ∗ ln(1 /δ ) , w h en run ning Algorith m 5 with label b udg et n and co nfidenc e parameter δ , on even t J n ( δ ) ∩ H ( i ) n ∩ H ( ii ) n , ev ery i ∈ { 0 , 1 , . . . , ˆ i ˜ d f } satisfies V ⋆ 2 i ⊆ ˆ V i ⊆ C c  di + ln(1 /δ ) 2 i  κ 2 κ − 1 ! , and furthe rmor e V ⋆ ˆ m ( ˜ d f ) ⊆ ˆ V ( ˜ d f ) . ⋄ Pro of Define c =  24 ˜ K c ′ √ µ  2 κ 2 κ − 1 , c ∗ = max  τ ∗ , 8 d  µc 1 /κ r (1 − γ ) / 6  1 2 κ − 1 log 2  4 µc 1 /κ r (1 − γ ) / 6   , and suppo se n ≥ c ∗ ln(1 /δ ) . W e no w proceed by induc tion. As the right side equals C for i = 0 , the claimed in clusion s are certai nly true fo r ˆ V 0 = C , which serv es as our base case. No w s uppose some i ∈ { 0 , 1 , . . . , ˆ i ˜ d f } satisfies V ⋆ 2 i ⊆ ˆ V i ⊆ C c  di + ln(1 /δ ) 2 i  κ 2 κ − 1 ! . (90) In parti cular , Condition 1 implies diam( ˆ V i ) ≤ diam C c  di + ln(1 /δ ) 2 i  κ 2 κ − 1 !! ≤ µc 1 κ  di + ln (1 /δ ) 2 i  1 2 κ − 1 . (91) If i < ˆ i ˜ d f , then let k be the in tege r for which ˆ i k − 1 ≤ i < ˆ i k , and ot herwise let k = ˜ d f . Note that we certain ly ha ve ˆ i 1 ≥ ⌊ log 2 ( n/ 2) ⌋ , since m = ⌊ n/ 2 ⌋ ≥ 2 ⌊ log 2 ( n/ 2) ⌋ is obtaine d while k = 1 . Therefore , if k > 1 , di + ln(1 /δ ) 2 i ≤ 4 d log 2 ( n ) + 4 ln(1 /δ ) n , so that (91) implies diam  ˆ V i  ≤ µc 1 κ  4 d log 2 ( n ) + 4 ln(1 /δ ) n  1 2 κ − 1 . By our choice of c ∗ , the right side is at most r (1 − γ ) / 6 . Therefor e, since L emma 35 implies f ∈ cl  V ⋆ 2 i  on H ( i ) n , we ha ve ˆ V i ⊆ B  f , r (1 − γ ) / 6  when k > 1 . C o mbined with (90), w e ha ve tha t 106 A C T I V I Z E D L E A R N I N G V ⋆ 2 i ⊆ ˆ V i , and either k = 1 , or ˆ V i ⊆ B( f , r (1 − γ ) / 6 ) and 4 m > 4 ⌊ n/ 2 ⌋ ≥ n . Now consider any m with 2 i + 1 ≤ m ≤ min n 2 i +1 , ˆ m ( ˜ d f ) o , and for the purpo se of induct ion suppo se V ⋆ m − 1 ⊆ V i +1 upon reaching Step 5 for th at value of m in Algorithm 5. Since V i +1 ⊆ ˆ V i and n ≥ τ ∗ , Lemm a 41 (with ℓ = m − 1 ) implies that on H ( i ) n ∩ H ( ii ) n , ˆ ∆ ( k ) 4 m ( X m , W 2 , V i +1 ) < γ = ⇒ ˆ Γ ( k ) 4 m ( X m , − f ( X m ) , W 2 , V i +1 ) < ˆ Γ ( k ) 4 m ( X m , f ( X m ) , W 2 , V i +1 ) , (92) so that after S te p 8 we ha ve V ⋆ m ⊆ V i +1 . Since (90) implies that the V ⋆ m − 1 ⊆ V i +1 condit ion holds if Algorith m 5 reache s Step 5 with m = 2 i + 1 (at which time V i +1 = ˆ V i ), we ha ve by induction that on H ( i ) n ∩ H ( ii ) n , V ⋆ m ⊆ V i +1 upon reaching Step 9 w it h m = min n 2 i +1 , ˆ m ( ˜ d f ) o . This establishe s the final claim of the lemma, giv en that the first claim hol ds. For the remainde r of this inducti ve proof, suppos e i < ˆ i ˜ d f . Since Step 8 enforces that, upon reach ing Step 9 w it h m = 2 i +1 , ev ery h 1 , h 2 ∈ V i +1 ha ve er ˆ L i +1 ( h 1 ) − er ˆ L i +1 ( h 2 ) = er L ⋆ i +1 ( h 1 ) − er L ⋆ i +1 ( h 2 ) , on J n ( δ ) ∩ H ( i ) n ∩ H ( ii ) n we ha ve ˆ V i +1 ⊆ ( h ∈ ˆ V i : er L ⋆ i +1 ( h ) − min h ′ ∈ V ⋆ 2 i +1 er L ⋆ i +1 ( h ′ ) ≤ ˆ U i +1  ˆ V i , δ  ) ⊆ n h ∈ ˆ V i : er L ⋆ i +1 ( h ) − er L ⋆ i +1 ( f ) ≤ ˆ U i +1  ˆ V i , δ o ⊆ ˆ V i ∩ C  2 ˆ U i +1  ˆ V i , δ  ⊆ C  2 ˜ U i +1  ˆ V i , δ  , (93) where the se cond lin e follo ws from Lemma 35 and the last two inclusions follow from Lemma 58. Focus ing on (93), combinin g (91) w it h (86) (and the fac t that φ i +1 ( ˆ V i ) ≤ 2 ), we can bound ˜ U i +1  ˆ V i , δ  as follo ws. r diam( ˆ V i ) ln(32( i + 1) 2 /δ ) 2 i ≤ √ µc 1 2 κ  di + ln (1 /δ ) 2 i  1 4 κ − 2  ln(32( i + 1) 2 /δ ) 2 i  1 2 ≤ √ µc 1 2 κ  2 di + 2 ln(1 /δ ) 2 i +1  1 4 κ − 2  8( i + 1) + 2 ln(1 /δ ) 2 i +1  1 2 ≤ 4 √ µc 1 2 κ  d ( i + 1) + ln(1 /δ ) 2 i +1  κ 2 κ − 1 , φ i +1 ( ˆ V i ) ≤ c ′ √ µc 1 2 κ  di + ln (1 /δ ) 2 i  1 4 κ − 2  d ( i + 2) 2 i  1 2 ≤ 4 c ′ √ µc 1 2 κ  d ( i + 1) + ln(1 /δ ) 2 i +1  κ 2 κ − 1 , and thus ˜ U i +1 ( ˆ V i , δ ) ≤ min ( 8 ˜ K c ′ √ µc 1 2 κ  d ( i + 1) + ln(1 /δ ) 2 i +1  κ 2 κ − 1 + ˜ K ln(32( i + 1) 2 /δ ) 2 i , 1 ) ≤ 12 ˜ K c ′ √ µc 1 2 κ  d ( i + 1) + ln(1 /δ ) 2 i +1  κ 2 κ − 1 = ( c/ 2)  d ( i + 1) + ln(1 /δ ) 2 i +1  κ 2 κ − 1 . 107 H A N N E K E Combining this with (93) no w implies ˆ V i +1 ⊆ C c  d ( i + 1) + ln(1 /δ ) 2 i +1  κ 2 κ − 1 ! . T o compl ete the ind ucti ve proof, it re mains only to sh o w V ⋆ 2 i +1 ⊆ ˆ V i +1 . T owa rd this end, recall we ha v e sho wn abo ve th at on H ( i ) n ∩ H ( ii ) n , V ⋆ 2 i +1 ⊆ V i +1 upon re aching Step 9 with m = 2 i +1 , and that eve ry h 1 , h 2 ∈ V i +1 at this point ha ve er ˆ L i +1 ( h 1 ) − er ˆ L i +1 ( h 2 ) = er L ⋆ i +1 ( h 1 ) − er L ⋆ i +1 ( h 2 ) . Consider any h ∈ V ⋆ 2 i +1 , and note that any other g ∈ V ⋆ 2 i +1 has er L ⋆ i +1 ( g ) = er L ⋆ i +1 ( h ) . Thus, on H ( i ) n ∩ H ( ii ) n , er ˆ L i +1 ( h ) − min h ′ ∈ V i +1 er ˆ L i +1 ( h ′ ) = er L ⋆ i +1 ( h ) − min h ′ ∈ V i +1 er L ⋆ i +1 ( h ′ ) ≤ er L ⋆ i +1 ( h ) − min h ′ ∈ ˆ V i er L ⋆ i +1 ( h ′ ) = inf g ∈ V ⋆ 2 i +1 er L ⋆ i +1 ( g ) − min h ′ ∈ ˆ V i er L ⋆ i +1 ( h ′ ) . (94) Lemma 58 and (90) imply that on J n ( δ ) ∩ H ( i ) n ∩ H ( ii ) n , the las t exp ression in (94) is at most inf g ∈ V ⋆ 2 i +1 er( g ) − er( f ) + ˆ U i +1 ( ˆ V i , δ ) , and Lemma 35 implies f ∈ cl  V ⋆ 2 i +1  on H ( i ) n , so that inf g ∈ V ⋆ 2 i +1 er( g ) = er( f ) . W e therefo re ha ve er ˆ L i +1 ( h ) − min h ′ ∈ V i +1 er ˆ L i +1 ( h ′ ) ≤ ˆ U i +1 ( ˆ V i , δ ) , so that h ∈ ˆ V i +1 as well. Since this holds for any h ∈ V ⋆ 2 i +1 , we hav e V ⋆ 2 i +1 ⊆ ˆ V i +1 . The lemma no w follo ws by the principle of inductio n. Lemma 60 Ther e e xist ( C , P X Y , γ ) -dependent constants c ∗ 1 , c ∗ 2 ∈ [1 , ∞ ) suc h that, for any ε, δ ∈ (0 , e − 3 ) and inte ger n ≥ c ∗ 1 + c ∗ 2 ˜ θ f  ε 1 κ  ε 2 κ − 2 log 2 2  1 εδ  , when run ning Algorith m 5 with label b udg et n and confidence par ameter δ , on an even t J ∗ n ( ε, δ ) with P ( J ∗ n ( ε, δ )) ≥ 1 − δ , w e have ˆ V ˆ i ˜ d f ⊆ C ( ε ) . ⋄ Pro of Define c ∗ 1 = max    2 ˜ d f +5 µc 1 /κ r (1 − γ ) / 6 ! 2 κ − 1 d log 2 dµc 1 /κ r (1 − γ ) / 6 , 2 ˜ δ 1 / 3 f ln  8 c ( i )  , 120 ˜ δ 1 / 3 f ln  8 c ( ii )     and c ∗ 2 = max    c ∗ , 2 ˜ d f +5 · µc 1 /κ r (1 − γ ) / 6 ! 2 κ − 1 , 2 ˜ d f +15 · µc 2 d γ ˜ δ f log 2 2 (4 dc )    . Fix any ε, δ ∈ (0 , e − 3 ) and inte ger n ≥ c ∗ 1 + c ∗ 2 ˜ θ f  ε 1 κ  ε 2 κ − 2 log 2 2  1 εδ  . 108 A C T I V I Z E D L E A R N I N G For eac h i ∈ { 0 , 1 , . . . } , let ˜ r i = µc 1 κ  di +ln(1 /δ ) 2 i  1 2 κ − 1 . Also define ˜ i =  2 − 1 κ  log 2 c ε + log 2  8 d log 2 2 dc εδ  . and let ˇ i = min  i ∈ N : s up j ≥ i ˜ r j < r (1 − γ ) / 6  . For any i ∈ n ˇ i, . . . , ˆ i ˜ d f o , let Q i +1 =  m ∈  2 i + 1 , . . . , 2 i +1  : ˆ ∆ ( ˜ d f ) 4 m ( X m , W 2 , B ( f , ˜ r i )) ≥ 2 γ / 3  . Also define ˜ Q = 96 γ ˜ δ f ˜ θ f  ε 1 κ  · 2 µc 2 ·  8 d log 2 2 dc εδ  · ε 2 κ − 2 . By Lemma 59 and Conditio n 1, on J n ( δ ) ∩ H ( i ) n ∩ H ( ii ) n , if i ≤ ˆ i ˜ d f , ˆ V i ⊆ C c  di + ln (1 /δ ) 2 i  κ 2 κ − 1 ! ⊆ B ( f , ˜ r i ) . (95) Lemma 59 also i mplies th at, on J n ( δ ) ∩ H ( i ) n ∩ H ( ii ) n , fo r i w it h ˆ i ˜ d f − 1 ≤ i ≤ ˆ i ˜ d f , al l of the sets V i +1 obtain ed in Algorithm 5 while k = ˜ d f and m ∈  2 i + 1 , . . . , 2 i +1  satisfy V ⋆ 2 i +1 ⊆ V i +1 ⊆ ˆ V i . Recall that ˆ i 1 ≥ ⌊ log 2 ( n/ 2) ⌋ , so tha t w e ha ve ei ther ˜ d f = 1 or else e very m ∈  2 i + 1 , . . . , 2 i +1  has 4 m > n . Also reca ll that Lemma 49 implies that when the abov e condi tions are satisfied, and i ≥ ˇ i , o n H ′ ∩ G ( i ) n , ˆ ∆ ( ˜ d f ) 4 m ( X m , W 2 , V i +1 ) ≤ (3 / 2) ˆ ∆ ( ˜ d f ) 4 m ( X m , W 2 , B ( f , ˜ r i )) , so that |Q i +1 | upper bound s the nu mber of m ∈  2 i + 1 , . . . , 2 i +1  for which Algorith m 5 req uests the label Y m in S te p 6 of the k = ˜ d f round . Thus, on J n ( δ ) ∩ H ( i ) n ∩ H ( ii ) n , 2 ˇ i + P ˆ i ˜ d f i =max  ˇ i, ˆ i ˜ d f − 1  |Q i +1 | upper bou nds the tota l number of lab el requests by Algo rithm 5 while k = ˜ d f ; the refore, by the const raint in Step 3, we kno w that either this quantity is at least as big as j 2 − ˜ d f n k , or else we ha ve 2 ˆ i ˜ d f +1 > ˜ d f · 2 n . In parti cular , on this ev ent, if we can sho w that 2 ˇ i + min  ˆ i ˜ d f , ˜ i  X i =max  ˇ i, ˆ i ˜ d f − 1  |Q i +1 | < j 2 − ˜ d f n k and 2 ˜ i +1 ≤ ˜ d f · 2 n , (96 ) then it must be true that ˜ i < ˆ i ˜ d f . Next, we will focus on establi shing thi s fact. Consider any i ∈ n max n ˇ i, ˆ i ˜ d f − 1 o , . . . , min n ˆ i ˜ d f , ˜ i oo and any m ∈  2 i + 1 , . . . , 2 i +1  . If ˜ d f = 1 , the n P  ˆ ∆ ( ˜ d f ) 4 m ( X m , W 2 , B ( f , ˜ r i )) ≥ 2 γ / 3    W 2  = P ˜ d f  S ˜ d f (B ( f , ˜ r i ))  . 109 H A N N E K E Otherwise, if ˜ d f > 1 , the n by Markov ’ s ine quality and the definition of ˆ ∆ ( ˜ d f ) 4 m ( · , · , · ) from (16 ), P  ˆ ∆ ( ˜ d f ) 4 m ( X m , W 2 , B ( f , ˜ r i )) ≥ 2 γ / 3    W 2  ≤ 3 2 γ E  ˆ ∆ ( ˜ d f ) 4 m ( X m , W 2 , B ( f , ˜ r i ))    W 2  = 3 2 γ 1 M ( ˜ d f ) 4 m (B ( f , ˜ r i )) (4 m ) 3 X s =1 P  S ( ˜ d f ) s ∪ { X m } ∈ S ˜ d f (B ( f , ˜ r i ))    S ( ˜ d f ) s  . By Lemma 39, Lemma 59, and (95), on J n ( δ ) ∩ H ( i ) n ∩ H ( ii ) n , this is at most 3 ˜ δ f γ 1 (4 m ) 3 (4 m ) 3 X s =1 P  S ( ˜ d f ) s ∪ { X m } ∈ S ˜ d f (B ( f , ˜ r i ))    S ( ˜ d f ) s  ≤ 24 ˜ δ f γ 1 4 3 2 3 i +3 4 3 2 3 i +3 X s =1 P  S ( ˜ d f ) s ∪ { X m } ∈ S ˜ d f (B ( f , ˜ r i ))    S ( ˜ d f ) s  . Note that this value is in varia nt to the choice of m ∈  2 i + 1 , . . . , 2 i +1  . By Hoeffd ing’ s in equality , on an e ven t J ∗ n ( i ) of proba bility P ( J ∗ n ( i )) ≥ 1 − δ / (16 i 2 ) , this is at most 24 ˜ δ f γ r ln(4 i/δ ) 4 3 2 3 i +3 + P ˜ d f  S ˜ d f (B ( f , ˜ r i ))  ! . (97) Since i ≥ ˆ i 1 > log 2 ( n/ 4) and n ≥ ln(1 /δ ) , we ha ve r ln(4 i/δ ) 4 3 2 3 i +3 ≤ 2 − i r ln(4 log 2 ( n/ 4) /δ ) 128 n ≤ 2 − i r ln( n/δ ) 128 n ≤ 2 − i . Thus, (97) is at most 24 ˜ δ f γ  2 − i + P ˜ d f  S ˜ d f (B ( f , ˜ r i ))  . In either case ( ˜ d f = 1 or ˜ d f > 1 ), by definit ion of ˜ θ f  ε 1 κ  , on J n ( δ ) ∩ H ( i ) n ∩ H ( ii ) n ∩ J ∗ n ( i ) , ∀ m ∈  2 i + 1 , . . . , 2 i +1  we ha ve P  ˆ ∆ ( ˜ d f ) 4 m ( X m , W 2 , B ( f , ˜ r i )) ≥ 2 γ / 3    W 2  ≤ 24 ˜ δ f γ  2 − i + ˜ θ f  ε 1 κ  · max n ˜ r i , ε 1 κ o . (98) Furthermor e, the 1 [2 γ / 3 , ∞ )  ˆ ∆ ( ˜ d f ) 4 m ( X m , W 2 , B ( f , ˜ r i ))  indica tors are con ditiona lly indep endent gi ven W 2 , so t hat we may bo und P  | Q i +1 | > ˜ Q    W 2  via a Chern of f bound. T owar d this e nd, note that on J n ( δ ) ∩ H ( i ) n ∩ H ( ii ) n ∩ J ∗ n ( i ) , (98) implies E  |Q i +1 |   W 2  = 2 i +1 X m =2 i +1 P  ˆ ∆ ( ˜ d f ) 4 m ( X m , W 2 , B ( f , ˜ r i )) ≥ 2 γ / 3    W 2  ≤ 2 i · 24 ˜ δ f γ  2 − i + ˜ θ f  ε 1 κ  · max n ˜ r i , ε 1 κ o ≤ 24 ˜ δ f γ  1 + ˜ θ f  ε 1 κ  · max n 2 i ˜ r i , 2 ˜ i ε 1 κ o . (99) 110 A C T I V I Z E D L E A R N I N G Note that 2 i ˜ r i = µc 1 κ ( di + ln(1 /δ )) 1 2 κ − 1 · 2 i ( 1 − 1 2 κ − 1 ) ≤ µc 1 κ  d ˜ i + ln(1 /δ )  1 2 κ − 1 · 2 ˜ i ( 1 − 1 2 κ − 1 ) ≤ µc 1 κ  8 d log 2 2 dc εδ  1 2 κ − 1 · 2 ˜ i ( 1 − 1 2 κ − 1 ) . Then since 2 − ˜ i 1 2 κ − 1 ≤  ε c  1 κ ·  8 d log 2 2 dc εδ  − 1 2 κ − 1 , we ha ve that the rightmost expressio n in (99 ) is at most 24 γ ˜ δ f  1 + ˜ θ f  ε 1 κ  · µ · 2 ˜ i ε 1 κ  ≤ 24 γ ˜ δ f  1 + ˜ θ f  ε 1 κ  · 2 µc 2 ·  8 d log 2 2 dc εδ  · ε 2 κ − 2  ≤ ˜ Q / 2 . Therefore , a Chern of f bound implies that on J n ( δ ) ∩ H ( i ) n ∩ H ( ii ) n ∩ J ∗ n ( i ) , we hav e P  |Q i +1 | > ˜ Q    W 2  ≤ exp n − ˜ Q / 6 o ≤ exp  − 8 log 2  2 dc εδ  ≤ exp  − log 2  48 log 2 (2 dc/εδ ) δ  ≤ δ/ (8 ˜ i ) . Combined with the la w of total pro bability an d a union bound over i va lues, this implies there exi sts an e vent J ∗ n ( ε, δ ) ⊆ J n ( δ ) ∩ H ( i ) n ∩ H ( ii ) n with P  J n ( δ ) ∩ H ( i ) n ∩ H ( ii ) n \ J ∗ n ( ε, δ )  ≤ P ˜ i i = ˇ i  δ / ( 16 i 2 ) + δ / ( 8 ˜ i )  ≤ δ/ 4 , on which eve ry i ∈ n max n ˇ i, ˆ i ˜ d f − 1 o , . . . , m in n ˆ i ˜ d f , ˜ i oo has |Q i +1 | ≤ ˜ Q . W e hav e chosen c ∗ 1 and c ∗ 2 lar ge enough that 2 ˜ i +1 < ˜ d f · 2 n and 2 ˇ i < 2 − ˜ d f − 2 n . In particul ar , this means that on J ∗ n ( ε, δ ) , 2 ˇ i + min  ˜ i, ˆ i ˜ d f  X i =max  ˇ i, ˆ i ˜ d f − 1  |Q i +1 | < 2 − ˜ d f − 2 n + ˜ i ˜ Q . Furthermor e, sin ce ˜ i ≤ 3 log 2 4 dc εδ , we ha ve ˜ i ˜ Q ≤ 2 13 µc 2 d γ ˜ δ f ˜ θ f  ε 1 κ  · ε 2 κ − 2 · log 2 2 4 dc εδ ≤ 2 13 µc 2 d log 2 2 (4 dc ) γ ˜ δ f ˜ θ f  ε 1 κ  · ε 2 κ − 2 · log 2 2 1 εδ ≤ 2 − ˜ d f − 2 n. Combining the abo ve, we hav e tha t (96) is satisfied on J ∗ n ( ε, δ ) , so tha t ˆ i ˜ d f > ˜ i . C o mbined w it h Lemma 59, this implies that on J ∗ n ( ε, δ ) , ˆ V ˆ i ˜ d f ⊆ ˆ V ˜ i ⊆ C c  d ˜ i + ln(1 /δ ) 2 ˜ i  κ 2 κ − 1 ! , 111 H A N N E K E and by definition of ˜ i we ha ve c  d ˜ i + ln(1 /δ ) 2 ˜ i  κ 2 κ − 1 ≤ c  8 d log 2 2 dc εδ  κ 2 κ − 1 · 2 − ˜ i κ 2 κ − 1 ≤ c  8 d log 2 2 dc εδ  κ 2 κ − 1 · ( ε /c ) ·  8 d log 2 2 dc εδ  − κ 2 κ − 1 = ε, so that ˆ V ˆ i ˜ d f ⊆ C ( ε ) . Finally , to pro ve the state d bound on P ( J ∗ n ( ε, δ )) , we ha ve 1 − P ( J ∗ n ( ε, δ )) ≤ (1 − P ( J n ( δ )) ) +  1 − P  H ( i ) n  + P  H ( i ) n \ H ( ii ) n  + P  J n ( δ ) ∩ H ( i ) n ∩ H ( ii ) n \ J ∗ n ( ε, δ )  ≤ 3 δ/ 4 + c ( i ) · exp n − n 3 ˜ δ f / 8 o + c ( ii ) · exp n − n ˜ δ 1 / 3 f / 120 o ≤ δ . Finally , we are read y for the proof of Lemm a 26. Pro of [Lemma 26] First, not e that because we break ties in the argmax of Step 7 in fa vor of a ˆ y v alue with V i k +1 [( X m , ˆ y )] 6 = ∅ , if V i k +1 6 = ∅ before Step 8, the n this remain s true after Step 8. Fur - thermore , the ˆ U i k +1 estimato r is nonnega tiv e, and thus the upda te in Step 10 ne ve r remo ves from V i k +1 the minimiz er of er ˆ L i k +1 ( h ) among h ∈ V i k +1 . Therefore, by inductio n we hav e V i k 6 = ∅ at all times in Algorithm 5. In particular , ˆ V ˆ i d +1 +1 6 = ∅ so tha t the return classifier ˆ h exists . Also, by Lemma 60, for n as in Lemma 60, on J ∗ n ( ε, δ ) , running Algorithm 5 with label bud get n and confiden ce pa rameter δ results in ˆ V ˆ i ˜ d f ⊆ C ( ε ) . Combining these two facts implies that for suc h a v alue of n , on J ∗ n ( ε, δ ) , ˆ h ∈ ˆ V ˆ i d +1 +1 ⊆ ˆ V ˆ i ˜ d f ⊆ C ( ε ) , so that er  ˆ h  ≤ ν + ε . E.3 T he Miss pecified Model Case Here we present a proof of Theorem 28, includin g a specificat ion of the method A ′ a from the theorem statemen t. Pro of [Theorem 28] Cons ider a weakly univ ersally consisten t pa ssi ve learni ng algorithm A u (De- vro ye, G y ¨ orfi, and Lugosi, 1996). Such a method must exist in our set ting; for ins tance, H o ef fd- ing’ s inequality and a union bound imply tha t it suf fices to ta ke A u ( L ) = argmin 1 ± B i er L ( 1 ± B i ) + q ln(4 i 2 |L| ) 2 |L| , where { B 1 , B 2 , . . . } is a countab le alg ebra that generates F X . Then A u achie ves a label comple xity Λ u such th at for any distrib ution P X Y on X × {− 1 , +1 } , ∀ ε ∈ (0 , 1) , Λ u ( ε + ν ∗ ( P X Y ) , P X Y ) < ∞ . In particu lar , if ν ∗ ( P X Y ) < ν ( C ; P X Y ) , then Λ u (( ν ∗ ( P X Y ) + ν ( C ; P X Y )) / 2 , P X Y ) < ∞ . Fix an y n ∈ N , and descr ibe the ex ecuti on of A ′ a ( n ) as follo ws. In a preprocess ing step, withhold the first m un = n − ⌊ n/ 2 ⌋ − ⌊ n / 3 ⌋ ≥ n/ 6 examples { X 1 , . . . , X m un } and req uest their labels { Y 1 , . . . , Y m un } . Run A a ( ⌊ n/ 2 ⌋ ) on the remainder of the seq uence { X m un +1 , X m un +2 , . . . } 112 A C T I V I Z E D L E A R N I N G (i.e., shift an y index reference s in the algorithm by m un ), and let h a denote the classifier it return s. Also reque st the labe ls Y m un +1 , . . . Y m un + ⌊ n/ 3 ⌋ , and let h u = A u  ( X m un +1 , Y m un +1 ) , . . . , ( X m un + ⌊ n/ 3 ⌋ , Y m un + ⌊ n/ 3 ⌋ )  . If er m un ( h a ) − er m un ( h u ) > n − 1 / 3 , return ˆ h = h u ; o therwise, return ˆ h = h a . This m e thod achie ves the stated result, for the follo wing reasons. First, let u s ex amine the final s tep of t his algor ithm. By Hoef fding’ s inequality , with pro babilit y at least 1 − 2 · exp  − n 1 / 3 / 12  , | (er m un ( h a ) − er m un ( h u )) − (er( h a ) − er( h u )) | ≤ n − 1 / 3 . When this is the case, a triang le ineq uality implies er( ˆ h ) ≤ m in { er( h a ) , er( h u ) + 2 n − 1 / 3 } . If P X Y satisfies the benign noise case, then for any n ≥ 2Λ a ( ε/ 2 + ν ( C ; P X Y ) , P X Y ) , we hav e E [er( h a )] ≤ ν ( C ; P X Y )+ ε/ 2 , so E [er( ˆ h )] ≤ ν ( C ; P X Y )+ ε/ 2+2 · exp {− n 1 / 3 / 12 } , which is at most ν ( C ; P X Y ) + ε if n ≥ 12 3 ln 3 (4 /ε ) . So in this case, we can take λ ( ε ) =  12 3 ln 3 (4 /ε )  . On the o ther hand, if P X Y is not in the benign noise c ase (i.e., the misspecified model case), then for any n ≥ 3Λ u (( ν ∗ ( P X Y ) + ν ( C ; P X Y )) / 2 , P X Y ) , E [er( h u )] ≤ ( ν ∗ ( P X Y ) + ν ( C ; P X Y )) / 2 , so that E [er( ˆ h )] ≤ E [e r( h u )] + 2 n − 1 / 3 + 2 · exp {− n 1 / 3 / 12 } ≤ ( ν ∗ ( P X Y ) + ν ( C ; P X Y )) / 2 + 2 n − 1 / 3 + 2 · exp {− n 1 / 3 / 12 } . Again, this is at most ν ( C ; P X Y ) + ε if n ≥ max  12 3 ln 3 2 ε , 64( ν ( C ; P X Y ) − ν ∗ ( P X Y )) − 3  . So in this case, we can tak e λ ( ε ) =  max  12 3 ln 3 2 ε , 3 Λ u  ν ∗ ( P X Y ) + ν ( C ; P X Y ) 2 , P X Y  , 64 ( ν ( C ; P X Y ) − ν ∗ ( P X Y )) 3  . In either case, we ha ve λ ( ε ) ∈ P olylog (1 /ε ) . Acknowledgmen ts I a m gra teful to Nina Balcan, Rui C a stro, Sanjo y Dasgu pta, Carlos Guest rin, Vladimir K oltchinsk ii, John Langfor d, Rob No wak, Larry W asserman , and E ri c Xing for insightfu l discu ssions . Refer ences N. Abe and H. Mamitsuka. Query learning strate gies using boo sting and bagging. In Pr oceedin gs of the 15 th Intern ational C o nfer ence on Mac hine Learning , 1998. 7 K. Alexa nder . Probab ility inequaliti es for empirica l processes and a law o f the iterate d log arithm. Annals of Pr obability , 4:1041–1 067, 1984 . 6.4 113 H A N N E K E M. Anthon y and P . L. Bartlet t. Neur al Network Learning: Theor etical F oundati ons . Cambridge Uni vers ity P r ess, 1999. A A. Antos a nd G. Lugo si. S tr ong minimax lo wer bounds for lear ning. M a chin e Learning , 30:31–5 6, 1998. E.1 R. B. Ash and C. A. Dol ´ eans-Dade. Pr obabi lity & Mea sur e Theory . Acade mic Press, 2000. B P . Auer and R. Ortne r . A ne w P A C bound for inters ection-c losed concept classes. In Pr oceed ings of the 17 th Confer ence on L ea rning Theory , 2004. 2 M.-F . Balcan, A. Beygelzi mer , and J. Langford . Agnostic acti ve learn ing. In Pr oceeding s of the 23 rd Intern ational Confer ence on Machine Learnin g , 20 06a. 1.1.2, 6.5 M.-F . Balc an, A. Blum, and S. V empala. Kerne ls as features : On kern els, mar gins, and low- dimensio nal mapp ings. Machin e L ea rning Jo urnal , 65(1):79 –94, 2006 b. 7 M.-F . B a lcan, A. Broder , and T . Zhang. Margin base d acti ve learnin g. In Pr oceeding s of the 20 th Confer ence on L ea rning Theory , 2007. 1.1.2, 6.4, 7 M.-F . Balcan, A. Be ygelzi mer , an d J. Lan gford. Agnosti c acti ve le arning. J ournal o f Computer a nd System Scien ces , 75( 1):78–8 9, 2009 . 1.1.2, 6.5 M.-F . B a lcan, S. Hannek e, and J. W ortman V aughan . T h e true sa mple compl exity of acti ve learnin g. Machi ne Learning , 80(2 –3):111 –139, 2010. 1.1.1, 1.1.2, 2, 2, 2, 2.1, 3.2, 3.3, 4.3, 5, 5.1.3 , 5.3, 5.4, 5.5, 7, A, B.2 J. Baldrid ge and A. Palmer . Ho w well does acti ve learning actuall y work? Time-based e v aluat ion of cost-re duction strat egie s for languag e docu mentation . In Pr oceedings of the Conf er ence on Empirical Method s in Natur al Langua ge Pr oce ssing , 2009. 1.1 Z. B ar -Yossef. Sampling lo wer boun ds via information theory . In Pr ocee dings of the 35 th Annual A CM Symposiu m on th e Theory of Computing , 2003. E.1 P . L. Bar tlett, M. I. Jordan, and J. D. McAuliffe. Con ve xity , clas sification , and risk bounds . J ourn al of the American Stati stical Assoc iation , 101(473):1 38–156 , 2006 . 6.4 A. Beygel zimer , S. Dasgupta , and J. Langford. Impo rtance weighted activ e learning . In Pr oceedings of the Intern ationa l Confer ence on Machine Learn ing , 2009. 1.1.1, 1.1.2, 5.1.3, 6.5, 7 A. Beyge lzimer , D. Hsu, J. Langfo rd, and T . Zhang. Agnostic acti ve le arning without constr aints. In Advanc es in Neur al Informat ion Pr ocessin g Sys tems 23 , 2010. 1.1.1, 1.1.2, 5.1.3 A. Blumer , A. Ehrenfeucht , D. Haussler , and M. W armuth. Learna bility and the Vapnik- Cherv onen kis dimension . J ourna l of the Associ ation for Computing Machiner y , 36(4):9 29–965 , 1989. 3.3, A F . Bunea, A. B. Tsyba ko v , and M. W eg kamp. Sparsity oracle ineq ualities for the lasso. Electr onic J ournal of Statistic s , 1: 169–19 4, 2009. 7 114 A C T I V I Z E D L E A R N I N G C. Campbell, N. Cristianin i, and A. Smola. Query learning with lar ge mar gin classifiers . In Pr o- ceedin gs of th e 17 th Intern ational Confer ence on Machine Learning , 2000 . 7 R. Castro and R. D. Nowa k. Minimax bounds for acti ve learni ng. IEEE T ra nsactio ns on Information Theory , 54(5): 2339–2 353, 2008. 1.1.2, 6.4, 6.4, 6.4 D. C o hn, L. Atlas, and R. Ladner . Improv ing generalizati on with activ e learning. Machine Learning , 15(2): 201–22 1, 1994. 1.1.1, 3, 5.1, 5.1.1, 5.1.4, 5.4 S. Dasgupta . Coarse sample complexity bound s for acti ve learning. In Advanc es in Neur al Infor - mation Pr ocessing Systems 18 , 2005. 1.1.1, 2, 2.1, 5.5, 7 S. Dasgup ta, A. T . Kalai, and C. Monteleon i. Analysis of perce ptron-b ased acti ve learning. In Pr oceedings of the 18 th Confer ence on L ea rning Theory , 2005. 1.1.1, 7 S. Das gupta, D. Hsu , an d C. Monteleon i. A g eneral ag nostic a cti ve learnin g algori thm. In Advances in Neur al Information Pr ocess ing Sy stems 20 , 2007. 1.1.1, 1.1.2, 5.1.3, 6.4, 6.5 S. Das gupta, A. T . K al ai, and C. Montele oni. Analysis of percep tron-ba sed acti v e learnin g. Jou rnal of Mach ine Learning Resear ch , 10:281–299 , 2009. 1.1.1, 7 O. D e kel, C. Gentile, and K. Sridharan . Robu st selecti ve sampling from single and multiple teacher s. In Pr oceedings of the 23 rd Confer ence on Lea rning Theory , 2010. 1.1, 6.4 L. De vroy e, L . Gy ¨ orfi, and G. Lugosi. A P r obab ilistic Theory of P attern Reco gnition . Spri nger - V erlag New Y ork, Inc., 1996. 6.8, B.2, E.3 R. M. Dudle y . Real Analysis and Pr obability . Cambridge Univ ersity Press, 2002. 6.1 Y . Freund , H. S. Seung, E. Shamir , and N. T ishby . Selecti ve sampling using th e query by co mmittee algori thm. M a chin e Learning , 28:133–16 8, 1997. 1.1.1 E. Friedman . Acti ve learning for smooth prob lems. In Pr oceedings of the 22 nd Confer ence on Learning Theory , 2009. 1.1.1, 3.2, 3.3, 5.1.3 R. Gang adhara iah, R. D . Bro wn, and J. Carbon ell. A c ti ve learning in example-ba sed m a chine transla tion. In Pr oceeding s of th e 17 th Nor dic Con fer ence on Computat ional L i nguisti cs , 200 9. 1.1 E. Gin ´ e and V . K oltchinski i. C o ncentrat ion in equaliti es and a symptotic results for rati o type empir - ical proces ses. T he Annals of Pr oba bility , 34(3):114 3–121 6, 200 6. 6.4, E.2 S. A. Goldman and M. J. Kearns. On the comple xity of teaching. Jour nal of Computer and System Scienc es , 50:20–31, 1995. 1.1.1 S. Hannek e. T eaching dimension and the complexity of acti ve learnin g. In Pr oceedin gs of the 20 th Confer ence on L ea rning Theory , 2007a. 1.1.1, 1.1.2 S. Hanneke. A bound on the la bel complexity of a gnostic activ e learning. In Pr ocee dings of the 2 4 th Intern ational Confer ence on Mac hine Learnin g , 2007b. 1.1.1, 1.1.2, 3.2, 3.3, 5.1, 5.1.3, 5.1.3, 5.1.4, 5.1.4, 5.1.4 115 H A N N E K E S. H a nnek e. Adapti ve rates of con ver gence in activ e learni ng. In Pr oceedi ngs of the 22 nd Confer ence on Learning Theory , 2009a. 1.1.2 S. Hanneke. T h eor etical F ounda tions of Active Learning . PhD thesis, Machine Learn ing Depart- ment, School of Computer Science , C a rnegi e M e llon Uni versi ty , 2009b. ∗ , 3.3, 6.6, 6.8 S. Hanneke. Rates of con ver gence in activ e learni ng. The Annals of Statist ics , 39(1): 333–36 1, 20 11. 1.1.1, 1.1.2, 3.2, 3.3, 5.1, 5.1.1, 5.1.3, 5.1.4, 5.1.4, 5.1.4, 5.4, 6, 6.4, 6.4, 6.5, 6.5, 6.5, 7, E.2 S. Har-Peled , D. Roth, and D. Zimak. Maximum mar gin coresets for acti ve and noise tolerant learnin g. In Pr oceed ings of the 20 th Intern ational Join t Confer ence on A r tificial Intellige nce , 2007. 7 D. Haussler . Decision theoretic gen eralizat ions of the P A C mode l fo r n eural net and other learning applic ations. Information and C o mputation , 100 :78–15 0, 1992. 6.1 D. Haussle r , N. Littleston e, and M. W armuth. Predicting { 0 , 1 } -funct ions on randomly drawn points . In formation and Computation , 115:248 –292, 199 4. 4.3, 5.1.4, 5.4 T . He ged ¨ us. Generali zed tea ching dimension and the q uery comple xity of learning . In Pr ocee dings of the 8 th Confer ence on Compu tationa l Learning Theory , 1995. 1.1, 1.1.1 L. Heller stein, K. Pillaip akkamnatt , V . Ragha v an, and D. Wi lkins. H o w many queries are ne eded to learn? Jo urnal of the Association for Computing Machine ry , 43(5):840–8 62, 1996. 1.1.1 S. C. H . Hoi, R. Jin, J. Zhu, and M. R. L yu. Batch mode acti ve lear ning and its app lication to medical image clas sification. In Pr ocee dings of the 23 rd Intern ational Confer ence on Machine Learning , 200 6. 1.1 M. K ¨ a ¨ ari ¨ ainen. Acti v e learnin g in the non -realizab le case. In Pr oceed ings of the 17 th Intern ational Confer ence on A lg orithmic Learning Theory , 2006. 1.1.2, 6.4 N. K a rmarkar . A new polynomial-ti me algorithm for linear programming. Combinator ica , 4:373 – 395, 1984. 4.4 M. J. Ke arns and U. V azirani. An In tr oduction to Comp utationa l Learni ng T h eory . T h e MIT Press, 1994. 4.4 M. J. Ke arns, R. E. Schapir e, an d L. M. Sellie. T oward e ffici ent agnostic learning. Mach ine L e arn- ing , 17:115 –141, 1994. 1.1.2 L. G . Khachiyan. A polynomial algori thm in linear pro gramming. So viet Mathemati cs Doklady , 20:19 1–194, 197 9. 4.4 V . K oltchinskii. Local rademacher complexit ies and oracle inequal ities in risk minimization . T h e Annals of Statist ics , 34(6):2593 –2656, 2006. 6.4, 6.4, 6.4, 6.4, 6.5, 7, E.2, E.2 V . K oltchinskii. O r acle inequalitie s in empirical risk minimizati on and spars e reco ver y pro blems: Lecture notes . T echnical report , ´ Ecole d’ ´ et ´ e de P ro babilit ´ es de Saint-Flour , 2008 . 6.4 116 A C T I V I Z E D L E A R N I N G V . Kolt chinski i. Rademac her complexiti es and bou nding th e excess risk in acti v e learning. J ournal of Mach ine Learning Resear ch , 11:2457–24 85, 2010. 1.1.1, 1.1.2, 5.1.3, 6.4, 6.5, 6.5, E.2 S. L i. C o ncise formulas for the ar ea and vol ume of a hyperspher ical cap. Asian Journ al of Mathe- matics and Statis tics , 4(1 ):66–7 0, 2011 . 5.3 M. L i ndenbau m, S. Marko vitch , and D. R u sako v . Selec ti ve sampling for nearest neighbor classi- fiers. Mac hine Learning , 54: 125–15 2, 2004. 7 T . Luo, K. Kramer , D. B. G o ldgof, L. O. Hall, S. Samson , A. Remsen, and T . Hopkins. Acti ve learnin g to recogn ize multip le typ es of pla nkton. J ournal of M a chin e Learn ing R e sear ch , 6: 589–6 13, 200 5. 1.1 S. Mahalan abis. A note on activ e learnin g for smooth probl ems. arXiv :1103.30 95 , 2011. 1.1.1, 5.1.3 E. Mammen and A. B. Tsyb ako v . Smooth discrimin ation analysis. Annals of Statist ics , 27:1808 – 1829, 1999 . 1.1.2, 6.4, 6.4, 6.4, 7 P . Massart and ´ E. N ´ ed ´ elec. Risk bounds for stat istical learni ng. The Annals of Sta tistics , 34(5): 2326– 2366, 2006. 6.4, 6.4, 6.4, 6.7, E. 2 A. McCallum and K. Nigam. Emplo ying EM in poo l-based act iv e learning for text classification. In Pr oceedings of the 15 th Intern ational Confer ence on Machine Learnin g , 19 98. 1.1, 7 P . Mitra, C. A . Murthy , and S. K. Pal. A probabilist ic acti ve suppor t ve ctor learning a lgorith m. IEEE T ransac tions on P attern Analysis and Machin e Intellig ence , 26(3 ):413– 418, 2004. 7 J. R. Munkres . T opolo gy . Prentic e Hall , Inc., 2 nd editio n, 20 00. 6.1 I. Muslea, S. Minton, and C. A. Knoblock. Activ e + semi-sup ervised learnin g = robus t multi-vie w learnin g. In P r oceed ings of the 19 th Intern ational C o nfer ence on Mac hine Learning , 2002. 7 R. D. No wak. Generalized binary search. In Pr oceedings of the 46 th Annual Allerton Confe r ence on Communication , Contr ol, and Computing , 200 8. 1.1.1 J. Poland and M. Hutter . MD L c on ver gence speed for B er noulli sequences . Stati stics and Comput- ing , 16:161 –175, 2006. E .1 G. V . Rocha, X . W ang, and B . Y u. Asymptotic distrib ution and spars istenc y for l1-p enalized para- metric M-estimators with applicatio ns to linear SV M and logist ic r egres sion. arXiv:090 8.1940v1 , 2009. 7 D. Roth and K . Small. M a rg in-base d act iv e learning for stru ctured outp ut spaces. In Eur opean Confer ence on M a chin e Learning , 2006. 7 N. Roy and A. McCallum. T owa rd optima l act iv e learning through sampling estimation of error reduct ion. In P r oceed ings of the 18 th Intern ational Confer ence on Mac hine Lea rning , 200 1. 1.1, 7 117 H A N N E K E A. I. S ch ein and L. H. Ungar . Activ e lea rning for logis tic regressio n: An e v alua tion. M a chine Learning , 68( 3):235– 265, 2007. 7 G. Schohn an d D . Cohn. Less is m o re: Acti ve learning with suppo rt v ector m a chines. In P r oceed - ings of the 17 th Intern ational C o nfer ence on Mach ine Learning , 2000. 7 B. Settles. Acti v e learning literature surve y . http ://activ e-learning .net , 2010. 1.1 S. M. Sri v asta v a. A Course on Bor el Sets . S p ringer -V erlag, 1998 . 2 S. T ong and D . K oller . S up port v ector mach ine acti ve lea rning with applications to te xt clas sifica- tion. J ourn al of Mac hine Learnin g R es ear ch , 2, 2001. 1.1, 7 A. B . Tsybak ov . O p timal agg rega tion of cla ssifiers in statistica l learning. The Anna ls of Statistics , 32(1): 135–16 6, 2004. 6.4, 6.4, 6.4, 7 L. G. V aliant. A theory of the learnabl e. Communicatio ns of the Assoc iation for Computing Ma- chi nery , 27(11):113 4–1142 , 1984 . 1.1.1, 4.4, 5.4 A. W . v an der V aart and J . A. W ellner . W eak Con ver gence a nd E mpir ical Pr ocesse s . S p ringer , 1996 . 7 V . V apnik. Estimation of Dependenc ies B a sed on Empirical Data . Springer -V erlag, New Y ork, 1982. 3.3, A V . V apnik. Stat istical Lear ning Theory . John W iley & Sons, Inc., 1998. 2 A. W ald. Sequential tests of sta tistical hy pothese s. The Annals of Math ematical Statistics , 16(2): 117–1 86, 194 5. E.1 L. W ang. Suf ficient conditions for agnosti c activ e learnab le. In Advance s in Neural Infor mation Pr ocessing Systems 22 , 2009. 1.1.1, 1.1.2, 3.3, 5.1.3 L. W ang. Smoothness, dis agreement coe fficie nt, and the labe l comple xity of agnostic activ e lea rn- ing. J ourna l o f Machin e Lear ning Resear c h , 12:2269–2 292, 2011. 1.1.1, 3.3, 5.1.3 L. W ang and X. Shen. On L1-norm multiclass suppor t v ector machine s. J ourna l of the America n Statis tical A ss ociatio n , 102( 478):58 3–594, 2007. 7 L. Y ang, S. Hannek e, and J. Carbonell. The sample comple xity of self-ve rifying Bayesia n ac- ti ve learning . In P r oceed ings of the 14 th Intern ational Confer ence on Artifici al Intelli gence and Statis tics , 201 1. 1.1.1 118

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment