The Power of Localization for Efficiently Learning Linear Separators with Noise

We introduce a new approach for designing computationally efficient learning algorithms that are tolerant to noise, and demonstrate its effectiveness by designing algorithms with improved noise tolerance guarantees for learning linear separators. W…

Authors: Pranjal Awasthi, Maria Florina Balcan, Philip M. Long

A The P ower of Localization f or Efficiently Learning Linear Separators with Noi se Pranjal A wasthi , Rutgers Uni versity Maria Florina Balcan , Carnegi e Mellon Univ ersity Philip M. Long , Sentient T echnologi es W e introduce a ne w approac h for designing computation ally effic ient learning algorithms that are tolerant to noise, and demonstrat e its ef fecti vene s s by designing algorithms with improve d noise tolera nce guarantees for learning linear separators. W e consider both the malicio us noise model of V alian t [V aliant 1985; Kear ns and Li 1988] and the adversa rial label noise model of Kea rns, Scha pire, and Se llie [199 4]. For malicious noise , where t he adv ersary can corru pt both the label a nd the feature s , w e p rovid e a polynomial-t ime algorithm fo r learning linear separators i n ℜ d under isotropic log-conca ve distr ibu- tions that can tol erate a ne arly i nformatio n-theore tically optimal noise rate of η = Ω( ǫ ) , improv ing on the Ω  ǫ 3 log 2 ( d/ǫ )  noise-tol erance of [Kl i vans et al. 2009a]. In th e c ase that the d istrib ution is u niform over the unit ball , this impro ves on the Ω  ǫ d 1 / 4  noise-tol erance of [Ka lai et al. 2005] and the Ω  ǫ 2 log( d/ǫ )  of [Kli van s et al. 2009a ]. F or the adv ersarial label noise model, where the distrib ution ov er the feature v ectors is unchange d, and the overa ll probabili ty of a noisy label is constrai ned to be at m ost η , we also gi ve a polynomia l-time algorithm for learnin g linear sepa rators in ℜ d under i sotropic log-conc ave distrib utions that c an handle a noise rat e of η = Ω ( ǫ ) . In t he case of the uniform distribut ion, this improv es ov er the re sults of [Kalai et al. 2 005] which either required runtime super-expo nentia l in 1 /ǫ (o urs is polynomial in 1 /ǫ ) o r tolera ted less noise. 1 Our algorith ms are also ef ficient in the activ e learnin g set ting, where learni ng alg orithms only re ceiv e the c lassifica tions of exampl es whe n t hey ask for them. W e sho w that , in thi s model, our al gorithms achie ve a labe l c om ple xity who se depen- dence on the e rror paramete r ǫ is polylogari thmic (and thus expone ntially better than that of an y passi ve algorithm). This provi des the first polynomial-ti me a cti ve learning algorithm for lear ning linear separat ors in the presence of malicious noise or adversari al labe l noise. Our algorit hms and anal ysis combine seve ral ingredients including aggre ssi ve localiz ation, minimizatio n of a progres- si vel y re s caled hi nge loss, and a novel localized and soft out lier remov al proce dure. W e use localizati on technique s (pre vi- ously used for obtaining better sample complexi ty results) in order to obtain better noise-tolerant polynomia l-time algori thms. Cate gories and Subject Descriptors: F .2 [ Theory of Computation ]: Analysis of Algorithms and Problem Complexity General T erms: Algorith ms,Theory A CM Refer ence Format: Pranjal A wasthi, Maria Florina Balcan and Philip M. Long, 2016. The Power of L ocaliz ation for Efficient ly Learning Linear Separat ors with Noise. J. ACM V , N, Article A (January YYYY), 26 pages. DOI: http:// dx.doi.org/ 10.1145/0000000.0000000 1. INTRODUCTION Overview . Dealing with n oisy data is one of th e main ch a llenges in machine learn ing and is an activ e area of resear ch. In this w o rk we study the n oise-tolerant learning of linear separ ators, ar- guably the most po pular class of fu nctions used in prac tice [Cristianini and Shawe-T aylor 2000]. Learning line ar separato rs from correctly lab eled (no n-noisy ) examples is a very well u nderstoo d problem with simple ef ficien t algorithms that are effective both in the classical passi ve learning Author’ s email addresses: P . A wasthi , pranjal.a wasthi@rutg ers.edu; M. F . Balcan, ninamf@cs.cmu.edu; P . M. Long, phil.long@ sentien t.ai. Permission to make digi tal or hard copies of part or all of this work for personal or classroom use is granted without fee provi ded that co pies are not made or distributed for profit or commerc ial a dv antage and that copies sho w t his notice on the first page o r init ial sc reen of a display along with the fu ll cita tion. Cop yrights for components of this work own ed by others than AC M must be honored. Abstracting with credit is permitted. T o cop y otherwise, to republish, to post on serve rs , to redistri bute to lists, or to use any component of this work in othe r works requir es prior specific permission and /or a fee. Permissions may be request ed from Publicatio ns Dept., A CM, Inc., 2 Penn Plaz a, Suite 701, New Y ork, NY 10121-0701 USA, fax + 1 (212) 869-0481, or permissions@acm.org. c  YYYY A CM 0004-5411/YYYY/01-AR T A $15.00 DOI: http:// dx.doi.org/ 10.1145/0000000.0000000 Journal of the ACM, V ol. V , No. N, Article A, Publication date: January YYYY . A:2 A wasthi, Balcan and Long setting [Kearns an d V azirani 1994; V ap nik 1998] an d in the more mod ern acti ve learning frame- work [Dasgupta 2011]. Howev e r, for noisy settings, except for the sp ecial case of unifo rm random noise, very fe w positi ve algo rithmic re sults exist even fo r passiv e learn ing. I n the context of theo- retical com p uter science m ore br oadly , problems of noisy learning ar e r e la te d to seminal results in approx imation-h ardness [ A r ora et al. 19 9 3; Gur uswami and Raghavendra 20 06], cryptogra p hic as- sumptions [Blum et al. 1994; Rege v 2005], and are connected to other classical questions in learn- ing theory ( e.g., learning DNF f o rmulas [Kearns et al. 1994]), and app ear as barrier s in differential priv acy [Gup ta et al. 2011]. In this paper we present new tech niques for designing ef ficient algorithms for learning linear separators in the presence of malicious noise and adv e rsarial label noise . These mod e ls were orig- inally pr oposed fo r a setting in which the algorithm mu st work for an arbitra ry , unknown distribu- tion. As we will see, bou n ds on the amount of noise tolerated for this distribution-fr e e setting were weak, an d no significant pro gress was mad e for many years. This motivated research in vestigating the role o f the distribution genera tin g the d ata on the tolerable level of no ise: a breakthrou gh re- sult of [ Kalai et al. 2005] and subsequent work of [Kliv ans et al. 2009a] showed that ind eed better bound s can be obtained f o r the unifo rm and iso tropic log-c o ncave distributions. In th is paper, we continue this lin e of r esearch. For the m alicious no ise case, where the adversary can cor r upt both the label and the features of the ob servation ( and it has unboun d ed computation al power and access to the entire histor y of the learning algor ithm’ s computatio n), we design an efficient algorithm that can learn with accuracy 1 − ǫ while tolerating an Ω( ǫ ) noise rate. This is within a constant f ac to r of the s tatistical limit even in th e case of the uniform distribution. I n particular , un like previous works, our noise tolerance limit has n o depend ence o n the dimension d of the space. W e also sho w similar improvements fo r adversarial label no ise, and furth ermore show tha t our algorithms can natu rally exploit th e po wer of active learning. Acti ve learn in g is a widely studied modern learn ing p aradigm, where th e learning a lgorithm on ly receives the class labels of examples wh en it asks for them. W e show that in this model, o u r algorithms a chieve a label comp lexity whose depen d ence o n the er - ror par ameter ǫ is exponen tially better th an th at of any passive alg orithm. This provides the first polyno mial-time acti ve learning algo rithm for learning linear separ ators in the p resence of adver- sarial lab el noise, solving an open problem posed in [Balcan et al. 2006; Monteleo ni 20 06]. It also provides the first analysis showing the b enefits o f active learning over p assi ve lea r ning u nder the challengin g malicious noise m odel. Our work brin gs a new set of algo rithmic an d analysis techn iques inclu ding lo calization (previ- ously used f or ob taining better sample co mplexity results) and soft outlier removal that we believe will ha ve oth e r applications in lea r ning theory and optim ization. Lo calization [ Bartlett et al. 20 0 5; Bouchero n et al. 2005; Zhang 2006; Balcan et al. 2007; Bshouty et al. 200 9; K oltchin sk ii 20 10; Hanneke 2011; Balcan and Long 2013] refers to the p ractice of prog ressi vely narrowing th e focu s of a lear ning algorithm to an increasingly restricted ran ge of possibilities (which ar e known to be safe given th e information u p to a cer tain po in t in time), thereby imp roving the stability of estimates of the quality of these possibilities based on random data . In the fo llowing we start by form ally defining the learn ing models we con sider . W e then p resent the most rele vant p rior w ork, and then our main results and techniques. Passiv e and Active Learning. No ise Mo dels. In this w o rk we con sider th e problem o f learn- ing linear sep arators in two learning parad igms: the classical passi ve learn in g setting and the mor e modern active learning scen a r io. As is typical [Kearns and V azirani 1994; V apnik 1998], we as- sume that there exists a distrib u tion D over ℜ d and a fixed unknown target f unction whose pa- rameter vector is w ∗ . In the noise-free case, in the passive s upervised learning model the algo- rithm is giv en access to a distribution or a c le E X ( D , w ∗ ) fro m which it can g et training samples ( x, sig n( w ∗ · x )) wh ere x ∼ D . T h e goal of th e algo rithm is to ou tput a hypothesis w such th at err D ( w ) = Pr x ∼ D [sign( w ∗ · x ) 6 = sign( w · x )] ≤ ǫ . In the acti ve learning m odel [Cohn et al. 1994; Dasgupta 2011] the learnin g algorithm is g i ven as in put a po ol of unlabeled exam p les drawn fr om the d istribution oracle. The algorithm can then quer y for the labels of exam p les of its ch oice from Journal of the ACM, V ol. V , No. N, Article A, Publication date: January YYYY . Lear ni ng Line ar Sep arators with Noise A:3 the poo l. The goal is to pro duce a hy pothesis of low erro r while a lso optimizin g for th e number of label q ueries (also known as label complexity ). The hop e is that in the active learnin g setting we can output a classifier of small error by using many fewer lab el requests than in the passi ve learn- ing setting by actively directing the q ueries to informative e xamples (while keeping the num b er of unlabeled e xamples polynomial). In this work we focus o n two noise models. The first one is the malicious noise model of [V aliant 1 985; Kearns and Li 1988] where samples are generated as follows: with probability (1 − η ) a random pair ( x, y ) is outpu t where x ∼ D and y = sign( w ∗ · x ) ; with pr obability η the adversary ca n o utput an arbitr a ry pair ( x, y ) ∈ ℜ d × {− 1 , 1 } . W e will call η th e noise rate. Each of the adversary’ s examples ca n depend on the state of the learning alg orithm and also th e previous draws of th e adversary . W e will den ote the malicious oracle as E X η ( D , w ∗ ) . The goal remains, howe ver, to output a hypothesis w such that P r x ∼ D [sign( w ∗ · x ) 6 = sig n( w · x )] ≤ ǫ . In this pap er , we con sid e r an extension o f th e ma licio us n oise m o del to th e the active learning model as follo ws. There are two oracles, an e x ample generation oracle and a label re vealing oracle. The example ge n eration orac le work s as usual in the malicio us no ise model: with probab ility (1 − η ) a random pair ( x, y ) is generated where x ∼ D and y = sign( w ∗ · x ) ; with proba bility η the adversary can outpu t an arb itrary pair ( x, y ) ∈ ℜ d ×{− 1 , 1 } . In the acti ve learning setting, unlike the standard malicious noise model, when an example ( x, y ) is gen erated, the algorithm only receiv es x , and must m ake a separate call to the label revealing oracle to get y . The g oal of the alg orithm is still to o utput a hypothesis w such that Pr x ∼ D [sign( w ∗ · x ) 6 = sig n( w · x )] ≤ ǫ . In the ad versarial label noise mo del, befor e any examp les a r e gener ated, th e ad versar y may choose a joint distribution P over ℜ d × {− 1 , 1 } whose marginal distribution over ℜ d is D an d su ch that Pr ( x,y ) ∼ P (sign( w ∗ · x ) 6 = y ) ≤ η . In th e active learning version o f this mo del, once again we will have two o r acles, an e x ample generation o racle and a label revealing or acle. W e note that the results from our theorems in this mod el translate imm ediately into similar guara n tees for the agnostic mo del of [K earn s et al. 1994] (used commo nly both in p assi ve and active learn ing (e. g., [Ka la i et al. 2005; Balcan et al. 20 06; Hanneke 20 07]) – see Appen dix C for details. W e will be in terested in algorith m s that run in time pol y ( d, 1 /ǫ ) and use pol y ( d , 1 /ǫ ) examples. In addition, for the a c ti ve learn ing scena rio we want our algorithm s to also optimize for the n umber of label req uests. In particular, we want the nu mber of labe led examples to d epend on ly poly loga- rithmically in 1 /ǫ . T h e goal then is to quan tify fo r a given value of ǫ , the tolerable noise rate η ( ǫ ) which w ould allow us to design an efficient (passi ve or acti ve) learning algorithm. Previous W ork. In the context of passi ve learning, Kearns and L i’ s an alysis [19 88] implies that halfspaces can be efficiently learned with respect to arbitrary distributions in polynomial time while tolerating a m alicious no ise rate of ˜ Ω  ǫ d  . Kearns and Li [1 988] also showed th at ma licious n oise at a rate gre a te r than ǫ 1+ ǫ cannot be tolerated (and a sligh t variant of the ir co nstruction shows that this remains true e ven when the distribution is unif orm over th e unit sphere). The ˜ Ω  ǫ d  bound for the distribution-free case w as not improved for many years. Kalai et al. [2005] showed that, 2 when the distribution is unifor m, the poly ( d, 1 /ǫ ) -time averaging algorithm toler ates malicio us noise a t a rate Ω( ǫ/ √ d ) . They also d escribed an im provement to ˜ Ω( ǫ/d 1 / 4 ) based o n the o b servation that unifor m examples will tend to be well-separated, so that pairs of examples that are too close to one another can be removed, and this limits an ad versar y’ s ability to coor dinate the effects of its noisy examples. [Kliv ans et al. 20 09a] an alyzed an other approa ch to limiting the coor d ination of the noisy examples: th ey pro posed an outlier r emoval proced ure that used PCA to find any direction u onto which projectin g the training data led to suspiciously high variance, and rem oving examp les with the most e x treme v alues after projecting onto any such u . Their algorithm toler a tes malicious n oise at a rate Ω( ǫ 2 / log( d/ǫ )) under the uniform d istribution. 2 These results from [Kalai et al. 2005] are most closely rel ated to our w ork. W e describe some of their other results, more prominentl y feat ured in their paper , la ter . Journal of the ACM, V ol. V , No. N, Article A, Publication da te: January YYYY . A:4 A wasthi, Balcan and Long Motiv ated by th e fact that many m o dern machin e learning applications have massive amounts of u nanno tated or unlabeled data, there has b een significant inter est in designin g active lear ning algorithm s that most efficiently utilize the av ailable da ta, wh ile minimizin g the need for human in- tervention. Over the past decade th ere has been substantial p rogress o n u nderstand ing th e un derlying statistical princ ip les of active lea rning, and several general character iz a tio ns have been developed for describing when active learnin g could ha ve an advantage over the classical passive super vised learning pa r adigm bo th in th e n oise free settings an d in the agnostic c ase [Fre und e t al. 1997; Dasgupta 2005; Balcan et al. 20 06; Balcan et al. 2 007; Hanneke 2007; Dasgupta et al. 2007; Castro and No wak 2007; Balcan et al. 20 08; K oltch in skii 2 010; Beygelzimer et a l. 2010; W ang 2011; Dasgupta 2011; Raginsky an d Rakhlin 2011; Balcan and Ha n neke 2012; Hanneke 2014]. Howe ver, despite many efforts, except fo r very simple noise models (r an- dom classification no ise [ Balcan and Feldman 2 013] and linear noise [Dekel et al. 2012]), to date there are no k nown com putationally efficient algo rithms with provable guaran tees in the presence of noise. In particular, there are no computation ally efficient alg o rithms for the agnostic c a se, and further m ore no result e x ists showing the benefits of acti ve learning over passi ve learning in the malicious noise m odel, where the ad versary may also corr upt the feature s. W e discuss additional related work in Appendix A. 1.1. Our Results The following are our main results. T H E O R E M 1.1 . Ther e is a polyn o mial-time algorithm A 1 for lea rn ing linear sepa rators with r espect to isotr op ic log-conca ve distr ibutions in ℜ d in the pr esenc e o f ad versarial label noise, and positive constants C and ǫ 0 such that, for all 0 < ǫ < ǫ 0 , and all δ > 0 , if η < C ǫ , th en the output w of A 1 satisfies Pr ( x,y ) ∼ D [sign( w · x ) 6 = sign( w ∗ · x )] ≤ ǫ with pr obab ility a t least 1 − δ . Further , A 1 uses at most po ly( d , log(1 /ǫ ) , log(1 /δ ) ) labe led e xamples. T H E O R E M 1.2 . Ther e is a polyn o mial-time algorithm A 2 for lea rn ing linear sepa rators with r espect to isotr o pic log-concave distributions in ℜ d in the pr esence of ma licious noise, and positive constants C a nd ǫ 0 such that, for all 0 < ǫ < ǫ 0 , and all δ > 0 , if η < C ǫ , then the outp u t w of A 2 satisfies Pr ( x,y ) ∼ D [sign( w · x ) 6 = sign( w ∗ · x )] ≤ ǫ with pr obab ility a t least 1 − δ . A 2 uses at most poly( d , log (1 /ǫ ) , log(1 /δ ) ) labe led e xa mples. As a restatement o f Theorem 1.1, in the agnostic setting con sidered in [Kalai et al. 2005], we can output a half space of erro r at m ost O ( η + α ) in tim e poly ( d, 1 /α ) . In the case o f the un i- form d istribution, Kalai, et al, achieved error η + α by lear ning a low degree po lynomial in time whose depend ence on th e inverse accuracy is sup er-exponential. On th e o ther han d, this result of [Kalai et al. 20 05] applies when the target halfspace d oes not necessarily go throu gh th e origin. Our algorithms naturally e x ploit the po wer of acti ve learning. (Indeed, as we will see, an ac- ti ve lear ning algorithm p roposed in [Balcan et al. 2007] provided the springb oard fo r our work.) W e show that in this model, the label com p lexity o f both algo r ithms is polylo garithmic in 1 /ǫ . Our efficient algorithm that tolerates adversarial label noise solves an open pro blem posed in [Balcan et al. 2 0 06; Mo nteleoni 2006]. Furth ermore , o ur paper provides the first acti ve learning algorithm fo r learnin g linear separ ators in the p r esence of non- tr ivial amou nt of adversar ial noise that can af fect not only the lab el, b u t also the fea tures. Our w o rk e xploits the power of localization for designing noise-tolerant polyn omial-time al- gorithms. Such localization techniqu e s have b een used f or analy zing sample c omplexity fo r pas- si ve learn ing (see [Bartlett et al. 2005; Bouche ron et al. 2005; Zhang 2006; Bsho uty et al. 200 9; Balcan and Long 2 013]) or for design ing active learning algorith ms (see [Balcan et al. 2007; K oltch inskii 2010; Hanneke 2 011; Balcan and Long 2013]). Ideas useful f or making such a local- ization strategy computation ally efficient, a nd tolerating maliciou s n oise, are described in Sec- tion 1.2. Journal of the ACM, V ol. V , No. N, Article A, Publication date: January YYYY . Lear ni ng Line ar Sep arators with Noise A:5 W e note that all our algorith ms ar e prop er in th at they return a linear separa tor . (Linear models can be e valuated efficiently , and ar e otherwise easy to work with .) W e summarize our results, and the most closely rela ted previous work , in T ables I and II . T able I: Comparison with pre vious poly( d, 1 /ǫ ) -time algs. for uniform distribution Passiv e Learning Prior work Our work malicious η = Ω  ǫ d 1 / 4  [Kalai et al. 2005] η = Ω ( ǫ ) η = Ω  ǫ 2 log( d/ǫ )  [Kliv ans et al. 2009a] adversarial η = Ω  ǫ √ log(1 /ǫ )  [Kalai et al. 20 05] η = Ω ( ǫ ) Active Lear ning N A η = Ω ( ǫ ) (malicious and ad versarial) T able II: Comparison with pre vious poly ( d, 1 /ǫ ) -time alg orithms isotropic log -conca ve distributions Passiv e Learning Prior work Our work malicious η = Ω  ǫ 3 log 2 ( d/ǫ )  [Kliv ans et al. 20 09a] η = Ω ( ǫ ) adversarial η = Ω  ǫ 3 log(1 /ǫ )  [Kliv ans et al. 20 09a] η = Ω ( ǫ ) Active Lear ning N A η = Ω ( ǫ ) (malicious and ad versarial) 1.2. T e chniques Hinge Loss Minimization As minimizing the 0-1 loss in the pre sen ce o f noise is NP- hard [Johnson and Preparata 19 78; Garey and J ohnson 1990], a natural appro ach is to minimize a surro gate conv ex loss that acts as a p roxy f o r the 0-1 loss. A co mmon choice in mach in e learning is to use th e hinge loss: max (0 , 1 − y ( w · x )) . In this paper , we u se the slightly more general ℓ τ ( w, x, y ) = max  0 , 1 − y ( w · x ) τ  , and, for a set T of examples, we let ℓ τ ( w, T ) = 1 | T | P ( x,y ) ∈ T ℓ τ ( w, x, y ) . Here τ is a parameter that chang es during training. It can be shown that minimizing hing e loss with an appr opriate normalization factor can tolerate a noise rate of Ω( ǫ 2 / √ d ) under isotropic log-concave distributions in ℜ d . This is also the limit for such a strategy since a more po wer ful maliciou s adversary can concen trate all the n o ise directly o pposite to the target vector w ∗ and make sure tha t the hin ge-loss is no lon ger a f a ith ful proxy for the 0-1 loss. Localization in the instance and concept space Our first key insight is that by u sing an iterative localization techniq ue, we can lim it the har m caused b y an a d versary at each stage and h ence can still do hinge-loss minimization despite significan tly mo re noise. In pa r ticular, the iterative alg orithm we propo se pr oceeds in stages a n d at stage k , we have a h ypothesis vector w k of a ce r tain error rate. The goal in stage k is to produce a new vector w k +1 with error r ate a con stant factor smaller than w k ’ s. In order to reduce the error rate, we foc u s on a band o f size b k = e − ck around the bound ary of the linear classifier whose norma l vector is w k , i.e. S w k ,b k = { x : | w k · x | < b k } . For the rest of the paper, we will r e peatedly refe r to this key region of bord erline examples as “th e b and”. The key o b servation mad e in [Balcan et al. 2007] is that ou tside the band, all th e classifiers still under consideratio n (name ly those hypotheses within radius r k of the p revious we ight vector w k ) will have very small error . Furthermo re, the probability ma ss o f this band under the o r iginal d istribution is small en ough , so that in order to make the desired pro gress we only n eed to find a hyp othesis of constant erro r rate over the data distribution c o nditione d o n being within margin b k of w k . This idea Journal of the ACM, V ol. V , No. N, Article A, Publication da te: January YYYY . A:6 A wasthi, Balcan and Long was used in [Balcan et al. 2007] to obtain active le a r ning alg orithms with improved labe l comp lexity ignorin g computationa l complexity consider ations 3 . In this work, we build on this idea to produce polynomial time algo rithms with improved noise tolerance. T o obtain o u r results, we exploit several n ew ideas: (1 ) the perfor m ance of the r escaled hinge lo ss min imization in smaller an d smaller bands, ( 2) an analysis of p r operties of the d istribution obtained af ter condition ing on th e band that en ables us to mo re sensitively identify cases in which the adversary concentr a tes the ef fects of n o isy exam ples, (3) another type of localization — a novel soft outlier remov a l proce d ure. W e first show that if we minimize a variant o f th e hinge loss that is r e scaled depe nding on the width of the band, it r e mains a faithfu l enough prox y f or the 0- 1 error even wh en there is sign ifi- cantly more noise. As a first step towards this goal, consider the setting wh e re we pic k τ k propo r- tionally to b k , the size of the band, and r k is proportio n al to the error rate o f w k , and then minim ize a n o rmalized hin ge loss function ℓ τ k ( w, x, y ) = max (0 , 1 − y ( w · x ) τ k ) over vectors w in B ( w k , r k ) , the b all of radius r k centered at w k . W e first show tha t w ∗ has small hinge loss within the band . Furthermo re, within the band the adversarial examp les cann ot hu r t the hin g e lo ss o f w ∗ by a lot. T o see this n otice that if the m a licious noise rate is η , w ith in S w k − 1 ,b k the effective noise rate is O ( η /b k ) . Also, with high pro bability , the hinge loss fo r vectors w ∈ B ( w k , r k ) is at most ˜ O ( √ d ) . Hence the maximum amoun t by which the adversary can affect the hinge loss is ˜ O ( η √ d/b k ) . Using this approach we g e t a noise tolerance of ˜ Ω( ǫ/ √ d ) . In or d er to g et b etter toleran ce in th e adversar ia l, o r agno stic, setting, we note that examp les x for which | w · x | is large for w close to w k − 1 are the mo st harmful, and, by analyzin g the variance of w · x fo r such directions w , we can more effectively limit the a m ount by which an adversary can “hurt” the hinge loss. This then leads to an imp roved noise tolerance of Ω( ǫ ) . Our algorithm that tolerates adversar ial lab e l noise d oes not w o rk for the malicious n oise model: it can be foiled by an algorithm that concentrates η measure on an in correctly labeled example within Θ( ǫ ) of the separatin g hyperp lane of th e target, but with a very large norm. If the norm of this noisy example is large eno ugh, its hinge loss can overwhelm the hin ge losses of clean examples. W e cop e with this using a soft localized outlier removal pr ocedur e at each stage (d e scr ibed next). This proce d ure assigns a weigh t to each data p o int in dicating the algorith m’ s confiden ce that the point is not “noisy”. W e th e n minimize the weig hted h inge loss. Combining this with the v ariance analysis mentioned abov e lead s to a noise of tolerance of Ω( ǫ ) in the maliciou s case. Soft Localize d Outlier Re moval Outlier rem oval has b een used for lea rning linear classifiers bef ore [Blum et a l. 1 997; Kliv ans et al. 2009a]. In [Klivans et al. 2009a], the goal of ou tlier rem oval was to limit the ability of the adversary to coordin ate the effects o f noisy examp les – excessive such coordi- nation was detected an d removed. Our o u tlier r emoval procedu re (Algo rithm 3) is similar in spirit to that of [Kli vans et al. 2009a] with two key differences. First, as in [Kliv ans et al. 2009a], we will use the variance of the examples in a p a r ticular directio n to measure the ir coordin ation. Howev er , due to the fact that in round k , we are minimizin g th e hinge loss only with respect to vectors that are close to w k − 1 , we only need to limit the variance in th e se direction s. As train in g pro ceeds, the band is increasingly shaped like a pan cake, wit h w k − 1 pointing in its flattest direction. Hypotheses th a t are close to w k − 1 also p oint in flat direction s; the variance in th o se d irections is Θ ( b 2 k ) which is much smaller than variance found in a generic directio n. This allows us to limit the harm of th e adversary to a greater e xtent than was possible in the analysis of [Kli vans et al. 20 0 9a]. The second d ifference is that, unlike previous outlier removal techniques, rather th a n m aking discrete remove-or-not de- cisions, w e instead weigh the examp le s and then minimize the weigh ted hin ge loss. Each weight indicates the algorithm’ s con fidence th at an examp le is not no isy . W e show that these weights can be com puted by solv ing a linear p r ogram with infinitely m any constraints. W e then show how to de- 3 W e note that the loca lizat ion considered by [Balcan et al. 2007] is a more aggressi ve one than those considere d in disagree- ment based acti ve learning lite rature [Balc an et al. 2006; Hannek e 200 7; Kolt chinskii 2 010; Hannek e 2011; W ang 2011 ] and earlie r in passiv e l earning [Bartl ett et al. 2005; Bouche ron et al. 2005; Zhang 2006]. Journal of the ACM, V ol. V , No. N, Article A, Publication date: January YYYY . Lear ni ng Line ar Sep arators with Noise A:7 sign an ef ficient separation oracle for the lin ear program using recent g eneral-p urpose optimizatio n technique s [Sturm and Zhang 2003; Bienstoc k and Michalka 2 0 14]. 1.3. Recent d evelopments Subsequen t to th e publicatio n of this work in p r eliminary form [A wasthi et al. 20 14], Daniely [2015] combin ed the techniqu es of this paper with the polyno mial-separation technique of [Kalai et al. 20 05] to achieve a PT AS for agn ostic learning of h alfspaces with respect to the uniform distribution. (Recall that agnostic learning is essentially eq uiv alent to learn ing with adversar ia l la- bel noise, as o utlined in Appendix C.) A w asthi et al. [2015] provid ed ef ficient ( activ e and passive) learning algorithm s fo r learning linear separator s in the presence of (sufficiently benign) boun ded noise (a.k . a. Massart noise) 4 to ar bitrarily sm a ll excess e r ror u nder th e unifo rm distribution over the u nit sphere in R d . A was thi et al. [2016] improved on this algorithm (to allow for any con stant bound ed noise), and extende d th e techniq ue to apply to the r elated pro blems of attribute efficient learning of linear separator s and t he po pular signal processing pro blem of 1 -bit co mpressed sensing (both in the passi ve learning model) . Th e recent work of [Diakon ikolas et al. 2017] has extend ed the results o f this paper to also include non-ho mogen eous halfsp aces. 2. PRELIMINARIES Recall that ℓ τ ( w, x, y ) = max  0 , 1 − y ( w · x ) τ  and ℓ τ ( w, T ) = 1 | T | P ( x,y ) ∈ T ℓ τ ( w, x, y ) . Simi- larly , the e x pected hing e loss w .r .t. D is defined as L τ ( w, D ) = E x ∼ D ( ℓ τ ( w, x, sign( w ∗ · x ))) . Our analysis will also consider the distribution D w, γ obtained by conditioning D on member sh ip in the band, i.e. the set { x : | w · x | ≤ γ } . W e present our algorithms in the acti ve learning model. Sin ce we will p rove that our acti ve algo- rithm only uses a poly n omial number of unlabeled samples, this will im ply a guaran te e for passiv e learning setting. At a high le vel, our algorithm s a r e iterati ve learning alg orithms that operate in round s. In ea ch round k we fo cus o n points tha t fall near the decision boundar y of the cu r rent hy - pothesis w k − 1 and use th em in order to obtain a n ew vector w k of lower erro r . In the malicious noise case, in rou n d k we first do a soft o utlier removal and then minimize hin ge loss norm alized approp riately by τ k . When analyzing the m alicious no ise mo d el, we will r efer to the examples generated by the ad - versary as the noisy e xa mples , and the o ther e xamples as the clean examples . For vectors u and v , den o te the angle between them by θ ( u, v ) . Let B ( u, r ) b e the ball of rad ius r center ed at u . The de scr iption o f the algorithms an d their analysis is simplified if we assum e th at it starts with a preliminar y weight vector w 0 whose an g le with the tar g et w ∗ is acute, i. e . that satisfies θ ( w 0 , w ∗ ) < π / 2 . W e show in Appen dix B th a t this is without lo ss of g enerality for the types of pr o blems we consider . A prob ability distribution is is otr opic lo g -conca ve if its density can b e written as exp( − ψ ( x )) for a con vex fun ction ψ , its mean is 0 , and its c ovariance m atrix is I . 3. AD VERSARIAL LABEL NOISE Algorithm 1 is our algorith m for learning in th e presence of adversarial label noise. In the analysis below , we assume that the algo r ithm has access to w 0 such that θ ( w 0 , w ∗ ) < π / 2 . Th is ca n be shown to be without loss of ge nerality (see Ap pendix B)). Theorem 1.1 follo ws imme diately from th e following theorem an alyzing Algorithm 1. T H E O R E M 3.1 . Let a distrib ution D over R d be isotr opic log-conca ve. Let w ∗ be the (unit length) tar get weight vector . Ther e a r e setting s of the parameters of Algorithm 1, and positive c o n- 4 The Massart noise is widely studied in statistical learnin g theory(see e.g. [Boucheron et al. 2005]) and can be thought of as a realistic generali zation of the random classificatio n noise, where where the label of each example x is flipped independent ly with constant probabili ty η ( x ) < 1 / 2 . Journal of the ACM, V ol. V , No. N, Article A, Publication da te: January YYYY . A:8 A wasthi, Balcan and Long Input : allowed erro r rate ǫ , pro bability o f failure δ , an o racle that returns x , fo r ( x, y ) sampled from EX η ( f , D ) , and an oracle for getting the labe l from an examp le ; a sequence of sample sizes m k > 0 ; a sequen ce of cut-off values b k > 0 ; a seq uence of hypo thesis space radii r k > 0 ; a precision v alue κ > 0 (1) Draw m 1 labeled examples and put them into a working se t W . (2) For k = 1 , . . . , s = ⌈ log 2 (1 /ǫ ) ⌉ (a) Find v k ∈ B ( w k − 1 , r k ) to ap proximately minimize training hinge loss o ver W s.t. k v k k 2 ≤ 1 : ℓ τ k ( v k , W ) ≤ min w ∈ B ( w k − 1 ,r k ) ∩ B (0 , 1)) ℓ τ k ( w , W ) + κ/ 8 . (b) Normalize v k to have unit length, yieldin g w k = v k k v k k 2 . (c) Clear the working set W . (d) Until m k +1 additional data points are put i n W , giv en an unlabeled example x for ( x, f ( x )) obtained from EX η ( f , D ) , if | w k · x | ≥ b k , then reject x else ask for the labe l of x a nd put the e xample into W Output : W eight vector w s of error at most ǫ with probab ility 1 − δ . Algorithm 1: C O M P U TA T I O NA L LY E FFI C I E N T A L G O R I T H M T O L E R AT I N G A DV E R SA R I A L L A B E L N O I S E stants M , C and ǫ 0 , such that, fo r a ll ǫ < ǫ 0 , for a ny δ > 0 , if the r ate η of adversarial no ise satisfies η < C ǫ , a number n k = po ly ( d, M k , lo g(1 / δ )) of unlabele d examples in r ound k and a number m k = O  d lo g  d ǫδ  ( d + log( k/ δ ))  of labeled e xamp les in r ou nd k ≥ 1 , and w 0 such that θ ( w 0 , w ∗ ) < π / 2 , after s = O (log (1 /ǫ )) iterations, fin ds w s satisfying err( w s ) ≤ ǫ with pr oba b ility ≥ 1 − δ . The rest of this section is dedicated to th e proof of T heorem 3.1. 3.1. Relevant properties of i sotropic lo g-conc ave distribution s W e start by listi ng some properties of i.l. c . distrib u tions that we will use in our analysis. L E M M A 3.2 ( [ L O V ´ A S Z A N D V E M P A L A 2 0 0 7 ; V E M P A L A 2 0 1 0 ] ) . Assume th at D is isotr opic log-concav e in R d and let f be its density function. (a) P r x ∼ D [ || x || 2 ≥ α √ d ] ≤ e − α +1 . (b) Pr ojections of D onto subspaces of R d ar e isotr op ic log-concave. (c) If d = 1 , then Pr x ∼ D [ x ∈ [ a, b ]] ≤ | b − a | . (d) The r e is an absolute constant c 1 such tha t, if d = 1 , f ( x ) > c 1 for all x ∈ [ − 1 / 9 , 1 / 9 ] . (e) Ther e is an a bsolute constant c 2 such th at for a ny two unit vecto rs u and v in R d we have c 2 θ ( v , u ) ≤ Pr x ∼ D (sign( u · x ) 6 = s ig n( v · x )) . (f) F or any d , the r e ar e positive c 3 ( d ) and c 4 ( d ) such tha t f ( x ) ≤ c 3 ( d ) exp( − c 4 ( d ) || x || ) . Parts (a)-(d ) are from [Lov ´ asz an d V empa la 2007]. Part (e) is implicit in [V empala 20 10], and set out explicitly in [Balcan and Long 2013]. Part (f) is from [Klivans et al. 2009b]. W e will use the f o llowing lemma as a tool to ana ly ze the variance in d irections clo se to the hypoth esis at any given time . L E M M A 3.3 . F or any C > 0 , there exist constants c, c ′ such that, for any isotr opic log-concave distribution D , for any a such that, k a k 2 ≤ 1 , and || u − a || 2 ≤ r , for any 0 < γ < C , a nd for any K ≥ 4 , we have Pr x ∼ D u,γ  | a · x | > K p r 2 + γ 2  ≤ ce − c ′ K q 1+ γ 2 r 2 . P R O O F . W .l.o .g. we may assum e that u = (1 , 0 , 0 , · · · , 0) . Journal of the ACM, V ol. V , No. N, Article A, Publication date: January YYYY . Lear ni ng Line ar Sep arators with Noise A:9 Let a ′ = ( a 2 , ..., a d ) , an d, for a ran dom x = ( x 1 , x 2 , ..., x d ) drawn from D u,γ , let x ′ = ( x 2 , ..., x d ) . W e may re write the probability that we want to bound as Pr x ∼ D u,γ  | a · x | > K p r 2 + γ 2  = Pr x ∼ D  | a · x | > K p r 2 + γ 2 and | x 1 | ≤ γ  Pr x ∼ D ( | x 1 | ≤ γ ) . (1) Lemma 3.2 imp lies that th ere is a positive con stant c 1 such that the deno minator satisfies the f ol- lowing lower bound: Pr x ∼ D ( | x 1 | ≤ γ ) ≥ c 1 min { γ , 1 / 9 } ≥ c 1 γ 9 C . (2) So no w , we just need an upper bound on the numerato r . W e ha ve Pr x ∼ D  | a · x | > K p r 2 + γ 2 and | x 1 | ≤ γ  ≤ Pr x ∼ D  | a ′ · x ′ | > K p r 2 + γ 2 − γ and | x 1 | ≤ γ  ≤ Pr x ∼ D  | a ′ · x ′ | > ( K − 1) p r 2 + γ 2 and | x 1 | ≤ γ  . Define a ′′ = a ′ k a ′ k . Define rand om v aria b le Y to be a ′′ · x and a random variable X to be x 1 where x is drawn fro m D . Then we have E [ X ] = E [ x 1 ] = 0 . E [ Y ] = E [ a ′′ · x ] = 0 . Fur thermor e, E [ X 2 ] = 1 , E [ Y 2 ] = 1 and E [ X Y ] = 0 . Hence, the joint distrib ution of X an d Y is isotrop ic lo g- concave. Let f ( X , Y ) be the p .d.f. of this distribution. Then the n umerato r can be up p er boun ded as follows: 4 P r x ∼ D  a ′ · x ′ > ( K − 1) p r 2 + γ 2 and 0 ≤ x 1 ≤ γ  ≤ 4 Z γ 0 Z ∞ ( K − 1) √ r 2 + γ 2 k a ′ k f ( X , Y ) d X d Y . Applying Part (f) of Lemma 3. 2 with d = 2 , there are co nstants c and c ′ such that the numerator is at most c Z γ 0 Z ∞ ( K − 1) √ r 2 + γ 2 k a ′ k exp( − c ′ p X 2 + Y 2 ) dX d Y ≤ c Z γ 0 Z ∞ ( K − 1) √ r 2 + γ 2 k a ′ k exp( − c ′ Y ) dX d Y ≤ c ′′ γ exp( − c ′ ( K − 1) p r 2 + γ 2 k a ′ k ) , in part because th e fact that k a ′ k ≤ r implies that ( K − 1) √ r 2 + γ 2 k a ′ k > 3 . Hence the numera tor of (1 ) is at m ost ≤ c ′′ γ exp( − c ′ ( K − 1) q 1 + γ 2 r 2 ) , completing the p r oof. Armed with Lemm a 3.3, n ow we are read y for th e variance b ound . I t improves on a bound from an earlier version of this paper [A wasthi et al. 2 014], matchin g wh at was obtained in that v e r sion f or the special case o f the u niform distribution. This improvement is what leads to closing a log factor gap in the tolerab le rate o f noise for i.l.c. distributions. L E M M A 3.4 . A ssume that D is isotr o pic log-concave. F or any c 3 , ther e is a con stant c 4 such that, fo r all 0 < γ ≤ c 3 , for all a such that k u − a k 2 ≤ r and k a k 2 ≤ 1 E x ∼ D u,γ (( a · x ) 2 ) ≤ c 4 ( r 2 + γ 2 ) . Journal of the ACM, V ol. V , No. N, Article A, Publication da te: January YYYY . A:10 A wasthi, Balcan and Long Proof: Let z = p r 2 + γ 2 . Setting, with fo r esight, t = 16 z 2 , we ha ve E x ∼ D u,γ (( a · x ) 2 ) = Z ∞ 0 Pr x ∼ D u,γ (( a · x ) 2 ≥ α ) dα ≤ t + Z ∞ t Pr x ∼ D u,γ (( a · x ) 2 ≥ α ) dα. ( 3) Since t ≥ 16 z 2 , Lemma 3.3 implies that, for absolute constants c an d c ′ , we ha ve E x ∼ D u,γ (( a · x ) 2 ) ≤ t + c Z ∞ t exp( − c ′ √ α r ) dα. Now , we w an t to e valuate th e integral. Using a change of variables u 2 = α , we get Z ∞ t exp  − c ′ √ α/r  dα = 2 Z ∞ √ t u exp ( − c ′ u/r ) du = 2 r 2 c ′ 2  √ t r + 1  exp  − c ′ √ t/r  . Putting it to gether, we get E x ∼ D u,γ (( a · x ) 2 ) ≤ t + 2 cr 2 c ′ 2  √ t r + 1  exp  − c ′ √ t/r  ≤  1 + c 2 c ′ 2  t + 2 cr 2 c ′ 2 exp  − 4 c ′ z r  and, since t = 16 z 2 and z r ≥ 1 , we g et the desired b o und. Finally , we will u se a lemma from [Balcan and Long 2013] th at ge n eralizes an d stren gthens a key lemma from [Balcan et al. 2007]. It is used to sho w that, dur ing the learnin g process, most large-margin examples ar e classified correctly . L E M M A 3.5 ( T H E O R E M 4 O F [ B A L C A N A N D L O N G 2 0 1 3 ] ) . F or any c 5 > 0 , ther e is a c 6 > 0 su ch that the following ho ld s. Let u a nd v b e two unit vectors in R d , and a ssume that θ ( u , v ) = α < π / 2 . If D is isotr op ic log-co ncave in R d , then Pr x ∼ D [sign( u · x ) 6 = s ign( v · x ) a nd | v · x | ≥ c 6 α ] ≤ c 5 α. 3.2. P aramet ers for the a lgorithm For easy referenc e throu ghou t the proo f, here we collect tog ether th e setting s of the parame te r s of the algorithm. Let M = max { 2 c 2 π , 2 } , whe r e c 2 is f rom Lemm a 3. 2. Let c ′ 1 be the value o f c 6 in Lemma 3.5 correspo n ding to the case where c 5 is c 2 4 M ; then let b k = c ′ 1 M − k . Let c ′ 2 be c 1 from Lemma 3.2. L et r k = min { M − ( k − 1) /c 2 , π / 2 } , where c 2 is fro m Lemm a 3.2 and κ = 1 4 c ′ 1 M . Let τ k = c 1 min { b k − 1 , 1 / 9 } κ 6 , where c 1 is the v alue from Lem m a 3.2. Let z 2 k = r 2 k + b 2 k − 1 . 3.3. The err o r within a ba n d in ea ch ite ration At each iteration , Algor ithm 1 concentrates it s attention on examples in the band . Our next theorem analyzes its error o n these e x amples. T H E O R E M 3.6 . F or k ≤ ⌈ lo g M (1 /ǫ ) ⌉ , if er r D ( w k − 1 ) ≤ M − ( k − 1) , with pr o bability 1 − δ k + k 2 (over the random examples in r ound k ), after r o und k of Algo rithm 1, we h ave er r D w k − 1 ,b k − 1 ( w k ) ≤ κ . W e will p rove Theore m 3.6 u sing a series of lemmas below . First, we bound th e hinge loss of the target w ∗ within the band S w k − 1 ,b k − 1 . Sin c e we are analyzing a particular ro und k , to re duce clu tter in the formulas, fo r the rest of this section, let us r efer to ℓ τ k as ℓ , and L τ k ( · , D w k − 1 ,b k − 1 ) as L ( · ) . Journal of the ACM, V ol. V , No. N, Article A, Publication date: January YYYY . Lear ni ng Line ar Sep arators with Noise A:11 L E M M A 3.7 . L ( w ∗ ) ≤ κ/ 6 . P R O O F . Notice that y ( w ∗ · x ) is never negativ e, so, o n any cle a n example ( x, y ) , we ha ve ℓ ( w ∗ , x, y ) = max  0 , 1 − y ( w ∗ · x ) τ k  ≤ 1 , and, furthermor e, w ∗ will pay a n on-zero hinge only inside the region wh ere | w ∗ · x | < τ k . Hence, L ( w ∗ ) ≤ Pr D w k − 1 ,b k − 1 ( | w ∗ · x | ≤ τ k ) = Pr x ∼ D ( | w ∗ · x | ≤ τ k & | w k − 1 · x | ≤ b k − 1 ) Pr x ∼ D ( | w k − 1 · x | ≤ b k − 1 ) . Using Part (d) o f Lemma 3 .2, for th e value of c 1 in tha t definition, we can lower bo und the denomin ator: Pr x ∼ D ( | w k − 1 · x | < b k − 1 ) ≥ 2 c 1 min { b k − 1 , 1 / 9 } . Part ( c) of Lemma 3. 2 also implies that the n umerato r is at most Pr x ∼ D ( | w ∗ · x | ≤ τ k ) ≤ 2 τ k . Hence, we ha ve L ( w ∗ ) ≤ 2 τ k 2 c 1 min { b k − 1 , 1 / 9 } = κ / 6 . Let ˜ P be the joint d istribution used by the algorithm, which includes the noisy labels chosen by the adversary . Let N = { ( x, y ) : s ig n( w ∗ · x ) 6 = y } consist of no isy examples, so that ˜ P ( N ) ≤ η . Let P be the join t d istribution ob tained b y a pplying the c o rrect lab els. Let ˜ P k be the distrib utio n on the examples giv en to the alg orithm in rou nd k (o btained by conditio ning ˜ P to examp le s that fall within the band), and let P k be the corresponding jo int distrib ution with clean labels. The key lemma her e is to bound how far the e x pected loss with respect to the distribution ˜ P k giv en to the algo rithm is from to the expected loss with respect to the distribution P k with the cleaned lab els. In formally , it shows that, to an extent, E ( x,y ) ∈ ˜ P k ( ℓ ( w, x, y )) is an ef fective p roxy for E ( x,y ) ∈ P k ( ℓ ( w, x, y )) . L E M M A 3.8 . There is an a bsolute p ositive constant c such that, if w e define z k = q r 2 k + b 2 k − 1 then for any w ∈ B ( w k − 1 , r k ) , we h ave | E ( x,y ) ∈ P k ℓ ( w, x, y )) − E ( x,y ) ∈ ˜ P k ℓ ( w, x, y )) | ≤ c r η ǫ z k τ k . (4) Journal of the ACM, V ol. V , No. N, Article A, Publication da te: January YYYY . A:12 A wasthi, Balcan and Long P R O O F . Fix an arbitrar y w ∈ B ( w k − 1 , r k ) . Recalling th a t N is the set of noisy exam p les, an d that the marginals of P k and ˜ P k on the inputs are the same, we hav e | E ( x,y ) ∈ P k ( ℓ ( w, x, y )) − E ( x,y ) ∈ ˜ P k ( ℓ ( w, x, y )) | = | E ( x,y ) ∈ ˜ P k ( ℓ ( w, x, y ) − ℓ ( w , x, sign( w ∗ · x ))) | = | E ( x,y ) ∈ ˜ P k (1 ( x,y ) ∈ N ( ℓ ( w, x, y ) − ℓ ( w, x, − y ))) | ≤ E ( x,y ) ∈ ˜ P k (1 ( x,y ) ∈ N | ℓ ( w, x, y ) − ℓ ( w, x, − y ) | ) ≤ 2 E ( x,y ) ∈ ˜ P k  1 ( x,y ) ∈ N  | w · x | τ k  = 2 τ k E ( x,y ) ∈ ˜ P k  1 ( x,y ) ∈ N | w · x |  ≤ 2 τ k r Pr ( x,y ) ∼ ˜ P k ( N ) × q E ( x,y ) ∈ ˜ P k (( w · x ) 2 ) by the Cauchy-Schwarz inequality . Lemma 3.2 im plies that, f or an absolute co nstant c ′ , Pr ( x,y ) ∈ ˜ P k ( N ) ≤ Pr ( x,y ) ∈ ˜ P ( N ) Pr ( x,y ) ∈ ˜ P ( S w k − 1 ,b k − 1 ) ≤ η c ′ M − k ≤ η c ′ ǫ/ M since k ≤ ⌈ log M (1 /ǫ ) ⌉ , and Lemma 3. 4 implies E ( x,y ) ∈ ˜ P k (( w · x ) 2 ) ≤ c ′ z 2 k . Finally , w e n e ed s ome bounds about estimates o f the hinge loss. L E M M A 3.9 . Let cleaned( W ) = { ( x, sign( w ∗ · x )) : ( x, y ) ∈ W } . W ith p r obab ility 1 − δ k + k 2 , for all w ∈ B ( w k − 1 , r k ) , we have | E ( x,y ) ∈ ˜ P k ( ℓ ( w, x, y )) − ℓ ( w , W ) | ≤ κ/ 16 , a nd | E ( x,y ) ∈ P k ( ℓ ( w, x, y )) − ℓ ( w , cleaned( W )) | ≤ κ/ 1 6 . (5) P R O O F . See Appendix D. Journal of the ACM, V ol. V , No. N, Article A, Publication date: January YYYY . Lear ni ng Line ar Sep arators with Noise A:13 P R O O F O F T H E O R E M 3 . 6 . With p robab ility 1 − δ k + k 2 , we have, for absolute co nstants c 1 and c 2 , the f ollowing: err D w k − 1 ,b k − 1 ( w k ) = err D w k − 1 ,b k − 1 ( v k ) ≤ E ( x,y ) ∈ P k ( ℓ ( v k , x, y )) (sinc e for each error , the hinge lo ss is at lea st 1 ) ≤ E ( x,y ) ∈ ˜ P k ( ℓ ( v k , x, y )) + c 1 r η ǫ × z k τ k (by Lemma 3.8) ≤ ℓ ( v k , W ) + c 1 r η ǫ × z k τ k + κ/ 1 6 (by Lemma 3.9) ≤ ℓ ( w ∗ , W ) + c 1 r η ǫ × z k τ k + κ/ 8 ≤ E ( x,y ) ∈ ˜ P k ( ℓ ( w ∗ , x, y )) + c 1 r η ǫ × z k τ k + κ/ 4 (by Lemma 3.9) ≤ E ( x,y ) ∈ P k ( ℓ ( w ∗ , x, y )) + c 2 r η ǫ × z k τ k + κ/ 4 (by Lemma 3.8) ≤ c 2 r η ǫ × z k τ k + κ/ 2 , since L ( w ∗ ) ≤ κ/ 6 . Since z k /τ k = Θ (1) , there is an c o nstant c 3 such that, η ≤ c 3 ǫ suffices for err D w k − 1 ,b k − 1 ( w k ) ≤ κ , completing the proof. 3.4. Putting it to geth er Now we are ready to put ev er ything together . The proof of Th eorem 3.1 follows the h igh lev el structure o f the proo f of [Balcan et al. 2007]; th e new elemen t is the applicatio n o f Th eorem 3.6 which analyzes the performance of the hing e lo ss minimizatio n algor ithm for learning inside the band. Proof (of Theor em 3.1): W e will p r ove b y in duction on k that after k ≤ s iterations, we h ave err D ( w k ) ≤ M − k with probab ility 1 − δ (1 − 1 / ( k + 1)) / 2 . When k = 0 , all that is required is er r D ( w 0 ) ≤ 1 . Assume now the claim is true for k − 1 ( k ≥ 1 ). Then by in duction hyp o thesis, we kn ow that with probab ility at lea st 1 − δ (1 − 1 /k ) / 2 , w k − 1 has err or at m ost M − ( k − 1) . Using Part (e) of Lemma 3.2, this implies that θ ( w k − 1 , w ∗ ) ≤ M − ( k − 1) /c 6 . This in tur n im plies θ ( w k − 1 , w ∗ ) ≤ π / 2 . (When k = 1 , this i s by assumption, and othe r wise it is implied by Part (e) of Lemma 3.2.) Let u s define S w k − 1 ,b k − 1 = { x : | w k − 1 · x | ≤ b k − 1 } and ¯ S w k − 1 ,b k − 1 = { x : | w k − 1 · x | > b k − 1 } . Since w k − 1 has unit length, and v k ∈ B ( w k − 1 , r k ) , we hav e θ ( w k − 1 , v k ) ≤ r k which in turn implies θ ( w k − 1 , w k ) ≤ min { M − ( k − 1) /c 6 , π / 2 } . Applying Lemma 3 .5 to bound the err or rate o u tside the band, we have bo th Pr x  ( w k − 1 · x )( w k · x ) < 0 , x ∈ ¯ S w k − 1 ,b k − 1  ≤ M − k 4 and Pr x  ( w k − 1 · x )( w ∗ · x ) < 0 , x ∈ ¯ S w k − 1 ,b k − 1  ≤ M − k 4 . T aking the sum, we obtain P r x  ( w k · x )( w ∗ · x ) < 0 , x ∈ ¯ S w k − 1 ,b k − 1  ≤ M − k 2 . Theref ore, we have err( w k ) ≤ (err D w k − 1 ,b k − 1 ( w k )) P r( S w k − 1 ,b k − 1 ) + M − k 2 . Journal of the ACM, V ol. V , No. N, Article A, Publication da te: January YYYY . A:14 A wasthi, Balcan and Long Input : allowed erro r rate ǫ , p robab ility of failure δ , an oracle th at retur n s x , for ( x, y ) sampled fr om EX η ( f , D ) , an d an oracle for getting the label y from an example; a sequence of unlabeled s ample sizes n k > 0 , k ∈ Z + ; a sequen ce o f labeled sample sizes m k > 0 ; a sequen ce of cut-off values b k > 0 ; a sequenc e o f hypothesis space rad ii r k > 0 ; a sequence o f remov al ra te s ξ k ; a s equence of variance bo unds σ 2 k ; precision v alu e κ ; weight vecto r w 0 . (1) Draw n 1 unlabeled examples and put them into a working set W . (2) For k = 1 , . . . , s = ⌈ log 2 (1 /ǫ ) ⌉ (a) Apply Algorithm 3 to W with parameters u ← w k − 1 , γ ← b k − 1 , r ← r k , ξ ← ξ k , σ 2 ← σ 2 k and let q be the output function q : W → [0 , 1] . Normalize q t o form a probability distribu tion p o ver W . (b) Choose m k examp les from W according to p and re veal the ir labels. Call this set T . (c) Find v k ∈ B ( w k − 1 , r k ) to ap proximately minimize training hinge loss o ver T s.t. k v k k 2 ≤ 1 : ℓ τ k ( v k , T ) ≤ min w ∈ B ( w k − 1 ,r k ) ∩ B (0 , 1)) ℓ τ k ( w , T ) + κ/ 8 . Normalize v k to have unit length, yieldin g w k = v k k v k k 2 . (d) Clear the working set W . (e) Until n k +1 additional data points are put in W , giv en unlabeled x for ( x, f ( x )) obtained from EX η ( f , D ) , if | w k · x | ≥ b k , then reject x else put into W . Output : weight vector w s of error at most ǫ with probab ility 1 − δ . Algorithm 2: C O M P U TA T I O NA L LY E FFI C I E N T A L G O R I T H M T O L E R AT I N G M A L I C I O U S N O I S E Since Pr( S w k − 1 ,b k − 1 ) ≤ 2 b k − 1 , this implies err( w k ) ≤ (err D w k − 1 ,b k − 1 ( w k ))2 b k − 1 + M − k 2 ≤ M − k  (err D w k − 1 ,b k − 1 ( w k ))2 c ′ 1 M + 1 / 2  . Recall that D w k − 1 ,b k − 1 is the distrib ution obtained b y cond itioning D on the e vent that x ∈ S w k − 1 ,b k − 1 . Combining Theorem 3. 6 with th e induction hypothesis, Pr(err( w k ) > 1 / M k ) ≤ P r(err( w k ) > 1 / M k | err( w k − 1 ) ≤ 1 / M k − 1 ) + Pr(err ( w k − 1 ) > 1 / M k − 1 ) ≤ δ 2( k + k 2 ) + δ (1 − 1 /k ) / 2 = δ (1 − 1 / ( k + 1)) / 2 . This completes the proof of the induction, and th erefore shows, with p robability at least 1 − δ , O (log (1 /ǫ )) iterations su ffice to a chieve err( w k ) ≤ ǫ . A po lynomial number of unlabeled samples are re quired by the algorithm and th e num- ber of labeled exam p les require d b y the algorithm is P k m k = O ( d ( d + lo g log(1 /ǫ ) + log(1 /δ )) log(1 /ǫ )) . Remark: Follo wing the publication of this w o rk, it was brought to our attention by Steve Hanneke that a more car eful analysis of the algorithm achieves the right labeled sample complexity of O ( d + log log(1 /ǫ ) + lo g (1 /δ )) log(1 / ǫ ) . This is imp licit in [Hanneke et al. 2015] 5 4. LEARNING WITH MALI CIOUS NOISE The intu ition in the case of malicious no ise is the same as for adversarial noise, e xcept that, because the adversary can also change the marginal distribution over the in stances, it is necessary to perf orm an additional outlier removal step at each stage of the algo rithm. Furth ermore, we need a different analysis since in th is case th e marginal distribution over the examples can chang e. Theorem 1.2 follo ws immediately from the following theore m a n alyzing Algorithm 2. 5 See [link] for a self contained proof graciousl y communicated to us by Stev e Hann eke . Journal of the ACM, V ol. V , No. N, Article A, Publication date: January YYYY . Lear ni ng Line ar Sep arators with Noise A:15 Input : a set S = { ( x 1 , x 2 , . . . , x n ) } of samples; the re ference unit vector u ; desire d radius r ; a parameter ξ specifying th e de sire d boun d on the fraction o f clean examp les removed; a v arian ce bound σ 2 (1) Find q : S → [0 , 1] satisfying the following constraints: (a) for all x ∈ S , 0 ≤ q ( x ) ≤ 1 (b) 1 | S | P ( x,y ) ∈ S q ( x ) ≥ 1 − ξ (c) for all w ∈ B ( u, r ) ∩ B ( 0 , 1) , 1 | S | P x ∈ S q ( x )( w · x ) 2 ≤ σ 2 . Output : A f u nction q : S → [0 , 1] . Algorithm 3: L O C A L I Z E D SO F T O U T L I E R R E M OV A L PR O C E D U R E T H E O R E M 4.1 . Let a distrib ution D over R d be isotr opic log-conca ve. Let w ∗ be the (unit length) tar get weight vector . Ther e a r e setting s of the parameters of Algorithm 2, and positive c o n- stants M , C and ǫ 0 , such that, for all ǫ < ǫ 0 , for any δ > 0 , if the rate η of malicious noise satisfies η < C ǫ , a number n k = poly ( d, M k , lo g(1 /δ )) of unlab eled examples in r ound k and a number m k = O  d lo g  d ǫδ  ( d + log( k/ δ ))  of labeled examples in r ound k ≥ 1 , and w 0 such that θ ( w 0 , w ∗ ) < π / 2 , after s = O (log (1 /ǫ )) iterations, fin ds w s satisfying err( w s ) ≤ ǫ with pr oba b ility ≥ 1 − δ . The rest of this section is dedicated to th e proof of T heorem 4.1. 4.1. P aramet ers for the a lgorithm W ith th e excep tion of th e para m eters σ 2 k and ξ k of the outlier rem oval proced ure, the par ameters are set e x actly as in Sectio n 3.2. The values of σ 2 k and ξ k are deter m ined b y our an alysis: σ 2 k is c ( r 2 k + b 2 k − 1 ) , for th e value of c in Theorem 4.2 be low , that correspon ds to th e choice, in th e stateme n t of Theo rem 4.2, of C = c ′ 1 . Finally , ξ k = min( κ 2 7 , κ 2 τ 2 k c 4 2 16 z 2 k ) , for the value of c 4 in Lem ma 3.4 correspondin g to the choice c 3 = b 0 . 4.2. Analysis of the outlier remo val s ubroutine The analysis of th e learning algorithm uses the following lemm a ab out Algorithm 3. T H E O R E M 4.2 . F or any C > 0 , ther e is a co nstant c and a poly n omial p such th at, fo r a ll ξ > 2 η ′ and all 0 < γ < C , if n ≥ p (1 /η ′ , d, 1 /ξ , 1 / δ, 1 / γ , 1 /r ) , then, with pr obability 1 − δ , the output q of Algorithm 3 satisfies the following : — P x ∈ S q ( x ) ≥ (1 − ξ ) | S | (a fraction 1 − ξ o f the weig h t is r etained) — F or all unit len gth w such that k w − u k 2 ≤ r , 1 | S | X x ∈ S q ( x )( w · x ) 2 ≤ c ( r 2 + γ 2 ) . (6) Furthermore , the algorithm ca n be implemented in polynomial time . Our proof of Theor em 4.2 pro ceeds through a series of lemmas. Le m ma 3.2 implies that we may assume without loss of generality that the in stances x 1 , ..., x n from S are distinct. Obviously , a feasible q satisfies the requiremen ts of the lemma. So all we need to show is — there is a feasible so lution q , an d — we can simulate a separatio n o racle: given a provision al solution ˆ q , we can fin d a linear con straint violated by ˆ q in po lynomial time. Journal of the ACM, V ol. V , No. N, Article A, Publication da te: January YYYY . A:16 A wasthi, Balcan and Long W e will start by working on proving that there is a feasible q . First of all, a Chernoff bou nd implies th at n ≥ p oly (1 /η ′ , 1 /δ ) su ffices for it to be the case that, with pr o bability 1 − δ , at mo st 2 η ′ members of S are noisy . Let us assum e from no w o n that this is the case. W e will show that q ∗ which sets q ∗ ( x ) = 0 f or each n oisy point, and q ∗ ( x ) = 1 f or each non - noisy point, is f easible. First, we use VC tools to show that, if e nough exam ples are chosen, a bound like Lemma 3.4, b u t av e raged o ver the clean e xa mples, lik ely holds for all relevant directions. L E M M A 4.3 . I f we draw ℓ times i.i.d. fr om D to form X C , with pr ob ability 1 − δ , we have that for any unit len gth a , 1 ℓ X x ∈ X C ( a · x ) 2 ≤ E [( a · x ) 2 ] + r O ( d log ( ℓ/δ )( d + log(1 /δ ))) ℓ . Proof: See A p pendix D. Lemma 4.3 and Lem ma 3.4 to gether directly imply th at n = poly  d, 1 /η ′ , 1 /δ, 1 c ( r 2 + γ 2 )  = p o ly ( d, 1 /η ′ , 1 /δ, 1 /γ , 1 / r ) suffices for it to be the case that, for all w ∈ B ( u, r ) , 1 | S | X ( x,y ) q ∗ ( x )( a · x ) 2 ≤ 2 E [( a · x ) 2 ] ≤ 2 c 4 ( r 2 + γ 2 ) , where c 4 is the v alue in Lemma 3.4 correspondin g to setting c 3 = C . If c = 2 c 4 , we hav e that q ∗ is feasible. So what is left is to p rove is that a separatio n oracle for the co n vex program can be com puted in polyno mial time. V ery rough ly , there is a linear constraint for each of a set of directions, limiting the variance in that direction. W e can find a violated constrain t, if there is one, by finding the directio n with maximum variance, using something like PC A, but taking appro priate account o f the fact that we are only co nsidering directions n ear u . In detail, we may compute the separation o racle as follows . First, it is ea sy to ch eck whether, for all x , 0 ≤ q ( x ) ≤ 1 , and wheth e r P x ∈ S q ( x ) ≥ (1 − ξ ) | S | . An alg orithm can first do that. If these pass, then it n eeds to check wheth e r there is a w ∈ B ( u, r ) with || w || 2 ≤ 1 such that 1 | S | X x ∈ S q ( x )( w · x ) 2 > c ( r 2 + γ 2 ) . This can be do ne by finding w ∈ B ( u, r ) with || w || 2 ≤ 1 that m aximizes P x ∈ S q ( x )( w · x ) 2 , and checking it. Suppose X is a matrix with a row for each x ∈ S , where the ro w is p q ( x ) x . Then P x ∈ S q ( x )( w · x ) 2 = w T X T X w , and, maximizing this over w is a n equiv alen t pro b lem to min- imizing w T ( − X T X ) w subject to k w − u k 2 ≤ r an d || w || ≤ 1 . Since − X T X is sy mmetric, problem s of this form are kn own to be solvable in p olynom ial time [Stur m and Zh ang 2003] (see [Bienstock and Michalka 20 14]). 4.3. The err o r within a ba n d in ea ch ite ration At each iteration , Algor ithm 2 concentrates it s attention on examples in the band . Our next theorem analyzes its error o n these e x amples. T H E O R E M 4.4 . F or k ≤ ⌈ lo g M (1 /ǫ ) ⌉ , if er r D ( w k − 1 ) ≤ M − ( k − 1) , with pr o bability 1 − δ k + k 2 (over the random examples in r ound k ), after r o und k of Algo rithm 2, we h ave er r D w k − 1 ,b k − 1 ( w k ) ≤ κ . Journal of the ACM, V ol. V , No. N, Article A, Publication date: January YYYY . Lear ni ng Line ar Sep arators with Noise A:17 W e will p rove Theore m 4.4 u sing a series of lem mas below . First, we bound the hinge loss o f the target w ∗ within the band S w k − 1 ,b k − 1 . Since we are analyz ing a particular ro und k , to redu ce clutter in the formulas, fo r the rest of this section , let us refer to ℓ τ k as ℓ , and L τ k ( · , D w k − 1 ,b k − 1 ) as L ( · ) . First, Lemma 3.7 , th at L ( w ∗ ) ≤ κ/ 6 , also applies here, using exactly the same p roof. During r ound k we can deco mpose the workin g set W into the set of “clean ” examp les W C which are drawn from D w k − 1 ,b k − 1 and the set of “dirty ” or malicious examp les W D which ar e outpu t by the adversary . W e will next show that the fraction of dirty examples in round k is no t too large. L E M M A 4.5 . Th er e is an absolute p ositive constant c su ch that, with pr ob ability 1 − δ 6( k + k 2 ) , | W D | ≤ cη n k M k ≤ cM η n k ǫ (7) P R O O F . From Lemma 3.2 and the setting of o ur param eters, th e probab ility that an examp le falls in S w k − 1 is at least Ω( M − k ) . Th erefore, with p robability (1 − δ 12( k + k 2 ) ) , the number of examp les we m u st draw before we e ncounte r n k examples that fall within S w k − 1 ,b k − 1 is at mo st O ( n k M k ) . The pr obability that eac h u nlabeled example we draw is noisy is at most η . App lying a Chern off bound , with probab ility a t least 1 − δ 12( k + k 2 ) , we ha ve | W D | ≤ cη n k M k . Since k ≤ ⌈ log M (1 /ǫ ) ⌉ , this completes the p r oof. Recall that th e total variation distance between two pro b ability distributions is the maxim um difference between the probabilities that they assign to any event. W e can think o f q as soft indic a tor functio n s fo r “ keeping” examp le s, and so interpret th e inequal- ity P x ∈ W q ( x ) ≥ (1 − ξ ) | W | as rough ly akin to saying that m ost examp les are kept. This means that distribution p obtained by normalizing q is close to the unifor m distrib ution over W . W e ma ke this precise in the following lem ma. L E M M A 4.6 . Th e total variation distance b etween p and the uniform d istrib ution over W is at most ξ . P R O O F . Lemma 1 of [Lo n g and Servedio 2006] implies that the total variation distance ρ be- tween p and th e uniform d istribution over W satisfies ρ = 1 − X x ∈ W min  p ( x ) , 1 | W |  = 1 − X x ∈ W min  q ( x ) P u ∈ W q ( u ) , 1 | W |  . Since q ( u ) ≤ 1 for all u , we ha ve P u ∈ W q ( u ) ≤ | W | , so that ρ ≤ 1 − 1 | W | X x ∈ W min { q ( x ) , 1 } . Again, since q ( x ) ≤ 1 , we have ρ ≤ 1 − (1 − ξ ) | W | | W | = ξ . Next, we will relate the av e rage hinge loss when example s are weighted acco rding to p , i.e., ℓ ( w, p ) to the hin ge loss averaged over clean examples W C , i.e., ℓ ( w, W C ) . Here ℓ ( w, W C ) and ℓ ( w, p ) are defined with r espect to the unre vealed labels th at the ad versary has com mitted to. L E M M A 4.7 . Ther e ar e absolute co nstants C 1 , C 2 and C 3 such that, with pr ob ability 1 − δ 2( k + k 2 ) , if we define z k = q r 2 k + b 2 k − 1 , then for a ny w ∈ B ( w k − 1 , r k ) , we h ave ℓ ( w, W C ) ≤ ℓ ( w , p ) + C 1 η ǫ  1 + z k τ k  + κ/ 3 2 (8 ) Journal of the ACM, V ol. V , No. N, Article A, Publication da te: January YYYY . A:18 A wasthi, Balcan and Long and ℓ ( w, p ) ≤ 2 ℓ ( w, W C ) + κ/ 32 + C 2 η ǫ + C 3 r η ǫ × z k τ k . (9) P R O O F . Assume without lo ss of generality that each element ( x, y ) ∈ W is distinct. Fix an arbitrary w ∈ B ( w k − 1 , r k ) . By Theor e m 4 . 2, L emma 4.5, Lemm a 3.2, L emma 3.4, and Lemma 4. 3, we know that, with probability 1 − δ 2( k + k 2 ) , there are ab solute constants K 1 , K 2 and K 3 such that 1 | W | X x ∈ W q ( x )( w · x ) 2 ≤ K 1 z 2 k (10) | W D | ≤ K 2 η n k ǫ (11) 1 | W C | X ( x,y ) ∈ W C ( w · x ) 2 ≤ K 3 z 2 k . (12) (W e will need th e v alu e of K 3 later: we may use K 3 = 2 c 4 (13) for the v alu e o f c 4 in Lemma 3.4 co rrespon d ing to c 3 = b 0 .) Assume that (10), (1 1) and (12) all hold . Since P x ∈ W q ( x ) ≥ (1 − ξ k ) | W | ≥ | W | / 2 , we h ave that (10) implies X x ∈ W p ( x )( w · x ) 2 ≤ 2 K 1 z 2 k . (14) First, let us bou n d the weigh ted loss o n noisy examp les in the training set. I n par ticu lar , we will show that X ( x,y ) ∈ W D p ( x ) ℓ ( w, x, y ) ≤ K 2 η /ǫ + ξ k + p 2 K 1 K 2 η /ǫ + ξ k  z k τ k  . (15) T o see this, n otice that, X ( x,y ) ∈ W D p ( x ) ℓ ( w, x, y ) = X ( x,y ) ∈ W D p ( x ) max  0 , 1 − y ( w · x ) τ k  ≤ P r p ( W D ) + 1 τ k X ( x,y ) ∈ W D p ( x ) | w · x | = Pr p ( W D ) + 1 τ k X ( x,y ) ∈ W p ( x )1 W D ( x, y ) | w · x | ≤ P r p ( W D ) + 1 τ k s X ( x,y ) ∈ W p ( x )1 W D ( x, y ) s X ( x,y ) ∈ W p ( x )( w · x ) 2 (by the Cauchy-Schwarz inequa lity ) ≤ P r p ( W D ) + q 2 K 1 Pr p ( W D )  z k τ k  ≤ K 2 η ǫ + ξ k + p 2 K 1 K 2 η /ǫ + ξ k  z k τ k  where the second to last inequality follows by (14) and the last o ne by Le m ma 4.6 and ( 1 1). Similarly , we will show tha t X ( x,y ) ∈ W p ( x ) ℓ ( w, x, y ) ≤ 1 + p 2 K 1  z k τ k  . (16) Journal of the ACM, V ol. V , No. N, Article A, Publication date: January YYYY . Lear ni ng Line ar Sep arators with Noise A:19 T o see this no tice that, X ( x,y ) ∈ W p ( x ) ℓ ( w, x, y ) = X ( x,y ) ∈ W p ( x ) max  0 , 1 − y ( w · x ) τ k  ≤ 1 + 1 τ k X ( x,y ) ∈ W p ( x ) | w · x | ≤ 1 + 1 τ k s X ( x,y ) ∈ W p ( x )( w · x ) 2 ≤ 1 + p 2 K 1  z k τ k  , by (14). Next, we have ℓ ( w, W C ) = 1 | W C |   X ( x,y ) ∈ W q ( x ) ℓ ( w, x, y ) + (1 W C ( x, y ) − q ( x )) ℓ ( w , x, y )   ≤ 1 | W C |   X ( x,y ) ∈ W q ( x ) ℓ ( w, x, y ) + X ( x,y ) ∈ W C (1 − q ( x )) ℓ ( w, x, y )   ≤ 1 | W C |   X ( x,y ) ∈ W q ( x ) ℓ ( w, x, y ) + X ( x,y ) ∈ W C (1 − q ( x ))  1 + | w · x | τ k    ≤ 1 | W C |   X ( x,y ) ∈ W q ( x ) ℓ ( w, x, y ) + ξ k | W | + 1 τ k X ( x,y ) ∈ W C (1 − q ( x )) | w · x |   ≤ 1 | W C |   X ( x,y ) ∈ W q ( x ) ℓ ( w, x, y ) + ξ k | W | + 1 τ k s X ( x,y ) ∈ W C (1 − q ( x )) 2 s X ( x,y ) ∈ W C ( w · x ) 2   by the Cauchy- Sch warz inequality . Recall th at 0 ≤ q ( x ) ≤ 1 , and P ( x,y ) ∈ W q ( x ) ≥ 1 − ξ k | W | . Thus, ℓ ( w, W C ) ≤ 1 | W C |   X ( x,y ) ∈ W q ( x ) ℓ ( w, x, y ) + ξ k | W | + 1 τ k p ξ k | W | s X ( x,y ) ∈ W C ( w · x ) 2   ≤ 1 | W C |   X ( x,y ) ∈ W q ( x ) ℓ ( w, x, y ) + ξ k | W | + p ξ k | W || W C | K 3  z k τ k    by (12). Sin ce | W C | ≥ | W | / 2 , we ha ve ℓ ( w, W C ) ≤ 1 | W C |   X ( x,y ) ∈ W q ( x ) ℓ ( w, x, y )   + 2 ξ k + p 2 ξ k K 3  z k τ k  . Journal of the ACM, V ol. V , No. N, Article A, Publication da te: January YYYY . A:20 A wasthi, Balcan and Long W e hav e chosen ξ k small enough that ℓ ( w, W C ) ≤ 1 | W C |   X ( x,y ) ∈ W q ( x ) ℓ ( w, x, y )   + κ/ 3 2 = P ( x,y ) ∈ W q ( x ) | W C |   X ( x,y ) ∈ W p ( x ) ℓ ( w, x, y )   + κ/ 3 2 = ℓ ( w , p ) + P ( x,y ) ∈ W q ( x ) | W C | − 1 !   X ( x,y ) ∈ W p ( x ) ℓ ( w, x, y )   + κ/ 3 2 ≤ ℓ ( w , p ) +  | W | | W C | − 1    X ( x,y ) ∈ W p ( x ) ℓ ( w, x, y )   + κ/ 3 2 ≤ ℓ ( w , p ) +  | W | | W C | − 1   1 + p 2 K 1  z k τ k  + κ/ 3 2 , by (16). A p plying (11) yields (8 ). Also, ℓ ( w, p ) = X ( x,y ) ∈ W p ( x ) ℓ ( w, x, y ) = X ( x,y ) ∈ W C p ( x ) ℓ ( w, x, y ) + X ( x,y ) ∈ W D p ( x ) ℓ ( w, x, y ) ≤ X ( x,y ) ∈ W C p ( x ) ℓ ( w, x, y ) + K 2 η /ǫ + ξ k + p 2 K 1 K 2 η /ǫ + ξ k  z k τ k  (by (15)) . = P ( x,y ) ∈ W C q ( x ) ℓ ( w, x, y ) P ( x,y ) ∈ W C q ( x ) + K 2 η /ǫ + ξ k + p 2 K 1 K 2 η /ǫ + ξ k  z k τ k  ≤ P ( x,y ) ∈ W C ℓ ( w, x, y ) P ( x,y ) ∈ W C q ( x ) + K 2 η /ǫ + ξ k + p 2 K 1 K 2 η /ǫ + ξ k  z k τ k  (since ∀ x, q ( x ) ≤ 1 )) . ≤ P ( x,y ) ∈ W C ℓ ( w, x, y ) | W C | − ξ | W | + K 2 η /ǫ + ξ k + p 2 K 1 K 2 η /ǫ + ξ k  z k τ k  ≤ 2 ℓ ( w , W C ) + K 2 η /ǫ + ξ k + p 2 K 1 K 2 η /ǫ + ξ k  z k τ k  , by (11), w h ich in turn implies (9). P R O O F O F T H E O R E M 4 . 4 . Exploiting the fact th a t, with hig h p robab ility , ℓ ( w , x, y ) = O  q d lo g  d ǫδ   for all ( x , y ) ∈ S w k − 1 ,b k − 1 and w ∈ B ( w k − 1 , r k ) as in th e proo f of Lemma 3.9, with probability 1 − δ 2( k + k 2 ) , for all w ∈ B ( w k − 1 , r k ) , | L ( w ) − ℓ ( w , W C ) | ≤ κ/ 32 (17) and | ℓ ( w, p ) − ℓ ( w, T ) | ≤ κ/ 32 . (18 ) Journal of the ACM, V ol. V , No. N, Article A, Publication date: January YYYY . Lear ni ng Line ar Sep arators with Noise A:21 Also with probability 1 − δ 2( k + k 2 ) , both (8) and (9) h old. Let us assume from her e on th a t all of these hold. Then we h av e err D w k − 1 ,b k − 1 ( w k ) = err D w k − 1 ,b k − 1 ( v k ) ≤ L ( v k ) (since for ea c h error , th e hing e loss is at least 1 ) ≤ ℓ ( v k , W C ) + κ/ 16 (b y (17)) ≤ ℓ ( v k , p ) + C 1 η ǫ  1 + z k τ k  + κ/ 8 (by (8)) ≤ ℓ ( v k , T ) + C 1 η ǫ  1 + z k τ k  + κ/ 4 (by (18)) ≤ ℓ ( w ∗ , T ) + C 1 η ǫ  1 + z k τ k  + κ/ 4 (since w ∗ ∈ B ( w k − 1 , r k ) ) ≤ ℓ ( w ∗ , p ) + C 1 η ǫ  1 + z k τ k  + κ/ 3 (by (18)) . This, together with (9) and (17), gi ves err D w k − 1 ,b k − 1 ( w k ) ≤ 2 ℓ ( w ∗ , W C ) + C 2 η ǫ + C 3 r η ǫ × z k τ k + C 1 η ǫ  1 + z k τ k  + 2 κ/ 5 ≤ 2 L ( w ∗ ) + C 2 η ǫ + C 3 r η ǫ × z k τ k + C 1 η ǫ  1 + z k τ k  + κ/ 2 ≤ κ / 3 + C 2 η ǫ + C 3 r η ǫ × z k τ k + C 1 η ǫ  1 + z k τ k  + κ/ 2 , by Lemma 3.7 . Now notice that z k /τ k is Θ(1) . Hen ce an Ω( ǫ ) b ound o n η suffices to imply th at err D w k − 1 ,b k − 1 ( w k ) ≤ κ with probability (1 − δ k + k 2 ) . The rest of the analysis is e xactly the same as for the case of adversarial label noise. 5. DISCUSSION W e no te tha t the id ea o f lo calization in the concept space is tradition ally used in statistical learn - ing theory both in supervised and active learn ing for getting shar per rates [Boucher on et al. 2005; Bshouty et al. 20 09; K oltchinskii 2010]. Furth ermore, the idea o f localization in the in- stance space has b een used in margin-based analy sis of ac ti ve lear ning [Balcan et a l. 2007; Balcan and Lo ng 2013]. In this work we u sed localizatio n in both senses in or d er to get p olynom ial- time algor ithms with better n oise toleran ce. It would b e interesting to fur ther exploit this idea fo r other concept spaces. Our algorithms run in polynomial time, a n d therefor e use a polyno mial n umber of examples. Notably , they use o nly po lylogarith mically ma ny class labels. Our bound s on the total number o f examples used b y o ur algorithm s are, however , somewhat worse than the b est boun ds known for the noise-free case. In order to find an d remove outliers, the precision with which we need statisti cs on the training data to m atch pro perties of the un derlying distribution gets fin er as the num ber of variables increases. When com bined with th e usual effect in VC analyses regarding g rowth of the richness of b ehavior with the number of variables (wh ich cou ld be partially mitigated usin g local- ized an alysis in plac e of the VC tools that we have used here), th is lea ds to the inc reased req uirement Journal of the ACM, V ol. V , No. N, Article A, Publication da te: January YYYY . A:22 A wasthi, Balcan and Long on the n u mber of examples. Substantially impr ovin g the sample complexity and finding m o re com- putationally ef ficien t noise-tolerant alg orithms is a p o tentially useful topic for futur e r e search. While we ha ve cho sen to fo cus on isotropic log -concave distributions to present our techniques in a clea n s etting, it appears th a t, using too ls from [Balcan and Long 2013; A wasthi et a l. 2014], our analysis can be applied to a broader class o f d istributions with minor changes, in cluding “nearly log- concave d istributions”, defin ed as in [App legate and Kannan 1991]. One proper ty of the distribution that is needed f or ou r analysis is that it is fairly likely th at a rand om example falls fairly clo se to the separating hyperplan e of th e target. While this may not be the case in som e applications, such applications are typically easier, and might b e hand led separately . Provably noise-tolerant learning of linear classifiers f o r natural c lasses o f d istributions that includ e such cases is another important topic for future work. Acknowledgments W e tha n k Stev e Hanneke fo r h elpful com munication s. W e a lso thank anonym ous reviewers for their helpful commen ts. This work was sup ported in part by NSF grants CCF-0953192 , CCF-1101 2 83, and CCF-142 2 910, AFOSR grant F A95 50-09 -1-05 38, ONR grant N0001 4-09- 1-075 1, and a Mi- crosoft Research F acu lty Fello wship . REFERENCES M. Anthony and P . L. Bartle tt. 1999. Neural Network L earning: Theore tical F oundatio ns . Cambridge Uni versi ty Press. D. Applegate and R. Kannan. 1991. Sampling and inte gration of near log-conca ve functions. In STOC . S. Arora, L. Babai, J. Stern, and Z. Sweedyk. 1993. The hardness of approxima te optima in la ttice s, codes, and systems of linea r equations. In P r oceedings of the 1993 IEEE 34th Annual F oundations of Computer Science . P . A wasthi, M. Balcan, and P . M. Long. 2014. The power of localiza tion for effic iently learning linear separat ors with noise. In STOC . 449–458. See also Arxi v p aper 1307.8371v7. P . A wasthi, M.-F . Balcan, N. Haghtalab, and R. Urner. 2015. Effic ient Learning of Linear Separator s under B ounded Noise. See also Arxiv paper 1503.03594. P . A wasthi , M.-F . Balca n, N. Haghta lab, and H. Zhang. 2016. Lea rning and 1-bit Compresse d Sensing under Asymmetric Noise. In COLT . P . A wasthi, A. Blum, and O. Shef fet. 2010. Improved guarantees for agnostic learni ng of disjunction s. I n COLT . M.-F . Balcan, A. Beygelzi m er, an d J . Langford. 2006. Agnostic acti ve learning. In ICML . M.-F . Balcan, A. Broder, and T . Zhang. 2007. Margi n based acti ve learning. In COLT . M.-F . Balcan and V . Feldman. 2013. Statistic al Acti ve Learning Algorithms. In NIPS . M.-F . Balcan and S. Hanneke. 2012. Robust Interacti ve Learni ng. In COLT . M.-F . Balcan, S. Hanneke, and J . W ortman. 2008. T he True Sample Complexit y of Acti ve Learning. In COLT . M.-F . Balcan and P . M. Long. 2013. Acti ve and passiv e learning of line ar separators under log-conca ve distrib utions. In Confer ence on Learning Theory . P . L. Bartlet t, O. Bousquet, and S. Mendelson. 2005. Local Rademache r c omple xitie s. A nnals of Statistics 33, 4 (2005), 1497–1537. A. Beygelzi m er, D. Hsu, J. Langford, and T . Zhang. 2010. Agnostic Acti ve Learning W ithout Constrai nts. In NIPS . D. Bienstock and A. Michalka. 2014. Polynomial solv ability of va riants of the trust-regio n subp roblem. In SODA . A. Birnbaum and S. S halev-Shw artz. 201 2. Learning Halfspaces with the Zero-One L oss: Time- Accurac y Tradeof fs. In NIPS . A. Blum, A. Frieze, R. Kannan, and S. V empala. 1997. A polyn omial time algorith m for learni ng noisy line ar threshold functio ns. Algorithmica 22, 1/2 (1997), 35–52. A vrim Blum, Merri ck L. Furst, Micha el J. Kea rns, and Richard J. Lipton. 1994. Crypto graphic Primiti ves Based on Hard Learning Problems. In Pr oceedings of the 13th Annual Internat ional Cryptolo gy Confer ence on Advances in Cryptolo gy . S. Boucheron, O. Bousquet, and G. Lugosi. 2005. Theory of Classifica tion: a Survey of Rece nt Adv ances. ESAIM: Proba - bility and Statistic s 9 (2005), 9:323–3 75. N. H . Bshouty, Y . Li, and P . M. Long. 2009. Using the doubl ing dimension to anal yze the generaliza tion of learning algo - rithms. JCSS (2009). T . Byl ander . 1994. Learning l inear threshol d functions in th e presence of c lassificat ion noise. In Confe re nce on Co mputa- tional Learning Theory . R. Castro and R. Nowak . 2007. Minimax Bounds for Acti ve L earning . In COLT . Journal of the ACM, V ol. V , No. N, Article A, Publication date: January YYYY . Lear ni ng Line ar Sep arators with Noise A:23 N. Cesa -Bianch i, C. Gentile, and L. Zaniboni. 2010. Lear ning noisy lin ear classifiers via ad apti ve and selecti ve sampling. Mach ine Learning (2010). D. Cohn, L. Atlas, and R. Ladner. 1994. Improving Generalizat ion with Acti ve Learning. Machine Learning 15, 2 (1994). N. Cristianin i and J. Shawe-T aylor. 2000. An intro duction to support vector machines and other kerne l-based le arning methods . Cambridge Univ ersity Press. A. Daniely. 2015. A PT AS for Agnostically Learning H alfspaces. In COLT . See also Arxi v paper arXiv :1410.7050. S. Dasgupta. 2005. Coarse sample complexit y bounds for acti ve learning. In NIPS . S. Dasgupta. 2011. Acti ve Learning. Encyclope dia of Machi ne Learning (2011). S. Dasgupta, D.J. Hsu, and C. Monteleon i. 2007. A general agnosti c acti ve learning algorith m. In NIPS . O. Dekel , C. Gentile, and K. Sridharan. 2012. Selecti ve Sampling and Activ e Learning from Si ngle and Multiple T each ers. JMLR (2012). Ilias Diakonik olas, Dani el M Kane, and Alistair Ste wart. 2017. Learning geometric concepts with nasty noise. arXiv prep rint arXiv:1707.01242 (2017). V . Feldman, P . Gopa lan, S. Khot , and A. Ponnuswami. 2006. Ne w Results for Learni ng Noisy Pari ties and Half spaces. In FOCS . 563–576. Y . Freund, H. S. Seung, E. Shamir, and N. Tishby . 1997. Selecti ve sampling using the query by committee algorithm. Machine Learning 28, 2-3 (1997), 133–168. Michae l R. Gare y and David S. Johnson. 1990. Computer s and Intractabi lity; A Guide to the Theory of NP-Completeness . A. Gonen, S. Sabato, and S. Shale v-Shwartz. 2013. Efficie nt Pool- Based Acti ve Learning of Halfspa ces. In ICML . Anupam Gupta, Moritz Hardt, Aaron Roth, an d Jonathan Ull man. 2011 . Priv ately releasi ng conj unctio ns and t he stat istica l query barrier . In Pr oceedi ngs of the 43rd annual ACM symposium on Theory of computin g . V enkate san Guruswami and Prasad Ragha vendra. 2006. Hardness of Learning Halfspaces with Noise. In Pr oceedings of the 47th Annual IE EE Symposium on F oundatio ns of Computer Science . V . Guruswami and P . Raghave ndra. 2009. Hardness of Learning Halfspac es with Noise. SIAM J . Comput. 39, 2 (2009), 742–765. S. Hanneke. 2007. A Bound on the Label Complex ity of Agnostic Acti ve Learning. In ICML . S. Hanneke. 2011. Rates of Con ver gence in Acti ve Learning. The Annals of Statistic s 39, 1 (2011), 333–361. S. Hanneke. 2014. Theory of Disagr eement-Based Active Learning . Foundations and Tre nds in Machine L earning. S. Hanneke, V . Kanade, and L. Y ang. 2015. Learning with a Drifting T arge t Concept. In ALT . D. S. Johnson and F . Prepara ta. 1978. The densest hemisphe re problem. Theoreti cal Compute r Science 6, 1 (1978), 93 – 107. Adam T auman Kalai, Adam R. Kliv ans, Yi s hay Mansour, and Rocco A. Servedio . 2005. Agnostically Learning Halfspace s. In P r oceedings of the 46th Annual IEEE Symposium on F oundations of Computer Science . Michae l Kearn s and Ming L i. 1988. Learn ing in the presence of malicio us errors. In Pr oceedings of the twent ieth annual ACM symposium on Theory of computi ng . Michae l K earns, Robert Schapire, and Linda Sellie. 199 4. T ow ard Ef ficient Agnostic L earning . Mach. Learn. 1 7, 2-3 (Nov . 1994). M. Kearns and U. V azira ni. 1994. An intr oductio n t o computati onal learning theory . MIT Press, Cambridge, MA. A. R. Kli v ans, P . M. Long, and R occo A. Se rved io. 20 09a. Learning Halfspaces with Malic ious Noise . Journal of Machine Learning R esearc h 10 (2009). A. R. Kli v ans, P . M. Long, and A. T ang. 2009b . Baum’ s Algorit hm Learns Intersect ions of Halfspace s with respect to Log- Conca ve Distributio ns. In RA NDOM . V . K oltchi nskii. 2010. Rademache r Complexit ies and Bounding the Excess Risk in Acti ve Learning. Journal of Mac hine Learning R esearc h 11 (2010), 2457–2485. P . M. Long and R. A. Servedio. 2006. Attribut e-ef ficient lea rning of decisio n lists and linear threshold functions under unconce ntrate d distrib utions. NIPS (2006). P . M. Long and R. A. Servedi o. 2011 . Learning large-mar gin halfspa ces wit h more malici ous no ise. In NIPS . L. Lov ´ asz and S. V empala. 2007. The geometry o f logconca ve func tions and sampling algorithms. Random Structur es and Algorithms 30, 3 (2007), 307–358. Claire Monteleoni. 2006. Effici ent algorith ms for g eneral acti ve learning. I n Pr oceedin gs of the 19th annual confer ence on Learning Theory . D. Pollard. 2011. Con verg ence o f Stochastic Pro cesses . M. Raginsky and A. Rakhlin. 2011. Lo wer Bounds for Passi ve and Activ e Lear ning. In NIPS . Oded Re gev. 2005. On lattices, learning with error s, random linea r codes, and crypto graphy . In Pr oceedings of th e thirty- seve nth annual AC M s ymposium on Theory of computing . Journal of the ACM, V ol. V , No. N, Article A, Publication da te: January YYYY . A:24 A wasthi, Balcan and Long Rocco A. Serve dio. 2001. S m ooth Boosting and L earning with Malici ous Noise. In 14th Annual Confer ence on Computa- tional Learning Theory and 5th Europe an Confer ence on Computati onal Learning Theory . J. Sturm and S. Zhang. 2003. On cones of nonnega ti ve qu adrati c f unction s. Mathematic s of Operati ons Researc h 28 (2003), 246–267. L. G. V alian t. 1985. Learning disjuncti on of conjunct ions. In Pro ceedi ngs of the 9th Internat ional Joi nt Con fer ence on Artificial intellig ence . V . V apnik. 1998. Statist ical Learning Theory . W iley-Inte rscience. S. V empala. 2010. A random-sampling-base d algor ithm for learni ng intersecti ons of halfspac es. JA CM 57, 6 (2010). L. W ang. 2011. Smoothness, Disagreeme nt Coeffici ent, and the Label Complexity of Agnostic Acti ve Lear ning. JMLR (2011). C. Zhang and K. Chaudhuri. 2014. Beyond Disagreement-B ased Agno stic Acti ve Learning. In NIPS . T . Z hang. 2006. Information Theoretical Upper a nd Lowe r Bounds for Stat istica l E stimation. IE EE T ransactions on Infor- mation Theory 52, 4 (2006), 1307–1321. A. ADDITIONAL RELA TED W ORK P assive Learnin g. Blum et al. [Blum et al. 1 997] conside r ed n oise-toleran t lea r ning of halfspaces u nder a more idealized noise model, known as the random noise mo del, in wh ich the labe l of each example is flipp e d with a cer ta in pro bability , indepen- dently of the feature vector . Some other, less c losely related, work o n efficient n oise- tolerant learn ing of h alfspaces includ es [Byland er 19 94; Blum et al. 1997; Feldman et al. 20 06; Guruswami and Rag h avendra 2009; Servedio 2 001; A wasthi et al. 2010; Lo ng and Servedio 2011; Birnbaum and S halev-Shwartz 20 1 2]. Active Learning. As we have mentione d , most prior th eoretical work on activ e learning fo- cuses on either sample complexity bounds (with out regard fo r efficiency) o r on providing p oly- nomial time algorithm s in the noiseless case o r und er simple no ise models ( r andom classification noise [Balcan and Feldm a n 2013] o r linear noise [ Cesa-Bianchi et al. 2 010; Dekel et al. 2012]). In [Cesa-Bianchi et al. 2010; Dekel et al. 2012] online learning algorithm s in the selecti ve sam- pling fr amew ork are presented , whe r e labels m ust be acti vely qu eried befor e they are revealed. Under the assumption that the label con ditional distribution is a lin ear f unction determined by a fixed target vector , they pr ovid e bounds on the regret of the algo rithm and on the n umber of lab els it qu eries when faced with an adaptiv e adversarial strategy of generating th e instances. As po inted out in [Dekel e t al. 2012], these results can also be c onv er ted to a d istributional P AC setting where instances x t are dr awn i.i.d . In this setting they obtain exponen tial imp rovement in label complexity over passi ve learning . These interesting results a nd techniqu es are not dire ctly compara b le to ours. One impo r tant d ifference is tha t ( as po in ted ou t in [Go nen e t al. 2013]) the exponen tial improve- ment they g iv e is not p ossible in the noiseless version o f th eir setting. In o ther words, the add ition of linear noise defin ed by the target makes th e pro blem easier for active samp ling. By contrast RCN can only make the classification task harder than in the realizable case. Recently , [ Balcan and Feldman 2 013] showed th e first po lynomial time algorith m s fo r actively learning thresholds, balanced rectangles, and homo g enous linear sepa rators under log-con cav e dis- tributions in the p resence of r andom classification no ise. Active lear ning with r e spect to isotro pic log-con cave distributions in th e absence of noise was stud ied in [ Balcan and L ong 2013]. An algo r ithm for active learning with a general hypo thesis space was pro posed and an alyzed by Zhang and Chaud huri [2014]. Ef ficient algor ithms for tracking a drifting lin ear classifier when the distribution is unif o rm wer e described by Han neke, Kanade and Y ang [2015]. B. A CUTE INITIALIZ A TION W e will p rove that we may assume without loss of g enerality that the algorithm receives as in put a w 0 whose angle with the target w ∗ is acute. Suppose we have an algorithm B as a subroutine that satisfies the gu arantee o f Theorem 3.1, giv en access to such a w 0 . Then we can arri ve at an algor ithm A which works withou t it as follows. W ith pro bability 1 , for a random u , eith e r u o r − u has an acu te angle with w ∗ . W e may then run B Journal of the ACM, V ol. V , No. N, Article A, Publication date: January YYYY . Lear ni ng Line ar Sep arators with Noise A:25 with b oth ch oices, an d with ǫ set to π c 2 4 , wh ere c 2 is the constant in Part (e) of Lemma 3 .2. Th en we can use hyp othesis testing on O (log (1 /δ )) examples, an d, with high probab ility , find a hypothesis w ′ with erro r less th an π c 2 4 . Part (e) of Lemma 3.2 th e n implies that A may then set w 0 = w ′ , and call B again. C. RELA TING AD VERSARIAL LABEL NOISE AND THE A GNOSTIC SETTING In this sectio n we stud y th e agnostic setting of [ Kearns et al. 1994; Kalai et al. 20 05] a n d describe how our re su lts imp ly co nstant facto r approxim ations in that model. In the agnostic mod el, d ata ( x, y ) is generated fro m a distribution D over ℜ d × { 1 , − 1 } . For a given concept class C , let OP T be the error of the b est classifier in C . In other words, O P T = ar gmin f ∈ C err D ( f ) = argmin f ∈ C P r ( x,y ) ∼ D [ f ( x ) 6 = y ] . The goal of th e learning alg orithm is to ou tput a h y pothesis h which is nearly as goo d a s f , i.e., given ǫ > 0 , we want e rr D ( h ) ≤ c · O P T + ǫ , where c is the approx imation factor . Any result in the ad versarial model th a t we study , translates into a result for the agnostic setting via the following lemma. L E M M A C.1 . F or a g iven concept class C an d distribution D , if ther e exist s an algorithm in the adversarial noise model w hich runs in time poly ( d, 1 / ǫ ) and tolerates a noise rate of η = Ω ( ǫ ) , then the r e exists a n alg orithm for ( C , D ) in the agnostic setting whic h runs in time p oly ( d, 1 / ǫ ) a nd achieves err or O ( OP T + ǫ ) . P R O O F . Let f ∗ be the optimal halfspa c e with error O P T . In the adversarial setting, w .r .t. f ∗ , the noise rate η will be exactly OP T . Set ǫ ′ = c ( O P T + ǫ ) as input to the algorithm f o r the a d versarial model. By th e g u arantee of the algo rithm we will g et a hypothesis h such that P r ( x,y ) ∼ D [ h ( x ) 6 = f ∗ ( x )] ≤ ǫ ′ = c ( O P T + ǫ ) . Hence by tria n gle inequ ality , we h av e er r D ( h ) ≤ err D ( f ∗ ) + c ( OP T + ǫ ) = O ( OP T + ǫ ) . For the case when C is the class of origin centered halfspaces in R d and the marginal of D is the u niform d istribution over S d − 1 , th e ab ove lemma along with Theo rem 1.1 implies that we can output a halfspa c e o f accura cy O ( O P T + ǫ ) in time poly ( d, 1 /ǫ ) . The work of [Kalai et al. 20 05] achieves a guarantee of O ( O P T + ǫ ) in time expon ential in 1 / ǫ by do ing L 2 regression to learn a low degree po ly nomial, and that L 1 regression can achieve a stro nger guaran tee of OP T + ǫ . As noted above, their appr oach also d oes not req uire that the halfspace to be learn ed goes th r ough the origin. D. PROOF OF VC LEMMAS In this section, we apply some stan dard VC too ls to establish some lemmas about estimates of expectations. Definition D.1 . Say that a set F of real-valued functio ns with a commo n doma in X shatters x 1 , ..., x d ∈ X if there are thresholds t 1 , ..., t d such that { (sign( f ( x 1 ) − t 1 ) , ..., sign( f ( x d ) − t d )) : f ∈ F } = {− 1 , 1 } d . The pseudo-dimension o f F is the size of the largest set shattered by F . W e will use th e following bound. L E M M A D.2 ( S E E [ A N T H O N Y A N D B A RT L E T T 1 9 9 9 ] ) . Let F be a set o f fun ctions fr om a common domain X to [ a, b ] and let d be the pseu do-dimen sion of F , and let D be a pr ob ability dis- tribution over X . Then, for m = O  ( b − a ) 2 α 2 ( d + log(1 /δ ))  , if x 1 , ..., x m ar e drawn independ ently at r andom accor d ing to D , with pr obab ility 1 − δ , for all f ∈ F ,      E x ∼ D ( f ( x )) − 1 m m X t =1 f ( x t )      ≤ α. Journal of the ACM, V ol. V , No. N, Article A, Publication da te: January YYYY . A:26 A wasthi, Balcan and Long D.1. Proof of Le mma 3.9 The pseudo -dimension of the set of linear co mbination s of d variables is known to be d [Pollard 2011]. Since, for any no n-increa sin g fu nction ψ : R → R and any F , the pseudo- dimension of { ψ ◦ f : f ∈ F } is at most th at of F ( see [Pollar d 2011]), the pseudo-d imension of { ℓ ( w, · ) : w ∈ R d } is at m ost d . Now , to apply Lemma D.2, we want an upper b o und on the loss. Th e first step is a bou nd in ter m s of the norm. L E M M A D.3 . There is a constan t c such that, for a ny w ∈ B ( w k − 1 , r k ) , and all x , ℓ ( w, x, y ) ≤ c (1 + || x || 2 ) . P R O O F . ℓ ( w, x, y ) ≤ 1 + | w · x | τ k ≤ 1 + | w k − 1 · x | + k w − w k − 1 k 2 || x || 2 τ k ≤ 1 + b k − 1 + r k || x || 2 τ k = 1 + c ′ 1 M − k + min { M − ( k − 1) /c 6 , π / 2 }|| x || 2 c 2 min { c ′ 1 M − k ,c 1 } κ 6 c 3 . If the supp ort of D is bou nded, Lemma D.3 gives a useful worst-case bou nd on the loss. Next, we gi ve a high-proba b ility bound that holds f o r all isotr o pic log-concave d istributions. L E M M A D.4 . F or an absolute constant c , with pr oba b ility 1 − δ 6( k + k 2 ) , max x ∈ W C || x || 2 ≤ c √ d ln  | W C | k δ  . (19) P R O O F . Applying Part (a) of Lemma 3.2 toge th er with a unio n bound, we have Pr( ∃ x ∈ W C , || x || > α ) ≤ c 9 | W C | e x p( − α/ √ d ) , and α = √ d ln  12 c 9 | W C | k 2 δ  makes the RHS at most δ 6( k + k 2 ) . Let D ′ be the distribution obtained b y cond itioning D o n the event that || x || < R , where R is the RHS of (1 9). By Lemma D.4, the total v ar iation distance between drawing the members o f W C indepen d ently r andom from D , and drawing them from D ′ , it at most 1 − δ 6( k + k 2 ) , so it suffices to prove (5) with respect to D ′ . Apply in g Lemma D.3, and Lemm a D.2 then comp letes the proo f o f (5). D.2. Proof of Le mma 4.3 Define f a by f a ( x ) = ( a · x ) 2 . The pseudo-d imension o f the set of all such functio ns is O ( d ) [Kliv ans et al. 2009a ] . As the proof o f Lemma 3.9, w .l.o. g., all x h ave || x || 2 ≤ O ( √ d log( ℓ/δ )) , and applying Lemma D.2 comp letes the p roof. Journal of the ACM, V ol. V , No. N, Article A, Publication date: January YYYY .

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment