Statistical Active Learning Algorithms for Noise Tolerance and Differential Privacy
We describe a framework for designing efficient active learning algorithms that are tolerant to random classification noise and are differentially-private. The framework is based on active learning algorithms that are statistical in the sense that th…
Authors: Maria Florina Balcan, Vitaly Feldman
Statistical Activ e Learning Algorithms for Noise T olerance and Differen tial Priv acy Maria Florina Balcan Carnegie Mellon Univ ersit y ninamf@cs.c mu.edu Vitaly F eldman IBM Researc h - Almaden vitaly@post .harvard.edu Abstract W e describ e a framework for designing efficient active learning alg orithms that are tolera n t to random classific a tion noise and ar e differ e ntially-priv ate. The fra mew or k is based on a ctiv e learning algorithms that are statistic al in the sense that they rely on estimates of exp ectations of functions of filtered r andom examples. It builds on the p ow erful statis tica l query framework of Kea rns [Kea98]. W e show that an y efficient active statistical lea rning algor ithm can b e a utomatically con- verted to an efficient activ e learning algor ithm which is to lerant to random classificatio n noise as well as other forms of “ uncorrelated” noise. The c o mplexit y of the r esulting algorithms has information-theor etically optimal quadr atic dep endence o n 1 / (1 − 2 η ), where η is the noise rate. W e show that commonly studied concept clas ses including thresho lds, rectangles, and linear separato r s can b e efficiently actively lear ned in our framework. These results combined with our generic conv ersion lead to the first co mputationally-efficient algorithms for actively learning some of these concept classes in the presence o f ra ndom c la ssification noise that provide exp onential improv ement in the dep endence on the erro r ǫ ov er their pass iv e counterparts. In addition, we show that our algo rithms can b e automatically conv erted to efficient active differentially- priv a te alg o rithms. This leads to the fir st differentially-priv ate ac tiv e learning a lgorithms with exp onen tial la bel savings ov er the pa ssive case. 1 In tro duction Most classic m achine learning metho ds dep end on the assump tion th at h uman s can annotate all the data a v ailable f or training. Ho w ev er, man y mo dern m ac hin e learning applications ha v e massive amoun ts of unannotated or unlab eled data. As a consequence, there has b een tremendous in terest b oth in mac hine learning and its application areas in d esigning algorithms that m ost efficientl y uti- lize the a v ailable data, w hile min im izing the n eed for human in terv entio n. An extensiv ely used and studied tec hnique is activ e learning, where the algorithm is pr esen ted with a large p o ol of u nlab eled examples and can interacti vely ask for th e lab els of examples of its own choosing from the p o ol, with the goal to d rastically reduce lab eling effort. T his has b een a ma jor area of mac h ine learning researc h in the past d ecade [Das11, Han], with sev eral exciting d ev elopments on understanding its underlying statistical principles [FSS T97, Das05, BBL06, BBZ07, Han07, DHM07, CN07, BHW08, Kol10, BHLZ10, W an11, RR11, BH12]. In particular, sev eral general c haracterizations hav e b een dev elop ed for describing when activ e l earning can in principle h a ve an adv antag e o v er t he clas- sic passiv e su p ervised learnin g paradigm, and by h o w muc h. While the lab el complexit y asp ect of activ e learning has b een in tensive ly studied and is currently well understo o d, the question of 1 pro viding computationally efficien t noise toleran t activ e learning algorithms has remained largely op en. In particular, prior to this w ork, there w ere no kn o wn efficien t activ e algorithms for concept classes of su p er-constan t VC-dimension th at are pro v ably robu st to random and indep endent noise while giving imp ro v ement s o v er the passive case. 1.1 Our Results W e p rop ose a framew ork f or designing efficient (p olynomial time) activ e learnin g algorithms wh ic h is based on restricting the wa y in w hic h examples (b oth lab eled and unlab eled) are accessed by the algorithm. These restricted algorithms can b e easily sim ulated u sing activ e sampling and, in addition, p ossess a num b er of other usefu l p rop erties. The main prop erty we will consider is tolerance to rand om classification noise of rate η (eac h lab el is flip p ed r andomly and indep en den tly with probabilit y η [AL88]). F urther, as w e will show, the algorithms are toleran t to other forms of noise and can b e sim ulated in a differenti ally-priv ate wa y . In our restriction, in stead of access to random examples from some distribution P o v er X × Y the learning algorithm on ly gets “acti ve” estimates of the statistica l p rop erties of P in th e follo wing sense. The algorithm ca n c ho ose any filter fu nction χ ( x ) : X → [0 , 1] and a query function φ : X × Y → [ − 1 , 1 ] for an y χ and φ . F or simplicit y w e can think of χ as an indicator fu nction of some set χ S ⊆ X of “informative ” p oin ts and of φ as s ome usefu l prop erty of the target function. F or this pair of fun ctions the learning algorithm can get an estimate of E ( x,y ) ∼ P [ φ ( x, y ) | x ∈ χ S ]. F or τ and τ 0 c hosen b y the algorithm the estimate is p ro vided to within toler anc e τ as long as E ( x,y ) ∼ P [ x ∈ χ S ] ≥ τ 0 (nothing is guaranteed otherwise). The key p oint it that when we sim ulate this query from r andom examples, the in ve rse of τ corresp onds to the lab el complexit y of the algorithm and the in verse of τ 0 corresp onds to its unlab eled sample complexit y . Suc h a query is referred to as active statistic al query (SQ) and algorithms u sing activ e SQs are referred to as active statistic al algorithms . Our framew ork builds on the classic statistical query (SQ) learning fr amew ork of Kearns [Kea98] defined in the con text of P AC lea rn in g m o d el [V al84]. The SQ mo del is b ased on estimate s of exp ectatio ns of functions of e xamples (bu t without th e additional filter function) and was defined in ord er to desig n efficien t noise toleran t algo rithm s in the P A C mo del. Despite the restrictiv e form, most of the learning alg orithms in the P A C mo del and ot her standard tech- niques in mac hin e learning and statistics used for pr oblems o v er distr ibutions ha ve SQ analogues [Kea98, BFKV97, BDMN05, CKL + 06, F GR + 13] 1 . F ur ther, statistica l algorithms enjoy additional prop erties: they can b e simulate d in a d ifferen tially-priv ate wa y [BDMN05], automatically paral- lelized on multi-core archite ctures [CKL + 06] and ha v e kno wn information-theoretic characte riza- tions of qu er y complexit y [BFJ + 94, F el12]. As w e sh o w , our fr amew ork inherits the str engths of the SQ mo del wh ile, as we will argue, capturing the p o w er of activ e learning. A t a fi rst glance b eing activ e and statistical app ear to b e incompatible r equiremen ts on th e algorithm. Activ e algorithms t ypically make lab el qu ery decisions on the basis of examining in- dividual samples (for example as in binary search for learning a threshold or the algorithms in [FSST97, DHM07, DKM09]). A t the same time s tatistical algorithms can only examine prop erties of the underlyin g d istr ibution. But there also exist a num b er of activ e learning algorithms that can b e seen as applyin g passive learning tec hn iqu es to batc hes of examples that are obtained from querying lab els of s amp les that satisfy the same fi lter. These include the general A 2 algorithm 1 The sample complexity of the SQ analogues migh t b e p olynomially larger though. 2 [BBL06] and, f or example, algo rithms in [BBZ07, DH08, BDL09, BL13]. As w e sh o w , w e can build on these tec hniqu es to provide algorithms that fit our framew ork. W e start b y p resen ting a general reduction sho wing that an y efficien t activ e statistical learning algorithm can b e automatically con verte d to an efficien t activ e learnin g algorithm w hic h is toleran t to random classification noise as well as other forms of “uncorrelated” noise. The sample complexit y of the resulting algorithms dep end s just quadr atica lly on 1 / (1 − 2 η ), where η is the noise r ate. W e then demonstrate th e generalit y of our fr amework by showing that the most commonly studied concept classes in clud ing thresh olds, b alanced rectangles, and homogenous linear separa- tors can b e efficiently activ ely learned via activ e statistical algorithms. F or these concept classes, w e design efficien t activ e learnin g algorithms that are statistical and pro vide the same exp onen- tial impro ve ments in the dep endence on the error ǫ o v er passiv e learning as their non-statistical coun terparts. The primary prob lem we consider is activ e learning of homogeneous halfsp aces a problem that has attracted a lot of interest in the theory of activ e learnin g [FSST97, Das05, BBZ07, BDL09, DKM09, CGZ10, DGS12, BL13, GSS S13]. W e describ e t wo algorithms for the problem. First, building on insights fr om margin b ased analysis of activ e learnin g [BBZ07, BL13], w e giv e an activ e statistical learning algorithm for homogeneous halfspaces ov er all isotropic log-conca v e dis- tributions, a wide class of distrib utions that includes many well -stud ied densit y f unctions an d has pla y ed an imp ortant role in sev eral areas including sampling, optimization, int egration, and learning [L V07]. Our algorithm for this setting pro ceeds in rounds; in round t we b uild a b etter appro ximation w t to the target function b y using a p assiv e SQ learning algorithm (e.g., the one of [D V04]) o ve r a distribution D t that is a mixture of distributions in which eac h comp onen t is the original distribution conditioned on b eing within a certain distance from the h yp erp lane defined b y previous approximati ons w i . T o p erform passive statistic al queries r elati ve to D t w e u se activ e SQs with a corresp ond ing real v alued filter. This algorithm is compu tatio nally efficien t and u ses only p oly( d, log (1 /ǫ )) activ e statistica l queries of tolerance in ve rse-p olynomial in the dimension d and log(1 /ǫ ). F or th e s p ecial case of the un iform distrib u tion o v er the u nit ball we giv e a new, simpler and substantia lly more efficien t activ e statistical learnin g algorithm. Our algorithm is based on measuring the err or of a halfspace conditioned on b eing w ithin s ome margin of that h alfspace. W e sho w that such measur emen ts p erf orm ed on the p erturb ations of the current h yp othesis along the d basis v ectors can b e combined to derive a b etter hyp othesis. This approac h differs substant ially from the previous algorithms for this p roblem [BBZ07, DKM09]. The algorithm is computationally efficien t an d uses d lo g (1 /ǫ ) activ e SQs with tolerance of Ω(1 / √ d ) and filter tolerance of Ω( ǫ ). These resu lts, com bined with our generic sim ulation of activ e statistical algo rithms in th e pres- ence of random classification noise (RCN) lead to the first known computationally efficien t algo- rithms for activ ely learnin g halfspaces whic h are RCN toleran t and giv e p ro v able lab el savings ov er the passive case. F or the uniform distribution case this leads to an algorithm with s amp le complex- it y of O ((1 − 2 η ) − 2 · d 2 log(1 /ǫ ) log ( d log (1 /ǫ ))) an d for the general isotropic log-co ncav e case w e get sample complexit y of p oly( d, log (1 /ǫ ) , 1 / (1 − 2 η )). T h is is wo rse th an th e sample complexit y in the noiseless case which is just O (( d + log log (1 /ǫ )) log (1 /ǫ )) [BL13 ]. How ever, compared to passiv e learning in the presence of R CN, our algorithms ha ve exp onen tially b etter dep endence on ǫ and essenti ally the same dep endence on d and 1 / (1 − 2 η ). O n e issue w ith the generic simulatio n is that it requ ir es knowle dge of η (or an almost precise estimate). Standard app roac h to dealing with this issue do es not alw a ys work in the activ e setting and for our log-conca ve and the u niform 3 distribution algorithms we giv e a sp ecialize d argumen t that pr eserves the exp onent ial improv ement in the d ep endence on ǫ . Differen tially-priv ate active lea rning: In many application of m achine learnin g su c h as medical and fin ancial record analysis, d ata is b oth sensitive and exp ensiv e to lab el. How ever, to the b est of our kno wledge, th er e are no formal r esu lts addressin g b oth of these constraints. W e add ress the pr oblem by defin ing a natural mo del of d ifferen tially-priv ate activ e learning. In our mo del w e assume that a learner has full access to unlab eled p ortion of some database of n examples S ⊆ X × Y whic h corresp ond to records of individu al participants in the d atabase. In add ition, for ev ery elemen t of the database S the learner can requ est the lab el of that elemen t. As u sual, the goal is to m in imize the num b er of lab el requests (suc h setup is r eferred to as p o ol-b ase d activ e learning [MN98]). In addition, w e w ould like to preserve the differ e ntial privacy of the participan ts in the database, a no w-standard n otion of priv acy introd uced in [DMNS06 ]. Informally s p eaking, an algorithm is differen tially priv ate if adding an y record to S (or remo ving a record from S ) do es not affect the p robabilit y that any sp ecific hyp othesis will b e outp u t by the algorithm significan tly . As first sh o w n b y Blum e t al. [BDMN05], SQ algorithms can b e automaticall y translated into differen tially-priv ate algorithms by u s ing th e so-called Laplace mec hanism (see also [KLN + 11]). Using a sim ilar app roac h , we sho w that activ e SQ learning algorithms can b e automatica lly trans- formed into differenti ally-priv ate activ e learning algorithms. As a consequen ce, for all the classes for which w e pro vide s tatistical activ e learning algorithms that can b e s im ulated by using only p oly( d, log (1 /ǫ )) lab eled examples (including thresh olds and halfspaces), w e can learn and preserve priv acy with muc h fewer lab el requ ests than those r equired by ev en non-priv ate classic passive learn- ing algorithms, and can do so ev en when in our m od el the p riv acy parameter is very s mall. Note that while we fo cus on the n umb er of lab el requ ests, the algorithms also preserv e the differenti al priv acy of the unlab eled p oints. 1.2 Additional Related W ork As w e h av e men tioned, most prior theoretical work on activ e learning fo cuses on either s amp le complexit y b ounds (without regard for efficiency) or the noiseless case. F or r andom classification noise in p articular, [BH12 ] pr o v id es a sample complexit y analysis based on the n otion of splitting index that is optimal up to p olylog factors and works for general concept classes and distributions, but it is n ot computationally efficien t. In addition, several w orks give activ e learning algorithms with empirical evidence of robustn ess to certain t yp es of noise [BDL09, GSS S13] ; In [CGZ10, DGS12] online learning algorithms in the s electiv e samp ling f r amew ork are pre- sen ted, where lab els m ust b e activ ely queried b efore th ey are rev ealed. Und er th e assump tion th at the lab el conditional distr ibution is a linear function d etermined b y a fi x ed target vec tor, they pro vide b ounds on the regret of the algorithm and on the n umb er of lab els it queries wh en faced with an adaptiv e adv ersarial strategy of generating the instances. As p oin ted ou t in [DGS12], these results can also b e con v erted to a distribu tional P A C s etting w here instances x t are dra wn i.i.d. In this setting they obtain exp onen tial impr o vemen t in lab el complexit y o ver passiv e learnin g. These in teresting results and tec hn iques are not dir ectly comparable to ours. Our framew ork is not restricted to halfspaces. Another imp ortant difference is that (as p ointe d out in [GSSS 13]) the exp onent ial impr o vemen t they giv e is not p ossible in the noiseless v ersion of their setting. In other words, the addition of linear noise d efined by the target mak es the p roblem easier for activ e sampling. By con trast RCN can only make th e classification task harder th an in the realizable case. 4 Among the so called disagreement- based algorithms that pro v ably w ork u nder very general noise mo dels (adv er s arial lab el noise) and for general concept classes [BBL06, Kol10, DHM07, BHLZ10, W an11, RR11, BH12, Han], th ose of Dasgupta, Hsu, and Mon teleoni [DHM07] and Beygelzimer, Hsu, Langford , and Zhang [BHLZ10] are most amenable to implement ation. While more amenable to implementat ion than other disagreement -based tec h niques, these algorithms assu me the existence of a computationally efficient passive learning algorithm (for the concept class at hand ) that can minimize the empirical err or in the adversarial lab el n oise — h o wev er , su c h algorithms are not kno wn to exist for most concept classes, includin g lin ear separators. F ollo w in g the original pu blication of our wo rk, Awasthi et al. [ABL14] giv e a p olynomial-time activ e learning algorithm for learning linear separators in the presence of adversarial forms of n oise. Their algorithm is the fir st one that can tolerate b oth ad versarial lab el noise and malicious noise (where the adversary can corrup t b oth the instance p art and the lab el part of the examples) as long as the rate of noise η = O ( ǫ ). W e note that these r esults are not comparable to our s as we need the noise to b e “uncorrelated” bu t can deal with n oise of any rate (with complexit y gro wing with 1 / (1 − 2 η )). Organization: Ou r mo del, its prop erties and seve ral illustrativ e examples (includ ing threshold functions and balanced r ecta ngles) are giv en in Section 2. Our algorithm for learning homoge- neous halfspaces o ver log-co ncav e and uniform distrib utions are giv en in Section 3 and S ectio n 4 resp ectiv ely . The formal s tate ment of d ifferen tially-priv ate sim ulation is given in Section 5. 2 Activ e Statistical A lgorithms Let X b e a domain and P b e a distrib ution ov er lab eled examples on X . W e repr esent suc h a distribution by a pair ( D , ψ ) where D is the marginal distribu tion of P on X and ψ : X → [ − 1 , 1] is a fun ction defined as ψ ( z ) = E ( x,ℓ ) ∼ P [ ℓ | x = z ]. W e w ill b e consid ering learnin g in the P A C mo del (realizable case) where ψ is a b o olean function, p ossibly corru p ted by random noise. When learning w ith resp ect to a distribu tion P = ( D , ψ ), an activ e statistic al learner has access to active statistic al queries . A query of this type is a p air of fu nctions ( χ, φ ), where χ : X → [0 , 1] is the filter f u nction whic h for a p oin t x , sp ecifies the probability with whic h the lab el of x should b e qu eried. The fun ction φ : X × {− 1 , 1 } → [ − 1 , 1] is the query function and dep ends on b oth p oin t and the lab el. The filter function χ defin es the d istribution D conditioned on χ as follo ws: for eac h x th e densit y fun ction D | χ ( x ) is defi n ed as D | χ ( x ) = D ( x ) χ ( x ) / E D [ χ ( x )]. Note th at if χ is an indicator fu nction of some set S then D | χ is exactly D cond itioned on x b eing in S . Let P | χ denote the conditioned distribution ( D | χ , ψ ). In add ition, a query has t wo tolerance parameters: filter tolerance τ 0 and query tolerance τ . In resp on s e to su c h a query the algorithm obtains a v alue µ suc h that if E D [ χ ( x )] ≥ τ 0 then µ − E P | χ [ φ ( x, ℓ )] ≤ τ (and nothing is guaran teed when E D [ χ ( x )] < τ 0 ). An activ e statistical learnin g algorithm can also ask tar get-indep endent q u eries with tolerance τ which are j ust queries o ver unlab eled samples. T h at is for a query ϕ : X → [ − 1 , 1] the algorithm obtains a v alue µ , su c h that | µ − E D [ ϕ ( x )] | ≤ τ . Suc h qu eries are not necessary when D is kn o w n to the learner. 5 F or the p u rp oses of obtaining n oise toleran t algorithms one can r elax the requir emen ts of mo del and give th e learning algorithm access to u n lab elled samples. A s im ilar v arian t o f the mo del w as considered in the con text of SQ mo del [Kea98, BFKV97]. W e refer to this v arian t as lab e l- statistic al . Lab el-stati stical algorithms d o not need access to target-indep endent queries access as they can simulate those using un lab elled samples. Our definition generalizes the statistical query framework of Kearns [Kea98] wh ic h d oes not include filtering fun ction, in other w ords a query is ju st a fun ction φ : X × {− 1 , 1 } → [ − 1 , 1] and it has a single tolerance parameter τ . By definition, an activ e S Q ( χ, φ ) with tolerance τ relativ e to P is the same as a p assiv e statistical query φ with tolerance τ relativ e to the d istribution P | χ . In particular, a (passive) SQ is equiv alen t to an activ e SQ with filter χ ≡ 1 an d filter tolerance 1. Finally w e note that from th e definition of activ e SQ we can see that E P | χ [ φ ( x, ℓ )] = E P [ φ ( x, ℓ ) · χ ( x )] / E P [ χ ( x )] . This implies that an activ e statistical qu er y can b e estimated using t wo passiv e statistica l qu eries. Ho w ev er to estimate E P | χ [ φ ( x, ℓ )] with tolerance τ one needs to estimate E P [ φ ( x, ℓ ) · χ ( x )] with tolerance τ · E P [ χ ( x )] whic h can b e m uch low er than τ . T olerance of a SQ directly corresp ond s to the n umb er of examples n eeded to ev aluate it and therefore simulating activ e S Q s passive ly migh t require many more examples. 2.1 Sim ulating Act ive Statistical Queries In our mo del, the algorithm op erates via statistical qu er ies. In this section w e describ e how the answ ers to these queries can b e simulated from r andom examples, whic h immediately implies that our algorithms can b e transformed in to activ e learning algorithms in the usual mo del [Das11]. W e first note that a v alid resp onse to a target-indep endent query w ith tolerance τ can b e obtained, with p r obabilit y at least 1 − δ , us ing O ( τ − 2 log (1 /δ ) ) unlab eled samp les. A natural w a y of simula ting an activ e S Q is by fi ltering p oin ts dra wn randomly from D : dra w a random p oint x , let B b e dr a w n from Bernoulli distribution with probability of 1 b eing χ ( x ); ask for the lab el of x wh en B = 1. Th e p oin ts for which we ask for a lab el are distrib uted according to D | χ . This implies that the empirical a verag e of φ ( x, ℓ ) on O ( τ − 2 log (1 /δ ) ) lab eled examples will then giv e µ . F ormally w e get the follo wing theorem. Theorem 2.1. L et P = ( D , ψ ) b e a distribution over X × {− 1 , 1 } . Ther e exists an active sampling algorithm that given functions χ : X → [0 , 1] , φ : X × {− 1 , 1 } → [ − 1 , 1] , values τ 0 > 0 , τ > 0 , δ > 0 , and ac c ess to samples fr om P , with pr ob ability at le ast 1 − δ , outputs a valid r esp onse to active statistic al query ( χ, φ ) with toler anc e p ar ameters ( τ 0 , τ ) . The algorithm uses O ( τ − 2 log (1 / δ ) ) lab ele d examples fr om P and O ( τ − 1 0 τ − 2 log (1 /δ ) ) unlab ele d samples fr om D . Pr o of. The Chernoff-Ho effding b ounds imply that for some t = O ( τ − 2 log (1 / δ ) ), the empirical mean of φ on t examples that are dra wn randomly from P | χ will, with probabilit y at least 1 − δ / 2, b e w ithin τ of E P | χ [ φ ( x, ℓ )]. W e can also assume that E D [ χ ( x )] ≥ τ 0 since an y v alue w ould b e a v alid resp on s e to the qu ery when this assump tion d o es not hold. By the standard m ultiplicativ e form of the Ch ernoff b oun d w e also kno w that give n t 0 = O ( τ − 1 0 t · log (1 /δ ) ) = O ( τ − 1 0 τ − 2 log (1 /δ ) 2 ) random samples from D , w ith probabilit y at least 1 − δ / 2, at least t of the samples will pass th e filter χ . T herefore with, probabilit y at least 1 − δ , we will obtain at least t samples from D filtered 6 using χ ( x ) and lab eled examples on these p oin ts will giv e an estimate of E P | χ [ φ ( x, ℓ )] with tolerance τ . This p r ocedu re give s log (1 /δ ) 2 dep endence on confid ence (and not the claimed log (1 /δ )). T o get the claimed dep endence we can u se a s tand ard confid ence b o osting tec h nique. W e run the ab o v e pro cedur e with δ ′ = 1 / 3, k times an d let µ 1 , µ 2 , . . . , µ k denote the resu lts. The sim ulation returns the median of µ i ’s. The C hernoff b ou n d implies that for k = O (log(1 /δ )), with probabilit y at least 1 − δ , at least h alf of the µ i ’s satisfy th e condition µ i − E P | χ [ φ ( x, ℓ )] ≤ τ . In particular, th e median satisfies this cond ition. The d ep endence on δ of s ample complexit y is no w as claimed. W e remark that in some cases b etter samp le complexit y b ounds can b e obtained usin g multi- plicativ e forms of the Chern off-Hoeffding b ounds (e.g. [AD98]). A direct wa y to sim ulate all th e queries of an activ e SQ algorithm is to estimate the resp onse to eac h query u sing fresh samples and use the union b ound to ensur e that, w ith pr obabilit y at least 1 − δ , all queries are answered corr ectly . S uc h direct simulatio n of an algorithm that u ses at most q qu eries can b e done usin g O ( q τ − 2 log( q /δ )) lab eled examples and O ( q τ − 1 0 τ − 2 log ( q /δ )) unlab eled samp les. Ho wev er, in many cases a more careful analysis can b e used to reduce the sample complexit y of sim ulation. Lab eled examples can b e shared to simulate qu er ies that use the same filter χ and do not dep end on eac h other. This implies that the sample size sufficien t for simula ting q n on-adaptiv e queries with the same filter scales logarithmically with q . More generally , giv en a set of q quer y functions (p ossibly c hosen adaptively) whic h b elong to some set Q of lo w complexit y (suc h as VC dimension) one can reduce the s amp le complexit y of estimating the ans wers to all q queries (with the same filter) by inv oking the standard b ounds based on uniform con verge nce (e.g. [BEHW89, V ap98]). 2.2 Noise tolerance An imp ortan t prop ert y of the sim ulation describ ed in Theorem 2.1 is that it can b e easily adapted to the case w h en the lab els are corrupted b y random classification noise [AL88]. F or a distribution P = ( D , ψ ) let P η denote the distribution P w ith the lab el flipp ed with probabilit y η rand omly and in dep endent ly of an example. It is easy to see that P η = ( D , (1 − 2 η ) ψ ). W e n o w show that, as in the SQ mo del [Kea98], activ e statistical queries can b e sim ulated giv en examples f rom P η . Theorem 2.2. L et P = ( D , ψ ) b e a distribution over e xampl es and let η ∈ [0 , 1 / 2) b e a noise r ate. Ther e e xists an active sampling algorithm that given f u nctions χ : X → [0 , 1] , φ : X × {− 1 , 1 } → [ − 1 , 1] , values η , τ 0 > 0 , τ > 0 , δ > 0 , and ac c ess to samples fr om P η , with pr ob ability at le ast 1 − δ , outputs a valid r esp onse to active statistic al qu e ry ( χ, φ ) with toler anc e p ar ameters ( τ 0 , τ ) . The algorithm uses O ( τ − 2 (1 − 2 η ) − 2 log (1 /δ ) ) lab ele d examples fr om P η and O ( τ − 1 0 τ − 2 (1 − 2 η ) − 2 log (1 / δ ) ) unlab ele d samples f r om D . Pr o of. Using a simple observ ation from [BF02], we first decomp ose the statistical query φ in to t w o parts: o ne that computes a correlation with the lab el and the other that do es not dep end on the lab el altogether. Namely , φ ( x, ℓ ) = φ ( x, 1) 1 + ℓ 2 + φ ( x, − 1) 1 − ℓ 2 = φ ( x, 1) − φ ( x, − 1) 2 · ℓ + φ ( x, 1) + φ ( x, − 1) 2 . (1) 7 Clearly , to estimate the v alue of E P | χ [ φ ( x, ℓ )] with tolerance τ it is suffi cient to estimate the v alues of E P | χ [ 1 2 ( φ ( x, 1) − φ ( x, − 1)) · ℓ ] and E P | χ [ 1 2 ( φ ( x, 1) + φ ( x, − 1)) ] with tolerance τ / 2. The latter expression do es n ot dep en d on the lab el and , in particular, is not affected by noise. Therefore it can b e estimated as b efore using P η in place of P . A t the same time w e can use the ind ep endence of noise to conclude 2 , E P η | χ 1 2 ( φ ( x, 1) − φ ( x, − 1)) · ℓ = (1 − 2 η ) E P | χ 1 2 ( φ ( x, 1) − φ ( x, − 1)) · ℓ . This means that w e can estimate E P | χ [ 1 2 ( φ ( x, 1) − φ ( x, − 1) ) · ℓ ] with tolerance τ / 2 b y estimating E P η | χ [ 1 2 ( φ ( x, 1) − φ ( x, − 1)) · ℓ ] w ith tolerance (1 − 2 η ) τ / 2 and then multiplying the result b y 1 / (1 − 2 η ). The estimation of E P η | χ [ 1 2 ( φ ( x, 1) − φ ( x, − 1)) · ℓ ] w ith tolerance (1 − 2 η ) τ / 2 can b e d one exactl y as in Theorem 2.1. Note that the sample complexit y of the resulting activ e s ampling algorithm has inf ormation- theoreticall y optimal quadratic dep enden ce on 1 / (1 − 2 η ), wher e η is the noise rate. Note that RCN do es not affect th e un lab elled samples so algorithms whic h are only lab el-statistical algorithms can also b e simula ted in the p resence of R CN. Remark 2.3. This simulation assumes that η is given to the algorithm exactly. It is e asy to se e fr om the pr o of, that any value η ′ such that 1 − 2 η 1 − 2 η ′ ∈ [1 − τ / 4 , 1 + τ / 4] c an b e use d in plac e of η (with the toler anc e of estimating E P η | χ [ 1 2 ( φ ( x, 1) − φ ( x, − 1)) · ℓ ] set to (1 − 2 η ) τ / 4 ). In some le arning sc enarios even an appr oximate value of η is not known but it is known that η ≤ η 0 < 1 / 2 . T o addr e ss this issue one c an c onstruct a se que nc e η 1 , . . . , η k of guesses of η , run the le arning algorithm with e ach of those gu esses in plac e of the true η and let h 1 , . . . , h k b e the r esulting hyp otheses [Ke a98]. One c an then r eturn the hyp othesis h i among those that has the b est agr e ement with a suitably lar ge sample. It is not har d to se e that k = O ( τ − 1 · log(1 / (1 − 2 η 0 ))) guesses wil l suffic e for this str ate gy to work [AD98]. Passive hyp othesis testing r e quir es Ω(1 /ǫ ) lab ele d examples and might b e to o exp ensive to b e use d with active le arning algorithms. It is uncle ar if ther e exists a ge ner al appr o ach for de aling with unknown η in the active le arning setting that do e s not incr e ase substantial ly the lab ele d example c omplexity. However, as we wil l demonstr ate, i n the c ontext of sp e cific active le arning algorithms variants of this appr o ach c an b e use d to solve the pr oblem. W e n o w show that more general t yp es of n oise can b e tolerated as long as th ey are “uncorrelated” with the queries and the target function. Namely , w e represent lab el noise u sing a fun ction Λ : X → [0 , 1], where Λ( x ) giv es the probab ility that the lab el of x is flipp ed. The rate of Λ when learning with resp ect to marginal d istribution D o ve r X is E D [Λ( x )]. F or a distribu tion P = ( D , ψ ) o v er examples, we denote b y P Λ the distribution P corrupted by lab el noise Λ. It is easy to see that P Λ = ( D , ψ · (1 − 2Λ)). Intuitiv ely , Λ is “uncorr elated” with a query if the wa y that Λ d eviates from its rate is almost orthogonal to the query on the target distribu tion. 2 F or an y function f ( x ) that d oes not dep end on the lab el, we h ave: E P η | χ [ f ( x ) · ℓ ] = (1 − η ) E P | χ [ f ( x ) · ℓ ] + η · E P | χ [ f ( x ) · ( − ℓ )] = (1 − 2 · η ) E P | χ [ f ( x ) · ℓ ]. The first equality follo ws from the fact th at und er P η | χ , for any given x , there is a (1 − η ) chance that the lab el is t h e same as under P | χ , and an η chance that the lab el is the negation of the lab el obtained from P | χ . 8 Definition 2.4. L et P = ( D, ψ ) b e a distribution over examples and τ ′ > 0 . F or functions χ : X → [0 , 1 ] , φ : X × {− 1 , 1 } → [ − 1 , 1 ] , we say that a noise function Λ : X → [0 , 1] is ( η , τ ′ ) - unc orr elate d with φ and χ over P if, E D | χ φ ( x, 1) − φ ( x, − 1) 2 ψ ( x ) · (1 − 2(Λ( x ) − η )) ≤ τ ′ . In this definition (1 − 2(Λ( x ) − η )) is th e expectation of {− 1 , 1 } coin that is flip p ed with probabilit y Λ( x ) − η , wh ereas ( φ ( x, 1) − φ ( x, − 1)) ψ ( x ) is the part of the qu ery which measures the correlation with the lab el. W e n o w giv e an analogue of Th eorem 2.2 for this more general setting. Theorem 2.5. L et P = ( D , ψ ) b e a distribution over examples, χ : X → [0 , 1] , φ : X × {− 1 , 1 } → [ − 1 , 1] b e a query and a filter functions, η ∈ [0 , 1 / 2) , τ > 0 and Λ b e a noise function that i s ( η , (1 − 2 η ) τ / 4) -unc orr e late d with φ and χ over P . Ther e exists an active sampling algorithm that given functions χ and φ , values η , τ 0 > 0 , τ > 0 , δ > 0 , and ac c ess to samples fr om P Λ , with pr ob ability at le ast 1 − δ , outputs a valid r esp onse to active statistic al query ( χ, φ ) with toler anc e p ar ameters ( τ 0 , τ ) . The algorithm uses O ( τ − 2 (1 − 2 η ) − 2 log (1 /δ ) ) lab ele d examples fr om P Λ and O ( τ − 1 0 τ − 2 (1 − 2 η ) − 2 log (1 / δ ) ) unlab ele d samples f r om D . Pr o of. As in the pro of of Theorem 2.2, we note that it is sufficien t to estimate the v alue of λ , E P | χ 1 2 ( φ ( x, 1) − φ ( x, − 1)) · ℓ = E D | χ φ ( x, 1) − φ ( x, − 1) 2 ψ ( x ) within tolerance τ / 2 (since E P | χ [ 1 2 ( φ ( x, 1) + φ ( x, − 1))] do es not dep end on the lab el and can b e estimated as b efore). No w E P Λ | χ φ ( x, 1) − φ ( x, − 1) 2 · ℓ = E D | χ φ ( x, 1) − φ ( x, − 1) 2 · ψ ( x ) · (1 − 2Λ( x )) = (1 − 2 η ) E D | χ φ ( x, 1) − φ ( x, − 1) 2 ψ ( x ) + E D | χ φ ( x, 1) − φ ( x, − 1) 2 ψ ( x )(1 − 2(Λ( x ) − η )) = (1 − 2 η ) E D | χ φ ( x, 1) − φ ( x, − 1) 2 ψ ( x ) + τ ′ = (1 − 2 η ) λ + τ ′ , where | τ ′ | ≤ (1 − 2 η ) τ / 4, sin ce Λ is ( η , (1 − 2 η ) τ / 4)- un correlated with φ and χ o v er P . This means th at w e can estimate E P | χ [ 1 2 ( φ ( x, 1) − φ ( x, − 1)) · ℓ ] with tolerance τ / 2 by estimating E P Λ | χ [ 1 2 ( φ ( x, 1) − φ ( x, − 1)) · ℓ ] w ith tolerance (1 − 2 η ) τ / 4 and then multiplying the result b y 1 / (1 − 2 η ). The estimation of E P Λ | χ [ 1 2 ( φ ( x, 1) − φ ( x, − 1)) · ℓ ] w ith tolerance (1 − 2 η ) τ / 4 can b e d one exactl y as in Theorem 2.1. An immediate implication of Th eorem 2.5 is that one can simulate an activ e S Q algorithm A using examples corrup ted b y noise Λ as long as Λ is ( η , (1 − 2 η ) τ / 4)-uncorrelat ed with all A ’s queries of tolerance τ for some fi xed η . Clearly , rand om classification noise of rate η has function Λ( x ) = η for all x ∈ X . It is therefore ( η , 0)-uncorrelated with any quer y o v er any distribution. Another s imple typ e of noise that is uncorrelated with most qu eries o ve r most distributions is the on e where noise function is c hosen 9 randomly so th at for eve ry p oint x the noise r ate Λ( x ) is chosen r andomly and in dep endent ly from some d istribution with exp ectation η (not necessarily the same for all p oin ts). F or an y fixed query and target d istribution, the exp ected correlation is 0. If the probabilit y mass of ev ery sin gle p oin t of the domain is small enough compared to (the in verse of the logarithm of ) the size of space of queries and target distributions then standard concentrat ion inequalities will imply that the correlation will b e small with high pr obabilit y . W e would lik e to note that the n oise mo dels considered here are not directly comparable to the w ell-studied Tsybako v’s and Massart’s n oise conditions [BBL05 ]. How eve r, it app ears that from a computational p oin t of view our noise mo del is significan tly more b en ign than these cond itions as they do not imp ose an y stru cture on the noise and only limit the rate. 2.3 Simple examples Thresholds: W e sh ow that a classic example of activ e learning a thresh old fun ction on an in terv al can b e easily expressed u sing activ e SQs. F or simplicit y and without loss of generalit y w e can assume that the interv al is [0 , 1] and the d istribution is un iform o v er it. 3 Assume th at we kno w that the threshold θ b elongs to the interv al [ a, b ] ⊆ [0 , 1]. W e ask a query φ ( x, ℓ ) = ( ℓ + 1) / 2 with filter χ ( x ) whic h is the in dicator function of the inte rv al [ a, b ] with tolerance 1 / 4 and filter tolerance b − a . Let v b e the resp onse to the qu ery . By definition, E [ χ ( x )] = b − a and therefore we ha ve that | v − E [ φ ( x, ℓ ) | x ∈ [ a, b ]] | ≤ 1 / 4. Note that, E [ φ ( x, ℓ ) | x ∈ [ a, b ]] = ( b − θ ) / ( b − a ) . W e can therefore conclude that ( b − θ ) / ( b − a ) ∈ [ v − 1 / 4 , v + 1 / 4] whic h means that θ ∈ [ b − ( v + 1 / 4)( b − a ) , b − ( v − 1 / 4)( b − a )] ∩ [ a, b ] . Note that the length of this in terv al is at most ( b − a ) / 2. This means that after at most log 2 (1 /ǫ ) + 1 iterations we w ill r eac h an in terv al [ a, b ] of length at most ǫ . In eac h iteration only constant 1 / 4 tolerance is necessary and filter tolerance is nev er b elo w ǫ . A d irect s im ulation of this algo rithm can b e d one u s ing log (1 /ǫ ) · log (log(1 /ǫ ) /δ ) lab eled examples and ˜ O (1 /ǫ ) · log(1 /δ ) unlab eled samples. Axis-aligned rectangles: Next we sh ow that learning of thresholds can b e used to obtain a simple algorithm for learning axis-aligned rectangles whose weig ht under the target distribu tion is not to o s mall. Namely , we assume that the target function satisfies that E D [ f ( x )] ≥ β . In the one dimensional case, w e just need to learn an inte rv al. After scaling the distribution to b e uniform on [0 , 1] we kn o w that th e target interv al [ θ 1 , θ 2 ] has length at least β . W e fi rst need to find a p oin t inside that in terv al. T o do this we consider the 2 /β inte rv als [( i − 1) β / 2 , iβ / 2] for 1 ≤ i ≤ 2 /β . A t least one of th ese in terv als in fully included in [ θ 1 , θ 2 ]. Hence us ing an activ e statistical query with query function φ ( x, ℓ ) = ( ℓ + 1) / 2 cond itioned on b eing in int erv al [( i − 1) β / 2 , iβ / 2] for eac h 1 ≤ i ≤ 2 /β and with tolerance 1 / 4 w e are guaran teed to find an in terv al for which the answ er is at least 3 / 4. The midp oin t of any in terv al f or which the ans w er to the query is at least 3 / 4 m ust b e inside the target interv al. Let the midp oint b e a . W e can now use t wo binary s earches with accuracy ǫ/ 2 to find the lo w er and up p er endp oin ts of the target interv al in the interv als [0 , a ] and [ a, 1], resp ectiv ely . T his will require 2 /β + log 2 (2 /ǫ ) activ e SQs of tolerance 1 / 4. As usual, the d -dimensional axis-aligned rectangles can b e red u ced to d interv al learnin g problems 3 As usu al, we can bring the distribu t ion to b e close enough to t his form using u nlabeled samples or O ( b/ǫ ) target-indep endent q ueries, where b is the number of b its needed to represent our examples. 10 with error ǫ/d [KV94]. This give s an activ e statistical algorithm usin g 2 d/β + log 2 (2 d/ǫ ) activ e SQs of tolerance 1 / 4 and filter tolerance ≥ min { β / 2 , ǫ/ 2 } . A 2 : W e no w note that the general and well- stud ied A 2 algorithm of [BBL06] falls naturally into our f ramew ork. At a high leve l, the A 2 algorithm is an iterativ e, disagr e ement-b ase d activ e learning algorithm. I t main tains a s et of su rviving classifiers C i ⊆ C , and in eac h roun d the algorithm asks for the lab els of a few random p oin ts that fall in the current region of disagreemen t of the surviving classifiers. F ormally , the r egio n of d isagreement DIS( C i ) of a set of classifiers C i is the of set of instances x such that f or eac h x ∈ DIS( C i ) there exist t w o classifiers f , g ∈ C i that disagree ab out the lab el of x . Based on the qu eried lab els, the algorithm then eliminates hypotheses that w ere still und er consid eration, but only if it is statistic al ly c onfident (giv en the lab els q u eried in the last roun d) that th ey are sub optimal. In essence, in eac h r ound A 2 only n eeds to estimate the error rates (of h yp otheses still under consideration) und er th e conditional distribution of b eing in th e region of disagreemen t. The k ey p oint is that this can b e easily done v ia activ e statistica l queries. Note that while the n umb er of activ e statistical quer ies needed to do this could b e large, the num b er of lab eled examples needed to simulate these queries is essen tially the same as the n umb er of lab eled examples needed by the kno wn A 2 analyses [Han07, Han]. While in general the required compu tation of the disagreemen t r egio n and manipulations of the h yp othesis sp ace cannot b e d one efficien tly , efficient imp lemen tation is p ossible in a n umb er of sim p le cases such as when the V C dimension of the concept class is a constant. It is n ot hard to see that in these cases the implemen tation can also b e done u sing a statistical algorithm. 3 Learning halfspaces wit h resp ect to log-conca v e distri butions In this section we p r esen t a redu ction from activ e learning to p assiv e learning of homogeneous linear separators un d er log-conca ve distributions. Com binin g it w ith th e SQ algorithm for learning halfspaces in th e p assiv e learnin g setting due to Dunagan and V empala [DV 04 ], w e obtain the first efficien t noise-toleran t activ e learning of h omoge neous halfspaces for an y isotropic log- conca ve distribution. Our reduction pro ceeds in rounds; in roun d t we build a b etter appro ximation w t to the target function by using the passive SQ learnin g algorithm [D V04] o ve r a distr ib ution D t that is a m ixture of distrib u tions in wh ich eac h comp onent is the original distr ibution conditioned on b eing with in a certain distance from the hyp erplane defined by previous app ro ximations w i . T o p erf orm p assiv e statistica l queries relativ e to D t w e use activ e SQs with a corresp onding real v alued fi lter. Ou r analysis builds on the analysis of the margin-based algorithms due to [BBZ07, BL13]. Ho wev er, note that in th e s tandard margin-based analysis only p oin ts close to the current hyp othesis w t are queried in round t . As a result the analysis of our algorithm is somewhat different f rom that in earlier w ork [BBZ07, BL13]. 3.1 Preliminaries F or a unit vec tor v ∈ R d w e denote by h v ( x ) the fun ction defin ed by the h omogenous h yp erp lane orthogonal to v , that is h v ( x ) = sign ( h v, x i ). Let H d denote the concept class of all homogeneous halfspaces. 11 Definition 3.1. A distribution over R d is lo g- c onc ave if log f ( · ) is c onc ave, wher e f is its asso ciate d density function. It is isotr opic if its me an is the origin and its c ovarianc e matrix is the identity. Log-conca ve distrib u tions form a broad class of distrib utions: for example, the Gauss ian, Lo- gistic, Exp onential , and uniform distrib ution o v er an y conv ex set are log-conca v e distributions. Next, we state sev eral simple prop erties of log-conca v e d ensities from [L V07 ]. Lemma 3.2. Ther e exists a c onstant c m such that for any isotr opic lo g-c onc ave distribution D on R d , every unit ve ctor v and a ∈ [0 , 1] , c m a ≤ Pr D [ x · v ∈ [ − a, a ]] ≤ 2 a. Lemma 3.3. Ther e exists a c onstant c such that for any isotr opic lo g- c onc ave D on R d and any two unit ve ctors u and v in R d we have cθ ( u, v ) ≤ E D [ h u ( x ) 6 = h v ( x )] , wher e θ ( u, v ) denotes the angle b etwe en u and v . F or our applications the k ey p r op ert y of log-c onca ve d ensities pro v ed in [BL13] is giv en in the follo w ing lemma. Lemma 3.4. F or any c onstant c 1 > 0 , ther e exists a c onstant c 2 > 0 such that the fol lowing holds. L et u and v b e two unit ve ctors in R d , and assume that θ ( u, v ) = α < π / 2 . A ssume that D is isotr opic lo g-c onc ave in R d . Then Pr D [ h u ( x ) 6 = h v ( x ) and | v · x | ≥ c 2 α ] ≤ c 1 α. (2) W e now state the passiv e SQ algorithm for learning halfspaces which will b e the basis of our activ e S Q algo rithm. Theorem 3.5. Ther e e xi sts a SQ algorithm Lea rnHS that le arns H d to ac cur acy 1 − ǫ over any distribution D | χ , wher e D is an isotr opic lo g-c onc ave distribution and χ : R d → [0 , 1] is a filter function. F urther LearnHS outputs a homo gene ous halfsp ac e, runs in time p olynomial in d , 1 /ǫ and log(1 /λ ) and u ses SQs of toler anc e ≥ 1 / p oly ( d, 1 /ǫ, log (1 /λ )) , wher e λ = E D [ χ ( x )] . W e u se the Dunagan-V empala algorithm f or learning h alfspaces to p ro v e this algorithm [DV0 4 ]. The b oun ds on the complexit y of th e algorithm follo w easily from th e prop erties of log-conca v e distributions. F u rther details of th e an alysis and related d iscussion app ear in App en d ix A. 3.2 Activ e learning algorit hm Theorem 3.6. Ther e exists an active SQ algo rithm ActiveLear nHS-LogC (Algor ithm 1) that for any isotr opic lo g-c onc ave distribution D on R d , le arns H d over D to ac cur acy 1 − ǫ in time p oly ( d, log(1 /ǫ )) and using active SQs of toler anc e ≥ 1 / p oly ( d, log (1 /ǫ )) and filter toler anc e Ω( ǫ ) . Pr o of. Let c b e the constan t given b y Lemma 3.3 and let C 1 b e the constant c 2 giv en by Lemma 3.4 when c 1 = c/ 16. Let C 2 = c/ (8 C 1 ) and C 3 = c m · C 2 · c . F or ev ery k ≤ s = ⌈ log 2 (1 / ( cǫ )) ⌉ define b k = C 1 / 2 k . Let h w denote the target halfspace and f or any u nit vecto r v and d istribution D ′ w e define err D ′ ( v ) = Pr D ′ [ h w ( x ) 6 = h v ( x )]. W e defin e w 0 , w 1 , . . . , w s via the iterativ e pro cess describ ed in Algorithm 1. Note that activ e SQs are used to allo w us to execute LearnH S on D k . That is a SQ φ of tolerance τ ask ed by LearnHS 12 Algorithm 1 Act iveLearnHS- LogC : Activ e SQ learning of homogeneous halfspaces o v er isotropic log-co ncav e densities 1: %% Constants c , C 1 , C 2 and C 3 are determin ed by the analys is. 2: Run Le arnHS w ith error C 2 to obtain w 0 . 3: for k = 1 to s = ⌈ log 2 (1 / ( cǫ )) ⌉ do 4: Let b k − 1 = C 1 / 2 k − 1 5: Let µ k equal the in dicator function of b eing within m argin b k − 1 of w k − 1 6: Let χ k = ( P i ≤ k µ i ) /k 7: Ru n LearnHS ov er D k = D | χ k with err or C 2 /k by using activ e queries with filter χ k and fi lter tolerance C 3 ǫ to obtain w k 8: end for 9: return w s (relativ e to D k ) is replaced with an activ e SQ ( χ k , φ ) of tolerance ( C 3 ǫ, τ ). The resp onse to the activ e SQ is a v alid resp onse to the query of LearnH S as long as E D [ χ k ] ≥ C 3 ǫ . W e will pro v e that this condition in deed h olds later. W e n o w p ro v e by indu ction on k th at after k ≤ s iteratio ns , w e ha v e that eve ry ˆ w such that err D | µ i ( ˆ w ) ≤ C 2 for all i ≤ k satisfies err D ( ˆ w ) ≤ c/ 2 k . In addition, w k satisfies this condition. The case k = 0 follo w s from the prop erties of Learn HS (without loss of generalit y C 2 ≤ c ). Assume no w that th e claim is tr ue for k − 1 ( k ≥ 1). Let S 1 k = { x : | w k − 1 · x | ≤ b k − 1 } an d S 2 k = { x : | w k − 1 · x | > b k − 1 } . Note that µ k − 1 is defin ed to b e the ind icato r fun ction of S 1 k . By the inductiv e h yp othesis we kno w that err D ( w k − 1 ) ≤ c/ 2 k − 1 . Consider an arb itrary separator ˆ w that satisfies err D | µ i ( ˆ w ) ≤ C 2 for all i ≤ k . By th e induc- tiv e hyp othesis, we kn o w th at err D ( ˆ w ) ≤ c/ 2 k − 1 . By Lemma 3.3 we h a ve θ ( ˆ w , w ) ≤ 2 − k +1 and θ ( w k − 1 , w ) ≤ 2 − k +1 . This implies θ ( w k − 1 , ˆ w ) ≤ 2 − k +2 . By our c hoice of C 1 and Lemma 3.4, w e obtain: Pr D sign ( w k − 1 · x ) 6 = si gn ( ˆ w · x ) , x ∈ S 2 k ≤ c 2 − k / 4 Pr D sign ( w k − 1 · x ) 6 = si gn ( w · x ) , x ∈ S 2 k ≤ c 2 − k / 4 . Therefore, Pr D sign ( ˆ w · x ) 6 = sign ( w · x ) , x ∈ S 2 k ≤ c 2 − k / 2 . (3) By the indu ctiv e h yp othesis, w e also h a ve: err D | µ k ( ˆ w ) = Pr D sign ( ˆ w · x ) 6 = sign ( w · x ) | x ∈ S 1 k ≤ C 2 . The set S 1 k consists of p oints x suc h that x · w k − 1 fall into interv al [ − b k − 1 , b k − 1 ]. By Lemma 3.2, this implies that Pr D [ x ∈ S 1 k ] ≤ 2 b k − 1 and therefore, Pr D sign ( ˆ w · x ) 6 = sign ( w · x ) , x ∈ S 1 k = Pr D sign ( ˆ w · x ) 6 = sign ( w · x ) | x ∈ S 1 k · Pr D x ∈ S 1 k ≤ 2 C 2 · b k − 1 = c 2 − k / 2 . (4) No w by com bin in g eq. (3) and eq. (4) w e get that err D ( ˆ w ) ≤ c/ 2 k as necessary to establish the first part of the ind uctiv e hyp othesis. By the p rop erties of LearnHS , err D k ( w k ) ≤ C 2 /k . By the 13 definition of χ k , err D k ( w k ) = 1 k X i ≤ k err D | µ i ( w k ) . This imp lies that for eve ry i ≤ k , err D | µ i ( w k ) ≤ C 2 , establishing the second p art of the ind uctiv e h yp othesis. Inductive hyp othesis immediately implies that err D ( w s ) ≤ ǫ . Therefore to fi nish the p ro of w e only need to establish the b ou n d on ru n ning time and query complexit y of the algorithm. T o establish the lo wer b oun d on fi lter tolerance w e obs erv e that by Lemma 3.2, for ev ery k ≤ s , E D [ µ k ] ≥ c m · b k − 1 = c m · C 2 / 2 k − 1 ≥ c m · C 2 · c · ǫ. This implies that f or ev ery k ≤ s , E D [ χ k ] = 1 k X i ≤ k E D [ µ i ] = Ω( ǫ ) . Eac h execution of Lear nHS is with err or C 2 /k = Ω(1 / lo g (1 /ǫ )) and there are at m ost O (log(1 /ǫ )) suc h executions. No w by Th eorem A this implies that the total run ning time, n umb er of qu eries and the inv erse of query tolerance are upp er-b ounded by a p olynomial in d and log(1 /ǫ ). W e remark th at, as u sual, w e can fir st bring the d istribution to an isotropic p osition b y using tar- get indep endent qu eries to estimate the mean and the co v ariance m atrix of the d istr ibution [L V07]. Therefore our algorithm can b e used to learn halfspaces o ve r general log-conca v e densities as long as the target h alfspace passes through the mean of the density . W e can now apply T h eorem 2.2 (or more generally Th eorem 2.5) to obtain an efficient activ e learning algorithm for h omoge neous halfspaces o v er log-conca ve densities in the pr esence of r andom classification n oise of kno wn rate. F u rther since our algorithm relies on Lea rnHS w h ic h can also b e sim ulated when the noise rate is unknown (see Remark 2.3) we obtain an activ e algorithm which do es not requir e th e kn o w ledge of the n oise rate. Corollary 3.7. Ther e exists a p olynomial-time active le arning algorithm that for any η ∈ [0 , 1 / 2) , le arns H d over any lo g- c onc ave distributions with r andom classific ation noise of r ate η to err or ǫ using p oly ( d, log(1 /ǫ ) , 1 / (1 − 2 η )) lab ele d examples and a p olynomial numb er of unlab e le d samples. 4 Learning halfspaces o v er the u niform distribution The algorithm presented in Section 3 r elies on the relativ ely inv olv ed and computationally costly algorithm of Dun aga n and V empala [D V04 ] f or learnin g halfspaces ov er general distrib utions. Sim- ilarly , other activ e learning algorithms for halfspaces often rely on the computationally costly linear program solving [BBZ07, BL13]. F or the sp ecial case of the uniform distribu tion on the unit sphere w e n o w giv e a su b stan tially simpler and more efficient algorithm in terms of b oth sample and computational complexit y . T his setting wa s studied in [BBZ07, DKM09]. W e remark that the u n iform distrib u tion o v er the unit spher e is not log-conca ve and therefore, in general, an algorithm for the isotropic log-conca v e case migh t n ot imply an algorithm for the u niform distribution o ve r the unit sp here. Ho wev er a more careful lo ok at the kn o w n activ e algorithms for the isotropic log-conca v e case [BBZ07, BL13] and at the algorithms in this work sho ws that 14 minimization of error is p erformed o v er homogeneous halfspaces. F or an y homogeneous h alfsp ace h v , any x ∈ R d and α > 0, h v ( x ) = h v ( αx ). This implies that for algorithms optimizing the error o ve r h omoge nou s halfspaces any t w o spherically s ymmetric distribu tions are equ iv alen t. I n particular, the uniform distrib u tion ov er the s phere is equiv alent to the u niform d istribution o v er the un it ball – an isotropic and log-conca v e distribu tion. F or a d imension d let X = S d − 1 or the un it sphere in d dimensions. Let U d denote the u niform distribution ov er S d − 1 . Unless sp ecified otherwise, in this section all prob ab ilities and exp ectatio ns are relativ e to U d . W e would also lik e to men tion exp licitly the follo wing trivial lemma relating the accuracy of an estimate of f ( α ) to the accuracy of an estimate of α . Lemma 4.1. L e t f : R → R b e a differ entiable function and let α and ˜ α b e any v alues in some interval [ a, b ] . Then | f ( α ) − f ( ˜ α ) | ≤ | α − ˜ α | · sup β ∈ [ a,b ] | f ′ ( β ) | . The lemma follo ws d ir ectly fr om the mean v alue theorem. Also n ote th at giv en an estimate ˜ α for α ∈ [ a, b ] w e can alw a ys assume that ˜ α ∈ [ a, b ] since otherwise ˜ α can b e replaced w ith the closest p oint in [ a, b ] wh ic h will b e at least as close to α as ˜ α . W e s tart with an outline of a non-activ e and sim p ler v ersion of the algorithm that demonstrates one of the ideas of the activ e SQ algorithm. T o the b est of our kno wledge th e algorithm we present is also the simplest and most efficien t (passive) SQ algorithm for the p roblem. A less efficient algorithm is give n in [KVV10]. 4.1 Learning using (passiv e) SQs Let w d enote the normal v ector of the target hyperp lane and let v b e any un it vec tor. Instead of arguing ab out the disagreemen t b etw een h w and h v directly w e will use the (Eu clidean) distance b et we en v and w as a proxy for disagreemen t. It is easy to see th at, up to a small constan t factor, this distance b ehav es lik e disagreemen t. Lemma 4.2. F or any unit ve ctors v and w , 1. Err or is upp e r b ounde d by half the distanc e: Pr [ h v ( x ) 6 = h w ( x )] ≤ k w − v k / 2 ; 2. T o estimate distanc e it is suffici ent to estimate err or: for every value α ∈ [0 , 1] , |k w − v k − 2 sin( π α/ 2) | ≤ π | Pr [ h v ( x ) 6 = h w ( x )] − α | . Pr o of. The angle b etw een v and w equals γ = π Pr [ h v ( x ) 6 = h w ( x )]. Hence k w − v k = 2 sin ( π γ / 2) = 2 sin( π Pr [ h v ( x ) 6 = h w ( x )] / 2) and Pr [ h v ( x ) 6 = h w ( x )] = 2 arcsin( k w − v k / 2) /π . Th e first claim follo ws by observing that 2 arcsin( x/ 2) π x is a monotone fu nction in [0 , 2] and equals 1 / 2 when x = 2. The d eriv ativ e of 2 sin ( π x/ 2) equals at most π in absolute v alue and therefore the second claim follo w s from Lemma 4.1. 15 The main idea of our algorithm is as follo ws. Gi ven a current h yp othesis represented b y its normal v ector v , w e estimate the distance from the target v ector w to v p erturb ed in the d irection of eac h the d basis v ectors. By com b ining the distance measurements in these d irections w e can find an estimate of w . Sp ecifically , let { x 1 , x 2 , . . . , x d } b e the u nit ve ctors of the standard basis. Let v i = ( v + β x i ) / k v + β x i k for some β ∈ (0 , 1 / 2]. Then the distance from w to v i can b e used to (appro ximately) fi nd w · x i . Namely , we rely on the follo wing simple lemma. Lemma 4.3. L et u, v and w b e any unit ve ctors and β ∈ (0 , 1 / 2] . Then for v ′ = ( v + β u ) / k v + β u k it holds that h u, w i = k v + β u k (2 − k v ′ − w k 2 ) − 2 + k v − w k 2 2 β . Pr o of. By defin ition, h v ′ , w i = ( h v , w i + β h u, w i ) / k v + β u k and ther efore h u, w i = k v + β u kh v ′ , w i − h v , w i β . (5) F or ev ery pair of unit v ectors u and u ′ , h u, u ′ i = k u k 2 + k u ′ k 2 −k u − u ′ k 2 2 = 1 − k u − u ′ k 2 / 2 and therefore, w e get th at h v , w i = 1 − k w − v k 2 / 2 and h v ′ , w i = 1 − k w − v ′ k 2 / 2. Subs tituting those into eq. (5) giv es th e claim. Using approximat e v alues of h x i , w i for all i ∈ [ d ] one can easily app ro ximate w . Ou r (non- activ e) SQ learning algorithm pr o v id es a simple example of how suc h reconstruction can b e u sed for learning. Theorem 4.4. Ther e e xi sts a p olynomial time SQ algorithm LearnHS-U that le arns H d over U d using d + 1 statistic al queries e ach of toler anc e Ω( ǫ/ √ d ) . Pr o of. Let v b e any unit ve ctor (e.g. x 1 ) and for i ∈ [ d ], define v i = ( v + x i / 2) / k v + x i / 2 k (that is β = 1 / 2). Let h w denote th e unknown target halfspace. F or ev ery v i , we ask a statistical query with tolerance ǫ/ (10 · π √ d ) to obtain α i suc h that | Pr [ h v i 6 = h w ] − α i | ≤ ǫ / (10 · π √ d ) and similarly get α such that | Pr [ h v 6 = h w ] − α | ≤ ǫ/ (10 · π √ d ). W e d efine γ = 1 − (2 sin( π α/ 2)) 2 / 2 and for eve ry i ∈ [ d ], γ i = k v + x i / 2 k (2 − (2 sin( π α i / 2)) 2 ) − 2 γ . By Lemma 4.2(2 ), we get that |k v − w k − 2 sin ( π α/ 2) | ≤ ǫ/ (10 √ d ). Clearly k v − w k ≤ 2 and therefore, by Lemm a 4.1, | γ − h v , w i| = | (2 sin( π α/ 2)) 2 − k v − w k 2 | / 2 ≤ 4 · | (2 sin( π α/ 2)) − k v − w k| / 2 ≤ ǫ/ (5 √ d ) . Note that k v + x i / 2 k ≤ 3 / 2 and therefore, by Lemmas 4.3 and 4.1 γ i − h x i , w i ≤ 3 2 · 4 · ǫ/ (10 √ d ) + 2 ǫ/ (5 √ d ) = ǫ/ √ d . No w let w ′ = X i ∈ [ d ] γ i x i . 16 P arsev al’s identi ty implies that k w − w ′ k 2 = X i ∈ [ d ] | γ i − h x i , w i| 2 ≤ d · ǫ 2 d = ǫ 2 . Let w ∗ = w ′ / k w ′ k . C learly , k w ∗ − w ′ k ≤ k w − w ′ k ≤ ǫ and therefore, b y triangle in equalit y , k w ∗ − w k ≤ 2 ǫ . By L emma 4.2(1) this implies that Pr [ h w ∗ 6 = h w ] ≤ ǫ . It is easy to see that this algorithm uses d + 1 statistical qu eries of tolerance Ω( ǫ/ √ d ) and ru ns in time linear in d . Remark 4.5. It is not har d to se e that an even simpler way to find e ach of the c o or dinates of w is by me asuring the err or of e ach of the standar d b asis ve c tors themselves and using the fact that h w, x i i = cos( π · Pr [ h x i 6 = h w ]) . The variant we pr esente d in The or em 4.4 is mor e useful as a warm-up for the analysis of the active version of the algorithm. 4.2 Activ e Learning of Halfspaces o v er U d Our activ e SQ learning alg orithm is based on t wo m ain ideas. First, as in Theorem 4.4, w e rely on measurin g the error of hyp otheses whic h are p erturbations of the current hyp othesis in the direction of eac h of the basis v ectors. W e then combine the measur ements to obtain a n ew h yp othesis. Second, as in p revious activ e learnin g algorithms for the problem [DKM09 , BBZ07], w e only use lab eled examples w hic h are within a certain margin of the current hypothesis. The margin we use to filter the examples is a fun ction of the current error rate. It is implicit in pr evious w ork [DKM09, BBZ07, BL13] that for an appropriate choice of margin, a constan t fr actio n of the error r egio n is within the margin wh ile the total pr ob ab ility of a p oint b eing within the m argin is linear in the err or of th e curr en t hypothesis. T ogether these conditions allo w appro ximating the error of a hyp othesis using tolerance that h as no dep enden ce on ǫ . W e start by compu tin g the err or of a hyp othesis v whose distance from th e target w is ∆ conditioned on b eing within margin γ of v . Let A d − 1 denote the surface area of S d − 1 . First, the surface area w ithin m argin γ of any homogenous halfspace v is 2 Z γ 0 A d − 2 (1 − r 2 ) ( d − 2) / 2 · 1 √ 1 − r 2 dr = 2 · A d − 2 Z γ 0 (1 − r 2 ) ( d − 3) / 2 dr . (6) W e no w observe that for an y v and w suc h that k v − w k = ∆, Pr [ h v ( x ) 6 = h w ( x ) | |h v , x i| ≤ γ ] is a function that dep ends only on ∆ and γ . Lemma 4.6. F or any v , w ∈ S d − 1 such that k v − w k = ∆ ≤ √ 2 and γ > 0 , Pr [ h v ( x ) 6 = h w ( x ) | |h v , x i| ≤ γ ] = A d − 3 R γ 0 (1 − r 2 ) ( d − 3) / 2 R 1 r · √ 2 − ∆ 2 ∆ · √ 1 − r 2 (1 − s 2 ) ( d − 4) / 2 ds · dr A d − 2 R γ 0 (1 − r 2 ) ( d − 3) / 2 dr . We denote the pr ob ability by cp d ( γ , ∆) . The pro of of this lemma can b e foun d in App endix B. The analysis of our algorithm is based on relating the effect that the c hange in distance of a h yp othesis has on the conditional probability of error. Th is effect can b e easily expressed using the deriv ativ e of th e conditional pr ob ab ility as a function of the distance. Sp ecifically w e p ro v e the follo w ing lo wer b ound on the deriv ativ e (the pro of can b e f ound in App endix B ). 17 Lemma 4.7. F or ∆ ≤ √ 2 , any d ≥ 4 , and γ ≥ ∆ / (2 √ d ) , ∂ ∆ cp d ( γ , ∆) ≥ 1 / (56 γ · √ d ) . An imp ortan t corollary of Lemma 4.7 is that giv en a hyp othesis h v and ∆, such that k v − w k ≤ ∆ w e can estimate k v − w k to accuracy ρ u sing an activ e statistical quer y w ith qu er y tolerance Ω( ρ/ ∆). Sp ecifically: Lemma 4.8. L et h w b e the tar get hyp othesis. Ther e i s an algorithm Meas ureDistanc e ( v , ρ, ∆ ′ ) that give n a unit ve ctor v and ρ > 0 and ∆ ′ such that k v − w k ≤ ∆ ′ , outputs a value ˜ ∆ satisfying |k v − w k − ˜ ∆ | ≤ ρ . The algorithm asks a single active SQ with filter toler anc e τ 0 = Ω(∆ ′ ) and query toler anc e of τ = Ω( ρ/ ∆ ′ ) and runs in time p oly (log(1 / (∆ ρ ))) . Pr o of. Let γ = ∆ ′ / (2 √ d ). This implies th at γ ≥ k v − w k / (2 √ d ). Lemma 4.7 together with the mean v alue th eorem imply that if ∆ 1 , ∆ 2 ≤ ∆ ′ and ∆ 1 − ∆ 2 ≥ ρ th en for some ˆ ∆ ∈ [∆ 1 , ∆ 2 ], cp d ( γ , ∆ 1 ) − cp d ( γ , ∆ 2 ) = ρ · ∂ ∆ cp d ( γ , ˆ ∆) ≥ ρ/ (56 γ √ d ) = ρ/ (28∆ ′ ) . (7) This imp lies that in order to estimate k v − w k to within tolerance ρ it is sufficient to estimate cp d ( γ , k v − w k ) to with in ρ/ (28∆ ′ ). T o see this note that an estimate of cp d ( γ , k v − w k ) with in ρ/ (28∆ ′ ) is a v alue µ such that | µ − cp d ( γ , k v − w k ) | ≤ ρ/ (28∆ ′ ). Let ˜ ∆ b e such th at cp d ( γ , ˜ ∆) = µ . Note that Lemm a 4.6 do es not giv e an explicit mapping from cp d ( γ , ∆) to ∆. But cp d ( γ , ∆) is a monotone fun ction of ∆ and can b e computed efficient ly giv en ∆. T herefore we can efficien tly in ve rt cp d ( γ , ∆) using a simple binary searc h. This compu tation w ill gi ve us a v alue ˜ ∆ suc h that | cp d ( γ , ˜ ∆) − cp d ( γ , k v − w k ) | ≤ ρ/ (28∆ ′ ). Using this together with eq. (7), we obtain that | ˜ ∆ − k v − w k| ≤ ρ . Let “ |h v , x i| ≤ γ ” denote th e function (of x ) that outputs 1 when the condition is satisfied and 0 otherwise. By definition, cp d ( γ , k v − w k ) can b e estimated to within ρ/ (28∆ ′ ) usin g an activ e S Q (“ |h v , x i| ≤ γ ”; h v ( x ) · ℓ ) with query tolerance of τ = ρ/ (28∆ ′ ) and filter tolerance of τ 0 = ∆ ′ / 8 ≤ Pr U d [ |h v , x i| ≤ γ ] (for example s ee [DKM09]). W e can no w use the estimates of distance of a v ector to w (the normal v ector of the target h yp erp lane) and Lemma 4.3 to obtain a v ector which is close to w . W e p erform this iterativ ely unt il w e obtain a v ector v giving a h yp othesis with err or of at most ǫ . Theorem 4.9. Ther e exists an active statistic al algorithm Ac tiveLearnHS -U that le arns H d over U d to ac cur acy 1 − ǫ , uses ( d + 1) log(1 /ǫ ) active SQs with toler anc e of Ω(1 / √ d ) and filter toler anc e of Ω(1 /ǫ ) and runs in time d · p oly (log ( d/ǫ )) . Pr o of. Our algorithm works b y fi n ding a v ector v that is at distance of at most 2 ǫ from th e normal v ector of the target h yp erp lane wh ic h b e denote by w . W e do this v ia an iterativ e pro cess su c h that at step t we construct a v ector u t , satisfying k w − u t k ≤ 2 − t . In step 1 w e constr u ct a ve ctor u 1 suc h that k u 1 − w k ≤ 1 / 2 by using the (non-activ e) algorithm LearnHS -U (Theorem 4.4) w ith error parameter of ǫ ′ = 1 / (2 π ). By Lemma 4.2(2) we get that Pr [ h w 6 = h u 1 ] ≤ 1 / (2 π ) implies that k u 1 − w k ≤ 1 / 2. No w giv en a v ector v = u t suc h th at k w − v k ≤ 2 − t w e constr u ct a un it vect or v ∗ suc h th at k w − v ∗ k ≤ 2 − t − 1 and set u t +1 = v ∗ . Clearly , for w ∗ = u ⌈ log (1 /ǫ ) ⌉− 1 w e will get k w − w ∗ k ≤ 2 ǫ and hence, by Lemm a 4.2, Pr [ h w 6 = h w ∗ ] ≤ ǫ . Let ∆ ′ = 2 − t , β = 2 − t and define v i = ( v + β x i ) / k v + β x i k . F or eve ry v i , we kno w that k v − v i k ≤ β = ∆ ′ , this means k w − v i k ≤ 2∆ ′ . W e use Me asureDistan ce (Lemma 4.8) for v , distance 18 Algorithm 2 Act iveLearnHS- U : A ctive SQ Learning of Homogeneous Halfspaces o v er the Uniform Distribution 1: Run Le arnHS-U with parameter error ǫ ′ = 1 / (2 π ) to obtain u 1 2: for t = 1 to ⌈ log (1 /ǫ ) ⌉ − 2 do 3: S et α = Me asureDistan ce ( u t , 1 8 · 2 t √ d , 2 − t ) 4: for i = 1 to d do 5: Set v i = ( v + 2 − t x i ) / k v + 2 − t x i k 6: Set α i = Measure Distance ( v i , 1 24 · 2 t √ d , 2 − t +1 ) 7: Set γ i = 2 t − 1 ( k v + 2 − t x i k (2 − α 2 i ) − 2 + α 2 ) 8: end for 9: S et v ′ = P i ∈ [ d ] γ i x i 10: Set u t +1 = v ′ / k v ′ k 11: end for 12: ret urn u ⌈ log(1 /ǫ ) ⌉− 1 . b ound ∆ ′ and accuracy parameter ρ = ∆ ′ / (8 √ d ) to obtain α suc h that |k v − w k − α | ≤ ∆ ′ / (8 √ d ). Similarly , for eac h i ∈ [ d ] we use MeasureD istance for v i , d istance b oun d 2∆ ′ and parameter ρ = ∆ ′ / (24 √ d ) to obtain α i suc h that |k v i − w k − α i | ≤ ∆ ′ / (24 √ d ). F or i ∈ [ d ], we define γ i = k v + β x i k (2 − α 2 i ) − 2 + α 2 2 β . W e view γ i as a function of α and α i and observe that for α ∈ [0 , ∆ ′ ], | ∂ α γ i | = α β ≤ 1 and for α i ∈ [0 , 2∆ ′ ], | ∂ α i γ i | = −k v + β x i k α i β ≤ 2 k v + β x i k ≤ 3 . Therefore, by Lemma 4.3 and Lemma 4.1, γ i − h x i , w i ≤ 3 · ∆ ′ / (24 √ d ) + ∆ ′ / (8 √ d ) = ∆ ′ / (4 √ d ) . No w let v ′ = X i ∈ [ d ] γ i x i . P arsev al’s identi ty implies that k w − v ′ k 2 = X i ∈ [ d ] | γ i − h x i , w i| 2 ≤ d · ∆ ′ 2 / (16 d ) = ∆ ′ 2 / 16 . Let v ∗ = v ′ / k v ′ k . Clearly , k v ∗ − v ′ k ≤ k w − v ′ k ≤ ∆ ′ / 4 and therefore, b y triangle inequalit y , k v ∗ − w k ≤ ∆ ′ / 2 = 2 − t − 1 . All that is left to prov e are the claimed b ounds on activ e S Qs used in this algorithm and its runn in g time. First note that eac h step uses d + 1 activ e SQs and th ere are at m ost log (1 /ǫ ) 19 steps. By Lemma 4.8 the tolerance of eac h query at step t is Ω((∆ ′ / √ d ) / ∆ ′ ) = Ω(1 / √ d ) and filter tolerance is Ω(2 − t ) = Ω ( ǫ ). Lemma 4.8 together with the b ound on the num b er of stages also im p ly the claimed ru nning time b ound . An immediate corollary of Theorems 4.9 and 2.2 is an activ e learnin g algorithm for H d that w orks in th e presence of ran d om classification noise. Corollary 4.10. Ther e exists a p olynomial-time active le arning algor ithm that given any η ∈ [0 , 1 / 2) , le arns H d over U d with r andom classific ation noise of r ate η to err or ǫ using O ((1 − 2 η ) − 2 · d 2 log(1 /ǫ ) lo g ( d log(1 /ǫ ))) lab e le d examples and O ((1 − 2 η ) − 2 · d 2 · ǫ − 1 log(1 /ǫ ) log ( d log (1 /ǫ ))) unlab ele d samples. 4.3 Learning with unkno wn noise rate One limitation of Corr. 4.10 is that the simulatio n requires knowing the noise rate η . W e sh o w that this limitation can b e o ve rcome by giving a p ro cedure that approxima tely finds the n oise rate whic h can then b e used in the s imulation of Act iveLearnHS- U . The idea for estimating the noise rate is to measure the agreemen t rate of random halfspaces with the target halfspace. T he agreement rate of a fixed halfspace is a linear function of the noise rate and therefore by comparing the distrib ution of agreemen t rates in the p resence of noise to the d istribution of agreemen t rates in the noiseless case w e can factor out the n oise rate. Lemma 4.11. Ther e is an algorith m B that for every unit ve ctor w and values η ∈ [0 , 1 / 2) , τ , δ ∈ (0 , 1 ) , given τ , δ and ac c ess to r andom examples fr om distribution P η = ( U d , (1 − 2 η ) h w ) wil l, with pr ob ability at le ast 1 − δ , output a value η ′ such that 1 − 2 η 1 − 2 η ′ ∈ [1 − τ , 1 + τ ] . F urther, B runs i n time p olynomial in d, 1 /τ , 1 / ( 1 − 2 η ) and log (1 /δ ) and uses O ( dτ − 2 (1 − 2 η ) − 2 · log ( d/ ((1 − 2 η ) τ δ ))) r andom examples. Pr o of. W e consider the exp ected correlation of a rand omly chosen h alfsp ace w ith some fixed un- kno wn h alfspace h u . Namely let ν = E v ∼ U d [ | E x ∼ U d [ h u ( x ) · h v ( x )] | ] . Spherical symmetry imp lies that ν do es not dep end on u . Firs t note that E x ∼ U d [ h u ( x ) · h v ( x )] = 2 arcsin( h u, v i ) /π ≥ 2 h u, v i /π . A w ell-kno wn fact is th at f or a r andomly and un if orm ly c hosen u nit v ector v , with probability at least 1 / 8, h u, v i ≥ 1 / √ d . Th is implies that with p r obabilit y at least 1 / 8, E x ∼ U d [ h u ( x ) · h v ( x )] ≥ 2 / ( π √ d ) and h ence ν ≥ c/ √ d for some fixed constant c . Henceforth we can assume that ν is kno wn exactly (as it is easy to estimate the necessary integ ral with the accuracy sufficien t for our use). A t the same time we kno w that E P η [ ℓ · h v ( x )] = (1 − 2 η ) E U d [ h w ( x ) · h v ( x )] . This means that ν η = E v ∼ U d [ | E ( x,ℓ ) ∼ P η [ h w ( x ) · h v ( x )] | ] = (1 − 2 η ) ν . 20 Therefore in order to estimate 1 − 2 η we estimate ν η giv en samples fr om P η . T o estimate ν η w e draw a set of random unit vect ors V and a set S of random examples from P . F or eac h v ∈ V w e estimate | E ( x,ℓ ) ∼ P η [ h w ( x ) · h v ( x )] | u s ing th e rand om examples in S and let α v denote the corresp onding estimate. The a v erage of α v ’s is an estimate of ν η . Chernoff-Ho effding b ound s imply that for t 1 ( θ , δ ′ ) = O ( θ − 2 log(1 /δ ′ )) and t 2 ( θ , δ ′ ) = O ( θ − 2 log( t 1 ( θ , δ ′ ) /δ ′ )), th e estimation pro cedure ab ov e with | V | = t 1 ( θ , δ ′ ) and S = t 2 ( θ , δ ′ ), will with pr obabilit y 1 − δ ′ return an estimate of ν η within θ . W e first find a go o d lo w er b ound on 1 − 2 η via a simp le guess, estimate and double pro cess. F or i = 1 , 2 , 3 , . . . we estimate ν η with tolerance ν · 2 − i and confidence δ/ 2 − i − 1 unt il w e get an estimate that equals at least 2 · ν · 2 − i . Let i η denote th e first s tep at wh ic h this condition was satisfied. W e claim that with p r obabilit y at least 1 − δ / 2, 2 − i η ≤ (1 − 2 η ) ≤ 3 · 2 − i η +1 . First w e kn o w that the estimates are successful for every i with pr obabilit y at least 1 − δ / 2. The stopping cond ition implies th at ν η ≥ ν · 2 − i η and in p articular, (1 − 2 η ) ≥ 2 − i η . Now to pro ve that (1 − 2 η ) ≤ 3 · 2 − i η +1 w e sho w that i η ≤ ⌈ log (3 / (1 − 2 η )) ⌉ . This is true since for k = ⌈ log (3 / (1 − 2 η )) ⌉ w e get th at (1 − 2 η ) ≥ 3 · 2 − k and h ence ν η = (1 − 2 η ) ν ≥ 3 ν 2 − k . Therefore an estimate of ν η with tolerance ν · 2 − k m ust b e at least 2 ν · 2 − k . This means that i η ≤ k . Giv en a lo wer b ound of 2 − i η , we estimate ν η to accuracy ν τ 2 − i η / 2 ≤ (1 − 2 η ) ν τ / 2 with confidence 1 − δ / 2 and let ν ′ η denote the estimate. W e set 1 − 2 η ′ = ν ′ /ν . W e first note th at | (1 − 2 η ′ ) − (1 − 2 η ) | ≤ 2 − i η τ / 2 ≤ (1 − 2 η ) τ / 2 and therefore 1 − 2 η 1 − 2 η ′ ∈ 1 1 + τ 2 , 1 1 − τ 2 ⊆ [1 − τ , 1 + τ ] . Using the fact that ν ≥ c/ √ d and i η ≤ ⌈ log (3 / (1 − 2 η )) ⌉ w e can conclude that th e fir s t step of the estimation pr o cedu re requir es O ( d (1 − 2 η ) − 2 · log d (1 − 2 η ) δ ) examples and the second step requ ir es O ( dτ − 2 (1 − 2 η ) − 2 · log d (1 − 2 η ) δ τ ) examples. T he straightforw ard imp lemen tation has ru nning time of O ( d 3 τ − 4 (1 − 2 η ) − 4 · log d (1 − 2 η ) δ τ ). Note that b y Theorem 4.9 our algorithm for learning halfspaces uses τ = Ω(1 / √ d ). W e can apply Lemma 4.11 together with Remark 2.3 and C orollary 4.10 to obtain a version of th e algorithm that do es not r equire the kno wledge of η . Corollary 4.12. Ther e exists a p olynomial -time active le arning algorithm that f or any η ∈ [0 , 1 / 2) , le arns H d over U d with r andom classific ation noise of r ate η to err or ǫ u si ng O (1 − 2 η ) − 2 · d 2 log d (1 − 2 η ) δ τ + log(1 /ǫ ) log ( d log(1 /ǫ )) lab ele d examples and O ((1 − 2 η ) − 2 · d 2 · ǫ − 1 log(1 /ǫ ) log ( d log (1 /ǫ ))) unlab ele d samples. 21 5 Differen tially-priv ate activ e learning In this section we s ho w that activ e SQ learning algorithms can also b e u sed to obtain different ially priv ate activ e learnin g algorithms. W e assu me that a learner has fu ll access to unlab eled p ortion of s ome d atabase of n examples S ⊆ X × Y whic h corresp ond to r ecords of ind ivid ual participan ts in the database. In addition, for every element of the database S the learner can request the lab el of that elemen t. As u sual, the goal is to min imize the num b er of lab el requ ests. In addition, w e w ould lik e to p reserv e the differ ential privacy of the participant s in the database, a n o w-standard notion of priv acy introdu ced in [DMNS06]. A simple scenario in whic h activ e d ifferen tially-priv ate learning could b e v aluable is medical r esearc h in which the goa l is to create an automatic predictor of whether a p erson has certain medical condition. I t is often the case that while man y unlab eled patien t r ecords are a v ailable, disco ve rin g the lab el requires work by a medical exp ert or an exp ensive test (or b oth). I n suc h a scenario an activ e learning algorithm could significantly redu ce costs of pro ducing a go o d predictor of the condition while d ifferen tial priv acy ens ures that the predictor do es not revea l an y inf ormation ab out the patient s whose data was used for the algorithm. F ormally , for some domain X × Y , w e w ill call S ⊆ X × Y a datab ase . Databases S, S ′ ⊂ X × Y are adjac e nt if one can b e obtained from th e other b y mo difyin g a single elemen t. Here w e will alw a ys h a ve Y = {− 1 , 1 } . I n the follo wing, A is an algorithm that take s as in put a database D and outputs an elemen t of some fi n ite set R . Definition 5.1 (Differen tial p riv acy [DMNS06]) . A (r andomize d) algorithm A : 2 X × Y → R is α - differen tially-priv ate if for al l r ∈ R and e very p air of adjac ent datab ases S, S ′ , we have Pr[ A ( S ) = r ] ≤ e ǫ Pr[ A ( S ′ ) = r ] . Here we consider algorithms that op erate on S in an activ e w a y . That is the learning algorithm receiv es th e u nlab eled p art of eac h p oin t in S as an inp ut and can only obtain the lab el of a p oin t up on request. The total num b er of requests is the lab el complexit y of th e algorithm. W e note the definition of differen tial priv acy we use do es not make an y d istin ction b et we en the entries of the database for whic h lab els w ere requested and the other ones. In particular, the p riv acy of all en tries is preserved. F u rther, in our setting the indices of en tries for wh ic h the lab els are requ ested are not a part of the output of the algorithm and hence do not need to b e differen tially priv ate. As first sh o w n b y Blum e t al. [BDMN05], SQ algorithms can b e automaticall y translated into differen tially-priv ate 4 algorithms. W e no w sho w that, analogously , activ e SQ learning algorithms can b e automaticall y transformed into differen tially-priv ate activ e learning algorithms. Theorem 5.2. L et A b e an algorithm that le arns a class of functions H to ac cur acy 1 − ǫ over distribution D using M 1 active SQs of toler anc e τ and filter toler anc e τ 0 and M 2 tar get-indep endent queries of toler anc e τ u . Ther e e xi sts a le arning algorithm A ′ that g i ven α > 0 , δ > 0 and active ac- c ess to datab ase S ⊆ X ×{− 1 , 1 } is α -differ ential ly-private and uses at most O ([ M 1 ατ + M 1 τ 2 ] log ( M 1 /δ )) lab els. F urther, for some n = O ([ M 1 ατ 0 τ + M 1 τ 0 τ 2 + M 2 ατ u + M 2 τ 2 u ] log (( M 1 + M 2 ) /δ )) , if S c onsists of at le ast n examples dr awn r andomly fr om D then with, pr ob ability at le ast 1 − δ , A ′ outputs a hyp othesis with ac cur acy ≥ 1 − ǫ (r elative to distribution D ). The running time of A ′ is the same as the running time of A plus O ( n ) . 4 In [BDMN05] a related b u t different d efi nition of priv acy w as used. How ever, as pointed out in [KLN + 11] the same translation can b e used to ac hieve differential priv acy . 22 Pr o of. W e first consider the activ e SQ s of A . W e w ill answ er eac h such query using a disjoint set of O ([ 1 ατ 0 τ + 1 τ 0 τ 2 ] log ( M 1 /δ )) unlab eled examples. T h e su bset T of these examples that satisfy the filter will b e qu eried for their lab els and used to compu te an answer to the statistical query (by taking the empirical a v erage in the usual wa y). Ad ditional noise dra wn fr om a Laplace distrib ution will then b e added to the answer in order to pr eserv e priv acy . W e b egin by analyzing the amount of noise needed to ac h iev e the desired priv acy guarante e. First, sin ce eac h query is b eing answ ered u sing a d isjoin t set of examples, c hanging an y giv en example can affect the answ er to at most one query; so, it suffices to answe r eac h qu ery with α -differen tial p r iv acy . Second, mo difyin g an y giv en example can c hange the emp irical answer to a query by at most 1 / | T | (there are thr ee cases: the example was already in T and remains in T after mo dification, th e example w as in T and is remo ved from T due to the m od ification, or the example wa s not in T and is added to T due to the mo d ificatio n; eac h c h anges the empir ical ans wer b y at most 1 / | T | ). Therefore, α -pr iv acy can b e ac hiev ed by addin g a quant it y ξ selected from a Laplace d istribution of wid th O ( 1 α | T | ) to the empirical answ er o ve r the lab eled sample. Finally , w e solv e for the size of T n eeded to en sure th at w ith sufficien tly high p robabilit y | ξ | ≤ τ / 2 so that the effect on the activ e SQ after correction for n oise is at most τ / 2. Sp ecifically , the Laplace distribution h as the prop erty that with pr obabilit y at least 1 − δ ′ , the m agnitud e of ξ is at most O ( 1 α | T | log(1 /δ ′ )). S etting th is to τ / 2 and using δ ′ = δ / (6 M 1 ) w e hav e that priv acy α can b e guaran teed with p erturbation at most τ / 2 as long as w e ha v e | T | ≥ c ( 1 ατ log( M 1 /δ )) for suffi cien tly large constan t c . Next, we also need to en sure that T is large enough s o that with probability at least 1 − δ ′ , ev en without the added Laplace noise, the empirical a v erage of the query f u nction ov er T is within τ / 2 of the true v alue. By Ho effding b ounds, this is ensur ed if | T | ≥ c ( 1 τ 2 log( M 1 /δ )) for sufficien tly large constan t c . Finally , we need the u n lab eled sample to b e large enough so th at with probabilit y at least 1 − δ ′ , the lab eled samp le T will satisfy b oth the ab o v e conditions. By Ho effding b oun ds, this is ensu red b y an unlab eled sample of size O ( 1 τ 0 [ 1 ατ + 1 τ 2 ] log ( M 1 /δ )). The ab o ve analysis was for eac h acti ve S Q . There are M 1 activ e SQs in total so the total sample size is a factor M 1 larger, and b y a union b ound ov er all M 1 queries w e ha ve that with probabilit y at least 1 − δ / 2, all are an s w ered within their d esired tolerance lev els. No w, w e analyze the M 2 target-indep enden t queries. Here, b y standard analysis (whic h is also a sp ecial case of the analysis ab o ve ), we get that it is s u fficien t to use O ([ 1 ατ u + 1 τ 2 u ] log ( M 2 /δ )) unlab eled samples to answer all the queries w ith probability at least 1 − δ / 2. Finally , summ ing up th e sample sizes and ap p lying a un ion b ound ov er th e failure p robabilities we get the claimed b ounds on the sample complexit y and ru nning time. W e remark that the algorithm is α -differentia lly-priv ate ev en wh en th e samples are not dra wn from d istribution D . The simulatio n ab ov e can b e easily made toleran t to rand om classificatio n (or uncorrelated) noise in exactly the same wa y as in Theorem 2.2 . In our setting it is also natural to treat the priv acy of lab eled and un lab eled parts differen tly . F or muc h the s ame reason that u nlab eled data is often muc h more p len tiful than lab eled data, in man y cases the lab el inform ation is m uc h more sensitiv e, in a priv acy s ense, than the unlab eled feature v ector. F or example the u nlab eled data ma y b e fully public (obtained by crawling the web or from a public address-b o ok) and the lab els obtained from a questionnaire. T o r eflect this one can d efine t wo priv acy parameters α ℓ and α with α ℓ denoting the (high) sensitivit y of the lab eled 23 information and α denoting the (lo we r) sensitivity of th e feature vecto r alone. More formally , in addition to r equiring α -d ifferential priv acy w e can require α ℓ -differen tial p riv acy on databases wh ic h differ only in a single lab el (for α ℓ < α ). A sp ecial case of this mo del where only lab el priv acy matters wa s studied in [CH11] (a mo del with a related bu t we ake r r equ iremen t in whic h lab eled p oin ts are priv ate and un lab eled are not w as recen tly consid er ed in [JMP13]). It is not h ard to see that with this defi n ition our an alysis will giv e an algorithm th at uses O ([ M 1 α ℓ τ + M 1 τ 2 ] log ( M 1 /δ )) lab els and r equires a database of size n for some n = Ω([ M 1 α ℓ τ 0 τ + M 1 τ 0 τ 2 + M 2 ατ u + M 2 τ 2 u ] log (( M 1 + M 2 ) /δ )). Note that in this r esult the pr iv acy constrain t on lab els do es n ot affect the num b er of samp les required to simulate target- ind ep enden t queries. Impro v ement o v er passiv e differen tially-priv ate learning An immediate consequence of Theorem 5.2 is that for learning of h omoge neous halfspaces o v er un if orm or log-conca v e distrib utions w e can obtain differen tial p riv acy w h ile essen tially preserving the lab el complexit y . F or example, b y com bining T heorems 5.2 and 4.9, w e can efficien tly and differentiall y-pr iv ately learn homogeneous halfspaces u nder the uniform distribu tion with p riv acy parameter α and er r or parameter ǫ b y using only ˜ O ( d √ d log (1 /ǫ )) /α + d 2 log(1 /ǫ )) lab els. Ho w eve r, it is kno wn that any passive learning algorithm, even ignorin g pr iv acy considerations an d noise requires Ω ( d/ǫ ) lab eled examples [Lon95]. So for α ≥ 1 / √ d and small enough ǫ w e get b etter lab el complexit y . 6 Discussion W e describ ed a framework for designing efficien t activ e learnin g algorithms that are toleran t to random classificatio n noise. W e used our f r amew ork to obtain the first compu tationall y-efficien t algorithm f or activ ely learning h omogeneous linear s ep arators o v er log-conca v e distributions with exp onen tial improv ement in the dep end ence on the err or ǫ o v er its p assiv e coun terpart. In add ition, w e show ed that our algorithms can b e automatica lly con verte d to efficien t activ e differentiall y- priv ate algorithms. Our wo rk suggests that, as in p assiv e learnin g, activ e statistical algorithms might b e essen tially as p o we rf ul as example-based efficien t activ e learning algorithms. It wo uld b e in teresting to fi nd more general evidence supp ortin g this claim or, alternativ ely , a coun terexample. An imp ortan t asp ect of (passiv e) statistical learning algorithms is that it is p ossible to pro v e uncond itional lo w er b ounds on such algorithms using SQ dim en sion [BFJ + 94] and its extensions. It w ould b e in teresting to deve lop an activ e analogue of these tec hn iques and giv e meaningful low er b ound s based on th em. This could pro vide a u seful tool for under s tanding the sample complexit y of differentia lly pr iv ate activ e learnin g algo rithms . Ac kno wledgmen ts W e thank Avrim Blum and Santo sh V emp ala for usefu l discussions. This w ork wa s supp orted in part by NSF gran ts CCF-095319 2, CCF-110128, an d CCF 1422 910, AF OS R grant F A9550 -09-1- 0538, ONR gran t N00014-0 9-1-0751 , and a Microsoft Researc h F aculty F ello wship . 24 References [ABL14] P . Awa sthi, M.-F. Balcan, and P . Long. The p o w er of lo caliza tion for efficien tly learning linear s eparators with noise. In Pr o c.46th ACM Symp osium on The ory of Computing , 2014. [AD98] J. Aslam and S. Decatur. Sp ecification and simulation of statistical q u ery algorithms for efficiency and n oise tolerance. JCSS , 56:191–20 8, 1998. [AL88] D. Angluin and P . Laird. L earning from n oisy examples. Machine L e arning , 2:34 3–370, 1988. [BBL05] O. Bousquet, S. Bouc heron, and G. Lu gosi. T heory of C lassificatio n: A Sur v ey of Recen t Ad v ances. ESAIM: Pr ob ability and Statistics , 9:323–3 75, 2005. [BBL06] M.-F. Balcan, A. Beygelzimer, and J. Lan gford . Agnostic activ e learning. I n ICML , 2006. [BBZ07] M.-F. Balcan, A. Bro der, and T. Zhang. Margin based activ e learning. In COL T , pages 35–50 , 2007. [BDL09] A. Beygelzimer, S. Dasgupta, and J. Langford. Imp ortance weigh ted activ e learnin g. In ICML , pages 49–5 6, 2009. [BDMN05 ] A. Blum, C. Dw ork, F. McSherry , and K. Nissim. Pr actical priv acy: th e SuLQ frame- w ork. In Pr o c e e dings of PODS , pages 128–138, 2005. [BEHW89] A. Bl um er, A. Eh renfeuc ht, D. Haussler, and M. W armuth. Learnabilit y and the V apnik-Chervonenkis dimension. Journal of the ACM , 36(4) :929–965, 1989. [BF02] N. Bsh ou ty and V. F eldman. On usin g extend ed statistical q u eries to a voi d member s hip queries. JMLR , 2:359–39 5, 2002. [BFJ + 94] A. Blum, M. F u rst, J. Jackson, M. Kearns, Y. Mansour , and S. Rud ich. W eakly learning DNF and characte rizing s tatistical query learning u sing F ourier an alysis. In STOC , pages 253–262 , 1994. [BFKV97] A. Blum, A. F rieze, R. Kannan, and S . V empala. A p olynomial time algorithm for learning noisy linear threshold functions. Algo rithmic a , 22(1/2): 35–52, 1997. [BH12] M.-F. Balcan and S . Hannek e. R ob u st interact ive learning. In COL T , 2012. [BHLZ10] A. Beygelzime r, D. Hsu , J. Langford, and T. Z hang. Agnostic activ e learning without constrain ts. In NIPS , 2010. [BHW08] M.-F. Balcan, S . Hannek e, and J. W ortman. The true s ample complexity of activ e learning. In COL T , 2008. [BL13] M.-F. Balcan and P . M. Long. Activ e and passive learning of linear separators und er log-co ncav e distributions. CO L T , 2013. 25 [Byl94] T. Bylander. Learning linear th r eshold functions in the presence of classification noise. In Pr o c e e dings of COL T , pages 340–34 7, 1994. [CGZ10] N. Cesa-Bianc hi, C. Gen tile, and L. Zanib oni. Learnin g noisy linear classifiers via adaptiv e and s elect ive sampling. Machine L e arning , 2010. [CH11] K. Chaud h uri and D. Hsu . S ample complexit y b ounds for differentia lly priv ate learnin g. JMLR - COL T P r o c e e dings , 19:155–18 6, 2011. [CKL + 06] C. Ch u, S. K im, Y. Lin, Y. Y u, G. Bradski, A. Ng, and K. Olukotun. Map-reduce for mac hine learnin g on m ulticore. In Pr o c e e dings of NIPS , p ages 281– 288, 2006. [CN07] R. C astro and R. No wa k. Minimax b ounds for activ e learning. In COL T , 2007. [Das05] S. Dasgupta. Coarse sample complexit y b ound s for activ e learning. In N IPS , vol um e 18, 2005. [Das11] S. Dasgupta. Activ e learning. Encyclop e dia of Machine L e arning , 2011. [DGS12] O. Dekel , C. Genti le, and K. Sridh aran. Selectiv e sampling and activ e learning from single and multiple teac hers . JM LR , 201 2. [DH08] S. Dasgupta and D. Hsu. Hierarc hical sampling for activ e learning. In ICML , p ages 208–2 15, 2008. [DHM07] S. Dasgupta, D.J. Hsu, and C. Mon teleoni. A general agnostic activ e learning algorithm. NIPS , 20, 2007. [DKM09] S. Dasgupta, A. T auman Kalai, and C. Mon teleoni. Analysis of p erceptron-based activ e learning. Journal of Machine L e arning R ese ar ch , 10:28 1–299, 2009. [DMNS06] C. Dwo rk, F. McSherry , K. Nissim, and A. S mith. Calibrating n oise to sensitivit y in priv ate d ata analysis. In TCC , pages 265– 284, 2006. [D V04] J. Dunagan and S . V empala. A simple p olynomial-time r escaling algorithm for solving linear programs. In STOC , pages 315–320, 2004. [F el12] V. F eldman. A complete c haracterization of s tatisti cal qu ery learning with ap p licatio ns to ev olv abilit y . Journal of Computer System Scienc es , 78(5): 1444–145 9, 2012. [F GR + 13] V. F eldman, E. Grigorescu, L. Reyzin, S. V empala, and Y. Xiao. Statistical algo rithm s and a lo w er b oun d for d etect ing plante d cliques. In ACM STOC , 2013. [FSST97] Y. F reun d, H.S. Seung, E. Shamir , and N. Tishb y . Selectiv e samp ling using th e query b y committee algorithm. Machine L e arning , 28(2-3) :133–168, 1997. [GSSS13] A. Gonen, S. Sabato, and S. Sh alev-Sh w artz. Efficient p o ol-based activ e learning of halfspaces. In ICML , 2013. [Han] S. Hannek e. Theory of activ e learning. 26 [Han07] S. Hanneke . A b ound on the lab el complexit y of agnostic activ e learning. In ICML , 2007. [JMP13] G. Jagannathan, C. Mon teleoni, and K . P illaipakk amn att. A semi-sup ervised learning approac h to differen tial priv acy . In Pr o c e e dings of the 2013 IEEE International Con- fer enc e on Data Mining Workshops (ICD M W), IEEE Workshop on Privacy Asp e cts of Data Mining (P ADM) , 2013. [Kea98] M. K earns. Efficien t n oise-toleran t learning from statistical queries. JACM , 45(6):98 3– 1006, 1998. [KLN + 11] S hiv a Prasad Kasivisw anathan, Homin K. Lee, Kobbi Nissim, Sofya Raskho dn ik o v a, and Ad am Smith. What can we learn p r iv ately? SIAM J. Comput. , 40(3):79 3–826, June 2011. [Kol10] V. Koltc hinskii. Rademac her complexities and b ou n ding th e excess risk in activ e learn- ing. JMLR , 11:2457–24 85, 2010. [KV94] M. Kearns and U. V azirani. An intr o duction to c omputationa l le arning the ory . MIT Press, Cam br idge, MA, 1994. [KVV10] V. K anade, L. G. V alian t, and J. W ortman V aughan. Evol ution with d rifting targets. In Pr o c e e dings of COL T , pages 155–16 7, 2010. [Lon95] P . M. Long. On the sample complexity of P AC learning halfspaces against the u niform distribution. IEE E T r ansactions on Neur al N etworks , 6(6):1 556–155 9, 1995 . [L V07] L. Lo v´ asz and S. V empala. Th e geometry of logconca ve functions an d sampling algo- rithms. R andom Struct. Algorithms , 30(3 ):307–358 , 2007 . [MN98] A. McCallum and K . Nigam. E mplo ying EM in p o ol-based activ e learning for text classification. In ICM L , pages 350–3 58, 1998. [Ros58] F. Rosenblatt. The p erceptron: a probabilistic mo del for information storage and organizatio n in th e brain. Psycholo gic al R eview , 65:38 6–407, 1958. [RR11] M. Raginsky and A. Rakhlin. Lo we r b ound s for passiv e and activ e learning. In NIPS , pages 1026–1 034, 2011. [V al84] L. G. V alian t. A theory of the learnable. Communic ations of the ACM , 27(11):11 34– 1142, 1984. [V ap98] V. V apnik. Statistic al L e arning The ory . Wiley-In terscience, 1998. [V em13] S. V empala. P ersonal communicatio n, 2013. [W an11] L. W ang. S mo othn ess, Disagreemen t Co efficien t, and the Lab el Comp lexity of Agnostic Activ e Learning. JMLR , 2011. 27 A P assiv e SQ learning of halfspaces The first SQ algorithm for learning general halfspaces wa s give n b y Blum et al. [BFKV97]. T his algorithm requires access to unlab eled samples from the unknown distribution and therefore is only lab el-statistica l. Th is algorithm can b e used as a basis for our activ e SQ algorithm bu t the resulting activ e algorithm w ill also b e only lab el-statistic al. As w e ha ve noted in Section 2, this is sufficient to obtain our RCN toleran t activ e learnin g algorithm give n in Cor. 3.7. Ho wev er our differen tially-priv ate sim ulation needs th e algorithm to b e (fully) statisti cal. Therefore we base our algorithm on the algorithm of Dun agan and V empala f or learning halfspaces [D V04]. While [D V04 ] do es not con tain an explicit statemen t of the S Q v ersion of the algorithm it is kno wn and easy to v erify that the algorithm h as a SQ v ersion [V em13]. This f ollo ws f rom the fact that the algorithm in [D V04] relies on a combinatio n on the P erceptron [Ros58] and the mo difi ed Pe rceptron algorithms [BFKV97] b oth of whic h ha ve SQ v ersions [Byl94, BFKV97]. An other small issue that w e need to tak e care of to apply the algorithm is that the run ning time and tolerance of the algorithm d ep end p olynomially (in fact, linearly) on log (1 /ρ 0 ), where ρ 0 is the mar gi n of the p oin ts give n to the algorithm. Namely , ρ 0 = min x ∈ S | w · x | k x k , where h w is the target h omogeneous halfspace and S is the set of p oin ts giv en to the algo rithm . W e are dealing with con tin uous distributions for whic h the margin is 0 and therefore we mak e the follo wing observ ation. In place of ρ 0 w e can u se any margin ρ 1 suc h that the probabilit y of b eing within margin ≤ ρ 1 around the target h yp erp lane is s mall enough that it can b e absorb ed in to the tolerance of the statistical queries of the Dun agan-V empala algorithm for m argin ρ 1 . F ormally , Definition A.1. F or p ositive δ < 1 and distribution D , we denote γ ( D , δ ) = inf k w k =1 sup γ > 0 γ Pr D | w · x | k x k ≤ γ ≤ δ , namely the smal lest val ue of γ such that for every halfsp ac e h w , γ is the lar gest such that the pr ob ability of b e i ng within mar gin γ of h w under D is at most δ . L et τ DV ( ρ, ǫ ) b e the toler anc e of the SQ version of the D u nagan-V emp ala algorithm when the initial mar gin is e qual to ρ and e rr or is set to ǫ . L et ρ 1 ( D , ǫ ) = 1 2 sup ρ ≥ 0 { ρ | γ ( D , τ DV ( ρ, ǫ/ 2) / 3) ≥ ρ } . No w, for ρ 1 = ρ 1 ( D , ǫ ) we kno w that γ ( D , τ DV ( ρ, ǫ/ 2) / 3) ≥ ρ . Let D ′ b e defined as distribution D conditioned on ha ving margin at least ρ around the target hyp erplane h w . By the definition of the function γ , the probabilit y of b eing w ithin margin ≤ ρ is at most τ DV ( ρ, ǫ/ 2) / 3. Ther efore for an y qu ery f unction g : X × {− 1 , 1 } → [ − 1 , 1], E D [ g ( x, h w ( x ))] − (1 − τ DV ( ρ, ǫ/ 2) / 3) E D ′ [ g ( x, h w ( x ))] ≤ τ DV ( ρ, ǫ/ 2) / 3 and h ence | E D [ g ( x, h w ( x ))] − E D ′ [ h w ( x, f ( x ))] | ≤ 2 τ DV ( ρ, ǫ/ 2) / 3. This implies th at we can obtain an ans wer to an y SQ relativ e to D ′ with tolerance τ DV ( ρ, ǫ/ 2) by usin g the same SQ relativ e to D with tolerance τ DV ( ρ, ǫ/ 2) / 3. This means that b y run ning the Dunagan-V empala algorithm in this wa y w e will obtain a hypothesis with error at most ǫ/ 2 relativ e to D ′ . Th is hyp othesis has error at most ǫ/ 2 + 2 τ DV ( ρ, ǫ/ 2) / 3 whic h, without loss of generalit y , is at most ǫ . Com bining these observ ations ab ou t th e Dunagan-V empala algorithm, w e obtain th e follo w ing statemen t. 28 Theorem A.2 ([D V04 ]) . Ther e exists a SQ algorithm LearnH S-DV that le arns H d to ac cur acy 1 − ǫ over any distribution D . F urther Lear nHS outputs a homo gene ous halfsp ac e, runs in time p olynomial in d , 1 /ǫ and log(1 /ρ 1 ) and uses SQs of toler anc e ≥ 1 / p oly ( d, 1 /ǫ, log (1 /ρ 1 )) , wh er e ρ 1 = ρ 1 ( D , ǫ ) . T o apply Theorem A.2 w e need to obtain b ounds on ρ 1 ( D , ǫ ) for any distrib u tion D on whic h w e might run the Dunagan-V empala algorithm. Lemma A.3. L et D b e an isotr opic lo g - c onc ave distribution. Then for any δ ∈ (0 , 1 / 20) , γ ( D , δ ) ≥ δ / (6 ln (1 /δ )) . Pr o of. Let γ ∈ (0 , 1 / 1 6) and w b e an y unit ve ctor. W e first upp er-b ound Pr D h | w · x | k x k ≤ γ i . Pr D | w · x | k x k ≤ γ ≤ Pr D k x k ≤ ln(1 /γ ) and | w · x | k x k ≤ γ + Pr D [ k x k > ln(1 /γ )] ≤ Pr D [ | w · x | ≤ γ · ln(1 /γ )] + Pr D [ k x k > ln(1 /γ )] . (8) By Lemma 5.7 in [L V07], for an isotropic log-conca ve D and an y R > 1, Pr D [ k x k > R ] < e − R +1 . Therefore Pr D [ k x k > ln(1 /γ )] ≤ e · γ . F urther, by Lemma 3.2, Pr D [ | w · x | ≤ γ · ln(1 /γ )] ≤ 2 γ · ln (1 /γ ) . Substituting, these inequalities into eq. (8) w e ob tain that for γ ∈ (0 , 1 / 16), Pr D | w · x | k x k ≤ γ ≤ 2 γ · ln(1 /γ ) + e · γ ≤ 3 γ · ln(1 /γ ) . This implies that f or γ = δ / (6 ln (1 /δ )) and an y u n it v ector w , Pr D | w · x | k x k ≤ γ ≤ 3 δ / (6 ln (1 /δ )) · ( ln(1 / δ ) + ln (6 ln(1 /δ )) ≤ δ, where w e used that for δ < 1 / 20, 6 ln (1 /δ ) ≤ 1 /δ . By defin ition of γ ( D , δ ), this implies that γ ( D , δ ) ≥ δ / (6 ln (1 /δ )). W e are now r eady to p ro v e Theorem A. There exists a S Q algorithm Le arnHS th at learns H d to accuracy 1 − ǫ o v er any distribution D | χ , wh ere D is an isotropic log-co ncav e distribu tion an d χ : R d → [0 , 1] is a filter function. F u rther L earnHS outputs a homogeneous halfspace, runs in time p olynomial in d ,1 /ǫ and log(1 /λ ) and uses SQs of tolerance ≥ 1 / p oly( d, 1 /ǫ, log (1 /λ )), w here λ = E D [ χ ( x )]. Pr o of of Thm. A. T o p r o ve the theorem w e b ou n d ρ 1 = ρ 1 ( D | χ , ǫ ) and th en apply Theorem A.2. W e fi rst obs erv e that for any ev ent Λ, Pr D | χ [Λ] ≤ Pr D [Λ] / E D [ χ ] . 29 Applying this to the ev en t | w · x | k x k ≤ γ in Definition A.1 we obtain that γ ( D | χ , δ ) ≥ γ ( D , δ · E D [ χ ]). By Lemma A.3, w e get that γ ( D | χ , δ ) = Ω ( λδ/ log(1 / ( λδ ))). In addition, by Theorem A.2, τ DV ( ρ, ǫ ) ≥ 1 /p ( d, 1 /ǫ, log (1 /ρ )) for some p olynomial p . Th is implies that γ ( D | χ , τ DV ( ρ, ǫ/ 2) / 3) ≤ γ D | χ , Ω 1 p ( d, 1 /ǫ, lo g (1 /ρ )) = ˜ Ω λ p ( d, 1 /ǫ, lo g (1 /ρ )) . Therefore, w e will obtain that, ρ 1 ( D | χ , ǫ ) = ˜ Ω λ p ( d, 1 /ǫ, 1 ) . By plugging th is b ound in to Theorem A.2 w e obtain the claim. B Pro ofs from Sectio n 4 W e n o w pro ve L emm as 4.6 and 4.7 wh ich w e restate f or con v enience. Lemma B.1 (Lem. 4.6 restated) . F or any v , w ∈ S d − 1 such that k v − w k = ∆ ≤ √ 2 and γ > 0 , Pr [ h v ( x ) 6 = h w ( x ) | |h v , x i| ≤ γ ] = A d − 3 R γ 0 (1 − r 2 ) ( d − 3) / 2 R 1 r · √ 2 − ∆ 2 ∆ · √ 1 − r 2 (1 − s 2 ) ( d − 4) / 2 ds · dr A d − 2 R γ 0 (1 − r 2 ) ( d − 3) / 2 dr . We denote the pr ob ability by cp d ( γ , ∆) . Pr o of. By using spher ical s y m metry , we can assume without loss of generalit y that v = (1 , 0 , 0 , . . . , 0) and w = ( p 1 − ∆ 2 / 2 , ∆ / √ 2 , 0 , 0 , . . . , 0). W e no w examine th e surface area of the p oint s that sat- isfy h w ( x ) = − 1 and 0 ≤ h v , x i ≤ γ (which is a half of the er r or region at distance at most γ from v ). T o compute it we consider th e p oints on S d − 1 that s atisfy h v , x i = r . T hese p oints form a h yp ersp here σ of d imension d − 2 and radius √ 1 − r 2 . In this h yp ers p here p oin ts that satisfy h w ( x ) = − 1 are p oints ( r, s, x 3 , .., x d ) ∈ S d for which r p 1 − ∆ 2 / 2 + s ∆ / √ 2 ≤ 0. In other w ords, s ≥ r √ 2 − ∆ 2 / ∆ or p oin ts of σ w h ic h are at least r √ 2 − ∆ 2 / ∆ a wa y from h yp erp lane (0 , 1 , 0 , 0 , . . . , 0) passing through the origin of σ (also r eferr ed to as a hyp erspherical cap). As in the equation (6), we obtain that its d − 2-dimensional su rface area is: (1 − r 2 ) ( d − 2) / 2 Z 1 r · √ 2 − ∆ 2 ∆ · √ 1 − r 2 A d − 3 (1 − s 2 ) ( d − 4) / 2 ds In tegrating o v er all r fr om 0 to γ giv es the sur face area of the region h w ( x ) = − 1 and 0 ≤ h v , x i ≤ γ : Z γ 0 (1 − r 2 ) ( d − 3) / 2 Z 1 r · √ 2 − ∆ 2 ∆ · √ 1 − r 2 A d − 3 (1 − s 2 ) ( d − 4) / 2 ds · dr . Hence the conditional p robabilit y is as claimed. 30 Lemma B.2 (Lem. 4.7 restated) . F or ∆ ≤ √ 2 , any d ≥ 4 , and γ ≥ ∆ / (2 √ d ) , ∂ ∆ cp d ( γ , ∆) ≥ 1 / (56 γ · √ d ) . Pr o of. First note th at τ ( γ ) = A d − 3 A d − 2 R γ 0 (1 − r 2 ) ( d − 3) / 2 dr is indep endent of ∆ and therefore it is sufficien t to differen tiate θ ( γ , ∆) = Z γ 0 (1 − r 2 ) ( d − 3) / 2 Z 1 r · √ 2 − ∆ 2 ∆ · √ 1 − r 2 (1 − s 2 ) ( d − 4) / 2 ds · dr . Let γ ′ = ∆ / (2 √ d ) (note th at b y our assu mption γ ′ ≤ γ ). By the Leibnitz integ ral rule, ∂ ∆ θ ( γ , ∆) = Z γ 0 (1 − r 2 ) ( d − 3) / 2 ∂ ∆ Z 1 r · √ 2 − ∆ 2 ∆ · √ 1 − r 2 (1 − s 2 ) ( d − 4) / 2 ds · dr = Z γ 0 (1 − r 2 ) ( d − 3) / 2 1 − r 2 (2 − ∆ 2 ) ∆ 2 (1 − r 2 ) d − 4 2 · 2 r ∆ 2 √ 1 − r 2 √ 2 − ∆ 2 · dr ≥ Z γ 0 (1 − r 2 ) ( d − 4) / 2 1 − 2 r 2 ∆ 2 (1 − r 2 ) d − 4 2 · 2 r √ 2∆ 2 · dr ≥ Z γ ′ 0 (1 − r 2 ) ( d − 4) / 2 1 − 2 r 2 ∆ 2 (1 − r 2 ) d − 4 2 · √ 2 · r ∆ 2 · dr . No w using the conditions ∆ ≤ √ 2, d ≥ 4, we obtain th at γ ′ ≤ 1 / (2 √ 2) and hence for all r ∈ [0 , γ ′ ], 1 − r 2 ≥ 7 / 8 and r 2 / ∆ 2 ≤ γ ′ 2 / ∆ 2 = 1 / (4 d ). T his im p lies that for all r ∈ [0 , γ ′ ], 1 − 2 r 2 ∆ 2 (1 − r 2 ) ≥ 1 − 2 7 8 4 d = 1 − 4 7 d . No w, 1 − 4 7 d ( d − 4) / 2 ≥ 1 − 4( d − 4) 14 d ≥ 5 7 . Substituting this into our expression for ∂ ∆ θ ( γ , ∆) we get ∂ ∆ θ ( γ , ∆) ≥ Z γ ′ 0 (1 − r 2 ) ( d − 4) / 2 √ 2 · 5 r 7∆ 2 · dr ≥ 1 ∆ 2 Z γ ′ 0 (1 − r 2 ) ( d − 4) / 2 · r · dr = 1 ∆ 2 ( d − 2) 1 − (1 − γ ′ 2 ) ( d − 2) / 2 ≥ 1 ∆ 2 ( d − 2) 1 − e − γ ′ 2 ( d − 2) / 2 ≥ ( ∗ ) 1 ∆ 2 ( d − 2) 1 − (1 − ( d − 2) γ ′ 2 4 ) = γ ′ 2 4∆ 2 = 1 16 d , where to derive ( ∗ ) we use the fact that e − γ ′ 2 ( d − 2) / 2 ≤ 1 − γ ′ 2 ( d − 2) / 4 since e − x ≤ 1 − x/ 2 for ev ery x ∈ [0 , 1 ] and γ ′ 2 ( d − 2) / 2 ≤ ∆ 2 ( d − 2) 8 d < 1. A t the same time, R γ 0 (1 − r 2 ) ( d − 3) / 2 dr ≤ γ and therefore, ∂ ∆ cp d ( γ , ∆) = τ ( γ ) · ∂ ∆ θ ( γ , ∆) ≥ A d − 3 16 dγ A d − 2 ≥ 1 32 √ 3 γ √ d > 1 56 γ √ d , where we u sed that A d − 3 A d − 2 ≥ √ d/ (2 √ 3) (e.g. [DKM09]). 31
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment