Learning Poisson Binomial Distributions

Learning Poisson Binomial Distributions Constantinos Daskalakis ∗ MIT costis@csa il.mit.edu Ilias Diakonikolas † Univ ersity of Edinbur gh ilias.d@ed .ac.uk Rocco A. Servedio ‡ Columbia Univ ersity rocco@cs.c olumbia.ed u February 18, 2015 Abstract W e consider a b asic problem in unsupervised learning : learn ing an unknown P oisson Bin omial Distrib u- tion . A Poisson Binomial Dis tribution (PBD) o ver { 0 , 1 , . . . , n } is the distribution of a sum of n independen t Bernoulli ran dom variables wh ich m ay have arbitr ary , poten tially non-e qual, expectations. These d istribu- tions were ﬁrst stud ied by S. Po isson in 1 837 [ Poi37 ] and are a n atural n -parameter generalization of th e familiar Binomial Dis tribution. Sur prisingly , pr ior to our work this basic learning problem was poorly under- stood, and known results for it were f ar from optimal. W e essentially settle the c omplexity of the learning problem for this basic class of distrib utions. As our ﬁrst ma in result we give a highly efﬁcient algorithm which learns to ǫ - accuracy (with respect to the total variation d istance) using ˜ O (1 /ǫ 3 ) samples independen t of n . The runnin g time of t he algorithm is qu asilinear in the size of its input data, i.e., ˜ O (log( n ) /ǫ 3 ) bit-operations. 1 (Observe that each draw from the distrib ution is a log ( n ) -bit strin g.) Our secon d main result is a pr oper learnin g algorithm that learns to ǫ - accuracy using ˜ O (1 /ǫ 2 ) samples, and runs in time (1 /ǫ ) p oly(log(1 /ǫ )) · log n . Th is s ample complexity is nearly optimal, since any a lgorithm for this p roblem must use Ω(1 /ǫ 2 ) samples. W e also give positive and negativ e results for some extensions of this learning problem to weighted sums of independen t Bern oulli random v ariables. 1 Introd uction W e begin by consi dering a somewhat fancifu l scenar io: Y ou are the manager of an independ ent weekly ne ws- paper in a city of n people . Each week the i -th inhabitant of the city inde pende ntly picks up a copy of you r paper w ith probabili ty p i . Of cours e you do not know the v alue s p 1 , . . . , p n ; each week you only see the tot al number of papers th at hav e been picked up. For many reasons (advert ising, produc tion, re ven ue analysis, etc.) you would like to hav e a detailed “snapshot” of the probability distrib ution (pd f) describing how many readers you hav e each week. Is ther e an ef ﬁcien t algorithm to constru ct a high-accur acy appr oximation of the pdf fr om a number of obser vation s that is indep endent of the popu lation n ? W e show that the answer is “yes. ” A P oisson Binomial Distrib ution of order n is the distrib ution of a sum X = n X i =1 X i , where X 1 , . . . , X n are indepe ndent Bernoulli (0/1) random variab les. The expect ations ( E [ X i ] = p i ) i need not all be the s ame, and thus these distrib utions generaliz e the B inomial distrib ution Bin( n, p ) and, indeed, comp rise ∗ Research supported by a Sloan Foundation Fellowship, a Microsoft Research Faculty Fellowship, and NSF A ward CCF- 0953960 (CAREER) and CCF-1101491. † Research supported by a Simons Foundation P ostdoctoral Fellowship. Some of this work was done while at Columbia Univ ersit y , supported by NSF grant CCF-0728736, and by an Alexande r S. Onassis Foundation Fellowship. ‡ Supported by NSF grants CNS-0716245, CCF-1115703, and CCF-1319788 and by D ARP A award HR00 11-08-1-00 69. 1 W e write ˜ O ( · ) to hide factors which are polylogarithmic in the argument to ˜ O ( · ) ; thus, for example, ˜ O ( a log b ) denotes a quantity which is O ( a log b · log c ( a log b )) for some absolute constant c . a m uch richer class of distrib utions. ( See Section 1.2 below .) It is belie ved that Poisson [ Poi37 ] was the ﬁrst to consid er this ext ension of the Binomial distrib ution 2 and the distrib ution is sometimes referred to as “Poisson’ s Binomial Distrib ution” in his honor; we shall simply call these distrib utions P BDs. PBDs are one of the most basic cla sses of discrete distrib utions; indeed , they are ar guably the simplest n - paramete r prob ability dis trib ution tha t has some nontri vial struct ure. As such they hav e bee n intensely stu died in probab ility and statistic s (see Section 1.2 ) and arise in many settin gs; for example , w e note here that tail bound s on PBDs form an important special case of Chernof f/Hoef fding bound s [ Che52 , H oe63 , DP09 ]. In applic ation domains, PBD s ha ve ma ny uses in research areas such a s surv ey s ampling , case-contro l studies, and survi val analysis , see e.g., [ CL97 ] for a surve y of the many uses of these distrib utions in applicatio ns. Giv en the simplicity and ubiquity of these distrib ution s, it is quite surpr ising that the proble m of density estima tion for PBDs (i.e., learning an unkno wn PBD from independ ent sample s) is not well understood in the statist ics or learnin g theory literatu re. This i s the pr oblem we consider , and essentia lly settle, in this paper . W e wor k in a natur al P A C-style model of learnin g an unk no wn discret e probability distr ib ution which is essent ially the model of [ KMR + 94 ]. In this learnin g frame work f or our p roblem, t he learner i s provide d with the v alue of n and with independen t sampl es drawn from an unkno w n PBD X . Using these samples, the learner must with probab ility at least 1 − δ output a hypot hesis distrib ution ˆ X such that the total vari ation distance d T V ( X, ˆ X ) is at most ǫ , where ǫ, δ > 0 are accurac y and conﬁdence parameters that are provided to the learner . 3 A pr oper learnin g alg orithm in this frame work outputs a distrib ution that is itself a Poisson Binomial Distrib ution, i.e., a vec tor ˆ p = ( ˆ p 1 , . . . , ˆ p n ) which describ es the hypoth esis P BD ˆ X = P n i =1 ˆ X i where E [ ˆ X i ] = ˆ p i . 1.1 Our r esults. Our main result is an ef ﬁcient alg orithm for learning PBDs from ˜ O (1 /ǫ 2 ) many samples indep enden t of [ n ] . Since PBDs are an n -paramet er f amily of distrib utions over the domain [ n ] , we view such a tight bound as a surpri sing result. W e pro ve: Theor em 1 ( Main Theorem). Let X = P n i =1 X i be an unknown PBD . 1. [Lear ning PBDs fro m constantly many samples] Ther e is an algorithm with the following pr operti es: given n, ǫ, δ and acc ess to indepen dent draws fr om X , the algorit hm uses ˜ O  (1 /ǫ 3 ) · log (1 /δ )  samples fr om X , performs ˜ O  (1 /ǫ 3 ) · log n · log 2 1 δ  bit o per ations , and with p r obabilit y at least 1 − δ outputs a (succinct descr iption of a) distrib ution ˆ X o ver [ n ] which is suc h that d T V ( ˆ X , X ) ≤ ǫ . 2. [Pr operly lear ning P BDs from constantly many samples] Ther e is an algorithm w ith the following pr opertie s: give n n , ǫ, δ and access to independ ent draws fr om X , the algorith m uses ˜ O (1 /ǫ 2 ) · log (1 /δ ) samples fr om X , performs (1 /ǫ ) O ( log 2 (1 /ǫ ) ) · ˜ O  log n · log 1 δ  bit oper ation s, and with pr obability at least 1 − δ out puts a (succi nct descript ion of a) vector ˆ p = ( ˆ p 1 , . . . , ˆ p n ) deﬁnin g a PB D ˆ X such that d T V ( ˆ X , X ) ≤ ǫ . 2 W e t hank Y uval Peres and Sam W atson for this information [ PW11 ]. 3 [ KMR + 94 ] used the Kullback-Leibler di vergence as their distance measure bu t we ﬁnd it more natural to use v ariation distance. 1 W e not e that, since ev ery sample dra wn from X is a log( n ) -bit string, for con stant δ the numb er of bit- operat ions pe rformed by our ﬁ rst algorithm is quasilinear in the length of its input. Moreo ver , the sample comple xity of both algo rithms is close to optimal, since Ω(1 /ǫ 2 ) samples are requ ired ev en to dis tingui sh the (simpler) Binomial di strib utions Bin( n, 1 / 2) and Bin( n, 1 / 2 + ǫ/ √ n ) , which hav e total vari ation d istance Ω( ǫ ) . Indeed , in vie w of this observ ation, our second algorit hm is essentia lly sample-opt imal. Moti v ated by these strong learning results for PBDs, we also consid er le arning a m ore general class of distrib utions, namely distr ib ution s of the form X = P n i =1 w i X i which are weighted sums of independen t Bernoull i rando m v ariables. W e giv e an algorithm w hich uses O (log n ) samples and runs in p oly( n ) time if there are only constan tly many d if ferent weights in the sum: Theor em 2 ( Learning sums of weig hted independen t Ber noulli random variables) . Let X = P n i =1 a i X i be a weighte d sum of unkno wn independen t Bernoullis such that ther e ar e at most k dif fer ent values among a 1 , . . . , a n . Then the r e is an algori thm with the following pr operties: given n, ǫ, δ, a 1 , . . . , a n and acces s to indepe ndent dra ws fr om X , it uses e O ( k /ǫ 2 ) · log ( n ) · log (1 /δ ) samples fr om X , runs in time p oly  n k · ǫ − k log 2 (1 /ǫ )  · log(1 /δ ) , and with pr obability at least 1 − δ outputs a hypothe sis vecto r ˆ p ∈ [0 , 1] n deﬁnin g in depend ent Bernoull i r andom variab les ˆ X i with E [ ˆ X i ] = ˆ p i suc h that d T V ( ˆ X , X ) ≤ ǫ, wher e ˆ X = P n i =1 a i ˆ X i . T o complement Theore m 2 , w e also show that if there are many distinct weights in the sum, the n e ven for weights with a ver y simple structure any learnin g algorithm must use m any samples: Theor em 3 ( Sample complexity lower b ound for learning sums of weighted independent Bernoullis ). Let X = P n i =1 i · X i be a weighted sum of unkno wn ind epend ent Bernoullis (wher e the i -th weigh t is simply i ). Let L be any learning algor ithm which, given n and access to indepe ndent draws fr om X , output s a hypoth esis distrib ution ˆ X such that d T V ( ˆ X , X ) ≤ 1 / 25 with pr obability at leas t e − o ( n ) . Then L must use Ω( n ) samples. 1.2 Related work. At a high lev el, there has been a recent surg e of interes t in the theore tical computer science community on fundame ntal algorithmic problems in volvin g basic types of probabilit y distrib ution s, see e.g., [ KMV10 , MV10 , BS10 , VV11 ] and other recent papers; our work may be consid ered as an extensi on of this theme. More spe cif- ically , there is a broad lite rature in probabili ty theory studyin g v arious properties of PB Ds; see [ W an93 ] for an access ible in troduc tion to some of this work. In particular , many results study approx imations to the Poisson Binomial distrib ution via simpler distr ib ution s. In a well-kn o wn result, Le Cam [ Cam60 ] sho ws that for any PBD X = P n i =1 X i with E [ X i ] = p i , it holds that d T V  X, P oi  n P i =1 p i   ≤ 2 n P i =1 p 2 i , where Poi( λ ) is the Poisson distrib ution with parameter λ . Subseq uently many other proof s of this result and similar one s were giv en usin g a range of dif ferent techniq ues; [ HC60 , Che74 , DP 86 , BHJ92 ] is a sampling of work along these lines, and Steele [ Ste94 ] gi ves an extensi ve list of relev ant re ference s. Much work has also been done on approximatin g PBDs by normal dis trib utions (see e.g., [ Ber41 , Ess4 2 , Mik93 , V o l95 ]) and by Binomial distrib utions (see e.g., [ Ehm91 , Soo96 , Roo00 ]). These results provide structural information abo ut PBDs that can be well-appro ximated via simpler di strib utions, b ut fall short of ou r goal of obtai ning approx imations of an unkno wn PBD up to arbitr ary accu rac y . Indeed, the approximation s obtained in the probab ility literatu re (such as the P oisson , N ormal and Binomial approximation s) typic ally depend only on the ﬁrst few moments of 2 the targ et PBD. This is attracti ve from a learning persp ecti ve because it is possible to ef ﬁciently estimate such moments from random samples, b ut higher moments are crucial for arbitrar y approxi mation [ Roo00 ]. T aking a differe nt perspect i ve, it is easy to show (see Section 2 of [ KG71 ]) that ev ery PBD is a unimodal distrib ution o v er [ n ] . (R ecall tha t a dist rib ution p ov er [ n ] is unimodal if there is a value ℓ ∈ { 0 , . . . , n } such that p ( i ) ≤ p ( i + 1) for i ≤ ℓ and p ( i ) ≥ p ( i + 1) for i > ℓ . ) T he learn ability of general unimodal distrib utions ov er [ n ] is well understo od: Birg ´ e [ Bir87a , Bir97 ] has giv en a computation ally efﬁci ent algorith m that can learn any unimodal distrib ution over [ n ] to varia tion distanc e ǫ from O (log( n ) /ǫ 3 ) samples, and has sho wn that an y algorit hm must us e Ω(log( n ) /ǫ 3 ) samples. (The [ Bir8 7a , Bir97 ] upper and lower bounds are stated for continuo us unimodal distrib utions, bu t the argument s are easily adapted to the discrete case.) Our main result, Theorem 1 , sho w s that the add itiona l PBD assu mption can be le ve raged to obta in sample comple xity indepe ndent of n with a comput ational ly hig hly efﬁcie nt algorith m. So, ho w might one lev erage the structure of PBD s to remove n from the sample complex ity? A ﬁrst obser - v ation is that a PBD assi gns 1 − ǫ of its mass to O ǫ ( √ n ) points. So one could draw samples to (appro ximately ) identi fy these points and then try to estimate the pro babili ty assigned to each such point, b ut clea rly suc h an approa ch, if follo wed na ¨ ıve ly , would gi ve p oly( n ) sample complexit y . Altern ati vely , one could run Bir g ´ e’ s al- gorith m on the restricted support of siz e O ǫ ( √ n ) , but that w ill not impro ve the asymptot ic sample complexit y . A dif ferent approach woul d be to construct a small ǫ -co ver (under the total variat ion distance) of the spa ce of all PBDs on n varia bles. Ind eed, if such a cov er has size N , it can be sho wn (se e L emma 10 in Section 3.1 , or Chapter 7 of [ DL01 ])) th at a tar get PBD can be lear ned from O (log ( N ) /ǫ 2 ) samples. Still it is easy to a r gue tha t any co ve r needs to ha ve size Ω( n ) , so this appro ach too gi ves a log( n ) dependence in the sample comple xity . Our approach, which removes n comp letely from the sample comple xity , requires a reﬁned und erstan ding of the st ructure of the set of all P BDs on n v ariab les, in fa ct one that is more reﬁned than the unders tandin g pro vided by the aforementio ned results (approx imating a PBD by a Poisson, Normal, or Binomial distrib ution ). W e gi ve an out line of the approac h in the next sect ion. 1.3 Our appr oach. The starting point of our algorithm for lear ning PBDs is a theorem of [ DP11 , Das08 ] that gi ves detailed info r - mation about the str ucture of a small ǫ -cov er (und er the total v ariatio n distan ce) of the spac e of all PBDs on n v ariabl es (see Theorem 4 ). Roughly speaking , this resu lt says that e very PBD is either close to a P BD whose suppo rt is sparse, or is close to a translated “hea vy” Binomial distrib ution . O ur learning algorithm exploit s this structu re of the cov er; it has two subroutin es correspon ding to these two dif ferent types of distrib utions that the cov er contains. First, assuming that the tar get P BD is close to a sp arsely supporte d distrib ution, it runs Birg ´ e’ s unimoda l distrib ution learne r ov er a careful ly selecte d subinterv al of [ n ] to construct a hypothesis H S ; the (pur - ported ) spar sity of the distrib ution makes it pos sible for this algo rithm to use ˜ O (1 /ǫ 3 ) samples independen t of n . Then, ass uming that the t ar get PBD is c lose to a t ranslat ed “hea vy” Binomial distrib ution, the algo rithm con- structs a hypothesis T ransla ted Poisson Distrib ution H P [ R ¨ 07 ] whose m ean and var iance match the estimat ed mean and v ariance of the target PBD ; we sho w that H P is close to the tar get PBD if the targe t P BD is not close to an y sparse distrib ution in the c ov er . A t this p oint the algorit hm has two h ypoth esis distrib utions , H S and H P , one of which should be good; it remains to selec t one as the ﬁnal output hyp othes is. This is ach ie ved using a form of “hypo thesis testing” for probability distrib utions. The abo ve sk etch captur es the main ingredi ents of Part (1) of Theorem 1 , b ut additional work needs to be done to get the proper learning algorith m of Part (2). For the non-sp arse case, ﬁrst note that the Tr anslat ed Poisson hypo thesis H P is not a PBD. V ia a seque nce of transformatio ns we are able to sho w that the T ranslat ed Poisson hypoth esis H P can be con verte d to a Binomia l distrib ution Bin ( n ′ , p ) for some n ′ ≤ n. T o handle the sparse case, we use an alt ernate learning appro ach: instead of usin g Bir g ´ e’ s unimoda l algo rithm (which would incur a sample co mplex ity of Ω(1 /ǫ 3 ) ), we ﬁrst sh o w tha t, in this case, there exists an efﬁcie ntly constructib le O ( ǫ ) -cov er of size (1 /ǫ ) O (log 2 (1 /ǫ )) , and then apply a genera l learning result that we no w descri be. The general learning result that we use (Lemma 10 ) is the follo wing: W e sho w that for any class S of 3 tar get distrib ution s, if S has an ǫ -cov er of size N then there is a generic algori thm for learnin g an unkno wn distrib ution from S to accu rac y O ( ǫ ) that uses O ((log N ) /ǫ 2 ) samples. Our approach is rather simil ar to the algori thm of [ DL01 ] for choosi ng a density esti mate (b ut dif ferent in some deta ils); it wo rks by carry ing out a tourna ment that matche s e ver y pai r of di strib utions in th e co ve r agains t each ot her . Our analysis sho ws tha t with high probab ility some ǫ -accurate dis trib ution in the cove r will survi ve the tournament unde feated , and that any undefe ated distrib ution will with high probabili ty be O ( ǫ ) -accurate . Applying this gen eral result to the O ( ǫ ) -cov er of size (1 /ǫ ) O (log 2 (1 /ǫ )) descri bed abo ve, we obtain a PBD that is O ( ǫ ) -close to the target (th is accoun ts for the inc reased runni ng time in Part (2) versus Part (1)). W e stress th at fo r bot h the n on-pr oper and prope r learnin g algorithms sk etche d abov e, many te chnica l subtleties and challe nges arise in implement ing the high-l e vel plan giv en above, requi ring a careful and detailed analysis. W e pr ov e T heorem 2 using th e gen eral appro ach of Lemma 10 spec ialized to weig hted su ms of indepe ndent Bernoull is with constantly m any d istinc t w eights . W e sho w how the tour nament can be implemented efﬁcie ntly for the cla ss S of weighted sums of ind epend ent Bernoullis with co nstantl y many dis tinct weights, and thu s obtain Theorem 2 . Finally , the lower bou nd of Theorem 3 is prov ed by a direct informatio n-theo retic ar gument. 1.4 Pr eliminaries. Distrib utions. For a distrib ution X supporte d on [ n ] = { 0 , 1 , . . . , n } we write X ( i ) to denote the value Pr[ X = i ] of the probability density function (pdf) at point i , an d X ( ≤ i ) to denote the value Pr[ X ≤ i ] of the cumulati ve density function (cdf) at point i . For S ⊆ [ n ] , we write X ( S ) to denote P i ∈ S X ( i ) and X S to denote the condition al distrib ution of X restricte d to S. Sometimes we write X ( I ) and X I for a subset I ⊆ [0 , n ] , meaning X ( I ∩ [ n ]) and X I ∩ [ n ] respec ti vel y . T otal V ariation Distance. Recall that the total variation distance between two distrib utions X and Y ov er a ﬁnite domain D is d T V ( X, Y ) := (1 / 2 ) · P α ∈ D | X ( α ) − Y ( α ) | = max S ⊆ D [ X ( S ) − Y ( S )] . Similarly , if X and Y are t wo random variab les ra nging over a ﬁnite s et, their t otal varia tion distan ce d T V ( X, Y ) is deﬁned as the tota l va riation dista nce between their distrib utions. For co n venie nce, we will often blur the distin ction between a rando m v ariabl e and its distrib ution. Cov ers. Fix a ﬁnite do main D , and let P denote some set of distr ib ution s o ver D . Giv en δ > 0 , a subset Q ⊆ P is s aid to be a δ -cove r of P (w .r .t. the total v ariation dista nce) if for ev ery d istrib ution P in P the re e xists some distrib ution Q in Q suc h that d T V ( P , Q ) ≤ δ . W e somet imes say that dist rib utions P , Q are δ -neighbor s if d T V ( P , Q ) ≤ δ . If this holds, we also say that P is δ -close to Q and vice vers a. Poi sson Binomial Distributio n. A P oisson binomial di strib ution of ord er n ∈ N is a su m P n i =1 X i of n mutu- ally independen t Bernoul li random varia bles X 1 , . . . , X n . W e denote the set of all Poisson binomial distrib utions of order n by S n and, if n is clear from cont ext, just S . A Poisso n bino mial distrib ution D ∈ S n can be represente d uniquely as a vecto r ( p i ) n i =1 satisfy ing 0 ≤ p 1 ≤ p 2 ≤ . . . ≤ p n ≤ 1 . T o go from D ∈ S n to its correspond ing vect or , we ﬁnd a collecti on X 1 , . . . , X n of mutually indepen dent Bernoullis such tha t P n i =1 X i is distrib uted according to D and E [ X 1 ] ≤ . . . ≤ E [ X n ] . (Such a collection exists by the deﬁnition of a Poisson binomial distrib ution.) Then we set p i = E [ X i ] for all i . Lemma 1 of [ DP13 ] sho ws that the resulting vecto r ( p 1 , . . . , p n ) is uniqu e. W e den ote by PBD( p 1 , . . . , p n ) th e distri b ution of the su m P n i =1 X i of mutually indepen dent indicator s X 1 , . . . , X n with expectati ons p i = E [ X i ] , for all i . G i ven the abov e discuss ion PBD( p 1 , . . . , p n ) is unique up to permutatio n of the p i ’ s. W e also sometimes write { X i } to denote the dis trib ution of P n i =1 X i . Note the dif ference between { X i } , w hich refers to the distrib ution of P i X i , and { X i } i , which refers to the unde rlying collec tion of mutually indepe ndent Bernoull i random varia bles. 4 T ranslated Poisso n Distribution. W e will make use of the transla ted P oisson distrib ution for approximatin g the Poisson Binomial distrib ution. W e deﬁne the translat ed Poisson distrib ution , and state a kno wn result on ho w well it approximat es the Poisson Binomial distrib ution. Deﬁnition 1 ([ R ¨ 07 ]) . W e say that an inte ger random variable Y is distrib uted acc or ding to the translated Poisson distrib ution with parameters µ and σ 2 , denoted T P ( µ, σ 2 ) , if f Y can be writte n as Y = ⌊ µ − σ 2 ⌋ + Z, wher e Z is a rand om variabl e distrib uted accor ding to P oisson( σ 2 + { µ − σ 2 } ) , wher e { µ − σ 2 } r epr esents the fra ctiona l part of µ − σ 2 . The followin g lemma giv es a useful bound on the variati on distance between a Poisson Binomial Distrib ution and a sui table translated Poisso n dis trib ution. Note that if the v ariance of the P oisson B inomial Distrib ution is lar ge, then the lemma giv es a strong bound . Lemma 1 (see (3.4) of [ R ¨ 07 ]) . Let J 1 , . . . , J n be inde penden t random indicator s with E [ J i ] = p i . Then d T V n X i =1 J i , T P ( µ, σ 2 ) ! ≤ q P n i =1 p 3 i (1 − p i ) + 2 P n i =1 p i (1 − p i ) , wher e µ = P n i =1 p i and σ 2 = P n i =1 p i (1 − p i ) . The follo wing bound on the total v ariatio n distanc e between transla ted P oisson distrib utions will be useful. Lemma 2 (Lemma 2.1 of [ BL06 ]) . F or µ 1 , µ 2 ∈ R and σ 2 1 , σ 2 2 ∈ R + with ⌊ µ 1 − σ 2 1 ⌋ ≤ ⌊ µ 2 − σ 2 2 ⌋ , we have d T V ( T P ( µ 1 , σ 2 1 ) , T P ( µ 2 , σ 2 2 )) ≤ | µ 1 − µ 2 | σ 1 + | σ 2 1 − σ 2 2 | + 1 σ 2 1 . Running Times, and Bit Complexity . Through out th is paper , we measure the runnin g times of our algorit hms in numbers of bit operati ons. For a positi ve intege r n , we denote by h n i its descripti on compl exit y in bina ry , namely h n i = ⌈ log 2 n ⌉ . Moreov er , we represen t a positi ve rational number q as q 1 q 2 , where q 1 and q 2 are relativ ely prime pos iti ve inte gers. T he de scripti on complexit y of q is deﬁned t o be h q i = h q 1 i + h q 2 i . W e wil l assu me that all ǫ ’ s and δ ’ s input to our algorithms are rational numbers. 2 Learn ing a sum of Bernoulli random variab les from pol y(1 / ǫ ) samples In this section, we prov e Theorem 1 by providi ng a sample- and time-efﬁcie nt algori thm for learnin g an unkno wn PBD X = P n i =1 X i . W e start with an importan t ingredient in our analys is. A cov er f or PBDs. W e make use of the follo wing theor em, which prov ides a cov er of the set S = S n of all PBDs of order - n . The theo rem was gi ven implic itly in [ DP11 ] and exp licitly as Theorem 1 in [ DP13 ]. Theor em 4 (Cove r for PBDs) . F or all ǫ > 0 , ther e e xists an ǫ -cover S ǫ ⊆ S of S suc h that 1. |S ǫ | ≤ n 2 + n ·  1 ǫ  O (log 2 1 /ǫ ) ; and 2. S ǫ can be constructed in time linear in its re pr esentatio n size , i.e., O ( n 2 log n ) + O ( n log n ) ·  1 ǫ  O (log 2 1 /ǫ ) . Mor eover , if { Y i } ∈ S ǫ , then the collection of n Bernoulli ran dom variab les { Y i } i =1 ,...,n has one of the following forms, wher e k = k ( ǫ ) ≤ C /ǫ is a positive inte ger , for some absolut e constan t C > 0 : 5 (i) ( k -Spar se F orm) Ther e is some ℓ ≤ k 3 = O (1 /ǫ 3 ) such that, for all i ≤ ℓ , E [ Y i ] ∈ n 1 k 2 , 2 k 2 , . . . , k 2 − 1 k 2 o and, for all i > ℓ , E [ Y i ] ∈ { 0 , 1 } . (ii) ( k -heavy B inomial F orm) Ther e is some ℓ ∈ { 1 , . . . , n } and q ∈  1 n , 2 n , . . . , n n  suc h that, for all i ≤ ℓ , E [ Y i ] = q and, for all i > ℓ , E [ Y i ] = 0 ; mor eo ver , ℓ, q satisfy ℓq ≥ k 2 and ℓq (1 − q ) ≥ k 2 − k − 1 . F inally , for ever y { X i } ∈ S for whic h ther e is no ǫ -neigh bor in S ǫ that is in spar se form, ther e exis ts some { Y i } ∈ S ǫ in k -heavy Binomial form such that (iii) d T V ( P i X i , P i Y i ) ≤ ǫ ; and (iv) if µ = E [ P i X i ] , µ ′ = E [ P i Y i ] , σ 2 = V ar [ P i X i ] and σ ′ 2 = V ar[ P i Y i ] , then | µ − µ ′ | = O (1) and | σ 2 − σ ′ 2 | = O (1 + ǫ · (1 + σ 2 )) . W e re mark th at the cov er theorem as stated in [ DP13 ] does no t incl ude the p art o f the above statement follo w ing “ﬁnally . ” W e prov ide a proof of this ex tensio n in Appendix A . The B asic L earning Algorithm. The high -le vel structure of our learnin g algorithms which giv e Theor em 1 is provided in Algori thm Learn- PBD of Figure 1 . W e inst antiate this high-le vel structu re, with appr opriate techni cal m odiﬁcatio ns, in Section 2.4 , where w e giv e more detailed descripti ons of the non-prop er and proper algori thms that gi ve p arts (1) and (2) of Theorem 1 . Learn -PBD ( n, ǫ, δ ) 1. Run Lea rn-Sp arse X ( n, ǫ, δ / 3) to get hypoth esis distrib ution H S . 2. Run Lea rn-Po isson X ( n, ǫ, δ / 3) to get hypoth esis distrib ution H P . 3. Return the distrib ution which is the output of Cho ose-H ypoth esis X ( H S , H P , ǫ, δ / 3) . Figure 1: Learn- PBD ( n, ǫ, δ ) At a high lev el, the subr outine Learn- Spars e is gi ven sample acces s to X and is designed to ﬁnd an ǫ -accura te hypothesis H S with proba bility at least 1 − δ / 3 , if the un kno w n PBD X is ǫ -close to some sparse form PBD inside th e cov er S ǫ . Similarly , Learn -Pois son is designed to ﬁnd an ǫ -accurat e hypoth esis H P , if X is not ǫ -close to a sparse form PB D (in this case, T heorem 4 implies that X must be ǫ -close to some k ( ǫ ) - hea vy Binomial form PBD). Finally , Ch oose- Hypo thesis is designed to choose one of the two hypot heses H S , H P as being ǫ -close to X. The follo wing subsections s pecify these subrouti nes, as well as ho w th e algorith m can be used to establis h Theorem 1 . W e note that Learn-Sp arse and Learn -Poi sson do not return the distrib utions H S and H P as a list of probab ilities for ev ery point in [ n ] . They return instead a succinct descri ption of these distrib utions in orde r to keep the running time of the algorithm logarithmic in n . Similarly , Choos e-Hyp othesis operate s with succin ct descrip tions of these distrib utions. 2.1 Learn ing when X is cl ose to a sparse f orm PBD . Our starting poin t he re is the simple obs erv ation tha t an y PBD is a unimodal distrib ution ov er the do main { 0 , 1 , . . . , n } . (T here is a simple in ducti ve proof of thi s, or see Section 2 of [ KG71 ].) This enabl es us to use the algori thm of Birg ´ e [ Bir97 ] for learn ing unimodal distrib utions. W e recall B ir g ´ e’ s result, and refer the reader to Appendi x B for an expl anatio n of how Theor em 5 as stated belo w follo ws from [ Bir97 ]. 6 Theor em 5 ([ Bir97 ]) . F or all n, ǫ , δ > 0 , the r e is an algorit hm that dr aws O  log n ǫ 3 log 1 δ + 1 ǫ 2 log 1 δ log log 1 δ  samples fr om an unknown unimodal distrib ution X o ver [ n ] , does ˜ O  log 2 n ǫ 3 log 2 1 δ  bit-op era tions, and outputs a (succinct d escrip tion of a) hy pothes is di strib ution H over [ n ] that has the following form: H is uniform over subinterva ls [ a 1 , b 1 ] , [ a 2 , b 2 ] , . . . , [ a k , b k ] , whose union ∪ k i =1 [ a i , b i ] = [ n ] , w her e k = O  log n ǫ  . In partic ular , the algorithm outputs the lists a 1 thr ough a k and b 1 thr ough b k , as well as the total pr obabil ity mass tha t H assigns to each subin terval [ a i , b i ] , i = 1 , . . . , k . F inally , with pr obability at least 1 − δ , d T V ( X, H ) ≤ ǫ . The main result of this subsec tion is the follo wing: Lemma 3. F or all n, ǫ ′ , δ ′ > 0 , ther e is an algorithm Le arn-S parse X ( n, ǫ ′ , δ ′ ) that dra ws O  1 ǫ ′ 3 log 1 ǫ ′ log 1 δ ′ + 1 ǫ ′ 2 log 1 δ ′ log log 1 δ ′  samples fr om a tar get P BD X over [ n ] , does log n · ˜ O  1 ǫ ′ 3 log 2 1 δ ′  bit operati ons, and outp uts a (succi nct descr iption of a) hy pothes is distrib ution H S ove r [ n ] that has the fol- lowing form: its support is containe d in an e xplicit ly spe ciﬁed interval [ a, b ] ⊂ [ n ] , wher e | b − a | = O (1 /ǫ ′ 3 ) , and for every point in [ a, b ] the algo rithm exp licitly speciﬁes the pr obability assign ed to that point by H S . 4 The al gorith m has the fo llowing gua ran tee: if X is ǫ ′ -close to some sp ars e form PBD Y in the cover S ǫ ′ of Theor em 4 , then with pr obability at least 1 − δ ′ , d T V ( X, H S ) ≤ c 1 ǫ ′ , for some ab solute constant c 1 ≥ 1 , and the suppor t of H S lies in the suppo rt of Y . The high-le vel idea of Lemma 3 is quite simple. W e truncate O ( ǫ ′ ) of the prob ability mass from each end of X to o btain a cond itiona l distrib ution X [ˆ a, ˆ b ] ; sin ce X is unimodal so is X [ˆ a, ˆ b ] . If ˆ b − ˆ a is lar ger than O (1 /ǫ ′ 3 ) then the alg orithm out puts “f ail” (an d X could not ha ve been close to a sparse- form distrib ution in the cov er). Otherwise, we use Birg ´ e’ s algorithm to learn the unimodal distrib ution X [ˆ a, ˆ b ] . A detailed descript ion of the algori thm is gi ven in Figu re 2 belo w . Pr oof of Lemma 3 : As describ ed in Figure 2 , algorithm Lear n-Spa rse X ( n, ǫ ′ , δ ′ ) ﬁ rst draws M = 32 log (8 /δ ′ ) /ǫ ′ 2 samples fr om X and sor ts them to obtain a list of val ues 0 ≤ s 1 ≤ · · · ≤ s M ≤ n. W e cla im the followin g about the v alues ˆ a and ˆ b deﬁned in Step 2 of the algorith m: Claim 4. W ith pr obability at least 1 − δ ′ / 2 , we have X ( ≤ ˆ a ) ∈ [3 ǫ ′ / 2 , 5 ǫ ′ / 2] and X ( ≤ ˆ b ) ∈ [1 − 5 ǫ ′ / 2 , 1 − 3 ǫ ′ / 2] . Pr oof. W e only sho w that X ( ≤ ˆ a ) ≥ 3 ǫ ′ / 2 with probabilit y at lea st 1 − δ ′ / 8 , since the ar guments for X ( ≤ ˆ a ) ≤ 5 ǫ ′ / 2 , X ( ≤ ˆ b ) ≤ 1 − 3 ǫ ′ / 2 and X ( ≤ ˆ b ) ≥ 1 − 5 ǫ ′ / 2 are identical. Gi ven that each of these conditions is met with prob ability at least 1 − δ ′ / 8 , the union bound establ ishes our claim. 4 In particular, our algorithm will output a list of pointers, mapp ing every point in [ a, b ] to s ome memory lo cation where the probability assigned to that point by H S is written. 7 Learn -Spar se X ( n, ǫ ′ , δ ′ ) 1. Draw M = 32 log(8 /δ ′ ) /ǫ ′ 2 samples from X and sort them to obtai n a list of v alues 0 ≤ s 1 ≤ · · · ≤ s M ≤ n. 2. Deﬁne ˆ a := s ⌈ 2 ǫ ′ M ⌉ and ˆ b := s ⌊ (1 − 2 ǫ ′ ) M ⌋ . 3. If ˆ b − ˆ a > ( C /ǫ ′ ) 3 (where C is the const ant in the statement of Theorem 4 ), output “fail” an d return the (tri vial) hypoth esis w hich puts prob ability m ass 1 on the poin t 0 . 4. Otherwise, run B ir g ´ e’ s unimodal distrib ution learner (Theore m 5 ) on the con dition al distrib ution X [ˆ a, ˆ b ] and outpu t the hypoth esis that it returns. Figure 2: Learn- Spars e X ( n, ǫ ′ , δ ′ ) T o sho w tha t X ( ≤ ˆ a ) ≥ 3 ǫ ′ / 2 is satisﬁed with probability at least 1 − δ ′ / 8 we ar gue as follows: Let α ′ = max { i | X ( ≤ i ) < 3 ǫ ′ / 2 } . C learly , X ( ≤ α ′ ) < 3 ǫ ′ / 2 while X ( ≤ α ′ + 1) ≥ 3 ǫ ′ / 2 . Giv en this, if M samples are drawn from X then the expect ed numb er of them that are ≤ α ′ is at most 3 ǫ ′ M / 2 . It follows then from the Chernof f bound that the probabil ity that more than 7 4 ǫ ′ M samples are ≤ α ′ is at m ost e − ( ǫ ′ / 4) 2 M / 2 ≤ δ ′ / 8 . Hence exc ept with this failure proba bility , we ha ve ˆ a ≥ α ′ + 1 , which implies that X ( ≤ ˆ a ) ≥ 3 ǫ ′ / 2 . As s peciﬁed in Steps 3 and 4, if ˆ b − ˆ a > ( C /ǫ ′ ) 3 , where C is th e c onstan t in th e sta tement of T heorem 4 , th e algori thm o utput s “ fail ”, r eturnin g the tri vial hypothesis which puts probability mass 1 on the point 0 . Otherwise , the algorithm runs Birg ´ e’ s unimodal distrib ution learner (Theorem 5 ) on the conditional distrib ution X [ˆ a, ˆ b ] , and outpu ts the result of Birg ´ e’ s algorithm. Since X is unimodal, it follo ws that X [ˆ a, ˆ b ] is also unimodal, hence Bir g ´ e’ s algori thm is ap propri ate for lea rning it. The way we apply B ir g ´ e’ s algorithm to learn X [ˆ a, ˆ b ] gi ve n samples fr om t he or iginal distr ib ution X is the o bvi ous one : we dr aw samples from X , ign oring all samples th at fall outside of [ˆ a, ˆ b ] , until th e right O (log(1 /δ ′ ) log (1 /ǫ ′ ) /ǫ ′ 3 ) number of s amples fall inside [ˆ a, ˆ b ] , as requir ed b y Bir g ´ e’ s algorithm for learnin g a distrib ution of support of size ( C /ǫ ′ ) 3 with probability at least 1 − δ ′ / 4 . Once we ha ve the right number of samples in [ˆ a, ˆ b ] , we run Bir g ´ e’ s algorithm to learn the conditio nal distrib ution X [ˆ a, ˆ b ] . Note that the numb er of sample s we need to dra w from X until the right O (log (1 /δ ′ ) log (1 /ǫ ′ ) /ǫ ′ 3 ) number of samples f all in side [ˆ a, ˆ b ] is stil l O (log (1 /δ ′ ) log (1 /ǫ ′ ) /ǫ ′ 3 ) , with probability at least 1 − δ ′ / 4 . Ind eed, since X ([ˆ a, ˆ b ]) = 1 − O ( ǫ ′ ) , it follo w s from the Chernof f bo und that w ith probabil ity at least 1 − δ ′ / 4 , if K = Θ(log(1 /δ ′ ) log (1 /ǫ ′ ) /ǫ ′ 3 ) samples are drawn fro m X , at least K (1 − O ( ǫ ′ )) fall inside [ˆ a, ˆ b ] . Analysis: It is ea sy to see that the sample comple xity of our algorithm is as promised. For the runn ing time, notice that, if Birg ´ e’ s algori thm is in vok ed, it will return two li sts of nu mbers a 1 throug h a k and b 1 throug h b k , as well as a list of probability masses q 1 , . . . , q k assign ed to each subint erv al [ a i , b i ] , i = 1 , . . . , k , by the h ypoth esis distrib ution H S , where k = O (log (1 /ǫ ′ ) /ǫ ′ ) . In linear time, we can compute a list of proba bilitie s ˆ q 1 , . . . , ˆ q k , repres enting the pro babili ty assigned by H S to ev ery poi nt of subinterv al [ a i , b i ] , for i = 1 , . . . , k . So we can repres ent our outp ut hypothe sis H S via a d ata stru cture that main tains O (1 /ǫ ′ 3 ) pointers, ha ving one pointer per point insid e [ a, b ] . The pointers map points to probabilit ies assign ed by H S to these points. Thus turning the outpu t o f Birg ´ e’ s algorit hm into an e xplicit di strib ution ov er [ a, b ] incurs linear o verhead in o ur running time, and hence the ru nning time of o ur algor ithm is also as pro mised. (See Appendix B for an e xplanation of the runn ing time of Bir g ´ e’ s algorithm.) Moreov er , w e also not e that the output distrib ution has the promised structur e, since in on e case it h as a sin gle atom at 0 and in the o ther case it is the output of Bir g ´ e’ s algori thm on a dist rib ution of suppo rt of size ( C /ǫ ′ ) 3 . It only remains to justif y the last part of the lemma. Let Y be the sparse -form P BD th at X is close to; say tha t Y is suppo rted on { a ′ , . . . , b ′ } w here b ′ − a ′ ≤ ( C /ǫ ′ ) 3 . Since X is ǫ ′ -close to Y in tot al v ariation distan ce it m ust be the case that X ( ≤ a ′ − 1) ≤ ǫ ′ . Since X ( ≤ ˆ a ) ≥ 3 ǫ ′ / 2 by Claim 4 , it must be the 8 case that ˆ a ≥ a ′ . Similar arg uments gi ve that ˆ b ≤ b ′ . So the interv al [ˆ a, ˆ b ] is contained in [ a ′ , b ′ ] and has length at m ost ( C /ǫ ′ ) 3 . This mean s that B ir g ´ e’ s algor ithm is indeed used correctly by our algor ithm to learn X [ˆ a, ˆ b ] , with probab ility at least 1 − δ ′ / 2 (that is, unless Claim 4 fai ls). No w it follo ws from the correctness of Bir g ´ e’ s algorithm (Theorem 5 ) and the discu ssion abov e, that the hypoth esis H S outpu t when Bir g ´ e’ s algorith m is in v ok ed satisﬁes d T V ( H S , X [ˆ a, ˆ b ] ) ≤ ǫ ′ , with proba bility at least 1 − δ ′ / 2 , i.e., unl ess either Birg ´ e’ s algorithm fail s, or we fail to get the right number of samples landing inside [ ˆ a , ˆ b ] . T o concl ude the proof of the lemma we note that: 2 d T V ( X, X [ˆ a, ˆ b ] ) = X i ∈ [ˆ a, ˆ b ] | X [ˆ a, ˆ b ] ( i ) − X ( i ) | + X i / ∈ [ ˆ a , ˆ b ] | X [ˆ a, ˆ b ] ( i ) − X ( i ) | = X i ∈ [ˆ a, ˆ b ]    1 X ([ ˆ a , ˆ b ]) X ( i ) − X ( i )    + X i / ∈ [ˆ a, ˆ b ] X ( i ) = X i ∈ [ˆ a, ˆ b ]    1 1 − O ( ǫ ′ ) X ( i ) − X ( i )    + O ( ǫ ′ ) = O ( ǫ ′ ) 1 − O ( ǫ ′ ) X i ∈ [ˆ a, ˆ b ]    X ( i )    + O ( ǫ ′ ) = O ( ǫ ′ ) . So the triangle inequa lity giv es: d T V ( H S , X ) = O ( ǫ ′ ) , and Lemma 3 is prov ed.  2.2 Learn ing when X is cl ose to a k -heavy Binomial F o rm PBD . Lemma 5. F or all n, ǫ ′ , δ ′ > 0 , ther e is an algorithm Le arn-P oisso n X ( n, ǫ ′ , δ ′ ) that draws O (log (1 /δ ′ ) /ǫ ′ 2 ) samples fr om a tar get P BD X over [ n ] , does O (log n · log(1 /δ ′ ) /ǫ ′ 2 ) bit opera tions, and r eturn s two para meters ˆ µ and ˆ σ 2 . The algorith m has the fol lowing guaran tee: Supp ose X is not ǫ ′ -close to any spar se form PBD in the co ver S ǫ ′ of Theor em 4 . Let H P = T P ( ˆ µ, ˆ σ 2 ) be the tran slated P oisson distrib ution with parameter s ˆ µ and ˆ σ 2 . Then with pr obability at least 1 − δ ′ we have d T V ( X, H P ) ≤ c 2 ǫ ′ for some abso lute constant c 2 ≥ 1 . Our proof plan is to e xploit the struc ture of the cove r of Theorem 4 . In par ticular , if X is not ǫ ′ -close to an y sparse form PBD in the cover , it m ust be ǫ ′ -close to a P BD in hea vy Binomial form with approxi mately the s ame mean and v arianc e as X , as speciﬁed b y the ﬁnal part of the cove r the orem. Hence, a natura l stra teg y is to obtain estimates ˆ µ and ˆ σ 2 of the mean and var iance of the unkno wn PB D X , and outpu t as a hypothesi s a translat ed Poisson distr ib ution with parameter s ˆ µ and ˆ σ 2 . W e sho w that this stra tegy is a success ful one. Before pro viding the details, we highligh t two f acts that we will establish in the subsequ ent analysis and that will be used later . The ﬁrst is that, assumin g X is not ǫ ′ -close to any sp arse form PB D in the co ver S ǫ ′ , its v arianc e σ 2 satisﬁes σ 2 = Ω(1 /ǫ ′ 2 ) ≥ θ 2 for some uni versal constan t θ . (1) The second is that under the same assumpti on, the estimates ˆ µ and ˆ σ 2 of the mean µ and v arianc e σ 2 of X that we obtain satisfy the follo wing bounds with probab ility at least 1 − δ : | µ − ˆ µ | ≤ ǫ ′ · σ and | σ 2 − ˆ σ 2 | ≤ ǫ ′ · σ 2 . (2) 9 Learn -Pois son X ( n, ǫ ′ , δ ′ ) 1. Let ǫ = ǫ ′ / q 4 + 1 θ 2 and δ = δ ′ . 2. Run algorith m A ( n, ǫ , δ ) to obtai n an estimate ˆ µ of E [ X ] and an estimate ˆ σ 2 of V ar[ X ] . 3. Output the transl ated P oisson distrib ution T P ( ˆ µ, ˆ σ 2 ) . Figure 3: Learn-P oisso n X ( n, ǫ ′ , δ ′ ) . The value θ used in L ine 1 is the uni ve rsal consta nt speciﬁed in the proof of Lemma 5 . A ( n, ǫ, δ ) 1. Let r = O (log 1 /δ ) . For i = 1 , . . . , r rep eat the follo w ing: (a) Draw m = ⌈ 3 / ǫ 2 ⌉ indep endent samples Z i, 1 , . . . , Z i,m from X . (b) Let ˆ µ i = P j Z i,j m , ˆ σ 2 i = P j ( Z i,j − 1 m P k Z i,k ) 2 m − 1 . 2. Set ˆ µ to be the median of ˆ µ 1 , . . . , ˆ µ r and set ˆ σ 2 to be the median of ˆ σ 2 1 , . . . , ˆ σ 2 r . 3. Output ˆ µ and ˆ σ 2 . Figure 4: A ( n, ǫ, δ ) See Figure 3 and the associated F igure 4 for a detaile d descrip tion of the Lea rn-Po isson X ( n, ǫ ′ , δ ′ ) algori thm. Pr oof of Lemma 5 : W e start by sho wing that we can estimate the mean and v arianc e of the tar get PBD X . Lemma 6. F or all n, ǫ, δ > 0 , ther e e xists an algorithm A ( n, ǫ, δ ) with the following pr operties: given acc ess to a PBD X of or der n , it pr oduces estimates ˆ µ and ˆ σ 2 for µ = E [ X ] and σ 2 = V ar[ X ] re specti vely such that with pr obabi lity at least 1 − δ : | µ − ˆ µ | ≤ ǫ · σ and | σ 2 − ˆ σ 2 | ≤ ǫ · σ 2 r 4 + 1 σ 2 . The algor ithm uses O (log (1 /δ ) /ǫ 2 ) samples and runs in time O (log n log (1 /δ ) /ǫ 2 ) . Pr oof. W e tr eat the estimation of µ and σ 2 separa tely . For both estimation problems we show how to us e O (1 /ǫ 2 ) samples to obtain estimates ˆ µ and ˆ σ 2 achie ving the required guarantee s with prob abilit y at least 2 / 3 (we refer to these as “weak estimators”) . Then a routine proce dure allows us to boost the succes s probabi lity to 1 − δ at the exp ense of a multiplicat i ve fact or O (log 1 /δ ) on the number of samples. While we omit the details of the routin e boo sting argu ment, we remind the reader that it in volv es runnin g the weak esti mator O (log 1 /δ ) times to obtain estimates ˆ µ 1 , . . . , ˆ µ O (log 1 /δ ) and outputting the median of these estimates, and similarly for estimatin g σ 2 . W e proceed to specify and analyze the weak estimators for µ and σ 2 separa tely: 10 • W eak estimator for µ : Let Z 1 , . . . , Z m be inde penden t samples from X , and let ˆ µ = P i Z i m . Then E [ ˆ µ ] = µ and V ar [ ˆ µ ] = 1 m V ar [ X ] = 1 m σ 2 . So Chebyshe v’ s inequal ity implies that Pr[ | ˆ µ − µ | ≥ t σ / √ m ] ≤ 1 t 2 . Choosin g t = √ 3 and m = ⌈ 3 /ǫ 2 ⌉ , the abov e imply that | ˆ µ − µ | ≤ ǫσ w ith probab ility at least 2 / 3 . • W eak estimator for σ 2 : Let Z 1 , . . . , Z m be indepe ndent samples from X , and let ˆ σ 2 = P i ( Z i − 1 m P i Z i ) 2 m − 1 be the unbiase d sample v arianc e. (Note the use of Bessel’ s correc tion.) Then it can be checked [ Joh03 ] that E [ ˆ σ 2 ] = σ 2 and V ar[ ˆ σ 2 ] = σ 4  2 m − 1 + κ m  , where κ is the excess kurtosis of the distrib ution of X (i.e. κ = E [( X − µ ) 4 ] σ 4 − 3 ). T o bound κ in terms of σ 2 suppo se that X = P n i =1 X i , where E [ X i ] = p i for all i . Then κ = 1 σ 4 X i (1 − 6 p i (1 − p i ))(1 − p i ) p i (see [ NJ05 ]) ≤ 1 σ 4 X i (1 − p i ) p i = 1 σ 2 . Hence, V ar [ ˆ σ 2 ] = σ 4  2 m − 1 + κ m  ≤ σ 4 m (4 + 1 σ 2 ) . S o Chebyshe v’ s inequal ity implies that Pr " | ˆ σ 2 − σ 2 | ≥ t σ 2 √ m r 4 + 1 σ 2 # ≤ 1 t 2 . Choosin g t = √ 3 and m = ⌈ 3 /ǫ 2 ⌉ , the abov e imply that | ˆ σ 2 − σ 2 | ≤ ǫσ 2 q 4 + 1 σ 2 with pro babili ty at least 2 / 3 . W e proceed to prove Lemma 5 . Lear n-Poi sson X ( n, ǫ ′ , δ ′ ) runs A ( n , ǫ, δ ) from Lemma 6 w ith ap- propri ately chosen ǫ = ǫ ( ǫ ′ ) and δ = δ ( δ ′ ) , giv en belo w , and then outp uts th e transla ted Poisso n distrib ution T P ( ˆ µ, ˆ σ 2 ) , where ˆ µ and ˆ σ 2 are the estimated mean and varia nce of X outpu t by A . Next, we sho w how to choos e ǫ and δ , as well as why the desired guarante es are satisﬁed by the output distrib ution. If X is not ǫ ′ -close to any P BD in sparse form insid e the cover S ǫ ′ of Theorem 4 , there ex ists a PBD Z in ( k = O (1 /ǫ ′ )) -hea vy Binomial form inside S ǫ ′ that is within total varia tion dist ance ǫ ′ from X . W e use the exi stence o f s uch Z to obtain lo wer bou nds on the mean and v ariance of X . Indeed , supp ose that the distri b ution of Z is Bin( ℓ, q ) , a Binomial with par ameters ℓ, q . Then Theore m 4 certiﬁes that the follo wing condit ions are satisﬁed by the paramete rs ℓ, q , µ = E [ X ] and σ 2 = V ar[ X ] : (a) ℓq ≥ k 2 ; (b) ℓq (1 − q ) ≥ k 2 − k − 1 ; (c) | ℓq − µ | = O (1) ; and (d) | ℓq (1 − q ) − σ 2 | = O (1 + ǫ ′ · (1 + σ 2 )) . 11 In parti cular , conditio ns (b) and (d) abov e imply that σ 2 = Ω( k 2 ) = Ω(1 /ǫ ′ 2 ) ≥ θ 2 , for some uni versal consta nt θ , es tablis hing ( 1 ). In terms o f this θ , we choose ǫ = ǫ ′ / q 4 + 1 θ 2 and δ = δ ′ for the applic ation of Lemma 6 to obta in—from O (log(1 /δ ′ ) /ǫ ′ 2 ) samples—e stimates ˆ µ and ˆ σ 2 of µ and σ 2 . From our choice of parameters and the guarantees of Lemma 6 , it follows that, if X is not ǫ ′ -close to any PBD in sparse form insid e the cov er S ǫ ′ , then with proba bility at least 1 − δ ′ the estimates ˆ µ and ˆ σ 2 satisfy : | µ − ˆ µ | ≤ ǫ ′ · σ and | σ 2 − ˆ σ 2 | ≤ ǫ ′ · σ 2 , establ ishing ( 2 ). Moreo ver , if Y is a rando m variab le distrib uted accordi ng to the transla ted Poisson distrib ution T P ( ˆ µ, ˆ σ 2 ) , w e show that X and Y are within O ( ǫ ′ ) in total variati on d istance , conc luding the proof of Lemma 5 . Claim 7. If X and Y ar e as above , then d T V ( X, Y ) ≤ O ( ǫ ′ ) . Pr oof. W e make use of Lemma 1 . Suppose that X = P n i =1 X i , where E [ X i ] = p i for all i . L emma 1 implies that d T V ( X, T P ( µ, σ 2 )) ≤ q P i p 3 i (1 − p i ) + 2 P i p i (1 − p i ) ≤ p P i p i (1 − p i ) + 2 P i p i (1 − p i ) ≤ 1 p P i p i (1 − p i ) + 2 P i p i (1 − p i ) = 1 σ + 2 σ 2 = O ( ǫ ′ ) . (3) It remains to bound the total varia tion di stance between the translated P oisson distrib ution s T P ( µ, σ 2 ) and T P ( ˆ µ, ˆ σ 2 ) . For th is we use Lemma 2 . Lemma 2 implies d T V ( T P ( µ, σ 2 ) , T P ( ˆ µ, ˆ σ 2 )) ≤ | µ − ˆ µ | min( σ , ˆ σ ) + | σ 2 − ˆ σ 2 | + 1 min( σ 2 , ˆ σ 2 ) ≤ ǫ ′ σ min( σ , ˆ σ ) + ǫ ′ · σ 2 + 1 min( σ 2 , ˆ σ 2 ) ≤ ǫ ′ σ σ / √ 1 − ǫ ′ + ǫ ′ · σ 2 + 1 σ 2 / (1 − ǫ ′ ) = O ( ǫ ′ ) + O (1 − ǫ ′ ) σ 2 = O ( ǫ ′ ) + O ( ǫ ′ 2 ) = O ( ǫ ′ ) . (4) The claim follo ws from ( 3 ), ( 4 ) and the triangl e inequali ty . The proof of Lemma 5 is concluded. W e remark that the algorithm described abo ve does not need to know a priori whether or not X is ǫ ′ -close to a PB D in sparse form inside the cov er S ǫ ′ of Theorem 4 . The algorith m simply runs the estimator of Lemm a 6 with ǫ = ǫ ′ / q 4 + 1 θ 2 and δ ′ = δ and outputs w hate ver estimates ˆ µ an d ˆ σ 2 the algorit hm of Lemma 6 produces .  12 2.3 Hypothesis testing. Our hypo thesis testing routine Choo se-Hy pothe sis X uses samples from the unk no wn distrib ution X to run a “competition ” between two candid ate hyp othes is distrib utions H 1 and H 2 ov er [ n ] that are giv en in the input. W e sho w that if at least one of the two candid ate hypot heses is close to the unkno w n distrib ution X , then with high prob abilit y ove r the sample s dra w n from X the routine selec ts as winner a can didate that is close to X . This basic approach of running a competiti on between candidate hypoth eses is quite similar to the “Scheff ´ e estimate” propo sed by Devr oye and Lugos i (see [ DL96b , DL96a ] an d Chapter 6 of [ DL01 ], as well as [ Y at85 ]), b ut our notion of competition here is dif feren t. W e obtain the follo w ing lemma, postponin g all running -time analysis to the next sectio n. Lemma 8. Ther e is an algo rithm Choose -Hypo thesis X ( H 1 , H 2 , ǫ ′ , δ ′ ) which is given sa mple acce ss to distrib ution X , two hypothesi s distrib utions H 1 , H 2 for X , an accurac y para meter ǫ ′ > 0 , and a conﬁdence par ameter δ ′ > 0 . It makes m = O (log(1 /δ ′ ) /ǫ ′ 2 ) dra ws fr om X and r eturns some H ∈ { H 1 , H 2 } . If d T V ( H i , X ) ≤ ǫ ′ for some i ∈ { 1 , 2 } , then w ith pr obabil ity at least 1 − δ ′ the distrib ution H tha t Cho ose-H ypot hesis r eturns has d T V ( H , X ) ≤ 6 ǫ ′ . Pr oof of Lemma 8 : Figur e 5 describes how the competi tion between H 1 and H 2 is carrie d out. Choos e-Hyp othesis ( H 1 , H 2 , ǫ ′ , δ ′ ) I N P U T : Sample access to distrib ution X ; a pair of hypothesis distrib utions ( H 1 , H 2 ) ; ǫ ′ , δ ′ > 0 . Let W be the supp ort of X , W 1 = W 1 ( H 1 , H 2 ) := { w ∈ W H 1 ( w ) > H 2 ( w ) } , and p 1 = H 1 ( W 1 ) , p 2 = H 2 ( W 1 ) . /* Clearly , p 1 > p 2 and d T V ( H 1 , H 2 ) = p 1 − p 2 . */ 1. If p 1 − p 2 ≤ 5 ǫ ′ , decla re a draw and ret urn either H i . Otherwise: 2. Draw m = 2 log(1 /δ ′ ) ǫ ′ 2 samples s 1 , . . . , s m from X , and let τ = 1 m |{ i | s i ∈ W 1 }| be the fra ction of samples that fall insid e W 1 . 3. If τ > p 1 − 3 2 ǫ ′ , declare H 1 as winner and retur n H 1 ; otherwis e, 4. if τ < p 2 + 3 2 ǫ ′ , declare H 2 as winner and retur n H 2 ; otherwis e, 5. declar e a draw and return either H i . Figure 5: Choose -Hypo thesis ( H 1 , H 2 , ǫ ′ , δ ′ ) The correctness of Choose- Hypo thesis is an immedia te consequen ce of the follo wing claim. (In fact for Lemma 8 we only need item (i) belo w , but item (ii) will be handy later in the proof of Lemma 10 .) Claim 9. Suppos e that d T V ( X, H i ) ≤ ǫ ′ , for some i ∈ { 1 , 2 } . Then: (i) if d T V ( X, H 3 − i ) > 6 ǫ ′ , the pr obability that Choose- Hypot hesis X ( H 1 , H 2 , ǫ ′ , δ ′ ) does not dec lar e H i as the winner is at most 2 e − mǫ ′ 2 / 2 , wher e m is chosen as in the desc riptio n of the algori thm. (Intu - itively , if H 3 − i is very bad then it is very likel y that H i will be declar ed winner .) (ii) if d T V ( X, H 3 − i ) > 4 ǫ ′ , the pr obabil ity that Ch oose- Hypot hesis X ( H 1 , H 2 , ǫ ′ , δ ′ ) declar es H 3 − i as the winner is at most 2 e − mǫ ′ 2 / 2 . (Intui tively , if H 3 − i is only modera tely bad then a draw is possibl e but it is very unlik ely that H 3 − i will be declar ed winner .) 13 Pr oof. Let r = X ( W 1 ) . The deﬁnition of the total v ariation dista nce implies that | r − p i | ≤ ǫ ′ . L et us deﬁne indep enden t indicators { Z j } m j =1 such that , for all j , Z j = 1 if f s j ∈ W 1 . Clearly , τ = 1 m P m j =1 Z j and E [ τ ] = E [ Z j ] = r . Since the Z j ’ s are m utually indepen dent, it follo w s from the Chernof f bound that Pr[ | τ − r | ≥ ǫ ′ / 2] ≤ 2 e − mǫ ′ 2 / 2 . Using | r − p i | ≤ ǫ ′ we get that Pr[ | τ − p i | ≥ 3 ǫ ′ / 2] ≤ 2 e − mǫ ′ 2 / 2 . Hence: • For part (i): If d T V ( X, H 3 − i ) > 6 ǫ ′ , from the triangle inequality we get that p 1 − p 2 = d T V ( H 1 , H 2 ) > 5 ǫ ′ . Hence, the algorithm will go beyond step 1, and with probabili ty at least 1 − 2 e − mǫ ′ 2 / 2 , it will sto p at step 3 (when i = 1 ) or step 4 (when i = 2 ), declaring H i as the w inner of the competit ion between H 1 and H 2 . • For part (ii) : If p 1 − p 2 ≤ 5 ǫ ′ then the competition dec lares a draw , hence H 3 − i is not the winner . Otherwise we ha ve p 1 − p 2 > 5 ǫ ′ and the abo ve ar guments imply that the competition bet ween H 1 and H 2 will declar e H 3 − i as the winner with probab ility at most 2 e − mǫ ′ 2 / 2 . This concl udes the proo f of Claim 9 . In vie w of Claim 9 , the proof of Lemma 8 is conclu ded.  Our Cho ose-H ypoth esis algorithm implies a generic learning algorit hm of indepen dent interes t. Lemma 10. Let S be an arbitr ary set of distrib utions over a ﬁnite domain. Mor eove r , let S ǫ ⊆ S be an ǫ -cov er of S of size N , for some ǫ > 0 . F or all δ > 0 , ther e is an algor ithm that uses O ( ǫ − 2 log N log(1 /δ )) samples fr om an un known distri b ution X ∈ S and, with pr obability at lea st 1 − δ , o utputs a distrib ution Z ∈ S ǫ that satisﬁ es d T V ( X, Z ) ≤ 6 ǫ. Pr oof. The algorithm performs a tour nament, by running Choose -Hypo thesis X ( H i , H j , ǫ, δ / (4 N )) for e ver y pair ( H i , H j ) , i < j , of distrib utions in S ǫ . Then it outputs any distrib ution Y ⋆ ∈ S ǫ that was ne ve r a loser (i.e., won or tied aga inst all other dis trib utions in the cov er). If no such distr ib ution exists in S ǫ then the algori thm says “fail ure, ” and output s an arbitra ry distrib ution from S ǫ . Since S ǫ is an ǫ -cov er of S , there e xists some Y ∈ S ǫ such that d T V ( X, Y ) ≤ ǫ. W e ﬁrst argu e that with high probability this distrib ution Y nev er loses a competitio n aga inst any other Y ′ ∈ S ǫ (so the algorithm does not output “failu re”). Consider any Y ′ ∈ S ǫ . If d T V ( X, Y ′ ) > 4 ǫ , by Claim 9 (ii) the probabi lity that Y loses to Y ′ is at m ost 2 e − mǫ 2 / 2 ≤ δ 2 N . On the other hand, if d T V ( X, Y ′ ) ≤ 4 ǫ , the triangle inequalit y gi ve s that d T V ( Y , Y ′ ) ≤ 5 ǫ and thus Y draws against Y ′ . A union bound over all N − 1 distrib utions in S ǫ \ { Y } sho ws that with prob ability at least 1 − δ / 2 , the distrib ution Y ne ver loses a competition . W e next ar gue that with probabi lity at least 1 − δ / 2 , e ver y distrib ution Y ′ ∈ S ǫ that ne ve r loses m ust be close to X. Fix a distrib ution Y ′ such that d T V ( Y ′ , X ) > 6 ǫ . Lemm a 9 (i) implies that Y ′ loses to Y with probab ility at least 1 − 2 e − mǫ 2 / 2 ≥ 1 − δ / (2 N ) . A union bound giv es that with proba bility at least 1 − δ / 2 , e ver y distrib ution Y ′ that has d T V ( Y ′ , X ) > 6 ǫ loses some competit ion. Thus, with ov erall prob abilit y at le ast 1 − δ , the tournamen t does not outpu t “fail ure” and out puts some distrib ution Y ⋆ such that d T V ( X, Y ⋆ ) ≤ 6 ǫ. This prov es the lemma. Remark 11. W e not e that Devr oye and Lugos i (Chapter 7 of [ DL01 ]) pr ove a similar r esult, b ut ther e ar e some dif fere nces. They also ha ve all pair s of dis trib utions in the co ver compete agains t each other , b ut the y use a dif fere nt notion of competition between e very pair . Mor eov er , their appr oach choos es a distrib ution in the cov er that wins the maximum number of competitions , wher eas our algorithm choos es a distrib ution that is never defeat ed (i.e., won o r tied again st all othe r distrib utions in the cove r). Remark 12. Recent work [ DK14 , AJOS14 , SO AJ14 ] impr oves the runni ng time of the tournamen t appr oache s of Lemma 10 , Devr oye-Lug osi and other r elated tourna ments to have a quasilinear dependence of O ( N log N ) on the size N = |S ǫ | . In particu lar , the y avoid running Choose -Hypo thesis for all pairs of distrib utions in S ǫ . 14 2.4 Pr oof of Theor em 1 . Non-P roper -Learn-PBD ( n, ǫ, δ ) 1. Run Lea rn-Sp arse X ( n, ǫ 12 max { c 1 ,c 2 } , δ / 3) to get hypot hesis distrib ution H S . 2. Run Lea rn-Po isson X ( n, ǫ 12 max { c 1 ,c 2 } , δ / 3) to get hypothes is distrib ution H P . 3. Run Cho ose-H ypoth esis X ( H S , d H P , ǫ/ 8 , δ / 3) . If it returns H S then return H S , and if it return s d H P then return H P . Figure 6: Non-Pro per-L earn-PBD ( n, ǫ , δ ) . T he v alues c 1 , c 2 are the absolute constan ts from Lemmas 3 and 5 . d H P is deﬁned in terms of H P as descr ibed in Deﬁnition 2 . W e ﬁ rst sho w Pa rt (1) of the theore m, where the learning algori thm may output an y distr ib ution ov er [ n ] and n ot n ecessar ily a PBD. The a lgorit hm for th is pa rt of the theorem, Non-Prop er-Le arn-PBD , is gi ven in Figure 6 . This algorithm follo w s the high-le ve l structure outlined in Figure 1 with the follo wing modiﬁcation s: (a) ﬁrst, if th e to tal v ariation distan ce to within which we want to learn X is ǫ , the second ar gument of both Learn -Spar se and Lear n-Poi sson is set to ǫ 12 max { c 1 ,c 2 } , where c 1 and c 2 are respec ti ve ly the constants from Lemmas 3 and 5 ; (b) the third step of L earn- PBD is replac ed by Choos e-Hyp othesis X ( H S , d H P , ǫ/ 8 , δ / 3) , where d H P is deﬁned in terms of H P as descr ibed in Deﬁnition 2 belo w; and (c) if Ch oose- Hypot hesis re- turns H S , th en Lear n-PBD also returns H S , whil e if Choose-H ypoth esis returns d H P , then Learn-PBD return s H P . Deﬁnition 2. ( Deﬁnition of d H P :) d H P is deﬁn ed in terms of H P and the suppo rt of H S in thr ee steps: (i) for all points i such that H S ( i ) = 0 , we let d H P ( i ) = H P ( i ) ; (ii) for all points i such that H S ( i ) 6 = 0 , we describe in Appendix C an efﬁ cient determinist ic al gorith m that numer ically appr oximate s H P ( i ) to w ithin an additive err or of ± ǫ/ 48 s , wher e s = O (1 /ǫ 3 ) is the car dinality of the suppo rt of H S . If d H P ,i is the appr oximatio n to H P ( i ) ou tput by the algo rithm, we set d H P ( i ) = max { 0 , d H P ,i − ǫ/ 48 s } ; notice then that H P ( i ) − ǫ/ 24 s ≤ d H P ( i ) ≤ H P ( i ) ; ﬁnally , (iii) for an arbitr ary point i suc h that H S ( i ) = 0 , w e set d H P ( i ) = 1 − P j 6 = i d H P ( j ) , to make sur e that d H P is a pr obabi lity distrib ution. Observe that d H P satisﬁ es d T V ( d H P , H P ) ≤ ǫ/ 24 , an d ther efor e | d T V ( d H P , X ) − d T V ( X, H P ) | ≤ ǫ/ 24 . Hence , if d T V ( X, H P ) ≤ ǫ 12 , then d T V ( X, d H P ) ≤ ǫ 8 and, if d T V ( X, d H P ) ≤ 6 ǫ 8 , then d T V ( X, H P ) ≤ ǫ . W e remark that the reason why we do no t wish to use H P directl y in Cho ose-H ypot hesis is pu rely computa tional. In particular , sin ce H P is a transl ated Poisso n distrib ution, w e cannot compute its probabilit ies H P ( i ) exac tly , and we need to approximate them. On the other hand, we need to make sure that using approx- imate values will no t caus e Choose- Hypot hesis to make a mistak e. Our d H P is car efully deﬁned so as to make sure that Choo se-Hy pothe sis selects a proba bility distrib ution that is close to the unkn o wn X , and that all probab ilities that Choos e-Hyp othesis needs to compute can be compute d without m uch ove rhead. In particul ar , we remark that, in running Choos e-Hyp othesis , we do not a priori compute the v alue of d H P at e ve ry point; we do instead a lazy ev aluation of d H P , as exp lained in the running- time analysi s belo w . W e no w proc eed to the a nalysi s of our m odiﬁed algorit hm Learn -PBD . The sampl e comp lexi ty bound an d correc tness of our algorit hm are immediate conseq uence s of Lemmas 3 , 5 and 8 , taki ng into accoun t the precise choice of constants and the distance bet ween H P and d H P . Next, let us bound the running time. Lemmas 3 15 and 5 bound the run ning time of Steps 1 and 2 of the algorithm, so it remains to bou nd the runni ng time of the C hoose -Hypo thesis step. Notice tha t W 1 ( H S , d H P ) is a subse t of the support of the distrib ution H S . Hence to compute W 1 ( H S , d H P ) it sufﬁces to determine the probabi lities H S ( i ) and d H P ( i ) for ev ery point i in the supp ort of H S . For ev ery suc h i , H S ( i ) is ex plicitl y giv en in the out put of Lea rn-Sp arse , so we only need to compute d H P ( i ) . It follo w s from T heorem 6 (Append ix C ) that the time need ed to comp ute d H P ( i ) is ˜ O (log (1 /ǫ ) 3 + log(1 /ǫ ) · (log n + h ˆ µ i + h ˆ σ 2 i )) . Since ˆ µ and ˆ σ 2 are output by Learn-Poi sson , by in specti on of that algorithm it is easy to see that they each ha ve bit comple xity at most O (log n + log (1 /ǫ )) bits. Hence, gi ve n that the support of H S has cardinalit y O (1 /ǫ 3 ) , the ov erall time spent computing the probab ilities d H P ( i ) for e very po int i in the support of H S is ˜ O ( 1 ǫ 3 log n ) . After W 1 is compu ted, the computatio n of the value s p 1 = H S ( W 1 ) , q 1 = d H P ( W 1 ) and p 1 − q 1 tak es time linear in the data produced by the algo rithm s o far , as these computa tions merely in vol ve ad ding an d sub tractin g probabiliti es that ha ve already b een exp licitly compu ted by the a lgorith m. Computing the fraction o f sampl es fro m X that f all in side W 1 tak es time O  log n · log(1 /δ ) /ǫ 2  and the rest of Choo se-Hy pothesis takes time linear in the size of the data that hav e been written down so far . Hence the overa ll running time of our algorith m is ˜ O ( 1 ǫ 3 log n log 2 1 δ ) . This gi ve s Part (1) of Theorem 1 . No w we turn to P art (2) of Theo rem 1 , the proper learni ng result. The a lgorith m for this part of th e the orem, Prope r-Lea rn-PBD , is giv en in F igure 7 . The algorit hm is essentiall y the same as No n-Pro per-L earn-PBD b ut with the follo wing modiﬁcation s, to pro duce a PBD that is within O ( ǫ ) of the unkno wn X : First, we re- place L earn- Spar se with a dif ferent learning algorith m, Prope r-Lea rn-Sp arse , w hich is based on Lemma 10 , and alway s outputs a PBD. S econd , we add a pos t-proc essing step to Learn-Po isson that con- ver ts the translated P oisson distrib ution H P outpu t by this procedure to a PB D (in fa ct, to a Binomial distrib u- tion). After we descri be these ne w ingred ients in detail , we ex plain and analyze our proper learning algorith m. Prope r-Lea rn-PBD ( n, ǫ, δ ) 1. Run Pro per-L earn- Sparse X ( n, ǫ 12 max { c 1 ,c 2 } , δ / 3) to get hypot hesis distrib ution H S . 2. Run Lea rn-Po isson X ( n, ǫ 12 max { c 1 ,c 2 } , δ / 3) to get hypothes is distrib ution H P = T P ( ˆ µ, ˆ σ 2 ) . 3. Run Cho ose-H ypoth esis X ( H S , d H P , ǫ/ 8 , δ / 3) . (a) If it returns H S then return H S . (b) Otherwise, if it returns d H P , then run Locat e-Bin omial ( ˆ µ , ˆ σ 2 , n ) to obtain a Binomial distrib ution H B = Bin( ˆ n, ˆ p ) with ˆ n ≤ n , and return H B . Figure 7: Proper-L earn- PBD ( n, ǫ, δ ) . T he valu es c 1 , c 2 are the abso lute consta nts from L emmas 3 and 5 . d H P is deﬁned in terms of H P as describe d in Deﬁnition 2 . 1. Prope r-Lea rn-Spar se X ( n, ǫ, δ ) : This procedure draws ˜ O (1 /ǫ 2 ) · log(1 /δ ) samples from X , does (1 /ǫ ) O ( log 2 (1 /ǫ ) ) · ˜ O  log n · log 1 δ  bit operation s, and outputs a PBD H S in sparse form. The guaran tee is similar to that of L earn- Spars e . Namely , if X is ǫ -close to some sparse form PB D Y in the cover S ǫ of Theorem 4 , t hen, with probability at least 1 − δ over the sample s dra wn fr om X , d T V ( X, H S ) ≤ 6 ǫ. The pr ocedur e Pr oper- Learn -Sparse X ( n, ǫ, δ ) is gi ven in Figure 8 ; we e xplai n the proc edure in tandem with a proof of c orrectn ess. As in Learn -Spar se , we start by trunca ting Θ( ǫ ) of th e probability mass from each end of X to obtain a conditi onal distrib ution X [ˆ a, ˆ b ] . In particula r , we compute ˆ a and ˆ b as descri bed in the beginni ng of the proof of Lemma 3 (se tting ǫ ′ = ǫ and δ ′ = δ ). Claim 4 implies that, with probab ility at least 1 − δ / 2 , X ( ≤ ˆ a ) , 1 − X ( ≤ ˆ b ) ∈ [3 ǫ/ 2 , 5 ǫ/ 2] . (Let us deno te this e ven t by G .) W e distin guish the follo wing cases: 16 Prope r-Lea rn-Sparse ( n, ǫ, δ ) 1. Draw M = 32 log (8 /δ ) /ǫ 2 samples from X and so rt them to obtain a list of value s 0 ≤ s 1 ≤ · · · ≤ s M ≤ n. 2. Deﬁne ˆ a := s ⌈ 2 ǫM ⌉ and ˆ b := s ⌊ (1 − 2 ǫ ) M ⌋ . 3. If ˆ b − ˆ a > ( C /ǫ ) 3 (where C is the constan t in the st atement of T heorem 4 ), output “fail” a nd return the (tri vial) hypoth esis w hich puts probabi lity m ass 1 on the point 0 . 4. Otherwise, (a) Construct S ′ ǫ , an ǫ -cov er of the set of all PBDs of order ( C /ǫ ) 3 (see Theorem 4 ). (b) Let ˜ S ǫ be the set of all distrib utions of the form A ( x − β ) where A is a distrib ution fro m S ′ ǫ and β is an inte ger in the range [ˆ a − ( C /ǫ ) 3 , . . . , ˆ b ] . (c) Run the tournamen t desc ribed in the proof of Lemma 10 on ˜ S ǫ , using conﬁdenc e para meter δ / 2 . Return the (sparse PBD ) hyp othesi s that this tournamen t outputs . Figure 8: Proper -Lear n-Sparse ( n, ǫ , δ ) . • If ˆ b − ˆ a > ω = ( C /ǫ ) 3 , where C is the const ant in the statemen t of Theorem 4 , the alg orithm output s “fai l, ” return ing the trivia l hypothes is that puts probability mass 1 on the point 0 . Observ e that, if ˆ b − ˆ a > ω and X ( ≤ ˆ a ) , 1 − X ( ≤ ˆ b ) ∈ [3 ǫ/ 2 , 5 ǫ/ 2] , then X canno t be ǫ -close to a sparse- form distrib ution in the cove r . • If ˆ b − ˆ a ≤ ω , then the algorith m proceeds as follo ws. Let S ′ ǫ be an ǫ -cover of the set of all PBDs of order ω , i.e., all PB Ds which are sums of just ω Bernoul li random va riables . By Theorem 4 , it follo ws that |S ′ ǫ | = (1 / ǫ ) O (log 2 (1 /ǫ )) and that S ′ ǫ can be constructed in time (1 /ǫ ) O (log 2 (1 /ǫ )) . Now , let ˜ S ǫ be the set of all distrib ution s of the form A ( x − β ) where A is a distri b ution from S ′ ǫ and β is an inte ger “shift” which is in the range [ˆ a − ω , . . . , ˆ b ] . Obser ve that the re are O (1 /ǫ 3 ) possibil ities for β and |S ′ ǫ | possibi lities for A , so we similarly get that | ˜ S ǫ | = (1 / ǫ ) O (log 2 (1 /ǫ ) and that ˜ S ǫ can be constructe d in time (1 /ǫ ) O (log 2 (1 /ǫ ) log n . Our algo rithm Prope r-Lea rn-Sp arse construct s the set ˜ S ǫ and run s the tourna ment described in the proof of Lemma 10 (usin g ˜ S ǫ in place of S ǫ , and δ / 2 in place of δ ). W e w ill sho w that, if X is ǫ -close to some sparse form PB D Y ∈ S ǫ and e ven t G happe ns, then, with probabilit y at le ast 1 − δ 2 , the output of the tourna ment is a sparse PB D that is 6 ǫ - close to X . Analysis: The sample comple xity and running time of Proper- Learn -Sparse follo w immediately from Claim 4 and Lemma 10 . T o sho w correctn ess, it suf ﬁces to ar gue that, if X is ǫ -c lose to so me sparse form PBD Y ∈ S ǫ and e ven t G happens, then X is ǫ -close to some distrib ution in ˜ S ǫ . Inde ed, su ppose that Y is an order ω PBD Z translated by some β and suppose that X ( ≤ ˆ a ) , 1 − X ( ≤ ˆ b ) ∈ [3 ǫ/ 2 , 5 ǫ/ 2] . Since at least 1 − O ( ǫ ) of the mass of X is in [ˆ a, ˆ b ] , it is clear that β must be in the range [ˆ a − ω , . . . , ˆ b ] , as otherwise X could not be ǫ -close to Y . S o Y ∈ ˜ S ǫ . 2. Locat e-Bin omial ( ˆ µ , ˆ σ 2 , n ) : T his routine takes as input the output ( ˆ µ , ˆ σ 2 ) of Lea rn-Po isson X ( n, ǫ, δ ) and compu tes a Bin omial dist rib ution H B , without a ny additional samples fro m X . The guaran tee is that, if X is not ǫ -close to any sparse form distrib ution in the cov er S ǫ of Theorem 4 , then, with probab ility at least 1 − δ (ov er the rand omness in the outpu t of Learn-P oiss on ), H B will be O ( ǫ ) -close to X . Let µ and σ 2 be the (unk no wn) mean and v ariance of distrib ution X and as sume tha t X is not ǫ -close to any sparse form distri b ution in S ǫ . O ur analysi s from Section 2.2 sho w s that, with probability at least 17 Locat e-Bin omial ( ˆ µ , ˆ σ 2 , n ) (a) If ˆ σ 2 ≤ n 4 , set σ 2 1 = ˆ σ 2 ; otherwis e, set σ 2 1 = n 4 . (b) If ˆ µ 2 ≤ n ( ˆ µ − σ 2 1 ) , set σ 2 2 = σ 2 1 ; otherwis e, set σ 2 2 = n ˆ µ − ˆ µ 2 n . (c) Return the hypo thesis dist rib ution H B = Bin( ˆ n, ˆ p ) , where ˆ n =  ˆ µ 2 / ( ˆ µ − σ 2 2 )  and ˆ p = ( ˆ µ − σ 2 2 ) / ˆ µ. Figure 9: Locate -Bino mial ( ˆ µ , ˆ σ 2 , n ) . 1 − δ , the output ( ˆ µ, ˆ σ 2 ) of Learn-P oiss on X ( n, ǫ, δ ) satisﬁes that d T V ( X, T P ( ˆ µ , ˆ σ 2 )) = O ( ǫ ) as well as the bounds ( 1 ) and ( 2 ) of Section 2.2 (with ǫ in place of ǫ ′ ). W e will call all these condition s our “work ing assumptions . ” W e provide no guar antees when the work ing assumption s are not satisﬁed. Locat e-Bin omial is presented in Figure 9 ; we proceed to expla in the algorithm and establish its correc tness. This routine has three steps. The ﬁ rst two eliminate corner -cases in the v alues of ˆ µ and ˆ σ 2 , while the last step deﬁnes a Binomial distrib ution H B ≡ Bin( ˆ n, ˆ p ) with ˆ n ≤ n that is O ( ǫ ) -close to H P ≡ T P ( ˆ µ, ˆ σ 2 ) and hence to X under our working assumptio ns. (W e note that a sign iﬁcant portion of the work be lo w is to ensure that ˆ n ≤ n , which does not seem to foll o w fro m a more dire ct app roach. Getting ˆ n ≤ n is necessary in order for our learn ing algorith m for order - n PBDs to be truly proper . ) Through out (a), (b) and (c) belo w we assume that our work ing ass umption s hold. In pa rticula r , our assumpti ons are used ev ery time we employ the bounds ( 1 ) and ( 2 ) of Section 2.2 . (a) T weaking ˆ σ 2 : If ˆ σ 2 ≤ n 4 , we set σ 2 1 = ˆ σ 2 ; otherwise, we set σ 2 1 = n 4 . (As intuition for this tweak, observ e that the lar gest possib le var iance of a Binomial distrib ution Bin( n, · ) is n/ 4 . ) W e note for future referen ce that in both cases ( 2 ) gi ves (1 − ǫ ) σ 2 ≤ σ 2 1 ≤ (1 + ǫ ) σ 2 , (5) where the lo wer bound follo ws from ( 2 ) and the fact that an y PBD satisﬁes σ 2 ≤ n 4 . W e prove next that our settin g of σ 2 1 results in d T V ( T P ( ˆ µ , ˆ σ 2 ) , T P ( ˆ µ, σ 2 1 )) ≤ O ( ǫ ) . Indeed, if ˆ σ 2 ≤ n 4 then this d istance is zero and the claim certainly h olds. Otherwise we ha ve that (1 + ǫ ) σ 2 ≥ ˆ σ 2 > σ 2 1 = n 4 ≥ σ 2 , where we used ( 2 ). Hence, by Lemma 2 we get: d T V ( T P ( ˆ µ, ˆ σ 2 ) , T P ( ˆ µ, σ 2 1 )) ≤ | ˆ σ 2 − σ 2 1 | + 1 ˆ σ 2 ≤ ǫσ 2 + 1 σ 2 = O ( ǫ ) , (6) where we used the fact that σ 2 = Ω(1 /ǫ 2 ) from ( 1 ). (b) T weaking σ 2 1 : If ˆ µ 2 ≤ n ( ˆ µ − σ 2 1 ) (equ i v alentl y , σ 2 1 ≤ n ˆ µ − ˆ µ 2 n ), set σ 2 2 = σ 2 1 ; otherwise, set σ 2 2 = n ˆ µ − ˆ µ 2 n . (As intuition for thi s tweak, observe that the v ariance of a Bin( n, · ) distrib ution with mean ˆ µ cannot e xceed n ˆ µ − ˆ µ 2 n . ) W e claim that this results in d T V ( T P ( ˆ µ, σ 2 1 ) , T P ( ˆ µ, σ 2 2 )) ≤ O ( ǫ ) . Indeed, if ˆ µ 2 ≤ n ( ˆ µ − σ 2 1 ) , then clearly the distan ce is zero and the claim holds. Otherwise • Observe ﬁ rst that σ 2 1 > σ 2 2 and σ 2 2 ≥ 0 , w here t he last asserti on follo ws from the f act th at ˆ µ ≤ n by constru ction. • Next, supp ose that X = P B D ( p 1 , . . . , p n ) . Then from Cauchy -Schwarz we get that µ 2 = n X i =1 p i ! 2 ≤ n n X i =1 p 2 i ! = n ( µ − σ 2 ) . 18 Rearrang ing this yield s µ ( n − µ ) n ≥ σ 2 . (7) W e no w ha ve that σ 2 2 = n ˆ µ − ˆ µ 2 n ≥ n ( µ − ǫσ ) − ( µ + ǫσ ) 2 n = nµ − µ 2 − ǫ 2 σ 2 − ǫσ ( n + 2 µ ) n ≥ σ 2 − ǫ 2 n σ 2 − 3 ǫσ ≥ (1 − ǫ 2 ) σ 2 − 3 ǫσ ≥ (1 − O ( ǫ )) σ 2 (8) where the ﬁrst inequality follo ws from ( 2 ), the second inequality follo ws from ( 7 ) and the f act that any PBD o ver n v ariable s satisﬁes µ ≤ n, and the last one from ( 1 ). • Giv en the abo ve, we get by Lemma 2 that: d T V ( T P ( ˆ µ, σ 2 1 ) , T P ( ˆ µ, σ 2 2 )) ≤ σ 2 1 − σ 2 2 + 1 σ 2 1 ≤ (1 + ǫ ) σ 2 − (1 − O ( ǫ )) σ 2 + 1 (1 − ǫ ) σ 2 = O ( ǫ ) , (9) where we used that σ 2 = Ω(1 /ǫ 2 ) from ( 1 ). (c) Construct ing a Binomial D istrib ution: W e cons truct a Binomia l dist rib ution H B that is O ( ǫ ) -close to T P ( ˆ µ, σ 2 2 ) . If we do this then, by ( 6 ), ( 9 ), our working assumptio n th at d T V ( H P , X ) = O ( ǫ ) , and the triangle inequa lity , we ha ve that d T V ( H B , X ) = O ( ǫ ) and we are done. The Binomial distrib ution H B that we constru ct is Bin( ˆ n, ˆ p ) , where ˆ n =  ˆ µ 2 / ( ˆ µ − σ 2 2 )  and ˆ p = ( ˆ µ − σ 2 2 ) / ˆ µ. Note that, from the way that σ 2 2 is set in S tep (b) ab ov e, we ha ve that ˆ n ≤ n an d ˆ p ∈ [0 , 1] , as requir ed for Bin( ˆ n , ˆ p ) to be a v alid Binomial d istrib ution and a v alid ou tput for Part 2 of Theorem 1 . Let us boun d the total v ariat ion distanc e between Bin( ˆ n , ˆ p ) and T P ( ˆ µ, σ 2 2 ) . F irst, us ing L emma 1 we ha ve : d T V (Bin( ˆ n, ˆ p ) , T P ( ˆ n ˆ p, ˆ n ˆ p (1 − ˆ p )) ≤ 1 p ˆ n ˆ p (1 − ˆ p ) + 2 ˆ n ˆ p (1 − ˆ p ) . (10) Notice that ˆ n ˆ p (1 − ˆ p ) ≥  ˆ µ 2 ˆ µ − σ 2 2 − 1   ˆ µ − σ 2 2 ˆ µ   σ 2 2 ˆ µ  = σ 2 2 − ˆ p (1 − ˆ p ) ≥ (1 − O ( ǫ )) σ 2 − 1 ≥ Ω(1 /ǫ 2 ) , where the seco nd inequ ality uses ( 8 ) (or ( 5 ) dependi ng on which case of Step (b) we fell into) and the last one uses the fact that σ 2 = Ω(1 /ǫ 2 ) from ( 1 ). So pluggin g this into ( 10 ) we get: d T V (Bin( ˆ n, ˆ p ) , T P ( ˆ n ˆ p, ˆ n ˆ p (1 − ˆ p )) = O ( ǫ ) . 19 The nex t step is to compare T P ( ˆ n ˆ p, ˆ n ˆ p (1 − ˆ p )) and T P ( ˆ µ, σ 2 2 ) . Lemma 2 gi ves: d T V ( T P ( ˆ n ˆ p, ˆ n ˆ p (1 − ˆ p )) , T P ( ˆ µ, σ 2 2 )) ≤ | ˆ n ˆ p − ˆ µ | min( p ˆ n ˆ p (1 − ˆ p ) , σ 2 ) + | ˆ n ˆ p (1 − ˆ p ) − σ 2 2 | + 1 min( ˆ n ˆ p (1 − ˆ p ) , σ 2 2 ) ≤ 1 p ˆ n ˆ p (1 − ˆ p ) + 2 ˆ n ˆ p (1 − ˆ p ) = O ( ǫ ) . By the triangle inequ ality we get d T V (Bin( ˆ n, ˆ p ) , T P ( ˆ µ, σ 2 2 ) = O ( ǫ ) , which was our ulti mate goal. 3. Prope r-Lea rn-PBD : G i ven the P roper -Lear n-Sparse and Lo cate- Binom ial routines de- scribe d abov e, we are ready to descr ibe our proper lea rning algorithm. T he algorithm is similar to our non-p roper learn ing on e, Lear n-PBD , with the following modiﬁcations: In the ﬁ rst step, instead of runnin g Lea rn-Sp arse , we run Prop er-Le arn-S parse to get a sparse form P BD H S . In the second step, we still run Lea rn-Po isson as we did before to get a translated Poisson distrib ution H P . Then we run Choose-H ypoth esis feedin g it H S and H P as inpu t. If the distrib ution re- turned by Choos e-Hyp othesis is H S , w e just output H S . If it returns H P instea d, then w e run Locat e-Bin omial to con vert it to a Binomial distrib ution that is still close to the unk no w n distrib u- tion X . W e tune the parameters ǫ and δ base d on the abov e analyses to guarante e that, with probabilit y at least 1 − δ , the dist rib ution output by our over all algorith m is ǫ -close to the unkno wn distrib ution X . The number of sampl es we need is ˜ O (1 /ǫ 2 ) log (1 /δ ) , and the runnin g time is  1 ǫ  O (log 2 1 /ǫ ) · ˜ O (log n · log 1 δ ) . This conclud es the proof of Part 2 of Theo rem 1 , and thus of the entire theore m. 3 Learn ing weighted sums of independen t Bernoulli s In this section we conside r a generalizat ion of the prob lem of learning an unkno w n PBD, by studyin g the learna bility of weighte d sums of indepen dent B ernoul li random v ariables X = P n i =1 w i X i . (Through out this sectio n we assume for simplicit y that the weights are “kno wn” to the learni ng algorithm.) In Section 3.1 we sho w that if there are only constan tly many dif ferent weights then such distr ib ution s ca n be learned by an algorit hm that uses O (log n ) samples and runs in time p oly( n ) . In Section 3.2 we sho w that if there a re n distinct weights then e ven if tho se weights ha ve an e xtremely simple s tructu re – the i -th weigh t is simply i – an y algo rithm must use Ω( n ) samples. 3.1 Learn ing sums of weighted independent Bern oulli random variables wi th few distinct weights Recall Theorem 2 : T H E O R E M 2 . Let X = P n i =1 a i X i be a weighted sum of unknown independen t Bernoull i r andom va riable s suc h that ther e ar e at most k dif fer ent values in the set { a 1 , . . . , a n } . Then ther e is an algorith m with the following pr opertie s: give n n , a 1 , . . . , a n and access to inde penden t draws fr om X , it uses e O ( k /ǫ 2 ) · log ( n ) · log (1 /δ ) samples fr om the tar get distrib ution X , runs in time p oly  n k · ( k /ǫ ) k log 2 ( k/ǫ )  · log(1 /δ ) , 20 and with pr obability at least 1 − δ outputs a hypothe sis vecto r ˆ p ∈ [0 , 1] n deﬁnin g in depend ent Bernoull i r andom variab les ˆ X i with E [ ˆ X i ] = p i suc h that d T V ( ˆ X , X ) ≤ ǫ, wher e ˆ X = P n i =1 a i ˆ X i . Remark 13. A special case of a mor e gener al r ecent re sult [ DDO + 13 ] implies a highly ef ﬁcient algorith m for the specia l case of Theor em 2 in which th e k distinct values that a 1 , . . . , a n can have ar e just { 0 , 1 , . . . , k − 1 } . In this case, the algor ithm of [ DDO + 13 ] draws p oly ( k , 1 /ǫ ) sa mples fr om the tar get distrib ution and, in the bit comple xity model of this paper , has running time p oly( k , 1 /ǫ, log n ) ; thus its running time and sample com- ple xity ar e both signiﬁc antly bet ter than Theor em 2 . However , even the most gene ral version of the [ DDO + 13 ] r esult cannot handle the full gen era lity of Theor em 2 , which imposes no conditio ns of any sort on the k distinct weights — the y may be any r eal values. The [ DDO + 13 ] r esult lever ag es kno wn central limit theor ems for total variat ion distance fr om pr obability theory that deal with sums of independe nt (small in te ger)- valued ran dom variab les. W e ar e not awar e of suc h centr al limit theor ems for the mor e gene ral setting of arbit rar y r eal value s, and thus we tak e a differ ent appr oach to Theor em 2 , via cove rs for PBDs, as describe d below . Giv en a vector a = ( a 1 , . . . , a n ) of weights, we refer to a distrib ution X = P n i =1 a i X i (where X 1 , . . . , X n are independe nt Bern oullis which may hav e arbitrar y means) as an a -weighte d sum of Bernoullis , and we write S a to denote the space of all such distrib utions. T o pro ve T heorem 2 we ﬁ rst s ho w that S a has an ǫ -co ver th at is n ot to o lar ge. W e then sh o w that by running a “tourna ment” be tween all pairs of distrib utions in the cov er , using the hypoth esis tes ting subrou tine fr om Section 2.3 , it is possib le to identif y a dis trib ution in the cov er that is close to the targe t a -weighted sum of Bernoull is. Lemma 14. Ther e is an ǫ -cove r S a ,ǫ ⊂ S a of size |S a ,ǫ | ≤ ( n/k ) 3 k · ( k /ǫ ) k · O (log 2 ( k/ǫ )) that can be constru cted in time p oly( |S a,ǫ | ) . Pr oof. Let { b j } k j =1 denote the set of distin ct weights in a 1 , . . . , a n , and let n j =   { i ∈ [ n ] | a i = b j }   . W ith this notation, we can write X = P k j =1 b j S j = g ( S ) , where S = ( S 1 , . . . , S k ) with each S j a sum of n j many in depen dent Bernoulli random vari ables and g ( y 1 , . . . , y k ) = P k j =1 b j y j . Clearly we ha ve P k j =1 n j = n . By Theorem 4 , for eac h j ∈ { 1 , . . . , k } the space of all possibl e S j ’ s has an expl icit ( ǫ/k ) -cov er S j ǫ/k of size |S j ǫ/k | ≤ n 2 j + n · ( k /ǫ ) O (log 2 ( k/ǫ )) . By independ ence across S j ’ s, the product Q = Q k j =1 S j ǫ/k is an ǫ -cov er for the space of all possibl e S ’ s, and hence the set { Q = k P j =1 b j S j : ( S 1 , . . . , S k ) ∈ Q} is an ǫ -cov er fo r S a . S o S a has an e xplicit ǫ -cove r o f size |Q| = Q k j =1 |S j ǫ/k | ≤ ( n/k ) 2 k · ( k/ǫ ) k · O (log 2 ( k/ǫ )) . Pr oof of Theor em 2 : W e claim that the algo rithm of Lemma 10 has the desir ed sample complexi ty and can be implemente d to run in t he cl aimed time bo und. The sample c omple xity bound fol lo ws direct ly from Lemma 10 . It remains to ar gue about the time complexity . Note that the running time of the algorithm is p oly( |S a ,ǫ | ) times the running ti me of a competition. W e will show that a comp etition b etween H 1 , H 2 ∈ S a ,ǫ can be c arried out by an ef ﬁcient algori thm. This amounts t o ef ﬁciently computing the p robab ilities p 1 = H 1 ( W 1 ) and q 1 = H 2 ( W 1 ) and efﬁci ently computin g H 1 ( x ) and H 2 ( x ) for each of the m samples x drawn in step (2) of the competition. Note that each element w ∈ W (the support of X in the comp etition Ch oose- Hypot hesis ) is a va lue w = P k j =1 b j n ′ j where n ′ j ∈ { 0 , . . . , n j } . Clearly , |W | ≤ Q k j =1 ( n j + 1) = O (( n/k ) k ) . It is thus easy to see that p 1 , q 1 and each of H 1 ( x ) , H 2 ( x ) can be ef ﬁ ciently computed as long as ther e is an ef ﬁcient algorithm for the follo wing probl em: gi ven H = P k j =1 b j S j ∈ S a,ǫ and w ∈ W , compute H ( w ) . Indeed, ﬁ x any such H , w . W e ha ve that H ( w ) = X m 1 ,...,m k k Q j =1 Pr H [ S j = m j ] , 21 where the sum is over all k -tuples ( m 1 , . . . , m k ) such that 0 ≤ m j ≤ n j for all j and b 1 m 1 + · · · + b k m k = w (as noted abo ve there are at most O (( n/k ) k ) such k -tuples). T o complete the pro of of Theorem 2 w e note tha t Pr H [ S j = m j ] can be comput ed in O ( n 2 j ) time by standa rd dynamic programmin g.  W e close this subsec tion with the following remark: In [ DDS12b ] the authors ha ve gi ve n a p oly( ℓ, log( n ) , 1 /ǫ ) -time algorithm tha t lea rns any ℓ -modal dis trib ution over [ n ] (i.e., a distrib ution w hose pdf has at most ℓ “peaks ” and “vall eys” ) using O ( ℓ log( n ) /ǫ 3 + ( ℓ/ǫ ) 3 log( ℓ/ǫ )) samples. It is natural to wonde r whether this algori thm could be us ed to e f ﬁciently learn a sum of n w eighte d independ ent Bernoulli rando m va riables with k distin ct weights, and thus g i ve an alternate algori thm for Theorem 2 , p erhaps with better asymp totic guarante es. Ho wev er , it is easy to construct a sum X = P n i =1 a i X i of n weighted independen t Bernoulli random v ariable s with k distinct weights such that X is 2 k -modal. Thus, a naiv e applica tion of the [ DDS12b ] res ult would only gi ve an algorithm w ith sample complexit y ex ponen tial in k , rather tha n the qu asiline ar sample complexity of our current algori thm. If the 2 k -modalit y of the abo ve- mentione d example is the worst cas e (which we do not kno w), then the [ DDS12b ] algorithm would giv e a p oly (2 k , log ( n ) , 1 /ǫ ) -time algor ithm for our proble m that uses O (2 k log( n ) /ǫ 3 ) + 2 O ( k ) · ˜ O (1 /ǫ 3 ) examples (so comparin g with Theorem 2 , expon entiall y worse sample comple xity as a fu nction o f k , b ut exp onenti ally better ru nning t ime as a functio n of n ). F inally , in the context of this question (how many m odes can there be for a sum of n w eighted independe nt Bernoulli random v ariables with k disti nct weigh ts), it is int erestin g to recall the result of K.-I. Sato [ Sat93 ] which sho ws that for any N there are two un imodal distrib ution s X , Y such that X + Y has at least N modes. 3.2 Sample complexity lower bound f or learning sums of weighted independ ent Bernoulli ran- dom variables Recall Theorem 3 : T H E O R E M 3 . L et X = P n i =1 i · X i be a weighted sum of unkno wn in depend ent Bernou lli random variabl es (wher e the i -th weight is simply i ). Let L be any learning algorithm w hich , given n and access to independen t dra ws fr om X , outpu ts a hypoth esis distrib ution ˆ X suc h that d T V ( ˆ X , X ) ≤ 1 / 25 with pr obabilit y at least e − o ( n ) . Then L must use Ω( n ) samples. The int uition underly ing this lo w er bo und is str aightfo rward : Suppos e there are n/ 100 vari ables X i , chos en unifor mly at random, which ha ve p i = 100 /n (call these the “relev ant vari ables”) , and the rest of the p i ’ s are zero. G i ven at most c · n dra w s fro m X for a small constan t c , with high pro babil ity some con stant fraction of the rele vant X i ’ s will not hav e been “re vealed” as rele v ant, and from this it is not dif ﬁ cult to sho w that an y hypot hesis must hav e constant error . A detail ed argu ment follo ws. Pr oof of Theor em 3 : W e deﬁne a probabilit y d istrib ution o ver possi ble tar get probability distrib utions X as follo ws: A su bset S ⊂ { n/ 2 + 1 , . . . , n } of size | S | = n/ 100 is dra wn uniformly at random from all  n/ 2 n/ 100  possib le outco mes.. The vecto r p = ( p 1 , . . . , p n ) is deﬁned as foll o ws: for each i ∈ S the value p i equals 100 /n = 1 / | S | , and for all other i the v alue p i equals 0 . The i - th Berno ulli rando m va riable X i has E [ X i ] = p i , and the tar get distrib ution is X = X p = P n i =1 iX i . W e will need two easy lemmas: Lemma 15. F ix any S, p as described abo ve. F or any j ∈ { n/ 2 + 1 , . . . , n } we have X p ( j ) 6 = 0 if and only if j ∈ S . F or any j ∈ S the value X p ( j ) is e xactly (100 /n )(1 − 100 /n ) n/ 100 − 1 > 35 /n (for n suf ﬁcien tly lar ge), and hence X p ( { n/ 2 + 1 , . . . , n } ) > 0 . 35 (again for n suf ﬁciently lar ge). The ﬁrst claim of the lemma holds because any set of c ≥ 2 nu mbers from { n/ 2 + 1 , . . . , n } must sum to more than n . The secon d claim holds because the only way a draw x from X p can hav e x = j is if X j = 1 and all other X i are 0 (here we are using lim x →∞ (1 − 1 /x ) x = 1 /e ). The next lemma is an easy cons equen ce of Chernof f bounds: 22 Lemma 16. F ix an y p as deﬁned abo ve, and consider a seque nce of n/ 2000 in depend ent draws fr om X p = P i iX i . W ith pr obability 1 − e − Ω( n ) the total numbe r of indices j ∈ [ n ] such tha t X j is eve r 1 in any of the n/ 2000 draws is at most n/ 1 000 . W e are no w ready to prov e T heorem 3 . Let L be a lear ning algorithm that recei ves n/ 2000 samples. Let S ⊂ { n/ 2 + 1 , . . . , n } and p be chosen randomly as deﬁned abov e, and set the tar get to X = X p . W e co nsider a n augment ed learn er L ′ that is gi ven “e xtra infor mation. ” For each point in the sample, ins tead of re cei ving the v alue of that d raw from X the l earner L ′ is gi ven the e ntire ve ctor ( X 1 , . . . , X n ) ∈ { 0 , 1 } n . Let T denote the set of elements j ∈ { n/ 2 + 1 , . . . , n } for which the learner is ev er giv en a vector ( X 1 , . . . , X n ) that has X j = 1 . By Lemma 16 we ha ve | T | ≤ n/ 1000 w ith probabil ity at least 1 − e − Ω( n ) ; we condit ion on the e ven t | T | ≤ n/ 100 0 going forth. Fix an y v alue ℓ ≤ n/ 1000 . Condition ed on | T | = ℓ , the set T is equall y likely to be any ℓ -element s ubset of S , and all possible “completio ns” of T with an additional n / 100 − ℓ ≥ 9 n/ 1000 elements of { n/ 2 + 1 , . . . , n } \ T are equally lik ely to be the true set S . Let H denote the hypothesis distrib ution ov er [ n ] tha t algorithm L outputs. Let R denote the set { n/ 2 + 1 , . . . , n } \ T ; note that since | T | = ℓ ≤ n/ 100 0 , we ha ve | R | ≥ 499 n/ 1000 . L et U denote the set { i ∈ R : H ( i ) ≥ 30 /n } . Since H is a distrib ution we must ha ve | U | ≤ n/ 30 . It is easy to ve rify that we hav e d T V ( X, H ) ≥ 5 n | S \ U | . S ince S is a uniform random extens ion of T with at m ost n/ 100 − ℓ ∈ [9 n/ 1000 , n/ 100] unkno wn elements of R and | R | ≥ 499 n/ 10 00 , an easy calcula tion shows that Pr[ | S \ U | > 8 n/ 10 00] is 1 − e − Ω( n ) . This means that with proba bility 1 − e − Ω( n ) we hav e d T V ( X, H ) ≥ 8 n 1000 · 5 n = 1 / 25 , and the theore m is pro ved .  4 Conclusion and open prob lems Since the initial conference publication of this work [ DDS12a ], some progr ess has been made on problems related to lea rning Poisso n Binomial Distrib utions. The initial confer ence versio n [ DDS12a ] ask ed whether log- conca ve distrib utions o ve r [ n ] (a genera lizatio n of P oisson Binomial Distrib utions) can be learne d to acc urac y ǫ with p oly(1 /ǫ ) samples indepen dent of n . An afﬁrmat i ve answer to this question was subsequen tly pro vided in [ CDSS13 ]. M ore rece ntly , [ DDO + 13 ] studied a diffe rent generalizati on of Poisson Binomial Distrib utions by consideri ng ran dom v ariabl es of the form X = P n i =1 X i where the X i ’ s are mutually indepe ndent (not necess arily identical) distrib utions tha t are each supported on the inte gers { 0 , 1 , . . . , k − 1 } (so, the k = 2 case corres ponds to P oisson Binomial Distrib utions). [ DDO + 13 ] gav e an algori thm for lea rning these distrib utions to accur acy ǫ using p oly( k , 1 /ǫ ) samples (indepe ndent of n ). While our results in this paper essentia lly settle the sample complexi ty of learning an unkno wn P oisson Bi- nomial Distrib ution, se veral goals remain for future wor k. Our no n-prop er learn ing a lgorith m is computati onally more efﬁcien t than our proper learning algorit hm, b ut uses a factor of 1 /ǫ more samples. An obv ious goal is to obtain “the best of both worlds” by co ming up with an O (1 /ǫ 2 ) -sample algorithm which performs ˜ O (log ( n ) /ǫ 2 ) bit operat ions and learns an unkno wn PBD to accuracy ǫ (ideally , such an algorithm would e ven be proper and outpu t a PBD as its hyp othes is). Another goal is to sharpen the sample comple xity bound s of [ DDO + 13 ] and determin e the correc t polynomial dependence on k and 1 /ǫ for the generaliz ed proble m studied in that work. Refer ences [AJOS14] Jayade v Acharya, Ashkan Jaf arpou r , A lon Orlitsk y , and Ananda Theertha Suresh. Sorting with ad- ver sarial comparat ors and applicat ion to density estimation. In the IEEE Interna tional Symposium on Informati on Theory (ISIT) , 2014. 12 [Ber41] Andr e w C. B erry . The Accurac y of the Gauss ian Approximatio n to the Sum of Independen t V ari- ates. T ransac tions of the American Mathematical Society , 49(1):122 –136, 1941. 1.2 23 [BHJ92] A.D. Barbour , L. Holst, and S. Janson. P oisson Appr oximatio n . Oxford Univ ersity P ress, N e w Y ork, NY , 1992 . 1.2 [Bir87a] L. Birg ´ e. E stimating a density under order restrict ions: Nonasymptoti c minimax risk. Annals of Statis tics , 15(3): 995–1 012, 1987. 1.2 , B [Bir87b] L. Bir g ´ e. On the risk of his tograms for esti mating decreas ing densitie s. Ann als of Statistics , 15(3): 1013– 1022, 198 7. B [Bir97] L. B ir g ´ e. Estimati on of unimodal densities without smoothnes s assu mptions . Annals of Statistics , 25(3): 970–9 81, 199 7. 1.2 , 2.1 , 5 , B [BL06] A. D . Barbour and T . Lindval l. Trans lated Poisson Approximation for Marko v Chain s. Jo urnal of Theor etical P r obability , 19, 2006 . 2 [Bre75] R. P . Brent. Multiple-pr ecision zero-ﬁnding metho ds and the complexity of elemen tary func tion e v aluati on. Analytic Computationa l C omple xity (J . F . T raub ed.) , pages 151–17 6, 1975. Academic Press, Ne w Y ork. C , 5 [Bre76] R. P . B rent. Fast multiple-prec ision e valua tion of elementary func tions. J ourna l of the A CM , 23(2): 242–2 51, 197 6. 3 [BS10] Mikhail Belkin and Kaushik Sinha. Polynomial learning of distrib ution families. In FOCS , pages 103–1 12, 2010. 1.2 [Cam60] L. Le Cam. An approx imation theorem for the Poisson binomial distrib ution. P aciﬁc J . Math , 10:11 81–11 97, 196 0. 1.2 [CDSS13] S. Chan, I. Diako nik olas, R. Serve dio, and X. Sun. Learning mixtures of struct ured distrib utions ov er discrete domains . In SOD A , pages 1380 –1394 , 2013. 4 [Che52] H. Cherno f f. A measure of asymptot ic ef ﬁcienc y for tests of a hypothe sis based on the sum of observ ations. Ann. Math. Statist. , 23:493–50 7, 1952. 1 [Che74] L.H.Y . Chen. On the con ver gence of Poisson binomial to Poisson distrib utions. Ann. Pr obab . , 2:178 –180, 1974. 1.2 [CL97] S.X. Chen and J.S. Liu. Statistic al application s of the Poisson-Binomia l and C onditi onal Bernoulli Distrib utions. Statistica Sinica , 7:875–89 2, 1997. 1 [Das08] Constant inos Daskalakis. An Ef ﬁcient P T A S for T wo-Strat egy Anonymous Games. WINE 2008, pp. 186-1 97. Full ver sion av ailable as ArXiV report , 2008. 1.3 [DDO + 13] C. Dask alakis , I. Diak onik olas, R. O’Donnell, R. Servedi o, and L .-Y . T an. L earnin g Sums of Indepe ndent Inte ger Random V ariables. In FOC S , 201 3. 13 , 4 [DDS12a] C. Daskala kis, I. D iak onik olas, and R . Servedio. Learning Poisson Binomial Distrib utions. In STOC , p ages 709–728 , 2012. 4 [DDS12b] C. Daskalakis, I. Diakonik olas, an d R.A. Servedio . L earnin g k -modal dis trib utions via testing. In SOD A , page s 1371–13 85, 2012. 3.1 [DK14] Cons tantin os Daska lakis and Gautam Kamath. Faste r and sampl e near -opti mal algo rithms for proper learning mixtures of gaussians . In the 27th Confer ence on Learning Theory (COLT) , 2014. 12 24 [DL96a] L. Devro ye and G. Lugosi. Nonasymptotic uni versal smoothing facto rs, ke rnel comple xity and Y atracos classes. Annals of Statistic s , 25:2626 –2637 , 1996. 2.3 [DL96b] L. Devro ye a nd G . Lugosi. A uni versally accept able smoo thing f actor for kern el de nsity estimation. Annals of Statist ics , 24:249 9–251 2, 1996. 2.3 [DL01] L. De vroye and G. L ugosi. Combinatoria l m ethod s in densi ty estimation . Springer Series in Sta tis- tics, Springer , 2001. 1.2 , 1.3 , 2.3 , 11 [DP86] P . Deheuv els and D. P feifer . A semigroup approa ch to Poisson approximation . Ann. Pr obab . , 14:66 3–676 , 198 6. 1.2 [DP09] D. Dubhas hi and A. Pan conesi . Concentr ation of measur e for the ana lysis of ra ndomized algo - rithms . Cambridge Univ ersity Press , Cambridge, 2009. 1 [DP11] C. Daskalak is and C. Papad imitriou. On Oblivio us PT AS ’ s for Nash Equilibrium. STOC 2009, pp. 75–84 . Full ver sion av ailable as ArXiV report, 2011. 1.3 , 2 [DP13] C. Daskalak is and C. Papadimitri ou. Sparse Cov ers for Sums of Indi cators . Arxiv Report , 2013. http:// arxi v .or g/abs/1306.1265. 1.4 , 2 , 2 , A , A [Ehm91] W erner Ehm. Binomial approximatio n to the Poisson binomial distrib ution . Statistics and Pr oba- bility Letters , 11:7–16, 1991. 1.2 [Ess42] Carl-Gus ta v Esseen. On the L iapun of f limit of error in the theory of probab ility . A rkiv f ¨ or matem- atik, astr onomi och fysik , A:1–19 , 1942. 1.2 [Fil92] Sandra F illebro wn. Faster computation of Bernoulli numbers. Journ al of Algorithms , 13(3):4 31 – 445, 1992. 6 [HC60] S.L. Hodges and L. Le C am. The Poisson approximation to the binomial distrib ution. A nn. Math. Statis t. , 31:747 –740, 1960. 1.2 [Hoe63] W . Hoef fding. Pro babilit y inequ alitie s for su ms of bo unded random v ariables. J ournal of the American Statis tical Associati on , 58:13–30, 1963. 1 [Joh03 ] J.L. Jo hnson . Pr obabi lity and Statistics fo r Co mputer Scienc e . John W iley & S ons, Inc., Ne w Y ork, NY , USA, 2003. 2.2 [KG71] J. Kei lson and H. Gerbe r . Some Results for Discre te Unimodality . J. America n Statistic al Associa- tion , 66(334 ):386 –389, 197 1. 1.2 , 2.1 [KMR + 94] M. Kea rns, Y . Mansour , D. Ron, R. Rubinfeld, R. Schapire, and L. Sellie. On the learnabil ity of discrete dis trib utions. In P r oceeding s of the 26th Symposium on Theory of Computi ng , pages 273–2 82, 1994. 1 , 3 [KMV10] Adam T auman Kalai, A nkur Moitra, and Grego ry V aliant. Efﬁcientl y learning mixtures of two Gaussian s. In STOC , page s 553–562 , 2010. 1.2 [Knu81] Donald E . Knuth. The Art of Computer Pr ogra mming, V olume II: Seminumerical Algorithms, 2nd Edition . Addison-W esley , 1981 . 6 [Mik93] V .G. Mikhailo v . On a reﬁnement of the central limit theorem for sums of independ ent random indica tors. Theory Pr obab . Appl. , 38:479– 489, 1993. 1.2 25 [MV10] Ankur Moitr a and Gre gory V aliant. Settling the Polyn omial Learnabi lity of Mixture s of Gaussia ns. In FOCS , pages 93–10 2, 2010. 1.2 [NJ05] S. Kotz N.L. Johnson , A.W . Kemp . Univariate di scr ete distri b ution s . John W ile y & Sons, In c., New Y ork, NY , USA, 2005. 2.2 [Poi37] S.D. Poisson . Recher ches sur la Pr obabilit ` e des jugements en mati ´ e criminel le et en mati ´ er e civile . Bachelie r , Paris, 1837. (docume nt) , 1 [PW11] Y . Peres and S. W atson. Personal communication , 2011. 2 [R ¨ 07] A. R ¨ ollin. T ranslated Poiss on Appro ximation Usin g E xchang eable Pair Coupling s. Annals of Applied Pr obability , 17(5/6): 1596– 1614, 2007. 1.3 , 1 , 1 [Roo00] B. Roos. Binomial approximatio n to the P oisson binomial distrib ution: The Krawtchouk expansio n. Theory Pr obab . Appl. , 45:328 –344, 2000. 1.2 [Sal76] Eugene Salamin. Computation of pi using arith metic-ge ometric mean. Mathematics of Comput a- tion , 30(135 ):565 –570, 197 6. 3 [Sat93] Ken -Iti Sato. Con v olutio n of unimodal distrib utions can produce any number of modes. Annals of Pr obability , 21(3):154 3–154 9, 199 3. 3.1 [SO AJ14] Ananda Theertha Suresh, Alon Orlitsky , Jayade v A charya , and Ashkan Jaf arpour . Near -optimal - sample estimators for spherical gaussian mixtures. In the A nnual C onfer ence o n Neura l Information Pr ocessing Systems (NIPS) , 2014. 12 [Soo96] S.Y .T . Soon . Binomial appr oximatio n fo r depe ndent indi cators. Statist. Sinic a , 6:703–714 , 199 6. 1.2 [SS71] A. Sch ¨ onhage and V . Strassen. Schnelle multiplikatio n grosser zahlen. Computing , 7:281– 292, 1971. 3 [Ste94] J.M. Steele. Le Cam’ s Inequ ality and Poisson Approxima tion. Amer . Math. Monthly , 101:4 8–54, 1994. 1.2 [V ol95] A. Y u. V olko v a. A reﬁnement of the central limit theorem for sums of independen t random indica - tors. Theory Pr obab . Appl. , 40:791 –794 , 199 5. 1.2 [VV11] Gre gory V aliant and Paul V aliant. Estimating the unseen : an n/ log ( n ) -sample estimator for entrop y and suppo rt size, shown optimal via ne w CL Ts. In STOC , p ages 685–694 , 2011. 1.2 [W an93] Y .H. W ang. On the number of successes in indepen dent trials. Statistic a Sin ica , 3:295– 312, 1993. 1.2 [Whi80] E.T . Whittake r . A cour se of modern anal ysis . Cambridge U ni ve rsity Press, 1980. 1 [Y at85] Y . G. Y atracos. Rates of con ver gence of minimum distance estimators and Kolmog oro v’ s en trop y. Annals of Statist ics , 13:768 –774, 1985. 2.3 26 A Extension of the Cover Th eorem: Proof of Theor em 4 Theorem 4 is res tating the main cov er theo rem (Theore m 1) of [ D P13 ], exce pt that it claims an additional proper ty , namely what fol lo ws the word “ﬁnall y” in the state ment of the theorem. (W e will sometimes ref er to thi s proper ty as the las t part of Theorem 4 in the follo wing dis cussio n.) Our go al is to show that the co ver of [ DP13 ] already satisﬁes this property without any modiﬁcations , thereby establish ing Theorem 4 . T o a v oid reprod ucing the in vo lve d constructi ons of [ DP13 ], we will assume that the reader has some familiarity with them. Still, our proof here will be self- contai ned. First, we note th at the ǫ -cove r S ǫ of Theorem 1 of [ DP13 ] is a su bset of a larg er ǫ 2 -co ver S ′ ǫ/ 2 of size n 2 + n · (1 /ǫ ) O (1 /ǫ 2 ) , whic h includ es all the k -sparse and all th e k -hea vy B inomial PBDs (up to permutati ons of the underl ying p i ’ s), for some k = O (1 /ǫ ) . Let us call S ′ ǫ/ 2 the “lar ge ǫ 2 -co ver” to distingu ish it from S ǫ , which we w ill cal l the “small ǫ -cov er . ” The reader is referred to Theorem 2 in [ DP13 ] (and the discus sion follo w ing that theorem) for a description of the larg e ǫ 2 -co ver , and to Section 3.2 of [ DP13 ] for how this cov er is used to constr uct the small ǫ -co ver . In particul ar , the small ǫ -co ver is a subset of the lar ge ǫ/ 2 -cov er , inc luding only a subset o f the sparse for m distr ib ution s in the lar ge ǫ/ 2 -co ve r . Moreo ver , for e ve ry sp arse for m di strib ution in the lar ge ǫ/ 2 -co ver , the s mall ǫ -cov er incl udes at least one sparse fo rm distrib ution tha t i s ǫ/ 2 -clos e in total v ariatio n distan ce. Hence, if the lar ge ǫ/ 2 -co ver s atisﬁes the l ast pa rt of Theorem 4 (with ǫ/ 2 instead of ǫ and S ′ ǫ/ 2 instea d of S ǫ ), it follo ws that the small ǫ -cov er also satisﬁes the last part of Theorem 4 . So we pro ceed to ar gue that, for all ǫ , the lar ge ǫ -co ve r implied by T heorem 2 of [ DP13 ] sati sﬁes the last part of Theorem 4 . Let us ﬁ rst re view how th e lar ge cover is cons tructed . (See Section 4 of [ DP13 ] for the details .) For e very co llectio n of indic ators { X i } n i =1 with e xpec tations { E [ X i ] = p i } i , the collec tion is subje cted to two ﬁlters, called the S tag e 1 and Sta ge 2 ﬁlters, and described respec ti ve ly in Sectio ns 4.1 and 4.2 of [ DP13 ]. Using the same notation as [ DP13 ], let us denote by { Z i } i the collection outpu t by the Stage 1 ﬁlter and by { Y i } i the collection out put by the S tage 2 ﬁlter . The collection { Y i } i outpu t by the S tage 2 ﬁlter sat isﬁes d T V ( P i X i , P i Y i ) ≤ ǫ , and is included in the cov er (po ssibly after permuting the Y i ’ s). Moreo ver , it is in sparse or heavy Binomial for m. This way , it is made sure that, for ev ery { X i } i , the re exists some { Y i } i in the cov er that is ǫ -close and is in sparse or hea vy Binomial form. W e proceed to show that the cov er thus deﬁne d satisﬁes the last part of Theorem 4 . For { X i } i , { Y i } i and { Z i } i as above , let ( µ, σ 2 ) , ( µ Z , σ 2 Z ) and ( µ Y , σ 2 Y ) de note resp ecti vely the (mean , v arianc e) pairs of the var iables X = P i X i , Z = P i Z i and Y = P i Y i . W e argue ﬁ rst that the pair ( µ Z , σ 2 Z ) satisﬁes | µ − µ Z | = O ( ǫ ) and | σ 2 − σ 2 Z | = O ( ǫ · (1 + σ 2 )) . Next we argu e that, if the collecti on { Y i } i outpu t by the Stage 2 ﬁlter is in heavy Bino mial form, then ( µ Y , σ 2 Y ) sa tisﬁes | µ − µ Y | = O (1) an d | σ 2 − σ 2 Y | = O (1 + ǫ · (1 + σ 2 )) , conclu ding the proof. • Proof for ( µ Z , σ 2 Z ) : The Stag e 1 ﬁ lter only m odiﬁes the indicators X i with p i ∈ (0 , 1 /k ) ∪ (1 − 1 /k, 1) , for some well-ch osen k = O (1 /ǫ ) . For c on ven ience let us d eﬁne L k = { i p i ∈ (0 , 1 /k ) } and H k = { i p i ∈ (1 − 1 /k , 1) } as in [ DP13 ]. The ﬁlter of S tage 1 rounds the expecta tions of the indicato rs index ed by L k to some val ue in { 0 , 1 /k } so that no single ex pectati on is altere d by m ore than an additi ve 1 /k , and the sum of th ese exp ectatio ns is not modiﬁed by m ore than a n addit i ve 1 /k . S imilarly , the ex pectat ions of the indica tors indexe d by H k are rounde d to some value in { 1 − 1 /k , 1 } . See th e details of ho w the round ing is performed in Section 4.1 of [ DP13 ]. L et us then denote b y { p ′ i } i the expectat ions of the indicators { Z i } i resulti ng from the round ing. W e argue that the mean and v arian ce of Z = P i Z i is close to the mean and 27 v arianc e of X . Indeed, | µ − µ Z | =      X i p i − X i p ′ i      =       X i ∈L k ∪H k p i − X i ∈L k ∪H k p ′ i       ≤ O (1 /k ) = O ( ǫ ) . (11) Similarly , | σ 2 − σ 2 Z | =      X i p i (1 − p i ) − X i p ′ i (1 − p ′ i )      ≤       X i ∈L k p i (1 − p i ) − X i ∈L k p ′ i (1 − p ′ i )       +       X i ∈H k p i (1 − p i ) − X i ∈H k p ′ i (1 − p ′ i )       . W e proceed to bound the two terms of the R HS separa tely . Since the ar gument is symmetric for L k and H k we only do L k . W e hav e       X i ∈L k p i (1 − p i ) − X i ∈L k p ′ i (1 − p ′ i )       =       X i ∈L k ( p i − p ′ i )(1 − ( p i + p ′ i ))       =       X i ∈L k ( p i − p ′ i ) − X i ∈L k ( p i − p ′ i )( p i + p ′ i )       ≤       X i ∈L k ( p i − p ′ i )       +       X i ∈L k ( p i − p ′ i )( p i + p ′ i )       ≤ 1 k + X i ∈L k | p i − p ′ i | ( p i + p ′ i ) ≤ 1 k + 1 k X i ∈L k ( p i + p ′ i ) ≤ 1 k + 1 k   2 X i ∈L k p i + 1 /k   = 1 k + 1 k   2 1 − 1 /k X i ∈L k p i (1 − 1 /k ) + 1 /k   ≤ 1 k + 1 k   2 1 − 1 /k X i ∈L k p i (1 − p i ) + 1 /k   ≤ 1 k + 1 k 2 + 2 k − 1 X i ∈L k p i (1 − p i ) . Using the abov e (and a symmetric ar gument for index set H k ) we obta in: | σ 2 − σ 2 Z | ≤ 2 k + 2 k 2 + 2 k − 1 σ 2 = O ( ǫ )(1 + σ 2 ) . (12) 28 • Proof for ( µ Y , σ 2 Y ) : After the S tage 1 ﬁlter is app lied to the collecti on { X i } i , the resulting coll ection of random var iables { Z i } i has expe ctatio ns p ′ i ∈ { 0 , 1 } ∪ [1 /k , 1 − 1 /k ] , for all i . The Stage 2 ﬁlter ha s dif ferent for m dependin g on the card inality of the set M = { i | p ′ i ∈ [1 /k , 1 − 1 /k ] } . In particular , if |M| > k 3 the output of the Stage 2 ﬁlter is in hea vy B inomial form, while if |M| ≤ k 3 the output of the Stage 2 ﬁlter is in spa rse form. As we are onl y looking to provid e guarantee for the distrib utions in hea vy Binomial form, it suf ﬁces to only conside r the former case next . – |M| > k 3 : Let { Y i } i be the collection produced by Stage 2 and let Y = P i Y i . Then Lemma 4 of [ DP13 ] implies that | µ Z − µ Y | = O (1) and | σ 2 Z − σ 2 Y | = O (1) . Combining this with ( 11 ) and ( 12 ) gi ve s | µ − µ Y | = O (1) and | σ 2 − σ 2 Y | = O (1 + ǫ · (1 + σ 2 )) . This conclud es the proof of Theorem 4 . B Birg ´ e’ s theorem: Learnin g unimodal distribution s Here we bri eﬂy exp lain ho w Theorem 5 fol lo ws from [ Bir97 ]. W e ass ume that the read er is moderate ly familiar with the paper [ Bir97 ]. Bir g ´ e (see his Theorem 1 and Corollary 1) upper bounds the expec ted variati on distance between the targe t distrib ution (which he denote s f ) an d the hypothesis distrib ution that is constructe d by his algorithm (which he denot es ˆ f n ; it should be not ed, though, that his “ n ” para meter denotes the number of samples used by the algori thm, w hile we will denote this by “ m ” , reservin g “ n ” for th e domain { 1 , . . . , n } of the distrib ution ). More precisel y , [ Bir97 ] sho w s that this expe cted variati on distance is at most that of th e Grenande r estimato r (appli ed to learn a unimoda l distrib ution when the mode is kno w n) plus a lower -order term. For our Theorem 5 we tak e Bir g ´ e’ s “ η ” parameter to be ǫ . W ith this choi ce of η , by the resul ts of [ Bir87a , Bir87b ] boundi ng the exp ected error of the Grenand er estimator , if m = O (log ( n ) /ǫ 3 ) samples are used in Birg ´ e’ s algorith m then the exp ected va riation distance between the tar get distrib ution and his hypothe sis dis trib ution is at most O ( ǫ ) . T o go from e xpect ed err or O ( ǫ ) to an O ( ǫ ) -accurate hypo thesis with proba bility at least 1 − δ , we run the above - descri bed algorithm O (log(1 /δ )) times so tha t with pr obabil ity at least 1 − δ some hypothesis obtaine d is O ( ǫ ) - accura te. Then we u se our hypo thesis testing procedure of Lemma 8 , or , more precisel y , the ext ension pro vided in Lemma 10 , to identif y an O ( ǫ ) -accurate hypothesis from within this pool of O (log(1 /δ )) hypotheses . (The use of Lemma 10 is why the runn ing time of Theorem 5 dep ends quadrati cally on log(1 /δ ) and why the sa mple comple xity contain s the second 1 ǫ 2 log 1 δ log log 1 δ term.) It remai ns on ly to argu e that a single ru n of Birg ´ e’ s algorithm on a s ample o f size m = O (log( n ) /ǫ 3 ) can b e carried out i n ˜ O (log 2 ( n ) /ǫ 3 ) bit operatio ns (recall that ea ch sample is a log( n ) -bit str ing). His algo rithm begi ns by locatin g an r ∈ [ n ] that appro ximately minimizes the valu e of his functio n d ( r ) (see Section 3 of [ Bir97 ]) to within an additi ve η = ǫ (see Deﬁnitio n 3 of his pape r); intuiti vely this r represe nts his algorithm’ s “guess” at the true mo de of t he distri b ution . T o loc ate such an r , follo wing Birg ´ e’ s sugge stion in Section 3 of his paper , we beg in by identify ing two consecuti ve point s in the sample such t hat r lies between those two sample point s. This can be don e usi ng log m stages of binary se arch ov er the (sorte d) po ints in the s ample, wher e at each stage of the binary search we compute the tw o functions d − and d + and proceed in the approp riate direction . T o compute the func tion d − ( j ) at a giv en point j (the computatio n of d + is analog ous), w e reca ll that d − ( j ) is deﬁned as the maximum differe nce ov er [1 , j ] between the empirica l cdf and its con ve x m inoran t ov er [1 , j ] . The con vex minorant of the empirical cdf (over m points) can be computed in ˜ O ((log n ) m ) bit-op eration s (wher e the log n comes from the fact that each sample point is an element of [ n ] ), and then by enumerating over all points in the sample that li e in [ 1 , j ] (in t ime O ((log n ) m ) ) w e can compute d − ( j ) . T hus it is po ssible t o identify t wo adjacent 29 points in the sample such that r lie s between them in time ˜ O ((log n ) m ) . Finally , as Bir g ´ e expl ains in the last paragr aph of Section 3 of his paper , once two such points ha ve been identiﬁed it is possib le to again use binary search to ﬁnd a point r in that interv al where d ( r ) is mini mized to withi n an add iti ve η . Since the maximum dif ference between d − and d + can nev er exceed 1, at most log (1 /η ) = log(1 /ǫ ) stages of binar y search are requir ed here to ﬁnd the desire d r . Finally , once the d esired r has been o btained , it is st raightf orward to output the ﬁnal hypoth esis (which Birg ´ e denote s ˆ f n ). A s e xplained in D eﬁnition 3, this hypoth esis is th e de ri v ati ve of ˜ F r n , whic h is ess entiall y the con vex minorant of the empirical cdf to the left of r and the con ve x majorant of the empirical cdf to the rig ht of r . A s descri bed abov e, giv en a val ue of r these con vex majora nts and m inoran ts can be computed in ˜ O ((log n ) m ) time, and the deri v ati ve is simply a collec tion of unifor m distrib utions as claimed. This concludes our ske tch of ho w Theorem 5 follo ws from [ Bir97 ]. C Efﬁcient Evaluation of the P oisson Distrib ution In this s ection we provid e an efﬁcie nt algori thm to compute a n additi ve approximatio n to th e Poisson pro babili ty mass functio n. It seems that this should be a basic operation in numerical analysis, b ut we were not able to ﬁnd it ex plicitly in the literatur e. Our main resu lt for this section is the follo wing. Theor em 6. Ther e is an algo rithm tha t, on in put a ration al number λ > 0 , and inte ger s k ≥ 0 and t > 0 , pr oduces an estimate b p k suc h that | b p k − p k | ≤ 1 t , wher e p k = λ k e − λ k ! is the pr obabil ity that the P oisson distrib ution of para meter λ assigns to inte ger k . The runnin g time of the algor ithm is ˜ O ( h t i 3 + h k i · h t i + h λ i · h t i ) . Pr oof. Clearly we cannot just compute e − λ , λ k and k ! se paratel y , as this w ill tak e time exponen tial in the descri ption complexi ty of k and λ . W e follo w instea d an indirect approa ch. W e start by re writing the target probab ility as follo ws p k = e − λ + k ln( λ ) − ln( k ! ) . Moti v ated by this formula, let E k := − λ + k ln( λ ) − ln ( k !) . Note that E k ≤ 0 . Our goal is to approxi mate E k to within high enoug h accurac y and then use this approxima- tion to approx imate p k . In parti cular , the main part of t he ar gument in v olve s an ef ﬁcient algori thm to compute an a pprox imation c c E k to E k satisfy ing    c c E k − E k    ≤ 1 4 t ≤ 1 2 t − 1 8 t 2 . (13) This approximatio n will ha ve bit complexity ˜ O ( h k i + h λ i + h t i ) and be computable in time ˜ O ( h k i · h t i + h λ i + h t i 3 ) . W e sho w that if we ha d such an approximatio n, then we would be able to complete the proof. For this, we claim that it sufﬁces to approximate e c c E k to within an additi ve error 1 2 t . Indeed, if b p k were the result of thi s 30 approx imation, then we would ha ve: b p k ≤ e c c E k + 1 2 t ≤ e E k + 1 2 t − 1 8 t 2 + 1 2 t ≤ e E k +ln(1+ 1 2 t ) + 1 2 t ≤ e E k  1 + 1 2 t  + 1 2 t ≤ p k + 1 t ; and similarly b p k ≥ e c c E k − 1 2 t ≥ e E k − ( 1 2 t − 1 8 t 2 ) − 1 2 t ≥ e E k − ln(1+ 1 2 t ) − 1 2 t ≥ e E k .  1 + 1 2 t  − 1 2 t ≥ e E k  1 − 1 2 t  − 1 2 t ≥ p k − 1 t . T o appro ximate e c c E k gi ve n c c E k , we need the follo wing lemma: Lemma 17. Let α ≤ 0 be a ration al number . Ther e is an algorit hm that compute s an estimate c e α suc h that    c e α − e α    ≤ 1 2 t and has runni ng time ˜ O ( h α i · h t i + h t i 2 ) . Pr oof. Since e α ∈ [0 , 1] , the point of the additi ve grid { i 4 t } 4 t i =1 closes t to e α achie ves error at most 1 / (4 t ) . Equi v alently , in a loga rithmic scale, consid er the grid { ln i 4 t } 4 t i =1 and let j ∗ := arg min j n    α − ln( j 4 t )    o . Then, we ha ve that     j ∗ (4 t ) − e α     ≤ 1 4 t . The idea o f the algorit hm is to approximatel y identify the p oint j ∗ , by co mputing approximatio ns to the poi nts of the logarithmic grid combined with a binary search proc edure. Inde ed, con sider the “rounded” gri d { d ln i 4 t } 4 t i =1 where each \ ln( i 4 t ) is an approximatio n to ln( i 4 t ) that is accurate to within an additi ve 1 16 t . Notice that, for i = 1 , . . . , 4 t : ln  i + 1 4 t  − ln  i 4 t  = ln  1 + 1 i  ≥ ln  1 + 1 4 t  > 1 / 8 t. Giv en that our approximat ions are accurate to within an additi ve 1 / 16 t , it follows that the rounded grid { d ln i 4 t } 4 t i =1 is monoto nic in i . The algorithm does not construct the points of this grid explicitly , b ut adap ti ve ly as it ne eds th em. In particu lar , it performs a binary search in the set { 1 , . . . , 4 t } to ﬁnd the point i ∗ := arg min i n    α − \ ln( i 4 t )    o . In e ver y iter ation of the searc h, when the algorithm ex amines the point j , it needs to compute the approx imation g j = \ ln( j 4 t ) and ev aluate the distance | α − g j | . It is kno wn that the logarithm of a number x with a binary fraction 31 of L bits a nd an e xpone nt of o ( L ) bits can b e comp uted to with in a re lati ve error O (2 − L ) in time ˜ O ( L ) [ Bre75 ]. It follo ws from thi s that g j has O ( h t i ) bits and can be computed in time ˜ O ( h t i ) . The subtractio n take s linear time, i.e., it uses O ( h α i + h t i ) bit operations. Therefore , each step of th e bi nary search can be don e in time O ( h α i ) + ˜ O ( h t i ) and thus the ov erall algori thm has O ( h α i · h t i ) + ˜ O ( h t i 2 ) runnin g time. The algorith m outputs i ∗ 4 t as its ﬁnal approximatio n to e α . W e argue next that the achie ved error is at most an ad diti ve 1 2 t . Since the distan ce between tw o con secuti ve points of the grid { l n i 4 t } 4 t i =1 is more th an 1 / (8 t ) and our approximatio ns are accurate to within a n additi ve 1 / 16 t , a little thought re veals that i ∗ ∈ { j ∗ − 1 , j ∗ , j ∗ + 1 } . This implies that i ∗ 4 t is within an additi ve 1 / 2 t of e α as desired, and the proof of the lemma is complete. Giv en L emma 17 , we describ e ho w we could approximate e c c E k gi ve n c c E k . Recall that we want to output an estimate b p k such that | b p k − e c c E k | ≤ 1 / (2 t ) . W e distingu ish the follo w ing cases: • If c c E k ≥ 0 , we output b p k := 1 . Indeed, gi ven that    c c E k − E k    ≤ 1 4 t and E k ≤ 0 , if c c E k ≥ 0 then c c E k ∈ [0 , 1 4 t ] . Hence, because t ≥ 1 , e c c E k ∈ [1 , 1 + 1 / 2 t ] , so 1 is within an additi ve 1 / 2 t of the right answer . • Otherwise, b p k is deﬁned to be the estimate obtained by applying Lemma 17 for α := c c E k . Giv en the bit comple xity of c c E k , the runni ng time of this proced ure will be ˜ O ( h k i · h t i + h λ i · h t i + h t i 2 ) . Hence, the ov erall runnin g time is ˜ O ( h k i · h t i + h λ i · h t i + h t i 3 ) . In view of the abov e, we only need to show how to compute c c E k . There are se ver al steps to our approxima- tion: 1. (Stirling ’ s Asy mptotic Approxima tion): Recall Stirling’ s asympt otic approximati on (see e.g., [ Whi80 ] p.193), which says that ln k ! equals k ln ( k ) − k + (1 / 2) · ln(2 π ) + m X j =2 B j · ( − 1) j j ( j − 1) · k j − 1 + O (1 /k m ) where B k are the Bernoulli numbers. W e deﬁne an approx imation of ln k ! as follo ws: d ln k ! := k ln( k ) − k + (1 / 2) · ln (2 π ) + m 0 X j =2 B j · ( − 1) j j ( j − 1) · k j − 1 for m 0 := O l h t i h k i m + 1  . 2. (Deﬁnition of an app roximate expon ent c E k ): Deﬁne c E k := − λ + k ln( λ ) − \ ln( k !) . Give n the abo ve discus sion, we can calcula te the distanc e of c E k to the true expo nent E k as follo ws: | E k − c E k | ≤ | ln( k !) − \ ln( k !) | ≤ O (1 /k m 0 ) (14) ≤ 1 10 t . (15) So we can fo cus our atten tion to approx imating c E k . Note that c E k is the sum of m 0 + 2 = O ( log t log k ) terms. T o approxima te it within error 1 / (10 t ) , it sufﬁce s to approx imate each summand within an additi ve error of O (1 / ( t · log t )) . Indee d, we so appro ximate each summand and our ﬁnal a pprox imation c c E k will be the sum of these approxima tions. W e proceed with the analysis : 32 3. (Estimating 2 π ): Since 2 π shows up i n the abo ve expr ession , we should try to a pprox imate it. It is kno wn that the ﬁrst ℓ digits of π can be co mputed exactly in time O (log ℓ · M ( ℓ )) , where M ( ℓ ) is the time to multiply two ℓ -bit integ ers [ Sal76 , B re76 ]. For examp le, if we use the Sch ¨ onhag e-Strasse n algorithm for multiplic ation [ SS71 ], w e get M ( ℓ ) = O ( ℓ · log ℓ · log log ℓ ) . Hence, ch oosing ℓ := ⌈ lo g 2 (12 t · log t ) ⌉ , we can obtain in time ˜ O ( h t i ) an approx imation c 2 π of 2 π that has a binary fraction of ℓ bits and satisﬁes: | c 2 π − 2 π | ≤ 2 − ℓ ⇒ (1 − 2 − ℓ )2 π ≤ c 2 π ≤ (1 + 2 − ℓ )2 π . Note that, with this approx imation, we ha ve    ln(2 π ) − ln( c 2 π )    ≤ ln(1 − 2 − ℓ ) ≤ 2 − ℓ ≤ 1 / (12 t · log t ) . 4. (Floating -Point Representat ion): W e will also need accurate approxi mations to ln c 2 π , ln k and ln λ . W e think of c 2 π and k as mult iple-p recisio n ﬂoatin g point numbers base 2 . In particula r , • c 2 π can be described with a binary fraction of ℓ + 3 bits and a constant size expone nt; and • k ≡ 2 ⌈ log k ⌉ · k 2 ⌈ log k ⌉ can b e describ ed with a binary fr action of ⌈ log k ⌉ , i.e., h k i , bits a nd an expo nent of length O (log log k ) , i.e., O (log h k i ) . Also, since λ is a positi ve rational number , λ = λ 1 λ 2 , where λ 1 and λ 2 are positi ve inte gers of at most h λ i bits. Hence, for i = 1 , 2 , we can think of λ i as a multiple-p recisi on ﬂoating point number base 2 with a binary fraction of h λ i bits and an expone nt of length O (log h λ i ) . H ence, if we choose L = ⌈ log 2 (12(3 k + 1) t 2 · k · λ 1 · λ 2 ) ⌉ = O ( h k i + h λ i + h t i ) , we can repre sent all numbers c 2 π , λ 1 , λ 2 , k as multiple precisio n ﬂoatin g po int numbers with a binary fraction of L bits and an expo nent of O (log L ) bits. 5. (Estimating the logs ): It is kno wn that the logarit hm of a number x with a bina ry fraction of L bits and an exp onent of o ( L ) bits can be computed to within a relati ve error O (2 − L ) in time ˜ O ( L ) [ Bre75 ]. Hence, in time ˜ O ( L ) we can obtain approximat ions [ ln c 2 π , d ln k , [ ln λ 1 , [ ln λ 2 such that: • | d ln k − ln k | ≤ 2 − L ln k ≤ 1 12(3 k +1) t 2 ; and similarly • | d ln λ i − ln λ i | ≤ 1 12(3 k +1) t 2 , for i = 1 , 2 ; • | [ ln c 2 π − ln c 2 π | ≤ 1 12(3 k +1) t 2 . 6. (Estimating the terms of the series ): T o complete the analys is, we also need to approximate each term of the form c j = B j j ( j − 1) · k j − 1 up to an additi ve error of O (1 / ( t · log t )) . W e do this as follo ws: W e compute the numbers B j and k j − 1 exa ctly , and we perform the di visio n approximately . Clearly , the positi ve intege r k j − 1 has des criptio n comple xity j · h k i = O ( m 0 · h k i ) = O ( h t i + h k i ) , since j = O ( m 0 ) . W e compute k j − 1 exa ctly using repeated squaring in time ˜ O ( j · h k i ) = ˜ O ( h t i + h k i ) . It is kno wn [ Fil92 ] that the rational number B j has ˜ O ( j ) bits and can be computed in ˜ O ( j 2 ) = ˜ O ( h t i 2 ) time. Hence, the approx imate ev aluation of the term c j (up to the desired additi ve error of 1 / ( t log t ) ) can be done in ˜ O ( h t i 2 + h k i ) , by a rat ional di vision op eration (see e.g., [ Knu8 1 ]). The sum of all the approx imate terms takes linear time, hence the approximate ev aluation of the entire truncated series (comprisin g at most m 0 ≤ h t i terms) can be done in ˜ O ( h t i 3 + h k i · h t i ) time ov erall . Let c c E k be the appro ximation arising if we us e all the afore mentione d approximatio ns. It foll o ws from the abo ve comput ations that    c c E k − c E k    ≤ 1 10 t . 33 7. (Overa ll Error): Combining the abov e computati ons we get:    c c E k − E k    ≤ 1 4 t . The over all time neede d to obtain c c E k was ˜ O ( h k i · h t i + h λ i + h t i 3 ) and the proof of Theorem 6 is complete . 34

Learning Poisson Binomial Distributions

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment