Near-Optimal Density Estimation in Near-Linear Time Using Variable-Width Histograms

Near–Optimal Densit y Estimation in Near–Linear Time Using V ariable–Width Histogram s Siu-On Chan Microsoft Researc h, New England sochan@gmai l.com . Ilias Diak onikolas ∗ Univ ersit y of Edin burgh ilias.d@ed. ac.uk . Ro cco A. Serv edio † Colum bia Univ ersit y rocco@cs.co lumbia.edu . Xiaorui Sun ‡ Colum bia Univers ity xiaoruisun@ cs.columbia.edu . Ma y 2, 2022 Abstract Let p be an unkno wn and arbitrar y pr obability distr ibutio n over [0 , 1). W e consider the problem of density estimation , in which a learning algorithm is giv en i.i.d. draws from p and m ust (with high probability) output a h yp othesis distribution that is close to p . The main contribution of this paper is a highly eﬃcien t densit y estimatio n algorithm for learning using a v ariable-width histogra m, i.e., a hypo thesis distribution with a piecewise consta nt pro bability density function. In mor e detail, for any k and ε , we g ive an algorithm that makes ˜ O ( k /ε 2 ) draws from p , runs in ˜ O ( k / ε 2 ) time, and outputs a hypothesis distr ibution h tha t is piecewise constant with O ( k log 2 (1 /ε )) pieces. With high proba bilit y the hypothesis h sa tis ﬁe s d TV ( p, h ) ≤ C · o pt k ( p )+ ε , where d TV denotes the total v aria tion distance (statistical dista nce), C is a universal constant, and opt k ( p ) is the smallest total v a riation distance b etw een p a nd a ny k -piecewise constant distribution. The sa mple size and running time of our alg o rithm ar e optimal up to lo garithmic factors. The “ approximation facto r” C in our re s ult is inherent in the pr oblem, as we prove that no a lgorithm with sa mple size b ounded in terms of k and ε can achieve C < 2 re g ardless of what kind of hypo thes is distribution it uses . 1 In tro duc tion Consider the follo wing fun damen tal statistic al task: Given indep e ndent dr aws fr om an unknown pr ob ability distribution, what is the minimum sample size ne e de d to obtain an ac cur ate estimate of the distribution? This is the question of density estimation , a classical problem in statistics with a ric h history and an extensiv e literature (see e.g., [ BBBB72 , DG85 , Sil86 , Sco92 , DL01 ]). While this broad question has mostly b een stud ied from an information–theoretic p ersp ectiv e, it is an in heren tly algorithmic qu estion as we ll, since the ultimate goal is to describ e and und er s tand algorithms that are b oth compu tationally and information-theoretically eﬃcient. The need for computationally eﬃcien t learning algorithms is only b ecoming more acute with the recen t ﬂo o d ∗ Supp orted by EPSRC grant EP/L02 1749/1, and a Marie Curie Career Integratio n Grant. † Supp orted by NSF grants CCF-0915929 and CCF-1115 703. ‡ Supp orted by NSF grant CCF-1149257. 1 of d ata across the sciences; the “gold standard” in this “big data” con text is an alg orithm with information-theoretical ly (near-) optimal s amp le size and run ning time (n ear-) linear in its sample size. In this pap er w e consider learning scenarios in whic h an algorithm is giv en an inpu t data set wh ic h is a sample of i.i.d. dra ws from an unkno wn probabilit y distribution. It is n atural to exp ect (and can b e easily formalized) that, if the underlying distrib ution of the d ata is inherently “complex”, it ma y b e hard to ev en appro ximately reconstruct the distribution. But what if the underlying distribu tion is “simple” or “succinct” – can we then r econstruct the distribu tion to high accuracy in a computationally and sample-eﬃcien t wa y? In this p ap er we answer this question in th e aﬃrmativ e for the problem of learning “noisy” histo gr ams , arguab ly one of the m ost b asic densit y estimation problems in the literature. T o m otiv ate our results, w e b egin by brieﬂy recalling the role of histograms in densit y estimat ion. Histograms constitute “the oldest and most widely used metho d for d en sit y estimation” [ S il86 ], ﬁr st in tro du ced by Karl Pearson in [ P ea95 ]. Giv en a samp le from a pr obabilit y density fun ction (p d f ) p , the m etho d partitions th e domain into a num b er of in terv als (bin s) B 1 , . . . , B k , and outputs the “empirical” p df whic h is constan t within ea ch bin. A k - histo gr am is a p iecewise constant distribu tion o v er bins B 1 , . . . , B k , where the pr obabilit y mass of eac h interv al B j , j ∈ [ k ], equals the fraction of obser v ations in the inte rv al. Th us, the goal of the “histogram metho d” is to appro ximate an unknown p df p by an appr opriate k -h istogram. It should b e emphasized that the num b er k of bins to b e used and the “width” and location of eac h bin are uns p eciﬁed; they are parameters of the estimation problem and are t ypically selected in an ad ho c manner . W e study the follo wing distribution learning question: Supp ose that ther e exists a k -histo gr am that pr ovides an ac cur ate appr oximation to the unknown tar get distribution. Can we eﬃciently ﬁn d such an appr oxima tion? In this pap er, we pr ovide a fairly complete aﬃr m ativ e answer to this basic question. Giv en a b ound k on the n umb er of int erv als, w e giv e an algorithm that uses a near-optimal s ample size, runs in ne ar-line ar time (in its samp le size), and appro ximates the target distribu tion nearly as accurately as the b est k -histogram. T o formally state our main result, w e will need a few d eﬁnitions. W e w ork in a standard mo d el of learning an un k n o wn probabilit y d istribution from samples, essentia lly that of [ KMR + 94 ], which is a natural analogue of V alian t’s well-kno wn P AC mo del for learnin g Bo olean fun ctions [ V al84 ] to the uns u p ervised s etting of learning an unknown pr obabilit y distrib u tion. 1 A distribu tion learnin g problem is deﬁned by a class C of distributions ov er a d omain Ω. The algorithm h as access to indep end en t dr a ws fr om an unkno wn p df p , and its goal is to output a h yp othesis distrib ution h that is “close” to the target d istribution p . W e measure the closeness b et w een distributions using the statistic al distanc e or total v ariation distance. In the “noiseless” setting, we are pr omised that p ∈ C and th e goal is to constru ct a hyp othesis h such that (with high pr obabilit y) the total v ariat ion distance d TV ( h, p ) b et w een h and p is at most ε , where ε > 0 is the accuracy parameter. The more c hallenging “noisy” or agnostic mo del captures the situation of h a ving arbitrary (or ev en adv ersarial) noise in the d ata. In this setting, w e do n ot make an y assumptions ab out the target densit y p and the goal is to ﬁnd a h yp othesis h that is almost as accurate as the “b est” appro ximation of p by any distribution in C . F ormally , giv en sample acce ss to a (p oten tially arbitrary) target distribution p and ε > 0, the goal of an agnostic le arning algorithm for C is to compute a hyp othesis d istribution h such that d TV ( h, p ) ≤ α · opt C ( p ) + ε , wh ere opt C ( p ) := 1 W e remark that our mod el is essentially equiv alent to the “minimax rate of converg ence un der the L 1 distance” in statistics [ DL01 ], and our results carry o ver to this setting as wel l. 2 inf q ∈C d TV ( q , p ) – i.e., opt C ( p ) is the statistical distance b et w een p and the closest distrib ution to it in C – and α ≥ 1 is a constant (that m ay dep end on the class C ). W e will call suc h a learning algorithm an α -agnostic le arning algorithm for C ; when α > 1 w e sometimes refer to this as a semi-agnostic le arning algorithm . A d istribution f o ver a ﬁn ite in terv al I ⊆ R is called k -ﬂat if there exists a partition of I in to k in terv als I 1 , . . . , I k suc h that the p df f is constan t within eac h suc h in terv al. W e henceforth (without loss of generalit y f or d en sities with b ound ed supp ort) restrict our s elv es to the case I = [0 , 1). Let C k b e the class of all k -ﬂat distribu tions o ve r [0 , 1). F or a (p oten tially arb itrary) distribution p o v er [0 , 1) we will d enote by opt k ( p ) := in f f ∈C k d TV ( f , p ). In this terminology , our learnin g p roblem is exactly the prob lem of agnostically learning the class of k -ﬂat distribu tions. Our main p ositiv e r esult is a near-optimal algorithm for this prob lem, i.e., a semi-agnostic learning algorithm that has near-optimal sample size and near-linear r unning time. More precisely , we prov e the f ollo wing: Theorem 1 (Main) . Ther e is an algorith m A with the fol lowing pr op erty: G ive n k ≥ 1 , ε > 0 , and sample ac c e ss to a tar get distribution p , algorithm A use s ˜ O ( k /ε 2 ) indep endent dr aws fr om p , runs in time ˜ O ( k /ε 2 ) , and outputs a O ( k log 2 (1 /ε )) - ﬂat hyp othesis distribution h tha t satisﬁes d TV ( h, p ) ≤ O (opt k ( p )) + ε with pr ob ability at le ast 9 / 10 . Using standard tec hniqu es, the conﬁ d ence probabilit y can b e b o osted to 1 − δ , for any δ > 0, with a (necessary) o v erh ead of O (log (1 /δ )) in the sample size and the run ning time. W e emphasize that the diﬃ cu lt y of our r esu lt lies in the fact that the “optimal” p iecewise constan t decomp osition of the domain is b oth unknown and appr oximate (in the s en se that opt k ( p ) > 0); and that our algorithm is b oth sample-optimal and run s in (near-) line ar time . Even in the (signiﬁcan tly easier) case that the target p ∈ C k (i.e., opt k ( p ) = 0), and th e optimal partition is explicitly giv en to the algorithm, it is kno wn that a sample of size Ω( k /ε 2 ) is information- theoreticall y n ecessary . (This lo wer b ound can, e.g., b e deduced from the standard fact th at learning an un kno wn d iscrete distribu tion o v er a k -elemen t set to statistical distance ε r equires an Ω( k /ε 2 ) size sample.) Hence, our algorithm has prov ably optimal sample complexit y (up to a logarithmic factor), ru ns in essen tially sample linear time, and is α -agnostic for a u niv ersal constan t α > 1. It sh ould b e noted that the sample size r equired for our pr oblem is well- un dersto o d; it follo ws from the VC theorem (Theorem 3 ) that O ( k /ε 2 ) draws fr om p are information-theoretically suf- ﬁcien t. How ev er, the theorem is non -constru ctiv e, and the “ob vious” algorithm follo wing from it has run ning time exp onen tial in k and 1 /ε . In recent w ork, Chan et al [ CDSS14 ] presented an approac h emplo ying an in tricate com bin ation of dynamic programming and linear programming whic h yields a p oly( k /ε ) time algorithm for the ab ov e problem. Ho wev er, th e runn ing time of the [ CDSS14 ] algorithm is Ω( k 3 ) eve n for constan t v alues of ε , making it im p ractical for applications. As discussed b elo w our algorithmic appr oac h is signiﬁcan tly d iﬀeren t fr om that of [ CDSS14 ], using neither dynamic nor linear programming. Applications. Nonparametric density estimation for sh ap e r estricted classes has b een a sub ject of stu dy in statistics since the 1950’s (see [ BBBB72 ] for an early b o ok on the topic and [ Gre56 , Bru58 , Rao69 , W eg70 , HP76 , Gro85 , Bir87 ] f or s ome of the early literature), and has applications to a range of areas includin g reliabilit y theory (see [ Reb05 ] and r eferences therein). By using the structural appr o ximation resu lts of Chan et al [ CDSS13 ], as an immediate corollary of Th eorem 1 w e obtain s ample optimal and ne ar-line ar time estimators for v arious wel l-studied classes of shap e restricted densities in clud ing monotone, un imo dal, and multimodal densities (with un kno wn mo de lo cations), monotone hazard r ate (MHR) distributions, and others (b ecause of sp ace constraints w e do not en umerate the exact descriptions of these classes or statemen ts of these results here, 3 but in s tead refer the interested r eader to [ CDSS 13 ]). Birg ´ e [ Bir87 ] obtained a samp le optimal and linear time estimator for monotone densities, but prior to our w ork, n o linear time and sample optimal estimator wa s kno wn for any of the other classes. Our algorithm from Theorem 1 is α -agnostic for a constan t α > 1. It is natural to ask whether a signiﬁcan tly stronger accuracy guarantee is eﬃcien tly achiev able; in particular, is ther e an agnostic algorithm with similar runnin g time and sample complexit y and α = 1? Perhaps surp risingly , we pro vide a negativ e answ er to this question. E v en in the simp lest non trivial case that k = 2, and the target distribu tion is deﬁned o v er a discrete domain [ N ] = { 1 , . . . , N } , an y α -agnostic algorithm with α < 2 r equires large sample s ize: Theorem 2 (Lo wer b ound , Informal statemen t) . A ny 1 . 99 -agnostic le arning algorithm for 2 -ﬂat distributions over [ N ] r e quir es a sample of size Ω( √ N ) . See T heorem 7 in S ection 4 f or a p r ecise statemen t. Note that th er e is an exact corresp ondence b et ween distrib utions ov er the d iscrete domain [ N ] and p d f ’s o ver [0 , 1) whic h are p iecewise constan t on eac h int erv al of the form [ k / N , ( k + 1) / N ) for k ∈ { 0 , 1 , . . . , N − 1 } . Thus, Th eorem 2 imp lies that no ﬁnite sample algorithm can 1 . 99-agnostic ally learn eve n 2-ﬂat distributions o ve r [0 , 1). (See Corollary 4.3 in Section 4 for a detailed statemen t.) Related w ork. A num b er of tec hniqu es f or d ensit y estimation h a v e b een d ev elop ed in the math- ematical statistics literature, including k ernels an d v arian ts thereof, nearest n eigh b or estimators, orthogonal series estimato rs, maxim um likeli h o o d estimato rs (MLE), and others (see Chapter 2 of [ Sil86 ] for a survey of existing metho ds). The main fo cus of these method s has b een on the statistica l rate of con ve rgence, as opp osed to the run ning time of th e corresp onding estimators. W e remark that the MLE do es not exist for very simp le classes of d istributions (e.g., u nimo dal distri- butions with an un kno wn mo de, see e.g, [ Bir97 ]). W e n ote that the notion of agnostic learning is related to the literature on mo del selectio n and oracle inequalities [ MP007 ], ho wev er this wo rk is of a d iﬀeren t ﬂav or and is not tec hnically related to our results. Histograms ha ve also b een studied extensiv ely in v arious areas of computer science, in clud ing databases and s treaming [ JKM + 98 , GKS06 , CMN98 , GGI + 02 ] und er v arious assumptions ab ou t the inpu t data and the precise ob jectiv e. Recentl y , Indyk et al [ ILR12 ] stud ied the problem of learning a k -ﬂat distribution ov er [ N ] under the L 2 norm and ga v e an eﬃcient algorithm with sample complexity O ( k 2 log( N ) /ε 4 ). Sin ce the L 1 distance is a stronger metric, Theorem 1 implies an im p ro ve d sample and time b oun d of ˜ O ( k /ε 2 ) for their setting. 2 Preliminaries Throughout the pap er w e assume that the un derlying distributions ha v e Leb esgue measurable densities. F or a p df p : [0 , 1) → R + and a Leb esgue m easur able subset A ⊆ [0 , 1), i.e., A ∈ L ([0 , 1)), w e use p ( A ) to denote R z ∈ A p ( z ) . The statistic al distanc e or total variation distanc e b et ween t wo densities p, q : [0 , 1) → R + is d TV ( p, q ) := sup A ∈L ([0 , 1)) | p ( A ) − q ( A ) | . Th e statistical distance satisﬁes the ident ity d TV ( p, q ) = 1 2 k p − q k 1 where k p − q k 1 , the L 1 distance b et w een p and q , is R [0 , 1) | p ( x ) − q ( x ) | dx ; for con ve n ience in the rest of the pap er we w ork with L 1 distance. W e r efer to a nonn egativ e fu nction p o v er an inte rv al (whic h n eed not necessarily inte grate to one o ver the in terv al) as a “sub-d istribution.” Giv en a v alue κ > 0, we sa y that a (sub-)distribu tion p o ver [0 , 1) is κ -wel l-b ehave d if su p x ∈ [0 , 1) Pr x ∼ p [ x ] ≤ κ , i.e., no individu al real v alue is assigned more than κ probabilit y under p . An y probabilit y distribution w ith no atoms is κ -well -b eha v ed for all κ > 0. Our results app ly for general d istributions o v er [0 , 1) which ma y ha ve an atomic part as we ll as a non-atomic part. Giv en m indep en d en t d ra ws s 1 , . . . , s m from a distribu tion p o ve r [0 , 1), the 4 empiric al distribution b p m o v er [0 , 1) is the discrete d istribution supp orted on { s 1 , . . . , s m } deﬁned as follo ws: for all z ∈ [0 , 1), Pr x ∼ b p m [ x = z ] = |{ j ∈ [ m ] | s j = z }| /m . The V C inequalit y . Let p : [0 , 1) → R b e a Leb esgue measurable fun ction. Giv en a f amily of subsets A ⊆ L ([0 , 1)) ov er [0 , 1), deﬁne k p k A = sup A ∈A | p ( A ) | . Th e VC dimension of A is the maxim um size of a sub set X ⊆ [0 , 1) that is shattered by A (a set X is shattered by A if for ev ery Y ⊆ X , s ome A ∈ A satisﬁes A ∩ X = Y ). If there is a shattered subset of size s for all s ∈ Z + , then w e sa y that the V C dimension of A is ∞ . Th e well -known V apnik-Chervonenkis (VC) ine quality states the follo wing: Theorem 3 (V C inequalit y , [ DL01 , p.31]) . L et p : I → R + b e a pr ob ability density function over I ⊆ R and b p m b e the empiric al distribution obtaine d after dr awing m p oints fr om p . L et A ⊆ 2 I b e a family of subsets with VC dimension d . Then E [ k p − b p m k A ] ≤ O ( p d/m ) . P artitioning into interv als of appro ximately equal mass. As a basic primitive , giv en access to a samp le d r a wn from a κ -w ell-b ehav ed target d istribution p o v er [0 , 1), we will need to partition [0 , 1) in to Θ(1 /κ ) in terv als eac h of whic h has probabilit y Θ( κ ) u nder p . Th ere is a simple algo- rithm, b ased on order s tatistics, w hic h do es this and h as th e follo wing p erformance guaran tee (see App end ix A.2 of [ CDSS14 ]): Lemma 2.1. Given κ ∈ (0 , 1) and ac c ess to p oints dr awn fr om a κ/ 64 -wel l-b ehave d distribution p over [0 , 1) , the pr o c e dur e Appro ximately- Equal-Pa rtition dr aws O ((1 /κ ) log (1 /κ )) p oints fr om p , runs in time ˜ O (1 /κ ) , and with pr ob ability at le ast 99 / 100 outputs a p artition of [0 , 1) into ℓ = Θ(1 /κ ) intervals such that p ( I j ) ∈ [ κ/ 2 , 3 κ ] for al l 1 ≤ j ≤ ℓ. 3 The algorithm and its analysis In this section we p ro v e our main algorithmic r esult, Theorem 1 . Ou r appr oac h has the f ollo win g high-lev el structur e: In Section 3.1 we giv e an algorithm for agnostically learnin g a target distri- bution p that is “nice” in tw o sen s es: (i) p is well -b eha v ed (i.e., it do es not h a v e an y hea vy atomic elemen ts), and (ii) opt k ( p ) is b ounded from ab ov e by the error parameter ε. In S ection 3.2 we give a general eﬃcien t reduction showing how the second assu mption can b e remo ve d , and in Section 3.3 we b rieﬂy explain how the ﬁrst assumption can b e remo ved, th us yielding Theorem 1 . 3.1 The main algorithm In this section we giv e our main algorithmic resu lt, which hand les w ell-b ehav ed distributions p for whic h opt k ( p ) is not to o large: Theorem 4. Ther e is an algorithm Lear n-WB-sma ll-opt- k -histogram that give n as input ˜ O ( k /ε 2 ) i.i.d. dr aw s fr om a tar get distribution p and a p ar ameter ε > 0 , runs in time ˜ O ( k /ε 2 ) , and has the fol lowing p erforma nc e guar ant e e: If (i) p is ε/ log(1 /ε ) 384 k -wel l-b ehave d, and (ii) op t k ( p ) ≤ ε , then with pr ob ability at le ast 19 / 20 , it outputs an O ( k · log 2 (1 /ε )) - ﬂat distribution h such that d TV ( p, h ) ≤ 2 · opt k ( p ) + 3 ε . W e requ ire some notatio n and terminology . Let r b e a distribu tion o ve r [0 , 1), and let P b e a set of d isjoin t inte rv als that are con tained in [0 , 1). W e sa y that the P -ﬂattening of r , d enoted ( r ) P , is the sub -d istribution d eﬁ ned as r ( v ) =  r ( I ) / | I | if v ∈ I , I ∈ P 0 if v do es not b elong to any I ∈ P 5 Observe th at if P is a partition of [0 , 1), then (since r is a distribu tion) ( r ) P is a distribution. W e say that t wo inte rv als I , I ′ are c onse cutive if I = [ a, b ) and I ′ = [ b, c ). Giv en t wo consecutiv e in terv als I , I ′ con tained in [0 , 1) and a su b-distribution r , we use α r ( I , I ′ ) to denote the L 1 distance b et ween ( r ) { I ,I ′ } and ( r ) { I ∪ I ′ } , i.e., α r ( I , I ′ ) = R I ∪ I ′ | ( r ) { I ,I ′ } ( x ) − ( r ) { I ∪ I ′ } ( x ) | dx. Note here that { I ∪ I ′ } is a set that con tains one element, th e in terv al [ a, c ). 3.1.1 In tuition for the algorithm W e b egin with a h igh-leve l in tuitiv e explanation of th e Lear n-WB-sma ll-opt- k - histogram algo- rithm. It starts in Step 1 by constructing a partition of [0 , 1) into z = Θ( k /ε ′ ) interv als I 1 , . . . , I z (where ε ′ = ˜ Θ( ε )) suc h that p has weigh t Θ( ε ′ /k ) on eac h sub in terv al. In Step 2 the algo rithm dra ws a sample of ˜ O ( k /ε 2 ) p oin ts from p and uses them to deﬁne an empirical distribution b p m . This is th e only step in which p oin ts are dra wn fr om p . F or the rest of this intuitiv e explanation w e pr etend that the w eigh t b p ( I ) that the empirical distribution b p m assigns to eac h in terv al I is actually the same as the tr u e w eigh t p ( I ) (Lemma 3.1 b elo w shows that this is not to o far from the truth). Before con tinuing with our explanation of the algorithm, let us digress brieﬂy by im agining for a momen t that the target distrib ution p actually is a k -ﬂat distribution (i.e ., that opt k ( p ) = 0). In this case there are at most k “breakp oints” , and hence at most k in terv als I j for whic h α b p m ( I j , I j +1 ) > 0, so computing the α b p m ( I j , I j +1 ) v alues would b e an easy wa y to identify the true br eakp oin ts (and giv en these it is not d iﬃcult to construct a high-accuracy hyp othesis). In r ealit y , w e ma y of course hav e opt k ( p ) > 0; this means that if we try to use the α b p m ( I j , I j +1 ) criterion to iden tify “breakp oints” of the optimal k -ﬂat distribu tion that is closest to p (call this k -ﬂat distribution q ), we may sometimes b e “fo oled” into thinkin g that q has a b reakp oint in an in terv al I j where it do es not (bu t rather the v alue α b p m ( I j , I j +1 ) is large b ecause of the diﬀerence b et ween q and p ). Ho wev er, r ecall that b y assu mption w e hav e opt k ( p ) ≤ ε ; this b ou n d can b e used to sho w that there cannot b e to o many in terv als I j for whic h a large v alue of α b p m ( I j , I j +1 ) suggests a “spurious br eakp oin t” (see the pro of of L emma 3.3 ). This is helpful, but in and of itself not enough; since our partition I 1 , . . . , I z divides [0 , 1) into k /ε ′ in terv als, a naiv e approac h based on this w ould result in a ( k /ε ′ )-ﬂat h yp othesis distribu tion, wh ic h in tur n w ould n ecessitate a sample complexit y of ˜ O ( k /ε ′ 3 ), whic h is unacceptably high. Instead, our algorithm p erforms a careful pro cess of iterativ ely merging consecutiv e in terv als for which the α b p m ( I j , I j +1 ) criterion in dicates that a merge will not adversely aﬀect the ﬁ nal accuracy by to o muc h. As a result of this pro cess w e end up w ith k · p olylog(1 /ε ) in terv als for the ﬁnal hyp othesis, wh ich en ables us to output a ( k · p olylog(1 /ε ′ ))-ﬂat ﬁnal hypothesis us in g ˜ O ( k /ε ′ 2 ) d ra ws from p . In more detail, this iterat ive merging is carried out b y the main loop of th e alg orithm in Step 4. Going int o the t -th iteration of the lo op, th e algorithm h as a partition P t − 1 of [0 , 1) into disjoin t su b-in terv als, and a set F t − 1 ⊆ P t − 1 (i.e., ev ery interv al b elonging to F t − 1 also b elongs to P t − 1 ). Initially P 0 con tains all the interv als I 1 , . . . , I z and F 0 is empty . Intuitiv ely , the interv als in P t − 1 \ F t − 1 are still b eing “pr o cessed”; su c h an in terv al ma y p ossibly b e merged with a consecutive in terv al from P t − 1 \ F t − 1 if d oing so would only incur a small “cost” (see condition (iii) of Step 4(b) of the algorithm).The inte rv als in F t − 1 ha v e b een “frozen” and will not b e altered or u s ed subsequently in the algorithm. 6 3.1.2 The algorithm Algorithm Learn-WB- small-opt - k -histo gram : Input: parameters k ≥ 1 , ε > 0; access to i.i.d. d ra ws from target distribu tion p ov er [0 , 1) Output: If (i) p is ε/ log(1 /ε ) 384 k -w ell-b eha ved and (ii) opt k ( p ) ≤ ε , th en w ith probability at least 99 / 10 0 the output is a distribution q suc h that d TV ( p, q ) ≤ 2opt k ( p ) + 3 ε . 1. Let ε ′ = ε/ log (1 /ε ). Run Algorithm Approx imately- Equal-Partition on in put param- eter ε ′ 6 k to partition [0 , 1) into z = Θ( k /ε ′ ) in terv als I 1 = [ i 0 , i 1 ), . . . , I z = [ i z − 1 , i z ), where i 0 = 0 and i z = 1, such th at w ith probabilit y at least 99 / 100, for eac h j ∈ { 1 , . . . , z } we ha v e p ([ i j − 1 , i j )) ∈ [ ε ′ / 12 k, ε ′ / 2 k ] (assuming p is ε ′ / (384 k )-well- b eha v ed). 2. Dra w m = ˜ O ( k /ε ′ 2 ) p oints from p and let b p m b e the resulting empirical distr ibution. 3. Set P 0 = { I 1 , I 2 , . . . I z } , and F 0 = ∅ . 4. Let s = log 2 1 ε ′ . Rep eat for t = 1 , . . . until t = s : (a) Initialize P t to ∅ and F t to F t − 1 . (b) Without loss of generalit y , assume P t − 1 = { I t − 1 , 1 , . . . , I t − 1 ,z t − 1 } where in terv al I t − 1 ,i is to the left of I t − 1 ,i +1 for all i . Scan left to right across the interv als in P t − 1 (i.e., iterate o v er i = 1 , . . . , z t − 1 − 1). If int erv als I t − 1 ,i , I t − 1 ,i +1 are (i) b oth not in F t − 1 , and (ii) α b p m ( I t − 1 ,i , I t − 1 ,i +1 ) > ε ′ / (2 k ), then add b oth I t − 1 ,i and I t − 1 ,i +1 in to F t . (c) Initialize i to 1, an d rep eatedly execute one of the follo wing four (mutually exclusiv e and exhaustiv e) cases unt il i > z t − 1 : [Case 1] i ≤ z t − 1 − 1 and I t − 1 ,i = [ a, b ) , I t − 1 ,i +1 = [ b, c ) are consecutiv e in terv als b oth not in F t . Add the m erged interv al I t − 1 ,i ∪ I t − 1 ,i +1 = [ a, c ) into P t . S et i ← i + 2. [Case 2] i ≤ z t − 1 − 1 and I t − 1 ,i ∈ F t . Set i ← i + 1. [Case 3] i ≤ z t − 1 − 1, I t − 1 ,i / ∈ F t and I t − 1 ,i +1 ∈ F t . Add I t − 1 ,i in to F t and set i ← i + 2. [Case 4] i = z t − 1 . Add I t − 1 ,z t − 1 in to F t if I t − 1 ,z t − 1 is n ot in F t and set i ← i + 1. (d) Set P t ← P t ∪ F t . 5. Ou tp ut the |P s | -ﬂat hyp othesis distr ibution ( b p m ) P s . 3.1.3 Analysis of the algorithm and pro of of Theorem 4 It is straigh tforward to v erify the claimed run ning time giv en Lemma 2.1 , which boun ds the runnin g time of A pproxima tely-Equa l-Partition . Indeed, we n ote that Step 2, whic h s im p ly dra ws ˜ O ( k /ε ′ 2 ) p oints and constructs the resulting emp irical distribu tion, d omin ates the o verall ru nning time. In the rest of this subsection we pr o v e correctness. W e ﬁrst observ e that with high probability th e empirical distribution b p m deﬁned in Step 2 giv es a high-accuracy estimate of the true probability of any u n ion of consecutive inte rv als from I 1 , . . . , I z . Th e follo wing lemma f r om [ CDSS14 ] follo ws from the standard m ultiplicativ e Chernoﬀ b ound : 7 Lemma 3.1 (Lemma 12, [ CDSS14 ]) . With pr ob ability 99 / 100 over the sample dr awn in Step 2, for every 0 ≤ a < b ≤ z we have that | b p m ([ i a , i b )) − p ([ i a , i b )) | ≤ p ε ′ ( b − a ) · ε ′ / (10 k ) . W e henceforth assu me that this 99 / 100 -lik ely even t in deed tak es p lace, so the ab ov e inequalit y holds for all 0 ≤ a < b ≤ z . W e use this to sho w that the α b p m ( I t − 1 ,i , I t − 1 ,i +1 ) v alue that the algorithm uses in Step 4(b) is a go o d pr oxy for the actual v alue α p ( I t − 1 ,i , I t − 1 ,i +1 ) (wh ic h of course is n ot accessible to the algorithm): Lemma 3.2. Fix 1 ≤ t ≤ s. Then we have | α b p m ( I t − 1 ,i , I t − 1 ,i +1 ) − α p ( I t − 1 ,i , I t − 1 ,i +1 ) | ≤ 2 ε ′ / (5 k ) . Pr o of. Observe that in iteration t , t wo consecutiv e in terv als I t − 1 ,i and I t − 1 ,i +1 corresp ond to t wo unions of consecutive in terv als I a ∪ · · · ∪ I b and I b +1 ∪ · · · ∪ I c resp ectiv ely fr om th e original p artition P 0 . Moreo v er, since eac h in terv al in P t − 1 \ F t − 1 , t > 1, is formed by merging tw o consecutive in terv als f r om P t − 2 \ F t − 2 , it must b e the case that b − a + 1 , c − b + 1 ≤ 2 t − 1 < 2 s − 1 ≤ 1 / (2 ε ′ ). Hence, by Lemma 3.1 , w e h a v e | p ( I t − 1 ,i ) − b p m ( I t − 1 ,i )) | ≤ √ ε ′ · 2 s − 1 · ε ′ 10 k ≤ ε ′ 10 √ 2 k and similarly , | p ( I t − 1 ,i +1 ) − b p m ( I t − 1 ,i +1 )) | ≤ ε ′ 10 √ 2 k . T o simplify notation, let I = I t − 1 ,i and J = I t − 1 ,i +1 . By deﬁn ition of α , α p ( I , J ) =     p ( I ) | I | − p ( I ) + p ( J ) | I | + | J |     | I | +     p ( J ) | J | − p ( I ) + p ( J ) | I | + | J |     | J | = 2 | I | + | J |   p ( I ) | J | − p ( J ) | I |   . (1) A str aightforw ard calculation n o w giv es th at | α p ( I , J ) − α b p m ( I , J ) | = 2 | I | + | J |      p ( I ) | J | − p ( J ) | I |   −   b p m ( I ) | J | − b p m ( J ) | I |      ≤ 2 | I | + | J |    p ( I ) − b p m ( I )   | J | +   p ( J ) − b p m ( J )   | I |  ≤ 2 ε ′ / (5 k ) . F or the rest of the analysis, let q denote a ﬁxed k -ﬂat distrib ution that is closest to p , so k p − q k 1 = opt k ( p ). (W e note that while opt k ( p ) is deﬁned as inf q ∈C k p − q k 1 , standard closure argumen ts can b e used to sho w that the inﬁm um is actually ac hiev ed by some k -ﬂat distribution q .) Let Q b e the p artition of [0 , 1) corresp onding to the int erv als on which q is p iecewise constan t. W e sa y that a br e akp oint of Q is a v alue in [0 , 1] th at is an endp oint of one of the (at most) k in terv als in Q . The follo wing imp ortan t lemma b ounds th e num b er of inte rv als in the ﬁ nal p artition P s : Lemma 3.3. P s c onta ins at most O ( k log 2 (1 /ε )) intervals. 8 Pr o of. W e start by recording a b asic fact that will b e useful in the pro of of the lemma. Let p b e a distribution o ver an interv al I and let q b e an y s ub-distribu tion o ve r I . P erhaps contrary to initial in tuition, the optimal scaling c · q , c > 0, of q to app ro ximate p (with resp ect to the L 1 -distance) is n ot necessarily obtained by scaling q s o that c · q is a distribution o v er I . Ho wev er, a simp le argumen t (see e.g., App end ix A.1 of [ CDSS14 ]) sho ws that s caling so that c · q is a d istribution cannot result in L 1 -error more than t wice that of the optimal scaling: Claim 3.4. L et p, g : I → R ≥ 0 b e pr ob ability distributions over I (so R I p ( x ) dx = R I g ( x ) dx = 1 ). Then, writing k f k 1 to denote R I | f ( x ) | dx , for every a > 0 we have that k p − g k 1 ≤ 2 k p − ag k 1 . W e no w pr o ceed with the p ro of of L emma 3.3 . W e ﬁr st sho w that a total of at most O ( k log (1 /ε ′ )) inte rv als are ev er added into F t across all executions of Step 4(b). Supp ose that interv als I t − 1 ,i , I t − 1 ,i +1 are added in to F t in some execution of Step 4(b). W e consider the follo wing t wo cases: Case 1: I t − 1 ,i ∪ I t − 1 ,i +1 con tains at least one breakp oint of Q . S ince Q h as at most k breakp oin ts, this can happ en at most k times in total. Case 2: I t − 1 ,i ∪ I t − 1 ,i +1 do es not con tain an y breakp oint of Q . Then I t − 1 ,i ∪ I t − 1 ,i +1 is a su b set of an in terv al in Q . Recalling that inte rv als I t − 1 ,i , I t − 1 ,i +1 w ere add ed into F t in an execution of Step 4(b), we ha v e that α b p m ( I t − 1 ,i , I t − 1 ,i +1 ) > ε ′ / (2 k ) , and hence by Lemma 3.2 , w e ha v e that α p ( I t − 1 ,i , I t − 1 ,i +1 ) ≥ 1 5 · ε ′ k . Claim 3.4 now imp lies that th e cont rib ution to the L 1 distance b et ween p and q from I t − 1 ,i ∪ I t − 1 ,i +1 , i.e., R I t − 1 ,i ∪ I t − 1 ,i +1 | p ( x ) − q ( x ) | dx , is at least 1 10 ε ′ k . Since k p − q k 1 = opt k ( p ) , there can b e at most k + O  opt k ( p ) · k ε ′  = O  k · log 1 ε  in terv als ev er added in to F t across all executions of Step 4(b) (note th at for the last equalit y w e hav e used the assump tion that opt k ( p ) ≤ ε ). Next, we argue that eac h F t satisﬁes |F t | ≤ O ( k log 2 (1 /ε )) . W e ha ve b ound ed the num b er of interv als added in to F t in Step 4(b) b y O ( k log(1 /ε ′ )), so it r emains to b ound the n umb er of in terv als added in Step 4(c)(Case 3) and 4(c)(Case 4). It is clear that a total of at m ost O (log(1 /ε ′ )) in terv als are ev er added in 4(c)(Case 4). Insp ection of Step 4(c)(Case 3) sho ws that for a giv en v alue of t , the n umb er of in terv als that this step adds to F t is at most the n umb er of “blo cks” of consecutiv e F t -in terv als. Sin ce eac h int erv al added in Step 4(c)(Case 3) extends some blo cks of consecutiv e F t -in terv als bu t d o es not create a new one (and hence d o es not increase their num b er), across the s = log(1 /ε ′ ) stages, the tota l num b er of in terv als that can b e added in executions of Step 4(c)(Case 3) is at most O ( k log 2 (1 /ε ′ )). It follo ws that we ha ve |F s | = O ( k log 2 (1 /ε )) as claimed. T o b ound |P t \ F t | , we observ e that by ins p ection of the algorithm, for eac h t w e ha v e |P t \ F t | ≤ 1 2 |P t − 1 \ F t − 1 | . S ince |P 0 | = Θ( k /ε ′ ), it follo ws that |P s \ F s | = O ( k ), and th e lemma is prov ed. The follo wing d eﬁnition will b e usefu l: Deﬁnition 5. Let P denote any p artition of [0 , 1). W e sa y that partition P is ε ′ -go o d for ( p, q ) if for ev ery br eakp oin t v of Q , the in terv al I in P cont aining v satisﬁes p ( I ) ≤ ε ′ / (2 k ) . The ab o ve d eﬁnition is j ustiﬁed by the follo wing lemma: 9 Lemma 3.5. If P is ε ′ -go o d f or ( p, q ) , then k p − ( p ) P k 1 ≤ 2opt k ( p ) + ε ′ . Pr o of. Fix an interv al I in P . If there do es not exist an inte rv al J in Q suc h that I ⊆ J , then I m ust cont ain a breakp oin t of Q , and hence since P is ε ′ -goo d for ( p, q ), w e ha ve p ( I ) ≤ ε ′ / (2 k ). This implies that the contribution to k ( p ) P − q k 1 that comes fr om I , n amely R I | ( p ) P ( x ) − q ( x ) | dx , satisﬁes Z I | ( p ) P ( x ) − q ( x ) | dx ≤ Z I | ( p ) P ( x ) − p ( x ) | dx + Z I | p ( x ) − q ( x ) | dx ≤ Z I | p ( x ) − q ( x ) | dx + 2 p ( I ) ≤ Z I | p ( x ) − q ( x ) | dx + ε ′ k . The other p ossibilit y is that there exists an interv al J in Q such that I ⊆ J . In this case, we ha v e that Z I | ( p ) P ( x ) − q ( x ) | dx ≤ Z I | p ( x ) − q ( x ) | dx. Since there are at most k in terv als in P con taining breakp oin ts of Q , summin g the ab ov e inequalities o ve r all interv als I in P , w e get that k ( p ) P − q k 1 ≤ k p − q k 1 + ε ′ = opt k ( p ) + ε ′ , and hence k ( p ) P − p k 1 ≤ k ( p ) P − q k 1 + k p − q k 1 ≤ 2opt k ( p ) + ε ′ . W e are no w in a p osition to prov e the follo wing: Lemma 3.6. Ther e exists a p artition R of [0 , 1) that is ε ′ -go o d f or ( p, q ) and satisﬁes k ( p ) P s − ( p ) R k 1 ≤ ε. Pr o of. W e construct th e claimed R based on P s , P s − 1 , . . . , P 0 as follo ws: (i) If I is an int erv al in P s not con taining a b r eakp oin t of Q , then I is also in R . (ii) If I is an inte rv al in P s that do es conta in a breakp oin t of Q , then w e f u rther partition I into a set of in terv als S by calling pro cedure R efine-par tition ( s, I ). This recurs ive pr o cedure exploits the lo cal stru cture of the earlier, ﬁ ner partitions P s − 1 , P s − 2 , . . . as d escrib ed b elo w. Pro cedure Refine -partiti on : Input: In teger t , Interv al J Output: S , a partition of inte rv al J 1. If t = 0, then output { J } . 2. If J is an interv al in P t , then (a) If J con tains a br eakp oin t of Q , then outp u t R efine-par tition ( t − 1, J ). 10 (b) O therwise output { J } . 3. Otherw ise, J is a union of t w o in terv als in P t . Let J 1 and J 2 denote the t w o in terv als in P t suc h th at J 1 ∪ J 2 = J . Output Refi ne-partit ion ( t , J 1 ) ∪ Refine-p artition ( t , J 2 ). W e claim that |R| (the n umb er of interv als in R ) is at m ost |P s | + O ( k · log 1 ε ). T o see this, note that eac h in terv al I ∈ P s not conta ining a breakp oint of Q (corresp onding to (i) ab o v e) translates directly to a sin gle interv al of R . F or eac h interv al of t yp e (ii) in P s , insp ection of the Refine-Pa rtition pro cedu r e sh ows th at that these interv als are p artitioned in to at most O ( k log(1 /ε )) inte rv als in R . In the r est of the pr o of, w e show that f or any inte rv al J in P s con taining at least one br eakp oin t of Q , the contribution to the L 1 distance b et wee n ( p ) P s and ( p ) R coming from inte rv al J is at most | b J | · ε ′ log 1 ε k , where b J is th e set of breakp oints of Q in J . Consider a ﬁxed breakp oin t v of Q . Let I t,v denote the inte rv al con taining v in the p artition P t . If I t,v merges with another in terv al in P t in Case 1 of Step 4(c), we denote that other in terv al as I ′ t,v . S ince I t,v merges with I ′ t,v in Case 1 of Step 4(c), these int erv als are b oth not in F t and hence were b oth not in F t − 1 in Step 4(b). Consequently when t > 1 it m ust b e th e case th at condition (ii) of Step 4(b) do es not hold for th ese interv als, i.e., α b p m ( I t,v , I ′ t,v ) ≤ ε ′ / (2 k ) . It follo ws that by L emma 3.2 , we h a v e that α p ( I t,v , I ′ t,v ) is at m ost 4 ε ′ 5 k . When t = 1, we ha v e a similar b ound α p ( I t,v , I ′ t,v ) ≤ ε ′ /k , by using ( 1 ) and the fact that p ( I t,v ) , p ( I ′ t,v ) ≤ ε ′ / 2 k wh en I t,v , I ′ t,v ∈ P 0 . On the other h and, insp ection of the pr o cedure Refi ne-Parti tion giv es that if t wo interv als in P t are unions of some interv als in Re fine-par tition ( s, I ), and their union is an inte rv al in P t +1 , then there exists v w h ic h is a b r eakp oin t of Q such that the tw o interv als are I t,v and I ′ t,v . Th u s , the contribution to the L 1 distance b et w een ( p ) P s and ( p ) R coming from interv al J is at most ε ′ k · log 1 ε ′ · | b J | . Su mming ov er all interv als J that con tain at least one breakp oint and recalling that the total n umb er of b reakp oint s is at most k , we get th at th e o v erall L 1 distance b etw een ( p ) P s and ( p ) R is at most ε . Finally , b y putting ev erything together w e can p ro ve Theorem 4 : Pr o of of The or em 4 . By Lemma 3.5 applied to R , we h a v e that k p − ( p ) R k 1 ≤ 2opt k ( p ) + ε ′ . By Lemma 3.6 , we ha ve that k ( p ) P s − ( p ) R k 1 ≤ ε ; thus the triangle inequalit y giv es that k p − ( p ) P s k 1 ≤ 2opt k ( p )+2 ε. By Lemma 3.3 the partitio n P s con tains at most O ( k log 2 (1 /ε )) interv als, so b oth ( p ) P s and ( b p m ) P s are O ( k log 2 (1 /ε )) -ﬂat distribu tions. Thus, k ( p ) P s − ( b p m ) P s k 1 = k ( p ) P s − ( b p m ) P s k A ℓ , where ℓ = O ( k log 2 (1 /ε )) and A ℓ is the family of all su bsets of [0 , 1) that consist of unions of up to ℓ inte rv als (whic h has V C dimension 2 ℓ ). Consequen tly b y the V C inequalit y (Th eorem 3 , for a suitable choice of m = ˜ O ( k /ε ′ 2 ), w e hav e that E [ k ( p ) P s − ( b p m ) P s k 1 ] ≤ 4 ε ′ / 100 . Mark ov’s inequalit y no w giv es that w ith probabilit y at least 96 / 100, we ha v e k ( p ) P s − ( b p m ) P s k 1 ≤ ε ′ . Hence, with o v erall probabilit y at least 19 / 20 (recall the 1/100 error p robabilit y in curred in Lemma 3.1 ), w e h a v e th at k p − ( b p m ) P s k 1 ≤ 2opt k ( p ) + 3 ε, and the theorem is pro ved. 3.2 A general reduction to the case of small opt for semi-agnostic learning In this s ection w e sh ow that und er mild cond itions, the general problem of agnostic distribution learning for a class C can b e eﬃcien tly reduced to the s p ecial case when opt C is not to o large compared with ε . While the reduction is simple and generic, we h av e not p reviously encountered it 11 in the literature on dens ity estimation, so we p ro vide a pr o of in the f ollo win g. A precise statemen t of the reduction follo ws: Theorem 6. L et A b e an algorithm with the fol lowing b ehavior: A is given as input i.i.d. p oints dr awn fr om p and a p ar ameter ε > 0 . A uses m ( ε ) = Ω (1 /ε ) dr aws fr om p , runs in time t ( ε ) = Ω(1 /ε ) , and satisﬁes the fol lowing: if op t C ( p ) ≤ 10 ε , then with pr ob ability at le ast 19 / 20 it outputs a hyp othesis distribution q su c h that (i) k p − q k 1 ≤ α · opt C ( p ) + ε , wher e α is an absolute c onstant, and (ii) g i ven any r ∈ [0 , 1) , the value q ( r ) of the p df of q at r c an b e eﬃciently c ompute d in T time steps. Then ther e is an algor ithm A ′ with the fol lowing p erformanc e guar ante e: A ′ is given as input i.i.d. dr aws fr om p and a p ar ameter ε > 0 . 2 Algor ithm A ′ uses O ( m ( ε/ 10) + log log(1 /ε ) /ε 2 ) dr aws fr om p , runs in time O ( t ( ε/ 10 )) + T · ˜ O (1 /ε 2 ) , and outputs a hyp othesis distribution q ′ such that with pr ob ability at le ast 39 / 40 we have k p − q ′ k 1 ≤ 10( α + 2) · opt C ( p ) + ε. Pr o of. The algorithm A ′ w orks in t wo stages, whic h we d escrib e and analyze b elo w. In the ﬁrst stage, A ′ iterates o ve r ⌈ log(20 /ε ) ⌉ “guesses” for the v alue of opt C ( p ), where the i -th guess g i is ε 10 · 2 i − 1 (so g 1 = ε 10 and g ⌈ log(20 /ε ) ⌉ ≥ 1). F or eac h v alue of g i , it p erforms r = O (1) run s of Algorithm A (using a fresh sample from p f or eac h run ) using p arameter g i as the “ ε ” parameter for eac h r u n; let h 1 ,i , . . . , h r,i b e the r h yp otheses th us obtained for the i -th guess. It is clear that this stage u ses O ( m ( ε/ 10) + m (2 ε/ 1 0) + · · · ) = O ( m ( ε )) dr aws from p , an d similarly that it ru ns in time O ( t ( ε )). If opt C ( p ) ≤ ε , then (for a suitable c hoice of r = O (1)) we get that with probability at least 39/40 , s ome hyp othesis h 1 ,ℓ satisﬁes k p − h 1 ,ℓ k ≤ α · opt C ( p ) + ε/ 10 . O therwise, there m ust b e some i ∈ { 2 , . . . , ⌈ log(20 /ε ) ⌉} suc h that g i / 2 < op t C ( p ) ≤ g i ; in this case, for a suitable c h oice of r = O (1) we get th at with pr obabilit y at least 39 / 40, there is some hypothesis h i,ℓ that satisﬁes k p − h i,ℓ k 1 ≤ α · opt C ( p ) + g i ≤ ( α + 2) · opt C ( p ). Thus in either ev ent, with p robabilit y at least 39 / 40 some h i,ℓ satisﬁes k p − h i,ℓ k 1 ≤ ( α + 2) · opt C ( p ) + ε/ 10 . In the second stage, A ′ runs a hyp othesis selectio n p ro cedure to c ho ose one of the candidate h yp otheses h i,ℓ . A num b er of such pro cedu r es are kno wn (see e.g., S ection 6.6 of [ DL01 ] or [ DDS12 , DK14 , AJOS14 ]); all of them wo rk b y ru nning s ome sort of “tournamen t” ov er the hyp otheses, and all ha ve the guarante e that with high probabilit y they will outpu t a h yp othesis fr om the p o ol of candidates which has L 1 error (with r esp ect to the target d istribution p ) not muc h wo rse than that of the b est candidate in the p o ol. W e use the classic Sc heﬀ´ e algorithm (see [ DL01 ]) as describ ed and analyzed in [ AJOS14 ] (see Algorithm SCHEFFE ∗ in Ap p endix B of that p ap er). Adapted to our con text, this algorithm has the follo wing p erform ance guaran tee: Prop osition 3.7. L et p b e a tar get distribution over [0 , 1) and let D τ = { p j } N j =1 b e a c ol le ction of N distributions over [0 , 1) with the pr op erty that ther e exists i ∈ [ N ] such that k p − p i k 1 ≤ τ . Ther e is a pr o c e dur e SCHEFFE which is given as input a p ar ameter ε > 0 and a c onﬁdenc e p ar ameter δ > 0 , and i s pr ovide d with ac c ess to (i) i.i.d. dr aws fr om p and fr om p i for al l i ∈ [ N ] , and (ii) an ev aluation oracle ev al p i for e ach ∈ [ N ] . This is a pr o c e dur e which, on input r ∈ [0 , 1) , outputs the value p i ( r ) of the p df of p i at the p oint r . The pr o c e dur e SCHEFFE has the fol lowing b ehavior: It makes s = O  (1 /ε 2 ) · (log N + log(1 /δ ))  dr aws fr om p and f r om e ach p i , i ∈ [ N ] , and O ( s ) c al ls to e ach or acle ev al p i , i ∈ [ N ] , and p e rforms O ( sN 2 ) arithmetic op er ations. With pr ob ability at le ast 1 − δ it outputs an index i ⋆ ∈ [ N ] that satisﬁes k p − p i ⋆ k 1 ≤ 10 max { τ , ε } . 2 Note that no w there is no guarantee that opt C ( p ) ≤ ε ; indeed, the p oin t h ere is that opt C ( p ) ma y b e arbitrary . 12 The algorithm A ′ runs the p r o cedure SCHEFFE us in g the N = O (log (1 /ε )) hyp otheses h i,ℓ , with its “ ε ” parameter set to 1 10 · (the input parameter ε that is giv en to A ′ ) and its “ δ ” parameter set to 1 / 40. By Prop osition 3.7 , with o v erall pr obabilit y at least 19 / 20 th e output is a hypothesis h i,ℓ satisfying k p − h i,ℓ k 1 ≤ 10( α + 2)opt C ( p ) + ε . Th e o v erall ru nning time and sample complexit y are easily seen to b e as claimed, and th e theorem is p ro v ed. 3.3 Dealing with distributions that are not w ell behav ed The assumption th at th e target distr ibution p is ˜ Θ( ε/k )-well- b eha v ed can b e straigh tforwardly remo v ed by follo wing the approac h in Section 3.6 of [ CDSS14 ]. That pap er presents a simp le linear-time sampling-based pro cedu re, us ing ˜ O ( k /ε ) samples, that w ith h igh pr obabilit y id en tiﬁes all the “hea vy” elements (atoms w hic h cause p to not b e well-behav ed, if any such p oints exist). Our o verall algorithm ﬁrst ru ns this pro cedure to ﬁnd the set S of “hea vy” elemen ts, and th en runs the algorithm present ed abov e (wh ich succeeds for w ell-b eh a v ed distributions, i.e., distributions that ha v e no “hea vy” elemen ts) usin g as its target d istribution the conditional distr ib ution of p o v er [0 , 1) \ S (let us d en ote this conditional d istribution by p ′ ). A straight forward analysis giv en in [ CDSS14 ] sh o ws that (i) opt k ( p ) ≥ opt k ( p ′ ), and moreov er (ii) d TV ( p, p ′ ) ≤ opt k ( p ). Th us, b y th e triangle inequalit y , an y hypothesis h satisfying d TV ( h, p ′ ) ≤ C opt k ( p ′ ) + ε will also satisfy d TV ( h, p ) ≤ ( C + 1)o pt k ( p ) + ε as desired. 4 Lo w er b ounds on agnostic lea rning In this section we establish th at α -agnostic learning with α < 2 is information theoretically im p os- sible, th us establishin g Theorem 2 . Fix any 0 < t < 1 / 2. W e deﬁne a probabilit y d istribution D t o v er a ﬁnite set of discrete distributions o v er the domain [2 N ] = { 1 , . . . , 2 N } as follo ws. (W e assume without loss of generalit y b elo w that t is rational and that tN is an intege r.) A dra w of p S 1 ,S 2 ,t from D t is obtained as follo ws. 1. A set S 1 ⊂ [ N ] is c hosen uniformly at random f rom all subs ets of [ N ] that contai n precisely tN elements. F or i ∈ [ N ], the distr ib ution p S 1 ,S 2 ,t assigns probability wei ght as follo ws: p S 1 ,S 2 ,t ( i ) = 1 4 N if i ∈ S 1 , p S 1 ,S 2 ,t ( i ) = 1 2 N  1 + t 2(1 − t )  if i ∈ [ N ] \ S 1 . 2. A set S 2 ⊂ [ N + 1 , . . . , 2 N ] is c hosen un iformly at random from all sub sets of [ N + 1 , . . . , 2 N ] that con tain precisely tN elemen ts. F or i ∈ [ N + 1 , . . . , 2 N ], the d istribution p S 1 ,S 2 ,t assigns probabilit y weigh t as follo ws: p S 1 ,S 2 ,t ( i ) = 3 4 N if i ∈ S 2 , 1 2 N  1 − t 2(1 − t )  if i ∈ [ N ] \ S 1 . Using a b ir thda y p arado x typ e argumen t, we show that no o ( √ N )-sample algorithm can suc- cessfully distinguish b et w een a d istribution p S 1 ,S 2 ,t ∼ D t and the uniform distribu tion ov er [2 N ]. W e then lev erage th is indistinguish abilit y to sh ow that an y (2 − δ )-semi-agnostic learning algorithm, ev en for 2-ﬂat distributions, m ust u se a sample of size Ω( √ N ): Theorem 7. Fix any δ > 0 and any function f ( · ) . Ther e is no algorithm A with the fol lowing pr op erty: gi ven ε > 0 and ac c ess to indep endent p oints dr awn fr om an unknown distribution p over 13 [2 N ] , algorithm A makes o ( √ N ) · f ( ε ) dr aws fr om p and with pr ob ability at le ast 51 / 10 0 outputs a hyp othesis distribution h over [2 N ] satisfying k h − p k 1 ≤ (2 − δ )opt 2 ( p ) + ε . Pr o of. W e write U 2 N to denote th e uniform distrib ution o v er [2 N ]. The follo wing prop osition sho ws that U 2 N has L 1 distance from p S 1 ,S 2 ,t almost t wice that of the optimal 2-ﬂat distribution: Prop osition 4.1. Fix any 0 < t < 1 / 2 . 1. F or any distribution p S 1 ,S 2 ,t in the supp ort of D t , we have kU 2 N − p S 1 ,S 2 ,t k 1 = t. 2. F or any distribution p S 1 ,S 2 ,t in the supp ort of D t , we have opt 2 ( p S 1 ,S 2 ,t ) ≤ t 2  1 + t 1 − t  . Pr o of. P art (1.) is a simple calculation. F or p art (2.), consider the 2-ﬂat distr ibution q ( i ) =    1 2 N  1 + t 2(1 − t )  if i ∈ [ N ] 1 2 N  1 − t 2(1 − t )  if i ∈ [ N + 1 , . . . , 2 N ] It is straigh tforwa rd to ve rif y that k p S 1 ,S 2 ,t − q k 1 = t 2  1 + t 1 − t  as claimed. F or a distr ibution p we write A p to ind icate that algorithm A is give n access to i.i.d. p oin ts dra wn f r om p . The follo wing simple pr op osition states that no algorithm can successfu lly distinguish b et w een a distribution p S 1 ,S 2 ,t ∼ D t and U 2 N using few er than (essent ially) √ N dra ws: Prop osition 4.2. Ther e is an absolute c onstant c > 0 such that the fol lowing holds: Fix any 0 < t < 1 / 2 , and let B b e any “distinguishing algorithm” which r e c eives c √ N i.i.d. dr aws fr om a distribution over [2 N ] and outputs either “ uniform” or “non-uniform”. Then    Pr [ B U [2 N ] outputs “ u niform” ] − Pr p S 1 ,S 2 ,t ∼D t [ B p S 1 ,S 2 ,t outputs “uniform” ]    ≤ 0 . 01 . (2) The pr o of is an easy consequence of th e fact that in b oth cases (the distribu tion is U [2 N ] , or the distribu tion is p S 1 ,S 2 ,t ∼ D t ), with probability at least 0.99 the c √ N draws receiv ed by A are a uniform rand om set of c √ N d istinct elements from [2 N ] (this can b e shown straighforw ardly using a birthday parado x t yp e argument) . No w w e u se Pr op osition 4.2 to show that any (2 − δ )-semi-agnostic learning algorithm ev en for 2-ﬂat d istributions must use a sample of size Ω( √ N ), an d thereb y pr o v e Th eorem 7 . Fix a v alue of δ > 0 and supp ose, for the sak e of contradicti on, that th ere exists su c h an algorithm A . W e describ e ho w the existe nce of su c h an algorithm A yields a distinguishing alg orithm B that violates Prop osition 4.2 . The algorithm B w orks as follo ws, giv en access to i.i.d. draws from an unknown d istribution p . It ﬁrst run s algorithm A with its “ ε ” p arameter set to ε := δ 3 12(2+ δ ) , obtaining (with p robabilit y at least 51 / 100) a hyp othesis distr ib ution h o v er [2 N ] su c h that k h − p k 1 ≤ (2 − δ )opt 2 ( p ) + ε. It then computes the v alue k h − U 2 N k 1 of the L 1 -distance b et w een h and the uniform distribution 14 (note that this step u ses no dra ws from th e distribution). If k h − U 2 N k 1 < 3 ε/ 2 then it outpu ts “uniform” and otherwise it outpu ts “non-uniform .” Since δ (and hence ε ) is indep endent of N , the algorithm B mak es fewe r than c √ N d ra ws fr om p (for N suﬃciently large). T o see that the ab o v e-describ ed algorithm B violates ( 2 ), consider ﬁrst the case that p is U [2 N ] . In this case opt 2 ( p ) = 0 an d so w ith probability at least 51/100 the h yp othesis h satisﬁes k h − U 2 N k 1 ≤ ε , and hence algorithm B outputs “uniform ” with probabilit y at least 51 / 100 . On the other hand, supp ose that p = p S 1 ,S 2 ,t is dra wn from D t , where t = δ 2+ δ . In this case, with probabilit y at least 51/100 the h yp othesis h s atisﬁes k h − p S 1 ,S 2 ,t k 1 ≤ (2 − δ )opt 2 ( p S 1 ,S 2 ,t ) + ε ≤ (2 − δ ) · t 2 ·  1 + t 1 − t  + ε, b y part (2.) of Prop osition 4.1 . Sin ce by p art (1.) of Prop osition 4.1 we ha v e kU 2 N − p S 1 ,S 2 ,t k 1 = t , the triangle inequalit y giv es th at k h − U 2 N k 1 ≥ t − (2 − δ ) · t 2 ·  1 + t 1 − t  − ε = 2 ε, where to obtain the ﬁnal equalit y w e recalled th e settings ε = δ 3 12(2+ δ ) , t = δ 2+ δ . Hence algorithm B outputs “uniform” with probab ility at most 49 / 100. Thus w e hav e    Pr [ B U [2 N ] outputs “uniform”] − Pr p S 1 ,S 2 ,t ∼D t [ B p S 1 ,S 2 ,t outputs “uniform”]    ≥ 0 . 02 whic h cont radicts ( 2 ) and pro ves the th eorem. As describ ed in the Int ro duction, via th e obvious corresp ondence that maps d istributions o ve r [ N ] to distributions o v er [0 , 1), we get the follo wing: Corollary 4.3. Fix any δ > 0 and any function f ( · ) . Ther e is no algorithm A with the fol lowing pr op erty: given ε > 0 and ac c ess to indep endent dr aws fr om an unknown distribution p over [0 , 1) , algorithm A makes f ( ε ) dr aw s fr om p and with pr ob ability at le ast 51 / 100 outputs a hyp othesis distribution h over [0 , 1) satisfying k h − p k 1 ≤ (2 − δ )opt 2 ( p ) + ε . References [AJOS14] J. Achary a, A. J afarp our, A. Orlitsky , and A.T. Su resh. Near-optimal-sample estima- tors for spherical gaussian mixtur es. T ec hnical Rep ort h ttp://arxiv.org/abs/140 2.4746, 19 F eb 2014. 3.2 [BBBB7 2] R.E. Barlo w, D.J. Bartholomew, J.M. Bremner, and H.D. Brunk. Statistic al Infer enc e under Or der R estrictions . Wiley , New Y ork, 1972. 1 , 1 [Bir87] L. Birg ´ e. Estimating a d en sit y un der order restrictions: Nonasymp totic minimax risk. Anna ls of Statistics , 15(3):995– 1012, 1987. 1 [Bir97] L. Birg´ e. Estimation of unimo dal densities without s m o othness assumptions. Annals of Statistics , 25(3):9 70–981, 1997. 1 [Bru58] H. D. Brunk. On the estimation of parameters restricted by inequalities. Ann. Math. Statist. , 29(2):pp. 437–45 4, 1958 . 1 15 [CDSS13] S. Chan, I. Diak onik olas, R. Serv edio, and X. S un. Learning mixtures of structured distributions o v er discrete domains. In SODA , pages 1380–1 394, 2013. 1 [CDSS14] S. C han, I. Diak onik olas, R. Serv edio, and X. Sun. Eﬃcien t density estimation via piece- wise p olynomial appr oximati on. T ec hnical Rep ort h ttp://arxiv.org/abs/13 05.3207, con- ference v ersion in STOC, pages 604-61 3, 2014. 1 , 2 , 3.1.3 , 3.1 , 3.1.3 , 3.3 [CMN98] S. Chaudhuri, R. Mot w ani, and V. Narasa yya. Random sampling for h istogram con- struction: Ho w muc h is enough? In SIGMOD Confer enc e , pages 436–44 7, 1998. 1 [DDS12] A. De, I. Diak onik olas, and R. S er vedio. Inv erse problems in app ro ximate u niform generation. Av ailable at h ttp://arxiv.org/p df/1211.17 22v1.p df, 2012. 3.2 [DG85] L. Devro y e and L. Gy¨ orﬁ. Nonp ar ametric Density Estimation: The L 1 View . John Wiley & Sons, 1985. 1 [DK14] C. Dask alakis and G. Kamath. F aster and sample n ear-optimal algorithms for p rop er learning mixtures of gaussians. In COL T , pages 1183–12 13, 2014. 3.2 [DL01] L. Devro y e and G. Lugosi. Combinatorial metho ds in density e stimation . Sprin ger Series in Statistics, Sprin ger, 2001. 1 , 1 , 3 , 3.2 [GGI + 02] A. Gilb ert, S. Guha, P . Indyk, Y. Kotidis, S. Muth uk r ishnan, and M. Strauss . F ast, small-space algorithms for approximat e h istogram main tenance. In STOC , pages 389– 398, 2002. 1 [GKS06] S. Guh a, N. Koudas, and K. Shim. Approxima tion and streaming algorithms for his- togram constr u ction problems. ACM T r ans. Datab ase Syst. , 31(1):396 –438, 2006. 1 [Gre56] U. Grenand er. O n the theory of m ortalit y measuremen t. Skand. Aktuarietidskr. , 39:1 25– 153, 1956. 1 [Gro85] P . Gro eneb o om. Estimating a monotone densit y . In Pr o c. of the Berkeley Confer enc e in H onor of Jerzy Neyman and Jack Kiefer , p ages 539–555 , 1985. 1 [HP76] D. L. Hanson and G. Pledger. Consistency in conca v e regression. The Annals of Statistics , 4(6):pp. 1038–10 50, 1976. 1 [ILR12] P . I n dyk, R. Levi, and R. Ru binfeld. Appro ximating and T esting k -Histogram Distri- butions in Sub -linear Time. In PODS , pages 15–22 , 2012. 1 [JKM + 98] H. V. Jagadish, N. K oudas, S. Muthukrishnan, V. Poosala, K. Sev cik, and T. S uel. Optimal h istograms with qualit y guaran tees. In VLD B , pages 275–286 , 1998. 1 [KMR + 94] M. Kearns, Y. Mansour , D. Ron, R. Rub in feld, R. Sc hapire, and L. Sellie. On the learnabilit y of discrete distribu tions. In P r o c. 26th STOC , p ages 273–282, 1994. 1 [MP007] Concen tration inequalities and mo del selection. Lecture Notes in Mathematics, 33, 2003, S ain t-Flour, C an tal, 2007. Massart, P . and Picard, J ., Sp ringer. 1 [P ea95] K. Pea rson. Con tribu tions to the mathematical theory of ev olution. ii. sk ew v ariation in homogeneous material. Philoso phic al T r ans. of the R oyal So ciety of L ondon , 186:343 – 414, 1895. 1 16 [Rao69] B.L.S. Prak asa Rao. Estimation of a unimo dal density . Sankhya Ser. A , 31:23– 36, 1969. 1 [Reb05] L. Reb oul. Estimation of a function un der shap e r estrictions. Applications to reliabilit y . Ann. Statist. , 33(3):1330– 1356, 2005. 1 [Sco92] D.W. Scott. Multivariate D ensity Estimation: The ory, P r actic e and Visualization . Wiley , New Y ork, 1992. 1 [Sil86] B. W. Silv erman. Density Estimation . Chapman and Hall, Lond on, 1986. 1 , 1 [V al8 4] L. G. V alian t. A theory of the learnable. In Pr o c. 16th Annual ACM Symp osium on The ory of Computing (STOC) , pages 436–445 . A CM Pr ess, 1984. 1 [W eg 70] E.J. W egman. Maxim um likeli ho o d estimation of a unimo dal d ensit y . I. and I I. An n. Math. Statist. , 41:457–4 71, 2169–2174 , 1970. 1 17

Near-Optimal Density Estimation in Near-Linear Time Using Variable-Width Histograms

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment