Local Privacy, Data Processing Inequalities, and Statistical Minimax Rates

Working under a model of privacy in which data remains private even from the statistician, we study the tradeoff between privacy guarantees and the utility of the resulting statistical estimators. We prove bounds on information-theoretic quantities, …

Authors: John C. Duchi, Michael I. Jordan, Martin J. Wainwright

Local Privacy, Data Processing Inequalities, and Statistical Minimax   Rates
Lo cal Priv acy , Data Pro cessing Inequalities, and M inima x Rates John C. Duc hi † Mic hael I. Jordan ∗ Martin J. W ainwrigh t ∗ jduchi@s tanford.edu jordan@s tat.berkeley.ed u wainwrig @stat.berkeley. ed u Stanford Univ ersit y † Univ ersit y of Califo rnia, Berk eley ∗ Stanford, CA 943 05 Berk eley , CA 947 20 Abstract W orking under a model of priv acy in which da ta remains priv ate even from the sta tis ticia n, we study the tr adeoff b et w een priv acy guara n tees and the utilit y of the resulting statistical es- timators. W e prove b ounds on information-theoretic quantities, including mutual information and Kullback-Leibler div ergence, that dep end on the priv acy guar an tees. When combined with standard minimax techn iques, including the Le Cam, F ano, and Assouad metho ds, these in- equalities allow for a precise c haracterizatio n of statistical rates under lo cal priv a cy constraints. W e provide a treatmen t of several cano nical families of pr oblems: mean estimatio n, pa rameter estimation in fix e d-design reg ression, multinomial pr obabilit y estimatio n, and nonpara metric density estimatio n. F or all of these families, we provide lower and upper b ounds that match up to constant facto rs, and e x hibit new (optimal) priv acy-pres e rving mechanisms and compu- tationally efficient es t imators that achiev e the b ounds. 1 In tro duction A ma jor c hallenge in statistical inference is that of c haracterizing and balancing statistica l utilit y with the pr iv acy of in dividuals from whom data is obtained [20, 21, 28 ]. Suc h a c haracterizatio n requires a formal definition of priv acy , and differ ential privacy h as b een put forth as one suc h formalization [e.g ., 24, 10, 25 , 34, 35]. In the database and cryptograph y literatures from wh ic h differen tial priv acy arose, early researc h w as mainly algorithmic in fo cus, and researc h ers ha ve used differentia l priv acy to ev aluate priv acy-retaining mec h a nisms f o r transp orting, indexing, an d querying data. More recen t work aims to link differentia l priv acy to statistical concerns [22, 51, 33, 48, 16, 46]; in particular, researc hers ha v e deve lop ed algorithms for p riv ate robust statistical estimators, p oin t and histogram estimation, and principal comp onen ts analysis. Guaran tees of optimalit y in this line of w ork ha ve ofte n b een non-inf er ential, aiming to appr o ximate a class of statistics u nder priv acy-resp ecting transf o rmations of the data at hand and not with resp ect to an underlying p opulation. There has also b een recen t wo rk within the con text of classification problems and the “probably appr oximate ly correct” fr amework of statistical lea rning theory [e.g. 37, 8] that treats the data as random and aims to reco v er asp ects of the underlyin g p opulation; we discuss this w ork in Section 6. In this pap er, we tak e a fully infer ential point of view on priv acy , bringing differen tial pri- v acy in to con tact with stat istical decision theory . Our fo c us is on the fundamen tal limits of differen tially-priv ate estimation. By treating differenti al priv acy as an ab s t ract constrain t on esti- mators, w e obtain ind epend e nce from sp ecific estima tion pro cedures and priv acy-preservin g mech- anisms. Within this fr a mew ork , w e derive b ot h lo wer b ounds and matc hing upp er b o unds on minimax r isk . W e obtain our lo w er b ounds b y in tegrating d ifferen tial priv acy in to t he cl assical 1 Z 1 Z 2 Z n X 1 X 2 X n Z 1 Z 2 Z n X 1 X 2 X n Figure 1. Left: graphica l structure of priv a te Z i and no n-priv ate da t a X i in interactive cas e. Righ t: graphical structure of c hannel in no n-in tera c t ive c a se. paradigms for b ounding min im ax risk via the inequalitie s of Le C a m, F ano, and Assouad, wh ile w e obtain matc hing upp er b ounds by pr o p osing and analyzing sp ecific priv ate pro cedures. W e study the setting of lo c al privacy , in whic h pro vider s do n o t ev en trust the statistician collect ing the data. Although lo cal priv acy is a relativ ely stringent requirement, w e view this setting as a natural step in id en tifying minimax risk b ounds under priv acy constrain ts. Indeed, lo c al priv acy is o ne of the oldest forms of priv acy: its essen tial form date s to W arner [50], who prop osed it as a remedy for what h e termed “ev asiv e answe r bias” in surv ey sampling. W e hop e that w e can lev erage deep er understand in g of this classical setting to treat other priv acy-preserving approac hes to data analysis. More formally , le t X 1 , . . . , X n ∈ X b e observ ations dra wn according to a distribution P , and let θ = θ ( P ) b e a parameter of this unknown distribution. W e wish to estimate θ based on access to obscured v iews Z 1 , . . . , Z n ∈ Z of the original data. The original random v ariables { X i } n i =1 and the pr iv atized observ ations { Z i } n i =1 are linked via a family of conditional distrib u tio ns Q i ( Z i | X i = x, Z 1: i − 1 = z 1: i − 1 ). T o simp li fy notatio n, w e t yp ic ally omit the subscript in Q i . W e refer to Q as a channel distribution , as it ac ts as a co nduit from the original to the pr iv atized d at a, and w e assum e it is se quential ly inter active , meaning the c h annel has the cond it ional in depen dence structure { X i , Z 1 , . . . , Z i − 1 } → Z i and Z i ⊥ X j | { X i , Z 1 , . . . , Z i − 1 } for j 6 = i, illustrated on the left of Figure 1. A sp ecial case of suc h a channel is the non-inter active case, in whic h eac h Z i dep ends only on X i (Fig. 1, righ t). Our w ork is based on the follo wing d e finition of priv acy . F or a giv en priv acy parameter α ≥ 0, w e say that Z i is an α - differ ential ly lo c al ly private vie w of X i if for all z 1 , . . . , z i − 1 and x, x ′ ∈ X w e h av e sup S ∈ σ ( Z ) Q i ( Z i ∈ S | X i = x, Z 1 = z 1 , . . . , Z i − 1 = z i − 1 ) Q i ( Z i ∈ S | X i = x ′ , Z 1 = z 1 , . . . , Z i − 1 = z i − 1 ) ≤ exp( α ) , (1) where σ ( Z ) denotes an appr o priate σ - field on Z . Defin it ion (1) does not c onstrain Z i to b e a release of data based exclusiv ely on X i : t he channel Q i ma y b e inter active [24], changing based on prior priv ate obs e rv ations Z j . W e also consider the non-in teractiv e case [50, 27] where Z i dep ends only on X i (see the r ig h t sid e of Figure 1); here the b ound (1) reduces to sup S ∈ σ ( Z ) sup x,x ′ ∈X Q ( Z i ∈ S | X i = x ) Q ( Z i ∈ S | X i = x ′ ) ≤ exp( α ) . (2) These definitions capture a t yp e of plaus ib le deniabilit y: no matter what d ata Z is released, it is nearly equally as likel y to h a ve come from one p oin t x ∈ X as any other. It is also p ossible 2 to interpret differen tial priv acy within a h yp othesis testing fr a mew ork, where α con trols the error rate in tests for the presence or absence of ind ividual d at a p oin ts in a dataset [51]. S uc h guarantee s against disco v ery , together with the treatmen t of issues of side information or adve rsarial strength that are p roblemati c for other formalisms, ha v e b een used to make the case for differentia l priv acy within the co mputer science literature; see, for example, the paper s [27, 24, 6, 30]. Although differentia l p riv acy pro vid es an elegan t formalism for limiting disclosure and pr ot ecting against man y forms of pr iv acy breac h, it is a strin g en t m e asure of pr iv acy , and it is conceiv ably o ve rly stringent for statistical practice. Indeed, Fien b erg et al. [29] criticize the use of differenti al priv acy in relea sing continge ncy tables, arguin g th at kno wn mec hanisms for different ially pr iv ate data release can giv e u nac ceptably p o or p erformance. As a consequence, they advocate—in some cases—recourse to w eaker pr iv acy guarante es to main tain the u til it y and usabilit y of released data. There are results that are m o re fav orable f o r d iffe ren tial pr iv acy; for example, S mith [48] shows th a t the n o n-lo ca l form of differen tial p riv acy [24] ca n b e satisfied while yielding asymptotically optimal parametric rates of con ve rgence for s o me p oi n t estimators. Resolving suc h differing p ersp ect iv es requires inv estigation into whether particular metho ds ha ve optimalit y prop erties that would allo w a general criticism of the framew ork, and c haracterizing the trade-offs b et w een priv acy and statistical efficiency . Suc h are the go als of the curren t pap er . 1.1 Our con t ribu tions The main con tribu tion of this wo rk is to p ro vid e general tec h niques for deriving minimax b oun d s under lo cal pr iv acy constrain ts and to illustrate these tec hn iques by co mputing minimax rates for several canonical problems: (a) mean estimation; (b) parameter estimation in fixed d e sign regression; (c) multinomial p robabilit y estimation; and (d ) densit y e stimation. W e no w outline our m ain con tribu tio ns. (Beca use a dee p er comparison of the curren t w ork with prior rese arc h requires a formal definition of our m inimax framework and p resen tation of our main r esu lt s, we defer a fu ll discussion of related w ork to S e ction 6. W e note here, how ev er, that our minimax rates are for estimatio n of p opulation quan tities, in accordance w it h our connectio ns to statistica l decision theory; b y wa y of comparison, most pr ior wo rk in the priv acy literature fo cuses on accurate appro ximation of statisti cs in a conditional analysis in whic h the data are trea ted as fixed. Man y metho ds for obtaining minimax b ounds inv olve information-theoretic quanti ties relating data-generating distributions [53, 52, 49]. In particular, let P 1 and P 2 denote t wo d istr ibutions on the observ ations X i , and for ν ∈ { 1 , 2 } , d efine the marginal distribution M n ν on Z n b y M n ν ( S ) := Z Q n ( S | x 1 , . . . , x n ) dP ν ( x 1 , . . . , x n ) for S ∈ σ ( Z n ) . (3) Here Q n ( · | x 1 , . . . , x n ) d en o tes the joint distribution on Z n of the pr iv ate sample Z 1: n , conditioned on X 1: n = x 1: n . The mutual information of samples dra w n according to distributions of the f o rm (3) and the KL dive rgence b et ween suc h distributions are ke y ob j e cts in statistical discrimin a bilit y and minimax rates [36, 9, 53, 52, 49 ], where they are often applied in one of three l o wer-bou n ding tec hniqu es: Le Cam’s, F a no’s, and Assouad’s metho ds. Keeping in mind the cent ralit y of these inform a tion-theoretic quant ities, w e summ arize our main r esults at a h ig h-lev el as f ollo ws. Theorem 1 b ounds the KL dive rgence b et we en distribu ti ons M n 1 and M n 2 , as defined by the marginal (3), b y a quantit y dep enden t on th e d iffe ren tial priv acy parameter α and the total v ariation distance b et ween P 1 and P 2 . The essence of T heorem 1 is that D kl ( M n 1 k M n 2 ) . α 2 n k P 1 − P 2 k 2 TV , 3 where . denotes inequalit y up to n u m e rical constan ts. When α 2 < 1, which is the usual r e gion of in terest, this result s h o w s that for statistical pro cedures whose min ima x rate of conv ergence can b e determined b y classical information-theoretic method s, the additional requ ir e men t of α -local differen tial pr iv acy causes the effe ctive sample size of any statistica l pro cedure to b e reduced from n to at most α 2 n . Section 3.1 con tains the formal state men t of this theorem, while S e ction 3.2 pro vides corollaries sh o win g its applicatio n to minimax risk b ounds. W e follo w this in Section 3.3 with applications of these results to estimation of one-dimen sio nal means and fi xed-design regression problems, p ro viding corresp onding upp er b ounds on the minimax r isk. In ad d iti on to our general analysis, w e exhibit s ome striking d iffi cu lt ies of lo cally priv ate estimation in non-compact spaces: if w e wish to estimate the mean of a r an d om v ariable X satisfying V a r( X ) ≤ 1, the m inimax r a te of estimation of E [ X ] decreases from the parametric 1 /n r a te to 1 / √ nα 2 . Theorem 1 is appropr ia te for man y one- dimensional problems, but it d o es n o t add ress diffi- culties inheren t in h ig her-dimensional problems. With this motiv ation, our next tw o main resu lts (Theorems 2 and 3) generalize Theorem 1 and incorp orate dimensionalit y in an essen tial wa y: eac h pr o vid es b ounds on information-theoretic quan tities by dimension-dep endent analog ues of total v ariation. More sp ecifical ly , Theorem 2 pro vides b ounds on m u tual in formati on quantitie s essen tial in in f o rmation-theoretic techniques suc h as F an o’s metho d [53, 52], while Theorem 3 pro- vides anal ogous b o unds o n summed pairs of KL-divergences us e ful in applications of Assouad’s metho d [5 , 53, 4]. As a consequence of Theorems 2 and 3, we obtain that for many d -dimensional estima tion problems the e ffectiv e sample size is reduced fr o m n t o nα 2 /d ; as our examples i llustrate, this dimension-dep endent reduction in sample size can ha ve dramatic consequences. W e pro vide the main statemen t and consequences of Theorem 2 in Section 4, sho wing its application to obtaining minimax rates for mean estimation in b oth classical and high-dimen s io nal settings. In Section 5, w e present Th e orem 3, sho w ing how it provides (sharp) minimax lo wer b ounds for multinomial and probabilit y d e nsit y estimation. Our results enable us to derive (often new) optimal mec h an ism s for these p roblems. O n e in teresting consequence of our results is that W arner’s rand omiz ed resp onse pro cedure [50] from the 1960s is an optimal mec hanism for m u lti nomial estimation. Notation: F or distributions P and Q defined on a sp a ce X , eac h absolutely con tinuous with resp ect to a distr i bution µ (with corresp onding densities p and q ), the KL div ergence b et w een P and Q is D kl ( P k Q ) := Z X dP log dP dQ = Z X p log p q dµ. Letting σ ( X ) denote the (an appropriate) σ -field on X , the total v ariation d ist ance b et ween t w o distributions P and Q is k P − Q k TV := sup S ∈ σ ( X ) | P ( S ) − Q ( S ) | = 1 2 Z X | p ( x ) − q ( x ) | dµ ( x ) . Let P and P Y denote marginal distributions of rand om vec tors X and Y and P Y ( · | X ) denote the distribution of Y conditional on X . Th e mutual information b et w een X and Y is I ( X ; Y ) = E P [ D kl ( P Y ( · | X ) k P Y ( · ))] = Z D kl ( P Y ( · | X = x ) k P Y ( · )) dP ( x ) . Random v ariable Y has Laplace( α ) distrib u tio n if its density is p Y ( y ) = α 2 exp ( − α | y | ). F or matrices A, B ∈ R d × d , the notatio n A  B means that B − A is p ositiv e semidefinite. F or real sequences 4 { a n } and { b n } , w e use a n . b n to mean there is a u niv ersal constant C < ∞ such that a n ≤ C b n for all n , and a n ≍ b n to denote that a n . b n and b n . a n . 2 Bac kground and pr ob lem form ulation W e first esta blish the min ima x framework w e use thr o ughout this p a p er; see references [52, 53, 49] for further bac kgroun d. Let P denote a class of distributions on the sample space X , and let θ ( P ) ∈ Θ denote a f u nctio n defined on P . T he space Θ in whic h the parameter θ ( P ) tak es v alues dep ends on the und er lyin g stat istical mo del (for un iv ariate m ean estimation, it is a subset of the real line). Let ρ d e note a semi-metric on the space Θ, whic h we use to measure the error of an estimator for the parameter θ , and let Φ : R + → R + b e a non-decreasing fu nctio n with Φ(0) = 0 (for example, Φ( t ) = t 2 ). In the c lassical s etting, th e statisti cian is giv en direct access to i.i.d. observ ations X i dra wn according to some P ∈ P . The lo cal priv acy sett ing i n v olve s an additional ingredient , namely , a conditional d istr i bution Q that transf o rms the sample { X i } n i =1 in to the priv ate sample { Z i } n i =1 taking v alues in Z . Based on these Z i , our goal is to estimate the unkno wn parameter θ ( P ) ∈ Θ. An estima tor b θ is a measurable f unction b θ : Z n → Θ , and we assess the qualit y of the esti mate b θ ( Z 1 , . . . , Z n ) in terms of th e risk E P ,Q h Φ  ρ ( b θ ( Z 1 , . . . , Z n ) , θ ( P ))  i . F or instance, for a univ ariate mean problem w it h ρ ( θ , θ ′ ) = | θ − θ ′ | and Φ( t ) = t 2 , this risk is the mean-squared error. F or any fi xed conditional distrib ution Q , the minimax rate is M n ( θ ( P ) , Φ ◦ ρ, Q ) := inf b θ sup P ∈P E P ,Q h Φ  ρ ( b θ ( Z 1 , . . . , Z n ) , θ ( P ))  i , (4) where we tak e the supr emum ov er distribu tio ns P ∈ P , and the infimum is tak en o v er all estimators b θ . F or α > 0, let Q α denote the set of all conditional distribu ti ons guaran teeing α -lo cal p riv acy (1). By minimizing the minimax risk (4) o v er all Q ∈ Q α , w e obtain the cen tral ob ject of study for this pap er, a f unctional wh ic h c haracterizes the optimal r at e of estimation in terms of the priv acy parameter α . Definition 1. Giv en a family of d istributions θ ( P ) and a priv acy parameter α > 0, the α -minimax r ate in the metric ρ is M n ( θ ( P ) , Φ ◦ ρ, α ) := inf Q ∈Q α inf b θ sup P ∈P E P ,Q h Φ( ρ ( b θ ( Z 1 , . . . , Z n ) , θ ( P ))) i . (5) F ro m estimation to test ing : A standard fir st step in p ro ving m i nimax b ounds is to red u ce the estimation problem to a testing pr o blem [53, 52, 49]. W e u se t wo t yp es of testing p roblems: on e a m ultiple hyp othesis test, the second b ase d on multiple binary hyp othesis tests. W e begin with the fi rst of the tw o. Giv en an index set V of finite card i nalit y , consider a family of distribu tions { P ν , ν ∈ V } conta ined within P . This f a mily indu ce s a collectio n of parameters { θ ( P ν ) , ν ∈ V } ; it is a 2 δ -pac kin g in the ρ -semimetric if ρ ( θ ( P ν ) , θ ( P ν ′ )) ≥ 2 δ for all ν 6 = ν ′ . (6) W e use this family to define the c anonic al hyp othesis testing pr oblem : 5 • first, nature c ho oses V according to the uniform d istr i bution o ve r V ; • second, conditioned on the c hoice V = ν , the random sample X = ( X 1 , . . . , X n ) is drawn from the n -fold pro duct distribution P n ν . In the c lassical sett ing, the statistic ian directly observ es the sample X , while the local priv acy constrain t means that a n e w rand om sample Z = ( Z 1 , . . . , Z n ) is generated b y sampling Z i from the distribution Q ( · | X 1: n ). By construction, conditioned on the choic e V = ν , the priv ate sample Z is d istributed ac cording to the marginal measure M n ν defined in equation (3). Giv en the observed v ector Z , the goa l is to determine the v alue of the underlying index ν . W e refer to any measurable mapping ψ : Z n → V as a test function. Its asso ciated error p robabilit y is P ( ψ ( Z 1 , . . . , Z n ) 6 = V ), where P denotes the joint distribution o ver the rand o m in d ex V and Z . The classical reduction from estimat ion to testing [e.g ., 49, Section 2.2] guaran tees that the minimax error (4) has lo wer b ound M n (Θ , Φ ◦ ρ, Q ) ≥ Φ ( δ ) inf ψ P ( ψ ( Z 1 , . . . , Z n ) 6 = V ) . (7) The remaining c hallenge is to lo we r b ound th e p r obabilit y of error in the underlying multi- w a y h yp othesis testing p roblem. There are a v ariet y of tec hniques f or this, and we fo cus on b ou n ds on the probabilit y of error (7) du e to Le Cam and F ano. The simplest form of Le C am’s inequalit y [e.g., 53, Lemma 1] is applicable wh e n there are t w o v alues ν , ν ′ in V . In this case, inf ψ P ( ψ ( Z 1 , . . . , Z n ) 6 = V ) = 1 2 − 1 2 k M n ν − M n ν ′ k TV , (8) where the marginal M is defined as in expression (3). More generally , F a no’s inequalit y [52, 32, Lemma 4.2.1] holds when n at ure chooses uniformly at random from a set V of ca rdinalit y larger than t wo , and tak es the form inf ψ P ( ψ ( Z 1 , . . . , Z n ) 6 = V ) ≥  1 − I ( Z 1 , . . . , Z n ; V ) + log 2 log |V |  . (9) The second reduction w e consider—w hic h transforms estimation problems into multiple bin a ry h yp othesis testing problems—u s e s the stru c ture of the hypercu be in an essentia l w ay . F or some d ∈ N , we set V = {− 1 , 1 } d . W e say that the the family P ν induces a 2 δ -Hamming separation for Φ ◦ ρ if there exists a function v : θ ( P ) → {− 1 , 1 } d satisfying Φ( ρ ( θ , θ ( P ν ))) ≥ 2 δ d X j =1 1 { [ v ( θ )] j 6 = ν j } . (10) Letting P ± j denote the join t d istribution o v er the r a ndom index V and Z co nditional on the j th co o rdinate V j = ± 1, we are able to establish the follo win g sharp ening of Assouad’s lemma [5, 4 ] (see App endix F.1 for a pro of ). Lemma 1. Under the c onditions of the pr evious p ar agr aph, we have M n ( θ ( P ) , Φ ◦ ρ, Q ) ≥ δ d X j =1 inf ψ [ P + j ( ψ ( Z 1: n ) 6 = +1) + P − j ( ψ ( Z 1: n ) 6 = − 1)] . 6 With the defi n iti on of the marginals M n ± j = 2 − d +1 P ν : ν j = ± 1 M n ν , expression (8) sh o ws that Lemma 1 is equiv alen t to the lo we r b ound M n ( θ ( P ) , Φ ◦ ρ, Q ) ≥ δ d X j =1 h 1 −   M n + j − M n − j   TV i . (11) As a consequence of the p rece ding reductions to testing and the err o r b ounds (8), (9), and (11), w e obtain b ound s on the priv ate minimax rate (5) b y con trolling v ariation distances of th e f orm k M n 1 − M n 2 k TV or th e m utu a l inf o rmation b et we en the random p a rameter index V and the s equ e nce of random v ariables Z 1 , . . . , Z n . W e devot e the follo wing sections to these tasks. 3 P airwise b ounds under p r iv acy: Le Cam and lo c a l F ano metho ds W e b egi n with results that upp er b ound the symmetrized Kullbac k-Leibler d iv ergence under a priv acy constrain t, dev eloping consequences of this result for b oth L e Cam’s metho d and a lo cal form of F ano’s metho d. Using these metho d s, w e d e riv e sh arp minimax rates un der lo c al priv acy for estimating 1-dimensional means and for d -dimensional fixed design regression. 3.1 P airwise upper bounds on Kullbac k-Leibler div ergences Man y statistical prob lems dep end on comparisons b et wee n a pair of distributions P 1 and P 2 defined on a common s pac e X . Any conditional d istribution Q transforms s uc h a pair of distributions into a new pair ( M 1 , M 2 ) via the marginaliza tion (3); that is, M j ( S ) = R X Q ( S | x ) dP j ( x ) for j = 1 , 2. Our first main result b ounds the sym met rized Kullbac k-Leibler (KL) div ergence b et we en these induced marginals as a function of the priv acy parameter α > 0 asso ciat ed with the co nditional distribution Q and the total v ariation distance b et we en P 1 and P 2 . Theorem 1. F or any α ≥ 0 , let Q b e a c onditional distribution that guar ante es α -differ ential privacy. Then for any p air of distributions P 1 and P 2 , the i nd uc e d mar ginals M 1 and M 2 satisfy the b ound D kl ( M 1 k M 2 ) + D kl ( M 2 k M 1 ) ≤ min { 4 , e 2 α } ( e α − 1) 2 k P 1 − P 2 k 2 TV . (12) Remarks: Theorem 1 is a t yp e of str ong data pr o c essing inequalit y [3], p ro viding a quan titativ e relationship from the div ergence k P 1 − P 2 k TV to the K L-div ergence D kl ( M 1 k M 2 ) that arises after applying the c h a nnel Q . T he result of Theorem 1 is similar to a result due to Dw ork et al. [25, Lemma I I I.2], who show that D kl ( Q ( · | x ) k Q ( · | x ′ )) ≤ α ( e α − 1) for an y x, x ′ ∈ X , wh ic h implies D kl ( M 1 k M 2 ) ≤ α ( e α − 1) by conv exit y . This upp er b ound is wea k er th an Th e orem 1 since it lac ks th e term k P 1 − P 2 k 2 TV . This total v ariation term is essential to our minimax lo wer b ounds: more than pro viding a b ound on KL d ivergence, Theorem 1 sho ws that differential priv acy ac ts as a cont raction on the space of probabilit y measur es. Th is con tractivit y holds in a strong sense: indeed, the b ound (12) sho w s that ev en if w e start with a pair of distribu tio ns P 1 and P 2 whose KL div ergence is infin ite, the induced marginals M 1 and M 2 alw a ys hav e finite KL div ergence. W e pro vide the pr oof of Theorem 1 in Section 7. Here w e dev elop a corollary that has useful consequences for minimax theory under lo cal p riv acy constrain ts. S upp ose that conditionally on V = ν , w e dra w a sample X 1 , . . . , X n from the pro duct measure Q n i =1 P ν,i , and that we dra w the 7 α -locally priv ate sample Z 1 , . . . , Z n according to the c hannel Q ( · | X 1: n ). Conditioned on V = ν , the pr iv ate sample is d istributed ac cording to the measure M n ν defined previously (3). Because w e allo w in teractiv e pr oto cols, the distribu t ion M n ν need not b e a pro duct distribution in general. Giv en this setup , we hav e the follo wing: Corollary 1. F or any α - lo c al ly differ ential ly private (1) c onditional distribution Q and any p air e d se q u enc es of distributions { P ν,i } and { P ν ′ ,i } , D kl ( M n ν k M n ν ′ ) + D kl ( M n ν ′ k M n ν ) ≤ 4( e α − 1) 2 n X i =1   P ν,i − P ν ′ ,i   2 TV . (13) See Sectio n 7.2 for the p roof, which requires a few in termediate steps to obtain the add iti v e inequalit y . Inequalit y (13) also immediately implies a mutual information b ound, whic h ma y b e us efu l in applicatio ns of F ano’s inequalit y . In particular, if w e define the mean distribution M n = 1 |V | P ν ∈V M n ν , then b y the definition of m utual information, we h a ve I ( Z 1 , . . . , Z n ; V ) = 1 |V | X ν ∈V D kl  M n ν k M n  ≤ 1 |V | 2 X ν,ν ′ D kl ( M n ν k M n ν ′ ) ≤ 4( e α − 1) 2 n X i =1 1 |V | 2 X ν,ν ′ ∈V   P ν,i − P ν ′ ,i   2 TV , (14) the fi r st inequalit y follo w ing from th e joint con vexit y of the KL div ergence and the fin a l inequality from Corollary 1. Remarks: Mutual information b ound s und er local priv acy ha ve app eared previously . McGregor et al. [43] study relati onships b et ween communicatio n complexit y and d iffe ren tial pr iv acy , showing that differentiall y priv ate sc hemes all o w lo w comm un ication. Th e y pr o vide a result [43 , Prop. 7] guaran teeing I ( X 1: n ; Z 1: n ) ≤ 3 αn ; they strengthen this b ound to I ( X 1: n ; Z 1: n ) ≤ (3 / 2) α 2 n wh e n the X i are i.i.d. uniform Bernoulli v ariables. Since the total v ariation distance is at most 1, our result also im p lie s this scaling (for arbitrary X i ), but it is stronger s in c e it inv olves the total v ariation terms k P ν,i − P ν ′ ,i k TV , which are essential in our minimax resu lt s. In addition, C orollary 1 allo ws for any (sequenti ally) inte ractiv e c hannel Q ; eac h Z i ma y dep end on the priv ate answe rs Z 1: i − 1 of other data pro viders . 3.2 Consequences for minimax theory under lo cal priv acy constrain ts W e no w tu rn to some consequences of Theorem 1 for minimax theory under lo ca l priv acy constraints. F or ease o f presen tation, we analyze the case of indep enden t and id e n tically distributed (i.i.d.) samples, m e aning th a t P ν,i ≡ P ν for i = 1 , . . . , n . W e sh o w that in b oth Le Cam’s inequalit y and the lo cal ve rsion of F ano’s metho d, the constrain t of α -local d iffe ren tial pr iv acy reduces the effectiv e sample size (at least) fr o m n to 4 α 2 n . Consequence fo r Le Cam’s metho d: The classical non-pr iv ate v ersion of Le Cam’s m etho d b ounds the us u al minimax risk M n ( θ ( P ) , Φ ◦ ρ ) := inf b θ sup P ∈P E P h Φ  ρ ( b θ ( X 1 , . . . , X n ) , θ ( P ))  i , 8 for estimators b θ : X n → Θ by a binary hyp othesis test. One ve rsion of Le C am’s lemma (8) asserts that, for an y pair of distributions { P 1 , P 2 } suc h that ρ ( θ ( P 1 ) , θ ( P 2 )) ≥ 2 δ , w e hav e M n ( θ ( P ) , Φ ◦ ρ ) ≥ Φ( δ ) n 1 2 − 1 2 √ 2 p nD kl ( P 1 k P 2 ) o . (15) Returning to the α -lo c ally pr iv ate setting, in which the estimator b θ dep ends only on the priv ate v ariables ( Z 1 , . . . , Z n ), w e measure the α -priv ate minimax risk (5). By applying Le Cam’s method to the pair ( M 1 , M 2 ) alo ng with Corollary 1 in the form of inequalit y (13), we fi nd: Corollary 2 (Priv ate form of Le Cam b ound) . Given observation s fr om an α -lo c al ly differ ential private channel for some α ∈ [0 , 22 35 ] , the α -private minimax risk is lower b ounde d as M n ( θ ( P ) , Φ ◦ ρ, α ) ≥ Φ( δ ) n 1 2 − 1 2 √ 2 q 8 nα 2 k P 1 − P 2 k 2 TV o . (16) Using the fact that k P 1 − P 2 k 2 TV ≤ 1 2 D kl ( P 1 k P 2 ), comparison with the original Le Cam b ound (15) sho w s that for α ∈ [0 , 22 35 ], the effect of α -local d iffe ren tial priv acy is to reduce the effe ctive sample size from n to 4 α 2 n . W e illustrate u se of this p riv ate version of Le Cam’s b ound in our analysis of the one-dimensional mean problem to follo w . Consequences for lo cal F a no’s metho d: W e no w tu r n to consequences for the so-called local form of F ano’s metho d. Th is metho d is based on constructing a family of distr ib utions { P ν , ν ∈ V } that defines a 2 δ -pac king, meaning ρ ( θ ( P ν ) , θ ( P ν ′ )) ≥ 2 δ for all ν 6 = ν ′ , satisfying D kl ( P ν k P ν ′ ) ≤ κ 2 δ 2 for some fixed κ > 0 . (17) W e refer to any suc h construction a s a ( δ , κ ) lo c al p acking . Recalling F ano’s inequalit y (9), the pairwise upp er b ounds (17) imply I ( X 1 , . . . , X n ; V ) ≤ n κ 2 δ 2 b y a con v exit y argument. W e th us obtain the local F ano lo wer b o und [36, 9 ] on the classical minimax risk: M n ( θ ( P ) , Φ ◦ ρ ) ≥ Φ( δ ) n 1 − nκ 2 δ 2 + log 2 log |V | o . (18) W e no w state the extension of this b ound to th e α -lo cally priv ate setting. Corollary 3 (Priv ate form of local F ano in e qualit y) . Consider observations fr om an α -lo c al ly dif- fer e ntial private channel f o r some α ∈ [0 , 22 35 ] . Given any ( δ, κ ) lo c al p acking, the α - p rivate minimax risk has lower b ound M n (Θ , Φ ◦ ρ, α ) ≥ Φ( δ ) n 1 − 4 nα 2 κ 2 δ 2 + log 2 log |V | o . (19) Once again, b y comparison to the classical version (18), w e see that, for all α ∈ [0 , 22 35 ], the price f o r priv acy is a reduction in the effectiv e samp le size from n to 4 α 2 n . The pro of is again straigh tfo w ard using Theorem 1. By Pins k er’s inequalit y , the pairwise b ound (17) implies that k P ν − P ν ′ k 2 TV ≤ 1 2 κ 2 δ 2 for all ν 6 = ν ′ . W e find that I ( Z 1 , . . . , Z n ; V ) ≤ 4 nα 2 κ 2 δ 2 for all α ∈ [0 , 22 35 ] b y combining this inequalit y w it h the upp er b ound (14) from Corollary 1. Th e claim (19) follo ws b y com b ining this u pp er b ound with the usual local F ano b ound (18). 9 3.3 Some applications of Theorem 1 In this section, w e illustrate the u se of the α -priv ate v ersions of Le Cam’s and F ano’s in e qualities, established in the previous section as Corollaries 2 and 3 of Th eo rem 1. Fi rst, we study the p roblem of one-dimensional mean estimatio n. In addition to demonstrating ho w th e min i max rate c hanges as a fun c tion of α , w e also rev eal some interesting (and p erhaps disturbing) effects of enforcing α -local d iffe ren tial p riv acy: the effectiv e sample size ma y b e ev en p olynomially smaller than α 2 n . Our seco nd example s tu dies fixed design li near regressio n, wh ere we again see th e reduction in effectiv e sample size from n to α 2 n . W e state eac h of our b ounds assuming α ∈ [0 , 1]; the b ounds hold (with differen t numerical constants) whenev er α ∈ [0 , C ] for some universal constant C . 3.3.1 One-dimensional mean estimation F or some k > 1, consider the family P k :=  distributions P suc h that E P [ X ] ∈ [ − 1 , 1] and E P [ | X | k ] ≤ 1  , and sup pose that our goa l is to estimate the mean θ ( P ) = E P [ X ]. The next prop osition c h arac ter- izes the α -priv ate minimax risk in squared ℓ 2 -error: M n ( θ ( P k ) , ( · ) 2 , α ) := inf Q ∈Q α inf b θ sup P ∈P k E h  b θ ( Z 1 , . . . , Z n ) − θ ( P )  2 i . Prop o sition 1. Ther e exist u niversa l c onstants 0 < c ℓ ≤ c u < ∞ such t hat for al l k > 1 and α ∈ [0 , 1] , the minimax err or M n ( θ ( P k , ( · ) 2 , α ) is b ounde d as c ℓ min  1 ,  nα 2  − k − 1 k  ≤ M n ( θ ( P k ) , ( · ) 2 , α ) ≤ c u min  1 , u k  nα 2  − k − 1 k  , (20) wher e u k = max { 1 , ( k − 1) − 2 } . W e pr ov e this resu lt using the α -priv ate v ers i on (16) of Le Cam’s inequalit y , as stated in Corol- lary 2. See Section 7.3 for the details. T o u nderstand the b ounds (20 ), it is w orth w hile consid e ring some sp ecial cases, b eginning with the usual setting of random v ariables with finite v ariance ( k = 2). In the non-priv ate setting in whic h th e original sample ( X 1 , . . . , X n ) is observ ed, the sample m ean b θ = 1 n P n i =1 X i has mean- squared error at most 1 / n . When w e requir e α -lo cal differen tial p r iv acy , Pr oposition 1 sho ws that the minimax rate worsens to 1 / √ nα 2 . More generall y , for an y k > 1, the minimax rate scales as M n ( θ ( P k ) , ( · ) 2 , α ) ≍ ( nα 2 ) − k − 1 k , ignoring k -dep enden t p re-fa ctors. As k ↑ ∞ , the momen t condition E [ | X | k ] ≤ 1 b ecomes equiv alen t to the b oundedness constraint | X | ≤ 1 a.s., and w e obtain the more standard parametric r a te ( nα 2 ) − 1 , where there is no redu ct ion in th e exp onent . More generally , the b eha vior of the α -p r iv ate min ima x r a tes in (20) helps demarcate situations in which lo cal different ial priv acy ma y or ma y not b e acceptable. In particular, for b ounded domains—where we ma y tak e k ↑ ∞ —lo cal differenti al priv acy ma y b e qu i te reasonable. Ho we v er, in situations in whic h the s amp le tak es v alues i n an un b ounded space, local differen tial pr iv acy pro vides m u c h stricter constrain ts. In deed, in App endix G, w e discuss an example that illustrates the pathologica l consequen c es of pro viding (lo c al) differentia l priv acy for non-compact spaces. 10 3.3.2 Linear regression with fixed design W e tu r n no w to the pr o blem of linear regression. Concretely , for a giv en design matrix X ∈ R n × d , consider the standard lin e ar mo del Y = X θ ∗ + ε, (21) where ε ∈ R n is a v ector of indep endent, zero-mean r a ndom v ariables. By rescaling as needed, we ma y assu me th a t θ ∗ ∈ Θ = B 2 (1), the Eu clidean b a ll of radius one. Moreo ver, w e assume that a scaling constant σ < ∞ suc h that th e noise sequence | ε i | ≤ σ fo r all i . Giv en t he challe nges of non-compactness exhibited by the location family estimation problems (cf. Prop osition 1), this t yp e of assumption is required for non -trivial r e sults. W e also assume that X has rank d ; otherwise, the design matrix X h a s a non-trivial nullspace and θ ∗ cannot b e estimated ev en w hen σ = 0. With the mo del (21) in place, let us consider estimation of θ ∗ in the squared ℓ 2 -error, where we pro vide α -lo cally differen tially priv ate views of the resp onse Y = { Y i } n i =1 . By follo wing the outline established in Section 3.2, we pro vid e a sharp charac terization of the α -priv ate minimax rate. In stating the result, we let ρ j ( A ) denote the j th singular v alue of a matrix A . (See Section 7.4 for the pro of.) Prop o sition 2. In the fixe d design r e gr ession mo del wher e the variables Y i and ar e α -lo c al ly dif- fer e ntial ly private for some α ∈ [0 , 1] , min  1 , σ 2 d nα 2 ρ 2 max ( X/ √ n )  . M n  Θ , k·k 2 2 , α  . min ( 1 , σ 2 d α 2 nρ 2 min ( X/ √ n ) ) . (22) T o in terpret the b ounds (22), it is helpfu l to consid e r some sp ec ial case s. First consider the c ase of an o rthonormal design, meaning th a t 1 n X ⊤ X = I d × d . The b ounds (22) imply that M n (Θ , k·k 2 2 , α ) ≍ σ 2 d/ ( nα 2 ), so that th e α -pr iv ate m inimax rate is fully determined (up to con- stan t pr e -factors). Standard minimax rates f o r linear regression problems scale as σ 2 d/n ; th us , by comparison, w e see that requiring differen tial priv acy indeed causes an effectiv e sample size redu c- tion fr o m n to nα 2 . More generally , up to the difference b et ween the maxim u m and minim um singular v alues of the d e sign X , Pr o p osition 2 pr o vides a sharp characte rization of the α -priv ate rate f o r fixed-design linear regression. As the p r oof mak es clea r, the u pp e r b ounds are attai ned b y adding Laplacian noise to the resp onse v ariables Y i and solving the resulting normal equations as in standard linear r eg ression. In this case, the standard Laplacian mec h anism [24] is optimal. 4 Mutual information under lo cal priv acy: F ano’s metho d As w e hav e previously noted, T heo rem 1 provides ind irect upp er b oun d s on the m utual information. Ho wev er, since the resulting b ounds in volv e p airw ise distances only , as in Corollary 1, they must b e us ed with lo ca l pac kings. Exploiti ng F ano’s inequalit y in its full generalit y requires a more sophisticated upp er b ound on the m utual information under lo cal priv acy , whic h is the main to pic of this section. W e ill ustrate this more p o w erful tec hnique by deriving lo wer b ounds fo r mean estimation problems in b oth classical as w ell as high-dimensional settings under th e non -interact iv e priv acy mod el (2). 11 4.1 V a riational b ound s on mu tual information W e b egin b y introdu c ing some definitions needed to state th e result. Let V b e a discrete rand o m v ariable uniformly distribu te d o v er s ome fin ite set V . Giv en a family of distributions { P ν , ν ∈ V } , w e d efi ne the mixture distribution P := 1 |V | X ν ∈V P ν . A sample X ∼ P can b e obtained by fir s t drawing V from the un iform distribu ti on o ver V , and then conditionally on V = ν , d ra wing X from the distribution P ν . By definition, the m u t ual information b et w een the random index V and the sample X is I ( X ; V ) = 1 |V | X ν ∈V D kl  P ν k P  , a representati on that pla ys an imp ortant r ole in our theory . As in the definition (3 ), any conditional distribution Q induces the family of marginal distribu t ions { M ν , ν ∈ V } and the asso ciated m ixt ure M := 1 |V | P ν ∈V M ν . O ur goa l is to upp er b ound the mutual information I ( Z 1 , . . . , Z n ; V ), wh e re conditioned on V = ν , the r a ndom v ariables Z i are dra wn according to M ν . Our up per b ound is v ariational in nature: it inv olv es optimization o v er a su bset of the space L ∞ ( X ) :=  f : X → R | k f k ∞ < ∞  of uniform ly b ounded fun ctions, equipp ed w it h the us u al norm k f k ∞ = sup x ∈X | f ( x ) | . W e define the 1-ball of the supremum norm B ∞ ( X ) := { γ ∈ L ∞ ( X ) | k γ k ∞ ≤ 1 } . (23) W e s ho w that this set describ es the maximal amoun t of p erturbation allo wed in the conditional Q . Since the set X is generally clea r f rom con text, we t yp ic ally omit this dep endence. F or eac h ν ∈ V , w e d efi ne the linear functional ϕ ν : L ∞ ( X ) → R by ϕ ν ( γ ) = Z X γ ( x )( dP ν ( x ) − d P ( x )) . With these definitions, w e ha v e the follo w ing result: Theorem 2. L et { P ν } ν ∈V b e an ar bitr ary c ol le ction of pr ob ability me asur es on X , and let { M ν } ν ∈V b e the set of mar ginal distributions induc e d by an α -differ ential ly private distribution Q . Then 1 |V | X ν ∈V  D kl  M ν k M  + D kl  M k M ν  ≤ ( e α − 1) 2 |V | sup γ ∈ B ∞ ( X ) X ν ∈V ( ϕ ν ( γ )) 2 . (24) It is imp ortan t to note that, at least up to constan t factors, Theorem 2 is n ev er w eak er th an the results provided by Theorem 1, includ ing the b ounds of C o rollary 1. B y definition of the linea r functional ϕ ν , w e hav e sup γ ∈ B ∞ ( X ) X ν ∈V ( ϕ ν ( γ )) 2 ( i ) ≤ X ν ∈V sup γ ∈ B ∞ ( X ) ( ϕ ν ( γ )) 2 = 4 X ν ∈V   P ν − P   2 TV , 12 where inequalit y ( i ) follo ws b y interc hanging the summation and suprem um. O v erall, w e hav e I ( Z ; V ) ≤ 4( e α − 1) 2 1 |V | 2 X ν,ν ′ ∈V k P ν − P ν ′ k 2 TV . The strength of Th e orem 2 arises from the fact that inequalit y ( i )—the inte rc h a nge of the order of supremum and summation—ma y b e quite lo ose. W e n ow presen t a corollary that extends Theorem 2 to the setting of rep eated sampling, provid- ing a tensorization inequ a lit y analogous to Corollary 1. Let V b e distributed uniformly at random in V , and assume that giv en V = ν , the observ ations X i are sampled indep enden tly according to the d istr ibution P ν for i = 1 , . . . , n . F or this coroll ary , w e require the non-in teractiv e setting (2) of lo c al priv acy , where ea c h p riv ate v ariable Z i dep ends only on X i . Corollary 4. Supp ose that the distributions { Q i } n i =1 ar e α -lo c al ly differ ential ly private i n the non- inter active setting (2) . Then I ( Z 1 , . . . , Z n ; V ) ≤ n ( e α − 1) 2 1 |V | sup γ ∈ B ∞ X ν ∈V ( ϕ ν ( γ )) 2 . (25) W e pro vide the p roof of C o rollary 4 in Section 8.2. W e conjecture that the b ound (25) also holds in the fully in teractiv e setting, but give n w ell-kno wn difficulties of c haracterizing multiple c hannel capacitie s with feedbac k [17, Chapter 15] , it ma y b e c hallenging to ve rify this conjecture. Theorem 2 and Corollary 4 relate the amount of mutual information b et w een the rand o m p erturb ed views Z of the data to geometric or v ariational prop erties of the u nderlying pac king V of the parameter space Θ. In p a rticular, Theorem 2 and Corollary 4 show that if w e ca n find a pac king set V th at yields linear functionals ϕ ν whose sum has go od “sp ectral” prop erties—meaning a small op erator norm wh e n taking su p rema ov er L ∞ -t yp e spaces—we can provide sharp er r esu lt s. 4.2 Applications of Theorem 2 to mean estimation In this section, we sh o w h ow T heo rem 2, coupled with Corollary 4, leads to sharp c haracteriza- tions of the α -p r iv ate min imax r a tes for classical and high-dimensional mean estimation p roblems. Our results sh o w that for in d -dimensional mean-estimati on pr o blems, the requirement of α -local differen tial priv acy causes a reduction in effectiv e sample size from n to n α 2 /d . Th r o ughout this section, w e assume that the c hannel Q is non-inter active , meaning that the random v ariable Z i dep ends only on X i , and so that local priv acy tak es the simpler form (2). W e also state eac h of our results for priv acy parameter α ∈ [0 , 1], bu t n ote th a t all of our b ounds hold for any constan t α , with appropriate c hanges in the n umerical pre-facto rs. Before pr oceeding, we describ e t w o samp l ing mec hanisms f o r enforcing α -lo cal differential pri- v acy . Our m e tho ds for ac hieving the up per b ounds in minimax r at es are based on unbiase d estima- tors. Let us assume we wish to construct an α -priv ate unbiased estimate Z for the v ector v ∈ R d . The follo w ing sampling strategies are based on a radius r > 0 and a b ound B > 0 sp ecified for eac h problem, and they r equire the Bernoulli random v ariable T ∼ Bernoulli( π α ) , where π α := e α / ( e α + 1) . 13 v 1 1+ e α e α 1+ e α v 1 1+ e α e α 1+ e α (a) (b) Figure 2. Priv a te sampling strategies. (a) Strategy (26a) for the ℓ 2 -ball. Outer b oundary of highlighted r egion sampled uniformly with pr obabilit y e α / ( e α + 1). (b) Str ategy (26b) for the ℓ ∞ -ball. Circled p oin t set sampled uniformly with pr o babilit y e α / ( e α + 1). Strategy A: Giv en a ve ctor v with k v k 2 ≤ r , set e v = r v/ k v k 2 with pr o babilit y 1 2 + k v k 2 / 2 r and e v = − r v / k v k 2 with probabilit y 1 2 − k v k 2 / 2 r . Then sample T ∼ Bernoulli( π α ) and s e t Z ∼ ( Uniform( z ∈ R d : h z , e v i > 0 , k z k 2 = B ) if T = 1 Uniform( z ∈ R d : h z , e v i ≤ 0 , k z k 2 = B ) if T = 0 . (26a) Strategy B: Giv en a v ector v with k v k ∞ ≤ r , co nstruct e v ∈ R d with coordinates e v j sampled indep enden tly f rom {− r, r } w it h probab ilities 1 / 2 − v j / (2 r ) and 1 / 2 + v j / (2 r ). Then sample T ∼ Bernoulli( π α ) and set Z ∼ ( Uniform( z ∈ {− B , B } d : h z , e v i > 0) if T = 1 Uniform( z ∈ {− B , B } d : h z , e v i ≤ 0) if T = 0 . (26b) See Figure 2 for visualizati ons of these sampling strategies. By insp ection, eac h is α -differen tially priv ate for an y v ector satisfying k v k 2 ≤ r or k v k ∞ ≤ r for Str ategy A or B, resp ectiv ely . Moreo v er, eac h strateg y is efficie n tly implemen table: Strategy A by normalizing a sample fr o m the N (0 , I d × d ) distribution, and Strategy B b y rejection sampling o v er the scaled h yp ercub e {− B , B } d . Giv en these samplin g strategies, w e stud y the d -dimensional problem of estimating the m e an θ ( P ) := E P [ X ] of a ran d om v ector. W e consider a few different metrics for the error of an estimator of the mean to fl esh out the testing reduction in Section 2. Due to the difficulties asso ci ated with differen tial pr iv acy on non-compact s pac es (recall Section 3.3.1), w e fo cus on distribu tions with compact s u pp ort. W e defer all p roofs to App endix A; they use a com bination of Th eo rem 2 with F ano’s method . 14 4.2.1 Minimax rat es W e b egin b y b ounding the minimax rate in the squ a red ℓ 2 -metric. F or a parameter p ∈ [1 , 2] and radius r < ∞ , consider the family P p,r :=  distributions P supp orted on B p ( r ) ⊂ R d  . (27) where B p ( r ) = { x ∈ R d | k x k p ≤ r } is the ℓ p -ball of r a dius r . Prop o sition 3. F or the me an estimation pr oblem, for al l p ∈ [1 , 2] and privacy levels α ∈ [0 , 1] , r 2 min  1 , 1 √ nα 2 , d nα 2  . M n ( θ ( P p,r ) , k·k 2 2 , α ) . r 2 min  d nα 2 , 1  . This b ound do es not dep end on the norm for X so long as p ∈ [1 , 2], which is consisten t w ith the classical mean estimation problem. Prop osition 3 d emo nstrates the sub sta n tial difference b et wee n d -dimensional mean estimati on in priv ate and non-priv ate set tings: m o re precisely , the priv acy constrain t leads to a m ultiplicativ e p enalt y of d/α 2 in term s of mean-squ a red error. Indeed, in the non-priv ate setting, the standard mean estimator b θ = 1 n P n i =1 X i has mean-squared error at most r 2 /n , since k X k 2 ≤ k X k p ≤ r by assump tio n. Thus, Prop osition 3 exhib it s an effectiv e sample size reduction of n 7→ nα 2 /d . T o sho w the applicabilit y of th e general metric construction in Section 2, we no w consider estimation in ℓ ∞ -norm; estimation in this metric is natural in scenarios wh er e one wish e s only to guaran tee that the m a xim u m error of any particular comp onen t in the v ector θ is small. W e focu s in this scenario on the family P ∞ ,r of distributions P supp orted on B ∞ ( r ) ⊂ R d . Prop o sition 4. F or the me an estimation pr oblem, for al l α ∈ [0 , 1] , min ( r , r p d log(2 d ) √ nα 2 ) . M n ( θ ( P ∞ ,r ) , k·k ∞ , α ) . min ( r , r p d log(2 d ) √ nα 2 ) . Prop osition 4 p ro vides a similar message to Pr o p osition 3 on th e loss of statistical efficiency . This is clearest from an example: let X i b e r a ndom vect ors b ounded by one in ℓ ∞ -norm. Th en classical results on sub-Gaussian random v ariables [e.g., 12]) immediately imply that the standard non- priv ate mean b θ = 1 n P n i =1 X i satisfies E [ k b θ − E [ X ] k ∞ ] ≤ p log(2 d ) /n . Comparing this result to the rate p d log(2 d ) /n of Prop osition 4, w e again see the effectiv e sample size reduction n 7→ nα 2 /d . Recen tly , there h as b een subs t an tial in terest in high-dimensional p roblems, in which the d imen- sion d is larger than the sample size n , but there is a lo w-dim en sio nal laten t structure that mak es inference p ossible. (See the pap er b y Negah ban et al. [44] for a ge neral o verview.) Accordingly , let us consider a n ideal ized v ersion of the high-dimensional m e an esti mation problem, in which w e assume that θ ( P ) = E [ X ] ∈ R d has (at most) one non-zero en try , so k E [ X ] k 0 ≤ 1. In the non-priv ate case, estimation of su ch an s -spars e p r edict or in the squ ared ℓ 2 -norm is p ossible at rate E [ k b θ − θ k 2 2 ] ≤ s log ( d/ s ) /n , so th at the dimension d can b e exp onen tially larger than the sample size n . With this con text, the next r esult sho ws that local priv acy can hav e a dramatic impact in the high-dimensional setting. Consider the family P s ∞ ,r := n distributions P sup ported on B ∞ ( r ) ⊂ R d with k E P [ X ] k 0 ≤ s o . 15 Prop o sition 5. F or the 1 -sp arse me ans pr oblem, for al l α ∈ [0 , 1] , min  r 2 , r 2 d log(2 d ) nα 2  . M n  θ ( P 1 ∞ ,r ) , k·k 2 2 , α  . min  r 2 , r 2 d log(2 d ) nα 2  . See Section A.3 for a pro of. F r o m Prop osition 5, it b eco mes clear that in lo c ally p r iv ate bu t non-in teractiv e (2 ) settings, high-dimens ional estimation is effectiv ely imp ossible. 4.2.2 Optimal mechanisms: attainability for mean estimation In this section, w e describ e h o w to ac hiev e matc h ing up p er b ounds in Prop ositions 3 and 4 using simple and p r ac tical algo rithms—namely , the “righ t” t yp e of stochasti c p erturbation of the obser- v ations X i coupled with a sta ndard mean e stimator. W e sho w the optimalit y o f priv atizing via the sampling strateg ies (26a) and (26b); interestingly , w e also sho w that priv atizing via Laplace p erturbation is strictly sub - optimal. T o giv e a priv ate mec hanism, w e m ust sp ec ify the conditional distribution Q satisfying α - lo c al d iffe ren tial pr iv acy used to constru ct Z . In this case, giv en an observ ation X i , we construct Z i b y p erturbing X i in suc h a w a y that E [ Z i | X i = x ] = x . Eac h of the s tr a tegies (26a) and (26b) also requires a constant B , and w e sho w how to choose B for eac h strategy to satisfy the u nbiasedness condition E [ Z | X = x ] = x . W e b egin with the mean estimatio n problem for distribu tions P p,r in Prop osition 3, for w h ic h w e use the sampling scheme (26a). That is, let X = x ∈ R d satisfy k x k 2 ≤ k x k p ≤ r . Then we construct th e random v ector Z ac cording to strategy (26a), where w e set the initial vect or v = x in the sampling sc heme. T o ac hiev e the un b ia sedness condition E [ Z | x ] = x , w e define the b o und B = r e α + 1 e α − 1 d √ π Γ( d − 1 2 + 1) Γ( d 2 + 1) . (28) (See App endix F.2 for a pro of that E [ Z | x ] = x w it h this choice of B ). Notably , the c hoice (28) implies B ≤ cr √ d/α for a universal constan t c < ∞ , since d Γ( d − 1 2 + 1) / Γ( d 2 + 1) . √ d and e α − 1 = α + O ( α 2 ). As a consequence, ge nerating eac h Z i b y this p erturbation strategy and usin g the mean estimat or b θ = 1 n P n i =1 Z i , the estima tor b θ is unbiased for E [ X ] and satisfies E h k b θ − E [ X ] k 2 2 i = 1 n 2 n X i =1 V ar( Z i ) ≤ B 2 n ≤ c r 2 d nα 2 for a univ ersal constan t c . In Prop osition 4, we consider the family P ∞ ,r of distrib u tio ns supp orted on the ℓ ∞ -ball of radius r . In our mechanism for attaining the up per b ound , w e u se the sampling s c h eme (26 b) to generate the priv ate Z i , so th at for an observ ation X = x ∈ R d with k x k ∞ ≤ r , w e resample Z (from the initial v ector v = x ) according to strategy (26b). Again, we would lik e to guaran tee the unbiase dness condition E [ Z | X = x ] = x , f or which we use a resu lt of Duc h i et al. [19]. That p a p er sho w s that taking B = c r √ d α (29) for a (particular) un iv ersal constan t c , yields the desired u n biasedness [19, Corollary 3]. S ince the random v ariable Z satisfies Z ∈ B ∞ ( r ) with p r o babilit y 1, eac h coord inate [ Z ] j of Z is sub-Gaussian. 16 As a co nsequence, w e obtain via standard b o unds [12] that E [ k b θ − θ k 2 ∞ ] ≤ B 2 log(2 d ) n = c 2 r 2 d log(2 d ) nα 2 for a univ ersal constan t c , pro ving the upp er b oun d in Prop osit ion 4. T o conclude this section, w e note that the strategy o f adding Laplacian noise to the v ectors X is sub-optimal. I ndeed, consider the the family P 2 , 1 of distribu ti ons supp orted on B 2 (1) ⊂ R d as in Prop ositio n 3. T o guaran tee α -differen tial priv acy using indep endent Laplace noise ve ctors for x ∈ B 2 (1), we tak e Z = x + W where W ∈ R d has comp o nen ts W j that are indep enden t and distributed as Laplace( α/ √ d ). W e ha ve the follo w in g information-theoretic result: if the Z i are constructed via the Laplac e noise mec h an ism , inf b θ sup P ∈P E P h k b θ ( Z 1 , . . . , Z n ) − E P [ X ] k 2 2 i & min  d 2 nα 2 , 1  . (30) See Ap pend ix A.4 for the pr oof of this claim. The p oorer dimension d epend e nce exhibted b y the Laplace m e c hanism (3 0) in comparison to Prop osition 3 demonstr ates that sampling mec hanisms m u st b e chose n carefully , as in the s t rategies (26a)–(26b), in order to obtain s t atistical ly op timal rates. 5 Bounds on m ultiple p a irwise div ergences: Assouad’s metho d Th us far, w e ha ve seen h o w Le C a m’s metho d and F ano’s metho d, in the form of Theorem 2 and Corollary 4, can giv e sharp minimax rates for v arious problems. Ho w ever, their app li cation app ears to b e limited to problems whose minimax rates can b e cont rolled via reductions to bin a ry hyp ot hesis tests (Le Cam’s metho d) or f o r n o n-in teractiv e c h a nnels satisfying the simpler definition (2 ) of lo cal priv acy (F a no’s metho d). In this section, we sho w that a priv atized form of Assouad’s metho d (in the form of Lemma 1) can b e used to obtain sharp minimax rates in inte ractiv e settings. In particular, it can b e applied when the loss is suffi c ien tly “decomp osable,” so that the co ordinate- wise natur e of the Assouad construction can b e brought to b ear. Concretely , we sho w that an upp er b ound on a s um of paired KL-div ergences, when com bined with Assouad’s method, pro vides s h arp lo wer b ounds for sev eral problems, including multinomial probabilit y estimation and n onparametric densit y estimati on. Eac h of these p roblems can b e c h a racterized in terms of an effec tiv e dimension d , and our results (p a ralleling those of Section 4) sho w that the requiremen t of α -local differen tial priv acy causes a r e duction in effec tiv e sample size from n to nα 2 /d . 5.1 V a riational b ound s on paired div er gences F or a fix ed d ∈ N , we consider collections of distrib u tio ns indexed using the Bo olea n hyp ercub e V = {− 1 , 1 } d . F or eac h i ∈ [ n ] and ν ∈ V , we let the distribution P ν,i b e sup ported on the fixed set X , and we defi ne the pro duct distrib u tio n P n ν = Q n i =1 P ν,i . Th en for j ∈ [ d ] we define the paired mixtures P n + j = 1 2 d − 1 X ν : ν j =1 P n ν , P n − j = 1 2 d − 1 X ν : ν j = − 1 P n ν , P ± j,i = 1 2 d − 1 X ν : ν j = ± 1 P ν,i . (31) 17 (Note that P n + j is n ot necessarily a pro duct d i stribution.) Recalling the marginal c han n el (3), w e ma y then defin e the marginal mixtures M n + j ( S ) := 1 2 d − 1 X ν : ν j =1 M n ν ( S ) = Z Q n ( S | x 1: n ) dP n + j ( x 1: n ) for j = 1 , . . . , d, with the distrib utions M n − j defined analogo usly . F or a giv en pair of distributions ( M , M ′ ), w e let D sy kl ( M | | M ′ ) = D kl ( M k M ′ ) + D kl ( M ′ k M ) denote the symmetrized KL-div er gence. Recalling the 1-ball of the suprem u m n orm (23), with these definitions w e hav e the follo wing theorem: Theorem 3. Under the c onditions of the pr evious p ar agr aph, for any α -lo c al ly differ ential ly p ri- vate (1) channel Q , we have d X j =1 D sy kl  M n + j | | M n − j  ≤ 2( e α − 1) 2 n X i =1 sup γ ∈ B ∞ ( X ) d X j =1  Z X γ ( x ) dP + j,i ( x ) − dP − j,i ( x )  2 . Theorem 3 generaliz es Theorem 1, whic h corresp onds to the sp ecial case d = 1, though it also has parallels with Theorem 2, as taking the sup r em um outside th e su mmatio n is essen tial to obtain sharp results. W e provide the pr oof of Theorem 2 in Section 9. Theorem 3 allo ws us to prov e s harp e r lo w er b ounds on the minimax risk. A com bination of Pinsk er’s inequalit y and Cauch y-Sc hw arz implies d X j =1   M n + j − M n − j   TV ≤ 1 2 √ d  d X j =1 D kl  M n + j k M n − j  + D kl  M n − j k M n + j   1 2 . Th us, in com bination with the sharp er Assouad inequalit y (11), wh enev er P ν induces a 2 δ -Hamming separation for Φ ◦ ρ we h a ve M n ( θ ( P ) , Φ ◦ ρ ) ≥ dδ " 1 −  1 4 d d X j =1 D sy kl  M n + j | | M n − j   1 2 # . (32) The com b inati on of inequalit y (32) with Theorem 3 is the foun datio n for t he r emaind e r of this section. 5.2 Multinomial estimation under lo cal priv acy F or our fir s t app l ication of Th eo rem 3, w e return to the original motiv ation for lo cal priv acy [50]: a vo iding sur v ey answ er bias. Consider the p robabilit y simplex ∆ d := n θ ∈ R d | θ ≥ 0 and d X j =1 θ j = 1 o . An y vec tor θ ∈ ∆ d sp ecifies a multinomial random v ariable taking d state s, in particular with probabilities P θ ( X = j ) = θ j for j ∈ { 1 , . . . , d } . Giv en a sample f rom this distribution, our goal is to estimate the p r o babilit y v ector θ . W arner [50] studied t he Be rnoulli v arian t of this p r oblem (corresp onding to d = 2), prop osing a mec hanism kno w n as r andomize d r esp onse : for a give n surve y question, resp onden ts answer truthfully with probabilit y p > 1 / 2 and lie with p robabilit y 1 − p . Here w e show that an e xtension of this mec hanism is optimal fo r α -lo cal ly differen tially priv ate m ultinomial estimation. 18 5.2.1 Minimax rat es of conv ergence for multinomial estimation Our fir st r e sult pr o vid es b ounds on the minimax error measur ed in either the squared ℓ 2 -norm or the ℓ 1 -norm for (sequential ly) in teractiv e channels. Th e ℓ 1 -norm is sometimes more ap p ropriate for probabilit y estimation d ue to its connectio ns w i th total v ariation distance and testing. Prop o sition 6. F or the multinomial estimation p r oblem, for a ny α -lo c al ly diffe r ential ly private channel (1) , ther e exist univ ersa l c onstants 0 < c ℓ ≤ c u < 5 suc h that for al l α ∈ [0 , 1] , c ℓ min  1 , 1 √ nα 2 , d nα 2  ≤ M n  ∆ d , k·k 2 2 , α  ≤ c u min  1 , d nα 2  , ( 33) and c ℓ min  1 , d √ nα 2  ≤ M n (∆ d , k·k 1 , α ) ≤ c u min  1 , d √ nα 2  . (34) See App endix B for the pro ofs of the low er b ound s . W e p ro vide simple estimation strategies ac hieving the u pp e r b ounds in th e next section. As in the p revious section, let us compare the priv ate rates to the classical rate in whic h there is no priv acy . The maxim u m lik eliho od estimate sets b θ j as the prop ortion of samp le s taking v alue j ; it has mean-squared error E h k b θ − θ k 2 2 i = d X j =1 E h ( b θ j − θ j ) 2 i = 1 n d X j =1 θ j (1 − θ j ) ≤ 1 n  1 − 1 d  < 1 n . An analog ous calculation for the ℓ 1 -norm yields E [ k b θ − θ k 1 ] ≤ d X j =1 E [ | b θ j − θ j | ] ≤ d X j =1 q V ar( b θ j ) ≤ 1 √ n d X j =1 q θ j (1 − θ j ) < √ d √ n . Consequent ly , for estimation in ℓ 1 or ℓ 2 -norm, the effect of pro vid ing α -differen tial priv acy causes the effectiv e samp le size to decrease as n 7→ n α 2 /d . 5.2.2 Optimal mechanisms: attainability for m ultinomial estimation An in teresting consequence of the lo we r b ound (33) is the follo wing: a minor v ariant of W arner’s randomized resp onse strategy is an optimal mec h a nism. Th e re are also ot her relativ ely simple estimation strateg ies that achiev e con ve rgence rate d/nα 2 ; the Laplace p erturbation approac h [24] is another. Nonetheless, its ease of u se, coupled with our optimalit y results, p ro vid e s u pp ort for randomized resp onse as a desirable probabilit y estimation metho d. Let us demonstrate that these strategies attain the optimal rate of con verge nce. S ince there is a bijection b et w een multinomial observ ations x ∈ { 1 , . . . , d } and the d standard basis v ectors e 1 , . . . , e d ∈ R d , w e abus e notation and represen t observ ations as either when designing estimat ion strategies. In rand omized resp onse, w e constru c t the priv ate v ector Z ∈ { 0 , 1 } d from a multi nomial observ ation x ∈ { e 1 , . . . , e d } b y sampling d coord inate s indep endent ly via th e pro cedure [ Z ] j = ( x j with probabilit y exp( α/ 2) 1+exp( α/ 2) 1 − x j with probabilit y 1 1+exp( α/ 2) . (35) 19 The distribution (35) is α -differen tially priv ate: indeed, for x, x ′ ∈ ∆ d and any z ∈ { 0 , 1 } d , w e ha v e Q ( Z = z | x ) Q ( Z = z | x ′ ) = exp  α 2  k z − x k 1 −   z − x ′   1   ∈ [exp( − α ) , exp( α )] , where th e triangle inequalit y guarante es | k z − x k 1 − k z − x ′ k 1 | ≤ 2. W e no w compu te the exp ected v alue and v ariance of the random v ariables Z . Using the defi n iti on (35), w e ha ve E [ Z | x ] = e α/ 2 1 + e α/ 2 x + 1 1 + e α/ 2 ( 1 − x ) = e α/ 2 − 1 e α/ 2 + 1 x + 1 1 + e α/ 2 1 . Since th e random v ariables Z are Bernou lli, w e h av e the v ariance b ound E [ k Z k 2 2 ] ≤ d . Letting Π ∆ d denote the pro jection op erator onto the simplex, we arr iv e at the natural estimator b θ part := 1 n n X i =1  Z i − 1 / (1 + e α/ 2 )  e α/ 2 + 1 e α/ 2 − 1 and b θ := Π ∆ d  b θ part  . (36) The pro jection of b θ part on to th e probabilit y simplex can b e done in time linear in th e dim en sio n d of the p roblem [1 1], so th e estimat or (36) is efficien tly computable. Sin ce pr o ject ions onto con ve x sets are non-expansiv e, an y pair of v ectors in the simplex are at most ℓ 2 -distance √ 2 apart, and E θ [ b θ part ] = θ b y construction, we hav e E h k b θ − θ k 2 2 i ≤ min n 2 , E h k b θ part − θ k 2 2 io ≤ min  2 , d n e α/ 2 + 1 e α/ 2 − 1 ! 2  . min  1 , d nα 2  . Similar resu lts h o ld for the ℓ 1 -norm: using the same estimator, since Euclidean pro jections to the simplex are non-expansiv e for the ℓ 1 distance, E h k b θ − θ k 1 i ≤ min  1 , d X j =1 E h | b θ part ,j − θ j | i  . min  1 , d √ nα 2  . 5.3 Densit y est imation under lo cal priv acy In this section, we sh o w that the effects of local differential p riv acy are more sev ere for n onpara- metric density estimation: instead of j u st a m ultiplicativ e loss in the effec tiv e sample size as in previous sections, imp osing lo cal differential priv acy leads to a d iffe ren t con v ergence rate. This result holds even though we solve a problem in w hic h the function estimate d and the observ ations themselv es b elo ng to compact s p ac es. Definition 2 (Elliptical Sob olev space) . F or a give n orthonormal basis { ϕ j } of L 2 ([0 , 1]), smo oth- ness parameter β > 1 / 2 and radiu s C , the Sob olev class of order β is giv en by F β [ C ] :=  f ∈ L 2 ([0 , 1]) | f = ∞ X j =1 θ j ϕ j suc h that ∞ X j =1 j 2 β θ 2 j ≤ C 2  . 20 If w e c ho ose the trignometric basis as our orth on orm a l b a sis, mem b ership in the class F β [ C ] corresp onds to smo o thness constraint s on the deriv ativ es of f . More precisely , for j ∈ N , consider the orthonormal basis for L 2 ([0 , 1]) of trigo nometric functions: ϕ 0 ( t ) = 1 , ϕ 2 j ( t ) = √ 2 cos(2 π j t ) , ϕ 2 j +1 ( t ) = √ 2 sin(2 π j t ) . ( 37) Let f b e a β -times almost ev erywhere differentia ble function for whic h | f ( β ) ( x ) | ≤ C for almost ev ery x ∈ [0 , 1] satisfying f ( k ) (0) = f ( k ) (1) for k ≤ β − 1. Then, uniformly ov er all suc h f , there is a univ ersal co nstan t c ≤ 2 suc h that that f ∈ F β [ cC ] (see, for instance, [49, Lemma A.3]). Supp ose our goal is to estimate a density fun ct ion f ∈ F β [ C ] and that qu ality is measured in terms of the squ ared error (squared L 2 [0 , 1]-norm) k b f − f k 2 2 := Z 1 0 ( b f ( x ) − f ( x )) 2 dx. The w ell-kno wn [53, 52, 49] (non-priv ate) minimax squ ared risk scales as M n  F β , k·k 2 2 , ∞  ≍ n − 2 β 2 β +1 . (38) The goal of this section is to u nderstand ho w this minimax rate change s when we add an α -priv acy constrain t to the problem. Ou r main result is to demonstrate t hat the classica l rate (38) is no longer attai nable wh e n w e require α -lo ca l differentia l priv acy . 5.3.1 Lo wer b ounds on density estimation W e b egin b y giving our main lo w er b ound on the minimax rate of estimation o f densities when observ ations from the density are differen tially p riv ate. W e provide the pro of of the follo wing prop osition in Section C.1. Prop o sition 7. Consider the class of densities F β define d using the trigonometric b asis (37) . Ther e exi sts a c onstant c β > 0 such that for any α -lo c al ly differ ential ly priva te channel (1) with α ∈ [0 , 1] , the private minimax risk has lower b ound M n  F β [1] , k ·k 2 2 , α  ≥ c β  nα 2  − 2 β 2 β +2 . (39) The most imp ortant feature of the lo wer b ound (39) is that it inv olv es a differ ent p olynomial exp onent th a n the classical minimax r ate (38). Whereas the exp onen t in classical case (38) is 2 β / (2 β + 1), it r e duces to 2 β / (2 β + 2) in the lo c ally pr iv ate setting. F or example, when w e estimate Lipsc h itz dens i ties ( β = 1), the rate degrades from n − 2 / 3 to n − 1 / 2 . In terestingly , no estimator based on Laplace (or exp onen tial) p erturbation of the observ ations X i themselv es ca n attain t he rate of conv er gence (39). Th is fact f o llo ws from r esults of Carroll and Hall [13] on nonparametric d e con vol ution. They show th a t if observ ations X i are p erturb ed by additiv e noise W , where the c haracteristic fu ncti on φ W of the additiv e n oi se has tails b eha ving as | φ W ( t ) | = O ( | t | − a ) for s o me a > 0, then no estimator can decon volv e X + W and attain a rate of con verge nce b etter than n − 2 β / (2 β +2 a +1) . Since the c haracteristic function of the Laplace distr ib ution has tails deca ying as t − 2 , no estimator based on the Laplace mec hanism (applied d irect ly to the observ ations) can attain rate of conv ergence b ette r than n − 2 β / (2 β +5) . In order to attain the lo w er b ound (39 ), we must th us study alternativ e p riv acy mec hanisms. 21 5.3.2 Ac hiev a bilit y by histogram estimat o rs W e no w tu r n to the mean-squ a red errors ac h ie v ed b y sp ecific practical sc hemes, beginnin g with the s p ecial case of Lip sc h it z densit y fun ctions ( β = 1). I n this sp ecial case, it suffices to consider a priv ate ve rsion of a classical histogram estimate. F or a fix ed p ositi v e in teger k ∈ N , let {X j } k j =1 denote the partitio n of X = [0 , 1] into the inte rv als X j = [( j − 1) /k , j /k ) for j = 1 , 2 , . . . , k − 1, and X k = [( k − 1) /k , 1] . An y histogram estima te of the d e nsit y based on these k bins can b e sp ecified by a v ector θ ∈ k ∆ k , where w e recall ∆ k ⊂ R k + is th e probabilit y simplex. Letting 1 E denote the characte ristic (ind ic ator) function of the s e t E , an y suc h v ector θ ∈ R k defines a densit y estimate via the sum f θ := k X j =1 θ j 1 X j . Let u s no w describ e a mec hanism that guaran tees α -lo cal d iffe ren tial p riv acy . Giv en a sample { X 1 , . . . , X n } from the d ist ribution f , consider v ectors Z i := e k ( X i ) + W i , for i = 1 , 2 , . . . , n, (40) where e k ( X i ) ∈ ∆ k is a k -v ector with j th en try e qual to one if X i ∈ X j and zero e s in all other en tries, and W i is a random vec tor w i th i.i.d. Laplace( α/ 2) en tries. The v ariables { Z i } n i =1 defined in this wa y are α -locally d iffe ren tially p riv ate for { X i } n i =1 . Using these priv ate v ariables, we form the densit y estimate b f := f b θ = P k j =1 b θ j 1 X j based on the v ector b θ := Π k  k n P n i =1 Z i  , wher e Π k denotes the Eu cl idean pr o jection op erator on to the set k ∆ k . By construction, w e ha ve b f ≥ 0 and R 1 0 b f ( x ) dx = 1, so b f is a v alid densit y estimate. The follo wing resu lt c haracterizes its m e an-squared estimation error: Prop o sition 8. Consider the estima te b f b ase d o n k = ( nα 2 ) 1 / 4 bins in the histo gr am. F or any 1 -Lipschitz density f : [0 , 1] → R + , the MSE is upp er b ounde d as E f h   b f − f   2 2 i ≤ 5( α 2 n ) − 1 2 + √ αn − 3 / 4 . (41) F or any fi x ed α > 0, the first term in the b ound (41 ) dominates, and the O (( α 2 n ) − 1 2 ) rate matc hes the min imax lo w er b ound (39) in the case β = 1. Consequ e n tly , the priv atized h isto gram estimator is m inimax-o ptimal for Lipschitz densities, pro viding a priv ate analog of the classical result that histogram estimators are minimax-optimal for Lipshitz densities. See S ection C.2 for a pro of of Prop osition 8. W e remark that a randomized r esponse s cheme p aral lel to that of Section 5.2 .2 ac hiev es th e same rate of conv ergence, sh o w i ng that this classical mec h anism is again an optimal sc heme. 5.3.3 Ac hiev a bilit y by orthogonal pro jection estimators F or h ig her degrees of smo othness ( β > 1), standard histogram estimators n o longer ac h i ev e optimal rates in the classica l setting [47]. Accordingly , w e no w turn to dev eloping estimators based on orthogonal series expansion, and sho w that ev en in the setting of lo ca l p riv acy , they can ac hiev e the lo we r b ound (39 ) for all orders of smo othn ess β ≥ 1. 22 Recall the elliptical Sob olev sp ace (Definition 2), in whic h a fu n ct ion f is repr ese n ted in terms of its basis expansion f = P ∞ j =1 θ j ϕ j . This representat ion underlies the orthonormal series estima tor as follo ws. Given a sample X 1: n dra wn i.i.d. according to a densit y f ∈ L 2 ([0 , 1]), compute the empirical basis coefficien ts b θ j = 1 n n X i =1 ϕ j ( X i ) for j ∈ { 1 , . . . , k } , (42) where th e v alue k ∈ N is c hosen either a p riori based on known p roperties of the estimation problem or adap tively , for example, using cross-v alidation [26, 49]. Using these empirical coefficients, the densit y estimate is b f = P k j =1 b θ j ϕ j . In our local priv acy setting, w e consider a mechanism that, instead of releasing the v ector of co e fficien ts  ϕ 1 ( X i ) , . . . , ϕ k ( X i )  for eac h data p oint, employs a random v ector Z i = ( Z i, 1 , . . . , Z i,k ) satisfying E [ Z i,j | X i ] = ϕ j ( X i ) for eac h j ∈ [ k ]. W e assu me the basis functions are B 0 -uniformly b ounded, that is, sup j sup x | ϕ j ( x ) | ≤ B 0 < ∞ . This b oundedness cond it ion h o lds for many standard bases, includ in g the trigonomet ric basis (37) that underlies the cla ssical Sob olev classes and the W al sh basis. W e generate the random v ariables from the v ector v ∈ R k defined b y v j = ϕ j ( X ) in the hyp e rcub e-based s a mpling scheme (26b), wh er e we assume that the outer b ound B > B 0 . With this sampling s tr a tegy , iteration of exp ec tation yields E [[ Z ] j | X = x ] = c k B B 0 √ k  e α e α + 1 − 1 e α + 1  ϕ j ( x ) , (43) where c k > 0 is a constant (wh ic h is b oun d ed indep endently of k ). Consequen tly , it suffi c es to tak e B = O ( B 0 √ k /α ) to guarantee the un b iasedn e ss co ndition E [[ Z i ] j | X i ] = ϕ j ( X i ). Ov er all, th e priv acy mec han ism and estimator p erform the follo wing steps: • giv en a data p oint X i , set the vecto r v = [ ϕ j ( X i )] k j =1 ; • sample Z i according to the strategy (26b), starting from the v ector v and usin g the b o und B = B 0 √ k ( e α + 1) /c k ( e α − 1); • compute the density estimate b f := 1 n n X i =1 k X j =1 Z i,j ϕ j . (44) The resulting e stimate enjo ys the follo wing guarantee , whic h (along with Prop osition 8 ) mak es clear that the p riv ate minimax lo wer b ound (39) is sh a rp, p ro v id ing a v arian t of the classical rates with a p ol ynomially w orse sample complexit y . (See Section C.3 for a pro o f.) Prop o sition 9. L et { ϕ j } b e a B 0 -uniformly b ounde d orthono rmal b asis for L 2 ([0 , 1]) . Ther e exists a c onstant c (dep ending only on C and B 0 ) such that, for any f in the Sob olev sp ac e F β [ C ] , the estimator (44) with k = ( n α 2 ) 1 / (2 β +2) has an MSE that is u p p er b ounde d as fol lows: E f h k f − b f k 2 2 i ≤ c  nα 2  − 2 β 2 β +2 . (45) 23 Before concluding our exp osition, we m ake a few r e marks on other p oten tial densit y estimators. Our orth o gonal series estimator (44) and sampling sc heme (43), w hile s im ilar in spirit to that pro- p osed b y W asserman and Zhou [51, S ec . 6], is d iffe ren t in that it is lo cally priv ate and requires a differen t n oise strategy to obtain b oth α -local priv acy an d the optimal conv ergence rate. Lastly , similarly to ou r remarks on the insu ffi c iency of standard Laplace noise addition for mean estima- tion, it is wo rth noting that densit y estimators that are based on orthogonal series and Laplac e p erturbation are su b-optima l: they can ac hiev e (at b est) r ates of ( nα 2 ) − 2 β 2 β +3 . This r a te is p oly- nomially w orse than the sharp result pro vided by P r oposition 9. Again, w e see that appropriately c hosen n o ise mec hanisms are crucial for obtaining optimal results. 6 Comparison to related w ork There has b een a substantia l amount of w ork in dev eloping d iffe ren tially pr iv ate m echanisms, b oth in lo cal and non-lo cal settings, and a num b er of authors ha ve attempted to charac terize optimal mec hanism s. F o r example, Kasiviswanathan et al. [37], w orking within a lo ca l d ifferential priv acy setting, study Probably-Appro ximately-Correct (P AC) learning problems and show that the statistic al query mo del [38] and lo cal learnin g are equiv alent up to p olynomial c hanges in the sample size. In our work, w e are concerned with a fi ner-grained assessment of inferenti al pro cedures—that of rates of con v ergence of pro cedures and their optimalit y . In the remainder of this section, w e discu s s f urther connections of our w ork to previous researc h on optimalit y , global (non-lo ca l) different ial priv acy , as well as error-in-v ariables mo dels. 6.1 Sample v ersus p opulation estimation The standard definition of differen tial p riv acy , due to Dwork et al. [24], is somewhat less restrictive than the lo cal priv acy formulation considered h ere. In particular, a conditional d istribution Q with output space Z is α -differen tially pr iv ate if sup  Q ( S | x 1: n ) Q ( S | x ′ 1: n ) | x i , x ′ i ∈ X , S ∈ σ ( Z ) , d ham ( x 1: n , x ′ 1: n ) ≤ 1  ≤ exp( α ) , (46) where d ham denotes the Hamming distance b et wee n sets. Sev eral researc hers h a ve considered quan- tities similar to our minimax cr iteria u nder lo cal (2) or non -lo cal (46) differen tial p riv acy [7, 35, 33, 18]. Ho wev er, the ob jectiv e h a s often b een quite differen t from ours: in ste ad of b ounding errors based on p o pulation-based quantitie s, they pro vide b ound s in whic h the data are assumed to b e held fixed. More pr e cisely , let θ : X n → Θ denote an estimator, and let θ ( x 1: n ) b e a sample qu antit y based on x 1: n . Prior w ork is based on c onditional minimax risks of the f o rm M cond n ( θ ( X ) , Φ ◦ ρ, α ) := in f Q sup x 1: n ∈X n E Q h Φ  ρ  θ ( x 1: n ) , b θ  | X 1: n = x 1: n i , (47) where b θ is drawn according t o Q ( · | x 1: n ), the infimum is ta k en o ver all α -differen tially priv ate c hann el s Q , and the su prem um is tak en o ver all p ossible samples of size n . The only randomness in this conditional minimax risk is pro vided by the c h a nnel; the data are held fixed, so th e re is no randomness f r om an underlying p opu lation distr i bution. A partial list of pap ers that use d e finitions of this typ e include Beimel et al . [7, S e ction 2.4], Hardt and T alw ar [35, Definition 2.4], Hall et al. [33, Section 3], an d De [18]. 24 The conditional ( 47) and popu la tion minimax risk (5) ca n d iffe r subs t an tially , and suc h dif- ferences are critical to address within a statistical app r oa c h to priv acy-constrained inference. Th e goal of in f e rence is to draw conclusions ab out the p opulation-b ase d quantity θ ( P ) based o n the sample. M oreo v er, lo wer bou n ds on the conditional minimax risk (47) do not imply b ou n ds on th e rate of esti mation for the p opulation θ ( P ). In f act, the cond it ional minimax risk (47) inv olv es a supremum o ver al l p ossible sampl es x ∈ X , so the opp osite is usu a lly true: p opulation risks p r o vid e lo wer b ounds on the conditional minimax risk, as we sh o w presen tly . An illustr ative example is u s e ful to un derstand the differences. Consider estimation of the mean of a normal distribution with kno wn standard deviation σ 2 , in whic h the mean θ = E [ X ] ∈ [ − 1 , 1] is assu med to b elong to the unit in terv al. As our Pr op osition 1 sho w s, it is p ossible to estima te the mea n of a normally-distributed random v ariable ev en under α - lo c al differen tial priv acy (1). In sharp contrast, the f o llo wing r esult shows that the conditional minimax r isk is infinite for this problem: Lemma 2. Consider the norm al lo c ation family { N ( θ , σ 2 ) | θ ∈ [ − 1 , 1] } under α -differ ential pri- vacy (46) . The c onditional minimax risk of the me an statistic is M cond n ( θ ( R ) , ( · ) 2 , α ) = ∞ . Pr o of. Assume for sak e of con tradiction that δ > 0 satisfies Q ( | b θ − θ ( x 1: n ) | > δ | x 1: n ) ≤ 1 2 for all samples x 1: n ∈ R n . Fix N ( δ ) ∈ N and choose p oin ts 2 δ -separated p oin ts θ ν , ν ∈ [ N ( δ )], that is, | θ ν − θ ν ′ | ≥ 2 δ for ν 6 = ν ′ . Then the sets { θ ∈ R | | θ − θ ν | ≤ δ } are al l d isjoi n t, so for any pair of samples x 1: n and x ν 1: n with d ham ( x 1: n , x ν 1: n ) ≤ 1, Q ( ∃ ν ∈ V s.t. | b θ − θ ν | ≤ δ | x 1: n ) = N ( δ ) X ν =1 Q ( | b θ − θ ν | ≤ δ | x 1: n ) ≥ e − α N ( δ ) X ν =1 Q ( | b θ − θ ν | ≤ δ | x ν 1: n ) . W e ma y tak e eac h sample x ν 1: n suc h that θ ( x ν 1: n ) = 1 n P n i =1 x ν i = θ ν (for example, for eac h ν ∈ [ N ( δ )] set x ν 1 = nθ ν − P n i =2 x i ) and b y assumption, 1 ≥ Q ( ∃ ν ∈ V s.t. | b θ − θ ν | ≤ δ | x 1: n ) ≥ e − α N ( δ ) 1 2 . T aking N ( δ ) > 2 e α yields a cont radiction. Ou r argum ent applies to an arbitrary δ > 0, so the claim follo w s. There are v ariations on this result. F o r instance, ev en if the output of the mean estimator is restricted to [ − 1 , 1], the conditional minimax risk remains co nstan t. S imil ar argumen ts app l y to w eake nings of differential priv acy (e.g., δ -approximate α -differen tial pr iv acy [23]). Conditional and p opulation r isks are v ery differen t quantitie s. More generally , the p opulation minimax risk usually lo w er b ounds the cond itional minimax risk . Supp ose we measure minimax risks in some give n metric ρ (so the loss Φ( t ) = t ). Let e θ b e an y 25 estimator b a sed on the original sample X 1: n , and let b θ b e an y estimator b a sed on the priv atized sample. W e then h a ve th e follo w ing series of inequ a lities: E Q,P [ ρ ( θ ( P ) , b θ )] ≤ E Q,P [ ρ ( θ ( P ) , e θ )] + E Q,P [ ρ ( e θ , b θ )] ≤ E P [ ρ ( θ ( P ) , e θ )] + sup x 1: n ∈X n E Q,P [ ρ ( e θ ( x 1: n ) , b θ ) | X 1: n = x 1: n ] . (48) The p opulation m in imax risk (5) thus lo wer b ounds the conditional minimax risk (47) via M cond n ( e θ ( X ) , ρ, α ) ≥ M n ( θ ( P ) , ρ, α ) − E P [ ρ ( θ ( P ) , e θ )]. In particular, if there exists an estimato r e θ based on the original (non-priv ate data) suc h that E P [ ρ ( θ ( P ) , e θ )] ≤ 1 2 M n ( θ ( P ) , ρ, α ) w e are guarant eed that M cond n ( e θ ( X ) , ρ, α ) ≥ 1 2 M n ( θ ( P ) , ρ, α ) , so the conditional minimax risk is low er b ounded b y a constan t m u lt iple of the p opulation minimax risk. This lo wer b ound holds for eac h of the examples in Sections 3–5; lo w er b ounds on the α -priv ate p opulation min imax risk (5) are stronger than low er b ound s on th e conditional min im ax risk. T o illustrate one applicat ion of the lo wer b ound (48), consider the estimatio n of the sample mean of a data set x 1: n ∈ { 0 , 1 } n under α -local priv acy . This p roblem has b een considered b efore; for instance, Beimel et al. [7] study distributed proto cols for th is problem. In Theorem 2 of their work, they sho w that if a proto col has ℓ round s of comm un ic ation, the squared er r o r in estimating the s a mple mean (1 /n ) P n i =1 x i is Ω ( 1 / ( nα 2 ℓ 2 )). The standard mean estimator e θ ( x 1: n ) = (1 /n ) P n i =1 x i has error E [ | e θ ( x 1: n ) − θ | ] ≤ n − 1 2 . Co nsequentl y , the lo we r b ound (4 8) with combined with Prop ositio n 1 implies c 1 √ nα 2 − 1 √ n ≤ M n ( θ ( P ) , | · | , α ) − su p θ ∈ [ − 1 , 1] E [ | e θ ( x 1: n ) − θ | ] ≤ M cond n ( θ ( {− 1 , 1 } ) , | · | , α ) , for some numerical constan t c > 0. A corollary of our r esults is suc h an Ω(1 / ( nα 2 )) lo w er b ound on the conditional minimax risk f o r mean estimation, allo win g for sequentia l interact ivit y bu t not m u lt iple “rounds.” An insp ection of Beimel et al.’s pro of tec h nique [7, Section 4.2] sh o ws th a t their lo wer b ound also im p lie s a lo w er b ound of 1 /nα 2 for estimation of the p opulation mean E [ X ] in one dimension in non-i nter active (2) settings; it is, h ow ev er, unclear h o w to extend th e ir tec h n ique to other settings. 6.2 Lo cal v ersus non-lo cal priv acy It is also w orthwhile to mak e some comparisons to w ork on non-lo cal forms of d ifferen tial priv acy , mainly to understand the differences b et ween local a nd global forms of priv acy . Chaudhuri and Hsu [15] pro vide lo wer b ounds for estimation of certain one-dimensional statistics based on a tw o- p oin t family of problems. Th ei r tec h niques differ from those of the curren t pap er, and do not app ear to provide b ound s on the statistic b eing estimated, bu t rather one that is near to it. Beimel et al. [8] provide some b ound s on samp le co mplexit y in the “probably app r o ximate correct” (P A C ) framew ork of learning theory , though extensions to other inferen tial tasks are u ncle ar. Other wo rk on non-lo cal priv acy [e.g., 33, 16, 48] sho ws that for v arious t yp es of estimation problems, adding Laplacian n o ise leads to degraded conv er gence rates in at most lo w er-order terms. In con trast, our w ork shows that the Laplace mec hanism ma y b e highly sub-optimal in lo cal priv acy . 26 T o understand con verge nce rates for non-lo ca l priv acy , let us r et urn to estimation of a multi- nomial distribution in ∆ d , based on ob s e rv ations X i ∈ { e j } d j =1 . In this case, add in g a noise v ector W ∈ R d with i.i.d. en tries d ist ributed as Laplace( αn ) provides differential priv acy [23]; the associ- ated mean-squared error is at most E θ      1 n n X i =1 X i + W − θ     2 2  = E      1 n n X i =1 X i − θ     2 2  + E [ k W k 2 2 ] ≤ 1 n + d n 2 α 2 . In p articular, in the asymptotic regime n ≫ d , there is n o p enalt y from pro v id ing differen tial p riv acy except in higher-order terms. Similar r esults h old for h isto gram estimation [33], classification problems [16], and classical p oin t estimation problems [48]; in this sense, local and gl obal forms of differen tial priv acy ca n b e rather d iffe ren t. 6.3 Error-in-v ariables mo dels As a final remark on related work, we touch b riefly on err ors-in-v ariables m o dels [14, 31], which ha ve b een the su b ject of extensiv e study . In such p roblems, one observes a corrup t ed v ers io n Z i of the true co v ariate X i . Priv acy analysis is one of the few settings in wh ic h it is p ossible to precisely kno w th e conditional distribution Q ( · | X i ). Ho we v er, the mec hanisms th a t are optimal from our an alysis—in particular, th o se in strateg ies (26a) and (26b)—are more complica ted than adding noise d irect ly to th e co v ariates, whic h leads to complications. Kno wn (statistically) efficien t error-in-v ariables estimatio n pro cedures often requir e either solving certain in tegrals or estimating equations, or solving n on -conv ex optimiza tion problems [e.g., 39, 41]. Some recen t wo rk [40] sh o ws that certain types of non-conv ex programs arising fr o m errors-in-v ariables can b e solved efficien tly . In d ensit y est imation (a s noted in Sect ion 5.3.1), corrupted observ ations lead to nonparametric decon volutio n problems that app ear harder than e stimation under pr iv acy constraints. F urther in vestiga tion of computationally efficient pro cedures for nonlinear err or-in-v ariables mo dels for priv acy-preserv ation is an in teresting direction for future researc h. 7 Pro of of Theorem 1 and related results W e no w turn to the pro ofs of our results, beginnin g with Theorem 1 and related results. I n all cases, w e defer the pro ofs of more tec hnical lemmas to the app endices. 7.1 Pro of of Theorem 1 Observe that M 1 and M 2 are abs olutely con tinuous w it h r e sp ect to one another, and there is a measure µ with resp ect to whic h they ha ve densities m 1 and m 2 , resp ectiv ely . The c h a nnel probabilities Q ( · | x ) and Q ( · | x ′ ) are likewise absolutely con tinuous, so that we ma y assume they ha ve densities q ( · | x ) and write m i ( z ) = R q ( z | x ) dP i ( x ). In terms of these densities, w e hav e D kl ( M 1 k M 2 ) + D kl ( M 2 k M 1 ) = Z m 1 ( z ) log m 1 ( z ) m 2 ( z ) dµ ( z ) + Z m 2 ( z ) log m 2 ( z ) m 1 ( z ) dµ ( z ) = Z  m 1 ( z ) − m 2 ( z )  log m 1 ( z ) m 2 ( z ) dµ ( z ) . Consequent ly , w e m ust b ound b oth th e difference m 1 − m 2 and the log ratio of the marginal densities. The follo win g t w o auxiliary lemmas are usefu l: 27 Lemma 3. F or any α -lo c al ly differ ential ly private c onditional, we have | m 1 ( z ) − m 2 ( z ) | ≤ c α inf x q ( z | x ) ( e α − 1) k P 1 − P 2 k TV , (49) wher e c α = min { 2 , e α } . Lemma 4. L et a, b ∈ R + . Then   log a b   ≤ | a − b | min { a,b } . W e pro ve these tw o results at the end of this section. With the lemmas in hand, let u s no w complete the pro of of the theorem. F rom Lemma 4, the log ratio is b ound e d as     log m 1 ( z ) m 2 ( z )     ≤ | m 1 ( z ) − m 2 ( z ) | min { m 1 ( z ) , m 2 ( z ) } . Applying Lemma 3 to the numerator yields     log m 1 ( z ) m 2 ( z )     ≤ c α ( e α − 1) k P 1 − P 2 k TV inf x q ( z | x ) min { m 1 ( z ) , m 2 ( z ) } ≤ c α ( e α − 1) k P 1 − P 2 k TV inf x q ( z | x ) inf x q ( z | x ) , where the final step uses the in equali t y min { m 1 ( z ) , m 2 ( z ) } ≥ in f x q ( z | x ). Pu tt ing together the pieces leads to the b oun d     log m 1 ( z ) m 2 ( z )     ≤ c α ( e α − 1) k P 1 − P 2 k TV . Com b ining with inequalit y (49) yields D kl ( M 1 k M 2 ) + D kl ( M 2 k M 1 ) ≤ c 2 α ( e α − 1) 2 k P 1 − P 2 k 2 TV Z inf x q ( z | x ) dµ ( z ) . The final in tegral is at most one, w hic h completes the pro of of the theorem. It remains to pro v e Lemmas 3 and 4. W e b egi n with the form e r. F or an y z ∈ Z , w e ha v e m 1 ( z ) − m 2 ( z ) = Z X q ( z | x ) [ dP 1 ( x ) − dP 2 ( x )] = Z X q ( z | x ) [ dP 1 ( x ) − dP 2 ( x )] + + Z X q ( z | x ) [ dP 1 ( x ) − dP 2 ( x )] − ≤ sup x ∈X q ( z | x ) Z X [ dP 1 ( x ) − dP 2 ( x )] + + inf x ∈X q ( z | x ) Z X [ dP 1 ( x ) − dP 2 ( x )] − =  sup x ∈X q ( z | x ) − inf x ∈X q ( z | x )  Z X [ dP 1 ( x ) − dP 2 ( x )] + . By definition of th e total v ariation norm, w e ha v e R [ dP 1 − dP 2 ] + = k P 1 − P 2 k TV , and hence | m 1 ( z ) − m 2 ( z ) | ≤ sup x,x ′   q ( z | x ) − q ( z | x ′ )   k P 1 − P 2 k TV . (50) 28 F or any ˆ x ∈ X , w e ma y add and subtract q ( z | ˆ x ) fr om the quan tity insid e the su prem um, whic h implies that sup x,x ′   q ( z | x ) − q ( z | x ′ )   = inf ˆ x sup x,x ′   q ( z | x ) − q ( z | ˆ x ) + q ( z | ˆ x ) − q ( z | x ′ )   ≤ 2 inf ˆ x sup x | q ( z | x ) − q ( z | ˆ x ) | = 2 inf ˆ x q ( z | ˆ x ) sup x     q ( z | x ) q ( z | ˆ x ) − 1     . Similarly , w e ha v e for an y x, x ′ | q ( z | x ) − q ( z | x ′ ) | = q ( z | x ′ )     q ( z | x ) q ( z | x ′ ) − 1     ≤ e α inf b x q ( z | b x )     q ( z | x ) q ( z | x ′ ) − 1     . Since f o r any c hoice of x, ˆ x , we h a ve q ( z | x ) /q ( z | ˆ x ) ∈ [ e − α , e α ], we fi nd that (since e α − 1 ≥ 1 − e − α ) sup x,x ′   q ( z | x ) − q ( z | x ′ )   ≤ min { 2 , e α } inf x q ( z | x ) ( e α − 1) . Com b ining with the earlier inequalit y (50) yields the cla im (49). T o see L emm a 4, note that for an y x > 0, the conca v ity of the loga rithm imp lie s that log( x ) ≤ x − 1 . Setting alternativ ely x = a/b and x = b/a , we obtain the inequalities log a b ≤ a b − 1 = a − b b and log b a ≤ b a − 1 = b − a a . Using the first inequalit y for a ≥ b and the seco nd for a < b completes the pro of. 7.2 Pro of of Corollary 1 Let us recall th e d e finition of the in duced marginal distribution (3), giv en by M ν ( S ) = Z X Q ( S | x 1: n ) dP n ν ( x 1: n ) for S ∈ σ ( Z n ). F or eac h i = 2 , . . . , n , we let M ν,i ( · | Z 1 = z 1 , . . . , Z i − 1 = z i − 1 ) = M ν,i ( · | z 1: i − 1 ) denote the (marginal o v er X i ) d istribution of the v ariable Z i conditioned on Z 1 = z 1 , . . . , Z i − 1 = z i − 1 . In addition, use the sh orthand nota tion D kl  M ν,i k M ν ′ ,i  := Z Z i − 1 D kl  M ν,i ( · | z 1: i − 1 ) k M ν ′ ,i ( · | z 1: i − 1 )  dM i − 1 ν ( z 1 , . . . , z i − 1 ) to d e note th e in tegrated K L divergence of the conditional distributions on the Z i . By the c hain-rule for KL div ergences [32, Chapter 5.3], w e obtain D kl ( M n ν k M n ν ′ ) = n X i =1 D kl  M ν,i k M ν ′ ,i  . 29 By assumption (1), the distribution Q i ( · | X i , Z 1: i − 1 ) on Z i is α -differen tially pr iv ate for the sample X i . As a consequence, if w e let P ν,i ( · | Z 1 = z 1 , . . . , Z i − 1 = z i − 1 ) denote the conditional distribution of X i giv en the fi rst i − 1 v alues Z 1 , . . . , Z i − 1 and the pac king index V = ν , then f r om the c hain rule and Theorem 1 w e obtain D kl ( M n ν k M n ν ′ ) = n X i =1 Z Z i − 1 D kl  M ν,i ( · | z 1: i − 1 ) k M ν ′ ,i ( · | z 1: i − 1 )  dM i − 1 ν ( z 1: i − 1 ) ≤ n X i =1 4( e α − 1) 2 Z Z i − 1   P ν,i ( · | z 1: i − 1 ) − P ν ′ ,i ( · | z 1: i − 1 )   2 TV dM i − 1 ν ( z 1 , . . . , z i − 1 ) . By the construction of our sampling scheme, th e random v ariables X i are conditionally indep enden t giv en V = ν ; th u s the d istribution P ν,i ( · | z 1: i − 1 ) = P ν,i , w here P ν,i denotes the distribution of X i conditioned on V = ν . Consequen tly , w e hav e   P ν,i ( · | z 1: i − 1 ) − P ν ′ ,i ( · | z 1: i − 1 )   TV =   P ν,i − P ν ′ ,i   TV , whic h give s the claimed result. 7.3 Pro of of Prop osition 1 The minimax r a te charac terized b y equation (20) inv olves b oth a lo w er and an upp er b ound, and w e divide our pro of accordingly . W e pr ovide the pro of for α ∈ (0 , 1], but note th a t a s i milar result (mo dulo d ifferent constan ts) holds for an y fi n ite v alue of α . Lo wer b ound: W e use Le Cam’s metho d to p ro v e th e lo w er b ound in equ ation (20 ). Fix a gi v en constan t δ ∈ (0 , 1], with a precise v alue to b e sp ecified later. F or ν ∈ V ∈ {− 1 , 1 } , defin e the distribution P ν with supp ort {− δ − 1 /k , 0 , δ 1 /k } b y P ν ( X = δ − 1 /k ) = δ (1 + ν ) 2 , P ν ( X = 0) = 1 − δ, and P ν ( X = − δ − 1 /k ) = δ (1 − ν ) 2 . By construction, w e h a ve E [ | X | k ] = δ ( δ − 1 /k ) k = 1 and θ ν = E ν [ X ] = δ k − 1 k ν , when ce the m e an difference is giv en by θ 1 − θ − 1 = 2 δ k − 1 k . Applying Le Cam’s metho d (8) and the minim ax b ound (7) yields M n (Θ , ( · ) 2 , Q ) ≥  δ k − 1 k  2  1 2 − 1 2   M n 1 − M n − 1   TV  , where M n ν denotes the marginal d ist ribution of the samples Z 1 , . . . , Z n conditioned on θ = θ ν . No w Pinsk er’s inequalit y implies that   M n 1 − M n − 1   2 TV ≤ 1 2 D kl  M n 1 k M n − 1  , and Corolla ry 1 yields D kl  M n 1 k M n − 1  ≤ 4( e α − 1) 2 n k P 1 − P − 1 k 2 TV = 4( e α − 1) 2 nδ 2 . Putting toge ther th e pieces yields   M n 1 − M n − 1   TV ≤ ( e α − 1) δ √ 2 n . F or α ∈ (0 , 1], w e hav e e α − 1 ≤ 2 α , and th u s our earlier application of Le Cam’s metho d implies M n (Θ , ( · ) 2 , α ) ≥  δ k − 1 k  2  1 2 − αδ √ 2 n  . Substituting δ = m in { 1 , 1 / √ 32 nα 2 } yiel ds the claim (20). 30 Upp er b ound: W e m ust demonstrate a n α -locally p riv ate co nditional distribution Q and an estimator that ac hiev es the upp er b ound in equation (20). W e do so via a com bination of truncation and addition of Laplac ian noise. Define the truncation fu ncti on [ · ] T : R → [ − T , T ] b y [ x ] T := max {− T , min { x, T }} , where th e truncation lev el T is to b e c hosen. Let W i b e indep endent L ap lace( α/ (2 T )) random v ariables, and for eac h index i = 1 , . . . , n , defin e Z i := [ X i ] T + W i . By construction, the rand o m v ariable Z i is α -differen tially p riv ate for X i . F or the mean estimator b θ := 1 n P n i =1 Z i , w e hav e E h ( b θ − θ ) 2 i = V ar( b θ ) +  E [ b θ ] − θ  2 = 4 T 2 nα 2 + 1 n V ar([ X 1 ] T ) + ( E [ Z 1 ] − θ ) 2 . (51) W e claim th a t E [ Z ] = E [[ X ] T ] ∈  E [ X ] − 1 ( k − 1) T k − 1 , E [ X ] + 1 ( k − 1) T k − 1  . (52) Indeed, b y the assump ti on th at E [ | X | k ] ≤ 1, w e hav e b y a c h a nge of v ariables that Z ∞ T xdP ( x ) = Z ∞ T P ( X ≥ x ) dx ≤ Z ∞ T 1 x k dx = 1 ( k − 1) T k − 1 . Th us E [[ X ] T ] ≥ E [min { X, T } ] = E [min { X , T } + [ X − T ] + − [ X − T ] + ] = E [ X ] − Z ∞ T ( x − T ) dP ( x ) ≥ E [ X ] − 1 ( k − 1) T k − 1 . A similar argumen t yields the upp er b ound in equatio n (52). F rom the b o und (51) and the inequalities that since [ X ] T ∈ [ − T , T ] and α 2 ≤ 1, w e hav e E h ( b θ − θ ) 2 i ≤ 5 T 2 nα 2 + 1 ( k − 1) 2 T 2 k − 2 v alid for any T > 0. Cho osing T = (5( k − 1)) − 1 2 k ( nα 2 ) 1 / (2 k ) yields E h ( b θ − θ ) 2 i ≤ 5(5( k − 1)) − 1 k ( nα 2 ) 1 k nα 2 + 1 ( k − 1) 2 (5( k − 1)) − 1+1 /k ( nα 2 ) 1 − 1 /k = 5 1 − 1 /k  1 + 1 k − 1  1 ( k − 1) 1 k ( nα 2 ) 1 − 1 k . Since (1 + ( k − 1) − 1 )( k − 1) − 1 k < ( k − 1) − 1 + ( k − 1) − 2 for k ∈ (1 , 2) and is b ounded by 1 + ( k − 1) − 1 ≤ 2 for k ∈ [2 , ∞ ], the upp er b ound (20) follo ws. 7.4 Pro of of Prop osition 2 W e no w turn to the pro o f of min ima x rates for fi xe d design linear regression. 31 Lo wer b ound: W e use a sligh t generalization of the α -priv ate form (19) of the lo cal F ano in- equalit y from Corollary 3. F or concreteness, w e assume throughout that α ∈ [0 , 23 35 ], but analogous argumen ts h ol d f o r an y b ou n ded α with c hanges only in the constan t pre-factors. Consider an instance of the linear regression mo del (21) in whic h the n oise v ariables { ε i } n i =1 are dra wn i.i.d. from the uniform distribu tio n on [ − σ, + σ ]. Our first step is to construct a s u ita ble pac king of the unit sphere S d − 1 = { u ∈ R d : k u k 2 = 1 } in ℓ 2 -norm: Lemma 5. Ther e exists a 1 -p acking V = { ν 1 , . . . , ν N } of the unit spher e S d − 1 with N ≥ exp( d/ 8) . See App endix D.1 for the pro of of this claim. F or a fixed δ ∈ (0 , 1] to b e c h ose n shortly , defin e the family of v ectors { θ ν , ν ∈ V } w it h θ ν := δ ν . Since k ν k 2 ≤ 1, w e hav e k θ ν − θ ν ′ k 2 ≤ 2 δ . Let P ν,i denote the distribu ti on of Y i conditioned on θ ∗ = θ ν . B y the form of the linear regression mo del (21) and our assumption on the noise v ariable ε i , P ν,i is uniform on the inte rv al [ h θ ν , x i i − σ, h θ ν , x i i + σ ]. Consequen tly , for ν 6 = ν ′ ∈ V , w e ha ve   P ν,i − P ν ′ ,i   TV = 1 2 Z | p ν,i ( y ) − p ν ′ ,i ( y ) | dy ≤ 1 2  1 2 σ | h θ ν , x i i − h θ ν ′ , x i i | + 1 2 σ | h θ ν , x i i − h θ ν ′ , x i i |  = 1 2 σ |h θ ν − θ ν ′ , x i i| . Letting V den ote a random sample from the u niform distribution on V , Corollary 1 implies th a t the m u tual information is upp er b ound ed as I ( Z 1 , . . . , Z n ; V ) ≤ 4( e α − 1) 2 n X i =1 1 |V | 2 X ν,ν ′ ∈V   P ν,i − P ν ′ ,i   2 TV ≤ ( e α − 1) 2 σ 2 n X i =1 1 |V | 2 X ν,ν ′ ∈V ( h θ ν − θ ν ′ , x i i ) 2 = ( e α − 1) 2 σ 2 1 |V | 2 X ν,ν ′ ∈V ( θ ν − θ ν ′ ) ⊤ X ⊤ X ( θ ν − θ ν ′ ) . Since θ ν = δ ν , we hav e by defin it ion of the maximum singular v alue that ( θ ν − θ ν ′ ) ⊤ X ⊤ X ( θ ν − θ ν ′ ) ≤ δ 2   ν − ν ′   2 2 ρ max ( X ⊤ X ) ≤ 4 δ 2 ρ 2 max ( X ) = 4 nδ 2 ρ 2 max ( X/ √ n ) . Putting together the pieces, we find that I ( Z 1 , . . . , Z n ; V ) ≤ 4 nδ 2 ( e α − 1) 2 σ 2 ρ 2 max ( X/ √ n ) ≤ 8 nα 2 δ 2 σ 2 ρ 2 max ( X/ √ n ) , where the second inequalit y is v alid for α ∈ [0 , 23 35 ]. C o nsequentl y , F ano’s inequalit y com bined with the pac king set V from Lemma 5 implies that M n  Θ , k·k 2 2 , α  ≥ δ 2 4  1 − 8 nδ 2 α 2 ρ 2 max ( X/ √ n ) /σ 2 + log 2 d/ 8  . W e split th e remaind e r of th e analysis in to cases. 32 Case 1: First supp ose that d ≥ 16. Then s e tting δ 2 = min { 1 , dσ 2 128 nρ 2 max ( X/ √ n ) } implies that 8 nδ 2 α 2 ρ 2 max ( X/ √ n ) /σ 2 + log 2 d/ 8 ≤ 8  log 2 d + 64 128  < 7 8 . As a co nsequence, w e ha ve the low er b ound M n  Θ , k·k 2 2 , α  ≥ 1 4 min  1 , dσ 2 128 nρ 2 max ( X/ √ n )  · 1 8 , whic h yields the claim for d ≥ 16. Case 2: Otherwise, we ma y assume that d < 16 . In this case, e a low er b ound for th e case d = 1 is su fficie n t, since apart from constan t factors, the same b ound h olds for all d < 16. W e use the Le Cam method based on a t wo p oin t comparison. Indeed, let θ 1 = δ and θ 2 = − δ so that the total v ariation d ista nce is at u pp e r b ounded k P 1 ,i − P 2 ,i k TV ≤ δ σ | x i | . By Corollary 2, w e ha ve M n  Θ , ( · ) 2 , α  ≥ δ 2 1 2 − δ ( e α − 1) σ  n X i =1 x 2 i  1 2 ! . Letting x = ( x 1 , . . . , x n ) and setting δ 2 = min { 1 , σ 2 / (16( e α − 1) 2 k x k 2 2 ) } giv es the desired result. Upp er b ound: W e now turn to the upp er b ound, for wh ic h w e need to sp e cify a priv ate con- ditional Q and an estimator b θ that ac h i ev es the stated upp er b ound on the mean-squared err o r. Let W i b e ind e p enden t Laplace( α/ ( 2 σ )) random v ariables. Then the additive ly p erturb ed ran- dom v ariable Z i = Y i + W i is α -differen tially priv ate for Y i , since by a ssumption the resp onse Y i ∈ [ h θ , x i i − σ, h θ , x i i + σ ]. W e no w claim that the standard least-squares estimat or of θ ∗ ac hiev es the stated upp er b ound. Indeed, the least-squares estimate is giv en by b θ = ( X ⊤ X ) − 1 X ⊤ Y = ( X ⊤ X ) − 1 X ⊤ ( X θ ∗ + ε + W ) . Moreo v er, fr om the indep endence of W and ε , we ha v e E h k b θ − θ ∗ k 2 2 i = E h k ( X ⊤ X ) − 1 X ⊤ ( ε + W ) k 2 2 i = E h k ( X ⊤ X ) − 1 X ⊤ ε k 2 2 i + E h k ( X ⊤ X ) − 1 X ⊤ W ) k 2 2 i . Since ε ∈ [ − σ, σ ] n , we know that E [ εε ⊤ ]  σ 2 I n × n , and for the give n choic e of W , w e ha v e E [ W W ⊤ ] = (4 σ 2 /α 2 ) I n × n . Since α ≤ 1, w e thus find E h k b θ − θ ∗ k 2 2 i ≤ 5 σ 2 α 2 tr  X ( X ⊤ X ) − 2 X ⊤  = 5 σ 2 α 2 tr  ( X ⊤ X ) − 1  . Noting that tr(( X ⊤ X ) − 1 ) ≤ d/ρ 2 min ( X ) = d/nρ 2 min ( X/ √ n ) giv es the claimed upp er b ound . 8 Pro of of Theorem 2 and related results In this sectio n, we collect together the pro of of Theorem 2 and related corolla ries. 33 8.1 Pro of of Theorem 2 Let Z den ote t he domain of the r an d om v ariable Z . W e b egin b y reducing the p roblem to the case w hen Z = { 1 , 2 , . . . , k } for an arbitrary p ositiv e integ er k . Ind e ed, in the general setting, w e let K = { K i } k i =1 b e any (measurable) finite partition of Z , where for z ∈ Z w e let [ z ] K = K i for the K i suc h that z ∈ K i . The KL div ergence D kl  M ν k M  can b e defined as the supremum of the (discrete) KL div ergences b et we en the random v ariables [ Z ] K sampled according to M ν and M o ver all p a rtitions K of Z ; for instance, see Gray [32, Chapter 5]. Consequen tly , w e ca n pro ve the claim for Z = { 1 , 2 , . . . , k } , and then tak e the su p rem um o ve r k to reco v er the general ca se. Acco rdingly , w e can w ork with the probab ility mass fu ncti ons m ( z | ν ) = M ν ( Z = z ) and m ( z ) = M ( Z = z ), and w e may w rite D kl  M ν k M  + D kl  M k M ν  = k X z =1 ( m ( z | ν ) − m ( z )) log m ( z | ν ) m ( z ) . (53) Throughout, w e will also use (without loss of generalit y) the probabilit y mass functions q ( z | x ) = Q ( Z = z | X = x ), wh er e w e note that m ( z | ν ) = R q ( z | x ) dP ν ( x ). No w w e use Lemm a 4 from the p roof of Th eo rem 1 to complete the pr oof of Th e orem 2. Starting with equalit y (53), w e ha ve 1 |V | X ν ∈V  D kl  M ν k M  + D kl  M k M ν  ≤ X ν ∈V 1 |V | k X z =1 | m ( z | ν ) − m ( z ) |     log m ( z | ν ) m ( z )     ≤ X ν ∈V 1 |V | k X z =1 | m ( z | ν ) − m ( z ) | | m ( z | ν ) − m ( z ) | min { m ( z ) , m ( z | ν ) } . No w, w e define the measure m 0 on Z = { 1 , . . . , k } b y m 0 ( z ) := inf x ∈X q ( z | x ). I t is clear that min { m ( z ) , m ( z | ν ) } ≥ m 0 ( z ), whence we find 1 |V | X ν ∈V  D kl  M ν k M  + D kl  M k M ν  ≤ X ν ∈V 1 |V | k X z =1 ( m ( z | ν ) − m ( z )) 2 m 0 ( z ) . It remains to b o und the final su m. F o r any constant c ∈ R , we hav e m ( z | ν ) − m ( z ) = Z X ( q ( z | x ) − c )  dP ν ( x ) − d P ( x )  . W e define a set of functions f : Z × X → R (dep ending imp li citly on q ) b y F α :=  f | f ( z , x ) ∈ [1 , e α ] m 0 ( z ) for all z ∈ Z and x ∈ X  . By the definition of differentia l pr iv acy , when view ed as a join t mapping from Z × X → R , th e conditional p.m.f. q satisfies { ( z , x ) 7→ q ( z | x ) } ∈ F α . Since constan t (with resp ect to x ) shifts do not c h ange the abov e integral, we can mod ify the range of f unctions in F α b y subtracting m 0 ( z ) from eac h, yielding the set F ′ α :=  f | f ( z , x ) ∈ [0 , e α − 1] m 0 ( z ) for all z ∈ Z and x ∈ X  . 34 As a co nsequence, w e find that X ν ∈V ( m ( z | ν ) − m ( z )) 2 ≤ sup f ∈F α ( X ν ∈V  Z X f ( z , x )  dP ν ( x ) − d P ( x )   2 ) = sup f ∈F ′ α ( X ν ∈V  Z X  f ( z , x ) − m 0 ( z )   dP ν ( x ) − d P ( x )   2 ) . By insp ection, when we divide b y m 0 ( z ) and recall the definition of the set B ∞ ⊂ L ∞ ( X ) in the statemen t of Th e orem 2, w e obtain X ν ∈V ( m ( z | ν ) − m ( z )) 2 ≤  m 0 ( z )  2 ( e α − 1) 2 sup γ ∈ B ∞ X ν ∈V  Z X γ ( x )  dP ν ( x ) − d P ( x )   2 . Putting together our bou n ds, we hav e 1 |V | X ν ∈V  D kl  M ν k M  + D kl  M k M ν  ≤ ( e α − 1) 2 k X z =1 1 |V |  m 0 ( z )  2 m 0 ( z ) sup γ ∈ B ∞ X ν ∈V  Z X γ ( x )  dP ν ( x ) − d P ( x )   2 ≤ ( e α − 1) 2 1 |V | sup γ ∈ B ∞ X ν ∈V  Z X γ ( x )  dP ν ( x ) − d P ( x )   2 , since P z m 0 ( z ) ≤ 1, whic h is the statemen t of the theorem. 8.2 Pro of of Corollary 4 In the non-in teractiv e setting (2), the marginal d istribution M n ν is a pro duct measure and Z i is conditionally in depen d en t of Z 1: i − 1 giv en V . Thus b y the c hain rule for mutual information [32, Chapter 5] and th e fact (as in the pro of of Theorem 2) that w e ma y assume w.l.o.g. that Z h as finite range I ( Z 1 , . . . , Z n ; V ) = n X i =1 I ( Z i ; V | Z 1: i − 1 ) = n X i =1 [ H ( Z i | Z 1: i − 1 ) − H ( Z i | V , Z 1: i − 1 )] . Since conditioning red u ce s en tropy and Z 1: i − 1 is conditionally ind epend en t of Z i giv en V , w e h av e H ( Z i | Z 1: i − 1 ) ≤ H ( Z i ) and H ( Z i | V , Z 1: i − 1 ) = H ( Z i | V ). In particular, w e ha ve I ( Z 1 , . . . , Z n ; V ) ≤ n X i =1 I ( Z i ; V ) = n X i =1 1 |V | X ν ∈V D kl  M ν,i k M i  . Applying Theorem 2 co mpletes the pro of. 35 9 Pro of of Theorem 3 The pr oof of this theorem combines the tec h n iques w e us ed in the p roofs of Theorems 1 and 2; the first handles in teractivit y , while the techniques to deriv e the v ariational b ounds are reminiscent of those used in Theorem 2 . Ou r first step is to note a consequence of th e ind e p endence stru ct ure in Fig. 1 essen tial to our tensorizat ion s teps . More precisely , w e claim that for an y set S ∈ σ ( Z ), M ± j ( Z i ∈ S | z 1: i − 1 ) = Z Q ( Z i ∈ S | Z 1: i − 1 = z 1: i − 1 , X i = x ) dP ± j,i ( x ) . (54) W e p ostpone the pro of of this in termediate claim to the end of this s e ction. No w consider the summed KL-dive rgences. Let M ± j,i ( · | z 1: i − 1 ) denote the conditional distribu - tion of Z i under P ± j , conditional on Z 1: i − 1 = z 1: i − 1 . As in th e p r oof of Corolla ry 1, the c hain-ru le for KL-div ergences [e.g. 32, Chapter 5] implies D kl  M n + j k M n − j  = n X i =1 Z Z i − 1 D kl ( M + j ( · | z 1: i − 1 ) k M − j ( · | z 1: i − 1 )) dM i − 1 + j ( z 1: i − 1 ) . F or notat ional conv en ience in the remainder of the p roof, let us define the symmetrized KL div er- gence b et ween measures M and M ′ as D sy kl ( M | | M ′ ) = D kl ( M k M ′ ) + D kl ( M ′ k M ). Defining P := 2 − d P ν ∈V P n ν , w e ha v e 2 P = P + j + P − j for eac h j sim ultaneously , W e also in tr o duce M ( S ) = R Q ( S | x 1: n ) dM ( x 1: n ), and let E ± j denote th e exp e ctation taken und er the marginals M ± j . W e then ha v e D kl  M n + j k M n − j  + D kl  M n − j k M n + j  = n X i =1  E + j [ D kl ( M + j,i ( · | Z 1: i − 1 ) k M − j,i ( · | Z 1: i − 1 ))] + E − j [ D kl ( M − j,i ( · | Z 1: i − 1 ) k M + j,i ( · | Z 1: i − 1 ))]  ≤ n X i =1  E + j [ D sy kl ( M + j,i ( · | Z 1: i − 1 ) | | M − j,i ( · | Z 1: i − 1 ))] + E − j [ D sy kl ( M + j,i ( · | Z 1: i − 1 ) | | M − j,i ( · | Z 1: i − 1 ))]  = 2 n X i =1 Z Z i − 1 D sy kl ( M + j,i ( · | z 1: i − 1 ) | | M − j,i ( · | z 1: i − 1 )) d M i − 1 ( z 1: i − 1 ) , where we hav e used the d efi nitio n of M and that 2 P = P + j + P − j for all j . S umming o ver j ∈ [ d ] yields d X j =1 D sy kl  M n + j | | M n − j  ≤ 2 n X i =1 Z Z i − 1 d X j =1 D sy kl ( M + j,i ( · | z 1: i − 1 ) | | M − j,i ( · | z 1: i − 1 )) | {z } =: T j,i dM i − 1 ( z 1: i − 1 ) . (55) W e b ound the underlined expression in inequalit y (55), wh o se ele men ts w e denote by T j,i . Without loss of generalit y (as in the pr oof of T heorem 2), w e ma y assume Z is finite, and that Z = { 1 , 2 , . . . , k } for some p ositiv e in teger k . Using the probabilit y mass fu ncti ons m ± j,i and 36 omitting the index i w hen it is clear from con text, Lemma 4 implies T j,i = k X z =1 ( m + j ( z | z 1: i − 1 ) − m + j ( z | z 1: i − 1 )) log m + j ( z | z 1: i − 1 ) m − j ( z | z 1: i − 1 ) ≤ k X z =1 ( m + j ( z | z 1: i − 1 ) − m + j ( z | z 1: i − 1 )) 2 1 min { m + j ( z | z 1: i − 1 ) , m − j ( z | z 1: i − 1 ) } . F or eac h fixed z 1: i − 1 , define th e in fimal measure m 0 ( z | z 1: i − 1 ) := inf x ∈X q ( z | X i = x, z 1: i − 1 ). By construction, w e hav e min { m + j ( z | z 1: i − 1 ) , m − j ( z | z 1: i − 1 ) } ≥ m 0 ( z | z 1: i − 1 ), and h ence T j,i ≤ k X z =1 ( m + j ( z | z 1: i − 1 ) − m + j ( z | z 1: i − 1 )) 2 1 m 0 ( z | z 1: i − 1 ) . Recalling equalit y (54), w e ha ve m + j ( z | z 1: i − 1 ) − m + j ( z | z 1: i − 1 ) = Z X q ( z | x, z 1: i − 1 )( dP + j,i ( x ) − dP − j,i ( x )) = m 0 ( z | z 1: i − 1 ) Z X  q ( z | x, z 1: i − 1 ) m 0 ( z | z 1: i − 1 ) − 1  ( dP + j,i ( x ) − dP − j,i ( x )) . F rom this p oint, the pro of is similar to that of Theorem 2. Defin e the collectio n of functions F α := { f : X × Z i → [0 , e α − 1] } . Using the defi n iti on of d ifferential p riv acy , we hav e q ( z | x, z 1: i − 1 ) m 0 ( z | z 1: i − 1 ) ∈ [1 , e α ], so th e re exists f ∈ F α suc h that d X j =1 T j,i ≤ d X j =1 k X z =1  m 0 ( z | z 1: i − 1 )  2 m 0 ( z | z 1: i − 1 )  Z X f ( x, z , z 1: i − 1 )( dP + j,i ( x ) − dP − j,i ( x ))  2 = k X z =1 m 0 ( z | z 1: i − 1 ) d X j =1  Z X f ( x, z , z 1: i − 1 )( dP + j,i ( x ) − dP − j,i ( x ))  2 . T aking a s uprem um o v er F α , w e find the further upp er b ound d X j =1 T j,i ≤ k X z =1 m 0 ( z | z 1: i − 1 ) sup f ∈F α d X j =1  Z X f ( x, z , z 1: i − 1 )( dP + j,i ( x ) − dP − j,i ( x ))  2 . The in ner suprem um may b e tak en indep endent ly of z and z 1: i − 1 , so we rescale b y ( e α − 1) to obtain our p en ultimate inequalit y d X j =1 D sy kl ( M + j,i ( · | z 1: i − 1 ) | | M − j,i ( · | z 1: i − 1 )) ≤ ( e α − 1) 2 k X z =1 m 0 ( z | z 1: i − 1 ) sup γ ∈ B ∞ ( X ) d X j =1  Z X γ ( x )( dP + j,i ( x ) − dP − j,i ( x ))  2 . 37 Noting that m 0 sums to a quanti t y ≤ 1 and su bstituting the pr ec eding expression in inequalit y (55) completes the pro o f. Finally , w e return to pro ve our intermediate marginaliz ation claim (54). W e h a ve that M ± j ( Z i ∈ S | z 1: i − 1 ) = Z Q ( Z i ∈ S | z 1: i − 1 , x 1: n ) dP ± j ( x 1: n | z 1: i − 1 ) ( i ) = Z Q ( Z i ∈ S | z 1: i − 1 , x i ) dP ± j ( x 1: n | z 1: i − 1 ) ( ii ) = Z Q ( Z i ∈ S | Z 1: i − 1 = z 1: i − 1 , X i = x ) dP ± j,i ( x ) , where equalit y (i) f ollo ws by the assum e d conditional indep endence structure of Q (recall Figur e 1) and equalit y (i i) is a consequen c e of the indep endence of X i and Z 1: i − 1 under P ± j . That is, w e ha ve P + j ( X i ∈ S | Z 1: i − 1 = z 1: i − 1 ) = P + j,i ( S ) by the defin it ion of P n ν as a pr oduct and that P ± j are a mixture of the pr o ducts P n ν . 10 Conclusions W e hav e linked minimax analysis from statistical decision theory w ith differentia l pr iv acy , brin g ing some of their resp ectiv e foundational principles into close con tact. Ou r main te c hn i que, in t he form of the div er gence inequalities in Theorems 1 and 2 , and their Corollaries 1–4, sho ws that applying differentiall y priv ate samp lin g sc hemes essen tially acts as a con traction on distributions. These cont ractiv e inequaliti es allo w us to give sharp minimax r a tes for estimation in lo cally priv ate settings, and w e think suc h resu lts may b e more generally applicable. With our examples in Sections 4.2, 5.2, and 5.3, w e ha ve dev elop ed a framew ork that sho ws that r o ughly , if one can construct a family of d istributions { P ν } on the sample space X that is n ot well “correlated” with an y mem b er of f ∈ L ∞ ( X ) for which f ( x ) ∈ {− 1 , 1 } , then pro viding pr iv acy is costly: the con traction Theorems 2 and 3 provide is strong. By pr oviding sharp con v ergence r a tes for man y standard statistical estimation pro cedures u nder lo c al differential priv acy , w e ha v e dev elop ed and explored some to ols that ma y b e used to b etter understand pr iv acy-preserving statistical inference. W e hav e iden tified a fund a men tal contin uum along wh ic h priv acy ma y b e traded f o r utilit y in the form of accurate statistica l estimates, pr oviding a w a y to adjust statistical pr ocedures to meet th e p riv acy or utilit y needs of the stat istician and the p opulation b eing sampled. There are a num b er of op en qu est ions r a ised by our work. It is natural to wonder wh e ther it is p ossible to obtain tensorized in e qualities of the form of Corollary 4 even for interacti v e mec hanisms. Another imp ortan t qu esti on is whether th e r e sults w e ha v e pro vid e d can b e extended to settings in whic h standard (non-lo c al) differen tial priv acy h o lds. Su c h extensions could yield in sig h ts in to optimal mec hanisms for differentia lly priv ate pro cedures. Ac kno wledgmen t s W e are very thankful to Sh uheng Zhou for p oin ting out errors in Corollaries 1 and 4 in an earlier v ersion of th is m a n uscript. W e also th a nk Gu y Rothblum for helpf ul discu s sio ns. JCD was partially supp orted by a F a ceb ook Gradu at e F ello wsh i p and an NDSEG fello wship. Our wo rk w as supp orted 38 in part by the U.S. Army Researc h Office u nder grant n um b er W911NF-11-1 -0391 , and Office of Na v al Researc h MURI gran t N00014- 11-1-0 688. A Pro ofs of m ulti-dimensional mean-estimation results A t a high lev el, our pro ofs of these results consist o f three steps, the first of w h ic h is relati v ely standard, while the second t w o exploit sp ecific asp ects of the lo cal priv acy setti ng. W e outline them here: (1) Th e first step is a standard reduction, based on inequ a lities (7)–(9) in Section 2, from an estimation problem to a m ulti-w ay testing pr o blem that inv olv es d isc riminating b et ween indices ν co n tained within some subset V of R d . (2) Th e second step is an appropriate co nstruction of a maximal δ -pac king, meaning a set V ⊂ R d suc h that eac h pair is δ -separated and th e r esu lt ing set is as large as p ossible. In add it ion, our argumen ts require that, for a random v ariable V u n iformly distr ib uted o v er V , the co v ariance Co v ( V ) has relativ ely small operator norm. (3) Th e fi nal s t ep is to apply Theorem 2 in ord e r to cont rol t he mutual information asso ciat ed with the testing problem. Do ing so requir es b ounding the supr e m u m in Corollary 4 via the the op erato r norm of C o v ( V ). The estimation to testing r e duction of Step 1 w as p revio usly d e scrib ed in Section 2. Accordingly , the pro ofs to follo w are dev oted to the second and th ir d steps in eac h case. A.1 Pro of of Prop osition 3 W e pro vide a pr o of of the low er b ound, as we pro vided the argum ent for the upp er b ound in Section 4.2.2. Constructing a go o d pac king: Let k b e an arbitrary integ er in { 1 , 2 , . . . , d } . The follo w ing auxiliary result pro vides a building blo c k for the pac king set und er lyin g our pr o of: Lemma 6. F or e ach inte ger k , ther e exists a p acking V k of the k - d imensional hyp er cub e {− 1 , 1 } k with k ν − ν ′ k 1 ≥ k / 2 for e ach ν, ν ′ ∈ V k with ν 6 = ν ′ such that |V k | ≥ ⌈ exp( k / 16 ) ⌉ , and 1 |V k | X ν ∈V k ν ν ⊤  25 I k × k . See App endix D.2 for the pro of. F or a giv en k ≤ d , w e extend the set V k ⊆ R k to a subset of R d b y setting V = V k × { 0 } d − k . F or a parameter δ ∈ (0 , 1 / 2] to b e c hosen, w e define a family of p robabilit y distributions { P ν } ν ∈V constructiv ely . In particular, the random vect or X ∼ P ν (a single observ ation) is formed by the follo wing p rocedur e: Cho ose index j ∈ { 1 , . . . , k } uniformly at r a ndom and set X = ( r e j w.p. 1+ δν j 2 − r e j w.p. 1 − δν j 2 . (56) 39 By construction, these distributions ha v e mean v ectors θ ν := E P ν [ X ] = δ r k ν. Consequent ly , give n the prop erties of the p a c king V , w e ha v e X ∈ B 1 ( r ) with probability 1, and k θ ν − θ ν ′ k 2 2 ≥ r 2 δ 2 /k . T h us we see that the mean v ectors { θ ν } ν ∈V pro vide us with an r δ/ √ k -pac king of the ball. Upp er b ounding the mutual informat ion: Ou r n ext step is to b ound the m u tual information I ( Z 1 , . . . , Z n ; V ) wh e n the observ ations X come f rom the distribu tio n (56) and V is u niform in the set V . W e ha v e the follo wing lemma, wh ich app lie s so lo ng as the channel Q is n o n-in teractiv e and α -locally priv ate (2 ). See App endix E.1 for the pro of. Lemma 7. Fix k ∈ { 1 , . . . , d } . L et Z i b e α -lo c al ly differ e ntial ly private for X i , and let X b e sample d ac c or ding to the distribution (56) c onditional on V = ν . Then I ( Z 1 , . . . , Z n ; V ) ≤ n 25 e α 16 δ 2 k ( e α − e − α ) 2 . Applying t esting inequalities: W e now show how a co m b inat ion of the hypercub e packing sp ecified by L e mma 6 and the sampling sc h e me (56) giv e us our desired lo w er b ound. Fix k ≤ d and let V = V k × { 0 } d − k b e th e p acking of {− 1 , 1 } k × { 0 } d − k defined f ollo wing Lemma 6. Combining Lemma 7 and th e fact that the v ectors θ ν pro vide an r δ / √ k pac king of the ball of cardinalit y at least exp( k / 1 6), F ano’s inequalit y imp lies that for an y k ∈ { 1 , . . . , d } , M n ( θ ( P ) , k ·k 2 2 , α ) ≥ r 2 δ 2 4 k  1 − 25 ne α δ 2 ( e α − e − α ) 2 / (16 k ) + log 2 k / 1 6  Because of the 1-dimensional mean-esti mation lo wer b ounds pro vided in Sectio n 3.3.1, we may assume w.l.o.g. that k ≥ 32. Sett ing δ 2 n,α,k = min { 1 , k 2 / (50 ne α ( e α − e − α ) 2 ) } , w e obtain M n ( θ ( P ) , k ·k 2 2 , α ) ≥ r 2 δ 2 n,α,k 4 k  1 − 1 2 − log 2 2  ≥ cr 2 min  1 k , k ne α ( e α − e − α ) 2  for a unive rsal (numerical) constant c . Since e α ( e α − e − α ) 2 < 16 α 2 for α ∈ [0 , 1], w e obtain the lo wer b ound M n ( θ ( P ) , k ·k 2 2 , α ) ≥ cr 2 max k ∈ [ d ]  min  1 k , k nα 2  for α ∈ [0 , 1] and a universal constan t c > 0. Setting k in the p rece ding disp lay to b e the integ er in { 1 , . . . , d } nearest √ nα 2 giv es th e r esu lt of the prop ositio n. A.2 Pro of of Prop osition 4 Since the upp er b ound w as established in Section 4.2 .2, we fo cus on the lo w er b ound. 40 Constructing a go od pac king: In this case, the pac king set is v ery simple: set V = {± e j } d j =1 so that |V | = 2 d . Fix some δ ∈ [0 , 1], and for ν ∈ V , define a distribu ti on P ν supp orted on X = {− r , r } d via P ν ( X = x ) = (1 + δ ν ⊤ x/r ) / 2 d . In w ords, for ν = e j , th e co ordinates of X are ind e p enden t uniform on {− r, r } except for the co o rdinate j , for wh ic h X j = r with probabilit y 1 / 2 + δ ν j and X j = − r with probabilit y 1 / 2 − δ ν j . With this sc heme, we ha ve θ ( P ν ) = rδ ν , and since k δ r ν − δ r ν ′ k ∞ ≥ δ r , w e h av e constru ct ed a δ r pac king in ℓ ∞ -norm. Upp er b ounding the m utual information: Let V b e drawn un iformly from th e p ac king set V = {± e j } d j =1 . With the s amp li ng sc heme in the p revious paragraph, w e ma y pro vide the follo wing u pp er b ound on the mutual information I ( Z 1 , . . . , Z n ; V ) for an y n o n-in teractiv e p riv ate distribution (2): Lemma 8. F or any non-inter active α - d iffer ential ly private distribution Q , we have I ( Z 1 , . . . , Z n ; V ) ≤ n e α 4 d  e α − e − α  2 δ 2 . See App endix E.2 for a pro of. Applying t esting inequalities: Finally , we turn to application of the testing in equ a lities. Lemma 8, in conju n ct ion with the standard testing reduction and F ano’s inequalit y (9), implies that M n ( θ ( P ) , k ·k ∞ , α ) ≥ r δ 2  1 − e α δ 2 n ( e α − e − α ) 2 / (4 d ) + log 2 log(2 d )  . There is no loss of generalit y in assuming th a t d ≥ 2, in whic h case the c hoice δ 2 = min  1 , d log(2 d ) e α ( e α − e − α ) 2 n  yields the prop ositio n. A.3 Pro of of Prop osition 5 F or this pr o p osition, the construction of the p ac king and lo wer b ound used in the p roof of Pr o p osi- tion 4 also apply . U nder these packing and sampling p r ocedures, note that the separation of p oin ts θ ( P ν ) = r δ ν in ℓ 2 -norm is r δ . It th us r ema ins to pro vide the upp er b ound. I n this case, w e use the samp ling s t rategy (26b), as in Pr o p osition 4 and Section 4.2.2, noting that w e may tak e the b ound B on k Z k ∞ to b e B = c √ dr /α for a constan t c . Let θ ∗ denote the true mean, assu m e d to b e s - sparse. No w consider estima ting θ ∗ b y the ℓ 1 -regularized optimization problem b θ := argmin θ ∈ R d ( 1 2 n     n X i =1 ( Z i − θ )     2 2 + λ k θ k 1 ) , Defining the error v ector W = θ ∗ − 1 n P n i =1 Z i , w e claim that λ ≥ 2 k W k ∞ implies that k b θ − θ k 2 ≤ 3 λ √ s. (57) 41 This result is a consequence o f standard r e sults on sparse estimation (e.g., Negah ban et al. [44, Theorem 1 and Corolla ry 1]). No w we n o te if W i = θ ∗ − Z i , th e n W = 1 n P n i =1 W i , an d by constru c tion of t he sampling mec hanism (26b) we ha ve k W i k ∞ ≤ c √ dr /α for a co nstan t c . By Ho e ffding’s inequalit y a nd a union b ound, we thus ha v e for some (differen t) universal co nstan t c that P ( k W k ∞ ≥ t ) ≤ 2 d exp  − c nα 2 t 2 r 2 d  for t ≥ 0 . By taking t 2 = r 2 d (log(2 d ) + ǫ 2 ) / ( cnα 2 ), we fi nd that k W k 2 ∞ ≤ r 2 d (log(2 d ) + ǫ 2 ) / ( cnα 2 ) with probabilit y at least 1 − exp( − ǫ 2 ), whic h giv es the cl aimed m inimax upp er b ound b y appropriate c hoice of λ = c p d log d/nα 2 in inequalit y (57). A.4 Pro of of inequalit y (30) W e p ro v e the b ound by an argumen t using the priv ate form of F ano’s inequalit y from Corollary 3 . The pro of mak es use of the classical V arshamo v-Gilb ert b ound (e.g. [53, Lemma 4]): Lemma 9 (V a rshamo v-Gilb ert) . Ther e is a p acking V of the d -dimensional hyp er cu b e {− 1 , 1 } d of size |V | ≥ exp( d/ 8) such that   ν − ν ′   1 ≥ d/ 2 for al l distinct p airs ν, ν ′ ∈ V . No w, let δ ∈ [0 , 1] and the distrib utio n P ν b e a p oin t mass at δ ν / √ d . T hen θ ( P ν ) = δ ν / √ d and k θ ( P ν ) − θ ( P ν ′ ) k 2 2 ≥ δ 2 . In addition, a calculatio n imp lie s that if M 1 and M 2 are d -dimensional Laplace( κ ) distributions with means θ 1 and θ 2 , resp ec tiv ely , then D kl ( M 1 k M 2 ) = d X j =1 (exp( − κ | θ 1 ,j − θ 2 ,j | ) + κ | θ 1 ,j − θ 2 ,j | − 1) ≤ κ 2 2 k θ 1 − θ 2 k 2 2 . As a consequence, we ha ve that und e r our Laplacian sampling scheme for the Z an d with V c hosen uniformly from V , I ( Z 1 , . . . , Z n ; V ) ≤ 1 |V | 2 n X ν,ν ′ ∈V D kl ( M ν k M ν ′ ) ≤ nα 2 2 d |V | 2 X ν,ν ′ ∈V    ( δ / √ d )( ν − ν ′ )    2 2 ≤ 2 nα 2 δ 2 d . No w, applying F ano’s inequalit y (9) in the context of the testi ng inequalit y (7), w e find that inf b θ sup ν ∈V E P ν h k b θ ( Z 1 , . . . , Z n ) − θ ( P ν ) k 2 2 i ≥ δ 2 4  1 − 2 nα 2 δ 2 /d + log 2 d/ 8  . W e ma y assume (based on our one-dimensional results in Prop osition 1) w.l.o.g. that d ≥ 10. T aking δ 2 = d 2 / (48 nα 2 ) then implies the r e sult (30). 42 B Pro ofs of m ultinomial estimation results In this s e ction, we prov e th e lo wer b ounds in Prop ositio n 6. Befo re pro ving the b ounds, how ev er, w e outline our tec h nique, which b orrows fr o m that in Section A, and which we also use to pro v e the lo we r b ounds on densit y estimation. The outline is as follo ws: (1) As in step (1) of Section A, our fir st step is a standard reduction using the sh arp er v ersion of Assouad’s metho d (Lemma 1) from estimation to a m u l tiple binary hyp o thesis testing pr oblem. Sp ecifically , we p erform a (essen tially standard) reduction of the form (1 0). (2) Ha ving constru ct ed appropriately separated binary h yp othesis tests, w e use app ly Theorem 3 via inequalit y (32) to control th e testing error in the bin a ry testing problem. App lying the theo- rem r e quires b oundin g certain suprema related to the co v ariance structure of rand omly selected elemen ts of V = {− 1 , 1 } d , as in the arguments in Section A. I n this case, though, the sy m met ry of the b in a ry hyp o thesis testing problems eliminates the need for carefully constructed pac kings of step A(2). With this outline in mind , w e turn to the pro ofs of inequalities (33) and (34). As we pro ved the u pp er boun ds in Section 5.2.2, th is section fo cu se s on the argumen t for the lo w er b ound. W e pro vide th e full pro of for th e mean-squared Eu cl idean error, after whic h w e sho w h o w the result for the ℓ 1 -error follo ws. Our first step is to p ro vide a lo wer b ound of the form (10), giving a Hamming s ep a ration for the squared err o r. T o that end , fix δ ∈ [0 , 1], and for simplicit y , let us assume that d is ev en. In this case, w e set V = {− 1 , 1 } d/ 2 , and for ν ∈ V let P ν b e the multi nomial distribu ti on w ith parameter θ ν := 1 d 1 + δ 1 d  ν − ν  ∈ ∆ d . F or an y estimator b θ , by defining b ν j = sign( b θ j − 1 /d ) for j ∈ [ d/ 2] w e h av e the lo we r b ound k b θ − θ ν k 2 2 ≥ δ 2 d 2 d/ 2 X j =1 1 { b ν j 6 = ν j } , so that b y the sharp er v ariant (32) of Assouad’s Lemma, w e obtain max ν ∈V E P ν [ k b θ − θ ν k 2 2 ] ≥ δ 2 4 d   1 −  1 2 d d/ 2 X j =1 D kl  M n + j k M n − j  + D kl  M n − j k M n + j   1 2   . (58) No w we app ly Theorem 3, whic h requir e s b oundin g sums of int egrals R γ ( dP + j − dP − j ), where P + j is defined in expression (3 1). W e claim the foll o wing inequ a lit y: sup γ ∈ B ∞ ( X ) d/ 2 X j =1  Z X γ ( x ) dP + j ( x ) − dP − j ( x )  2 ≤ 8 δ 2 d . (59) Indeed, b y co nstruction P + j is th e m u ltinomial with parameter (1 /d ) 1 + ( δ /d )[ e ⊤ j − e ⊤ j ] ⊤ ∈ ∆ d and similarly for P − j , wh ere e j ∈ { 0 , 1 } d/ 2 denotes the j th s tand a rd basis v ector. Abusing notation and iden tifying γ with v ectors γ ∈ [ − 1 , 1] d , w e hav e Z X γ ( x ) dP + j ( x ) − dP − j ( x ) = 2 δ d γ ⊤  e j − e j  , 43 whence w e find d/ 2 X j =1  Z X γ ( x ) dP + j ( x ) − dP − j ( x )  2 = 4 δ 2 d 2 γ ⊤ d/ 2 X j =1  e j − e j   e j − e j  ⊤ γ = 4 δ 2 d 2 γ ⊤  I − I − I I  γ ≤ 8 δ 2 d , b ecause the op erator norm of th e matrix is b ounded b y 2. This give s the claim (59). Substituting the b ound (59) into the b ound (58) via Th eo rem 3, w e obtain max ν ∈V E P ν [ k b θ − θ ν k 2 2 ] ≥ δ 2 4 d h 1 −  4 n ( e α − 1) 2 δ 2 /d 2  1 2 i . Cho osing δ 2 = min { 1 , d 2 / (16 n ( e α − 1) 2 ) } giv es the lo wer b ound M n (∆ d , k·k 2 2 , α ) ≥ min  1 4 d , d 64 n ( e α − 1) 2  . T o complete the pro o f, w e note that w e can pro v e the preceding u pp e r b ound f o r any ev en d 0 ∈ { 2 , . . . , d } ; this requ ir e s choosing ν ∈ V = {− 1 , 1 } d 0 / 2 and co nstructing the multi nomial v ectors θ ν = 1 d 0  1 d 0 0 d − d 0  + δ d 0   ν − ν 0 d − d 0   ∈ ∆ d , wh ere 1 d 0 = [1 1 · · · 1] ⊤ ∈ R d 0 . Rep eati ng the pro of mutatis mutandis giv es the b oun d M n (∆ d , k·k 2 2 , α ) ≥ max d 0 ∈{ 2 , 4 ,..., 2 ⌊ d/ 2 ⌋} min  1 4 d 0 , d 0 64 n ( e α − 1) 2  . Cho osing d 0 to b e the ev en int eger closest to √ nα 2 in { 1 , . . . , d } and n oting that ( e α − 1) 2 ≤ 3 α 2 for α ∈ [0 , 1] giv es th e claimed result (33). In the case of measurin g error in the ℓ 1 -norm, w e p ro vide a completely identica l pro of, except that we hav e the separation k b θ − θ ν k 1 ≥ ( δ/d ) P d/ 2 j =1 1 { b ν j 6 = ν j } , and th us inequalit y (58) holds with the initial m ultiplier δ 2 / (4 d ) replaced b y δ/ (4 d ). P arallel reasoning to the ℓ 2 2 case then giv es the minimax lo w er b ound M n (∆ d , k·k 1 , α ) ≥ δ 4 d 0 h 1 − (4 n ( e α − 1) 2 δ 2 /d 2 0 ) 1 2 i for an y ev en d 0 ∈ { 2 , . . . , d } . Cho osing δ = min { 1 , d 2 0 / (16 n ( e α − 1) 2 ) } giv es the claim (34). C Pro ofs of densit y estimation r esults In this section, we p ro vide th e pro o fs of the results stated in Section 5.3 on dens i t y estimation. W e defer the pr oofs of more tec hn ic al results to later app endices. Throughout all p roofs, w e us e c to denote a univ ersal constan t whose v alue ma y change f rom line to line. 44 0 0.5 1 −0.3 0 0.3 g 1 0 0.5 1 −0.03 0 0.03 g 2 (a) (b) Figure 3. Panel (a): illustratio n of 1-Lipschitz contin uous bump function g 1 used to pack F β when β = 1. Panel (b): bump function g 2 with | g ′′ 2 ( x ) | ≤ 1 used to pack F β when β = 2. C.1 Pro of of Prop osition 7 As w it h our pr o of for m ultinomial estimation, the argum e n t follo ws the general outline describ ed at the b eginning of Section B. W e remark that our pr oof is based on an explicit construction of densities id e n tified with corners of th e hyp ercub e , a more classical app r oa c h than the global metric en tropy approac h of Y ang an d Barron [52] (cf. [53]). W e u s e the lo ca l pac king approac h since it is b etter su it ed to the pr iv acy constrain ts and information con tractions that we h a ve dev elop ed. In comparison with our pro ofs of previous pr op ositions, the construction of a suitable pac king of F β is somewh a t more c hallenging: the identificatio n of densities with finite-dimensional vecto rs, w hic h w e require for our application of Theorem 3, is not immediately obvi ous. In all cases, we guaran tee that our densit y functions f b elo ng to the trigonometric Sob olev sp a ce, so we m ay work directly with smo o th densit y fu nctio ns f . Constructing w e ll- separated densities: W e begin by d escrib ing a standard framewo rk for defining lo cal pac kings of d e nsit y fu n ct ions. Let g β : [0 , 1] → R b e a f unction satisfying th e follo wing p roperties: (a) Th e fu nctio n g β is β -times differen tiable with 0 = g ( i ) β (0) = g ( i ) β (1 / 2) = g ( i ) β (1) for all i < β . (b) Th e function g β is cen tered w i th R 1 0 g β ( x ) dx = 0, and there exist constan ts c, c 1 / 2 > 0 such that Z 1 / 2 0 g β ( x ) dx = − Z 1 1 / 2 g β ( x ) dx = c 1 / 2 and Z 1 0  g ( i ) β ( x )  2 dx ≥ c for all i < β . (c) Th e fun ction g β is n o n-negativ e on [0 , 1 / 2] and non-p ositiv e on [1 / 2 , 1], and Leb esgue measur e is absolutely co n tinuous with resp ect to the measures G j , j = 1 , 2, giv en by G 1 ( A ) = Z A ∩ [0 , 1 / 2] g β ( x ) dx and G 2 ( A ) = − Z A ∩ [1 / 2 , 1] g β ( x ) dx. (60) 45 (d) Lastly , for almost ev ery x ∈ [0 , 1], w e ha ve | g ( β ) β ( x ) | ≤ 1 and | g β ( x ) | ≤ 1. As illustrated in Figure 3 , the fun ctions g β are smooth “bu m p” functions. Fix a p ositiv e in teger k (to b e sp ecified in the sequel). Our fir st step is to construct a family of “ w ell-separated” dens it ies f o r whic h w e can redu c e th e densit y e stimation problem to one o f iden tifying corners of a hypercub e, whic h allo ws application of Lemma 1. S pecifically , w e m ust exhibit a condition similar to the separatio n condition (10). F or eac h j ∈ { 1 , . . . , k d e fine the function g β ,j ( x ) := 1 k β g β  k  x − j − 1 k   1 n x ∈ h j − 1 k , j k io . Based on this d efi nitio n, w e define th e family of densities  f ν := 1 + k X j =1 ν j g β ,j for ν ∈ V  ⊆ F β . (61) It is a standard fact [53, 49] that for any ν ∈ V , th e function f ν is β -times d i fferen tiable, satisfies | f ( β ) ( x ) | ≤ 1 for all x . No w, based on some densit y f ∈ F β , let us defin e the sign v ector v ( f ) ∈ {− 1 , 1 } k to ha ve en tries v j ( f ) := argmin s ∈{− 1 , 1 } Z [ j − 1 k , j k ] ( f ( x ) − sg β ,j ( x )) 2 dx. Then by constr u ct ion of the g β and v , w e h a ve for a numerical constan t c (whose v alue ma y dep end on β ) that k f − f ν k 2 2 ≥ c k X j =1 1 { v j ( f ) 6 = ν j } Z [ j − 1 k , j k ] ( g β ,j ( x )) 2 dx = c k 2 β +1 k X j =1 1 { v j ( f ) 6 = ν j } . By insp ect ion, th is is the Hamming separation required in inequ a lit y (10), whence the sharp er v ersion (32) of Assouad’s Lemma 1 giv es the result M n  F β [1] , k ·k 2 2 , α  ≥ c k 2 β   1 −  1 4 k k X j =1 ( D kl  M n + j k M n − j  + D kl  M n − j k M n + j  )  1 2   , (62) where w e h a ve defined P ± j to b e the probabilit y d istribution asso ciated with th e a v eraged d e nsities f ± j = 2 1 − k P ν : ν j = ± 1 f ν . Applying divergence inequalities: No w we m ust con trol the su mmed KL-diverge nces. T o do so, w e note that b y the construction (61 ), symm e try implies that f + j = 1 + g β ,j and f − j = 1 − g β ,j for eac h j ∈ [ k ] . (63) W e then obtai n the follo wing resu lt , whic h b ounds the av eraged KL-d ivergences. 46 Lemma 10. F or any α -lo c al ly private c onditional distribution Q , the summe d KL-diver genc es ar e b ounde d as k X j =1  D kl  M n + j k M n − j  + D kl  M n + j k M n − j  ≤ 4 c 2 1 / 2 n ( e α − 1) 2 k 2 β +1 . The pr oof of this lemma is fairly inv olv ed, so we defer it to App endix E.3. W e n o te that, for α ≤ 1, w e ha v e ( e α − 1) 2 ≤ 3 α 2 , so we m ay rep lace the b ound in Lemma 10 w it h the qu a n tit y cnα 2 /k 2 β +1 for a c onstan t c . W e remark th a t standard div ergence b ounds u sing Assouad’s lemma [53, 49] pro vide a b ound of roughly n/k 2 β ; our b o und i s th us essen tially a f actor of the “dimension” k tigh ter. The remainder of the pro o f is an a pplication of inequalit y (62). In particular, by applyin g Lemma 10, w e find that for an y α -lo cally priv ate c h a nnel Q , there are constan ts c 0 , c 1 (whose v alues may dep e nd on β ) suc h that M n  F β , k·k 2 2 , Q  ≥ c 0 k 2 β " 1 −  c 1 nα 2 k 2 β +2  1 2 # . Cho osing k n,α,β =  4 c 1 nα 2  1 2 β +2 ensures that the quan tit y inside th e paren theses is at least 1 / 2. Substituting for k in the preceding displa y p ro v es the p r o p osition. C.2 Pro of of Prop osition 8 Note that the op erator Π k p erforms a Euclidean pro jection of the vecto r ( k/n ) P n i =1 Z i on to the scaled probability simplex, th us pr o jecting b f onto the set of p robabilit y densities. Give n th e non-expansivit y of Euclidean p ro jectio n, this op eration can only decrease the error k b f − f k 2 2 . Co n- sequen tly , it suffices to b ound the error of the un pro jected estimator; to redu c e notational o verhead w e retain our previous notation of b θ for the unp ro jecte d v ersion. Using this notation, w e ha ve E h   b f − f   2 2 i ≤ k X j =1 E f " Z j k j − 1 k ( f ( x ) − b θ j ) 2 dx # . By expand ing this exp ression and noting that the in depen den t n oi se v ariables W ij ∼ Laplace( α/ 2) ha ve zero mean, we obtain E h   b f − f   2 2 i ≤ k X j =1 E f " Z j k j − 1 k  f ( x ) − k n n X i =1 [ e k ( X i )] j  2 dx # + k X j =1 Z j k j − 1 k E  k n n X i =1 W ij  2  = k X j =1 Z j k j − 1 k E f "  f ( x ) − k n n X i =1 [ e k ( X i )] j  2 # dx + k 1 k 4 k 2 nα 2 . (64) Next w e b ound th e err o r term inside the exp ecta tion (64). Defining p j := P f ( X ∈ X j ) = R X j f ( x ) dx , we h av e k E f [[ e k ( X )] j ] = k p j = k Z X j f ( x ) dx ∈  f ( x ) − 1 k , f ( x ) + 1 k  for an y x ∈ X j , 47 b y the Lipsc hitz con tinuit y of f . Th us, expanding the bias and v ariance of the in tegrated expecta- tion ab o ve, w e fin d that E f "  f ( x ) − k n n X i =1 [ e k ( X i )] j  2 # ≤ 1 k 2 + V ar k n n X i =1 [ e k ( X i )] j ! = 1 k 2 + k 2 n V ar([ e k ( X )] j ) = 1 k 2 + k 2 n p j (1 − p j ) . Recalling the inequalit y (64), w e obtain E f h   b f − f   2 2 i ≤ k X j =1 Z j k j − 1 k  1 k 2 + k 2 n p j (1 − p j )  dx + 4 k 2 nα 2 = 1 k 2 + 4 k 2 nα 2 + k n k X j =1 p j (1 − p j ) . Since P k j =1 p j = 1, w e find that E f h   b f − f   2 2 i ≤ 1 k 2 + 4 k 2 nα 2 + k n , and c ho osing k = ( nα 2 ) 1 4 yields the claim. C.3 Pro of of Prop osition 9 W e b egin by fixing k ∈ N ; w e will optimize the c hoice of k sh o rtly . Recall that, since f ∈ F β [ C ], we ha ve f = P ∞ j =1 θ j ϕ j for θ j = R f ϕ j . Th us w e ma y define Z j = 1 n P n i =1 Z i,j for eac h j ∈ { 1 , . . . , k } , and w e hav e k b f − f k 2 2 = k X j =1 ( θ j − Z j ) 2 + ∞ X j = k +1 θ 2 j . Since f ∈ F β [ C ], w e are guaranteed that P ∞ j =1 j 2 β θ 2 j ≤ C 2 , and hence X j >k θ 2 j = X j >k j 2 β θ 2 j j 2 β ≤ 1 k 2 β X j >k j 2 β θ 2 j ≤ 1 k 2 β C 2 . F or the ind ic es j ≤ k , we note that by assumption, E [ Z i,j ] = R ϕ j f = θ j , and since | Z i,j | ≤ B , we ha ve E  ( θ j − Z j ) 2  = 1 n V ar( Z 1 ,j ) ≤ B 2 n = B 2 0 c k k n  e α + 1 e α − 1  2 , where c k = Ω(1) is the constant in expression (43 ). Putting toget her the pieces, the mean-squ ared L 2 -error is upp er b ounded as E f h k b f − f k 2 2 i ≤ c  k 2 nα 2 + 1 k 2 β  , where c is a constan t d epend ing on B 0 , c k , and C . C hoose k = ( nα 2 ) 1 / (2 β +2) to complete the p roof. 48 C.4 Insufficiency of Laplace noise for densit y estimation Finally , w e consider the insufficiency of standard L ap lace noise addition for estimation in the setting of this sectio n. Consider the v ector [ ϕ j ( X i )] k j =1 ∈ [ − B 0 , B 0 ] k . T o mak e this v ector α - differen tially priv ate b y adding an indep enden t Laplace noise v ector W ∈ R k , we must tak e W j ∼ Laplace( α/ ( B 0 k )). The natural orthogonal series estimator [e.g., 51] is to take Z i = [ ϕ j ( X i )] k j =1 + W i , where W i ∈ R k are indep endent Laplace noise v ectors. W e then use the density estimator (44), except that we use the Laplacian p ertur bed Z i . H o wev er, this estimator su ffer s the follo wing d ra wbac k: Observ ation 1. L et b f = 1 n P n i =1 P k j =1 Z i,j ϕ j , wher e the Z i ar e the L aplac e- p erturb e d ve ctors of the pr evious p ar agr aph. Assume the orthono rmal b asis { ϕ j } of L 2 ([0 , 1]) c ontains the c onstant function. Ther e is a c onstant c such that for any k ∈ N , ther e is an f ∈ F β [2] such that E f h k f − b f k 2 2 i ≥ c ( nα 2 ) − 2 β 2 β +3 . Pr o of. W e b egin b y n o ting that f o r f = P j θ j ϕ j , b y defin it ion of b f = P j b θ j ϕ j w e h av e E h k f − b f k 2 2 i = k X j =1 E h ( θ j − b θ j ) 2 i + X j ≥ k +1 θ 2 j = k X j =1 B 2 0 k 2 nα 2 + X j ≥ k +1 θ 2 j = B 2 0 k 3 nα 2 + X j ≥ k +1 θ 2 j . Without loss of generalit y , let u s assume ϕ 1 = 1 is the constan t f unction. Then R ϕ j = 0 for all j > 1, and by defining the true function f = ϕ 1 + ( k + 1) − β ϕ k +1 , w e ha ve f ∈ F β [2] and R f = 1, and moreo v er, E h k f − b f k 2 2 i ≥ B 2 0 k 3 nα 2 + 1 ( k + 1) − 2 β ≥ C β ,B 0 ( nα 2 ) − 2 β 2 β +3 , where C β ,B 0 is a constan t dep ending on β and B 0 . This fi nal low er b ound comes by minimizing o ve r all k . (If ( k + 1) − β B 0 > 1, w e can r esc ale ϕ k +1 b y B 0 to ac h ie v e the same r e sult and guarantee that f ≥ 0.) This lo wer b ound s h o w s that standard estimators based on adding Laplace noise to appropr iate basis expansions of the data fail: ther e is a degradation in rate from n − 2 β 2 β +2 to n − 2 β 2 β +3 . While this is not a f o rmal pro of that no approac h based on Laplace p erturbation can pro vide optimal con verge nce r a tes in our sett ing, it do e s suggest that finding suc h an estimato r is non-trivial. D P ac king set constructions In this app endix, we collect pr oofs of the constructions of our pac kin g sets. D.1 Pro of of Lemma 5 By the V a rshamo v-Gilb ert b ound [e.g., 53, Lemma 4], there is a pac king H d of th e d -dimensional h yp ercub e {− 1 , 1 } d of size |H d | ≥ exp( d/ 8) satisfying k u − v k 1 ≥ d/ 2 for all u, v ∈ H d with u 6 = v . F or eac h u ∈ H d , set ν u = u/ √ d , so that k ν u k 2 = 1 and k ν u − ν v k 2 2 ≥ d/d = 1 for u 6 = v ∈ H d . Setting V = { ν u | u ∈ H d } giv es the desired result. 49 D.2 Pro of of Lemma 6 W e u se the probabilistic method [2], sho w ing that for random dr aws f r o m the Bo olea n hypercub e, a collection of v ectors as claimed in the lemma exists w it h p ositiv e probab ility . Consider a set of N v ectors ν i ∈ {− 1 , 1 } k sampled uniformly at random from the Boolean hyp ercub e, and for a fixed t > 0, define the t w o “bad” ev en ts B 1 :=  ∃ i 6 = j |   ν i − ν j   1 < k / 2  , and B 2 ( t ) :=  1 N N X i =1 ν i ( ν i ) ⊤ 6 ( t + 1) I k × k  . W e b egin b y analyzing B 1 . Letting { W ℓ } k ℓ =1 denote a sequence of i.i.d. Bernoulli { 0 , 1 } v ariables, f or an y i 6 = j , the even t {k ν i − ν j k 1 < k / 2 } is equiv alent to the even t { P k ℓ =1 W ℓ < k / 4 } . C o nsequent ly , b y com bining the union b ound w it h the the Hoeffding b ound, we find P ( B 1 ) ≤  N 2  P  k ν i − ν j k 1 < k / 2  ≤  N 2  exp( − k / 8) . (65) T urning to the ev ent B 2 ( t ), w e ha ve 1 N P N i =1 ν i ( ν i ) ⊤ 6 ( t + 1) I k × k if and o nly if the maxim u m eigen v alue λ max ( 1 N P N i =1 ν i ( ν i ) ⊤ − I k × k ) is larger than t . Using sharp v ersions of the Ahlswede- Win ter inequalities [1] (see Corollary 4.2 in the pap er [42]), w e obtain P ( B 2 ( t )) ≤ k exp  − N t 2 k 2  . (66) Finally , com bining the union b ound with inequalities (65) and (66), w e find that P ( B 1 ∪ B 2 ( t )  ≤ N ( N − 1) 2 exp( − k / 8) + d exp  − N t 2 k 2  . By in s pection, if w e c ho ose t = 24 and N = ⌈ exp( k / 16 ) ⌉ , the ab o v e b ound is strictly less than 1, so a p ac king satisfying the constraints m u s t exist. E Information b ounds In this app endix, w e collect th e pro ofs of lemmas providing m u t ual in formati on and KL-div ergence b ounds. E.1 Pro of of Lemma 7 Our strategy is to apply Theorem 2 to b oun d the mutual information. Without loss of generalit y , w e ma y assume that r = 1 so the set X = {± e j } k j =1 , w here e j ∈ R d . Th us, under the notatio n of Theorem 2, w e ma y id e n tify v ectors γ ∈ L ∞ ( X ) by ve ctors γ ∈ R 2 k . If w e d efi ne ν = 1 |V | P ν ∈V ν to b e the mean elemen t of the pac kin g set, the linear functional ϕ ν defined in Theorem 2 is ϕ ν ( γ ) = 1 2 k  k X j =1 γ ( e j ) 1 + ν j δ 2 + k X j =1 γ ( − e j ) 1 − ν j δ 2  − 1 2 k  k X j =1 γ ( e j ) 1 + ν j δ 2 + k X j =1 γ ( − e j ) 1 − ν j δ 2  = 1 2 k k X j =1  δ 2 γ ( e j )( ν j − ν j ) − δ 2 γ ( − e j )( ν j − ν j )  = δ 4 k γ ⊤  I k × k 0 k × d − k − I k × k 0 k × d − k  ( ν − ν ) . 50 Define the matrix A :=  I k × k 0 k × d − k − I k × k 0 k × d − k  ∈ {− 1 , 0 , 1 } 2 k × d . Then w e ha v e that 1 |V | X ν ∈V ϕ ν ( γ ) 2 = δ 2 (4 k ) 2 γ ⊤ A 1 |V | X ν ∈V ( ν − ν )( ν − ν ) ⊤ A ⊤ γ = δ 2 (4 k ) 2 γ ⊤ A 1 |V | X ν ∈V ν ν ⊤ − ν ν ⊤ ! A ⊤ γ ≤ δ 2 (4 k ) 2 γ ⊤ A 1 |V | X ν ∈V ν ν ⊤ ! A ⊤ γ ≤ 25 16 δ 2 k 2 γ ⊤ AI 2 d × 2 d A ⊤ γ =  5 δ 4 k  2 γ ⊤  I k × k − I k × k − I k × k I k × k  γ . (67) Here the final inequ a lit y u sed our assump ti on on th e sum of outer pro ducts in V . W e complete our pro of using the b ound (67). Th e op erator norm of the matrix sp ecified in (67 ) is 2. As a consequence, since w e h a ve th e cont ainmen t B ∞ = n γ ∈ R 2 k : k γ k ∞ ≤ 1 o ⊂ n γ ∈ R 2 k : k γ k 2 2 ≤ 2 k o w e h av e the inequalit y sup γ ∈ B ∞ 1 |V | X ν ∈V ϕ ν ( γ ) 2 ≤ 25 δ 2 16 k 2 · 2 · 2 k = 25 4 δ 2 k Applying Theorem 2 co mpletes the pro of. E.2 Pro of of Lemma 8 It is no loss of generalit y to assume th e radiu s r = 1. W e use th e n o tation of Theorem 2 , recalling the linear f unctional s ϕ ν : L ∞ ( X ) → R . Because the set X = {− 1 , 1 } d , w e can identify vecto rs γ ∈ L ∞ ( X ) with v ectors γ ∈ R 2 d . Moreo ver, we hav e (b y constru ct ion) that ϕ ν ( γ ) = X x ∈{− 1 , 1 } d γ ( x ) p ν ( x ) − X x ∈{− 1 , 1 } d γ ( x ) p ( x ) = 1 2 d X x ∈X γ ( x )(1 + δ ν ⊤ x − 1) = δ 2 d X x ∈X γ ( x ) ν ⊤ x. F or ea c h ν ∈ V , w e ma y construct a v ector u ν ∈ {− 1 , 1 } 2 d , indexed b y x ∈ {− 1 , 1 } d , with u ν ( x ) = ν ⊤ x = ( 1 if ν = ± e j and sign( ν j ) = sign( x j ) − 1 if ν = ± e j and sign( ν j ) 6 = sign( x j ) . F or ν = e j , we see that u e 1 , . . . , u e d are the first d columns of the standard Hadama rd transform matrix (and u − e j are their negativ es). Then w e h av e that P x ∈X γ ( x ) ν ⊤ x = γ ⊤ u ν , and ϕ ν ( γ ) = γ ⊤ u ν u ⊤ ν γ . 51 Note also that u ν u ⊤ ν = u − ν u ⊤ − ν , and as a consequence we ha v e X ν ∈V ϕ ν ( γ ) 2 = δ 2 4 d γ ⊤ X ν ∈V u ν u ⊤ ν γ = 2 δ 2 4 d γ ⊤ d X j =1 u e j u ⊤ e j γ . (68) But n o w , studying the quadratic form (68), we note th a t the ve ctors u e j are orthogo nal. As a consequence, th e v ectors (up to scaling) u e j are the only eigen vec tors corresp ondin g to p ositiv e eigen v alues of the p ositiv e semidefinite matrix P d j =1 u e j u ⊤ e j . Th us, since the set B ∞ = n γ ∈ R 2 d : k γ k ∞ ≤ 1 o ⊂ n γ ∈ R 2 d : k γ k 2 2 ≤ 2 d o , w e h av e via an eigen v alue calculation that sup γ ∈ B ∞ X ν ∈V ϕ ν ( γ ) 2 ≤ 2 δ 2 4 d sup γ : k γ k 2 2 ≤ 2 d γ ⊤ d X j =1 u e j u ⊤ e j γ = 2 δ 2 4 d k u e 1 k 4 2 = 2 δ 2 since k u e j k 2 2 = 2 d for eac h j . Applying Theorem 2 and Corollary 4 completes the pro of. E.3 Pro of of Lemma 10 This r e sult relies on T heorem 3, along with a careful argument to un d erstand the extreme p oin ts of γ ∈ L ∞ ([0 , 1]) that w e use when applying the resu lt . First, w e tak e the p a c king V = {− 1 , 1 } β and densities f ν for ν ∈ V as in the construction (61). Ov er all, o ur first step is to show f or the purp oses of applying T h eo rem 3, it is no loss of generalit y to iden tify γ ∈ L ∞ ([0 , 1]) with v ectors γ ∈ R 2 k , where γ is constan t on interv als of the form [ i/ 2 k , ( i + 1) / 2 k ]. With this iden tification complete, w e can then provide a b ound on the correlat ion of an y γ ∈ B ∞ with the densities f ± j defined in (63), w hic h completes the pr oof. With this outline in min d, let the sets D i , i ∈ { 1 , 2 , . . . , 2 k } , b e defined as D i = [( i − 1) / 2 k , i/ 2 k ) except that D 2 k = [(2 k − 1) / 2 k , 1], so the collection { D i } 2 k i =1 forms a partition of the u nit interv al [0 , 1]. By construction of the densities f ν , th e sign of f ν − 1 r e mains constant on eac h D i . Let us define (for shorthand) the linear fu nctio nals ϕ j : L ∞ ([0 , 1]) → R for eac h j ∈ { 1 , . . . , k } via ϕ j ( γ ) := Z γ ( dP + j − dP − j ) = 2 k X i =1 Z D i γ ( x )( f + j ( x ) − f − j ( x )) dx = 2 Z D 2 j − 1 ∪ D 2 j γ ( x ) g β ,j ( x ) dx, where w e recall the definitions (63) of the mixture densities f ± j = 1 ± g β ,j . Since the set B ∞ from Theorem 3 is compact, conv ex, and Hausdorff, the Krein-Milman theorem [45, Prop osition 1 .2] guaran tees that it is equal to th e conv ex h u ll of its extreme p oin ts; moreo v er, since the functionals γ 7→ ϕ 2 j ( γ ) are conv ex, the sup rem um in Theorem 3 must b e atta ined at the extreme p oin ts of B ∞ ([0 , 1]). As a co nsequence, when applying the div ergence b ound k X j =1  D kl  M n + j k M n − j  + D kl  M n − j k M n + j  ≤ 2 n ( e α − 1) 2 sup γ ∈ B ∞ k X j =1 ϕ 2 j ( γ ) , (69) 52 w e can restrict our atten tion to γ ∈ B ∞ for whic h γ ( x ) ∈ {− 1 , 1 } . No w w e a rgue that it is no loss of generalit y to assume that γ , wh en restricted to D i , is a constan t (apart from a measure ze ro set). Fix i ∈ [2 k ], and assume for the sake of con tradiction that there exist sets B i , C i ⊂ D i suc h that γ ( B i ) = { 1 } and γ ( C i ) = {− 1 } , while µ ( B i ) > 0 and µ ( C i ) > 0 wh ere µ denotes Leb esg ue measure. 1 W e will constru ct v ectors γ 1 and γ 2 ∈ B ∞ and a v alue λ ∈ (0 , 1) such th at Z D i γ ( x ) g β ,j ( x ) dx = λ Z D i γ 1 ( x ) g β ,j ( x ) dx + (1 − λ ) Z D i γ 2 ( x ) g β ,j ( x ) dx sim u lt aneously for all j ∈ [ k ], while on D c i = [0 , 1] \ D i , w e will ha ve the equiv alence γ 1 | D c i ≡ γ 2 | D c i ≡ γ | D c i . Indeed, set γ 1 ( D i ) = { 1 } and γ 2 ( D i ) = {− 1 } , otherwise setting γ 1 ( x ) = γ 2 ( x ) = γ ( x ) for x 6∈ D i . F or the unique ind e x j ∈ [ k ] suc h that [( j − 1) /k , j /k ] ⊃ D i , w e defi n e λ := R B i g β ,j ( x ) dx R D i g β ,j ( x ) dx so 1 − λ = R C i g β ,j ( x ) dx R D i g β ,j ( x ) dx . By the construction of the fun ction g β , the f unctions g β ,j do not c han ge signs on D i , and the absolute con tinuit y conditions on g β sp ecified in equation (6 0) guaran tee 1 > λ > 0, since µ ( B i ) > 0 and µ ( C i ) > 0. W e th us fin d that for an y j ∈ [ k ], Z D i γ ( x ) g β ,j ( x ) dx = Z B i γ 1 ( x ) g β ,j ( x ) dx + Z C i γ 2 ( x ) g β ,j ( x ) dx = Z B i g β ,j ( x ) dx − Z C i g β ,j ( x ) dx = λ Z D i g β ,j ( x ) dx − (1 − λ ) Z D i g β ,j ( x ) dx = λ Z γ 1 ( x ) g β ,j ( x ) dx + (1 − λ ) Z γ 2 ( x ) g β ,j ( x ) dx. (Notably , for j suc h that g β ,j is id e n tically 0 on D i , this equalit y is trivial.) By linearit y and the strong con vexit y of the function x 7→ x 2 , then, we find that for sets E j := D 2 j − 1 ∪ D 2 j , k X j =1 ϕ 2 j ( γ ) = k X j =1 Z E j γ ( x ) g β ,j ( x ) dx ! 2 < λ k X j =1 Z E j γ 1 ( x ) g β ,j ( x ) dx ! 2 + (1 − λ ) X ν ∈V Z E j γ 2 ( x ) g β ,j ( x ) dx ! 2 . Th us one of the dens ities γ 1 or γ 2 m u st hav e a larger ob jectiv e v alue than γ . This is our desired con tradiction, whic h sho ws that (up to measure zero sets) an y γ attaining the suprem um in the information b ound (69) must b e constan t on eac h of the D i . Ha ving sh own that γ is constan t on eac h of the interv als D i , we conclude th a t the sup r em um (69) can b e reduced to a finite-dimensional problem o v er the subset B 1 , 2 k := n u ∈ R 2 k | k u k ∞ ≤ 1 o 1 F or a fun ctio n f and set A , the n o tation f ( A ) denotes the image f ( A ) = { f ( x ) | x ∈ A } . 53 of R 2 k . In terms of this subs e t, the sup rem um (69) ca n b e rewr it ten as the the up per b ound sup γ ∈ B ∞ k X j =1 ϕ j ( γ ) 2 ≤ su p γ ∈B 1 , 2 k k X j =1  γ 2 j − 1 Z D 2 j − 1 g β ,j ( x ) dx + γ 2 j Z D 2 j g β ,j ( x ) dx  2 By construction of the f unction g β , w e hav e the equalit y Z D 2 j − 1 g β ,j ( x ) dx = − Z D 2 j g β ,j ( x ) dx = Z 1 2 k 0 g β , 1 ( x ) dx = Z 1 2 k 0 1 k β g β ( k x ) d x = c 1 / 2 k β +1 . This implies that 1 2 e α ( e α − 1) 2 n k X j =1  D kl  M n + j k M n − j  + D kl  M n + j k M n − j  ≤ su p γ ∈ B ∞ k X j =1 ϕ j ( γ ) 2 ≤ su p γ ∈B 1 , 2 k k X j =1  c 1 / 2 k β +1 γ ⊤ ( e 2 j − 1 − e 2 j )  2 = c 2 1 / 2 k 2 β +2 sup γ ∈B 1 , 2 k γ ⊤ k X j =1 ( e 2 j − 1 − e 2 j )( e 2 j − 1 − e 2 j ) ⊤ γ , ( 70) where e j ∈ R 2 k denotes the j th standard basis v ector. Rewriting this using the Kronec ke r pro duct ⊗ , w e ha v e k X j =1 ( e 2 j − 1 − e 2 j )( e 2 j − 1 − e 2 j ) ⊤ = I k × k ⊗  1 − 1 − 1 1   2 I 2 k × 2 k . Com b ining this b ound with our in equ a lit y (70), w e obtain k X j =1  D kl  M n + j k M n − j  + D kl  M n + j k M n − j  ≤ 4 n ( e α − 1) 2 c 2 1 / 2 k 2 β +2 sup γ ∈B 1 , 2 k k γ k 2 2 = 4 c 2 1 / 2 n ( e α − 1) 2 k 2 β +1 . F T ec hnical argumen ts In this app endix, we collect pr oofs of te c hn ic al lemmas and results needed for completeness. F.1 Pro of of Lem ma 1 Fix an (arbitrary) estimat or b θ . By assumption (10), w e hav e Φ( ρ ( θ , θ ( P ν ))) ≥ 2 δ d X j =1 1 { [ v ( θ )] j 6 = ν j } . T aking expectations, we s ee that sup P ∈P E P h Φ( ρ ( b θ ( Z 1 , . . . , Z n ) , θ ( P ))) i ≥ max ν ∈V E P ν h Φ( ρ ( b θ ( Z 1 , . . . , Z n ) , θ ν )) i ≥ 1 |V | X ν ∈V E P ν h Φ( ρ ( b θ ( Z 1 , . . . , Z n ) , θ ν )) i ≥ 1 |V | X ν ∈V 2 δ d X j =1 E P ν h 1 n [ ψ ( b θ )] j 6 = ν j oi 54 as the a v erage is smaller than the m a xim u m of a set and using th e s e paration assumption (10). Recalling the definition (31) of the mixtures P ± j , w e swa p the summation orders to see that 1 |V | X ν ∈V P ν  [ v ( b θ )] j 6 = ν j  = 1 |V | X ν : ν j =1 P ν  [ v ( b θ )] j 6 = ν j  + 1 |V | X ν : ν j = − 1 P ν  [ v ( b θ )] j 6 = ν j  = 1 2 P + j  [ v ( b θ )] j 6 = ν j  + 1 2 P − j  [ v ( b θ )] j 6 = ν j  . This giv es the statemen t claimed in the lemma, wh ile taking an infimum o ver all testing p rocedures ψ : Z n → {− 1 , +1 } giv es the claim (11). F.2 Pro of of un biasedness for sam pling strategy (26a) W e compute the exp ecta tion of a r a ndom v ariable Z sampled according to the strategy (26a), i.e. w e compu te E [ Z | v ] for a v ector v ∈ R d . By scaling, it is n o loss of generalit y to assu me that k v k 2 = 1, and usin g the rotational symmetry of the ℓ 2 -ball, we see it is no loss of g eneralit y to assume that v = e 1 , the first standard basis vect or. Let the function s d denote the surface area of the sphere in R d , so that s d ( r ) = dπ d/ 2 Γ( d/ 2 + 1) r d − 1 is the sur fac e area of the sp h ere of radius r . (W e u se s d as a shorthand f o r s d (1) when conv enien t.) Then for a rand om v ariable W sampled uniformly from the half of the ℓ 2 -ball with first co o rdinate W 1 ≥ 0, symmetry implies that by integrati ng o v er the radii of the ball, E [ W ] = e 1 2 s d Z 1 0 s d − 1  p 1 − r 2  r dr . Making the c hange of v ariables to sph er ical co ordinates (w e use φ as th e angle), we ha v e 2 s d Z 1 0 s d − 1  p 1 − r 2  r dr = 2 s d Z π / 2 0 s d − 1 (cos φ ) sin φ dφ = 2 s d − 1 s d Z π / 2 0 cos d − 2 ( φ ) sin( φ ) dφ. Noting that d dφ cos d − 1 ( φ ) = − ( d − 1) cos d − 2 ( φ ) sin( φ ), w e obtain 2 s d − 1 s d Z π / 2 0 cos d − 2 ( φ ) sin( φ ) dφ = − cos d − 1 ( φ ) d − 1     π / 2 0 = 1 d − 1 , or that E [ W ] = e 1 ( d − 1) π d − 1 2 Γ( d 2 + 1) dπ d 2 Γ( d − 1 2 + 1) 1 d − 1 = e 1 Γ( d 2 + 1) √ π d Γ( d − 1 2 + 1) | {z } =: c d , (71) where w e defin e th e constant c d to b e the final ratio. Allo wing again k v k 2 ≤ r , w i th the expression (71), w e s e e that for ou r sampling strategy for Z , w e h av e E [ Z | v ] = v B r c d  e α e α + 1 − 1 e α + 1  = B r c d e α − 1 e α + 1 . 55 Consequent ly , the c hoice B = e α + 1 e α − 1 r c d = e α + 1 e α − 1 r √ π d Γ( d − 1 2 + 1) Γ( d 2 + 1) yields E [ Z | v ] = v . Moreo ver, w e hav e k Z k 2 = B ≤ r e α + 1 e α − 1 3 √ π √ d 2 b y Stirling’s appro ximation to the Γ-function. By noting that ( e α + 1) / ( e α − 1) ≤ 3 /α for α ≤ 1, w e see that k Z k 2 ≤ 8 r √ d/α . G Effects of differen tial priv acy in non-comp act s p a ces In this app en dix, we presen t a somewhat p athological example that demonstrate s the effe cts of differen tial priv acy in non-compact sp a ces. Let us assume only that θ ∈ R and α < ∞ , and w e denote P θ to b e the c ollectio n of p robabilit y measures with v ariance 1 ha ving θ a s a mean. In con trast to the non-priv ate case, where the r isk of the sample mean scale s as 1 /n , w e obtain M n ( R , ( · ) 2 , α ) = ∞ (72) for all n ∈ N . T o see this, consider the F ano inequalit y v ersion (9). Fix δ > 0 and c ho ose { θ 1 = 0 , θ 2 = 2 δ , . . . , θ N = 2 N δ } where N = N ( δ, n ) = max {  exp(64( e α − 1) 2 n )  , 2 4 } . Then b y applying Corollary 1, w e ha ve for V = [ N ] that M n ( R , ( · ) 2 , α ) ≥ δ 2 1 − 4( e α − 1) 2 n P ν,ν ′ ∈V k P ν − P ν ′ k 2 TV / |V | 2 + log 2 log N ( δ , n ) ! . W e ha ve k P ν − P ν ′ k TV ≤ 1 for an y t wo distributions P ν and P ν ′ , whic h implies M n ( R , ( · ) 2 , α ) ≥ δ 2  1 − 16( e α − 1) 2 n + log 2 log N ( δ , n )  ≥ δ 2  1 − 1 2  = 1 2 δ 2 . Since δ > 0 was arb it rary , this prov es the in finite min im ax r isk b oun d (72). The constru ction to ac hiev e (72) is somewhat contriv ed, b ut it suggests that care is n ee ded when designing differen tially priv ate in ference pro cedures, and shows that ev en in cases when it is p ossible to attain a parametric rate of con v ergence, there ma y b e no (locally) differential ly priv ate inference pro cedure. References [1] R. Ahlswe de and A. Winte r. Strong conv erse for iden tification via quantum c hannels. IEE E T r ansactions on Informat ion The ory , 48(3):5 69–57 9, Marc h 2002. [2] N. Alon and J . H. Sp encer. The Pr ob abilistic M eth o d . Wiley-In terscience, seco nd edition, 2000. [3] V. Anantharam, A. Gohari, S. Kamath, and C. Nair. On maximal correlation, hyp e rcon trac- tivit y , and the data pr o cessing inequalit y studied by Erkip and Cov er. arXiv:1304 .6133 [cs.IT] , 2013. URL http: //arxiv.org/abs /1304.6133 . 56 [4] E. Ar ia s-Castro, E. Cand´ es, an d M. Da v enp ort. On the fu ndamen tal limits of adaptiv e sensing. IEEE T r ansactions on Informa tion The ory , 59(1): 472–4 81, 2013. [5] P . Assouad. Deux remarqu es su r l’estimation. C. R. A c ademy Scientifique Paris S´ eries I Mathematics , 296(2 3):102 1–1024, 1983. [6] B. Barak, K . C haudh ur i, C . Dwork, S. Kale, F. McSherry , and K. T alw ar. Priv acy , accuracy , and consistency to o: A holistic solution to con tingency table release. In Pr o c e e dings of th e 26th ACM Symp osium on Principles of Datab ase Systems , 200 7. [7] A. Beimel, K. Nissim, and E. Omri. Distributed priv ate data analysis: Sim u lt aneously solving ho w and what. In A dvanc es in Crypt olo gy , v olume 5157 of L e ctur e N o tes in Computer Sci enc e , pages 451– 468. Sp ringer, 2008. [8] A. Beimel, S. P . Kasiviswanathan, and K. Nissim. Bound s on the sample complexit y for priv ate lea rning and p riv ate data release. In Pr o c e e dings of the 7th The ory of Crypto g r aphy Confer enc e , pages 437–454 , 2010 . [9] L. Birg ´ e. Appro ximation dans les espaces m´ et riques et th´ eorie de l’estimation. Zeitschrift f¨ u r Wahrscheinlichkeitsthe orie und verwebte Gebi e t , 65: 181–2 38, 1983. [10] A. Blum , K. Ligett, and A. Roth. A learning theory a pproac h to non-int eractiv e database priv acy . In Pr o c e e dings of the F ourtieth Annual ACM Symp osium on the The ory of Computing , 2008. [11] P . Bruck er. An O ( n ) algorithm for qu a dratic knapsac k problems. O p er ations R ese ar ch L etters , 3(3):1 63–16 6, 1984. [12] V. Buld yg in and Y. Kozac henko. Metric Char acterization of R andom V ariables and R andom Pr o c esses , v olume 188 of T r anslations of Mathematic al Mono gr aphs . American Mathematical So ciet y , 2000. [13] R. Carroll and P . Hall. Optimal rates of con v ergence for decon volving a density . Journal of the Americ an Statistic al Asso ciation , 83(404) :1184– 1186, 1988. [14] R. C a rroll, D. Rupp ert, L. Stefanski, and C. Crainicean u. Me asur ement Err or in Nonline ar Mo dels: A Mo dern Persp e ctive . Chapman and Hall, second ed i tion, 2006. [15] K. Chaudhuri and D. Hsu. Con v ergence r a tes for differentia lly priv ate statistical estimation. In Pr o c e e dings of the 29th International Confer e nc e on Machine L e arning , 2012. [16] K. Ch au d h uri, C. Monte leoni, and A. D. Sarw ate. Differen tially priv ate empirical risk mini- mization. Journal of M a chine L e arning R ese ar ch , 12:10 69–11 09, 2011. [17] T. M. Co ve r and J. A. Thomas. Elements of Information The ory, Se c ond Edition . Wiley , 2006. [18] A. De. Lo wer b ounds in differen tial priv acy . In P r o c e e dings of the Ninth The ory of Crypto gr aphy Confer enc e , 2012. URL http://arx iv.org/abs/1107.2183 . [19] J. C. Duchi, M. I. Jordan, and M. J. W ain wright. Priv acy aw are lea rning. [stat.ML] , 2012. URL http: //arxiv.org/abs /1210.2085 . [20] G. T. Duncan an d D. Lam b ert. Disclosure-limited data d isseminati on. Journal of the Americ an Statistic al Asso ciation , 81(393 ):10–1 8, 19 86. [21] G. T. Dun ca n and D. Lam b ert. T he risk of disclosure for micro data . Journal of Business and Ec onomic Statistics , 7(2):207– 217, 1989. [22] C. Dwo rk and J. Lei. Differen tial priv acy and robust statistics. In Pr o c e e dings of the F ourty- 57 First Annual ACM Symp osium on the The ory of Computing , 2009. [23] C. Dw ork, K. K e n thapadi, F. McSherr y , I. Mironov, and M. Naor. O ur data, our s e lv es: Pr i v acy via distributed noise ge neration. In A dvanc es in Crypto lo gy (EUR OCR Y P T 2006) , 2006. [24] C. Dw ork, F. McSherry , K. Nissim, an d A. Smith. Calibrating noise to sensitivit y in priv ate data analysis. I n Pr o c e e dings of the 3r d The ory of Crypto gr aphy Confer enc e , pages 265–2 84, 2006. [25] C. Dw ork, G. N. Roth b lu m, and S. P . V adhan. Boosting and differen tial priv acy . In 51st Annual Symp osium on F oundations of Computer Scienc e , pages 51–60 , 2010. [26] S. Efr o mo vich. Nonp ar ametric Curve Estimation: Metho ds, The ory, and Applic ations . Springer-V erlag, 19 99. [27] A. V. Evfim ie vski, J. Gehrk e, and R. Srik ant. Limiting priv acy breac hes in p riv acy preserving data mining. In P r o c e e dings of th e Twenty-Se c ond Symp osium on Principles of Datab ase Systems , pages 21 1–222 , 2003. [28] S. E. Fien b erg, U. E. Mak o v, and R. J. S te ele. Disclosure limitati on using p erturbation and related metho d s for categorica l data. Jo urnal of Official Statistics , 14(4 ):485– 502, 1998. [29] S. E. Fienb e rg, A. Rinaldo, and X. Y ang. Differen tial pr iv acy and the risk-utilit y tradeoff for m ulti-dimensional cont ingency tables. In The International Confer enc e on Privacy in Statistic al Datab ases , 201 0. [30] S. R. Gan ta, S . Kasivisw anathan, and A. Smith. C o mp osition attac ks and auxiliary inform a tion in data priv acy . In Pr o c e e dings of the 14th ACM SIGKDD Confer e nc e on Know le dge and Data Disc overy (KD D) , 2008. [31] L. J . Glese r. Estimation in a multiv ariate “e rrors in v ariables” regression m odel: large sample results. A nnals of Statistics , 9(1):24–44 , 1981 . [32] R. M. Gra y . Entr opy and Informatio n The ory . Springer, 19 90. [33] R. Hall, A. Rinaldo, and L. W asserman. Random different ial priv acy . ar Xiv:1112.2680 [stat.ME] , 2011 . URL http://a rxiv.org/abs/1112.2680 . [34] M. Hardt and G. N. Roth blum. A multiplicativ e w eights mec h anism for priv acy-preserving data analysis. In 51st Annual Symp osium on F ounda tions of Computer Scienc e , 2010. [35] M. Hardt and K. T alw ar. O n the geometry of differen tial priv acy . In Pr o c e e dings of the F ourty-Se c ond A nnual ACM Symp osium on th e The ory o f Computing , pages 705– 714, 2010. URL http://arxi v.org/abs/0907.3754 . [36] R. Z . Has’minskii. A lo wer b ound on the risks of nonparametric estimates of densities in the uniform metric. Th e ory of Pr ob ability and Applic ations , 23:794–798 , 1978. [37] S. P . Kasiviswanat han, H. K. L ee, K. Nissim, S . R askh odniko v a, and A. Smith. What can we learn priv ately? SIAM Journal on Computing , 40(3):7 93–82 6, 2011. [38] M. Kearns. E ffici en t noise-toleran t learning from statistical queries. Journal of the A sso ciation for Computing Machinery , 45(6):9 83–10 06, 1998. [39] H. L ing and R. Li. V ariable selection for partially li near mo dels with mea surement errors. Journal of the Americ an Statistic al A sso ciation , 104(485) :234–2 48, 2009. [40] P .-L. Loh and M. J. W ainwrigh t. High-dimensional regression with noisy and missin g data: pro v able guarante es with noncon v exity . Annals of Statistics , 40(3 ):1637 –1664, 2012. 58 [41] Y. Ma and R. Li. V ariable selection in measur e men t error mo d el s. Bernoul li , 16(1):27 4–300 , 2010. [42] L. W. Mac k ey , M. I. Jordan, R. Y. Chen, B. F arrell, and J. A. T ropp. Matrix concen tration inequalities via the metho d of exc hangeable p a irs. arXiv:1201.600 2 [math.PR] , 2012. URL http://a rxiv.org/abs/12 0 1 .6 002 . [43] A. McGregor, I. Mirono v, T. Pitassi, O. Reingo ld, K. T alwa r, and S. V adhan. The limits of t wo-part y differen tial p riv acy . I n 51st Annual Symp osium on F oundatio ns of Computer Scienc e , 201 0. [44] S. Negah ban, P . Ra vikumar, M. W ain wright, and B. Y u. A u nified framew ork for high- dimensional analysis of M -estimators with decomp osable regularizers. Statistic al Scienc e , 27 (4):53 8–557 , 2012. [45] R. R. Phelps. L e ctur es on Cho que t’s The or em, Se c ond Edition . Springer, 200 1. [46] B. I. P . Rubin ste in, P . L. Bartlett, L. Huang, and N. T aft. Learnin g in a large function space: priv acy-preserving mechanisms for SVM learning. Journal of Privacy and Confid entiality , 4 (1):65 –100, 201 2. [47] D. Scott. On optimal and data-based histograms. Biometrika , 66(3) :605–6 10, 1979. [48] A. Smith. Priv acy-preserving sta tistical estimat ion with optimal conv ergence rates. In Pr o- c e e dings of the F ourty-Thir d Annual ACM Symp osium on the The ory of Computing , 2011. [49] A. B. Tsybako v. Intr o duction to Nonp ar ametric Estimation . Sprin g er, 2009. [50] S. W arner. Rand o mized resp onse: a surv ey tec hniqu e fo r eliminating ev asiv e answ er bias. Journal of the Americ an Statistic al A sso ciation , 60(309):6 3–69, 1965. [51] L. W asserman and S. Zhou. A statisti cal framewo rk for differen tial priv acy . Journal of the Americ an Statistic al Asso ci ation , 105(489):37 5–389, 2010. [52] Y. Y ang and A. Barron. Information-theoretic determination of min imax rates of conv ergence. Anna ls of Statistics , 27(5) :1564– 1599, 1999. [53] B. Y u. Assouad, Fano, and L e Cam. In F estschrift for Lu c ien L e Cam , p ages 423–435. Springer-V erlag, 19 97. 59

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment