Sample Complexity Bounds for Robust Mean Estimation with Mean-Shift Contamination

Sample Complexit y Bounds for Robust Mean Estimation with Mean-Shift Con tamination Ilias Diak onikolas ∗ UW Madison ilias@cs.wisc.edu Giannis Iak ovidis † UW Madison iakovidis@wisc.edu Daniel M. Kane ‡ UC San Diego dakane@ucsd.edu Sihan Liu UC San Diego sil046@ucsd.edu Abstract W e study the basic task of mean estimation in the presence of mean-shift con tamination. In the mean-shift contamination mo del, an adversary is allow ed to replace a small constant fraction of the clean samples by samples drawn from arbitrarily shifted versions of the base distribution. Prior w ork c haracterized the sample complexity of this task for the sp ecial cases of the Gaussian and Laplace distributions. Sp eciﬁcally , it was sho wn that consistent estimation is p ossible in these cases, a prop ert y that is prov ably imp ossible in Hub er’s con tamination mo del. An op en question p osed in earlier work w as to determine the sampl e complexit y of mean estimation in the mean-shift con tamination mo del for general base distributions. In this work, w e study and essentially resolve this op en question. Sp eciﬁcally , w e show that, under mild sp ectral conditions on the characteristic function of the (p oten tially multiv ariate) base distribution, there exists a sample-eﬃcient algorithm that estimates the target mean to an y desired accuracy . W e complemen t our upp er b ound with a qualitatively matching sample complexit y low er bound. Our techniques make critical use of F ourier analysis, and in particular introduce the notion of a F ourier witness as an essential ingredient of our upp er and low er b ounds. ∗ Supp orted b y NSF Medium A ward CCF-2107079, ONR a ward n umber N00014-25-1-2268, and an H.I. Romnes F aculty F ello wship. † Supp orted b y ONR a ward n umber N00014-25-1-2268. ‡ Supp orted b y NSF Medium A ward CCF-2107547. 1 In tro duction Robust statistics [ HR09 ] is the ﬁeld that studies the design of accurate estimators in the presence of data con tamination. This study is motiv ated b y a range of practical applications, where the standard i.i.d. assumption holds only appro ximately . F or example, the input data ma y b e systematically manipulated by v arious sources or include out-of-distribution p oin ts; see, e.g., [ BNJT10 , BNL12 , SKL17 , TLM18 , DKK + 19a , DFDL24 ]. Originating in the pioneering w orks of T uk ey and Hub er [ T uk60 , Hub64 ], the ﬁeld dev elop ed minimax-optimal robust estimators for classical problems. A recen t line of w ork in computer science has lead to a reviv al of robust statistics from an algorithmic viewp oin t b y providing p olynomial-time estimators in high dimensions [ DKK + 19b , LR V16 , DK23 ]. The classical contamination mo del in the ﬁeld of Robust Statistics, known as Hub er’s con tamina- tion mo del [ Hub64 ], is deﬁned as follows. F or an inlier distribution D o ver R d and a contamination prop ortion α ∈ (0 , 1 / 2) , each sample x is drawn from D with probability 1 − α , and from an unkno wn (p oten tially arbitrary) distribution Q with probabilit y α . Hub er’s mo del has b een the main fo cus of prior w ork in the ﬁeld, b oth from the information-theoretic and the algorithmic standp oin ts. As the mo del allo ws for an arbitrary outlier distribution Q , it suﬀers from inherent information-theoretic limitations. Sp eciﬁcally , even for the basic task of estimating the mean of the simplest inlier distributions (e.g., univ ariate Gaussians), no estimator can achiev e error b etter than Ω( α ) —indep enden t of the sample size. That is, Hub er’s mo del do es not allo w for consisten t estimators, i.e., estimators whose error b ecomes arbitrarily small as the sample size increases. T o ac hieve consistency , some assumptions on the structure of the outlier distribution Q are necessary . In this work, w e fo cus on the basic task of robust mean estimation in a prominent data con tamination mo del, known as the me an-shift contamination, in whic h consistency ma y b e p ossible. In the mean-shift contamination mo del, the outlier distribution is assumed to consist of arbitrary mean-shifts of the inlier distribution. F ormally , we hav e the follo wing deﬁnition. Deﬁnition 1.1 (Mean-Shift Contamination Mo del) . L et α ∈ (0 , 1 / 2) , let D , Q b e distributions over R d with E x ∼ D [ x ] = 0 . An α -me an-shift c ontaminate d sample fr om D µ is gener ate d as fol lows: 1. With pr ob ability 1 − α , output a clean sample x = µ + y wher e y ∼ D . 2. With pr ob ability α , an adversary dr aws a shift ve ctor z fr om Q , and outputs an outlier sample x = z + y wher e y ∼ D . W e denote by D ( α ) µ the r esulting observe d distribution. The mean-shift contamination mo del has b een in tensively studied in recen t years, in b oth the statistics [ CJ10 , CD19 , CDR V21 , K G25 ] and computer science [ Li23 , DIKP25 ] literature. Beyond mean estimation, v ariants of the mo del hav e also b een considered in the context of regression tasks [ STB01 , Gan07 , MW07 , SO11 ]. Historically , this mo del traces back to the m ultiple-hypothesis- testing literature, which treats the n ull parameters as unkno wn and th us estimates them prior to testing [ Efr04 , Efr07 , Efr08 ]. Prior w ork has studied the sample and computational complexit y of mean-shift con tamination in t wo b enc hmark settings—Gaussian and Laplace base distributions—pro viding b oth information- theoretic [ CDR V21 , K G25 ] and algorithmic [ Li23 , DIKP25 ] results. An immediate corollary of these w orks is that for these tw o protot ypical base distributions, consistent estimation is indeed p ossible. Sp eciﬁcally , for the Gaussian case, [ K G25 ] determined the minimax error rates and [ DIKP25 ] ga ve the ﬁrst p olynomial-time estimator with near-minimax rates in high dimensions. Despite this recen t progress, several fundamental questions regarding estimation in this model remain p oorly understo o d when w e depart from the Gaussian/Laplace regimes. Under what c onditions 1 on the distribution D is c onsistent me an estimation with me an-shift c ontamination p ossible? Can we qualitatively char acterize the sample c omplexity of estimating the me an of a gener al distribution D in this mo del? Understanding these questions is of fundamen tal interest and has b een p osed as an op en problem in prior work. Sp eciﬁcally , the work of [ K G25 ] states that “an in teresting problem is to pro ve a matc hing minimax low er b ound for estimating the mean with a generic base distribution D .” As our main contribution, we answ er this question by providing a qualitative characterization of the sample complexit y of this problem. 1.1 Our Results F or a distribution D with characteristic function ϕ D (see Deﬁnition 1.4 ), we deﬁne the follo wing quan tity: δ = δ ( ϵ, α, D ) : = inf ∥ v ∥≥ ϵ sup ω : dist( ω · v, Z ) ≥ α   ϕ D ( ω )   . (1) In w ords, δ is the w orst-case, ov er all mean-shift directions v with ∥ v ∥ ≥ ϵ , largest F ourier magnitude | ϕ D ( ω ) | o ver frequencies ω whose pro jection ω · v sta ys at least α a wa y from every in teger. In tuitively , { ω : dist ( ω · v , Z ) ≥ α } is the set of frequencies whic h the adversary cannot fully corrupt. As we will establish, the parameter δ c haracterizes the sample complexity of estimating the mean µ of the con taminated distribution D ( α ) µ . P articularly , in our sample complexity low er b ound ( Lemma 4.6 ), w e show that the adversary can zero out the characteristic function ϕ D ( α ) µ throughout the set dist ( ω · v , Z ) ≤ α . Con versely , in our sample complexit y upp er b ound ( Theorem 3.2 ), we prov e that ev ery ω outside this set is suitable for iden tifying the diﬀerence b et w een b µ and µ when b µ − µ = v and ∥ v ∥ = ϵ . The diﬃcult y of identifying this diﬀerence is go verned by the magnitude | ϕ D ( ω ) | ; and since the mean shift could lie in an arbitrary direction v , we tak e the inf v . Our main result is informally stated b elo w (see Theorems 3.2 and 4.1 for more detailed formal statemen ts and Remark 4.3 for a comparison of our upp er and lo wer b ounds): Theorem 1.2 (Informal Main Result) . L et D b e a distribution over R d with char acteristic function ϕ D , α ∈ (0 , 1 / 2) b e the c ontamination p ar ameter, and ϵ b e the tar get err or. Then, under mild te chnic al assumptions on D , ther e exists an algorithm that estimates the me an of D µ fr om e O α ( d/δ 2 ) i.i.d. α -me an-shift c ontaminate d samples. Mor e over, even for d = 1 , any algorithm r e quir es at le ast 1 /δ Ω(1) samples. A few remarks are in order. First, we note that our univ ariate sample complexity low er b ound can be straightforw ardly extended to a ( d/δ ) Ω(1) -sample lo wer b ound in d dimensions for pro duct distributions; see Remark 4.2 . Second, our F ourier characterization applies to a broad range of base distributions well b ey ond the Gaussian and Laplace families. W e illustrate our results with a few basic examples in T able 1 (see also Section D ). Remark 1.3 (Consistency) . Theorem 1.2 yields an easily v eriﬁable condition under whic h consistency is imp ossible. Namely , no algorithm can estimate the mean to error ϵ if the set { ω : dist ( ω ϵ, Z ) ≥ α } has zero F ourier mass (i.e., if δ = 0 in the theorem’s notation). Any distribution with a band-limited c haracteristic function, i.e., suc h that there exists B with ϕ ( ω ) = 0 for all | ω | ≥ B , satisﬁes this condition. F or example, consider the distribution with density prop ortional to sinc 2 ( x ) . Note that sinc 2 is the F ourier transform of the triangular windo w, namely the function equal to 1 − | ω | for | ω | < 1 and 0 otherwise (i.e., the conv olution of tw o rectangular windows). Since b oth functions are ev en, the conv erse relationship holds as well, which gives us an example of a distribution with band-limited c haracteristic function. 2 T able 1: Sample complexity for mean estimation under α -mean-shift contamination. Distribution D Upper bound Lo wer bound ( d = 1) N (0 , I d ) e O  d e O ( ( α/ϵ ) 2 )  Ω  e Ω ( ( α/ϵ ) 2 )  Lap d (0 , I d ) e O  d α 2 /ϵ 4  Ω  ( α/ϵ ) 1 / 2  Unif ([ − 1 , 1]) e O (1 /ϵ ) Ω  ( α/ϵ ) 1 / 6   Unif ([ − 1 , 1])  ∗ m e O  α − 2 ( O ( α/ϵ )) 2 m  Ω  ( α/ϵ ) (2 m − 1) / 6  Indep enden t W ork Con temp oraneous work [ KKLZ26 ] also studies robust mean estimation under mean shift contamination. The fo cus of their work is on developing computationally eﬃcien t (i.e., sample-p olynomial time) algorithms in high dimensions, whereas the curren t w ork aims to c haracterize the sample complexit y of the problem for general distributions. While [ KKLZ26 ] provide a sample and computationally eﬃcien t algorithm for interesting families of distributions, their b ounds can b e qualitativ ely far from optimal in many cases. Sp eciﬁcally , for the prototypical case of m ultiv ariate Gaussians, their algorithm has sample complexity of 2 O ( d/ϵ 2 ) (i.e., exponential in the dimension d ), while the sample complexit y of the problem in this case is O ( d 2 O (1 /ϵ 2 ) ) . F or the sak e of completeness, w e provide t wo brief technical p oin ts of comparison. First, the sample complexit y of the algorithm in [ KKLZ26 ] (as app earing in their theorem statement) dep ends on a F ourier-theoretic quantit y that—in our notation—essen tially corresp onds to the worst-case F ourier witness of the base distribution’s characteristic function ov er a large radius T . As a result, their theorem statemen t pro vides no guarantees when the distribution D , for example, is uniform o ver an in terv al. While a careful analysis sho ws that the uniform distribution is not necessarily a true barrier for their approach, it can b e seen that their algorithm inherently incurs a substan tially sub optimal sample complexity for distributions whose characteristic functions hav e a non-trivial F ourier witness but are small on a set of reasonably large measure. Second, though b oth the algorithm of [ KKLZ26 ] and ours rely on one-dimensional pro jections for mean estimation of m ultiv ariate distributions, their algorithm perform the pro jections in p olynomially man y randomly chosen directions (for the sake of computational eﬃciency); while ours utilizes an exp onen tial-size cov er. As the price of the relatively small num b er of random pro jections, their estimations along the pro jected directions need to b e of accuracy ≪ ϵ/ p oly ( d ) . Consequen tly , when the sample complexit y of the univ ariate estimation problem has an exp onen tial dep endene on the error parameter (as is the case for the Gaussian distribution), their approac h results in a sample complexit y with an (often unnecessary) exp onen tial dep endency on the dimension. 1.2 Notation and Preliminaries W e use Z + for the set of p ositiv e integers. F or n ∈ Z + , we denote [ n ] := { 1 , . . . , n } . F or a v ector x w e denote b y ∥ x ∥ its Euclidean norm. W e deﬁne by δ z the Dirac measure at z ∈ R d , i.e., δ z ( A ) = 1 { z ∈ A } for all Borel sets A ⊆ R d . F or distributions P , Q , we denote by P ∗ Q their con volution. F or a distribution D o ver R d and µ ∈ R d , we use D µ for the shifted distribution D ∗ δ µ . W e also denote b y P z the distribution δ z ∗ P . F or A ⊆ R d and x ∈ R d , the distance of x to A is dist ( x , A ) = min y ∈ A ∥ x − y ∥ . Let B d ( R ) := { θ ∈ R d : ∥ θ ∥ ≤ R } b e the Euclidean ball of radius R on R d . F or a measurable function f : R → C and 1 ≤ p < ∞ , deﬁne ∥ f ∥ L p : =  R R | f ( x ) | p , dx  1 /p , with v alue + ∞ if the integral diverges. Denote by ( h ( x )) + : = max ( h ( x ) , 0 ) and b y ( h ( x )) − : = max ( − h ( x ) , 0 ) . Denote by sinc the function deﬁned as sinc ( x ) : = sin ( π x ) / ( π x ) for 3 x  = 0 and sinc(0) = 1 . W e use a ≲ b to denote that there exists an absolute constant C > 0 (indep enden t of the v ariables or parameters on whic h a and b dep end) suc h that a ≤ C b . Deﬁnition 1.4 (Characteristic function) . L et P b e a distribution over R d . W e deﬁne its char acteristic function by ϕ P ( ω ) : = E x ∼ P  e 2 π i ( ω · x )  , ω ∈ R d . Note that the char acteristic function is the F ourier tr ansform of the underlying pr ob ability me asur e. 2 Our T ec hniques Upp er Bound In the mean-shift contamination mo del, the observed distribution is a conv olution of the form D ( α ) µ = D ∗ Q , where D is the base distribution (centered at 0 ) and Q is the distribution o ver the mean vector (designed by an adversary) that is guaran teed to put at least (1 − α ) mass on the target clean mean v ector µ . This structure naturally invites F ourier analysis, yielding the c haracteristic function identit y ϕ D ( α ) µ ( ω ) = ϕ D ( ω ) ϕ Q ( ω ) . Con venien tly , ϕ D ( α ) µ ( ω ) can b e approximated using samples and ϕ D ( ω ) has a kno wn functional form, thereby allo wing us to approximate ϕ Q ( ω ) via the ratio. If Q w ere a p oin t mass on µ , its c haracteristic function would b e exactly exp (2 π iµ · ω ) . By the assumption of the con tamination mo del, Q has to put at least (1 − α ) mass on the clean mean vector µ . As a result, it is not hard to sho w that ϕ Q ( ω ) is p oin twise 2 α close to exp (2 π iµ · ω ) . Next, we pro ceed to show how the ab o v e observ ations allo w us to distinguish b et ween the cases where a candidate vector b µ is close to the clean v ector µ or is at least ϵ far from µ . If we can do so, then w e can solv e the mean estimation task b y scanning ov er all candidate mean v ectors lying in a suﬃciently ﬁne cov er. T o w ards distinguishing whether b µ is close to µ or far from it, we w ant to decide whether ϕ Q ( ω ) is close to exp (2 π i b µ · ω ) (as it would b e if Q placed most of its mass near b µ ) or noticeably diﬀerent. F or this, it suﬃces to ﬁnd a witness frequency ω ∗ for which the diﬀerence | exp (2 π i b µ · ω ∗ ) − exp (2 π iµ · ω ∗ ) | is large. This happ ens whenev er ( b µ − µ ) · ω ∗ is far from an in teger. Since ( b µ − µ ) · ω ∗ ≤ ∥ b µ − µ ∥∥ ω ∥ ; for close candidates this requires a v ery large ∥ ω ∗ ∥ , whereas for ϵ -far candidates, c ho osing ∥ ω ∗ ∥ on the order of 1 /ϵ suﬃces. Once we ha ve suc h a witness, it remains to construct a suﬃciently go o d estimate of ϕ Q ( ω ∗ ) . Our algorithm tries to estimate ϕ Q via the ratio form ϕ D ( α ) µ /ϕ D . Thus, for the estimator to hav e reasonable concen tration, the denominator ϕ D at ω ∗ necessarily needs to b e b ounded a wa y from 0 . In particular, provided that the witness satisﬁes that ( b µ − µ ) · ω ∗ is at least A -far from integers and ϕ D ( α ) µ ( ω ∗ ) is at least δ , w e can estimate ϕ D ( α ) µ —and consequen tly ϕ Q ( ω ∗ ) —up to suﬃcient accuracy with roughly O ( A − 1 δ − 2 ) samples. In summary , in order to learn the mean vector µ up to ϵ accuracy , all w e need is that for an y error v ector v = ˆ µ − µ of ℓ 2 norm at least ϵ , there exists a witness frequency ω v suc h that (i) v · ω v is b ounded a wa y from in tegers, and (ii) ϕ D ( α ) µ ( ω v ) is b ounded aw a y from 0 . W e call this the frequency-witness condition of the distribution D and its formal deﬁnition can b e found in Deﬁnition 3.1 . Lo w er Bound Our lo wer b ound shows that an L 2 v ersion of the ab o ve frequency witness condition is essentiall y necessary for mean estimation under the mean-shift con tamination mo del. Consider the basic case of univ ariate mean estimation. In this case, the negation of the frequency-witness condition with respect to the error v = ϵ simpliﬁes in to sa ying that the c haracteristic function restricted on 4 p oin ts far from the in teger multiples of ϵ − 1 ha ve small L ∞ norm, i.e., ∥ ϕ D ( ω ) 1 { dist ( ϵω , Z ) }∥ L ∞ ≤ δ . In Theorem 4.1 , we show that if the base distribution D satisﬁes an L 2 v ersion of the ab o v e b ound, i.e., ∥ ϕ D ( ω ) 1 { dist ( ϵω , Z ) > Θ( α ) }∥ L 2 ≤ δ , (together with some natural tail b ound condition) then p oly ( δ − 1 ) many samples are also statistically necessary for mean estimation up to error ϵ under mean-shift con tamination. A t a high level, we would like to design a pair of univ ariate distributions Y 1 , Y 2 , where Y i is a univ ariate distribution that puts at least (1 − α ) amoun t of mass on ± ϵ resp ectiv ely , such that the con volutions Q i = D ∗ Y i are statistically hard to distinguish, i.e., ∥ Q 1 − Q 2 ∥ L 1 ≤ p oly ( δ ) . Our starting p oin t is the famous Planc herel theorem, whic h allows us to equate the L 2 norm of a function to its F ourier transform. Provided that Q 1 and Q 2 b oth satisfy suﬃciently strong tail b ounds, we can then b ound their the total v ariation distance b y their L 2 distance, and consequently the L 2 distance b et w een their c haracteristic function ∆ ϕ = ϕ Q 1 − ϕ Q 2 . See Lemma 4.5 for the details of this argumen t. T o control ∆ ϕ , w e rewrite it as ∆ ϕ = ϕ D ∗ Y 1 − ϕ D ∗ Y 2 =  (1 − α )( e π iϵω − e − π iϵω ) + α ( ϕ Y 1 − ϕ Y 2 )  ϕ D . Note that by our assumption ϕ D already has small L 2 norm while restricted on p oin ts far from in teger multiples of ϵ − 1 . Thus, to ensure that ∆ ϕ glob al ly has small L 2 norm, it suﬃces to ensure that  (1 − α )  e π iϵω − e − π iϵω  + α ( ϕ Y 1 − ϕ Y 2 )  has small L 2 norm on p oin ts close to in teger multiples of ϵ − 1 . More precisely , we w ant to design the pair Y 1 , Y 2 suc h that ( ϕ Y 2 − ϕ Y 1 ) matc hes with the F ourier signal b f ( ω ) := α − 1 (1 − α )  e π iϵω − e − π iϵω  on 1 { dist ( ω ϵ, Z ) < Θ( α ) } . It is tempting to consider directly the in verse F ourier transform of b f ( ω ) and c heck whether we can write it as a diﬀerence of t wo probability measures. Unfortunately , b f can b e as large as 2(1 − α ) /α > 2 , which implies that it cannot b e expressed as the diﬀerence of any pair of probability measures. T o mitigate the issue, w e note that the extreme v alues of b f only o ccurs p erio dically at ω that are far from integer multiples of ϵ − 1 . Recall that this is precisely the range in which ( ϕ Y 2 − ϕ Y 1 ) do es not need to appro ximate b f . Hence, w e can simply truncate b f to b e 0 at these p eriodic bands with extreme v alues. Naively , one would just m ultiply b f with a p erio dic window function to mask out its extreme v alues. Y et, doing so w ould make the resulting function no longer smo oth, which is a required condition for its inv erse F ourier transform to ha ve rapidly decaying tail b ounds. T o ensure smo othness, w e will further mollify the windo w function b y conv olving it with v arious scaled versions of itself. With some standard F ourier analysis, one can sho w that the con volv ed window function has globally b ounded deriv ativ es and interpolates smo othly from 0 at p oin ts far from the window area to 1 within the window area. Consequen tly , the deriv ative b ound in the frequency domain giv es rise to the desired tail b ounds on Y 2 − Y 1 . The detail of this argument is given in the pro of of Lemma 4.6 . 3 Sample-Eﬃcien t Algorithm In this section, we show that under certain regularit y assumptions on the c haracteristic function of D , w e can design a sample eﬃcien t algorithm. In particular, the assumption that we will use is the follo wing: Deﬁnition 3.1 (F requency-witness condition) . W e say that a distribution D over R d satisﬁes the ( ϵ, A, δ ) -fr e quency-witness c ondition if: ∀ v ∈ R d with ∥ v ∥ ≥ ϵ ∃ ω ∈ R d such that | sin ( π v · ω ) | ≥ A, and | ϕ D ( ω ) | ≥ δ . Mor e over, we c al l any such ω an ( ϵ, A, δ ) -fr e quency-witness for the ve ctor v . Denote b y D ( α ) µ the α -mean-shift con taminated v ersion of D µ (see Deﬁnition 1.1 ). Let b µ ∈ R d 5 Input: ϵ, α ∈ (0 , 1 / 2) , oracle access to ϕ D , sample access to D ( α ) µ , the α -mean-shift con tamination of D µ with ∥ µ ∥ ≤ R , parameters R > 1 , L, δ > 0 and A, c ∈ (0 , 1) , and M 1 > 0 the deriv ative b ound in Theorem 3.2 . Assume that D satisﬁes the ( ϵ, A, δ ) -frequency–witness condition and ϕ D is L –Lipsc hitz. Output: b µ with ∥ b µ − µ ∥ ≤ ϵ w.p. ≥ 2 / 3 . 1. Set B δ ← √ d M 1 2 π δ . 2. Set n ← l C d log  B δ RL δ A  1  ((1 − α ) A − 2 α ) δ  2 m , for a suﬃcien tly large constant C > 0 . 3. Using Fact A.3 , construct C µ an ϵ ′ –co ver of B d ( R ) and C ω an η –co ver of B d ( B δ ) , where ϵ ′ ← min( α/ (2(1 − α ) π B δ ) , ϵ ) and η ← min { δ / (2 L ) , A/ (2 π R ) } . 4. Deﬁne the set S ω ← { ω ∈ C ω : | ϕ D ( ω ) | ≥ δ / 2 } . 5. Dra w x 1 , . . . , x n ∼ D ( α ) µ once; and for each ω ∈ S ω compute the empirical characteristic function of the con taminated samples, b ϕ ( ω ) = 1 n P n j =1 e 2 π i ω · x j . 6. F or eac h ω ∈ S ω , set b ψ ( ω ) ← b ϕ ( ω ) /ϕ D ( ω ) . 7. F or eac h b µ ∈ C µ and ω ∈ S ω , set b T b µ ( ω ) ← (1 − α ) e 2 π i ω · b µ − b ψ ( ω ) . 8. F or eac h b µ ∈ C µ compute the score s ( b µ ) ← max ω ∈ S ω   b T b µ ( ω )   . 9. Output b µ ∈ arg min θ ∈C µ s ( θ ) . Algorithm 1: R obust me an estimation via fr e quency witnesses. b e a candidate mean. This condition essentially guaran tees that every inc orr e ct direction v = b µ − µ with ∥ v ∥ ≥ ϵ has an asso ciated frequency ω where D has non trivial F ourier mass ( | ϕ D ( ω ) | ≥ δ ) and where the phase shift induced by v is dete ctable : | sin ( π ω · v ) | ≥ A . A t such a frequency , the p opulation characteristic function of the contaminated mo del, ϕ D ( α ) µ ( ω ) = ϕ D ( ω )  (1 − α ) e 2 π i ω · µ + α ϕ Q ( ω )  , cannot b e well-appro ximated by (1 − α ) e 2 π i ω · b µ ϕ D ( ω ) if b µ is far from µ . Th us, if we scan a ﬁnite co ver of b ounded frequencies and pick the b µ that minimizes the worst–case discrepancy b et ween the empirical c haracteristic function and (1 − α ) e 2 π i ω · b µ ϕ D ( ω ) , the frequency witnesses force all far candidates to incur a large p enalt y , while all near candidates hav e uniformly small discrepancy . W e will assume lipsc hitzness of ϕ D whic h lets us discretize frequencies without losing the witness prop ert y . This is exactly what Algorithm 1 do es. W e remark that, although Algorithm 1 uses a v alue oracle for ϕ D (a v ailable for most well-kno wn distributions) for simplicit y , a sampling oracle for D also suﬃces (see Remark B.1 for details). W e pro ceed with the pro of of our upp er b ound. 6 Theorem 3.2 (Upp er bound via frequency witnesses) . L et α ∈ (0 , 1 / 2) , ϵ ∈ (0 , 1) , and let µ ∈ R d b e an unknown ve ctor satisfying ∥ µ ∥ ≤ R , for a known R > 1 . Fix p ar ameters L > 0 and A ∈ (0 , 1] , with (1 − α ) A − 2 α > 0 and δ > 0 . L et D b e a distribution over R d with density p D . Assume that: 1. D satisﬁes the ( ϵ, A, δ ) -fr e quency-witness c ondition. 2. ∂ ∂ x j p D ∈ L 1 ( R d ) for al l j ∈ [ d ] and deﬁne M 1 : = max j ∈ [ d ]    ∂ ∂ x j p D    L 1 ( R d ) . 3. W e have ac c ess to a value or acle for ϕ D and ϕ D is L -Lipschitz over B d  √ d M 1 2 π δ  . Then, A lgorithm 1 , dr aws n = O d log √ dM 1 RL δ 2 A ! 1  ((1 − α ) A − 2 α ) δ  2 ! i.i.d. α -me an-shift c ontaminate d samples fr om D µ , and outputs an estimator b µ satisfying ∥ b µ − µ ∥ ≤ ϵ with pr ob ability at le ast 2 / 3 . Before w e pro ceed with the pro of, w e brieﬂy comment on the conditions. First, the b ounded-mean assumption is easy to reduce to algorithmically for man y distributions. F or instance, when the distribution has b ounded co v ariance, one can use adversarially robust estimators and then cen ter the data to obtain a b ounded mean (see Fact A.4 ). Second, the Lipsc hitz contin uit y of ϕ D is implied b y ﬁnite absolute ﬁrst moment of D . Indeed, by the deﬁnition of the characteristic function and standard inequalities for the exp onen tial function, one obtains the required Lipsc hitz b ound. Pr o of of The or em 3.2 . First, we show that ev ery ( ϵ, A, δ ) frequency witness has b ounded norm; this will help us b ound the frequency co ver radius. The pro of uses the fact that smo othness of a function results in faster deca y of its F ourier transform (see Section B for the pro of ). Claim 3.3 (Every frequency witness has b ounded norm) . Fix ϵ, A, δ ∈ (0 , 1) . If ω is an ( ϵ, A, δ ) - fr e quency-witness for some dir e ction v (i.e., | sin ( π v · ω ) | ≥ A and | ϕ D ( ω ) | ≥ δ ), then ne c essarily ∥ ω ∥ ≤ B δ wher e B δ : = √ d M 1 2 π δ . No w note that since any frequency witness ω necessarily satisﬁes ∥ ω ∥ ≤ B δ (b y Claim 3.3 ), it suﬃces to construct the frequency co ver inside B d ( B δ ) . Let C µ b e the ϵ ′ -co ver of B d ( R ) with ϵ ′ = min ( α/ (2(1 − α ) π B δ ) , ϵ ) and C ω b e the η -co ver of B d ( B δ ) with η = min ( δ / (2 L ) , A/ (2 π R )) , constructed in Line 3 . By Fact A.3 we hav e that |C µ | ≤ (5 R/ϵ ′ ) d and |C ω | ≤ (5 B δ /η ) d . Let b ϕ : R d → C b e the empirical characteristic function of the con taminated samples and by ϕ its p opulation counterpart. F or the set of ω ∈ R d where ϕ D ( ω )  = 0 deﬁne the function b ψ ( ω ) : = b ϕ ( ω ) /ϕ D ( ω ) . Note that ψ ( ω ) : = E [ b ψ ( ω )] = (1 − α ) e 2 π iω · µ + αϕ Q ( ω ) . Also denote b y b T b µ ( ω ) : = (1 − α ) e 2 π iω · b µ − b ψ ( ω ) the test function computed in Line 7 and by T b µ ( ω ) : = E [ b T b µ ( ω )] = (1 − α ) e 2 π iω · b µ − ψ ( ω ) its p opulation coun terpart. Denote the set Ω( δ, B δ ) : = { ω ∈ R d : | ϕ D ( ω ) | > δ, ∥ ω ∥ ≤ B δ } . Note that Algorithm 1 essen tially consists of ﬁnding a b µ ∈ C µ suc h that max ω ∈ Ω( δ ,B δ ) T b µ ( ω ) is small. First w e prov e that for a candidate mean, b µ , if the direction b µ − µ exhibits a frequency-witness ω , then T b µ ( ω ) is large. F or the pro of, w e refer the reader to Section B . Claim 3.4 (Distant candidates ha ve a large T b µ ) . L et b µ ∈ R d b e c andidate me an such that ∥ b µ − µ ∥ > ϵ . If ω ∈ R d is a ( ϵ, A, δ ) -fr e quency-witness for the dir e ction v : = b µ − µ , then | T b µ ( ω ) | ≥ 2(1 − α ) A − α . No w we prov e that if a candidate mean b µ is suﬃcien tly close, then T b µ ( ω ) is small for all ω in a large ball. F or the pro of, w e refer the reader to Section B . 7 Claim 3.5 (Close candidates ha ve small T b µ ) . L et b µ ∈ R d b e a c andidate me an and set v : = b µ − µ . Then, | T b µ ( ω ) | ≤ 2(1 − α ) π ∥ ω ∥∥ v ∥ + α. The next claim sho ws that if you already hav e a go od frequency witness for a direction v , then passing to an η -co ver do es not destroy it. Instead, you can ﬁnd a nearby p oint in the cov er that remains a frequency witness, with only small losses in the parameters. F or the pro of, we refer the reader to Section B . Claim 3.6 (Cov ering ov er ω preserv es frequency witnesses) . Fix v ∈ R d . L et C ω b e an η -c over of B d ( B δ ) . If ther e exists a ( ϵ, A, δ ) -fr e quency-witness for the dir e ction v , then ther e exists an ω ∈ C ω that is a ( ϵ, A − π η ∥ v ∥ , δ − η L ) -fr e quency-witness for v . W e now prov e that w e can compute b T µ ( ω ) eﬃcien tly with samples for all b µ ∈ C µ , ω ∈ C ω . This essen tially follows from the fact that the c haracteristic function is the av erage of unit absolute v alue complex n umbers (for the pro of w e refer the reader to Section B ). Claim 3.7 (Concen tration of b T ) . Fix a suﬃciently lar ge universal c onstant C > 0 and η , τ ∈ (0 , 1) . If n ≥ C log ( |C ω | /τ ) / ( η δ ) 2 , then with pr ob ability at le ast 1 − τ for any ω ∈ C ω such that | ϕ D ( ω ) | ≥ δ and any b µ ∈ C µ it holds that | T b µ ( ω ) − b T b µ ( ω ) | ≤ η . Finally w e can pro ceed with the pro of of the main theorem. Denote by S ω the search set of frequencies used b y the algorithm S ω = { ω ∈ C ω : ϕ D ( ω ) > δ } (see Line 4 ). Note that since C ω is an η -co ver of B d ( B δ ) with η = min ( δ / (2 L ) , A/ (2 π R )) , b y Claim 3.6 w e hav e that for ev ery v ∈ B d ( R ) there exist an ( ϵ, A/ 2 , δ / 2) -frequency-witness in C ω . Hence from Claim 3.4 we hav e that for any b µ suc h that ∥ b µ − µ ∥ ≥ ϵ there exists an ω ∈ S ω suc h that T b µ ( ω ) > T bad : = (1 − α ) A − α . While from Claim 3.5 w e hav e that for a b µ suc h that ∥ b µ − µ ∥ ≤ ϵ ′ (note that ϵ ′ ≤ ϵ ) for an y ω ∈ S ω , T b µ ( ω ) < T goo d : = 2(1 − α ) π B δ ϵ ′ + α . Denote for con venience c = (1 − α ) A/ 2 − α . Note that since ϵ ′ < ((1 − α ) A − 2 α ) / (4(1 − α ) π B δ ) , w e ha ve that T bad > T goo d + c . Therefore, b y Claim 3.7 and the fact that n ≥ C d log ( B δ /η ) / ( cδ ) 2 for a suﬃciently large constant C w e ha ve that max ω ∈ S ω b T b µ ′ ( ω ) < max ω ∈ S ω b T b µ ( ω ) for an y µ, µ ′ ∈ C µ suc h that ∥ b µ − µ ∥ ≥ ϵ and ∥ b µ ′ − µ ∥ ≤ ϵ ′ . Which implies that no b µ with ∥ b µ − µ ∥ ≥ ϵ can b e returned by the algorithm since there exists a candidate b µ ′ ∈ C µ with ∥ b µ ′ − µ ∥ ≤ ϵ ′ and hence with max ω ∈ S ω b T b µ ′ ( ω ) < max ω ∈ S ω b T b µ ( ω ) . 4 Sample Complexit y Lo w er Bound In this section, we derive a lo wer b ound for one-dimensional distributions under the negation of the assumption used for the algorithm together with some other mild tail b ounds on the distributions. Recall that the F requency-witness condition ( Deﬁnition 3.1 ) used for the upp er b ound states that for all directions v ∈ R d with ∥ v ∥ ≥ ϵ there exists a frequency ω ∈ R d suc h that v · ω is far from an integer and | ϕ D ( ω ) | is b ounded a wa y from 0 . Our 1 d hard instances satisfy (a form of ) the negation of this condition: ϕ D ( ω ) has negligible L 2 mass whenever ϵω is far from an in teger, i.e., its F ourier mass concentrates in thin bands around the lattice { ω ≈ k /ϵ : k ∈ Z } . A dditionally , we imp ose an assumption controlling the tails of our distribution. This assumption ensures that the distribution has suﬃcient tail decay to relate the total v ariation distance (the L 1 norm) to the L 2 norm, and hence to conv ert F ourier-transform closeness into L 2 closeness via P arsev al’s identit y . Moreo ver, for our applications, the parameter for quantifying this deca y is constan t and therefore do es not signiﬁcan tly aﬀect our low er-bound complexit y . In particular, w e present the following theorem: 8 Theorem 4.1 ( 1 -Dimensional Low er Bound) . Ther e exists a suﬃciently lar ge universal c onstant C > 0 such that the fol lowing hold. Fix α ∈ (0 , 1 / 2) and ϵ, δ, σ > 0 with σ > C δ . L et D b e a distribution over R that satisﬁes the fol lowing c onditions: 1. ∥ ϕ D ( ω ) 1 { dist( ϵω , Z ) > cα }∥ L 2 ≤ δ , for a suﬃciently smal l c onstant c > 0 . 2. F or al l R > 0 , Pr x ∼ D [ | x | ≥ R ] ≤ σ /R . Then, any algorithm that estimates the me an to err or at most ϵ , with pr ob ability at le ast 2 / 3 , fr om α -me an-shift c ontaminate d samples (as in Deﬁnition 1.1 ), must use Ω  1 ( δ σ ) 1 / 3 + ( ϵ/α ) 3 ( δ /σ ) 2  samples. In p articular, if ( ϵ/α ) 3 ( δ /σ ) 2 ≤ ( δ σ ) 1 / 3 , then this simpliﬁes to Ω(1 / ( δ σ ) 1 / 3 ) samples. Before we mov e to the proof w e make some imp ortan t remarks ab out the statement of Theorem 4.1 . Remark 4.2 (High-Dimensional Lo wer Bound) . While Theorem 4.1 is a low er b ound against univ ariate distributions, it also induces a low er b ound against multiv ariate distributions by considering pro duct distributions. In particular, let D b e a pro duct distribution o ver R d whose co ordinate marginals are indep enden t and eac h satisfy the conditions of Theorem 4.1 . In all but rare cases, estimating the mean to error ϵ requires at least Ω( d ) samples. Moreo ver, since the co ordinates are indep enden t, the one-dimensional low er b ound (denoted by lb ) still applies: w e may consider an instance in whic h b oth the mean displacemen t and the noise lie entirely in the ﬁrst co ordinate. As a result, max ( lb , Ω( d )) ≥ ( lb + Ω( d )) / 2 is a lo wer b ound. By Hölder’s inequality , this implies a lo wer b ound of (Ω( d )) 1 /p lb 1 /q for 1 /p + 1 /q = 1 . Concretely , by taking p close to 1 , this yields a lo wer b ound of d 0 . 99 e Ω(( α/ϵ ) 2 ) against d -dimensional Gaussian distributions, which is tigh t up to p olynomial factors. Remark 4.3 (Condition of Upp er & Low er Bound) . W e remark that the condition sin ( π ω · v ) ≥ A is equiv alent to requiring that v · ω b e b ounded a wa y from in tegers: indeed | sin ( π t ) | = | sin  π dist ( t, Z )  | , so | sin ( π ω · v ) | ≥ A iﬀ dist ( v · ω , Z ) ≥ arcsin ( A ) /π . Th us, the frequency–witness condition demands frequencies ω where ω · v is far from an in teger and D has nontrivial F ourier mass. In con trast, the lo wer b ound’s assumes that the L 2 –mass of ϕ D concen trates within small bands around integer p oin ts { ω : dist ( v · ω , Z ) ≤ cα } , i.e., regions where | sin ( π ω · v ) | ≤ sin ( π cα ) is small. Therefore, the regions that b oth conditions consider are essentially the same. No w, the only mismatc h is that the witness condition uses the L ∞ norm of ϕ D o ver S : = ω ∈ R : dist( ϵω , Z ) > cα , whereas the low er-b ound condition uses the L 2 norm. Ho wev er, under an additional deriv ativ e assumption (which holds for all the examples studied in Section D ), Claim C.1 implies that | ϕ D 1 S | L 2 ≲ p | ϕ D 1 S | L ∞ . Therefore, the upp er and low er-bound conditions relate p olynomially to each other. More generally , stronger smo othness yields faster F ourier decay and an ev en tighter link b et ween p oin twise and L 2 con trol. Remark 4.4 (T ail Condition) . Condition (2) of Theorem 4.1 is a mild tail-deca y requiremen t: it b ounds the mass of D outside [ − R, R ] by σ /R , whic h is exactly what w e need to control the truncation error in Lemma 4.5 . In particular, if E D [ | x | ] ≤ σ , then Pr D ( | x | ≥ R ) ≤ σ /R for all R > 0 b y Marko v’s inequalit y . Bounded v ariance also implies Condition (2) as it implies b ounded E D [ | x | ] . T o deriv e our low er b ound, we construct tw o distributions P and Q whose characteristic functions are nearly indistinguishable on a large set of frequencies (the complement of the set in condition (1) of Theorem 4.1 ). Let p, q b e the densitiy functions of P , Q . T o conv ert this F ourier-space closeness 9 in to a statistical complexit y lo wer b ound, we need an inequalit y that con trols the distributional distance b et w een distributions by the F ourier discrepancies of their characteristic functions. The next lemma provides the key bridge: the total v ariation distance d TV ( P , Q ) can b e b ounded in terms of the L 2 -norm of the characteristic-function diﬀerence, ∥ ∆ ϕ ∥ L 2 . and the L 1 discrepancy b et w een the t wo densities on the tails, ∥ ( p − q ) 1 {| x | > R }∥ L 1 . The pro of is fairly standard and is deferred to Section C . Lemma 4.5 (Characteristic function to TV distance closeness) . L et P and Q b e distributions over R with densities p and q r esp e ctively. Denote by ∆ ϕ : = ϕ P − ϕ Q the diﬀer enc e of their char acteristic functions. Then, for every R > 0 , d T V ( P , Q ) ≤ 1 2 ∥ ( p − q ) 1 {| x | > R }∥ L 1 + r R 2 ∥ ∆ ϕ ∥ L 2 . Next w e present the F ourier construction that we will use for our low er b ound. By Lemma 4.5 , it is enough to control ∥ ∆ ϕ ∥ L 2 and ∥ ( p − q ) 1 {| x | > R }∥ L 1 . The h yp othesis on D con trols the L 2 con tribution aw a y from the lattice bands, since on that region ϕ D is small in L 2 . Thus, the main diﬃcult y is controlling ∆ ϕ on the bands where | ϕ D | ma y b e large. Fix A > 0 (to b e c hosen) and supp ose that | ϵω − k | ≤ A for some k ∈ Z . Note that, without noise, w e hav e | ϕ D ϵ/ 2 ( ω ) − ϕ D − ϵ/ 2 ( ω ) | = | ϕ D ( ω )  e π iϵω − e − π iϵω  | = 2 | ϕ D ( ω ) sin( π ϵω ) | ≤ 2 | ϕ D ( ω ) | | sin( π A ) | . While the factor | sin ( π A ) | is small for small A , this do es not by itself control the L 2 norm on the union of bands, b ecause ϕ D is otherwise unconstrained there and may ha v e large (and even inﬁnite) L 2 mass (e.g., for D = N (0 , σ 2 ) , ϕ D ( ω ) = e − 2 π 2 σ 2 ω 2 and ∥ ϕ D ( ω ) ∥ 2 L 2 = √ π / (2 π σ ) , whic h can b e large for small σ ). How ev er, for any p oin t ω in this region the amplitude b ound | ϕ D ( ω ) | | sin ( π A ) | ≤ | sin ( π A ) | is small when A is small; in our setting w e will take A on the order of the noise rate α . This mak es it p ossible to design t wo noise distributions such that the diﬀerence of their c harac- teristic functions, m ultiplied by the noise rate α , appro ximates  e π iϵω − e − π iϵω  arbitrarily w ell on this region, essentially canceling its contribution to d T V . Crucially , this matching is p erformed via a smo oth p eriodic window, which also yields the tail control needed in Lemma 4.5 . T ogether, these t wo prop erties yield the desired result. The next lemma formalizes this F ourier matching construction (see Section C for the pro of ). Lemma 4.6 (F ourier Matc hing) . L et ϵ, α ∈ (0 , 1) , b f ( ω ) : = (1 − α )( e π iϵω − e − π iϵω ) /α and f : = F − 1 [ b f ] . F or every w > 0 ther e exists a function b ρ w : R → C with b ρ w ( ω ) = 1 for al l ω : | ω − i/ϵ | ≤ w for some i ∈ Z , b g : = b f · b ρ w , and g : = F − 1 [ b g ] , such that the fol lowing hold: 1. g is a r e al value d signe d me asur e. 2. R ∞ −∞ g ( x ) dx = 0 . 3. ∥ g ( x ) ∥ L 1 ≲ ϵw /α . 4. F or every R > max { 2 ϵ, 2 /w } it holds that ∥ g 1 {| x | > R }∥ L 1 ≲ ϵ α 1 w 2 R 3 . W e remark on the diﬀerence of the ab ov e construction with previous work: Remark 4.7 (Construction of [ K G25 ]) . Our F ourier matc hing diﬀers from [ K G25 ]. Their work fo cuses only on the Gaussian setting and enforces b P 0 ( ω ) = b P 1 ( ω ) on an in terv al around the origin. F or general distributions, matc hing only around ω = 0 is not suﬃcient. Instead, w e apply a smo oth windo w that equals 1 on [ − w , w ] and v anishes outside [ − 2 w , 2 w ] , and then p eriodize o ver multiples of 1 /ϵ . In this w ay , the discrepancy is canceled across the entire set ω : dist( ω ϵ, Z ) ≤ cα . Moreo ver, the smo othness of the windo w ensures that our construction is realized by distributions and provides the tail con trol needed for our total-v ariation b ounds. 10 4.1 Pro of of Theorem 4.1 In this section, w e prov e our main lo wer b ound result. Pr o of of The or em 4.1 . Let Q 0 , Q 1 b e arbitrary distributions o ver R . W e deﬁne the distributions P 0 : = (1 − α ) δ ϵ/ 2 ∗ D + αQ 0 ∗ D and P 1 : = (1 − α ) δ − ϵ/ 2 ∗ D + αQ 1 ∗ D . Note that ∆ ϕ ( ω ) : = ϕ P 0 ( ω ) − ϕ P 1 ( ω ) = ϕ D ( ω )  (1 − α )( e π iϵω − e − π iϵω ) + α ( ϕ Q 0 ( ω ) − ϕ Q 1 ( ω ))  . Note that any ﬁnite signed measure h on R with ∥ h ∥ L 1 = 2 and R R h ( x ) dx = 0 can b e realized as the diﬀerence of tw o probabilit y distributions by setting Q 0 to b e ( h ( x )) + and Q 1 to b e ( h ( x )) − (where h + and h − the Jordan decomp osition of h ). Therefore, b y applying Lemma 4.6 with w = cα/ϵ for a suﬃcien tly small constan t c > 0 , w e get the existence of a signed measure g with ∥ g ∥ L 1 ≤ 2 . Let m : = ∥ g ∥ L 1 / 2 < 1 and deﬁne Q 0 and Q 1 as the probabilit y measures Q 0 : = g − + (1 − m ) δ 0 , Q 1 : = g + + (1 − m ) δ 0 , resp ectiv ely . Note that Q 0 − Q 1 = − g . Therefore, we obtain the existence of Q 0 , Q 1 suc h that ∆ ϕ ( ω ) = 0 for all ω ∈ R satisfying | ω − i/ϵ | ≤ cα/ϵ for some i ∈ Z . No w we use this fact along with our assumptions to b ound the total v ariation distance b et w een P 0 and P 1 . F or notational simplicit y let E 0 : = (1 − α ) δ ϵ/ 2 + αQ 0 , E 1 : = (1 − α ) δ − ϵ/ 2 + αQ 1 and denote b y ∆ ϕ E ( ω ) = ϕ E 0 ( ω ) − ϕ E 1 ( ω ) . Thus, w e hav e that ∆ ϕ = ϕ D ∆ ϕ E and P j = D ∗ E j . W e will use Lemma 4.5 to b ound d T V ( P 0 , P 1 ) . Let p 0 , p 1 b e the densities of P 0 , P 1 , and write h : = p 0 − p 1 . By Lemma 4.5 , for ev ery R > 0 , d T V ( P 0 , P 1 ) ≤ 1 2 ∥ h ( x ) 1 {| x | > R }∥ L 1 + r R 2 ∥ ∆ ϕ ∥ L 2 . W e ﬁrst b ound the F ourier term. Since | ∆ ϕ E ( ω ) | ≤ 2 b y Fact A.1 , we hav e ∥ ∆ ϕ ∥ L 2 = ∥ ϕ D ( ω )∆ ϕ E ( ω ) 1 { dist( ϵω , Z ) > cα }∥ L 2 ≤ 2 ∥ ϕ D ( ω ) 1 { dist( ϵω , Z ) > cα }∥ L 2 ≤ 2 δ , where in the last inequality we used the assumption that ∥ ϕ D ( ω ) 1 { dist ( ϵω , Z ) > cα }∥ L 2 ≤ δ . Therefore, ( R/ 2) 1 / 2 ∥ ∆ ϕ ∥ L 2 ≲ √ R δ . It remains to b ound the tail term ∥ h 1 {| x | > R }∥ L 1 . Since P j = D ∗ E j , we hav e h = p 0 − p 1 = D ∗ ( E 0 − E 1 ) . Let ∆ E : = E 0 − E 1 (a ﬁnite signed measure). Using | D ∗ ∆ E | ≤ D ∗ | ∆ E | , we get ∥ h 1 {| x | > R }∥ L 1 = Z | x | >R   ( D ∗ ∆ E )( x )   dx ≤ Z | x | >R ( D ∗ | ∆ E | )( x ) dx. Deﬁne E ′ as the measure | ∆ E ( x ) | / R R | ∆ E | ( y ) dy . No w b y the union b ound we ha ve m ultiplying and dividing b y R R | ∆ E | ( y ) dy gives us Z | x | >R ( D ∗ | ∆ E | )( x ) dx 11 ≤ Pr D [ | x | > R/ 2] Z R | ∆ E | d y + Z | y | >R/ 2 | ∆ E | d y . Since E 0 , E 1 are probabilit y distributions, R R | ∆ E | d y ≤ 2 . Therefore, ∥ h 1 {| x | >R }∥ L 1 ≤ 2 Pr D [ | x | > R/ 2] + Z | y | >R/ 2 | ∆ E | d y . Next we b ound R | y | >R/ 2 | ∆ E | d y using the structure of ∆ E . By deﬁnition, ∆ E = (1 − α )( δ ϵ/ 2 − δ − ϵ/ 2 ) + α ( Q 0 − Q 1 ) . By our choice of Q 0 , Q 1 w e hav e ∆ E = (1 − α )( δ ϵ/ 2 − δ − ϵ/ 2 ) − αg . Th us | ∆ E | ≤ (1 − α )( δ ϵ/ 2 + δ − ϵ/ 2 ) + α | g | . In particular, for R > ϵ Z | y | >R/ 2 | ∆ E | d y ≤ α Z | y | >R/ 2 | g ( y ) | dy = α ∥ g 1 {| y | > R/ 2 }∥ L 1 . Consider R greater than a suﬃciently large constant. By item (4) of Lemma 4.6 with radius R/ 2 (note that R > max (4 ϵ, 4 /w ) , since w e hav e that 1 /w = O ( ϵ/α ) = O (1) and R is greater than a suﬃcien tly large constant), we obtain ∥ g 1 {| y | > R/ 2 }∥ L 1 ≲ ϵ α · 1 w 2 ( R/ 2) 3 ≲  ϵ αR  3 . Com bining the ab o ve b ounds giv es ∥ h 1 {| x | > R }∥ L 1 ≤ 2 Pr x ∼ D [ | x | >R/ 2] + O   ϵ αR  3  . Therefore, d T V ( P 0 , P 1 ) ≲ σ R +  ϵ αR  3 + √ R δ . Cho osing R = ( σ /δ ) 2 / 3 , note that σ /δ is assumed to b e less than a suﬃciently large constant, we ha ve d T V ( P 0 , P 1 ) ≲ ( δ σ ) 1 / 3 + ( ϵ/α ) 3 ( δ /σ ) 2 . Finally , consider the decision problem of having samples generated from either P 0 or P 1 and c ho osing whic h generated the samples. This problem is harder than estimating the mean under the mean- shift-con tamination mo del to error ϵ/ 2 , since then means associated with P 0 , P 1 ha ve distance ϵ . Giv en that P 0 , P 1 are dra wn with probability 1 / 2 to generate the samples, then using n i.i.d. samples w e hav e that the optimal error for the decision problem is p ⋆ ( n ) = 1 / 2  1 − d T V ( P ⊗ n 0 , P ⊗ n 1 )  . By tensorization, d T V ( P ⊗ n 0 , P ⊗ n 1 ) ≤ 1 − (1 − d T V ( P 0 , P 1 )) n , hence p ⋆ ( n ) ≥ 1 / 2(1 − d T V ( P 0 , P 1 )) n . Hence for achieving error less than 1 / 3 , it is necessary that (1 − d T V ( P 0 , P 1 )) n ≤ 2 / 3 whic h implies that n ≥ log (3 / 2) /d T V ( P 0 , P 1 ) = Ω(1 /d T V ( P 0 , P 1 )) . Com bining with the b ound ab o ve concludes the pro of of Theorem 4.1 . 12 References [BNJT10] M. Barreno, B. Nelson, A. D. Joseph, and J. D. Tygar. The securit y of machine learning. Machine L e arning , 81(2):121–148, 2010. [BNL12] B. Biggio, B. Nelson, and P . Lask ov. P oisoning attacks against supp ort vector machines. In Pr o c e e dings of the 29th International Confer enc e on Machine L e arning, ICML 2012 , 2012. [CD19] O. Collier and A. Dalaly an. Multidimensional linear functional estimation in sparse gaussian models and robust estimation of the mean. Ele ctr onic Journal of Statistics , 13:2830–2864, 2019. [CDR V21] A. Carpentier, S. Delattre, E. Ro quain, and N. V erzelen. Estimating minimum eﬀect with outlier selection. The Annals of Statistics , 49(1):272–294, 2021. [CJ10] T. T. Cai and J. Jin. Optimal rates of conv ergence for estimating the null density and prop ortion of nonn ul eﬀects in large-scale multiple testing. The Annals of Statistics , 38(1):100–145, 2010. [DFDL24] X. Du, Z. F ang, I. Diakonik olas, and Y. Li. Ho w do es unlab eled data prov ably help out-of-distribution detection? In The Twelfth International Confer enc e on L e arning R epr esentations (ICLR) , 2024. [DIKP25] I. Diakonik olas, G. Iak ovidis, D. Kane, and T. Pittas. Eﬃcient m ultiv ariate robust mean estimation under mean-shift contamination. In International Confer enc e on Machine L e arning , pages 13570–13600. PMLR, 2025. [DK23] I. Diak onikolas and D. M. Kane. Algorithmic high-dimensional r obust statistics . Cam- bridge univ ersity press, 2023. [DKK + 19a] I. Diakonik olas, G. Kamath, D. Kane, J. Li, J. Steinhardt, and A. Stewart. Sever: A ro- bust meta-algorithm for sto c hastic optimization. In Pr o c e e dings of the 36th International Confer enc e on Machine L e arning, ICML 2019 , pages 1596–1606, 2019. [DKK + 19b] I. Diakonik olas, G. Kamath, D. M. Kane, J. Li, A. Moitra, and A. Stewart. Robust estimators in high dimensions without the computational intractabilit y . SIAM Journal on Computing , 48(2):742–864, 2019. Conference version app eared in FOCS 2016. [Efr04] B. Efron. Large-scale sim ultaneous hypothesis testing: the cho ice of a null hypothesis. Journal of the Americ an Statistic al Asso ciation , 99(465):96–104, 2004. [Efr07] B. Efron. Correlation and large-scale sim ultaneous signiﬁcance testing. Journal of the A meric an Statistic al Asso ciation , 102(477):93–103, 2007. [Efr08] B. Efron. Microarra ys, empirical bay es and the t wo-groups mo del. 2008. [G + 08] L. Grafak os et al. Classic al fourier analysis , volume 2. Springer, 2008. [Gan07] I. Gannaz. Robust estimation and wa v elet thresholding in partially linear models. Statistics and Computing , 17:293–310, 2007. [HR09] P . Hub er and E. M. Ronchetti. R obust statistics . Wiley New Y ork, 2009. 13 [Hub64] P . J. Hub er. Robust estimation of a lo cation parameter. Ann. Math. Statist. , 35(1):73– 101, 03 1964. [K G25] S. Kotek al and C. Gao. Optimal estimation of the null distribution in large-scale inference. IEEE T r ansactions on Information The ory , 2025. [KKLZ26] A. Kala v asis, P . K. K othari, S. Li, and M. Zamp etakis. Learning mixture mo dels via eﬃcien t high-dimensional sparse fourier transforms. arXiv pr eprint arXiv:2601.05157 , 2026. [Li23] S. Li. Robust mean estimation against oblivious adversaries. Master’s thesis, Carnegie Mellon Univ ersity Pittsburgh, P A, 2023. [LR V16] K. A. Lai, A. B. Rao, and S. V empala. Agnostic estimation of mean and cov ariance. In Pr o c e e dings of FOCS’16 , 2016. [MW07] L. McCann and R. E. W elsch. Robust v ariable selection using least angle regression and elemen tal set sampling. Computational Statistics & Data Analysis , 52(1):249–257, 2007. [SKL17] J. Steinhardt, P . W. Koh, and P . Liang. Certiﬁed defenses for data p oisoning attacks. In A dvanc es in Neur al Information Pr o c essing Systems 30: Annual Confer enc e on Neur al Information Pr o c essing Systems 2017 , pages 3517–3529, 2017. [SO11] Y. She and A. B. Ow en. Outlier detection using nonconv ex p enalized regression. Journal of the Americ an Statistic al Asso ciation , 106(494):626–639, 2011. [SS11] E. M. Stein and R. Shak arc hi. F ourier analysis: an intr o duction , volume 1. Princeton Univ ersity Press, 2011. [STB01] S. Sardy , P . T seng, and A. Bruce. Robust wa v elet denoising. IEEE tr ansactions on signal pr o c essing , 49(6):1146–1152, 2001. [TLM18] B. T ran, J. Li, and A. Madry . Sp ectral signatures in bac kdo or attac ks. In A dvanc es in Neur al Information Pr o c essing Systems 31: Annual Confer enc e on Neur al Information Pr o c essing Systems 2018, NeurIPS 2018 , pages 8011–8021, 2018. [T uk60] J. W. T uk ey . A surv ey of sampling from contaminated distributions. Contributions to pr ob ability and statistics , pages 448–485, 1960. [V er10] R. V ersh ynin. Introduction to the non-asymptotic analysis of random matrices, 2010. 14 App endix The app endix is structured as follows: First Section A includes additional preliminaries required in subsequent technical sections. Section C contains the pro ofs of the technical lemmas of the lo wer bound ( Section 4 ). Finally , Section D contains applications of our theorems to well-kno wn distributions. A Omitted F acts and Preliminaries A dditional Notation: W e deﬁne the Dirac delta function δ ( · ) as the distribution with the prop ert y that for an y measurable f : R d → R , w e hav e R R d f ( x ) δ ( x ) dx = f (0) . W e denote by χ w : R → R the rectangular window function χ w ( x ) : = 1 { x ∈ [ − w , w ] } . F or a complex num b er z ∈ C , z = a + bi w e denote by z its conjugate z = a − bi . W e extend the L 1 -notation to ﬁnite signed measures (and, more generally , distributions) by duality: for suc h g deﬁne ∥ g ∥ L 1 : = sup ∥ f ∥ ∞ ≤ 1 ⟨ g , f ⟩ , where ⟨ g , f ⟩ = R f dg when g is a (signed) measure, so in particular ∥ δ z ∥ L 1 = 1 ; when g admits a densit y g ( x ) dx this agrees with R R d | g ( x ) | dx . Let f ∗ m denote the m -fold con volution of f with itself, i.e., f ∗ m = f ∗ · · · ∗ f with m factors. F act A.1 (Useful F ourier T ransform facts (see [ SS11 ])) . L et f , ϕ b e ﬁnite me asur es and let ϕ ( ω ) = F [ f ] . Then the fol lowing pr op erties hold: (i) If ϕ is a function and ϕ ( − ω ) = ϕ ( ω ) then for al l ω ∈ R , then f is a r e al-value d me asur e. (ii) If R R | f ( x ) | dx ≤ 1 , then | ϕ ( ω ) | ≤ 1 , for al l ω ∈ R . (iii) If ϕ is T -p erio dic for some T > 0 , then the inverse F ourier tr ansform is the discr ete me asur e f ( x ) = X k ∈ Z a k δ  x − k /T  with a k = 1 T Z T 0 ϕ ( ω ) e − 2 π ikω /T dω , k ∈ Z . F act A.2 (F ourier Deriv atives [ G + 08 ]) . L et f : R d → R b e absolutely inte gr able function with absolutely inte gr able p artial derivatives and denote b f : = F [ f ] . Then, for al l j ∈ [ d ] it holds that F [ ∂ f /∂ x j ] = − 2 π iω j b f ( ω ) F act A.3 ( ϵ -Co ver of a Ball (see [ V er10 ])) . L et R > 0 and d ∈ Z + . Ther e exists an ϵ -c over G of B d ( R ) in ℓ 2 such that log | G | ≤ C d log ( R/ϵ ) , for a universal c onstant C > 0 . In p articular, one may take | G | ≤ (1 + 4 R/ϵ ) d . F act A.4 (Adv ersarially Robust Estimator [ DK23 ]) . L et 0 < α < 1 / 3 and let D b e a distribution over R d with µ : = E x ∼ D [ x ] such that Co v ( D ) ⪯ σ 2 I . Ther e exists an algorithm that, given α , σ , and N + O ( d log (1 /δ ) /α ) i.i.d. Hub er α -c orrupte d samples (samples fr om a distribution of the form (1 − α ) D + αN for an arbitr ary distribution N over R d ), in p oly ( N d ) time r eturns b µ such that with pr ob ability 1 − δ it holds ∥ b µ − µ ∥ = O ( σ √ α ) . B Omitted Con ten t from Section 3 In this section, w e provide the conten t omitted from our upp er b ound section. 15 Remark B.1 (A ccess to distribution D ) . W e remark that even though Algorithm 1 uses a v alue oracle for ϕ D (whic h is a v ailable for most well-kno wn distributions) for simplicity , a sampling oracle for D suﬃces. In particular, b ecause | ϕ D ( ω ) | ≤ 1 and all ev aluations o ccur on a ﬁnite grid C ω , the empirical c haracteristic function b ϕ D concen trates uniformly on C ω with a modest num b er of clean samples. Hence b oth uses of ϕ D can b e handled from data: (i) to construct S ω , we can threshold b ϕ D at a slightly smaller lev el so that frequency witnesses are retained and ev ery included ω has | ϕ D ( ω ) | b ounded aw a y from 0 w.h.p.; and (ii) to compute b ψ , w e can divide by b ϕ D ( ω ) instead of ϕ D ( ω ) , and b ψ turns out to b e stable on the new S ω since | ϕ D ( ω ) | b ounded aw a y from 0 . Consequently , with an additional clean sample budget on the order of the con taminated budget, a sampling oracle fully replaces the v alue oracle without altering the stated rates. In what follows, we pro vide the pro ofs of the claims in Theorem 3.2 . W e remind the reader that B δ , ϕ, b ϕ, T b µ , b T b µ , ψ , b ψ , D , Q, ϕ D and ϕ Q are the quan tities deﬁned in the pro of of Theorem 3.2 . Claim B.2 (Every frequency witness has b ounded norm) . Fix ϵ, A, δ ∈ (0 , 1) . If ω is an ( ϵ, A, δ ) - fr e quency-witness for some dir e ction v (i.e., | sin ( π v · ω ) | ≥ A and | ϕ D ( ω ) | ≥ δ ), then ne c essarily ∥ ω ∥ ≤ B δ wher e B δ : = √ d M 1 2 π δ . Pr o of of Claim B.2 . Since A > 0 , any frequency witness must satisfy ω  = 0 . Let ω ∈ R d and c ho ose an index j ∈ [ d ] . By Fact A.2 w e hav e Z R d ∂ ∂ x j p D ( x ) e 2 π i ω · x dx = 2 π iω j ϕ D ( ω ) . Whic h implies ϕ D ( ω ) = 1 2 π i ω j Z R d ∂ ∂ x j p D ( x ) e 2 π i ω · x dx . By c ho osing j suc h that ∥ ω j ∥ = ∥ ω ∥ ∞ w e hav e | ϕ D ( ω ) | ≤ 1 2 π | ω j |     ∂ ∂ x j p D     L 1 ( R d ) ≤ M 1 2 π ∥ ω ∥ ∞ ≤ √ d M 1 2 π ∥ ω ∥ . Whic h concludes the pro of of Claim B.2 . Claim B.3 (Distant candidates hav e a large T b µ ) . L et b µ ∈ R d b e c andidate me an such that ∥ b µ − µ ∥ > ϵ . If ω ∈ R d is a ( ϵ, A, δ ) -fr e quency-witness for the dir e ction v : = b µ − µ , then | T b µ ( ω ) | ≥ 2(1 − α ) A − α . Pr o of. Note that T b µ ( ω ) = e 2 π iω · ( µ + v / 2) (1 − α )( e π iω · v − e − π iω · v ) − αϕ Q ( ω ) = e 2 π iω · ( µ + v / 2) 2 i (1 − α ) sin( π ω · v ) − αϕ Q ( ω ) . No w since ω is a frequency-witness for the direction v w e hav e that | sin ( π ω · v ) | ≥ A . Therefore, by the rev erse triangle inequality we hav e that | T b µ ( ω ) | ≥ 2(1 − α ) A − α . Whic h concludes the pro of of Claim B.3 . Claim B.4 (Close candidates hav e small T b µ ) . L et b µ ∈ R d b e a c andidate me an and set v : = b µ − µ . Then, | T b µ ( ω ) | ≤ 2(1 − α ) π ∥ ω ∥∥ v ∥ + α. 16 Pr o of of Claim B.4 . Recall that T b µ ( ω ) = (1 − α ) e 2 π i ω · b µ − ϕ ( ω ) = (1 − α )  e 2 π i ω · b µ − e 2 π i ω · µ  − α ϕ Q ( ω ) . No w since e 2 π i ω · b µ − e 2 π i ω · µ = e 2 π i ω · ( µ + v/ 2)  e π i ω · v − e − π i ω · v  = 2 i e 2 π i ω · ( µ + v/ 2) sin( π ω · v ) , we get | T b µ ( ω ) | ≤ 2(1 − α ) | sin( π ω · v ) | + α | ϕ Q ( ω ) | . Because | ϕ Q ( ω ) | ≤ 1 for all ω ∈ R d , this yields | T b µ ( ω ) | ≤ 2(1 − α ) | sin( π ω · v ) | + α . Finally , using that | sin t | ≤ | t | and | ω · v | ≤ ∥ ω ∥ ∥ v ∥ , giv es us | T b µ ( ω ) | ≤ 2(1 − α ) π ∥ ω ∥∥ v ∥ + α , whic h concludes the pro of of Claim B.4 . Claim B.5 (Cov ering ov er ω preserv es frequency witnesses) . Fix v ∈ R d . L et C ω b e an η -c over of B d ( B δ ) . If ther e exists a ( ϵ, A, δ ) -fr e quency-witness for the dir e ction v , then ther e exists an ω ∈ C ω that is a ( ϵ, A − π η ∥ v ∥ , δ − η L ) -fr e quency-witness for v . Pr o of of Claim B.5 . Let ω v b e a ( ϵ, A, δ ) -frequency-witness for the direction v . By the deﬁnition of C ω , there exists an ω ∈ C ω suc h that ∥ ω v − ω ∥ ≤ η . Therefore, from the Lipsc hitzness assumption on ϕ D , it follo ws that | ϕ D ( ω ) | ≥ δ − η L . Moreo ver, since | sin ( π v · ω v ) | ≥ A , and sin is 1 -Lipsc hitz, we obtain | sin ( π v · ω ) | ≥ A − π η ∥ v ∥ . Thus ω is a ( ϵ, A − π η ∥ v ∥ , δ − η L ) -frequency-witness for v , as desired. Whic h completes the pro of of Claim B.5 . Claim B.6 (Concen tration of b T ) . Fix a suﬃciently lar ge universal c onstant C > 0 and η , τ ∈ (0 , 1) . L et C ω , C µ b e ﬁnite subsets of R d . If n ≥ C log ( |C ω | /τ ) / ( η δ ) 2 , then with pr ob ability at le ast 1 − τ for any ω ∈ C ω such that | ϕ D ( ω ) | ≥ δ and any b µ ∈ C µ it holds that | T b µ ( ω ) − b T b µ ( ω ) | ≤ η . Pr o of of Claim B.6 . Let x 1 , . . . , x n b e i.i.d. contaminated samples. F or ﬁxed ω , write y j = e 2 π i ω · x j so b oth Re ( y j ) and Im ( y j ) lie in [ − 1 , 1] and b ϕ ( ω ) = 1 n P n j =1 y j . Ho eﬀding on real and imaginary parts giv es Pr    b ϕ ( ω ) − ϕ ( ω )   ≥ t  ≤ 4 e − nt 2 / 4 . Similarly , this also holds for ϕ D , i.e., Pr    b ϕ D ( ω ) − ϕ D ( ω )   ≥ t  ≤ 4 e − mt 2 / 4 . Since b T b µ ( ω ) − T b µ ( ω ) = ψ ( ω ) − b ψ ( ω ) = ( ϕ ( ω ) − b ϕ ( ω )) /ϕ D ( ω ) and | ϕ D ( ω ) | ≥ δ , the even t   T b µ ( ω ) − b T b µ ( ω )   ≤ η holds whenev er   b ϕ ( ω ) − ϕ ( ω )   ≤ η δ . A union b ound ov er ω ∈ C ω yields failure probabilit y at most 4 |C ω | e − nη 2 δ 2 / 4 , which we make ≤ τ b y the stated c hoice of n . Note the deviation is indep enden t of b µ , so no extra |C µ | factor is needed. Whic h concludes the pro of of Claim B.6 . C Omitted Con ten t from Section 4 In this section, w e provide the conten t omitted from our lo wer b ound section. Claim C.1 (Deriv ativ e b ounds imply p olynomial L 2 − L ∞ relationship) . L et D b e a distribution on R with density p such that p ′ ∈ L 1 ( R ) . Assume ∥ p ′ ∥ L 1 = O (1) , then for any me asur able set S ⊆ R it holds ∥ ϕ D 1 S ∥ L 2 ≲ p ∥ ϕ D 1 S ∥ L ∞ . 17 Pr o of. First by Fact A.2 we hav e ϕ D ( ω ) = − 1 2 π iω Z R p ′ ( x ) e 2 π iωx dx, hence | ϕ D ( ω ) | ≤ ∥ p ′ ∥ L 1 / (2 π | ω | ) for all ω  = 0 . Denote by M : = sup ω ∈ S | ϕ D ( ω ) | . Now ﬁx an y T > 0 and write ∥ ϕ D 1 S ∥ 2 L 2 = Z S ∩{| ω |≤ T } | ϕ D ( ω ) | 2 dω + Z S ∩{| ω | >T } | ϕ D ( ω ) | 2 dω . The ﬁrst term is at most (2 T ) M 2 . F or the second term, use the 1 / | ω | decay: Z | ω | >T | ϕ D ( ω ) | 2 dω ≤ Z | ω | >T  ∥ p ′ ∥ L 1 2 π | ω |  2 dω ≲ ∥ p ′ ∥ 2 L 1 T . Th us ∥ ϕ D 1 S ∥ 2 L 2 ≲ T M 2 + ∥ p ′ ∥ 2 L 1 T . Cho osing T = ∥ p ′ ∥ L 1 / M yields ∥ ϕ D 1 S ∥ 2 L 2 ≲ M ∥ p ′ ∥ L 1 , pro ving the claim. Lemma C.2 (Characteristic function to TV distance closeness) . L et P and Q b e distributions over R with densities p and q r esp e ctively. Denote by ∆ ϕ : = ϕ P − ϕ Q the diﬀer enc e of their char acteristic functions. Then, for every R > 0 , d T V ( P , Q ) ≤ 1 2 ∥ ( p − q ) 1 {| x | > R }∥ L 1 + r R 2 ∥ ∆ ϕ ∥ L 2 . Pr o of. Deﬁne h : = p − q . Note that d T V ( P , Q ) = ∥ h ∥ L 1 / 2 . F or R > 0 , w e write ∥ h ∥ L 1 = I out + I in , where I out : = R | x | >R | h ( x ) | dx and I in : = R | x |≤ R | h ( x ) | dx . W e leav e the outside region contribution, I out , as is. F or the inside region contribution w e ha ve that b y Cauc hy–Sc h warz, I in ≤ (2 R ) 1 / 2 ∥ h ∥ L 2 . Using Planc herel’s theorem this yields I in ≤ (2 R ) 1 / 2 ∥ ∆ ϕ ∥ L 2 . Whic h completes the pro of of Lemma C.2 . Lemma C.3 (F ourier Matching) . L et ϵ, α ∈ (0 , 1) , b f ( ω ) : = (1 − α )( e π iϵω − e − π iϵω ) /α and f : = F − 1 [ b f ] . F or every w > 0 ther e exists a function b ρ w : R → C with b ρ w ( ω ) = 1 for al l ω : | ω − i/ϵ | ≤ w for some i ∈ Z , b g : = b f · b ρ w , and g : = F − 1 [ b g ] , such that the fol lowing hold: 1. g is a r e al value d signe d me asur e. 2. R ∞ −∞ g ( x ) dx = 0 . 3. ∥ g ( x ) ∥ L 1 ≲ ϵw /α . 4. F or every R > max { 2 ϵ, 2 /w } it holds that ∥ g 1 {| x | > R }∥ L 1 ≲ ϵ α 1 w 2 R 3 . Pr o of of L emma C.3 . Let b b w : R → C b e the windo w function obtained by conv olving 4 rectangular windo ws (recall that χ w ( ω ) : = 1 { ω ∈ [ − w , w ] } ), one of width 3 w / 2 and three of width w / 6 , and then normalizing the resulting con voluted function: b b w ( ω ) =  3 w  3 ( χ 3 w/ 2 ∗ χ w/ 6 ∗ χ w/ 6 ∗ χ w/ 6 )( ω ) . 18 Claim C.4 (Window F unction Prop erties) . The function b b w satisﬁes the fol lowing pr op erties: 1. b b w is an even function. 2. b b w ( ω ) = 1 for al l ω ∈ [ − w , w ] and b b w ( ω ) = 0 for al l ω / ∈ [ − 2 w , 2 w ] . Pr o of of Claim C.4 . First, note that b b w is ev en as each comp onen t of the conv olution is ev en. Next, note that for r , s > 0 we hav e that ( χ r ∗ χ s )( ω ) = Z ω + s ω − s χ r ( t ) dt. Therefore, supp ( χ r ∗ χ s ) = [ − ( r + s ) , r + s ] . If r ≥ s , then for | ω | ≤ r − s the limits of the ab o v e in tegration lie fully inside [ − r , r ] . Hence, the integral is constan tly equal to 2 s for al for | ω | ≤ r − s , i.e. ( χ r ∗ χ s )( ω ) = 2 s for | ω | ≤ r − s . W e apply this reasoning iterativ ely . Let a = 3 w / 2 and b = c = d = w / 6 . First h 1 : = χ a ∗ χ b has supp ort | ω | ≤ a + b and a ﬂat plateau of heigh t 2 b , for | ω | ≤ a − b . Next h 2 : = h 1 ∗ χ c has supp ort | ω | ≤ a + b + c and, since h 1 is constan t on | ω | ≤ a − b , h 2 has a ﬂat plateau of height (2 b )(2 c ) , for | ω | ≤ a − b − c . Finally , h 3 : = h 2 ∗ χ d has supp ort | ω | ≤ a + b + c + d and a ﬂat plateau of height (2 b )(2 c )(2 d ) = 8 bcd on | ω | ≤ a − b − c − d . With a = 3 w / 2 and b = c = d = w / 6 , w e hav e a + b + c + d = 2 w and a − b − c − d = w , hence supp( h 3 ) = [ − 2 w , 2 w ] and h 3 ( x ) = 8 bcd on [ − w , w ] . Since 8 bcd = 8 ( w / 6) 3 = w 3 / 27 , the stated normalization giv es b b w ( x ) = (3 /w ) 3 h 3 ( x ) = 1 on [ − w , w ] , b b w ( x ) = 0 for | x | > 2 w . This completes the pro of of Claim C.4 . W rite b w = F − 1 [ b b w ] . Since b b w is a con volution of rectangular windo ws and the in verse F ourier transform of a rectangular window 1 { ω ∈ [ − w , w ] } is 2 w sinc (2 w ω ) , w e hav e that b w ( x ) = 3 3 2 4 w sinc(3 w x ) sinc 3 ( w x/ 3) . Deﬁne the 1 /ϵ –p eriodized window b ρ w ( ω ) : = X m ∈ Z b b w ( ω − m/ϵ ) , and set b g ( ω ) : = b f ( ω ) b ρ w ( ω ) , g : = F − 1 [ b g ] . Note that since b g ( ω ) is p erio dic with p eriod 2 /ϵ ( b f has perio d 2 /ϵ and b ρ w has perio d 1 /ϵ ), w e ha ve that g is a collection of δ -functions (see Fact A.1 ). First, we pro v e that g is a real signed measure. Since b ρ w is ev en and real v alued, and b f ( − ω ) = b f ( ω ) (indeed b f ( ω ) = 2(1 − α ) i sin ( π ω ϵ ) /α ), w e get b g ( − ω ) = b f ( − ω ) b ρ w ( − ω ) = b f ( ω ) b ρ w ( ω ) = b g ( ω ) . Hence, b y Fact A.1 we hav e that g is a real v alued measure, i.e., the co eﬃcien ts of the δ ’s are real-v alued. This pro ves the ﬁrst item of Lemma 4.6 . Second, note that b y the deﬁnition of the F ourier transform we hav e that R ∞ −∞ g ( x ) dx = b g (0) = b f (0) b ρ w (0) . Now since b f (0) = 0 , we hav e that R ∞ −∞ g ( x ) dx = 0 . This pro ves the second item of Lemma 4.6 . No w we b ound the L 1 norm of g . T o compute the L 1 norm it suﬃces to compute the co eﬃcien ts of the δ -functions. T o compute the in verse g note that b ρ w is a conv olution of com b 1 /ϵ ( ω ) : = 19 P m ∈ Z δ ( ω − m/ϵ ) and b b w with F − 1  com b 1 /ϵ  ( x ) = ϵ P n ∈ Z δ ( x − nϵ ) . As a result the inv erse fourier transform of b ρ w is ρ w ( x ) = ϵ X n ∈ Z b w ( nϵ ) δ ( x − nϵ ) . Finally to compute g , we can just con volv e f and ρ w whic h gives us g ( x ) = (1 − α ) ϵ α X k ∈ Z  b w  ( k + 1) ϵ  − b w  k ϵ   δ  x − k ϵ − ϵ/ 2  . Therefore, w e hav e that ∥ g ∥ L 1 = (1 − α ) ϵ α X k ∈ Z   b w (( k + 1) ϵ ) − b w ( k ϵ )   . By the fundamen tal theorem of calculus and the triangle inequality , for each k ∈ Z ,   b w (( k + 1) ϵ ) − b w ( k ϵ )   =    Z ( k +1) ϵ kϵ b ′ w ( x ) dx    ≤ Z ( k +1) ϵ kϵ | b ′ w ( x ) | dx. Summing o ver k ∈ Z giv es us X k ∈ Z   b w (( k + 1) ϵ ) − b w ( k ϵ )   ≤ Z R   b ′ w ( x )   dx. Substituting the deriv ative and changing v ariables gives us Z R | b ′ w ( x ) | dx ≲ 3 w Z R  | sinc ′ (3 y ) | | sinc( y / 3) | 3 + | sinc(3 y ) | | sinc( y / 3) | 2 | sinc ′ ( y / 3) |  dy . W e use the following b ounds | sinc ( t ) | ≤ min { 1 , 1 / ( π | t | ) } , sinc ′ ( x ) ≤ min { p 2 / 3 π , 2 / | t |} t ∈ R . Applying this b ound we get that Z R | b ′ w ( x ) | dx ≲ w . Hence, ∥ g ∥ L 1 = (1 − α ) ϵ α X k ∈ Z   b w (( k + 1) ϵ ) − b w ( k ϵ )   ≲ ϵ α w . No w it suﬃces to b ound the tail b eha vior of g . Fix R > 0 . Similarly to the ab o v e we hav e ∥ g 1 {| x | > R }∥ L 1 = (1 − α ) ϵ α X k : | kϵ + ϵ/ 2 | >R   b w (( k + 1) ϵ ) − b w ( k ϵ )   . 20 Using the fundamental theorem of calculus and summing only ov er those k with | k ϵ + ϵ/ 2 | > R , w e obtain ∥ g 1 {| x | > R }∥ L 1 ≤ (1 − α ) ϵ α Z | x | >R − ϵ | b ′ w ( x ) | dx. Recall that b w ( x ) = C 0 w sinc (3 w x ) sinc 3 ( w x/ 3) for a universal constant C 0 . Diﬀeren tiating and c hanging v ariables y = w x giv es Z | x | >R − ϵ | b ′ w ( x ) | dx ≲ w Z | y | >w ( R − ϵ )  | sinc ′ (3 y ) | | sinc( y / 3) | 3 + | sinc(3 y ) | | sinc( y / 3) | 2 | sinc ′ ( y / 3) |  dy . No w using the tail b ound for sinc similarly to the pro of of (3) we hav e, for w ( R − ϵ ) ≥ 1 , Z | x | >R − ϵ | b ′ w ( x ) | dx ≲ w Z | y | >w ( R − ϵ ) dy | y | 4 ≲ 1 w 2 ( R − ϵ ) 3 ≲ 1 w 2 R 3 . Substituting bac k gives ∥ g 1 {| x | > R }∥ L 1 ≲ ϵ α 1 w 2 R 3 . Whic h completes the pro of of Lemma C.3 D Applications to W ell-Kno wn Distributions In this section, w e pro vide the applications of our theorems to several w ell-studied distribution families. W e summarize our results in T able 1 . W e b egin by instantiating the upp er b ound, Theorem 3.2 . Corollary D.1 (Estimating the mean of a Standard Gaussian) . L et α ∈ (0 , 1 / 4) , ϵ ∈ (0 , 1) , d ∈ Z + . Ther e exists an algorithm that given e O ( d e O (( α/ϵ ) 2 ) ) i.i.d. α –me an–shift c ontaminate d samples fr om N ( µ, I d ) r eturns an estimate b µ such that ∥ b µ − µ ∥ ≤ ϵ with pr ob ability at le ast 2 / 3 . Pr o of. First, using d/α 2 samples and p olynomial time w e estimate the mean to ℓ 2 -error O (1) using an adv ersarially robust estimator (see Fact A.4 ). W e use this estimate to recen ter the distribution so that ∥ µ ∥ = O (1) . F ollowing we verify the conditions of Theorem 3.2 for the Gaussian. F or D = N (0 , I d ) w e hav e ϕ D ( ω ) = e − 2 π 2 ∥ ω ∥ 2 . Hence ∇ ϕ D ( ω ) = − 4 π 2 ω e − 2 π 2 ∥ ω ∥ 2 and th us ∥∇ ϕ D ( ω ) ∥ ≤ 4 π 2 ∥ ω ∥ e − 2 π 2 ∥ ω ∥ 2 = O (1) , so ϕ D is L –Lipsc hitz with L = O (1) . No w ﬁx v ∈ R d with ∥ v ∥ ≥ ϵ and set A : = 4 α and ω : = arcsin( A ) π v ∥ v ∥ . Then | sin ( π v · ω ) | = sin ( arcsin ( A )) = A and ∥ ω ∥ = arcsin( A ) π ∥ v ∥ ≲ α/ϵ . Therefore | ϕ D ( ω ) | = e − 2 π 2 ∥ ω ∥ 2 ≥ e − c ( α/ϵ ) 2 for a univ ersal constant c > 0 . Setting δ : = e − c ( α/ϵ ) 2 , w e conclude that D satisﬁes the ( ϵ, A, δ ) -frequency- witness condition. The densit y of D is p D ( x ) = (2 π ) − d/ 2 e −∥ x ∥ 2 / 2 . F or each j ∈ [ d ] w e hav e ∂ ∂ x j p D ( x ) = − x j p D ( x ) , so ∂ ∂ x j p D ∈ L 1 ( R d ) and     ∂ ∂ x j p D     L 1 ( R d ) = Z R d | x j | p D ( x ) dx = E x ∼N (0 ,I d ) [ | x j | ] = r 2 π . 21 Hence, M 1 : = max j ∈ [ d ]    ∂ ∂ x j p D    L 1 ( R d ) = q 2 π = O (1) . Finally , applying Theorem 3.2 with L = O (1) , A = 4 α , and δ = e − c ( α/ϵ ) 2 yields sample complexity n = O d log √ d M 1 RL δ 2 A ! 1 (((1 − α ) A − 2 α ) δ ) 2 ! = e O ( d e O (( α/ϵ ) 2 ) ) , and Algorithm 1 returns b µ with ∥ b µ − µ ∥ ≤ ϵ with probability at least 2 / 3 . Corollary D.2 (Estimating the mean of a Standard Laplace) . L et α ∈ (0 , 1 / 4) , ϵ ∈ (0 , 1) , and d ∈ Z + . Ther e exists an algorithm that, given n = e O  d α 2 ϵ 4  i.i.d. α –me an–shift c ontaminate d samples fr om the d –dimensional pr o duct L aplac e distribution with unit c ovarianc e (i.e., e ach c o or dinate Lap(0 , 1 / √ 2) ), r eturns an estimate b µ such that ∥ b µ − µ ∥ ≤ ϵ with pr ob ability at le ast 2 / 3 . Pr o of. As in the Gaussian corollary , since α < 1 / 4 using O ( d ) samples and p olynomial time, we obtain an O (1) -accurate robust estimate of µ (see Fact A.4 ) and recenter so that ∥ µ ∥ = O (1) . F or D = Lap d (0 , I d ) (i.i.d. co ordinates with b = 1 / √ 2 ), the c haracteristic function is ϕ D ( ω ) = Q d j =1 (1 + 2 π 2 ω 2 j ) − 1 . Fix an y v with ∥ v ∥ ≥ ϵ and set A : = 4 α . Cho ose ω = arcsin( A ) π v ∥ v ∥ , so | sin ( π v · ω ) | = sin ( arcsin A ) = A and ∥ ω ∥ ≤ (arcsin A ) / ( π ϵ ) = Θ( α/ϵ ) . Let δ : = 1 1+2 π 2 ∥ ω ∥ 2 . Since ∥ ω ∥ = Θ( α/ϵ ) we hav e δ = Θ  1 1+( α/ϵ ) 2  and moreov er | ϕ D ( ω ) | ≥ Q d j =1 1 1+2 π 2 ω 2 j ≥ 1 1+2 π 2 ∥ ω ∥ 2 = δ . Also ϕ D is L -Lipsc hitz on ∥ ξ ∥ ≤ ∥ ω ∥ with L ≲ 2 π 2 ∥ ω ∥ = e O ( α/ϵ ) . Therefore D satisﬁes the ( ϵ, A, δ ) -frequency-witness condition. The one-dimensional Laplace density with b = 1 / √ 2 is p 1 ( x ) = 1 √ 2 e − √ 2 | x | and the d -dimensional pro duct density is p D ( x ) = Q d j =1 p 1 ( x j ) . F or j ∈ [ d ] and all x with x j  = 0 , ∂ ∂ x j p D ( x ) =  p ′ 1 ( x j ) p 1 ( x j )  p D ( x ) = − √ 2 sign( x j ) p D ( x ) , so ∂ ∂ x j p D ∈ L 1 ( R d ) and     ∂ ∂ x j p D     L 1 ( R d ) = Z R d √ 2 p D ( x ) dx = √ 2 . Hence M 1 : = max j ∈ [ d ]    ∂ ∂ x j p D    L 1 ( R d ) = √ 2 = O (1) . Finally , applying Theorem 3.2 with A = 4 α and δ = Θ  1 1+( α/ϵ ) 2  yields n = O d log √ d M 1 RL δ 2 A ! 1 (((1 − α ) A − 2 α ) δ ) 2 ! = e O  d α 2 ϵ 4  , and running Algorithm 1 returns b µ with ∥ b µ − µ ∥ ≤ ϵ with probability at least 2 / 3 . Corollary D.3 (Estimating the mean of a Uniform distribution) . L et α ∈ (0 , 1 / 4) and ϵ ∈ (0 , 1) . L et D b e the uniform distribution on [ − 1 , 1] (with density p D ( x ) = 1 2 1 {| x | ≤ 1 } ), and let D µ denote its 22 shift by an unknown me an µ ∈ R . Ther e exists an algorithm that, given n = e O  1 ϵ 2  i.i.d. α –me an–shift c ontaminate d samples fr om D µ , r eturns an estimate b µ such that | b µ − µ | ≤ ϵ with pr ob ability at le ast 2 / 3 . Pr o of. As in the Gaussian/Laplace corollaries, using O (1 =) samples and p olynomial time, we can obtain an O (1) -accurate robust estimate of µ ( Fact A.4 ) and recenter so that | µ | = O (1) . F or D = Unif ([ − 1 , 1]) we hav e for ev ery ω ∈ R , the c haracteristic function is ϕ D ( ω ) = E x ∼ D  e 2 π iωx  = 1 2 Z 1 − 1 e 2 π iωx dx = sin(2 π ω ) 2 π ω = sinc(2 ω ) . Diﬀeren tiation gives ϕ ′ D ( ω ) = 2 π ω cos(2 π ω ) − sin(2 π ω ) 2 π ω 2 , hence sup ω ∈ R | ϕ ′ D ( ω ) | = O (1) , thus ϕ D is L –Lipsc hitz with L = O (1) . Fix any v ∈ R with | v | ≥ ϵ . Consider the in terv al I v : = h 1 4 | v | , 3 4 | v | i . F or any ω ∈ I v w e ha ve | v ω | ∈ [1 / 4 , 3 / 4] , and therefore | sin ( π v ω ) | ≥ sin ( π / 4) = 1 √ 2 . Set A : = 1 / √ 2 (note that (1 − α ) A − 2 α > 0 for all α < 1 / 4 ). W e now low er b ound | ϕ D ( ω ) | for a suitable choice of ω ∈ I v . If | v | ≥ 1 , then ω 0 : = 1 / (4 | v | ) ∈ [0 , 1 / 4] , and using sin t ≥ (2 /π ) t for t ∈ [0 , π / 2] , | ϕ D ( ω 0 ) | = sin(2 π ω 0 ) 2 π ω 0 ≥ (2 /π ) · 2 π ω 0 2 π ω 0 = 2 π . If instead | v | ∈ [ ϵ, 1) , then | I v | = 1 / (2 | v | ) ≥ 1 / 2 ; since ω 7→ sin (2 π ω ) has perio d 1 , there exists ω 1 ∈ I v with | sin(2 π ω 1 ) | ≥ 1 / 2 . F or this ω 1 w e hav e | ω 1 | ≤ 3 / (4 | v | ) and hence | ϕ D ( ω 1 ) | = | sin(2 π ω 1 ) | 2 π | ω 1 | ≥ 1 / 2 2 π · (3 / (4 | v | )) = | v | 3 π ≥ ϵ 3 π . Com bining the tw o cases, for ev ery v with | v | ≥ ϵ there exists ω ∈ I v suc h that | sin( π v ω ) | ≥ A and | ϕ D ( ω ) | ≥ δ with δ : = ϵ/ (3 π ) . In particular, D satisﬁes the ( ϵ, A, δ ) -frequency-witness condition (see Deﬁnition 3.1 ) with A = 1 / √ 2 and δ = ϵ/ (3 π ) . Also p D has a distributional deriv ative d dx p D = 1 2 δ ( x + 1) − 1 2 δ ( x − 1) . Its total v ariation is M 1 : =    d dx p D    L 1 = 1 2 + 1 2 = 1 . Applying Theorem 3.2 w e hav e that with n = e O  1 ((1 − α ) A − 2 α ) 2 δ 2  = e O  1 ϵ 2  , samples Algorithm 1 returns b µ with | b µ − µ | ≤ ϵ with probability at least 2 / 3 . 23 Corollary D.4 (Estimating the mean of a sum of m Uniforms) . L et α ∈ (0 , 1 / 8) , ϵ ∈ (0 , α ) , and m ∈ Z + . L et U 1 , . . . , U m b e i.i.d. Unif ([ − 1 , 1]) , let D ( m ) b e the distribution of P m i =1 U i , and let D ( m ) µ denote its shift by an unknown me an µ ∈ R . Ther e exists an algorithm that, given n = e O  α − 2 ( O ( α/ϵ )) 2 m  , i.i.d. α –me an–shift c ontaminate d samples fr om D ( m ) µ , r eturns an estimate b µ such that | b µ − µ | ≤ ϵ with pr ob ability at le ast 2 / 3 . Pr o of. As in the previous corollaries, since α < 1 / 8 using O (1) samples and p olynomial time we obtain a robust estimate of µ ( Fact A.4 ) to absolute error O ( √ m ) (since V ar ( P m i =1 U i ) = m/ 3 ), and recen ter so that | µ | = O ( √ m ) . F or D ( m ) w e hav e, for ev ery ω ∈ R , ϕ D ( m ) ( ω ) = m Y i =1 E  e 2 π iωU i  =  sin(2 π ω ) 2 π ω  m = sinc(2 ω ) m . W rite s ( ω ) : = sinc(2 ω ) . Since sup ω ∈ R | s ′ ( ω ) | = O (1) and | s ( ω ) | ≤ 1 , sup ω ∈ R   ϕ ′ D ( m ) ( ω )   = sup ω ∈ R   m s ( ω ) m − 1 s ′ ( ω )   = O ( m ) , so ϕ D ( m ) is L –Lipsc hitz with L = O ( m ) . Fix an y v ∈ R with | v | ≥ ϵ and set A : = 4 α (so (1 − α ) A − 2 α = 2 α (1 − 2 α ) > 0 ). Let ω ⋆ v : = arcsin( A ) π v , so that   sin( π v ω ⋆ v )   = A. W e now sho w that for every | v | ≥ ϵ there exists ω with | sin( π v ω ) | ≥ A and | ϕ D ( m ) ( ω ) | ≥  O  | v | A  m . This implies the ( ϵ, A, δ ) -frequency-witness condition with δ ≥  O  ϵ α  m . If | v | ≥ 4 A , then | ω ⋆ v | = arcsin( A ) π | v | ≤ A | v | ≤ 1 4 , and using | sin t | ≥ (2 /π ) | t | for | t | ≤ π / 2 we get    sin(2 π ω ⋆ v ) 2 π ω ⋆ v    ≥ 2 π , hence | ϕ D ( m ) ( ω ⋆ v ) | ≥ (2 /π ) m . If | v | < 4 A . Let θ : = arcsin( A ) ∈ (0 , π / 2) and deﬁne ℓ v : = min n 1 , π 2 − θ π | v | o . Consider the in terv al I v : = [ ω ⋆ v , ω ⋆ v + ℓ v ] . F or any ω = ω ⋆ v + t with t ∈ [0 , ℓ v ] w e hav e π | v | t ≤ π 2 − θ = ⇒ | sin( π v ω ) | = sin( θ + π | v | t ) ≥ sin( θ ) = A, so the entir e interv al I v satisﬁes | sin( π v ω ) | ≥ A . 24 No w choose ω ∈ I v maximizing | sin (2 π ω ) | o ver I v . A simple p eriodicity/compactness argument giv es the quantitativ e b ound max ω ∈ I v | sin(2 π ω ) | ≥ ( 1 , ℓ v ≥ 1 2 , sin( π ℓ v ) ≥ 2 ℓ v , ℓ v < 1 2 , and in either case | sin (2 π ω ) | ≥ min { 1 , 2 ℓ v } . F urthermore, A = 4 α < 1 / 2 implies θ = arcsin ( A ) ≤ arcsin (1 / 2) = π / 6 , th us π / 2 − θ ≥ π / 3 . Hence, if | v | ≤ 4 A < 1 / 2 then ( π / 2 − θ ) / ( π | v | ) ≥ ( π / 3) / ( π · 1 / 2) = 2 / 3 , so ℓ v = min { 1 , ( π / 2 − θ ) / ( π | v | ) } ≥ 2 / 3 . Also | ω | ≤ | ω ⋆ v | + ℓ v ≤ | ω ⋆ v | + 1 = O ( ω ∗ v ) , using | v | < 4 A so | ω ⋆ v | ≳ 1 and hence | ω ⋆ v | + 1 = O ( | ω ⋆ v | ) . Therefore | s ( ω ) | =    sin(2 π ω ) 2 π ω    ≥ min { 1 , 2 ℓ v } 2 π ( | ω ⋆ v | + 1) ≳ | v | A . So | ϕ D ( m ) ( ω ) | = | s ( ω ) | m ≥  c | v | A  m . F or a suﬃciently small constan t c > 0 . Com bining the t wo cases with the fact that ϵ/α ≤ 1 we hav e that for every | v | ≥ ϵ there is some ω with | sin( π v ω ) | ≥ A and | ϕ D ( m ) ( ω ) | ≥  cϵ α  m . Th us D ( m ) satisﬁes the ( ϵ, A, δ ) -frequency-witness condition with δ ≥ (Ω( ϵ/α )) m . No w let p U b e the densit y of the uniform distribution on [ − 1 , 1] . Since ( f ∗ g ) ′ = f ′ ∗ g = f ∗ g ′ , we ha ve ( p ∗ m U ) ′ = ( p ∗ ( m − 1) U ∗ p U ) ′ = p ∗ ( m − 1) U ∗ p ′ U . Then ∥ ( p ∗ m U ) ′ ∥ 1 = ∥ p ∗ ( m − 1) U ∗ p ′ U ∥ 1 ≤ ∥ p ∗ ( m − 1) U ∥ 1 ∥ p ′ U ∥ 1 = 1 · ∥ p ′ U ∥ 1 ≤ 1 . This giv es M 1 ≤ 1 for the densit y of D ( m ) . Applying Theorem 3.2 yields that with n = e O  1 (((1 − α ) A − 2 α ) 2 ) δ 2  = e O  C m α 2 · α 2 m ϵ 2 m  = e O  C m α 2 m − 2 ϵ 2 m  , for a suﬃcien tly large constant C > 0 univ ersal constant, Algorithm 1 returns b µ with | b µ − µ | ≤ ϵ with probabilit y at least 2 / 3 . Moreo ver, using Theorem 4.1 yields the following up to p olynomial factors tigh t low er b ounds. Corollary D.5 (Standard Gaussian low er b ound) . Fix α, ϵ ∈ (0 , 1 / 2) , ϵ < α with ϵ/α < c for a suﬃciently smal l universal c onstant c . Then any algorithm that estimates the me an of a standar d Gaussian to absolute err or ϵ with pr ob ability at le ast 2 / 3 fr om α -me an-shift c ontaminate d samples (as in Deﬁnition 1.1 ), must use at le ast Ω( e Ω(( α/ϵ ) 2 ) ) , samples. Pr o of. The characteristic function of D = N (0 , 1) is ϕ D ( ω ) = exp( − 2 π 2 ω 2 ) . Let S : = { ω : dist ( ϵω , Z ) > cα } (for the constan t c > 0 from Theorem 4.1 ). Note that S ⊆ {| ω | > cα/ϵ } , since | ω | ≤ cα/ϵ implies dist( ϵω , Z ) ≤ | ϵω | ≤ cα . Hence   ϕ D ( ω ) 1 { ω ∈ S }   2 L 2 ≤ Z | ω | >cα/ϵ e − 4 π 2 ω 2 dω = 2 Z ∞ cα/ϵ e − 4 π 2 ω 2 dω ≲ ϵ α exp  − 4 π 2 c 2  α ϵ  2  . 25 Therefore, δ : =   ϕ D ( ω ) 1 { ω ∈ S }   L 2 ≲ r ϵ α exp  − 2 π 2 c 2  α ϵ  2  . F or Condition (2) of Theorem 4.1 , let σ : = E x ∼ D [ | x | ] = p 2 /π . By Marko v’s inequalit y , for all R > 0 , Pr x ∼ D [ | x | ≥ R ] ≤ σ /R. Also, since δ is exp onen tially small in ( α/ϵ ) 2 and σ = Θ(1) , the requirement σ > C δ in Theorem 4.1 holds (for the univ ersal constant C ). Applying Theorem 4.1 giv es n = Ω  1 ( δ σ ) 1 / 3 + ( ϵ/α ) 3 ( δ /σ ) 2  = Ω  1 δ 1 / 3  = Ω  e Ω(( α/ϵ ) 2 )  , where the last step uses the ab o v e b ound on δ (the prefactor p ϵ/α is absorb ed in the exp onen t). Corollary D.6 (Laplace low er b ound) . Fix α, ϵ ∈ (0 , 1 / 2) , ϵ < α with ϵ/α < c for a suﬃciently smal l universal c onstant c . L et D = Lap (0 , 1) b e the standar d L aplac e distribution with density p ( x ) = 1 2 e −| x | . Then any algorithm that estimates the me an of a standar d L aplac e to absolute err or ϵ with pr ob ability at le ast 2 / 3 fr om α -me an-shift c ontaminate d samples (as in Deﬁnition 1.1 ), must use at le ast Ω  p α/ϵ  samples. Pr o of. The characteristic function of D = Lap(0 , 1) is ϕ D ( ω ) = Z R 1 2 e −| x | e 2 π iωx dx = 1 1 + (2 π ω ) 2 . Let S : = { ω : dist ( ϵω , Z ) > cα } (for the constan t c > 0 from Theorem 4.1 ). As in the Gaussian pro of, S ⊆ {| ω | > cα/ϵ } and therefore   ϕ D ( ω ) 1 { ω ∈ S }   2 L 2 ≤ Z | ω | >cα/ϵ 1  1 + (2 π ω ) 2  2 dω = 2 Z ∞ cα/ϵ 1  1 + (2 π ω ) 2  2 dω ≲ Z ∞ cα/ϵ dω ω 4 ≲  ϵ α  3 . Hence, δ : =   ϕ D ( ω ) 1 { ω ∈ S }   L 2 ≲  ϵ α  3 / 2 . F or Condition (2) of Theorem 4.1 , let σ : = E x ∼ D [ | x | ] = 1 . By Marko v’s inequality , for all R > 0 , Pr x ∼ D [ | x | ≥ R ] ≤ σ /R. Also, since ϵ < α w e hav e δ ≲ ( ϵ/α ) 3 / 2 < 1 /c , so the requirement σ > C δ in Theorem 4.1 holds (for the univ ersal constant C ). Applying Theorem 4.1 giv es n = Ω  1 ( δ σ ) 1 / 3 + ( ϵ/α ) 3 ( δ /σ ) 2  . 26 With σ = 1 and δ ≲ ( ϵ/α ) 3 / 2 , we ha ve ( δ σ ) 1 / 3 ≲ ( ϵ/α ) 1 / 2 and ( ϵ/α ) 3 ( δ /σ ) 2 ≲ ( ϵ/α ) 6 , hence the ﬁrst term dominates, yielding n = Ω  δ − 1 / 3  = Ω   α ϵ  1 / 2  , as claimed. Corollary D.7 (Uniform low er b ound) . Fix α, ϵ ∈ (0 , 1 / 2) , ϵ < α with ϵ/α < c for a suﬃciently smal l universal c onstant c . L et D = Unif ([ − 1 , 1]) and let D µ denote its shift by an unknown me an µ ∈ R . Then any algorithm that estimates the me an to absolute err or ϵ with pr ob ability at le ast 2 / 3 fr om α -me an-shift c ontaminate d samples (as in Deﬁnition 1.1 ) must use at le ast Ω  ( α/ϵ ) 1 / 6  samples. Pr o of. F or D = Unif ([ − 1 , 1]) we ha ve ϕ D ( ω ) = E x ∼ D  e 2 π iωx  = sin(2 π ω ) 2 π ω . Let S : = { ω : dist ( ϵω , Z ) > cα } (for the constant c > 0 from Theorem 4.1 ). As before, S ⊆ {| ω | > cα/ϵ } . Hence,   ϕ D ( ω ) 1 { ω ∈ S }   2 L 2 ≤ 2 Z ∞ cα/ϵ  sin(2 π ω ) 2 π ω  2 dω ≤ 2 Z ∞ cα/ϵ dω (2 π ω ) 2 ≲ ϵ α . Therefore, taking δ : =   ϕ D ( ω ) 1 { ω ∈ S }   L 2 ≲  ϵ α  1 / 2 v eriﬁes Condition (1) of Theorem 4.1 . F or Condition (2), since | x | ≤ 1 almost surely under x ∼ D , w e hav e for all R > 0 , Pr x ∼ D [ | x | ≥ R ] ≤ 1 R , so the tail condition holds with σ : = 1 (and in particular σ > C δ in the regime where ϵ/α < c for a suﬃcien tly small constant c > 0 ). Applying Theorem 4.1 (with σ = 1 ) yields n = Ω  1 δ 1 / 3 + ( ϵ/α ) 3 δ 2  = Ω  δ − 1 / 3  = Ω   α ϵ  1 / 6  , where w e used ϵ < α to note ( ϵ/α ) 3 δ 2 ≲ ( ϵ/α ) 4 ≤ δ 1 / 3 . Corollary D.8 (Sum of m Uniforms low er b ound) . Fix α, ϵ ∈ (0 , 1 / 2) , ϵ < α with ϵ/α < c for a suﬃciently smal l universal c onstant c . L et U 1 , . . . , U m b e i.i.d. Unif ([ − 1 , 1]) , let D ( m ) b e the distribution of P m i =1 U i , and let D ( m ) µ denote its shift by an unknown me an µ ∈ R . Then any algorithm that estimates the me an to absolute err or ϵ with pr ob ability at le ast 2 / 3 fr om α -me an-shift c ontaminate d samples (as in Deﬁnition 1.1 ) must use at le ast Ω  ( α/ϵ ) (2 m − 1) / 6  samples. 27 Pr o of. F or D ( m ) w e hav e ϕ D ( m ) ( ω ) =  sin(2 π ω ) 2 π ω  m . Let S : = { ω : dist ( ϵω , Z ) > cα } (for the constant c > 0 from Theorem 4.1 ). W e hav e S ⊆ {| ω | > cα/ϵ } . Using | sin(2 π ω ) | ≤ 1 , for | ω | > 0 we hav e | ϕ D ( m ) ( ω ) | ≤  1 2 π | ω |  m , and therefore letting T : = cα/ϵ ,   ϕ D ( m ) ( ω ) 1 { ω ∈ S }   2 L 2 ≤ 2 Z ∞ T  1 2 π ω  2 m dω = 2  1 2 π  2 m 1 (2 m − 1) T 2 m − 1 ≲  ϵ α  2 m − 1 Hence, δ : =   ϕ D ( m ) ( ω ) 1 { ω ∈ S }   L 2 ≲  ϵ α  (2 m − 1) / 2 . F or Condition (2) of Theorem 4.1 , w e ha ve V ar D ( m ) ( x ) = m/ 3 , so σ : = E D ( m ) [ | x | ] ≤ p E D ( m ) [ x 2 ] = p m/ 3 , and Marko v giv es Pr D ( m ) [ | x | ≥ R ] ≤ σ /R for all R > 0 . Since δ is exp onen tially small in m ( ϵ/α < 1 ) and σ = Θ( √ m ) , the requirement σ > C δ holds when ϵ/α is less than a suﬃciently small constan t. Applying Theorem 4.1 yields n = Ω  1 ( δ σ ) 1 / 3 + ( ϵ/α ) 3 ( δ /σ ) 2  = Ω  δ − 1 / 3  = Ω  ( α/ϵ ) (2 m − 1) / 6  , where w e used ϵ < α to note ( ϵ/α ) 3 ( δ /σ ) 2 ≪ δ 1 / 3 for all m ≥ 1 . 28

Sample Complexity Bounds for Robust Mean Estimation with Mean-Shift Contamination

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment