High-Dimensional Gaussian Mean Estimation under Realizable Contamination

We study mean estimation for a Gaussian distribution with identity covariance in $\mathbb{R}^d$ under a missing data scheme termed realizable $ε$-contamination model. In this model an adversary can choose a function $r(x)$ between 0 and $ε$ and each …

Authors: Ilias Diakonikolas, Daniel M. Kane, Thanasis Pittas

High-Dimensional Gaussian Mean Estimation under Realizable Con tamination Ilias Diak onikolas ∗ Univ ersity of Wisconsin-Madison ilias@cs.wisc.edu Daniel M. Kane † Univ ersity of California, San Diego dakane@cs.ucsd.edu Thanasis Pittas ‡ Univ ersity of Wisconsin-Madison pittas@wisc.edu Marc h 18, 2026 Abstract W e study mean estimation for a Gaussian distribution with iden tity co v ariance in R d under a missing data scheme termed r e alizable ε -c ontamination mo del. In this mo del an adv ersary can c ho ose a function r ( x ) betw een 0 and ε and eac h sample x go es missing with probability r ( x ) . Recen t w ork [ MVB + 24 ] proposed this model as an in termediate-strength setting b et ween Missing Completely A t Random (MCAR)—where missingness is indep enden t of the data—and Missing Not At Random (MNAR)—where missingness may dep end arbitrarily on the sample v alues and can lead to non-identifiabilit y issues. That w ork established information-theoretic upp er and low er b ounds for mean estimation in the realizable con tamination mo del. Their prop osed estimators incur run time exp onen tial in the dimension, lea ving op en the p ossibilit y of computationally efficien t algorithms in high dimensions. In this work, we establish an information–computation gap in the Statistical Query mo del (and, as a corollary , for Lo w-Degree Polynomials and PTF tests), showing that algorithms must either use substantially more samples than information- theoretically necessary or incur exp onen tial runtime. W e complemen t our SQ low er b ound with an algorithm whose sample–time tradeoff nearly matc hes our lo wer b ound. T ogether, these results qualitatively characterize the complexity of Gaussian mean estimation under ε -realizable con tamination. ∗ Supp orted by NSF Medium A ward CCF-2107079, ONR a ward num b er N00014-25-1-2268, and an H.I. Romnes F aculty F ellowship. † Supp orted by NSF Medium A ward CCF-2107547. ‡ Supp orted by NSF Medium A ward CCF-2107079. 1 In tro duction The fundamen tal assumption underlying m uch of classical statistics is that datasets consist of i.i.d. samples drawn indep enden tly from the distribution we aim to learn. In practice, ho wev er, this assumption is often violated. A common violation arises when the observ ations are incomplete. This can o ccur for a v ariet y of reasons, for example due to data collection via crowdsourcing [ V dVE11 ] or p eer grading [ PHC + , KWL + 13 ], and has led to the developmen t of a broad literature studying estimators that are robust to missing data (see, e.g., [ Rub76 , T si , LR19 ] for standard references). Differen t kinds of missingness patterns can b e classified based on their nature. The simplest and most b enign form of missingness is known as “Missing Completely at Random” (MCAR), meaning that the mechanism causing the missingness is indep enden t of the data itself. A large b o dy of work studies statistical estimation and inference under MCAR assumptions. Examples include sparse linear regression [ L W , BR T17 ], classification [ CZ19 , SBC24 ], PCA [ EvdG19 , ZWS22 , YCF24 ], co v ariance and precision matrix estimation [ Lou14 , L T18 ] and changepoint estimation [ XHW12 , FWS22 ]. Ho wev er, the MCAR assumption fails to capture many settings of interest where missingness is systematic . F or example, individuals with depression are more likely to submit incomplete questionnaires [ CMW + 21 ], and data may b e missing for patients who discontin ue treatmen t or go off proto col due to p oor tolerability [ LDC + 12 ]. Other commonly studied missingness mechanisms include MAR, in which the missigness dep endence on the data v alues is only through the observed p ortion of the data [ LR19 , SGJC13 , FDS22 ], and MNAR, where the missigness may dep end in any w ay on the data [ Rob97 , RR97 , SRR99 , SMP15 , ALJY20 , DDK + 25 ]. While these mo dels are more expressiv e, the resulting statistical guarantees are sometimes weak, for example failing to ensure iden tifiability of the target parameters [ MVB + 24 ]. Recen t work [ MVB + 24 ], prop osed and studied a differen t mo del—whic h they termed r e alizable c ontamination ; see Definition 1.1 . This missingness mo del is not MCAR, but it is milder than MAR and MNAR: the missingness dep ends on the data, but in a more structured manner. As we explain b elo w, the realizable contamination mo del can also b e viewed as an analogue of Massart noise [ MN06 ]—a widely studied lab el-corruption mo del in sup ervised learning that lies b et w een purely random and adversarial corruptions—in the context of unsup ervised setting. Definition 1.1 (Realizable ε -con tamination mo del) . L et ε ∈ (0 , 1) b e a c ontamination p ar ameter. L et P b e a distribution on a domain X with pr ob ability density function (p df ) p : X → R + . A n ε -c orrupte d version e P of P is any distribution that c an b e obtaine d as fol lows: First, an adversary cho oses a function f : X → R + such that (1 − ε ) p ( x ) ≤ f ( x ) ≤ p ( x ) . e P is then define d to b e the distribution whose samples ar e gener ate d as fol lows: • With pr ob ability R X f ( x )d x the sample is dr awn fr om the distribution with p df f ( x ) / R X f ( x )d x . • With pr ob ability 1 − R X f ( x )d x the sample is set to the sp e cial symb ol ⊥ . [ MVB + 24 ] define this mo del using a somewhat differen t formalism that is equiv alen t to Defini- tion 1.1 . The equiv alence of the definitions is discussed in Section A.1 . Below, we discuss connections b et w een Definition 1.1 and preexisting concepts in the statistics and ML literature. First, the realizable contamination mo del shares some similarities with Hub er’s contamination mo del from classical robust statistics [ T uk60 , Hub64 ]. In Hub er’s mo del, the observ ed samples are dra wn i.i.d. from a mixture that with probability (1 − ε ) outputs a sample from the inlier (clean) distribution and with probabilit y ε ouputs a sample from an arbitrary outlier distribution. As explained in Section A.1 , the realizable contamination mo del of Definition 1.1 can equiv alen tly b e describ ed as a mixture where, with probabilit y 1 − ε , a sample is drawn from P , and otherwise from 1 an MNAR version of P . Thus, the inlier comp onen t is the same in b oth mo dels. That said, Huber’s mo del is more p o werful due to its ability to in tro duce arbitrary outlier samples. Second, the mo del of Definition 1.1 can b e viewed as an unsup ervised analogue of Massart noise [ MN06 ]. In sup ervised learning, Random Classification Noise [ AL88 ] and adversarial noise represen t t wo extremes, and Massart noise was introduced as a realistic intermediate mo del. There, the lab el of x is flipp ed with probabilit y η ( x ) , with R η ( x ) dx equal to the total corruption rate. As shown in Section A.1 , Definition 1.1 admits an equiv alen t description: a sample x ∼ P is discarded with probabilit y 1 − f ( x ) /p ( x ) , closely mirroring the Massart noise mo del. Lastly , as observed in [ MVB + 24 ], realizable ε -con tamination generalizes sev eral previous ap- proac hes for studying restricted forms of MNAR, including certain biased sampling mo dels [ V ar85 , GVW88 , BR91 , AL13 , SL W22 ], related restrictions known as sensitivity conditions in the causal inference literature [ Ros87 , ZSB19 ], and the missingness mo del used in the truncated statistics literature [ DGTZ18 , KTZ19 , DKPZ24 ]. The latter can b e viewed, at a high lev el, as corresp onding to Definition 1.1 with ε = 1 , alb eit with imp ortan t differences b et ween the tw o mo dels. A more detailed discussion of related work app ears in Section A.1 . In this pap er, w e study the complexity of arguably the most fundamental statistical task in the presence of realizable ε -con tamination: estimating the mean of a m ultiv ariate Gaussian with iden tity co v ariance. Specifically , given access to ε -corrupted samples from P = N ( µ, I d ) on R d where the mean v ector µ is unkno wn, and a desired accuracy δ , the goal is to compute an estimate b µ that lies within Euclidean distance δ of the true mean µ . The only known results for this problem are information-theoretic. [ MVB + 24 ] established matc hing upp er and lo wer b ounds on the task’s sample complexit y . F or high constant probability of success, it is n = Θ  d δ 2 (1 − ε )  + exp Θ  log(1 + ε 1 − ε ) δ  2 ! . (1) W e note that the first term in ( 1 ) corresp onds to the standard sample complexity for Gaussian mean estimation from the n (1 − ε ) clean samples, while the second term captures the additional cost due to realizable ε -con tamination. Remark ably , these results hold for all ε, δ ∈ (0 , 1) , highlighting tw o imp ortan t differences from Hub er’s contamination mo del (and other robust statistics/missing data mo dels). First, δ → 0 implies consistent estimation is p ossible; that is, one can achiev e arbitrarily small error, whereas under Hub er’s mo del, estimation is only p ossible for δ ≫ ε . Second, estimation remains p ossible even for ε ≥ 1 / 2 , i.e., when the ma jority of samples are corrupted. F or the sak e of completeness, the App endix provides a self-con tained and more concise pro of of the sample complexity b ounds, alb eit with slightly weak er guarantees: App endix C establishes a lo wer b ound in the one-dimensional case, while App endix D presen ts a (computationally inefficien t) algorithm using d ε 2 exp((log(1 + ε/ (1 − ε )) /δ ) 2 ) samples. Although the information-theoretic aspects of our problem are well-understoo d, muc h less is kno wn ab out the computational asp ects (the fo cus of this pap er). Sp ecifically , essentially the only kno wn multiv ariate estimator is based on computing a cov er of the unit sphere with 2 Θ( d ) directions and applying the one-dimensional estimator to the pro jections of the samples along each direction. This has runtime 2 Θ( d ) p oly( n, d ) and therefore b egs the following questions: Is ther e a sample (ne ar-)optimal and p olynomial-time me an estimator in the pr esenc e of r e alizable c ontamination? Mor e br o ad ly, what is the c omputational c omplexity of this task? In this pap er, we address and essentially resolve these questions. F irst, w e give formal evidence (in the form of Statistical Query/Lo w-Degree Polynomial or PTF low er b ounds) that the problem exhibits an information–c omputation gap —meaning that either the run time or sample size required 2 is inheren tly large. Second, we give an algorithm whose sample-time tradeoff nearly matches our lo wer b ound. T ogether, these results qualitatively c haracterize the complexity of Gaussian mean estimation with realizable contamination. 1.1 Our Results W e b egin with our first main result: an information-computation gap for Gaussian mean estimation in the ε -realizable con tamination mo del. As is t ypical for such gaps, rather than proving them unconditionally , one usually establishes them under complexity-theoretic hardness assumptions or for restricted (yet natural) families of algorithms. W e state the result for the family of Statistical Query (SQ) algorithms b elo w (and discuss other mo dels later on). Before stating the result, w e pro vide a brief summary of SQ. SQ Mo del Basics The model, introduced b y [ Kea98 ] and extensively studied since, see, e.g., [ F GR + 13 ], considers algorithms that, instead of drawing individual samples from the target distribu- tion, ha ve indirect access to the distribution using the follo wing oracle: Definition 1.2 (ST A T Oracle) . L et D b e a distribution on R d . A statistic al query is a b ounde d function f : R d → [ − 1 , 1] . F or τ > 0 , the ST A T ( τ ) or acle r esp onds to the query f with a value v such that | v − E X ∼ D [ f ( X )] | ≤ τ . W e c al l τ the toler anc e of the statistic al query. An SQ lower b ound for a learning problem is an unconditional statemen t that an y SQ algorithm for the problem either needs to p erform a large n umber q of queries, or at least one query with very small tolerance τ . Note that, by Ho effding-Chernoff b ounds, a query of tolerance τ is implementable b y non-SQ algorithms b y drawing 1 /τ 2 samples and av eraging them. Th us, an SQ low er b ound in tuitively serves as a tradeoff b et ween runtime of Ω( q ) and sample complexity of Ω(1 /τ ) . Our SQ low er b ound for Gaussian mean estimation under ε -realizable contamination shows that any SQ algorithm m ust either mak e an exp onen tial num b er of queries or mak e at least one query with tolerance d − e Ω  log(1+ ε/ (1 − ε )) δ  2 (corresp onding to a sample complexit y of d e Ω  log(1+ ε/ (1 − ε )) δ  2 for sample-based algorithms). The low er b ound is established for a testing version of the problem: distinguishing b et ween tw o means that differ by δ in ℓ 2 distance. Since an y estimator with accuracy δ / 2 solves this testing problem, the SQ hardness directly carries ov er to the estimation setting. Theorem 1.3 (SQ lo wer bound) . Ther e exists a sufficiently smal l absolute c onstant c > 0 such that the fol lowing holds for al l ε ∈ (0 , 1) and δ ∈ (0 , cε ) . Define m = ⌊ cγ 2 / log γ ⌋ for γ := 1 δ log  1 + ε/ 2 1 − ε/ 2  . F or any dimension d ≥ ( m log d ) 2 we have the fol lowing: Any SQ algorithm that distinguishes b etwe en N (0 , I d ) and N ( δ u, I d ) for u b eing a unit ve ctor, under the c ontamination mo del of Definition 1.1 ne e ds either 2 d Ω(1) queries or at le ast one query with d − m/ 16 toler anc e. The ab o ve result is based on framing our problem as a sp ecial case of Non-Gaussian Comp onen t Analysis (NGCA), a general testing problem that is kno wn to b e hard in many restricted mo dels of computation, b eyond the SQ mo del—including lo w-degree p olynomial metho ds [ BBH + 21 ], PTF s [ DKLP25b ] and the Sum-of-Squares framew ork [ DKPP24 ]. As such, we highlight that qualitativ ely similar hardness results to those in Theorem 1.3 also hold in these mo dels. The formal version of these statemen ts are deferred to App endix E . Our second main result presen ts an algorithm whose leading term in the sample complexity is d O  log(1+ ε 1 − ε ) /δ  2 , with a run time that is nearly p olynomial in n and d . In ligh t of our SQ lo wer 3 b ound, this conceptually means that (up to certain factors that we explain b elo w) our algorithm ac hieves the optimal tradeoff b etw een information and computation. Theorem 1.4 (Algorithmic result) . L et ε ∈ (0 , 1) and δ ≤ ε b e p ar ameters. 1 Ther e exists an algorithm that takes as input samples fr om an ε -c orrupte d version of N ( µ, I d ) , the p ar ameters ε, δ and r eturns b µ ∈ R d such that ∥ b µ − µ ∥ 2 ≤ δ with pr ob ability at le ast 0 . 9 . Mor e over, the sample c omplexity of the algorithm is n = ( kd ) O ( k ) ε 2 wher e k :=  1 δ log(1 + ε 1 − ε )  2 and the runtime of the algorithm is exp( e e O ( k ) /ε 2 ) p oly( n, d ) . Up to the k O ( k ) /ε 2 factor (which is indep enden t of the dimension), the sample complexity ac hieved by our algorithm matches our SQ low er b ound. The runtime of our algorithm is p olynomial in n and d , up to the term exp( e e O ( k ) /ε 2 ) (whic h is also dimension-independent). 1.2 Our T ec hniques SQ lo w er b ound Our SQ low er b ound builds on the framework of [ DKS17 ], which establishes the following. If A is a one-dimensional distribution whose first m momen ts match those of N (0 , 1) , then distinguishing b et ween N (0 , I ) and the d -dimensional distribution that agrees with A along an unkno wn direction and is standard Gaussian in all orthogonal directions requires either q = 2 d Ω(1) queries or tolerance τ < d − Ω( m ) in the SQ mo del. W e show that the robust mean estimation problem we consider can b e cast in this form. The main challenge is to design an adv ersary , as in Definition 1.1 , that corrupts N ( δ, 1) so that it matches m = e Ω ( log (1 + ε 1 − ε ) /δ ) 2 momen ts of N (0 , 1) . W e accomplish this via a tw o-step approach. First, we construct a function f for the adversary in Definition 1.1 that uses ε/ 2 of the corruption budget to match the standard Gaussian ov er a large in terv al x ∈ [ − B , B ] , where B = log (1 + ε 1 − ε ) /δ . This relies on the observ ation that the probabilit y densit y functions of t wo unit-v ariance Gaussians whose means differ by δ are within a (1 ± ε ) multiplicativ e factor of each other, except in the B -tails. After this step, the momen ts do not exactly match those of N (0 , 1) due to discrepancies outside the in terv al [ − B , B ] . Ho wev er, for a small constan t c , the difference in the first cB 2 momen ts is exp onen tially small b ecause the mass in the tails decays exp onentially for Gaussians. Consequen tly , b y adding a v ery small p olynomial p ( x ) to f ( x ) on [ − 1 , 1] , we can eliminate these remaining discrepan- cies. By imp osing the moment-matc hing constraints and expanding the p olynomial p ( x ) in the basis of Legendre p olynomials, we show that suc h a p olynomial indeed p ( · ) exists. A dding this p olynomial correction corresp onds to censoring eac h p oin t x with probability ε/ 4 + p ( x )(1 − ε/ 4) 1 ( | x | < 1) /ϕ ( x ) , where ϕ denotes the standard Gaussian p df. Algorithmic result As discussed in the introduction, from an information-theoretic p erspective, a n umber of samples as shown in ( 1 ) suffices to estimate the mean up to error δ . The algorithm ac hieving this guarantee relies on the fact that tw o Gaussians whose means differ b y δ ha ve probability density functions that are within a m ultiplicative factor of 1 ± ε of eac h other, except in the ( log (1 + ε 1 − ε ) /δ ) - tails. Consequently , by accurately estimating the densities in these tails using exp (( log (1 + ε 1 − ε ) /δ ) 2 ) samples, one can rule out any candidate mean µ ′ that lies at distance more than δ from the true mean µ . This yields a one-dimensional estimator; (also see App endix D for more details). A multiv ariate extension can b e obtained by rep eating this pro cedure along every direction in a fine cov er of the unit sphere, thereby learning an ℓ 2 -appro ximation of µ . How ev er, such a cov er must hav e 2 Θ( d ) size, rendering the resulting algorithm computationally infeasible. 1 If δ ≫ ε , pre-existing algorithms from robust statistics (see, e.g., [ DK23 ]) obtain error O ( δ ) . 4 F or computational efficiency , rather than computing tails, w e rely on moments as a substitute. Our approac h is motiv ated by the following observ ation. Let k > ( log (1 + ε 1 − ε ) /δ ) 2 . Then, for tw o unit-v ariance Gaussians whose means differ b y δ (e.g., N (0 , 1) and N ( δ, 1) ), there must exist a momen t of order t ∈ [ k ] that differs by at least ε ( Corollary 4.2 ). This structural result motiv ates the following algorithm: First, w e compute a rough estimate b µ of the true mean µ and translate all samples via the transformation x ← x − b µ . After this prepro cessing step, the samples b eha v e as if they were drawn from a distribution with mean µ − b µ , and the remaining goal is to estimate this difference to error δ . F or each t ∈ [ k ] , we compute the order- t momen t tensor T t . By the structural result of Corollary 4.2 we can certify that for all v suc h that ⟨ v ⊗ t , T t ⟩ is small b elow a certain threshold η it should hold | v ⊤ ( µ − b µ ) | ≤ δ / 2 ( Claim 4.6 ). It therefore suffices to restrict atten tion to the subspace V corresp onding to directions with large moments, and to apply the inefficien t estimator describ ed ab o ve only within this subspace to estimate µ − b µ up to error δ / 2 . By the triangle inequalit y , the resulting estimate has error at most δ . The run time of this approach is 2 O (dim( V )) . Crucially , w e show that dim ( V ) is small—b ounded solely as a function of ε and δ , with no dep endence on d . This follows from the fact that the ℓ 2 -norm of the full momen t tensor is b ounded in terms of ε, δ , a prop ert y that con tinues to hold under our corruption mo del (cf. Lemma 4.7 ). A small norm implies that only few eigenv alues can b e large, which in turn b ounds dim( V ) . There are t wo technical complications. First, identifying the subspace V is computationally hard: ev en finding a single direction v with a large pro jection of a degree-4 momen t tensor is intractable. T o circumv en t this, instead of searching for directions v with small ⟨ v ⊗ t , T t ⟩ , we flatten T t in to a matrix M ∈ R d × R d t − 1 and define V as the span of singular vectors of M with singular v alues exceeding η , which is computable in polynomial time. This relaxation suffices since, for any v ∈ V , ⟨ v ⊗ t , T t ⟩ ≤ sup u ∈ R d t − 1 : ∥ u ∥ 2 =1 ⟨ v , M u ⟩ ≤ δ . The second challenge concerns the initial rough estimate b µ . While the robust statistics literature offers many black-box estimators, most assume stronger con tamination mo dels and require ε < 1 / 2 , failing when a ma jority of samples are corrupted. T o our knowledge, no p olynomial-time rough estimator exists for Definition 1.1 when ε > 1 / 2 . W e therefore use a list-deco dable estimator, whic h tolerates a ma jority of outliers but, due to non-identifiabilit y , outputs a list of candidate means, one of which is close to µ , but all others could b e arbitrarily inaccurate. F ortunately , there is a tournamen t-based pro cedure known in the literature that we can use to identify an element of the list that has error comparable to the b est among all of them. This enables us to end up with a single w arm start vector and completes the pro of sketc h of the algorithm. 2 Preliminaries Basic Notation W e use Z + for the set of p ositiv e integers. W e denote [ n ] = { 1 , . . . , n } . F or a v ector x w e denote by ∥ x ∥ 2 its Euclidean norm. Let I d denote the d × d iden tity matrix (omitting the subscript when it is clear from the context). W e use ⊤ for the transp ose of matrices and vectors. F or a tensor T , w e define by ∥ T ∥ 2 = q P i T 2 i the ℓ 2 or (F robenius) norm. W e use a ≲ b to denote that there exists an absolute universal constant C > 0 (indep enden t of the v ariables or parameters on which a and b dep end) such that a ≤ C b . In our notation a = O ( b ) has the same meaning as a ≲ b (similarly for Ω( · ) notation) W e use e O and e Ω to hide p olylogarithmic factors in the argumen t. 2.1 Non-Gaussian Comp onen t Analysis (NGCA) W e give a brief background on the SQ hardness of the Non-Gaussian Comp onen t Analysis problem (NGCA). First, the testing v ersion of NGCA is defined as distinguishing b et w een a standard Gaussian 5 and a Gaussian that has a non-Gaussian comp onen t plan ted in an unknown direction, defined b elo w. Definition 2.1 (Hidden direction distribution) . L et A b e a distribution on R . F or a unit ve ctor v , we denote by P A,v the distribution with the density P A,v ( x ) := A ( v ⊤ x ) ϕ ⊥ v ( x ) , wher e ϕ ⊥ v ( x ) = exp  −∥ x − ( v ⊤ x ) v ∥ 2 2 / 2  / (2 π ) ( d − 1) / 2 , i.e., the distribution that c oincides with A on the dir e ction v and is standar d Gaussian in every ortho gonal dir e ction. Problem 2.2 (Non-Gaussian Comp onen t Analysis (NGCA)) . Let A b e a distribution on R and P A,v the distribution from Definition 2.1 . W e define the following hypothesis testing problem: • H 0 : The data distribution is N (0 , I d ) . • H 1 : The data distribution is P A,v , for some vector v ∈ S d − 1 in the unit sphere. A kno wn result is that the NGCA is hard in the SQ model if A matc hes a lot of moments with the standard Gaussian. Here we use the statement from Theorem 1.5 in [ DKRS23 ] using λ = 1 / 2 and c = (1 − λ ) / 8 = 1 / 16 and ν = 0 . Hardness results for other mo dels b ey ond SQ are summarized in App endix E . Condition 2.3 (Moment matching condition) . E x ∼ A [ x i ] − E x ∼N (0 , 1) [ x i ] = 0 for all i ∈ [ m ] . Prop osition 2.4 (Theorem 1.5 in [ DKRS23 ]) . L et d, m b e p ositive inte gers with d ≥ ( m log d ) 2 . A ny SQ algorithm that solves Pr oblem 2.2 for a distribution A satisfying Condition 2.3 r e quir es either 2 d Ω(1) many queries or at le ast one query with ac cur acy d − m/ 16 . 3 Statistical Query Lo w er Bound: Pro of of Theorem 1.3 In order to show Theorem 1.3 , we will prov e the following momen t-matching prop osition. Prop osition 3.1 (Momen t Matching) . Ther e exists a sufficiently smal l absolute c onstant c > 0 such that the fol lowing holds. F or every ε ∈ (0 , 1) and δ ∈ (0 , cε ) , ther e exists a distribution A on R such that the fol lowing statements ar e satisfie d for m = ⌊ cγ 2 / log γ ⌋ wher e γ := 1 δ log  1 + ε/ 2 1 − ε/ 2  : • A is the c onditional distribution on the visible (non-delete d) samples of an ε -c orrupte d version of N ( δ, 1) ac c or ding to Definition 1.1 . • It holds E x ∼ A [ x i ] = E x ∼N (0 , 1) [ x i ] for i = 1 , . . . , m . W e briefly explain ho w Prop osition 3.1 yields Theorem 1.3 . F or v b eing the unit v ector in the direction of µ , let P A,v b e the distribution as in Problem 2.2 . Then, P A,v is an ε -corrupted v ersion of N ( δ v , I ) , and the hypothesis testing problem of Theorem 1.3 is an instance of non-Gaussian comp onen t analysis testing with A as the hidden direction distribution. An application of Prop osition 2.4 with m = ⌊ cγ 2 / log γ ⌋ then yields Theorem 1.3 . Finally , note that Theorem 1.3 shows hardness of distinguishing b et ween the ground truth µ = 0 and ∥ µ ∥ 2 ≥ δ . This immediately implies hardness of the problem of estimating µ up to error δ / 2 . This is b ecause if one has an estimator it can run it to obtain b µ with ∥ b µ − µ ∥ 2 ≤ δ / 2 and reject the null hypothesis iff ∥ b µ ∥ 2 > δ / 2 . 3.1 Pro of of Prop osition 3.1 Denote by ϕ the p df of N (0 , 1) . F ollo wing the Definition 1.1 we need to find a function f : R → R + suc h that (1 − ε ) ϕ ( x − δ ) ≤ f ( x ) ≤ ϕ ( x − δ ) and the distribution with p df f ( x ) / R R f ( x )d x matc hes the first m moments with N (0 , 1) . The argument to do so consists of t wo parts: 6 1. W e will sho w that there exists g : R → R + suc h that (1 − ε ) ϕ ( x − δ ) ≤ g ( x ) ≤ ϕ ( x − δ ) , R x ∈ R g ( x )d x = 1 − ε/ 2 and g ( x ) 1 − ε/ 2 = ϕ ( x ) for all x ∈ [ − B , B ] for B := 1 δ log  1 + ε/ 2 1 − ε/ 2  . This means that g defines an ε -corrupted version of N ( δ, 1) that matches exactly N (0 , 1) in the en tire range [ − B , B ] . Due to the mismatch outside this in terv al, the first m momen ts will not match exactly , but will differ b y only a small amount. 2. In order to correct the moments, we will find an appropriate p olynomial p ( x ) to add to the function g from the previous step in [ − 1 , 1] so that (i) the distribution with p df prop ortional to f ( x ) := g ( x ) + p ( x ) 1 ( | x | ≤ 1) no w matches the m first momen ts with N (0 , 1) exactly and (ii) f satisfies (1 − ε ) ϕ ( x − δ ) ≤ f ( x ) ≤ ϕ ( x − δ ) , i.e., it is still a v alid ε -corruption of N ( δ, 1) . W e now present the pro ofs of the tw o steps in Lemma 3.2 and Lemma 3.3 , resp ectively . Lemma 3.2. Denote by ϕ ( x ) the p df of N (0 , 1) . F or any ε, δ ∈ (0 , 1) ther e exists a function g : R → R + such that (1 − ε ) ϕ ( x − δ ) ≤ g ( x ) ≤ ϕ ( x − δ ) , R x ∈ R g ( x )d x = 1 − ε/ 2 and g ( x ) 1 − ε/ 2 = ϕ ( x ) for al l x ∈ [ − B + δ / 2 , B + δ / 2] wher e B := 1 δ log  1 + ε/ 2 1 − ε/ 2  . Pr o of. F or conv enience we will prov e the claim with everything shifted b y δ / 2 to the left, i.e., we will sho w that, (1 − ε ) ϕ ( x − δ / 2) ≤ g ( x ) ≤ ϕ ( x − δ / 2) (2) as w ell as R x ∈ R g ( x )d x = 1 − ε/ 2 and g ( x ) 1 − ε/ 2 = ϕ ( x + δ / 2) for all x ∈ [ − B , B ] . F or simplicity of notation, w e will use p + ( x ) := ϕ ( x − δ / 2) and p − ( x ) := ϕ ( x + δ / 2) to denote the t wo Gaussian densities for the rest of the pro of. The main idea is to let g ( x ) = (1 − ε/ 2) p − ( x ) for all x in an interv al around zero which is as large as p ossible without violating the condition ( 2 ) . Once w e find which is the biggest p ossible such in terv al, w e will need to correct g ( x ) outside of it so that it still resp ects condition ( 2 ). F or the first part of our pro of argument (finding the largest interv al for which setting g ( x ) = (1 − ε/ 2) p − ( x ) inside it satisfies condition ( 2 ) ) w e solve the equations (1 − ε/ 2) p − ( x ) = p + ( x ) and (1 − ε/ 2) p − ( x ) = (1 − ε ) p + ( x ) . The tw o solutions are x + = 1 δ log  1 − ε/ 2 1 − ε  and x − = 1 δ log  1 − ε 2  . This means that if we define B := 1 δ log  1 + ε/ 2 1 − ε/ 2  w e hav e that the function defined as g ( x ) := p − ( x )(1 − ε/ 2) satisfies condition ( 2 ) for all x ∈ [ − B , B ] . W e now need to show how to define g ( x ) outside of [ − B , B ] . W e show that it is p ossible to extend the definition outside of [ − B , B ] in a wa y that condition ( 2 ) con tinues to hold and R x ∈ R g ( x )d x = 1 − ε/ 2 . A first, unsuccessful approach would be to set g ( x ) = (1 − ε ) p + ( x ) for x > B and g ( x ) = p + ( x ) for x < − B . Although this ensures condition ( 2 ) , the other desideratum R R g ( x )d x = 1 − ε/ 2 is not satisfied. T o see this, let us define A 1 and A 2 b e the following areas: A 1 := Z + ∞ B ((1 − ε ) p + ( x ) − (1 − ε/ 2) p − ( x )) d x, A 2 := Z − B −∞ ((1 − ε/ 2) p − ( x ) − p + ( x )) d x. Then the integral of g is Z x ∈ R g ( x )d x = Z + ∞ −∞ (1 − ε/ 2) p − ( x )d x + Z B x + ((1 − ε/ 2) p ( x ) − (1 − ε ) p + ( x )))d x + A 1 − A 2 (3) 7 < 1 − ε 2 + A 1 − A 2 , where the first in tegral is simply 1 − ε/ 2 and the in tegral from x + to B is negative (by definition of x + ). W e can finally chec k that A 2 > A 1 to conclude the pro of of R x ∈ R g ( x )d x < 1 − ε/ 2 . T o see this, first note that w e can rewrite A 2 = R − B −∞ p + ( − x ) − (1 − ε/ 2) p − ( − x )d x = R ∞ B (1 − ε/ 2) p + ( x ) − p − ( x )d x where w e used that p + ( − x ) = p − ( x ) and a change of v ariable. Then, A 2 − A 1 = Z ∞ B ε 2 ( p + ( x ) − p − ( x )) d x > 0 . Ho wev er, the ab ov e choice of g ( x ) for x > B is not the only one allow ed b y condition ( 2 ) . W e could alternatively c ho ose any g ( x ) ∈ [(1 − ε ) p + ( x ) , p + ( x )] for x > B . W e just saw that the first extreme choice g ( x ) = (1 − ε ) p + ( x ) results in R x ∈ R g ( x )d x < 1 − ε/ 2 . W e will now show that the other extreme choice of setting g ( x ) = p − ( x ) in x > B results in R x ∈ R g ( x )d x > 1 − ε/ 2 . By con tinuit y this w ould mean that there exists a w ay of defining g ( x ) in x > B that ac hieves R x ∈ R g ( x )d x = 1 − ε/ 2 . W e now show the remaining claim ab o ve, that the choice g ( x ) = p + ( x ) for all x > B (and g ( x ) = p + ( x ) for x < − B , g ( x ) = (1 − ε/ 2) p + ( x ) for x ∈ [ − B , B ] as b efore) results in R x ∈ R g ( x )d x > 1 − ε/ 2 . In this case, similarly to inequality ( 3 ), we ha ve Z x ∈ R g ( x )d x = Z + ∞ −∞ (1 − ε/ 2) p − ( x )d x + e A 1 − A 2 where A 2 is the same as b efore, but e A 1 = R ∞ B ( p + ( x )d x − (1 − ε/ 2) p − ( x )) d x . Now, this gives e A 1 − A 2 = Z ∞ B ε 2 ( p + ( x ) + p − ( x )) d x > 0 . Lemma 3.3. L et ε ∈ (0 , 1) , δ ≪ ε and g ( x ) b e as in Lemma 3.2 . Ther e exists a p olynomial p ( x ) such that the function f ( x ) := g ( x ) + p ( x ) 1 ( | x | ≤ 1) satisfies (1 − ε ) ϕ ( x − δ ) ≤ f ( x ) ≤ ϕ ( x − δ ) and the distribution with p df f ( x ) / R R f ( x )d x matches the first m moments with N (0 , 1) for some m = Ω( γ 2 / log γ ) , wher e γ := 1 δ log  1 + ε/ 2 1 − ε/ 2  . Pr o of. The function g ( x ) from Lemma 3.2 satisfies the follo wing: 1. (1 − ε ) ϕ ( x − δ ) ≤ g ( x ) ≤ ϕ ( x − δ ) for all x ∈ R 2. R R g ( x )d x = 1 − ε/ 2 3. g ( x ) = ϕ ( x )(1 − ε/ 2) for x ∈ [ − B + δ / 2 , B + δ / 2] where B := 1 δ log  1 + ε/ 2 1 − ε/ 2  . W e will sho w the existence of a p olynomial p such that 1. | p ( x ) | ≤ cε , where c is a sufficien tly small absolute constant, 2. R 1 − 1 p ( x )d x = 0 , 3. R R x i g ( x )+ p ( x ) 1 ( | x |≤ 1) 1 − ε/ 2 d x = R R x i ϕ ( x )d x for all i ∈ [ m ] . 8 Before showing that such a p olynomial exists, we first show how Lemma 3.3 follo ws given the ab o v e p oin ts. First, the moment matching prop erty in the conclusion of Lemma 3.3 directly follo ws b y the third item in the ab o v e list. W e now sho w ho w the part that (1 − ε ) ϕ ( x − δ ) ≤ f ( x ) ≤ ϕ ( x − δ ) for all x ∈ R follo ws from the ab o ve. This can b e seen by verifying that (i) ϕ ( x − δ ) − g ( x ) = Ω( ε ) for all x ∈ [ − 1 , 1] and (ii) g ( x ) − (1 − ε ) ϕ ( x − δ ) = Ω( ε ) for all x ∈ [ − 1 , 1] . W e sho w the part (i) since the other part can b e seen with identical arguments. The smallest v alue of ϕ ( x − δ ) − g ( x ) happ ens at x = − 1 . On that p oint: ϕ ( x − δ ) − (1 − ε/ 2) ϕ ( x ) = ϕ ( x − δ ) − ϕ ( x ) + ε 2 ϕ ( x ) ≥ − O ( δ ) + Ω( ε ) ≥ Ω( ε ) , where the first inequalit y uses the fact that ϕ ( x − δ ) − ϕ ( x ) = xϕ ( x ) δ + ξ 2 − 1 2 ϕ ( ξ ) δ 2 for some x − δ ≤ ξ ≤ x , by T a ylor’s theorem), and w e also used that ϕ ( x ) = Ω(1) for x ∈ [ − 1 , 1] . The last inequalit y ab o ve used that δ ≪ ε . W e no w turn to showing the existence of the p olynomial p . This part of the pro of follows an argumen t similar to the one in [ DKS17 ]. Recall the moment matc hing condition that we wan t to ensure: R 1 − 1 x i p ( x )d x = R ∞ −∞ x i ϕ ( x )d x − R ∞ −∞ x i g ( x ) 1 − ε/ 2 d x . Using the fact that g ( x ) / (1 − ε/ 2) = ϕ ( x ) in the interv al [ − B + δ , B + δ ] the moment matching condition b ecomes: Z 1 − 1 x i p ( x )d x = Z R \ [ − B + δ,B + δ ] x i ϕ ( x )d x − Z R \ [ − B + δ,B + δ ] x i g ( x ) 1 − ε/ 2 d x, (4) for i = 1 , . . . , m . First, we express p ( x ) as a linear combination of Legendre p olynomials P k : F act 3.4. W e c an write p ( x ) = P m k =0 a k P k ( x ) , wher e a k = 2 k +1 2 R 1 − 1 P k ( x ) p ( x )d x . By prop erties of Legendre p olynomials ( Fact B.9 ) we hav e that | p ( x ) | ≤ P m k =0 | a k | for all x ∈ [ − 1 , 1] , th us it suffices to b ound the co efficien ts a k . T o wards that end,     Z 1 − 1 P k ( x ) p ( x )d x     =      Z R \ [ − B + δ,B + δ ] P k ( x ) ϕ ( x )d x − Z R \ [ − B + δ,B + δ ] P k ( x ) g ( x ) 1 − ε/ 2 d x      (due to ( 4 )) ≤      Z R \ [ − B + δ,B + δ ] P k ( x ) ϕ ( x )d x      +      Z R \ [ − B + δ,B + δ ] P k ( x ) g ( x ) 1 − ε/ 2 d x      (5) W e will sho w how to b ound the first term (the pro of for the first term is almost identical). First,      Z R \ [ − B + δ,B + δ ] P k ( x ) ϕ ( x )d x      ≤      Z R \ [ − ( B − δ ) , ( B − δ )] 4 k | x | k ϕ ( x )d x      (b y Fact B.9 ) ≤ Z − ( B − δ ) −∞ 4 k | x | k ϕ ( x )d x + Z ∞ B − δ 4 k | x | k ϕ ( x )d x = 2 Z ∞ B − δ 4 k x k ϕ ( x )d x ≲ 4 k Z ∞ β x k e − x 2 / 2 d x (denote β := B − δ ) ≤ 4 k Z ∞ β x k e − β 2 / 2+ β e − x d x ( − x 2 / 2 + x is decreasing for x > 1 ) ≤ 4 k Z ∞ 0 x k e − β 2 / 2+ β e − x d x 9 ≤ 4 k e − β 2 / 2+ β Z ∞ 0 x k e − x d x = 4 k e − β 2 / 2+ β Γ( k + 1) ( Γ( · ) is the Gamma function) ≤ 4 k e − β 2 / 4 k k . Com bining the inequalit y | p ( x ) | ≤ P m k =0 | a k | with the formula for a k and the ab o ve b ound, we hav e that | p ( x ) | ≤ m X k =0 | a k | ≲ m X k =0 2 k + 1 2 4 k e − β 2 / 4 k k ≲ 4 m e − β 2 / 4 m m m X k =0 2 k + 1 2 ≲ 4 m e − β 2 / 4 m m +2 . If m is a sufficiently small constan t multiple of β 2 − log(1 /ε ) log( β 2 − log(1 /ε )) , then the RHS ab o ve is at most ε . Note that b y our assumption δ ≪ ε w e hav e β := B − δ = 1 δ log (1 + ε/ 2 1 − ε/ 2 ) − δ = Θ( 1 δ log (1 + ε/ 2 1 − ε/ 2 )) and β − log (1 /ε ) = Θ  1 δ log(1 + ε/ 2 1 − ε/ 2 )  . Th us, we can further simplify β 2 − log(1 /ε ) log( β 2 − log(1 /ε )) = Ω( γ 2 log γ ) , where γ := 1 δ log  1 + ε/ 2 1 − ε/ 2  . Therefore, the moment matching can b e ac hieved for m as high as a sufficien tly small constant multiple of γ 2 / log γ . 4 Efficien t Mean Estimation with Realizable Con tamination: Pro of of Theorem 1.4 4.1 Structural Lemma The algorithm will b e based on the lemma b elo w establishing that if the ground-truth Gaussian has a mean that deviates from zero, then some sufficiently high-order moment must necessarily deviate from the corresp onding standard Gaussian moment. Background on Hermite p olynomials can b e found in Section B.2 . Prop osition 4.1 (Structural Result) . L et ε ∈ (0 , 1) and δ ∈ (0 , log (1 + 2 ε 1 − ε )) b e p ar ameters. L et k b e an even inte ger which satisfies k ≥ 3  1 δ log(1 + 2 ε 1 − ε )  2 and P = N ( δ, 1) b e a Gaussian distribution. Then for any ε -c orrupte d version e P of P under the mo del of Definition 1.1 , if P ′ denotes the c onditional distribution of e P on the non-missing samples, it holds E x ∼ P ′ [ x k ] − E z ∼N (0 , 1) [ z k ] > ε . Pr o of. W e will prov e this b y showing the following tw o claims: 1. E y ∼ P [ y k ] > (1 + 2 ε/ (1 − ε )) E z ∼N (0 , 1) [ z k ] . 2. E x ∼ P ′ [ x k ] ≥ (1 − ε ) E y ∼ P [ y k ] . These t wo claims suffice, b ecause if we combine them, we obtain E x ∼ P ′ [ x k ] ≥ (1 − ε ) E y ∼ P [ y k ] ≥ (1 − ε )(1 + 2 ε/ (1 − ε )) E z ∼N (0 , 1) [ z k ] ≥ (1 + ε ) E z ∼N (0 , 1) [ z k ] . Rearranging, this means that E x ∼ P ′ [ x k ] − E z ∼N (0 , 1) [ z k ] ≥ ε E z ∼N (0 , 1) [ z k ] ≥ ε . W e no w show the t wo claims. The second claim follo ws directly by the definition of the con tamination mo del: if f denotes the function used in Definition 1.1 , then f ( x ) ≥ (1 − ε ) p ( x ) (where p is the p df of P ). Thus E x ∼ P ′ [ x k ] = Z R x k f ( x ) R R f ( z )d z d x ≥ (1 − ε ) Z R x k p ( x ) R R f ( z )d z d x ≥ (1 − ε ) Z R x k p ( x ) = (1 − ε ) E y ∼ P [ y k ] . 10 W e now mov e to the first claim, i.e., that E z ∼N (0 , 1) [( z + δ ) k ] ≥ E z ∼N (0 , 1) [ z k ](1 + 2 ε/ (1 − ε )) when k ≥ 3  1 δ log(1 + 2 ε 1 − ε )  2 . First, we recall that E z ∼N (0 , 1) [ z k ] = ( k − 1)!! . Using the binomial theorem, w e can write the other non-cen tered moment as follows: E z ∼N (0 , 1) [( z + δ ) k ] = k X j =0 j even  k j  δ k E z ∼N (0 , 1) [ z k − j ] = k X j =0 j even  k j  δ k ( k − j − 1)!! . (6) W e will no w rewrite the right hand side ab o ve. First, for the ( k − j − 1)!! we hav e the follo wing. W e can rewrite this as follows by taking the first j / 2 o dd factors of ( k − 1)!! : ( k − j − 1)!! = ( k − 1)!! ( k − 1)( k − 3) · · · ( k − j + 1) ≥ ( k − 1)!! k j / 2 , (7) where we used that every factor in the denominator is at most k . W e also hav e the following for the binomial co efficien t:  k j  = k ( k − 1) · · · ( k − ( j − 1)) j ! = k j j ! j − 1 Y i =0  1 − i k  ≥ k j j ! j − 1 Y i =0  1 − i j  k j j ! j ! j j = k j j j . (8) Com bining Equation ( 6 ) with inequalities ( 7 ) and ( 8 ) , w e hav e E z ∼N (0 , 1) [( z + δ ) k ] ≥ ( k − 1)!! P k j =0 j even  √ kδ j  j . The sum can b e lo wer b ounded by a sp ecific term of the sum. F or this w e will chose j to be a sufficien tly small multiple of √ k δ to obtain E z ∼N (0 , 1) [( z + δ ) k ] ≥ ( k − 1)!! e √ kδ / 3 = E z ∼N (0 , 1) [ z k ] e √ kδ / 3 . Th us, if k ≥ 3  1 δ log(1 + 2 ε 1 − ε )  2 then E z ∼N (0 , 1) [( z + δ ) k ] ≥ (1 + 2 ε/ (1 − ε )) E z ∼N (0 , 1) [ z k ] . W e will need a Hermite-p olynomial version of Prop osition 4.1 , obtained by expansion in the Hermite basis and an application of Cauch y–Sc hw arz. Corollary 4.2. L et ε ∈ (0 , 1) and δ ∈ (0 , log (1 + 2 ε 1 − ε )) b e p ar ameters. L et k b e an even inte ger which satisfies k ≥ 3( 1 δ log (1 + 2 ε 1 − ε )) 2 and P = N ( δ, 1) b e a Gaussian distribution. Then for any ε -c orrupte d version e P of P under the mo del of Definition 1.1 , if P ′ denotes the c onditional distribution of e P on the non-missing samples, it holds | E x ∼ P ′ [ h k ( x )] | > ϵ ( k +1) k/ 2 , wher e h k is the normalize d pr ob abilist’s Hermite p olynomial. Pr o of. Prop osition 4.1 states that E x ∼ P ′ [ x k ] − E z ∼N (0 , 1) [ z k ] > ε . W e now expand the function x k in the Hermite basis, i.e., x k = P k t =0 a t h t ( x ) where a t := E x ∼N (0 , 1) [ x k h t ( x )] . Com bining the result from Prop osition 4.1 with Cauch y-Sch w arz gives the following v u u t k X t =0 a 2 t s ( k + 1) max t =0 , · ,k     E x ∼ P ′ [ h t ( x )] − E z ∼N (0 , 1) [ h t ( z )]     2 ≥ v u u t k X t =0 a 2 t v u u t k X t =0  E x ∼ P ′ [ h t ( x )] − E z ∼N (0 , 1) [ h t ( z )]  2 ≥ k X t =0 a t  E x ∼ P ′ [ h t ( x )] − E z ∼N (0 , 1) [ h t ( z )]  E x ∼ P ′ [ x k ] − E z ∼N (0 , 1) [ z k ] > ϵ. 11 Rearranging w e hav e max t =0 , · ,k     E x ∼ P ′ [ h t ( x )] − E z ∼N (0 , 1) [ h t ( z )]     ≥ ϵ √ k + 1 q P k t =0 a 2 t = ϵ √ k + 1 q E z ∼N (0 , 1) [ z 2 k ] = ϵ q ( k + 1) (2 k )! 2 k k ! ≥ ϵ ( k + 1) k/ 2 . Finally , noting that E z ∼N (0 , 1) [ h t ( z )] = 0 concludes the pro of. 4.2 Algorithm Description The pseudo co de for the algorithm is giv en in Algorithm 1 . It b egins by inv oking a list-deco dable mean estimation pro cedure as a black b o x, i.e., a pro cedure that returns a p oly-sized list L of candidate means suc h that at least one elemen t e µ 0 is O ( p log(1 /ε ) ) from the true mean. As explained in Section 1.2 this v ector would b e a go od warm start for the subsequent steps. Ho wev er, since the iden tity of this go o d candidate is unknown we use a standard tournamen t-style selection pro cedure from robust statistics [ DKLP25a ] to select a vector b µ 0 from the list L , with error appro ximately the b est among the errors of L ’s candidates. Ha ving this b µ 0 w arm start vector in hand, the algorithm draws a dataset and shifts all samples b y b µ 0 . After this translation, the data ma y b e view ed as having ground truth mean µ − b µ 0 , and the task reduces to estimating this offset. W e then hav e the main, dimension-reduction part of the algorithm. F or many v alues t , it computes the t -th order moment tensor and considers the subspace V t spanned by eigenv ectors whose corresp onding eigen v alues exceed a carefully c hosen threshold. The algorithm then sets V to b e the span of all the V t ’s. By the structural result in Corollary 4.2 , w e show that µ − b µ 0 is largely contained inside V . Finally , the algorithm applies a computationally inefficien t estimator restricted to V to reco ver the comp onen t of the ground truth mean that lies in this subspace. In the pro of of correctness, we will show that the dimension of V is small, thus the complexit y of this step is as stated in the theorem. 4.3 Useful Subroutines from Robust Statistics W e require t wo subroutines used in Algorithm 1 . The first handles mean estimation when more than half the samples are corrupted and, since exact recov ery is imp ossible, outputs a list of candidate means con taining one close to the truth. F act 4.3 (Mean list-deco ding algorithm (see, e.g., [ DKK + 22 ])) . L et ε ∈ (0 , 1) b e a c orruption r ate p ar ameter and τ ∈ (0 , 1) b e a pr ob ability of failur e p ar ameter. Ther e exists an algorithm that uses ε -c orrupte d samples fr om N ( µ, I ) in the str ong c ontamination mo del 2 and finds a list L of c andidate me ans such that, with pr ob ability at le ast 0.99, ther e is at le ast one b µ 0 ∈ L with ∥ b µ 0 − µ ∥ 2 = O ( p log(1 / (1 − ε )) ) . The sample c omplexity of the algorithm is n = ( log (1 / (1 − ε )) d ) O (log (1 / (1 − ε ))) , the size of the r eturne d list is | L | = O ( 1 1 − ε ) and the runtime of the algorithm is p oly( n ) . The second comp onen t is a pruning pro cedure that selects a near-optimal estimate from the list. The pro cedure runs a one-dimensional robust mean estimator along the line connecting each pair of v ectors in the list. F or each pair, it disqualifies the element that lies farther from the estimated mean along that line. At the end, any remaining element can b e returned. 2 Unlik e Definition 1.1 , in the str ong c ontamination mo del (1 − ε ) n samples are drawn from N ( µ, I ) and an adv ersary can add the remaining εn p oin ts arbitr arily . 12 Algorithm 1 Sp ectral algorithm for multiv ariate mean estimation 1: Input : Sample access to the distribution of Definition 1.1 , parameters ε, δ ∈ (0 , 1) . 2: Output : V ector b µ ∈ R d . 3: Let C b e a sufficiently large absolute constant, k :=  C  1 δ log  1 + ε 1 − ε  2  and η := 1 C ε ( k +1) k/ 2 . 4: Use a list-deco ding algorithm to compute a list L of candidate mean estimates in R d suc h that L con tains at least one e µ 0 with ∥ e µ 0 − µ ∥ 2 = O ( p log(1 / (1 − ε ))) . ▷ Using Fact 4.3 5: b µ 0 ← TournamentImpr ove ( L, ε, O ( p log(1 / (1 − ε )))) . ▷ Using Fact 4.4 6: Dra w samples S = { x 1 , . . . , x n } from Definition 1.1 for n = ( k d ) C k /ε 2 . 7: Recen ter the dataset S ′ ← { x − b µ 0 : x ∈ S } . 8: for t = 0 , 1 , . . . , k do 9: Compute the empirical Hermite tensor b T t ← 1 n P x ∈ S ′ H t ( x ) . ▷ cf. Definition B.10 10: Let M ( b T t ) b e the flattened d × d t − 1 matrix. 11: Compute the right singular vectors v 1 , . . . , v d of M ( b T t ) with singular v alues σ 1 , . . . , σ d . 12: Define I t = { i ∈ [ d ] : σ i > η } and V t = span( { v i : i ∈ I t } ) . 13: Let V = span( V 1 , . . . , V k ) . 14: b µ ← BruteF orce(Pro j V ( S ′ ) , ε, δ / 2) . ▷ Using Theorem D.5 15: return b µ + b µ 0 . The pro of is identical to that of [ DKLP25a ], with the only difference b eing the c hoice of the one-dimensional robust mean estimator. Here, we ma y use an estimator designed for Gaussian mean estimation under an ε -fraction of arbitrary corruptions (unlik e Definition 1.1 , where arbitrary corruptions allow the adversary to mo dify an ε -fraction of the samples arbitrarily). In particular, the median or trimmed mean suffices, with sample complexity log (1 /τ ) /δ 2 in one dimension. W e set τ = 1 /k 2 to allo w a union b ound ov er all pairs in a list of size k . F act 4.4 (T ournament pruning (see Lemma 4.1 in [ DKLP25a ])) . L et C b e a sufficiently lar ge absolute c onstant. L et ε, δ ∈ (0 , 1) b e p ar ameters. L et L = { µ 1 , · · · , µ k } ⊂ R d b e a set of c andidate estimates of µ ∈ R d . Ther e exists an algorithm TournamentImpr ove that takes as input the list L , the p ar ameters ε, δ and dr aws n = O  log k δ 2  samples ac c or ding to the data gener ation mo del of Definition 1.1 with me an µ ∈ R d and c orruption r ate ε , and outputs some estimate µ j ∈ L such that ∥ µ j − µ ∥ 2 ≤ 2 min i ∈ [ k ] ∥ µ i − µ ∥ 2 + δ / 2 with pr ob ability at le ast 0 . 99 . The runtime of the algorithm is p oly( n, k , d ) . 4.4 Pro of of Correctness In the remainder of this section w e pro ve Theorem 1.4 , with some details deferred to the App endices. W e first argue that the final output has error δ . W e then b ound the complexity of the algorithm. Error analysis In the second line, we kno w that the l ist L con tains one estimate b µ 0 with ∥ b µ 0 − µ ∥ 2 = O ( p log(1 / (1 − ε )) ) . In what follows we will analyze the for lo op and show that the estimator pro duced in its last line has ℓ 2 -error at most δ . Consider a b µ 0 for which ∥ b µ 0 − µ ∥ 2 = O ( p log(1 / (1 − ε )) ) . Because of the centering transformation, w e can equiv alently analyze things as if there w as no re-centering transformation but instead all samples come from the mo del with mean µ of b ounded norm ∥ µ ∥ 2 = O ( p log(1 / (1 − ε ))) . First w e will require the following concentration lemma, which we show in Section B.2 . 13 Lemma 4.5 (Hermite tensor concentration) . L et η , ε ∈ (0 , 1) b e p ar ameters, C b e a sufficiently lar ge absolute c onstant and µ ∈ R d b e a ve ctor with ∥ µ ∥ 2 = O ( q log( 1 1 − ε ) ) . L et e P b e an ε -c orrupte d version of N ( µ, I ) (cf. Definition 1.1 ) and let P ′ denote the c onditional distribution of e P on the non-missing samples. L et x 1 , . . . , x n ∼ P ′ b e i.i.d. samples and define b T := 1 n P n i =1 H k ( x i ) , and T := E x ∼ P ′ [ H k ( x )] , wher e H k ( x ) denotes the Hermite tensor fr om Definition B.10 . If n > C d 3 k 2 O ( k ) ( k log ( 1 1 − ε )) k/ 2 (1 − ε ) η 2 τ , then with pr ob ability at le ast 1 − τ we have that    b T − T    2 ≤ η . W e will use η := 1 C ε ( k +1) k/ 2 as in the pseudo co de. The num b er of samples in Theorem 1.4 has b een chosen to allow a union b ound ov er t = 0 , . . . , k and all iterations of the for lo op of the algorithm so that we ha ve    b T t − T t    2 ≤ η ∀ t = 0 , . . . , k (9) with probabilit y at least 0 . 99 . Using the formula for the sample complexity of Lemma 4.5 and simpli- fying it bit, it can b e seen that this concentration even t can be ac hieved with ( k d ) O ( k ) ε − 2 samples. W e now fo cus on showing that the estimate b µ to wards the end of the for lo op has error δ . W e will first argue that the mean has a small comp onen t of size at most δ / 2 in the subspace V ⊥ . Since w e estimate the mean up to error δ / 2 on the subspace V , this will immediately mean that the total error is at most δ . Claim 4.6 (Mean certification) . Consider the notation of Algorithm 1 and that the event of ( 9 ) holds. The subsp ac e V mentione d in Algorithm 1 satisfies v ∈ V ⊥ ⇒ | v ⊤ µ | ≤ δ / 2 . Pr o of. Let e P denote the ε -corrupted v ersion of N ( µ, I ) according to Definition 1.1 and let P ′ denote the conditional distribution on the non-missing samples. W e pro ve the claim by contra- diction. Supp ose that | v ⊤ µ | > δ / 2 . Then by Corollary 4.2 there exists a t ∈ { 0 , . . . , k } with | E x ∼ P ′ [ h t ( v ⊤ x )] | > η . Using the Hermite tensor prop ert y (see Section B.2 for background on Hermite tensors) E x ∼ P ′ [ h t ( v ⊤ x )] = ⟨ v ⊗ t , E x ∼ P ′ [ H t ( x )] ⟩ this means that |⟨ v ⊗ t , T t ⟩| > η . By the concen tration of the even t in ( 9 ) , w e hav e that |⟨ v ⊗ t , b T t ⟩| > η . Therefore, v b elongs to V t , whic h con tradicts our assumption v ∈ V ⊥ . Claim 4.6 sho ws that ∥ Pro j V ⊥ ( µ ) ∥ 2 ≤ δ / 2 . F or the orthogonal subspace our algorithm computes a b µ ∈ V with ∥ b µ + b µ 0 − Pro j V ( µ ) ∥ 2 ≤ δ / 2 (b ecause of the guarantee stated in Theorem D.5 ). Thus b y the Pythagorean theorem, our estimator b µ to wards the end of the for lo op has error at most δ . This concludes the error analysis part of the pro of. In the remainder of this section w e analyze the complexit y of the algorithm. Complexit y of the list-deco ding and tournamen t subroutines The sample complexity of this part is giv en by Fact 4.3 and it can b e chec k ed that it is smaller than the ( k d ) O ( k ) that is men tioned in the statement of Theorem 1.4 . The run time of list-deco ding is p oly ( n ) . It can also b e chec k ed that the sample complexity and runtime of the tournament subroutine step, stated in Fact 4.4 are smaller than what is stated in Theorem 1.4 . Complexit y of a single iteration of the outer lo op W e will b ound the runtime of the main for lo op, conditioned on the even t that b µ 0 is within O ( p log(1 / (1 − ε )) ) of the true mean. The run time of every step except the application of the brute force algorithm is p olynomial on all parameters n, d and 1 / (1 − ε ) , so it remains to analyze the runtime of the brute force algorithm. By Theorem D.5 , the run time is 2 O (dim( V )) p oly( n, d ) th us we need a b ound on dim( V ) . 14 Supp ose that we ha ve shown a b ound ∥ b T t ∥ 2 ≤ γ . Ha ving that b ound will allow us to argue as follo ws: If σ i denote the singular v alues of M ( T t ) , then q P d i =1 σ 2 i = ∥ M ( T t ) ∥ F = ∥ T t ∥ 2 ≤ γ whic h means that for the set I t = { i : σ i ≥ η } it holds |I t | ≤ γ 2 /η 2 . This means that dim ( V k ) ≤ γ 2 /η 2 . Th us dim( V ) ≤ P k t =0 dim( V t ) ≤ ( k + 1) γ 2 /η 2 . W e will show in Lemma 4.7 the ℓ 2 -norm b ound ∥ E [ b T t ] ∥ 2 ≤ e e O ( k ) . Under the ev ent ( 9 ) that w e conditioned on in the b eginning, this will also imply that ∥ b T t ∥ 2 ≤ e e O ( k ) . Ha ving this and plugging γ = e e O ( k ) , η = O ( ε/ ( k + 1) k/ 2 ) to the b ounds of the previous paragraph w e will finally hav e the b ounds b elo w which conclude that the runtime of the algorithm is exp( e e O ( k ) /ε 2 ) p oly( n, d ) . dim( V ) ≤ ( k + 1) γ 2 η 2 ≤ k e e O ( k ) η 2 ≤ e e O ( k ) η 2 ≤ e e O ( k ) ( k + 1) k ε 2 ≤ k e O ( k ) ε 2 ≤ e e O ( k ) ε 2 , Lemma 4.7 (Momen t tensor norm b ound) . L et e P b e the ε -c orrupte d version of N ( µ, I ) mentione d in the statement of The or em 1.4 and P ′ denote the c onditional distribution on the non-missing samples. Assume that ∥ µ ∥ 2 = O ( p log(1 / (1 − ε )) ) . L et T t = E x ∼ P ′ [ H t ( x )] denote the tensors use d in Algorithm 1 . W e have that ∥ T t ∥ 2 ≤ 1 1 − ε O (log(1 / (1 − ε ))) t/ 2 + 1 1 − ε exp ( O ( t log (1 / (1 − ε )))) . (10) W e pro ve this lemma b elo w. Note that the RHS in inequalit y ( 10 ) can b e further b ounded from ab o v e by the simpler expression e e O ( k ) b y using t ≤ k , k =  1 δ log(1 + ε/ (1 − ε ))  2 , δ ≤ ε . Pr o of. Using the v ariational c haracterization of the ℓ 2 -norm, w e hav e ∥ T t ∥ 2 = sup ∥ A ∥ 2 =1 ⟨ A, T t ⟩ = sup ∥ A ∥ 2 =1 E x ∼ P ′ [ ⟨ A, H t ( x ) ⟩ ] . Since H t ( x ) is a symmetric t -tensor and dep ends only on the symmetrization of A , we may restrict without loss of generalit y to symmetric A . F or such A , the function p A ( x ) := ⟨ A, H t ( x ) ⟩ is a degree- t p olynomial satisfying E z ∼N (0 ,I ) [ p A ( z ) 2 ] = ∥ A ∥ 2 2 = 1 . Moreov er, ev ery degree- t p olynomial that is orthonormal with resp ect to the Gaussian measure can b e written in this w ay from a unique symmetric tensor A . Therefore, ∥ T t ∥ 2 = sup p degree t E z ∼N (0 ,I ) [ p ( z ) 2 ]=1 E x ∼ e P [ p ( x )] . (11) Th us we need to show that the E x ∼ P ′ [ p ( x )] is b ounded for every unit-norm p olynomial p of degree t . T o this end, recall the definition of the distribution P ′ from Definition A.1 E x ∼ P ′ [ p ( x )] = E x ∼N ( µ,I ) [ p ( x ) | x not missing ] = E [ p ( x ) 1 ( x not missing )] P x ∼N ( µ,I ) [ x not missing ] ≤ 1 1 − ε E x ∼N ( µ,I ) [ p ( x ) 1 ( x not missing )] ≤ 1 1 − ε E x ∼N ( µ,I ) [ p ( x )] − 1 1 − ε E x ∼N ( µ,I ) [ p ( x ) 1 ( x missing )] ≤ 1 1 − ε      E x ∼N ( µ,I ) [ p ( x )]     +     E x ∼N ( µ,I ) [ p ( x ) 1 ( x missing )]      . 15 W e will analyze each term separately . F or the first term we hav e the following: E x ∼N ( µ,I ) [ p ( x )] ≤     E x ∼N ( µ,I ) [ H t ( x )]     2 (b y the v ariational characterization of ℓ 2 -norm) =   µ ⊗ t   2 / √ t ! ≤ ∥ µ ∥ t 2 / √ t ! = O (log(1 / (1 − ε ))) t/ 2 (b y Fact B.12 and ∥ µ ∥ 2 = O ( p log(1 / (1 − ε )) ) F or the second term, w e hav e the follo wing:     E x ∼N ( µ,I ) [ p ( x ) 1 ( x missing )]     ≤ r E x ∼N ( µ,I ) [ p 2 ( x )] p P [ x missing ] ≤ r E x ∼N ( µ,I ) [ p 2 ( x )] ε . It remains to b ound E x ∼N ( µ,I ) [ p 2 ( x )] . W e know that E x ∼N (0 ,I ) [ p 2 ( x )] = 1 , how ev er we need to b ound the exp ectation ov er a translated Gaussian. By Claim B.14 , we ha ve that E x ∼N ( µ,I ) [ p 2 ( x )] ≤ e t ∥ µ ∥ 2 ≤ e O ( t log (1 / (1 − ε )) ) (where we used that ∥ µ ∥ 2 = O ( p log(1 / (1 − ε ) ) ). Ov erall, by putting ev erything together, we hav e that E x ∼ ¯ P [ p ( x )] ≤ 1 1 − ε O (log(1 / (1 − ε ))) t/ 2 + 1 1 − ε exp ( O ( t log (1 / (1 − ε )))) . 5 Conclusions This w ork studied and essentially characterized the complexity of mean estimation of a spherical Gaussian in the realizable contamination mo del. Our results suggest several natural questions for future work. A broad direction is to understand the tradeoffs b et ween sample and computational complexit y for other statistical tasks in the presence of realizable con tamination. A second direction is to inv estigate broader distributional assumptions. While the Gaussian setting serves as the canonical starting p oin t, a natural next step is to consider subgaussian distributions. How ev er, in full generalit y , consistency is kno wn to b e imp ossible in this mo del [ MVB + 24 ]. This raises the question of identifying structured sub classes of subgaussian distributions, or alternative distribution families, for whic h consistent estimation remains achiev able. A c kno wledgmen ts W e thank Chao Gao for bringing the realizable con tamination mo del to our attention. 16 References [AAR99] G. E. Andrews, R. Askey , and R. Ro y . Sp e cial F unctions . 1999. [AL88] D. Angluin and P . Laird. Learning from noisy examples. Machine L e arning , 2(4):343–370, 1988. [AL13] P . M. Arono w and D. KK Lee. Interv al estimation of p opulation means under unkno wn but b ounded probabilities of sample selection. Biometrika , 100(1):235–240, 2013. [ALJY20] M. F. Adak, P . Lieb erzeit, P . Jarujamrus, and N. Y umusak. Classification of alcohols obtained by QCM sensors with different c haracteristics using ABC based neural netw ork. Engine ering Scienc e and T e chnolo gy, an International Journal , 23(3):463–469, 2020. [BBH + 21] M. S. Brennan, G. Bresler, S. Hopkins, J. Li, and T. Schramm. Statistical Query Algorithms and Lo w Degree T ests Are Almost Equiv alent. In Pr o c e e dings of Thirty F ourth Confer enc e on L e arning The ory , pages 774–774. PMLR, 2021. [BBV08] M-F Balcan, A. Blum, and S. V empala. A discriminativ e framework for clustering via similarit y functions. In Pr o c e e dings of the fortieth annual ACM symp osium on The ory of c omputing , pages 671–680, 2008. [BR91] P . J. Bick el and J. Rito v. Large sample theory of estimation in biased sampling regression mo dels. i. The Annals of Statistics , 19(2):797–816, 1991. [BR T17] A. Belloni, M. Rosen baum, and A. B. T sybak ov. Linear and conic programming estimators in high dimensional errors-in-v ariables mo dels. Journal of the R oyal Statistic al So ciety Series B: Statistic al Metho dolo gy , 79(3):939–956, 2017. [CA T + 20] Y. Cherapanamjeri, E. Aras, N. T ripuraneni, M. I. Jordan, N. Flammarion, and P . L. Bartlett. Optimal Robust Linear Regression in Nearly Linear Time. 2020. [CMW + 21] G. Carreras, G. Miccinesi, A. Wilco c k, N. Preston, D. Nieb oer, L. Deliens, M. Gro en v old, U. Lunder, A. v an der Heide, M. Baccini, et al. Missing not at random in end of life care studies: multiple imputation and sensitivit y analysis on data from the action study . BMC me dic al r ese ar ch metho dolo gy , 21(1):13, 2021. [CSV17] M. Charik ar, J. Steinhardt, and G. V alian t. Learning from un trusted data. In Pr o c e e dings of the 49th annual ACM SIGACT symp osium on the ory of c omputing , pages 47–60, 2017. [CZ19] T. T on y Cai and L. Zhang. High dimensional linear discriminant analysis: optimality , adaptiv e algorithm and missing data. Journal of the R oyal Statistic al So ciety Series B: Statistic al Metho dolo gy , 81(4):675–705, 2019. [DDK + 25] I. Diak onikolas, J. Diakonik olas, D. M. Kane, J. C. H. Lee, and T. Pittas. Linear regression under missing or corrupted co ordinates. arXiv pr eprint arXiv:2509.19242 , 2025. [DGTZ18] C. Dask alakis, T. Gouleakis, C. T zamos, and M. Zamp etakis. Efficient Statistics, in High Dimensions, from T runcated Samples. In 2018 IEEE 59th A nnual Symp osium on F oundations of Computer Scienc e (FOCS) , pages 639–649. IEEE, 2018. 17 [DGTZ19] C. Dask alakis, T. Gouleakis, C. T zamos, and M. Zamp etakis. Computationally and Sta- tistically Efficient T runcated Regression. In Pr o c e e dings of the Thirty-Se c ond Confer enc e on L e arning The ory , pages 955–960. PMLR, 2019. [DK23] I. Diakonik olas and D. M. Kane. Algorithmic High-Dimensional R obust Statistics . Cam bridge Universit y Press, 1 edition, 2023. [DKK + 18] I. Diakonik olas, G. Kamath, D. M. Kane, J. Li, A. Moitra, and A. Stew art. Robustly learning a gaussian: Getting optimal error, efficien tly . In Pr o c e e dings of the Twenty- Ninth Annual ACM-SIAM Symp osium on Discr ete A lgorithms , pages 2683–2702. SIAM, 2018. [DKK + 19] I. Diakonik olas, G. Kamath, D. Kane, J. Li, A. Moitra, and A. Stewart. Robust estimators in high-dimensions without the computational intractabilit y . SIAM Journal on Computing , 2019. [DKK + 22] I. Diakonik olas, D. M. Kane, S. Karmalk ar, A. Pensia, and T. Pittas. List-deco dable sparse mean estimation via difference-of-pairs filtering. A dvanc es in Neur al Information Pr o c essing Systems , 35:13947–13960, 2022. [DKLP25a] I. Diakonik olas, D. M. Kane, S. Liu, and T. Pittas. En tangled mean estimation in high dimensions. In Pr o c e e dings of the 57th A nnual ACM Symp osium on The ory of Computing , pages 1680–1688, 2025. [DKLP25b] I. Diakonik olas, D. M. Kane, S. Liu, and T. Pittas. Ptf testing low er b ounds for non-gaussian comp onen t analysis. In Pr o c e e dings of the 66th Annual IEEE Symp osium on F oundations of Computer Scienc e (FOCS) , 2025. [DKPP22] I. Diak onikolas, D. M. Kane, A. Pensia, and T. Pittas. Streaming algorithms for high- dimensional robust statistics. In International Confer enc e on Machine L e arning , pages 5061–5117. PMLR, 2022. [DKPP24] I. Diak onikolas, S. Karmalk ar, S. Pang, and A. P otec hin. Sum-of-squares lo wer b ounds for non-gaussian comp onen t analysis. In 2024 IEEE 65th Annual Symp osium on F oundations of Computer Scienc e (FOCS) , pages 949–958. IEEE, 2024. [DKPZ24] I. Diak onikolas, D. M. Kane, T. Pittas, and N. Zarifis. Statistical Query Lo wer Bounds for Learning T runcated Gaussians. In The Thirty Seventh A nnual Confer enc e on L e arning The ory, June 30 - July 3, 2023, Edmonton, Canada , v olume 247 of Pr o c e e dings of Machine L e arning R ese ar ch , pages 1336–1363. PMLR, 2024. [DKRS23] I. Diak onikolas, D. Kane, L. Ren, and Y. Sun. Sq low er b ounds for non-gaussian comp onen t analysis with w eak er assumptions. A dvanc es in Neur al Information Pr o c essing Systems , 2023. [DKS17] I. Diakonik olas, D. M. Kane, and A. Stewart. Statistical query low er b ounds for robust estimation of high-dimensional gaussians and gaussian mixtures. In Pr o c e e dings of the 58th Annual IEEE Symp osium on F oundations of Computer Scienc e (FOCS) , pages 73–84, 2017. [DKS18] I. Diakonik olas, D. M. Kane, and A. Stewart. List-deco dable robust mean estimation and learning mixtures of spherical gaussians. In Pr o c e e dings of the 50th Annual A CM SIGA CT Symp osium on The ory of Computing (STOC) , pages 1047–1060, 2018. 18 [DKS19] I. Diakonik olas, W. K ong, and A. Stew art. Efficient Algorithms and Low er Bounds for Robust Linear Regression. In Pr o c e e dings of the Thirtieth Annual ACM-SIAM Symp osium on Discr ete Algorithms, SODA 2019, San Die go, California, USA, January 6-9, 2019 , pages 2745–2754. SIAM, 2019. [DRZ] C. Dask alakis, D. Rohatgi, and E. Zamp etakis. T runcated Linear Regression in High Dimensions. In A dvanc es in Neur al Information Pr o c essing Systems , volume 33, pages 10338–10347. Curran Asso ciates, Inc. [DSYZ] C. Dask alakis, P . Stefanou, R. Y ao, and E. Zamp etakis. Efficient T runcated Linear Regression with Unkno wn Noise V ariance. In A dvanc es in Neur al Information Pr o c essing Systems , v olume 34, pages 1952–1963. Curran Asso ciates, Inc. [EvdG19] A. Elsener and S. v an de Geer. Sparse sp ectral estimation with missing and corrupted measuremen ts. Stat , 8(1):e229, 2019. [FDS22] DM F arewell, RM Daniel, and SR Seaman. Missing at random: a sto c hastic pro cess p erspective. Biometrika , 109(1):227–241, 2022. [F GR + 13] V. F eldman, E. Grigorescu, L. Reyzin, S. V empala, and Y. Xiao. Statistical algorithms and a low er b ound for detecting planted cliques. In Pr o c e e dings of STOC’13 , pages 655–664, 2013. F ull v ersion in Journal of the A CM, 2017. [FWS22] B. F ollain, T. W ang, and R. J. Samw orth. High-dimensional c hangep oin t estimation with heterogeneous missingness. Journal of the R oyal Statistic al So ciety Series B: Statistic al Metho dolo gy , 84(3):1023–1055, 2022. [Gal97] F. Galton. An Examination into the Registered Sp eeds of American T rotting Horses, with Remarks on Their V alue as Hereditary Data. Pr o c e e dings of the R oyal So ciety of L ondon , 62:310–315, 1897. [Gor41] R. D. Gordon. V alues of mills’ ratio of area to b ounding ordinate and of the normal probabilit y integral for large v alues of the argument. The Annals of Mathematic al Statistics , 12(3):364–366, 1941. [GVW88] R. D. Gill, Y. V ardi, and J. A. W ellner. Large sample theory of empirical distributions in biased sampling mo dels. The Annals of Statistics , pages 1069–1112, 1988. [Hub64] P . J. Hub er. Robust estimation of a lo cation parameter. Ann. Math. Statist. , 35(1):73– 101, 03 1964. [Kea98] M. J. Kearns. Efficient noise-tolerant learning from statistical queries. Journal of the A CM , 45(6):983–1006, 1998. [KKK19] S. Karmalk ar, A. Kliv ans, and P . Kothari. List-deco dable linear regression. A dvanc es in neur al information pr o c essing systems , 32, 2019. [KKM18] A. R. Kliv ans, P . K. K othari, and R. Mek a. Efficient Algorithms for Outlier-Robust Regression. In Confer enc e On L e arning The ory, COL T 2018, Sto ckholm, Swe den, 6-9 July 2018 , volume 75 of Pr o c e e dings of Machine L e arning R ese ar ch , pages 1420–1430. PMLR, 2018. 19 [KS17] P . K. K othari and D. Steurer. Outlier-robust moment-estimation via sum-of-squares. CoRR , abs/1711.11581, 2017. [KTZ19] V. Kon tonis, C. T zamos, and M. Zampetakis. Efficien t T runcated Statistics with Unkno wn T runcation. In 2019 IEEE 60th Annual Symp osium on F oundations of Computer Scienc e (FOCS) , pages 1578–1595. IEEE, 2019. [KWB22] D. Kunisky , A. S. W ein, and A. S. Bandeira. Notes on computational hardness of h yp othesis testing: Predictions using the lo w-degree likelihoo d ratio. In ISAA C Congr ess (International So ciety for Analysis, its Applic ations and Computation) , pages 1–50. Springer, 2022. [KWL + 13] C. Kulk arni, K. P . W ei, H. Le, D. Chia, K. Papadopoulos, J. Cheng, D. Koller, and S. R. Klemmer. Peer and self assessment in massive online classes. ACM T r ans. Comput.-Hum. Inter act. , 20(6):33:1–33:31, 2013. [LDC + 12] R. J. Little, R. D’agostino, M. L. Cohen, K. Dick ersin, S. S. Emerson, J. T. F arrar, C. F rangakis, J. W. Hogan, G. Molenberghs, S. A. Murph y , et al. The preven tion and treatment of missing data in clinical trials. New England Journal of Me dicine , 367(14):1355–1360, 2012. [Lou14] K. Lounici. High-dimensional cov ariance matrix estimation with missing observ ations. Bernoul li , 20(3):1029–1058, 2014. [LR19] R. J. A. Little and D. B. Rubin. Statistic al Analysis with Missing Data . John Wiley & Sons, 2019. [LR V16] K. A. Lai, A. B. Rao, and S. V empala. Agnostic Estimation of Mean and Co v ariance. In 2016 IEEE 57th Annual Symp osium on F oundations of Computer Scienc e (FOCS) , pages 665–674. IEEE, 2016. [L T18] P .-L. Loh and X. L. T an. High-dimensional robust precision matrix estimation: Cellwise corruption under epsilon-contamination. Ele ctr onic Journal of Statistics , 12:1429–1467, 2018. [L W] P .-L. Loh and M. J. W ainwrigh t. High-dimensional regression with noisy and missing data: Pro v able guarantees with nonconv exity . The Annals of Statistics , 40(3):1637–1664. [MN06] P . Massart and E. Nedelec. Risk b ounds for statistical learning. Ann. Statist. , 34(5):2326– 2366, 2006. [MVB + 24] T. Ma, K. A. V erc hand, T. B. Berrett, T. W ang, and R. J. Samw orth. Estimation b ey ond Missing (Completely) at Random, 2024. [P ea02] K. Pearson. On the Systematic Fitting of Curves to Observ ations and Measurements. Biometrika , 1(3):265–303, 1902. [PHC + ] C. Piech, J. Huang, Z. Chen, C. B. Do, A. Y. Ng, and D. Koller. T uned Mo dels of P eer Assessmen t in MOOCs. In Pr o c e e dings of the 6th International Confer enc e on Educ ational Data Mining, Memphis, T ennesse e, USA, July 6-9, 2013 , pages 153–160. In ternational Educational Data Mining Society . 20 [PJL24] A. P ensia, V. Jog, and P .-L. Loh. Robust Regression with Cov ariate Filtering: Heavy T ails and Adv ersarial Con tamination. Journal of the Americ an Statistic al Asso ciation , pages 1–12, 2024. [PL08] K. P earson and A. Lee. On the Generalised Probable Error in Multiple Normal Correlation. Biometrika , 6(1):59–68, 1908. [Rob97] J. M. Robins. Non-Resp onse Mo dels for the Analysis of Non-Monotone Non-Ignorable Missing Data. Statistics in Me dicine , 16(1):21–37, 1997. [Ro c24] S. Ro c h. Mo dern Discr ete Pr ob ability: A n Essential T o olkit . Cam bridge Universit y Press, 1 edition, 2024. [Ros87] P . R. Rosenbaum. Sensitivit y analysis for certain p erm utation inferences in matched observ ational studies. Biometrika , 74(1):13–26, 1987. [RR97] A. Rotnitzky and J. Robins. Analysis of Semi-P arametric Regression Mo dels with Non-Ignorable Non-Resp onse. Statistics in Me dicine , 16(1):81–102, 1997. [Rub76] D. B. Rubin. Inference and Missing Data. Biometrika , 63(3):581–592, 1976. [R Y20] P . Raghav endra and M. Y au. List deco dable learning via sum of squares. In Pr o c e e dings of the F ourte enth Annual ACM-SIAM Symp osium on Discr ete Algorithms , pages 161–180. SIAM, 2020. [SBC24] T. Sell, T. B. Berrett, and T. I. Cannings. Nonparametric classification with missing data. The A nnals of Statistics , 52(3):1178–1200, 2024. [SGJC13] S. Seaman, J. Galati, D. Jackson, and J. Carlin. What is meant by" missing at random"? Statistic al Scienc e , pages 257–268, 2013. [SL W22] R. Saho o, L. Lei, and S. W ager. Learning from a biased sample. arXiv pr eprint arXiv:2209.01754 , 2022. [SMP15] I. Shpitser, K. Mohan, and J. Pearl. Missing data as a causal and probabilistic problem. T ec hnical rep ort, 2015. [SRR99] D. O. Scharfstein, A. Rotnitzky , and J. M. Robins. A djusting for nonignorable drop- out using semiparametric nonresp onse mo dels. Journal of the Americ an Statistic al Asso ciation , 94(448):1096–1120, 1999. [Sze67] G. Szegö. Ortho gonal Polynomials . Num b er τ . 23 in American Mathematical So ciet y collo quium publications. American Mathematical So ciet y , 1967. [T si] A. T siatis. Semip ar ametric The ory and Missing Data . Springer Series in Statistics. Springer New Y ork. [T uk60] J. W. T uk ey . A survey of sampling from contaminated distributions. Contributions to pr ob ability and statistics , 2:448–485, 1960. [V ar85] Y. V ardi. Empirical distributions in selection bias mo dels. The Annals of Statistics , pages 178–203, 1985. 21 [V dVE11] J. V uurens, A. P . de V ries, and C. Eickhoff. How m uch spam can you take? an analysis of crowdsourcing results to increase accuracy . In Pr o c. ACM SIGIR W orkshop on Cr owdsour cing for Information R etrieval (CIR’11) , pages 21–26, 2011. [V er18] R. V ersh ynin. High-Dimensional Pr ob ability: An Intr o duction with Applic ations in Data Scienc e . Cambridge Universit y Press, 2018. [XHW12] Y. Xie, J. Huang, and R. Willett. Change-p oin t detection for high-dimensional time series with missing data. IEEE Journal of Sele cte d T opics in Signal Pr o c essing , 7(1):12–27, 2012. [YCF24] Y. Y an, Y. Chen, and J. F an. Inference for heteroskedastic p ca with missing data. The A nnals of Statistics , 52(2):729–756, 2024. [ZSB19] Q. Zhao, D. S. Small, and B. B. Bhattac harya. Sensitivity analysis for inv erse probability w eighting estimators via the p ercentile b ootstrap. Journal of the R oyal Statistic al So ciety Series B: Statistic al Metho dolo gy , 81(4):735–761, 2019. [ZWS22] Z. Zhu, T. W ang, and R. J. Samw orth. High-dimensional principal component analysis with heterogeneous missingness. Journal of the R oyal Statistic al So ciety Series B: Statistic al Metho dolo gy , 84(5):2000–2031, 2022. 22 App endix A Omitted Details from Section 1 A.1 A dditional Discussion on Related W ork Comparison of Definition 1.1 with the definition in [ MVB + 24 ] As describ ed in Section 1 , the con tamination mo del ( Definition 1.1 ) studied in this pap er was prop osed in the recen t work of [ MVB + 24 ] as an attempt to formalize missingness mechanisms that are non-MCAR y et milder than MAR or MNAR. At first glance, the mo del app ears to b e defined slightly differently in eq. (6) of [ MVB + 24 ]. The ε -corrupted version e P of the original disrtibution P there is defined as an y mixture of the form (as a note regarding notation in that pap er, we are using q = 1 in the notation of that pap er, i.e., there is no MCAR comp onen t in eq. (6)) (1 − ε ) P + εQ, (12) where Q is any MNAR version of P that the adv ersary can choose. Ho wev er, [ MVB + 24 , Prop osition 2] provides a characterization that establishes its equiv alence with the definition used in this pap er. As explained following Prop osition 2, if samples are interpreted as b eing generated by first dra wing a v alue and then applying a missingness mec hanism, and if h ( x ) denotes the probabilit y that a sample is missing conditional on the original v alue b eing x , then Prop osition 2 sho ws that the realizable ε -con tamination mo del is equiv alen t to the condition 1 − ε ≤ h ( x ) ≤ 1 . In the language of our pap er, this c haracterization leads to the alternative definition of the realizable ε -contamination mo del stated in Definition A.1 b elo w. Definition A.1 (Con tamination mo del; Alternate definition) . L et P b e a distribution on a domain X with p df p : X → R + . An ε -c orrupte d version e P of P is any distribution that c an b e obtaine d as fol lows: First, an adversary cho oses a function f : X → R + with (1 − ε ) p ( x ) ≤ f ( x ) ≤ p ( x ) . e P is then define d to b e the distribution whose samples ar e gener ate d as fol lows: 1. Dr aw x fr om P . 2. With pr ob ability 1 − f ( x ) /p ( x ) , r eplac e x by a sp e cial symb ol ⊥ . As shown b elo w, it is then straightforw ard to chec k that Definition A.1 is equiv alen t to Definition 1.1 . Claim A.2. Definition 1.1 and Definition A.1 ar e e quivalent. Pr o of. Let us denote by X the initial v alue of the sample b efore the missingness pattern is applied and b y Z the sample after its application. In Definition A.1 we hav e P [ Z  = ⊥ ] = Z X P [ Z  = ⊥ | X = x ] P [ X = x ]d x = Z X f ( x ) p ( x ) p ( x )d x = Z X f ( x )d x, and w e also hav e P [ X = x | Z  = ⊥ ] = P [ Z  = ⊥ | X = x ] P [ X = x ] P [ Z  = ⊥ ] = (1 − P [ Z = ⊥ | X = x ]) p ( x ) R X f ( x )d x = f ( x ) p ( x ) p ( x ) R X f ( x )d x = f ( x ) R X f ( x )d x , whic h agrees with Definition 1.1 . 23 Robust Statistics The field of robust statistics w as initiated in the 1960s through the seminal w orks of Hub er and T ukey [ T uk60 , Hub64 ], with the goal of developing estimators that are robust to data contamination. In this setting, data corruptions are formalized b y allowing a small fraction ε < 1 / 2 of the samples to come from an arbitrary distribution. While early work in the 1960s studied the information-theoretic asp ects of one-dimensional inference in this mo del (including optimal error rates and sample complexit y), extensions of these algorithms to higher dimensions required exp onen tial time. It w as not until 2016 that the first p olynomial-time algorithms w ere obtained [ LR V16 , DKK + 19 ]. This led to a revitalization of the field, with impro ved algorithms for a v ariet y of problems such as mean estimation [ KS17 , DKK + 18 , DKPP22 ] and linear regression [ KKM18 , DKS19 , PJL24 , CA T + 20 ]. Apart from the fact that robust statistics considers data corruptions rather than missingness, the differences with the current work are as follows: (i) Due to the arbitrary nature of outliers, identifiabilit y is only p ossible when the corruption fraction satisfies ε < 1 / 2 . If a ma jority of the samples are corrupted, the dataset may consist of t wo equally sized subsets corresp onding to different underlying distributions, in which case it is imp ossible to determine which subset contains the inliers. The robust statistics literature has nevertheless considered the regime ε > 1 / 2 . Since identifiabilit y is imp ossible in this case, the goal shifts to outputting a list of candidate solutions with the guaran tee that at least one is close to the ground truth. Algorithms of this type are known as list-de c o ding algorithms [ BBV08 , CSV17 ], and such algorithms ha ve b een developed for several tasks, including mean estimation [ CSV17 , DKS18 , DKK + 22 ] and linear regression [ KKK19 , R Y20 ]. (ii) Even in the regime ε < 1 / 2 , where identifiabilit y is p ossible, consistency—i.e., the prop ert y that the estimation error v anishes as the num b er of samples tends to infinity—is still unattainable in robust statistics. F or example, for Gaussian mean estimation with an ε fraction of arbitrary corruptions, the optimal error is Θ( ε ) , regardless of the sample size (and analogous low er b ounds hold for other distributions, with differen t dep endencies on ε ). In contrast, Definition 1.1 allo ws for consistency due to the additional structure imp osed on the missingness pattern. T runcated Statistics Although it is a form of Missing Not At Random, this large subfield of statistics—tracing bac k to Galton, Pearson, and Lee [ Gal97 , P ea02 , PL08 ]—dev elop ed largely orthogonally to the rest of the missing data literature. T runcated statistics concerns scenarios in whic h there is a truncation set (which may b e known or unkno wn to the learning algorithm), and only samples that fall within this set are observed. Despite early work on this problem, efficient algorithms for fundamen tal tasks were obtained only in the last decade, including Gaussian mean estimation [ DGTZ18 , KTZ19 ] and linear regression [ DGTZ19 , DRZ , DSYZ ]. More concretely , the notion of missingness used in truncated statistics is defined as follo ws (we state it for identit y-co v ariance Gaussians to matc h the inlier distribution considered in this pap er). Samples are dra wn from N ( µ, I ) but are revealed only if they fall in some subset S ⊆ R d whose probabilit y mass is assumed to b e lo wer b ounded b y a parameter α > 0 ; otherwise, the samples are hidden (e.g., represented by a sp ecial symbol ⊥ ). More generally , the literature also considers a setting in which hidden samples are completely unobserved, so the algorithm do es not know the ratio of missing to visible samples. A t a high level, this can b e viewed as a sp ecial case of the ε -realizable con tamination mo del considered in this pap er ( Definition 1.1 ) with ε = 1 , but there are imp ortan t differences. First, there is no analogue in Definition 1.1 of the requirement that the truncation set hav e probability mass b ounded aw a y from zero. As a result, the adversary in Definition 1.1 could mak e all samples missing, whic h is the main reason the problem is unsolv able when ε = 1 (the sample complexity in 24 ( 1 ) div erges as ε → 1 ). Second, even when such a low er-b ound condition is imp osed, additional nuances arise in the truncated statistics literature dep ending on whether the truncation set S is known to the algorithm. If the set is unknown, mean estimation remains information-theoretically imp ossible to arbitrary accuracy [ DGTZ18 ]. Estimation b ecomes p ossible only if (i) the algorithm has oracle access to S [ DGTZ18 ], or (ii) the set has b ounded complexit y , for example b ounded VC dimension or b ounded Gaussian surface area [ KTZ19 ]. In the latter case, there is an information–computation trade-off, pro viding evidence that p olynomial-time algorithms often require more samples than the information- theoretic optim um [ DKPZ24 ]. B Omitted Details from Section 2 B.1 Useful F acts F act B.1 (Gaussian tail b ound) . L et Z ∼ N (0 , 1) . Then for al l x > 0 , P [ Z ≥ x ] ≤ 1 x √ 2 π e − x 2 / 2 . F act B.2 (Mills ratio inequality [ Gor41 ]) . L et ϕ, Φ denote the p df and c df of N (0 , 1) r esp e ctively. The fol lowing holds for al l x > 0 : x < ϕ ( x ) 1 − Φ( x ) < x + 1 x . F act B.3 (Dv oretzky–Kiefer–W olfowitz (DKW) inequalit y) . L et X 1 , . . . , X n b e i.i.d. r e al-value d r andom variables with cumulative distribution function F , and let the empiric al cumulative distribution function b e F n ( x ) := 1 n P n i =1 1 { X i ≤ x } . Then, for al l ε > 0 , P  sup x ∈ R   F n ( x ) − F ( x )   > ε  ≤ 2 e − 2 nε 2 . F act B.4 (Maximal coupling (see, e.g., [ Ro c24 ])) . L et P and Q b e distributions on some domain X . It holds that D TV ( P , Q ) = inf Π P ( X,Y ) ∼ Π [ X  = Y ] wher e the infimum is over al l p ossible c ouplings b etwe en P and Q . Mor e over, ther e exists Π such that P ( X,Y ) ∼ Π [ X  = Y ] = D TV ( P , Q ) . F act B.5 (see, e.g., Corollary 4.2.13 in [ V er18 ]) . L et ξ > 0 . Ther e exists a set C of unit ve ctors of R d such that |C | < (1 + 2 /ξ ) d and for every u ∈ R d with ∥ u ∥ 2 = 1 it holds min y ∈C ∥ y − u ∥ 2 ≤ ξ . Corollary B.6 (see, e.g., Exercise 4.4.3 (b) in [ V er18 ]) . Ther e exists a subset C of the d -dimensional unit b al l with |C | ≤ 7 d such that ∥ x ∥ 2 ≤ 2 max v ∈C | v ⊤ x | for al l x ∈ R d and ∥ A ∥ op ≤ 3 max x ∈C x ⊤ Ax for every symmetric A ∈ R d × d . F act B.7. L et x ∼ N (0 , I d ) and let j ≥ 1 . Then E [ ∥ x ∥ j ] ≤ (2 √ j d ) j . Pr o of. F or a standard normal z ∼ N (0 , 1) it is well known that E | z | j ≤ (2 √ j ) j . Since ∥ x ∥ j =  P d i =1 x 2 i  j / 2 and for r ≥ 1 we hav e ( P d i =1 a i ) r ≤ d r − 1 P d i =1 a r i , applying this with a i = x 2 i and r = j / 2 gives ∥ x ∥ j ≤ d j / 2 − 1 d X i =1 | x i | j . 25 T aking exp ectations and using identical marginals, E ∥ x ∥ j ≤ d j / 2 E | z | j ≤ d j / 2 (2 p j ) j = (2 p j d ) j . Prop osition B.8 (Le Cam’s lemma) . F or any distributions P 1 and P 2 on X , we have inf Ψ  P X ∼ P 1 (Ψ( X )  = 1) + P X ∼ P 2 (Ψ( X )  = 2)  = 1 − D TV ( P 1 , P 2 ) , wher e the infimum is taken over al l tests Ψ : X → { 1 , 2 } . Legendre P olynomials In this w ork, we mak e use of the Legendre P olynomials which are orthogonal p olynomials ov er [ − 1 , 1] . Some of their prop erties are: F act B.9 ([ Sze67 ]) . The L e gendr e p olynomials P k for k ∈ Z , satisfy the fol lowing pr op erties: 1. P k is a k -de gr e e p olynomial and P 0 ( x ) = 1 and P 1 ( x ) = x . 2. R 1 − 1 P i ( x ) P j ( x )d x = 2 / (2 i + 1) 1 { i = j } , for al l i, j ∈ Z . 3. | P k ( x ) | ≤ 1 for al l | x | ≤ 1 . 4. | P k ( x ) | ≤ (4 | x | ) k for | x | ≥ 1 . 5. P k ( x ) = ( − 1) k P k ( − x ) . 6. P k ( x ) = 2 − k P ⌈ k/ 2 ⌉ i =1  k i  2 k − 2 i k  x k − 2 i . B.2 Hermite Analysis Definition B.10 (Hermite tensor) . F or k ∈ N and x ∈ R n , we define the k -th Hermite tensor as ( H k ( x )) i 1 ,i 2 ,...,i k = 1 √ k ! X Partitions P of [ k ] into sets of size 1 and 2 O { a,b }∈ P ( − I i a ,i b ) O { c }∈ P x i c . F act B.11. If v ∈ R d is a unit ve ctor it holds H k ( v ⊤ x ) = ⟨ v ⊗ k , H k ( x ) ⟩ . F act B.12. E x ∼N ( µ,I ) [ H k ( x )] = µ ⊗ k / √ k ! . Hermite p olynomials form a complete orthogonal basis of the v ector space L 2 ( R , N (0 , 1)) of all functions f : R → R suc h that E x ∼N (0 , 1) [ f 2 ( x )] < ∞ . There are tw o commonly used types of Hermite p olynomials. The physicist’s Hermite p olynomials, denoted by H k for k ∈ Z satisfy the follo wing orthogonality prop ert y with resp ect to the weigh t function e − x 2 : for all k , m ∈ Z , R R H k ( x ) H m ( x ) e − x 2 d x = √ π 2 k k ! 1 ( k = m ) . The pr ob abilist’s Hermite p olynomials H e k for k ∈ Z satisfy R R H e k ( x ) H e m ( x ) e − x 2 / 2 d x = k ! √ 2 π 1 ( k = m ) and are related to the physicist’s p olynomials through H e k ( x ) = 2 − k/ 2 H k ( x/ √ 2 ) . W e will mostly use the normalize d pr ob abilist’s Hermite p olynomi- als h k ( x ) = H e k ( x ) / √ k ! , k ∈ Z for which R R h k ( x ) h m ( x ) e − x 2 / 2 d x = √ 2 π 1 ( k = m ) . These p olynomi- als are the ones obtained b y Gram-Schmidt orthonormalization of the basis { 1 , x, x 2 , . . . } with respect to the inner pro duct ⟨ f , g ⟩ N (0 , 1) = E x ∼N (0 , 1) [ f ( x ) g ( x )] . Every function f ∈ L 2 ( R , N (0 , 1)) can b e uniquely written as f ( x ) = P i ∈ Z a i h i ( x ) and we ha ve lim n →∞ E x ∼N (0 , 1) [( f ( x ) − P n i =0 a i h i ( x )) 2 ] = 0 26 (see, e.g., [ AAR99 ]). Extending the normalized probabilist’s Hermite p olynomials to higher di- mensions, an orthonormal basis of L 2 ( R d , N (0 , I d )) (with resp ect to the inner pro duct ⟨ f , g ⟩ = E x ∼N (0 ,I d ) [ f ( x ) g ( x )] ) can b e formed by all the pro ducts of one-dimensional Hermite p olynomials, i.e., h a ( x ) = Q d i =1 h a i ( x i ) , for all multi-indices a ∈ Z d (w e are now slightly ov erloading notation by using m ulti-indices as subscripts). The total degree of h a is | a | = P d i =1 a i . Claim B.13 (Univ ariate Gaussian shift b ound) . L et p : R → R b e a p olynomial of de gr e e at most d satisfying E x ∼N (0 , 1) [ p ( x ) 2 ] = 1 . Then for every µ ∈ R , E x ∼N (0 , 1)  p ( x + µ ) 2  ≤ e dµ 2 . Pr o of. Let h k ( x ) = H e k ( x ) / √ k ! denote the normalized probabilists’ Hermite p olynomials. Expand p in the orthonormal Hermite basis: p ( x ) = d X k =0 c k h k ( x ) , d X k =0 c 2 k = E x ∼N (0 , 1) [ p ( x ) 2 ] = 1 . The follo wing identit y holds for the normalized probabilists’ Hermite p olynomials: h k ( x + µ ) = k X r =0  k r  µ k − r r r ! k ! h r ( x ) . Using orthonormalit y of { h r } and indep endence of the co efficien ts in the expansion, we ha ve that E x ∼N (0 , 1) [ h k ( x + µ ) 2 ] = k X r =0  k r  2 µ 2( k − r ) r ! k ! = k X s =0  k s  µ 2 s s ! , where s = k − r . Bounding  k s  ≤ k s /s ! yields  k s  µ 2 s s ! ≤ ( k µ 2 ) s s ! . Therefore E x ∼N (0 , 1) [ h k ( x + µ ) 2 ] ≤ ∞ X s =0 ( k µ 2 ) s s ! = e kµ 2 . No w expand p ( x + µ ) : E x ∼N (0 , 1) [ p ( x + µ ) 2 ] = d X k =0 c 2 k E x ∼N (0 , 1) [ h k ( x + µ ) 2 ] ≤ d X k =0 c 2 k e kµ 2 ≤ e dµ 2 d X k =0 c 2 k = e dµ 2 . Claim B.14 (Multiv ariate Gaussian shift b ound) . L et p : R n → R b e a p olynomial of total de gr e e at most D satisfying E x ∼N (0 , 1) [ p ( x ) 2 ] = 1 . F or a multi-index α ∈ N n , define h α ( x ) = Q n i =1 h α i ( x i ) and | α | = P n i =1 α i . Then for every µ ∈ R n , it holds E x ∼N (0 , 1) [ p ( x + µ ) 2 ] ≤ e D ∥ µ ∥ 2 2 . Pr o of. Expand p in the multiv ariate orthonormal Hermite basis: p ( x ) = X | α |≤ D c α h α ( x ) , X | α |≤ D c 2 α = E x ∼N (0 , 1) [ p ( x ) 2 ] = 1 . 27 Because h α ( x ) = Q i h α i ( x i ) , and the co ordinates of x are indep enden t, w e hav e E x ∼N (0 , 1) [ h α ( x + µ ) 2 ] = n Y i =1 E x ∼N (0 , 1) [ h α i ( x i + µ i ) 2 ] . Applying the univ ariate b ound of Claim B.13 to each co ordinate, E x ∼N (0 , 1) [ h α i ( x i + µ i ) 2 ] ≤ e α i µ 2 i , so E x ∼N (0 , 1) [ h α ( x + µ ) 2 ] ≤ exp n X i =1 α i µ 2 i ! ≤ exp  | α | ∥ µ ∥ 2 2  ≤ e D ∥ µ ∥ 2 2 . Finally , E [ p ( X + µ ) 2 ] = X | α |≤ D c 2 α E [ h α ( X + µ ) 2 ] ≤ X α c 2 α e D ∥ µ ∥ 2 2 = e D ∥ µ ∥ 2 2 . Claim B.15. L et H k denote the k -th Hermite tensor for d dimensions. Then, the fol lowing b ound holds: ∥ H k ( x ) ∥ 2 ≤ d k/ 2 (1 + ∥ x ∥ k )2 O ( k ) . Pr o of. F or a degree- k tensor A , w e use A π to denote the matrix that A π i 1 ,...,i k = A π ( i 1 ,...,i k ) . Note that ∥ A ∥ 2 = ∥ A π ∥ 2 . Then from the definition of Hermite tensor, w e hav e that H k ( x ) = 1 √ k ! ⌊ k/ 2 ⌋ X t =1 X Perm utation π of [ k ] 1 2 t t !( k − 2 t )!  I ⊗ t x ⊗ ( k − 2 t )  π . Th us the norm is ∥ H k ( x ) ∥ 2 =       1 √ k ! ⌊ k/ 2 ⌋ X t =1 X Perm utation π of [ k ] 1 2 t t !( k − 2 t )!  I ⊗ t x ⊗ ( k − 2 t )  π       2 ≤ ⌊ k/ 2 ⌋ X t =1 √ k ! 2 t t !( k − 2 t )! max  ∥ I ⊗ t ∥ 2 ∥ x ∥ k − 2 t 2 , 1  ≤ ⌊ k/ 2 ⌋ X t =1 √ k ! 2 t t !( k − 2 t )! max( d t/ 2 ∥ x ∥ k − 2 t 2 , 1) ≤ ⌊ k/ 2 ⌋ X t =1 √ k ! 2 t t !( k − 2 t )!  d t/ 2 ∥ x ∥ k − 2 t 2 + 1  ≤ ⌊ k/ 2 ⌋ X t =1 √ k ! 2 t t !( k − 2 t )!  d t/ 2 max( ∥ x ∥ k 2 , 1) + 1  ≤ 2 d k/ 2 (1 + ∥ x ∥ k ) ⌊ k/ 2 ⌋ X t =1 √ k ! 2 t t !( k − 2 t )! . One can see that the denominator is minimized when t = k / 2 − O ( √ k ) . Using that, we hav e that the righ t hand side ab o v e is at most d k/ 2 (1 + ∥ x ∥ k )2 O ( k ) . 28 W e restate and prov e the following concentration of Hermite moments. Lemma 4.5 (Hermite tensor concentration) . L et η , ε ∈ (0 , 1) b e p ar ameters, C b e a sufficiently lar ge absolute c onstant and µ ∈ R d b e a ve ctor with ∥ µ ∥ 2 = O ( q log( 1 1 − ε ) ) . L et e P b e an ε -c orrupte d version of N ( µ, I ) (cf. Definition 1.1 ) and let P ′ denote the c onditional distribution of e P on the non-missing samples. L et x 1 , . . . , x n ∼ P ′ b e i.i.d. samples and define b T := 1 n P n i =1 H k ( x i ) , and T := E x ∼ P ′ [ H k ( x )] , wher e H k ( x ) denotes the Hermite tensor fr om Definition B.10 . If n > C d 3 k 2 O ( k ) ( k log ( 1 1 − ε )) k/ 2 (1 − ε ) η 2 τ , then with pr ob ability at le ast 1 − τ we have that    b T − T    2 ≤ η . Pr o of. Consider one entry b T i 1 i 2 ··· i k of the estimator. It holds V ar( b T i 1 i 2 ··· i k ) = 1 n V ar x ∼ P ′ ( H k ( x ) i 1 i 2 ··· i k ) ≤ 1 n E x ∼ P ′ [( H k ( x ) i 1 i 2 ··· i k ) 2 ] ≤ 1 n (1 − ε ) E x ∼N ( µ,I )  ( H k ( x ) i 1 i 2 ··· i k ) 2  (b y Definition 1.1 ) ≤ 1 n (1 − ε ) E x ∼N ( µ,I )  ∥ H k ( x ) ∥ 2 2  ≤ d k/ 2 2 O ( k ) n (1 − ε )  1 + E x ∼N ( µ,I ) h ∥ x ∥ k 2 i  (b y Claim B.15 ) ≤ d k 2 O ( k ) k k/ 2 n (1 − ε ) , (13) where the last step can b e sho wn as follows: E x ∼N ( µ,I ) h ∥ x ∥ k 2 i = E z ∼N (0 ,I ) h ∥ z + µ ∥ k 2 i ≤ E z ∼N (0 ,I ) h ( ∥ z ∥ 2 + ∥ µ ∥ 2 ) k i ≤ 2 k − 1  E z ∼N (0 ,I ) h ∥ z ∥ k 2 i + ∥ µ ∥ k 2  ≤ 2 O ( k )  ( k d ) k/ 2 + 2 O ( k ) log(1 / (1 − ε )) k/ 2  (using Fact B.7 ) ≤ 2 O ( k ) k k/ 2 d k/ 2 log(1 / (1 − ε )) k/ 2 . Ha ving the v ariance b ound of inequality ( 13 ) , an application of Cheb yshev’s inequality yields that if the num ber of samples is n > C d 2 k 2 O ( k ) k k/ 2 log(1 / (1 − ε )) k/ 2 (1 − ε ) η 2 τ ′ then P h | b T i 1 i 2 ··· i k − T i 1 i 2 ··· i k | > η d k/ 2 i ≤ d 2 k 2 O ( k ) k k/ 2 log(1 / (1 − ε )) k/ 2 n (1 − ε ) η 2 ≤ τ ′ . W e will use τ ′ = τ d − k . By union b ound the probabilit y of ha ving | b T i 1 i 2 ··· i k − T i 1 i 2 ··· i k | ≤ η d k/ 2 for all en- tries simultaneously is at least 1 − τ . In that ev en t w e hav e that ∥ b T − T ∥ 2 = q P i 1 ··· i k | b T i 1 i 2 ··· i k − T i 1 i 2 ··· i k | 2 ≤ η which completes the pro of. 29 C Sample Complexity Lo w er Bound W e restate and prov e the following result. Theorem C.1 (Sample complexity low er b ound) . F or every ε ∈ (0 , 1) , δ ∈ (0 , log − 1 / 2 (1 + ε 1 − ε )) and n ∈ Z + the fol lowing holds. If A is an algorithm that uses samples fr om an ε -c orrupte d version of a Gaussian N ( µ, 1) and outputs b µ such that ∥ µ − b µ ∥ 2 ≤ δ with pr ob ability at le ast 0 . 9 then the sample c omplexity of A is n ≥ 1 1 − ε exp    Ω   log  1 + ε 1 − ε  δ   2    . (14) Remark C.2. Some remarks follow: • The b ound in ( 14 ) agrees with the sample complexity upp er and low er b ound shown in [ MVB + 24 ] (Theorems 5 and 6 therein): Our mo del coincides with theirs when σ = 1 and q = 1 . Solving for the second term (which is the dominant term) in Theorem 5 or 6 to b e equal to δ 2 yields (up to constant factors) the same expression as the right-hand side of ( 14 ). • (Small ε regime) When ε → 0 the bound b ecomes exp(Ω( ε/δ ) 2 ) . • (Large ε regime) When ε → 1 the bound b eha v es like exp  Ω( 1 δ log( 1 1 − ε )) 2  . The argument for showing the theorem consists of showing that there exist tw o distributions in this con tamination mo del that are close in total v ariation distance. Lemma C.3. F or every ε ∈ (0 , 1) , δ > 0 , n ∈ Z + the fol lowing holds. Consider the two Gaussians P 1 = N ( − δ / 2 , 1) and P 2 = N ( δ / 2 , 1) . Ther e exist distributions Q 1 , Q 2 on R such that • Q 1 is an ε -c orrupte d version of P 1 ac c or ding to Definition 1.1 and Q 2 is ε -c orrupte d version of P 2 . • D TV ( Q ⊗ n 1 , Q ⊗ n 2 ) ≤ n 1 − ε e − Ω(log(1+ ε/ (1 − ε )) /δ ) 2 . W e first sho w how Theorem C.1 follows given the lemma Pr o of of The or em C.1 . Define the following hypothesis testing problem: With probabilit y 1 / 2 all samples come from Q 1 and with probability 1 / 2 all samples come from Q 2 . If a mean estimator existed that had accuracy δ / 2 with probability 0 . 9 then we would b e able to solv e the testing problem with probability 0 . 9 . How ever by Le Cam’s lemma ( Prop osition B.8 ), ev ery testing algorithm has probabilit y of failure at least 1 2  1 − D TV ( Q ⊗ n 1 , Q ⊗ n 2 )  . In order for that probabilit y of failure to b e less than 0 . 1 we need n > 1 1 − ε e Ω(log(1+ ε/ (1 − ε )) /δ ) 2 . Pr o of of Lemma C.3 . Fix the threshold t := log  1+ ε 1 − ε  δ through this pro of. Let p 1 ( x ) , p 2 ( x ) denote the p dfs of P 1 , P 2 . W e define the functions q 1 , q 2 as sho wn b elo w: q 1 ( x ) =      p 1 ( x ) x ∈ (0 , ∞ ) p 2 ( x ) x ∈ [ − t, 0] (1 − ε ) p 1 ( x ) x ∈ ( −∞ , − t ) q 2 ( x ) =      (1 − ε ) p 2 ( x ) x ∈ ( t, ∞ ) p 1 ( x ) x ∈ [0 , t ] p 2 ( x ) x ∈ ( −∞ , 0) Claim C.4. It holds (1 − ε ) p 1 ( x ) ≤ q 1 ( x ) ≤ p 1 ( x ) and (1 − ε ) p 2 ( x ) ≤ q 2 ( x ) ≤ p 2 ( x ) for al l x ∈ R . 30 Pr o of. W e do the chec k for the first part of the claim inv olving p 1 and q 1 . The chec k for the second part is iden tical. The only non-trivial part of chec k is sho wing that p 2 ( x ) ≥ (1 − ε ) p 1 ( x ) for all x ∈ [ − t, 0] . Recall that p 2 is the p df of N (0 , δ / 2) and p 1 is the p df of N ( − δ / 2 , 1) . W e thus w ant to solve for (1 − ε ) p 1 ( x ) ≤ p 2 ( x ) . Plugging in the p df of the tw o Gaussians exp  − ( x − δ / 2) 2 2 + ( x + δ / 2) 2 2  ≥ 1 − ε Solving the ab o ve yields x ≥ − log (1 + ε 1 − ε ) /δ . Therefore, p 2 ( x ) ≥ (1 − ε ) p 1 ( x ) for all x ∈ [ − t, 0] . Since (1 − ε ) p 1 ( x ) ≤ q 1 ( x ) ≤ p 1 ( x ) the function q 1 induces a definition of an ε -corruption of P 1 according to Definition 1.1 . That is, a sample from Q 1 is generated according to the follo wing pro cedure: With probabilit y R R q 1 ( x )d x the sample is drawn q 1 ( x ) / R R q 1 ( x )d x , and with probability 1 − R R q 1 ( x )d x the sample is set to the sp ecial symbol ⊥ . Similarly , (1 − ε ) p 2 ( x ) ≤ q 2 ( x ) ≤ p 2 ( x ) and q 2 induces an ε -corrupted v ersion of P 1 , that we denote by Q 2 . Samples from Q 2 are generated as follows: with probability R R q 2 ( x )d x the sample is dra wn from q 2 ( x ) / R R q 2 ( x )d x , and with probabilit y 1 − R R q 2 ( x )d x the sample is set to the sp ecial sym b ol ⊥ . By symmetry of our setup, the probability of the sample not b eing deleted (set to ⊥ ) is the same R R q 1 ( x )d x = R R q 2 ( x )d x . Denote by α this probabilit y . Also denote by e Q 1 the conditional distribution of Q 1 conditioned on the sample not b eing ⊥ and let e Q 2 denote the corresponding conditional distribution for Q 2 . In the following we will sho w that D TV ( Q ⊗ n 1 , Q ⊗ n 2 ) ≤ n 1 − ε e − Ω(log(1+ ε/ (1 − ε )) /δ ) 2 . By Fact B.4 it suffices to find a coupling Π b et ween the tw o join t distributions Q ⊗ n 1 , Q ⊗ n 2 with probabilit y of disagreemen t at most n 1 − ε e − Ω(log(1+ ε/ (1 − ε )) /δ ) 2 . That is, w e need to define a joint distribution Π on tw o sets of n samples (( X 1 , . . . , X n ) , ( Y 1 , . . . , Y n )) suc h that (i) the marginals are ( X 1 , . . . , X n ) ∼ e Q ⊗ n 1 and ( Y 1 , . . . , Y n ) ∼ e Q ⊗ n 2 resp ectiv ely (i.e., it is a v alid coupling) and (ii) the probability of disagreement is P Π [( X 1 , . . . , X n )  = ( Y 1 , . . . , Y n )] ≤ n 1 − ε e − Ω(log(1+ ε/ (1 − ε )) /δ ) 2 . W e define the coupling by defining the data generation pro cess for (( X 1 , . . . , X n ) , ( Y 1 , . . . , Y n )) b ello w. In the construction b ellow, w e will assume that we already ha ve a coupling Π 0 for the conditional distributions of single samples, i.e., a distribution Π 0 suc h that if ( X , Y ) ∼ Π 0 it holds X ∼ e Q 1 , Y ∼ e Q 2 (i.e., Π 0 is a coupling b et w een e Q 1 and e Q 2 ) and P ( X,Y ) ∼ Π 0 [ X  = Y ] ≤ n 1 − ε e − Ω(log(1+ ε/ (1 − ε )) /δ ) 2 . W e will sho w why P 0 exists at the end; for no w we will conclude the construction of Π using Π 0 . W e define the sample generation pro cess for Π as follows: 1. Dra w c i ∼ Ber( α ) for i ∈ [ n ] (recall that α is the probability of “missing” samples). 2. F or eac h i ∈ [ n ] : (a) If c i = 1 , then draw ( X i , Y i ) ∼ Π 0 . (b) Else, set ( X i , Y i ) = ( ⊥ , ⊥ ) . Note that in the ab o v e construction, each X i is distributed according to Q 1 and each Y i follo ws Q 2 th us the ab ov e defines a v alid coupling b etw een Q ⊗ n 1 , Q ⊗ n 2 . F or the probability of disagreement w e hav e the following: P [( X 1 , . . . , X n )  = ( Y 1 , . . . , Y n )] ≤ n X i =1 P [ X i  = Y i ] 31 = n X i =1 P [ X i  = Y i | c i = 1] P [ c i = 1] + P [ X i  = Y i | c i = 0] P [ c i = 0] ≤ n X i =1 P [ X i  = Y i | c i = 1] = n X i =1 P ( X i ,Y i ) ∼ Π 0 [ X i  = Y i ] ≤ n 1 − ε exp    − Ω   log  1 + ε 1 − ε  δ   2    . It suffices to sho w that the coupling Π 0 with P ( X,Y ) ∼ Π 0 [ X  = Y ] ≤ n 1 − ε e − Ω(log(1+ ε/ (1 − ε )) /δ ) 2 exists. W e sho w this by b ounding the TV distance b et ween the conditional distributions e Q 1 , e Q 2 and defining Π 0 to b e the maximal coupling (cf. Fact B.4 ). D TV ( e Q 1 , e Q 2 ) = 1 2 Z + ∞ −∞     q 1 ( x ) R R q 1 ( x )d x − q 2 ( x ) R R q 2 ( x )d x     (15) = 1 2 α Z ∞ −∞ | q 1 ( x ) − q 2 ( x ) | d x (16) = 1 2 α  Z − t −∞ εp 1 ( x )d x + Z + ∞ t εp 2 ( x )d x  (b y definition of q 1 , q 2 ) = 1 2 α  Z − t −∞ εϕ ( x + δ / 2)d x + Z + ∞ t εϕ ( x + δ / 2)d x  ( ϕ is the p df of N (0 , 1) ) ≤ 1 2 α 2 e − Ω  1 δ log(1+ ε 1 − ε ) − δ / 2  2 (see b elo w) ≤ 1 1 − ε e − Ω  1 δ log(1+ ε 1 − ε )  2 . (see b elo w) Where the second to last line uses the standard Gaussian tail b ound P z ∼N (0 , 1) [ z > r ] ≤ e − r 2 / 2 for ev ery r ≥ 0 . W e are using this with r := 1 δ log (1 + ε 1 − ε ) − δ / 2 . Note that this is non-negativ e b ecause ε ∈ (0 , 1) and δ 2 ≤ log (1 + ε/ (1 − ε )) . The final line uses α := 1 − ε and ε ∈ (0 , 1) and δ 2 ≤ log(1 + ε/ (1 − ε )) . D Sample Complexity Upp er Bound W e restate the main result for the sample complexit y of one dimensional estimation b elo w. W e will then use this together with a cov er argument to show a multi-v ariate estimator in Theorem D.5 . Theorem D.1 (Sample complexity upp er b ound) . Ther e exists a c omputational ly efficient algorithm such that the fol lowing holds for any ε ∈ (0 , 1) , δ ∈ (0 , log 1 / 2 (1 + ε 1 − ε )) and τ ∈ (0 , 1) . The algorithm takes as input ε, δ, τ , dr aws n = exp O  log(1+ ε 1 − ε ) δ  2 ! log(1 /τ ) ε 2 (1 − ε ) samples fr om an ε -c orrupte d version of N ( µ, 1) under the c ontamination mo del of Definition 1.1 , and it r eturns b µ such that it holds | b µ − µ | ≤ δ with pr ob ability at le ast 1 − τ . 32 Remark D.2. Some remarks follow: • If ε is not known to the algorithm it can b e easily estimated b y taking the fraction of samples that are equal to ⊥ . • The algorithm’s sample complexity matches the low er b ound of Theorem C.1 and the sample complexit y upp er b ound of [ MVB + 24 ] up to the ε − 2 factor. W e start with the claim that if the cdfs of tw o Gaussians are multiplicativ ely close to each other, then the means of the Gaussians must also be appropriately close. The following structural lemma quan tifies this. Lemma D.3. L et ξ , t, ε b e r e al numb ers with ξ / 2 > 0 , t < − ξ / 2 and ε ∈ (0 , 1) . L et two Gaussians P + = N ( ξ / 2 , 1) and P − = N ( − ξ / 2 , 1) and denote by F + ( x ) and F − ( x ) their cumulative distribution functions (c dfs). If t is a p oint for which F + ( t ) ≥ (1 − ε ) F − ( t ) , then ξ ≤ log  1+ ε 1 − ε  | t | . Pr o of. It suffices to prov e the claim for the extreme case, i.e., that F + ( t ) = (1 − ε ) F − ( t ) implies ξ ≤ log  1+ ε 1 − ε  | t | . Let Φ( x ) denote the cdf of N (0 , 1) . Then the equation F + ( t ) = (1 − ε ) F − ( x ) is equiv alent to Φ( t − ξ / 2) / Φ( t + ξ / 2) = 1 − ε . T aking logarithms on b oth sides and doing some further rewriting, w e hav e log(1 − ε ) = log  Φ( t − ξ / 2) Φ( t + ξ / 2)  = log(Φ( t − ξ / 2)) − log (Φ( t + ξ / 2)) = − Z t + ξ / 2 t − ξ / 2 d d y log Φ( y )d y = − Z t + ξ / 2 t − ξ / 2 ϕ ( y ) Φ( y ) d y , (17) where ϕ ( y ) denotes the p df of N (0 , 1) . Recall the Mills ratio inequalit y (cf. Fact B .2 ): x ≤ ϕ ( x ) 1 − Φ( x ) ≤ x + 1 x ∀ x > 0 . Ho wev er, due to our assumption t < − ξ / 2 , the v ariable y inside the integral in ( 17 ) is alwa ys negative. W e can obtain a version of Mill’s ratio inequality for negativ e reals b y using the symmetry prop erties ϕ ( x ) = ϕ ( − x ) , 1 − Φ( x ) = Φ( − x ) : − y ≤ ϕ ( y ) Φ( y ) ≤ − y − 1 y ∀ y < 0 . Com bining the left part of the ab o v e inequality with ( 17 ), we obtain Z t + ξ / 2 t − ξ / 2 y d y ≥ − Z t + ξ / 2 t − ξ / 2 ϕ ( y ) Φ( y ) d y = log(1 − ε ) . Using R t + ξ / 2 t − ξ / 2 y d y = 1 2 (( t + ξ / 2) 2 − ( t − ξ / 2) 2 ) = 2 tξ / 2 = − 2 | t | ξ / 2 and rearranging the ab o v e inequalit y , we finally obtain ξ ≤ − log(1 − ε ) | t | = log  1 + ε 1 − ε  | t | . 33 W e will use the contrapositive and shifted version of Lemma D.3 that is stated b elo w. This is con trap ositiv e b ecause it is sa ying that large difference in the mean of tw o Gaussians translates to large multiplicativ e gap of their cdfs, and it is shifted b ecause it includes an arbitrary shift µ in the means of b oth Gaussians. Corollary D.4. L et µ, ξ , t, ε b e r e als with, ξ > 0 , t + ξ / 2 < 0 and ε ∈ (0 , 1) . If P 1 = N ( µ, 1) and P 2 = N ( µ + ξ , 1) ar e two Gaussians with ξ > log  1+ ε 1 − ε  | t | , then their c dfs F 1 , F 2 satisfy F 2 ( t + µ + ξ / 2) < (1 − ε ) F 1 ( t + µ + ξ / 2) . Pr o of. First, with n 1 − ε samples one can learn an approximation b F to the cum ulative distribution function F of the corrupted distribution of the non-missing samples ( Fact B.3 ). That is, with probabilit y at least 1 − 2 e − 2 nη 2 w e hav e    b F ( x ) − F ( x )    ≤ η (18) for all x ∈ R . F or the remainder of the pro of fix t := log  1+ ε 1 − ε  δ . W e will use η := ε Φ( − t ) (where Φ denotes the cdf of N (0 , 1) ) and we will set n a sufficiently large m ultiple of η − 2 log (1 /τ ) so that the probabilit y of failure is at most τ . Let u := b F − 1 (Φ( − t )) , i.e, the p oin t for which it holds b F ( u ) = Φ( − t ) . Consider the Gaussian distribution N ( µ 0 , 1) where µ 0 := u + t . This is exactly the Gaussian whose cdf F 0 satisfies F 0 ( u ) = Φ( − t ) . The algorithm then is this: W e simply return b µ = µ 0 = u + t . Giv en Corollary D.4 it is easy to see why this has accuracy O ( δ ) : Let F ∗ b e the cdf of the ground truth Gaussian (inlier distribution). By the fact that b F ( u ) = Φ( − t ) , ( 18 ) and the definition of our con tamination mo del, w e hav e that (1 − O ( ε ))Φ( − t ) ≤ F ∗ ( u ) ≤ (1 + O ( ε ))Φ( − t ) . By Corollary D.4 , if e F is the cdf of a unit-v ariance Gaussian with mean larger than µ 0 + C log  1+ ε 1 − ε  u − µ 0 − δ / 2 , then it w ould hold e F ( t ) < (1 − C ε )Φ( − t ) . Thus, this Gaussian could not b e the ground truth one as its cdf ev aluated on p oin t t falls ouside of the interv al [(1 − O ( ε ))Φ( − t ) , ≤ (1 + O ( ε ))Φ( − t )] . Similarly we can rule out any Gaussian with mean smaller than µ 0 − C log  1+ ε 1 − ε  | u − µ 0 − δ / 2 | from b eing the ground truth. This means that the p oin t µ 0 is within 2 C log  1+ ε 1 − ε  | u − µ 0 − δ / 2 | = 2 C log  1+ ε 1 − ε  t + δ / 2 = O ( δ ) from the ground truth Gaussian’s mean, where in the last step w e used that t = log  1 + ε 1 − ε  /δ and δ 2 ≤ log (1 + ε 1 − ε ) . By adjusting the constants we can turn this O ( δ ) into just δ . W e now show the extension of the algorithm to m ultiple dimensions. Theorem D.5 (Multiv ariate estimator) . L et d ∈ Z + denote the dimension, and C b e a sufficiently lar ge absolute c onstant. L et ε ∈ (0 , 1) , b e a c ontamination p ar ameter and δ ∈ (0 , log 1 / 2  1 + ε 1 − ε  ) b e an ac cur acy p ar ameter. L et µ ∈ R d b e an (unknown) ve ctor. Ther e exists an algorithm that takes as input ε , δ , dr aws n = exp O  log(1+ ε 1 − ε ) δ  2 ! d +log(1 /τ ) ε 2 (1 − ε ) p oints fr om an ε -c orrupte d version of N ( µ, I d ) under the c ontamination mo del of Definition 1.1 , outputs a b µ such that ∥ b µ − µ ∥ 2 ≤ δ with pr ob ability at le ast 1 − τ . Mor e over, it runs in time 2 O ( d ) p oly( n, d ) . Pr o of of The or em D.5 . Denote by T = { x i } n i =1 , x i ∈ R d the p oints from the ε -corrupted version of N ( µ, I ) and denote by C the the cov er set of Corollary B.6 . The algorithm is the follo wing: First, using the algorithm from Theorem D.1 , calculate a m v for eac h v ∈ C suc h that | m v − v ⊤ µ | ≤ δ / 8 34 (see next paragraph for more details on this step). Then, output the solution of the follo wing linear program (note that the program alwa ys has a solution, as it is satisfied by b µ = µ ): Find b µ ∈ R d s.t. | v ⊤ b µ − m v | ≤ δ / 4 , ∀ v ∈ C . The claim is that this solution b µ is indeed close to the target µ , since ∥ µ − b µ ∥ 2 ≤ 2 max v ∈C | v ⊤ ( µ − b µ ) | (using Corollary B.6 ) ≤ 2 max v ∈C ( | v ⊤ µ − m v | + | m v − v ⊤ b µ | ) ≤ 2( ε/ 8 + ε/ 4) < ε . (19) W e now explain how to obtain the approximations m v with the guaran tee | m v − v ⊤ µ | ≤ δ / 8 . Fixing a direction v ∈ C , we note that v ⊤ x ∼ N ( v ⊤ µ, 1) th us { v ⊤ x i } m i =1 is a set of samples from an ε -corrupted v ersion of N ( v ⊤ µ, 1) . Th us, if we apply algorithm from Theorem D.1 with probabilit y of failure τ ′ = τ / |C | , the ev ent | m v − v ⊤ µ | ≤ δ / 8 will hold with probability at least 1 − τ / |C | . By union b ound, the probabilit y all the even ts for v ∈ C hold sim ultaneously is at least 1 − τ . The num b er of samples for this application of Theorem D.1 is 2 O (log (1+ ε/ (1 − ε ) /δ ) 2 log(1 /τ ′ ) ε 2 (1 − ε ) = 2 O (log (1+ ε/ (1 − ε ) /δ ) 2 log( |C | /τ ) ε 2 (1 − ε ) = 2 O (log (1+ ε/ (1 − ε ) /δ ) 2 d +log(1 /τ ) ε 2 (1 − ε ) . W e conclude with the runtime analysis. The run time to find the m v ’s is O ( |C | p oly ( nd )) = 2 O ( d ) p oly ( nd ) since for eac h fixed v ∈ C w e need p oly ( nd ) time to calculate the pro jection { x ⊤ i v } of our dataset onto v and p oly ( n ) time to run the one-dimensional estimator. The linear program can b e solved using the ellipsoid algorithm. Consider the separation oracle that exhaustiv ely chec ks all 2 O ( d ) constrain ts. W e need p oly ( d ) log ( R r ) calls to that separation oracle, where R, r are the radii of the b ounding spheres of the feasible region. First, R ≤ ε , b ecause we hav e already shown in ( 19 ) that the feasible set b elongs in a ball of radius ε around µ . Regarding the upp er b ound r , note that all b µ inside a ball of radius δ / 8 around µ are feasible since | v ⊤ b µ − m v | ≤ | v ⊤ b µ − v ⊤ µ | + | v ⊤ µ − m v | ≤ ∥ b µ − µ ∥ + δ / 8 ≤ ε/ 4 . This means that r = ε/ 4 . Hence the total runtime for solving the LP is 2 O ( d ) p oly( d ) or simply 2 O ( d ) . E Hardness in Other Restricted Mo dels of Computation W e give a brief summary of kno wn information-computation gaps for Non-Gaussian Comp onen t Analysis problem ( Problem 2.2 ) for different restricted mo dels of computation. Since we hav e already shown that our problem is an instance of NGCA, we obtain immediately corollaries for Lo w-Degree Polynomials ( Corollary E.3 ) and PTF s ( Corollary E.5 ). The result for SoS has a few mild but cumbersome to v erify conditions, so w e refer the readers directly to [ DKPP24 ] for the formal statemen ts. E.1 Hardness in the Low-Degree P olynomial Class of Algorithms W e start with the Low-Degree P olynomial (LDP) mo del, which we describ e in more detail. W e will consider tests that are thresholded p olynomials of lo w-degree, i.e., output H 1 if the v alue of the p olynomial exceeds a threshold and H 0 otherwise. W e need the follo wing notation and definitions. F or a distribution D o ver X , we use D ⊗ n to denote the joint distribution of n i.i.d. samples from 35 D . F or tw o functions f : X → R , g : X → R and a distribution D , we use ⟨ f , g ⟩ D to denote the inner pro duct E X ∼ D [ f ( X ) g ( X )] . W e use ∥ f ∥ D to denote p ⟨ f , f ⟩ D . W e say that a p olynomial f ( x 1 , . . . , x n ) : R n × d → R has sample-wise degree ( r , ℓ ) if each monomial uses at most ℓ differen t samples from x 1 , . . . , x n and uses degree at most r for each of them. Let C r,ℓ b e linear space of all p olynomials of sample-wise degree ( r , ℓ ) with resp ect to the inner pro duct defined ab o v e. F or a function f : R n × d → R , we use f ≤ r,ℓ to b e the orthogonal pro jection onto C r,ℓ with resp ect to the inner pro duct ⟨· , ·⟩ D ⊗ n 0 . Finally , for the null distribution D 0 and a distribution P , define the lik eliho o d ratio P ⊗ n ( x ) := P ⊗ n ( x ) /D ⊗ n 0 ( x ) . Definition E.1 ( n -sample τ -distinguisher) . F or the hyp othesis testing pr oblem b etwe en D 0 (nul l distribution) and D 1 (alternate distribution) over X , we say that a function p : X n → R is an n -sample τ -distinguisher if | E X ∼ D ⊗ n 0 [ p ( X )] − E X ∼ D ⊗ n 1 [ p ( X )] | ≥ τ q V ar X ∼ D ⊗ n 0 [ p ( X )] . W e c al l τ the adv antage of the p olynomial p . Note that if a function p has adv an tage τ , then the Chebyshev’s inequality implies that one can furnish a test p ′ : X n → { D 0 , D 1 } b y thresholding p suc h that the probability of error under the null distribution is at most O (1 /τ 2 ) . W e will think of the adv an tage τ as the pro xy for the inv erse of the probabilit y of error and we will sho w that the adv an tage of all p olynomials up to a certain degree is O (1) . It can b e shown that for hypothesis testing problems of the form of Problem 2.2 , the b est p ossible adv an tage among all p olynomials in C r,ℓ is captured by the low-degree lik eliho o d ratio (see, e.g., [ BBH + 21 , KWB22 ]):     E v ∼U ( S )   P ⊗ n A,v  ≤ r,ℓ  − 1     D ⊗ n 0 , where in our case D 0 = N (0 , I ) . It has b een known b y [ BBH + 21 ] that a low er b ound for the SQ dimension translates to an upp er b ound for the low-degree likelihoo d ratio. Giv en this, one can obtain the corollary regarding the hardness of NGCA: Theorem E.2 (Information-computation gap for NGCA in LDP) . L et c b e a sufficiently smal l p ositive c onstant and c onsider the hyp othesis testing pr oblem of Pr oblem 2.2 the distribution A matches the first t moments with N (0 , I ) . F or any d ∈ Z + with d = t Ω(1 /c ) , any n ≤ Ω( d ) ( t +1) / 10 /χ 2 ( A, N (0 , I )) and any even inte ger ℓ < d c , we have that     E v ∼U ( S )   P ⊗ n A,v  ≤∞ ,ℓ  − 1     D ⊗ n 0 ≤ 1 . The in terpretation of this result is that unless the num b er of samples used n is greater than Ω( d ) ( t +1) / 10 /χ 2 ( A, N (0 , I )) , any p olynomial of degree roughly up to d c fails to b e a go o d test (note that an y p olynomial of degree ℓ has sample-wise degree at most ( ℓ, ℓ ) ). W e now show the corollary for the robust mean estimation problem of this pap er. Recall that the h yp othesis testing problem of Theorem 1.3 includes as a sp ecial case the NGCA problem with A b eing the ε -corrupted version of N ( δ v , I ) where v is a unit vector and the corruption adversary from Definition 1.1 uses as f ( x ) the function from Lemma 3.3 . F or this distribution A w e ha ve that (i) it matc hes the first Ω( γ 2 / log γ ) moments with N (0 , 1) , where γ := 1 δ log (1 + ε/ 2 1 − ε/ 2 ) and (ii) χ 2 ( A, N (0 , 1) = O ( 1 1 − ε ) ; this part is not included in Lemma 3.3 b ecause it w as not needed for the SQ lo wer b ound, but it immediately follo ws b y using that f ( x ) / R f ( x )d x ≤ g ( x )+ p ( x ) 1 ( | x |≤ 1) 1 − ε ≤ ϕ ( x − δ )+ ε 1 − ε (in the pro ofs of Lemmata 3.2 and 3.3 it can b e seen that g ( x ) b ounded by a translated Gaussian and that the p olynomial p ( x ) has small absolute v alue in [ − 1 , 1] ). 36 Corollary E.3 (Hardness of mean estimation against Lo w-Degree Polynomials) . Consider the same hyp othesis testing pr oblem as in The or em 1.3 and let m b e define d as in The or em 1.3 . Ther e is a way for the adversary of Definition 1.1 to c orrupt the alternative hyp othesis P = N ( δ u, I ) into a distribution e P such that the fol lowing holds. L et c b e a sufficiently smal l p ositive c onstant. F or any d ∈ Z + with d = m Ω(1 /c ) , any n ≤ Ω( d ) ( m +1) / 10 (1 − ε ) and any even inte ger ℓ < d c , we have that     E v ∼U ( S )   e P ⊗ n  ≤∞ ,ℓ  − 1     D ⊗ n 0 ≤ 1 . This is interpreted as a tradeoff b et ween d Ω( m ) (1 − ε ) and super-p olynomial runtime. E.2 Hardness against PTF T ests In the LDP class of the previous section, the goo dness of a test is quan tified by the adv an tage, defined in Definition E.1 . Hardness of a problem in this class are shown by ruling out existence of p olynomials with small adv an tage. That definition is based on the idea that one can obtain a test by thresholding the p olynomial in the midp oin t of the exp ectations for the tw o distributions. Th us rulling out existence of p olynomials with small adv an tage rulls out the construction of such tests. How ev er it still lea ves op en the p ossibilit y of some other kind of thresholded p olynomial migh t still succeed. As such, a more natural class of tests is the one consisting of all p ossible thresholded p olynomials (i.e., with arbitrary thresholdings). The information-computation gap for NGCA in this class is the one b elo w. Theorem E.4 (Information-computation gap for NGCA against PTF T ests ([ DKLP25b ])) . Ther e exists a sufficiently lar ge absolute c onstant C ∗ such that the fol lowing holds. F or any c ∗ ∈ (0 , 1 / 4) , d, k , n, m ∈ Z + such that (i) m is even, (ii) max ( k , m ) < d c ∗ /C ∗ , and (iii) n < d (1 / 4 − c ∗ ) m , we have that if p : R n × d 7→ R is a de gr e e- k p olynomial, and A is a distribution on R that matches the first m moments with N (0 , 1) , then:      E v ∼U ( S ) x (1) ,...,x ( n ) ∼ P A,v h sgn( p ( x (1) , · · · , x ( n ) )) i − E x (1) ,...,x ( n ) ∼N ( 0 , I ) h sgn( p ( x (1) , · · · , x ( n ) )) i      ≤ 0 . 11 . (20) wher e P A,v denotes the hidden dir e ction distribution fr om Definition 2.1 , and sgn : R → { 0 , 1 } is the sign function with sgn( x ) = 1 if and only if x ≥ 0 . The corollary for robust mean estimation is stated b elow. Corollary E.5 (Hardness of mean estimation against PTF s) . Consider the same hyp othesis testing pr oblem as in The or em 1.3 and let m b e define d as in The or em 1.3 . Ther e is a way for the adversary of Definition 1.1 to c orrupt the alternative hyp othesis P = N ( δ u, I ) into a distribution e P such that the fol lowing holds. Ther e exists a sufficiently lar ge absolute c onstant C ∗ such that the fol lowing holds. F or any d, k , n, m ∈ Z + such that (i) m is even, (ii) max ( k , m ) < d c ∗ /C ∗ , and (iii) n < d Ω( m ) , we have that if p : R n × d 7→ R is a de gr e e- k p olynomial, then:      E v ∼U ( S ) x (1) ,...,x ( n ) ∼ P A,v h sgn( p ( x (1) , · · · , x ( n ) )) i − E x (1) ,...,x ( n ) ∼N ( 0 , I ) h sgn( p ( x (1) , · · · , x ( n ) )) i      ≤ 0 . 11 . (21) wher e P A,v denotes the hidden dir e ction distribution fr om Definition 2.1 , and sgn : R → { 0 , 1 } is the sign function with sgn( x ) = 1 if and only if x ≥ 0 . 37 F or an arbitrary p olynomial p , the run time for this computation is on the order of p oly (( nd ) k ) and thus, Theorem E.4 implies an inheren t trade-off b et w een the exp onen tial runtime ( nd ) d Ω(1) , and the sample complexity d Ω( m ) for the family of PTF tests. 38

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment