High-dimensional estimation with missing data: Statistical and computational limits

High-dimensional estimation with missing data: Statistical and computational limits Kabir Aladin V erc hand † , Ankit P ensia ◦ , Samin ul Haque ⋆ , and Rohith Kuditipudi ⋆ † Departmen t of Data Sciences and Op erations, Universit y of Southern California ◦ Departmen t of Statistics and Data Science, Carnegie Mellon Univ ersit y ⋆ Departmen t of Computer Science, Stanford Universit y Marc h 18, 2026 Abstract W e consider computationally-eﬃcien t estimation of population parameters when ob- serv ations are sub ject to missing data. In particular, we consider estimation under the realizable con tamination mo del of missing data in which an ϵ fraction of the observ ations are sub ject to an arbitrary (and unknown) missing not at random (MNAR) mechanism. When the true data is Gaussian, we pro vide evidence tow ards statistical-computational gaps in sev eral problems. F or mean estimation in ℓ 2 norm, w e sho w that in order to obtain error at most ρ , for any constant contamination ϵ ∈ (0 , 1), (roughly) n ≳ de 1 /ρ 2 samples are necessary and that there is a computationally-ineﬃcient algorithm which ac hieves this error. On the other hand, w e show that any computationally-eﬃcient method within cer- tain popular families of algorithms requires a m uch larger sample complexity of (roughly) n ≳ d 1 /ρ 2 and that there exists a polynomial time algorithm based on sum-of-squares whic h (nearly) ac hiev es this lo w er b ound. F or co v ariance estimation in relative op erator norm, w e show that a parallel developmen t holds. Finally , we turn to linear regression with missing observ ations and sho w that such a gap do es not persist. Indeed, in this setting w e show that minimizing a simple, strongly conv ex empirical risk nearly achiev es the information-theoretic low er b ound in p olynomial time. Con ten ts 1 In tro duction 4 1.1 Con tributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2 F urther related w ork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Bac kground 10 2.1 Sum-of-squares algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 SQ low er b ounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 Minimax low er b ounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3 Information-theoretic limits of mean and co v ariance estimation 14 3.1 Algorithms via v ariational principles . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2 Mean estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.3 Co v ariance estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Authors are listed in random order. 1 4 Computationally-eﬃcien t mean and co v ariance estimation 18 4.1 W arm-up: Eﬃcien t testing via p olynomials . . . . . . . . . . . . . . . . . . . . 18 4.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.3 Mean estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.4 Co v ariance estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 5 Nearly optimal, computationally-eﬃcien t linear regression 23 5.1 Mo del preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 5.2 W arm-up: The p opulation setting . . . . . . . . . . . . . . . . . . . . . . . . . 24 5.3 Main result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 6 Discussion 27 A Pro ofs of information-theoretic upp er b ounds for mean and cov ariance es- timation 33 A.1 Mean estimation: Pro of of Theorem 3.1 upp er bound . . . . . . . . . . . . . . . 33 A.2 Co v ariance estimation: Pro of of Theorem 3.2 upp er bound . . . . . . . . . . . . 34 A.3 Univ ariate upper bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 A.3.1 Univ ariate mean estimation: Proof of Corollary A.1 . . . . . . . . . . . 37 A.3.2 Univ ariate v ariance estimation: Pro of of Lemma A.2 . . . . . . . . . . . 42 B Proof of upp er bound for linear regression 44 B.1 Linear regression: Pro of of Theorem 5.1 upp er bound . . . . . . . . . . . . . . 44 B.1.1 Pro of of Lemma B.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 B.1.2 Pro of of Lemma B.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 B.2 Eﬃcien t algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 C Pro ofs of information-theoretic low er b ounds 52 C.1 Lo w er bound constructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 C.2 Information-theoretic low er b ounds . . . . . . . . . . . . . . . . . . . . . . . . . 56 C.2.1 Mean estimation: Pro of of Theorem 3.1 low er bound . . . . . . . . . . . 56 C.2.2 Co v ariance estimation: Pro of of Theorem 3.2 low er bound . . . . . . . . 60 C.2.3 Linear regression: Pro of of Theorem 5.1 low er bound . . . . . . . . . . . 62 D Pro ofs of sum-of-squares upp er b ounds 65 D.1 Mean estimation: Pro of of Theorem 4.5 . . . . . . . . . . . . . . . . . . . . . . 65 D.2 Co v ariance estimation: Pro of of Theorem 4.10 . . . . . . . . . . . . . . . . . . . 66 D.2.1 Pro of of Lemma 4.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 E Pro ofs of computational lo wer b ounds 69 E.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 E.2 Mean estimation: Pro of of Theorem 4.7 . . . . . . . . . . . . . . . . . . . . . . 72 E.2.1 Proof of Lemma E.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 E.2.2 Co v ariance estimation: Proof of Theorem 4.12 . . . . . . . . . . . . . . 73 F Extension to m ultiple missingness patterns for mean and cov ariance esti- mation 74 F.1 Lo w er bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 F.2 Upp er bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 2 F.2.1 Mean estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 F.2.2 Cov ariance estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 G Auxiliary lemmas 79 G.1 Reduction to q = 1: Pro of of Lemma 1.2 . . . . . . . . . . . . . . . . . . . . . . 79 G.2 Useful general lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 G.3 Univ ariate Gaussian concen tration and momen t properties . . . . . . . . . . . . 81 G.4 Multiv ariate Gaussian concen tration and momen t properties . . . . . . . . . . . 84 G.4.1 Proof of Lemma 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 G.5 Sum-of-squares facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 3 1 In tro duction Statistical pro cedures are typically deﬁned under the idealized assumption that each observ a- tion { X i } i ∈ [ n ] in a sample of size n is an indep endent outcome from a p opulation distribution P . In practice, it is often the case that observ ations depart in some wa y from this idealized scenario. One common departure is that of missing data, in which each observ ation may only b e partially rev ealed. Of key imp ortance when handling missing data is the nature of the mechanism b y which data is missing. T ypically , these “missingness mec hanisms” are classiﬁed in to one of three categories in increasing ﬂexibility: Missing completely at random (MCAR), Missing at random (MAR), and Missing not at random (MNAR). In brief, the MCAR assumption ensures that the missingness mechanism is indep endent of the underlying observ ations; the MAR assumption ensures that the missingness mec hanism dep ends only on the observed data 1 ; and the MNAR assumption places no restriction on the missingness mec hanism. F or a detailed exp osition of each of these three categories, w e refer the reader t o the b o ok [ LR14 ]. While the former t w o assumptions often enable identiﬁcation of population parameters, they are (i.) often to o strong to hold in practice and (ii.) imp ossible to test for [ GVR97 ]. On the other hand, the MNAR assumption is t ypically to o ﬂexible to enable iden tiﬁcation of the p opulation parameters [ Man03 ]. In particular, when data are sub ject to missing not at random mechanisms, the resulting observ ations may b e signiﬁcan tly biased. Motiv ated b y this discrepancy , [ MVBWS24 ] recently in tro duced the realizable contamination framew ork whic h considers Hub er-style con tamination tailored to the setting of missing data. In this pap er, w e adopt this model, whic h w e describe next. Throughout the main text we will consider a simpliﬁed al l-or-nothing setting in which observ ations ha v e either no missing data or are fully missing. Later, in App endix F , we show ho w our results naturally extend to the setting of multiple missingness patterns. ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ (a) All-or-nothing missingness. ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ (b) Multiple missingness patterns. Figure 1: T yp es of missing data patterns. Each row indicates a single sample, where the en tries in gra y indicate an observ ed v alue and the ⋆ entries indicate missingness. In order to simplify the results in the main text, we focus on the all -or-nothing patterns described in panel (a), deferring the extension to the more general setting to App endix F . T o concretely deﬁne the mo del, let P denote a probabilit y distribution on R d and supp ose 1 See [ SGJC13 ] for a precise deﬁnition. 4 that we are interested in estimating some quan tit y θ ( P ) from incomplete observ ations. F ollow- ing [ MVBWS24 ], we deﬁne the follo wing sets of distributions, sp ecialized to the all-or-nothing setting: MCAR ( P,q ) := n La w  X  ⋆ Ω  : X ∼ P and Ω ∈ { (0) d , (1) d } , Ω ⊥ ⊥ X, P (Ω = (1) d ) = q o (1a) MNAR P := n La w  X  ⋆ Ω  : X ∼ P and La w (Ω) ∈ P  { (0) d , (1) d }  o . (1b) W e p oint the reader to the notation section at the end of the in tro duction for a precise deﬁnition of the mo diﬁed Hadamard pro duct  ⋆ . W e emphasize that these distributions are on the extended space R d ∪ { ⋆ d } . In words, MCAR ( P,q ) describ es the missing completely at random set of distributions in which eac h observ ation is seen with probability q . On the other hand, MNAR P describ es the set of missing not at random distributions which can be obtained b y applying an y missingness mec hanism to the observ ations. Since consisten t estimation under the assumption-lean setting of MNAR P is in general imp ossible, we consider the r e alizable c ontamination mo del introduced b y [ MVBWS24 ]. In particular, given a contamination parameter ϵ ∈ [0 , 1], the realizable set of distributions is giv en b y R ( P , ϵ, q ) := (1 − ϵ ) MCAR ( P,q ) + ϵ MNAR P . (2) When q = 1, we omit it, i.e., R ( P , ϵ ) := R ( P , ϵ, 1). The following straigh tforw ard lemma (whic h is a corollary of [ MVBWS24 , Proposition 2]) characterizes the set of realizable distri- butions by upp er and low er bounds on likelihoo d ratios. Lemma 1.1. The distribution Q ∈ R ( P, ϵ, q ) if and only if, for al l z ∈ R d , q (1 − ϵ ) ≤ d Q d P ( z ) ≤ q (1 − ϵ ) + ϵ. W e emphasize that, restricted to R d , this lemma implies that Q is a sub-probability measure whose densit y is upp er and lo w er bounded b y that of P . F rom this relation, we can read oﬀ upp er and low er b ounds on the conditional distribution Q R of the observ ations, given that they w ere observed. In particular, note that q (1 − ϵ ) ≤ Q ( R d ) ≤ q (1 − ϵ ) + ϵ. This in turn implies that if Q ∈ R ( P , ϵ, q ), then 1 − ϵ q (1 − ϵ ) + ϵ ≤ d Q R d P ( z ) ≤ 1 + ϵ q (1 − ϵ ) for all z ∈ R d . (3) Most of our algorithms in the sequel will b e based on this characterization. W e will o ccasion- ally use the notation R R ( P , ϵ, q ) to denote the corresp onding set of conditional distributions (e.g. on R d ). W e note in passing that in this special case, the condition is similar to familiar sensitivit y conditions considered in the causal inference literature, e.g., [ Ros87 ; ZSB19 ] as w ell as models of sampling bias [ SL W22 ]. In particular, if we let Γ = 1 + ϵ q (1 − ϵ ) ≥ 1, this is equiv alen t to a biased sampling model in which al l of the observ ations are biased with lik eliho o d ratios b ounded abov e and below by 1 / Γ and Γ, resp ectively . Indeed, w e note that under an appropriate c hange of v ariables, all of our algorithms (and lo w er b ounds) apply in this setting. 5 W e will assume throughout the remainder of this text that q = 1. This is without loss of generalit y b y an appropriate rescaling of ϵ and n . In particular, if b oth q and ϵ are kno wn, one can remov e a 1 − q ′ -fraction of the completely missing observ ations and apply an algorithm designed for observ ations from the set R ( P, ϵ, 1). The follo wing lemma mak es this precise. Lemma 1.2. L et ϵ ∈ [0 , 1) and q ∈ [0 , 1] and using these deﬁne ϵ ′ = ϵ ϵ + q (1 − ϵ ) and q ′ = ϵ + q (1 − ϵ ) . Then, R ( P , ϵ, q ) = n q ′ · Q ′ + (1 − q ′ ) δ { ⋆ d } : Q ′ ∈ R ( P, ϵ ′ , 1) o . W e pro vide the pro of of Lemma 1.2 to Section G.1 . 1.1 Con tributions W e consider three particular estimation problems and ov erview our main results in eac h. Mean estimation. W e consider the family of base distributions P R d , where for each θ ∈ R d , P θ = N ( θ , σ 2 I d ) (with known σ > 0) for θ ∈ R d . W e sho w that in the realizable con tamination mo del ( 2 ), an y estimator b θ of θ from n observ ations whic h succeeds with probability at least 1 − δ m ust satisfy   b θ − θ   2 ≳ σ log( 1 1 − ϵ ) r log  1 + ϵ 2 1 − ϵ · n d +log( 1 δ )  , and w e provide an estimator whic h achiev es this error. See Theorem 3.1 for a precise state- men t. W e note that in terms of sample complexity , if δ, σ, q , ϵ are ﬁxed constants, then this implies that to reach error ∥ b θ − θ ∥ 2 ≤ ρ , we require (roughly) n ≳ de 1 /ρ 2 samples. T urning to computationally-eﬃcient rates, we show that, for ϵ a small enough constant, a p olynomial time algorithm based on the sum-of-squares metho d achiev es error ρ as long as (roughly) n ≳ d 1 /ρ 2 observ ations are a v ailable, where we hide multiplicativ e factors depending on ρ . See Theorem 4.5 for a precise statemen t. W e complement this, in Theorem 4.7 with a nearly-matc hing computational low er bound against statistical query (SQ) algorithms, low- degree p olynomial tests, sum-of-squares hierarch y , and p olynomial threshold functions. F or the remainder of this section, we fo cus on statistical query algorithms for simplicit y . T ogether, these results pro vide evidence to wards a large statistical–computational gap. W e summarize this state of aﬀairs in Figure 2 . Imp ossible SQ hard Easy de 1 /ρ 2 d 1 /ρ 2 n Figure 2: Sample complexit y phase diagram for mean estimation with ϵ a ﬁxed constant. In order to achiev e ℓ 2 norm error ρ , it is information-theoretically necessary and suﬃcien t to tak e n ≍ de 1 /ρ 2 man y samples. On the other hand, any statistical query algorithm must take (roughly) d 1 /ρ 2 man y samples and a p olynomial time algorithm (nearly) saturates this low er b ound. 6 Co v ariance estimation. W e sho w that a parallel set of claims holds in the cov ariance estimation setting. In particular, w e consider the family P cov , where for eac h Σ ∈ C d ++ , P Σ = N (0 , Σ), we show that an y estimator b Σ that succeeds with probabilit y at least 1 − δ m ust satisfy   Σ − 1 / 2 b ΣΣ − 1 / 2 − I d   op ≳ s d + log ( 1 δ ) n (1 − ϵ ) + log( 1 1 − ϵ ) log  1 + ϵn d +log( 1 δ )  , and that a computationally-ineﬃcient algorithm ac hiev es nearly the same rate. See Theo- rem 3.2 for an exact statement. On the other hand, we show that for any ﬁxed τ a small enough constan t, in order to achiev e error ρ in relativ e op erator norm, roughly n ≳ d 1 /ρ samples are suﬃcient for a p olynomial time sum-of-squares based estimator to succeed (The- orem 4.10 ) and that nearly the same sample complexity is required of an y statistical query algorithm (Theorem 4.12 ). Linear regression. In the setting of linear regression with missing observ ations in which the missingness may depend arbitrarily on b oth the resp onse Y as well as the co v ariate X , w e sho w that a large statistical–computational gap do es not p ersist. In particular, w e consider the class P LR ( σ 2 ), which consists of co v ariate resp onse pairs ( X , Y ) distributed as X ∼ N (0 , I d ) and Y | X ∼ N  X ⊤ θ , σ 2  . In this setting, we show that any estimator b θ of the co eﬃcien ts θ whic h succeeds with probability at least 1 − δ must incur error at least   b θ − θ   2 ≳ σ · log(1 / (1 − ϵ )) r log  1 + ϵ 2 1 − ϵ · n d +log( 1 δ )  , and that there exists a simple p olynomial time estimator based on minimizing a strongly con v ex empirical risk whic h ac hiev es nearly matc hing error   b θ − θ   2 ≲ σ · ϵ 1 − ϵ · log log  n (1 − ϵ ) d +log( 1 δ )  r log  1 + ϵ 2 1 − ϵ · n d +log( 1 δ )  . Observ e that for ϵ less than a suﬃciently small constant, the upp er b ounds and low er b ounds diﬀer only by a mild log log term. See Theorem 5.1 for details. 1.2 F urther related w ork The mo del and methods considered in this pap er lie at the in tersection of the study of missing data and robust estimation, and here we describ e the relation to the most related results and mo dels in eac h of these ﬁelds. Missing data. Missing data is a mature ﬁeld and w e refer the in terested reader to the b o ok [ LR14 ] for man y of the classical models and results. The realizable con tamination mo del whic h we study in this man uscript was recen tly introduced by [ MVBWS24 ], although we note that similar mo dels hav e b een prop osed in the literature in restricted settings by [ HM95 ; BK22 ]. As mentioned in the discussion following the sandwich relation in ( 3 ), in the all- or-nothing setting, there is a direct connection b et w een realizable con tamination and biased sampling [ SL W22 ; AL13 ] and sensitivity analysis in causal inference [ Ros87 ]. 7 An adv an tage of the realizable con tamination mo del is that it do es not preclude consis- ten t estimation ev en when a constant fraction of the observ ations are contaminated. W e note that considering constrained forms of MNAR which similarly allow identiﬁcation of the p opulation parameters hav e b een considered in the literature, we direct the interested reader to, e.g., [ MST22 ; MPT13 ; RRS98 ], and the references therein for a small subset of this lit- erature. W e emphasize that the realizable con tamination mo del encapsulates all of these mo deling choices when ϵ = 1 as it places no assumptions on the con tamination comp onent other than that it masks the original data. On the other hand, we note that stronger forms of con tamination (which can mask and alter the base distribution) hav e b een considered in the missing data literature. F or instance, [ HR21 ] consider an analogy of the strong contamination mo del [ DK23 ] and deriv e nearly optimal mean estimators in this setting. In the setting of linear regression, [ DDKLP25 ] consider adversarial contamination and pro vide optimal and computationally-eﬃcien t algorithms for estimating the regression co eﬃcien ts. W e emphasize that a key distinction b etw een these stronger forms of contamination and the realizable set- ting considered here is that our restricted mo del often enables identiﬁcation in settings where the strong contamination mo del do es not. Our focus is on computationally-eﬃcien t metho ds. This has been a fo cus of the litera- ture on truncated statistics ov er the last several years (as w e discuss in the next paragraph). W e remark here that even in the more b enign MCAR setting, missingness can induce com- putational diﬃculties. F or instance, [ L W12 ] consider sparse linear regression with MCAR co v ariates and provide a computationally-eﬃcient algorithm for estimation despite noncon- v exit y of a natural plug-in estimator. Mo ving b ey ond MCAR, there has b een a ﬂurry of work on the computational asp ects of estimation from truncated statistics which we detail next. T runcated statistics. Estimation from truncated samples is a classical problem of esti- mating a distribution, or its parameters, when given i.i.d. samples conditioned on the samples falling in an observ ation set S . Recent literature [ DGTZ18 ] re-visited this classical problem— under the assumption of a Gaussian base distribution—with an ey e tow ards computationally- eﬃcien t algorithms. These authors sho w that in many learning settings, when given oracle access to the truncation set S , computationally-eﬃcient estimation of p opulation parame- ters is p ossible; by contrast, the same authors sho w that without such oracle access, it is information-theoretically imp ossible to p erform estimation without further assumptions on the set S . Motiv ated b y this, subsequent w ork [ KTZ19 ] further explored the unknown trun- cation set S , under the additional assumption that S is known to hav e additional regularity suc h as b elonging to a class of b ounded VC dimension or Gaussian surface area. F or instance, these authors show that if the truncation set S b elongs to a collection C , then it is p ossible to estimate the underlying mean with a sample complexit y scaling as e O (V C( C ) /ρ + d 2 /ρ 2 ), al- b eit with an ineﬃcient algorithm. F ollowing this, [ DKPZ24 ] show that there exist truncation sets with small VC dimension and Gaussian surface area such that a sup erp olynomial sample complexit y is required for eﬃcient algorithms in the statistical query mo del. The setting with unknown truncation is more closely related to the all-or-nothing setting considered in our paper, where w e can understand the contamination set as the set of all p ossible truncated distributions, that is, without any structural assumptions on S . Giv en the similarit y b et w een the tw o settings, it is w orthwhile to op en a small paren thesis to discuss the diﬀerences, which are crucial and lead to diﬀerent types of algorithms. In the truncated statistics problem, the assumptions imp osed on S p ermit truncating the entiret y of the tails, so the tail b ehaviour within the mo del cannot b e used to estimate the distribution. By 8 con trast, in our MNAR setting, w e allo w the lik eliho o d ratio d Q d P to v ary arbitrarily within [1 − ϵ, 1], indep endently across p oints. When ϵ is close to 1, this renders the bulk of the distribution unhelpful for estimation, as Q can hide all the p oten tial v ariation of d P ′ d P ( z ) for P ′ in a neighborho o d of P . How ever, when P b elongs to a known class of distributions with rapidly decaying tails (e.g. Gaussian), a small p erturbation can lead to a large likelihoo d ratio in the tails. It is this feature of the tails that w e exploit in our problems of in terest. Th us, while estimation from truncated samples setting is ac hiev ed by fo cusing on the bulk of the distribution, estimation from MNAR samples is achiev ed by fo cusing on the tails of the distribution. Despite this con trast, the computationally-eﬃcient sample complexities of mean and cov ariance estimation are similar in the tw o settings. One ma jor diﬀerence is that in linear regression, we obtain a computationally-eﬃcien t algorithm that requires only linear in d many samples, while in the truncated sample setting, existing algorithms incur either d O (1 /ρ 2 ) samples [ LMZ24 ] or require truncation which can only dep end on the resp onse [ KMK C26 ]. Robust estimation. F or a thorough background on robust estimation, we refer the reader to the b o oks [ HR09 ] (for its historical treatment) and [ DK23 ] (for its algorithmic asp ects). The prototypical contamination mo del considered in robust statistics is the Hub er con- tamination mo del, where the statistician observ es Q = (1 − ϵ Huber ) P + ϵ Huber R , where R is an arbitrary distribution and ϵ Huber ∈ [0 , 1 / 2) is the con tamination rate. Even when P b elongs to a nice family of distributions suc h as Gaussians, the Hub er contamination mo del is strong enough to preclude consistent estimation of parameters suc h as mean, co v ariance, and linear regression. F urthermore, realizable contamination can b e seen as a sp ecial case of Hub er contamination: the ﬁrst inequality in ( 3 ) implies that the conditional distribution Q R of Q ∈ R ( P , ϵ, 1) is a v alid Hub er contamination as long as ϵ Huber ≥ ϵ . How ever, realizable con tamination con tains additional structural information (the second inequality in ( 3 )) that leads to consistency . Our work could also b e seen as part of a broader research agenda that s tudies practical restrictions of Hub er contamination mo del, which lead to improv ed statistical rates. W e discuss some related works b elow. In the con text of mean estimation, multiple recen t works ha v e studied the mean-shift con tamination mo del [ Li23 ; KG25 ; DIKP25 ; KKLZ26 ; DIKL26 ], whic h places a diﬀerent kind of constraint on the con tamination distribution. In the basic setting, one observ es samples from (1 − ϵ ) N ( µ, I d ) + ϵR ∗ N (0 , I d ), where R is an arbitrary distribution and ∗ denotes the con v olution op erator. These works hav e shown that consisten t mean estimation is p ossible in this mo del: for any constan t ϵ ∈ (0 , 1), the sample complexity to get error ρ is roughly d/ρ 2 + e Θ(1 /ρ 2 ) , and furthermore there is no substantial statistical–computational gap. While this rate is similar to that of realizable contamination for d = 1, the statistical sample complexity for realizable contamination is m uc h higher for d > 1, scaling as de Θ(1 /ρ 2 ) (and the computational sample complexity is signiﬁcantly higher, scaling as d Θ(1 /ρ 2 ) ). In the context of linear regression, the mean-shift contamination mo del has the following analog: X ∼ N (0 , I d ) , and y | X ∼ (1 − ϵ ) · N ( X ⊤ θ ⋆ , σ 2 ) + ϵ · R X ∗ D X ⊤ θ ⋆ , (4) where R X is a univ ariate distribution that may dep end on the co v ariate X . When the con- taminating distribution, R X , is oblivious of X , i.e., R X = R , then it is termed an oblivious con tamination mo del. F or the oblivious contamination mo del, m ultiple w orks hav e sho wn that 9 consisten t estimation is p ossible [ TJSO14 ; JTK14 ; BJK15 ; BJKK17 ; SBRJ19 ; dNS21 ], and there exist (at most quadratic) statistical–computational gaps [ DGKLP25 ]. When the distri- bution R X is allow ed to dep end on X , it is termed adaptive con tamination. The forthcoming w ork [ DGKPX26 ] studies this adaptive contamination mo del, b oth from computational and statistical p ersp ective. While ( 4 ) can b e seen as a more general con tamination model than realizable contamination with missing resp onses (see Eqs. ( 11 )), the statistical rates surpris- ingly coincide; this follow b y combining our lo w er b ounds (which builds on their work) and their upp er b ounds. By contrast, the computational rates diﬀer widely , p ointing to the com- putational beneﬁts of realizable contamination ov er adaptiv e con tamination. T o elab orate, our work gives a computationally-eﬃcient algorithm w ith nearly-matching rate for realizable con tamination, and [ DGKPX26 ] establishes a statistical query low er b ound for adaptive con- tamination of d 1 /ρ 2 for any constant ϵ . While their SQ low er bound of d 1 /ρ 2 for ( 4 ) lo oks similar to our SQ lo w er b ound for realizable mean estimation, it is unclear if there is a deep er connection in terms of a formal reduction. Notation W e let R , R + , R ++ denote the set of real n um b ers, non-negative real num b ers, and p ositive real num b ers, resp ectively . W e let N denote the set of natural num b ers. W e let S d − 1 = { u ∈ R d : ∥ u ∥ 2 = 1 } denote the unit sphere. W e let C d + = { X ∈ R d × d : X = X ⊤ and X ⪰ 0 } and C d ++ = { X ∈ R d × d : X = X ⊤ and X ≻ 0 } denote the cones of p ositive semideﬁnite and p ositiv e deﬁnite matrices, resp ectively . W e use 1 A and 1 { A } interc hangeably to denote the indicator function on the set A . W e let R ⋆ = R ∪ { ⋆ } denote an extended space where the v alue ⋆ denotes a missing entry , and let R k ⋆ = R ⋆ × . . . × R ⋆ | {z } k times . W e sa y that the supp ort of a binary vector ω ∈ { 0 , 1 } k , denoted b y supp( ω ), is the set of co ordinates on whic h ω takes the v alue 1. W e deﬁne the op eration  ⋆ : R k × { 0 , 1 } k → R k ⋆ , where the j th comp onent of x  ⋆ ω is deﬁned by ( x  ⋆ ω ) j : = ( x j if ω j = 1 ⋆ if ω j = 0 , for j ∈ [ k ]. W e use S ± T b et w een multisets S and T to, resp ectively , denote multiset union and diﬀerence. F or a multiset S , we use E x ∼ S [ f ( x )] to denote the exp ectation of f with resp ect to the empirical distribution of S . F or n ∈ N , we use the notation f ( n ) ≲ g ( n ) (or f ( n ) = O ( g ( n ))) to mean | f ( n ) | ≤ C | g ( n ) | for some universal, p ositiv e constant C . Analogously , w e use the notation f ( n ) ≳ g ( n ) (or f ( n ) = Ω( g ( n ))) to mean | f ( n ) | ≥ c | g ( n ) | for some universal, p ositiv e constant c . If f ( n ) ≲ g ( n ) and f ( n ) ≳ g ( n ), w e denote this relationship b y f ( n ) ≍ g ( n ) (or f ( n ) = Θ( g ( n ))). W e use e O , e Ω, and e Θ to hide terms that are p oly-logarithmic in the parameters. W e let N ( µ, Σ) denote the Gaussian distribution with mean µ and co v ariance Σ. W e let ϕ d ( · ; µ, Σ) to denote its density . W e omit the subscript from the ϕ d when d = 1. W e often omit the parameter Σ from ϕ d when it is the iden tity matrix and omit µ when it is the zero v ector, so that ϕ d ( · ) denotes the density of the standard d -dimensional Gaussian. W e let χ 2 d denote the chi-squared distribution with d degrees of freedom. 2 Bac kground In this section, we provide background on some of the to ols w e use throughout the pap er. 10 2.1 Sum-of-squares algorithms W e brieﬂy review sum-of-squares pro ofs and pseudo-exp ectations, referring to [ BS16 ; FKP19 ] for a more complete o v erview. Throughout, we will refer to a p olynomial as a sum of squares if it is equal to a sum of squared p olynomials. W e use R [ X ] to denote the set of p olynomials o v er the indeterminates/v ariable X = ( X 1 , . . . , X N ) and R [ X ] ≤ t to denote those p olynomials with degree at most t . F or p ∈ R [ X ], we use ∥ p ∥ 2 to denote the Euclidean norm of the asso ciated co eﬃcien t vector (in the monomial basis). Consider a system of p olynomial inequalities A = { q i ( X ) ≥ 0 } m i =1 o v er the N -dimensional indeterminates X = ( X 1 , . . . , X N ). Deﬁnition 2.1. (Sum-of-squares pro ofs) Let p, p ′ ∈ R [ X ]. The inequality p ( X ) ≥ p ′ ( X ) has a sum-of-squar es pr o of given A if there exist sum of squares p olynomials p S ∈ R [ X ] for S ⊂ [ m ] satisfying p ( X ) = p ′ ( X ) + X S ⊂ [ m ] p S ( X )Π i ∈ S q i ( X ) , and we say the pro of has degree t if each summand has degree at most t ; w e denote this b y A t X p ≥ p ′ (w e ma y omit the v ariable when it is clear from the context). Deﬁnition 2.2. (Pseudo-exp ectations) A degree- t pseudo-expectation e E : R [ X ] ≤ t → R is a linear map satisfying e E [1] = 1 and e E [ p ( X ) 2 ] ≥ 0 for an y p ∈ R [ X ] ≤ t/ 2 . Giv en a set of p olynomial inequalities A , we say a degree- t pseudo-exp ectation e E satisﬁes A if e E [ p ( X )] ≥ 0 for all p ∈ R [ X ] ≤ t suc h that p ( X ) = s ( X ) 2 Π i ∈ S q i ( X ) for some S ⊂ [ m ] and s ∈ R [ X ]. Under the same conditions, w e say e E approximately satisﬁes A if e E [ p ( X )] ≥ − 2 − N Ω( t ) ∥ s ∥ 2 Π i ∈ S ∥ q i ∥ 2 . Under mild conditions on the bit complexit y of the constrain t set and b oundedness of the feasible region, we can eﬃciently ﬁnd an approximately satisfying pseudo exp ectation. Moreo v er, this pseudo exp ectation will approximately satisfy all p olynomial inequalities that ha v e a sum of squares pro of (of suﬃciently small degree), again under mild conditions on the bit complexit y of the pro of. These conditions will hold for all the problems w e consider in this pap er with negligible appro ximation error, so for simplicit y w e take the follo wing assumptions as facts throughout. Assumption 2.3. W e can compute a degree- d pseudo-exp ectation satisfying A (or refute its satisﬁabilit y) in time ( m + N ) O ( d ) . Assumption 2.4. Let A t ′ p ≥ 0 for p ∈ R [ X ] and let e E b e a degree- t pseudo-exp ectation that satisﬁes A . Let h ∈ R [ X ] b e a sum of squares with deg( h ) + t ′ ≤ t . Then e E [ h · p ] ≥ 0. Finally , w e will use x ∈ (1 ± y ) z as shorthand to denote the in tersection of the constraints x ≤ (1 + y ) z and x ≥ (1 − y ) z . W e will sometimes c onsider sum-of-squares pro ofs in volving matrix-v alued v ariables. F or a matrix-v alued polynomial A , w e use the constraint A ⪯ B to denote A = B − C C T , where C is an implicit dummy matrix-v alued indeterminate (and lik ewise for A ⪰ B ). Quan tiﬁer elimination. Let Z = ( Z 1 , . . . , Z d ) b e indeterminates. Let A ′ b e a set of m ′ p olynomial inequalities o v er Z . W e will frequently w ork with a set of constraints A ov er indeterminates X that include the following constraint on the v ariable X : A ′ t Z p ( X , Z ) ≥ 0 , (5) 11 whic h ma y b e read as: there exists a sum-of-squares proof under the axioms A ′ in the in- determinates Z that p ( X , Z ) ≥ 0. F or example, w e will routinely use the constrain t of the follo wing form: { P i Z 2 i = 1 } t Z P n j =1 ⟨ Z, y − X ⟩ 2 t ≤ B . Even though ( 5 ) do es not seem to b e represen table as A = { q i ( X ) ≥ 0 } m i =0 for a small m , it turns out that it can b e compactly represen ted. In particular, if p is a p olynomial of degree at most t and Z has dimension d , then w e can enco de it as A o ver indeterminates X and X ′ = ( X ′ 1 , . . . , X ′ N ′ ), where | N ′ | ≲ | O ( N + d + m ′ ) | O ( t ) and |A| = ( O ( N ′ + m + d )) O ( t ) . W e refer the reader to [ KS17 ] for further details (see also [ DKKPP22 , App endix A.2]). 2.2 SQ lo wer b ounds In this section, we give a brief ov erview of the statistical query (SQ) framew ork [ Kea98 ; F GR VX17 ]. In the sequel, we shall sho w that our algorithms for mean and cov ariance esti- mation is qualitativ ely optimal in the SQ framework. While our fo cus is on SQ algorithms, our structural results also rule out p olynomial-time algorithms from other p opular families of algorithms: sum-of-squares hierarchies [ DKPP24 ], low-degree p olynomial tests [ BBHLS21 ], and p olynomial threshold functions [ DKLP25 ]. An SQ algorithm do es not tak e in samples from a distribution Q , instead, it interacts with the underlying distribution Q through an oracle of the following form: Deﬁnition 2.5 (ST A T Oracle) . Let P be a distribution on the domain X . A statistical query is a b ounded (measurable) function f : X → [ − 1 , 1]. F or a tolerance κ ∈ (0 , 1), the ST A T P,κ oracle resp onds to the query f with a v alue v = ST A T P,κ ( f ) such that | v − E X ∼ P [ f ( X )] | ≤ κ . W e call κ the tolerance of the SQ oracle. Man y p opular algorithms for statistical estimation use samples only to approximate E P [ f 1 ( X )] , . . . , E P [ f m ( X )] for a sequence of (adaptively chosen) functions f i ’s. An SQ algorithm could directly obtain these appro ximations using the ST A T P,κ oracle ab ov e, up to error κ . Ob- serv e that κ here corresp onds to the sampling error, and thus p oly(1 /κ ) is interpreted as the “eﬀectiv e sample complexit y” of the SQ algorithm. Our fo cus will b e on showing the hardness of testing problems; Standard arguments imply that the corresp onding estimation task is no easier (either statistically or computationally). W e will consider a testing problem of the following kind: Deﬁnition 2.6 (Generic T esting Problem) . Let P null b e a distribution and let A b e a set of distributions. Supp ose the te ster has access (either through i.i.d. samples ( X 1 , . . . , X n ) ∼ Q ⊗ n or an SQ oracle ST A T Q,κ ) to the distribution Q . Consider the following testing problem: H 0 : Q = P null vs. H 1 : Q ∈ A . W e say an SQ algorithm solves the testing problem ab ov e with q queries and tolerance κ if there exists an SQ algorithm that interacts with an ST A T Q,κ oracle by making at most q adaptiv e queries and outputs a decision b ϕ such that, with probability at least 2 / 3, b ϕ = 1 Q ∈A . An SQ hardness result is parameterized by tw o parameters, a query parameter q 0 and a tolerance parameter κ 0 , and is of the following ﬂa v or: any SQ algorithm that makes less than q 0 queries and successfully solves the inference task of in terest m ust use at least a single query with tolerance less than κ 0 . This is interpreted as an information-computation gap stating that every successful SQ algorithm m ust use either Ω( q 0 ) “runtime” or p oly (1 /κ 0 ) “samples”. W e refer the reader to [ DK23 , Chapter 8] for further details. 12 The inﬂuential work of [ DKS17 ] prov ed SQ hardness results for a generic hypothesis testing problem, kno wn as Non-Gaussian Comp onent Analysis (NGCA). NGCA asks us to distinguish whether Q = N (0 , I ) or whether Q has a non-Gaussian comp onent distribution A in a hidden direction, deﬁned b elo w. Deﬁnition 2.7 (High-Dimensional Hidden Direction Distribution) . Let A b e a univ ariate distribution on R ; w e use a ( x ) for the density of A at x . F or a unit vector v , w e de- note by P A,v the distribution with the densit y P A,v ( x ) := a ( v ⊤ x ) ϕ v ⊥ ( x ), where ϕ v ⊥ ( x ) := exp  −∥ x − ( v ⊤ x ) v ∥ 2 2 / 2  / (2 π ) ( d − 1) / 2 , i.e., the distribution that coincides with A on the di- rection v and is an indep endent standard Gaussian in e v ery orthogonal direction. Deﬁnition 2.8 (NGCA) . Let A b e a given univ ariate distribution. Consider the distribution testing problem (see Deﬁnition 2.6 ) given access to a distribution Q ov er R d : H 0 : Q = N (0 , I d ) vs. H 1 : Q ∈ { P A,v | v ∈ S d − 1 } . [ DKS17 ] established the following SQ lo wer bounds, showing that if A matc hes m mo- men ts with N (0 , 1), then an y SQ algorithm either needs at least exp onentially many queries (“run time”) or in v erse tolerance (“samples”) at least d Ω( m ) . Lemma 2.9 (SQ Low er Bounds for NGCA [ DKS17 ]) . Supp ose A matches m moments with N (0 , 1) and χ 2 ( A, N (0 , 1)) < ∞ . Then any SQ algorithm that solves the testing pr oblem in Deﬁnition 2.8 must use either • numb er of queries at le ast 2 Ω( d Ω(1) ) d O ( m ) , or • toler anc e less than κ ≲ √ χ 2 ( A, N (0 , 1)) d Ω( m ) . In p articular, if m ≳ 1 , d ≳ ( m log d ) Ω(1) , χ 2 ( A, N (0 , 1)) ≲ m , then any suc c essful SQ algo- rithm either uses 2 d Ω(1) many queries or toler anc e at most d − Ω( m ) . Computational low er b ounds for the NGCA problem under the same momen t-matching condition on A ha v e also b een established for sum-of-squares hierarchies [ DKPP24 ], lo w-degree p olynomial tests [ BBHLS21 ], and p olynomial threshold functions [ DKLP25 ]. Since our com- putational low er b ounds for mean and cov ariance estimation under realizable contamination are prov ed using the NGCA framework, we immediately obtain lo wer b ounds for these families of algorithms as well. 2.3 Minimax lo wer b ounds Consider a parameter space Θ and an observ ation space X . Let L : Θ × Θ → R + denote a metric on the parameter space. F or each θ ∈ Θ, let P θ ∈ P ( X ) denote a probabilit y distribution on the space of observ ations X . W e deﬁne the 1 − δ quantile as Quan tile  1 − δ, P θ , L ( b θ , θ )  := inf  r ≥ 0   P θ  L ( b θ , θ ) ≤ r  ≥ 1 − δ  . Our information-theoretic lo wer b ounds will b e stated in terms of the minimax (1 − δ ) quantile, deﬁned as M ( δ, P Θ , L ) := inf b θ ∈ b Θ sup θ ∈ Θ sup P θ ∈P θ Quan tile  1 − δ, P θ , L ( b θ , θ )  . (6) 13 W e will commonly consider the setting in which n samples are drawn i.i.d. from an underlying distribution and use the shorthand M n ( δ, P Θ , L ) accordingly . Our low er b ounds will use the follo wing v ariant of F ano’s inequalit y , the main statement and pro of of which can b e found in sev eral standard statistics textb o oks [ W ai19 ; Duc25 ; Tsy09 ]. Lemma 2.10 (F ano’s inequality) . Supp ose that H ∈ P ( X ) and θ 1 , . . . , θ M ⊆ Θ satisfy 1. (L oss sep ar ation) F or al l ℓ  = k ∈ [ M ] and al l θ ∈ Θ , max { L ( θ , θ k ) , L ( θ, θ ℓ ) } ≥ γ ; 2. (Bounde d KL) 1 M P ℓ ∈ [ M ] KL( P θ ℓ , H ) + log (2 − M − 1 ) ≤ 1 2 log( M ) . Then, for al l δ ≤ 1 / 2 , it holds that M ( δ, P Θ , L ) ≥ γ . 3 Information-theoretic limits of mean and co v ariance estima- tion 3.1 Algorithms via v ariational principles Our information-theoretically optimal algorithms use a familiar tec hnique from the robust statistics literature in whic h optimal univ ariate estimators are combined using a v ariational principle to obtain an optimal multiv ariate estimator. This idea dates back at least to the T ukey median and the notion of T ukey depth [ T uk75 ] and has b een deploy ed in problems b ey ond mean estimation [ Gao20 ] and using a v ariet y of v ariational principles, such as the P AC-Ba yes metho d [ MZ25 ]. W e next brieﬂy describ e the v ariational principle w e use for mean estimation and for cov ariance estimation in turn. Mean estimation. Let θ , θ ⋆ ∈ R d and note that the ℓ 2 norm error admits the v ariational c haracterization ∥ θ − θ ⋆ ∥ 2 = sup v ∈ S d − 1 ⟨ v , θ − θ ⋆ ⟩ . (7) Supp ose that for each direction v ∈ S d − 1 , we had an estimator b θ v of the mean ⟨ v , θ ⋆ ⟩ . In this case, we could deﬁne an estimator b θ to b e any elemen t b θ ∈ arg min θ ∈ R d sup v ∈ S d − 1   ⟨ v , θ ⟩ − b θ v   . W e then ha ve the oracle inequality sup v ∈ S d − 1   ⟨ v , b θ ⟩ − b θ v   ≤ sup v ∈ S d − 1   ⟨ v , θ ⋆ ⟩ − b θ v   , from which we combine the c haracterization ( 7 ) with the triangle inequalit y to obtain   b θ − θ ⋆   2 = sup v ∈ S d − 1  v , b θ − θ ⋆  ≤ sup v ∈ S d − 1    v , b θ  − b θ v   + sup v ∈ S d − 1   ⟨ v , θ ⋆ ⟩ − b θ v   ≤ 2 sup v ∈ S d − 1   ⟨ v , θ ⋆ ⟩ − b θ v   . It th us follows that if each of the univ ariate estimators b θ v are information-theoretically opti- mal, we may b e able to combine these estimators into a multiv ariate estimator with similar statistical guaran tees. In the setting of mean estimation, we note that the optimal univ ariate estimator is due to forthcoming work [ MVGS26 ]; for the reader’s conv enience, w e provide an alternativ e metho d whic h ac hieves the same rate. ♣ 14 Co v ariance estimation. F or simplicity , we consider op erator norm in this paragraph, al- though our main guaran tees follo w a sligh tly diﬀerent path as we are in terested in relative error. F ollowing similar steps, let Σ , Σ ⋆ ∈ C d ++ and note that the op erator norm admits the v ariational represen tation ∥ Σ − Σ ⋆ ∥ op = sup v ∈ S d − 1  v ,  Σ − Σ ⋆  v  . As b efore, supp ose we had an estimator b σ 2 v of the v ariance v ⊤ Σ ⋆ v in each direction v ∈ S d − 1 . W e then deﬁne an estimator b Σ as any element b Σ ∈ arg min Σ ∈C d ++ sup v ∈ S d − 1   ⟨ v , Σ v ⟩ − b σ 2 v   . Com bining the preceding tw o displays with the triangle inequality yields   b Σ − Σ ⋆   op = sup v ∈ S d − 1  v ,  b Σ − Σ ⋆  v  ≤ sup v ∈ S d − 1    v , b Σ v  − b σ 2 v   + sup v ∈ S d − 1    v , Σ ⋆ v  − b σ 2 v   ≤ 2 · sup v ∈ S d − 1    v , Σ ⋆ v  − b σ 2 v   . Hence, as in the mean estimation setting, this suggests that it suﬃces to ﬁnd optimal uni- v ariate v ariance estimators in each direction and that these can b e conv erted into cov ariance estimators. In the sequel, we show that a v arian t of the minimum Kolmogorov distance es- timator introduced by [ MVBWS24 ] yields a suitable univ ariate estimator in eac h direction. ♣ W e emphasize that as written, these estimators cannot b e computed in ﬁnite time. A straigh tforw ard solution is to approximate the unit sphere b y a suitable ϵ -net of the unit sphere and compute univ ariate estimators on each element of the ϵ -net. While this strat- egy yields information-theoretically optimal estimators, w e note that the size of these ϵ -nets scales exp onentially in the dimension and hence, in general, yields computationally-ineﬃcien t estimators. 3.2 Mean estimation W e ﬁrst consider information-theoretic rates for Gaussian mean estimation. W e take the parameter space and loss to b e P mean ( σ 2 , ϵ ) =  R ( N ( θ , σ 2 I d ) , ϵ ) | θ ∈ R d  and L ( θ , θ ⋆ ) = ∥ θ − θ ⋆ ∥ 2 . In the theorem b elow, we giv e tight minimax rates on this problem. W e provide the pro of of the upp er b ound in Section A.1 and of the low er b ound in Section C.2.1 . As mentioned in the previous subsection, the upp er b ound is a corollary of the optimal univ ariate estimator of [ MVGS26 ]; our main contribution is a matc hing low er b ound in the high-dimensional and high-conﬁdence setting. Theorem 3.1. L et σ 2 > 0 , ϵ < 1 , δ ≤ 0 . 5 , n ∈ N such that n ≳ d + log ( 1 δ ) 1 − ϵ . 15 The minimax 1 − δ quantile M n ( δ, P mean ( σ 2 , ϵ ) , L ) ( 6 ) satisﬁes M n ( δ, P mean ( σ 2 , ϵ ) , L ) ≍ σ log( 1 1 − ϵ ) r log  1 + ϵ 2 1 − ϵ · n d +log( 1 δ )  . A few remarks are in order. First, let us simplify the expression of the minimax rate in the regime of interest where δ is constan t and ϵ ≤ 1 − c for some small constant c . In this setting, the rate simpliﬁes to M n ( δ, P mean ( σ 2 , ϵ ) , L ) ≍ σ ϵ p log (1 + ϵ 2 n/d ) . Note that if ϵ ≲ p d/n , this recov ers the parametric rate of estimation σ p d/n . On the other hand, if the contamination fraction ϵ is at least a constant, the b ound further simpliﬁes to M n ( δ, P mean ( σ 2 , ϵ ) , L ) ≍ σ p log( n/d ) . In turn, this implies that in terms of sample complexity , in order to reach error σ ρ > 0, it is b oth necessary and suﬃcien t to tak e n ≍ de Θ(1 /ρ 2 ) man y samples. Next, let us compare this rate to similar results in the literature, starting with [ MVBWS24 , Theorem 8], which considers the same setting. In particular, re-arranging our sample size condition, we require that δ ≥ exp  − c ·  n (1 − ϵ ) − d  . By contrast, [ MVBWS24 ] imp ose the condition that δ ≥ exp n − c  { n (1 − ϵ ) } 31 / 36 log( n (1 − ϵ )) − d o , whic h additionally imp oses the sub-optimal sample complexit y requirement n ≳ d 36 / 31 / (1 − ϵ ). As we sho w, the correct sample complexity scales linearly in the dimension d . Let us next brieﬂy isolate the dep endence on the failure probability δ . In particular, taking d = 1 and σ, ϵ to b e constan ts, w e see that M n ( δ, P mean ( σ 2 , ϵ ) , L ) ≍ 1 p log( n ) − log log(1 /δ ) , so that the dep endence on the failure probabilit y δ is doubly logarithmic in the denominator. Finally , w e comment brieﬂy on the pro of technique. As previously mentioned, our main con tribution in Theorem 3.1 is the lo w er b ound and w e restrict our commen tary to this result. W e note that it is typical in robust estimation problems to obtain an error which decomp oses as the sum of tw o terms: One consisting of the “clean” term corresp onding to ϵ = 0 and one capturing the dep endence on ϵ . Typically , the clean term interacts with the dimension d , the sample size n , and the conﬁdence δ , whereas the second term is a function solely of the contamination fraction ϵ [ DKKLMS16 ]. F rom the p ersp ective of low er b ounds, these t w o terms can typically by captured by (i.) an application of F ano’s inequality or Assouad’s metho d for the “clean” term and (ii.) an application of Le Cam’s tw o p oint metho d for the con tamination term. By contrast, the realizable contamination mo del considered here has a con tamination term which dep ends nontrivially on the dimension d . In order to capture this dep endence, w e build on the proof tec hnique of the forthcoming w ork of [ DGKPX26 ] (who applied it in the context of linear regression): we construct a family of exp onentially many hard distributions (using the high-dimensional hidden direction distributions of Deﬁnition 2.7 ) and conclude via F ano’s inequality (Lemma 2.10 ). 16 3.3 Co v ariance estimation W e no w consider information-theoretic rates for Gaussian co v ariance estimation in relativ e op erator norm error. Spec iﬁcally , w e tak e the parameter space and loss to b e P cov ( ϵ ) =  R ( N (0 , Σ) , ϵ ) | Σ ∈ C d ++  and L (Σ , Σ ⋆ ) =   Σ − 1 2 ⋆ ΣΣ − 1 2 ⋆ − I d   op . In the theorem b elo w, we giv e upp er and lo w er b ounds on the minimax rates. W e provide the pro of of the upp er b ound in Section A.2 and the low er b ound in Section C.2.2 , which follow similar techniques as the pro of of Theorem 3.1 . Theorem 3.2. L et ϵ < 1 , δ ≤ 0 . 5 , and n ∈ N such that n ≳ d (1 + log ( 1 1 − ϵ )) + log ( 1 δ ) 1 − ϵ . The minimax 1 − δ quantile ( 6 ) M n  δ, P cov ( ϵ ) , L  satisﬁes s d + log ( 1 δ ) n (1 − ϵ ) + log( 1 1 − ϵ ) log  1 + ϵn d +log( 1 δ )  ≲ M n  δ, P cov ( ϵ ) , L  ≲ log( 1 1 − ϵ ) log  1 + r ϵ 2 1 − ϵ · n d (1+log( 1 1 − ϵ ))+log( 1 δ )  . W e analyze the upp er and lo w er b ounds of Theorem 3.2 in a few diﬀerent regimes. In order to ease the notation, we let α = r d +log( 1 δ ) n (1 − ϵ ) and β = log ( 1 1 − ϵ ). In the low-to-constan t con tamination regime of ϵ ≤ 1 2 , we hav e β ≍ ϵ and so the b ounds of Theorem 3.2 simplify to α + ϵ log (1 + ϵ/α 2 ) ≲ M n  δ, P cov ( ϵ ) , L  ≲ ϵ log(1 + ϵ/α ) . When ϵ ≲ α , b oth the upp er and low er b ounds yield M n  δ, P cov ( ϵ ) , L  ≍ α , which w e recognize as the parametric rate. When ϵ ≳ α 1 − c for some small constant c (which con- tains the regime when ϵ is at least a small constant), b oth b ounds agree and yield the rate M n  δ, P cov ( ϵ ) , L  ≍ ϵ log(1 /α ) . In the in termediate range b et w een these tw o settings, where ϵ = α 1 − γ for γ = o (1), our b ounds read as ϵ log(1 /α ) ≲ M n  δ, P cov ( ϵ ) , L  ≲ ϵ γ log (1 /α ) , represen ting a gap on the order of 1 γ . F or example, when ϵ = α log (1 /α ) and δ ≳ e − d , we observ e a gap in our b ounds on the order of log(1 /α ) log log(1 /α ) ≍ log( n/d ) log log( n/d ) . No w we consider the large contamination regime ϵ ≥ 1 2 . The upp er and low er b ounds of Theorem 3.2 simplify to α + β β + log (1 /α ) ≲ M n  δ, P cov ( ϵ ) , L  ≲ β β − log (1 + β ) + log(1 /α ) . Because β is b ounded aw ay from zero in this regime, the upp er and low er b ounds simplify to M n  δ, P cov ( ϵ ) , L  ≍ β β +log (1 /α ) . Finally , note that when the mec hanism is nearly missing completely at random setting (i.e. when ϵ ≲ α ), the rates of estimation for mean estimation in ℓ 2 norm and cov ariance 17 estimation in op erator norm coincide. On the other hand, when ϵ > α , the rate of estimation for cov ariance estimation is quadratically faster. In particular, if we ﬁx the parameters ϵ and δ as constants, we see that M n  δ, P cov ( ϵ ) , L  ≍ 1 log( n/d ) , whereas M n ( δ, P mean  1 , ϵ ) , L  ≍ 1 p log( n/d ) . In terms of sample complexit y , this implies that in order to ac hiev e co v ariance norm error Θ( ρ ), w e require n ≳ de 1 /ρ man y samples as opp osed to de 1 /ρ 2 samples for mean estimation. 4 Computationally-eﬃcien t mean and cov ariance estimation In the previous section, we established (nearly) tigh t statistical rates for mean and cov ariance estimation, but the algorithms w e emplo y ed required exponential time computationally . In this section, we study p olynomial time algorithms and rates of estimation for polynomial time algorithms. In Section 4.1 , we preview the k ey idea underlying our approach in the simpler setting of hypothesis testing. Then, in Section 4.3 we provide an eﬃcien t algorithm for mean estimation and demonstrate that its sample complexity is optimal among v arious well-studied algorithmic classes. Finally , in Section 4.4 , we provide a parallel story in cov ariance estimation. 4.1 W arm-up: Eﬃcien t testing via p olynomials In order to build our in tuition, let us b e gin with the univ ariate testing problem. In particular, w e will try to test H 0 : Q ∈ R  N (0 , 1) , ϵ  vs. H 1 : Q ∈ R  N ( ρ, 1) , ϵ  . for ρ ∈ R . Note that the hardest distributions to test put the same amoun t of mass on { ⋆ } , so it suﬃces to test b etw een the conditional distributions Q R := La w ( X | X ∈ R ). The minimum distance estimators studied in Section 3 use ﬁne-grained information from the distribution functions under eac h hypothesis. Here, we note that, at the p opulation level, thresholding a simple p olynomial yields a consistent test. W e will use the notation τ := ϵ 1 − ϵ , throughout the deriv ation. Indeed, from the discussion following Lemma 1.1 , w e see that 1 1 + τ = 1 − ϵ ≤ d Q R d P ( z ) ≤ 1 + ϵ 1 − ϵ = 1 + τ for all z ∈ R . (8) A t the p opulation level, consider the family of testing functions, indexed by k ∈ N , ψ ( k ) : R → R deﬁned as ψ ( k ) ( X ) = X 2 k . It is a straightforw ard consequence of the sandwich relation ( 8 ) that  1 1 + τ  · E P  ψ ( k ) ( X )  ≤ E Q R  ψ ( k ) ( X )  ≤ (1 + τ ) · E P  ψ ( k ) ( X )  . Consequen tly , under H 0 , we see that E Q R  ψ ( k ) ( X )  ≤ (1 + τ ) E X ∼ N (0 , 1)  ψ ( k ) ( X )  = (1 + τ ) E  G 2 k  = (1 + τ )(2 k − 1)!! , 18 where we hav e used the notation G ∼ N (0 , 1). On the other hand, under H 1 , E Q R  ψ ( k ) ( X )  ≥  1 1 + τ  E X ∼ N ( γ , 1)  ψ ( k ) ( X )  =  1 1 + τ  E  ( G + ρ ) 2 k  =  1 1 + τ  2 k X ℓ =0  2 k ℓ  ρ ℓ E  G 2 k − ℓ  =  1 1 + τ  k X ℓ =0  2 k 2 ℓ  ρ 2 ℓ E  G 2 k − 2 ℓ  ≥  1 1 + τ  · n E [ G 2 k ] +  2 k 2  ρ 2 E  G 2 k − 2  o =  1 1 + τ  ·  E  G 2 k  + k ρ 2 (2 k − 1)!!  , where in the ﬁnal displa y w e used the fact that E [ G 2 k − 2 ] = (2 k − 3)!!. Re-arranging, w e deduce that any separation ρ ≥ s { (1 + τ ) 2 − 1 } · E [ G 2 k ] k (2 k − 1)!! = r τ 2 + 2 τ k , suﬃces to test betw een H 0 and H 1 . In particular, the larger w e tak e k (the degree of the p olynomial), the b etter our test p erforms. Our main goal in this section is to develop computationally-eﬃcient algorithms in higher dimension and tow ards this goal, w e can use a similar technique as in Section 3.1 to obtain a natural analog in higher dimension. In particular, we consider the testing function ψ ( k ) d := sup v ∈ S d − 1 n E Q R  ⟨ v , X ⟩ 2 k ] o , and its empirical counterpart ψ ( k ) d,n ( Z 1 , . . . , Z n ) := sup v ∈ S d − 1 n 1 n n X i =1 ⟨ v , Z i ⟩ 2 k o . W e recognize the ab ov e quantit y as the injective tensor norm of the tensor 1 n P n i =1 Z ⊗ 2 k i . The k ey insight which we build on is that while the computation of the p olynomial optimization problem ψ ( k ) d,n is b eliev ed to b e in tractable in the w orst case [ BBHKSZ12 ], under certain a v erage-case assumptions on the data, a go o d (additive) appro ximation of it can b e computed in p olynomial time using the sum-of-squares technique [ KSS18 ; HL18 ]. Indeed, our algorithms for mean and cov ariance estimation to follow build directly on this in tuition. Before describing our computationally-eﬃcien t algorithms, we b egin with some prelimi- naries. 4.2 Preliminaries Equipp ed with the intuition from the previous section, we turn now to our main algorithmic dev elopmen t. W e fo cus in this section on the all-or-nothing setting. The mean and co v ariance estimators we prop ose in this section will therefore discard all missing data and only consider the remaining observ ations. W e will mainly consider the following data generation pro cess (Deﬁnition 4.1 ), whic h is strong enough to simulate n i.i.d. samples for any Q ∈ R ( P , ϵ ). 19 Deﬁnition 4.1. (Data generating pro cess) Let n ∈ N , ϵ ∈ [0 , 1]. Let P ∈ P ( R d ). W e generate n samples in R d ∪ { ⋆ d } as follows: 1. Let T MCAR ⊆ R d b e a m ultiset of m i.i.d. data from P for m ∼ Bin ( n, 1 − ϵ ) (the MCAR observ ations). 2. Let T MCAR ,⋆ ⊆ R d b e a multiset of n − m i.i.d. data from P and T MNAR ⊂ T MCAR ,⋆ b e an arbitrary subset (the non-missing MNAR observ ations). 3. Let T := T MCAR + T MNAR ⊆ R d . The estimator observes T and n − | T | copies of ⋆ d . W e will take P = N ( µ, Σ) for some µ ∈ R d and Σ ≻ 0. In particular, for the mean estimation setting we will assume Σ = I d and µ is unknown, while for the cov ariance estimation setting we will assume Σ is unknown and µ = 0. Deﬁnition 4.2 (Go o d even t) . Consider the sets T MCAR and T MCAR ,⋆ from Deﬁnition 4.1 . Let E b e the even t that: 1. n m ≤ 1+ ϵ 1 − ϵ ; 2. F or all ℓ ∈ [2 k ] and S ∈ { T MCAR , T MCAR + T MCAR ,⋆ } :   E x ∼ S  Σ − 1 / 2 ( x − µ )  ⊗ ℓ  − E z ∼ N (0 ,I d )  z ⊗ ℓ    ∞ ≤ ϵ (4 d ) k 2 k . W e will pro ve our mean and co v ariance estimators are accurate assuming E holds. The follo wing lemma th us b ounds the sample complexit y of b oth estimators. Lemma 4.3 (Sample complexit y of go o d even t E ) . Consider Deﬁnitions 4.1 and 4.2 and supp ose that P = N ( µ, Σ) . Ther e exists a tuple of universal, p ositive c onstants c 1 , c 2 such that n ≥ ( c 1 k d log(1 /δ )) c 2 k (1 − ϵ ) ϵ 2 , implies the event E o c curs with pr ob ability at le ast 1 − 3 δ . W e defer the pro of of this lemma to Section G.4.1 . Equipp ed with these preliminaries, we turn next to studying eﬃcient algorithms and computational lo wer b ounds for mean estima- tion. 4.3 Mean estimation W e ﬁrst describ e our sum-of-squares based estimator and provide an upp er b ound on its sample complexity in Theorem 4.5 . W e defer the pro of of this theorem to Section D.1 . Deﬁnition 4.4 (SoS program for mean estimation) . Consider the following 2 constrain t on the d -dimensional (indeterminate) v ariable θ . n ∥ v ∥ 2 2 = 1 o 4 k v E x ∼ T ⟨ v , x − θ ⟩ 2 k ≤  (1 + ϵ ) 2 1 − ϵ  E G ∼ N (0 , 1) [ G 2 k ] . (9) 2 Refer to preliminaries for notation. 20 Algorithm 1 Mean Estimator 1: function Mean-estima tor ( T , k , ϵ ) 2: Find a pseudo-exp ectation e E of degree 4 k which satisﬁes the system of Deﬁnition 4.4 . If no such e E exists, return the all-zeros vector. 3: return ˆ θ := e E [ θ ]. 4: end function Theorem 4.5 (SoS algorithm for mean estimation) . L et P = N ( θ ⋆ , I d ) and gener ate T ⊂ R d by Deﬁnition 4.1 . L et k ∈ N satisfy k ≤ √ d and let b θ b e the output of Algorithm 1 . Supp ose n satisﬁes the c onditions of L emma 4.3 ; i.e., n ≳  (Θ( kd log(1 /δ ))) Θ( k ) (1 − ϵ ) ϵ 2  . Then, with pr ob ability at le ast 1 − 3 δ , the fol lowing two c onditions b ot h hold: 1. (Satisﬁability) The c onstr aint ( 9 ) is satisﬁable. 2. (A c cur acy) L et γ =  (1+ ϵ ) 2 − (1 − ϵ ) 3 (1 − ϵ ) 2  . Any de gr e e- 4 k pseudo exp e ctation e E satisfying this c onstr aint also satisﬁes   b θ − θ ⋆   2 ≤ 8  r γ k + γ k  . Let us compare this result with the information-theoretic results established in Theo- rem 3.1 . First, the algorithm ab ov e runs in time p oly( n k , d k , 1 /ϵ ) as opp osed to the algorithm in Theorem 3.1 , which needs p oly ( n, e d ) time. T o facilitate the comparison on sample complex- it y , consider the case when ϵ ∈ (0 , 1) is a constant. Then γ = O (1) and Theorem 4.5 implies that to obtain error ρ ∈ (0 , 1), one can take k ≍ 1 /ρ 2 and the resulting sample complexit y w ould b e d O (1 /ρ 2 ) · f ( ρ ) for some function f . Con trast this with the information-theoretic rate (Theorem 3.1 ), where one needs only d · e f ( ρ ) samples for some function e f . Giv en this gap, it is natural to wonder whether there are computationally-eﬃcient algorithms which improv e up on the sample complexity required b y Algorithm 1 . The next result shows that such an impro v emen t is imp ossible for an y computationally-eﬃcien t statistical query algorithm. Deﬁnition 4.6 (T esting Problem for Mean Estimation) . Let P b e the underlying (unknown) distribution equal to N ( µ, I d ) for µ ∈ R d . Let the con tamination rate b e ϵ and the signal strength b e ρ . Let P ′ b e a contaminated conditional distribution P ′ ∈ R R ( P , ϵ ). Supp ose the tester has access (either through i.i.d. samples ( X 1 , . . . , X n ) ∼ P ′ ⊗ n or an SQ oracle ST A T P ′ ( κ ) with tolerance κ ) to the distribution P ′ . The testing problem is the following: H 0 : µ = 0 vs. H 1 : ∥ µ ∥ 2 ≥ ρ. Theorem 4.7 (SQ low er b ound for mean estimation) . Ther e exists a lar ge c onstant C > 0 and smal l c onstants c, c ′ > 0 such that the fol lowing holds. Consider Deﬁnition 4.6 with (i) ρ ≤ ϵ/C and (ii) d ≥ (( ϵ/ρ ) log(2 d )) C . Supp ose that ϵ ≤ c and 1 − q ≤ c . Then any SQ algorithm that solves Deﬁnition 4.6 with fewer than 2 d c ′ many queries must use toler anc e κ ≤ d − m for m ≥ c ′ ϵ 2 /ρ 2 log( ϵ 2 /ρ 2 ) . The pro of of this theorem uses the NGCA framework (Deﬁnition 2.8 ) and is deferred to Section E . T aken together, Theorem 4.5 and Theorem 4.7 , essen tially settle the computational sample complexit y for constant ϵ . F or v anishing ϵ → 0, there is a gap: the low er b ound on the 21 eﬀectiv e sample complexity scales as d ϵ 2 /ρ 2 , whereas the upp er b ound requires roughly d ϵ/ρ 2 samples. In the next section, w e turn our atten tion to dev eloping a parallel story for cov ariance estimation. 4.4 Co v ariance estimation Our sum-of-squares based cov ariance estimator solves for a pseudo exp ectation ov er v ariables M , B ∈ R d × d satisfying the following constraints. Deﬁnition 4.8 (SoS program for cov ariance estimation) . Let α > 0. Let A cov ariance to b e the union of the following sets of constraints 3 o v er M , B ∈ R d × d : 1. Let A moments b e the set nn ∥ v ∥ 2 2 = 1 o 8 k v E x ∼ T ⟨ v , M ( E x ∼ T [ xx ⊤ ] − 1 2 ) x ⟩ 2 ℓ ∈ (1 ± 10 ϵ ) E G ∼ N (0 , 1) [ G 2 ℓ ] o ℓ ∈ [ k ] . (10) 2. Let A deviation := { M ⪰ 0 } ∪ { B = M − I d } ∪ { αI ⪰ B ⪰ − αI d } ∪ { B ⊤ B ⪯ α 2 I d } . Analogous to the mean estimation setting, it is relativ ely straigh tforw ard to show that an y pseudo exp ectation e E that satisﬁes the c onstrain t ( 10 ) m ust also satisfy the relation e E  ∥ Σ 1 / 2 M v ∥ k 2  ≈ 1, from whic h we may conclude that ∥ e E [ M Σ M ] − I ∥ op is small. How- ev er, this in and of itself do es not guarantee that ∥ e E [ M ] − Σ ∥ op is small (i.e., e E [ M ] may not necessarily b e a go o d estimator). Thus, w e take the additional step of solving for an explicit matrix A that satisﬁes ∥ e E [ M AM ] − I ∥ op . This step motiv ates the additional constrain ts in A deviation : if e E satisﬁes all of these constrain ts, then we can guarantee A is a go o d estimator (mo dulo the whitening step implicit in constrain t ( 10 )). W e mak e this key fact precise in Lemma 4.9 , whose pro of we pro vide in Section D.2.1 . W e then use it to pro v e an upp er b ound on the sample complexity of our estimator in Theorem 4.10 . Algorithm 2 Co v ariance Estimator 1: function Cov ariance-estima tor ( T , k , ϵ, α ) 2: Find a pseudo-exp ectation e E of degree 8 k which satisﬁes the system of Deﬁnition 4.8 . If no such e E exists, return E x ∼ T [ xx ⊤ ]. 3: Let A ∗ b e a matrix that minimizes min A : A ⪰ 0 ∥ e E [ M AM ] − I d ∥ op . 4: return b Σ = ( E x ∼ T [ xx ⊤ ] 1 / 2 ) A ∗ ( E x ∼ T [ xx ⊤ ] 1 / 2 ). 5: end function Lemma 4.9. L et e E b e a de gr e e- 4 pseudo exp e ctation satisfying A deviation ( α ) , and for β > 0 let H ∈ R d × d b e an arbitr ary symmetric matrix satisfying ∥ e E [ M H M ] ∥ op ≤ β . Then if 1 − 2 α − α 2 > 0 we have ∥ H ∥ op ≤ β 1 − 2 α − α 2 . 3 Refer to preliminaries for notation. 22 W e describ e our estimator and upp er b ound its sample complexity in Theorem 4.10 . Theorem 4.10. Fix ϵ < 1 / 100 . L et P = N (0 , Σ) for Σ ≻ 0 . Gener ate T ⊂ R d by Deﬁni- tion 4.1 . L et k = 2 p for some p ∈ N and let α = 10 ϵ (Deﬁnition 4.8 ). Supp ose n satisﬁes the c onditions of L emma 4.3 , i.e., n ≳ (Θ( kd log(1 /δ ))) Θ( k ) ϵ 2 . Then with pr ob ability at le ast 1 − 3 δ , the fol lowing c onditions b oth hold: 1. (Satisﬁability) The c onstr aints A cov ariance ar e satisﬁable. 2. (A c cur acy) L et b Σ b e the output of Algorithm 2 . Then   Σ − 1 / 2 b ΣΣ − 1 / 2 − I d   op ≲ ϵ k . Let us compare this result with the information-theoretic rate established in Theorem 3.2 . First, the range of permissible ϵ in Theorem 4.10 is restricted to b e b ounded aw ay from 1; b y contrast, Theorem 3.2 demonstrates that Algorithm 1 works for all ϵ < 1. Next, let us discuss the sample complexity to obtain error ρ when ϵ is a constant. In Theorem 4.10 , w e can set k ≍ 1 /ρ , and the sample complexity of the algorithm ab ov e is d Θ(1 /ρ ) · f ( ρ ), as opp osed to the information-theoretic sample complexit y of d · e f ( ρ ) in Theorem 3.2 ; here f and e f are dimension-indep enden t functions. W e next demonstrate such a gap is inherent for all computationally-eﬃcient SQ algorithms. Deﬁnition 4.11 (T esting Problem for Cov ariance Estimation) . Let P be the underlying (unkno wn) distribution equal to N (0 , I d + w w ⊤ ) for an unknown v ector w ∈ R d . Let the con tamination rate b e ϵ ∈ (0 , 1 / 2) and the signal strength b e ρ ∈ (0 , 1 / 2). Let P ′ b e a con taminated conditional distribution P ′ ∈ R R ( P , ϵ ). Supp ose the tester has access (either through i.i.d. samples ( X 1 , . . . , X n ) ∼ P ⊗ n or an SQ oracle ST A T P ′ ) to the distribution P ′ . The testing problem is the following: H 0 : ∥ w w ⊤ ∥ op = 0 vs. H 1 : ∥ w w ⊤ ∥ op ≥ ρ. The following theorem provides an SQ low er b ound for the ab ov e testing problem, again relying on the NGCA framework (Deﬁnition 2.8 ). Theorem 4.12 (Computational Lo w er Bound for Cov ariance Estimation) . Ther e exist a lar ge c onstant C > 0 and a smal l c onstant c > 0 such that the fol lowing holds. Consider Deﬁnition 4.11 with (i) ϵ/ρ ≥ C and (ii) d ≥ (( ϵ/ρ ) log(2 d )) C . Then any SQ algorithm that solves Deﬁnition 4.11 with fewer than 2 d c many queries must use toler anc e κ ≤ d − m for m ≥ c ϵ/ρ log( ϵ/ρ ) . It is worth while to contrast this guaran tee with that of Theore ms 4.5 and 4.7 : Ev en for v anishing ϵ → 0, w e see that b oth the upp er b ound in Theorem 4.10 and the SQ low er b ound in Theorem 4.12 scale roughly as d Θ( ϵ/ρ ) . 5 Nearly optimal, computationally-eﬃcien t linear regression In contrast with Gaussian mean and cov ariance estimation, we sho w that in Gaussian linear regression with missing observ ations, there is (nearly) no statistical-computational gap in man y parameter regimes of interest. Before presen ting the eﬃcient estimator in Section 5.3 , w e ﬁrst sp ecify the mo del in Section 5.1 and then provide a glimpse at the technique in Section 5.2 . 23 5.1 Mo del preliminaries In this section, w e consider linear regression in the setting with isotropic Gaussian cov ariates and Gaussian noise. That is, for θ ⋆ ∈ R d , we take the base distribution P θ ⋆ to b e the law of ( X , Y ) generated as X ∼ N (0 , I d ) , and Y | X ∼ N  X ⊤ θ ⋆ , σ 2  . (11a) W e tak e the parameter space and loss to b e P LR ( σ 2 , ϵ ) = {R ( P θ , ϵ ) | θ ∈ R d } and L ( θ , θ ′ ) = ∥ θ − θ ′ ∥ 2 . (11b) Let us emphasize that in this setting, when an observ ation is masked, the entire observ ation is mask ed, rather than just the resp onse. W e also emphasize that in the MNAR comp onen t, the missingness can dep end arbitrarily on b oth the resp onse Y and the cov ariate X . In order to further clarify the setting, it will b e helpful to in tro duce auxiliary random v ariables to represen t con taminated distributions. T o this end, for a distribution R ∈ R ( P θ , ϵ ), w e kno w that there exist random v ariables Ω ∈ { 0 , 1 } , B ∼ Bern ( ϵ ) , ( X, Y ) ∼ P such that ( X , Y )  ⋆  (1 − B ) + B Ω  ∼ R, where B is indep endent of the triple ( X , Y , Ω). Moreov er, the resp onse Y can b e generated as X · θ ⋆ + σ G for G ∼ N (0 , 1) indep enden t of X . Observe that ( X, Y )  = ⋆ if and only if (1 − B ) + B Ω = 1. Equipp ed with these preliminaries, w e next pro vide the key in tuition b ehind our estimator. 5.2 W arm-up: The p opulation setting The natural estimator to consider is that of ordinary least squares (OLS), whic h minimizes the sum of squared residuals. Unfortunately , in the realizable con tamination setting, the OLS estimator can suﬀer from bias. In order to mitigate this, w e consider an analogue of the p olynomial estimators developed in Section 4 . In particular, rather than minimizing the sum of squared residuals, we attempt to minimize a loss which raises the residuals to a larger p o w er 2 k > 2. This magniﬁes the error resulting from deviation aw ay from θ ⋆ , regardless of the con tamination. By taking k ↑ ∞ , the eﬀects of the contamination v anish and the p opulation estimator conv erges to θ ⋆ . This in tuition can b e made precise via a short calculation in the p opulation setting, whic h w e pro vide presen tly . F or k ∈ N , we deﬁne the loss F ( k ) : R d → R and its minimizer θ ( k ) pop as F ( k ) ( θ ) := E  ( Y − X ⊤ θ ) 2 k · 1 { ( X , Y )  = ⋆ }  and θ ( k ) pop := arg min θ ∈ R d F ( k ) ( θ ) . Our strategy will b e to control the error ∥ θ ( k ) pop − θ ⋆ ∥ 2 b y applying the inequality   θ ( k ) pop − θ ⋆   2 ≤ ∥∇ F ( k ) ( θ ⋆ ) ∥ 2 µ , where µ denotes the modulus of strong con v exit y of the p opulation loss. Let us b egin by computing µ . W e compute the strong conv exity parameter by lo w er b ounding the Hessian at ev ery θ b y ∇ 2 F ( k ) ( θ ) = 2 k (2 k − 1) E  ( Y − X ⊤ θ ⋆ ) 2 k − 2 X X ⊤ · 1 { ( X , Y )  = ⋆ }  24 ⪰ 2 k (2 k − 1) E  (1 − B ) ·  X ⊤ ( θ ⋆ − θ ) + σ G  2 k − 2 X X ⊤  ( a ) = (1 − ϵ )2 k (2 k − 1) 2 k − 2 X ℓ =0 σ 2 k − 2 − ℓ E [ G 2 k − 2 − ℓ ] · E  X ⊤ ( θ ⋆ − θ )  ℓ X X ⊤  ( b ) ⪰ (1 − ϵ )2 k (2 k − 1) σ 2 k − 2 E  G 2 k − 2  · I d = (1 − ϵ )2 k (2 k − 1) σ 2 k − 2 (2 k − 3)!! · I d . Ab o v e, step ( a ) follows b y mutual indep endence of X , G , and B ; and step ( b ) follows from the fact that o dd moments of the standard Gaussian distribution are zero. On the other hand, we upp er b ound the norm of its gradient at the true solution θ ⋆ as ∥∇ F ( k ) ( θ ⋆ ) ∥ 2 = 2 k ·   E  ( Y − X ⊤ θ ⋆ ) 2 k − 1 X · 1 { ( X , Y )  = ⋆ }    2 = 2 k σ 2 k − 1 ·   E  (1 − B ) · G 2 k − 1 X  + E  B Ω · G 2 k − 1 X    2 = 2 k σ 2 k − 1 ϵ   E  G 2 k − 1 X · Ω    2 = 2 k σ 2 k − 1 ϵ · sup v ∈ S d − 1  E  G 2 k − 1 ( v ⊤ X )Ω  ≤ 2 k σ 2 k − 1 ϵ · sup v ∈ S d − 1  E  | G | 2 k − 1 | v ⊤ X |  = 2 p 2 /π · k σ 2 k − 1 ϵ · (2 k − 2)!! , where again w e make use of the mutual indep endence of X , G , and B . Putting the pieces together, we deduce that   θ ( k ) pop − θ ⋆   2 ≤ ∥∇ F ( k ) ( θ ⋆ ) ∥ 2 (1 − ϵ )2 k (2 k − 1) σ 2 k − 2 (2 k − 3)!! ≲ σ · ϵ 1 − ϵ · (2 k − 2)!! (2 k − 1)!! ≲ σ ϵ (1 − ϵ ) √ k , where the ﬁnal inequality follows from Stirling’s inequalit y (see F act G.2 ). Therefore, while the quadratic loss (corresp onding to k = 1) yields an error guaran tee of O ( σ ϵ 1 − ϵ ), taking k to be large giv es a p opulation quan tit y with arbitrarily small error guaran tees. In the next section, w e derive ﬁnite sample guaran tees for the empirical v arian t of this estimator by carefully c ho osing the v alue of k . 5.3 Main result In the p opulation setting, w e established that raising residuals to arbitrarily large p olynomial p o w ers has a monotone eﬀect on the estimation error. In the ﬁnite sample setting, this is no longer the case as raising the residuals to higher p ow ers render the empirical risk heavier tailed. Our main result demonstrates that selecting a v alue k whic h balances the p opulation error and the heavy-tailed nature of the risk yields an estimator that is nearly optimal. F ormally , giv en samples { ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x n , y n ) } ⊆ R d +1 ⋆ w e deﬁne the empirical loss F ( k ) n ( θ ) := 1 n n X i =1 1 { ( x i , y i )  = ⋆ } · ( y i − x i · θ ) 2 k (12a) as well as its minimizer b θ ( k ) n = arg min θ F ( k ) n ( θ ) . (12b) 25 The following theorem provides information-theoretic upp er b ounds for the linear regres- sion problem as w ell as the upper b ound obtained by a computationally-eﬃcien t estimator that appro ximates ˆ θ ( k ) n . W e pro vide the proof of the upper b ound in Section B.1 and the low er b ound, whic h builds on the construction in [ DGKPX26 ], in Section C.2.3 . Theorem 5.1. Ther e exists c onstants c, C > 0 such that the fol lowing is true. L et σ 2 > 0 , ϵ < 1 , n ∈ N , δ ∈ [ { (1 − ϵ ) n } − c , 0 . 5] and let α = r d +log( 1 δ ) (1 − ϵ ) n . If n ≥ C · d + log ( 1 δ ) 1 − ϵ , then the minimax 1 − δ quantile ( 6 ) satisﬁes σ · log( 1 1 − ϵ ) r log  1 +  ϵ (1 − ϵ ) α  2  ≲ M ( δ , P LR ( σ 2 , ϵ ) , L ) ≲ σ ·  ϵ 1 − ϵ  ·  1 ∨ log log  ϵ α (1 − ϵ )  r log  1 +  ϵ (1 − ϵ ) α  2  . Mor e over, the upp er b ound is achieve d by the estimator b θ ( k ) n for some k ≲ log( ϵ α (1 − ϵ ) ) which c an b e appr oximate d to ac cur acy of the same or der in p oly ( k , d, n ) time. W e next pro vide several remarks ab out this theorem. First, let us emphasize that the estimator b θ ( k ) n can b e approximated to suﬃcient accuracy in p olynomial time. In particular, in Section B.2 we show that (for instance) the ellipsoid metho d can compute an approximate minimizer in p olynomial time. W e con trast this with the estimator provided in [ MVBWS24 , Section 5], which is based on pro jections in the Kolmogorov distance and requires exponential time to compute. Second, giv en the results of the previous tw o sections and the fact that the estimator of [ MVBWS24 ] is based on Kolmogorov pro jections, it is tempting to guess that there may a statistical–computational gap in the regression setting as well. Indeed, in our well-speciﬁed setting, the linear regression problem can be reduced to that of co v ariance estimation. Indeed, w e note that the cov ariate-response pair ( X , Y ) is join tly Gaussian: ( X , Y ) ∼ N  0 ,  I d θ ⋆ θ ⊤ ⋆ ∥ θ ⋆ ∥ 2 2 + σ 2  . W e can thus apply the eﬃcient estimator dev elop ed in Section 4 to obtain an estimate of the co eﬃcien ts θ ⋆ . In order to obtain an estimate of θ ⋆ with error at most ρ , this would require at least n ≳ d Ω(1 /ρ 2 ) man y samples. A similar dimension dep endence was obtained by [ LMZ24 , Corollary 3.20] in the related setting of eﬃcien t linear regression with truncated statistics, where the sample complexit y is also n ≳ d poly (1 /ρ ) . On the other hand, our empirical risk minimization pro cedure is computable in p olynomial time and requires a signiﬁcan tly smaller sample complexity , which we discuss next. Next, in order to simplify the upp er and low er b ounds, let us consider the parameters ϵ, δ , σ to b e constants. The upp er and low er b ounds then simplify to 1 p log(1 /α 2 ) ≲ M n ( δ, P LR  σ 2 , ϵ ) , L  ≲ log log(1 /α ) p log(1 /α 2 ) . 26 This reveals that in this setting, our metho d is optimal up to a multiplicativ e factor which is a doubly iterated logarithm in α . Moreov er, for constant ϵ and δ , w e ﬁnd that α ≍ p d/n so that the upp er and low er b ounds further simplify to 1 p log( n/d ) ≲ M n ( δ, P LR  σ 2 , ϵ ) , L  ≲ log log( n/d ) p log( n/d ) . In particular, this implies that our estimator requires a sample complexit y of n ≳ de Ω(log 2 (1 /ρ ) /ρ 2 ) to obtain error ρ , which is linear in dimension. W e note that this sample complexity is p oly- nomially smaller in dimension than that required by the ineﬃcient algorithm of [ MVBWS24 ], whic h requires n ≳ d 36 / 31 . Next, w e note that, while we prov ed the upp er b ound for cov ariates distributed as X ∼ N (0 , I d ), only mild assumptions are needed for this estimator to work. Indeed, even if X is hea vy-tailed (with at least a constant n um b er of b ounded moments), our algorithm—with a mo diﬁed analysis—contin ues to succeed with similar error rates in the constant failure probabilit y ( δ ) regime. Let us mention tw o limitations of our result. First, we note that, in contrast with Theo- rems 3.1 and 3.2 , w e require the low er b ound δ ≥ { (1 − ϵ ) n } − c for a small enough constant c > 0. This is not a huge limitation, as one could take the guarantees of Theorem 5.1 for con- stan t failure probabilit y 1 / 4 with sample complexity n 0 and use a simple b o osting pro cedure to turn it in to a (computationally-eﬃcient) estimator that w orks for arbitrary δ ∈ (0 , 1 / 4) with a sample complexity of log(1 /δ ) · n 0 . 4 Still, the resulting sample complexity has a m ul- tiplicativ e dep endence on log (1 /δ ) as opp osed to an additiv e dep endence in the lo w er b ound. W e leav e further in v estigation in the extremely high-conﬁdence setting as an interesting op en direction. Second, our algorithm yields an error which scales linearly in τ = ϵ/ (1 − ϵ ). By con trast, the lo wer b ound (as well as the rates for mean and cov ariance estimation) scale logarithmically in τ . While these match for ϵ < c for c > 0 a small enough universal constan t, this exp onen tially large gap b ecomes more striking as ϵ → 1. Finally , w e remark that the recent work [ KMKC26 ] also considers linear regression with truncated statistics. The authors fo cus on the regime where the truncation dep ends only on the resp onse v ariable and demonstrate a p olynomial time estimator that ac hiev es error ρ with a sample complexity of p oly ( d, 1 /ρ, k ), where k is the num b er of interv als that form the truncation set. This sample complexit y admits a dimension-dep endence whic h is p olynomially larger than our estimator, but a dep endence on 1 /ρ which is exp onentially smaller than ours. W e re-emphasize that, in general, these mo dels are incomparable (see Section 1.2 ) and that our information-theoretic low er b ound precludes such improv ements to our dep endence on 1 /ρ . 6 Discussion In this pap er, we studied sev eral estimation problems with missing data in the realizable con tamination problem. W e fo cused on the so-called all-or-nothing setting in which eac h observ ation is either completely observed, or not at all. Under this setting w e sho wed: • In me an estimation , to achiev e error in ℓ 2 norm at most ρ , there app ears to b e a sample complexit y gap: Ignoring all other problem parameters, n ≳ de 1 /ρ 2 samples are necessary 4 The bo osting pro cedure is as follo ws: run the geometric median-of-means [ Min15 ] pro cedure on k ≍ log(1 /δ ) indep endent estimates (obtained by running Theorem 5.1 on disjoin t subsets), each of whic h is correct with probabilit y 3 / 4 (guaran teed b y Theorem 5.1 if each subset is of size n 0 ). 27 and suﬃcien t information-theoretically , but computationally-eﬃcient statistical query algorithms (along with other restricted families of algorithms men tioned in Section 2.2 ) require n ≳ d Ω(1 /ρ 2 ) man y observ ations. This latter low er b ound is achiev ed by a sum- of-squares estimator. • In c ovarianc e estimation , to ac hiev e error ρ in relative op erator norm, analogous claims hold: n ≳ de 1 /ρ samples are necessary and suﬃcient information-theoretically , but statistical query algorithms require n ≳ d 1 /ρ samples. • In line ar r e gr ession , we show that for most parameters, these statistical–computational gaps do not exist. In particular, w e show that the sample complexit y n ≳ de 1 /ρ 2 is necessary information-theoretically and can b e matched by a computationally-eﬃcient algorithm. There remain several op en questions, several of which we discuss here. First, our computationally- eﬃcien t algorithm for linear regression do es not matc h the information-theoretic rate for ex- tremely low failure probabilities δ or when ϵ → 1. Going further, it would b e in teresting to understand ho w the rates of con vergence c hange—b oth computationally and statistically—for more complicated regression mo dels than linear regression. Next, w e note that the all-or-nothing setting of missingness is less common than the m ultiple pattern setting (see Figure 1 ). In App endix F , we sho w ho w to extend our algorithms and low er b ounds to the setting with multiple patterns and sho w that for both mean and co v ariance estimation, our all-or-nothing algorithms can b e adapted to achiev e error which is a m ultiplicative factor of C | S | larger than the optimal error, where C | S | is a p ositive constant dep ending only on the num b er of patterns. While this error inﬂation may b e reasonable for a small num b er of patterns, it may b e sub optimal for a large num b er of missingness patterns. An imp ortant area for future work is to determine the fundamental limits of estimation, even for simple tasks like mean and cov ariance estimation, in this multiple pattern setting. References [AL13] P . M. Aronow and D. K. K. Lee. “In terv al estimation of p opulation means under unkno wn but b ounded probabilities of sample selection”. In: Biometrika 100.1 (2013) (cit. on p. 7 ). [BBHKSZ12] B. Barak, F. G. S. L. Brandao, A. W. Harro w, J. Kelner, D. Steurer, and Y. Zhou. “Hyp ercontractivit y , sum-of-squares pro ofs, and their applications”. In: Pr o c. 44th A nnual ACM Symp osium on The ory of Computing (STOC) . 2012 (cit. on p. 19 ). [BBHLS21] M. Brennan, G. Bresler, S. B. Hopkins, J. Li, and T. Schramm. “Statistical query algorithms and low-degree tests are almost equiv alent”. In: Pr o c. 34th A nnual Confer enc e on L e arning The ory (COL T) . 2021 (cit. on pp. 12 , 13 ). [BEHW89] A. B lumer, A. Ehrenfeuch t, D. Haussler, and M. K. W armuth. “Learnabilit y and the V apnik-Chervonenkis dimension”. In: Journal of the ACM 4 (1989) (cit. on p. 81 ). [BJK15] K. Bhatia, P . Jain, and P . Kar. “Robust Regression via Hard Thresholding”. In: A dvanc es in Neur al Information Pr o c essing Systems 28 (NeurIPS) . 2015 (cit. on p. 10 ). 28 [BJKK17] K. Bhatia, P . Jain, P . Kamalaruban, and P . Kar. “Consistent Robust Regres- sion”. In: A dvanc es in Neur al Information Pr o c essing Systems 30 (NeurIPS) . 2017 (cit. on p. 10 ). [BK22] M. Bon vini and E. H. Kennedy. “Sensitivit y analysis via the proportion of unmeasured confounding”. In: Journal of the Americ an Statistic al Asso ciation 117.539 (2022) (cit. on p. 7 ). [BLM13] S. Boucheron, G. Lugosi, and P . Massart. Conc entr ation Ine qualities: A Nonasymp- totic The ory of Indep endenc e . Oxford Universit y Press, 2013 (cit. on p. 51 ). [BS16] B. Barak and D. Steurer. “Pro ofs, b eliefs, and algorithms through the lens of sum-of-squares”. In: 1 (2016) (cit. on p. 11 ). [Bub15] S. Bub eck. “Conv ex Optimization: Algorithms and Complexit y”. In: F ounda- tions and T r ends ® in Machine L e arning 8.3-4 (2015), pp. 231–357 (cit. on p. 52 ). [DDKLP25] I. Diak onik olas, J. Diakonik olas, D. M. Kane, J. C. H. Lee, and T. Pittas. “Lin- ear Regression under Missing or Corrupted Co ordinates”. In: arXiv pr eprint arXiv:2509.19242 (2025) (cit. on p. 8 ). [DGKLP25] I. Diak onikolas, C. Gao, D. M. Kane, J. Laﬀerty , and A. P ensia. “Information- Computation T radeoﬀs for Noiseless Linear Regression with Oblivious Con- tamination”. In: A dvanc es in Neur al Information Pr o c essing Systems 38 (NeurIPS) . 2025 (cit. on p. 10 ). [DGKPX26] I. Diakonik olas, C. Gao, D. M. Kane, A. Pensia, and D. Xie. “Robust Regres- sion with Adaptiv e Con tamination in Resp onse: Optimal Rates and Compu- tational Barrier”. In: In pr ep ar ation (2026) (cit. on pp. 10 , 16 , 26 ). [DGTZ18] C. Dask alakis, T. Gouleakis, C. Tzamos, and M. Zampetakis. “Eﬃcient Statis- tics, in High Dimensions, from T runcated Samples”. In: Pr o c. 59th IEEE Symp osium on F oundations of Computer Scienc e (FOCS) . 2018, pp. 639–649 (cit. on p. 8 ). [DIKL26] I. Diak onik olas, G. Iak ovidis, D. M. Kane, and S. Liu. “Sample Complexit y Bounds for Robust Mean Estimation with Mean-Shift Con tamination”. In: arXiv 2602.22130 (2026) (cit. on p. 9 ). [DIKP25] I. Diak onik olas, G. Iako vidis, D. M. Kane, and T. Pittas. “Eﬃcient Multiv ari- ate Robust Mean Estimation Under Mean-Shift Con tamination”. In: Pr o c. 42nd International Confer enc e on Machine L e arning (ICML) . 2025 (cit. on p. 9 ). [DK23] I. Diakonik olas and D. M. Kane. A lgorithmic High-Dimensional R obust Statis- tics . Cambridge Universit y Press, 2023 (cit. on pp. 8 , 9 , 12 , 71 ). [DKKLMS16] I. Diak onikolas, G. Kamath, D. M. Kane, J. Li, A. Moitra, and A. Stew- art. “Robust Estimators in High Dimensions without the Computational In- tractabilit y”. In: Pr o c. 57th IEEE Symp osium on F oundations of Computer Scienc e (FOCS) . 2016 (cit. on p. 16 ). [DKKPP22] I. Diakonik olas, D. M. Kane, S. Karmalk ar, A. Pensia, and T. Pittas. “Ro- bust Sparse Mean Estimation via Sum of Squares”. In: Pr o c. 35th Annual Confer enc e on L e arning The ory (COL T) . 2022 (cit. on p. 12 ). 29 [DKLP25] I. Diakonik olas, D. M. Kane, S. Liu, and T. Pittas. “PTF T esting Lo w er Bounds for Non-Gaussian Comp onent Analysis”. In: Pr o c. 66th IEEE Sym- p osium on F oundations of Computer Scienc e (FOCS) . 2025 (cit. on pp. 12 , 13 ). [DKPP24] I. Diakonik olas, S. Karmalk ar, S. Pang, and A. Potec hin. “Sum-of-Squares Lo w er Bounds for Non-Gaussian Comp onent Analysis”. In: Pr o c. 65th IEEE Symp osium on F oundations of Computer Scienc e (FOCS) . 2024 (cit. on pp. 12 , 13 ). [DKPZ24] I. Diak onik olas, D. M. Kane, T. Pittas, and N. Zariﬁs. “Statistical query lo wer b ounds for learning truncated Gaussians”. In: Pr o c. 37th Annual Confer enc e on L e arning The ory (COL T) . 2024 (cit. on p. 8 ). [DKS17] I. Diak onik olas, D. M. Kane, and A. Stewart. “Statistical Query Low er Bounds for Robust Estimation of High-Dimensional Gaussians and Gaussian Mix- tures”. In: Pr o c. 58th IEEE Symp osium on F oundations of Computer Scienc e (F OCS) . 2017 (cit. on p. 13 ). [dNS21] T. d’Orsi, G. Novik ov, and D. Steurer. “Consisten t regression when oblivi- ous outliers o v erwhelm”. In: Pr o c. 38th International Confer enc e on Machine L e arning (ICML) . 2021 (cit. on p. 10 ). [Duc25] J. Duchi. Statistics and Information The ory . 2025 (cit. on p. 14 ). [F GR VX17] V. F eldman, E. Grigorescu, L. Reyzin, S. S. V empala, and Y. Xiao. “Statistical Algorithms and a Low er Bound for Detecting Planted Cliques”. In: Journal of the A CM 64.2 (2017) (cit. on p. 12 ). [FKP19] N. Fleming, P . Kothari, and T. Pitassi. “Semialgebraic Pro ofs and Eﬃcient Algorithm Design”. In: F ound. T r ends The or. Comput. Sci. (2019) (cit. on p. 11 ). [Gao20] C. Gao. “Robust regression via mutiv ariate regression depth”. In: Bernoul li 26.2 (2020) (cit. on p. 14 ). [GVR97] R. D. Gill, M. J. V an Der Laan, and J. M. Robins. “Coarsening at random: Characterizations, conjectures, counter-examples”. In: Pr o c. 1st Se attle Sym- p osium in Biostatistics: Survival Analysis . 1997 (cit. on p. 4 ). [HL18] S. B. Hopkins and J. Li. “Mixture Mo dels, Robustness, and Sum of Squares Pro ofs”. In: Pr o c. 50th Annual A CM Symp osium on The ory of Computing (STOC) . 2018 (cit. on p. 19 ). [HM95] J. L. Horo witz and C. F. Manski. “Identiﬁcation and robustness with con tam- inated and corrupted data”. In: Ec onometric a: Journal of the Ec onometric So ciety (1995) (cit. on p. 7 ). [HR09] P . J. Hub er and E. M. Ronchetti. R obust Statistics . John Wiley & Sons, 2009 (cit. on p. 9 ). [HR21] L. Hu and O. Reingold. “Robust mean estimation on highly incomplete data with arbitrary outliers”. In: Pr o c. 24th International Confer enc e on A rtiﬁcial Intel ligenc e and Statistics (AIST A TS) . 2021 (cit. on p. 8 ). [JTK14] P . Jain, A. T ew ari, and P . Kar. “On Iterative Hard Thresholding Metho ds for High-Dimensional M-Estimation”. In: A dvanc es in Neur al Information Pr o c essing Systems 27 (NeurIPS) . 2014 (cit. on p. 10 ). 30 [K C18] A. K. Kuchibhotla and A. Chakrab ortty. “Moving Beyond Sub-Gaussianit y in High-Dimensional Statistics: Applications in Cov ariance Estimation and Linear Regression”. In: arXiv pr eprint arXiv:1804.02605 (2018) (cit. on p. 85 ). [Kea98] M. J. Kearns. “Eﬃcien t noise-tolerant Learning from Statistical Queries”. In: Journal of the A CM 45.6 (1998), pp. 983–1006 (cit. on p. 12 ). [K G25] S. Kotek al and C. Gao. “Optimal Estimation of the Null Distribution in Large- Scale Inference”. In: IEEE T r ansactions on Information The ory (2025) (cit. on p. 9 ). [KKLZ26] A. Kalav asis, P . K. Kothari, S. Li, and M. Zamp etakis. “Learning Mixture Mo dels via Eﬃcient High-dimensional Sparse F ourier T ransforms”. In: Pr o c. 58th Annual ACM Symp osium on The ory of Computing (STOC) . 2026 (cit. on p. 9 ). [KMK C26] A. Kouridakis, A. Mehrotra, A. Kalav asis, and C. Caramanis. “Linear Re- gression with Unknown T runcation Bey ond Gaussian F eatures”. In: arXiv pr eprint arXiv:2602.12534 (2026) (cit. on pp. 9 , 27 ). [KS17] P . K. Kothari and D. Steurer. “Outlier-robust moment-estimation via sum- of-squares”. In: arXiv pr eprint arXiv:1711.11581 (2017) (cit. on p. 12 ). [KSS18] P . K. Kothari, J. Steinhardt, and D. Steurer. “Robust Momen t Estimation and Impro v ed Clustering via Sum of Squares”. In: Pr o c. 50th Annual ACM Symp osium on The ory of Computing (STOC) . 2018 (cit. on p. 19 ). [KTZ19] V. Kontonis, C. Tzamos, and M. Zamp etakis. “Eﬃcient truncated statistics with unknown truncation”. In: Pr o c. 60th IEEE Symp osium on F oundations of Computer Scienc e (FOCS) . 2019 (cit. on p. 8 ). [Li23] S. Li. “Robust Mean Estimation Against Oblivious Adversaries”. MA thesis. Carnegie Mellon Universit y, 2023 (cit. on p. 9 ). [LMZ24] J. H. Lee, A. Mehrotra, and M. Zamp etakis. “Eﬃcien t Statistics With Un- kno wn T runcation, Polynomial Time Algorithms, Beyond Gaussians”. In: Pr o c. 65th IEEE Symp osium on F oundations of Computer Scienc e (FOCS) . 2024 (cit. on pp. 9 , 26 ). [LR14] R. J. A. Little and D. B. Rubin. Statistic al Analysis with Missing Data . John Wiley & Sons, 2014 (cit. on pp. 4 , 7 ). [L W12] P . Loh and M. J. W ain wrigh t. “High-dimensional regression with noisy and missing data: prov able guaran tees with nonconv exity”. In: The Annals of Statistics 40.3 (2012) (cit. on p. 8 ). [Man03] C. F. Manski. Partial identiﬁc ation of pr ob ability distributions . Springer, 2003 (cit. on p. 4 ). [Mas90] P . Massart. “The tigh t constan t in the Dvoretzky–Kiefer–Wolfo witz inequal- it y”. In: The Annals of Pr ob ability (1990) (cit. on pp. 37 , 41 ). [Min15] S. Minsker. “Geometric Median and Robust Estimation in Banach Spaces”. In: Bernoul li 21.4 (2015), pp. 2308–2335 (cit. on p. 27 ). [MPT13] K. Mohan, J. P earl, and J. Tian. “Graphical models for inference with missing data”. In: A dvanc es in Neur al Information Pr o c essing Systems 26 (NeurIPS) . 2013 (cit. on p. 8 ). 31 [MST22] D. Malinsky , I. Shpitser, and E. J. Tchetgen Tchetgen. “Semiparametric in- ference for nonmonotone missing-not-at-random data: the no self-censoring mo del”. In: Journal of the Americ an Statistic al Asso ciation 117.539 (2022) (cit. on p. 8 ). [MVBWS24] T. Ma, K. A. V erchand, T. B. Berrett, T. W ang, and R. J. Samw orth. “Estima- tion b eyond Missing (Completely) at Random”. In: arXiv pr eprint (2024) (cit. on pp. 4 , 5 , 7 , 15 , 16 , 26 , 27 , 36 , 37 , 57 ). [MV GS26] T. Ma, K. A. V erc hand, C. Gao, and R. J. Samw orth. “Adaptive conﬁdence in terv als with missing data”. In: In pr ep ar ation (2026+) (cit. on pp. 14 , 15 , 33 ). [MVS24] T. Ma, K. A. V erchand, and R. J. Samw orth. “High-probability minimax lo w er b ounds”. In: arXiv pr eprint arXiv:2406.13447 (2024) (cit. on pp. 59 , 62 ). [MZ25] A. Minasyan and N. Zhiv oto vskiy. “Statistically optimal robust mean and co v ariance estimation for anisotropic Gaussians.” In: Mathematic al Statistics & L e arning 8 (2025) (cit. on p. 14 ). [Ros87] P . R. Rosenbaum. “Sensitivity analysis for certain p ermutation inferences in matc hed observ ational studies”. In: Biometrika 74.1 (1987) (cit. on pp. 5 , 7 ). [RRS98] A. Rotnitzky , J. M. Robins, and D. O. Scharfstein. “Semiparametric regres- sion for rep eated outcomes with nonignorable nonresp onse”. In: Journal of the A meric an Statistic al Asso ciation 93.444 (1998) (cit. on p. 8 ). [SBRJ19] A. S. Suggala, K. Bhatia, P . Ra vikumar, and P . Jain. “Adaptiv e Hard Thresh- olding for Near-optimal Consisten t Robust Regression”. In: Pr o c. 32nd An- nual Confer enc e on L e arning The ory (COL T) . 2019 (cit. on p. 10 ). [SGJC13] S. Seaman, J. Galati, D. Jackson, and J. Carlin. “What Is Mean t by “Missing at Random”?” In: Statistic al Scienc e 28.2 (2013) (cit. on p. 4 ). [SL W22] R. Saho o, L. Lei, and S. W ager. “Learning from a biased sample”. In: arXiv pr eprint arXiv:2209.01754 (2022) (cit. on pp. 5 , 7 ). [TJSO14] E. Tsakonas, J. Jald´ en, N. D. Sidirop oulos, and B. Ottersten. “Conv ergence of the Hub er Regression M-Estimate in the Presence of Dense Outliers”. In: IEEE Signal Pr o c essing L etters 21.10 (2014) (cit. on p. 10 ). [Tsy09] A. Tsybako v. Intr o duction to Nonp ar ametric Estimation . Springer, 2009 (cit. on p. 14 ). [T uk75] J. W. T ukey. “Mathematics and the picturing of data”. In: Pr o c. International Congr ess of Mathematicians . V ol. 2. 1975 (cit. on p. 14 ). [V er25] R. V ershynin. High-dimensional Pr ob ability: A n Intr o duction with Applic a- tions in Data Scienc e . 2nd ed. Cambridge Universit y Press, 2025 (cit. on pp. 33 , 34 , 37 , 51 , 57 ). [W ai19] M. J. W ainwrigh t. High-Dimensional Statistics: A Non-Asymptotic View- p oint . Cam bridge Universit y Press, 2019 (cit. on p. 14 ). [ZSB19] Q. Zhao, D. S. Small, and B. B. Bhattachary a. “Sensitivit y analysis for in verse probabilit y w eigh ting estimators via the p ercen tile b o otstrap”. In: Journal of the R oyal Statistic al So ciety, Series B: Statistic al Metho dolo gy 81.4 (2019) (cit. on p. 5 ). 32 A Pro ofs of information-theoretic upp er b ounds for mean and co v ariance estimation A.1 Mean estimation: Pro of of Theorem 3.1 upp er b ound W e prov e the claim for general q ∈ (0 , 1]. T o establish the upp er b ound, we rely on the follo w- ing minimax optimal univ ariate estimator, due to [ MVGS26 , Theorem 1]. F or completeness, w e pro vide a self-con tained pro of (using a diﬀeren t estimator) in Section A.3.1 . Corollary A.1 ([ MV GS26 ] Theorem 1) . L et ϵ ∈ [0 , 1] , q ∈ (0 , 1] , n ∈ N , δ ∈ (0 , 1) , and deﬁne α = r log( 1 δ ) nq (1 − ϵ ) . F urther, let θ ⋆ ∈ R and supp ose that the observations Z 1 , . . . , Z n iid ∼ R ∈ R ( P θ ⋆ , ϵ, q ) . Ther e exists universal, p ositive c onstants C 1 , C 2 such that if n ≥ C 1 log( 1 δ ) q (1 − ϵ ) , then ther e exists an estimator b θ : R n ⋆ → R which satisﬁes, with pr ob ability at le ast 1 − δ ,   b θ − θ ⋆   ≤ C 2 log(1 + τ ) p log(1 + τ 2 /α 2 ) . Equipp ed with this univ ariate estimator, we turn to the pro of of our upp er b ound, which follo ws a standard approximation argument on an ϵ -net. T o this end, let V denote a 1 2 –pac king of the unit sphere S d − 1 , which b y , e.g., [ V er25 , Corollary 4.2.11] exists and satisﬁes the cardinality bound |V | ≤ 5 d . Now, for eac h unit v ector v ∈ V , let { Z v i } i ∈ [ n ] denote the pro jection of the i th observ ation Z i on to v , noting that if Z i = ⋆ d , then Z v i = ⋆ . Imp ortan tly , if Law ( Z i ) ∈ R ( P θ ⋆ , ϵ, q ), it follo ws that La w ( Z v i ) ∈ R ( P θ ⊤ ⋆ v , ϵ, q ). Next, for each v ∈ N , let b θ v denote the univ ariate estimator implicit in Corollary A.1 . Armed with eac h of these univ ariate estimators, w e deﬁne our multiv ariate mean estimator b θ ∈ R d to b e an y elemen t b θ ∈ arg min θ ∈ R d max v ∈N   b θ ⊤ v − b θ v   , (13) where for each v ∈ N , b θ v denotes the estimator in Corollary A.1 . F or parameters n, q , ϵ, δ , let r ( n, q , ϵ, δ ) denote the corresp onding error rate in the corollary . Next, take δ ′ = δ / (5 d ) and let Ω v denote the even t that   b θ v − θ ⊤ ⋆ v   ≤ r  n, q , ϵ, δ ′  . (14) Note that by Corollary A.1 , P (Ω c v ) ≤ δ / (5 d ), whence with Ω := S v ∈N Ω v , an application of the union b ound implies that P (Ω) ≥ 1 − δ . No w, w orking on the ev en t Ω, w e ha v e   b θ − θ ⋆   2 ( a ) ≤ 2 · max v ∈N  b θ − θ ⋆  ⊤ v ≤ 2 · max v ∈N   b θ ⊤ v − b µ v   + 2 · max v ∈N   b µ v − θ ⊤ ⋆ v   ( b ) ≤ 4 · max v ∈N   b µ v − θ ⊤ ⋆ v   ≤ 4 r ( n, q , ϵ, δ ′ ) , 33 where step ( a ) follo ws from an identical argument to the appro ximation in [ V er25 , Lemma 4.4.1], step ( b ) follows from the deﬁnition of b θ in Eq. ( 13 ) and the ﬁnal inequality follo ws from Eq. ( 14 ). Finally , w e see that r ( n, q , ϵ, δ ′ ) ≤ C 2 log(1 + τ ) r log  1 + τ 2 nq (1 − ϵ ) d +log( 1 δ )  , whic h completes the pro of. A.2 Co v ariance estimation: Pro of of Theorem 3.2 upp er b ound W e pro ve the claim for general q ∈ (0 , 1]. W e devise a tw o-step estimator: with one half of the data, we compute the unnormalized empirical cov ariance e Σ, scale the remaining data b y M , then ﬁnally com bine 1-dimensional v ariance estimates of the scaled data ov er man y directions, as in the mean estimation algorithm. W e deﬁne e Σ = 2 n n/ 2 X i =1 Z i Z ⊤ i 1 { Z i  = ⋆ } . Observ e that Σ − 1 2 ⋆ Z 1 { Z  = ⋆ } is a O (1)-subgaussian random vector, so b y the concen tration of i.i.d. subgaussian random vectors (see e.g. [ V er25 , Theorem 4.7.1]) and our assumption that n ≳ d + log(1 /δ ), we ha v e that with probabilit y at least 1 − δ / 2,     Σ − 1 2 ⋆ e ΣΣ − 1 2 ⋆ − Σ − 1 2 ⋆ E [ Z Z T 1 { Z  = ⋆ } ]Σ − 1 2 ⋆     op ≤ 1 2     Σ − 1 2 ⋆ E [ Z Z T 1 { Z  = ⋆ } ]Σ − 1 2 ⋆     op ≤ 1 2 . F urthermore, our con tamination mo del implies that q (1 − ϵ )Σ ⋆ ⪯ E [ Z Z T 1 { Z  = ⋆ } ] ⪯ ( q (1 − ϵ ) + ϵ )Σ ⋆ , and so combining the ab ov e tw o displays, we hav e that with probabilit y at least 1 − δ / 2, q (1 − ϵ ) 2 Σ ⋆ ⪯ e Σ ⪯ 2( q (1 − ϵ ) + ϵ )Σ ⋆ . W e condition on this ev ent for the remainder of the pro of. Letting M = ˜ Σ − 1 2 , w e ha v e by rearranging the ab o v e displa y that 1 2( q (1 − ϵ ) + ϵ ) I d ⪯ M Σ ⋆ M ⪯ 2 q (1 − ϵ ) I d . (15) No w deﬁne the scaled data ˜ Z i = M Z i + n/ 2 for i ∈ { 1 , . . . , n/ 2 } , ov er which we will apply a univ ariate cov ariate estimator. F or this, we require the following lemma, whose pro of we defer to Section A.3.2 . Lemma A.2. L et ϵ ∈ [0 , 1] , q ∈ (0 , 1] , n ∈ N , and δ ∈ (0 , 1) , and deﬁne α = r log( 1 δ ) q (1 − ϵ ) n . F urther, let σ ⋆ ∈ R ++ , let P σ = σ 2 χ 2 denote the sc ale d χ -squar e d distribution, and supp ose 34 that the observations Z 1 , . . . , Z n iid ∼ R ∈ R ( P σ ⋆ , ϵ, q ) . Ther e exists a p air of universal, p ositive c onstants C 1 , C 2 such that if n ≥ C 1 log( 1 δ ) q (1 − ϵ ) , then ther e exists an estimator b σ : R n ⋆ → R which satisﬁes, with pr ob ability at le ast 1 − δ ,   b σ 2 − σ 2 ⋆   ≤ C 2 log(1 + τ ) log(1 + τ /α ) · σ 2 ⋆ . Con tin uing, let N denote a β –net of the unit sphere S d − 1 , for β = 1 16 √ 1+ τ , whic h has size e O ( d log(1 /β )) . F or each v ∈ N , let b σ 2 v b e the estimator from Lemma A.2 applied to observ ations { ˜ Z i · v } i ∈ [ n/ 2] and failure probabilit y δ / (2 |N | ). Applying Lemma A.2 for each v ∈ N and applying a union b ound, we hav e with probability at least 1 − δ / 2, for all v ∈ N ,   b σ 2 v − v ⊤ M Σ ⋆ M v   ≤ C 2 log(1 + τ ) log  1 + τ q nq (1 − ϵ ) d log (1 /β )+log (8 /δ )  · v ⊤ M Σ ⋆ M v . Next, we deﬁne b Σ as any Σ such that for all v ∈ N , | ˆ σ 2 v − v ⊤ M Σ M v | ≤    C 2 log(1 + τ ) log  1 + τ q nq (1 − ϵ ) d log (1 /β )+log (8 /δ )     | {z } =: γ · v ⊤ M Σ M v , whic h is feasible as Σ ⋆ is one such matrix. Our c hoice of β satisﬁes log (1 /β ) ≍ 1 + log (1 + τ ). so it suﬃces to establish that L ( b Σ , Σ) = O ( γ ). W e can assume that γ ≤ 1 4 , as otherwise, w e can instead take b Σ = 0 whic h satisﬁes L ( b Σ , Σ ⋆ ) = 1 ≤ 4 γ . By the guarantees of our estimator, we hav e for all v ∈ N that   v ⊤ M b Σ M v − v ⊤ M Σ ⋆ M v   ≤ γ v ⊤ M b Σ M v + γ v ⊤ M Σ ⋆ M v . Re-arranging this we obtain that for all v ∈ N , 1 − γ 1 + γ v ⊤ M Σ ⋆ M v ≤ v ⊤ M b Σ M v ≤ 1 + γ 1 − γ v ⊤ M Σ ⋆ M v . F or γ ≤ 1 4 , we ha v e 1 − γ 1+ γ ≥ 1 − 3 γ and 1+ γ 1 − γ ≤ 1 + 3 γ , and so the ab ov e guarantee implies that for all v ∈ N , | v ⊤ M ( b Σ − Σ ⋆ ) M v | ≤ 3 γ v ⊤ M Σ ⋆ M v . T o turn this in to a b ound on relativ e sp ectral norm, ﬁrst observ e that L ( b Σ , Σ ⋆ ) = sup v  =0 | v ⊤ (Σ − 1 2 ⋆ b ΣΣ − 1 2 ⋆ − I d ) v | v ⊤ v = sup v  =0 | v ⊤ ( M b Σ M − M Σ ⋆ M ) v | v ⊤ M Σ ⋆ M v . 35 T o b ound this, we ﬁrst deﬁne the matrix E = Σ − 1 2 ⋆ b ΣΣ − 1 2 ⋆ − I d and note that b oth L ( b Σ , Σ ⋆ ) = ∥ E ∥ op and that for any v ∈ R d , v ⊤ ( M b Σ M − M Σ ⋆ M ) v v ⊤ M Σ ⋆ M v = ˜ v ⊤ E ˜ v , where ˜ v =   Σ 1 2 ⋆ M v   − 1 Σ 1 2 ⋆ M v . F or any v ∈ S d − 1 , we kno w there exists u ∈ N such that ∥ v − u ∥ ≤ β . Let ˜ v and ˜ u b e deﬁned as ab o v e for v and u , resp ectively . Then we hav e      v ⊤ ( M b Σ M − M Σ ⋆ M ) v v ⊤ M Σ ⋆ M v − u ⊤ ( M b Σ M − M Σ ⋆ M ) u u ⊤ M Σ ⋆ M u      = | ˜ v ⊤ E ˜ v − ˜ u ⊤ E ˜ u | ≤ | ˜ v ⊤ E ˜ v − ˜ u ⊤ E ˜ v | + | ˜ u ⊤ E ˜ v − ˜ u ⊤ E ˜ u | ≤ 2 ∥ E ∥ op · ∥ ˜ v − ˜ u ∥ 2 . (16) F urthermore, w e can b ound the distance b etw een ˜ v and ˜ u as   ˜ v − ˜ u   =      Σ 1 2 ⋆ M v   Σ 1 2 ⋆ M v   − Σ 1 2 ⋆ M u   Σ 1 2 ⋆ M u        ≤      Σ 1 2 ⋆ M v   Σ 1 2 ⋆ M v   − Σ 1 2 ⋆ M v   Σ 1 2 ⋆ M u        +      Σ 1 2 ⋆ M v   Σ 1 2 ⋆ M u   − Σ 1 2 ⋆ M u   Σ 1 2 ⋆ M u        ≤ 2   Σ 1 2 ⋆ M ( v − u )     Σ 1 2 ⋆ M u   ≤ 4 β s q (1 − ϵ ) + ϵ q (1 − ϵ ) = 4 β √ 1 + τ , where the ﬁnal inequalit y follo ws from ( 15 ). Plugging this b ound into ( 16 ), and taking a suprem um o v er v ∈ S d − 1 , we obtain that L ( b Σ , Σ ⋆ ) ≤ 3 γ + 8 β √ 1 + τ · L ( b Σ , Σ ⋆ ) · β = ⇒ L ( b Σ , Σ ⋆ ) ≤ 6 γ , where the ﬁnal implication follows from our choice of β . A.3 Univ ariate upp er b ounds Our information-theoretically optimal mean estimation algorithm follows the well-w orn path of applying optimal univ ariate estimators o v er a cov ering. F or our univ ariate estimators, w e will use the minim um Kolmogorov distance estimator introduced by [ MVBWS24 ]. is a minim um distance estimator using the Kolmogorov distance d K : P ( R ⋆ ) × P ( R ⋆ ) deﬁned as d K ( R 1 , R 2 ) := sup A ∈A   R 1 ( A ) − R 2 ( A )   , where we take the collection A to denote all upp er half interv als A := { ( −∞ , r ] | r ∈ R } . W e note in passing that this deﬁnition of the Kolmogoro v distance is a symmetrized v arian t of the usual Kolmogorov distance. 36 Giv en a collection of probability measures R ′ ⊆ P ( R ⋆ ) and a single probability measure R ∈ P ( R ⋆ ), we deﬁne the pro jection in Kolmogorov distance onto the set R ′ as d K ( R, R ′ ) := inf R ′ ∈R ′ d K ( R, R ′ ) . Next, given observ ations Z 1 , . . . , Z n ∈ R ⋆ , we let b R n := 1 n n X i =1 δ Z i , denote the empirical distribution of the observ ations. Imp ortan tly , the empirical distribution conv erges in Kolmogoro v distance to its population v ersion at the parametric rate, as the following lemma establishes. Lemma A.3. L et ϵ ∈ [0 , 1) , q ∈ (0 , 1] , and P ∈ P ( R ) . Supp ose that Z 1 , . . . , Z n iid ∼ R ∈ R ( P , ϵ, q ) and let b R n denote the empiric al distribution of the observations. F or any δ ∈ (0 , 1) , it holds that P  d K  b R n , R ( P, ϵ, q )  > 3 s { q (1 − ϵ ) + ϵ } log( 1 δ ) n  ≤ δ . W e omit the pro of of this lemma as its pro of follows an identical sequence of steps leading to [ MVBWS24 , Ineq. (56)] (in particular applying the DKW inequality [ Mas90 ] in conjunction with Bernstein’s inequality [ V er25 ]). Note that here we depart sligh tly from the setting of [ MVBWS24 ] in that we consider the symmetrized Kolmogorov distance, but a careful insp ection of their pro of sho ws that the result remains unchanged. A.3.1 Univ ariate mean estimation: Pro of of Corollary A.1 Let P θ = N ( θ , 1). W e consider the minimum distance estimator b θ , deﬁned to b e an y elemen t b θ ( Z 1 , . . . , Z n ) ∈ arg min θ ∈ R d K  b R n , R ( P θ , ϵ, q )  . (17) In w ords, the estimator b θ ∈ R is taken as the parameter which contains the closest realizable con tamination to the empirical distribution b R n . Applying Lemma A.3 in conjunction with the triangle inequality , we deduce that d K ( R ( P b θ , ϵ, q ) , R ( P θ ⋆ , ϵ, q )) ≤ 6 s { q (1 − ϵ ) + ϵ } log( 1 δ ) n = 6 q (1 − ϵ ) α √ 1 + τ . Note further that, for any r ∈ R , d K  R ( P θ , ϵ, q ) , R ( P θ ⋆ , ϵ, q )  ≥ q (1 − ϵ ) · P ( G + θ ≤ θ ⋆ − r ) −  q (1 − ϵ ) + ϵ  · P ( G + θ ⋆ ≤ θ ⋆ − r ) = q (1 − ϵ ) ·  ¯ Φ( r − t ) − (1 + τ ) ¯ Φ( r ) | {z } =: g ( t ; r )  , where ab ov e we hav e G ∼ N (0 , 1) and recall that ¯ Φ is the surviv al function of the standard Gaussian. The ﬁnal relation follo ws b y taking θ = θ ⋆ − t . The function g : R + × R → R deﬁned ab o v e is strictly increasing in its ﬁrst argument. Hence,   b θ − θ ⋆   ≤ sup  t ≥ 0 : | θ − θ ⋆ | ≤ t and d K  R ( P θ , ϵ, q ) , R ( P θ ⋆ , ϵ, q )  ≤ 6 q (1 − ϵ ) α √ 1 + τ  37 ≤ sup  t ≥ 0 : sup r ∈ R g ( t ; r ) ≤ 6 α √ 1 + τ  = inf  t ≥ 0 : sup r ∈ R g ( t ; r ) > 6 α √ 1 + τ  . Re-arranging and inv oking the decreasing nature of ¯ Φ − 1 , w e see that for an y choice of r > 0, the condition is satisﬁed if t ≥ r − ¯ Φ − 1  (1 + τ ) ¯ Φ( r ) + 6 α √ 1 + τ  . (18) W e next consider several cases for the v alue of τ , select v alues of r and upp er b ound the righ t-hand side of ( 18 ). T o facilitate the analysis, we introduce the shorthand ∆ := τ ¯ Φ( r ) + 6 α √ 1 + τ . Applying the mean v alue theorem in conjunction with Lemma G.11 , we deduce that ¯ Φ − 1  ¯ Φ( r ) + ∆  ≥ r − ∆ · sup x ′ ∈ [ ¯ Φ( r ) , ¯ Φ( r )+∆] 1 ϕ ( ¯ Φ − 1 ( x ′ )) ≥ r − ∆ · sup x ′ ∈ [ ¯ Φ( r ) , ¯ Φ( r )+∆] 1 x ′ ¯ Φ − 1 ( x ′ ) ≥ r − ∆ ¯ Φ( r ) · ¯ Φ − 1  ¯ Φ( r ) + ∆  . (19) W e pro ceed with by consider sev eral cases on τ . Before doing so, ﬁrst observe that our assumption n ≥ C 1 log( 1 δ ) q (1 − ϵ ) implies that α 2 ≤ 1 /C 1 . Case 1: τ ≤ 125 α . In this case, we set r = 2. Note that b y assumption, α 2 ≤ 1 /C 1 . Expanding the deﬁnition of ∆ and taking C 1 suﬃcien tly large yields the conclusion ∆ ≤ C α for a suﬃciently large constant C . Applying the ﬁrst inequality in ( 19 ), we ﬁnd that ¯ Φ − 1  ¯ Φ( r ) + ∆  ≥ r − ∆ · sup x ′ ∈ [ ¯ Φ( r ) , ¯ Φ( r )+∆] 1 ϕ ( ¯ Φ − 1 ( x ′ )) ≥ r − ∆ ϕ ( r ) = r − ∆ √ 2 π e r 2 / 2 = r − ∆ e √ 2 π , where the second inequalit y follo ws b ecause the Gaussian PDF ϕ is decreasing on non-negative reals. It thus follows that an y separation t ≥ r −  r − C e √ 2 π α  = C e √ 2 π α ≍ log(1 + τ ) p log(1 + τ 2 /α 2 ) suﬃces, where the ﬁnal equality (up to constant factors) holds b ecause τ and τ /α are b oth b ounded b y a constan t in this regime. Case 2: 125 α < τ ≤ 8 . In this case, w e set r = q log  τ / { 6 p 2 π (1 + τ ) α }  = ⇒ 6 p 2 π (1 + τ ) · αe r 2 = τ , 38 and note r ≥ 1 for this range of τ . Moreov er, applying the standard Mills’ ratio low er b ound (see Lemma G.10 ) yields ¯ Φ( r ) ≥ e − r 2 / 2 r √ 2 π ≥ e − r 2 √ 2 π . It thus follows that ∆ ¯ Φ( r ) ≤ τ + 6 p 2 π (1 + τ ) · αe r 2 = 2 τ . On the other hand, applying the ﬁrst inequalit y in ( 19 ) in conjunction with the decreasing nature of the Gaussian PDF ϕ , we deduce that ¯ Φ − 1  ¯ Φ( r ) + ∆  ≥ r − ∆ ϕ ( r ) ≥ r − τ r + 6 p 2 π (1 + τ ) · αe r 2 ≥ r , where the p enultimate inequalit y follows from the Mills’ ratio upp er b ound ¯ Φ( x ) ≤ ϕ ( x ) /x (see Lemma G.10 ) and the ﬁnal inequality follo ws b ecause r ≥ 1. Combining the inequalities in the previous tw o displays with the ﬁnal inequality in ( 19 ), w e ﬁnd that an y separation t ≥ r −  r − 2 τ r  ≥ 2 τ r = 2 τ q log( τ /α ) − log (6 p 2 π (1 + τ )) , suﬃces. T o conclude this case, we hav e that log (1 + τ ) ≍ τ b ecause τ < 8 and log( τ /α ) − log (6 p 2 π (1 + τ )) α ) ≍ log( τ /α ) ≍ log(1 + τ 2 /α 2 ) b ecause τ α > 125 and 6 p 2 π (1 + τ ) < 125 2 . Case 3: 8 < τ ≤ 1 20 α − 1 / 4 . In this case, we take r = q log(1 / 6 √ 2 π α ) . Note that under this setting, taking C 1 large enough implies that r ≥ 1, which com bined with the Mills’ ratio low er b ound (see Lemma G.10 ) gives ¯ Φ( r ) ≥ r r 2 + 1 e − r 2 / 2 √ 2 π ≥ 1 2 √ 2 π e − r 2 = 3 α. Then b ecause τ > 8, w e ﬁnd that 6 α √ 1 + τ ≤ τ ¯ Φ( r ) . In turn, this implies that ¯ Φ − 1  ¯ Φ( r ) + ∆  ≥ ¯ Φ − 1  2 τ ¯ Φ( r )  . W e claim that with the shorthand ζ := ¯ Φ − 1  2 τ ¯ Φ( r )  , the following inequalit y holds, deferring its pro of to the end of the case, ζ 2 ≥ r 2 − 2 log (2 τ ) + 2 log  r ζ + 1 /ζ  . (20) W e further note the w eak er inequalit y ζ ≥ 1 2 s log  1 2 τ ¯ Φ( r )  ( a ) ≥ 1 2 s log  e r 2 / 2 2 τ  = 1 2 r r 2 2 + log  1 / (2 τ )  ( b ) ≥ r 4 ≥ 1 4 39 where ab o v e, step ( a ) follo ws from the b ound ¯ Φ( r ) ≤ e − r 2 / 2 and step (b) follows since for the considered range of τ and the setting of r , 2 τ ≤ e r 2 / 4 . On the other hand, we hav e that ζ ≤ r b ecause 2 τ > 1 and ¯ Φ − 1 is decreasing. Combining these inequalities, we deduce that r ζ + 1 /ζ ≥ r r + 4 ≥ 1 5 . Substituting this back into the inequality ( 20 ) then yields ζ 2 ≥ r 2 − 2 log (10 τ ) ≥ r 2 / 2 , where the ﬁnal inequality follows b ecause 10 τ ≤ e r 2 / 4 . Hence, we ﬁnd that r − p r 2 − 2 log ( τ / 4) ≤ 4 log( τ / 4) r , and consequently deduce that any separation t ≥ 4 log( τ / 4) r = 4 log( τ / 4) q log(1 / 6 √ 2 π α ) , suﬃces. Then we hav e that log(1 + τ ) ≍ log( τ / 4) because τ > 8 and log(1 / 6 √ 2 π α ) ≍ log( τ /α ) b ecause, taking C 1 large enough, log(1 /α ) is b ounded b elow by a suﬃcien tly large constant and log 8 ≤ log ( τ ) ≤ 1 4 log(1 /α ) − log(20). It remains to establish the inequality ( 20 ). Pro of of the inequalit y ( 20 ): Note that 2 τ ¯ Φ( r ) ≤ 1 2 , so that ζ ≥ 0. W e thus apply the standard Mills’ ratio b ounds (see Lemma G.10 ) to obtain the sandwich relation ζ ζ 2 + 1 ϕ ( ζ ) ≤ ¯ Φ( ζ ) = 2 τ ¯ Φ( r ) ≤ 2 τ r ϕ ( r ) . It thus follows that 1 ζ + 1 /ζ e − ζ 2 / 2 ≤ 2 τ r e − r 2 / 2 . Hence, taking logarithms and re-arranging yields the conclusion ζ 2 ≥ r 2 − 2 log (2 τ ) + 2 log  r ζ + 1 /ζ  , as desired. Case 4: τ > 1 20 α − 1 4 In this case, w e take a slightly mo diﬁed version of the previous esti- mator. Rather than the Kolmogorov distance, we tak e conditional Kolmogoro v distance d K ( P , Q | R ) := d K ( P ( · | R ) , Q ( · | R )) , then we take b θ to b e b θ ( Z 1 , . . . , Z n ) ∈ arg min θ ∈ R d K  b R n , R ( P θ , ϵ, q ) | R  . (21) 40 Similar to the previous estimator, we can easily b ound the conditional Kolmogorov distance b et w een P b θ and P θ ⋆ . Let ¯ q = R ( R ) ≥ q (1 − ϵ ) and let ¯ R = R ( · | R ) and let S ⊆ [ n ] b e the random set of indices S = { i ∈ [ n ] | Z i  = ⋆ } . Observe that | S | ∼ Bin ( n, ¯ q ). By Bernstein’s inequalit y and our assumption that n ≳ log(1 /δ ) q (1 − ϵ ) , we hav e | S | ≥ ¯ q n/ 2 with probability at least 1 − δ / 2. Now conditioned on S , observ e that { Z i } i ∈ S iid ∼ ¯ R ⊗| S | , so we can apply (e.g.) the DKW inequality [ Mas90 ] to obtain that with probability at least 1 − δ / 2, d K ( ˆ R n , R ( P θ ⋆ , ϵ, q ) | R ) ≤ d K ( ˆ R n , R | R ) = d K 1 | S | X i ∈ S δ Z i , ¯ R ! ≤ s log( 1 δ ) 2 | S | . Th us b y a union b ound and triangle inequality , we hav e with probability at least 1 − δ that d K ( R ( P b θ , ϵ, q ) , R ( P θ ⋆ , ϵ, q ) | R ) ≤ 2 s log( 1 δ ) q (1 − ϵ ) n ≤ 2 C − 1 2 1 . (22) W e will b e done once we establish a low er b ound on d K ( R ( P θ ⋆ , ϵ, q ) , R ( P θ , ϵ, q ) | R ) for an y θ such that | θ − θ ⋆ | is suﬃciently large. Let R ∈ R ( P θ ⋆ , ϵ, q ) and R ′ ∈ R ( P θ , ϵ, q ) b e arbitrary contaminations. By translating and negating the observ ations if necessary , we may assume without loss of generality that θ ⋆ = 0 and θ > 0. Then by the deﬁnition of our con tamination set, w e ha v e for all S ⊆ R q (1 − ϵ )Φ( S ) ≤ R ( S ) ≤ ( q (1 − ϵ ) + ϵ )Φ( S ) = (1 + τ ) q (1 − ϵ ) P θ ( S ) , and similar for R ′ with Φ( S − θ ) in place of Φ( S ). Noting that ( a, b ) 7→ a/ ( a + b ) is increasing in a and decreasing in b , we hav e that for any r ∈ R + , d K ( R, R ′ | R ) ≥ R ( · ≤ r | R ) − R ′ ( · ≤ r | R ) = R ( · ≤ r ) R ( · ≤ r ) + R ( · > r ) − R ′ ( · ≤ r ) R ′ ( · ≤ r ) + R ′ ( · > r ) = Φ( r ) Φ( r ) + (1 + τ )(1 − Φ( r )) − (1 + τ )Φ( r − θ ) (1 + τ )Φ( r − θ ) + (1 − Φ( r − θ )) = Φ( r )(1 − Φ( r − θ )) − (1 + τ ) 2 Φ( r − θ )(1 − Φ( r )) [Φ( r ) + (1 + τ )(1 − Φ( r ))][(1 + τ )Φ( r − θ ) + (1 − Φ( r − θ ))] = Φ( r )Φ( θ − r ) − (1 + τ ) 2 Φ( r − θ )Φ( − r ) [Φ( r ) + (1 + τ )Φ( − r )][(1 + τ )Φ( r − θ ) + Φ( θ − r )] , where the ﬁnal equality follows by applying the symmetry of the Gaussian distribution. Now for θ suc h that Φ( − θ / 2) ≤ 1 4 (1 + τ ) − 1 , by taking r = θ / 2, we hav e d K ( R, R ′ | R ) ≥ Φ( θ / 2) 2 − (1 + τ ) 2 Φ( − θ / 2) 2 [Φ( θ / 2) + (1 + τ )Φ( − θ / 2)] 2 ≥ 1 4 − 1 16 ( 1 2 + 1 4 ) 2 ≥ 3 C − 1 2 1 , where w e used that Φ( θ / 2) ≥ 1 2 and Φ( − θ / 2) ≤ 1 4 (1 + τ ) − 1 in the second inequality and to ok C 1 suﬃcien tly large in the ﬁnal inequality . By a Gaussian tail b ound, Φ( − θ / 2) ≤ e − θ 2 / 8 , so Φ( − θ / 2) ≤ 1 4 (1 + τ ) − 1 holds for all θ ≥ p 8 log(4(1 + τ )). Therefore, by ( 22 ), we hav e that with probability at least 1 − δ , | b θ − θ ⋆ | ≤ p 8 log(4(1 + τ )) ≍ log(1 + τ ) p log(1 + τ 2 /α 2 ) , where the ﬁnal asymptotic equiv alence holds b ecause τ > 1 20 α − 1 / 4 . This prov es the desired result in this case and concludes the pro of of the corollary . 41 A.3.2 Univ ariate v ariance estimation: Pro of of Lemma A.2 W e note that this proof is nearly identical to that of Corollary A.1 , hence we omit sev eral details already provided in that pro of. W e consider the minim um distance estimator b σ , deﬁned to b e any elemen t b σ ( Z 1 , . . . , Z n ) ∈ sup n σ ∈ R ++ | d K  b R n , R ( P σ , ϵ, q )  ≤ 3 q (1 − ϵ ) α √ 1 + τ o . (23) Applying Lemma A.3 in conjunction with the triangle inequality , we deduce that ˆ σ ≥ σ ⋆ and d K ( R ( P b σ , ϵ, q ) , R ( P σ ⋆ , ϵ, q )) ≤ 6 q (1 − ϵ ) α √ 1 + τ , so it remains to upp er b ound ˆ σ . Note further that, for an y r ∈ R , 2 d K  R ( P σ , ϵ, q ) , R ( P σ ⋆ , ϵ, q )  ≥ q (1 − ϵ ) · P ( σ 2 G 2 ≥ r ) −  q (1 − ϵ ) + ϵ  · P ( σ 2 ⋆ G 2 ≥ r ) = q (1 − ϵ ) ¯ Φ  r r σ 2  −  q (1 − ϵ ) + ϵ  ¯ Φ  r r σ 2 ⋆  , where ab ov e we ha v e used the notation G ∼ N (0 , 1) and let ¯ Φ denote the surviv al function of the standard Gaussian (that is, ¯ Φ( x ) = P ( G ≥ x )). Letting σ 2 = σ 2 ⋆ (1 + t ) 2 and re-scaling, w e see that d K  R ( P σ , ϵ, q ) , R ( P σ ⋆ , ϵ, q )  ≥ q (1 − ϵ )  ¯ Φ  r 1 + t  − (1 + τ ) ¯ Φ( r ) | {z } =: g ( t,r )  . Note that the function g deﬁned ab ov e is strictly increasing in its ﬁrst argumen t. It thus follo ws that with ψ ( ϵ, q , n, δ ) := sup  t ≥ 0 : sup r ∈ R g ( t ; r ) ≤ 6 α √ 1 + τ  = inf  t ≥ 0 : sup r ∈ R g ( t ; r ) > 6 α √ 1 + τ  , w e ha v e | b σ 2 − σ 2 ⋆ | ≤ σ 2 ⋆ ·  2 ψ ( ϵ, q , n, δ ) + { ψ ( ϵ, q , n, δ ) } 2  . Re-arranging and inv oking the decreasing nature of ¯ Φ − 1 , we see that for any c hoice of r > 0, the condition is satisﬁed if t ≥ r ·  ¯ Φ − 1  (1 + τ ) ¯ Φ( r ) + 6 α √ 1 + τ  − 1 − 1 . (24) Let us now b ound this term in three cases, dep ending on the v alue of τ . Case 1: τ ≤ 125 α . In this case, we set r = 2. The same sequence of steps as in Case 1 of the pro of of Corollary A.1 yields ¯ Φ − 1  (1 + τ ) ¯ Φ( r ) + 6 α √ 1 + τ  ≥ r − ∆ e √ 2 π , 42 where ∆ = τ ¯ Φ( r ) + 6 α √ 1 + τ . Hence, since ∆ ≲ α is smaller than a suﬃciently small constant, r r − ∆ e √ 2 π − 1 ≤ ∆ e √ 2 π r − ∆ e √ 2 π ≤ C α, and so any separation ψ ( ϵ, q , n, δ ) = C α suﬃces. In turn, for large enough C 1 , this implies that ψ ( ϵ, q , n, δ ) < 1, so that in this case   b σ 2 − σ 2 ⋆   ≤ C α ≍ log(1 + τ ) log(1 + τ /α ) , where the last equality is b ecause τ ≲ α ≲ 1. Case 2: 125 α < τ ≤ 8 . As in Corollary A.1 , in this case, we set r = p log( τ / 12 α ) , and deduce the low er b ound ¯ Φ − 1  (1 + τ ) ¯ Φ( r ) + 6 α √ 1 + τ  ≥ r − 2 τ r / 2 . Hence, since r r − 2 τ r/ 2 − 1 ≤ C τ r 2 < 1 , w e deduce that in this case   b σ 2 − σ 2 ⋆   ≤ C τ log( τ / 12 α ) ≍ log(1 + τ ) log(1 + τ /α ) , where the last inequality is b ecause α ≲ τ ≲ 1. Case 3: 8 < τ ≤ 1 20 α − 1 / 4 . As in Corollary A.1 , in this case, we take r = q log(1 / 12 √ 10 α ) . Under this setting, following the same steps as in Corollary A.1 , we ﬁnd that ¯ Φ − 1  (1 + τ ) ¯ Φ( r ) + 6 α √ 1 + τ  ≥ p r 2 − 2 log ( τ / 4) . Th us, since r p r 2 − 2 log ( τ / 4) − 1 ≤ C log( τ / 4) r 2 , w e deduce that in this case,   b σ 2 − σ 2 ⋆   ≤ C log τ log(1 / 12 √ 10 α ) ≍ log(1 + τ ) log(1 + τ /α ) , where the last equality is b ecause α ≲ 1 and 1 ≲ τ ≲ α − 1 / 4 . 43 Case 4: τ > 1 20 α − 1 / 4 . In this case, instead of our previous estimator, we simply tak e b σ = 0. Because α 2 ≤ 1 C 1 for suﬃcien tly large C 1 and τ > 1 20 α − 1 / 4 , we hav e that | σ 2 ⋆ − b σ 2 | = σ 2 ⋆ ≤ C 2 log(1+ τ ) log(1+ τ /α ) σ 2 ⋆ b y pic king C 2 suﬃcien tly large. This prov es the desired result in all cases. B Pro of of upp er b ound for linear regression B.1 Linear regression: Pro of of Theorem 5.1 upp er b ound W e prov e the claim for general q ∈ (0 , 1]. W e show that our choice of estimator b θ ( k ) n ac hiev es the desired rate. T o do so, we introduce some key lemmas to control relev ant quan tities with high probability . W e defer the pro ofs of Lemmata B.1 and B.2 to the subsequen t subsections. Lemma B.1. L et α b e as in The or em 5.1 , m = q (1 − ϵ ) n , and supp ose that δ ≥ m − c for some c onstant c > 0 . With pr ob ability at le ast 1 − O ( δ ) , it holds that   ∇ F ( k ) n ( θ ⋆ )   ≲ q (1 − ϵ ) k σ 2 k − 1 · n τ (2 k − 2)!! + α (1 + √ τ ) p (4 k − 3)!! + γ 1 m − 1 4 + γ 2 m − 1 2 + γ 3 m − 1 o , (25) wher e γ 1 = α log 1 2 ( m ) p (4 k − 3)!! + α (1 + τ 1 4 ) O (log { (1 + τ ) m } ) k 2 , γ 2 = α [log 1 2 ( m ) p (4 k − 3)!! + O (log { (1 + τ ) m } ) k 2 + 1 4 ] + √ τ O (log { (1 + τ ) m } ) k 2 , γ 3 = log( m )(2 k − 2)!! + O (log { (1 + τ ) m } ) k +1 2 . Lemma B.2. L et α b e as in The or em 5.1 . Supp ose α − 2 ≳ e 10 k log 2 k . With pr ob ability at le ast 1 − O ( δ ) , the empiric al risk F ( k ) n is uniformly str ongly c onvex: inf θ ∈ R d inf v ∈ S d − 1 v T ∇ 2 F ( k ) n ( θ ) v ≳ q (1 − ϵ ) σ 2 k − 2 (2 k + 1)!! . Pr o of of The or em 5.1 . Supp ose ﬁrst that we choose k satisfying e 10 k log 2 k ≲ α − 2 . By Lem- mas B.1 , we deduce that with probabilit y at least 1 − δ / 2, F ( k ) n is µ n –strongly conv ex. F rom strong conv exit y of F ( k ) n , we obtain the inequality  ∇ F ( k ) n  b θ ( k ) n  − ∇ F ( k ) n ( θ ⋆ )  ⊤  b θ ( k ) n − θ ⋆  ≥ µ n · ∥ b θ ( k ) n − θ ⋆ ∥ 2 2 . Since b y deﬁnition, ∇ F ( k ) n  b θ ( k ) n  = 0, applying the Cauc h y–Sc h w arz inequalit y to the left-hand side and re-arranging yields   θ ⋆ − b θ ( k ) n   2 ≤   ∇ F ( k ) n ( θ ⋆ )   2 µ n . Applying Lemmas B.1 and B.2 , which hold with probability 1 − δ , then yields 1 σ ·   b θ ( k ) n − θ ⋆   ≲ k ·  τ (2 k − 2)!! + α (1 + √ τ ) p (4 k − 3)!! + γ 1 m − 1 4 + γ 2 m − 1 2 + γ 3 m − 1  (2 k + 1)!! 44 ≲ τ √ k + 2 k α (1 + √ τ ) √ k + γ 1 m − 1 4 + γ 2 m − 1 2 + γ 3 m − 1 (2 k − 1)!! , (26) where the ﬁrst tw o terms in the ﬁnal step follow from Stirling’s inequality (see F act G.2 ). W e analyze this b ound in diﬀeren t cases for τ . T o do so, let C ′ ≥ 1 be a suﬃciently large constan t and c < 1 a suﬃciently small constant that we will sp ecify later. Small τ : τ ≤ C ′ α . In this case we take k = 1 and ha v e that α − 2 ≳ e 10 k log 2 k b ecause α ≤ 1 / √ C by our assumption m ≥ C ( d + log ( 1 δ )). Because τ ≤ C ′ α ≲ 1, the error b ound ( 26 ) simpliﬁes to 1 σ ·   b θ ( k ) n − θ ⋆   ≲ α ·  1 + m − 1 4 p log m + m − 1 2 log 3 4 ( m )  + m − 1 log m ≲ α, where the ﬁnal step follows because α ≥ 1 √ m . In this regime, τ (1 ∨ log log ( τ /α )) √ log(1+ τ 2 /α 2 ) ≍ α , so this matc hes our desired rate. Large τ : τ ≥ C ′ α . In this case, w e tak e k =  c log ( τ /α ) (log log( τ /α )) 2  . By c ho osing C ′ suﬃcien tly large (as a function of c ), w e hav e that k ≥ 1 b ecause τ /α ≥ C ′ . This implies | log k | = log k ≤ log log(1 /α ) and so k log 2 k ≤ c log(1 /α ). By picking c = 0 . 1, we hav e that k is small enough to apply Lemma B.2 . Substituting this into ( 26 ), we ha v e 1 σ ·   b θ ( k ) n − θ ⋆   ≲ τ log log( τ /α ) p log( τ /α ) + α ( τ /α ) c log 2 (log log ( τ /α )) 2 log log( τ /α ) p log( τ /α ) + γ 1 m − 1 4 + γ 2 m − 1 2 + γ 3 m − 1 (2 k − 1)!! (27) Enlarging C ′ if necessary , we see that c log 2 (log log( τ /α )) 2 ≤ 1, which implies that the second te rm is dominated by the ﬁrst. W e now b ound γ 1 m 1 4 (2 k − 1)!! . W e hav e γ 1 m 1 4 (2 k − 1)!! ≍ α 2 k log 1 2 ( m ) m 1 4 + αO (log( m )) k 2 m 1 4 (2 k − 1)!! . Because (2 k − 1)!! = O ( k ) k , w e see that O (log m ) k 2 (2 k − 1)!! ≤ e O ( √ log m ) ≤ m 1 8 for m ≥ C ′′ , where C ′′ is a suﬃciently large constant. Now b ecause α − 2 ≤ m , w e hav e that m 1 8 ≳ p log(1 /α ). Applying these b ounds to the ab o v e displa y , we see that as long as m ≥ C ′′ ∨ e (8 log 2) k , γ 1 m 1 4 (2 k − 1)!! ≲ α p log(1 /α ) . Note that b y taking C 1 to b e a suﬃciently large constant, the assumption that α ≤ 1 / √ C 1 implies that 8 c log 2 (log log(1 /α )) 2 ≤ 2. This in turn implies m ≥ e (8 log 2) k b ecause e (8 log 2) k = α − 8 c log 2 (log log (1 /α )) 2 ≤ α − 2 ≤ m. The other tw o terms in ( 27 ) follow similarly , so we hav e the error b ound 1 σ ·   b θ ( k ) n − θ ⋆   ≲ τ log log( τ /α ) p log( τ /α ) + α p log(1 /α ) ≍ τ (1 ∨ log log ( τ /α )) p log(1 + τ 2 /α 2 ) , as desired. 45 It remains to prov e the ab ov e lemmas. W e introduce some notation that will b e useful in b oth lemmas. In particular, by the usual decomp osition, w e know that there exists Ω MCAR ∼ Bern ( q ) , Ω MNAR ∈ { 0 , 1 } , B ∼ Bern ( ϵ ) , ( X, Y ) ∼ P such that R is the law of R = La w ( Z ) , and Z = ( X , Y )  ⋆  (1 − B )Ω MCAR + B Ω MNAR  , where Ω MCAR , B , and ( Z , Ω MNAR ) are m utually indep endent. Notice that this slightly gener- alizes the developmen t in Section 5 where we sp ecify q = 1. Moreov er, Y can b e generated as X · θ ⋆ + σ g for g ∼ N (0 , 1) independent of X . Observ e that Z ∈ R d +1 if and only if (1 − B )Ω MCAR + B Ω MNAR = 1. Let ω MCAR i , ω MNAR i , b i b e the realizations of these random v ariables in the sample. B.1.1 Pro of of Lemma B.1 Expanding out ∇ F ( k ) n ( θ ⋆ ) and using the fact that y i = X i · θ ⋆ + g i , we hav e ∇ F ( k ) n ( θ ⋆ ) = 2 k σ 2 k − 1 n n X i =1 ((1 − b i ) ω MCAR i + b i ω MNAR i ) g 2 k − 1 i X i so that    ∇ F ( k ) n ( θ ⋆ )    ≤ 2 k σ 2 k − 1       1 n X i : b i =0 ω MCAR i g 2 k − 1 i X i       + 2 k σ 2 k − 1       1 n X i : b i =1 ω MNAR i g 2 k − 1 i X i       (28) W e will b e done once we b ound each vector norm in ( 28 ) with probability at least 1 − O ( δ ) b y B := q (1 − ϵ ) · n τ (2 k − 2)!! + α (1 + √ τ ) p (4 k − 3)!! + γ 1 m − 1 4 + γ 2 m − 1 2 + γ 3 m − 1 o , (29) where we recall that γ 1 = α log 1 2 ( m ) p (4 k − 3)!! + α (1 + τ 1 4 ) O (log { (1 + τ ) m } ) k 2 , γ 2 = α [log 1 2 ( m ) p (4 k − 3)!! + O (log { (1 + τ ) m } ) k 2 + 1 4 ] + √ τ O (log { (1 + τ ) m } ) k 2 , γ 3 = log( m )(2 k − 2)!! + O (log { (1 + τ ) m } ) k +1 2 . Bounding the ﬁrst term of ( 28 ) . Observ e that (1 − b i ) ω MCAR i ∼ Bern ( q (1 − ϵ )) are i.i.d. and indep endent of { ( g i , X i ) } . Thus, conditioning on n 0 = P n i =1 (1 − b i ) ω i , w e can assume without loss of generality that the ﬁrst n 0 samples are the samples for which (1 − b i ) ω i = 1 and then b ound      1 n n 0 X i =1 g 2 k − 1 i X i      . No w w e condition on g = { g i } and observe that 1 n P n 0 i =1 g 2 k − 1 i X i ∼ N (0 , σ 2 ( g ) I d ), where σ 2 ( g ) = 1 n 2 n 0 X i =1 g 4 k − 2 i . 46 Th us a standard Gaussian tail b ound yields that with probability at least 1 − δ ,      1 n n 0 X i =1 g 2 k − 1 i X i      ≤ 1 n · v u u t n 0 X i =1 g 4 k − 2 i ·  √ d + q 2 log( 1 δ )  . (30) W e ha v e b y Lemma G.9 and the assumption that δ ≥ m − c that with probabilit y at least 1 − δ , n 0 X i =1 g 4 k − 2 i ≤ n 0 (4 k − 3)!! + q n 0 2 4 k − 3 log(2 /δ ) log(2 n 0 /δ ) 2 k − 1 ≲ n 0 (4 k − 3)!! + [ O (log n 0 + log m )] k √ n 0 . (31) No w b ecause n 0 ∼ Bin ( n, q (1 − ϵ )) we apply Bernstein’s inequalit y to obtain that with prob- abilit y at least 1 − e − Ω( m ) ≥ 1 − δ , n 0 ≤ 2 m . Putting it all together, we deduce that with probability at least 1 − O ( δ ),      1 n n 0 X i =1 g 2 k − 1 i X i      ≲ q d + log ( 1 δ ) n ·  p mq (1 − ϵ )(4 k − 3)!! + O (log m ) k 2 m 1 4  = q (1 − ϵ ) α  p (4 k − 3)!! + m − 1 4 O (log m ) k 2  . This is b ounded b y O ( B ) b ecause γ 1 ≥ α · O (log m ) k 2 . Bounding the second term of ( 28 ) . Again observ e that { b i } is indep enden t of ev erything else, so we can n 1 = P n i =1 b i and assume without loss of generality that the ﬁrst n 1 v alues of b i are equal to 1. Then we hav e    1 n X i : b i =1 ω MNAR i g 2 k − 1 i X i    = 1 n sup v ∈ S d − 1      n 1 X i =1 ω MNAR i g 2 k − 1 i · X ⊤ i v      ≤ 1 n sup v ∈ S d − 1 n 1 X i =1 | g 2 k − 1 i | · | X ⊤ i v | . No w w e condition on g = { g i } n i =1 and let a i = | g 2 k − 1 i | . W e deﬁne T v := P n 1 i =1 a i | X ⊤ i v | and let N b e a 1 / 2-packing of S d − 1 of size at most 5 d . Then for any v ∈ S d − 1 , let v ′ ∈ N satisfy ∥ v − v ′ ∥ ≤ 1 / 2 and observe that | T v − T v ′ | ≤ n 1 X i =1 a i ·   | X ⊤ i v | − | X ⊤ i v ′ |   ≤ n 1 X i =1 a i · | X ⊤ i ( v − v ′ ) | ≤   v − v ′   · n 1 X i =1 a i | X ⊤ i u | where u ∈ S d − 1 is such that v = v ′ + ∥ v − v ′ ∥ u . W e thus hav e that T v ≤ T v ′ +   v − v ′   n 1 X i =1 a i | X ⊤ i u | ≤ sup u ∈N T u + 1 2 sup u ∈ S d − 1 T u . T aking a suprem um o ver v ∈ S d − 1 and rearranging yields sup v ∈ S d − 1 T v ≤ 2 sup v ∈N T v . 47 Since X i iid ∼ N (0 , I d ), we hav e that for each v ∈ N that E [ | X ⊤ i v | ] = p 2 /π and that T v − p 2 /π P n 1 i =1 a i is  P n 1 i =1 a 2 i  -subgaussian. Th us, P sup v ∈N T v − r 2 π n 1 X i =1 a i ≥ t ! ≤ |N | X v ∈N P T v − r 2 π n 1 X i =1 a i ≥ t ! ≤ 5 d exp  − t 2 2 P n 1 i =1 a 2 i  . Pic king t = q 2( d log 5 + log ( 1 δ )) P n 1 i =1 a 2 i , we hav e with probability at least 1 − δ that sup v ∈N T v ≤ r 2 π n 1 X i =1 a i + v u u t n 1 X i =1 a 2 i ( d log 5 + log ( 1 δ )) . Assem bling the pieces, we hav e conditional on n 1 and g that with probability at least 1 − δ ,       1 n X i : b i =1 ω MNAR i g 2 k − 1 i X i       ≲ 1 n   n 1 X i =1 | g i | 2 k − 1 + v u u t n 1 X i =1 g 4 k − 2 i ( d + log ( 1 δ ))   . (32) Again by Lemma G.9 , we hav e with probability at least 1 − δ that n 1 X i =1 | g i | 2 k − 1 ≤ n 1 (2 k − 2)!! + O (log( n 1 /δ )) k 2 √ n 1 and n 1 X i =1 | g i | 4 k − 2 ≤ n 1 (4 k − 3)!! + O (log( n 1 /δ )) k √ n 1 . Recalling that n 1 ∼ Bin ( n, ϵ ), w e hav e that with probabilit y at least 1 − δ that n 1 ≲ ϵn + log( 1 δ ) = τ m + O (log m ). Now we condition on all these even ts, which o ccur simultaneously with probability at least 1 − O ( δ ). Note that log( n 1 /δ ) = log ( τ m + O (log m )) + O (log m )) ≍ log((1 + τ ) m ). Then the ﬁrst term in ( 32 ) is upp er b ounded by 1 n n 1 X i =1 | g i | 2 k − 1 ≲ 1 n ·  ( τ m + log m )(2 k − 2)!! + O (log { (1 + τ ) m } ) k 2 ( √ τ m + p log m )  ≍ q (1 − ϵ ) h τ (2 k − 2)!! + m − 1 log( m )(2 k − 2)!! + m − 1 2 O (log { (1 + τ ) m } ) k 2 √ τ + m − 1 O (log { (1 + τ ) m } ) k +1 2 i . Collecting terms with the same exp onen t on m and recalling the deﬁnitions of γ 2 and γ 3 , we ha v e that the ab o ve displa y is O ( B ) b ecause γ 2 m − 1 2 ≥ m − 1 2 O (log { (1 + τ ) m } ) k 2 √ τ and γ 3 m − 1 ≥ m − 1 (log( m )(2 k − 2)!! + O (log { (1 + τ ) m } ) k +1 2 ) . The second term in ( 32 ) is b ounded ab ov e by 1 n v u u t n 1 X i =1 g 4 k − 2 i ( d + log ( 1 δ )) ≲ q d + log ( 1 δ ) n · h p τ m (4 k − 3)!! + log 1 2 ( m ) p (4 k − 3)!! 48 + O (log { (1 + τ ) m } ) k 2 [( τ m ) 1 4 + log 1 4 ( m )] i ≍ q (1 − ϵ ) α h p τ (4 k − 3)!! + m − 1 2 log 1 2 ( m ) p (4 k − 3)!! + O (log { (1 + τ ) m } ) k 2 [ m − 1 4 τ 1 4 + m − 1 2 log 1 4 ( m )] i . Similar to the previous term, this display is also O ( B ) b ecause γ 1 m − 1 4 ≥ ατ 1 4 O (log((1 + τ ) m )) k 2 m − 1 4 and γ 2 m − 1 2 ≥ α (log 1 2 ( m ) p (4 k − 3)!! + log 1 4 ( m ) O (log((1 + τ ) m )) k 2 ) m − 1 2 . W e ha ve th us b ounded all terms b y O ( B ), proving the claim. B.1.2 Pro of of Lemma B.2 Expanding out ∇ 2 F ( k ) n , we hav e for any unit vector v that v T ∇ 2 F ( k ) n ( θ ) v = (2 k )(2 k − 1) n n X i =1 ((1 − b i ) ω MCAR i + b i ω MNAR i )  x ⊤ i ( θ ⋆ − θ ) + σ g i  2 k − 2 ( x ⊤ i v ) 2 ≥ (2 k )(2 k − 1) n n X i =1 (1 − b i ) ω MCAR i  x ⊤ i ( θ ⋆ − θ ) + σ g i  2 k − 2 ( x ⊤ i v ) 2 , where the inequalit y is b ecause the summands are nonnegative. Deﬁning the random set I = { i ∈ [ n ] | (1 − b i ) ω MCAR i = 1 } , we equiv alently hav e v T ∇ 2 F ( k ) n ( θ ) v ≥ (2 k )(2 k − 1) n X i ∈I  x ⊤ i ( θ ⋆ − θ ) + σ g i  2 k − 2 ( x ⊤ i v ) 2 (33) W e condition on I and let n 0 = |I | . Because ((1 − b i ) ω MCAR i ) i ∈ [ n ] is indep endent of (( x i , g i )) i ∈ [ n ] , w e ha v e that condition on I , (( x i , g i )) i ∈I are n 0 i.i.d. samples from N (0 , I d ) ⊗ N (0 , σ 2 ). W e no w introduce a ﬁnite sequence of noise bins indexed by ℓ = 0 , . . . , L , where t ℓ = (1 + 1 2 k ) ℓ and L is the unique integer suc h that 2 √ k log(3 k ) ≤ t L < (1 + 1 2 k )[2 √ k log(3 k )]. Then for each ℓ ∈ [ L ], θ ∈ R d , and unit vector v , w e deﬁne the subsets S ℓ,θ,v :=  i ∈ I : x i · ( θ ⋆ − θ ) ≥ 0 , | x i · v | ≥ p π / 8 , and g i ∈ [ t ℓ − 1 , t ℓ )  . F or i ∈ S ℓ,θ,v , we hav e the low er b ounds ( y i − x ⊤ i θ ) = ( x ⊤ i θ ⋆ + σ g i ) − x ⊤ i θ ≥ σ t ℓ − 1 and ( x ⊤ i v ) 2 ≥ π 8 > 1 4 . No w observe that for eac h θ and v , each index i ∈ I is in S ℓ,θ,v for at most one ℓ ∈ [ L ]. Applying these low er b ounds to ( 33 ), we hav e that v ⊤ ∇ 2 F ( k ) n ( θ ) v ≥ (2 k )(2 k − 1) n L X ℓ =1 X i ∈ S ℓ,θ,v ( σ t ℓ ) 2 k − 2 ( x ⊤ i v ) 2 ≥ k 2 σ 2 k − 2 2 n L X ℓ =1 | S ℓ,θ,v | t 2 k − 2 ℓ − 1 . T o pro ve the claim, we ﬁnd uniform low er b ounds on | S ℓ,θ,v | that hold with probability at least 1 − O ( δ ). F or ℓ ∈ [ L ], let p ℓ := P ( G ∈ [ t ℓ − 1 , t ℓ )), where G ∼ N (0 , 1). Observ e that for ℓ ∈ [ L ], p ℓ = 1 √ 2 π Z t ℓ t ℓ − 1 e − t 2 / 2 dt ≥ t ℓ − t ℓ − 1 √ 2 π e − t 2 ℓ / 2 = 1 (2 k + 1) √ 2 π t ℓ e − t 2 ℓ / 2 . 49 No w b ecause t 7→ te − t 2 / 2 is decreasing for t ≥ 1 and b ecause for all ℓ ∈ [ L ], t ℓ ≤ 3 √ k log(3 k ) w e ha v e that p ℓ ≥ 3 √ k log(3 k ) (2 k + 1) √ 2 π e − 9 2 k log 2 (3 k ) ≳ e − 5 k log 2 k . (34) W e no w hav e the follo wing claim whic h w e prov e later. Claim B.3. L et C > 0 b e a suﬃciently lar ge c onstant. With pr ob ability at le ast 1 − δ , it holds simultane ously for al l ℓ ∈ [ L ] , θ ∈ R d , and v ∈ S d − 1 that | S ℓ,θ,v | ≥ p ℓ n 0 4 − C r ( d + log ( 1 δ )) n 0 . (35) W e condition on the ev ent deﬁned in the claim. No w recall that n 0 ∼ Bin ( n, q (1 − ϵ )). Then, α 2 ≤ 1 /C 1 implies q (1 − ϵ ) n ≳ log ( 1 δ ) and so with probability at least 1 − δ , w e hav e that q (1 − ϵ ) n/ 2 ≤ n 0 ≤ 2 q (1 − ϵ ) n . F urther conditioning on this even t, w e hav e by com bining this with ( 35 ) that for all ℓ ∈ [ L ], | S ℓ,θ,v | ≥ p ℓ q (1 − ϵ ) n 8 − C r 2( d + log ( 1 δ )) q (1 − ϵ ) n. No w, applying the assumption α − 2 ≥ C ′ e 10 k log 2 k com bined with ( 34 ) implies C r 2( d + log ( 1 δ )) q (1 − ϵ ) n ≤ C √ 2 C ′ · e 5 k log 2 k n ≤ p ℓ q (1 − ϵ ) n 16 , where the ﬁnal inequality follows by taking C ′ to b e a suﬃciently large constan t. Th us, it holds for all ℓ ∈ [ L ] that | S ℓ,θ,v | ≥ q (1 − ϵ ) p ℓ n 16 . Consequen tly , for all θ ∈ R d and all unit vectors v , w e ha v e v T ∇ 2 F ( k ) n ( θ ) v ≥ (1 − ϵ ) k 2 σ 2 k − 2 16 L X ℓ =1 p ℓ t 2 k − 2 ℓ − 1 . Recalling that t ℓ − 1 = (1 + 1 2 k ) − 1 and p ℓ = P ( G ∈ [ t ℓ − 1 , t ℓ )), we hav e L X ℓ =1 p ℓ t 2 k − 2 ℓ − 1 =  1 + 1 2 k  − 1 L X ℓ =1 p ℓ t 2 k − 2 ℓ =  1 + 1 2 k  − 1 L X ℓ =1 Z t ℓ t ℓ − 1 t 2 k − 2 ℓ ϕ ( t ) dt ≥ 2 3 L X ℓ =1 Z t ℓ t ℓ − 1 t 2 k − 2 ϕ ( t ) dt = 2 3 Z t L 1 t 2 k − 2 ϕ ( t ) dt. The last expression is a truncated Gaussian momen t and Lemma G.7 giv es that R t L 1 t 2 k − 2 ϕ ( t ) dt ≥ (2 k − 3)!! 3 b ecause t L ≥ 2 √ k log(3 k ). Combining the pieces we ha v e for all θ ∈ R d and all unit v ectors v that v T ∇ 2 F ( k ) n ( θ ) v ≥ q (1 − ϵ ) k ( 2 k − 3)!! σ 2 k − 2 36 ≍ q (1 − ϵ ) σ 2 k − 2 (2 k + 1)!! , pro ving the claim. 50 Pro of of the Claim B.3 . The pro of of this b ound follows from a standard VC dimension argumen t. F or eac h ℓ ∈ [ L ], we deﬁne the function classes F ℓ := n ( x , g ) 7→ 1 n x · ( θ ⋆ − θ ) ≥ 0 , | x · v | ≥ r π 8 , and t ℓ ≤ g < t ℓ +1 o : v ∈ S d − 1 , θ ∈ θ ⋆ + S d − 1 o and also deﬁne F = ∪ ℓ ∈ [ L ] F ℓ . Observ e that F is a sub class of the 3-fold intersection of the follo wing function classes: F 1 = n ( x, g ) 7→ 1 { x · ( θ ⋆ − θ ) ≥ 0 } : θ ∈ θ ⋆ + S d − 1 o F 2 = n ( x, g ) 7→ 1 {| x · v | ≥ p π / 8 } : v ∈ S d − 1 o F 3 = n ( x, g ) 7→ 1 { a ≤ g < b } : a, b ∈ R o . Common VC-dimension calculations give that VC( F 1 ) ≤ d + 1, V C( F 2 ) ≤ 2( d + 1) and V C( F 3 ) ≤ 2. W e can then apply Lemma G.5 to deduce that VC( F ) ≤ 12 log(18)( d + 1) ≍ d . Using this VC-dimension b ound, w e can b ound the exp ected worst-case deviation using [ V er25 , Theorem 8.3.5] to obtain E  sup f ∈F    1 n 0 n 0 X i =1 f ( x i , g i ) − E  f ( x , g )      ≲ r d n . T o conv ert this to a high probability b ound on deviation, we note that the random v ariable sup f ∈F   1 n 0 P n 0 i =1 f ( x i , g i ) − E  f ( x , g )    satisﬁes the b ounded diﬀerences property with constan t 1 /n 0 , so by combining the preceding display with, e.g., [ BLM13 , Theorem 6.2], we obtain sup f ∈F    1 n n 0 X i =1 f ( x i , g i ) − E  f ( x , g )     ≲ r d n + s log(2 /δ ) n 0 with probability ≥ 1 − δ. (36) No w, since x ∼ N (0 , I d ), we hav e that for all θ ∈ θ ⋆ + S d − 1 and all v ∈ S d − 1 that P ( x · ( θ ⋆ − θ ) < 0) = 1 2 and P  | x · v | ≥ r π 8  ≤ 1 √ 2 π · r π 8 = 1 4 = ⇒ P  x · ( θ ⋆ − θ ) ≥ 0 and | x · v | ≥ r π 8  ≥ 1 4 . F urther, b ecause x and g are indep endent, we ha ve that E  f ( x , g )  ≥ 1 4 p ℓ , for every f ∈ F ℓ . Com bining this with ( 36 ) prov es the b ound. B.2 Eﬃcien t algorithm So far we hav e shown the rates achiev able by some b θ ( k ) . No w we show that an eﬃcien t estimator has a similar rate. W e consider a tw o-step pro cedure. First w e observ e that because the loss which b θ (1) minimizes is a strongly con vex, quadratic function, we can eﬃcien tly compute it to an y desired accuracy . By the analysis of the pro of of the upp er b ound of Theorem 5.1 , we know that b oth   b θ (1) − θ ⋆   2 ≤ C σ (1 + p oly ( τ )) | {z } =: R and   b θ ( k ) − θ ⋆   2 ≤ C σ τ log log (1 /α ) p log(1 /α ) | {z } =: β , 51 for a suﬃcien tly large constan t C . Then for the ﬁrst step of the pro cedure, w e compute ˜ θ 1 satisfying ∥ ˜ θ 1 − b θ (1) ∥ 2 ≤ R whic h implies that ∥ ˜ θ 1 − b θ ( k ) | 2 ≤ 2 R + β . Recall by Lemma B.2 , w e ha v e with probabilit y at least 1 − δ that F ( k ) n is µ n -strongly conv ex. W e then run the ellipsoid algorithm (see, e.g., [ Bub15 , Theorem 2.4]) on the domain B = B 2 ( ˜ θ 1 , 2( R + β )), un til w e obtain ˜ θ 2 satisfying an excess risk b ound on F ( k ) n of at most µ n β 2 / 2 using a gradient oracle. Strong conv exity implies that ∥ ˜ θ 2 − b θ ( k ) ∥ 2 ≤ β and so ˜ θ 2 has the same error, up to constant factors, as b θ ( k ) . T o establish runtime, all that remains is to establish that the set C := n θ : F ( k ) n ( θ ) − F ( k ) n ( b θ ( k ) n ) ≤ µ n β 2 2 o , con tains a ball of large enough size. T o this end, on the domain B , we can crudely upp er b ound the gradien t norm ∥∇ 2 F ( k ) n ( θ ) ∥ 2 as sup θ ∈B   ∇ 2 F ( k ) n ( θ )   2 ≲ k 2 sup i ∈ [ n ]  2( R + β ) ∥ x i ∥ 2 + σ | g i |  2 k − 2 ∥ x i ∥ 2 2  ≲ O  ( R + β + 1) p d + log ( n/δ )  2 k . Th us, B 2  b θ ( k ) , r  ⊆ C , where r ≳ q (1 − ϵ ) σ 2 k − 2 (2 k + 1)!! β 2 O (( R + β + 1) p d + log ( n/δ )) 2 k . Therefore, the ellipsoid p ortion of the algorithm has an oracle complexity of O d 2 log ( R + β ) O (( R + β + 1) p d + log ( n/δ )) 2 k q (1 − ϵ ) σ 2 k − 2 (2 k + 1)!! β 2 !! , (37) whic h is p olynomial in k , d, n . C Pro ofs of information-theoretic low er b ounds C.1 Lo wer b ound constructions F or all low er b ounds, we rely on hidden-direction distributions of the form P A,v (see Deﬁni- tion 2.7 ) in b oth the information-theoretic and SQ low er b ounds. Crucial to pro ving these is the following lemma, which c haracterizes the b ounds on the density of distributions in the con tamination set of a given distribution. Corollary C.1 (Corollary of Lemma 1.1 ) . L et P b e a distribution over R d . L et L = q (1 − ϵ ) and U = q (1 − ϵ ) + ϵ . L et b ∈ [ L, U ] b e arbitr ary. Then Q ∈ R R ( P , ϵ, q ) if for al l z ∈ R d L b ≤ dQ dP ( z ) ≤ U b . F urthermor e, supp ose Q satisﬁes the c ondition ab ove for some b ∈ [ L, U ] . Consider the distribution Q ′ over R d ⋆ which outputs ⋆ d with pr ob ability 1 − b and X ∼ Q with pr ob ability b . Then Q ′ ∈ R ( P, ϵ, q ) . 52 Lemma C.2 (Mean estimation hard instance) . L et q ∈ (0 , 1] , ϵ ∈ [0 , 1) , and τ = ϵ q (1 − ϵ ) . L et v ∈ S d − 1 and let b, γ , R ∈ R ++ such that b = q (1 − ϵ ) √ 1 + τ , γ 2 ≤ 0 . 25 log (1 + τ ) , R ≥ 2 γ , and R ≤ log(1 + τ ) 8 γ , With β = P ( | G + γ |≤ R ) P ( | G |≤ R ) , let A denote the distribution over R with density A ( x ) := ( β ϕ ( x ) | x | ≤ R ϕ ( x ; γ ) otherwise , (38) and deﬁne Q ∈ P ( R d ⋆ ) as Q ( { ⋆ d } ) = 1 − b and Q ( x ) = b · P A,v ( x ) . Then Q ∈ R  N ( γ v , I d ) , ϵ, q  . F urthermor e, | β − 1 | ≲ γ e − R 2 / 4 . Pr o of. First observ e that A is a v alid distribution since it integrates to 1 on R . F urthermore, w e ha v e that β ≤ 1 and β ≥ min | x |≤ R ϕ ( x ; γ ) ϕ ( x ) = min | x |≤ R exp( γ x − γ 2 / 2) = exp( − γ R − γ 2 / 2). Let f ( x ) b e the density of P A,v at a p oint x . Corollary C.1 states that Q is a v alid realizable con tamination as long as P Q ( X  = ⋆ d ) = b ∈ ( L, U ) and f ( x ) ϕ ( x ; γ ) ∈ ( L ′ , U ′ ) for L ′ = L/b and U ′ = U /b and L = q (1 − ϵ ) and U = q (1 − ϵ ) + ϵ . Observ e that our choice of b satisﬁes b 2 = LU and thus | log L ′ | = | log U ′ | = log √ 1 + τ . Therefore, for Q to b e a v alid realizable contamination of N ( γ v , I ), w e require that    log f ( x ) ϕ ( x ; γ )    ≤ 0 . 5 log(1 + τ ) for all x ∈ R d . In fact, we shall establish a stronger condition that    log f ( x ) ϕ ( x ; γ )    ≤ 0 . 25 log(1 + τ ) for all x ∈ R d ; this stronger requirement will b e useful in proving SQ low er b ounds. Observ e that f ( x ) ϕ ( x ; γ ) = ϕ v ⊥ ( x ) A ( v ⊤ x ) ϕ v ⊥ ( x ) ϕ ( v ⊤ x ; γ ) = A ( v ⊤ x ) ϕ ( v ⊤ x ; γ ) . W e denote x ′ = v ⊤ x in the rest of the pro of for brevity . W e thus require that    log A ( x ′ ) ϕ ( x ′ ; γ )    ≤ 0 . 5 log(1 + τ ) for all x ′ ∈ R . By the deﬁnition of A in ( 38 ), this obviously holds on the domain | x ′ | > R . F or | x ′ | ≤ R , this ratio is equal to max | x ′ |≤ R     log A ( x ′ ) ϕ ( x ′ ; γ )     = max | x ′ |≤ R     log β ϕ ( x ′ ) ϕ ( x ′ ; γ )     = max | x ′ |≤ R    log β e γ 2 / 2 − γ x ′    =    log β e γ 2 / 2+ γ R    ≤    log β e γ 2 / 2    + γ R . It suﬃces if eac h term is less than 0 . 125 log(1 + τ ), and observ e that the second term, γ R , indeed satisﬁes this under the condition R ≤ log (1 + τ ) / (8 γ ). F or the ﬁrst term, observ e that log β e γ 2 / 2 ≤ γ 2 / 2 since β ≤ 1, which is also appro- priately b ounded under the condition on γ 2 listed in the statement. Finally , log β e γ 2 / 2 ≥ log e − γ R − γ 2 / 2 e γ 2 / 2 = − γ R . And thus − log β e γ 2 / 2 ≤ γ R , which as w e argued earlier is indeed small. Therefore, max x ′ ∈ R     log A ( x ′ ) ϕ ( x ′ ; γ )     ≤ 0 . 25 log (1 + τ ) . 53 Finally , w e state an upp er b ound on | β − 1 | which is equal to | β − 1 | = | P ( | G + γ | ≤ R ) − P ( | G | ≤ R ) | P ( | G | ≤ R ) ≲ | P ( G ∈ [ − R − γ , R − γ ]) − P ( G ∈ [ − R, R ]) || ≲ P ( G ∈ [ R − γ , R ]) ≤ γ ϕ ( R/ 2) ≲ γ e − R 2 / 4 , where we use that R ≥ 2 γ . Lemma C.3 (Co v ariance estimation hard instance) . L et q ∈ (0 , 1] , ϵ ∈ [0 , 1) , v ∈ S d − 1 and let b, γ , R ∈ R ++ satisfy b = q (1 − ϵ ) √ 1 + τ , γ ≤ τ , and R 2 ≤ 1 + γ γ log(1 + τ ) , With β = P ( √ 1+ γ | G |≤ R ) P ( | G |≤ R ) , let A denote the distribution over R with density A ( x ) := ( β ϕ ( x ) | x | ≤ R ϕ ( x ; 0 , 1 + γ ) otherwise , (39) and deﬁne Q ∈ P ( R d ⋆ ) as Q ( { ⋆ d } ) = 1 − b and Q ( x ) = b · P A,v ( x ) . Then Q ∈ R  N (0 , I d + γ v v T ) , ϵ, q  . F urthermor e, | β − 1 | ≲ min  1 , R ·     1 − 1 √ 1 + γ      e − Ω( R 2 / (1+ γ )) . Pr o of. Observe that β is c hosen to ensure that A is a v alid probabilit y distribution. Moreo v er, b y deﬁnition, β satisﬁes the b ounds β ≤ 1 and β ≥ min | x |≤ R 1 √ 1 + γ e − x 2 / 2(1+ γ ) e − x 2 / 2 = 1 √ 1 + γ . (40) With Σ = I d + γ v v ⊤ and follo wing the pro of of Lemma C.2 , by our c hoice of b , it suﬃces to sho w that for all x ∈ R d :   log f ( x ) ϕ ( x ;0 , Σ)   ≤ log √ 1 + τ , where f denotes the density of the hidden direction distribution P A,v . Using the deﬁnition of P A,v , the likelihoo d ratio is equal to f ( x ) ϕ ( x ; 0 , Σ) = A ( x ′ ) ϕ ( x ′ ; 0 , 1 + γ ) , where x ′ = v ⊤ x . F rom the deﬁnition of A , it is clear that the ab o v e likelihoo d ratio is equal to 1 when | x ′ | > R , and thus the desired b ound is satisﬁed. On the other hand, for | x ′ | ≤ R , w e ha v e log A ( x ′ ) ϕ ( x ′ ; 0 , 1 + γ ) = log  β p 1 + γ  + log e − x ′ 2 γ 2(1+ γ ) = log  β p 1 + γ  − x ′ 2 γ 2(1 + γ ) . In order to show that Q ∈ R  N (0 , I d + γ v v T ) , ϵ, q  , it remains to establish the pair of inequal- ities log  β p 1 + γ  ≤ log √ 1 + τ and log  β p 1 + γ  − R 2 γ 2(1 + γ ) ≥ − log √ 1 + τ . 54 The ﬁrst inequality follo ws b y applying the upp er b ound β ≤ 1 in ( 40 ) in conjunction with the assumption γ ≤ τ . T urning to the second inequality , we hav e log  β p 1 + γ  − R 2 γ 2(1 + γ ) ≥ − R 2 γ 2(1 + γ ) ≥ − log √ 1 + τ , where the ﬁrst step follows from the low er b ound β ≥ 1 √ 1+ γ in ( 40 ) and the ﬁnal step follows from the assumption R 2 ≤ 1+ γ γ log(1 + τ ). Finally , we obtain a b ound on | β − 1 | , which holds under the condition R ≳ √ 1 + γ . In particular, expanding yields | β − 1 | = P ( | G | ≤ R ) − P ( | G | ≤ R √ 1+ γ ) P ( | G | ≤ R ) = P  | G | ∈  R √ 1+ γ , R  P ( | G | ≤ R ) ≲ P  | G | ∈  R √ 1 + γ , R  ≲ min  P  | G | ≥ R √ 1 + γ  ,     R − R √ 1 + γ     · ϕ  R √ 1 + γ  ≲ min  1 , R ·     1 − 1 √ 1 + γ      e − Ω( R 2 / (1+ γ )) . Lemma C.4 (Linear regression hard instance) . L et q ∈ (0 , 1] , ϵ ∈ [0 , 1) , and τ = ϵ q (1 − ϵ ) . L et v ∈ S d − 1 and let b, γ , r , R ∈ R ++ such that b = q (1 − ϵ ) √ 1 + τ , γ 2 r 2 ≤ 0 . 5 log (1 + τ ) , and R ≤ log(1 + τ ) 8 γ r . L et P ∗ denote the distribution over ( X, y ) : X ∼ N (0 , I d ) and y | X ∼ N ( γ v ⊤ X , 1) . F or al l x ∈ R d , with β x ∈ (0 , 1] deﬁne d implicitly, let A x denote the distribution over R with density A x ( y ) := ( β x ϕ ( y ) | v ⊤ x | ≤ r and | y | ≤ R ϕ ( y ; γ v ⊤ x, 1) otherwise . (41) and deﬁne Q ∈ P ( R d +1 ⋆ ) as Q ( { ⋆ d +1 } ) = 1 − b and Q ( x, y ) = b · ϕ ( x ) A x ( y ) . Then Q ∈ R  P ∗ , ϵ, q  . Pr o of. F or any x , A x is a v alid distribution since it in tegrates to 1: (i) this is obvious if γ | v ⊤ x | > r and (ii) for γ | v ⊤ x | ≤ r , this is satisﬁed if we choose β x P ( | G | ≤ R ) + P ( | G + γ x ⊤ v | ≤ R ) = 1 ⇐ ⇒ β x = P ( | G + γ x ⊤ v | ≤ R ) P ( | G | ≤ R ) . Note that for all x , β x ∈ (0 , 1]. Using the same argument as in the pro of of Lemma C.2 for a ﬁxed x , we get that 1 ≥ β x ≥ min | y |≤ R ϕ ( y ; γ x ⊤ v , 1) ϕ ( y ) = min | y |≤ R e − ( y − γ x ⊤ v ) 2 / 2 e − y 2 / 2 = min | y | ≤ R e − γ 2 ( x ⊤ v ) 2 / 2+ γ y x ⊤ v 55 = e − γ 2 ( x ⊤ v ) 2 / 2 − γ R | x ⊤ v | . Let f ( x, y ) b e the conditional density of Q under the even t { ( x, y )  = ⋆ d +1 } and let p ( x, y ) denote the density of the linear mo del X ∼ N (0 , I d ) and y ∼ N ( γ v ⊤ X , 1). As earlier, w e w an t the following likelihoo d ratio to lie in the range as p er Corollary C.1 . f ( x, y ) p ( x, y ) = A x ( y ) ϕ ( y ; γ v ⊤ x, 1) = ( β x ϕ ( y ) ϕ ( y ; γ v ⊤ x, 1) if | y | ≤ R and γ | v ⊤ x | ≤ r 1 otherwise By Corollary C.1 and our choice of b , w e are done once w e establish that the absolute v alue of the logarithm of this densit y ratio is uniformly upp er b ounded by 0 . 5 log(1 + τ ). As p er the deﬁnition, this needs to b e chec ked only when | y | ≤ R and γ | v ⊤ x | ≤ r . F or such ( x, y ), the ratio is f ( x, y ) p ( x, y ) = β x e − y 2 / 2 e − ( y − γ x ⊤ v ) 2 / 2 = β x e γ 2 ( x ⊤ v ) 2 / 2 e − γ y x ⊤ v And therefore, the logarithm satisﬁes log f ( x, y ) p ( x, y ) = log β x e γ 2 ( x ⊤ v ) 2 / 2 − γ y x ⊤ v . It w ould s uﬃce if the absolute v alue of each term is at most 0 . 25 log(1 + τ ). The second term is at most γ Rr , which by assumption is b ounded appropriately . F or the ﬁrst term, w e use that e − γ 2 ( x ⊤ v ) 2 / 2 − γ R | x ⊤ v | ≤ β x ≤ 1 to get log β x e γ 2 ( x ⊤ v ) 2 / 2 ≤ γ 2 r 2 / 2 and log β x e γ 2 ( x ⊤ v ) 2 / 2 ≥ log e − 2 γ R | x ⊤ v | = − 2 γ R | x ⊤ v | . Th us, | log β x e γ 2 ( x ⊤ v ) 2 / 2 | ≤ max( γ 2 r 2 / 2 , 2 γ R r ), which is also b ounded appropriately . C.2 Information-theoretic lo wer b ounds C.2.1 Mean estimation: Pro of of Theorem 3.1 low er b ound W e prov e the claim for general q ∈ (0 , 1]. W e in tro duce the quantities τ = ϵ q (1 − ϵ ) , τ ′ = log(1 + τ ), and b = q (1 − ϵ ) √ 1 + τ . W e claim that it suﬃces to prov e that M := M n ( δ, P mean ( σ 2 , ϵ, q ) , L ) ≳ T parametric ∨ T dimension ∨ T conﬁdence , (42) where T parametric := s d + log ( 1 δ ) nq (1 − ϵ ) , T dimension := √ τ ′ ∧ τ ′ r log  3 + cnbτ ′ 2 d  , and T conﬁdence := √ τ ′ ∧ τ ′ r log  3 + C nbτ ′ 2 log(1 / 2 δ )  , for a suﬃciently large constan t C . W e defer the pro of of this reduction to the end of the section. W e next prov e each low er b ound in turn. 56 Pro of of M ≳ T parameteric : W e consider the contamination that censors ev erything, so that the resulting distribution is MCAR with probability or observ ation equal to q (1 − ϵ ). Then M ≳ T parametric is essen tially the standard minimax mean estimation lo wer bound applied with the fact that there are O ( nq (1 − ϵ )) non-missing samples with high probabilit y , see [ MVBWS24 , Prop. 47] for a formal argument. Pro of of M ≳ T dimension : This case forms the technical bulk of the pro of and pro ceeds via F ano’s inequality . W e b egin by deﬁning several parameters. W e will let γ := γ ( ϵ, q , n, d ) denote our separation parameter. Throughout we will imp ose the constraint γ ≤ √ τ ′ 10 , (43) and set a truncation threshold R = τ ′ / (10 γ ). Note that from the constraint ( 43 ), w e deduce the b ound γ ≤ √ τ ′ 10 ≤ R 10 ≤ R 2 . (44) Equipp ed with these parameters, we let N denote a 1 / 2–pac king of S d − 1 , which by , e.g., [ V er25 , Corollary 4.2.11], exists and satisﬁes the cardinalit y b ound |N | ≤ 5 d . F or each v ∈ N , it suf- ﬁces to construct distributions P v whic h satisfy the pair of desiderata 1. (V alid con tamination) P v ∈ R ( N ( γ v , I d ) , ϵ, q ). 2. (Equal probability of missing mass) The missingness probability P v  { ⋆ } d  = 1 − b . W e will additionally deﬁne a cen tral distribution H ∈ P  R d ∪ { ⋆ d }  as H  { ⋆ d }  = 1 − b and H ( x ) = b · ϕ d ( x ) for all x ∈ R d . (45) By F ano’s inequality (Lemma 2.10 ), it suﬃces to construct distributions P θ whic h satisfy the t w o desiderata ab ov e and such that for all v ∈ N , it holds that KL( P v , H ) ≤ d 2 n . Indeed, if this KL b ound holds, then by tensorization of the KL divergence, 1 |N | P v ∈N KL  P ⊗ n v , H ⊗ n  − log (2 − |N | − 1 ) log |N | ≤ 1 2 d d log 5 < 1 2 . W e tak e P v to b e the hard distribution Q from Lemma C.2 , with all lemma parameter c hosen iden tically to this theorem. The remainder of this pro of is dedicated to establishing the b ound on KL( P v , H ). Since the probabilit y of observing { ⋆ d } is iden tical under b oth P v and H , and since P v and H are supp orted on { ⋆ d } ∪ R d , we hav e KL( P v , H ) = b · KL( P ′ v , H ′ ) , (46) where P ′ v = P A,v and H ′ = N (0 , I d ) denote the resp ectiv e conditional distributions ov er R d . Denoting x ′ := v ⊤ x and ¯ x the pro jection of x on to v ⊥ , observe that the densit y of P ′ v is p ′ v ( x ) = ϕ d − 1 ( ¯ x ) A ( v ⊤ x ) and that of H ′ is h ′ ( x ) = ϕ d − 1 ( ¯ x ) ϕ ( v ⊤ x ). Applying the deﬁnition of A in ( 38 ) yields p ′ v ( x ) h ′ ( x ) = A ( x ′ ) ϕ ( x ′ ) = ( β if | x ′ | ≤ R e − 0 . 5 γ 2 + γ x ′ if | x ′ | > R . 57 In turn, this implies that the log-likelihoo d ratio satisﬁes log p ′ v ( x ) h ′ ( x ) = (log β ) 1 | x ′ |≤ R + ( − 0 . 5 γ 2 + γ x ′ ) 1 | x ′ | >R ≤ γ x ′ 1 | x ′ | >R , where w e hav e used the b ound β ≤ 1. The expectation of log p ′ v ( x ) h ′ ( x ) under P ′ θ ( x ) is exactly equal to KL( P ′ v , H ′ ). Since the upp er b ound dep ends only on x ′ ∼ A , it is equiv alent to take an exp ectation of this term ov er x ′ ∼ A , which as p er ( 38 ) is equal to KL( P ′ v , H ′ ) ≤ E X ∼ A [ γ X 1 | X | >R ] = γ E [( G + γ ) 1 | G + γ | >R ] = γ 2 P ( | G + γ | > R ) + γ E [ G 1 | G + γ | >R ] . (47) T o calculate an upp er b ound on the ﬁrst term, w e use that γ < R/ 2 and obtain that P ( | G + γ | > R ) ≤ P ( | G | > R/ 2) ≲ e − Ω( R 2 ) . F or the second term, we use the follo wing inequalities that rely on symmetry: E [ G 1 | G + γ | >R ] = E [ G 1 | G | >R ] + E [ G  1 | G + γ | >R − 1 | G |≥ R  ] = E [ G  1 | G + γ | >R − 1 | G |≥ R  ] ≤ E [ | G |   1 | G + γ | >R − 1 | G |≥ R   ] ≤ E [ | G | 1 | G |∈ [ R − γ , R + γ ] ] ≤ 4 γ max x ∈ [ − R/ 2 , 3 R/ 2] xϕ ( x ) ≲ γ e − Ω( R 2 ) , (48) where w e use that xe − x 2 / 2 ≲ e − x 2 / 4 for all x > 0. Combining Equations ( 46 ) to ( 48 ), w e deduce the inequality KL( P v , H ) ≲ bγ 2 e − Ω( R 2 ) ≲ bγ 2 e − c τ ′ 2 γ 2 ≲ bτ ′ 2 y e − 1 /y , (49) where w e use the v alue of R = τ ′ / (10 γ ) and deﬁne y := γ 2 cτ ′ 2 , for a suﬃciently small constan t c . F or this to b e less than d 2 n , it suﬃces that y e − 1 /y ≤ ρ ′ := d C bτ ′ 2 n for a suﬃcien tly large constan t C . That is, y ≤ 1 W 0 (1 /ρ ′ ) , where W 0 ( x ) is the Lam b ert W function (principal branc h), whic h is the unique p ositive v alue such that W 0 ( x ) e W 0 ( x ) = x for x ≥ 0. Observ e that log(3 + x ) e log(3+ x ) > x + 3 > x and thus W 0 ( x ) ≤ log(3 + x ). This condition is alwa ys satisﬁed whenev er γ ≲ τ ′ √ y = τ ′ r W 0  C nbτ ′ 2 d  . And since γ must also satisfy that γ ≲ √ τ ′ for all of these calculations to b e v alid, we get the desired low er b ound in mean estimation M ≳ γ ≳ √ τ ′ ∧ τ ′ r log  3 + C nbτ ′ 2 d  = T dimension , (50) whic h concludes this part of the pro of. Pro of of M ≳ T conﬁdence : W e will use the same constructions as in the previous part of the pro of. T o this end, note that H ∈ R ( N (0 , I d ) , ϵ, q ). W e let v b e an y unit vector, let Q b e 58 the distribution from Lemma C.2 . Applying the Bretagnolle–Hub er inequalit y in conjunction with the KL b ound from ( 49 ) yields TV( Q ⊗ n , H ⊗ n ) ≤ p 1 − e − n KL( Q,H ) ≤ q 1 − exp  − n · bτ ′ 2 y e − 1 /y  . F ollowing the same steps as in the previous part and ensuring y e − y ≲ log(1 / 2 δ ) bτ ′ 2 n implies that TV( P ⊗ n θ , H ⊗ n ) ≤ 1 − 2 δ . Hence, we apply [ MVS24 , Lemma 5] to deduce that M ≳ √ τ ′ ∧ τ ′ r log  3 + C nbτ ′ 2 log(1 / 2 δ )  , (51) as desired. Pro of of the reduction ( 42 ) : Re-phrasing ( 42 ), w e ha ve the minimax low er b ound M ≳ α + √ τ ′ ∧ τ ′ r log  3 + C √ 1+ τ τ ′ 2 α 2  , where α = q d +log(1 /δ ) nq (1 − ϵ ) ≤ 1 / √ C 1 and C is a suﬃciently large constant. Let C ′ b e suﬃcien tly large constant that we will sp ecify later. Then by considering several cases, w e will sho w that this is equiv alent to the claimed low er b ound in Theorem 3.1 . • Case 1: τ ≤ C ′ α . This regime is immediate b ecause log(1 + τ ) p log(1 + τ 2 /α 2 ) ≍ τ p α 2 /τ 2 = α ≲ M . • Case 2: C ′ α ≤ τ ≤ 1 . Then b ecause τ ′ ≤ τ ≤ 1, we hav e that √ τ ′ ∧ τ ′ r log  3 + C √ 1+ τ τ ′ 2 α 2  = τ ′ r log  3 + C √ 1+ τ τ ′ 2 α 2  ≥ log(1 + τ ) q log(3 + C √ 2 τ 2 /α 2 ) ≥ log(1 + τ ) p 2 log(1 + τ 2 /α 2 ) , where the ﬁnal inequality follo ws from τ /α ≥ C ′ b y picking C ′ large enough (dep ending on C ). • Case 3: 1 ≤ τ ≤ α − 2 /C 1 . In this regime, log  3 + C √ 1+ τ τ ′ 2 α 2  ≍ log(1 /α ) so that √ τ ′ ∧ τ ′ r log  3 + C √ 1+ τ τ ′ 2 α 2  ≍ p log(1 + τ ) · " 1 ∧ s log(1 + τ ) log(1 /α ) # . 59 Then b ecause τ ≤ α − 2 /C 1 , we hav e log (1 + τ ) ≲ log (1 /α ), and hence √ τ ′ ∧ τ ′ r log  3 + C √ 1+ τ τ ′ 2 α 2  ≳ log(1 + τ ) p log(1 /α ) ≍ log(1 + τ ) p log(1 + τ 2 /α 2 ) , where the ﬁnal equality (up to constant factors) follows b ecause 1 ≤ τ ≤ α − 2 /C 1 . • Case 4: τ > α − 2 /C 1 . In this regime, log  3 + C √ 1+ τ τ ′ 2 α 2  ≍ log(1 + τ ) so that √ τ ′ ∧ τ ′ r log  3 + C √ 1+ τ τ ′ 2 α 2  ≍ p log(1 + τ ) . Moreo v er, log(1 + τ 2 /α 2 ) ≍ log(1 + τ ) b ecause τ > α − 2 /C 1 . Th us, √ τ ′ ∧ τ ′ r log  3 + C √ 1+ τ τ ′ 2 α 2  ≍ log(1 + τ ) p log(1 + τ 2 /α 2 ) . This completes the reduction, proving the claim. C.2.2 Co v ariance estimation: Pro of of Theorem 3.2 low er b ound W e prov e the claim for general q ∈ (0 , 1]. The pro of will closely mirror the structure of the pro of of the low er b ound of Theorem 3.1 . W e introduce the quan tities τ = ϵ q (1 − ϵ ) , τ ′ = log(1 + τ ), and b = q (1 − ϵ ) √ 1 + τ . W e claim that it suﬃces to prov e that M := M n  δ, P cov ( ϵ, q ) , L  ≥ T parametric ∨ T dimension ∨ T conﬁdence , (52) where T parametric := s d + log ( 1 δ ) nq (1 − ϵ ) , T dimension := τ ′ ∧ τ ′ log  3 + cnq (1 − ϵ ) √ 1+ τ τ ′ d  , and T conﬁdence := τ ′ ∧ τ ′ log  3 + cnq (1 − ϵ ) √ 1+ τ τ ′ log(1 / 4 δ )  , for a suﬃcien tly large constant c . As in the mean estimation case, the lo w er b ound M ≥ T parametric is standard, and we omit its pro of. W e next prov e each of the lo wer b ounds M ≥ T dimension and M ≥ T conﬁdence in turn. Pro of of M ≥ T dimension : Let γ ∈ [0 , 1] denote the target minimax error, to b e sp eciﬁed later. Let N b e a 1 / 2-packing of unit vectors as in the pro of of Theorem 3.1 (b) satisfying |N | ≤ 5 d . F or eac h v ∈ N , we will construct hypotheses P v whic h satisfy the pair of desiderata P v ∈ R  N (0 , I d + γ v v ⊤ ) , ϵ, q  and P v  { ⋆ d }  = 1 − b. By Lemma G.4 , for an y distinct v , v ′ ∈ N , we ha ve that for all matrices M , L ( M , I d + γ v v ⊤ ) ∨ L ( M , I d + γ ¯ v ¯ v ⊤ ) > γ / 16. Hence, an y estimator b Σ with w orst-case risk at most γ / 64 m ust 60 satisfy P P v ( L ( b Σ , I d + γ v v ⊤ ) ≥ γ / 16) ≤ 1 / 4 for all v ∈ N , and can b e used to recov er v ∈ N with probability at least 3 / 4. T o prov e that recov ering v ∈ N with probability at least 3 / 4 is imp ossible, it will suﬃce b y F ano’s inequality (Lemma 2.10 ) to show that there exists a distribution H such that KL( P v , H ) ≲ d n for all v ∈ N . W e will take P v to b e the distribution Q from Lemma C.3 with R 2 = 1+ γ γ log(1 + τ ) and the common distribution H to b e deﬁned as H  { ⋆ d }  = 1 − b and H ( x ) = b · ϕ d ( x ) for all x ∈ R d . (53) Since P v ( { ⋆ d } ) = H ( { ⋆ d } ) = 1 − b and b oth distributions are supported on R d ∪ { ⋆ d } , w e deduce that KL( P v , H ) = b · KL( P ′ v , H ′ ) , (54) where P ′ v and H ′ denote the distributions of P v and H conditioned on b elonging to R d . W e next in tro duce the shorthand x ′ := v ⊤ x and let p ′ v and h ′ denote the densities of P ′ v and H ′ . Expanding their deﬁnitions, we simplify the likelihoo d ratio as p ′ v ( x ) h ′ ( x ) = A ( x ′ ) ϕ ( x ′ ) =    β if | x ′ | ≤ R 1 √ 1+ γ e γ | x ′ | 2 2(1+ γ ) if | x ′ | > R . Hence, we ﬁnd that the log-likelihoo d ratio is b ounded as log p ′ v ( x ) h ′ ( x ) = (log β ) 1 | x ′ |≤ R +  γ | x ′ | 2 2(1 + γ ) − log p 1 + γ  1 | x ′ | >R ≤ γ | x ′ | 2 2(1 + γ ) 1 | x ′ | >R , (55) where w e hav e additionally used the inequality β ≤ 1, whic h holds b y construction. Moreov er, our construction implies that if X ′ ∼ A , X ′ · 1 | X ′ | >R dist = p 1 + γ · G 1 √ 1+ γ | G | >R . T aking exp ectations with resp ect to the random v ariable X ′ ∼ A , it thus follows from the upp er b ound ( 55 ) that KL( P ′ v , H ′ ) ≤ γ 2 E  G 2 1 √ 1+ γ | G | >R  ( a ) ≲ γ q P ( p 1 + γ | G | > R ) ≲ γ e − Ω( R 2 / (1+ γ )) , (56) where step ( a ) follo ws up on applying the Cauch y–Sch warz inequality . Com bining Equa- tions ( 54 ) and ( 56 ) then yields the upp er b ound KL( P v , H ) ≲ bγ e − Ω( R 2 / (1+ γ )) ≲ bγ e − Ω(log(1+ τ ) /γ ) , (57) where the ﬁnal inequality follows since by construction R 2 = 1+ γ γ log(1 + τ ). T o obtain the desired lo w er b ound, we set γ = min ( c, log(1 + τ ) log  q (1 − ϵ ) n cd  ) , for a suﬃcien tly small constan t c ≤ 1. Note that from the numeric inequalit y log (1 + x ) ≤ x , this setting ensures that γ ≤ τ . Hence, substituting this into the inequalit y ( 57 ), we obtain KL( P v , H ) ≲ bγ e − Ω(log(1+ τ ) /γ ) ≲ q (1 − ϵ ) √ 1 + τ exp  − Ω  log(1 + τ ) c ∨ log  q (1 − ϵ ) n cd   ≲ d n , where the last inequalit y follows by taking c suﬃciently small and because d nq (1 − ϵ ) ≤ 1. It th us follo ws from F ano’s inequalit y that M ≥ T dimension . 61 Pro of of M ≥ T conﬁdence : W e will use the same constructions as in the previous part of the pro of. T o this end, note that H ∈ R ( N (0 , I d ) , ϵ, q ). W e let v b e any unit v ector, and let Q b e the distribution from Lemma C.3 . In this case, we tak e γ = min ( τ ′ , τ ′ log  b · τ ′ n C log(1 / (4 δ ))  ) = min ( τ ′ , τ ′ log  q (1 − ϵ ) √ 1+ τ τ ′ n C log(1 / (4 δ ))  ) . Applying the KL b ound ( 57 ) and substituting the setting of γ ab o v e yields KL( Q, H ) ≲ bγ e − Ω( τ ′ /γ ) = q (1 − ϵ ) √ 1 + τ γ e − Ω( τ ′ /γ ) ≤ c log(1 / (4 δ )) n . Hence, adjusting constants and applying the Bretagnolle–Hub er inequality , we obtain TV( Q ⊗ n , H ⊗ n ) ≤ p 1 − e − n KL( Q,H ) ≤ 1 − 2 δ. W e conclude b y applying a high-probability v ariant of Le Cam’s metho d [ MVS24 , Lemma 5], whic h yields the low er b ound M ≥ γ = T conﬁdence , as desired. Pro of of the reduction ( 52 ) : Re-phrasing ( 52 ), w e ha ve the minimax low er b ound M ≳ α + min ( c, log(1 + τ ) log(1 /cα 2 ) ) ≍ α + log(1 + τ ) log(1 + τ ) + log (1 /α ) , where α = q d +log(1 /δ ) nq (1 − ϵ ) ≤ 1 / √ C 1 and c is a suﬃciently small constan t. By considering sev eral cases, we will show that this is equiv alent to the c laimed low er b ound in Theorem 3.2 . • Case 1: τ ≲ α log (1 /α ) . This regime is immediate b ecause log(1 + τ ) log(1 + τ /α 2 ) ≍ α ≲ M . • Case 2: α log(1 /α ) ≲ τ ≲ α − 2 . This regime follo ws b ecause b oth log(1 + τ ) and log(1 + τ 2 /α 2 ) are within constant factors of log (1 /α ). • Case 3: τ ≳ α − 2 . This regime is immediate because log (1 + τ ) ≍ log(1 + τ /α 2 ) ≳ log(1 /α ). Because we hav e cov ered all cases of τ , the claim follows. C.2.3 Linear regression: Pro of of Theorem 5.1 low er b ound Pr o of. W e prov e the claim for general q ∈ (0 , 1]. W e can assume without loss of generality that σ 2 = 1, as otherwise w e can scale the resp onses by σ − 1 . W e introduce the quan tities τ = ϵ q (1 − ϵ ) , τ ′ = log(1 + τ ), and b = q (1 − ϵ ) √ 1 + τ . By the same argument as in the pro of of the low er b ound of Theorem 3.1 , it suﬃces to show that M := M ( δ , P LR (1 , ϵ ) , L ) ≳ T clean ∨ T dimension ∨ T conﬁdence , (58) 62 where T clean := s d + log ( 1 δ ) nq (1 − ϵ ) , T dimension := √ τ ′ ∧ τ ′ r log  3 + C nbτ ′ 2 d  , and T conﬁdence := √ τ ′ ∧ τ ′ r log  3 + C nbτ ′ 2 log(1 / 2 δ )  , for a suﬃciently large constant C . The low er b ounds T clean and T conﬁdence follo w b y appropriate mo diﬁcations to the arguments in the pro of of the lo w er b ound of Theorem 3.1 , so we fo cus on establishing T dimension . The ﬁnal claim then follows from ( 58 ) b y an identical argument as in Theorem 3.1 . Let γ := γ ( ϵ, q , n, d ) denote the resulting error b ound. Let r ∈ R + b e the x -truncation parameter such that γ 2 r 2 ≤ 0 . 5 τ ′ . Let R ∈ R + b e the y -truncation parameter suc h that R := τ ′ 10 γ r and observe that γ r ≤ R / 2. W e shall c ho ose r = 1 in the remainder of the pro of. Th us, the follo wing argumen ts will b e applicable as long as γ ≲ √ τ ′ . Let F v b e the distribution o ver ( X , y ) suc h that X ∼ N (0 , I d ) and y ∼ N ( θ ⊤ X , 1). Let N b e a 1 / 2-packing of unit v ectors satisfying |N | ≤ 5 d . F or each v ∈ N , we will deﬁne P v ∈ R ( F γ v , ϵ, q ) with the additional prop erty that P v ( ⋆ d ) = 1 − b . T o prov e our des ired lo wer b ound of Ω( γ ), it will suﬃce by F ano’s inequality (Lemma 2.10 ) to show that there exists a distribution H such that KL( P v , H ) ≲ d n for all v ∈ N . W e will take P v to b e the distribution P ∗ from Lemma C.4 with parameters b, γ , r, R (and noting that the parameter requirements are satisﬁed) and H to b e deﬁned as H  { ⋆ d }  = 1 − b and H ( x ) = b · F 0 ( x, y ) for all x ∈ R d and y ∈ R . (59) Because P v ( ⋆ d ) = H ( ⋆ d ) = b and b oth distributions are supp orted on R d ∪ { ⋆ d } , we ha v e that KL( P v , H ) = b · KL( P ′ v , H ′ ) , (60) where P ′ v and H ′ are the conditional distributions of R d . Observ e that the density ratio satisﬁes (with p 0 b eing the densit y of F 0 ) p ′ v ( x, y ) p ′ 0 ( x, y ) = ϕ ( x ) A x ( y ) ϕ ( x ) ϕ ( y ) = ( β x if | x ⊤ v | ≤ r and | y | ≤ R e − ( y − γ v ⊤ x ) 2 / 2 e − y 2 / 2 otherwise = ( β x if | x ⊤ v | ≤ r and | y | ≤ R e γ y x ⊤ v − γ 2 ( v ⊤ x ) 2 / 2 otherwise Therefore, log p ′ v ( x, y ) p ′ 0 ( x, y ) = ( log β x if | x ⊤ v | ≤ r and | y | ≤ R γ y x ⊤ v − γ 2 ( v ⊤ x ) 2 otherwise ≤ γ ( y x ⊤ v − γ ( v ⊤ x ) 2 )1 | x ⊤ v | >r or | y | >R , (61) where we use that β x ≤ 1. T aking exp ectation under P ′ v w e get that the KL divergence is upp er b ounded b y KL( P ′ v , F 0 ) /γ ≤ E ( x,y ) ∼ P ′ v h ( y x ⊤ v − γ ( v ⊤ x ) 2 )1 | x ⊤ v | >r or | y | >R i 63 = Z { | x ⊤ v | >r } ∪{| y | >R } ( y x ⊤ v − γ ( v ⊤ x ) 2 ) ϕ ( x ) A x ( y ) dxdy = Z {| z | >r }∪{| y | >R } ( y z − γ z 2 ) ϕ ( z ) ϕ ( y ; γ z , 1) dz dy (deﬁnition of A x and z := x ⊤ v ) = E G,Z [(( γ Z + G ) Z − γ Z 2 )1 | Z | >r /γ or | γ Z + G | >R ] = E G,Z [ GZ 1 | Z | >r or | γ Z + G | >R ] , where Z and G are indep enden t standard normals, and the equality follows from the linear mo del representation Y = γ Z + G for the Gaussian linear mo del, whic h is the same as A x in that interv al. Contin uing further we get KL( P ′ v , F 0 ) /γ ≤ E G,Z [ GZ 1 | Z | >r or | γ Z + G | >R ] =  E G,Z [ GZ 1 | Z | >r or | γ Z + G | >R ] − E G,Z [ GZ 1 | Z | >r or | G | >R ]  (b y symmetry E G,Z [ GZ 1 | Z | >r /γ or | G | >R ] = 0 ) = E G,Z  GZ  1 | Z | >r or | γ Z + G | >R ] − 1 | Z | >r or | G | >R ]  = E G,Z  GZ 1 | Z |≤ r  1 | γ Z + G | >R − 1 | G | >R  ≤ E G,Z  | G | · | Z | 1 | Z |≤ r   1 | γ Z + G | >R − 1 | G | >R    Since | 1 | x + y |≥ a − 1 | x |≥ a | ≤ 1 x ∈ [ −| a |−| y | , −| a | + | y | ] + 1 x ∈ [ | a |−| y | , | a | + | y | ] , we get that KL( P ′ v , F 0 ) /γ ≤ E G,Z  | G | · | Z | 1 | Z |≤ r  1 G ∈ [ − R − γ | Z | , − R + γ | Z | ] ∪ [ R − γ | Z | ,R + γ | Z | ]  = 2 E G,Z  | G | · | Z | 1 | Z |≤ r 1 G ∈ [ R − γ | Z | ,R + γ | Z | ]  (b y symmetry) ≲ E G,Z  | R + γ r | · | Z | 1 | Z |≤ r 1 G ∈ [ R − γ | Z | ,R + γ | Z | ]  ≲ max( γ r, R ) E G,Z  | Z | 1 | Z |≤ r P { G ∈ [ R − γ | Z | , R + γ | Z | ] }  ≤ max( rγ , R ) E G,Z h | Z | 1 | Z |≤ r · γ | Z | e − ( R − γ | Z | ) 2 / 2 i ≲ R E G,Z h γ Z 2 e − Ω( R 2 ) i ( r γ ≤ R/ 2) ≲ Rγ e − Ω( R 2 ) ≲ γ e − Ω( R 2 ) , (62) where we use the fact that xe − Ω( x 2 ) ≲ e − Ω( x 2 ) for all x > 0. Com bining Equations ( 60 ) and ( 62 ), the KL divergence is upp er b ounded by KL( P v , H ) ≤ bγ 2 e − Ω( R 2 ) ≤ bγ 2 e − cτ ′ 2 /γ 2 . where we use that R ≍ τ ′ /γ . Since the expressions and parameter restrictions are identical to those encountered in the pro of of Theorem 3.1 (see Section C.2.1 ), we will get a low er b ound of min     √ τ ′ , τ ′ r log  3 + C nbτ ′ 2 d      . 64 D Pro ofs of sum-of-squares upp er b ounds D.1 Mean estimation: Pro of of Theorem 4.5 W e will pro ve b oth claims hold given E . The result then follows from Lemma 4.3 . Satisﬁabilit y . W e ﬁrst show that the constraint ( 9 ) is satisﬁable. T o this end, let θ = θ ⋆ . Let ξ = E x ∼ T MCAR + T MCAR ,⋆ [ x ⊗ 2 k ] − E z ∼ N (0 ,I d ) [ z ⊗ 2 k ] (recall T MCAR + T MCAR ,⋆ consists of n i.i.d. standard Gaussian observ ations). Applying Lemma G.23 (with t = d k ∥ ξ ∥ ∞ ) gives  ∥ v ∥ 2 2 = 1  4 k v E x ∼ T MCAR + T MCAR ,⋆  ⟨ v , x − θ ⟩ 2 k  ≤ E G ∼ N (0 , 1) [ G 2 k ] + d k ∥ ξ ∥ ∞ , from which E implies  ∥ v ∥ 2 2 = 1  4 k v E x ∼ T MCAR + T MCAR ,⋆  ⟨ v , x − θ ⟩ 2 k  ≤ E G ∼ N (0 , 1) [ G 2 k ] + ϵ ≤ (1 + ϵ ) E G ∼ N (0 , 1) [ G 2 k ] . Because the diﬀerence (up to normalization) b etw een E x ∼ T  ⟨ v , x − θ ⟩ 2 k  and E x ∼ T MCAR + T MCAR ,⋆  ⟨ v , x − θ ⟩ 2 k  is a sum-of-squares, i.e., | T | n E x ∼ T  ⟨ v , x − θ ⟩ 2 k  + | T MCAR ,⋆ − T MNAR | n E x ∼ T MCAR ,⋆ − T MNAR  ⟨ v , x − θ ⟩ 2 k  = E x ∼ T MCAR + T MCAR ,⋆  ⟨ v , x − θ ⟩ 2 k  , it then follows that E x ∼ T  ⟨ v , x − θ ⟩ 2 k  ≤ | T MCAR + T MCAR ,⋆ | | T | (1 + ϵ ) E G ∼ N (0 , 1) [ G 2 k ] . Then E implies E x ∼ T  ⟨ v , x − θ ⟩ 2 k  ≤ | T MCAR + T MCAR ,⋆ | | T | (1 + ϵ ) E G ∼ N (0 , 1) [ G 2 k ] ≤ (1 + ϵ ) 2 1 − ϵ E G ∼ N (0 , 1) [ G 2 k ] . Accuracy . Let e E b e a degree-4 k pseudo exp ectation ov er θ ∈ R d satisfying the constraint ( 9 ). Note that  (1 + ϵ ) 2 1 − ϵ  E G ∼ N (0 , 1) [ G 2 k ] ≥ e E  E x ∼ T ⟨ v , x − θ ⟩ 2 k  ≥ e E  | T MCAR | | T | E x ∼ T MCAR ⟨ v , x − θ ⟩ 2 k  ≥ e E  (1 − ϵ ) E x ∼ T MCAR ⟨ v , x − θ ⟩ 2 k  ≥ (1 − ϵ ) E x ∼ T MCAR ⟨ v , x − e E [ θ ] ⟩ 2 k , where the last step follows by combining Lemmas G.21 and G.22 . 65 F or con venience, let b p ℓ = inf v ∈ S d − 1 E x ∼ T MCAR  ⟨ v , x − θ ⋆ ⟩ ℓ  and ∆ =   θ ⋆ − e E [ θ ]   2 . Then, taking v = e E [ θ ] − θ ⋆ ∆ and dividing the previous display by 1 − ϵ on b oth sides gives  1 + ϵ 1 − ϵ  2 E G ∼ N (0 , 1) [ G 2 k ] ≥ b p 2 k + k − 1 X i =1  2 k 2 i  b p 2 k − 2 i ∆ 2 i −  2 k 2 i − 1  b p 2 k − 2 i +1 ∆ 2 i − 1  + ∆ 2 k . The even t E implies via Lemma G.16 that | b p j − E G ∼ N (0 , 1) [ G j ] | ≤ d j / 2 − k ϵ for j ∈ [2 k ]. Recall γ =  (1+ ϵ ) 2 − (1 − ϵ ) 3 (1 − ϵ ) 2  . Rearranging the previous display by mo ving b p 2 k to the other side gives k − 1 X i =1  2 k 2 i  b p 2 k − 2 i ∆ 2 i −  2 k 2 i − 1  b p 2 k − 2 i +1 ∆ 2 i − 1  + ∆ 2 k ≤ γ E G ∼ N (0 , 1) [ G 2 k ] . (63) Let ∆ 0 = max i ∈ [ k ] 2   b p 2 k − 2 i +1 b p 2 k − 2 i 2 i 2 k − 2 i +1   and supp ose that ∆ > ∆ 0 . Then each paren thesized summand in the display ( 63 ) is at least  2 k 2 i  b p 2 k − 2 i ∆ 2 i  1 − ∆ 0 2∆  . Recall E implies b p 2 k − 2 i ≥ 0 for all i ∈ [ k ] (since these are even-degree p olynomials with exp ectation at least 1 regardless of i ), so taking i = 1 and discarding all other terms gives 2 k (2 k − 1) b p 2 k − 2 16 ∆ 2 ≤ γ E G ∼ N (0 , 1) [ G 2 k ] . Th us, w e obtain ∆ ≤ max   ∆ 0 , 4 k s γ E G ∼ N (0 , 1) [ G 2 k ] b p 2 k − 2   ≤ max  ∆ 0 , 8 r γ k  , where the last step follows from that fact that the even t E implies b p 2 k − 2 ≥ 0 . 5(2 k − 3)!! (recall E G ∼ N (0 , 1) [ G 2 k ] = (2 k − 1)!!). All that remains is to upp er b ound ∆ 0 . Recall E implies | b p 2 k − 2 i +1 | ≤ ϵd 1 / 2 − i 2 k for all i ∈ [ k ] (since these are o dd-degree p olynomials with expectation zero). Likewise, w e can lo w er bound b p 2 k − 2 i ≥ 0 . 5(2 k − 2 i − 1)!!. Finally , recall k ≤ √ d b y assumption. Putting these facts together giv es ∆ 0 ≤ 8 ϵ k ≤ 8 γ k . D.2 Co v ariance estimation: Pro of of Theorem 4.10 W e will pro ve b oth claims hold given E . The result then follows from Lemma 4.3 . Throughout the pro of, w e let e Σ = E x ∼ T [ xx T ] b e the empirical cov ariance of the observed data. Satisﬁabilit y . W e ﬁrst show that A deviation is satisﬁable f by M = ( e Σ 1 / 2 Σ − 1 e Σ 1 / 2 ) 1 / 2 and B = M − I d for α = 10 ϵ . First, observ e b oth matrices are symmetric and M ⪰ 0 (recall that w e assume Σ ≻ 0). The even t E implies for S ∈ { T MCAR , T MCAR + T MCAR ,⋆ } that   Σ − 1 / 2 E x ∼ S [ xx T ]Σ − 1 / 2 − I d   op ≤ d   Σ − 1 / 2 E x ∼ S [ xx T ]Σ − 1 / 2 − I d   ∞ ≤ ϵ, 66 and th us (1 − ϵ )Σ ⪯ E x ∼ S [ xx T ] ⪯ (1 + ϵ )Σ. Also, E implies m ≥ (1 − 10 ϵ ) n . Thus, after renormalization we hav e  1 − ϵ 1 + ϵ  Σ ⪯ E x ∼ T [ xx T ] ⪯  1 + ϵ 1 − ϵ  Σ , whic h after whitening b ecomes  1 − ϵ 1 + ϵ  e Σ − 1 / 2 Σ e Σ − 1 / 2 ⪯ I d ⪯  1 + ϵ 1 − ϵ  e Σ − 1 / 2 Σ e Σ − 1 / 2 and after inv erting b ecomes  1 − ϵ 1 + ϵ  e Σ 1 / 2 Σ − 1 e Σ 1 / 2 ⪯ I d ⪯  1 + ϵ 1 − ϵ  e Σ 1 / 2 Σ − 1 e Σ 1 / 2 . Because ϵ < 1 / 100, w e can b ound q 1+ ϵ 1 − ϵ ≤ 1 + 5 ϵ . So, from the fact that f ( t ) → √ t is operator monotone it follows by taking square ro ots of the ab ov e displa y that M and B satisfy the constrain ts in A for α = 10 ϵ . No w all that remains is to show satisﬁability of the A moments for the same choice of M . Let ℓ ∈ [ k ] and denote ξ S = E x ∼ S [(Σ − 1 / 2 x ) ⊗ 2 ℓ ] − E z ∼ N (0 ,I d ) [ z ⊗ 2 ℓ ]. W e pro v e the upp er b ound b y taking S = T MCAR + T MCAR ,⋆ ; the low er b ound follows analogously by taking S = T MCAR . Because e Σ − 1 / 2 x ∼ N (0 , M − 2 ), applying Lemma G.23 (with t = d ℓ   ξ T MCAR + T MCAR ,⋆   ∞ ) gives  ∥ v ∥ 2 2 = 1  4 k v E x ∼ T MCAR + T MCAR ,⋆  ⟨ v , M e Σ − 1 / 2 x ⟩ 2 ℓ  ≤ E G ∼ N (0 , 1) [ G 2 ℓ ] + d ℓ   ξ T MCAR + T MCAR ,⋆   ∞ , from which E implies  ∥ v ∥ 2 2 = 1  4 k v E x ∼ T MCAR + T MCAR ,⋆  ⟨ v , M e Σ − 1 / 2 x ⟩ 2 ℓ  ≤ E G ∼ N (0 , 1) [ G 2 ℓ ] + ϵ ≤ (1 + ϵ ) E G ∼ N (0 , 1) [ G 2 ℓ ] . Because the diﬀerence b et w een E x ∼ T  ⟨ v , M e Σ − 1 / 2 x ⟩ 2 ℓ  and | T | n E x ∼ T MCAR + T MCAR ,⋆  ⟨ v , M e Σ − 1 / 2 x ⟩ 2 ℓ  , is a sum-of-squares, it then follows that n ∥ v ∥ 2 2 = 1 o 4 k v E x ∼ T h ⟨ v , M e Σ − 1 / 2 x ⟩ 2 ℓ i ≤ n | T | (1 + ϵ ) E G ∼ N (0 , 1) [ G 2 ℓ ] ≤ (1 + ϵ ) 2 1 − ϵ E G ∼ N (0 , 1) [ G 2 ℓ ] . Using our assumption that ϵ < 1 / 100, we can b ound (1+ ϵ ) 2 1 − ϵ ≤ 1 + 10 ϵ . Th us, M and B satisfy the upp er b ound in the constrain ts ( 10 ) o v er ℓ ∈ [ k ]. Accuracy . Let β = 10 ϵ and A OPT = e Σ − 1 / 2 Σ e Σ − 1 / 2 for conv enience. W e will sho w (Lemma D.1 ) the even t E implies n  ∥ v ∥ 2 2 = 1  4 k v E x ∼ T ⟨ v , M e Σ − 1 / 2 x ⟩ 2 k ∈ (1 ± 8 ϵ )   Σ 1 / 2 e Σ − 1 / 2 M v   2 k 2 E G ∼ N (0 , 1) [ G 2 k ] + 2 ϵ o . 67 T ogether with the fact that e E satisﬁes constraint ( 10 ), this implies for β = 10 ϵ that e E h   Σ 1 / 2 e Σ − 1 / 2 M v   2 k 2 i ∈  1 − β 1 + β , 1 + β 1 − β  , i.e., e E    Σ 1 / 2 e Σ − 1 / 2 M v   2 k 2  ∈ 1 ± 10 β (since β < 1 / 2). Combining lemmas G.26 and G.24 then giv es e E    Σ 1 / 2 e Σ − 1 / 2 M v   2 2  ∈ 1 ± γ . Be cause this holds for all v ∈ S d − 1 , we obtain   e E [ M A OPT M ] − I d   op ≤ γ . Th us, there exists A ∗ satisfying   e E [ M A ∗ M ] − I d   op ≤ γ , and moreo v er by the triangle inequality any suc h A ∗ m ust satisfy   e E [ M ( A ∗ − A OPT ) M ]   op ≤ 2 γ . Applying Lemma 4.9 with H = A ∗ − A OPT then gives ∥ A ∗ − A OPT ∥ op ≲ ϵ k . It th us follows for any v ∈ S d − 1 that   Σ − 1 / 2 e Σ 1 / 2 ( A ∗ − A OPT ) e Σ 1 / 2 Σ − 1 / 2 v   2 ≲ ϵ k   Σ − 1 / 2 e Σ 1 / 2   op   e Σ 1 / 2 Σ − 1 / 2   op ≲ ϵ k , where the last step follows from the fact that E implies (1 − 1 2 )Σ ⪯ e Σ ⪯ 2Σ. The main claim then follows immediately . Lemma D.1. The event E implies n  ∥ v ∥ 2 2 = 1  4 k v E x ∼ T ⟨ v , M e Σ − 1 / 2 x ⟩ 2 k ∈ (1 ± β )   Σ 1 / 2 e Σ − 1 / 2 M v   2 k 2 E G ∼ N (0 , 1) [ G 2 k ] o . Pr o of. The proof will follow a similar structure to the pro of of satisﬁabilit y . As before, let ℓ ∈ [ k ] and denote ξ S = E x ∼ S [(Σ − 1 / 2 x ) ⊗ 2 ℓ ] − E z ∼ N (0 ,I d ) [ z ⊗ 2 ℓ ]. W e pro v e the upp er b ound by taking S = T MCAR + T MCAR ,⋆ ; the low er b ound follows analogously by taking S = T MCAR . Let u = Σ 1 / 2 e Σ − 1 / 2 M v . Lemma G.23 implies for all t > 0 that  ∥ v ∥ 2 2 = 1  4 k v E x ∼ T MCAR + T MCAR ,⋆  ⟨ v , M e Σ − 1 / 2 x ⟩ 2 ℓ  = E x ∼ T MCAR + T MCAR ,⋆  ⟨ u, Σ − 1 / 2 x ⟩ 2 ℓ  ≤ ∥ u ∥ 2 ℓ 2 E G ∼ N (0 , 1) [ G 2 ℓ ] + t 2 ∥ u ∥ 4 ℓ 2 + d 2 ℓ 2 t   ξ T MCAR + T MCAR ,⋆   2 ∞ . Using the fact that M ⪯ 2 I and taking t =  d 4  ℓ   ξ T MCAR + T MCAR ,⋆   ∞ , we then obtain  ∥ v ∥ 2 2 = 1  4 k v E x ∼ T MCAR + T MCAR ,⋆  ⟨ v , M e Σ − 1 / 2 x ⟩ 2 ℓ  ≤ ∥ u ∥ 2 ℓ 2 E G ∼ N (0 , 1) [ G 2 ℓ ] + 2 4 ℓ − 1 t + d 2 ℓ 2 t   ξ T MCAR + T MCAR ,⋆   2 ∞ ≤ ∥ u ∥ 2 ℓ 2 E G ∼ N (0 , 1) [ G 2 ℓ ] + (4 d ) ℓ   ξ T MCAR + T MCAR ,⋆   ∞ ≤ ∥ u ∥ 2 ℓ 2 E G ∼ N (0 , 1) [ G 2 ℓ ] + ϵ, where the last step follo ws from E . Because the diﬀerence b etw een E x ∼ T  ⟨ v , M e Σ − 1 / 2 x ⟩ 2 ℓ  and | T | n E x ∼ T MCAR + T MCAR ,⋆  ⟨ v , M e Σ − 1 / 2 x ⟩ 2 ℓ  is a sum-of-squares, it then follows that  ∥ v ∥ 2 2 = 1  4 k v E x ∼ T  ⟨ v , M e Σ − 1 / 2 x ⟩ 2 ℓ  ≤ n | T |  ∥ u ∥ 2 ℓ 2 E G ∼ N (0 , 1) [ G 2 ℓ ] + ϵ  ≤ 1 + ϵ 1 − ϵ  ( ∥ u ∥ 2 ℓ 2 E G ∼ N (0 , 1) [ G 2 ℓ ] + ϵ  ≤ (1 + 8 ϵ )  ∥ u ∥ 2 ℓ 2 E G ∼ N (0 , 1) [ G 2 ℓ ]  + 2 ϵ, where the last step follows from our assumption that ϵ < 1 / 100. 68 D.2.1 Pro of of Lemma 4.9 Since e E satisﬁes B = M − I d , e E must satisfy M H M = ( I + B ) H ( I + B ). Thus for any unit v ector v ∈ S d − 1 w e ha v e   v T H v   =    v T e E [ M H M − ( B H + H B ) + B H B ] v    (64) ≤ β +    e E [ v T ( B H + H B ) v ]    + e E [ v T B H B v ] . W e will pro ceed to separately b ound the t wo terms in the last display . First, we hav e e E [ v T B H v ] ≤ q e E [( v T B H v ) 2 ] ≤ q e E [ ∥ B v ∥ 2 2 ] e E [ ∥ H v ∥ 2 2 ] ≤ α ∥ H ∥ op , where the ﬁrst step follows from Jensen’s inequality (Lemma G.22 ); the second from Cauch y- Sc h w arz (Lemma G.19 ); and the last from our constrain t that B T B ⪯ α 2 I . W e can similarly b ound e E [ v T B H v ] and b oth negations (i.e., − e E [ v T B H v ] and − e E [ v T B H v ]), implying    e E [ v T ( B H + H B ) v ]    ≤ 2 α ∥ H ∥ op . F or the second term, observe ∥ H ∥ 2 op ∥ B v ∥ 2 2 − ∥ H B v ∥ 2 2 is a sum of squares, i.e., for some C = ∥ H ∥ 2 op I − H T H ⪰ 0 and D D T = C ∥ H ∥ 2 op ∥ B v ∥ 2 2 − ∥ H B v ∥ 2 2 = v T D D T v . Th us, w e can b ound e E [ v T B H B v ] ≤ q e E [( v T B H B v ) 2 ] ≤ q e E [ ∥ B v ∥ 2 2 ] e E [ ∥ H B v ∥ 2 2 ] ≤ α 2 ∥ H ∥ op . Com bining our b ounds on the tw o terms and taking the suprem um ov er v of the left-hand side of equation ( 64 ) (recall H is symmetric) gives ∥ H ∥ op ≤ β + 2 α ∥ H ∥ op + α 2 ∥ H ∥ op , from which the claim follows immediately . E Pro ofs of computational low er b ounds E.1 Preliminaries The following lemma con trols the diﬀerence in the moments b etw een the hard univ ariate distributions for the information-theoretic rate and the standard Gaussian. 69 Lemma E.1 (Approximate momen t matc hing) . F or a distribution A on R and a k ∈ N , deﬁne δ A,k :=    E X ∼ A [ X k ] − E G ∼ N (0 , 1) [ G k ]    1. If A is the instanc e fr om ( 38 ) and | γ | ≤ R/ 2 and R ≳ 1 , then δ A,k ≲ | γ | e − Ω( R 2 ) e O ( k log max( k,R )) . 2. If A is the instanc e fr om ( 39 ) with R ≳ 1 , then δ A,k ≲ γ e − Ω( R 2 )+ O ( k log max( k,R )) . Pr o of. W e ﬁrst start with A as in ( 38 ). Observe that the diﬀerence in the moments is exactly equal to ( β − 1) E [ G k 1 | G |≤ R ] + E [( G + γ ) k 1 | G + γ | >R − G k 1 | G | >R ] , The absolute v alue of the ﬁrst term is at most | β − 1 | ( k − 1)!! ≲ | γ | e − Ω( R 2 ) e O ( k log k ) , where w e use Lemma C.2 for con trolling | β − 1 | . F or the second term, we upp er b ound its absolute v alue as follows (where we shall use using that | a k − b k | ≤ k | a − b | max( | a | k − 1 , | b | k − 1 )): E [( G + γ ) k 1 | G + γ | >R − G k 1 | G | >R ] ≤    E [(( G + γ ) k − G k ) 1 | G | >R ]    +    E [( G + γ ) k  1 | G + γ | >R − 1 | G | >R  ]    ≤    E [ k | γ | ( | G | + | γ | ) k − 1 1 | G | >R ]    + E [( | G | + γ | ) k 1 | G |∈ [ R −| γ | , R + | γ | ] ] ≤    E [ k | γ | ( | G | + | γ | ) k − 1 1 | G | >R ]    + E [( O ( R + | γ | )) k 1 | G |∈ [ R −| γ | , R + | γ | ] ] No w, the second term is at most ( R + | γ | ) O ( k ) P ( | G | ∈ [ R − | γ | , R + | γ | ]), whic h is at most | γ | R O ( k ) e − Ω( R 2 ) , where w e use that γ ≲ R . F or the ﬁrst term, we apply Cauch y-Sch warz inequalit y to get it is at most | γ | ( k + | γ | ) O ( k ) e − Ω( R 2 ) and then use that γ ≤ R . Combining ev erything, w e ha v e sho wn the desired upp er b ound on δ A,k . Next we consider the case when A is from ( 39 ). The diﬀerence in the momen ts is ( β − 1) E [ G k 1 | G |≤ R ] + E [( G p 1 + γ ) k 1 | G √ 1+ γ | >R − G k 1 | G | >R ] . The ﬁrst term is at most γ e − Ω( R 2 ) e O ( k log k ) . F or the second term, w e shall do a similar decomp osition as b efore    E [( G p 1 + γ ) k 1 | G √ 1+ γ | >R − G k 1 | G | >R ]    ≤    E [  ( G p 1 + γ ) k − G k  1 | G | >R ]    +    E [( G p 1 + γ ) k ·  1 | G √ 1+ γ | >R − 1 | G | >R  ]    ≤    ( p 1 + γ ) k − 1    E G k 1 | G |≥ R + O ( R ) k    E [ 1 | G √ 1+ γ | >R − 1 | G | >R ]    ≤ O ( k γ ) e k q E [ G 2 k ] p P ( | G | > R ) + R k P ( G ∈ [ R / p 1 + γ , R ]) , whic h is at most γ e k log k − Ω( R 2 ) + R k O ( γ ) e − Ω( R 2 ) . 70 W e will also use the following tec hnical result that p erturbs a distribution A so that it can turn a distribution that approximately matches moments into one that exactly matc hes momen ts. Lemma E.2 (Fixing the momen ts [ DK23 , Exercise 8.3]) . Then ther e exists a c onstant c > 0 such that the fol lowing holds. L et k ∈ N and let A b e any distribution over R with k ﬁnite moments such that | E A [ X i ] − E G ∼ N (0 , 1) [ G i ] | ≤ α for al l i ∈ [ k ] . Supp ose that inf | x |≤ 1 A ( x ) ≥ 2 αk c . Then ther e is a distribution A ′ such that | A ( x ) − A ′ ( x ) | ≤ ( αk c if | x | ≤ 1 0 otherwise . and A ′ matches k moments with N (0 , 1) exactly. In p articular, | A ( x ) − A ′ ( x ) | ≤ 1 2 | A ( x ) | for al l x . Claim E.3. L et Q ∈ P R ( P , q , ϵ ) . L et R b e an arbitr ary distribution with ﬁnite χ 2 ( P , R ) . Then χ 2 ( Q, R ) ≲ max(1 , τ 2 )  1 + χ 2 ( P , R )  . Pr o of. By Corollary C.1 , we ha v e that for all x ∈ R d : L U ≤ L b ≤ q ( x ) p ( x ) ≤ U b ≤ U L , where b = P ( X  = ⋆ ) ∈ ( L, U ), L = q (1 − ϵ ), and U = q (1 − ϵ )+ ϵ . Therefore, q ( x ) /p ( x ) ≤ (1+ τ ). χ 2 ( Q, R ) = E R "  q ( X ) r ( X )  2 # − 1 = E R "  q ( X ) p ( X )  2  p ( X ) r ( X )  2 # − 1 ≤ (1 + τ ) 2  1 + χ 2 ( P , R )  − 1 ≲ max(1 , τ 2 )  1 + χ 2 ( P , R )  . High-lev el pro of sk etch for Theorems 4.7 and 4.12 Both of our low er b ound instances w ould b e based on NGCA. In particular, the null instance w ould not hav e any con tamination, and we would set P ′ = N (0 , I d ) for b oth Deﬁnitions 4.6 and 4.11 . F or the alternate instance, w e would set P ′ := P A,v for some univ ariate distribution A , chosen separately for the mean and the cov ariance testing problem. Observ e that the corresp onding clean distribution in the mean case is equal to N ( ρv , I d ) and in the cov ariance setting is equal to N (0 , I + ρv v ⊤ ). These could b e written compactly as P F,v for F = N ( δ, 1 + γ ), where γ = 0 , δ = ρ for the mean case and δ = 0 , γ = ρ for the cov ariance case. The following claim is immediate. Claim E.4. If A ∈ R R ( F , ϵ, q ) for a univariate distribution F , then P A,v ∈ R R ( P H,v , ϵ, q ) . W e get the desired SQ hardness from Lemma 2.9 if A ∈ R R ( H , ϵ, q ) and matches m momen ts with N (0 , 1), where m = e Θ( ϵ 2 /ρ 2 ) for the task of mean estimation and m = e Θ( ϵ/ρ ) for the task of cov ariance estimation (while ensuring that the χ ( A, N (0 , 1)) ≲ m ). 71 E.2 Mean estimation: Pro of of Theorem 4.7 As mentioned ab o v e, w e shall use Claim E.4 to reduce our problem to an NGCA instance. W e make a change of v ariable and use γ = ρ in the following. Giv en the generic SQ hardness in Lemma 2.9 , our goal reduces to ﬁnding a univ ariate distribution A ′ ∈ R R ( N ( γ , 1) , ϵ, q ) that matc hes m moments and satisﬁes χ 2 ( A ′ , N (0 , 1)) ≲ 1. W e start by establishing the existence of a moment-matc hing distribution, providing its pro of in Section E.2.1 . Lemma E.5. L et ϵ and q b e such that τ := ϵ q (1 − ϵ ) ≤ c for a smal l enough c onstant. Ther e exists a univariate distribution A ′ ∈ R R ( N ( γ , 1) , ϵ, q ) that matches m = e Θ( ϵ 2 γ 2 ) many moments with N (0 , 1) . T o complete the pro of using Lemma 2.9 , it remains to show that χ 2 div ergence is at most O ( m ). Applying Claim E.3 , we get that χ 2 ( A ′ , N (0 , 1)) ≲ 1 + χ 2 ( N ( γ , 1) , N (0 , 1)) ≲ 1 + e γ 2 ≲ 1 , where we use that γ ≲ 1 and τ ≲ 1. E.2.1 Pro of of Lemma E.5 Pr o of of Lemma E.5 . Recall that we are in the regime where τ = ϵ q (1 − ϵ ) ≤ c 0 for a small enough constant. In this regime, τ = Θ( τ ′ ). Our starting p oint to constructing A ′ will b e the distribution A from Lemma C.2 , which is a v alid realizable con tamination as p er the lemma statement. The parameter choices will b e the same as in the pro of of the low er b ound in Theorem 3.1 : w e choose R to b e Θ( τ ′ /γ ) = Θ( τ /γ ). Since γ /τ ≤ c ′ for a small enough enough absolute constant, we hav e that R is at least a large enough constan t. F urthermore, R ≤ √ τ / 10 since γ ≤ τ /c . Observe that the target num b er of moments, m = e Θ( τ 2 /γ 2 ), is Θ( R 2 / log R ). Recall that to get the desired SQ hardness, we wan t an A that matches m moments for m ≳ τ 2 γ 2 1 log( τ /γ ) . As sho wn in Lemma E.1 , it matc hes roughly m 0 ≍ R 2 / log R many momen ts appr oximately . T o b e precise, there exists a small constan t c such that if m ≤ cR 2 / log R , then for all i ∈ [ m ]: | E A [ X i ] − E [ G i ] | ≲ | γ | e − Ω( R 2 ) . T o match m momen ts exactly , w e p erturb this distribution to another A ′ using Lemma E.2 . F or this lemma to b e applicable, w e w ant that inf x : | x |≤ 1 A ( x ) ≳ | γ | e − Ω( R 2 ) p oly( m ) . (65) T o establish ( 65 ), we shall show that, for R ≳ 1, the left hand side is at least an absolute constan t, while the right side is upp er b ounded by O ( e − Ω( R 2 ) ), and thus ( 65 ) is satisﬁed for R large enough. F or the left hand side, the deﬁnition of A tells us that inf x : | x |≤ 1 A ( x ) = β ϕ (1) ≳ 1 since β ≥ 1 / 2 for R at least a large enough constant (b y Lemma C.2 ). F or the right hand side, observe that since m ≤ R , the right hand side is at most O ( e − Ω( R 2 ) ). Thus, the resulting A ′ will matc h m moments exactly . How ever, we still need to show that A ′ ∈ R R ( N ( γ , 1) , ϵ, q ) is a v alid realizable contamination of N ( γ , 1). By Corollary C.1 , it suﬃces to establish that max x ∈ R     log A ′ ( x ) ϕ ( x ; γ , 1)     ≤ 0 . 5 log (1 + τ ) . (66) 72 F or | x | > 1, this is true b ecause A ′ is exactly equal to A . F or | x | ≤ 1, we need to ensure that the p erturbations are small enough. W e hav e that for any | x | ≤ 1, log A ′ ( x ) ϕ ( x ; γ , 1) = log  A ( x ) ϕ ( x ; γ )  1 + A ′ ( x ) − A ( x ) A ( x )  = log A ( x ) ϕ ( x ; γ ) + log  1 + A ( x ) − A ′ ( x ) A ( x )  . By Lemma E.2 , w e hav e that    A ( x ) − A ′ ( x ) A ( x )    ≤ 1 2 and thus the inequality | log(1 + y ) | ≲ | y | for | y | ≤ 1 2 implies that     log A ′ ( x ) ϕ ( x ; γ )     ≤     log A ( x ) ϕ ( x ; γ )     +     A ′ ( x ) − A ( x ) A ( x )     ≤ 0 . 25 log (1 + τ ) + O ( | γ | e − Ω( R 2 ) ) , where we use that in the pro of of Lemma C.2 , w e had established the stronger inequality    log A ( x ) ϕ ( x ; γ )    ≤ 0 . 25 log (1 + τ ) as opp osed to a b ound of    log A ( x ) ϕ ( x ; γ )    ≤ 0 . 5 log (1 + τ ). T o establish ( 66 ), we thus w an t that γ e − Ω( R 2 ) ≲ log(1 + τ ) ≲ τ , meaning e − Ω( R 2 ) ≲ τ /γ = Θ( R ), which is satisﬁed as long as R ≳ 1. E.2.2 Co v ariance estimation: Pro of of Theorem 4.12 W e no w provide the pro of of Theorem 4.12 . Pr o of of The or em 4.12 . W e again make a change of v ariable and use γ = ρ in the remainder of the pro of. Using the same arguments in the pro of of Theorem 4.7 , it suﬃces to establish a momen t-matc hing distribution that is a v alid realizable contamination. Lemma E.6. L et ϵ and q b e such that τ := ϵ q (1 − ϵ ) ≤ c for a smal l enough absolute c onstant. L et γ ∈ (0 , 1 / 2) such that τ /γ ≳ 1 . Ther e exists a univariate distribution A ′ ∈ R R ( N (0 , 1 + γ ) , ϵ, q ) that matches m = e Θ( τ γ ) many moments with N (0 , 1) . Pr o of. W e start from A given in Lemma C.3 , where w e take R 2 = 1+ γ 8 γ log(1 + τ ) ≍ τ γ . According to Lemma E.1 , the resulting A matches roughly m 0 = e Θ( R 2 ) many moments with error at most O ( γ e − Ω( R 2 ) ). F ollowing the same argumen ts as in the pro of of Lemma E.5 , w e tak e this approximate momen t-matc hing distribution A and p erturb it using Lemma E.2 to get an m 0 momen t- matc hing distribution A ′ . This can b e done as long as the density of A is b ounded b y b elow b y O ( | γ | e − Ω( R 2 ) )p oly( m ). As in the pro of of Lemma E.5 , this is satisﬁed as long as R ≳ 1. Next we need to v erify that A ′ is a v alid realizable contamination of N (0 , 1 + γ ) in the sense of ( 66 ). F or this we need to show that the lik eliho o d ratio b et w een A ′ and N (0 , 1 + γ ) is b ounded appropriately , meaning that    log A ′ ( x ) ϕ ( x )    ≤ 0 . 5 log(1 + τ ) for all x ∈ R . Doing the same calculations as in Lemma E.5 , we get that     log A ′ ( x ) ϕ ( x )     ≤     log A ( x ) ϕ ( x )     +     A ′ ( x ) − A ( x ) A ( x )     ≤ 0 . 25 log (1 + τ ) + O ( | γ | e − Ω( R 2 ) ) , 73 where use that    log A ( x ) ϕ ( x )    ≤ 0 . 25 log (1 + τ ) if (i.) log (1 + γ ) ≤ 0 . 5 log (1 + τ ), whic h is equiv alent to γ ≲ τ ≲ 1 in our regime and (ii.) R 2 ≤ 1+ γ 2 γ log(1 + τ ); this can b e seen b y a simple insp ection of the proof in Lemma C.3 . F or the second term to b e less than 0 . 25 log (1 + τ ), it suﬃces that e − Ω( R 2 ) ≲ τ /γ ≍ R , whic h happ ens as long as R ≳ 1. As p er our parameter choice, this is equiv alen t to requiring τ /γ ≳ 1. So far, w e ha v e established that A ′ matc hes the desired num b er of momen ts, while b e- ing a v alid realizable contamination. Finally , it remains to establish an upp er b ound on χ 2 ( A ′ , N (0 , 1)). Using Claim E.3 , we hav e that χ 2 ( A ′ , N (0 , 1)) ≲ 1 + χ 2 ( N (0 , 1 + γ ) , N (0 , 1)) ≲ 1 , whenev er γ ≤ 1 / 2 and τ ≲ 1. F Extension to m ultiple missingness patterns for mean and co v ariance estimation In this section, we demonstrate ho w our low er b ounds and algorithms can b e extended to the multiple pattern setting depicted in Figure 1b . Let us b egin with a generalization of the con tamination mo del in ( 2 ). T o this end, let S b e a collection of subsets of [ d ]. W e extend the deﬁnitions of MCAR ( 1a ) and MNAR ( 1b ) as MCAR ( P, S ,π ) := n La w  X  ⋆ Ω  : X ∼ P and Ω ∈ 1 S , Ω ⊥ ⊥ X, P (Ω = 1 S ) = π S o (67a) MNAR P, S := n La w  X  ⋆ Ω  : X ∼ P and La w (Ω) ∈ P  1 S )  o , (67b) where ab o ve w e hav e let 1 S = { 1 S } S ∈ S and 1 S ∈ { 0 , 1 } d denote the v ector whic h takes the v alue 1 on the index set S and 0 elsewhere. W e then deﬁne the natural extension of the all-or-nothing realizable contamination mo del ( 2 ) as R ( P , ϵ, S , π ) := (1 − ϵ ) MCAR ( P, S ,π ) + ϵ MNAR P, S . (68) In the subsequent sections, we show that, at the cost of multiplicativ e factors scaling p olyno- mially in the cardinality | S | , our algorithmic guarantees and low er bounds can b e extended to this more general setting. F.1 Lo wer b ounds In order to extend our low er b ounds, w e rely on the intuition that revealing more data in eac h observ ation makes the problem easier. W e mak e this notion precise through an inﬂation map, deﬁned presently . Deﬁnition F.1 (Inﬂation map) . F or S ⊆ 2 [ d ] , we call the map f : S → 2 [ d ] an inﬂation map if, for all S ∈ S , it holds that S ⊆ f ( S ). Equipp ed with this deﬁnition, we hav e the following lemma. Lemma F.2 (Enlarging observ ations mak es the problem easier) . L et S ⊆ 2 [ d ] and let f : S → 2 [ d ] b e an inﬂation map as in Deﬁnition F.1 . L et A denote an algorithm designe d for observations fr om a distribution in R ( P , ϵ, S , π ) . Given observations fr om R ( P , ϵ, f ( S ) , f ∗ π ) , ther e exists an eﬃcient algorithm which simulates A . 74 Pr o of. Let f ∗ π denote the pushforw ard measure of π under the inﬂation map f so that for an y set S ′ ∈ Image( f ), ( f ∗ π )( S ′ ) = π  f − 1 ( S ′ )  . In turn, let g denote a p oten tially randomized mapping which satisﬁes g ( S ′ ) ∼ π if S ′ ∼ f ∗ π , and note that it is p ossible to construct this map so that g ( S ′ ) ⊆ S ′ , almost surely . No w, consider any measure R con tained in the inﬂated realizable contamination set R ( P, ϵ, f ( S ) , f ∗ π ). By deﬁnition, there exist random v ariables X ∼ P , Ω MNAR , Ω MCAR and B suc h that X  ⋆  (1 − B )Ω MCAR + B Ω MNAR  ∼ R. No w, since g ( S ′ ) ⊆ S ′ and g ( S ′ ) ∼ π , w e see that the random v ariable X  ⋆  (1 − B ) g  Ω MCAR  + B g  Ω MNAR  ∼ R 0 ∈ R ( P, ϵ, S , π ) . It follows that an y algorithm designed for observ ations in R ( P , ϵ, S , π ) can be sim ulated b y masking observed data from R ( P , ϵ, f ( S ) , f ∗ π ) using g . Let us use this re duction to pro vide information-theoretic low er b ounds as well as SQ lo w er b ounds. Both corollaries use the following instantiation of inﬂation map f I f I : S 7→ ( S ∪ I if I ∩ S  = ∅ S otherwise , where I ⊆ [ d ]. Note that for any set of missingness pattern S and any I , either all or nothing of the subset I is observed in an y giv en sample. Next, consider the error metrics ∥ θ 1 − θ 2 ∥ 2 and ∥ Σ − 1 / 2 1 Σ 2 Σ − 1 / 2 1 − I d ∥ op and note that for any index set I ′ ⊆ [ d ], it holds that ∥ θ 1 − θ 2 ∥ 2 ≥ ∥ ( θ 1 ) I ′ − ( θ 2 ) I ′ ∥ 2 and ∥ Σ − 1 / 2 1 Σ 2 Σ − 1 / 2 1 − I d ∥ op ≥ ∥ (Σ 1 ) − 1 / 2 I ′ ,I ′ (Σ 2 ) I ′ ,I ′ (Σ 1 ) − 1 / 2 I ′ ,I ′ − I | I ′ | ∥ op . Hence, taking the maxim um o v er all subsets I ⊆ [ d ] ov er b oth the minimax quantile lo w er b ounds in Theorems 3.1 and 3.2 and the SQ low er b ounds in Theorems 4.7 and 4.12 and applying the reduction in Lemma F.2 , we conclude that the same lo w er b ounds hold in the m ultiple pattern setting. F.2 Upp er b ounds In the previous section, we show ed that our low er b ounds for mean and cov ariance estimation con tin ue to hold in the multiple pattern setting. Here, we use our algorithms to develop quan titativ e upp er b ounds for mean and co v ariance estimation in turn, b eginning with mean estimation. In order to obtain quantitativ e guarantees, we require the following pair of regu- larit y conditions Assumption F.3 (Constant observ ation probabilit y) . The pair ( S , π S ) satisﬁes the constant observ ation probabilit y assumption if there exists a constants c 0 > 0 such that min S ∈ S π S > c 0 . Assumption F.3 ensures that eac h pattern is observed a constan t fraction of the time. The second assumption concerns the cov erage of the patterns in S . Assumption F.4 (Co v erage regularity) . The set of patterns S has regular cov erage if b oth [ S ∈ S S = [ d ] . Note that Assumption F.4 is necessary for non-trivial mean estimation. 75 F.2.1 Mean estimation W e b egin by describing how to adapt estimators designed for the all-or-nothing setting to the m ultiple pattern setting. W e follow the following steps: 1. F or each S ∈ S , consider the mo diﬁed input that replaces all inputs whose missingness pattern is not S with ( ⋆ d ) and discard all co ordinates in [ d ] \ S . This corresp onds to an all-or-nothing dataset in R S . 2. Run an all-or-nothing estimator (e.g. the estimator implicit in Theorem 3.1 or Algo- rithm 1 ) on eac h dataset to obtain an estimate b θ S ∈ R S with a high probabilit y error b ound r S , ∥ b θ S − ( θ ⋆ ) S ∥ 2 ≤ r S . This implies that the true mean θ ⋆ lies in the cylinder C S := n θ ∈ R d :   b θ S − θ S   2 ≤ 2 r S o . 3. Since the guarantee ab o v e holds for all patterns S ∈ S (after taking a union b ound), we deduce that θ ⋆ ∈ \ S ∈ S C S =: C . Moreo v er, since eac h of the cylinders C S are con vex, we can ﬁnd some element b θ ∈ C via conv ex optimization. After pro ducing an estimator b θ using the steps ab ov e, we note that since b oth b θ ∈ C and θ ⋆ ∈ C , w e can pro duce an error b ound b y computing a b ound on the diameter of the set C . In order to mak e the ab ov e steps precise, three items remain. First, in order to use our all-or-nothing algorithms, we must show that the reduction in Step 1 preserves realizability . Second, we provide more detail on how to pro duce an element in C via conv ex optimization. Finally , we provide a b ound on the diameter of the set C . W e pro vide the details for each of these three items in order. Pro jection preserv es realizabilit y . Let us b egin with some notation. F or S ⊆ [ d ], let Π S : R d ⋆ → R | S | ⋆ b e the pro jection on to the coordinates in S . Additionally , deﬁne the censoring function F S : R d ⋆ → R | S | as F S ( x ) = ( x S if x j  = ⋆ for all j ∈ S and x j = ⋆ for all j ∈ S c ( ⋆ ) | S | otherwise; i.e., tak e the co ordinates when the missingness pattern is exactly S and fully censor other- wise. The following straigh tforw ard lemma demonstrates that this censoring map preserves realizabilit y . Lemma F.5. Supp ose that Q ∈ R ( P , ϵ, S , π ) . Then, for every S ∈ S , the pushforwar d me asur e ( F S ) ∗ Q satisﬁes the inclusion ( F S ) ∗ Q ∈ R (Π S ⋆ P , ϵ, π ( S )) . Pr o of. Note that b y deﬁnition there exists a tuple of random v ariables X ∼ P , Ω MCAR , Ω MNAR and B ∼ Bern ( ϵ ) such that X  ⋆  (1 − B )Ω MCAR + B Ω MNAR  ∼ Q. 76 No w, consider the map g : { 0 , 1 } d → { 0 , 1 } | S | deﬁned so that g ( 1 S ) = 1 | S | (the all ones v ector) and g ( v ) = 0 (the all zeros vector) for all other v . Note that Π S X  ⋆  (1 − B ) g  Ω MCAR  + B g  Ω MNAR  ∼ ( F S ) ∗ Q. F urther, it holds that B , g (Ω MCAR ), and (Π S X , g (Ω MNAR )) are mutually indep endent and where P ( g (Ω MCAR ) = 1 | S | ) = π ( S ). This prov es the claim. Computing an elemen t b θ ∈ C . W e will show that an elemen t of C can b e computed in p olynomial time by the ellipsoid metho d. In order for this to hold, w e require inner and outer radii r and R such that B 2 ( θ , r ) ⊆ C ⊆ B 2 (0 , R ) , for some element θ ∈ C and a separation oracle for the set C . Let us b egin with a separation oracle. Note that if θ / ∈ C , then there exists S ∈ S such that θ / ∈ C S . Note that a separation oracle for balls in R | S | pro vides a separation oracle for C S b y ﬁlling the co ordinates in [ d ] \ S with zeros. This holds for all S ∈ S and in turn pro vides a separation oracle for the in tersection C . Next, we show that C ⊆ B 2 (0 , R ). Note that by Assumption F.4 , it holds that ∪ S ∈ S S = [ d ]. Hence, by the triangle inequality , we deduce that for any θ ∈ C , ∥ θ ∥ 2 ≤ X S ∈ S ∥ θ S ∥ 2 ≤ X S ∈ S 2 r S + ∥ b θ S ∥ 2 =: R. Finally , we show that there exists an r > 0 such that B 2 ( θ ⋆ , r ) ⊆ C . T o this end, w e take r := r min = min S ∈ S r S . W e then see that for any θ ∈ B 2 ( θ ⋆ , r ), ∥ θ S − b θ S ∥ 2 ≤ ∥ ( θ ⋆ ) S − b θ S ∥ 2 + ∥ θ S − ( θ ⋆ ) S ∥ 2 ≤ r S + r min ≤ 2 r S . Since the ab ov e inequalit y holds for all S ∈ S , it holds that θ ∈ C , whence we deduce that B 2 ( θ , r ) ⊆ C . It remains to b ound the diameter of the set C . Bounding the diameter of the set C . Let w : S → [0 , 1] denote a weigh ting function suc h that for all j ∈ [ d ], X S : i ∈ S w ( S ) ≥ 1 . Note that under the cov erage condition in Assumption F.4 , such a w eigh ting function exists. Let I S = diag( 1 S ) ∈ C d + denote the diagonal matrix which con tains ones on the indices con tained in the set S and note that the previous display ensures that I d ⪯ X S ∈ S w ( S ) I S . It follows that   b θ − θ ⋆   2 2 =  b θ − θ ⋆  ⊤ I d  b θ − θ ⋆  ≤ X S ∈ S w ( S ) ·  b θ − θ ⋆  ⊤ I S  b θ − θ ⋆  ≤ X S ∈ S w ( S ) ·   b θ − θ ⋆   2 2 ≤ 4 X S ∈ S w ( S ) r 2 S . 77 Note that it is alwa ys v alid to taking the weigh ting function w ( S ) = 1 for all S ∈ S so that   b θ − θ ⋆   2 2 ≤ 4 | S | · max S ∈ S r 2 S . Com bining this b ound with the low er b ound in the previous section, we see that under As- sumption F.3 (whic h ensures | S | is of constant order), our algorithms are optimal in the m ulti-pattern setting up to a multiplicativ e factor of | S | . F.2.2 Co v ariance estimation W e prop ose a very similar algorithm as in the mean estimation setting. Although in the main text, we pro vide guarantees for our estimators in relativ e op erator norm, in order to simplify the dev elopmen t here, we consider guarantees in the (weak er) op erator norm. In particular, for each S ∈ S , our algorithms yield estimators e Σ S ∈ C | S | + whic h satisfy the guarantee   e Σ S − (Σ ⋆ ) S,S   op ≤ r S , for all S ∈ S . Pro ceeding as b efore, w e deﬁne the cylinders C S := n Σ ∈ C d + :   e Σ S − P S Σ P ⊤ S   op ≤ r S o , where the matrices P S ∈ R | S |× d tak e the form P S =  ( e j ) j ∈ S  ⊤ for e j the j th elementary v ector. W e again deﬁne the intersection C = ∩ S ∈ S C S and pro duce an estimate b Σ as any elemen t of C . As in the mean estimation setting, w e produce such an estimate via the ellipsoid metho d. Let us note that the inner radius r , outer radius R , and separation oracle can b e obtained in a parallel manner to in the mean estimation setting. The primary departure from the setting of mean estimation comes when w e bound the diameter of the set C . In particular, we require the following cov erage condition. Assumption F.6 (Cov erage regularit y for co v ariance) . The set of patterns S has regular co v erage if, for ev ery pair i, j ∈ [ d ], there exists S ∈ S such that { i, j } ⊆ S . Henceforth, w e will assume that Assumption F.6 is in force. Next, let I 1 , . . . , I k b e a partition of the co ordinate set [ d ] such that for every pair ℓ 1 , ℓ 2 ∈ [ k ], there exists a pattern S ∈ S such that the union I ℓ 1 ∪ I ℓ 2 ⊆ S . Note that under Assumption F.6 , such a partition alw a ys exists. F or instance, one can take the partition in to singleton sets. Given suc h a partition, let S ℓ 1 ,ℓ 2 ∈ S satisfy I ℓ 1 ∪ I ℓ 2 ⊆ S ℓ 1 ,ℓ 2 . No w, for any b Σ ∈ C and v ∈ S d − 1 , we hav e v ⊤  b Σ − Σ ⋆  v = v ⊤  k X ℓ =1 P I ℓ   b Σ − Σ ⋆   k X ℓ =1 P I ℓ  v = X ℓ 1 ,ℓ 2 ∈ [ k ] v ⊤ P I ℓ 1  b Σ − Σ ⋆  P I ℓ 1 v = X ℓ 1 ,ℓ 2 ∈ [ k ] v ⊤ P I ℓ 1 P S ℓ 1 ,ℓ 2  b Σ − Σ ⋆  P S ℓ 1 ,ℓ 2 P I ℓ 1 v ≤ X ℓ 1 ,ℓ 2 ∈ [ k ]   P I ℓ 1 v   2   P I ℓ 2 v   2 r S ℓ 1 ,ℓ 2 ≤ r max ·  X ℓ ∈ [ k ]   P I ℓ v   2  2 ≤ k r max , 78 where r max = max S ∈ S r S . W e th us see that minimizing the num b er of partitions yields b etter rates. Let us conclude by providing a coarse upp er b ound on the size of a minimum partition. As men tioned, the singleton partition alwa ys satisﬁes the desired condition. This, ho wev er, giv es the dimension dep enden t upper b ound k ≤ d . When the n um b er of patterns | S | is a constan t, we can provide a smaller partition by taking k = 2 | S | . Let ℓ ∈ 2 | S | and consider its binary represen tation. This represen tation corresp onds to which sets in S each elemen t is a member of. By construction, this is a partition of [ d ]. Moreov er, it satisﬁes the pairwise union condition. T o see this, consider an y element x ℓ 1 ∈ I ℓ 1 and x ℓ 2 ∈ I ℓ 2 and note that by Assumption F.6 , there exists some set S ∈ S such that b oth x ℓ 1 , x ℓ 2 ∈ S . It thus follows that b oth I ℓ 1 , I ℓ 2 ⊆ S so that I ℓ 1 ∪ I ℓ 2 ⊆ S . Hence, we deduce the b ound k ≤ 2 | S | so that   b Σ − Σ ⋆   op ≤ min  d, 2 | S |  r max ≤ C | S | r max , where the ﬁnal inequality follows from Assumption F.3 and C | S | is a constan t dep ending only on the num b er of patterns. G Auxiliary lemmas G.1 Reduction to q = 1 : Pro of of Lemma 1.2 Pr o of. First, supp ose that Q ∈ R ( P , ϵ, q ), so that b y Lemma 1.1 , for all z ∈ R d , q (1 − ϵ ) ≤ d Q d P ( z ) ≤ q (1 − ϵ ) + ϵ = q ′ . (69) W e then deﬁne the distribution Q ′ on R d ∪ { ⋆ d } so that d Q ′ d P ( z ) = 1 q ′ d Q d P ( z ) , for z ∈ R d , and Q ′  { ⋆ d }  = 1 − 1 q ′ · Q  R d  . Note that by construction, Q ′ is a probability distribution. Moreov er, observ e that the inv ari- an t relation q (1 − ϵ ) = q ′ · (1 − ϵ ′ ) , holds by construction. Consequen tly , applying the sandwic h relation ( 69 ), we obtain the inequalities d Q ′ d P ( z ) = 1 q ′ · d Q d P ( z ) ≤ 1 and d Q ′ d P ( z ) = 1 q ′ · d Q d P ( z ) ≥ 1 − ϵ ′ . By Lemma 1.1 , this implies that Q ′ ∈ R ( P, ϵ ′ , 1). T o conclude this part, note that q ′ · Q ′ and Q agree on the entiret y of R d so that Q = q ′ · Q ′ + (1 − q ′ ) δ { ⋆ d } , since Q must in tegrate to 1 on R d ∪ { ⋆ d } . T o obtain the reverse implication, supp ose that Q ′ ∈ R ( P , ϵ ′ , 1). Applying 1.1 once more yields q (1 − ϵ ) = q ′ · (1 − ϵ ′ ) ≤ q ′ · d Q ′ d P ( z ) ≤ q ′ = q (1 − ϵ ) + ϵ. No w, consider Q = q ′ · Q ′ + (1 − q ′ ) · δ { ⋆ } d . By construction, Q is a probability distribution. Hence, from the ab o v e displa y , we conclude that Q ∈ R ( P , ϵ, q ). 79 G.2 Useful general lemmas Lemma G.1. L et a, b ≥ 0 and ℓ ≤ 1 . Then ( a + b ) ℓ ≤ a ℓ + b ℓ . Pr o of. The case where a, b = 0 is trivial, so for the remainder we will assume a + b > 0. Observ e a ℓ + b ℓ = ( a + b ) ℓ  a ( a + b )  ℓ +  b ( a + b )  ℓ ! ≥ ( a + b ) ℓ , where the last step follows from the fact that x ℓ + (1 − x ) ℓ ≥ 1 for x ∈ [0 , 1]. F act G.2. F or all k ∈ N , √ 2 π k  k e  k ≤ k ! ≤ √ 2 π k  k e  k e 1 / (12 k ) . Lemma G.3. F or n ∈ N and ϵ ∈ [0 , 1) , let m ∼ Bin ( n, 1 − ϵ ) . F or δ ∈ (0 , 1) , supp ose n ≥ (1 + ϵ ) 2 ϵ 2 (1 − ϵ ) 2 log 1 δ . Then with pr ob ability at le ast 1 − δ we have m ≥ (1 − ϵ ) n 1 + ϵ . Pr o of. Observe E [ m ] = (1 − ϵ ) n . Ho eﬀding’s inequality thus implies for any t > 0 that P ( m − (1 − ϵ ) n ≤ − t ) ≤ exp  − 2 t 2 n  . The result then follows by taking t = (1 − 1 / (1 + ϵ ))(1 − ϵ ) n = ( ϵ/ (1 + ϵ ))(1 − ϵ ) n . Lemma G.4. L et L (Σ , Σ ⋆ ) = ∥ Σ − 1 2 ⋆ ΣΣ − 1 2 ⋆ − I ∥ op . L et γ ∈ [0 , 1] and v , ¯ v ∈ S d − 1 . Then for al l M ∈ C d + we have L ( M , I d + γ v v ⊤ ) ∨ L ( M , I d + γ ¯ v ¯ v ⊤ ) > γ 16 ∥ v − ¯ v ∥ 2 2 . Pr o of. Supp ose L ( M , I d + γ v v ⊤ ) < γ 16 ∥ v − ¯ v ∥ 2 2 , so that M = I d + γ v v ⊤ + E for some E with 1 1+ γ ∥ E ∥ op ≤ ∥ ( I + γ v v ⊤ ) − 1 2 E ( I + γ v v ⊤ ) − 1 2 ∥ op < γ 16 ∥ v − ¯ v ∥ 2 2 . Then,      I d + γ ¯ v ¯ v ⊤  − 1 2 M  I d + γ ¯ v ¯ v ⊤  − 1 2 − I d     op =      I d + γ ¯ v ¯ v ⊤  − 1 2  I d + γ v v ⊤ + E   I d + γ ¯ v ¯ v ⊤  − 1 2 − I d     op ≥ γ      I d + γ ¯ v ¯ v ⊤  − 1 2  v v ⊤ − ¯ v ¯ v ⊤   I d + γ ¯ v ¯ v ⊤  − 1 2     op −      I d + γ ¯ v ¯ v ⊤  − 1 2 E  I d + γ ¯ v ¯ v ⊤  − 1 2     op 80 ≥ γ 1 + γ ∥ v v ⊤ − ¯ v ¯ v ⊤ ∥ − γ (1 + γ ) 8 ∥ v − ¯ v ∥ 2 . No w let u ∈ S d − 1 b e orthogonal to v and such that ¯ v = αv + √ 1 − α 2 u for some α ∈ [0 , 1]. Then ∥ v v ⊤ − ¯ v ¯ v ⊤ ∥ op ≥ ( ¯ v · u ) 2 = 1 − α 2 and ∥ v − ¯ v ∥ 2 2 = 2(1 − α ). Th us, we ha ve the lo wer b ound ∥ v v ⊤ − ¯ v ¯ v ⊤ ∥ ≥ 1 − α 2 ≥ 1 − α = 1 2 ∥ v − ¯ v ∥ 2 2 . Plugging this in gives L ( M , I d + γ ¯ v ¯ v ⊤ ) ≥  γ 2(1 + γ ) − γ (1 + γ ) 16  ∥ v − ¯ v ∥ 2 ≥ γ 8 ∥ v − ¯ v ∥ 2 > γ 16 ∥ v − ¯ v ∥ 2 , where the last step follows from γ ∈ [0 , 1]. This prov es the claim. Belo w we present a slightly mo diﬁed v ersion of [ BEHW89 , Lemma 3.2.3]. The only diﬀer- ence is that we allo w the s -fold union to b e across diﬀerent classes, but this do es not c hange the result. Lemma G.5 (V C-dimension of s -fold in tersections) . L et C 1 , . . . , C s ⊆ 2 X b e c onc ept classes such that V C( C i ) ≤ d for al l i ∈ [ s ] . Then the c onc ept class C ∩ = {∩ s i =1 c i | c i ∈ C i ∀ i ∈ [ s ] } has V C-dimension less than 2 ds log(3 s ) . G.3 Univ ariate Gaussian concen tration and momen t prop erties Throughout, recall that denote the density of a standard univ ariate Gaussian b y the function ϕ . W e will also use G to denote a standard Gaussian. F act G.6 (Gaussian moments) . F or all k ∈ N , E [ | G | 2 k − 1 ] = r 2 π (2 k − 2)!! and E [ G 2 k ] = (2 k − 1)!! . Lemma G.7. L et k ≥ 1 and T ≥ 2 √ k log(3 k ) . Then E [ G 2 k − 2 1 { G ∈ [1 , T ] } ] ≥ (2 k − 3)!! 3 . Pr o of. F rom F act G.6 , we hav e E [ G 2 k − 2 ] = (2 k − 3)!! and b y symmetry of the Gaussian distribution, we hav e that (2 k − 3)!! 2 = Z ∞ 0 t 2 k − 2 ϕ ( t ) dt = Z 1 0 t 2 k − 2 ϕ ( t ) dt | {z } =: A + Z T 1 t 2 k − 2 ϕ ( t ) dt | {z } =: B + Z ∞ T t 2 k − 2 ϕ ( t ) dt | {z } =: C . W e w ant to low er b ound B , which we will do by upp er b ounding A and C . T o upp er b ound A , A = Z 1 0 t 2 k − 2 ϕ ( t ) dt ≤ Z 1 0 e − t 2 / 2 √ 2 π dt ≤ 1 √ 2 π . T o upp er b ound C , C = Z ∞ T t 2 k − 2 ϕ ( t ) dt ≤ 1 T Z ∞ T t 2 k − 1 ϕ ( t ) dt. 81 Noting that d dt ( − ϕ ( t )) = tϕ ( t ), we apply integration by parts with u = t 2 k − 2 and v = − ϕ ( t ) to obtain Z ∞ T t 2 k − 1 ϕ ( t ) dt = Z ∞ T ( t 2 k − 2 )( tϕ ( t )) dt =  − t 2 k − 2 ϕ ( t )    ∞ T + (2 k − 2) Z ∞ T t 2 k − 3 ϕ ( t ) dt ≤ T 2 k − 2 ϕ ( T ) + 2 k − 2 T C. Th us, C ≤ 1 T  T 2 k − 2 ϕ ( T ) + 2 k − 2 T C  = ⇒ C ≤ 1 √ 2 π  1 − 2 k − 2 T 2  − 1 T 2 k − 3 e − T 2 / 2 Observ e that t 7→ t 2 k − 3 e − t 2 / 2 is decreasing for t ≥ √ 2 k − 3, so for T ≥ 2 √ k log(3 k ) w e ha v e that T 2 k − 3 e − T 2 / 2 ≤ (2 √ k log(3 k )) 2 k − 3 e − (4 k log 2 (3 k )) / 2) ≤ exp  − 2 k log 2 (3 k ) + 2 k log (3 k )  ≤ 1 Also for T ≥ 2 √ k − 1,  1 − 2 k − 2 T 2  − 1 ≤ 2. Thus, C ≤ 2 √ 2 π . This implies B ≥ (2 k − 3)!! 2 − 3 √ 2 π ≥ (2 k − 3)!! 3 . Lemma G.8. L et ℓ ∈ N . Then E [ G ℓ ] ≤ 2  ℓ e  ℓ/ 2 . Pr o of. The claim is trivial when ℓ is o dd, so for the remainder let ℓ = 2 k for some k ∈ N . Stirling’s formulae (F act G.2 ) imply that √ 2 π k  k e  k ≤ k ! ≤ √ 2 π k  k e  k e 1 / (12 k ) . (70) Th us, w e ha v e E [ G 2 k ] = (2 k )! 2 k k ! ≤ √ 4 π k (2 k /e ) 2 k e 1 / (24 k ) 2 k √ 2 π k ( k /e ) k = √ 2 e 1 / 24  2 k e  k ≤ 2  2 k e  k . Lemma G.9. L et G 1 , . . . , G n iid ∼ N (0 , 1) and m ∈ N . Then with pr ob ability at le ast 1 − δ , we have that n X i =1 | G i | m ≤ n ( m − 1)!! + q n 2 m − 1 log(2 /δ ) log(2 n/δ ) m 2 . 82 Pr o of. Let T > 0 be a threshold we will sp ecify later and deﬁne A i = | G i | m 1 {| G i | ≤ T } . Then by a union b ound P 1 n n X i =1 | G i | m ≥ µ m + ∆ 1 ! ≤ P 1 n n X i =1 A i ≥ µ m + ∆ 1 ! + P  max i ∈ [ n ] | G i | > T  . F or the ﬁrst term, note that b ecause A i is supp orted on [0 , T ] that 1 4 T 2 m -subgaussian. F urthermore, E [ A 1 ] ≤ µ m . Th us, P 1 n n X i =1 A i ≥ µ m + ∆ 1 ! ≤ P 1 n n X i =1 A i − µ m ≥ ∆ 1 ! ≤ exp  − 2 n ∆ 2 1 T − 2 m  This is b ounded b y δ / 2 if we choose T suc h that T 2 ≤  2 n ∆ 2 1 log(2 /δ )  1 m . F or the second term, w e ha v e by union b ound that P  max i ∈ [ n ] | G i | > T  ≤ 2 n exp( − 1 2 T 2 ) , whic h is at most δ / 2 for T ≥ 2 p log(2 n/δ ). W e can alw ays ﬁnd some T satisfying b oth constraints as long as 2 n ∆ 2 1 log(2 /δ ) ≥ 2 m log(2 n/δ ) m 2 = ⇒ ∆ 1 ≥ s 2 m − 1 log(2 /δ ) log(2 n/δ ) m 2 n , pro ving the claim. The next tw o lemmas provide basic prop erties of the Gaussian survivor function ¯ Φ. Lemma G.10 (Mills’ ratio b ound) . F or al l t ≥ 0 , t t 2 + 1 · ϕ ( t ) ≤ ¯ Φ( t ) ≤ 1 t · ϕ ( t ) . Lemma G.11. L et ¯ Φ : R → [0 , 1] b e deﬁne d as ¯ Φ( x ) = P ( G ≥ x ) , wher e G ∼ N (0 , 1) . The fol lowing hold: (a) The inverse function ¯ Φ − 1 is diﬀer entiable and its derivative is given by d d x ¯ Φ − 1 ( x ) = − 1 ϕ ( ¯ Φ − 1 ( x )) . (b) F or 0 ≤ x ≤ 1 / 2 , t · ¯ Φ − 1 ( t ) ≤ ϕ  ¯ Φ − 1 ( t )  ≤ t ·  ¯ Φ − 1 ( t ) + 1 ¯ Φ − 1 ( t )  . 83 Pr o of. W e prov e each part in turn. Pro of of part (a). Since ¯ Φ is con tin uously diﬀeren tiable, it follows from the inv erse function theorem that d d x ¯ Φ − 1 ( x ) = 1 ¯ Φ ′  ¯ Φ − 1 ( x )  = − 1 ϕ ( ¯ Φ − 1 ( x )) , as desired. Pro of of part (b). The desired sandwic h relation follows from the Mills’ ratio b ounds (Lemma G.10 ) applied to t = ¯ Φ − 1 ( x ) for 0 ≤ x ≤ 1 / 2. G.4 Multiv ariate Gaussian concen tration and momen t prop erties Lemma G.12. L et z ∼ N (0 , I d ) and ℓ ∈ N . F or an index α = ( i 1 , ..., i ℓ ) ∈ [ d ] ℓ , let X = ( z ⊗ ℓ ) α . Then for any γ > 1 / 2 we have E [exp( 1 8 γ | X | 2 /ℓ )] ≤ e 1 / 2 γ . Pr o of. W e can write X = Π i ∈ supp( α ) z m i i for multiplicities m i ∈ N . Observe 1 8 γ | X | 2 /ℓ = 1 8 γ  Π i ∈ supp( α ) z 2 m i i  1 /ℓ ≤ 1 8 γ ℓ X i ∈ supp( α ) m i z 2 i , where the last step follows from the AM-GM inequality . Recall for t < 1 / 4 we hav e E G ∼ N (0 , 1 [exp( tG 2 )] ≤ (1 − 2 t ) − 1 ≤ exp(4 t ). Our assumption on γ implies m i 8 γ ℓ < 1 4 for all i ∈ supp( α ), and so it follows that E  exp  1 8 γ | X | 2 /ℓ  = Π i ∈ supp( α ) E  exp  1 8 γ ℓ m i z 2 i  ≤ Π i ∈ supp( α ) exp  m i 2 γ ℓ  = e 1 / 2 γ . Lemma G.13. L et z ∼ N (0 , I d ) and ℓ ∈ N . F or an index α = ( i 1 , ..., i ℓ ) ∈ [ d ] ℓ , let X = ( z ⊗ ℓ ) α . Then we have E [exp( 1 8 ℓ | X − E [ X ] | 2 /ℓ )] ≤ 2 . Pr o of. Lemma G.1 implies E  exp  1 8 ℓ | X − E [ X ] | 2 /ℓ  ≤ E  exp  1 8 ℓ  | X | 2 /ℓ + | E [ X ] | 2 /ℓ )   = exp  1 8 ℓ | E [ X ] | 2 /ℓ  E  exp  1 8 ℓ | X | 2 /ℓ  . (71) Lemmas G.8 and G.12 resp ectively imply we can b ound b oth terms in equation ( 71 ) by e 1 / 4 . Com bining, w e can b ound their pro duct b y e 1 / 2 ≤ 2. 84 Lemma G.14. L et ℓ ∈ N . L et z 1 , ..., z n i . i . d . ∼ N (0 , I d ) and ξ = 1 n P n i =1 z ⊗ ℓ i − E z ∼ N (0 ,I d ) z ⊗ ℓ . Then for δ ∈ (0 , 1) we have ∥ ξ ∥ ∞ ≤ (100 ℓ ) ℓ/ 2 r ℓ log d + log (1 /δ ) n + ( ℓ log d + log (1 /δ )) ℓ/ 2 n ! with pr ob ability at le ast 1 − δ . Pr o of. F or an y index α = ( i 1 , ..., i ℓ ) ∈ [ d ] ℓ , com bining Lemma G.13 (setting ℓ = 2 k ) with Theorem 3.1 of [ KC18 ] gives P | ξ α | ≤ (100 ℓ ) ℓ/ 2 r t n + t ℓ/ 2 n !! ≤ 2 e − t . The result then follows from taking a union b ound ov er all d ℓ elemen ts. Corollary G.15. L et ℓ ∈ N . L et z 1 , ..., z n iid ∼ N (0 , I d ) and ξ = 1 n P n i =1 z ⊗ ℓ i − E z ∼ N (0 ,I d ) z ⊗ ℓ . F or β > 0 , supp ose n ≥ max ( 4(100 ℓ ) ℓ ( ℓ log d + log (1 /δ )) β 2 , 2(100 ℓ ) ℓ/ 2 ( ℓ log d + log (1 /δ )) ℓ/ 2 β ) . Then ∥ ξ ∥ ∞ ≤ β with pr ob ability at le ast 1 − δ . Pr o of. The res ult follows immediately from inv erting the b ound of Lemma G.14 , setting b oth summands in the b ound to β / 2. Lemma G.16. L et z 1 , ..., z n i . i . d . ∼ N (0 , I d ) . F or ℓ ∈ N , let ξ = 1 n P n i =1 z ⊗ ℓ i − E z ∼ N (0 ,I d ) z ⊗ ℓ . Then sup ∥ v ∥ 2 ≤ 1      1 n n X i =1 ⟨ z i , v ⟩ ℓ − E z ∼ N (0 ,I d ) [ ⟨ z , v ⟩ ℓ ]      ≤ d ℓ/ 2 ∥ ξ ∥ ∞ . Pr o of. Observe for any x, y ∈ R d w e ha ve ⟨ x, y ⟩ ℓ = ⟨ x ⊗ ℓ , y ⊗ ℓ ⟩ . Moreov er, for any unit vector v ∈ R d w e ha v e    v ⊗ ℓ 2    = X α ∈ [ d ] ℓ (Π i ∈ α v i ) 2 = d X i =1 v 2 i ! k = 1 . It thus follows for any suc h v that      1 n n X i =1 ⟨ z i , v ⟩ ℓ − E z ∼ N (0 ,I d ) [ ⟨ z , v ⟩ ℓ ]      =      * 1 n n X i =1 z ⊗ ℓ i − E z ∼ N (0 ,I d ) [ z ⊗ ℓ ] , v ⊗ ℓ +      ≤ ∥ ξ ∥ 2    v ⊗ ℓ    2 ≤ d ℓ/ 2 ∥ ξ ∥ ∞ . 85 G.4.1 Pro of of Lemma 4.3 Pr o of. Our assumption on n satisﬁes the conditions of Lemma G.3 , implying n | T | ≤ 1+ ϵ 1 − ϵ with probabilit y at least 1 − δ . This then implies b oth | T | and n satisfy the conditions of Lemma G.15 with probability δ / 2 k and β = ϵ d k k . T aking a union b ound ov er ℓ ∈ [2 k ] giv es the desired result. G.5 Sum-of-squares facts Lemma G.17. L et r ∈ N . L et x, y by p olynomials of de gr e e at most k . Then { x, y , y − x ≥ 0 } rk x,y y r − x r ≥ 0 . (72) Pr o of. W e can write y r − x r as the sum of pro ducts of non-negative p olynomials as follows: y r − x r = r − 1 X i =0 y r − i x i − y r − i − 1 x i +1 = ( y − x ) r − 1 X i =0 y r − i − 1 x i . Lemma G.18. L et x, y b e ve ctor-value d p olynomials of de gr e e at most k . Then for any t > 0 we have 2 k x,y ⟨ x, y ⟩ ≤ t 2 ∥ x ∥ 2 2 + 1 2 t ∥ y ∥ 2 2 . Pr o of. Observe t 2 ∥ x ∥ 2 2 + 1 2 t ∥ y ∥ 2 2 = ⟨ x, y ⟩ + 1 2     √ tx + 1 √ t y     2 2 . Corollary G.19. L et x, y b e ve ctor-value d p olynomials of de gr e e at most k and let e E b e a pseudo exp e ctation of de gr e e at le ast 2 k . Then e E [ ⟨ x, y ⟩ ] ≤ q e E [ ∥ x ∥ 2 2 ] e E [ ∥ y ∥ 2 2 ] . Pr o of. The result follows immediately from Lemma G.18 with t = r e E [ ∥ y ∥ 2 2 ] e E [ ∥ x ∥ 2 2 ] . Lemma G.20. L et c ∈ N with r = 2 c and r ′ = 2( c + 1) . L et x b e a p olynomial of de gr e e at most k and let e E b e a pseudo exp e ctation of de gr e e at le ast 2 k ( c + 1) . Then e E [ x r ′ ] ≥ e E [ x r ] r ′ r . Pr o of. W e pro ceed by induction on c . The base case c = 1 follows from Cauc h y-Sc h w arz (i.e., com bining Lemmas G.19 and G.17 ). Now supp ose the claim holds for c ′ = c − 1, i.e., e E [ x 2( c ′ +1) ] ≥ e E [ x 2 c ′ ] c ′ +1 c ′ . W e will show the claim holds for c = c ′ + 1. Applying Cauch y-Sch warz to x c ′ +2 and x c ′ giv es e E [ x 2( c ′ +2) ] ≥ e E [ x 2( c ′ +1) ] 2 e E [ x 2 c ′ ] 86 ≥ e E [ x 2( c ′ +1) ] 2 e E [ x 2( c ′ +1) ] c ′ c ′ +1 = e E [ x 2( c ′ +1) ] c ′ +2 c ′ +1 . with the second step following from the induction hypothesis. Corollary G.21. L et r , r ′ ∈ N b e even with r ′ ≥ r and let x b e a p olynomial of de gr e e at most k . L et e E b e a pseudo exp e ctation of de gr e e at le ast k r ′ . Then e E [ x r ′ ] ≥ e E [ x r ] r ′ r . Pr o of. The result follows from rep eatedly applying Lemma G.20 . Lemma G.22. L et x b e a p olynomial of de gr e e at most k and let e E b e a pseudo exp e ctation of de gr e e at le ast 2 k . Then e E [ x ] 2 ≤ e E [ x 2 ] . Pr o of. Observe e E [ x 2 ] − e E [ x ] 2 = e E [( x − e E [ x ]) 2 ] ≥ 0 . Lemma G.23. L et z 1 , ..., z n i . i . d . ∼ N (0 , I d ) , with ξ = 1 n P n i =1 z ⊗ 2 k i − E z ∼ N (0 ,I d ) z ⊗ 2 k . Then it fol lows for k ∈ N that for al l t > 0 that 4 k v 1 n n X i =1 ⟨ v , z i ⟩ 2 k ∈ ∥ v ∥ 2 k 2 E G ∼ N (0 , 1) [ G 2 k ] ±  t 2 ∥ v ∥ 4 k 2 + d 2 k 2 t ∥ ξ ∥ 2 ∞  . Pr o of. F or the upp er b ound, observe 4 k v 1 n n X i =1 ⟨ v , z i ⟩ 2 k = ∥ v ∥ 2 k 2 E G ∼ N (0 , 1) [ G 2 k ] + ⟨ v ⊗ 2 k , ξ ⟩ (73) ≤ ∥ v ∥ 2 k 2 E G ∼ N (0 , 1) [ G 2 k ] + t 2 ⟨ v ⊗ 2 k , v ⊗ 2 k ⟩ + 1 2 t ⟨ ξ , ξ ⟩ (74) = ∥ v ∥ 2 k 2 E G ∼ N (0 , 1) [ G 2 k ] + t 2 ∥ v ∥ 4 k 2 + 1 2 t ⟨ ξ , ξ ⟩ ≤ ∥ v ∥ 2 k 2 E G ∼ N (0 , 1) [ G 2 k ] + t 2 ∥ v ∥ 4 k 2 + d 2 k 2 t ∥ ξ ∥ 2 ∞ , where equation ( 73 ) follows from the fact that E G ∼ N (0 , 1) [ ⟨ v , G ⟩ 2 k ] = ∥ v ∥ 2 k 2 E G ∼ N (0 , 1) [ G 2 k ] for all v ∈ R d and inequality ( 74 ) follows from SoS Cauch y–Sch warz (Lemma G.18 ). The low er b ound follows analogously . Lemma G.24. L et x ∈ R b e a p olynomial of de gr e e at most k . L et r ∈ N b e even and let e E b e a pseudo exp e ctation of de gr e e at le ast k r satisfying e E [ x r ] ≤ 1 + α for some α > 0 . Then it fol lows that e E  x 2  ≤ 1 + 2 α r . Pr o of. Holder’s inequality for pseudo exp ectations (Lemma G.20 ) implies e E  x 2  ≤ e E [ x r ] 2 /r ≤ (1 + α ) 2 /r ≤ 1 + 2 α r . 87 Lemma G.25. L et x ∈ R b e a p olynomial of de gr e e at most k . L et r ∈ N b e even and let e E b e a pseudo exp e ctation of de gr e e at le ast 2 k r satisfying b oth e E [ x r ] ≥ 1 − β and e E [ x 2 r ] ≥ 1 − α for some α, β > 0 . Then it fol lows that e E [ x r ] ≥ 1 − α 2 − β . Pr o of. Observe e E [1 − x 2 r ] = e E [(1 − x r )(1 + x r )] ≥ (2 − β ) e E [1 − x r ] . Th us, e E [1 − x 2 r ] ≤ α implies e E [1 − x r ] ≤ α 2 − β . Corollary G.26. L et x ∈ R b e a p olynomial of de gr e e at most k . L et r = 2 p for some p ∈ N and α ∈ (0 , 1 / 2) . L et e E b e a pseudo exp e ctation of de gr e e at le ast 2 k r satisfying e E [ x 2 ℓ ] ≥ 1 − α for ℓ ∈ [ p ] . Then we have e E [ x 2 ] ≥ 1 − 2 exp(3 / 2) α r . Pr o of. Let β 1 = α 2 − α and β i = β i − 1 2 − α for 2 ≤ i ≤ p . Applying Lemma G.25 recursively ov er ℓ ∈ [ p ] then gives e E [ x 2 ] ≥ 1 − α Π p − 1 i =1 (2 − β i ) ≥ 1 − α Π p − 1 i =1 2 exp( − β i ) = 1 − 2 α r exp  − P p − 1 i =1 β i  . where the second step follows from the fact that exp( − 2 x ) ≤ 1 − x for x < 1 / 2 (taking x = β i 2 ). Our premise that α < 1 / 2 implies β i +1 ≤ 2 3 β i for i ∈ [ p − 1], and so Π p i =1 P p − 1 i =1 β i ≤ 3 α . The desired result follo ws immediately from com bining this fact with the previous displa y . 88

High-dimensional estimation with missing data: Statistical and computational limits

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment