High-dimensional estimation with missing data: Statistical and computational limits
We consider computationally-efficient estimation of population parameters when observations are subject to missing data. In particular, we consider estimation under the realizable contamination model of missing data in which an $ε$ fraction of the ob…
Authors: Kabir Aladin Verch, Ankit Pensia, Saminul Haque
High-dimensional estimation with missing data: Statistical and computational limits Kabir Aladin V erc hand † , Ankit P ensia ◦ , Samin ul Haque ⋆ , and Rohith Kuditipudi ⋆ † Departmen t of Data Sciences and Op erations, Universit y of Southern California ◦ Departmen t of Statistics and Data Science, Carnegie Mellon Univ ersit y ⋆ Departmen t of Computer Science, Stanford Universit y Marc h 18, 2026 Abstract W e consider computationally-efficien t estimation of population parameters when ob- serv ations are sub ject to missing data. In particular, we consider estimation under the realizable con tamination mo del of missing data in which an ϵ fraction of the observ ations are sub ject to an arbitrary (and unknown) missing not at random (MNAR) mechanism. When the true data is Gaussian, we pro vide evidence tow ards statistical-computational gaps in sev eral problems. F or mean estimation in ℓ 2 norm, w e sho w that in order to obtain error at most ρ , for any constant contamination ϵ ∈ (0 , 1), (roughly) n ≳ de 1 /ρ 2 samples are necessary and that there is a computationally-inefficient algorithm which ac hieves this error. On the other hand, w e show that any computationally-efficient method within cer- tain popular families of algorithms requires a m uch larger sample complexity of (roughly) n ≳ d 1 /ρ 2 and that there exists a polynomial time algorithm based on sum-of-squares whic h (nearly) ac hiev es this lo w er b ound. F or co v ariance estimation in relative op erator norm, w e show that a parallel developmen t holds. Finally , we turn to linear regression with missing observ ations and sho w that such a gap do es not persist. Indeed, in this setting w e show that minimizing a simple, strongly conv ex empirical risk nearly achiev es the information-theoretic low er b ound in p olynomial time. Con ten ts 1 In tro duction 4 1.1 Con tributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2 F urther related w ork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Bac kground 10 2.1 Sum-of-squares algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 SQ low er b ounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 Minimax low er b ounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3 Information-theoretic limits of mean and co v ariance estimation 14 3.1 Algorithms via v ariational principles . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2 Mean estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.3 Co v ariance estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Authors are listed in random order. 1 4 Computationally-efficien t mean and co v ariance estimation 18 4.1 W arm-up: Efficien t testing via p olynomials . . . . . . . . . . . . . . . . . . . . 18 4.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.3 Mean estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.4 Co v ariance estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 5 Nearly optimal, computationally-efficien t linear regression 23 5.1 Mo del preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 5.2 W arm-up: The p opulation setting . . . . . . . . . . . . . . . . . . . . . . . . . 24 5.3 Main result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 6 Discussion 27 A Pro ofs of information-theoretic upp er b ounds for mean and cov ariance es- timation 33 A.1 Mean estimation: Pro of of Theorem 3.1 upp er bound . . . . . . . . . . . . . . . 33 A.2 Co v ariance estimation: Pro of of Theorem 3.2 upp er bound . . . . . . . . . . . . 34 A.3 Univ ariate upper bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 A.3.1 Univ ariate mean estimation: Proof of Corollary A.1 . . . . . . . . . . . 37 A.3.2 Univ ariate v ariance estimation: Pro of of Lemma A.2 . . . . . . . . . . . 42 B Proof of upp er bound for linear regression 44 B.1 Linear regression: Pro of of Theorem 5.1 upp er bound . . . . . . . . . . . . . . 44 B.1.1 Pro of of Lemma B.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 B.1.2 Pro of of Lemma B.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 B.2 Efficien t algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 C Pro ofs of information-theoretic low er b ounds 52 C.1 Lo w er bound constructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 C.2 Information-theoretic low er b ounds . . . . . . . . . . . . . . . . . . . . . . . . . 56 C.2.1 Mean estimation: Pro of of Theorem 3.1 low er bound . . . . . . . . . . . 56 C.2.2 Co v ariance estimation: Pro of of Theorem 3.2 low er bound . . . . . . . . 60 C.2.3 Linear regression: Pro of of Theorem 5.1 low er bound . . . . . . . . . . . 62 D Pro ofs of sum-of-squares upp er b ounds 65 D.1 Mean estimation: Pro of of Theorem 4.5 . . . . . . . . . . . . . . . . . . . . . . 65 D.2 Co v ariance estimation: Pro of of Theorem 4.10 . . . . . . . . . . . . . . . . . . . 66 D.2.1 Pro of of Lemma 4.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 E Pro ofs of computational lo wer b ounds 69 E.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 E.2 Mean estimation: Pro of of Theorem 4.7 . . . . . . . . . . . . . . . . . . . . . . 72 E.2.1 Proof of Lemma E.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 E.2.2 Co v ariance estimation: Proof of Theorem 4.12 . . . . . . . . . . . . . . 73 F Extension to m ultiple missingness patterns for mean and cov ariance esti- mation 74 F.1 Lo w er bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 F.2 Upp er bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 2 F.2.1 Mean estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 F.2.2 Cov ariance estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 G Auxiliary lemmas 79 G.1 Reduction to q = 1: Pro of of Lemma 1.2 . . . . . . . . . . . . . . . . . . . . . . 79 G.2 Useful general lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 G.3 Univ ariate Gaussian concen tration and momen t properties . . . . . . . . . . . . 81 G.4 Multiv ariate Gaussian concen tration and momen t properties . . . . . . . . . . . 84 G.4.1 Proof of Lemma 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 G.5 Sum-of-squares facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 3 1 In tro duction Statistical pro cedures are typically defined under the idealized assumption that each observ a- tion { X i } i ∈ [ n ] in a sample of size n is an indep endent outcome from a p opulation distribution P . In practice, it is often the case that observ ations depart in some wa y from this idealized scenario. One common departure is that of missing data, in which each observ ation may only b e partially rev ealed. Of key imp ortance when handling missing data is the nature of the mechanism b y which data is missing. T ypically , these “missingness mec hanisms” are classified in to one of three categories in increasing flexibility: Missing completely at random (MCAR), Missing at random (MAR), and Missing not at random (MNAR). In brief, the MCAR assumption ensures that the missingness mechanism is indep endent of the underlying observ ations; the MAR assumption ensures that the missingness mec hanism dep ends only on the observed data 1 ; and the MNAR assumption places no restriction on the missingness mec hanism. F or a detailed exp osition of each of these three categories, w e refer the reader t o the b o ok [ LR14 ]. While the former t w o assumptions often enable identification of population parameters, they are (i.) often to o strong to hold in practice and (ii.) imp ossible to test for [ GVR97 ]. On the other hand, the MNAR assumption is t ypically to o flexible to enable iden tification of the p opulation parameters [ Man03 ]. In particular, when data are sub ject to missing not at random mechanisms, the resulting observ ations may b e significan tly biased. Motiv ated b y this discrepancy , [ MVBWS24 ] recently in tro duced the realizable contamination framew ork whic h considers Hub er-style con tamination tailored to the setting of missing data. In this pap er, w e adopt this model, whic h w e describe next. Throughout the main text we will consider a simplified al l-or-nothing setting in which observ ations ha v e either no missing data or are fully missing. Later, in App endix F , we show ho w our results naturally extend to the setting of multiple missingness patterns. ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ (a) All-or-nothing missingness. ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ (b) Multiple missingness patterns. Figure 1: T yp es of missing data patterns. Each row indicates a single sample, where the en tries in gra y indicate an observ ed v alue and the ⋆ entries indicate missingness. In order to simplify the results in the main text, we focus on the all -or-nothing patterns described in panel (a), deferring the extension to the more general setting to App endix F . T o concretely define the mo del, let P denote a probabilit y distribution on R d and supp ose 1 See [ SGJC13 ] for a precise definition. 4 that we are interested in estimating some quan tit y θ ( P ) from incomplete observ ations. F ollow- ing [ MVBWS24 ], we define the follo wing sets of distributions, sp ecialized to the all-or-nothing setting: MCAR ( P,q ) := n La w X ⋆ Ω : X ∼ P and Ω ∈ { (0) d , (1) d } , Ω ⊥ ⊥ X, P (Ω = (1) d ) = q o (1a) MNAR P := n La w X ⋆ Ω : X ∼ P and La w (Ω) ∈ P { (0) d , (1) d } o . (1b) W e p oint the reader to the notation section at the end of the in tro duction for a precise definition of the mo dified Hadamard pro duct ⋆ . W e emphasize that these distributions are on the extended space R d ∪ { ⋆ d } . In words, MCAR ( P,q ) describ es the missing completely at random set of distributions in which eac h observ ation is seen with probability q . On the other hand, MNAR P describ es the set of missing not at random distributions which can be obtained b y applying an y missingness mec hanism to the observ ations. Since consisten t estimation under the assumption-lean setting of MNAR P is in general imp ossible, we consider the r e alizable c ontamination mo del introduced b y [ MVBWS24 ]. In particular, given a contamination parameter ϵ ∈ [0 , 1], the realizable set of distributions is giv en b y R ( P , ϵ, q ) := (1 − ϵ ) MCAR ( P,q ) + ϵ MNAR P . (2) When q = 1, we omit it, i.e., R ( P , ϵ ) := R ( P , ϵ, 1). The following straigh tforw ard lemma (whic h is a corollary of [ MVBWS24 , Proposition 2]) characterizes the set of realizable distri- butions by upp er and low er bounds on likelihoo d ratios. Lemma 1.1. The distribution Q ∈ R ( P, ϵ, q ) if and only if, for al l z ∈ R d , q (1 − ϵ ) ≤ d Q d P ( z ) ≤ q (1 − ϵ ) + ϵ. W e emphasize that, restricted to R d , this lemma implies that Q is a sub-probability measure whose densit y is upp er and lo w er bounded b y that of P . F rom this relation, we can read off upp er and low er b ounds on the conditional distribution Q R of the observ ations, given that they w ere observed. In particular, note that q (1 − ϵ ) ≤ Q ( R d ) ≤ q (1 − ϵ ) + ϵ. This in turn implies that if Q ∈ R ( P , ϵ, q ), then 1 − ϵ q (1 − ϵ ) + ϵ ≤ d Q R d P ( z ) ≤ 1 + ϵ q (1 − ϵ ) for all z ∈ R d . (3) Most of our algorithms in the sequel will b e based on this characterization. W e will o ccasion- ally use the notation R R ( P , ϵ, q ) to denote the corresp onding set of conditional distributions (e.g. on R d ). W e note in passing that in this special case, the condition is similar to familiar sensitivit y conditions considered in the causal inference literature, e.g., [ Ros87 ; ZSB19 ] as w ell as models of sampling bias [ SL W22 ]. In particular, if we let Γ = 1 + ϵ q (1 − ϵ ) ≥ 1, this is equiv alen t to a biased sampling model in which al l of the observ ations are biased with lik eliho o d ratios b ounded abov e and below by 1 / Γ and Γ, resp ectively . Indeed, w e note that under an appropriate c hange of v ariables, all of our algorithms (and lo w er b ounds) apply in this setting. 5 W e will assume throughout the remainder of this text that q = 1. This is without loss of generalit y b y an appropriate rescaling of ϵ and n . In particular, if b oth q and ϵ are kno wn, one can remov e a 1 − q ′ -fraction of the completely missing observ ations and apply an algorithm designed for observ ations from the set R ( P, ϵ, 1). The follo wing lemma mak es this precise. Lemma 1.2. L et ϵ ∈ [0 , 1) and q ∈ [0 , 1] and using these define ϵ ′ = ϵ ϵ + q (1 − ϵ ) and q ′ = ϵ + q (1 − ϵ ) . Then, R ( P , ϵ, q ) = n q ′ · Q ′ + (1 − q ′ ) δ { ⋆ d } : Q ′ ∈ R ( P, ϵ ′ , 1) o . W e pro vide the pro of of Lemma 1.2 to Section G.1 . 1.1 Con tributions W e consider three particular estimation problems and ov erview our main results in eac h. Mean estimation. W e consider the family of base distributions P R d , where for each θ ∈ R d , P θ = N ( θ , σ 2 I d ) (with known σ > 0) for θ ∈ R d . W e sho w that in the realizable con tamination mo del ( 2 ), an y estimator b θ of θ from n observ ations whic h succeeds with probability at least 1 − δ m ust satisfy b θ − θ 2 ≳ σ log( 1 1 − ϵ ) r log 1 + ϵ 2 1 − ϵ · n d +log( 1 δ ) , and w e provide an estimator whic h achiev es this error. See Theorem 3.1 for a precise state- men t. W e note that in terms of sample complexity , if δ, σ, q , ϵ are fixed constants, then this implies that to reach error ∥ b θ − θ ∥ 2 ≤ ρ , we require (roughly) n ≳ de 1 /ρ 2 samples. T urning to computationally-efficient rates, we show that, for ϵ a small enough constant, a p olynomial time algorithm based on the sum-of-squares metho d achiev es error ρ as long as (roughly) n ≳ d 1 /ρ 2 observ ations are a v ailable, where we hide multiplicativ e factors depending on ρ . See Theorem 4.5 for a precise statemen t. W e complement this, in Theorem 4.7 with a nearly-matc hing computational low er bound against statistical query (SQ) algorithms, low- degree p olynomial tests, sum-of-squares hierarch y , and p olynomial threshold functions. F or the remainder of this section, we fo cus on statistical query algorithms for simplicit y . T ogether, these results pro vide evidence to wards a large statistical–computational gap. W e summarize this state of affairs in Figure 2 . Imp ossible SQ hard Easy de 1 /ρ 2 d 1 /ρ 2 n Figure 2: Sample complexit y phase diagram for mean estimation with ϵ a fixed constant. In order to achiev e ℓ 2 norm error ρ , it is information-theoretically necessary and sufficien t to tak e n ≍ de 1 /ρ 2 man y samples. On the other hand, any statistical query algorithm must take (roughly) d 1 /ρ 2 man y samples and a p olynomial time algorithm (nearly) saturates this low er b ound. 6 Co v ariance estimation. W e sho w that a parallel set of claims holds in the cov ariance estimation setting. In particular, w e consider the family P cov , where for eac h Σ ∈ C d ++ , P Σ = N (0 , Σ), we show that an y estimator b Σ that succeeds with probabilit y at least 1 − δ m ust satisfy Σ − 1 / 2 b ΣΣ − 1 / 2 − I d op ≳ s d + log ( 1 δ ) n (1 − ϵ ) + log( 1 1 − ϵ ) log 1 + ϵn d +log( 1 δ ) , and that a computationally-inefficient algorithm ac hiev es nearly the same rate. See Theo- rem 3.2 for an exact statement. On the other hand, we show that for any fixed τ a small enough constan t, in order to achiev e error ρ in relativ e op erator norm, roughly n ≳ d 1 /ρ samples are sufficient for a p olynomial time sum-of-squares based estimator to succeed (The- orem 4.10 ) and that nearly the same sample complexity is required of an y statistical query algorithm (Theorem 4.12 ). Linear regression. In the setting of linear regression with missing observ ations in which the missingness may depend arbitrarily on b oth the resp onse Y as well as the co v ariate X , w e sho w that a large statistical–computational gap do es not p ersist. In particular, w e consider the class P LR ( σ 2 ), which consists of co v ariate resp onse pairs ( X , Y ) distributed as X ∼ N (0 , I d ) and Y | X ∼ N X ⊤ θ , σ 2 . In this setting, we show that any estimator b θ of the co efficien ts θ whic h succeeds with probability at least 1 − δ must incur error at least b θ − θ 2 ≳ σ · log(1 / (1 − ϵ )) r log 1 + ϵ 2 1 − ϵ · n d +log( 1 δ ) , and that there exists a simple p olynomial time estimator based on minimizing a strongly con v ex empirical risk whic h ac hiev es nearly matc hing error b θ − θ 2 ≲ σ · ϵ 1 − ϵ · log log n (1 − ϵ ) d +log( 1 δ ) r log 1 + ϵ 2 1 − ϵ · n d +log( 1 δ ) . Observ e that for ϵ less than a sufficiently small constant, the upp er b ounds and low er b ounds differ only by a mild log log term. See Theorem 5.1 for details. 1.2 F urther related w ork The mo del and methods considered in this pap er lie at the in tersection of the study of missing data and robust estimation, and here we describ e the relation to the most related results and mo dels in eac h of these fields. Missing data. Missing data is a mature field and w e refer the in terested reader to the b o ok [ LR14 ] for man y of the classical models and results. The realizable con tamination mo del whic h we study in this man uscript was recen tly introduced by [ MVBWS24 ], although we note that similar mo dels hav e b een prop osed in the literature in restricted settings by [ HM95 ; BK22 ]. As mentioned in the discussion following the sandwich relation in ( 3 ), in the all- or-nothing setting, there is a direct connection b et w een realizable con tamination and biased sampling [ SL W22 ; AL13 ] and sensitivity analysis in causal inference [ Ros87 ]. 7 An adv an tage of the realizable con tamination mo del is that it do es not preclude consis- ten t estimation ev en when a constant fraction of the observ ations are contaminated. W e note that considering constrained forms of MNAR which similarly allow identification of the p opulation parameters hav e b een considered in the literature, we direct the interested reader to, e.g., [ MST22 ; MPT13 ; RRS98 ], and the references therein for a small subset of this lit- erature. W e emphasize that the realizable con tamination mo del encapsulates all of these mo deling choices when ϵ = 1 as it places no assumptions on the con tamination comp onent other than that it masks the original data. On the other hand, we note that stronger forms of con tamination (which can mask and alter the base distribution) hav e b een considered in the missing data literature. F or instance, [ HR21 ] consider an analogy of the strong contamination mo del [ DK23 ] and deriv e nearly optimal mean estimators in this setting. In the setting of linear regression, [ DDKLP25 ] consider adversarial contamination and pro vide optimal and computationally-efficien t algorithms for estimating the regression co efficien ts. W e emphasize that a key distinction b etw een these stronger forms of contamination and the realizable set- ting considered here is that our restricted mo del often enables identification in settings where the strong contamination mo del do es not. Our focus is on computationally-efficien t metho ds. This has been a fo cus of the litera- ture on truncated statistics ov er the last several years (as w e discuss in the next paragraph). W e remark here that even in the more b enign MCAR setting, missingness can induce com- putational difficulties. F or instance, [ L W12 ] consider sparse linear regression with MCAR co v ariates and provide a computationally-efficient algorithm for estimation despite noncon- v exit y of a natural plug-in estimator. Mo ving b ey ond MCAR, there has b een a flurry of work on the computational asp ects of estimation from truncated statistics which we detail next. T runcated statistics. Estimation from truncated samples is a classical problem of esti- mating a distribution, or its parameters, when given i.i.d. samples conditioned on the samples falling in an observ ation set S . Recent literature [ DGTZ18 ] re-visited this classical problem— under the assumption of a Gaussian base distribution—with an ey e tow ards computationally- efficien t algorithms. These authors sho w that in many learning settings, when given oracle access to the truncation set S , computationally-efficient estimation of p opulation parame- ters is p ossible; by contrast, the same authors sho w that without such oracle access, it is information-theoretically imp ossible to p erform estimation without further assumptions on the set S . Motiv ated b y this, subsequent w ork [ KTZ19 ] further explored the unknown trun- cation set S , under the additional assumption that S is known to hav e additional regularity suc h as b elonging to a class of b ounded VC dimension or Gaussian surface area. F or instance, these authors show that if the truncation set S b elongs to a collection C , then it is p ossible to estimate the underlying mean with a sample complexit y scaling as e O (V C( C ) /ρ + d 2 /ρ 2 ), al- b eit with an inefficient algorithm. F ollowing this, [ DKPZ24 ] show that there exist truncation sets with small VC dimension and Gaussian surface area such that a sup erp olynomial sample complexit y is required for efficient algorithms in the statistical query mo del. The setting with unknown truncation is more closely related to the all-or-nothing setting considered in our paper, where w e can understand the contamination set as the set of all p ossible truncated distributions, that is, without any structural assumptions on S . Giv en the similarit y b et w een the tw o settings, it is w orthwhile to op en a small paren thesis to discuss the differences, which are crucial and lead to different types of algorithms. In the truncated statistics problem, the assumptions imp osed on S p ermit truncating the entiret y of the tails, so the tail b ehaviour within the mo del cannot b e used to estimate the distribution. By 8 con trast, in our MNAR setting, w e allo w the lik eliho o d ratio d Q d P to v ary arbitrarily within [1 − ϵ, 1], indep endently across p oints. When ϵ is close to 1, this renders the bulk of the distribution unhelpful for estimation, as Q can hide all the p oten tial v ariation of d P ′ d P ( z ) for P ′ in a neighborho o d of P . How ever, when P b elongs to a known class of distributions with rapidly decaying tails (e.g. Gaussian), a small p erturbation can lead to a large likelihoo d ratio in the tails. It is this feature of the tails that w e exploit in our problems of in terest. Th us, while estimation from truncated samples setting is ac hiev ed by fo cusing on the bulk of the distribution, estimation from MNAR samples is achiev ed by fo cusing on the tails of the distribution. Despite this con trast, the computationally-efficient sample complexities of mean and cov ariance estimation are similar in the tw o settings. One ma jor difference is that in linear regression, we obtain a computationally-efficien t algorithm that requires only linear in d many samples, while in the truncated sample setting, existing algorithms incur either d O (1 /ρ 2 ) samples [ LMZ24 ] or require truncation which can only dep end on the resp onse [ KMK C26 ]. Robust estimation. F or a thorough background on robust estimation, we refer the reader to the b o oks [ HR09 ] (for its historical treatment) and [ DK23 ] (for its algorithmic asp ects). The prototypical contamination mo del considered in robust statistics is the Hub er con- tamination mo del, where the statistician observ es Q = (1 − ϵ Huber ) P + ϵ Huber R , where R is an arbitrary distribution and ϵ Huber ∈ [0 , 1 / 2) is the con tamination rate. Even when P b elongs to a nice family of distributions suc h as Gaussians, the Hub er contamination mo del is strong enough to preclude consistent estimation of parameters suc h as mean, co v ariance, and linear regression. F urthermore, realizable contamination can b e seen as a sp ecial case of Hub er contamination: the first inequality in ( 3 ) implies that the conditional distribution Q R of Q ∈ R ( P , ϵ, 1) is a v alid Hub er contamination as long as ϵ Huber ≥ ϵ . How ever, realizable con tamination con tains additional structural information (the second inequality in ( 3 )) that leads to consistency . Our work could also b e seen as part of a broader research agenda that s tudies practical restrictions of Hub er contamination mo del, which lead to improv ed statistical rates. W e discuss some related works b elow. In the con text of mean estimation, multiple recen t works ha v e studied the mean-shift con tamination mo del [ Li23 ; KG25 ; DIKP25 ; KKLZ26 ; DIKL26 ], whic h places a different kind of constraint on the con tamination distribution. In the basic setting, one observ es samples from (1 − ϵ ) N ( µ, I d ) + ϵR ∗ N (0 , I d ), where R is an arbitrary distribution and ∗ denotes the con v olution op erator. These works hav e shown that consisten t mean estimation is p ossible in this mo del: for any constan t ϵ ∈ (0 , 1), the sample complexity to get error ρ is roughly d/ρ 2 + e Θ(1 /ρ 2 ) , and furthermore there is no substantial statistical–computational gap. While this rate is similar to that of realizable contamination for d = 1, the statistical sample complexity for realizable contamination is m uc h higher for d > 1, scaling as de Θ(1 /ρ 2 ) (and the computational sample complexity is significantly higher, scaling as d Θ(1 /ρ 2 ) ). In the context of linear regression, the mean-shift contamination mo del has the following analog: X ∼ N (0 , I d ) , and y | X ∼ (1 − ϵ ) · N ( X ⊤ θ ⋆ , σ 2 ) + ϵ · R X ∗ D X ⊤ θ ⋆ , (4) where R X is a univ ariate distribution that may dep end on the co v ariate X . When the con- taminating distribution, R X , is oblivious of X , i.e., R X = R , then it is termed an oblivious con tamination mo del. F or the oblivious contamination mo del, m ultiple w orks hav e sho wn that 9 consisten t estimation is p ossible [ TJSO14 ; JTK14 ; BJK15 ; BJKK17 ; SBRJ19 ; dNS21 ], and there exist (at most quadratic) statistical–computational gaps [ DGKLP25 ]. When the distri- bution R X is allow ed to dep end on X , it is termed adaptive con tamination. The forthcoming w ork [ DGKPX26 ] studies this adaptive contamination mo del, b oth from computational and statistical p ersp ective. While ( 4 ) can b e seen as a more general con tamination model than realizable contamination with missing resp onses (see Eqs. ( 11 )), the statistical rates surpris- ingly coincide; this follow b y combining our lo w er b ounds (which builds on their work) and their upp er b ounds. By contrast, the computational rates differ widely , p ointing to the com- putational benefits of realizable contamination ov er adaptiv e con tamination. T o elab orate, our work gives a computationally-efficient algorithm w ith nearly-matching rate for realizable con tamination, and [ DGKPX26 ] establishes a statistical query low er b ound for adaptive con- tamination of d 1 /ρ 2 for any constant ϵ . While their SQ low er bound of d 1 /ρ 2 for ( 4 ) lo oks similar to our SQ lo w er b ound for realizable mean estimation, it is unclear if there is a deep er connection in terms of a formal reduction. Notation W e let R , R + , R ++ denote the set of real n um b ers, non-negative real num b ers, and p ositive real num b ers, resp ectively . W e let N denote the set of natural num b ers. W e let S d − 1 = { u ∈ R d : ∥ u ∥ 2 = 1 } denote the unit sphere. W e let C d + = { X ∈ R d × d : X = X ⊤ and X ⪰ 0 } and C d ++ = { X ∈ R d × d : X = X ⊤ and X ≻ 0 } denote the cones of p ositive semidefinite and p ositiv e definite matrices, resp ectively . W e use 1 A and 1 { A } interc hangeably to denote the indicator function on the set A . W e let R ⋆ = R ∪ { ⋆ } denote an extended space where the v alue ⋆ denotes a missing entry , and let R k ⋆ = R ⋆ × . . . × R ⋆ | {z } k times . W e sa y that the supp ort of a binary vector ω ∈ { 0 , 1 } k , denoted b y supp( ω ), is the set of co ordinates on whic h ω takes the v alue 1. W e define the op eration ⋆ : R k × { 0 , 1 } k → R k ⋆ , where the j th comp onent of x ⋆ ω is defined by ( x ⋆ ω ) j : = ( x j if ω j = 1 ⋆ if ω j = 0 , for j ∈ [ k ]. W e use S ± T b et w een multisets S and T to, resp ectively , denote multiset union and difference. F or a multiset S , we use E x ∼ S [ f ( x )] to denote the exp ectation of f with resp ect to the empirical distribution of S . F or n ∈ N , we use the notation f ( n ) ≲ g ( n ) (or f ( n ) = O ( g ( n ))) to mean | f ( n ) | ≤ C | g ( n ) | for some universal, p ositiv e constant C . Analogously , w e use the notation f ( n ) ≳ g ( n ) (or f ( n ) = Ω( g ( n ))) to mean | f ( n ) | ≥ c | g ( n ) | for some universal, p ositiv e constant c . If f ( n ) ≲ g ( n ) and f ( n ) ≳ g ( n ), w e denote this relationship b y f ( n ) ≍ g ( n ) (or f ( n ) = Θ( g ( n ))). W e use e O , e Ω, and e Θ to hide terms that are p oly-logarithmic in the parameters. W e let N ( µ, Σ) denote the Gaussian distribution with mean µ and co v ariance Σ. W e let ϕ d ( · ; µ, Σ) to denote its density . W e omit the subscript from the ϕ d when d = 1. W e often omit the parameter Σ from ϕ d when it is the iden tity matrix and omit µ when it is the zero v ector, so that ϕ d ( · ) denotes the density of the standard d -dimensional Gaussian. W e let χ 2 d denote the chi-squared distribution with d degrees of freedom. 2 Bac kground In this section, we provide background on some of the to ols w e use throughout the pap er. 10 2.1 Sum-of-squares algorithms W e briefly review sum-of-squares pro ofs and pseudo-exp ectations, referring to [ BS16 ; FKP19 ] for a more complete o v erview. Throughout, we will refer to a p olynomial as a sum of squares if it is equal to a sum of squared p olynomials. W e use R [ X ] to denote the set of p olynomials o v er the indeterminates/v ariable X = ( X 1 , . . . , X N ) and R [ X ] ≤ t to denote those p olynomials with degree at most t . F or p ∈ R [ X ], we use ∥ p ∥ 2 to denote the Euclidean norm of the asso ciated co efficien t vector (in the monomial basis). Consider a system of p olynomial inequalities A = { q i ( X ) ≥ 0 } m i =1 o v er the N -dimensional indeterminates X = ( X 1 , . . . , X N ). Definition 2.1. (Sum-of-squares pro ofs) Let p, p ′ ∈ R [ X ]. The inequality p ( X ) ≥ p ′ ( X ) has a sum-of-squar es pr o of given A if there exist sum of squares p olynomials p S ∈ R [ X ] for S ⊂ [ m ] satisfying p ( X ) = p ′ ( X ) + X S ⊂ [ m ] p S ( X )Π i ∈ S q i ( X ) , and we say the pro of has degree t if each summand has degree at most t ; w e denote this b y A t X p ≥ p ′ (w e ma y omit the v ariable when it is clear from the context). Definition 2.2. (Pseudo-exp ectations) A degree- t pseudo-expectation e E : R [ X ] ≤ t → R is a linear map satisfying e E [1] = 1 and e E [ p ( X ) 2 ] ≥ 0 for an y p ∈ R [ X ] ≤ t/ 2 . Giv en a set of p olynomial inequalities A , we say a degree- t pseudo-exp ectation e E satisfies A if e E [ p ( X )] ≥ 0 for all p ∈ R [ X ] ≤ t suc h that p ( X ) = s ( X ) 2 Π i ∈ S q i ( X ) for some S ⊂ [ m ] and s ∈ R [ X ]. Under the same conditions, w e say e E approximately satisfies A if e E [ p ( X )] ≥ − 2 − N Ω( t ) ∥ s ∥ 2 Π i ∈ S ∥ q i ∥ 2 . Under mild conditions on the bit complexit y of the constrain t set and b oundedness of the feasible region, we can efficiently find an approximately satisfying pseudo exp ectation. Moreo v er, this pseudo exp ectation will approximately satisfy all p olynomial inequalities that ha v e a sum of squares pro of (of sufficiently small degree), again under mild conditions on the bit complexit y of the pro of. These conditions will hold for all the problems w e consider in this pap er with negligible appro ximation error, so for simplicit y w e take the follo wing assumptions as facts throughout. Assumption 2.3. W e can compute a degree- d pseudo-exp ectation satisfying A (or refute its satisfiabilit y) in time ( m + N ) O ( d ) . Assumption 2.4. Let A t ′ p ≥ 0 for p ∈ R [ X ] and let e E b e a degree- t pseudo-exp ectation that satisfies A . Let h ∈ R [ X ] b e a sum of squares with deg( h ) + t ′ ≤ t . Then e E [ h · p ] ≥ 0. Finally , w e will use x ∈ (1 ± y ) z as shorthand to denote the in tersection of the constraints x ≤ (1 + y ) z and x ≥ (1 − y ) z . W e will sometimes c onsider sum-of-squares pro ofs in volving matrix-v alued v ariables. F or a matrix-v alued polynomial A , w e use the constraint A ⪯ B to denote A = B − C C T , where C is an implicit dummy matrix-v alued indeterminate (and lik ewise for A ⪰ B ). Quan tifier elimination. Let Z = ( Z 1 , . . . , Z d ) b e indeterminates. Let A ′ b e a set of m ′ p olynomial inequalities o v er Z . W e will frequently w ork with a set of constraints A ov er indeterminates X that include the following constraint on the v ariable X : A ′ t Z p ( X , Z ) ≥ 0 , (5) 11 whic h ma y b e read as: there exists a sum-of-squares proof under the axioms A ′ in the in- determinates Z that p ( X , Z ) ≥ 0. F or example, w e will routinely use the constrain t of the follo wing form: { P i Z 2 i = 1 } t Z P n j =1 ⟨ Z, y − X ⟩ 2 t ≤ B . Even though ( 5 ) do es not seem to b e represen table as A = { q i ( X ) ≥ 0 } m i =0 for a small m , it turns out that it can b e compactly represen ted. In particular, if p is a p olynomial of degree at most t and Z has dimension d , then w e can enco de it as A o ver indeterminates X and X ′ = ( X ′ 1 , . . . , X ′ N ′ ), where | N ′ | ≲ | O ( N + d + m ′ ) | O ( t ) and |A| = ( O ( N ′ + m + d )) O ( t ) . W e refer the reader to [ KS17 ] for further details (see also [ DKKPP22 , App endix A.2]). 2.2 SQ lo wer b ounds In this section, we give a brief ov erview of the statistical query (SQ) framew ork [ Kea98 ; F GR VX17 ]. In the sequel, we shall sho w that our algorithms for mean and cov ariance esti- mation is qualitativ ely optimal in the SQ framework. While our fo cus is on SQ algorithms, our structural results also rule out p olynomial-time algorithms from other p opular families of algorithms: sum-of-squares hierarchies [ DKPP24 ], low-degree p olynomial tests [ BBHLS21 ], and p olynomial threshold functions [ DKLP25 ]. An SQ algorithm do es not tak e in samples from a distribution Q , instead, it interacts with the underlying distribution Q through an oracle of the following form: Definition 2.5 (ST A T Oracle) . Let P be a distribution on the domain X . A statistical query is a b ounded (measurable) function f : X → [ − 1 , 1]. F or a tolerance κ ∈ (0 , 1), the ST A T P,κ oracle resp onds to the query f with a v alue v = ST A T P,κ ( f ) such that | v − E X ∼ P [ f ( X )] | ≤ κ . W e call κ the tolerance of the SQ oracle. Man y p opular algorithms for statistical estimation use samples only to approximate E P [ f 1 ( X )] , . . . , E P [ f m ( X )] for a sequence of (adaptively chosen) functions f i ’s. An SQ algorithm could directly obtain these appro ximations using the ST A T P,κ oracle ab ov e, up to error κ . Ob- serv e that κ here corresp onds to the sampling error, and thus p oly(1 /κ ) is interpreted as the “effectiv e sample complexit y” of the SQ algorithm. Our fo cus will b e on showing the hardness of testing problems; Standard arguments imply that the corresp onding estimation task is no easier (either statistically or computationally). W e will consider a testing problem of the following kind: Definition 2.6 (Generic T esting Problem) . Let P null b e a distribution and let A b e a set of distributions. Supp ose the te ster has access (either through i.i.d. samples ( X 1 , . . . , X n ) ∼ Q ⊗ n or an SQ oracle ST A T Q,κ ) to the distribution Q . Consider the following testing problem: H 0 : Q = P null vs. H 1 : Q ∈ A . W e say an SQ algorithm solves the testing problem ab ov e with q queries and tolerance κ if there exists an SQ algorithm that interacts with an ST A T Q,κ oracle by making at most q adaptiv e queries and outputs a decision b ϕ such that, with probability at least 2 / 3, b ϕ = 1 Q ∈A . An SQ hardness result is parameterized by tw o parameters, a query parameter q 0 and a tolerance parameter κ 0 , and is of the following fla v or: any SQ algorithm that makes less than q 0 queries and successfully solves the inference task of in terest m ust use at least a single query with tolerance less than κ 0 . This is interpreted as an information-computation gap stating that every successful SQ algorithm m ust use either Ω( q 0 ) “runtime” or p oly (1 /κ 0 ) “samples”. W e refer the reader to [ DK23 , Chapter 8] for further details. 12 The influential work of [ DKS17 ] prov ed SQ hardness results for a generic hypothesis testing problem, kno wn as Non-Gaussian Comp onent Analysis (NGCA). NGCA asks us to distinguish whether Q = N (0 , I ) or whether Q has a non-Gaussian comp onent distribution A in a hidden direction, defined b elo w. Definition 2.7 (High-Dimensional Hidden Direction Distribution) . Let A b e a univ ariate distribution on R ; w e use a ( x ) for the density of A at x . F or a unit vector v , w e de- note by P A,v the distribution with the densit y P A,v ( x ) := a ( v ⊤ x ) ϕ v ⊥ ( x ), where ϕ v ⊥ ( x ) := exp −∥ x − ( v ⊤ x ) v ∥ 2 2 / 2 / (2 π ) ( d − 1) / 2 , i.e., the distribution that coincides with A on the di- rection v and is an indep endent standard Gaussian in e v ery orthogonal direction. Definition 2.8 (NGCA) . Let A b e a given univ ariate distribution. Consider the distribution testing problem (see Definition 2.6 ) given access to a distribution Q ov er R d : H 0 : Q = N (0 , I d ) vs. H 1 : Q ∈ { P A,v | v ∈ S d − 1 } . [ DKS17 ] established the following SQ lo wer bounds, showing that if A matc hes m mo- men ts with N (0 , 1), then an y SQ algorithm either needs at least exp onentially many queries (“run time”) or in v erse tolerance (“samples”) at least d Ω( m ) . Lemma 2.9 (SQ Low er Bounds for NGCA [ DKS17 ]) . Supp ose A matches m moments with N (0 , 1) and χ 2 ( A, N (0 , 1)) < ∞ . Then any SQ algorithm that solves the testing pr oblem in Definition 2.8 must use either • numb er of queries at le ast 2 Ω( d Ω(1) ) d O ( m ) , or • toler anc e less than κ ≲ √ χ 2 ( A, N (0 , 1)) d Ω( m ) . In p articular, if m ≳ 1 , d ≳ ( m log d ) Ω(1) , χ 2 ( A, N (0 , 1)) ≲ m , then any suc c essful SQ algo- rithm either uses 2 d Ω(1) many queries or toler anc e at most d − Ω( m ) . Computational low er b ounds for the NGCA problem under the same momen t-matching condition on A ha v e also b een established for sum-of-squares hierarchies [ DKPP24 ], lo w-degree p olynomial tests [ BBHLS21 ], and p olynomial threshold functions [ DKLP25 ]. Since our com- putational low er b ounds for mean and cov ariance estimation under realizable contamination are prov ed using the NGCA framework, we immediately obtain lo wer b ounds for these families of algorithms as well. 2.3 Minimax lo wer b ounds Consider a parameter space Θ and an observ ation space X . Let L : Θ × Θ → R + denote a metric on the parameter space. F or each θ ∈ Θ, let P θ ∈ P ( X ) denote a probabilit y distribution on the space of observ ations X . W e define the 1 − δ quantile as Quan tile 1 − δ, P θ , L ( b θ , θ ) := inf r ≥ 0 P θ L ( b θ , θ ) ≤ r ≥ 1 − δ . Our information-theoretic lo wer b ounds will b e stated in terms of the minimax (1 − δ ) quantile, defined as M ( δ, P Θ , L ) := inf b θ ∈ b Θ sup θ ∈ Θ sup P θ ∈P θ Quan tile 1 − δ, P θ , L ( b θ , θ ) . (6) 13 W e will commonly consider the setting in which n samples are drawn i.i.d. from an underlying distribution and use the shorthand M n ( δ, P Θ , L ) accordingly . Our low er b ounds will use the follo wing v ariant of F ano’s inequalit y , the main statement and pro of of which can b e found in sev eral standard statistics textb o oks [ W ai19 ; Duc25 ; Tsy09 ]. Lemma 2.10 (F ano’s inequality) . Supp ose that H ∈ P ( X ) and θ 1 , . . . , θ M ⊆ Θ satisfy 1. (L oss sep ar ation) F or al l ℓ = k ∈ [ M ] and al l θ ∈ Θ , max { L ( θ , θ k ) , L ( θ, θ ℓ ) } ≥ γ ; 2. (Bounde d KL) 1 M P ℓ ∈ [ M ] KL( P θ ℓ , H ) + log (2 − M − 1 ) ≤ 1 2 log( M ) . Then, for al l δ ≤ 1 / 2 , it holds that M ( δ, P Θ , L ) ≥ γ . 3 Information-theoretic limits of mean and co v ariance estima- tion 3.1 Algorithms via v ariational principles Our information-theoretically optimal algorithms use a familiar tec hnique from the robust statistics literature in whic h optimal univ ariate estimators are combined using a v ariational principle to obtain an optimal multiv ariate estimator. This idea dates back at least to the T ukey median and the notion of T ukey depth [ T uk75 ] and has b een deploy ed in problems b ey ond mean estimation [ Gao20 ] and using a v ariet y of v ariational principles, such as the P AC-Ba yes metho d [ MZ25 ]. W e next briefly describ e the v ariational principle w e use for mean estimation and for cov ariance estimation in turn. Mean estimation. Let θ , θ ⋆ ∈ R d and note that the ℓ 2 norm error admits the v ariational c haracterization ∥ θ − θ ⋆ ∥ 2 = sup v ∈ S d − 1 ⟨ v , θ − θ ⋆ ⟩ . (7) Supp ose that for each direction v ∈ S d − 1 , we had an estimator b θ v of the mean ⟨ v , θ ⋆ ⟩ . In this case, we could define an estimator b θ to b e any elemen t b θ ∈ arg min θ ∈ R d sup v ∈ S d − 1 ⟨ v , θ ⟩ − b θ v . W e then ha ve the oracle inequality sup v ∈ S d − 1 ⟨ v , b θ ⟩ − b θ v ≤ sup v ∈ S d − 1 ⟨ v , θ ⋆ ⟩ − b θ v , from which we combine the c haracterization ( 7 ) with the triangle inequalit y to obtain b θ − θ ⋆ 2 = sup v ∈ S d − 1 v , b θ − θ ⋆ ≤ sup v ∈ S d − 1 v , b θ − b θ v + sup v ∈ S d − 1 ⟨ v , θ ⋆ ⟩ − b θ v ≤ 2 sup v ∈ S d − 1 ⟨ v , θ ⋆ ⟩ − b θ v . It th us follows that if each of the univ ariate estimators b θ v are information-theoretically opti- mal, we may b e able to combine these estimators into a multiv ariate estimator with similar statistical guaran tees. In the setting of mean estimation, we note that the optimal univ ariate estimator is due to forthcoming work [ MVGS26 ]; for the reader’s conv enience, w e provide an alternativ e metho d whic h ac hieves the same rate. ♣ 14 Co v ariance estimation. F or simplicity , we consider op erator norm in this paragraph, al- though our main guaran tees follo w a sligh tly different path as we are in terested in relative error. F ollowing similar steps, let Σ , Σ ⋆ ∈ C d ++ and note that the op erator norm admits the v ariational represen tation ∥ Σ − Σ ⋆ ∥ op = sup v ∈ S d − 1 v , Σ − Σ ⋆ v . As b efore, supp ose we had an estimator b σ 2 v of the v ariance v ⊤ Σ ⋆ v in each direction v ∈ S d − 1 . W e then define an estimator b Σ as any element b Σ ∈ arg min Σ ∈C d ++ sup v ∈ S d − 1 ⟨ v , Σ v ⟩ − b σ 2 v . Com bining the preceding tw o displays with the triangle inequality yields b Σ − Σ ⋆ op = sup v ∈ S d − 1 v , b Σ − Σ ⋆ v ≤ sup v ∈ S d − 1 v , b Σ v − b σ 2 v + sup v ∈ S d − 1 v , Σ ⋆ v − b σ 2 v ≤ 2 · sup v ∈ S d − 1 v , Σ ⋆ v − b σ 2 v . Hence, as in the mean estimation setting, this suggests that it suffices to find optimal uni- v ariate v ariance estimators in each direction and that these can b e conv erted into cov ariance estimators. In the sequel, we show that a v arian t of the minimum Kolmogorov distance es- timator introduced by [ MVBWS24 ] yields a suitable univ ariate estimator in eac h direction. ♣ W e emphasize that as written, these estimators cannot b e computed in finite time. A straigh tforw ard solution is to approximate the unit sphere b y a suitable ϵ -net of the unit sphere and compute univ ariate estimators on each element of the ϵ -net. While this strat- egy yields information-theoretically optimal estimators, w e note that the size of these ϵ -nets scales exp onentially in the dimension and hence, in general, yields computationally-inefficien t estimators. 3.2 Mean estimation W e first consider information-theoretic rates for Gaussian mean estimation. W e take the parameter space and loss to b e P mean ( σ 2 , ϵ ) = R ( N ( θ , σ 2 I d ) , ϵ ) | θ ∈ R d and L ( θ , θ ⋆ ) = ∥ θ − θ ⋆ ∥ 2 . In the theorem b elow, we giv e tight minimax rates on this problem. W e provide the pro of of the upp er b ound in Section A.1 and of the low er b ound in Section C.2.1 . As mentioned in the previous subsection, the upp er b ound is a corollary of the optimal univ ariate estimator of [ MVGS26 ]; our main contribution is a matc hing low er b ound in the high-dimensional and high-confidence setting. Theorem 3.1. L et σ 2 > 0 , ϵ < 1 , δ ≤ 0 . 5 , n ∈ N such that n ≳ d + log ( 1 δ ) 1 − ϵ . 15 The minimax 1 − δ quantile M n ( δ, P mean ( σ 2 , ϵ ) , L ) ( 6 ) satisfies M n ( δ, P mean ( σ 2 , ϵ ) , L ) ≍ σ log( 1 1 − ϵ ) r log 1 + ϵ 2 1 − ϵ · n d +log( 1 δ ) . A few remarks are in order. First, let us simplify the expression of the minimax rate in the regime of interest where δ is constan t and ϵ ≤ 1 − c for some small constant c . In this setting, the rate simplifies to M n ( δ, P mean ( σ 2 , ϵ ) , L ) ≍ σ ϵ p log (1 + ϵ 2 n/d ) . Note that if ϵ ≲ p d/n , this recov ers the parametric rate of estimation σ p d/n . On the other hand, if the contamination fraction ϵ is at least a constant, the b ound further simplifies to M n ( δ, P mean ( σ 2 , ϵ ) , L ) ≍ σ p log( n/d ) . In turn, this implies that in terms of sample complexity , in order to reach error σ ρ > 0, it is b oth necessary and sufficien t to tak e n ≍ de Θ(1 /ρ 2 ) man y samples. Next, let us compare this rate to similar results in the literature, starting with [ MVBWS24 , Theorem 8], which considers the same setting. In particular, re-arranging our sample size condition, we require that δ ≥ exp − c · n (1 − ϵ ) − d . By contrast, [ MVBWS24 ] imp ose the condition that δ ≥ exp n − c { n (1 − ϵ ) } 31 / 36 log( n (1 − ϵ )) − d o , whic h additionally imp oses the sub-optimal sample complexit y requirement n ≳ d 36 / 31 / (1 − ϵ ). As we sho w, the correct sample complexity scales linearly in the dimension d . Let us next briefly isolate the dep endence on the failure probability δ . In particular, taking d = 1 and σ, ϵ to b e constan ts, w e see that M n ( δ, P mean ( σ 2 , ϵ ) , L ) ≍ 1 p log( n ) − log log(1 /δ ) , so that the dep endence on the failure probabilit y δ is doubly logarithmic in the denominator. Finally , w e comment briefly on the pro of technique. As previously mentioned, our main con tribution in Theorem 3.1 is the lo w er b ound and w e restrict our commen tary to this result. W e note that it is typical in robust estimation problems to obtain an error which decomp oses as the sum of tw o terms: One consisting of the “clean” term corresp onding to ϵ = 0 and one capturing the dep endence on ϵ . Typically , the clean term interacts with the dimension d , the sample size n , and the confidence δ , whereas the second term is a function solely of the contamination fraction ϵ [ DKKLMS16 ]. F rom the p ersp ective of low er b ounds, these t w o terms can typically by captured by (i.) an application of F ano’s inequality or Assouad’s metho d for the “clean” term and (ii.) an application of Le Cam’s tw o p oint metho d for the con tamination term. By contrast, the realizable contamination mo del considered here has a con tamination term which dep ends nontrivially on the dimension d . In order to capture this dep endence, w e build on the proof tec hnique of the forthcoming w ork of [ DGKPX26 ] (who applied it in the context of linear regression): we construct a family of exp onentially many hard distributions (using the high-dimensional hidden direction distributions of Definition 2.7 ) and conclude via F ano’s inequality (Lemma 2.10 ). 16 3.3 Co v ariance estimation W e no w consider information-theoretic rates for Gaussian co v ariance estimation in relativ e op erator norm error. Spec ifically , w e tak e the parameter space and loss to b e P cov ( ϵ ) = R ( N (0 , Σ) , ϵ ) | Σ ∈ C d ++ and L (Σ , Σ ⋆ ) = Σ − 1 2 ⋆ ΣΣ − 1 2 ⋆ − I d op . In the theorem b elo w, we giv e upp er and lo w er b ounds on the minimax rates. W e provide the pro of of the upp er b ound in Section A.2 and the low er b ound in Section C.2.2 , which follow similar techniques as the pro of of Theorem 3.1 . Theorem 3.2. L et ϵ < 1 , δ ≤ 0 . 5 , and n ∈ N such that n ≳ d (1 + log ( 1 1 − ϵ )) + log ( 1 δ ) 1 − ϵ . The minimax 1 − δ quantile ( 6 ) M n δ, P cov ( ϵ ) , L satisfies s d + log ( 1 δ ) n (1 − ϵ ) + log( 1 1 − ϵ ) log 1 + ϵn d +log( 1 δ ) ≲ M n δ, P cov ( ϵ ) , L ≲ log( 1 1 − ϵ ) log 1 + r ϵ 2 1 − ϵ · n d (1+log( 1 1 − ϵ ))+log( 1 δ ) . W e analyze the upp er and lo w er b ounds of Theorem 3.2 in a few different regimes. In order to ease the notation, we let α = r d +log( 1 δ ) n (1 − ϵ ) and β = log ( 1 1 − ϵ ). In the low-to-constan t con tamination regime of ϵ ≤ 1 2 , we hav e β ≍ ϵ and so the b ounds of Theorem 3.2 simplify to α + ϵ log (1 + ϵ/α 2 ) ≲ M n δ, P cov ( ϵ ) , L ≲ ϵ log(1 + ϵ/α ) . When ϵ ≲ α , b oth the upp er and low er b ounds yield M n δ, P cov ( ϵ ) , L ≍ α , which w e recognize as the parametric rate. When ϵ ≳ α 1 − c for some small constant c (which con- tains the regime when ϵ is at least a small constant), b oth b ounds agree and yield the rate M n δ, P cov ( ϵ ) , L ≍ ϵ log(1 /α ) . In the in termediate range b et w een these tw o settings, where ϵ = α 1 − γ for γ = o (1), our b ounds read as ϵ log(1 /α ) ≲ M n δ, P cov ( ϵ ) , L ≲ ϵ γ log (1 /α ) , represen ting a gap on the order of 1 γ . F or example, when ϵ = α log (1 /α ) and δ ≳ e − d , we observ e a gap in our b ounds on the order of log(1 /α ) log log(1 /α ) ≍ log( n/d ) log log( n/d ) . No w we consider the large contamination regime ϵ ≥ 1 2 . The upp er and low er b ounds of Theorem 3.2 simplify to α + β β + log (1 /α ) ≲ M n δ, P cov ( ϵ ) , L ≲ β β − log (1 + β ) + log(1 /α ) . Because β is b ounded aw ay from zero in this regime, the upp er and low er b ounds simplify to M n δ, P cov ( ϵ ) , L ≍ β β +log (1 /α ) . Finally , note that when the mec hanism is nearly missing completely at random setting (i.e. when ϵ ≲ α ), the rates of estimation for mean estimation in ℓ 2 norm and cov ariance 17 estimation in op erator norm coincide. On the other hand, when ϵ > α , the rate of estimation for cov ariance estimation is quadratically faster. In particular, if we fix the parameters ϵ and δ as constants, we see that M n δ, P cov ( ϵ ) , L ≍ 1 log( n/d ) , whereas M n ( δ, P mean 1 , ϵ ) , L ≍ 1 p log( n/d ) . In terms of sample complexit y , this implies that in order to ac hiev e co v ariance norm error Θ( ρ ), w e require n ≳ de 1 /ρ man y samples as opp osed to de 1 /ρ 2 samples for mean estimation. 4 Computationally-efficien t mean and cov ariance estimation In the previous section, we established (nearly) tigh t statistical rates for mean and cov ariance estimation, but the algorithms w e emplo y ed required exponential time computationally . In this section, we study p olynomial time algorithms and rates of estimation for polynomial time algorithms. In Section 4.1 , we preview the k ey idea underlying our approach in the simpler setting of hypothesis testing. Then, in Section 4.3 we provide an efficien t algorithm for mean estimation and demonstrate that its sample complexity is optimal among v arious well-studied algorithmic classes. Finally , in Section 4.4 , we provide a parallel story in cov ariance estimation. 4.1 W arm-up: Efficien t testing via p olynomials In order to build our in tuition, let us b e gin with the univ ariate testing problem. In particular, w e will try to test H 0 : Q ∈ R N (0 , 1) , ϵ vs. H 1 : Q ∈ R N ( ρ, 1) , ϵ . for ρ ∈ R . Note that the hardest distributions to test put the same amoun t of mass on { ⋆ } , so it suffices to test b etw een the conditional distributions Q R := La w ( X | X ∈ R ). The minimum distance estimators studied in Section 3 use fine-grained information from the distribution functions under eac h hypothesis. Here, we note that, at the p opulation level, thresholding a simple p olynomial yields a consistent test. W e will use the notation τ := ϵ 1 − ϵ , throughout the deriv ation. Indeed, from the discussion following Lemma 1.1 , w e see that 1 1 + τ = 1 − ϵ ≤ d Q R d P ( z ) ≤ 1 + ϵ 1 − ϵ = 1 + τ for all z ∈ R . (8) A t the p opulation level, consider the family of testing functions, indexed by k ∈ N , ψ ( k ) : R → R defined as ψ ( k ) ( X ) = X 2 k . It is a straightforw ard consequence of the sandwich relation ( 8 ) that 1 1 + τ · E P ψ ( k ) ( X ) ≤ E Q R ψ ( k ) ( X ) ≤ (1 + τ ) · E P ψ ( k ) ( X ) . Consequen tly , under H 0 , we see that E Q R ψ ( k ) ( X ) ≤ (1 + τ ) E X ∼ N (0 , 1) ψ ( k ) ( X ) = (1 + τ ) E G 2 k = (1 + τ )(2 k − 1)!! , 18 where we hav e used the notation G ∼ N (0 , 1). On the other hand, under H 1 , E Q R ψ ( k ) ( X ) ≥ 1 1 + τ E X ∼ N ( γ , 1) ψ ( k ) ( X ) = 1 1 + τ E ( G + ρ ) 2 k = 1 1 + τ 2 k X ℓ =0 2 k ℓ ρ ℓ E G 2 k − ℓ = 1 1 + τ k X ℓ =0 2 k 2 ℓ ρ 2 ℓ E G 2 k − 2 ℓ ≥ 1 1 + τ · n E [ G 2 k ] + 2 k 2 ρ 2 E G 2 k − 2 o = 1 1 + τ · E G 2 k + k ρ 2 (2 k − 1)!! , where in the final displa y w e used the fact that E [ G 2 k − 2 ] = (2 k − 3)!!. Re-arranging, w e deduce that any separation ρ ≥ s { (1 + τ ) 2 − 1 } · E [ G 2 k ] k (2 k − 1)!! = r τ 2 + 2 τ k , suffices to test betw een H 0 and H 1 . In particular, the larger w e tak e k (the degree of the p olynomial), the b etter our test p erforms. Our main goal in this section is to develop computationally-efficient algorithms in higher dimension and tow ards this goal, w e can use a similar technique as in Section 3.1 to obtain a natural analog in higher dimension. In particular, we consider the testing function ψ ( k ) d := sup v ∈ S d − 1 n E Q R ⟨ v , X ⟩ 2 k ] o , and its empirical counterpart ψ ( k ) d,n ( Z 1 , . . . , Z n ) := sup v ∈ S d − 1 n 1 n n X i =1 ⟨ v , Z i ⟩ 2 k o . W e recognize the ab ov e quantit y as the injective tensor norm of the tensor 1 n P n i =1 Z ⊗ 2 k i . The k ey insight which we build on is that while the computation of the p olynomial optimization problem ψ ( k ) d,n is b eliev ed to b e in tractable in the w orst case [ BBHKSZ12 ], under certain a v erage-case assumptions on the data, a go o d (additive) appro ximation of it can b e computed in p olynomial time using the sum-of-squares technique [ KSS18 ; HL18 ]. Indeed, our algorithms for mean and cov ariance estimation to follow build directly on this in tuition. Before describing our computationally-efficien t algorithms, we b egin with some prelimi- naries. 4.2 Preliminaries Equipp ed with the intuition from the previous section, we turn now to our main algorithmic dev elopmen t. W e fo cus in this section on the all-or-nothing setting. The mean and co v ariance estimators we prop ose in this section will therefore discard all missing data and only consider the remaining observ ations. W e will mainly consider the following data generation pro cess (Definition 4.1 ), whic h is strong enough to simulate n i.i.d. samples for any Q ∈ R ( P , ϵ ). 19 Definition 4.1. (Data generating pro cess) Let n ∈ N , ϵ ∈ [0 , 1]. Let P ∈ P ( R d ). W e generate n samples in R d ∪ { ⋆ d } as follows: 1. Let T MCAR ⊆ R d b e a m ultiset of m i.i.d. data from P for m ∼ Bin ( n, 1 − ϵ ) (the MCAR observ ations). 2. Let T MCAR ,⋆ ⊆ R d b e a multiset of n − m i.i.d. data from P and T MNAR ⊂ T MCAR ,⋆ b e an arbitrary subset (the non-missing MNAR observ ations). 3. Let T := T MCAR + T MNAR ⊆ R d . The estimator observes T and n − | T | copies of ⋆ d . W e will take P = N ( µ, Σ) for some µ ∈ R d and Σ ≻ 0. In particular, for the mean estimation setting we will assume Σ = I d and µ is unknown, while for the cov ariance estimation setting we will assume Σ is unknown and µ = 0. Definition 4.2 (Go o d even t) . Consider the sets T MCAR and T MCAR ,⋆ from Definition 4.1 . Let E b e the even t that: 1. n m ≤ 1+ ϵ 1 − ϵ ; 2. F or all ℓ ∈ [2 k ] and S ∈ { T MCAR , T MCAR + T MCAR ,⋆ } : E x ∼ S Σ − 1 / 2 ( x − µ ) ⊗ ℓ − E z ∼ N (0 ,I d ) z ⊗ ℓ ∞ ≤ ϵ (4 d ) k 2 k . W e will pro ve our mean and co v ariance estimators are accurate assuming E holds. The follo wing lemma th us b ounds the sample complexit y of b oth estimators. Lemma 4.3 (Sample complexit y of go o d even t E ) . Consider Definitions 4.1 and 4.2 and supp ose that P = N ( µ, Σ) . Ther e exists a tuple of universal, p ositive c onstants c 1 , c 2 such that n ≥ ( c 1 k d log(1 /δ )) c 2 k (1 − ϵ ) ϵ 2 , implies the event E o c curs with pr ob ability at le ast 1 − 3 δ . W e defer the pro of of this lemma to Section G.4.1 . Equipp ed with these preliminaries, we turn next to studying efficient algorithms and computational lo wer b ounds for mean estima- tion. 4.3 Mean estimation W e first describ e our sum-of-squares based estimator and provide an upp er b ound on its sample complexity in Theorem 4.5 . W e defer the pro of of this theorem to Section D.1 . Definition 4.4 (SoS program for mean estimation) . Consider the following 2 constrain t on the d -dimensional (indeterminate) v ariable θ . n ∥ v ∥ 2 2 = 1 o 4 k v E x ∼ T ⟨ v , x − θ ⟩ 2 k ≤ (1 + ϵ ) 2 1 − ϵ E G ∼ N (0 , 1) [ G 2 k ] . (9) 2 Refer to preliminaries for notation. 20 Algorithm 1 Mean Estimator 1: function Mean-estima tor ( T , k , ϵ ) 2: Find a pseudo-exp ectation e E of degree 4 k which satisfies the system of Definition 4.4 . If no such e E exists, return the all-zeros vector. 3: return ˆ θ := e E [ θ ]. 4: end function Theorem 4.5 (SoS algorithm for mean estimation) . L et P = N ( θ ⋆ , I d ) and gener ate T ⊂ R d by Definition 4.1 . L et k ∈ N satisfy k ≤ √ d and let b θ b e the output of Algorithm 1 . Supp ose n satisfies the c onditions of L emma 4.3 ; i.e., n ≳ (Θ( kd log(1 /δ ))) Θ( k ) (1 − ϵ ) ϵ 2 . Then, with pr ob ability at le ast 1 − 3 δ , the fol lowing two c onditions b ot h hold: 1. (Satisfiability) The c onstr aint ( 9 ) is satisfiable. 2. (A c cur acy) L et γ = (1+ ϵ ) 2 − (1 − ϵ ) 3 (1 − ϵ ) 2 . Any de gr e e- 4 k pseudo exp e ctation e E satisfying this c onstr aint also satisfies b θ − θ ⋆ 2 ≤ 8 r γ k + γ k . Let us compare this result with the information-theoretic results established in Theo- rem 3.1 . First, the algorithm ab ov e runs in time p oly( n k , d k , 1 /ϵ ) as opp osed to the algorithm in Theorem 3.1 , which needs p oly ( n, e d ) time. T o facilitate the comparison on sample complex- it y , consider the case when ϵ ∈ (0 , 1) is a constant. Then γ = O (1) and Theorem 4.5 implies that to obtain error ρ ∈ (0 , 1), one can take k ≍ 1 /ρ 2 and the resulting sample complexit y w ould b e d O (1 /ρ 2 ) · f ( ρ ) for some function f . Con trast this with the information-theoretic rate (Theorem 3.1 ), where one needs only d · e f ( ρ ) samples for some function e f . Giv en this gap, it is natural to wonder whether there are computationally-efficient algorithms which improv e up on the sample complexity required b y Algorithm 1 . The next result shows that such an impro v emen t is imp ossible for an y computationally-efficien t statistical query algorithm. Definition 4.6 (T esting Problem for Mean Estimation) . Let P b e the underlying (unknown) distribution equal to N ( µ, I d ) for µ ∈ R d . Let the con tamination rate b e ϵ and the signal strength b e ρ . Let P ′ b e a contaminated conditional distribution P ′ ∈ R R ( P , ϵ ). Supp ose the tester has access (either through i.i.d. samples ( X 1 , . . . , X n ) ∼ P ′ ⊗ n or an SQ oracle ST A T P ′ ( κ ) with tolerance κ ) to the distribution P ′ . The testing problem is the following: H 0 : µ = 0 vs. H 1 : ∥ µ ∥ 2 ≥ ρ. Theorem 4.7 (SQ low er b ound for mean estimation) . Ther e exists a lar ge c onstant C > 0 and smal l c onstants c, c ′ > 0 such that the fol lowing holds. Consider Definition 4.6 with (i) ρ ≤ ϵ/C and (ii) d ≥ (( ϵ/ρ ) log(2 d )) C . Supp ose that ϵ ≤ c and 1 − q ≤ c . Then any SQ algorithm that solves Definition 4.6 with fewer than 2 d c ′ many queries must use toler anc e κ ≤ d − m for m ≥ c ′ ϵ 2 /ρ 2 log( ϵ 2 /ρ 2 ) . The pro of of this theorem uses the NGCA framework (Definition 2.8 ) and is deferred to Section E . T aken together, Theorem 4.5 and Theorem 4.7 , essen tially settle the computational sample complexit y for constant ϵ . F or v anishing ϵ → 0, there is a gap: the low er b ound on the 21 effectiv e sample complexity scales as d ϵ 2 /ρ 2 , whereas the upp er b ound requires roughly d ϵ/ρ 2 samples. In the next section, w e turn our atten tion to dev eloping a parallel story for cov ariance estimation. 4.4 Co v ariance estimation Our sum-of-squares based cov ariance estimator solves for a pseudo exp ectation ov er v ariables M , B ∈ R d × d satisfying the following constraints. Definition 4.8 (SoS program for cov ariance estimation) . Let α > 0. Let A cov ariance to b e the union of the following sets of constraints 3 o v er M , B ∈ R d × d : 1. Let A moments b e the set nn ∥ v ∥ 2 2 = 1 o 8 k v E x ∼ T ⟨ v , M ( E x ∼ T [ xx ⊤ ] − 1 2 ) x ⟩ 2 ℓ ∈ (1 ± 10 ϵ ) E G ∼ N (0 , 1) [ G 2 ℓ ] o ℓ ∈ [ k ] . (10) 2. Let A deviation := { M ⪰ 0 } ∪ { B = M − I d } ∪ { αI ⪰ B ⪰ − αI d } ∪ { B ⊤ B ⪯ α 2 I d } . Analogous to the mean estimation setting, it is relativ ely straigh tforw ard to show that an y pseudo exp ectation e E that satisfies the c onstrain t ( 10 ) m ust also satisfy the relation e E ∥ Σ 1 / 2 M v ∥ k 2 ≈ 1, from whic h we may conclude that ∥ e E [ M Σ M ] − I ∥ op is small. How- ev er, this in and of itself do es not guarantee that ∥ e E [ M ] − Σ ∥ op is small (i.e., e E [ M ] may not necessarily b e a go o d estimator). Thus, w e take the additional step of solving for an explicit matrix A that satisfies ∥ e E [ M AM ] − I ∥ op . This step motiv ates the additional constrain ts in A deviation : if e E satisfies all of these constrain ts, then we can guarantee A is a go o d estimator (mo dulo the whitening step implicit in constrain t ( 10 )). W e mak e this key fact precise in Lemma 4.9 , whose pro of we pro vide in Section D.2.1 . W e then use it to pro v e an upp er b ound on the sample complexity of our estimator in Theorem 4.10 . Algorithm 2 Co v ariance Estimator 1: function Cov ariance-estima tor ( T , k , ϵ, α ) 2: Find a pseudo-exp ectation e E of degree 8 k which satisfies the system of Definition 4.8 . If no such e E exists, return E x ∼ T [ xx ⊤ ]. 3: Let A ∗ b e a matrix that minimizes min A : A ⪰ 0 ∥ e E [ M AM ] − I d ∥ op . 4: return b Σ = ( E x ∼ T [ xx ⊤ ] 1 / 2 ) A ∗ ( E x ∼ T [ xx ⊤ ] 1 / 2 ). 5: end function Lemma 4.9. L et e E b e a de gr e e- 4 pseudo exp e ctation satisfying A deviation ( α ) , and for β > 0 let H ∈ R d × d b e an arbitr ary symmetric matrix satisfying ∥ e E [ M H M ] ∥ op ≤ β . Then if 1 − 2 α − α 2 > 0 we have ∥ H ∥ op ≤ β 1 − 2 α − α 2 . 3 Refer to preliminaries for notation. 22 W e describ e our estimator and upp er b ound its sample complexity in Theorem 4.10 . Theorem 4.10. Fix ϵ < 1 / 100 . L et P = N (0 , Σ) for Σ ≻ 0 . Gener ate T ⊂ R d by Defini- tion 4.1 . L et k = 2 p for some p ∈ N and let α = 10 ϵ (Definition 4.8 ). Supp ose n satisfies the c onditions of L emma 4.3 , i.e., n ≳ (Θ( kd log(1 /δ ))) Θ( k ) ϵ 2 . Then with pr ob ability at le ast 1 − 3 δ , the fol lowing c onditions b oth hold: 1. (Satisfiability) The c onstr aints A cov ariance ar e satisfiable. 2. (A c cur acy) L et b Σ b e the output of Algorithm 2 . Then Σ − 1 / 2 b ΣΣ − 1 / 2 − I d op ≲ ϵ k . Let us compare this result with the information-theoretic rate established in Theorem 3.2 . First, the range of permissible ϵ in Theorem 4.10 is restricted to b e b ounded aw ay from 1; b y contrast, Theorem 3.2 demonstrates that Algorithm 1 works for all ϵ < 1. Next, let us discuss the sample complexity to obtain error ρ when ϵ is a constant. In Theorem 4.10 , w e can set k ≍ 1 /ρ , and the sample complexity of the algorithm ab ov e is d Θ(1 /ρ ) · f ( ρ ), as opp osed to the information-theoretic sample complexit y of d · e f ( ρ ) in Theorem 3.2 ; here f and e f are dimension-indep enden t functions. W e next demonstrate such a gap is inherent for all computationally-efficient SQ algorithms. Definition 4.11 (T esting Problem for Cov ariance Estimation) . Let P be the underlying (unkno wn) distribution equal to N (0 , I d + w w ⊤ ) for an unknown v ector w ∈ R d . Let the con tamination rate b e ϵ ∈ (0 , 1 / 2) and the signal strength b e ρ ∈ (0 , 1 / 2). Let P ′ b e a con taminated conditional distribution P ′ ∈ R R ( P , ϵ ). Supp ose the tester has access (either through i.i.d. samples ( X 1 , . . . , X n ) ∼ P ⊗ n or an SQ oracle ST A T P ′ ) to the distribution P ′ . The testing problem is the following: H 0 : ∥ w w ⊤ ∥ op = 0 vs. H 1 : ∥ w w ⊤ ∥ op ≥ ρ. The following theorem provides an SQ low er b ound for the ab ov e testing problem, again relying on the NGCA framework (Definition 2.8 ). Theorem 4.12 (Computational Lo w er Bound for Cov ariance Estimation) . Ther e exist a lar ge c onstant C > 0 and a smal l c onstant c > 0 such that the fol lowing holds. Consider Definition 4.11 with (i) ϵ/ρ ≥ C and (ii) d ≥ (( ϵ/ρ ) log(2 d )) C . Then any SQ algorithm that solves Definition 4.11 with fewer than 2 d c many queries must use toler anc e κ ≤ d − m for m ≥ c ϵ/ρ log( ϵ/ρ ) . It is worth while to contrast this guaran tee with that of Theore ms 4.5 and 4.7 : Ev en for v anishing ϵ → 0, w e see that b oth the upp er b ound in Theorem 4.10 and the SQ low er b ound in Theorem 4.12 scale roughly as d Θ( ϵ/ρ ) . 5 Nearly optimal, computationally-efficien t linear regression In contrast with Gaussian mean and cov ariance estimation, we sho w that in Gaussian linear regression with missing observ ations, there is (nearly) no statistical-computational gap in man y parameter regimes of interest. Before presen ting the efficient estimator in Section 5.3 , w e first sp ecify the mo del in Section 5.1 and then provide a glimpse at the technique in Section 5.2 . 23 5.1 Mo del preliminaries In this section, w e consider linear regression in the setting with isotropic Gaussian cov ariates and Gaussian noise. That is, for θ ⋆ ∈ R d , we take the base distribution P θ ⋆ to b e the law of ( X , Y ) generated as X ∼ N (0 , I d ) , and Y | X ∼ N X ⊤ θ ⋆ , σ 2 . (11a) W e tak e the parameter space and loss to b e P LR ( σ 2 , ϵ ) = {R ( P θ , ϵ ) | θ ∈ R d } and L ( θ , θ ′ ) = ∥ θ − θ ′ ∥ 2 . (11b) Let us emphasize that in this setting, when an observ ation is masked, the entire observ ation is mask ed, rather than just the resp onse. W e also emphasize that in the MNAR comp onen t, the missingness can dep end arbitrarily on b oth the resp onse Y and the cov ariate X . In order to further clarify the setting, it will b e helpful to in tro duce auxiliary random v ariables to represen t con taminated distributions. T o this end, for a distribution R ∈ R ( P θ , ϵ ), w e kno w that there exist random v ariables Ω ∈ { 0 , 1 } , B ∼ Bern ( ϵ ) , ( X, Y ) ∼ P such that ( X , Y ) ⋆ (1 − B ) + B Ω ∼ R, where B is indep endent of the triple ( X , Y , Ω). Moreov er, the resp onse Y can b e generated as X · θ ⋆ + σ G for G ∼ N (0 , 1) indep enden t of X . Observe that ( X, Y ) = ⋆ if and only if (1 − B ) + B Ω = 1. Equipp ed with these preliminaries, w e next pro vide the key in tuition b ehind our estimator. 5.2 W arm-up: The p opulation setting The natural estimator to consider is that of ordinary least squares (OLS), whic h minimizes the sum of squared residuals. Unfortunately , in the realizable con tamination setting, the OLS estimator can suffer from bias. In order to mitigate this, w e consider an analogue of the p olynomial estimators developed in Section 4 . In particular, rather than minimizing the sum of squared residuals, we attempt to minimize a loss which raises the residuals to a larger p o w er 2 k > 2. This magnifies the error resulting from deviation aw ay from θ ⋆ , regardless of the con tamination. By taking k ↑ ∞ , the effects of the contamination v anish and the p opulation estimator conv erges to θ ⋆ . This in tuition can b e made precise via a short calculation in the p opulation setting, whic h w e pro vide presen tly . F or k ∈ N , we define the loss F ( k ) : R d → R and its minimizer θ ( k ) pop as F ( k ) ( θ ) := E ( Y − X ⊤ θ ) 2 k · 1 { ( X , Y ) = ⋆ } and θ ( k ) pop := arg min θ ∈ R d F ( k ) ( θ ) . Our strategy will b e to control the error ∥ θ ( k ) pop − θ ⋆ ∥ 2 b y applying the inequality θ ( k ) pop − θ ⋆ 2 ≤ ∥∇ F ( k ) ( θ ⋆ ) ∥ 2 µ , where µ denotes the modulus of strong con v exit y of the p opulation loss. Let us b egin by computing µ . W e compute the strong conv exity parameter by lo w er b ounding the Hessian at ev ery θ b y ∇ 2 F ( k ) ( θ ) = 2 k (2 k − 1) E ( Y − X ⊤ θ ⋆ ) 2 k − 2 X X ⊤ · 1 { ( X , Y ) = ⋆ } 24 ⪰ 2 k (2 k − 1) E (1 − B ) · X ⊤ ( θ ⋆ − θ ) + σ G 2 k − 2 X X ⊤ ( a ) = (1 − ϵ )2 k (2 k − 1) 2 k − 2 X ℓ =0 σ 2 k − 2 − ℓ E [ G 2 k − 2 − ℓ ] · E X ⊤ ( θ ⋆ − θ ) ℓ X X ⊤ ( b ) ⪰ (1 − ϵ )2 k (2 k − 1) σ 2 k − 2 E G 2 k − 2 · I d = (1 − ϵ )2 k (2 k − 1) σ 2 k − 2 (2 k − 3)!! · I d . Ab o v e, step ( a ) follows b y mutual indep endence of X , G , and B ; and step ( b ) follows from the fact that o dd moments of the standard Gaussian distribution are zero. On the other hand, we upp er b ound the norm of its gradient at the true solution θ ⋆ as ∥∇ F ( k ) ( θ ⋆ ) ∥ 2 = 2 k · E ( Y − X ⊤ θ ⋆ ) 2 k − 1 X · 1 { ( X , Y ) = ⋆ } 2 = 2 k σ 2 k − 1 · E (1 − B ) · G 2 k − 1 X + E B Ω · G 2 k − 1 X 2 = 2 k σ 2 k − 1 ϵ E G 2 k − 1 X · Ω 2 = 2 k σ 2 k − 1 ϵ · sup v ∈ S d − 1 E G 2 k − 1 ( v ⊤ X )Ω ≤ 2 k σ 2 k − 1 ϵ · sup v ∈ S d − 1 E | G | 2 k − 1 | v ⊤ X | = 2 p 2 /π · k σ 2 k − 1 ϵ · (2 k − 2)!! , where again w e make use of the mutual indep endence of X , G , and B . Putting the pieces together, we deduce that θ ( k ) pop − θ ⋆ 2 ≤ ∥∇ F ( k ) ( θ ⋆ ) ∥ 2 (1 − ϵ )2 k (2 k − 1) σ 2 k − 2 (2 k − 3)!! ≲ σ · ϵ 1 − ϵ · (2 k − 2)!! (2 k − 1)!! ≲ σ ϵ (1 − ϵ ) √ k , where the final inequality follows from Stirling’s inequalit y (see F act G.2 ). Therefore, while the quadratic loss (corresp onding to k = 1) yields an error guaran tee of O ( σ ϵ 1 − ϵ ), taking k to be large giv es a p opulation quan tit y with arbitrarily small error guaran tees. In the next section, w e derive finite sample guaran tees for the empirical v arian t of this estimator by carefully c ho osing the v alue of k . 5.3 Main result In the p opulation setting, w e established that raising residuals to arbitrarily large p olynomial p o w ers has a monotone effect on the estimation error. In the finite sample setting, this is no longer the case as raising the residuals to higher p ow ers render the empirical risk heavier tailed. Our main result demonstrates that selecting a v alue k whic h balances the p opulation error and the heavy-tailed nature of the risk yields an estimator that is nearly optimal. F ormally , giv en samples { ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x n , y n ) } ⊆ R d +1 ⋆ w e define the empirical loss F ( k ) n ( θ ) := 1 n n X i =1 1 { ( x i , y i ) = ⋆ } · ( y i − x i · θ ) 2 k (12a) as well as its minimizer b θ ( k ) n = arg min θ F ( k ) n ( θ ) . (12b) 25 The following theorem provides information-theoretic upp er b ounds for the linear regres- sion problem as w ell as the upper b ound obtained by a computationally-efficien t estimator that appro ximates ˆ θ ( k ) n . W e pro vide the proof of the upper b ound in Section B.1 and the low er b ound, whic h builds on the construction in [ DGKPX26 ], in Section C.2.3 . Theorem 5.1. Ther e exists c onstants c, C > 0 such that the fol lowing is true. L et σ 2 > 0 , ϵ < 1 , n ∈ N , δ ∈ [ { (1 − ϵ ) n } − c , 0 . 5] and let α = r d +log( 1 δ ) (1 − ϵ ) n . If n ≥ C · d + log ( 1 δ ) 1 − ϵ , then the minimax 1 − δ quantile ( 6 ) satisfies σ · log( 1 1 − ϵ ) r log 1 + ϵ (1 − ϵ ) α 2 ≲ M ( δ , P LR ( σ 2 , ϵ ) , L ) ≲ σ · ϵ 1 − ϵ · 1 ∨ log log ϵ α (1 − ϵ ) r log 1 + ϵ (1 − ϵ ) α 2 . Mor e over, the upp er b ound is achieve d by the estimator b θ ( k ) n for some k ≲ log( ϵ α (1 − ϵ ) ) which c an b e appr oximate d to ac cur acy of the same or der in p oly ( k , d, n ) time. W e next pro vide several remarks ab out this theorem. First, let us emphasize that the estimator b θ ( k ) n can b e approximated to sufficient accuracy in p olynomial time. In particular, in Section B.2 we show that (for instance) the ellipsoid metho d can compute an approximate minimizer in p olynomial time. W e con trast this with the estimator provided in [ MVBWS24 , Section 5], which is based on pro jections in the Kolmogorov distance and requires exponential time to compute. Second, giv en the results of the previous tw o sections and the fact that the estimator of [ MVBWS24 ] is based on Kolmogorov pro jections, it is tempting to guess that there may a statistical–computational gap in the regression setting as well. Indeed, in our well-specified setting, the linear regression problem can be reduced to that of co v ariance estimation. Indeed, w e note that the cov ariate-response pair ( X , Y ) is join tly Gaussian: ( X , Y ) ∼ N 0 , I d θ ⋆ θ ⊤ ⋆ ∥ θ ⋆ ∥ 2 2 + σ 2 . W e can thus apply the efficient estimator dev elop ed in Section 4 to obtain an estimate of the co efficien ts θ ⋆ . In order to obtain an estimate of θ ⋆ with error at most ρ , this would require at least n ≳ d Ω(1 /ρ 2 ) man y samples. A similar dimension dep endence was obtained by [ LMZ24 , Corollary 3.20] in the related setting of efficien t linear regression with truncated statistics, where the sample complexit y is also n ≳ d poly (1 /ρ ) . On the other hand, our empirical risk minimization pro cedure is computable in p olynomial time and requires a significan tly smaller sample complexity , which we discuss next. Next, in order to simplify the upp er and low er b ounds, let us consider the parameters ϵ, δ , σ to b e constants. The upp er and low er b ounds then simplify to 1 p log(1 /α 2 ) ≲ M n ( δ, P LR σ 2 , ϵ ) , L ≲ log log(1 /α ) p log(1 /α 2 ) . 26 This reveals that in this setting, our metho d is optimal up to a multiplicativ e factor which is a doubly iterated logarithm in α . Moreov er, for constant ϵ and δ , w e find that α ≍ p d/n so that the upp er and low er b ounds further simplify to 1 p log( n/d ) ≲ M n ( δ, P LR σ 2 , ϵ ) , L ≲ log log( n/d ) p log( n/d ) . In particular, this implies that our estimator requires a sample complexit y of n ≳ de Ω(log 2 (1 /ρ ) /ρ 2 ) to obtain error ρ , which is linear in dimension. W e note that this sample complexity is p oly- nomially smaller in dimension than that required by the inefficient algorithm of [ MVBWS24 ], whic h requires n ≳ d 36 / 31 . Next, w e note that, while we prov ed the upp er b ound for cov ariates distributed as X ∼ N (0 , I d ), only mild assumptions are needed for this estimator to work. Indeed, even if X is hea vy-tailed (with at least a constant n um b er of b ounded moments), our algorithm—with a mo dified analysis—contin ues to succeed with similar error rates in the constant failure probabilit y ( δ ) regime. Let us mention tw o limitations of our result. First, we note that, in contrast with Theo- rems 3.1 and 3.2 , w e require the low er b ound δ ≥ { (1 − ϵ ) n } − c for a small enough constant c > 0. This is not a huge limitation, as one could take the guarantees of Theorem 5.1 for con- stan t failure probabilit y 1 / 4 with sample complexity n 0 and use a simple b o osting pro cedure to turn it in to a (computationally-efficient) estimator that w orks for arbitrary δ ∈ (0 , 1 / 4) with a sample complexity of log(1 /δ ) · n 0 . 4 Still, the resulting sample complexity has a m ul- tiplicativ e dep endence on log (1 /δ ) as opp osed to an additiv e dep endence in the lo w er b ound. W e leav e further in v estigation in the extremely high-confidence setting as an interesting op en direction. Second, our algorithm yields an error which scales linearly in τ = ϵ/ (1 − ϵ ). By con trast, the lo wer b ound (as well as the rates for mean and cov ariance estimation) scale logarithmically in τ . While these match for ϵ < c for c > 0 a small enough universal constan t, this exp onen tially large gap b ecomes more striking as ϵ → 1. Finally , w e remark that the recent work [ KMKC26 ] also considers linear regression with truncated statistics. The authors fo cus on the regime where the truncation dep ends only on the resp onse v ariable and demonstrate a p olynomial time estimator that ac hiev es error ρ with a sample complexity of p oly ( d, 1 /ρ, k ), where k is the num b er of interv als that form the truncation set. This sample complexit y admits a dimension-dep endence whic h is p olynomially larger than our estimator, but a dep endence on 1 /ρ which is exp onentially smaller than ours. W e re-emphasize that, in general, these mo dels are incomparable (see Section 1.2 ) and that our information-theoretic low er b ound precludes such improv ements to our dep endence on 1 /ρ . 6 Discussion In this pap er, we studied sev eral estimation problems with missing data in the realizable con tamination problem. W e fo cused on the so-called all-or-nothing setting in which eac h observ ation is either completely observed, or not at all. Under this setting w e sho wed: • In me an estimation , to achiev e error in ℓ 2 norm at most ρ , there app ears to b e a sample complexit y gap: Ignoring all other problem parameters, n ≳ de 1 /ρ 2 samples are necessary 4 The bo osting pro cedure is as follo ws: run the geometric median-of-means [ Min15 ] pro cedure on k ≍ log(1 /δ ) indep endent estimates (obtained by running Theorem 5.1 on disjoin t subsets), each of whic h is correct with probabilit y 3 / 4 (guaran teed b y Theorem 5.1 if each subset is of size n 0 ). 27 and sufficien t information-theoretically , but computationally-efficient statistical query algorithms (along with other restricted families of algorithms men tioned in Section 2.2 ) require n ≳ d Ω(1 /ρ 2 ) man y observ ations. This latter low er b ound is achiev ed by a sum- of-squares estimator. • In c ovarianc e estimation , to ac hiev e error ρ in relative op erator norm, analogous claims hold: n ≳ de 1 /ρ samples are necessary and sufficient information-theoretically , but statistical query algorithms require n ≳ d 1 /ρ samples. • In line ar r e gr ession , we show that for most parameters, these statistical–computational gaps do not exist. In particular, w e show that the sample complexit y n ≳ de 1 /ρ 2 is necessary information-theoretically and can b e matched by a computationally-efficient algorithm. There remain several op en questions, several of which we discuss here. First, our computationally- efficien t algorithm for linear regression do es not matc h the information-theoretic rate for ex- tremely low failure probabilities δ or when ϵ → 1. Going further, it would b e in teresting to understand ho w the rates of con vergence c hange—b oth computationally and statistically—for more complicated regression mo dels than linear regression. Next, w e note that the all-or-nothing setting of missingness is less common than the m ultiple pattern setting (see Figure 1 ). In App endix F , we sho w ho w to extend our algorithms and low er b ounds to the setting with multiple patterns and sho w that for both mean and co v ariance estimation, our all-or-nothing algorithms can b e adapted to achiev e error which is a m ultiplicative factor of C | S | larger than the optimal error, where C | S | is a p ositive constant dep ending only on the num b er of patterns. While this error inflation may b e reasonable for a small num b er of patterns, it may b e sub optimal for a large num b er of missingness patterns. An imp ortant area for future work is to determine the fundamental limits of estimation, even for simple tasks like mean and cov ariance estimation, in this multiple pattern setting. References [AL13] P . M. Aronow and D. K. K. Lee. “In terv al estimation of p opulation means under unkno wn but b ounded probabilities of sample selection”. In: Biometrika 100.1 (2013) (cit. on p. 7 ). [BBHKSZ12] B. Barak, F. G. S. L. Brandao, A. W. Harro w, J. Kelner, D. Steurer, and Y. Zhou. “Hyp ercontractivit y , sum-of-squares pro ofs, and their applications”. In: Pr o c. 44th A nnual ACM Symp osium on The ory of Computing (STOC) . 2012 (cit. on p. 19 ). [BBHLS21] M. Brennan, G. Bresler, S. B. Hopkins, J. Li, and T. Schramm. “Statistical query algorithms and low-degree tests are almost equiv alent”. In: Pr o c. 34th A nnual Confer enc e on L e arning The ory (COL T) . 2021 (cit. on pp. 12 , 13 ). [BEHW89] A. B lumer, A. Ehrenfeuch t, D. Haussler, and M. K. W armuth. “Learnabilit y and the V apnik-Chervonenkis dimension”. In: Journal of the ACM 4 (1989) (cit. on p. 81 ). [BJK15] K. Bhatia, P . Jain, and P . Kar. “Robust Regression via Hard Thresholding”. In: A dvanc es in Neur al Information Pr o c essing Systems 28 (NeurIPS) . 2015 (cit. on p. 10 ). 28 [BJKK17] K. Bhatia, P . Jain, P . Kamalaruban, and P . Kar. “Consistent Robust Regres- sion”. In: A dvanc es in Neur al Information Pr o c essing Systems 30 (NeurIPS) . 2017 (cit. on p. 10 ). [BK22] M. Bon vini and E. H. Kennedy. “Sensitivit y analysis via the proportion of unmeasured confounding”. In: Journal of the Americ an Statistic al Asso ciation 117.539 (2022) (cit. on p. 7 ). [BLM13] S. Boucheron, G. Lugosi, and P . Massart. Conc entr ation Ine qualities: A Nonasymp- totic The ory of Indep endenc e . Oxford Universit y Press, 2013 (cit. on p. 51 ). [BS16] B. Barak and D. Steurer. “Pro ofs, b eliefs, and algorithms through the lens of sum-of-squares”. In: 1 (2016) (cit. on p. 11 ). [Bub15] S. Bub eck. “Conv ex Optimization: Algorithms and Complexit y”. In: F ounda- tions and T r ends ® in Machine L e arning 8.3-4 (2015), pp. 231–357 (cit. on p. 52 ). [DDKLP25] I. Diak onik olas, J. Diakonik olas, D. M. Kane, J. C. H. Lee, and T. Pittas. “Lin- ear Regression under Missing or Corrupted Co ordinates”. In: arXiv pr eprint arXiv:2509.19242 (2025) (cit. on p. 8 ). [DGKLP25] I. Diak onikolas, C. Gao, D. M. Kane, J. Lafferty , and A. P ensia. “Information- Computation T radeoffs for Noiseless Linear Regression with Oblivious Con- tamination”. In: A dvanc es in Neur al Information Pr o c essing Systems 38 (NeurIPS) . 2025 (cit. on p. 10 ). [DGKPX26] I. Diakonik olas, C. Gao, D. M. Kane, A. Pensia, and D. Xie. “Robust Regres- sion with Adaptiv e Con tamination in Resp onse: Optimal Rates and Compu- tational Barrier”. In: In pr ep ar ation (2026) (cit. on pp. 10 , 16 , 26 ). [DGTZ18] C. Dask alakis, T. Gouleakis, C. Tzamos, and M. Zampetakis. “Efficient Statis- tics, in High Dimensions, from T runcated Samples”. In: Pr o c. 59th IEEE Symp osium on F oundations of Computer Scienc e (FOCS) . 2018, pp. 639–649 (cit. on p. 8 ). [DIKL26] I. Diak onik olas, G. Iak ovidis, D. M. Kane, and S. Liu. “Sample Complexit y Bounds for Robust Mean Estimation with Mean-Shift Con tamination”. In: arXiv 2602.22130 (2026) (cit. on p. 9 ). [DIKP25] I. Diak onik olas, G. Iako vidis, D. M. Kane, and T. Pittas. “Efficient Multiv ari- ate Robust Mean Estimation Under Mean-Shift Con tamination”. In: Pr o c. 42nd International Confer enc e on Machine L e arning (ICML) . 2025 (cit. on p. 9 ). [DK23] I. Diakonik olas and D. M. Kane. A lgorithmic High-Dimensional R obust Statis- tics . Cambridge Universit y Press, 2023 (cit. on pp. 8 , 9 , 12 , 71 ). [DKKLMS16] I. Diak onikolas, G. Kamath, D. M. Kane, J. Li, A. Moitra, and A. Stew- art. “Robust Estimators in High Dimensions without the Computational In- tractabilit y”. In: Pr o c. 57th IEEE Symp osium on F oundations of Computer Scienc e (FOCS) . 2016 (cit. on p. 16 ). [DKKPP22] I. Diakonik olas, D. M. Kane, S. Karmalk ar, A. Pensia, and T. Pittas. “Ro- bust Sparse Mean Estimation via Sum of Squares”. In: Pr o c. 35th Annual Confer enc e on L e arning The ory (COL T) . 2022 (cit. on p. 12 ). 29 [DKLP25] I. Diakonik olas, D. M. Kane, S. Liu, and T. Pittas. “PTF T esting Lo w er Bounds for Non-Gaussian Comp onent Analysis”. In: Pr o c. 66th IEEE Sym- p osium on F oundations of Computer Scienc e (FOCS) . 2025 (cit. on pp. 12 , 13 ). [DKPP24] I. Diakonik olas, S. Karmalk ar, S. Pang, and A. Potec hin. “Sum-of-Squares Lo w er Bounds for Non-Gaussian Comp onent Analysis”. In: Pr o c. 65th IEEE Symp osium on F oundations of Computer Scienc e (FOCS) . 2024 (cit. on pp. 12 , 13 ). [DKPZ24] I. Diak onik olas, D. M. Kane, T. Pittas, and N. Zarifis. “Statistical query lo wer b ounds for learning truncated Gaussians”. In: Pr o c. 37th Annual Confer enc e on L e arning The ory (COL T) . 2024 (cit. on p. 8 ). [DKS17] I. Diak onik olas, D. M. Kane, and A. Stewart. “Statistical Query Low er Bounds for Robust Estimation of High-Dimensional Gaussians and Gaussian Mix- tures”. In: Pr o c. 58th IEEE Symp osium on F oundations of Computer Scienc e (F OCS) . 2017 (cit. on p. 13 ). [dNS21] T. d’Orsi, G. Novik ov, and D. Steurer. “Consisten t regression when oblivi- ous outliers o v erwhelm”. In: Pr o c. 38th International Confer enc e on Machine L e arning (ICML) . 2021 (cit. on p. 10 ). [Duc25] J. Duchi. Statistics and Information The ory . 2025 (cit. on p. 14 ). [F GR VX17] V. F eldman, E. Grigorescu, L. Reyzin, S. S. V empala, and Y. Xiao. “Statistical Algorithms and a Low er Bound for Detecting Planted Cliques”. In: Journal of the A CM 64.2 (2017) (cit. on p. 12 ). [FKP19] N. Fleming, P . Kothari, and T. Pitassi. “Semialgebraic Pro ofs and Efficient Algorithm Design”. In: F ound. T r ends The or. Comput. Sci. (2019) (cit. on p. 11 ). [Gao20] C. Gao. “Robust regression via mutiv ariate regression depth”. In: Bernoul li 26.2 (2020) (cit. on p. 14 ). [GVR97] R. D. Gill, M. J. V an Der Laan, and J. M. Robins. “Coarsening at random: Characterizations, conjectures, counter-examples”. In: Pr o c. 1st Se attle Sym- p osium in Biostatistics: Survival Analysis . 1997 (cit. on p. 4 ). [HL18] S. B. Hopkins and J. Li. “Mixture Mo dels, Robustness, and Sum of Squares Pro ofs”. In: Pr o c. 50th Annual A CM Symp osium on The ory of Computing (STOC) . 2018 (cit. on p. 19 ). [HM95] J. L. Horo witz and C. F. Manski. “Identification and robustness with con tam- inated and corrupted data”. In: Ec onometric a: Journal of the Ec onometric So ciety (1995) (cit. on p. 7 ). [HR09] P . J. Hub er and E. M. Ronchetti. R obust Statistics . John Wiley & Sons, 2009 (cit. on p. 9 ). [HR21] L. Hu and O. Reingold. “Robust mean estimation on highly incomplete data with arbitrary outliers”. In: Pr o c. 24th International Confer enc e on A rtificial Intel ligenc e and Statistics (AIST A TS) . 2021 (cit. on p. 8 ). [JTK14] P . Jain, A. T ew ari, and P . Kar. “On Iterative Hard Thresholding Metho ds for High-Dimensional M-Estimation”. In: A dvanc es in Neur al Information Pr o c essing Systems 27 (NeurIPS) . 2014 (cit. on p. 10 ). 30 [K C18] A. K. Kuchibhotla and A. Chakrab ortty. “Moving Beyond Sub-Gaussianit y in High-Dimensional Statistics: Applications in Cov ariance Estimation and Linear Regression”. In: arXiv pr eprint arXiv:1804.02605 (2018) (cit. on p. 85 ). [Kea98] M. J. Kearns. “Efficien t noise-tolerant Learning from Statistical Queries”. In: Journal of the A CM 45.6 (1998), pp. 983–1006 (cit. on p. 12 ). [K G25] S. Kotek al and C. Gao. “Optimal Estimation of the Null Distribution in Large- Scale Inference”. In: IEEE T r ansactions on Information The ory (2025) (cit. on p. 9 ). [KKLZ26] A. Kalav asis, P . K. Kothari, S. Li, and M. Zamp etakis. “Learning Mixture Mo dels via Efficient High-dimensional Sparse F ourier T ransforms”. In: Pr o c. 58th Annual ACM Symp osium on The ory of Computing (STOC) . 2026 (cit. on p. 9 ). [KMK C26] A. Kouridakis, A. Mehrotra, A. Kalav asis, and C. Caramanis. “Linear Re- gression with Unknown T runcation Bey ond Gaussian F eatures”. In: arXiv pr eprint arXiv:2602.12534 (2026) (cit. on pp. 9 , 27 ). [KS17] P . K. Kothari and D. Steurer. “Outlier-robust moment-estimation via sum- of-squares”. In: arXiv pr eprint arXiv:1711.11581 (2017) (cit. on p. 12 ). [KSS18] P . K. Kothari, J. Steinhardt, and D. Steurer. “Robust Momen t Estimation and Impro v ed Clustering via Sum of Squares”. In: Pr o c. 50th Annual ACM Symp osium on The ory of Computing (STOC) . 2018 (cit. on p. 19 ). [KTZ19] V. Kontonis, C. Tzamos, and M. Zamp etakis. “Efficient truncated statistics with unknown truncation”. In: Pr o c. 60th IEEE Symp osium on F oundations of Computer Scienc e (FOCS) . 2019 (cit. on p. 8 ). [Li23] S. Li. “Robust Mean Estimation Against Oblivious Adversaries”. MA thesis. Carnegie Mellon Universit y, 2023 (cit. on p. 9 ). [LMZ24] J. H. Lee, A. Mehrotra, and M. Zamp etakis. “Efficien t Statistics With Un- kno wn T runcation, Polynomial Time Algorithms, Beyond Gaussians”. In: Pr o c. 65th IEEE Symp osium on F oundations of Computer Scienc e (FOCS) . 2024 (cit. on pp. 9 , 26 ). [LR14] R. J. A. Little and D. B. Rubin. Statistic al Analysis with Missing Data . John Wiley & Sons, 2014 (cit. on pp. 4 , 7 ). [L W12] P . Loh and M. J. W ain wrigh t. “High-dimensional regression with noisy and missing data: prov able guaran tees with nonconv exity”. In: The Annals of Statistics 40.3 (2012) (cit. on p. 8 ). [Man03] C. F. Manski. Partial identific ation of pr ob ability distributions . Springer, 2003 (cit. on p. 4 ). [Mas90] P . Massart. “The tigh t constan t in the Dvoretzky–Kiefer–Wolfo witz inequal- it y”. In: The Annals of Pr ob ability (1990) (cit. on pp. 37 , 41 ). [Min15] S. Minsker. “Geometric Median and Robust Estimation in Banach Spaces”. In: Bernoul li 21.4 (2015), pp. 2308–2335 (cit. on p. 27 ). [MPT13] K. Mohan, J. P earl, and J. Tian. “Graphical models for inference with missing data”. In: A dvanc es in Neur al Information Pr o c essing Systems 26 (NeurIPS) . 2013 (cit. on p. 8 ). 31 [MST22] D. Malinsky , I. Shpitser, and E. J. Tchetgen Tchetgen. “Semiparametric in- ference for nonmonotone missing-not-at-random data: the no self-censoring mo del”. In: Journal of the Americ an Statistic al Asso ciation 117.539 (2022) (cit. on p. 8 ). [MVBWS24] T. Ma, K. A. V erchand, T. B. Berrett, T. W ang, and R. J. Samw orth. “Estima- tion b eyond Missing (Completely) at Random”. In: arXiv pr eprint (2024) (cit. on pp. 4 , 5 , 7 , 15 , 16 , 26 , 27 , 36 , 37 , 57 ). [MV GS26] T. Ma, K. A. V erc hand, C. Gao, and R. J. Samw orth. “Adaptive confidence in terv als with missing data”. In: In pr ep ar ation (2026+) (cit. on pp. 14 , 15 , 33 ). [MVS24] T. Ma, K. A. V erchand, and R. J. Samw orth. “High-probability minimax lo w er b ounds”. In: arXiv pr eprint arXiv:2406.13447 (2024) (cit. on pp. 59 , 62 ). [MZ25] A. Minasyan and N. Zhiv oto vskiy. “Statistically optimal robust mean and co v ariance estimation for anisotropic Gaussians.” In: Mathematic al Statistics & L e arning 8 (2025) (cit. on p. 14 ). [Ros87] P . R. Rosenbaum. “Sensitivity analysis for certain p ermutation inferences in matc hed observ ational studies”. In: Biometrika 74.1 (1987) (cit. on pp. 5 , 7 ). [RRS98] A. Rotnitzky , J. M. Robins, and D. O. Scharfstein. “Semiparametric regres- sion for rep eated outcomes with nonignorable nonresp onse”. In: Journal of the A meric an Statistic al Asso ciation 93.444 (1998) (cit. on p. 8 ). [SBRJ19] A. S. Suggala, K. Bhatia, P . Ra vikumar, and P . Jain. “Adaptiv e Hard Thresh- olding for Near-optimal Consisten t Robust Regression”. In: Pr o c. 32nd An- nual Confer enc e on L e arning The ory (COL T) . 2019 (cit. on p. 10 ). [SGJC13] S. Seaman, J. Galati, D. Jackson, and J. Carlin. “What Is Mean t by “Missing at Random”?” In: Statistic al Scienc e 28.2 (2013) (cit. on p. 4 ). [SL W22] R. Saho o, L. Lei, and S. W ager. “Learning from a biased sample”. In: arXiv pr eprint arXiv:2209.01754 (2022) (cit. on pp. 5 , 7 ). [TJSO14] E. Tsakonas, J. Jald´ en, N. D. Sidirop oulos, and B. Ottersten. “Conv ergence of the Hub er Regression M-Estimate in the Presence of Dense Outliers”. In: IEEE Signal Pr o c essing L etters 21.10 (2014) (cit. on p. 10 ). [Tsy09] A. Tsybako v. Intr o duction to Nonp ar ametric Estimation . Springer, 2009 (cit. on p. 14 ). [T uk75] J. W. T ukey. “Mathematics and the picturing of data”. In: Pr o c. International Congr ess of Mathematicians . V ol. 2. 1975 (cit. on p. 14 ). [V er25] R. V ershynin. High-dimensional Pr ob ability: A n Intr o duction with Applic a- tions in Data Scienc e . 2nd ed. Cambridge Universit y Press, 2025 (cit. on pp. 33 , 34 , 37 , 51 , 57 ). [W ai19] M. J. W ainwrigh t. High-Dimensional Statistics: A Non-Asymptotic View- p oint . Cam bridge Universit y Press, 2019 (cit. on p. 14 ). [ZSB19] Q. Zhao, D. S. Small, and B. B. Bhattachary a. “Sensitivit y analysis for in verse probabilit y w eigh ting estimators via the p ercen tile b o otstrap”. In: Journal of the R oyal Statistic al So ciety, Series B: Statistic al Metho dolo gy 81.4 (2019) (cit. on p. 5 ). 32 A Pro ofs of information-theoretic upp er b ounds for mean and co v ariance estimation A.1 Mean estimation: Pro of of Theorem 3.1 upp er b ound W e prov e the claim for general q ∈ (0 , 1]. T o establish the upp er b ound, we rely on the follo w- ing minimax optimal univ ariate estimator, due to [ MVGS26 , Theorem 1]. F or completeness, w e pro vide a self-con tained pro of (using a differen t estimator) in Section A.3.1 . Corollary A.1 ([ MV GS26 ] Theorem 1) . L et ϵ ∈ [0 , 1] , q ∈ (0 , 1] , n ∈ N , δ ∈ (0 , 1) , and define α = r log( 1 δ ) nq (1 − ϵ ) . F urther, let θ ⋆ ∈ R and supp ose that the observations Z 1 , . . . , Z n iid ∼ R ∈ R ( P θ ⋆ , ϵ, q ) . Ther e exists universal, p ositive c onstants C 1 , C 2 such that if n ≥ C 1 log( 1 δ ) q (1 − ϵ ) , then ther e exists an estimator b θ : R n ⋆ → R which satisfies, with pr ob ability at le ast 1 − δ , b θ − θ ⋆ ≤ C 2 log(1 + τ ) p log(1 + τ 2 /α 2 ) . Equipp ed with this univ ariate estimator, we turn to the pro of of our upp er b ound, which follo ws a standard approximation argument on an ϵ -net. T o this end, let V denote a 1 2 –pac king of the unit sphere S d − 1 , which b y , e.g., [ V er25 , Corollary 4.2.11] exists and satisfies the cardinality bound |V | ≤ 5 d . Now, for eac h unit v ector v ∈ V , let { Z v i } i ∈ [ n ] denote the pro jection of the i th observ ation Z i on to v , noting that if Z i = ⋆ d , then Z v i = ⋆ . Imp ortan tly , if Law ( Z i ) ∈ R ( P θ ⋆ , ϵ, q ), it follo ws that La w ( Z v i ) ∈ R ( P θ ⊤ ⋆ v , ϵ, q ). Next, for each v ∈ N , let b θ v denote the univ ariate estimator implicit in Corollary A.1 . Armed with eac h of these univ ariate estimators, w e define our multiv ariate mean estimator b θ ∈ R d to b e an y elemen t b θ ∈ arg min θ ∈ R d max v ∈N b θ ⊤ v − b θ v , (13) where for each v ∈ N , b θ v denotes the estimator in Corollary A.1 . F or parameters n, q , ϵ, δ , let r ( n, q , ϵ, δ ) denote the corresp onding error rate in the corollary . Next, take δ ′ = δ / (5 d ) and let Ω v denote the even t that b θ v − θ ⊤ ⋆ v ≤ r n, q , ϵ, δ ′ . (14) Note that by Corollary A.1 , P (Ω c v ) ≤ δ / (5 d ), whence with Ω := S v ∈N Ω v , an application of the union b ound implies that P (Ω) ≥ 1 − δ . No w, w orking on the ev en t Ω, w e ha v e b θ − θ ⋆ 2 ( a ) ≤ 2 · max v ∈N b θ − θ ⋆ ⊤ v ≤ 2 · max v ∈N b θ ⊤ v − b µ v + 2 · max v ∈N b µ v − θ ⊤ ⋆ v ( b ) ≤ 4 · max v ∈N b µ v − θ ⊤ ⋆ v ≤ 4 r ( n, q , ϵ, δ ′ ) , 33 where step ( a ) follo ws from an identical argument to the appro ximation in [ V er25 , Lemma 4.4.1], step ( b ) follows from the definition of b θ in Eq. ( 13 ) and the final inequality follo ws from Eq. ( 14 ). Finally , w e see that r ( n, q , ϵ, δ ′ ) ≤ C 2 log(1 + τ ) r log 1 + τ 2 nq (1 − ϵ ) d +log( 1 δ ) , whic h completes the pro of. A.2 Co v ariance estimation: Pro of of Theorem 3.2 upp er b ound W e pro ve the claim for general q ∈ (0 , 1]. W e devise a tw o-step estimator: with one half of the data, we compute the unnormalized empirical cov ariance e Σ, scale the remaining data b y M , then finally com bine 1-dimensional v ariance estimates of the scaled data ov er man y directions, as in the mean estimation algorithm. W e define e Σ = 2 n n/ 2 X i =1 Z i Z ⊤ i 1 { Z i = ⋆ } . Observ e that Σ − 1 2 ⋆ Z 1 { Z = ⋆ } is a O (1)-subgaussian random vector, so b y the concen tration of i.i.d. subgaussian random vectors (see e.g. [ V er25 , Theorem 4.7.1]) and our assumption that n ≳ d + log(1 /δ ), we ha v e that with probabilit y at least 1 − δ / 2, Σ − 1 2 ⋆ e ΣΣ − 1 2 ⋆ − Σ − 1 2 ⋆ E [ Z Z T 1 { Z = ⋆ } ]Σ − 1 2 ⋆ op ≤ 1 2 Σ − 1 2 ⋆ E [ Z Z T 1 { Z = ⋆ } ]Σ − 1 2 ⋆ op ≤ 1 2 . F urthermore, our con tamination mo del implies that q (1 − ϵ )Σ ⋆ ⪯ E [ Z Z T 1 { Z = ⋆ } ] ⪯ ( q (1 − ϵ ) + ϵ )Σ ⋆ , and so combining the ab ov e tw o displays, we hav e that with probabilit y at least 1 − δ / 2, q (1 − ϵ ) 2 Σ ⋆ ⪯ e Σ ⪯ 2( q (1 − ϵ ) + ϵ )Σ ⋆ . W e condition on this ev ent for the remainder of the pro of. Letting M = ˜ Σ − 1 2 , w e ha v e by rearranging the ab o v e displa y that 1 2( q (1 − ϵ ) + ϵ ) I d ⪯ M Σ ⋆ M ⪯ 2 q (1 − ϵ ) I d . (15) No w define the scaled data ˜ Z i = M Z i + n/ 2 for i ∈ { 1 , . . . , n/ 2 } , ov er which we will apply a univ ariate cov ariate estimator. F or this, we require the following lemma, whose pro of we defer to Section A.3.2 . Lemma A.2. L et ϵ ∈ [0 , 1] , q ∈ (0 , 1] , n ∈ N , and δ ∈ (0 , 1) , and define α = r log( 1 δ ) q (1 − ϵ ) n . F urther, let σ ⋆ ∈ R ++ , let P σ = σ 2 χ 2 denote the sc ale d χ -squar e d distribution, and supp ose 34 that the observations Z 1 , . . . , Z n iid ∼ R ∈ R ( P σ ⋆ , ϵ, q ) . Ther e exists a p air of universal, p ositive c onstants C 1 , C 2 such that if n ≥ C 1 log( 1 δ ) q (1 − ϵ ) , then ther e exists an estimator b σ : R n ⋆ → R which satisfies, with pr ob ability at le ast 1 − δ , b σ 2 − σ 2 ⋆ ≤ C 2 log(1 + τ ) log(1 + τ /α ) · σ 2 ⋆ . Con tin uing, let N denote a β –net of the unit sphere S d − 1 , for β = 1 16 √ 1+ τ , whic h has size e O ( d log(1 /β )) . F or each v ∈ N , let b σ 2 v b e the estimator from Lemma A.2 applied to observ ations { ˜ Z i · v } i ∈ [ n/ 2] and failure probabilit y δ / (2 |N | ). Applying Lemma A.2 for each v ∈ N and applying a union b ound, we hav e with probability at least 1 − δ / 2, for all v ∈ N , b σ 2 v − v ⊤ M Σ ⋆ M v ≤ C 2 log(1 + τ ) log 1 + τ q nq (1 − ϵ ) d log (1 /β )+log (8 /δ ) · v ⊤ M Σ ⋆ M v . Next, we define b Σ as any Σ such that for all v ∈ N , | ˆ σ 2 v − v ⊤ M Σ M v | ≤ C 2 log(1 + τ ) log 1 + τ q nq (1 − ϵ ) d log (1 /β )+log (8 /δ ) | {z } =: γ · v ⊤ M Σ M v , whic h is feasible as Σ ⋆ is one such matrix. Our c hoice of β satisfies log (1 /β ) ≍ 1 + log (1 + τ ). so it suffices to establish that L ( b Σ , Σ) = O ( γ ). W e can assume that γ ≤ 1 4 , as otherwise, w e can instead take b Σ = 0 whic h satisfies L ( b Σ , Σ ⋆ ) = 1 ≤ 4 γ . By the guarantees of our estimator, we hav e for all v ∈ N that v ⊤ M b Σ M v − v ⊤ M Σ ⋆ M v ≤ γ v ⊤ M b Σ M v + γ v ⊤ M Σ ⋆ M v . Re-arranging this we obtain that for all v ∈ N , 1 − γ 1 + γ v ⊤ M Σ ⋆ M v ≤ v ⊤ M b Σ M v ≤ 1 + γ 1 − γ v ⊤ M Σ ⋆ M v . F or γ ≤ 1 4 , we ha v e 1 − γ 1+ γ ≥ 1 − 3 γ and 1+ γ 1 − γ ≤ 1 + 3 γ , and so the ab ov e guarantee implies that for all v ∈ N , | v ⊤ M ( b Σ − Σ ⋆ ) M v | ≤ 3 γ v ⊤ M Σ ⋆ M v . T o turn this in to a b ound on relativ e sp ectral norm, first observ e that L ( b Σ , Σ ⋆ ) = sup v =0 | v ⊤ (Σ − 1 2 ⋆ b ΣΣ − 1 2 ⋆ − I d ) v | v ⊤ v = sup v =0 | v ⊤ ( M b Σ M − M Σ ⋆ M ) v | v ⊤ M Σ ⋆ M v . 35 T o b ound this, we first define the matrix E = Σ − 1 2 ⋆ b ΣΣ − 1 2 ⋆ − I d and note that b oth L ( b Σ , Σ ⋆ ) = ∥ E ∥ op and that for any v ∈ R d , v ⊤ ( M b Σ M − M Σ ⋆ M ) v v ⊤ M Σ ⋆ M v = ˜ v ⊤ E ˜ v , where ˜ v = Σ 1 2 ⋆ M v − 1 Σ 1 2 ⋆ M v . F or any v ∈ S d − 1 , we kno w there exists u ∈ N such that ∥ v − u ∥ ≤ β . Let ˜ v and ˜ u b e defined as ab o v e for v and u , resp ectively . Then we hav e v ⊤ ( M b Σ M − M Σ ⋆ M ) v v ⊤ M Σ ⋆ M v − u ⊤ ( M b Σ M − M Σ ⋆ M ) u u ⊤ M Σ ⋆ M u = | ˜ v ⊤ E ˜ v − ˜ u ⊤ E ˜ u | ≤ | ˜ v ⊤ E ˜ v − ˜ u ⊤ E ˜ v | + | ˜ u ⊤ E ˜ v − ˜ u ⊤ E ˜ u | ≤ 2 ∥ E ∥ op · ∥ ˜ v − ˜ u ∥ 2 . (16) F urthermore, w e can b ound the distance b etw een ˜ v and ˜ u as ˜ v − ˜ u = Σ 1 2 ⋆ M v Σ 1 2 ⋆ M v − Σ 1 2 ⋆ M u Σ 1 2 ⋆ M u ≤ Σ 1 2 ⋆ M v Σ 1 2 ⋆ M v − Σ 1 2 ⋆ M v Σ 1 2 ⋆ M u + Σ 1 2 ⋆ M v Σ 1 2 ⋆ M u − Σ 1 2 ⋆ M u Σ 1 2 ⋆ M u ≤ 2 Σ 1 2 ⋆ M ( v − u ) Σ 1 2 ⋆ M u ≤ 4 β s q (1 − ϵ ) + ϵ q (1 − ϵ ) = 4 β √ 1 + τ , where the final inequalit y follo ws from ( 15 ). Plugging this b ound into ( 16 ), and taking a suprem um o v er v ∈ S d − 1 , we obtain that L ( b Σ , Σ ⋆ ) ≤ 3 γ + 8 β √ 1 + τ · L ( b Σ , Σ ⋆ ) · β = ⇒ L ( b Σ , Σ ⋆ ) ≤ 6 γ , where the final implication follows from our choice of β . A.3 Univ ariate upp er b ounds Our information-theoretically optimal mean estimation algorithm follows the well-w orn path of applying optimal univ ariate estimators o v er a cov ering. F or our univ ariate estimators, w e will use the minim um Kolmogorov distance estimator introduced by [ MVBWS24 ]. is a minim um distance estimator using the Kolmogorov distance d K : P ( R ⋆ ) × P ( R ⋆ ) defined as d K ( R 1 , R 2 ) := sup A ∈A R 1 ( A ) − R 2 ( A ) , where we take the collection A to denote all upp er half interv als A := { ( −∞ , r ] | r ∈ R } . W e note in passing that this definition of the Kolmogoro v distance is a symmetrized v arian t of the usual Kolmogorov distance. 36 Giv en a collection of probability measures R ′ ⊆ P ( R ⋆ ) and a single probability measure R ∈ P ( R ⋆ ), we define the pro jection in Kolmogorov distance onto the set R ′ as d K ( R, R ′ ) := inf R ′ ∈R ′ d K ( R, R ′ ) . Next, given observ ations Z 1 , . . . , Z n ∈ R ⋆ , we let b R n := 1 n n X i =1 δ Z i , denote the empirical distribution of the observ ations. Imp ortan tly , the empirical distribution conv erges in Kolmogoro v distance to its population v ersion at the parametric rate, as the following lemma establishes. Lemma A.3. L et ϵ ∈ [0 , 1) , q ∈ (0 , 1] , and P ∈ P ( R ) . Supp ose that Z 1 , . . . , Z n iid ∼ R ∈ R ( P , ϵ, q ) and let b R n denote the empiric al distribution of the observations. F or any δ ∈ (0 , 1) , it holds that P d K b R n , R ( P, ϵ, q ) > 3 s { q (1 − ϵ ) + ϵ } log( 1 δ ) n ≤ δ . W e omit the pro of of this lemma as its pro of follows an identical sequence of steps leading to [ MVBWS24 , Ineq. (56)] (in particular applying the DKW inequality [ Mas90 ] in conjunction with Bernstein’s inequality [ V er25 ]). Note that here we depart sligh tly from the setting of [ MVBWS24 ] in that we consider the symmetrized Kolmogorov distance, but a careful insp ection of their pro of sho ws that the result remains unchanged. A.3.1 Univ ariate mean estimation: Pro of of Corollary A.1 Let P θ = N ( θ , 1). W e consider the minimum distance estimator b θ , defined to b e an y elemen t b θ ( Z 1 , . . . , Z n ) ∈ arg min θ ∈ R d K b R n , R ( P θ , ϵ, q ) . (17) In w ords, the estimator b θ ∈ R is taken as the parameter which contains the closest realizable con tamination to the empirical distribution b R n . Applying Lemma A.3 in conjunction with the triangle inequality , we deduce that d K ( R ( P b θ , ϵ, q ) , R ( P θ ⋆ , ϵ, q )) ≤ 6 s { q (1 − ϵ ) + ϵ } log( 1 δ ) n = 6 q (1 − ϵ ) α √ 1 + τ . Note further that, for any r ∈ R , d K R ( P θ , ϵ, q ) , R ( P θ ⋆ , ϵ, q ) ≥ q (1 − ϵ ) · P ( G + θ ≤ θ ⋆ − r ) − q (1 − ϵ ) + ϵ · P ( G + θ ⋆ ≤ θ ⋆ − r ) = q (1 − ϵ ) · ¯ Φ( r − t ) − (1 + τ ) ¯ Φ( r ) | {z } =: g ( t ; r ) , where ab ov e we hav e G ∼ N (0 , 1) and recall that ¯ Φ is the surviv al function of the standard Gaussian. The final relation follo ws b y taking θ = θ ⋆ − t . The function g : R + × R → R defined ab o v e is strictly increasing in its first argument. Hence, b θ − θ ⋆ ≤ sup t ≥ 0 : | θ − θ ⋆ | ≤ t and d K R ( P θ , ϵ, q ) , R ( P θ ⋆ , ϵ, q ) ≤ 6 q (1 − ϵ ) α √ 1 + τ 37 ≤ sup t ≥ 0 : sup r ∈ R g ( t ; r ) ≤ 6 α √ 1 + τ = inf t ≥ 0 : sup r ∈ R g ( t ; r ) > 6 α √ 1 + τ . Re-arranging and inv oking the decreasing nature of ¯ Φ − 1 , w e see that for an y choice of r > 0, the condition is satisfied if t ≥ r − ¯ Φ − 1 (1 + τ ) ¯ Φ( r ) + 6 α √ 1 + τ . (18) W e next consider several cases for the v alue of τ , select v alues of r and upp er b ound the righ t-hand side of ( 18 ). T o facilitate the analysis, we introduce the shorthand ∆ := τ ¯ Φ( r ) + 6 α √ 1 + τ . Applying the mean v alue theorem in conjunction with Lemma G.11 , we deduce that ¯ Φ − 1 ¯ Φ( r ) + ∆ ≥ r − ∆ · sup x ′ ∈ [ ¯ Φ( r ) , ¯ Φ( r )+∆] 1 ϕ ( ¯ Φ − 1 ( x ′ )) ≥ r − ∆ · sup x ′ ∈ [ ¯ Φ( r ) , ¯ Φ( r )+∆] 1 x ′ ¯ Φ − 1 ( x ′ ) ≥ r − ∆ ¯ Φ( r ) · ¯ Φ − 1 ¯ Φ( r ) + ∆ . (19) W e pro ceed with by consider sev eral cases on τ . Before doing so, first observe that our assumption n ≥ C 1 log( 1 δ ) q (1 − ϵ ) implies that α 2 ≤ 1 /C 1 . Case 1: τ ≤ 125 α . In this case, we set r = 2. Note that b y assumption, α 2 ≤ 1 /C 1 . Expanding the definition of ∆ and taking C 1 sufficien tly large yields the conclusion ∆ ≤ C α for a sufficiently large constant C . Applying the first inequality in ( 19 ), we find that ¯ Φ − 1 ¯ Φ( r ) + ∆ ≥ r − ∆ · sup x ′ ∈ [ ¯ Φ( r ) , ¯ Φ( r )+∆] 1 ϕ ( ¯ Φ − 1 ( x ′ )) ≥ r − ∆ ϕ ( r ) = r − ∆ √ 2 π e r 2 / 2 = r − ∆ e √ 2 π , where the second inequalit y follo ws b ecause the Gaussian PDF ϕ is decreasing on non-negative reals. It thus follows that an y separation t ≥ r − r − C e √ 2 π α = C e √ 2 π α ≍ log(1 + τ ) p log(1 + τ 2 /α 2 ) suffices, where the final equality (up to constant factors) holds b ecause τ and τ /α are b oth b ounded b y a constan t in this regime. Case 2: 125 α < τ ≤ 8 . In this case, w e set r = q log τ / { 6 p 2 π (1 + τ ) α } = ⇒ 6 p 2 π (1 + τ ) · αe r 2 = τ , 38 and note r ≥ 1 for this range of τ . Moreov er, applying the standard Mills’ ratio low er b ound (see Lemma G.10 ) yields ¯ Φ( r ) ≥ e − r 2 / 2 r √ 2 π ≥ e − r 2 √ 2 π . It thus follows that ∆ ¯ Φ( r ) ≤ τ + 6 p 2 π (1 + τ ) · αe r 2 = 2 τ . On the other hand, applying the first inequalit y in ( 19 ) in conjunction with the decreasing nature of the Gaussian PDF ϕ , we deduce that ¯ Φ − 1 ¯ Φ( r ) + ∆ ≥ r − ∆ ϕ ( r ) ≥ r − τ r + 6 p 2 π (1 + τ ) · αe r 2 ≥ r , where the p enultimate inequalit y follows from the Mills’ ratio upp er b ound ¯ Φ( x ) ≤ ϕ ( x ) /x (see Lemma G.10 ) and the final inequality follo ws b ecause r ≥ 1. Combining the inequalities in the previous tw o displays with the final inequality in ( 19 ), w e find that an y separation t ≥ r − r − 2 τ r ≥ 2 τ r = 2 τ q log( τ /α ) − log (6 p 2 π (1 + τ )) , suffices. T o conclude this case, we hav e that log (1 + τ ) ≍ τ b ecause τ < 8 and log( τ /α ) − log (6 p 2 π (1 + τ )) α ) ≍ log( τ /α ) ≍ log(1 + τ 2 /α 2 ) b ecause τ α > 125 and 6 p 2 π (1 + τ ) < 125 2 . Case 3: 8 < τ ≤ 1 20 α − 1 / 4 . In this case, we take r = q log(1 / 6 √ 2 π α ) . Note that under this setting, taking C 1 large enough implies that r ≥ 1, which com bined with the Mills’ ratio low er b ound (see Lemma G.10 ) gives ¯ Φ( r ) ≥ r r 2 + 1 e − r 2 / 2 √ 2 π ≥ 1 2 √ 2 π e − r 2 = 3 α. Then b ecause τ > 8, w e find that 6 α √ 1 + τ ≤ τ ¯ Φ( r ) . In turn, this implies that ¯ Φ − 1 ¯ Φ( r ) + ∆ ≥ ¯ Φ − 1 2 τ ¯ Φ( r ) . W e claim that with the shorthand ζ := ¯ Φ − 1 2 τ ¯ Φ( r ) , the following inequalit y holds, deferring its pro of to the end of the case, ζ 2 ≥ r 2 − 2 log (2 τ ) + 2 log r ζ + 1 /ζ . (20) W e further note the w eak er inequalit y ζ ≥ 1 2 s log 1 2 τ ¯ Φ( r ) ( a ) ≥ 1 2 s log e r 2 / 2 2 τ = 1 2 r r 2 2 + log 1 / (2 τ ) ( b ) ≥ r 4 ≥ 1 4 39 where ab o v e, step ( a ) follo ws from the b ound ¯ Φ( r ) ≤ e − r 2 / 2 and step (b) follows since for the considered range of τ and the setting of r , 2 τ ≤ e r 2 / 4 . On the other hand, we hav e that ζ ≤ r b ecause 2 τ > 1 and ¯ Φ − 1 is decreasing. Combining these inequalities, we deduce that r ζ + 1 /ζ ≥ r r + 4 ≥ 1 5 . Substituting this back into the inequality ( 20 ) then yields ζ 2 ≥ r 2 − 2 log (10 τ ) ≥ r 2 / 2 , where the final inequality follows b ecause 10 τ ≤ e r 2 / 4 . Hence, we find that r − p r 2 − 2 log ( τ / 4) ≤ 4 log( τ / 4) r , and consequently deduce that any separation t ≥ 4 log( τ / 4) r = 4 log( τ / 4) q log(1 / 6 √ 2 π α ) , suffices. Then we hav e that log(1 + τ ) ≍ log( τ / 4) because τ > 8 and log(1 / 6 √ 2 π α ) ≍ log( τ /α ) b ecause, taking C 1 large enough, log(1 /α ) is b ounded b elow by a sufficien tly large constant and log 8 ≤ log ( τ ) ≤ 1 4 log(1 /α ) − log(20). It remains to establish the inequality ( 20 ). Pro of of the inequalit y ( 20 ): Note that 2 τ ¯ Φ( r ) ≤ 1 2 , so that ζ ≥ 0. W e thus apply the standard Mills’ ratio b ounds (see Lemma G.10 ) to obtain the sandwich relation ζ ζ 2 + 1 ϕ ( ζ ) ≤ ¯ Φ( ζ ) = 2 τ ¯ Φ( r ) ≤ 2 τ r ϕ ( r ) . It thus follows that 1 ζ + 1 /ζ e − ζ 2 / 2 ≤ 2 τ r e − r 2 / 2 . Hence, taking logarithms and re-arranging yields the conclusion ζ 2 ≥ r 2 − 2 log (2 τ ) + 2 log r ζ + 1 /ζ , as desired. Case 4: τ > 1 20 α − 1 4 In this case, w e take a slightly mo dified version of the previous esti- mator. Rather than the Kolmogorov distance, we tak e conditional Kolmogoro v distance d K ( P , Q | R ) := d K ( P ( · | R ) , Q ( · | R )) , then we take b θ to b e b θ ( Z 1 , . . . , Z n ) ∈ arg min θ ∈ R d K b R n , R ( P θ , ϵ, q ) | R . (21) 40 Similar to the previous estimator, we can easily b ound the conditional Kolmogorov distance b et w een P b θ and P θ ⋆ . Let ¯ q = R ( R ) ≥ q (1 − ϵ ) and let ¯ R = R ( · | R ) and let S ⊆ [ n ] b e the random set of indices S = { i ∈ [ n ] | Z i = ⋆ } . Observe that | S | ∼ Bin ( n, ¯ q ). By Bernstein’s inequalit y and our assumption that n ≳ log(1 /δ ) q (1 − ϵ ) , we hav e | S | ≥ ¯ q n/ 2 with probability at least 1 − δ / 2. Now conditioned on S , observ e that { Z i } i ∈ S iid ∼ ¯ R ⊗| S | , so we can apply (e.g.) the DKW inequality [ Mas90 ] to obtain that with probability at least 1 − δ / 2, d K ( ˆ R n , R ( P θ ⋆ , ϵ, q ) | R ) ≤ d K ( ˆ R n , R | R ) = d K 1 | S | X i ∈ S δ Z i , ¯ R ! ≤ s log( 1 δ ) 2 | S | . Th us b y a union b ound and triangle inequality , we hav e with probability at least 1 − δ that d K ( R ( P b θ , ϵ, q ) , R ( P θ ⋆ , ϵ, q ) | R ) ≤ 2 s log( 1 δ ) q (1 − ϵ ) n ≤ 2 C − 1 2 1 . (22) W e will b e done once we establish a low er b ound on d K ( R ( P θ ⋆ , ϵ, q ) , R ( P θ , ϵ, q ) | R ) for an y θ such that | θ − θ ⋆ | is sufficiently large. Let R ∈ R ( P θ ⋆ , ϵ, q ) and R ′ ∈ R ( P θ , ϵ, q ) b e arbitrary contaminations. By translating and negating the observ ations if necessary , we may assume without loss of generality that θ ⋆ = 0 and θ > 0. Then by the definition of our con tamination set, w e ha v e for all S ⊆ R q (1 − ϵ )Φ( S ) ≤ R ( S ) ≤ ( q (1 − ϵ ) + ϵ )Φ( S ) = (1 + τ ) q (1 − ϵ ) P θ ( S ) , and similar for R ′ with Φ( S − θ ) in place of Φ( S ). Noting that ( a, b ) 7→ a/ ( a + b ) is increasing in a and decreasing in b , we hav e that for any r ∈ R + , d K ( R, R ′ | R ) ≥ R ( · ≤ r | R ) − R ′ ( · ≤ r | R ) = R ( · ≤ r ) R ( · ≤ r ) + R ( · > r ) − R ′ ( · ≤ r ) R ′ ( · ≤ r ) + R ′ ( · > r ) = Φ( r ) Φ( r ) + (1 + τ )(1 − Φ( r )) − (1 + τ )Φ( r − θ ) (1 + τ )Φ( r − θ ) + (1 − Φ( r − θ )) = Φ( r )(1 − Φ( r − θ )) − (1 + τ ) 2 Φ( r − θ )(1 − Φ( r )) [Φ( r ) + (1 + τ )(1 − Φ( r ))][(1 + τ )Φ( r − θ ) + (1 − Φ( r − θ ))] = Φ( r )Φ( θ − r ) − (1 + τ ) 2 Φ( r − θ )Φ( − r ) [Φ( r ) + (1 + τ )Φ( − r )][(1 + τ )Φ( r − θ ) + Φ( θ − r )] , where the final equality follows by applying the symmetry of the Gaussian distribution. Now for θ suc h that Φ( − θ / 2) ≤ 1 4 (1 + τ ) − 1 , by taking r = θ / 2, we hav e d K ( R, R ′ | R ) ≥ Φ( θ / 2) 2 − (1 + τ ) 2 Φ( − θ / 2) 2 [Φ( θ / 2) + (1 + τ )Φ( − θ / 2)] 2 ≥ 1 4 − 1 16 ( 1 2 + 1 4 ) 2 ≥ 3 C − 1 2 1 , where w e used that Φ( θ / 2) ≥ 1 2 and Φ( − θ / 2) ≤ 1 4 (1 + τ ) − 1 in the second inequality and to ok C 1 sufficien tly large in the final inequality . By a Gaussian tail b ound, Φ( − θ / 2) ≤ e − θ 2 / 8 , so Φ( − θ / 2) ≤ 1 4 (1 + τ ) − 1 holds for all θ ≥ p 8 log(4(1 + τ )). Therefore, by ( 22 ), we hav e that with probability at least 1 − δ , | b θ − θ ⋆ | ≤ p 8 log(4(1 + τ )) ≍ log(1 + τ ) p log(1 + τ 2 /α 2 ) , where the final asymptotic equiv alence holds b ecause τ > 1 20 α − 1 / 4 . This prov es the desired result in this case and concludes the pro of of the corollary . 41 A.3.2 Univ ariate v ariance estimation: Pro of of Lemma A.2 W e note that this proof is nearly identical to that of Corollary A.1 , hence we omit sev eral details already provided in that pro of. W e consider the minim um distance estimator b σ , defined to b e any elemen t b σ ( Z 1 , . . . , Z n ) ∈ sup n σ ∈ R ++ | d K b R n , R ( P σ , ϵ, q ) ≤ 3 q (1 − ϵ ) α √ 1 + τ o . (23) Applying Lemma A.3 in conjunction with the triangle inequality , we deduce that ˆ σ ≥ σ ⋆ and d K ( R ( P b σ , ϵ, q ) , R ( P σ ⋆ , ϵ, q )) ≤ 6 q (1 − ϵ ) α √ 1 + τ , so it remains to upp er b ound ˆ σ . Note further that, for an y r ∈ R , 2 d K R ( P σ , ϵ, q ) , R ( P σ ⋆ , ϵ, q ) ≥ q (1 − ϵ ) · P ( σ 2 G 2 ≥ r ) − q (1 − ϵ ) + ϵ · P ( σ 2 ⋆ G 2 ≥ r ) = q (1 − ϵ ) ¯ Φ r r σ 2 − q (1 − ϵ ) + ϵ ¯ Φ r r σ 2 ⋆ , where ab ov e we ha v e used the notation G ∼ N (0 , 1) and let ¯ Φ denote the surviv al function of the standard Gaussian (that is, ¯ Φ( x ) = P ( G ≥ x )). Letting σ 2 = σ 2 ⋆ (1 + t ) 2 and re-scaling, w e see that d K R ( P σ , ϵ, q ) , R ( P σ ⋆ , ϵ, q ) ≥ q (1 − ϵ ) ¯ Φ r 1 + t − (1 + τ ) ¯ Φ( r ) | {z } =: g ( t,r ) . Note that the function g defined ab ov e is strictly increasing in its first argumen t. It thus follo ws that with ψ ( ϵ, q , n, δ ) := sup t ≥ 0 : sup r ∈ R g ( t ; r ) ≤ 6 α √ 1 + τ = inf t ≥ 0 : sup r ∈ R g ( t ; r ) > 6 α √ 1 + τ , w e ha v e | b σ 2 − σ 2 ⋆ | ≤ σ 2 ⋆ · 2 ψ ( ϵ, q , n, δ ) + { ψ ( ϵ, q , n, δ ) } 2 . Re-arranging and inv oking the decreasing nature of ¯ Φ − 1 , we see that for any c hoice of r > 0, the condition is satisfied if t ≥ r · ¯ Φ − 1 (1 + τ ) ¯ Φ( r ) + 6 α √ 1 + τ − 1 − 1 . (24) Let us now b ound this term in three cases, dep ending on the v alue of τ . Case 1: τ ≤ 125 α . In this case, we set r = 2. The same sequence of steps as in Case 1 of the pro of of Corollary A.1 yields ¯ Φ − 1 (1 + τ ) ¯ Φ( r ) + 6 α √ 1 + τ ≥ r − ∆ e √ 2 π , 42 where ∆ = τ ¯ Φ( r ) + 6 α √ 1 + τ . Hence, since ∆ ≲ α is smaller than a sufficiently small constant, r r − ∆ e √ 2 π − 1 ≤ ∆ e √ 2 π r − ∆ e √ 2 π ≤ C α, and so any separation ψ ( ϵ, q , n, δ ) = C α suffices. In turn, for large enough C 1 , this implies that ψ ( ϵ, q , n, δ ) < 1, so that in this case b σ 2 − σ 2 ⋆ ≤ C α ≍ log(1 + τ ) log(1 + τ /α ) , where the last equality is b ecause τ ≲ α ≲ 1. Case 2: 125 α < τ ≤ 8 . As in Corollary A.1 , in this case, we set r = p log( τ / 12 α ) , and deduce the low er b ound ¯ Φ − 1 (1 + τ ) ¯ Φ( r ) + 6 α √ 1 + τ ≥ r − 2 τ r / 2 . Hence, since r r − 2 τ r/ 2 − 1 ≤ C τ r 2 < 1 , w e deduce that in this case b σ 2 − σ 2 ⋆ ≤ C τ log( τ / 12 α ) ≍ log(1 + τ ) log(1 + τ /α ) , where the last inequality is b ecause α ≲ τ ≲ 1. Case 3: 8 < τ ≤ 1 20 α − 1 / 4 . As in Corollary A.1 , in this case, we take r = q log(1 / 12 √ 10 α ) . Under this setting, following the same steps as in Corollary A.1 , we find that ¯ Φ − 1 (1 + τ ) ¯ Φ( r ) + 6 α √ 1 + τ ≥ p r 2 − 2 log ( τ / 4) . Th us, since r p r 2 − 2 log ( τ / 4) − 1 ≤ C log( τ / 4) r 2 , w e deduce that in this case, b σ 2 − σ 2 ⋆ ≤ C log τ log(1 / 12 √ 10 α ) ≍ log(1 + τ ) log(1 + τ /α ) , where the last equality is b ecause α ≲ 1 and 1 ≲ τ ≲ α − 1 / 4 . 43 Case 4: τ > 1 20 α − 1 / 4 . In this case, instead of our previous estimator, we simply tak e b σ = 0. Because α 2 ≤ 1 C 1 for sufficien tly large C 1 and τ > 1 20 α − 1 / 4 , we hav e that | σ 2 ⋆ − b σ 2 | = σ 2 ⋆ ≤ C 2 log(1+ τ ) log(1+ τ /α ) σ 2 ⋆ b y pic king C 2 sufficien tly large. This prov es the desired result in all cases. B Pro of of upp er b ound for linear regression B.1 Linear regression: Pro of of Theorem 5.1 upp er b ound W e prov e the claim for general q ∈ (0 , 1]. W e show that our choice of estimator b θ ( k ) n ac hiev es the desired rate. T o do so, we introduce some key lemmas to control relev ant quan tities with high probability . W e defer the pro ofs of Lemmata B.1 and B.2 to the subsequen t subsections. Lemma B.1. L et α b e as in The or em 5.1 , m = q (1 − ϵ ) n , and supp ose that δ ≥ m − c for some c onstant c > 0 . With pr ob ability at le ast 1 − O ( δ ) , it holds that ∇ F ( k ) n ( θ ⋆ ) ≲ q (1 − ϵ ) k σ 2 k − 1 · n τ (2 k − 2)!! + α (1 + √ τ ) p (4 k − 3)!! + γ 1 m − 1 4 + γ 2 m − 1 2 + γ 3 m − 1 o , (25) wher e γ 1 = α log 1 2 ( m ) p (4 k − 3)!! + α (1 + τ 1 4 ) O (log { (1 + τ ) m } ) k 2 , γ 2 = α [log 1 2 ( m ) p (4 k − 3)!! + O (log { (1 + τ ) m } ) k 2 + 1 4 ] + √ τ O (log { (1 + τ ) m } ) k 2 , γ 3 = log( m )(2 k − 2)!! + O (log { (1 + τ ) m } ) k +1 2 . Lemma B.2. L et α b e as in The or em 5.1 . Supp ose α − 2 ≳ e 10 k log 2 k . With pr ob ability at le ast 1 − O ( δ ) , the empiric al risk F ( k ) n is uniformly str ongly c onvex: inf θ ∈ R d inf v ∈ S d − 1 v T ∇ 2 F ( k ) n ( θ ) v ≳ q (1 − ϵ ) σ 2 k − 2 (2 k + 1)!! . Pr o of of The or em 5.1 . Supp ose first that we choose k satisfying e 10 k log 2 k ≲ α − 2 . By Lem- mas B.1 , we deduce that with probabilit y at least 1 − δ / 2, F ( k ) n is µ n –strongly conv ex. F rom strong conv exit y of F ( k ) n , we obtain the inequality ∇ F ( k ) n b θ ( k ) n − ∇ F ( k ) n ( θ ⋆ ) ⊤ b θ ( k ) n − θ ⋆ ≥ µ n · ∥ b θ ( k ) n − θ ⋆ ∥ 2 2 . Since b y definition, ∇ F ( k ) n b θ ( k ) n = 0, applying the Cauc h y–Sc h w arz inequalit y to the left-hand side and re-arranging yields θ ⋆ − b θ ( k ) n 2 ≤ ∇ F ( k ) n ( θ ⋆ ) 2 µ n . Applying Lemmas B.1 and B.2 , which hold with probability 1 − δ , then yields 1 σ · b θ ( k ) n − θ ⋆ ≲ k · τ (2 k − 2)!! + α (1 + √ τ ) p (4 k − 3)!! + γ 1 m − 1 4 + γ 2 m − 1 2 + γ 3 m − 1 (2 k + 1)!! 44 ≲ τ √ k + 2 k α (1 + √ τ ) √ k + γ 1 m − 1 4 + γ 2 m − 1 2 + γ 3 m − 1 (2 k − 1)!! , (26) where the first tw o terms in the final step follow from Stirling’s inequality (see F act G.2 ). W e analyze this b ound in differen t cases for τ . T o do so, let C ′ ≥ 1 be a sufficiently large constan t and c < 1 a sufficiently small constant that we will sp ecify later. Small τ : τ ≤ C ′ α . In this case we take k = 1 and ha v e that α − 2 ≳ e 10 k log 2 k b ecause α ≤ 1 / √ C by our assumption m ≥ C ( d + log ( 1 δ )). Because τ ≤ C ′ α ≲ 1, the error b ound ( 26 ) simplifies to 1 σ · b θ ( k ) n − θ ⋆ ≲ α · 1 + m − 1 4 p log m + m − 1 2 log 3 4 ( m ) + m − 1 log m ≲ α, where the final step follows because α ≥ 1 √ m . In this regime, τ (1 ∨ log log ( τ /α )) √ log(1+ τ 2 /α 2 ) ≍ α , so this matc hes our desired rate. Large τ : τ ≥ C ′ α . In this case, w e tak e k = c log ( τ /α ) (log log( τ /α )) 2 . By c ho osing C ′ sufficien tly large (as a function of c ), w e hav e that k ≥ 1 b ecause τ /α ≥ C ′ . This implies | log k | = log k ≤ log log(1 /α ) and so k log 2 k ≤ c log(1 /α ). By picking c = 0 . 1, we hav e that k is small enough to apply Lemma B.2 . Substituting this into ( 26 ), we ha v e 1 σ · b θ ( k ) n − θ ⋆ ≲ τ log log( τ /α ) p log( τ /α ) + α ( τ /α ) c log 2 (log log ( τ /α )) 2 log log( τ /α ) p log( τ /α ) + γ 1 m − 1 4 + γ 2 m − 1 2 + γ 3 m − 1 (2 k − 1)!! (27) Enlarging C ′ if necessary , we see that c log 2 (log log( τ /α )) 2 ≤ 1, which implies that the second te rm is dominated by the first. W e now b ound γ 1 m 1 4 (2 k − 1)!! . W e hav e γ 1 m 1 4 (2 k − 1)!! ≍ α 2 k log 1 2 ( m ) m 1 4 + αO (log( m )) k 2 m 1 4 (2 k − 1)!! . Because (2 k − 1)!! = O ( k ) k , w e see that O (log m ) k 2 (2 k − 1)!! ≤ e O ( √ log m ) ≤ m 1 8 for m ≥ C ′′ , where C ′′ is a sufficiently large constant. Now b ecause α − 2 ≤ m , w e hav e that m 1 8 ≳ p log(1 /α ). Applying these b ounds to the ab o v e displa y , we see that as long as m ≥ C ′′ ∨ e (8 log 2) k , γ 1 m 1 4 (2 k − 1)!! ≲ α p log(1 /α ) . Note that b y taking C 1 to b e a sufficiently large constant, the assumption that α ≤ 1 / √ C 1 implies that 8 c log 2 (log log(1 /α )) 2 ≤ 2. This in turn implies m ≥ e (8 log 2) k b ecause e (8 log 2) k = α − 8 c log 2 (log log (1 /α )) 2 ≤ α − 2 ≤ m. The other tw o terms in ( 27 ) follow similarly , so we hav e the error b ound 1 σ · b θ ( k ) n − θ ⋆ ≲ τ log log( τ /α ) p log( τ /α ) + α p log(1 /α ) ≍ τ (1 ∨ log log ( τ /α )) p log(1 + τ 2 /α 2 ) , as desired. 45 It remains to prov e the ab ov e lemmas. W e introduce some notation that will b e useful in b oth lemmas. In particular, by the usual decomp osition, w e know that there exists Ω MCAR ∼ Bern ( q ) , Ω MNAR ∈ { 0 , 1 } , B ∼ Bern ( ϵ ) , ( X, Y ) ∼ P such that R is the law of R = La w ( Z ) , and Z = ( X , Y ) ⋆ (1 − B )Ω MCAR + B Ω MNAR , where Ω MCAR , B , and ( Z , Ω MNAR ) are m utually indep endent. Notice that this slightly gener- alizes the developmen t in Section 5 where we sp ecify q = 1. Moreov er, Y can b e generated as X · θ ⋆ + σ g for g ∼ N (0 , 1) independent of X . Observ e that Z ∈ R d +1 if and only if (1 − B )Ω MCAR + B Ω MNAR = 1. Let ω MCAR i , ω MNAR i , b i b e the realizations of these random v ariables in the sample. B.1.1 Pro of of Lemma B.1 Expanding out ∇ F ( k ) n ( θ ⋆ ) and using the fact that y i = X i · θ ⋆ + g i , we hav e ∇ F ( k ) n ( θ ⋆ ) = 2 k σ 2 k − 1 n n X i =1 ((1 − b i ) ω MCAR i + b i ω MNAR i ) g 2 k − 1 i X i so that ∇ F ( k ) n ( θ ⋆ ) ≤ 2 k σ 2 k − 1 1 n X i : b i =0 ω MCAR i g 2 k − 1 i X i + 2 k σ 2 k − 1 1 n X i : b i =1 ω MNAR i g 2 k − 1 i X i (28) W e will b e done once we b ound each vector norm in ( 28 ) with probability at least 1 − O ( δ ) b y B := q (1 − ϵ ) · n τ (2 k − 2)!! + α (1 + √ τ ) p (4 k − 3)!! + γ 1 m − 1 4 + γ 2 m − 1 2 + γ 3 m − 1 o , (29) where we recall that γ 1 = α log 1 2 ( m ) p (4 k − 3)!! + α (1 + τ 1 4 ) O (log { (1 + τ ) m } ) k 2 , γ 2 = α [log 1 2 ( m ) p (4 k − 3)!! + O (log { (1 + τ ) m } ) k 2 + 1 4 ] + √ τ O (log { (1 + τ ) m } ) k 2 , γ 3 = log( m )(2 k − 2)!! + O (log { (1 + τ ) m } ) k +1 2 . Bounding the first term of ( 28 ) . Observ e that (1 − b i ) ω MCAR i ∼ Bern ( q (1 − ϵ )) are i.i.d. and indep endent of { ( g i , X i ) } . Thus, conditioning on n 0 = P n i =1 (1 − b i ) ω i , w e can assume without loss of generality that the first n 0 samples are the samples for which (1 − b i ) ω i = 1 and then b ound 1 n n 0 X i =1 g 2 k − 1 i X i . No w w e condition on g = { g i } and observe that 1 n P n 0 i =1 g 2 k − 1 i X i ∼ N (0 , σ 2 ( g ) I d ), where σ 2 ( g ) = 1 n 2 n 0 X i =1 g 4 k − 2 i . 46 Th us a standard Gaussian tail b ound yields that with probability at least 1 − δ , 1 n n 0 X i =1 g 2 k − 1 i X i ≤ 1 n · v u u t n 0 X i =1 g 4 k − 2 i · √ d + q 2 log( 1 δ ) . (30) W e ha v e b y Lemma G.9 and the assumption that δ ≥ m − c that with probabilit y at least 1 − δ , n 0 X i =1 g 4 k − 2 i ≤ n 0 (4 k − 3)!! + q n 0 2 4 k − 3 log(2 /δ ) log(2 n 0 /δ ) 2 k − 1 ≲ n 0 (4 k − 3)!! + [ O (log n 0 + log m )] k √ n 0 . (31) No w b ecause n 0 ∼ Bin ( n, q (1 − ϵ )) we apply Bernstein’s inequalit y to obtain that with prob- abilit y at least 1 − e − Ω( m ) ≥ 1 − δ , n 0 ≤ 2 m . Putting it all together, we deduce that with probability at least 1 − O ( δ ), 1 n n 0 X i =1 g 2 k − 1 i X i ≲ q d + log ( 1 δ ) n · p mq (1 − ϵ )(4 k − 3)!! + O (log m ) k 2 m 1 4 = q (1 − ϵ ) α p (4 k − 3)!! + m − 1 4 O (log m ) k 2 . This is b ounded b y O ( B ) b ecause γ 1 ≥ α · O (log m ) k 2 . Bounding the second term of ( 28 ) . Again observ e that { b i } is indep enden t of ev erything else, so we can n 1 = P n i =1 b i and assume without loss of generality that the first n 1 v alues of b i are equal to 1. Then we hav e 1 n X i : b i =1 ω MNAR i g 2 k − 1 i X i = 1 n sup v ∈ S d − 1 n 1 X i =1 ω MNAR i g 2 k − 1 i · X ⊤ i v ≤ 1 n sup v ∈ S d − 1 n 1 X i =1 | g 2 k − 1 i | · | X ⊤ i v | . No w w e condition on g = { g i } n i =1 and let a i = | g 2 k − 1 i | . W e define T v := P n 1 i =1 a i | X ⊤ i v | and let N b e a 1 / 2-packing of S d − 1 of size at most 5 d . Then for any v ∈ S d − 1 , let v ′ ∈ N satisfy ∥ v − v ′ ∥ ≤ 1 / 2 and observe that | T v − T v ′ | ≤ n 1 X i =1 a i · | X ⊤ i v | − | X ⊤ i v ′ | ≤ n 1 X i =1 a i · | X ⊤ i ( v − v ′ ) | ≤ v − v ′ · n 1 X i =1 a i | X ⊤ i u | where u ∈ S d − 1 is such that v = v ′ + ∥ v − v ′ ∥ u . W e thus hav e that T v ≤ T v ′ + v − v ′ n 1 X i =1 a i | X ⊤ i u | ≤ sup u ∈N T u + 1 2 sup u ∈ S d − 1 T u . T aking a suprem um o ver v ∈ S d − 1 and rearranging yields sup v ∈ S d − 1 T v ≤ 2 sup v ∈N T v . 47 Since X i iid ∼ N (0 , I d ), we hav e that for each v ∈ N that E [ | X ⊤ i v | ] = p 2 /π and that T v − p 2 /π P n 1 i =1 a i is P n 1 i =1 a 2 i -subgaussian. Th us, P sup v ∈N T v − r 2 π n 1 X i =1 a i ≥ t ! ≤ |N | X v ∈N P T v − r 2 π n 1 X i =1 a i ≥ t ! ≤ 5 d exp − t 2 2 P n 1 i =1 a 2 i . Pic king t = q 2( d log 5 + log ( 1 δ )) P n 1 i =1 a 2 i , we hav e with probability at least 1 − δ that sup v ∈N T v ≤ r 2 π n 1 X i =1 a i + v u u t n 1 X i =1 a 2 i ( d log 5 + log ( 1 δ )) . Assem bling the pieces, we hav e conditional on n 1 and g that with probability at least 1 − δ , 1 n X i : b i =1 ω MNAR i g 2 k − 1 i X i ≲ 1 n n 1 X i =1 | g i | 2 k − 1 + v u u t n 1 X i =1 g 4 k − 2 i ( d + log ( 1 δ )) . (32) Again by Lemma G.9 , we hav e with probability at least 1 − δ that n 1 X i =1 | g i | 2 k − 1 ≤ n 1 (2 k − 2)!! + O (log( n 1 /δ )) k 2 √ n 1 and n 1 X i =1 | g i | 4 k − 2 ≤ n 1 (4 k − 3)!! + O (log( n 1 /δ )) k √ n 1 . Recalling that n 1 ∼ Bin ( n, ϵ ), w e hav e that with probabilit y at least 1 − δ that n 1 ≲ ϵn + log( 1 δ ) = τ m + O (log m ). Now we condition on all these even ts, which o ccur simultaneously with probability at least 1 − O ( δ ). Note that log( n 1 /δ ) = log ( τ m + O (log m )) + O (log m )) ≍ log((1 + τ ) m ). Then the first term in ( 32 ) is upp er b ounded by 1 n n 1 X i =1 | g i | 2 k − 1 ≲ 1 n · ( τ m + log m )(2 k − 2)!! + O (log { (1 + τ ) m } ) k 2 ( √ τ m + p log m ) ≍ q (1 − ϵ ) h τ (2 k − 2)!! + m − 1 log( m )(2 k − 2)!! + m − 1 2 O (log { (1 + τ ) m } ) k 2 √ τ + m − 1 O (log { (1 + τ ) m } ) k +1 2 i . Collecting terms with the same exp onen t on m and recalling the definitions of γ 2 and γ 3 , we ha v e that the ab o ve displa y is O ( B ) b ecause γ 2 m − 1 2 ≥ m − 1 2 O (log { (1 + τ ) m } ) k 2 √ τ and γ 3 m − 1 ≥ m − 1 (log( m )(2 k − 2)!! + O (log { (1 + τ ) m } ) k +1 2 ) . The second term in ( 32 ) is b ounded ab ov e by 1 n v u u t n 1 X i =1 g 4 k − 2 i ( d + log ( 1 δ )) ≲ q d + log ( 1 δ ) n · h p τ m (4 k − 3)!! + log 1 2 ( m ) p (4 k − 3)!! 48 + O (log { (1 + τ ) m } ) k 2 [( τ m ) 1 4 + log 1 4 ( m )] i ≍ q (1 − ϵ ) α h p τ (4 k − 3)!! + m − 1 2 log 1 2 ( m ) p (4 k − 3)!! + O (log { (1 + τ ) m } ) k 2 [ m − 1 4 τ 1 4 + m − 1 2 log 1 4 ( m )] i . Similar to the previous term, this display is also O ( B ) b ecause γ 1 m − 1 4 ≥ ατ 1 4 O (log((1 + τ ) m )) k 2 m − 1 4 and γ 2 m − 1 2 ≥ α (log 1 2 ( m ) p (4 k − 3)!! + log 1 4 ( m ) O (log((1 + τ ) m )) k 2 ) m − 1 2 . W e ha ve th us b ounded all terms b y O ( B ), proving the claim. B.1.2 Pro of of Lemma B.2 Expanding out ∇ 2 F ( k ) n , we hav e for any unit vector v that v T ∇ 2 F ( k ) n ( θ ) v = (2 k )(2 k − 1) n n X i =1 ((1 − b i ) ω MCAR i + b i ω MNAR i ) x ⊤ i ( θ ⋆ − θ ) + σ g i 2 k − 2 ( x ⊤ i v ) 2 ≥ (2 k )(2 k − 1) n n X i =1 (1 − b i ) ω MCAR i x ⊤ i ( θ ⋆ − θ ) + σ g i 2 k − 2 ( x ⊤ i v ) 2 , where the inequalit y is b ecause the summands are nonnegative. Defining the random set I = { i ∈ [ n ] | (1 − b i ) ω MCAR i = 1 } , we equiv alently hav e v T ∇ 2 F ( k ) n ( θ ) v ≥ (2 k )(2 k − 1) n X i ∈I x ⊤ i ( θ ⋆ − θ ) + σ g i 2 k − 2 ( x ⊤ i v ) 2 (33) W e condition on I and let n 0 = |I | . Because ((1 − b i ) ω MCAR i ) i ∈ [ n ] is indep endent of (( x i , g i )) i ∈ [ n ] , w e ha v e that condition on I , (( x i , g i )) i ∈I are n 0 i.i.d. samples from N (0 , I d ) ⊗ N (0 , σ 2 ). W e no w introduce a finite sequence of noise bins indexed by ℓ = 0 , . . . , L , where t ℓ = (1 + 1 2 k ) ℓ and L is the unique integer suc h that 2 √ k log(3 k ) ≤ t L < (1 + 1 2 k )[2 √ k log(3 k )]. Then for each ℓ ∈ [ L ], θ ∈ R d , and unit vector v , w e define the subsets S ℓ,θ,v := i ∈ I : x i · ( θ ⋆ − θ ) ≥ 0 , | x i · v | ≥ p π / 8 , and g i ∈ [ t ℓ − 1 , t ℓ ) . F or i ∈ S ℓ,θ,v , we hav e the low er b ounds ( y i − x ⊤ i θ ) = ( x ⊤ i θ ⋆ + σ g i ) − x ⊤ i θ ≥ σ t ℓ − 1 and ( x ⊤ i v ) 2 ≥ π 8 > 1 4 . No w observe that for eac h θ and v , each index i ∈ I is in S ℓ,θ,v for at most one ℓ ∈ [ L ]. Applying these low er b ounds to ( 33 ), we hav e that v ⊤ ∇ 2 F ( k ) n ( θ ) v ≥ (2 k )(2 k − 1) n L X ℓ =1 X i ∈ S ℓ,θ,v ( σ t ℓ ) 2 k − 2 ( x ⊤ i v ) 2 ≥ k 2 σ 2 k − 2 2 n L X ℓ =1 | S ℓ,θ,v | t 2 k − 2 ℓ − 1 . T o pro ve the claim, we find uniform low er b ounds on | S ℓ,θ,v | that hold with probability at least 1 − O ( δ ). F or ℓ ∈ [ L ], let p ℓ := P ( G ∈ [ t ℓ − 1 , t ℓ )), where G ∼ N (0 , 1). Observ e that for ℓ ∈ [ L ], p ℓ = 1 √ 2 π Z t ℓ t ℓ − 1 e − t 2 / 2 dt ≥ t ℓ − t ℓ − 1 √ 2 π e − t 2 ℓ / 2 = 1 (2 k + 1) √ 2 π t ℓ e − t 2 ℓ / 2 . 49 No w b ecause t 7→ te − t 2 / 2 is decreasing for t ≥ 1 and b ecause for all ℓ ∈ [ L ], t ℓ ≤ 3 √ k log(3 k ) w e ha v e that p ℓ ≥ 3 √ k log(3 k ) (2 k + 1) √ 2 π e − 9 2 k log 2 (3 k ) ≳ e − 5 k log 2 k . (34) W e no w hav e the follo wing claim whic h w e prov e later. Claim B.3. L et C > 0 b e a sufficiently lar ge c onstant. With pr ob ability at le ast 1 − δ , it holds simultane ously for al l ℓ ∈ [ L ] , θ ∈ R d , and v ∈ S d − 1 that | S ℓ,θ,v | ≥ p ℓ n 0 4 − C r ( d + log ( 1 δ )) n 0 . (35) W e condition on the ev ent defined in the claim. No w recall that n 0 ∼ Bin ( n, q (1 − ϵ )). Then, α 2 ≤ 1 /C 1 implies q (1 − ϵ ) n ≳ log ( 1 δ ) and so with probability at least 1 − δ , w e hav e that q (1 − ϵ ) n/ 2 ≤ n 0 ≤ 2 q (1 − ϵ ) n . F urther conditioning on this even t, w e hav e by com bining this with ( 35 ) that for all ℓ ∈ [ L ], | S ℓ,θ,v | ≥ p ℓ q (1 − ϵ ) n 8 − C r 2( d + log ( 1 δ )) q (1 − ϵ ) n. No w, applying the assumption α − 2 ≥ C ′ e 10 k log 2 k com bined with ( 34 ) implies C r 2( d + log ( 1 δ )) q (1 − ϵ ) n ≤ C √ 2 C ′ · e 5 k log 2 k n ≤ p ℓ q (1 − ϵ ) n 16 , where the final inequality follows by taking C ′ to b e a sufficiently large constan t. Th us, it holds for all ℓ ∈ [ L ] that | S ℓ,θ,v | ≥ q (1 − ϵ ) p ℓ n 16 . Consequen tly , for all θ ∈ R d and all unit vectors v , w e ha v e v T ∇ 2 F ( k ) n ( θ ) v ≥ (1 − ϵ ) k 2 σ 2 k − 2 16 L X ℓ =1 p ℓ t 2 k − 2 ℓ − 1 . Recalling that t ℓ − 1 = (1 + 1 2 k ) − 1 and p ℓ = P ( G ∈ [ t ℓ − 1 , t ℓ )), we hav e L X ℓ =1 p ℓ t 2 k − 2 ℓ − 1 = 1 + 1 2 k − 1 L X ℓ =1 p ℓ t 2 k − 2 ℓ = 1 + 1 2 k − 1 L X ℓ =1 Z t ℓ t ℓ − 1 t 2 k − 2 ℓ ϕ ( t ) dt ≥ 2 3 L X ℓ =1 Z t ℓ t ℓ − 1 t 2 k − 2 ϕ ( t ) dt = 2 3 Z t L 1 t 2 k − 2 ϕ ( t ) dt. The last expression is a truncated Gaussian momen t and Lemma G.7 giv es that R t L 1 t 2 k − 2 ϕ ( t ) dt ≥ (2 k − 3)!! 3 b ecause t L ≥ 2 √ k log(3 k ). Combining the pieces we ha v e for all θ ∈ R d and all unit v ectors v that v T ∇ 2 F ( k ) n ( θ ) v ≥ q (1 − ϵ ) k ( 2 k − 3)!! σ 2 k − 2 36 ≍ q (1 − ϵ ) σ 2 k − 2 (2 k + 1)!! , pro ving the claim. 50 Pro of of the Claim B.3 . The pro of of this b ound follows from a standard VC dimension argumen t. F or eac h ℓ ∈ [ L ], we define the function classes F ℓ := n ( x , g ) 7→ 1 n x · ( θ ⋆ − θ ) ≥ 0 , | x · v | ≥ r π 8 , and t ℓ ≤ g < t ℓ +1 o : v ∈ S d − 1 , θ ∈ θ ⋆ + S d − 1 o and also define F = ∪ ℓ ∈ [ L ] F ℓ . Observ e that F is a sub class of the 3-fold intersection of the follo wing function classes: F 1 = n ( x, g ) 7→ 1 { x · ( θ ⋆ − θ ) ≥ 0 } : θ ∈ θ ⋆ + S d − 1 o F 2 = n ( x, g ) 7→ 1 {| x · v | ≥ p π / 8 } : v ∈ S d − 1 o F 3 = n ( x, g ) 7→ 1 { a ≤ g < b } : a, b ∈ R o . Common VC-dimension calculations give that VC( F 1 ) ≤ d + 1, V C( F 2 ) ≤ 2( d + 1) and V C( F 3 ) ≤ 2. W e can then apply Lemma G.5 to deduce that VC( F ) ≤ 12 log(18)( d + 1) ≍ d . Using this VC-dimension b ound, w e can b ound the exp ected worst-case deviation using [ V er25 , Theorem 8.3.5] to obtain E sup f ∈F 1 n 0 n 0 X i =1 f ( x i , g i ) − E f ( x , g ) ≲ r d n . T o conv ert this to a high probability b ound on deviation, we note that the random v ariable sup f ∈F 1 n 0 P n 0 i =1 f ( x i , g i ) − E f ( x , g ) satisfies the b ounded differences property with constan t 1 /n 0 , so by combining the preceding display with, e.g., [ BLM13 , Theorem 6.2], we obtain sup f ∈F 1 n n 0 X i =1 f ( x i , g i ) − E f ( x , g ) ≲ r d n + s log(2 /δ ) n 0 with probability ≥ 1 − δ. (36) No w, since x ∼ N (0 , I d ), we hav e that for all θ ∈ θ ⋆ + S d − 1 and all v ∈ S d − 1 that P ( x · ( θ ⋆ − θ ) < 0) = 1 2 and P | x · v | ≥ r π 8 ≤ 1 √ 2 π · r π 8 = 1 4 = ⇒ P x · ( θ ⋆ − θ ) ≥ 0 and | x · v | ≥ r π 8 ≥ 1 4 . F urther, b ecause x and g are indep endent, we ha ve that E f ( x , g ) ≥ 1 4 p ℓ , for every f ∈ F ℓ . Com bining this with ( 36 ) prov es the b ound. B.2 Efficien t algorithm So far we hav e shown the rates achiev able by some b θ ( k ) . No w we show that an efficien t estimator has a similar rate. W e consider a tw o-step pro cedure. First w e observ e that because the loss which b θ (1) minimizes is a strongly con vex, quadratic function, we can efficien tly compute it to an y desired accuracy . By the analysis of the pro of of the upp er b ound of Theorem 5.1 , we know that b oth b θ (1) − θ ⋆ 2 ≤ C σ (1 + p oly ( τ )) | {z } =: R and b θ ( k ) − θ ⋆ 2 ≤ C σ τ log log (1 /α ) p log(1 /α ) | {z } =: β , 51 for a sufficien tly large constan t C . Then for the first step of the pro cedure, w e compute ˜ θ 1 satisfying ∥ ˜ θ 1 − b θ (1) ∥ 2 ≤ R whic h implies that ∥ ˜ θ 1 − b θ ( k ) | 2 ≤ 2 R + β . Recall by Lemma B.2 , w e ha v e with probabilit y at least 1 − δ that F ( k ) n is µ n -strongly conv ex. W e then run the ellipsoid algorithm (see, e.g., [ Bub15 , Theorem 2.4]) on the domain B = B 2 ( ˜ θ 1 , 2( R + β )), un til w e obtain ˜ θ 2 satisfying an excess risk b ound on F ( k ) n of at most µ n β 2 / 2 using a gradient oracle. Strong conv exity implies that ∥ ˜ θ 2 − b θ ( k ) ∥ 2 ≤ β and so ˜ θ 2 has the same error, up to constant factors, as b θ ( k ) . T o establish runtime, all that remains is to establish that the set C := n θ : F ( k ) n ( θ ) − F ( k ) n ( b θ ( k ) n ) ≤ µ n β 2 2 o , con tains a ball of large enough size. T o this end, on the domain B , we can crudely upp er b ound the gradien t norm ∥∇ 2 F ( k ) n ( θ ) ∥ 2 as sup θ ∈B ∇ 2 F ( k ) n ( θ ) 2 ≲ k 2 sup i ∈ [ n ] 2( R + β ) ∥ x i ∥ 2 + σ | g i | 2 k − 2 ∥ x i ∥ 2 2 ≲ O ( R + β + 1) p d + log ( n/δ ) 2 k . Th us, B 2 b θ ( k ) , r ⊆ C , where r ≳ q (1 − ϵ ) σ 2 k − 2 (2 k + 1)!! β 2 O (( R + β + 1) p d + log ( n/δ )) 2 k . Therefore, the ellipsoid p ortion of the algorithm has an oracle complexity of O d 2 log ( R + β ) O (( R + β + 1) p d + log ( n/δ )) 2 k q (1 − ϵ ) σ 2 k − 2 (2 k + 1)!! β 2 !! , (37) whic h is p olynomial in k , d, n . C Pro ofs of information-theoretic low er b ounds C.1 Lo wer b ound constructions F or all low er b ounds, we rely on hidden-direction distributions of the form P A,v (see Defini- tion 2.7 ) in b oth the information-theoretic and SQ low er b ounds. Crucial to pro ving these is the following lemma, which c haracterizes the b ounds on the density of distributions in the con tamination set of a given distribution. Corollary C.1 (Corollary of Lemma 1.1 ) . L et P b e a distribution over R d . L et L = q (1 − ϵ ) and U = q (1 − ϵ ) + ϵ . L et b ∈ [ L, U ] b e arbitr ary. Then Q ∈ R R ( P , ϵ, q ) if for al l z ∈ R d L b ≤ dQ dP ( z ) ≤ U b . F urthermor e, supp ose Q satisfies the c ondition ab ove for some b ∈ [ L, U ] . Consider the distribution Q ′ over R d ⋆ which outputs ⋆ d with pr ob ability 1 − b and X ∼ Q with pr ob ability b . Then Q ′ ∈ R ( P, ϵ, q ) . 52 Lemma C.2 (Mean estimation hard instance) . L et q ∈ (0 , 1] , ϵ ∈ [0 , 1) , and τ = ϵ q (1 − ϵ ) . L et v ∈ S d − 1 and let b, γ , R ∈ R ++ such that b = q (1 − ϵ ) √ 1 + τ , γ 2 ≤ 0 . 25 log (1 + τ ) , R ≥ 2 γ , and R ≤ log(1 + τ ) 8 γ , With β = P ( | G + γ |≤ R ) P ( | G |≤ R ) , let A denote the distribution over R with density A ( x ) := ( β ϕ ( x ) | x | ≤ R ϕ ( x ; γ ) otherwise , (38) and define Q ∈ P ( R d ⋆ ) as Q ( { ⋆ d } ) = 1 − b and Q ( x ) = b · P A,v ( x ) . Then Q ∈ R N ( γ v , I d ) , ϵ, q . F urthermor e, | β − 1 | ≲ γ e − R 2 / 4 . Pr o of. First observ e that A is a v alid distribution since it integrates to 1 on R . F urthermore, w e ha v e that β ≤ 1 and β ≥ min | x |≤ R ϕ ( x ; γ ) ϕ ( x ) = min | x |≤ R exp( γ x − γ 2 / 2) = exp( − γ R − γ 2 / 2). Let f ( x ) b e the density of P A,v at a p oint x . Corollary C.1 states that Q is a v alid realizable con tamination as long as P Q ( X = ⋆ d ) = b ∈ ( L, U ) and f ( x ) ϕ ( x ; γ ) ∈ ( L ′ , U ′ ) for L ′ = L/b and U ′ = U /b and L = q (1 − ϵ ) and U = q (1 − ϵ ) + ϵ . Observ e that our choice of b satisfies b 2 = LU and thus | log L ′ | = | log U ′ | = log √ 1 + τ . Therefore, for Q to b e a v alid realizable contamination of N ( γ v , I ), w e require that log f ( x ) ϕ ( x ; γ ) ≤ 0 . 5 log(1 + τ ) for all x ∈ R d . In fact, we shall establish a stronger condition that log f ( x ) ϕ ( x ; γ ) ≤ 0 . 25 log(1 + τ ) for all x ∈ R d ; this stronger requirement will b e useful in proving SQ low er b ounds. Observ e that f ( x ) ϕ ( x ; γ ) = ϕ v ⊥ ( x ) A ( v ⊤ x ) ϕ v ⊥ ( x ) ϕ ( v ⊤ x ; γ ) = A ( v ⊤ x ) ϕ ( v ⊤ x ; γ ) . W e denote x ′ = v ⊤ x in the rest of the pro of for brevity . W e thus require that log A ( x ′ ) ϕ ( x ′ ; γ ) ≤ 0 . 5 log(1 + τ ) for all x ′ ∈ R . By the definition of A in ( 38 ), this obviously holds on the domain | x ′ | > R . F or | x ′ | ≤ R , this ratio is equal to max | x ′ |≤ R log A ( x ′ ) ϕ ( x ′ ; γ ) = max | x ′ |≤ R log β ϕ ( x ′ ) ϕ ( x ′ ; γ ) = max | x ′ |≤ R log β e γ 2 / 2 − γ x ′ = log β e γ 2 / 2+ γ R ≤ log β e γ 2 / 2 + γ R . It suffices if eac h term is less than 0 . 125 log(1 + τ ), and observ e that the second term, γ R , indeed satisfies this under the condition R ≤ log (1 + τ ) / (8 γ ). F or the first term, observ e that log β e γ 2 / 2 ≤ γ 2 / 2 since β ≤ 1, which is also appro- priately b ounded under the condition on γ 2 listed in the statement. Finally , log β e γ 2 / 2 ≥ log e − γ R − γ 2 / 2 e γ 2 / 2 = − γ R . And thus − log β e γ 2 / 2 ≤ γ R , which as w e argued earlier is indeed small. Therefore, max x ′ ∈ R log A ( x ′ ) ϕ ( x ′ ; γ ) ≤ 0 . 25 log (1 + τ ) . 53 Finally , w e state an upp er b ound on | β − 1 | which is equal to | β − 1 | = | P ( | G + γ | ≤ R ) − P ( | G | ≤ R ) | P ( | G | ≤ R ) ≲ | P ( G ∈ [ − R − γ , R − γ ]) − P ( G ∈ [ − R, R ]) || ≲ P ( G ∈ [ R − γ , R ]) ≤ γ ϕ ( R/ 2) ≲ γ e − R 2 / 4 , where we use that R ≥ 2 γ . Lemma C.3 (Co v ariance estimation hard instance) . L et q ∈ (0 , 1] , ϵ ∈ [0 , 1) , v ∈ S d − 1 and let b, γ , R ∈ R ++ satisfy b = q (1 − ϵ ) √ 1 + τ , γ ≤ τ , and R 2 ≤ 1 + γ γ log(1 + τ ) , With β = P ( √ 1+ γ | G |≤ R ) P ( | G |≤ R ) , let A denote the distribution over R with density A ( x ) := ( β ϕ ( x ) | x | ≤ R ϕ ( x ; 0 , 1 + γ ) otherwise , (39) and define Q ∈ P ( R d ⋆ ) as Q ( { ⋆ d } ) = 1 − b and Q ( x ) = b · P A,v ( x ) . Then Q ∈ R N (0 , I d + γ v v T ) , ϵ, q . F urthermor e, | β − 1 | ≲ min 1 , R · 1 − 1 √ 1 + γ e − Ω( R 2 / (1+ γ )) . Pr o of. Observe that β is c hosen to ensure that A is a v alid probabilit y distribution. Moreo v er, b y definition, β satisfies the b ounds β ≤ 1 and β ≥ min | x |≤ R 1 √ 1 + γ e − x 2 / 2(1+ γ ) e − x 2 / 2 = 1 √ 1 + γ . (40) With Σ = I d + γ v v ⊤ and follo wing the pro of of Lemma C.2 , by our c hoice of b , it suffices to sho w that for all x ∈ R d : log f ( x ) ϕ ( x ;0 , Σ) ≤ log √ 1 + τ , where f denotes the density of the hidden direction distribution P A,v . Using the definition of P A,v , the likelihoo d ratio is equal to f ( x ) ϕ ( x ; 0 , Σ) = A ( x ′ ) ϕ ( x ′ ; 0 , 1 + γ ) , where x ′ = v ⊤ x . F rom the definition of A , it is clear that the ab o v e likelihoo d ratio is equal to 1 when | x ′ | > R , and thus the desired b ound is satisfied. On the other hand, for | x ′ | ≤ R , w e ha v e log A ( x ′ ) ϕ ( x ′ ; 0 , 1 + γ ) = log β p 1 + γ + log e − x ′ 2 γ 2(1+ γ ) = log β p 1 + γ − x ′ 2 γ 2(1 + γ ) . In order to show that Q ∈ R N (0 , I d + γ v v T ) , ϵ, q , it remains to establish the pair of inequal- ities log β p 1 + γ ≤ log √ 1 + τ and log β p 1 + γ − R 2 γ 2(1 + γ ) ≥ − log √ 1 + τ . 54 The first inequality follo ws b y applying the upp er b ound β ≤ 1 in ( 40 ) in conjunction with the assumption γ ≤ τ . T urning to the second inequality , we hav e log β p 1 + γ − R 2 γ 2(1 + γ ) ≥ − R 2 γ 2(1 + γ ) ≥ − log √ 1 + τ , where the first step follows from the low er b ound β ≥ 1 √ 1+ γ in ( 40 ) and the final step follows from the assumption R 2 ≤ 1+ γ γ log(1 + τ ). Finally , we obtain a b ound on | β − 1 | , which holds under the condition R ≳ √ 1 + γ . In particular, expanding yields | β − 1 | = P ( | G | ≤ R ) − P ( | G | ≤ R √ 1+ γ ) P ( | G | ≤ R ) = P | G | ∈ R √ 1+ γ , R P ( | G | ≤ R ) ≲ P | G | ∈ R √ 1 + γ , R ≲ min P | G | ≥ R √ 1 + γ , R − R √ 1 + γ · ϕ R √ 1 + γ ≲ min 1 , R · 1 − 1 √ 1 + γ e − Ω( R 2 / (1+ γ )) . Lemma C.4 (Linear regression hard instance) . L et q ∈ (0 , 1] , ϵ ∈ [0 , 1) , and τ = ϵ q (1 − ϵ ) . L et v ∈ S d − 1 and let b, γ , r , R ∈ R ++ such that b = q (1 − ϵ ) √ 1 + τ , γ 2 r 2 ≤ 0 . 5 log (1 + τ ) , and R ≤ log(1 + τ ) 8 γ r . L et P ∗ denote the distribution over ( X, y ) : X ∼ N (0 , I d ) and y | X ∼ N ( γ v ⊤ X , 1) . F or al l x ∈ R d , with β x ∈ (0 , 1] define d implicitly, let A x denote the distribution over R with density A x ( y ) := ( β x ϕ ( y ) | v ⊤ x | ≤ r and | y | ≤ R ϕ ( y ; γ v ⊤ x, 1) otherwise . (41) and define Q ∈ P ( R d +1 ⋆ ) as Q ( { ⋆ d +1 } ) = 1 − b and Q ( x, y ) = b · ϕ ( x ) A x ( y ) . Then Q ∈ R P ∗ , ϵ, q . Pr o of. F or any x , A x is a v alid distribution since it in tegrates to 1: (i) this is obvious if γ | v ⊤ x | > r and (ii) for γ | v ⊤ x | ≤ r , this is satisfied if we choose β x P ( | G | ≤ R ) + P ( | G + γ x ⊤ v | ≤ R ) = 1 ⇐ ⇒ β x = P ( | G + γ x ⊤ v | ≤ R ) P ( | G | ≤ R ) . Note that for all x , β x ∈ (0 , 1]. Using the same argument as in the pro of of Lemma C.2 for a fixed x , we get that 1 ≥ β x ≥ min | y |≤ R ϕ ( y ; γ x ⊤ v , 1) ϕ ( y ) = min | y |≤ R e − ( y − γ x ⊤ v ) 2 / 2 e − y 2 / 2 = min | y | ≤ R e − γ 2 ( x ⊤ v ) 2 / 2+ γ y x ⊤ v 55 = e − γ 2 ( x ⊤ v ) 2 / 2 − γ R | x ⊤ v | . Let f ( x, y ) b e the conditional density of Q under the even t { ( x, y ) = ⋆ d +1 } and let p ( x, y ) denote the density of the linear mo del X ∼ N (0 , I d ) and y ∼ N ( γ v ⊤ X , 1). As earlier, w e w an t the following likelihoo d ratio to lie in the range as p er Corollary C.1 . f ( x, y ) p ( x, y ) = A x ( y ) ϕ ( y ; γ v ⊤ x, 1) = ( β x ϕ ( y ) ϕ ( y ; γ v ⊤ x, 1) if | y | ≤ R and γ | v ⊤ x | ≤ r 1 otherwise By Corollary C.1 and our choice of b , w e are done once w e establish that the absolute v alue of the logarithm of this densit y ratio is uniformly upp er b ounded by 0 . 5 log(1 + τ ). As p er the definition, this needs to b e chec ked only when | y | ≤ R and γ | v ⊤ x | ≤ r . F or such ( x, y ), the ratio is f ( x, y ) p ( x, y ) = β x e − y 2 / 2 e − ( y − γ x ⊤ v ) 2 / 2 = β x e γ 2 ( x ⊤ v ) 2 / 2 e − γ y x ⊤ v And therefore, the logarithm satisfies log f ( x, y ) p ( x, y ) = log β x e γ 2 ( x ⊤ v ) 2 / 2 − γ y x ⊤ v . It w ould s uffice if the absolute v alue of each term is at most 0 . 25 log(1 + τ ). The second term is at most γ Rr , which by assumption is b ounded appropriately . F or the first term, w e use that e − γ 2 ( x ⊤ v ) 2 / 2 − γ R | x ⊤ v | ≤ β x ≤ 1 to get log β x e γ 2 ( x ⊤ v ) 2 / 2 ≤ γ 2 r 2 / 2 and log β x e γ 2 ( x ⊤ v ) 2 / 2 ≥ log e − 2 γ R | x ⊤ v | = − 2 γ R | x ⊤ v | . Th us, | log β x e γ 2 ( x ⊤ v ) 2 / 2 | ≤ max( γ 2 r 2 / 2 , 2 γ R r ), which is also b ounded appropriately . C.2 Information-theoretic lo wer b ounds C.2.1 Mean estimation: Pro of of Theorem 3.1 low er b ound W e prov e the claim for general q ∈ (0 , 1]. W e in tro duce the quantities τ = ϵ q (1 − ϵ ) , τ ′ = log(1 + τ ), and b = q (1 − ϵ ) √ 1 + τ . W e claim that it suffices to prov e that M := M n ( δ, P mean ( σ 2 , ϵ, q ) , L ) ≳ T parametric ∨ T dimension ∨ T confidence , (42) where T parametric := s d + log ( 1 δ ) nq (1 − ϵ ) , T dimension := √ τ ′ ∧ τ ′ r log 3 + cnbτ ′ 2 d , and T confidence := √ τ ′ ∧ τ ′ r log 3 + C nbτ ′ 2 log(1 / 2 δ ) , for a sufficiently large constan t C . W e defer the pro of of this reduction to the end of the section. W e next prov e each low er b ound in turn. 56 Pro of of M ≳ T parameteric : W e consider the contamination that censors ev erything, so that the resulting distribution is MCAR with probability or observ ation equal to q (1 − ϵ ). Then M ≳ T parametric is essen tially the standard minimax mean estimation lo wer bound applied with the fact that there are O ( nq (1 − ϵ )) non-missing samples with high probabilit y , see [ MVBWS24 , Prop. 47] for a formal argument. Pro of of M ≳ T dimension : This case forms the technical bulk of the pro of and pro ceeds via F ano’s inequality . W e b egin by defining several parameters. W e will let γ := γ ( ϵ, q , n, d ) denote our separation parameter. Throughout we will imp ose the constraint γ ≤ √ τ ′ 10 , (43) and set a truncation threshold R = τ ′ / (10 γ ). Note that from the constraint ( 43 ), w e deduce the b ound γ ≤ √ τ ′ 10 ≤ R 10 ≤ R 2 . (44) Equipp ed with these parameters, we let N denote a 1 / 2–pac king of S d − 1 , which by , e.g., [ V er25 , Corollary 4.2.11], exists and satisfies the cardinalit y b ound |N | ≤ 5 d . F or each v ∈ N , it suf- fices to construct distributions P v whic h satisfy the pair of desiderata 1. (V alid con tamination) P v ∈ R ( N ( γ v , I d ) , ϵ, q ). 2. (Equal probability of missing mass) The missingness probability P v { ⋆ } d = 1 − b . W e will additionally define a cen tral distribution H ∈ P R d ∪ { ⋆ d } as H { ⋆ d } = 1 − b and H ( x ) = b · ϕ d ( x ) for all x ∈ R d . (45) By F ano’s inequality (Lemma 2.10 ), it suffices to construct distributions P θ whic h satisfy the t w o desiderata ab ov e and such that for all v ∈ N , it holds that KL( P v , H ) ≤ d 2 n . Indeed, if this KL b ound holds, then by tensorization of the KL divergence, 1 |N | P v ∈N KL P ⊗ n v , H ⊗ n − log (2 − |N | − 1 ) log |N | ≤ 1 2 d d log 5 < 1 2 . W e tak e P v to b e the hard distribution Q from Lemma C.2 , with all lemma parameter c hosen iden tically to this theorem. The remainder of this pro of is dedicated to establishing the b ound on KL( P v , H ). Since the probabilit y of observing { ⋆ d } is iden tical under b oth P v and H , and since P v and H are supp orted on { ⋆ d } ∪ R d , we hav e KL( P v , H ) = b · KL( P ′ v , H ′ ) , (46) where P ′ v = P A,v and H ′ = N (0 , I d ) denote the resp ectiv e conditional distributions ov er R d . Denoting x ′ := v ⊤ x and ¯ x the pro jection of x on to v ⊥ , observe that the densit y of P ′ v is p ′ v ( x ) = ϕ d − 1 ( ¯ x ) A ( v ⊤ x ) and that of H ′ is h ′ ( x ) = ϕ d − 1 ( ¯ x ) ϕ ( v ⊤ x ). Applying the definition of A in ( 38 ) yields p ′ v ( x ) h ′ ( x ) = A ( x ′ ) ϕ ( x ′ ) = ( β if | x ′ | ≤ R e − 0 . 5 γ 2 + γ x ′ if | x ′ | > R . 57 In turn, this implies that the log-likelihoo d ratio satisfies log p ′ v ( x ) h ′ ( x ) = (log β ) 1 | x ′ |≤ R + ( − 0 . 5 γ 2 + γ x ′ ) 1 | x ′ | >R ≤ γ x ′ 1 | x ′ | >R , where w e hav e used the b ound β ≤ 1. The expectation of log p ′ v ( x ) h ′ ( x ) under P ′ θ ( x ) is exactly equal to KL( P ′ v , H ′ ). Since the upp er b ound dep ends only on x ′ ∼ A , it is equiv alent to take an exp ectation of this term ov er x ′ ∼ A , which as p er ( 38 ) is equal to KL( P ′ v , H ′ ) ≤ E X ∼ A [ γ X 1 | X | >R ] = γ E [( G + γ ) 1 | G + γ | >R ] = γ 2 P ( | G + γ | > R ) + γ E [ G 1 | G + γ | >R ] . (47) T o calculate an upp er b ound on the first term, w e use that γ < R/ 2 and obtain that P ( | G + γ | > R ) ≤ P ( | G | > R/ 2) ≲ e − Ω( R 2 ) . F or the second term, we use the follo wing inequalities that rely on symmetry: E [ G 1 | G + γ | >R ] = E [ G 1 | G | >R ] + E [ G 1 | G + γ | >R − 1 | G |≥ R ] = E [ G 1 | G + γ | >R − 1 | G |≥ R ] ≤ E [ | G | 1 | G + γ | >R − 1 | G |≥ R ] ≤ E [ | G | 1 | G |∈ [ R − γ , R + γ ] ] ≤ 4 γ max x ∈ [ − R/ 2 , 3 R/ 2] xϕ ( x ) ≲ γ e − Ω( R 2 ) , (48) where w e use that xe − x 2 / 2 ≲ e − x 2 / 4 for all x > 0. Combining Equations ( 46 ) to ( 48 ), w e deduce the inequality KL( P v , H ) ≲ bγ 2 e − Ω( R 2 ) ≲ bγ 2 e − c τ ′ 2 γ 2 ≲ bτ ′ 2 y e − 1 /y , (49) where w e use the v alue of R = τ ′ / (10 γ ) and define y := γ 2 cτ ′ 2 , for a sufficiently small constan t c . F or this to b e less than d 2 n , it suffices that y e − 1 /y ≤ ρ ′ := d C bτ ′ 2 n for a sufficien tly large constan t C . That is, y ≤ 1 W 0 (1 /ρ ′ ) , where W 0 ( x ) is the Lam b ert W function (principal branc h), whic h is the unique p ositive v alue such that W 0 ( x ) e W 0 ( x ) = x for x ≥ 0. Observ e that log(3 + x ) e log(3+ x ) > x + 3 > x and thus W 0 ( x ) ≤ log(3 + x ). This condition is alwa ys satisfied whenev er γ ≲ τ ′ √ y = τ ′ r W 0 C nbτ ′ 2 d . And since γ must also satisfy that γ ≲ √ τ ′ for all of these calculations to b e v alid, we get the desired low er b ound in mean estimation M ≳ γ ≳ √ τ ′ ∧ τ ′ r log 3 + C nbτ ′ 2 d = T dimension , (50) whic h concludes this part of the pro of. Pro of of M ≳ T confidence : W e will use the same constructions as in the previous part of the pro of. T o this end, note that H ∈ R ( N (0 , I d ) , ϵ, q ). W e let v b e an y unit vector, let Q b e 58 the distribution from Lemma C.2 . Applying the Bretagnolle–Hub er inequalit y in conjunction with the KL b ound from ( 49 ) yields TV( Q ⊗ n , H ⊗ n ) ≤ p 1 − e − n KL( Q,H ) ≤ q 1 − exp − n · bτ ′ 2 y e − 1 /y . F ollowing the same steps as in the previous part and ensuring y e − y ≲ log(1 / 2 δ ) bτ ′ 2 n implies that TV( P ⊗ n θ , H ⊗ n ) ≤ 1 − 2 δ . Hence, we apply [ MVS24 , Lemma 5] to deduce that M ≳ √ τ ′ ∧ τ ′ r log 3 + C nbτ ′ 2 log(1 / 2 δ ) , (51) as desired. Pro of of the reduction ( 42 ) : Re-phrasing ( 42 ), w e ha ve the minimax low er b ound M ≳ α + √ τ ′ ∧ τ ′ r log 3 + C √ 1+ τ τ ′ 2 α 2 , where α = q d +log(1 /δ ) nq (1 − ϵ ) ≤ 1 / √ C 1 and C is a sufficiently large constant. Let C ′ b e sufficien tly large constant that we will sp ecify later. Then by considering several cases, w e will sho w that this is equiv alent to the claimed low er b ound in Theorem 3.1 . • Case 1: τ ≤ C ′ α . This regime is immediate b ecause log(1 + τ ) p log(1 + τ 2 /α 2 ) ≍ τ p α 2 /τ 2 = α ≲ M . • Case 2: C ′ α ≤ τ ≤ 1 . Then b ecause τ ′ ≤ τ ≤ 1, we hav e that √ τ ′ ∧ τ ′ r log 3 + C √ 1+ τ τ ′ 2 α 2 = τ ′ r log 3 + C √ 1+ τ τ ′ 2 α 2 ≥ log(1 + τ ) q log(3 + C √ 2 τ 2 /α 2 ) ≥ log(1 + τ ) p 2 log(1 + τ 2 /α 2 ) , where the final inequality follo ws from τ /α ≥ C ′ b y picking C ′ large enough (dep ending on C ). • Case 3: 1 ≤ τ ≤ α − 2 /C 1 . In this regime, log 3 + C √ 1+ τ τ ′ 2 α 2 ≍ log(1 /α ) so that √ τ ′ ∧ τ ′ r log 3 + C √ 1+ τ τ ′ 2 α 2 ≍ p log(1 + τ ) · " 1 ∧ s log(1 + τ ) log(1 /α ) # . 59 Then b ecause τ ≤ α − 2 /C 1 , we hav e log (1 + τ ) ≲ log (1 /α ), and hence √ τ ′ ∧ τ ′ r log 3 + C √ 1+ τ τ ′ 2 α 2 ≳ log(1 + τ ) p log(1 /α ) ≍ log(1 + τ ) p log(1 + τ 2 /α 2 ) , where the final equality (up to constant factors) follows b ecause 1 ≤ τ ≤ α − 2 /C 1 . • Case 4: τ > α − 2 /C 1 . In this regime, log 3 + C √ 1+ τ τ ′ 2 α 2 ≍ log(1 + τ ) so that √ τ ′ ∧ τ ′ r log 3 + C √ 1+ τ τ ′ 2 α 2 ≍ p log(1 + τ ) . Moreo v er, log(1 + τ 2 /α 2 ) ≍ log(1 + τ ) b ecause τ > α − 2 /C 1 . Th us, √ τ ′ ∧ τ ′ r log 3 + C √ 1+ τ τ ′ 2 α 2 ≍ log(1 + τ ) p log(1 + τ 2 /α 2 ) . This completes the reduction, proving the claim. C.2.2 Co v ariance estimation: Pro of of Theorem 3.2 low er b ound W e prov e the claim for general q ∈ (0 , 1]. The pro of will closely mirror the structure of the pro of of the low er b ound of Theorem 3.1 . W e introduce the quan tities τ = ϵ q (1 − ϵ ) , τ ′ = log(1 + τ ), and b = q (1 − ϵ ) √ 1 + τ . W e claim that it suffices to prov e that M := M n δ, P cov ( ϵ, q ) , L ≥ T parametric ∨ T dimension ∨ T confidence , (52) where T parametric := s d + log ( 1 δ ) nq (1 − ϵ ) , T dimension := τ ′ ∧ τ ′ log 3 + cnq (1 − ϵ ) √ 1+ τ τ ′ d , and T confidence := τ ′ ∧ τ ′ log 3 + cnq (1 − ϵ ) √ 1+ τ τ ′ log(1 / 4 δ ) , for a sufficien tly large constant c . As in the mean estimation case, the lo w er b ound M ≥ T parametric is standard, and we omit its pro of. W e next prov e each of the lo wer b ounds M ≥ T dimension and M ≥ T confidence in turn. Pro of of M ≥ T dimension : Let γ ∈ [0 , 1] denote the target minimax error, to b e sp ecified later. Let N b e a 1 / 2-packing of unit vectors as in the pro of of Theorem 3.1 (b) satisfying |N | ≤ 5 d . F or eac h v ∈ N , we will construct hypotheses P v whic h satisfy the pair of desiderata P v ∈ R N (0 , I d + γ v v ⊤ ) , ϵ, q and P v { ⋆ d } = 1 − b. By Lemma G.4 , for an y distinct v , v ′ ∈ N , we ha ve that for all matrices M , L ( M , I d + γ v v ⊤ ) ∨ L ( M , I d + γ ¯ v ¯ v ⊤ ) > γ / 16. Hence, an y estimator b Σ with w orst-case risk at most γ / 64 m ust 60 satisfy P P v ( L ( b Σ , I d + γ v v ⊤ ) ≥ γ / 16) ≤ 1 / 4 for all v ∈ N , and can b e used to recov er v ∈ N with probability at least 3 / 4. T o prov e that recov ering v ∈ N with probability at least 3 / 4 is imp ossible, it will suffice b y F ano’s inequality (Lemma 2.10 ) to show that there exists a distribution H such that KL( P v , H ) ≲ d n for all v ∈ N . W e will take P v to b e the distribution Q from Lemma C.3 with R 2 = 1+ γ γ log(1 + τ ) and the common distribution H to b e defined as H { ⋆ d } = 1 − b and H ( x ) = b · ϕ d ( x ) for all x ∈ R d . (53) Since P v ( { ⋆ d } ) = H ( { ⋆ d } ) = 1 − b and b oth distributions are supported on R d ∪ { ⋆ d } , w e deduce that KL( P v , H ) = b · KL( P ′ v , H ′ ) , (54) where P ′ v and H ′ denote the distributions of P v and H conditioned on b elonging to R d . W e next in tro duce the shorthand x ′ := v ⊤ x and let p ′ v and h ′ denote the densities of P ′ v and H ′ . Expanding their definitions, we simplify the likelihoo d ratio as p ′ v ( x ) h ′ ( x ) = A ( x ′ ) ϕ ( x ′ ) = β if | x ′ | ≤ R 1 √ 1+ γ e γ | x ′ | 2 2(1+ γ ) if | x ′ | > R . Hence, we find that the log-likelihoo d ratio is b ounded as log p ′ v ( x ) h ′ ( x ) = (log β ) 1 | x ′ |≤ R + γ | x ′ | 2 2(1 + γ ) − log p 1 + γ 1 | x ′ | >R ≤ γ | x ′ | 2 2(1 + γ ) 1 | x ′ | >R , (55) where w e hav e additionally used the inequality β ≤ 1, whic h holds b y construction. Moreov er, our construction implies that if X ′ ∼ A , X ′ · 1 | X ′ | >R dist = p 1 + γ · G 1 √ 1+ γ | G | >R . T aking exp ectations with resp ect to the random v ariable X ′ ∼ A , it thus follows from the upp er b ound ( 55 ) that KL( P ′ v , H ′ ) ≤ γ 2 E G 2 1 √ 1+ γ | G | >R ( a ) ≲ γ q P ( p 1 + γ | G | > R ) ≲ γ e − Ω( R 2 / (1+ γ )) , (56) where step ( a ) follo ws up on applying the Cauch y–Sch warz inequality . Com bining Equa- tions ( 54 ) and ( 56 ) then yields the upp er b ound KL( P v , H ) ≲ bγ e − Ω( R 2 / (1+ γ )) ≲ bγ e − Ω(log(1+ τ ) /γ ) , (57) where the final inequality follows since by construction R 2 = 1+ γ γ log(1 + τ ). T o obtain the desired lo w er b ound, we set γ = min ( c, log(1 + τ ) log q (1 − ϵ ) n cd ) , for a sufficien tly small constan t c ≤ 1. Note that from the numeric inequalit y log (1 + x ) ≤ x , this setting ensures that γ ≤ τ . Hence, substituting this into the inequalit y ( 57 ), we obtain KL( P v , H ) ≲ bγ e − Ω(log(1+ τ ) /γ ) ≲ q (1 − ϵ ) √ 1 + τ exp − Ω log(1 + τ ) c ∨ log q (1 − ϵ ) n cd ≲ d n , where the last inequalit y follows by taking c sufficiently small and because d nq (1 − ϵ ) ≤ 1. It th us follo ws from F ano’s inequalit y that M ≥ T dimension . 61 Pro of of M ≥ T confidence : W e will use the same constructions as in the previous part of the pro of. T o this end, note that H ∈ R ( N (0 , I d ) , ϵ, q ). W e let v b e any unit v ector, and let Q b e the distribution from Lemma C.3 . In this case, we tak e γ = min ( τ ′ , τ ′ log b · τ ′ n C log(1 / (4 δ )) ) = min ( τ ′ , τ ′ log q (1 − ϵ ) √ 1+ τ τ ′ n C log(1 / (4 δ )) ) . Applying the KL b ound ( 57 ) and substituting the setting of γ ab o v e yields KL( Q, H ) ≲ bγ e − Ω( τ ′ /γ ) = q (1 − ϵ ) √ 1 + τ γ e − Ω( τ ′ /γ ) ≤ c log(1 / (4 δ )) n . Hence, adjusting constants and applying the Bretagnolle–Hub er inequality , we obtain TV( Q ⊗ n , H ⊗ n ) ≤ p 1 − e − n KL( Q,H ) ≤ 1 − 2 δ. W e conclude b y applying a high-probability v ariant of Le Cam’s metho d [ MVS24 , Lemma 5], whic h yields the low er b ound M ≥ γ = T confidence , as desired. Pro of of the reduction ( 52 ) : Re-phrasing ( 52 ), w e ha ve the minimax low er b ound M ≳ α + min ( c, log(1 + τ ) log(1 /cα 2 ) ) ≍ α + log(1 + τ ) log(1 + τ ) + log (1 /α ) , where α = q d +log(1 /δ ) nq (1 − ϵ ) ≤ 1 / √ C 1 and c is a sufficiently small constan t. By considering sev eral cases, we will show that this is equiv alent to the c laimed low er b ound in Theorem 3.2 . • Case 1: τ ≲ α log (1 /α ) . This regime is immediate b ecause log(1 + τ ) log(1 + τ /α 2 ) ≍ α ≲ M . • Case 2: α log(1 /α ) ≲ τ ≲ α − 2 . This regime follo ws b ecause b oth log(1 + τ ) and log(1 + τ 2 /α 2 ) are within constant factors of log (1 /α ). • Case 3: τ ≳ α − 2 . This regime is immediate because log (1 + τ ) ≍ log(1 + τ /α 2 ) ≳ log(1 /α ). Because we hav e cov ered all cases of τ , the claim follows. C.2.3 Linear regression: Pro of of Theorem 5.1 low er b ound Pr o of. W e prov e the claim for general q ∈ (0 , 1]. W e can assume without loss of generality that σ 2 = 1, as otherwise w e can scale the resp onses by σ − 1 . W e introduce the quan tities τ = ϵ q (1 − ϵ ) , τ ′ = log(1 + τ ), and b = q (1 − ϵ ) √ 1 + τ . By the same argument as in the pro of of the low er b ound of Theorem 3.1 , it suffices to show that M := M ( δ , P LR (1 , ϵ ) , L ) ≳ T clean ∨ T dimension ∨ T confidence , (58) 62 where T clean := s d + log ( 1 δ ) nq (1 − ϵ ) , T dimension := √ τ ′ ∧ τ ′ r log 3 + C nbτ ′ 2 d , and T confidence := √ τ ′ ∧ τ ′ r log 3 + C nbτ ′ 2 log(1 / 2 δ ) , for a sufficiently large constant C . The low er b ounds T clean and T confidence follo w b y appropriate mo difications to the arguments in the pro of of the lo w er b ound of Theorem 3.1 , so we fo cus on establishing T dimension . The final claim then follows from ( 58 ) b y an identical argument as in Theorem 3.1 . Let γ := γ ( ϵ, q , n, d ) denote the resulting error b ound. Let r ∈ R + b e the x -truncation parameter such that γ 2 r 2 ≤ 0 . 5 τ ′ . Let R ∈ R + b e the y -truncation parameter suc h that R := τ ′ 10 γ r and observe that γ r ≤ R / 2. W e shall c ho ose r = 1 in the remainder of the pro of. Th us, the follo wing argumen ts will b e applicable as long as γ ≲ √ τ ′ . Let F v b e the distribution o ver ( X , y ) suc h that X ∼ N (0 , I d ) and y ∼ N ( θ ⊤ X , 1). Let N b e a 1 / 2-packing of unit v ectors satisfying |N | ≤ 5 d . F or each v ∈ N , we will define P v ∈ R ( F γ v , ϵ, q ) with the additional prop erty that P v ( ⋆ d ) = 1 − b . T o prov e our des ired lo wer b ound of Ω( γ ), it will suffice by F ano’s inequality (Lemma 2.10 ) to show that there exists a distribution H such that KL( P v , H ) ≲ d n for all v ∈ N . W e will take P v to b e the distribution P ∗ from Lemma C.4 with parameters b, γ , r, R (and noting that the parameter requirements are satisfied) and H to b e defined as H { ⋆ d } = 1 − b and H ( x ) = b · F 0 ( x, y ) for all x ∈ R d and y ∈ R . (59) Because P v ( ⋆ d ) = H ( ⋆ d ) = b and b oth distributions are supp orted on R d ∪ { ⋆ d } , we ha v e that KL( P v , H ) = b · KL( P ′ v , H ′ ) , (60) where P ′ v and H ′ are the conditional distributions of R d . Observ e that the density ratio satisfies (with p 0 b eing the densit y of F 0 ) p ′ v ( x, y ) p ′ 0 ( x, y ) = ϕ ( x ) A x ( y ) ϕ ( x ) ϕ ( y ) = ( β x if | x ⊤ v | ≤ r and | y | ≤ R e − ( y − γ v ⊤ x ) 2 / 2 e − y 2 / 2 otherwise = ( β x if | x ⊤ v | ≤ r and | y | ≤ R e γ y x ⊤ v − γ 2 ( v ⊤ x ) 2 / 2 otherwise Therefore, log p ′ v ( x, y ) p ′ 0 ( x, y ) = ( log β x if | x ⊤ v | ≤ r and | y | ≤ R γ y x ⊤ v − γ 2 ( v ⊤ x ) 2 otherwise ≤ γ ( y x ⊤ v − γ ( v ⊤ x ) 2 )1 | x ⊤ v | >r or | y | >R , (61) where we use that β x ≤ 1. T aking exp ectation under P ′ v w e get that the KL divergence is upp er b ounded b y KL( P ′ v , F 0 ) /γ ≤ E ( x,y ) ∼ P ′ v h ( y x ⊤ v − γ ( v ⊤ x ) 2 )1 | x ⊤ v | >r or | y | >R i 63 = Z { | x ⊤ v | >r } ∪{| y | >R } ( y x ⊤ v − γ ( v ⊤ x ) 2 ) ϕ ( x ) A x ( y ) dxdy = Z {| z | >r }∪{| y | >R } ( y z − γ z 2 ) ϕ ( z ) ϕ ( y ; γ z , 1) dz dy (definition of A x and z := x ⊤ v ) = E G,Z [(( γ Z + G ) Z − γ Z 2 )1 | Z | >r /γ or | γ Z + G | >R ] = E G,Z [ GZ 1 | Z | >r or | γ Z + G | >R ] , where Z and G are indep enden t standard normals, and the equality follows from the linear mo del representation Y = γ Z + G for the Gaussian linear mo del, whic h is the same as A x in that interv al. Contin uing further we get KL( P ′ v , F 0 ) /γ ≤ E G,Z [ GZ 1 | Z | >r or | γ Z + G | >R ] = E G,Z [ GZ 1 | Z | >r or | γ Z + G | >R ] − E G,Z [ GZ 1 | Z | >r or | G | >R ] (b y symmetry E G,Z [ GZ 1 | Z | >r /γ or | G | >R ] = 0 ) = E G,Z GZ 1 | Z | >r or | γ Z + G | >R ] − 1 | Z | >r or | G | >R ] = E G,Z GZ 1 | Z |≤ r 1 | γ Z + G | >R − 1 | G | >R ≤ E G,Z | G | · | Z | 1 | Z |≤ r 1 | γ Z + G | >R − 1 | G | >R Since | 1 | x + y |≥ a − 1 | x |≥ a | ≤ 1 x ∈ [ −| a |−| y | , −| a | + | y | ] + 1 x ∈ [ | a |−| y | , | a | + | y | ] , we get that KL( P ′ v , F 0 ) /γ ≤ E G,Z | G | · | Z | 1 | Z |≤ r 1 G ∈ [ − R − γ | Z | , − R + γ | Z | ] ∪ [ R − γ | Z | ,R + γ | Z | ] = 2 E G,Z | G | · | Z | 1 | Z |≤ r 1 G ∈ [ R − γ | Z | ,R + γ | Z | ] (b y symmetry) ≲ E G,Z | R + γ r | · | Z | 1 | Z |≤ r 1 G ∈ [ R − γ | Z | ,R + γ | Z | ] ≲ max( γ r, R ) E G,Z | Z | 1 | Z |≤ r P { G ∈ [ R − γ | Z | , R + γ | Z | ] } ≤ max( rγ , R ) E G,Z h | Z | 1 | Z |≤ r · γ | Z | e − ( R − γ | Z | ) 2 / 2 i ≲ R E G,Z h γ Z 2 e − Ω( R 2 ) i ( r γ ≤ R/ 2) ≲ Rγ e − Ω( R 2 ) ≲ γ e − Ω( R 2 ) , (62) where we use the fact that xe − Ω( x 2 ) ≲ e − Ω( x 2 ) for all x > 0. Com bining Equations ( 60 ) and ( 62 ), the KL divergence is upp er b ounded by KL( P v , H ) ≤ bγ 2 e − Ω( R 2 ) ≤ bγ 2 e − cτ ′ 2 /γ 2 . where we use that R ≍ τ ′ /γ . Since the expressions and parameter restrictions are identical to those encountered in the pro of of Theorem 3.1 (see Section C.2.1 ), we will get a low er b ound of min √ τ ′ , τ ′ r log 3 + C nbτ ′ 2 d . 64 D Pro ofs of sum-of-squares upp er b ounds D.1 Mean estimation: Pro of of Theorem 4.5 W e will pro ve b oth claims hold given E . The result then follows from Lemma 4.3 . Satisfiabilit y . W e first show that the constraint ( 9 ) is satisfiable. T o this end, let θ = θ ⋆ . Let ξ = E x ∼ T MCAR + T MCAR ,⋆ [ x ⊗ 2 k ] − E z ∼ N (0 ,I d ) [ z ⊗ 2 k ] (recall T MCAR + T MCAR ,⋆ consists of n i.i.d. standard Gaussian observ ations). Applying Lemma G.23 (with t = d k ∥ ξ ∥ ∞ ) gives ∥ v ∥ 2 2 = 1 4 k v E x ∼ T MCAR + T MCAR ,⋆ ⟨ v , x − θ ⟩ 2 k ≤ E G ∼ N (0 , 1) [ G 2 k ] + d k ∥ ξ ∥ ∞ , from which E implies ∥ v ∥ 2 2 = 1 4 k v E x ∼ T MCAR + T MCAR ,⋆ ⟨ v , x − θ ⟩ 2 k ≤ E G ∼ N (0 , 1) [ G 2 k ] + ϵ ≤ (1 + ϵ ) E G ∼ N (0 , 1) [ G 2 k ] . Because the difference (up to normalization) b etw een E x ∼ T ⟨ v , x − θ ⟩ 2 k and E x ∼ T MCAR + T MCAR ,⋆ ⟨ v , x − θ ⟩ 2 k is a sum-of-squares, i.e., | T | n E x ∼ T ⟨ v , x − θ ⟩ 2 k + | T MCAR ,⋆ − T MNAR | n E x ∼ T MCAR ,⋆ − T MNAR ⟨ v , x − θ ⟩ 2 k = E x ∼ T MCAR + T MCAR ,⋆ ⟨ v , x − θ ⟩ 2 k , it then follows that E x ∼ T ⟨ v , x − θ ⟩ 2 k ≤ | T MCAR + T MCAR ,⋆ | | T | (1 + ϵ ) E G ∼ N (0 , 1) [ G 2 k ] . Then E implies E x ∼ T ⟨ v , x − θ ⟩ 2 k ≤ | T MCAR + T MCAR ,⋆ | | T | (1 + ϵ ) E G ∼ N (0 , 1) [ G 2 k ] ≤ (1 + ϵ ) 2 1 − ϵ E G ∼ N (0 , 1) [ G 2 k ] . Accuracy . Let e E b e a degree-4 k pseudo exp ectation ov er θ ∈ R d satisfying the constraint ( 9 ). Note that (1 + ϵ ) 2 1 − ϵ E G ∼ N (0 , 1) [ G 2 k ] ≥ e E E x ∼ T ⟨ v , x − θ ⟩ 2 k ≥ e E | T MCAR | | T | E x ∼ T MCAR ⟨ v , x − θ ⟩ 2 k ≥ e E (1 − ϵ ) E x ∼ T MCAR ⟨ v , x − θ ⟩ 2 k ≥ (1 − ϵ ) E x ∼ T MCAR ⟨ v , x − e E [ θ ] ⟩ 2 k , where the last step follows by combining Lemmas G.21 and G.22 . 65 F or con venience, let b p ℓ = inf v ∈ S d − 1 E x ∼ T MCAR ⟨ v , x − θ ⋆ ⟩ ℓ and ∆ = θ ⋆ − e E [ θ ] 2 . Then, taking v = e E [ θ ] − θ ⋆ ∆ and dividing the previous display by 1 − ϵ on b oth sides gives 1 + ϵ 1 − ϵ 2 E G ∼ N (0 , 1) [ G 2 k ] ≥ b p 2 k + k − 1 X i =1 2 k 2 i b p 2 k − 2 i ∆ 2 i − 2 k 2 i − 1 b p 2 k − 2 i +1 ∆ 2 i − 1 + ∆ 2 k . The even t E implies via Lemma G.16 that | b p j − E G ∼ N (0 , 1) [ G j ] | ≤ d j / 2 − k ϵ for j ∈ [2 k ]. Recall γ = (1+ ϵ ) 2 − (1 − ϵ ) 3 (1 − ϵ ) 2 . Rearranging the previous display by mo ving b p 2 k to the other side gives k − 1 X i =1 2 k 2 i b p 2 k − 2 i ∆ 2 i − 2 k 2 i − 1 b p 2 k − 2 i +1 ∆ 2 i − 1 + ∆ 2 k ≤ γ E G ∼ N (0 , 1) [ G 2 k ] . (63) Let ∆ 0 = max i ∈ [ k ] 2 b p 2 k − 2 i +1 b p 2 k − 2 i 2 i 2 k − 2 i +1 and supp ose that ∆ > ∆ 0 . Then each paren thesized summand in the display ( 63 ) is at least 2 k 2 i b p 2 k − 2 i ∆ 2 i 1 − ∆ 0 2∆ . Recall E implies b p 2 k − 2 i ≥ 0 for all i ∈ [ k ] (since these are even-degree p olynomials with exp ectation at least 1 regardless of i ), so taking i = 1 and discarding all other terms gives 2 k (2 k − 1) b p 2 k − 2 16 ∆ 2 ≤ γ E G ∼ N (0 , 1) [ G 2 k ] . Th us, w e obtain ∆ ≤ max ∆ 0 , 4 k s γ E G ∼ N (0 , 1) [ G 2 k ] b p 2 k − 2 ≤ max ∆ 0 , 8 r γ k , where the last step follows from that fact that the even t E implies b p 2 k − 2 ≥ 0 . 5(2 k − 3)!! (recall E G ∼ N (0 , 1) [ G 2 k ] = (2 k − 1)!!). All that remains is to upp er b ound ∆ 0 . Recall E implies | b p 2 k − 2 i +1 | ≤ ϵd 1 / 2 − i 2 k for all i ∈ [ k ] (since these are o dd-degree p olynomials with expectation zero). Likewise, w e can lo w er bound b p 2 k − 2 i ≥ 0 . 5(2 k − 2 i − 1)!!. Finally , recall k ≤ √ d b y assumption. Putting these facts together giv es ∆ 0 ≤ 8 ϵ k ≤ 8 γ k . D.2 Co v ariance estimation: Pro of of Theorem 4.10 W e will pro ve b oth claims hold given E . The result then follows from Lemma 4.3 . Throughout the pro of, w e let e Σ = E x ∼ T [ xx T ] b e the empirical cov ariance of the observed data. Satisfiabilit y . W e first show that A deviation is satisfiable f by M = ( e Σ 1 / 2 Σ − 1 e Σ 1 / 2 ) 1 / 2 and B = M − I d for α = 10 ϵ . First, observ e b oth matrices are symmetric and M ⪰ 0 (recall that w e assume Σ ≻ 0). The even t E implies for S ∈ { T MCAR , T MCAR + T MCAR ,⋆ } that Σ − 1 / 2 E x ∼ S [ xx T ]Σ − 1 / 2 − I d op ≤ d Σ − 1 / 2 E x ∼ S [ xx T ]Σ − 1 / 2 − I d ∞ ≤ ϵ, 66 and th us (1 − ϵ )Σ ⪯ E x ∼ S [ xx T ] ⪯ (1 + ϵ )Σ. Also, E implies m ≥ (1 − 10 ϵ ) n . Thus, after renormalization we hav e 1 − ϵ 1 + ϵ Σ ⪯ E x ∼ T [ xx T ] ⪯ 1 + ϵ 1 − ϵ Σ , whic h after whitening b ecomes 1 − ϵ 1 + ϵ e Σ − 1 / 2 Σ e Σ − 1 / 2 ⪯ I d ⪯ 1 + ϵ 1 − ϵ e Σ − 1 / 2 Σ e Σ − 1 / 2 and after inv erting b ecomes 1 − ϵ 1 + ϵ e Σ 1 / 2 Σ − 1 e Σ 1 / 2 ⪯ I d ⪯ 1 + ϵ 1 − ϵ e Σ 1 / 2 Σ − 1 e Σ 1 / 2 . Because ϵ < 1 / 100, w e can b ound q 1+ ϵ 1 − ϵ ≤ 1 + 5 ϵ . So, from the fact that f ( t ) → √ t is operator monotone it follows by taking square ro ots of the ab ov e displa y that M and B satisfy the constrain ts in A for α = 10 ϵ . No w all that remains is to show satisfiability of the A moments for the same choice of M . Let ℓ ∈ [ k ] and denote ξ S = E x ∼ S [(Σ − 1 / 2 x ) ⊗ 2 ℓ ] − E z ∼ N (0 ,I d ) [ z ⊗ 2 ℓ ]. W e pro v e the upp er b ound b y taking S = T MCAR + T MCAR ,⋆ ; the low er b ound follows analogously by taking S = T MCAR . Because e Σ − 1 / 2 x ∼ N (0 , M − 2 ), applying Lemma G.23 (with t = d ℓ ξ T MCAR + T MCAR ,⋆ ∞ ) gives ∥ v ∥ 2 2 = 1 4 k v E x ∼ T MCAR + T MCAR ,⋆ ⟨ v , M e Σ − 1 / 2 x ⟩ 2 ℓ ≤ E G ∼ N (0 , 1) [ G 2 ℓ ] + d ℓ ξ T MCAR + T MCAR ,⋆ ∞ , from which E implies ∥ v ∥ 2 2 = 1 4 k v E x ∼ T MCAR + T MCAR ,⋆ ⟨ v , M e Σ − 1 / 2 x ⟩ 2 ℓ ≤ E G ∼ N (0 , 1) [ G 2 ℓ ] + ϵ ≤ (1 + ϵ ) E G ∼ N (0 , 1) [ G 2 ℓ ] . Because the difference b et w een E x ∼ T ⟨ v , M e Σ − 1 / 2 x ⟩ 2 ℓ and | T | n E x ∼ T MCAR + T MCAR ,⋆ ⟨ v , M e Σ − 1 / 2 x ⟩ 2 ℓ , is a sum-of-squares, it then follows that n ∥ v ∥ 2 2 = 1 o 4 k v E x ∼ T h ⟨ v , M e Σ − 1 / 2 x ⟩ 2 ℓ i ≤ n | T | (1 + ϵ ) E G ∼ N (0 , 1) [ G 2 ℓ ] ≤ (1 + ϵ ) 2 1 − ϵ E G ∼ N (0 , 1) [ G 2 ℓ ] . Using our assumption that ϵ < 1 / 100, we can b ound (1+ ϵ ) 2 1 − ϵ ≤ 1 + 10 ϵ . Th us, M and B satisfy the upp er b ound in the constrain ts ( 10 ) o v er ℓ ∈ [ k ]. Accuracy . Let β = 10 ϵ and A OPT = e Σ − 1 / 2 Σ e Σ − 1 / 2 for conv enience. W e will sho w (Lemma D.1 ) the even t E implies n ∥ v ∥ 2 2 = 1 4 k v E x ∼ T ⟨ v , M e Σ − 1 / 2 x ⟩ 2 k ∈ (1 ± 8 ϵ ) Σ 1 / 2 e Σ − 1 / 2 M v 2 k 2 E G ∼ N (0 , 1) [ G 2 k ] + 2 ϵ o . 67 T ogether with the fact that e E satisfies constraint ( 10 ), this implies for β = 10 ϵ that e E h Σ 1 / 2 e Σ − 1 / 2 M v 2 k 2 i ∈ 1 − β 1 + β , 1 + β 1 − β , i.e., e E Σ 1 / 2 e Σ − 1 / 2 M v 2 k 2 ∈ 1 ± 10 β (since β < 1 / 2). Combining lemmas G.26 and G.24 then giv es e E Σ 1 / 2 e Σ − 1 / 2 M v 2 2 ∈ 1 ± γ . Be cause this holds for all v ∈ S d − 1 , we obtain e E [ M A OPT M ] − I d op ≤ γ . Th us, there exists A ∗ satisfying e E [ M A ∗ M ] − I d op ≤ γ , and moreo v er by the triangle inequality any suc h A ∗ m ust satisfy e E [ M ( A ∗ − A OPT ) M ] op ≤ 2 γ . Applying Lemma 4.9 with H = A ∗ − A OPT then gives ∥ A ∗ − A OPT ∥ op ≲ ϵ k . It th us follows for any v ∈ S d − 1 that Σ − 1 / 2 e Σ 1 / 2 ( A ∗ − A OPT ) e Σ 1 / 2 Σ − 1 / 2 v 2 ≲ ϵ k Σ − 1 / 2 e Σ 1 / 2 op e Σ 1 / 2 Σ − 1 / 2 op ≲ ϵ k , where the last step follows from the fact that E implies (1 − 1 2 )Σ ⪯ e Σ ⪯ 2Σ. The main claim then follows immediately . Lemma D.1. The event E implies n ∥ v ∥ 2 2 = 1 4 k v E x ∼ T ⟨ v , M e Σ − 1 / 2 x ⟩ 2 k ∈ (1 ± β ) Σ 1 / 2 e Σ − 1 / 2 M v 2 k 2 E G ∼ N (0 , 1) [ G 2 k ] o . Pr o of. The proof will follow a similar structure to the pro of of satisfiabilit y . As before, let ℓ ∈ [ k ] and denote ξ S = E x ∼ S [(Σ − 1 / 2 x ) ⊗ 2 ℓ ] − E z ∼ N (0 ,I d ) [ z ⊗ 2 ℓ ]. W e pro v e the upp er b ound by taking S = T MCAR + T MCAR ,⋆ ; the low er b ound follows analogously by taking S = T MCAR . Let u = Σ 1 / 2 e Σ − 1 / 2 M v . Lemma G.23 implies for all t > 0 that ∥ v ∥ 2 2 = 1 4 k v E x ∼ T MCAR + T MCAR ,⋆ ⟨ v , M e Σ − 1 / 2 x ⟩ 2 ℓ = E x ∼ T MCAR + T MCAR ,⋆ ⟨ u, Σ − 1 / 2 x ⟩ 2 ℓ ≤ ∥ u ∥ 2 ℓ 2 E G ∼ N (0 , 1) [ G 2 ℓ ] + t 2 ∥ u ∥ 4 ℓ 2 + d 2 ℓ 2 t ξ T MCAR + T MCAR ,⋆ 2 ∞ . Using the fact that M ⪯ 2 I and taking t = d 4 ℓ ξ T MCAR + T MCAR ,⋆ ∞ , we then obtain ∥ v ∥ 2 2 = 1 4 k v E x ∼ T MCAR + T MCAR ,⋆ ⟨ v , M e Σ − 1 / 2 x ⟩ 2 ℓ ≤ ∥ u ∥ 2 ℓ 2 E G ∼ N (0 , 1) [ G 2 ℓ ] + 2 4 ℓ − 1 t + d 2 ℓ 2 t ξ T MCAR + T MCAR ,⋆ 2 ∞ ≤ ∥ u ∥ 2 ℓ 2 E G ∼ N (0 , 1) [ G 2 ℓ ] + (4 d ) ℓ ξ T MCAR + T MCAR ,⋆ ∞ ≤ ∥ u ∥ 2 ℓ 2 E G ∼ N (0 , 1) [ G 2 ℓ ] + ϵ, where the last step follo ws from E . Because the difference b etw een E x ∼ T ⟨ v , M e Σ − 1 / 2 x ⟩ 2 ℓ and | T | n E x ∼ T MCAR + T MCAR ,⋆ ⟨ v , M e Σ − 1 / 2 x ⟩ 2 ℓ is a sum-of-squares, it then follows that ∥ v ∥ 2 2 = 1 4 k v E x ∼ T ⟨ v , M e Σ − 1 / 2 x ⟩ 2 ℓ ≤ n | T | ∥ u ∥ 2 ℓ 2 E G ∼ N (0 , 1) [ G 2 ℓ ] + ϵ ≤ 1 + ϵ 1 − ϵ ( ∥ u ∥ 2 ℓ 2 E G ∼ N (0 , 1) [ G 2 ℓ ] + ϵ ≤ (1 + 8 ϵ ) ∥ u ∥ 2 ℓ 2 E G ∼ N (0 , 1) [ G 2 ℓ ] + 2 ϵ, where the last step follows from our assumption that ϵ < 1 / 100. 68 D.2.1 Pro of of Lemma 4.9 Since e E satisfies B = M − I d , e E must satisfy M H M = ( I + B ) H ( I + B ). Thus for any unit v ector v ∈ S d − 1 w e ha v e v T H v = v T e E [ M H M − ( B H + H B ) + B H B ] v (64) ≤ β + e E [ v T ( B H + H B ) v ] + e E [ v T B H B v ] . W e will pro ceed to separately b ound the t wo terms in the last display . First, we hav e e E [ v T B H v ] ≤ q e E [( v T B H v ) 2 ] ≤ q e E [ ∥ B v ∥ 2 2 ] e E [ ∥ H v ∥ 2 2 ] ≤ α ∥ H ∥ op , where the first step follows from Jensen’s inequality (Lemma G.22 ); the second from Cauch y- Sc h w arz (Lemma G.19 ); and the last from our constrain t that B T B ⪯ α 2 I . W e can similarly b ound e E [ v T B H v ] and b oth negations (i.e., − e E [ v T B H v ] and − e E [ v T B H v ]), implying e E [ v T ( B H + H B ) v ] ≤ 2 α ∥ H ∥ op . F or the second term, observe ∥ H ∥ 2 op ∥ B v ∥ 2 2 − ∥ H B v ∥ 2 2 is a sum of squares, i.e., for some C = ∥ H ∥ 2 op I − H T H ⪰ 0 and D D T = C ∥ H ∥ 2 op ∥ B v ∥ 2 2 − ∥ H B v ∥ 2 2 = v T D D T v . Th us, w e can b ound e E [ v T B H B v ] ≤ q e E [( v T B H B v ) 2 ] ≤ q e E [ ∥ B v ∥ 2 2 ] e E [ ∥ H B v ∥ 2 2 ] ≤ α 2 ∥ H ∥ op . Com bining our b ounds on the tw o terms and taking the suprem um ov er v of the left-hand side of equation ( 64 ) (recall H is symmetric) gives ∥ H ∥ op ≤ β + 2 α ∥ H ∥ op + α 2 ∥ H ∥ op , from which the claim follows immediately . E Pro ofs of computational low er b ounds E.1 Preliminaries The following lemma con trols the difference in the moments b etw een the hard univ ariate distributions for the information-theoretic rate and the standard Gaussian. 69 Lemma E.1 (Approximate momen t matc hing) . F or a distribution A on R and a k ∈ N , define δ A,k := E X ∼ A [ X k ] − E G ∼ N (0 , 1) [ G k ] 1. If A is the instanc e fr om ( 38 ) and | γ | ≤ R/ 2 and R ≳ 1 , then δ A,k ≲ | γ | e − Ω( R 2 ) e O ( k log max( k,R )) . 2. If A is the instanc e fr om ( 39 ) with R ≳ 1 , then δ A,k ≲ γ e − Ω( R 2 )+ O ( k log max( k,R )) . Pr o of. W e first start with A as in ( 38 ). Observe that the difference in the moments is exactly equal to ( β − 1) E [ G k 1 | G |≤ R ] + E [( G + γ ) k 1 | G + γ | >R − G k 1 | G | >R ] , The absolute v alue of the first term is at most | β − 1 | ( k − 1)!! ≲ | γ | e − Ω( R 2 ) e O ( k log k ) , where w e use Lemma C.2 for con trolling | β − 1 | . F or the second term, we upp er b ound its absolute v alue as follows (where we shall use using that | a k − b k | ≤ k | a − b | max( | a | k − 1 , | b | k − 1 )): E [( G + γ ) k 1 | G + γ | >R − G k 1 | G | >R ] ≤ E [(( G + γ ) k − G k ) 1 | G | >R ] + E [( G + γ ) k 1 | G + γ | >R − 1 | G | >R ] ≤ E [ k | γ | ( | G | + | γ | ) k − 1 1 | G | >R ] + E [( | G | + γ | ) k 1 | G |∈ [ R −| γ | , R + | γ | ] ] ≤ E [ k | γ | ( | G | + | γ | ) k − 1 1 | G | >R ] + E [( O ( R + | γ | )) k 1 | G |∈ [ R −| γ | , R + | γ | ] ] No w, the second term is at most ( R + | γ | ) O ( k ) P ( | G | ∈ [ R − | γ | , R + | γ | ]), whic h is at most | γ | R O ( k ) e − Ω( R 2 ) , where w e use that γ ≲ R . F or the first term, we apply Cauch y-Sch warz inequalit y to get it is at most | γ | ( k + | γ | ) O ( k ) e − Ω( R 2 ) and then use that γ ≤ R . Combining ev erything, w e ha v e sho wn the desired upp er b ound on δ A,k . Next we consider the case when A is from ( 39 ). The difference in the momen ts is ( β − 1) E [ G k 1 | G |≤ R ] + E [( G p 1 + γ ) k 1 | G √ 1+ γ | >R − G k 1 | G | >R ] . The first term is at most γ e − Ω( R 2 ) e O ( k log k ) . F or the second term, w e shall do a similar decomp osition as b efore E [( G p 1 + γ ) k 1 | G √ 1+ γ | >R − G k 1 | G | >R ] ≤ E [ ( G p 1 + γ ) k − G k 1 | G | >R ] + E [( G p 1 + γ ) k · 1 | G √ 1+ γ | >R − 1 | G | >R ] ≤ ( p 1 + γ ) k − 1 E G k 1 | G |≥ R + O ( R ) k E [ 1 | G √ 1+ γ | >R − 1 | G | >R ] ≤ O ( k γ ) e k q E [ G 2 k ] p P ( | G | > R ) + R k P ( G ∈ [ R / p 1 + γ , R ]) , whic h is at most γ e k log k − Ω( R 2 ) + R k O ( γ ) e − Ω( R 2 ) . 70 W e will also use the following tec hnical result that p erturbs a distribution A so that it can turn a distribution that approximately matches moments into one that exactly matc hes momen ts. Lemma E.2 (Fixing the momen ts [ DK23 , Exercise 8.3]) . Then ther e exists a c onstant c > 0 such that the fol lowing holds. L et k ∈ N and let A b e any distribution over R with k finite moments such that | E A [ X i ] − E G ∼ N (0 , 1) [ G i ] | ≤ α for al l i ∈ [ k ] . Supp ose that inf | x |≤ 1 A ( x ) ≥ 2 αk c . Then ther e is a distribution A ′ such that | A ( x ) − A ′ ( x ) | ≤ ( αk c if | x | ≤ 1 0 otherwise . and A ′ matches k moments with N (0 , 1) exactly. In p articular, | A ( x ) − A ′ ( x ) | ≤ 1 2 | A ( x ) | for al l x . Claim E.3. L et Q ∈ P R ( P , q , ϵ ) . L et R b e an arbitr ary distribution with finite χ 2 ( P , R ) . Then χ 2 ( Q, R ) ≲ max(1 , τ 2 ) 1 + χ 2 ( P , R ) . Pr o of. By Corollary C.1 , we ha v e that for all x ∈ R d : L U ≤ L b ≤ q ( x ) p ( x ) ≤ U b ≤ U L , where b = P ( X = ⋆ ) ∈ ( L, U ), L = q (1 − ϵ ), and U = q (1 − ϵ )+ ϵ . Therefore, q ( x ) /p ( x ) ≤ (1+ τ ). χ 2 ( Q, R ) = E R " q ( X ) r ( X ) 2 # − 1 = E R " q ( X ) p ( X ) 2 p ( X ) r ( X ) 2 # − 1 ≤ (1 + τ ) 2 1 + χ 2 ( P , R ) − 1 ≲ max(1 , τ 2 ) 1 + χ 2 ( P , R ) . High-lev el pro of sk etch for Theorems 4.7 and 4.12 Both of our low er b ound instances w ould b e based on NGCA. In particular, the null instance w ould not hav e any con tamination, and we would set P ′ = N (0 , I d ) for b oth Definitions 4.6 and 4.11 . F or the alternate instance, w e would set P ′ := P A,v for some univ ariate distribution A , chosen separately for the mean and the cov ariance testing problem. Observ e that the corresp onding clean distribution in the mean case is equal to N ( ρv , I d ) and in the cov ariance setting is equal to N (0 , I + ρv v ⊤ ). These could b e written compactly as P F,v for F = N ( δ, 1 + γ ), where γ = 0 , δ = ρ for the mean case and δ = 0 , γ = ρ for the cov ariance case. The following claim is immediate. Claim E.4. If A ∈ R R ( F , ϵ, q ) for a univariate distribution F , then P A,v ∈ R R ( P H,v , ϵ, q ) . W e get the desired SQ hardness from Lemma 2.9 if A ∈ R R ( H , ϵ, q ) and matches m momen ts with N (0 , 1), where m = e Θ( ϵ 2 /ρ 2 ) for the task of mean estimation and m = e Θ( ϵ/ρ ) for the task of cov ariance estimation (while ensuring that the χ ( A, N (0 , 1)) ≲ m ). 71 E.2 Mean estimation: Pro of of Theorem 4.7 As mentioned ab o v e, w e shall use Claim E.4 to reduce our problem to an NGCA instance. W e make a change of v ariable and use γ = ρ in the following. Giv en the generic SQ hardness in Lemma 2.9 , our goal reduces to finding a univ ariate distribution A ′ ∈ R R ( N ( γ , 1) , ϵ, q ) that matc hes m moments and satisfies χ 2 ( A ′ , N (0 , 1)) ≲ 1. W e start by establishing the existence of a moment-matc hing distribution, providing its pro of in Section E.2.1 . Lemma E.5. L et ϵ and q b e such that τ := ϵ q (1 − ϵ ) ≤ c for a smal l enough c onstant. Ther e exists a univariate distribution A ′ ∈ R R ( N ( γ , 1) , ϵ, q ) that matches m = e Θ( ϵ 2 γ 2 ) many moments with N (0 , 1) . T o complete the pro of using Lemma 2.9 , it remains to show that χ 2 div ergence is at most O ( m ). Applying Claim E.3 , we get that χ 2 ( A ′ , N (0 , 1)) ≲ 1 + χ 2 ( N ( γ , 1) , N (0 , 1)) ≲ 1 + e γ 2 ≲ 1 , where we use that γ ≲ 1 and τ ≲ 1. E.2.1 Pro of of Lemma E.5 Pr o of of Lemma E.5 . Recall that we are in the regime where τ = ϵ q (1 − ϵ ) ≤ c 0 for a small enough constant. In this regime, τ = Θ( τ ′ ). Our starting p oint to constructing A ′ will b e the distribution A from Lemma C.2 , which is a v alid realizable con tamination as p er the lemma statement. The parameter choices will b e the same as in the pro of of the low er b ound in Theorem 3.1 : w e choose R to b e Θ( τ ′ /γ ) = Θ( τ /γ ). Since γ /τ ≤ c ′ for a small enough enough absolute constant, we hav e that R is at least a large enough constan t. F urthermore, R ≤ √ τ / 10 since γ ≤ τ /c . Observe that the target num b er of moments, m = e Θ( τ 2 /γ 2 ), is Θ( R 2 / log R ). Recall that to get the desired SQ hardness, we wan t an A that matches m moments for m ≳ τ 2 γ 2 1 log( τ /γ ) . As sho wn in Lemma E.1 , it matc hes roughly m 0 ≍ R 2 / log R many momen ts appr oximately . T o b e precise, there exists a small constan t c such that if m ≤ cR 2 / log R , then for all i ∈ [ m ]: | E A [ X i ] − E [ G i ] | ≲ | γ | e − Ω( R 2 ) . T o match m momen ts exactly , w e p erturb this distribution to another A ′ using Lemma E.2 . F or this lemma to b e applicable, w e w ant that inf x : | x |≤ 1 A ( x ) ≳ | γ | e − Ω( R 2 ) p oly( m ) . (65) T o establish ( 65 ), we shall show that, for R ≳ 1, the left hand side is at least an absolute constan t, while the right side is upp er b ounded by O ( e − Ω( R 2 ) ), and thus ( 65 ) is satisfied for R large enough. F or the left hand side, the definition of A tells us that inf x : | x |≤ 1 A ( x ) = β ϕ (1) ≳ 1 since β ≥ 1 / 2 for R at least a large enough constant (b y Lemma C.2 ). F or the right hand side, observe that since m ≤ R , the right hand side is at most O ( e − Ω( R 2 ) ). Thus, the resulting A ′ will matc h m moments exactly . How ever, we still need to show that A ′ ∈ R R ( N ( γ , 1) , ϵ, q ) is a v alid realizable contamination of N ( γ , 1). By Corollary C.1 , it suffices to establish that max x ∈ R log A ′ ( x ) ϕ ( x ; γ , 1) ≤ 0 . 5 log (1 + τ ) . (66) 72 F or | x | > 1, this is true b ecause A ′ is exactly equal to A . F or | x | ≤ 1, we need to ensure that the p erturbations are small enough. W e hav e that for any | x | ≤ 1, log A ′ ( x ) ϕ ( x ; γ , 1) = log A ( x ) ϕ ( x ; γ ) 1 + A ′ ( x ) − A ( x ) A ( x ) = log A ( x ) ϕ ( x ; γ ) + log 1 + A ( x ) − A ′ ( x ) A ( x ) . By Lemma E.2 , w e hav e that A ( x ) − A ′ ( x ) A ( x ) ≤ 1 2 and thus the inequality | log(1 + y ) | ≲ | y | for | y | ≤ 1 2 implies that log A ′ ( x ) ϕ ( x ; γ ) ≤ log A ( x ) ϕ ( x ; γ ) + A ′ ( x ) − A ( x ) A ( x ) ≤ 0 . 25 log (1 + τ ) + O ( | γ | e − Ω( R 2 ) ) , where we use that in the pro of of Lemma C.2 , w e had established the stronger inequality log A ( x ) ϕ ( x ; γ ) ≤ 0 . 25 log (1 + τ ) as opp osed to a b ound of log A ( x ) ϕ ( x ; γ ) ≤ 0 . 5 log (1 + τ ). T o establish ( 66 ), we thus w an t that γ e − Ω( R 2 ) ≲ log(1 + τ ) ≲ τ , meaning e − Ω( R 2 ) ≲ τ /γ = Θ( R ), which is satisfied as long as R ≳ 1. E.2.2 Co v ariance estimation: Pro of of Theorem 4.12 W e no w provide the pro of of Theorem 4.12 . Pr o of of The or em 4.12 . W e again make a change of v ariable and use γ = ρ in the remainder of the pro of. Using the same arguments in the pro of of Theorem 4.7 , it suffices to establish a momen t-matc hing distribution that is a v alid realizable contamination. Lemma E.6. L et ϵ and q b e such that τ := ϵ q (1 − ϵ ) ≤ c for a smal l enough absolute c onstant. L et γ ∈ (0 , 1 / 2) such that τ /γ ≳ 1 . Ther e exists a univariate distribution A ′ ∈ R R ( N (0 , 1 + γ ) , ϵ, q ) that matches m = e Θ( τ γ ) many moments with N (0 , 1) . Pr o of. W e start from A given in Lemma C.3 , where w e take R 2 = 1+ γ 8 γ log(1 + τ ) ≍ τ γ . According to Lemma E.1 , the resulting A matches roughly m 0 = e Θ( R 2 ) many moments with error at most O ( γ e − Ω( R 2 ) ). F ollowing the same argumen ts as in the pro of of Lemma E.5 , w e tak e this approximate momen t-matc hing distribution A and p erturb it using Lemma E.2 to get an m 0 momen t- matc hing distribution A ′ . This can b e done as long as the density of A is b ounded b y b elow b y O ( | γ | e − Ω( R 2 ) )p oly( m ). As in the pro of of Lemma E.5 , this is satisfied as long as R ≳ 1. Next we need to v erify that A ′ is a v alid realizable contamination of N (0 , 1 + γ ) in the sense of ( 66 ). F or this we need to show that the lik eliho o d ratio b et w een A ′ and N (0 , 1 + γ ) is b ounded appropriately , meaning that log A ′ ( x ) ϕ ( x ) ≤ 0 . 5 log(1 + τ ) for all x ∈ R . Doing the same calculations as in Lemma E.5 , we get that log A ′ ( x ) ϕ ( x ) ≤ log A ( x ) ϕ ( x ) + A ′ ( x ) − A ( x ) A ( x ) ≤ 0 . 25 log (1 + τ ) + O ( | γ | e − Ω( R 2 ) ) , 73 where use that log A ( x ) ϕ ( x ) ≤ 0 . 25 log (1 + τ ) if (i.) log (1 + γ ) ≤ 0 . 5 log (1 + τ ), whic h is equiv alent to γ ≲ τ ≲ 1 in our regime and (ii.) R 2 ≤ 1+ γ 2 γ log(1 + τ ); this can b e seen b y a simple insp ection of the proof in Lemma C.3 . F or the second term to b e less than 0 . 25 log (1 + τ ), it suffices that e − Ω( R 2 ) ≲ τ /γ ≍ R , whic h happ ens as long as R ≳ 1. As p er our parameter choice, this is equiv alen t to requiring τ /γ ≳ 1. So far, w e ha v e established that A ′ matc hes the desired num b er of momen ts, while b e- ing a v alid realizable contamination. Finally , it remains to establish an upp er b ound on χ 2 ( A ′ , N (0 , 1)). Using Claim E.3 , we hav e that χ 2 ( A ′ , N (0 , 1)) ≲ 1 + χ 2 ( N (0 , 1 + γ ) , N (0 , 1)) ≲ 1 , whenev er γ ≤ 1 / 2 and τ ≲ 1. F Extension to m ultiple missingness patterns for mean and co v ariance estimation In this section, we demonstrate ho w our low er b ounds and algorithms can b e extended to the multiple pattern setting depicted in Figure 1b . Let us b egin with a generalization of the con tamination mo del in ( 2 ). T o this end, let S b e a collection of subsets of [ d ]. W e extend the definitions of MCAR ( 1a ) and MNAR ( 1b ) as MCAR ( P, S ,π ) := n La w X ⋆ Ω : X ∼ P and Ω ∈ 1 S , Ω ⊥ ⊥ X, P (Ω = 1 S ) = π S o (67a) MNAR P, S := n La w X ⋆ Ω : X ∼ P and La w (Ω) ∈ P 1 S ) o , (67b) where ab o ve w e hav e let 1 S = { 1 S } S ∈ S and 1 S ∈ { 0 , 1 } d denote the v ector whic h takes the v alue 1 on the index set S and 0 elsewhere. W e then define the natural extension of the all-or-nothing realizable contamination mo del ( 2 ) as R ( P , ϵ, S , π ) := (1 − ϵ ) MCAR ( P, S ,π ) + ϵ MNAR P, S . (68) In the subsequent sections, we show that, at the cost of multiplicativ e factors scaling p olyno- mially in the cardinality | S | , our algorithmic guarantees and low er bounds can b e extended to this more general setting. F.1 Lo wer b ounds In order to extend our low er b ounds, w e rely on the intuition that revealing more data in eac h observ ation makes the problem easier. W e mak e this notion precise through an inflation map, defined presently . Definition F.1 (Inflation map) . F or S ⊆ 2 [ d ] , we call the map f : S → 2 [ d ] an inflation map if, for all S ∈ S , it holds that S ⊆ f ( S ). Equipp ed with this definition, we hav e the following lemma. Lemma F.2 (Enlarging observ ations mak es the problem easier) . L et S ⊆ 2 [ d ] and let f : S → 2 [ d ] b e an inflation map as in Definition F.1 . L et A denote an algorithm designe d for observations fr om a distribution in R ( P , ϵ, S , π ) . Given observations fr om R ( P , ϵ, f ( S ) , f ∗ π ) , ther e exists an efficient algorithm which simulates A . 74 Pr o of. Let f ∗ π denote the pushforw ard measure of π under the inflation map f so that for an y set S ′ ∈ Image( f ), ( f ∗ π )( S ′ ) = π f − 1 ( S ′ ) . In turn, let g denote a p oten tially randomized mapping which satisfies g ( S ′ ) ∼ π if S ′ ∼ f ∗ π , and note that it is p ossible to construct this map so that g ( S ′ ) ⊆ S ′ , almost surely . No w, consider any measure R con tained in the inflated realizable contamination set R ( P, ϵ, f ( S ) , f ∗ π ). By definition, there exist random v ariables X ∼ P , Ω MNAR , Ω MCAR and B suc h that X ⋆ (1 − B )Ω MCAR + B Ω MNAR ∼ R. No w, since g ( S ′ ) ⊆ S ′ and g ( S ′ ) ∼ π , w e see that the random v ariable X ⋆ (1 − B ) g Ω MCAR + B g Ω MNAR ∼ R 0 ∈ R ( P, ϵ, S , π ) . It follows that an y algorithm designed for observ ations in R ( P , ϵ, S , π ) can be sim ulated b y masking observed data from R ( P , ϵ, f ( S ) , f ∗ π ) using g . Let us use this re duction to pro vide information-theoretic low er b ounds as well as SQ lo w er b ounds. Both corollaries use the following instantiation of inflation map f I f I : S 7→ ( S ∪ I if I ∩ S = ∅ S otherwise , where I ⊆ [ d ]. Note that for any set of missingness pattern S and any I , either all or nothing of the subset I is observed in an y giv en sample. Next, consider the error metrics ∥ θ 1 − θ 2 ∥ 2 and ∥ Σ − 1 / 2 1 Σ 2 Σ − 1 / 2 1 − I d ∥ op and note that for any index set I ′ ⊆ [ d ], it holds that ∥ θ 1 − θ 2 ∥ 2 ≥ ∥ ( θ 1 ) I ′ − ( θ 2 ) I ′ ∥ 2 and ∥ Σ − 1 / 2 1 Σ 2 Σ − 1 / 2 1 − I d ∥ op ≥ ∥ (Σ 1 ) − 1 / 2 I ′ ,I ′ (Σ 2 ) I ′ ,I ′ (Σ 1 ) − 1 / 2 I ′ ,I ′ − I | I ′ | ∥ op . Hence, taking the maxim um o v er all subsets I ⊆ [ d ] ov er b oth the minimax quantile lo w er b ounds in Theorems 3.1 and 3.2 and the SQ low er b ounds in Theorems 4.7 and 4.12 and applying the reduction in Lemma F.2 , we conclude that the same lo w er b ounds hold in the m ultiple pattern setting. F.2 Upp er b ounds In the previous section, we show ed that our low er b ounds for mean and cov ariance estimation con tin ue to hold in the multiple pattern setting. Here, we use our algorithms to develop quan titativ e upp er b ounds for mean and co v ariance estimation in turn, b eginning with mean estimation. In order to obtain quantitativ e guarantees, we require the following pair of regu- larit y conditions Assumption F.3 (Constant observ ation probabilit y) . The pair ( S , π S ) satisfies the constant observ ation probabilit y assumption if there exists a constants c 0 > 0 such that min S ∈ S π S > c 0 . Assumption F.3 ensures that eac h pattern is observed a constan t fraction of the time. The second assumption concerns the cov erage of the patterns in S . Assumption F.4 (Co v erage regularity) . The set of patterns S has regular cov erage if b oth [ S ∈ S S = [ d ] . Note that Assumption F.4 is necessary for non-trivial mean estimation. 75 F.2.1 Mean estimation W e b egin by describing how to adapt estimators designed for the all-or-nothing setting to the m ultiple pattern setting. W e follow the following steps: 1. F or each S ∈ S , consider the mo dified input that replaces all inputs whose missingness pattern is not S with ( ⋆ d ) and discard all co ordinates in [ d ] \ S . This corresp onds to an all-or-nothing dataset in R S . 2. Run an all-or-nothing estimator (e.g. the estimator implicit in Theorem 3.1 or Algo- rithm 1 ) on eac h dataset to obtain an estimate b θ S ∈ R S with a high probabilit y error b ound r S , ∥ b θ S − ( θ ⋆ ) S ∥ 2 ≤ r S . This implies that the true mean θ ⋆ lies in the cylinder C S := n θ ∈ R d : b θ S − θ S 2 ≤ 2 r S o . 3. Since the guarantee ab o v e holds for all patterns S ∈ S (after taking a union b ound), we deduce that θ ⋆ ∈ \ S ∈ S C S =: C . Moreo v er, since eac h of the cylinders C S are con vex, we can find some element b θ ∈ C via conv ex optimization. After pro ducing an estimator b θ using the steps ab ov e, we note that since b oth b θ ∈ C and θ ⋆ ∈ C , w e can pro duce an error b ound b y computing a b ound on the diameter of the set C . In order to mak e the ab ov e steps precise, three items remain. First, in order to use our all-or-nothing algorithms, we must show that the reduction in Step 1 preserves realizability . Second, we provide more detail on how to pro duce an element in C via conv ex optimization. Finally , we provide a b ound on the diameter of the set C . W e pro vide the details for each of these three items in order. Pro jection preserv es realizabilit y . Let us b egin with some notation. F or S ⊆ [ d ], let Π S : R d ⋆ → R | S | ⋆ b e the pro jection on to the coordinates in S . Additionally , define the censoring function F S : R d ⋆ → R | S | as F S ( x ) = ( x S if x j = ⋆ for all j ∈ S and x j = ⋆ for all j ∈ S c ( ⋆ ) | S | otherwise; i.e., tak e the co ordinates when the missingness pattern is exactly S and fully censor other- wise. The following straigh tforw ard lemma demonstrates that this censoring map preserves realizabilit y . Lemma F.5. Supp ose that Q ∈ R ( P , ϵ, S , π ) . Then, for every S ∈ S , the pushforwar d me asur e ( F S ) ∗ Q satisfies the inclusion ( F S ) ∗ Q ∈ R (Π S ⋆ P , ϵ, π ( S )) . Pr o of. Note that b y definition there exists a tuple of random v ariables X ∼ P , Ω MCAR , Ω MNAR and B ∼ Bern ( ϵ ) such that X ⋆ (1 − B )Ω MCAR + B Ω MNAR ∼ Q. 76 No w, consider the map g : { 0 , 1 } d → { 0 , 1 } | S | defined so that g ( 1 S ) = 1 | S | (the all ones v ector) and g ( v ) = 0 (the all zeros vector) for all other v . Note that Π S X ⋆ (1 − B ) g Ω MCAR + B g Ω MNAR ∼ ( F S ) ∗ Q. F urther, it holds that B , g (Ω MCAR ), and (Π S X , g (Ω MNAR )) are mutually indep endent and where P ( g (Ω MCAR ) = 1 | S | ) = π ( S ). This prov es the claim. Computing an elemen t b θ ∈ C . W e will show that an elemen t of C can b e computed in p olynomial time by the ellipsoid metho d. In order for this to hold, w e require inner and outer radii r and R such that B 2 ( θ , r ) ⊆ C ⊆ B 2 (0 , R ) , for some element θ ∈ C and a separation oracle for the set C . Let us b egin with a separation oracle. Note that if θ / ∈ C , then there exists S ∈ S such that θ / ∈ C S . Note that a separation oracle for balls in R | S | pro vides a separation oracle for C S b y filling the co ordinates in [ d ] \ S with zeros. This holds for all S ∈ S and in turn pro vides a separation oracle for the in tersection C . Next, we show that C ⊆ B 2 (0 , R ). Note that by Assumption F.4 , it holds that ∪ S ∈ S S = [ d ]. Hence, by the triangle inequality , we deduce that for any θ ∈ C , ∥ θ ∥ 2 ≤ X S ∈ S ∥ θ S ∥ 2 ≤ X S ∈ S 2 r S + ∥ b θ S ∥ 2 =: R. Finally , we show that there exists an r > 0 such that B 2 ( θ ⋆ , r ) ⊆ C . T o this end, w e take r := r min = min S ∈ S r S . W e then see that for any θ ∈ B 2 ( θ ⋆ , r ), ∥ θ S − b θ S ∥ 2 ≤ ∥ ( θ ⋆ ) S − b θ S ∥ 2 + ∥ θ S − ( θ ⋆ ) S ∥ 2 ≤ r S + r min ≤ 2 r S . Since the ab ov e inequalit y holds for all S ∈ S , it holds that θ ∈ C , whence we deduce that B 2 ( θ , r ) ⊆ C . It remains to b ound the diameter of the set C . Bounding the diameter of the set C . Let w : S → [0 , 1] denote a weigh ting function suc h that for all j ∈ [ d ], X S : i ∈ S w ( S ) ≥ 1 . Note that under the cov erage condition in Assumption F.4 , such a w eigh ting function exists. Let I S = diag( 1 S ) ∈ C d + denote the diagonal matrix which con tains ones on the indices con tained in the set S and note that the previous display ensures that I d ⪯ X S ∈ S w ( S ) I S . It follows that b θ − θ ⋆ 2 2 = b θ − θ ⋆ ⊤ I d b θ − θ ⋆ ≤ X S ∈ S w ( S ) · b θ − θ ⋆ ⊤ I S b θ − θ ⋆ ≤ X S ∈ S w ( S ) · b θ − θ ⋆ 2 2 ≤ 4 X S ∈ S w ( S ) r 2 S . 77 Note that it is alwa ys v alid to taking the weigh ting function w ( S ) = 1 for all S ∈ S so that b θ − θ ⋆ 2 2 ≤ 4 | S | · max S ∈ S r 2 S . Com bining this b ound with the low er b ound in the previous section, we see that under As- sumption F.3 (whic h ensures | S | is of constant order), our algorithms are optimal in the m ulti-pattern setting up to a multiplicativ e factor of | S | . F.2.2 Co v ariance estimation W e prop ose a very similar algorithm as in the mean estimation setting. Although in the main text, we pro vide guarantees for our estimators in relativ e op erator norm, in order to simplify the dev elopmen t here, we consider guarantees in the (weak er) op erator norm. In particular, for each S ∈ S , our algorithms yield estimators e Σ S ∈ C | S | + whic h satisfy the guarantee e Σ S − (Σ ⋆ ) S,S op ≤ r S , for all S ∈ S . Pro ceeding as b efore, w e define the cylinders C S := n Σ ∈ C d + : e Σ S − P S Σ P ⊤ S op ≤ r S o , where the matrices P S ∈ R | S |× d tak e the form P S = ( e j ) j ∈ S ⊤ for e j the j th elementary v ector. W e again define the intersection C = ∩ S ∈ S C S and pro duce an estimate b Σ as any elemen t of C . As in the mean estimation setting, w e produce such an estimate via the ellipsoid metho d. Let us note that the inner radius r , outer radius R , and separation oracle can b e obtained in a parallel manner to in the mean estimation setting. The primary departure from the setting of mean estimation comes when w e bound the diameter of the set C . In particular, we require the following cov erage condition. Assumption F.6 (Cov erage regularit y for co v ariance) . The set of patterns S has regular co v erage if, for ev ery pair i, j ∈ [ d ], there exists S ∈ S such that { i, j } ⊆ S . Henceforth, w e will assume that Assumption F.6 is in force. Next, let I 1 , . . . , I k b e a partition of the co ordinate set [ d ] such that for every pair ℓ 1 , ℓ 2 ∈ [ k ], there exists a pattern S ∈ S such that the union I ℓ 1 ∪ I ℓ 2 ⊆ S . Note that under Assumption F.6 , such a partition alw a ys exists. F or instance, one can take the partition in to singleton sets. Given suc h a partition, let S ℓ 1 ,ℓ 2 ∈ S satisfy I ℓ 1 ∪ I ℓ 2 ⊆ S ℓ 1 ,ℓ 2 . No w, for any b Σ ∈ C and v ∈ S d − 1 , we hav e v ⊤ b Σ − Σ ⋆ v = v ⊤ k X ℓ =1 P I ℓ b Σ − Σ ⋆ k X ℓ =1 P I ℓ v = X ℓ 1 ,ℓ 2 ∈ [ k ] v ⊤ P I ℓ 1 b Σ − Σ ⋆ P I ℓ 1 v = X ℓ 1 ,ℓ 2 ∈ [ k ] v ⊤ P I ℓ 1 P S ℓ 1 ,ℓ 2 b Σ − Σ ⋆ P S ℓ 1 ,ℓ 2 P I ℓ 1 v ≤ X ℓ 1 ,ℓ 2 ∈ [ k ] P I ℓ 1 v 2 P I ℓ 2 v 2 r S ℓ 1 ,ℓ 2 ≤ r max · X ℓ ∈ [ k ] P I ℓ v 2 2 ≤ k r max , 78 where r max = max S ∈ S r S . W e th us see that minimizing the num b er of partitions yields b etter rates. Let us conclude by providing a coarse upp er b ound on the size of a minimum partition. As men tioned, the singleton partition alwa ys satisfies the desired condition. This, ho wev er, giv es the dimension dep enden t upper b ound k ≤ d . When the n um b er of patterns | S | is a constan t, we can provide a smaller partition by taking k = 2 | S | . Let ℓ ∈ 2 | S | and consider its binary represen tation. This represen tation corresp onds to which sets in S each elemen t is a member of. By construction, this is a partition of [ d ]. Moreov er, it satisfies the pairwise union condition. T o see this, consider an y element x ℓ 1 ∈ I ℓ 1 and x ℓ 2 ∈ I ℓ 2 and note that by Assumption F.6 , there exists some set S ∈ S such that b oth x ℓ 1 , x ℓ 2 ∈ S . It thus follows that b oth I ℓ 1 , I ℓ 2 ⊆ S so that I ℓ 1 ∪ I ℓ 2 ⊆ S . Hence, we deduce the b ound k ≤ 2 | S | so that b Σ − Σ ⋆ op ≤ min d, 2 | S | r max ≤ C | S | r max , where the final inequality follows from Assumption F.3 and C | S | is a constan t dep ending only on the num b er of patterns. G Auxiliary lemmas G.1 Reduction to q = 1 : Pro of of Lemma 1.2 Pr o of. First, supp ose that Q ∈ R ( P , ϵ, q ), so that b y Lemma 1.1 , for all z ∈ R d , q (1 − ϵ ) ≤ d Q d P ( z ) ≤ q (1 − ϵ ) + ϵ = q ′ . (69) W e then define the distribution Q ′ on R d ∪ { ⋆ d } so that d Q ′ d P ( z ) = 1 q ′ d Q d P ( z ) , for z ∈ R d , and Q ′ { ⋆ d } = 1 − 1 q ′ · Q R d . Note that by construction, Q ′ is a probability distribution. Moreov er, observ e that the inv ari- an t relation q (1 − ϵ ) = q ′ · (1 − ϵ ′ ) , holds by construction. Consequen tly , applying the sandwic h relation ( 69 ), we obtain the inequalities d Q ′ d P ( z ) = 1 q ′ · d Q d P ( z ) ≤ 1 and d Q ′ d P ( z ) = 1 q ′ · d Q d P ( z ) ≥ 1 − ϵ ′ . By Lemma 1.1 , this implies that Q ′ ∈ R ( P, ϵ ′ , 1). T o conclude this part, note that q ′ · Q ′ and Q agree on the entiret y of R d so that Q = q ′ · Q ′ + (1 − q ′ ) δ { ⋆ d } , since Q must in tegrate to 1 on R d ∪ { ⋆ d } . T o obtain the reverse implication, supp ose that Q ′ ∈ R ( P , ϵ ′ , 1). Applying 1.1 once more yields q (1 − ϵ ) = q ′ · (1 − ϵ ′ ) ≤ q ′ · d Q ′ d P ( z ) ≤ q ′ = q (1 − ϵ ) + ϵ. No w, consider Q = q ′ · Q ′ + (1 − q ′ ) · δ { ⋆ } d . By construction, Q is a probability distribution. Hence, from the ab o v e displa y , we conclude that Q ∈ R ( P , ϵ, q ). 79 G.2 Useful general lemmas Lemma G.1. L et a, b ≥ 0 and ℓ ≤ 1 . Then ( a + b ) ℓ ≤ a ℓ + b ℓ . Pr o of. The case where a, b = 0 is trivial, so for the remainder we will assume a + b > 0. Observ e a ℓ + b ℓ = ( a + b ) ℓ a ( a + b ) ℓ + b ( a + b ) ℓ ! ≥ ( a + b ) ℓ , where the last step follows from the fact that x ℓ + (1 − x ) ℓ ≥ 1 for x ∈ [0 , 1]. F act G.2. F or all k ∈ N , √ 2 π k k e k ≤ k ! ≤ √ 2 π k k e k e 1 / (12 k ) . Lemma G.3. F or n ∈ N and ϵ ∈ [0 , 1) , let m ∼ Bin ( n, 1 − ϵ ) . F or δ ∈ (0 , 1) , supp ose n ≥ (1 + ϵ ) 2 ϵ 2 (1 − ϵ ) 2 log 1 δ . Then with pr ob ability at le ast 1 − δ we have m ≥ (1 − ϵ ) n 1 + ϵ . Pr o of. Observe E [ m ] = (1 − ϵ ) n . Ho effding’s inequality thus implies for any t > 0 that P ( m − (1 − ϵ ) n ≤ − t ) ≤ exp − 2 t 2 n . The result then follows by taking t = (1 − 1 / (1 + ϵ ))(1 − ϵ ) n = ( ϵ/ (1 + ϵ ))(1 − ϵ ) n . Lemma G.4. L et L (Σ , Σ ⋆ ) = ∥ Σ − 1 2 ⋆ ΣΣ − 1 2 ⋆ − I ∥ op . L et γ ∈ [0 , 1] and v , ¯ v ∈ S d − 1 . Then for al l M ∈ C d + we have L ( M , I d + γ v v ⊤ ) ∨ L ( M , I d + γ ¯ v ¯ v ⊤ ) > γ 16 ∥ v − ¯ v ∥ 2 2 . Pr o of. Supp ose L ( M , I d + γ v v ⊤ ) < γ 16 ∥ v − ¯ v ∥ 2 2 , so that M = I d + γ v v ⊤ + E for some E with 1 1+ γ ∥ E ∥ op ≤ ∥ ( I + γ v v ⊤ ) − 1 2 E ( I + γ v v ⊤ ) − 1 2 ∥ op < γ 16 ∥ v − ¯ v ∥ 2 2 . Then, I d + γ ¯ v ¯ v ⊤ − 1 2 M I d + γ ¯ v ¯ v ⊤ − 1 2 − I d op = I d + γ ¯ v ¯ v ⊤ − 1 2 I d + γ v v ⊤ + E I d + γ ¯ v ¯ v ⊤ − 1 2 − I d op ≥ γ I d + γ ¯ v ¯ v ⊤ − 1 2 v v ⊤ − ¯ v ¯ v ⊤ I d + γ ¯ v ¯ v ⊤ − 1 2 op − I d + γ ¯ v ¯ v ⊤ − 1 2 E I d + γ ¯ v ¯ v ⊤ − 1 2 op 80 ≥ γ 1 + γ ∥ v v ⊤ − ¯ v ¯ v ⊤ ∥ − γ (1 + γ ) 8 ∥ v − ¯ v ∥ 2 . No w let u ∈ S d − 1 b e orthogonal to v and such that ¯ v = αv + √ 1 − α 2 u for some α ∈ [0 , 1]. Then ∥ v v ⊤ − ¯ v ¯ v ⊤ ∥ op ≥ ( ¯ v · u ) 2 = 1 − α 2 and ∥ v − ¯ v ∥ 2 2 = 2(1 − α ). Th us, we ha ve the lo wer b ound ∥ v v ⊤ − ¯ v ¯ v ⊤ ∥ ≥ 1 − α 2 ≥ 1 − α = 1 2 ∥ v − ¯ v ∥ 2 2 . Plugging this in gives L ( M , I d + γ ¯ v ¯ v ⊤ ) ≥ γ 2(1 + γ ) − γ (1 + γ ) 16 ∥ v − ¯ v ∥ 2 ≥ γ 8 ∥ v − ¯ v ∥ 2 > γ 16 ∥ v − ¯ v ∥ 2 , where the last step follows from γ ∈ [0 , 1]. This prov es the claim. Belo w we present a slightly mo dified v ersion of [ BEHW89 , Lemma 3.2.3]. The only differ- ence is that we allo w the s -fold union to b e across different classes, but this do es not c hange the result. Lemma G.5 (V C-dimension of s -fold in tersections) . L et C 1 , . . . , C s ⊆ 2 X b e c onc ept classes such that V C( C i ) ≤ d for al l i ∈ [ s ] . Then the c onc ept class C ∩ = {∩ s i =1 c i | c i ∈ C i ∀ i ∈ [ s ] } has V C-dimension less than 2 ds log(3 s ) . G.3 Univ ariate Gaussian concen tration and momen t prop erties Throughout, recall that denote the density of a standard univ ariate Gaussian b y the function ϕ . W e will also use G to denote a standard Gaussian. F act G.6 (Gaussian moments) . F or all k ∈ N , E [ | G | 2 k − 1 ] = r 2 π (2 k − 2)!! and E [ G 2 k ] = (2 k − 1)!! . Lemma G.7. L et k ≥ 1 and T ≥ 2 √ k log(3 k ) . Then E [ G 2 k − 2 1 { G ∈ [1 , T ] } ] ≥ (2 k − 3)!! 3 . Pr o of. F rom F act G.6 , we hav e E [ G 2 k − 2 ] = (2 k − 3)!! and b y symmetry of the Gaussian distribution, we hav e that (2 k − 3)!! 2 = Z ∞ 0 t 2 k − 2 ϕ ( t ) dt = Z 1 0 t 2 k − 2 ϕ ( t ) dt | {z } =: A + Z T 1 t 2 k − 2 ϕ ( t ) dt | {z } =: B + Z ∞ T t 2 k − 2 ϕ ( t ) dt | {z } =: C . W e w ant to low er b ound B , which we will do by upp er b ounding A and C . T o upp er b ound A , A = Z 1 0 t 2 k − 2 ϕ ( t ) dt ≤ Z 1 0 e − t 2 / 2 √ 2 π dt ≤ 1 √ 2 π . T o upp er b ound C , C = Z ∞ T t 2 k − 2 ϕ ( t ) dt ≤ 1 T Z ∞ T t 2 k − 1 ϕ ( t ) dt. 81 Noting that d dt ( − ϕ ( t )) = tϕ ( t ), we apply integration by parts with u = t 2 k − 2 and v = − ϕ ( t ) to obtain Z ∞ T t 2 k − 1 ϕ ( t ) dt = Z ∞ T ( t 2 k − 2 )( tϕ ( t )) dt = − t 2 k − 2 ϕ ( t ) ∞ T + (2 k − 2) Z ∞ T t 2 k − 3 ϕ ( t ) dt ≤ T 2 k − 2 ϕ ( T ) + 2 k − 2 T C. Th us, C ≤ 1 T T 2 k − 2 ϕ ( T ) + 2 k − 2 T C = ⇒ C ≤ 1 √ 2 π 1 − 2 k − 2 T 2 − 1 T 2 k − 3 e − T 2 / 2 Observ e that t 7→ t 2 k − 3 e − t 2 / 2 is decreasing for t ≥ √ 2 k − 3, so for T ≥ 2 √ k log(3 k ) w e ha v e that T 2 k − 3 e − T 2 / 2 ≤ (2 √ k log(3 k )) 2 k − 3 e − (4 k log 2 (3 k )) / 2) ≤ exp − 2 k log 2 (3 k ) + 2 k log (3 k ) ≤ 1 Also for T ≥ 2 √ k − 1, 1 − 2 k − 2 T 2 − 1 ≤ 2. Thus, C ≤ 2 √ 2 π . This implies B ≥ (2 k − 3)!! 2 − 3 √ 2 π ≥ (2 k − 3)!! 3 . Lemma G.8. L et ℓ ∈ N . Then E [ G ℓ ] ≤ 2 ℓ e ℓ/ 2 . Pr o of. The claim is trivial when ℓ is o dd, so for the remainder let ℓ = 2 k for some k ∈ N . Stirling’s formulae (F act G.2 ) imply that √ 2 π k k e k ≤ k ! ≤ √ 2 π k k e k e 1 / (12 k ) . (70) Th us, w e ha v e E [ G 2 k ] = (2 k )! 2 k k ! ≤ √ 4 π k (2 k /e ) 2 k e 1 / (24 k ) 2 k √ 2 π k ( k /e ) k = √ 2 e 1 / 24 2 k e k ≤ 2 2 k e k . Lemma G.9. L et G 1 , . . . , G n iid ∼ N (0 , 1) and m ∈ N . Then with pr ob ability at le ast 1 − δ , we have that n X i =1 | G i | m ≤ n ( m − 1)!! + q n 2 m − 1 log(2 /δ ) log(2 n/δ ) m 2 . 82 Pr o of. Let T > 0 be a threshold we will sp ecify later and define A i = | G i | m 1 {| G i | ≤ T } . Then by a union b ound P 1 n n X i =1 | G i | m ≥ µ m + ∆ 1 ! ≤ P 1 n n X i =1 A i ≥ µ m + ∆ 1 ! + P max i ∈ [ n ] | G i | > T . F or the first term, note that b ecause A i is supp orted on [0 , T ] that 1 4 T 2 m -subgaussian. F urthermore, E [ A 1 ] ≤ µ m . Th us, P 1 n n X i =1 A i ≥ µ m + ∆ 1 ! ≤ P 1 n n X i =1 A i − µ m ≥ ∆ 1 ! ≤ exp − 2 n ∆ 2 1 T − 2 m This is b ounded b y δ / 2 if we choose T suc h that T 2 ≤ 2 n ∆ 2 1 log(2 /δ ) 1 m . F or the second term, w e ha v e by union b ound that P max i ∈ [ n ] | G i | > T ≤ 2 n exp( − 1 2 T 2 ) , whic h is at most δ / 2 for T ≥ 2 p log(2 n/δ ). W e can alw ays find some T satisfying b oth constraints as long as 2 n ∆ 2 1 log(2 /δ ) ≥ 2 m log(2 n/δ ) m 2 = ⇒ ∆ 1 ≥ s 2 m − 1 log(2 /δ ) log(2 n/δ ) m 2 n , pro ving the claim. The next tw o lemmas provide basic prop erties of the Gaussian survivor function ¯ Φ. Lemma G.10 (Mills’ ratio b ound) . F or al l t ≥ 0 , t t 2 + 1 · ϕ ( t ) ≤ ¯ Φ( t ) ≤ 1 t · ϕ ( t ) . Lemma G.11. L et ¯ Φ : R → [0 , 1] b e define d as ¯ Φ( x ) = P ( G ≥ x ) , wher e G ∼ N (0 , 1) . The fol lowing hold: (a) The inverse function ¯ Φ − 1 is differ entiable and its derivative is given by d d x ¯ Φ − 1 ( x ) = − 1 ϕ ( ¯ Φ − 1 ( x )) . (b) F or 0 ≤ x ≤ 1 / 2 , t · ¯ Φ − 1 ( t ) ≤ ϕ ¯ Φ − 1 ( t ) ≤ t · ¯ Φ − 1 ( t ) + 1 ¯ Φ − 1 ( t ) . 83 Pr o of. W e prov e each part in turn. Pro of of part (a). Since ¯ Φ is con tin uously differen tiable, it follows from the inv erse function theorem that d d x ¯ Φ − 1 ( x ) = 1 ¯ Φ ′ ¯ Φ − 1 ( x ) = − 1 ϕ ( ¯ Φ − 1 ( x )) , as desired. Pro of of part (b). The desired sandwic h relation follows from the Mills’ ratio b ounds (Lemma G.10 ) applied to t = ¯ Φ − 1 ( x ) for 0 ≤ x ≤ 1 / 2. G.4 Multiv ariate Gaussian concen tration and momen t prop erties Lemma G.12. L et z ∼ N (0 , I d ) and ℓ ∈ N . F or an index α = ( i 1 , ..., i ℓ ) ∈ [ d ] ℓ , let X = ( z ⊗ ℓ ) α . Then for any γ > 1 / 2 we have E [exp( 1 8 γ | X | 2 /ℓ )] ≤ e 1 / 2 γ . Pr o of. W e can write X = Π i ∈ supp( α ) z m i i for multiplicities m i ∈ N . Observe 1 8 γ | X | 2 /ℓ = 1 8 γ Π i ∈ supp( α ) z 2 m i i 1 /ℓ ≤ 1 8 γ ℓ X i ∈ supp( α ) m i z 2 i , where the last step follows from the AM-GM inequality . Recall for t < 1 / 4 we hav e E G ∼ N (0 , 1 [exp( tG 2 )] ≤ (1 − 2 t ) − 1 ≤ exp(4 t ). Our assumption on γ implies m i 8 γ ℓ < 1 4 for all i ∈ supp( α ), and so it follows that E exp 1 8 γ | X | 2 /ℓ = Π i ∈ supp( α ) E exp 1 8 γ ℓ m i z 2 i ≤ Π i ∈ supp( α ) exp m i 2 γ ℓ = e 1 / 2 γ . Lemma G.13. L et z ∼ N (0 , I d ) and ℓ ∈ N . F or an index α = ( i 1 , ..., i ℓ ) ∈ [ d ] ℓ , let X = ( z ⊗ ℓ ) α . Then we have E [exp( 1 8 ℓ | X − E [ X ] | 2 /ℓ )] ≤ 2 . Pr o of. Lemma G.1 implies E exp 1 8 ℓ | X − E [ X ] | 2 /ℓ ≤ E exp 1 8 ℓ | X | 2 /ℓ + | E [ X ] | 2 /ℓ ) = exp 1 8 ℓ | E [ X ] | 2 /ℓ E exp 1 8 ℓ | X | 2 /ℓ . (71) Lemmas G.8 and G.12 resp ectively imply we can b ound b oth terms in equation ( 71 ) by e 1 / 4 . Com bining, w e can b ound their pro duct b y e 1 / 2 ≤ 2. 84 Lemma G.14. L et ℓ ∈ N . L et z 1 , ..., z n i . i . d . ∼ N (0 , I d ) and ξ = 1 n P n i =1 z ⊗ ℓ i − E z ∼ N (0 ,I d ) z ⊗ ℓ . Then for δ ∈ (0 , 1) we have ∥ ξ ∥ ∞ ≤ (100 ℓ ) ℓ/ 2 r ℓ log d + log (1 /δ ) n + ( ℓ log d + log (1 /δ )) ℓ/ 2 n ! with pr ob ability at le ast 1 − δ . Pr o of. F or an y index α = ( i 1 , ..., i ℓ ) ∈ [ d ] ℓ , com bining Lemma G.13 (setting ℓ = 2 k ) with Theorem 3.1 of [ KC18 ] gives P | ξ α | ≤ (100 ℓ ) ℓ/ 2 r t n + t ℓ/ 2 n !! ≤ 2 e − t . The result then follows from taking a union b ound ov er all d ℓ elemen ts. Corollary G.15. L et ℓ ∈ N . L et z 1 , ..., z n iid ∼ N (0 , I d ) and ξ = 1 n P n i =1 z ⊗ ℓ i − E z ∼ N (0 ,I d ) z ⊗ ℓ . F or β > 0 , supp ose n ≥ max ( 4(100 ℓ ) ℓ ( ℓ log d + log (1 /δ )) β 2 , 2(100 ℓ ) ℓ/ 2 ( ℓ log d + log (1 /δ )) ℓ/ 2 β ) . Then ∥ ξ ∥ ∞ ≤ β with pr ob ability at le ast 1 − δ . Pr o of. The res ult follows immediately from inv erting the b ound of Lemma G.14 , setting b oth summands in the b ound to β / 2. Lemma G.16. L et z 1 , ..., z n i . i . d . ∼ N (0 , I d ) . F or ℓ ∈ N , let ξ = 1 n P n i =1 z ⊗ ℓ i − E z ∼ N (0 ,I d ) z ⊗ ℓ . Then sup ∥ v ∥ 2 ≤ 1 1 n n X i =1 ⟨ z i , v ⟩ ℓ − E z ∼ N (0 ,I d ) [ ⟨ z , v ⟩ ℓ ] ≤ d ℓ/ 2 ∥ ξ ∥ ∞ . Pr o of. Observe for any x, y ∈ R d w e ha ve ⟨ x, y ⟩ ℓ = ⟨ x ⊗ ℓ , y ⊗ ℓ ⟩ . Moreov er, for any unit vector v ∈ R d w e ha v e v ⊗ ℓ 2 = X α ∈ [ d ] ℓ (Π i ∈ α v i ) 2 = d X i =1 v 2 i ! k = 1 . It thus follows for any suc h v that 1 n n X i =1 ⟨ z i , v ⟩ ℓ − E z ∼ N (0 ,I d ) [ ⟨ z , v ⟩ ℓ ] = * 1 n n X i =1 z ⊗ ℓ i − E z ∼ N (0 ,I d ) [ z ⊗ ℓ ] , v ⊗ ℓ + ≤ ∥ ξ ∥ 2 v ⊗ ℓ 2 ≤ d ℓ/ 2 ∥ ξ ∥ ∞ . 85 G.4.1 Pro of of Lemma 4.3 Pr o of. Our assumption on n satisfies the conditions of Lemma G.3 , implying n | T | ≤ 1+ ϵ 1 − ϵ with probabilit y at least 1 − δ . This then implies b oth | T | and n satisfy the conditions of Lemma G.15 with probability δ / 2 k and β = ϵ d k k . T aking a union b ound ov er ℓ ∈ [2 k ] giv es the desired result. G.5 Sum-of-squares facts Lemma G.17. L et r ∈ N . L et x, y by p olynomials of de gr e e at most k . Then { x, y , y − x ≥ 0 } rk x,y y r − x r ≥ 0 . (72) Pr o of. W e can write y r − x r as the sum of pro ducts of non-negative p olynomials as follows: y r − x r = r − 1 X i =0 y r − i x i − y r − i − 1 x i +1 = ( y − x ) r − 1 X i =0 y r − i − 1 x i . Lemma G.18. L et x, y b e ve ctor-value d p olynomials of de gr e e at most k . Then for any t > 0 we have 2 k x,y ⟨ x, y ⟩ ≤ t 2 ∥ x ∥ 2 2 + 1 2 t ∥ y ∥ 2 2 . Pr o of. Observe t 2 ∥ x ∥ 2 2 + 1 2 t ∥ y ∥ 2 2 = ⟨ x, y ⟩ + 1 2 √ tx + 1 √ t y 2 2 . Corollary G.19. L et x, y b e ve ctor-value d p olynomials of de gr e e at most k and let e E b e a pseudo exp e ctation of de gr e e at le ast 2 k . Then e E [ ⟨ x, y ⟩ ] ≤ q e E [ ∥ x ∥ 2 2 ] e E [ ∥ y ∥ 2 2 ] . Pr o of. The result follows immediately from Lemma G.18 with t = r e E [ ∥ y ∥ 2 2 ] e E [ ∥ x ∥ 2 2 ] . Lemma G.20. L et c ∈ N with r = 2 c and r ′ = 2( c + 1) . L et x b e a p olynomial of de gr e e at most k and let e E b e a pseudo exp e ctation of de gr e e at le ast 2 k ( c + 1) . Then e E [ x r ′ ] ≥ e E [ x r ] r ′ r . Pr o of. W e pro ceed by induction on c . The base case c = 1 follows from Cauc h y-Sc h w arz (i.e., com bining Lemmas G.19 and G.17 ). Now supp ose the claim holds for c ′ = c − 1, i.e., e E [ x 2( c ′ +1) ] ≥ e E [ x 2 c ′ ] c ′ +1 c ′ . W e will show the claim holds for c = c ′ + 1. Applying Cauch y-Sch warz to x c ′ +2 and x c ′ giv es e E [ x 2( c ′ +2) ] ≥ e E [ x 2( c ′ +1) ] 2 e E [ x 2 c ′ ] 86 ≥ e E [ x 2( c ′ +1) ] 2 e E [ x 2( c ′ +1) ] c ′ c ′ +1 = e E [ x 2( c ′ +1) ] c ′ +2 c ′ +1 . with the second step following from the induction hypothesis. Corollary G.21. L et r , r ′ ∈ N b e even with r ′ ≥ r and let x b e a p olynomial of de gr e e at most k . L et e E b e a pseudo exp e ctation of de gr e e at le ast k r ′ . Then e E [ x r ′ ] ≥ e E [ x r ] r ′ r . Pr o of. The result follows from rep eatedly applying Lemma G.20 . Lemma G.22. L et x b e a p olynomial of de gr e e at most k and let e E b e a pseudo exp e ctation of de gr e e at le ast 2 k . Then e E [ x ] 2 ≤ e E [ x 2 ] . Pr o of. Observe e E [ x 2 ] − e E [ x ] 2 = e E [( x − e E [ x ]) 2 ] ≥ 0 . Lemma G.23. L et z 1 , ..., z n i . i . d . ∼ N (0 , I d ) , with ξ = 1 n P n i =1 z ⊗ 2 k i − E z ∼ N (0 ,I d ) z ⊗ 2 k . Then it fol lows for k ∈ N that for al l t > 0 that 4 k v 1 n n X i =1 ⟨ v , z i ⟩ 2 k ∈ ∥ v ∥ 2 k 2 E G ∼ N (0 , 1) [ G 2 k ] ± t 2 ∥ v ∥ 4 k 2 + d 2 k 2 t ∥ ξ ∥ 2 ∞ . Pr o of. F or the upp er b ound, observe 4 k v 1 n n X i =1 ⟨ v , z i ⟩ 2 k = ∥ v ∥ 2 k 2 E G ∼ N (0 , 1) [ G 2 k ] + ⟨ v ⊗ 2 k , ξ ⟩ (73) ≤ ∥ v ∥ 2 k 2 E G ∼ N (0 , 1) [ G 2 k ] + t 2 ⟨ v ⊗ 2 k , v ⊗ 2 k ⟩ + 1 2 t ⟨ ξ , ξ ⟩ (74) = ∥ v ∥ 2 k 2 E G ∼ N (0 , 1) [ G 2 k ] + t 2 ∥ v ∥ 4 k 2 + 1 2 t ⟨ ξ , ξ ⟩ ≤ ∥ v ∥ 2 k 2 E G ∼ N (0 , 1) [ G 2 k ] + t 2 ∥ v ∥ 4 k 2 + d 2 k 2 t ∥ ξ ∥ 2 ∞ , where equation ( 73 ) follows from the fact that E G ∼ N (0 , 1) [ ⟨ v , G ⟩ 2 k ] = ∥ v ∥ 2 k 2 E G ∼ N (0 , 1) [ G 2 k ] for all v ∈ R d and inequality ( 74 ) follows from SoS Cauch y–Sch warz (Lemma G.18 ). The low er b ound follows analogously . Lemma G.24. L et x ∈ R b e a p olynomial of de gr e e at most k . L et r ∈ N b e even and let e E b e a pseudo exp e ctation of de gr e e at le ast k r satisfying e E [ x r ] ≤ 1 + α for some α > 0 . Then it fol lows that e E x 2 ≤ 1 + 2 α r . Pr o of. Holder’s inequality for pseudo exp ectations (Lemma G.20 ) implies e E x 2 ≤ e E [ x r ] 2 /r ≤ (1 + α ) 2 /r ≤ 1 + 2 α r . 87 Lemma G.25. L et x ∈ R b e a p olynomial of de gr e e at most k . L et r ∈ N b e even and let e E b e a pseudo exp e ctation of de gr e e at le ast 2 k r satisfying b oth e E [ x r ] ≥ 1 − β and e E [ x 2 r ] ≥ 1 − α for some α, β > 0 . Then it fol lows that e E [ x r ] ≥ 1 − α 2 − β . Pr o of. Observe e E [1 − x 2 r ] = e E [(1 − x r )(1 + x r )] ≥ (2 − β ) e E [1 − x r ] . Th us, e E [1 − x 2 r ] ≤ α implies e E [1 − x r ] ≤ α 2 − β . Corollary G.26. L et x ∈ R b e a p olynomial of de gr e e at most k . L et r = 2 p for some p ∈ N and α ∈ (0 , 1 / 2) . L et e E b e a pseudo exp e ctation of de gr e e at le ast 2 k r satisfying e E [ x 2 ℓ ] ≥ 1 − α for ℓ ∈ [ p ] . Then we have e E [ x 2 ] ≥ 1 − 2 exp(3 / 2) α r . Pr o of. Let β 1 = α 2 − α and β i = β i − 1 2 − α for 2 ≤ i ≤ p . Applying Lemma G.25 recursively ov er ℓ ∈ [ p ] then gives e E [ x 2 ] ≥ 1 − α Π p − 1 i =1 (2 − β i ) ≥ 1 − α Π p − 1 i =1 2 exp( − β i ) = 1 − 2 α r exp − P p − 1 i =1 β i . where the second step follows from the fact that exp( − 2 x ) ≤ 1 − x for x < 1 / 2 (taking x = β i 2 ). Our premise that α < 1 / 2 implies β i +1 ≤ 2 3 β i for i ∈ [ p − 1], and so Π p i =1 P p − 1 i =1 β i ≤ 3 α . The desired result follo ws immediately from com bining this fact with the previous displa y . 88
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment