Separating Oblivious and Adaptive Models of Variable Selection
Sparse recovery is among the most well-studied problems in learning theory and high-dimensional statistics. In this work, we investigate the statistical and computational landscapes of sparse recovery with $\ell_\infty$ error guarantees. This variant…
Authors: Ziyun Chen, Jerry Li, Kevin Tian
Separating Oblivious and A daptiv e Mo dels of V ariable Selection Ziyun Chen ∗ Jerry Li † Kevin Tian ‡ Y usong Zh u § Abstract Sparse reco very is among the most well-studied problems in learning theory and high- dimensional statistics. In this w ork, w e inv estigate the statistical and computational landscap es of sparse reco very with ℓ ∞ error guaran tees. This v arian t of the problem is motiv ated b y variable sele ction tasks, where the goal is to estimate the supp ort of a k -sparse signal in R d . Our main con tribution is a pro v able separation b etw een the oblivious (“for each”) and adaptive (“for all”) mo dels of ℓ ∞ sparse reco very . W e show that under an oblivious mo del, the optimal ℓ ∞ error is attainable in near-linear time with ≈ k log d samples, whereas in an adaptive mo del, ≳ k 2 samples are necessary for any algorithm to achiev e this b ound. This establishes a surprising con trast with the standard ℓ 2 setting, where ≈ k log d samples suffice even for adaptive sparse reco very . W e conclude with a preliminary examination of a p artial ly-adaptive mo del, where we sho w non trivial v ariable selection guarantees are p ossible with ≈ k log d measuremen ts. ∗ Univ ersity of W ashington. ziyuncc@cs.washington.edu . † Univ ersity of W ashington. jerryzli@cs.washington.edu . ‡ Univ ersity of T exas at Austin. kjtian@cs.utexas.edu . § Univ ersity of T exas at Austin. zhuys@utexas.edu . Con ten ts 1 In tro duction 1 1.1 Problem statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Our results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Preliminaries 6 2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Sparse recov ery preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3 Oblivious ℓ ∞ Sparse Reco v ery 9 3.1 Help er tec hnical lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2 Supp ort iden tification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.3 In-supp ort estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.4 Expanding the r range in Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4 A daptive ℓ ∞ Sparse Reco v ery 18 4.1 ℓ ∞ -RIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2 IHT under ℓ ∞ -RIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.3 Lo wer bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 5 V ariable Selection with A daptiv e Measuremen ts 24 5.1 F ew false positives in thresholding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 5.2 Recursiv e supp ort estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 A Discussion of Error Metrics 34 B Tigh t Error Guarantee for Gaussian Noise Mo del 35 C Application to Spik e-and-Slab P osterior Sampling 37 1 In tro duction W e consider the problem of sparse recov ery , a cornerstone problem in learning theory and high- dimensional statistics, with applications to many diverse fields, including medical imaging [ LDP07 , GS15 ], computational photography [ DDT + 08 , GJP20 ] and wireless communication [ DE11 , HL Y13 ]. In this problem, we assume there is some underlying ground truth k -sparse signal θ ⋆ ∈ R d , and our goal is to reco ver it giv en n (p oten tially noisy) linear measurements, i.e., from y := X θ ⋆ + ξ for some measuremen t matrix X ∈ R n × d and some noise v ector ξ ∈ R n . T ypically , we are interested in the case where the n umber of measuremen ts n is m uch smaller than d , and the main statistical measure of merit is ho w large n has to b e to ac hiev e goo d estimation error for θ ⋆ . In this pap er, we inv estigate the question of learning θ ⋆ to ℓ ∞ error, a task whic h is closely related to the w ell-studied question of variable sele ction for sparse linear mo dels [ Tib96 , FL01 , CT07 , MB10 , BC15 ]. In many real-w orld applications of sparse recov ery , a primary goal is to select which features of the regression mo del hav e significant explanatory p o wer [ YSY + 08 , BCH14 , CT20 , Aky23 ]. In other words, the task is to find the supp ort of the large elemen ts of the unkno wn θ ⋆ . This problem is of particular imp ort in ov erparameterized, high-dimensional settings where d ≫ | supp( θ ⋆ ) | . By a thresholding argument, observe that this task is more or less equiv alen t to learning θ ⋆ to go od ℓ ∞ error. Indeed, recov ery in ℓ ∞ immediately implies that w e can also learn the supp ort of the hea vy elements of θ ⋆ , and con v ersely , if we can identify this supp ort efficiently , it is (in man y natural settings) straightforw ard to recov er θ ⋆ , since by fo cusing on those co ordinates, we can reduce the problem to standard (i.e., dense) linear regression as long as n ≳ | supp( θ ⋆ ) | . In the most commonly-studied setting where X is an entrywise Gaussian measuremen t matrix, and the goal is to learn θ ⋆ to go od ℓ 2 -norm error, the statistical complexity of sparse recov ery is by no w fairly well-understoo d. The seminal work of [ CT05 , CT06 ] demonstrated that in the noiseless setting, i.e., ξ = 0 d , exact recov ery is possible when n ≈ k log d k , and moreov er, this is achiev able with an efficient algorithm ( ℓ 1 minimization). This sample complexit y is tigh t up to a logarithmic factor, simply b y a rank argumen t. F ollo w-up work of [ CR T06 ] demonstrated that for general noise v ectors ξ ∈ R n , there is an efficien tly-computable estimator θ whic h achiev es ℓ 2 -norm error ∥ θ ⋆ − θ ∥ 2 = O ( ∥ ξ ∥ 2 ) , with the same asymptotic n umber of measuremen ts, and this reco very rate is optimal. Ho wev er despite the large literature on sparse recov ery , the sample complexit y landscap e is signifi- can tly less well-understoo d for recov ery in the ℓ ∞ norm, and for v ariable selection in general. While a num b er of pap ers [ Lou08 , YZ10 , CW11 , HJLL17 , L YP + 19 , W ai19 ] demonstrate upp er b ounds for this problem, including sev eral that prov e error rates for p opular algorithms suc h as LASSO [ Lou08 , YZ10 , W ai19 ], very few low er bounds are kno wn (see Section 1.3 for a more detailed dis- cussion), and moreov er, sev eral of these results require additional assumptions on θ ⋆ and/or the noise. F or instance, while [ W ai19 ] prov es that one can ac hieve go od ℓ ∞ error with LASSO with n = O ( k log d k ) measuremen ts, their results require (among other things) that the support of θ ⋆ is random, and indep enden t of X . This is in stark con trast to the landscape for learning in ℓ 2 , where one can obtain a “for all” guaran tee for learning any k -sparse v ector θ ⋆ with the same X . A dditionally , there are very limited low er b ounds for learning in ℓ ∞ error, and they do not typically matc h the existing upp er b ounds. This state of affairs b egs the natural question: Can we char acterize the statistic al landsc ap e of le arning sp arse line ar mo dels in ℓ ∞ err or? Relatedly , can w e understand the sample complexit y of v ariable selection for sparse linear regression? 1 In this w ork, w e mak e significan t progress on understanding these fundamen tal questions. Our main contributions are new sample complexit y upp er and lo wer b ounds for v ariable selection and ℓ ∞ sparse recov ery , under v arious natural generative mo dels. Before we go in to detail ab out our results, we wish to emphasize tw o main conceptual con tributions of our in vestigation. A daptivity matters for ℓ ∞ sparse recov ery . As mentioned, prior w orks for v ariable selection and ℓ ∞ sparse recov ery often required additional assumptions on ho w the supp ort of the unknown k -sparse v ector θ ⋆ is c hosen. W e sho w that this is inherent: if θ ⋆ and ξ are chosen indep enden tly of the measuremen t matrix X (the “oblivious” or “for each” mo del), then recov ery is p ossible with n = O ( k log d k ) measuremen ts in nearly-linear time, but they can b e chosen with kno wledge of X (the “adaptive” or “for all” model), 1 then n = Ω( k 2 ) measurements are both necessary and (up to a log d factor) sufficien t. In other w ords, unlike for recov ery in ℓ 2 , adaptivit y in the c hoice of the unkno wn parameters ( θ ⋆ , ξ ) pr ovably makes the pr oblem statistic al ly har der . A new, canonical c hoice of error metric. F or sparse reco v ery in the ℓ 2 norm, it is w ell- kno wn that the best ac hiev able reco very error is ∥ ξ ∥ 2 (up to constan t factors). Ho w ever, no suc h c haracterization was previously kno wn for ℓ ∞ . V arious error metrics hav e b een prop osed by prior w ork, discussed thoroughly in App endix A ; ho wev er, none were kno wn to yield a tight rate for the ac hiev able error. In this work, w e show strong evidence that the correct error metric in ℓ ∞ is err ( X , ξ ) : = X ⊤ ξ ∞ . (1) W e giv e the follo wing justifications for this error metric. First, in the oblivious mo del, we sho w that this quantit y is equiv alen t to sev eral others considered in the literature, up to appropriate scaling (Lemma 20 ). Second, in the adaptiv e mo del, w e demonstrate nearly-matching upp er and lo w er b ounds under the metric ( 1 ), and w e demonstrate that other quan tities considered in the literature are prov ably impossible to ac hieve in the adaptive model (Lemma 21 ). 1.1 Problem statements W e now formally define the problems we study in this pap er. Throughout this introduction, w e pri- marily consider the standard setting where X has i.i.d. entries ∼ N (0 , 1 n ) ; this scaling is conv enient as it ensures that E [ X ⊤ X ] = I d . With this, w e no w state the v ariable selection problem w e study . Problem 1 (V ariable selection) . L et ( n, d ) ∈ N 2 and k ∈ [ d ] . L et X ∈ R n × d b e a known me asur e- ment matrix fr om Mo del 3 , 2 and let ( θ ⋆ , ξ ) ∈ R d × R n b e unknown, so that nnz( θ ⋆ ) ≤ k and min i ∈ supp( θ ⋆ ) | θ ⋆ i | > C ∥ X ⊤ ξ ∥ ∞ , (2) for a universal c onstant C > 0 , r epr esent a signal and noise ve ctor. W e observe ( X , y ) , wher e y := X θ ⋆ + ξ . Our go al is to output supp( θ ⋆ ) ⊆ [ d ] . Problem 1 is a supp ort reco very problem, that asks to select the relev ant v ariables of the signal v ector θ ⋆ from the observ ations ( X , y ) , under an appropriate signal-to-noise ratio condition ( 2 ). As 1 The literature sometimes uses the term “for all” mo del to describe settings with a dep enden t signal θ ⋆ and indep enden t noise ξ , i.e., our “partially-adaptive” Model 4 . F or disambiguation, we primarily refer to our “for each” and “for all” mo dels as the oblivious and adaptive mo dels throughout. 2 W e state a general mo del of sub-Gaussian matrix ensembles to capture the full generalit y of our results; the reader may primarily consider the i.i.d. entrywise N (0 , 1 n ) mo del of X for simplicity , which falls under Mo del 3 . 2 justified previously (and in more detail in App endix B ), w e believe that the choice of X ⊤ ξ ∞ on the right-hand side of ( 2 ) is the correct parameterization for this problem. Next, we formally define the problem of sparse reco very with ℓ ∞ error. Problem 2 ( ℓ ∞ sparse reco very) . L et ( n, d ) ∈ N 2 and k ∈ [ d ] . L et X ∈ R n × d b e a known me asur e- ment matrix fr om Mo del 3 , and let ( θ ⋆ , ξ ) ∈ R d × R n b e unknown so that nnz( θ ⋆ ) ≤ k . W e observe ( X , y ) wher e y := X θ ⋆ + ξ . Our go al is to output θ ∈ R d satisfying, for a universal c onstant C > 0 , ∥ θ − θ ⋆ ∥ ∞ ≤ C ∥ X ⊤ ξ ∥ ∞ , nnz( θ ) ≤ k . (3) T o solve an instance of Problem 1 with constan t C , it suffices to solv e Problem 2 with constant C 2 , and then threshold the co ordinates of θ at C 2 ∥ X ⊤ ξ ∥ ∞ to recov er supp( θ ⋆ ) . Th us, Problem 2 is a more general problem (up to the choice of C ), and is largely the rest of the pap er’s fo cus. W e next define v arious mo deling assumptions used when studying Problem 2 . As mentioned, a k ey conceptual con tribution of our paper is that the mo deling assumptions play an imp ortan t role in c haracterizing the statistical complexit y of ℓ ∞ sparse recov ery . Mo del 1 (Oblivious mo del) . In the oblivious mo del , ( θ ⋆ , ξ ) ar e chosen indep endently of X . Mo del 2 (A daptive mo del) . In the adaptive mo del , we make no indep endenc e assumptions on the triple ( θ ⋆ , ξ , X ) . In Model 1 , an equiv alent viewpoint is that ( θ ⋆ , ξ ) are first fixed (p ossibly as samples from a distribution), and then X is indep enden tly sampled. In Model 2 , the order is in tuitively reversed: first X is sampled, and then ( θ ⋆ , ξ ) can be arbitrarily defined dep ending on its outcome. An algorithm succeeding under Mo del 2 is p ow erful: it can b e used in arbitrary adaptively-defined instances of Problem 2 , with the same X . This is particularly useful when Problem 2 is used as a subroutine, e.g., in h yp erparameter searc h or a wrapp er optimization algorithm inv olving X . Understanding Problem 2 under Mo dels 1 and 2 is our main fo cus, but there are v arious more fine-grained indep endence assumptions one could imp ose. F or example, Section 5 in vestigates a p artial ly-adaptive mo del , where only the noise is viewed as b enign. 1.2 Our results Oblivious sparse reco very . Our first main result is a new algorithm for ℓ ∞ sparse reco very in the oblivious mo del (Mo del 1 ). W e show that it is p ossible to solv e Problem 2 in this setting, with a num b er of samples matc hing that required for optimal ℓ 2 sparse recov ery . Theorem 1 (informal, see Theorem 3 ) . L et n = Ω( k log d ) , and let X ∈ R n × d have i.i.d. N (0 , 1 n ) entries. Ther e is an estimator which solves Pr oblem 2 under Mo del 1 with high pr ob ability. Mor e over, the estimator c an b e c ompute d in ne arly-line ar time. W e prov e Theorem 3 through a simple three-stage metho d (Algorithm 2 ). Our algorithm first uses iterativ e hard thresholding (IHT) [ BD09 ], an ℓ 2 sparse reco very algorithm, to obtain a warm start. It then estimates the supp ort via thresholding, and solves ordinary least squares on the learned supp ort. F or general sub-Gaussian measurements, the error of Theorem 1 can exceed ∥ X ⊤ ξ ∥ ∞ , but b y at most a logarithmic factor; w e give a detailed discussion in Sections 3.3 and 3.4 . 3 T o our kno wledge, this is the first such estimator which ac hieves these guarantees for ℓ ∞ sparse reco very whic h runs in nearly-linear time. 3 W e note that qualitativ ely similar guaran tees w ere kno wn for classical estimators suc h as LASSO [ W ai19 ]; how ev er, their results w ere stated for a differen t notion of error (although w e sho w that these notions are equiv alent in App endix B under the oblivious Mo del 1 ). Moreov er, LASSO-style estimators are based on linear programming, whic h can b e computationally cumbersome compared to our result in Theorem 1 . Finally , Theorem 3 impleies an impro v ed run time for a recent state-of-the-art algorithm for Ba yesian sparse linear regression b y [ KSTZ25 ]. W e discuss this application in more detail in App endix C . A daptive sparse reco very . Our second main result is a new set of nearly-matc hing upp er and lo wer bounds for the adaptive model (Model 2 ) of v ariable selection and ℓ ∞ sparse recov ery . Theorem 2 (informal, see Theorems 4 , 5 , and 6 ) . L et n = Ω( k log d ) , and let X ∈ R n × d have i.i.d. N (0 , 1 n ) entries. Ther e is an estimator which solves Pr oblem 2 under Mo del 2 with high pr ob ability when n = Ω( k 2 log d k ) , c omputable in ne arly-line ar time. Mor e over, ther e is no algorithm which c an solve either Pr oblem 1 or Pr oblem 2 under Mo del 2 with pr ob ability > 1 2 if n = o ( k 2 ) . Up to a logarithmic factor of the dimension, Theorem 2 settles both the computational and statis- tical complexity of adaptiv e ℓ ∞ sparse recov ery . Notably , the sample complexity of b oth parts of Theorem 2 scales quadratically with k , as opp osed to the linear-in- k scaling in Theorem 1 , as well as in the standard ℓ 2 sparse reco very setting. T o our kno wledge, no such separation had b een previ- ously demonstrated in any similar setting. W e conjecture that the low er b ound in Theorem 2 can b e extended to an y n = o ( k 2 log d k ) , that is, that the tight measurement complexity is n = Θ( k 2 log d k ) ; w e leav e this in teresting problem op en for future w ork. The upp er b ound in Theorem 2 is once again ac hiev ed via an application of IHT. Our main concep- tual contribution in demonstrating this result is to define a new notion of the well-studied restricted isometry property (RIP) whic h w e call ℓ ∞ -RIP (Definition 4 ), and to demonstrate that Gaussian (and sub-Gaussian) matrices satisfy this condition with high probabilit y when n = Ω( k 2 log d k ) . The more technically interesting part of Theorem 2 is the lo wer b ound. At a high lev el, we demon- strate that the inverse of the Gram matrix X ⊤ X restricted to an y sufficiently small submatrix has a large ℓ ∞ op erator norm, unless n = Ω( k 2 ) (Lemma 18 ). W e then exhibit a sparse noise v ector whic h has a very large ℓ ∞ norm, but which is effectively killed off by the measuremen ts X as long as n = o ( k 2 ) , and hence cannot b e detected unless we ha v e sufficien tly many measuremen ts. V ariable selection with partial adaptivity . A natural question is whether or not one can circum ven t the quadratic low er b ounds of the adaptive mo del in an intermediate setting whic h in terp olates betw een Models 1 and 2 . T ow ards understanding this possibility , w e demonstrate non trivial recov ery guarantees for v ariable selection (Problem 1 ), in a mo del we call the p artial ly adaptive mo del, where the noise ξ is indep enden t of X , but θ ⋆ ma y b e adaptive (Mo del 4 ). W e demonstrate in Theorem 7 that if the learner is allo wed to mask the effect of certain co ordinates when querying observ ations (Algorithm 4 ), then ≈ k log d log k measuremen ts suffice to solv e v ariable selection under partial adaptivit y . Our algorithm iterativ ely applies thresholding after masking an estimated supp ort, whic h we sho w makes geometric progress on the residual. Although our result 3 Nearly-linear time metho ds ha ve also been developed based on decoding expander graph-based measuremen ts, but typically require explicit design of the measuremen t matrix. Moreo ver, these results often op erate in the noiseless setting, or only give ℓ ∞ reco very under a minimum signal strength assumption; see Section 1.3 for a discussion. 4 holds under a nonstandard observ ation model, we b elieve it highligh ts the possibility of going b eyond Theorem 2 ev en under partial adaptivit y . Indeed, a key structural result that w e leverage in our algorithm is that threshold-based supp ort learning has few false p ositiv es under Mo del 4 ; we b eliev e this observ ation will pro ve useful in future inv estigations of the partially-adaptiv e setting. 1.3 Related work ℓ 1 -con v ex relaxation. Since [ Nat95 ] sho wed that ℓ 0 -minimization for linear regression is NP-hard in general, extensiv e efforts [ Tib96 , CDS01 , CT05 , CT06 , CT07 ] hav e focused on ℓ 1 minimization as a tractable con vex surrogate. Under RIP-t yp e assumptions and with measuremen t complexit y n = Θ( k log d k ) , b oth the Lasso [ W ai19 ] and the Dan tzig selector [ CT07 ] are kno wn to ac hiev e optimal ℓ 2 estimation error rates. Beyond ℓ 2 reco very , [ ZY06 ] sho wed that the Lasso achiev es exact support recov ery when the nonzero co efficien ts are sufficien tly strong and the measuremen ts satisfy an irrepresen tablity condition, requiring n = Ω( k log d ) in the oblivious model and n = Ω( k 2 log d ) in the adaptive model. F or ℓ ∞ guaran tees, [ Lou08 ] pro ved that b oth the Lasso and the Dan tzig selector attain ≈ σ error under a pairwise incoherence condition, which requires n = Ω( k 2 ) (Lemma 16 ). Similarly , [ YZ10 ] derived comparable ℓ ∞ b ounds under an ℓ ∞ -curv ature condition, again with n = Ω( k 2 log d ) . More recently , [ W ai19 ] show ed that under mutual incoherence, the Lasso satisfies a refined ℓ ∞ error b ound which is achiev ed with n = Θ( k log d ) in the oblivious mo del and n = Θ( k 2 log d ) in the adaptiv e mo del for sub-Gaussian designs. As w e show in Section 4.3 , this b ound matc hes the minimax-optimal ≈ σ ℓ ∞ error only when n = Ω( k 2 ) . Greedy selection under ℓ 0 constrain ts. Beginning with seminal works of [ MZ93 , PRK93 ], a substan tial line of research [ NV08 , NT08 , CW11 , JTD11 , DTDS12 ] has developed greedy heuris- tics for approximating the in tractable ℓ 0 minimization problem. [ PRK93 ] introduced Orthogonal Matc hing Pursuit (OMP), whic h iteratively selects the column most correlated with the current residual and remo ves its contribution via orthogonal pro jection. OMP [ Zha11 ] and its v ariants, including ROMP [ NV08 ], CoSaMP [ NT08 ], and OMPR [ JTD11 ], admit similar RIP-based analyses and achiev e ℓ 2 estimation error ≈ ∥ ξ ∥ 2 with measurement complexit y n = Ω( k log O (1) d ) . These guaran tees are comparable, or sometimes slightly w eaker, than the sharp b ounds obtained for IHT [ Pri21 ]. On the other hand, [ CW11 ] and [ HJLL17 ] establish ℓ ∞ reco very guarantees for OMP and its v ariant SD AR under incoherence-type assumptions and a Gaussian noise mo del, which lead to a higher measurement requirement of n = Ω( k 2 log d ) . These greedy methods t ypically incur higher computational costs than IHT and our Algorithm 3 , as they require solving a least-squares problem o ver a size- k supp ort at eac h iteration, whereas IHT p erforms only simple gradient descent and thresholding op erations and Algorithm 3 only solv es OLS once on top of calling IHT. Expander-based metho ds. A parallel line of w ork [ XH07 , JXHC09 , IR08 , BIR08 ] uses sparse binary measuremen t matrices, t ypically the adjacency matrices of bipartite expanders, to enable sparse recov ery , primarily in noiseless settings. F or example, [ XH07 ] show ed that exact reco very with a for eac h guaran tee (Model 1 ) is p ossible using n = O ( k log d ) measurements and decoding time T = O ( d log d ) . Subsequen t work impro ved efficiency: [ SBB06 ] reduced deco ding time to T = O ( k log d log k ) , while [ WV12 ] sharp ened the measuremen t complexity to n = O ( k ) under additional signal mo del assumptions. Notably , suc h results require strong con trol of the measuremen t matrix. Despite these fav orable guarantees in the noiseless regime, expander-based decoders are generally less robust to noise, and analyses in noisy settings remain limited [ JXHC09 , ABK17 , L YP + 19 ]. Notably , [ JXHC09 ] established robustness for approximately k -sparse signals, while [ ABK17 ] show ed that n = Θ( k 2 log d ) measurements are b oth necessary and sufficien t for supp ort reco very in the 5 adaptiv e mo del under binary measurements. F urthermore, [ L YP + 19 ] provided ℓ ∞ guaran tees in the oblivious mo del, but required all signal co ordinates to exceed the noise lev el b y a significant margin. Iterativ e expander-based algorithms such as EMP [ IR08 ] and SSMP [ BIR08 ] ac hieve optimal ℓ 2 reco very in adaptive mo dels, but their computational cost is comparable to greedy metho ds like IHT and OMP , and substantially higher than that of com binatorial deco ders. Lo wer b ounds on supp ort recov ery . Lo wer b ounds for support recov ery remain comparatively less explored. Classic w orks such as [ CD13 , SC19 , W ai09 ] establish information-theoretic lo wer b ounds on the ℓ 2 estimation error via F ano’s inequality , implying that n ≥ k log d k measuremen ts are necessary under their resp ectiv e scalings. In con trast, under our normalization this line of analysis yields a unified risk lo wer b ound without an explicit measuremen t constraint (see App endix B ). F or supp ort recov ery , [ W ai06 ] shows that the Lasso consistently iden tifies the true supp ort under Gaus- sian designs only if n = Ω k log d k . This result was extended by [ FRG09 ] to maxim um-lik eliho o d estimators, yielding a necessary condition n = Ω( k log d/k SNR · MAR ) , where the denominator quan tifies the signal strength. More recently , [ GZ17 ] identified an additional “all-or-nothing” threshold for supp ort reco very of binary signals, o ccurring at n = k log d · log k 1+ σ 2 . 2 Preliminaries In this section w e develop some preliminaries for the rest of the pap er. 2.1 Notation General notation. W e denote matrices in capital b oldface and vectors in lo wercase b oldface. W e define [ n ] := { i ∈ N : i ≤ n } . When S is a subset of a larger set clear from context, S c denotes its complemen t. F or random v ariables X and Y , X ⊥ Y denotes that ( X , Y ) are indep enden t. F or an ev ent E , we use I E to denote its associated 0 - 1 indicator v ariable. T o simplify expressions, we henceforth assume k is at least a sufficiently-large universal constan t in future pro ofs. W e also alw ays let { x i ∈ R n } i ∈ [ d ] denote the columns of the measurement matrix X when clear from con text. V ectors. F or a vector v ∈ R d , we let supp( v ) = { i ∈ [ d ] : v i = 0 } and let ∥ v ∥ 2 ,k b e the ℓ 2 -norm of its top- k largest elements in the absolute v alue. W e let 0 d and 1 d denote the all-zero es and all-ones v ectors in R d . W e also define e i as the i th standard basis vector. F or v ∈ R d and k ∈ [ d ] , we let H k ( v ) ∈ R d k eep the k largest coordinates of v b y magnitude (breaking ties in lexicographical order), i.e., the “head,” and set all other co ordinates to 0 . Matrices. W e let I d denote the d × d iden tit y matrix, and I S is the iden tity on R S for an index set S . F or p ≥ 1 (including p = ∞ ), applied to a vector argumen t, ∥·∥ p denotes the ℓ p norm. F or p, q ≥ 1 and a matrix M ∈ R n × d , we also use the notation ∥ M ∥ p → q := max v ∈ R d |∥ v ∥ p ≤ 1 ∥ Mv ∥ q . W e use ∥·∥ F and ∥·∥ op to denote the F rob enius and ( 2 → 2 ) op erator norms of a matrix argument. A helpful observ ation used throughout is that ∥·∥ ∞→∞ is the largest ℓ 1 norm of a ro w. Indexing. F or an y i ∈ [ d ] , v ∈ R d , X ∈ R n × d , w e let v − i ∈ R d − 1 drop the i th elemen t of v and X : − i ∈ R n × ( d − 1) drop the i th column of X . F or a v ector v ∈ R d and S ⊆ [ d ] , we use v S ∈ R S to denote its restriction to its co ordinates in S . W e let nnz( · ) denote the n umber of nonzero entries in a matrix or vector argumen t. F or M ∈ R n × d , we use M i : to denote its i th ro w for i ∈ [ n ] , and M : j to denote its j th column for j ∈ [ d ] . When S ⊆ [ n ] and T ⊆ [ d ] are ro w and column indices, w e let 6 M S × T b e the submatrix indexed by S and T ; if T = [ d ] w e simply use M S : and similarly , we define M : T . W e fix the conv en tion that transp osition is done prior to indexing, i.e., M ⊤ S × T := [ M ⊤ ] S × T . Sub-Gaussian distributions. W e give a simplified introduction to sub-Gaussian distributions. F or a detailed discussion, we refer to Chapter 2 and 3 of [ V er18 ]. Definition 1 (sub-Gaussian distribution) . W e say that a r andom variable X ∈ R is σ 2 -sub-Gaussian with me an µ , which we denote by X ∼ subG ( µ, σ 2 ) , if E [ X ] = µ , and E [exp ( λ ( X − µ ))] ≤ exp λ 2 a 2 2 , for al l λ ∈ R . The follo wing facts will b e very helpful in manipulating sub-Gaussian random v ariables. The first follo ws simply b y applying Definition 1 and applying independence appropriately . F act 1. If X ∼ subG ( µ, σ 2 ) and a ∈ R , then aX ∼ subG ( aµ, a 2 σ 2 ) , and if Y ∼ subG ( ν, τ 2 ) wher e X ⊥ Y , then X + Y ∼ subG ( µ + ν, σ 2 + τ 2 ) . Lemma 1 (Ho effding’s inequality , Theorem 2.2.1, [ V er18 ]) . If X ∼ subG ( µ, σ 2 ) , then for al l t ≥ 0 , P [ | X − µ | ≥ t ] ≤ 2 exp − t 2 2 σ 2 . Lemma 2 (Prop osition 2.6.1, [ V er18 ]) . Each of these statements implies e ach of the others, for some c onstants C 1 , C 2 > 0 (wher e the c onstants may change in the differ ent dir e ctions of implic ation). 1. X ∼ subG ( µ, σ 2 ) . 2. E [ X ] = µ , and for any p ∈ N , E [ | X − µ | p ] ≤ ( C 1 σ √ p ) p . 3. E [ X ] = µ , and E [exp( ( X − µ ) 2 C 2 σ 2 )] ≤ C 2 . 4 Lemma 3. If X ∼ subG (0 , C σ 2 ) has varianc e σ 2 for some c onstant C > 0 , then ther e exist c onstants c 1 , c 2 > 0 such that | X | ∼ subG ( µ, c 2 σ 2 ) for µ ≥ c 1 σ . Pr o of. By Item 2 in Lemma 2 , w e kno w that E X 4 = O ( σ 4 ) . Applying the P aley-Zygmund inequalit y to Y = X 2 , there exists a constan t c > 0 suc h that P X 2 ≥ 1 2 E X 2 ≥ E X 2 2 4 E [ X 4 ] ≥ c. Hence, E [ | X | ] ≥ c 1 σ for some c 1 > 0 , pro ving the claim ab out the mean. Next, b y Item 3 in Lemma 2 , we kno w that there exists some C 2 > 0 such that E " exp ( | X | − E | X | ) 2 2 C 2 σ 2 !# ≤ E exp 2 X 2 + 2( E | X | ) 2 2 C 2 σ 2 ≤ C 2 exp 1 C 2 . Th us, there is a constant suc h that Item 3 holds, so the sub-Gaussian parameter is O ( σ 2 ) . 4 In [ V er18 ], Item 3 is listed with a bound of 2 on the right-hand side. How ever, examining the pro of shows any constan t b ound suffices for concluding sub-Gaussianity (through an appropriate tail b ound). 7 2.2 Sparse recov ery preliminaries In this section, w e recall some standard results from the sparse reco very literature. Regularit y assumptions for sparse recov ery . W e introduce tw o structural prop erties of X that are commonly used to make sparse recov ery tractable. F or a more detailed in tro duction to these prop erties, w e refer the reader to Chapter 7 of [ W ai19 ]. Definition 2 (Restricted isometry prop ert y) . L et ( ϵ, s ) ∈ (0 , 1) × [ d ] . W e say X ∈ R n × d satisfies the ( ϵ, s ) -r estricte d isometry pr op erty, or X is ( ϵ, s ) - RIP , if for al l θ ∈ R d with nnz( θ ) ≤ s , (1 − ϵ ) ∥ θ ∥ 2 2 ≤ ∥ X θ ∥ 2 2 ≤ (1 + ϵ ) ∥ θ ∥ 2 2 . A n e quivalent c ondition is that λ ([ X ⊤ X ] S × S ) ∈ [1 − ϵ, 1 + ϵ ] s for al l S ⊆ [ d ] with | S | ≤ s . In tuitively , RIP implies X acts as an approximate isometry on sparse vectors, so [ X ⊤ X ] S × S is w ell-conditioned for any sparse supp ort S . It is well-kno wn that Definition 2 is satisfied by v ari- ous random matrix ensem bles. In this pap er, we primarily fo cus on X drawn from sub-Gaussian ensem bles, which w e formally define here for ease of reference. Mo del 3. L et X ∈ R n × d have i.i.d. entries ∼ subG (0 , C n ) , with varianc e 1 n , for a c onstant C > 0 . Next, we state a useful prop ert y of sub-Gaussian matrix ensembles. Prop osition 1 (Theorem 9.2, [ FR13 ]) . L et δ, ϵ ∈ (0 , 1 2 ) 2 . Under Mo del 3 , for any s ∈ [ d ] , X is ( ϵ, s ) -RIP with pr ob ability ≥ 1 − δ if, for an appr opriate c onstant, n = Ω s log d s + log 1 δ ϵ 2 ! . F or more RIP matrix ensembles, including those based on sampling b ounded orthonormal systems and real trignometric polynomials, w e refer the reader to Chapter 12 of [ FR13 ]. Another common condition for sparse recov ery is through the lens of (appro ximate) orthogonality . Definition 3 (P airwise incoherence) . L et α ∈ (0 , 1) . W e say X ∈ R n × d with c olumns { x i } i ∈ [ d ] satisfies ϵ -p airwise inc oher enc e, or X is α -PI, if max ( i,j ) ∈ [ d ] × [ d ] h X ⊤ X − I i ij ≤ α. Prop osition 2 (Lemma 6.26, [ W ai19 ]) . L et α, δ ∈ (0 , 1 2 ) 2 . Under Mo del 3 , X is α -PI with pr ob a- bility ≥ 1 − δ if, for an appr opriate c onstant, n = Ω log d δ α 2 ! . ℓ 2 sparse recov ery in nearly-linear time. Our algorithms use ℓ 2 sparse reco very , an extensively studied primitive, as a subroutine. In Algorithm 1 , we recall one famous sparse recov ery algorithm, iterativ e hard thresholding (IHT) [ BD09 ], whose ℓ 2 error guarantees are w ell-understo o d. 8 Algorithm 1: IHT ( X , y , k , R, r ) 1 θ (0) ← 0 d 2 if r ≥ R then 3 return θ (0) 4 end 5 T ← ⌈ log 2 R r ⌉ 6 for t = 0 , 1 , . . . , T − 1 do 7 θ ( t +1) ← H k θ ( t ) + X ⊤ ( y − X θ ( t ) ) 8 end 9 return θ ← θ ( T ) Lemma 4 (Theorem 4.8, [ Pri21 ]) . In Pr oblem 2 , if X is (0 . 14 , 3 k ) -RIP, then for any R ≥ ∥ θ ⋆ ∥ 2 , the output θ of Algorithm 1 satisfies 5 ∥ θ − θ ⋆ ∥ 2 ≤ r + 5 X ⊤ ξ 2 , 3 k ≤ r + 5 √ 3 k X ⊤ ξ ∞ , | supp( θ ) | ≤ k . 3 Oblivious ℓ ∞ Sparse Reco v ery In this section, we develop our estimation algorithm to solve Problem 2 under the oblivious Mo del 1 . In Algorithm 2 , w e decomp ose the task in to three phases. 1. In the first phase, we run IHT (Algorithm 1 ) to obtain a w arm start and finer con trol on ∥ θ ⋆ ∥ 2 . 2. In the second phase, w e use a simple thresholding pro cedure to estimate the supp ort of θ ⋆ . 3. In the third phase, w e p erform an ordinary least squares (OLS) step on the learned support. The analysis of the first phase blac k-b o x applies Lemma 4 . After stating some tec hnical preliminaries in Section 3.1 , w e analyze the second phase in Section 3.2 and the third phase in Section 3.3 . Since Algorithm 2 pro ceeds in multiple phases, the estimation in later phases generally dep ends on the outcomes of the preceding ones. T o handle this dep endence cleanly , we first establish a basic equiv alence principle: b ounds prov ed for every fixe d design are equiv alent to bounds prov ed for any indep endent r andom design . W e will use these t wo viewp oints interc hangeably henceforth. Lemma 5. L et (Ω 1 , F 1 ) and (Ω 2 , F 2 ) b e two me asur able sp ac es, and let R 1 : Ω 1 → X 1 and R 2 : Ω 2 → X 2 b e two r andom variables. L et E ⊂ X 1 × X 2 b e an me asur able event. Then P R 1 [( R 1 , R 2 ) ∈ E ] ≤ δ, for al l R 2 ∈ X 2 ⇐ ⇒ P R 1 ,R 2 [( R 1 , R 2 ) ∈ E ] ≤ δ, for any R 2 ⊥ R 1 . Pr o of. The ⇐ = direction follows by taking R 2 to b e deterministic (a Dirac measure). T o show the = ⇒ direction, notice that P R 1 ,R 2 [( R 1 , R 2 ) ∈ E ] = E R 2 [ P R 1 [( R 1 , R 2 ) ∈ E | R 2 ]] = E R 2 [ P R 1 [( R 1 , R 2 ) ∈ A ]] , 5 A minor difference is that [ Pri21 ] derives a bound in terms of ∥ ξ ∥ 2 , whereas we require a b ound on X ⊤ ξ 2 , 3 k . This mismatch can b e resolved by refining Lemma 4.7 of [ Pri21 ]: instead of applying the coarse b ound X ⊤ S : ξ 2 ≤ X ⊤ S : op ∥ ξ ∥ 2 , one can directly control X ⊤ S : ξ 2 via the tighter quantit y X ⊤ ξ 2 , 3 k . 9 where the last equalit y is by the fact that R 1 ⊥ R 2 . By the assumption, we obtain P R 1 ,R 2 [( R 1 , R 2 ) ∈ E ] = E R 2 [ P R 1 [( R 1 , R 2 ) ∈ E ]] ≤ E R 2 [ δ ] = δ. Algorithm 2: ObliviousSpa rseRecovery ( X , y , k , R, r , c ) 1 ( X , y ) ← (3 X , 3 y ) 2 Ev enly divide observ ations ( X , y ) in to X = X (1) X (2) X (3) , y = y (1) y (2) y (3) 3 b θ ← IHT ( X (1) , y (1) , k , R, √ k r ) 4 ( r (2) , r (3) ) ← ( y (2) − X (2) b θ , y (3) − X (3) b θ ) 5 L ← { i ∈ [ d ] : | X (2) ⊤ i : r (2) | ≥ r c } 6 θ ← b θ + [ X (3) ⊤ X (3) ] − 1 L × L X (3) ⊤ L : r (3) 7 return θ 3.1 Help er technical lemmas Before we dive into the analysis of Algorithm 2 , we first introduce sev eral help er lemmas ab out sub-Gaussian ensembles (Mo del 3 ), whic h will b e used later. Lemma 6. L et δ ∈ (0 , 1 2 ) and let v ∈ R d , u ∈ R n b e fixe d. Under Mo del 3 , with pr ob ability ≥ 1 − δ , |⟨ u , Xv ⟩| ≤ 2 ∥ u ∥ 2 ∥ v ∥ 2 s C log 1 δ n . Pr o of. By F act 1 en trywise, ⟨ u , Xv ⟩ ∼ subG (0 , C n ∥ v ∥ 2 2 ∥ u ∥ 2 2 ) . The claim is now Lemma 1 . W e conclude the follo wing corollary holds using Lemma 5 . Corollary 1. L et δ ∈ (0 , 1 2 ) , and under Mo del 3 , let v ∈ R d , u ∈ R n b e r andom variables satisfying ( u , v ) ⊥ X and ∥ v ∥ 2 ≤ α , ∥ u ∥ 2 ≤ β with pr ob ability 1 . Then with pr ob ability ≥ 1 − δ , ⟨ u , Xv ⟩ ≤ 2 α β s C log 1 δ n . Next, we sho w an imp ortan t technical observ ation, that intuitiv ely translates to the ℓ ∞ -op erator norm b ound of X ⊤ X restricted to small submatrices not holding for “typical” vectors. F or a k × k submatrix of X ⊤ X (or its inv erse), Proposition 2 suggests that w e need n ≈ k 2 for the ℓ ∞ -op erator norm to b e b ounded by O (1) . The following result shows that if we parameterize the op erator norm differen tly , n ≈ k samples suffices for a sharper bound on an indep enden t vector to hold. 10 Lemma 7. L et δ ∈ (0 , 1 2 ) , n = Ω( k log d δ ) for an appr opriate c onstant, and k ∈ [ d ] . Under Mo del 3 , for any fixe d S ⊆ [ d ] with | S | = s ≤ k , and fixe d v ∈ R n , with pr ob ability ≥ 1 − δ the fol lowing hold: [ X ⊤ X ] − 1 S × S X ⊤ S : v ∞ ≤ 8 ∥ v ∥ 2 s C log d δ n , (4) 1 6 X ⊤ S : v ∞ ≤ [ X ⊤ X ] − 1 S × S X ⊤ S : v ∞ ≤ 6 X ⊤ S : v ∞ . (5) Pr o of. By Prop osition 1 , with probability 1 − δ 4 , X satisfies ( 1 2 , k ) -RIP , and under this even t, applying the RIP definition to standard basis vectors yields that for all i ∈ [ d ] , w e hav e ∥ x i ∥ 2 2 ∈ [ 1 2 , 3 2 ] . T o prov e the statement, it suffices to control the quantit y , for all i ∈ S , e ⊤ i [ X ⊤ X ] − 1 S × S X ⊤ S : v . F or an y i ∈ S , w e let X : S − i ∈ R n × ( s − 1) b e the submatrix of X : S without its i th column. Now fix an index i ∈ S . Displaying S × S matrices so that index i is in the top-left corner gives [ X ⊤ X ] S × S = a b ⊤ b C , where a := x ⊤ i x i , b := X ⊤ S − i : x i , C := X ⊤ S − i : X : S − i . By the Sch ur complement form ula, [ X ⊤ X ] − 1 S × S = 1 a − b ⊤ C − 1 b 1 − b ⊤ C − 1 − C − 1 b C − 1 bb ⊤ C − 1 + ( a − b ⊤ C − 1 b ) C − 1 . W e introduce the helpful notation P i := I n − X : S − i ( X ⊤ S − i : X : S − i ) − 1 X ⊤ S − i : , u i := X : S − i ( X ⊤ S − i : X : S − i ) − 1 X ⊤ S − i : v , so P i is the orthogonal pro jector onto the range of X : S − i . A direct calculation yields a − b ⊤ C − 1 b = x ⊤ i P i x i , e ⊤ i [ X ⊤ X ] − 1 S × S X ⊤ S : v = x ⊤ i P i v x ⊤ i P i x i = x ⊤ i v − x ⊤ i u i x ⊤ i P i x i . Th us, [ X ⊤ X ] − 1 S × S X ⊤ S : v ∞ = max i ∈ S x ⊤ i P i v x ⊤ i P i x i = max i ∈ S x ⊤ i v − x ⊤ i u i x ⊤ i P i x i . (6) W e pro ceed to establish concen tration of b oth the n umerator and denominator of ( 6 ). Con trolling the denominator of ( 6 ) . The matrix P i is an orthogonal pro jection with rank r = n + 1 − s ≥ n 2 , and x i ⊥ P i . Hence, for all i ∈ [ s ] , E h x ⊤ i P i x i i = D E h x i x ⊤ i i , P i E = 1 n T r ( P i ) = r n ≥ 1 2 . Next, the Hanson-W righ t inequality (Theorem 6.2.2, [ V er18 ]) states that there is a constan t c > 0 suc h that for all t ≥ 0 , P h x ⊤ i P i x i ≤ E [ x ⊤ i P i x i ] − t i ≤ exp − c min t 2 n 2 C 2 r , tn C . Com bining the ab o ve t wo displa ys, and applying a union b ound ov er i ∈ S , we ha ve P ∃ i ∈ S | x ⊤ i P i x i ≤ 1 4 ≤ s exp − c min n 16 C 2 , n 4 C ≤ δ 4 , (7) for sufficiently large n = Ω(log d δ ) as stated, using s ≤ d . Also, RIP implies x ⊤ i P i x i ≤ ∥ x i ∥ 2 2 ≤ 3 2 . 11 Con trol of x ⊤ i P i v . Since P i is a pro jection operator, ∥ P i v ∥ 2 ≤ ∥ v ∥ 2 . Conditioning on ( P i , v ) and using x i ⊥ ( P i , v ) , we thus ha ve x i P i v ∼ subG (0 , C n ∥ P i v ∥ 2 2 ) . Hence, P h x ⊤ i P i v > t i ≤ 2 exp − nt 2 2 C ∥ v ∥ 2 2 , for all i ∈ S , by Lemma 1 . No w applying a union b ound o ver i ∈ S yields P ∃ i ∈ S : x ⊤ i P i v > 2 ∥ v ∥ 2 s C log d δ n ≤ δ 4 , whic h further implies ( 4 ) when plugged in to ( 6 ), as long as the ev en t in ( 7 ) fails. Con trol of x ⊤ 1 u i . Conditioned on X satisfying ( 1 2 , k ) -RIP , w e hav e ∥ u i ∥ 2 ≤ ∥ X : S − i ∥ 2 → 2 · [ X ⊤ S − i : X : S − i ] − 1 op · X ⊤ S − i : v 2 ≤ 4 √ s X ⊤ S : v ∞ ≤ 4 √ k X ⊤ S : v ∞ . Since u i is indep enden t of x i , we hav e by F act 1 that x ⊤ i u i ∼ subG (0 , C n ∥ u i ∥ 2 2 ) , and thus P h x ⊤ i u > t i ≤ 2 exp − nt 2 2 C ∥ u ∥ 2 2 ! ≤ 2 exp − nt 2 32 k C X ⊤ S : v 2 ∞ ! . T aking t = 1 2 X ⊤ S : v ∞ and applying a union b ound o ver i ∈ S yields P ∃ i : x ⊤ i u > 1 2 X ⊤ S : v ∞ ≤ δ 4 , for sufficien tly large n = Ω( k log d δ ) . Conditioning on all four random even ts used so far in the pro of giv es the failure probability . No w letting j ∈ S b e the index that ac hieves | x ⊤ j v | = X ⊤ S : v ∞ , and letting i ∈ S b e the index that ac hieves the maxim um in ( 6 ), w e include that ( 5 ) holds: x ⊤ i v − x ⊤ i u i x ⊤ i P i x i ≥ x ⊤ j v − x ⊤ j u j x ⊤ j P j x j ≥ 2 3 x ⊤ j v − x ⊤ j u j ≥ 1 6 X ⊤ S : v ∞ , x ⊤ i v − x ⊤ i u i x ⊤ i P i x i ≤ 4 x ⊤ i v + 4 x ⊤ i u i ≤ 6 X ⊤ S : v ∞ . W e again state a corollary due to Lemma 5 . Corollary 2. L et δ ∈ (0 , 1 2 ) , n = Ω( k log d δ ) for an appr opriate c onstant, and k ∈ [ d ] . Under Mo del 3 , for any S ⊥ X such that | S | ≤ k , and v ⊥ X : S , with pr ob ability ≥ 1 − δ we have [ X ⊤ X ] − 1 S × S X ⊤ S : v ∞ ≤ 8 min X ⊤ S : v ∞ , ∥ v ∥ 2 s C log d δ n . (8) Pr o of. This follo ws from Lemmas 5 and 7 , where w e note that the pro of of Lemma 7 only required the indep endence assumption b et ween v and X : S (rather than all of X ). 12 3.2 Supp ort identification W e proceed to analyze the support identification step in Line 5 of Algorithm 2 . Direct calculation sho ws that in Problem 2 , for an y i ∈ supp( θ ⋆ ) and j ∈ supp( θ ⋆ ) c , we hav e x ⊤ i y = ∥ x i ∥ 2 2 θ ⋆ i + x i , X : − i θ ⋆ − i + x ⊤ i ξ , x ⊤ j y = ⟨ x j , X θ ⋆ ⟩ + x ⊤ j ξ . (9) Our goal is to separate such a pair ( i, j ) , whenev er | θ ⋆ i | ≫ ∥ X ⊤ ξ ∥ ∞ is large (intuitiv ely , a “heavy” co ordinate of the support). W e obtain signal from ∥ x i ∥ 2 2 θ ⋆ i ≫ ∥ X ⊤ ξ ∥ ∞ , and the pure noise terms | x ⊤ i ξ | , | x ⊤ j ξ | are trivially b ounded by X ⊤ ξ ∞ . Th us, our goal is to ensure the lefto ver terms, x i , X : − i θ ⋆ − i and ⟨ x j , X θ ⋆ ⟩ , are small in magnitude. If we can ensure this holds, then a simple thresholding op eration on X ⊤ y will separate large co ordinates of θ ⋆ from zero co ordinates. W e now formalize this idea, crucially lev eraging indep endence of X and θ ⋆ . Lemma 8. L et δ ∈ (0 , 1 2 ) , n = Ω( k log d δ ) for an appr opriate c onstant, and k ∈ [ d ] . Under Mo del 3 , supp ose that L := { i ∈ [ d ] : | x ⊤ i y | ≥ r ∞ } , for r ∞ ≥ ∥ θ ⋆ ∥ 2 √ k + 3 X ⊤ ξ ∞ . Then the fol lowing hold with pr ob ability ≥ 1 − δ : L ⊆ supp( θ ⋆ ) , ∥ θ ⋆ L c ∥ ∞ ≤ 4 r ∞ . Pr o of. Throughout the pro of, denote S := supp( θ ⋆ ) . W e claim b oth of the following hold, with probabilit y at least 1 − δ : X is ( 1 2 , 1) -RIP , and max max i ∈ S x i , X : − i θ ⋆ − i , max j ∈ S c ⟨ x j , X θ ⋆ ⟩ ≤ ∥ θ ⋆ ∥ 2 √ k . T o see that this is correct, RIP holds with probability ≥ 1 − δ 2 b y Prop osition 1 , whic h implies all i ∈ [ d ] ha ve ∥ x i ∥ 2 ∈ [ 1 2 , 3 2 ] . Hence, b ecause ( i, j ) ∈ S × S c satisfy x i ⊥ X : − i θ ⋆ − i , and x j ⊥ X θ ⋆ , applying Corollary 1 with failure probability δ ← δ 2 d giv es the b ound for large enough n . Condition on the abov e even ts holding henceforth. No w recalling the decomposition ( 9 ), | x ⊤ j y | ≤ ∥ θ ⋆ ∥ 2 √ k + X ⊤ ξ ∞ for all j ∈ S. Finally , supp osing some i ∈ [ d ] with | θ ⋆ i | > 4 r ∞ is not contained in L gives a con tradiction: | x ⊤ i y | > 1 2 | θ ⋆ i | − ∥ θ ⋆ ∥ 2 √ k − X ⊤ ξ ∞ ≥ r ∞ . 3.3 In-supp ort estimation Once the large co ordinates of θ ⋆ are isolated, the remaining task reduces to a lo w-dimensional regression problem restricted to the reco vered supp ort. As our main help er lemma, we show that running ordinary least squares (OLS) on the learned supp ort directly gives us accurate estimation in the ℓ ∞ norm, by crucially lev eraging indep endence in Mo del 1 . 13 Lemma 9. L et δ ∈ (0 , 1 2 ) , n = Ω( k log d δ ) for an appr opriate c onstant, and k ∈ [ d ] . Under Mo dels 1 and 3 , supp ose that L ⊆ supp( θ ⋆ ) , and that L ⊥ X . Then with pr ob ability ≥ 1 − δ , θ ⋆ L − [ X ⊤ X ] − 1 L × L X ⊤ L : y ∞ ≤ 8 X ⊤ ξ ∞ + 1 √ k ∥ θ ⋆ ∥ 2 . Pr o of. Because we assumed L ⊆ S := supp( θ ⋆ ) , we can decomp ose X θ ⋆ = X : L θ ⋆ + X : L c θ ⋆ L c , so the OLS solution also admits the following decomp osition: [ X ⊤ X ] − 1 L × L X ⊤ L : y = θ ⋆ L + [ X ⊤ X ] − 1 L × L X ⊤ L : X : L c θ ⋆ L c + [ X ⊤ X ] − 1 L × L X ⊤ L : ξ . In the remainder of the pro of w e bound the last t wo terms on the righ t-hand side. First, condition on X satisfying ( 1 2 , k ) -RIP , with fails with probability δ 3 b y Prop osition 1 . Next, letting w := X L c θ ⋆ L c , w e hav e by RIP that ∥ w ∥ 2 ≤ 2 ∥ θ ⋆ ∥ 2 , and ( L, w ) ⊥ X : L b y assumption. Thus, by Corollary 2 , h X ⊤ X i − 1 L × L X ⊤ L : w ∞ ≤ 8 ∥ w ∥ 2 s C log d δ n ≤ 1 √ k ∥ θ ⋆ ∥ 2 , except with probability δ 3 , by our c hoice of n . Finally , again by Corollary 2 , h X ⊤ X i − 1 L × L X ⊤ L : ξ ∞ ≤ 8 X ⊤ L : ξ ∞ ≤ 8 X ⊤ ξ ∞ , except with probabilit y δ 3 , because ξ ⊥ X . The conclusion follo ws from com bining the ab o ve t wo displa ys and applying a union b ound for the failure probability . Finally , we giv e an end-to-end guaran tee for Algorithm 2 . Theorem 3 (Oblivious ℓ ∞ sparse reco very) . L et δ ∈ (0 , 1 2 ) , n = Ω( k log d δ ) for an appr opriate c onstant, and k ∈ [ d ] . Under Mo dels 1 and 3 , ther e is a universal c onstant c > 0 such that if A lgorithm 2 is c al le d with p ar ameters c , R ≥ ∥ θ ⋆ ∥ 2 , and r ≥ σ := X ⊤ ξ ∞ r log n δ , then with pr ob ability ≥ 1 − δ , its output satisfies ∥ θ − θ ⋆ ∥ ∞ = O ( r ) , | supp( θ ) | = O ( k ) . (10) Mor e over, the algorithm runs in time O ( nd log R r + nk log Rk r ) . Pr o of. By Prop osition 1 , with probability ≥ 1 − δ 4 , X (1) , X (2) , X (3) are all (0 . 14 , 3 k ) -RIP . Moreov er, w e claim that with probability ≥ 1 − δ 4 , for all i ∈ [3] , there exists a constant A > 0 suc h that X ( i ) ⊤ ξ ( i ) ∞ ≤ Aσ. (11) T o see this, eac h of the relev ant quantities is the maxim um of either n 3 or n i.i.d. sub-Gaussian random v ariables. Standard sub-Gaussian concentration (see the deriv ation of ( b ) in Lemma 20 ) sho ws that with probabilit y δ 4 , X ( i ) ⊤ ξ ( i ) ∞ = O ξ ( i ) 2 r log n δ n ! = O ∥ ξ ∥ 2 r log n δ n ! = O X ⊤ ξ ∞ r log n δ . (12) 14 Next, we analyze the three stages of Algorithm 2 in turn. By Lemma 4 , b θ returned on Line 3 has b θ − θ ⋆ 2 ≤ 5 √ 3 k Aσ + √ k r ≤ 10 A √ k r . (13) In the remainder of the pro of, let ¯ θ := θ ⋆ − b θ . Note that the pairs ( X (2) , r (2) ) and ( X (3) , r (3) ) are exactly instances of Mo del 1 with sparsit y parameter k ← 2 k , and the signal-noise pairs ( ¯ θ , ξ (2) ) and ( ¯ θ , ξ (3) ) resp ectiv ely . Moreo ver, the learned supp ort L from Line 5 satisfies L ⊥ X (3) . Next we apply Lemma 8 . W e first chec k its preconditions hold: indeed, 3 X (2) ⊤ ξ (2) ∞ + ¯ θ 2 √ k ≤ 13 Ar, so it is enough to choose 1 c ≥ 13 A . Then, with probability ≥ 1 − δ 4 , Lemma 8 giv es L ⊆ supp( ¯ θ ) , ¯ θ L c ∞ ≤ 4 r c . Finally , applying Lemma 9 , with probability ≥ 1 − δ 4 , ¯ θ L − [ X (3) ⊤ X (3) ] − 1 L × L X (3) ⊤ L : r (3) ∞ ≤ 8 C ′ σ + 10 Ar ≤ (10 A + 8 C ′ ) r . Summing the ab o ve t wo displa ys thus yields for the output θ , ∥ θ − θ ⋆ ∥ ∞ = θ − ( ¯ θ + b θ ) ∞ ≤ ¯ θ L c ∞ + ¯ θ L − ( θ − b θ ) ∞ = O ( r ) . Finally , to prov e the sparsity claim, it suffices to note that b oth the supp ort of b θ output b y IHT , and the learned support L , ha ve cardinalit y at most O ( k ) . W e conclude b y discussing the runtime. All steps in Algorithm 2 clearly take time at most O ( nd log R r ) except for the OLS step, Line 6 . Whenev er Lemma 8 holds, | L | ≤ k , so under our stated sample complexit y , the matrix A := [ X (3) ⊤ X (3) ] L × L is O (1) -w ell-conditioned. Our goal is to compute A − 1 b up to O ( r ) in ℓ ∞ error, where b = X (3) ⊤ r (3) . F urther, ∥ b ∥ 2 ≤ X (3) ⊤ X (3) b θ 2 + X (3) ⊤ ξ (3) 2 ≤ 2 b θ 2 + Ar ≤ 2 R + 20 A √ k r + Ar, where w e used the triangle inequalit y , RIP and the assumption ( 11 ), and the recov ery guaran tee ( 13 ) resp ectiv ely in the three inequalities. Standard b ounds on gradien t descent, e.g., Theorem 2.1.15, [ Nes03 ], then sho w that log ( kR r ) iterations yield a sufficien tly-accurate solution. Eac h iteration requires O (1) matrix-v ector m ultiplications through X (3) : L , each taking O ( nk ) time. The p olylogarithmic ov erhead in the error of Theorem 3 is due to our use of a small amount of adaptivit y , leading to a requiremen t of the form ( 11 ) to compare eac h phase’s error to the o veral l observ ations. Examining the pro of, it is clear w e can replace the error b ound with O max i ∈ [3] X ( i ) ⊤ ξ ( i ) ∞ . 15 In general, as long as eac h individual ∥ X ( i ) ⊤ ξ ( i ) ∥ ∞ is comparable to ∥ X ⊤ ξ ∥ ∞ , then we hav e the tigh t requiremen t r = Ω( ∥ X ⊤ ξ ∥ ∞ ) . W e also p oin t out that for certain p opular random matrix mo dels, we can obtain this impro vemen t directly in some parameter regimes. F or example, if X is en trywise a (scaled) Gaussian, and δ = p oly( 1 k ) , then using tigh t b ounds for the maximum of k Gaussians in place of ( 12 ) again yields a tigh t b ound r = Ω( ∥ X ⊤ ξ ∥ ∞ ) . W e also mention here that it is simple to b o ost an y improp er hypothesis for Problem 2 into a prop er (i.e., exactly k -sparse) hypothesis, at a constant factor o verhead in the reco very guaran tee. Lemma 10. L et θ ⋆ ∈ R d have nnz( θ ⋆ ) ≤ k . Then for any θ ∈ R d and p ∈ R ≥ 1 ∪ {∞} , ∥ H k ( θ ) − θ ⋆ ∥ p ≤ 2 ∥ θ − θ ⋆ ∥ p . Pr o of. Because H k ( θ ) is the pro jection of θ onto the set of k -sparse v ectors under the ℓ p norm, ∥ H k ( θ ) − θ ⋆ ∥ p ≤ ∥ H k ( θ ) − θ ∥ p + ∥ θ − θ ⋆ ∥ p ≤ 2 ∥ θ − θ ⋆ ∥ p . 3.4 Expanding the r range in Theorem 3 A p oten tial shortcoming of Theorem 3 is that it requires the lo wer bound r ≳ ∥ X ⊤ ξ ∥ ∞ to hold, and its guarantees do not formally apply otherwise. In this section, we give a simple black-box reduction that lifts this requiremen t, yielding error guaran tee ≲ r + ∥ X ⊤ ξ ∥ ∞ for any r > 0 . This matc hes the form of our IHT bounds in Lemma 4 and Theorem 4 . Our reduction (Algorithm 3 ) main tains a threshold ρ , and applies Theorem 3 in phases. In each phase it then uses holdout samples to chec k whether a certain test passes. Finally , it outputs the estimated parameters obtained on the last phase that passed. W e require one help er result in the analysis of Algorithm 3 . Lemma 11. L et δ ∈ (0 , 1) , and let v ∈ R d b e fixe d with nnz( v ) ≤ k . If n = Ω( k log d δ ) for an appr opriate c onstant, then under Mo del 1 , with pr ob ability ≥ 1 − δ , X ⊤ Xv ∞ ∥ v ∥ ∞ ∈ 1 2 , 2 . Pr o of. The argument follows the same structure as Lemma 7 , but is simpler. By Prop osition 1 , with probability at least 1 − δ 2 , the matrix X is ( 1 4 , k ) -RIP . Let S = supp( v ) with | S | ≤ k . Under ( 1 4 , k ) -RIP , ∥ Xv ∥ 2 2 ≤ 2 ∥ v ∥ 2 2 ≤ 2 k ∥ v ∥ 2 ∞ . F or an y i / ∈ S , since x i ⊥ X : S , by Lemma 1 , P h X ⊤ Xv i i > 1 4 ∥ v ∥ ∞ ≤ 2 exp − n 2 C k ≤ δ 2 d , for all i ∈ S , (14) where C is as in Mo del 3 , and the last inequalit y tak es n = Ω( k log d δ ) sufficiently large. Next, fix j ∈ S and write T j = S \ { j } . Then ⟨ x j , X : S v ⟩ = ∥ x j ∥ 2 2 v 1 + x j , X : T j v − j . 16 Algorithm 3: OSRReduction ( X , y , k , R, r , c ′ ) 1 θ (0) ← 0 d ; if r ≥ R then 2 return θ (0) 3 end 4 T ← ⌈ log 2 R r ⌉ 5 c ← constan t in the statemen t of Theorem 3 6 ρ (0) ← R 7 ( X , y ) ← √ 2 T ( X , y ) 8 Ev enly divide observ ations ( X , y ) in to X = X (1) . . . X (2 T ) , y = y (1) . . . y (2 T ) 9 for t = 0 , 1 , . . . , T − 1 do 10 ρ ( t +1) ← ρ ( t ) 2 11 θ ( t +1) ← H k ( ObliviousSpa rseRecovery ( X (2 t +1) , y (2 t +1) , k , R, ρ ( t +1) , c )) 12 if ∥ ( X (2 t +2) ) ⊤ ( y (2 t +2) − X (2 t +2) θ ( t +1) ) ∥ ∞ > 1 c ′ ρ ( t +1) then 13 return θ ← θ ( t ) 14 end 15 t ← t + 1 16 end 17 return θ ← θ ( T ) Since x j ⊥ X : T j , the second term satisfies the same sub-Gaussian b ound as ab o v e, hence P x j , X : T j v − j > 1 4 ∥ v ∥ ∞ ≤ δ 2 d , for all j ∈ S. (15) By a union bound o ver all d ev ents in ( 14 ), ( 15 ), and using ∥ x j ∥ 2 2 ≤ 5 4 for all j ∈ S , w e then ha ve max i ∈ S h X ⊤ Xv i i ≤ 1 4 ∥ v ∥ ∞ , max j ∈ S h X ⊤ Xv i j ≤ 3 2 ∥ v ∥ ∞ , whic h yields the claimed upp er b ound. F or the lo wer b ound, let j ⋆ ∈ S satisfy | v j ⋆ | = ∥ v ∥ ∞ . Using ∥ x j ⋆ ∥ 2 2 ≥ 3 4 and the deviation bound in ( 15 ), h X ⊤ Xv i j ⋆ ≥ 3 4 | v j ⋆ | − 1 4 ∥ v ∥ ∞ = 1 2 ∥ v ∥ ∞ . W e no w prov e that Algorithm 4 obtains an ℓ ∞ reco very guarantee of the form ≈ ∥ X ⊤ ξ ∥ ∞ + r . F or simplicit y , w e assume the error b ound in Theorem 3 only applies in the regime r ≥ X ⊤ ξ ∞ r log n δ , although we inherit similar improv ements under certain mo dels (e.g., Gaussian measurements) where Theorem 3 offers a tigh ter error bound, see the discussion after the pro of. 17 Corollary 3. L et δ ∈ (0 , 1 2 ) , n = Ω( k T log dT δ ) for an appr opriate c onstant, and k ∈ [ d ] , wher e T = O (log R r ) is as in A lgorithm 3 . Under Mo dels 1 and 3 , ther e is a universal c onstant c ′ > 0 such that if A lgorithm 3 is c al le d with p ar ameters c ′ , R ≥ ∥ θ ⋆ ∥ 2 , and r > 0 , then with pr ob ability ≥ 1 − δ , its output satisfies ∥ θ − θ ⋆ ∥ ∞ = O r + ∥ X ⊤ ξ ∥ ∞ s log n δ log R r ! . (16) Pr o of. W e b egin b y noting that we scaled all of the divided measurements, ( X ( i ) , y ( i ) ) for i ∈ [2 T ] , b y √ 2 T , to ensure that they fall in the setting of Mo del 3 after adjusting for the smaller sample size n 2 T . Let C > 0 b e such that Theorem 3 and Lemma 10 guarantee that ρ ( t ) ≥ σ := X ⊤ ξ ∞ s log n δ log R r = ⇒ θ ( t ) − θ ⋆ ∞ ≤ C ρ ( t ) , (17) and such that in ev ery iteration t ∈ [ T ] , X (2 t ) ⊤ ξ (2 t ) ∞ ≤ C σ, X (2 t ) ⊤ X (2 t ) ( θ (2 t ) − θ ⋆ ) ∞ θ (2 t ) − θ ⋆ ∞ ∈ 1 2 , 2 . (18) Note that ( 17 ) holds with probabilit y ≥ 1 − δ 3 b y taking a union b ound o ver Theorem 3 for T ≤ n rounds. The first b ound in ( 18 ) also holds with probabilit y ≥ 1 − δ 3 b y a union b ound for an appropriate C , via ( 12 ). Finally , the second b ound in ( 18 ) holds with probabilit y ≥ 1 − δ 3 b y Lemma 11 and nnz( θ ( t ) − θ ⋆ ) ≤ 2 k . W e henceforth condition on b oth ev en ts, and let 1 c ′ ≥ 3 C . W e next claim that in every round where ρ ( t +1) ≥ σ , the test in Line 12 will pass. This is b ecause X (2 t +2) ⊤ ( y (2 t +2) − X (2 t +2) θ ( t +1) ) ∞ ≤ X (2 t +2) ⊤ X (2 t +2) ( θ ⋆ − θ ( t +1) ) ∞ + X (2 t +2) ⊤ ξ (2 t +2) ∞ ≤ 2 θ ⋆ − θ ( t +1) ∞ + X (2 t +2) ⊤ ξ (2 t +2) ∞ ≤ 3 C ρ ( t +1) ≤ 1 c ′ ρ ( t +1) . Therefore, the last round where the test passes will satisfy ρ ( t +1) ≤ max( r, 2 σ ) . On this round, θ ( t ) − θ ⋆ ∞ ≤ 2 X (2 t ) ⊤ X (2 t ) ( θ ⋆ − θ ( t ) ) ∞ ≤ 2 X (2 t ) ⊤ ( y (2 t ) − X (2 t ) θ ( t ) ) ∞ + 2 X (2 t ) ⊤ ξ (2 t ) ∞ ≤ max(2 ρ, 4 σ ) c ′ + C σ = O r + X ⊤ ξ ∞ s log n δ log R r ! . 4 A daptiv e ℓ ∞ Sparse Reco v ery In this section, w e study fundamental limits and algorithms for Problem 2 under Model 2 . 18 W e first establish that n = Ω( k 2 log d ) sub-Gaussian measurements suffice to solve adaptiv e ℓ ∞ sparse recov ery . W e in fact prov e our upp er b ound result under a more general regularity condition, an ℓ ∞ -norm v ariant of RIP , defined and studied in Section 4.1 . Under this condition, we giv e our main algorithm for solving Problem 2 in the adaptiv e mo del in Section 4.2 . In Section 4.3 , we then construct a nearly-matching lo w er b ound, which shows that sample sizes of order n = O ( k 2 ) can b e fo oled by adversarial instances. This rules out the p ossibilit y for any algorithm to solve the adaptive v ariant of Problem 2 in this sample complexit y regime. 4.1 ℓ ∞ -RIP Motiv ated by the classical RIP (Definition 2 ), w e introduce an ℓ ∞ analog that controls the Gram matrix X ⊤ X under a differen t norm. While RIP b ounds the ℓ 2 op erator norm of X ⊤ X − I d on sparse submatrices, ℓ ∞ -RIP instead b ounds the ℓ ∞ op erator norm of the same restrictions. Definition 4 ( ℓ ∞ -RIP) . L et ( ϵ, s ) ∈ (0 , 1) × [ d ] . W e say X ∈ R n × d is ( ϵ, s ) - ℓ ∞ -RIP, if for al l S ⊆ [ d ] with | S | ≤ s , [ X ⊤ X − I d ] S × S ∞→∞ ≤ ϵ. (19) The following observ ation shows that ℓ ∞ -RIP implies ℓ 2 -RIP with the same parameters. Lemma 12. F or any symmetric matrix M ∈ R d × d , we have ∥ M ∥ 2 → 2 ≤ ∥ M ∥ ∞→∞ . Pr o of. This follo ws b ecause λ = ∥ M ∥ 2 → 2 is witnessed by some eigenv ector v with either Mv = λ v or Mv = − λ v , b oth of whic h witness the same blowup in the ℓ ∞ norm. The ℓ ∞ -RIP condition is also closely related to pairwise incoherence (PI, Definition 3 ). Our next lemma sho ws that ϵ s -PI implies ( ϵ, s ) - ℓ ∞ -RIP . The conv erse, ho w ever, do es not hold: ℓ ∞ -RIP p ermits structured correlations that are not excluded by the pairwise b ounds of PI. Lemma 13 (PI implies ℓ ∞ -RIP) . F or any ( ϵ, s ) ∈ (0 , 1) × [ d ] , if X ∈ R n × d is ϵ s -PI, then X is also ( ϵ, s ) - ℓ ∞ -RIP. Mor e over, the c onverse statement is false. Pr o of. As remark ed on in Section 2.1 , the ℓ ∞ → ℓ ∞ norm of a matrix is the largest ℓ 1 norm of an y of its ro ws, so for any | S | ≤ s , the b ound ( 19 ) is immediate using ϵ s -PI. T o disprov e the conv erse statemen t, take X = I d + ϵ e i e ⊤ j for any i = j and s ≥ 2 . Giv en Lemma 13 and Prop osition 2 , sub-Gaussian ensembles are ℓ ∞ -RIP with high probability . W e sho w that directly applying a union bound giv es a sligh tly tighter guaran tee. Lemma 14. L et ( δ, ϵ ) ∈ (0 , 1 2 ) 2 and s ∈ [ d ] . Under Mo del 3 , if n = Ω( s 2 ϵ 2 (log d s + log 1 δ )) for an appr opriate c onstant, then with pr ob ability 1 − δ , X satisfies ( ϵ, s ) - ℓ ∞ -RIP. Pr o of. W rite G := X ⊤ X − I d . W e con trol the diagonal and off-diagonal terms separately . F or each i , G ii = ∥ x i ∥ 2 2 − 1 is a centered χ 2 random v ariable, so Lemma 1, [ LM00 ] sho ws P h | G ii | ≥ ϵ 2 i ≤ 2 exp ( − Ω( ϵn )) . 19 A union b ound ov er i ∈ [ d ] yields max i | G ii | ≤ ϵ 2 with probability at least 1 − δ 3 pro vided n = Ω( 1 ϵ log( d δ )) , which falls within our parameter regime. Next, b y Prop osition 1 , the even t ∥ x i ∥ 2 ≤ 2 for all i ∈ [ d ] occurs with probabilit y ≥ 1 − δ 3 . Condition on b oth ev ents henceforth. Finally , for the off-diagonal terms, we first fix a subset S ⊂ [ d ] with | S | = s , i ∈ S , and v ∈ {± 1 } S . Conditioning on x i , the { x ⊤ i x j : j ∈ S \ { i }} are i.i.d. subG (0 , 2 C n ) random v ariables. Hence, X j ∈ S \{ i } v j x ⊤ i x j ∼ subG 0 , 2 C s n , b y F act 1 , and therefore b y Lemma 1 , P X j ∈ S \{ i } v j x ⊤ i x j ≥ ϵ 2 = 2 exp − Ω ϵ 2 n s . (20) Note that if the even t in ( 20 ) holds, then combined with the diagonal, | e ⊤ i G S × S v | ≤ ϵ . No w to obtain ℓ ∞ RIP , we m ust show max S ⊆ [ d ]: | S |≤ s i ∈ S v ∈{± 1 } S | e ⊤ i G S × S v | ≤ ϵ. for all of N certificates ( S, i, v ) . The n umber of even ts is at most N ≤ s 2 s d k , with log N = Θ( s log ( d s )) . Setting the righ t-hand side of ( 20 ) to δ 3 N and solving for n giv es the claim. If w e instead combine Lemma 13 with Prop osition 2 to analyze sub-Gaussian ensembles, the resulting argumen t yields the sufficien t condition n ≳ s 2 ϵ 2 log d δ , which is sligh tly worse than Lemma 14 . This gap in the logarithmic dep endence can b e viewed as direct evidence that PI imp oses a more restrictiv e structural requiremen t than ℓ ∞ –RIP . A natural question is whether the ≈ s 2 sample complexity established in Lemma 14 for sub-Gaussian ensem bles can b e further impro ved b y alternative matrix ensembles. Unfortunately , as w e shortly demonstrate, this dep endence is essen tially tight: for an y matrix with suitably-normalized columns, ( ϵ, s ) - ℓ ∞ -RIP cannot hold when n = o ( s 2 ) . Consequen tly , sub-Gaussian ensem bles achiev e the optimal sample complexity for constant ϵ , up to logarithmic factors. T o prov e this argumen t, w e first recall a classic inequality known as the W elch b ounds, whic h low er b ounds a veraged correlations among a set of unit v ectors. Lemma 15 ([ W el74 ]) . L et { v i } i ∈ [ d ] b e a set of unit ve ctors in R n . Then, 1 d 2 X ( i,j ) ∈ [ d ] × [ d ] ⟨ v i , v j ⟩ 2 ≥ 1 n . Then we prov e our tigh tness result as follo ws. Lemma 16. F or any ϵ ∈ (0 , 1 2 ) , s ≥ 2 , and d ≥ s 3 ϵ 2 , if X satisfies ( ϵ, s ) - ℓ ∞ -RIP, then n ≥ s 2 144 ϵ 2 . Pr o of. Throughout the pro of, we use Lemma 12 whic h shows X also satisfies ( ϵ, s ) - ℓ 2 -RIP , so ∥ x i ∥ 2 2 ∈ [1 − ϵ, 1 + ϵ ] ⊂ [ 1 2 , 3 2 ] for all i ∈ [ d ] . No w applying Lemma 15 to the { x i ∥ x i ∥ − 1 2 } i ∈ [ d ] , X ( i,j ) ∈ [ d ] × [ d ] |⟨ x i , x j ⟩| 2 ∥ x i ∥ 2 2 ∥ x j ∥ 2 2 ≥ d 2 n . 20 Since all ∥ x i ∥ 2 2 ≥ 1 2 , it follows that P ( i,j ) ∈ [ d ] × [ d ] |⟨ x i , x j ⟩| 2 ≥ d 2 4 n . Next, define S i := P j ∈ [ d ] |⟨ x i , x j ⟩| 2 . A v eraging ov er i yields 1 d P i ∈ [ d ] S i ≥ d 4 n . Hence there exists an index i 0 suc h that S i 0 ≥ d 4 n . W e define an index set T through a thresholding rule: T := j ∈ [ d ] : |⟨ x i 0 , x j ⟩| ≥ 1 2 √ 2 n . Moreo ver, the Cauch y-Sch warz inequality gives a trivial b ound for pairwise incoherence: |⟨ x i 0 , x j ⟩| ≤ ∥ x i 0 ∥ 2 ∥ x j ∥ 2 ≤ 3 2 , so we ha ve S i 0 ≤ | T | · 9 4 + ( d − | T | ) · 1 8 n . Com bining with S i 0 ≥ d 4 n yields | T | ≥ d 18 n . If n ≥ s 2 18 ϵ 2 , the desired claim already holds. Otherwise, | T | ≥ s , since d ≥ s 3 ϵ 2 . W e no w select a subset S ⊆ T with | S | = s and i 0 ∈ S and obtain that [ X ⊤ X − I ] S × S ∞→∞ ≥ X j ∈ S, j = i 0 |⟨ x i 0 , x j ⟩| − | ∥ x i 0 ∥ 2 2 − 1 | . Using the definition of S ⊆ T and that |∥ x i 0 ∥ 2 2 − 1 | ≤ ϵ , w e obtain ϵ ≥ [ X ⊤ X − I ] S × S ∞→∞ ≥ s − 1 2 √ 2 n − 1 2 ≥ s 6 √ n − ϵ. No w, rearranging 2 ϵ ≥ s 6 √ n giv es the claim. 4.2 IHT under ℓ ∞ -RIP W e next show that under Definition 4 , IHT (Algorithm 1 ) solves Problem 2 even in an adaptive setting. The pro of roadmap is similar to the classic ℓ 2 case, cf. [ Pri21 ], where the RIP condition is used to argue contraction on a p oten tial based on the ℓ 2 norm of the residual, and then the sparsit y of θ ⋆ b ounds the lossiness of the top- k selection step with respect to the p oten tial. W e start with Lemma 17 , which sho ws that before the top- k truncation step, the ℓ ∞ error of our estimation contracts by a constant factor in each iteration. Lemma 17. F ol lowing the notation of Algorithm 1 , for al l t ∈ [0 , T − 1] , define θ ( t +0 . 5) = θ ( t ) + X ⊤ ( y − X θ ( t ) ) . Then in the setting of Pr oblem 2 , if X is ( ϵ, 2 k ) - ℓ ∞ -RIP, we have θ ( t +0 . 5) − θ ⋆ ∞ ≤ ϵ θ ( t ) − θ ⋆ ∞ + X ⊤ ξ ∞ . Pr o of. Rearranging the definition of θ ( t +0 . 5) yields θ ( t +0 . 5) − θ ⋆ = ( I − X ⊤ X )( θ ( t ) − θ ⋆ ) + X ⊤ ξ . Th us, letting S := supp( θ ⋆ ) ∪ supp( θ ( t ) ) with | S | ≤ 2 k , θ ( t +0 . 5) − θ ⋆ ∞ ≤ h I − X ⊤ X i S × S ( θ ( t ) − θ ⋆ ) ∞ + X ⊤ ξ ∞ . ≤ h I − X ⊤ X i S × S ∞→∞ θ ( t ) − θ ⋆ ∞ + X ⊤ ξ ∞ ≤ ϵ θ ( t ) − θ ⋆ ∞ + X ⊤ ξ ∞ . 21 Com bining Lemma 17 with our earlier observ ation in Lemma 10 , we conclude that the ℓ ∞ -norm error of IHT iterates admits an exp onen tial deca y tow ard X ⊤ ξ ∞ , as θ ( t +1) = T k ( θ ( t +0 . 5) ) . Corollary 4. In the setting of L emma 17 , θ ( t +1) − θ ⋆ ∞ ≤ 2 ϵ θ ( t ) − θ ⋆ ∞ + 2 X ⊤ ξ ∞ . With this one-step con traction, we finally giv e our ℓ ∞ sparse recov ery guaran tee for IHT. Theorem 4 (Adaptiv e ℓ ∞ sparse recov ery) . L et δ ∈ (0 , 1 2 ) , n = Ω( k 2 log d δ ) for an appr opriate c onstant, and k ∈ [ d ] . Under Mo dels 2 and 3 , if A lgorithm 1 is c al le d with p ar ameters R ≥ ∥ θ ⋆ ∥ ∞ and r > 0 , then with pr ob ability ≥ 1 − δ , its output satisfies ∥ θ − θ ⋆ ∥ ∞ ≤ r + 2 X ⊤ ξ ∞ . Mor e over, the algorithm runs in time O ( nd log R r ) . Pr o of. By Lemma 14 , with probability 1 − δ , X is ( 1 4 , 2 k ) - ℓ ∞ -RIP . Now, b y telescoping Corollary 4 , ∥ θ − θ ⋆ ∥ ∞ ≤ 2 4 T ∥ θ ⋆ ∥ ∞ + 2 X ⊤ ξ ∞ · ∞ X t =0 2 4 t ≤ 1 2 T R + 2 X ⊤ ξ ∞ . T aking T ≥ log 2 R r , we obtain the desired claim, and the run time is immediate. As noted, Theorem 4 extends b ey ond Mo del 3 , as it only uses the sufficien t condition of ℓ ∞ -RIP . 4.3 Lo w er b ound The previous Sections 4.1 and 4.2 sho w that if w e go through the sufficien t condition of ℓ ∞ -RIP , then we must take n = Ω( k 2 ) for an algorithm to solve Problem 2 . Ho wev er, this do es not formally preclude a lo wer sample complexity of n = o ( k 2 ) that solv es the problem under Mo del 3 directly . The main goal of this section is to rule out such a p ossibilit y . The key in tuition b ehind the construction is to iden tify a k -sparse v ∈ R d satisfying ∥ v ∥ ∞ ≫ σ while still having X ⊤ Xv ∞ = σ . Setting ξ = Xv then produces an adv ersarial noise vector that significan tly perturbs the underlying signal yet con tributes negligibly to X ⊤ ξ , effectively masking the true support. T o formalize this idea, we first establish that under suitable conditions, we can con trol the ℓ ∞ op erator norm of the inverse of a Gram matrix from Mo del 3 . Lemma 18. L et δ ∈ (0 , 1 2 ) , Ω( k log d δ ) = n = o ( k 2 ) , and let S ⊆ [ d ] b e any fixe d subset such that | S | = k = Ω(log 1 δ ) . Under Mo del 3 , with pr ob ability ≥ 1 − δ , h X ⊤ X i − 1 S × S ∞→∞ = Ω k √ n . Pr o of. By Lemma 1 , with probabilit y ≥ 1 − δ 3 , X is ( 1 2 , 1) -RIP , so ∥ x i ∥ 2 2 ∈ [ 1 2 , 3 2 ] for all i ∈ [ d ] . Next, define E := I d − X ⊤ X . Standard inequalities (Exercise 4.7.3, [ V er18 ]) yield that for large enough n as specified, denoting E S := E S × S for short, ∥ E S ∥ op ≤ 1 2 , with probability ≥ 1 − δ 3 . 22 Finally , conditioned on the realization of x i with ∥ x i ∥ 2 2 ∈ [ 1 2 , 3 2 ] for a fixed i ∈ S , F act 1 shows that for any j = i , E ij = x ⊤ i x j ∼ subG (0 , 3 C 2 n ) , and V ar [ E ij ] = 1 n ∥ x i ∥ 2 2 ≥ 1 2 n . By Lemma 3 , E [ | E ij | ] ≥ c 1 √ n , V ar [ | E ij | ] ≤ c 2 n 2 , for some constants c 1 , c 2 , where w e b ounded the v ariance using Item 2 in Lemma 2 . Now Cantelli’s inequalit y shows that there exists a constan t c suc h that P | E ij | ≥ 4 c √ n ≥ c. (21) F or the stated range of k , a Chernoff b ound gives that at least a 2 c 3 fraction of the entries j in S \ { i } satisfy the ab o ve ev ent, with probabilit y ≥ 1 − δ 3 . Th us, under the assumed parameter regimes, X j ∈ S \{ i } | E ij | ≥ 8 c ( k − 1) 3 √ n ≥ ck √ n = ⇒ X j ∈ S | [ I S − E S ] ij | ≥ 2 ck √ n − 1 2 ≥ ck √ n . In the remainder of the pro of, condition on the ab o ve three even ts holding, which giv es the failure probabilit y . Next, consider the p o wer series: [ X ⊤ X ] − 1 S × S = ( I S + E S ) − 1 = I S − E S + ∞ X t =2 ( − E S ) t . Letting R S := P ∞ t =2 ( − E S ) t , we hav e ∥ R S ∥ op ≤ ∞ X t =2 ∥ E S ∥ t op = ∥ E S ∥ 2 op 1 − ∥ E S ∥ op ≤ 2 ∥ E S ∥ 2 op ≤ 1 2 . The conclusion then follo ws from combining the ab o ve three displa ys, and applying Lemma 12 : h X ⊤ X i − 1 S × S ∞→∞ ≥ ∥ I S − E S ∥ ∞→∞ − ∥ R S ∥ ∞→∞ ≥ ck √ n − 1 2 = Ω k √ n . Corollary 5. In the setting of L emma 18 , ther e is a ve ctor v ∈ R d with supp( v ) = S , satisfying ∥ v ∥ ∞ = Ω k √ n , X ⊤ Xv ∞ = 1 . Pr o of. By Lemma 18 , with probabilit y ≥ 1 − δ 3 , there exists u ∈ {− 1 , +1 } S suc h that h X ⊤ X i − 1 S × S u ∞ ≥ ck √ n , for a constan t c . Also, henceforth condition on X b eing ( 1 2 , k ) -RIP , which occurs with probability ≥ 1 − δ 3 . W e tak e v ∈ R d to b e the vector supported on S , satisfying v S = h X ⊤ X i − 1 S × S u . 23 Th us, the lo wer bound on ∥ v ∥ ∞ is trivially satisfied. F or the second claim, X ⊤ Xv ∞ = max n X ⊤ : S Xv ∞ , X ⊤ : S c Xv ∞ o . Since supp( v ) = S , the first term is 1 . F or the second term, notice that S ⊥ X , so X ⊤ S c : ⊥ ( X : S , v ) . Th us, for eac h j ∈ S c , conditioned on ( X : S , v ) , we hav e x ⊤ j X : S v S ∼ subG 0 , ∥ X S v S ∥ 2 2 n ! , ∥ X S v S ∥ 2 2 = u ⊤ h X ⊤ X i − 1 S × S u ≤ 2 ∥ u ∥ 2 2 = 2 k . Th us, applying Lemma 1 and taking a union b ound o ver all j ∈ S c sho ws ∥ X ⊤ Xv ∥ ∞ = O (1) with probabilit y ≥ 1 − δ 2 . The result follo ws b y adjusting b oth b ounds by a constan t. Theorem 5 (Lo w er b ound for adaptiv e ℓ ∞ sparse reco very) . Under Mo dels 2 and 3 , ther e is no algorithm that solves Pr oblem 2 with pr ob ability ≥ 3 4 , when Ω( k log d ) = n = o ( k 2 ) . Pr o of. Fix tw o arbitrary sets S ⊆ [ d ] , T ⊆ [ d ] each with size k 2 . In this regime, Corollary 5 holds with probabilit y δ = 1 2 , on supp ort S . Let ¯ θ ∈ R d b e an arbitrary v ector supp orted on T , and let v b e the v ector guaran teed by Corollary 5 on supp ort S , whenever it succeeds. Now let θ ⋆ 1 := ¯ θ , θ ⋆ 2 := ¯ θ + v , ξ 1 := Xv , ξ 2 := 0 n . An y algorithm for solving Problem 2 m ust then be able to distinguish θ ⋆ 1 and θ ⋆ 2 , b ecause X ⊤ ξ 1 ∞ = 1 , X ⊤ ξ 2 ∞ = 0 , and ∥ θ ⋆ 1 − θ ⋆ 2 ∥ ∞ = ω (1) in the stated parameter regime. Ho wev er, this is a contradiction, b ecause the observ ation mo dels ( X , y ) are the same under either pair ( θ ⋆ 1 , ξ 1 ) or ( θ ⋆ 2 , ξ 2 ) , so no algorithm can do b etter than random guessing. Thus, the success probabilit y is upp er b ounded b y 1 2 + 1 2 · 1 2 . W e note that Theorem 5 extends to a low er bound against v ariable selection as w ell. Theorem 6 (Low er bound for adaptive v ariable selection) . In the setting of The or em 5 , ther e is no algorithm that solves Pr oblem 1 with pr ob ability ≥ 3 4 . Pr o of. Instate the same low er b ound instance as in Theorem 5 . Observe that solving Problem 1 w ould allow us to distinguish whether the supp ort is T or S ∪ T , i.e., it w ould let us distinguish θ ⋆ 1 from θ ⋆ 2 whenev er Corollary 5 succeeds. Also, for sufficien tly large ¯ θ entrywise, the signal-to-noise assumption ( 2 ) clearly holds for b oth of our instances. Thus, the same con tradiction holds. 5 V ariable Selection with A daptiv e Measuremen ts In this section, we consider a partial mo del of adaptivit y that sits b et ween Mo dels 1 and 2 . Intu- itiv ely , it treats the noise ξ as b enign, but the signal θ ⋆ as adaptive. Mo del 4 (P artially-adaptive mo del) . In the partially-adaptive mo del of Pr oblems 1 and 2 , ξ ⊥ X , and we make no indep endenc e assumptions on ( θ ⋆ , ξ ) or ( θ ⋆ , X ) . 24 W e note that Mo del 1 is an instance of Mo del 4 , which in turn is an instance of Mo del 2 . This moti- v ates the natural question: does the sample complexit y of Problems 1 and 2 under the intermediate Mo del 4 scale as ≈ k as in Theorem 3 , or as ≳ k 2 as in Theorem 5 ? While we do not fully resolve this question, in this section, we demonstrate that non trivial reco very guaran tees are p ossible under Mo del 4 . Sp ecifically , if we allow for adaptiv e measurements where we can exclude certain co ordinates of θ ⋆ from observ ation, then ≈ k samples suffice to solve Problem 1 . This result is shown in Theorem 7 . While our result w orks in a differen t (and p oten tially unrealistic in some scenarios) observ ation model than standard sparse recov ery , w e hop e it ma y serve as a stepping stone tow ards understanding the landscape of partially-adaptiv e v ariable selection. In Section 5.1 , we giv e a help er lemma that sho ws how to isolate a constant fraction of undiscov ered large co ordinates of θ ⋆ . W e then recurs up on this pro cedure in Section 5.2 to prov e Theorem 7 . 5.1 F ew false p ositiv es in thresholding Recall that the supp ort estimation phase of Algorithm 2 (Line 5 ) is based on thresholding x ⊤ i y . In the oblivious mo del, X ⊥ θ ⋆ , so the interference term ⟨ x i , X : S θ ⋆ ⟩ concentrates at the scale X ⊤ ξ ∞ . This no longer applies in the adaptive model, where θ ⋆ ma y b e chosen adversarially after observing X . Indeed, w e exploit this adaptivity for our w orst-case lo w er bound in Theorem 5 . Despite this w orst-case construction, w e sho w that a single round of thresholding based on x ⊤ i y still admits a meaningful guarantee in the setting of Problem 1 , where all i ∈ supp( θ ⋆ ) satisfy | θ ⋆ i | = Ω( X ⊤ ξ ∞ ) . In other words, every nonzero signal co ordinate is sufficien tly he avy . In this setting, we show that the num b er of falsely-selected co ordinates via thresholding is O ( k ) , and that the identified co ordinates contain at least a constant fraction of the total ℓ 2 energy of θ ⋆ . W e remark that this pro of is inspired by a similar observ ation in Lemma 2.1, [ BBKS24 ]. Lemma 19. L et δ ∈ (0 , 1 2 ) and assume that for a universal c onstant C ′ > 0 , ∥ θ ⋆ ∥ 2 ≤ C ′ √ k X ⊤ ξ ∞ . In the setting of Pr oblem 1 and Mo dels 3 and 4 , let S FP = i ∈ supp( θ ⋆ ) : x ⊤ i y ≥ C 2 X ⊤ ξ ∞ , S FN = i ∈ supp( θ ⋆ ) : x ⊤ i y < C 2 X ⊤ ξ ∞ , wher e C is the c onstant fr om Pr oblem 1 . Then if n = Ω( k log d δ ) for an appr opriate c onstant, with pr ob ability ≥ 1 − δ , we have | S FP | = O ( k ) , θ ⋆ S FN 2 ∥ θ ⋆ ∥ 2 ≤ 0 . 95 . Pr o of. W e prov e the first claim through con tradiction. Suppose that there is T ⊆ S FP suc h that | T | = 64( C ′ ) 2 C 2 · k + 1 , (22) whic h exists whenenv er | S FP | > 64( C ′ ) 2 C 2 · k . By the thresholding rule, for an y i ∈ T , x ⊤ i y ≥ C 2 X ⊤ ξ ∞ . Define a v ector v ∈ R T suc h that v i = sign( x ⊤ i y ) . Then, D v , X ⊤ T : y E = X i ∈ T x ⊤ i y ≥ C 2 | T | · X ⊤ ξ ∞ . 25 On the other hand, notice that D v , X ⊤ T : y E = D v , X ⊤ T : X θ ⋆ E + D v , X ⊤ T : ξ E = ⟨ X : T v , X θ ⋆ ⟩ + D v , X ⊤ T : ξ E . By the Cauch y-Sc hw arz inequality , w e hav e |⟨ X : T v , X θ ⋆ ⟩| ≤ ∥ X : T v ∥ 2 · ∥ X θ ⋆ ∥ 2 , D v , X ⊤ T : ξ E ≤ ∥ v ∥ 2 · X ⊤ T : ξ 2 ≤ | T | · X ⊤ ξ ∞ . F or sufficiently large n as stated, X satisfies ( 1 2 , | T | + k ) -RIP , so ∥ X : T v ∥ 2 2 ≤ 2 ∥ v ∥ 2 2 = 2 | T | , ∥ X θ ⋆ ∥ 2 2 ≤ 2 ∥ θ ⋆ ∥ 2 2 . This yields D v , X ⊤ T : y E ≤ 2 ∥ θ ⋆ ∥ 2 p | T | + | T | · X ⊤ ξ ∞ . Com bining gives the contradiction, for sufficien tly large C > 4 , 2 ∥ θ ⋆ ∥ 2 p | T | + | T | · X ⊤ ξ ∞ ≥ C 2 | T | · X ⊤ ξ ∞ = ⇒ | T | ≤ 64 ∥ θ ⋆ ∥ 2 2 C 2 ∥ X ⊤ ξ ∥ 2 ∞ ≤ 64( C ′ ) 2 C 2 · k . T o pro ve the second claim, we use a similar strategy . By the thresholding rule, for an y i ∈ S FN , w e kno w x ⊤ i y < C 2 X ⊤ ξ ∞ . Then, X ⊤ S FN : y 2 = s X i ∈ S FN x ⊤ i y 2 < C 2 p | S FN | X ⊤ ξ ∞ . On the other hand, w e let S TP = S \ S FN and notice that X ⊤ S FN : y 2 ≥ X ⊤ S FN : X : S FN θ ⋆ S FN 2 − X ⊤ S FN : X : S TP θ ⋆ S TP 2 − X ⊤ S FN : ξ 2 . A direct calculation also giv es, under ( 1 4 , k ) -RIP , X ⊤ S FN : X : S FN θ ⋆ S FN 2 ≥ 3 4 θ ⋆ S FN 2 , X ⊤ S FN : X : S TP θ ⋆ S TP 2 ≤ 5 4 θ ⋆ S TP 2 , X ⊤ S FN : ξ 2 ≤ p | S FN | X ⊤ ξ ∞ . Therefore, we hav e the follo wing low er b ound: X ⊤ S FN : y 2 ≥ θ ⋆ S FN 2 − θ ⋆ S TP 2 − 1 2 ∥ θ ⋆ ∥ 2 − p | S FN | X ⊤ ξ ∞ . T o av oid contradiction, w e require θ ⋆ S FN 2 − θ ⋆ S TP 2 − 1 2 ∥ θ ⋆ ∥ 2 − p | S FN | X ⊤ ξ ∞ ≤ C 2 p | S FN | X ⊤ ξ ∞ , whic h for C ′ sufficien tly large in Problem 1 , is equiv alen t to α − p 1 − α 2 ≤ 1 2 + ( C 2 + 1) p | S FN | X ⊤ ξ ∞ ∥ θ ⋆ ∥ 2 ≤ 3 5 , for α := θ ⋆ S FN 2 ∥ θ ⋆ ∥ 2 . Finally , α ∈ (0 , 1) implies α ≤ 0 . 95 as claimed, b y solving a quadratic. 26 Algorithm 4: AdaptiveObserve ( n, S ) 1 Generate X ∈ R n × d , ξ ∈ R n under Mo dels 3 and 4 2 X : S ← 0 n ×| S | 3 Observ e y ← X θ ⋆ + ξ 4 return ( X , y ) 5.2 Recursiv e supp ort estimation W e no w describ e an adaptiv e algorithm that solv es Problem 1 using ≈ k measurements. The specific form of adaptivity w e use is the ability to query new observ ations using Algorithm 4 . In other w ords, w e fix the signal θ ⋆ in all adaptiv e calls to Algorithm 4 , but eac h call generates a new indep enden t ( X , ξ ) resp ecting the setting of Problem 1 . Note that θ ⋆ can arbitrarily dep end on all of these measurement and noise generations under Mo del 4 ; for example, we can then reuse these same measurements and noise on future instances of Problem 1 . Ho wev er, our adaptiv e observ ation mo del in Algorithm 4 also assumes the ability to mute sp ecified columns of X . This can b e a strong assumption, and obtaining a comparable result without this access is an in teresting op en direction. W e give our v ariable selection algorithm in Algorithm 5 , whic h assumes access to Algorithm 4 . Algorithm 5: AdaptiveObservationSpa rseRecovery ( k, n, d, N , R , r 2 , r ∞ , δ ) 1 X (0) , y (0) ← AdaptiveObserve ( n 3 , ∅ ) 2 b θ 0 ← IHT ( X (0) , y (0) , k , R, r 2 ) 3 T 0 ← supp( b θ 0 ) 4 for i ∈ [ N ] do 5 X ( i ) , y ( i ) ← AdaptiveObserve ( n 3 N , T i − 1 ) 6 T i ← T i − 1 ∪ { j : | x ( i ) ⊤ j y ( i ) r | ≥ r ∞ } 7 end 8 S ← T N 9 X ( N +1) , y ( N +1) ← AdaptiveObserve n 3 , S c 10 θ ← H k [ X ( N +1) ⊤ X ( N +1) ] − 1 S × S X ( N +1) ⊤ S : y ( N +1) 11 return θ The algorithm consists of three phases. 1. In the first phase, w e run IHT to obtain a w arm start and a finer control on ∥ θ ⋆ ∥ 2 . 2. In the second phase, we perform a sequence of adaptiv e queries (leveraging our access to Algorithm 5 ) to iden tify a b ounded-size sup erset of supp( θ ⋆ ) . 3. In the third phase, once the supp ort is reco vered, w e run simple OLS on the union of all estimated supp orts to obtain the desired ℓ ∞ estimation guarantee. The adaptive querying phase is the core of the algorithm and is built directly up on Lemma 19 . At a high lev el, this phase main tains a gro wing set T i of co ordinates identified as heavy up to round i . In subsequent rounds, the columns of the design matrix indexed by T i are m uted, so that their con tribution no longer in terferes with future measuremen ts. Lemma 19 ensures t wo k ey prop erties 27 at eac h round: (1) false disc overy c ontr ol , which guarantees that the n um b er of incorrectly iden tified co ordinates is prop ortional to the num b er of remaining true co ordinates, and hence that the final estimated supp ort do es not gro w excessiv ely; and (2) ℓ 2 -ener gy c ontr action , which guaran tees that the remaining signal energy decreases geometrically across rounds. W e use these t wo prop erties to sho w that the algorithm finds all signal co ordinates in O (log k ) adaptiv e rounds. Theorem 7. L et δ ∈ (0 , 1 2 ) , n = Ω( k log( k ) log( d δ )) for an appr opriate c onstant, and k ∈ [ d ] . Under Mo dels 3 and 4 , if Algorithm 5 is c al le d with p ar ameters R ≥ ∥ θ ⋆ ∥ 2 , r 2 < σ wher e σ := max i ∈{ 0 }∪ [ N +1] X ( i ) ⊤ ξ ( i ) ∞ , r ∞ ≥ C σ 2 , N = Ω(log k ) , then in the setting of Pr oblem 1 , with pr ob ability ≥ 1 − δ , the output of A lgorithm 5 satisfies supp( θ ) = supp( θ ⋆ ) , ∥ θ − θ ⋆ ∥ ∞ = O ( σ ) . Pr o of. By Prop osition 1 , for large enough n , with probability ≥ 1 − δ 4 , X (0) , X (1) , . . . , X ( N ) all satisfy (0 . 1 , O ( k )) -RIP , and X ( N ) satisfies (0 . 1 , O ( k log k )) -RIP , for appropriate constants. W e condition on these even ts henceforth, and denote S ⋆ := supp( θ ⋆ ) . Phase I: W arm start. By Lemma 4 , with probability at least 1 − δ 5 , our IHT estimator b θ 0 satisfies b θ 0 − θ ⋆ 2 ≤ 10 √ k σ . Define the residual support S 0 = S ⋆ \ T 0 . Then θ ⋆ S 0 2 ≤ b θ 0 − θ ⋆ 2 ≤ 10 √ k σ . This warm start ensures that when entering the adaptive Phase I I, the truncated signal θ ⋆ S 0 satisfies the ℓ 2 -norm condition required b y Lemma 19 . Phase I I: A daptive supp ort identification. F or each round i ≥ 1 , let S i = S ⋆ \ T i . By con- struction of the algorithm, all columns X : T i are muted in future rounds. Therefore, the observ ation ( X ( i ) , y ( i ) ) is equiv alen t to the linear mo del y ( i ) = X ( i ) θ ( i ) + ξ ( i ) , θ ( i ) = θ ⋆ S i . W e apply Lemma 19 to each round. Since min j ∈ S ⋆ | θ ⋆ j | ≥ C σ under Problem 1 , and ∥ θ ⋆ S i ∥ 2 ≤ ∥ θ ⋆ S 0 ∥ 2 ≤ 10 √ k σ , the lemma is applicable at ev ery iteration. Note that the conclusion still clearly holds if the truncation threshold is set larger than C σ 2 , as b oth conclusions are monotone in the threshold. Then, Lemma 19 yields tw o uniform guarantees with probability 1 − δ 4 N . 1. ( F alse disc overy c ontr ol. ) The n umber of false inclusions at round i is at most k . 2. ( ℓ 2 -ener gy c ontr action. ) ∥ θ ⋆ S i +1 ∥ 2 ≤ 0 . 95 ∥ θ ⋆ S i ∥ 2 . Here w e note that the constan t in the false disco very con trol prop ert y was selected b y taking C ≥ 80 and C ′ = 10 in ( 22 ). By the assumption in Problem 1 , we obtain that | S i | · C 2 σ 2 ≤ θ ⋆ S i 2 2 ≤ 0 . 95 2 i · 100 k σ 2 , whic h implies a geometric deca y of supp ort size: | S i | ≤ 0 . 95 2 i · 100 k C 2 . Th us, after N = O (log k ) rounds, w e ha ve | S N | < 1 , i.e., S N = ∅ and all true coordinates hav e b een iden tified. Moreov er, the total size of the learned supp ort S := T N is at most O ( k log k ) . 28 Phase II I: OLS solution. Once a sup erset S of the exact supp ort S ⋆ is recov ered, we run ordinary least squares on the identified supp ort and then truncate. Denote ( X , ξ ) := ( X ( N +1) , ξ ( N +1) ) for this part of the pro of. W e obtain that θ ′ := [ X ⊤ X ] − 1 S × S X ⊤ S : y = θ ⋆ + [ X ⊤ X ] − 1 S × S X ⊤ S : ξ . By Lemma 10 , w e know the top- k truncation only affects the ℓ ∞ error by a factor of 2: ∥ θ − θ ⋆ ∥ ∞ = H k ( θ ′ ) − θ ⋆ ∞ ≤ 2 θ ′ − θ ⋆ ∞ = 2 [ X ⊤ X ] − 1 S × S X ⊤ S : ξ ∞ = O ( σ ) , with probability ≥ 1 − δ 4 , where the last inequalit y applied Corollary 2 and leveraged our indepen- dence assumption. T aking a union bound o ver all failure ev ents concludes the pro of. A c kno wledgemen ts YZ and KT would like to thank Eric Price for helpful con versations on the ℓ ∞ sparse reco v ery literature and the partially-adaptiv e mo del in Section 5 . 29 References [ABK17] Jay adev A chary a, Arnab Bhattacharyy a, and Pritish Kamath. Impro ved b ounds for univ ersal one-bit compressive sensing. In 2017 IEEE International Symp osium on In- formation The ory (ISIT) , pages 2353–2357. IEEE, 2017. [Aky23] Berk ay Aky apı. Machine learning and feature selection: Applications in economics and climate change. Envir onmental Data Scienc e , 2:e47, 2023. [BBKS24] Jarosław B łasiok, Rares-Darius Buhai, Prav esh K Kothari, and Da vid Steurer. Semi- random planted clique and the restricted isometry prop ert y . In 2024 IEEE 65th A nnual Symp osium on F oundations of Computer Scienc e (FOCS) , pages 959–969. IEEE, 2024. [BC15] Rina F o ygel Barb er and Emmanuel J Candès. Controlling the false disco very rate via kno c k offs. The A nnals of Statistics , 43(5), 2015. [BCH14] Alexandre Belloni, Victor Chernozhuk o v, and Christian Hansen. High-dimensional meth- o ds and inference on structural and treatmen t effects. Journal of Ec onomic Persp e ctives , 28(2):29–50, 2014. [BD09] Thomas Blumensath and Mik e E Da vies. Iterative hard thresholding for compressed sensing. Applie d and c omputational harmonic analysis , 27(3):265–274, 2009. [BIR08] Radu Berinde, Piotr Indyk, and Milan Ruzic. Practical near-optimal sparse recov ery in the l1 norm. In 2008 46th Annual Al lerton Confer enc e on Communic ation, Contr ol, and Computing , pages 198–205. IEEE, 2008. [CD13] Emman uel J. Candès and Mark A. Da v enp ort. How w ell can we estimate a sparse v ector?, 2013. [CDS01] Scott Shaobing Chen, David L. Donoho, and Mic hael A. Saunders. A tomic decomp osi- tion by basis pursuit. SIAM R eview , 43(1):129–159, 2001. [CR T06] Emman uel J Candès, Justin Romberg, and T erence T ao. Robust uncertain ty princi- ples: Exact signal reconstruction from highly incomplete frequency information. IEEE T r ansactions on information the ory , 52(2):489–509, 2006. [CT05] Emman uel Candes and T erence T ao. Decoding by linear programming, 2005. [CT06] Emman uel Candes and T erence T ao. Near optimal signal recov ery from random pro jec- tions: Univ ersal enco ding strategies?, 2006. [CT07] Emman uel Candes and T erence T ao. The dantzig selector: Statistical estimation when p is muc h larger than n. The Annals of Statistics , pages 2313–2351, 2007. [CT20] Mohammad Ziaul Islam Chowdh ury and T anvir C T urin. V ariable selection strategies and its importance in clinical prediction modelling. F amily me dicine and c ommunity he alth , 8(1):e000262, 2020. [CW11] T. T ony Cai and Lie W ang. Orthogonal matching pursuit for sparse signal recov ery with noise. IEEE T r ansactions on Information The ory , 57(7):4680–4688, 2011. [DDT + 08] Marco F Duarte, Mark A Dav enp ort, Dharmpal T akhar, Jason N Lask a, Ting Sun, Kevin F Kelly , and Richard G Baraniuk. Single-pixel imaging via compressive sampling. IEEE signal pr o c essing magazine , 25(2):83–91, 2008. 30 [DE11] Marco F Duarte and Y onina C Eldar. Structured compressed sensing: F rom theory to applications. IEEE T r ansactions on signal pr o c essing , 59(9):4053–4085, 2011. [DTDS12] Da vid L. Donoho, Y aak ov T saig, Iddo Drori, and Jean-Luc Starck. Sparse solution of underdetermined systems of linear equations b y stagewise orthogonal matching pursuit. IEEE T r ansactions on Information The ory , 58(2):1094–1121, 2012. [FL01] Jianqing F an and Runze Li. V ariable selection via nonconca ve penalized lik eliho od and its oracle properties. Journal of the Americ an statistic al Asso ciation , 96(456):1348–1360, 2001. [FR13] Simon F oucart and Holger Rauh ut. A Mathematic al Intr o duction to Compr essive Sens- ing . Birkhäuser, 2013. [FR G09] Alyson K Fletcher, Sundeep Rangan, and Viv ek K Goy al. Necessary and sufficien t conditions for sparsit y pattern reco very . IEEE T r ansactions on Information The ory , 55(12):5758–5772, 2009. [GJP20] Graham M Gibson, Steven D Johnson, and Miles J Padgett. Single-pixel imaging 12 y ears on: a review. Optics expr ess , 28(19):28190–28208, 2020. [GS15] Christian G Graff and Emil Y Sidky . Compressiv e sensing in medical imaging. Applie d optics , 54(8):C23–C44, 2015. [GZ17] David Gamarnik and Ilias Zadik. High-dimensional regression with binary coefficients. estimating squared error and a phase transition. arXiv pr eprint arXiv:1701.04455 , 2017. [HJLL17] Jian Huang, Y uling Jiao, Y an yan Liu, and Xiliang Lu. A constructiv e approac h to high-dimensional regression, 2017. [HL Y13] Zhu Han, Husheng Li, and W otao Yin. Compr essive sensing for wir eless networks . Cam bridge Universit y Press, 2013. [IR08] Piotr Indyk and Milan Ruzic. Near-optimal sparse recov ery in the l1 norm. In 2008 49th A nnual IEEE Symp osium on F oundations of Computer Scienc e , pages 199–207. IEEE, 2008. [JTD11] Prateek Jain, Ambuj T ewari, and Inderjit Dhillon. Orthogonal matching pursuit with replacemen t. In J. Sha we-T a ylor, R. Zemel, P . Bartlett, F. Pereira, and K.Q. W ein- b erger, editors, A dvanc es in Neur al Information Pr o c essing Systems , volume 24. Curran Asso ciates, Inc., 2011. [JXHC09] Sina Jafarp our, W eiyu Xu, Babak Hassibi, and Rob ert Calderbank. Efficien t and robust compressed sensing using optimized expander graphs. IEEE T r ansactions on Informa- tion The ory , 55(9):4299–4308, 2009. [KSTZ25] Symantak Kumar, Purnamrita Sark ar, Kevin Tian, and Y usong Zh u. Spike-and-slab p osterior sampling in high dimensions. In The Thirty Eighth A nnual Confer enc e on L e arning The ory , v olume 291 of Pr o c e e dings of Machine L e arning R ese ar ch , pages 3407– 3462. PMLR, 2025. [LDP07] Michael Lustig, Da vid Donoho, and John M P auly . Sparse mri: The application of compressed sensing for rapid mr imaging. Magnetic R esonanc e in Me dicine: A n Official Journal of the International So ciety for Magnetic R esonanc e in Me dicine , 58(6):1182– 1195, 2007. 31 [LM00] Béatrice Lauren t and P ascal Massart. Adaptiv e estimation of a quadratic functional by mo del selection. The Annals of Statistics , 28(5):1302–1338, 2000. [Lou08] Karim Lounici. Sup-norm conv ergence rate and sign concen tration prop ert y of lasso and dan tzig estimators. Ele ctr onic Journal of Statistics , 2(none), January 2008. [L YP + 19] Xiao Li, Dong Yin, Sameer P aw ar, Ram tin P edarsani, and Kannan Ramc handran. Sub- linear time supp ort reco very for compressed sensing using sparse-graph co des. IEEE T r ansactions on Information The ory , 65(10):6580–6619, 2019. [MB10] Nicolai Meinshausen and Peter Bühlmann. Stability selection. Journal of the R oyal Statistic al So ciety Series B: Statistic al Metho dolo gy , 72(4):417–473, 2010. [MW24] Andrea Montanari and Y uc hen W u. Pro v ably efficient posterior sampling for sparse linear regression via measure decomp osition, 2024. [MZ93] Stéphane G Mallat and Zhifeng Zhang. Matc hing pursuits with time-frequency dictio- naries. IEEE T r ansactions on signal pr o c essing , 41(12):3397–3415, 1993. [Nat95] B. K. Natara jan. Sparse approximate solutions to linear systems. SIAM Journal on Computing , 24(2):227–234, 1995. [Nes03] Y urii Nestero v. Intr o ductory L e ctur es on Convex Optimization: A Basic Course, volume I . 2003. [NT08] D. Needell and J. A. T ropp. Cosamp: Iterative signal reco very from incomplete and inaccurate samples, 2008. [NV08] Deanna Needell and Roman V ersh ynin. Greedy signal recov ery and uncertaint y princi- ples. In Computational Imaging VI , v olume 6814, pages 139–150. SPIE, 2008. [Pri21] Eric Price. Sparse recov ery . In TimEditor Roughgarden, editor, Beyond the W orst-Case A nalysis of Algorithms , page 140–164. Cambridge Univ ersit y Press, 2021. [PRK93] Y agyensh Chandra Pati, Ramin Rezaiifar, and Perinkulam Sambam urth y Krish- naprasad. Orthogonal matc hing pursuit: Recursive function approximation with ap- plications to w av elet decomposition. In Pr o c e e dings of 27th Asilomar c onfer enc e on signals, systems and c omputers , pages 40–44. IEEE, 1993. [SBB06] Shriram Sarv otham, Dror Baron, and Richard G Baraniuk. Sudoco des–fast measure- men t and reconstruction of sparse signals. In 2006 IEEE International Symp osium on Information The ory , pages 2804–2808. IEEE, 2006. [SC19] Jonathan Scarlett and V olk an Cevher. An in tro ductory guide to fano’s inequalit y with applications in statistical estimation. arXiv pr eprint arXiv:1901.00555 , 2019. [Tib96] Robert Tibshirani. Regression shrink age and selection via the lasso. Journal of the R oyal Statistic al So ciety Series B: Statistic al Metho dolo gy , 58(1):267–288, 1996. [V er18] Roman V ersh ynin. High-dimensional pr ob ability: A n intr o duction with applic ations in data scienc e , volume 47. Cam bridge universit y press, 2018. [W ai06] Martin J. W ain wright. Sharp thresholds for high-dimensional and noisy recov ery of sparsit y , 2006. 32 [W ai09] Martin J W ainwrigh t. Information-theoretic limits on sparsit y recov ery in the high- dimensional and noisy setting. IEEE tr ansactions on information the ory , 55(12):5728– 5741, 2009. [W ai19] Martin J W ain wright. High-dimensional statistics: A non-asymptotic viewp oint , v ol- ume 48. Cam bridge universit y press, 2019. [W el74] Lloyd W elc h. Low er bounds on the maxim um cross correlation of signals (corresp.). IEEE T r ansactions on Information the ory , 20(3):397–399, 1974. [WV12] Yihong W u and Sergio V erdú. Optimal phase transitions in compressed sensing. IEEE T r ansactions on Information The ory , 58(10):6241–6263, 2012. [XH07] W eiyu Xu and Babak Hassibi. Efficien t compressive sensing with deterministic guaran- tees using expander graphs. In 2007 IEEE Information The ory W orkshop , pages 414–419. IEEE, 2007. [YSY + 08] Okito Y amashita, Masa-aki Sato, T aku Y oshiok a, F rank T ong, and Y ukiy asu Kamitani. Sparse estimation automatically selects vo xels relev an t for the deco ding of fmri activity patterns. Neur oImage , 42(4):1414–1429, 2008. [YWJ16] Y un Y ang, Martin J. W ain wright, and Mic hael I. Jordan. On the computational com- plexit y of high-dimensional ba yesian v ariable selection. The A nnals of Statistics , pages 2497–2532, 2016. [YZ10] F ei Y e and Cun-Hui Zhang. Rate minimaxity of the lasso and dantzig selector for the lq loss in lr balls. The Journal of Machine L e arning R ese ar ch , 11:3519–3540, 2010. [Zha11] T ong Zhang. Sparse reco very with orthogonal matching pursuit under rip. IEEE T r ans- actions on Information The ory , 57(9):6215–6221, 2011. [ZY06] P eng Zhao and Bin Y u. On mo del selection consistency of lasso. The Journal of Machine L e arning R ese ar ch , 7:2541–2563, 2006. 33 A Discussion of Error Metrics While our analysis naturally leads to b ounds in terms of the quantit y ∥ X ⊤ ξ ∥ ∞ , there are other common wa ys to parameterize the error of algorithms for Problem 2 . This includes h X ⊤ X i − 1 S × S X ⊤ S : ξ ∞ for S := supp( θ ⋆ ) (23) b y e.g., [ W ai19 ], and ∥ ξ ∥ 2 √ n , ∥ ξ ∥ ∞ , as sometimes seen in w orks from the hea vy hitters literature. In this section, we show that in the oblivious setting of Model 1 , these ℓ ∞ -t yp e error metrics are equiv alent up to constan t factors. In contrast, under the adversarial setting of Mo del 2 , no such guaran tee is ac hiev able for other error metrics in general. Lemma 20 establishes con v ersions betw een these different error metrics in the oblivious setting. Lemma 20. L et δ ∈ (0 , 1 2 ) . In the c ontext of Pr oblem 2 under Mo dels 1 and 3 , if n = Ω( k log d δ ) and k = Ω(log 1 δ ) for appr opriate c onstants, then with pr ob ability ≥ 1 − δ , we have for S := supp( θ ⋆ ) , h X ⊤ X i − 1 S × S X ⊤ S : ξ ∞ ( a ) = Θ X ⊤ S : ξ ∞ ( b ) = O ∥ ξ ∥ 2 s log k δ n ( c ) = O ∥ ξ ∥ ∞ r log k δ ! , X ⊤ S : ξ ∞ ( d ) = Ω ∥ ξ ∥ 2 r 1 n ! ( e ) = Ω ∥ ξ ∥ ∞ r 1 n ! . Pr o of. The iden tit y ( a ) follo ws directly from Lemma 7 , and the bounds ( c ) , ( e ) hold for all ξ ∈ R n . It remains to establish ( b ) and ( d ) . F or any i ∈ S , conditioned on ξ , we ha ve x ⊤ i ξ ∼ subG (0 , C n ∥ ξ ∥ 2 2 ) with v ariance 1 n ∥ ξ ∥ 2 2 . The maximum of k of these v ariables can b e b ounded by Lemma 1 and a union b ound, yielding ( b ) . Finally , there is a constant probability that each such draw x ⊤ i ξ ob eys the b ound in ( d ) , by the same pro of strategy as used in establishing ( 21 ). Under the stated parameter range on k , the probabilit y ( d ) fails under k indep enden t draws is at most δ . W e mention that b oth ( b ) and ( d ) can b e tigh t: if x i is entrywise a (scaled) Gaussian, then ( b ) is tigh t, and if x i is entrywise a (scaled) Rademac her, then for 1 -sparse ξ , ( d ) is tigh t. Regardless, Lemma 21 establishes a complemen tary imp ossibilit y result: in the adaptiv e setting, no algorithm can guaran tee an ℓ ∞ estimation error on the order of any of the other candidate metrics, ev en taking the more conserv ative of the pairs ( b ) , ( d ) and ( a ) , ( e ) . Lemma 21. In the c ontext of Pr oblem 2 under Mo dels 2 and 3 , denoting S := supp( θ ⋆ ) , ther e is no algorithm outputting an estimate θ ∈ R d that c an guar ante e that with pr ob ability ≥ 2 3 , ∥ θ − θ ⋆ ∥ ∞ = O ( ∥ ξ ∥ ∞ ) or O ∥ ξ ∥ 2 r log k n ! or O h X ⊤ X i − 1 S × S X ⊤ S : ξ ∞ . 34 Pr o of. Consider t wo signal-noise pairs ( θ ⋆ 1 , ξ 1 ) = ( 0 d , Xe i ) and ( θ ⋆ 2 , ξ 2 ) = ( e i , 0 n ) , for any i ∈ [ d ] . In the third error metric, we take the S in the first pair to be any set not containing i , and the S in the second set to b e { i } . Both induce the same observ ations: y = X θ ⋆ 1 + ξ 1 = X ( θ + e j ) = X θ ⋆ 2 + ξ 2 , and hence are statistically indistinguishable (i.e., no algorithm can b eat random guessing). The parameters satisfy ∥ θ ⋆ 1 − θ ⋆ 2 ∥ ∞ = 1 . On the other hand, b y Lemma 1 , standard χ 2 concen tration b ounds (e.g., Lemma 1, [ LM00 ]), and Corollary 2 resp ectiv ely , with probability at least 2 3 , ∥ ξ 1 ∥ ∞ = ∥ Xe 1 ∥ ∞ = O r log n n ! , ∥ ξ 1 ∥ 2 = O (1) , ( X ⊤ X ) − 1 S × S X ⊤ S : ξ 1 ∞ = O r log d n ! , and trivially ξ 2 = 0 n satisfies the same b ounds. Note that in our application of Corollary 2 , we used that v ← Xe i = x i is indep enden t from X : S for i ∈ S . Whenev er the ab ov e bounds hold, an y of these error metrics allo ws for exact recov ery of θ ⋆ 1 or θ ⋆ 2 when n is large enough. This con tradicts the maxim um success probability of 2 3 · 1 2 + 1 3 ac hiev able. B Tigh t Error Guaran tee for Gaussian Noise Mo del In this section, we giv e an information-theoretic low er b ound for ℓ ∞ sparse reco very under a Gaussian noise mo del. W e start with a simple lemma to instan tiate the general results in Sections 3 and 5 . Lemma 22. F or any δ ∈ (0 , 1 2 ) , if X ∈ R n × d satisfies ( 1 2 , 1) -RIP, then for ξ ∼ N (0 , σ 2 I n ) , with pr ob ability at le ast 1 − δ , X ⊤ ξ ∞ ≤ O σ r log d δ ! . Pr o of. F or an y i ∈ [ d ] , w e ha ve x ⊤ i ξ ∼ N (0 , σ 2 ∥ x i ∥ 2 2 ) . Since ∥ x i ∥ 2 2 ≤ 3 2 , we hav e P [ | x ⊤ i ξ | ≥ t ] ≤ 2 exp( − t 2 2 σ 2 ) . T aking t = O ( σ p log d/δ ) and a union b ound o ver i ∈ [ d ] leads to the conclusion. W e next quan tify the intrinsic difficult y of ℓ ∞ sparse recov ery using t wo standard notions of risk. The minimax risk captures the worst-case exp ected error ov er the k -sparse parameter class, corre- sp onding to the partially-adaptive Mo del 4 , while the Ba yes risk measures the a verage error under a prior on θ ⋆ , corresp onding to the oblivious Mo del 1 . A basic but useful fact is that the Ba yes risk under any prior lo wer bounds the minimax risk. Our pro of roadmap mainly follows [ SC19 ]. Definition 5. L et Θ b e a p ar ameter sp ac e such that every θ ∈ Θ indexes a data distribution P θ . Supp ose we want to estimate an unknown θ ∈ Θ fr om data Z ∼ P θ . W e define the minimax risk as M minimax (Θ , Z ) = inf b θ sup θ ⋆ ∈ Θ E Z ∼P θ ⋆ h b θ ( Z ) − θ ⋆ ∞ i , wher e b θ is any estimate that is a function 6 of the observations Z . Similarly, if θ is sample d fr om some prior π over Θ , we define the Bayes risk as M Bay es ( π , Θ , Z ) = inf b θ E θ ⋆ ∼ π Z ∼P θ ⋆ h b θ ( Z ) − θ ⋆ ∞ i . 6 W e restrict to deterministic b θ without loss: otherwise outputting E b θ improv es the risk, since ∥·∥ ∞ is conv ex. 35 Our low er b ound follows a standard information-theoretic route. W e construct a finite, well- separated subset of parameters and reduce estimation to iden tifying the correct candidate parameter. Applying F ano’s inequality relates the ac hiev able estimation accuracy to the m utual information b et w een the observ ations and the underlying index, yielding a quan titative low er b ound on the Ba yes (and therefore, minimax) risk. Lemma 23 (F ano’s inequality) . L et V and b V b e discr ete r andom variables on a c ommon set V , wher e V ∼ unif. V . Then letting I denote the mutual information, P h b V = V i ≥ 1 − I ( V ; b V ) + log 2 log |V | . The following calculation helps simplify applications of Lemma 23 . Lemma 24 (Lemma 4, [ SC19 ]) . let P V , P y and P y | V b e the mar ginal and c onditional distribu- tions with r esp e ct to a p air of jointly-distribute d r andom variables ( V , y ) . Then for any auxiliary distribution Q y , we have I ( V ; y ) = X v ∈V P V ( v ) D KL P y | V ( · | v ) ∥ P y ≤ X v P V ( v ) D KL P y | V ( · | v ) ∥ Q y . Lemma 25 (Risk lo wer b ound via exact recov ery) . Under Mo del 1 , fix ϵ > 0 , and let Θ V = { θ v } v ∈V b e a finite subset of Θ , such that ∥ θ v − θ v ′ ∥ ∞ ≥ ϵ, for al l ( v , v ′ ) ∈ V 2 , v = v ′ . (24) Then if π is uniform over Θ V , and y ar e observations gener ate d fr om a distribution indexe d by θ V , M minimax (Θ , y ) ≥ M Bay es ( π , Θ , y ) ≥ ϵ 2 1 − I ( V ; y ) + log 2 log | V | . Pr o of. By Marko v’s inequalit y , we ha ve E h b θ − θ ⋆ ∞ i ≥ t · P h b θ − θ ⋆ ∞ > t i . F or any estimator b θ ( y ) , let b V = arg min v ∈V ∥ b θ − θ v ∥ ∞ . Using the triangle inequalit y and our assumption ( 24 ), if ∥ b θ − θ V ∥ ∞ < ϵ 2 , then b V = V , hence P h b θ − θ V ∞ ≥ ϵ 2 i ≥ P h ˆ V = V i . Th us, taking t ← ϵ 2 and applying Lemma 23 concludes the pro of. Lemma 26. Under Mo del 1 , let d = Ω( k ) , σ > 0 , ϵ := σ 2 q log d k , and define Ω k = { θ ⋆ ∈ R d : | supp( θ ⋆ ) | ≤ k } , V = n ϵ v : v ∈ {− 1 , 0 , 1 } d , ∥ v ∥ 1 = k o . L et π b e uniform over Θ V wher e θ v := v . If X satisfies ( 1 2 , 1) -RIP and ξ ∼ N (0 , σ 2 I n ) , then M minimax (Ω k , y ) ≥ M Bay es ( π , Ω k , y ) = Ω σ r log d k ! . 36 Pr o of. Our construction of V trivially satisfies the b ound ( 24 ), so by Lemma 25 , M Bay es ( π , Ω k , y ) ≥ ϵ 2 1 − I ( V ; y ) + log 2 log |V | . (25) Similarly , by Lemma 24 , I ( V ; y ) ≤ X v ∈V P V ( v ) D KL P y | V ( · | v ) ∥ Q y . T aking Q y = N (0 , σ 2 ) for all y , we ha ve I ( V ; y ) ≤ 1 |V | X v ∈V ∥ X θ v ∥ 2 2 2 σ 2 = ϵ 2 2 σ 2 E V h ∥ X1 V ∥ 2 2 i = ϵ 2 2 σ 2 T r ( X Cov [ V ] X ⊤ ) . It is straightforw ard by a symmetry argument that Co v [ V ] = k d I d , and thus I ( V ; y ) ≤ ϵ 2 k 2 dσ 2 ∥ X ∥ 2 F . Since X satisfies ( 1 2 , 1) -RIP , ∥ x i ∥ 2 2 ≤ 2 for all i ∈ [ d ] , which implies ∥ X ∥ 2 F ≤ 2 d . W e then obtain I ( V ; y ) ≤ ϵ 2 k σ 2 . (26) A t this p oin t, w e can conclude the result by plugging ( 26 ) bac k into ( 25 ): I ( V ; y ) + log 2 log |V | ≤ k log( d k ) 2 log |V | ≤ 1 2 , where we used that the cardinality of |V | satisfies |V | = d k · 2 k ≥ 2 d k k . Lemmas 22 and 26 show that the minimax risk of ℓ ∞ sparse recov ery , given an RIP matrix X and Gaussian noise ξ , even in the limit of infinite observ ations, is Ω( ∥ X ⊤ ξ ∥ ∞ ) with high probabilit y . This is matched b y all of our upp er b ounds up to (at most) a sublogarithmic ov erhead. C Application to Spik e-and-Slab P osterior Sampling In this app endix, w e briefly discuss one application of our upp er b ound results for ℓ ∞ sparse recov ery . Recen tly , [ KSTZ25 ] ga ve an algorithm for Bay esian sparse linear regression in the following model. Mo del 5 (Spik e-and-slab p osterior sampling) . In the spik e-and-slab p osterior sampling pr oblem, X ∈ R n × d is dr awn under Mo del 3 , ξ ∼ N (0 , σ 2 ) n indep endently, and θ ⋆ ∈ R d is indep endently dr awn fr om the fol lowing spik e-and-slab prior, for q ∈ [0 , 1] d with ∥ q ∥ 1 = k : π := O i ∈ [ d ] ((1 − q i ) δ 0 + q i N (0 , 1)) , wher e δ 0 is a Dir ac density at 0 . F or some δ ∈ (0 , 1) , the go al is to pr o duc e a sample fr om the p osterior π ( · | X , y ) within total variation δ , wher e y = X θ ⋆ + ξ ar e the observations. 37 Under Mo del 5 , the parameter k can b e viewed as an exp ected sparsity lev el for the signal θ ⋆ ∼ π , b ecause co ordinate i ∈ [ d ] is only nonzero with probability δ . The key challenge in solving Model 5 is that for noise levels σ that are ≈ 1 , there are many co ordinates that could plausibly either be part of the signal θ ⋆ , or masked b y the random noise ξ . T o mo del this uncertain ty , the posterior densit y π ( · | X , y ) places nontrivial mass on d Ω( k ) candidate supp orts, and therefore the algorithm m ust successfully sample from this exp onentially-sized candidate set. Prior works by [ YWJ16 , MW24 ] resp ectiv ely designed algorithms for restricted v arian ts of Mo del 5 that succeeded when σ is relativ ely large or small, and when the n umber of observ ations n ≳ d . The algorithm b y [ KSTZ25 ] lifts these restrictions, and solves the sampling problem in Mo del 5 for any σ > 0 and using n = poly ( k , log( d δ )) samples. In particular, [ KSTZ25 ] giv es tw o algorithms: the first is based on ℓ 2 sparse recov ery , and th us runs in nearly-linear time (e.g., using Lemma 4 ), but uses n ≈ k 5 observ ations to comp ensate for the weak er recov ery guarantee. The second algorithm lev erages ℓ ∞ sparse recov ery , and uses an impro ved n ≈ k 3 , but its runtime w as previously based on the LASSO (see Theorem 1, [ KSTZ25 ], whic h claims a run time of ≈ n 2 d 1 . 5 ). Substituting our Theorems 3 or 4 in place of the LASSO (Prop osition 3, [ KSTZ25 ]) immediately giv es the follo wing improv ed runtime for spik e-and-slab p osterior sampling. Corollary 6. In the setting of Mo del 5 , if n = Ω( k 3 p olylog ( d δ )) , ther e is an algorithm that r eturns θ ∼ π ′ such that D TV ( π ′ , π ( · | X , y )) ≤ δ with pr ob ability ≥ 1 − δ over the r andomness of ( X , θ ⋆ , ξ ) , which runs in time O nd log log 1 δ min(1 , σ ) !! . Pr o of. The runtime of, e.g., Theorem 4 meets the described bound, where we ma y take r = Ω( σ ) and R 2 = O (log 1 δ ) with probabilit y ≥ 1 − δ under our mo deling assumptions. The remainder of the pro of follo ws iden tically to the proof of Theorem 1 in [ KSTZ25 ]. 38
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment