Statistical Query Lower Bounds for Smoothed Agnostic Learning

We study the complexity of smoothed agnostic learning, recently introduced by~\cite{CKKMS24}, in which the learner competes with the best classifier in a target class under slight Gaussian perturbations of the inputs. Specifically, we focus on the pr…

Authors: Ilias Diakonikolas, Daniel M. Kane

Statistical Query Lo w er Bounds for Smo othed Agnosti c Learning Ilias Diak onik olas ∗ Univ ersit y of Wisconsin-Madison ilias@cs.wi sc.edu Daniel M. Kane † Univ ersit y of California, San Diego dakane@cs.u csd.edu F ebruary 25, 2026 Abstract W e study the complexity of smo othed agnostic learning, rece n tly in tro duced by [CKK + 24], in which the lea rner competes with the b est classifier in a tar get cla ss under slight Gaussian per turbations of the inputs. Sp ecifically , w e fo cus on the pr otot ypical task of a gnostically learn- ing halfspaces under subgaussian distributions in the s m o othed mo del. The b est known upp er bo und for this problem relies o n L 1 -p olynomial regress ion a nd ha s complexit y d ˜ O (1 /σ 2 ) log(1 /ǫ ) , where σ is the s m o othing parameter and ǫ is the excess err o r. Our main result is a Statistical Query (SQ) low er bound pr o viding formal evidence that this upp er b ound is close to b est po s si- ble. In more detail, we show t hat (even for Gaussian marginals) any SQ algorithm f or smo othed agnostic learning of halfspace s r equires complexity d Ω(1 /σ 2 +log(1 /ǫ )) . This is the first non- tr ivial low er bo und on the complexity of this task and nearly matches the k nown upper b ound. Ro ughly sp eaking, we show that a pplying L 1 -p olynomial regres sion to a smo othed version of the function is essentially b est p ossible. Our tec hniques in volv e finding a moment-matc hing hard distribu- tion b y w a y o f linear progr amming duality . This dual progr am cor r esp onds ex a ctly to finding a low-degree a pproximating po lynomial to the smo othed version o f th e target function (whic h turns o ut to b e the same condition r equired for the L 1 -p olynomial r egress io n to work). Our explicit SQ lower bo und then comes from pr oving low er b ounds on this appr oximation degree for the class of ha lfspaces. ∗ Supp orted by N SF Medium A w ard CCF-2107079 and an H .I. R omn es F aculty F ello wship. † Supp orted by N SF Medium A w ard CCF-2107547 . 1 In tro duction In the agnostic mo del [Hau92, KSS 94], the learner is giv en samples from a distribu tion D on lab eled examples in R d × {± 1 } , and the goal is to compute a hyp othesis that is comp etitiv e with the b est-fit function in a kn o w n concept class C . In more detail, given n i.i.d. samples ( x ( i ) , y i ) drawn from D and a desired e xcess error ǫ > 0, an agnostic lea rner for C ou tp uts a hyp othesis h : R d → {± 1 } suc h that Pr ( x,y ) ∼ D [ h ( x ) 6 = y ] ≤ OPT + ǫ , where OPT def = inf f ∈C Pr ( x,y ) ∼ D [ f ( x ) 6 = y ]. Imp ortan tly , in th e agnostic mo del, no assu mptions are made on the observ ed lab els. As suc h, agnostic learning can b e view ed as a general framew ork to stu dy sup ervised learning in the pr esence of arbitrary missp ecification or (p otent ially adv ersarial) lab el noise. It is w orth recalling that the agnostic mo del generalizes V alian t’s (rea lizable) P AC mo d el [V al84], w here the lab els are promised to b e consisten t with an unkno w n fun ction in the class C (i.e., OPT = 0). In the original definition of agnostic learning, no assum p tions are made on th e marginal dis- tribution of D on x (henceforth denoted b y D x ). Learning in this distribution-fr e e agnostic m o del turns out to b e computationally challe nging. While the mo del is statistical ly tractable for all concept classes with b ounded V C dimension (see, e.g., [BEHW89]), p r ior w ork established sup er- p olynomial compu tational low er b ounds , eve n for w eak ag nostic learning of simple concept classes suc h as halfspaces [Dan16 , DKMR2 2, Tie23]. One app roac h to circumv en t these computational limits inv olves making natural assumptions ab out the underlying distribution D x on feature v ectors. Ex amp les include structured contin uous distributions—su c h as the Gaussian distribu tion or more broadly any log-conca ve distrib ution— and discrete distributions—su ch as the u niform distribution on the Boolean h yp ercu b e. While certain computational limitations p ersist ev en in th is distribution-sp ecific setting, s ee, e.g., [DKZ20, GGK20, DKPZ21, DKR23], it turns out that non-trivial p ositiv e resu lts are p ossible. S p ecifically , a line of w ork has dev elop ed efficient learning algorithms with b oth exact (see, e.g., [KKMS08, K OS08, DKK + 21]) and app ro ximate (see, e.g., [ABL17, DKS 18 , DKTZ20, DKTZ22]) error guarantees in the distribution-sp ecific setting. A complemen tary approac h—recen tly introdu ced in [CKK + 24] a nd the fo cus of the curren t pap er—inv olv es relaxing the notion of optimalit y , b y w orking in an a ppr opriate smo othe d setting. This app roac h was insp ir ed by a classical line of wo rk in the smo othed analysis of algorithms [ST04]. F or concreteness, w e define the smoothed ag nostic learning framew ork belo w. Definition 1.1 (Smo othed Agnostic Learning [CK K + 24]) . Let C b e a Bo olean concept class on R d and D x b e a family of distrib utions on R d . W e sa y th at an algorithm A agnostically learns C in the smo othe d setting if it satisfies the follo wing prop ert y . Giv en an error paramete r ǫ > 0, a smo othing parameter σ > 0 and access to i.i.d. samples from a distribution D on R d × {± 1 } suc h that D x ∈ D x , A outpu ts a hyp othesis h : R d → {± 1 } such that with high probabilit y it holds Pr ( x,y ) ∼ D [ h ( x ) 6 = y ] ≤ OPT σ + ǫ , wh ere OPT σ def = in f f ∈C E z ∼N d  Pr ( x,y ) ∼ D [ f ( x + σ z ) 6 = y ]  . (1) Some comments are in ord er. First note that for σ = 0 one reco v ers the standard agnostic mo del. The hop e is th at w hen σ > 0 is sufficient ly large, smo othed agnostic learning b ecomes computationally tractable in settings where v anilla agnostic learning is n ot. S p ecifically , the goal is to c haracterize (in terms of b oth up p er and lo wer b oun ds) the computational complexit y of smo othed agnostic learning for natural concept classes C and distributions D x , as a fu n ction of d, 1 /ǫ , a nd 1 /σ . It turns out that it is p ossib le to dev elop efficien t smo othed agnostic learners for natural concept classes C with resp ect to a b road class of d istribution families D x —for wh ich v anilla agnostic 1 learning is computationally h ard. The first p ositiv e results were given in [CKK + 24] (who defin ed th e smo othed framewo rk). The latter work f o cused on the natural concept class of Multi-Ind ex Mo dels (MIMs) 1 that satisfy an ad d itional mild condition (n amely , ha vin g b ounded Gauss ian surface area). The distribution class D x consists of all distr ib utions whose univ ariate pr o jections ha v e s ubgaussian (or, more generally , str ictly sub-exp onential) tails. In the follo wing discussion, we fo cus on the case of subgaussian d istributions for concreteness. In this con text, [CK K + 24] sho ws that the L 1 -p olynomial regression algorithm [K KMS08] is a smo othed agnostic learner for K -MIMs with Gaussian surface area at most Γ > 0 with (sample and compu tational) complexit y d poly ( K , Γ , 1 /σ, 1 /ǫ ) . This up p er b ound was recen tly improv ed in the same con text by [KW25], wh o show ed th at th e same algo rithm has complexit y d poly ( K , 1 /σ, log(1 /ǫ )) — indep en d en t of the surface area b ound. More p recisely , the exp onent of d in the upp er b ound of [KW2 5] is ˜ O ( K/σ 2 ) log(1 /ǫ ). Motiv ated b y the a forement ioned p ositiv e results, here we ask the question whether the kno wn upp er b ounds for smo othed agnostic learning can b e qualitativ ely imp ro v ed. In terestingly , all kno wn algorithmic w orks on the topic (includin g [CK K + 24, KW25] and the w ork of [KM25] stud ying a Bo olean-v ersion of smo othing) essen tially rely on the L 1 p olynomial regression metho d—the standard agnostic learner i n the “non-smo othed” s etting. Th e tec hnical no ve lt y of these works lies in the analysis: they establish appropr iate p olynomial approxi mation r esults that yield their up p er b ound s. Giv en this state-of-affairs, it is n atural to ask wh ether the us e of L 1 p olynomial regression is inherent o r w h ether algorithms with qualitativ ely b etter complexit y are p ossible in the smo othed setting. More concretely , fo cusing on the sub gaussian case f or single-index mo dels ( K = 1) and halfspaces in particular, the b est kno wn upp er b oun d is d ˜ O (1 /σ 2 ) log(1 /ǫ ) . Is there a p oly( d, 1 /ǫ, 1 /σ ) time algorithm? Or is the exp onent ial dep end ence on 1 /σ 2 and log(1 /ǫ ) inh eren t? Motiv ated by this gap in our understanding, we stud y the follo wing question: What is the c omplexity of smo othe d agnostic le arning for halfsp ac es under sub gaussian distributions? As our main resu lt, we giv e formal evidence—in the form of a Statistical Q uery low er b oun d —that the c omplexit y of known algorithms for this problem is essen tially optimal. Before w e formally state our results, we need basic bac kground on the Statistic al Query (SQ) Mo del and the class of Halfspaces. Statistical Q uery Mo del Statistical Query (SQ) alg orithms are a class of algo rithms that are allo wed to qu ery exp ectations of b ounded fun ctions of the underlying d istr ibution r ather than directly a ccess samples. F ormally , an SQ algorithm has access to th e follo wing oracle. Definition 1.2 ( ST A T O r acle) . L et D b e a distribution o v er R d × {± 1 } . A statistic al query is a function q : R d × {± 1 } → [ − 1 , 1]. W e define ST A T ( τ ) to b e the oracle that giv en a qu ery q ( · , · ) outputs a v alue v such that | v − E ( x,y ) ∼ D [ q ( x, y )] | ≤ τ , w here τ > 0 is the tolerance o f the qu ery . The SQ mo d el was in tro du ced in [Kea98] as a n atural r estriction of the P A C m o del and has b een extensiv ely stud ied in learning theory; see, e.g., [FGR + 13, F GV17 , F el17] and references therein. The class of S Q algorithms is f airly b road: a wide range of known algorithmic tec hniques in mac hine learning a re kno w n to b e implementable us in g SQ s (see , e.g., [ CKL + 06, F GR + 13, FG V17]). 1 A function f : R d → {± 1 } is called a K -MIM if there exists a K -dimensional subspace V suc h that for each x ∈ R d f ( x ) only dep ends on the p ro jection of x on to V . 2 Halfspaces A halfsp ac e is an y Bo olean-v alued fun ction f : R d → {± 1 } of the form f ( x ) = sign( w · x − t ), for a weigh t v ector w ∈ R d and a t hresh old t ∈ R . T h e algorithmic task of learning halfspaces from lab eled examples is one of th e most basic and extensively studied problems in mac h ine learning [Ros58, No v62 , MP68, FS97, V ap98 , STC 00]. Halfspaces are a p rotot yp ical family of Single-Index mo dels. It is known that ev en we ak agnostic learning of halfspaces is computationally hard in the distribution-fr ee setting [Dan16 ]. Moreo ver, in the distribution-sp ecific setting where the distribution is Gaussian, ac hieving error OPT + ǫ requires complexit y d Ω(1 /ǫ 2 ) [DKPZ21, DKR23]. 1.1 Our Results and T ec hniques W e a re no w ready to state our main result: Theorem 1.3 (Main Result) . Any smo othe d agnostic le arner for the class H of halfsp ac es on R d under the Gaussian distribution, with exc ess err or ǫ and smo othing p ar ameter σ > 0 , r e quir es SQ c omplexity d Ω(log(1 /ǫ )+1 / ( σ + ǫ ) − 2 ) . In mor e detail, any SQ algorithm f or this task must either make 2 d Ω(1) queries or a query of ac cur acy b etter than d − Ω(log(1 /ǫ )+1 / ( σ + ǫ ) − 2 ) . Some commen ts are in order. First note that for σ ≥ ǫ , our SQ lo w er b ound nearly m atches the upp er b ound of [KW25] w hic h holds for all su bgaussian distributions. That is, the sp ecial case of the Gaussian distribution see ms to capture the hardness of t he subgaussian family . In the sp ecial case where σ = 0, one reco v ers the kn o w n (optimal) SQ lo w er b ound of d Ω(1 /ǫ 2 ) [DKPZ21] for v anilla agnostic learning under Gauss ian marginals. O n the other h and, when 0 < σ = Ω(1), we obtain a n SQ lo w er b ound o f d Ω(log(1 /ǫ )) . The fact that our SQ low er b ound for smo othed agnostic learning of halfspaces qualitativ ely matc hes the u p p er b oun d obtained via L 1 p olynomial r egression is n ot a coincidence. Roughly sp eaking, w e sho w the follo wing more general r esult (Theorem 3.1): for any concept classe C with natural closure pr op erties (including halfspaces and more general MIMs), the degree of L 1 p olynomial appro ximation to the smoothed fu nction T σ f := E z ∼N [ f ( x + σ z )] (directly related to the definition of the sm o othed optim um (1)) o v er f ∈ C determines the S Q complexit y of smo othed agnostic lea rnin g. Roughly speaking, if the optimal p olynomial appro ximation degree for C is m ∗ , then the SQ complexit y of smo othed agnostic learning C is d Ω( m ∗ ) —matc h in g the complexit y of L 1 -p olynomial regressio n. It is natural to ask wh ether one can sho w stronger SQ lo we r b ounds for agnostically learning more general m ulti-index mo dels in the smo othed setting. Interesti ngly , we show that this is not p ossible when th e marginal distrib ution is promised to b e Gaussian (as in Theorem 1.3). Sp ecifically , w e sho w in Th eorem 4.1 (give n in Section 4) that L 1 -p olynomial regression is a smo othed agnostic learner for an y d istribution ( X, y ) with X Gaussian, with complexit y d O (log(1 /ǫ ) /σ 2 ) . T ec hnical Overview The general framew ork that we lev erage to obtain our r esults f ollo ws a line of w ork on SQ lo w er b ound s for No n-Gaussian Comp onent Analysis (NG CA) and its v arious generalizat ions, dev elop ed in a series of pap ers o ve r the past decade; see, e.g., [DKS17, DKPZ21, DKRS23, DIK R25]. Sp ecifically , our pro of pro ceeds i n t wo steps: First, w e sho w a generic SQ lo we r b ound (Theo- rem 3.1) that is applicable to an y concept class C satisfying natur al closure p rop erties. In the con text of smoothed agnostic learning of h alfspaces under the Gaussian distribution, this result s h o ws the follo wing: If f ( x ) = sign( x ) (the univ ariate sig n function), then a ny SQ a lgorithm for the problem has complexit y d Ω( m ) , w here d is the dim en sion and m is th e minimum d egree of L 1 -p olynomial appro ximation with error O ( ǫ ) of the smo othed function T σ f ( x ) = E z ∼N [ f ( x + σ z )]. T h e pro of of our generic lo wer b ound pr o of follo ws the o verall strategy of [DKPZ21] (who pr o v e our b ound 3 in the s p ecial c ase of σ = 0). The basic idea is to c onstruct a momen t-matc hing d istribution on y so that OPT σ + ǫ is still s omewh at less than 1 / 2. Thus, any algorithm that pr o duces a hypothesis with err or O P T σ + ǫ w ould allo w one to distinguish b et w een this y and a y th at is indep endent of x . W e th en sh o w that this distinguishin g (testing) problem is SQ -h ard, as it corresp ond s to a h ard instance of conditional NGCA [DIKR25]. T o construct t he desired momen t-matc hing d istribution on y , w e use LP dualit y and show that the relev ant du al pr ogram corresp onds exactly to fin ding a lo w-d egree L 1 p olynomial appro ximation of T σ f . The second s tep of our pro of inv olv es proving explicit degree lo w er b ounds for T σ f . Here w e establish a degree lo w er b ound of c (lo g (1 /ǫ ) + 1 / ( σ + ǫ ) 2 ) fo r some universal constan t c > 0 2 . T o establish a degree lo wer b ound of log (1 /ǫ ) (Th eorem 3.8) we pro ceed as follo ws. W e start b y noting the ident it y T σ f = U a f , for the function f ( x ) = sign( x ), wh ere U a is the the Orn stein- Ulen b ec k (OU) op erator and a = 1 / √ 1 + σ 2 . An explicit an alysis, using p r op erties of Hermite p olynomials an d the OU op erator, allo ws us to s ho w that th e L 2 error of an y degree- k p olynomial appro ximation to T σ f m ust b e at least 2 − O ( k ) . In fact, we can ev en certify this error b y n oting th at the in ner p ro duct of T σ f − p with an appropriate Hermite p olynomial must b e large. Th e n ext step is to compare the L 1 and L 2 p olynomial app ro ximations, whic h w e carry out b y using r esu lts ab out t he concen tration of p and this Hermite p olynomial. As a result, we sh ow that the L 1 error of any degree- k p olynomial app ro ximation t o T σ f m ust be 2 − O ( k ) as well (for a differen t univ ersal constan t hidd en in the O ().) T o establish the L 1 degree lo we r b ound of ( σ + ǫ ) − 2 (Theorem 3.5), we pro ve the follo wing new structural result (Th eorem 3.6) that may b e of b roader int erest: ther e e xists a k -wise indep endent family of univ ariate Gaussians (that we explicitly construct) so that their sum mo d 1 is close to an y sp ecified distribution. The su m of these Gaussians divided by √ k will b e a d istribution that matc hes k moments with the standard Gaussian bu t has a v ery differen t distribu tion when tak en mo dulo 1 / √ k . By letting X | y = 1 b e suc h a distribution m issing a large fraction of the p ossible v alues mo du lo 1 / √ k , it is easy to see that for some thr eshold f that the | T σ f ( X ) − T σ f ( G ) | ≫ ǫ so long as ǫ , σ ≪ 1 / √ k . Th is giv es us an exa mple of a momen t-matc hing distribu tion ( X, y ) with OPT σ substanti ally less than 1 / 2, whic h can b e plu gged into our generic SQ low er b oun d. Regarding our upp er b oun d for the smo othed Gaussian case (Theorem 4.1), it suffices to pro v e an upp er b oun d on the degree of L 2 p olynomial appr o xim ation of T σ f . T o d o this, w e note that T σ f = U g for g ( x ) = f ( √ 1 + σ 2 x ) and U is the relev ant elemen t of the Ornstein-Ulen b ec k semigroup. But if g has Herm ite transform g = P g [ k ] , then U g is P g [ k ] (1 + σ 2 ) − k / 2 . Since k g [ k ] k 2 ≤ k g k 2 ≤ 1, taking the fi rst O (log(1 /ǫ ) /σ 2 ) t erms suffices to g ive error ǫ. 2 Preliminaries Notation F or n ∈ Z + , we denote [ n ] def = { 1 , . . . , n } . W e t ypically us e small letters to denote random v ariables w hen the underlying distribution is clear from the conte xt. W e use E [ x ] for the exp ectation of the r andom v ariable x and Pr [ E ] for the probabilit y of ev en t E . Let N d enote the standard un iv ariate Gaussian distr ib ution and N d denote the standard d -dimensional Gaussian distribution. W e use φ d to d enote the p df of N d . Small letters are used for vec tors a nd capital letters are used for matrices. L et k x k 2 denote the ℓ 2 -norm of the vect or x ∈ R d . W e use u · v f or the inner pr o duct of vecto rs u, v ∈ R d . F or a matrix P ∈ R m × n , let k P k 2 denote its spectral norm a nd k P k F denote its F rob enius norm. W e use I d to denote t he d × d iden tit y matrix. 2 Note that th e second term is roughly 1 /σ 2 , as long as σ ≥ ǫ . F or σ = 0, degree O (1 /ǫ 2 ) is known to suffice. 4 Hermite Analysis Consider L 2 ( R d , N d ), the v ector space of all fu nctions f : R d → R su c h that E x ∼N d [ f ( x ) 2 ] < ∞ . This is an inn er pro duct space und er the inner pro du ct h f , g i = E x ∼N d [ f ( x ) g ( x )]. This inner p ro du ct sp ace has a complete orthogonal basis given by the Hermite p olynomials . In the univ ariate case, w e will work w ith n ormalized Hermite p olynomials defined b elo w. Definition 2.1 (Normalized Pr obabilist’s Hermite P olynomial) . F or k ∈ N , the k -th pr ob abilist’s Hermite p olynomial H e k : R → R is defined as H e k ( t ) = ( − 1) k e t 2 / 2 · d k d t k e − t 2 / 2 . W e d efi ne the k -th normalize d prob ab ilist’s Hermite p olynomial h k : R → R as h k ( t ) = H e k ( t ) / √ k !. Note t hat for G ∼ N we ha ve E [ h n ( G ) h m ( G )] = δ n,m , and √ m + 1 h m +1 ( t ) = th m ( t ) − h ′ m ( t ). W e will use v arious w ell-kno wn prop erties of these p olynomials in our pro ofs. Smo othing Op erators W e will also need th e follo wing smo othing operators: • F or a fu nction f : R k → R and σ > 0, we defin e the op erator T σ f ( x ) = E z ∼N k [ f ( x + σ z )]. This op erator is motiv ated by th e definition of our smo othing mo d el. • F or a fu nction f : R k → R and ρ ∈ [0 , 1], the Orn stein-Uhlen b eck (OU) op er ator is defin ed as follo ws U ρ f ( x ) = E z ∼N k h f ( ρx + p 1 − ρ 2 z ) i . Th is operator arises in some of our analysis. W e note that for homogeneous functions these tw o op erators are essential ly equiv alen t up to scaling. Sp ecifically , w e ha ve that T σ f = U a g , where a = 1 / √ 1 + σ 2 and g ( x ) = f ( √ 1 + σ 2 x ). W e will use the fol lo wing standard prop erties of the OU operator: F act 2.2 (See, e.g., [Bog98] and [O’D14] (Chapter 11)) . We have the fol lowing: 1. (Self-adjointness) F or f , g ∈ L 2 ( N ) and ρ ∈ (0 , 1) , E x ∼N [( U ρ f ( x )) g ( x )] = E x ∼N [( U ρ g ( x )) f ( x )] . 2. (Hermite eigenfu nctions) F or i ∈ Z + , U ρ h i ( z ) = ρ i h i ( z ) . Analytic F acts F or a p olynomial p : R k → R , we consider the random v ariable p ( x ), w here x ∼ N k . W e w ill u se k p k r , for r ≥ 1, to denote th e L r -norm of the r andom v ariable p ( x ), i.e., k p k r def = E x ∼N k [ | p ( x ) | r ] 1 /r . W e will need the f ollo wing analytic and pr obabilistic facts. W e fi rst r ecall the follo wing moment b ound for lo w -degree p olynomials, whic h is equiv alen t to the hyp ercon tractiv e in equalit y [Bon70 , Gro75]: F act 2.3 (Hyp ercontract ive Inequ ality) . L et p : R n → R b e a de gr e e- d p olynomial and q > 2 . Then k p k q ≤ ( q − 1) d/ 2 k p k 2 . As a simple corollary , w e can relate the L 1 -norm of a p olynomial with the L 2 -norm. Sp ecifically , w e obta in the follo wing folklore fac t: F act 2.4. L et p : R n → R b e a de gr e e- d p olynomial. Then k p k 2 ≤ 2 d/ 2 k p k 1 . Pr o of. Using the hypercontrac tiv e inequ alit y for q = 3 give s k p k 3 ≤ 2 d/ 2 k p k 2 . By the Cauc h y- Sc hartz inequalit y , w e ha ve that k p k 2 ≤ k p k 1 / 2 1 k p k 1 / 2 3 . Com bining the t w o b ounds gives the pro of. 5 3 Main Results The stru cture of this s ection is as follo ws: In Section 3.1, w e pro v e our generic SQ lo wer b ound (Theorem 3.1) f or smo othed agnostic learning of an y concept class C satisfying appropriate closure prop erties. Roughly sp eaking, this r esu lt s ho ws that the SQ complexit y of the task is d Ω( m ) , where m is the op timal degree of L 1 -p olynomial approximat ion of a smo othed v ersion of the target functions. In Section 3.2, w e pro v e explicit degree lo we r b ound s for th e class of halfsp aces, thereby establishing Theorem 1.3. 3.1 Generic SQ Lo w er Bound Theorem 3.1 (Generic SQ L ow er Bound for Smo othed Agnostic Learning) . L et C b e a class of Bo ole an functions on R d that for some f : R k → {± 1 } c ontains al l functions f ( P ( x )) any pr oje c tion P : R d → R k (i.e., a line ar map with P P ⊤ = I d ). Su pp ose that for some p ositive inte g e r m the fol lowing holds for some ǫ, σ > 0 : f or any de gr e e at most m p olynomial p : R k → R E x ∼N k [ | T σ f ( x ) − p ( x ) | ] > ǫ + 2 τ , (2) wher e τ def = Θ k ,m ( d ) − m/ 5 with the implie d c onstant sufficiently smal l. Then any SQ algorithm that σ -smo othe d-agnostic al ly le arns C to exc e ss err or ǫ must either make 2 d Ω(1) queries or a qu ery of ac cur acy b e tter than τ . T o p ro v e Theorem 3.1, w e need the follo wing prop osition: Prop osition 3.2. In the c ontext of The or em 3.1, ther e exists a distribution ( X, Y ) on R k × {± 1 } such that: (i) The X -mar ginal is a standar d Gaussian N k . (ii) The exp e ctation of Y is 0 . (iii) The c onditional distributions X | Y = 1 and X | Y = − 1 e ach match their first m moments with the standa r d Gaussian N k . (iv) If Z i s an indep endent standar d Gaussian on R k , then Pr [ f ( X + σ Z ) 6 = Y ] < 1 / 2 − ǫ − 2 τ . W e start b y sho w ing ho w Prop osition 3. 2 can be u s ed to pro v e Theorem 3.1. Pr o of of The or em 3.1. W e will use the distribution ( X , Y ), whose existence is established in Prop o- sition 3.2 , to pro duce a conditional non-Gaussian c omp on ent analysis construction [DIKR25]. W e b egin b y defining a hard ensem ble. In p articular, for a random p ro jection P : R d → R k , w e define the distribution on R d × {± 1 } as ( P ⊤ ( X ) + X ′ , Y ), where ( X , Y ) is t he d istribution from Prop osi- tion 3.2 and X ′ is a standard Gaussian sup p orted on the subsp ace V ⊥ where V is the image of P ⊤ . Notice that this give s an instance of a Relativized Hidden S ubspace Distribution for A = ( X, y ) and U = P ⊤ from Defin ition 3 .1 of [DIKR25]. F u rthermore, as X | Y = 1 and X | Y = − 1 b oth matc h m moments with a standard Gaussian, Theorem 3.3 of [DIKR25] implies that a ny Statistical Query algorithm d istinguishing ( x, Y ) (with randomly c hosen P ) from ( x, Y ′ ) (where Y ′ is an in dep end en t, u n iform {± 1 } random v ariable) m ust e ither use a query of ac curacy b etter t han τ or u se 2 d Ω(1) queries. 6 On the other hand, w e claim that giv en a smoothed-agnostic learner for C with excess error ǫ , an additional statistica l query of tolerance τ would allo w u s to correctly solve this distinguishin g problem. This is b ecause the learned hypothesis h : R d → {± 1 } w ould ha v e the pr op ert y that Pr [ h ( x ) 6 = Y ] ≤ OPT σ + ǫ ≤ Pr [ f ( P ( x + σ z )) 6 = Y ] + ǫ = Pr [ f ( X + σ Z ) 6 = Y ] + ǫ < 1 / 2 − 2 τ . On the other hand , since Y ′ is indep endent of x , we ha v e Pr [ h ( x ) 6 = Y ′ ] = 1 / 2 . As a sin gle statistical query can d istinguish b et we en Pr [ h ( x ) 6 = Y ] b eing 1 / 2 or less than 1 / 2 − 2 τ , this completes the reduction and the proof of Theorem 3.1. T o pr ov e Proposition 3.2, w e use linear p r ogramming dualit y t ec hniques, similar t o those from [DKPZ 21] (wh ic h corresp on d s to the σ = 0 sp ecial case of our proof b elo w ). Pr o of of Pr op osition 3.2. T o b egin with, we note that find ing a distribution ( X , Y ) is equiv alent to sp ecifying the fun ction g ( x ) = E [ Y | X = x ]. In particular, note that g must b e a function from R k to [ − 1 , 1]. W e no w translate th e desired p rop erties of ( X , Y ) to corresp onding p rop erties of the function g . W e start with the follo wing simple claim: Claim 3 .3. Conditio ns (ii) and (iii) on ( X , Y ) to gether ar e e quivalent to the c ondition that g satisfies E [ g ( X ) p ( X )] = 0 for any de gr e e at most m p olynomial p : R k → R . Pr o of. One dir ection of the equiv alence is straight forward. Sup p ose that conditions (ii) and (iii) on ( X, Y ) are satisfied. Since g ( x ) = E [ Y | X = x ], it follo ws that for an y degree at most m p olynomial p w e ha ve that E [ g ( X ) p ( X )] = E [ Y p ( X )] = Pr ( Y = 1) E [ p ( X ) | Y = 1] − Pr ( Y = − 1) E [ p ( X ) | Y = − 1] = ( E [ p ( X ) | Y = 1] − E [ p ( X ) | Y = − 1]) / 2 = 0 . F or the opp osite direction, sup p ose that E [ g ( X ) p ( X )] = 0 for an y degree at most m p olynomial p : R k → R . Setting p ≡ 1 give s condition (ii). Note th at the latter implies th at Pr [ Y = 1] = Pr [ Y = − 1] = 1 / 2. Giv en this, w e ha ve that for any degree at most m p olynomial p , it holds E [ p ( X ) | Y = 1] = E [ p ( X ) 1 { Y = 1 } ] Pr [ Y = 1] = 2 E [ p ( X ) 1 { Y = 1 } ] = E [ p ( X )( g ( X ) + 1)] = E [ p ( X )] . An id en tical argumen t can b e u sed f or E [ p ( X ) | Y = − 1], givi ng (iii). Th is completes the pr o of of Claim 3 .3. W e finally translate condition (iv). W e can write Pr [ f ( X + σ Z ) 6 = Y ] = (1 − E [ f ( X + σ Z ) Y ]) / 2 = (1 − E [( T σ f )( X ) Y ]) / 2 = (1 − E [( T σ f )( X ) g ( X )]) / 2 . 7 Th us, condition (iv) is equiv alen t to the condition E [ T σ f ( X ) g ( X )] > ǫ + 2 τ . (3) In summary , to prov e the prop osition, it suffi ces to find a fu nction g : R k → R su c h that (A) k g k ∞ ≤ 1. (B) E X ∼N k [ g ( X ) p ( X )] = 0 for all degree at mo st m p olynomials p . (C) E X ∼N k [( T σ ) f ( X ) g ( X )] > ǫ + 2 τ . W e note that constrain ts (A)-(C) defi n e an infi nite linear program. By a generalization of standard LP dualit y to th is setting, this lin ear program is feasible if and only if th e corresp onding dual program is infeasible. This wo uld follo w immediately from the standard LP dualit y resu lt w ere it a fin ite linea r program. In our ca se, w e need to u se some sligh tly more complicated analytic tools, and fo llo ws directly from Corollary 55 of [DKPZ21]. This asks if th ere is a linear combinatio n of constraints th at r eac h a contradictio n. In particular, the constraints in part (A) com bine to form constrain ts of the form E [ g ( X ) a ( X )] ≤ k a k 1 for all a . The constrain ts in part (B) in general are of the form E [ g ( X ) p ( X )] = 0. The constrain ts in part (C) a re just m ultiples of the single constrain t that E [ T σ f ( X ) g ( X )] > ǫ + 2 τ . Th us, a solution to th e dual p rogram must consist of a fun ction a , a p olynomial p and a negativ e real num b er c suc h that a ( X ) + p ( X ) + c T σ f ( X ) is identica lly 0 and k a k 1 + c ( ǫ + 2 τ ) < 0 . By homog eneit y , w e can tak e c = − 1, so a ( X ) = T σ f ( X ) − p ( X ) and k a k 1 < ǫ + 2 τ . In particular, there is a solution to the dual p rogram if and only if th er e is a d egree at most m p olynomial p suc h that for k T σ f ( X ) − p ( X ) k 1 < ǫ + 2 τ . Ho wev er, th is is ruled out by assumption. Therefore, our primal L P m ust h a ve a solution, pro ving Proposition 3.2. 3.2 Smo othed Agnost ic Learning of Halfspaces: Pro of of Theorem 1.3 Our main application is for sm o othed agnostic lea rnin g of halfsp aces, when the d istribution D x is the sta ndard Gaussian. This co rresp onds to the case k = 1 and f ( x ) = sign( x ) in Th eorem 3.1. Theorem 1.3 follo w s d irectly from T heorem 3.1 and the follo wing result, whic h is established in this section. Theorem 3.4 (Sm o othed L 1 -Degree Low er Bound for the S ign F u nction) . L et f ( x ) = sign( x ) . F or σ, ǫ > 0 , we have that for any p olynomial p : R → R of de gr e e at most a smal l c onstant multiple of m = min { ( σ + ǫ ) − 2 + log (1 /ǫ ) } , it holds that E x ∼N [ | T σ f ( x ) − p ( x ) | ] > ǫ . W e start b y pro ving a degree lo wer b ound of ( σ + ǫ ) − 2 . Lemma 3.5. The or e m 3.4 holds inste ad taking m = ( σ + ǫ ) − 2 . The key tec hnical statemen t that we require is in th e follo wing prop osition. Prop osition 3.6. F or k > 0 and an y subset S of [0 , 1] , ther e e xi sts a pr ob ability distribution X on R so that: 8 (A) X matches its first k moments with the standa r d Gaussian. (B) X ( x ) /φ ( x ) ≤ (1 / | S | ) , wher e | S | denotes the me asur e of S . (C) F or some c onstant C > 0 , al l but 2 − k of the pr ob ability mass of X is supp orte d on p oints x so that F r actionalP art ( x ( √ C k )) ∈ S . Pr o of. Let C b e a large enough int eger. W e will let X = 1 / √ C k P C k i =1 X i , where the X i ’s are non-indep endent Gaussians. In particular, yo u can wr ite the standard Gaussian G as a m ixture of U ([0 , 1]) with some other distribution E . I n particular, G = αU ([0 , 1]) + (1 − α ) E . W e sample from X in th e follo wing w a y: (1) Sample y i iid fr om { 0 , 1 } with y i = 1 with p robabilit y α . (2) F or ea c h y i = 0, set X i to b e an indep end ent samp le from E . (3) If P y i < k + 1, set eac h X i with y i = 1 to b e an indep endent sample from U ([0 , 1]) . (4) Otherwise, set the X i ’s with y i = 1 to b e in dep end en t samples from U ([0 , 1]) co nditioned on the sum of all X i ’s mo d 1 lying in S . (5) Let X = 1 / √ C k P C k i =1 X i . T o analyze this, we b egin by noting that in (4) if instead the y i ’s are take n to b e in dep end en t U ([0 , 1]) random v ariables, then the X i ’s are indep endent Gaussians, so X is a standard Gaussian. T o pro ve (A), we note that in step (4) the y i ’s are instead eac h uniformly distrib uted and k -wise indep en d en t, so they hav e the same first k momen ts as if they we re indep endent ; and th u s X h as the same fir st k moments as a standard Gaussian. F or (B), w e note that w e o btain X b y taking a sampling p r o cess that generates a standard Gaussian and in step (4) conditioning on an ev ent of probabilit y | S | . Finally to prov e (C), w e note that if P y i ≥ k + 1, that F ractionalP art ( x/ ( √ C k )) ∈ S . Since the s u m of the y i ’s is d istributed is B in ( α, C k ), this happ ens with probabilit y at least 1 − 2 − k , wh en C is large enough. Th is completes the pro of of Prop osition 3.6. Remark 3.7. Conditions (B ) and ( C) in the statemen t of Proposition 3.6 can b e replaced b y the statemen t that X i s 2 − Ω( k ) close in total v ariation distance to the s tandard Gauss ian conditioned on the fac t that F ractionalP art ( x/ ( √ C k )) ∈ S . Th is statemen t ma y be useful for o ther applications. Pr o of of L emma 3.5. Giv en Prop osition 3.6, let k b e a small enou gh constan t multiple of ( σ + ǫ ) − 2 and let S = [1 / 2 , 1] and consider the corresp onding distribution X . Let t = 0 and let t ′ = 1 / (2 √ C k ) with C as in Prop osition 3.6. Let f ( x ) = sign( x − t ) and f ′ ( x ) = sign( x − t ′ ). L et g = T σ f and g ′ = T σ f ′ . W e note that t and t ′ differ b y Ω( k − 1 / 2 ) whic h is a large c onstant t imes ( σ + ǫ ). F rom this it is easy to s ee that E G ∼N [ g ( G )] − E G ∼N [ g ′ ( G )] is at least a constant times the p robabilit y that G lies in [(2 t + t ′ ) / 3 , ( t + 2 t ′ ) / 3], w hic h is at least a large constan t times ( σ + ǫ ). On the other hand, it is n ot h ard to see that E [ g ( X )] − E [ g ′ ( X )] = O ( σ ), b ecause d iscoun ting the region b etw een t and t ′ (a region where X assigns exp onenti ally sm all mass) the L 1 -norm (wrt G ) of g − g ′ is O ( σ ). Ho wev er, since X is b oun ded b y 2 G , where G ∼ N , (b y prop erty (B)) and since it matc hes its first k momen ts with a Gaussian ( by prop ert y (A)), if p is a degree at most k p olynomial with k g − p k 1 < δ , then E [ g ( X )] = E [ p ( X )] + E [( g − p )( X )] = E [ p ( G )] + O ( δ ) = E [ g ( G )] + E [( g − p )( G )] + O ( δ ) = E [ g ( G )] + O ( δ ) . 9 A similar b ound holds for g ′ . Ho wev er, if suc h appro ximations exist for both g and g ′ , w e ha ve that E G ∼N [ g ( G ) − g ′ ( G )] = E [ g ( X ) − g ′ ( X )] + O ( δ ) , whic h b y the ab o v e is a con tradiction unless δ is at least a large co nstant times σ + ǫ . This sho w s that there is a T σ L TF that is not w ell app r o ximable by degree k polynomials. Th is co mpletes the pro of of L emm a 3.5. Lemma 3.8. The or e m 3.4 holds inste ad taking m = log(1 /ǫ ) , for σ = Θ(1) . Pr o of. Note that for f ( x ) = sign( x ), T σ f = U a f , where U is the Or nstein-Ulen b eck operator and a = 1 / √ 1 + σ 2 , whic h is Ω (1) if σ = O (1). F or p a p olynomial of degree less than k , with k o d d, and h k the degree - k n orm alized p r obabilist’s He rmite p olynomial, we hav e that E G ∼N [ h k ( G )( U a f − p )( G )] = E G ∼N [ h k ( G ) U a f ( G )] = E G ∼N [ U a h k ( G ) f ( G )] = a k E G ∼N [ h k ( G ) f ( G )] = 2 − O ( k ) E G ∼N [ h k ( G ) f ( G )] , where the first equalit y uses the orthogonalit y of Hermite p olynomials, the second equalit y is b y self-adjoin tness of the OU op erator, the third equalit y u ses Mehler’s f orm ula (namely that U a h k ( x ) = a k h k ( x )), and the fourth follo ws from the fact that 1 > a = Ω(1). Sin ce k is o dd , | E G ∼N [ h k ( G ) f ( G )] | = Ω(1 / poly ( k )), and w e ge t that | E G ∼N [ h k ( G )( U a f − p )( G )] | = 2 − O ( k ) . (4) Supp ose that the p olynomial p satisfies k U a f − p k 1 < ǫ < 1. T hen w e hav e that k p k 1 < 2. By F act 2.4, this implies that k p k 2 < 2 k k p k 1 ≤ 2 k +1 . Thus, b y hyp ercon tractivit y (F act 2.3), we get that k U a f − p k 4 ≪ k U a f k 4 + k p k 4 = 2 O ( k ) . On the other hand, since | E [ h k ( G )( U a f ( G ) − p ( G ))] | > 2 − O ( k ) , t his implies that k U a f − p k 2 > 2 − O ( k ) . Com bined with the ab o v e, the P aley-Zygm und inequalit y imp lies that | U a f ( x ) − p ( x ) | > 2 − O ( k ) with p robabilit y at least 1 / 2 O ( k ) o ver x ∼ N , whic h implies k U a f − p k 1 > 2 − O ( k ) . Th us, if k U a f ( x ) − p ( x ) k 1 < ǫ , w e must h a v e k ≫ log(1 /ǫ ), c ompleting the p ro of. 4 Upp er B ound for Smo othed Agnostic Learning under the Gaus- sian Distribution In this section, w e p ro v e the fo llo wing s imple prop osition, ruling out stronger SQ lo we r b ounds f or smo othed a gnostic learning of an y concept class under Gaussian marginals. Prop osition 4.1. L et ( X , y ) b e any distribution on R d × {± 1 } so that X is distribute d as a standar d Gaussian, and let 1 / 2 > σ, ǫ > 0 . Then ther e is an algorithm that le arns y to 0-1 err or OPT σ + ǫ in d O (log(1 /ǫ ) /σ 2 ) samples and time. 10 Pr o of. W e will simply use the L 1 -p olynomial r egression algorithm. F or this, it su ffices to p r o v e that for some m = O (log (1 /ǫ ) /σ 2 ) that there is a degree- m p olynomial p so that k p − T σ f k 1 < ǫ/ 2 where f ( x ) = E [ y | X = x ]. In fact, w e will show that if p is the degree- m p art of T σ f th en k p − T σ f k 2 < ǫ/ 2. W e pr o v e this usin g Herm ite analysis. In p articular, we note that T σ f = U a g wh ere a = 1 / √ 1 + σ 2 = 1 − Ω( σ 2 ) and g ( x ) = f ( x/a ). If we write g in terms of its Hermite expansion, w e find that g ( x ) = P ∞ k =0 g [ k ] ( x ), w here g [ k ] is the degree- k part of th e H ermite expansion. W e h a v e that ∞ X k =0 k g [ k ] ( x ) k 2 2 = k g k 2 2 ≤ 1 . W e also ha v e that T σ f = U a g = ∞ X k =0 a k g [ k ] . Letting p = P m k =0 a k g [ k ] ( x ), we hav e that T σ f − p = ∞ X k = m +1 a k g [ k ] . So k T σ f − p k 2 2 = ∞ X k = m +1 a 2 k k g [ k ] k 2 2 ≤ a 2 m . Since a = 1 − Ω(1 /σ 2 ), if m is at least a sufficiently large constan t multiple of log(1 /ǫ ) /σ 2 this is less t han ǫ/ 2. Th is completes the p r o of. 5 Conclusions In this w ork, we ob tained the fir st SQ low er b ounds for smoothed agnostic learning of halfspaces. Sp ecifically , our SQ lo we r b oun ds nearly matc h known upp er b ou n ds for subgaussian distributions. Suc h low er b ound s can b e natur ally extended to v arious other families of sin gle-index mo dels. A num b er of op en questions suggest themselv es. Since a fully-p olynomial time algorithm w ith optimal err or ma y b e unachiev able in the smo othed setting, it would b e of inte rest to d evelo p p oly( d, 1 /σ, 1 / ǫ )-time smo othed agnostic learners w ith appr oximate err or guarantee s, e.g., err or O (OPT σ )+ ǫ (or giv e evidence of h ardness even for such a w eak er guaran tee). In a different direction, w e note that previous u pp er b ounds requ ire that the distribu tion on examples has su b exp onen tial tails (or b etter). Is this inheren t? Namely , can we prov e strong computatio nal lo we r b ound s w hen the input d istribution is hea vy -tailed? 11 References [ABL17] P . Aw asthi, M. F. Balcan, an d P . M. Long. The p ow er of lo calizat ion for efficien tly learning li near separators w ith n oise. J. ACM , 63(6):50:1 –50:27, 2017. [BEHW89] A. Blumer, A. Ehrenfeuch t, D. Haussler, and M. W armuth. Learnabilit y and the V apnik-Cherv onenkis dimension. Journal of the ACM , 36(4):92 9–965, 1989. [Bog98 ] V. Bogac hev. Gaussian me asur es . Mathematical su rve ys and monographs, v ol. 62, 1998. [Bon70] A. Bonami. Etude des co efficien ts four ier des fonctiones de l p ( g ). Ann. Inst. F ourier (Gr e noble) , 20 (2):335–4 02, 1970. [CKK + 24] G. Ch andrasek aran, A. R. Kliv ans, V. Kon tonis, R. Mek a, and K. Sta vr op oulos. Smo othed analysis for learning concepts with lo w intrinsic d imension. In The Thirty Seventh Annual Confer e nc e on L e arning The ory , v olume 247 of Pr o c e e dings of Machine L e arning R ese ar c h , pages 876–922. PMLR, 2024. [CKL + 06] C.-T. Chu, S. K . Kim, Y. A. Lin, Y. Y u, G. Brads ki, A. Y. Ng, and K. Oluko tun. Map-reduce for mac hine learning on multic ore. In P r o c e e dings of the 19th Interna- tional Confer enc e on Neur al Information Pr o c essing Systems , NIPS’06, p ages 281–288 , Cam bridge, MA, USA, 2 006. MIT Press. [Dan16] A. Daniely . Complexit y theoretic limitations on learning halfspaces. In Pr o c e e dings of the 48th Annual Symp osium on The ory of Computing, STO C 2016 , pages 1 05–117, 2016. [DIKR25] I. Diak onik olas, G . I ako vidis, D . M. Kane, and L. Ren. Algorithms and SQ low er b ound s for robu stly learning real-v alued m ulti-index mo dels. CoRR , abs/2505.21 475, 2025. Conf er en ce v ersion in NeurIP S ’25. [DKK + 21] I. Diak onikola s, D. M. K ane, V. Kon tonis, C. Tzamos, and N. Z arifi s. Agnostic pr op er learning of halfsp aces un d er gaussian marginals. In Confer enc e on L e arning The ory, COL T 2021 , volume 134 of Pr o c e e dings of Machine L e arning R ese ar ch , pages 1522– 1551. PMLR, 2021. [DKMR22] I. Diak oniko las, D. Kane, P . Manurangsi, and L. Ren. Cr y p tographic h ardness of learning halfsp aces with Massart noise. A dvanc es in Neur al Information Pr o c essing Systems , 35 :3624–3 636, 2022. [DKPZ21] I. Diak onik olas, D. M. Kane, T. Pittas, and N. Zarifi s. The o ptimalit y of p olynomial regression for a gnostic l earning under gaussian ma rginals in the S Q mod el. In Confer- enc e on L e arning The ory, COL T 2021 , v olume 134 of P r o c e e dings of M achine L e arning R ese ar ch , pages 1552–1584. PMLR, 2021. [DKR23] I. Diak oniko las, D. Kane, and L. Ren. Near-optimal cryptographic hardn ess of agnosti- cally learning h alfspaces and relu regression u nder gaussian marginals. In International Confer enc e on Machine L e arning, ICML 2023 , v olume 202 of Pr o c e e dings of Machine L e arning R ese ar c h , pages 7922–7938 . PMLR, 2023. 12 [DKRS23] I. Diak onik olas, D. Kane, L. Ren, and Y. Sun. SQ lo w er b oun ds for n on-gaussian comp onent analysis with weak er assum p tions. In A dvanc es in Neur al In formation Pr o- c essing Systems 36: Annual Confer enc e on Neur al Information Pr o c essing Systems 2023, NeurIPS 2023 , 2 023. [DKS17] I. Diak oniko las, D. M. Kan e, and A. Stew art. Statistical query low er b ounds for robust estimation of h igh-dimensional gaussians and gaussian mixtures. In 58th IE EE Annual Symp osium on F oundations of Computer Sc i enc e, FOCS 2017 , pages 73–84, 2017. F ull v ersion at ht tp://arxiv.org/abs/16 11.03473. [DKS18] I. Diak onikola s, D. M. Kane, and A. S tewart. Learn ing geometric c oncepts with n ast y noise. In Pr o c e e dings of the 50 th Annual ACM SIGACT Symp osium on The ory of Computing, STOC 2018 , pages 1061–1 073, 2018. [DKTZ20] I. Diak onikola s, V. K ontonis, C. Tzamos, a nd N. Zarifis. Non-con vex SGD learns halfspaces with a dversarial lab el noise. In A dvanc es in N eur al Information Pr o c essing Systems, NeurIPS , 202 0. [DKTZ22] I. Diak onik olas, V. Kontonis, C. Tzamos, and N. Zarifis. Learning general halfsp aces with adv ersarial lab el n oise via online gradient descen t. In International Confer enc e on Machine L e arning, ICML 2022 , v olume 162 of Pr o c e e dings of Machine L e arning R ese ar ch , pages 5118–5141. PMLR, 2022. [DKZ20] I. Diako niko las, D. Kane, and N. Zarifis. Near-optimal S Q low er b ounds f or agnostically learning halfspaces and relus un der gaussian marginals. In A dvanc es in Neur al Infor- mation Pr o c essing Systems 33: Annual Confer enc e on Ne u r al Information Pr o c essing Systems 2020, NeurIPS 2020 , 2 020. [F el17] V. F eldman. A general characte rization of the statistical query complexit y . In P r o c e e d- ings of the 30th Confer e nc e on L e arning The ory, COL T 2017 , vo lume 65 of P r o c e e dings of Machine L e arning R ese ar ch , pages 785–830 . P MLR, 2017. [F GR + 13] V. F eldman, E. Grigorescu, L. Reyzin, S. V empala, and Y. Xiao. Statistical algorithms and a lo wer b ound for detecting plan ted cliques. In Pr o c e e dings of STOC’13 , pages 655–6 64, 2013. F ull v ers ion in Jo urn al of the A CM, 201 7. [F GV17] V. F eldman, C. Gu zman, and S. S. V empala. Statistical query algorithms for mean v ector estimatio n and sto c h astic co nv ex optimization. In Philip N. Klein, editor, Pr o- c e e dings of the Twenty-Eighth Annual ACM-SIAM Symp osium on Discr e te Algorithm s, SODA 2017 , pages 1265–1277 . SIAM, 2 017. [FS97] Y. F reund and R. Sc hapire. A d ecision-theoretic generaliz ation of on-line learning and an app licatio n to b o osting. Journal of Computer and System Scienc es , 55(1):119–1 39, 1997. [GGK20] S. Go el, A. Gollak ota, and A. R. Kliv ans. Statistical-query lo we r b ound s via fu nc- tional grad ients. In A dvanc e s in N eur al Information Pr o c essing Systems 33: Annual Confer enc e on N eur al Information Pr o c essing Systems 2020, NeurIPS 2020 , 20 20. [Gro75] L. Gross. Logarithmic Sob olev inequalities. Amer. J. Math. , 97(4):10 61–1083 , 1975. [Hau92] D. Haussler. Decision theoretic generalizations of the P AC mo del for neural net an d other le arning applications. Information and Computation , 100 :78–150 , 1992. 13 [Kea98] M. J. Kearns. Efficien t noise-toleran t learning f r om statistic al queries. Journal of the ACM , 45(6):9 83–1006 , 1998. [KKMS08] A. Kalai, A. Kliv an s , Y. Mansour, and R. S erv edio. Agnostica lly learning h alfspaces. SIAM Journal on Computing , 37(6):1777– 1805, 2008. [KM25] Y. Kou and R . Mek a. Smo othed agnostic learning of halfspaces ov er the hyp ercub e. arXiv , No v em b er 2025. Conf er en ce v ersion in NeurIP S ’25. [K OS08] A. Kliv ans, R. O’Donnell, a nd R. S erv edio. Learning geometric concepts via Gaussian surface area . In F OCS 2008 , pages 541–5 50, 2008. [KSS94] M. K earns, R. Sc hapire, and L. S ellie. T o ward Efficien t Agnostic Learn ing. Machine L e arning , 17(2 /3):115– 141, 1994. [KW25] F. Ko ehler and B. W u. Constr u ctiv e app ro ximation under carleman’s c ondition, with applications to smo othed analysis. arXiv , Decem b er 20 25. [MP68] M. Minsky and S. Pa p ert. Per c eptr ons: an intr o duction to c omputational ge ometry . MIT Pr ess, Cam brid ge, MA, 1 968. [No v62] A. No vik off. On con v ergence pro ofs on p erceptrons. In Pr o c e e dings of the Symp osium on Mathematic al The ory of Automata , volume XI I, pages 615–622 , 1962. [O’D14] R. O’Do nnell. Analys is of Bo ole an F unctions . Cam bridge Univ ersity Press, 201 4. [Ros58] F. Rosenblatt. The Perceptron : a prob ab ilistic mo del for information storage and organizatio n in the brain. P sycholo gic al R evi ew , 6 5:386–4 07, 1958. [ST04] D. A. S pielman and S .-H. T eng. Smo othed analysis of algorithms: Wh y the simplex algorithm usually tak es p olynomial time. J. ACM , 51(3):385–4 63, 2004. [STC00] J. Shaw e-T a ylor and N. Cristianini. An intr o duction to supp ort ve ctor ma chines . Cam- bridge Un iv ersit y Press, 2000. [Tie23] S. Tiegel. Hardness of agnostica lly learning halfspaces fr om worst-case lattice pr oblems. In Th e Thirty Sixth Annual Confer enc e on L e arning The ory, CO L T 2023 , v olume 195 of Pr o c e e dings of Machine L e arning R ese ar ch , pages 3 029–306 4. PMLR, 2 023. [V al84] L. V alian t. A theory of the learnable. Communic ations o f the A CM , 27(11 ):1134–1 142, 1984. [V ap98] V. V apnik. Statistic al L e arning The ory . Wiley-In terscience, New Y ork, 1998. 14

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment