The Conjugate Domain Dichotomy: Exact Risk of M-Estimators under Infinite-Variance Noise in High Dimensions

The Conjugate Domain Dic hotom y: Exact Risk of M-Estimators under Inﬁnite-V ariance Noise in High Dimensions Charalamp os Agirop oulos Departmen t of Economics, Univ ersity of Piraeus hagyropoylos@unipi.gr Marc h 31, 2026 Abstract This pap er studies high-dimensional M-estimation in the prop ortional asymptotic regime ( p/n → γ > 0) when the noise distribution has inﬁnite v ariance. F or noise with regularly- v arying tails of index α ∈ (1 , 2), w e establish that the asymptotic behavior of a regularized M-estimator is gov erned b y a single geometric prop ert y of the loss function: the b oundedness of the domain of its F enc hel conjugate. When this conjugate domain is bounded — as is the case for the Hub er, absolute-v alue, and quan tile loss functions — the dual v ariable in the min-max formulation of the estimator is conﬁned, the eﬀectiv e noise reduces to the ﬁnite ﬁrst absolute moment of the noise distribution, and the estimator achiev es b ounded risk without recourse to external information. When the conjugate domain is unbounded — as for the squared loss — the dual v ariable scales with the noise, the eﬀective noise inv olves the diverging second moment, and b ounded risk can b e ac hieved only through transfer regularization tow ard an external prior. F or the squared-loss class sp eciﬁcally , we derive the exact asymptotic risk via the Conv ex Gaussian Minimax Theorem under a noise-adapted regularization scaling λ n = ˜ λ σ 2 n . The resulting risk conv erges to a universal ﬂo or q Σ = p − 1 ∥ β ∗ − β 0 ∥ 2 Σ x that is indep endent of the regularizer, yielding a loss–risk trichotom y: squared-loss estimators without transfer div erge; Hub er-loss estimators ac hieve bounded but non-v anishing risk; transfer-regularized estimators attain q Σ . Keywor ds: Conjugate domain, Conv ex Gaussian minimax theorem, heavy-tailed noise, high- dimensional regression, M-estimation, phase transition, prop ortional asymptotics, regular v ariation, transfer learning, Winsorization. MSC2020: 62J07, 62F35, 60B20, 62C20. Con ten ts 1 In tro duction 2 1.1 Bac kground and motiv ation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 A structural tension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 The conjugate domain as the go verning mechanism . . . . . . . . . . . . . . . . . 3 1.4 Summary of contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.5 Scop e and limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.6 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 The Conjugate Domain Dichotom y 5 2.1 Assumptions and deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Statemen t of the dichotom y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1 3 Exact risk c haracterization 7 3.1 The truncation bridge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.2 F ailure of squared-loss estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.3 Noise-adapted regularization and exact risk . . . . . . . . . . . . . . . . . . . . . 8 3.4 The universal risk ﬂo or . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.5 The loss–risk trichotom y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.6 Design universalit y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 4 Pro ofs 10 4.1 Pro of arc hitecture for the Conjugate Domain Dic hotomy . . . . . . . . . . . . . . 10 4.2 Pro of of the exact risk formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 5 Numerical exp erimen ts 11 6 Discussion 12 6.1 The tw o-p enalt y decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 6.2 Op en problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 A T run cation univ ersality: full details 14 B CGMT reduction and compactiﬁcation 14 C A O conv ergence and the univ ersal ﬂo or 14 D Conjugate Domain Dic hotomy: pro of details 15 E Design univ ersality: condition veriﬁcation 15 1 In tro duction 1.1 Bac kground and motiv ation Consider the linear mo del y = X β ∗ + w , (1) where X ∈ R n × p is a design matrix with indep endent rows drawn from N (0 , Σ x ), β ∗ ∈ R p is the unkno wn regression parameter, and w ∈ R n is a noise vector with indep endent and identically distributed entries. In the prop ortional asymptotic regime p/n → γ ∈ (0 , ∞ ), a substantial b o dy of recent work has established precise characterizations of the estimation risk for regularized least-squares estimators under ﬁnite noise v ariance. The foundational results of Dobriban and W ager [ 2018 ] and Hastie et al. [ 2022 ] for ridge regression, the extension to general conv ex regularizers via the Conv ex Gaussian Minimax Theorem (CGMT) by Thramp oulidis et al. [ 2015 , 2018 ], and the analysis of b enign o v erﬁtting b y Bartlett et al. [ 2020 ], Tsigler and Bartlett [ 2023 ] all rely on the assumption that the noise v ariance is ﬁnite and ﬁxed as n → ∞ . The present work considers the question of what can b e said when this assumption fails. Distributions with regularly-v arying tails — including P areto distributions, Student- t distri- butions with few degrees of freedom, and α -stable laws with stabilit y index α < 2 — arise naturally in the mo deling of ﬁnancial returns, insurance claims, netw ork traﬃc, and seismological recordings Resnick [ 2007 ], Samoro dnitsky and T aqqu [ 1994 ]. F or such distributions with tail index α ∈ (1 , 2), the ﬁrst momen t is ﬁnite but the v ariance is inﬁnite. The classical bias–v ariance decomp osition, which underpins muc h of the theory of high-dimensional regression, is no longer applicable in this setting. 2 1.2 A structural tension A standard approach to handling inﬁnite-v ariance noise is Winsorization : clipping the noise at a sample-size-dep enden t threshold τ n = n 1 /α to pro duce a proxy mo del with ﬁnite but growing v ariance. W e show (Theorem 3.1 ) that this pro cedure yields an eﬀective v ariance σ 2 n = 2 c 2 − α n (2 − α ) /α (1 + o (1)) (2) for any noise distribution satisfying P ( | w | > t ) ∼ c t − α as t → ∞ , where the asymptotic constant c > 0 and the rate dep end only on the tail parameters α and c . This div erging eﬀective v ariance gives rise to a noteworth y tension. The total Fisher information contained in the observ ations satisﬁes I n = n/σ 2 n = Θ( n (2 α − 2) /α ) → ∞ for all α > 1 (Theorem 3.2 ), so the data are, in an information-theoretic sense, increasingly informativ e ab out β ∗ . It is therefore not the case that the observ ations lack the information necessary for consistent estimation. A t the same time, an y estimator constructed b y minimizing a squared-loss ob jectiv e 1 2 n ∥ y − X β ∥ 2 inherits the div erging noise v ariance directly . The risk of the ordinary least-squares estimator gro ws as Θ( σ 2 n ). Ridge regression with a ﬁxed regularization parameter exhibits the same divergen t b ehavior. Even when the regularization strength is adapted to the noise level — sp eciﬁcally , when λ n = Θ( σ 2 n ) — the estimator merely exchanges div erging v ariance for an irreducible bias that dep ends on the distance b etw een the truth and the regularization center (Theorem 3.3 ). The diﬃculty , then, is not a lack of information in the data, but rather a mismatc h b et ween the loss function and the noise structure. The squared loss, through its sensitivit y to the second momen t of the noise, ampliﬁes the diverging v ariance into the estimation risk. 1.3 The conjugate domain as the go verning mechanism The central contribution of this pap er is the iden tiﬁcation of the precise geom etric prop erty of the loss function that determines whether an M-estimator can tolerate inﬁnite-v ariance noise. This prop ert y is the b ounde dness of the domain of the F enchel c onjugate . Recall that any prop er closed conv ex loss function L admits the dual represen tation L ( a ) = sup u ∈ dom( L ∗ ) { ua − L ∗ ( u ) } , where L ∗ ( u ) = sup t { ut − L ( t ) } is the F enchel conjugate. In the min-max formulation of the M-estimator that arises within the CGMT framework, the dual v ariable u i asso ciated with the i -th observ ation is constrained to dom ( L ∗ ). The interaction b et ween this constrain t and the noise w i is the mec hanism through which the loss function either ampliﬁes or attenuates the eﬀect of hea vy tails: • When dom ( L ∗ ) ⊆ [ − K, K ] for some K > 0, eac h dual v ariable satisﬁes | u i | ≤ K . The noise coupling term n − 1 P n i =1 u i w i is then b ounded b y K · n − 1 P | w i | , which in volv es only the ﬁrst absolute moment of the noise. Since E [ | w | ] < ∞ whenev er α > 1, this quantit y remains b ounded, and the estimator achiev es ﬁnite risk. • When dom ( L ∗ ) = R , the optimal dual v ariable scales with the noise (for the squared loss, u ∗ i ∝ w i ), and the noise coupling inv olves the second moment n − 1 P w 2 i , whic h div erges when α < 2. The squared loss, with L ∗ ( u ) = u 2 / 2 and dom ( L ∗ ) = R , is the prototypical loss with un b ounded conjugate domain. The Hub er loss, with L ∗ ( u ) = u 2 / 2 for | u | ≤ k and L ∗ ( u ) = + ∞ for | u | > k , is the prototypical loss with b ounded conjugate domain. More generally , a loss function has b ounded conjugate domain if and only if it grows at least linearly at inﬁnity: lim inf | t |→∞ L ( t ) / | t | > 0. This characterization encompasses the absolute-v alue loss, the quantile loss, the log-cosh loss, and all Hub er-t yp e losses. W e formalize this classiﬁcation as the Conjugate Domain Dichotomy (Theorem 2.5 ). 3 1.4 Summary of contributions The pap er makes the following contributions. (i) The Conjugate Domain Dichotomy (Theorem 2.5 ). W e prov e that, under regularly-v arying noise with index α ∈ (1 , 2), a regularized M-estimator achiev es b ounded risk if and only if its loss function has b ounded conjugate domain. The pro of iden tiﬁes the ﬁrst-versus-second momen t mec hanism described ab o ve as the quan titative explanation (Corollary 2.6 ). (ii) The moment hier ar chy (Remark 2.7 ). W e extend the dichotom y to a contin uous classiﬁcation: a loss whose conjugate grows as | u | q requires ﬁniteness of the q -th noise moment. This yields a complete corresp ondence b etw een conjugate growth rate and noise momen t requiremen ts. (iii) T runc ation universality (Theorem 3.1 ). F or an y noise distribution in the domain of attraction of an α -stable law, Winsorization at the threshold τ n = n 1 /α pro duces the eﬀective v ariance ( 2 ) , with concentrat ion of the empirical squared norm around this deterministic equiv alent. (iv) Exact risk char acterization via the CGMT (Theorem 3.5 ). F or squared-loss estimators with noise-adapted regularization λ n = ˜ λ σ 2 n , w e deriv e the exact asymptotic risk as the solution of a ﬁxed-p oint system inv olving the proximal op erator of the regularizer and the sp ectral distribution of Σ x . (v) Universal risk ﬂo or (Theorem 3.6 ). In the diverging-noise limit, the ﬁxed-p oint system admits an explicit solution: the risk con v erges to q Σ = p − 1 ∥ β ∗ − β 0 ∥ 2 Σ x , indep enden t of the choice of co erciv e conv ex regularizer. This universalit y arises from the proximal op erator collapsing under a diverging step size. (vi) The loss–risk trichotomy (Theorem 3.7 ). Combining the dichotom y with the exact risk c haracterization yields three distinct regimes: squared-loss estimators without transfer ha ve div erging risk; Hub er-loss estimators ac hiev e b ounded risk R H < ∞ but cannot eliminate the prop ortional-regime p enalty; transfer-regularized squared-loss estimators attain q Σ ≪ R H when the prior is informative. (vii) Design universality (Theorem 3.9 ). All results extend from Gaussian to sub-Gaussian designs b y veriﬁcation of the conditions in the univ ersalit y framework of Montanari and Saeed [ 2022 ]. 1.5 Scop e and limitations The results apply to prop er closed conv ex loss functions paired with co ercive conv ex regularizers. The analysis do es not cov er non-conv ex p enalties such as SCAD or MCP , nor do es it address the case of heavy-tailed cov ariates, which would require tools b ey ond the CGMT. The tail index is restricted to α ∈ (1 , 2); for α ≤ 1 the mean is inﬁnite and the truncation bridge of Theorem 3.1 do es not apply . W e do not address exact support recov ery , which requires techniques distinct from the L 2 -risk analysis pursued here, nor data-driv en selection of the regularization parameter. 1.6 Related w ork Pr op ortional-r e gime risk char acterization. The precise risk of ridge regression in the prop ortional regime was established by Dobriban and W ager [ 2018 ] and Hastie et al. [ 2022 ], building on the Marchenk o–P astur la w Marc henk o and Pastur [ 1967 ] and companion Stieltjes transform tec hniques Bai and Silverstein [ 2010 ]. The CGMT, developed b y Thramp oulidis et al. [ 2015 , 2018 ] from Gordon’s Gaussian comparison inequality Gordon [ 1988 ], extended these results to general conv ex regularizers and M-estimators. The phenomena of b enign o verﬁtting Bartlett et al. [ 2020 ], Tsigler and Bartlett [ 2023 ] and double descent Belkin et al. [ 2019 ], Adv ani and 4 Saxe [ 2020 ], Mei and Montanari [ 2022 ] ha ve b een analyzed within this framework. All of these w orks assume ﬁnite and ﬁxed noise v ariance. He avy-taile d r e gr ession. Estimation under heavy-tailed noise has b een studied from sev eral p ersp ectiv es. Lugosi and Mendelson [ 2019 ] established sub-Gaussian estimation rates via median- of-means techniques, and Adomait yt ˙ e et al. [ 2024 ] derived exact asymptotics under heavy-tailed data in a related high-dimensional setting. The robust statistics literature, originating with Hub er [ 1964 ] and developed in Hub er and Ronchetti [ 2009 ], provides the classical foundation for M-estimation with b ounded inﬂuence functions. High-dimensional robust regression was analyzed b y El Karoui et al. [ 2013 ] for ﬁxed noise v ariance. T r ansfer le arning. T ransfer learning in high-dimensional linear mo dels was studied by Dar and Baraniuk [ 2022 ], who c haracterized the double descen t phenomenon under task transfer, and b y Li et al. [ 2022 ], who established minimax rates for transfer estimation. Universality. Design universalit y for empirical risk minimization was established by Montanari and Saeed [ 2022 ] via a con tinuous in terp olation argumen t, and b y Hu and Lu [ 2023 ] for random feature mo dels. The present work diﬀers from the ab ov e in tw o principal resp ects. First, w e consider a regime in which the noise v ariance diverges with the sample size — a qualitativ ely diﬀerent setting from the ﬁxed-v ariance regime studied in the CGMT and proportional-asymptotics literatures. Second, we identify the conjugate domain of the loss function as the geometric prop ert y that go verns whether an estimator can tolerate this divergence, providing a classiﬁcation that applies to all conv ex losses simultaneously rather than analyzing sp eciﬁc estimators individually . 2 The Conjugate Domain Dic hotomy 2.1 Assumptions and deﬁnitions W e w ork throughout with the observ ation mo del ( 1 ) under the follo wing conditions. Assumption 2.1 (Design and dimensionality) . The design matrix satisﬁes X = Z Σ 1 / 2 x , where Z ∈ R n × p has indep enden t standard Gaussian entries, and Σ x ∈ R p × p is p ositiv e deﬁnite with empirical sp ectral distribution con verging w eakly to a compactly supp orted probabilit y measure µ on (0 , ∞ ). The dimensions satisfy p/n → γ ∈ (0 , ∞ ) as n, p → ∞ . Assumption 2.2 (Regularly-v arying noise) . The noise v ariables w 1 , w 2 , . . . are indep enden t and identically distributed with E [ w i ] = 0 and tail b ehavior P ( | w i | > t ) = c t − α (1 + o (1)) as t → ∞ , for some α ∈ (1 , 2) and c > 0. Assumption 2.2 places the noise distribution in the domain of attraction of an α -stable law. The condition α > 1 ensures that E [ | w i | ] < ∞ , while α < 2 implies E [ w 2 i ] = ∞ . The requiremen t that the slowly v arying comp onent of the tail conv erges to a constant (rather than b eing an arbitrary slowly v arying function) is imp osed for clarit y; it is satisﬁed by Student- t , Pareto, and α -stable distributions. Deﬁnition 2.3 (Winsorization) . F or a threshold τ > 0, the Winsorized noise is w ( τ ) i = w i 1 {| w i |≤ τ } + τ sign ( w i ) 1 {| w i | >τ } . W e set τ n = n 1 /α and write σ 2 n = E [( w ( τ n ) i ) 2 ] for the eﬀectiv e v ariance. Deﬁnition 2.4 (Conjugate domain classes) . Let L : R → R ∪ { + ∞} b e a prop er closed con vex function with F enchel conjugate L ∗ ( u ) = sup t { ut − L ( t ) } . W e sa y that L has b ounde d c onjugate domain if there exists K > 0 such that dom ( L ∗ ) = { u : L ∗ ( u ) < ∞} ⊆ [ − K, K ], and unb ounde d c onjugate domain if dom( L ∗ ) = R . A standard result in conv ex analysis relates the conjugate domain to the growth rate of the loss: dom ( L ∗ ) is b ounded if and only if lim inf | t |→∞ L ( t ) / | t | > 0. Thus the class of losses 5 with b ounded conjugate domain consists precisely of those that grow at least linearly at inﬁnit y . Among common loss functions, the squared loss L ( t ) = t 2 / 2 is the unique standard example with un b ounded conjugate domain. The Hub er loss ρ k ( t ) = ( t 2 / 2 | t | ≤ k , k | t | − k 2 / 2 | t | > k , the absolute-v alue loss L ( t ) = | t | , the log-cosh loss L ( t ) = log cosh ( t ), and the quantile loss L ( t ) = t ( τ − 1 { t< 0 } ) all hav e b ounded conjugate domain. 2.2 Statemen t of the dic hotomy W e consider the general regularized M-estimator ˆ β L = arg min β ∈ R p 1 n n X i =1 L ( y i − x ⊤ i β ) + λ n R ( β ) , (3) where R : R p → R ∪ { + ∞} is a prop er closed co ercive conv ex regularizer (meaning R ( β ) → ∞ as ∥ β ∥ → ∞ ) and λ n > 0 is the regularization parameter. Theorem 2.5 (Conjugate Domain Dic hotomy) . Under Assumptions 2.1 – 2.2 , the fol lowing hold for the estimator ( 3 ) : (B) If L has b ounde d c onjugate domain ( dom ( L ∗ ) ⊆ [ − K, K ] ), then for any ﬁxe d λ > 0 and any c o er cive c onvex R , 1 p ∥ ˆ β L − β ∗ ∥ 2 Σ x P − → R L ( K, λ, γ , µ ) < ∞ , wher e R L do es not dep end on σ 2 n . (U) If L has unb ounde d c onjugate domain ( dom( L ∗ ) = R ), then: (U1) Without tr ansfer r e gularization ( R ( β ) c enter e d at the origin), the risk either diver ges (if λ n = o ( σ 2 n ) ) or c onver ges to p − 1 ∥ β ∗ ∥ 2 Σ x (if λ n = Θ( σ 2 n ) ). (U2) With tr ansfer r e gularization R ( β − β 0 ) and λ n = ˜ λ σ 2 n for ﬁxe d ˜ λ > 0 , the risk c onver ges to q Σ = p − 1 ∥ β ∗ − β 0 ∥ 2 Σ x . The pro of is giv en in Section 4.1 and Appendix D. The following corollary makes the mec hanism explicit. Corollary 2.6 (The moment mechanism) . Part (B) of The or em 2.5 holds b e c ause the c onstr aint u i ∈ [ − K, K ] ensur es that the noise c oupling in the CGMT dual satisﬁes | n − 1 P u i w ( τ n ) i | ≤ K · n − 1 P | w ( τ n ) i | P − → K E [ | w | ] < ∞ . Part (U) holds b e c ause the unc onstr aine d dual u ∗ i ∝ w i yields a c oupling that involves n − 1 P w 2 i ≍ σ 2 n → ∞ . R emark 2.7 (The momen t hierarc hy) . The dichotom y generalizes to a con tinuous classiﬁcation. If the conjugate satisﬁes L ∗ ( u ) ∼ c q | u | q as | u | → ∞ for some q ≥ 1, then the optimal dual v ariable has magnitude | u ∗ i | = O ( | w i | 1 / ( q − 1) ) (b y the ﬁrst-order condition of the dual), and the noise coupling inv olv es the q / ( q − 1)-th moment of w . The estimator therefore ac hieves b ounded risk if and only if α ≥ q / ( q − 1), or equiv alently q ≥ α/ ( α − 1). F or q = 1 (b ounded domain, e.g., Hub er), the requirement is α > 1 (ﬁnite mean). F or q = 2 (quadratic conjugate, i.e., squared loss), the requirement is α ≥ 2 (ﬁnite v ariance). 6 3 Exact risk c haracterization This section develops the quantitativ e predictions that underlie the dichotom y . W e b egin with the truncation bridge that connects regularly-v arying noise to a div erging-v ariance pro xy mo del, then present the exact risk formulas for the squared-loss class, and conclude with the trichotom y that combin es b oth classes. 3.1 The truncation bridge Theorem 3.1 (T runcation universalit y) . Under Assumption 2.2 , with τ n = n 1 /α and σ 2 n = E [( w ( τ n ) ) 2 ] : (i) σ 2 n = 2 c 2 − α n (2 − α ) /α (1 + o (1)) . (ii) n − 1 ∥ w ( τ n ) ∥ 2 /σ 2 n P − → 1 . Pr o of of (i) . By deﬁnition of the Winsorized v ariable, E [( w ( τ ) ) 2 ] = E [ w 2 1 {| w |≤ τ } ] + τ 2 P ( | w | > τ ). Applying in tegration b y parts to the ﬁrst term with the lay er-cake represen tation yields E [ w 2 1 {| w |≤ τ } ] = R τ 0 2 t ¯ F ( t ) dt − τ 2 ¯ F ( τ ), where ¯ F ( t ) = P ( | w | > t ). Adding the Winsorization b oundary term τ 2 ¯ F ( τ ) pro duces a cancellation: E [( w ( τ ) ) 2 ] = Z τ 0 2 t ¯ F ( t ) dt. (4) This cancellation is sp eciﬁc to Winsorization; simple truncation ( w 1 {| w |≤ τ } ) retains the b oundary term. The in tegrand 2 t ¯ F ( t ) = 2 c t 1 − α (1 + o (1)) is regularly v arying with index 1 − α > − 1, so Karamata’s theorem [ Bingham et al. , 1987 , Theorem 1.5.11] giv es R τ 0 2 t ¯ F ( t ) dt ∼ 2 c 2 − α τ 2 − α . Setting τ = n 1 /α yields the result. The mean correction is handled b y the dominated con v ergence theorem: since | w ( τ n ) | ≤ | w | and E [ | w | ] < ∞ (as α > 1), we hav e | E [ w ( τ n ) ] | ≤ E [ | w | 1 {| w | >τ n } ] → 0, so ( E [ w ( τ n ) ]) 2 = o (1) is negligible compared to the diverging second momen t. Pr o of of (ii) . The concentration of n − 1 P ( w ( τ n ) i ) 2 around σ 2 n requires care, b ecause the stan- dard fourth-momen t metho d is exactly b orderline. Indeed, the same in tegration-by-parts and Karamata argumen t gives E [( w ( τ n ) ) 4 ] ∼ 4 c 4 − α n (4 − α ) /α , and the relative v ariance n − 1 E [( w ( τ n ) ) 4 ] ( σ 2 n ) 2 ≍ n − 1+(4 − α ) /α − 2(2 − α ) /α = n 0 = O (1) , whic h do es not tend to zero. This b orderline b ehavior is not a pro of artifact: the ratio τ 2 n /σ 2 n ≍ n gro ws linearly , placing the problem at the exact threshold where CL T-type normalization pro duces non-degenerate ﬂuctuations. W e circum ven t this obstruction via a truncation-based double-limit argument on the tr iangular arra y { ( w ( τ n ) i ) 2 /σ 2 n } 1 ≤ i ≤ n . Deﬁne Y n,i = ( w ( τ n ) i ) 2 /σ 2 n − 1 and S n = n − 1 P Y n,i . F or any M > 0, decomp ose S n = A n ( M ) + B n ( M ), where A n ( M ) = n − 1 P Y n,i 1 {| Y n,i |≤ M } . The b ounded part satisﬁes V ar ( A n ( M )) ≤ M 2 /n → 0 by Cheb yshev’s inequalit y . The tail part satisﬁes P ( | B n ( M ) | > δ ) ≤ δ − 1 E [ | Y n, 1 | 1 {| Y n, 1 | >M } ] b y Mark ov’s inequality , and the Karamata structure of the tail ensures that lim sup n E [ | Y n, 1 | 1 {| Y n, 1 | >M } ] → 0 as M → ∞ . Cho osing M large and then n large completes the argumen t. F ull details are in App endix A. 3.2 F ailure of squared-loss estimation Theorem 3.2 (Fisher information diverges) . Under Assumption 2.2 , I n = n/σ 2 n = Θ( n (2 α − 2) /α ) → ∞ . 7 Pr o of. The p er-observ ation Fisher information for a lo cation family with v ariance σ 2 n satisﬁes I 1 = O (1 /σ 2 n ). With n indep enden t observ ations, I n = n · I 1 = Θ( n/σ 2 n ). Since (2 α − 2) /α > 0 for α > 1, the result follo ws. Theorem 3.3 (Divergence of L 2 -estimation without transfer) . Under Assumptions 2.1 – 2.2 with p/n → γ > 0 and σ 2 n → ∞ : (i) The OLS estimator (when γ < 1 ) has risk Θ( σ 2 n ) → ∞ . (ii) R idge r e gr ession with ﬁxe d λ > 0 has risk Θ( σ 2 n ) → ∞ . (iii) Any r e gularize d estimator with λ n = o ( σ 2 n ) has diver ging risk. (iv) With noise-adapte d r e gularization λ n = ˜ λσ 2 n c enter e d at the origin, the risk c onver ges to p − 1 ∥ β ∗ ∥ 2 Σ x . Pr o of. P arts (i)–(ii) follo w from the standard bias–v ariance decomp osition in the prop ortional regime, where the v ariance comp onen t is prop ortional to σ 2 n . P art (iii) follows from the failure of the compactiﬁcation argumen t (Lemma 4.1 ) when ˜ λ = λ n /σ 2 n → 0: the regularizer cannot conﬁne the estimator. P art (iv) is a sp ecial case of Theorem 3.6 with β 0 = 0. Theorems 3.2 and 3.3 together rev eal that the failure of squared-loss estimation under div erging noise is not an information-theoretic limitation but a structural vulnerability of the loss function. The data contain suﬃcient information for consistent estimation; the squared loss simply cannot extract it. This observ ation is cen tral to the motiv ation for the Conjugate Domain Dich otom y . 3.3 Noise-adapted regularization and exact risk T o obtain meaningful risk b ounds for the squared-loss class, the regularization strength must scale with the noise p ow er. Deﬁnition 3.4 (Noise-adapted regularization) . The eﬀe ctive r e gularization str ength is ˜ λ = λ n /σ 2 n , treated as an O (1) constant. This scaling ensures that the regularization grows in step with the noise, pro ducing a well- deﬁned trade-oﬀ b etw een bias and v ariance in the limit. When α = 2 (ﬁnite v ariance), σ 2 n = Θ(1) and the standard ﬁxed- λ regime is reco vered. Theorem 3.5 (Exact risk at ﬁnite noise lev el) . Under Assumptions 2.1 – 2.2 , for the tr ansfer- r e gularize d estimator ˆ β = arg min β ∈ R p 1 2 n ∥ y − X β ∥ 2 + ˜ λ σ 2 n R ( β − β 0 ) with c o er cive c onvex R and p/n → γ , the risk p − 1 ∥ ˆ β − β ∗ ∥ 2 Σ x c onver ges in pr ob ability to a deterministic limit R char acterize d by the ﬁxe d-p oint e quation τ 2 = 1 + γ R ( τ ) , wher e R ( τ ) = E S,ζ " S  pro x ˜ λσ 2 n S − 1 r  ∆ S − σ n τ √ S ζ  − ∆ S  2 # , (5) with S ∼ µ (the limiting sp e ctr al distribution of Σ x ), ζ ∼ N (0 , 1) indep endent, and ∆ S denoting the pr oje ction of ∆ = β ∗ − β 0 onto the eigensp ac e c orr esp onding to eigenvalue S . F or ridge r e gularization ( R = 1 2 ∥ · ∥ 2 ), the risk admits the close d-form Stieltjes r epr esentation R ridge = µ 2 E S  S ∆ 2 S ( S v + µ ) 2  + σ 2 n n v E S  S 2 ( S v + µ ) 2  , (6) wher e µ = ˜ λσ 2 n and v is the unique p ositive solution of v − 1 = 1 + γ E S [ S ( S v + µ ) − 1 ] . The pro of pro ceeds through the CGMT framew ork and is given in Section 4.2 and App endix B– C. 8 3.4 The univ ersal risk ﬂo or Theorem 3.6 (Universal risk ﬂo or) . Under Assumptions 2.1 – 2.2 , for any c o er cive c onvex R with arg min R = { 0 } and any ˜ λ > 0 : 1 p ∥ ˆ β − β ∗ ∥ 2 Σ x P − → q Σ = 1 p ∥ β ∗ − β 0 ∥ 2 Σ x . This limit is indep endent of the r e gularizer R , the eﬀe ctive str ength ˜ λ , and the noise distribution b eyond the tail index α . Pr o of. In the CGMT auxiliary optimization, the pro ximal step size for co ordinate j in the eigen basis of Σ x is η j = ˜ λσ 2 n /s j , where s j is the j -th eigenv alue of Σ x . Since σ 2 n → ∞ and s j is b ounded a w ay from zero (b y the compact supp ort of µ ), w e ha ve η j → ∞ for every j . The asymptotic b eha vior of the proximal op erator under diverging step size is well known: for any prop er closed conv ex function r with arg min r = { 0 } , pro x η r ( x ) → 0 as η → ∞ for an y x [ Bausc hke and Com b ettes , 2017 , Proposition 12.29]. In our setting, the argumen t of the pro ximal op erator is ∆ s j − σ n τ ζ j / √ s j = O ( σ n ), while the step size is O ( σ 2 n ). Therefore the pro ximal output is O ( σ n /σ 2 n ) = O (1 /σ n ) → 0. Setting ϕ j = v j + ∆ s j (the estimator’s deviation from the prior in co ordinate j ), w e obtain ϕ ∗ j → 0, i.e., v ∗ j → − ∆ s j . The risk b ecomes 1 p X j s j ( v ∗ j ) 2 → 1 p X j s j ∆ 2 s j = 1 p ∥ ∆ ∥ 2 Σ x = q Σ . The univ ersality of the risk ﬂoor across regularizers has a clear in terpretation: under div erging noise, the proximal op erator with diverging step size maps all inputs to the minimizer of r , erasing any dep endence on the regularizer’s geometry . Ridge, Lasso, elastic net, and group Lasso all pro duce the same limiting risk. The regularizer matters only in the transient regime at ﬁnite noise levels, where it gov erns the rate of conv ergence to q Σ . 3.5 The loss–risk trichotom y Com bining the b ounded-domain result (Theorem 2.5 (B)) with the exact squared-loss c haracteri- zation (Theorems 3.5 – 3.6 ) and the Hub er analysis yields the following trichotom y . Theorem 3.7 (Loss–risk trichotom y) . Under Assumptions 2.1 – 2.2 : (a) Squared loss without transfer. The risk either diver ges or c onver ges to p − 1 ∥ β ∗ ∥ 2 Σ x > 0 . (b) Hub er loss without transfer. The risk c onver ges to R H ( k , λ, γ , µ ) < ∞ , which do es not dep end on σ 2 n but satisﬁes R H > 0 in the pr op ortional r e gime. (c) Squared loss with transfer. The risk c onver ges to q Σ = p − 1 ∥ β ∗ − β 0 ∥ 2 Σ x . When the prior is informative, q Σ ≪ R H . The trichotom y reveals that tw o distinct p enalties imp ede estimation under inﬁnite-v ariance noise. The he avy-tail p enalty arises from the in teraction b etw een the loss function’s conjugate geometry and the noise momen t structure; it is eliminated b y any loss with b ounded conjugate domain. The high-dimensional p enalty R H > 0 arises from the prop ortional regime p/n → γ > 0; it p ersists ev en when the eﬀective noise is b ounded and is b ypassed only when external structural information (the transfer prior) is a v ailable. R emark 3.8 (When the Hub er estimator dominates) . If the transfer prior is p o or ( q Σ > R H ), the Hub er estimator strictly dominates the transfer-regularized squared-loss estimator. The practitioner therefore faces a genuine trade-oﬀ: accurate prior information yields low er risk via transfer, but absent such information, a robust loss function provides a safer alternative. 9 3.6 Design univ ersalit y Theorem 3.9 (Design univ ersality) . Al l r esults of The or ems 2.5 – 3.7 hold when Assumption 2.1 is r elaxe d to al low Z ij to b e indep endent and identic al ly distribute d sub-Gaussian r andom variables with E [ Z ij ] = 0 , E [ Z 2 ij ] = 1 , and ∥ Z ij ∥ ψ 2 ≤ K . Pr o of. The result follows from the univ ersality framework of Mon tanari and Saeed [ 2022 ]. The ﬁv e conditions of their Corollary 3 are veriﬁed in App endix E: (i) the squared loss has lo cally Lipsc hitz gradien t; (ii) the feasible set is compact by Lemma 4.1 ; (iii) the regularizer is lo cally Lipsc hitz on b ounded sets; (iv) the design x i = Σ 1 / 2 x ¯ x i with sub-Gaussian ¯ x i satisﬁes the point wise normalit y condition; (v) the Winsorized noise is b ounded (hence sub-Gaussian) and indep endent of the design. The key observ ation is that the univ ersality mechanism acts on the design matrix, while the noise passes through unc hanged regardless of its magnitude or distribution. 4 Pro ofs 4.1 Pro of arc hitecture for the Conjugate Domain Dichotom y The pro of of Theorem 2.5 pro ceeds through the CGMT in three stages. Stage I: Comp actiﬁc ation. The following lemma ensures that the optimizer remains b ounded. Lemma 4.1 (Compactiﬁcation) . Under noise-adapte d r e gularization λ n = ˜ λσ 2 n with c o er cive c onvex R , ther e exists a deterministic c onstant C > 0 , indep endent of n , such that P ( ∥ ˆ β − β ∗ ∥ Σ x > C ) → 0 . Pr o of. The ob jectiv e at v = 0 equals (2 nσ 2 n ) − 1 ∥ w ( τ n ) ∥ 2 + ˜ λR (∆), whic h con verges in probability to 1 2 + ˜ λR (∆) by Theorem 3.1 (ii). On the sphere ∥ v ∥ Σ x = C , the co ercivity of R ensures ˜ λR ( v + ∆) > 1 2 + ˜ λR (∆) for C suﬃcien tly large. Since the minimizer ac hiev es a v alue at most the baseline, it must lie inside the ball. An imp ortant structural observ ation emerges from the pro of: the quadratic conﬁnemen t pro vided b y the design matrix, (2 nσ 2 n ) − 1 ∥ X v ∥ 2 → ∥ v ∥ 2 Σ x / (2 σ 2 n ) → 0, v anishes under the σ 2 n normalization. The regularizer is therefore the sole mechanism conﬁning the estimator under div erging noise. Stage II: CGMT r e duction. With the feasible set compact, the CGMT Thramp oulidis et al. [ 2015 ] replaces the bilinear form u ⊤ Z (Σ 1 / 2 x v ) in the Primary Optimization with indep endent Gaussian v ectors g ∼ N (0 , I n ) and h ∼ N (0 , Σ x ), pro ducing the Auxiliary Optimization. The noise vector w ( τ n ) en ters b oth formu lations identically as an additiv e term; it is not aﬀected by Gordon’s comparison inequality , which op erates exclusively on the Gaussian bilinear form. Stage III: Sc alarization and the moment me chanism. In the A O, optimizing ov er the dual v ariable u yields co ordinate-wise up dates. F or b ounded dom ( L ∗ ): each u i is constrained to [ − K, K ], so u ∗ i = clip ( w ( τ n ) i + ∥ ˜ v ∥ g i , − K , K ). The eﬀectiv e noise n − 1 ∥ u ∗ ∥ 2 ≤ K 2 is b ounded deterministically . Mean while, n − 1 P | w ( τ n ) i | P − → E [ | w | ] < ∞ b y the law of large num b ers and the dominated conv ergence theorem (since | w ( τ n ) | ≤ | w | and E [ | w | ] < ∞ for α > 1). The AO scalarization pro duces a ﬁxed-p oin t system with all terms O (1), yielding ﬁnite risk. F or un b ounded dom ( L ∗ ) (e.g., L ∗ ( u ) = u 2 / 2): u ∗ i = w ( τ n ) i + ∥ ˜ v ∥ g i is unconstrained, so n − 1 ∥ u ∗ ∥ 2 ≍ σ 2 n → ∞ . After σ 2 n normalization, the AO reduces to the ﬁxed-p oint system of Theorem 3.5 , with the univ ersal ﬂo or following from the proximal collapse of Theorem 3.6 . F ull details are in App endices B–D. 10 1 0 0 1 0 1 1 0 2 1 0 3 N o i s e s c a l e s c a l e 1 0 3 1 0 2 1 0 1 1 0 0 1 0 1 1 0 2 1 0 3 1 0 4 Risk (log scale) T h e S t r u c t u r a l P a r a d o x : L 2 D i v e r g e n c e v s T r a n s f e r C o n v e r g e n c e OLS R i d g e ( = 1 , f i x e d ) T r a n s f e r R i d g e ( = 1 . 0 ) q = 0 . 0 0 1 1 Figure 1: Risk as a function of noise scale for three squared-loss estimators. OLS and ﬁxed- λ ridge exhibit p ow er-la w div ergence; transfer ridge is constant at q Σ ≈ 0 . 001. 4.2 Pro of of the exact risk formula The deriv ation of Theorem 3.5 follo ws the CGMT scalarization after Stages I–I I ab ov e. The Primary Optimization, after normalization by σ 2 n , takes the form min v ∈S C max u ∈ R n u ⊤ w ( τ n ) nσ n − u ⊤ Z Σ 1 / 2 x v nσ n − ∥ u ∥ 2 2 n + ˜ λ R ( v + ∆) . Gordon’s comparison replaces the bilinear term, and the empirical quantities concentrate by Theorem 3.1 and standard Gaussian concen tration. After optimizing ov er u in closed form and passing to the deterministic limit (using uniform conv ergence on the compact set S C ), the risk is c haracterized b y the ﬁxed-p oint system ( 5 ) . F or ridge regression, the proximal op erator reduces to the linear map x 7→ x/ (1 + η ), yielding the companion Stieltjes representation ( 6 ) . F ull details app ear in App endix C. 5 Numerical exp eriments W e present Monte Carlo simulations that verify the theoretical predictions. All exp erimen ts use n = 2000 observ ations, p = 1000 cov ariates ( γ = 0 . 5), an AR(1) cov ariance structure (Σ x ) ij = 0 . 5 | i − j | , and Student- t noise with 1 . 5 degrees of freedom ( α = 1 . 5). The signal β ∗ has 10% non-zero en tries; the transfer prior β 0 = β ∗ + ∆ has misalignment ∥ ∆ ∥ = 1. Eac h p oint represen ts the a v erage of 500 indep endent replications. Figure 1 displays the structural tension of Section 1.2 : the OLS and ﬁxed-regularization ridge estimators exhibit p ow er-la w gro wth in risk as the noise scale increases, while the transfer- regularized ridge estimator remains constant at q Σ . Figure 2 illustrates the universal risk ﬂo or of Theorem 3.6 : b oth ridge and Lasso transfer- regularized estimators conv erge to the same limit q Σ , conﬁrming that the regularizer geometry is irrelev an t in the diverging-noise limit. Figure 3 v alidates the ﬁnite-noise-lev el predictions of Theorem 3.5 . The theoretical risk curv e, computed from the Stieltjes ﬁxed-p oin t system ( 6 ) , is ov erlaid on the empirical risk from 11 1 0 0 1 0 1 1 0 2 N o i s e s c a l e s c a l e 0.00090 0.00095 0.00100 0.00105 0.00110 0.00115 R i s k 1 p * 2 x U n i v e r s a l R i s k F l o o r : n = 2 0 0 0 , p = 1 0 0 0 , = 1 . 5 , = 0 . 5 , = 1 . 0 Ridge (empirical) Lasso (empirical) q = 0 . 0 0 1 1 Figure 2: T ransfer ridge and transfer Lasso estimators con verge to the same universal risk ﬂo or q Σ as the noise scale increases. 500-trial Mon te Carlo sim ulations across 25 v alues of the eﬀective noise v ariance σ 2 n . The median relativ e error b et ween theory and sim ulation is 0 . 03%. Figure 4 displa ys the complete trichotom y of Theorem 3.7 . The squared-loss estimators (OLS, ﬁxed ridge) div erge; the Hub er estimator ( k = 1 . 5, λ = 0 . 1, no transfer prior) plateaus at R H ≈ 0 . 19; and the transfer ridge estimator achiev es q Σ ≈ 0 . 001. The ratio R H /q Σ ≈ 179 quan tiﬁes the gap b etw een the tw o ﬁnite-risk regimes. 6 Discussion 6.1 The t w o-p enalt y decomp osition The trichotom y of Theorem 3.7 reveals that tw o structurally distinct p enalties imp ede estimation under inﬁnite-v ariance noise. The he avy-tail p enalty is a prop ert y of the loss–noise interaction: it arises when the conjugate domain of the loss is un b ounded, exp osing the estimator to a div erging noise momen t. It is eliminated b y adopting any loss function with b ounded conjugate domain. The high-dimensional p enalty R H > 0 is a prop erty of the prop ortional regime p/n → γ > 0: it p ersists even when the eﬀective noise is b ounded and reﬂects the fundamental cost of estimating p parameters from n observ ations. This p enalty is bypassed only by incorp orating external structural information through a transfer prior. This decomp osition provides a reusable diagnostic: when a practitioner observ es p o or estimation p erformance under heavy-tailed data, the ﬁrst question is whether the degradation originates from the loss function (addressable b y c hanging to a robust loss) or from the high- dimensional geometry (addressable only through additional information). 6.2 Op en problems Sev eral directions for future inv estigation emerge from this w ork. Optimal Hub er thr eshold. The risk R H ( k ) dep ends on the threshold parameter k of the Hub er loss. Characterizing the optimal threshold k ∗ ( α, γ ) as a function of the tail index and the asp ect ratio is an imp ortant optimization problem with direct practical implications. Non-c onvex r e gularizers. The CGMT framework requires con vexit y of b oth the loss and the regularizer. Extending the analysis to non-conv ex p enalties such as SCAD or MCP , which are 12 1 0 1 1 0 0 1 0 1 1 0 2 1 0 3 1 0 4 E f f e c t i v e n o i s e v a r i a n c e 2 n 0.0002 0.0004 0.0006 0.0008 0.0010 R i s k 1 p * 2 x Transient Regime: Theory vs Empirical (Transfer Ridge) n = 2 0 0 0 , p = 1 0 0 0 , = 0 . 5 , = 0 . 5 , = 1 . 0 Theory (Stieltjes fixed point) q = 0 . 0 0 1 1 Empirical (500 trials) Figure 3: Finite- σ n risk of transfer ridge: theoretical prediction (solid) versus Monte Carlo sim ulation (dots, 500 trials each). Median relative error: 0 . 03%. 1 0 0 1 0 1 1 0 2 1 0 3 1 0 4 N o i s e s c a l e s c a l e 1 0 2 1 0 0 1 0 2 1 0 4 1 0 6 R i s k 1 p * 2 x H 0 . 1 9 q = 0 . 0 0 1 1 The Loss--Risk Trichotomy under Infinite- V ariance Noise n = 2 0 0 0 , p = 1 0 0 0 , = 1 . 5 , = 0 . 5 , = 0 . 5 OLS R i d g e ( = 1 , f i x e d ) H u b e r ( k = 1 . 5 , n o t r a n s f e r ) T r a n s f e r R i d g e ( = 1 . 0 ) Figure 4: The loss–risk trichotom y . OLS and ﬁxed ridge div erge; the Hub er estimator plateaus at R H ≈ 0 . 19; transfer ridge achiev es q Σ ≈ 0 . 001. 13 widely used in practice, would require alternativ e to ols. He avy-taile d c ovariates. When the design matrix itself has heavy-tailed entries, the sub- Gaussian assumption (Theorem 3.9 ) is violated. Dev eloping comparison inequalities for heavy- tailed random matrices remains a substan tial op en problem. Interme diate gr owth r ates. The moment hierarc h y of Remark 2.7 predicts that losses with conjugate gro wth | u | q for q ∈ (1 , 2) in terp olate b etw een the Hub er and squared-loss b ehaviors. A quan titativ e veriﬁcation of this prediction for sp eciﬁc loss functions is a natural next step. Ac kno wledgmen ts The author thanks the anonymous referees for their feedbac k. Simulation co de is av ailable up on request. A T runcation universalit y: full details The complete pro of of Theorem 3.1 (ii), including the double-limit argument on the triangular arra y , is giv en in the main text (Section 3.1 ). W e provide here additional details on the Karamata b ound used in the tail-part estimate. F or the truncated v ariables Y n,i = ( w ( τ n ) i ) 2 /σ 2 n − 1, the tail exp ectation E [ | Y n, 1 | 1 {| Y n, 1 | >M } ] is b ounded using the structure of the regularly-v arying tail. Setting a n ( M ) = (( M − 1) σ 2 n ) 1 / 2 , Karamata’s theorem applied to the ratio of partial integrals gives R τ n a n 2 t ¯ F ( t ) dt/ R τ n 0 2 t ¯ F ( t ) dt → ( a n /τ n ) 2 − α , which v anishes as n → ∞ for each ﬁxed M since a n /τ n ≍ M 1 / 2 n − 1 / 2 → 0. This establishes that lim sup n E [ | Y n, 1 | 1 {| Y n, 1 | >M } ] ≤ h ( M ) with h ( M ) → 0. B CGMT reduction and compactiﬁcation The CGMT framework of Thramp oulidis et al. [ 2015 ] requires three conditions: (i) the feasible set is compact (pro vided b y Lemma 4.1 ); (ii) the ob jectiv e is conv ex–conca ve (b y construction); (iii) the random matrix has i.i.d. Gaussian entries. The noise vector w ( τ n ) en ters the Primary Optimization as a ﬁxed additive v ector and is identical in b oth the PO and the Auxiliary Optimization; it do es not interact with Gordon’s comparison inequalit y . The compactiﬁcation pro of (Lemma 4.1 ) is given in full in Section 4.1 . The structural observ ation that the design’s quadratic conﬁnement v anishes under the σ 2 n normalization is discussed there. C A O conv ergence and the universal ﬂo or The AO scalarization and the deriv ation of the ﬁxed-p oin t system pro ceed as follows. After optimizing ov er the dual v ariable u in closed form, the A O reduces to a minimization o ver v in volving three empirical quantities: n − 1 ∥ w ( τ n ) ∥ 2 /σ 2 n , n − 1 ∥ g ∥ 2 , and g ⊤ w ( τ n ) / ( √ nσ n ). All three concen trate (the ﬁrst by Theorem 3.1 (ii), the second by the law of large n um b ers, the third by conditional Gaussianity), so the empirical AO conv erges uniformly on S C to a deterministic limit. The minimizer of the deterministic limit satisﬁes the ﬁxed-p oin t system ( 5 ). The pro of of the universal risk ﬂoor (Theorem 3.6 ) is given in full in Section 3.4 . The ﬁxed-p oint uniqueness at leading order follo ws from the observ ation that R ( τ ) → q Σ uniformly in τ on compact sets (since the proximal op erator with diverging step size kills the τ -dep endence), so the equation τ 2 = 1 + γ q Σ has the unique p ositive solution τ ∗ = √ 1 + γ q Σ . 14 D Conjugate Domain Dic hotom y: proof details The pro of of Theorem 2.5 is presen ted in Section 4.1 . W e provide here the veriﬁcation that n − 1 P | w ( τ n ) i | conv erges. Since | w ( τ n ) i | ≤ | w i | p oin twise and E [ | w | ] < ∞ (as α > 1), the dominated con vergence theorem giv es E [ | w ( τ n ) | ] → E [ | w | ]. By the strong la w of large num b ers for the i.i.d. b ounded random v ariables | w ( τ n ) i | (b ounded b y τ n , with conv ergent exp ectations): n − 1 P n i =1 | w ( τ n ) i | a.s. − − → E [ | w | ] < ∞ . This ensures that the noise coupling under b ounded conjugate domain is O (1). E Design univ ersality: condition v eriﬁcation W e v erify the ﬁv e conditions of Mon tanari and Saeed [ 2022 , Corollary 3]: (1) L oss function. The squared loss ℓ ( u ; y ) = ( y − u ) 2 / 2 has locally Lipschitz gradient ∇ u ℓ = − ( y − u ), satisfying Assumption 1 ′ of Mon tanari and Saeed [ 2022 ]. (2) Comp act c onstr aint set. S C = { β : ∥ β − β ∗ ∥ Σ x ≤ C } is compact by Lemma 4.1 . (3) R e gularizer. After σ 2 n normalization, ˜ λR ( β − β 0 ) is lo cally Lipsc hitz on b ounded sets with O (1) constan t. (4) Sub-Gaussian design. x i = Σ 1 / 2 x ¯ x i with i.i.d. sub-Gaussian ¯ x ij satisﬁes p oint wise normality (Assumption 5 of Montanari and Saeed 2022 ). (5) Noise indep endenc e. w ( τ n ) is indep enden t of X and b ounded by τ n (hence sub-Gaussian). The univ ersalit y acts on the design, not the noise. By Mon tanari and Saeed [ 2022 , Theorem 1], the training error is univ ersal. By Montanari and Saeed [ 2022 , Theorem 3(a)], the test error is universal for strongly conv ex regularizers; the Lasso case follows by ε -p erturbation. References Urte Adomaityt ˙ e, Emanuele Deﬁlippis, Bruno Loureiro, and Gabriele Sicuro. High-dimensional robust regression under heavy-tailed data. arXiv pr eprint arXiv:2309.16476v2 , 2024. Madh u S. Adv ani and Andrew M. Saxe. High-dimensional dynamics of generalization error in neural net w orks. Neur al Networks , 132:428–446, 2020. doi: 10.1016/j.neunet.2020.08.022. Zhidong Bai and Jack W. Silv erstein. Sp e ctr al Analysis of L ar ge Dimensional R andom Matric es . Springer, 2nd edition, 2010. doi: 10.1007/978- 1- 4419- 0661- 8. P eter L. Bartlett, Philip M. Long, G´ ab or Lugosi, and Alexander Tsigler. Benign ov erﬁtting in linear regression. Pr o c e e dings of the National A c ademy of Scienc es , 117(48):30063–30070, 2020. doi: 10.1073/pnas.1907378117. Heinz H. Bausc hke and Patric k L. Com b ettes. Convex Analysis and Monotone Op er ator The ory in Hilb ert Sp ac es . Springer, 2nd edition, 2017. doi: 10.1007/978- 3- 319- 48311- 5. Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling mo dern machine- learning practice and the classical bias–v ariance trade-oﬀ. Pr o c e e dings of the National A c ademy of Scienc es , 116(32):15849–15854, 2019. doi: 10.1073/pnas.1903070116. 15 N. H. Bingham, C. M. Goldie, and J. L. T eugels. R e gular V ariation , v olume 27 of Encyclop e dia of Mathematics and its Applic ations . Cambridge Universit y Press, 1987. doi: 10.1017/ CBO9780511721434. Y ehonathan Dar and Richard G. Baraniuk. Double double descent: On generalization errors in transfer learning b etw een linear regression tasks. SIAM Journal on Mathematics of Data Scienc e , 4(4):1447–1472, 2022. doi: 10.1137/21M1458788. Edgar Dobriban and Stefan W ager. High-dimensional asymptotics of prediction: Ridge regression and classiﬁcation. The A nnals of Statistics , 46(1):247–279, 2018. doi: 10.1214/17- AOS1549. Noureddine El Karoui, Derek Bean, Peter J. Bick el, Chinghw a y Lim, and Bin Y u. On robust regression with high-dimensional predictors. Pr o c e e dings of the National A c ademy of Scienc es , 110(36):14557–14562, 2013. doi: 10.1073/pnas.1307842110. Y ehoram Gordon. On Milman’s inequalit y and random subspaces which escap e through a mesh in R n . In Ge ometric Asp e cts of F unctional A nalysis , v olume 1317 of L e ctur e Notes in Mathematics , pages 84–106. Springer, 1988. T revor Hastie, Andrea Montanari, Saharon Ross et, and Ryan J. Tibshirani. Surprises in high- dimensional ridgeless least squares interpolation. The A nnals of Statistics , 50(2):949–986, 2022. doi: 10.1214/21- AOS2133. Hong Hu and Y ue M. Lu. Univ ersality laws for high-dimensional learning with random features. IEEE T r ansactions on Information The ory , 69(3):1932–1964, 2023. doi: 10.1109/TIT.2022. 3217698. P eter J. Hub er. Robust estimation of a lo cation parameter. The Annals of Mathematic al Statistics , 35(1):73–101, 1964. doi: 10.1214/aoms/1177703732. P eter J. Hub er and Elvezio M. Ronchetti. R obust Statistics . John Wiley & Sons, 2nd edition, 2009. doi: 10.1002/9780470434697. Sai Li, T. T ony Cai, and Hongzhe Li. T ransfer learning for high-dimensional linear regression. Journal of the R oyal Statistic al So ciety: Series B , 84(1):149–173, 2022. doi: 10.1111/rssb.12479. G´ abor Lugosi and Shahar Mendelson. Mean estimation and regression under heavy-tailed distributions: A surv ey . F oundations of Computational Mathematics , 19(5):1145–1190, 2019. doi: 10.1007/s10208- 019- 09427- x. Vladimir A. Marc henko and Leonid A. P astur. Distribution of eigen v alues for some sets of random matrices. Matematicheskii Sb ornik , 1(4):457–483, 1967. doi: 10.1070/ SM1967v001n04ABEH001994. Song Mei and Andrea Montanari. The generalization error of random features regression. Communic ations on Pur e and Applie d Mathematics , 75(4):667–766, 2022. doi: 10.1002/cpa. 22008. Andrea Montanari and Basil Saeed. Univ ersality of empirical risk minimization. In Pr o c e e dings of the 35th Confer enc e on L e arning The ory (COL T) , volume 178 of PMLR , 2022. Sidney I. Resnick. He avy-T ail Phenomena: Pr ob abilistic and Statistic al Mo deling . Springer, 2007. doi: 10.1007/978- 0- 387- 45024- 7. Gennady Samoro dnitsky and Murad S. T aqqu. Stable Non-Gaussian R andom Pr o c esses . Chap- man & Hall, 1994. 16 Christos Thramp oulidis, Samet Oymak, and Babak Hassibi. Regularized linear regression: A precise analysis of the estimation error. In Pr o c e e dings of the 28th Confer enc e on L e arning The ory (COL T) , pages 1683–1709, 2015. Christos Thramp oulidis, Samet Oymak, and Babak Hassibi. Precise error analysis of regularized M-estimators in high dimensions. IEEE T r ansactions on Information The ory , 64(8):5592–5628, 2018. doi: 10.1109/TIT.2018.2840720. Alexander Tsigler and P eter L. Bartlett. Benign ov erﬁtting in ridge regression. Journal of Machine L e arning R ese ar ch , 24(123):1–76, 2023. 17

The Conjugate Domain Dichotomy: Exact Risk of M-Estimators under Infinite-Variance Noise in High Dimensions

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment