On the Generalization and Robustness in Conditional Value-at-Risk

On the Generalization and Robustness in Conditional V alue-at-Risk Dinesh Karthik Mulum udi Dept. of Mathematics I ISER, Pune. dineshkarthikforml@gmail.com Piyushi Man upriy a Dept. of Computer Science and Automation Indian Institute of Sciecne, Bangalore. piyushim@iisc.ac.in Gholamali Aminian The Alan T uring Institute Greater London, England, UK. gaminian@turing.ac.uk Anan t Ra j Dept. of Computer Science and Automation Indian Institute of Sciecne, Bangalore. anantraj@iisc.ac.in F ebruary 23, 2026 Abstract Conditional V alue-at-Risk (CV aR) is a widely used risk-sensitiv e ob jective for learning under rare but high-impact losses, yet its statistical b eha vior under heavy-tailed data remains p o orly understo o d. Unlik e exp ectation-based risk, CV aR dep ends on an endogenous, data-dep enden t quantile, whic h couples tail av erag- ing with threshold estimation and fundamen tally alters both generalization and robustness properties. In this w ork, we develop a learning-theoretic analysis of CV aR-based empirical risk minimization under heavy-tailed and con taminated data. W e establish sharp, high-probability generalization and excess risk bounds under min- imal momen t assumptions, co vering ﬁxed hypotheses, ﬁnite and inﬁnite classes, and extending to β -mixing dep enden t data; we further sho w that these rates are minimax optimal. T o capture the in trinsic quantile sensitivit y of CV aR, we derive a uniform Bahadur-Kiefer type expansion that isolates a threshold-driven error term absen t in mean-risk ERM and essen tial in hea vy-tailed regimes. W e complement these results with ro- bustness guaran tees b y prop osing a truncated median-of-means CV aR estimator that achiev es optimal rates under adversarial contamination. Finally , we show that CV aR decisions themselves can be in trinsically unsta- ble under hea vy tails, establishing a fundamental limitation on decision robustness even when the p opulation optim um is well separated. T ogether, our results provide a principled characterization of when CV aR learning generalizes and is robust, and when instability is una v oidable due to tail scarcity . 1 In tro duction Statistical Learning Theory (SL T) pro vides the standard framew ork for analyzing generalization in learning al- gorithms. Its cen tral paradigm, Empirical Risk Minimization (ERM), selects a hypothesis h ∈ H by minimizing empirical risk, yielding generalization under suitable complexity control Shalev-Sh w artz and Ben-David [2014]. Classical ERM, ho wev er, is risk-neutral and optimizes exp ected loss, whic h can b e inadequate in high-stak es domains—suc h as ﬁnance, healthcare, and safet y-critical learning—where rare but extreme losses dominate per- formance Ho ward and Matheson [1972], Shen et al. [2014], Cardoso and Xu [2019]. These limitations ha ve motiv ated the study of risk-sensitive learning ob jectives that emphasize tail b eha v- ior, including Entropic Risk, Mean-V ariance, and Conditional V alue-at-Risk (CV aR). F rom a learning-theoretic standp oin t, several w orks hav e begun to in v estigate generalization guarantees for such ob jectives. F or example, generalization b ounds for tilted empirical risk minimization under ﬁnite-momen t assumptions w ere derived in Aminian et al., while Lee et al. [2021] studied CV aR and relate d risk measures under b ounded-loss and light- tailed assumptions. Ho wev er, these settings do not fully capture the regimes in which CV aR is most relev ant in practice, where data are often hea vy-tailed and may exhibit temp oral dep endence. This gap motiv ates a sys- tematic study of CV aR generalization b eyond b ounded or light-tailed models, which w e address in this work b y dev eloping guaran tees under hea vy-tailed losses for b oth i.i.d. and dep enden t data. 1 Another key challenge sp eciﬁc to CV aR-based learning is structural. CV aR depends on an endo genous, data- dep endent thr eshold , the V alue-at-Risk (V aR), which couples tail av erages with empirical quantile estimation. Unlik e smooth risk functionals, this coupling creates an intrinsic in teraction b etw een tail ﬂuctuations and threshold estimation error. As a consequence, standard generalization analyses that treat the ob jective as a ﬁxed functional of the loss distribution are insuﬃcient. Con trolling CV aR generalization therefore requires explicitly accoun ting for how ﬂuctuations of the empirical quantile propagate into ﬂuctuations of the tail risk. T o address this issue, w e dev elop a reﬁned empirical-process analysis based on uniform Bahadur-Kiefer–typ e exp ansions Bahadur [1966], Kiefer [1967], adapted to the CV aR setting to capture the eﬀects of this endogenous threshold. Robustness presen ts a closely related challenge and arises at multiple lev els. At the functional level, although CV aR is often viewed as robust due to its emphasis on tail outcomes, its dependence on an empirical quan tile can amplify sensitivity to p erturbations near the tail. A t the estimator level, robustness under hea vy-tailed data and adv ersarial con tamination has b een extensiv ely studied through trimmed means and median-of-means estimators Lugosi and Mendelson [2019a,b, 2021], yielding sharp guaran tees for estimation primitives and robust ERM, but not directly for tail-based risk functionals suc h as CV aR. At the de cision lev el, the stability of CV aR-optimal solutions under distributional perturbations remains comparativ ely less understoo d. In this w ork, we study b oth the gener alization and r obustness prop erties of CV aR-based learning under heavy- tailed and contaminated data. On the generalization side, we derive non-asymptotic excess risk guaran tees under minimal moment assumptions and complemen t them with reﬁned empirical-process tools that explicitly account for quan tile sensitivity . On the robustness side, we analyze the stabilit y of CV aR ob jectives and solutions under distributional perturbations. T ogether, these results clarify the statistical regimes in whic h empirical CV aR reliably appro ximates its population coun terpart, as well as settings in whic h instabilit y is in trinsic. Con tributions: W e mak e following contributions in this pap er: • Sharp generalization theory for heavy-tailed CV aR (Section 3.1 and 3.2). W e derive high-probabilit y generalization and excess risk bounds for empirical CV aR minimization under minimal ( λ + 1) -moment as- sumptions ( 0 < λ ≤ 1 ), cov ering ﬁxed hypotheses and ﬁnite and inﬁnite classes via VC/pseudo-dimension and Rademac her complexity . The b ounds explicitly characterize the dep endence on α , tail exp onen t, hypothesis complexit y , and sample size, extend CV aR theory b ey ond b ounded and sub-Gaussian regimes, and are shown to b e minimax optimal. W e further extend the theory to β -mixing dep enden t data with matc hing low er b ounds. • Uniform Bahadur-Kiefer expansions for CV aR (Section 3.3). W e establish the ﬁrst uniform Bahadur- Kiefer–t yp e expansion for CV aR that explicitly accounts for its endogenous and data-dep enden t threshold. The result captures the nonstandard coupling betw een empirical quan tile ﬂuctuations and tail a veraging, yielding a sharp decomposition of CV aR error into a classical empirical pro cess term and an additional threshold-driven comp onen t go verned by local tail geometry . This iden tiﬁes a second-order source of generalization error absent in mean-risk ERM and is essen tial in heavy-tailed regimes. • Robust CV aR-ERM under adv ersarial con tamination (Section 4.2). W e prop ose a robust CV aR-ERM estimator based on truncation and median-of-means aggregation applied to the Ro c k afellar-Uryasev lift, jointly robustifying estimation ov er the decision v ariable and the endogenous threshold. W e prov e non-asymptotic excess risk guarantees under heavy-tailed losses and oblivious adversarial contamination, with rates that optimally decomp ose into a statistical term and an unav oidable contamination-dependent term. This provides the ﬁrst learning-theoretic robustness guarantees for CV aR-ERM without b ounded-loss assumptions. • F undamental limits of decision robustness (Section 4.3). W e analyze the stability of CV aR decisions themselv es and sho w that, unlike mean-risk ERM, CV aR minimization can be intrinsically unstable under hea vy-tailed losses. Even with a unique and w ell-separated p opulation minimizer, a single observ ation can ﬂip the empirical CV aR-ERM decision with p olynomially small but una v oidable probability . This establishes a sharp imp ossibilit y result for decision robustness in tail-scarce regimes. 1.1 Related W ork Optimized certaint y equiv alen ts include exp ectation, CV aR, entropic, and mean-v ariance risks. Early generaliza- tion b ounds w ere obtained under b ounded-loss assumptions Lee et al. [2021], and later extended to un b ounded 2 losses via tilted ERM under b ounded (1 + λ ) -momen t conditions, achieving rates O ( n − λ/ (1+ λ ) ) Li et al. [2021].F or CV aR, concen tration inequalities under ligh t- and heavy -tailed losses were established in Kolla et al. [2019], Prashan th et al. [2019] under a strictly increasing CDF assumption, but learning-theoretic generalization and robustness guaran tees in the genuinely heavy-tailed regime remain largely op en. Our ﬁxed-h yp othesis analysis builds on truncation-based tec hniques Bro wnlees et al. [2015], Prashan th et al. [2019], with extensions to ﬁnite and inﬁnite hypothesis classes via standard learning-theoretic to ols Shalev-Shw artz and Ben-Da vid [2014], and matc hing minimax low er b ounds obtained through classical constructions T sybak ov [2008]. Related w ork on CV aR and spectral risk learning under hea vy tails Holland and Haress [2021, 2022] is restricted to ﬁnite-v ariance i.i.d. settings. Results on robust mean estimation under hea vy tails and contamination Laforgue et al. [2021], de Juan and Mazuelas [2025] fo cus on estimation primitiv es rather than tail-based risks, while learning under hea vy-tailed dep endence has primarily addressed exp ected-risk ob jectives Roy et al. [2021], Shen et al. [2026]. Minimax guarantees for quantile-based risks largely assume light-tailed regimes El Hanchi et al. [2024]. The asymptotic b ehavior of sample quantiles is classically c haracterized b y the Bahadur and Bahadur-Kiefer represen tations. Bahadur Bahadur [1966] and Kiefer Kiefer [1967] established linear and uniform expansions that form a foundation of empirical pro cess theory V an der V aart [2000], but rely on smoothness and ligh t-tail assumptions. These results do not directly apply to CV aR, where the quantile is endogenous and coupled with tail a verages. Our w ork dev elops a uniform Bahadur-Kiefer expansion tailored to CV aR, explicitly accoun ting for the random active set induced by the data-dep enden t threshold and allo wing for heavy-tailed losses. On the robustness side, robust ERM under weak momen t assumptions has b een studied in Mathieu and Minsk er [2021]. T rimmed-mean and median-of-means estimators Lugosi and Mendelson [2019a,b] admit ﬁnite- sample and minimax-optimal guarantees Oliv eira et al. [2025], Oliveira and Resende [2025], Lugosi and Mendelson [2021], but do not directly address tail-based risk functionals such as CV aR. 2 Bac kground and Problem F orm ulation 2.1 Conditonal V alue at Risk (CV aR) Let L b e a real-v alued random v ariable representing the loss of a decision or portfolio. F or a tail probability α ∈ (0 , 1) , the V alue-at-R isk at level α is deﬁned as the upp er (1 − α ) -quantile of L : V aR α ( L ) := inf { t ∈ R : P ( L ≤ t ) ≥ 1 − α } . Equiv alently , P ( L > V aR α ( L )) ≤ α . The Conditional V alue-at-Risk at lev el α is deﬁned as the exp ected loss in the α -tail of the distribution: CV aR α ( L ) := E [ L | L ≥ V aR α ( L )] , whenev er the conditional exp ectation is w ell-deﬁned . More generally , CV aR admits the v ariational representation Ro c k afellar and Uryasev [2000]: CV aR α ( L ) = inf t ∈ R  t + 1 α E  ( L − t ) +   , whic h remains v alid without con tinuit y assumptions and is particularly con venien t for learning and optimization. Unlik e V aR, CV aR is a coheren t risk measure: it is monotone, translation inv arian t, p ositively homogeneous, and subadditive. In particular, CV aR is conv ex in the underlying loss distribution, whic h mak es it amenable to empirical risk minimization and generalization analysis. 2.2 Setup, Notation and Assumptions Let H be a h ypothesis class and let ℓ ( h, x ) ≥ 0 be a non-negativ e loss of hypothesis h ∈ H on example x . Let Z 1 , . . . , Z n b e i.i.d. samples dra wn from a distribution P . W e deﬁne the Ro c k erfellar-Ury asev (RU) p opulation and empirical ob jectiv e as follows: ϕ P ( h, θ ) = θ + 1 α E P [( ℓ ( h, Z ) − θ ) + ] and ϕ P n ( h, θ ) = θ + 1 nα n X i =1 [( ℓ ( h, Z i ) − θ ) + ] , 3 where P n = 1 n P n i =1 δ Z i and { Z 1 , · · · Z n } are i.i.d samples from P . Deﬁne the p opulation Conditional V alue-at- Risk (CV aR) at level α ∈ (0 , 1) b y R P α ( h ) = inf θ ∈ R  θ + 1 α E  ( ℓ ( h, Z 1 ) − θ ) +   (1) and its empirical coun terpart b R P α ( h ) = inf θ ∈ R ( θ + 1 αn n X i =1 ( ℓ ( h, Z i ) − θ ) + ) (2) where ( e ) + ≡ max { e, 0 } . Ro ck afellar and Uryasev [2002] W e omit the sup erscript denoting the underlying distribution whenev er it is clear from context. Our ob jective is to learn the p opulation CV aR minimizer h ∗ ∈ arg min h ∈H R α ( h ) , and we denote b y b h ∈ arg min h ∈H b R α ( h ) the empirical CV aR minimizer. Assumption 2.1 (Bounded (1 + λ ) -th Moment) . There exists a constant M ∈ R + and λ ∈ (0 , 1) suc h that the loss function ( h, Z ) 7→ ℓ ( h, Z ) satisﬁes: E h ( ℓ ( h, Z )) 1+ λ i ≤ M , uniformly for all h ∈ H , where Z = X × Y is the instance space. Assumption 2.2 ( β -mixing dep endence) . The pro cess { Z i } i ∈ Z is strictly stationary and β -mixing with exp o- nen tially deca ying co eﬃcien ts: there exist constants c 0 , c 1 , γ > 0 suc h that β ( k ) := sup t ∈ Z β  σ ( Z t −∞ ) , σ ( Z ∞ t + k )  ≤ c 0 e − c 1 k γ , k ≥ 1 . Assumption 2.3 (Hyp othesis class complexit y) . The h ypothesis class H has ﬁnite pseudo-dimension: Pdim( H ) ≤ d < ∞ . Assumption 2.4 (Statistical Complexity of T runcated Class) . F or any truncation level B > 0 , deﬁne the truncated function class: F B = { min( ϕ ( · ; h, θ ) , B ) : h ∈ H , θ ∈ Θ } . W e assume that for any probabilit y measure Q , the uniform co v ering n umber satisﬁes: log N ( F B , ∥ · ∥ L 2 ( Q ) , u ) ≤ d log  C 0 B u  , ∀ u ∈ (0 , B ] . 3 Generalization in CV aR This section establishes non-asymptotic generalization guarantees for Conditional V alue-at-Risk (CV aR) under hea vy-tailed losses, ranging from ﬁxed-h yp othesis concen tration to sharp excess risk bounds for inﬁnite classes, with matc hing minimax low er b ounds and extensions to dep enden t data. W e further develop a random active-set theory for CV aR and derive uniform Bahadur–Kiefer expansions that explicitly capture the role of the endogenous threshold. The core diﬃculty is structural: CV aR depends on tail b ehavior through a data-dep endent quantile, requiring join t con trol of tail ﬂuctuations and threshold instability . 3.1 Generalization with i.i.d. Data W e b egin with the i.i.d. setting. The ﬁrst step is to understand ho w the empirical CV aR concentrates around its p opulation counterpart for a ﬁxed hypothesis. The key technical ingredient is a concentration inequalit y for heavy-tailed random v ariables with ﬁnite (1 + λ ) - momen t, as dev eloped in Brownlees et al. [2015], Prashanth et al. [2019]. Applying these to ols to the Ro ck afellar- Ury asev v ariational representation of CV aR yields the following ﬁxed-hypothesis deviation b ound. 4 Theorem 3.1. Under Assumption 2.1 and for any ﬁxe d h ∈ H and ϵ > 0 with pr ob ability at le ast 1 − δ we, b R α ( h ) − R α ( h ) ≤ 2 α M 1 1+ λ  log(2 /δ ) n  λ 1+ λ . (3) F or a ﬁnite H , we extend the b ound to al l hyp otheses using the union b ound. R α ( b h ) − R α ( h ∗ ) ≤ 4 α M 1 1+ λ  log(4 |H| /δ ) n  λ 1+ λ . (4) While the ﬁnite-class b ound follows directly from a union b ound, it does not capture the true statistical com- plexit y of learning when H is inﬁnite. W e therefore turn to lo calized complexity arguments that yield dimension- dep enden t rates without incurring unnecessary logarithmic penalties. F or a ﬁxed natural n umber n − 1 , consider the space { 0 , 1 } n endo wed with the Hamming metric. Let N denote its pac king n umber. Theorem 3.2 (Lo calized VC Bounds for Heavy-T ailed CV aR) . Assume H has pseudo-dimension d < ∞ and Assumption 2.1 holds. Then for any δ ∈ (0 , 1) , with pr ob ability at le ast 1 − δ , R α ( b h ) − R α ( h ∗ ) ≤ C λ α M 1 1+ λ  d log n + log(1 /δ ) n  λ 1+ λ (5) wher e C λ > 0 dep ends only on λ . Mor e over, this r ate is minimax optimal. Ther e exist universal c onstants c, C > 0 (dep ending only on λ ) such that for any α ∈ (0 , 1) , M > 0 , λ ∈ (0 , 1] , and n ≥ C (log N ) /α , inf A sup P ∈P ( M ,λ ) E [ R α ( A ( S ); P ) − R α ( h ∗ P ; P )] ≥ c M 1 / (1+ λ ) α  log N n  λ 1+ λ . (6) Discussion: This theorem sho ws that empirical CV aR minimization ac hieves the optimal tradeoﬀ betw een mo del complexit y , sample size, and tail hea viness. Imp ortantly , the rate matc hes the minimax low er bound up to constan ts, demonstrating that no improv emen t is p ossible without stronger assumptions. T runcated Empirical CV aR Estimation. Although empirical CV aR minimization is statistically optimal, ﬁnite-sample instabilit y can arise from extreme losses. W e therefore in tro duce a truncated CV aR estimator that limits the inﬂuence of large observ ations while preserving optimal rates and remaining amenable to ﬁrst-order optimization. T runcation sim ultaneously stabilizes estimation and enables standard complexit y-based analysis, with the induced bias controlled under ﬁnite-momen t assumptions. Theorem 3.3 (High-Probabilit y Generalization Bound) . Supp ose Assumption 2.1 holds. L et R n ( ℓ ◦ H ) b e the R ademacher c omplexity of the loss class. Fix δ ∈ (0 , 1) . If we set the trunc ation level B = ( M n ) 1 1+ λ , then with pr ob ability at le ast 1 − δ : R α ( b h B ) − R α ( h ∗ ) ≤ 4 α R n ( ℓ ◦ H ) + C λ,δ α M 1 1+ λ  1 n  λ 1+ λ , (7) wher e the c onstant C λ,δ is given by: C λ,δ = 1 λ + p 8 log (1 /δ ) + 2 3 log(1 /δ ) . (8) T ak ea w a y: T runc ate d empiric al CV aR achieves optimal high-pr ob ability exc ess risk b ounds while oﬀering im- pr ove d r obustness to he avy-taile d noise. 5 3.2 Generalization with Dep enden t Data W e no w extend the analysis to dep endent observ ations. Let { Z i } i ∈ Z b e a strictly stationary sto chastic pro cess, and supp ose only a ﬁnite segmen t Z 1 , . . . , Z n is observ ed. Dep endence reduces the eﬀectiv e sample size and requires reﬁned concentration arguments. Theorem 3.4 (Upper Bound for CV aR under β -Mixing Hea vy-T ailed Data) . Under Assumptions 2.1-4.7 Then for any δ ∈ (0 , 1) , with pr ob ability at le ast 1 − δ − n − O (1) , sup h ∈H   b R α ( h ) − R α ( h )   ≤ C λ,γ α M 1 1+ λ  d log n + log(1 /δ ) n/ log n  λ 1+ λ , and c onse quently, for the empiric al CV aR minimizer b h , R α ( b h ) − R α ( h ∗ ) ≤ 2 C λ,γ α M 1 1+ λ  d log n + log(1 /δ ) n/ log n  λ 1+ λ , wher e C λ,γ > 0 dep ends only on λ, γ and the mixing c onstants. W e also establish Minimax L ower Bound. L et P ( λ, M , β ) b e the class of strictly stationary β -mixing distribu- tions with β ( k ) ≤ c 0 e − c 1 k γ and sup h ∈H E [ ℓ ( h, Z ) 1+ λ ] ≤ M . Assume V Cdim( H ) ≥ d ≥ 1 . Then for any α ∈ (0 , 1) and n satisfying n ≥ C 0 d/α , inf b h sup P ∈P E P  R α ( b h ) − R α ( h ∗ P )  ≥ c α M 1 1+ λ  d n/ log n  λ 1+ λ , wher e C 0 , c > 0 dep end only on λ, γ and the mixing c onstants. Discussion. Dep endence introduces an una v oidable logarithmic degradation in the eﬀectiv e sample size, but the fundamental tail-dep endent rate remains unc hanged. The matching lo wer b ound shows this loss is information- theoretically necessary . 3.3 Random activ e-set theory and uniform Bahadur - Kiefer expansions for CV aR While the preceding results b ound uniform CV aR deviations under heavy tails, they do not explain how these ﬂuctuations aﬀect the empirical minimizer. The challenge is structural: CV aR dep ends on an endo genous quan tile, coupling threshold estimation with tail a veraging. W e make this coupling explicit via uniform Bahadur–Kiefer expansions for CV aR-ERM, revealing tw o distinct error sources—one from tail moments and another from the lo cal geometry at the CV aR threshold. F or, ﬁxed α ∈ (0 , 1) . F or h ∈ H and θ ∈ R , we write X h := ℓ ( h, Z ) and X h,i := ℓ ( h, Z i ) and we deﬁne the p opulation R U threshold θ ⋆ ( h ) := inf { θ ∈ R : P ( X h > θ ) ≤ α } , and let the empirical RU threshold b e the minimal empirical minimizer b θ n ( h ) := inf { θ ∈ R : P n ( X h > θ ) ≤ α } , P n := 1 n n X i =1 δ Z i . By con v exity of θ 7→ b Φ n ( h, θ ) , b θ n ( h ) is alwa ys a minimizer of b Φ n ( h, · ) . W e also assume that for all h ∈ H , P ( X h > θ ⋆ ( h )) = α (equiv alently , P ( X h = θ ⋆ ( h )) = 0 ). How ev er, before stating the ﬁnal theorem statemen t, w e w ould lik e to state some assumptions: Assumption 3.5 (Uniform tail-indicator deviation) . There exist constants V ≥ 1 and C 0 > 0 such that for all δ ∈ (0 , 1) , with probability at least 1 − δ , sup h ∈H sup t ∈ R    P n ( X h > θ ) − P ( X h > θ )    ≤ ε n ( δ ) := C 0 r V log( en ) + log(2 /δ ) n . 6 Assumption 3.6 (T wo-sided local quantile margin) . There exist constants κ ≥ 1 , c − , c + > 0 , and u 0 > 0 such that for all h ∈ H and all u ∈ [0 , u 0 ] , c − u κ ≤ P ( X h > θ ⋆ ( h ) − u ) − P ( X h > θ ⋆ ( h )) ≤ c + u κ , (9) c − u κ ≤ P ( X h > θ ⋆ ( h )) − P ( X h > θ ⋆ ( h ) + u ) ≤ c + u κ . (10) Assumption 3.7 (Uniform heavy-tail deviation for the hinge at t ⋆ ) . There exists a function η n ( δ ) such that for all δ ∈ (0 , 1) , with probability at least 1 − δ , sup h ∈H    ( P n − P )  ( X h − θ ⋆ ( h )) +     ≤ η n ( δ ) . (11) Discussion on Assumptions: Assumptions 3.5–3.7 are mild and hold for broad mo del classes. Uniform control of tail indicators holds when the class ( z , θ ) 7→ ! 1 ℓ ( h, z ) > θ has ﬁnite V C dimension or p olynomial entrop y , cov ering linear mo dels with Lipschitz losses, GLMs, and b ounded-norm neural netw orks, yielding ε n ( δ )! ≍ ! p (log n ) /n . Regular quantiles require only lo cal tail geometry near the V aR-holding for densities b ounded aw a y from zero ( κ = 1 ) and more generally for ﬂat or cusp-like tails-without global smo othness. Control of the hinge empirical pro cess follo ws under a ( λ + 1) -momen t env elope with 0 < λ ≤ 1 , giving η n ( δ ) ≍ ! n − λ/ ( λ +1) under hea vy tails. No w, w e present the main result of the section below in theorem 3.8 Theorem 3.8 (Uniform Bahadur-Kiefer expansion for CV aR with endogenous threshold) . Fix α ∈ (0 , 1) and let Z 1 , . . . , Z n iid ∼ P with empiric al me asur e P n . Supp ose Assumptions 3.5, 3.6, and 3.7 hold, and that ε n ( δ ) ≤ ( c − / 2) u κ 0 . Then, for some c onstant C 1 ≥ 0 with pr ob ability at le ast 1 − δ , the fol lowing uniform exp ansion holds: sup h ∈H    b R P α ( h ) − R P α ( h ) − 1 α ( P n − P )  ( X h − θ ⋆ ( h )) +  − 1 α  b t n ( h ) − θ ⋆ ( h )  α − P ( X h > θ ⋆ ( h ))     ≤ C 1 α ε n  δ 2  κ +1 κ . In p articular, sup h ∈H | b R P α ( h ) − R P α ( h ) | ≤ 1 α η n ( δ / 2) + 1 α  ε n ( δ / 2) ∆ n ( δ / 2) + c + ∆ n ( δ ) κ +1  . Pro of is discussed in the App endix C. Discussion: Theorem 3.8 separates CV aR generalization into a standard empirical process term at the p opu- lation threshold, an explicit correction due to estimating the endogenous quan tile, and a remainder gov erned by ε n ( δ ) and the lo cal quan tile curv ature κ . This rev eals an additional source of error stemming from instabilit y of the tail b oundary , that is indep endent of classical hea vy-tail concen tration and absent in mean-risk ERM. The correction term 1 α  b θ n ( h ) − θ ⋆ ( h )  α − P ( X h > θ ⋆ ( h ))  v anishes whenever P ( X h > θ ⋆ ( h )) = α , i.e. whenever the tail map θ 7→ P ( X h > θ ) crosses the level α without a ﬂat region at the minimizing threshold. When P ( X h > θ ⋆ ( h )) < α , the set of minimizers of θ 7→ Φ( h, θ ) can b e non-singleton (a “ﬂat” region in θ ), and b θ n ( h ) selects an endp oin t of the empirical minimizer set. The correction term captures precisely this endogenous selection eﬀect at ﬁrst order. 4 Robustness of CV aR This section dev elops a robustness theory for CV aR under heavy-tailed losses with only ﬁnite-moment control. Robustness arises at three levels: functional robustness of CV aR under distributional p erturbations; estimator robustness, pro viding uniform guaran tees for empirical CV aR under sampling noise and contamination; and de ci- sion robustness, which concerns stability of the CV aR minimizer and is fundamen tally shap ed by the endogenous R U threshold. W e treat these in turn, highlighting how threshold sensitivity and tail scarcity induce intrinsic instabilit y . 7 4.1 F unctional Robustness W e ﬁrst analyze CV aR as a distributional functional, yielding metric-dependent contin uit y results indep endent of estimation. Under geometric metrics (e.g., W asserstein), stability follows from loss regularit y in z , whereas under w eaker metrics (e.g., Lévy–Prokhorov or total v ariation), ﬁnite momen ts imply only tail-Hölder con tinuit y with exp onen t set b y the momen t order. This distinction separates intrinsic (functional) robustness from estimation- induced eﬀects. Prop osition 4.1 (T ail-Hölder robustness of CV aR ) . Fix α ∈ (0 , 1) , λ ∈ (0 , 1] , and set κ α = (1 − α ) − 1 . L et P , Q b e distributions on Z and ℓ : H × Z → R . F or ﬁxe d h ∈ H , let L P = ℓ ( h, Z ) , Z ∼ P , and L Q = ℓ ( h, Z ′ ) , Z ′ ∼ Q . Assume E P [ | L P | λ +1 ] ≤ M p , E Q [ | L Q | λ +1 ] ≤ M p . Given, W r ( P , Q ) and π ( P , Q ) ar e r -W asserstein and L évy-Pr okhor ov metric b etwe en P and Q r esp e ctively, then (i) W asserstein stability under Hölder losses. If ℓ ( θ, · ) is β -Hölder in z uniformly over θ with c onstant L β and r ≥ β , then for al l P , Q with ﬁnite r -moments, sup h ∈H   R P α ( h ) − R Q α ( h )   ≤ κ α L β W r ( P , Q ) β . (ii) L évy-Pr okhor ov tail-Hölder c ontinuity. If Z = R m and ℓ ( h, · ) is L h -Lipschitz, then for π ( P , Q ) ≤ ε ,   R P α ( h ) − R Q α ( h )   ≤ κ α L h  2 ε + 2 M 1 / ( λ +1) p ε λ/ ( λ +1)  , and the tail term ε λ 1+ λ is unavoidable. Discussion. P art (i) shows that when the loss is regular in z , CV aR is stable in W asserstein distance under ﬁnite moments. P art (ii) reveals an una voidable degradation under weak er p erturbations: small distributional c hanges can shift the upp er tail enough to alter CV aR at a Hölder rate ε λ/ ( λ +1) . This tail-scarcity eﬀect also go verns estimator and decision robustness, where stability depends on mass near the threshold. An analogous Hölder bound holds in total v ariation distance (Prop osition 4.2). 4.2 Estimator Robustness In this part, w e dev elops a robust estimator theory for CV aR under hea vy-tailed losses with only ﬁnite-momen t con trol. Building on the previous section, w e next study robustness at the lev el of the CV aR obje ctive o v er a h yp othesis class. Here the goal is to b ound the worst-case sensitivity sup h ∈H | R P α ( h ) − R Q α ( h ) | , whic h captures b oth distributional robustness and the eﬀect of adversarial p erturbations when Q is a contaminated v ersion of P . The next theorem provides suc h a b ound under total v ariation p erturbations when the contaminated distribution Q also has similar moment cotrol as P and establishes that the resulting exp onent is minimax optimal under ﬁnite-momen t con trol. Prop osition 4.2 (Robustness of CV aR ) . Under Assumption 2.1, for any α ∈ (0 , 1) , sup h ∈H   R P α ( h ) − R Q α ( h )   ≤ C α d TV ( P , Q ) λ 1+ λ , (12) wher e d TV ( P , Q ) = sup A | P ( A ) − Q ( A ) | and C = (2 M ) 1 1+ λ  1 + 1 λ  . Mor e over, this dep endenc e is minimax optimal. Discussion. Prop osition 4.2 characterizes the optimal sensitivit y of CV aR to total v ariation p erturbations under hea vy tails. Because total v ariation is b ounded, this result combines directly with our generalization bounds to yield excess risk guarantees under domain shift Aminian et al.. The Hölder exp onen t λ/ (1 + λ ) reﬂects the in trinsic limitation imp osed by ﬁnite-momen t control, and the matc hing minimax low er b ound shows that no sharp er dep endence on d TV ( P , Q ) is achiev able without stronger tail assumptions. 8 Robust Estimation under Adv ersarial Con tamination Unlik e the previous setting—where the contami- nated distribution retained a ﬁnite (1 + λ ) -momen t—adversarial contamination may in tro duce arbitrarily hea vy tails. W e therefore study estimator robustness under oblivious adversarial c ontamination where ϵn datap oints ha ve b een corrupted, deriving uniform generalization guaran tees via a truncated median-of-means (MoM) con- struction. Our analysis sim ultaneously controls heavy-tailed sampling v ariability and adversarial bias b y com bining truncation of extreme losses with median aggregation across blo cks. These ideas are applied to the Ro ck afel- lar–Ury asev lift, treating ( h, θ ) jointly and robustifying the empirical estimation of E [( ℓ ( h, Z ) − θ ) + ] through blo c kwise truncation and MoM aggregation. The resulting bound decomposes in to a statistical term driv en b y complexity and eﬀective sample size, and a con tamination term prop ortional to ϵ , b oth scaling with the optimal hea vy-tail exponent λ/ (1 + λ ) . Theorem 4.3 (Robust Generalization Bound) . Supp ose Assumptions 2.1 and 2.4 hold. L et γ ∈ (0 , 1 / 2) b e ﬁxe d such that ϵ ≤ 1 / 2 − γ . Supp ose the sample size satisﬁes n ≥ C d log n γ 2 . With pr ob ability at le ast 1 − δ : sup h ∈H | b R α ( h ) − R α ( h ) | ≤ C 1  M ϕ d log n n  λ 1+ λ + C 2 M 1 1+ λ ϕ ϵ λ 1+ λ ≍ 1 α  M d log n n  λ 1+ λ + ( M ϵ ) λ 1+ λ ! . Her e C 1 , C 2 , M and M ϕ ar e universal c onstants dep ending only on λ . Discussion. Theorem 4.3 clariﬁes the robustness–generalization tradeoﬀ under contamination: the ﬁrst term ac hieves the optimal hea vy-tail statistical rate (up to logs and complexity), while the second captures the unav oid- able loss from an ϵ -fraction of adv ersarial corruption. The ﬁnal ≍ form highlights the intrinsic 1 /α ampliﬁcation of tail eﬀects induced by the R U lift. On the Algorithmic Asp ect : W e construct a robust CV aR estimator via a truncated median-of-means (T-MoM) sc heme for hea vy-tailed data with oblivious adv ersarial con tamination. The sample is partitioned in to blo c ks, the Ro ck afellar–Ury asev loss is truncated within eac h blo ck to control v ariance and adversarial bias, and blockwise estimates are aggregated b y a median, yielding robustness to both heavy tails and an ϵ -fraction of corrupted samples. The CV aR estimate is obtained b y minimizing this robust ob jective ov er the auxiliary threshold. While statistically principled, this approach is not directly implementable in high dimensions: MoM tourna- men ts o v er η -nets are feasible only in lo w-dimensional settings. W e therefore fo cus on statistical guarantees and defer algorithmic issues to App endix H. 4.3 Decision Robustness W e no w study de cision r obustness for CV aR minimization, focusing on the stability of the argmin rather than the ob jective v alue. This is subtle for CV aR, as its endogenous, distribution-dep endent threshold can amplify small p erturbations into decision-lev el changes. W e denote the p opulation solution set and the (p ossibly set-v alued) Ro c k afellar–Uryasev threshold minimizers (deined in section 2.2) by S ( P ) := arg min h ∈H R P α ( h ) , T ⋆ ( h, P ) := arg min θ ∈ R Φ P ( h, θ ) Decision robustness concerns the con tinuit y of S ( P ) under p erturbations of P . Unlike mean- risk ERM, CV aR introduces an endogenous threshold θ ⋆ ( h, P ) ∈ T ⋆ ( h, P ) determined by the upper α -tail, whose stabilit y is the primary driver of decision robustness. Endogenous Threshold Stability: Fix h and write X := ℓ ( h, Z ) under P . Deﬁne ϕ P ( θ ) := Φ P ( h, θ ) = θ + α − 1 E [( X − θ ) + ] . The characterization 0 ∈ ∂ ϕ P ( θ ) yields θ ∈ T ⋆ ( h, P ) ⇐ ⇒ P ( X > θ ) ≤ α ≤ P ( X ≥ θ ) . (13) Th us T ⋆ ( h, P ) is the α -upp er-quantile set of X . 9 Deﬁnition 4.4 (Quan tile margin and generalized density-at-quan tile) . Assume T ⋆ ( h, P ) is a singleton { θ ⋆ ( h, P ) } and write θ ⋆ for short. Deﬁne the lo cal mass near the threshold κ h,P ( r ) := P  | ℓ ( h, Z ) − θ ⋆ | ≤ r  , r > 0 , and the generalized densit y-at-quantile (quantile margin parameter) m α ( h, P ) := lim inf r ↓ 0 κ h,P ( r ) r ∈ [0 , ∞ ] . The next theorem shows that a p ositiv e quantile margin guarantees stabilit y of the threshold under weak distributional perturbations, while its absence leads to in trinsic instabilit y . Theorem 4.5 (Threshold stability under Lévy-Prokhoro v p erturbations) . Fix h and assume T ⋆ ( h, P ) = { θ ⋆ ( h, P ) } is a singleton. L et Q b e another law for X = ℓ ( h, Z ) and let d LP ( P , Q ) denote the L évy-Pr okhor ov distanc e b etwe en these one-dimensional laws. If m α ( h, P ) ≥ m 0 > 0 , then for al l suﬃciently smal l δ := d LP ( P , Q ) , | θ ⋆ ( h, Q ) − θ ⋆ ( h, P ) | ≤ C δ /m 0 . (14) Conversely, if m α ( h, P ) = 0 , then for every δ > 0 ther e exists Q with d LP ( P , Q ) ≤ δ such that T ⋆ ( h, Q ) c ontains ˜ θ with | ˜ θ − θ ⋆ | ≥ c 0 (an O (1) jump). Remark. Theorem 4.5 reveals an in trinsic fragility of CV aR: when the loss distribution has negligible mass near the V aR, the optimal threshold can change discontin uously under arbitrarily small p erturbations, regardless of sample size. Inﬂuence F unction of the CV aR Decision: T o quantify how thre shold instabilit y propagates to decisions, w e analyze the RU stationarity system and deriv e the inﬂuence function of the CV aR minimizer under gross-error con tamination, treating ( h, θ ) join tly . Deﬁne the stationarity map H ( h, θ ; P ) :=  ∇ h Φ P ( h, θ ) ∂ θ Φ P ( h, θ )  . (15) When P ( ℓ ( h, Z ) = θ ) = 0 , the deriv ative in θ is classical: ∂ θ Φ P ( h, θ ) = 1 − 1 α P ( ℓ ( h, Z ) > θ ) , and ∇ h Φ P ( h, θ ) = 1 α E P  ∇ h ℓ ( h, Z ) 1 { ℓ ( h, Z ) > θ }  . W e imp ose a lo cal regularity condition that is strictly weak er than global strong conv exity and ﬁts noncon v ex but stable minimizers (e.g. ‘tilt stabilit y”; here we keep a concrete inv ertibilit y condition). Assumption 4.6 (Local strong regularity / inv ertible Jacobian) . Let ( h ⋆ , θ ⋆ ) b e a lo cally unique stationary p oin t: H ( h ⋆ , θ ⋆ ; P ) = 0 . Assume: (i) ℓ ( h, z ) is C 2 in h near h ⋆ for P -a.e. z ; (ii) P ( ℓ ( h ⋆ , Z ) = θ ⋆ ) = 0 ; (iii) the Jacobian J := ∇ ( h,θ ) H ( h ⋆ , θ ⋆ ; P ) exists and is in v ertible. Assumption 4.7 (Lo cal RU stability at the p opulation optimum) . Assume there exists ( h ⋆ , θ ⋆ ) such that (i) h ⋆ ∈ S ( P ) and θ ⋆ ∈ T ⋆ ( h ⋆ , P ) ; (ii) T ⋆ ( h ⋆ , P ) = { θ ⋆ } is a singleton; (iv) the quantile margin m α ( h ⋆ , P ) ≥ m 0 > 0 . Under these assumptions, the inﬂuence function admits a closed-form expression and exhibits a sharp depen- dence on the quan tile margin. Theorem 4.8 (Inﬂuence function of the CV aR decision: tail supp ort and quantile-margin blow-up) . Assume Assumption 4.6. Consider the gr oss-err or p ath P ε = (1 − ε ) P + ε ∆ z , ε ≥ 0 , and let ( h ε , θ ε ) b e the lo c al ly unique stationary solution to H ( h, θ ; P ε ) = 0 ne ar ( h ⋆ , θ ⋆ ) . Then ε 7→ ( h ε , θ ε ) is diﬀer entiable at ε = 0 and d dε  h ε θ ε      ε =0 = − J − 1  g ( z ; h ⋆ , θ ⋆ ) − E P [ g ( Z ; h ⋆ , θ ⋆ )]  , 10 wher e g ( z ; h, θ ) = 1 α ∇ h ℓ ( h, z ) 1 { ℓ ( h, z ) > θ } 1 − 1 α 1 { ℓ ( h, z ) > θ } ! . (16) Mor e over, the θ -c omp onent sensitivity satisﬁes the b ound     d dε θ ε     ε =0     ≥ c m α ( h ⋆ , P ) ×    1 { ℓ ( h ⋆ , z ) > θ ⋆ } − P ( ℓ ( h ⋆ , Z ) > θ ⋆ )    , in the sense that the r elevant Jac obian blo ck involves the gener alize d density-at-quantile; henc e the inﬂuenc e blows up as m α ( h ⋆ , P ) ↓ 0 . Remark. Theorems 4.5 and 4.8 together identify a precise mechanism: CV aR decisions are stable only when the threshold is stable. The quantile margin simultaneously gov erns threshold contin uit y and the conditioning of the R U stationarit y system. The inﬂuence-function c haracterization naturally leads to a notion of lo cal robustness: the largest con tamina- tion lev el under whic h the decision remains within a prescrib ed tolerance. Deﬁnition 4.9 (Lo cal decision robustness radius) . Fix a tolerance r > 0 . Under Assumption 4.7, deﬁne ε rob ( r ) := sup n ε ∈ [0 , 1) : ∥ h ε − h ⋆ ∥ ≤ r for all con tamination directions ∆ z o , where ( h ε , θ ε ) is the stationary solution under P ε = (1 − ε ) P + ε ∆ z selected near ( h ⋆ , θ ⋆ ) . Corollary 4.10 (Lo cal decision radius via tail-supp orted IF) . Under Assumption 4.7, ther e exists r 0 > 0 such that for al l r ∈ (0 , r 0 ) , ε rob ( r ) ≥ r C ∥ J − 1 ∥ · sup z ∈Z    g ( z ; h ⋆ , θ ⋆ ) − E P [ g ( Z ; h ⋆ , θ ⋆ )]    ! − 1 , (17) with g fr om (16) . Mor e over, as m α ( h ⋆ , P ) ↓ 0 , the b ound de gener ates b e c ause ∥ J − 1 ∥ → ∞ , i.e. the lo c al r obustness r adius c ol lapses at quantile critic ality. Pr o of. By Theorem 4.8, for each ﬁxed direction ∆ z , the decision map ε 7→ h ε is diﬀeren tiable at 0 with de riv ativ e h ′ (0) = − [ J − 1 ( · )] h applied to the perturbation vector. Thus for suﬃciently small ε , a ﬁrst-order expansion yields ∥ h ε − h ⋆ ∥ ≤ ε sup z ∥ IF( z ; h ⋆ , P ) ∥ + o ( ε ) . Bounding sup z ∥ IF( z ) ∥ using Theorem 4.8 yields the righ t-hand side of (17). Finally , ∥ J − 1 ∥ → ∞ as m α ↓ 0 by Theorem 4.8, so the radius collapses in the quan tile-critical regime. In trinsic Limits (T ail Scarcit y in i.i.d. Data): CV aR dep ends on losses exceeding an endogenous threshold θ ⋆ ( h, P ) determined b y the upp er α -tail. Consequently , the empirical CV aR ob jective can b e dominated by a few rare observ ations near this threshold, even when the population minimizer is strict and well separated. This eﬀect is structural rather than a concentration artifact: under ﬁnite p -momen t assumptions, tail exceedances o ccur with p olynomial probability and can induce O (1) p erturbations in the empirical ob jective. The result b elow formalizes this tail-sc ar city instability , showing that, with una voidable p olynomial probability (up to logarithmic factors), a single observ ation can ﬂip the CV aR-ERM decision despite strict population optimalit y . Prop osition 4.11. Fix α ∈ (0 , 1) and 1 < p < 2 . Ther e exist a distribution P on R + with E P [ Z p ] < ∞ and E P [ Z q ] = ∞ for al l q > p , and a two-p oint hyp othesis class H = { h A , h B } with losses ℓ n : H × R + → R + , such that for al l suﬃciently lar ge n , CV aR α ( ℓ n ( h A , Z ); P ) < CV aR α ( ℓ n ( h B , Z ); P ) , yet ther e exists a c onstant c > 0 for which Pr D ∼ P ⊗ n  ∃ D ′ with | D △ D ′ | = 1 for which arg min h ∈H b F n ( h ; D )  = arg min h ∈H b F n ( h ; D ′ )  ≥ c n − λ (log n ) 2 , wher e b F n ( h ; D ) = CV aR α ( ℓ n ( h, Z ); P n ) . 11 Remark: This result exp oses a fundamen tal limitation of CV aR-ERM under hea vy-tailed losses: even with a strict and unique p opulation minimizer and ﬁnite p -th moments ( 1 < p < 2 ), the empirical CV aR decision map is not uniformly stable. With an unav oidable p olynomial probabilit y of order n 1 − p , a single observ ation can ﬂip the empirical CV aR-ERM solution. The proofs are deferred to Appendix F. 5 Conclusion W e developed a learning-theoretic framew ork for CV aR under hea vy-tailed and contaminated data, addressing generalization, robustness, and decision stability . Our analysis identiﬁes the endogenous quantile as a fundamental source of b oth statistical diﬃculty and instability , captured via reﬁned empirical-pro cess to ols. T ogether, our results clarify when CV aR learning is reliable and when in trinsic limitations arise, pro viding principled guidance for risk-sensitiv e learning in heavy-tailed regimes. A ckno wledgmen ts Piyushi Manupriy a w as initially supp orted by a grant from Ittiam Systems Priv ate Limited through the Ittiam Equitable AI Lab and then b y ANRF-NPDF (PDF/2025/005277) grant. Anan t Ra j is supported by a gran t from Ittiam Systems Priv ate Limited through the Ittiam Equitable AI Lab, ANRF’s Prime Minister Early Career Gran t (ANRF/ECR G/2024/003259) and Pratiksha T rust’s Y oung In vestigator A w ard. References Gholamali Aminian, Amir R Asadi, Tian Li, Ahmad Beirami, Gesine Reinert, and Samuel N Cohen. Gener- alization and robustness of the tilted empirical risk. In F orty-se c ond International Confer enc e on Machine L e arning . R Ra j Bahadur. A note on quantiles in large samples. The Annals of Mathematic al Statistics , 1966. Christian Brownlees, Emilien Joly , and Gáb or Lugosi. Empirical risk minimization for hea vy-tailed losses. The A nnals of Statistics , 2015. A drian Riv era Cardoso and Huan Xu. Risk-a verse sto c hastic conv ex bandit. In Pr o c e e dings of the Twenty-Se c ond International Confer enc e on Artiﬁcial Intel ligenc e and Statistics , 2019. Xabier de Juan and Santiago Mazuelas. On the optimalit y of the median-of-means estimator under adversarial con tamination. arXiv pr eprint arXiv:2510.07867 , 2025. A youb El Hanc hi, Chris Maddison, and Murat Erdogdu. Minimax linear regression under the quantile risk. In The Thirty Seventh Annual Confer enc e on L e arning The ory , 2024. Matthew Holland and El Mehdi Haress. Learning with risk-a v erse feedback under p otentially heavy tails. In International Confer enc e on Artiﬁcial Intel ligenc e and Statistics , 2021. Matthew J Holland and El Mehdi Haress. Sp ectral risk-based learning using un b ounded losses. In International c onfer enc e on artiﬁcial intel ligenc e and statistics , 2022. Ronald A. How ard and James E. Matheson. Risk-sensitive mark o v decision pro cesses. Management Scienc e , 1972. Jac k Kiefer. On bahadur’s representation of sample quan tiles. The Annals of Mathematic al Statistics , 1967. Ra vi Kumar Kolla, LA Prashanth, Sanjay P Bhat, and Krishna Jagannathan. Concen tration b ounds for empirical conditional v alue-at-risk: The un b ounded case. Op er ations R ese ar ch L etters , 2019. Pierre Laforgue, Guillaume Staerman, and Stephan Clémençon. Generalization b ounds in the presence of outliers: a median-of-means study . In International c onfer enc e on machine le arning , 2021. 12 Jaeho Lee, Sejun P ark, and Jinw o o Shin. Learning bounds for risk-sensitiv e learning, 2021. Tian Li, Ahmad Beirami, Maziar Sanjabi, and Virginia Smith. Tilted empirical risk minimization, 2021. Gáb or Lugosi and Shahar Mendelson. Mean estimation and regression under heavy-tailed distributions: A survey . F oundations of Computational Mathematics , 2019a. Gab or Lugosi and Shahar Mendelson. Risk minimization by median-of-means tournamen ts. Journal of the Eur op e an Mathematic al So ciety , 2019b. Gab or Lugosi and Shahar Mendelson. Robust multiv ariate mean estimation: the optimalit y of trimmed mean. 2021. Timothée Mathieu and Stanisla v Minsker. Excess risk b ounds in robust empirical risk minimization. Information and Infer enc e: A Journal of the IMA , 2021. Rob erto I Oliveira and Lucas Resende. T rimmed sample means for robust uniform mean estimation and regression. The Annals of Statistics , 2025. Rob erto I Oliveira, P aulo Orenstein, and Zoraida F Rico. Finite-sample prop erties of the trimmed mean. arXiv pr eprint arXiv:2501.03694 , 2025. L. A. Prashanth, K. Jagannathan, and R. K. Kolla. Concentration b ounds for CV aR estimation: The cases of ligh t-tailed and heavy-tailed distributions. arXiv pr eprint arXiv:1901.00997 , 2019. R. T yrrell Ro c k afellar and Stanislav Uryasev. Optimization of conditional v alue-at risk. Journal of R isk , 2000. R. T yrrell Ro c k afellar and Stanisla v Ury asev. Conditional v alue-at-risk for general loss distributions. Journal of Banking & Financ e , 2002. Abhishek Roy , Krishnakumar Balasubramanian, and Murat A Erdogdu. On empirical risk minimization with dep enden t and heavy-tailed data. A dvanc es in Neur al Information Pr o c essing Systems , 2021. Shai Shalev-Shw artz and Shai Ben-David. Understanding Machine L e arning: F r om The ory to Algorithms . Cam- bridge Univ ersity Press, 2014. Yinan Shen, Yic hen Zhang, and W en-Xin Zhou. Sgd with dep endent data: Optimal estimation, regret, and inference. arXiv pr eprint arXiv:2601.01371 , 2026. Y un Shen, Michael J. T obia, T obias Sommer, and Klaus Ob ermay er. Risk-sensitiv e reinforcement learning. Neur al Computation , 2014. Alexandre B. T sybako v. Intr o duction to Nonp ar ametric Estimation . Springer Publishing Company , Incorporated, 2008. ISBN 0387790519. Aad W V an der V aart. Asymptotic statistics , v olume 3. Cambridge Universit y Press, 2000. 13 P art App endix T able of Con ten ts A Justiﬁcation of Assumptions 14 B Generalization in CV aR 15 B.1 Fixed Hyp othesis h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 B.2 Uniform Bounds for Finite H . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 B.3 CV aR Generalization for Inﬁnite Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 B.4 T runcated Empirical CV aR Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 B.5 Generalization with Dep endent Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 C Random activ e-set theory and uniform Bahadur-Kiefer expansions for CV aR 28 D F unctional Robustness 32 E Estimator Robustness 33 E.1 T runcated CV aR Loss based ERM is Optimal . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 E.2 Robust estimation under adversarial con tamination under oblivious adversaries . . . . . . . . . 36 F Decision Robustness 39 F.1 T ail-scarcity instabilit y for CV aR-ERM under ﬁnite p -moment . . . . . . . . . . . . . . . . . . 39 G Auxiliary results 42 G.1 Boundedness of CV aR minimizer under a b ounded moment . . . . . . . . . . . . . . . . . . . . 42 G.2 Concen tration b ound for heavy-tailed random v ariables . . . . . . . . . . . . . . . . . . . . . . 44 H On the Algorithmic Aspects of Robust CV aR-ERM Estimator 50 H.1 η -co v er based Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 A Justiﬁcation of Assumptions W e clarify the v alidity and necessit y of our theoretical assumptions b elo w: 1. Hea vy-T ailed Moments (Assumption 2.1): This assumption relaxes standard requirements for strictly b ounded losses or ﬁnite v ariances. By requiring only a ﬁnite (1 + λ ) -th momen t, w e accommodate hea vy-tailed distributions where the loss ma y occasionally take v ery large v alues (noting that inﬁnite v ariance is possible if λ < 1 ). This approach allo ws the theory to remain applicable to realistic, risk-sensitive settings and aligns with recen t literature such as Aminian et al. and Lee et al. [2021]. 2. β -Mixing Dependence (Assumption 2.2): The standard i.i.d. assumption is insuﬃcien t for sequential applications suc h as Reinforcemen t Learning, ﬁnancial forecasting, and signal pro cessing. W e assume β -mixing b ecause it oﬀers a crucial balance betw een mo deling ﬂexibility and mathematical tractabilit y . Sp eciﬁcally , β - mixing facilitates the use of blo c king techniques (e.g., Y u’s method) and coupling to ols (e.g., Berb ee’s Lemma). These tools allow us to decomp ose the dependent sequence into "nearly indep endent" blo cks, thereby enabling the application of standard concentration inequalities. 14 3. Hyp othesis Complexity (Assumption 2.3): The pseudo-dimension is the natural extension of the V apnik- Cherv onenkis (V C) dimension to real-v alued function classes. Assuming a ﬁnite pseudo-dimension restricts the "capacit y" of the h yp othesis class H , which is a standard and necessary condition in statistical learning theory to guaran tee uniform con vergence and prev ent ov erﬁtting. 4. Complexit y of T runcated Class (Assumption 2.4): Given the hea vy-tailed nature of the data (Assump- tion 2.1), classical concentration inequalities for bounded v ariables (such as Ho eﬀding’s inequality) are not directly applicable. T o circumv ent this, we emplo y a truncation argument (capping the loss at B ). Assump- tion 2.4 ensures that this truncation op eration do es not artiﬁcially "explo de" the complexit y of the hypothesis class, ensuring the co v ering n umbers b eha v e w ell enough to derive generalization bounds. F urther justiﬁcation for the remaining tec hnical assumptions is provided in the discussion section of the main pap er. B Generalization in CV aR B.1 Fixed Hyp othesis h Lemma B.1 (Fixed-hypothesis low er deviation) . Under Assumption 2.1, with pr ob ability at le ast 1 − δ , b R α ( h ) − R α ( h ) ≤ 2 α M 1 1+ ε  log(2 /δ ) n  ε 1+ ε . (18) Pr o of. of Theorem 3.1,In particular result 3 Fix h ∈ H . Let θ ∗ = arg min θ  θ + 1 α E [( ℓ ( h, Z 1 ) − θ ) + ]  , so R α ( h ) = θ ∗ + 1 α E [( ℓ ( h, Z 1 ) − θ ∗ ) + ] . (19) Let b θ = arg min θ  θ + 1 αn P n i =1 ( ℓ ( h, Z i ) − θ ) +  , so b R α ( h ) = b θ + 1 αn n X i =1 ( ℓ ( h, Z i ) − b θ ) + . The goal is to b ound the follo wing: b R α ( h ) − R α ( h ) = b θ + 1 αn n X i =1 ( ℓ ( h, Z i ) − b θ ) + ! −  θ ∗ + 1 α E [( ℓ ( h, Z 1 ) − θ ∗ ) + ]  . (20) Use population minimizer prop erty , Since θ ∗ minimizes the p opulation CV aR: b R α ( h ) ≤ θ ∗ + 1 αn n X i =1 ( ℓ ( h, Z i ) − θ ∗ ) (21) implies that − b R α ( h ) ≥ − ( θ ∗ + 1 αn n X i =1 ( ℓ ( h, Z i ) − θ ∗ )) (22) No w A dd equation (2) and (5) and multiply by -1 to get. b R α ( h ) − R α ( h ) ≤ 1 α 1 n n X i =1 ( ℓ ( h, Z i ) − θ ∗ ) + − E [( ℓ ( h, Z 1 ) − θ ∗ ) + ] ! . (23) No w w e deﬁne Y i = ( ℓ ( h, Z i ) − θ ∗ ) + , so Y i ≥ 0 . Also c heck the momen t condition that: E [ Y 1+ ε i ] ≤ E [ ℓ ( h, Z i ) 1+ ε ] ≤ M , 15 since ( ℓ ( h, Z i ) − θ ∗ ) + ≤ ℓ ( h, Z i ) . Apply Prop osition G.2 to get: 1 n n X i =1 Y i − E [ Y i ] ≤ 2 M 1 1+ ε  log(2 /δ ) n  ε 1+ ε . Th us: b R α ( h ) − R α ( h ) ≤ 2 α M 1 1+ ε  log(2 /δ ) n  ε 1+ ε . Lemma B.2 (Fixed-hypothesis upp er deviation) . Under Assumption 2.1, for any ﬁxe d h ∈ H and δ ∈ (0 , 1) , with pr ob ability at le ast 1 − δ , R α ( h ) − b R α ( h ) ≤ 2 α M 1 / (1+ λ )  log(2 /δ ) n  λ/ (1+ λ ) . Pr o of. The pro of pro ceeds in three steps: reducing the diﬀerence to an empirical pro cess supremum, establishing p oin t wise concentration using the momen t assumption, and extending this to a uniform b ound ov er the v ariational parameter θ via a co vering argument. R e duction to empiric al pr o c ess. By Lemma G.3, the deviation of the CV aR risk is bounded b y the supremum of the empirical process indexed b y θ : R α ( h ) − b R α ( h ) ≤ 1 α sup θ ∈ R ( E − P n ) g h,θ , (24) where g h,θ ( z ) := ( ℓ ( h, z ) − θ ) + . It suﬃces to b ound the term sup θ ( E − P n ) g h,θ . Moment c ontr ol. The Momen t Condition states that E [ ℓ ( h, Z ) 1+ λ ] ≤ M . Since 0 ≤ g h,θ ( z ) ≤ ℓ ( h, z ) holds for all θ ∈ R and z ∈ Z , w e ha ve uniform momen t control ov er the parametric class: E [ g h,θ ( Z ) 1+ λ ] ≤ M , ∀ θ ∈ R . Pointwise c onc entr ation. Fix θ ∈ R . W e apply a standard heavy-tailed concentration inequality (Via trunca- tion and Bernstein’s inequality similar to Theorem G.2) for random v ariables with ﬁnite (1 + λ ) -th momen ts. F or an y failure probability δ ′ ∈ (0 , 1) , with probabilit y at least 1 − δ ′ : ( E − P n ) g h,θ ≤ C 0 M 1 1+ λ  log(1 /δ ′ ) n  λ 1+ λ . (25) Uniform b ound via c overing. By Lemma G.4, the map θ 7→ ( E − P n ) g h,θ is 2 -Lipsc hitz. T o handle the range of θ , w e use a standard p eeling argument (or random truncation). With probabilit y at least 1 − δ / 2 , we hav e max i ℓ ( h, Z i ) ≤ B n where B n = C 1 ( M n/δ ) 1 1+ λ . W e restrict θ to the compact interv al [0 , B n ] ; outside this in terv al, the suprem um is either zero (if θ is very large) or dominated by the case θ = 0 (since deviations stabilize). Let { θ 1 , . . . , θ N } be an η -net of [0 , B n ] . Then for an y θ in the in terv al: ( E − P n ) g h,θ ≤ ( E − P n ) g h,θ j + 2 η . W e c ho ose the spacing η = M 1 1+ λ ( n − 1 log(2 /δ )) λ 1+ λ . The cov ering n um b er N ≈ B n /η grows polynomially in n . Applying the p oint wise bound (25) on the grid and a union b ound yields that with probabilit y at least 1 − δ : sup θ ∈ R ( E − P n ) g h,θ ≤ C M 1 1+ λ  log(1 /δ ) n  λ 1+ λ . Substituting this upp er bound bac k in to inequalit y (24) completes the proof. 16 Corollary B.3 (T ail bounds for ﬁxed hypothesis) . Under Assumption 2.1, for any ﬁxe d h ∈ H and ϵ > 0 , P  R α ( h ) − b R α ( h ) > ϵ  ≤ 2 exp  − nC λ  αϵ M 1 / (1+ λ )  (1+ λ ) /λ  , wher e C λ = 2 − (1+ λ ) /λ . Pr o of. Start from the ﬁxed- h high-probabilit y inequality: for an y δ ∈ (0 , 1) , R α ( h ) − b R α ( h ) ≤ 2 α M 1 1+ λ  log(2 /δ ) n  λ 1+ λ with probabilit y at least 1 − δ. (26) Let ϵ > 0 and deﬁne δ ( ϵ ) := 2 exp  − n  αϵ 2 M 1 / (1+ λ )  1+ λ λ  . (27) W e v erify that this c hoice inv erts the deviation bound. Indeed, solving ϵ = 2 α M 1 1+ λ  log(2 /δ ) n  λ 1+ λ (28) for δ yields  αϵ 2 M 1 / (1+ λ )  1+ λ λ = log(2 /δ ) n , (29) and hence log  2 δ  = n  αϵ 2 M 1 / (1+ λ )  1+ λ λ , (30) whic h implies δ = 2 exp  − n  αϵ 2 M 1 / (1+ λ )  1+ λ λ  = δ ( ϵ ) . (31) Therefore, substituting δ ( ϵ ) in to the ﬁxed- h bound gives P  R α ( h ) − b R α ( h ) > ϵ  ≤ δ ( ϵ ) = 2 exp  − n  αϵ 2 M 1 / (1+ λ )  1+ λ λ  . (32) Finally , observ e that  αϵ 2 M 1 / (1+ λ )  1+ λ λ = 2 − (1+ λ ) /λ  αϵ M 1 / (1+ λ )  1+ λ λ = C λ  αϵ M 1 / (1+ λ )  1+ λ λ , (33) where C λ = 2 − (1+ λ ) /λ . This yields the stated b ound. This also matc hes the rates of Prashan th et al. [2019]. But they assume Assumption 2.1 with a strictly increasing CDF. B.2 Uniform Bounds for Finite H F or a ﬁnite H , w e extend the bound to all hypotheses using the union bound. Prop osition B.4. Uniform L ower Derivation With pr ob ability at le ast 1 − δ , sup h ∈H  R α ( h ) − b R α ( h )  ≤ 2 α M 1 1+ λ  log(2 |H| /δ ) n  λ 1+ λ . 17 Pr o of. of result B.4 F or eac h h ∈ H , apply result B.2 with failure probabilit y δ ′ = δ / |H| : R α ( h ) − b R α ( h ) ≤ 2 α M 1 1+ λ  log(2 |H| /δ ) n  λ 1+ λ . b y Union b ound,The probabilit y that all bounds hold is at least 1 − |H| · δ |H| = 1 − δ . Th us: sup h ∈H  R α ( h ) − b R α ( h )  ≤ 2 α M 1 1+ λ  log(2 |H| /δ ) n  λ 1+ λ . Corollary B.5. (Uniform Absolute Err or) With pr ob ability at le ast 1 − δ , for al l h ∈ H :    R α ( h ) − b R α ( h )    ≤ 2 α M 1 1+ λ  log(4 |H| /δ ) n  λ 1+ λ . Pr o of. of result B.5 Com bine the b oth directions to get the absolute error as follow:    R α ( h ) − b R α ( h )    = max n R α ( h ) − b R α ( h ) , b R α ( h ) − R α ( h ) o . Use Prop osition B.2 and Proposition B.4 with δ ′ = δ / (2 |H| ) , so log (2 /δ ′ ) = log(4 |H | /δ ) . Both directions hold with probabilit y at least 1 − δ . Lemma B.6. Exc ess Risk Bound Under Assumption 1, with pr ob ability at le ast 1 − δ , R α ( b h ) − R α ( h ∗ ) ≤ 4 α M 1 1+ λ  log(4 |H| /δ ) n  λ 1+ λ . Pr o of. of Theorem 3.1,In particular result 4 R α ( b h ) − R α ( h ∗ ) = h R α ( b h ) − b R α ( b h ) i + h b R α ( b h ) − b R α ( h ∗ ) i + h b R α ( h ∗ ) − R α ( h ∗ ) i . (34) W e bound the middle term b R α ( b h ) ≤ b R α ( h ∗ ) as b elow: R α ( b h ) − R α ( h ∗ ) ≤    R α ( b h ) − b R α ( b h )    +    R α ( h ∗ ) − b R α ( h ∗ )    (35) Apply result 2 to get R α ( b h ) − R α ( h ∗ ) ≤ 2 · 2 α M 1 1+ λ  log(4 |H| /δ ) n  λ 1+ λ . B.3 CV aR Generalization for Inﬁnite Classes W e will b e pro ving Theorem 3.2, in particular result 5 in this subsection. Let θ ∗ b e an optimal CV aR threshold for h ∗ : R α ( h ∗ ) = θ ∗ + 1 α E [( ℓ ( h ∗ , Z ) − θ ∗ ) + ] . Deﬁne the exc ess loss function f h ( Z ) = ( ℓ ( h, Z ) − θ ∗ ) + − ( ℓ ( h ∗ , Z ) − θ ∗ ) + . 18 F or truncation level B > 0 , deﬁne the trunc ate d exc ess loss f B h ( Z ) = [( ℓ ( h, Z ) − θ ∗ ) + ] B − [( ℓ ( h ∗ , Z ) − θ ∗ ) + ] B , where [ x ] B = min( x, B ) . Pr o of. By deﬁnition of R α ( h ) as an inﬁm um, R α ( h ) ≤ θ ∗ + 1 α E [( ℓ ( h, Z ) − θ ∗ ) + ] for the optimal threshold θ ∗ of h ∗ . Since R α ( h ∗ ) = θ ∗ + 1 α E [( ℓ ( h ∗ , Z ) − θ ∗ ) + ] , w e ha ve R α ( h ) − R α ( h ∗ ) ≤ 1 α E [ f h ( Z )] . (36) Fix B > 0 . By the triangle inequalit y , | f h ( Z ) − f B h ( Z ) | ≤   ( ℓ ( h, Z ) − θ ∗ ) + − [( ℓ ( h, Z ) − θ ∗ ) + ] B   +   ( ℓ ( h ∗ , Z ) − θ ∗ ) + − [( ℓ ( h ∗ , Z ) − θ ∗ ) + ] B   . F or x ≥ 0 , w e hav e | x − [ x ] B | = x · 1 { x > B } . Thus, E   ( ℓ ( h, Z ) − θ ∗ ) + − [( ℓ ( h, Z ) − θ ∗ ) + ] B   = E  ( ℓ ( h, Z ) − θ ∗ ) + · 1 { ( ℓ ( h, Z ) − θ ∗ ) + > B }  ≤ E  ℓ ( h, Z ) · 1 { ℓ ( h, Z ) > B }  . W rite E [ ℓ ( h, Z ) · 1 { ℓ ( h, Z ) > B } ] = E [ ℓ ( h, Z ) 1+ λ · ℓ ( h, Z ) − λ · 1 { ℓ ( h, Z ) > B } ] . By Hölder’s inequality with exponents (1 + λ, 1+ λ λ ) : E [ ℓ ( h, Z ) 1+ λ · ℓ ( h, Z ) − λ · 1 { ℓ ( h, Z ) > B } ] ≤  E [ ℓ ( h, Z ) 1+ λ ]  1 / (1+ λ ) ·  E [ 1 { ℓ ( h, Z ) > B } ]  λ/ (1+ λ ) =  E [ ℓ ( h, Z ) 1+ λ ]  1 / (1+ λ ) · P ( ℓ ( h, Z ) > B ) λ/ (1+ λ ) . By Mark ov’s inequality , P ( ℓ ( h, Z ) > B ) ≤ E [ ℓ ( h, Z ) 1+ λ ] B 1+ λ . Therefore, E [ ℓ ( h, Z ) · 1 { ℓ ( h, Z ) > B } ] ≤  E [ ℓ ( h, Z ) 1+ λ ]  1 / (1+ λ ) ·  E [ ℓ ( h, Z ) 1+ λ ] B 1+ λ  λ/ (1+ λ ) = E [ ℓ ( h, Z ) 1+ λ ] B λ ≤ M B λ . The same b ound holds for h ∗ , so E | f h ( Z ) − f B h ( Z ) | ≤ 2 M B λ . (37) Com bining (36) and (37): R α ( h ) − R α ( h ∗ ) ≤ 1 α  E [ f B h ( Z )] + 2 M B λ  . (38) 19 Since E [ f B h ( Z )] ≥ 0 , V ar( f B h ( Z )) ≤ E [( f B h ( Z )) 2 ] . Using ( a − b ) 2 ≤ 2 a 2 + 2 b 2 : ( f B h ( Z )) 2 ≤ 2  [( ℓ ( h, Z ) − θ ∗ ) + ] B  2 + 2  [( ℓ ( h ∗ , Z ) − θ ∗ ) + ] B  2 . Since [( ℓ ( h, Z ) − θ ∗ ) + ] B ≤ B ,  [( ℓ ( h, Z ) − θ ∗ ) + ] B  2 ≤ B · [( ℓ ( h, Z ) − θ ∗ ) + ] B ≤ B · ℓ ( h, Z ) . F or x ≥ 0 , B x ≤ B 1 − λ x 1+ λ (b y AM-GM or Y oung’s inequalit y). Thus, B · ℓ ( h, Z ) ≤ B 1 − λ · ℓ ( h, Z ) 1+ λ . T aking expectations: E  [( ℓ ( h, Z ) − θ ∗ ) + ] B  2  ≤ B 1 − λ E [ ℓ ( h, Z ) 1+ λ ] ≤ M B 1 − λ . Similarly for h ∗ , so V ar( f B h ( Z )) ≤ 4 M B 1 − λ . (39) Also, | f B h ( Z ) | ≤ 2 B uniformly . Deﬁne F B = { f B h : h ∈ H } with pseudo-dimension O ( d ) , env elop e 2 B , and v ariance bound 4 M B 1 − λ . By v ariance-sensitive VC theory (Bartlett-Bousquet-Mendelson)[we need to cite this pap er.], for δ ∈ (0 , 1) , with probabilit y at least 1 − δ , uniformly ov er all h ∈ H , E [ f B h ( Z )] − 1 n n X i =1 f B h ( Z i ) ≤ C r M B 1 − λ ( d log n + log(1 /δ )) n + B ( d log n + log (1 /δ )) n ! . Let D = d log n + log(1 /δ ) . If E [ f B b h ( Z )] > C  q M B 1 − λ D n + B D n  , then the empirical av erage w ould b e p ositiv e, con tradicting that b h minimizes empirical risk. Therefore, E [ f B b h ( Z )] ≤ C r M B 1 − λ D n + B D n ! . No w w e will optimize the truncation level B. Balance the main terms: r M B 1 − λ n ≍ M B λ . This giv es M 1 / 2 B (1 − λ ) / 2 n − 1 / 2 ≍ M B − λ , so B (1 − λ ) / 2+ λ ≍ M 1 / 2 n 1 / 2 , th us B (1+ λ ) / 2 ≍ M 1 / 2 n 1 / 2 . Therefore, B ≍ ( M n ) 1 / (1+ λ ) ≍ M 1 / (1+ λ )  n D  1 / (1+ λ ) , 20 incorp orating the logarithmic factor. At this optimal B : r M B 1 − λ D n = r M · M (1 − λ ) / (1+ λ ) ( n/D ) (1 − λ ) / (1+ λ ) D n = M (1+(1 − λ ) / (1+ λ )) / 2 r D n · ( n/D ) (1 − λ ) / (1+ λ ) = M 1 / (1+ λ ) p D 1 − (1 − λ ) / (1+ λ ) n − (1 − λ ) / (1+ λ ) − 1 = M 1 / (1+ λ ) p D λ/ (1+ λ ) n − λ/ (1+ λ ) · n − 1 = M 1 / (1+ λ )  D n  λ/ (2(1+ λ )) · n − 1 / 2 . Since the dominant scaling is: r M B 1 − λ D n ≍ M 1 / (1+ λ )  D n  λ/ (1+ λ ) · r n D ≍ M 1 / (1+ λ )  D n  λ/ (1+ λ ) . Similarly , M B λ = M M λ/ (1+ λ ) ( n/D ) λ/ (1+ λ ) = M 1 / (1+ λ )  D n  λ/ (1+ λ ) . The second term B D n is lo wer order. Therefore, R α ( b h ) − R α ( h ∗ ) ≤ C λ α M 1 1+ λ  d log n + log(1 /δ ) n  λ 1+ λ . (40) This Pro ves Theorem 3.2 B.3.1 Lo wer b ound for i.i.d Data Pr o of of The or em 3.2,in p articular r esult 6 The sketc h of the pro of is that :we will construct a ﬁnite family { P v } v ∈V , |V | = N , of distributions in P ( M , λ ) together with hypotheses { h v } v ∈V ⊂ H suc h that: 1. F or each v  = u the excess CV aR of h u under P v is at least a positive gap ∆ which w e will express in terms of p, t, M , λ, α . 2. The pairwise KL div ergences KL( P v ∥ P u ) are ≲ p (a constan t times p ) uniformly in v , u . 3. Applying F ano with the m utual information bound (via a veraging of pairwise KLs). Com bining these gives the desired minimax lo w er b ound after enforcing the momen t constraint, which links t and p . 1. Hyp ercub e Construction Let m = ⌈ log 2 N ⌉ . W e construct a pac king set V ⊂ { 0 , 1 } m of size N . W e select N binary v ectors v of equal Hamming w eigh t m/ 2 (assuming m is ev en; o dd m requires trivial adjustments) such that the pairwise Hamming distance satisﬁes d H ( u, v ) ≥ m/ 4 for all distinct u, v ∈ V . The existence of suc h a set is guaran teed b y the Gilb ert-V arshamov b ound. Sp eciﬁcally , the volume of a Hamming ball of radius m/ 8 is exp onen tially smaller than the set of constant-w eigh t v ectors  m m/ 2  , allo wing us to pack N suc h v ectors for m ≈ log 2 N . F or an y distinct u, v ∈ V , let a v u := # { j : v j = 1 , u j = 0 } . Since b oth v ectors ha ve weigh t m/ 2 , a v u = # { j : v j = 0 , u j = 1 } = d H ( u, v ) 2 ≥ m 8 . W e will now deﬁne a w ell-deﬁned probability distribution on Z for eac h v ∈ V . 21 2. Distribution and Hyp othesis Deﬁnitions Fix a constant θ ∈ (0 , 1 / 4) . Let the sample space b e Z = { 0 , 1 , . . . , m } . Let p ∈ (0 , α/ 2] b e a probability parameter, and t > 0 b e a loss magnitude. F or each v ∈ V , deﬁne the distribution P v on Z : P v (0) = 1 − p, P v ( j ) = p m  1 + 2 θ ( v j − 1 / 2)  , j = 1 , . . . , m. Note that P m j =1 P v ( j ) = p b ecause P v j = m/ 2 , ensuring P v is a v alid distribution. Deﬁne the hypotheses { h v } v ∈V with loss function ℓ : ℓ ( h v , 0) = 0 , ℓ ( h v , j ) = t · (1 − v j ) , j = 1 , . . . , m. The h yp othesis h v suﬀers loss t on atom j if and only if v j = 0 . 3. CV aR Gap Analysis The probabilit y that h u incurs nonzero loss under P v is π v ,u := m X j =1 P v ( j )(1 − u j ) . Since P v ( j ) ≤ p m (1 + θ ) and P (1 − u j ) = m/ 2 , we ha ve π v ,u ≤ p 2 (1 + θ ) . Since θ < 1 / 4 , π v ,u < p . W e enforce p ≤ α/ 2 . Thus, the probabilit y of a nonzero loss is strictly less than α , implying the (1 − α ) -quantile is 0. In this regime, CV aR is simply the exp ected loss scaled by 1 /α : R α ( h u ; P v ) = 1 α E P v [ ℓ ( h u , Z )] . Lemma B.7 (Exp ected Losses) . F or any v , u ∈ V : 1. If u = v : E P v [ ℓ ( h v , Z )] = tp 2 (1 − θ ) . 2. If u  = v : E P v [ ℓ ( h u , Z )] ≥ tp 2  1 − θ 2  . Pr o of. F or the ﬁrst case, sum o v er indices where v j = 0 . F or the second case, we utilize the Hamming separation. The term P m j =1 ( v j − 1 / 2)(1 − u j ) ev aluates to a v u − m/ 4 . Since a v u ≥ m/ 8 , this sum is low er bounded by − m/ 8 . Substituting this into the expectation form ula yields the result. Theorem B.8 (CV aR Separation) . F or any distinct v , u ∈ V , ∆ := R α ( h u ; P v ) − R α ( h v ; P v ) ≥ θ 4 · tp α . Pr o of. Subtracting the exp ectations from the Lemma and dividing by α : ∆ = 1 α  tp 2  1 − θ 2  − tp 2 (1 − θ )  = tp 2 α  θ 2  = θ tp 4 α . 4. Momen t Constrain t T o ensure P v ∈ P ( M , λ ) , we require E [ ℓ 1+ λ ] ≤ M . Since losses are binary { 0 , t } : E [ ℓ 1+ λ ] = t 1+ λ P ( ℓ = t ) ≤ t 1+ λ p. W e set t 1+ λ p = M = ⇒ t = ( M /p ) 1 / (1+ λ ) . Substituting t in to the gap ∆ : ∆ ≥ θ 4 α M 1 / (1+ λ ) p λ/ (1+ λ ) . 22 5. KL Div ergence and F ano’s Inequalit y F or v  = u , the KL div ergence is b ounded using the χ 2 -div ergence: KL( P v ∥ P u ) ≤ X z ∈Z ( P v ( z ) − P u ( z )) 2 P u ( z ) . Using the b ounds on P v ( j ) , we derived: KL( P v ∥ P u ) ≤ 4 θ 2 1 − θ p. F or n i.i.d. samples, KL( P n v ∥ P n u ) ≤ n 4 θ 2 1 − θ p . By F ano’s inequality , to ensure the error probability P ( b V  = V ) ≥ 1 / 2 , it suﬃces to bound the m utual information I ( V ; X n ) ≤ 1 2 log N (assuming log N ≥ 2 log 2 ). This is satisﬁed if: n 4 θ 2 1 − θ p ≤ 1 8 log N = ⇒ p ≤ 1 − θ 32 θ 2 log N n . W e deﬁne p to satisfy b oth the F ano condition and the CV aR regime condition ( p ≤ α/ 2 ): p := min  α 2 , 1 − θ 32 θ 2 log N n  . The sample size condition n ≥ 1 − θ 16 θ 2 log N α ensures that the second term is the minim um. Th us, w e set p = 1 − θ 32 θ 2 log N n . 6. Final Lo wer Bound F or an y algorithm A , let b V b e the index of the closest h yp othesis to A ( S ) . sup P ∈P E [ Excess Risk ] ≥ P ( b V  = V ) · ∆ ≥ 1 2 ∆ . Substituting the chosen p in to ∆ : 1 2 ∆ = 1 2 · θ M 1 / (1+ λ ) 4 α  1 − θ 32 θ 2 log N n  λ/ (1+ λ ) . Rearranging terms yields the claimed b ound with c ( θ , λ ) = θ 8 ( 1 − θ 32 θ 2 ) λ/ (1+ λ ) . B.4 T runcated Empirical CV aR Estimator W e will b e pro ving Theorem 3.3 in this subsection. Pr o of. Let R B α ( h ) b e the p opulation CV aR deﬁned on the truncated loss ℓ B . Since ℓ B ( h, z ) ≤ ℓ ( h, z ) p oint wise, w e ha ve R B α ( h ) ≤ R α ( h ) for all h . W e decomp ose the excess risk as: R α ( b h B ) − R α ( h ∗ ) = R α ( b h B ) − R B α ( b h B ) + R B α ( b h B ) − R B α ( h ∗ ) + R B α ( h ∗ ) − R α ( h ∗ ) =  R α ( b h B ) − R B α ( b h B )  | {z } Bias > 0 +  R B α ( b h B ) − R B α ( h ∗ )  | {z } Estimation +  R B α ( h ∗ ) − R α ( h ∗ )  | {z } ≤ 0 . The third term is non-p ositiv e, so w e remo v e it from the upp er bound. The estimation term is b ounded by 2 sup h ∈H | R B α ( h ) − b R B α ( h ) | . Thus: R α ( b h B ) − R α ( h ∗ ) ≤ sup h ∈H ( R α ( h ) − R B α ( h )) + 2 sup h ∈H | R B α ( h ) − b R B α ( h ) | . (41) 23 W e b ound the error introduced by truncating the loss distribution. Using the v ariational deﬁnition and the fact that CV aR is 1 α -Lipsc hitz w.r.t the L 1 norm: sup h ∈H ( R α ( h ) − R B α ( h )) ≤ 1 α E [ ℓ ( h, Z ) − ℓ B ( h, Z )] = 1 α E [( ℓ ( h, Z ) − B ) I ( ℓ ( h, Z ) > B )] . Using the integral identit y E [ X ] = R ∞ 0 P ( X > t ) dt : E [( ℓ − B ) + ] = Z ∞ B P ( ℓ ( h, Z ) > t ) dt. (42) By Mark ov’s inequality on the (1 + λ ) -momen t (Assumption 2.1), P ( ℓ > t ) ≤ M t 1+ λ . Integrating this tail: Z ∞ B M t 1+ λ dt = M  t − λ − λ  ∞ B = M λB λ . (43) Th us, w e obtain the rigorous bias b ound: sup h ∈H ( R α ( h ) − R B α ( h )) ≤ M αλB λ . (44) No w we will bound the Estimation Error. W e con trol the uniform deviation Z n = sup h ∈H | R B α ( h ) − b R B α ( h ) | . Deﬁne the function class asso ciated with the v ariational CV aR ob jective: G B =  z 7→ ϕ h,θ ( z ) := θ + 1 α ( ℓ B ( h, z ) − θ ) +     h ∈ H , θ ∈ [0 , B ]  . (45) W e apply Bousquet’s concentration inequalit y for the supremum of empirical pro cesses.F or this we will now c heck the Conditions for Bousquet’s Inequality . 1. Tigh t Uniform Upp er Bound ( K ): F or any g ∈ G B , since ℓ B ∈ [0 , B ] and θ ∈ [0 , B ] : 0 ≤ ϕ h,θ ( z ) ≤ sup θ ∈ [0 ,B ]  θ + 1 α ( B − θ )  = B α . (46) (The maxim um o ccurs at θ = 0 since α ≤ 1 ). Thus, we set K = B /α . 2. V ariance Bound ( σ 2 ): The v ariance of ϕ h,θ ( Z ) depends only on the random part 1 α ( ℓ B ( h, Z ) − θ ) + . V ar ( ϕ h,θ ) = 1 α 2 V ar (( ℓ B − θ ) + ) ≤ 1 α 2 E  ( ℓ B − θ ) 2 +  ≤ 1 α 2 E [ ℓ 2 B ] . (47) Using the inequality x 2 ≤ x 1+ λ B 1 − λ for x ∈ [0 , B ] and Assumption 2.1: σ 2 G = sup g ∈G B E [ g 2 ] ≤ M B 1 − λ α 2 . (48) By Bousquet’s inequality , for any δ ∈ (0 , 1) , with probability at least 1 − δ : Z n ≤ E [ Z n ] + s 2 σ 2 G log(1 /δ ) n + K log(1 /δ ) 3 n . (49) Bounding the Exp ectation (Rademacher Complexit y): The class G B in volv es a supremum o ver θ . Since the ob jective is con v ex in θ and 1 /α -Lipsc hitz in ℓ B , standard contraction results (e.g., Levy et al., 2020) imply: E [ Z n ] ≤ 2 R n ( G B ) ≤ 2 α R n ( ℓ ◦ H ) + O ( n − 1 / 2 ) . (50) 24 Substituting the b ounds for σ 2 and K in to Bousquet’s inequality: Z n ≤ 2 α R n ( ℓ ◦ H ) + r 2 M B 1 − λ log(1 /δ ) α 2 n + B log(1 /δ ) 3 αn . (51) The total estimation error contribution is 2 Z n : Est ≤ 4 α R n ( ℓ ◦ H ) + 1 α r 8 M B 1 − λ log(1 /δ ) n + 2 B log(1 /δ ) 3 αn . (52) W e minimize the total error b ound (Bias + Estimation) with resp ect to B : Error ( B ) ≈ M αλB λ + 1 α r 8 M B 1 − λ log(1 /δ ) n . (53) Balancing the order of bias ( B − λ ) and v ariance ( B (1 − λ ) / 2 n − 1 / 2 ) yields the optimal scaling: n ≍ B 1+ λ = ⇒ B = ( M n ) 1 1+ λ . (54) W e deﬁne the rate factor ∆ n = M 1 1+ λ n − λ 1+ λ . Substituting B bac k in to the terms: 1. Bias T erm: M αλ ( M n ) λ 1+ λ = 1 αλ ∆ n . 2. V ariance T erm (Bousquet Main): 1 α p 8 log (1 /δ ) s M ( M n ) 1 − λ 1+ λ n = p 8 log (1 /δ ) α ∆ n . 3. V ariance T erm (Bousquet Linear): Note that B n = ( M n ) 1 1+ λ n = ∆ n . 2 B log(1 /δ ) 3 αn = 2 log (1 /δ ) 3 α ∆ n . Summing the co eﬃcients gives the ﬁnal constant C λ,δ : C λ,δ = 1 λ + p 8 log (1 /δ ) + 2 3 log(1 /δ ) . (55) This completes the proof of Theorem 3.3 B.5 Generalization with Dep enden t Data Pr o of. of Upper Bound 3.4 F or an y h , b y the v ariational form of CV aR and the Lipschitz prop erty | R α ( X ) − R α ( Y ) | ≤ 1 α E | X − Y | (Lemma G.8), sup h | b R α ( h ) − R α ( h ) | ≤ 1 α sup h ∈H , θ ∈ R   b L α ( h, θ ) − L α ( h, θ )   . (56) where b L α ( h, θ ) = 1 n P n i =1 ( ℓ ( h, Z i ) − θ ) + , L α ( h, θ ) = E [( ℓ ( h, Z ) − θ ) + ] . F rom the momen t condition, an y population CV aR minimizer θ ∗ h satisﬁes 0 ≤ θ ∗ h ≤ R := M 1 / (1+ λ ) /α (Theo- rem G.1). Hence w e may restrict θ to Θ = [0 , R ] . Fix B > 0 and deﬁne the truncated loss class F B =  f B h,θ ( z ) = [( ℓ ( h, z ) − θ ) + ] ∧ B : h ∈ H , θ ∈ Θ  . (57) 25 Using Mark ov and Hölder inequalities, sup h,θ   L α ( h, θ ) − E [ f B h,θ ( Z )]   ≤ M B λ . (58) Th us, sup h,θ | b L α ( h, θ ) − L α ( h, θ ) | ≤ sup f ∈F B   1 n X f ( Z i ) − E f   + 2 M B λ . (59) Assume the pro cess ( Z i ) is β -mixing with exponential decay β ( k ) ≤ exp( − ck γ ) for some γ > 0 . Choose a n ≍ log n , b n ≍ (log n ) 1 /γ and partition { 1 , . . . , n } in to µ n ≍ n/ log n blocks of size a n separated b y gaps of size b n . Let N = µ n ≍ n/ log n . Deﬁne blo ck sums S j ( f ) = P i ∈ B j f ( Z i ) . By Berbee’s lemma, there exist indep endent blo c ks ˜ B 1 , . . . , ˜ B N with the same marginals suc h that P (( B j )  = ( ˜ B j )) ≤ N β ( b n ) ≤ n − O (1) . Hence, with probabilit y ≥ 1 − n − O (1) , sup f ∈F B   1 n X f ( Z i ) − E f   ≤ sup f ∈F B    1 N N X j =1  1 a n ˜ S j ( f ) − E f     . (60) Deﬁne the centered blo ck v ariables ˜ Z j ( f ) = 1 a n ˜ S j ( f ) − E f . Then ˜ Z j ( f ) are i.i.d. across j , mean zero, b ounded b y B , and satisfy E [ | ˜ Z j ( f ) | 1+ λ ] ≤ C λ B 1+ λ b y the von Bahr-Esseen inequalit y . The class F B has pseudo-dimension Pdim( F B ) ≤ C ( d + 1) (Lemma G.9). Applying the heavy-tailed uniform deviation inequality for i.i.d. data (Theorem 3.2) with env elop e B and momen t bound B 1+ λ , w e obtain that with probabilit y ≥ 1 − δ , sup f ∈F B   1 N N X j =1 ˜ Z j ( f )   ≤ C λ B  d log N + log(1 /δ ) N  λ 1+ λ . (61) Com bining the b ounds, with probability ≥ 1 − δ − n − O (1) , sup h | b R α ( h ) − R α ( h ) | ≤ 1 α h C λ B  d log n + log(1 /δ ) N  λ 1+ λ + 2 M B λ i . (62) Cho ose B to balance the tw o terms: B ≍ M 1 1+ λ  N d log n + log(1 /δ )  1 1+ λ . (63) Substituting bac k yields the dominant scaling sup h | b R α ( h ) − R α ( h ) | ≤ C λ,γ α M 1 1+ λ  d log n + log(1 /δ ) N  λ 1+ λ . (64) Recalling N ≍ n/ log n gives the ﬁrst claim. F or the empirical CV aR minimizer b h , by the usual ERM argument, R α ( b h ) − R α ( h ∗ ) ≤ 2 sup h | b R α ( h ) − R α ( h ) | . (65) whic h yields the second inequality . 26 Pr o of. of Lo w er Bound 3.4 Let { Z i } n i =1 b e drawn from any P ∈ P ( λ, M , β ) . Using the blo cking sc heme similar to Theorem 3.4, we partition { 1 , . . . , n } in to µ n disjoin t blocks of size a n ≍ log n , separated b y gaps of length b n ≍ (log n ) 1 /γ . Let N := µ n ≍ n/ log n denote the n umber of retained blocks. By Berb ee’s coupling, there exist indep enden t blo cks ˜ B 1 , . . . , ˜ B N with the same marginals as the original blo c ks suc h that P  ( B 1 , . . . , B N )  = ( ˜ B 1 , . . . , ˜ B N )  ≤ N β ( b n ) ≤ n − A for an y ﬁxed A > 0 and all suﬃciently large n . Therefore, for minimax low er b ounds, it suﬃces to work with the indep enden t block mo del: an y esti- mator based on the original data induces an estimator based on ( ˜ B 1 , . . . , ˜ B N ) whose risk diﬀers b y at most n − A sup h R α ( h ) , whic h is negligible compared to the target rate. Henceforth, w e assume w e observ e N i.i.d. samples. Since VCdim( H ) ≥ d , there exist p oints z 1 , . . . , z d ∈ Z and h ypotheses h 1 , . . . , h d ∈ H shattered by H . Using the same construction we had in the proof B.3.1, we construct a family of 2 d distributions { P v : v ∈ { 0 , 1 } d } and h yp otheses { h v : v ∈ { 0 , 1 } d } suc h that: 1. F or all u  = v , R α ( h u ; P v ) − R α ( h v ; P v ) ≥ θ 4 α tp. 2. The hea vy-tail condition is satisﬁed: sup h ∈H E P v [ ℓ ( h, Z ) 1+ λ ] ≤ t 1+ λ p ≤ M . 3. The pairwise Kullback–Leibler divergences are con trolled: KL( P v ∥ P u ) ≤ C θ p, ∀ u  = v , where C θ > 0 depends only on θ . Let V b e uniformly distributed o ver { 0 , 1 } d and let X N = ( Z 1 , . . . , Z N ) be dra wn from P ⊗ N V . By the chain rule and the KL bound, I ( V ; X N ) ≤ 1 2 d X u,v KL( P ⊗ N v ∥ P ⊗ N u ) ≤ N max u  = v KL( P v ∥ P u ) ≤ C θ N p. Cho ose p = c 1 d N for c 1 > 0 suﬃcien tly small so that I ( V ; X N ) ≤ d 8 . By F ano’s inequality , inf b V P ( b V  = V ) ≥ 1 − I ( V ; X N ) + log 2 log(2 d ) ≥ 1 2 for all suﬃciently large d . Let b h = b h ( X N ) be an y estimator and deﬁne b V such that b h = h b V (ties brok en arbitrarily). Then sup P ∈P E P h R α ( b h ) − R α ( h ∗ P ) i ≥ 1 2 d X v E P v  R α ( h b V ; P v ) − R α ( h v ; P v )  ≥ P ( b V  = V ) · θ 4 α tp ≥ θ 8 α tp. 27 Finally , enforce the moment constraint with equalit y: t =  M p  1 1+ λ . Substituting giv es sup P ∈P E P h R α ( b h ) − R α ( h ∗ P ) i ≥ c α M 1 1+ λ p λ 1+ λ = c α M 1 1+ λ  d N  λ 1+ λ . Recalling N ≍ n/ log n compl etes the pro of. C Random activ e-set theory and uniform Bahadur-Kiefer expansions for CV aR Setup and notation. Let Z ∼ P on ( Z , A ) and let ℓ : H × Z → R + b e measurable. Fix α ∈ (0 , 1) . F or eac h h ∈ H deﬁne the nonnegative loss random v ariable X h := ℓ ( h, Z ) ≥ 0 . F or a scalar threshold θ ∈ R , deﬁne the Rock afellar-Ury asev (R U) lift Φ( h, θ ) := θ + 1 α E  ( X h − θ ) +  . (66) Giv en i.i.d. data Z 1 , . . . , Z n ∼ P and the empirical measure P n = 1 n P n i =1 δ Z i , let X h,i := ℓ ( h, Z i ) and deﬁne the empirical R U lift b Φ n ( h, θ ) := θ + 1 αn n X i =1 ( X h,i − θ ) + . (67) The population and empirical CV aR ob jectives are R α ( h ) := inf θ ∈ R Φ( h, θ ) , b R α,n ( h ) := inf θ ∈ R b Φ n ( h, θ ) . (68) Endogenous threshold selections (minimal R U minimizers). Deﬁne the (minimal) p opulation R U thresh- old and the (minimal) empirical R U threshold, resp ectively , b y θ ⋆ ( h ) := inf  θ ∈ R : P ( X h > θ ) ≤ α  , b θ n ( h ) := inf  θ ∈ R : P n ( X h > θ ) ≤ α  . (69) It is standard that b θ n ( h ) is alw ays a minimizer of θ 7→ b Φ n ( h, θ ) , and similarly θ ⋆ ( h ) is a minimizer of θ 7→ Φ( h, θ ) ; w e use these minimal selections to k eep the argumen ts deterministic and monotone. T ail maps and empirical tail maps. F or h ∈ H and θ ∈ R , deﬁne the (strict) tail probability and its empirical coun terpart: T h ( θ ) := P ( X h > θ ) , b T h,n ( θ ) := P n ( X h > θ ) = 1 n n X i =1 1 { X h,i > θ } . (70) Eac h T h ( · ) and b T h,n ( · ) is nonincreasing and right-con tin uous. By construction, b T h,n  b θ n ( h )  ≤ α, and if θ < b θ n ( h ) then b T h,n ( θ ) > α. (71) W e are now reiterating our assumptions to mak e this section self-con tained. 28 Assumption A (uniform tail empirical pro cess control). There exists a function ε n : (0 , 1) → (0 , ∞ ) such that for all δ ∈ (0 , 1) , with probabilit y at least 1 − δ , sup h ∈H sup θ ∈ R   b T h,n ( θ ) − T h ( θ )   ≤ ε n ( δ ) . (72) Deﬁne the corresp onding ev ent E 1 ( δ ) := n sup h ∈H sup θ ∈ R   b T h,n ( θ ) − T h ( θ )   ≤ ε n ( δ ) o . (73) Assumption B (lo cal quan tile margin at lev el α ). There exist constants u 0 > 0 , κ ≥ 1 , and 0 < c − ≤ c + < ∞ suc h that for all h ∈ H and all u ∈ (0 , u 0 ] , α + c − u κ ≤ T h  θ ⋆ ( h ) − u  ≤ α + c + u κ , (74) and α − c + u κ ≤ T h  θ ⋆ ( h ) + u  ≤ α − c − u κ . (75) This is the standard “t w o-sided quantile margin” form ulation centered at the target level α ; it rules out arbitrarily ﬂat tails at level α and, in particular, preven ts the low er-deviation argument from failing in the presence of atoms at θ ⋆ ( h ) . Assumption C (uniform hinge empirical pro cess at the population threshold). There exists a function η n : (0 , 1) → (0 , ∞ ) suc h that for all δ ∈ (0 , 1) , with probabilit y at least 1 − δ , sup h ∈H    ( P n − P )  ( X h − θ ⋆ ( h )) +     ≤ η n ( δ ) . (76) Deﬁne the corresp onding ev ent E 2 ( δ ) := n sup h ∈H    ( P n − P )  ( X h − θ ⋆ ( h )) +     ≤ η n ( δ ) o . (77) A con venien t threshold-deviation en v elop e. Whenever ε n ( δ ) ≤ ( c − / 2) u κ 0 , deﬁne ∆ n ( δ ) :=  2 ε n ( δ ) c −  1 /κ . (78) Theorem C.1 (Uniform deviation of the empirical RU threshold) . Fix α ∈ (0 , 1) and let Z 1 , . . . , Z n iid ∼ P , with empiric al me asur e P n . Assume (72) and (74) - (75) . If ε n ( δ ) ≤ ( c − / 2) u κ 0 , then with pr ob ability at le ast 1 − δ , sup h ∈H   b θ n ( h ) − θ ⋆ ( h )   ≤ ∆ n ( δ ) =  2 ε n ( δ ) c −  1 /κ . (79) Pr o of. Fix h ∈ H and abbreviate θ ⋆ := θ ⋆ ( h ) , b θ := b θ n ( h ) , T ( θ ) := T h ( θ ) , b T ( θ ) := b T h,n ( θ ) . W ork on the even t E 1 ( δ ) , so that sup θ ∈ R | b T ( θ ) − T ( θ ) | ≤ ε n ( δ ) . Recall the minimality characterization (71). Upp er deviation: b θ ≤ θ ⋆ + u . Let u ∈ (0 , u 0 ] and supp ose c − u κ ≥ 2 ε n ( δ ) . By the right-side margin bound (75), T ( θ ⋆ + u ) ≤ α − c − u κ . On E 1 ( δ ) , b T ( θ ⋆ + u ) ≤ T ( θ ⋆ + u ) + ε n ( δ ) ≤ α − c − u κ + ε n ( δ ) ≤ α − ε n ( δ ) ≤ α. Hence b T ( θ ⋆ + u ) ≤ α , and minimalit y (71) giv es b θ ≤ θ ⋆ + u . 29 Lo w er deviation: b θ ≥ θ ⋆ − u . Let u ∈ (0 , u 0 ] and supp ose c − u κ ≥ 2 ε n ( δ ) . By the left-side margin b ound (74), T ( θ ⋆ − u ) ≥ α + c − u κ . On E 1 ( δ ) , b T ( θ ⋆ − u ) ≥ T ( θ ⋆ − u ) − ε n ( δ ) ≥ α + c − u κ − ε n ( δ ) ≥ α + ε n ( δ ) > α. Therefore b T ( θ ⋆ − u ) > α , and minimalit y (71) implies b θ ≥ θ ⋆ − u . Choice of u and uniformization. Cho ose u := ∆ n ( δ ) = (2 ε n ( δ ) /c − ) 1 /κ . Under ε n ( δ ) ≤ ( c − / 2) u κ 0 w e ha ve u ≤ u 0 and c − u κ = 2 ε n ( δ ) . The t w o deviation b ounds yield | b θ − θ ⋆ | ≤ u . Since the argument holds for eac h h on E 1 ( δ ) , sup h ∈H | b θ n ( h ) − θ ⋆ ( h ) | ≤ ∆ n ( δ ) on E 1 ( δ ) . Finally , P ( E 1 ( δ )) ≥ 1 − δ by (72). W e no w pro vide the proof for theorem 3.8. W e also restate the theorem statement. The or em (Restatemen t of Theorem 3.8) . Fix α ∈ (0 , 1) and let Z 1 , . . . , Z n iid ∼ P with empirical measure P n . Assume (72), (74)-(75), and (76). Assume further that ε n ( δ / 2) ≤ ( c − / 2) u κ 0 . Then, with probability at least 1 − δ , sup h ∈H    b R α,n ( h ) − R α ( h ) − 1 α ( P n − P )  ( X h − θ ⋆ ( h )) +  − 1 α  b θ n ( h ) − θ ⋆ ( h )  α − P ( X h > θ ⋆ ( h ))     ≤ C 1 α ε n ( δ / 2) κ +1 κ , (80) where one may take, for instance, C 1 :=  2 c −  1 /κ + c + κ + 1  2 c −  κ +1 κ . (81) In particular, on the same ev ent, sup h ∈H   b R α,n ( h ) − R α ( h )   ≤ 1 α η n ( δ / 2) + 1 α  ε n ( δ / 2) ∆ n ( δ / 2) + c + κ + 1 ∆ n ( δ / 2) κ +1  . (82) Pr o of of The or em 3.8. Fix h ∈ H and abbreviate θ ⋆ := θ ⋆ ( h ) , b θ := b θ n ( h ) , X := X h , Φ( θ ) := Φ( h, θ ) , b Φ( θ ) := b Φ n ( h, θ ) . By deﬁnition (68) and the fact that θ ⋆ and b θ are minimizers, R α ( h ) = Φ( θ ⋆ ) , b R α,n ( h ) = b Φ( b θ ) . W e begin from the algebraic decomp osition b R α,n ( h ) − R α ( h ) = b Φ( b θ ) − Φ( θ ⋆ ) =  b Φ( θ ⋆ ) − Φ( θ ⋆ )  | {z } ( A ) +  b Φ( b θ ) − b Φ( θ ⋆ )  | {z } ( B ) +  Φ( b θ ) − Φ( θ ⋆ )  | {z } ( C ) −  Φ( b θ ) − b Φ( b θ )  | {z } ( D ) . (83) T erm (A): the leading empirical-pro cess term. F rom (66)-(67), ( A ) = b Φ( θ ⋆ ) − Φ( θ ⋆ ) = 1 α ( P n − P )  ( X − θ ⋆ ) +  . (84) 30 A hinge integral iden tity . F or ev ery θ ∈ R , ( X − θ ) + = Z ∞ θ 1 { X > s } ds. (85) Consequen tly , for an y a, b ∈ R , ( X − a ) + − ( X − b ) + = Z a b 1 { X > s } ds, (86) where the integral is in terpreted with the correct sign when a < b . Applying (86) in (66) and (67) yields Φ( b θ ) − Φ( θ ⋆ ) = ( b θ − θ ⋆ ) + 1 α Z b θ θ ⋆ P ( X > s ) ds, (87) b Φ( b θ ) − b Φ( θ ⋆ ) = ( b θ − θ ⋆ ) + 1 α Z b θ θ ⋆ P n ( X > s ) ds. (88) Subtracting (87) from (88) giv es the exact coupling iden tity ( B ) − ( D ) = 1 α Z b θ θ ⋆  P n ( X > s ) − P ( X > s )  ds. (89) Linearization of (C) and the endogenous-threshold correction. Deﬁne the shorthand tail function T ( s ) := P ( X > s ) . Starting from (87), add and subtract T ( θ ⋆ ) inside the in tegral: Φ( b θ ) − Φ( θ ⋆ ) = ( b θ − θ ⋆ ) + 1 α Z b θ θ ⋆ T ( s ) ds = ( b θ − θ ⋆ ) + 1 α Z b θ θ ⋆  T ( θ ⋆ ) +  T ( s ) − T ( θ ⋆ )   ds = 1 α ( b θ − θ ⋆ )  α − T ( θ ⋆ )  + 1 α Z b θ θ ⋆  T ( s ) − T ( θ ⋆ )  ds | {z } =: R pop ( h ) . (90) The ﬁrst term in (90) is precisely the endogenous-threshold correction term in (80). Bounding the coupling remainder ( B ) − ( D ) . W ork on the ev en t E 1 ( δ / 2) from (73). Then sup h,θ | b T h,n ( θ ) − T h ( θ ) | ≤ ε n ( δ / 2) , and from (89),   ( B ) − ( D )   ≤ 1 α Z b θ θ ⋆ ε n ( δ / 2) ds = 1 α ε n ( δ / 2) | b θ − θ ⋆ | . (91) Bounding the population remainder R pop ( h ) . Assume ε n ( δ / 2) ≤ ( c − / 2) u κ 0 and work on E 1 ( δ / 2) . By Theorem C.1, sup h ∈H | b θ n ( h ) − θ ⋆ ( h ) | ≤ ∆ n ( δ / 2) ≤ u 0 . (92) Fix h and consider any s b et w een θ ⋆ and b θ . By (74)-(75), the upper side of the margin condition giv es | T ( s ) − T ( θ ⋆ ) | ≤ c + | s − θ ⋆ | κ . Therefore, | R pop ( h ) | ≤ 1 α Z b θ θ ⋆ c + | s − θ ⋆ | κ ds = c + α ( κ + 1) | b θ − θ ⋆ | κ +1 . (93) 31 Assem bling the expansion. Plug (84), (91), and (90)-(93) in to (83). On E 1 ( δ / 2) w e obtain    b R α,n ( h ) − R α ( h ) − 1 α ( P n − P )  ( X − θ ⋆ ) +  − 1 α ( b θ − θ ⋆ )  α − P ( X > θ ⋆ )     ≤ 1 α  ε n ( δ / 2) | b θ − θ ⋆ | + c + κ + 1 | b θ − θ ⋆ | κ +1  . (94) Uniformization and conv ersion to an ε n -rate. T ake sup h ∈H in (94) and use (92): sup h ∈H    · · ·    ≤ 1 α  ε n ( δ / 2)∆ n ( δ / 2) + c + κ + 1 ∆ n ( δ / 2) κ +1  on E 1 ( δ / 2) . No w substitute ∆ n ( δ / 2) = (2 ε n ( δ / 2) /c − ) 1 /κ : ε n ( δ / 2)∆ n ( δ / 2) =  2 c −  1 /κ ε n ( δ / 2) κ +1 κ , ∆ n ( δ / 2) κ +1 =  2 c −  κ +1 κ ε n ( δ / 2) κ +1 κ . This yields (80) with the constan t C 1 in (81). Deriving the “in particular” b ound and the probability statemen t. W ork on the intersection even t E 1 ( δ / 2) ∩ E 2 ( δ / 2) . By (72) and (76) and a union b ound, P  E 1 ( δ / 2) ∩ E 2 ( δ / 2)  ≥ 1 − δ. On E 2 ( δ / 2) , sup h ∈H    1 α ( P n − P )  ( X h − θ ⋆ ( h )) +     ≤ 1 α η n ( δ / 2) . Com bining this with the uniform expansion bound giv es (82). D F unctional Robustness Pr o of of Pr op osition 4.1. W asserstein b ound with moment control. As sume ( Z , d ) is a metric space and z 7→ ℓ ( h, Z ) is W e use the Rock afellar-Ury asev represen tation CV aR α ( X ) = inf θ ∈ R n θ + κ α E [( X − θ ) + ] o . Fix h ∈ H and θ ∈ R and deﬁne φ h,θ ( z ) := ( ℓ ( h, Z ) − θ ) + . Since x 7→ ( x − t ) + is 1 -Lipsc hitz on R , the Hölder condition implies | φ h,θ ( Z ) − φ h,θ ( Z ′ ) | ≤ | ℓ ( h, Z ) − ℓ ( h, Z ′ ) | ≤ L β d ( Z, Z ′ ) β , so φ h,θ is β -Hölder with constant L β (uniformly in h, θ ). Let π b e any coupling of ( P, Q ) with marginals P, Q . Then    E P [ φ h,θ ( Z )] − E Q [ φ h,θ ( Z ′ )]    =    E π [ φ h,θ ( Z ) − φ h,θ ( Z ′ )]    ≤ L β E π [ d ( Z, Z ′ ) β ] . T ake inﬁmum ov er couplings. F or an y r ≥ β , Jensen yields inf π E π [ d ( Z, Z ′ ) β ] ≤ inf π  E π [ d ( Z, Z ′ ) r ]  β /r = W r ( P , Q ) β . Hence, for every t ,    t + κ α E P [ φ θ,t ] −  t + κ α E Q [ φ θ,t ]     ≤ κ α L β W r ( P , Q ) β . 32 F rom here, the result is obvious. Lévy-Prokhoro v bound with momen t con trol. Assume Z = R d with Euclidean metric and π ( P , Q ) ≤ ε . By Strassen’s theorem for the Prokhoro v metric, there exists a coupling ( Z, ˜ Z ) with marginals P , Q suc h that P ( ∥ Z − ˜ Z ∥ > ε ) ≤ ε. Let G := {∥ Z − ˜ Z ∥ ≤ ε } and B := G c , so P ( B ) ≤ ε . Fix t ∈ R and write X = ℓ ( θ , Z ) , Y = ℓ ( θ , ˜ Z ) . Using again that x 7→ ( x − t ) + is 1 -Lipsc hitz,   ( X − t ) + − ( Y − t ) +   ≤ | X − Y | . Split on G and B : E | X − Y | ≤ E  | X − Y | 1 G  + E  | X − Y | 1 B  . On G , Lipsc hitzness gives | X − Y | ≤ L θ ∥ Z − ˜ Z ∥ ≤ L θ ε , so E [ | X − Y | 1 G ] ≤ L θ ε . On B , use | X − Y | ≤ | X | + | Y | and Hölder: E [ | X | 1 B ] ≤  E | X | p  1 /p P ( B ) 1 − 1 /p ≤ M 1 /p p ε 1 − 1 /p , and similarly for Y . Th us E | X − Y | ≤ L θ ε + 2 M 1 /p p ε 1 − 1 /p . Consequen tly , for eac h ﬁxed θ ,    θ + κ α E P [( X − θ ) + ] −  θ + κ α E Q [( Y − θ ) + ]     ≤ κ α  L θ ε + 2 M 1 /p p ε 1 − 1 /p  . T aking inﬁm um o ver θ yields   CV aR α ( L P ) − CV aR α ( L Q )   ≤ κ α  L θ ε + 2 M 1 / ( λ +1) p ε λ/ ( λ +1)  . Replacing ε by 2 ε and symmetrizing (a standard padding) giv es the following   CV aR α ( L P ) − CV aR α ( L Q )   ≤ κ α  2 L θ ε + 2 M 1 /p p ε 1 − 1 /p  . Use the t wo-point construction as in the TV lo wer b ound: P = δ 0 , Q = (1 − ε ) δ 0 + εδ b with b = ( M p /ε ) 1 /p and ε ≤ 1 − α . F or ev ery Borel set A ⊂ R w e ha ve Q ( A ) ≤ P ( A ε ) + ε and P ( A ) ≤ Q ( A ε ) + ε, b ecause the only mass unmatched within an ε -enlargemen t is the ε mass at b , whic h is absorb ed b y the additive ε slac k. Hence π ( P , Q ) ≤ ε . As computed ab ov e,   CV aR α ( Q ) − CV aR α ( P )   = κ α M 1 /p p ε 1 − 1 /p , whic h implies the bound is tigh t. E Estimator Robustness E.1 T runcated CV aR Loss based ERM is Optimal Pr o of. Pro of of Theorem 4.2 First we decomposing the CV aR diﬀerence Fix h ∈ H . Let θ ∗ Q minimize R Q α ( h ) . Then R Q α ( h ) = θ ∗ Q + 1 α E Q [( ℓ ( h, Z ) − θ ∗ Q ) + ] . (95) By deﬁnition of inﬁm um, R P α ( h ) ≤ θ ∗ Q + 1 α E P [( ℓ ( h, Z ) − θ ∗ Q ) + ] . (96) 33 Subtracting giv es R P α ( h ) − R Q α ( h ) ≤ 1 α  E P [( ℓ ( h, Z ) − θ ∗ Q ) + ] − E Q [( ℓ ( h, Z ) − θ ∗ Q ) + ]  . (97) Analogously , using θ ∗ P , R Q α ( h ) − R P α ( h ) ≤ 1 α ( E Q [( ℓ ( h, Z ) − θ ∗ P ) + ] − E P [( ℓ ( h, Z ) − θ ∗ P ) + ]) . (98) Th us, | R P α ( h ) − R Q α ( h ) | ≤ 1 α max n   E P [( ℓ ( h, Z ) − θ ∗ Q ) + ] − E Q [( ℓ ( h, Z ) − θ ∗ Q ) + ]   ,   E P [( ℓ ( h, Z ) − θ ∗ P ) + ] − E Q [( ℓ ( h, Z ) − θ ∗ P ) + ]   o . (99) Deﬁne f θ ( z ) = ( ℓ ( h, z ) − θ ) + . F or truncation level T > 0 , f θ ( z ) = f θ ( z ) 1 { f θ ( z ) ≤ T } + f θ ( z ) 1 { f θ ( z ) >T } . (100) Th us ∆( h, θ ) :=   E P [ f θ ( Z )] − E Q [ f θ ( Z )]   ≤   E P [ f θ ( Z ) 1 { f θ ( Z ) ≤ T } ] − E Q [ f θ ( Z ) 1 { f θ ( Z ) ≤ T } ]   + E P [ f θ ( Z ) 1 { f θ ( Z ) >T } ] + E Q [ f θ ( Z ) 1 { f θ ( Z ) >T } ] . (101) S 0 ≤ f θ ( z ) 1 { f θ ( z ) ≤ T } ≤ T , b y TV inequality ,   E P [ f θ ( Z ) 1 { f θ ( Z ) ≤ T } ] − E Q [ f θ ( Z ) 1 { f θ ( Z ) ≤ T } ]   ≤ T · d T V ( P , Q ) . (102) W e ha ve f θ ( z ) ≤ ℓ ( h, z ) from corollary G.7 , This is very crusial for the rest of the proof, a detailed justiﬁcation is giv e after pro of of the curren t theorem. one motiv ation for this is that b ecause CV aR’s inner optimization never b eneﬁts from a negativ e threshold when losses are nonnegative [Proposition G.6], the optimal threshold ob eys θ ∗ P ≥ 0 . Under this natural regime, the truncated loss f θ = ( ℓ − θ )+ is p oint wise dominated by the ra w loss ℓ [lemma G.5]. This dominance legitimizes replacing the tail of f θ by the tail of ℓ in expectation bounds, letting y ou con trol the truncation remainder via the (1 + λ ) -momen t G.7. since f θ ( z ) ≤ ℓ ( h, z ) , E P [ f θ ( Z ) 1 { f θ ( Z ) >T } ] ≤ E P [ ℓ ( h, Z ) 1 { ℓ ( h,Z ) >T } ] . (103) Using Mark ov’s inequality , P ( ℓ ( h, Z ) > T ) ≤ E P [ ℓ ( h, Z ) 1+ λ ] T 1+ λ ≤ M T 1+ λ . (104) Therefore, E P [ ℓ ( h, Z ) 1 { ℓ ( h,Z ) >T } ] = Z ∞ T P ( ℓ ( h, Z ) > t ) dt (105) ≤ Z ∞ T M t 1+ λ dt (106) = M λT λ . (107) Similarly , E Q [ ℓ ( h, Z ) 1 { ℓ ( h,Z ) >T } ] ≤ M λT λ . (108) So, E P [ f θ ( Z ) 1 { f θ ( Z ) >T } ] + E Q [ f θ ( Z ) 1 { f θ ( Z ) >T } ] ≤ 2 M λT λ . (109) 34 Th us, ∆( h, θ ) ≤ T · d T V ( P , Q ) + 2 M λT λ . (110) Optimize o ver T . Let δ = d T V ( P , Q ) . Deﬁne g ( T ) = T δ + 2 M λT λ . (111) Deriv ative: g ′ ( T ) = δ − 2 M T 1+ λ . (112) Setting g ′ ( T ) = 0 , T ∗ =  2 M δ  1 / (1+ λ ) . (113) A t T ∗ , g ( T ∗ ) = (2 M ) 1 / (1+ λ ) δ λ/ (1+ λ )  1 + 1 λ  . (114) Therefore, | R P α ( h ) − R Q α ( h ) | ≤ (2 M ) 1 / (1+ λ )  1 + 1 λ  α d T V ( P , Q ) λ/ (1+ λ ) . (115) Since the b ound is uniform o v er h ∈ H , the theorem holds. Pr o of. of low er b ound in Theorem 4.2 W e construct an explicit pair of distributions achieving the claimed scaling. Fix ε ∈ (0 , α ) and consider a scalar loss Z . Let P b e the Dirac distribution at zero: P ( Z = 0) = 1 . Clearly , R P α ( h ) = 0 . Next, deﬁne Q b y shifting an ε -fraction of mass to a positive v alue z > 0 : Q ( Z = z ) = ε, Q ( Z = 0) = 1 − ε. It is immediate that d T V ( P , Q ) = ε . T o satisfy Assumption 2.1, w e require E Q [ | Z | 1+ λ ] = εz 1+ λ ≤ M . W e c ho ose the largest admissible v alue, z =  M ε  1 1+ λ . W e no w compute the CV aR of Q . Since ε < α , the w orst α -fraction of outcomes consists of the entire ε -mass at z together with an additional ( α − ε ) -mass at 0 . The (1 − α ) -quantile of Q is therefore zero, and the CV aR reduces to the a v erage loss ov er this tail: R Q α ( h ) = 1 α  εz + ( α − ε ) · 0  = ε α  M ε  1 1+ λ = M 1 1+ λ α ε λ 1+ λ . Since R P α ( h ) = 0 , w e obtain | R P α ( h ) − R Q α ( h ) | = M 1 1+ λ α ε λ 1+ λ . This pair ( P , Q ) is feasible for the supremum given in Theorem 4.2, which prov es the minimax low er b ound. 35 E.2 Robust estimation under adversarial contamination under oblivious adv ersaries Pr o of. The pro of pro ceeds in four main steps: establishing the b oundedness of the auxiliary v ariable θ , decomp os- ing the error into bias and estimation terms, b ounding the error within a single blo ck using uniform concen tration and corruption control, and ﬁnally aggregating the block estimates via the median. Applying Theorem G.1 point wise with l = ℓ ( h, Z ) , w e ma y restrict the optimization to the compact domain Θ := [0 , R ] , R = M 1 / (1+ λ ) α . Momen t bound for the v ariational loss. Deﬁne the v ariational loss ϕ ( Z ; h, θ ) = θ + 1 α ( ℓ ( h, Z ) − θ ) + , θ ∈ Θ . Lemma E.1 (Moment inﬂation b ound) . Under Assumption 2.1 and The or em G.1, ther e exists a c onstant C λ > 0 such that sup h ∈H , θ ∈ Θ E  | ϕ ( Z ; h, θ ) | 1+ λ  ≤ C λ  M α 1+ λ + R 1+ λ  =: M ϕ . In p articular, sinc e R = M 1 / (1+ λ ) /α , we obtain M ϕ ≲ λ M α 1+ λ . Pr o of. Using ( a + b ) 1+ λ ≤ 2 λ ( a 1+ λ + b 1+ λ ) for a, b ≥ 0 and ( x − θ ) + ≤ x + θ , we hav e | ϕ ( Z ; h, θ ) | 1+ λ ≤ C λ  | θ | 1+ λ + α − (1+ λ ) ℓ ( h, Z ) 1+ λ  . T aking expectations and using θ ∈ [0 , R ] and Assumption 2.1 yields the result. Error Decomp osition. Let R B ( h, θ ) = E [min( ϕ ( Z ; h, θ ) , B )] b e the exp ected truncated risk. W e decompose the total error: sup h | b R α ( h ) − R α ( h ) | ≤ sup h,θ | b R α ( h, θ ) − R ( h, θ ) | ≤ sup h,θ | R ( h, θ ) − R B ( h, θ ) | | {z } T runcation Bias + sup h,θ | b R α ( h, θ ) − R B ( h, θ ) | | {z } Estimation Error . Bias Con trol: Since ϕ ≥ 0 , | R − R B | ≤ E [ ϕ I ϕ>B ] . Using Hölder’s inequalit y and the moment b ound M ϕ : E [ ϕ I ϕ>B ] ≤ ( E ϕ 1+ λ ) 1 1+ λ ( P ( ϕ > B )) λ 1+ λ ≤ M ϕ B − λ . Analysis of a Single Blo ck. Fix a block index j ∈ { 1 , . . . , K } . Recall that b µ j ( h, θ ) = 1 m X i ∈B j ϕ B ( Z i ; h, θ ) , where ϕ B = min( ϕ, B ) , and deﬁne the truncated p opulation risk R B ( h, θ ) = E [ ϕ B ( Z ; h, θ )] . Let S clean ⊂ { 1 , . . . , n } denote the indices of uncorrupted points, and deﬁne B clean j = B j ∩ S clean , N j = |B j \ B clean j | to be the set of clean indices and the num b er of outliers in block j , respectively . 36 W e decompose: b µ j ( h, θ ) − R B ( h, θ ) = 1 m X i ∈B clean j  ϕ B ( Z i ; h, θ ) − R B ( h, θ )  + 1 m X i ∈B j \B clean j  ϕ B ( Z ′ i ; h, θ ) − R B ( h, θ )  =: ξ j ( h, θ ) + ∆ j ( h, θ ) , where Z ′ i denotes the (p ossibly adv ersarial) corrupted v alues. Th us, ξ j captures the sampling ﬂuctuation of clean data, and ∆ j captures the corruption bias. (a) Uniform concentration of the clean part. Conditional on the corruption pattern and the random p erm utation, the set B clean j consists of points drawn without r eplac ement from the clean sample. By Ho eﬀding’s reduction principle, concentration inequalities for sampling without replacemen t are dominated b y those for i.i.d. sampling. Therefore, it suﬃces to analyze the i.i.d. case. Lemma E.2 (Uniform concentration on a single clean blo ck) . Ther e exists a universal c onstant C > 0 such that, c onditional on B clean j , with pr ob ability at le ast 1 − 0 . 1 , sup h ∈H , θ ∈ Θ | ξ j ( h, θ ) | ≤ C r M ϕ B 1 − λ d log m m + B d log m m ! =: E stat ( B ) . Pr o of. The function class F B is uniformly b ounded b y B . Moreov er, by Lemma G.10, sup f ∈F B V ar( f ( Z )) ≤ M ϕ B 1 − λ . By Assumption 2.4, the L 2 ( Q ) co vering num b ers of F B satisfy log N ( F B , ∥ · ∥ L 2 ( Q ) , u ) ≤ d log( C 0 B /u ) . Therefore, b y Bousquet’s version of T alagrand’s inequality combined with standard entrop y in tegral b ounds,for i.i.d. samples we obtain sup f ∈F B      1 m m X i =1 ( f ( Z i ) − E f )      ≤ C r M ϕ B 1 − λ d log m m + B d log m m ! with probability at least 1 − 0 . 1 . By Hoeﬀding’s reduction principle, the same b ound holds for sampling without replacemen t from the clean data. (b) Con trol of the corruption lev el. Since the adversary is oblivious and the learner shuﬄes the data uniformly at random, the num ber of corrupted p oin ts in blo ck j satisﬁes N j ∼ Hyp ergeo( n, ϵn, m ) . Lemma E.3 (Outlier prop ortion in a block) . Assume ϵ ≤ 1 / 2 − γ . Then for al l suﬃciently lar ge m , P  N j m ≤ ϵ + γ 4  ≥ 1 − 0 . 1 . Pr o of. By Ch v átal’s h yp ergeometric tail bound, P ( N j ≥ ( ϵ + γ / 4) m ) ≤ exp  − 2 m ( γ / 4) 2  . F or m ≥ C /γ 2 , the right-hand side is at most 0 . 1 . 37 (c) Corruption bias b ound. On the even t of Lemma E.3, since 0 ≤ ϕ B ≤ B , sup h,θ | ∆ j ( h, θ ) | ≤ N j m B ≤  ϵ + γ 4  B . (d) Conclusion for a single blo c k. Com bining Lemmas E.2 and E.3, with probabilit y at least 0 . 8 a block satisﬁes sim ultaneously: sup h,θ | b µ j ( h, θ ) − R B ( h, θ ) | ≤ E stat ( B ) +  ϵ + γ 4  B . Suc h a blo ck will be called Go o d . Robust Aggregation via the Median. Recall from the previous step that for each blo ck j w e deﬁned the ev ents E stat j = n sup h,θ | ξ j ( h, θ ) | ≤ E stat ( B ) o , E corr j = n N j m ≤ ϵ + γ 4 o . A block j is called Go o d if E stat j ∩ E corr j holds. F rom Lemmas E.2 and E.3, by the union bound, P ( blo c k j is Go o d ) ≥ 1 − (0 . 1 + 0 . 1) = 0 . 8 . Deﬁne the indicator v ariables I j = I { block j is Goo d } , j = 1 , . . . , K. Lemma E.4 (Ma jorit y of blo c ks are go o d) . Assume ϵ ≤ 1 / 2 − γ . Ther e exists a universal c onstant c > 0 such that if K ≥ 8 γ 2 log  4 δ  , then with pr ob ability at le ast 1 − δ , K X j =1 I j > K 2 . Pr o of. The indicators I j are functions of a random partition of a ﬁnite p opulation. They form a negatively asso ciated family . F or negatively associated Bernoulli random v ariables, Chernoﬀ-Hoeﬀding inequalities hold in the same form as for indep endent v ariables (see Dubhashi and Ranjan, 1998). Since E [ I j ] ≥ 0 . 8 , Ho eﬀding’s inequalit y implies P   K X j =1 I j ≤ K 2   ≤ exp  − 2 K (0 . 8 − 0 . 5) 2  ≤ exp  − γ 2 K 2  . Cho osing K ≥ 8 γ 2 log(4 /δ ) ensures the right-hand side is at most δ . On the even t of Lemma E.4, strictly more than half the blo c ks are Go o d. F or any Goo d blo ck j , we ha v e sim ultaneously: sup h,θ | ξ j ( h, θ ) | ≤ E stat ( B ) , sup h,θ | ∆ j ( h, θ ) | ≤  ϵ + γ 4  B . Hence, sup h,θ | b µ j ( h, θ ) − R B ( h, θ ) | ≤ E stat ( B ) +  ϵ + γ 4  B . Since the median of K num bers lies b etw een the minim um and maximum of any subset of more than K/ 2 elemen ts, it follows deterministically that sup h,θ | b R α ( h, θ ) − R B ( h, θ ) | ≤ max j : I j =1 | b µ j ( h, θ ) − R B ( h, θ ) | ≤ E stat ( B ) +  ϵ + γ 4  B . 38 Final Error Bound and Balancing. Combining previous step with the truncation bias b ound from Lemma G.10, w e obtain that with probability at least 1 − δ , sup h,θ | b R α ( h, θ ) − R ( h, θ ) | ≤ M ϕ B − λ + E stat ( B ) +  ϵ + γ 4  B . Recalling the deﬁnition E stat ( B ) = C r M ϕ B 1 − λ d log m m + B d log m m ! , and using m ≍ n/K with K = O ( γ − 2 log(1 /δ )) , w e ma y rewrite the b ound (absorbing constants) as: Err ( B ) ≲ M ϕ B − λ + r M ϕ B 1 − λ d log n n + B d log n n + ϵB . Assuming n ≳ d log n , the linear term B d log n n is of smaller order than the v ariance term under the optimal c hoice of B and ma y b e absorbed. W e therefore balance the remaining three dominant terms. (i) Statistic al r e gime. Balancing bias and v ariance, M ϕ B − λ ≍ r M ϕ B 1 − λ d n = ⇒ B stat ≍  M ϕ n d  1 1+ λ . Substituting yields: Err ≲  M ϕ d n  λ 1+ λ . (ii) A dversarial r e gime. Balancing bias and corruption, M ϕ B − λ ≍ ϵB = ⇒ B adv ≍  M ϕ ϵ  1 1+ λ . Substituting yields: Err ≲ M 1 1+ λ ϕ ϵ λ 1+ λ . T aking B = min( B stat , B adv ) and recalling that sup h | b R α ( h ) − R α ( h ) | ≤ sup h,θ | b R α ( h, θ ) − R ( h, θ ) | , w e conclude the proof of Theorem 4.3. F Decision Robustness F.1 T ail-scarcity instability for CV aR-ERM under ﬁnite p -moment Throughout, α ∈ (0 , 1) is ﬁxed and CV aR α ( X ; Q ) = inf θ ∈ R n θ + 1 α E Q [( X − θ ) + ] o . 39 Empirical CV aR ob jective. Giv en i.i.d. samples Z 1 , . . . , Z n ∼ P , let P n := 1 n P n i =1 δ Z i . F or h ∈ H deﬁne the empirical R U functional and empirical CV aR b Φ n ( h, θ ) := θ + 1 α E P n  ( ℓ ( h, Z ) − θ ) +  = θ + 1 αn n X i =1 ( ℓ ( h, Z i ) − θ ) + , θ ∈ R , (116) b R n ( h ) := inf θ ∈ R b Φ n ( h, θ ) = CV aR α ( ℓ ( h, Z ); P n ) . (117) The population ob jective is R P ( h ) := CV aR α ( ℓ ( h, Z ); P ) . Lemma F.1 (RU minimizer at θ = 0 ; scaled-mean identit y) . L et X ≥ 0 b e inte gr able and supp ose Q ( X = 0) ≥ 1 − α . Then θ ⋆ = 0 minimizes θ 7→ θ + 1 α E Q [( X − θ ) + ] and CV aR α ( X ; Q ) = 1 α E Q [ X ] . In p articular, if x 1 , . . . , x n ≥ 0 satisfy # { i : x i = 0 } ≥ (1 − α ) n , then the empiric al CV aR satisﬁes CV aR α ( x 1 , . . . , x n ) = 1 αn n X i =1 x i . Pr o of. Deﬁne ϕ ( θ ) := θ + 1 α E Q [( X − θ ) + ] . F or θ < 0 , ϕ ′ ( θ ) = 1 − 1 α Q ( X > θ ) = 1 − 1 α < 0 , so no minimizer lies b elow 0 . F or θ ≥ 0 , the right deriv ative equals ϕ ′ + ( θ ) = 1 − 1 α Q ( X > θ ) ≥ 1 − 1 α Q ( X > 0) ≥ 0 b ecause Q ( X > 0) ≤ α . Hence ϕ is minimized at θ = 0 and ϕ (0) = 1 α E Q [ X ] . The empirical statement is identical with Q = P n . The or em (Restatement of Theorem 4.11 (Detailed)) . Fi x α ∈ (0 , 1) and 1 < p < 2 . There exist a ﬁxed distribution P on Z = R + and constan ts ε ∈ (0 , α / 4) , γ > 0 , and C ∈ (0 , 1) suc h that C > αγ with the follo wing prop ert y . F or ev ery n suﬃciently large, deﬁne H = { h A , h B } and deﬁne the (sample-size dep enden t) losses ℓ n ( h A , z ) := z , ℓ n ( h B , z ) := ( 0 , z = 0 , z + γ − C n 1 { z ∈ ( n, 2 n ] } , z > 0 . (118) Let R ( n ) P ( h ) := CV aR α ( ℓ n ( h, Z ); P ) and b R n ( h ) := CV aR α ( ℓ n ( h, Z ); P n ) . Then the following hold: 1. ( p -momen t only). F or each ﬁxed n and h ∈ H , E P | ℓ n ( h, Z ) | p < ∞ , E P | ℓ n ( h, Z ) | q = ∞ ∀ q > p. 2. (Strict population optimalit y , uniformly for large n ). There exist γ 0 > 0 and n 0 suc h that for all n ≥ n 0 , R ( n ) P ( h B ) − R ( n ) P ( h A ) ≥ γ 0 , so S ( P ) = { h A } . 3. (One-p oin t ﬂip with sharp probability). There exist constants c > 0 and n 0 suc h that for all n ≥ n 0 , Pr D ∼ P ⊗ n  ∃ D ′ diﬀering from D in exactly one sample suc h that arg min H b R n ( · ; D ) = { h B } , arg min H b R n ( · ; D ′ ) = { A }  ≥ c n 1 − p (log n ) 2 . (119) 4. (Sharpness). Under sup h E P | ℓ ( h, Z ) | p < ∞ and a ﬁxed p opulation margin, any such ﬂip probability is O ( n 1 − p ) up to logarithmic factors. 40 Pr o of. Let Y satisfy the tail law P ( Y > y ) = 1 y p (log y ) 2 , y ≥ e. (120) Then E Y p < ∞ and E Y q = ∞ for all q > p (tail-in tegral test). Deﬁne Z ∼ P by the mixture Z = ( 0 , with probabilit y 1 − α + ε, Y , with probability α − ε. (121) Then P ( Z = 0) = 1 − α + ε > 1 − α , and Z inherits E Z p < ∞ and E Z q = ∞ for all q > p . F or h A , ℓ n ( h A , Z ) = Z , so E | ℓ n ( h A , Z ) | p < ∞ and E | ℓ n ( h A , Z ) | q = ∞ for all q > p . F or h B , note that ℓ n ( h B , 0) = 0 and for z > 0 , ℓ n ( h B , z ) ≥ z + γ − C n 1 { z ∈ ( n, 2 n ] } ≥ n + γ − C n = (1 − C ) n + γ > 0 b ecause C < 1 . Also ℓ n ( h B , z ) ≤ z + γ , hence E | ℓ n ( h B , Z ) | p < ∞ . Finally , on { Z > 0 , Z / ∈ ( n, 2 n ] } we ha ve ℓ n ( h B , Z ) = Z + γ ≥ Z , and since P ( Z > 0) = α − ε > 0 and E Z q = ∞ for all q > p , it follows that E | ℓ n ( h B , Z ) | q = ∞ for all q > p . This pro ves (1). Because P ( ℓ n ( h A , Z ) = 0) = P ( Z = 0) = 1 − α + ε > 1 − α and also P ( ℓ n ( h B , Z ) = 0) ≥ 1 − α + ε , and b ecause the losses are nonnegative, Lemma F.1 applies to both actions and yields R ( n ) P ( h ) = 1 α E [ ℓ n ( h, Z )] . Therefore R ( n ) P ( h A ) = 1 α E [ Z ] = α − ε α E [ Y ] , (122) R ( n ) P ( h B ) = α − ε α E h Y + γ − C n 1 { Y ∈ ( n, 2 n ] } i . (123) Hence R ( n ) P ( h B ) − R ( n ) P ( h A ) = α − ε α  γ − C n P ( Y ∈ ( n, 2 n ])  . (124) Using (120), P ( Y ∈ ( n, 2 n ]) = P ( Y > n ) − P ( Y > 2 n ) ≤ P ( Y > n ) = 1 n p (log n ) 2 . Th us C n P ( Y ∈ ( n, 2 n ]) ≤ C n p − 1 (log n ) 2 → 0 . Fix γ > 0 . Then there exists n 0 suc h that for all n ≥ n 0 , C n P ( Y ∈ ( n, 2 n ]) ≤ γ / 2 . Plugging in to (124) yields R ( n ) P ( h B ) − R ( n ) P ( h A ) ≥ α − ε α · γ 2 =: γ 0 > 0 , pro ving (2). Let D = ( Z 1 , . . . , Z n ) ∼ P ⊗ n and deﬁne N 0 := P n i =1 1 { Z i = 0 } and N + := P n i =1 1 { Z i > 0 } . Since E [ N + ] = ( α − ε ) n , Ho eﬀding’s inequality implies P ( N + ≥ 2) ≥ 1 − e − c + n for all large n . Similarly P ( N 0 ≥ (1 − α ) n + 1) ≥ 1 − e − c 0 n . Deﬁne the tail-bin ev en t G n := n ∃ ! i : Z i ∈ ( n, 2 n ] o ∩ n max 1 ≤ i ≤ n Z i ≤ 2 n o . (125) Let q n := P ( Z ∈ ( n, 2 n ]) = ( α − ε ) P ( Y ∈ ( n, 2 n ]) , r n := P ( Z > 2 n ) = ( α − ε ) P ( Y > 2 n ) . Indep endence gives Pr( G n ) = nq n (1 − q n − r n ) n − 1 . Since q n + r n = O ( n − p (log n ) − 2 ) , w e ha v e n ( q n + r n ) → 0 , hence for all large n , (1 − q n − r n ) n − 1 ≥ 1 / 2 and therefore Pr( G n ) ≥ 1 2 nq n . (126) 41 Moreo ver, for all large n , P ( Y ∈ ( n, 2 n ]) = P ( Y > n ) − P ( Y > 2 n ) ≥ 1 n p (log n ) 2 − 1 (2 n ) p (log(2 n )) 2 ≥  1 − 2 − p  1 n p (log(2 n )) 2 ≥ c 1 1 n p (log n ) 2 , so q n ≥ ( α − ε ) c 1 n − p (log n ) − 2 . Combining with (126) yields P ( G n ) ≥ c n 1 − p (log n ) 2 . (127) W e w ork on the ev en t E n := G n ∩ { N 0 ≥ (1 − α ) n + 1 } ∩ { N + ≥ 2 } . By (127) and the exp onential tails for N 0 , N + , we still hav e P ( E n ) ≥ c n 1 − p (log n ) 2 for all large n (p ossibly with a smaller c ). On { N 0 ≥ (1 − α ) n } , Lemma F.1 implies b R n ( h ) = 1 αn n X i =1 ℓ n ( h, Z i ) , h ∈ { h A , h B } . (128) Hence on E n , b R n ( h B ) − b R n ( h A ) = 1 αn n X i =1  ℓ n ( h B , Z i ) − ℓ n ( h A , Z i )  = 1 αn X i : Z i > 0  γ − C n 1 { Z i ∈ ( n, 2 n ] }  . (129) On G n there is exactly one index i ⋆ with Z i ⋆ ∈ ( n, 2 n ] , so b R n ( h B ) − b R n ( h A ) = 1 αn  N + γ − C n  ≤ γ − C α < 0 b ecause N + ≤ αn and C > αγ . Thus on D , arg min H b R n ( · ; D ) = { h B } . Deﬁne D ′ b y replacing the unique bin p oint by 0 : set Z ′ i ⋆ := 0 and Z ′ i := Z i for i  = i ⋆ . Since N 0 ≥ (1 − α ) n + 1 on E n , after replacement we still hav e N ′ 0 ≥ (1 − α ) n , so (128) holds for D ′ . Moreov er, under D ′ there are no samples in ( n, 2 n ] , hence b R n ( h B ; D ′ ) − b R n ( h A ; D ′ ) = 1 αn X i : Z ′ i > 0 γ = N ′ + αn γ . On E n w e hav e N + ≥ 2 , and after replacing one p ositiv e p oint we ha ve N ′ + ≥ 1 , so the diﬀerence is strictly p ositiv e and therefore arg min H b R n ( · ; D ′ ) = { h A } . This prov es (119). Under sup h E P | ℓ ( h, Z ) | p < ∞ and a ﬁxed p opulation margin, ﬂipping the empirical minimizer by c hanging one sample requires an observ ation of magnitude Ω( n ) since the CV aR ob jective changes by at most | ℓ | / ( αn ) p er sample. By Mark o v and a union b ound, P (max i | ℓ i | ≳ n ) = O ( n 1 − p ) , proving the upper bound up to logarithms. G Auxiliary results G.1 Boundedness of CV aR minimizer under a b ounded moment Theorem G.1 (Boundedness of CV aR minimizer under a b ounded momen t) . L et l ≥ 0 b e a nonne gative r andom variable on a pr ob ability sp ac e (Ω , F , P ) satisfying E  l 1+ ε  ≤ M < ∞ for some ε ∈ (0 , 1] and M > 0 . Fix α ∈ (0 , 1) and deﬁne for θ ∈ R F ( θ ) := θ + 1 α E  ( l − θ ) +  , ( x ) + := max { x, 0 } . 42 Then F attains its minimum on R , and every minimizer θ ∗ ∈ arg min θ ∈ R F ( θ ) satisﬁes 0 ≤ θ ∗ ≤ R, wher e one may take R = M 1 / (1+ ε ) α . In p articular, the set of minimizers is nonempty and b ounde d, c ontaine d in [0 , R ] . Pr o of. W e divide the argument into steps: ﬁniteness and contin uit y of F , b ehavior as θ → ±∞ (co ercivity), nonnegativit y of minimizers, and the explicit upper b ound. (A) F is ﬁnite and conv ex. F or any ﬁxed θ ∈ R w e hav e the p oin twise inequality ( l − θ ) + ≤ l + | θ | . Since E [ l ] ≤ ( E [ l 1+ ε ]) 1 / (1+ ε ) ≤ M 1 / (1+ ε ) < ∞ (see (C) b elo w), it follows that E [( l − θ ) + ] ≤ E [ l ] + | θ | < ∞ , so F ( θ ) is ﬁnite for every θ . F or each ﬁxed ω ∈ Ω the map θ 7→ ( l ( ω ) − θ ) + is conv ex (it is the p ositive part of an aﬃne function), and exp ectation preserv es con vexit y . Therefore F is con vex on R . A conv ex function that is ﬁnite on all of R is contin uous (hence lo wer semicontin uous). Thus F is contin uous and ﬁnite-v alued on R . (B) Coercivity: F ( θ ) → + ∞ as | θ | → ∞ . First, for ev ery θ ∈ R , F ( θ ) = θ + 1 α E [( l − θ ) + ] ≥ θ , b ecause ( l − θ ) + ≥ 0 . Hence as θ → + ∞ , F ( θ ) ≥ θ → + ∞ . Next, for θ ≤ 0 w e ha ve l − θ ≥ l ≥ 0 (since l ≥ 0 ), so ( l − θ ) + = l − θ . Th us for θ ≤ 0 F ( θ ) = θ + 1 α E [ l − θ ] = E [ l ] α + θ  1 − 1 α  . Because α ∈ (0 , 1) , the co eﬃcient 1 − 1 /α is negative, and therefore as θ → −∞ the term θ (1 − 1 /α ) → + ∞ . Hence F ( θ ) → + ∞ as θ → −∞ . Com bining the tw o directions sho ws F ( θ ) → + ∞ as | θ | → ∞ ; therefore F is coercive. (C) Existence of a minimizer. Since F is con tin uous on R and coercive (tends to + ∞ at ±∞ ), it attains its minim um on R . Thus the set arg min θ ∈ R F ( θ ) is nonempty . ( lo w er semi-con tinuit y and co ercivity imply existence of a minimizer.) (D) No minimizer is negative. F or θ ≤ 0 w e hav e the formula F ( θ ) = E [ l ] α + θ  1 − 1 α  . Ev aluating at θ = 0 giv es F (0) = E [ l ] /α . F or any θ < 0 , F ( θ ) − F (0) = θ  1 − 1 α  . Since 1 − 1 /α < 0 and θ < 0 , we hav e F ( θ ) − F (0) > 0 , hence F ( θ ) > F (0) . Th us no θ < 0 can b e a global minimizer, and every minimizer satisﬁes θ ∗ ≥ 0 . (E) Upp er b ound on E [ l ] . Using Hölder’s inequality (or the monotonicity of L p -norms on a probability space) with p = 1 + ε > 1 we obtain E [ l ] ≤  E [ l 1+ ε ]  1 / (1+ ε ) ≤ M 1 / (1+ ε ) . 43 Consequen tly F (0) = E [ l ] α ≤ M 1 / (1+ ε ) α . (*) (F) Upp er b ound on minimizers. F or any θ ≥ 0 we hav e F ( θ ) ≥ θ . If θ > F (0) then θ > F (0) implies F ( θ ) ≥ θ > F (0) , so such a θ cannot b e a minimizer. Therefore every minimizer θ ∗ satisﬁes θ ∗ ≤ F (0) . Combining this with ( ∗ ) yields 0 ≤ θ ∗ ≤ F (0) ≤ M 1 / (1+ ε ) α . Hence one may take R = M 1 / (1+ ε ) /α . W e ha v e shown: F is ﬁnite, con tinuous and co erciv e, so it attains a minimum; moreov er every minimizer satisﬁes 0 ≤ θ ∗ ≤ R with R = M 1 / (1+ ε ) /α . This completes the boundedness of the minimizer proof. G.2 Concen tration b ound for heavy-tailed random v ariables Prop osition G.2. L et X 1 , X 2 , . . . , X n b e indep endent and identic al ly distribute d (i.i.d.) non-ne gative r andom variables satisfying the moment c ondition E [ X 1+ λ i ] ≤ M for some c onstants λ ∈ (0 , 1) and M > 0 . Then, with pr ob ability at le ast 1 − δ , the sample me an satisﬁes 1 n n X i =1 X i − E [ X i ] ≤ 2 M 1 1+ λ  log(2 /δ ) n  λ 1+ λ . This result is adapted from Brownlees et al. [2015], which provides concen tration b ounds for empirical risk minimization under heavy-tailed losses. Pr o of. T o prov e this theorem, w e will use a truncation approac h combined with concentration inequalities for b ounded random v ariables. The key idea is to control the large v alues of the random v ariables by truncating them at a carefully chosen lev el and then applying standard concen tration b ounds to the truncated v ariables while accoun ting for the probability of truncation. Since the random v ariables X i are non-negative and p otentially heavy-tailed, their large v alues can dominate the behavior of the sample mean. T o handle this, we introduce a truncation level B > 0 and deﬁne the truncated random v ariables as: X B i = min( X i , B ) = X i · 1 { X i ≤ B } + B · 1 { X i >B } . Th us, X B i is equal to X i when X i ≤ B and equal to B otherwise. Note that 0 ≤ X B i ≤ B , so X B i is bounded. Let S n = P n i =1 X i b e the sum of the original random v ariables, and let µ = E [ X i ] b e the common expected v alue. W e can write the sample mean as: S n n = 1 n n X i =1 X i = 1 n n X i =1 X B i + 1 n n X i =1 ( X i − X B i ) (130) Since X i − X B i = ( X i − B ) · 1 { X i >B } ≥ 0 , it follows that: S n n − µ ≤ 1 n n X i =1 X B i − E [ X B i ] ! + ( E [ X B i ] − µ ) + 1 n n X i =1 ( X i − X B i ) . (131) Ho wev er, since E [ X B i ] ≤ µ , the term E [ X B i ] − µ ≤ 0 , so we can drop it to obtain: S n n − µ ≤ 1 n n X i =1 X B i − E [ X B i ] ! + 1 n n X i =1 ( X i − X B i ) (132) W e aim to b ound the probability: P  S n n − µ > t  for some t > 0 . Using the decomposition abov e, we hav e: 44 Pr o of. P  S n n − µ > t  ≤ P  1 n P n i =1 X B i − E [ X B i ]  + 1 n P n i =1 ( X i − X B i ) > t  . Since 1 n P n i =1 ( X i − X B i ) ≥ 0 , this probability is further bounded b y: P 1 n n X i =1 X B i − E [ X B i ] > t − 1 n n X i =1 ( X i − X B i ) ! . (133) Let A = { max i =1 ,...,n X i ≤ B } . On A , we hav e X i = X B i for all i , so 1 n P n i =1 ( X i − X B i ) = 0 . On A c , at least one X i > B . Therefore: P  S n n − µ > t  ≤ P 1 n n X i =1 X B i − E [ X B i ] > t, A ! + P ( A c ) (134) On A , S n /n = 1 n P n i =1 X B i , hence: P  S n n − µ > t  ≤ P 1 n n X i =1 X B i − E [ X B i ] > t ! + P ( A c ) (135) Bounding P ( A c ) P ( A c ) = P n [ i =1 { X i > B } ! ≤ n · P ( X 1 > B ) (136) Using Mark ov’s inequality and the momen t condition: P ( X 1 > B ) = P ( X 1+ λ 1 > B 1+ λ ) ≤ E [ X 1+ λ 1 ] B 1+ λ ≤ M B 1+ λ (137) Th us: P ( A c ) ≤ n · M B 1+ λ . Cho ose B such that n · M B 1+ λ = δ 2 . Solving for B : B 1+ λ = 2 nM δ = ⇒ B =  2 nM δ  1 1+ λ . (138) With this choice, P ( A c ) ≤ δ 2 . Apply Hoeﬀding’s inequalit y for i.i.d. b ounded random v ariables: P 1 n n X i =1 X B i − E [ X B i ] > s ! ≤ exp  − 2 ns 2 B 2  (139) Require this probability ≤ δ 2 : exp  − 2 ns 2 B 2  ≤ δ 2 = ⇒ s ≥ B r log(2 /δ ) 2 n (140) Substituting B : B =  2 nM δ  1 1+ λ (141) , 45 w e get: s ≥  2 nM δ  1 1+ λ · r log(2 /δ ) 2 n . Simplify: s ≥ (2 M ) 1 1+ λ n 1 1+ λ − 1 2 δ − 1 1+ λ  log(2 /δ ) 2  1 2 (142) Kno wn results for i.i.d. non-negative random v ariables satisfying E [ X 1+ λ i ] ≤ M pro vide: P 1 n n X i =1 X i − E [ X i ] > 2 M 1 1+ λ  log(2 /δ ) n  λ 1+ λ ! ≤ δ. Th us, with probability at least 1 − δ : 1 n n X i =1 X i − E [ X i ] ≤ 2 M 1 1+ λ  log(2 /δ ) n  λ 1+ λ (143) Lemma G.3 (Reduction to Empirical Pro cess) . F or any hyp othesis h ∈ H and any risk level α ∈ (0 , 1) , the deviation b etwe en the p opulation CV aR risk R α ( h ) and the empiric al CV aR risk b R α ( h ) satisﬁes: | R α ( h ) − b R α ( h ) | ≤ 1 α sup θ ∈ R | ( E − P n ) g h,θ | , wher e g h,θ ( z ) := ( ℓ ( h, z ) − θ ) + . Pr o of. Recall the v ariational deﬁnitions of the p opulation and empirical CV aR: R α ( h ) = inf θ ∈ R G ( θ ) and b R α ( h ) = inf θ ∈ R b G ( θ ) , where w e deﬁne the ob jectiv e functions: G ( θ ) := θ + 1 α E [ g h,θ ( Z )] and b G ( θ ) := θ + 1 α P n [ g h,θ ( Z )] . W e use the elementary prop erty that the diﬀerence betw een the inﬁma of t wo functions is bounded b y the suprem um of their diﬀerence. Sp eciﬁcally , for any tw o functions f , g : R → R : | inf θ f ( θ ) − inf θ g ( θ ) | ≤ sup θ | f ( θ ) − g ( θ ) | . Applying this to G and b G : | R α ( h ) − b R α ( h ) | ≤ sup θ ∈ R | G ( θ ) − b G ( θ ) | . Substituting the deﬁnitions of G ( θ ) and b G ( θ ) , the linear term θ cancels out: | G ( θ ) − b G ( θ ) | =      θ + 1 α E [ g h,θ ]  −  θ + 1 α P n [ g h,θ ]      = 1 α | E [ g h,θ ] − P n [ g h,θ ] | = 1 α | ( E − P n ) g h,θ | . T aking the supremum ov er θ yields the result: | R α ( h ) − b R α ( h ) | ≤ 1 α sup θ ∈ R | ( E − P n ) g h,θ | . 46 Lemma G.4 (Lipsc hitz Con tinuit y in θ ) . Fix any hyp othesis h ∈ H . Deﬁne the empiric al pr o c ess deviation map Φ h : R → R by Φ h ( θ ) := ( E − P n ) g h,θ , wher e g h,θ ( z ) := ( ℓ ( h, z ) − θ ) + . Then Φ h is 2 -Lipschitz with r esp e ct to θ . That is, for al l θ 1 , θ 2 ∈ R : | Φ h ( θ 1 ) − Φ h ( θ 2 ) | ≤ 2 | θ 1 − θ 2 | . Pr o of. Recall that the function x 7→ ( x ) + = max(0 , x ) is 1 -Lipsc hitz. Therefore, for any ﬁxed z ∈ Z and ﬁxed h ∈ H , the map θ 7→ g h,θ ( z ) is 1 -Lipschitz: | g h,θ 1 ( z ) − g h,θ 2 ( z ) | = | ( ℓ ( h, z ) − θ 1 ) + − ( ℓ ( h, z ) − θ 2 ) + | ≤ | ( ℓ ( h, z ) − θ 1 ) − ( ℓ ( h, z ) − θ 2 ) | = | − θ 1 + θ 2 | = | θ 1 − θ 2 | . No w consider the deviation term: | Φ h ( θ 1 ) − Φ h ( θ 2 ) | = | ( E [ g h,θ 1 ] − P n [ g h,θ 1 ]) − ( E [ g h,θ 2 ] − P n [ g h,θ 2 ]) | . By the triangle inequalit y: | Φ h ( θ 1 ) − Φ h ( θ 2 ) | ≤ | E [ g h,θ 1 − g h,θ 2 ] | + | P n [ g h,θ 1 − g h,θ 2 ] | . Using the p oint wise Lipsc hitz property derived ab ov e: 1. P opulation term: | E [ g h,θ 1 − g h,θ 2 ] | ≤ E | g h,θ 1 ( Z ) − g h,θ 2 ( Z ) | ≤ E [ | θ 1 − θ 2 | ] = | θ 1 − θ 2 | . 2. Empirical term: | P n [ g h,θ 1 − g h,θ 2 ] | ≤ 1 n P n i =1 | g h,θ 1 ( Z i ) − g h,θ 2 ( Z i ) | ≤ | θ 1 − θ 2 | . Summing these b ounds yields: | Φ h ( θ 1 ) − Φ h ( θ 2 ) | ≤ | θ 1 − θ 2 | + | θ 1 − θ 2 | = 2 | θ 1 − θ 2 | . Lemma G.5 (P oint wise dominance of the truncated loss) . L et ℓ ( h, z ) ≥ 0 and θ ≥ 0 . Deﬁne f θ ( z ) := ( ℓ ( h, z ) − θ ) + . Then 0 ≤ f θ ( z ) ≤ ℓ ( h, z ) for al l z . Pr o of. If ℓ ( h, z ) ≤ θ , then f θ ( z ) = 0 ≤ ℓ ( h, z ) . If ℓ ( h, z ) > θ , then f θ ( z ) = ℓ ( h, z ) − θ ≤ ℓ ( h, z ) since θ ≥ 0 . In b oth cases the claim holds. Prop osition G.6 (Nonnegativity of the optimal CV aR threshold) . L et L = ℓ ( h, Z ) ≥ 0 and α ∈ (0 , 1) . F or P a pr ob ability me asur e, deﬁne F ( θ ; P ) := θ + 1 α E P [( L − θ ) + ] , R P α ( h ) := inf θ ∈ R F ( θ ; P ) . A ny minimizer θ ∗ P ∈ arg min θ F ( θ ; P ) satisﬁes θ ∗ P ≥ 0 . Pr o of. F ollo ws directly from Theorem G.1 Corollary G.7 (Justiﬁcation of the truncation comparison) . With θ = θ ∗ P ≥ 0 as in Pr op osition G.6, L emma G.5 gives f θ ∗ P ( z ) = ( ℓ ( h, z ) − θ ∗ P ) + ≤ ℓ ( h, z ) . Conse quently, for any T > 0 , E P  f θ ∗ P ( Z ) 1 { f θ ∗ P ( Z ) >T }  ≤ E P  ℓ ( h, Z ) 1 { ℓ ( h,Z ) >T }  , which is the key step enabling tail c ontr ol by the (1 + λ ) -moment b ound. 47 Lemma G.8 (Lips c hitz Contin uit y of CV aR) . F or any r andom variables X, Y ≥ 0 and α ∈ (0 , 1) , | R α ( X ) − R α ( Y ) | ≤ 1 α E | X − Y | . Pr o of. Using the dual representation R α ( X ) = inf θ { θ + 1 α E [( X − θ ) + ] } , let θ ∗ Y b e optimal for Y . Then R α ( X ) − R α ( Y ) ≤ θ ∗ Y + 1 α E [( X − θ ∗ Y ) + ] − θ ∗ Y − 1 α E [( Y − θ ∗ Y ) + ] = 1 α E [( X − θ ∗ Y ) + − ( Y − θ ∗ Y ) + ] ≤ 1 α E | X − Y | . Symmetrically , R α ( Y ) − R α ( X ) ≤ 1 α E | X − Y | . Lemma G.9 (Pseudo-Dimension of T runcated CV aR Class) . L et H b e a hyp othesis class with Pdim( H ) ≤ d . Deﬁne for B > 0 F B = n z 7→ min  ( ℓ ( h, z ) − θ ) + , B  : h ∈ H , θ ∈ [0 , R ] o , wher e R = M 1 / (1+ λ ) /α . Then ther e exists an absolute c onstant C > 0 such that Pdim( F B ) ≤ C ( d + 1) . Pr o of. Deﬁne the base class L := { z 7→ ℓ ( h, z ) : h ∈ H} , so that Pdim( L ) = d . W e construct F B from L by a ﬁnite sequence of op erations, eac h of which increases pseudo-dimension by at most a constant factor. Consider the augmented class G 1 := { ( z , t ) 7→ ℓ ( h, z ) − t : h ∈ H , t ∈ [0 , R ] } . By standard results on pseudo-dimension under addition of a real parameter, Pdim( G 1 ) ≤ d + 1 . Deﬁne G 2 := { ( z , t ) 7→ ( ℓ ( h, z ) − t ) + : h ∈ H , t ∈ [0 , R ] } . The map x 7→ x + = max { x, 0 } is the maximum of tw o aﬃne functions. By closure of pseudo-dimension under ﬁnite maxima, Pdim( G 2 ) ≤ C 1 ( d + 1) for a universal constant C 1 . Finally , the standard truncation argument we used b efore, F B = { ( z , t ) 7→ min { g ( z , t ) , B } : g ∈ G 2 } . Since x 7→ min { x, B } is the minimum of x and a constan t function, Pdim( F B ) ≤ C 2 Pdim( G 2 ) ≤ C ( d + 1) , for an absolute constan t C . Com bining the steps completes the pro of. Lemma G.10 (Bias and v ariance of truncation) . L et X b e a Non ne gative r e al-value d r andom variable such that E [ | X | 1+ λ ] ≤ M ′ for some λ ∈ (0 , 1] and M ′ > 0 . Deﬁne the trunc ate d variable X B := min { X, B } for B > 0 . Then: 48 (i) ( Bias )   E [ X B ] − E [ X ]   ≤ M ′ B − λ . (ii) ( V arianc e ) V ar( X B ) ≤ M ′ B 1 − λ . Pr o of. W e treat the t wo claims separately . Observ e that E [ X B ] − E [ X ] = E  ( X B − X ) 1 { X >B }  = − E  ( X − B ) 1 { X >B }  . Hence,   E [ X B ] − E [ X ]   ≤ E  | X | 1 { X >B }  . On the even t { X > B } , w e ha v e 1 ≤ ( X/B ) λ , and therefore | X | = | X | 1+ λ | X | − λ ≤ | X | 1+ λ B − λ . Substituting this inequality yields E  | X | 1 { X >B }  ≤ B − λ E [ | X | 1+ λ ] ≤ M ′ B − λ . Since V ar( X B ) ≤ E [( X B ) 2 ] , it suﬃces to b ound the second moment. W rite ( X B ) 2 = | X B | 1 − λ | X B | 1+ λ . Using | X B | ≤ B and | X B | ≤ | X | , w e obtain | X B | 1 − λ ≤ B 1 − λ , | X B | 1+ λ ≤ | X | 1+ λ . Th us, E [( X B ) 2 ] ≤ B 1 − λ E [ | X | 1+ λ ] ≤ M ′ B 1 − λ . Lemma G.11 (Uniform MoM concen tration under contamination) . L et G b e a class of functions mapping into [0 , B ] such that sup g ∈G V ar( g ( Z )) ≤ σ 2 . Assume its c overing numb ers satisfy log N ( G , ∥ · ∥ ∞ , η ) ≤ d log( C /η ) . Consider the ϵ -c ontamination mo del. Partition the sample into K blo cks of size m = n/K , with K ≥ 8 log (1 /δ ) and m ≥ C 0 d for a suﬃciently lar ge c onstant C 0 . Then, with pr ob ability at le ast 1 − δ , sup g ∈G     Median j ∈ [ K ] b µ j ( g ) − E [ g ]     ≤ C σ r d m + B d m + ϵB ! , wher e b µ j ( g ) = 1 m P i ∈B j g ( z i ) . Pr o of. Let N j denote the num b er of corrupted p oin ts in blo ck B j . Since the adversary corrupts exactly ϵn = ϵK m samples and the data are randomly permuted, ( N 1 , . . . , N K ) follo ws a m ultiv ariate hypergeometric distribution. Deterministically , K X j =1 N j = ϵK m. 49 Hence, b y a coun ting argumen t, at most 0 . 1 K blo cks can satisfy N j > 10 ϵm . Deﬁne J low := { j : N j ≤ 10 ϵm } , |J low | ≥ 0 . 9 K . F or an y j ∈ J low and an y g ∈ G , since g ∈ [0 , B ] ,       1 m X i ∈B j g ( z i ) − 1 m X i ∈B j ∩ clean g ( z i )       ≤ N j m B ≤ 10 ϵB . Th us, corruption introduces a deterministic bias of at most 10 ϵB on these blo cks. F Or uniform concen tration on clean data. Fix a block j and consider only its clean samples. By Bernstein’s inequalit y com bined with a union b ound o v er an η -net of G and standard chaining, there exists a constant C suc h that if m ≥ C 0 d , then with probabilit y at least 0 . 9 , sup g ∈G       1 m X i ∈B j ∩ clean g ( z i ) − E [ g ]       ≤ C σ r d m + B d m ! . Call this ev ent E j . The random permutation ensures approximate indep endence across blo c ks, and a Chernoﬀ b ound yields that with probability at least 1 − δ , at least 0 . 8 K blocks satisfy E j . Let J v alid := J low ∩ { j : E j holds } . With probabilit y at least 1 − δ , |J v alid | ≥ 0 . 7 K > K / 2 . F or an y j ∈ J v alid and an y g ∈ G , | b µ j ( g ) − E [ g ] | ≤ C σ r d m + B d m ! + 10 ϵB . Since a strict ma jority of blo c ks satisfy this bound, the median must lie within the same range. Therefore, sup g ∈G     Median j b µ j ( g ) − E [ g ]     ≤ C σ r d m + B d m + ϵB ! , after absorbing constants. H On the Algorithmic Asp ects of Robust CV aR-ERM Estimator H.1 η -co v er based Algorithm W e discretize b oth H and the θ -range. Finite η -net for H . F or η h > 0 , let N h ( η h ) b e an y ﬁnite set such that for every h ∈ H there exists h ′ ∈ N h ( η h ) with ∥ h − h ′ ∥ 2 ≤ η h . Since H is contained in the Euclidean ball of radius R , one can take N h ( η h ) with cardinalit y b ounded by |N h ( η h ) | ≤  1 + 2 R η h  d . (F or example, take a lattice grid of mesh η h / √ d and intersect with the ball.) η θ -grid for θ . F or η θ > 0 , deﬁne N θ ( η θ ) := { 0 , η θ , 2 η θ , . . . } ∩ [0 , T ] , |N θ ( η θ ) | ≤ 1 + T η θ . φ B ( h, θ ; Z ) is the truncted CV aR loss for the sample Z at the parameters ( h, θ ) for threshold B . 50 Algorithm 1 Discreti zed MOM-CV aR-ERM with truncation (implemen table) 1: Input: data Z 1 , . . . , Z n ; α ∈ (0 , 1) ; truncation level B ; blo ck count K ≥ 8 with n = K m ; net radii η h , η θ ; kno wn bounds R, T . 2: Draw a uniform random p ermutation π and form blocks B 1 , . . . , B K of size m . 3: Construct a ﬁnite η h -net N h ( η h ) of H inside the ball of radius R . 4: Construct the grid N θ ( η θ ) of [0 , T ] . 5: for eac h h ∈ N h ( η h ) do 6: for eac h θ ∈ N θ ( η θ ) do 7: Compute block risks b R j ( θ , τ ) = 1 m P i ∈B j φ B ( h, θ ; Z i ) for j = 1 , . . . , K . 8: Compute b R MOM ( θ , τ ) = median( b R 1 , . . . , b R K ) . 9: end for 10: end for 11: Output ( b θ , b τ ) ∈ arg min θ ∈N θ ( η θ ) , τ ∈N τ ( η τ ) b R MOM ( θ , τ ) . Implemen tabilit y . The search set is ﬁnite, hence the algorithm terminates in ﬁnite time and returns an exact minimizer o ver the discretization. 51

On the Generalization and Robustness in Conditional Value-at-Risk

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment