On the Generalization and Robustness in Conditional Value-at-Risk

Conditional Value-at-Risk (CVaR) is a widely used risk-sensitive objective for learning under rare but high-impact losses, yet its statistical behavior under heavy-tailed data remains poorly understood. Unlike expectation-based risk, CVaR depends on …

Authors: Dinesh Karthik Mulumudi, Piyushi Manupriya, Gholamali Aminian

On the Generalization and Robustness in Conditional V alue-at-Risk Dinesh Karthik Mulum udi Dept. of Mathematics I ISER, Pune. dineshkarthikforml@gmail.com Piyushi Man upriy a Dept. of Computer Science and Automation Indian Institute of Sciecne, Bangalore. piyushim@iisc.ac.in Gholamali Aminian The Alan T uring Institute Greater London, England, UK. gaminian@turing.ac.uk Anan t Ra j Dept. of Computer Science and Automation Indian Institute of Sciecne, Bangalore. anantraj@iisc.ac.in F ebruary 23, 2026 Abstract Conditional V alue-at-Risk (CV aR) is a widely used risk-sensitiv e ob jective for learning under rare but high-impact losses, yet its statistical b eha vior under heavy-tailed data remains p o orly understo o d. Unlik e exp ectation-based risk, CV aR dep ends on an endogenous, data-dep enden t quantile, whic h couples tail av erag- ing with threshold estimation and fundamen tally alters both generalization and robustness properties. In this w ork, we develop a learning-theoretic analysis of CV aR-based empirical risk minimization under heavy-tailed and con taminated data. W e establish sharp, high-probability generalization and excess risk bounds under min- imal momen t assumptions, co vering fixed hypotheses, finite and infinite classes, and extending to β -mixing dep enden t data; we further sho w that these rates are minimax optimal. T o capture the in trinsic quantile sensitivit y of CV aR, we derive a uniform Bahadur-Kiefer type expansion that isolates a threshold-driven error term absen t in mean-risk ERM and essen tial in hea vy-tailed regimes. W e complement these results with ro- bustness guaran tees b y prop osing a truncated median-of-means CV aR estimator that achiev es optimal rates under adversarial contamination. Finally , we show that CV aR decisions themselves can be in trinsically unsta- ble under hea vy tails, establishing a fundamental limitation on decision robustness even when the p opulation optim um is well separated. T ogether, our results provide a principled characterization of when CV aR learning generalizes and is robust, and when instability is una v oidable due to tail scarcity . 1 In tro duction Statistical Learning Theory (SL T) pro vides the standard framew ork for analyzing generalization in learning al- gorithms. Its cen tral paradigm, Empirical Risk Minimization (ERM), selects a hypothesis h ∈ H by minimizing empirical risk, yielding generalization under suitable complexity control Shalev-Sh w artz and Ben-David [2014]. Classical ERM, ho wev er, is risk-neutral and optimizes exp ected loss, whic h can b e inadequate in high-stak es domains—suc h as finance, healthcare, and safet y-critical learning—where rare but extreme losses dominate per- formance Ho ward and Matheson [1972], Shen et al. [2014], Cardoso and Xu [2019]. These limitations ha ve motiv ated the study of risk-sensitive learning ob jectives that emphasize tail b eha v- ior, including Entropic Risk, Mean-V ariance, and Conditional V alue-at-Risk (CV aR). F rom a learning-theoretic standp oin t, several w orks hav e begun to in v estigate generalization guarantees for such ob jectives. F or example, generalization b ounds for tilted empirical risk minimization under finite-momen t assumptions w ere derived in Aminian et al., while Lee et al. [2021] studied CV aR and relate d risk measures under b ounded-loss and light- tailed assumptions. Ho wev er, these settings do not fully capture the regimes in which CV aR is most relev ant in practice, where data are often hea vy-tailed and may exhibit temp oral dep endence. This gap motiv ates a sys- tematic study of CV aR generalization b eyond b ounded or light-tailed models, which w e address in this work b y dev eloping guaran tees under hea vy-tailed losses for b oth i.i.d. and dep enden t data. 1 Another key challenge sp ecific to CV aR-based learning is structural. CV aR depends on an endo genous, data- dep endent thr eshold , the V alue-at-Risk (V aR), which couples tail av erages with empirical quantile estimation. Unlik e smooth risk functionals, this coupling creates an intrinsic in teraction b etw een tail fluctuations and threshold estimation error. As a consequence, standard generalization analyses that treat the ob jective as a fixed functional of the loss distribution are insufficient. Con trolling CV aR generalization therefore requires explicitly accoun ting for how fluctuations of the empirical quantile propagate into fluctuations of the tail risk. T o address this issue, w e dev elop a refined empirical-process analysis based on uniform Bahadur-Kiefer–typ e exp ansions Bahadur [1966], Kiefer [1967], adapted to the CV aR setting to capture the effects of this endogenous threshold. Robustness presen ts a closely related challenge and arises at multiple lev els. At the functional level, although CV aR is often viewed as robust due to its emphasis on tail outcomes, its dependence on an empirical quan tile can amplify sensitivity to p erturbations near the tail. A t the estimator level, robustness under hea vy-tailed data and adv ersarial con tamination has b een extensiv ely studied through trimmed means and median-of-means estimators Lugosi and Mendelson [2019a,b, 2021], yielding sharp guaran tees for estimation primitives and robust ERM, but not directly for tail-based risk functionals suc h as CV aR. At the de cision lev el, the stability of CV aR-optimal solutions under distributional perturbations remains comparativ ely less understoo d. In this w ork, we study b oth the gener alization and r obustness prop erties of CV aR-based learning under heavy- tailed and contaminated data. On the generalization side, we derive non-asymptotic excess risk guaran tees under minimal moment assumptions and complemen t them with refined empirical-process tools that explicitly account for quan tile sensitivity . On the robustness side, we analyze the stabilit y of CV aR ob jectives and solutions under distributional perturbations. T ogether, these results clarify the statistical regimes in whic h empirical CV aR reliably appro ximates its population coun terpart, as well as settings in whic h instabilit y is in trinsic. Con tributions: W e mak e following contributions in this pap er: • Sharp generalization theory for heavy-tailed CV aR (Section 3.1 and 3.2). W e derive high-probabilit y generalization and excess risk bounds for empirical CV aR minimization under minimal ( λ + 1) -moment as- sumptions ( 0 < λ ≤ 1 ), cov ering fixed hypotheses and finite and infinite classes via VC/pseudo-dimension and Rademac her complexity . The b ounds explicitly characterize the dep endence on α , tail exp onen t, hypothesis complexit y , and sample size, extend CV aR theory b ey ond b ounded and sub-Gaussian regimes, and are shown to b e minimax optimal. W e further extend the theory to β -mixing dep enden t data with matc hing low er b ounds. • Uniform Bahadur-Kiefer expansions for CV aR (Section 3.3). W e establish the first uniform Bahadur- Kiefer–t yp e expansion for CV aR that explicitly accounts for its endogenous and data-dep enden t threshold. The result captures the nonstandard coupling betw een empirical quan tile fluctuations and tail a veraging, yielding a sharp decomposition of CV aR error into a classical empirical pro cess term and an additional threshold-driven comp onen t go verned by local tail geometry . This iden tifies a second-order source of generalization error absent in mean-risk ERM and is essen tial in heavy-tailed regimes. • Robust CV aR-ERM under adv ersarial con tamination (Section 4.2). W e prop ose a robust CV aR-ERM estimator based on truncation and median-of-means aggregation applied to the Ro c k afellar-Uryasev lift, jointly robustifying estimation ov er the decision v ariable and the endogenous threshold. W e prov e non-asymptotic excess risk guarantees under heavy-tailed losses and oblivious adversarial contamination, with rates that optimally decomp ose into a statistical term and an unav oidable contamination-dependent term. This provides the first learning-theoretic robustness guarantees for CV aR-ERM without b ounded-loss assumptions. • F undamental limits of decision robustness (Section 4.3). W e analyze the stability of CV aR decisions themselv es and sho w that, unlike mean-risk ERM, CV aR minimization can be intrinsically unstable under hea vy-tailed losses. Even with a unique and w ell-separated p opulation minimizer, a single observ ation can flip the empirical CV aR-ERM decision with p olynomially small but una v oidable probability . This establishes a sharp imp ossibilit y result for decision robustness in tail-scarce regimes. 1.1 Related W ork Optimized certaint y equiv alen ts include exp ectation, CV aR, entropic, and mean-v ariance risks. Early generaliza- tion b ounds w ere obtained under b ounded-loss assumptions Lee et al. [2021], and later extended to un b ounded 2 losses via tilted ERM under b ounded (1 + λ ) -momen t conditions, achieving rates O ( n − λ/ (1+ λ ) ) Li et al. [2021].F or CV aR, concen tration inequalities under ligh t- and heavy -tailed losses were established in Kolla et al. [2019], Prashan th et al. [2019] under a strictly increasing CDF assumption, but learning-theoretic generalization and robustness guaran tees in the genuinely heavy-tailed regime remain largely op en. Our fixed-h yp othesis analysis builds on truncation-based tec hniques Bro wnlees et al. [2015], Prashan th et al. [2019], with extensions to finite and infinite hypothesis classes via standard learning-theoretic to ols Shalev-Shw artz and Ben-Da vid [2014], and matc hing minimax low er b ounds obtained through classical constructions T sybak ov [2008]. Related w ork on CV aR and spectral risk learning under hea vy tails Holland and Haress [2021, 2022] is restricted to finite-v ariance i.i.d. settings. Results on robust mean estimation under hea vy tails and contamination Laforgue et al. [2021], de Juan and Mazuelas [2025] fo cus on estimation primitiv es rather than tail-based risks, while learning under hea vy-tailed dep endence has primarily addressed exp ected-risk ob jectives Roy et al. [2021], Shen et al. [2026]. Minimax guarantees for quantile-based risks largely assume light-tailed regimes El Hanchi et al. [2024]. The asymptotic b ehavior of sample quantiles is classically c haracterized b y the Bahadur and Bahadur-Kiefer represen tations. Bahadur Bahadur [1966] and Kiefer Kiefer [1967] established linear and uniform expansions that form a foundation of empirical pro cess theory V an der V aart [2000], but rely on smoothness and ligh t-tail assumptions. These results do not directly apply to CV aR, where the quantile is endogenous and coupled with tail a verages. Our w ork dev elops a uniform Bahadur-Kiefer expansion tailored to CV aR, explicitly accoun ting for the random active set induced by the data-dep enden t threshold and allo wing for heavy-tailed losses. On the robustness side, robust ERM under weak momen t assumptions has b een studied in Mathieu and Minsk er [2021]. T rimmed-mean and median-of-means estimators Lugosi and Mendelson [2019a,b] admit finite- sample and minimax-optimal guarantees Oliv eira et al. [2025], Oliveira and Resende [2025], Lugosi and Mendelson [2021], but do not directly address tail-based risk functionals such as CV aR. 2 Bac kground and Problem F orm ulation 2.1 Conditonal V alue at Risk (CV aR) Let L b e a real-v alued random v ariable representing the loss of a decision or portfolio. F or a tail probability α ∈ (0 , 1) , the V alue-at-R isk at level α is defined as the upp er (1 − α ) -quantile of L : V aR α ( L ) := inf { t ∈ R : P ( L ≤ t ) ≥ 1 − α } . Equiv alently , P ( L > V aR α ( L )) ≤ α . The Conditional V alue-at-Risk at lev el α is defined as the exp ected loss in the α -tail of the distribution: CV aR α ( L ) := E [ L | L ≥ V aR α ( L )] , whenev er the conditional exp ectation is w ell-defined . More generally , CV aR admits the v ariational representation Ro c k afellar and Uryasev [2000]: CV aR α ( L ) = inf t ∈ R  t + 1 α E  ( L − t ) +   , whic h remains v alid without con tinuit y assumptions and is particularly con venien t for learning and optimization. Unlik e V aR, CV aR is a coheren t risk measure: it is monotone, translation inv arian t, p ositively homogeneous, and subadditive. In particular, CV aR is conv ex in the underlying loss distribution, whic h mak es it amenable to empirical risk minimization and generalization analysis. 2.2 Setup, Notation and Assumptions Let H be a h ypothesis class and let ℓ ( h, x ) ≥ 0 be a non-negativ e loss of hypothesis h ∈ H on example x . Let Z 1 , . . . , Z n b e i.i.d. samples dra wn from a distribution P . W e define the Ro c k erfellar-Ury asev (RU) p opulation and empirical ob jectiv e as follows: ϕ P ( h, θ ) = θ + 1 α E P [( ℓ ( h, Z ) − θ ) + ] and ϕ P n ( h, θ ) = θ + 1 nα n X i =1 [( ℓ ( h, Z i ) − θ ) + ] , 3 where P n = 1 n P n i =1 δ Z i and { Z 1 , · · · Z n } are i.i.d samples from P . Define the p opulation Conditional V alue-at- Risk (CV aR) at level α ∈ (0 , 1) b y R P α ( h ) = inf θ ∈ R  θ + 1 α E  ( ℓ ( h, Z 1 ) − θ ) +   (1) and its empirical coun terpart b R P α ( h ) = inf θ ∈ R ( θ + 1 αn n X i =1 ( ℓ ( h, Z i ) − θ ) + ) (2) where ( e ) + ≡ max { e, 0 } . Ro ck afellar and Uryasev [2002] W e omit the sup erscript denoting the underlying distribution whenev er it is clear from context. Our ob jective is to learn the p opulation CV aR minimizer h ∗ ∈ arg min h ∈H R α ( h ) , and we denote b y b h ∈ arg min h ∈H b R α ( h ) the empirical CV aR minimizer. Assumption 2.1 (Bounded (1 + λ ) -th Moment) . There exists a constant M ∈ R + and λ ∈ (0 , 1) suc h that the loss function ( h, Z ) 7→ ℓ ( h, Z ) satisfies: E h ( ℓ ( h, Z )) 1+ λ i ≤ M , uniformly for all h ∈ H , where Z = X × Y is the instance space. Assumption 2.2 ( β -mixing dep endence) . The pro cess { Z i } i ∈ Z is strictly stationary and β -mixing with exp o- nen tially deca ying co efficien ts: there exist constants c 0 , c 1 , γ > 0 suc h that β ( k ) := sup t ∈ Z β  σ ( Z t −∞ ) , σ ( Z ∞ t + k )  ≤ c 0 e − c 1 k γ , k ≥ 1 . Assumption 2.3 (Hyp othesis class complexit y) . The h ypothesis class H has finite pseudo-dimension: Pdim( H ) ≤ d < ∞ . Assumption 2.4 (Statistical Complexity of T runcated Class) . F or any truncation level B > 0 , define the truncated function class: F B = { min( ϕ ( · ; h, θ ) , B ) : h ∈ H , θ ∈ Θ } . W e assume that for any probabilit y measure Q , the uniform co v ering n umber satisfies: log N ( F B , ∥ · ∥ L 2 ( Q ) , u ) ≤ d log  C 0 B u  , ∀ u ∈ (0 , B ] . 3 Generalization in CV aR This section establishes non-asymptotic generalization guarantees for Conditional V alue-at-Risk (CV aR) under hea vy-tailed losses, ranging from fixed-h yp othesis concen tration to sharp excess risk bounds for infinite classes, with matc hing minimax low er b ounds and extensions to dep enden t data. W e further develop a random active-set theory for CV aR and derive uniform Bahadur–Kiefer expansions that explicitly capture the role of the endogenous threshold. The core difficulty is structural: CV aR depends on tail b ehavior through a data-dep endent quantile, requiring join t con trol of tail fluctuations and threshold instability . 3.1 Generalization with i.i.d. Data W e b egin with the i.i.d. setting. The first step is to understand ho w the empirical CV aR concentrates around its p opulation counterpart for a fixed hypothesis. The key technical ingredient is a concentration inequalit y for heavy-tailed random v ariables with finite (1 + λ ) - momen t, as dev eloped in Brownlees et al. [2015], Prashanth et al. [2019]. Applying these to ols to the Ro ck afellar- Ury asev v ariational representation of CV aR yields the following fixed-hypothesis deviation b ound. 4 Theorem 3.1. Under Assumption 2.1 and for any fixe d h ∈ H and ϵ > 0 with pr ob ability at le ast 1 − δ we, b R α ( h ) − R α ( h ) ≤ 2 α M 1 1+ λ  log(2 /δ ) n  λ 1+ λ . (3) F or a finite H , we extend the b ound to al l hyp otheses using the union b ound. R α ( b h ) − R α ( h ∗ ) ≤ 4 α M 1 1+ λ  log(4 |H| /δ ) n  λ 1+ λ . (4) While the finite-class b ound follows directly from a union b ound, it does not capture the true statistical com- plexit y of learning when H is infinite. W e therefore turn to lo calized complexity arguments that yield dimension- dep enden t rates without incurring unnecessary logarithmic penalties. F or a fixed natural n umber n − 1 , consider the space { 0 , 1 } n endo wed with the Hamming metric. Let N denote its pac king n umber. Theorem 3.2 (Lo calized VC Bounds for Heavy-T ailed CV aR) . Assume H has pseudo-dimension d < ∞ and Assumption 2.1 holds. Then for any δ ∈ (0 , 1) , with pr ob ability at le ast 1 − δ , R α ( b h ) − R α ( h ∗ ) ≤ C λ α M 1 1+ λ  d log n + log(1 /δ ) n  λ 1+ λ (5) wher e C λ > 0 dep ends only on λ . Mor e over, this r ate is minimax optimal. Ther e exist universal c onstants c, C > 0 (dep ending only on λ ) such that for any α ∈ (0 , 1) , M > 0 , λ ∈ (0 , 1] , and n ≥ C (log N ) /α , inf A sup P ∈P ( M ,λ ) E [ R α ( A ( S ); P ) − R α ( h ∗ P ; P )] ≥ c M 1 / (1+ λ ) α  log N n  λ 1+ λ . (6) Discussion: This theorem sho ws that empirical CV aR minimization ac hieves the optimal tradeoff betw een mo del complexit y , sample size, and tail hea viness. Imp ortantly , the rate matc hes the minimax low er bound up to constan ts, demonstrating that no improv emen t is p ossible without stronger assumptions. T runcated Empirical CV aR Estimation. Although empirical CV aR minimization is statistically optimal, finite-sample instabilit y can arise from extreme losses. W e therefore in tro duce a truncated CV aR estimator that limits the influence of large observ ations while preserving optimal rates and remaining amenable to first-order optimization. T runcation sim ultaneously stabilizes estimation and enables standard complexit y-based analysis, with the induced bias controlled under finite-momen t assumptions. Theorem 3.3 (High-Probabilit y Generalization Bound) . Supp ose Assumption 2.1 holds. L et R n ( ℓ ◦ H ) b e the R ademacher c omplexity of the loss class. Fix δ ∈ (0 , 1) . If we set the trunc ation level B = ( M n ) 1 1+ λ , then with pr ob ability at le ast 1 − δ : R α ( b h B ) − R α ( h ∗ ) ≤ 4 α R n ( ℓ ◦ H ) + C λ,δ α M 1 1+ λ  1 n  λ 1+ λ , (7) wher e the c onstant C λ,δ is given by: C λ,δ = 1 λ + p 8 log (1 /δ ) + 2 3 log(1 /δ ) . (8) T ak ea w a y: T runc ate d empiric al CV aR achieves optimal high-pr ob ability exc ess risk b ounds while offering im- pr ove d r obustness to he avy-taile d noise. 5 3.2 Generalization with Dep enden t Data W e no w extend the analysis to dep endent observ ations. Let { Z i } i ∈ Z b e a strictly stationary sto chastic pro cess, and supp ose only a finite segmen t Z 1 , . . . , Z n is observ ed. Dep endence reduces the effectiv e sample size and requires refined concentration arguments. Theorem 3.4 (Upper Bound for CV aR under β -Mixing Hea vy-T ailed Data) . Under Assumptions 2.1-4.7 Then for any δ ∈ (0 , 1) , with pr ob ability at le ast 1 − δ − n − O (1) , sup h ∈H   b R α ( h ) − R α ( h )   ≤ C λ,γ α M 1 1+ λ  d log n + log(1 /δ ) n/ log n  λ 1+ λ , and c onse quently, for the empiric al CV aR minimizer b h , R α ( b h ) − R α ( h ∗ ) ≤ 2 C λ,γ α M 1 1+ λ  d log n + log(1 /δ ) n/ log n  λ 1+ λ , wher e C λ,γ > 0 dep ends only on λ, γ and the mixing c onstants. W e also establish Minimax L ower Bound. L et P ( λ, M , β ) b e the class of strictly stationary β -mixing distribu- tions with β ( k ) ≤ c 0 e − c 1 k γ and sup h ∈H E [ ℓ ( h, Z ) 1+ λ ] ≤ M . Assume V Cdim( H ) ≥ d ≥ 1 . Then for any α ∈ (0 , 1) and n satisfying n ≥ C 0 d/α , inf b h sup P ∈P E P  R α ( b h ) − R α ( h ∗ P )  ≥ c α M 1 1+ λ  d n/ log n  λ 1+ λ , wher e C 0 , c > 0 dep end only on λ, γ and the mixing c onstants. Discussion. Dep endence introduces an una v oidable logarithmic degradation in the effectiv e sample size, but the fundamental tail-dep endent rate remains unc hanged. The matching lo wer b ound shows this loss is information- theoretically necessary . 3.3 Random activ e-set theory and uniform Bahadur - Kiefer expansions for CV aR While the preceding results b ound uniform CV aR deviations under heavy tails, they do not explain how these fluctuations affect the empirical minimizer. The challenge is structural: CV aR dep ends on an endo genous quan tile, coupling threshold estimation with tail a veraging. W e make this coupling explicit via uniform Bahadur–Kiefer expansions for CV aR-ERM, revealing tw o distinct error sources—one from tail moments and another from the lo cal geometry at the CV aR threshold. F or, fixed α ∈ (0 , 1) . F or h ∈ H and θ ∈ R , we write X h := ℓ ( h, Z ) and X h,i := ℓ ( h, Z i ) and we define the p opulation R U threshold θ ⋆ ( h ) := inf { θ ∈ R : P ( X h > θ ) ≤ α } , and let the empirical RU threshold b e the minimal empirical minimizer b θ n ( h ) := inf { θ ∈ R : P n ( X h > θ ) ≤ α } , P n := 1 n n X i =1 δ Z i . By con v exity of θ 7→ b Φ n ( h, θ ) , b θ n ( h ) is alwa ys a minimizer of b Φ n ( h, · ) . W e also assume that for all h ∈ H , P ( X h > θ ⋆ ( h )) = α (equiv alently , P ( X h = θ ⋆ ( h )) = 0 ). How ev er, before stating the final theorem statemen t, w e w ould lik e to state some assumptions: Assumption 3.5 (Uniform tail-indicator deviation) . There exist constants V ≥ 1 and C 0 > 0 such that for all δ ∈ (0 , 1) , with probability at least 1 − δ , sup h ∈H sup t ∈ R    P n ( X h > θ ) − P ( X h > θ )    ≤ ε n ( δ ) := C 0 r V log( en ) + log(2 /δ ) n . 6 Assumption 3.6 (T wo-sided local quantile margin) . There exist constants κ ≥ 1 , c − , c + > 0 , and u 0 > 0 such that for all h ∈ H and all u ∈ [0 , u 0 ] , c − u κ ≤ P ( X h > θ ⋆ ( h ) − u ) − P ( X h > θ ⋆ ( h )) ≤ c + u κ , (9) c − u κ ≤ P ( X h > θ ⋆ ( h )) − P ( X h > θ ⋆ ( h ) + u ) ≤ c + u κ . (10) Assumption 3.7 (Uniform heavy-tail deviation for the hinge at t ⋆ ) . There exists a function η n ( δ ) such that for all δ ∈ (0 , 1) , with probability at least 1 − δ , sup h ∈H    ( P n − P )  ( X h − θ ⋆ ( h )) +     ≤ η n ( δ ) . (11) Discussion on Assumptions: Assumptions 3.5–3.7 are mild and hold for broad mo del classes. Uniform control of tail indicators holds when the class ( z , θ ) 7→ ! 1 ℓ ( h, z ) > θ has finite V C dimension or p olynomial entrop y , cov ering linear mo dels with Lipschitz losses, GLMs, and b ounded-norm neural netw orks, yielding ε n ( δ )! ≍ ! p (log n ) /n . Regular quantiles require only lo cal tail geometry near the V aR-holding for densities b ounded aw a y from zero ( κ = 1 ) and more generally for flat or cusp-like tails-without global smo othness. Control of the hinge empirical pro cess follo ws under a ( λ + 1) -momen t env elope with 0 < λ ≤ 1 , giving η n ( δ ) ≍ ! n − λ/ ( λ +1) under hea vy tails. No w, w e present the main result of the section below in theorem 3.8 Theorem 3.8 (Uniform Bahadur-Kiefer expansion for CV aR with endogenous threshold) . Fix α ∈ (0 , 1) and let Z 1 , . . . , Z n iid ∼ P with empiric al me asur e P n . Supp ose Assumptions 3.5, 3.6, and 3.7 hold, and that ε n ( δ ) ≤ ( c − / 2) u κ 0 . Then, for some c onstant C 1 ≥ 0 with pr ob ability at le ast 1 − δ , the fol lowing uniform exp ansion holds: sup h ∈H    b R P α ( h ) − R P α ( h ) − 1 α ( P n − P )  ( X h − θ ⋆ ( h )) +  − 1 α  b t n ( h ) − θ ⋆ ( h )  α − P ( X h > θ ⋆ ( h ))     ≤ C 1 α ε n  δ 2  κ +1 κ . In p articular, sup h ∈H | b R P α ( h ) − R P α ( h ) | ≤ 1 α η n ( δ / 2) + 1 α  ε n ( δ / 2) ∆ n ( δ / 2) + c + ∆ n ( δ ) κ +1  . Pro of is discussed in the App endix C. Discussion: Theorem 3.8 separates CV aR generalization into a standard empirical process term at the p opu- lation threshold, an explicit correction due to estimating the endogenous quan tile, and a remainder gov erned by ε n ( δ ) and the lo cal quan tile curv ature κ . This rev eals an additional source of error stemming from instabilit y of the tail b oundary , that is indep endent of classical hea vy-tail concen tration and absent in mean-risk ERM. The correction term 1 α  b θ n ( h ) − θ ⋆ ( h )  α − P ( X h > θ ⋆ ( h ))  v anishes whenever P ( X h > θ ⋆ ( h )) = α , i.e. whenever the tail map θ 7→ P ( X h > θ ) crosses the level α without a flat region at the minimizing threshold. When P ( X h > θ ⋆ ( h )) < α , the set of minimizers of θ 7→ Φ( h, θ ) can b e non-singleton (a “flat” region in θ ), and b θ n ( h ) selects an endp oin t of the empirical minimizer set. The correction term captures precisely this endogenous selection effect at first order. 4 Robustness of CV aR This section dev elops a robustness theory for CV aR under heavy-tailed losses with only finite-moment control. Robustness arises at three levels: functional robustness of CV aR under distributional p erturbations; estimator robustness, pro viding uniform guaran tees for empirical CV aR under sampling noise and contamination; and de ci- sion robustness, which concerns stability of the CV aR minimizer and is fundamen tally shap ed by the endogenous R U threshold. W e treat these in turn, highlighting how threshold sensitivity and tail scarcity induce intrinsic instabilit y . 7 4.1 F unctional Robustness W e first analyze CV aR as a distributional functional, yielding metric-dependent contin uit y results indep endent of estimation. Under geometric metrics (e.g., W asserstein), stability follows from loss regularit y in z , whereas under w eaker metrics (e.g., Lévy–Prokhorov or total v ariation), finite momen ts imply only tail-Hölder con tinuit y with exp onen t set b y the momen t order. This distinction separates intrinsic (functional) robustness from estimation- induced effects. Prop osition 4.1 (T ail-Hölder robustness of CV aR ) . Fix α ∈ (0 , 1) , λ ∈ (0 , 1] , and set κ α = (1 − α ) − 1 . L et P , Q b e distributions on Z and ℓ : H × Z → R . F or fixe d h ∈ H , let L P = ℓ ( h, Z ) , Z ∼ P , and L Q = ℓ ( h, Z ′ ) , Z ′ ∼ Q . Assume E P [ | L P | λ +1 ] ≤ M p , E Q [ | L Q | λ +1 ] ≤ M p . Given, W r ( P , Q ) and π ( P , Q ) ar e r -W asserstein and L évy-Pr okhor ov metric b etwe en P and Q r esp e ctively, then (i) W asserstein stability under Hölder losses. If ℓ ( θ, · ) is β -Hölder in z uniformly over θ with c onstant L β and r ≥ β , then for al l P , Q with finite r -moments, sup h ∈H   R P α ( h ) − R Q α ( h )   ≤ κ α L β W r ( P , Q ) β . (ii) L évy-Pr okhor ov tail-Hölder c ontinuity. If Z = R m and ℓ ( h, · ) is L h -Lipschitz, then for π ( P , Q ) ≤ ε ,   R P α ( h ) − R Q α ( h )   ≤ κ α L h  2 ε + 2 M 1 / ( λ +1) p ε λ/ ( λ +1)  , and the tail term ε λ 1+ λ is unavoidable. Discussion. P art (i) shows that when the loss is regular in z , CV aR is stable in W asserstein distance under finite moments. P art (ii) reveals an una voidable degradation under weak er p erturbations: small distributional c hanges can shift the upp er tail enough to alter CV aR at a Hölder rate ε λ/ ( λ +1) . This tail-scarcity effect also go verns estimator and decision robustness, where stability depends on mass near the threshold. An analogous Hölder bound holds in total v ariation distance (Prop osition 4.2). 4.2 Estimator Robustness In this part, w e dev elops a robust estimator theory for CV aR under hea vy-tailed losses with only finite-momen t con trol. Building on the previous section, w e next study robustness at the lev el of the CV aR obje ctive o v er a h yp othesis class. Here the goal is to b ound the worst-case sensitivity sup h ∈H | R P α ( h ) − R Q α ( h ) | , whic h captures b oth distributional robustness and the effect of adversarial p erturbations when Q is a contaminated v ersion of P . The next theorem provides suc h a b ound under total v ariation p erturbations when the contaminated distribution Q also has similar moment cotrol as P and establishes that the resulting exp onent is minimax optimal under finite-momen t con trol. Prop osition 4.2 (Robustness of CV aR ) . Under Assumption 2.1, for any α ∈ (0 , 1) , sup h ∈H   R P α ( h ) − R Q α ( h )   ≤ C α d TV ( P , Q ) λ 1+ λ , (12) wher e d TV ( P , Q ) = sup A | P ( A ) − Q ( A ) | and C = (2 M ) 1 1+ λ  1 + 1 λ  . Mor e over, this dep endenc e is minimax optimal. Discussion. Prop osition 4.2 characterizes the optimal sensitivit y of CV aR to total v ariation p erturbations under hea vy tails. Because total v ariation is b ounded, this result combines directly with our generalization bounds to yield excess risk guarantees under domain shift Aminian et al.. The Hölder exp onen t λ/ (1 + λ ) reflects the in trinsic limitation imp osed by finite-momen t control, and the matc hing minimax low er b ound shows that no sharp er dep endence on d TV ( P , Q ) is achiev able without stronger tail assumptions. 8 Robust Estimation under Adv ersarial Con tamination Unlik e the previous setting—where the contami- nated distribution retained a finite (1 + λ ) -momen t—adversarial contamination may in tro duce arbitrarily hea vy tails. W e therefore study estimator robustness under oblivious adversarial c ontamination where ϵn datap oints ha ve b een corrupted, deriving uniform generalization guaran tees via a truncated median-of-means (MoM) con- struction. Our analysis sim ultaneously controls heavy-tailed sampling v ariability and adversarial bias b y com bining truncation of extreme losses with median aggregation across blo cks. These ideas are applied to the Ro ck afel- lar–Ury asev lift, treating ( h, θ ) jointly and robustifying the empirical estimation of E [( ℓ ( h, Z ) − θ ) + ] through blo c kwise truncation and MoM aggregation. The resulting bound decomposes in to a statistical term driv en b y complexity and effective sample size, and a con tamination term prop ortional to ϵ , b oth scaling with the optimal hea vy-tail exponent λ/ (1 + λ ) . Theorem 4.3 (Robust Generalization Bound) . Supp ose Assumptions 2.1 and 2.4 hold. L et γ ∈ (0 , 1 / 2) b e fixe d such that ϵ ≤ 1 / 2 − γ . Supp ose the sample size satisfies n ≥ C d log n γ 2 . With pr ob ability at le ast 1 − δ : sup h ∈H | b R α ( h ) − R α ( h ) | ≤ C 1  M ϕ d log n n  λ 1+ λ + C 2 M 1 1+ λ ϕ ϵ λ 1+ λ ≍ 1 α  M d log n n  λ 1+ λ + ( M ϵ ) λ 1+ λ ! . Her e C 1 , C 2 , M and M ϕ ar e universal c onstants dep ending only on λ . Discussion. Theorem 4.3 clarifies the robustness–generalization tradeoff under contamination: the first term ac hieves the optimal hea vy-tail statistical rate (up to logs and complexity), while the second captures the unav oid- able loss from an ϵ -fraction of adv ersarial corruption. The final ≍ form highlights the intrinsic 1 /α amplification of tail effects induced by the R U lift. On the Algorithmic Asp ect : W e construct a robust CV aR estimator via a truncated median-of-means (T-MoM) sc heme for hea vy-tailed data with oblivious adv ersarial con tamination. The sample is partitioned in to blo c ks, the Ro ck afellar–Ury asev loss is truncated within eac h blo ck to control v ariance and adversarial bias, and blockwise estimates are aggregated b y a median, yielding robustness to both heavy tails and an ϵ -fraction of corrupted samples. The CV aR estimate is obtained b y minimizing this robust ob jective ov er the auxiliary threshold. While statistically principled, this approach is not directly implementable in high dimensions: MoM tourna- men ts o v er η -nets are feasible only in lo w-dimensional settings. W e therefore fo cus on statistical guarantees and defer algorithmic issues to App endix H. 4.3 Decision Robustness W e no w study de cision r obustness for CV aR minimization, focusing on the stability of the argmin rather than the ob jective v alue. This is subtle for CV aR, as its endogenous, distribution-dep endent threshold can amplify small p erturbations into decision-lev el changes. W e denote the p opulation solution set and the (p ossibly set-v alued) Ro c k afellar–Uryasev threshold minimizers (deined in section 2.2) by S ( P ) := arg min h ∈H R P α ( h ) , T ⋆ ( h, P ) := arg min θ ∈ R Φ P ( h, θ ) Decision robustness concerns the con tinuit y of S ( P ) under p erturbations of P . Unlike mean- risk ERM, CV aR introduces an endogenous threshold θ ⋆ ( h, P ) ∈ T ⋆ ( h, P ) determined by the upper α -tail, whose stabilit y is the primary driver of decision robustness. Endogenous Threshold Stability: Fix h and write X := ℓ ( h, Z ) under P . Define ϕ P ( θ ) := Φ P ( h, θ ) = θ + α − 1 E [( X − θ ) + ] . The characterization 0 ∈ ∂ ϕ P ( θ ) yields θ ∈ T ⋆ ( h, P ) ⇐ ⇒ P ( X > θ ) ≤ α ≤ P ( X ≥ θ ) . (13) Th us T ⋆ ( h, P ) is the α -upp er-quantile set of X . 9 Definition 4.4 (Quan tile margin and generalized density-at-quan tile) . Assume T ⋆ ( h, P ) is a singleton { θ ⋆ ( h, P ) } and write θ ⋆ for short. Define the lo cal mass near the threshold κ h,P ( r ) := P  | ℓ ( h, Z ) − θ ⋆ | ≤ r  , r > 0 , and the generalized densit y-at-quantile (quantile margin parameter) m α ( h, P ) := lim inf r ↓ 0 κ h,P ( r ) r ∈ [0 , ∞ ] . The next theorem shows that a p ositiv e quantile margin guarantees stabilit y of the threshold under weak distributional perturbations, while its absence leads to in trinsic instabilit y . Theorem 4.5 (Threshold stability under Lévy-Prokhoro v p erturbations) . Fix h and assume T ⋆ ( h, P ) = { θ ⋆ ( h, P ) } is a singleton. L et Q b e another law for X = ℓ ( h, Z ) and let d LP ( P , Q ) denote the L évy-Pr okhor ov distanc e b etwe en these one-dimensional laws. If m α ( h, P ) ≥ m 0 > 0 , then for al l sufficiently smal l δ := d LP ( P , Q ) , | θ ⋆ ( h, Q ) − θ ⋆ ( h, P ) | ≤ C δ /m 0 . (14) Conversely, if m α ( h, P ) = 0 , then for every δ > 0 ther e exists Q with d LP ( P , Q ) ≤ δ such that T ⋆ ( h, Q ) c ontains ˜ θ with | ˜ θ − θ ⋆ | ≥ c 0 (an O (1) jump). Remark. Theorem 4.5 reveals an in trinsic fragility of CV aR: when the loss distribution has negligible mass near the V aR, the optimal threshold can change discontin uously under arbitrarily small p erturbations, regardless of sample size. Influence F unction of the CV aR Decision: T o quantify how thre shold instabilit y propagates to decisions, w e analyze the RU stationarity system and deriv e the influence function of the CV aR minimizer under gross-error con tamination, treating ( h, θ ) join tly . Define the stationarity map H ( h, θ ; P ) :=  ∇ h Φ P ( h, θ ) ∂ θ Φ P ( h, θ )  . (15) When P ( ℓ ( h, Z ) = θ ) = 0 , the deriv ative in θ is classical: ∂ θ Φ P ( h, θ ) = 1 − 1 α P ( ℓ ( h, Z ) > θ ) , and ∇ h Φ P ( h, θ ) = 1 α E P  ∇ h ℓ ( h, Z ) 1 { ℓ ( h, Z ) > θ }  . W e imp ose a lo cal regularity condition that is strictly weak er than global strong conv exity and fits noncon v ex but stable minimizers (e.g. ‘tilt stabilit y”; here we keep a concrete inv ertibilit y condition). Assumption 4.6 (Local strong regularity / inv ertible Jacobian) . Let ( h ⋆ , θ ⋆ ) b e a lo cally unique stationary p oin t: H ( h ⋆ , θ ⋆ ; P ) = 0 . Assume: (i) ℓ ( h, z ) is C 2 in h near h ⋆ for P -a.e. z ; (ii) P ( ℓ ( h ⋆ , Z ) = θ ⋆ ) = 0 ; (iii) the Jacobian J := ∇ ( h,θ ) H ( h ⋆ , θ ⋆ ; P ) exists and is in v ertible. Assumption 4.7 (Lo cal RU stability at the p opulation optimum) . Assume there exists ( h ⋆ , θ ⋆ ) such that (i) h ⋆ ∈ S ( P ) and θ ⋆ ∈ T ⋆ ( h ⋆ , P ) ; (ii) T ⋆ ( h ⋆ , P ) = { θ ⋆ } is a singleton; (iv) the quantile margin m α ( h ⋆ , P ) ≥ m 0 > 0 . Under these assumptions, the influence function admits a closed-form expression and exhibits a sharp depen- dence on the quan tile margin. Theorem 4.8 (Influence function of the CV aR decision: tail supp ort and quantile-margin blow-up) . Assume Assumption 4.6. Consider the gr oss-err or p ath P ε = (1 − ε ) P + ε ∆ z , ε ≥ 0 , and let ( h ε , θ ε ) b e the lo c al ly unique stationary solution to H ( h, θ ; P ε ) = 0 ne ar ( h ⋆ , θ ⋆ ) . Then ε 7→ ( h ε , θ ε ) is differ entiable at ε = 0 and d dε  h ε θ ε      ε =0 = − J − 1  g ( z ; h ⋆ , θ ⋆ ) − E P [ g ( Z ; h ⋆ , θ ⋆ )]  , 10 wher e g ( z ; h, θ ) = 1 α ∇ h ℓ ( h, z ) 1 { ℓ ( h, z ) > θ } 1 − 1 α 1 { ℓ ( h, z ) > θ } ! . (16) Mor e over, the θ -c omp onent sensitivity satisfies the b ound     d dε θ ε     ε =0     ≥ c m α ( h ⋆ , P ) ×    1 { ℓ ( h ⋆ , z ) > θ ⋆ } − P ( ℓ ( h ⋆ , Z ) > θ ⋆ )    , in the sense that the r elevant Jac obian blo ck involves the gener alize d density-at-quantile; henc e the influenc e blows up as m α ( h ⋆ , P ) ↓ 0 . Remark. Theorems 4.5 and 4.8 together identify a precise mechanism: CV aR decisions are stable only when the threshold is stable. The quantile margin simultaneously gov erns threshold contin uit y and the conditioning of the R U stationarit y system. The influence-function c haracterization naturally leads to a notion of lo cal robustness: the largest con tamina- tion lev el under whic h the decision remains within a prescrib ed tolerance. Definition 4.9 (Lo cal decision robustness radius) . Fix a tolerance r > 0 . Under Assumption 4.7, define ε rob ( r ) := sup n ε ∈ [0 , 1) : ∥ h ε − h ⋆ ∥ ≤ r for all con tamination directions ∆ z o , where ( h ε , θ ε ) is the stationary solution under P ε = (1 − ε ) P + ε ∆ z selected near ( h ⋆ , θ ⋆ ) . Corollary 4.10 (Lo cal decision radius via tail-supp orted IF) . Under Assumption 4.7, ther e exists r 0 > 0 such that for al l r ∈ (0 , r 0 ) , ε rob ( r ) ≥ r C ∥ J − 1 ∥ · sup z ∈Z    g ( z ; h ⋆ , θ ⋆ ) − E P [ g ( Z ; h ⋆ , θ ⋆ )]    ! − 1 , (17) with g fr om (16) . Mor e over, as m α ( h ⋆ , P ) ↓ 0 , the b ound de gener ates b e c ause ∥ J − 1 ∥ → ∞ , i.e. the lo c al r obustness r adius c ol lapses at quantile critic ality. Pr o of. By Theorem 4.8, for each fixed direction ∆ z , the decision map ε 7→ h ε is differen tiable at 0 with de riv ativ e h ′ (0) = − [ J − 1 ( · )] h applied to the perturbation vector. Thus for sufficiently small ε , a first-order expansion yields ∥ h ε − h ⋆ ∥ ≤ ε sup z ∥ IF( z ; h ⋆ , P ) ∥ + o ( ε ) . Bounding sup z ∥ IF( z ) ∥ using Theorem 4.8 yields the righ t-hand side of (17). Finally , ∥ J − 1 ∥ → ∞ as m α ↓ 0 by Theorem 4.8, so the radius collapses in the quan tile-critical regime. In trinsic Limits (T ail Scarcit y in i.i.d. Data): CV aR dep ends on losses exceeding an endogenous threshold θ ⋆ ( h, P ) determined b y the upp er α -tail. Consequently , the empirical CV aR ob jective can b e dominated by a few rare observ ations near this threshold, even when the population minimizer is strict and well separated. This effect is structural rather than a concentration artifact: under finite p -momen t assumptions, tail exceedances o ccur with p olynomial probability and can induce O (1) p erturbations in the empirical ob jective. The result b elow formalizes this tail-sc ar city instability , showing that, with una voidable p olynomial probability (up to logarithmic factors), a single observ ation can flip the CV aR-ERM decision despite strict population optimalit y . Prop osition 4.11. Fix α ∈ (0 , 1) and 1 < p < 2 . Ther e exist a distribution P on R + with E P [ Z p ] < ∞ and E P [ Z q ] = ∞ for al l q > p , and a two-p oint hyp othesis class H = { h A , h B } with losses ℓ n : H × R + → R + , such that for al l sufficiently lar ge n , CV aR α ( ℓ n ( h A , Z ); P ) < CV aR α ( ℓ n ( h B , Z ); P ) , yet ther e exists a c onstant c > 0 for which Pr D ∼ P ⊗ n  ∃ D ′ with | D △ D ′ | = 1 for which arg min h ∈H b F n ( h ; D )  = arg min h ∈H b F n ( h ; D ′ )  ≥ c n − λ (log n ) 2 , wher e b F n ( h ; D ) = CV aR α ( ℓ n ( h, Z ); P n ) . 11 Remark: This result exp oses a fundamen tal limitation of CV aR-ERM under hea vy-tailed losses: even with a strict and unique p opulation minimizer and finite p -th moments ( 1 < p < 2 ), the empirical CV aR decision map is not uniformly stable. With an unav oidable p olynomial probabilit y of order n 1 − p , a single observ ation can flip the empirical CV aR-ERM solution. The proofs are deferred to Appendix F. 5 Conclusion W e developed a learning-theoretic framew ork for CV aR under hea vy-tailed and contaminated data, addressing generalization, robustness, and decision stability . Our analysis identifies the endogenous quantile as a fundamental source of b oth statistical difficulty and instability , captured via refined empirical-pro cess to ols. T ogether, our results clarify when CV aR learning is reliable and when in trinsic limitations arise, pro viding principled guidance for risk-sensitiv e learning in heavy-tailed regimes. A ckno wledgmen ts Piyushi Manupriy a w as initially supp orted by a grant from Ittiam Systems Priv ate Limited through the Ittiam Equitable AI Lab and then b y ANRF-NPDF (PDF/2025/005277) grant. Anan t Ra j is supported by a gran t from Ittiam Systems Priv ate Limited through the Ittiam Equitable AI Lab, ANRF’s Prime Minister Early Career Gran t (ANRF/ECR G/2024/003259) and Pratiksha T rust’s Y oung In vestigator A w ard. References Gholamali Aminian, Amir R Asadi, Tian Li, Ahmad Beirami, Gesine Reinert, and Samuel N Cohen. Gener- alization and robustness of the tilted empirical risk. In F orty-se c ond International Confer enc e on Machine L e arning . R Ra j Bahadur. A note on quantiles in large samples. The Annals of Mathematic al Statistics , 1966. Christian Brownlees, Emilien Joly , and Gáb or Lugosi. Empirical risk minimization for hea vy-tailed losses. The A nnals of Statistics , 2015. A drian Riv era Cardoso and Huan Xu. Risk-a verse sto c hastic conv ex bandit. In Pr o c e e dings of the Twenty-Se c ond International Confer enc e on Artificial Intel ligenc e and Statistics , 2019. Xabier de Juan and Santiago Mazuelas. On the optimalit y of the median-of-means estimator under adversarial con tamination. arXiv pr eprint arXiv:2510.07867 , 2025. A youb El Hanc hi, Chris Maddison, and Murat Erdogdu. Minimax linear regression under the quantile risk. In The Thirty Seventh Annual Confer enc e on L e arning The ory , 2024. Matthew Holland and El Mehdi Haress. Learning with risk-a v erse feedback under p otentially heavy tails. In International Confer enc e on Artificial Intel ligenc e and Statistics , 2021. Matthew J Holland and El Mehdi Haress. Sp ectral risk-based learning using un b ounded losses. In International c onfer enc e on artificial intel ligenc e and statistics , 2022. Ronald A. How ard and James E. Matheson. Risk-sensitive mark o v decision pro cesses. Management Scienc e , 1972. Jac k Kiefer. On bahadur’s representation of sample quan tiles. The Annals of Mathematic al Statistics , 1967. Ra vi Kumar Kolla, LA Prashanth, Sanjay P Bhat, and Krishna Jagannathan. Concen tration b ounds for empirical conditional v alue-at-risk: The un b ounded case. Op er ations R ese ar ch L etters , 2019. Pierre Laforgue, Guillaume Staerman, and Stephan Clémençon. Generalization b ounds in the presence of outliers: a median-of-means study . In International c onfer enc e on machine le arning , 2021. 12 Jaeho Lee, Sejun P ark, and Jinw o o Shin. Learning bounds for risk-sensitiv e learning, 2021. Tian Li, Ahmad Beirami, Maziar Sanjabi, and Virginia Smith. Tilted empirical risk minimization, 2021. Gáb or Lugosi and Shahar Mendelson. Mean estimation and regression under heavy-tailed distributions: A survey . F oundations of Computational Mathematics , 2019a. Gab or Lugosi and Shahar Mendelson. Risk minimization by median-of-means tournamen ts. Journal of the Eur op e an Mathematic al So ciety , 2019b. Gab or Lugosi and Shahar Mendelson. Robust multiv ariate mean estimation: the optimalit y of trimmed mean. 2021. Timothée Mathieu and Stanisla v Minsker. Excess risk b ounds in robust empirical risk minimization. Information and Infer enc e: A Journal of the IMA , 2021. Rob erto I Oliveira and Lucas Resende. T rimmed sample means for robust uniform mean estimation and regression. The Annals of Statistics , 2025. Rob erto I Oliveira, P aulo Orenstein, and Zoraida F Rico. Finite-sample prop erties of the trimmed mean. arXiv pr eprint arXiv:2501.03694 , 2025. L. A. Prashanth, K. Jagannathan, and R. K. Kolla. Concentration b ounds for CV aR estimation: The cases of ligh t-tailed and heavy-tailed distributions. arXiv pr eprint arXiv:1901.00997 , 2019. R. T yrrell Ro c k afellar and Stanislav Uryasev. Optimization of conditional v alue-at risk. Journal of R isk , 2000. R. T yrrell Ro c k afellar and Stanisla v Ury asev. Conditional v alue-at-risk for general loss distributions. Journal of Banking & Financ e , 2002. Abhishek Roy , Krishnakumar Balasubramanian, and Murat A Erdogdu. On empirical risk minimization with dep enden t and heavy-tailed data. A dvanc es in Neur al Information Pr o c essing Systems , 2021. Shai Shalev-Shw artz and Shai Ben-David. Understanding Machine L e arning: F r om The ory to Algorithms . Cam- bridge Univ ersity Press, 2014. Yinan Shen, Yic hen Zhang, and W en-Xin Zhou. Sgd with dep endent data: Optimal estimation, regret, and inference. arXiv pr eprint arXiv:2601.01371 , 2026. Y un Shen, Michael J. T obia, T obias Sommer, and Klaus Ob ermay er. Risk-sensitiv e reinforcement learning. Neur al Computation , 2014. Alexandre B. T sybako v. Intr o duction to Nonp ar ametric Estimation . Springer Publishing Company , Incorporated, 2008. ISBN 0387790519. Aad W V an der V aart. Asymptotic statistics , v olume 3. Cambridge Universit y Press, 2000. 13 P art App endix T able of Con ten ts A Justification of Assumptions 14 B Generalization in CV aR 15 B.1 Fixed Hyp othesis h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 B.2 Uniform Bounds for Finite H . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 B.3 CV aR Generalization for Infinite Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 B.4 T runcated Empirical CV aR Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 B.5 Generalization with Dep endent Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 C Random activ e-set theory and uniform Bahadur-Kiefer expansions for CV aR 28 D F unctional Robustness 32 E Estimator Robustness 33 E.1 T runcated CV aR Loss based ERM is Optimal . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 E.2 Robust estimation under adversarial con tamination under oblivious adversaries . . . . . . . . . 36 F Decision Robustness 39 F.1 T ail-scarcity instabilit y for CV aR-ERM under finite p -moment . . . . . . . . . . . . . . . . . . 39 G Auxiliary results 42 G.1 Boundedness of CV aR minimizer under a b ounded moment . . . . . . . . . . . . . . . . . . . . 42 G.2 Concen tration b ound for heavy-tailed random v ariables . . . . . . . . . . . . . . . . . . . . . . 44 H On the Algorithmic Aspects of Robust CV aR-ERM Estimator 50 H.1 η -co v er based Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 A Justification of Assumptions W e clarify the v alidity and necessit y of our theoretical assumptions b elo w: 1. Hea vy-T ailed Moments (Assumption 2.1): This assumption relaxes standard requirements for strictly b ounded losses or finite v ariances. By requiring only a finite (1 + λ ) -th momen t, w e accommodate hea vy-tailed distributions where the loss ma y occasionally take v ery large v alues (noting that infinite v ariance is possible if λ < 1 ). This approach allo ws the theory to remain applicable to realistic, risk-sensitive settings and aligns with recen t literature such as Aminian et al. and Lee et al. [2021]. 2. β -Mixing Dependence (Assumption 2.2): The standard i.i.d. assumption is insufficien t for sequential applications suc h as Reinforcemen t Learning, financial forecasting, and signal pro cessing. W e assume β -mixing b ecause it offers a crucial balance betw een mo deling flexibility and mathematical tractabilit y . Sp ecifically , β - mixing facilitates the use of blo c king techniques (e.g., Y u’s method) and coupling to ols (e.g., Berb ee’s Lemma). These tools allow us to decomp ose the dependent sequence into "nearly indep endent" blo cks, thereby enabling the application of standard concentration inequalities. 14 3. Hyp othesis Complexity (Assumption 2.3): The pseudo-dimension is the natural extension of the V apnik- Cherv onenkis (V C) dimension to real-v alued function classes. Assuming a finite pseudo-dimension restricts the "capacit y" of the h yp othesis class H , which is a standard and necessary condition in statistical learning theory to guaran tee uniform con vergence and prev ent ov erfitting. 4. Complexit y of T runcated Class (Assumption 2.4): Given the hea vy-tailed nature of the data (Assump- tion 2.1), classical concentration inequalities for bounded v ariables (such as Ho effding’s inequality) are not directly applicable. T o circumv ent this, we emplo y a truncation argument (capping the loss at B ). Assump- tion 2.4 ensures that this truncation op eration do es not artificially "explo de" the complexit y of the hypothesis class, ensuring the co v ering n umbers b eha v e w ell enough to derive generalization bounds. F urther justification for the remaining tec hnical assumptions is provided in the discussion section of the main pap er. B Generalization in CV aR B.1 Fixed Hyp othesis h Lemma B.1 (Fixed-hypothesis low er deviation) . Under Assumption 2.1, with pr ob ability at le ast 1 − δ , b R α ( h ) − R α ( h ) ≤ 2 α M 1 1+ ε  log(2 /δ ) n  ε 1+ ε . (18) Pr o of. of Theorem 3.1,In particular result 3 Fix h ∈ H . Let θ ∗ = arg min θ  θ + 1 α E [( ℓ ( h, Z 1 ) − θ ) + ]  , so R α ( h ) = θ ∗ + 1 α E [( ℓ ( h, Z 1 ) − θ ∗ ) + ] . (19) Let b θ = arg min θ  θ + 1 αn P n i =1 ( ℓ ( h, Z i ) − θ ) +  , so b R α ( h ) = b θ + 1 αn n X i =1 ( ℓ ( h, Z i ) − b θ ) + . The goal is to b ound the follo wing: b R α ( h ) − R α ( h ) = b θ + 1 αn n X i =1 ( ℓ ( h, Z i ) − b θ ) + ! −  θ ∗ + 1 α E [( ℓ ( h, Z 1 ) − θ ∗ ) + ]  . (20) Use population minimizer prop erty , Since θ ∗ minimizes the p opulation CV aR: b R α ( h ) ≤ θ ∗ + 1 αn n X i =1 ( ℓ ( h, Z i ) − θ ∗ ) (21) implies that − b R α ( h ) ≥ − ( θ ∗ + 1 αn n X i =1 ( ℓ ( h, Z i ) − θ ∗ )) (22) No w A dd equation (2) and (5) and multiply by -1 to get. b R α ( h ) − R α ( h ) ≤ 1 α 1 n n X i =1 ( ℓ ( h, Z i ) − θ ∗ ) + − E [( ℓ ( h, Z 1 ) − θ ∗ ) + ] ! . (23) No w w e define Y i = ( ℓ ( h, Z i ) − θ ∗ ) + , so Y i ≥ 0 . Also c heck the momen t condition that: E [ Y 1+ ε i ] ≤ E [ ℓ ( h, Z i ) 1+ ε ] ≤ M , 15 since ( ℓ ( h, Z i ) − θ ∗ ) + ≤ ℓ ( h, Z i ) . Apply Prop osition G.2 to get: 1 n n X i =1 Y i − E [ Y i ] ≤ 2 M 1 1+ ε  log(2 /δ ) n  ε 1+ ε . Th us: b R α ( h ) − R α ( h ) ≤ 2 α M 1 1+ ε  log(2 /δ ) n  ε 1+ ε . Lemma B.2 (Fixed-hypothesis upp er deviation) . Under Assumption 2.1, for any fixe d h ∈ H and δ ∈ (0 , 1) , with pr ob ability at le ast 1 − δ , R α ( h ) − b R α ( h ) ≤ 2 α M 1 / (1+ λ )  log(2 /δ ) n  λ/ (1+ λ ) . Pr o of. The pro of pro ceeds in three steps: reducing the difference to an empirical pro cess supremum, establishing p oin t wise concentration using the momen t assumption, and extending this to a uniform b ound ov er the v ariational parameter θ via a co vering argument. R e duction to empiric al pr o c ess. By Lemma G.3, the deviation of the CV aR risk is bounded b y the supremum of the empirical process indexed b y θ : R α ( h ) − b R α ( h ) ≤ 1 α sup θ ∈ R ( E − P n ) g h,θ , (24) where g h,θ ( z ) := ( ℓ ( h, z ) − θ ) + . It suffices to b ound the term sup θ ( E − P n ) g h,θ . Moment c ontr ol. The Momen t Condition states that E [ ℓ ( h, Z ) 1+ λ ] ≤ M . Since 0 ≤ g h,θ ( z ) ≤ ℓ ( h, z ) holds for all θ ∈ R and z ∈ Z , w e ha ve uniform momen t control ov er the parametric class: E [ g h,θ ( Z ) 1+ λ ] ≤ M , ∀ θ ∈ R . Pointwise c onc entr ation. Fix θ ∈ R . W e apply a standard heavy-tailed concentration inequality (Via trunca- tion and Bernstein’s inequality similar to Theorem G.2) for random v ariables with finite (1 + λ ) -th momen ts. F or an y failure probability δ ′ ∈ (0 , 1) , with probabilit y at least 1 − δ ′ : ( E − P n ) g h,θ ≤ C 0 M 1 1+ λ  log(1 /δ ′ ) n  λ 1+ λ . (25) Uniform b ound via c overing. By Lemma G.4, the map θ 7→ ( E − P n ) g h,θ is 2 -Lipsc hitz. T o handle the range of θ , w e use a standard p eeling argument (or random truncation). With probabilit y at least 1 − δ / 2 , we hav e max i ℓ ( h, Z i ) ≤ B n where B n = C 1 ( M n/δ ) 1 1+ λ . W e restrict θ to the compact interv al [0 , B n ] ; outside this in terv al, the suprem um is either zero (if θ is very large) or dominated by the case θ = 0 (since deviations stabilize). Let { θ 1 , . . . , θ N } be an η -net of [0 , B n ] . Then for an y θ in the in terv al: ( E − P n ) g h,θ ≤ ( E − P n ) g h,θ j + 2 η . W e c ho ose the spacing η = M 1 1+ λ ( n − 1 log(2 /δ )) λ 1+ λ . The cov ering n um b er N ≈ B n /η grows polynomially in n . Applying the p oint wise bound (25) on the grid and a union b ound yields that with probabilit y at least 1 − δ : sup θ ∈ R ( E − P n ) g h,θ ≤ C M 1 1+ λ  log(1 /δ ) n  λ 1+ λ . Substituting this upp er bound bac k in to inequalit y (24) completes the proof. 16 Corollary B.3 (T ail bounds for fixed hypothesis) . Under Assumption 2.1, for any fixe d h ∈ H and ϵ > 0 , P  R α ( h ) − b R α ( h ) > ϵ  ≤ 2 exp  − nC λ  αϵ M 1 / (1+ λ )  (1+ λ ) /λ  , wher e C λ = 2 − (1+ λ ) /λ . Pr o of. Start from the fixed- h high-probabilit y inequality: for an y δ ∈ (0 , 1) , R α ( h ) − b R α ( h ) ≤ 2 α M 1 1+ λ  log(2 /δ ) n  λ 1+ λ with probabilit y at least 1 − δ. (26) Let ϵ > 0 and define δ ( ϵ ) := 2 exp  − n  αϵ 2 M 1 / (1+ λ )  1+ λ λ  . (27) W e v erify that this c hoice inv erts the deviation bound. Indeed, solving ϵ = 2 α M 1 1+ λ  log(2 /δ ) n  λ 1+ λ (28) for δ yields  αϵ 2 M 1 / (1+ λ )  1+ λ λ = log(2 /δ ) n , (29) and hence log  2 δ  = n  αϵ 2 M 1 / (1+ λ )  1+ λ λ , (30) whic h implies δ = 2 exp  − n  αϵ 2 M 1 / (1+ λ )  1+ λ λ  = δ ( ϵ ) . (31) Therefore, substituting δ ( ϵ ) in to the fixed- h bound gives P  R α ( h ) − b R α ( h ) > ϵ  ≤ δ ( ϵ ) = 2 exp  − n  αϵ 2 M 1 / (1+ λ )  1+ λ λ  . (32) Finally , observ e that  αϵ 2 M 1 / (1+ λ )  1+ λ λ = 2 − (1+ λ ) /λ  αϵ M 1 / (1+ λ )  1+ λ λ = C λ  αϵ M 1 / (1+ λ )  1+ λ λ , (33) where C λ = 2 − (1+ λ ) /λ . This yields the stated b ound. This also matc hes the rates of Prashan th et al. [2019]. But they assume Assumption 2.1 with a strictly increasing CDF. B.2 Uniform Bounds for Finite H F or a finite H , w e extend the bound to all hypotheses using the union bound. Prop osition B.4. Uniform L ower Derivation With pr ob ability at le ast 1 − δ , sup h ∈H  R α ( h ) − b R α ( h )  ≤ 2 α M 1 1+ λ  log(2 |H| /δ ) n  λ 1+ λ . 17 Pr o of. of result B.4 F or eac h h ∈ H , apply result B.2 with failure probabilit y δ ′ = δ / |H| : R α ( h ) − b R α ( h ) ≤ 2 α M 1 1+ λ  log(2 |H| /δ ) n  λ 1+ λ . b y Union b ound,The probabilit y that all bounds hold is at least 1 − |H| · δ |H| = 1 − δ . Th us: sup h ∈H  R α ( h ) − b R α ( h )  ≤ 2 α M 1 1+ λ  log(2 |H| /δ ) n  λ 1+ λ . Corollary B.5. (Uniform Absolute Err or) With pr ob ability at le ast 1 − δ , for al l h ∈ H :    R α ( h ) − b R α ( h )    ≤ 2 α M 1 1+ λ  log(4 |H| /δ ) n  λ 1+ λ . Pr o of. of result B.5 Com bine the b oth directions to get the absolute error as follow:    R α ( h ) − b R α ( h )    = max n R α ( h ) − b R α ( h ) , b R α ( h ) − R α ( h ) o . Use Prop osition B.2 and Proposition B.4 with δ ′ = δ / (2 |H| ) , so log (2 /δ ′ ) = log(4 |H | /δ ) . Both directions hold with probabilit y at least 1 − δ . Lemma B.6. Exc ess Risk Bound Under Assumption 1, with pr ob ability at le ast 1 − δ , R α ( b h ) − R α ( h ∗ ) ≤ 4 α M 1 1+ λ  log(4 |H| /δ ) n  λ 1+ λ . Pr o of. of Theorem 3.1,In particular result 4 R α ( b h ) − R α ( h ∗ ) = h R α ( b h ) − b R α ( b h ) i + h b R α ( b h ) − b R α ( h ∗ ) i + h b R α ( h ∗ ) − R α ( h ∗ ) i . (34) W e bound the middle term b R α ( b h ) ≤ b R α ( h ∗ ) as b elow: R α ( b h ) − R α ( h ∗ ) ≤    R α ( b h ) − b R α ( b h )    +    R α ( h ∗ ) − b R α ( h ∗ )    (35) Apply result 2 to get R α ( b h ) − R α ( h ∗ ) ≤ 2 · 2 α M 1 1+ λ  log(4 |H| /δ ) n  λ 1+ λ . B.3 CV aR Generalization for Infinite Classes W e will b e pro ving Theorem 3.2, in particular result 5 in this subsection. Let θ ∗ b e an optimal CV aR threshold for h ∗ : R α ( h ∗ ) = θ ∗ + 1 α E [( ℓ ( h ∗ , Z ) − θ ∗ ) + ] . Define the exc ess loss function f h ( Z ) = ( ℓ ( h, Z ) − θ ∗ ) + − ( ℓ ( h ∗ , Z ) − θ ∗ ) + . 18 F or truncation level B > 0 , define the trunc ate d exc ess loss f B h ( Z ) = [( ℓ ( h, Z ) − θ ∗ ) + ] B − [( ℓ ( h ∗ , Z ) − θ ∗ ) + ] B , where [ x ] B = min( x, B ) . Pr o of. By definition of R α ( h ) as an infim um, R α ( h ) ≤ θ ∗ + 1 α E [( ℓ ( h, Z ) − θ ∗ ) + ] for the optimal threshold θ ∗ of h ∗ . Since R α ( h ∗ ) = θ ∗ + 1 α E [( ℓ ( h ∗ , Z ) − θ ∗ ) + ] , w e ha ve R α ( h ) − R α ( h ∗ ) ≤ 1 α E [ f h ( Z )] . (36) Fix B > 0 . By the triangle inequalit y , | f h ( Z ) − f B h ( Z ) | ≤   ( ℓ ( h, Z ) − θ ∗ ) + − [( ℓ ( h, Z ) − θ ∗ ) + ] B   +   ( ℓ ( h ∗ , Z ) − θ ∗ ) + − [( ℓ ( h ∗ , Z ) − θ ∗ ) + ] B   . F or x ≥ 0 , w e hav e | x − [ x ] B | = x · 1 { x > B } . Thus, E   ( ℓ ( h, Z ) − θ ∗ ) + − [( ℓ ( h, Z ) − θ ∗ ) + ] B   = E  ( ℓ ( h, Z ) − θ ∗ ) + · 1 { ( ℓ ( h, Z ) − θ ∗ ) + > B }  ≤ E  ℓ ( h, Z ) · 1 { ℓ ( h, Z ) > B }  . W rite E [ ℓ ( h, Z ) · 1 { ℓ ( h, Z ) > B } ] = E [ ℓ ( h, Z ) 1+ λ · ℓ ( h, Z ) − λ · 1 { ℓ ( h, Z ) > B } ] . By Hölder’s inequality with exponents (1 + λ, 1+ λ λ ) : E [ ℓ ( h, Z ) 1+ λ · ℓ ( h, Z ) − λ · 1 { ℓ ( h, Z ) > B } ] ≤  E [ ℓ ( h, Z ) 1+ λ ]  1 / (1+ λ ) ·  E [ 1 { ℓ ( h, Z ) > B } ]  λ/ (1+ λ ) =  E [ ℓ ( h, Z ) 1+ λ ]  1 / (1+ λ ) · P ( ℓ ( h, Z ) > B ) λ/ (1+ λ ) . By Mark ov’s inequality , P ( ℓ ( h, Z ) > B ) ≤ E [ ℓ ( h, Z ) 1+ λ ] B 1+ λ . Therefore, E [ ℓ ( h, Z ) · 1 { ℓ ( h, Z ) > B } ] ≤  E [ ℓ ( h, Z ) 1+ λ ]  1 / (1+ λ ) ·  E [ ℓ ( h, Z ) 1+ λ ] B 1+ λ  λ/ (1+ λ ) = E [ ℓ ( h, Z ) 1+ λ ] B λ ≤ M B λ . The same b ound holds for h ∗ , so E | f h ( Z ) − f B h ( Z ) | ≤ 2 M B λ . (37) Com bining (36) and (37): R α ( h ) − R α ( h ∗ ) ≤ 1 α  E [ f B h ( Z )] + 2 M B λ  . (38) 19 Since E [ f B h ( Z )] ≥ 0 , V ar( f B h ( Z )) ≤ E [( f B h ( Z )) 2 ] . Using ( a − b ) 2 ≤ 2 a 2 + 2 b 2 : ( f B h ( Z )) 2 ≤ 2  [( ℓ ( h, Z ) − θ ∗ ) + ] B  2 + 2  [( ℓ ( h ∗ , Z ) − θ ∗ ) + ] B  2 . Since [( ℓ ( h, Z ) − θ ∗ ) + ] B ≤ B ,  [( ℓ ( h, Z ) − θ ∗ ) + ] B  2 ≤ B · [( ℓ ( h, Z ) − θ ∗ ) + ] B ≤ B · ℓ ( h, Z ) . F or x ≥ 0 , B x ≤ B 1 − λ x 1+ λ (b y AM-GM or Y oung’s inequalit y). Thus, B · ℓ ( h, Z ) ≤ B 1 − λ · ℓ ( h, Z ) 1+ λ . T aking expectations: E  [( ℓ ( h, Z ) − θ ∗ ) + ] B  2  ≤ B 1 − λ E [ ℓ ( h, Z ) 1+ λ ] ≤ M B 1 − λ . Similarly for h ∗ , so V ar( f B h ( Z )) ≤ 4 M B 1 − λ . (39) Also, | f B h ( Z ) | ≤ 2 B uniformly . Define F B = { f B h : h ∈ H } with pseudo-dimension O ( d ) , env elop e 2 B , and v ariance bound 4 M B 1 − λ . By v ariance-sensitive VC theory (Bartlett-Bousquet-Mendelson)[we need to cite this pap er.], for δ ∈ (0 , 1) , with probabilit y at least 1 − δ , uniformly ov er all h ∈ H , E [ f B h ( Z )] − 1 n n X i =1 f B h ( Z i ) ≤ C r M B 1 − λ ( d log n + log(1 /δ )) n + B ( d log n + log (1 /δ )) n ! . Let D = d log n + log(1 /δ ) . If E [ f B b h ( Z )] > C  q M B 1 − λ D n + B D n  , then the empirical av erage w ould b e p ositiv e, con tradicting that b h minimizes empirical risk. Therefore, E [ f B b h ( Z )] ≤ C r M B 1 − λ D n + B D n ! . No w w e will optimize the truncation level B. Balance the main terms: r M B 1 − λ n ≍ M B λ . This giv es M 1 / 2 B (1 − λ ) / 2 n − 1 / 2 ≍ M B − λ , so B (1 − λ ) / 2+ λ ≍ M 1 / 2 n 1 / 2 , th us B (1+ λ ) / 2 ≍ M 1 / 2 n 1 / 2 . Therefore, B ≍ ( M n ) 1 / (1+ λ ) ≍ M 1 / (1+ λ )  n D  1 / (1+ λ ) , 20 incorp orating the logarithmic factor. At this optimal B : r M B 1 − λ D n = r M · M (1 − λ ) / (1+ λ ) ( n/D ) (1 − λ ) / (1+ λ ) D n = M (1+(1 − λ ) / (1+ λ )) / 2 r D n · ( n/D ) (1 − λ ) / (1+ λ ) = M 1 / (1+ λ ) p D 1 − (1 − λ ) / (1+ λ ) n − (1 − λ ) / (1+ λ ) − 1 = M 1 / (1+ λ ) p D λ/ (1+ λ ) n − λ/ (1+ λ ) · n − 1 = M 1 / (1+ λ )  D n  λ/ (2(1+ λ )) · n − 1 / 2 . Since the dominant scaling is: r M B 1 − λ D n ≍ M 1 / (1+ λ )  D n  λ/ (1+ λ ) · r n D ≍ M 1 / (1+ λ )  D n  λ/ (1+ λ ) . Similarly , M B λ = M M λ/ (1+ λ ) ( n/D ) λ/ (1+ λ ) = M 1 / (1+ λ )  D n  λ/ (1+ λ ) . The second term B D n is lo wer order. Therefore, R α ( b h ) − R α ( h ∗ ) ≤ C λ α M 1 1+ λ  d log n + log(1 /δ ) n  λ 1+ λ . (40) This Pro ves Theorem 3.2 B.3.1 Lo wer b ound for i.i.d Data Pr o of of The or em 3.2,in p articular r esult 6 The sketc h of the pro of is that :we will construct a finite family { P v } v ∈V , |V | = N , of distributions in P ( M , λ ) together with hypotheses { h v } v ∈V ⊂ H suc h that: 1. F or each v  = u the excess CV aR of h u under P v is at least a positive gap ∆ which w e will express in terms of p, t, M , λ, α . 2. The pairwise KL div ergences KL( P v ∥ P u ) are ≲ p (a constan t times p ) uniformly in v , u . 3. Applying F ano with the m utual information bound (via a veraging of pairwise KLs). Com bining these gives the desired minimax lo w er b ound after enforcing the momen t constraint, which links t and p . 1. Hyp ercub e Construction Let m = ⌈ log 2 N ⌉ . W e construct a pac king set V ⊂ { 0 , 1 } m of size N . W e select N binary v ectors v of equal Hamming w eigh t m/ 2 (assuming m is ev en; o dd m requires trivial adjustments) such that the pairwise Hamming distance satisfies d H ( u, v ) ≥ m/ 4 for all distinct u, v ∈ V . The existence of suc h a set is guaran teed b y the Gilb ert-V arshamov b ound. Sp ecifically , the volume of a Hamming ball of radius m/ 8 is exp onen tially smaller than the set of constant-w eigh t v ectors  m m/ 2  , allo wing us to pack N suc h v ectors for m ≈ log 2 N . F or an y distinct u, v ∈ V , let a v u := # { j : v j = 1 , u j = 0 } . Since b oth v ectors ha ve weigh t m/ 2 , a v u = # { j : v j = 0 , u j = 1 } = d H ( u, v ) 2 ≥ m 8 . W e will now define a w ell-defined probability distribution on Z for eac h v ∈ V . 21 2. Distribution and Hyp othesis Definitions Fix a constant θ ∈ (0 , 1 / 4) . Let the sample space b e Z = { 0 , 1 , . . . , m } . Let p ∈ (0 , α/ 2] b e a probability parameter, and t > 0 b e a loss magnitude. F or each v ∈ V , define the distribution P v on Z : P v (0) = 1 − p, P v ( j ) = p m  1 + 2 θ ( v j − 1 / 2)  , j = 1 , . . . , m. Note that P m j =1 P v ( j ) = p b ecause P v j = m/ 2 , ensuring P v is a v alid distribution. Define the hypotheses { h v } v ∈V with loss function ℓ : ℓ ( h v , 0) = 0 , ℓ ( h v , j ) = t · (1 − v j ) , j = 1 , . . . , m. The h yp othesis h v suffers loss t on atom j if and only if v j = 0 . 3. CV aR Gap Analysis The probabilit y that h u incurs nonzero loss under P v is π v ,u := m X j =1 P v ( j )(1 − u j ) . Since P v ( j ) ≤ p m (1 + θ ) and P (1 − u j ) = m/ 2 , we ha ve π v ,u ≤ p 2 (1 + θ ) . Since θ < 1 / 4 , π v ,u < p . W e enforce p ≤ α/ 2 . Thus, the probabilit y of a nonzero loss is strictly less than α , implying the (1 − α ) -quantile is 0. In this regime, CV aR is simply the exp ected loss scaled by 1 /α : R α ( h u ; P v ) = 1 α E P v [ ℓ ( h u , Z )] . Lemma B.7 (Exp ected Losses) . F or any v , u ∈ V : 1. If u = v : E P v [ ℓ ( h v , Z )] = tp 2 (1 − θ ) . 2. If u  = v : E P v [ ℓ ( h u , Z )] ≥ tp 2  1 − θ 2  . Pr o of. F or the first case, sum o v er indices where v j = 0 . F or the second case, we utilize the Hamming separation. The term P m j =1 ( v j − 1 / 2)(1 − u j ) ev aluates to a v u − m/ 4 . Since a v u ≥ m/ 8 , this sum is low er bounded by − m/ 8 . Substituting this into the expectation form ula yields the result. Theorem B.8 (CV aR Separation) . F or any distinct v , u ∈ V , ∆ := R α ( h u ; P v ) − R α ( h v ; P v ) ≥ θ 4 · tp α . Pr o of. Subtracting the exp ectations from the Lemma and dividing by α : ∆ = 1 α  tp 2  1 − θ 2  − tp 2 (1 − θ )  = tp 2 α  θ 2  = θ tp 4 α . 4. Momen t Constrain t T o ensure P v ∈ P ( M , λ ) , we require E [ ℓ 1+ λ ] ≤ M . Since losses are binary { 0 , t } : E [ ℓ 1+ λ ] = t 1+ λ P ( ℓ = t ) ≤ t 1+ λ p. W e set t 1+ λ p = M = ⇒ t = ( M /p ) 1 / (1+ λ ) . Substituting t in to the gap ∆ : ∆ ≥ θ 4 α M 1 / (1+ λ ) p λ/ (1+ λ ) . 22 5. KL Div ergence and F ano’s Inequalit y F or v  = u , the KL div ergence is b ounded using the χ 2 -div ergence: KL( P v ∥ P u ) ≤ X z ∈Z ( P v ( z ) − P u ( z )) 2 P u ( z ) . Using the b ounds on P v ( j ) , we derived: KL( P v ∥ P u ) ≤ 4 θ 2 1 − θ p. F or n i.i.d. samples, KL( P n v ∥ P n u ) ≤ n 4 θ 2 1 − θ p . By F ano’s inequality , to ensure the error probability P ( b V  = V ) ≥ 1 / 2 , it suffices to bound the m utual information I ( V ; X n ) ≤ 1 2 log N (assuming log N ≥ 2 log 2 ). This is satisfied if: n 4 θ 2 1 − θ p ≤ 1 8 log N = ⇒ p ≤ 1 − θ 32 θ 2 log N n . W e define p to satisfy b oth the F ano condition and the CV aR regime condition ( p ≤ α/ 2 ): p := min  α 2 , 1 − θ 32 θ 2 log N n  . The sample size condition n ≥ 1 − θ 16 θ 2 log N α ensures that the second term is the minim um. Th us, w e set p = 1 − θ 32 θ 2 log N n . 6. Final Lo wer Bound F or an y algorithm A , let b V b e the index of the closest h yp othesis to A ( S ) . sup P ∈P E [ Excess Risk ] ≥ P ( b V  = V ) · ∆ ≥ 1 2 ∆ . Substituting the chosen p in to ∆ : 1 2 ∆ = 1 2 · θ M 1 / (1+ λ ) 4 α  1 − θ 32 θ 2 log N n  λ/ (1+ λ ) . Rearranging terms yields the claimed b ound with c ( θ , λ ) = θ 8 ( 1 − θ 32 θ 2 ) λ/ (1+ λ ) . B.4 T runcated Empirical CV aR Estimator W e will b e pro ving Theorem 3.3 in this subsection. Pr o of. Let R B α ( h ) b e the p opulation CV aR defined on the truncated loss ℓ B . Since ℓ B ( h, z ) ≤ ℓ ( h, z ) p oint wise, w e ha ve R B α ( h ) ≤ R α ( h ) for all h . W e decomp ose the excess risk as: R α ( b h B ) − R α ( h ∗ ) = R α ( b h B ) − R B α ( b h B ) + R B α ( b h B ) − R B α ( h ∗ ) + R B α ( h ∗ ) − R α ( h ∗ ) =  R α ( b h B ) − R B α ( b h B )  | {z } Bias > 0 +  R B α ( b h B ) − R B α ( h ∗ )  | {z } Estimation +  R B α ( h ∗ ) − R α ( h ∗ )  | {z } ≤ 0 . The third term is non-p ositiv e, so w e remo v e it from the upp er bound. The estimation term is b ounded by 2 sup h ∈H | R B α ( h ) − b R B α ( h ) | . Thus: R α ( b h B ) − R α ( h ∗ ) ≤ sup h ∈H ( R α ( h ) − R B α ( h )) + 2 sup h ∈H | R B α ( h ) − b R B α ( h ) | . (41) 23 W e b ound the error introduced by truncating the loss distribution. Using the v ariational definition and the fact that CV aR is 1 α -Lipsc hitz w.r.t the L 1 norm: sup h ∈H ( R α ( h ) − R B α ( h )) ≤ 1 α E [ ℓ ( h, Z ) − ℓ B ( h, Z )] = 1 α E [( ℓ ( h, Z ) − B ) I ( ℓ ( h, Z ) > B )] . Using the integral identit y E [ X ] = R ∞ 0 P ( X > t ) dt : E [( ℓ − B ) + ] = Z ∞ B P ( ℓ ( h, Z ) > t ) dt. (42) By Mark ov’s inequality on the (1 + λ ) -momen t (Assumption 2.1), P ( ℓ > t ) ≤ M t 1+ λ . Integrating this tail: Z ∞ B M t 1+ λ dt = M  t − λ − λ  ∞ B = M λB λ . (43) Th us, w e obtain the rigorous bias b ound: sup h ∈H ( R α ( h ) − R B α ( h )) ≤ M αλB λ . (44) No w we will bound the Estimation Error. W e con trol the uniform deviation Z n = sup h ∈H | R B α ( h ) − b R B α ( h ) | . Define the function class asso ciated with the v ariational CV aR ob jective: G B =  z 7→ ϕ h,θ ( z ) := θ + 1 α ( ℓ B ( h, z ) − θ ) +     h ∈ H , θ ∈ [0 , B ]  . (45) W e apply Bousquet’s concentration inequalit y for the supremum of empirical pro cesses.F or this we will now c heck the Conditions for Bousquet’s Inequality . 1. Tigh t Uniform Upp er Bound ( K ): F or any g ∈ G B , since ℓ B ∈ [0 , B ] and θ ∈ [0 , B ] : 0 ≤ ϕ h,θ ( z ) ≤ sup θ ∈ [0 ,B ]  θ + 1 α ( B − θ )  = B α . (46) (The maxim um o ccurs at θ = 0 since α ≤ 1 ). Thus, we set K = B /α . 2. V ariance Bound ( σ 2 ): The v ariance of ϕ h,θ ( Z ) depends only on the random part 1 α ( ℓ B ( h, Z ) − θ ) + . V ar ( ϕ h,θ ) = 1 α 2 V ar (( ℓ B − θ ) + ) ≤ 1 α 2 E  ( ℓ B − θ ) 2 +  ≤ 1 α 2 E [ ℓ 2 B ] . (47) Using the inequality x 2 ≤ x 1+ λ B 1 − λ for x ∈ [0 , B ] and Assumption 2.1: σ 2 G = sup g ∈G B E [ g 2 ] ≤ M B 1 − λ α 2 . (48) By Bousquet’s inequality , for any δ ∈ (0 , 1) , with probability at least 1 − δ : Z n ≤ E [ Z n ] + s 2 σ 2 G log(1 /δ ) n + K log(1 /δ ) 3 n . (49) Bounding the Exp ectation (Rademacher Complexit y): The class G B in volv es a supremum o ver θ . Since the ob jective is con v ex in θ and 1 /α -Lipsc hitz in ℓ B , standard contraction results (e.g., Levy et al., 2020) imply: E [ Z n ] ≤ 2 R n ( G B ) ≤ 2 α R n ( ℓ ◦ H ) + O ( n − 1 / 2 ) . (50) 24 Substituting the b ounds for σ 2 and K in to Bousquet’s inequality: Z n ≤ 2 α R n ( ℓ ◦ H ) + r 2 M B 1 − λ log(1 /δ ) α 2 n + B log(1 /δ ) 3 αn . (51) The total estimation error contribution is 2 Z n : Est ≤ 4 α R n ( ℓ ◦ H ) + 1 α r 8 M B 1 − λ log(1 /δ ) n + 2 B log(1 /δ ) 3 αn . (52) W e minimize the total error b ound (Bias + Estimation) with resp ect to B : Error ( B ) ≈ M αλB λ + 1 α r 8 M B 1 − λ log(1 /δ ) n . (53) Balancing the order of bias ( B − λ ) and v ariance ( B (1 − λ ) / 2 n − 1 / 2 ) yields the optimal scaling: n ≍ B 1+ λ = ⇒ B = ( M n ) 1 1+ λ . (54) W e define the rate factor ∆ n = M 1 1+ λ n − λ 1+ λ . Substituting B bac k in to the terms: 1. Bias T erm: M αλ ( M n ) λ 1+ λ = 1 αλ ∆ n . 2. V ariance T erm (Bousquet Main): 1 α p 8 log (1 /δ ) s M ( M n ) 1 − λ 1+ λ n = p 8 log (1 /δ ) α ∆ n . 3. V ariance T erm (Bousquet Linear): Note that B n = ( M n ) 1 1+ λ n = ∆ n . 2 B log(1 /δ ) 3 αn = 2 log (1 /δ ) 3 α ∆ n . Summing the co efficients gives the final constant C λ,δ : C λ,δ = 1 λ + p 8 log (1 /δ ) + 2 3 log(1 /δ ) . (55) This completes the proof of Theorem 3.3 B.5 Generalization with Dep enden t Data Pr o of. of Upper Bound 3.4 F or an y h , b y the v ariational form of CV aR and the Lipschitz prop erty | R α ( X ) − R α ( Y ) | ≤ 1 α E | X − Y | (Lemma G.8), sup h | b R α ( h ) − R α ( h ) | ≤ 1 α sup h ∈H , θ ∈ R   b L α ( h, θ ) − L α ( h, θ )   . (56) where b L α ( h, θ ) = 1 n P n i =1 ( ℓ ( h, Z i ) − θ ) + , L α ( h, θ ) = E [( ℓ ( h, Z ) − θ ) + ] . F rom the momen t condition, an y population CV aR minimizer θ ∗ h satisfies 0 ≤ θ ∗ h ≤ R := M 1 / (1+ λ ) /α (Theo- rem G.1). Hence w e may restrict θ to Θ = [0 , R ] . Fix B > 0 and define the truncated loss class F B =  f B h,θ ( z ) = [( ℓ ( h, z ) − θ ) + ] ∧ B : h ∈ H , θ ∈ Θ  . (57) 25 Using Mark ov and Hölder inequalities, sup h,θ   L α ( h, θ ) − E [ f B h,θ ( Z )]   ≤ M B λ . (58) Th us, sup h,θ | b L α ( h, θ ) − L α ( h, θ ) | ≤ sup f ∈F B   1 n X f ( Z i ) − E f   + 2 M B λ . (59) Assume the pro cess ( Z i ) is β -mixing with exponential decay β ( k ) ≤ exp( − ck γ ) for some γ > 0 . Choose a n ≍ log n , b n ≍ (log n ) 1 /γ and partition { 1 , . . . , n } in to µ n ≍ n/ log n blocks of size a n separated b y gaps of size b n . Let N = µ n ≍ n/ log n . Define blo ck sums S j ( f ) = P i ∈ B j f ( Z i ) . By Berbee’s lemma, there exist indep endent blo c ks ˜ B 1 , . . . , ˜ B N with the same marginals suc h that P (( B j )  = ( ˜ B j )) ≤ N β ( b n ) ≤ n − O (1) . Hence, with probabilit y ≥ 1 − n − O (1) , sup f ∈F B   1 n X f ( Z i ) − E f   ≤ sup f ∈F B    1 N N X j =1  1 a n ˜ S j ( f ) − E f     . (60) Define the centered blo ck v ariables ˜ Z j ( f ) = 1 a n ˜ S j ( f ) − E f . Then ˜ Z j ( f ) are i.i.d. across j , mean zero, b ounded b y B , and satisfy E [ | ˜ Z j ( f ) | 1+ λ ] ≤ C λ B 1+ λ b y the von Bahr-Esseen inequalit y . The class F B has pseudo-dimension Pdim( F B ) ≤ C ( d + 1) (Lemma G.9). Applying the heavy-tailed uniform deviation inequality for i.i.d. data (Theorem 3.2) with env elop e B and momen t bound B 1+ λ , w e obtain that with probabilit y ≥ 1 − δ , sup f ∈F B   1 N N X j =1 ˜ Z j ( f )   ≤ C λ B  d log N + log(1 /δ ) N  λ 1+ λ . (61) Com bining the b ounds, with probability ≥ 1 − δ − n − O (1) , sup h | b R α ( h ) − R α ( h ) | ≤ 1 α h C λ B  d log n + log(1 /δ ) N  λ 1+ λ + 2 M B λ i . (62) Cho ose B to balance the tw o terms: B ≍ M 1 1+ λ  N d log n + log(1 /δ )  1 1+ λ . (63) Substituting bac k yields the dominant scaling sup h | b R α ( h ) − R α ( h ) | ≤ C λ,γ α M 1 1+ λ  d log n + log(1 /δ ) N  λ 1+ λ . (64) Recalling N ≍ n/ log n gives the first claim. F or the empirical CV aR minimizer b h , by the usual ERM argument, R α ( b h ) − R α ( h ∗ ) ≤ 2 sup h | b R α ( h ) − R α ( h ) | . (65) whic h yields the second inequality . 26 Pr o of. of Lo w er Bound 3.4 Let { Z i } n i =1 b e drawn from any P ∈ P ( λ, M , β ) . Using the blo cking sc heme similar to Theorem 3.4, we partition { 1 , . . . , n } in to µ n disjoin t blocks of size a n ≍ log n , separated b y gaps of length b n ≍ (log n ) 1 /γ . Let N := µ n ≍ n/ log n denote the n umber of retained blocks. By Berb ee’s coupling, there exist indep enden t blo cks ˜ B 1 , . . . , ˜ B N with the same marginals as the original blo c ks suc h that P  ( B 1 , . . . , B N )  = ( ˜ B 1 , . . . , ˜ B N )  ≤ N β ( b n ) ≤ n − A for an y fixed A > 0 and all sufficiently large n . Therefore, for minimax low er b ounds, it suffices to work with the indep enden t block mo del: an y esti- mator based on the original data induces an estimator based on ( ˜ B 1 , . . . , ˜ B N ) whose risk differs b y at most n − A sup h R α ( h ) , whic h is negligible compared to the target rate. Henceforth, w e assume w e observ e N i.i.d. samples. Since VCdim( H ) ≥ d , there exist p oints z 1 , . . . , z d ∈ Z and h ypotheses h 1 , . . . , h d ∈ H shattered by H . Using the same construction we had in the proof B.3.1, we construct a family of 2 d distributions { P v : v ∈ { 0 , 1 } d } and h yp otheses { h v : v ∈ { 0 , 1 } d } suc h that: 1. F or all u  = v , R α ( h u ; P v ) − R α ( h v ; P v ) ≥ θ 4 α tp. 2. The hea vy-tail condition is satisfied: sup h ∈H E P v [ ℓ ( h, Z ) 1+ λ ] ≤ t 1+ λ p ≤ M . 3. The pairwise Kullback–Leibler divergences are con trolled: KL( P v ∥ P u ) ≤ C θ p, ∀ u  = v , where C θ > 0 depends only on θ . Let V b e uniformly distributed o ver { 0 , 1 } d and let X N = ( Z 1 , . . . , Z N ) be dra wn from P ⊗ N V . By the chain rule and the KL bound, I ( V ; X N ) ≤ 1 2 d X u,v KL( P ⊗ N v ∥ P ⊗ N u ) ≤ N max u  = v KL( P v ∥ P u ) ≤ C θ N p. Cho ose p = c 1 d N for c 1 > 0 sufficien tly small so that I ( V ; X N ) ≤ d 8 . By F ano’s inequality , inf b V P ( b V  = V ) ≥ 1 − I ( V ; X N ) + log 2 log(2 d ) ≥ 1 2 for all sufficiently large d . Let b h = b h ( X N ) be an y estimator and define b V such that b h = h b V (ties brok en arbitrarily). Then sup P ∈P E P h R α ( b h ) − R α ( h ∗ P ) i ≥ 1 2 d X v E P v  R α ( h b V ; P v ) − R α ( h v ; P v )  ≥ P ( b V  = V ) · θ 4 α tp ≥ θ 8 α tp. 27 Finally , enforce the moment constraint with equalit y: t =  M p  1 1+ λ . Substituting giv es sup P ∈P E P h R α ( b h ) − R α ( h ∗ P ) i ≥ c α M 1 1+ λ p λ 1+ λ = c α M 1 1+ λ  d N  λ 1+ λ . Recalling N ≍ n/ log n compl etes the pro of. C Random activ e-set theory and uniform Bahadur-Kiefer expansions for CV aR Setup and notation. Let Z ∼ P on ( Z , A ) and let ℓ : H × Z → R + b e measurable. Fix α ∈ (0 , 1) . F or eac h h ∈ H define the nonnegative loss random v ariable X h := ℓ ( h, Z ) ≥ 0 . F or a scalar threshold θ ∈ R , define the Rock afellar-Ury asev (R U) lift Φ( h, θ ) := θ + 1 α E  ( X h − θ ) +  . (66) Giv en i.i.d. data Z 1 , . . . , Z n ∼ P and the empirical measure P n = 1 n P n i =1 δ Z i , let X h,i := ℓ ( h, Z i ) and define the empirical R U lift b Φ n ( h, θ ) := θ + 1 αn n X i =1 ( X h,i − θ ) + . (67) The population and empirical CV aR ob jectives are R α ( h ) := inf θ ∈ R Φ( h, θ ) , b R α,n ( h ) := inf θ ∈ R b Φ n ( h, θ ) . (68) Endogenous threshold selections (minimal R U minimizers). Define the (minimal) p opulation R U thresh- old and the (minimal) empirical R U threshold, resp ectively , b y θ ⋆ ( h ) := inf  θ ∈ R : P ( X h > θ ) ≤ α  , b θ n ( h ) := inf  θ ∈ R : P n ( X h > θ ) ≤ α  . (69) It is standard that b θ n ( h ) is alw ays a minimizer of θ 7→ b Φ n ( h, θ ) , and similarly θ ⋆ ( h ) is a minimizer of θ 7→ Φ( h, θ ) ; w e use these minimal selections to k eep the argumen ts deterministic and monotone. T ail maps and empirical tail maps. F or h ∈ H and θ ∈ R , define the (strict) tail probability and its empirical coun terpart: T h ( θ ) := P ( X h > θ ) , b T h,n ( θ ) := P n ( X h > θ ) = 1 n n X i =1 1 { X h,i > θ } . (70) Eac h T h ( · ) and b T h,n ( · ) is nonincreasing and right-con tin uous. By construction, b T h,n  b θ n ( h )  ≤ α, and if θ < b θ n ( h ) then b T h,n ( θ ) > α. (71) W e are now reiterating our assumptions to mak e this section self-con tained. 28 Assumption A (uniform tail empirical pro cess control). There exists a function ε n : (0 , 1) → (0 , ∞ ) such that for all δ ∈ (0 , 1) , with probabilit y at least 1 − δ , sup h ∈H sup θ ∈ R   b T h,n ( θ ) − T h ( θ )   ≤ ε n ( δ ) . (72) Define the corresp onding ev ent E 1 ( δ ) := n sup h ∈H sup θ ∈ R   b T h,n ( θ ) − T h ( θ )   ≤ ε n ( δ ) o . (73) Assumption B (lo cal quan tile margin at lev el α ). There exist constants u 0 > 0 , κ ≥ 1 , and 0 < c − ≤ c + < ∞ suc h that for all h ∈ H and all u ∈ (0 , u 0 ] , α + c − u κ ≤ T h  θ ⋆ ( h ) − u  ≤ α + c + u κ , (74) and α − c + u κ ≤ T h  θ ⋆ ( h ) + u  ≤ α − c − u κ . (75) This is the standard “t w o-sided quantile margin” form ulation centered at the target level α ; it rules out arbitrarily flat tails at level α and, in particular, preven ts the low er-deviation argument from failing in the presence of atoms at θ ⋆ ( h ) . Assumption C (uniform hinge empirical pro cess at the population threshold). There exists a function η n : (0 , 1) → (0 , ∞ ) suc h that for all δ ∈ (0 , 1) , with probabilit y at least 1 − δ , sup h ∈H    ( P n − P )  ( X h − θ ⋆ ( h )) +     ≤ η n ( δ ) . (76) Define the corresp onding ev ent E 2 ( δ ) := n sup h ∈H    ( P n − P )  ( X h − θ ⋆ ( h )) +     ≤ η n ( δ ) o . (77) A con venien t threshold-deviation en v elop e. Whenever ε n ( δ ) ≤ ( c − / 2) u κ 0 , define ∆ n ( δ ) :=  2 ε n ( δ ) c −  1 /κ . (78) Theorem C.1 (Uniform deviation of the empirical RU threshold) . Fix α ∈ (0 , 1) and let Z 1 , . . . , Z n iid ∼ P , with empiric al me asur e P n . Assume (72) and (74) - (75) . If ε n ( δ ) ≤ ( c − / 2) u κ 0 , then with pr ob ability at le ast 1 − δ , sup h ∈H   b θ n ( h ) − θ ⋆ ( h )   ≤ ∆ n ( δ ) =  2 ε n ( δ ) c −  1 /κ . (79) Pr o of. Fix h ∈ H and abbreviate θ ⋆ := θ ⋆ ( h ) , b θ := b θ n ( h ) , T ( θ ) := T h ( θ ) , b T ( θ ) := b T h,n ( θ ) . W ork on the even t E 1 ( δ ) , so that sup θ ∈ R | b T ( θ ) − T ( θ ) | ≤ ε n ( δ ) . Recall the minimality characterization (71). Upp er deviation: b θ ≤ θ ⋆ + u . Let u ∈ (0 , u 0 ] and supp ose c − u κ ≥ 2 ε n ( δ ) . By the right-side margin bound (75), T ( θ ⋆ + u ) ≤ α − c − u κ . On E 1 ( δ ) , b T ( θ ⋆ + u ) ≤ T ( θ ⋆ + u ) + ε n ( δ ) ≤ α − c − u κ + ε n ( δ ) ≤ α − ε n ( δ ) ≤ α. Hence b T ( θ ⋆ + u ) ≤ α , and minimalit y (71) giv es b θ ≤ θ ⋆ + u . 29 Lo w er deviation: b θ ≥ θ ⋆ − u . Let u ∈ (0 , u 0 ] and supp ose c − u κ ≥ 2 ε n ( δ ) . By the left-side margin b ound (74), T ( θ ⋆ − u ) ≥ α + c − u κ . On E 1 ( δ ) , b T ( θ ⋆ − u ) ≥ T ( θ ⋆ − u ) − ε n ( δ ) ≥ α + c − u κ − ε n ( δ ) ≥ α + ε n ( δ ) > α. Therefore b T ( θ ⋆ − u ) > α , and minimalit y (71) implies b θ ≥ θ ⋆ − u . Choice of u and uniformization. Cho ose u := ∆ n ( δ ) = (2 ε n ( δ ) /c − ) 1 /κ . Under ε n ( δ ) ≤ ( c − / 2) u κ 0 w e ha ve u ≤ u 0 and c − u κ = 2 ε n ( δ ) . The t w o deviation b ounds yield | b θ − θ ⋆ | ≤ u . Since the argument holds for eac h h on E 1 ( δ ) , sup h ∈H | b θ n ( h ) − θ ⋆ ( h ) | ≤ ∆ n ( δ ) on E 1 ( δ ) . Finally , P ( E 1 ( δ )) ≥ 1 − δ by (72). W e no w pro vide the proof for theorem 3.8. W e also restate the theorem statement. The or em (Restatemen t of Theorem 3.8) . Fix α ∈ (0 , 1) and let Z 1 , . . . , Z n iid ∼ P with empirical measure P n . Assume (72), (74)-(75), and (76). Assume further that ε n ( δ / 2) ≤ ( c − / 2) u κ 0 . Then, with probability at least 1 − δ , sup h ∈H    b R α,n ( h ) − R α ( h ) − 1 α ( P n − P )  ( X h − θ ⋆ ( h )) +  − 1 α  b θ n ( h ) − θ ⋆ ( h )  α − P ( X h > θ ⋆ ( h ))     ≤ C 1 α ε n ( δ / 2) κ +1 κ , (80) where one may take, for instance, C 1 :=  2 c −  1 /κ + c + κ + 1  2 c −  κ +1 κ . (81) In particular, on the same ev ent, sup h ∈H   b R α,n ( h ) − R α ( h )   ≤ 1 α η n ( δ / 2) + 1 α  ε n ( δ / 2) ∆ n ( δ / 2) + c + κ + 1 ∆ n ( δ / 2) κ +1  . (82) Pr o of of The or em 3.8. Fix h ∈ H and abbreviate θ ⋆ := θ ⋆ ( h ) , b θ := b θ n ( h ) , X := X h , Φ( θ ) := Φ( h, θ ) , b Φ( θ ) := b Φ n ( h, θ ) . By definition (68) and the fact that θ ⋆ and b θ are minimizers, R α ( h ) = Φ( θ ⋆ ) , b R α,n ( h ) = b Φ( b θ ) . W e begin from the algebraic decomp osition b R α,n ( h ) − R α ( h ) = b Φ( b θ ) − Φ( θ ⋆ ) =  b Φ( θ ⋆ ) − Φ( θ ⋆ )  | {z } ( A ) +  b Φ( b θ ) − b Φ( θ ⋆ )  | {z } ( B ) +  Φ( b θ ) − Φ( θ ⋆ )  | {z } ( C ) −  Φ( b θ ) − b Φ( b θ )  | {z } ( D ) . (83) T erm (A): the leading empirical-pro cess term. F rom (66)-(67), ( A ) = b Φ( θ ⋆ ) − Φ( θ ⋆ ) = 1 α ( P n − P )  ( X − θ ⋆ ) +  . (84) 30 A hinge integral iden tity . F or ev ery θ ∈ R , ( X − θ ) + = Z ∞ θ 1 { X > s } ds. (85) Consequen tly , for an y a, b ∈ R , ( X − a ) + − ( X − b ) + = Z a b 1 { X > s } ds, (86) where the integral is in terpreted with the correct sign when a < b . Applying (86) in (66) and (67) yields Φ( b θ ) − Φ( θ ⋆ ) = ( b θ − θ ⋆ ) + 1 α Z b θ θ ⋆ P ( X > s ) ds, (87) b Φ( b θ ) − b Φ( θ ⋆ ) = ( b θ − θ ⋆ ) + 1 α Z b θ θ ⋆ P n ( X > s ) ds. (88) Subtracting (87) from (88) giv es the exact coupling iden tity ( B ) − ( D ) = 1 α Z b θ θ ⋆  P n ( X > s ) − P ( X > s )  ds. (89) Linearization of (C) and the endogenous-threshold correction. Define the shorthand tail function T ( s ) := P ( X > s ) . Starting from (87), add and subtract T ( θ ⋆ ) inside the in tegral: Φ( b θ ) − Φ( θ ⋆ ) = ( b θ − θ ⋆ ) + 1 α Z b θ θ ⋆ T ( s ) ds = ( b θ − θ ⋆ ) + 1 α Z b θ θ ⋆  T ( θ ⋆ ) +  T ( s ) − T ( θ ⋆ )   ds = 1 α ( b θ − θ ⋆ )  α − T ( θ ⋆ )  + 1 α Z b θ θ ⋆  T ( s ) − T ( θ ⋆ )  ds | {z } =: R pop ( h ) . (90) The first term in (90) is precisely the endogenous-threshold correction term in (80). Bounding the coupling remainder ( B ) − ( D ) . W ork on the ev en t E 1 ( δ / 2) from (73). Then sup h,θ | b T h,n ( θ ) − T h ( θ ) | ≤ ε n ( δ / 2) , and from (89),   ( B ) − ( D )   ≤ 1 α Z b θ θ ⋆ ε n ( δ / 2) ds = 1 α ε n ( δ / 2) | b θ − θ ⋆ | . (91) Bounding the population remainder R pop ( h ) . Assume ε n ( δ / 2) ≤ ( c − / 2) u κ 0 and work on E 1 ( δ / 2) . By Theorem C.1, sup h ∈H | b θ n ( h ) − θ ⋆ ( h ) | ≤ ∆ n ( δ / 2) ≤ u 0 . (92) Fix h and consider any s b et w een θ ⋆ and b θ . By (74)-(75), the upper side of the margin condition giv es | T ( s ) − T ( θ ⋆ ) | ≤ c + | s − θ ⋆ | κ . Therefore, | R pop ( h ) | ≤ 1 α Z b θ θ ⋆ c + | s − θ ⋆ | κ ds = c + α ( κ + 1) | b θ − θ ⋆ | κ +1 . (93) 31 Assem bling the expansion. Plug (84), (91), and (90)-(93) in to (83). On E 1 ( δ / 2) w e obtain    b R α,n ( h ) − R α ( h ) − 1 α ( P n − P )  ( X − θ ⋆ ) +  − 1 α ( b θ − θ ⋆ )  α − P ( X > θ ⋆ )     ≤ 1 α  ε n ( δ / 2) | b θ − θ ⋆ | + c + κ + 1 | b θ − θ ⋆ | κ +1  . (94) Uniformization and conv ersion to an ε n -rate. T ake sup h ∈H in (94) and use (92): sup h ∈H    · · ·    ≤ 1 α  ε n ( δ / 2)∆ n ( δ / 2) + c + κ + 1 ∆ n ( δ / 2) κ +1  on E 1 ( δ / 2) . No w substitute ∆ n ( δ / 2) = (2 ε n ( δ / 2) /c − ) 1 /κ : ε n ( δ / 2)∆ n ( δ / 2) =  2 c −  1 /κ ε n ( δ / 2) κ +1 κ , ∆ n ( δ / 2) κ +1 =  2 c −  κ +1 κ ε n ( δ / 2) κ +1 κ . This yields (80) with the constan t C 1 in (81). Deriving the “in particular” b ound and the probability statemen t. W ork on the intersection even t E 1 ( δ / 2) ∩ E 2 ( δ / 2) . By (72) and (76) and a union b ound, P  E 1 ( δ / 2) ∩ E 2 ( δ / 2)  ≥ 1 − δ. On E 2 ( δ / 2) , sup h ∈H    1 α ( P n − P )  ( X h − θ ⋆ ( h )) +     ≤ 1 α η n ( δ / 2) . Com bining this with the uniform expansion bound giv es (82). D F unctional Robustness Pr o of of Pr op osition 4.1. W asserstein b ound with moment control. As sume ( Z , d ) is a metric space and z 7→ ℓ ( h, Z ) is W e use the Rock afellar-Ury asev represen tation CV aR α ( X ) = inf θ ∈ R n θ + κ α E [( X − θ ) + ] o . Fix h ∈ H and θ ∈ R and define φ h,θ ( z ) := ( ℓ ( h, Z ) − θ ) + . Since x 7→ ( x − t ) + is 1 -Lipsc hitz on R , the Hölder condition implies | φ h,θ ( Z ) − φ h,θ ( Z ′ ) | ≤ | ℓ ( h, Z ) − ℓ ( h, Z ′ ) | ≤ L β d ( Z, Z ′ ) β , so φ h,θ is β -Hölder with constant L β (uniformly in h, θ ). Let π b e any coupling of ( P, Q ) with marginals P, Q . Then    E P [ φ h,θ ( Z )] − E Q [ φ h,θ ( Z ′ )]    =    E π [ φ h,θ ( Z ) − φ h,θ ( Z ′ )]    ≤ L β E π [ d ( Z, Z ′ ) β ] . T ake infimum ov er couplings. F or an y r ≥ β , Jensen yields inf π E π [ d ( Z, Z ′ ) β ] ≤ inf π  E π [ d ( Z, Z ′ ) r ]  β /r = W r ( P , Q ) β . Hence, for every t ,    t + κ α E P [ φ θ,t ] −  t + κ α E Q [ φ θ,t ]     ≤ κ α L β W r ( P , Q ) β . 32 F rom here, the result is obvious. Lévy-Prokhoro v bound with momen t con trol. Assume Z = R d with Euclidean metric and π ( P , Q ) ≤ ε . By Strassen’s theorem for the Prokhoro v metric, there exists a coupling ( Z, ˜ Z ) with marginals P , Q suc h that P ( ∥ Z − ˜ Z ∥ > ε ) ≤ ε. Let G := {∥ Z − ˜ Z ∥ ≤ ε } and B := G c , so P ( B ) ≤ ε . Fix t ∈ R and write X = ℓ ( θ , Z ) , Y = ℓ ( θ , ˜ Z ) . Using again that x 7→ ( x − t ) + is 1 -Lipsc hitz,   ( X − t ) + − ( Y − t ) +   ≤ | X − Y | . Split on G and B : E | X − Y | ≤ E  | X − Y | 1 G  + E  | X − Y | 1 B  . On G , Lipsc hitzness gives | X − Y | ≤ L θ ∥ Z − ˜ Z ∥ ≤ L θ ε , so E [ | X − Y | 1 G ] ≤ L θ ε . On B , use | X − Y | ≤ | X | + | Y | and Hölder: E [ | X | 1 B ] ≤  E | X | p  1 /p P ( B ) 1 − 1 /p ≤ M 1 /p p ε 1 − 1 /p , and similarly for Y . Th us E | X − Y | ≤ L θ ε + 2 M 1 /p p ε 1 − 1 /p . Consequen tly , for eac h fixed θ ,    θ + κ α E P [( X − θ ) + ] −  θ + κ α E Q [( Y − θ ) + ]     ≤ κ α  L θ ε + 2 M 1 /p p ε 1 − 1 /p  . T aking infim um o ver θ yields   CV aR α ( L P ) − CV aR α ( L Q )   ≤ κ α  L θ ε + 2 M 1 / ( λ +1) p ε λ/ ( λ +1)  . Replacing ε by 2 ε and symmetrizing (a standard padding) giv es the following   CV aR α ( L P ) − CV aR α ( L Q )   ≤ κ α  2 L θ ε + 2 M 1 /p p ε 1 − 1 /p  . Use the t wo-point construction as in the TV lo wer b ound: P = δ 0 , Q = (1 − ε ) δ 0 + εδ b with b = ( M p /ε ) 1 /p and ε ≤ 1 − α . F or ev ery Borel set A ⊂ R w e ha ve Q ( A ) ≤ P ( A ε ) + ε and P ( A ) ≤ Q ( A ε ) + ε, b ecause the only mass unmatched within an ε -enlargemen t is the ε mass at b , whic h is absorb ed b y the additive ε slac k. Hence π ( P , Q ) ≤ ε . As computed ab ov e,   CV aR α ( Q ) − CV aR α ( P )   = κ α M 1 /p p ε 1 − 1 /p , whic h implies the bound is tigh t. E Estimator Robustness E.1 T runcated CV aR Loss based ERM is Optimal Pr o of. Pro of of Theorem 4.2 First we decomposing the CV aR difference Fix h ∈ H . Let θ ∗ Q minimize R Q α ( h ) . Then R Q α ( h ) = θ ∗ Q + 1 α E Q [( ℓ ( h, Z ) − θ ∗ Q ) + ] . (95) By definition of infim um, R P α ( h ) ≤ θ ∗ Q + 1 α E P [( ℓ ( h, Z ) − θ ∗ Q ) + ] . (96) 33 Subtracting giv es R P α ( h ) − R Q α ( h ) ≤ 1 α  E P [( ℓ ( h, Z ) − θ ∗ Q ) + ] − E Q [( ℓ ( h, Z ) − θ ∗ Q ) + ]  . (97) Analogously , using θ ∗ P , R Q α ( h ) − R P α ( h ) ≤ 1 α ( E Q [( ℓ ( h, Z ) − θ ∗ P ) + ] − E P [( ℓ ( h, Z ) − θ ∗ P ) + ]) . (98) Th us, | R P α ( h ) − R Q α ( h ) | ≤ 1 α max n   E P [( ℓ ( h, Z ) − θ ∗ Q ) + ] − E Q [( ℓ ( h, Z ) − θ ∗ Q ) + ]   ,   E P [( ℓ ( h, Z ) − θ ∗ P ) + ] − E Q [( ℓ ( h, Z ) − θ ∗ P ) + ]   o . (99) Define f θ ( z ) = ( ℓ ( h, z ) − θ ) + . F or truncation level T > 0 , f θ ( z ) = f θ ( z ) 1 { f θ ( z ) ≤ T } + f θ ( z ) 1 { f θ ( z ) >T } . (100) Th us ∆( h, θ ) :=   E P [ f θ ( Z )] − E Q [ f θ ( Z )]   ≤   E P [ f θ ( Z ) 1 { f θ ( Z ) ≤ T } ] − E Q [ f θ ( Z ) 1 { f θ ( Z ) ≤ T } ]   + E P [ f θ ( Z ) 1 { f θ ( Z ) >T } ] + E Q [ f θ ( Z ) 1 { f θ ( Z ) >T } ] . (101) S 0 ≤ f θ ( z ) 1 { f θ ( z ) ≤ T } ≤ T , b y TV inequality ,   E P [ f θ ( Z ) 1 { f θ ( Z ) ≤ T } ] − E Q [ f θ ( Z ) 1 { f θ ( Z ) ≤ T } ]   ≤ T · d T V ( P , Q ) . (102) W e ha ve f θ ( z ) ≤ ℓ ( h, z ) from corollary G.7 , This is very crusial for the rest of the proof, a detailed justification is giv e after pro of of the curren t theorem. one motiv ation for this is that b ecause CV aR’s inner optimization never b enefits from a negativ e threshold when losses are nonnegative [Proposition G.6], the optimal threshold ob eys θ ∗ P ≥ 0 . Under this natural regime, the truncated loss f θ = ( ℓ − θ )+ is p oint wise dominated by the ra w loss ℓ [lemma G.5]. This dominance legitimizes replacing the tail of f θ by the tail of ℓ in expectation bounds, letting y ou con trol the truncation remainder via the (1 + λ ) -momen t G.7. since f θ ( z ) ≤ ℓ ( h, z ) , E P [ f θ ( Z ) 1 { f θ ( Z ) >T } ] ≤ E P [ ℓ ( h, Z ) 1 { ℓ ( h,Z ) >T } ] . (103) Using Mark ov’s inequality , P ( ℓ ( h, Z ) > T ) ≤ E P [ ℓ ( h, Z ) 1+ λ ] T 1+ λ ≤ M T 1+ λ . (104) Therefore, E P [ ℓ ( h, Z ) 1 { ℓ ( h,Z ) >T } ] = Z ∞ T P ( ℓ ( h, Z ) > t ) dt (105) ≤ Z ∞ T M t 1+ λ dt (106) = M λT λ . (107) Similarly , E Q [ ℓ ( h, Z ) 1 { ℓ ( h,Z ) >T } ] ≤ M λT λ . (108) So, E P [ f θ ( Z ) 1 { f θ ( Z ) >T } ] + E Q [ f θ ( Z ) 1 { f θ ( Z ) >T } ] ≤ 2 M λT λ . (109) 34 Th us, ∆( h, θ ) ≤ T · d T V ( P , Q ) + 2 M λT λ . (110) Optimize o ver T . Let δ = d T V ( P , Q ) . Define g ( T ) = T δ + 2 M λT λ . (111) Deriv ative: g ′ ( T ) = δ − 2 M T 1+ λ . (112) Setting g ′ ( T ) = 0 , T ∗ =  2 M δ  1 / (1+ λ ) . (113) A t T ∗ , g ( T ∗ ) = (2 M ) 1 / (1+ λ ) δ λ/ (1+ λ )  1 + 1 λ  . (114) Therefore, | R P α ( h ) − R Q α ( h ) | ≤ (2 M ) 1 / (1+ λ )  1 + 1 λ  α d T V ( P , Q ) λ/ (1+ λ ) . (115) Since the b ound is uniform o v er h ∈ H , the theorem holds. Pr o of. of low er b ound in Theorem 4.2 W e construct an explicit pair of distributions achieving the claimed scaling. Fix ε ∈ (0 , α ) and consider a scalar loss Z . Let P b e the Dirac distribution at zero: P ( Z = 0) = 1 . Clearly , R P α ( h ) = 0 . Next, define Q b y shifting an ε -fraction of mass to a positive v alue z > 0 : Q ( Z = z ) = ε, Q ( Z = 0) = 1 − ε. It is immediate that d T V ( P , Q ) = ε . T o satisfy Assumption 2.1, w e require E Q [ | Z | 1+ λ ] = εz 1+ λ ≤ M . W e c ho ose the largest admissible v alue, z =  M ε  1 1+ λ . W e no w compute the CV aR of Q . Since ε < α , the w orst α -fraction of outcomes consists of the entire ε -mass at z together with an additional ( α − ε ) -mass at 0 . The (1 − α ) -quantile of Q is therefore zero, and the CV aR reduces to the a v erage loss ov er this tail: R Q α ( h ) = 1 α  εz + ( α − ε ) · 0  = ε α  M ε  1 1+ λ = M 1 1+ λ α ε λ 1+ λ . Since R P α ( h ) = 0 , w e obtain | R P α ( h ) − R Q α ( h ) | = M 1 1+ λ α ε λ 1+ λ . This pair ( P , Q ) is feasible for the supremum given in Theorem 4.2, which prov es the minimax low er b ound. 35 E.2 Robust estimation under adversarial contamination under oblivious adv ersaries Pr o of. The pro of pro ceeds in four main steps: establishing the b oundedness of the auxiliary v ariable θ , decomp os- ing the error into bias and estimation terms, b ounding the error within a single blo ck using uniform concen tration and corruption control, and finally aggregating the block estimates via the median. Applying Theorem G.1 point wise with l = ℓ ( h, Z ) , w e ma y restrict the optimization to the compact domain Θ := [0 , R ] , R = M 1 / (1+ λ ) α . Momen t bound for the v ariational loss. Define the v ariational loss ϕ ( Z ; h, θ ) = θ + 1 α ( ℓ ( h, Z ) − θ ) + , θ ∈ Θ . Lemma E.1 (Moment inflation b ound) . Under Assumption 2.1 and The or em G.1, ther e exists a c onstant C λ > 0 such that sup h ∈H , θ ∈ Θ E  | ϕ ( Z ; h, θ ) | 1+ λ  ≤ C λ  M α 1+ λ + R 1+ λ  =: M ϕ . In p articular, sinc e R = M 1 / (1+ λ ) /α , we obtain M ϕ ≲ λ M α 1+ λ . Pr o of. Using ( a + b ) 1+ λ ≤ 2 λ ( a 1+ λ + b 1+ λ ) for a, b ≥ 0 and ( x − θ ) + ≤ x + θ , we hav e | ϕ ( Z ; h, θ ) | 1+ λ ≤ C λ  | θ | 1+ λ + α − (1+ λ ) ℓ ( h, Z ) 1+ λ  . T aking expectations and using θ ∈ [0 , R ] and Assumption 2.1 yields the result. Error Decomp osition. Let R B ( h, θ ) = E [min( ϕ ( Z ; h, θ ) , B )] b e the exp ected truncated risk. W e decompose the total error: sup h | b R α ( h ) − R α ( h ) | ≤ sup h,θ | b R α ( h, θ ) − R ( h, θ ) | ≤ sup h,θ | R ( h, θ ) − R B ( h, θ ) | | {z } T runcation Bias + sup h,θ | b R α ( h, θ ) − R B ( h, θ ) | | {z } Estimation Error . Bias Con trol: Since ϕ ≥ 0 , | R − R B | ≤ E [ ϕ I ϕ>B ] . Using Hölder’s inequalit y and the moment b ound M ϕ : E [ ϕ I ϕ>B ] ≤ ( E ϕ 1+ λ ) 1 1+ λ ( P ( ϕ > B )) λ 1+ λ ≤ M ϕ B − λ . Analysis of a Single Blo ck. Fix a block index j ∈ { 1 , . . . , K } . Recall that b µ j ( h, θ ) = 1 m X i ∈B j ϕ B ( Z i ; h, θ ) , where ϕ B = min( ϕ, B ) , and define the truncated p opulation risk R B ( h, θ ) = E [ ϕ B ( Z ; h, θ )] . Let S clean ⊂ { 1 , . . . , n } denote the indices of uncorrupted points, and define B clean j = B j ∩ S clean , N j = |B j \ B clean j | to be the set of clean indices and the num b er of outliers in block j , respectively . 36 W e decompose: b µ j ( h, θ ) − R B ( h, θ ) = 1 m X i ∈B clean j  ϕ B ( Z i ; h, θ ) − R B ( h, θ )  + 1 m X i ∈B j \B clean j  ϕ B ( Z ′ i ; h, θ ) − R B ( h, θ )  =: ξ j ( h, θ ) + ∆ j ( h, θ ) , where Z ′ i denotes the (p ossibly adv ersarial) corrupted v alues. Th us, ξ j captures the sampling fluctuation of clean data, and ∆ j captures the corruption bias. (a) Uniform concentration of the clean part. Conditional on the corruption pattern and the random p erm utation, the set B clean j consists of points drawn without r eplac ement from the clean sample. By Ho effding’s reduction principle, concentration inequalities for sampling without replacemen t are dominated b y those for i.i.d. sampling. Therefore, it suffices to analyze the i.i.d. case. Lemma E.2 (Uniform concentration on a single clean blo ck) . Ther e exists a universal c onstant C > 0 such that, c onditional on B clean j , with pr ob ability at le ast 1 − 0 . 1 , sup h ∈H , θ ∈ Θ | ξ j ( h, θ ) | ≤ C r M ϕ B 1 − λ d log m m + B d log m m ! =: E stat ( B ) . Pr o of. The function class F B is uniformly b ounded b y B . Moreov er, by Lemma G.10, sup f ∈F B V ar( f ( Z )) ≤ M ϕ B 1 − λ . By Assumption 2.4, the L 2 ( Q ) co vering num b ers of F B satisfy log N ( F B , ∥ · ∥ L 2 ( Q ) , u ) ≤ d log( C 0 B /u ) . Therefore, b y Bousquet’s version of T alagrand’s inequality combined with standard entrop y in tegral b ounds,for i.i.d. samples we obtain sup f ∈F B      1 m m X i =1 ( f ( Z i ) − E f )      ≤ C r M ϕ B 1 − λ d log m m + B d log m m ! with probability at least 1 − 0 . 1 . By Hoeffding’s reduction principle, the same b ound holds for sampling without replacemen t from the clean data. (b) Con trol of the corruption lev el. Since the adversary is oblivious and the learner shuffles the data uniformly at random, the num ber of corrupted p oin ts in blo ck j satisfies N j ∼ Hyp ergeo( n, ϵn, m ) . Lemma E.3 (Outlier prop ortion in a block) . Assume ϵ ≤ 1 / 2 − γ . Then for al l sufficiently lar ge m , P  N j m ≤ ϵ + γ 4  ≥ 1 − 0 . 1 . Pr o of. By Ch v átal’s h yp ergeometric tail bound, P ( N j ≥ ( ϵ + γ / 4) m ) ≤ exp  − 2 m ( γ / 4) 2  . F or m ≥ C /γ 2 , the right-hand side is at most 0 . 1 . 37 (c) Corruption bias b ound. On the even t of Lemma E.3, since 0 ≤ ϕ B ≤ B , sup h,θ | ∆ j ( h, θ ) | ≤ N j m B ≤  ϵ + γ 4  B . (d) Conclusion for a single blo c k. Com bining Lemmas E.2 and E.3, with probabilit y at least 0 . 8 a block satisfies sim ultaneously: sup h,θ | b µ j ( h, θ ) − R B ( h, θ ) | ≤ E stat ( B ) +  ϵ + γ 4  B . Suc h a blo ck will be called Go o d . Robust Aggregation via the Median. Recall from the previous step that for each blo ck j w e defined the ev ents E stat j = n sup h,θ | ξ j ( h, θ ) | ≤ E stat ( B ) o , E corr j = n N j m ≤ ϵ + γ 4 o . A block j is called Go o d if E stat j ∩ E corr j holds. F rom Lemmas E.2 and E.3, by the union bound, P ( blo c k j is Go o d ) ≥ 1 − (0 . 1 + 0 . 1) = 0 . 8 . Define the indicator v ariables I j = I { block j is Goo d } , j = 1 , . . . , K. Lemma E.4 (Ma jorit y of blo c ks are go o d) . Assume ϵ ≤ 1 / 2 − γ . Ther e exists a universal c onstant c > 0 such that if K ≥ 8 γ 2 log  4 δ  , then with pr ob ability at le ast 1 − δ , K X j =1 I j > K 2 . Pr o of. The indicators I j are functions of a random partition of a finite p opulation. They form a negatively asso ciated family . F or negatively associated Bernoulli random v ariables, Chernoff-Hoeffding inequalities hold in the same form as for indep endent v ariables (see Dubhashi and Ranjan, 1998). Since E [ I j ] ≥ 0 . 8 , Ho effding’s inequalit y implies P   K X j =1 I j ≤ K 2   ≤ exp  − 2 K (0 . 8 − 0 . 5) 2  ≤ exp  − γ 2 K 2  . Cho osing K ≥ 8 γ 2 log(4 /δ ) ensures the right-hand side is at most δ . On the even t of Lemma E.4, strictly more than half the blo c ks are Go o d. F or any Goo d blo ck j , we ha v e sim ultaneously: sup h,θ | ξ j ( h, θ ) | ≤ E stat ( B ) , sup h,θ | ∆ j ( h, θ ) | ≤  ϵ + γ 4  B . Hence, sup h,θ | b µ j ( h, θ ) − R B ( h, θ ) | ≤ E stat ( B ) +  ϵ + γ 4  B . Since the median of K num bers lies b etw een the minim um and maximum of any subset of more than K/ 2 elemen ts, it follows deterministically that sup h,θ | b R α ( h, θ ) − R B ( h, θ ) | ≤ max j : I j =1 | b µ j ( h, θ ) − R B ( h, θ ) | ≤ E stat ( B ) +  ϵ + γ 4  B . 38 Final Error Bound and Balancing. Combining previous step with the truncation bias b ound from Lemma G.10, w e obtain that with probability at least 1 − δ , sup h,θ | b R α ( h, θ ) − R ( h, θ ) | ≤ M ϕ B − λ + E stat ( B ) +  ϵ + γ 4  B . Recalling the definition E stat ( B ) = C r M ϕ B 1 − λ d log m m + B d log m m ! , and using m ≍ n/K with K = O ( γ − 2 log(1 /δ )) , w e ma y rewrite the b ound (absorbing constants) as: Err ( B ) ≲ M ϕ B − λ + r M ϕ B 1 − λ d log n n + B d log n n + ϵB . Assuming n ≳ d log n , the linear term B d log n n is of smaller order than the v ariance term under the optimal c hoice of B and ma y b e absorbed. W e therefore balance the remaining three dominant terms. (i) Statistic al r e gime. Balancing bias and v ariance, M ϕ B − λ ≍ r M ϕ B 1 − λ d n = ⇒ B stat ≍  M ϕ n d  1 1+ λ . Substituting yields: Err ≲  M ϕ d n  λ 1+ λ . (ii) A dversarial r e gime. Balancing bias and corruption, M ϕ B − λ ≍ ϵB = ⇒ B adv ≍  M ϕ ϵ  1 1+ λ . Substituting yields: Err ≲ M 1 1+ λ ϕ ϵ λ 1+ λ . T aking B = min( B stat , B adv ) and recalling that sup h | b R α ( h ) − R α ( h ) | ≤ sup h,θ | b R α ( h, θ ) − R ( h, θ ) | , w e conclude the proof of Theorem 4.3. F Decision Robustness F.1 T ail-scarcity instability for CV aR-ERM under finite p -moment Throughout, α ∈ (0 , 1) is fixed and CV aR α ( X ; Q ) = inf θ ∈ R n θ + 1 α E Q [( X − θ ) + ] o . 39 Empirical CV aR ob jective. Giv en i.i.d. samples Z 1 , . . . , Z n ∼ P , let P n := 1 n P n i =1 δ Z i . F or h ∈ H define the empirical R U functional and empirical CV aR b Φ n ( h, θ ) := θ + 1 α E P n  ( ℓ ( h, Z ) − θ ) +  = θ + 1 αn n X i =1 ( ℓ ( h, Z i ) − θ ) + , θ ∈ R , (116) b R n ( h ) := inf θ ∈ R b Φ n ( h, θ ) = CV aR α ( ℓ ( h, Z ); P n ) . (117) The population ob jective is R P ( h ) := CV aR α ( ℓ ( h, Z ); P ) . Lemma F.1 (RU minimizer at θ = 0 ; scaled-mean identit y) . L et X ≥ 0 b e inte gr able and supp ose Q ( X = 0) ≥ 1 − α . Then θ ⋆ = 0 minimizes θ 7→ θ + 1 α E Q [( X − θ ) + ] and CV aR α ( X ; Q ) = 1 α E Q [ X ] . In p articular, if x 1 , . . . , x n ≥ 0 satisfy # { i : x i = 0 } ≥ (1 − α ) n , then the empiric al CV aR satisfies CV aR α ( x 1 , . . . , x n ) = 1 αn n X i =1 x i . Pr o of. Define ϕ ( θ ) := θ + 1 α E Q [( X − θ ) + ] . F or θ < 0 , ϕ ′ ( θ ) = 1 − 1 α Q ( X > θ ) = 1 − 1 α < 0 , so no minimizer lies b elow 0 . F or θ ≥ 0 , the right deriv ative equals ϕ ′ + ( θ ) = 1 − 1 α Q ( X > θ ) ≥ 1 − 1 α Q ( X > 0) ≥ 0 b ecause Q ( X > 0) ≤ α . Hence ϕ is minimized at θ = 0 and ϕ (0) = 1 α E Q [ X ] . The empirical statement is identical with Q = P n . The or em (Restatement of Theorem 4.11 (Detailed)) . Fi x α ∈ (0 , 1) and 1 < p < 2 . There exist a fixed distribution P on Z = R + and constan ts ε ∈ (0 , α / 4) , γ > 0 , and C ∈ (0 , 1) suc h that C > αγ with the follo wing prop ert y . F or ev ery n sufficiently large, define H = { h A , h B } and define the (sample-size dep enden t) losses ℓ n ( h A , z ) := z , ℓ n ( h B , z ) := ( 0 , z = 0 , z + γ − C n 1 { z ∈ ( n, 2 n ] } , z > 0 . (118) Let R ( n ) P ( h ) := CV aR α ( ℓ n ( h, Z ); P ) and b R n ( h ) := CV aR α ( ℓ n ( h, Z ); P n ) . Then the following hold: 1. ( p -momen t only). F or each fixed n and h ∈ H , E P | ℓ n ( h, Z ) | p < ∞ , E P | ℓ n ( h, Z ) | q = ∞ ∀ q > p. 2. (Strict population optimalit y , uniformly for large n ). There exist γ 0 > 0 and n 0 suc h that for all n ≥ n 0 , R ( n ) P ( h B ) − R ( n ) P ( h A ) ≥ γ 0 , so S ( P ) = { h A } . 3. (One-p oin t flip with sharp probability). There exist constants c > 0 and n 0 suc h that for all n ≥ n 0 , Pr D ∼ P ⊗ n  ∃ D ′ differing from D in exactly one sample suc h that arg min H b R n ( · ; D ) = { h B } , arg min H b R n ( · ; D ′ ) = { A }  ≥ c n 1 − p (log n ) 2 . (119) 4. (Sharpness). Under sup h E P | ℓ ( h, Z ) | p < ∞ and a fixed p opulation margin, any such flip probability is O ( n 1 − p ) up to logarithmic factors. 40 Pr o of. Let Y satisfy the tail law P ( Y > y ) = 1 y p (log y ) 2 , y ≥ e. (120) Then E Y p < ∞ and E Y q = ∞ for all q > p (tail-in tegral test). Define Z ∼ P by the mixture Z = ( 0 , with probabilit y 1 − α + ε, Y , with probability α − ε. (121) Then P ( Z = 0) = 1 − α + ε > 1 − α , and Z inherits E Z p < ∞ and E Z q = ∞ for all q > p . F or h A , ℓ n ( h A , Z ) = Z , so E | ℓ n ( h A , Z ) | p < ∞ and E | ℓ n ( h A , Z ) | q = ∞ for all q > p . F or h B , note that ℓ n ( h B , 0) = 0 and for z > 0 , ℓ n ( h B , z ) ≥ z + γ − C n 1 { z ∈ ( n, 2 n ] } ≥ n + γ − C n = (1 − C ) n + γ > 0 b ecause C < 1 . Also ℓ n ( h B , z ) ≤ z + γ , hence E | ℓ n ( h B , Z ) | p < ∞ . Finally , on { Z > 0 , Z / ∈ ( n, 2 n ] } we ha ve ℓ n ( h B , Z ) = Z + γ ≥ Z , and since P ( Z > 0) = α − ε > 0 and E Z q = ∞ for all q > p , it follows that E | ℓ n ( h B , Z ) | q = ∞ for all q > p . This pro ves (1). Because P ( ℓ n ( h A , Z ) = 0) = P ( Z = 0) = 1 − α + ε > 1 − α and also P ( ℓ n ( h B , Z ) = 0) ≥ 1 − α + ε , and b ecause the losses are nonnegative, Lemma F.1 applies to both actions and yields R ( n ) P ( h ) = 1 α E [ ℓ n ( h, Z )] . Therefore R ( n ) P ( h A ) = 1 α E [ Z ] = α − ε α E [ Y ] , (122) R ( n ) P ( h B ) = α − ε α E h Y + γ − C n 1 { Y ∈ ( n, 2 n ] } i . (123) Hence R ( n ) P ( h B ) − R ( n ) P ( h A ) = α − ε α  γ − C n P ( Y ∈ ( n, 2 n ])  . (124) Using (120), P ( Y ∈ ( n, 2 n ]) = P ( Y > n ) − P ( Y > 2 n ) ≤ P ( Y > n ) = 1 n p (log n ) 2 . Th us C n P ( Y ∈ ( n, 2 n ]) ≤ C n p − 1 (log n ) 2 → 0 . Fix γ > 0 . Then there exists n 0 suc h that for all n ≥ n 0 , C n P ( Y ∈ ( n, 2 n ]) ≤ γ / 2 . Plugging in to (124) yields R ( n ) P ( h B ) − R ( n ) P ( h A ) ≥ α − ε α · γ 2 =: γ 0 > 0 , pro ving (2). Let D = ( Z 1 , . . . , Z n ) ∼ P ⊗ n and define N 0 := P n i =1 1 { Z i = 0 } and N + := P n i =1 1 { Z i > 0 } . Since E [ N + ] = ( α − ε ) n , Ho effding’s inequality implies P ( N + ≥ 2) ≥ 1 − e − c + n for all large n . Similarly P ( N 0 ≥ (1 − α ) n + 1) ≥ 1 − e − c 0 n . Define the tail-bin ev en t G n := n ∃ ! i : Z i ∈ ( n, 2 n ] o ∩ n max 1 ≤ i ≤ n Z i ≤ 2 n o . (125) Let q n := P ( Z ∈ ( n, 2 n ]) = ( α − ε ) P ( Y ∈ ( n, 2 n ]) , r n := P ( Z > 2 n ) = ( α − ε ) P ( Y > 2 n ) . Indep endence gives Pr( G n ) = nq n (1 − q n − r n ) n − 1 . Since q n + r n = O ( n − p (log n ) − 2 ) , w e ha v e n ( q n + r n ) → 0 , hence for all large n , (1 − q n − r n ) n − 1 ≥ 1 / 2 and therefore Pr( G n ) ≥ 1 2 nq n . (126) 41 Moreo ver, for all large n , P ( Y ∈ ( n, 2 n ]) = P ( Y > n ) − P ( Y > 2 n ) ≥ 1 n p (log n ) 2 − 1 (2 n ) p (log(2 n )) 2 ≥  1 − 2 − p  1 n p (log(2 n )) 2 ≥ c 1 1 n p (log n ) 2 , so q n ≥ ( α − ε ) c 1 n − p (log n ) − 2 . Combining with (126) yields P ( G n ) ≥ c n 1 − p (log n ) 2 . (127) W e w ork on the ev en t E n := G n ∩ { N 0 ≥ (1 − α ) n + 1 } ∩ { N + ≥ 2 } . By (127) and the exp onential tails for N 0 , N + , we still hav e P ( E n ) ≥ c n 1 − p (log n ) 2 for all large n (p ossibly with a smaller c ). On { N 0 ≥ (1 − α ) n } , Lemma F.1 implies b R n ( h ) = 1 αn n X i =1 ℓ n ( h, Z i ) , h ∈ { h A , h B } . (128) Hence on E n , b R n ( h B ) − b R n ( h A ) = 1 αn n X i =1  ℓ n ( h B , Z i ) − ℓ n ( h A , Z i )  = 1 αn X i : Z i > 0  γ − C n 1 { Z i ∈ ( n, 2 n ] }  . (129) On G n there is exactly one index i ⋆ with Z i ⋆ ∈ ( n, 2 n ] , so b R n ( h B ) − b R n ( h A ) = 1 αn  N + γ − C n  ≤ γ − C α < 0 b ecause N + ≤ αn and C > αγ . Thus on D , arg min H b R n ( · ; D ) = { h B } . Define D ′ b y replacing the unique bin p oint by 0 : set Z ′ i ⋆ := 0 and Z ′ i := Z i for i  = i ⋆ . Since N 0 ≥ (1 − α ) n + 1 on E n , after replacement we still hav e N ′ 0 ≥ (1 − α ) n , so (128) holds for D ′ . Moreov er, under D ′ there are no samples in ( n, 2 n ] , hence b R n ( h B ; D ′ ) − b R n ( h A ; D ′ ) = 1 αn X i : Z ′ i > 0 γ = N ′ + αn γ . On E n w e hav e N + ≥ 2 , and after replacing one p ositiv e p oint we ha ve N ′ + ≥ 1 , so the difference is strictly p ositiv e and therefore arg min H b R n ( · ; D ′ ) = { h A } . This prov es (119). Under sup h E P | ℓ ( h, Z ) | p < ∞ and a fixed p opulation margin, flipping the empirical minimizer by c hanging one sample requires an observ ation of magnitude Ω( n ) since the CV aR ob jective changes by at most | ℓ | / ( αn ) p er sample. By Mark o v and a union b ound, P (max i | ℓ i | ≳ n ) = O ( n 1 − p ) , proving the upper bound up to logarithms. G Auxiliary results G.1 Boundedness of CV aR minimizer under a b ounded moment Theorem G.1 (Boundedness of CV aR minimizer under a b ounded momen t) . L et l ≥ 0 b e a nonne gative r andom variable on a pr ob ability sp ac e (Ω , F , P ) satisfying E  l 1+ ε  ≤ M < ∞ for some ε ∈ (0 , 1] and M > 0 . Fix α ∈ (0 , 1) and define for θ ∈ R F ( θ ) := θ + 1 α E  ( l − θ ) +  , ( x ) + := max { x, 0 } . 42 Then F attains its minimum on R , and every minimizer θ ∗ ∈ arg min θ ∈ R F ( θ ) satisfies 0 ≤ θ ∗ ≤ R, wher e one may take R = M 1 / (1+ ε ) α . In p articular, the set of minimizers is nonempty and b ounde d, c ontaine d in [0 , R ] . Pr o of. W e divide the argument into steps: finiteness and contin uit y of F , b ehavior as θ → ±∞ (co ercivity), nonnegativit y of minimizers, and the explicit upper b ound. (A) F is finite and conv ex. F or any fixed θ ∈ R w e hav e the p oin twise inequality ( l − θ ) + ≤ l + | θ | . Since E [ l ] ≤ ( E [ l 1+ ε ]) 1 / (1+ ε ) ≤ M 1 / (1+ ε ) < ∞ (see (C) b elo w), it follows that E [( l − θ ) + ] ≤ E [ l ] + | θ | < ∞ , so F ( θ ) is finite for every θ . F or each fixed ω ∈ Ω the map θ 7→ ( l ( ω ) − θ ) + is conv ex (it is the p ositive part of an affine function), and exp ectation preserv es con vexit y . Therefore F is con vex on R . A conv ex function that is finite on all of R is contin uous (hence lo wer semicontin uous). Thus F is contin uous and finite-v alued on R . (B) Coercivity: F ( θ ) → + ∞ as | θ | → ∞ . First, for ev ery θ ∈ R , F ( θ ) = θ + 1 α E [( l − θ ) + ] ≥ θ , b ecause ( l − θ ) + ≥ 0 . Hence as θ → + ∞ , F ( θ ) ≥ θ → + ∞ . Next, for θ ≤ 0 w e ha ve l − θ ≥ l ≥ 0 (since l ≥ 0 ), so ( l − θ ) + = l − θ . Th us for θ ≤ 0 F ( θ ) = θ + 1 α E [ l − θ ] = E [ l ] α + θ  1 − 1 α  . Because α ∈ (0 , 1) , the co efficient 1 − 1 /α is negative, and therefore as θ → −∞ the term θ (1 − 1 /α ) → + ∞ . Hence F ( θ ) → + ∞ as θ → −∞ . Com bining the tw o directions sho ws F ( θ ) → + ∞ as | θ | → ∞ ; therefore F is coercive. (C) Existence of a minimizer. Since F is con tin uous on R and coercive (tends to + ∞ at ±∞ ), it attains its minim um on R . Thus the set arg min θ ∈ R F ( θ ) is nonempty . ( lo w er semi-con tinuit y and co ercivity imply existence of a minimizer.) (D) No minimizer is negative. F or θ ≤ 0 w e hav e the formula F ( θ ) = E [ l ] α + θ  1 − 1 α  . Ev aluating at θ = 0 giv es F (0) = E [ l ] /α . F or any θ < 0 , F ( θ ) − F (0) = θ  1 − 1 α  . Since 1 − 1 /α < 0 and θ < 0 , we hav e F ( θ ) − F (0) > 0 , hence F ( θ ) > F (0) . Th us no θ < 0 can b e a global minimizer, and every minimizer satisfies θ ∗ ≥ 0 . (E) Upp er b ound on E [ l ] . Using Hölder’s inequality (or the monotonicity of L p -norms on a probability space) with p = 1 + ε > 1 we obtain E [ l ] ≤  E [ l 1+ ε ]  1 / (1+ ε ) ≤ M 1 / (1+ ε ) . 43 Consequen tly F (0) = E [ l ] α ≤ M 1 / (1+ ε ) α . (*) (F) Upp er b ound on minimizers. F or any θ ≥ 0 we hav e F ( θ ) ≥ θ . If θ > F (0) then θ > F (0) implies F ( θ ) ≥ θ > F (0) , so such a θ cannot b e a minimizer. Therefore every minimizer θ ∗ satisfies θ ∗ ≤ F (0) . Combining this with ( ∗ ) yields 0 ≤ θ ∗ ≤ F (0) ≤ M 1 / (1+ ε ) α . Hence one may take R = M 1 / (1+ ε ) /α . W e ha v e shown: F is finite, con tinuous and co erciv e, so it attains a minimum; moreov er every minimizer satisfies 0 ≤ θ ∗ ≤ R with R = M 1 / (1+ ε ) /α . This completes the boundedness of the minimizer proof. G.2 Concen tration b ound for heavy-tailed random v ariables Prop osition G.2. L et X 1 , X 2 , . . . , X n b e indep endent and identic al ly distribute d (i.i.d.) non-ne gative r andom variables satisfying the moment c ondition E [ X 1+ λ i ] ≤ M for some c onstants λ ∈ (0 , 1) and M > 0 . Then, with pr ob ability at le ast 1 − δ , the sample me an satisfies 1 n n X i =1 X i − E [ X i ] ≤ 2 M 1 1+ λ  log(2 /δ ) n  λ 1+ λ . This result is adapted from Brownlees et al. [2015], which provides concen tration b ounds for empirical risk minimization under heavy-tailed losses. Pr o of. T o prov e this theorem, w e will use a truncation approac h combined with concentration inequalities for b ounded random v ariables. The key idea is to control the large v alues of the random v ariables by truncating them at a carefully chosen lev el and then applying standard concen tration b ounds to the truncated v ariables while accoun ting for the probability of truncation. Since the random v ariables X i are non-negative and p otentially heavy-tailed, their large v alues can dominate the behavior of the sample mean. T o handle this, we introduce a truncation level B > 0 and define the truncated random v ariables as: X B i = min( X i , B ) = X i · 1 { X i ≤ B } + B · 1 { X i >B } . Th us, X B i is equal to X i when X i ≤ B and equal to B otherwise. Note that 0 ≤ X B i ≤ B , so X B i is bounded. Let S n = P n i =1 X i b e the sum of the original random v ariables, and let µ = E [ X i ] b e the common expected v alue. W e can write the sample mean as: S n n = 1 n n X i =1 X i = 1 n n X i =1 X B i + 1 n n X i =1 ( X i − X B i ) (130) Since X i − X B i = ( X i − B ) · 1 { X i >B } ≥ 0 , it follows that: S n n − µ ≤ 1 n n X i =1 X B i − E [ X B i ] ! + ( E [ X B i ] − µ ) + 1 n n X i =1 ( X i − X B i ) . (131) Ho wev er, since E [ X B i ] ≤ µ , the term E [ X B i ] − µ ≤ 0 , so we can drop it to obtain: S n n − µ ≤ 1 n n X i =1 X B i − E [ X B i ] ! + 1 n n X i =1 ( X i − X B i ) (132) W e aim to b ound the probability: P  S n n − µ > t  for some t > 0 . Using the decomposition abov e, we hav e: 44 Pr o of. P  S n n − µ > t  ≤ P  1 n P n i =1 X B i − E [ X B i ]  + 1 n P n i =1 ( X i − X B i ) > t  . Since 1 n P n i =1 ( X i − X B i ) ≥ 0 , this probability is further bounded b y: P 1 n n X i =1 X B i − E [ X B i ] > t − 1 n n X i =1 ( X i − X B i ) ! . (133) Let A = { max i =1 ,...,n X i ≤ B } . On A , we hav e X i = X B i for all i , so 1 n P n i =1 ( X i − X B i ) = 0 . On A c , at least one X i > B . Therefore: P  S n n − µ > t  ≤ P 1 n n X i =1 X B i − E [ X B i ] > t, A ! + P ( A c ) (134) On A , S n /n = 1 n P n i =1 X B i , hence: P  S n n − µ > t  ≤ P 1 n n X i =1 X B i − E [ X B i ] > t ! + P ( A c ) (135) Bounding P ( A c ) P ( A c ) = P n [ i =1 { X i > B } ! ≤ n · P ( X 1 > B ) (136) Using Mark ov’s inequality and the momen t condition: P ( X 1 > B ) = P ( X 1+ λ 1 > B 1+ λ ) ≤ E [ X 1+ λ 1 ] B 1+ λ ≤ M B 1+ λ (137) Th us: P ( A c ) ≤ n · M B 1+ λ . Cho ose B such that n · M B 1+ λ = δ 2 . Solving for B : B 1+ λ = 2 nM δ = ⇒ B =  2 nM δ  1 1+ λ . (138) With this choice, P ( A c ) ≤ δ 2 . Apply Hoeffding’s inequalit y for i.i.d. b ounded random v ariables: P 1 n n X i =1 X B i − E [ X B i ] > s ! ≤ exp  − 2 ns 2 B 2  (139) Require this probability ≤ δ 2 : exp  − 2 ns 2 B 2  ≤ δ 2 = ⇒ s ≥ B r log(2 /δ ) 2 n (140) Substituting B : B =  2 nM δ  1 1+ λ (141) , 45 w e get: s ≥  2 nM δ  1 1+ λ · r log(2 /δ ) 2 n . Simplify: s ≥ (2 M ) 1 1+ λ n 1 1+ λ − 1 2 δ − 1 1+ λ  log(2 /δ ) 2  1 2 (142) Kno wn results for i.i.d. non-negative random v ariables satisfying E [ X 1+ λ i ] ≤ M pro vide: P 1 n n X i =1 X i − E [ X i ] > 2 M 1 1+ λ  log(2 /δ ) n  λ 1+ λ ! ≤ δ. Th us, with probability at least 1 − δ : 1 n n X i =1 X i − E [ X i ] ≤ 2 M 1 1+ λ  log(2 /δ ) n  λ 1+ λ (143) Lemma G.3 (Reduction to Empirical Pro cess) . F or any hyp othesis h ∈ H and any risk level α ∈ (0 , 1) , the deviation b etwe en the p opulation CV aR risk R α ( h ) and the empiric al CV aR risk b R α ( h ) satisfies: | R α ( h ) − b R α ( h ) | ≤ 1 α sup θ ∈ R | ( E − P n ) g h,θ | , wher e g h,θ ( z ) := ( ℓ ( h, z ) − θ ) + . Pr o of. Recall the v ariational definitions of the p opulation and empirical CV aR: R α ( h ) = inf θ ∈ R G ( θ ) and b R α ( h ) = inf θ ∈ R b G ( θ ) , where w e define the ob jectiv e functions: G ( θ ) := θ + 1 α E [ g h,θ ( Z )] and b G ( θ ) := θ + 1 α P n [ g h,θ ( Z )] . W e use the elementary prop erty that the difference betw een the infima of t wo functions is bounded b y the suprem um of their difference. Sp ecifically , for any tw o functions f , g : R → R : | inf θ f ( θ ) − inf θ g ( θ ) | ≤ sup θ | f ( θ ) − g ( θ ) | . Applying this to G and b G : | R α ( h ) − b R α ( h ) | ≤ sup θ ∈ R | G ( θ ) − b G ( θ ) | . Substituting the definitions of G ( θ ) and b G ( θ ) , the linear term θ cancels out: | G ( θ ) − b G ( θ ) | =      θ + 1 α E [ g h,θ ]  −  θ + 1 α P n [ g h,θ ]      = 1 α | E [ g h,θ ] − P n [ g h,θ ] | = 1 α | ( E − P n ) g h,θ | . T aking the supremum ov er θ yields the result: | R α ( h ) − b R α ( h ) | ≤ 1 α sup θ ∈ R | ( E − P n ) g h,θ | . 46 Lemma G.4 (Lipsc hitz Con tinuit y in θ ) . Fix any hyp othesis h ∈ H . Define the empiric al pr o c ess deviation map Φ h : R → R by Φ h ( θ ) := ( E − P n ) g h,θ , wher e g h,θ ( z ) := ( ℓ ( h, z ) − θ ) + . Then Φ h is 2 -Lipschitz with r esp e ct to θ . That is, for al l θ 1 , θ 2 ∈ R : | Φ h ( θ 1 ) − Φ h ( θ 2 ) | ≤ 2 | θ 1 − θ 2 | . Pr o of. Recall that the function x 7→ ( x ) + = max(0 , x ) is 1 -Lipsc hitz. Therefore, for any fixed z ∈ Z and fixed h ∈ H , the map θ 7→ g h,θ ( z ) is 1 -Lipschitz: | g h,θ 1 ( z ) − g h,θ 2 ( z ) | = | ( ℓ ( h, z ) − θ 1 ) + − ( ℓ ( h, z ) − θ 2 ) + | ≤ | ( ℓ ( h, z ) − θ 1 ) − ( ℓ ( h, z ) − θ 2 ) | = | − θ 1 + θ 2 | = | θ 1 − θ 2 | . No w consider the deviation term: | Φ h ( θ 1 ) − Φ h ( θ 2 ) | = | ( E [ g h,θ 1 ] − P n [ g h,θ 1 ]) − ( E [ g h,θ 2 ] − P n [ g h,θ 2 ]) | . By the triangle inequalit y: | Φ h ( θ 1 ) − Φ h ( θ 2 ) | ≤ | E [ g h,θ 1 − g h,θ 2 ] | + | P n [ g h,θ 1 − g h,θ 2 ] | . Using the p oint wise Lipsc hitz property derived ab ov e: 1. P opulation term: | E [ g h,θ 1 − g h,θ 2 ] | ≤ E | g h,θ 1 ( Z ) − g h,θ 2 ( Z ) | ≤ E [ | θ 1 − θ 2 | ] = | θ 1 − θ 2 | . 2. Empirical term: | P n [ g h,θ 1 − g h,θ 2 ] | ≤ 1 n P n i =1 | g h,θ 1 ( Z i ) − g h,θ 2 ( Z i ) | ≤ | θ 1 − θ 2 | . Summing these b ounds yields: | Φ h ( θ 1 ) − Φ h ( θ 2 ) | ≤ | θ 1 − θ 2 | + | θ 1 − θ 2 | = 2 | θ 1 − θ 2 | . Lemma G.5 (P oint wise dominance of the truncated loss) . L et ℓ ( h, z ) ≥ 0 and θ ≥ 0 . Define f θ ( z ) := ( ℓ ( h, z ) − θ ) + . Then 0 ≤ f θ ( z ) ≤ ℓ ( h, z ) for al l z . Pr o of. If ℓ ( h, z ) ≤ θ , then f θ ( z ) = 0 ≤ ℓ ( h, z ) . If ℓ ( h, z ) > θ , then f θ ( z ) = ℓ ( h, z ) − θ ≤ ℓ ( h, z ) since θ ≥ 0 . In b oth cases the claim holds. Prop osition G.6 (Nonnegativity of the optimal CV aR threshold) . L et L = ℓ ( h, Z ) ≥ 0 and α ∈ (0 , 1) . F or P a pr ob ability me asur e, define F ( θ ; P ) := θ + 1 α E P [( L − θ ) + ] , R P α ( h ) := inf θ ∈ R F ( θ ; P ) . A ny minimizer θ ∗ P ∈ arg min θ F ( θ ; P ) satisfies θ ∗ P ≥ 0 . Pr o of. F ollo ws directly from Theorem G.1 Corollary G.7 (Justification of the truncation comparison) . With θ = θ ∗ P ≥ 0 as in Pr op osition G.6, L emma G.5 gives f θ ∗ P ( z ) = ( ℓ ( h, z ) − θ ∗ P ) + ≤ ℓ ( h, z ) . Conse quently, for any T > 0 , E P  f θ ∗ P ( Z ) 1 { f θ ∗ P ( Z ) >T }  ≤ E P  ℓ ( h, Z ) 1 { ℓ ( h,Z ) >T }  , which is the key step enabling tail c ontr ol by the (1 + λ ) -moment b ound. 47 Lemma G.8 (Lips c hitz Contin uit y of CV aR) . F or any r andom variables X, Y ≥ 0 and α ∈ (0 , 1) , | R α ( X ) − R α ( Y ) | ≤ 1 α E | X − Y | . Pr o of. Using the dual representation R α ( X ) = inf θ { θ + 1 α E [( X − θ ) + ] } , let θ ∗ Y b e optimal for Y . Then R α ( X ) − R α ( Y ) ≤ θ ∗ Y + 1 α E [( X − θ ∗ Y ) + ] − θ ∗ Y − 1 α E [( Y − θ ∗ Y ) + ] = 1 α E [( X − θ ∗ Y ) + − ( Y − θ ∗ Y ) + ] ≤ 1 α E | X − Y | . Symmetrically , R α ( Y ) − R α ( X ) ≤ 1 α E | X − Y | . Lemma G.9 (Pseudo-Dimension of T runcated CV aR Class) . L et H b e a hyp othesis class with Pdim( H ) ≤ d . Define for B > 0 F B = n z 7→ min  ( ℓ ( h, z ) − θ ) + , B  : h ∈ H , θ ∈ [0 , R ] o , wher e R = M 1 / (1+ λ ) /α . Then ther e exists an absolute c onstant C > 0 such that Pdim( F B ) ≤ C ( d + 1) . Pr o of. Define the base class L := { z 7→ ℓ ( h, z ) : h ∈ H} , so that Pdim( L ) = d . W e construct F B from L by a finite sequence of op erations, eac h of which increases pseudo-dimension by at most a constant factor. Consider the augmented class G 1 := { ( z , t ) 7→ ℓ ( h, z ) − t : h ∈ H , t ∈ [0 , R ] } . By standard results on pseudo-dimension under addition of a real parameter, Pdim( G 1 ) ≤ d + 1 . Define G 2 := { ( z , t ) 7→ ( ℓ ( h, z ) − t ) + : h ∈ H , t ∈ [0 , R ] } . The map x 7→ x + = max { x, 0 } is the maximum of tw o affine functions. By closure of pseudo-dimension under finite maxima, Pdim( G 2 ) ≤ C 1 ( d + 1) for a universal constant C 1 . Finally , the standard truncation argument we used b efore, F B = { ( z , t ) 7→ min { g ( z , t ) , B } : g ∈ G 2 } . Since x 7→ min { x, B } is the minimum of x and a constan t function, Pdim( F B ) ≤ C 2 Pdim( G 2 ) ≤ C ( d + 1) , for an absolute constan t C . Com bining the steps completes the pro of. Lemma G.10 (Bias and v ariance of truncation) . L et X b e a Non ne gative r e al-value d r andom variable such that E [ | X | 1+ λ ] ≤ M ′ for some λ ∈ (0 , 1] and M ′ > 0 . Define the trunc ate d variable X B := min { X, B } for B > 0 . Then: 48 (i) ( Bias )   E [ X B ] − E [ X ]   ≤ M ′ B − λ . (ii) ( V arianc e ) V ar( X B ) ≤ M ′ B 1 − λ . Pr o of. W e treat the t wo claims separately . Observ e that E [ X B ] − E [ X ] = E  ( X B − X ) 1 { X >B }  = − E  ( X − B ) 1 { X >B }  . Hence,   E [ X B ] − E [ X ]   ≤ E  | X | 1 { X >B }  . On the even t { X > B } , w e ha v e 1 ≤ ( X/B ) λ , and therefore | X | = | X | 1+ λ | X | − λ ≤ | X | 1+ λ B − λ . Substituting this inequality yields E  | X | 1 { X >B }  ≤ B − λ E [ | X | 1+ λ ] ≤ M ′ B − λ . Since V ar( X B ) ≤ E [( X B ) 2 ] , it suffices to b ound the second moment. W rite ( X B ) 2 = | X B | 1 − λ | X B | 1+ λ . Using | X B | ≤ B and | X B | ≤ | X | , w e obtain | X B | 1 − λ ≤ B 1 − λ , | X B | 1+ λ ≤ | X | 1+ λ . Th us, E [( X B ) 2 ] ≤ B 1 − λ E [ | X | 1+ λ ] ≤ M ′ B 1 − λ . Lemma G.11 (Uniform MoM concen tration under contamination) . L et G b e a class of functions mapping into [0 , B ] such that sup g ∈G V ar( g ( Z )) ≤ σ 2 . Assume its c overing numb ers satisfy log N ( G , ∥ · ∥ ∞ , η ) ≤ d log( C /η ) . Consider the ϵ -c ontamination mo del. Partition the sample into K blo cks of size m = n/K , with K ≥ 8 log (1 /δ ) and m ≥ C 0 d for a sufficiently lar ge c onstant C 0 . Then, with pr ob ability at le ast 1 − δ , sup g ∈G     Median j ∈ [ K ] b µ j ( g ) − E [ g ]     ≤ C σ r d m + B d m + ϵB ! , wher e b µ j ( g ) = 1 m P i ∈B j g ( z i ) . Pr o of. Let N j denote the num b er of corrupted p oin ts in blo ck B j . Since the adversary corrupts exactly ϵn = ϵK m samples and the data are randomly permuted, ( N 1 , . . . , N K ) follo ws a m ultiv ariate hypergeometric distribution. Deterministically , K X j =1 N j = ϵK m. 49 Hence, b y a coun ting argumen t, at most 0 . 1 K blo cks can satisfy N j > 10 ϵm . Define J low := { j : N j ≤ 10 ϵm } , |J low | ≥ 0 . 9 K . F or an y j ∈ J low and an y g ∈ G , since g ∈ [0 , B ] ,       1 m X i ∈B j g ( z i ) − 1 m X i ∈B j ∩ clean g ( z i )       ≤ N j m B ≤ 10 ϵB . Th us, corruption introduces a deterministic bias of at most 10 ϵB on these blo cks. F Or uniform concen tration on clean data. Fix a block j and consider only its clean samples. By Bernstein’s inequalit y com bined with a union b ound o v er an η -net of G and standard chaining, there exists a constant C suc h that if m ≥ C 0 d , then with probabilit y at least 0 . 9 , sup g ∈G       1 m X i ∈B j ∩ clean g ( z i ) − E [ g ]       ≤ C σ r d m + B d m ! . Call this ev ent E j . The random permutation ensures approximate indep endence across blo c ks, and a Chernoff b ound yields that with probability at least 1 − δ , at least 0 . 8 K blocks satisfy E j . Let J v alid := J low ∩ { j : E j holds } . With probabilit y at least 1 − δ , |J v alid | ≥ 0 . 7 K > K / 2 . F or an y j ∈ J v alid and an y g ∈ G , | b µ j ( g ) − E [ g ] | ≤ C σ r d m + B d m ! + 10 ϵB . Since a strict ma jority of blo c ks satisfy this bound, the median must lie within the same range. Therefore, sup g ∈G     Median j b µ j ( g ) − E [ g ]     ≤ C σ r d m + B d m + ϵB ! , after absorbing constants. H On the Algorithmic Asp ects of Robust CV aR-ERM Estimator H.1 η -co v er based Algorithm W e discretize b oth H and the θ -range. Finite η -net for H . F or η h > 0 , let N h ( η h ) b e an y finite set such that for every h ∈ H there exists h ′ ∈ N h ( η h ) with ∥ h − h ′ ∥ 2 ≤ η h . Since H is contained in the Euclidean ball of radius R , one can take N h ( η h ) with cardinalit y b ounded by |N h ( η h ) | ≤  1 + 2 R η h  d . (F or example, take a lattice grid of mesh η h / √ d and intersect with the ball.) η θ -grid for θ . F or η θ > 0 , define N θ ( η θ ) := { 0 , η θ , 2 η θ , . . . } ∩ [0 , T ] , |N θ ( η θ ) | ≤ 1 + T η θ . φ B ( h, θ ; Z ) is the truncted CV aR loss for the sample Z at the parameters ( h, θ ) for threshold B . 50 Algorithm 1 Discreti zed MOM-CV aR-ERM with truncation (implemen table) 1: Input: data Z 1 , . . . , Z n ; α ∈ (0 , 1) ; truncation level B ; blo ck count K ≥ 8 with n = K m ; net radii η h , η θ ; kno wn bounds R, T . 2: Draw a uniform random p ermutation π and form blocks B 1 , . . . , B K of size m . 3: Construct a finite η h -net N h ( η h ) of H inside the ball of radius R . 4: Construct the grid N θ ( η θ ) of [0 , T ] . 5: for eac h h ∈ N h ( η h ) do 6: for eac h θ ∈ N θ ( η θ ) do 7: Compute block risks b R j ( θ , τ ) = 1 m P i ∈B j φ B ( h, θ ; Z i ) for j = 1 , . . . , K . 8: Compute b R MOM ( θ , τ ) = median( b R 1 , . . . , b R K ) . 9: end for 10: end for 11: Output ( b θ , b τ ) ∈ arg min θ ∈N θ ( η θ ) , τ ∈N τ ( η τ ) b R MOM ( θ , τ ) . Implemen tabilit y . The search set is finite, hence the algorithm terminates in finite time and returns an exact minimizer o ver the discretization. 51

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment