Statistical Guarantees for Distributionally Robust Optimization with Optimal Transport and OT-Regularized Divergences

St a tistical Guarantees f or Distributionall y R obust Optimiza tion with Optimal Transpor t and OT-Regularized Diver gences A Preprint Jeremiah Birrell Departmen t of Mathematics T exas State Univ ersit y San Marcos, TX, USA jbirrell@txstate.edu Xiao xi Shen Departmen t of Mathematics T exas State Universit y San Marcos, TX, USA rcd67@txstate.edu Marc h 31, 2026 Abstract W e study ﬁnite-sample statistical performance guaran tees for distributionally robust opti- mization (DRO) with optimal transp ort (OT) and OT-regularized divergence mo del neigh- b orhoo ds. Sp eciﬁcally , w e deriv e concentration inequalities for sup ervised learning via DR O-based adv ersarial training, as commonly emplo yed to enhance the adversarial robust- ness of machine learning mo dels. Our results apply to a wide range of OT cost functions, b ey ond the p -W asserstein case studied by previous authors. In particular, our results are the ﬁrst to: 1) co ver soft-constrain t norm-ball OT cost functions; soft-constrain t costs hav e b een sho wn empirically to enhance robustness when used in adversarial training, 2) apply to the com bination of adversarial sample generation and adversarial reweigh ting that is induced b y using OT-regularized f -div ergence mo del neighborho ods; the added rew eighting mec hanism has also b een shown empirically to further improv e p erformance. In addition, ev en in the p -W asserstein case, our b ounds exhibit b etter b eha vior as a function of the DRO neigh b orhoo d size than previous results when applied to the adv ersarial setting. K eyw ords A dversarial Robustness · Optimal T ransp ort · Information Div ergence · Distributionally Robust Optimization · Concentration Inequality 1 In tro duction Distributionally robust optimization (DRO) is a technique for regularizing a stochastic optimization problem, inf θ E P [ L θ ] , by replacing the P -exp ectation with the worst-case exp ected v alue ov er some neigh b orhoo d of ‘nearb y’ mo dels, U ( P ) , also kno wn as the am biguit y set or uncertain ty region in some literature. This results in the minimax problem inf θ sup Q ∈U ( P ) E Q [ L θ ] . (1) A v ariety of mo del neighborho ods hav e b een studied and employ ed in prior work, including maxim um mean discrepancy (MMD) [ 44 ], f -div ergence neighborho o ds [ 6 , 1 , 30 , 8 , 32 ], (conditional) momen t constraints [ 27 , 18 , 50 , 13 ], smo othed f -div ergences [ 36 ], Sinkhorn div ergence [ 48 ], W asserstein neigh b orhoo ds [ 38 , 43 , 51, 52, 26], and general optimal-transp ort (OT) neigh b orhoo ds [14, 5]. DR O has b een used eﬀectively in a num ber of applications; see, e.g., [ 31 ] for an ov erview. Our fo cus is on adv ersarial robustness in mac hine learning; the vulnerability to adversarial samples is a well known weakness of mac hine learning mo dels (esp ecially deep learning) [ 41 , 28 ]. A dversarial samples are inputs that are in tentionally mo diﬁed by an adv ersary to mislead the model, e.g., causing a harmful misclassiﬁcation of the A preprint - March 31, 2026 input. Adv ersarial training is a p opular class of tec hniques for mitigating this issue; in adv ersarial training, adv ersarial samples are constructed in a wa y that mimics the eﬀorts of an attac ker and are then employ ed during training. Having such a simulated attac ker comp ete with the mo del during training results in a more robust trained mo del [ 40 , 37 , 29 , 49 , 53 , 54 , 20 , 42 , 17 , 19 , 12 ]. Many commonly used adversarial training metho ds can b e formulated as DRO problems (1). Here w e fo cus on adv ersarial training based on DRO with OT and OT-regularized div ergence neighborho ods. OT-regularized div ergences were recen tly in tro duced in [ 12 ] for the purp ose of enhancing adv ersarial robustness in deep learning. Sp eciﬁcally , we consider OT-regularized f -div ergences, deﬁned via an inﬁmal conv olution of a OT cost, C , and an f -div ergence, D f , as follows: D c f ( ν ∥ µ ) : = inf η ∈P ( Z ) { D f ( η ∥ µ ) + C ( η , ν ) } , (2) where P ( Z ) denotes the space of probability measures on Z . In [ 12 ] it w as shown that DR O with OT- regularized f -div ergence neigh b orhoo ds corresp onds to adv ersarial sample generation (due to the OT cost) along with adversarial sample rew eigh ting (due to the f -div ergence); com bining these tw o mec hanisms enhances previous DRO-based approaches to adv ersarial training, suc h as PGD [ 37 ], TRADES [ 53 ], MA R T [49], and UDR [17]. In this work w e deriv e ﬁnite-sample statistical guaran tees, in the form of concentration inequalities, for DR O with OT and OT-regularized f -div ergences, as applied to sup ervised learning. Our main con tributions are the following: 1. W e derive concentration inequalities for OT-DR O that apply to a muc h more general class of OT cost functions than considered in previous works, in both the classiﬁcation and regression settings; see Theorems 3.8 and 3.10. The increased generality facilitates applications to adversarial training metho ds whic h were not co vered b y prior approaches. In addition, even in the p -W asserstein case studied in previous w orks, our results hav e b etter dep endence on the neighborho o d size than other metho ds that treat the robust training setting; see Remark 3.12. 2. W e pro vide the ﬁrst analysis of statistical guarantees for OT-regularized f -div ergence DRO, as used to enhance the adversarial robustness of classiﬁcation mo dels; see Theorems 4.3 and 4.9. This recen t class of DRO metho ds com bines sample rew eighting with adv ersarial sample generation and has been sho wn empirically to improv e adversarial training p erformance. 1.1 Comparison with Related W orks Statistical guaran tees for DR O with model neighborho ods of v arious types hav e been studied previously by a n umber of authors. First we note the metho ds based on W asserstein-metric neigh b orhoo ds [ 34 , 43 , 2 , 16 , 24 , 4 , 25 , 3 ]; see also the tutorial article [ 15 ]. Of these, [ 34 ] fo cus on a ﬁxed neigh b orhoo d size r , which is the appropriate setting for adversarial training, [ 43 , 16 , 24 , 3 ] fo cus on neighborho o d-size that shrinks to 0 as the num b er of samples n → ∞ , as is appropriate for use as a regularization term to protect against o verﬁtting, and [ 2 , 4 , 25 ] cov er b oth cases. How ever, to the authors’ knowledge, there are no existing works sho wing con vergence of the empirical OT-DRO problem to p opulation problem which co ver more general OT-cost functions, b ey ond the p -W asserstein family . In this work w e consider a class of OT cost functions that simultaneously generalize p -W asserstein and norm-ball constrain t costs and fo cus on b ounds that b eha ve w ell for a ﬁxed, small neighborho o d size; b oth factors facilitate applications to adversarial robustness. Another stream of work fo cuses on information-theoretic model neighborho ods [ 7 , 33 , 9 , 21 , 2 , 22 , 23 ]. Of these, the closest to the present work is [ 23 ], whic h obtained ﬁnite-sample statistical guarantees for DRO with Cressie-Read divergences, a family of f -div ergences which are closely related to the α -div ergences (8) , as a method for addressing distributional shifts. Ho w ev er, information-theoretic neigh b orhoo ds cannot account for a change in the supp ort of the distribution, as is required for applications to adversarial robustness. T o unify the OT and information theoretic approac hes, our statistical guaran tees apply to the OT-regularized f -div ergences (2) , developed in [ 12 ] for application to adversarial robustness; these div ergences in terp olate b et w een the OT and information-theoretic approac hes, thereb y allo wing for adversarial sample generation along with sample rew eighting. Our sim ultaneous study of more general OT-cost and general f -div ergences requires several inno v ativ e tec hniques. In particular, the Cressie-Read divergence case allo ws for an analytical simpliﬁcation that is not p ossible for the more general class of f -div ergences co vered by our results. Finally , w e men tion sev eral other recen t metho ds that com bine OT and information-theoretic ingredients in DR O. The approac h of [ 4 ] studies W asserstein DRO with a KL-divergence penalty betw een the transp ort plan 2 A preprint - March 31, 2026 and a Gaussian mixture centered on the samples; this is a v ery diﬀeren t approach to the t wo-stage transp ort- rew eighting mechanism inherent in our OT-regularized divergences, b oth conceptually and computationally . Closer to our approach are those of [ 13 , 36 ]. While distinct in general, the approach of [ 13 ] o verlaps with the OT-regularized divergence framework [ 12 ] studied here in the f -div ergence case, but [ 13 ] do es not consider the statistical prop erties of the method. The framew orks of [ 36 ] and [ 12 ] are distinct when applied to f -div ergences, and the former deriv es statistical guarantees that apply to smo othing via 1 -W asserstein and Lévy-Prokhoro v metrics (see their Section 5.2); sp eciﬁcally , they show that their empirical DRO problem upp er b ounds the p opulation DR O problem with high-probabilit y . This conclusion is weak er than what is pro duced by the results in this pap er, which sho w con vergence of the empirical problem to the population problem as n → ∞ ; the latter type of result is more meaningful in the adversarial training setting. Moreov er, w e obtain results for more general OT cost functions. The OT-regularized div ergence approac h studied here has already b een sho wn to b e a practical computational to ol for enhancing the adv ersarial robustness of mac hine learning mo dels, further motiv ating the study of its statistical properties in settings adapted to that application. 2 Bac kground and Main Results In this section w e pro vide the necessary background on DR O with OT and OT-regularized f -div ergences and summarize our main results. Details regarding the required tec hnical assumptions, along with the pro ofs, can b e found in the subsequen t sections. 2.1 Bac kground W e let Z denote a Polish space (i.e., a complete separable metric space) and let P ( Z ) denote the space of Borel probabilit y measures on Z . W e will use the term cost-function to refer to a low er semicontin uous (LSC) function c : Z × Z → [0 , ∞ ] . The asso ciated optimal-transp ort (OT) cost is deﬁned b y C : P ( Z ) × P ( Z ) → [0 , ∞ ] , C ( µ, ν ) : = inf π ∈P ( Z ×Z ): π 1 = µ,π 2 = ν Z cdπ , (3) where π 1 , π 2 denote the marginal distributions; see, e.g., [ 46 ] for background on optimal transp ort. The follo wing result, adapted from [ 14 , 26 , 55 ], pro vides an imp ortan t dual form ulation of the OT-DRO problem. Prop osition 2.1. L et c b e a c ost function that satisﬁes c ( z , z ) = 0 for al l z ∈ Z , L : Z → R b e me asur able and b ounde d b elow, and P ∈ P ( Z ) . Then for al l r > 0 we have sup Q : C ( P,Q ) ≤ r E Q [ L ] = inf λ> 0 { λr + E P [ L c λ ] } , (4) wher e L c λ ( z ) : = sup ˜ z ∈Z {L ( ˜ z ) − λc ( z , ˜ z ) } (5) and we employ the c onvention ∞ − ∞ : = −∞ . Remark 2.2. Note that L c λ is universal ly me asur able; we do not distinguish b etwe en P and its c ompletion in our notation. L c λ is known as the c -tr ansform in the optimal tr ansp ort liter atur e; se e Deﬁnition 5.2 in [46]. In practice, the supreum um o ver ˜ z in (5) represen ts the generation of an adversarial sample, ˜ z , paired with the original sample, z , i.e., the action of the sim ulated attac ker. The OT-regularized divergences, (2) , mix an OT cost with an f -div ergence, with the latter deﬁned as follows. F or a, b ∈ [ −∞ , ∞ ] that satisfy −∞ ≤ a < 1 < b ≤ ∞ w e deﬁne F 1 ( a, b ) to be the set of conv ex functions f : ( a, b ) → R with f (1) = 0 . Giv en f ∈ F 1 ( a, b ) , the corresponding f -div ergence b et ween ν, µ ∈ P ( Z ) is deﬁned by D f ( ν ∥ µ ) =  E P [ f ( dν /dµ )] , ν ≪ µ ∞ , ν ≪ µ , (6) where the deﬁnition of f in (6) is extended to [ a, b ] by contin uit y and is set to ∞ on [ a, b ] c . The k ey result from [ 12 ] that we will require is the following dual formu lation of the OT-regularized f -div ergence DR O problem. 3 A preprint - March 31, 2026 Prop osition 2.3. Supp ose we have the fol lowing: 1. A me asur able function L : Z → ( −∞ , ∞ ] that is b ounde d b elow. 2. P ∈ P ( Z ) . 3. f ∈ F 1 ( a, b ) , wher e a ≥ 0 . 4. A c ost function, c , that satisﬁes c ( z , z ) = 0 for al l z ∈ Z . Then for r > 0 we have sup Q : D c f ( Q ∥ P ) ≤ r E Q [ L ] = inf λ> 0 ,ρ ∈ R { λr + ρ + λE P [ f ∗ (( L c λ − ρ ) /λ )] } , (7) wher e L c λ was deﬁne d in (5) and we employ the c onvention f ∗ ( ∞ ) : = ∞ . Imp ortan t examples of OT-regularized f -div ergences include those constructed using the α -div ergences, deﬁned in terms of f α ( t ) = t α − 1 α ( α − 1) , α > 1 , (8) and which has Legendre transform f ∗ α ( t ) = α − 1 ( α − 1) α/ ( α − 1) max { t, 0 } α/ ( α − 1) + 1 α ( α − 1) , α > 1 , (9) along with the KL divergence case, deﬁned using f K L ( t ) = t log( t ) , for which one has the simpliﬁcation sup Q : K L c ( Q ∥ P ) ≤ r E Q [ L ] = inf λ> 0  λr + λ log  E P [exp( λ − 1 L c λ )]  . (10) Practical implementations of DRO for adversarial training utilize the dual form ulations, i.e., the righ t-hand sides of (4), (7), and (10). Therefore, these will b e our fo cus for the remainder of this pap er. 2.2 Summary of the Main Results Next we summarize the main results in this pap er; details regarding the required assumptions and pro ofs can b e found in Sections 3-4. Given our fo cus on sup ervised learning, samples z ∈ Z will hav e the form z = ( x, y ) , where x ∈ R d is the predictor and y the resp onse (i.e., label). F or OT-DRO w e consider a general y (i.e., either regression or classiﬁcation), while for OT-regularized f -div ergence DRO our approac h is applicable to discrete y (classiﬁcation) only . First, in Theorem 3.8, w e pro ve that the optimal v alue of the n -sample empirical OT-DRO problem is close to the corresp onding p opulation v alue with high probabilit y: P n ± inf θ ∈ Θ sup Q : C ( P,Q ) ≤ r E Q [ L θ ] − inf θ ∈ Θ sup Q : C ( P n ,Q ) ≤ r E Q [ L θ ] ! ≥ 2 D n + ϵ ! (11) ≤ exp  − 2 ϵ 2 n β 2  for all n ∈ Z + , ϵ > 0 , r > 0 , where P n denotes the empirical distribution corresp onding to the samples, z i ∼ P , i = 1 , .., n , P n denotes the pro duct measure (i.e., the samples are i.i.d.), β is an upp er bound on L θ , θ ∈ Θ , and D n , deﬁned in (117) , is a measure of the complexity of the function class {L θ : θ ∈ Θ } ; under appropriate assumptions, D n approac hes 0 as n → ∞ . W e build on this result in Theorem 3.10 by sho wing that a solution to the empirical risk minimization (ERM) OT-DRO problem is also an appro ximate solution to the p opulation DRO problem with high probability . Our Theorems 3.8 and 3.10 apply to a wide range of OT cost functions, b ey ond the p -W asserstein case that has b een studied in previous w orks [ 34 , 43 , 2 , 16 , 24 , 4 , 25 , 3 ]. In particular, the increased generality of our results facilitates applications to adv ersarial training methods from [ 37 , 53 , 49 , 17 ]. The aforementioned adv ersarial training metho ds are either based on the PGD cost function c δ (( x, y ) , ( ˜ x, ˜ y )) : = ∞ 1 ∥ x − ˜ x ∥ X >δ + ∞ 1 y  = ˜ y , δ ≥ 0 , (12) 4 A preprint - March 31, 2026 or on soft-constraint relaxations of it. In particular, [ 17 ] prop osed a general class of soft-constrain t relaxations of (12) ; in practice, the soft constrain t allows the adv ersarial sample, ˜ x , to leav e the δ -ball cen tered at x , but only if there is an esp ecially eﬀective adversarial sample that is nearby . In [ 17 ] it was observed empirically that using suc h relaxations for adv ersarial training leads to improv ed adversarial robustness. In this pap er w e will analyze relaxations of (12) that ha ve the form c ψ ,δ (( x, y ) , ( ˜ x, ˜ y )) : = ϕ ψ ,δ ( ∥ ˜ x − x ∥ X ) + ∞ 1 y  = ˜ y , ϕ ψ ,δ ( t ) : = ψ ( t − δ )1 t ≥ δ , (13) for δ > 0 and appropriate choices of ψ . The function ψ determines how muc h the adversarial sample is p enalized when it lea ves the δ -ball; see Section 3.2 for sp eciﬁc examples cov ered by our results. W e note that our framework can b e applied to methods which require main taining an unc hanged cop y of the original x in addition to the adversarial ˜ x , e.g., TRADES [ 53 ] and MAR T [ 49 ], b y considering the unc hanged cop y of x to b e part of the resp onse, y . W e emphasize that cost functions of the form (13) are not co vered by previous statistical performance guarantees on OT-DRO; addressing this limitation is one of the main motiv ations for our analysis. In [ 12 ] it w as further demonstrated that com bining the OT costs used by [ 17 ] with sample reweigh ting, in the form OT-regularized f -div ergence DR O, further improv es p erformance; the rew eighting mec hanism causes the training to fo cus more on the most troublesome adversarial samples, leading to increased adversarial robustness. This motiv ates our study of the corresp onding statistical guaran tees; in Theorem 4.3 w e obtain the concentration inequality P n ± inf θ ∈ Θ sup Q : D c f ( Q ∥ P ) ≤ r E Q [ L θ ] − inf θ ∈ Θ sup Q : D c f ( Q ∥ P n ) ≤ r E Q [ L θ ] ! ≥ max { R n , e R n } + ϵ ! (14) ≤ exp  − 2 nϵ 2 β 2  + exp  − 2 nϵ 2 ( β ( f ∗ ) ′ + ( − ˜ ν )) 2  + X y ∈Y e − 2 n ( p y − p 0 ) 2 , where p y is the probability of the class y ∈ Y , p 0 and ˜ ν are appropriately chosen constants, and R n and e R n dep end on the complexit y of the ob jection function class and approach 0 as n → ∞ under appropriate assumptions; see (88) and (111) resp ectiv ely for details. Similarly to the OT-DRO case, in Theorem 4.9 w e also show that the corresp onding ERM solution is an appro ximate solution to the p opulation DRO problem with high probability . These results apply to OT cost functions of the form (13) and to a general class of f -divergences, including the oft-used KL and α -div ergences. 3 Concen tration Inequalities for OT-DRO W e now pro ceed to the detailed pro ofs of our results. W e start b y considering OT-DR O, without mixing with an f -div ergence, fo cusing on OT cost functions of the form (13) whic h are not cov ered b y previous w orks; in addition to b eing of independent interest, this will provide an imp ortan t to ol for our subsequent study of OT-regularized f -div ergences. In the following, we co dify our assumptions regarding the ob jectiv e and OT cost functions. Assumption 3.1. A ssume the fol lowing: 1. Z = X × Y , wher e Y is Polish and X ⊂ R d is c onvex and Polish (e.g., either close d or op en); note that this makes Z a Polish sp ac e. 2. ∥ · ∥ X is a norm on R d . 3. L θ : X × Y → R , θ ∈ Θ , ar e me asur able and we have L X ∈ (0 , ∞ ) such that x 7→ L θ ( x, y ) is L X -Lipschitz with r esp e ct to ∥ · ∥ X for al l y , θ . 4. c is a c ost function on Z and we have δ ≥ 0 and ψ : [0 , ∞ ) → [0 , ∞ ] with ψ (0) = 0 such that c ψ ,δ ≤ c ≤ c δ , (15) wher e c δ is the c ost function deﬁne d in (12) and c ψ ,δ was deﬁne d in (13) . Remark 3.2. Note that c δ , (12) , c orr esp onds to the choic e ψ ( t ) = ∞ 1 t> 0 . A key challenge in the study of statistical guarantees on the OT-DR O problem via its dual formulation (4) , is to eﬀectiv ely handle the non-compact interv al in the optimization ov er λ . W e start by deriving a bound that can b e used to sho w the c -transformed loss has a well-behav ed limit as λ → ∞ . 5 A preprint - March 31, 2026 Lemma 3.3. Under A ssumption 3.1 we have ∥L c θ,λ − L c δ θ ∥ ∞ ≤ λψ ∗ ( L X /λ ) (16) for al l λ > 0 , θ ∈ Θ , wher e ψ ∗ ( s ) : = sup t ≥ 0 { st − ψ ( t ) } (17) is the L e gendr e tr ansform of ψ . Remark 3.4. Note that the c δ -tr ansforme d loss L c δ θ,λ do es not dep end on λ and if δ = 0 then ϕ ψ ,δ = ψ and L c δ θ = L θ . A lso note that if ψ is LSC then c ψ ,δ is LSC and henc e is a valid OT c ost function, in which c ase this r esult c an b e applie d c = c ψ ,δ . Pr o of. In the following we will suppress the θ dep endence of the optimization ob jectiv e, as it is not relev ant to the computations. F or z = ( x, y ) ∈ Z , ﬁrst compute L c λ ( z ) = sup ˜ z ∈Z {L ( ˜ z ) − λc ( z , ˜ z ) } ≤ sup ˜ z ∈Z {L ( ˜ z ) − λc ψ ,δ ( z , ˜ z ) } (18) = max      sup ˜ x ∈X : ∥ ˜ x − x ∥ X ≤ δ {L ( ˜ x, y ) − λϕ ψ ,δ ( ∥ ˜ x − x ∥ X ) } , sup ˜ x ∈X : ∥ ˜ x − x ∥ X >δ {L ( ˜ x, y ) − λϕ ψ ,δ ( ∥ ˜ x − x ∥ X ) }      = max ( L c δ ( z ) , sup ˜ x ∈X : ∥ ˜ x − x ∥ X >δ {L ( ˜ x, y ) − λϕ ψ ,δ ( ∥ ˜ x − x ∥ X ) } ) . The assumption that L ( x, y ) is L X -Lipsc hitz in x implies that L c δ ( z ) is ﬁnite. T ogether with (18) and the b ound c ≤ c δ , this allows us to obtain 0 ≤ L c λ ( z ) − L c δ ( z ) ≤ max ( 0 , sup ˜ x ∈X : ∥ ˜ x − x ∥ X >δ {L ( ˜ x, y ) − λϕ ψ ,δ ( ∥ ˜ x − x ∥ X ) } − L c δ ( z ) ) . By conv exity of X , for any ˜ x ∈ X satisfying ∥ ˜ x − x ∥ X > δ there exists x ′ ∈ X that satisﬁes ∥ x ′ − x ∥ X = δ and ∥ ˜ x − x ∥ X = δ + ∥ ˜ x − x ′ ∥ X . Therefore we can compute L ( ˜ x, y ) − λϕ ψ ,δ ( ∥ ˜ x − x ∥ X ) − L c δ ( x, y ) ≤L ( ˜ x, y ) − λϕ ( ∥ ˜ x − x ∥ X ) − L ( x ′ , y ) (19) ≤ L X ∥ ˜ x − x ′ ∥ X − λϕ ( δ + ∥ ˜ x − x ′ ∥ X ) = L X ∥ ˜ x − x ′ ∥ X − λψ ( ∥ ˜ x − x ′ ∥ X ) ≤ sup t ≥ 0 { L X t − λψ ( t ) } = λψ ∗ ( L X /λ ) . Maximizing ov er { ˜ x ∈ X : ∥ ˜ x − x ∥ X > δ } w e see that 0 ≤ L c λ ( z ) − L c δ ( z ) ≤ max { 0 , λψ ∗ ( L X /λ ) } = λψ ∗ ( L X /λ ) , (20) where the equality follows from the fact that ψ (0) = 0 . Maximizing ov er z ∈ Z completes the pro of. Next w e use Lemma 3.3 to derive a cov ering num b er b ound that is able to handle the unbounded domain for λ . T o do so, we make the follo wing assumptions: Assumption 3.5. In addition to A ssumption 3.1, supp ose that: 1. Ther e exists M ∈ [0 , ∞ ) such that for al l ( x, y ) ∈ Z we have sup ˜ x ∈ D x,y c (( x, y ) , ( ˜ x, y )) ≤ M , D x,y : = { ˜ x ∈ X : c (( x, y ) , ( ˜ x, y )) < ∞} , (21) 2. ψ ∗ ( t ) < ∞ for al l t > 0 , 3. ψ ∗ ( t ) = o ( t ) as t → 0 + . 6 A preprint - March 31, 2026 Lemma 3.6. Supp ose that A ssumption 3.5 is satisﬁe d. Deﬁne G : = {L θ : θ ∈ Θ } , G c : = {L c θ,λ : θ ∈ Θ , λ ∈ (0 , ∞ ) } ∪ {L c δ θ : θ ∈ Θ } . (22) Then for al l ϵ 1 , ϵ 2 > 0 we have the fol lowing r elation b etwe en c overing numb ers in the supr emum norm: N ( ϵ 1 + ϵ 2 , G c , ∥ · ∥ ∞ ) ≤  M λ ∗ ( ϵ 2 ) 2 ϵ 2  + 1  N ( ϵ 1 , G , ∥ · ∥ ∞ ) , (23) wher e λ ∗ ( ϵ 2 ) : = inf { λ > 0 : λψ ∗ ( L X /λ ) ≤ ϵ 2 } . (24) W e also note that λ ∗ ( ϵ 2 ) is ﬁnite. Pr o of. If N ( ϵ 1 , G , ∥ · ∥ ∞ ) = ∞ the result is trivial, so supp ose it is ﬁnite. Lemma 3.3 implies that for all θ ∈ Θ , λ > 0 w e ha ve ∥L c θ,λ − L c δ θ ∥ ∞ ≤ λψ ∗ ( L X /λ ) . (25) Com bined with the assumption that ψ ∗ ( t ) < ∞ for all t > 0 we see that L c θ,λ is real-v alued (recall that the Lipsc hitz assumption on L θ implies that L c δ θ is real-v alued). The assumption (15) implies L c θ,λ ( x, y ) = sup ˜ x ∈ D x,y {L θ ( ˜ x, y ) − λc (( x, y ) , ( ˜ x, y )) } , (26) therefore ∥L c θ,λ 1 − L c θ,λ 2 ∥ ∞ ≤ sup ( x,y ) ∈Z sup ˜ x ∈ D x,y | λ 2 c (( x, y ) , ( ˜ x, y )) − λ 1 c (( x, y ) , ( ˜ x, y )) | ≤ M | λ 2 − λ 1 | , (27) and, by similar computations, ∥L c θ 1 ,λ − L c θ 2 ,λ ∥ ∞ ≤ ∥L θ 1 − L θ 2 ∥ ∞ , ∥L c δ θ 1 − L c δ θ 2 ∥ ∞ ≤ ∥L θ 1 − L θ 2 ∥ ∞ . (28) With λ ∗ deﬁned by (24) , we note that the assumption ψ ∗ ( t ) = o ( t ) as t → 0 + implies that λ ∗ ( ϵ 2 ) < ∞ for all ϵ 2 > 0 . ψ ∗ is con vex and, b y assumption, it is real-v alued on (0 , ∞ ) , therefore ψ ∗ is con tinuous on (0 , ∞ ) . Hence either λ ∗ ( ϵ 2 ) = 0 or λ ∗ ( ϵ 2 ) ψ ∗ ( L X /λ ∗ ( ϵ 2 )) ≤ ϵ 2 . Deﬁne N ( ϵ 2 ) : =  M λ ∗ ( ϵ 2 ) 2 ϵ 2  ∈ Z 0 . (29) First consider the case where λ ∗ ( ϵ 2 ) > 0 (and hence N ( ϵ 2 ) > 0 ) and deﬁne λ j : = ( j − 1 / 2) λ ∗ ( ϵ 2 ) N ( ϵ 2 ) , j = 1 , ..., N ( ϵ 2 ) . (30) Let L θ i , i = 1 , ..., N ( ϵ 1 , G , ∥ · ∥ ∞ ) be a minimal ϵ 1 -co ver of G . W e will sho w that L c λ j ,θ i , L c δ θ i , where i ∈ { 1 , ..., N ( ϵ 1 , G , ∥ · ∥ ∞ ) } , j ∈ { 1 , ..., N ( ϵ 2 ) } , is an ϵ 1 + ϵ 2 -co ver of G c : Giv en θ ∈ Θ , λ > 0 , there exists i suc h that ∥L θ − L θ i ∥ ∞ ≤ ϵ 1 (31) and hence (28) implies ∥L c δ θ − L c δ θ i ∥ ∞ ≤ ∥L θ − L θ i ∥ ∞ ≤ ϵ 1 . (32) If λ ≥ λ ∗ ( ϵ 2 ) then, using the fact that λ 7→ λψ ∗ ( L X /λ ) is non-increasing along with (25), we can compute ∥L c θ,λ − L c δ θ i ∥ ∞ ≤∥L c θ,λ − L c δ θ ∥ ∞ + ∥L c δ θ − L c δ θ i ∥ ∞ (33) ≤ λψ ∗ ( L X /λ ) + ϵ 1 ≤ λ ∗ ( ϵ 2 ) ψ ∗ ( L X /λ ∗ ( ϵ 2 )) + ϵ 1 ≤ ϵ 2 + ϵ 1 . If λ < λ ∗ ( ϵ 2 ) then there exists j ∈ { 1 , ..., N ( ϵ 2 ) } such that λ ∈ [( j − 1) λ ∗ ( ϵ 2 ) / N ( ϵ 2 ) , j λ ∗ ( ϵ 2 ) / N ( ϵ 2 )] (34) 7 A preprint - March 31, 2026 and so we can use (27) and (28) to compute ∥L c θ,λ − L c θ i ,λ j ∥ ∞ ≤∥L c θ,λ − L c θ i ,λ ∥ ∞ + ∥L c θ i ,λ − L c θ i ,λ j ∥ ∞ (35) ≤∥L θ − L θ i ∥ ∞ + M | λ − λ j | ≤ ϵ 1 + M λ ∗ ( ϵ 2 ) 2 N ( ϵ 2 ) ≤ ϵ 1 + ϵ 2 . This pro ves that L c λ j ,θ i , L c δ θ i , i = 1 , ..., N ( ϵ 1 , G , ∥ · ∥ ∞ ) , j = 1 , ..., N ( ϵ 2 ) is an ϵ 1 + ϵ 2 -co ver of G c and therefore N ( ϵ 1 + ϵ 2 , G c , ∥ · ∥ ∞ ) ≤ ( N ( ϵ 2 ) + 1) N ( ϵ 1 , G , ∥ · ∥ ∞ ) as claimed. Th us we hav e prov en the lemma under the assumption that λ ∗ ( ϵ 2 ) > 0 . In the case where λ ∗ ( ϵ 2 ) = 0 , by making the obvious mo diﬁcations to the ab o ve argument, one sees that L c δ θ i , i = 1 , ..., N ( ϵ 1 , G , ∥ · ∥ ∞ ) pro vides the desired cov er. This completes the pro of. Using the ab o ve lemmas, w e no w derive concen tration inequalities for the OT-DRO problem. The follo wing lists the additional conditions we will require. Assumption 3.7. In addition to A ssumption 3.5, assume the fol lowing: 1. Ther e exists β ∈ (0 , ∞ ) such that 0 ≤ L θ ≤ β for al l θ ∈ Θ . 2. P ∈ P ( Z ) . 3. N ( ϵ 1 , G , ∥ · ∥ ∞ ) < ∞ for al l ϵ 1 > 0 , wher e G : = {L θ : θ ∈ Θ } . 4. h i : (0 , ∞ ) → (0 , ∞ ) , i = 1 , 2 , ar e me asur able and h 1 (˜ ϵ ) + h 2 (˜ ϵ ) = ˜ ϵ for al l ˜ ϵ > 0 . Theorem 3.8. Under A ssumption 3.7, for n ∈ Z + , ϵ > 0 , r > 0 , we have P n ± inf θ ∈ Θ sup Q : C ( P,Q ) ≤ r E Q [ L θ ] − inf θ ∈ Θ sup Q : C ( P n ,Q ) ≤ r E Q [ L θ ] ! ≥ 2 D n + ϵ ! (36) ≤ exp  − 2 ϵ 2 n β 2  , wher e P n : = 1 n P n i =1 δ z i denotes the empiric al me asur e c orr esp onding to z ∈ Z n and D n : =12 n − 1 / 2 Z β 0 s log  M λ ∗ ( h 2 (˜ ϵ )) 2 h 2 (˜ ϵ )  + 1  N ( h 1 (˜ ϵ ) , G , ∥ · ∥ ∞ )  d ˜ ϵ . (37) Remark 3.9. T o obtain this r esult with D n given by (117) we use Dud ley’s entr opy inte gr al over (0 , ∞ ) . W e r estrict our attention to this c ase for simplicity of the pr esentation, though one c an e asily obtain a mor e gener al r esult using the entr opy inte gr al over ( ˜ ϵ 0 , ∞ ) , ˜ ϵ 0 > 0 , as in, e.g., The or em 5.22 of [ 47 ], in or der to hand le families of obje ctive functions for which the entr opy inte gr al over (0 , β ) diver ges. Pr o of. The assumed b ound c ≤ c δ implies c is zero on the diagonal. X and Y w ere assumed to b e Polish, hence Z is P olish. Therefore Prop osition 2.1 implies sup Q : C ( P,Q ) ≤ r E Q [ L θ ] = inf λ> 0 { λr + E P [ L c θ,λ ] } , (38) sup Q : C ( P n ,Q ) ≤ r E Q [ L θ ] = inf λ> 0 { λr + E P n [ L c θ,λ ] } . (39) The assumed b ounds on L θ imply 0 ≤ L c θ,λ ≤ β for all θ , λ . In particular, L c θ,λ ∈ L 1 ( P ) ∩ L 1 ( P n ) and b oth (38) and (39) are ﬁnite, as are their inﬁma o ver θ ∈ Θ . Therefore w e can compute ± inf θ ∈ Θ sup Q : C ( P,Q ) ≤ r E Q [ L θ ] − inf θ ∈ Θ sup Q : C ( P n ,Q ) ≤ r E Q [ L θ ] ! (40) = ±  inf θ ∈ Θ ,λ> 0 { λr + E P [ L c θ,λ ] } − inf θ ∈ Θ ,λ> 0 { λr + E P n [ L c θ,λ ] }  ≤ sup g ∈G c ( ± E P [ g ] − 1 n n X i =1 g ( z i ) !) : = ϕ ± ( z ) , 8 A preprint - March 31, 2026 where G c w as deﬁned in (22) . Note that G c consists of universally measurable functions, 0 ≤ g ≤ β for all g ∈ G c , and Lemma 3.6 implies that the follo wing relation b et ween cov ering num bers of G and G c in the suprem um norm for all ˜ ϵ > 0 : N ( ˜ ϵ, G c , ∥ · ∥ ∞ ) ≤  M λ ∗ ( h 2 (˜ ϵ )) 2 h 2 (˜ ϵ )  + 1  N ( h 1 (˜ ϵ ) , G , ∥ · ∥ ∞ ) < ∞ . (41) No w apply the uniform law of large num bers result from, e.g., Theorem 3.3, Eq. (3.8) - (3.13) in [ 39 ] (this reference assumes [0 , 1] -v alued functions but the result can b e shifted and scaled to apply to any set of uniformly b ounded functions), to obtain E P n [ ϕ ± ] ≤ 2 R G c ,P,n (42) for all n ∈ Z + . The Rademacher complexity , R G c ,P,n , can b e b ounded using the Dudley en tropy integral; see, e.g., Corollary 5.25 in [45]: R G c ,P,n ≤ 12 n − 1 / 2 Z β 0 p log N (˜ ϵ, G c , ∥ · ∥ ∞ ) d ˜ ϵ . (43) Com bining this with the co vering num ber b ound (41) w e obtain E P n [ ϕ ± ] ≤ 2 D n . (44) Finally , applying McDiarmid’s inequality (see, e.g., Theorem D.8 in [ 39 ]) to ϕ ± and com bining this with (40) and (44) we can compute P n ± inf θ ∈ Θ sup Q : C ( P,Q ) ≤ r E Q [ L θ ] − inf θ ∈ Θ sup Q : C ( P n ,Q ) ≤ r E Q [ L θ ] ! ≥ 2 D n + ϵ ! (45) ≤ P n ( ϕ ± − E P n [ ϕ ± ] ≥ ϵ ) ≤ exp  − 2 ϵ 2 n β 2  . 3.1 OT-DR O: ERM Bound By a similar argumen t, w e can also show that a solution to the empirical OT-DR O problem (i.e., the ERM solution; see (46) b elo w) is also an approximate solution to the p opulation DR O problem with high probability (also allowing for an optimization error tolerance, ϵ opt n , and optimization failure probability , δ opt n ). Theorem 3.10. Under A ssumption 3.7, supp ose we have r > 0 , θ ∗ ,n : Z n → Θ , ϵ opt n ≥ 0 , δ opt n ∈ [0 , 1] , and E n ⊂ Z n such that P n ( E c n ) ≤ δ opt n and sup Q : C ( P n ,Q ) ≤ r E Q [ L θ ∗ ,n ] ≤ inf θ ∈ Θ sup Q : C ( P n ,Q ) ≤ r E Q [ L θ ] + ϵ opt n (46) on E n , wher e P n : = 1 n P n i =1 δ z i . Then for n ∈ Z + , ϵ > 0 , and with D n as deﬁne d in (117) , we have P n sup Q : C ( P,Q ) ≤ r E Q [ L θ ∗ ,n ] ≥ inf θ ∈ Θ sup Q : C ( P,Q ) ≤ r E Q [ L θ ] + 4 D n + ϵ opt n + ϵ ! (47) ≤ exp  − ϵ 2 n 2 β 2  + δ opt n , Remark 3.11. In the absenc e of suﬃcient me asur ability assumptions, P n in (47) and in the fol lowing pr o of should b e interpr ete d as the outer pr ob ability. Pr o of. On E n w e ha ve sup Q : C ( P,Q ) ≤ r E Q [ L θ ∗ ,n ] − inf θ ∈ Θ sup Q : C ( P,Q ) ≤ r E Q [ L θ ] (48) ≤ sup Q : C ( P,Q ) ≤ r E Q [ L θ ∗ ,n ] − sup Q : C ( P n ,Q ) ≤ r E Q [ L θ ∗ ,n ] + inf θ ∈ Θ sup Q : C ( P n ,Q ) ≤ r E Q [ L θ ] + ϵ opt n − inf θ ∈ Θ sup Q : C ( P,Q ) ≤ r E Q [ L θ ] ≤ sup θ ∈ Θ { F θ − F n,θ } + sup θ ∈ Θ { F n,θ − F θ } + ϵ opt n , 9 A preprint - March 31, 2026 where F n,θ : = sup Q : C ( P n ,Q ) ≤ r E Q [ L θ ] = inf λ> 0 { λr + E P n [ L c θ,λ ] } , (49) F θ : = sup Q : C ( P,Q ) ≤ r E Q [ L θ ] = inf λ> 0 { λr + E P [ L c θ,λ ] } . Therefore P n sup Q : C ( P,Q ) ≤ r E Q [ L θ ∗ ,n ] ≥ inf θ ∈ Θ sup Q : C ( P,Q ) ≤ r E Q [ L θ ] + 4 D n + ϵ opt n + ϵ ! (50) ≤ δ opt n + P n  sup θ ∈ Θ { F θ − F n,θ } + sup θ ∈ Θ { F n,θ − F θ } ≥ 4 D n + ϵ  ≤ δ opt n + P n ( ϕ ≥ 4 D n + ϵ ) , where ϕ : = sup g ∈G c ( E P [ g ] − 1 n n X i =1 g ( z i ) ) + sup g ∈G c ( − E P [ g ] − 1 n n X i =1 g ( z i ) !) (51) and G c w as deﬁned in (22). By the same argument that lead to (44) in the pro of of Theorem 3.8 we ha ve E P n [ ϕ ] ≤ 4 D n , (52) hence we can use McDiarmid’s inequality to compute P n sup Q : C ( P,Q ) ≤ r E Q [ L θ ∗ ,n ] ≥ inf θ ∈ Θ sup Q : C ( P,Q ) ≤ r E Q [ L θ ] + 4 D n + ϵ opt n + ϵ ! (53) ≤ δ opt n + P n ( ϕ − E P n [ ϕ ] ≥ ϵ ) ≤ δ opt n + exp  − ϵ 2 n 2 β 2  . Remark 3.12. Note that our appr o ach yields b ounds that do not dep end on the neighb orho o d size, r ; this is wel l-suite d for applic ations to adversarial tr aining, wher e r is smal l, but ﬁxe d (i.e., do es not dep end on n ). Our r esults have sever al advantages over those obtaine d by pr evious appr o aches that c onsider e d the adversarial tr aining setting [ 34 , 2 , 4 , 25 ] (note that these only c onsider e d p -W asserstein c osts). Sp e ciﬁc al ly, c omp ar e (47) with The or em 2 in [ 34 ]; the latter has (in our notation) a r − ( p − 1) factor in one of the err or terms, thus it b ehaves p o orly when r is smal l. The b ounds in [ 2 , 25 ] do not appr o ach 0 as n → ∞ due to eﬀe ct of the ﬁxe d neighb orho o d size; se e The or ems 3 and 4 in [ 2 ] and The or em 3 in [ 25 ]. Final ly, we note that the r esults in [ 4 ] r e quir e mor e r estrictive assumptions on the obje ctive function (se e their A ssumption 5) when r is not de c aying with n . 3.2 Examples Next we provide several cases of interest where ψ (0) = 0 , ψ ∗ is ﬁnite, and ψ ∗ ( t ) = o ( t ) as t → 0 + . W e fo cus on soft-constrain t relaxations, (13) with real-v alued ψ , but we also show that our metho d easily handles the PGD-cost (12) as a sp ecial case. 1. ψ ( t ) = αt q for some α > 0 , q > 1 : In this case, for s > 0 w e ha ve ψ ∗ ( s ) = α ( q − 1)  s αq  q / ( q − 1) (54) and for ϵ 2 > 0 we hav e λ ∗ ( ϵ 2 ) = α − 1 ( L X /q ) q ( q − 1) q − 1 ϵ − ( q − 1) 2 . (55) Substituting this into the co v ering num ber bound (23) , we see that the contribution from the λ parameter is a factor that scales lik e O ( ϵ − q 2 ) ; this can be thought of as an eﬀectiv ely q -dimensional con tribution to the complexity of the function space. 10 A preprint - March 31, 2026 2. ψ ( t ) = αt q + β t for some α, β > 0 , q > 1 : In this case, for s > 0 w e ha ve ψ ∗ ( s ) =    α ( q − 1)  s − β αq  q / ( q − 1) if s > β 0 if 0 < s ≤ β . (56) The fact that ψ ∗ v anishes on (0 , β ] implies the bound λ ∗ ( ϵ 2 ) ≤ L X /β . Hence Lemma 3.3 gives L c θ,λ = L c δ θ for all λ ≥ L X /β . The corresp onding contribution of λ to the cov ering n umber bound (23) scales like O ( ϵ − 1 2 ) and hence is eﬀectiv ely one-dimensional; this matches the fact that in such cases one can restrict λ to a compact interv al. 3. ψ ( t ) = α ( e q t − 1) for some α, q > 0 : In this case, for s > 0 w e ha ve ψ ∗ ( s ) = ( s q log  s αq  − α  s αq − 1  if s > αq 0 if 0 < s ≤ αq . (57) This example exhibits similar quan titative b eha vior to the previous case, making an essentially one-dimensional contribution to the co vering n umber b ound. 4. ψ ( t ) = ∞ 1 t> 0 : This corresponds to the PGD case, (12) . Here w e ha ve ψ ∗ ( s ) = 0 for all s , and hence λ ∗ ( ϵ 2 ) = 0 for all ϵ 2 > 0 . Thus the co vering num ber b ound (23) giv es N ( ϵ 1 + ϵ 2 , G c , ∥ · ∥ ∞ ) ≤ N ( ϵ 1 , G , ∥ · ∥ ∞ ) (58) for all ϵ 2 > 0 , consistent with the fact that the minimization ov er λ in (4) can b e ev aluated explicitly to give inf λ> 0 { λr + E P [ L c λ ] } = E ( x,y ) ∼ P " sup ˜ x ∈X : ∥ x − ˜ x ∥ X ≤ δ L ( ˜ x, y ) # . (59) Th us our approach gracefully handles the classical PGD OT-cost as a special case. The abov e examples all result in a mo dest increase in the co vering n umber b ound, apart from case (1) with v ery large q . Thus, moving from non-robust to robust mo del training do es not lead to a signiﬁcan t weak ening of the statistical error b ounds. Corresp onding explicit b ounds on D n , (117), are giv en in App endix A. 4 Concen tration Inequalities for DRO with OT-Regularized f -Div ergences In this section, w e prov e the concen tration inequalit y (14) for DR O with OT-regularized f -div ergences, working under the following assumptions; in particular, we now restrict our attention to classiﬁcation problems (i.e., discrete Y ). Assumption 4.1. In addition to A ssumption 3.7, assume the fol lowing: 1. Y is a ﬁnite set with c ar dinality K ∈ Z + , K ≥ 2 (the lab el c ate gories); we e quip Y with the discr ete metric, which makes Z : = X × Y a Polish sp ac e. 2. The Y -mar ginal distribution, P Y , satisﬁes p y : = P Y ( y ) > 0 for al l y ∈ Y . 3. { ˜ z ∈ Z : c ( z , ˜ z ) < ∞} = X × { y } for al l z = ( x, y ) ∈ Z . 4. W e have h i : (0 , ∞ ) → (0 , ∞ ) , i = 1 , 2 , 3 that ar e me asur able and satisfy h 1 ( ˜ ϵ ) + h 2 ( ˜ ϵ ) + h 3 ( ˜ ϵ ) = ˜ ϵ for al l ˜ ϵ > 0 . 5. W e have f ∈ F 1 ( a, b ) , wher e 0 ≤ a < 1 < b ≤ ∞ , and f is strictly c onvex on a neighb orho o d of 1 . 6. s 0 : = f ′ + (1) ∈ { f ∗ < ∞} o . 7. f ∗ is b ounde d b elow. 8. lim s →∞ ( f ∗ ) ′ + ( s ) = ∞ . 9. s ( f ∗ ) ′ + ( − s ) is b ounde d on s ∈ [ ˜ ν , ∞ ) for al l ˜ ν ∈ R . 11 A preprint - March 31, 2026 Remark 4.2. The assumption on P Y is trivial in pr actic e, as one simply r emoves unne c essary classes. The assumption r e gar ding the set wher e c is ﬁnite holds for the OT c ost functions c ψ ,δ (13) with r e al-value d ψ . The assumptions on f hold for b oth KL and the α -diver genc es for al l α > 1 ; for details, se e Se ction 4.3. Sp eciﬁcally , w e will prov e the following result. Theorem 4.3. Under A ssumption 4.1, let p 0 ∈ (0 , min y p y ) and ˜ ν ∈ R such that ( f ∗ ) ′ + ( − M − ˜ ν ) ≥ 1 /p 0 . (60) Then for al l ϵ > 0 , n ∈ Z + , λ n > 0 we have P n ± inf θ ∈ Θ sup Q : D c f ( Q ∥ P ) ≤ r E Q [ L θ ] − inf θ ∈ Θ sup Q : D c f ( Q ∥ P n ) ≤ r E Q [ L θ ] ! ≥ max { R n , e R n } + ϵ ! (61) ≤ exp  − 2 nϵ 2 β 2  + exp  − 2 nϵ 2 ( β ( f ∗ ) ′ + ( − ˜ ν )) 2  + X y ∈Y e − 2 n ( p y − p 0 ) 2 , wher e R n and e R n ar e deﬁne d b elow in (88) and (111) r esp e ctively. Remark 4.4. F or a discussion of the or der in n of the err or terms R n and e R n , se e Se ction 4.3. 4.1 Pro of of Theorem 4.3 The pro of of Theorem 4.3 is somewhat length y , and so we spread it ov er the following subsections. 4.1.1 Pro of of Theorem 4.3: Error Decomp osition The primary new complication in the analysis of DRO with OT-regularized f -div ergences (as compared to the OT-DR O case) is the 1 /λ factor on the right-hand side of (7) . W e note that in the Cressie-Read div ergence case (without OT), the optimization ov er λ can b e simpliﬁed analytically; this tec hnique w as used in [ 23 ] but is not applicable to the more general class of f -div ergences studied in this work. Thus the following analysis requires new techniques. T o address the aforementioned complication, w e start b y decomp osing the problem into tw o terms, cov ering diﬀeren t λ -domains, which we treat by diﬀerent metho ds. First, by a similar calculation to that in the OT case, we hav e the error b ound ± inf θ ∈ Θ sup Q : D c f ( Q ∥ P ) ≤ r E Q [ L θ ] − inf θ ∈ Θ sup Q : D c f ( Q ∥ P n ) ≤ r E Q [ L θ ] ! (62) ≤ sup θ ∈ Θ ,λ> 0  ±  inf ρ ∈ R { ρ + λE P [ f ∗ (( L c θ,λ − ρ ) /λ )] } − inf ρ ∈ R { ρ + λE P n [ f ∗ (( L c θ,λ − ρ ) /λ )] }  . No w, for all e D n ≥ 0 and all λ n > 0 (to b e c hosen later), we can use a union b ound to obtain P n ± inf θ ∈ Θ sup Q : D c f ( Q ∥ P ) ≤ r E Q [ L θ ] − inf θ ∈ Θ sup Q : D c f ( Q ∥ P n ) ≤ r E Q [ L θ ] ! ≥ e D n ! (63) ≤ P n sup θ ∈ Θ ,λ ≥ λ n n ±  λ Λ P f [ L c θ,λ /λ ] − λ Λ P n f [ L c θ,λ /λ ] o ≥ e D n ! (64) + P n sup θ ∈ Θ ,λ ∈ (0 ,λ n ) n ±  λ Λ P f [ L c θ,λ /λ ] − λ Λ P n f [ L c θ,λ /λ ] o ≥ e D n ! , (65) where we changed v ariables in the inner inﬁma to ν = ρ/λ and introduced the notation Λ Q f [ ϕ ] : = inf ν ∈ R { ν + E Q [ f ∗ ( ϕ − ν )] , (66) whic h is deﬁned so long as ϕ − ∈ L 1 ( Q ) . Note that w e ha ve the identit y Λ Q f [ ϕ + γ ] = γ + Λ Q f [ ϕ ] (67) 12 A preprint - March 31, 2026 for all γ ∈ R . In the KL case, Λ Q K L [ ϕ ] = log E Q [ e ϕ ] and so Λ Q f [ ϕ ] can b e though t of as a generalization of the cum ulant generating function. The term (64) , which inv olves the suprem um ov er λ ≥ λ n , can be handled in a manner reminiscent of the OT case, as we show in the next subsection. The supremum ov er λ ∈ (0 , λ n ) in (65) m ust con tend with the 1 /λ singularit y in the arguments to Λ P f and Λ P n f ; this necessitates an alternative pro of strategy . The key new to ol is the follo wing lemma, which pro vides conditions under which the optimization ov er ν in (66) can b e restricted to a b ounded domain. As this result is of in terest for applications b ey ond the curren t work, we pro ve it under more general conditions that those of Assumption 4.1. Lemma 4.5. L et (Ω , M , Q ) b e a pr ob ability sp ac e and ϕ : Ω → R b e me asur able with ϕ − ∈ L 1 ( Q ) and ϕ ≤ β for some β ∈ R . L et f ∈ F 1 ( a, b ) with a ≥ 0 and supp ose f is strictly c onvex in a neighb orho o d of 1 and s 0 ∈ { f ∗ < ∞} o ( s 0 as deﬁne d in p art 6 of A ssumption 4.1). Supp ose we have ˜ α ∈ R , p ˜ α ∈ (0 , 1) , and ν ˜ α ∈ R such that ˜ α − ν ˜ α ∈ { f ∗ < ∞} , Q ( ϕ ≥ ˜ α ) ≥ p ˜ α , and ( f ∗ ) ′ + ( ˜ α − ν ˜ α ) ≥ 1 /p ˜ α . Then ν ˜ α < β − s 0 and Λ Q f [ ϕ ] = inf ν ∈ [ ν ˜ α ,β − s 0 ] { ν + E Q [ f ∗ ( ϕ − ν )] } . (68) Mor e over, if α ≤ ϕ ≤ β then we have the simpler r esult Λ Q f [ ϕ ] = inf ν ∈ [ α − s 0 ,β − s 0 ] { ν + E Q [ f ∗ ( ϕ − ν )] } . (69) Remark 4.6. The simpler distribution-indep endent c ase (69) was pr eviously obtaine d in L emma 2.1 of [ 10 ]. The distribution-dep endent c ase (68) , which we b elieve to b e new, wil l b e key for obtaining uniform (in λ ) b ounds on the terms in (65) , thus mitigating the app ar ent diﬃculty stemming fr om the 1 /λ singularity. Pr o of. Standard results in conv ex analysis imply that f ∗ is con tinuous on { f ∗ < ∞} and the right-deriv ative ( f ∗ ) ′ + is non-decreasing (and hence its deﬁnition can b e naturally extended to { f ∗ < ∞} ). Also note that the assumption a ≥ 0 implies f ∗ is non-decreasing and hence ( f ∗ ) ′ + ≥ 0 . Let ν < ν ˜ α . First we sho w that ν + f ∗ ( ϕ − ν ) ≥ ν + f ∗ ( ϕ − ν ˜ α ) + ( f ∗ ) ′ + ( ˜ α − ν ˜ α )( ν ˜ α − ν )1 ϕ ≥ ˜ α . (70) Note that the claim is trivial if f ∗ ( ϕ − ν ) = ∞ . Therefore we supp ose ϕ − ν ∈ { f ∗ < ∞} . F or n ∈ Z + large enough we hav e ϕ − ν ˜ α < ϕ − ν − 1 /n , ϕ − ν ˜ α , ϕ − ν − 1 /n ∈ { f ∗ < ∞} o and hence we can use absolute con tinuit y of f ∗ compute ν + f ∗ ( ϕ − ν − 1 /n ) = ν + f ∗ ( ϕ − ν ˜ α ) + Z ϕ − ν − 1 /n ϕ − ν ˜ α ( f ∗ ) ′ + ( s ) ds (71) ≥ ν + f ∗ ( ϕ − ν ˜ α ) + ( f ∗ ) ′ + ( ϕ − ν ˜ α )( ν ˜ α − ν − 1 /n ) ≥ ν + f ∗ ( ϕ − ν ˜ α ) + ( f ∗ ) ′ + ( ˜ α − ν ˜ α )( ν ˜ α − ν − 1 /n )1 ϕ ≥ ˜ α . T aking n → ∞ and using contin uit y of f ∗ on { f ∗ < ∞} we can conclude the claimed bound. T aking the exp ectation of b oth sides of (70) , which exist in ( −∞ , ∞ ] due to the assumption ϕ − ∈ L 1 ( Q ) and the fact that f ∗ ( t ) ≥ t , we obtain ν + E Q [ f ∗ ( ϕ − ν )] ≥ ν + E Q [ f ∗ ( ϕ − ν ˜ α )] + ( f ∗ ) ′ + ( ˜ α − ν ˜ α )( ν ˜ α − ν ) Q ( ϕ ≥ ˜ α ) (72) ≥ ν + E Q [ f ∗ ( ϕ − ν ˜ α )] + p − 1 ˜ α ( ν ˜ α − ν ) Q ( ϕ ≥ ˜ α ) ≥ ν ˜ α + E Q [ f ∗ ( ϕ − ν ˜ α )] . Th us, for all ν < ν ˜ α , we hav e pro v en ν + E Q [ f ∗ ( ϕ − ν )] ≥ ν ˜ α + E Q [ f ∗ ( ϕ − ν ˜ α )] . (73) 13 A preprint - March 31, 2026 No w assume that f is strictly conv ex in a neighborho od of 1 and s 0 : = f ′ + (1) ∈ { f ∗ < ∞} o . These assumptions imply f ∗ ( s 0 ) = s 0 and ( f ∗ ) ′ + ( s 0 ) = 1 (see Lemma A.9 in [11]). The bound ( f ∗ ) ′ + ( ˜ α − ν ˜ α ) ≥ 1 /p ˜ α > 1 = ( f ∗ ) ′ + ( s 0 ) (74) implies ˜ α − ν ˜ α > s 0 . Also assuming that ϕ ≤ β , w e hav e ˜ α ≤ β (otherwise Q ( ϕ ≥ ˜ α ) = 0 ) and therefore β − ν ˜ α ≥ ˜ α − ν ˜ α > s 0 . No w let ν > β − s 0 . Noting that ( −∞ , s 0 ] ⊂ { f ∗ < ∞} o w e can compute f ∗ ( ϕ − ( β − s 0 )) = f ∗ ( ϕ − ν ) + Z ϕ − ( β − s 0 ) ϕ − ν ( f ∗ ) ′ + ( s ) ds (75) ≤ f ∗ ( ϕ − ν ) + ( f ∗ ) ′ + ( s 0 )( ν − ( β − s 0 )) = f ∗ ( ϕ − ν ) + ν − ( β − s 0 ) . Therefore ν + E Q [ f ∗ ( ϕ − ν )] ≥ β − s 0 + E Q [ f ∗ ( ϕ − ( β − s 0 ))] (76) for all ν > β − s 0 . Combining this with (73) w e arriv e at (68) . The simpler case (69) follo ws similarly , and w as also prov en previously in Lemma 2.1 of [10]. 4.1.2 Pro of of Theorem 4.3: λ ≥ λ n T erm T o b ound (64), we start b y sho wing that lim λ →∞ λ Λ P f [ L c θ,λ /λ ] = E P [ L c δ θ ] (77) (and similarly with P n replacing P ), with explicit error b ounds. The pro of will utilize Lemma 3.3, but it will also require new ingredients in order to handle the contribution from f ∗ and the inﬁmum ov er ν . Note that w e do not in vok e the entiret y of Assumption 4.1 in the follo wing lemma, as pro ving (77) only requires a subset of those conditions. Lemma 4.7. In addition to A ssumption 3.1, p art 1 of A ssumption 3.7, and p arts 5 - 6 of A ssumption 4.1, assume that λ 0 > 0 satisﬁes β /λ 0 + s 0 ∈ { f ∗ < ∞} o ( β fr om p art 1 of A ssumption 3.7 and s 0 fr om p art 6 of A ssumption 4.1). Then for al l P ∈ P ( Z ) , λ ≥ λ 0 , θ ∈ Θ we have   λ Λ P f [ L c θ,λ /λ ] − E P [ L c δ θ ]   ≤ λψ ∗ ( L X /λ ) + β  ( f ∗ ) ′ + ( β /λ + s 0 ) − ( f ∗ ) ′ + ( s 0 )  . (78) Remark 4.8. A l l suﬃciently lar ge λ 0 ’s satisfy the r e quir e d c ondition, due to the assumption that s 0 ∈ { f ∗ < ∞} o . Also note that, as ( f ∗ ) ′ + is right c ontinuous, if ψ ∗ ( t ) = o ( t ) as t → 0 + then the upp er b ound in (78) c onver ges to 0 as λ → ∞ . Pr o of. Again we suppress the θ dep endence, as it is not relev ant to the computations. First use Lemma 4.5 (sp eciﬁcally , the simpler case (69)) along a c hange of v ariables to rewrite λ Λ P f [ L c λ /λ ] = inf η ∈ [0 ,β ] { η − λs 0 + λE P [ f ∗ (( L c λ − η ) /λ + s 0 )] } . (79) F or λ ≥ λ 0 and η ∈ [0 , β ] we ha ve ( L c λ − η ) /λ ) + s 0 ≤ β /λ 0 + s 0 . As f ∗ is non-decreasing, this implies ( L c λ − η ) /λ ) + s 0 ∈ { f ∗ < ∞} o . Therefore we can use T a ylor’s form ula for conv ex functions (see Theorem 1 in [ 35 ]) together with the iden tities f ∗ ( s 0 ) = s 0 and ( f ∗ ) ′ + ( s 0 ) = 1 (again, see Lemma A.9 in [ 11 ]) to obtain f ∗ (( L c λ − η ) /λ + s 0 ) = s 0 + ( L c λ − η ) /λ + R f ∗ ( s 0 , ( L c λ − η ) /λ + s 0 ) (80) for all η ∈ [0 , β ] , where R f ∗ denotes the remainder term in the expansion. Using this w e obtain λ inf ν ∈ R { ν + E P [ f ∗ ( L c λ /λ − ν ] } = E P [ L c λ ] + inf η ∈ [0 ,β ] E P [ λR f ∗ ( s 0 , ( L c λ − η ) /λ + s 0 )] . Next, we recall that the remainder term is non-decreasing in t ∈ [ s, ∞ ) and satisﬁes 0 ≤ R f ∗ ( s, t ) ≤ | t − s || ( f ∗ ) ′ + ( t ) − ( f ∗ ) ′ + ( s ) | . (81) 14 A preprint - March 31, 2026 This allows us to compute 0 ≤ inf η ∈ [0 ,β ] E P [ λR f ∗ ( s 0 , ( L c λ − η ) /λ + s 0 )] ≤ E P [ λR f ∗ ( s 0 , L c λ /λ + s 0 )] (82) ≤ λR f ∗ ( s 0 , β /λ + s 0 ) ≤ β  ( f ∗ ) ′ + ( β /λ + s 0 ) − ( f ∗ ) ′ + ( s 0 )  . Using the ab o v e together with Lemma 3.3 we obtain   λ Λ P f [ L c λ /λ ] − E P [ L c δ ]   (83) ≤| E P [ L c λ ] − E P [ L c δ ] | + inf η ∈ [0 ,β ] E P [ λR f ∗ ( s 0 , ( L c λ − η ) /λ + s 0 )] ≤ λψ ∗ ( L X /λ ) + β  ( f ∗ ) ′ + ( β /λ + s 0 ) − ( f ∗ ) ′ + ( s 0 )  as claimed. W e are now ready to b ound the term (64) . First apply Lemma 4.7 with λ 0 replaced by λ n (increasing in n , but with precise dep endence on n to b e c hosen later) to obtain sup θ ∈ Θ ,λ ≥ λ n n ±  λ Λ P f [ L c θ,λ /λ ] − λ Λ P n f [ L c θ,λ /λ ] o (84) ≤ sup θ ∈ Θ {± ( E P [ L c δ θ ] − E P n [ L c δ θ ]) } + 2 sup λ ≥ λ n { λψ ∗ ( L X /λ ) + β  ( f ∗ ) ′ + ( β /λ + s 0 ) − ( f ∗ ) ′ + ( s 0 )  } ≤ sup θ ∈ Θ {± ( E P [ L c δ θ ] − E P n [ L c δ θ ]) } + 2 λ n ψ ∗ ( L X /λ n ) + 2 β  ( f ∗ ) ′ + ( β /λ n + s 0 ) − ( f ∗ ) ′ + ( s 0 )  . A straightforw ard application of McDiarmid’s inequalit y giv es P n  sup θ ∈ Θ {± ( E P [ L c δ θ ] − E P n [ L c δ θ ]) } ≥ E P n  sup θ ∈ Θ {± ( E P [ L c δ θ ] − E P n [ L c δ θ ]) }  + ϵ  (85) ≤ exp  − 2 nϵ 2 β 2  and, using Dudley’s entrop y in tegral together with the b ound ∥L c δ θ 1 − L c δ θ 2 ∥ ∞ ≤ ∥L θ 1 − L θ 2 ∥ , w e can compute E P n  sup θ ∈ Θ {± ( E P [ L c δ θ ] − E P n [ L c δ θ ]) }  ≤ 24 n − 1 / 2 Z β 0 p log N (˜ ϵ, G , ∥ · ∥ ∞ ) d ˜ ϵ , (86) where G is as deﬁned in (22). Com bining these we ﬁnd P n sup θ ∈ Θ ,λ ≥ λ n n ±  λ Λ P f [ L c θ,λ /λ ] − λ Λ P n f [ L c θ,λ /λ ] o ≥ R n + ϵ ! ≤ exp  − 2 nϵ 2 β 2  , (87) where R n : =2 λ n ψ ∗ ( L X /λ n ) + 2 β  ( f ∗ ) ′ + ( β /λ n + s 0 ) − ( f ∗ ) ′ + ( s 0 )  (88) + 24 n − 1 / 2 Z β 0 p log N (˜ ϵ, G , ∥ · ∥ ∞ ) d ˜ ϵ . 4.1.3 Pro of of Theorem 4.3: λ < λ n T erm Next we b ound the term (65) , which includes the λ − 1 singularit y . Our approac h will b e to show that, with high-probabilit y , one can restrict the optimization ov er ν to a compact subset and that this results in a Lipsc hitz dep endence on λ , despite the apparent singularity as λ → 0 + . Start by combining part 1 of Assumption 3.5 with part 3 of Assumption 4.1 to see that sup ˜ c : = sup z , ˜ z : c ( z, ˜ z ) < ∞ c ( z , ˜ z ) ≤ M < ∞ (89) 15 A preprint - March 31, 2026 and      L c θ,λ ( z ) − sup ˜ z : c ( z, ˜ z ) < ∞ L θ ( ˜ z )      ≤ λ sup ˜ c , (90) where sup ˜ z : c ( z, ˜ z ) < ∞ L θ ( ˜ z ) = sup ˜ x L θ ( ˜ x, y ) . Deﬁne ∆ L c θ,λ = sup Z L θ − L c θ,λ (note that this inv olv es the supremum ov er all of Z , not just ov er { ˜ z : c ( z , ˜ z ) < ∞} ). As sup Z L θ do es not dep end on z , (67) implies λ Λ P f [ L c θ,λ /λ ] = sup Z L θ + λ Λ P f [ − ∆ L c θ,λ /λ ] , (91) where − ∆ L c θ,λ /λ ≤ 0 . T o connect this with the bound (90) , note that for all z = ( x, y ) suc h that y ∈ argmax y sup ˜ x L θ ( ˜ x, y ) we hav e − ∆ L c θ,λ ( z ) = L c θ,λ ( z ) − sup ˜ z : c ( z, ˜ z ) < ∞ L θ ( ˜ z ) ≤ λ sup ˜ c , (92) and therefore P ( − ∆ L c θ,λ /λ ≥ − sup ˜ c ) ≥ P Y ( argmax y sup ˜ x L θ ( ˜ x, y )) ≥ min y P Y ( y ) . (93) No w ﬁx p 0 ∈ (0 , min y P Y ( y )) . By part 8 of Assumption 4.1, there exists ˜ ν ∈ R suc h that ( f ∗ ) ′ + ( − M − ˜ ν ) ≥ 1 /p 0 , and hence also ( f ∗ ) ′ + ( − sup ˜ c − ˜ ν ) ≥ 1 /p 0 ; see (89). Lemma 4.5 then implies λ Λ P f [ − ∆ L c θ,λ /λ ] = inf ν ∈ [ ˜ ν , − s 0 ] { λν + λE P [ f ∗ ( − ∆ L c θ,λ /λ − ν )] } . (94) Th us w e hav e sho wn that the inﬁmum ov er ν can b e restricted to a compact interv al, uniformly in λ ; this will b e key for proving ﬁnite Rademacher complexity b ounds on the corresponding family of functions. W e cannot immediately use the ab ov e argument to restrict the domain of the inﬁm um o ver ν in the formula for λ Λ P n f [ − ∆ L c θ,λ /λ ] in a wa y that is uniform in λ for all z ∈ Z n , as it is p ossible that z con tains no sample whose lab el maximizes sup ˜ x L θ ( ˜ x, y ) ; this case presen ts a problem when λ → 0 , as do cases where there are not enough maximizing samples. Ho wev er, the set of suc h z ’s has probability that is exponentially decaying in n and hence we can use a union b ound to handle the low-probabilit y exceptional set where the argument leading to (94) fails for P n . T o that end, for y ∈ Y , ξ y ∈ (0 , 1) deﬁne ˜ Y y ,ξ y ,n : = { ˜ y ∈ Y n : |{ i : ˜ y i = y }| ≥ (1 − ξ y ) p y n } , (95) where p y : = P Y ( y ) , and deﬁne ˜ Y ξ,n : = ∩ y ∈Y ˜ Y y ,ξ y ,n . (96) A straightforw ard application of McDiarmid’s inequalit y to the co ordinate maps Y i : Y n → Y implies P n Y ( ˜ Y c y ,ξ y ,n ) = P n Y 1 n n X i =1 1 y ( Y i ) − p y < − ξ y p y ! (97) ≤ e − 2 nξ 2 y p 2 y and therefore, letting ξ y = 1 − p 0 /p y for all y , we obtain P n ( Y n ∈ ˜ Y c ξ,n ) ≤ X y ∈Y e − 2 n ( p y − p 0 ) 2 . (98) On the even t Y n ∈ ˜ Y ξ,n , by a similar argumen t to the p opulation case, w e then hav e P n ( − ∆ L c θ,λ /λ ≥ − sup ˜ c ) ≥ P n ( argmax y sup ˜ x L θ ) ≥ min y 1 n |{ i : Y i = y }| ≥ min y (1 − ξ y ) p y = p 0 , (99) and hence (67) and Lemma 4.5 together imply λ Λ P n f [ L c θ,λ /λ ] = sup Z L θ + inf ν ∈ [ ˜ ν , − s 0 ] { λν + λE P n [ f ∗ ( − ∆ L c θ,λ /λ − ν )] } . (100) 16 A preprint - March 31, 2026 By combining the ab o v e results, we obtain the following b ound on (65): P n sup θ ∈ Θ ,λ ∈ (0 ,λ n ) n ±  λ Λ P f [ L c θ,λ /λ ] − λ Λ P n f [ L c θ,λ /λ ] o ≥ D ! (101) ≤ P n sup θ ∈ Θ ,λ ∈ (0 ,λ n ) n ±  λ Λ P f [ L c θ,λ /λ ] − λ Λ P n f [ L c θ,λ /λ ] o ≥ D , Y n ∈ ˜ Y ξ,n ! + X y ∈Y e − 2 n ( p y − p 0 ) 2 ≤ P n ( ϕ ± ≥ D ) + X y ∈Y e − 2 n ( p y − p 0 ) 2 , where ϕ ± : = sup θ ∈ Θ ,λ ∈ (0 ,λ n ) ,ν ∈ [ ˜ ν , − s 0 ]  ±  E P [ λf ∗ ( − ∆ L c θ,λ /λ − ν )] − E P n [ λf ∗ ( − ∆ L c θ,λ /λ − ν )]  (102) = sup g ∈G c,f {± ( E P n [ g ] − E P [ g ]) } , G c,f : = { g θ,λ,ν : θ ∈ Θ , λ ∈ (0 , λ n ) , ν ∈ [ ˜ ν , − s 0 ] } , g θ,λ,ν : = λ ( f ∗ ( − ν ) − f ∗ ( − ∆ L c θ,λ /λ − ν )) . (103) Supp osing z , z ′ ∈ Z n diﬀer only at index j , we hav e the b ounded diﬀerence property | ϕ ± ( z ) − ϕ ± ( z ′ ) | ≤ sup θ ∈ Θ ,λ ∈ (0 ,λ n ) ,ν ∈ [ ˜ ν , − s 0 ] λ n   f ∗ ( − ∆ L c θ,λ ( z j ) /λ − ν )] − f ∗ ( − ∆ L c θ,λ ( z ′ j ) /λ − ν )   (104) ≤ 1 n ( f ∗ ) ′ + ( − ˜ ν ) sup θ ∈ Θ ,λ ∈ (0 ,λ n ) | ∆ L c θ,λ ( z j ) − ∆ L c θ,λ ( z ′ j ) | ≤ β n ( f ∗ ) ′ + ( − ˜ ν ) . Therefore McDiarmid’s inequality combined with the standard b ound on the mean in terms of the Rademacher complexit y yields P n  ϕ ± ≥ 2 R G c,f ,P,n + ϵ  ≤ exp  − 2 nϵ 2 ( β ( f ∗ ) ′ + ( − ˜ ν )) 2  . (105) T o bound the Rademac her complexity , w e note the following uniform and Lipschitz b ounds on the functions g θ,λ,ν for all θ ∈ Θ , λ ∈ (0 , λ n ) , ν ∈ [ ˜ ν , − s 0 ] ; for further details, see App endix B: 0 ≤ g θ,λ,ν ≤ β ( f ∗ ) ′ + ( − ˜ ν ) , (106) | g θ 1 ,λ,ν − g θ 2 ,λ,ν | ≤ ( f ∗ ) ′ + ( − ˜ ν ) ∥L θ 1 − L θ 2 ∥ ∞ , (107) | g θ,λ,ν 1 − g θ,λ,ν 2 | ≤ 2 λ n ( f ∗ ) ′ + ( − ˜ ν ) | ν 1 − ν 2 | , (108) | ∂ λ g θ,λ,ν | ≤ f ∗ ( − ˜ ν ) − inf f ∗ + sup t ≥ ˜ ν | t ( f ∗ ) ′ + ( − t ) | + ( f ∗ ) ′ + ( − ˜ ν )(max {− s 0 , − ˜ ν } + sup ˜ c ) . (109) Regarding the deriv ativ e in λ , note that f ∗ is Lipschitz on ( −∞ , d ) for all d ∈ R and − ∆ L c θ,λ /λ − ν is b ounded ab o v e and is absolutely con tinuous when λ is restricted to a compact interv al due to the conv exit y of L c θ,λ in λ . Therefore g θ,λ,ν is absolutely contin uous in λ when restricted to compact interv als. Hence the ab o v e a.s. b ound on the deriv ative implies a corresponding Lipschitz b ound on (0 , λ n ) . Using the b ounds (106)-(109) w e obtain the following cov ering num b er b ound for all ϵ 1 , ϵ 2 , ϵ 3 > 0 : N ( ϵ 1 + ϵ 2 + ϵ 3 , G c,f , ∥ · ∥ ∞ ) ≤  λ n C 1 2 ϵ 1   ( − s 0 − ˜ ν ) λ n C 2 ϵ 2  N ( ϵ 3 /C 2 , G , ∥ · ∥ ∞ ) , (110) C 1 : = f ∗ ( − ˜ ν ) − inf f ∗ + sup t ≥ ˜ ν | t ( f ∗ ) ′ + ( − t ) | + ( f ∗ ) ′ + ( − ˜ ν )(max {− s 0 , − ˜ ν } + sup ˜ c ) , C 2 : = ( f ∗ ) ′ + ( − ˜ ν ) . Com bining this with the uniform bound (106) , we obtain a b ound on the Rademac her complexit y via the Dudley entrop y in tegral: 2 R G c,f ,P,n ≤ 24 n − 1 / 2 Z β C 2 0 s log  λ n C 1 2 h 1 (˜ ϵ )   ( − s 0 − ˜ ν ) λ n C 2 h 2 (˜ ϵ )  N ( h 3 (˜ ϵ ) /C 2 , G , ∥ · ∥ ∞ )  d ˜ ϵ : = e R n . (111) T ogether, the results in Sections 4.1.3-4.1.1 complete the pro of of Theorem 4.3. 17 A preprint - March 31, 2026 4.2 DR O with OT-Regularized f -Div ergences: ERM Bound Similarly to Theorem 3.10, one can also sho w that a solution to the empirical OT-regularized f -div ergence DR O problem (see (113) b elo w) is an appro ximate solution to the p opulation DR O problem with high probabilit y (again allowing for an optimization error tolerance, ϵ opt n , and optimization failure probabilit y , δ opt n ). Theorem 4.9. Under A ssumption 4.1, let p 0 ∈ (0 , min y p y ) and ˜ ν ∈ R such that ( f ∗ ) ′ + ( − M − ˜ ν ) ≥ 1 /p 0 . (112) Supp ose we have r > 0 , θ ∗ ,n : Z n → Θ , ϵ opt n ≥ 0 , δ opt n ∈ [ 0 , 1] , and E n ⊂ Z n such that P n ( E c n ) ≤ δ opt n and sup Q : D c f ( Q ∥ P n ) ≤ r E Q [ L θ ∗ ,n ] ≤ inf θ ∈ Θ sup Q : D c f ( Q ∥ P n ) ≤ r E Q [ L θ ] + ϵ opt n (113) on E n , wher e P n : = 1 n P n i =1 δ z i . Then for n ∈ Z + , ϵ > 0 we have P n sup Q : D c f ( Q ∥ P ) ≤ r E Q [ L θ ∗ ,n ] ≥ inf θ ∈ Θ sup Q : D c f ( Q ∥ P ) ≤ r E Q [ L θ ] + 2 max { R n , e R n } + ϵ opt n + ϵ ! (114) ≤ δ opt n + 2 exp  − nϵ 2 2 β 2  + 2 exp  − nϵ 2 2( β ( f ∗ ) ′ + ( − ˜ ν )) 2  + 2 X y ∈Y e − 2 n ( p y − p 0 ) 2 , wher e R n was deﬁne d in (88) and e R n was deﬁne d in (111) . The proof builds oﬀ of the pro of of Theorem 4.3, similarly to how Theorem 3.10 built oﬀ of Theorem 3.8; for details, see App endix C. 4.3 OT-Regularized f -Div ergence Examples Here we show that the KL divergence and α -div ergence cases are cov ered by our results on OT-regularized f -divergence DRO; see the conditions in Assumption 4.1. 1. KL Div ergence: In this case we hav e f K L ( t ) = t log ( t ) , f ∗ K L ( t ) = e t − 1 . It is straightforw ard to verify that f K L is strictly conv ex, s 0 = 1 ∈ { f ∗ < ∞} , inf f ∗ K L = 0 , lim s →∞ ( f ∗ K L ) ′ ( s ) = ∞ , and sup s ∈ [ ˜ ν , ∞ ) | s ( f ∗ K L ) ′ ( − s ) | ≤ max { e − 2 , − ˜ ν e − ˜ ν − 1 } . (115) Th us we see that f K L satisﬁes the requirements of Assumption 4.1 and hence Theorems 4.3 and 4.9 hold for any p 0 ∈ (0 , min y p y ) and any ˜ ν ≤ − 1 − M − log(1 /p 0 ) . 2. α -Div ergence, α > 1 : Here, f α is given by (8) and f ∗ α b y (9) . It is straigh tforward to verify that f α is strictly conv ex, s 0 = 1 α − 1 ∈ { f ∗ α < ∞} = R , inf f ∗ α = 1 α ( α − 1) , lim s →∞ ( f ∗ α ) ′ ( s ) = ∞ , and sup s ∈ [ ˜ ν , ∞ ) | s ( f ∗ α ) ′ ( − s ) | = ( α − 1) 1 / ( α − 1) max {− ˜ ν , 0 } α/ ( α − 1) . (116) Th us we see that for all α > 1 , f α satisﬁes the requirements of Assumption 4.1 and hence Theorems 4.3 and 4.9 hold for all ˜ ν ≤ − M − ( α − 1) − 1 p − ( α − 1) 0 and all p 0 ∈ (0 , min y p y ) . Similarly to the examples in Section 3.2, in b oth the KL and α -div ergence cases, if we consider ψ satisfying ψ ∗ ( t ) = O ( t q / ( q − 1) ) as t → 0 for some q > 1 (note that the cases (56) and (57) satisfy this for all q > 1 ) and let λ n = C n r with r = max { ( q − 1) / 2 , 1 / 2 } then R n = O ( n − 1 / 2 ) and ˜ R n = O ( p log( n ) /n ) (provided the en tropy integrals are ﬁnite, e.g., in the cases considered in Section A). 5 Conclusions W e derived statistical p erformance guaran tees, in the form of concentration inequalities, for DRO with OT and OT-regularized f -div ergence neighborho ods. In the OT case, our results apply to b oth classiﬁcation 18 A preprint - March 31, 2026 and regression and they improv e on prior studies of OT-DRO in that they apply to a wider range of OT cost functions, b ey ond the p -W asserstein case, and hav e impro ved dep endence on the neighborho od size parameter. The increased generality of our results facilitates applications to a broader range of adv ersarial training metho ds. W e also provide the ﬁrst study of statistical guarantees for OT-regularized f -div ergence DR O, a recent class of methods that combine sample reweigh ting with adv ersarial sample generation. A dra wback of our approach to OT-regularized f -div ergence DR O is that our results only apply to a discrete lab el space (i.e., classiﬁcation problems). Addressing this limitation is a direction we intend to explore in future work. A Example Details In this app endix we provide b ounds on the Dudley en tropy integral term D n : =12 n − 1 / 2 Z β 0 s log  M λ ∗ ( h 2 (˜ ϵ )) 2 h 2 (˜ ϵ )  + 1  N ( h 1 (˜ ϵ ) , G , ∥ · ∥ ∞ )  d ˜ ϵ , (117) corresp onding to each of the examples in Section 3.2. W e consider the case where Θ is the unit ball in R k with resp ect to some norm ∥ · ∥ Θ and we assume w e ha ve L Θ ∈ (0 , ∞ ) such that θ 7→ L θ ( z ) is L Θ -Lipsc hitz under this norm for all z ∈ Z . In such cases we hav e the co vering num ber b ounds N ( ϵ 1 , G , ∥ · ∥ ∞ ) ≤ N ( ϵ 1 /L Θ , Θ , ∥ · ∥ Θ ) ≤ (1 + 2 L Θ /ϵ 1 ) k , (118) where we used, e.g., the result from Example 5.8 in [ 47 ]. Therefore, letting h 1 ( ˜ ϵ ) = γ ˜ ϵ , h 2 ( ˜ ϵ ) = (1 − γ ) ˜ ϵ for γ ∈ (0 , 1) , we obtain the following b ounds on D n . 1. ψ ( t ) = αt q for some α > 0 , q > 1 : Using (55) and (118) we obtain D n ≤ 12 n − 1 / 2 Z β 0 v u u t log  M ( L X /q ) q ( q − 1) q − 1 2 α ((1 − γ )˜ ϵ ) q  + 1   1 + 2 L Θ γ ˜ ϵ  k ! d ˜ ϵ (119) ≤ 12 n − 1 / 2 β Z 1 0 s log  M ( L X /q ) q ( q − 1) q − 1 2 αβ q ((1 − γ ) u ) q  + 1  + 2 k L Θ β γ u du ≤ 12 n − 1 / 2 β Z 1 0 s log  M ( L X /q ) q ( q − 1) q − 1 2 αβ q + 1  ((1 − γ ) u ) − q + 1  + 2 k L Θ β γ u du ≤ 24 n − 1 / 2 β s q  M ( L X /q ) q ( q − 1) q − 1 2 αβ q + 1  (1 − γ ) − 1 + 2 k L Θ β γ − 1 , where we changed v ariables to u = ˜ ϵ/β and used that log (1 + A ) ≤ A , ⌈ A/s ⌉ ≤ ( A + 1) /s for all s ∈ (0 , 1) , A ≥ 0 , and (1 + B t − q ) ≤ (1 + B /t ) q for all B ≥ 1 , t > 0 . Minimizing o ver γ ∈ (0 , 1) and using the result min γ ∈ (0 , 1) { A (1 − γ ) − 1 + B γ − 1 } =  √ A + √ B  2 for all A, B > 0 (120) w e obtain D n ≤ 24 n − 1 / 2 β q 1 / 2  M ( L X /q ) q ( q − 1) q − 1 2 αβ q + 1  1 / 2 +  2 k L Θ β  1 / 2 ! . (121) 2. ψ ( t ) = αt q + η t for some α, η > 0 , q > 1 : Using the b ound λ ∗ ( ϵ 2 ) ≤ L X /η along with (118) w e can compute D n ≤ 12 n − 1 / 2 Z β 0 v u u t log  M L X 2 η (1 − γ ) ˜ ϵ  + 1   1 + 2 L Θ γ ˜ ϵ  k ! d ˜ ϵ (122) ≤ 12 n − 1 / 2 Z β 0 s  M L X 2 η (1 − γ ) ˜ ϵ  + 2 k L Θ γ ˜ ϵ d ˜ ϵ ≤ 24 n − 1 / 2 β s 1 + M L X 2 η (1 − γ ) β + 2 k L Θ γ β . 19 A preprint - March 31, 2026 Minimizing ov er γ ∈ (0 , 1) w e obtain D n ≤ 24 n − 1 / 2 β v u u t 1 + M L X 2 η β + 2 k L Θ β + 2 β s k M L X L Θ η . (123) 3. ψ ( t ) = α ( e q t − 1) for some α, q > 0 : Mirroring the computation in the previous case, we obtain the b ound (123) except with η replaced by αq . 4. ψ ( t ) = ∞ 1 t> 0 : Using the fact that λ ∗ = 0 in this case, we hav e D n =12 n − 1 / 2 Z β 0 p log ( N ( h 1 (˜ ϵ ) , G , ∥ · ∥ ∞ )) d ˜ ϵ (124) ≤ 12( k /n ) 1 / 2 Z β 0 p log (1 + 2 L Θ /h 1 (˜ ϵ )) d ˜ ϵ ≤ 12(2 L Θ k /n ) 1 / 2 Z β 0 h 1 (˜ ϵ ) − 1 / 2 d ˜ ϵ . Letting h 1 (˜ ϵ ) = γ ˜ ϵ and taking γ → 1 − w e ﬁnd D n ≤ 24(2 L Θ β k /n ) 1 / 2 . (125) B Prop erties for the F unctions g θ,λ,ν In this app endix we provide detailed deriv ations of the prop erties of the functions g θ,λ,ν : = λ ( f ∗ ( − ν ) − f ∗ ( − ∆ L c θ,λ /λ − ν )) , θ ∈ Θ , λ ∈ (0 , λ n ) , ν ∈ [ ˜ ν , − s 0 ] , (126) that were stated and used in Section 4.1.3 as part of the proof of Theorem 4.3; we work under Assumption 4.1. First note that we hav e the following uniform b ound on the functions in G c,f , whic h follo ws from the b ounds 0 ≤ ∆ L c θ,λ ≤ β , − ∆ L c θ,λ /λ − ν ≤ − ν ≤ − ˜ ν together with f ∗ b eing non-decreasing and ( f ∗ ) ′ + ( − ˜ ν ) -Lipsc hitz on ( −∞ , − ˜ ν ] : 0 ≤ λ ( f ∗ ( − ν ) − f ∗ ( − ∆ L c θ,λ /λ − ν )) ≤ β ( f ∗ ) ′ + ( − ˜ ν ) (127) for all θ ∈ Θ , λ ∈ (0 , λ n ) , ν ∈ [ ˜ ν , − s 0 ] . Using these properties of f ∗ w e also obtain Lipschitz b ounds in θ ,   λ ( f ∗ ( − ν ) − f ∗ ( − ∆ L c θ 1 ,λ /λ − ν )) − λ ( f ∗ ( − ν ) − f ∗ ( − ∆ L c θ 2 ,λ /λ − ν ))   (128) ≤ ( f ∗ ) ′ + ( − ˜ ν )   L c θ 1 ,λ − L c θ 2 ,λ   ≤ ( f ∗ ) ′ + ( − ˜ ν ) ∥L θ 1 − L θ 2 ∥ ∞ , and in ν ,   λ ( f ∗ ( − ν 1 ) − f ∗ ( − ∆ L c θ,λ /λ − ν 1 )) − λ ( f ∗ ( − ν 2 ) − f ∗ ( − ∆ L c θ,λ /λ − ν 2 ))   (129) ≤ 2 λ n ( f ∗ ) ′ + ( − ˜ ν ) | ν 1 − ν 2 | . Lastly , we consider the dep endence on λ . Start by noting that f ∗ is Lipschitz on ( −∞ , d ) for all d ∈ R and − ∆ L c θ,λ /λ − ν is b ounded ab o ve and is absolutely contin uous when λ is restricted to a compact interv al due to the conv exit y of L c θ,λ in λ . Therefore λ 7→ λ ( f ∗ ( − ν ) − f ∗ ( − ∆ L c θ,λ /λ − ν )) is absolutely contin uous when restricted to compact interv als and so it is diﬀeren tiable a.s. W e bound the deriv ativ e as follows. First compute ∂ λ  λ ( f ∗ ( − ν ) − f ∗ ( − ∆ L c θ,λ /λ − ν ))  (130) = f ∗ ( − ν ) − f ∗ ( − ∆ L c θ,λ /λ − ν ) − ( f ∗ ) ′ + ( − ∆ L c θ,λ /λ − ν )( − ∂ λ ∆ L c θ,λ + ∆ L c θ,λ /λ ) . By assumption, f ∗ is b ounded b elo w and we also hav e that f ∗ is non-decreasing, hence 0 ≤ f ∗ ( − ν ) − f ∗ ( − ∆ L c θ,λ /λ − ν ) ≤ f ∗ ( − ˜ ν ) − inf f ∗ . (131) 20 A preprint - March 31, 2026 Using the b ound     L c θ,λ +∆ λ − L c θ,λ ∆ λ     ≤ sup ˜ c , (132) together with the assumption that s ( f ∗ ) ′ + ( − s ) is b ounded on s ∈ [ ˜ ν , ∞ ) for all ˜ ν ∈ R , w e can then compute | ( f ∗ ) ′ + ( − ∆ L c θ,λ /λ − ν )( − ∂ λ ∆ L c θ,λ + ∆ L c θ,λ /λ ) | (133) ≤| ( f ∗ ) ′ + ( − ∆ L c θ,λ /λ − ν )(∆ L c θ,λ /λ + ν ) | + | ( f ∗ ) ′ + ( − ∆ L c θ,λ /λ − ν )( ν + ∂ λ ∆ L c θ,λ )) | ≤ sup t ≥ ˜ ν | t ( f ∗ ) ′ + ( − t ) | + ( f ∗ ) ′ + ( − ˜ ν )(max {− s 0 , − ˜ ν } + sup ˜ c ) . Therefore we obtain   ∂ λ  λ ( f ∗ ( − ν ) − f ∗ ( − ∆ L c θ,λ /λ − ν ))    (134) ≤ f ∗ ( − ˜ ν ) − inf f ∗ + sup t ≥ ˜ ν | t ( f ∗ ) ′ + ( − t ) | + ( f ∗ ) ′ + ( − ˜ ν )(max {− s 0 , − ˜ ν } + sup ˜ c ) . This completes the pro of of the prop erties of the functions g θ,λ,ν whic h w ere used in the main text. C Pro of of ERM Bound for DR O with OT-Regularized f -Divergences In this app endix we prov e Theorem 4.9 from the main text, which is repeated b elo w. Theorem C.1. Under A ssumption 4.1, let p 0 ∈ (0 , min y p y ) and ˜ ν ∈ R such that ( f ∗ ) ′ + ( − M − ˜ ν ) ≥ 1 /p 0 . (135) Supp ose we have r > 0 , θ ∗ ,n : Z n → Θ , ϵ opt n ≥ 0 , δ opt n ∈ [ 0 , 1] , and E n ⊂ Z n such that P n ( E c n ) ≤ δ opt n and sup Q : D c f ( Q ∥ P n ) ≤ r E Q [ L θ ∗ ,n ] ≤ inf θ ∈ Θ sup Q : D c f ( Q ∥ P n ) ≤ r E Q [ L θ ] + ϵ opt n (136) on E n , wher e P n : = 1 n P n i =1 δ z i . Then for n ∈ Z + , ϵ > 0 we have P n sup Q : D c f ( Q ∥ P ) ≤ r E Q [ L θ ∗ ,n ] ≥ inf θ ∈ Θ sup Q : D c f ( Q ∥ P ) ≤ r E Q [ L θ ] + 2 max { R n , e R n } + ϵ opt n + ϵ ! (137) ≤ δ opt n + 2 exp  − nϵ 2 2 β 2  + 2 exp  − nϵ 2 2( β ( f ∗ ) ′ + ( − ˜ ν )) 2  + 2 X y ∈Y e − 2 n ( p y − p 0 ) 2 , wher e R n was deﬁne d in (88) and e R n was deﬁne d in (111) . Pr o of. Using (136), on E n one can compute the error decomp osition b ound sup Q : D c f ( Q ∥ P ) ≤ r E Q [ L θ ∗ ,n ] − inf θ ∈ Θ sup Q : D c f ( Q ∥ P ) ≤ r E Q [ L θ ] (138) ≤ sup Q : D c f ( Q ∥ P ) ≤ r E Q [ L θ ∗ ,n ] − sup Q : D c f ( Q ∥ P n ) ≤ r E Q [ L θ ∗ ,n ] + inf θ ∈ Θ sup Q : D c f ( Q ∥ P n ) ≤ r E Q [ L θ ] − inf θ ∈ Θ sup Q : D c f ( Q ∥ P ) ≤ r E Q [ L θ ] + ϵ opt n ≤ sup θ ∈ Θ ( sup Q : D c f ( Q ∥ P ) ≤ r E Q [ L θ ] − sup Q : D c f ( Q ∥ P n ) ≤ r E Q [ L θ ] ) + sup θ ∈ Θ ( − sup Q : D c f ( Q ∥ P ) ≤ r E Q [ L θ ] − sup Q : D c f ( Q ∥ P n ) ≤ r E Q [ L θ ] !) + ϵ opt n . The claimed result then follo ws from a union b ound follow ed by essentially the same computations as in the pro of of Theorem 4.3. 21 A preprint - March 31, 2026 References [1] A. Ahmadi-Ja vid , En tropic v alue-at-risk: A new coheren t risk measure , Journal of Optimization Theory and Applications, 155 (2012), pp. 1105–1123. [2] Y. An and R. Gao , Generalization b ounds for (Wasserstein) robust optimization , Adv ances in Neural Information Pro cessing Systems, 34 (2021), pp. 10382–10392. [3] L. A olaritei, S. Shafiee, and F. Dörfler , W asserstein distributionally robust estimation in high dimensions: p erformance analysis and optimal h yp erparameter tuning , Mathematical Programming, (2026), pp. 1–85. [4] W. Azizian, F. Iutzeler, and J. Malick , Exact generalization guaran tees for (regularized) Wasserstein distributionally robust mo dels , A dv ances in Neural Information Processing Systems, 36 (2023), pp. 14584–14596. [5] W. Azizian, F. Iutzeler, and J. Malick , Regularization for Wasserstein distributionally robust optimization, ESAIM: Control, Optimisation and Calculus of V ariations, 29 (2023), p. 33. [6] A. Ben-T al, D. Ber tsimas, and D. B. Bro wn , A soft robust mo del for optimization under am biguity , Op erations Researc h, 58 (2010), pp. 1220–1234, https://doi.org/10.1287/opre.1100. 0821 , https://doi.org/10.1287/opre.1100.0821 , 1287/opre.1100.0821 . [7] A. Ben-T al, D. Den Her tog, A. De W aegenaere, B. Melenber g, and G. Rennen , Robust solutions of optimization problems aﬀected by uncertain probabilities , Management Science, 59 (2013), pp. 341–357. [8] A. Ben-T al, D. den Her tog, A. D. W aegenaere, B. Melenberg, and G. Rennen , Robust solutions of optimization problems aﬀected by uncertain probabilities , Management Science, 59 (2013), pp. 341–357, http://www.jstor.org/stable/23359484 . [9] D. Ber tsimas, V. Gupt a, and N. Kallus , Data-driv en robust optimization , Mathematical Program- ming, 167 (2018), pp. 235–292. [10] J. Birrell , Statistical error b ounds for GANs with nonlinear ob jectiv e functionals , T ransactions on Mac hine Learning Research, (2025), https://openreview.net/forum?id=ZgjhykPSdU . [11] J. Birrell, P. Dupuis, M. A. Ka tsoulakis, Y. P ant azis, and L. Rey-Bellet , ( f , Γ) -Div ergences: In terp olating b et ween f-divergences and in tegral probability metrics , Journal of Mac hine Learning Re- searc h, 23 (2022), pp. 1–70, http://jmlr.org/papers/v23/21- 0100.html . [12] J. Birrell and R. Ebrahimi , Optimal transp ort regularized divergences: Application to adversarial robustness, SIAM Journal on Mathematics of Data Science, 7 (2025), pp. 1801–1827. [13] J. Blanchet, D. Kuhn, J. Li, and B. T askesen , Unifying distributionally robust optimization via optimal transp ort theory , arXiv e-prints, (2023), https://doi.org/10.48550/arXiv. 2308.05414 , https://arxiv.org/abs/2308.05414 . [14] J. Blanchet and K. Mur thy , Quan tifying distributional mo del risk via optimal transp ort , Mathe- matics of Op erations Research, 44 (2019), pp. 565–600, https://doi.org/10.1287/moor.2018.0936 , https://doi.org/10.1287/moor.2018.0936 , moor.2018.0936 . [15] J. Blanchet, K. Mur thy, and V. A. Nguyen , Statistical analysis of Wasserstein distributionally robust estimators , in T utorials in Op erations Research: Emerging optimization metho ds and mo deling tec hniques with applications, INFORMS, 2021, pp. 227–254. [16] J. Blanchet, K. Mur thy, and N. Si , Conﬁdence regions in wasserstein distributionally robust estimation, Biometrika, 109 (2022), pp. 295–315. [17] T. A. Bui, T. Le, Q. H. Tran, H. Zhao, and D. Phung , A uniﬁed Wasserstein distributional robustness framework for adversarial training , in International Conference on Learning Representations, 2022, https://openreview.net/pdf?id=Dzpe9C1mpiv . [18] E. Delage and Y. Ye , Distributionally robust optimization under momen t uncertaint y with application to data-driv en problems , Op erations Research, 58 (2010), pp. 595–612, https://doi. org/10.1287/opre.1090.0741 , https://doi.org/10.1287/opre.1090.0741 , abs/https://doi.org/10.1287/opre.1090.0741 . 22 A preprint - March 31, 2026 [19] J. Dong, L. Y ang, Y. W ang, X. Xie, and J. Lai , T o wards intrinsic adv ersarial robustness through probabilistic training, IEEE T ransactions on Image Pro cessing, (2023). [20] Y. Dong, Z. Deng, T. P ang, J. Zhu, and H. Su , A dversarial distributional training for robust deep learning, Adv ances in Neural Information Pro cessing Systems, 33 (2020), pp. 8270–8283. [21] J. Duchi and H. Namkoong , V ariance-based regularization with conv ex objectives , Journal of Mac hine Learning Research, 20 (2019), pp. 1–55. [22] J. C. Duchi, P. W. Gl ynn, and H. Namkoong , Statistics of robust optimization: A generalized empirical likelihoo d approach, Mathematics of Operations Research, 46 (2021), pp. 946–969. [23] J. C. Duchi and H. Namkoong , Learning mo dels with uniform p erformance via distributionally robust optimization, The Annals of Statistics, 49 (2021), pp. 1378–1406. [24] R. Gao , Finite-sample guarantees for Wasserstein distributionally robust optimization: Breaking the curse of dimensionality, Op erations Research, 71 (2023), pp. 2291–2306. [25] R. Ga o, X. Chen, and A. J. Kleywegt , W asserstein distributionally robust optimization and v ariation regularization, Operations Research, 72 (2024), pp. 1177–1191. [26] R. Gao and A. Kleywegt , Distributionally robust sto c hastic optimization with Wasserstein distance , Mathematics of Op erations Research, 48 (2023), pp. 603–655, https://doi.org/10.1287/moor.2022. 1275 , https://doi.org/10.1287/moor.2022.1275 , 1287/moor.2022.1275 . [27] J. Goh and M. Sim , Distributionally robust optimization and its tractable approximations , Op erations Researc h, 58 (2010), pp. 902–917, https://doi.org/10.1287/opre.1090.0795 , https://doi.org/10. 1287/opre.1090.0795 , https://arxiv.org/abs/https://doi.org/10.1287/opre.1090.0795 . [28] I. J. Goodfello w, J. Shlens, and C. Szegedy , Explaining and harnessing adversarial examples , arXiv preprint arXiv:1412.6572, (2014). [29] W. Hu, G. Niu, I. Sa to, and M. Sugiy ama , Do es distributionally robust sup ervised learning give robust classiﬁers?, in In ternational Conference on Machine Learning, PMLR, 2018, pp. 2029–2037. [30] Z. Hu and L. J. Hong , Kullbac k-Leibler div ergence constrained distributionally robust optimization , A v ailable at Optimization Online, 1 (2013), p. 9. [31] D. Kuhn, S. Shafiee, and W. Wiesemann , Distributionally robust optimization , Acta Numerica, 34 (2025), pp. 579–804. [32] H. Lam , Reco vering b est statistical guarantees via the empirical div ergence-based distributionally robust optimization , Op erations Research, 67 (2019), pp. 1090–1105, https://doi.org/10.1287/opre.2018. 1786 , https://doi.org/10.1287/opre.2018.1786 , 1287/opre.2018.1786 . [33] H. Lam and E. Zhou , The empirical likelihoo d approac h to quantifying uncertaint y in sample av erage appro ximation, Op erations Researc h Letters, 45 (2017), pp. 301–307. [34] J. Lee and M. Raginsky , Minimax statistical learning with Wasserstein distances , A dv ances in Neural Information Pro cessing Systems, 31 (2018). [35] F. Liese and I. V ajda , On div ergences and informations in statistics and information theory , IEEE T ransactions on Information Theory , 52 (2006), pp. 4394–4412. [36] Z. Liu, B. P. V an P ar ys, and H. Lam , Smo othed f -divergence distributionally robust optimization: Exp onen tial rate eﬃciency and complexity-free calibration, arXiv preprint arXiv:2306.14041, (2023). [37] A. Madr y, A. Makelo v, L. Schmidt, D. Tsipras, and A. Vladu , T o wards deep learning mo dels resistan t to adversarial attacks , in International Conference on Learning Representations, 2018, https: //openreview.net/forum?id=rJzIBfZAb . [38] P. Mohajerin Esf ahani and D. Kuhn , Data-driv en distributionally robust optimization using the Wasserstein metric: Performance guaran tees and tractable reformulation , Mathematical Programming, 171 (2018), pp. 115–166, https://doi.org/10.1007/s10107- 017- 1172- 1 . [39] M. Mohri, A. Rost amizadeh, and A. T al w alkar , F oundations of Machine Learning, second edition , A daptive Computation and Machine Learning series, MIT Press, 2018, https://books.google.com/ books?id=V2B9DwAAQBAJ . 23 A preprint - March 31, 2026 [40] N. P apernot, P. McD aniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Sw ami , Practical blac k-b o x attacks against machine learning , in Proceedings of the 2017 A CM on Asia conference on computer and communications security , 2017, pp. 506–519. [41] N. P apernot, P. McD aniel, S. Jha, M. Fredrikson, Z. B. Celik, and A. Sw ami , The limitations of deep learning in adv ersarial settings , in 2016 IEEE Europ ean symp osium on securit y and priv acy (EuroS&P), IEEE, 2016, pp. 372–387. [42] C. Regniez, G. Gidel, and H. Berard , A distributional robustness p erspective on adversarial training with the ∞ -Wasserstein distance, (2021), https://openreview.net/forum?id=z7DAilcTx7 . [43] S. Shafieezadeh-Abadeh, D. Kuhn, and P. M. Esf ahani , Regularization via mass transp ortation , Journal of Machine Learning Researc h, 20 (2019), pp. 1–68. [44] M. St aib and S. Jegelka , Distributionally robust optimization and generalization in k ernel metho ds , in Adv ances in Neural Information Pro cessing Systems, H. W allach, H. Laro c helle, A. Beygelzimer, F. d ' Alc hé-Buc, E. F ox, and R. Garnett, eds., vol. 32, Cur- ran Asso ciates, Inc., 2019, https://proceedings.neurips.cc/paper_files/paper/2019/file/ 1770ae9e1b6bc9f5fd2841f141557ffb- Paper.pdf . [45] R. v an Handel , Probability in High Dimension, APC 550 Lecture Notes, Princeton Universit y , 2016. [46] C. Villani , Optimal T ransp ort: Old and New , Grundlehren der mathematischen Wissenschaften, Springer Berlin Heidelb erg, 2008, https://books.google.com/books?id=hV8o5R7_5tkC . [47] M. W ainwright , High-Dimensional Statistics: A Non-Asymptotic Viewpoint , Cam bridge Series in Statistical and Probabilistic Mathematics, Cambridge Universit y Press, 2019, https://books.google. com/books?id=IluHDwAAQBAJ . [48] J. W ang, R. Ga o, and Y. Xie , Sinkhorn distributionally robust optimization , Op erations Research, (2025). [49] Y. W ang, D. Zou, J. Yi, J. Bailey, X. Ma, and Q. Gu , Impro ving adversarial robustness requires revisiting misclassiﬁed examples , in International Conference on Learning Representations, 2020, https://api.semanticscholar.org/CorpusID:211548864 . [50] W. Wiesemann, D. Kuhn, and M. Sim , Distributionally robust con v ex optimization , Op erations Researc h, 62 (2014), pp. 1358–1376, https://doi.org/10.1287/opre.2014.1314 , https://doi.org/ 10.1287/opre.2014.1314 , https://arxiv.org/abs/https://doi.org/10.1287/opre.2014.1314 . [51] Q. Wu, J. Yu-Meng Li, and T. Mao , On generalization and regularization via Wasserstein distributionally robust optimization , arXiv e-prin ts, (2022), https://doi.org/10. 48550/arXiv.2212.05716 , https://arxiv.org/abs/2212.05716 . [52] J. Yu-Meng Li and T. Mao , A general Wasserstein framew ork for data-driv en distributionally robust optimization: T ractability and applications , arXiv e-prin ts, (2022), https: //doi.org/10.48550/arXiv.2207.09403 , https://arxiv.org/abs/2207.09403 . [53] H. Zhang, Y. Yu, J. Jiao, E. Xing, L. El Ghaoui, and M. Jordan , Theoretically principled trade-oﬀ b et ween robustness and accuracy , in In ternational conference on mac hine learning, PMLR, 2019, pp. 7472–7482. [54] J. Zhang, J. Zhu, G. Niu, B. Han, M. Sugiy ama, and M. Kankanhalli , Geometry-a ware instance-rew eighted adversarial training , in In ternational Conference on Learning Representations, 2020. [55] L. Zhang, J. Y ang, and R. Gao , A short and general duality pro of for Wasserstein distributionally robust optimization, Op erations Research, 73 (2025), pp. 2146–2155. 24

Statistical Guarantees for Distributionally Robust Optimization with Optimal Transport and OT-Regularized Divergences

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment