Corruption-robust Offline Multi-agent Reinforcement Learning From Human Feedback

Corruption-r ob ust Ofﬂine Multi-agent Reinf or cement Lear ning Fr om Human F eedback Andi Nika MPI-SWS Debmalya Mandal Univ ersity of W arwick Parameswaran Kamalaruban V isa Adish Singla MPI-SWS Goran Radanovi ´ c MPI-SWS Abstract W e consider rob ustness against data corruption in ofﬂine multi-agent reinforcement learning from human feedback (MARLHF) under a strong-contamination model: giv en a dataset D of trajectory–preference tuples (each preference being an n -dimensional binary label vector representing each of the n agents’ preferences), an ϵ -fraction of the samples may be arbitrarily corrupted. W e model the problem using the framew ork of linear Markov games. First, under a uniform covera ge assumption—where every policy of interest is suf ﬁciently represented in the clean (prior to corruption) data—we introduce a robust estimator that guarantees an O ( ϵ 1 − o (1) ) bound on the Nash-equilibrium g ap. Next, we mov e to the more challenging unilater al cover age setting, in which only a Nash equilibrium and its single-player deviations are co vered: here our proposed algorithm achiev es an O ( √ ϵ ) Nash-gap bound. Both of these procedures, howev er, suffer from intractable computation. T o address this, we relax our solution concept to coarse corr elated equilibria (CCE). Under the same unilateral-cov erage re gime, we then deri ve a quasi-polynomial-time algorithm whose CCE g ap scales as O ( √ ϵ ) . T o the best of our knowledge, this is the ﬁrst systematic treatment of adversarial data corruption in ofﬂine MARLHF . 1 INTRODUCTION Reinforcement Learning from Human Feedback (RLHF) [ Christiano et al. , 2017 ] has surged in popularity as a straightforward, efﬁcient means of ﬁne-tuning large lan- Proceedings of the 29 th International Conference on Artiﬁcial Intel- ligence and Statistics (AIST A TS) 2026, T angier , Morocco. PMLR: V olume 300. Copyright 2026 by the author(s). guage models (LLMs) [ Ouyang et al. , 2022 ]. While most existing work focuses on single-agent settings, man y real- world applications in volv e multiple interacting decision- makers, such as autonomous v ehicles, distrib uted systems, or strate gic marketplaces. Extending RLHF to such set- tings leads to multi-agent RLHF (MARLHF), where the goal is to learn aligned joint policies from preference data. Y et, despite its importance, research on MARLHF remains remarkably limited [ Zhang et al. , 2024 ]. At the same time, a critical shortcoming of all learning algorithms is their susceptibility to data-poisoning at- tacks—and RLHF is no exception [ W ang et al. , 2023a , Shi et al. , 2023 , Rando and T ramèr , 2023 , Nika et al. , 2025 , Baumgärtner et al. , 2024 ]. By injecting malicious feedback or subtly corrupting preference labels, adversaries can steer LLMs toward biased or harmful outputs—an especially serious risk as these models are increasingly deployed in safety-critical settings. Mandal et al. [ 2025 ] have proposed robust methods against data corruption for single-agent RLHF . Howe ver , no prior w ork has tackled the rob ustness of MARLHF systems against data-poisoning attacks. The added strategic complexity of MARLHF can amplify the impact of such attacks. It is therefore unclear whether single-agent methods extend directly to this setting. Motiv ated by the above, we initiate the study of corruption- robust of ﬂine MARLHF . Our setting assumes access to preference data D = { ( τ , τ ′ , o ) } of size m , where τ , τ ′ are two sample trajectories and o is an n -dimensional vector of binary entries, each denoting a preference ov er the trajectory pair , corresponding to one of n agents. W e assume that an ϵ -fraction of D is arbitrarily corrupted by an attacker . Using the standard Bradley-T erry (BT) [ Bradley and T erry , 1952 ] preference model, we cast our problem as an instance of a linear Markov game. Previous w ork has already established that the optimal theoretical guarantees in corruption-robust ofﬂine RL and two-player zero-sum Mark ov games, under linear function approximation, exhibit a linear dependence on ϵ . Ho we ver , such dependence is achieved only when the data cov ers all possible directions (i.e. uniform co v- Corruption-rob ust MARLHF erag e ). When the data covers only a Nash policy and its unilateral de viations (i.e. unilateral co vera ge ), prior w ork in RLHF (also RL and two-player zero-sum MGs) achiev es √ ϵ bounds. An immediate question is whether we can at- tain the same ϵ -dependent robustness rates in MARLHF as those of RLHF . Optimal dependence can be expected under uniform cov erage. Howe ver , weaker co verage assumptions (e.g. unilateral coverage) imply a more challenging setting. Challenges. The main challenge stems from uncertainty ov er multiple re ward models: deriving worst-case guaran- tees means selecting a policy that performs well under e very rew ard model in a conﬁdence set obtained from a robust rew ard estimation, b ut computing that polic y’ s worst-case performance is challenging—it requires optimizing over all candidate re ward models, e ven though our data-coverage assumption only holds with respect to the true (unknown) rew ard. In the single-agent case, Mandal et al. [ 2025 ] resort to subgradient methods (yielding an O ( ϵ 1 / 4 ) rate) and then a primal-dual approach to recov er O ( √ ϵ ) . No analogous primal-dual theory exists for Marko v games. Thus, we take a different approach. Our approach. In multi-agent environments, the ultimate objecti ve is to identify equilibrium policies, and minimizing the Nash gap provides a natural surrogate for that goal. Our key insight is that, for an y re ward model in the conﬁdence set obtained by a robust reward estimation, the gap at a true Nash equilibrium policy π ∗ must lie within a small margin of the minimal gap over all policies, up to the re ward- estimation error induced by the conﬁdence set. Thus, if e π nearly minimizes the gap for a candidate re ward model, then the gradient of π ∗ with respect to re wards can serve as a biased—but usable—proxy for the gradient at e π . T o estimate that proxy , we make use of our unilateral cov erage (i.e. D sufﬁciently cov ers π ∗ and its unilateral de viations). In the linear Mark ov-game setting, the empirical feature dif ferences between data-generating policies µ and µ ref then furnish a tractable approximation to the desired gradient. Plugging this into a ﬁrst-order optimizer allows us to obtain the desired O ( √ ϵ ) guarantee on the Nash gap. In particular , we make the follo wing contributions. A summary of our results is giv en in T able 1 . • First, assuming that D has uniform cov erage, we design an algorithm that approximately computes a NE solely from corrupted preference data. Our algorithm ﬁrst ro- bustly estimates each reward function. Then, it runs a value-based backw ard-induction procedure to compute a policy that minimizes an estimate of the gap. W e prov e that our algorithm incurs O ( nϵ 1 − o (1) + n/ √ m ) bounds on the Nash gap. • Next, we relax our cov erage assumption on our data and assume only unilateral coverage. W e run projected gradi- ent ascent (PGA) with a biased estimate of the gradient of the true gap for T 1 steps to compute the worst-case re ward parameter . Using that parameter , we run the same procedure as in our previous method. W e ﬁnally prove that our algorithm incurs O ( n √ ϵ + n/ √ m + n/ √ T 1 ) bounds on the Nash gap. • Our ﬁnal contribution is on computational tractability . It is well-kno wn that the NE computation is intractable for general-sum Marko v games. W e thus relax the NE notion into that of coarse correlated equilibrium (CCE). This allows us to frame each stage game as a saddle-point problem with con ve x-concave objecti ve. W e then utilize Optimistic Hedge to learn the CCE of each stage game. This yields and O ( n √ ϵ + n/ √ m + n/ √ T 1 + n/T 2 ) bound on the CCE gap, where T 2 is the number of steps for which we run Optimistic Hedge. 2 PRELIMINARIES This section contains the background technical material to be used throughout the paper . 2.1 Markov Games A Markov g ame of ﬁnite horizon H be- tween n agents is deﬁned by the tuple G =  S, { A i } n i =1 , { P h } H − 1 h =0 , {R i,h } H − 1 h =0 , s 0  , where S is the state space, A i is the action set of agent i , P h : S × A 1 × · · · × A n → ∆( S ) is the state transition kernel at time-step h ; the map R i,h : S × A 1 × · · · × A n → ∆( R ) denotes the random reward of agent i at time-step h , with R i,h ( s, a ) := E [ R i,h ( s, a ) | s, a ] ; ﬁnally , s 0 ∈ S denotes the initial state. Policies and value functions. Given agent i , a Markov policy π i = ( π i, 0 , . . . , π i,H − 1 ) denotes the tuple containing the decision-making strategies of agent i , where, for each h ∈ [ H − 1] := { 0 , 1 , . . . , H − 1 } , π i,h : S → ∆( A i ) maps states to probability simplices ov er actions. A joint product polic y is deﬁned as the tuple π = ( π 1 , . . . , π n ) ov er all agents. W e denote by Π PP = Π PP 1 × . . . × Π PP n the ov erall product polic y class and write π = ( π i , π − i ) , where π − i = ( π 1 , . . . , π i − 1 , π i +1 , . . . , π n ) . Giv en joint polic y π and state s at time-step h , the value function with respect to π , s , and h is deﬁned as V π i,h ( s ) = E " H − 1 X t = h R i,t ( s t , a t ) | s h = s, π t , P t # , where a = ( a 1 , . . . , a n ) . Moreover , for gi ven a the action value function is deﬁned as Q π i,h ( s, a ) = E " H − 1 X t = h R i,t ( s t , a t ) | s h = s, a h = a , π t , P t # Nash equilibria. A product policy π ∗ is said to be an α - Nash equilibrium if there exists α ≥ 0 , such that, for e very agent i and state s , we hav e V π ∗ i, 0 ( s ) ≥ V π ′ i , π ∗ − i i, 0 ( s ) − α , for ev ery π ′ i ∈ Π PP i . If α = 0 , then π ∗ is said to be a Nash equilibrium. W e also deﬁne the notion of optimality of a Nika, Mandal, Kamalaruban, Singla, and Radanovi ´ c Bounds on the NE (CCE) gap NE & Unif. Cov . e O   1 ξ R + 1 ξ P  H nϵ 1 − o (1) + H 2 n √ poly ( d ) ξ P √ m  NE & Unil. Cov . e O  1 √ C R + 1 √ C P + 1 √ T 1   H 5 / 2 nd 3 / 4 √ ϵ + H 2 n √ poly ( d ) √ m  CCE & Unil. Cov . e O  1 √ C R + 1 √ C P + 1 √ T 1   H 5 / 2 nd 3 / 4 √ ϵ + H 2 n √ poly ( d ) √ m  + H n 2 T 2  T able 1: Summary of our bounds for: (i) NE gap minimization under uniform coverage, (ii) NE gap minimization under unilateral cov erage, and (iii) CCE gap minimization under unilateral cov erage. Here, e O hides any poly-logarithmic factors, n denotes the number of agents, H denotes the horizon and d denotes the dimension. Moreov er , ξ R and ξ P denote the uniform coverage coef ﬁcients, while C R and C P denote the unilateral coverage coef ﬁcients; ϵ denotes the corruption parameter , m is the data size, T 1 is the number of gradient steps while T 2 is the number of steps for which OptimisticHedge is run. It is worth mentioning that our bounds hav e optimal dependence on ϵ in the uniform coverage setting, while maintaining the same dependence as the single-agent (and two-player zero-sum Markov game) settings under non-uniform cov erage. Moreover , note that the algorithm for CCE approximation is also computationally efﬁcient. gi ven polic y π proﬁle. The Nash gap [ Cui and Du , 2022 ] of π is deﬁned as Gap ( π ) = P i ∈ [ n ] V † , π − i i, 0 ( s 0 ) − V π i, 0 ( s 0 ) , where V † , π − i i, 0 ( s 0 ) = max π ′ i V π ′ i , π − i i, 0 ( s 0 ) . Note that, by deﬁnition, any Nash equilibrium has 0 Nash gap, and an y α -Nash equilibrium has at most α Nash gap. Linear Markov games. In this paper , we consider linear Markov games [ Zhong et al. , 2022 ]. Formally , G is said to be a linear Marko v game with feature map ϕ : S × A → R d , for some d ∈ N , if we hav e P h ( s h +1 | s h , a h ) = ⟨ ϕ ( s h , a h ) , ξ h ( s h +1 ) ⟩ , and R i,h ( s h , a h ) = ⟨ ϕ ( s h , a h ) , θ ∗ i,h ⟩ + ζ i,h , ∀ ( s h , a h , i, h ) ∈ S × A × [ n ] × [ H − 1] , where ξ h and θ ∗ i are unknown parameters and ζ i,h zero-mean γ 2 -subGaussian noise. Here, ∥ ϕ ( s, a ) ∥ 2 ≤ 1 for all state-action tuples ( s, a ) ∈ S × A , max {    θ ∗ i,h    2 , ∥ ξ h ( s ) ∥ 2 } ≤ √ d , for all i ∈ [ n ] and h ∈ [ H − 1] . Let Θ denote the set of all feasible θ as deﬁned here. Remark 1. There ar e two main r easons why we consider linear Markov games to model our pr oblem. F irst, the corruption-r obust of ﬂine RL literatur e in the gener al func- tion appr oximation [ Y e et al. , 2023 ] consider a corrup- tion model which is deﬁned in terms of Bellman residuals. Since we only assume access to prefer ence data, this type of corruption model is not well-deﬁned for our setting. Sec- ond, r elaxing the linearity of r ewar ds would then r equir e corruption-r obust maximum lik elihood estimation pr oce- dur es beyond gener alized linear models, which, to the best of our knowledge, ar e not pr esent in the curr ent literatur e. 2.2 Preference Data Follo wing the formulation in [ Zhang et al. , 2024 ], we denote by e D = { ( e τ i , e τ ′ i , e o i ) } m i =1 the clean preference dataset, where e τ = ( e s 0 , e a 1 , 0 , e a 2 , 0 , . . . , e a n, 0 , e s 1 , . . . , e s H − 1 ) and e τ ′ = ( e s ′ 0 , e a ′ 1 , 0 , e a ′ 2 , 0 , . . . , e a ′ n, 0 , e s ′ 1 , . . . , e s ′ H − 1 ) denote sampled trajectories from beha vior policies µ and µ ref , re- spectiv ely , and e o i = ( e o i, 1 , . . . , e o i,n ) , with e o i,j ∈ {− 1 , +1 } , for all j ∈ [ n ] , giv es information about individual prefer - ences of agents for each pair of trajectories: e o i,j = 1 implies that e τ i is preferred to e τ ′ i for agent j . W e assume that the preferences are generated according to the Bradle y-T erry (BT) model [ Bradley and T erry , 1952 ]: for each agent j , we assume that we hav e P ( e o i,j = 1 | e τ i , e τ ′ i ) = σ H − 1 X h =0 R i,h ( e s h , e a h ) − H − 1 X h =0 R i,h ( e s ′ h , e a ′ h ) ! , with σ ( x ) = 1 / (1 + exp( − x )) being the sigmoid function. 2.3 Corruption Model Follo wing the ϵ -corruption model in ofﬂine RL(HF) [ Zhang et al. , 2022 , Mandal et al. , 2025 ], we assume that there exists an attacker that has full access to the dataset e D and arbitrar- ily perturbs an ϵ -fraction of it. That is, gi ven ϵ ∈ [0 , 1 / 2) , we assume that the attack er inspects e D and modiﬁes up to ϵ · m samples in e D . W e denote by D the poisoned dataset. In other words, there are at most ϵ · m data samples in D such that ( τ , τ ′ , o )  = ( e τ , e τ ′ , e o ) . 1 2.4 Data Coverage Ofﬂine learning problems necessitate access to a dataset that contains, at least to some extent, "good" samples, in the sense that they are tra versed by policies we are trying to approximate. This condition is usually described by the notion of data coverag e . In linear Markov games (MG), data cov erage is captured by the feature cov ariance matrix. Formally , for ev ery h ∈ [ H − 1] , we deﬁne Σ µ ( h ) = E µ h  ϕ ( s h , a h ) ϕ ( s h , a h ) ⊤  1 Follo wing prior literature on corruption-robust RL [ Zhang et al. , 2022 ], we assume that, for each h ∈ [ H − 1] , the subset of D containing only the h steps of the samples, is also consistent with the ϵ -corruption model. Corruption-rob ust MARLHF and Σ − µ , µ ref = E µ , µ ref h ( ϕ ( τ ) − ϕ ( τ ′ )) ( ϕ ( τ ) − ϕ ( τ ′ )) ⊤ i as the feature cov ariance matrices that will determine cov- erage of our given data. Here ϕ ( τ ) = P H − 1 h =0 ϕ ( s h , a h ) . Note that Σ µ ( h ) is the standard co v ariance matrix used in the literature of ofﬂine RL, while Σ − µ , µ ref is the dif ference cov ariance matrix which has been previously used in RLHF literature [ Zhan et al. , 2023 ]. 3 ROB UST NE LEARNING UNDER UNIFORM CO VERA GE W e start by considering the case when the preference dataset has uniform cov erage, that is, all basis directions of the fea- ture space are suf ﬁciently covered. W e state this assumption below . Assumption 1 (Uniform Cov erage) . Let µ and µ ref be the behavior policies that wer e used to generate trajectories pr esent in the data e D . W e assume that we have Σ − µ , µ ref ⪰ ξ R H · I and Σ µ ( h ) ⪰ ξ P · I , for all h ∈ [ H − 1] , wher e ξ R and ξ P ar e strictly positive constants and A ⪰ B is equivalent to x ⊤ ( A − B ) x ≥ 0 , for all vectors x  = 0 . Remark 2. Note that we r equire covera ge with r espect to two covariance matrices. The ﬁrst one, Σ − µ , µ ref , captur es covera ge of the r ewar ds, since re war ds ar e estimated in terms of feature differ ences. The second one, Σ µ ( h ) , cap- tur es covera ge of transitions for each step h . In standar d RL, the second condition is enough to pr ovide guarantees. However , in pr eference-based RL, the ﬁrst condition is nec- essary due to the lack of the re ward signal in the given data [ Zhan et al. , 2023 ]. On a high le vel, all our algorithms are based on the follo w- ing pipeline. They ﬁrst use the preference data to rob ustly estimate each agent’ s reward parameters. Then, using those re wards, the y proceed to compute rob ust pessimistic and op- timistic estimates of the individual Q -functions for policies of interest. Finally , they estimate the gap and output a joint policy that minimizes it. W e will instantiate different ver- sions of this pipeline under different co verage assumptions and notions of equilibria. Robust Re ward Estimation ⇒ Robust Q-function Estimation ⇒ Estimated Gap Minimization 3.1 Algorithm The main idea of our proposed algorithm is as follows. First, note that our overall objecti ve is to ﬁnd a joint policy that minimizes the Nash gap with respect to the ground-truth re ward functions. Howe ver , we hav e access neither to these functions, nor to the real en vironment. W e are only giv en a ﬁnite preference dataset D , an ϵ -fraction of which is arbitrarily corrupted. Therefore, our ﬁrst step is to compute robust estimates of the reward parameters of each agent. Note that, for linear re wards, maximum likelihood estimation becomes standard logistic re gression. And for such an objecti ve, it is kno wn [ A wasthi et al. , 2022 ] that we can reco ver the true parameter of interest from ϵ -corrupted data with O ( ϵ 1 − o (1) ) accuracy via a robust method called TrimmedMLE (for a pseudocode, see Algorithm 6 in Appendix E ). Thus, for each agent i , we let e θ i = TrimmedMLE ( D, ϵ, ν ) , where we denote by θ i the H d -dimensional result of the concatenation of θ i,h , for h ∈ [ H − 1] , and ν denotes a granularity hyperparameter . Algorithm 1 Corruption-robust Equilibrium Learning from Human Feedback with Uniform Cov erage Require: Preference dataset D ; conﬁdence parameter δ . 1: Split D into equal D 1 and D 2 . 2: ▷ Reward Estimation via Trimmed MLE using D 1 . 3: for i ∈ [ n ] do 4: Compute e θ i = TrimmedMLE ( D 1 , ϵ, ν ) . 5: Set the optimistic and pessimistic re wards R i,h ( · , · ) and R i,h ( · , · ) as in Equations ( 2 ) and ( 3 ) , respectively . 6: ▷ Robust Value Function Estimation Phase using D 2 . 7: for π ∈ Π PP do 8: Apply Algorithm 7 on D 2 using only the preferred trajectories generated by µ , with input π , i , R i , and R i , and bonus function Γ( · , · ) = 0 , to obtain V † , π − i i,h ( · ) and V π i,h ( · ) , for all h ∈ [ H − 1] . 9: end for 10: end for 11: ▷ Nash Gap Estimation Phase 12: for e very polic y π ∈ Π PP : do 13: Compute the estimated gap g Gap ( π ) . 14: end for 15: return e π ∈ arg min π ∈ Π PP g Gap ( π ) . Next, since we are in the of ﬂine setting, the most reasonable approach is to apply pessimism with respect to the reco vered parameters. T o that end, we form conﬁdence sets, for each agent i , based on the TrimmedMLE guarantees: Θ Unif ( e θ i ) =  θ ∈ Θ :    e θ i − θ    2 ≤ O  ϵ 1 − o (1) ξ R  , (1) where δ > 0 is a randomness parameter . Once we hav e access to the conﬁdence set, we compute the boundary pa- rameters R i,h ( s, a ) = max θ i ∈ Θ Unif ( e θ i ) θ ⊤ i,h ϕ ( s, a ) , (2) R i,h ( s, a ) = min θ i ∈ Θ Unif ( e θ i ) θ ⊤ i,h ϕ ( s, a ) . (3) Nika, Mandal, Kamalaruban, Singla, and Radanovi ´ c Note that the inner problems are conv ex programs that ha ve closed-form solutions. Now that we hav e access to our esti- mate re ward functions, we need to minimize the Nash gap with respect to these re wards. In order to do that, we apply backward induction. First, we initialize the v alue function estimates V π H ( · ) = V † , π − i H ( · ) = 0 , for every joint policy π . Then, for every step h down to 0 , we apply a robust estimation algorithm Rob-Q (see Algorithm 2 ) for the Q- values. Essentially , the procedure ﬁrst robustly estimates the parameters of the Bellman operator using a RobEst oracle which is guaranteed to return an O ( ϵ ) -close value parameter under uniform cov erage [ Zhang et al. , 2022 ]. Then, it computes the estimated Q-v alues by properly clip- ping the bonus-inﬂated (deﬂated) estimates so that they remain in [ − H √ d, H √ d ] . Once we hav e the Q -functions, Algorithm 2 Robust Estimation of Q-Functions ( Rob-Q ) Require: Dataset D ; corruption le vel ϵ ; policy π ; agent i ; rew ard functions R i,h and R i,h ; step h ; next-step v alue estimates V π i,h +1 ( · ) , V † , π − i i,h +1 ( · ) ; bonus Γ( · , · ) . 1: Set ω π i,h as RobEst  ϕ ( s h , a h ) , R i,h ( s h , a h ) + V π i,h +1 ( s h +1 )  . 2: Set ω † , π − i i,h as RobEst  ϕ ( s h , a h ) , R i,h ( s h , a h ) + V † , π − i i,h +1 ( s h +1 )  . 3: Set Q π i,h ( · , · ) as Clip [ − ( H − h ) √ d, ( H − h ) √ d ]  ϕ ( · , · ) ⊤ ω π i,h − Γ( · , · )  . 4: Set Q † , π − i i,h ( · , · ) as Clip [ − ( H − h ) √ d, ( H − h ) √ d ]  ϕ ( · , · ) ⊤ ω † , π − i i,h + Γ( · , · )  . 5: return Q-functions Q π i,h ( · , · ) and Q † , π − i i,h ( · , · ) . the value function estimates V π i,h ( · ) (and V † , π − i i,h ( · ) ) for all steps, deﬁned with respect to the estimated rew ard pa- rameters, are then computed by taking expectations ov er (and taking max ov er actions for player i ) the gi ven poli- cies. 2 Once we do this for e very policy , we then return the policy e π that minimizes the estimated gap with respect to ( R, R ) = ( R 1 , R 1 , . . . , R n , R n ) : arg min π g Gap  π , R, R  := X i ∈ [ n ] V † , π − i i, 0 ( s 0 ) − V π i, 0 ( s 0 ) . (4) As shown in Appendix (see Lemma A.6 ), our optimistic and pessimistic value estimates are high-probability approx- imates of the true value function, which implies that the estimated gap is a high-probability upper bound on the ac- tual Nash g ap. Minimizing this surrogate gap therefore 2 Note that Algorithm 7 runs Algorithm 2 for H steps and ﬁnally returns the estimates of the v alue functions. W e have used Algo- rithm 7 to present Algorithm 1 for ease of presentation. Howe ver , Algorithm 2 will be necessary in the following sections. serves as a proxy for minimizing the true Nash gap—and, as the gap approaches zero, the resulting joint policy cor- respondingly approaches a Nash equilibrium. Algorithm 1 provides the pseudocode for the full procedure. 3.2 Theoretical guarantees In this section, we state the theoretical guarantees on the con vergence of the Algorithm 1 . Proofs can be found in Appendix A . Theorem 3.1. Let ϵ ∈ [0 , 1 / 2) , δ > 0 and Γ( · , · ) = 0 . Fur- thermor e, assume that m ≥ Ω(( H 3 / 2 /ϵ 2 )( d + log ( n/δ ))) . Then, under Assumption 1 with ξ R ≥ 5 ϵ , for some positive constant c , ther e e xist r obust algorithms TrimmedMLE and RobEst such that, with pr obability at least 1 − δ , the output e π of Algorithm 1 satisﬁes Gap ( e π ) ≤ O   H n   exp  H + p log( n/ 2 δ ϵ )  ξ R + H √ d + γ ξ P ! · ϵ + H n s ( H √ d + γ ) 2 poly ( d ) ξ 2 P m   Remark 3. Note that the bounds of Theorem 3.1 have a quasi-linear dependence on ϵ , which is known to be optimal in the single agent setting [ Zhang et al. , 2022 ] and the two- agent zer o-sum setting [ Nika et al. , 2024b ]. This is due to the str ong cover age assumption on the data. In pr actice, the data may cover only some directions of inter est, in which case, a dif fer ent approac h is needed. 4 ROB UST NE LEARNING UNDER UNILA TERAL CO VERA GE In the previous section, we proposed an algorithm that re- turns an O ( nϵ + n/ √ m ) -approximate Nash equilibrium under uniform cov erage. Howe ver , such coverage is rarely possible in practice. The purpose of this section is to solve the Nash gap minimization problem under a more relax ed notion of coverage, namely unilater al covera ge , which sim- ply requires cov erage of a Nash policy and all its unilateral deviations for each agent. W e extend the notion of lo w rel- ativ e uncertainty of [ Zhong et al. , 2022 ] to the MARLHF setting. Assumption 2 (Unilateral Cov erage) . Given Nash equilib- rium π ∗ , we assume that ther e exist positive constants C R and C P , such that, for all h ∈ [ H − 1] and i ∈ { 1 , . . . , n } , Σ − ρ , ρ ′ ⪰ C R · Σ − ( π i , π ∗ − i ) , ρ ′ , for ρ , ρ ′ ∈ { µ , µ ref } , and Σ µ ( h ) ⪰ C P · Σ π i , π ∗ − i ( h ) . The ﬁrst condition simply says that behavior policies µ and µ ref sufﬁciently co ver a Nash equilibrium and its unilateral deviations in the feature space. Dif ferent from single-agent Corruption-rob ust MARLHF RL, where single policy concentrability is enough to pro- vide theoretical guarantees, its extension to Marko v games, where only Nash policies are covered does not allo w for any guarantees. Unilateral coverage is in fact necessary and suf ﬁcient to pro vide an y meaningful guarantees in zero-sum [ Zhong et al. , 2022 ] and general-sum Markov games [ Zhang et al. , 2023 ]. 4.1 Algorithm For the uniform coverage setting, we applied TrimmedMLE to obtain estimates for the ground-truth re ward parameters. The beneﬁt of such an approach is that, under such cov erage, it comes with bounds on the ℓ 2 -norm of the error , which then allo ws for deﬁning our conﬁdence set in terms of such error bounds. This, in turn, allows us to directly upper bound the difference between value functions and their estimates in terms of ℓ 2 -dif ference of their respective re ward parameters. When we do not ha ve uniform co verage, the ﬁnal estimate is not guaranteed to remain close to the true parameter in the ℓ 2 sense. In this case, as shown in Appendix (see Lemma ?? ), the parameters are close in the log-sigmoid sense. Given the output e θ i of TrimmedMLE , we deﬁne the conﬁdence set for the unilateral cov erage setting as Θ Unil ( e θ i ) = ( θ ∈ Θ : 2 m X ( τ ,τ ′ ,o ) ∈ D log σ  o · e θ ⊤ i ( ϕ ( τ ) − ϕ ( τ ′ ))  σ ( o · θ ⊤ ( ϕ ( τ ) − ϕ ( τ ′ ))) ≤ κ ) , (5) where κ = 6 ϵH √ d + (2 d/m ) · log( H m/δ ) controls the ‘radius’ of the conﬁdence set. W e can provide theoretical guarantees that Θ Unil ( e θ i ) contains the ground-truth param- eter θ ∗ i with high probability . Unfortunately though, the Algorithm 3 Rew ard Parameter Estimation ( RewardEst ) Require: Dataset D ; corruption lev el ϵ ; conﬁdence param- eter δ ; learning rate η ; slackness parameter ν ; number of steps T . 1: for i ∈ [ n ] do 2: Let e θ i = TrimmedMLE ( D , ϵ, ν ) . 3: Initialize b θ (0) i uniformly at random in Θ Unil ( e θ i ) (de- ﬁned in Equation ( 5 )). 4: for t = 0 , 1 , . . . , T − 1 do 5: T ake gradient step b θ ( t +1) i = P Θ Unil ( e θ i )  b θ ( t ) i + η e ∇ θ i Gap  π ∗ , b θ ( t )  . 6: end for 7: Set b θ i = (1 /T ) P T t =1 b θ ( t ) i . 8: end for 9: return b θ = ( b θ 1 , . . . , b θ n ) . same analysis does not go through just by choosing reward Algorithm 4 Corruption-robust Nash Equilibrium Learning from Human Feedback Require: Dataset D ; corruption le vel ϵ ; regularization pa- rameter λ ; conﬁdence parameter δ ; learning rate η 1 ; bonus functions Γ( · , · ) ; slackness parameter ν ; number of gradient steps T 1 . 1: Split D into equal D 1 and D 2 . 2: Compute b θ = RewardEst ( D 1 , ϵ, δ, η 1 , ν, T 1 ) . 3: for i ∈ [ n ] do 4: for π ∈ Π PP do 5: Initialize V π i,H ( · ) = V † , π − i i,H ( · ) = 0 . 6: for h = H − 1 , . . . , 0 do 7: Compute b R i,h ( · , · ) = ( b θ i,h ) ⊤ ϕ ( · , · ) to be the estimated rew ard. 8: Compute  Q π i,h ( · , · ) , Q † , π − i i,h ( · , · )  = Rob-Q  D 2 ,h , π , ϵ, b R i,h , V π i,h +1 , V † , π − i i,h +1 , Γ  . 9: Set V π i,h ( · ) = E a ∼ π h h Q π i,h ( · , a ) i and V † , π − i i,h ( · ) = max a i E a − i ∼ π − i,h h Q † , π − i i,h ( · , a ) i . 10: end for 11: end for 12: end for 13: return e π ∈ arg min π ∈ Π PP g Gap ( π , b θ ) . estimates that maximize (minimize) over the conﬁdence set, due to this more complicated notion of closeness. Hence, we follow a dif ferent approach in this section. Intuiti vely , if we can ﬁnd θ in our conﬁdence set that maximizes the gap of our output policy , and, similarly , ﬁnd a policy π that minimizes the gap with r espect to this choice of θ , we can ﬁnally bound the true gap on the ground-truth reward. This intuition is based on the observ ation that an y Nash gaps of policies that used θ parameters in the conﬁdence set should be close to each-other . Based on the above discussion, we will utilize projected gradient ascent (PGA) to update our estimates of θ . How- ev er, we do not ha ve access to the true gap gi ven parameter θ , neither of its gradient with respect to θ . W e thus resort to using biased estimates of it. As we sho w in Appendix (see Lemma B.5 ), for any θ := ( θ 1 , . . . , θ n ) ∈ Θ Unil ( e θ i ) × . . . , × Θ Unil ( e θ n ) , any policy π that is the minimizer of the estimated gap computed via Rob-Q on θ , satisﬁes | Gap ( π , θ ) − Gap ( π ∗ , θ ) | ≤ O ( n √ ϵ + n/ √ m ) , for Nash equilibrium policy π ∗ which is co vered by D , where Gap ( π , θ ) here denotes the true Nash gap of π under rew ard function parameterized by θ . Therefore, we optimize Gap ( π ∗ , θ ) as a surrogate objectiv e, and later transfer guarantees to Gap ( π , θ ) using Lemma B.5 . Howe ver , we do not hav e access to ∇ θ Gap ( π ∗ , θ ) . Here, Nika, Mandal, Kamalaruban, Singla, and Radanovi ´ c we use the follo wing observation: in the linear setting, the (sub)gradient of the gap becomes the av erage feature dif- ferences ov er con vex combinations of occupanc y measures. Thus, we can use ∇ θ P i ∈ [ n ] ( V µ i, 0 ( s 0 , θ i ) − V µ ref i, 0 ( s 0 , θ i )) as an estimate for ∇ θ Gap ( π ∗ , θ ) , since µ and µ ref already cov er π ∗ and its unilateral deviations. Here, V µ i, 0 ( s 0 , θ i ) denotes the v alue function of µ with respect to re ward function parametrized by θ i . T o estimate ∇ θ i ( V µ i, 0 ( s 0 , θ i ) − V µ ref i, 0 ( s 0 , θ i )) , we use a robust mean oracle RobMean that takes as input corrupted fea- ture differences and returns a O ( √ ϵ ) -approximate estimate of their true mean, which in our case is just ∇ θ i ( V µ i, 0 ( s 0 , θ i ) − V µ ref i, 0 ( s 0 , θ i )) . W e thus deﬁne our gradient estimate e ∇ θ i Gap ( π ∗ , θ ) with respect to θ i as H − 1 X h =0 RobMean  D µ h,ϕ  − H − 1 X h =0 RobMean  D µ ref h,ϕ  , where D µ h,ϕ and D µ ref h,ϕ partition each h -lev el of D and store only the features of trajectories generated by µ and µ ref , respectiv ely . After running PGA for T 1 steps, we compute the empirical a verage of the iterates and use that to compute our estimated reward function. The reward estimation procedure RewardEst is described in Algorithm 3 . Once we hav e access to this rew ard, we can run Rob-Q on it and obtain estimated gaps for each policy . Howe ver , lack of uniform coverage implies weaker guaran- tees on Rob-Q . Thus, we need to properly deﬁne a bonus term that accounts for corruption and lack of coverage. First, let us deﬁne a scaled sample covariance matrix with respect to the preferred trajectories in the corrupted data, using regularization parameter λ ≥ 0 (to be speciﬁed later) as Λ h = 3 5   1 m m X j =1 ϕ ( s j h , a j h ) ϕ ( s j h , a j h ) ⊤ + ( ϵ + λ ) I   . Using Λ h , we no w deﬁne the bonus term to be used in this section as follows. F or any ( s, a ) , let Γ( s, a ) = E ( d, m, δ , ϵ ) · ∥ ϕ ( s, a ) ∥ Λ − 1 h , where E ( d, m, δ , ϵ ) = O ( √ ϵ + 1 / √ m ) (see Appendix B for detailed deﬁnition). W e run Rob-Q with bonus set as Γ and obtain estimated gaps for ev ery joint policy . Finally , we return a joint policy that minimizes estimated gap. The full procedure is described in Algorithm 4 . 4.2 Theoretical guarantees In this section, we state the theoretical guarantee on the con vergence of Algorithm 4 . Theorem 4.1. Let ϵ ∈ [0 , 1 / 2) , λ ≥ Ω( dH log( m/δ ) /m ) , and δ > 0 . Set Θ Unil ( · ) as in Equation ( 5 ) and Γ( s, a ) = E ( d, m, δ , ϵ ) · ∥ ϕ ( s, a ) ∥ Λ − 1 h . Suppose Assumption 2 is satisﬁed and PGA is run for T 1 steps with learning r ate η = O (1 / √ T 1 ) . Then, there exist r obust subr outines RobEst , TrimmedMLE , and RobMean such that, with pr obability at least 1 − δ , the output e π of Algorithm 4 with subr outines RobEst , TrimmedMLE , RobMean and RewardEst , satisﬁes Gap ( e π ) ≤ e O  1 √ C R + 1 √ C P + 1 √ T 1  · H 5 / 2 nd 3 / 4 √ ϵ + H 2 n r poly ( d ) m !! . Remark 4. Note that the or der of ϵ in the above bounds is 1 / 2 . This deterioration comes from the relaxation of uniform coverage . This dependence is identical to that in single-agent RL [ Zhang et al. , 2022 ], two-player zer o-sum Markov games [ Nika et al. , 2024b ], and single-agent RLHF [ Mandal et al. , 2025 ] under data corruption. The currently established linear lower bounds hold under uniform cover - age . It remains an open question whether weaker co vera ge implies tighter lower bounds. 5 ROB UST CCE LEARNING UNDER UNILA TERAL CO VERA GE In the pre vious section, we provided an algorithm that was designed to compute an approximate NE using corrupted preference data under the minimal unilateral coverage as- sumption. Ho we ver , a key bottleneck of Algorithm 4 is the intractability of the gap-minimization step. It is well- known that even normal-form general-sum games suffer from the curse of multi-agents—computational time scales exponentially with the number of agents (actions) [ F oster et al. , 2023 ]. Thus, to address this issue, previous work has considered more relaxed versions of the NE, such as corr elated equilibria or coarse correlated equilibria (CCE) [ Cui et al. , 2023 , Zhang et al. , 2023 , Ma et al. , 2023 , Song et al. , 2021 ], the latter of which can be approximated using no-regret learning algorithms. A general correlated polic y is deﬁned as a set of H maps π := { π h : Ω × ( S × A ) h − 1 × S → ∆( A ) } h ∈ [ H − 1] , where the ﬁrst ar gument w ∈ Ω is sampled from some underlying distribution. A crucial dif ference from Markov policies is information about prior states that is gi ven as input. W e denote by Π GCP the space of all general correlated policies. W e denote by Π GCP i the set of general correlated policies for agent i . Then, policy π ∗ is said to be an α -CCE if there exists α ≥ 0 , such that, for ev ery agent i and state s , we hav e V π ∗ i, 0 ( s ) ≥ V π ′ , π ∗ − i i, 0 ( s ) − α , for ev ery π ′ i ∈ Π GCP i . If α = 0 , then π ∗ is said to be a CCE. Note that the only difference between NEs and CCEs is that an NE is restricted to be a product policy , while a CCE can be any arbitrary combination of individual policies in the joint action space simplex. Hereafter , we ov erload notation and use Gap ( π ) to denote the CCE gap of a joint polic y π . Corruption-rob ust MARLHF There has been a lot of interest in ef ﬁciently computing approximate CCEs in Markov g ames using V -learning type algorithms [ Jin et al. , 2021 , W ang et al. , 2023b , Cui et al. , 2023 ]. Howe ver , all these works consider the online set- ting, where the learner can e xplore the environment and increasingly gather more relev ant data. W e only have at our disposal ofﬂine corrupted preference data. 5.1 Algorithm In this section, we propose an ofﬂine-learning algorithm for computing approximate CCE in linear Marko v games. First, we again assume unilateral coverage on our data (Assumption 2 ). Gi ven preference data D , we again run RewardEst procedure to obtain the reward estimates b θ . At this point, dif ferent from the pre vious section, we take another approach at the estimated gap minimization prob- lem. First, for a given joint policy π , agent i , state s , step h , and actions a and a † , we deﬁne the loss L s i ( a † , a ′ ) for this stage as E a − i ∼ π − i,h ( ·| s ) h Q † , π − i i,h ( s, a † , a − i ) − Q π i,h ( s, a ′ , a − i ) i , where the optimistic and pessimistic Q -function estimates are computed using b θ . Using this loss function, we can now express the estimated gap minimization problem at stage h as min π ′ h X i ∈ [ n ] max π † i,h E a † i ∼ π † i,h ( ·| s ) ,a ′ i ∼ π ′ i,h ( ·| s )  L s i ( a † , a ′ )  . Note that such an objecti ve can be framed as a normal- form game at stage h . T o solve the stage game, we utilize OptimisticHedge [ Daskalakis et al. , 2021 ], a no-regret learning algorithm which returns a e O (1 /T 2 ) -approximate CCE of the game when run for T 2 iterations. Each player basically solves a max − min problem at stage h and up- dates its policy using a multiplicati ve weights update style. W e run the algorithm for T 2 iterations and return the av- erage joint polic y . The pseudo-code for Optimistic Hedge applied to our setting is given in Algorithm 8 (see Appendix E ). Once we hav e computed our joint policy at stage h , we then compute the optimistic and pessimistic values by tak- ing expectations of the Q -function estimates over the ne wly computed policies. W e use these v alue estimates to run the next iteration h − 1 of our algorithm. Finally , we return the joint polic y e π which is a composition of the returned policies from OptimisticHedge at each stage h . The full procedure is giv en in Algorithm 5 . 5.2 Theoretical guarantees In this section, we pro vide upper bounds on the CCE gap for the output of Algorithm 5 . Theorem 5.1. Let ϵ ∈ [0 , 1 / 2) , λ ≥ Ω( dH log( m/δ ) /m ) , and δ > 0 . Set Θ Unil ( · ) as in Equation ( 5 ) and Γ( s, a ) = E ( d, m, δ , ϵ ) · ∥ ϕ ( s, a ) ∥ Λ − 1 h . Suppose Assumption 2 is satisﬁed, PGA is run for T 1 steps with learning rate Algorithm 5 Corruption-robust CCE Learning from Human Feedback Require: Preference dataset D ; regularization parameter λ ; conﬁdence parameter δ ; learning rates η 1 , η 2 ; bonus functions Γ( · , · ) ; slackness parameter ν ; number of gra- dient steps T 1 ; number of optimization steps T 2 . 1: Split D into equal D 1 and D 2 . 2: Compute b θ = RewardEst ( D 1 , ϵ, δ, η 1 , ν, T 1 ) . 3: Initialize e π uniformly at random and V e π i,H ( · ) = V † , e π − i i,H ( · ) = 0 , for all i ∈ { 1 , . . . , n } . 4: for h = H − 1 , . . . , 0 do 5: for i = 1 , . . . , n do 6: Compute b R i,h ( · , · ) = ( b θ i,h ) ⊤ ϕ ( · , · ) to be the esti- mated rew ard. 7: Compute  Q e π i,h ( · , · ) , Q † , e π − i i,h ( · , · )  = Rob-Q  D 2 ,h , e π , ϵ, b R i,h , V e π i,h +1 , V † , e π − i i,h +1 , Γ  . 8: Compute loss L s i , for states s ∈ S . 9: end for 10: Compute e π h ( ·| s ) = OptimisticHedge ( L s 1 , . . . , L s n , η 2 , T 2 ) . 11: Set V e π i,h ( · ) = E a ∼ e π h h Q e π i,h ( · , a ) i and V † , e π − i i,h ( · ) = max a i E a − i ∼ e π − i,h h Q † , e π − i i,h ( · , a ) i , for i ∈ { 1 , . . . , n } . 12: end for 13: return e π = ( e π 0 , . . . , e π H − 1 ) . η 1 = O (1 / √ T 1 ) , and OptimisticHedge is run for T 2 steps with learning rate η 2 = O (1 / ( n log 4 T 2 )) . Then, ther e exist rob ust subr outines RobEst , TrimmedMLE , and RobMean such that, with pr obability at least 1 − δ , the output e π of Algorithm 5 with subr outines RobEst , TrimmedMLE , RobMean and OptimisticHedge , sat- isﬁes Gap ( e π ) ≤ e O  1 √ C R + 1 √ C P + 1 √ T 1  · H 5 / 2 nd 3 / 4 √ ϵ + H 2 n p poly ( d ) √ m ! + H n 2 T 2 ! . Remark 5. Note that we only incur an additional O (1 /T 2 ) term on the CCE Gap, which comes fr om applying the no- r egr et sub-r outine OptimisticHedge . The beneﬁt of such procedur e is that it can be run in quasi-polynomial time in dataset size and featur e dimension. The computational complexity of Algorithm 5 is O  ( nd ) log( 1 ϵ ) + ( T 1 + nH ) · ( poly ( m, d, 1 ϵ ) + H T 2 max i | A i | )  . Nika, Mandal, Kamalaruban, Singla, and Radanovi ´ c 6 RELA TED WORK Reinf orcement Learning from Human Feedback (RLHF) RLHF has substantially grown in popularity in the recent years, largely due to LLMs [ Ziegler et al. , 2019 , Nakano et al. , 2021 , W u et al. , 2021 , Ouyang et al. , 2022 , Stien- non et al. , 2020 , Glaese et al. , 2022 , Ramamurthy et al. , 2023 , Menick et al. , 2022 , Ganguli et al. , 2022 , Bai et al. , 2022 , Gao et al. , 2023 ]. Y et RLHF’ s utility extends f ar beyond LLMs, encompassing diverse applications—from game playing [ Christiano et al. , 2017 , W arnell et al. , 2018 , Knox and Stone , 2008 , MacGlashan et al. , 2017 ] to robotic control [ Shin et al. , 2023 , Bro wn et al. , 2019 ]. Our work is related to recent theoretical studies on (MA)RLHF [ Zhan et al. , 2023 , Zhu et al. , 2023 , Zhang et al. , 2024 , Li et al. , 2023 , Xiong et al. , 2023 , Nika et al. , 2024a ]. In particu- lar , we consider data corruption on MARLHF . In the single player setting, Nika et al. [ 2025 ] propose a general data- poisoning framew ork in RLHF , while Mandal et al. [ 2025 ] propose robust algorithms trained on ϵ -corrupted data. The latter is the most closely related work to ours. While we share the preference-based model and the data corruption model, our setting is a generalization of the single-agent RLHF setting considered in [ Mandal et al. , 2025 ]. This in- troduces a new layer of comple xity: instead of maximizing value functions o ver single policies, our goal is to minimize the Nash gap over joint policies. This in volv es a different style of analysis. Algorithmically , our methods diver ge in two ke y ways. First, instead of relying on zeroth-order or- acle calls to estimate gradients, we directly approximate the gradient of the Nash gap with respect to each agent’ s strategy via the biased gradient with respect to a Nash pol- icy . This allows us to also maintain the O ( √ ϵ ) bounds on the gap. Second, we incorporate a quasi-polynomial-time subroutine that computes an approximate coarse correlated equilibrium (CCE) of the induced game. Corruption-rob ust Ofﬂine Reinfor cement Lear ning (RL) There has been a substantial body of research on adversarial attacks in (MA)RL [ Huang et al. , 2017 , Lin et al. , 2017 , W u et al. , 2023 , Rakhsha et al. , 2021 , Rangi et al. , 2022 , Nika et al. , 2024b , Ma et al. , 2023 , Gleav e et al. , 2020 ]. Our research relates to a speciﬁc type of adversarial attack, namely , data corruption [ Mei and Zhu , 2015 , Xiao et al. , 2015 , Rakhsha et al. , 2021 ]. Our focus is on designing robust algorithms trained on corrupted data generated via ϵ -corruption model (a.k.a. strong contamination model [ Di- akonikolas et al. , 2019 ]). In this line of work, Zhang et al. [ 2022 ] ﬁrst consider corruption-robust RL via linear Markov decision processes (MDP), which is later extended to linear zero-sum Markov games (MG) [ Nika et al. , 2024b ]. Us- ing a dif ferent contamination model, Y e et al. [ 2023 ] study the problem of corruption-robustness in RL with general function approximation. Our work diver ges from the above in that we study strong data corruption in multi-agent rein- forcement learning from human feedback, which, due to its dependence on preference data, introduces additional layers of complexity in pro viding robustness guarantees. Ofﬂine Markov Games (MG) Our work also relates to the literature of learning in MGs [ T ian et al. , 2021 , Vrancx et al. , 2008 , Littman , 1994 , 2001 ]. W e model our underlying en vironment from which the data is generated as a linear MG [ Zhong et al. , 2022 ]. W e are interested in approximating notions of optimal joint policies from corrupted preference data. The primary notion of optimality in MGs is that of the Nash equilibrium (NE) [ Nash Jr , 1950 ]. Due to its computational intractability in general-sum MGs, prior work has considered relaxed versions of it such as CCEs [ Cui et al. , 2023 , Zhang et al. , 2023 , Ma et al. , 2023 , Song et al. , 2021 ], and designed computationally efﬁcient methods to compute them in the online setting [ Jin et al. , 2021 , W ang et al. , 2023b , Cui et al. , 2023 ]. W e depart from this line of work and consider the CCE computation problem in the ofﬂine setting, where we compute the CCE of each stage game via no-regret methods [ Daskalakis et al. , 2021 ]. 7 DISCUSSION In this paper , we studied the problem of data corruption in ofﬂine MARLHF . W e proposed provable-rob ust algorithms, both under uniform and unilateral coverage assumptions. Finally , we proposed a computationally ef ﬁcient algorithm that robustly approximates a coarse correlated equilibrium of the underlying Markov g ame. A ke y technical contrib ution of our work is a new way to op- timize the Nash gap without access to true reward functions or their gradients. Prior single-agent RLHF approaches rely on primal-dual methods or unbiased gradients, which do not extend to general-sum Mark ov g ames due to strategic cou- pling. W e instead introduce a biased but tr actable gradient surr ogate : by leveraging the linear structure of the underly- ing Markov game, we approximate the gradient at a Nash equilibrium using feature expectations induced by behavior policies. Under unilateral coverage, these policies capture the occupanc y measures of the equilibrium and its unilateral de viations, so their feature differences act as a proxy for the true gradient direction. Despite the bias, this estimate is ac- curate enough to guide projected gradient ascent ov er the re- ward conﬁdence set, yielding O ( √ ϵ ) robustness guarantees. This idea—lev eraging equilibrium structure to construct us- able gradient surrogates from corrupted of ﬂine preference data—appears to be ne w and may be of independent interest. Sev eral interesting directions are worth pursuing. First, it is not clear how one can formulate the data corruption problem in MARLHF with general function approximation, and then how to design robust algorithms in that setting. Second, it would be interesting to address the open question of whether the O ( √ ϵ ) bound under non-uniform cov erage is tight. Finally , implementing the proposed algorithms and experimentally testing them on MARL environments is another exciting future direction. Corruption-rob ust MARLHF Acknowledgements The work of Andi Nika and Goran Radanovic w as funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – project number 467367360. References Pranjal A wasthi, Abhimanyu Das, W eihao K ong, and Ra- jat Sen. Trimmed Maximum Lik elihood Estimation for Robust Generalized Linear Model. NeurIPS , 2022. Y untao Bai et al. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. CoRR , abs/2204.05862, 2022. T im Baumgärtner, Y ang Gao, Dana Alon, and Donald Met- zler . Best-of-Venom: Attacking RLHF by Injecting Poi- soned Preference Data. CoRR , abs/2404.05530, 2024. Ralph Allan Bradle y and Milton E T erry . Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons. Biometrika , 39(3/4), 1952. Daniel S. Brown, W onjoon Goo, Prabhat Nagarajan, and Scott Niekum. Extrapolating Be yond Suboptimal Demon- strations via In verse Reinforcement Learning from Ob- servations. In ICML , 2019. Paul F Christiano, Jan Leike, T om Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep Reinforcement Learning from Human Preferences. In NeurIPS , 2017. Qiwen Cui and Simon S Du. Prov ably Ef ﬁcient Of ﬂine Multi-agent Reinforcement Learning via Strategy-wise Bonus. NeurIPS , 2022. Qiwen Cui, Kaiqing Zhang, and Simon Du. Breaking the Curse of Multiagents in a Lar ge State Space: RL in Markov Games with Independent Linear Function Ap- proximation. In COL T , 2023. Constantinos Daskalakis, Maxwell Fishelson, and Noah Golowich. Near-optimal No-re gret Learning in General Games. In NeurIPS , 2021. Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Ankur Moitra, and Alistair Stew art. Rob ust estimators in high-dimensions without the computational intractability . SIAM Journal on Computing , 48(2):742–864, 2019. Ilias Diakonikolas, Daniel M Kane, and Ankit Pensia. Out- lier Robust Mean Estimation with Subgaussian Rates via Stability . In NeurIPS , 2020. Ilias Diakonikolas, Samuel B Hopkins, Ankit Pensia, and Stefan T iegel. Sos Certiﬁability of Subgaussian Distribu- tions and its Algorithmic Applications. In Pr oceedings of the 57th Annual A CM Symposium on Theory of Com- puting , pages 1689–1700, 2025. Y ihe Dong, Samuel Hopkins, and Jerry Li. Quantum En- tropy Scoring for Fast Robust Mean Estimation and Im- prov ed Outlier Detection. Advances in Neural Informa- tion Pr ocessing Systems , 32, 2019. Dylan J F oster , Noah Golowich, and Sham M Kakade. Hard- ness of Independent Learning and Sparse Equilibrium Computation in Markov Games. In ICML , 2023. D Ganguli et al. Red T eaming Language Models to Re- duce Harms: Methods, Scaling Behaviors, and Lessons Learned. CoRR , abs/2209.07858, 2022. Leo Gao, John Schulman, and Jacob Hilton. Scaling La ws for Rew ard Model Overoptimization. In ICML , 2023. Amelia Glaese et al. Improving Alignment of Dia- logue Agents via T argeted Human Judgements. CoRR , abs/2209.14375, 2022. Adam Gleave, Michael Dennis, Cody W ild, Neel Kant, Serge y Le vine, and Stuart Russell. Adversarial Policies: Attacking Deep Reinforcement Learning. In ICLR , 2020. Sandy Huang, Nicolas P apernot, Ian Goodfello w , Y an Duan, and Pieter Abbeel. Adversarial Attacks on Neural Net- work Policies. CoRR , abs/1702.02284, 2017. Chi Jin, Qinghua Liu, Y uanhao W ang, and Tiancheng Y u. V -learning: A Simple, Ef ﬁcient, Decentralized Algorithm for Multiagent RL. arXiv pr eprint arXiv:2110.14555 , 2021. W Bradley Knox and Peter Stone. T amer: T raining an Agent Manually via Evaluati ve Reinforcement. In ICDL , 2008. Zihao Li, Zhuoran Y ang, and Mengdi W ang. Reinforcement Learning with Human Feedback: Learning Dynamic Choices via Pessimism. arXiv pr eprint arXiv:2305.18438 , 2023. Y en-Chen Lin, Zhang-W ei Hong, Y uan-Hong Liao, Meng- Li Shih, Ming-Y u Liu, and Min Sun. T actics of Adversar- ial Attack on Deep Reinforcement Learning Agents. In IJCAI , 2017. Michael L. Littman. Markov Games as a Framework for Multi-agent Reinforcement Learning. In ICML , 1994. Michael L Littman. V alue-function Reinforcement Learning in Markov Games. Cognitive Systems Resear ch , 2001. Shaocong Ma, Ziyi Chen, Shaofeng Zou, and Y i Zhou. Decentralized Robust V-Learning for Solving Markov Games with Model Uncertainty . JMLR , 2023. James MacGlashan, Mark K Ho, Robert Loftin, Bei Peng, Guan W ang, David L Roberts, Matthew E T aylor , and Michael L Littman. Interactiv e Learning from Polic y- dependent Human Feedback. In ICML , 2017. Debmalya Mandal, Andi Nika, P arameswaran Kamalaruban, Adish Singla, and Goran Radanovi ´ c. Corruption Rob ust Ofﬂine Reinforcement Learning with Human Feedback. In AIST A TS , 2025. Nika, Mandal, Kamalaruban, Singla, and Radanovi ´ c Shike Mei and Xiaojin Zhu. Using Machine T eaching to Identify Optimal Training-set Attacks on Machine Learn- ers. In AAAI , 2015. Jacob Menick et al. T eaching Language Models to Support Answers with Veriﬁed Quotes. CoRR , abs/2203.11147, 2022. Reiichiro Nakano et al. W ebgpt: Browser -assisted Question-answering with Human Feedback. CoRR , abs/2112.09332, 2021. John F Nash Jr . Equilibrium Points in n-person Games. Pr oceedings of the National Academy of Sciences , 1950. Andi Nika, Debmalya Mandal, P arameswaran Kamalaruban, Georgios Tzannetos, Goran Radanovi ´ c, and Adish Singla. Rew ard Model Learning vs. Direct Policy Optimization: A Comparati ve Analysis of Learning from Human Prefer - ences. In ICML , 2024a. Andi Nika, Debmalya Mandal, Adish Singla, and Goran Radanovic. Corruption-Rob ust Ofﬂine Two-player Zero- sum Markov Games. In AIST ATS , 2024b. Andi Nika, Jonathan Nöther , Debmalya Mandal, Parameswaran Kamalaruban, Adish Singla, and Goran Radanovi ´ c. Polic y T eaching via Data poisoning in Learning from Human Preferences. In AIST A TS , 2025. Long Ouyang et al. Training Language Models to Follo w Instructions with Human Feedback. In NeurIPS , 2022. Amin Rakhsha, Goran Radanovic, Rati Devidze, Xiaojin Zhu, and Adish Singla. Policy Teaching in Reinforcement Learning via Environment Poisoning Attacks. JMLR , 2021. Rajkumar Ramamurthy , Prithviraj Ammanabrolu, Kianté Brantley , Jack Hessel, Rafet Sifa, Christian Bauckhage, Hannaneh Hajishirzi, and Y ejin Choi. Is Reinforcement Learning (not) for Natural Language Processing: Bench- marks, Baselines, and Building Blocks for Natural Lan- guage Policy Optimization. In ICLR , 2023. Javier Rando and Florian T ramèr . Univ ersal Jailbreak Back- doors from Poisoned Human Feedback. In ICLR , 2023. Anshuka Rangi, Haifeng Xu, Long T ran-Thanh, and Mas- simo Franceschetti. Understanding the Limits of Poison- ing Attacks in Episodic Reinforcement Learning. CoRR , abs/2208.13663, 2022. Jiawen Shi, Y ixin Liu, Pan Zhou, and Lichao Sun. Badgpt: Exploring Security V ulnerabilities of ChatGPT via Back- door Attacks to InstructGPT . CoRR , abs/2304.12298, 2023. Daniel Shin, Anca D. Dragan, and Daniel S. Brown. Bench- marks and Algorithms for Of ﬂine Preference-Based Re- ward Learning. T ransactions of Machine Learning Re- sear ch , 2023. Ziang Song, Song Mei, and Y u Bai. When Can We Learn General-sum Markov Games with a Large Num- ber of Players Sample-Efﬁciently? arXiv pr eprint arXiv:2110.04184 , 2021. Nisan Stiennon, Long Ouyang, Jef frey W u, Daniel Ziegler , Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to Summarize with Human Feedback. In NeurIPS , 2020. Y i T ian, Y uanhao W ang, T iancheng Y u, and Suvrit Sra. Online Learning in Unkno wn Marko v Games. In ICML , 2021. Peter Vrancx, Katja V erbeeck, and Ann No wé. Decentral- ized Learning in Marko v Games. IEEE T ransactions on Systems, Man, and Cybernetics, P art B (Cybernetics) , 38, 2008. Jiongxiao W ang, Junlin W u, Muhao Chen, Y evgeniy V orob- eychik, and Chaowei Xiao. On the Exploitability of Reinforcement Learning with Human Feedback for Large Language Models. CoRR , abs/2311.09641, 2023a. Y uanhao W ang, Qinghua Liu, Y u Bai, and Chi Jin. Breaking the Curse of Multiagenc y: Prov ably Ef ﬁcient Decentral- ized Multi-agent RL with Function Approximation. In COLT , 2023b. Garrett W arnell, Nicholas W aytowich, V ernon La whern, and Peter Stone. Deep T AMER: Interactive Agent Shaping in High-dimensional State Spaces. In AAAI , 2018. Jeff W u, Long Ouyang, Daniel M Zie gler , Nisan Stien- non, Ryan Lowe, Jan Leik e, and Paul Christiano. Recur- si vely Summarizing Books with Human Feedback. CoRR , abs/2109.10862, 2021. Y oung W u, Jeremy McMahan, Xiaojin Zhu, and Qiaomin Xie. Re ward Poisoning Attacks on Of ﬂine Multi-agent Reinforcement Learning. In AAAI , 2023. Huang Xiao, Battista Biggio, Gavin Brown, Giorgio Fumera, Claudia Eckert, and Fabio Roli. Is Feature Selection Secure Against Training Data Poisoning? In ICML , 2015. W ei Xiong, Hanze Dong, Chenlu Y e, Han Zhong, Nan Jiang, and T ong Zhang. Gibbs Sampling from Human Feedback: A Prov able KL-constrained Framework for RLHF. CoRR , 2023. Chenlu Y e, Rui Y ang, Quanquan Gu, and T ong Zhang. Corruption-robust Of ﬂine Reinforcement Learning with General Function Approximation. In NeurIPS , 2023. Andrea Zanette, Ching-An Cheng, and Alekh Agarwal. Cau- tiously Optimistic Policy Optimization and Exploration with Linear Function Approximation. In COL T , 2021. W enhao Zhan, Masatoshi Uehara, Nathan Kallus, Jason D Lee, and W en Sun. Prov able Of ﬂine Reinforcement Corruption-rob ust MARLHF Learning with Human Feedback. CoRR , abs/:2305.14816, 2023. Natalia Zhang, Xinqi W ang, Qiwen Cui, Runlong Zhou, Sham M Kakade, and Simon S Du. Multi-Agent Re- inforcement Learning from Human Feedback: Data Cov erage and Algorithmic T echniques. arXiv preprint arXiv:2409.00717 , 2024. Xuezhou Zhang, Y iding Chen, Xiaojin Zhu, and W en Sun. Corruption-robust Ofﬂine Reinforcement Learning. In AIST A TS , 2022. Y uheng Zhang, Y u Bai, and Nan Jiang. Of ﬂine Learning in Markov Games with General Function Approximation. In ICML , 2023. Han Zhong, W ei Xiong, Jiyuan T an, Liwei W ang, T ong Zhang, Zhaoran W ang, and Zhuoran Y ang. Pessimistic Minimax Value Iteration: Prov ably Efﬁcient Equilibrium Learning from Ofﬂine Datasets. In ICML , 2022. Banghua Zhu, Michael I. Jordan, and Jiantao Jiao. Prin- cipled Reinforcement Learning with Human Feedback from Pairwise or K-wise Comparisons. In ICML , 2023. Daniel M Ziegler , Nisan Stiennon, Jeffrey W u, T om B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffre y Irving. Fine-tuning Language Models from Human Preferences. CoRR , abs/1909.08593, 2019. Nika, Mandal, Kamalaruban, Singla, and Radanovi ´ c Corruption-r ob ust Ofﬂine Multi-agent Reinfor cement Learning fr om Human F eedback A ppendix T able of Contents A Proof of Theorem 3.1 13 B Proof of Theorem 4.1 18 C Proof of Theorem 5.1 28 D T echnical Results 30 E Additional Algorithm Pseudocodes 32 A Proof of Theor em 3.1 In this section, we provide the full proof for Theorem 3.1 . Lemma A.1. [Lemma A.1 of Mandal et al. [ 2025 ]] Let Assumption 1 hold with ξ R ≥ 5 ϵ , for some positive constant c , and let m ≥ Ω  H 3 / 2 ϵ 2 ( d + log( n/δ ))  . Then, for every i , Algorithm 6 r eturns an estimator e θ i such that, with pr obability at least 1 − δ / 2 , satisﬁes    e θ i − θ ∗ i    2 ≤ O  ϵ ξ R exp  H + p log( n/ 2 δ ϵ )   , wher e e θ i denotes the H d -dimensional vector with sub-vectors e θ i,h for every h . Pr oof. This result is an immediate application of Lemma A.1 of [ Mandal et al. , 2025 ] to the multi-agent setting by applying the union bound ov er n agents. This upper bound provides us with a provable threshold function for our conﬁdence sets. Ne xt, we will make use of the following result. Theorem A.1. [ Zhang et al. , 2022 ] Given an ϵ -corrupted dataset D = { x i , y i } i ∈ [ m ] , wher e the clean data is generated as e x i ∼ β , P ( ∥ e x i ∥ ≤ 1) = 1 , e y i = e x ⊤ i ω ∗ + ζ i , wher e ζ i is zer o-mean σ 2 -variance sub-Gaussian random noise, then a r ob ust least squar e estimator returns an estimator ω such that: • If E β [ xx ⊤ ] ⪰ ξ I , then with pr obability at least 1 − δ / 2 , we have ∥ ω ∗ − ω ∥ 2 ≤ c 1 ( δ ) · s σ 2 poly ( d ) ξ 2 m + σ ξ ϵ ! ; • W ith pr obability at least 1 − δ / 2 , we have E β h   e x ⊤ ( ω ∗ − ω )   2 2 i ≤ c 2 ( δ ) ·  σ 2 poly ( d ) m + σ 2 ϵ  , Corruption-rob ust MARLHF wher e c 1 and c 2 hide constants and polylog (1 /δ ) terms. Applying this to our setting means considering the corrupted Bellman operator samples from our data as signals generated from an unkno wn underlying distribution. W e thus deﬁne, for e very i, h, s, a ∈ [ n ] × [ H − 1] × S × A , the Bellman operator as B i,h V i,h +1 ( s, a ) = R i,h ( s, a ) + X s ′ ∈ S P ( s ′ | s, a ) V i,h +1 ( s ′ ) . W e also deﬁne the Bellman operator with respect to the estimated rewards as B i,h V π i,h +1 ( s, a ) = R π i,h ( s, a ) + X s ′ ∈ S P ( s ′ | s, a ) V π i,h +1 ( s ′ ) , (6) and B i,h V † , π − i i,h +1 ( s, a ) = R † , π − i i,h ( s, a ) + X s ′ ∈ S P ( s ′ | s, a ) V † , π − i i,h +1 ( s ′ ) . (7) W e then have the follo wing result. Lemma A.2. W e have, for every tuple ( s h , a h , s h +1 ) in D , V ar  R i,h ( s h , a h ) + V i,h +1 ( s h +1 ) − B i,h V i,h +1 ( s h , a h ) | s h , a h  ≤ ( H √ d + γ ) 2 , and V ar  R i,h ( s h , a h ) + V i,h +1 ( s h +1 ) − B i,h V i,h +1 ( s h , a h ) | s h , a h  ≤ ( H √ d + γ ) 2 . Pr oof. Note that we hav e V ar  R i,h ( s h , a h ) + V i,h +1 ( s h +1 ) − B i,h V i,h +1 ( s h , a h ) | s h , a h  = E h  R i,h ( s h , a h ) + V i,h +1 ( s h +1 ) − E  R i,h ( s h , a h ) + V i,h +1 ( s h +1 )  2 i ≤ V ar ( R i,h ( s h , a h )) + V ar ( V i,h +1 ( s h +1 )) ≤ ( H √ d + γ ) 2 , since both H and γ are nonnegati ve numbers. The proof of the second statement is similar . Using the abov e, we will deﬁne the error stated abov e using short-hand notation for ease of presentation: E 1 ( d, m, δ, ϵ ) = c 1 ( δ ) ·   s ( H √ d + γ ) 2 poly ( d ) ξ 2 P m + H √ d + γ ξ P ϵ   . (8) Next, we prov e upper bounds on the maximum and minimum values of estimated rew ard functions in terms of ground-truth rew ards. Lemma A.3. W ith pr obability at least 1 − δ / 2 , we have R i,h ( s, a ) − C 1 ϵ · exp  H + p log( n/ 2 δ ϵ )  ξ R ≤ R i,h ( s, a ) ≤ R i,h ( s, a ) ≤ R i,h ( s, a ) ≤ R i,h ( s, a ) + C 1 ϵ · exp  H + p log( n/ 2 δ ϵ )  ξ R . Pr oof. Let θ i,h be the parameter that corresponds to R i,h and let θ i,h be deﬁned similarly . Observe that, for e very agent i and time-step h , we hav e   R i,h ( s, a ) − R i,h ( s, a )   =    ϕ ( s, a ) , θ i,h  −  ϕ ( s, a ) , θ ∗ i,h    Nika, Mandal, Kamalaruban, Singla, and Radanovi ´ c ≤    ϕ ( s, a ) , θ i,h  −  ϕ ( s, a ) , θ ∗ i,h    =     ϕ ( s, a ) , θ i,h  − D ϕ ( s, a ) , e θ i,h E + D ϕ ( s, a ) , e θ i,h E −  ϕ ( s, a ) , θ ∗ i,h     =    D ϕ ( s, a ) , e θ i,h − θ i,h E +  ϕ ( s, a ) , θ i,h − θ ∗ i,h     ≤ ∥ ϕ ( s, a ) ∥ 2    e θ i,h − θ i,h    2 + ∥ ϕ ( s, a ) ∥ 2   θ i,h − θ ∗ i,h   2 ≤    e θ i,h − θ i,h    2 +   θ i,h − θ ∗ i,h   2 ≤ C 1 ϵ · exp  H + p log( n/ 2 δ ϵ )  ξ R , where the ﬁrst equality follows by deﬁnition, and the fact that the true expected rew ards are already in [ − √ d, √ d ] ; we hav e used Cauchy-Schwarz for the second inequality , the fact that ∥ ϕ ( s, a ) ∥ 2 ≤ 1 by assumption, and Lemma A.1 for the ﬁnal inequality . Similarly , we hav e   R i,h ( s, a ) − R i,h ( s, a )   ≤ C 1 ϵ · exp  H + p log( n/ 2 δ ϵ )  ξ R . Next, we ha ve the following result that pro vides bounds on the Bellman errors. Lemma A.4. W ith pr obability at least 1 − δ / 2 , we have, for every i, h, s, a and policy pr oﬁle π , − E 1 ( d, m, δ, ϵ ) ≤ B i,h V π i,h +1 ( s, a ) − Q π i,h ( s, a ) ≤ E 1 ( d, m, δ, ϵ ) , − E 1 ( d, m, δ, ϵ ) ≤ B i,h V † , π − i i,h +1 ( s, a ) − Q † , π − i i,h ( s, a ) ≤ E 1 ( d, m, δ, ϵ ) . Pr oof. First, as noted in [ Zhong et al. , 2022 ], in linear MDPs, the value functions are also linear in features. Thus, denoting by ω π , ∗ i,h the parameter of the Bellman transform of V π i,h and deﬁning ω † , π − i , ∗ i,h similarly , we hav e    ϕ ( s, a ) ⊤ ω π , ∗ i,h − B i,h V π i,h +1 ( s, a )    = D ϕ ( s, a ) , ω π , ∗ i,h − ω π i,h E ≤ ∥ ϕ ( s, a ) ∥ 2    ω π , ∗ i,h − ω π i,h    2 ≤ E 1 ( d, m, δ, ϵ ) , where the penultimate step uses Cauchy-Schwarz and the ﬁnal step uses the feature norm assumption and Theorem A.1 . Next, we pro ve a similar result for the estimated v alue functions and best responses. Lemma A.5. Under the event of Lemma A.4 , we have, for every a gent i , state s , step h and policy π : V π i,h ( s ) ≤ V π i,h ( s ) + E 1 ( d, m, δ, ϵ ) + C 1 ϵ · exp  H + p log( n/ 2 δ ϵ )  ξ R , and V † , π − i i,h ( s ) ≥ V † , π − i i,h ( s ) − E 1 ( d, m, δ, ϵ ) − C 1 ϵ · exp  H + p log( n/ 2 δ ϵ )  ξ R . Pr oof. W e prove the result by induction. Let V s i,H ( π ) = 0 . The result holds for step H . Suppose the result holds for step h + 1 . Then, for step h we hav e V π i,h ( s ) = E a ∼ π h h Q π i,h ( s, a ) i (9) ≤ E a ∼ π h  B i,h V π i,h +1 ( s, a )  + E 1 ( d, m, δ, ϵ ) (10) ≤ E a ∼ π h  B i,h V π i,h +1 ( s, a )  + E 1 ( d, m, δ, ϵ ) (11) Corruption-rob ust MARLHF = E a ∼ π h " R i,h ( s, a ) + X s ′ P ( s ′ | s, a ) V π i,h +1 ( s ′ ) # + E 1 ( d, m, δ, ϵ ) ≤ E a ∼ π h " R i,h ( s, a ) + X s ′ P ( s ′ | s, a ) V π i,h +1 ( s ′ ) # + C 1 ϵ · exp  H + p log( n/ 2 δ ϵ )  ξ R + E 1 ( d, m, δ, ϵ ) (12) = E a ∼ π h  B i,h V π i,h +1 ( s, a )  + C 1 ϵ · exp  H + p log( n/ 2 δ ϵ )  ξ R + E 1 ( d, m, δ, ϵ ) = V π i,h ( s ) + C 1 ϵ · exp  H + p log( n/ 2 δ ϵ )  ξ R + E 1 ( d, m, δ, ϵ ) (13) where Equation ( 9 ) follows by deﬁnition; Equation ( 10 ) follows from Lemma A.4 ; Equation ( 11 ) follows by the inducti ve assumption; Equation ( 12 ) follows from Lemma A.3 ; Equation ( 13 ) follo ws by deﬁnition. Similarly , V † , π − i i,h ( s ) = max a i ∈ A i E a − i ∼ π − i,h ( ·| s ) h Q † , π − i i,h ( s, a ) i (14) ≥ max a i ∈ A i E a − i ∼ π − i,h ( ·| s ) h B i,h V † , π − i i,h +1 ( s, a ) i − E 1 ( d, m, δ, ϵ ) (15) ≥ max a i ∈ A i E a − i ∼ π − i,h ( ·| s ) h B i,h V † , π − i i,h +1 ( s, a ) i − E 1 ( d, m, δ, ϵ ) (16) ≥ max a i ∈ A i E a − i ∼ π − i,h ( ·| s ) h B i,h V † , π − i i,h +1 ( s, a ) i − C 1 ϵ · exp  H + p log( n/ 2 δ ϵ )  ξ R − E 1 ( d, m, δ, ϵ ) (17) = V † , π − i i,h ( s ) − C 1 ϵ · exp  H + p log( n/ 2 δ ϵ )  ξ R − E 1 ( d, m, δ, ϵ ) , (18) where again Equation ( 14 ) follows by deﬁnition; Equation ( 15 ) follows from Lemma A.4 ; Equation ( 16 ) follows from the linearity and monotonicity of B i,h ; Equation ( 17 ) follows by the inducti ve assumption and Lemma A.3 , and ( 18 ) follows by deﬁnition. Next, we pro vide bounds on the dif ference between the estimated and true expected returns. Lemma A.6. Under the event of Lemma A.4 we have, for any π ∈ Π PP , V π i, 0 ( s 0 ) − V π i, 0 ( s 0 ) ≤ H C 1 ϵ · exp  H + p log( n/ 2 δ ϵ )  ξ R + H E 1 ( d, m, δ, ϵ ) , V π i, 0 ( s 0 ) − V π i, 0 ( s 0 ) ≤ H C 1 ϵ · exp  H + p log( n/ 2 δ ϵ )  ξ R + H E 1 ( d, m, δ, ϵ ) . Pr oof. Note that we hav e V π i, 0 ( s 0 ) − V π i, 0 ( s 0 ) = E a 0 ∼ π 0 ( ·| s 0 )  Q π i, 0 ( s 0 , a 0 )  − E a 0 ∼ π 0 ( ·| s 0 ) h Q π i, 0 ( s 0 , a 0 ) i ≤ E a 0 ∼ π 0 ( ·| s 0 )  Q π i, 0 ( s 0 , a 0 ) − B i, 0 V π i, 1 ( s 0 , a 0 ) + E 1 ( d, m, δ, ϵ )  (19) = E a 0 ∼ π 0 ( ·| s 0 )  B i, 0 V π i, 1 ( s 0 , a 0 ) − B i, 0 V π i, 1 ( s 0 , a 0 ) + E 1 ( d, m, δ, ϵ )  (20) = E a 0 ∼ π 0 ( ·| s 0 )  R i, 0 ( s 0 , a 0 ) − R i, 0 ( s 0 , a 0 ) + E s 1 ∼ P ( ·| s 0 , a 0 )  V i, 1 ( s 1 ) − V i, 1 ( s 1 )  + E 1 ( d, m, δ, ϵ )  (21) ≤ E a 0 ∼ π 0 ( ·| s 0 )   E s 1 ∼ P ( ·| s 0 , a 0 )  V i, 1 ( s 1 ) − V i, 1 ( s 1 )  + C 1 ϵ · exp  H + p log( n/ 2 δ ϵ )  ξ R + E 1 ( d, m, δ, ϵ )   (22) = C 1 ϵ · exp  H + p log( n/ 2 δ ϵ )  ξ R + E 1 ( d, m, δ, ϵ ) + E a 0 ∼ π 0 ( ·| s 0 ) ,s 1 ∼ P ( ·| s 0 , a 0 )  V π i, 1 ( s 1 ) − V π i, 1 ( s 1 )  = . . . Nika, Mandal, Kamalaruban, Singla, and Radanovi ´ c ≤ H   C 1 ϵ · exp  H + p log( n/ 2 δ ϵ )  ξ R + E 1 ( d, m, δ, ϵ )   , (23) where Equation ( 19 ) follows from Lemma A.4 ; Equation ( 20 ) follows from the fact that the true action-v alue function has zero error with respect to the Bellman operator; Equation ( 21 ) follows from e xpanding the Bellman operator for both value functions; Equation ( 22 ) follows from Lemma A.3 and, ﬁnally , Equation ( 23 ) follows from applying the same bounds H steps. Follo wing similar arguments, for the best response value g ap, we hav e: V † , π − i i, 0 ( s 0 ) − V † , π − i i, 0 ( s 0 ) = max a i, 0 ∈ A i E a − i, 0 ∼ π − i, 0 ( ·| s 0 ) h Q † , π − i i, 0 ( s 0 , a 0 ) − Q † , π − i i, 0 ( s 0 , a 0 ) i ≤ max a i, 0 ∈ A i E a − i, 0 ∼ π − i, 0 ( ·| s 0 ) h B i, 0 V † , π − i i, 1 ( s 0 , a 0 ) − B i, 0 V † , π − i i, 1 ( s 0 , a 0 ) + E 1 ( d, m, δ, ϵ ) i ≤ C 1 ϵ · exp  H + p log( n/ 2 δ ϵ )  ξ R + E 1 ( d, m, δ, ϵ ) + max a i, 0 ∈ A i E a − i, 0 ∼ π − i, 0 ( ·| s 0 ) ,s 1 ∼ P ( ·| s 0 , a 0 ) h V † , π − i i, 1 ( s 1 ) − V † , π − i i, 1 ( s 1 ) i ≤ H   C 1 ϵ · exp  H + p log( n/ 2 δ ϵ )  ξ R + E 1 ( d, m, δ, ϵ )   . Next, we state a result that pro vides an upper bound on the Nash gap in terms of the estimated v alue functions. Lemma A.7. Under the event of Lemma A.5 , we have, for some C 1 > 0 , Gap ( e π ) ≤ min π X i ∈ [ n ] V † , π − i i, 0 ( s 0 ) − V π i, 0 ( s 0 ) + 2 nE 1 ( d, m, δ, ϵ ) + 2 nC 1 ϵ · exp  H + p log( n/ 2 δ ϵ )  ξ R . Pr oof. Note that, by deﬁnition of the Nash gap and Lemma A.5 , we ha ve Gap ( e π ) = X i ∈ [ n ] V † , e π − i i, 0 ( s 0 ) − V e π i, 0 ( s 0 ) ≤ X i ∈ [ n ] V † , e π − i i, 0 ( s 0 ) − V e π i, 0 ( s 0 ) + 2 nE 1 ( d, m, δ, ϵ ) + 2 nC 1 ϵ · exp  H + p log( n/ 2 δ ϵ )  ξ R = min π X i ∈ [ n ] V † , π − i i, 0 ( s 0 ) − V π i, 0 ( s 0 ) + 2 nE 1 ( d, m, δ, ϵ ) + 2 nC 1 ϵ · exp  H + p log( n/ 2 δ ϵ )  ξ R , where the last step uses the fact that e π minimizes the quantity within the summation, as deﬁned in Algorithm 1 . Now we are ready to ﬁnalize the proof of the main theorem of Section 3 . W e restate it for con venience. Theorem A.2. Let ϵ ∈ [0 , 1 / 2) , δ > 0 and Γ( · , · ) = 0 . Furthermore, assume that m ≥ Ω(( H 3 / 2 /ϵ 2 )( d + log( n/δ ))) . Then, under Assumption 1 with ξ R ≥ 5 ϵ , for some positive constant c , there exist r obust algorithms TrimmedMLE and RobEst such that, with pr obability at least 1 − δ , the output e π of Objective ( 4 ) satisﬁes Gap ( e π ) ≤ O   H n   exp  H + p log( n/ 2 δ ϵ )  ξ R + H √ d + γ ξ P   · ϵ + H n s ( H √ d + γ ) 2 poly ( d ) ξ 2 P m   . Pr oof. Let π ∗ be a Nash equilibrium. W e ha ve Gap ( e π ) ≤ min π X i ∈ [ n ] V † , π − i i, 0 ( s 0 ) − V π i, 0 ( s 0 ) + 2 nE 1 ( d, m, δ, ϵ ) + 2 nϵC 1 exp  H + p log( n/ 2 δ ϵ )  ξ R (24) Corruption-rob ust MARLHF ≤ X i ∈ [ n ] V † , π ∗ − i i, 0 ( s 0 ) − V † , π ∗ − i i, 0 ( s 0 ) + X i ∈ [ n ] V π ∗ i, 0 ( s 0 ) − V π ∗ i, 0 ( s 0 ) + X i ∈ [ n ] V † , π ∗ − i i, 0 ( s 0 ) − V π ∗ i, 0 ( s 0 ) + 2 nE 1 ( d, m, δ, ϵ ) + 2 nϵC 1 exp  H + p log( n/ 2 δ ϵ )  ξ R (25) ≤ 2 H n ·   2 ϵC 1 exp  H + p log( n/ 2 δ ϵ )  ξ R + 2 E 1 ( d, m, δ, ϵ )   + X i ∈ [ n ] V † , π ∗ − i i, 0 ( s 0 ) − V π ∗ i, 0 ( s 0 ) (26) ≤ 2 H n ·   2 ϵC 1 exp  H + p log( n/ 2 δ ϵ )  ξ R + 2 E 1 ( d, m, δ, ϵ )   , (27) where Equation ( 24 ) follows from Lemma A.7 ; for Equation ( 25 ) , we pick a Nash equilibrium π ∗ and use the fact that e π is the minimizer of the estimated gap; Equation ( 26 ) follo ws from Lemma A.6 and, ﬁnally , Equation ( 27 ) follo ws from the fact that π ∗ is a Nash equilibrium and thus any unilateral de viation yields a smaller value for agent i . B Proof of Theor em 4.1 Note that the above bounds characterize the conﬁdence set that we used throughout Section 4 . As a robust estimation technique, we again make use of RobEst based on the second part of Theorem A.1 , which does not require uniform cov erage. Lemma A.2 and Theorem A.1 giv e us the follo wing guarantee for the output of the robust estimate e ω , where, without loss of generality , we use the behavior polic y µ : E µ    ϕ ( s, a ) ⊤ ( ω ∗ − e ω )   2  ≤ c 2 ( δ ) · s ( H √ d + γ ) 2 poly ( d ) m + ( H √ d + γ ) 2 ϵ . (28) Note that, using the abov e, we can equi v alently write ∥ ω ∗ − e ω ∥ 2 Σ µ ( h ) ≤ c 2 ( δ ) ( H √ d + γ ) 2 poly ( d ) m + ( H √ d + γ ) 2 ϵ ! , which implies that ∥ ω ∗ − e ω ∥ 2 Σ µ ( h )+(2 ϵ + λ ) I ≤ c 2 ( δ ) ( H √ d + γ ) 2 poly ( d ) m + ( H √ d + γ ) 2 ϵ + (2 ϵ + λ ) H √ d ! , since ∥ ω ∗ ∥ ≤ H √ d (Lemma A.1 of [ Zhang et al. , 2022 ]). Let us deﬁne E ( d, m, δ , ϵ ) := v u u t c 2 ( δ ) ( H √ d + γ ) 2 poly ( d ) m + ( H √ d + γ ) 2 ϵ + (2 ϵ + λ ) H √ d ! . (29) This term will be useful in deﬁning our bonus for this section. Recall that, for each step h , we ha ve deﬁned the scaled sample cov ariance matrix with respect to the corrupted data as follows: Λ h = 3 5 1 m m X i =1  ϕ ( s h , a h ) ϕ ( s h , a h ) ⊤  + ( ϵ + λ ) I ! , (30) while the bonus has been deﬁned as Γ( s, a ) = E ( d, m, δ , ϵ ) · ∥ ϕ ( s, a ) ∥ Λ − 1 h . (31) In the absence of bounds on the norm of the parameter , we cannot bound the difference in re wards directly , as we did in Lemma A.3 . Thus, we will need to follo w another approach. First, similar to the pre vious section, we ha ve the following result. Nika, Mandal, Kamalaruban, Singla, and Radanovi ´ c Lemma B.1. Let λ ≥ Ω( dH log( m/δ )) and Γ be deﬁned as in Equation ( 31 ) . Then, with pr obability at least 1 − δ / 2 we have, for every i, h, s, a , and policy π , 0 ≤ B i,h V π i,h +1 ( s, a ) − Q π i,h ( s, a ) ≤ 2Γ( s, a ) , 0 ≥ B i,h V † , π − i i,h +1 ( s, a ) − Q † , π − i i,h ( s, a ) ≥ − 2Γ( s, a ) . Pr oof. Follo wing a similar approach as the proof of Lemma A.4 , we have   ϕ ( s, a ) ⊤ ω ∗ i,h − B i,h V π i,h +1 ( s, a )   =    ϕ ( s, a ) , ω ∗ i,h − ω π i,h    ≤   ω ∗ i,h − ω π i,h   Σ µ ( h )+(2 ϵ + λ ) I ∥ ϕ ( s, a ) ∥ (Σ µ ( h )+(2 ϵ + λ ) I ) − 1 ≤ E ( d, m, δ, ϵ ) · ∥ ϕ ( s, a ) ∥ (Σ µ ( h )+(2 ϵ + λ ) I ) − 1 ≤ E ( d, m, δ, ϵ ) · ∥ ϕ ( s, a ) ∥ (Σ µ ( h )+(2 ϵ + λ ) I ) − 1 ≤ E ( d, m, δ, ϵ ) · ∥ ϕ ( s, a ) ∥ Λ − 1 h = Γ( s, a ) , where the penultimate inequality uses the fact that    ω ∗ i,h    2 ≤ H √ d and the ﬁnal inequality follows from the following observation: Λ h = 3 5 1 m m X i =1 ϕ ( s h , a h ) ϕ ( s h , a h ) ⊤ + ( ϵ + λ ) I ! ⪯ 3 5 1 m m X i =1 ϕ ( e s h , e a h ) ϕ ( e s h , e a h ) ⊤ + (2 ϵ + λ ) I ! ⪯ Σ µ ( h ) + (2 ϵ + λ ) I , where the second step uses the fact that ∥ ϕ ( s, a ) ∥ 2 ≤ 1 and that only ϵ · m samples are corrupted, while the last step uses Lemma D.1 and the fact that m (2 ϵ + λ ) ≥ Ω( d log( m/δ )) , due to our choice of λ and the fact that ϵ ≥ 0 . Thus, we obtained that B i,h V π i,h +1 ( s, a ) − Γ( s, a ) ≤ ϕ ( s, a ) ⊤ ω ∗ i,h ≤ B i,h V π i,h +1 ( s, a ) + Γ( s, a ) , which, by subtracting Γ( s, a ) from all sides, further implies that B i,h V π i,h +1 ( s, a ) − 2Γ( s, a ) ≤ ϕ ( s, a ) ⊤ ω ∗ i,h − Γ( s, a ) ≤ B i,h V π i,h +1 ( s, a ) . Now , since B i,h V π i,h +1 ( s, a ) ∈ [ − ( H − h ) √ d, ( H − h ) √ d ] , and since the clipping operator is monotone, we hav e B i,h V π i,h +1 ( s, a ) − 2Γ( s, a ) ≤ Clip [ − ( H − h ) √ d, ( H − h ) √ d ]  B i,h V π i,h +1 ( s, a ) − 2Γ( s, a )  ≤ Clip [ − ( H − h ) √ d, ( H − h ) √ d ]  ϕ ( s, a ) ⊤ ω ∗ i,h − Γ( s, a )  = Q π i,h ( s, a ) ≤ B i,h V π i,h +1 ( s, a ) . This ﬁnally implies that 0 ≤ B i,h V π i,h +1 ( s, a ) − Q π i,h ( s, a ) ≤ 2Γ( s, a ) . For the optimistic estimates, we ar gue in a symmetrical fashion, thus, we omit the proof. Next, we will state a result which is the analogue of Lemma A.5 for this section. W e deﬁne V π i,h ( s, b θ i ) and V † , π − i i,h ( s, b θ i ) to be the lo wer and upper estimates of the v alue functions of gi ven policy π with respect to parameter b θ i . W e use similar notation for the Q -function estimates. Corruption-rob ust MARLHF Lemma B.2. Let b θ i ∈ Θ Unil ( e θ i ) be a parameter used by the r obust subr outine. Then, under the event of Lemma B.1 , we have, for every a gent i , state s , step h and policy π : V π i,h ( s, b θ i ) ≤ V π i,h ( s, b θ i ) , and V † , π − i i,h ( s, b θ i ) ≥ V † , π − i i,h ( s, b θ i ) . Pr oof. Similar to Lemma A.5 , we again apply induction. Note that the result holds for step H where all value estimates are 0 , since the bound term is non-negati ve. Suppose the statement holds for step h + 1 . Then, for step h , we hav e V π i,h ( s, b θ i ) = E a ∼ π h h Q π i,h ( s, a , b θ i ) i ≤ E a ∼ π h h B i,h V π i,h +1 ( s, a , b θ i ) i ≤ E a ∼ π h h B i,h V π i,h +1 ( s, a , b θ i ) i = V π i,h ( s, b θ i ) . For V † , π − i i,h ( s, b θ i ) we ha ve V † , π − i i,h ( s, b θ i ) = max a i ∈ A i E a − i ∼ π − i,h ( ·| s ) h Q † , π − i i,h ( s, a , b θ i ) i ≥ max a i ∈ A i E a − i ∼ π − i,h ( ·| s ) h B i,h V † , π − i i,h +1 ( s, a , b θ i ) i ≥ max a i ∈ A i E a − i ∼ π − i,h ( ·| s ) h B i,h V † , π − i i,h +1 ( s, a , b θ i ) i = V † , π − i i,h ( s, b θ i ) . Next, we pro ve an upper bound on the e xpected sum of bonuses. Lemma B.3. Let π ∗ be a Nash equilibrium which is cover ed by D . Then, for every agent i , we have E π ∗ " H − 1 X h =0 Γ( s h , a h ) # ≤ H · E ( d, m, δ , ϵ ) · r 5 d C P . Pr oof. Using the deﬁnition of the bonus in Equation ( 31 ), we hav e E π ∗ " H − 1 X h =0 Γ( s h , a h ) # = E ( d, m, δ , ϵ ) · E π ∗ " H − 1 X h =0 ∥ ϕ ( s h , a h ) ∥ Λ − 1 h # . W e bound the last factor on the right-hand side of the equation above. Using the deﬁnition of Λ h in Equation ( 30 ) , we hav e, for ev ery h ∈ [ H − 1] : E π ∗ h ∥ ϕ ( s h , a h ) ∥ Λ − 1 h i ≤ E π ∗ h ∥ ϕ ( s h , a h ) ∥ ((Σ µ ( h )+ λI )) − 1 i (32) ≤ E π ∗  q ϕ ( s h , a h ) ⊤ ((Σ µ ( h ) + λI )) − 1 ϕ ( s h , a h )  ≤ r E π ∗ h ϕ ( s h , a h ) ⊤ ((Σ µ ( h ) + λI )) − 1 ϕ ( s h , a h ) i (33) = r T r  E π ∗ [ ϕ ( s h , a h ) ϕ ( s h , a h ) ⊤ ] ((Σ µ ( h ) + λI )) − 1  (34) ≤ r 1 C P r T r  E µ h [ ϕ ( s h , a h ) ϕ ( s h , a h ) ⊤ ] ((Σ µ ( h ) + λI )) − 1  (35) = r 1 C P r T r  Σ µ ( h ) ((Σ µ ( h ) + λI )) − 1  Nika, Mandal, Kamalaruban, Singla, and Radanovi ´ c ≤ r 1 C P v u u t d X j =1 σ j σ j + λ (36) ≤ r d C P , (37) where T r ( M ) denotes the trace of matrix M and σ j denote the eigen values of cov ariance matrix Σ µ ( h ) . Above, Equation ( 32 ) uses the observation Λ − 1 h = 3 5 1 m m X i =1 ϕ ( s h , a h ) ϕ ( s h , a h ) ⊤ + ( ϵ + λ ) I !! − 1 ⪯ 3 5 1 m m X i =1 ϕ ( e s h , e a h ) ϕ ( e s h , e a h ) ⊤ + λI !! − 1 ⪯ ((Σ µ ( h ) + λI )) − 1 , which follo ws from Lemma D.1 and similar arguments as in the proof of Lemma B.1 ; Equation ( 33 ) uses Jensen’ s inequality; Equation ( 34 ) uses the commutati vity of the trace operator: T r ( x ⊤ M x ) = T r ( M xx ⊤ ) ; Equation ( 35 ) uses the transition cov erage part of Assumption 2 ; ﬁnally , Equation ( 36 ) uses the fact that the eigen v alues of Σ µ ( h ) and λ are nonnegati ve real numbers. Next, we will provide an upper bound on the difference in preference functions between the ground-truth and estimated parameters. Lemma B.4. F or any agent i and θ i ∈ Θ Unil ( e θ i ) , with pr obability at least 1 − δ / 2 , we have E τ ∼ d µ τ ′ ∼ d µ ref h ∥ P ( ·| τ , τ ′ , θ ∗ i ) − P ( ·| τ , τ ′ , θ i ) ∥ 2 1 i ≤ 8 H √ dϵ + c · d m log 2 H nm √ d δ ! , wher e c is an absolute constant. Pr oof. Lemma D.2 gi ves us the follo wing bound: E τ ∼ d µ τ ′ ∼ d µ ref h ∥ P ( ·| τ , τ ′ , θ ∗ i ) − P ( ·| τ , τ ′ , θ ) ∥ 2 1 i ≤ c 1 m m X j =1 log P  e o j | e τ j , e τ ′ j , θ  P  e o j | e τ j , e τ ′ j , θ ∗ i  ! + log ( d log (2 n/δ )) . Let us deal with the ﬁrst term of the bound belo w . Note that the bound depends on clean samples. Let b D be the gi ven dataset and S denote the set of corrupted trajectories in b D . W e can write m X j =1 log P  e o j | e τ j , e τ ′ j , θ  P  e o j | e τ j , e τ ′ j , θ ∗ i  ! = X ( τ ,τ ′ ,o ) ∈ S log P  o j | τ j , τ ′ j , θ  P  o j | τ j , τ ′ j , θ ∗ i  ! + X ( τ ,τ ′ ,o ) ∈ S log P  o j | τ j , τ ′ j , θ  P  o j | τ j , τ ′ j , θ ∗ i  ! ≤ X ( τ ,τ ′ ,o ) ∈ b D log P  o j | τ j , τ ′ j , θ  P  o j | τ j , τ ′ j , θ ∗ i  ! + ϵ · log   1 + exp  H √ d  1 + exp  − H √ d    ≤ X ( τ ,τ ′ ,o ) ∈ b D log   P  o j | τ j , τ ′ j , e θ i  P  o j | τ j , τ ′ j , θ ∗ i    + 2 ϵ · H √ d ≤ 8 H √ dϵ + c 2 · d m log  2 H mn δ  , where the ﬁrst inequality uses the fact that the corrupted subset comprises an ϵ -fraction of the whole dataset and the fact that ϕ ⊤ θ ≤ H √ d , by assumption of linear MDPs; the second inequality uses the fact that e θ i maximizes the log-likelihood with respect to the corrupted data, and that H and d are natural numbers so we can bound the log expression directly in terms of the bounds on the re wards; ﬁnally , for the last inequality we ha ve used Lemma ?? with some constant c 2 . Putting things together , we obtain the stated bound. Corruption-rob ust MARLHF Next, we pro ve bounds on the gaps with respect to an y chosen reward parameters from the conﬁdence set. Lemma B.5. Let b θ = ( b θ 1 , . . . , b θ n ) , wher e b θ i ∈ Θ Unil ( e θ i ) , for every i ∈ [ n ] , and let π ∗ be a Nash equilibrium cover ed by D . Then, if e π ∈ arg min π g Gap  π , b θ  , we have, with pr obability at least 1 − δ : 0 ≤ Gap  e π , b θ  ≤ e O  n ·  1 √ C R + 1 √ C P  ·  H 5 / 2 d 3 / 4 √ ϵ + H 2 poly ( d ) 1 √ m  , 0 ≤ Gap  π ∗ , b θ  ≤ e O  n ·  1 √ C R + 1 √ C P  ·  H 5 / 2 d 3 / 4 √ ϵ + H 2 poly ( d ) 1 √ m  , Pr oof. First, note that both the true gap and estimated gap are, by deﬁnition, non-ne gative. No w , given b θ and e π , as speciﬁed in the statement, we hav e Gap  e π , b θ  = X i ∈ [ n ] V † , e π − i i, 0  s 0 , b θ i  − V e π i, 0  s 0 , b θ i  (38) ≤ X i ∈ [ n ] V † , e π − i i, 0  s 0 , b θ i  − V e π i, 0  s 0 , b θ i  (39) ≤ min π X i ∈ [ n ] V † , π − i i, 0  s 0 , b θ i  − V π i, 0  s 0 , b θ i  (40) ≤ X i ∈ [ n ] V † , π ∗ − i i, 0  s 0 , b θ i  − V µ ref i, 0  s 0 , b θ i  −  V π ∗ i, 0  s 0 , b θ i  − V µ ref i, 0  s 0 , b θ i  (41) = X i ∈ [ n ] V † , π ∗ − i i, 0  s 0 , b θ i  + V µ ref i, 0 ( s 0 , θ ∗ ) − V µ ref i, 0  s 0 , b θ i  −  V π ∗ i, 0  s 0 , b θ i  + V µ ref i, 0 ( s 0 , θ ∗ ) − V µ ref i, 0  s 0 , b θ i  (42) = X i ∈ [ n ] V † , π ∗ − i i, 0  s 0 , b θ i  + V µ ref i, 0 ( s 0 , θ ∗ ) − V µ ref i, 0  s 0 , b θ i  − V † , π ∗ − i i, 0 ( s 0 , θ ∗ ) + V π ∗ i, 0 ( s 0 , θ ∗ ) −  V π ∗ i, 0  s 0 , b θ i  + V µ ref i, 0 ( s 0 , θ ∗ ) − V µ ref i, 0  s 0 , b θ i  +  V † , π ∗ − i i, 0 ( s 0 , θ ∗ ) − V π ∗ i, 0 ( s 0 , θ ∗ )  | {z } =0 (43) = X i ∈ [ n ] V † , π ∗ − i i, 0  s 0 , b θ i  − V µ ref i, 0  s 0 , b θ i  + V µ ref i, 0 ( s 0 , θ ∗ ) − V † , π ∗ − i i, 0 ( s 0 , θ ∗ ) | {z } := Z 1 + V π ∗ i, 0 ( s 0 , θ ∗ ) − V µ ref i, 0 ( s 0 , θ ∗ ) −  V π ∗ i, 0  s 0 , b θ i  − V µ ref i, 0  s 0 , b θ i  | {z } := Z 2 (44) where Equation ( 38 ) follows by deﬁnition; Equation ( 39 ) follows from Lemma B.2 ; Equation ( 40 ) follows by design of Algorithm 4 ; in Equation ( 41 ) we just substitute π ∗ and add and subtract the same term; in Equation ( 42 ) and Equation ( 43 ) we again add and subtract identical terms; Equation ( 44 ) follo ws from the fact that π ∗ is a NE under θ ∗ . Now , we will deal with the two terms above, Z 1 and Z 2 , separately . First, let us consider the term Z 2 . For ev ery i , we hav e V π ∗ i, 0 ( s 0 , θ ∗ i ) − V µ ref i, 0 ( s 0 , θ ∗ i ) −  V π ∗ i, 0  s 0 , b θ i  − V µ ref i, 0  s 0 , b θ i  = E a 0 ∼ π ∗ 0 ( ·| s 0 ) h Q π ∗ i, 0 ( s 0 , a 0 ) − Q π ∗ i, 0 ( s 0 , a 0 , b θ i ) i − V µ ref i, 0 ( s 0 , θ ∗ i ) + V µ ref i, 0  s 0 , b θ i  ≤ E a 0 ∼ π ∗ 0 ( ·| s 0 ) h B i, 0 V π ∗ i, 1 ( s 0 , a 0 ) − B i, 0 V π ∗ i, 0 ( s 0 , a 0 , b θ i ) + 2Γ( s 0 , a 0 ) i − V µ ref i, 0 ( s 0 , θ ∗ i ) + V µ ref i, 0  s 0 , b θ i  (45) Nika, Mandal, Kamalaruban, Singla, and Radanovi ´ c = E a 0 ∼ π ∗ 0 ( ·| s 0 ) h R i, 0 ( s 0 , a 0 ) − R i, 0 ( s 0 , a 0 ) + E s 1 ∼ P 1 ( ·| s 0 , a 0 ) h V π ∗ i, 1 ( s 1 ) − V π ∗ i, 1 ( s 1 , b θ i ) i + 2Γ( s 0 , a 0 ) i − V µ ref i, 0 ( s 0 , θ ∗ i ) + V µ ref i, 0  s 0 , b θ i  (46) ≤ E a 0 ∼ π ∗ 0 ( ·| s 0 ) ,s 1 ∼ P 1 ( ·| s 0 , a 0 ) , a 1 ∼ π ∗ 1 ( ·| s 1 ) " 1 X h =0  R i,h ( s h , a h ) − R i,h ( s h , a h ) + 2Γ( s h , a h )  # + E a 0 ∼ π ∗ 0 ( ·| s 0 ) ,s 1 ∼ P 1 ( ·| s 0 , a 0 ) , a 1 ∼ π ∗ 1 ( ·| s 1 ) ,s 2 ∼ P 2 ( ·| s 1 , a 1 ) h V π ∗ i, 2 ( s 2 ) − V π ∗ i, 2 ( s 2 , b θ i ) i − V µ ref i, 0 ( s 0 , θ ∗ i ) + V µ ref i, 0  s 0 , b θ i  ≤ . . . . . . ≤ E π ∗ " H − 1 X h =0  R i,h ( s h , a h ) − R i,h ( s h , a h ) + 2Γ( s h , a h )  # − V µ ref i, 0 ( s 0 , θ ∗ i ) + V µ ref i, 0  s 0 , b θ i  = E τ ∼ d π ∗ h ϕ ( τ ) ⊤ θ ∗ i − ϕ ( τ ) ⊤ b θ i i − E τ ∼ d µ ref h ϕ ( τ ) ⊤ θ ∗ i − ϕ ( τ ) ⊤ b θ i i + 2 E π ∗ " H − 1 X h =0 Γ( s h , a h ) # (47) = E τ ∼ d π ∗ ,τ ′ ∼ d µ ref h ( ϕ ( τ ) − ϕ ( τ ′ )) ⊤  θ ∗ i − b θ i i + 2 E π ∗ " H − 1 X h =0 Γ( s h , a h ) # ≤ s E τ ∼ d π ∗ ,τ ′ ∼ d µ ref   ( ϕ ( τ ) − ϕ ( τ ′ )) ⊤  θ ∗ i − b θ i  2  + 2 E π ∗ " H − 1 X h =0 Γ( s h , a h ) # (48) = r  θ ∗ i − b θ i  ⊤ E τ ∼ d π ∗ ,τ ′ ∼ d µ ref h ( ϕ ( τ ) − ϕ ( τ ′ )) ( ϕ ( τ ) − ϕ ( τ ′ )) ⊤ i  θ ∗ i − b θ i  + 2 E π ∗ " H − 1 X h =0 Γ( s h , a h ) # = r  θ ∗ i − b θ i  ⊤ Σ − π ∗ , µ ref  θ ∗ i − b θ i  + 2 E π ∗ " H − 1 X h =0 Γ( s h , a h ) # (49) ≤ r 1 C R r  θ ∗ i − b θ i  ⊤ Σ − µ , µ ref  θ ∗ i − b θ i  + 2 E π ∗ " H − 1 X h =0 Γ( s h , a h ) # (50) = r 1 C R s E τ ∼ d µ ,τ ′ ∼ d µ ref   ( ϕ ( τ ) − ϕ ( τ ′ )) ⊤ θ ∗ i − ( ϕ ( τ ) − ϕ ( τ ′ )) ⊤ b θ i  2  + 2 E π ∗ " H − 1 X h =0 Γ( s h , a h ) # = r 1 C R s E τ ∼ d µ ,τ ′ ∼ d µ ref     σ − 1 ( P ( o = 1 | τ , τ ′ , θ ∗ i )) − σ − 1  P  o = 1 | τ , τ ′ , b θ i     2  + 2 E π ∗ " H − 1 X h =0 Γ( s h , a h ) # (51) ≤ s ι 2 C R s E τ ∼ d µ ,τ ′ ∼ d µ ref     P ( o = 1 | τ , τ ′ , θ ∗ i ) − P  o = 1 | τ , τ ′ , b θ i     2  + 2 E π ∗ " H − 1 X h =0 Γ( s h , a h ) # (52) ≤ s ι 2 2 C R s E τ ∼ d µ ,τ ′ ∼ d µ ref     P ( ·| τ , τ ′ , θ ∗ i ) − P  ·| τ , τ ′ , b θ i     2 1  + 2 H · E ( d, m, δ, ϵ ) · r 5 d C P (53) ≤ s ι 2 2 C R r 8 ϵ + c · d m log  nm δ  + 2 H · E ( d, m, δ, ϵ ) · r 5 d C P (54) (55) where Equation ( 45 ) follows by deﬁnition of Q-functions and the Bellman operator , and Lemma B.1 ; Equation ( 46 ) follows from the deﬁnition of the Bellman operator; Equation ( 47 ) uses the trajectory-based deﬁnition of the return; Equation ( 48 ) Corruption-rob ust MARLHF uses Jensen’ s inequality; Equation ( 49 ) uses the deﬁnition of the dif ference cov ariance matrix with respect to π ∗ and µ ref ; Equation ( 50 ) uses the ﬁrst part of Assumption 2 ; Equation ( 51 ) uses the deﬁnition of the link function and the preference data generation assumption; Equation ( 52 ) uses uses Lemma D.3 ; ﬁnally , for Equation ( 53 ) we hav e used Lemma B.3 . Now , denote π † i ∈ arg max π ′ V π ′ , π ∗ − i i, 0 ( s 0 , b θ i ) . For the ﬁnal term, Z 1 , we similarly hav e V † , π ∗ − i i, 0  s 0 , b θ i  − V µ ref i, 0  s 0 , b θ i  −  V † , π ∗ − i i, 0 ( s 0 , θ ∗ i ) − V µ ref i, 0 ( s 0 , θ ∗ i )  ≤ E τ ∼ d π † i ,τ ′ ∼ d π ∗ − i h ( ϕ ( τ ) − ϕ ( τ ′ )) ⊤  b θ i − θ ∗ i i + 2 E π † i , π ∗ − i " H − 1 X h =0 Γ( s h , a h ) # ≤ r 1 C R r  b θ i − θ ∗ i  ⊤ Σ − µ , µ ref  b θ i − θ ∗ i  + 2 H · E ( d, m, δ, ϵ ) · r 5 d C P ≤ s ι 2 2 C R r E τ ∼ d µ ,τ ′ ∼ d µ ref h ∥ P ( ·| τ , τ ′ , θ ∗ i ) − P ( ·| τ , τ ′ , θ i ) ∥ 2 1 i + 2 H · E ( d, m, δ, ϵ ) · r 5 d C P ≤ s ι 2 2 C R r 8 ϵ + c · d m log  nm δ  + 2 H · E ( d, m, δ, ϵ ) · r 5 d C P , where we have used similar arguments as abov e (note that this is the part where low relative uncertainty is needed, as opposed to cov erage of only a Nash equilibrium). Putting ev erything together, and using the deﬁnition of E ( d, m, δ, ϵ ) from Equation ( 28 ) we obtain Gap ( e π , b θ 1 , . . . , b θ n ) ≤ 2 n · s ι 2 2 C R r 8 ϵ + c · d m log  nm δ  + 4 nH · v u u t c 2 ( δ ) ( H √ d + γ ) 2 poly ( d ) m + ( H √ d + γ ) 2 ϵ + (2 ϵ + λ ) H √ d ! · r 5 d C P ≤ e O  n ·  1 √ C R + 1 √ C P  ·  H 5 / 2 d 3 / 4 √ ϵ + H 2 poly ( d ) 1 √ m  , where we hav e also used our choice of λ . Finally , for the third statement, note that Gap  π ∗ , b θ 1 , . . . , b θ n  = X i ∈ [ n ] V † , π ∗ − i i, 0  s 0 , b θ i  − V π ∗ i, 0  s 0 , b θ i  ≤ X i ∈ [ n ] V † , π ∗ − i i, 0  s 0 , b θ i  − V µ ref i, 0  s 0 , b θ i  −  V π ∗ i, 0  s 0 , b θ i  − V µ ref i, 0  s 0 , b θ i  ≤ e O  n ·  1 √ C R + 1 √ C P  ·  H 5 / 2 d 3 / 4 √ ϵ + H 2 poly ( d ) 1 √ m  , where the ﬁrst inequality uses Lemma B.2 as if the robust subroutine were applied on π ∗ , while the second inequality follows from noting that we already ha ve a bound on the previous quantity from Equation ( 41 ). Before we proceed, we need to provide guarantees on the output of the PGA methods used in Algorithm 4 . Proposition B.1. Let η 1 = 1 / √ T 1 and, for every ag ent i , let b θ ∗ i ∈ arg max θ i ∈ Θ Unil ( e θ i ) Gap i ( π ∗ , θ i ) , wher e b θ i := 1 T 1 T 1 X t =1 b θ ( t ) i , Nika, Mandal, Kamalaruban, Singla, and Radanovi ´ c and the iterates b θ ( t ) i ar e generated by b θ ( t +1) i = P Θ Unil ( e θ i )  b θ ( t ) i + η 1 e ∇ θ Gap i ( π ∗ , b θ ( t ) i )  for T 1 steps. Then, Gap i ( π ∗ , b θ ∗ i ) − Gap i ( π ∗ , b θ i ) ≤ e O  1 √ C R + 1 √ C P   H 5 / 2 d 3 / 4 √ ϵ + H 2 poly ( d ) √ m  + H 2 p poly ( d ) √ T 1  √ ϵ + 1 √ m  ! . Pr oof. First, note that, for any agent i and parameter θ i , we hav e Gap i ( π ∗ , θ i ) = V † , π ∗ − i i, 0 ( s 0 , θ i ) − V π ∗ i, 0 ( s 0 , θ i ) = max π ′ i V π ′ i , π ∗ − i i, 0 ( s 0 , θ i ) − V π ∗ i, 0 ( s 0 , θ i ) . Now , let us deﬁne Π PP , † i ( θ i ) =  π † i ∈ Π PP i : V π † i , π ∗ − i i, 0 ( s 0 , θ i ) = max π ′ i V π ′ i , π ∗ − i i, 0 ( s 0 , θ i )  , as the set of unilateral maximizer policies for player i at θ i . Let π † i ∈ Π PP , † i ( b θ ∗ i ) be any unilateral maximizer at b θ ∗ i , and deﬁne g ∗ i := ∇ θ i  V π † i , π ∗ − i i, 0 ( s 0 , θ i ) − V π ∗ i, 0 ( s 0 , θ i )  . Since the value function is linear in θ i , g ∗ i does not depend on θ i . Moreover , Danskin’ s subdifferential formula implies that g ∗ i ∈ ∂ θ i Gap i  π ∗ , b θ ∗ i  . By linearity in θ i , for any θ ′ i we hav e ( θ ′ i ) ⊤ g ∗ i = V π † i , π ∗ − i i, 0 ( s 0 , θ ′ i ) − V π ∗ i, 0 ( s 0 , θ ′ i ) . (56) In particular , since π † i is activ e at b θ ∗ i , Gap i  π ∗ , b θ ∗ i  =  b θ ∗ i  ⊤ g ∗ i , while for any θ i , Gap i ( π ∗ , θ i ) ≥ θ ⊤ i g ∗ i . Thus, we can write: Gap i  π ∗ , b θ ∗ i  − Gap i ( π ∗ , θ i ) ≤ D b θ ∗ i − θ i , g ∗ i E . (57) Now , since µ and µ ref cov er π ∗ and its unilateral de viations, we can estimate the gradient of the v alue function at π ∗ , or its unilateral de viations, from the gradient of the v alue function at µ or µ ref . Let b θ i = (1 /T 1 ) P T 1 t =1 b θ ( t ) i . Note that, since b θ i ∈ Θ Unil ( e θ i ) and b θ ∗ i ∈ Θ Unil ( e θ i ) , we hav e    D b θ ∗ i − b θ i , g ∗ i − ∇ θ V µ i, 0  s 0 , b θ i  + ∇ θ V π ∗ i, 0  s 0 , b θ i E    =     V π † i , π ∗ − i i, 0  s 0 , b θ ∗ i  − V µ i, 0  s 0 , b θ ∗ i   −  V π † i , π ∗ − i i, 0  s 0 , b θ i  − V µ i, 0  s 0 , b θ i      =     E τ ∼ d π † i , π ∗ − i ,τ ′ ∼ d µ h ( ϕ ( τ ) − ϕ ( τ ′ )) ⊤  b θ i − b θ ∗ i i     Corruption-rob ust MARLHF ≤     E τ ∼ d π † i , π ∗ − i ,τ ′ ∼ d µ h ( ϕ ( τ ) − ϕ ( τ ′ )) ⊤  b θ i − θ ∗ i i     +     E τ ∼ d π † i , π ∗ − i ,τ ′ ∼ d µ h ( ϕ ( τ ) − ϕ ( τ ′ )) ⊤  θ ∗ i − b θ ∗ i i     ≤ 2 s ι 2 2 C R r 8 ϵ + c · d m log  nm δ  , (58) where the ﬁrst equality follows from Equation ( 56 ) and linearity of the value function in θ ; the rest follows the same argument as the proof of Lemma B.4 . Similarly , we have    D b θ ∗ i − b θ i , ∇ θ i V µ ref i, 0  s 0 , b θ i  − ∇ θ i V π ∗ i, 0  s 0 , b θ i E    ≤ 2 s ι 2 2 C R r 8 ϵ + c · d m log  nm δ  . Combining, we get    D b θ ∗ i − b θ i , g ∗ i − ∇ θ i R ( b θ i ) E    ≤ 4 s ι 2 2 C R r 8 ϵ + c · d m log  nm δ  , (59) where R ( b θ i ) = V µ i, 0  s 0 , b θ i  − V µ ref i, 0  s 0 , b θ i  . So, along the direction of b θ ∗ i − b θ i , the gradient of the dif ference of values at µ and µ ref is a good approximation of the acti ve linear branch of the gap at b θ ∗ i . Now , in order to approximate the gradient at µ and µ ref , we use the fact that ∇ θ i V µ i, 0 ( s 0 , b θ i ) = H − 1 X h =0 ( d µ h ) ⊤ Φ , together with the fact that we already hav e access to ϵ -corrupted features from d µ . Thus, we can deﬁne a robust estimate of the abov e as e ∇ θ i V µ i, 0  s 0 , b θ i  = H − 1 X h =0 RobMean  D µ h,ϕ  , and e ∇ θ i V µ ref i, 0  s 0 , b θ i  = H − 1 X h =0 RobMean  D µ ref h,ϕ  . Corollary 2 giv es us bounds f ( ϵ ) on the L2-error:    e ∇ θ V µ i, 0 ( s 0 , θ ) − ∇ θ V µ i, 0 ( s 0 , θ )    ≤ O ( H f ( ϵ )) , and    e ∇ θ V µ ref i, 0 ( s 0 , θ ) − ∇ θ V µ ref i, 0 ( s 0 , θ )    ≤ O ( H f ( ϵ )) , where f ( ϵ ) = r d log( poly ( d )) m + √ dϵ + r d log(1 /δ ) m . Let us now deﬁne e ∇ θ i Gap i  π ∗ , b θ i  = e ∇ θ i V µ i, 0  s 0 , b θ i  − e ∇ θ i V µ ref i, 0  s 0 , b θ i  . Note that we hav e E h    e ∇ θ Gap i  π ∗ , b θ i     i ≤ 4 H + O ( H f ( ϵ )) , Nika, Mandal, Kamalaruban, Singla, and Radanovi ´ c due to the guarantees of RobMean and the feature norm bounds. Therefore, we are in the conditions of Lemma D.4 . Recall that b θ ∗ i is the true maximizer of Gap i ( π ∗ , θ i ) ov er Θ Unil ( e θ i ) . Then, using the activ e branch g ∗ i selected abov e, we hav e Gap i  π ∗ , b θ ∗ i  − 1 T 1 T 1 X t =1 Gap i  π ∗ , b θ ( t ) i  ≤ 1 T 1 T 1 X t =1 D b θ ∗ i − b θ ( t ) i , g ∗ i E (60) = 1 T 1 T 1 X t =1 D b θ ∗ i − b θ ( t ) i , e ∇ θ Gap i  π ∗ , b θ ( t ) i E + 1 T 1 T 1 X t =1 D b θ ∗ i − b θ ( t ) i , g ∗ i − e ∇ θ Gap i  π ∗ , b θ ( t ) i E (61) ≤ 1 T 1       b θ ∗ i − b θ (1) i    2 2 η + η T 1 (4 H + O ( H f ( ϵ ))) 2 2    + 1 T 1 T 1 X t =1 D b θ ∗ i − b θ ( t ) i , g ∗ i − ∇ θ V µ i, 0  s 0 , b θ ( t ) i  + ∇ θ V µ ref i, 0  s 0 , b θ ( t ) i E + 1 T 1 T 1 X t =1 D b θ ∗ i − b θ ( t ) i , ∇ θ V µ i, 0  s 0 , b θ ( t ) i  − ∇ θ V µ ref i, 0  s 0 , b θ ( t ) i  − e ∇ θ Gap i  π ∗ , b θ ( t ) i E (62) ≤ O  H 2 + f ( ϵ ) √ T 1  + 4 s ι 2 2 C R r 8 ϵ + c · d m log  nm δ  + 1 T 1 T 1 X t =1    b θ ∗ i − b θ ( t ) i       ∇ θ V µ i, 0  s 0 , b θ ( t ) i  − ∇ θ V µ ref i, 0  s 0 , b θ ( t ) i  − e ∇ θ Gap i  π ∗ , b θ ( t ) i     (63) ≤ O  H 2 + f ( ϵ ) √ T 1  + 4 s ι 2 2 C R r 8 ϵ + c · d m log  nm δ  + 2 H 2 √ df ( ϵ ) ≤ e O  1 √ C R + 1 √ C P  ·  H 5 / 2 d 3 / 4 √ ϵ + H 2 poly ( d ) 1 √ m  + H 2 p poly ( d ) √ T 1  √ ϵ + 1 √ m  ! , where Equation ( 60 ) follows from Equation ( 57 ) ; Equation ( 61 ) adds and subtracts the same term; Equation ( 62 ) follows from Lemma D.4 ; Equation ( 63 ) follows from Cauchy-Schwarz and Equation ( 59 ) ; and the last inequality follows from Corollary 2 . Theorem B.1. Let ϵ ∈ [0 , 1 / 2) , λ ≥ Ω( dH log( m/δ ) /m ) , and δ > 0 . Set Θ Unil ( · ) as in Equation ( 5 ) and Γ( s, a ) = E ( d, m, δ , ϵ ) · ∥ ϕ ( s, a ) ∥ Λ − 1 h . Suppose Assumption 2 is satisﬁed and PGA is run for T 1 steps with learning rate η = O (1 / √ T 1 ) . Then, ther e exist r obust subrouti nes RobEst , TrimmedMLE , and RobMean such that, with pr obability at least 1 − δ , the output e π of Algorithm 4 with subr outines RobEst , TrimmedMLE , RobMean and RewardEst , satisﬁes Gap ( e π ) ≤ e O  1 √ C R + 1 √ C P + 1 √ T 1  · H 5 / 2 nd 3 / 4 √ ϵ + H 2 n r poly ( d ) m !! . Pr oof. Deﬁne θ ∗ = ( θ ∗ 1 , . . . , θ ∗ n ) and let f ( ϵ ) = e O  1 √ C R + 1 √ C P  ·  H 5 / 2 d 3 / 4 √ ϵ + H 2 poly ( d ) 1 √ m  + H 2 p poly ( d ) √ T 1  √ ϵ + 1 √ m  ! . Also deﬁne ∆ := e O  n ·  1 √ C R + 1 √ C P  ·  H 5 / 2 d 3 / 4 √ ϵ + H 2 poly ( d ) 1 √ m  . Moreov er , let b θ ∗ denote the maximizer of the gap with respect to π ∗ ov er the unilateral conﬁdence set, i.e., b θ ∗ ∈ arg max θ ∈ Θ Unil ( e θ 1 ) × ... × Θ Unil ( e θ n ) Gap ( π ∗ , θ ) . Corruption-rob ust MARLHF W e have Gap ( e π , θ ∗ ) ≤ Gap ( π ∗ , θ ∗ ) + ∆ (64) ≤ Gap  π ∗ , b θ ∗  + ∆ (65) ≤ Gap  π ∗ , b θ  + n · f ( ϵ ) + ∆ (66) ≤ Gap  e π , b θ  + n · f ( ϵ ) + 2∆ (67) ≤ g Gap  e π , b θ  + n · f ( ϵ ) + 2∆ (68) ≤ g Gap  π ∗ , b θ  + n · f ( ϵ ) + 2∆ (69) = X i ∈ [ n ] V † , π ∗ − i i, 0  s 0 , b θ i  − V π ∗ i, 0  s 0 , b θ i  + n · f ( ϵ ) + 2∆ (70) ≤ X i ∈ [ n ] V † , π ∗ − i i, 0  s 0 , b θ i  − V µ ref i, 0  s 0 , b θ i  −  V π ∗ i, 0  s 0 , b θ i  − V µ ref i, 0  s 0 , b θ i  + n · f ( ϵ ) + 2∆ (71) ≤ e O  n ·  1 √ C R + 1 √ C P  ·  H 5 / 2 d 3 / 4 √ ϵ + H 2 poly ( d ) 1 √ m  + n · f ( ϵ ) + 2∆ (72) ≤ e O n ·  1 √ C R + 1 √ C P  ·  H 5 / 2 d 3 / 4 √ ϵ + H 2 poly ( d ) 1 √ m  + H 2 n p poly ( d ) √ T 1  √ ϵ + 1 √ m  ! . Equation ( 64 ) follows from Lemma B.5 . Equation ( 65 ) follows from the deﬁnition of b θ ∗ . Equation ( 66 ) follows from Proposition B.1 , applied coordinate-wise at π ∗ . Equation ( 67 ) follows ag ain from Lemma B.5 . Equation ( 68 ) follows from Lemma B.2 . Equation ( 69 ) follows by design of Algorithm 4 , since e π minimizes the estimated gap at b θ . Equation ( 70 ) follows by deﬁnition of estimated gap. In Equation ( 71 ) we add and subtract the same term. Equation ( 72 ) follows from Lemma B.5 . The ﬁnal inequality follows from the deﬁnitions of f ( ϵ ) and ∆ . C Proof of Theor em 5.1 This section includes full proof of Theorem 5.1 . W e begin by establishing regret guarantees on the Optimistic Hedge algorithm applied to our setting. Lemma C.1. Denote by e π the joint policy r eturned by Optimistic Hedge run for T 2 r ounds. F or each i ∈ [ n ] , let Reg i,h,T 2 = max π † i,h E a † i ∼ π † i,h ( ·| s ) ,a ′ i ∼ e π i,h ( ·| s )  E a − i ∼ e π − i,h ( ·| s )  Q π † i , e π − i,h i,h ( s, a † i , a − i ) − Q e π ( s, a ′ i , a − i )  − min π ′ i,h max π † i,h E a † i ∼ π † i,h ( ·| s ) ,a ′ i ∼ π ′ i,h ( ·| s )  E a − i ∼ π − i,h ( ·| s )  Q π † i , e π − i i,h ( s, a † i , a − i ) − Q π ′ i , e π − i ( s, a ′ i , a − i )  . F or every i ∈ [ n ] and h ∈ [ H − 1] , we have X i ∈ [ n ] Reg i,h,T 2 ≤ O  n 2 H · log | A | · log 4 T 2  Pr oof. Recall that the loss we use for Optimistic Hedge is deﬁned as L s i ( a † , a ′ ) = E a − i ∼ e π − i,h ( ·| s ) h Q † , e π − i i,h ( s, a † i , a − i ) − Q e π i,h ( s, a ′ , a − i ) i . Now , let π ( † , ( t ) i denote the iterations of player i with respect to the max problem. Note that, for ev ery agent i and state s , after T 2 steps, we hav e max π † i T 2 X t =1 E a † ∼ π † i ,a ′ ∼ e π ( t ) i  L s i ( a † , a ′ )  − min π i max π † i T 2 X t =1 E a † ∼ π † i ,a ′ ∼ e π ( t ) i  L s i ( a † , a ′ )  Nika, Mandal, Kamalaruban, Singla, and Radanovi ´ c = max π † i T 2 X t =1 E a † ∼ π † i ,a ′ ∼ e π ( t ) i  L s i ( a † , a ′ )  − T 2 X t =1 E a † ∼ π † , ( t ) i ,a ′ ∼ e π ( t ) i  L s i ( a † , a ′ )  | {z } Reg i,h,T 2 for π † , ( t ) i + T 2 X t =1 E a † ∼ π † , ( t ) i ,a ′ ∼ e π ( t ) i  L s i ( a † , a ′ )  − min π i T 2 X t =1 E a † ∼ π † , ( t ) i ,a ′ ∼ e π ( t ) i  L s i ( a † , a ′ )  | {z } Reg i,h,T 2 for π ( t ) i + min π i T 2 X t =1 E a † ∼ π † , ( t ) i ,a ′ ∼ e π ( t ) i  L s i ( a † , a ′ )  − min π i max π † i T 2 X t =1 E a † ∼ π † i ,a ′ ∼ e π ( t ) i  L s i ( a † , a ′ )  | {z } ≤ 0 ≤ O  nH log | A i | log 4 T 2  , where for the inequality we have used Theorem D.2 applied on the regrets with respect to the max and min players, and the fact that min x f ( x, y ) ≤ min x max y f ( x, y ) , due to the monotonicity of the min operator . This implies that, the empirical distribution of the sequence of policies up to time step T 2 , which equals the returned policy e π , satisﬁes, for any player i and state s , max π † i E a † ∼ π † i ,a ′ ∼ e π i  L s i ( a † , a ′ )  − min π i max π † i E a † ∼ π † i ,a ′ ∼ π i  L s i ( a † , a ′ )  ≤ O  n · log | A | · log 4 T 2 T 2  . In particular , we have X i ∈ [ n ] max π † i,h E a † i ∼ π † i,h ( ·| s ) ,a ′ i ∼ e π i,h ( ·| s ) h E a − i ∼ e π − i,h ( ·| s ) h Q † , e π − i,h i,h ( s, a † i , a − i ) − Q e π i,h ( s, a ′ i , a − i ) ii − min π ′ h X i ∈ [ n ] max π † i,h E a † i ∼ π † i,h ( ·| s ) ,a ′ i ∼ π ′ i,h ( ·| s ) h E a − i ∼ π − i,h ( ·| s ) h Q † , e π − i i,h ( s, a † i , a − i ) − Q e π i,h ( s, a ′ i , a − i ) ii ≤ O  n 2 H · log | A | · log 4 T 2 T 2  , where we hav e used the fact that the min of the sum is lar ger than the sum of indi vidual min s. Finally , we are ready to state and prov e Theorem 5.1 . W e restate it here for conv enience. Theorem C.1. Let ϵ ∈ [0 , 1 / 2) , λ ≥ Ω( dH log ( m/δ ) /m ) , and δ > 0 . Set Θ Unil ( · ) as in Equation ( 5 ) and Γ( s, a ) = E ( d, m, δ , ϵ ) · ∥ ϕ ( s, a ) ∥ Λ − 1 h . Suppose Assumption 2 is satisﬁed, PGA is run for T 1 steps with learning rate η 1 = O (1 / √ T 1 ) , and OptimisticHedge is run for T 2 steps with learning rate η 2 = O (1 / ( n log 4 T 2 )) . Then, there exist r obust subr outines RobEst , TrimmedMLE , and RobMean such that, with probability at least 1 − δ , the output e π of Algorithm 5 with subr outines RobEst , TrimmedMLE , RobMean and OptimisticHedge , satisﬁes Gap ( e π ) ≤ e O  1 √ C R + 1 √ C P + 1 √ T 1  · H 5 / 2 nd 3 / 4 √ ϵ + H 2 n p poly ( d ) √ m ! + H n 2 T 2 ! . Pr oof. Similar to the proof of Theorem 4.1 , we can write Gap ( e π , θ ∗ ) ≤ Gap  e π , b θ ∗  ≤ Gap  e π , b θ  + n · f ( ϵ ) ≤ g Gap  e π , b θ  + n · f ( ϵ ) ≤ min π g Gap  π , b θ  + O  n 2 H · log | A | · log 4 T 2 T 2  + n · f ( ϵ ) (73) Corruption-rob ust MARLHF ≤ g Gap  π ∗ , b θ  + O  n 2 H · log | A | · log 4 T 2 T 2  + n · f ( ϵ ) = X i ∈ [ n ] V † , π ∗ − i i, 0  s 0 , b θ i  − V π ∗ i, 0  s 0 , b θ i  + O  n 2 H · log | A | · log 4 T 2 T 2  + n · f ( ϵ ) ≤ e O  n ·  1 √ C R + 1 √ C P + 1 √ T 1  ·  H 5 / 2 d 3 / 4 √ ϵ + H 2 poly ( d ) 1 √ m  + n 2 H T 2  , (74) where Equation ( 73 ) follows from Lemma C.1 and Equation ( 74 ) follo ws from Theorem 4.1 . D T echnical Results This section includes various miscellaneous technical results that are used throughout the proofs in the paper . Lemma D.1 ( Zanette et al. [ 2021 ]) . Let { ϕ i } m i =1 ⊂ R d be i.i.d. samples fr om an underlying bounded distribution µ ref , with ∥ ϕ i ∥ 2 ≤ 1 and covariance Σ µ ref . Deﬁne Λ = m X i =1 ϕ i ϕ ⊤ i + λI , for some λ ≥ Ω( d log( m/δ )) . Then, with pr obability at least 1 − δ , we have 1 3 ( m Σ µ ref + λI ) ⪯ Λ ⪯ 5 3 ( m Σ µ ref + λI ) . Next, we state a result that bounds the dif ference in log probabilities of parameters in a given space. Lemma D.2 (Lemma 2 of [ Zhan et al. , 2023 ] for the linear setting) . W ith pr obability at least 1 − δ , we have, for e very agent i and θ ∈ Θ Unil ( e θ i ) : E τ ∼ d µ τ ′ ∼ d µ ref h ∥ P ( ·| τ , τ ′ , θ ∗ i ) − P ( ·| τ , τ ′ , θ ) ∥ 2 1 i ≤ c m m X j =1 log P  e o j | e τ j , e τ ′ j , θ ∗ i  P  e o j | e τ j , e τ ′ j , θ  ! + log ( d log ( n/δ )) , wher e c > 0 is an absolute constant. Pr oof. This is just an application of Proposition 1 of [ Zhan et al. , 2023 ] to the linear setting, which then induces the result abov e by applying the union bound ov er all agents. Next, we sho w that the inv erse of the sigmoid link function is Lipschitz on a bounded domain. Lemma D.3. Let −∞ < a, b < ∞ be two real numbers and let σ ( x ) = 1 / (1 + exp( − x )) be deﬁned on the domain x ∈ [ a, b ] . Then, there e xists a positive number ι < ∞ , such that the in verse σ − 1 of σ is Lipsc hitz with constant ι , that is, sup p ( x ) ∈ (0 , 1): x ∈ [ a,b ]     ∂ σ − 1 ( p ) ∂ p     ≤ ι . As a consequence, if the r ewar ds ar e bounded, then the in verse of the pr efer ence link function is Lipschitz continuous for some ι > 0 . Pr oof. First, note that the deri v ati ve of the in verse of the sigmoid can be written as ∂ σ − 1 ( p ) ∂ p = ∂ ∂ p log p 1 − p = 1 p (1 − p ) . Now , since x ∈ [ a, b ] and σ is continuous in R , then, there exist −∞ < a ′ , b ′ < ∞ , such that p = σ ( x ) ∈ [ a ′ , b ′ ] , for ev ery x ∈ [ a, b ] . Moreov er , since the function p (1 − p ) is also continuous in R , then, there exist −∞ < a ′′ , b ′′ < ∞ , such that p (1 − p ) ∈ [ a ′′ , b ′′ ] , for all p ∈ [ a ′ , b ′ ] . Thus, there exists a positi ve constant ι , such that 1 p (1 − p ) ≤ ι , Nika, Mandal, Kamalaruban, Singla, and Radanovi ´ c for all x ∈ [ a, b ] . This implies that the function σ − 1 is Lipschitz on the domain of σ . For the ﬁnal statement of the result, note that the preference function uses dif ferences in expected rewards that are individually bounded in [ − √ d, √ d ] . Thus, we have H − 1 X h =0 R ( s h , a h ) − R ′ ( s h , a h ) ∈ [ − 2 H √ d, 2 H √ d ] . Thus, the domain of the sigmoid link function, used for our preference model, has a bounded domain, which implies that its in verse is Lipschitz continuous. Corollary 1. The function f ( θ ) = E τ ∼ d µ ref  ϕ ( τ ) ⊤ θ  is H -Lipschitz and con vex. Moreover , the set Θ Unil ( θ ′ ) is a con vex set, for any θ ′ and λ > 0 . Pr oof. Note that we hav e ∥∇ θ f ( θ ) ∥ =   ( d µ ref ) ⊤ Φ   ≤ ∥ d µ ref ∥ 1 ∥ Φ ∥ ∞ ≤ max ( s,a ) ∥ ϕ ( s, a ) ∥ 1 ≤ H max ( s,a ) ∥ ϕ ( s, a ) ∥ 2 ≤ H . Con vexity follo ws from the direct observation that ∇ θ ( d µ ref ) ⊤ Φ = 0 . The con ve xity of Θ Unil ( θ ′ ) is observed in [ Mandal et al. , 2025 ]. Next, we pro vide an upper bound on the error of the estimate returned by the RobMean algorithm. Theorem D .1 (Proposition 1.5 of [ Diakonikolas et al. , 2020 ]) . Let T be an ϵ -corrupted set of m samples fr om a distribution in R d with mean ρ and covariance Σ . Let ϵ ′ be in the or der of (log(1 /δ ) /m + ϵ ) ≤ c , for a constant c > 0 . Then any stability-based algorithm on input T and ϵ ′ , efﬁciently computes e ρ such that with pr obability at least 1 − δ , we have ∥ ρ − e ρ ∥ = O r T r (Σ) log ( T r (Σ) / ∥ Σ) ∥ m + p ∥ Σ ∥ ϵ + r ∥ Σ ∥ log(1 /δ ) m ! . Corollary 2. The RobMean algorithm r eturns a gradient estimate that satisﬁes    e ∇ θ V µ ref i, 0 ( s 0 ) − ∇ θ V µ ref i, 0 ( s 0 )    ≤ O r d log( poly ( d )) m + √ dϵ + r d log(1 /δ ) m ! . Pr oof. Note that, since ∥ ϕ ( s, a ) ∥ ≤ 1 , for all state action tuples, then ∥ Φ ∥ ≤ d and T r (Φ) ≤ d . Next, we state an upper bound on the indi vidual regret of each agent when playing Optimistic Hedge. Theorem D.2 (Theorem 1.1 of [ Daskalakis et al. , 2021 ]) . Ther e ar e constants C, C ′ > 1 so that the following holds. Suppose a time horizon T ∈ N and a game G with n players and | A i | actions for each player i ∈ [ n ] is given. Suppose all players play accor ding to Optimistic Hedge with any positive step size η = 1 / ( C n log 4 T ) . Then, for any player i ∈ [ n ] , the r egr et of player i satisﬁes Reg i,T ≤ O  n · log | A | · log 4 T  . Lemma D.4 (Lemma E.6 of [ Mandal et al. , 2025 ]) . Let y 1 ∈ W , and η > 0 . Deﬁne the sequence y 2 , . . . , y n +1 and h 1 , . . . , h n such that, for k = 1 , . . . , n y k +1 = P W  y k − η b h k  , and b h k satisﬁes E h b h k |F k − 1 i = h k , and E     b h k    2 |F k − 1  ≤ G 2 , wher e F k ar e the σ -algebr as on which the variables up to k ar e deﬁned. Then, for any y ∗ ∈ W , we have E " n X k =1 ⟨ y ∗ − y k , h k ⟩ # ≤ ∥ y 1 − y ∗ ∥ 2 2 η + η nG 2 2 . Corruption-rob ust MARLHF Algorithm 6 Alternating Minimization ( Trimmed MLE ) for full sample corruption. Require: Corrupted data D ; corruption parameter ϵ ; slackness parameter ν . 1: Split D into equally-sized D 1 and D 2 , uniformly at random. 2: Use D 1 to build a rob ust estimate b Σ of the Σ − µ , µ ref [ Diakonikolas et al. , 2025 ]. 3: Whiten covariates using b Σ , i.e. form e D = { b Σ − 1 / 2 ( ϕ ( τ ) − ϕ ( τ ′ )) | ( τ , τ ′ ) ∈ D 2 } . 4: Let b D ← Filtering ( e D , ϵ ) (Algorithm 4 of [ Dong et al. , 2019 ]). 5: Deﬁne L θ ( τ , τ ′ , o ) = log σ  o · θ ⊤ b Σ 1 / 2 ( ϕ ( τ ) − ϕ ( τ ′ ))  , for τ ∈ b D . 6: Set e θ 0 = 0 . 7: for t = 1 , 2 , . . . do 8: e S t = arg max S ⊂ b D : | S | =(1 − ϵ ) m P ( τ ,τ ′ ,o ) ∈ S L e θ t ( τ , τ ′ , o ) . 9: e θ t +1 = arg max θ : ∥ θ ∥≤ √ H d P ( τ ,τ ′ ,o ) ∈ e S t L θ ( τ , τ ′ , o ) . 10: if P ( τ ,τ ′ ,o ) ∈ e S t L e θ t +1 ( τ , τ ′ , o ) ≤ P ( τ ,τ ′ ,o ) ∈ e S t L e θ t ( τ , τ ′ , o ) + ν then 11: Return e θ t +1 . 12: end if 13: end for Algorithm 7 Robust Estimation of V alue Functions Require: Dataset D , policy π , agent i , re ward functions R i and R i , bonus function Γ( · , · ) . 1: Initialize V π i,H ( · ) = V † , π − i i,H ( · ) = 0 , for all agents i ∈ [ n ] . 2: for h = H − 1 , . . . , 0 : do 3: ω π i,h = RobEst  ϕ ( s h , a h ) , R i,h ( s h , a h ) + V π i,h +1 ( s )  . 4: ω † , π − i i,h = RobEst  ϕ ( s h , a h ) , R i,h ( s h , a h ) + V † , π − i i,h +1 ( s )  . 5: Q π i,h ( · , · ) = Clip [ − ( H − h ) √ d, ( H − h ) √ d ]  ϕ ( · , · ) ⊤ ω π i,h − Γ( · , · )  . 6: Q † , π − i i,h ( · , · ) = Clip [ − ( H − h ) √ d, ( H − h ) √ d ]  ϕ ( · , · ) ⊤ ω † , π − i i,h + Γ( · , · )  . 7: V π i,h ( s ) = E a ∼ π h h Q π i,h ( s, a ) i . 8: V † , π − i i,h ( s ) = max a i ∈ A i E a − i ∼ π − i h Q † , π − i i,h ( s, a ) i . 9: end for 10: return V alue functions V † , π − i i,h ( · ) and V π i,h ( · ) , for all h ∈ [ H − 1] . E Additional Algorithm Pseudocodes Nika, Mandal, Kamalaruban, Singla, and Radanovi ´ c Algorithm 8 Optimistic Hedge for n min − max Games ( OptimisticHedge ) Require: Loss functions L 1 ( · ) , . . . , L n ( · ) ; step size ν ; steps T . 1: Initialize policies π (0) i , π † , (0) i uniformly at random, for ev ery i ∈ { 1 , . . . , n } 2: for t = 0 , . . . , T − 1 do 3: for i ∈ { 1 , . . . , n } do 4: Let u ( t ) i ( a † ) = E a ′ ∼ π ( t ) i  L i ( a † , a ′ )  , for ev ery individual action a of player i . 5: Let ℓ ( t ) i ( a ′ ) = E a † ∼ π † , ( t ) i  L i ( a † , a ′ )  , for ev ery individual action a of player i . 6: for a i ∈ A i do 7: Update for the max player: π † , ( t +1) i ( a † ) = π † , ( t ) i ( a † ) · exp  ν · u ( t ) i ( a † )  P a ‡ π † , ( t ) i ( a ‡ ) · exp  ν · u ( t ) i ( a ‡ )  . 8: Update for the min player: π ( t +1) i ( a ′ ) = π ( t ) i ( a ′ ) · exp  − ν · ℓ ( t ) i ( a ′ )  P a ′′ π ( t ) i ( a ′′ ) · exp  − ν · ℓ ( t ) i ( a ′′ )  . 9: end for 10: end for 11: end for 12: return A verage polic y proﬁle ov er T rounds 1 T P T t =1 π ( t ) .

Corruption-robust Offline Multi-agent Reinforcement Learning From Human Feedback

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment