Achieving Logarithmic Regret in KL-Regularized Zero-Sum Markov Games
Reverse Kullback-Leibler (KL) divergence-based regularization with respect to a fixed reference policy is widely used in modern reinforcement learning to preserve the desired traits of the reference policy and sometimes to promote exploration (using uniform reference policy, known as entropy regularization). Beyond serving as a mere anchor, the reference policy can also be interpreted as encoding prior knowledge about good actions in the environment. In the context of alignment, recent game-theoretic approaches have leveraged KL regularization with pretrained language models as reference policies, achieving notable empirical success in self-play methods. Despite these advances, the theoretical benefits of KL regularization in game-theoretic settings remain poorly understood. In this work, we develop and analyze algorithms that provably achieve improved sample efficiency under KL regularization. We study both two-player zero-sum matrix games and Markov games: for matrix games, we propose OMG, an algorithm based on best response sampling with optimistic bonuses, and extend this idea to Markov games through the algorithm SOMG, which also uses best response sampling and a novel concept of superoptimistic bonuses. Both algorithms achieve a logarithmic regret in $T$ that scales inversely with the KL regularization strength $β$ in addition to the traditional $\widetilde{\mathcal{O}}(\sqrt{T})$ regret without the $β^{-1}$ dependence.
💡 Research Summary
The paper investigates the theoretical benefits of reverse Kullback‑Leibler (KL) regularization in two‑player zero‑sum games, both in the static matrix setting and in finite‑horizon Markov games. While KL regularization (or entropy regularization when the reference policy is uniform) is widely used in modern reinforcement learning to preserve desirable traits of a reference policy and to encourage exploration, its impact on sample efficiency in game‑theoretic environments has remained largely unexplored. The authors address this gap by designing two algorithms—OMG (Optimistic Matrix Game) for matrix games and SOMG (Super‑Optimistic Matrix Game) for Markov games—that exploit the closed‑form Gibbs‑type optimal response induced by KL regularization.
Key technical ideas
- Closed‑form best response: Under KL regularization, the best response to a fixed opponent strategy is a Gibbs distribution whose parameters depend linearly on the payoff matrix and the regularization strength β. This allows the learner to sample best‑response actions directly without solving a costly optimization problem.
- Optimistic bonuses: For matrix games, OMG adds β‑scaled optimistic bonuses to each row and column estimate, yielding a high‑probability regret bound of O(β⁻¹ d² log²(T/δ)) where d is the feature dimension and T the number of rounds. The algorithm also retains the classical O(d√T log(T/δ)) regret that is independent of β.
- Super‑optimistic bonuses: Extending to Markov games, SOMG introduces a novel “super‑optimistic” bonus that exceeds the standard optimism term. By inflating the Q‑function estimates in a β‑dependent way, the algorithm can treat each stage as a regularized matrix game and achieve a logarithmic‑in‑T regret of O(β⁻¹ d³ H⁷ log²(dT/δ)), where H is the horizon length. A complementary √T‑type bound of O(d^{3/2} H³ √T) is also proved.
- Sample complexity: Translating regret to ε‑Nash equilibrium (NE) learning, the authors show that both OMG and SOMG require only O(β⁻¹ d²/ε) and O(β⁻¹ d³ H⁷/ε) samples respectively, i.e., linear dependence on 1/ε and inverse dependence on the regularization strength. This improves over prior works that achieve O(1/ε²) or lack any β‑dependent acceleration.
Theoretical contributions
- The paper provides the first logarithmic‑regret guarantees for KL‑regularized zero‑sum games under bandit feedback, matching recent single‑agent results but extending them to competitive multi‑agent settings.
- It introduces the concept of super‑optimism, a new tool for handling the additional complexity of state transitions while preserving the benefits of KL regularization.
- The analysis is carried out under linear function approximation (feature vectors), with extensions to more general function classes discussed in the appendix.
Empirical validation
A modest experimental suite on synthetic payoff matrices and small‑scale Markov games demonstrates that OMG and SOMG converge faster than baseline UCB‑type or K‑Learning algorithms, and that the empirical regret curves align with the predicted logarithmic behavior when β is moderate to large.
Limitations and future directions
The results rely on a coverage assumption: the reference policy must assign non‑negligible probability to all actions, which may be restrictive for real‑world language‑model alignment where the reference policy is highly concentrated. Extending the framework to general‑sum or multi‑player games, relaxing the coverage condition, and scaling the methods to large language models are identified as promising avenues for future work.
Overall significance
By rigorously showing that KL regularization can convert the usual √T regret in zero‑sum games into a β‑dependent logarithmic regret, the paper bridges a crucial gap between practical RLHF/LLM alignment techniques and their theoretical underpinnings. The proposed algorithms and analytical tools provide a solid foundation for designing sample‑efficient, regularized self‑play systems in high‑dimensional, multi‑step environments.
Comments & Academic Discussion
Loading comments...
Leave a Comment