Actor-Dual-Critic Dynamics for Zero-sum and Identical-Interest Stochastic Games

Actor-Dual-Critic Dynamics for Zero-sum and Identical-Interest Stochastic Games
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We propose a novel independent and payoff-based learning framework for stochastic games that is model-free, game-agnostic, and gradient-free. The learning dynamics follow a best-response-type actor-critic architecture, where agents update their strategies (actors) using feedback from two distinct critics: a fast critic that intuitively responds to observed payoffs under limited information, and a slow critic that deliberatively approximates the solution to the underlying dynamic programming problem. Crucially, the learning process relies on non-equilibrium adaptation through smoothed best responses to observed payoffs. We establish convergence to (approximate) equilibria in two-agent zero-sum and multi-agent identical-interest stochastic games over an infinite horizon. This provides one of the first payoff-based and fully decentralized learning algorithms with theoretical guarantees in both settings. Empirical results further validate the robustness and effectiveness of the proposed approach across both classes of games.


💡 Research Summary

This paper introduces a novel independent, payoff‑based learning framework for stochastic (Markov) games that operates without any model of the environment, opponent actions, or gradient information. The core of the approach is an actor‑dual‑critic architecture in which each agent maintains three objects: a fast critic estimating local Q‑values, a slow critic estimating state values, and an actor that updates its policy via an ε‑greedy best‑response to the fast critic.

The fast critic updates the Q‑value for the state‑action pair actually taken at time k‑1 using the observed immediate reward, the next state, and the current estimate of the value function. The update is normalized by the probability with which the action was selected, ensuring that all actions receive, in expectation, the same learning rate despite asynchronous updates. The slow critic updates the value function as a moving average of the expected Q‑values under the current policy, using a stepsize that decays faster than the actor’s stepsize. This two‑timescale separation mirrors the “System 1 / System 2” dual‑process theory from psychology: the fast critic reacts quickly to new observations, while the slow critic integrates information over a longer horizon, thereby mitigating non‑stationarity caused by other agents’ learning.

The actor does not rely on policy gradients. Instead, at each state it computes the set of actions that maximize the current Q‑estimate, applies an exploration kernel that mixes the pure best response with uniform noise (parameter ε), and moves the policy a small amount (stepsize α_k) toward this ε‑smoothed best response. Consequently the policy can be written as (1‑ε)·μ + ε·Uniform, guaranteeing persistent exploration.

The authors prove convergence to ε‑Nash equilibria in two important classes of stochastic games:

  1. Two‑agent zero‑sum games – where a unique game value exists. By choosing stepsizes such that the fast critic converges quickly, the actor tracks a smoothed best response, and the slow critic provides a stable value estimate, the joint process converges almost surely to an ε‑approximate Nash equilibrium.

  2. Multi‑agent identical‑interest games – where all agents share the same payoff function. Here the value iteration operator is not a contraction, so the authors employ recent quasi‑monotonicity techniques to show that the coupled updates still converge to an ε‑Nash equilibrium, despite the presence of multiple equilibria.

The convergence error scales linearly with the exploration rate ε, and the proofs rely on stochastic approximation theory, two‑timescale analysis, and a careful decomposition of the actor update into a deterministic drift plus a martingale noise term.

Algorithm 1 details the complete procedure: (i) observe the current state, (ii) update the fast Q‑estimate for the previously visited state‑action pair, (iii) adjust the policy via the ε‑best response, (iv) update the slow value estimate, and (v) sample the next action. All updates are performed online, without episodic resets or centralized coordination.

Empirical evaluation on grid‑world zero‑sum games, cooperative navigation tasks, and mixed competition‑cooperation scenarios demonstrates that the proposed method converges faster and more robustly than prior independent Q‑learning, independent policy‑gradient, and recent actor‑critic extensions for stochastic games. The experiments confirm that the algorithm remains stable under fully decentralized operation and that the ε‑exploration can be set small without sacrificing convergence to near‑optimal joint policies.

In comparison with related work, the paper distinguishes itself by (a) being completely model‑free and payoff‑only, (b) using a gradient‑free best‑response actor, (c) maintaining continuous slow‑critic updates rather than batch‑wise value iteration, and (d) providing rigorous convergence guarantees for both zero‑sum and identical‑interest stochastic games—settings where most existing guarantees either require strong assumptions (e.g., irreducibility, coordinated sampling) or apply only to potential games.

The authors conclude that the actor‑dual‑critic dynamics offer a principled, psychologically inspired mechanism for decentralized learning in complex multi‑agent environments. Future directions include extending the analysis to general‑sum games, incorporating function approximation (e.g., deep neural networks) for large state spaces, and studying robustness under communication delays or partial observability.


Comments & Academic Discussion

Loading comments...

Leave a Comment