The Bounds of Algorithmic Collusion; $Q$-learning, Gradient Learning, and the Folk Theorem

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We explore the behaviour emerging from learning agents repeatedly interacting strategically for a wide range of learning dynamics, including $Q$-learning, projected gradient, replicator and log-barrier dynamics. Going beyond the better understood classes of potential games and zero-sum games, we consider the setting of a general repeated game with finite recall under different forms of monitoring. We obtain a Folk Theorem-style result and characterise the set of payoff vectors that can be obtained by these dynamics, discovering a wide range of possibilities for the emergence of algorithmic collusion. Achieving this requires a novel technical approach, which, to the best of our knowledge, yields the first convergence result for multi-agent $Q$-learning algorithms in repeated games.

💡 Research Summary

The paper “The Bounds of Algorithmic Collusion; Q‑learning, Gradient Learning, and the Folk Theorem” investigates the long‑run outcomes that can emerge when multiple artificial‑intelligence agents repeatedly interact in a strategic environment, focusing on a broad class of learning dynamics: Q‑learning, projected gradient learning, replicator dynamics, and log‑barrier dynamics. The authors move beyond the well‑studied subclasses of potential games and zero‑sum games, and instead consider a general repeated game with finite recall (ℓ‑recall) under both perfect and imperfect monitoring structures. Their central contribution is a Folk‑Theorem‑style characterization of the set of payoff vectors that can be realized by these learning processes, together with the first convergence proof for multi‑agent Q‑learning in such repeated games.

Model and Setting
A stage game G = (N, A, (R_i)_{i∈N}) with finitely many players and actions is repeated infinitely with discount factor δ ∈ (0,1). After each stage, each player receives a private signal drawn from a joint distribution q(a) that depends on the action profile a. This formulation captures both public monitoring (signals reveal the action profile) and imperfect monitoring (signals are noisy or partially observed). Players are restricted to strategies that depend only on the most recent ℓ periods of their own action‑signal histories, reflecting the memory constraints of reinforcement‑learning agents. The set of ℓ‑recall mixed strategies for player i is denoted Π_i^ℓ = Δ(A_i^{Ĥ_i^ℓ}).

Learning Algorithms

Q‑learning: Each agent maintains a Q‑table Q(s,a) where the state s is the current ℓ‑recall history. When the agent observes (s,a,r,s′) it updates
Q_{t+1}(s,a) = Q_t(s,a) + γ_t

The Bounds of Algorithmic Collusion; $Q$-learning, Gradient Learning, and the Folk Theorem

💡 Research Summary

Comments & Academic Discussion

Leave a Comment