Individual Regret in Cooperative Stochastic Multi-Armed Bandits

Individual Regret in Cooperative Stochastic Multi-Armed Bandits
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We study the regret in stochastic Multi-Armed Bandits (MAB) with multiple agents that communicate over an arbitrary connected communication graph. We analyzed a variant of Cooperative Successive Elimination algorithm, COOP-SE, and show an individual regret bound of $O(R/ m + A^2 + A \sqrt{\log T})$ and a nearly matching lower bound. Here $A$ is the number of actions, $T$ the time horizon, $m$ the number of agents, and $R = \sum_{Δ_i > 0}\log(T)/Δ_i$ is the optimal single agent regret, where $Δ_i$ is the sub-optimality gap of action $i$. Our work is the first to show an individual regret bound in cooperative stochastic MAB that is independent of the graph’s diameter. When considering communication networks there are additional considerations beyond regret, such as message size and number of communication rounds. First, we show that our regret bound holds even if we restrict the messages to be of logarithmic size. Second, for logarithmic number of communication rounds, we obtain a regret bound of $O(R / m+A \log T)$.


💡 Research Summary

The paper addresses the problem of stochastic multi‑armed bandits (MAB) in a cooperative multi‑agent setting where agents are placed on an arbitrary connected communication graph. Each agent repeatedly selects an arm, receives a stochastic reward, and exchanges information with its immediate neighbors. The central performance metric is the individual pseudo‑regret of each agent, i.e., the expected cumulative loss relative to the optimal arm.

Algorithmic contribution. The authors propose Cooperative Successive Elimination (Coop‑SE), a decentralized variant of the classic Successive Elimination (SE) algorithm. In Coop‑SE every agent maintains a local set of active arms. At each round the agent (1) processes elimination messages received from neighbors, (2) aggregates all observed rewards (its own and those forwarded by neighbors), (3) computes empirical means and confidence bounds (UCB/LCB) for each active arm, (4) eliminates any arm whose UCB is strictly below the LCB of some other arm, and (5) selects the next arm in a round‑robin fashion from the remaining active set. The eliminated arms and newly observed rewards are broadcast to all neighbors, allowing information to propagate through the network without any central coordinator.

Regret analysis. The main theoretical result is an individual regret upper bound of

\


Comments & Academic Discussion

Loading comments...

Leave a Comment