Bandit Social Learning with Exploration Episodes

Bandit Social Learning with Exploration Episodes
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We study a stylized social learning dynamics where self-interested agents collectively follow a simple multi-armed bandit protocol. Each agent controls an ``episode": a short sequence of consecutive decisions. Motivating applications include users repeatedly interacting with an AI, or repeatedly shopping at a marketplace. While agents are incentivized to explore within their respective episodes, we show that the aggregate exploration fails: e.g., its Bayesian regret grows linearly over time. In fact, such failure is a (very) typical case, not just a worst-case scenario. This conclusion persists even if an agent’s per-episode utility is some fixed function of the per-round outcomes: e.g., $\min$ or $\max$, not just the sum. Thus, externally driven exploration is needed even when some amount of exploration happens organically.


💡 Research Summary

The paper introduces a novel framework called Episodic Bandit Social Learning (EpiBSL), which models a setting where self‑interested agents arrive sequentially, each controlling a short “episode” of m ≥ 2 consecutive bandit rounds. The underlying bandit problem has two stochastic arms with unknown Bernoulli reward means drawn from independent Beta priors, plus a “skip” arm that yields zero reward at no cost. In each episode an agent observes the full history, forms a posterior over the arm means, and selects a deterministic policy that maximizes its Bayesian‑expected utility.

Utility is defined as a function f of the binary reward vector observed during the episode, minus a small exploration cost c_expl paid each time a non‑skip arm is pulled. The function f is required only to be coordinate‑wise non‑decreasing, non‑constant, and to satisfy f(0,…,0)=0; typical choices are the sum, the maximum, or the minimum of the episode rewards. The authors assume f is symmetric (permutations of coordinates do not change the value), and they later extend some results to non‑symmetric f when m=2.

The central question is whether the presence of intra‑episode exploration can prevent the classic “learning failure” observed in pure‑exploitation dynamics (the greedy algorithm). The authors show that it does not. They define a failure event FAIL(c,N) that requires (i) both arm means to lie in (c, 1‑c), (ii) the better arm to be at least c higher than the other, and (iii) the better arm to be pulled at most N times overall. They prove that for any fixed c and N, there exists a positive‑probability set of problem instances where FAIL(c,N) occurs.

From FAIL they derive a stronger event StrongFail(c,N) in which at most N episodes ever “consider” the better arm (i.e., assign it positive probability). When StrongFail holds, the cumulative Bayesian regret after T episodes, defined as BReg(T)=E


Comments & Academic Discussion

Loading comments...

Leave a Comment