Representative Action Selection for Large Action Space Bandit Families
We study the problem of selecting a subset from a large action space shared by a family of bandits, with the goal of achieving performance nearly matching that of using the full action space. Indeed, in many natural situations, while the nominal set of actions may be large, there also exist significant correlations between the rewards of different actions. In this paper we propose an algorithm that can significantly reduce the action space when such correlations are present, without the need to a-priori know the correlation structure. We provide theoretical guarantees on the performance of the algorithm and demonstrate its practical effectiveness through empirical comparisons with Thompson Sampling and Upper Confidence Bound methods.
💡 Research Summary
The paper introduces a novel problem called “representative action selection” for families of multi‑armed bandits that share a common, potentially huge action set A_full. Instead of solving each bandit instance separately, the goal is to find a small subset A ⊂ A_full that works well for almost all bandit instances drawn from a distribution P. The performance metric is the expected regret defined as the gap between the best possible reward using the full action set and the best reward achievable with the subset A, averaged over the distribution of bandit instances.
To address this, the authors propose a simple yet powerful algorithm (Algorithm 1, the ε‑net algorithm). The algorithm repeatedly samples a bandit instance θ from P (the distribution is unknown but can be sampled), solves the instance to obtain its optimal action a*(θ) (using any oracle, e.g., Thompson Sampling or UCB), and adds this action to a growing set A. After K repetitions, the set A consists of K optimal actions from K randomly drawn instances. The key insight is that these sampled optimal actions form an ε‑net of the original action space under the metric induced by the reward process, even though the algorithm never explicitly computes distances or clusters.
The theoretical analysis assumes that the reward expectations {μ_a(θ)} form a Gaussian process. By representing each action a as a vector in ℝⁿ and letting μ_a(θ)=⟨a,θ⟩ with θ∼N(0,I), the L₂ distance between reward functions coincides with the Euclidean distance between vectors. Under this representation, the authors define two notions of ε‑nets: a geometric ε‑net (every action is within ε of some representative) and a measure‑theoretic ε‑net (every cluster with probability mass larger than ε contains at least one representative). Lemma 3.3 shows that with K = O((1/ε)·log(1/ε)) random samples, the output set A is a measure‑theoretic ε‑net with high probability for any fixed partition of A_full.
The main regret bound (Theorem 4.5) relates the expected regret to the covering number N(A_full, ε) of the action space. Roughly, E
Comments & Academic Discussion
Loading comments...
Leave a Comment