Bandit Allocational Instability

Bandit Allocational Instability
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

When multi-armed bandit (MAB) algorithms allocate pulls among competing arms, the resulting allocation can exhibit huge variation. This is particularly harmful in modern applications such as learning-enhanced platform operations and post-bandit statistical inference. Thus motivated, we introduce a new performance metric of MAB algorithms termed allocation variability, which is the largest (over arms) standard deviation of an arm’s number of pulls. We establish a fundamental trade-off between allocation variability and regret, the canonical performance metric of reward maximization. In particular, for any algorithm, the worst-case regret $R_T$ and worst-case allocation variability $S_T$ must satisfy $R_T \cdot S_T=Ω(T^{\frac{3}{2}})$ as $T\rightarrow\infty$, as long as $R_T=o(T)$. This indicates that any minimax regret-optimal algorithm must incur worst-case allocation variability $Θ(T)$, the largest possible scale; while any algorithm with sublinear worst-case regret must necessarily incur ${S}_T= ω(\sqrt{T})$. We further show that this lower bound is essentially tight, and that any point on the Pareto frontier $R_T \cdot S_T=\tildeΘ(T^{3/2})$ can be achieved by a simple tunable algorithm UCB-f, a generalization of the classic UCB1. Finally, we discuss implications for platform operations and for statistical inference, when bandit algorithms are used. As a byproduct of our result, we resolve an open question of Praharaj and Khamaru (2025).


💡 Research Summary

The paper introduces a novel performance metric for stochastic multi‑armed bandit (MAB) algorithms called allocation variability (denoted (S_T)), defined as the maximum standard deviation across arms of the number of pulls each arm receives by time (T). While the classic metric is expected (pseudo)‑regret (R_T), the authors argue that in many modern applications—such as content‑sharing platforms, e‑commerce marketplaces, and post‑bandit statistical inference—high variability in how traffic is allocated to arms can be detrimental, leading to unfairness, increased churn, or invalid inference.

The core theoretical contribution is a fundamental trade‑off between regret and allocation variability. For Gaussian bandit instances, any algorithm that learns (i.e., achieves (R_T=o(T))) must satisfy
\


Comments & Academic Discussion

Loading comments...

Leave a Comment