Bayesian Online Model Selection

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Online model selection in Bayesian bandits raises a fundamental exploration challenge: When an environment instance is sampled from a prior distribution, how can we design an adaptive strategy that explores multiple bandit learners and competes with the best one in hindsight? We address this problem by introducing a new Bayesian algorithm for online model selection in stochastic bandits. We prove an oracle-style guarantee of $O\left( d^* M \sqrt{T} + \sqrt{(MT)} \right)$ on the Bayesian regret, where $M$ is the number of base learners, $d^*$ is the regret coefficient of the optimal base learner, and $T$ is the time horizon. We also validate our method empirically across a range of stochastic bandit settings, demonstrating performance that is competitive with the best base learner. Additionally, we study the effect of sharing data among base learners and its role in mitigating prior mis-specification.

💡 Research Summary

This paper tackles the problem of online model selection in Bayesian stochastic bandits, where an environment is drawn once from a known prior and then remains fixed for a horizon of T rounds. The learner has access to a finite collection of M base bandit algorithms (base learners), each potentially well‑suited to different types of environments (e.g., sparse linear, low‑dimensional generalized linear). The meta‑learner must decide at each round which base learner to query, let that learner choose an action, observe the reward, and then update both the base learner and its own belief about the environment. The goal is to achieve Bayesian regret that is competitive with an oracle that, after seeing the realized environment, would commit to the single best base learner for that instance.

Key contributions

Bayesian Online Model Selection (B‑MS) algorithm – The authors propose a novel meta‑learning procedure that maintains a global Bayesian posterior over the reward means of all actions, updated with every (action, reward) pair regardless of which base learner generated it. At each round the algorithm draws a sample (\tilde\mu_t) from the current posterior, computes the sampled optimal mean (\tilde\mu_t^{\star}=\max_a \tilde\mu_t(a)), and for each base learner (i) evaluates a balancing potential
\

Bayesian Online Model Selection

💡 Research Summary

Comments & Academic Discussion

Leave a Comment