Bayesian Inference of Contextual Bandit Policies via Empirical Likelihood
Policy inference plays an essential role in the contextual bandit problem. In this paper, we use empirical likelihood to develop a Bayesian inference method for the joint analysis of multiple contextual bandit policies in finite sample regimes. The proposed inference method is robust to small sample sizes and is able to provide accurate uncertainty measurements for policy value evaluation. In addition, it allows for flexible inferences on policy comparison with full uncertainty quantification. We demonstrate the effectiveness of the proposed inference method using Monte Carlo simulations and its application to an adolescent body mass index data set.
💡 Research Summary
The paper addresses a fundamental challenge in contextual bandits: how to infer the value of one or more policies when only off‑policy data are available and the sample size may be modest. Traditional off‑policy estimators—direct regression, importance sampling (IS), self‑normalized IS (SNIS), and doubly robust (DR)—each suffer from bias, high variance, or reliance on correctly specified reward models. Moreover, frequentist confidence intervals based on empirical likelihood (EL) are only asymptotically valid; with small or medium samples they are poorly calibrated, and they do not naturally extend to comparisons among several correlated policies.
To overcome these limitations, the authors propose a Bayesian inference framework that treats the EL as a pseudo‑likelihood. They first formulate a joint EL for a vector of policy values (v = (v_1,\dots,v_\ell)) using the estimating equation (v = \mathbb{E}_{x\sim D, a\sim p_x}
Comments & Academic Discussion
Loading comments...
Leave a Comment