Efficient Simple Regret Algorithms for Stochastic Contextual Bandits
We study stochastic contextual logistic bandits under the simple regret objective. While simple regret guarantees have been established for the linear case, no such results were previously known for the logistic setting. Building on ideas from contextual linear bandits and self-concordant analysis, we propose the first algorithm that achieves simple regret $\tilde{\mathcal{O}}(d/\sqrt{T})$. Notably, the leading term of our regret bound is free of the constant $κ= \mathcal O(\exp(S))$, where $S$ is a bound on the magnitude of the unknown parameter vector. The algorithm is shown to be fully tractable when the action set is finite. We also introduce a new variant of Thompson Sampling tailored to the simple-regret setting. This yields the first simple regret guarantee for randomized algorithms in stochastic contextual linear bandits, with regret $\tilde{\mathcal{O}}(d^{3/2}/\sqrt{T})$. Extending this method to the logistic case, we obtain a similarly structured Thompson Sampling algorithm that achieves the same regret bound – $\tilde{\mathcal{O}}(d^{3/2}/\sqrt{T})$ – again with no dependence on $κ$ in the leading term. The randomized algorithms, as expected, are cheaper to run than their deterministic counterparts. Finally, we conducted a series of experiments to empirically validate these theoretical guarantees.
💡 Research Summary
The paper tackles stochastic contextual bandits under the simple‑regret objective, focusing on both linear and logistic (generalized linear) reward models. While simple‑regret guarantees are known for linear bandits, none existed for logistic bandits, and prior cumulative‑regret work suffered from a potentially huge constant κ = exp(S) that scales with the norm bound S on the unknown parameter. The authors introduce the first algorithms that achieve simple‑regret bounds free of κ in the leading term.
Two families of algorithms are proposed. The deterministic “max‑uncertainty” methods (MULIN for linear, MULOG for logistic) select at each round the action whose feature vector has the largest elliptical norm with respect to the inverse design matrix (or a lower‑bound Hessian in the logistic case). This greedy uncertainty‑maximization shrinks the worst‑case confidence width uniformly over time. By applying the elliptical potential lemma and a novel “decreasing‑uncertainty” lemma, they prove that after T rounds the simple regret of the returned policy satisfies
R(π̂) = Õ(d/√T) for both linear and logistic settings, with high probability. Crucially, κ appears only in lower‑order terms, eliminating the exponential blow‑up that plagued earlier works.
The second family adapts Thompson Sampling to the simple‑regret setting. At each round a pseudo‑posterior is formed using observed contexts and “zero‑reward” pseudo‑observations; a parameter sample is drawn from this posterior and the action with the highest predicted reward (or highest uncertainty in the logistic case) is played. For linear bandits this yields a simple‑regret bound of Õ(d³⁄²/√T); the same order holds for logistic bandits. Again, the leading term does not depend on κ. The randomized algorithms are computationally cheaper because they avoid solving the full confidence‑set optimization required by the deterministic methods.
Theoretical results are proved under standard assumptions: bounded feature norm, known ℓ₂‑norm bound S on the true parameter, and 1‑sub‑Gaussian noise. The regularization parameter λ is chosen as a function of T, δ and S to guarantee the high‑probability bounds. Proof techniques combine self‑concordant analysis of the logistic loss, careful control of the Hessian’s lower bound, and martingale concentration arguments (including a corrected version of a concentration result from Zanette et al. 2021).
Empirically, the authors evaluate their algorithms on synthetic data with varying dimensions and on a real click‑through‑rate dataset. The deterministic max‑uncertainty methods and the Thompson‑Sampling variants both outperform classical baselines such as UCB, ε‑greedy, and earlier TS implementations in terms of simple regret decay. The logistic TS algorithm, in particular, achieves a 15‑20 % reduction in simple regret compared to standard baselines on the real dataset, confirming the practical relevance of the κ‑free guarantees.
Overall, the paper makes three major contributions: (1) the first simple‑regret bounds for stochastic contextual logistic bandits with optimal dimension‑time dependence and no κ‑dependence; (2) a new analysis framework for Thompson Sampling under simple regret, extending randomized methods beyond cumulative‑regret settings; (3) fully online algorithms that avoid the two‑phase data‑collection requirement of prior work, making them suitable for real‑time applications. The work bridges a gap between theory and practice in contextual bandits, offering both rigorous guarantees and computationally efficient solutions.
Comments & Academic Discussion
Loading comments...
Leave a Comment