Exploration in the Limit
In fixed-confidence best arm identification (BAI), the objective is to quickly identify the optimal option while controlling the probability of error below a desired threshold. Despite the plethora of BAI algorithms, existing methods typically fall short in practical settings, as stringent exact error control requires using loose tail inequalities and/or parametric restrictions. To overcome these limitations, we introduce a relaxed formulation that requires valid error control asymptotically with respect to a minimum sample size. This aligns with many real-world settings that often involve weak signals, high desired significance, and post-experiment inference requirements, all of which necessitate long horizons. This allows us to achieve tighter optimality, while better handling flexible nonparametric outcome distributions and fully leveraging individual-level contexts. We develop a novel asymptotic anytime-valid confidence sequences over arm indices, and we use it to design a new BAI algorithm for our asymptotic framework. Our method flexibly incorporates covariates for variance reduction and ensures approximate error control in fully nonparametric settings. Under mild convergence assumptions, we provide asymptotic bounds on the sample complexity and show the worst-case sample complexity of our approach matches the best-case sample complexity of Gaussian BAI under exact error guarantees and known variances. Experiments suggest our approach reduces average sample complexities while maintaining error control.
💡 Research Summary
The paper tackles fixed‑confidence best‑arm identification (BAI), a pure‑exploration problem where a learner must sequentially allocate samples to K arms until it can declare the arm with the highest mean with probability at least 1 − α. Existing BAI methods achieve optimal sample‑complexity only under restrictive parametric assumptions (e.g., exponential families with known variances) or require strong moment‑generating‑function bounds to guarantee exact error control. In practice, experiments often involve weak signals, very small α, long horizons, and rich contextual covariates, making those assumptions unrealistic and leading to overly conservative sampling.
Key contribution – asymptotic relaxation of the error constraint.
Instead of demanding that the error probability be bounded by α at every time, the authors introduce a “burn‑in” parameter t₀ and require that for all t ≥ t₀ the algorithm’s error probability converges to ≤ α. This asymptotic guarantee is natural when the experiment is long enough for central‑limit‑type approximations to be accurate. It allows the use of non‑parametric outcome distributions and eliminates the need for explicit MGF bounds.
Asymptotic anytime‑valid confidence sequences.
The core technical tool is a new class of confidence sequences that are valid only after the burn‑in period. They are built from weighted sums of unbiased scoring functions φ(x,a,y). The weights are chosen to maximize the signal‑to‑noise ratio (SNR) of the resulting test statistic. This weight‑optimization problem is a convex fractional program; its solution can be obtained by projected sub‑gradient descent. When no covariates are present, the optimal weights coincide with a Kullback–Leibler (KL) projection, linking the method to the KL‑UCB family.
Algorithmic framework.
A BAI algorithm consists of three components: (i) a sampling policy π that, at each round, uses the current weighted scores to allocate future pulls so as to minimize the asymptotic expected stopping time, (ii) a stopping rule ξ that halts when the lower confidence bound of a candidate arm exceeds the upper bounds of all others, and (iii) a recommendation rule ˆa that returns the arm with the highest lower bound. The confidence sequences guarantee that, after t₀, the probability of incorrectly stopping or recommending is asymptotically ≤ α.
Theoretical results.
Under mild assumptions (unique optimal arm, positive conditional variances, bounded outcomes), the authors prove:
- Asymptotic α‑correctness: for any fixed α, the algorithm’s error probability converges to ≤ α as t₀ → ∞.
- Sample‑complexity upper bound: the worst‑case asymptotic expected stopping time of the proposed method is no larger than the optimal sample‑complexity for Gaussian BAI with known variances (the classic lower bound). Thus, even in a fully non‑parametric setting, the method is not harder than the best parametric case.
- Contextual gain: when covariates are available, the SNR‑maximizing weights exploit the conditional mean and variance functions g(x,a), v(x,a), yielding strictly smaller asymptotic complexity than the non‑contextual Gaussian benchmark.
Empirical evaluation.
Synthetic experiments mirroring prior BAI benchmarks and a realistic ad‑optimization simulation demonstrate that the new algorithm reduces average sample usage by 20‑33 % relative to state‑of‑the‑art methods (Track‑and‑Stop, KL‑LUCB, existing anytime‑valid approaches) while still satisfying the prescribed α‑level error control. The gains are larger when contextual information is incorporated.
Implications and future work.
By relaxing exact error control to an asymptotic guarantee, the paper opens a practical pathway for BAI in long‑horizon, weak‑signal, and highly contextual environments such as clinical trials, network resource allocation, and online experimentation. The combination of asymptotic anytime‑valid confidence sequences, SNR‑optimal weighting, and projected sub‑gradient sampling design constitutes a novel, theoretically grounded, and empirically effective framework. Future directions include extending the method to multi‑objective settings, handling adaptive burn‑in schedules, and validating on real‑world clinical or industrial datasets.
Comments & Academic Discussion
Loading comments...
Leave a Comment