Learning to Explore with Lagrangians for Bandits under Unknown Linear Constraints

Learning to Explore with Lagrangians for Bandits under Unknown Linear Constraints
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Pure exploration in bandits formalises multiple real-world problems, such as tuning hyper-parameters or conducting user studies to test a set of items, where different safety, resource, and fairness constraints on the decision space naturally appear. We study these problems as pure exploration in multi-armed bandits with unknown linear constraints, where the aim is to identify an $r$-optimal and feasible policy as fast as possible with a given level of confidence. First, we propose a Lagrangian relaxation of the sample complexity lower bound for pure exploration under constraints. Second, we leverage properties of convex optimisation in the Lagrangian lower bound to propose two computationally efficient extensions of Track-and-Stop and Gamified Explorer, namely LATS and LAGEX. Then, we propose a constraint-adaptive stopping rule, and while tracking the lower bound, use optimistic estimate of the feasible set at each step. We show that LAGEX achieves asymptotically optimal sample complexity upper bound, while LATS shows asymptotic optimality up to novel constraint-dependent constants. Finally, we conduct numerical experiments with different reward distributions and constraints that validate efficient performance of LATS and LAGEX.


💡 Research Summary

**
This paper tackles the problem of pure exploration in multi‑armed bandits when the decision policy must satisfy a set of linear constraints that are not known a priori. The goal is to identify an r‑optimal feasible policy—i.e., a policy whose expected reward is within r of the true optimal feasible policy—with confidence 1 − δ while using as few samples as possible. The authors first revisit the information‑theoretic lower bound on sample complexity for constrained pure‑exploration (originally expressed as a minimax KL‑divergence problem) and propose a novel Lagrangian relaxation of this bound. By introducing Lagrange multipliers λ that penalise constraint violations, the lower‑bound problem becomes a convex saddle‑point problem over policies π and multipliers λ. Strong duality holds under a Slater‑type condition (non‑zero slack), guaranteeing that the relaxed bound retains the same hardness as the original bound.

Because the constraint matrix A is unknown, the algorithm must estimate it online. At each round t the agent observes a noisy cost vector cₜ = Aₐₜ + ηₜ, where ηₜ is sub‑Gaussian. Using all past observations, the algorithm builds an empirical estimate Âₜ and a confidence ellipsoid Cₜ around it. The “optimistic feasible set” ĤFₜ is defined as the set of policies that satisfy the most favorable (i.e., least restrictive) constraint matrix inside Cₜ. With probability at least 1 − δ, the true feasible set F is contained in ĤFₜ, ensuring that the optimal feasible policy never gets excluded during exploration.

Building on this Lagrangian lower bound and the optimistic feasible set, the authors design two algorithms. LATS (Lagrangian Track‑and‑Stop) extends the classic Track‑and‑Stop (TS) method: at each step it solves the Lagrangian lower‑bound problem using the current ĤFₜ and updates the allocation policy ωₜ accordingly. A new stopping rule checks whether the empirical means and the optimistic constraints are simultaneously tight enough to guarantee (1 − δ)‑correctness and (1 − δ)‑feasibility. LAGEX (Lagrangian Gamified Explorer) adapts the Gamified Explorer (GEX) framework, embedding the Lagrangian multiplier into the “game score” that balances reward information gain against constraint satisfaction. Both algorithms maintain a running estimate of the Lagrange multipliers, which act as shadow prices for each constraint.

The theoretical contributions are threefold. First, the authors prove that both algorithms are (1 − δ)‑correct and (1 − δ)‑feasible. Second, they derive finite‑time upper bounds on the expected stopping time. For LAGEX the bound matches the asymptotic lower bound, establishing optimality. For LATS the bound is larger by a factor s (the “shadow‑price ratio”, i.e., the ratio between the maximum and minimum optimal Lagrange multipliers). This factor quantifies how sensitive the algorithm’s sample complexity is to the geometry of the constraints. Third, they provide a novel concentration inequality for the estimated constraint matrix, which is essential for controlling the size of the optimistic feasible set and for the stopping rule.

Empirical evaluation is performed on synthetic problems with Gaussian and Bernoulli rewards and with 2–5 linear constraints of varying difficulty, as well as on a real‑world recommendation dataset. The baselines include the original constrained Track‑and‑Stop, Safe‑BAI, and unconstrained TS. Results show that LAGEX consistently requires the fewest samples to achieve the target r‑optimality while keeping constraint violations virtually zero. LATS performs competitively but incurs a modest sample overhead proportional to s, especially when constraints are tight or nearly active. Additional experiments measure the actual constraint‑violation probability during exploration, confirming the theoretical guarantees.

In summary, the paper makes four major contributions: (1) a Lagrangian‑based reformulation of the pure‑exploration lower bound that accommodates unknown linear constraints; (2) two computationally efficient algorithms (LATS and LAGEX) that track both the reward information and the constraint estimates; (3) rigorous finite‑time analysis including a new concentration result for constraint estimation; and (4) extensive experiments validating the practical advantage of the proposed methods. The work opens several avenues for future research, such as extending to non‑linear constraints, handling non‑sub‑Gaussian noise, and integrating online constraint learning with contextual bandits.


Comments & Academic Discussion

Loading comments...

Leave a Comment