Combinatorial Rising Bandits

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Combinatorial online learning is a fundamental task for selecting the optimal action (or super arm) as a combination of base arms in sequential interactions with systems providing stochastic rewards. It is applicable to diverse domains such as robotics, social advertising, network routing, and recommendation systems. In many real-world scenarios, we often encounter rising rewards, where playing a base arm not only provides an instantaneous reward but also contributes to the enhancement of future rewards, e.g., robots improving through practice and social influence strengthening in the history of successful recommendations. Crucially, these enhancements may propagate to multiple super arms that share the same base arms, introducing dependencies beyond the scope of existing bandit models. To address this gap, we introduce the Combinatorial Rising Bandit (CRB) framework and propose a provably efficient and empirically effective algorithm, Combinatorial Rising Upper Confidence Bound (CRUCB). We empirically demonstrate the effectiveness of CRUCB in realistic deep reinforcement learning environments and synthetic settings, while our theoretical analysis establishes tight regret bounds. Together, they underscore the practical impact and theoretical rigor of our approach. Our code is available at https://github.com/ml-postech/Combinatorial-Rising-Bandits.

💡 Research Summary

Motivation and Problem Setting
Combinatorial online learning selects a super‑arm— a set of base arms— at each round and receives stochastic rewards. In many real‑world systems (robotic skill acquisition, social advertising, network routing, recommendation), pulling a base arm not only yields an immediate reward but also improves the arm’s future performance. Existing combinatorial bandit models assume static rewards, while rising‑bandit models handle only single‑arm dynamics. Neither captures the situation where a base arm is shared across multiple super‑arms, causing partially shared enhancement. The authors therefore introduce the Combinatorial Rising Bandit (CRB) framework, which models (i) selection of a super‑arm and (ii) a rested rising process: the expected outcome µ_i(n) of base arm i after n pulls is non‑decreasing, and the incremental gain γ_i(n)=µ_i(n+1)−µ_i(n) is non‑negative and concave (γ_i(n) ≥ γ_i(n+1)). The reward function r(S,µ) is monotone in the mean outcomes of the selected arms, covering additive and k‑MAX utilities.

Optimal Policy Structure
In non‑combinatorial rising bandits the optimal policy is a constant arm. In CRB, however, the optimal policy can be more intricate because shared arms create dependencies between super‑arms. The authors prove (Theorem 1) that a constant policy need not be optimal in general. Nevertheless, when the reward function is bounded above and below by an additive function (B_L∑µ_i ≤ r ≤ B_U∑µ_i), the optimal constant policy achieves at most a factor B_U/B_L of the optimal cumulative reward (Theorem 2). For truly additive rewards (B_U = B_L) the constant policy is exactly optimal (Corollary 1).

Algorithm: Combinatorial Rising UCB (CRUCB)
CRUCB proceeds in two stages each round:

Future‑UCB Index Computation – For each base arm i a sliding window of size h_i = ε·N_i,t−1 is maintained. The index ˆµ_i(t) consists of three components:
- Recent average of the last h_i observations.
- Predicted improvement: the finite‑difference slope (X_i(l)−X_i(l−h_i))/h_i multiplied by the remaining number of pulls (t−l), which is optimistic due to the concavity assumption.
- Exploration bonus σ·√{log t / (t−N_i,t−1+h_i−1)} that is larger than in stationary bandits because the environment is non‑stationary and rising.
Solver – The vector of indices ˆµ(t) is fed to a combinatorial optimization oracle that solves
S_t = arg max_{S∈𝒮} r(S,ˆµ(t)).
The oracle can be any problem‑specific algorithm (e.g., Dijkstra for shortest‑path, max‑flow for network design), preserving the combinatorial structure while incorporating the rising estimates.

Theoretical Regret Analysis
The difficulty of a CRB instance is quantified by the cumulative increment
Υ(M,q) = Σ_{l=0}^{M−1} max_i γ_i(l)^q,
where q∈

Combinatorial Rising Bandits

💡 Research Summary

Comments & Academic Discussion

Leave a Comment