Provably Adaptive Average Reward Reinforcement Learning for Metric Spaces
We study infinite-horizon average-reward reinforcement learning (RL) for Lipschitz MDPs, a broad class that subsumes several important classes such as linear and RKHS MDPs, function approximation frameworks, and develop an adaptive algorithm $\text{ZoRL}$ with regret bounded as $\mathcal{O}\big(T^{1 - d_{\text{eff.}}^{-1}}\big)$, where $d_{\text{eff.}}= 2d_\mathcal{S} + d_z + 3$, $d_\mathcal{S}$ is the dimension of the state space and $d_z$ is the zooming dimension. In contrast, algorithms with fixed discretization yield $d_{\text{eff.}} = 2(d_\mathcal{S} + d_\mathcal{A}) + 2$, $d_\mathcal{A}$ being the dimension of action space. $\text{ZoRL}$ achieves this by discretizing the state-action space adaptively and zooming into ‘‘promising regions’’ of the state-action space. $d_z$, a problem-dependent quantity bounded by the state-action space’s dimension, allows us to conclude that if an MDP is benign, then the regret of $\text{ZoRL}$ will be small. The zooming dimension and $\text{ZoRL}$ are truly adaptive, i.e., the current work shows how to capture adaptivity gains for infinite-horizon average-reward RL. $\text{ZoRL}$ outperforms other state-of-the-art algorithms in experiments, thereby demonstrating the gains arising due to adaptivity.
💡 Research Summary
The paper tackles the problem of infinite‑horizon average‑reward reinforcement learning (RL) in continuous state‑action spaces that satisfy a Lipschitz smoothness condition. While prior work on Lipschitz MDPs has focused mainly on episodic settings or on fixed discretizations, the authors observe that those approaches either fail to exploit problem structure in the average‑reward regime or suffer from a “zoom dimension” that collapses to the ambient dimension, eliminating any adaptivity gains.
To address this gap, the authors introduce ZoRL (Zoom‑based Reinforcement Learning), an algorithm that adaptively partitions the joint state‑action space into dyadic cells. The key technical contributions are:
-
Zoom dimension (d_z) – defined via covering numbers of the set of near‑optimal state‑action pairs (Z_\beta). This problem‑dependent quantity satisfies (d_z \le d = d_{\mathcal S}+d_{\mathcal A}) and captures how many cells are needed to cover the region where the sub‑optimality gap is at most (\beta).
-
Key cells – for any sub‑optimal policy played by the algorithm, there exists at least one cell that (i) has not been visited enough times and (ii) carries a large stationary probability under that policy. Lemma 4.1 establishes a quantitative link between a policy’s regret and the gaps of the state‑action pairs it traverses, guaranteeing the existence of such key cells.
-
Adaptive activation rule – each cell (\zeta) has a lower and upper visitation threshold (N_{\min}(\zeta)) and (N_{\max}(\zeta)) that depend on the cell’s diameter, a confidence parameter (\delta), and a constant (c_a>1). A cell is “active” when its visitation count lies between these thresholds. When a cell becomes active, it is split into its dyadic children, thereby refining the discretization only where needed.
-
Episode length selection – unlike prior work that ends an episode when any cell’s count doubles, ZoRL determines the episode horizon from a “proxy diameter” of the current policy. This ensures that, within each episode, every key cell receives at least (N_{\min}) visits with high probability, which is essential for achieving the claimed regret bound.
The theoretical analysis proceeds in two stages. First, the authors bound the total number of times a set of (\beta)-sub‑optimal policies (\Phi(\beta)) can be selected by relating it to the (\beta)-covering number of the corresponding state‑action pairs. This step avoids the explosion of covering numbers in policy space that plagued earlier works. Second, using the definition of the zoom dimension, they translate the covering bound into a regret bound of
\
Comments & Academic Discussion
Loading comments...
Leave a Comment