A Two-Timescale Primal-Dual Framework for Reinforcement Learning via Online Dual Variable Guidance

A Two-Timescale Primal-Dual Framework for Reinforcement Learning via Online Dual Variable Guidance
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We study reinforcement learning by combining recent advances in regularized linear programming formulations with the classical theory of stochastic approximation. Motivated by the challenge of designing algorithms that leverage off-policy data while maintaining on-policy exploration, we propose PGDA-RL, a novel primal-dual Projected Gradient Descent-Ascent algorithm for solving regularized Markov Decision Processes (MDPs). PGDA-RL integrates experience replay-based gradient estimation with a two-timescale decomposition of the underlying nested optimization problem. The algorithm operates asynchronously, interacts with the environment through a single trajectory of correlated data, and updates its policy online in response to the dual variable associated with the occupancy measure of the underlying MDP. We prove that PGDA-RL converges almost surely to the optimal value function and policy of the regularized MDP. Our convergence analysis relies on tools from stochastic approximation theory and holds under weaker assumptions than those required by existing primal-dual RL approaches, notably removing the need for a simulator or a fixed behavioral policy. Under a strengthened ergodicity assumption on the underlying Markov chain, we establish a last-iterate finite-time guarantee with $\tilde{O} (k^{-2/3})$ mean-square convergence, aligning with the best-known rates for two-timescale stochastic approximation methods under Markovian sampling and biased gradient estimates.


💡 Research Summary

This paper introduces a novel two‑timescale primal‑dual algorithm, PGDA‑RL (Projected Gradient Descent‑Ascent for Reinforcement Learning), for solving regularized Markov Decision Processes (MDPs) formulated as linear programs. The authors start from the regularized LP representation of an MDP, where the primal variables correspond to the value function and the dual variables correspond to the discounted occupancy measure of a policy. By constructing the Lagrangian of this LP, they obtain a saddle‑point problem that can be tackled with gradient‑based methods.

The key methodological contribution is the use of a two‑timescale stochastic approximation scheme. The fast timescale (step size βₖ) updates the dual variable using projected gradient ascent, while the slow timescale (step size αₖ) updates the primal value function via projected gradient descent. The stepsizes satisfy the classic SA conditions (∑αₖ = ∞, ∑αₖ² < ∞, ∑βₖ = ∞, ∑βₖ² < ∞, and βₖ ≫ αₖ). Gradient estimates are obtained from an experience replay buffer, which introduces bias because samples are drawn from a single, correlated trajectory rather than an i.i.d. generator.

PGDA‑RL operates asynchronously: after each dual update, the current dual variable defines a stochastic policy π_{ρₖ} that is immediately deployed in the environment. Consequently, the algorithm maintains on‑policy exploration while still leveraging off‑policy data stored in the buffer. This design eliminates the need for a simulator or a fixed behavioral policy, a limitation present in many prior LP‑based primal‑dual RL methods.

The convergence analysis proceeds in two stages. First, under a generative‑model assumption, the authors prove almost‑sure convergence of both primal and dual iterates using the ordinary‑differential‑equation (ODE) method for stochastic approximation. Second, they extend the result to the realistic setting where data come from a single Markov trajectory. By assuming a strengthened ergodicity condition—specifically, that every state‑action pair is visited with a probability that grows linearly after a burn‑in period (event G_δ)—they control the Markovian noise and the bias introduced by replay‑buffer sampling. Under this condition, they establish a finite‑time bound for the dual iterate:

 E


Comments & Academic Discussion

Loading comments...

Leave a Comment