Periodic Regularized Q-Learning

Periodic Regularized Q-Learning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In reinforcement learning (RL), Q-learning is a fundamental algorithm whose convergence is guaranteed in the tabular setting. However, this convergence guarantee does not hold under linear function approximation. To overcome this limitation, a significant line of research has introduced regularization techniques to ensure stable convergence under function approximation. In this work, we propose a new algorithm, periodic regularized Q-learning (PRQ). We first introduce regularization at the level of the projection operator and explicitly construct a regularized projected value iteration (RP-VI), subsequently extending it to a sample-based RL algorithm. By appropriately regularizing the projection operator, the resulting projected value iteration becomes a contraction. By extending this regularized projection into the stochastic setting, we establish the PRQ algorithm and provide a rigorous theoretical analysis that proves finite-time convergence guarantees for PRQ under linear function approximation.


💡 Research Summary

This paper addresses the long‑standing instability of Q‑learning when combined with linear function approximation. While tabular Q‑learning enjoys guaranteed convergence, the “deadly triad” of off‑policy learning, bootstrapping, and function approximation can cause divergence or oscillations. Existing remedies—such as target networks, truncation, projection onto a ball, or additional regularization—typically rely on strong assumptions or introduce extra complexity.

The authors propose a two‑pronged solution: (1) a regularized projection operator and (2) periodic parameter updates. The standard weighted Euclidean projection Γ = Φ(ΦᵀDΦ)⁻¹ΦᵀD is augmented with a ridge term ηI, yielding Γ_η = Φ(ΦᵀDΦ + ηI)⁻¹ΦᵀD. Lemma 3.1 shows that as η → ∞, Γ_η collapses to the zero operator, while η → 0 recovers the original projection. Crucially, for sufficiently large η (e.g., η > 2 under ‖Φ‖∞ ≤ 1), the composition Γ_η T becomes a contraction provided γ‖Γ_η‖∞ < 1 (Lemma 3.2). This guarantees existence and uniqueness of the fixed point of the regularized projected Bellman equation (RP‑BE): Φθ = Γ_η T Φθ.

Building on this, the authors define regularized projected value iteration (RP‑VI):
 Φθ_{k+1} = Γ_η T Φθ_k (10)
or equivalently, in parameter space,
 θ_{k+1} = (ΦᵀDΦ + ηI)⁻¹ΦᵀD(R + γPΠΦθ_k) (11).
They show (Lemma 4.1) that if the unique solution θ*η exists, the iterates converge geometrically at rate γ‖Γ_η‖∞, without requiring any additional projection or truncation.

To make RP‑VI practical in a model‑free setting, the paper introduces Periodic Regularized Q‑learning (PRQ). The key idea is to separate the current parameters θ from a target parameter θ′ that is held fixed for a predetermined number of steps τ, then updated to θ. The update rule derived from the convex surrogate loss L_η(θ,θ′) = ½‖Γ(R + γPΠΦθ′) – Φθ‖_D² + ½η‖θ‖² becomes a stochastic gradient step:
 θ ← θ – α


Comments & Academic Discussion

Loading comments...

Leave a Comment