PIQL: Projective Implicit Q-Learning with Support Constraint for Offline Reinforcement Learning
Offline Reinforcement Learning (RL) faces a fundamental challenge of extrapolation errors caused by out-of-distribution (OOD) actions. Implicit Q-Learning (IQL) employs expectile regression to achieve in-sample learning. Nevertheless, IQL relies on a fixed expectile hyperparameter and a density-based policy improvement method, both of which impede its adaptability and performance. In this paper, we propose Projective IQL (PIQL), a projective variant of IQL enhanced with a support constraint. In the policy evaluation stage, PIQL substitutes the fixed expectile hyperparameter with a projection-based parameter and extends the one-step value estimation to a multi-step formulation. In the policy improvement stage, PIQL adopts a support constraint instead of a density constraint, ensuring closer alignment with the policy evaluation. Theoretically, we demonstrate that PIQL maintains the expectile regression and in-sample learning framework, guarantees monotonic policy improvement, and introduces a progressively more rigorous criterion for advantageous actions. Experiments on D4RL and NeoRL2 benchmarks demonstrate robust gains across diverse domains, achieving state-of-the-art performance overall.
💡 Research Summary
Offline reinforcement learning (RL) suffers from severe extrapolation errors when the learned policy selects actions that are out‑of‑distribution (OOD) with respect to the static dataset. Implicit Q‑Learning (IQL) mitigates this problem by using expectile regression to perform in‑sample learning, but it has two major drawbacks: (1) it relies on a fixed expectile hyper‑parameter τ that must be manually tuned for each dataset, and (2) its policy improvement step is essentially a one‑step method that draws actions directly from the behavior policy, limiting the potential for substantial policy upgrades.
The paper introduces Projective IQL (PIQL), a novel variant that addresses both issues. In the policy‑evaluation stage, PIQL replaces the static τ with an adaptive, projection‑based parameter τ_proj(a|s). This parameter is computed as the projection of the behavior policy vector π_β(a|s) onto the current policy vector π_φ(a|s):
τ_proj(a|s) = (π_β(a|s)·π_φ(a|s)) / ‖π_φ(a|s)‖² · π_φ(a|s).
When π_φ aligns closely with π_β, the projection is large, yielding a higher τ and a more optimistic value estimate; when the policies diverge, τ_proj shrinks, producing a conservative estimate. Consequently, τ is no longer a hand‑tuned constant but a data‑driven quantity that automatically balances optimism and conservatism, eliminating costly hyper‑parameter search.
PIQL also extends the one‑step value update of IQL to a multi‑step formulation. The expectile loss for the value network V_ψ becomes
L_V(ψ) = E_{(s,a)∼D}
Comments & Academic Discussion
Loading comments...
Leave a Comment