General Flexible $f$-divergence for Challenging Offline RL Datasets with Low Stochasticity and Diverse Behavior Policies
Offline RL algorithms aim to improve upon the behavior policy that produces the collected data while constraining the learned policy to be within the support of the dataset. However, practical offline datasets often contain examples with little diversity or limited exploration of the environment, and from multiple behavior policies with diverse expertise levels. Limited exploration can impair the offline RL algorithm’s ability to estimate \textit{Q} or \textit{V} values, while constraining towards diverse behavior policies can be overly conservative. Such datasets call for a balance between the RL objective and behavior policy constraints. We first identify the connection between $f$-divergence and optimization constraint on the Bellman residual through a more general Linear Programming form for RL and the convex conjugate. Following this, we introduce the general flexible function formulation for the $f$-divergence to incorporate an adaptive constraint on algorithms’ learning objectives based on the offline training dataset. Results from experiments on the MuJoCo, Fetch, and AdroitHand environments show the correctness of the proposed LP form and the potential of the flexible $f$-divergence in improving performance for learning from a challenging dataset when applied to a compatible constrained optimization algorithm.
💡 Research Summary
**
Offline reinforcement learning (RL) has become a promising paradigm for leveraging static datasets to train agents without further environment interaction. However, many real‑world offline datasets suffer from two intertwined challenges: (1) low stochasticity or limited exploration, which hampers accurate estimation of Q‑functions or value functions, and (2) a mixture of behavior policies with varying levels of expertise, which makes a single, overly conservative constraint inappropriate. Existing offline RL methods typically enforce a pessimistic constraint that forces the learned policy to stay within the support of the dataset, often by adding a regularization term based on a fixed f‑divergence (e.g., KL or χ²). While effective on benchmark suites such as D4RL, these approaches can be too restrictive when the data are near‑deterministic or when multiple policies with disparate performance coexist.
The paper first revisits the classic linear programming (LP) formulation of RL. In the primal LP, the value function V is minimized subject to Bellman inequality constraints; the dual LP maximizes the expected return over occupancy measures d while enforcing the same Bellman flow constraints. By introducing a Lagrange multiplier ζ that represents the density ratio between the learned occupancy and the dataset occupancy, the authors show that the dual objective naturally contains an f‑divergence term D_f(d‖d_D) = E_{d_D}
Comments & Academic Discussion
Loading comments...
Leave a Comment