Hierarchical Subspaces of Policies for Continual Offline Reinforcement Learning

Hierarchical Subspaces of Policies for Continual Offline Reinforcement Learning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We consider a Continual Reinforcement Learning setup, where a learning agent must continuously adapt to new tasks while retaining previously acquired skill sets, with a focus on the challenge of avoiding forgetting past gathered knowledge and ensuring scalability with the growing number of tasks. Such issues prevail in autonomous robotics and video game simulations, notably for navigation tasks prone to topological or kinematic changes. To address these issues, we introduce HiSPO, a novel hierarchical framework designed specifically for continual learning in navigation settings from offline data. Our method leverages distinct policy subspaces of neural networks to enable flexible and efficient adaptation to new tasks while preserving existing knowledge. We demonstrate, through a careful experimental study, the effectiveness of our method in both classical MuJoCo maze environments and complex video game-like navigation simulations, showcasing competitive performances and satisfying adaptability with respect to classical continual learning metrics, in particular regarding the memory usage and efficiency.


💡 Research Summary

The paper introduces HiSPO (Hierarchical Subspaces of Policies), a novel framework for continual offline reinforcement learning (CORL) focused on goal‑conditioned navigation tasks. Traditional continual reinforcement learning (CRL) methods largely operate in an online setting and rely on replay buffers, regularization, or network expansion to mitigate catastrophic forgetting. These approaches face challenges in offline scenarios where data storage is limited, privacy concerns arise, and environments may change over time. HiSPO addresses these issues by combining two key ideas: (1) representing policies as points within low‑dimensional subspaces spanned by a set of anchor parameters, and (2) structuring the policy hierarchically into a high‑level planner and a low‑level controller, each with its own subspace.

In the subspace formulation, a policy parameter vector θ is expressed as a convex combination of anchors θ_i with weights α_i (α_i ≥ 0, Σα_i = 1). This reduces the effective dimensionality of adaptation and enables efficient reuse of previously learned knowledge. When a new task arrives, HiSPO either expands the subspace by adding a new anchor (initialized via Low‑Rank Adaptation, LoRA) or explores the existing subspace by sampling α from a Dirichlet distribution and selecting the configuration that minimizes the loss on a small batch of the new task’s offline dataset. A simple criterion based on a tolerance ε decides whether the new anchor should be kept (if it yields a significantly lower loss) or pruned (if the previous subspace performs comparably). This “extend‑explore‑prune” loop repeats for each task, ensuring that the number of parameters grows only linearly with the number of anchors, while preserving past performance.

The hierarchical aspect separates long‑term planning from short‑term execution. The high‑level policy π_h receives the current state s_t and goal g, and predicts a sub‑goal ϕ(s_{t+k}) k steps ahead. The low‑level policy π_l receives (s_t, sub‑goal) and outputs the action a_t that moves the agent toward the sub‑goal. Both policies are trained offline using hierarchical imitation learning on pre‑collected expert trajectories, augmented with Hindsight Experience Replay (HER) to relabel goals and increase data efficiency. This decomposition is particularly suited to navigation problems where topological changes (affecting planning) and kinematic changes (affecting control) can be addressed by the respective subspaces.

Experimental evaluation is conducted on two benchmark suites: (a) classic MuJoCo maze environments that involve substantial topological modifications, and (b) a complex video‑game‑style navigation simulator built from human‑authored datasets, featuring diverse kinematic dynamics. The authors assess performance using standard continual learning metrics: average success rate (PER), backward transfer (BWT), forward transfer (FWT), and relative memory usage (MEM). HiSPO achieves competitive or superior PER while dramatically reducing MEM compared to replay‑based methods and progressive neural networks. BWT and FWT results show that HiSPO mitigates forgetting and facilitates knowledge transfer across tasks, thanks to its subspace reuse and hierarchical separation.

Key contributions of the work include: (1) a practical offline continual learning algorithm that leverages hierarchical policy subspaces for scalable adaptation; (2) a new benchmark comprising both robotic and video‑game navigation tasks with publicly released datasets; (3) extensive empirical comparisons demonstrating that HiSPO outperforms state‑of‑the‑art CRL baselines in memory efficiency, adaptability, and overall task performance. The paper also discusses limitations such as the stochastic nature of Dirichlet‑based weight exploration and the computational overhead of maintaining separate high‑ and low‑level networks. Future directions suggested involve Bayesian optimization for anchor weight selection, shared base networks with overlapping subspaces, and extending the approach to more complex, non‑goal‑conditioned domains. Overall, HiSPO represents a significant step toward memory‑efficient, scalable continual learning in offline reinforcement learning settings.


Comments & Academic Discussion

Loading comments...

Leave a Comment