Chain-of-Goals Hierarchical Policy for Long-Horizon Offline Goal-Conditioned RL
Offline goal-conditioned reinforcement learning remains challenging for long-horizon tasks. While hierarchical approaches mitigate this issue by decomposing tasks, most existing methods rely on separate high- and low-level networks and generate only a single intermediate subgoal, making them inadequate for complex tasks that require coordinating multiple intermediate decisions. To address this limitation, we draw inspiration from the chain-of-thought paradigm and propose the Chain-of-Goals Hierarchical Policy (CoGHP), a novel framework that reformulates hierarchical decision-making as autoregressive sequence modeling within a unified architecture. Given a state and a final goal, CoGHP autoregressively generates a sequence of latent subgoals followed by the primitive action, where each latent subgoal acts as a reasoning step that conditions subsequent predictions. To implement this efficiently, we pioneer the use of an MLP-Mixer backbone, which supports cross-token communication and captures structural relationships among state, goal, latent subgoals, and action. Across challenging navigation and manipulation benchmarks, CoGHP consistently outperforms strong offline baselines, demonstrating improved performance on long-horizon tasks.
💡 Research Summary
This paper tackles the long‑horizon challenge of offline goal‑conditioned reinforcement learning (GCRL) by rethinking hierarchical decision‑making as an autoregressive sequence‑generation problem. The authors introduce the Chain‑of‑Goals Hierarchical Policy (CoGHP), a unified model that, given the current state and a final goal, generates a series of latent subgoals followed by a primitive action. Each latent subgoal acts as a reasoning step that conditions subsequent predictions, mirroring the chain‑of‑thought paradigm popular in large language models.
To implement this, the authors adopt an MLP‑Mixer backbone rather than a transformer. The Mixer alternates token‑mixing and channel‑mixing MLP layers, enabling efficient cross‑token communication while respecting the fixed semantic role of each token (state, goal, subgoals, action). A causal mixer is added so that each token can only attend to previously generated tokens, preserving the autoregressive nature of the process. Subgoals are generated in reverse order—from the farthest from the agent to the nearest—ensuring that the subgoal closest to the current state receives the richest contextual information.
Training leverages a goal‑conditioned Implicit Q‑Learning (IQL) value function Vψ(s,g) shared across all sequence elements. Both subgoals and the final action are optimized with an advantage‑weighted regression loss that uses the same advantage estimate derived from Vψ. This shared objective enables end‑to‑end gradient flow across the entire hierarchy, eliminating the fragmented optimization that plagues traditional two‑level hierarchical RL.
Empirical evaluation spans challenging navigation tasks (e.g., Ant‑Maze, Mini‑Room) and manipulation suites that require multiple intermediate decisions. CoGHP consistently outperforms strong offline baselines such as HIRO‑IQL and HAC‑IQL in terms of success rate, return, and sample efficiency. Ablation studies show that (1) replacing the Mixer with a transformer degrades stability, (2) removing the causal mixer harms the sequential dependency, and (3) generating subgoals in forward order reduces performance. The method also scales well with the number of subgoals, showing little performance loss as H increases.
In summary, CoGHP demonstrates that importing chain‑of‑thought reasoning into offline GCRL yields a single, scalable architecture capable of handling multiple intermediate subgoals, preserving final‑goal awareness, and supporting end‑to‑end learning. This work opens a new direction for long‑horizon robotic control and complex decision‑making in settings where interaction with the environment is limited or unsafe.
Comments & Academic Discussion
Loading comments...
Leave a Comment