CRL-VLA: Continual Vision-Language-Action Learning

CRL-VLA: Continual Vision-Language-Action Learning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Lifelong learning is critical for embodied agents in open-world environments, where reinforcement learning fine-tuning has emerged as an important paradigm to enable Vision-Language-Action (VLA) models to master dexterous manipulation through environmental interaction. Thus, Continual Reinforcement Learning (CRL) is a promising pathway for deploying VLA models in lifelong robotic scenarios, yet balancing stability (retaining old skills) and plasticity (learning new ones) remains a formidable challenge for existing methods. We introduce CRL-VLA, a framework for continual post-training of VLA models with rigorous theoretical bounds. We derive a unified performance bound linking the stability-plasticity trade-off to goal-conditioned advantage magnitude, scaled by policy divergence. CRL-VLA resolves this dilemma via asymmetric regulation: constraining advantage magnitudes on prior tasks while enabling controlled growth on new tasks. This is realized through a simple but effective dual-critic architecture with novel Goal-Conditioned Value Formulation (GCVF), where a frozen critic anchors semantic consistency and a trainable estimator drives adaptation. Experiments on the LIBERO benchmark demonstrate that CRL-VLA effectively harmonizes these conflicting objectives, outperforming baselines in both anti-forgetting and forward adaptation.


💡 Research Summary

The paper introduces CRL‑VLA, a continual learning framework tailored for Vision‑Language‑Action (VLA) models that must adapt to a stream of robotic manipulation tasks. The authors first identify the core cause of catastrophic forgetting in continual VLA learning: the magnitude of the goal‑conditioned advantage (M_g). By defining M_g as the maximum absolute advantage of an anchored (usually previous) policy over the state‑action distribution of a new policy, they link performance change directly to both M_g and the KL‑divergence between policies.

Theoretical contributions include Theorem 4.1, which provides unified bounds for both stability (old‑task performance degradation) and plasticity (new‑task performance improvement):

  • J_old(π_new) – J_old(π_old) ≤ 2γ/(1‑γ)² · M_old · D_old
  • J_new(π_new) – J_new(π_old) ≤ 1/(1‑γ) · M_new · D_new

Here D_old and D_new are expected KL divergences under the old‑task and new‑task state distributions, respectively. The bounds reveal that to preserve old skills we must keep M_old small (by limiting value‑approximation error), while to learn new skills we need a sufficiently large M_new (naturally bounded by the environment’s return range) together with controlled D_new.

Guided by this analysis, the authors propose two orthogonal control mechanisms:

  1. V‑only path – a frozen “anchor” critic that provides a stable value estimate for previous tasks. By minimizing the critic’s approximation error ε_V on replayed old‑task data, M_old can be tightly bounded (Corollary 4.1).
  2. Natural boundedness path – Monte‑Carlo (MC) return estimates on new tasks, which inherently limit M_new within

Comments & Academic Discussion

Loading comments...

Leave a Comment