Nipping the Drift in the Bud: Retrospective Rectification for Robust Vision-Language Navigation

Nipping the Drift in the Bud: Retrospective Rectification for Robust Vision-Language Navigation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Vision-Language Navigation (VLN) requires embodied agents to interpret natural language instructions and navigate through complex continuous 3D environments. However, the dominant imitation learning paradigm suffers from exposure bias, where minor deviations during inference lead to compounding errors. While DAgger-style approaches attempt to mitigate this by correcting error states, we identify a critical limitation: Instruction-State Misalignment. Forcing an agent to learn recovery actions from off-track states often creates supervision signals that semantically conflict with the original instruction. In response to these challenges, we introduce BudVLN, an online framework that learns from on-policy rollouts by constructing supervision to match the current state distribution. BudVLN performs retrospective rectification via counterfactual re-anchoring and decision-conditioned supervision synthesis, using a geodesic oracle to synthesize corrective trajectories that originate from valid historical states, ensuring semantic consistency. Experiments on the standard R2R-CE and RxR-CE benchmarks demonstrate that BudVLN consistently mitigates distribution shift and achieves state-of-the-art performance in both Success Rate and SPL.


💡 Research Summary

Vision‑Language Navigation (VLN) in continuous 3‑D environments requires an embodied agent to follow natural‑language instructions while perceiving raw RGB images. The dominant training paradigm—imitation learning with teacher‑forcing—suffers from exposure bias: during training the policy sees only expert states, but at test time it must act on its own, potentially out‑of‑distribution, states. Interactive imitation learning methods such as DAgger mitigate this by collecting on‑policy data and providing corrective actions from error states. However, the authors identify a critical flaw they call “Instruction‑State Misalignment”: the recovery trajectories required to bring the agent back to the reference path often contradict the original instruction (e.g., turning back when the instruction says “walk straight”), thereby confusing the language grounding process.

To address both exposure bias and instruction‑state misalignment, the paper introduces BudVLN, an online training framework that learns directly from on‑policy rollouts and constructs supervision that matches the current state distribution. BudVLN operates in iterative rounds, each consisting of a low‑cost greedy probe followed by a dynamic routing decision:

  1. Proficiency Pathway (GRPO) – If the greedy probe succeeds (no failure triggers), the sample is deemed proficient. The agent then generates additional stochastic rollouts (group size G) and optimizes a Group Relative Policy Optimization (GRPO) objective. GRPO computes a scalar return for each trajectory that combines a success bonus, a weighted SPL term, and a penalty proportional to the remaining geodesic distance. Relative advantage is estimated using group statistics (mean and std) rather than a learned value function. The policy is updated with a PPO‑style clipped surrogate loss, augmented with a KL‑regularization term that keeps the new policy close to a reference policy. This encourages the discovery of shorter, more efficient paths while preserving stability.

  2. Rectification Pathway (SFT) – If the greedy probe triggers any of four failure conditions (off‑track, progress stall, premature stop, forced stop), the sample is classified as hard. Instead of exploring further, BudVLN performs “Retrospective Rectification”. It rolls back to the latest valid progress point on the reference trajectory, queries a geodesic oracle to synthesize a counterfactual corrective trajectory that proceeds forward from that point to the goal, and guarantees semantic consistency with the original instruction. This synthesized demonstration is used to compute a weighted Supervised Fine‑Tuning (SFT) loss, which directly updates the policy. By re‑anchoring to a valid historical state, the method avoids teaching the agent contradictory recovery actions.

An “Adaptive Mutual Exclusion Strategy” ensures that GRPO and SFT are mutually exclusive for any given sample, preventing simultaneous gradient signals that could destabilize training. Consequently, BudVLN achieves high sample efficiency: it reaches comparable or better performance than traditional DAgger pipelines while using roughly 25 % of the training budget.

The authors evaluate BudVLN on the standard continuous VLN benchmarks R2R‑CE and RxR‑CE. BudVLN consistently outperforms prior state‑of‑the‑art methods, achieving notable gains in both Success Rate (SR) and SPL. For example, on R2R‑CE it attains SR = 73.2 % and SPL = 61.4 %, and on RxR‑CE SR = 68.5 % and SPL = 57.2 %. Qualitative analyses show that the retrospective rectification trajectories remain aligned with the linguistic instructions, effectively preventing error propagation.

In summary, BudVLN makes three key contributions: (1) an online training loop that directly mitigates exposure bias, (2) a novel retrospective rectification mechanism that resolves instruction‑state misalignment by synthesizing forward‑looking, semantically consistent supervision, and (3) a synergistic combination of GRPO for efficient exploration and SFT for targeted correction, delivering state‑of‑the‑art performance on continuous VLN tasks.


Comments & Academic Discussion

Loading comments...

Leave a Comment