Bridging the Indoor-Outdoor Gap: Vision-Centric Instruction-Guided Embodied Navigation for the Last Meters

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Embodied navigation holds significant promise for real-world applications such as last-mile delivery. However, most existing approaches are confined to either indoor or outdoor environments and rely heavily on strong assumptions, such as access to precise coordinate systems. While current outdoor methods can guide agents to the vicinity of a target using coarse-grained localization, they fail to enable fine-grained entry through specific building entrances, critically limiting their utility in practical deployment scenarios that require seamless outdoor-to-indoor transitions. To bridge this gap, we introduce a novel task: out-to-in prior-free instruction-driven embodied navigation. This formulation explicitly eliminates reliance on accurate external priors, requiring agents to navigate solely based on egocentric visual observations guided by instructions. To tackle this task, we propose a vision-centric embodied navigation framework that leverages image-based prompts to drive decision-making. Additionally, we present the first open-source dataset for this task, featuring a pipeline that integrates trajectory-conditioned video synthesis into the data generation process. Through extensive experiments, we demonstrate that our proposed method consistently outperforms state-of-the-art baselines across key metrics including success rate and path efficiency.

💡 Research Summary

The paper tackles a largely overlooked problem in embodied navigation: the “out‑to‑in” transition where an autonomous agent must move from an outdoor environment into a specific indoor destination without relying on any external priors such as GPS coordinates, semantic maps, or detailed textual descriptions. Existing indoor‑only VLN methods (e.g., R2R, MP3D) assume a fully indoor setting, while outdoor point‑to‑point approaches depend heavily on precise positional data and can only bring the agent to the vicinity of a target. This gap is critical for real‑world applications like last‑mile delivery, where a robot must locate a storefront entrance and enter it to complete a task.

To address this, the authors introduce a new task—out‑to‑in prior‑free instruction‑driven embodied navigation—and propose a complete solution called BridgeNav together with the first open‑source dataset for this scenario, BridgeNavDataset. The dataset is generated through a novel trajectory‑conditioned video synthesis pipeline that produces coherent street‑view sequences, automatically annotates entrance locations and signage, and provides A*‑derived waypoint trajectories, all without any manual mapping or GPS data.

BridgeNav Architecture
The system consists of four tightly integrated modules:

Vision and Language Encoders – An RGB frame is processed by a Vision Transformer (ViT) to obtain visual tokens; the short instruction (e.g., “go to Starbucks”) is embedded via a learnable word embedding layer.
Latent Intention Inference – Recognizing that visual relevance changes with distance, this module uses stacked Transformers and a regression head to predict a set of “intent tokens” that highlight the most salient regions in the current view. When far from the goal, the model looks for the target building; at medium range it focuses on signage; when near, it attends to the entrance itself.
Optical‑Flow‑Guided Dynamic Perception – A lightweight RAFT optical‑flow estimator computes the flow field between the current frame and the next predicted frame. The top‑k pixels with the largest flow magnitude are masked, and a decoder reconstructs only these regions, effectively teaching the agent to “imagine” which parts of the scene will change as a result of its motion. This future‑state imagination bridges perception and planning.
Multimodal Large Language Model (Qwen2.5‑VL‑3B) – The visual tokens, intent tokens, and instruction embeddings are fused via cross‑attention in a large vision‑language model. Learnable initial tokens are inserted to serve as trajectory anchors. The final decoder outputs a sequence of future waypoints (typically five steps) in Euclidean space.

Training Procedure
Training proceeds in two stages. First, the latent intention module is trained using a bounding‑box regression loss derived from the dataset’s entrance and signage annotations. Second, the intention branch is frozen; the model is then trained jointly on waypoint prediction loss (supervising the trajectory) and a dynamic perception loss that penalizes reconstruction error on the high‑flow masked regions. This staged approach ensures that intention inference and navigation planning improve without interfering with each other.

Experimental Evaluation
The authors evaluate BridgeNav on BridgeNavDataset and compare against several strong baselines adapted from indoor VLN (e.g., ORAR, HAMT) and outdoor navigation (e.g., UrbanNav, GOAT‑Bench). Metrics include Success Rate (SR), SPL (Success weighted by Path Length), and Entrance Accuracy (the proportion of episodes where the agent correctly passes through the target entrance). BridgeNav consistently outperforms baselines: SR improves by ~15 percentage points, SPL by ~0.17, and Entrance Accuracy by ~20 percentage points. Ablation studies reveal that removing the optical‑flow module drops performance by roughly 10 percentage points, confirming the importance of future‑state imagination. Qualitative visualizations show the model’s attention shifting from building silhouettes to signage and finally to doorways as the agent approaches the goal.

Contributions and Impact

Definition of a novel, realistic navigation task that bridges indoor and outdoor domains without external priors.
Introduction of a vision‑centric framework that dynamically reallocates visual attention and predicts future visual changes via optical flow.
Creation of a large‑scale, automatically generated dataset that includes coherent video, trajectory annotations, and fine‑grained entrance labels.
Demonstration of state‑of‑the‑art performance across multiple metrics, establishing a new benchmark for out‑to‑in navigation.

Limitations and Future Work
The current evaluation is confined to simulated street‑view environments; real‑world deployment would need to handle sensor noise, dynamic obstacles, and lighting variations. Extending the model to incorporate depth or LiDAR data could improve entrance localization under occlusion. Moreover, integrating a reinforcement‑learning fine‑tuning stage might further enhance robustness in highly dynamic urban settings.

In summary, the paper delivers a comprehensive solution to the “last meters” problem in embodied navigation, offering both a methodological advance (dynamic intention and flow‑guided perception) and a valuable data resource. Its emphasis on privacy‑preserving, prior‑free operation makes it especially relevant for real‑world robotic delivery, autonomous vehicles, and augmented‑reality navigation applications.

Bridging the Indoor-Outdoor Gap: Vision-Centric Instruction-Guided Embodied Navigation for the Last Meters

💡 Research Summary

Comments & Academic Discussion

Leave a Comment