NaviDriveVLM: Decoupling High-Level Reasoning and Motion Planning for Autonomous Driving
Vision-language models (VLMs) have emerged as a promising direction for end-to-end autonomous driving (AD) by jointly modeling visual observations, driving context, and language-based reasoning. However, existing VLM-based systems face a trade-off between high-level reasoning and motion planning: large models offer strong semantic understanding but are costly to adapt for precise control, whereas small VLM models can be fine-tuned efficiently but often exhibit weaker reasoning. We propose NaviDriveVLM, a decoupled framework that separates reasoning from action generation using a large-scale Navigator and a lightweight trainable Driver. This design preserves reasoning ability, reduces training cost, and provides an explicit interpretable intermediate representation for downstream planning. Experiments on the nuScenes benchmark show that NaviDriveVLM outperforms large VLM baselines in end-to-end motion planning.
💡 Research Summary
Title: NaviDriveVLM: Decoupling High‑Level Reasoning and Motion Planning for Autonomous Driving
Problem Statement:
Recent vision‑language models (VLMs) have shown promise for end‑to‑end autonomous driving by jointly processing images, vehicle state, and language prompts. However, a fundamental trade‑off exists: large VLMs excel at semantic understanding and high‑level reasoning but are prohibitively expensive to fine‑tune for precise control, while small VLMs can be efficiently adapted for waypoint prediction but lose much of the reasoning capability. Existing works typically use a single model to perform both tasks, leading to sub‑optimal performance in either reasoning quality, adaptation efficiency, or planning accuracy.
Proposed Solution – NaviDriveVLM:
The authors introduce a two‑module architecture that explicitly separates reasoning from action generation:
-
Navigator (large‑scale VLM, frozen):
- Uses a pretrained large VLM (e.g., Qwen3‑VL‑8B).
- Inputs: multi‑view surround images, ego‑vehicle state (velocity, yaw rate, acceleration), past waypoints, and a high‑level command (Hard Left, Slight Right, Keep Straight, Decelerate Stop, etc.).
- Outputs three textual tokens: a scene description, a recommended action, and a natural‑language reasoning explanation.
- The model remains frozen during training, preserving its rich semantic knowledge without incurring costly gradient updates.
-
Driver (lightweight VLM, trainable):
- Based on a smaller VLM (Qwen3‑VL‑2B, or 8B with LoRA).
- Receives visual tokens, ego‑state tokens, command tokens, and the reasoning tokens generated by the Navigator.
- Predicts a sequence of future waypoints (six points covering a 6‑second horizon) using an autoregressive decoder.
- Trained with supervised fine‑tuning (SFT) by minimizing the negative log‑likelihood of ground‑truth waypoints.
The reasoning tokens act as an explicit, interpretable intermediate representation, bridging perception and planning. This design allows the system to keep the high‑level understanding of a massive VLM while leveraging a compact, efficiently trainable model for precise motion planning.
Dataset Construction – nuScenes‑Reason:
The authors start from the public nuScenes dataset (850 scenes, 2 Hz sampling). They extract 8‑second clips (2 s history, 6 s prediction) resulting in 16.5 k training and 3.6 k test samples. For each clip, the frozen Navigator is run once to generate textual reasoning; these outputs are stored, forming the nuScenes‑Reason dataset. This pre‑computation eliminates repeated expensive inference of the large VLM during Driver training.
Training Details:
- Optimizer: AdamW (weight decay 0.01), learning rate 1e‑5 with cosine decay.
- Batch size 1, gradient accumulation over 16 steps.
- For the 8‑B model, 8‑bit quantization and LoRA (rank 64, alpha 128, dropout 0.05) are applied.
- All experiments run on a single NVIDIA RTX 4090.
Evaluation Metrics:
The primary metric is open‑loop waypoint prediction error (L2 distance) at 1 s, 2 s, 3 s, and 6 s horizons. The best of six candidate waypoint sequences (minimum average L2 at 6 s) is selected for reporting. Reasoning quality is assessed qualitatively, as no standard quantitative metric exists.
Results – Quantitative:
| Model | L2 (1 s) | L2 (2 s) | L2 (3 s) | L2 (6 s) | Avg L2 |
|---|---|---|---|---|---|
| OpenEMMA | 1.45 | 3.21 | 3.76 | 2.81 | – |
| UniAD | – | 0.44 | 0.67 | 0.96 | 0.69 |
| Driver‑VLM (Qwen3‑VL‑8B) | 0.24 | 0.65 | 1.25 | 0.60 | 0.60 |
| NaviDriveVLM (Qwen3‑VL‑2B) | 0.20 | 0.50 | 0.93 | 0.46 | 0.46 |
NaviDriveVLM outperforms all baselines, including the large‑scale VLM fine‑tuned directly for waypoints (which yields higher errors). An ablation removing the Navigator’s reasoning tokens degrades performance by ~0.07 m average L2, confirming the usefulness of the explicit reasoning signal.
Results – Qualitative:
Figures illustrate three scenarios: stopping at a stop sign, yielding to pedestrians, and proceeding through a green light. The frozen large VLM (without fine‑tuning) produces accurate scene descriptions but wildly inaccurate waypoints. The small VLM after fine‑tuning predicts accurate waypoints but generates poor or missing textual explanations. NaviDriveVLM combines both: the Navigator supplies correct natural‑language reasoning (e.g., “The traffic light is green, proceed straight”), and the Driver generates waypoints that closely follow the ground‑truth trajectory. Additional examples show the system handling red‑light waiting, following a lead vehicle, and emergency braking.
Discussion:
- Interpretability: The reasoning tokens are human‑readable, offering transparent justification for each driving decision—a crucial property for safety‑critical systems and regulatory acceptance.
- Scalability: Freezing the Navigator eliminates the need to retrain massive models for each new domain, but it also means that novel environments (e.g., severe weather, new traffic signs) may still require updating the Navigator or employing domain‑adaptation techniques.
- Real‑time Constraints: In deployment, the Navigator would need to run at inference time; the authors mitigate this by pre‑computing reasoning for training, but a production system would need an efficient inference pipeline (e.g., caching, incremental updates).
- Future Work: Efficient summarization of reasoning to stay within token limits, dynamic prompting strategies, and extending the framework to multi‑modal inputs (LiDAR, radar) are promising directions.
Conclusion:
NaviDriveVLM demonstrates that decoupling high‑level semantic reasoning from low‑level motion planning yields a system that retains the rich understanding of large VLMs while achieving precise, low‑cost waypoint prediction with a lightweight model. The explicit intermediate representation improves both performance and explainability, offering a compelling blueprint for future VLM‑based autonomous driving architectures.
Comments & Academic Discussion
Loading comments...
Leave a Comment