Drive-KD: Multi-Teacher Distillation for VLMs in Autonomous Driving
Autonomous driving is an important and safety-critical task, and recent advances in LLMs/VLMs have opened new possibilities for reasoning and planning in this domain. However, large models demand substantial GPU memory and exhibit high inference latency, while conventional supervised fine-tuning (SFT) often struggles to bridge the capability gaps of small models. To address these limitations, we propose Drive-KD, a framework that decomposes autonomous driving into a “perception-reasoning-planning” triad and transfers these capabilities via knowledge distillation. We identify layer-specific attention as the distillation signal to construct capability-specific single-teacher models that outperform baselines. Moreover, we unify these single-teacher settings into a multi-teacher distillation framework and introduce asymmetric gradient projection to mitigate cross-capability gradient conflicts. Extensive evaluations validate the generalization of our method across diverse model families and scales. Experiments show that our distilled InternVL3-1B model, with ~42 times less GPU memory and ~11.4 times higher throughput, achieves better overall performance than the pretrained 78B model from the same family on DriveBench, and surpasses GPT-5.1 on the planning dimension, providing insights toward efficient autonomous driving VLMs.
💡 Research Summary
Drive‑KD proposes a systematic knowledge‑distillation framework that enables small vision‑language models (VLMs) to perform autonomous‑driving tasks with efficiency comparable to, or even surpassing, large‑scale counterparts. The authors first decompose the driving problem into three sequential capabilities—perception, reasoning, and planning—mirroring the human thought process. They then conduct a thorough analysis to identify the most informative transformer layers for each capability. Using two complementary criteria—representation change (adjacent‑layer cosine similarity and vision‑text cosine similarity) and capability‑wise intra‑consistency—they find that layer 1 captures early perception cues, intermediate layers retain stable reasoning features, and the penultimate layer best encodes planning information.
For the distillation signal, the paper evaluates dispersion of hidden states versus attention maps across questions belonging to the same image. Attention consistently shows lower dispersion, indicating higher stability across tasks, and is therefore selected as the primary supervision signal. Output‑distribution alignment (KL divergence) is deliberately omitted because autonomous‑driving outputs are less confident and more diffuse than generic multimodal QA outputs, making this signal noisy.
Single‑teacher distillation is built on a hard‑label supervised loss combined with attention‑matching losses specific to each capability: cross‑modal attention at layer 1 for perception, grouped‑matching full attention across intermediate layers for reasoning, and cross‑modal attention at the penultimate layer for planning. This tailored approach ensures that each teacher transfers the most relevant knowledge without over‑constraining the student.
When multiple teachers are combined, gradient conflicts arise because each capability optimizes a different objective. To resolve this, the authors introduce Asymmetric Gradient Projection (AGP), which normalizes gradients and projects lower‑priority gradients onto a subspace that does not interfere with higher‑priority ones. AGP effectively mitigates conflict and yields a student model that outperforms any single‑teacher version.
Extensive experiments are conducted on the DriveBench benchmark. The distilled InternVL3‑1B model consumes roughly 42 × less GPU memory and achieves about 11.4 × higher throughput than the 78 B‑parameter pretrained InternVL3 model, yet it attains higher overall scores. Notably, on the planning dimension it surpasses the state‑of‑the‑art GPT‑5.1. The framework’s generality is validated across different model families (e.g., LLaVA‑V2) and scales (1 B to 8 B parameters), consistently delivering strong efficiency‑performance trade‑offs.
In summary, Drive‑KD makes four key contributions: (1) a principled study of layer and signal selection for autonomous‑driving VLMs, (2) capability‑specific single‑teacher distillation recipes that improve over pretrained and supervised‑fine‑tuned baselines, (3) a multi‑teacher distillation framework with AGP to resolve gradient conflicts, and (4) empirical evidence of broad applicability across model families and sizes. The work demonstrates that carefully designed knowledge distillation can bridge the gap between large, resource‑heavy VLMs and lightweight models suitable for real‑time deployment in safety‑critical autonomous‑driving systems. Future directions include extending the approach to streaming sensor modalities (LiDAR, radar) and long‑duration on‑vehicle evaluations.
Comments & Academic Discussion
Loading comments...
Leave a Comment