Driving on Registers

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We present DrivoR, a simple and efficient transformer-based architecture for end-to-end autonomous driving. Our approach builds on pretrained Vision Transformers (ViTs) and introduces camera-aware register tokens that compress multi-camera features into a compact scene representation, significantly reducing downstream computation without sacrificing accuracy. These tokens drive two lightweight transformer decoders that generate and then score candidate trajectories. The scoring decoder learns to mimic an oracle and predicts interpretable sub-scores representing aspects such as safety, comfort, and efficiency, enabling behavior-conditioned driving at inference. Despite its minimal design, DrivoR outperforms or matches strong contemporary baselines across NAVSIM-v1, NAVSIM-v2, and the photorealistic closed-loop HUGSIM benchmark. Our results show that a pure-transformer architecture, combined with targeted token compression, is sufficient for accurate, efficient, and adaptive end-to-end driving. Code and checkpoints will be made available via the project page.

💡 Research Summary

The paper introduces DrivoR, a compact yet powerful transformer‑based architecture for end‑to‑end autonomous driving. The core idea is to replace the massive token streams produced by Vision Transformers (ViTs) with a small set of learnable “camera‑aware register tokens”. For each of the N cameras (four in the experiments), R registers are appended to the ViT input and fine‑tuned using LoRA on a pretrained DINOv2‑ViT‑S backbone. After the final ViT layer, only the R tokens per camera are extracted, yielding N × R “scene tokens” that summarize the visual context while preserving the distinction between different viewpoints.

These scene tokens feed two lightweight transformer decoders. The first decoder receives a set of learned trajectory queries and, through self‑attention followed by cross‑attention to the scene tokens, produces |Q_traj| candidate trajectories. Each trajectory is a sequence of (x, y, θ) poses decoded by a final MLP. Training uses a Winner‑Takes‑All (or Minimum‑over‑n) loss that supervises only the closest candidate to the human reference, encouraging diversity among the proposals. An optional second target with a longer horizon can be added to push predictions toward farther waypoints.

The second decoder is a scoring module. It first embeds each generated trajectory into a D‑dimensional score query via an MLP, then cross‑attends to the same scene tokens. Crucially, gradients from the scoring decoder are blocked from flowing back to the trajectory decoder, ensuring that generation and evaluation remain disentangled. The scoring head predicts six sub‑scores (safety, comfort, efficiency, etc.) originally defined in the Predictive Driver Model Score (PDMS) used by the NAVSIM benchmark. These sub‑scores are learned with binary cross‑entropy against oracle scores, and at inference time they are linearly combined with user‑specified weights λ_c. By adjusting λ_c, a single trained model can realize different driving styles—e.g., safety‑first, comfort‑oriented, or progress‑maximizing—without any retraining.

Implementation details: four cameras (front, front‑left, front‑right, rear), 16 registers per camera (total 64 scene tokens), 4‑layer decoders with inner dimension 256, feed‑forward network dilation factor 4, and LoRA rank 32 for the ViT. The entire model contains roughly 40 M parameters, far fewer than competing ViT‑based planners. Training runs for 10 epochs on the NAVTRAIN split with a cosine‑annealed learning rate of 2 × 10⁻⁴ on four NVIDIA A100 GPUs.

Evaluation on the NAVSIM‑v1 and NAVSIM‑v2 datasets, as well as the photorealistic closed‑loop HUGSIM benchmark, shows that DrivoR matches or exceeds state‑of‑the‑art baselines on the primary PDMS metric, collision count, progress, and comfort. Despite the aggressive token reduction, the model retains the critical planning‑relevant information, demonstrating that a pure‑transformer pipeline without BEV projections or large trajectory dictionaries can be both accurate and computationally efficient.

In summary, DrivoR contributes three key advances: (1) a register‑token compression scheme that dramatically reduces visual token length while preserving multi‑camera context; (2) a disentangled generation‑and‑scoring architecture that yields strong trajectory proposals and interpretable scores; and (3) a behavior‑conditioned scoring mechanism that enables flexible, user‑driven driving policies from a single model. The work opens avenues for further research on optimal register configurations, multimodal sensor fusion with registers, and real‑world deployment of lightweight transformer planners.

Driving on Registers

💡 Research Summary

Comments & Academic Discussion

Leave a Comment