DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving

January 04, 2026

Reading time: 5 minute

...

#Computer Science #Model #Computer Vision

📝 Original Info

Title: DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving
ArXiv ID: 2601.01528
Date: 2026-01-04
Authors: Yang Zhou, Hao Shao, Letian Wang, Zhuofan Zong, Hongsheng Li, Steven L. Waslander

📝 Abstract

Video generation models, as one form of world models, have emerged as one of the most exciting frontiers in AI, promising agents the ability to imagine the future by modeling the temporal evolution of complex scenes. In autonomous driving, this vision gives rise to driving world models: generative simulators that imagine ego and agent futures, enabling scalable simulation, safe testing of corner cases, and rich synthetic data generation. Yet, despite fast-growing research activity, the field lacks a rigorous benchmark to measure progress and guide priorities. Existing evaluations remain limited: generic video metrics overlook safety-critical imaging factors; trajectory plausibility is rarely quantified; temporal and agentlevel consistency is neglected; and controllability with respect to ego conditioning is ignored. Moreover, current datasets fail to cover the diversity of conditions required for real-world deployment. To address these gaps, we present DrivingGen, the first comprehensive benchmark for generative driving world models. Driv-ingGen combines a diverse evaluation dataset curated from both driving datasets and internet-scale video sources, spanning varied weather, time of day, geographic regions, and complex maneuvers, with a suite of new metrics that jointly assess visual realism, trajectory plausibility, temporal coherence, and controllability. Benchmarking 14 state-of-the-art models reveals clear trade-offs: general models look better but break physics, while driving-specific ones capture motion realistically but lag in visual quality. DrivingGen offers a unified evaluation framework to foster reliable, controllable, and deployable driving world models, enabling scalable simulation, planning, and data-driven decision-making.

💡 Deep Analysis

📄 Full Content

Driven by scalable learning techniques, generative video models have made remarkable progress in recent years, enabling the synthesis of high-fidelity videos across diverse scenes and motions. These models suggest a promising path toward "world models" -predictive simulators capable of imagining the future, which can support planning, simulation, and decision-making in complex, dynamic environments. Inspired by this vision, there has been an accelerating surge in developing driving world models: generative models specialized for predicting future driving scenarios. Given an initial scene and optional conditions (e.g., text prompts, driving actions), a driving world model predicts both the ego-vehicle's future movements and the evolution of surrounding agents' trajectories. Such models enable closed-loop simulation and synthetic data generation, reducing reliance on real-world data and offering a promising means to explore out-of-distribution scenarios safely (Gao et al., 2024;Hassan et al., 2024;Mousakhan et al., 2025;Li et al., 2025d;Wang et al., 2025;Zhou et al., 2025). Driving world models are also tightly coupled with end-to-end autonomous driving systems, where Table 1: Comparison of existing video benchmarks, driving world models, and driving video benchmarks. " " indicates the missing metrics, and " " signifies that the evaluation is comprehensive. "Visual", "Agent" and "Traj." represent evaluation of images or videos, surrounding agents and vehicles' trajectories, respectively.

While a vibrant exploration of a wide range of approaches for driving world models is underway, a well-designed benchmark -which not only measures progress but also guides research priorities and shapes the trajectory of the entire field -has not yet emerged. Current evaluations fail to fully capture the unique requirements of the driving domain, and are limited in several ways. 1) Visual Fidelity First, most benchmarks rely on distribution-level metrics such as Fréchet Video Distance (FVD) to assess video realism, and some adopt human-preference-aligned models (e.g., vision-language models) to score visual quality or semantic consistency. However, driving imposes unique constraints on imaging: sensor artifacts, glare, or other corruptions can have critical safety implications that general video metrics fail to capture. 2) Trajectory Plausibility Second, the ego-motion trajectories underlying the generated videos are crucial. High-quality video generation in driving must produce trajectories that are natural, dynamically feasible, interaction-aware, and safe-properties that go beyond mere visual realism. 3) Temporal and Agent-Level Consistency Third, temporal consistency is crucial for driving, where surrounding objects directly impact safety and decision-making. Prior benchmarks often focus on scene-level consistency but neglect agent-level consistency, such as abrupt appearance changes or abnormal disappearances of agents-imperfections that can severely compromise the realism and reliability of driving simulations. 4) Motion Controllability Finally, for ego-conditioned video generation, it is critical to assess whether the generated motion faithfully follows the conditioning trajectory. This aspect of controllability is largely overlooked in existing benchmarks, yet it is essential for safe planning and reliable closed-loop driving, where misalignment can lead to catastrophic consequences.

Another major limitation in existing benchmarks for driving world models is the lack of diversity along crucial dimensions essential for real-world deployment. 1) First, Weather and Time of Day coverage is heavily skewed: datasets like nuScenes (Caesar et al., 2020) are dominated by clearweather, daytime driving, leaving rare but safety-critical conditions (night, snow, fog) underrepresented. 2) Second, Geographic Coverage is limited, often confined to a few cities or countries, which restricts evaluation across varied scene appearance and with local traffic rules. 3) Third, Driving Maneuvers and Interactions rarely capture the full diversity of agent behaviors and complex multi-agent dynamics, such as pedestrians waiting at crosswalks, aggressive driver cut-ins, or dense traffic scenarios (Wang et al., 2021). This lack of diversity makes it difficult to assess whether generative models can handle the wide range of scenarios encountered in real-world driving, undermining their reliability and safety for large-scale deployment.

To address the above gaps, this work proposes DrivingGen, a comprehensive benchmark for generative world models in the driving domain with a diverse data distribution and novel evaluation metrics. DrivingGen evaluates models from both a visual perspective (the realism and overall quality of generated videos) and a robotics perspective (the physical plausibility, consistency and accuracy of generated trajectories). Our benchmark makes the following key contributions: Diverse Driving Dataset. We present a new evalu

📄 Read Full PDF on ArXiv

Reference

This content is AI-processed based on open access ArXiv data.

DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

Reference

Related Posts

Detecting Performance Degradation under Data Shift in Pathology Vision-Language Model

Factorized Learning for Temporally Grounded Video-Language Models

PathFound: An Agentic Multimodal Model Activating Evidence-seeking Pathological Diagnosis

Start searching

No results found