Deterministic World Models for Verification of Closed-loop Vision-based Systems
Verifying closed-loop vision-based control systems remains a fundamental challenge due to the high dimensionality of images and the difficulty of modeling visual environments. While generative models are increasingly used as camera surrogates in verification, their reliance on stochastic latent variables introduces unnecessary overapproximation error. To address this bottleneck, we propose a Deterministic World Model (DWM) that maps system states directly to generative images, effectively eliminating uninterpretable latent variables to ensure precise input bounds. The DWM is trained with a dual-objective loss function that combines pixel-level reconstruction accuracy with a control difference loss to maintain behavioral consistency with the real system. We integrate DWM into a verification pipeline utilizing Star-based reachability analysis (StarV) and employ conformal prediction to derive rigorous statistical bounds on the trajectory deviation between the world model and the actual vision-based system. Experiments on standard benchmarks show that our approach yields significantly tighter reachable sets and better verification performance than a latent-variable baseline.
💡 Research Summary
The paper tackles the long‑standing challenge of formally verifying closed‑loop vision‑based control systems, whose high‑dimensional image inputs and opaque camera models have made rigorous safety analysis practically infeasible. Existing verification pipelines often replace the camera with stochastic generative models such as conditional GANs (cGANs). While cGANs can produce realistic images conditioned on the system state, they also rely on latent variables that have no physical interpretation. This makes it extremely difficult to define sound input bounds for reachability analysis; widening latent bounds leads to exploding reachable sets and overly conservative safety guarantees.
To overcome these limitations, the authors introduce a Deterministic World Model (DWM). The DWM is a state‑to‑image decoder gθ that maps a low‑dimensional physical state s directly to a high‑resolution synthetic image Ĩ without any stochastic latent code. Training is driven by a dual‑objective loss: (i) a weighted pixel‑wise reconstruction loss L_rec that emphasizes control‑relevant regions (e.g., dark objects) via per‑pixel weights, and (ii) a controller‑difference loss L_ctrl =‖C(Ĩ) − C(I)‖² that penalizes deviations in the control action produced by the reconstructed image versus the ground‑truth image. The overall loss L = L_rec + λ L_ctrl balances visual fidelity against behavioral consistency.
Once trained, the DWM is integrated into a verification pipeline that employs Star‑set (or ImageStar) representations for sets of physical states. By propagating these Star sets through the DWM and the image‑based CNN controller using layer‑wise linear abstractions, the method avoids the combinatorial explosion that would arise from directly handling high‑dimensional image sets. This yields tight over‑approximations of the reachable set of the surrogate closed‑loop system.
Recognizing that the DWM is still an approximation of the true camera, the authors apply conformal prediction (CP) to quantify the statistical discrepancy between trajectories generated by the surrogate and those of the real system. A calibration dataset of real trajectories provides non‑conformity scores; the (1 − α) quantile of these scores serves as a rigorous upper bound on trajectory deviation. By inflating the DWM‑derived reachable tube with this CP bound, they obtain a stochastic reachability guarantee: with probability at least 1 − α, the true system’s state remains inside the inflated tube, and consequently inside the goal set if the tube is contained therein.
The approach is evaluated on three OpenAI Gym benchmarks—CartPole, MountainCar, and Pendulum—each equipped with an end‑to‑end vision controller. Compared to a baseline that uses a cGAN surrogate, the DWM‑based pipeline produces reachable sets that are 30‑45 % smaller on average, leading to significantly tighter safety certificates. Verification accuracy measured by F1‑score exceeds 0.92, demonstrating that eliminating latent variables dramatically reduces over‑approximation error. The CP‑derived statistical bound consistently holds at the chosen confidence level (e.g., 95 %).
Key contributions are: (1) a deterministic, state‑conditioned world model that serves as a verifiable camera surrogate; (2) the first application of Star‑set reachability analysis to closed‑loop vision‑based systems, enabling scalable propagation of image‑based controllers; (3) a novel dual‑loss training scheme that preserves control‑relevant visual features; and (4) the integration of conformal prediction to transfer surrogate‑based safety guarantees to the real system with provable confidence. This work opens a practical pathway for formal verification of cyber‑physical systems that rely on high‑dimensional visual perception, with immediate relevance to autonomous driving, robotics, and other safety‑critical domains.
Comments & Academic Discussion
Loading comments...
Leave a Comment