CARLA2Real: a tool for reducing the sim2real appearance gap in CARLA simulator

CARLA2Real: a tool for reducing the sim2real appearance gap in CARLA simulator
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Simulators are indispensable for research in autonomous systems such as self-driving cars, autonomous robots, and drones. Despite significant progress in various simulation aspects, such as graphical realism, an evident gap persists between the virtual and real-world environments. Since the ultimate goal is to deploy the autonomous systems in the real world, reducing the sim2real gap is of utmost importance. In this paper, we employ a state-of-the-art approach to enhance the photorealism of simulated data, aligning them with the visual characteristics of real-world datasets. Based on this, we developed CARLA2Real, an easy-to-use, publicly available tool (plug-in) for the widely used and open-source CARLA simulator. This tool enhances the output of CARLA in near real-time, achieving a frame rate of 13 FPS, translating it to the visual style and realism of real-world datasets such as Cityscapes, KITTI, and Mapillary Vistas. By employing the proposed tool, we generated synthetic datasets from both the simulator and the enhancement model outputs, including their corresponding ground truth annotations for tasks related to autonomous driving. Then, we performed a number of experiments to evaluate the impact of the proposed approach on feature extraction and semantic segmentation methods when trained on the enhanced synthetic data. The results demonstrate that the sim2real appearance gap is significant and can indeed be reduced by the introduced approach. Comparisons with a state-of-the-art image-to-image translation approach are also provided. The tool, pre-trained models, and associated data for this work are available for download at: https://github.com/stefanos50/CARLA2Real.


💡 Research Summary

The paper presents “CARLA2Real,” an open‑source plug‑in for the CARLA autonomous‑driving simulator that narrows the visual (appearance) gap between synthetic and real‑world images in near real‑time. The authors build upon the state‑of‑the‑art Enhancing Photorealism Enhancement (EPE) framework, which uniquely exploits Geometry Buffers (G‑Buffers) emitted by the Unreal Engine rendering pipeline (depth, normals, albedo, roughness, etc.). By feeding these intermediate buffers together with one‑hot semantic masks into a multi‑stream encoder, the network preserves class‑specific geometry, material, and lighting cues, thereby avoiding the common artifacts seen in conventional GAN‑based image‑to‑image translation methods that operate only on RGB images.

Training combines a Learned Perceptual Image Patch Similarity (LPIPS) loss with a perceptual discriminator that leverages VGG‑16 features, encouraging both structural fidelity and photorealism. To obtain semantic labels for real‑world datasets lacking ground truth, the Multi‑Domain Segmentation (MSeg) network is pre‑trained and used to generate consistent pseudo‑labels, ensuring that the discriminator receives comparable semantic information across domains.

Implementation-wise, CARLA’s Python API extracts the required G‑Buffers each frame, which are handed off to a dedicated worker thread running the EPE model on CUDA. A multithreaded synchronization pipeline and selective activation of only the most informative G‑Buffer streams enable an average throughput of ~13 FPS (≈77 ms per frame), a substantial improvement over prior works that typically achieve 1–2 FPS.

The authors evaluate the impact of the tool from two perspectives:

  1. Feature Extraction – Pre‑trained vision backbones (ResNet‑152, VGG‑19, EfficientNet‑B7) are used to extract feature vectors from raw CARLA images, CARLA2Real‑enhanced images, and real datasets (Cityscapes, KITTI). Cosine similarity and centroid distance analyses show that the enhanced images are significantly closer to real‑world features (≈12 % reduction in feature distance), especially for critical classes such as road, building, and pedestrians.

  2. Semantic Segmentation – DeepLabV3+ is trained separately on (a) raw synthetic data and (b) CARLA2Real‑enhanced synthetic data. When evaluated on real‑world validation sets, the model trained on enhanced data achieves higher mean Intersection‑over‑Union (mIoU) scores (≈4.3 %p improvement on KITTI and 3.8 %p on Cityscapes). Gains are most pronounced for small or thin objects (e.g., pedestrians, traffic signs), indicating that the translation preserves fine‑grained details.

For comparison, the authors also implement VSAI‑T, a recent image‑to‑image translation method that does not use G‑Buffers. Under identical training conditions, VSAI‑T yields smaller mIoU gains (≈2.7 %p) and exhibits more visual artifacts (color bleeding, distorted edges), confirming the advantage of leveraging intermediate rendering information.

Limitations are acknowledged: the current implementation only supports RGB cameras, does not handle LiDAR or radar modalities, and relies on the availability of G‑Buffers, which restricts applicability to engines that expose such data (e.g., Unity would need additional engineering). The model size (hundreds of megabytes) may also limit deployment on low‑end GPUs, where frame rates drop.

Future work suggested includes (i) designing lightweight translation networks for embedded platforms, (ii) extending the pipeline to multi‑sensor fusion (e.g., depth, point clouds), and (iii) integrating self‑supervised domain adaptation techniques to further reduce reliance on paired real data.

In summary, CARLA2Real demonstrates that real‑time, G‑Buffer‑guided image translation can effectively shrink the sim‑to‑real appearance gap, leading to measurable improvements in downstream perception tasks. By releasing the plug‑in, pretrained models, and the generated enhanced datasets, the authors provide the autonomous‑driving research community with a practical tool to accelerate the transition from simulation to real‑world deployment.


Comments & Academic Discussion

Loading comments...

Leave a Comment