Synesthesia of Vehicles: Tactile Data Synthesis from Visual Inputs

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Autonomous vehicles (AVs) rely on multi-modal fusion for safety, but current visual and optical sensors fail to detect road-induced excitations which are critical for vehicles’ dynamic control. Inspired by human synesthesia, we propose the Synesthesia of Vehicles (SoV), a novel framework to predict tactile excitations from visual inputs for autonomous vehicles. We develop a cross-modal spatiotemporal alignment method to address temporal and spatial disparities. Furthermore, a visual-tactile synesthetic (VTSyn) generative model using latent diffusion is proposed for unsupervised high-quality tactile data synthesis. A real-vehicle perception system collected a multi-modal dataset across diverse road and lighting conditions. Extensive experiments show that VTSyn outperforms existing models in temporal, frequency, and classification performance, enhancing AV safety through proactive tactile perception.

💡 Research Summary

The paper introduces “Synesthesia of Vehicles” (SoV), a novel framework that enables autonomous vehicles (AVs) to anticipate road‑induced tactile excitations—such as vibrations, slip, and dynamic loads—using only forward‑looking visual inputs. Recognizing that conventional cameras and LiDARs excel at detecting visual cues but cannot capture the physical interactions between tires and road surfaces, the authors draw inspiration from human synesthesia, where one sensory modality evokes another, to create a visual‑to‑tactile mapping that eliminates the spatiotemporal gap between perception and control.

Data acquisition is performed on a Geely Geometry E equipped with a high‑resolution ZED 2 stereo camera (30 fps) and an intelligent tire instrumented with a three‑axis ADXL375 accelerometer (500 Hz). An RTK module records precise vehicle position and speed at 20 Hz, allowing the system to align each camera frame with the road segment that will be traversed in the next 0.6–20 m. The alignment pipeline consists of keyframe extraction, target road segment marking, spatial indexing via speed integration, and temporal‑to‑spatial conversion of the accelerometer stream using interpolation. This yields paired samples (

Synesthesia of Vehicles: Tactile Data Synthesis from Visual Inputs

💡 Research Summary

Comments & Academic Discussion

Leave a Comment