UL-VIO: Ultra-lightweight Visual-Inertial Odometry with Noise Robust Test-time Adaptation

UL-VIO: Ultra-lightweight Visual-Inertial Odometry with Noise Robust Test-time Adaptation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Data-driven visual-inertial odometry (VIO) has received highlights for its performance since VIOs are a crucial compartment in autonomous robots. However, their deployment on resource-constrained devices is non-trivial since large network parameters should be accommodated in the device memory. Furthermore, these networks may risk failure post-deployment due to environmental distribution shifts at test time. In light of this, we propose UL-VIO – an ultra-lightweight (<1M) VIO network capable of test-time adaptation (TTA) based on visual-inertial consistency. Specifically, we perform model compression to the network while preserving the low-level encoder part, including all BatchNorm parameters for resource-efficient test-time adaptation. It achieves 36X smaller network size than state-of-the-art with a minute increase in error – 1% on the KITTI dataset. For test-time adaptation, we propose to use the inertia-referred network outputs as pseudo labels and update the BatchNorm parameter for lightweight yet effective adaptation. To the best of our knowledge, this is the first work to perform noise-robust TTA on VIO. Experimental results on the KITTI, EuRoC, and Marulan datasets demonstrate the effectiveness of our resource-efficient adaptation method under diverse TTA scenarios with dynamic domain shifts.


💡 Research Summary

UL‑VIO introduces an ultra‑lightweight visual‑inertial odometry (VIO) framework that simultaneously addresses two critical challenges for deployment on resource‑constrained platforms: model size and robustness to distribution shifts at test time. The authors start from a state‑of‑the‑art NAS‑VIO architecture and aggressively compress it while deliberately preserving all BatchNorm (BN) parameters in the visual encoder. Compression techniques include inserting an average‑pooling layer after the last convolutional block of the visual encoder (yielding a 117× reduction), shrinking channel dimensions in both visual and inertial encoders (8× reduction), and replacing the recurrent LSTM decoder with a fully‑connected layer (161× reduction). After these steps the total parameter count falls below one million, a 36× reduction compared with the original model, yet the increase in pose error on the clean KITTI benchmark is only about 1 %.

The second contribution is a novel test‑time adaptation (TTA) scheme that leverages the inherent multimodal nature of VIO. Inertial measurements are far less affected by visual degradations such as rain, snow, blur, or shadows. Consequently, the pose estimated by the inertial‑only decoder is used as a pseudo‑label for the fused visual‑inertial decoder. The adaptation loss is a simple L2 distance between the fused pose and the inertial pose (both translation and rotation components). Crucially, only the BN statistics (mean and variance) of the visual encoder are updated during TTA, which makes the adaptation computationally cheap and memory‑efficient. To decide when to adapt, the system extracts domain‑distinctive features (ddfs) from early visual layers and matches them against a small dictionary of pre‑learned ddfs for known noise types. If a mismatch is detected, the corresponding BN parameter set is loaded and a few gradient steps are taken on the TTA loss. This “BN‑only” update incurs a negligible 0.18 % parameter overhead per noise type and can be performed online on a per‑frame basis.

Experiments are conducted on three public datasets—KITTI (including the corrupted KITTI‑C suite), EuRoC, and Marulan—covering a wide range of visual disturbances (blur, rain, snow, shadows, brightness changes). Results show that the compressed model retains near‑state‑of‑the‑art accuracy on clean data while being dramatically smaller. Under dynamic noise shifts in KITTI‑C, UL‑VIO achieves an average 18 % reduction in translation RMSE and up to 45 % reduction in the worst cases after adaptation. Similar gains are reported on EuRoC and Marulan, confirming that the inertial‑based pseudo‑label remains reliable across different motion dynamics and sensor configurations. Memory usage fits comfortably within the on‑chip SRAM of modern mobile SoCs (e.g., Apple A16, Qualcomm Snapdragon), and power consumption is reduced by roughly 30 % compared with the uncompressed baseline.

The paper also discusses limitations: the approach assumes reasonably accurate inertial data; severe IMU drift could degrade the pseudo‑label quality. BN‑only adaptation may be slower to respond to abrupt visual changes, suggesting future work on meta‑learning initializations or multi‑BN switching strategies. Additionally, the current system relies on a pre‑defined set of noise categories; extending it to fully unsupervised domain discovery is an open research direction.

In summary, UL‑VIO delivers a practical VIO solution for edge devices by (1) compressing a high‑performance VIO network to under one million parameters, (2) enabling lightweight, online test‑time adaptation through BN updates guided by inertial‑visual consistency, and (3) demonstrating robust performance across diverse real‑world visual degradations. This combination of extreme model compactness and noise‑robust adaptability makes UL‑VIO a strong candidate for deployment in autonomous drones, mobile robots, and self‑driving cars where memory, compute, and energy budgets are tightly constrained.


Comments & Academic Discussion

Loading comments...

Leave a Comment