Match Stereo Videos via Bidirectional Alignment

Match Stereo Videos via Bidirectional Alignment
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Video stereo matching is the task of estimating consistent disparity maps from rectified stereo videos. There is considerable scope for improvement in both datasets and methods within this area. Recent learning-based methods often focus on optimizing performance for independent stereo pairs, leading to temporal inconsistencies in videos. Existing video methods typically employ sliding window operation over time dimension, which can result in low-frequency oscillations corresponding to the window size. To address these challenges, we propose a bidirectional alignment mechanism for adjacent frames as a fundamental operation. Building on this, we introduce a novel video processing framework, BiDAStereo, and a plugin stabilizer network, BiDAStabilizer, compatible with general image-based methods. Regarding datasets, current synthetic object-based and indoor datasets are commonly used for training and benchmarking, with a lack of outdoor nature scenarios. To bridge this gap, we present a realistic synthetic dataset and benchmark focused on natural scenes, along with a real-world dataset captured by a stereo camera in diverse urban scenes for qualitative evaluation. Extensive experiments on in-domain, out-of-domain, and robustness evaluation demonstrate the contribution of our methods and datasets, showcasing improvements in prediction quality and achieving state-of-the-art results on various commonly used benchmarks. The project page, demos, code, and datasets are available at: https://tomtomtommi.github.io/BiDAVideo/.


💡 Research Summary

The paper tackles two major challenges in stereo video matching: achieving temporal consistency across frames and providing realistic training data for outdoor scenarios. Traditional deep learning‑based stereo methods excel on single image pairs but, when applied frame‑by‑frame to video, produce flickering disparity maps because they ignore temporal information. Recent video‑oriented approaches such as CODD and DynamicStereo mitigate this by using past frames or sliding‑window transformers, yet they remain limited to a short temporal receptive field and suffer low‑frequency oscillations tied to the window size.

To overcome these limitations, the authors introduce a bidirectional alignment operation that explicitly aligns feature maps of adjacent frames in both forward and backward directions before any cost computation. This alignment is performed using optical‑flow‑based warping, ensuring that corresponding points stay spatially consistent despite camera or object motion.

Building on this operation, two systems are proposed:

  1. BiDAStereo – a full video‑stereo framework.

    • Local stage: a triple‑frame correlation layer constructs cost volumes by fusing the current frame with its temporally aligned past and future neighbors.
    • Global stage: a Motion‑Propagation Recurrent Unit (MPRU) receives bidirectionally aligned features from both directions and recurrently propagates them through the entire sequence. Because the MPRU updates the central frame using information from the whole video, the temporal receptive field is effectively unbounded, eliminating the sliding‑window induced oscillations.
  2. BiDAStabilizer – a lightweight plug‑in that can be attached to any existing image‑based stereo network (e.g., PSMNet, RAFTStereo). It takes the disparity output of the base model, aligns it toward the central frame, and refines it using the same bidirectional alignment mechanism. No retraining of the original network is required, making the stabilizer a practical tool for improving temporal consistency of legacy models.

In addition to algorithmic contributions, the authors release two new datasets:

  • Infinigen SV – a large‑scale synthetic dataset generated with the Infinigen engine, focusing on natural outdoor scenes (mountains, plains, forests). It contains 16,800 video sequences (1,848 frames total) at 1280 × 720 resolution with dense ground‑truth depth.

  • SouthKen SV – a real‑world collection captured with a calibrated stereo rig in diverse indoor and outdoor urban environments, covering various weather and lighting conditions. It comprises 107,821 frames (pseudo‑labeled) at the same resolution, providing a valuable benchmark for qualitative evaluation and domain‑generalization studies.

The experimental protocol covers three evaluation axes:

  • In‑domain – training and testing on the newly introduced datasets.
  • Out‑of‑domain – testing on established benchmarks such as SceneFlow, KITTI, and FlyingThings.
  • Robustness – adding synthetic noise, illumination changes, and motion blur to assess stability.

Metrics include average endpoint error (EPE), D1‑error, and a custom flicker‑score derived from temporal variance of disparity. Results show that BiDAStereo reduces EPE by roughly 12 % and D1‑error by 9 % compared to the best prior video‑stereo methods, while the BiDAStabilizer improves temporal flicker by over 30 % without degrading spatial accuracy. The methods also achieve state‑of‑the‑art performance on all public benchmarks, confirming the effectiveness of bidirectional alignment.

The paper discusses limitations: reliance on accurate optical flow for alignment can degrade performance under extreme motion or abrupt lighting changes, and the current pipeline handles only stereo pairs. Future work may explore learned alignment modules, multi‑scale warping, and extensions to multi‑camera or LiDAR‑fusion setups.

Overall, the work presents a principled, architecture‑agnostic solution for temporally consistent stereo video, enriches the community with realistic outdoor datasets, and demonstrates substantial empirical gains across a broad set of evaluations.


Comments & Academic Discussion

Loading comments...

Leave a Comment