DenVisCoM: Dense Vision Correspondence Mamba for Efficient and Real-time Optical Flow and Stereo Estimation
In this work, we propose a novel Mamba block DenVisCoM, as well as a novel hybrid architecture specifically tailored for accurate and real-time estimation of optical flow and disparity estimation. Given that such multi-view geometry and motion tasks are fundamentally related, we propose a unified architecture to tackle them jointly. Specifically, the proposed hybrid architecture is based on DenVisCoM and a Transformer-based attention block that efficiently addresses real-time inference, memory footprint, and accuracy at the same time for joint estimation of motion and 3D dense perception tasks. We extensively analyze the benchmark trade-off of accuracy and real-time processing on a large number of datasets. Our experimental results and related analysis suggest that our proposed model can accurately estimate optical flow and disparity estimation in real time. All models and associated code are available at https://github.com/vimstereo/DenVisCoM.
💡 Research Summary
DenVisCoM introduces a novel hybrid architecture that jointly tackles optical flow and stereo disparity estimation in a single, unified model. The core contribution is a redesigned Mamba block—named DenVisCoM—that integrates dense visual correspondence learning directly into the state‑space model (SSM) sequence transformation. The block consists of three parallel pathways: two symmetric Conv1D‑SiLU branches that process left and right image features independently, preserving local information, and a “Scan” branch that feeds a fused representation of both images into the SSM, enabling joint long‑range modeling of the pair. The outputs of these pathways are concatenated to form a correspondence‑aware feature tensor.
The overall network begins with two separate ResNet‑18 encoders that extract 8× down‑sampled feature maps (128 × H/8 × W/8) from each input image (or consecutive video frames). These maps are reshaped into 14 × 14 patches (196 tokens) and fed into a stack of DenVisCoM blocks. After each DenVisCoM block, a lightweight Transformer‑style attention module is applied, containing both self‑attention (within each image sequence) and cross‑attention (between the two sequences). This dual‑attention scheme enriches the representation with both intra‑image global context and inter‑image matching cues while keeping computational cost manageable by scaling the number of heads across stages.
The architecture alternates “Mamba‑Attention” modules n times, progressively reducing the patch size and finally reconstructing the feature maps to the original spatial resolution. A parameter‑free matching layer then computes dense correspondences via matrix multiplication followed by softmax normalization. For optical flow, a 2‑D coordinate difference yields the flow field; for stereo, a 1‑D horizontal disparity is extracted after masking invalid matches. This matching step leverages high‑dimensional features directly, avoiding the expensive cost‑volume construction typical of traditional methods.
Training follows the MemFlow protocol on the large‑scale SceneFlow dataset (100 k steps, batch size 8, AdamW optimizer). Zero‑shot evaluation on KITTI‑2015, Sintel, VKITTI1, and other benchmarks demonstrates substantial gains over state‑of‑the‑art methods. On KITTI optical flow, DenVisCoM achieves an End‑Point Error (EPE) of 1.34 and F1‑all of 2.52, outperforming RAFT (EPE 2.45, F1‑all 7.9), Unimatch (2.25, 7.2), and MemFlow (3.38, 12.8). Notably, the model excels in medium and large motion ranges (S 10‑40 and S 40+), where it records EPEs of 0.41 and 2.00 respectively, far below competing approaches. Stereo results show comparable improvements in D1 error and memory footprint. The system runs at over 40 FPS on an RTX A6000 GPU with a modest memory budget, confirming its suitability for real‑time deployment on resource‑constrained platforms.
In summary, DenVisCoM successfully merges the linear‑time efficiency of SSMs with the expressive power of Transformer attention, delivering a compact yet highly accurate solution for dense correspondence tasks. The paper’s extensive ablations confirm that each component—the symmetric convolutional branches, the fused SSM scan, and the combined self‑ and cross‑attention—contributes meaningfully to the overall performance. Future work may explore lighter backbones, multi‑scale feature fusion, or multimodal extensions (e.g., LiDAR‑camera fusion) to further broaden the applicability of this unified dense perception framework.
Comments & Academic Discussion
Loading comments...
Leave a Comment