A low-complexity method for efficient depth-guided image deblurring

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Image deblurring is a challenging problem in imaging due to its highly ill-posed nature. Deep learning models have shown great success in tackling this problem but the quest for the best image quality has brought their computational complexity up, making them impractical on anything but powerful servers. Meanwhile, recent works have shown that mobile Lidars can provide complementary information in the form of depth maps that enhance deblurring quality. In this paper, we introduce a novel low-complexity neural network for depth-guided image deblurring. We show that the use of the wavelet transform to separate structural details and reduce spatial redundancy as well as efficient feature conditioning on the depth information are essential ingredients in developing a low-complexity model. Experimental results show competitive image quality against recent state-of-the-art models while reducing complexity by up to two orders of magnitude.

💡 Research Summary

The paper introduces EDIBNet, a lightweight depth‑guided image deblurring network designed for edge devices such as smartphones and embedded AI modules. The authors observe that state‑of‑the‑art deep deblurring models achieve impressive PSNR/SSIM scores but require hundreds of billions of FLOPs and large memory footprints, making them unsuitable for real‑time deployment on resource‑constrained hardware. To address this, the authors combine two complementary ideas: (1) a discrete wavelet transform (DWT)‑based representation that isolates most of the useful structural information in the low‑frequency sub‑band, and (2) an efficient adapter module that fuses real‑world depth maps obtained from mobile LiDAR sensors into the visual feature stream.

Wavelet‑based decomposition.
The input blurred RGB image is processed with a two‑level Haar DWT, yielding four sub‑bands per level (LL, LH, HL, HH). The second‑level low‑frequency component LL(2) has a spatial resolution of one‑fourth of the original image but retains the dominant structural cues (edges, contours) essential for deblurring. High‑frequency sub‑bands (LH, HL, HH) are largely ignored during the neural processing; they are simply passed through the inverse DWT (iWT) after the low‑frequency reconstruction. By restricting the neural network’s computation to the LL(2) sub‑band (concatenated with its three sibling high‑frequency components as channels), the authors achieve a drastic reduction in spatial size and consequently in FLOPs.

Efficient encoder‑decoder backbone.
The low‑frequency tensor is fed into a compact U‑Net‑style encoder‑decoder. The encoder consists of three hierarchical stages with channel widths 16, 32, and 64, each employing strided convolutions for down‑sampling and residual blocks for feature refinement. The decoder mirrors this structure with up‑sampling layers, skip connections, and additional residual blocks. All convolutions use 3×3 kernels and SiLU activation, keeping the parameter count low.

Depth‑guided adapters.
Depth maps from the device’s LiDAR are first normalized and passed through lightweight bias‑adjustment layers. They are then concatenated with the image features and processed by a “chunking‑spatial‑conditioning” mechanism that generates depth‑guided prompts. These prompts modulate the image features via element‑wise multiplication, followed by a lightweight 1×1 convolution and a channel‑attention block (inspired by NAFNet). This design injects geometric priors (object boundaries, layout) without adding significant computational overhead.

Training and dataset.
Experiments use a subset of the ARKitScenes dataset, containing 29,264 RGB‑D pairs for training and 500 pairs for validation. Synthetic motion blur is applied using standard benchmark kernels. The loss combines L1 reconstruction error with a cosine similarity term, optimized for 400 epochs on 256×256 patches using Adam (lr = 1e‑4, β1 = 0.9, β2 = 0.999) with cosine annealing.

Results.
EDIBNet (with 32 channels) runs in ~0.2 seconds per 720p frame on an NVIDIA Jetson Orin Nano, achieving a PSNR of 30.1 dB and SSIM of 0.92. Compared to recent SOTA models such as Restormer, IPT, and Depth‑NAFNet, EDIBNet reduces FLOPs and memory consumption by roughly two orders of magnitude while incurring less than 0.3 dB PSNR loss. Ablation studies confirm that skipping the high‑frequency sub‑bands has negligible impact on visual quality, and that the depth adapters contribute a measurable boost over the depth‑free baseline.

Contributions and impact.

Demonstrates that wavelet‑domain processing can isolate the most informative components for deblurring, enabling massive computational savings.
Introduces a novel, ultra‑lightweight depth‑adapter that effectively fuses multimodal LiDAR data into a CNN pipeline.
Provides a practical, real‑time deblurring solution for edge hardware without sacrificing perceptual quality, opening avenues for on‑device photography enhancement, AR/VR streaming, and autonomous‑vehicle perception where both speed and accuracy are critical.

Future work may explore further compression of the adapter (e.g., quantization or pruning), robustness to noisy or sparse LiDAR measurements, and extension to video deblurring with temporal consistency constraints.

A low-complexity method for efficient depth-guided image deblurring

💡 Research Summary

Comments & Academic Discussion

Leave a Comment