Robust Shape from Focus via Multiscale Directional Dilated Laplacian and Recurrent Network
Shape-from-Focus (SFF) is a passive depth estimation technique that infers scene depth by analyzing focus variations in a focal stack. Most recent deep learning-based SFF methods typically operate in two stages: first, they extract focus volumes (a per pixel representation of focus likelihood across the focal stack) using heavy feature encoders; then, they estimate depth via a simple one-step aggregation technique that often introduces artifacts and amplifies noise in the depth map. To address these issues, we propose a hybrid framework. Our method computes multi-scale focus volumes traditionally using handcrafted Directional Dilated Laplacian (DDL) kernels, which capture long-range and directional focus variations to form robust focus volumes. These focus volumes are then fed into a lightweight, multi-scale GRU-based depth extraction module that iteratively refines an initial depth estimate at a lower resolution for computational efficiency. Finally, a learned convex upsampling module within our recurrent network reconstructs high-resolution depth maps while preserving fine scene details and sharp boundaries. Extensive experiments on both synthetic and real-world datasets demonstrate that our approach outperforms state-of-the-art deep learning and traditional methods, achieving superior accuracy and generalization across diverse focal conditions.
💡 Research Summary
This paper introduces a novel hybrid framework for Shape‑from‑Focus (SFF) that addresses two major drawbacks of existing methods: the computational burden and noise sensitivity of deep feature encoders, and the loss of spatial context when collapsing a 3‑D focus volume into a depth map in a single step. The authors propose to compute multi‑scale focus volumes using handcrafted Directional Dilated Laplacian (DDL) kernels. DDL extends the classic Laplacian operator with directional filters (0°, 45°, 90°, 135°) and multiple dilation rates, enabling the capture of long‑range and anisotropic focus variations across the focal stack. Because DDL kernels are applied via simple 2‑D convolutions, focus volumes are generated rapidly without any learnable parameters, making the process robust to limited training data and noise.
The second component is a lightweight, multi‑scale Gated Recurrent Unit (GRU) network that takes the DDL‑derived focus volumes together with auxiliary context features as input. The GRU iteratively refines an initial low‑resolution depth estimate, integrating local focus cues and global scene context at each iteration. By operating in a coarse‑to‑fine manner, the network preserves large‑scale structure while progressively sharpening fine details, all with a modest parameter count (≈2–3 M).
To recover high‑resolution depth maps, the authors embed a learned Convex Upsampling module. This module predicts convex combination weights that blend the low‑resolution depth with a high‑resolution guide image (the all‑in‑focus reconstruction), thereby preserving sharp depth discontinuities and avoiding the blurring typical of bilinear or nearest‑neighbor upsampling. Intermediate depth predictions are supervised, encouraging consistent refinement across iterations.
Extensive experiments on both synthetic (Light‑Field‑Dataset) and real‑world focal stacks demonstrate that the proposed method outperforms recent deep learning SFF approaches (e.g., N‑H. Wang et al., 2021; Won & Jeon, 2022) and traditional focus‑measure pipelines. Quantitatively, the method reduces MAE and RMSE by 15–20 % and improves SSIM by 0.03–0.05 across diverse scenes, especially under low‑light and high‑noise conditions. Ablation studies confirm that (1) replacing DDL with a standard Laplacian degrades performance, (2) substituting the GRU with a single‑step 1×1 convolution leads to noticeable artifacts, and (3) removing the Convex Upsampling results in blurred depth edges.
The paper acknowledges limitations: the handcrafted DDL kernels may not be optimal for specialized domains (e.g., medical or satellite imagery), and increasing the number of GRU refinement steps raises inference latency. Future work is suggested in automated kernel design (e.g., neural architecture search) and more efficient recurrent architectures such as ConvLSTM or transformer‑based recurrence, as well as multimodal integration of color, depth, and focus cues.
In summary, this work successfully merges the robustness of traditional focus measures with the learning capacity of modern recurrent networks, delivering a computationally efficient, high‑accuracy SFF solution suitable for real‑time 3‑D measurement and lightweight embedded applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment