Perspective-aware fusion of incomplete depth maps and surface normals for accurate 3D reconstruction

Perspective-aware fusion of incomplete depth maps and surface normals for accurate 3D reconstruction
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We address the problem of reconstructing 3D surfaces from depth and surface normal maps acquired by a sensor system based on a single perspective camera. Depth and normal maps can be obtained through techniques such as structured-light scanning and photometric stereo, respectively. We propose a perspective-aware log-depth fusion approach that extends existing orthographic gradient-based depth-normals fusion methods by explicitly accounting for perspective projection, leading to metrically accurate 3D reconstructions. Additionally, the method handles missing depth measurements by leveraging available surface normal information to inpaint gaps. Experiments on the DiLiGenT-MV data set demonstrate the effectiveness of our approach and highlight the importance of perspective-aware depth-normals fusion.


💡 Research Summary

The paper tackles the problem of fusing a depth map and a surface‑normal map that are captured by a single perspective camera, a scenario common in systems that combine structured‑light scanning (for depth) with photometric stereo (for normals). Existing depth‑normal fusion methods assume orthographic projection, which leads to metric distortions when the true camera model is perspective. The authors propose a perspective‑aware fusion framework based on a log‑depth substitution. By defining ℓ(u,v)=ln d(u,v) they derive a linear relationship between the surface normal vector and the gradient of the log‑depth (Equation 14), which holds under perspective projection and explicitly incorporates the focal length and principal point.

With this relationship, any orthographic gradient‑based fusion method can be reused unchanged: the observed depth values are transformed to log‑depth, the observed normals are converted to a log‑depth gradient field using the derived formula, and the original quadratic objective (depth‑consistency term weighted by a confidence map κ and gradient‑consistency term) is minimized with respect to ℓ. Because the objective remains convex, standard linear solvers can be applied efficiently. The confidence map allows the method to ignore missing depth measurements (κ=0) and rely solely on the normal‑derived gradient, thus performing principled in‑painting of depth gaps without any ad‑hoc interpolation.

The authors also extend the total generalized variation (TGV) formulation to the log‑depth domain, preserving its ability to smooth while avoiding stair‑casing artifacts. The complete algorithm consists of four steps: (1) log‑transform the observed depth, (2) convert normals to log‑depth gradients via Equation 14, (3) run any orthographic depth‑normal fusion solver on these transformed inputs, and (4) exponentiate the fused log‑depth to obtain metric depth. No additional parameters beyond the original α, β, and κ are required, and existing codebases can be reused with minimal changes.

Experimental validation is performed on the DiLiGenT‑MV dataset, which provides ground‑truth meshes, depth, and normal maps for five objects. The authors simulate realistic structured‑light depth data by down‑sampling, adding Gaussian noise (σ=1 mm), and masking approximately 25 % of the pixels with Perlin‑noise‑generated gaps to emulate occlusions and pattern‑reflection failures. The normal maps are kept at full resolution with only small Gaussian noise (σ=0.1 rad). The fused results are evaluated using root‑mean‑square error (RMSE) on depth and mean angular error (MAE) on normals.

Results show that the naive approach—reprojecting the perspective depth to an orthographic plane before fusion—suffers from interpolation errors and cannot handle missing depth. Pure orthographic fusion applied directly to perspective data also yields large errors due to projection mismatch. In contrast, the proposed log‑depth perspective‑aware fusion achieves the lowest RMSE and MAE across all objects. The TGV‑based variant further improves surface smoothness and eliminates stair‑casing, demonstrating that the log‑depth formulation integrates seamlessly with higher‑order regularizers.

Key contributions are: (1) a principled log‑depth formulation that makes perspective projection compatible with any orthographic gradient‑based fusion method, (2) an inherent mechanism for depth‑gap in‑painting using only normal information, and (3) empirical evidence of superior accuracy over state‑of‑the‑art orthographic and naive perspective methods. The approach is computationally lightweight, analytically transparent, and suitable for real‑time single‑view applications such as robotic perception, augmented reality, and low‑cost 3D scanning. Future work may explore extensions to distorted camera models, integration with deep learning pipelines, and deployment on embedded hardware.


Comments & Academic Discussion

Loading comments...

Leave a Comment