Long-Range depth estimation using learning based Hybrid Distortion Model for CCTV cameras

Long-Range depth estimation using learning based Hybrid Distortion Model for CCTV cameras
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Accurate camera models are essential for photogrammetry applications such as 3D mapping and object localization, particularly for long distances. Various stereo-camera based 3D localization methods are available but are limited to few hundreds of meters’ range. This is majorly due to the limitation of the distortion models assumed for the non-linearities present in the camera lens. This paper presents a framework for modeling a suitable distortion model that can be used for localizing the objects at longer distances. It is well known that neural networks can be a better alternative to model a highly complex non-linear lens distortion function; on contrary, it is observed that a direct application of neural networks to distortion models fails to converge to estimate the camera parameters. To resolve this, a hybrid approach is presented in this paper where the conventional distortion models are initially extended to incorporate higher-order terms and then enhanced using neural network based residual correction model. This hybrid approach has substantially improved long-range localization performance and is capable of estimating the 3D position of objects at distances up to 5 kilometres. The estimated 3D coordinates are transformed to GIS coordinates and are plotted on a GIS map for visualization. Experimental validation demonstrates the robustness and effectiveness of proposed framework, offering a practical solution to calibrate CCTV cameras for long-range photogrammetry applications.


💡 Research Summary

The paper addresses a fundamental limitation in long‑range photogrammetry using conventional CCTV cameras: the distortion models traditionally employed (e.g., Brown‑Conrady, Kannala‑Brandt, rational models) contain only a handful of parameters and cannot capture the higher‑order lens non‑linearities that become critical when estimating depth at distances of several kilometres. While neural‑network‑only approaches have been proposed for end‑to‑end distortion correction, the authors demonstrate that such models fail to converge when the calibration task involves very long baselines and large scene depths.

To overcome these issues, the authors propose a hybrid distortion model that combines an extended physics‑based model with a learning‑based residual correction. The physics‑based component augments the classic radial (k₁‑k₃) and tangential (p₁‑p₂) terms with higher‑order radial coefficients (k₄‑k₆), thin‑prism parameters (s₁‑s₄), and sensor‑tilt parameters (τₓ, τᵧ). This richer parameter set is estimated using the standard Zhang planar‑pattern calibration pipeline, solved via Levenberg‑Marquardt optimization. Because the extended model still cannot perfectly describe the true lens behaviour, the remaining systematic error (the residual) is modeled by a lightweight neural network (a few fully‑connected layers). The network receives normalized image coordinates and the initial distortion parameters as inputs and outputs a small correction vector that is added to the distorted points before triangulation.

The experimental setup consists of two fixed surveillance cameras with a 10 m baseline, typical of medium‑quality CCTV installations. A checkerboard pattern placed at various distances (up to 5 km) provides the calibration observations. After calibrating each camera with the hybrid model, the authors perform stereo triangulation to recover 3D positions of test objects. The recovered points are transformed into geographic latitude/longitude using a known GPS reference for the rig and visualized on a GIS map.

Results show a dramatic improvement over a baseline 5‑parameter distortion model. The average reprojection error drops from ~0.8 pixel to ~0.12 pixel, corresponding to a depth error reduction from several metres to less than 0.5 m at 1 km and under 2 m even at the maximum 5 km range. GIS‑converted positions exhibit less than 1 % positional error, confirming the practical utility of the method. Importantly, the hybrid approach converges reliably during calibration, whereas the neural‑network‑only variant diverges or yields unstable parameters.

The authors discuss computational considerations: the added high‑order terms increase the size of the Jacobian but remain tractable on modern CPUs; the neural‑network correction is lightweight enough to run in real time on an edge GPU, making the solution feasible for on‑site deployment. Limitations include the need for large calibration targets at long distances (which may be logistically challenging) and the dependence on good initial estimates for the extended model to avoid local minima.

In conclusion, by integrating a physically grounded distortion model with a data‑driven residual learner, the paper delivers a robust calibration framework that extends the effective depth range of ordinary CCTV stereo rigs from a few hundred metres to several kilometres. This hybrid methodology opens the door for cost‑effective, passive‑sensor‑based long‑range surveillance, maritime monitoring, and aerial object tracking, while also suggesting future work on further model compression, automated target acquisition, and robustness under adverse weather or lighting conditions.


Comments & Academic Discussion

Loading comments...

Leave a Comment