4D Monocular Surgical Reconstruction under Arbitrary Camera Motions
Reconstructing deformable surgical scenes from endoscopic videos is challenging and clinically important. Recent state-of-the-art methods based on implicit neural representations or 3D Gaussian splatting have made notable progress. However, most are designed for deformable scenes with fixed endoscope viewpoints and rely on stereo depth priors or accurate structure-from-motion for initialization and optimization, limiting their ability to handle monocular sequences with large camera motion in real clinical settings. To address this, we propose Local-EndoGS, a high-quality 4D reconstruction framework for monocular endoscopic sequences with arbitrary camera motion. Local-EndoGS introduces a progressive, window-based global representation that allocates local deformable scene models to each observed window, enabling scalability to long sequences with substantial motion. To overcome unreliable initialization without stereo depth or accurate structure-from-motion, we design a coarse-to-fine strategy integrating multi-view geometry, cross-window information, and monocular depth priors, providing a robust foundation for optimization. We further incorporate long-range 2D pixel trajectory constraints and physical motion priors to improve deformation plausibility. Experiments on three public endoscopic datasets with deformable scenes and varying camera motions show that Local-EndoGS consistently outperforms state-of-the-art methods in appearance quality and geometry. Ablation studies validate the effectiveness of our key designs. Code will be released upon acceptance at: https://github.com/IRMVLab/Local-EndoGS.
💡 Research Summary
This paper tackles the challenging problem of reconstructing deformable surgical scenes from monocular endoscopic video streams that involve arbitrary camera motion—a scenario common in real‑world minimally invasive procedures but largely unsupported by existing 4D reconstruction methods. State‑of‑the‑art approaches based on implicit neural representations (INR) or 3D Gaussian Splatting (3DGS) typically assume a fixed endoscope viewpoint and rely on stereo depth priors or accurate structure‑from‑motion (SfM) pipelines (e.g., COLMAP) for initializing a single canonical space and a deformation field. Such assumptions break down when the camera moves significantly, because new tissue regions continuously enter the field of view and the canonical space can no longer represent all observations. Moreover, monocular endoscopes lack reliable depth cues, making stereo‑based or SfM‑based initialization unstable.
The authors propose Local‑EndoGS, a novel framework that enables high‑quality 4D reconstruction from monocular sequences with unrestricted camera motion. The core contributions are threefold:
-
Progressive window‑based global scene representation – The long video is automatically segmented into temporally coherent windows based on motion dynamics. Each window is assigned its own local canonical space (represented by 3D Gaussian splats) and a local deformation field. The parameters of a new window are initialized from the previously optimized window, allowing a progressive, memory‑efficient optimization that scales to long sequences and large motions. This design solves the “single canonical space” bottleneck by locally adapting to newly observed anatomy.
-
Coarse‑to‑fine initialization without stereo depth or accurate SfM – For each window, the authors first recover a rough 3D point cloud using multi‑view geometric constraints (epipolar consistency) and cross‑window correspondences. A monocular depth network supplies dense depth priors; the depth scale is calibrated across windows to enforce consistency. This hybrid initialization yields a stable, scale‑consistent canonical geometry even when depth is ambiguous, eliminating the need for stereo rigs or COLMAP.
-
Long‑range 2D pixel trajectory and physical motion priors – During optimization, the method enforces that tracked 2D pixel trajectories across many frames are consistent with the predicted deformation field, providing strong temporal regularization especially under rapid camera motion. Additionally, a physics‑based motion prior (e.g., elastic‑viscous tissue model) penalizes implausible deformations, encouraging smooth, realistic tissue dynamics.
The optimization combines photometric reconstruction loss, the aforementioned trajectory constraints, and the physical prior, all rendered efficiently thanks to the 3DGS representation. Experiments on three public endoscopic datasets (including EndoVis and Hamlyn) cover three camera motion regimes: fixed, rotating around tissue, and forward motion. Quantitatively, Local‑EndoGS outperforms prior INR‑based (e.g., EndoNeRF, UW‑DNeRF) and 3DGS‑based methods (e.g., Deform3DGS, Endo‑4DGS) by 2–3 dB higher PSNR, higher SSIM, and up to 30 % lower Chamfer Distance. Visual results show markedly fewer artifacts and more faithful geometry when the camera traverses large distances. Ablation studies confirm that removing any of the three key components (windowed representation, coarse‑to‑fine initialization, trajectory/physics priors) leads to substantial performance drops, validating their necessity.
In summary, Local‑EndoGS introduces a scalable, window‑wise 4D reconstruction pipeline that works with monocular endoscopic video under arbitrary camera motion, without requiring stereo depth or high‑quality SfM. By integrating multi‑view geometry, cross‑window information, monocular depth priors, long‑range pixel trajectories, and physical tissue constraints, it achieves superior appearance and geometric fidelity while remaining computationally efficient. The authors will release code and pretrained models upon acceptance, paving the way for practical deployment in surgical navigation, training simulators, and intra‑operative augmented reality.
Comments & Academic Discussion
Loading comments...
Leave a Comment