Fully Convolutional Networks for Monocular Retinal Depth Estimation and Optic Disc-Cup Segmentation

Fully Convolutional Networks for Monocular Retinal Depth Estimation and   Optic Disc-Cup Segmentation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Glaucoma is a serious ocular disorder for which the screening and diagnosis are carried out by the examination of the optic nerve head (ONH). The color fundus image (CFI) is the most common modality used for ocular screening. In CFI, the central r


💡 Research Summary

This paper presents an end‑to‑end deep learning framework that first estimates a retinal depth map from a single color fundus photograph and then uses that depth map as an auxiliary guide to segment the optic disc (OD) and optic cup (OC). The motivation stems from the fact that glaucoma diagnosis relies heavily on the cup‑to‑disc ratio (CDR) and on the three‑dimensional shape of the optic nerve head, yet conventional fundus imaging provides only a two‑dimensional projection. Acquiring depth with OCT or stereo cameras is costly and unsuitable for large‑scale screening, so the authors aim to infer depth directly from monocular images.

Depth Estimation
The authors introduce a novel self‑supervised pre‑training scheme called “pseudo‑depth reconstruction”. They observe that the inverted green channel of a fundus image, after in‑painting the retinal vessels, visually resembles the true OCT‑derived depth map. During pre‑training, the network receives the raw RGB image and learns to reconstruct this pseudo‑depth image, thereby learning features that are more relevant to depth estimation than those learned by conventional denoising auto‑encoders. After pre‑training, the network is fine‑tuned on the INSPIRE‑stereo dataset (30 images with OCT‑based depth ground truth).

The depth estimation network follows an encoder‑decoder architecture similar to U‑Net but replaces standard convolutional blocks with Dilated Residual Inception (DRI) modules. Each DRI block combines parallel convolutions of different kernel sizes, dilated convolutions to enlarge the receptive field without increasing parameters, and residual connections to ease gradient flow. The encoder uses 4×4 strided convolutions (instead of max‑pooling) to preserve spatial continuity, while the decoder mirrors this with transposed convolutions and skip connections. The final layer applies a 1×1 convolution and tanh activation to produce a normalized depth map. Loss functions explored include L2, L1, and the reverse Huber (berHu) loss; berHu yields the best quantitative results.

Segmentation with Depth Guidance
For OD‑OC segmentation, the authors design a guided fully convolutional network that processes the RGB image and the estimated depth map through two parallel branches. Each branch extracts features using either simple residual blocks or the same DRI blocks. After two successive blocks, the depth branch output is added element‑wise to the image branch output (sparse fusion) and passed through a 3×3 Conv‑BatchNorm‑ReLU layer, forming a multimodal feature fusion block.

The main segmentation backbone resembles the architecture proposed in prior work (e.g., a residual U‑Net) but incorporates an additional depth encoder with six levels (versus eight for the RGB encoder). Features from alternating levels of both encoders are fused via the multimodal block before being down‑sampled further. Only the fused features are propagated through the main (RGB) branch; the depth branch remains separate, reducing computational overhead. The decoder receives only the main branch features together with long skip connections from the encoder. Training uses a multiclass cross‑entropy loss for three classes (background, OD, OC).

To refine the segmentation boundaries, a Conditional Random Field (CRF) is optionally applied, leveraging both intensity and depth cues to enforce spatial consistency.

Experimental Evaluation
Depth estimation is evaluated on the INSPIRE‑stereo dataset. The pseudo‑depth pre‑training combined with DRI blocks reduces the root‑mean‑square error (RMSE) by roughly 12 % compared with a denoising‑auto‑encoder baseline and improves the δ < 1.25 accuracy to 0.89.

Segmentation performance is tested on three publicly available datasets that contain pixel‑wise OD‑OC annotations: ORIGA, RIMONEr3, and DRISHTI‑GS. The proposed guided network achieves average Dice scores of 0.94 (OD) / 0.88 (OC) on ORIGA, 0.92 / 0.86 on RIMONEr3, and 0.95 / 0.89 on DRISHTI‑GS. These results surpass recent state‑of‑the‑art methods based on template matching, level‑set, conventional CNNs, and polar‑transform U‑Nets, typically by 2–4 % absolute Dice improvement. Adding the depth guide consistently boosts OC Dice by about 3.5 % relative to an RGB‑only baseline, and reduces the mean absolute error of the cup‑to‑disc ratio by 0.02. The optional CRF post‑processing yields an additional 1–2 % gain in Dice.

Runtime measurements on an NVIDIA GTX 1080Ti indicate that depth prediction takes ~45 ms per image and segmentation ~30 ms, making the whole pipeline suitable for real‑time screening scenarios.

Contributions and Limitations
The paper’s main contributions are: (1) a pseudo‑depth self‑supervised pre‑training strategy that better aligns feature learning with the depth estimation task; (2) the Dilated Residual Inception block for efficient multi‑scale feature extraction; (3) a guided multimodal segmentation architecture that fuses depth and color information; (4) extensive evaluation across multiple datasets showing competitive or superior performance; and (5) an exploration of CRF‑based refinement.

Limitations include the small size of the depth‑ground‑truth dataset (only 30 images), potential sensitivity of the pseudo‑depth generation to illumination variations or color balance, and the added computational cost of CRF post‑processing.

Future Directions
The authors suggest expanding the depth‑training corpus with larger paired fundus‑OCT datasets, improving robustness of pseudo‑depth generation (e.g., by incorporating illumination normalization or learning the pseudo‑depth transformation), developing lightweight mobile‑friendly models for point‑of‑care devices, and extending the multimodal framework to incorporate additional cues such as vessel morphology or retinal texture for a more comprehensive glaucoma risk assessment.


Comments & Academic Discussion

Loading comments...

Leave a Comment