Fully Convolutional Networks for Monocular Retinal Depth Estimation and Optic Disc-Cup Segmentation
Glaucoma is a serious ocular disorder for which the screening and diagnosis are carried out by the examination of the optic nerve head (ONH). The color fundus image (CFI) is the most common modality used for ocular screening. In CFI, the central r
đĄ Research Summary
This paper presents an endâtoâend deep learning framework that first estimates a retinal depth map from a single color fundus photograph and then uses that depth map as an auxiliary guide to segment the optic disc (OD) and optic cup (OC). The motivation stems from the fact that glaucoma diagnosis relies heavily on the cupâtoâdisc ratio (CDR) and on the threeâdimensional shape of the optic nerve head, yet conventional fundus imaging provides only a twoâdimensional projection. Acquiring depth with OCT or stereo cameras is costly and unsuitable for largeâscale screening, so the authors aim to infer depth directly from monocular images.
Depth Estimation
The authors introduce a novel selfâsupervised preâtraining scheme called âpseudoâdepth reconstructionâ. They observe that the inverted green channel of a fundus image, after inâpainting the retinal vessels, visually resembles the true OCTâderived depth map. During preâtraining, the network receives the raw RGB image and learns to reconstruct this pseudoâdepth image, thereby learning features that are more relevant to depth estimation than those learned by conventional denoising autoâencoders. After preâtraining, the network is fineâtuned on the INSPIREâstereo dataset (30 images with OCTâbased depth ground truth).
The depth estimation network follows an encoderâdecoder architecture similar to UâNet but replaces standard convolutional blocks with Dilated Residual Inception (DRI) modules. Each DRI block combines parallel convolutions of different kernel sizes, dilated convolutions to enlarge the receptive field without increasing parameters, and residual connections to ease gradient flow. The encoder uses 4Ă4 strided convolutions (instead of maxâpooling) to preserve spatial continuity, while the decoder mirrors this with transposed convolutions and skip connections. The final layer applies a 1Ă1 convolution and tanh activation to produce a normalized depth map. Loss functions explored include L2, L1, and the reverse Huber (berHu) loss; berHu yields the best quantitative results.
Segmentation with Depth Guidance
For ODâOC segmentation, the authors design a guided fully convolutional network that processes the RGB image and the estimated depth map through two parallel branches. Each branch extracts features using either simple residual blocks or the same DRI blocks. After two successive blocks, the depth branch output is added elementâwise to the image branch output (sparse fusion) and passed through a 3Ă3 ConvâBatchNormâReLU layer, forming a multimodal feature fusion block.
The main segmentation backbone resembles the architecture proposed in prior work (e.g., a residual UâNet) but incorporates an additional depth encoder with six levels (versus eight for the RGB encoder). Features from alternating levels of both encoders are fused via the multimodal block before being downâsampled further. Only the fused features are propagated through the main (RGB) branch; the depth branch remains separate, reducing computational overhead. The decoder receives only the main branch features together with long skip connections from the encoder. Training uses a multiclass crossâentropy loss for three classes (background, OD, OC).
To refine the segmentation boundaries, a Conditional Random Field (CRF) is optionally applied, leveraging both intensity and depth cues to enforce spatial consistency.
Experimental Evaluation
Depth estimation is evaluated on the INSPIREâstereo dataset. The pseudoâdepth preâtraining combined with DRI blocks reduces the rootâmeanâsquare error (RMSE) by roughly 12âŻ% compared with a denoisingâautoâencoder baseline and improves the δâŻ<âŻ1.25 accuracy to 0.89.
Segmentation performance is tested on three publicly available datasets that contain pixelâwise ODâOC annotations: ORIGA, RIMONEr3, and DRISHTIâGS. The proposed guided network achieves average Dice scores of 0.94 (OD) / 0.88 (OC) on ORIGA, 0.92 / 0.86 on RIMONEr3, and 0.95 / 0.89 on DRISHTIâGS. These results surpass recent stateâofâtheâart methods based on template matching, levelâset, conventional CNNs, and polarâtransform UâNets, typically by 2â4âŻ% absolute Dice improvement. Adding the depth guide consistently boosts OC Dice by about 3.5âŻ% relative to an RGBâonly baseline, and reduces the mean absolute error of the cupâtoâdisc ratio by 0.02. The optional CRF postâprocessing yields an additional 1â2âŻ% gain in Dice.
Runtime measurements on an NVIDIA GTXâŻ1080Ti indicate that depth prediction takes ~45âŻms per image and segmentation ~30âŻms, making the whole pipeline suitable for realâtime screening scenarios.
Contributions and Limitations
The paperâs main contributions are: (1) a pseudoâdepth selfâsupervised preâtraining strategy that better aligns feature learning with the depth estimation task; (2) the Dilated Residual Inception block for efficient multiâscale feature extraction; (3) a guided multimodal segmentation architecture that fuses depth and color information; (4) extensive evaluation across multiple datasets showing competitive or superior performance; and (5) an exploration of CRFâbased refinement.
Limitations include the small size of the depthâgroundâtruth dataset (only 30 images), potential sensitivity of the pseudoâdepth generation to illumination variations or color balance, and the added computational cost of CRF postâprocessing.
Future Directions
The authors suggest expanding the depthâtraining corpus with larger paired fundusâOCT datasets, improving robustness of pseudoâdepth generation (e.g., by incorporating illumination normalization or learning the pseudoâdepth transformation), developing lightweight mobileâfriendly models for pointâofâcare devices, and extending the multimodal framework to incorporate additional cues such as vessel morphology or retinal texture for a more comprehensive glaucoma risk assessment.
Comments & Academic Discussion
Loading comments...
Leave a Comment