UnReflectAnything: RGB-Only Highlight Removal by Rendering Synthetic Specular Supervision
Specular highlights distort appearance, obscure texture, and hinder geometric reasoning in both natural and surgical imagery. We present UnReflectAnything, an RGB-only framework that removes highlights from a single image by predicting a highlight map together with a reflection-free diffuse reconstruction. The model uses a frozen vision transformer encoder to extract multi-scale features, a lightweight head to localize specular regions, and a token-level inpainting module that restores corrupted feature patches before producing the final diffuse image. To overcome the lack of paired supervision, we introduce a Virtual Highlight Synthesis pipeline that renders physically plausible specularities using monocular geometry, Fresnel-aware shading, and randomized lighting which enables training on arbitrary RGB images with correct geometric structure. UnReflectAnything generalizes across natural and surgical domains where non-Lambertian surfaces and non-uniform lighting create severe highlights and it achieves competitive performance with state-of-the-art results on several benchmarks. Project Page: https://alberto-rota.github.io/UnReflectAnything/
💡 Research Summary
The paper “UnReflectAnything: RGB-Only Highlight Removal by Rendering Synthetic Specular Supervision” presents a novel deep learning framework designed to remove specular highlights from a single RGB image, applicable to both natural and challenging surgical (endoscopic) imagery. The core problem it addresses is the ill-posed nature of separating diffuse and specular components without specialized hardware like polarization cameras and the scarcity of large-scale, perfectly paired training data.
The key innovation is a “Virtual Highlight Synthesis” pipeline that generates its own supervision signal, eliminating the need for real paired data. For any input RGB image, the pipeline first estimates monocular geometry (depth, surface normals, camera intrinsics) using an off-the-shelf network. It then constructs a 3D point cloud, samples a random virtual light source in camera coordinates, and renders a physically plausible specular highlight map using a Blinn-Phong reflectance model modulated by a Schlick-Fresnel term. This synthetic highlight is composited onto the original image, creating an artificial “highlight-corrupted” version. The original image (before synthesis) serves as the ground-truth “highlight-free” target, enabling self-supervised training on virtually any RGB image.
The model architecture is built for efficiency and effectiveness. It employs a frozen, pretrained DINOv3-Large Vision Transformer as a feature encoder to extract rich, multi-scale patch tokens. A lightweight highlight predictor head processes these features to generate a soft, pixel-level highlight probability map. The heart of the method is a “token-level inpainting” module. Patches identified as highlights (synthetic or pre-existing) are masked in the feature token space. These masked tokens are replaced with a blend of a learnable mask token and a local mean prior computed from neighboring visible tokens. A small stack of Transformer blocks then refines these seed tokens by attending to the full context, effectively “inpainting” the corrupted features in the semantic token space. Finally, a decoder transforms these restored multi-scale features into the final reflection-free diffuse RGB image.
Training utilizes a hybrid supervision scheme. The highlight predictor is trained on the synthetic highlight maps using a combination of Dice, L1, and Total Variation losses. The inpainting module is trained to reconstruct the original, clean feature tokens for patches where synthetic highlights were added. A crucial detail is the handling of “dataset highlights”—real highlights already present in the original training images (common in endoscopy). Since these regions are saturated and lack reliable ground truth, they are excluded from the inpainting loss calculation, even though the model must still inpaint them. This forces the model to learn a general inpainting function for highlight regions without being misled by corrupted supervision.
Experiments demonstrate that UnReflectAnything achieves competitive performance with state-of-the-art methods on natural image benchmarks (e.g., Specular Highlights Removal dataset) and generalizes effectively to surgical domains. It successfully removes or attenuates severe specularities while preserving underlying texture, without requiring polarization input or large paired datasets. The work stands out for its elegant combination of physically-based synthetic data generation, efficient use of frozen foundation model features, and token-space reasoning to solve a classic computer vision problem in a practical, RGB-only framework.
Comments & Academic Discussion
Loading comments...
Leave a Comment