From SAM to DINOv2: Towards Distilling Foundation Models to Lightweight Baselines for Generalized Polyp Segmentation
Accurate polyp segmentation during colonoscopy is critical for the early detection of colorectal cancer and still remains challenging due to significant size, shape, and color variations, and the camouflaged nature of polyps. While lightweight baseline models such as U-Net, U-Net++, and PraNet offer advantages in terms of easy deployment and low computational cost, they struggle to deal with the above issues, leading to limited segmentation performance. In contrast, large-scale vision foundation models such as SAM, DINOv2, OneFormer, and Mask2Former have exhibited impressive generalization performance across natural image domains. However, their direct transfer to medical imaging tasks (e.g., colonoscopic polyp segmentation) is not straightforward, primarily due to the scarcity of large-scale datasets and lack of domain-specific knowledge. To bridge this gap, we propose a novel distillation framework, Polyp-DiFoM, that transfers the rich representations of foundation models into lightweight segmentation baselines, allowing efficient and accurate deployment in clinical settings. In particular, we infuse semantic priors from the foundation models into canonical architectures such as U-Net and U-Net++ and further perform frequency domain encoding for enhanced distillation, corroborating their generalization capability. Extensive experiments are performed across five benchmark datasets, such as Kvasir-SEG, CVC-ClinicDB, ETIS, ColonDB, and CVC-300. Notably, Polyp-DiFoM consistently outperforms respective baseline models significantly, as well as the state-of-the-art model, with nearly 9 times reduced computation overhead. The code is available at https://github.com/lostinrepo/PolypDiFoM.
💡 Research Summary
This paper introduces “Polyp-DiFoM,” a novel knowledge distillation framework designed to bridge the gap between large, powerful vision foundation models and lightweight, deployable architectures for the challenging task of colonoscopic polyp segmentation. Accurate polyp segmentation is crucial for early colorectal cancer detection but remains difficult due to significant variations in polyp size, shape, and appearance, as well as their camouflaged nature.
The authors identify a key dilemma: lightweight baseline models like U-Net and U-Net++ are efficient and easy to deploy but struggle with robustness and generalization across diverse clinical datasets. In contrast, large-scale vision foundation models (FMs) such as SAM, DINOv2, OneFormer, and Mask2Former exhibit impressive generalization capabilities learned from vast natural image datasets. However, their direct application to medical imaging is hindered by the domain gap between natural and endoscopic images, scarcity of large-scale medical data, and their substantial computational cost, making them unsuitable for resource-constrained clinical settings.
To address this, Polyp-DiFoM proposes a modular distillation framework that transfers the rich, general-purpose visual representations from multiple frozen FMs into a trainable lightweight segmentation backbone (e.g., U-Net). The core innovation lies in its “Semantic High-Low Distillation” module. Instead of directly mimicking FM features, the framework first extracts and fuses semantic embeddings from different FMs. It then applies a 2D Fast Fourier Transform (FFT) to this unified representation, analyzing it in the frequency domain. This allows the separation of features into Low-Frequency Components (LFC), which capture global semantic context and shape information, and High-Frequency Components (HFC), which encode fine-grained structural details like edges and textures.
These distilled components are then aligned with specific latent vectors within the redesigned baseline encoder through two distinct loss functions: an L1-distillation loss aligns the LFC with a latent vector responsible for global semantics, while an L2-distillation loss aligns the HFC with a vector for local structural details. This targeted approach ensures the student model learns both the “what” (context) and the “where” (precise boundaries) of polyps from the teachers. A foundational feature-aware decoder then integrates these distilled priors with multi-scale encoder features to generate the final segmentation mask.
The framework is evaluated extensively on five public polyp segmentation benchmarks: Kvasir-SEG, CVC-ClinicDB, ETIS, ColonDB, and CVC-300. The results demonstrate that Polyp-DiFoM consistently and significantly outperforms its vanilla baseline counterparts (U-Net/U-Net++) across all datasets. Moreover, it achieves superior generalization performance on unseen data, even surpassing contemporary state-of-the-art polyp segmentation models. A critical practical advantage is efficiency: Polyp-DiFoM provides this performance boost with nearly 9 times lower computational overhead compared to running the full foundation models, making it highly suitable for real-time clinical deployment. The work effectively demonstrates how to leverage the power of foundation models to create accurate, robust, and efficient specialized tools for medical image analysis.
Comments & Academic Discussion
Loading comments...
Leave a Comment