Generalization of Self-Supervised Vision Transformers for Protein Localization Across Microscopy Domains
Task-specific microscopy datasets are often too small to train deep learning models that learn robust feature representations. Self-supervised learning (SSL) can mitigate this by pretraining on large unlabeled datasets, but it remains unclear how well such representations transfer across microscopy domains with different staining protocols and channel configurations. We investigate the cross-domain transferability of DINO-pretrained Vision Transformers for protein localization on the OpenCell dataset. We generate image embeddings using three DINO backbones pretrained on ImageNet-1k, the Human Protein Atlas (HPA), and OpenCell, and evaluate them by training a supervised classification head on OpenCell labels. All pretrained models transfer well, with the microscopy-specific HPA-pretrained model achieving the best performance (mean macro $F_1$-score = 0.8221 $\pm$ 0.0062), slightly outperforming a DINO model trained directly on OpenCell (0.8057 $\pm$ 0.0090). These results highlight the value of large-scale pretraining and indicate that domain-relevant SSL representations can generalize effectively to related but distinct microscopy datasets, enabling strong downstream performance even when task-specific labeled data are limited.
💡 Research Summary
This paper investigates the cross‑domain transferability of self‑supervised Vision Transformers (ViTs) trained with the DINO framework for protein‑localization tasks in microscopy. Three DINO backbones are considered: one pretrained on the natural‑image benchmark ImageNet‑1k, a second pretrained on the Human Protein Atlas (HPA) fluorescence‑microscopy dataset, and a third trained directly on the OpenCell CRISPR‑tagged protein dataset. The authors generate embeddings for the OpenCell images using each backbone and evaluate downstream performance by training a simple supervised classification head to predict 17 subcellular compartment labels.
A central technical challenge is the mismatch in channel composition across datasets. OpenCell provides two channels (protein of interest and nucleus), HPA supplies four channels (protein, microtubules, nucleus, endoplasmic reticulum), and ImageNet consists of three RGB channels. To reconcile these differences the authors explore two strategies: (1) Channel replication, where each channel is processed independently by the pretrained DINO model and the resulting feature vectors are concatenated; this requires no additional training but linearly increases feature dimensionality and computational cost. (2) Channel mapping, where semantically corresponding channels are aligned and missing channels are padded with zeros (e.g., mapping OpenCell protein to HPA protein, nucleus to HPA nucleus, and filling the microtubule and ER slots with zeros). For ImageNet, protein is mapped to the red channel and nucleus to green.
Experimental results show that all three backbones transfer well to the OpenCell task. The HPA‑pretrained model achieves the highest mean macro F1‑score of 0.8221 ± 0.0062, slightly surpassing the OpenCell‑trained model (0.8057 ± 0.0090) and markedly outperforming the ImageNet‑pretrained model (≈0.78). This demonstrates that domain‑specific microscopy data provide richer structural cues for self‑supervised learning than generic natural images, yet even a model trained on the target domain (OpenCell) remains competitive. The performance gap between the channel‑replication and channel‑mapping strategies is minimal, indicating that either approach can be used in practice depending on computational constraints.
Key contributions include: (i) empirical evidence that large‑scale, domain‑relevant SSL pretraining (HPA) yields representations that generalize effectively to a distinct but related microscopy dataset (OpenCell); (ii) a practical workflow for handling differing channel configurations via simple replication or mapping, enabling the reuse of pretrained DINO models without extensive architectural changes; (iii) confirmation that self‑supervised ViTs can serve as powerful feature extractors for small‑sample microscopy studies, reducing the need for extensive manual annotation. The authors suggest future directions such as incorporating channel‑aware normalization, domain‑adaptation loss functions, and strategies for handling severe class imbalance, which could further boost cross‑domain performance. Overall, the study underscores the promise of self‑supervised vision transformers as foundation models for bio‑imaging, facilitating robust downstream analysis even when labeled data are scarce.
Comments & Academic Discussion
Loading comments...
Leave a Comment