Transcending the Annotation Bottleneck: AI-Powered Discovery in Biology and Medicine

T ranscending the Annotation Bottlenec k: AI-P o w ered Disco v ery in Biology and Medicine Soumic k Chatterjee 1 , 2 [0000 − 0001 − 7594 − 1188] 1 Human T ec hnop ole, Milan, Italy 2 F acult y of Computer Science, Otto von Gueric ke Univ ersity Magdeburg, Magdeburg, German y contact@soumick.com Abstract. The dependence on exp ert annotation has long constituted the primary rate-limiting step in the application of artiﬁcial intelligence to biomedicine. While sup ervised learning drov e the initial w av e of clini- cal algorithms, a paradigm shift tow ards unsupervised and self-sup ervised learning (SSL) is curren tly unlo c king the latent potential of biobank- scale datasets. By learning directly from the intrinsic structure of data - whether pixels in a magnetic resonance image (MRI), vo xels in a volu- metric scan, or tok ens in a genomic sequence - these methods facilitate the discov ery of no vel phenot yp es, the link age of morphology to genet- ics, and the detection of anomalies without human bias. This article syn thesises seminal and recen t adv ances in ”learning without lab els,” highligh ting ho w unsup ervised frameworks can derive heritable cardiac traits, predict spatial gene expression in histology , and detect pathologies with p erformance that riv als or exceeds sup ervised counterparts. Keyw ords: Unsup ervised Learning · Medical Imaging · Phenot yp e Dis- co very · Anomaly Detection · Genomics. 1 In tro duction: The Annotation Bottlenec k and the Unsup ervised Solution F or the past decade, the standard w orkﬂow in biomedical data analysis has ne- cessitated the curation of datasets, the manual annotation of regions of in terest (e.g., tumours, lesions, or anatomical structures), and the training of sup ervised mo dels to replicate these h uman lab els. While eﬀectiv e for sp eciﬁc, narro w tasks, this approac h is fundamentally constrained b y the scarcity of high-qualit y labels, the inheren t bias of h uman kno wledge, and the high cost of exp ert time. F urther- more, sup ervised approac hes typically discard the v ast ma jority of information con tained within high-dimensional data, fo cusing only on features relev an t to the pre-deﬁned lab el. T o ov ercome these limitations, the ﬁeld has increasingly turned to unsu- p ervised and self-supervised learning. A common critique of these metho ds is that they sacriﬁce accuracy for ﬂexibility . Ho wev er, recent evidence suggests 2 Soumic k Chatterjee this trade-oﬀ is v anishing. In an inv estigation of vo xel-wise segmentation for ad- ditiv e man ufacturing, Iuso et al. [11] compared sophisticated supervised models (suc h as UNet++) against unsup ervised V AE-based approac hes for p orosit y de- tection. Remark ably , they found that unsup ervised models, particularly when p ost-processed, could ac hieve p erformance metrics (Average Precision 0.830) that riv alled or even exceeded their sup ervised counterparts (Average Precision 0.751) in challenging testing scenarios. This ﬁnding challenges the ortho do xy that sup ervised learning is in v ariably sup erior, suggesting that for complex, highly v ariable targets, a mo del that comprehends the fundamental data distribution ma y b e more robust than one trained to mimic a limited set of human lab els. These techniques learn robust representations by solving ”pretext” tasks—suc h as con trasting similar views of an image or reconstructing mask ed portions of data—rather than predicting extrinsic lab els. Seminal w orks in computer vision established the eﬃcacy of this approac h: SimCLR [8] demonstrated that con- trastiv e learning could pro duce visual representations com p arable to sup ervised metho ds, while DINO [5] utilised self-distillation with Vision T ransformers (ViT) to capture semantic segmen tation prop erties without explicit supervision. In the medical domain, these principles were adapted to address data heterogeneity by Azizi et al. [1], paving the wa y for ”foundation mo dels” capable of discov ering biological signals that ma y elude human observers. 2 Unsup ervised Learning in Medical Imaging The application of unsup ervised learning in medical imaging has matured from simple dimensionalit y reduction to complex tasks in volving phenotype disco very , anomaly detection, and image registration. 2.1 Phenot yp e Disco very and Genetic Link age A principal adv an tage of data-driv en disco very is the capacity to deﬁne quan ti- tativ e phenotypes that bridge the gap b et ween macroscopic imaging and micro- scopic genetics. In the realm of multimodal learning, T aleb et al. [19] introduced the ”Con tIG” framework, demonstrating that self-sup ervised contrastiv e learn- ing could eﬀectively in tegrate medical imaging with genetic data to improv e disease prediction. Building upon this, Radhakrishnan et al. [17] utilised cross-mo dal auto en- co ders to learn holistic representations of the cardiov ascular state. Expanding this frontier, Ometto et al. [16] recently developed a 3D diﬀusion autoenco der (3DDiﬀAE) to analyse temp oral cardiac MRIs from the UK Biobank. Unlike traditional metho ds relying on ﬁxed parameters suc h as ejection fraction, this unsup ervised mo del learnt a ”latent space” of 182 phenotypes describing complex cardiac w all motion and structure. Crucially , Ometto et al. demonstrated that these latent phenotypes shared a genetic architecture with established cardiac diseases, revealing 89 signiﬁcant genomic loci. Title Suppressed Due to Excessive Length 3 This principle extends to the microscopic scale. In computational pathology , Cisternino et al. [9] utilised self-supervised Vision T ransformers (ViT) trained on ov er 1.7 million histology tiles from the Genotype-Tissue Expression (GTEx) pro ject. Their mo del, RNAP ath, utilises these self-sup ervised features to pre- dict spatial RNA expression lev els directly from H&E-stained slides, eﬀectiv ely bridging the gap betw een tissue morphology and transcriptomics without the need for exp ensiv e spatial transcriptomics assa ys. 2.2 Robust Anomaly Detection One of the mos t immediate clinical applications of learning without lab els is anomaly detection—identifying pathology as a deviation from the normative distribution. StRegA [7] addressed this in neuroimaging, a pipeline utilising a con text-enco ding V ariational Auto encoder (V AE). By learning the distribution of healthy brain anatomy , StRegA identiﬁes regions that the mo del cannot ac- curately reconstruct, successfully lo calising brain tumours and other anomalies without ever observing a labelled tumour during training. Building upon these foundational V AE-based approaches, recen t adv ance- men ts hav e introduced more sophisticated generativ e mo dels. Li et al. [13] intro- duced Scale-Aware Contrastiv e Reverse Distillation (SCAD), a nov el framework that enhances anomaly detection b y leveraging multi-scale feature representa- tions and con trastive learning. SCAD addresses the challenge of scale v ariance in medical anomalies by distilling knowledge from a pre-trained teacher netw ork to a student net work in a reverse manner, eﬀectively capturing anomalies across diﬀeren t resolutions. F urthermore, Bercea et al. [4] critically ev aluated normative represen tation learning in generative AI for robust anomaly detection in brain imaging. Their work highligh ts the imp ortance of robust normativ e learning to ensure that generativ e mo dels accurately capture the v ariability of healthy brain anatom y , thereby impro ving the sensitivity and sp eciﬁcit y of anomaly detection. Addressing the limitations of standard diﬀusion mo dels in preserving ﬁne de- tails, Beizaee et al. [3] introduced MAD-AD, a masked diﬀusion framework for unsup ervised brain anomaly detection. By incorp orating masking strategies in to the diﬀusion pro cess, MAD-AD eﬀectively mitigates the issue of noise accum u- lation during image reconstruction, leading to more precise anomaly lo calisation compared to traditional diﬀusion-based metho ds. The ﬁeld contin ues to diversify with the emergence of State Space Models (SSMs) like Mamba, which hav e led to new architectures such as MAA T (Mamba Adaptiv e Anomaly T ransformer) [20]. This mo del eﬃcien tly captures long-range dep endencies in ph ysiological data, oﬀering a computationally eﬃcient alter- nativ e to traditional T ransformers. F urthermore, foundation mo dels are being adapted for this task; Seeb¨ oc k et al. [18] recen tly utilised self-supervised learning to guide segmentation in retinal OCT scans, eﬀectively using anomaly detection as a weak sup ervision signal. 4 Soumic k Chatterjee 2.3 Image Registration Deformable image registration has traditionally b een computationally exp en- siv e. Unsup ervised deep learning metho ds like V oxelMorph [2] learn to predict deformation ﬁelds b y optimizing image similarity metrics, ac hieving state-of- the-art accuracy with signiﬁcantly faster inference times. Building up on these foundations, MICDIR [6] in tro duced a m ulti-scale in verse-consisten t framew ork incorp orating a self-constructing graph latent. By explicitly enco ding global de- p endencies and enforcing cycle consistency , MICDIR demonstrated statistically signiﬁcan t impro v ements ov er V oxelMorph in both in tramodal and intermodal brain MRI registration tasks. 3 Deciphering the Molecular Code Bey ond imaging, unsupervised learning is revolutionising genomics and molecu- lar biology by treating biological sequences as a ”language” of life. 3.1 Genomic Sequence Mo delling Just as large language mo dels learn the structure of text, genomic mo dels can learn the grammar of regulatory elements and gene expression without explicit lab els. Seminal works like DNABER T [12] applied the BER T arc hitecture to k- mer sequences of DNA, demonstrating that attention mec hanisms could capture global and lo cal genomic con text. More recently , the Nucleotide T ransformer [10] scaled this approach to billions of parameters, training on multispecies genomes to predict molecular phenot yp es and v ariant eﬀects. 3.2 Single-Cell Analysis The adven t of single-cell RNA sequencing (scRNA-seq) has provided high-resolution views of cellular heterogeneity , but the data is inherently high-dimensional, sparse, and noisy . Deep generative mo dels like scVI (Single-cell V ariational In- ference) [15] use v ariational inference to approximate the underlying probability distributions of gene expression. By learning a lo w-dimensional laten t represen- tation of each cell, scVI can correct for batch eﬀects, impute missing v alues, and cluster cell types without reliance on pre-deﬁned markers. 4 Clinical and Therap eutic F ron tiers The utility of unsupervised learning extends in to the translation of biological insigh ts in to clinical practice and therapeutic dev elopment. Title Suppressed Due to Excessive Length 5 4.1 Computational Phenotyping from Electronic Health Records Electronic Health Records (EHR) contain rich, longitudinal data on patien t health. Unsupervised learning allo ws for ”computational phenot yping”—the dis- co very of clinical patterns without manual cohort deﬁnition. Inspired by natural language pro cessing, mo dels lik e BEHR T [14] treat patient medical histories as sequences of even ts and use T ransformer arc hitectures to learn robust pa- tien t representations. These self-supervised embeddings can predict future dis- ease risks and stratify patients in to no vel subt yp es, eﬀectiv ely enabling precision medicine at the population scale. 5 Concluding Remarks The transition from sup ervised to unsup ervised learning marks a decisive mat- uration in biomedical AI, eﬀectively circumv enting the ”annotation b ottlenec k” that has long stiﬂed progress. No longer compromising on accuracy , these self- sup ervised frameworks now riv al sup ervised counterparts and drive gen uine dis- co very—from deﬁning nov el cardiac phenotypes to deco ding the genomic ”lan- guage” of life. By lev eraging the intrinsic structure of data, the ﬁeld is moving to wards a holistic view of biology where insights are derived from the data itself rather than human bias. F uture researc h must fo cus on the con vergence of these mo dalities in to uni- ﬁed ”foundation mo dels” capable of reasoning across imaging, genomics, and electronic health records simultaneously . Additionally , the exploration of com- putationally eﬃcien t architectures, such as State Space Mo dels (e.g. Mam ba), oﬀers a promising a ven ue for mo delling long-range biological dependencies that traditional T ransformers struggle to capture. Ultimately , the priorit y remains bridging the gap b et ween these high-dimensional latent representations and in- terpretable, clinically actionable biomark ers. References 1. Azizi, S., Mustafa, B., Ryan, F., Bea ver, Z., F reyb erg, J., Deaton, J., Loh, A., Karthik esalingam, A., Kornb lith, S., Chen, T., et al.: Big self-sup ervised models adv ance medical image classiﬁcation. In: Pro ceedings of the IEEE/CVF interna- tional conference on computer vision. pp. 3478–3488 (2021) 2. Balakrishnan, G., Zhao, A., Sabuncu, M.R., Guttag, J., Dalca, A.V.: V oxelmorph: a learning framework for deformable medical image registration. IEEE transactions on medical imaging 38 (8), 1788–1800 (2019) 3. Beizaee, F., Lo dygensky , G., Desrosiers, C., Dolz, J.: Mad-ad: masked diﬀusion for unsup ervised brain anomaly detection. In: In ternational Conference on Information Pro cessing in Medical Imaging. pp. 139–153. Springer (2025) 4. Bercea, C.I., Wiestler, B., Rueck ert, D., Schnabel, J.A.: Ev aluating normative rep- resen tation learning in generative ai for robust anomaly detection in brain imaging. Nature Comm unications 16 (1), 1624 (2025) 6 Soumic k Chatterjee 5. Caron, M., T ouvron, H., Misra, I., J´ egou, H., Mairal, J., Bo janowski, P ., Joulin, A.: Emerging properties in self-sup ervised vision transformers. In: Pro ceedings of the IEEE/CVF international conference on computer vision. pp. 9650–9660 (2021) 6. Chatterjee, S., Ba ja j, H., Siddiquee, I.H., Subbara yappa, N.B., Simon, S., Shashid- har, S.B., Speck, O., N ¨ urn berger, A.: Micdir: Multi-scale in verse-consisten t de- formable image registration using unetmss with self-constructing graph laten t. Computerized Medical Imaging and Graphics 108 , 102267 (2023) 7. Chatterjee, S., Sciarra, A., D ¨ unn wald, M., T ummala, P ., Agraw al, S.K., Jauhari, A., Kalra, A., Oeltze-Jafra, S., Sp ec k, O., N ¨ urn b erger, A.: Strega: Unsup ervised anomaly detection in brain mris using a compact context-encoding v ariational auto encoder. Computers in biology and medicine 149 , 106093 (2022) 8. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framew ork for con- trastiv e learning of visual representations. In: International conference on machine learning. pp. 1597–1607. PmLR (2020) 9. Cisternino, F., Ometto, S., Chatterjee, S., Giacopuzzi, E., Levine, A.P ., Glaston- bury , C.A.: Self-sup ervised learning for characterising histomorphological div ersity and spatial rna expression prediction across 23 human tissue t yp es. Nature Com- m unications 15 (1), 5906 (2024) 10. Dalla-T orre, H., Gonzalez, L., Mendoza-Revilla, J., Lopez Carranza, N., Grzyw aczewski, A.H., Oteri, F., Dallago, C., T rop, E., de Almeida, B.P ., Sirelkha- tim, H., et al.: Nucleotide transformer: building and ev aluating robust foundation mo dels for human genomics. Nature Metho ds 22 (2), 287–297 (2025) 11. Iuso, D., Chatterjee, S., Cornelissen, S., V erhees, D., Beenhouw er, J.D., Sijb ers, J.: V o xel-wise segmentation for p orosit y inv estigation of additive manufactured parts with 3d unsup ervised and (deeply) supervised neural netw orks. Applied Intelligence 54 (24), 13160–13177 (2024) 12. Ji, Y., Zhou, Z., Liu, H., Davuluri, R.V.: Dnab ert: pre-trained bidirectional encoder represen tations from transformers model for dna-languag e in genome. Bioinformat- ics 37 (15), 2112–2120 (2021) 13. Li, C., Shi, Y., Hu, J., Zh u, X.X., Mou, L.: Scale-aw are contrastiv e rev erse distillation for unsupervised medical anomaly detection. In: The Thirteen th In ternational Conference on Learning Representations (2025), h ttps://op enreview.net/forum?id=HNOo4UNPBF 14. Li, Y., Rao, S., Solares, J.R.A., Hassaine, A., Ramakrishnan, R., Cano y , D., Zhu, Y., Rahimi, K., Salimi-Khorshidi, G.: Behrt: transformer for electronic health records. Scien tiﬁc rep orts 10 (1), 7155 (2020) 15. Lop ez, R., Regier, J., Cole, M.B., Jordan, M.I., Y osef, N.: Deep generativ e mo deling for single-cell transcriptomics. Nature metho ds 15 (12), 1053–1058 (2018) 16. Ometto, S., Chatterjee, S., V ergani, A.M., Landini, A., Sharap o v, S., Giacopuzzi, E., Viscon ti, A., Bianchi, E., Santonastaso, F., So da, E.M., et al.: Hundreds of cardiac mri traits derived using 3d diﬀusion auto encoders share a common genetic arc hitecture. medRxiv pp. 2024–11 (2024) 17. Radhakrishnan, A., F riedman, S.F., Khurshid, S., Ng, K., Batra, P ., Lubitz, S.A., Philippakis, A.A., Uhler, C.: Cross-modal auto encoder framework learns holis- tic representations of cardiov ascular state. Nature Communications 14 (1), 2436 (2023) 18. Seeb¨ ock, P ., Orlando, J.I., Michl, M., Mai, J., Schmidt-Erfurth, U., Bogunovi ´ c, H.: Anomaly guided segmentation: Introducing semantic context for lesion segmenta- tion in retinal oct using w eak con text sup ervision from anomaly detection. Medical Image Analysis 93 , 103104 (2024) Title Suppressed Due to Excessive Length 7 19. T aleb, A., Kirc hler, M., Monti, R., Lippert, C.: Contig: Self-supervised multimodal con trastive learning for medical imaging with genetics. In: Pro ceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20908– 20921 (2022) 20. Zak aria Sellam, A., Benaissa, I., T aleb-Ahmed, A., Patrono, L., Distan te, C.: Maat: Mam ba adaptive anomaly transformer with asso ciation discrepancy for time series. arXiv e-prin ts pp. arXiv–2502 (2025)

Transcending the Annotation Bottleneck: AI-Powered Discovery in Biology and Medicine

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment