Decipher-MR: A Vision-Language Foundation Model for 3D MRI Representations

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Magnetic Resonance Imaging is a critical imaging modality in clinical diagnosis and research, yet its complexity and heterogeneity hinder scalable, generalizable machine learning. Although foundation models have revolutionized language and vision tasks, their application to MRI remains constrained by data scarcity and narrow anatomical focus. We present Decipher-MR, a 3D MRI-specific vision-language foundation model trained on 200,000 MRI series from over 22,000 studies spanning diverse anatomical regions, sequences, and pathologies. Decipher-MR integrates self-supervised vision learning with report-guided text supervision to build robust representations for broad applications. To enable efficient use, Decipher-MR supports a modular design that enables tuning of lightweight, task-specific decoders attached to a frozen pretrained encoder. Following this setting, we evaluate Decipher-MR across disease classification, demographic prediction, anatomical localization, and cross-modal retrieval, demonstrating consistent improvements over existing foundation models and task-specific approaches. These results position Decipher-MR as a versatile foundation for MRI-based AI in clinical and research settings.

💡 Research Summary

Decipher‑MR introduces a large‑scale vision‑language foundation model specifically designed for three‑dimensional magnetic resonance imaging (MRI). The authors assembled a diverse pre‑training corpus comprising 203,233 MRI series from 22,594 studies, together with radiology reports for 20,658 of those studies. The dataset spans the full age range (0‑90 years), both sexes, multiple body regions (brain, spine, abdomen, pelvis, heart, etc.), a variety of pulse sequences (T1‑weighted, T2‑weighted, FLAIR, diffusion, contrast‑enhanced, etc.), and scanners from several manufacturers (GE, Siemens, Philips, Toshiba). This breadth is intended to mitigate the well‑known domain shift problems that plague medical AI models trained on narrow, homogeneous data.

Training proceeds in two stages. In the first stage, the visual encoder (a 3‑D Vision Transformer) is trained with a student‑teacher contrastive framework, while the text encoder (a Transformer‑based masked language model) learns from the radiology reports. This self‑supervised pre‑training yields robust modality‑specific embeddings without any label dependence. In the second stage, the authors perform image‑report contrastive learning, aligning the visual and textual latent spaces. By jointly optimizing a contrastive loss over paired MRI volumes and their corresponding reports, the model learns to associate fine‑grained anatomical and pathological descriptors with imaging patterns, enabling zero‑shot cross‑modal retrieval and improving downstream visual representation quality.

A key architectural decision is to freeze the pretrained encoders after the two‑stage pre‑training and attach lightweight task‑specific decoders (e.g., a three‑layer MLP for classification, a small U‑Net for segmentation). This modular design dramatically reduces the number of trainable parameters for each downstream task, allowing rapid fine‑tuning on limited data while preserving the rich knowledge encoded in the large backbone.

The authors evaluate Decipher‑MR across a wide spectrum of MRI‑related tasks: disease classification (e.g., brain tumors, cardiac disease, prostate lesions), demographic prediction (age, sex), body‑region and sequence identification, imaging attribute detection (contrast presence), organ and lesion localization, and both text‑to‑image and image‑to‑text retrieval. In a probing setup where only the lightweight decoder is trained, Decipher‑MR consistently outperforms state‑of‑the‑art foundation models such as DINOv2, BiomedCLIP, and MedImageInsight. Reported gains include an average increase of 2.9 % in AUC for disease classification, 3.0 % for demographic prediction, and modest but consistent improvements in attribute detection. The advantage widens in low‑data regimes, with up to 5 % absolute AUC gains when only 10 % of the labeled data are available.

Ablation studies dissect the contributions of textual supervision and data diversity. Models trained without report contrastive alignment, or limited to head‑and‑neck scans, or restricted to T2‑weighted images, all perform worse than the full Decipher‑MR. Image‑report contrastive learning adds 1.3 %‑5.0 % absolute improvement across tasks, with the most pronounced effects in cardiac disease classification (+5.0 %) and prostate lesion detection (+2.4 %). Even when evaluated on tasks that focus on a single anatomical region (e.g., Alzheimer’s disease classification), the model trained on the full heterogeneous dataset outperforms a head‑only model, underscoring the value of broad anatomical coverage.

Cross‑modal retrieval experiments demonstrate zero‑shot capability. On an in‑domain test set of ~25,000 MRI volumes, Decipher‑MR retrieves the correct scan within the top‑10 results for 26 % of textual queries (full reports), far surpassing MedImageInsight’s 5.1 % baseline. On an out‑of‑domain “Source1” dataset (body‑region retrieval), the model achieves a top‑3 success rate of 91.4 % with full reports and 78.8 % with concise organ‑level descriptions, again beating competing methods. Mean average precision (mAP) scores are similarly higher, indicating better overall ranking quality.

Bias analysis reveals that performance is highest when training and testing are performed within the same sex, but Decipher‑MR maintains a 5.5 % advantage over MedImageInsight even when evaluated across sexes, suggesting improved robustness to demographic variation.

In summary, Decipher‑MR delivers four major contributions: (1) a massive, diverse 3‑D MRI‑report dataset enabling comprehensive self‑supervised and vision‑language pre‑training; (2) a two‑stage training pipeline that first builds strong modality‑specific embeddings and then aligns them across modalities; (3) a modular frozen‑encoder architecture that supports lightweight, task‑specific fine‑tuning with minimal computational overhead; and (4) extensive empirical validation showing superior performance across classification, regression, localization, segmentation, and zero‑shot retrieval tasks, especially under limited‑label conditions. The work positions Decipher‑MR as a versatile foundation model that can accelerate the development of MRI‑based AI applications in both clinical practice and biomedical research.

Decipher-MR: A Vision-Language Foundation Model for 3D MRI Representations

💡 Research Summary

Comments & Academic Discussion

Leave a Comment