Robust Pre-Training of Medical Vision-and-Language Models with Domain-Invariant Multi-Modal Masked Reconstruction

Robust Pre-Training of Medical Vision-and-Language Models with Domain-Invariant Multi-Modal Masked Reconstruction
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Medical vision-language models show strong potential for joint reasoning over medical images and clinical text, but their performance often degrades under domain shift caused by variations in imaging devices, acquisition protocols, and reporting styles. Existing multi-modal pre-training methods largely overlook robustness, treating it as a downstream adaptation problem. In this work, we propose Robust Multi-Modal Masked Reconstruction (Robust-MMR), a self-supervised pre-training framework that explicitly incorporates robustness objectives into masked vision-language learning. Robust-MMR integrates asymmetric perturbation-aware masking, domain-consistency regularization, and modality-resilience constraints to encourage domain-invariant representations. We evaluate Robust-MMR on multiple medical vision-language benchmarks, including medical visual question answering (VQA-RAD, SLAKE, VQA-2019), cross-domain image-text classification (MELINDA), and robust image-caption retrieval (ROCO). Robust-MMR achieves 78.9% cross-domain accuracy on VQA-RAD, outperforming the strongest baseline by 3.8 percentage points, and reaches 74.6% and 77.0% accuracy on SLAKE and VQA-2019, respectively. Under perturbed evaluation, Robust-MMR improves VQA-RAD accuracy from 69.1% to 75.6%. For image-text classification, cross-domain MELINDA accuracy increases from 70.3% to 75.2%, while retrieval experiments show a reduction in mean rank degradation from over 16 to 4.1 under perturbation. Qualitative results further demonstrate improved clinical reasoning for disease detection and structural abnormality assessment. These findings show that explicitly modeling robustness during pre-training leads to more reliable and transferable medical vision-language representations for real-world deployment.


💡 Research Summary

The paper addresses a critical gap in medical vision‑language (V‑L) modeling: the lack of robustness to domain shifts that arise from variations in imaging devices, acquisition protocols, institutional practices, and reporting styles. While recent self‑supervised pre‑training approaches—particularly masked autoencoding—have demonstrated strong in‑domain performance, they typically rely on random masking and reconstruction objectives that do not explicitly encourage invariance across domains. Consequently, models often suffer steep performance drops when deployed in real‑world clinical settings where data distributions differ from the training set.

To remedy this, the authors propose Robust Multi‑Modal Masked Reconstruction (Robust‑MMR), a self‑supervised pre‑training framework that embeds robustness objectives directly into the masked V‑L learning process. Robust‑MMR consists of three complementary components:

  1. Asymmetric Perturbation‑Aware Masking – Instead of applying a uniform random mask to both modalities, the method uses different masking ratios for images and text and deliberately injects simulated perturbations (e.g., Gaussian blur, down‑sampling, scanner‑specific filters for images; abbreviation substitution, token shuffling, partial deletion for text). This forces the encoder to learn representations that can tolerate corrupted or missing information.

  2. Domain‑Consistency Regularization – For each patient case, pairs from multiple domains (e.g., different hospitals or scanners) are presented simultaneously. A consistency loss (L2 distance or contrastive term) is applied to the joint embeddings, encouraging the model to map domain‑variant inputs to a shared, domain‑invariant latent space.

  3. Modality‑Resilience Constraints – When one modality is fully masked, the other modality must still enable accurate reconstruction of the missing content. Cross‑modal reconstruction losses are added so that the image encoder can infer textual semantics and vice‑versa, promoting redundancy and resilience across modalities.

The overall training objective combines standard reconstruction losses for vision and text with weighted domain‑consistency (λ_dc) and modality‑resilience (λ_mr) terms:
\


Comments & Academic Discussion

Loading comments...

Leave a Comment