A Geometric Multimodal Foundation Model Integrating Bp-MRI and Clinical Reports in Prostate Cancer Classification

A Geometric Multimodal Foundation Model Integrating Bp-MRI and Clinical Reports in Prostate Cancer Classification
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Prostate cancer (PCa) is one of the most common cancers in men worldwide. Bi-parametric MRI (bp-MRI) and clinical variables are crucial for PCa identification and improving treatment decisions. However, this process is subjective to expert interpretations. Furthermore, most existing computer-aided diagnosis methods focus on imaging-based models, overlooking the clinical context and suffering from data scarcity, limiting their ability to learn robust representations. We propose a geometric multimodal Foundation Model (FM), named MFM-Geom, that learns representations from bp-MRI and clinical reports, encoding visual findings and information from the context of clinical variables. In the representations classification head, the approach leverages symmetric positive definite (SPD) matrices and Riemannian deep learning to integrate imaging-text representations from a biomedical multimodal FM. Using 10% of the training data, MFM-Geom outperformed baseline class token embedding-based classification (+8.3%, AUC-PR of 90.67). Generalization on external dataset confirmed the robustness of fine-tuning biomedical FM, achieving an AUC-PR of 90.6.


💡 Research Summary

Prostate cancer (PCa) remains one of the most prevalent malignancies in men, and accurate discrimination of clinically significant disease (csPCa) is essential for guiding treatment decisions. While biparametric MRI (bp‑MRI) provides valuable visual cues, clinical variables such as age, PSA level, PSA density, and prostate volume add crucial contextual information. Existing computer‑aided diagnosis systems largely rely on unimodal imaging models, ignoring the synergistic potential of clinical data and suffering from limited training samples, which hampers robust representation learning.

In this work the authors introduce MFM‑Geom, a geometric multimodal foundation model that jointly processes bp‑MRI volumes and structured clinical reports. The backbone is BiomedCLIP, a multimodal foundation model pretrained on 15 million biomedical image‑text pairs. The image encoder adapts a Vision Transformer (ViT‑B/16) to 3‑D inputs via a weight‑inflation strategy, extracting patch embeddings from each MRI sequence (T2W, ADC, DWI). The text encoder uses PubMedBERT to embed a “fill‑in‑the‑blank” report that encodes the same clinical variables used in the PI‑CAI challenge. Both encoders output a sequence of token embeddings together with a class token after L transformer layers.

To fuse the two modalities, the authors construct a symmetric positive‑definite (SPD) matrix from the concatenated image‑patch and text‑token embeddings: S₀ = (1/d²) MMᵀ, where M is the N × d matrix of all embeddings. Because SPD matrices lie on a Riemannian manifold, a geometry‑aware network (SPDNet) processes S₀ through BiMap (bilinear mapping) and ReEig (eigenvalue rectification) layers, preserving positivity and manifold structure. After a LogEig operation maps the result to Euclidean space, a lightweight MLP produces the final csPCa probability.

Training combines a binary cross‑entropy loss with an InfoNCE contrastive loss that aligns image and text class embeddings in a shared latent space, encouraging multimodal coherence. The model is fine‑tuned on the PI‑CAI dataset (415 csPCa lesions, 847 non‑csPCa studies) and evaluated with 5‑fold cross‑validation. Three binary classification tasks are considered: (1) non‑csPCa vs. {intermediate, high‑grade}, (2) non‑csPCa vs. intermediate, and (3) non‑csPCa vs. high‑grade. Performance metrics focus on area under the precision‑recall curve (AUC‑PR) and false‑positive rate at 95 % true‑positive rate (FPR95), which are robust to class imbalance.

Results show that even with only 10 % of the training data, MFM‑Geom achieves an AUC‑PR of 90.67 ± 1.17 on the most challenging task, outperforming a baseline class‑token classifier by 8.3 percentage points and reducing FPR95 by 37.1 %. When the full training set is used, the model reaches 97.2 % AUC‑ROC, surpassing recent CNN‑based approaches (94.1–96.5 %). External validation on the PROSTATE158 dataset confirms that the geometric head improves generalization, with the unimodal geometric variant (UFM‑Geom) outperforming standard baselines despite the absence of clinical variables in the external set.

Attention visualizations reveal that the image encoder focuses on the lesion region while the text encoder highlights key clinical factors such as PSA density, prostate volume, lesion extension beyond the gland, and zonal location, indicating that the model learns clinically meaningful cross‑modal relationships.

The authors acknowledge limited interpretability inherent to large foundation models and note that the external dataset lacked clinical reports, preventing a full multimodal assessment. Future work will explore the structure of the SPD latent space, assess its ability to stratify different Gleason grades, and further reduce computational overhead.

In summary, MFM‑Geom demonstrates that coupling a biomedical foundation model with a geometry‑preserving multimodal fusion strategy yields robust, data‑efficient prostate cancer classification. By leveraging both imaging and structured clinical information, the approach particularly excels at distinguishing intermediate‑grade lesions, offering a promising tool for active‑surveillance decision support and personalized treatment planning.


Comments & Academic Discussion

Loading comments...

Leave a Comment