Handling Missing Modalities in Multimodal Survival Prediction for Non-Small Cell Lung Cancer
Accurate survival prediction in Non-Small Cell Lung Cancer (NSCLC) requires the integration of heterogeneous clinical, radiological, and histopathological information. While Multimodal Deep Learning (MDL) offers a promises for precision prognosis and survival prediction, its clinical applicability is severely limited by small cohort sizes and the presence of missing modalities, often forcing complete-case filtering or aggressive imputation. In this work, we present a missing-aware multimodal survival framework that integrates Computed Tomography (CT), Whole-Slide Histopathology (WSI) Images, and structured clinical variables for overall survival modeling in unresectable stage II-III NSCLC. By leveraging Foundation Models (FM) for modality-specific feature extraction and a missing-aware encoding strategy, the proposed approach enables intermediate multimodal fusion under naturally incomplete modality profiles. The proposed architecture is resilient to missing modalities by design, allowing the model to utilize all available data without being forced to drop patients during training or inference. Experimental results demonstrate that intermediate fusion consistently outperforms unimodal baselines as well as early and late fusion strategies, with the strongest performance achieved by the fusion of WSI and clinical modalities (73.30 C-index). Further analyses of modality importance reveal an adaptive behavior in which less informative modalities, i.e., CT modality, are automatically down-weighted and contribute less to the final survival prediction.
💡 Research Summary
This paper addresses the challenge of predicting overall survival (OS) in patients with unresectable stage II–III non‑small‑cell lung cancer (NSCLC) by integrating three heterogeneous data sources—computed tomography (CT) scans, whole‑slide pathology images (WSI), and structured clinical variables—while explicitly handling missing modalities. The authors first extract high‑level representations for each modality using pretrained foundation models: a 3‑D Swin‑Transformer for CT volumes, a Vision Transformer (ViT) for WSI patches, and a shallow multilayer perceptron for the tabular clinical data. To make the system robust to absent inputs, they introduce a “missing‑aware” representation learning block based on a transformer architecture that inserts learnable mask tokens for any missing modality. During attention computation these mask tokens receive near‑zero weights, effectively allowing the network to ignore unavailable data without requiring imputation.
The modality‑specific embeddings are then concatenated (intermediate fusion) and fed into an Oblivious Differentiable Decision Tree (ODDT) head. The ODDT retains the interpretability of classic decision trees while being fully differentiable, enabling direct optimization of the Cox partial‑likelihood loss and yielding a continuous hazard estimate for each patient.
Experiments were conducted on a curated cohort of 179 NSCLC patients collected at the Campus Bio‑Medico in Rome. The dataset reflects real‑world clinical practice: some patients lack CT, others lack WSI, and a few have both missing. Performance was measured with the concordance index (C‑index) under five‑fold cross‑validation. Unimodal baselines achieved C‑indices of 68.5 (WSI), 66.2 (clinical) and 60.1 (CT). All multimodal configurations outperformed the single‑modality models, with the best result (C‑index = 73.30) obtained by fusing WSI and clinical data. Adding CT to the fusion yielded a modest drop (≈ 72.5), and a full three‑modality model achieved 72.5 as well, indicating that CT contributes less predictive information in this setting.
A modality‑importance analysis, based on learned mask weights and SHAP values, confirmed that the model automatically down‑weights the CT stream when it is less informative, while WSI and clinical features dominate the risk prediction. Risk scores derived from the ODDT were used to stratify patients into high‑ and low‑risk groups; log‑rank tests showed statistically significant separation (p < 0.001), demonstrating clinical relevance beyond a mere C‑index improvement.
The authors emphasize several contributions: (1) leveraging foundation models to obtain robust features from limited data, (2) a novel missing‑aware transformer that eliminates the need for ad‑hoc imputation, (3) an intermediate fusion strategy coupled with an interpretable ODDT head, and (4) a publicly released, anonymized dataset and code to foster reproducibility. Limitations include a relatively small single‑center cohort and a simplistic CT preprocessing pipeline; future work will explore larger multi‑institutional datasets, incorporation of genomic/transcriptomic data, and more sophisticated visualization of model explanations.
In summary, the proposed framework demonstrates that a carefully designed multimodal deep learning pipeline can achieve state‑of‑the‑art survival prediction in NSCLC while gracefully handling missing modalities, thereby moving closer to real‑world clinical deployment.
Comments & Academic Discussion
Loading comments...
Leave a Comment