Diff4MMLiTS: Advanced Multimodal Liver Tumor Segmentation via Diffusion-Based Image Synthesis and Alignment

Diff4MMLiTS: Advanced Multimodal Liver Tumor Segmentation via Diffusion-Based Image Synthesis and Alignment
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Multimodal learning has been demonstrated to enhance performance across various clinical tasks, owing to the diverse perspectives offered by different modalities of data. However, existing multimodal segmentation methods rely on well-registered multimodal data, which is unrealistic for real-world clinical images, particularly for indistinct and diffuse regions such as liver tumors. In this paper, we introduce Diff4MMLiTS, a four-stage multimodal liver tumor segmentation pipeline: pre-registration of the target organs in multimodal CTs; dilation of the annotated modality’s mask and followed by its use in inpainting to obtain multimodal normal CTs without tumors; synthesis of strictly aligned multimodal CTs with tumors using the latent diffusion model based on multimodal CT features and randomly generated tumor masks; and finally, training the segmentation model, thus eliminating the need for strictly aligned multimodal data. Extensive experiments on public and internal datasets demonstrate the superiority of Diff4MMLiTS over other state-of-the-art multimodal segmentation methods.


💡 Research Summary

Diff4MMLiTS introduces a four‑stage pipeline that tackles two major challenges in multimodal liver tumor segmentation: (1) the lack of perfectly registered multimodal CT scans in clinical practice, and (2) the scarcity of annotated tumor data. The first stage performs a coarse organ‑level registration across the four CT phases (NC, AP, PVP, DELAY). Because tumor regions remain misaligned, the authors dilate the radiologist‑provided PVP tumor mask (using a 5×5 morphological closing followed by a 3×3 dilation) to guarantee coverage across all phases. This dilated mask is fed into a Fast Fourier Convolution (FFC) based inpainting network (the Normal CT Generator, NCG) which removes tumor voxels and produces “normal” CT volumes for each phase. FFC combines a local convolution branch with a global FFT branch, preserving fine‑scale texture while capturing long‑range context, enabling high‑quality inpainting with modest computational cost.

In the second stage, the Multimodal CT Synthesizer (MCS) creates perfectly aligned multimodal tumor CTs. A pre‑trained VQGAN auto‑encoder first compresses each normal CT into a latent representation. A latent diffusion model (LDM) then performs forward diffusion (adding Gaussian noise) and reverse denoising conditioned on a synthetic tumor mask. The mask is generated by randomly selecting a liver‑centric point, sampling ellipsoidal semi‑axes, and applying elastic deformation to mimic realistic tumor shapes. During reverse diffusion, the denoising network predicts the noise component given the current latent, the mask, and the timestep, gradually reconstructing a latent that encodes a tumor. The decoder maps this latent back to image space, yielding four phase CTs that share the exact tumor location but retain phase‑specific intensity distributions. This approach produces large numbers of fully aligned multimodal tumor examples without any manual annotation.

The third stage, Multimodal Segmentation (MS), trains a 3D U‑Net (or any backbone) on a hybrid dataset composed of real, roughly aligned CTs and the synthetic, perfectly aligned CTs. The loss combines Dice and cross‑entropy terms (γ = 0.5). During training, the model sees both data sources equally, learning to be robust to registration noise while exploiting the clean synthetic signal. At inference, missing modalities are duplicated to satisfy the network’s input channel requirement, allowing the model to operate even when some phases are unavailable.

Experiments were conducted on an in‑house mmLiTS dataset (45 patients, four‑phase CT, manual PVP annotations) and the public LiTS dataset (single‑phase CT). On mmLiTS, Diff4MMLiTS achieved a Dice score of 79.02 % versus 76.34 % for the strong nnUNet baseline, with improvements also seen in Jaccard, sensitivity, and precision. More strikingly, when trained on mmLiTS and tested on LiTS (out‑of‑distribution), the baseline nnUNet reached only 41.63 % Dice, whereas Diff4MMLiTS obtained 57.75 % Dice—a 16.12 % absolute gain, demonstrating superior generalization. Ablation studies showed that integrating Diff4MMLiTS with various backbones (U‑Net, Attention‑UNet, Swin‑UNETR) consistently raised Dice by 1.5–6.5 %. The authors also compared using only a fraction of real diseased CTs plus the corresponding normal CTs, confirming that the synthetic tumor generation is the primary driver of performance gains.

Key contributions are: (1) a novel use of mask dilation and FFC‑based inpainting to generate modality‑specific normal CTs from misaligned data; (2) a latent diffusion framework that efficiently synthesizes perfectly aligned multimodal tumor CTs, mitigating both registration errors and data scarcity; (3) a hybrid training strategy that leverages synthetic data to improve segmentation robustness across backbones and datasets. The work suggests that diffusion‑driven multimodal synthesis can be a general solution for other organ systems where multimodal acquisition is routine but perfect registration is impractical.


Comments & Academic Discussion

Loading comments...

Leave a Comment