A Hybrid Mamba-SAM Architecture for Efficient 3D Medical Image Segmentation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Accurate segmentation of 3D medical images such as MRI and CT is essential for clinical diagnosis and treatment planning. Foundation models like the Segment Anything Model (SAM) provide powerful general-purpose representations but struggle in medical imaging due to domain shift, their inherently 2D design, and the high computational cost of fine-tuning. To address these challenges, we propose Mamba-SAM, a novel and efficient hybrid architecture that combines a frozen SAM encoder with the linear-time efficiency and long-range modeling capabilities of Mamba-based State Space Models (SSMs). We investigate two parameter-efficient adaptation strategies. The first is a dual-branch architecture that explicitly fuses general features from a frozen SAM encoder with domain-specific representations learned by a trainable VMamba encoder using cross-attention. The second is an adapter-based approach that injects lightweight, 3D-aware Tri-Plane Mamba (TPMamba) modules into the frozen SAM ViT encoder to implicitly model volumetric context. Within this framework, we introduce Multi-Frequency Gated Convolution (MFGC), which enhances feature representation by jointly analyzing spatial and frequency-domain information via 3D discrete cosine transforms and adaptive gating. Extensive experiments on the ACDC cardiac MRI dataset demonstrate the effectiveness of the proposed methods. The dual-branch Mamba-SAM-Base model achieves a mean Dice score of 0.906, comparable to UNet++ (0.907), while outperforming all baselines on Myocardium (0.910) and Left Ventricle (0.971) segmentation. The adapter-based TP MFGC variant offers superior inference speed (4.77 FPS) with strong accuracy (0.880 Dice). These results show that hybridizing foundation models with efficient SSM-based architectures provides a practical and effective solution for 3D medical image segmentation.

💡 Research Summary

This paper addresses the challenge of adapting the large‑scale, 2‑D‑oriented Segment Anything Model (SAM) to 3‑D medical image segmentation, where domain shift, lack of inter‑slice context, and high computational cost hinder direct use. The authors propose a hybrid architecture, “Mamba‑SAM,” that keeps the SAM encoder frozen and augments it with efficient state‑space model (SSM) components based on the Mamba family. Two parameter‑efficient adaptation strategies are explored.

The first, a dual‑branch design, runs a frozen SAM ViT‑B encoder in parallel with a trainable VMamba encoder. The SAM branch supplies generalist features, while the VMamba branch learns domain‑specific representations from the same 2‑D slice. A Cross‑Branch Attention (CBA) module treats VMamba outputs as queries and SAM outputs as keys/values, allowing the specialist branch to guide retrieval of relevant general knowledge. The attention output is added residually to the SAM features, and a decoder (either a conventional CNN up‑sampler or an implicit feature alignment decoder) produces the final 3‑D mask. This explicit separation of concerns yields high Dice scores (mean 0.906) comparable to state‑of‑the‑art UNet++ (0.907) and superior performance on the Myocardium and Left Ventricle classes.

The second strategy inserts lightweight 3‑D‑aware Tri‑Plane Mamba (TP‑Mamba) adapters into every multi‑head self‑attention (MSA) and MLP block of the frozen SAM encoder, following a Parameter‑Efficient Fine‑Tuning (PEFT) paradigm. Each adapter projects the 2‑D token sequence to a lower‑dimensional space, reshapes it into a pseudo‑volume, and processes it through two parallel paths: a local 3‑D convolutional path and a global path that slices the volume along axial, coronal, and sagittal planes and feeds each slice into a Mamba block. The two paths are fused and projected back to the original SAM dimension, then added residually. An optional LoRA can be applied to the Q/K/V projections of the frozen MSA layers for even finer control.

A novel component, Multi‑Frequency Gated Convolution (MFGC), is integrated into the TP‑Mamba adapters. MFGC applies a 3‑D discrete cosine transform (DCT) to extract frequency‑domain features, then combines them with spatial features through an adaptive gating mechanism. This joint spatial‑frequency analysis suppresses high‑frequency noise while emphasizing low‑frequency anatomical structures, improving robustness on noisy medical scans.

Experiments are conducted on the ACDC cardiac MRI dataset, which includes four cardiac structures (right ventricle, myocardium, left ventricle, and blood pool). The dual‑branch Mamba‑SAM‑Base achieves a mean Dice of 0.906, matching UNet++ and outperforming it on myocardium (0.910) and left ventricle (0.971). The adapter‑based TP‑MFGC variant runs at 4.77 frames per second (FPS) with a mean Dice of 0.880, demonstrating a favorable speed‑accuracy trade‑off. Compared against a range of baselines—including 3‑D U‑Net, Swin‑UNet, MedSAM, and recent Mamba‑based segmentation models—the proposed methods consistently deliver comparable or superior performance while using far fewer trainable parameters and requiring less GPU memory.

Key contributions are: (1) a hybrid framework that leverages SAM’s massive pre‑trained knowledge without fine‑tuning its parameters; (2) two distinct PEFT strategies—dual‑branch cross‑attention and internal TP‑Mamba adapters—that enable efficient 3‑D context modeling; (3) the introduction of MFGC for joint spatial‑frequency feature enhancement; and (4) thorough empirical validation showing that the hybrid approach attains state‑of‑the‑art accuracy with markedly reduced computational overhead. The work opens avenues for extending foundation‑model adaptation to other volumetric modalities (e.g., CT, PET) and for integrating such lightweight, high‑performance segmenters into real‑time clinical workflows.

A Hybrid Mamba-SAM Architecture for Efficient 3D Medical Image Segmentation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment