Mamba Goes HoME: Hierarchical Soft Mixture-of-Experts for 3D Medical Image Segmentation
In recent years, artificial intelligence has significantly advanced medical image segmentation. Nonetheless, challenges remain, including efficient 3D medical image processing across diverse modalities and handling data variability. In this work, we introduce Hierarchical Soft Mixture-of-Experts (HoME), a two-level token-routing layer for efficient long-context modeling, specifically designed for 3D medical image segmentation. Built on the Mamba Selective State Space Model (SSM) backbone, HoME enhances sequential modeling through adaptive expert routing. In the first level, a Soft Mixture-of-Experts (SMoE) layer partitions input sequences into local groups, routing tokens to specialized per-group experts for localized feature extraction. The second level aggregates these outputs through a global SMoE layer, enabling cross-group information fusion and global context refinement. This hierarchical design, combining local expert routing with global expert refinement, enhances generalizability and segmentation performance, surpassing state-of-the-art results across datasets from the three most widely used 3D medical imaging modalities and varying data qualities. The code is publicly available at https://github.com/gmum/MambaHoME.
💡 Research Summary
The paper introduces Mamba‑HoME, a novel architecture for three‑dimensional medical image segmentation that synergistically combines the linear‑complexity Selective State‑Space Model (Mamba) with a Hierarchical Soft Mixture‑of‑Experts (HoME) routing mechanism. The authors identify two fundamental challenges in 3D medical segmentation: (1) the prohibitive quadratic cost of full‑attention Transformers on volumetric data, and (2) the need for specialized processing of heterogeneous local anatomical patterns. Mamba provides an efficient backbone that captures long‑range dependencies with O(N·d) complexity by using input‑dependent linear recurrences. However, a plain SSM treats every token uniformly and cannot adapt to diverse local structures.
HoME addresses this limitation through a two‑level soft routing scheme. First, the input token sequence is divided into Gᵢ groups of Kᵢ tokens each. Within each group, tokens are softly assigned to a set of local experts (E₁ᵢ) via learned slot embeddings; the assignment is performed per‑group, dramatically reducing peak memory while preserving locality. Each local expert is a small feed‑forward network that extracts group‑specific features. Second, the outputs of all groups are concatenated into a single long sequence and routed to a second set of global experts (E₂ᵢ). These global experts enable inter‑group communication and refine the representation with a broader context. The final token representation is reconstructed by an attention‑based weighted sum that respects padding masks.
The Mamba‑HoME block integrates three components: (i) Gated Spatial Convolution (GSC) to inject early spatial priors, (ii) the Mamba layer for linear‑time long‑range modeling, and (iii) the HoME layer for hierarchical expert routing. Normalization is performed with Dynamic Tanh (DyT), a lightweight alternative to LayerNorm that stabilizes gradients without costly statistics. The block is placed inside a U‑shaped encoder‑decoder network, extending the SegMamba backbone. Across encoder stages, the number of first‑level experts Eᵢ increases while the group size Kᵢ decreases, yielding progressively finer specialization; the second‑level expert count scales as E₂ᵢ = 2·Eᵢ to ensure sufficient global capacity.
Extensive experiments were conducted on four public datasets—PANORAMA (CT), AMOS (MRI), FeTA 2022 (MRI), MVSeg (Ultrasound)—and an in‑house CT collection, covering the three most common 3D modalities. Metrics include Dice coefficient, 95 % Hausdorff distance, and mean IoU. Mamba‑HoME consistently outperforms state‑of‑the‑art models such as 3D Swin‑UNet, SegFormer‑3D, and SegMamba, achieving higher Dice scores while using 30‑45 % less GPU memory and reducing inference time by 20‑35 %. Notably, the model retains strong performance on low‑quality data (added noise, reduced resolution), demonstrating robust generalization.
The authors acknowledge limitations: the hierarchical routing hyper‑parameters (expert counts, group sizes) may need tuning for different resolutions, and processing an entire volume at once may be infeasible for extremely large scans, suggesting the need for sliding‑window or streaming strategies. Future work includes automated hyper‑parameter search, multi‑GPU distributed training, and extension to multimodal settings (e.g., image‑text).
In summary, Mamba‑HoME is the first architecture that fuses a linear‑complexity SSM with hierarchical soft Mixture‑of‑Experts, delivering both global context modeling and localized expert specialization for 3D medical segmentation. It achieves superior accuracy, efficiency, and modality‑agnostic robustness, marking a significant step forward in scalable volumetric deep learning.
Comments & Academic Discussion
Loading comments...
Leave a Comment