ECMNet:Lightweight Semantic Segmentation with Efficient CNN-Mamba Network
In the past decade, Convolutional Neural Networks (CNNs) and Transformers have achieved wide applicaiton in semantic segmentation tasks. Although CNNs with Transformer models greatly improve performance, the global context modeling remains inadequate. Recently, Mamba achieved great potential in vision tasks, showing its advantages in modeling long-range dependency. In this paper, we propose a lightweight Efficient CNN-Mamba Network for semantic segmentation, dubbed as ECMNet. ECMNet combines CNN with Mamba skillfully in a capsule-based framework to address their complementary weaknesses. Specifically, We design a Enhanced Dual-Attention Block (EDAB) for lightweight bottleneck. In order to improve the representations ability of feature, We devise a Multi-Scale Attention Unit (MSAU) to integrate multi-scale feature aggregation, spatial aggregation and channel aggregation. Moreover, a Mamba enhanced Feature Fusion Module (FFM) merges diverse level feature, significantly enhancing segmented accuracy. Extensive experiments on two representative datasets demonstrate that the proposed model excels in accuracy and efficiency balance, achieving 70.6% mIoU on Cityscapes and 73.6% mIoU on CamVid test datasets, with 0.87M parameters and 8.27G FLOPs on a single RTX 3090 GPU platform.
💡 Research Summary
The paper introduces ECMNet, a lightweight semantic segmentation network that synergistically combines convolutional neural networks (CNNs) with the recently proposed Mamba architecture, a state‑space model (SSM) based sequence processor with linear computational complexity. While CNN‑Transformer hybrids have demonstrated strong performance, their self‑attention mechanisms suffer from quadratic complexity, limiting applicability to high‑resolution images and resource‑constrained devices. Mamba, by contrast, offers long‑range dependency modeling with O(N) cost, making it attractive for vision tasks that require global context without excessive computation.
ECMNet adopts a classic U‑shape encoder‑decoder backbone but replaces the conventional bottleneck blocks with three novel modules:
-
Enhanced Dual‑Attention Block (EDAB) – The input feature map is first compressed via a 1×1 convolution to halve channel dimensionality, reducing both FLOPs and parameters. A two‑branch design then processes the compressed features: one branch uses a 3×1 followed by a 1×3 depth‑wise convolution to capture local, short‑range patterns; the parallel branch employs atrous (dilated) convolutions to aggregate broader context. Each branch is equipped with distinct attention mechanisms—Dual‑Direction Attention (DDA) and Channel Attention (CA)—which generate complementary attention matrices. The outputs are merged, passed through a point‑wise 1×1 convolution, and finally shuffled across channels to restore the original dimensionality while encouraging inter‑channel interaction. This design balances receptive field expansion with a minimal parameter budget.
-
Multi‑Scale Attention Unit (MSAU) – MSAU consists of a spatial‑scale path and a channel‑aggregation path. In the spatial path, the feature map is first reduced to C/2 channels, then processed in parallel by depth‑separable convolutions of sizes 3×3, 5×5, and 7×7, yielding multi‑scale representations. These are fused, pooled adaptively to a single spatial descriptor, and passed through a 7×7 depth‑separable convolution followed by a sigmoid to produce a spatial attention map, which is multiplied back onto the fused feature. The channel path computes both average‑pooled and max‑pooled descriptors, each projected by a 1×1 convolution, and combines them to form a channel‑wise attention vector. The spatial and channel attentions are multiplied element‑wise and added residually to the original input, effectively enriching low‑level detail with high‑level semantics.
-
Feature Fusion Module (FFM) – To merge encoder outputs with the two MSAU streams, the authors concatenate the three tensors and feed them into a 2D‑Selective‑Scan (SS2D) block, a Mamba‑derived component that treats the concatenated features as a sequence, applies linear transformations, and performs 2D convolutions in a selective‑scan fashion. This captures global dependencies with very few parameters. A subsequent Feed‑Forward Network (FFN) introduces non‑linearity and re‑weights channels, after which a residual connection adds back the original encoder feature.
Three long‑range skip connections link encoder and decoder stages; each skip is enhanced by an MSAU, ensuring that both fine‑grained spatial cues and abstract semantic cues are propagated throughout the network.
Experimental validation:
- Ablation studies on CamVid show that the baseline U‑Net (plain encoder‑decoder) achieves 69.92 % mIoU. Adding long skip connections yields +0.61 %p, integrating MSAU adds another +0.92 %p, and incorporating FFM contributes +1.11 %p, culminating in a 73.62 % mIoU (3.70 %p over baseline). Parameter growth across all additions is modest (total 0.87 M parameters, 8.27 G FLOPs).
- Comparison with state‑of‑the‑art: On Cityscapes, ECMNet reaches 70.6 % mIoU with far fewer parameters than heavyweight models (e.g., LBN‑AA with 6.2 M parameters). On CamVid, it attains 73.6 % mIoU, surpassing many recent lightweight approaches (LEDNet, CGNet, CFPNet) while maintaining a comparable computational footprint.
The results demonstrate that ECMNet successfully reconciles the classic trade‑off between accuracy and efficiency. By leveraging Mamba’s linear‑complexity global modeling together with carefully engineered CNN‑based attention blocks, the network delivers high‑quality segmentation suitable for real‑time deployment on embedded GPUs or mobile platforms. Potential applications include autonomous driving perception, robotic navigation, and augmented‑reality scene understanding, where both low latency and precise pixel‑wise labeling are critical.
In summary, ECMNet contributes a novel hybrid architecture that:
- Introduces a lightweight dual‑attention bottleneck (EDAB) to capture local and dilated context with minimal overhead.
- Provides a multi‑scale attention unit (MSAU) that fuses spatial and channel information across scales.
- Employs a Mamba‑based feature fusion module (FFM) to aggregate multi‑level features efficiently.
- Achieves state‑of‑the‑art performance on benchmark segmentation datasets with sub‑1 M parameters and sub‑10 G FLOPs, establishing a new baseline for ultra‑lightweight semantic segmentation.
Comments & Academic Discussion
Loading comments...
Leave a Comment