CATFA-Net: A Trans-Convolutional Approach for Accurate Medical Image Segmentation
Convolutional blocks have played a crucial role in advancing medical image segmentation by excelling in dense prediction tasks. However, their inability to effectively capture long-range dependencies has limited their performance. Transformer-based architectures, leveraging attention mechanisms, address this limitation by modeling global context and creating expressive feature representations. Recent research has explored this potential by introducing hybrid frameworks that combine transformer encoders with convolutional decoders. Despite their advantages, these approaches face challenges such as limited inductive bias, high computational cost, and reduced robustness to data variability. To overcome these issues, this study introduces CATFA-Net, a novel and efficient segmentation framework designed to produce high-quality segmentation masks while reducing computational costs and increasing inference speed. CATFA-Net employs a hierarchical hybrid encoder architecture with a lightweight convolutional decoder backbone. Its transformer-based encoder uses a new Context Addition Attention mechanism that captures inter-image dependencies without the quadratic complexity of standard attention mechanisms. Features from the transformer branch are fused with those from the convolutional branch through a proposed Cross-Channel Attention mechanism, which helps retain spatial and channel information during downsampling. Additionally, a Spatial Fusion Attention mechanism in the decoder refines features while reducing background noise ambiguity. Extensive evaluations on five publicly available datasets show that CATFA-Net outperforms existing methods in accuracy and efficiency. The framework sets new state-of-the-art Dice scores on GLaS (94.48%) and ISIC 2018 (91.55%). Robustness tests and external validation further demonstrate its strong ability to generalize in binary segmentation tasks.
💡 Research Summary
CATFA‑Net is a novel hybrid architecture designed to address the long‑range dependency limitation of pure convolutional networks while keeping computational demands low. The model consists of two parallel encoder branches: a lightweight ConvNeXt‑based convolutional branch that preserves strong inductive bias, and a Hierarchical Context‑Addition Transformer (H‑CA T) branch that replaces the standard Swin‑Transformer multi‑head self‑attention with a Context‑Addition Attention (CAP) module. CAP enriches key representations by concatenating queries and keys followed by 1×1 convolutions, capturing inter‑image similarity, and then applies a spatial‑reduction block to lower the quadratic complexity to O(N²/R).
Features from both branches are fused through a Cross‑Channel Trans‑Convolutional Fusion Attention (CCTFA) mechanism, which maintains spatial resolution while enhancing channel‑wise interactions during down‑sampling. The decoder employs Conv‑G‑NeXt blocks and transposed convolutions for up‑sampling, and a Spatial Fusion Attention (SFA) gate that suppresses ambiguous background noise and highlights salient structures. The final pixel‑wise mask is produced by a 1×1 convolution and bilinear up‑sampling.
Extensive experiments on five public datasets—GLaS, DS Bowl 2018, REFUGE, CVC‑ClinicDB, and ISIC 2018—show that CATFA‑Net consistently outperforms state‑of‑the‑art methods such as U‑Net++, TransUNet, and Swin‑UNet across Dice, IoU, sensitivity, and specificity metrics. Notably, it achieves Dice scores of 94.48 % on GLaS and 91.55 % on ISIC 2018, setting new benchmarks. Ablation studies confirm that each proposed component (CAP, CCTFA, SFA) contributes 2–4 % performance gains. Compared with existing hybrid models, CATFA‑Net reduces parameter count and FLOPs by over 30 % and doubles inference speed.
In summary, CATFA‑Net delivers high‑quality medical image segmentation by efficiently modeling global context and preserving fine‑grained local details, offering a compelling solution for resource‑constrained clinical applications and providing a solid foundation for future extensions to multi‑class and 3D segmentation tasks.
Comments & Academic Discussion
Loading comments...
Leave a Comment