TauFlow: Dynamic Causal Constraint for Complexity-Adaptive Lightweight Segmentation
Deploying lightweight medical image segmentation models on edge devices presents two major challenges: 1) efficiently handling the stark contrast between lesion boundaries and background regions, and 2) the sharp drop in accuracy that occurs when pur…
Authors: Zidong Chen, Fadratul Hafinaz Hassan
T auFlow: Dynamic Causal Constrain t f or Complexity-Adaptive Lightw eight Segmenta tion Zidong Chen 1 School of Compu ter Sciences, Universi ti Sains Malaysia Gelugor, Penang, Malaysia, 11 800 chenzidon g@studen t.usm.my Fadratul Hafin az H assan 2, ∗ School of Compu ter Sciences, Universi ti Sain s Malaysia Gelugor, Penang, Malaysia, 11 800 fadratul@usm .my Abstract: Deploying lightweight medic al image segmentation models on edge devices presents two m ajor challenges: 1) efficiently handling th e stark contras t between lesion boundarie s and background regions, and 2) the sharp d rop in accuracy that occurs when pursuing extremely lightweight designs ( e. g., < 0.5M paramet ers). T o address these problems, this paper proposes T auFlow , a novel lightweight segmentation model. The cor e of T auFlow is a dynamic f eature re sponse stra tegy inspired by brain-like mechanisms. This is ac hieved through two ke y in novations: the Convolutional Long-Time Constant Cell (ConvL TC), which dynamic ally regulates the featur e update rat e to “slowly” process low-frequency b ackgrounds and “ quickly ” respond to high-frequency boundaries ; and the STDP Self-Organizing Module, which signific antly mitigat es featur e conflicts b etween the encoder and decoder , reducing the conflict ra te from approximat ely 35%-40% to 8%-10%. Keywords: Medical Image Segmen tation, Li ghtweight Mode l, E dge Computing, Br ain-Inspired Computing , Dynam ic Syst ems, ConvL TC, ST DP 1 Introduction 1.1 Background and Signi ficance Medical image segme ntation is a core supporting technology for computer-aided diagnosis (CAD) and p recision medicine. Its ta sk is to automatically and accurately delineate target regions from complex medica l images (such as computed tomography (CT), magnetic resonance imagin g (MR I), digital pathology slid es, etc.), including tum ors, specific organs (such as cardiac structures), or microscopic tissues (such as glan ds). Accurate segmentation results can provide critical quantitative evidence for earl y disease screening , disease progression mon itoring, surgical planning , and r adiotherapy target delineation[ 1], greatly enhanc ing the objectivity , reproducibility , and efficiency of clinic al diagnosis. However , the task of m edical image segmentation itself is f ull of challenges. The data characteristics of diffe rent modalities vary greatly: CT images are often a ccompan ied by low contras t and artifacts; MRI has multi-modal characteristics but im aging parameters are highly variable; p athology slides, on the other hand, are extremely large in scale and o ften face issue s such as gland a dhesion a nd cellular m orphological diversity . Therefore, developing segmen tation models that simultaneously possess high ac curacy , high ef ficiency , and strong generaliza tion ability has always been a research focu s in this field. 1.2 Existing Methods, Li ghtweight T rends, and Limitations 1.2.1 U-Net and Its Mainstream Improv ements Since the proposal of U-Net[ 2], its symmetric encoder-decoder ar chitecture and innovative skip connection design have successfully addressed the fusion problem of semantic information and spatial details in se gmentation tasks, r apidly becomin g a benchmark framework in the field of medical image segmentation. The encoder agg regate s high-level semantic information through layer-by-lay er downsampling , the decoder gradually restores spatia l resolution through upsampling, and the skip connection s transmit shallo w detail featur es from the encoder to the d ecoder , eff ectively alleviating the “ semantic gap ” caused by network depth. Based on this paradigm, num erous studies have focused on optim izing U-Net 's comp onents and connection methods. For example: UNet++[3] reduces the sema ntic gap between encoder and decoder featur e maps by designing nested dense skip p aths and also support s model prun ing for lightweight implementation; MultiResUNet[4] in troduces multi-scale residual paths to replace traditional residual units, enhancing the m odel's ability to represent multi-m odal features without significantly increasing paramet ers; U -Net v2[5] improves segmentation accu racy by redesigning a b idirectional featur e enh ancement mechanism in the skip connections and demonstrates its potential as a plug-and- play mod ule adaptable to various encoder-deco der structures. Thes e im provements have all been validated for their effectivenes s in specific medical tasks. 1.2.2 Edge Compu ting-Driven Lightweight T rend With the rapid development of edge intelligence, mobile health care (mHealth), and portable diagnostic devices, the deployment scenarios of models are shifting from cloud servers to resource-constrained edge devices (suc h as handheld ultrasound device s, smart stethoscopes, and embedde d analysis box es). This requires segme ntation models to simultaneously meet the dual standards of “ extreme ligh tweight ” (low parameter count, low memory footprint) and “ high accuracy ” (meet ing clinical diagnostic requirements). In the general vision dom ain, lightweight architectures have achieved significant p rogress. For example, through hardware-awar e dyna mic pruning and m ixed-precision quantization strat egies[ 6], m odels c an maintain high performan ce unde r extremely low reso urce consumpt ion. However , the com plexity of medical images-suc h as the blurred boundarie s of skin lesions [9], dynamic deformations i n cardiac MRI [10], and the intricate textures of pathology slides -poses far more demanding requirements for fine feature m odeling in lightweight models than conv entional images. 1.2.3 Main T e chnical Approaches and Limitations of Lightweight Segmentati on T o balance lightweight design and ac curacy , researchers have explored three mainstream technical paths: 1. Effic ient Convolution -Based Path: Base d on conc epts such as MobileNet and ShuffleNet, lightweight U-sha ped networks are constructed usin g depthwise separable convolutions, grouped convolutions, and other methods. Although these approaches have low pa ramete r counts, their limited receptive fields restrict long-range depen dency modeling, mak ing it difficult to capture the global structural infor mation of organs. 2. T rans former Lightweight Path: Vision Tr ansf ormer (ViT) and its variants (such a s Swin T r ansfo rmer) have been introduced into the segme ntation field (e. g., T r ansUNet) due to their powerful glob al modeling capabilities. However , the quadratic computational complexity of the self- atte ntion m echanism makes deployment on edge d evices challenging. T o address this, researchers have proposed solutions such as EViT- Unet[7], attempting to adopt efficient vision T r ansfo rmer structures suitable for mobile devices. 3. State Space Model (SSM ) Path: Recently , SSM s represented by Mamba have emerged as a new hotspot in lig htweight research d ue to their linear complexity and strong lo ng-range dependency modeling capabilities. Re searchers have quickly combined them with U-Net, proposing mod els such as VM-UNet [11], Mam ba-UNet [12], an d MSVM-UNet [13]. The se models use Mamba blocks to replace T r ansfo rmer blocks or certain convolutional blo cks, significantly reducing com putational cost wh ile m aintaining lon g-range modeling capa bility . For example, UltraLight VM-UNet[8] employs parallel visual Mamb a blocks and achieves excelle nt ligh tweight p erformance in skin lesion segmentation. This paper further validates through ac tual edge device deployment (such as the Allwinner H61 8 and Rockchip RK3588 platforms) that T auFlow demo nstrat es real-time performance and low-power advantages in resource-constrained scenar ios (see Sect ion 4.5 for details). 1.2.4 Overlooke d Exploration: Brain-Inspired Co mputing In ad dition, there exists an other explorat ory path-brain-inspired com puting-which attempts to fundamentally simulate the computational prin ciples o f biological neural syste ms. This mainly includes two directions: Models B ased on Continuous-Time Dynamic Sys tems: For example, Liq uid Time- Constant (L TC) networks or the generalized Liquid State Machine . These m odels atte mpt to simulate the continuous dynamic responses of neurons through differential equations and , in theory, posse ss the capability to hand le complex temporal d ynamics[39] . Models Based on Sp iking Neural Networks (SNNs): SNNs transmit information using biologically more realistic event-driven spike signals. They are of ten com bined with Spike-Timing-Depe ndent Plasticity ( STDP) as a learning rule. STDP is a lo cal unsup ervised mechanism that sim ulates biological synaptic plasticity , automatically adjusting synaptic weights ba sed on the precise timing differ ences of spike firings. In theory , it is highly suita ble for unsupervised self-organization of featur es and pattern discov ery[47] . However , both of these br ain-inspired computing paths f ace severe challenges in current deep learning applications. Models based on continuous- time dynam ic syst ems of ten incur enor mous computational cos ts, re lying on com plex ordinary diffe rential equ ation (ODE) s olvers, which results in low training and infer ence efficienc y . The main bottleneck of SNNs combined with STDP lies in training difficulty: on one hand, STDP as a local unsup ervised rule is difficult to directly optimize for complex, p ixel-leve l segmentation tasks (which require global superv ision signals); on the other hand, the disc rete and non -differe ntiable n ature of spikes makes it hard to efficiently integr ate with mature gradient-based global backpropagation (B P) algorithm s (the “ surrogat e gradient ” problem), resulting in challengin g model optim ization, slow conve rgence, and limited accuracy . Therefor e, although brain-inspired compu ting (whether L TC or STDP) provides a high ly attractive theoretical fr amework for dy namic modeling and fe ature self-organization, its application is currently largely confined to laboratory exploration due to the a for ementioned com putational overhead and training convergence c hallenges, and it is far from meeting practical standards for real-time, high-ac curacy segmentation on edge devices. 1.3 Core Existing Issues Although the afore mentioned ligh tweight m odels, particular ly those based on State Spac e Models (SSM), have made progress, o ur r esearch rev eals tha t existing m odels exhibit two cor e bottlenecks when com pressed to an extreme parameter count (e.g., <0. 5M). The se bottlenecks create a fu ndamental trade-off between accuracy and efficiency: Lack of Rate Adaptation in Static Feature Processing Mechanisms: Medical image content is high ly heterogeneous. For ex ample, lesion boundarie s represent high-frequency info rmation, requiring the model to respond qu ickly to capture fine edges, whereas larg e areas of background or tissue interiors are low-frequency info rmation, requiring smooth p rocessing to suppress noise. However , whether using CNNs (fixed convolution kernels), T r ansfo rmers (fixed attention patterns), or Mamba (unified scanning m echanism) , the u nderlying featur e p rocessing rules are essential ly static. They apply a “ one-size-fits-all ” stra tegy to all regions (high-f requency bound aries or low-frequency backgrounds). This sta tic m echanism causes models under lightweight constraints to be unable to dynamically adapt feature response rat es: fine details are easily lost when processing complex boundaries, an d artifacts are easily introduced when processing hom ogeneous re gions. For example, on the Synapse dataset, such lightweight models achieve Dic e scores that are on avera ge 3%-5% lower than those of high-accu racy heav yweight m odels. “ Poor Modal Fusion Consistency ” under Lightweight Constraints: The performance of U- shaped architectures heavily relies on skip connections, which b ridge shallow featur es from the enc oder (rich in local textures and spatial details) with deep features from the decoder (rich in global s emantics a nd positiona l information). These two type s of featur es diffe r significantly in properties (i.e ., “ modal ” differ ences). Heavyweight models (such as DA-T r ansUNet[14][15]) have sufficient parameter capacity to “ align ” these features using complex multi-head att ention, spatia l-channel dual calibration, or multi-sc ale Tr a nsformer fusion mechanisms. Under lightweight constr aints, however , models lac k sufficient capacity for such complex operations, and sim ple concatenation o r shallow attention often leads to severe feature conflicts. Regar ding the semantic gap problem in skip conn ections of U-Net- type models, Wa ng et al[16] c onducted an in- depth ana lysis. The ir study sh owed that in lightweig ht T ra nsformer U-Nets, directly fusing encoder and decoder featur es reduced the cosine similarity of fea tures b y approximat ely 25%- 30%. This clearly r eflects the mismatch and int erfer ence betw een loca l fine featur es and high- level semantic features, representing a ke y factor limiting segme ntation accuracy . 1.4 Proposed Method: T auFlow Figure 1: Ef ficiency compa rison of T auFlow with main stream segmentation models (X-axis: mo del parameter count, in millions (M); Y-ax is: computational cost, in G-FLOPs; both axes are in logarithmic scale. The lower-le ft corner indicates higher ef ficiency). The pr oposed T auFlow model (star ma rker) de monstrates significant advantages over other main stream and lightweight models in both para meter count and computational cost. T o a ddress the above two c ore bottlenecks, this paper draws insp iration from b rain-inspired computing and bypasses the com plex path of traditional differe ntial equation solvers, proposing an efficient lightweight seg mentation model based on continuous-time dynamic syste ms-T a uFlow . The core i dea of T auFlow is to in troduce an inp ut-adaptive “ time c onstant ” , using to dynamically regulate th e rat e of fea ture updates and the strategy for cross-moda l featur e fu sion, thereby achieving fine-grained dynam ic feature m odeling under extremely low parameter counts. T o realize this idea, we designed two core lightweight mo dules and introduced a unique fine-tuning m echanism: 1. Convolutional Long- Time Constant Cell (ConvL T C): T o address the problem o f “insufficient rat e adaptation” . This module d raw s on the Liquid Time-Constant (L T C) m echanism, dynamically generating a sequence for each spatial position via convolutional projection. The directly controls the update rate of the recurrent un it: smaller values accelerate featur e upda tes in lesio n boundary (high -frequency) regions, while larger values smoo th noise interfer ence in bac kground (low-frequency) regions. This achieves efficient “ spatial + temporal ” dual-dim ensional dynamic m odeling. 2. T auFlowSequenc e Dynamic Groupi ng Module: T o overcome t he challenge of “poor moda l consistency ” . This module first dynamically allocates computational resources b ased on local image com plexity (e.g ., 5 fine-compu tation groups in high- complexity regions using gradients). Simultaneously , it employs a -guided atte ntion mecha nism du ring skip-connection fusion, pref er entially selecting c ross-modal k ey featur es that are highly consistent with the decoder . Specifically , T auFlow innovatively incorporates a Spiking Neural Network (SNN)-inspired Spike-Timing-Dependent Plasticity (STDP) mechani sm (STDPModule ). This mechanism acts as a forward weight ad juster , dynamically self-organizing the strengthening or suppression of ke y module we ights (in a manner sim ilar to L TP /L TD) based on the temporal variations of fea tures. The STDP mechanism works in synergy with the mechanism, jointly a chieving multi-dimensio nal adaptive feature reorganiz ation un der extremely low pa ramete r counts, greatly reducing fea ture conflicts. Through the collaboration of these two modules, T auFlow strictly maintains a total parameter count below 0.33 Mb withou t an y kno wledge d istillation te chniques. After processing with the proposed m ethod, th e p roportion of conflict regions betwe en encoder and decoder feat ures is significantly reduced from 35%-40% in conve ntional lightweight methods to 8%-10%. 1.5 Summary of Contributions The main contributions of this paper are as follows: 1. Architectural and Mec hanism Innovation - Introduction of a n Eff icient Dynamic Mechanism: We pr opose a novel ligh tweight segmentation architecture, T auFlow . For the first time, the input-adaptive time constant mechanism from continuous-time dynamic sys tems (L TC) is efficiently and differe ntiably integrated into a U-sh aped network via the ConvL TC modu le. This mechanism enables pixel-level dynamic contr ol of featur e processing r ate s, addressing the sta tic f eature processing bottleneck comm only present in existing lightweight models. 2. Fusion and Self-Organization Breakthrough - -STDP-Driven Dynamic Grouping : We designed the T auFlowSequence dynamic grouping m odule and inn ovatively in corporated th e SNN-inspired Spike-Timing-Dependent Plasticity (STDP) mechanism (STDPModule ). STDP , as a forward weight adjuster , dynamica lly self-organizes the strengthening or suppression of grouping p atte rns based on temporal variations of feature s (in a ma nner sim ilar to L TP /L TD). Leveraging guidance and STDP self-organization ca pabilities, this m odule achieves d ual dynamic control of computational resources (dynamic num ber o f groups) and featur e fu sion (cross-modal consistency), successfully reducing the proportion of conflict regions b etween encoder and decoder fe atures f rom 35%- 40% in conv entional lightweight method s to 8%-10%. 3. SOT A Performance a nd Generalization Va lidation - High Accuracy under Extreme Lightweight Constraints: T a uFlow achieves superior performance comp ared to existing lightweight SOT A methods across three public medical datasets (cover ing gland segme ntation, nuclear segmentation, and multi-organ segmentation), while kee ping the total parameter count at 0.33M. For example, on th e Gla S datase t, the D ice score reaches 92.12% (an improvement of 1.09% over UD T r ansNet); on the MoNuSeg dataset, the Dic e score reaches 80.97% (an improvement of 1.11 % over MSVM-UNet); on the Synapse dataset, the aver age Di ce score reaches 90.85% (an improvement of 1.57 % over UDT r ansNet), with particular ly notable improvements for comp lex organs such as the pan creas (88.1% D ice, +2.3%). Furthermore, on the non-m edical C ityscapes dataset, T auFlow achieves 79.26 % mIoU, demonstrating its cross-domain generalization capability . 4. Actual Deployment and Edge Va lidation: We quan tified T auFlow's infere nce speed, memo ry usage, and power con sumption on various embedde d platforms, demon stra ting its practicality for mo bile medica l devices. 2 Related Work Research o n image segm entation has continu ously evolved around two core objective s: “ accuracy improvement ” and “ efficiency optimization, ” giving rise to three main technical branches: lightweight network design, d ynamic mod eling and featur e fusion optimization, and brain-inspired computing with dynamic lear ning paradigms. This section systematically r eviews the technical pa ths, advanta ges, a nd existing limitations of each branch in light of the latest research, clarify ing the differentia ted positioni ng of the proposed method relative to existing work. 2.1 Lightweight Medical Image Segmenta tion Networks T o meet the deployment requirements of edge devices (such as portable imaging analyzer s and intraopera tive navigation syste ms), lightweight network design focuses on “ parameter compression ” and “ computational efficiency improvement. ” This is mainly ac hieved through three strat egies: structu ral simplification, convolutional optimization, and novel operator replacement. In recent years, a rich variety of variants has emerged based on the U-Net architecture. In the field of pu re convolutional ligh tweight architectures, L FT-UNet[18] , as a conci se U-Net variant, aband ons complex attention or T r ansformer modules, an d designs a pu re convolutional encoder-decoder structure only by optimizing the convolution kernel size and chann el ratio, maintaining basic segmentation per formance while kee ping the p arameter count b elow 100k , suitable for scenarios with extremely lim ited computing power . GA-UNet[19] introduces Ghost modules and attention mechan isms to achieve a lightweight design with only 2.18M pa ramete rs in m edical image segmentation, achieving exc ellent Dice scores on the ISIC 2018 dataset while maintaining high infer ence speed . For 3D medical im ages (such as brain tu mor MRI), LA TUP-Net [20] proposes a lightweight 3D atte ntion U-Net, using p arallel grouped convolutions to replace traditional 3D convolutions, reducing redundant com putation while maintaining the receptive field, ach ieving a Dice score of 89.2 % on the BraTS 2023 dataset with only 1/5 of the pa ramete rs of the traditional 3D U-Net. In addition, Gh ost convolution, as an efficient convolution paradigm, has also been widely applied: a lightweight U-Net variant based on GhostNet[21] generates “ base fea tures + ghost features ” th rough Ghost modules, reducing GFLOPs b y 58% compared with U-Net in brain tissue segmentation tasks in n euro-robotics, while ma intaining 90.5% boundary segme ntation accuracy . In the field of Mamba-based lightweight models, with the r ise of str uctured stat e sp ace models (SSM), Visio n Mamba, d ue to its linear complexity and lon g-range depend ency modeling capability , has become a core operator for lightweight segmentation. U-Mamba [22] first embedded the Mam ba mechanism into the U-Net encod er bottleneck, creating a linear-com plexity global m odeler , raising the Dice score of nnU- Net from 79.3% to 86.4% in 3D abdominal segmentation (CT /MR). VM- UNet[23] integra tes the Mamba mechanism into a lightweight U-Net by replacing some convolutional blocks in the enc oder with streamlined Mamba units, achieving linear-time infer ence in the S ynapse abdominal multi-o rgan segmentation task. When subsequently extended to the multimedia doma in[24], cross-modal featur e calibration modules were introduced to improve segme ntation robustness across differ ent i maging moda lities (CT /MRI). T o add ress th e limitations of Mamba in fine structure segmentation, researchers fu rther optimized operat or design: H-VM UNet[25] introduces high-order Vision Mamb a to model com plex lesions (such as adherent glands) through multi-order feature interactions. LKM-UN et[26] extends the Mamba kernel size from the default 4 × 4 to 40 × 40, achieving an overall Dice improvement of 0.97% and an NSD impr ovement of 0.53% in Abdo men MR multi-organ segm entation comp ared to standard U-Mamba. MSVM-U Net [27] com bines multi- scale convolutions with Mamba, capturing multi- receptive-field d etails at the same featur e level through pa rallel convolutions of differ ent ker nel sizes (1 × 1, 3 × 3, 5 × 5), achieving a DSC of 85.00% and HD95 of 14.75 mm in Synapse m ulti-organ CT segmentation. Furtherm ore, VMAXL-UNet[28] integra tes Vision Mamba with xLSTM, f urther reducing computational redundancy through efficient SS2D scanning and mLSTM mechanism s, decreasing the parameter count by 15% compared to VM-UNet (22.5M vs. 26.5M). In the field of lightweight models combining attention an d co nv olution, split a tt ention mechanisms have become ke y to balancing accuracy an d efficiency . Att E-UNet[29] proposes a lightweight edge-attention enhanceme nt branch, combinin g multi-stage Canny filters with learnable gat ed f usion (GF) , achieving F1=76.3% and IoU=6 5.4% o n the PanNuke dataset, with a parameter c ount of o nly 0.54 8 Mb, suitabl e for resource-constrained medical edge devices. ES-UNet[30] improves the fu ll-scale sk ip co nnection paths of U-Net witho ut significantly increasing parameter count, by serially integrating lightweight c hannel attention modules alon g each encod er-to-decoder path, enhancin g multi-scale feature fusion in high- resolution 3D medical volumes. On the MICC AI H ECKTOR head and neck tumor dat aset, its DSC impro ved by 4.37% compared to the b aseline 3D U-Net (76. 87% vs 72.50%), and IoU im proved by approximat ely 4.3%. Although th ese lightweight models have made progress in paramete r compression, they of ten face lim itations in receptive field and fea ture representa tion when handling the heterogeneity of medical images (such as blurred bound aries and multi-m odal noise), resulting in accuracy degradation in extreme lightweight scenarios. 2.2 Dynamic Modeling and Feat ure Fusi on Optimization The dynami c characteristics of m edical images (such a s blurred l esion boundarie s an d heterogeneous tissue te xtures) require models to have adaptive feature processing capabilities. Related research mainly focuses on “ dynamic operator design ” and “ featur e fusion mechani sm optimization ” to mitigat e the generaliza tion limitations of traditional static models. In the a rea o f dyna mic operator s and network architectures, dynamic large kernels and dynamic branches have become research hotspot s. D-Net [31] proposes dyna mic large-kernel convolution, constructing a n equivalent 23 × 23 × 23 sup er receptive field by cascad ing 5 × 5 × 5 a nd 7 × 7 × 7 depthwise separable large ker nels, and using a ch annel-space dual dynam ic selection mechanism guided by global pooling to adaptively weight multi-scale fe atures. In liver vessel-tumor CT segmentation tasks, th e recall of tubular f ine vessels im proved by 8.3% compared to fixed-k ernel models.E2 ENet[32] designs a “ multi-stage bidirectional sparse feature flow + restricted depth-shift c onvolution. ” The former uses a Dynamic Sparse Feature Fusion (DSFF) mecha nism to adaptively select and f use multi-s cale information from three directi ons (up, down , front-back ), filtering redundant co nnections; the latter divides inpu t c hannels into three groups, shifts them along the depth axis by {-1, 0, +1}, and uses (1 × 3 × 3) convolutions to capture 3D spa tial relationships. Its pa rameter count is only one-third of standard 3D convolutions, ac hieving an mDice of 90.3% on the AM OS-CT multi-organ segm entation c hallenge.DAFNet[33] combines dual-branch feature decom position with domain-ada ptive fusion , dynamically alignin g global infrared and visible-ligh t feature distribu tions during encodin g via an MK-MMD mixed-ker nel function, reducing detail loss by 35% compa red to traditional fixed fusion strategies in infrared-visible image fusion tasks.Further more, Fu sionMamba [34] proposes a D VSS + DFFM dual-driven U-shaped network. Dynamic convolution (LD C) and differ ential perception gating dynamic ally enhance the texture expressiven ess a nd modal ity differe ntiatio n of encod er features, while the CMFM cross-m odal Mamb a module achieves long- range interaction fusion. This approach maintains SOT A perf ormance across multiple medical imaging m odalities (such as CT-MRI, PET-M RI), with fusion robustness significantly improved over traditional U-Net. In terms of featur e fusion strat egies, multi-scale fusion and ad aptive f usion hav e b ecome mainstream d irections. TB SFF-UNet [35] achieves an average Dice improvement of 3.1% over U-Net++ on the GlaS and MoNuSeg datasets (90 .56 vs 87.46; 79.07 vs 76.02), with a parameter count only 43% of U-Net++, ach ieving a bal ance between accuracy and efficiency . DF A-UNet [36], targe ting single-shot u ltrasound elastography (BUE), uses a dual-stream encoder (ConvNeXt + lightweight ViT) to extract local and global feature s, performing fea ture c oncatenation and fusion before the bottleneck. Its IoU reaches 77.41 %, improving 2.68% over the baseline U-Net and 0.86% over the SOT A model ACE-Net. In the research on generalization a nd adaptivity in medica l image segmentation, Zhu et a l. [37] proposed the Uncertainty and Shape-Aware CTT A framework at MICCAI 2023. It incorporates shape priors through dom ain-generalized training to improve i nitial robustness and c ombines uncertainty-weighted pseudo-labels with shape-aware consistency to ac hieve continual test-time adaptation. A random weight reset mechanism prevents overfitting, enabling this method to significantly outper form existing CT T A approaches across three cross-domain segmentation tasks. Subsequently , D ong et al.[38] proposed Shape-Intensity-Guide d U-Net (SIG-UNet), which adds a symmetric reconstruction decoder to U- Net to reconstruct cla ss-aver age images (containing only shape and intensity informa tion, witho ut texture) and fuses them with the segme ntation decoder via skip con nections. This design guides the encoder to focus more on stable shape-in tensity featur es, reducing bias toward te xture. Although these dynamic m ethods enh ance model adaptivity, their computational cost is high, creating efficiency bottlenecks under ligh tweight con str aints, and they still insuf ficiently optimize cross-modal consistency in featur e fusion. 2.3 Brain-Inspired Computing and Dynamic Learning Par adigms Brain-inspired computin g, particularly Liquid Neural Networks (LNN), has emerged a s a novel technology for han dling temporal depe ndencies and dynam ic fe atures in medical images due to its continuous-time dynam ic characteristics. Related research mainly focuses on innovations in dynamic mod eling and learnin g paradigms. In the medica l applications of Liquid Neural Networks, Hasani et al. proposed the Liqu id Time-Constant (L TC) n etwork, which systematically in troduces continuous-time recurrent units with variable time c onstan ts to enha nce modeling of non-stationary temporal signals, demonstrating its expressive p ower and stability in time-se ries prediction ta sks[39]. This approach provides a p otential pathw ay , b oth theoretically and methodologically , for applying dynamic temp oral m odeling concepts to medical im age segmentation (e. g., in tegra ting serialized images, mul ti-modal time series, or real-time intraoperat ive video stre ams), and, together with published imp lementations of liquid/reservoir- like mod els on neuromorphic or embedded hardware, i ndicates the feas ibility of low-power , real-time deploym ent[40]. However , to da te, only a few peer-reviewed publications report verifiabl e quantitative performance of LNN series directly on mainstream medical imaging segmentation benchmarks (such as BraTS, CheXpert). For example, on the BraTS dataset, a hybrid model combinin g LNN with ResNet-50 (LNN -P A-R esNet50) reported 99.4% a ccuracy[41], and LNN m odels applied to radiographic images (e.g., knee X-rays) for osteoarthritis detection have shown potential advantage s[42]. The refor e, while this paper cites the methodolo gical potentia l of L T C as a research motivation, it maintains caution reg arding the generalizability of task-specific performance, making on ly limited assertions based on existing evidence. Although brain-inspired computing provides a new p erspective for dynam ic m odeling, its computational com plexity and training challenges limit its wide spread application in lightweight medical segmentation. T auFlow ad dresses these limitations by introducing an efficient ada ptive mechanism in synergy with STDP , achieving high-accu racy segmentation un der e xtreme lightweight constr aints. 3. Method 3.1 Over all Framework and Data Flow Overvi ew Figure 2: Overall Structure of T auFlow Figure 2 illustra tes the over all a rchitecture of the prop osed T auFlow network. The network adopts a lightweight encod er-decoder structure, where the encoder is responsibl e for multi-sc ale spatial featur e extraction. The Flow Interface injects explicit position al encodin g and generates initial temporal state s. Dynamic G rouping per forms dynamic grouping, -Att ention conduc ts lightweight temporal mod eling guided by transmitted , an d Flow Cell extracts an d fu ses featur e maps based on the temporal modeling of dynamic gr oups. T ogether , the Flow Interface, Dynamic Grouping, -Att ention, and Flow Cell form the core Flow Sequence module of this study . F inally , the deco der outputs the segmented featur es. The overall design targ ets efficient represent ation under parameter constraints , achieving effe ctive mo deling of cross-region coherence a nd fine structural details in medical images through p osition awareness, com plexity-adaptive processing , and -guided grouped temporal mechanisms. The overall forw ard process is as follows. First, the input image ( ) is passed through the encoder to obtain multi-sc ale fea tures , , and . Subsequently , the Flow Interface (Fig. 3) generates positional embedding s on and constructs concate nated featur es , while global pooling and mapping produce the initial hidden state for ConvL TC. Importantly , does not d irectly enter the temporal unit; it is f irst processed by Dynamic Grouping (Fig. 4) to generate gr oup masks and grouped featur es (U), which a re then fused with statistics via -Att ention (Fig. 5 ) to obtain the weighted group representation . Finally , the Flow Cell (Fig. 6 ) evolves the groups temporally , and the outputs, af ter mask-weighted fusion, are fed into the decoder for upsamp ling and skip connections, producing the main segmen tation and auxiliary segmen tation outputs. T o fu rther stabilize dyna mic grouping a nd group-level temporal evolution, the T auFlowSeque nce introduces the concept of sp ike-timing-dependent pla sticity (STDP) f rom brain-inspired lear ning as a lightweight regularization term. Its core objective is to encourage temporal causal consistency: when input a ctivations within a group p recede hidden state (or prediction) activations, a rew ard is ap plied; otherwise , a penalty is imposed, thereby suppressing updates where “ effe ct precedes cause. ” Specificall y , an STDP regularization loss is computed during th e T auFlowSequenc e forward pass and add ed with a very small weight to the total loss during training, without incurring any extr a overhead during i nfer ence. This mechanism compleme nts complexity-adaptive processing, -guided attention, and flow-smoothing regularization, jointly enhancing the stability of dynamic masks and inter-group specialization. The entire process can be represented as: Flow I nterf ace → Dynamic Grouping → -Att ention → Flow Cell → Decoder . Figures 3-1 to 3-5 illustr ate, respectively , the over all architecture, i nterf ace details, groupin g module, -guided attention, and grouped temporal unit. The remainder of this chapter sequentia lly details the design a nd implemen tation of the Flow Interface (Section 3.2) , followed by a focused descrip tion of Dynamic Groupin g, -Att ention, and Flow Cell within the T auFlowSequence (Section 3.3), and concludes with the design of the loss functions (Sec tion 3.4) an d the unified training strat egy (Section 3.5) . 3.2 T a uFlow Interf ace: Po sitional Embedding and Initial Stat e Generation Figure 3: Computation Flow of T auFlow Interface The T auFlow Interface module (Fig . 3) serves as the interface between the encoder a nd temporal modeling, performing two key tasks: first, injecting explicit spa tial positiona l information into deep features; and seco nd, generating the initial hidden s tate for ConvL T C b ased on global context, providing a stable and meaningful starting point for subsequ ent grouping a nd temporal evolution. The positional embedding is genera ted using a learnable convolution. Given the dee p encoder featur e , a constant cha nnel tensor (all ones) is first constructed and mapp ed through a convolution to obtain the positional embedding , as shown in Equation (1). (1) which is then co ncatenat ed with to form Equation (2) . (2) This learnable positional em bedding a llows the mod el to adaptively encod e ta sk-relevant positional inf ormation during tra ining, making it m ore suitable for capturing local structural featur es in medical images compared to fixed sinusoidal positional enc odings. The initial hidden sta te is generat ed through global average pooling followed by a linear mapping: the final enc oder feature m ap is first processed with adaptive average poolin g a nd flattened, then mapped to the hidden dimension via a linear lay er and activated with . The result is broadcasted to m atch th e spatial dimensio ns of the feature map, yield ing, as shown in Equation (3). (3) This contains global semantic information and effe ctively mitigates training instability ca used by random initialization. The T auFlow Interface outpu ts and , where serves a s the input to D ynamic Grouping to generate group m asks and grouped fea tures, while is reused as the initial sta te for each valid group w ithin the T auFlow Cell (Fig. 6). In th is way , the in terf ace orga nically combines positional infor mation and global context, providing the nece ssary inputs and initialization conditions for subsequent grouping and temporal modeling. 3.3 T auFlowSequence: Core of Dynamic Grouping and T e mporal Modeli ng T auFlowSequenc e is the core module connecting sp atial features with tem poral reasoning. Its computation is com posed sequentially of three submod ules: Dynamic Grouping (dynam ic grouping), T au-At tention (g roup-level -guided attention), a nd ConvL TC cells (the minimal unit for temporal modeling) . This module implements a closed-loop pr ocess f rom complexity-aware grouping to group-level weighting and per -group temporal evolution, balan cing efficiency and fine-detail representation und er constrained paramete r budgets. 3.3.1 Dy namic Grouping: Group Generation B ased on Complexity and T au Gradien ts Figure 4: Detailed Computation Flow of D ynamic Grouping Dynamic Grou ping (Fig. 4) starts from and first projects it to the hidden d imension via a convolution to compute the r aw time constants , as shown in Equation (4). (4) which are then passed through a Softplus activation and numerically clamped to ensu re positivity and stability , as shown in Equation (5). (5) These values n ot only contr ol the update rates in subsequent ConvL TC cells but also participate in the generation of group masks and atte ntion weighting. Complexity assess ment is per formed by com bining global statistics of (via glob al pooli ng) with ima ge edge densit y . This f used representation is input to a lightweight MLP to produce a n ormalized com plexity score , which is then mapped to the actu al number of groups . Subsequently , the Pat tern Generator , taking as input, passes through several convolution and norm alization layer s to generate the raw group mask , which is then normalized along the group dimension via Softmax to obtain the initial ma sk . T o enhance mask robustness, Dynamic Group ing employs a multi-step itera tive mechanism max_flow_steps : starting from the initial ma sk, a lightweight fast segmentation head q uickly evaluates the grouped features and adju sts split/m erge scales u sing Dice-based rewar ds. Simultaneously , the gradient of with respect to the in put is computed, and regions with large gradient magnitudes are treated as key areas in the dynamic feature set, guiding adjustments to the m ask weights. This results in the -gradient-adjusted d ynamic masks . Finally , the grouped feature tensor is formed by expanding a nd weighting according to the masks, as shown in Equation (6). (6) where only the first G groups are valid, and the remaining groups are ze roed out in subsequ ent processing. 3.3.2 T au-A ttention: Group-Level T au -Guided Lightweight Attention T au-At tention (Fig. 5) is a group-level lightweight attention mech anism designed to inject mask information an d statistics into the weighting of group ed features with minima l com putational cost. Unlike spatial self-attention, which computes large-scale dot products fo r each pixel, T au-At tention first performs spatial averaging within each group to obtain a low-di mensional representation, which is then used to comp ute gr oup-level similarit y and importance. Figure 5: Detailed Computation Flow of T au -Att ention Specifically , U is first projected via a convolution to obtain and , which are then spa tially average d to form vectors . Elemen t-wise multiplication followed by a linear ma pping produ ces the base scores. The pooled values of the masks and the mask-weighted means are th en added linearly to obt ain the final score s, which are pa ssed through a Sigmoid to generat e the group-level attention weights, as shown in Equation (7). (7) Finally , th e weights are broadcast and app lied to (U) to o btain the weighted group representation . This design allows multiple groups to be emp hasized simultaneously (non -competitive normalization) and uses learnable p arameter s to balanc e contributions from the QK interaction, mask, and . T au-At tention coup les group -level spatial importance with -driven tem poral sensitivity at minimal com putational cost, providing more discrim inative in puts for the per-group temporal modeling in ConvL T C. 3.3.3 T auFlow Cell and Per-Grou p T emporal Ev olution Figure 6: Detailed Computation Flow of T auFlow Cell Figure 6 illustrat es th e detailed computation of the T auFlow Cell. The module takes the group-weighted featur e and the initial state as inpu ts. First, a convolution ( cell) c omputes the group-specific time c onstan t , which is then passed through S oftplus and clamped to main tain a s table temporal rang e. Next, the hidden sta te s a nd the group inpu t are separately map ped via depthwise separable convolution and pointwise convolution branches , an d their outputs are summ ed and passed through a to form the nonlinear mapping . This mapp ing captures the dynamic interactions within the group featur es and provides the driv ing force for subsequent stat e evolution, as shown in Equation (8). Based on this, the Flow Cell updates the stat e acco rding to the firs t-order differential equation using exp licit Eule r discretiza tion to obtain . The updated featur es are then passed through GroupNorm and a convolution to p roduce the group output , which is subsequently reduced in dimension via O utProj and f used with mask weighting to generate the final combined feature. The overall process corresponds to four stages f rom top to bottom-time constant estimation → (8) stat e mapping → temp oral upda te → ou tput fusion -realizing parallel per-group temporal modeling and in formation i ntegra tion within dynamic groups. 3.3.4 STDP-Enhan ced Routing Consistency Regu larization T o enforce temporal causality within groups and improve th e stability of dynamic masks, we design an STDP-enhanced routing consistency regularization in T auFlowSeque nce. L et the input of the -th group at time be and the corresponding hid den stat e be . We use a differentiable event app ro ximation to model the “ activation occ urrence ” as a binarized pr ocess. Specifically , a Sigmoid with te mperature (\kappa) appro ximates thresholding on normalized activations, as shown in Equation (9, 10). (9) (10) where are channel-normal ized values, and are learnable or fixed thresholds. T o enco urage “ cause precedes effect, ” a first-or der temporal neighborhood asymmetric ke rnel is applied, which maximizes forward-causal co-activation and m inimizes backward-causal co-activation, resulting in the STDP regularization, as shown in Equation (11). (11 ) where controls the penalty streng th for backward causality . Consid ering the modulation of group dynamics by , the loss can be reweight ed by normalized time-constant weights to emphasize temporally sensitive regions, as shown in Equation (12). (12) T o further enforce superv isory consistency , a teacher-forcing term is introduced: the target mask downsampled to group-level resolution, , is combined with the postsynaptic event using a coefficient , as shown in Equation (13). (13) In practice, a single-step temporal n eighborhood is sufficient for stable gains. This STDP regularization is returned internally as ` ` during training to guide dyna mic grouping and per-group temporal upd ates tow ard causal c onsistency . It is disabled during infer ence, incurring no extra computational cost. 3.4 Lightweight Loss Functio n Design The loss function is designed to generat e optimization signals based on the grouped features learned at the T auFlowSequence stage. This section focuses on how to integrat e multi-scale supervisio n with regularization terms, which is cruc ial for enhancing model generalization. T o alleviate discrepancies when combining the main loss with auxiliary losses, we adopt a composite loss fun ction. It leverages the auxil iary head to g uide the enhan cement signa l for the main loss a long the scale dimension, thereby eliminating optimization inco nsistencies. Consequen tly , the comp osite loss can be regar ded as a signal calibrator , capable of automatically fusing multi- task guidance and ad aptable to scenarios prone to overfitting. From a mathematical p erspective, the loss f unction takes the m ain segmentation output and the au xiliary output as inputs. First, guided by , the main loss is com puted using a Dice-Focal mechanism, as shown in Equation (14). (14) The Dice loss (with for numerical stability) is defined as, as shown in Equation (15). (15) where is the binary targ et mask. The Fo cal loss (balancing po sitive/negative samples an d ha rd examples) is, as shown in Equation (16). (16) where , , and is the total number of pixels. On the basis of the ab ove composite loss, we further incorporate flow-smoothing and STDP regularization to enhance the spatial and temporal consistency of the dynamic masks. The flow-smoothing loss imposes an L1 constraint on the gradients of neighbori ng pixels in the mask, as shown in Equation (17). (17) where is the stacke d dynam ic groupin g mask. The STDP term uses the teacher-forcing version described in the previous section. Thus, the total loss is updated as, as shown in Equation (18). (18) where is th e auxiliary loss (Dice- Focal loss com puted on , weight 0.4) ; is th e complexity loss (MSE loss, weight 0.1); is th e diversity reward (group number diversity / 5, weight 0.05, subtracted from the total loss to encourage group ing diversity). (consistent with im plementation), (light weight, serving as brain-inspired regula rization). This combination injects spatiotempo ral priors at minim al cost while ensuring that the m ain supervision dom inates the optimization, thereby imp roving the interpretability and sta bility of dynamic routing. 3.5 T r aining Str at egy and I mplementation Deta ils The training stra tegy is designed to generate stable gradients based on the optim ization signa ls learned fr om th e loss function. T o ensure fair evaluation du ring test ing and repr oducibility, the following tra ining protocol and details are specified. From a mathematical p erspective, the strat egy takes a learn ing rate and batch size (B = 8) as inputs. First, the AdamW optimizer is used (weight decay with an ini tial learning ra te of , while a CosineAn nealingWarmResta rts schedu ler dynamically adju sts the learning rate. For data augm entation, during training, random horizont al a nd vertical flips are applied with probability 0.5, and input images are res ized using bilinear inter polation to . Du ring testing, only resizing is applied. T o ensure repr oducibility , the global random seed is fixed at 42. No strict upper limit is set for training epochs, but early stopping is applied if the Dice score d oes not improve within 200 epochs. Gradient accumulation is used to stabilize updat es, ultimately producing a stable tr ained model. T o support edge deployme nt, the model can be exported in ONNX format and optimized with post-training quantization (PT Q) for FP32 precision, with quantization error c ontr olled within 0.5% Dice. 4 Experiment s 4.1 Datase ts We selected four public d atasets to evaluate the p roposed model, i ncluding three medical im age segmentation datasets (GlaS, MoN uSeg, and Synapse) and one a utonomous d riving scene segmentation dataset (Cityscapes). GlaS dataset for gla nd segmentation The GlaS dataset (K. Sirinuk unwat tana et al., 201 6) was used in th e “ MICCA I 2015 Gland Segmentation Challenge ” on colorectal histology images. It contains 85 training sam ples and 80 test samples, all derived from 16 HE (Hematoxylin and Eosin) stained colorectal adenocarcinoma tissue slides. MoNuSeg dataset for nuclear segmentation The MoNuSeg dataset (N. Kumar et al., 2017) was used in the “ MIC CAI 2018 Multi-Organ Nuclei Segmentation Challenge. ” It includes 30 training im ages and 14 test images, with 21,623 manually annotat ed nu clei boundarie s in the training set. E ach im age was individually sampled from H E-stained whole slide ima ges from dif fer ent orga ns in the Cancer Genome Atlas (TC GA) database. Synapse dataset fo r multi-organ abdominal segmen tation The Synapse multi-organ segm entation dataset (B. A. Landma n et al., 2015) was used in the “ MICCAI 2015 Mult i-Atlas Abdomen Labeling Challe nge. ” It cont ains 12 training cases (2,2 11 axial CT slices) and 12 test cases (1,568 slices) , requiring segmen tation of 8 abdominal organs such as the aorta, gallbladder , and spleen. Cityscapes dataset for autonom ous driving scene segmentation The Cityscapes dataset (M. Cordts et al., 2016) is a widely used semantic segm entation benchmark in au tonomous driving, focusing on u rban str eet scene und erst anding. It includ es 5,000 finely annotat ed images ( 2,975 train, 500 validation, 1,525 test) and 20,000 c oarsely annotated images. Each image has a resolution of 2048 × 1024 and 19 semantic classes (e. g., road, sid ewalk, vehicle, pedestrian, traffic sign ), mainly captured from 50 cities in Ger many an d neighboring c ountries. During training, we ad opt five-fold cross-validation. At testing, the fina l per formance is reported as the aver age score across the five folds to ensure more stable and reliable experimental results. 4.2 Implementation a nd E v aluation Deta ils During training, the overall op timization objective follows the comp osite loss function defined in Section 3.4. The implementation environment and hardware configuration are as described in Section 3.5 and a re not repeated h ere. During validation and testing, the main evaluation metrics are Dice coef ficient (Dice), 95% Hausdorff distance (HD95), and Intersection over Union (IoU) . The calculation procedure for HD95 is as follows: 1. Normalize th e predicted and ground truth masks to the same spatial domain ; 2. Compute the bidirectional E uclidea n dis tances betw een the boundary point sets of the prediction and ground truth; 3. Remove isolat ed noise regions with an area smaller than 3 pixels; 4. T ake the 95t h percentile of the resulting distance distribution as the final HD95 value (in pixels). This metric reduces the influence of local outliers whi le more accurately measu ring boundary quality . T o improve robustness and statistical reliability , f ive-fold cross-validation is employed, r eporting both the mean and standard deviation (st d) across folds. All ba seline and c omparison m ethods are trained u nder the same data splits, training epochs, and optim ization strat egies to ensure fair and reproducible evaluation. T able 1: Summ ary of Key Module Configurations and Hyperparameters Module Details TauFlow Interface , Dynamic Grouping , , Tau-Attention TauFlow Cell , , , Optimizer Learning Rate Scheduler CosineAnnealing WarmRestarts ( T₀=10, ) Early Stoppin g , T o improve the r obustness and s tatis tical reliability of th e results, this s tudy employed five-fold cross-validation and reported the mean and standard deviation (std) of each fold. All performance comp arisons with the b aselines were tested for statistical significance using pa ired t-tests, with a significance level of α = 0.05 . Results with p < 0.05 are marked as *, and those with p < 0.01 are marked as **. Consi dering the issue o f multiple co mparisons, we applied Bonferr oni correction to th e main comparison methods in T ables 2-5 (corrected α ' = 0.05 /6 ≈ 0.0083). All compared methods we re evaluated un der the same data spl its, num ber of tr aining epochs, and optim ization str ategies to ensure fairness and repr oducibility of the results. 4.3 Comparison with Existing St ate-of-the-Art Methods T o validate the performanc e advantage s of T auFlow , we con ducted a comprehensive comparison with two categor ies of existing advanced segmen tation methods, including one CN N-based method (UNet++), one algorithm c ombining T r ansfo rmer and UNet (UDT ransNet), two Mamba-based methods (V M-UNet, MSVM-UNet), and two dynamic-operat or-based methods (D-Net, FusionMam ba). T o ensure a fair comparison, the origin al publicly released cod es a nd settings of these m ethods were directly used in this experiment. 4.3.1 Quantitative R esults The quantitative results on the four datasets are reported in T ables 2, 3, 4, and 5, with the best results highlighted in bold. The experimental results demo nstrate tha t our method consistently outperforms existing stat e-of-the-art app roaches. Compared with CNN-base d algorithms, T auFlow achieves cross-scale improvements in metrics such as Dice and IoU. Compared with T r ansformer -base d algorithms, T auFlow also shows performance gains, while significantly r educing both parameter count and infer ence latency . Over all, T auFlow can b e seen as striking a sub tle balance between CNN and T r ansfo rmer app roaches, leveraging its dynamic a rchitecture, refined dynam ic group ing, and per-group temporal modeling modu les to enhance featur e processing capabilities . Due to the inh erent locality of convolution operations, CNN-based methods generally strug gle to explicitly model long-range dependenc ies, whic h ca n limit segme ntation performance. T r ansfo rmer-based methods, on th e oth er hand, capture glob al context via self-attention mechanisms, of ten a chieving superior results on complex medical image segmentation tasks. However , T rans former a rchitectures usually come w ith higher computational and parameter costs. Mam ba-based methods show potential in sequence mod eling bu t still have room for improvement in multi-scale sp atial feature fusion. Dynamic- operat or-based methods attempt to increase model flexibility through adap tive mechanism s, but of ten lack fine-grained c omplexity evaluation and resource allocation. On the GlaS da taset (T able 2), T auFlow achieved a Dice score of 92.12% and an IoU of 85.3 9%, improving over the second-best m ethod UDT ransNet by 1.09% and 1.85%, respectively . Meanwhile, the HD95 metric d ropped to 5.38 mm, indicating that T auFlow provides more precise boundary d elineation in glan d segmentation tasks. T able 2: Quantitative R esults on the GlaS Dataset Model Dice (%) HD95 (mm) IoU (%) p-value† UNet++ 87.56 ± 1.17 12.72 ± 1.76 78.39 - UDTransNet 91.03 ± 0.56 6.71 ± 0.99 83.54 0.0231 VM-UNet 89.35 ± 0.68 7.45 ± 1.12 80.75 0.0089 MSVM-UNet 90.12 ± 0.51 6.95 ± 0.88 82.02 0.0124 D-Net 89.72 ± 0.83 7.84 ± 1.24 81.36 0.0156 FusionMamba 86.45 ± 1.94 13.21 ± 2.05 76.13 0.0043 TauFlow (Ours) 92.12 ± 0.42** 5.38 ± 0.76** 85.39** - † : p-value from paired t-test compa ring T auFlow vs. each b aseline (n=5 folds). **: Significant improvement over all baselines (p<0.01 , Bonferroni-corrected α '=0.0083). Bold: Best performance. On the MoNuSeg dataset (T able 3), T auFlow a chieved a Dice score of 80.97% and an IoU of 68.16%, improving over the second-best method MSVM-UNet by 1.11% and 1.69%, respectively . The HD95 m etric reached 1.95 mm, signi ficantly outperforming a ll compared methods. T his indicates th at T auFlow has a c lear ad vanta ge in tasks requiring precise bounda ry localization, such as nucleus segme ntation. T able 3: Quantitative R esults on the MoNuSeg Dataset Model Dice (%) HD95 (mm) IoU (%) UNet++ 77.01 ± 2.10 4.18 ± 1.29 62.61 UDTransNet 79.47 ± 0.80 2.73 ± 0.64 65.93 VM-UNet 78.92 ± 0.72 2.95 ± 0.58 65.18 MSVM-UNet 79.86 ± 0.69 2.58 ± 0.49 66.47 D-Net 79.41 ± 0.95 2.84 ± 0.61 65.85 FusionMamba 76.25 ± 1.84 4.62 ± 1.48 61.62 TauFlow (Ours) 80.97 ± 0.67** 1.95 ± 0.28** 68.16** On th e Synapse multi-organ segmentation da taset (T able 4) , T auFlow demonstrated even more significant performance gains. The av erage Dice score reached 90.85% , improving over the second-best method UDT r ansNet by 1.57 percentage points, while the HD95 metric decreased to 16.41 mm, a 33.4% reduction compared to UDT r ansNet . Examinin g per-organ results, T auFlow achieved competitive performance o n 7 out of 8 organs, with particular ly strong results on morpholog ically complex and boundary-a mbiguous organs such as the right ki dney (RK), liver (LI), pancreas (P A ), and splee n (SP). Notably , in the most challenging pancreas segme ntation task, T auFlow reached a Di ce score of 88.1% , improving over UDT r ansNet a nd MSVM-UN et b y 2.3 percentage p oints each. This strongly validates the effective ness of the dynamic groupin g mechanism a nd temporal modeling modu les in handling com plex anatomical structures. T able 4: Quantitative R esults on the Synapse Dataset Abbreviations in the tables: AO (Aorta), GA (Gallbladder), LK (Left Kidney), RK (Right Kidney), LI (Liver), P A (Pan creas), SP (Spleen), ST (Stomach). Furtherm ore, T able 5 p resents the quantita tive re sults on the Citys capes dataset based on semantic grouping. For readability , the 19 c lasses are grouped into f ive semantic categories according to the official definition: Flat (Road, Sidewalk), Construction (Building , Wall, Fence, V eget ation, T errain, Sky), Object (Pole, Tr af fic Light, T r af fic Sign), Human (Person, Rider), and V ehicle (Car , T ruck, Bus, Tr a in, Motorcycle, Bicycle). The results show that T auFlow a chieves the b est performance a cross all groups, with an overall mIoU of 79.2 6%, improving 2.12 p ercentag e poin ts over the second- best method UDTr ansNet (77.14%), while the averag e Dice score reach es 87.69 %. The advantages are particularly notable in the Ob ject (+1.49 %), Vehicle (+ 1.62%), an d Construction (+1.16%) groups, demonstrating that T auFlow's dynamic g rouping and temporal modeling mechanisms are highly effectiv e in capturing complex structural re lationships, small ob jects, and instances with m otion blur . Methods Average DSC HD95 AO GA LK RK LI PA SP ST U-Net++ 86.11±1.08 35.88± 2.15 88.4 76.6 88.1 88.1 89.6 85.6 90.4 87.1 UDTransNet 89.28±0.58 24.63± 1.69 89.4 82.6 92.8 ** 92.1 91.2 85.8 93.3 87.1 VM-UNet 87.05±0.61 20.64± 1.58 90.1 78.9 90.5 90.3 90.8 78.1 91.1 86.6 MSVM-UNet 88.72±0.42 19.07± 1.61 88.3 84.3 89.1 91.4 90.2 85.8 93.1 87.6 D-Net 88.46±1.21 26.49± 2.85 94.3 ** 82.6 88.9 92 90.6 85.9 93.5 88 Fusion Mamba 85.02±2.34 37.04± 2.52 87.9 79.1 86.2 85.4 84.2 86.4 89.4 84.8 TauFlow (Our) 90.85± 0.37** 16.41± 1.04** 93.2 84.5 ** 91.5 93.9 ** 92.7 ** 88.1 ** 93.9 ** 89** Method Average DSC mIoU Flat Constructio n Object Human Vehicle T able 5: Segmentation performance comparison on the Cityscapes dataset grouped by semantic categories. Although T auFlow was originally designe d for medica l image segmentation, the experimental results indicate that the mod el a lso performs rema rkably well on complex urban street scene semantic segmen tation tasks. This demonstrates that T auFlow's core mech anisms-dynam ic grouping and T au-Att ention temporal m odeling-p ossess strong c ross-domain generalization capabilities. Overall, T auFlow ach ieves superior performance with the lowest parameter count among all compared methods, main ly due to two factors: first, the dynamic grouping modu le adaptively allocates computational resources ba sed on image comp lexity; second , the temporal modeling enabled by T au-Att ention and T auFlow Cell effe ctively captures c ross-region dynamic dependencie s, achieving a synergistic optimization of spatial and temporal information. UNet++ 84.09 73.37 91.67 78.84 80.9 83.65 85.39 UDTransNet 86.6 77.14 92.86 81.44 83.79 87.15 87.74 VM-UNet 85.49 75.59 92.35 80.43 82.72 85.05 86.9 MSVM-UNet 86.04 76.38 92.72 80.97 83.39 85.76 87.35 D-Net 85.79 76.08 92.55 80.7 83.04 85.45 87.23 FusionMamba 82.45 70.89 90.98 77.13 78.79 81.67 83.69 TauFlow (Ours) 87.69** 79.26** 93.77** 82.6** 85.28** 87.42** 89.36** 4.3.2 Qualitative R esults Figure 7: MoNuSeg qualitative results, with red circles highlighting regions where T auFlow outperforms other method s. Figure 7 presents visual c omparisons of segmentation results across the MoN uSeg, Synapse, and GlaS datasets using diffe rent method s. The results indicate that our model demonstra tes clear advantage s in handlin g multi-scale targets, complex boundaries, and detail preservation. On the MoNu Seg dataset (rows 1-3), T auFlow performs exce ptionally well in cell nu cleus segmentation tasks. The first row shows nuclei with in glandula r tissue , where T auFlow accurately segments densely pa cked nuclei with notice ably clearer bounda ries compared to other methods. The second row depicts a m ore densely distributed n uclei scenario, in which methods lik e At tn-UNet and MultiResUNet suffer from obviou s over-segm entation or under-se gmentation. T auFlow , leveraging its dynamic grouping mechanism, adap tively adjusts the receptive field to correctly distinguish adjac ent nuclei. The third row illustrates morphologically diverse nuclei, where T a uFlow maintains strong segmentation integrity for irregularly shaped nuclei, benefiting from the T au-Att ention mechanism tha t eff ectively integr ates local a nd global features. On the Synapse dataset (rows 4-5), the multi-organ segme ntation task imposes higher d emands on the model 's multi-sca le representa tion capab ility . In the f ourth row , the axial slice h ighlights small structures su ch as the a orta (orange region), which are partially missed or incompletely segmented by method s like Attn-UNet, M ultiResUNet, and T r ansUNet. In c ontrast, T auFlow , leveraging the T auFlowSequen ce module for dynamic groupin g and multi-m odal fusion, accurately ca ptures these fine-scale anatomical structures. This dem onstrat es that propaga ting rich spatial a nd temporal information from the encoder to the decoder , combined with low-level spatial details, h elps the model id entify finer target s. The fif th row shows sagittal mu lti-organ segmentation, including organs such as the liver , spleen, and stomach. UDT r ansNet and D-Net can segment the major organs but p roduce jagged artifacts alon g o rgan boundaries and fail to clearly delineate the s tomach (yellow region) an d spleen (cyan-green r egion). In comparison, T auFlow produces segme ntation masks tha t closely match the ground truth, with smooth and natural organ bo undaries an d clearly defined separations between organs. This advanta ge stems from the temporal modeling capability of dynamic grouping , which preserves spa tial coherence du ring featur e propagation and mitigates information loss commonly observed in conventional encoder-decoder upsa mpling pipeline s. On the GlaS dataset (rows 6 -7), the gla nd segme ntation task requires the model to p recisely capture com plex gland structures an d bounda ries. In these rows, it is clearly observed that U-Net++, VM-UNet , and D- Net exhib it significa nt merging artifacts a long the thin gland branches, with U-Net++ additionally producing some false-positive regions. Although U DT r ansNet shows improvements, gland edges remain somewhat blurred. In contrast, T a uFlow fully preserve s the gland topology , accurately handles the connectivity of fine branches, demonstrating the effe ctiveness of the dynamic groupin g module in adaptively a llocating computational resources across regions of varying complexity . T a uFlow su ccessfully segments each individual gland unit, with precise boun dary localization and intact internal segmen tation, free of holes or fragmentation. Overall, T auFlow exhibits su perior segm entation performance across diverse and challenging scenarios, particularly excelling in small-obj ect detection, precise bound ary localization, multi-scale target handli ng, a nd d ense-object differ entiation. These qualitative results corroborat e the previously reported quantitative experiments, f ully validating the effectiveness of the proposed dynamic groupin g mechanism a nd temporal modeling modu les. 4.3.3 Computation al Complexity For the vast majority of T ransf ormer-based segme ntation method s, high computational complexity is a comm on drawback. However , a s shown in T able 6, our model demonstrates superior p arameter efficiency compa red to existing stat e-of-the-art T r ansfo rmer-based segmentation methods. Notab ly , T auFlow has a parameter c ount of only approximat ely 1/ 100 that of ad vanced segmentation methods such a s UDT r ansNet, VMUNet, and MSVMUNet, while achieving signific ant per formance improvements. It is particularly emph asized that the lightweight nature of T auFlow is f ully achieved through the innovative mechanism of ‘ dynamic grouping + temporal reuse', without emp loying any distillation techniq ues (none of the compared models use d istillation eithe r). As sh own in T able 7, its parameter cou nt is only 0.33 M, which is 1% of UDT r ansNet (33.90 M), 0.7% of VMUNet (44.27 M), 0.6% of MSVMUNet (54. 69M), and even lower than the CNN-ba sed U-Net++ (9.16M). It is the only method am ong current SOT A models of its kind that achieves ‘ 0.33M parameters + cross-dataset SOT A' without distillation. T able 6: Compariso n T able of Computational Complexity Model Parameters FLOPs U-Net++ 9.16M 26.72G UDTransNet 33.90M 26.51G VMUNet 44.27M 7.56G MSVMUNet 54.69M 22.86G D-Net 45.90M 39.80G TauFlow(Max Group=5) 0.33M 4.19G TauFlow(Max Group=7) 0.45M 6.60G It should be noted tha t U DT ransNet integra tes the m ulti-head self- attention (MSA) m odule into the encoder-deco der framework, whereas T auFlow incorporates a meticul ously designe d dynamic temporal module into the connection part (between the encod er and decoder)-this differ ence highlights th e importance of the design ap proach an d deployment position of the temporal modeling module in the U- shaped a rchitecture. Additionally , the paramet er count of T auFlow is even lower than that of some CNN-based methods (e .g., U-Net++ ); compared to our previous m odels, T auFlow significantly reduces the parameter co unt while substantially improving segmentation performance. Except for th e original U-Net and th e ResNet34-based U-Net, wh ich have fewer parameters (but limited performance), ou r T auFlow ach ieves superior segmentation per formance while featuring fewe r parameters, lower floating-p oint operations (FL OPS), and shorter in fer ence time. Considering its excellent segmentation performance and reduced parameter c onsumption, it can be concluded that our m odel achieves a favor able balance between effectiveness a nd efficiency . 4.4 Ablation Experiments T o validate the i mpact of each co mponent and hyperparameter of T auFlow on performance, we conducted sys tematic ablation experiments on the GlaS dataset. All results a re reported as m ean ± standard deviation based on five-fold cross-validation. The primar y evaluation m etrics are the Dice coefficient and 95% Hausdor ff distance (HD95), wit h model p arameter count (P arameter s, M) and com putational complexity (FLOPs, G) recorded to assess cost-eff ectiveness. The Baseline is a standard Encoder -Decoder a rchitecture (without the T auFlowSequ ence mecha nism). 4.4.1 Ablation Re sults Distin guishing the Presence o r Absence of the STDP M echanism T o validate the independe nt contribution of each proposed module, we conduct a component-wise ab lation study on the GlaS d ataset. Starting from a standard Encoder-Decoder baseline, we progressively add: (1) ConvL TC cells for dynamic -driven fe ature update, ( 2) T au-At tention for group- level weighting, (3) Dynamic Grouping for complexity-adaptive computation, and (4) STDP mechanism for temporal consistency regularization. All experiments use base_ channel=32, max_groups=5, and the sam e tr aining p rotocol (Section 3.5). Results are aver aged over 5- fold cross-validation with statistic al significa nce marked (* p<0.05, ** p <0.01 via paired t-test against baseline). 表 7-A: Compone nt-wise Ablation Stud y on GlaS Dataset ✓ : Module enab led; ✗ : Module di sabled. *: p<0.05, **: p<0.0 1 (paired t-test vs. baseline, n=5 folds). Δ Dice from baseline: +1.98% (ConvL TC), +3.11% (T au-Attn), +3.81% (Grouping), +4.56% (Full) . STDP adds negligible parameter s but improves Dice by +0.75% through temporal regularization. As sh own in T able 7-A, s tarting fr om a standard Encoder-Decoder , replacing cells with ConvL TC (+dynamic ) improves Dice f rom 87.56 % to 89.54% and reduces HD95 from 12.72 mm to 10.38 mm, high lighting the benefit of temporal modeling. Adding T au-A tt ention raises D ice to 90.6 7%, showing -guided feature weighting reduces redundancy . Dynami c Grouping f urther boosts Dice to 91.37% and IoU to 84.21 % with only 0.05 M more pa ramete rs, validating adap tive computation in complex regions. Finally , STDP regularization increases Dice to 92.1 2% and lowers HD95 to 5.38 m m witho ut extra parameters, enhanc ing in ter-group synergy via causality-consistent training. All gains are stat istically sign ificant (p < 0.05). ConvL TC alone gives +1.98% Dice, wh ile the full T auFlow a ccumulates +4.56%, demonstrating str ong module synergy . T able 7: T auFlow Ablation Results on the GlaS Dataset (without STDP Mechanis m) Base Channel Max Groups Dice (%) HD95 (mm) Parameters (M) FLOPs (G) Baseline - 87.56 ± 1.17 12.72 ± 1.76 0.23 4.08 16 3 89.14 ± 0.71 9.85 ± 1.32 0.11 1.06 5 89.78 ± 0.63 8.96 ± 1.19 0.15 1.08 7 90.03 ± 0.59 8.67 ± 1.14 0.2 1.1 13 89.91 ± 0.66 8.79 ± 1.21 0.39 1.15 32 3 90.42 ± 0.55 7.58 ± 0.98 0.28 4.13 5 91.37 ± 0.48 6.44 ± 0.89 0.33 4.19 7 91.59 ± 0.45 6.21 ± 0.85 0.39 4.25 13 91.46 ± 0.51 6.33 ± 0.91 0.62 4.43 64 3 91.03 ± 0.52 6.88 ± 0.93 0.87 16.33 5 91.82 ± 0.41 5.94 ± 0.79 0.94 16.54 7 91.95 ± 0.39 5.77 ± 0.76 1.03 16.74 13 91.87 ± 0.43 5.83 ± 0.81 1.35 17.37 128 3 91.44 ± 0.47 6.35 ± 0.87 0.87 16.33 5 91.97 ± 0.38 5.62 ± 0.73 0.94 16.54 7 92.05 ± 0.36 5.51 ± 0.71 1.03 16.74 13 91.99 ± 0.40 5.57 ± 0.75 1.35 17.37 T able 7 presents the p erformance of T auFlow under different base channel se ttings (16, 32, 64, 128) an d max groups (3, 5, 7, 13) (without introducing the STDP mecha nism). As the nu mber of max gr oups increases, model perf ormance first im proves and then declines, with 5 groups achieving a favor able trade-off . Under the sam e max groups configuration, the model with b ase channel=32 demonstra tes the best cost-effective ness between performance a nd compu tational overhead. O verall, T auFlow significantly outp erforms the Basel ine a cross all configurations, with improved Dice scores an d substantially reduced HD95, indicating that T a uFlow markedly enhances feature extr action and segmentation accuracy . T able 8: T auFlow Ablation Results on the GlaS Dataset (with STDP Mechanism ) Base Channel Max Groups Dice (%) HD95 (mm) Parameter s (M) FLOPs (G) 16 3 89.92 ± 0.65 8.72 ± 1.21 0.11 1.06 5 90.46 ± 0.57 7.89 ± 1.10 0.15 1.08 7 90.71 ± 0.53 7.64 ± 1.05 0.2 1.1 13 90.59 ± 0.60 7.76 ± 1.12 0.39 1.15 32 3 91.10 ± 0.49 6.65 ± 0.89 0.28 4.13 5 92.12 ± 0.42 5.38 ± 0.76 0.33 4.19 7 92.07 ± 0.39 5.88 ± 0.76 0.39 4.25 13 92.14 ± 0.45 6.40 ± 0.82 0.62 4.43 64 3 91.71 ± 0.46 5.95 ± 0.84 0.87 16.33 5 91.50 ± 0.35 5.98 ± 0.70 0.94 16.54 7 92.03 ± 0.33 5.84 ± 0.67 1.03 16.74 13 91.55 ± 0.37 5.90 ± 0.72 1.35 17.37 128 3 90.12 ± 0.41 5.42 ± 0.78 0.87 16.33 5 90.65 ± 0.32 5.69 ± 0.64 0.94 16.54 7 91.23 ± 0.30 5.58 ± 0.62 1.03 16.74 13 91.67 ± 0.34 5.64 ± 0.66 1.35 17.37 T able 8 presents the exp erimental results after introducing the inter-group STDP mecha nism under the sam e configurations. With the incorporation o f STDP , performance consistently improves across nearly a ll configurations, with an averag e Dice increase of approx imately 0.68% and an average HD95 reduc tion of about 0.93 mm. Notab ly , the u se of STDP introduces virtuall y no additional paramete rs or computational overhead, ma intaining the system's high ef ficiency . Among all configurations, th e optimal setting of base channel=32 and max groups=5 achieves a Dice of 92.12% and HD95 of 5.38 mm, validating the critical role of STDP in enha ncing i nter-group competition and feature disc riminabilit y . 4.4.2 Qualitative Analysis of the Dynamic Grouping Module For this model, the d ynamic groupin g a pproach is undoubtedly the c ore of the entire in novation. T o ensure that dynamic grouping is indeed eff ective an d not compensated by other b ypass mechanisms, qua litative analysis is c onducted through visualization of this module, as shown in Figure 8. Figure 8 Visualization of the Dynamic Grouping Process Here, an image from the GlaS dataset is sele cted for illustr ation. This image feature s c lear glandular structures but exhibits slight blurri ng and noise interf erence in the edge regions, representing a typical challenging scenar io. As shown in Figu re 8, from left to right, the original image (Original ), Ground T ruth (Label), model-predicted mask (Prediction), an d the core intermediate processes of the dynam ic grouping module are seque ntially disp layed, inc luding the group mask (Group Mask ) a nd mappin g results (Mapping). First, in the original image, the gland s appear in a purplish-red tone with int act internal cavity structures, but the loc al boundarie s exhibit noticeable gradients, making them susc eptible to noise in terf erence. The corresponding Ground T ruth (Label) marks the glandular core regions with white irregular elliptical areas, precisely deline ating the target inst ances. The dynamic grouping m odule first generates group masks (Group Mask) based on the predicted mask (or initial segmen tation result), as sh own by the four sm all b locks in th e figure (in practice, 7 Group Masks are produced, with only 5 used for illustration). These ma sks employ a dark brown speckle pattern to d ynamically cluster adjac ent pixels into multipl e compact groups (clusters) based on spatia l neighborhood, simila rity in intensity , and texture features, effe ctively separating independent glandular instances while suppressing noise points (such as isolated speckles). The key to this step lies in adaptive thresholding and graph-cut optimization, ensuring that group boundaries are high ly consistent with th e glandu lar anatomical structures and avoiding the over-segmentation or under-se gmentation issues common in traditional fixed-threshold methods. As shown in the map ping (Mapping) stage, this phase projects the group masks back into the original prediction spa ce, gener ating a color overlay map (yellow indicating high-confidence groups, purple indi cating transition zones). It ca n be observed that the mapping results accurately capture the continuity of glandular cavities an d wall thickness, with noise regions effe ctively filtered out, retaining on ly structured groups. Ultimately , the predicted mask ( Prediction) after int egrating the mapping constr aints highly overlaps with the Groun d T ruth, with a signif icantly impro ved IoU metric (reaching 0.92 in this example), smooth boundaries, a nd complete instance separation. Through this visualization, the independe nt contribution of the dynamic grouping module can be intuitively verified: if this module is re moved and the original prediction is used d irectly , noise diffusion woul d lead to in stance adh esion; a fter grouping, the model output robustness is greatly enhanced, proving that this innovat ion is not compensated by other compon ent bypasses and genuinely improves the accuracy and generalization ability of glandular instance segmentation. 4.5 Edg e Device Deplo yment and Real-W orld V alid ation T o evaluate T auFlow's deployability and practical value in resource-constr ained environments, we conducted comprehensive edge tests on three represent ative embedde d platforms, covering low-, m id-, and high-com pute scenarios. The experiments m easured infer ence speed (FPS), p eak memory usage, and energy consumption per inf ere nce (mJ) on the GlaS dataset for gland segmentation (inp ut resolution 224 × 224), simula ting real-time applications on mob ile medical devices such as portabl e pa thology a nalyzer s. All tests used the same batch size (1) a nd optimization settings to ensure comparability . We selected three system-on-chip (SoC ) p latforms for evaluation: the entry-level Allwinn er H61 8 (quad-core ARM Cortex-A53, up to 1.5 GHz, Mali- G31 MP2 GPU, no NPU, 2 GB RAM) representing ultra-low-power scena rios su ch as hand held dia gnostic device s; th e m id-range Rockchip RK3 399 (dual-core Cortex-A72 + qua d-core A53, up to 2.0 GHz, 4 GB RAM) simulating mod erat e-compute edge devices like embedd ed medic al imaging box es; and the high-end Rockc hip RK3588 (quad-core Cortex-A76 + quad-core A55 , up to 2.4 GHz, 6 TOPS NPU, 8 GB RAM) targe ting high-performanc e edge applications such as surgical naviga tion syste ms. The deployme nt workflow involved exporting T auFlow f rom PyT orch to ONNX (ONNX Run time 1.15) and applying FP 32 post-training quantization ( PTQ), ke eping Dice loss degradation und er 0.5%. Runtime in tegra tion used RKNN T oolkit for Rockchip and T engine for Allwi nner , including NPU delegation a nd mem ory reuse optimizations. For each p latform, we performed 1000 infer ence runs, recording averag e FPS, p eak m emory (via htop), and energy con sumption (measured with an external power meter). Inputs were randomly generated synthetic medical images to simulate real deployment conditions. Fig. 9: Edge D evice D eployments and Real-Time Segmen tation Demonstrations of T auFlow . (a) Allwinner H61 8 development board, (b) Rockchip RK3399 development board, (c) Rockchip RK3588 edge box, (d) Hikvision 2K autofocus lens, (e) 300,000-pixel OV7670 cam era module, (f) Real-time segmentation on GlaS dataset, (g) Real-time segmentation on Synapse dataset, (h) Real-time segmentation on MoNuSeg dataset. T able 9 su mmarizes model complexity , infe re nce performance, and efficiency acr oss platforms, comparing q uantized FP32 versions of T auFlow , UNet+ +, and UDT ransNet. T auFlow achieves h igh accuracy with only 0.33 M p arameter s and 4.19 G FLOPs, outperforming the baselines. On the H618, T auFlow reaches 1.14 FPS on CPU with 877 ms late ncy and 2.19 J/image, fa r exc eeding UNet++ (0.34 FPS, 2,941 ms, 7.35 J) and UDT r ansNet (0.22 FPS, 4, 545 ms, 11.36 J). On RK339 9, it achieves 1.97 FPS, 44 ms la tency , and 0.44 J/image, demonstrat ing mid-range acceleration potential. On RK3 588, leveraging the NPU, T auFlow attains 27.79 FPS, 36 ms la tency , and 0.36 J/image-approximately 5-6 × fast er than UNet++ (4.12 FPS, 243 m s, 2.43 J) and UDT r ansNet (5.4 FPS, 185 ms, 1.85 J). This efficienc y stems from T auFlow's dynamic grouping, activating extra computation only in high -complexity regions (average 3.2 gr oups), reducing ~40% o f redundant operations. Power consumption is constant across platforms (H618 2.5 W , RK339 9 10 W , RK358 8 10 W), and memory usage is lower for T auFlow (435 MB) compared with baseline s (1,15 6-2,210 MB). T able 9 Comparison o f Model Inference Speed and Energy Consumptio n on the RK358 8 Platform In practical tests on the RK358 8, we connected a Hikvision 2K auto- focus camera and an OV7670 module ( 224 × 224 resolution, 30 FPS). T auFlow a chieved end-to- end gla nd segme ntation with latency under 100 ms, an d a Di ce score comparable to d esktop performance (91.8%). In real-tim e demonstrations on the GlaS, Synapse, a nd MoNuSeg d atasets, T auFlow m aintained precise boundaries a nd low-no ise outp uts, demon stra ting robu stness on live video streams. Thes e results indicate that T auFlow not only ex cels on laboratory benchmarks bu t also shows strong potential for edge deployment, meeting efficiency requirements for m obile medic al applications in emergency scenarios or resource-limited regions. 5. Conclusio n and Fut ure Works 5.1 Summary of Ke y Findin gs This work addresses two key b ottlenecks in lightweight medical image segmentation: “ insufficient speed ad aptation due to sta tic processing ” and “ inconsisten t modal fu sion under lightweight constraints, ” respond ing to the demand for efficient edge intelligence in precision healthcare. We propose T auFlow , which innovatively integrates ConvL TC cells to adaptively control pixe l-level feature up date rates via the tim e con stan t , enabling rapid response to high-frequency b oundaries while smooth ing low-frequency background noise. The T auFlowSequenc e modu le further combines -gradient-based dynamic grouping with STDP-inspired feature self-organization, reduc ing encoder-de coder featur e conflict from 35-40% to 8-10% while keeping paramete rs strictly at 0.33 M. On GlaS gland segm entation, T auFlow achieves 92.12% Dice, 85 .39% IoU, and 5.38 mm HD95, improving Dice by 1.09% over UDT r ansNet; on MoNuS eg nuc lear segme ntation, 80.97% Dice, 68.16% I oU, and 1.95 mm HD95, surpass ing MSVM- UNet b y 1.11% Dice; on Synapse m ulti-organ segmentation, an a ver age of 90.8 5% D ice and 16.4 1 mm HD95 , with notable gains on complex organs like the pancreas (88 .1% Dice, +2.3 %). E ven on the n on-medical Cityscapes d ataset, T auFlow maintains superior b oundary ha ndling. In terms of compu tational efficiency , T auFlow requires only 4.19 GFLOPs, achieving a fav orable parameter-FL OPs tradeoff com pared with all baselin es. Edge deployment on the RK358 8 demonstrates real-tim e infer ence at 27.79 FPS, low energy consu mption of 0.36 J/image, and memory usage of ju st 435 MB, c onfirming its su itability for portable medical dev ices. Ablation studies show that dynamic grouping improves Dic e by ~2.56%, and STDP adds another 0.68%, validating the effe ctiveness of -STDP-driven dynamic m odeling an d fusion. 5.2. Limitations Despite its excellent p erformance, T auFlow still has several limitations. First, the model is based on 2D ima ge processing, mak ing it difficu lt to directly extend to 3D volumetric data (such as CT/ MRI). Alth ough it can c apture dynamics within a slice, it ignores inter-slice d ependencie s, which m ay limit its performance in time-serie s t asks like cardiac MRI. Second, th e experiments covered pathology , CT , a nd u rban scenes, but did no t test emerging m odalities like PET or ultrasound. Its robu stness to their specific noise or deformation (such as the cardiac dynamics mentioned in the introduction) remains to be validated. Third, while dynamic grouping is efficient, the inference time fluctuates slightly with image complexity (a m aximum of 5 groups increases parameters by ), which could am plify latency on ultra-low-power devices. Furthe rmore, STDP only ad justs in the forwar d direction, no t fully sim ulating bidirectional synaptic pla sticity , resulting in slightly lower c onve rgence stability in extreme noisy data (H D95 standard deviation on GlaS). 5.3. Future Enh anceme nts and Research D irections Building on the foundation of T auFlow , future work can expand in multiple dimensions to overcome limitations and deepen its impact. The primary improvement is 3D extension, introducing inter -slice propagation or hybrid Mam ba-ConvL T C blocks, aiming to reach the 85.5% Dice b enchmark on the ACDC dataset while k eeping parameter s under . Multi-modal fusion is also a key focus, utilizin g -guided attention to align CT/ MRI and PET features, reducin g heterogeneous modality conflicts. T o mitigate infe rence fluctuations, h ardware -aware adaptive group prunin g o r qua ntization-aware training can be introduced, with further attempts a t quantization and deployment to edge devices like th e ESP32-S3. Research directions include: Unsu pervise d pre-training combined with STDP to achieve few-shot domain a daptation for segm enting rare lesions; exploring a hybrid paradigm of L T C and reservoir computing to enhanc e long-range modeling for ultra-large pathologic al sli des; and validating real-world efficacy through federa ted learning and prospective clinical trials , ensuring T auFlow aligns with the vision of a sca lable CAD sy stem m entioned in the introduction. Ultimately , T auFlow can evolve into a general lightweight d ynamic-aware framework, facilitat ing real-time precise diagnosis in m obile healthcare and intraoperativ e naviga tion. Reference [1] Litjens G, Kooi T , Bejnordi BE, Setio AAA , Ciompi F , Ghafoorian M, van der Laak JA WM, van Ginneken B, Sánchez CI. A sur vey on deep learning in medical image analysis. Med Image Anal. 201 7 D ec;42:60-88 . doi : 10.1016 /j.media.20 17.07.0 05. Epub 2017 J ul 26. PMID : 28778026. [2] Ronneberger O, Fischer P , Brox T . U-Net: Convolutional Networks for Biome dical Im age Segmentation[C]//Medica l Ima ge Computin g and Computer-Assisted Intervention (MIC CAI). Springer , Cham, 2015: 234 -241. [3] Zhou Z, Sidd iquee M MR, T ajbakhsh N , et al. UNet++: A Nested U-Net Architecture for Medical Image S egmentation[J]. IEEE T r ansactions on Me dical Ima ging, 2020, 39(6): 1856-1867. [4] Ibtehaz N, Rahman MSA . Multi ResUNet: Rethinking the U-Net A rchitecture for Multim odal Biomedical Image Segme ntation[J]. Neural Networks, 2020, 121: 74-87. [5] Peng Y , Sonka M, Chen DZ. U- Net v2: Rethinking the Skip Conne ctions o f U-Net for Medica l Image Segmentation[J]. arXiv preprint arXiv:2311.17791 , 2023. [6] H. Kwon et al., "HARD: Hardware-awar e lightweight real-tim e semantic segmentation model deployable," Proc. ACCV , 2024. [7] Li X, Zhu W , Dong X, et al. EViT-Un et: U-Net Like E fficient Vision T r ansfo rmer for Medical Image Segmentation on Mobile an d Edge Devices[J]. arXiv preprint arXiv:2410 .15036, 2024. [8] Wu R, Liu Y , Jiang Z, et al. UltraLight VM-UNet: Parallel Vision Mamba Sign ificantly Re duces Para meters for Skin Lesion Segme ntation[J]. Pat terns, 2025, 6(6): 1009 74. [9] Y uan Y , Cheng Y . Medical image segme ntation with UNet-base d m ulti-scale conte xt fu sion[J]. Scientific Reports, 2024, 14: 15687. [10] Don g W , Du B, Xu Y . Shape-intensity-guided U-net for medical im age segmentation[J]. Neurocomputing , 2024, 610: 1285 34. [11] Ruan J, Li J, Xian g S, et al. VM-U Net: Vision Mamba UNet for Med ical Image Segme ntation[J]. arXiv preprint arXiv:2402. 02491, 202 4. [12] Wang Z, Zheng J- Q , Zh ang Y , et al. Mamba-UNet: UNet-Like Pure Visual Mamba for Medical Image Segmentation[J]. arXiv preprint arXiv:2402.05079 , 2024. [13] Chen C, Yu L, Min S, Wa ng S, et al. MSVM-UN et: Mult i-Scale Vision Mam ba UNet for Medical Image Segmentation[C]//IEEE Inte rnational Confer ence on B ioinformatics and Biome dicine (BIBM). IEEE, 2024: 3111-311 4. [14] Sun G, Pan Y , Kong W , et al. DA-T ransUNet: Inte grating spatial and channel dual atte ntion with T ra nsformer for medic al image segmentation[J]. Frontiers in Bioenginee ring and Biotechnology , 202 4, 12: 1398237 . [15] Wang Z , Li M, Chen H, et al. Enha ncing m edical image segmentation with a multi-transformer U-Net[J]. PeerJ, 202 4, 12: e17005 . [16] H. Wang, P . Cao, J. Y ang, an d O. Zaiane, “ Narrowing the semantic gaps in U-Net with learnable skip connections: The case of m edical image segmentation, ” Neural Netw ., vol. 178, p. 106546, Oct. 2024, doi: 10.1 016/j.neun et.2024.10 6546. [17] Liu J, Y ang H, Zhou H-Y , et al. Swin-UMamba: Mamba- based UNet with Im ageNet-based Pretraining[C]//Medical Image Computing and Com puter-Assisted Inte rvention ( MICCAI). Springer , Cham, 2024: 615 -625. [18] Z. Dong, H. W ang, Z. Xu, K. Wu, S. L i, and T . Li, “ LFT-UNet: A lightweight fusion of transfo rmer and U-Net architectures for embedde d medical image segm entation, ” Biomedical Signa l Processing and Control, vol. 112, part B, p. 108565, 202 6, doi: 10.1016/j .bspc.2025.10 8565. [19] Pang B, Chen L, T ao Q , Wang E, Yu Y . GA-UNet: A Lightweight Ghost and At tention U-Net for Medical Image Segmentation. J Imaging Inform Me d. 2024 Aug;37(4) :1874-1888 . doi: 10.1007/s10278-024 -01070-5. Epub 2024 Mar 13. PMID: 3847 8188; PMCID : PMC1130077 7. [20] E. J . Alwadee, X. Sun, Y . Qin, and F . C. Lan gbein, “ LA TUP-Net: A lightweight 3D attention U-Net with p arallel conv olutions for brain tumor segmentation, ” Comp uters in Biology an d Medicine, vol. 184, p. 109353, 2025 , doi: 10.1016 /j.comp biomed.2024.109 353. [21] Y . Zheng and Q. He, “ Research and comparison of lightweight U- Net based on GhostNets for medical and remote sensi ng images, ” in Proc. 8th Int. Conf . Inte ll. Comput. Signal Process. (ICSP), Xi'an, Chin a, 2023, pp. 2082-2086, doi : 10.1109/IC SP58490.2 023.102 48608. [22] J. Ma, F . L i, and B. Wang, " U-Mamba: Enhancin g long-range dependenc y for biom edical image segmentation," arXiv:2401.04 722 [cs.CV], 2024. [23] J. Ruan, J. Li, and S. Xian g, “ VM- UNet: Vision Mamba UNet for medical image segmentation, ” arXiv preprint arXiv:2402.02 491, 2024. [Onlin e]. Available: https://ar xiv .org/ abs/2402.02491 . [24] R. Wu, Y . Liu, G. Ning, P . Liang, and Q. Chang , “ UltraLight VM-UNet: Para llel Vision Mamba significantly reduces parameters for skin lesion segmentation, ” Patt erns, p. 101298, 2025, doi: 10.1016/ j.patter .2025.101298. [25] R. Wu, Y . L iu, P . Liang , and Q. Chang, “ H-VM UNet: High-order Vision Mamba UNet for medical image segm entation, ” Neu rocomputing, vol. 624, p. 129447, 202 5, doi: 10.1016/j .neucom.2025.12 9447. [26] J. Wa ng, J. Chen, D. Z . Chen, a nd J. Wu, “ LKM-UNet: Large Kernel Vision Mamba UNet for medical image segmentation, ” in Proc. Med. Image Comput. Comput. Assist. Interv . (MICCAI), LNCS, vol. 15008, Springer Nature Switzerland, Oct. 2024 , pp. 360-370. [27] C. Chen, L. Yu, S. Min and S. Wang, "MSVM-UNet: Multi-Scale Vision Mamba UNet for Medical Image Segmentation," 2024 IEEE International Confer ence on Bioinformatics and Biomedicine (BIBM), Lisbon, Portugal, 2024, pp. 3111 -3114, doi: 10.1109/BIB M62325.2 024.10 821761. [28] X. Zhong, G. Lu, and H. L i, “ Vision Mamba a nd xLSTM-UN et for medical im age segmentation, ” Scientific Reports, vol. 15, no. 1, p. 816 3, 2025, doi: 10.1038/s41598-025 -88967-5. [29] S. Xia, Q . Sun, Y . Zhou, Z. Wa ng, C. Y ou, K. Ma, and M. Liu, “ A lightweight neural network for cell segmentation based on attention enha ncement, ” Information, vol. 16, no. 4, p. 295, 2025, doi: 10.3390 /info16040295. [30] M. Park, S. O h, J. Park, T . Jeong, and S. Y u, “ ES-UNet: Efficient 3D medical im age segmentation with enhanced skip connection s in 3D UNet, ” BMC Med ical Imaging , vol. 25, no. 1, p. 327, 2025, doi: 10.1186/s12880 -025-0185 7-0. [31] J. Y ang, P . Qiu, Y . Zhang, D. S. Marcus, and A . Sotiras, “ D- Net: Dynamic large kernel with dynamic feature fusion for volumetric medical image segmentation, ” Biomedic al Signal Processing and Control, vol. 113, part B, p. 108837, 2026, doi: 10.1016 /j.bspc.2025 .108837. [32] B. Wu , Q. Xiao, S. Liu, L. Yin, M. Pechenizkiy , D. C. Mocanu, M. V an Keulen, and E. Mocanu, “ E2 ENet: Dynamic spar se fea ture fu sion f or ac curat e and efficient 3D medic al image segmentation, ” arXiv preprint a rXiv:2312.04 727, 2025. [Online]. Available: https://ar xiv .org/ abs/2312.04727 . [33] Z. Cai, Y . Chen, J. Wa ng, X. He, Z. Pei, X. Lei, and C. L u, “ DAFNet: A novel D ynamic Adaptive Fusion Network for medic al image classif ication, ” Information Fusion, vol. 126, pa rt A, p . 103507, 2026, doi: 10.1016 /j.inffus.2025.1035 07. [34] X. Xie, Y . Cui, T . T an, X. Zheng, and Z. Yu , “ Fu sionMamba : Dynamic feature enh ancemen t for multimodal imag e fusion with Mamb a, ” Visua l Intelligence, vol. 2, n o. 1, p. 37, 2024, doi: 10.1007/s44267-024 -00072-9. [35] Y . Yu an and Y . Che ng, “ Medical ima ge segmen tation with UNet-base d m ulti-scale c ontext fusion, ” Scientific Reports, vol. 14, no. 1, p. 15687, 202 4, doi: 10.1038 /s41598-024-6658 5-x. [36] Zha ng Y , Li H, Wang J, et al. DF A-UNet: dua l-stream featur e-fusion attention U-Net for lymph node segmentation in ultrasound images[J] . Frontiers in Neu roscience, 2024, 18: 144 8294. [37] J. Zhu, B. Bolsterlee, B. V . Y . Chow , Y . Song, and E. Meij ering, “ Uncertainty and sha pe-aware continual test-time adaptation for cr oss-domain segm entation of medical images, ” in Proc. MICCAI, V ancouver , BC, Canada, 2023, pp. 659-669, doi : 10.1007/978 -3-031-4389 8-1_63. [38] W . Dong, B. Du , and Y . Xu, “ Shape-in tensity-guided U-n et for medic al image segmen tation, ” Neurocomputing , vol. 610, Art. no. 128534 , 2024. doi : 10.1016/j.n eucom.2024. 128534. [39] R. Hasani, M. Lechn er , A. Amini, D. Rus, and M. Frosio, “ L iquid time-con stant networks, ” in Proc. AAAI Conf . Artif . Int ell., 2021, vol. 35, no . 9, pp. 7657-766 6. [40] A. S. Raj, S. R. Sanna si Chakravarth y , and S. S. Kumar , “ Liqu id neural networks on the edge: A comparative s tudy with CNN, ” in Proc. 2nd Int. Conf . Recent Adv . Eng. Sci. (ICRA ES-2K25), Chennai, In dia, 2025, pp. 1-6. [41] S. Berlin Shaheema, S. D. K., and N. B. Muppalan eni, “ An explainab le Liquid Neural Network combined with p ath ag gregation residual network for an accurat e brain tumor diagnosis, ” Comput. Elect. Eng., vol. 122, Dec. 2025, Art. no. 109999, doi: 10.1016/j .compeleceng.202 4.1099 99. [42] A. Abhishek, R. Sin gh and N. K aur , "Radiographic k nee osteoarthritis det ection using liquid neural networks," 2025 4th International Conference on D istributed Computing an d Electrical Circuits and Electronic s (ICDCECE), Ballari, Indi a, 202 5, pp. 1-6, doi: 10.1109/I CDCECE65353.202 5.11034 969. [43] Sirinuk unwat tana, K., Pluim, J.P . W ., Che n, H., Qi, X., Heng , P .A., G uo, Y .B., Wang, L. Y ., Matuszewski, B. J., Bruni, E., Sanchez, U., B öhm, A ., Ronneberger , O ., Cheikh , B.B., Ra coceanu, D., Kainz, P ., Pfeiff er , M., Urschler , M., Snead, D.R.J., Rajpoot, N.M. , 201 6. Gland segmentation in colon histology ima ges: The GlaS ch allenge contest. Me d. Image Anal. 35, 489-502. [44] Kumar , N., Ver ma, R., Sharma , S., Bhargav a, S., V ahadane, A. , Sethi, A ., 201 7. A dataset and a technique for generalized nuclear segme ntation for com putational p athology . IEEE T r ansactions on Medical Im aging 36, 1550-156 0. [45] La ndman, B., Xu, Z., Igelsias, J., Styner , M., Lan gera k, T ., Klein, A., 201 5. 2015 m iccai multi-atlas labeling beyond the cranial vault-workshop and challenge. 10. 730 3/syn3193805. [46] M. Cordts, M. Omran, S. Ram os, T . Rehfeld, M. Enzweiler , R. Be nenson, U . F rank e, S. Roth, and B. Schiele, “ The Cityscapes Dataset for Sema ntic Urban Scene U nderstanding, ” in Proc. of the IEEE Conference on Computer Vision and Pa tter n Recognition (CVPR), 2016. [47] A. T av anaei, M. Ghodrati, S. R. Kheradpisheh, T . Masq uelier , and A. Maida, "Deep learnin g in spiking neural networks," *Neural Networks*, vol. 111, pp . 47 – 63, 2019, doi: 10.1016/j .neunet.2018.12 .002.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment