TwinMixing: A Shuffle-Aware Feature Interaction Model for Multi-Task Segmentation

Accurate and efficient perception is essential for autonomous driving, where segmentation tasks such as drivable-area and lane segmentation provide critical cues for motion planning and control. However, achieving high segmentation accuracy while mai…

Authors: Minh-Khoi Do, Huy Che, Dinh-Duy Phan

TwinMixing: A Shuffle-Aware Feature Interaction Model for Multi-Task Segmentation
T winMixing: A Shuffle- A w are Feature Interaction Model f or Multi- T ask Segment ation Minh-Khoi Do a,b , ∗ , Huy Che a,b , ∗ , Dinh-Duy Phan a,b , Duc-Khai Lam a,b , ∗∗ and Duc-Lung V u a,b a Univ ersity of Information T ec hnology, Ho Chi Minh City, Vietnam b Vie tnam National Univ ersity, Ho Chi Minh City, Vietnam A R T I C L E I N F O Keyw ords : Drivable-area segment ation Lane segment ation Autonomous car Light-weight model A B S T R A C T Accurate and efficient perception is essential for autonomous driving, where segmentation t asks such as drivable-area and lane segmentation pro vide cr itical cues for motion planning and control. How ever , achie ving high segment ation accuracy while maintaining real-time performance on low -cost hardware remains a challenging problem. To address t his issue, we introduce TwinMixing, a lightweight multi- task segmentation model designed explicitly for dr ivable-area and lane segmentation. The proposed netw ork features a shared encoder and task -specific decoders, enabling both f eature sharing and t ask specialization. Within the encoder , we propose an Efficient Pyramid Mixing (EPM) module that enhances multi-scale f eature extraction through a combination of grouped conv olutions, depthwise dilated convolutions and channel shuffle operations, effectivel y expanding t he receptive field while minimizing computational cos t. Each decoder adopts a Dual-Br anch Upsampling (DBU) Block composed of a learnable transposed con volution–based F ine detailed br anch and a parame ter- free bilinear interpolation–based Coarse grained branch , achieving detailed yet spatially consistent f eature reconstruction. Extensive experiments on the BDD100K dataset validate the effectiveness of T winMixing across three configurations - tiny , base , and larg e . Among them, the base configuration achie ves the best trade-off between accuracy and comput ational efficiency, reaching 92.0% mIoU for drivable-area segmentation and 32.3% IoU for lane segmentation with only 0.43M parameters and 3.95 GFLOPs. Moreov er, T winMixing consistently outperforms existing segmentation models on the same tasks, as illustrated in Figure 1 . Thanks to its compact and modular design, TwinMixing demonstrates strong potential for real-time deployment in autonomous dr iving and embedded perception systems. The source code is avaiable at https://git hub.com/Jun0se7en/T winMixing . 1. Introduction In autonomous dr iving systems, image segmentation pla ys a vital role in understanding the driving environment by identifying ke y elements such as road surfaces, vehi- cles, pedestrians, and lane markings. Howe ver , constr ucting highly detailed multi-class semantic maps [ 1 , 2 , 3 , 4 ] is often unnecessar y f or practical advanced dr iver -assistance systems (ADAS) [ 5 , 6 , 7 ]. A critical requirement of AD AS is real-time perception, where even minor delays in processing can compromise dr iving safe ty . Instead of exhaus tively seg- menting all objects in the scene, AD AS applications pr imar- ily f ocus on extracting t ask -relev ant regions-notably dr ivable areas and lane markings-whic h directly suppor t navigation, lane keeping, and trajectory planning. Consequently , seg- mentation approaches that emphasize task-specific utility and real-time efficiency are more desirable than those that solely focus on fine-g rained semantic labeling. In contrast to general semantic scene understanding t asks [ 1 , 2 , 3 , 4 , 8 ] that f ocus on categor izing all visible ob- jects wit hin a scene, drivable area segmentation and lane segmentation [ 8 ] ser ve a more saf ety-critical function in autonomous driving per ception. As illus trated in Fig. 1 , ∗ Equal contribution ∗ ∗ Corresponding author khoidm.19@grad.uit.edu.vn (M. Do); huycq@uit.edu.vn (H. Che); duypd@uit.edu.vn (D. Phan); khaild@uit.edu.vn (D. Lam); lungvd@uit.edu.vn (D. V u) OR CI D (s): 0009-0007-7477-4702 (H. Che); 0000-0003-0228-9648 (D. Phan); 0000-0003-2711-1408 (D. Lam); 0000-0002-0045-4657 (D. V u) 5 10 15 small m edium 91 MobiP Drivable area segmentation mIoU (%) 89 93 GFLOPs 0 large tiny base T winMixing 25 30 35 Lane segmentation IoU (%) SegFormer large 87 T winLiteNet+ T winLiteNet nano nano small medium YOLOP DFFM (Ours) Figure 1: The horizontal axis represents FLOPs, the vertical axis denotes mIoU, and the circle radius corresponds to IoU. these tasks pro vide crucial inf or mation f or identifying na v- igable regions, maint aining stable lane positioning, and en- abling trajector y planning and collision av oidance in com- plex and dynamic traffic environments. The pr imary aim of segmentation research in this conte xt is to create ac- curate mappings between sensor y inputs, such as camera images, LiD AR, or other modalities, and their cor responding segmentation maps. Ne vertheless, ac hieving high accuracy alone is not sufficient f or real-wor ld deployment; models must also deliver real-time inf erence per f or mance to ensure saf ety and responsiv eness in autonomous dr iving systems. This req uirement underscores the need f or segmentation M-K Do et-al.: Pr epr int submitted to Elsevier Page 1 of 15 T winMixing Model Drivable Area Lane Semantic Scene Understanding Drivable Area and Lane Segmentation Pole Fence Building Tree Person Car Wall . . . Figure 2: Visual compa rison of semantic scene understanding versus drivable area and lane segmentation, highlighting the fo cus on safet y-critical and navigable regions fo r autonomous driving. algorit hms that strike a balance between strong representa- tional capability and computational efficiency , while main- taining high processing t hroughput on resource-constrained onboard hardware. Consequently , balancing accuracy and efficiency emerges as a key challeng e in designing segment a- tion systems for practical autonomous dr iving applications. Although recent segmentation studies ha ve achiev ed re- markable accuracy [ 9 , 10 , 11 , 12 ], most are ev aluated on high-end GPUs, which differ substantially from the com- putational constraints of in-v ehicle systems. To balance ac- curacy and efficiency , sev eral lightweight architectures [ 13 , 14 , 15 ] hav e been dev eloped. Among them, dilated con v olu- tion–based models [ 13 , 14 ] effectivel y enlarg e the receptive field wit hout increasing the ker nel size, thereby enhancing contextual perception at a moderate cost. Howe ver , t hese approaches still suffer from redundant comput ation and lim- ited channel interaction. To ov ercome these limit ations, we introduce an efficient feature extr action strategy that com- bines dilated dept hwise conv olutions with channel shuffle operations. Our proposed model takes a different perspective by emphasizing inter-f eature interaction and cross-scale in- f or mation integ ration within t he encoder . Instead of relying solely on str uctural efficiency , T winMixing enhances repre- sentational r ichness through efficient pyramid mixing and channel-le vel fusion. This design strengthens cross-channel communication while subst antially reducing parameters and FLOPs, resulting in a more compact yet expressiv e encoder suitable f or embedded deployment. Building upon these pr inciples, we propose the T win- Mixing Model, a lightweight multi-t ask segmentation ar - chitecture designed for dr ivable area and lane segmenta- tion. The model emplo ys a shared encoder that extracts multi-scale f eatures through a combination of Efficient Pyra- mid Mixing (EPM), f ollow ed by two t ask -specific decoders equipped with Dual Branch Upsampling Blocks (DBU) to reconstruct segmentation masks f or eac h task. The inte- gration of EPM Units enables efficient multi-scale repre- sentation lear ning, while t he dual-branch decoder balances detailed reconstr uction with spatial smoothness through the fusion of transposed conv olution and bilinear upsampling. Our main contributions can be summar ized as follo ws: • W e propose T winMixing, a lightweight multi-task segmentation model that jointly performs drivable- area and lane-line segmentation. The shared encoder integrates dilated dept hwise separable conv olutions, channel shuffle operations, and the proposed Effi- cient Pyramid Mixing (EPM) module to achiev e rich multi-scale representations with minimal computa- tional ov erhead. • W e design Dual Branch U psampling (DBU)–that fuses a learnable transposed conv olution–based Fine Branch with a parameter -free Bilinear Upsampling–based Coarse Branch. This dual-pat h strategy enhances spa- tial detail recov er y and stability while a voiding chec ker- board ar tifacts. , • T winMixing is dev eloped in three scalable configurations- tiny , base , and lar ge -ranging from 0.10M / 1.08 GFLOPs to 1.50M / 14.25 GFLOPs, supporting flex- ible deployment from embedded systems to edge devices. • Extensive experiments on t he BDD100K dataset demon- strate that T winMixing large achie ves 92.8% mIoU f or dr ivable-area segmentation and 34.2% IoU f or lane segmentation, outper forming recent lightweight baselines such as T winLiteNet + [ 14 ], DFFM [ 16 ], and IALaneNet [ 9 ] wit h significantly lower computational cost. 2. RELA TED W ORK Semantic segmentation is a fundamental task in com- puter vision, in which each pix el in an image is assigned a semantic label. In autonomous dr iving, this t ask play s a crucial role in en vironmental perception and scene under- standing [ 1 , 2 , 3 , 4 , 8 ], suppor ting downstream task s such as mo tion planning, navig ation, and collision a voidance. Pioneering w orks such as DeepLab [ 17 ], SegFormer [ 18 ], and Mask2Former [ 19 ] hav e achiev ed high accuracy through deep feature extraction and strong contextual represent ation. How ev er, these models of ten rely on heavy backbones with high comput ational costs, making t hem unsuitable for real- time deployment on embedded or in-v ehicle systems. To impro ve computational efficiency , sev eral lightweight archi- tectures [ 20 , 21 , 22 ] ha ve been proposed, which employ simplified encoder–decoder designs to reduce complexity while maintaining reasonable segment ation per f or mance. In addition, channel mixing mechanisms such as grouped con- v olution and channel shuffle, introduced in ShuffleNet [ 23 ] and MobileNet [ 24 ], fur ther enhance feature diversity wit h- out significantly increasing the parameter count. M-K Do et-al.: Pr epr int submitted to Elsevier P age 2 of 15 T winMixing Model Shuffle Unit Shuffle Unit C Stride EPM EPM Shuffle Unit Stride EPM EPM Shuffle Unit PCAA C DBU C DBU C DBU C DBU C DBU DBU Encoding Stage Decoding Stage Figure 3: The architecture of T winMixing. The model consists of a shared enco der and tw o task-sp ecific deco ders. The encoder integrates the p rop osed Efficient Pyramid Mixing (EPM) modules to enhance multi-scale feature extraction and contextual rep resentation. Each deco der adopts a Dual Branch Upsampling Block (DBU) comp osed of Fine detailed branch and Coarse grained branch . The t w o deco ders indep endently generate segmentation masks for lane lines and drivable areas, resp ectively . In practical autonomous dr iving systems, how ev er, seg- menting a wide range of complex semantic categor ies (e.g., road, building, v egetation, pedestrian, or traffic sign) is not alw ays necessar y f or real-time decision-making. Instead, advanced driv er assistance systems (AD AS) often f ocus on perception task s that directly suppor t vehicle control, such as dr ivable area segment ation and lane line detection. Consequentl y , recent studies ha ve dev eloped task -specific models [ 9 , 16 , 13 , 14 , 25 , 20 , 26 , 27 , 28 ] f or t hese pur poses, f ollowing either single-task [ 20 , 26 , 27 , 28 ] or multi-t ask paradigms [ 9 , 16 , 13 , 14 , 25 ]. Among them, multi-task mod- els hav e gained more attention due to t heir ability to share f eatures across tasks, reduce redundant computation, and impro ve scalability . T ypically , these models adopt a shared encoder to learn general representations, f ollowed by task - specific decoders to specialize for each output. Despite t hese advances, most existing multi-task architectures [ 13 , 14 , 16 ] rely on dilated conv olutions to enlarge the receptive field but remain limited in inter-c hannel interaction, which constrains f eature diversity and contextual representation capacity . From these obser vations, two significant challeng es re- main insufficiently addressed in pr ior research: (1) how to enhance inter-c hannel information ex change while maint ain- ing effective multi-scale contextual representation under low computational cost; and (2) how to achie ve st able spatial detail reconstr uction dur ing decoding wit hout introducing boundary noise or chec kerboard ar tif acts. T o address these challeng es, we propose T winMixing - a lightweight multi- task segment ation model designed explicitl y f or drivable area and lane line segmentation. The model incor porates an Efficient Pyramid Mixing (EPM) module within t he en- coder to e xpand the receptiv e field via dilated depthwise con volutions while facilitating channel interactions through grouped con volutions and channel shuffle. In the decoding stage, a Dual Branch Upsam pling (DBU) module with two parallel branches ensures smooth and stable spatial recon- struction. Through this design, T winMixing achie ves an optimal balance between computational efficiency , accuracy , and training stability , making it well-suited for real-time perception in autonomous dr iving systems. 3. T winMixing Model 3.1. Overall architectur e The T winMixing model is a multi-task segmentation architecture designed to simultaneously perform tw o re- lated tasks: lane line segment ation and dr ivable area seg- mentation. The model t akes an RGB image as input and extracts f eature representations through a shared encoder . These shared features are t hen processed by two task -specific decoders, each dedicated to reconstructing a segmentation mask for its cor responding task. The encoder of T winMixing is built upon a combination of con v olutional lay ers, the proposed Efficient Pyramid Mix- ing (EPM) modules, and the PCAA module [ 29 ]. This com- bination enhances multi-scale f eature representation while maintaining high comput ational efficiency . Specifically , the EPM modules are designed to expand the receptive field and f acilitate channel interactions at a low computational cost through multiple EPM Units, enabling the model to capture contextual information across different spatial lev els. Mean- while, the PC AA module serves as an attention mechanism that amplifies semantically import ant regions for each t arg et class. Dur ing encoding, the spatial resolution of feature maps is graduall y reduced from H × W to H/8 × W/8 bef ore being passed to the decoding stage. In contrast to the shared encoder , the decoders in T win- Mixing are task -specific, allowing the model to lear n spe- cialized f eature represent ations optimized f or each segmen- tation objective. Each decoder is constr ucted using the pro- posed Dual Branc h Upsampling Bloc k (DBU), which con- sists of tw o complement ary branches: the Fine de tailed br anch and the Coarse gr ained branc h . The outputs of these tw o branches are fused through element-wise addition, enabling the model to simult aneously lever age t he detail reconstruction capability of transposed con volution and the spatial smoothness and stability pro vided by Bilinear Inter- polation. As a result, T winMixing achiev es more accurate spatial feature reconstruction and improv ed training stability . During inference, the encoder processes the input image  𝑟𝑔 𝑏 to produce an inter mediate feature representation  𝑒 . This shared f eature is then passed through tw o separate decoders to generate the final segmentation masks:  𝑙𝑎𝑛𝑒 M-K Do et-al.: Pr epr int submitted to Elsevier P age 3 of 15 T winMixing Model Shuffle Unit Shuffle Unit Stride=2 Stride EPM EPM Shuffle Unit Stride EPM EPM Shuffle Unit PCAA Figure 4: Overview of the enco der in T winMixing. The enco der extracts hierarchical multi-scale features from the input image through a combination of standard convolutional la yers, Efficient Pyramid Mixing (EPM) modules, and Pa rtial Class Activation Attention (PCAA) [ 29 ], producing the shared rep resentation  𝑒 fo r subsequent deco ding. f or lane line segment ation and  𝑑 𝑟𝑖𝑣𝑎𝑏𝑙 𝑒 f or dr ivable area segmentation. The ov erall architecture of T winMixing is illustrated in Figure 3 . 3.2. Encoder In the T winMixing segmentation model, the encoding stage is implemented using a shared encoder , t hat extr acts and represents spatial and semantic f eatures from the in- put image. The encoder is constr ucted using a sequence of CNN lay ers that combine Shuffle Units [ 23 ] with the proposed Efficient Pyramid Mixing (EPM) modules. The EPM modules include both EPM and Stride EPM variants. While the standard EPM focuses on expanding the receptive field to enhance multi-scale f eature representation, t he Stride EPM fur ther enables the encoder to simultaneously enlarge the receptive fie ld and perform progressive do wnsampling across lay ers. During t he encoding process, the feature maps are gradually reduced in spatial resolution from H × W to H/8 × W/8, f ollowed by a PCAA attention module [ 29 ] before being passed to the decoding stage. The encoder design is illustrated in Figure 4 . Within the encoder, t he proposed Efficient Pyramid Mixing (EPM) play s a central role in enhancing multi- scale feature extraction while maintaining computational efficiency . EPM follo ws the r educe–split–transf or m–merg e principle, enabling t he model to process features across mul- tiple spatial scale represent ations effectiv ely . Unlike prior approaches [ 17 ] that directly split the input f eatures into multiple branches, the proposed EPM first performs dimen- sionality reduction through an EPM Unit with a 1 × 1 ker nel, effectivel y lowering the computational cost of subsequent multi-branch transformations while preser ving essential representational information. By per forming t he reduction bef ore branching, t he subsequent transformations in each parallel path operate on low er-dimensional feature maps, thereby significantly decreasing the o ver all comput ational EMP Unit kernel 1x1 EMP Unit kernel 3x3 dilation = 2 EMP Unit kernel 3x3 dilation = 4 EMP Unit kernel 3x3 dilation = 1 EMP Unit kernel 3x3 dilation = 16 EMP Unit kernel 3x3 dilation = 8 C Figure 5: Illustration of the proposed Efficient Pyramid Mixing (EPM) mo dule. The design is inspired b y the ESP [ 22 ], where the EPM p erforms a reduction step using an EPM Unit with a kernel 1 × 1 before splitting features into multiple pa rallel b ranches. Each b ranch transforms the reduced feature through an EPM Unit with a different dilation rate to capture multi-scale spatial info rmation. The outputs of all branches are then merged through the Hiera rchical Feature Fusion (HFF) mechanism [ 22 ]. In the Stride EPM variant, the reduction step emplo ys an EPM Unit with a kernel 1 × 1 and a stride of 2 to achieve downsampling. burden of multi-branch processing. The reduced f eature is then split into multiple parallel branches, each transformed by an EPM Unit with a different dilation rate, enabling the model to capture conte xtual information at multiple spatial scales. EPM Unit enables the model to capture contextual inf or mation at varying receptiv e scales. The outputs from these branches are then fused via a Hierarchical Feature Fusion (HFF) mechanism [ 22 ], which mitigates gr idding artifacts. The ov erall architecture of the EPM module is illustrated in Figure 5 . Within the EPM module, each EPM U nit serves as t he core transf or mation process, responsible for captur ing and encoding multi-scale spatial inf ormation. Str ucturally , the EPM Unit is structurally optimized to achiev e a balance betw een computational efficiency and representational ca- pacity , combining essential operations: (i) a grouped 1 × 1 con volution f or feature projection; (ii) a 3 × 3 depthwise dilated conv olution f or spatial transformation and receptiv e field expansion wit hout incur ring significant computational ov erhead; and (iii) A channel shuffle operation is emplo yed to restore inter-group inf ormation flow and mitig ate t he isolation caused by g rouped conv olutions. This combina- tion enables an effectiv e balance between representation capability and efficiency , making the EPM Unit par ticularl y suitable for real-time segmentation t asks that demand both high accuracy and fast inf erence. In the original Efficient Spatial Pyramid (ESP) design [ 22 ], only dilated conv olutions (D-Conv) were employ ed f or featur e transformation, as illustrated in Figure 6a . The proposed Efficient Pyramid Mixing (EPM) Unit dra ws in- spiration from the Depthwise Dilated Separable Con volu- tion (DDS-Con v) str ucture introduced in previous studies M-K Do et-al.: Pr epr int submitted to Elsevier P age 4 of 15 T winMixing Model D-Conv (a) Dilated Con- volution a DD-Conv PW-Conv (b) Depth-wise Dilated Separa- ble Conv olution G-Conv DD-Conv G-Conv chanel mixing (c) EPM Unit a a G-Conv DD-Conv Stride = 2 G-Conv chanel mixing Pooling C (d) Str ide EPM Unit a a Figure 6: Evolution of transformation designs in approaches derived from the Efficient Spatial Pyramid (ESP) framew ork [ 22 ]. (a) Dilated Convolution (D-Conv) in ESPNet [ 22 ]; (b) Depth-wise Dilated Separable Convolution (DDS-Conv) in [ 14 , 30 ]; (c) EPM Unit proposed; (d) Stride EPM Unit variant fo r downsampling. [ 14 , 30 ], depicted in Figure 6b . DDS-Con v effectivel y en- larg es the receptive field by applying a Depthwise Dilated Con volution (DD-Conv) to capture spatial context, follo wed by a Pointwise Conv olution (PW-Con v) to adjust t he channel dimensionality and restore f eature compactness. How ev er , since both operations process channels independently , DDS- Con v lac ks inter -channel interaction, resulting in limited f eature diversity and weak contextual representation. T o ov ercome this limit ation, the EPM Unit replaces standard PW -Conv s with g rouped 1 × 1 conv olutions, f ollowed by a channel shuffle operation to promo te cross-group inf or ma- tion ex change. The shuffled featur es are then transf or med by a DD-Conv , where t he dilation rate is adaptivel y adjusted to control the receptive field based on netw ork dept h. Finally , a second g rouped 1 × 1 conv olution restores t he output channel dimension. When the number of input and output channels is identical, a shortcut connection with element-wise addition is applied to stabilize training and facilitate efficient informa- tion propagation across lay ers, as shown in Figure 6c . For the Stride EPM variant, used in the reduction stages of t he EPM hierarch y , a 1 × 1 conv olution wit h str ide = 2 is employ ed, as illustrated in Figure 6d . In addition to performing DD-Conv with a str ide of 2, w e introduce an additional 3 × 3 av erage pooling lay er in t he shor tcut branch to downsample spatially , and replace element-wise addition with channel concatena- tion, allowing the networ k to increase its output channel capacity wit h minimal computational ov erhead. This design maintains a balance between spatial efficiency and informa- tion preservation, ensuring that the downsampling process does not compromise the geometric structure of the learned f eature representations. 3.3. Decoder In t he T winMixing Model, we design separate decoders f or each segmentation t ask to enable the model to lear n t ask - specific feature representations. Each decoder is responsible f or transf or ming the encoder output f eatures (F 𝑒 ) into a segmentation mask t hat has t he ex act spatial resolution as the input image, as illustrated in Figure 7 . The decoders are constructed from a combination of conv olutional operations and upsampling methods, jointly extr acting spatial features and restoring resolution. In par ticular, we propose t he Dual Branch Upsampling Block (DBU) - a dual-path upsampling structure composed of two complementar y branches: the F ine detailed branc h and the Coarse gr ained branc h . The F ine detailed br anc h is designed to recov er fine spa- tial details that are often lost dur ing the encoder’ s downsam- pling process. Specifically , the F ine detailed branc h employ s transposed con volution for lear nable upsampling, enabling the model to directly lear n how to reconstr uct fine geometric f eatures from the training dat a. In t he early upsampling stages, the upsampled features are concatenated with low - lev el feature maps from the encoder via skip connections at the ex act spatial resolution, t hereby preser ving spatial continuity and local conte xt. After concatenation, the fused f eatures are fur ther refined by an additional conv olutional la yer, whic h enhances the extr action of combined f eatures and enriches the spatial representation. Consequently , the F ine detailed branc h pla ys an essential role in reconstructing object boundar ies, ensuring sharpness and precision in the segmentation results. In contrast, the Coarse gr ained branc h f ocuses on pre- serving the ov erall structure and spatial continuity dur ing the resolution recov er y process. This branch first applies a 1 × 1 conv olution to reduce computational cost and ad- just the channel of the input features, follo wed by bilinear interpolation - a parameter -free inter polation method that enlarg es f eature maps by a factor of 2 × without producing the check erboard ar tifacts commonly obser ved in transposed con volution. As a result, the Coarse gr ained branc h provides stable f eature representations that maintain t he global lay out of o bjects, ser ving as a str uctural foundation to be integrated with t he detailed features from the F ine detailed branc h . M-K Do et-al.: Pr epr int submitted to Elsevier P age 5 of 15 T winMixing Model Drivable area segmentation Lane segmentation Figure 7: Architecture of the proposed mo del with tw o parallel deco ders sharing the same structural design for different segmentation tasks. C Fine detailed branch Coarse grained branch Bilinear Interpolation PW-Conv DeConv Conv (a) DBU w/ skip connection Fine detailed branch Coarse grained branch Bilinear Interpolation PW-Conv DeConv (b) DBU w/o skip connection Figure 8: Illustration of the proposed Dual Branch Up- sampling Blo ck (DBU). It consists of tw o complementary upsampling paths: the Fine detailed branch and the Coa rse grained branch . The outputs from the two branches are fused through element-wise addition, allowing the model to jointly lev er - age the fine-detail learning capability of the transposed con volution branch and the spatial stability of the bilinear interpolation branch. This combination impro ves boundary accuracy and maintains a consistent spatial str ucture in the final segmentation results. The detailed design of the DBU is illustrated in Figure 8 , where Figure 8a shows the case in which t he DBU incor porates downsampled features through skip connections, and Figure 8b depicts t he configuration when the block operates independently without skip connec- tions. 3.4. T raining strat egies During the training process, all input images are resized from their or iginal resolution of 1280 × 720 to 640 × 384 to ensure computational efficiency and facilitate a fair com- parison wit h previous works t hat f ollow the same setting. T o enhance generalization, we apply photometric augmen- tations (random hue, saturation, and v alue shifts) and geo- metric transformations (random translation, cropping, and horizont al flipping) to impro ve robustness and spatial diver - sity . W e adopt the AdamW optimizer [ 31 ] and the lear ning rate follo ws a cosine annealing schedule, g radually reducing during 100 epochs. The batch size is set to 16. T o address challenges in pix el-wise classification for autonomous dr iving-particularly class imbalance and fine- structure sensitivity-we design a hybrid loss combining Fo- cal Loss [ 32 ] and T versky Loss [ 33 ]. Each loss is applied independently to the dr ivable-area and lane-segmentation outputs, ensur ing task -specific optimization. F ocal Loss f o- cuses on hard-to-classify pix els by down-w eighting easy ex amples using a modulating fact or (1 −  𝑝 𝑖 ( 𝑐 )) 𝛾 , which is effective in mitigating sev ere class imbalance where back - ground pixels dominate. T versky Loss extends Dice Loss [ 34 ] by introducing weighting factors 𝛼 and 𝛽 to balance f alse positiv es and f alse negativ es, which is cr ucial f or thin and elongated lane structures. The total objectiv e is e x- pressed as:  total =  drivable area +  lane (1) Empirically , we set 𝛼 = 0.7, 𝛽 = 0.3 for drivable-area segmentation, and 𝛼 = 0.9, 𝛽 = 0.1 f or lane segmentatio n, while using 𝛼 𝑡 = 0.25 and 𝛾 = 2 in t he Focal Loss formula- tion. These hyperparameters are chosen to emphasize recall ov er precision, ensuring safe ty-critical sensitivity to small or nar row str uctures such as lane markings. All experiments are conducted on an NVIDIA GTX 4090 GPU, with the model trained jointly f or both t asks in an end-to-end multi- task setting. 4. Experimental 4.1. Dat aset and evaluation metrics The BDD100K dat aset is a large-scale, diverse dat aset of driving scenes dev eloped for research in autonomous dr iving perception. It comprises 100,000 video clips collected from ov er 50,000 driving sessions across multiple regions in the United States, encompassing a wide range of envir onments, weather conditions, and times of day . The dataset is divided into training, validation, and test subsets containing 70K, 10K, and 20K images, respectiv ely . Following pr ior stud- ies [ 13 , 14 , 9 , 16 ], since the ground-trut h annotations f or the test set are not publicly a vailable, all ev aluations in t his w ork are conducted on the validation set consisting of 10,000 images. Owing to its large scale and diverse environmental conditions, BDD100K serves as a comprehensive bench- mark f or e valuating segmentation models under r ealistic driving conditions. For the segmentation ev aluation, consistent with prior w orks [ 13 , 14 , 9 , 16 ], the dr ivable-area segmentation per for - mance is quantified using the mean Intersection ov er Union M-K Do et-al.: Pr epr int submitted to Elsevier P age 6 of 15 T winMixing Model T able 1 Throughput complexity comparison among configurations at va rious batch sizes Config FPS w/ batch size ↑ Pa rams FLOPs batch=1 1 4 16 Tiny 83 ± 0.89 335 ± 3.37 1302 ± 12.71 0.10M 1.08G Base 67 ± 0.47 255 ± 2.68 724 ± 0.94 0.43M 3.95G Large 56 ± 0.51 218 ± 2.42 301 ± 0.64 1.50M 14.25G T able 2 Compa rison of per-ep o ch training time across T winLiteNet, T winLiteNet + , and T winMixing. Pa rameters FLOPs T raining times T winLiteNet 0.44M 3.9G 567.1s T winLiteNet + Medium 0.48M 4.63G 799.8s T winMixing base 0.43M 3.95G 1070.2s (mIoU) metric. For the lane segmentation task, both Accu- racy (Acc) and Intersection ov er Union (IoU) are employ ed to provide a comprehensiv e assessment. How ev er, due to the substantial class imbalance between lane markings and the back g round, a balanced accuracy [ 35 , 14 ] metric is addition- ally adopted to yield a more reliable ev aluation of model performance. To measure t he computational efficiency of our approac h, we adopt tw o commonly used indicators: the number of parameters and the FLOPs count. Follo wing standard practice in recent studies [ 14 , 36 , 22 ], FLOPs are defined as t he total number of multiplication and addition operations required dur ing inference. 4.2. Main results 4.2.1. Infer ence throughput and model complexity The r untime characteristics and scalability of t he T win- Mixing f amily are summarized in T able 1 , detailing an ev aluation of the tiny , base , and lar ge configurations at batch sizes of 1, 4, 16. Inf erence FPS is computed o ver 500 independent r uns and repor ted as the mean v alue wit h standard deviation. The results rev eal a clear and distinct trade-off between computational complexity and inf erence latency , demonstrating effective t hroughput scaling across the architecture variants. The model f amily spans from T winMixing tiny , tailored f or ultra-lightweight and edge deployments, to T winMixing large , which targe ts high-capacity scenarios. The T winMixing tiny variant demonstrates ex cellent hardware utilization, ac hiev - ing 83 FPS at batch size 1 and scaling near-linearl y to 1302 FPS at batch size 16. This indicates the model is predom- inantly compute-bound and highly optimized for resource- constrained applications. For a balanced latency–capacity profile, T winMixing base pro vides substantially greater model capacity than the tiny variant while maintaining robust real- time throughput, delivering 67 FPS with a batch size of 1 and 724 FPS with a batch size of 16. The larg e variant is designed f or high-end GPU deployments where model capacity and accuracy are pr ior itized. Despite incur r ing the highest computational cost, it sustains real-time operation at 56 FPS with a batch size of 1 and scales effectivel y to 301 FPS with a batch size of 16. These results confir m the effective scalability of t he T winMixing architecture, whose hierarchical variants offer fa vorable throughput-complexity trade-offs suitable for diverse computational budgets. 4.2.2. T raining time compar ison W e compare the training time of T winLiteNet [ 13 ], T winLiteNet + [ 14 ], and T winMixing on an RTX 4090 GPU . T able 2 highlights a clear difference between the- oretical computational efficiency (FLOPs and parameter count) and practical training cost (time per epoch) across the three models. Although T winMixing achiev es the low - est FLOPs (3.9G) and a compact parameter size (0.44M) compared with the T winLiteNet variants, its training time per epoch increases substantially to 1070,2s. This gap is primar ily attributed to differences in the core operators and the resulting computational graph complexity . Specifically , T winLiteNet relies on dilated conv olutions as its pr imar y f eature-extraction operator, while T winLiteNet + adopts di- lated depthwise conv olutions to improv e efficiency . The complex design of this unit results in a more intr icate computation g raph, increasing the cost of bac kpropagation and gradient computation. Moreover , grouped conv olution and channel shuffle are introduced to impro ve cross-channel interaction without notably increasing FLOPs; they can introduce additional ov erhead from memor y access and frequent tensor la yout transformations on the GPU, furt her prolonging training. Finally , the Dual Branch Upsampling (DBU) decoder, which processes two parallel branc hes to recov er fine details, increases intermediate activation v olume and thus adds training-time ov erhead. Despite its more complex architecture, which aims to enhance feature extraction and reconstr uction, T winMixing maintains low FLOPs and parameter count through width scaling, carefully tuning c hannel width across lay ers. This design choice reduces both the number of parameters and FLOPs relativ e to t he compared baselines, thereby improv - ing deployment efficiency , at t he expense of higher training cost due to multi-branch computation and memory-related ov erhead. 4.2.3. Quantitative results The compared methods include both g eneral-pur pose segmentation netw orks and multi-t ask perception models, ranging from standard baselines [ 17 , 18 , 37 , 39 ] to recent lightweight architectures [ 13 , 25 , 15 ]. W e also include scal- able models with multiple configurations [ 16 , 9 , 14 ], which are analogous to our T winMixing f or a f air com parison across different scales. The quantitative results of all models are summar ized in T able 3 . On the BDD100K dat aset, the proposed T winMixing achie ves high segmentation accuracy while maintaining low computational cost. For the lane segmentation task, T winMixing large reaches 82.4% accuracy and 34.2% IoU, outperforming competitive models such as T winLiteNet + Large , M-K Do et-al.: Pr epr int submitted to Elsevier P age 7 of 15 T winMixing Model T able 3 Quantitative comparison of mo dels on the drivable-area and lane-segmentation tasks. Results are rep orted as mIoU (%) for drivable area segmentation, and as Acc (%) and IoU (%) for lane segmentation, together with mo del complexity measured by FLOPs and parameter count. Mo del Drivable area segmentation Lane segmentation FLOPs P arameters mIoU (%) A cc (%) IoU (%) DeepLabV3+ [ 17 ] 90.9 – 29.8 30.7G 15.4M SegF ormer [ 18 ] 92.3 4 – 31.7 12.1G 7.2M R-CNNP [ 37 ] 90.2 – 24.0 – – YOLOP [ 37 ] 91.6 – 26.5 8.11G 5.53M YOLOv8 (multi-seg) [ 38 ] 84.2 81.7 3 24.3 – – Spa rse U-PDP [ 39 ] 91.5 – 31.2 – – BILane [ 15 ] 91.2 – 31.3 – 1.4M EdgeUNet [ 40 ] 89.9 – – – – MobiP [ 25 ] 90.3 – 31.2 3.6G 5.8M T winLiteNet [ 13 ] 91.3 77.8 31.1 3.9G 0.44M IALaneNet ResNet-18 [ 9 ] 90.5 – 30.4 89.83G 17.05M IALaneNet ResNet-34 [ 9 ] 90.6 – 30.5 139.46G 27.16M IALaneNet ConvNeXt-tiny [ 9 ] 91.3 – 31.5 96.52G 18.35M IALaneNet ConvNeXt-small [ 9 ] 91.7 – 32.5 3 200.07G 6 39.97M 6 DFFM Nano [ 16 ] 88.7 – 25.3 0.72G 0.03M DFFM Small [ 16 ] 90.8 – 29.3 2.56G 0.13M DFFM Mdedium [ 16 ] 92.0 – 31.6 10.17G 0.5M DFFM Large [ 16 ] 92.1 5 – 32.1 5 39.79G 5 2.2M 5 T winLiteNet + Nano [ 14 ] 87.3 70.2 23.3 0.57G 0.03M T winLiteNet + Small [ 14 ] 90.6 75.8 29.3 1.40G 0.12M T winLiteNet + Medium [ 14 ] 92.0 79.1 5 32.3 4 4.63G 0.48M T winLiteNet + Large [ 14 ] 92.9 81.9 34.2 17.58G 4 1.94M 4 Our proposed T winMixing tiny 91.1 76.6 29.8 1.08G 0.10M T winMixing base 92.4 3 80.7 4 33.2 3.95G 0.43M T winMixing large 92.8 82.4 34.2 14.25G 3 1.50M 3 The best and second-best results are mark ed in b old and underline, while the top 3rd, 4th, and 5th ranked mo dels are indicated with 3 , 4 , and 5 , resp ectively . Note that the ranking of FLOPs and Pa rameters is based only on the models with the highest combined mIoU and IoU scores, rather than all models in the table. DFFM Large , and IALaneNet ConvN eXt-small . Despite higher accuracy , our model requires significantly f ew er resources, highlighting the effectiveness of t he proposed architec- ture in modeling fine-grained structural cues wit h min- imal o verhead. F or the drivable area segmentation task, T winMixing large achie ves 92.8% mIoU, which is only 0.1% low er than T winLiteNet + Large , but with a substantially low er computational burden, saving 3.33 GFLOPs and 0.44M parameters. This demonstrates that the larg e configuration of T winMixing offers a super ior balance between accuracy and efficiency across both segment ation task s compared to prior multi-task netw orks such as T winLiteNet + , DFFM, and IALaneNet. Among the top fiv e models with the highest o verall performance, measured b y the sum of mIoU for dr iv- able area segment ation and IoU f or lane segmentation, T winMixing base demonstrates t he best comput ational effi- ciency , requiring only 3.95 GFLOPs and 0.43M parame- ters while maint aining competitiv e accuracy (92.4% mIoU , 33.2% IoU). Its per f ormance ranks just below T winMixing large and T winLiteNet + large , yet with significantly lower compu- tational cost, highlighting its strong accuracy–efficiency balance. Furt hermore, t he T winMixing tiny variant, tailored f or ultra-lightweight embedded deployment, achiev es 91.1% mIoU and 29.8% IoU wit h only 1.08 GFLOPs and 0.10M p a- rameters, demonstrating ex cellent scalability under limited computational budgets. These results clearl y demonstrate that T winMixing achiev es an outstanding balance between segment ation accuracy and computational cost compared to existing multi-task seg- mentation models. As illustrated in Figure 2, the proposed models consistently lie on the optimal region of the accu- racy–efficiency curve, showing fa vorable trade-offs across different configurations. This confirms the scalability of T winMixing, whose hierarchical variants ( tiny , base , larg e ) maintain competitiv e per f or mance under v ar ying computa- tional budgets. M-K Do et-al.: Pr epr int submitted to Elsevier P age 8 of 15 T winMixing Model Clear Highway Dawn/dusk Clear Citystreet Daytime Overcast Citystreet Daytime Clear Highway Daytime Partlycloudy Residential Daytime Figure 9: Qualitative compa rison of segmentation results under normal driving conditions. 4.3. Qualit ative results 4.3.1. Qualitative analysis on BDD100K For a comprehensive ev aluation, we also evaluate T winMi- xing base configurations with T winLiteNet and T winLiteNet + Medium using their public pretrained chec kpoints on BDD100K [ 8 ] with standardized configurations. These tw o models ha ve the same parameters as T winMixing, making them appropriate baselines f or compar ison. Evaluation is conducted on the BDD100K validation split, with subsets stratified by time of da y , scene type, and weather . Figure 9 presents a visual com- parison across highwa y , city street, and residential settings with benign weather in day light and da wn/dusk conditions. T winMixing produces cleaner dr ivable-area masks with tighter lane boundaries than T winLiteNet and T winLiteNet + . In residential and highwa y scenes under daytime lighting, it preserves lane geometry and road–horizon alignment, reducing spillov er into non-drivable regions frequentl y ob- served in the baselines. The model also remains reliable under illumination chang es on highwa y scenes at dawn/dusk and on city-stree t scenes in both clear and ov ercast daytime conditions. T winMixing maintains accurate lane delineation while suppressing spurious activations near v ehicles and roadside str uctures. Bey ond benign settings, we qualitativ ely assess robust- ness in adverse envir onments (Fig. 10 ), spanning night time lighting with city street, residential, highw a y , and tunnel scenes under sno w , rain, and fog conditions. T winMixing offers cleaner drivable-area estimates with tighter lane de- marcation, outperforming T winLiteNet and T winLiteNet + . In a city street scene under snowy weather conditions at da wn, T winMixing captures lane boundaries missed b y the baselines and pr ovides superior delineation. Acr oss tun- nel, residential, and highwa y scenes under these condi- tions, T winMixing maintains stable dr ivable-area masks and sharper lane boundaries; in par ticular, on residential scenes at night and in rain, it yields a cleaner drivable-area mask than T winLiteNet. These obser vations are consistent with the qualitativ e improv ements in boundar y quality and t he reduction in false positives. 4.3.2. Cross-dataset gener alization analysis Bey ond in-domain evaluation on the BDD100K dataset, we furt her assess the cross-dataset generalization ability of T winMixing on Cityscapes [ 1 ] and ACDC [ 3 ]. The Cityscapes dataset consists of urban scenes captured across different European cities. At the same time, ACDC focuses M-K Do et-al.: Pr epr int submitted to Elsevier P age 9 of 15 T winMixing Model Night Undefined Tunnel Snowy Citystreet Dawn/dusk Rainy Residential Night Snowy Citystreet Dawn/dusk Foggy Highway Night Figure 10: Qual itative comparison of segmentation results under challenging driving conditions. T able 4 Compa rative evaluation of the T winMixing model under multi-task and single-task lea rning settings, rep orting segmentation accuracy and computational efficiency metrics. Metho d Drivable area segmentation Lane segmentation P a rameters FLOPs mIoU (%) A cc (%) IoU (%) Single-task 92.3 ✘ ✘ 0.41M 3.50G ✘ 81.1 33.5 0.41M 3.50G Multi-task 92.4 ↑ 0.1 80.7 ↓ 0.4 33.2 ↓ 0.3 0.43M ↑ 0.02M 3.95G ↑ 0.45G on diverse and challenging en vironment al conditions. Qual- itative results of T winMixing + large across these datasets are visualized in Figure 11 . The results indicate that T winMix- ing exhibits strong generalization across unseen datasets, successfully recognizing dr ivable areas and lane regions in scenes captured from different geographic locations and un- der adv erse en vironmental conditions, including nighttime, rain, fog, and snow . Nev er theless, due to domain shifts and the absence of training on these dat asets, the model does not consistentl y achie ve optimal accuracy in specific scenar ios, highlighting remaining challeng es in cross-domain robust- ness. 4.4. Ablation study 4.4.1. Multi-task and single-task models T able 4 contrasts T winMixing trained as two sepa- rate single-task models with a single multi-t ask model for drivable-area and lane segment ation. The multi-task config- uration attains a slightly low er dr ivable-area mIoU 92.3% vs. 92.4%, matches lane accuracy 81.1%, and achie ves a marginall y higher lane IoU 33.5% vs. 33.2%. Notably , a sin- gle multi-t ask model requires only 0.434 M parameters and 3.950 G FLOPs to produce both outputs, whereas deploying tw o single-t ask models together requires about 0.823 M parameters and 6.998 G FLOPs. Relative to a single-task M-K Do et-al.: Pr epr int submitted to Elsevier P age 10 of 15 T winMixing Model Cityscape ACDC Figure 11: Qualitative visualization of cross-dataset generalization. model, the multi-task variant adds only ∼ 0.023 M param- eters and ∼ 0.451 G FLOPs, yet replaces two models with one, delivering near-parity accuracy at roughl y half t he total compute and parameter budget. These results indicate that T winMixing offers a balanced and efficient trade-off betw een per f or mance and complexity for joint drivable-area and lane segmentation in multi-t ask settings. 4.4.2. The results of T winMixing in different conditions T able 5 presents the per formance of T winMixing base across a range of envir onmental conditions in the BDD100K. W e ev aluate t he model under fa vorable conditions wit h clear illumination and simple scene lay outs—such as daytime, highw ay , and residential bef ore moving to more challeng- ing scenar ios, including nighttime, rain y , and tunnel envi- ronments. The results indicate that T winMixing performs strongl y under nor mal conditions, maintaining high segmen- tation accuracy , while its per f or mance degrades moderately under adverse en vironments, such as rainy and tunnel con- ditions. Notably , despite the challenging lighting conditions in night scenes, the model still ac hiev es competitiv e results (92.5% mIoU for drivable area and 32.9% IoU for lane segmentation), which can be attr ibuted to its e xposure to a larg e number of nighttime images ( 28k/70k) dur ing training. Adv erse conditions pr imar ily affect lane segmentation, as reflectiv e sur faces, motion blur, and low -contrast lane mark - ings reduce model confidence. Among all, tunnel scenes pose t he g reatest challenge, where lane IoU drops signifi- cantly compared to t he over all performance. These findings demonstrate that T winMixing maintains robust, consistent accuracy in dr ivable area segment ation. In contrast, lane detection remains more sensitive to challenging conditions, such as foggy , snowy , and tunnel en vironments. T able 5 P erformance across va rious environmental conditions on b oth task drivable area segmentation and lane segmentation Environmental conditions Drivable area segmentation Lane segmentation mIoU (%) IoU (%) Da ytime 93.2 35.1 Night 92.5 32.9 Sno wy 91.6 31.6 Rainy 90.2 32.0 F oggy 92.4 29.6 Highw ay 93.3 33.8 Residential 92.8 34.4 T unnel 89.8 28.9 T winMixing base 92.4 33.2 4.4.3. Ablation study on the Dual Branch Upsampling module T o e valuate the contribution of each component wit hin the proposed Dual Branch Upsampling (DBU) module, we conduct an ablation analysis by selectiv ely removing the fine-detailed branch and coarse-g rained branch. The results are summar ized in Table 6 . When either the fine- detailed branc h (Transposed Con volution–based) or the coarse-grained branch (bilinear inter polation–based) is omit- ted, a consistent per f or mance degradation is obser ved across both dr ivable-area and lane segment ation task s. Specifically , removing the fine-detailed branch results in a 0.2% drop in mIoU and a 0.3% reduction in IoU, indicating its impor tance in restoring spatial details. Similarly , eliminating the coarse- grained branch results in slightl y low er accuracy and IoU, confirming its role in providing spatial smoothness and sta- bility . These findings demonstrate that the complementar y design of the two branches enables DBU to achiev e a better balance betw een fine-g rained reconstruction and smooth M-K Do et-al.: Pr epr int submitted to Elsevier P age 11 of 15 T winMixing Model T able 6 Ablation of the Dual Branch Upsampling (DBU) mo dule. Removing either b ranch degrades p erfo rmance, confirming their complementary effect on segmentation accuracy . Metho ds Drivable area segmentation Lane segmentation mIoU (%) A cc (%) IoU (%) DBU 92.4 80.7 33.2 w/o fine detailed 92.2 ↓ 0.2 80.3 ↓ 0.4 32.9 ↓ 0.3 w/o coarse grained 92.3 ↓ 0.1 80.3 ↓ 0.4 33.0 ↓ 0.2 upsampling, thereby improving ov erall segment ation per for - mance. 4.4.4. Sensitivity analysis of dilation rate and group size in EPM module In this section, we analyze t he sensitivity of the proposed model to the dilation rate and group size configurations in the EPM module. All experiments are conducted using the T winMixing tiny configuration, trained for 50 epochs un- der identical settings. Different dilation-rate and g roup-size variants are compared against the default T winMixing tiny configuration, which is also trained for 50 epochs to ensure a fair comparison. T o ev aluate the effect of dilation rates, we replace the def ault multi-dilation setting in t he EPM module with fixed dilation rates of 1, 4, and 16, while keeping all other com- ponents unchang ed. As reported in T able 7 , the def ault multi-dilation configuration achiev es the best performance, attaining 90.9% mIoU for drivable-area segmentation and 29.4% IoU for lane segmentation. In contrast, using a single fixed dilation rate consistently degrades perf or mance, with more pronounced dr ops observed f or lane segmentation. These results indicate that the multi-dilation design is more effective at capturing multi-scale contextual inf ormation and mitigating gr idding ar tifacts t han fixed-dilation-rate alter na- tives. W e fur ther examine the sensitivity to g roup size in grouped con volutions. While the input and output channel dimensions determine the default group size, we ov er ride this rule and set t he number of groups to {1, 2, 4, 8, 16, 32}. As shown in Table 8 , varying the group size pr imarily affects model complexity , as measured by the number of parameters and FLOPs. In contrast, segmentation accuracy remains stable mainly across different configurations. Smaller group sizes yield marginal improv ements in accuracy at t he ex- pense of increased computation. In compar ison, larger group sizes reduce t he number of parame ters and FLOPs with negligible per f or mance deg radation. Overall, the results in- dicate a clear efficiency–accuracy trade-off and confir m that T winMixing is not highly sensitive to t he specific group-size choice. T able 7 Sensitivit y analysis of dilation-rate configurations in the EPM mo dule. Dilated rates Drivable area segmentation Lane segmentation mIoU (%) A cc (%) IoU (%) ★ 90.9 76.4 29.4 1 89.6 75.4 28.2 4 90.2 74.8 27.7 16 89.8 74.8 27.7 ★ indicate default settings T able 8 Sensitivit y analysis of group-size configurations in the EPM mo dule. Groups Pa rameters FLOPs Drivable area Lane mIoU (%) IoU (%) ★ 98.2K 1,084G 90.9 29.4 1 115.0K 1.222G 91.3 30.4 2 104.5K 1.118G 91.2 29.5 4 100.7K 1.094G 90.8 29.3 8 99.2K 1.088G 90.9 29.3 16 98.5K 1.086G 90.9 29.2 32 98.2K 1.084G 90.9 29.3 ★ indicate default settings 4.5. Quantization and deployment T able 9 presents the quantization per f or mance of T win- Mixing under three numerical precisions: FP32, FP16, and INT8. For the INT8 configuration, we employ Quantization- A w are Training (QA T), which integrates quantization oper- ations directly into the training process. Instead of retraining the model from scratch, w e fine-tune the pre-trained FP32 model for 10 additional epochs using QA T . As shown, quantization has a negligible impact on segment ation ac- curacy while significantly reducing computational cost and memory usage-a cr ucial advantage for real-time deployment on embedded hardwar e. In particular, the FP16 configuration ac hiev es near ly identical accuracy compared to the FP32 baseline. T ran- sitioning to INT8 quantization introduces only a marginal performance degradation (< 1%) while reducing model size by up to 4 × and substantially low er ing inf erence latency across various hardware platforms. These results confir m the robustness of T winMixing to precision scaling, highlighting its suit ability f or efficient inference on edge and low -power devices wit hout compromising segmentation quality . T o further ev aluate the real-time inf erence capability of T winMixing on embedded hardw are, w e measure the latency of t he tiny configuration across multiple NVIDIA Jetson platf or ms, including AGX Or in, Xa vier, Orin Nano, and TX2. The inference is ex ecuted using T ensorRT with FP16 precision to leverag e hardware acceleration and op- timize r untime efficiency . The results, presented in T able M-K Do et-al.: Pr epr int submitted to Elsevier P age 12 of 15 T winMixing Model T able 9 Quantization results of T winMixing across different configura- tions, including FP32, FP16, and INT8. Results are p resented in the format mIoU (for drivable area segmentation) / IoU (for lane segmentation). Config P erfo rmance FP32 FP16 INT8 (QA T) Tiny 91.1 / 29.8 91.1 / 29.8 90.3 / 29.2 Base 92.4 / 33.2 92.4 / 33.2 92.2 / 32.7 La rge 92.8 / 34.2 92.8 / 34.2 92.6 / 33.9 Figure 12: Visualization of challenging failure cases fo r T winMixing large . 10 , demonstrate that T winMixing maintains consistently low inf erence latency across all de vices, with 21.96 ms on AGX Orin and 27 ms on Xa vier and Or in Nano, while remaining under 60.81 ms ev en on the older TX2 board. These results indicate t hat T winMixing is well-suited f or real-time deployment on a wide range of embedded systems with var ying computational capacities. 5. Discussion and Conclusion 5.1. Limit ations Although T winMixing demonstrates strong ov erall per - f or mance, sev eral limit ations remain that merit fur ther in ves- tigation. First, despite its robustness across diverse driving en vironments, the model’ s drivable area and lane segmen- tation accuracy degrade under adverse conditions such as snow , rain, or tunnels, as shown in T able 5 . This suggests t hat T able 10 Inference latency (in milliseconds) of T winMixing tiny on various NVIDIA Jetson devices, rep orted as the mean and standard deviation over 500 ind ependent runs. Device A GX Orin Xavier Orin Nano TX2 Latency 21.96 ± 0.23 26.78 ± 0.38 27.53 ± 0.08 60.81 ± 0.13 T winMixing’ s reliability under varying illumination and vis- ibility could be enhanced through more advanced data aug- mentation or domain adapt ation techniques. Second, while the proposed Efficient Pyramid Mixing (EPM) and Dual- Branch Upsampling (DBU) modules achie ve an ex cellent balance between accuracy and efficiency , their configura- tions still depend on manually tuned architectural hyperpa- rameters (e.g., dilation rates, grouping f actors, and repetition depth). Incorporating neural architecture search (NAS) or adaptive parameterization strategies could fur ther optimize performance and generalizability . Finally , T winMixing cur- rently focuses solely on segmentation tasks. Extending it to support additional perception t asks such as object detection, depth estimation, or panoptic segmentation could f or m a unified perception backbone for broader autonomous driving applications. W e also additionally present representative f ailure cases of T winMixing (large configuration) in Figure 12 . While the model demonstrates competitive per f or mance across various conditions in Figure 9 , 10 , it still struggles in low- light nighttime scenes and under adverse weather conditions such as rain or snow (Figure 12 ). In these scenar ios, strong illumination contrast, reflections on wet road sur faces, and increased visual noise hinder effective f eature extraction, resulting in inaccurate dr ivable-area boundar ies and incom- plete or fragmented lane segmentation. These observations highlight the model’ s cur rent limit ations under e xtreme en- vironmental conditions and sugg est that fur ther robus tness enhancements, such as illumination-a ware training strategies or multimodal cues, could be beneficial. 5.2. Conclusion In this w ork, we present T winMixing, a shuffle-a ware, lightweight multi-t ask segmentation model designed explic- itly for drivable area and lane segmentation in autonomous driving. The proposed Efficient Pyramid Mixing (EPM) module enhances multi-scale feature extraction. At the same time, t he Dual Branch Upsampling (DBU) block improv es decoding stability by combining fine-detailed and coarse- grained spatial reconstruction. Comprehensive experiments on the BDD100K dataset demonstrate t hat T winMixing achie ves a superior trade-off between accuracy and effi- ciency , outper forming state-of-the-ar t lightweight models such as T winLiteNet+ and DFFM while requiring substan- tially f ewer parameters and FLOPs. Moreov er, its consistent real-time inference speed acr oss NVIDIA Jetson devices confirms its potential f or embedded and edge deplo yment. Future work will f ocus on impro ving model robustness under M-K Do et-al.: Pr epr int submitted to Elsevier P age 13 of 15 T winMixing Model extreme weather and lighting conditions, e xplor ing self- supervised pretraining for better generalization, and extend- ing T winMixing to broader panoptic per ception tasks in autonomous systems. Ref erences [1] M. Cordts, M. Omran, S. Ramos, T . Rehfeld, M. Enzweiler , R. Benen- son, U. Franke, S. Roth, and B. Schiele, “ The cityscapes dataset for semantic urban scene understanding, ” in 2016 IEEE Conference on Computer Vision and P atter n Recognition (CVPR) , 2016, pp. 3213– 3223. [2] G. R os, L. Sellar t, J. Materzynska, D. V azquez, and A. M. Lopez, “The synthia dataset: A large collection of synthetic images f or semantic segmentation of urban scenes, ” in 2016 IEEE Confer ence on Computer Vision and P atter n Recognition (CVPR) , 2016, pp. 3234– 3243. [3] C. Sakar idis, D. Dai, and L. V an Gool, “ Acdc: The adverse conditions dataset with cor respondences for semantic dr iving scene understand- ing, ” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV) , 2021, pp. 10 745–10 755. [4] S. R. Richter, V . Vineet, S. R oth, and V . Koltun, “Playing f or data: Ground tr uth from computer games, ” in Computer Vision – ECCV 2016 , B. Leibe, J. Matas, N. Sebe, and M. W elling, Eds. Cham: Springer Inter national Publishing, 2016, pp. 102–118. [5] R. Koteczki and B. E. Balassa, “Systematic literature re view of user acceptance factors of advanced dr iver assistance systems across different social g roups, ” T r ansportation Researc h Inter disciplinar y P erspectiv es , v ol. 31, p. 101486, 2025. [Online]. A vailable: https: //www .sciencedirect.com/science/ar ticle/pii/S2590198225001654 [6] T . Neumann, “ Analy sis of advanced driver-assis t ance systems f or safe and comfortable driving of motor vehicles, ” Sensors , vol. 24, no. 19, 2024. [Online]. A vailable: https://www .mdpi.com/ 1424- 8220/ 24/19/ 6223 [7] M. W . Khattak, K. Br ijs, T . M. T ran, T . A. T r inh, A. T. V u, and T . Brijs, “ A ccept ance tow ards advanced driver assistance systems (adas): A validation of the unified model of driver accept ance (umda) using str uctural equation modelling, ” T r ansportation Resear ch P art F: T r affic Psyc hology and Behaviour , vol. 105, pp. 284–305, 2024. [Online]. A vailable: https://www .sciencedirect.com/ science/article/ pii/S1369847824001803 [8] F . Y u, H. Chen, X. W ang, W . Xian, Y . Chen, F . Liu, V . Madhavan, and T . Dar rell, “Bdd100k: A diverse driving dat aset f or heter ogeneous multitask learning, ” in 2020 IEEE/CVF Conf erence on Computer Vision and Patt ern Recognition (CVPR) , 2020, pp. 2633–2642. [9] W . Tian, X. Y u, and H. Hu, “Interactive attention lear ning on detection of lane and lane marking on the road by monocular camera image, ” Sensors , vol. 23, no. 14, 2023. [10] C. Han, Q. Zhao, S. Zhang, Y . Chen, Z. Zhang, and J. Y uan, “Y olopv2: Better , faster , stronger for panoptic dr iving perception, ” 2022. [11] J. Zhan, J. Liu, Y . Wu, and C. Guo, “Multi-task visual perception f or object detection and semantic segmentation in intelligent driving, ” Remote Sensing , vol. 16, no. 10, 2024. [Online]. Av ailable: https://www .mdpi.com/2072- 4292/16/10/1774 [12] J. Zhan, Y . Luo, C. Guo, Y . Wu, J. Meng, and J. Liu, “Y olopx: Anchor - free multi-task lear ning netw ork f or panoptic driving perception, ” P attern Recognition , vol. 148, p. 110152, 2024. [13] Q.-H. Che, D.-P . Nguyen, M.-Q. Pham, and D.-K. Lam, “Twinlitene t: An efficient and lightweight model for dr iveable area and lane seg- mentation in self-dr iving cars, ” in 2023 Inter national Confer ence on Multimedia Analysis and P atter n Recognition (MAPR) , 2023, pp. 1–6. [14] Q.-H. Che, D.- T . Le, M.-Q. Pham, V .- T . Nguyen, and D.- K. Lam, “T winlitenet+: An enhanced multi-t ask segmentation model f or autonomous driving, ” Computers and Electrical Engineering , v ol. 128, p. 110694, 2025. [Online]. A vailable: https: //www .sciencedirect.com/science/ar ticle/pii/S0045790625006378 [15] Z. Hu and Y . Shen, “Lane detection based on boundary f eature enhancement and information interaction, ” Academic Journal of Computing & Information Science , vol. 8, no. 1, pp. 57–63, 2025. [Online]. A vailable: https://doi.org/10.25236/AJCIS.2025.080108 [16] I. Papadeas, L. Tsoc hatzidis, and I. Pratikakis, “Dual-t ask learning f or real-time semantic segmentation in autonomous dr iving, ” IEEE T r ansactions on Intellig ent V ehicles , pp. 1–10, 2025. [17] L.-C. Chen, Y . Zhu, G. Papandreou, F . Schroff, and H. Adam, “Encoder-decoder with atrous separable con volution for semantic image segmentation, ” in Computer Vision – ECCV 2018 , V . Fer rari, M. Hebert, C. Sminchisescu, and Y . W eiss, Eds. Cham: Springer International Publishing, 2018, pp. 833–851. [18] E. Xie, W . W ang, Z. Y u, A. Anandkumar, J. M. Alv arez, and P . Luo, “Segf ormer: Simple and efficient design f or semantic segmenta- tion with transformers, ” in Neur al Information Processing Syst ems (NeurIPS) , 2021. [19] B. Cheng, I. Misra, A. G. Schwing, A. Kirillov , and R. Girdhar, “Masked-attention mask transf ormer f or universal image segmen- tation, ” in Proceedings of the IEEE/CVF Conf erence on Com puter Vision and Patt ern Recognition (CVPR) , June 2022, pp. 1290–1299. [20] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello, “Enet: A deep neural netw ork architectur e for real-time semantic segmentation, ” ArXiv , vol. abs/1606.02147, 2016. [21] R. P . K. Poudel, S. Liwicki, and R. Cipolla, “Fast-scnn: Fast semantic segmentation netw ork,” ArXiv , vol. abs/1902.04502, 2019. [Online]. A vailable: https://api.semanticscholar .org/Cor pusID:60441195 [22] S. Mehta, M. Rastegari, A. Caspi, L. Shapiro, and H. Hajishirzi, “Espnet: Efficient spatial pyramid of dilated con volutions for seman- tic segmentation,” in Computer Vision – ECCV 2018 , V . Ferrar i, M. Hebert, C. Sminchisescu, and Y . W eiss, Eds. Cham: Springer International Publishing, 2018, pp. 561–580. [23] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremel y efficient conv olutional neural netw ork for mobile devices, ” 2018 IEEE/CVF Conference on Computer Vision and P attern Recognition , pp. 6848–6856, 2017. [24] A. How ard, M. Zhu, B. Chen, D. Kalenichenko, W . W ang, T . W eyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient conv olutional neural networks for mobile vision applications, ” 04 2017. [25] M. Y e and J. Zhang, “Mobip: a lightweight model for dr iving perception using mobilenet, ” Fr ontiers in Neuror obotics , vol. V olume 17 - 2023, 2023. [Online]. A v ailable: https://www .frontiersin.org/ journals/neurorobotics/ar ticles/10.3389/fnbot.2023.1291875 [26] Y . Hou, Z. Ma, C. Liu, and C. C. Loy , “Learning lightweight lane detection cnns by self attention distillation, ” 2019 IEEE/CVF Interna- tional Confer ence on Computer Vision (ICCV) , pp. 1013–1021, 2019. [27] Y . Qian, J. M. Dolan, and M. Y ang, “Dlt-net: Joint detection of drivable areas, lane lines, and traffic objects, ” IEEE T ransactions on Intellig ent T r anspor tation Systems , vol. 21, no. 11, pp. 4670–4679, 2020. [28] H. Zhao, J. Shi, X. Qi, X. W ang, and J. Jia, “Pyramid scene parsing netw ork,” in Proceedings of the IEEE Confer ence on Computer Vision and Patt er n Recognition (CVPR) , July 2017. [29] S.-A . Liu, H. Xie, H. Xu, Y . Zhang, and Q. T ian, “Partial class activation attention for semantic segmentation,” in 2022 IEEE/CVF Confer ence on Computer Vision and P attern Recognition (CVPR) , 2022, pp. 16 815–16 824. [30] Q.-H. Che and D.-K. Lam, “Trilitenet: Lightweight model for multi- task visual perception, ” IEEE Access , vol. 13, pp. 50 152–50 166, 2025. [31] I. Loshchilo v and F . Hutter, “Decoupled weight deca y regularization, ” in International Conference on Lear ning Representations , 2017. [32] T .- Y . Lin, P . Goyal, R. Girshick, K. He, and P . Dollar , “Focal loss f or dense object detection, ” in Proceedings of the IEEE International Confer ence on Com puter Vision (ICCV) , Oct 2017. [33] S. S. M. Salehi, D. Erdogmus, and A . Gholipour , “T versky loss function f or image segment ation using 3d fully conv olutional deep netw orks,” in Machine Learning in Medical Imaging , Q. W ang, Y . Shi, H.-I. Suk, and K. Suzuki, Eds. Cham: Spr inger Inter national Publishing, 2017, pp. 379–387. M-K Do et-al.: Pr epr int submitted to Elsevier P age 14 of 15 T winMixing Model [34] C. H. Sudre, W . Li, T . V ercauteren, S. Ourselin, and M. Jorge Cardoso, “Generalised dice ov erlap as a deep learning loss function for highly unbalanced segmentations, ” in Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support . Cham: Spr inger Inter national Publishing, 2017, pp. 240–248. [35] J. W ang, Q. M. Jonathan Wu, and N. Zhang, “Y ou only look at once f or real-time and generic multi-t ask, ” IEEE T r ansactions on V ehicular T ec hnology , vol. 73, no. 9, pp. 12 625–12 637, 2024. [36] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual lear ning for image recognition, ” in 2016 IEEE Conf erence on Computer Vision and Patt er n Recognition (CVPR) , 2016, pp. 770–778. [37] D. Wu, M.-W . Liao, W .- T . Zhang, X. W ang, X. Bai, W . Cheng, and W .- Y . Liu, “Y olop: Y ou only look once for panoptic dr iving perception, ” Machine Intellig ence Researc h , vol. 19, pp. 550 – 562, 2021. [38] G. Jocher , A. Chaurasia, and J. Qiu, “Ultralytics Y OLO, ” 2023. [Online]. A vailable: https://github.com/ultralytics/ultralytics [39] H. W ang, M. Qiu, Y . Cai, L. Chen, and Y . Li, “Sparse u-pdp: A unified multi-task framew ork for panoptic dr iving perception, ” IEEE T r ansactions on Intelligent T r anspor tation Systems , vol. 24, no. 10, pp. 11 308–11 320, 2023. [40] X. Sheng, J.-Z. Zhang, Z. W ang, and Z.- T . Duan, “Edgeunet: Edge- guided multi-loss networ k for dr ivable area and lane segment ation in autonomous vehicles, ” IEEE T ransactions on Intelligent T r anspor ta- tion Systems , vol. 26, no. 2, pp. 2530–2542, 2025. M-K Do et-al.: Pr epr int submitted to Elsevier P age 15 of 15

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment