AASeg: Attention Aware Network for Real Time Semantic Segmentation
Semantic segmentation is a fundamental task in computer vision that involves dense pixel-wise classification for scene understanding. Despite significant progress, achieving high accuracy while maintaining real-time performance remains a challenging …
Authors: Abhinav Sagar
AASeg: Attention A war e Network f or Real T ime Semantic Segmentation Abhinav Sag ar Uni versity of Maryland, Colle ge P ark, Maryland College P ark, Maryland asagar@umd.edu Abstract Semantic se gmentation is a fundamental task in com- puter vision that in volves dense pixel-wise classification for scene understanding. Despite significant pr ogr ess, achie v- ing high accur acy while maintaining r eal-time performance r emains a challenging trade-off , particularly for deploy- ment in resour ce-constrained or latency-sensitive applica- tions. In this paper , we pr opose AASeg , a novel Attention- A war e Network for real-time semantic se gmentation. AASe g effectively captures both spatial and channel-wise depen- dencies thr ough lightweight Spatial Attention (SA) and Channel Attention (CA) modules, enabling enhanced fea- tur e discrimination without incurring significant compu- tational overhead. T o enrich contextual repr esentation, we intr oduce a Multi-Scale Context (MSC) module that aggr e gates dense local featur es acr oss multiple r eceptive fields. The outputs fr om attention and context modules ar e adaptively fused to produce high-r esolution se gmenta- tion maps. Extensive e xperiments on Cityscapes, ADE20K, and CamV id demonstr ate that AASe g ac hieves a compelling trade-of f between accuracy and efficiency , outperforming prior r eal-time methods. 1. Introduction Semantic segmentation is a core problem in computer vi- sion, where the goal is to assign a semantic label to ev ery pixel in an image. This task plays a vital role in v arious real-world applications such as autonomous driving, robotic navigation, augmented reality , and medical imaging. How- ev er, despite recent adv ancements in deep learning-based segmentation models, achieving an optimal balance be- tween segmentation accuracy and inference speed remains a significant challenge, particularly in resource-constrained or latency-critical en vironments. Early work, such as the Fully Con volutional Network (FCN) [ 24 ] demonstrated that con volutional neural net- works could be extended to pixel-le vel prediction tasks. While FCNs achieved strong semantic representation, they struggled with accurately capturing fine-grained details, es- pecially around object boundaries. T o address this, later works emplo yed atrous (dilated) con volutions [ 2 , 40 ] to e x- pand the receptive field without do wnsampling, thereby im- proving context aggregation. Ho we ver , this enhancement often came with a substantial increase in computational ov erhead, making such methods less suitable for real-time deployment. In response to the gro wing demand for ef ficient segmen- tation models, sev eral lightweight architectures have been proposed. SegNet [ 1 ] introduced an encoder–decoder struc- ture with skip connections for improved inference speed. Networks like ENet [ 28 ], ESPNet [ 25 ], and Fast-SCNN [ 29 ] further pushed the boundary of real-time semantic seg- mentation through architectural simplification, depth-wise separable conv olutions, and efficient encoder-decoder de- signs. While these models improve throughput (measured in frames per second, FPS), they often do so at the cost of degraded se gmentation accuracy , particularly in challenging scenarios in volving small objects or complex spatial lay- outs. T o address these limitations, we propose AASe g (Attention-A ware Segmentation Network), a real-time se- mantic segmentation framework that explicitly enhances both spatial and channel-wise feature representations using attention mechanisms. Our network introduces three key components: (1) a Spatial Attention (SA) module to em- phasize salient regions in the spatial domain, (2) a Chan- nel Attention (CA) module to reweight informative chan- nels adapti vely , and (3) a Multi-Scale Context (MSC) mod- ule to aggregate dense local context across v arying recep- tiv e fields without significant computational burden. By in- tegrating these modules into a lightweight yet e xpressi ve architecture, AASeg achiev es high segmentation accuracy while maintaining fast inference speeds suitable for real- time applications. W e ev aluate AASeg on standard benchmarks including Cityscapes, ADE20K, and CamV id, and show that it out- performs most existing real-time methods. Specifically , AASeg achiev es 74.4% mIoU on the Cityscapes test set 1 while operating at 202.7 FPS, setting a new benchmark for high-speed semantic segmentation. The speed–accuracy trade-off of state-of-the-art semantic segmentation meth- ods is illustrated in Figure 1 , which presents results on the Cityscapes test set. Our proposed method, AASe g, achie ves a higher mean Intersection-ov er-Union (mIoU) while main- taining superior inference speed (FPS), outperforming most existing real-time approaches. Figure 1. Comparison of speed vs. accuracy on the Cityscapes test set. AASeg achieves a compelling balance, delivering high segmentation accurac y while maintaining real-time performance. 2. Related W ork Real-time semantic segmentation has gained significant at- tention due to its crucial role in safety-critical applications such as autonomous dri ving and robotic perception. Recent advances aim to optimize the trade-of f between segmen- tation accuracy and computational efficiency . W e catego- rize related work into four key areas: attention mechanisms, lightweight backbones, multi-resolution and multi-path de- signs, and contextual feature aggre gation. 2.1. Attention Mechanisms Attention modules hav e proven effecti ve in improving fea- ture representation by modeling long-range dependencies. The Dual Attention Network (D ANet) [ 7 ] introduces spa- tial and channel attention modules to adapti vely enhance the recepti ve field, leading to more discriminativ e fea- tures. Similarly , Squeeze-and-Excitation Networks [ 15 ] ap- ply channel-wise recalibration to focus on the most infor- mativ e feature channels, improving both accuracy and effi- ciency . These attention-based designs moti vate our use of spatial and channel attention to enhance feature encoding in a lightweight framew ork. 2.2. Lightweight Backbones and Efficient Con volu- tions T o reduce computation, sev eral approaches design compact network architectures. ENet [ 28 ] aggressively prunes con- volutional filters and applies early downsampling to min- imize latency . ESPNet [ 25 ] introduces an Ef ficient Spa- tial Pyramid (ESP) module to decompose standard con- volutions into parallel dilated conv olutions, enabling fast and accurate segmentation. DF A-Net [ 18 ] employs a lightweight backbone and depth-wise separable con volu- tions along with multi-scale feature fusion to balance per- formance and speed. 2.3. Multi-Resolution and Multi-P ath Ar chitectures Multi-branch architectures are effecti ve in capturing both global semantics and fine spatial details. ICNet [ 44 ] pro- poses a cascade framew ork with high, medium, and low- resolution branches to progressiv ely refine segmentation. BiSeNet [ 37 ] introduces a two-path structure consisting of a Spatial Path (SP) for preserving spatial resolution and a Context Path (CP) for enlarging the receptive field. BiSeNetV2 [ 39 ] further refines this concept using a multi- path fusion strategy to improve both accuracy and effi- ciency . 2.4. Contextual F eature Aggr egation Sev eral works lev erage contextual information to improv e prediction quality . SwiftNet [ 27 ] uses a ResNet-based en- coder with lateral connections to recover spatial details at high speed. RefineNet [ 22 ] employs multi-path refinement to progressi vely enhance feature maps, but relies heavily on deep backbones and neglects global conte xt ef ficiency . Our main contributions are summarized as follo ws: • W e propose AASeg, a nov el Attention-A ware Network designed specifically for real-time semantic segmenta- tion, targeting the optimal balance between accuracy and efficienc y . • AASeg introduces a modular architecture consisting of three core components: a Spatial Attention (SA) mod- ule that models spatial dependencies to emphasize in- formativ e regions, a Channel Attention (CA) module that adaptively re weights feature channels to capture dis- criminativ e semantic features, and a Multi-Scale Con- text (MSC) module that aggregates contextual informa- tion across multiple recepti ve fields to enhance hierarchi- cal feature representation. • W e conduct extensi ve e xperiments and ablation studies on three benchmark datasets—Cityscapes, CamV id, and ADE20K—demonstrating that AASe g not only impro ves segmentation accuracy b ut also significantly boosts infer- ence speed compared to most other real-time methods. 3. Background 3.1. Spatial Inf ormation Preserving spatial information is critical in semantic seg- mentation, as it directly impacts the ability to accurately de- 2 lineate object boundaries and fine-grained structures. Mod- ern segmentation frameworks incorporate spatial encoding mechanisms to retain this information throughout the net- work. For instance, Dilated Conv olutions, as used in DUC [ 35 ], PSPNet [ 43 ], and DeepLab v2 [ 2 ], allo w for enlarging the recepti ve field without reducing the spatial resolution of feature maps. This approach helps preserve detailed spatial cues crucial for high-precision predictions. 3.2. Context Inf ormation In addition to spatial details, capturing global and local context is essential for understanding object relationships and scene structure. Conte xtual information enables the network to differentiate between visually similar regions based on surrounding cues. T o this end, sev eral methods adopt multi-scale feature extraction strategies. For exam- ple, ASPP (Atrous Spatial Pyramid Pooling) in DeepLab [ 2 ] captures features at multiple scales using dilated con vo- lutions with varying dilation rates. Similarly , PSPNet [ 43 ] introduces a Pyramid Pooling Module (PPM) that performs av erage pooling at different scales to aggregate global con- text. These methods enhance the model’ s ability to recog- nize objects under varying spatial extents and scene com- plexities. 3.3. Attention Mechanisms Attention mechanisms ha ve been increasingly adopted in segmentation networks to selectiv ely enhance relev ant fea- tures while suppressing redundant ones. Squeeze-and- Excitation Networks [ 11 ] pioneered the use of channel-wise attention, significantly improving performance in image recognition tasks. For semantic segmentation, non-local at- tention and contextual attention approaches have prov en ef- fectiv e. For example, [ 38 ] proposed an attention module that refines feature representations by modeling global de- pendencies. 3.4. F eature Fusion Feature fusion plays a piv otal role in combining dif ferent sources of information, particularly the integration of low- lev el spatial details and high-lev el semantic context. Ar - chitectures such as BiSeNet utilize a dual-path design to separately capture spatial and contextual features, which are then fused to produce the final output. Effecti ve fusion strategies hav e also been applied to related tasks, including classification, detection, and instance segmentation. In the context of semantic segmentation, proper fusion allows the network to benefit from both sharp localization and rob ust semantic understanding. 4. Proposed Method 4.1. Dataset The following datasets ha ve been used to benchmark our results: • Cityscapes: It is used for urban street segmentation. The 5000 annotated images are used in our experiments, which are di vided into 2975, 500, and 1525 images for training, validation, and testing, respecti vely . • ADE20K: This dataset contains labels of 150 object cat- egories. The dataset includes 20k,2k, and 3k images for training, validation, and testing, respecti vely . • CamV id: This dataset is used for semantic segmentation for autonomous driving scenarios. It is composed of 701 densely annotated images. 4.2. Network Ar chitecture The primary goal of semantic segmentation is to map an input RGB image X ∈ R H × W × 3 to a semantic label map Y ∈ R H × W × C , where H and W are the image dimensions and C is the number of semantic classes. The input image X is transformed into hierarchical feature representations F l 3 l =1 , where F l ∈ R H l × W l × C l is the feature map from the l th stage. Unlike many e xisting segmentation frameworks that rely on pre-trained backbones (e.g., ResNet, MobileNet), our proposed architecture, AASe g, is designed from scratch with lightweight modules optimized for real-time perfor- mance. The input image is initially passed through a con- volutional block consisting of a con volutional layer , batch normalization, and ReLU activ ation. W e define a con volutional operation W n ( x ) as: W n ( x ) = W n × n ⊙ x + b (1) where ⊙ denotes con volution, W n × n is an n × n kernel, x is the input feature map, and b is the bias term. 4.2.1. Multi-Scale Context (MSC) Module T o capture rich local and global context, we introduce a Multi-Scale Conte xt (MSC) module. The input feature map is processed using conv olutions of varying kernel sizes: 1 × 1 , 3 × 3 , and 5 × 5 . These outputs are fused and then passed through a 1 × 1 con v olution to reduce dimensionality from 2048 to 256, producing an output of size H × W × n c . Subsequently , a series of dilated con volutions with in- creasing dilation rates ( d = 3 , 6 , 12 ) are applied. The input at each stage is a concatenation of the original feature map and all preceding outputs. Finally , the three dilated outputs are concatenated with the original feature map to enhance the context representation. 3 4.2.2. Spatial Attention (SA) Module The Spatial Attention (SA) module captures spatial depen- dencies by focusing on relevant re gions of the feature map. It is defined as: f SA ( x ) = σ ( W 2 ( ReLU ( W 1 ( x )))) (2) where W 1 and W 2 are 1 × 1 con volution layers, σ de- notes the sigmoid function, and ReLU is the activ ation func- tion. This module generates a spatial attention mask that enhances informati ve locations in the feature map and is shown in Figure 2 : Figure 2. Details of our Spatial Attention (SA) module. 4.2.3. Channel Attention (CA) Module T o capture inter-channel dependencies and semantic saliency , we employ a Channel Attention (CA) module. It is defined as: f CA ( x ) = σ ( W 2 ( ReLU ( W 1 ( A vgPool ( x ))))) (3) Here, A vgPool denotes global average pooling, and W 1 , W 2 are 1 × 1 con volution layers. The resulting attention map is used to modulate the input feature map channel-wise and is shown in Figure 3 : Figure 3. Details of our Channel Attention (CA) module. 4.2.4. Featur e Aggregation T o integrate the outputs of the SA, CA, and MSC modules, we define a unified aggregation operation. Let ⊕ denote the concatenation operator . The intermediate concatenated feature x concat is defined as: x concat = x 1 ⊕ x 2 ⊕ x 3 (4) The output of the AASeg module is then computed as: x AASeg = ( f SA ( x concat ) ⊗ x concat ) ⊕ ( f CA ( x concat ) ⊗ x concat ) ⊕ ( f MSC ( x concat ) ⊗ x concat ) (5) Here, ⊗ represents element-wise multiplication. 4.2.5. Con volutional Blocks and Output W e denote the output of the i th con volutional block as: x i = Con vBlock i ( xi − 1 , k i ) (6) where each ConvBlock i consists of a con volutional layer with kernel size k i , batch normalization, and ReLU activ a- tion. T o produce the final output, we apply a fusion operation that integrates all hierarchical outputs: x output = F ( x 1 , x 2 , ..., x n ) (7) 4.2.6. Overall Ar chitecture The full AASeg architecture is depicted in Figure 4 , illus- trating the flo w from input through spatial, channel, and context attention modules to the final segmentation predic- tion. 4.3. Loss Functions W e use the cross-entropy loss function to weigh the differ - ence between the forw ard propagation result of the network and the ground truth of the samples. The cross-entropy loss is calculated as: L ce = 1 N N X n =1 [ y n log ˆ y n + (1 − y n ) log (1 − ˆ y n )] (8) Where N denotes the total number of samples, y n de- notes the probability that the forward propagation result is true, and 1- y n denotes the probability that the forw ard prop- agation result is false. W e also use the auxiliary supervision loss L aux to im- prov e the model’ s performance, thus making it easier to op- timize. The auxiliary loss can be defined as: L aux = − 1 B N B X i =1 N X j =1 K X k =1 I g i j = k × log exp p i j,k P K m =1 exp p i j,m (9) I g i j = k = 1 , g i j = k 0 , other w ise (10) 4 Figure 4. Overvie w of the proposed AASeg architecture. “c” denotes concatenation. where B is the mini batch size, N is the number of pixels in e very batch; K is the number of categories; p ij k is the prediction of the j th pixel in the i th sample for the k th class, I ( g ij = k ) is a function which is defined in Equation 10 . The class attention loss L cls from the channel attention module is also used. The class Attention loss is defined as follows: L cls = − 1 B N B X i =1 N X j =1 K X k =1 I g i j = k × log exp a i j,k P K m =1 exp a i j,m (11) where a ij k is the value generated by the class attention map of the j th pixel in the i th sample for the k th class. W e combine the three terms to balance the final loss term as follows: L = λ 1 L ce + λ 2 L cls + λ 3 L aux (12) Where λ 1 , λ 2 , and λ 3 are set as 1, 0.5, and 0.5 to balance the loss. 4.4. Implementation Details The PyT orch deep learning framew ork is used to carry out our e xperiments. W e use stochastic gradient descent (SGD) as optimizer , batch size of 16, momentum of 0.9, and weight decay of 5 e − 4 . The network is trained for 20K iterations with an initial learning rate value of 0.01. The “poly” learn- ing rate polic y is used to decay the initial learning rate while training the model to reduce over -fitting. Data augmenta- tion operations like random horizontal flip, random resizing with a scale range of [1.0, 2.0], and random cropping with a crop size of 1024 × 1024 were done. 4.5. Evaluation Metrics For quantitativ e ev aluation of our network’ s performance, we employ multiple standard metrics to comprehensively assess both accuracy and efficienc y . The primary accuracy metric is the mean Intersection-over -Union (mIoU), which calculates the av erage class-wise overlap between the pre- dicted segmentation and the ground truth, providing a ro- bust measure of segmentation quality . T o ev aluate compu- tational efficiency , we report the number of floating-point operations (FLOPs), reflecting the computational complex- ity of the model, and frames per second (FPS), indicating the real-time processing speed of the network during infer- ence. T ogether , these metrics offer a balanced benchmark- ing framework that captures both the effecti veness and prac- ticality of our method. 5. Results and Discussion W e present the segmentation accuracy and inference speed of our proposed Attention A ware Network (AASeg) on the Cityscapes v alidation and test sets in T able 1 . The model is trained using both the Cityscapes training and validation sets before submitting predictions to the official Cityscapes online e v aluation server . T able 1 compares AASeg with sev eral recent state-of-the-art semantic segmentation meth- ods in terms of mean Intersection-over -Union (mIoU) on both validation and test sets, input resolution, backbone architecture, and inference speed measured in frames per second (FPS). Our AASe g model achie ves a competiti ve mIoU of 74.8% on the validation set and 74.4% on the test set without relying on any backbone network, demonstrat- ing that our design effecti vely extracts robust features de- 5 spite the absence of a lar ge pretrained encoder . In terms of speed, AASeg processes images at 202.7 FPS at a res- olution of 512 × 1024 , outperforming many existing mod- els with backbones, including BiSeNetV2 and STDC vari- ants, thereby offering a superior speed-accuracy tradeoff for real-time applications. Compared to lightweight mod- els like BiSeNetV2 and FasterSeg, AASeg provides higher segmentation accuracy while maintaining competitiv e infer - ence speed. Although some models, such as STDC1-Seg50, achiev e higher FPS, they exhibit lo wer mIoU, indicating that AASeg offers a more balanced performance for prac- tical deployment. The results highlight the effecti veness of our spatia l-channel attention modules combined with the multi-scale context module in enhancing feature represen- tation and segmentation precision without incurring addi- tional computational ov erhead. The qualitative results on the Cityscapes validation set are presented in Figure 5 : W e ev aluate the performance of our proposed AASeg model on the CamV id dataset. The comparati ve results with recent state-of-the-art methods are summarized in T a- ble 2 . Our e xperiments use an input resolution of 720 × 960 pixels, consistent with previous works to ensure a fair comparison. As shown in T able 2 , AASeg achiev es a high mean Intersection-over -Union (mIoU) score of 73.5%, which is competiti ve with other leading methods such as BiSeNetV2-L and SFNet variants that employ pretrained backbones. Notably , our model does not rely on any back- bone network, yet it manages to deliver comparable or superior accuracy . In terms of inference speed, AASeg processes frames at 188.7 FPS, demonstrating its suitabil- ity for real-time applications in urban scene understand- ing and autonomous driving scenarios. This speed outper- forms most backbone-based models, including SFNet and BiSeNetV1 with ResNet18, while closely approaching the fastest model, STDC1-Seg, which reaches 197.6 FPS but with slightly lower mIoU. W e further validate the ef fectiveness of our proposed AASeg model, T able 3 presents a performance comparison between AASeg and sev eral recent state-of-the-art seman- tic segmentation methods ev aluated on the ADE20K vali- dation set. Our model achiev es a mean Intersection-over - Union (mIoU) of 46.29%, outperforming well-established approaches such as PSPNet and SFNet by a noticeable mar- gin. Specifically , AASeg surpasses PSPNet101 by nearly 3% and also improv es upon SFNet v ariants, demonstrat- ing the strong representational power of our attention-aw are architecture. Also, AASeg requires only 80.26 GFLOPs, which is significantly lower than PSPNet50 and PSPNet101 and only slightly above the most efficient SFNet v ariant with 75.7 GFLOPs. This balance of high accuracy and mod- erate computational complexity highlights AASeg’ s po- tential for real-world applications, especially in resource- constrained en vironments. 6. Ablation Study 6.1. Ablation Study on Upsampling Methods W e in vestigate the impact of dif ferent upsampling opera- tions on the performance of our AASeg network. Specif- ically , we compare three commonly used techniques: bi- linear upsampling, decon v olution (transposed con volution), and nearest neighbor upsampling. The results of this ab- lation study on the Cityscapes validation set are summa- rized in T able 4 . Our experiments show that bilinear up- sampling achiev es the best performance with a mIoU of 79.2%, slightly outperforming both deconv olution (78.5%) and nearest neighbor upsampling (78.3%). Although the differences are relati vely small, bilinear upsampling pro- vides a good trade-of f between accuracy and computational simplicity . Decon volution, while being a learnable upsam- pling method, does not significantly impro ve the se gmenta- tion accuracy in our setup, possibly due to added complex- ity or training instability . Nearest neighbor upsampling, the simplest and fastest method, also performs competiti vely but slightly trails the other two approaches. Based on these findings, bilinear upsampling is chosen as the default up- sampling method in our network, as it balances accuracy , smoothness in feature maps, and computational efficiency . This insight reinforces the notion that simple interpolation- based upsampling methods can be effecti ve for dense pre- diction tasks when combined with strong attention mecha- nisms. 6.2. Ablation Study on K ernel Size T o further optimize our network’ s performance and effi- ciency , we ev aluate the effect of dif ferent con volutional ker- nel sizes within the attention modules. W e experiment with kernel sizes of 1 × 1 , 3 × 3 , 5 × 5 , and 7 × 7 on the Cityscapes validation set, with the results summarized in T able 5 . The results demonstrate that increasing the kernel size generally leads to marginal improvements in segmentation accuracy , with both 5 × 5 and 7 × 7 kernels achieving the highest mean Intersection-over -Union (mIoU) of 79.5%. Howe ver , the computational cost, measured in GFLOPs, increases no- tably with larger kernels—from 118.2 GFLOPs for 1 × 1 kernels up to 136.1 GFLOPs for 7 × 7 kernels. While the 7 × 7 kernel of fers the best accuracy , the gain ov er the 5 × 5 kernel is minimal, suggesting diminishing returns as kernel size increases. On the other hand, the 1 × 1 kernel provides the lowest computational ov erhead but with slightly reduced accuracy . Balancing accuracy and ef ficiency , the 3 × 3 ker- nel size is a reasonable compromise, deliv ering near-peak performance (79.4% mIoU) with moderate computational demands (120.8 GFLOPs). These observations suggest that moderate kernel sizes are preferable in practice, especially 6 T able 1. Comparisons with other state-of-the-art methods using Cityscapes dataset. No indicates the method does not have any backbone for training and testing. Best results are highlighted in bold. Model Resolution Backbone mIoU val(%) mIoU test(%) FPS DF ANet B 1024 × 1024 Xception B - 67.1 120 DF ANet A 1024 × 1024 Xception A - 71.3 100 BiSeNetV1 768 × 1536 Xception39 69.0 68.4 105.8 BiSeNetV1 768 × 1536 ResNet18 74.8 74.7 65.5 SFNet 1024 × 2048 DF1 - 74.5 121 BiSeNetV2 512 × 1024 no 73.4 72.6 156 BiSeNetV2-L 512 × 1024 no 75.8 75.3 47.3 FasterSe g 1024 × 2048 no 73.1 71.5 163.9 STDC1-Seg50 512 × 1024 STDC1 72.2 71.9 250.4 STDC2-Seg50 512 × 1024 STDC2 74.2 73.4 188.6 STDC1-Seg75 768 × 1536 STDC1 74.5 75.3 126.7 STDC2-Seg75 768 × 1536 STDC2 77.0 76.8 97.0 AASeg 512 × 1024 no 74.8 74.4 202.7 Figure 5. V isualized segmentation results on Cityscapes validation set. The three columns left-to-right refer to the input image, ground truth, and prediction from our network. for real-time segmentation tasks where computational re- sources are constrained. 6.3. Ablation Study on Dilation Rates In our network, we employ an increasing sequence of di- lation rates { 0 , 1 , 2 , 3 } within the con volutional layers to capture multi-scale contextual information effecti vely . T o in vestigate the impact of different dilation rates on the net- work’ s performance, speed, and parameter count, we con- ducted a detailed ablation study on the Cityscapes v alida- tion set. The results are summarized in T able 6 . As shown in the table, incorporating larger dilation rates progressiv ely improv es segmentation accuracy , with the highest mean Intersection-ov er-Union (mIoU) of 80.2% achieved using a dilation rate of 3. This improvement is attributed to the enlarged recepti ve field, which allo ws the model to better capture global context and finer spatial details. Howe ver , the increase in dilation rate also leads to a rise in the num- ber of model parameters—from 0.76 million for the base- line model without dilation up to 0.90 million for the model with dilation rate 3—and a corresponding decrease in infer- ence speed (FPS). Specifically , the model with dilation rate 3 achiev es an FPS of 88.6, which is lower than the 118.2 FPS of the baseline model. Interestingly , removing dilation altogether reduces the mIoU to 77.4% but improves the FPS slightly to 121.3, indicating a trade-off between accuracy and speed. Overall, these results highlight the balance be- tween capturing richer contextual information via dilation and maintaining real-time performance. The dilation rate of 2 offers a fav orable compromise with an mIoU of 80.1%, 7 T able 2. Comparisons with other state-of-the-art methods on the CamV id dataset. No indicates the method does not have a backbone for both training and testing. Best results are highlighted in bold. Model Resolution Backbone mIoU(%) FPS DF ANet A 720 × 960 no 64.7 120 DF ANet B 720 × 960 no 59.3 160 BiSeNetV1 720 × 960 Xception39 65.6 175 BiSeNetV1 720 × 960 ResNet18 68.7 116.3 BiSeNetV2 720 × 960 no 72.4 124.5 BiSeNetV2-L 720 × 960 no 73.2 32.7 SFNet 720 × 960 DF2 70.4 134.1 SFNet 720 × 960 ResNet-18 73.8 35.5 SFNet 720 × 960 DF2 70.4 134.1 SFNet 720 × 960 ResNet-18 73.8 35.5 STDC1-Seg 720 × 960 STDC1 73.0 197.6 STDC2-Seg 720 × 960 STDC2 73.9 152.2 AASeg 720 × 960 no 73.5 188.7 T able 3. Performance comparison of results reported on the ADE20K validation set. Best results are highlighted in bold. Method Mean IoU(%) GFLOPs PSPNet50 42.78 167.6 PSPNet101 43.29 238.4 SFNet 42.81 75.7 SFNet 44.67 94.0 DCANet 45.49 - AASeg 46.29 80.26 T able 4. Ablation study with Upsampling operation in our network using the Cityscapes validation set. Best results are highlighted in bold. Method mIoU (%) bilinear upsampling 79.2 decon volution 78.5 nearest neighbor 78.3 T able 5. Ablation study on kernel size k in our network using Cityscapes validation set. Best results are highlighted in bold. Method mIoU (%) GFlops k = 1 79.2 118.2 k = 3 79.4 120.8 k = 5 79.5 127.5 k = 7 79.5 136.1 102.7 FPS, and 0.84 million parameters. T able 6. Ablation study results on Cityscapes validation set using different dilation rates. FPS are estimated for an input image of resolution of 512 × 1024 . Best results are highlighted in bold. Model mIoU (%) FPS Parameters (M) AASeg-baseline 79.2 118.2 0.76 AASeg-w/o dilation 77.4 121.3 0.80 AASeg-(r=1) 79.8 115.5 0.79 AASeg-(r=2) 80.1 102.7 0.84 AASeg-(r=3) 80.2 88.6 0.90 7. Conclusions In this paper , we introduced Attention A ware Network (AASeg), a novel architecture designed for real-time se- mantic segmentation. By integrating both Spatial Atten- tion (SA) and Channel Attention (CA) modules, our net- work effecti vely enhances feature representations of ob- jects without incurring additional computational ov erhead. The combined spatial-channel attention, together with the Multi Scale Context (MSC) module, enables robust and discriminativ e feature extraction for both targets and back- grounds. W e extensi vely e v aluated AASeg on multiple challenging benchmarks, including Cityscapes, ADE20K, and CamV id datasets. The results demonstrate that our ap- proach achieves a strong balance between inference speed and segmentation accuracy , competitiv e with existing state- of-the-art methods. For future work, we plan to extend the multi-scale attention mechanisms within AASeg to tackle more complex tasks such as instance segmentation, aiming to further improve the precision and versatility of the model. 8 References [1] V ijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep conv olutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern anal- ysis and machine intelligence , 39(12):2481–2495, 2017. 1 [2] Liang-Chieh Chen, Geor ge Papandreou, Florian Schrof f, and Hartwig Adam. Rethinking atrous conv olution for seman- tic image segmentation. arXiv pr eprint arXiv:1706.05587 , 2017. 1 , 3 [3] W uyang Chen, Xinyu Gong, Xianming Liu, Qian Zhang, Y uan Li, and Zhangyang W ang. Fasterse g: Searching for faster real-time semantic segmentation. arXiv preprint arXiv:1912.10917 , 2019. [4] Marius Cordts, Mohamed Omran, Sebastian Ramos, T imo Rehfeld, Markus Enzweiler , Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Pr oceed- ings of the IEEE confer ence on computer vision and pattern r ecognition , pages 3213–3223, 2016. [5] Jia Deng, W ei Dong, Richard Socher , Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE confer ence on computer vision and pattern r ecognition , pages 248–255. Ieee, 2009. [6] Mingyuan Fan, Shenqi Lai, Junshi Huang, Xiaoming W ei, Zhenhua Chai, Junfeng Luo, and Xiaolin W ei. Rethinking bisenet for real-time semantic segmentation. In Pr oceedings of the IEEE/CVF Conference on Computer V ision and P at- tern Recognition , pages 9716–9725, 2021. [7] Jun Fu, Jing Liu, Haijie T ian, Y ong Li, Y ongjun Bao, Zhiwei Fang, and Hanqing Lu. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition , pages 3146– 3154, 2019. 2 [8] Kai Han, Y unhe W ang, Qi Tian, Jianyuan Guo, Chunjing Xu, and Chang Xu. Ghostnet: More features from cheap operations. In Pr oceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition , pages 1580– 1589, 2020. [9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Pr oceed- ings of the IEEE confer ence on computer vision and pattern r ecognition , pages 770–778, 2016. [10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In Eur opean confer ence on computer vision , pages 630–645. Springer, 2016. [11] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation net- works. In Proceedings of the IEEE conference on computer vision and pattern r ecognition , pages 7132–7141, 2018. 3 [12] Ping Hu, Fabian Caba, Oliv er W ang, Zhe Lin, Stan Sclaroff, and Federico Perazzi. T emporally distributed networks for fast video semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition , pages 8818–8827, 2020. [13] Gao Huang, Zhuang Liu, Laurens V an Der Maaten, and Kil- ian Q W einberger . Densely connected con volutional net- works. In Proceedings of the IEEE conference on computer vision and pattern r ecognition , pages 4700–4708, 2017. [14] Zilong Huang, Xinggang W ang, Lichao Huang, Chang Huang, Y unchao W ei, and W enyu Liu. Ccnet: Criss-cross attention for semantic segmentation. In Pr oceedings of the IEEE/CVF International Conference on Computer V ision , pages 603–612, 2019. [15] Forrest N Iandola, Song Han, Matthe w W Moskewicz, Khalid Ashraf, W illiam J Dally , and Kurt K eutzer . Squeezenet: Alexnet-le vel accuracy with 50x fewer pa- rameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360 , 2016. 2 [16] Longlong Jing, Y ucheng Chen, and Y ingli Tian. Coarse-to- fine semantic segmentation from image-level labels. IEEE T ransactions on Image Pr ocessing , 29:225–236, 2019. [17] Gen Li, Inyoung Y un, Jonghyun Kim, and Joongkyu Kim. Dabnet: Depth-wise asymmetric bottleneck for real-time semantic segmentation. arXiv preprint , 2019. [18] Hanchao Li, Pengfei Xiong, Haoqiang Fan, and Jian Sun. Dfanet: Deep feature aggregation for real-time semantic se g- mentation. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition , pages 9522– 9531, 2019. 2 [19] Peike Li, Xuanyi Dong, Xin Y u, and Y i Y ang. When hu- mans meet machines: T ow ards efficient segmentation net- works. In Pr oceedings of the British Machine V ision Confer- ence (BMVC) , 2020. [20] Xiangtai Li, Ansheng Y ou, Zhen Zhu, Houlong Zhao, Maoke Y ang, Kuiyuan Y ang, Shaohua T an, and Y unhai T ong. Se- mantic flo w for fast and accurate scene parsing. In Eur opean Confer ence on Computer V ision , pages 775–793. Springer, 2020. [21] Chih-Y ang Lin, Y i-Cheng Chiu, Hui-Fuang Ng, Timothy K Shih, and Kuan-Hung Lin. Global-and-local context network for semantic segmentation of street vie w images. Sensors , 20 (10):2907, 2020. [22] Guosheng Lin, Anton Milan, Chunhua Shen, and Ian Reid. Refinenet: Multi-path refinement networks for high- resolution semantic segmentation. In Pr oceedings of the IEEE conference on computer vision and pattern reco gni- tion , pages 1925–1934, 2017. 2 [23] Y ifu Liu, Chenfeng Xu, and Xinyu Jin. Dcanet: Dense context-a ware network for semantic segmentation. arXiv pr eprint arXiv:2104.02533 , 2021. [24] Jonathan Long, Evan Shelhamer , and Tre vor Darrell. Fully con volutional networks for semantic segmentation. In Pro- ceedings of the IEEE confer ence on computer vision and pat- tern r ecognition , pages 3431–3440, 2015. 1 [25] Sachin Mehta, Mohammad Rastegari, Anat Caspi, Linda Shapiro, and Hannaneh Hajishirzi. Espnet: Efficient spatial pyramid of dilated conv olutions for semantic segmentation. In Proceedings of the european confer ence on computer vi- sion (ECCV) , pages 552–568, 2018. 1 , 2 [26] Ruigang Niu, Xian Sun, Y u T ian, W enhui Diao, Kaiqiang Chen, and Kun Fu. Hybrid multiple attention network for semantic segmentation in aerial images. IEEE T ransactions on Geoscience and Remote Sensing , 2021. 9 [27] Marin Orsic, Ivan Kreso, Petra Be vandic, and Sinisa Se gvic. In defense of pre-trained imagenet architectures for real-time semantic segmentation of road-driving images. In Pr oceed- ings of the IEEE/CVF Conference on Computer V ision and P attern Recognition , pages 12607–12616, 2019. 2 [28] Adam Paszke, Abhishek Chaurasia, Sangpil Kim, and Eu- genio Culurciello. Enet: A deep neural network architec- ture for real-time semantic segmentation. arXiv preprint arXiv:1606.02147 , 2016. 1 , 2 [29] Rudra PK Poudel, Stephan Liwicki, and Roberto Cipolla. Fast-scnn: fast semantic segmentation network. arXiv pr eprint arXiv:1902.04502 , 2019. 1 [30] Abhinav Sagar . Aa3dnet: Attention augmented real time 3d object detection. arXiv preprint , 2021. [31] Abhinav Sagar . Dmsanet: Dual multi scale attention net- work. arXiv preprint , 2021. [32] Abhinav Sagar and RajKumar Soundrapandiyan. Semantic segmentation with multi scale spatial attention for self driv- ing cars. arXiv preprint , 2020. [33] Mingxing T an and Quoc Le. Ef ficientnet: Rethinking model scaling for conv olutional neural networks. In International Confer ence on Machine Learning , pages 6105–6114. PMLR, 2019. [34] Quan T ang, Fagui Liu, T ong Zhang, Jun Jiang, and Y u Zhang. Attention-guided chained context aggregation for semantic segmentation. arXiv preprint , 2020. [35] Panqu W ang, Pengfei Chen, Y e Y uan, Ding Liu, Zehua Huang, Xiaodi Hou, and Garrison Cottrell. Understanding con volution for semantic segmentation. In 2018 IEEE win- ter conference on applications of computer vision (W ACV) , pages 1451–1460. IEEE, 2018. 3 [36] Xuan Y ang, Shanshan Li, Zhengchao Chen, Jocelyn Chanus- sot, Xiuping Jia, Bing Zhang, Baipeng Li, and P an Chen. An attention-fused network for semantic segmentation of very- high-resolution remote sensing imagery . ISPRS Journal of Photogrammetry and Remote Sensing , 177:238–262, 2021. [37] Changqian Y u, Jingbo W ang, Chao Peng, Changxin Gao, Gang Y u, and Nong Sang. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Pr oceed- ings of the Eur opean confer ence on computer vision (ECCV) , pages 325–341, 2018. 2 [38] Changqian Y u, Jingbo W ang, Chao Peng, Changxin Gao, Gang Y u, and Nong Sang. Learning a discriminativ e fea- ture network for semantic segmentation. In Pr oceedings of the IEEE conference on computer vision and pattern reco g- nition , pages 1857–1866, 2018. 3 [39] Changqian Y u, Changxin Gao, Jingbo W ang, Gang Y u, Chunhua Shen, and Nong Sang. Bisenet v2: Bilateral net- work with guided aggregation for real-time semantic seg- mentation. arXiv preprint , 2020. 2 [40] Fisher Y u and Vladlen Koltun. Multi-scale context aggregation by dilated con volutions. arXiv pr eprint arXiv:1511.07122 , 2015. 1 [41] Hang Zhang, Kristin Dana, Jianping Shi, Zhongyue Zhang, Xiaogang W ang, Ambrish T yagi, and Amit Agrawal. Con- text encoding for semantic segmentation. In Proceedings of the IEEE confer ence on Computer V ision and P attern Recog- nition , pages 7151–7160, 2018. [42] Y iheng Zhang, Zhaofan Qiu, Jingen Liu, Ting Y ao, Dong Liu, and T ao Mei. Customizable architecture search for se- mantic se gmentation. In Pr oceedings of the IEEE/CVF Con- fer ence on Computer V ision and P attern Recognition , pages 11641–11650, 2019. [43] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang W ang, and Jiaya Jia. Pyramid scene parsing network. In Pr oceedings of the IEEE conference on computer vision and pattern r ecognition , pages 2881–2890, 2017. 3 [44] Hengshuang Zhao, Xiaojuan Qi, Xiaoyong Shen, Jianping Shi, and Jiaya Jia. Icnet for real-time semantic segmenta- tion on high-resolution images. In Proceedings of the Eu- r opean confer ence on computer vision (ECCV) , pages 405– 420, 2018. 2 [45] Hengshuang Zhao, Y i Zhang, Shu Liu, Jianping Shi, Chen Change Loy , Dahua Lin, and Jiaya Jia. Psanet: Point- wise spatial attention network for scene parsing. In Pro- ceedings of the Eur opean Confer ence on Computer V ision (ECCV) , pages 267–283, 2018. 10
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment