MS-SCANet: A Multiscale Transformer-Based Architecture with Dual Attention for No-Reference Image Quality Assessment
We present the Multi-Scale Spatial Channel Attention Network (MS-SCANet), a transformer-based architecture designed for no-reference image quality assessment (IQA). MS-SCANet features a dual-branch structure that processes images at multiple scales, effectively capturing both fine and coarse details, an improvement over traditional single-scale methods. By integrating tailored spatial and channel attention mechanisms, our model emphasizes essential features while minimizing computational complexity. A key component of MS-SCANet is its cross-branch attention mechanism, which enhances the integration of features across different scales, addressing limitations in previous approaches. We also introduce two new consistency loss functions, Cross-Branch Consistency Loss and Adaptive Pooling Consistency Loss, which maintain spatial integrity during feature scaling, outperforming conventional linear and bilinear techniques. Extensive evaluations on datasets like KonIQ-10k, LIVE, LIVE Challenge, and CSIQ show that MS-SCANet consistently surpasses state-of-the-art methods, offering a robust framework with stronger correlations with subjective human scores.
💡 Research Summary
The paper introduces MS‑SCANet, a novel no‑reference image quality assessment (NR‑IQA) framework that leverages a dual‑branch, multi‑scale transformer architecture combined with spatial and channel attention mechanisms. Each branch processes the input image at a different patch size (16×16 and 32×32), enabling simultaneous capture of fine‑grained and coarse‑grained distortions. Within each branch, window‑based self‑attention (similar to Swin‑Transformer) is applied, reducing computational complexity from O(N²·d) to O(N²·w·d), where w is the window size.
Spatial attention operates inside each window to model relationships among patches, while channel attention employs a Squeeze‑and‑Excitation block that re‑weights feature channels based on global average pooled statistics. The two attentions run in parallel, allowing the network to emphasize both local texture details and global structural cues without excessive overhead.
The most distinctive contribution is the cross‑branch attention module. Unlike prior works (e.g., CrossViT) that only exchange class tokens and patch tokens, MS‑SCANet directly computes attention between the patch tokens of the two scales. Queries, keys, and values are derived separately from the short‑scale and long‑scale branches, and bidirectional attention scores are summed, producing a fused representation that integrates information across scales. This design is particularly suited for IQA, where distortions can appear at multiple spatial frequencies.
To further stabilize multi‑scale feature fusion, two consistency loss terms are introduced. Cross‑Branch Consistency Loss (CB‑Loss) minimizes the mean‑squared error between the feature maps of the two branches, encouraging scale‑invariant representations. Adaptive Pooling Consistency Loss (AP‑Loss) penalizes discrepancies between original feature maps and those after adaptive pooling, preserving spatial relationships that are often distorted by linear or bilinear resizing. The total training objective combines the standard L1 regression loss with CB‑Loss and AP‑Loss, weighted by α = β = 0.5.
Extensive experiments were conducted on four widely used IQA benchmarks: KonIQ‑10k, LIVE, LIVE‑Challenge, and CSIQ. Using Pearson Linear Correlation Coefficient (PLCC) and Spearman Rank‑Order Correlation Coefficient (SROCC) as evaluation metrics, MS‑SCANet consistently ranks among the top three methods across all datasets. Notably, it achieves the highest SROCC (0.923) on the challenging LIVE‑Challenge set, demonstrating robustness to real‑world distortions.
Ablation studies reveal that (1) the dual‑branch, dual‑attention configuration outperforms single‑branch or single‑attention variants, confirming the importance of multi‑scale feature interaction; (2) each consistency loss individually improves performance, and their combination yields the best results, indicating synergistic benefits for feature alignment and down‑sampling stability.
In terms of efficiency, MS‑SCANet requires approximately 14.7 M FLOPs per token, substantially lower than Swin‑Transformer (71.8 M), TRIQ (92.9 M), and vanilla ViT (185.7 M). This efficiency stems from the windowed attention, reduced number of patches, and a modest embedding dimension of 256. Consequently, the model is suitable for high‑resolution image processing with near‑real‑time inference.
The authors conclude that MS‑SCANet advances NR‑IQA by (i) integrating multi‑scale transformer processing with low computational cost, (ii) employing parallel spatial and channel attention to highlight perceptually relevant cues, (iii) introducing a cross‑branch attention mechanism that fuses fine and coarse features effectively, and (iv) adding two novel consistency losses that preserve spatial and scale integrity during training. Future work will explore further computational optimizations, deployment on mobile platforms, and extension to video quality assessment tasks.
Comments & Academic Discussion
Loading comments...
Leave a Comment