Beyond Global Scanning: Adaptive Visual State Space Modeling for Salient Object Detection in Optical Remote Sensing Images
Salient object detection (SOD) in optical remote sensing images (ORSIs) faces numerous challenges, including significant variations in target scales and low contrast between targets and the background. Existing methods based on vision transformers (ViTs) and convolutional neural networks (CNNs) architectures aim to leverage both global and local features, but the difficulty in effectively integrating these heterogeneous features limits their overall performance. To overcome these limitations, we propose an adaptive state space context network (ASCNet), which builds upon the state space model mechanism to simultaneously capture long-range dependencies and enhance regional feature representation. Specifically, we employ the visual state space encoder to extract multi-scale features. To further achieve deep guidance and enhancement of these features, we design a Multi-Level Context Module (MLCM), which module strengthens cross-layer interaction capabilities between features of different scales while enhancing the model’s structural perception, allowing it to distinguish between foreground and background more effectively. Then, we design the Adaptive Patchwise Visual State Space (APVSS) block as the decoder of ASCNet, which integrates our proposed Dynamic Adaptive Granularity Scan (DAGS) and Granularity-aware Propagation Module (GPM). It performs adaptive patch scanning on feature maps enhanced by local perception, thereby capturing rich local region information and enhancing state space model’s local modeling capability. Extensive experimental results demonstrate that the proposed model achieves state-of-the-art performance, validating its effectiveness and superiority.
💡 Research Summary
This paper addresses the challenging task of Salient Object Detection (SOD) in optical remote sensing images (ORSIs), where objects often exhibit low contrast, large scale variations, and complex backgrounds. Existing approaches that combine Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) struggle to fuse global and local features effectively. To overcome these limitations, the authors propose Adaptive State‑Space Context Network (ASCNet), a novel encoder‑decoder architecture built upon visual state‑space models (VSS). The encoder extracts multi‑scale features using a state‑space block that captures long‑range dependencies via a selective scan mechanism. Between encoder and decoder, a Multi‑Level Context Module (MLCM) employs a graph‑neural‑network‑based topology‑aware attention to strengthen cross‑layer interactions and model spatial relationships among features of different scales.
In the decoder, the Adaptive Patchwise Visual State‑Space (APVSS) block integrates two key components: the Granularity‑aware Propagation Module (GPM) and the Dynamic Adaptive Granularity Scan (DAGS). GPM introduces a global token that conditionally gates local tokens, providing global semantic conditioning for enhanced local perception. DAGS replaces uniform scanning with a resolution‑aware partitioning of feature maps into spatial blocks; each block is scanned in multiple directions (forward, backward, left, right) with content‑adaptive weighting, thereby preserving fine‑grained spatial dependencies and improving boundary precision.
The overall pipeline proceeds as follows: (1) input image → VSS encoder → multi‑scale features; (2) MLCM fuses and refines these features; (3) APVSS decodes the refined features, first applying GPM for global‑local integration and then DAGS for adaptive local scanning; (4) final saliency map is produced.
Extensive experiments on two widely used ORSI‑SOD benchmarks (ORSSD and EORSSD) demonstrate that ASCNet consistently outperforms recent state‑of‑the‑art methods such as HFANet, ADSTNet, MRBINet, and BCARNet across multiple metrics (F‑measure, MAE, E‑measure, S‑measure). Notably, the model excels at delineating object boundaries and detecting small or elongated objects, which are common failure cases for pure CNN or ViT models. Ablation studies confirm the individual contributions of MLCM, GPM, and DAGS: removing MLCM degrades global context modeling; omitting GPM harms local detail preservation; excluding DAGS reduces directional sensitivity and leads to blurred edges.
The paper’s contributions are fourfold: (1) introduction of a state‑space‑based network that balances global context and local detail for ORSI‑SOD; (2) design of MLCM with topology‑aware attention to enhance multi‑scale interaction; (3) development of APVSS, combining GPM and DAGS to improve local region modeling while retaining global awareness; (4) comprehensive empirical validation showing superior performance on benchmark datasets.
Limitations include the relatively high computational and memory cost of the state‑space blocks, which may hinder deployment on very high‑resolution remote sensing imagery or resource‑constrained platforms. The current framework is also limited to single‑modal optical images; extending it to multi‑spectral, SAR, or temporal remote sensing data remains an open direction. Future work will explore lightweight state‑space designs, multimodal fusion strategies, and hardware‑aware optimizations to enable real‑time, large‑scale ORSI‑SOD applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment