DynamicVis: Dynamic Visual Perception for Efficient Remote Sensing Foundation Models
The advancement of RS technology has enabled high-resolution Earth observation; however, interpreting these images using modern VFMs remains a significant challenge. Unlike object-centric natural images, RS imagery is fundamentally characterized by extreme target sparsity and massive spatial redundancy. Key objects of interest (e.g., ships, vehicles) often occupy less than 1% of the spatial extent, surrounded by vast, target-free backgrounds. Existing VFMs predominantly rely on uniform dense processing (e.g., ViTs) and pixel-reconstruction pre-training paradigms (e.g., MAE). These approaches inherently waste substantial computational capacity on modeling redundant backgrounds and inadvertently dilute the feature representations of small, sparse targets. To bridge this structural misalignment, we propose DynamicVis, a visual foundation model explicitly tailored to the sparse nature of RS imagery. Architecturally, DynamicVis introduces a Dynamic Region-Aware SSM that bypasses uniform computation. It adaptively routes and incrementally models only task-relevant, high-salience tokens while employing a parameter-free integration for background context, drastically reducing the complexity of processing ultra-long 2D token sequences ($\sim$100,000). Crucially, to equip the network with robust spatial-selection capabilities, we propose a novel Region-Level Meta-Embedding Multi-Instance Learning (MIL) pre-training paradigm. Trained on a million-scale dataset, this paradigm explicitly disentangles sparse foreground instances from dense backgrounds in the latent semantic space, overcoming the semantic ambiguity of conventional pixel-reconstruction methods. Extensive evaluations across nine diverse downstream tasks reveal that DynamicVis exhibits exceptional efficacy, particularly dominating in sparse-target and instance-level perception tasks (e.g., small object detection, and change detection).
💡 Research Summary
The paper addresses a fundamental mismatch between modern vision foundation models (VFMs) and the unique characteristics of high‑resolution remote sensing (RS) imagery. In RS data, objects of interest such as ships, vehicles, or scattered buildings typically occupy less than one percent of the total pixel area, while the vast majority of the scene consists of repetitive, information‑poor background (ocean, desert, clouds). Conventional VFMs—most notably Vision Transformers (ViTs) trained with Masked AutoEncoder (MAE) objectives—process every token uniformly and allocate substantial capacity to reconstructing background textures. This leads to quadratic computational cost, massive memory consumption, and diluted feature representations for the sparse targets.
DynamicVis is proposed as a purpose‑built foundation model that explicitly incorporates “sparse spatial perception” as an inductive bias. Its core architecture is a Dynamic Region‑Aware State Space Model (SSM). After patch tokenization, a learnable importance scorer assigns a saliency score to each token. Tokens whose scores exceed a learned threshold are routed to a dual‑path SSM (forward and backward scans) for deep contextual modeling, while the remaining background tokens bypass the heavy computation and are merged back via a parameter‑free residual connection. This selective routing reduces the effective sequence length from ~100 k tokens to a few thousand, turning the computational complexity from O(N²) (self‑attention) to O(K·L) where K ≪ N.
However, dynamic routing requires the model to reliably distinguish foreground from background. To provide this capability, the authors introduce a Region‑Level Meta‑Embedding Multi‑Instance Learning (MIL) pre‑training paradigm. Using a million‑scale dataset with weak region annotations (e.g., fMoW), each image is treated as a “bag” of instances. Regional visual embeddings are contrasted against categorical meta‑embeddings in a shared latent space, forcing foreground instances to align closely with their class prototypes while background patches remain distant. This contrastive MIL objective equips the backbone with robust spatial‑selection semantics, enabling the importance scorer to focus on truly informative regions during downstream fine‑tuning.
Extensive experiments cover nine downstream tasks spanning scene classification, tiny ship detection, building extraction, image retrieval, region classification, SAR and optical instance segmentation, road segmentation, and bi‑temporal change detection. DynamicVis consistently outperforms strong baselines such as ViT‑base, SatMAE, RingMo, and other recent RS foundation models, especially on tasks where target sparsity is pronounced (small‑object detection, instance segmentation, change detection). Notably, processing a 2048 × 2048 image requires only 97 ms latency and 833 MB GPU memory—approximately 6 % of the latency and 3 % of the memory of a comparable ViT‑base—without any specialized acceleration techniques.
The contributions can be summarized as: (1) redefining RS foundation modeling through a spatial‑sparsity bias; (2) integrating adaptive token routing with linear‑complexity SSMs to achieve ultra‑scalable encoding; (3) devising a region‑level meta‑embedding MIL pre‑training that supplies precise foreground‑background guidance; and (4) demonstrating that the resulting model delivers both high efficiency and state‑of‑the‑art accuracy across a broad spectrum of RS applications. The work opens avenues for real‑time, large‑scale satellite and aerial image analysis, disaster monitoring, and GIS pipelines where computational resources are limited but high‑resolution detail is essential.
Comments & Academic Discussion
Loading comments...
Leave a Comment