A Contrastive Learning Foundation Model Based on Perfectly Aligned Sample Pairs for Remote Sensing Images
Self-Supervised Learning (SSL) enables us to pre-train foundation models without costly labeled data. Among SSL methods, Contrastive Learning (CL) methods are better at obtaining accurate semantic representations in noise interference. However, due to the significant domain gap, while CL methods have achieved great success in many computer vision tasks, they still require specific adaptation for Remote Sensing (RS) images. To this end, we present a novel self-supervised method called PerA, which produces all-purpose RS features through semantically Perfectly Aligned sample pairs. Specifically, PerA obtains features from sampled views by applying spatially disjoint masks to augmented images rather than random cropping. Our framework provides high-quality features by ensuring consistency between teacher and student and predicting learnable mask tokens. Compared to previous contrastive methods, our method demonstrates higher memory efficiency and can be trained with larger batches due to its sparse inputs. Additionally, the proposed method demonstrates remarkable adaptability to uncurated RS data and reduce the impact of the potential semantic inconsistency. We also collect an unlabeled pre-training dataset, which contains about 5 million RS images. We conducted experiments on multiple downstream task datasets and achieved performance comparable to previous state-of-the-art methods with a limited model scale, demonstrating the effectiveness of our approach. We hope this work will contribute to practical remote sensing interpretation works.
💡 Research Summary
The paper introduces PerA, a self‑supervised learning (SSL) framework specifically designed for remote sensing (RS) imagery. While contrastive learning (CL) has achieved impressive results on natural images, its standard practice of generating positive pairs through random cropping often leads to severe semantic inconsistencies in RS data, where objects are small, scattered, and the scenes are captured from a bird’s‑eye view. PerA addresses this problem by replacing random crops with two spatially disjoint masks applied to the same augmented image. The masks are composed of small patches (e.g., 16 × 16 pixels); when the patch size is smaller than the smallest object in the scene, both masked views retain identical semantic content, producing “perfectly aligned” sample pairs.
The training architecture follows a teacher‑student paradigm similar to BYOL. The teacher encoder is updated with a momentum average, providing stable target representations, while the student encoder learns to match the teacher’s output for the masked views. In addition to the contrastive alignment loss, PerA incorporates a masked image modeling (MIM) objective that predicts the content of the masked tokens, thereby encouraging the network to reconstruct missing information from sparse inputs. Because the inputs are heavily masked, they become sparse, which dramatically reduces memory consumption and enables the use of much larger batch sizes than traditional CL methods that rely on dense images and large negative banks.
To evaluate the approach, the authors built a massive unlabeled RS dataset (RSRSD‑5m) containing roughly 5 million 512 × 512 RGB tiles. The dataset was collected automatically via Google Earth Engine, sampling points across six continents with a balanced representation of urban, agricultural, forest, water, desert, ice, grassland, and cloud categories. The spatial resolution of the tiles ranges from 0.3 m to 10 m, making the collection representative of real‑world monitoring scenarios.
Experiments were conducted on three widely used RS benchmarks: (1) AID for scene classification, (2) ISPRS Potsdam for semantic segmentation, and (3) LEVIR‑CD for change detection. PerA‑pre‑trained models, even when fine‑tuned with relatively modest model sizes, achieved performance comparable to or surpassing state‑of‑the‑art CL methods such as MoCo, SimCLR, and BYOL. Notably, the change detection task—requiring fine‑grained pixel‑level understanding—benefited from the MIM component, showing higher F1 scores than pure CL baselines. The authors also report that the memory footprint of PerA is substantially lower, allowing batch sizes up to 8 k images on a single GPU, which shortens overall training time.
Key contributions are: (1) an automated pipeline that creates one of the largest publicly available unlabeled RS datasets; (2) the PerA method that leverages disjoint masking to generate perfectly aligned positive pairs, combining the strengths of CL and MIM while being memory‑efficient; (3) extensive validation across classification, segmentation, and change detection tasks, demonstrating that PerA matches or exceeds existing SSL approaches. Limitations include sensitivity to mask patch size selection and the current focus on RGB data; future work will explore multimodal extensions (e.g., SAR, hyperspectral) and adaptive mask generation strategies. Overall, PerA offers a practical, scalable SSL solution for a wide range of remote sensing applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment