SkyCap: Bitemporal VHR Optical-SAR Quartets for Amplitude Change Detection and Foundation-Model Evaluation
Change detection for linear infrastructure monitoring requires reliable high-resolution data and regular acquisition cadence. Optical very-high-resolution (VHR) imagery is interpretable and straightforward to label, but clouds break this cadence. Synthetic Aperture Radar (SAR) enables all-weather acquisitions, yet is difficult to annotate. We introduce SkyCap, a bitemporal VHR optical-SAR dataset constructed by archive matching and co-registration of (optical) SkySat and Capella Space (SAR) scenes. We utilize optical-to-SAR label transfer to obtain SAR amplitude change detection (ACD) labels without requiring SAR-expert annotations. We perform continued pretraining of SARATR-X on our SAR data and benchmark the resulting SAR-specific foundation models (FMs) together with SARATR-X against optical FMs on SkyCap under different preprocessing choices. Among evaluated models, MTP(ViT-B+RVSA), an optical FM, with dB+Z-score preprocessing attains the best result (F1$_c$ = 45.06), outperforming SAR-specific FMs further pretrained directly on Capella data. We observe strong sensitivity to preprocessing alignment with pretraining statistics, and the ranking of optical models on optical change detection does not transfer one-to-one to SAR ACD. To our knowledge, this is the first evaluation of foundation models on VHR SAR ACD.
💡 Research Summary
The paper introduces SkyCap, a novel bitemporal dataset that pairs very‑high‑resolution (VHR) optical imagery from Planet SkySat (0.5 m GSD, RGB+NIR) with co‑registered VHR X‑band Spotlight SAR from Capella Space (0.5 m GSD, HH polarization). By matching archival scenes across 19 geographically diverse sites (Eastern Europe, the Middle East, and Asia) and carefully co‑registering them, the authors create 19 “quartets” (time‑step 1 optical + SAR, time‑step 2 optical + SAR), yielding 3,484 annotated change pairs. Change annotations are produced solely on the interpretable optical pairs by an experienced human team and then transferred to the SAR pairs via the precise registration, eliminating the need for SAR‑expert labeling.
The authors explore three research questions: (1) how to obtain reliable SAR change labels without SAR experts, (2) how SAR‑specific foundation models compare to optical foundation models when applied to SAR, and (3) which preprocessing pipeline best bridges the modality gap. To answer (2), they continue pretraining of the SAR‑specific foundation model SARATR‑X (which builds on a HiViT‑B backbone and replaces the MAE reconstruction target with Multi‑Scale Gradient Features, MSGF) on the SkyCap SAR data. They generate three SAR‑pretrained variants: CapellaX (trained only on Capella data), ALOS‑X (trained on L‑band ALOS‑2 data), and CapALOS‑X (a 50/50 mix of both). All models are kept at the Base size (~90 M parameters) for fair comparison.
Six encoders are evaluated in a Siamese‑U‑Net change‑detection architecture: three optical (HiViT, MTP‑ViT‑B+RVSA, DINOv3) and three SAR‑pretrained (SARATR‑X, CapellaX, CapALOS‑X). The authors test three preprocessing schemes for SAR inputs: (1) linear – percentile clipping and scaling to
Comments & Academic Discussion
Loading comments...
Leave a Comment