Assessing the value of Geo-Foundational Models for Flood Inundation Mapping: Benchmarking models for Sentinel-1, Sentinel-2, and Planetscope for end-users
Geo-Foundational Models (GFMs) enable fast and reliable extraction of spatiotemporal information from satellite imagery, improving flood inundation mapping by leveraging location and time embeddings. Despite their potential, it remains unclear whether GFMs outperform traditional models like U-Net. A systematic comparison across sensors and data availability scenarios is still lacking, which is an essential step to guide end-users in model selection. To address this, we evaluate three GFMs, Prithvi 2.0, Clay V1.5, DOFA, and UViT (a Prithvi variant), against TransNorm, U-Net, and Attention U-Net using PlanetScope, Sentinel-1, and Sentinel-2. We observe competitive performance among all GFMs, with only 2-5% variation between the best and worst models across sensors. Clay outperforms others on PlanetScope (0.79 mIoU) and Sentinel-2 (0.70), while Prithvi leads on Sentinel-1 (0.57). In leave-one-region-out cross-validation across five regions, Clay shows slightly better performance across all sensors (mIoU: 0.72(0.04), 0.66(0.07), 0.51(0.08)) compared to Prithvi (0.70(0.05), 0.64(0.09), 0.49(0.13)) and DOFA (0.67(0.07), 0.64(0.04), 0.49(0.09)) for PlanetScope, Sentinel-2, and Sentinel-1, respectively. Across all 19 sites, leave-one-region-out cross-validation reveals a 4% improvement by Clay compared to U-Net. Visual inspection highlights Clay’s superior ability to retain fine details. Few-shot experiments show Clay achieves 0.64 mIoU on PlanetScope with just five training images, outperforming Prithvi (0.24) and DOFA (0.35). In terms of computational time, Clay is a better choice due to its smaller model size (26M parameters), making it ~3x faster than Prithvi (650M) and 2x faster than DOFA (410M). Contrary to previous findings, our results suggest GFMs offer small to moderate improvements in flood mapping accuracy at lower computational cost and labeling effort compared to traditional U-Net.
💡 Research Summary
This paper presents a comprehensive benchmark of Geo‑Foundational Models (GFMs) for flood inundation mapping across three satellite sensors—Sentinel‑1 (SAR), Sentinel‑2 (optical), and PlanetScope (high‑resolution commercial optical). The authors evaluate three state‑of‑the‑art GFMs—Prithvi 2.0, Clay V1.5, and DOFA—along with a Prithvi‑derived variant (UViT) and several conventional deep‑learning baselines (U‑Net, Attention U‑Net, DeepLabv3+, TransNorm). The study is motivated by the need to understand whether large, self‑supervised pretrained encoders truly outperform traditional convolutional networks in a practical, disaster‑response context, especially when considering computational resources and labeling effort.
To enable a fair comparison, the authors introduce the FloodPlanet dataset, which comprises 19 globally distributed flood events (2017‑2020) with co‑registered imagery from all three sensors and high‑quality manual inundation masks. PlanetScope images are 1024 × 1024 pixels (≈3 m resolution), while Sentinel‑1 and Sentinel‑2 are resized to 320 × 320 and 224 × 224 respectively, ensuring consistent patch sizes across models. The dataset contains 366 PlanetScope tiles, 362 Sentinel‑1 tiles, and 298 Sentinel‑2 tiles, providing a robust testbed for cross‑sensor analysis.
Model selection criteria included (1) the scale of pre‑training data (number of image chips), (2) multi‑sensor support, and (3) prior performance on GEO‑Bench and PANGEA benchmarks. Consequently, Prithvi 2.0 (≈4.2 M pre‑training images), DOFA (≈8 M), and Clay V1.5 (≈70 M) were chosen as representative GFMs. All models were fine‑tuned with a unified decoder architecture: a Feature Pyramid Network‑based UPerNet, which facilitates multi‑scale feature aggregation while keeping inference cost low.
The experimental protocol consists of three parts:
-
Full‑dataset evaluation – Using an 80/20 train‑validation‑test split, mean Intersection‑over‑Union (mIoU) is reported for each sensor. All models achieve comparable performance within a 2‑5 % band. Clay leads on PlanetScope (0.79 mIoU) and Sentinel‑2 (0.70), while Prithvi attains the highest score on Sentinel‑1 (0.57). Traditional U‑Net baselines lag by roughly 4 % on average.
-
Leave‑one‑region‑out cross‑validation – To assess generalization, each of the five geographic regions is held out as a test set in turn. Clay consistently outperforms the other GFMs across all sensors (PlanetScope 0.72 ± 0.04, Sentinel‑2 0.66 ± 0.07, Sentinel‑1 0.51 ± 0.08). The improvement over U‑Net across the 19 sites is about 4 % in mIoU, confirming that GFMs can provide modest but consistent gains when applied to unseen regions.
-
Few‑shot (label‑efficiency) experiments – The authors train each model with only five annotated images per sensor. Clay achieves 0.64 mIoU on PlanetScope, dramatically surpassing Prithvi (0.24) and DOFA (0.35). This result highlights Clay’s superior transfer learning capability and suggests that GFMs can be effective even when labeled data are scarce—a common scenario in rapid disaster response.
Computational efficiency is also examined. Clay’s model size is 26 M parameters, roughly 3× smaller than Prithvi (650 M) and 2× smaller than DOFA (410 M). Consequently, inference speed is approximately three times faster than Prithvi and twice as fast as DOFA, making Clay a practical choice for time‑critical applications and for deployment on hardware with limited memory.
The authors conclude that GFMs do not revolutionize flood mapping accuracy—gains are modest (2‑5 % over strong CNN baselines)—but they do offer tangible benefits in terms of reduced labeling effort and faster inference. Clay, in particular, emerges as the most balanced model, delivering the highest accuracy across sensors, excellent few‑shot performance, and a lightweight footprint. The paper underscores the importance of evaluating GFMs on high‑resolution commercial data and in realistic, multi‑sensor settings, rather than relying solely on benchmark datasets with frozen encoders.
Future work suggested includes expanding the geographic and seasonal diversity of test data, exploring alternative self‑supervised pre‑training objectives (e.g., contrastive learning, masked autoencoders), and integrating the best‑performing GFM into end‑to‑end flood‑response pipelines that combine real‑time satellite ingestion, rapid model inference, and downstream decision support for emergency managers.
Comments & Academic Discussion
Loading comments...
Leave a Comment