OceanForecastBench: A Benchmark Dataset for Data-Driven Global Ocean Forecasting
Global ocean forecasting aims to predict key ocean variables such as temperature, salinity, and currents, which is essential for understanding and describing oceanic phenomena. In recent years, data-driven deep learning-based ocean forecast models, such as XiHe, WenHai, LangYa and AI-GOMS, have demonstrated significant potential in capturing complex ocean dynamics and improving forecasting efficiency. Despite these advancements, the absence of open-source, standardized benchmarks has led to inconsistent data usage and evaluation methods. This gap hinders efficient model development, impedes fair performance comparison, and constrains interdisciplinary collaboration. To address this challenge, we propose OceanForecastBench, a benchmark offering three core contributions: (1) A high-quality global ocean reanalysis data over 28 years for model training, including 4 ocean variables across 23 depth levels and 4 sea surface variables. (2) A high-reliability satellite and in-situ observations for model evaluation, covering approximately 100 million locations in the global ocean. (3) An evaluation pipeline and a comprehensive benchmark with 6 typical baseline models, leveraging observations to evaluate model performance from multiple perspectives. OceanForecastBench represents the most comprehensive benchmarking framework currently available for data-driven ocean forecasting, offering an open-source platform for model development, evaluation, and comparison. The dataset and code are publicly available at: https://github.com/Ocean-Intelligent-Forecasting/OceanForecastBench.
💡 Research Summary
The paper introduces OceanForecastBench, a comprehensive open‑source benchmark designed to standardize data‑driven global ocean forecasting. Recognizing the lack of publicly available, standardized datasets that hampers reproducibility and fair model comparison, the authors assemble a training corpus spanning 28 years (1993‑2021) of high‑resolution global ocean reanalysis (GLORYS12) together with surface forcing data from ERA5 (10 m wind) and SST from OSTIA. The training set includes four ocean variables (temperature, salinity, zonal and meridional currents) across 23 depth levels, plus four surface variables (SST, sea‑level anomaly, and the two wind components), all resampled to a uniform 1.40625° spatial grid (≈150 km) to balance data fidelity with typical GPU memory constraints.
For evaluation, the benchmark aggregates three independent observation sources: EN4 (quality‑controlled in‑situ temperature and salinity from Argo and other platforms), the Global Drifter Program (6‑hourly SST and surface currents), and CMEMS L3 (sea‑level anomaly from satellite altimetry). Together these provide roughly 100 million observation points, covering the full globe and offering a robust ground‑truth reference. The evaluation pipeline aligns model forecasts with discrete observations in both space and time, applies quality filters, and computes a suite of metrics (RMSE, MAE, correlation, SSIM) on a per‑variable and per‑depth basis.
Six baseline models are evaluated under identical training‑validation splits (1993‑2017 for training, 2018‑2020 for validation). The baselines comprise a conventional numerical weather prediction (NWP) system and five state‑of‑the‑art deep‑learning architectures, including Swin‑Transformer variants, ConvLSTM, U‑Net, and a graph‑neural‑network approach. All models are trained with the same loss functions and hyper‑parameter search budgets, ensuring a fair comparison.
Results show that deep‑learning models outperform the traditional NWP in short‑range forecasts (1‑3 days), achieving 15‑20 % lower RMSE for sea‑surface temperature and surface currents. However, performance gains diminish for medium‑ to long‑range forecasts (7‑30 days), where error accumulation and the lack of explicit physical constraints limit deep models. Depth‑wise analysis reveals that predictions are most accurate in the upper 100 m, while errors increase sharply below 500 m due to sparse observations and more complex dynamics. The study also highlights the importance of surface wind forcing and SST as key drivers for model skill, whereas mean dynamic topography contributes less directly.
Beyond the empirical findings, the benchmark’s major contribution lies in its end‑to‑end pipeline: standardized data extraction, uniform preprocessing, a clear train‑validation split, and reproducible evaluation scripts—all released under an open‑source license. This infrastructure lowers the barrier for researchers from diverse fields (e.g., oceanography, machine learning, climate science) to develop, test, and compare new algorithms on a common footing.
The authors discuss future directions, including hybrid models that combine high‑resolution regional forecasts with global deep‑learning frameworks, physics‑informed neural networks that embed conservation laws, Bayesian approaches to quantify observation uncertainty, and multi‑scale temporal modeling to improve long‑range skill. By providing a robust, publicly accessible benchmark, OceanForecastBench is poised to accelerate progress in AI‑for‑Earth, fostering more accurate, efficient, and collaborative ocean forecasting efforts.
Comments & Academic Discussion
Loading comments...
Leave a Comment