Exploring Novel Data Storage Approaches for Large-Scale Numerical Weather Prediction
Driven by scientific and industry ambition, HPC and AI applications such as operational Numerical Weather Prediction (NWP) require processing and storing ever-increasing data volumes as fast as possible. Whilst POSIX distributed file systems and NVMe SSDs are currently a common HPC storage configuration providing I/O to applications, new storage solutions have proliferated or gained traction over the last decade with potential to address performance limitations POSIX file systems manifest at scale for certain I/O workloads. This work has primarily aimed to assess the suitability and performance of two object storage systems -namely DAOS and Ceph- for the ECMWF’s operational NWP as well as for HPC and AI applications in general. New software-level adapters have been developed which enable the ECMWF’s NWP to leverage these systems, and extensive I/O benchmarking has been conducted on a few computer systems, comparing the performance delivered by the evaluated object stores to that of equivalent Lustre file system deployments on the same hardware. Challenges of porting to object storage and its benefits with respect to the traditional POSIX I/O approach have been discussed and, where possible, domain-agnostic performance analysis has been conducted, leading to insight also of relevance to I/O practitioners and the broader HPC community. DAOS and Ceph have both demonstrated excellent performance, but DAOS stood out relative to Ceph and Lustre, providing superior scalability and flexibility for applications to perform I/O at scale as desired. This sets a promising outlook for DAOS and object storage, which might see greater adoption at HPC centres in the years to come, although not necessarily implying a shift away from POSIX-like I/O.
💡 Research Summary
The paper investigates modern storage alternatives for high‑performance computing (HPC) and artificial intelligence (AI) workloads that generate ever‑increasing data volumes, focusing on the operational Numerical Weather Prediction (NWP) workflow at the European Centre for Medium‑Range Weather Forecasts (ECMWF). Traditionally, ECMWF and similar large‑scale scientific facilities rely on POSIX‑based distributed file systems such as Lustre combined with NVMe SSDs. While this architecture has served well, the POSIX abstraction introduces metadata bottlenecks, lock contention, and limited scalability when faced with massive concurrent I/O typical of NWP field archiving, post‑processing, and AI‑driven analytics.
To address these limitations, the authors evaluate two open‑source object‑storage systems: Distributed Asynchronous Object Storage (DAOS) and Ceph (via its RADOS layer). They develop software adapters that replace the existing FDB (Forecast DataBase) POSIX back‑ends with object‑storage‑aware back‑ends, exposing the same high‑level API to the NWP code while internally using libdaos and librados calls. DAOS is tightly coupled with storage‑class memory (SCM) and NVMe, offering low‑latency, high‑bandwidth object operations and a two‑stage flush mechanism (archive and flush) to guarantee consistency. Ceph provides flexible replication and erasure‑coding options but relies on a metadata server (MDS) that can become a hotspot under high concurrency.
Benchmarking is performed on two platforms: (1) the NEXTGenIO testbed, which integrates DRAM‑NVMe hybrid storage representing a near‑future exascale node, and (2) a Google Cloud Platform (GCP) cluster equipped with NVMe SSDs, representing a production‑grade cloud environment. In each case, the same hardware configuration is used to compare DAOS, Ceph, and Lustre. Four representative I/O patterns are exercised: (a) IOR micro‑benchmarks for sequential and random reads/writes, (b) the fdb‑hammer benchmark that mimics real ECMWF data‑base access (mixed reads/writes with multiple writers/readers), (c) a realistic NWP field‑archive workflow (large 1 MiB weather fields written by many processes per node, followed by a flush and a post‑processing “PGEN” job), and (d) small‑object tests (4 KB–64 KB) to assess overhead.
Results show that DAOS consistently outperforms both Ceph and Lustre across all metrics. In the IOR tests, DAOS achieves sustained aggregate bandwidths above 2.5 GB/s on the SCM platform, scaling almost linearly when the number of compute nodes is quadrupled. Ceph reaches roughly 1.6 GB/s, while Lustre lags at about 1.2 GB/s under the same conditions. In the fdb‑hammer workload with up to 24 writer and 24 reader nodes (48 processes per node) and 100 time steps, DAOS delivers 30 % higher throughput than Ceph and exhibits negligible metadata latency, whereas Ceph’s MDS becomes a bottleneck and Lustre suffers from file‑creation contention. Small‑object benchmarks reveal that DAOS maintains over 1.8 GB/s even for 4 KB objects, whereas Ceph drops to under 1 GB/s, confirming DAOS’s lower per‑object overhead.
From a data‑redundancy perspective, DAOS’s native erasure‑coding on SCM reduces storage overhead by more than 30 % compared with traditional 3‑way replication, while Ceph’s replication scheme incurs higher cost. The authors also discuss software migration challenges: existing NWP pipelines are deeply rooted in file‑directory semantics, so moving to an object model requires changes not only in the FDB back‑ends but also in catalog management, metadata handling, and long‑term data‑retention policies. Nevertheless, the study demonstrates that object storage can fundamentally alleviate the metadata bottleneck inherent to POSIX systems, offering superior scalability, bandwidth, and storage efficiency for data‑intensive scientific applications.
In conclusion, DAOS emerges as the most promising storage solution for large‑scale NWP and similar HPC workloads, delivering near‑linear scalability, higher aggregate throughput, and better storage efficiency than both Ceph and Lustre. Ceph remains a viable alternative, especially where existing Ceph deployments exist, but its performance is limited by metadata server contention and higher per‑object overhead. The paper suggests future work on hybrid architectures that combine the best of POSIX and object storage, automated data‑management policies tailored to object stores, and deeper integration of DAOS into production NWP pipelines.
Comments & Academic Discussion
Loading comments...
Leave a Comment