Performance report and optimized implementations of Weather & Climate dwarfs on multi-node systems
This document is one of the deliverable reports created for the ESCAPE project. ESCAPE stands for Energy-efficient Scalable Algorithms for Weather Prediction at Exascale. The project develops world-class, extreme-scale computing capabilities for European operational numerical weather prediction and future climate models. This is done by identifying Weather & Climate dwarfs which are key patterns in terms of computation and communication (in the spirit of the Berkeley dwarfs). These dwarfs are then optimised for different hardware architectures (single and multi-node) and alternative algorithms are explored. Performance portability is addressed through the use of domain specific languages. Here we summarize the work performed on optimizations of the dwarfs focusing on CPU multi-nodes and multi-GPUs. We limit ourselves to a subset of the dwarf configurations chosen by the consortium. Intra-node optimizations of the dwarfs and energy-specific optimizations have been described in Deliverable D3.3. To cover the important algorithmic motifs we picked dwarfs related to the dynamical core as well as column physics. Specifically, we focused on the formulation relevant to spectral codes like ECMWF’s IFS code. The main findings of this report are: (a) Up-to 30% performance gain with CPU based multi-node systems compared to optimized version of dwarfs from task 3.3 (see D3.3), (b) up to 10X performance gain on multiple GPUs from optimizations to keep data resident on the GPU and enable fast inter-GPU communication mechanisms, and (c) multi-GPU systems which feature a high-bandwidth all-to-all interconnect topology with NVLink/NVSwitch hardware are particularly well suited to the algorithms.
💡 Research Summary
The ESCAPE project’s Deliverable D3.4 presents a comprehensive performance study and optimization of two fundamental “dwarfs” (algorithmic building blocks) used in numerical weather prediction (NWP) and climate modelling: the Spectral Transform (Spherical Harmonics) dwarf and the MPDATA dwarf. The work focuses on scaling these kernels on modern multi‑node CPU clusters and multi‑GPU systems, evaluates the impact of hardware‑specific interconnects, and quantifies both speed‑up and energy‑to‑solution improvements.
Background and Scope
ESCAPE (Energy‑efficient Scalable Algorithms for Weather Prediction at Exascale) aims to extract, port, and optimise core computational patterns from operational codes such as ECMWF’s Integrated Forecasting System (IFS). This deliverable builds on earlier intra‑node work (D3.3) and extends the optimisation to distributed memory environments. The selected dwarfs were chosen because they dominate the computational cost of spectral‑based dynamical cores and of advection‑diffusion schemes (MPDATA).
Spectral Transform (Spherical Harmonics) Optimisation
The spectral transform requires an all‑to‑all data exchange at every time step. In a naïve multi‑GPU implementation, CUDA‑aware MPI still routes data through host memory, incurring large latency and bandwidth penalties. The authors analyse two NVIDIA platforms: DGX‑1 (four‑GPU “islands” connected by NVLink) and DGX‑2 (sixteen GPUs linked through an NVSwitch). While DGX‑1 provides 50 GB/s per NVLink link, the island topology limits full all‑to‑all connectivity; the NVSwitch in DGX‑2 creates a fully connected fabric with up to 300 GB/s per GPU pair (2.4 TB/s aggregate).
To exploit this hardware, a custom communication layer was built using CUDA Inter‑Process Communication (IPC) and CUDA streams. Data stays resident on the device; the kernel launches and communication overlap via asynchronous streams. Profiling with nvprof shows the communication phase shrinks from ~70 % of total runtime (CUDA‑aware MPI) to <15 % with the custom layer. Performance results demonstrate a 2.5× speed‑up on a four‑GPU DGX‑1V and an 8‑10× speed‑up on a sixteen‑GPU DGX‑2, with near‑linear scaling as GPUs increase. Power measurements using nvidia‑smi reveal that while instantaneous GPU power remains similar, the reduced runtime cuts the energy‑to‑solution by more than 70 %.
MPDATA Optimisation
MPDATA is a high‑order advection‑diffusion scheme whose performance is limited by memory bandwidth and halo‑exchange communication. The authors first performed a roof‑line analysis on a single V100 GPU, identifying low arithmetic intensity kernels. By reorganising memory accesses, employing shared memory, and using texture caches, they raised effective bandwidth utilisation to ~85 % of the V100 STREAM limit, yielding a 2.3× speed‑up for a 512×512 test case.
For multi‑GPU scaling, the traditional approach used the ATLAS library to move halo data to host memory before MPI exchange, which caused a host‑GPU round‑trip each step. The new CUDA‑aware halo exchange replaces host staging with direct GPU‑to‑GPU IPC transfers and overlaps communication with computation via streams. This reduces halo‑exchange overhead by a factor of 5‑6. Scaling tests on DGX‑1V and DGX‑2 show almost linear performance growth from 4 to 16 GPUs, with 9.8× speed‑up on 8 GPUs and 14.5× on 16 GPUs for the 1024‑grid case.
On the CPU side, the authors targeted Intel Xeon Gold 6148 (2‑socket, 24 cores) nodes using an MPI+OpenMP hybrid model. Profiling identified that communication and memory copies accounted for ~40 % of runtime. By aggregating messages, employing non‑blocking MPI (Isend/Irecv), and reducing synchronization points, they achieved a 30 % performance gain on a 16‑core node configuration.
Energy measurements with a Zimmer power meter and software counters show a 20 % reduction in energy consumption for the CPU‑only optimised runs, while the GPU‑only runs cut energy‑to‑solution by over 70 % thanks to the dramatic runtime reduction. The DGX‑2 system, with its NVSwitch fabric, achieved the highest performance‑per‑watt metric, confirming that high‑bandwidth, fully connected interconnects are essential for communication‑heavy weather kernels.
Key Findings
- Communication‑centric optimisation – Both dwarfs are limited by all‑to‑all or halo exchanges; eliminating host‑memory staging and using device‑direct pathways (CUDA‑IPC, NVLink/NVSwitch) yields the largest gains.
- Hardware‑aware software design – Tailoring code to the topology of NVLink islands (DGX‑1) versus a full NVSwitch fabric (DGX‑2) is necessary for optimal scaling.
- Energy‑to‑solution as a primary metric – Speed‑up alone is insufficient; the study demonstrates that the same optimisations also dramatically lower energy consumption, an essential consideration for Exascale NWP.
- Portability considerations – Current implementations rely heavily on CUDA and NVIDIA‑specific interconnects. Future work should explore domain‑specific languages or performance‑portable frameworks (Kokkos, SYCL) to retain benefits on emerging architectures.
- Hybrid CPU‑GPU strategies – Combining CPU‑side preprocessing and I/O with GPU‑centric compute/communication can maximise utilisation of heterogeneous systems.
Conclusions and Outlook
The deliverable proves that, for the most communication‑intensive components of modern weather and climate models, a co‑design approach—optimising algorithms, data layouts, and exploiting the latest GPU interconnect technology—delivers up to 10× speed‑up on multi‑GPU systems and up to 30 % improvement on multi‑node CPU clusters, while simultaneously reducing energy consumption. These results provide concrete guidelines for the next generation of Exascale‑ready NWP codes: prioritize device‑direct communication, adopt performance‑portable abstractions for future hardware, and evaluate both performance and power in tandem. The work sets a solid foundation for scaling full‑physics weather models to forthcoming exascale platforms.
Comments & Academic Discussion
Loading comments...
Leave a Comment