Awesome
A SYCL plug-in to run AMReX apps on AMD/Nvidia GPUs
Nuno Nobre, Alex Grant, Karthikeyan Chockalingam and Xiaohu Guo
SYCL unlocks single-source development for hardware accelerators by leveraging C++ templated functions, greatly easing the otherwise laborious task of porting C++ code to heterogeneous architectures. The aim of this work is to show that state of the art scientific applications such as AMReX can be solely written in SYCL while still preserving its performance portability features.
We demonstrate how minimal the extra required development effort can be with AMReX's ElectromagneticPIC tutorial. This is due to the tutorial's relevance to plasma fusion but we note that all AMReX applications should be able to benefit. Here is an illustration of a PIC simulation you can carry out using this code:
Current-driven Langmuir oscillations at the plasma frequency on a 32 x 32 x 32 grid with 1 electron per cell. The mesh is coloured after the amplitude of the oscillating but uniform electric field: from blue (-) to red (+).
The plug-in consists of a build script and code patches which extend AMReX's SYCL capability beyond Intel GPUs. We support two open-source SYCL compiler and runtime frameworks:
- DPC++ (proprietary solutions, e.g. the Intel oneAPI DPC++/C++ Compiler and its plugins, should also work);
- Open SYCL (formerly known as hipSYCL).
The plug-in has been tested on all the high performance computing GPUs generally available at the beginning of 2023:
- Nvidia: V100 and A100;
- AMD: MI100, MI210 and MI250.
Since AMReX also includes native support for both the Nvidia CUDA and the AMD HIP programming models, a direct comparison against those is trivial. This plot shows that the SYCL implementation is as fast as those vendor alternatives.
SYCL vs CUDA and HIP. Performance comparison for a Langmuir oscillations simulation on a 128 x 128 x 128 grid with 64 electrons per cell and 100 time steps. The AMD MI100 is an order of magnitude slower due to the lack of support for FP64 atomics on that GPU. The PIC loop is always faster on the SYCL implementation, but the particle initialisation routine is notably slower, meaning the small number of iterations results in slower execution times for SYCL on some GPUs such as the A100. For a detailed per-routine comparison for each GPU, see here.
To learn how to install and use the plug-in, continue reading here.
Acknowledgments
- Joe Todd (@joeatodd), Rod Burns (@rodburns), and the Codeplay developer team (@codeplaysoftware) for helping to identify an issue with slow FP atomics on GPUs, and for developing the Nvidia and AMD support plugins for DPC++;
- Aksel Alpay (@illuhad), and the Open SYCL developer team (@opensycl) for cultivating a welcoming community, for reviewing and suggesting improvements to our contributions to Open SYCL, and for developing Open SYCL;
- Andrew Myers (@atmyers), Axel Huebl (@ax3l), Weiqun Zhang (@weiqunzhang), and the AMReX developer team (@amrex-codes) for welcoming and reviewing our contributions to AMReX and its tutorials, and for developing AMReX;
- UKAEA (@ukaea), and the FARSCAPE project (@farscape-project) for funding this work.