Awesome
Introduction
The directories in this repository contain code examples for the course of OpenMP GPU-offloading at Paderborn Center for Parallel Computing (PC²), Paderborn University. The sub-directories are generally organized as:
- src: source code
- docs: documentation
- tests: some tests
Some highlights of the codes in this repository:
-
The performance of our
saxpy
implemented by using OpenMP GPU-offloading is as good ascublasSaxpy
in CUBLAS. Seecase 7
in05_saxpy/src/asaxpy.c
for details. -
The GPU shared memory has not been standardized in OpenMP API Specification (Version 5.0 Nov. 2018). To optimize the performance of matrix multiplication by using OpenMP GPU-offloading, i)
case 6
in10_matMul/src/matMulAB.c
implements a register blocking algorithm and ii)case 8
in the same source code file implements a common GPU-based tiled algorithm by blocking the local shared memory in a very tricky manner and the OpenMP code resembles CUDA.
List of Projects
-
00_build_OpenMP_offload
Documentation and scripts for building GCC as well as Clang/LLVM with OpenMP support for Nvidia GPU offloading.
-
01_accelQuery
accelQuery
searches accelerator(s) on a heterogeneous computer. Accelerator(s), if found, will be enumerated with some basic info. -
02_dataTransRate
dataTransRate
gives the data transfer rate (in MB/sec) fromsrc
todst
.The possible situations are:
- h2h:
src
= host anddst
= host - h2a:
src
= host anddst
= accel - a2a:
src
= accel anddst
= accel
NOTE:
- A bug in Clang 9.0.1 has been fixed in Clang 11.
- The data transfer rata for
a2a
is still lower than our expectation.
- h2h:
-
03_taskwait
taskwait
checks thetaskwait
construct for the deferred target task.NOTE:
- Asynchronous offloading hasn't been implemented in the GCC 9.2 compiler.
- Asynchronous offloading is available in Clang 11.
-
04_scalarAddition
scalarAddition
adds two integers on host and accelerator, and also compares the performance. -
05_saxpy
saxpy
performs thesaxpy
operation on host as well as accelerator. The performance (in MB/s) for different implementations is also compared. -
08_distThreads
distThreads
demonstrates the organization of threads and teams in a league on GPU. -
09_matAdd
matAdd
performs matrix addition (A +=B) in single-precision on GPU. The performance (in GB/s) for different implementations is compared and the numerical results are also verified. -
10_matMul
matMul
performs matrix multiplication in single-precision on GPU. The performance (in GFLOPS) for different implementations is compared and the numerical results are also verified.