Awesome
Sparkler
Overview
The Sparkler miniapp computes a specialized dense matrix-matrix product C = A^T A for small integer elements of the matrix A. This operation mimics the matrix product operation used to compute the Custom Corellation Coefficient (CCC) in the CoMet computational genomics code.
Sparkler is licensed under the CoMet license; see https://github.com/wdj/comet.
Building
The build requires MPI and make. The default build requires CUDA 9.2 or higher for NVIDIA GPUs. An alternative build path for CPU-only execution requires an installed BLAS library, preferably multithreaded if the runs use more than one core per MPI rank.
To build for a cluster, modify the Makefile to reflect your MPI and CUDA installs and then type "make" (GPU case) or "env USE_GPU=NO make" (CPU-only case).
Running
Running the GPU executable requires one or more NVIDIA GPUs. Volta V100 or later (compute capability 7.0 or higher) GPUs are preferred; older GPUs will run much slower due to lack of tensor core hardware.
A run is composed of a series of iterations, each representing a global dense matrix-matrix product. A single iteration is composed of steps, each corresponding to a single GEMM executed on each GPU.
Command-line options:
--num_vector - number of vectors (half the number of columns of matrix A)
--num_field - number of fields (the number of rows of A)
--num_iterations - number of (global) matrix products done
Example:
mpirun -n 2 ./exec.gpu --num_vector 1000 --num_field 2000 --num_iterations 2
Reported values are:
TF - teraflops, total number of GEMM floating point operations
GEMM sec - total time spent in GPU GEMM operations
GEMM TF/sec - GEMM teraflop rate, ratio of TF to GEMM sec
total sec - total runtime
hash - a hash of the results computed, for evaluating correctness
Competition Test Cases:
The four competition test cases can be run by
./run_test_case.sh <i>
where <i>
= 1, 2, 3 or 4. Higher test case numbers correspond to more
GPUs (1, 2, 3 or 6) and longer runtime. Note test case 1 can run on
smaller-memory GPUs, but cases 2, 3 and 4 on GPUs require at least
16 GB memory per GPU.
The script run_test_case.sh
may need to be modified for your specific
CUDA and MPI installations. The execution mode is one MPI
rank per GPU.
Values to reported are (1) the hash, to validate correctness, and (2) the GEMM TF/sec value, to measure performance. Note that due to load balancing issues best per-GPU performance is achieved for odd numbers of GPUs.
Representative outputs are shown below, from test runs on the Summit architecture using the Volta V100 tensor cores. The hash values from your runs should match those shown. Values marked here by "XXXXXX" will appear as actual numbers in your runs.
summit-batch4$ ./run_test_case.sh 1
num_vector 4000 num_field 90000 num_iterations 400 num_proc 1
Iteration 1 of 400, step 1 of 1, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 2 of 400, step 1 of 1, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 4 of 400, step 1 of 1, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 8 of 400, step 1 of 1, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 16 of 400, step 1 of 1, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 32 of 400, step 1 of 1, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 64 of 400, step 1 of 1, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 128 of 400, step 1 of 1, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 256 of 400, step 1 of 1, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 400 of 400, step 1 of 1, elapsed sec XXXXXX: setup... GEMM... check...
TF 4608.000 GEMM sec XXXXXX GEMM TF/sec XXXXXX total sec XXXXXX hash 435999930709XXXXXX
summit-batch4$ ./run_test_case.sh 2
num_vector 18000 num_field 90000 num_iterations 350 num_proc 2
Iteration 1 of 350, step 1 of 2, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 1 of 350, step 2 of 2, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 2 of 350, step 1 of 2, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 2 of 350, step 2 of 2, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 4 of 350, step 1 of 2, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 4 of 350, step 2 of 2, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 8 of 350, step 1 of 2, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 8 of 350, step 2 of 2, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 16 of 350, step 1 of 2, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 16 of 350, step 2 of 2, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 32 of 350, step 1 of 2, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 32 of 350, step 2 of 2, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 64 of 350, step 1 of 2, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 64 of 350, step 2 of 2, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 128 of 350, step 1 of 2, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 128 of 350, step 2 of 2, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 256 of 350, step 1 of 2, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 256 of 350, step 2 of 2, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 350 of 350, step 1 of 2, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 350 of 350, step 2 of 2, elapsed sec XXXXXX: setup... GEMM... check...
TF 61236.000 GEMM sec XXXXXX GEMM TF/sec XXXXXX total sec XXXXXX hash 2775866192702XXXXXX
summit-batch4$ ./run_test_case.sh 3
num_vector 27000 num_field 90000 num_iterations 1600 num_proc 3
Iteration 1 of 1600, step 1 of 2, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 1 of 1600, step 2 of 2, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 2 of 1600, step 1 of 2, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 2 of 1600, step 2 of 2, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 4 of 1600, step 1 of 2, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 4 of 1600, step 2 of 2, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 8 of 1600, step 1 of 2, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 8 of 1600, step 2 of 2, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 16 of 1600, step 1 of 2, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 16 of 1600, step 2 of 2, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 32 of 1600, step 1 of 2, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 32 of 1600, step 2 of 2, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 64 of 1600, step 1 of 2, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 64 of 1600, step 2 of 2, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 128 of 1600, step 1 of 2, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 128 of 1600, step 2 of 2, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 256 of 1600, step 1 of 2, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 256 of 1600, step 2 of 2, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 512 of 1600, step 1 of 2, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 512 of 1600, step 2 of 2, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 768 of 1600, step 1 of 2, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 768 of 1600, step 2 of 2, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 1024 of 1600, step 1 of 2, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 1024 of 1600, step 2 of 2, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 1280 of 1600, step 1 of 2, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 1280 of 1600, step 2 of 2, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 1536 of 1600, step 1 of 2, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 1536 of 1600, step 2 of 2, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 1600 of 1600, step 1 of 2, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 1600 of 1600, step 2 of 2, elapsed sec XXXXXX: setup... GEMM... check...
TF 559872.000 GEMM sec XXXXXX GEMM TF/sec XXXXXX total sec XXXXXX hash 3719610844656XXXXXX
peak-login1$ ./run_test_case.sh 4
num_vector 54000 num_field 90000 num_iterations 3000 num_proc 6
Iteration 1 of 3000, step 1 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 1 of 3000, step 2 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 1 of 3000, step 3 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 1 of 3000, step 4 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 2 of 3000, step 1 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 2 of 3000, step 2 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 2 of 3000, step 3 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 2 of 3000, step 4 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 4 of 3000, step 1 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 4 of 3000, step 2 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 4 of 3000, step 3 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 4 of 3000, step 4 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 8 of 3000, step 1 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 8 of 3000, step 2 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 8 of 3000, step 3 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 8 of 3000, step 4 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 16 of 3000, step 1 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 16 of 3000, step 2 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 16 of 3000, step 3 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 16 of 3000, step 4 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 32 of 3000, step 1 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 32 of 3000, step 2 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 32 of 3000, step 3 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 32 of 3000, step 4 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 64 of 3000, step 1 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 64 of 3000, step 2 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 64 of 3000, step 3 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 64 of 3000, step 4 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 128 of 3000, step 1 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 128 of 3000, step 2 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 128 of 3000, step 3 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 128 of 3000, step 4 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 256 of 3000, step 1 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 256 of 3000, step 2 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 256 of 3000, step 3 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 256 of 3000, step 4 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 512 of 3000, step 1 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 512 of 3000, step 2 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 512 of 3000, step 3 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 512 of 3000, step 4 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 768 of 3000, step 1 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 768 of 3000, step 2 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 768 of 3000, step 3 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 768 of 3000, step 4 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 1024 of 3000, step 1 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 1024 of 3000, step 2 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 1024 of 3000, step 3 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 1024 of 3000, step 4 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 1280 of 3000, step 1 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 1280 of 3000, step 2 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 1280 of 3000, step 3 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 1280 of 3000, step 4 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 1536 of 3000, step 1 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 1536 of 3000, step 2 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 1536 of 3000, step 3 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 1536 of 3000, step 4 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 1792 of 3000, step 1 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 1792 of 3000, step 2 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 1792 of 3000, step 3 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 1792 of 3000, step 4 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 2048 of 3000, step 1 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 2048 of 3000, step 2 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 2048 of 3000, step 3 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 2048 of 3000, step 4 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 2304 of 3000, step 1 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 2304 of 3000, step 2 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 2304 of 3000, step 3 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 2304 of 3000, step 4 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 2560 of 3000, step 1 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 2560 of 3000, step 2 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 2560 of 3000, step 3 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 2560 of 3000, step 4 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 2816 of 3000, step 1 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 2816 of 3000, step 2 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 2816 of 3000, step 3 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 2816 of 3000, step 4 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 3000 of 3000, step 1 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 3000 of 3000, step 2 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 3000 of 3000, step 3 of 4, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 3000 of 3000, step 4 of 4, elapsed sec XXXXXX: setup... GEMM... check...
TF 3674160.000 GEMM sec XXXXXX GEMM TF/sec XXXXXX total sec XXXXXX hash 4137762059954XXXXXX