Awesome

Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric Behaviors

Code repo for Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric Behaviors.

TODO: Provide better instructions and explanations

Setup

Add CUDA path to env. e.g

export CUDA_PATH=/usr/local/cuda-11.0
export PATH=$CUDA_PATH/bin:$PATH
export CUDACXX=$CUDA_PATH/bin/nvcc

Config compiler target Arch/SM

export TargetSM=80 // for A100

export TargetSM=70 // for V100

export TargetSM=75 // for Turing

Run script

cd/microbench
sh run_all.sh

You are expected to get xxx-ILPx.log files.

Note, there will be static_assert errors messages when running the scripts, because some codes have static_assert() for larger ILPs. This kind of error messages can be ignored.

References

Some codes are borrowed from Accel-Sim

citations

@ARTICLE{9931992,
  author={Sun, Wei and Li, Ang and Geng, Tong and Stuijk, Sander and Corporaal, Henk},
  journal={IEEE Transactions on Parallel and Distributed Systems}, 
  title={Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric Behaviors}, 
  year={2023},
  volume={34},
  number={1},
  pages={246-261},
  doi={10.1109/TPDS.2022.3217824}}