Awesome
<div align='center'> <img src=https://img.shields.io/badge/Language-CUDA-brightgreen.svg > <img src=https://img.shields.io/github/watchers/DefTruth/cuda-learn-note?color=9cc > <img src=https://img.shields.io/github/forks/DefTruth/cuda-learn-note.svg?style=social > <img src=https://img.shields.io/github/stars/DefTruth/cuda-learn-note.svg?style=social > <img src=https://img.shields.io/badge/Release-v2.5-brightgreen.svg > <img src=https://img.shields.io/badge/License-GPLv3.0-turquoise.svg > </div>🎉 Modern CUDA Learn Notes with PyTorch for Beginners: fp32/tf32, fp16/bf16, fp8/int8, Tensor/CUDA Cores, flash_attn, rope, embedding, sgemm, sgemv, hgemm, hgemv, warp/block reduce, dot prod, elementwise, sigmoid, relu, gelu, softmax, layernorm, rmsnorm, hist and some CUDA optimization techniques (pack LDST, cp.async, warp gemv, sliced_k/split_k/pipeline gemm, bank conflicts reduce, WMMA/MMA, block/warp swizzle, etc).
<!--- <img width="1438" alt="image" src="https://github.com/user-attachments/assets/0c5e5125-586f-43fa-8e8b-e2c61c1afbbe"> --->📖 HGEMM/SGEMM Supported Matrix
CUDA Cores | Sliced K(Loop over K) | Tile Block | Tile Thread |
---|---|---|---|
✔️ | ✔️ | ✔️ | ✔️ |
WMMA(m16n16k16) | MMA(m16n8k16) | Pack LDST(128 bits) | SMEM Padding |
✔️ | ✔️ | ✔️ | ✔️ |
Copy Async | Tile MMA(More Threads) | Tile Warp(More Values) | Multi Stages |
✔️ | ✔️ | ✔️ | ✔️ |
Reg Double Buffers | Block Swizzle | Warp Swizzle | Collective Store(Shfl) |
✔️ | ✔️ | ✔️ | ✔️ |
Row Major(NN) | Col Major(TN) | SGEMM TF32 | SMEM Swizzle(Permute) |
✔️ | ✔️ | ✔️ | ❔ |
Currently, on NVIDIA L20, RTX 4090 and RTX 3090 Laptop, compared with cuBLAS's default Tensor Cores math algorithm CUBLAS_GEMM_DEFAULT_TENSOR_OP
, the HGEMM (WMMA and MMA)
implemented in this repo can achieve approximately 95%~98%
of its performance. Please check hgemm benchmark for more details.
📖 CUDA Kernel目录 (面试常考题目)
- / = not supported now.
- ✔️ = known work and already supported now.
- ❔ = in my plan, but not coming soon, maybe a few weeks later.
- workflow: custom CUDA kernel impl -> PyTorch python binding -> Run tests.
- How to contribute? please check 🌤🌤Kernel Trace & 目标 & 代码规范 & 致谢🎉🎉
👉TIPS: * means using Tensor Cores(MMA/WMMA), otherwise, using CUDA Cores by default.
📖 博客目录
📖 大模型|多模态|Diffusion|推理优化 (本人作者)
📖 CV推理部署|C++|算法|技术随笔 (本人作者)
📖 CUTLASS|CuTe|NCCL|CUDA|文章推荐 (其他作者)
💡说明: 大佬们写的文章实在是太棒了,学到了很多东西。欢迎大家提PR推荐更多优秀的文章!
©️License
GNU General Public License v3.0
🎉Contribute
Welcome to 🌟👆🏻star & submit a PR to this repo!
<div align='center'> <a href="https://star-history.com/#DefTruth/CUDA-Learn-Notes&Date"> <picture align='center'> <source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/svg?repos=DefTruth/CUDA-Learn-Notes&type=Date&theme=dark" /> <source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/svg?repos=DefTruth/CUDA-Learn-Notese&type=Date" /> <img width=450 height=300 alt="Star History Chart" src="https://api.star-history.com/svg?repos=DefTruth/CUDA-Learn-Notes&type=Date" /> </picture> </a> </div> <details> <summary>📖 References </summary>