Awesome
<div align='left'> <img src='https://github.com/user-attachments/assets/b2578723-b7a7-4d8f-bcd1-5008947b808a' > </div> <div align='center'> <img src=https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg > <img src=https://img.shields.io/badge/Language-CUDA-brightgreen.svg > <img src=https://img.shields.io/github/watchers/DefTruth/cuda-learn-note?color=9cc > <img src=https://img.shields.io/github/forks/DefTruth/cuda-learn-note.svg?style=social > <img src=https://img.shields.io/github/stars/DefTruth/cuda-learn-note.svg?style=social > <img src=https://img.shields.io/badge/Release-v2.6-brightgreen.svg > <img src=https://img.shields.io/badge/License-GPLv3.0-turquoise.svg > </div> <div id="contents"></div>📚 Modern CUDA Learn Notes with PyTorch for Beginners: It includes Tensor/CUDA Cores, TF32/F16/BF16/F8, 📖150+ CUDA Kernels🔥🔥 with PyTorch bindings, 📖30+ LLM/VLM🔥, 📖40+ CV/C++...🔥, 📖50+ CUDA/CuTe...🔥 Blogs and 📖HGEMM/SGEMM🔥🔥 which has been fully optimized, check 📖HGEMM/SGEMM Supported Matrix👇 for more details. Welcome to 🌟👆🏻star this repo to support me, many thanks ~ 🎉🎉
<div id="hgemm-sgemm"></div> <div align='left'> <img src='https://github.com/user-attachments/assets/71927ac9-72b3-4ce9-b0e2-788b5885bc99' height="225px" width="403px"> <img src='https://github.com/user-attachments/assets/05ef4f5e-d999-48ea-b58e-782cffb24e85' height="225px" width="403px"> </div>Currently, on NVIDIA L20, RTX 4090 and RTX 3090 Laptop, compared with cuBLAS's default Tensor Cores math algorithm CUBLAS_GEMM_DEFAULT_TENSOR_OP
, the HGEMM (WMMA/MMA)
implemented in this repo (blue
🔵) can achieve 95%~99%
of its (orange
🟠) performance. Please check hgemm benchmark for more details.
CUDA Cores | Sliced K(Loop over K) | Tile Block | Tile Thread |
---|---|---|---|
✔️ | ✔️ | ✔️ | ✔️ |
WMMA(m16n16k16) | MMA(m16n8k16) | Pack LDST(128 bits) | SMEM Padding |
✔️ | ✔️ | ✔️ | ✔️ |
Copy Async | Tile MMA(More Threads) | Tile Warp(More Values) | Multi Stages |
✔️ | ✔️ | ✔️ | ✔️ |
Reg Double Buffers | Block Swizzle | Warp Swizzle | Collective Store(Warp Shfl) |
✔️ | ✔️ | ✔️ | ✔️ |
Row Major(NN) | Col Major(TN) | SGEMM TF32 | SMEM Swizzle(CuTe) |
✔️ | ✔️ | ✔️ | ✔️ |
📖 150+ CUDA Kernels 🔥🔥 (面试常考题目) (©️back👆🏻)
Workflow: custom CUDA kernel impl -> PyTorch Python bindings -> Run tests. 👉TIPS: *
= Tensor Cores(WMMA/MMA), otherwise, CUDA Cores; /
= not supported; ✔️
= supported; ❔
= in my plan.
📖 博客目录
<div id="my-blogs-part-1"></div>📖 大模型|多模态|Diffusion|推理优化 (本人作者) (©️back👆🏻)
📖 CV推理部署|C++|算法|技术随笔 (本人作者) (©️back👆🏻)
<div id="my-blogs-part-2"></div>📖 CUTLASS|CuTe|NCCL|CUDA|文章推荐 (其他作者) (©️back👆🏻)
<div id="other-blogs"></div>💡说明: 本小节整理一些自己比较喜欢的文章。欢迎大家提PR推荐更多优秀的文章!
©️License (©️back👆🏻)
<div id="License"></div>GNU General Public License v3.0
🎉Contribute (©️back👆🏻)
<div id="Contribute"></div>How to contribute? please check 🌤🌤CONTRIBUTE🎉🎉.
<div align='center'> <a href="https://star-history.com/#DefTruth/CUDA-Learn-Notes&Date"> <picture align='center'> <source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/svg?repos=DefTruth/CUDA-Learn-Notes&type=Date&theme=dark" /> <source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/svg?repos=DefTruth/CUDA-Learn-Notese&type=Date" /> <img width=450 height=300 alt="Star History Chart" src="https://api.star-history.com/svg?repos=DefTruth/CUDA-Learn-Notes&type=Date" /> </picture> </a> </div>