Awesome

GPU MODE Resource Stream

Here you find a collection of CUDA related material (books, papers, blog-post, youtube videos, tweets, implementations etc.). We also collect information to higher level tools for performance optimization and kernel development like Triton and torch.compile() ... whatever makes the GPUs go brrrr.

You know a great resource we should add? Please see How to contribute.

Lectures / Reading Group Live Sessions

You find a list of upcoming lectures in the Events option in the channel list (side bar) of our discord server.

Recordings of the weekly lectures are published on our YouTube channel. Material (code, slides) for the individual lectures can be found in the lectures repository.

1st Contact with CUDA

An Easy Introduction to CUDA C and C++
An Even Easier Introduction to CUDA
CUDA Toolkit Documentation
Basic terminology: Thread block, Warp, Streaming Multiprocessor: Wiki: Thread Block, A tour of CUDA
GPU Performance Background User's Guide
OLCF NVIDIA CUDA Training Series, talk recordings can be found under the presentation footer for each lecture; exercises
GTC 2022 - CUDA: New Features and Beyond - Stephen Jones
Intro video: Writing Code That Runs FAST on a GPU

2nd Contact

CUDA Refresher

Hazy Research

The MLSys-oriented research group at Stanford led by Chris Re, with alumni Tri Dao, Dan Fu, and many others. A goldmine.

Building Blocks for AI Systems: Their collection of resources similar to this one, many great links
Data-Centric AI: An older such collection
Blog
ThunderKittens: (May 2024) A DSL within CUDA, this blog post has good background on getting good H100 performance
Systems for Foundation Models, and Foundation Models for Systems: Chris Re's keynote from NeurIPS Dec 2023

Papers, Case Studies

Books

Programming Massively Parallel Processors: A Hands-on Approach
Cuda by Example: An Introduction to General-Purpose Gpu Programming; code
The CUDA Handbook
The Book of Shaders guide through the abstract and complex universe of Fragment Shader (not cuda but GPU related)
Art of HPC 4 books on HPC more generally, does not specifically cover GPUs but lessons broadly apply

Cuda Courses

HetSys: Programming Heterogeneous Computing Systems with GPUs and other Accelerators
Heterogeneous Parallel Programming Class (YouTube playlist) Prof. Wen-mei Hwu, University of Illinois
Official YouTube channel for "Programming Massively Parallel Processors: A Hands-on Approach", course playlist: Applied Parallel Programming
Programming Parallel Computers; covers both CUDA and CPU-parallelism. Use Open Course Version and you can even submit your own solutions to the exercises for testing and benchmarking.

CUDA Grandmasters

Practice

Adnan Aziz and Anupam Bhatnagar GPU Puzzlers

PyTorch Performance Optimization

PyTorch Internals & Debugging

TorchDynamo Deep Dive
PyTorch Compiler Troubleshooting
PyTorch internals
Pytorch 2 internals
Understanding GPU memory: 1: Visualizing All Allocations over Time, 2: Finding and Removing Reference Cycles
Debugging memory using snapshots: Debugging PyTorch memory use with snapshots
CUDA caching allocaator: https://zdevito.github.io/2022/08/04/cuda-caching-allocator.html
Trace Analyzer: PyTorch Trace Analysis for the Masses
Holistic Trace Analysis (HTA), gh: facebookresearch/HolisticTraceAnalysis

Code / Libs

NVIDIA/cutlass

Essentials

Triton compiler tutorials
CUDA C++ Programming Guide
PyTorch: Custom C++ and CUDA Extensions, Code: pytorch/extension-cpp
PyTorch C++ API
pybind11 documentation
NVIDIA Tensor Core Programming
GPU Programming: When, Why and How?
How GPU Computing Works | GTC 2021 (more basic than the 2022 version)
How CUDA Programming Works | GTC 2022
CUDA Kernel optimization Part 1 Part 2
PTX and ISA Programming Guide (V8.3)
Compiler Explorer: Inspect PTX: div 256 -> shr 8 example

Profiling

Nsight Compute Profiling Guide
mcarilli/nsight.sh - Favorite nsight systems profiling commands for PyTorch scripts
Profiling GPU Applications with Nsight Systems

Python GPU Computing

PyTorch
Trtion, github: openai/triton
numba @cuda.jit
Apache TVM
JAX Pallas
CuPy NumPy compatible GPU Computing
NVidia Fuser
Codon @gpu.kernel, github: exaloop/codon
Mojo (part of commercial MAX Plattform by Modular)
NVIDIA Python Bindings: CUDA Python (calling NVRTC to compile kernels, malloc, copy, launching kernels, ..), cuDNN FrontEnd(FE) API, CUTLASS Python Interface

Advanced Topics, Research, Compilers

TACO: The Tensor Algebra Compiler, gh: tensor-compiler/taco
Mosaic compiler C++ DSL for sparse and dense tensors algebra (built on top of TACO), paper, presentation

News

SemiAnalysis

Technical Blog Posts

Cooperative Groups: Flexible CUDA Thread Programming (Oct 04, 2017)
A friendly introduction to machine learning compilers and optimizers (Sep 7, 2021)

Hardware Architecture

GPU-MODE Community Projects

ring-attention

see our ring-attention repo

pscan

GPU Gems: Parallel Prefix Sum (Scan) with CUDA, PDF version (2007), impl: stack overflow, nicer impl: mattdean1/cuda
Accelerating Reduction and Scan Using Tensor Core Units
Thrust: Prefix Sums, Reference: scan variants
CUB, part of cccl: NVIDIA/cccl/tree/main/cub
SAM Algorithm: Higher-Order and Tuple-Based Massively-Parallel Prefix Sums (licensed for non commercial use only)
CUB Algorithm: Single-pass Parallel Prefix Scan with Decoupled Look-back
Group Experiments: johnryan465/pscan, andreaskoepf/pscan_kernel

Triton Kernels / Examples

unsloth that implements custom kernels in Triton for faster QLoRA training
Custom implementation of relative position attention (link)
Tri Dao's Triton implementation of Flash Attention: flash_attn_triton.py
YouTube playlist: Triton Conference 2023
LightLLM with different triton kernels for different LLMs

How to contribute

To share interesting CUDA related links please create a pull request for this file. See editing files in the github documentation.

Or contact us on the GPU MODE discord server: https://discord.gg/gpumode