Awesome
Supplementary Material for Lectures
The PMPP Book: Programming Massively Parallel Processors: A Hands-on Approach (Amazon link)
Lecture 1: Profiling and Integrating CUDA kernels in PyTorch
- Speaker: Mark Saroufim
- Notebook and slides in lecture_001 folder
Lecture 2: Recap Ch. 1-3 from the PMPP book
- Speaker: Andreas Koepf
- Slides: The powerpoint file lecture_002/cuda_mode_lecture2.pptx can be found in the root directory of this repository. Alternatively here as Google docs presentation.
Lecture 3: Getting Started With CUDA
- Speaker: Jeremy Howard
- Notebook: See the lecture_003 folder, or run the Colab version
Lecture 4: Intro to Compute and Memory Architecture
- Speaker: Thomas Viehmann
- Notebook and slides in the lecture_004 folder.
Lecture 5: Going Further with CUDA for Python Programmers
- Speaker: Jeremy Howard
- Notebook in the lecture_005 folder.
Lecture 6: Optimizing PyTorch Optimizers
Lecture 7: Advanced Quantization
- Speaker: Charles Hernandez
- Slides
Lecture 8: CUDA Performance Checklist
- Speaker: Mark Saroufim
- Code in the lecture_008 folder
- Slides
Lecture 9: Reductions
- Speaker: Mark Saroufim
- Code in the lecture_009 folder
- Slides
Lecture 10: Build a Prod Ready CUDA Library
- Speaker: Oscar Amoros Huguet
- slides
Lecture 11: Sparsity
Lecture 12: Flash Attention
- Speaker: Thomas Viehmann
Lecture 13: Ring Attention
- Speaker: Andreas Koepf
- Slides
Lecture 14: Practitioner's Guide to Triton
Lecture 15: CUTLASS
- Speaker: Eric Auld
Lecture 16: On Hands profiling
- Speaker: Taylor Robbie
Bonus Lecture: CUDA C++ llm.cpp
- Speaker: Jake Hemstad & Georgii Evtushenko
- Slides
Lecture 17: GPU Collective Communication (NCCL)
- Speaker: Dan Johnson
- Code in the lecture_017 folder
Lecture 18: Fused Kernels
- Speaker: Kapil Sharma
- Code in the lecture_018 folder
Lecture 19: Data Processing on GPUs
- Speaker: Devavret Makkar
Lecture 20: Scan Algorithm
- Speaker: Izzat El Haj
- Slides
Lecture 21: Scan Algorithm Part 2
- Speaker: Izzat El Haj
- Slides
Lecture 22: Hacker's Guide to Speculative Decoding in VLLM
- Speaker: Cade Daniel
- Slides
Lecture 23: Tensor Cores
- Speaker: Vijay Thakkar & Pradeep Ramani
- Slides
Lecture 24: Scan at the Speed of Light
- Speaker: Jake Hemstad & Georgii Evtushenko
Lecture 25: Speaking Composable Kernel
- Speaker: Haocong Wang
- Slides
Lecture 26: SYCL MODE (Intel GPU)
- Speaker: Patric Zhao
- Slides
Lecture 27: gpu.cpp
- Speaker: Austin Huang
- Slides
Lecture 28: Liger Kernel
Lecture 29: Triton Internals
- Speaker: Kapil Sharma
- Code/presentation in the lecture_029 folder
Lecture 30: Quantized training
- Speaker: Thien Tran
- Code/presentation in the lecture_030 folder
Lecture 31: Beginners Guide to Metal Kernels
- Speaker: Nikita Shulga
- Code/presentation in the lecture_031 folder
Lecture 32: Unsloth - LLM Systems Engineering
- Speaker: Daniel Han
- Slides
Lecture 33: BitBLAS
- Speaker: Wang Lei
- Code/presentation in the lecture_033 folder
Lecture 34: Low Bit Triton Kernels
- Speaker: Hicham Badri
- Slides
Lecture 35: SGLang Performance Optimization
- Speaker: Yineng Zhang
- Slides
Lecture 36: CUTLASS and Flash ATtention 3
Lecture 37: Introduction to SASS & GPU Microarchitecture
- Speaker: Arun Demeure
- Slides