Awesome

Supplementary Material for Lectures

YouTube Channel

The PMPP Book: Programming Massively Parallel Processors: A Hands-on Approach (Amazon link)

Lecture 1: Profiling and Integrating CUDA kernels in PyTorch

Speaker: Mark Saroufim
Notebook and slides in lecture_001 folder

Lecture 2: Recap Ch. 1-3 from the PMPP book

Speaker: Andreas Koepf
Slides: The powerpoint file lecture_002/cuda_mode_lecture2.pptx can be found in the root directory of this repository. Alternatively here as Google docs presentation.

Lecture 3: Getting Started With CUDA

Speaker: Jeremy Howard
Notebook: See the lecture_003 folder, or run the Colab version

Lecture 4: Intro to Compute and Memory Architecture

Speaker: Thomas Viehmann
Notebook and slides in the lecture_004 folder.

Lecture 5: Going Further with CUDA for Python Programmers

Speaker: Jeremy Howard
Notebook in the lecture_005 folder.

Lecture 6: Optimizing PyTorch Optimizers

Speaker: Jane Xu
Slides

Lecture 7: Advanced Quantization

Speaker: Charles Hernandez
Slides

Lecture 8: CUDA Performance Checklist

Speaker: Mark Saroufim
Code in the lecture_008 folder
Slides

Lecture 9: Reductions

Speaker: Mark Saroufim
Code in the lecture_009 folder
Slides

Lecture 10: Build a Prod Ready CUDA Library

Speaker: Oscar Amoros Huguet
slides

Lecture 11: Sparsity

Speaker: Jesse Cai
Slides

Lecture 12: Flash Attention

Speaker: Thomas Viehmann

Lecture 13: Ring Attention

Speaker: Andreas Koepf
Slides

Lecture 14: Practitioner's Guide to Triton

Date: 2024-04-13, Speaker: Umer Adil
Notebook

Lecture 15: CUTLASS

Speaker: Eric Auld

Lecture 16: On Hands profiling

Speaker: Taylor Robbie

Bonus Lecture: CUDA C++ llm.cpp

Speaker: Jake Hemstad & Georgii Evtushenko
Slides

Lecture 17: GPU Collective Communication (NCCL)

Speaker: Dan Johnson
Code in the lecture_017 folder

Lecture 18: Fused Kernels

Speaker: Kapil Sharma
Code in the lecture_018 folder

Lecture 19: Data Processing on GPUs

Speaker: Devavret Makkar

Lecture 20: Scan Algorithm

Speaker: Izzat El Haj
Slides

Lecture 21: Scan Algorithm Part 2

Speaker: Izzat El Haj
Slides

Lecture 22: Hacker's Guide to Speculative Decoding in VLLM

Speaker: Cade Daniel
Slides

Lecture 23: Tensor Cores

Speaker: Vijay Thakkar & Pradeep Ramani
Slides

Lecture 24: Scan at the Speed of Light

Speaker: Jake Hemstad & Georgii Evtushenko

Lecture 25: Speaking Composable Kernel

Speaker: Haocong Wang
Slides

Lecture 26: SYCL MODE (Intel GPU)

Speaker: Patric Zhao
Slides

Lecture 27: gpu.cpp

Speaker: Austin Huang
Slides

Lecture 28: Liger Kernel

Lecture 29: Triton Internals

Speaker: Kapil Sharma
Code/presentation in the lecture_029 folder

Lecture 30: Quantized training

Speaker: Thien Tran
Code/presentation in the lecture_030 folder

Lecture 31: Beginners Guide to Metal Kernels

Speaker: Nikita Shulga
Code/presentation in the lecture_031 folder

Lecture 32: Unsloth - LLM Systems Engineering

Speaker: Daniel Han
Slides

Lecture 33: BitBLAS

Speaker: Wang Lei
Code/presentation in the lecture_033 folder

Lecture 34: Low Bit Triton Kernels

Speaker: Hicham Badri
Slides

Lecture 35: SGLang Performance Optimization

Speaker: Yineng Zhang
Slides

Lecture 36: CUTLASS and Flash ATtention 3

Speaker: Jay Shah
Slides

Lecture 37: Introduction to SASS & GPU Microarchitecture

Speaker: Arun Demeure
Slides

Lecture 38: Lowbit kernels for ARM CPU

Speaker: Scott Roy
Slides