Awesome
Awesome LLM Systems Papers
A curated list of Large Language Model systems related academic papers, articles, tutorials, slides and projects. Star this repository, and then you can keep abreast of the latest developments of this booming research field.
LLM Systems
Pre-Training
- Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
- Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
- Reducing Activation Recomputation in Large Transformer Models
- Optimized Network Architectures for Large Language Model Training with Billions of Parameters | MIT
- Carbon Emissions and Large Neural Network Training | Google, UCB
- Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates | SOSP 23
- GEMINI: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints
- Perseus: Removing Energy Bloat from Large Model Training | SOSP' 24
- MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs | ByteDance
- DISTMM: Accelerating distributed multimodal model training | NSDI' 24
- A Codesign of Scheduling and Parallelization for Large Model Training in Heterogeneous Clusters
- Pipeline Parallelism with Controllable Memory | Sea AI Lab
- Boosting Large-scale Parallel Training Efficiency with C4: A Communication-Driven Approach
- Scaling Beyond the GPU Memory Limit for Large Mixture-of-Experts Model Training | ICML 24
- Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models with Adaptive Expert Placement
- Alibaba HPN: A Data Center Network for Large Language ModelTraining
- FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
- ByteCheckpoint: A Unified Checkpointing System for LLM Development
- The Llama 3 Herd of Models (Section 3)
- HybridFlow: A Flexible and Efficient RLHF Framework
- Malleus: Straggler-Resilient Hybrid Parallel Training of Large-scale Models via Malleable Data and Model Parallelization
- FALCON: Pinpointing and Mitigating Stragglers for Large-Scale Hybrid-Parallel Training
- Enabling Parallelism Hot Switching for Efficient Training of Large Language Models | SOSP' 24
- Revisiting Reliability in Large-Scale Machine Learning Research Clusters
- ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation | SOSP' 24
- ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling | EuroSys '24
- DynaPipe : Optimizing Multi-task Training through Dynamic Pipelines | EuroSys '24
- HAP: SPMD DNN Training on Heterogeneous GPU Clusters with Automated Program Synthesis | EuroSys'24
- Demystifying Workload Imbalances in Large Transformer Model Training over Variable-length Sequences | PKU
- RLHFuse: Efficient RLHF Training for Large Language Models with Inter- and Intra-Stage Fusion | NSDI'25
- Improving training time and GPU utilization in geo-distributed language model training
Serving
- Orca: A Distributed Serving System for Transformer-Based Generative Models | OSDI'22
- Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline | NUS
- Efficiently Scaling Transformer Inference | MLSys' 23
- Flover: A Temporal Fusion Framework for Efficient Autoregressive Model Parallel Inference
- FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
- DeepSpeed Inference : Enabling Efficient Inference of Transformer Models at Unprecedented Scale.
- TurboTransformers: An Efficient GPU Serving System For Transformer Models
- MPCFormer : fast, performant, and private transformer inference with MPC | ICLR'23
- POLCA: Power Oversubscription in LLM Cloud Providers | Microsoft
- SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills | Microsoft
- FlexGen: High-throughput Generative Inference of Large Language Models with a Single GPU | ICML' 23
- AttMemo: Accelerating Self-Attention with Memoization on Big Memory Systems
- vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention | SOSP' 23
- Tabi: An Efficient Multi-Level Inference System for Large Language Models | EuroSys' 23
- Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity | VLDB' 24
- AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation | Microsoft
- FlashDecoding++: Faster Large Language Model Inference on GPUs | Tsinghua
- DeepSpeed-MII: Model Implementations for Inference (MII) | Microsoft
- Punica: Multi-Tenant LoRA Serving
- S-LoRA: Serving Thousands of Concurrent LoRA Adapters
- STI: Turbocharge NLP Inference at the Edge via Elastic Pipelining | ASPLOS 23
- SpotServe: Serving Generative Large Language Models on Preemptible Instances | CMU
- LLM in a flash: Efficient Large Language Model Inference with Limited Memory | Apple
- SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads
- Fairness in Serving Large Language Models | OSDI' 24
- Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache
- CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference
- DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving| OSDI' 24
- Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads
- APIServe: Efficient API Support for Large-Language Model Inferencing
- FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning
- DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving
- Optimizing LLM Queries in Relational Workloads | UCB
- AttentionStore: Cost-effective Attention Reuse across Multi-turn Conversations in Large Language Model Serving | NUS
- MuxServe: Flexible Multiplexing for Efficient Multiple LLM Serving
- LoongServe: Efficiently Serving Long-context Large Language Models with Elastic Sequence Parallelism | SOSP' 24
- RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation | PKU
- Andes: Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Services | Umich
- BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models
- vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention
- Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs | CMU
- Eloquent: A More Robust Transmission Scheme for LLM Token Streaming | NAIC' 24
- Optimizing Speculative Decoding for Serving Large Language Models Using Goodput | UCB
- Enabling Elastic Model Serving with MultiWorld | Cisco Research
- ALTO: An Efficient Network Orchestrator for Compound AI Systems | Stanford & UCB
- Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models
- NanoFlow: Towards Optimal Large Language Model Serving Throughput
- Responsive ML inference in multi-tenanted environments using AQUA
- One Queue Is All You Need: Resolving Head-of-Line Blocking in Large Language Model Serving
- MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool
- dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving | OSDI' 24
- Parrot: Efficient Serving of LLM-based Applications with Semantic Variable | OSDI' 24
- Llumnix: Dynamic Scheduling for Large Language Model Serving | OSDI' 24
- Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve | OSDI' 24
- InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management
- ServerlessLLM: Low-Latency Serverless Inference for Large Language Models | OSDI' 24
- Preble: Efficient Distributed Prompt Scheduling for LLM Serving
- Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations
- ConServe: Harvesting GPUs for Low-Latency and High-Throughput Large Language Model Serving
- BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models
- PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU | SOSP' 24
- Context Parallelism for Scalable Million-Token Inference
- xDiT: an Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism
- MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs
- Pie: Pooling CPU Memory for LLM Inference
- NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference
- FastSwitch: Optimizing Context Switching Efficiency in Fairness-aware Large Language Model Serving
- Flash Communication: Reducing Tensor Parallelization Bottleneck for Fast Large Language Model Inference
Fine-tuning Systems
- Ymir: A Scheduler for Foundation Model Fine-tuning Workloads in Datacenters | ICS' 24
Multi-Model Systems
-
MOSEL: Inference Serving Using Dynamic Modality Selection
-
DISTMM: Accelerating distributed multimodal model training | NSDI' 24
-
Approximate Caching for Efficiently Serving Diffusion Models | Adobe Research
-
DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models | MIT
-
Optimus: Accelerating Large-Scale Multi-Modal LLM Training by Bubble Exploitation
-
Addressing Model and Data Heterogeneity in Multimodal Large Language Model Training | PKU
LLM for Systems
- Large Language Models for Compiler Optimization
- The Hitchhiker's Guide to Program Analysis: A Journey with Large Language Models
- LLM-Assisted Code Cleaning For Training Accurate Code Generators | UCB
- Efficient Multi-Task Large Model Training via Data Heterogeneity-aware Model Management
- If At First You Don’t Succeed, Try, Try, Again...? | SOSP' 24
- Aceso: Efficient Parallel DNN Training through Iterative Bottleneck Alleviation | EuroSys '24
- GMorph: Accelerating Multi-DNN Inference via Model Fusion | EuroSys '24
- Automatic Root Cause Analysis via Large Language Models for Cloud Incidents | EuroSys '24
System Efficiency Optimization
- Fast Distributed Inference Serving for Large Language Models | PKU
- FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance | Stanford
- H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models | ICML ES-FoMo Workshop 2023
- Inference with Reference: Lossless Acceleration of Large Language Models
- SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inferencex
- Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time
- Knowledge-preserving Pruning for Pre-trained Language Models without Retraining | SNU
- Accelerating LLM Inference with Staged Speculative Decoding | ICML' 23
- SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification | CMU
- Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time | ICML' 23
- S3: Increasing GPU Utilization during Generative Inference for Higher Throughput | Havard
- LLMCad: Fast and Scalable On-device Large Language Model Inference
- Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding | THU
- LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery | Microsoft
- Ring Attention with Blockwise Transformers for Near-Infinite Context | UCB
- Learned Best-Effort LLM Serving | UCB
- Star Attention : Efficient LLM Inference over Long Sequences| NVIDIA
ML Systems
- INFaaS: Automated Model-less Inference Serving | ATC’ 21
- Alpa : Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning | OSDI' 22
- Pathways : Asynchronous Distributed Dataflow for ML | MLSys' 22
- AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving
- DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale ICML' 2022.
- ZeRO-Offload : Democratizing Billion-Scale Model Training.
- ZeRO-Infinity : Breaking the GPU Memory Wall for Extreme Scale Deep Learning
- ZeRO : memory optimizations toward training trillion parameter models.
- Band: Coordinated Multi-DNN Inference on Heterogeneous Mobile Processors | MobiSys ’22
- Serving Heterogeneous Machine Learning Models on Multi-GPU Servers with Spatio-Temporal Sharing | ATC'22
- Fast and Efficient Model Serving Using Multi-GPUs with Direct-Host-Access | Eurosys'23
- Cocktail: A Multidimensional Optimization for Model Serving in Cloud | NSDI'22
- Merak: An Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation Models
- SHEPHERD : Serving DNNs in the Wild
- Efficient GPU Kernels for N:M-Sparse Weights in Deep Learning
- AutoScratch: ML-Optimized Cache Management for Inference-Oriented GPUs
- ZeRO++: Extremely Efficient Collective Communication for Giant Model Training
- Channel Permutations for N:M Sparsity | MLSys' 23
- Welder : Scheduling Deep Learning Memory Access via Tile-graph | OSDI' 23
- Optimizing Dynamic Neural Networks with Brainstorm | OSDI'23
- ModelKeeper: Accelerating DNN Training via Automated Training Warmup | NSDI'23
- Breadth-First Pipeline Parallelism | MLSys' 23
- MGG : Accelerating Graph Neural Networks with Fine-Grained Intra-Kernel Communication-Computation Pipelining on Multi-GPU Platforms | OSDI' 23
- Hydro: Surrogate-Based Hyperparameter Tuning Service in Datacenters | OSDI' 23
- Cocktailer: Analyzing and Optimizing Dynamic Control Flow in Deep Learning | OSDI' 23
- BPipe: Memory-Balanced Pipeline Parallelism for TrainingLarge Language Models
- Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects
- Revisiting Reliability in Large-Scale Machine Learning Research Clusters
- Orion: Interference-aware, Fine-grained GPU Sharing for ML Applications | EuroSys '24
- Optimus: Warming Serverless ML Inference via Inter-Function Model Transformation | EuroSys '24
- Model Selection for Latency-Critical Inference Serving | EuroSys '24
- Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving | SOSP' 24
Survey Paper
- Efficient Large Language Models: A Survey
- Challenges and Applications of Large Language Models
- Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models
- Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems
LLM Benchmark / Leaderboard ? Traces
- LLM Energy Leaderboard | Umich
- LLM-Perf Leaderboard | HuggingFace
- Aviary Explorer | Anyscale
- Open LLM Leaderboard | HuggingFace
- HELM | Stanford
- LMSYS | UCB
- Towards Efficient and Reliable LLM Serving: A Real-World Workload Study
LLM Frameworks
- DeepSpeed: a deep learning optimization library that makes distributed training and inference easy, efficient, and effective | Microsoft
- TensorRT-LLM | Nvidia
- Accelerate | Hugging Face
- Ray-LLM | Ray
- LLaVA
- Megatron | Nvidia
- NeMo | Nvidia
- torchtitan | PyTorch
- vLLM | UCB
- SGLang | UCB
- TGI | Hugging Face
- OpenRLHF
Related ML Readings
- Large Transformer Model Inference Optimization
- Transformer Inference Arithmetic
- The Transformer Family Version 2.0
- Full Stack Optimization of Transformer Inference: a Survey | UCB
MLSys Courses
- Systems for Machine Learning | (Stanford)[https://cs229s.stanford.edu/fall2023/]
- Systems for Generative AI | (Umich)[https://github.com/mosharaf/eecs598/tree/w24-genai]
- Systems for AI - LLMs | (GT)[https://cs8803-sp24.anand-iyer.com/]
Other Reading
- A curated list of Large Language Model
- AI systems paper list
- A baseline repository of Auto-Parallelism in Training Neural Networks
- Numbers every LLM Developer should know
- 100,000 H100 Clusters: Power, Network Topology, Ethernet vs InfiniBand, Reliability, Failures, Checkpointing
- OpenAI Keynote on Building Scalable AI Infrastructure