Home

Awesome

Awesome AI System

This repo is motivated by awesome tensor compilers.

Contents

Paper-Code

Researcher

NameUniversityHomepage
Ion StoicaUC BerkeleyWebsite
Joseph E. GonzalezUC BerkeleyWebsite
Matei ZahariaUC BerkeleyWebsite
Zhihao JiaCMUWebsite
Tianqi ChenCMUWebsite
Xingda WeiSJTUWebsite
Xin JinPKUWebsite
Harry XuUCLAWebsite
Ravi NetravaliPrincetonWebsite
Christos KozyrakisStanfordWebsite
Christopher RĂ©StanfordWebsite
Tri DaoPrincetonWebsite
Mosharaf ChowdhuryUMichWebsite
Shivaram VenkataramanWiscWebsite
Hao ZhangUCSDWebsite
Ana KlimovicETHWebsite
Fan LaiUIUCWebsite
Lianmin ZhengUC BerkeleyWebsite
Ying ShengStanfordWebsite
Zhuohan LiUC BerkeleyWebsite
Woosuk KwonUC BerkeleyWebsite
Zihao YeUniversity of WashingtonWebsite

LLM Serving Framework

TitleGithub
MLC LLMStar
TensorRT-LLMStar
xFasterTransformerStar
CTranslate2(low latency)Star
llama2.cStar

LLM Evaluation Platform

TitleGithubWebsite
FastChatStarWebsite

LLM Inference (System Side)

TitlearXivGithubWebSitePub. & Date
NanoFlow: Towards Optimal Large Language Model Serving ThroughputarXivStar-Arxiv'24
LoongServe: Efficiently Serving Long-context Large Language Models with Elastic Sequence ParallelismarXivStar-SOSP'24
Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative InferencearXivStar-MLSYS'24
PLLMCompass: Enabling Efficient Hardware Design for Large Language Model InferencearXivStar-ISCA'24
Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert InferencearXivStar-ISCA'24
Prompt Cache: Modular Attention Reuse for Low-Latency InferencearXivStar-MLSYS'24
Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-ServearXivStar-OSDI'24
Llumnix: Dynamic Scheduling for Large Language Model ServingarXivStar-OSDI'24
CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge FusionarXivStar-arxiv'24
Parrot: Efficient Serving of LLM-based Application with Semantic VariablesarXivStar-OSDI'24
CacheGen: Fast Context Loading for Language Model Applications via KV Cache StreamingarXivStar-SIGCOMM'24
Efficiently Programming Large Language Models using SGLangarXivStar-Jan, 2024
Efficient Memory Management for Large Language Model Serving with PagedAttentionarXivStar-SOSP'23
SpecInfer: Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree VerificationarXivStar-Dec,2023
Liger: Interleaving Intra- and Inter-Operator Parallelism for Distributed Large Model Inference-Star-PPOPP'24
Efficiently Programming Large Language Models using SGLangarXivStar-Dec, 2023
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured SparsityarXivStar-VLDB'24
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPUarXivStar-Dec, 2023

LLM Inference(AI Side)

TitlearXivGithubWebSitePub. & Date
InferCept: Efficient Intercept Support for Augmented Large Language Model InferencearXivStar-ICML'24
Online Speculative DecodingarXivStar-ICML'24
MuxServe: Flexible Spatial-Temporal Multiplexing for LLM ServingarXivStar-ICML'24
BitDelta: Your Fine-Tune May Only Be Worth One BitarXivStar-Feb,2024
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding HeadsarXivStar-Jan,2024
LLMCompiler: An LLM Compiler for Parallel Function CallingarXivStar-Dec,2023
Mamba: Linear-Time Sequence Modeling with Selective State SpacesarXivStar-Dec,2023
Teaching LLMs memory management for unbounded contextarXivStar-Oct,2023
Break the Sequential Dependency of LLM Inference Using Lookahead DecodingarXivStar-Feb,2024
EAGLE: Lossless Acceleration of LLM Decoding by Feature ExtrapolationarXivStar-Jan,2024

LLM MoE

TitlearXivGithubWebSitePub. & Date
Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert InferencearXivStar-ISCA'24
SIDA-MOE: SPARSITY-INSPIRED DATA-AWARE SERVING FOR EFFICIENT AND SCALABLE LARGE MIXTURE-OF-EXPERTS MODELSarXivStar-MLSYS'24
ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks SchedulingarXivStar-Eurosys'24

LoRA

TitlearXivGithubWebSitePub. & Date
S-LoRA: Serving Thousands of Concurrent LoRA AdaptersarXivStar-Nov,2023
Punica: Serving multiple LoRA finetuned LLM as onearXivStar-Oct,2023

Framework

Parallellism Training

Training

Communication

Serving-Inference

MoE

GPU Cluster Management

Schedule and Resource Management

Optimization

GNN

Fine-Tune

Energy

Misc

Contribute

We encourage all contributions to this repository. Open an issue or send a pull request.