Awesome
Awesome Tensor Compilers
A list of awesome compiler projects and papers for tensor computation and deep learning.
Contents
Open Source Projects
- TVM: An End to End Machine Learning Compiler Framework
- MLIR: Multi-Level Intermediate Representation
- XLA: Optimizing Compiler for Machine Learning
- Halide: A Language for Fast, Portable Computation on Images and Tensors
- Glow: Compiler for Neural Network Hardware Accelerators
- nnfusion: A Flexible and Efficient Deep Neural Network Compiler
- Hummingbird: Compiling Trained ML Models into Tensor Computation
- Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations
- AITemplate: A Python framework which renders neural network into high performance CUDA/HIP C++ code
- Hidet: A Compilation-based Deep Learning Framework
- Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code
- TensorComprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions
- PlaidML: A Platform for Making Deep Learning Work Everywhere
- BladeDISC: An End-to-End DynamIc Shape Compiler for Machine Learning Workloads
- TACO: The Tensor Algebra Compiler
- Nebulgym: Easy-to-use Library to Accelerate AI Training
- Speedster: Automatically apply SOTA optimization techniques to achieve the maximum inference speed-up on your hardware
- NN-512: A Compiler That Generates C99 Code for Neural Net Inference
- DaCeML: A Data-Centric Compiler for Machine Learning
- Mirage: A Multi-level Superoptimizer for Tensor Algebra
Papers
Survey
- The Deep Learning Compiler: A Comprehensive Survey by Mingzhen Li et al., TPDS 2020
- An In-depth Comparison of Compilers for DeepNeural Networks on Hardware by Yu Xing et al., ICESS 2019
Compiler and IR Design
- (De/Re)-Composition of Data-Parallel Computations via Multi-Dimensional Homomorphisms by Ari Rasch, TOPLAS 2024
- BladeDISC: Optimizing Dynamic Shape Machine Learning Workloads via Compiler Approach by Zhen Zheng et al., SIGMOD 2024
- Hidet: Task-Mapping Programming Paradigm for Deep Learning Tensor Programs by Yaoyao Ding et al., ASPLOS 2023
- TensorIR: An Abstraction for Automatic Tensorized Program Optimization by Siyuan Feng, Bohan Hou et al., ASPLOS 2023
- Exocompilation for Productive Programming of Hardware Accelerators by Yuka Ikarashi, Gilbert Louis Bernstein et al., PLDI 2022
- DaCeML: A Data-Centric Compiler for Machine Learning by Oliver Rausch et al., ICS 22
- FreeTensor: A Free-Form DSL with Holistic Optimizations for Irregular Tensor Programs by Shizhi Tang et al., PLDI 2022
- Roller: Fast and Efficient Tensor Compilation for Deep Learning by Hongyu Zhu et al., OSDI 2022
- AStitch: Enabling a New Multi-dimensional Optimization Space for Memory-Intensive ML Training and Inference on Modern SIMT Architectures by Zhen Zheng et al., ASPLOS 2022
- Composable and Modular Code Generation in MLIR: A Structured and Retargetable Approach to Tensor Compiler Construction by Nicolas Vasilache et al., arXiv 2022
- PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections by Haojie Wang et al., OSDI 2021
- MLIR: Scaling Compiler Infrastructure for Domain Specific Computation by Chris Lattner et al., CGO 2021
- A Tensor Compiler for Unified Machine Learning Prediction Serving by Supun Nakandala et al., OSDI 2020
- Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks by Lingxiao Ma et al., OSDI 2020
- Stateful Dataflow Multigraphs: A Data-Centric Model for Performance Portability on Heterogeneous Architectures by Tal Ben-Nun et al., SC 2019
- TASO: The Tensor Algebra SuperOptimizer for Deep Learning by Zhihao Jia et al., SOSP 2019
- Tiramisu: A polyhedral compiler for expressing fast and portable code by Riyadh Baghdadi et al., CGO 2019
- Triton: an intermediate language and compiler for tiled neural network computations by Philippe Tillet et al., MAPL 2019
- Relay: A High-Level Compiler for Deep Learning by Jared Roesch et al., arXiv 2019
- TVM: An Automated End-to-End Optimizing Compiler for Deep Learning by Tianqi Chen et al., OSDI 2018
- Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions by Nicolas Vasilache et al., arXiv 2018
- Intel nGraph: An Intermediate Representation, Compiler, and Executor for Deep Learning by Scott Cyphers et al., arXiv 2018
- Glow: Graph Lowering Compiler Techniques for Neural Networks by Nadav Rotem et al., arXiv 2018
- DLVM: A modern compiler infrastructure for deep learning systems by Richard Wei et al., arXiv 2018
- Diesel: DSL for linear algebra and neural net computations on GPUs by Venmugil Elango et al., MAPL 2018
- The Tensor Algebra Compiler by Fredrik Kjolstad et al., OOPSLA 2017
- Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines by Jonathan Ragan-Kelley et al., PLDI 2013
Auto-tuning and Auto-scheduling
- Accelerated Auto-Tuning of GPU Kernels for Tensor Computations by Chendi Li, Yufan Xu et al., ICS 2024
- Enabling Tensor Language Model to Assist in Generating High-Performance Tensor Programs for Deep Learning by Yi Zhai et al., OSDI 2024
- The Droplet Search Algorithm for Kernel Scheduling by Michael Canesche et al., ACM TACO 2024
- Tensor Program Optimization with Probabilistic Programs by Junru Shao et al., NeurIPS 2022
- One-shot tuner for deep learning compilers by Jaehun Ryu et al., CC 2022
- Autoscheduling for sparse tensor algebra with an asymptotic cost model by Peter Ahrens et al., PLDI 2022
- Bolt: Bridging the Gap between Auto-tuners and Hardware-native Performance by Jiarong Xing et al., MLSys 2022
- A Full-Stack Search Technique for Domain Optimized Deep Learning Accelerators by Dan Zhang et al., ASPLOS 2022
- Efficient Automatic Scheduling of Imaging and Vision Pipelines for the GPU by Luke Anderson et al., OOPSLA 2021
- Lorien: Efficient Deep Learning Workloads Delivery by Cody Hao Yu et al., SoCC 2021
- Value Learning for Throughput Optimization of Deep Neural Networks by Benoit Steiner et al., MLSys 2021
- A Flexible Approach to Autotuning Multi-Pass Machine Learning Compilers by Phitchaya Mangpo Phothilimthana et al., PACT 2021
- Ansor: Generating High-Performance Tensor Programs for Deep Learning by Lianmin Zheng et al., OSDI 2020
- Schedule Synthesis for Halide Pipelines on GPUs by Sioutas Savvas et al., TACO 2020
- FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System by Size Zheng et al., ASPLOS 2020
- ProTuner: Tuning Programs with Monte Carlo Tree Search by Ameer Haj-Ali et al., arXiv 2020
- AdaTune: Adaptive tensor program compilation made efficient by Menghao Li et al., NeurIPS 2020
- Optimizing the Memory Hierarchy by Compositing Automatic Transformations on Computations and Data by Jie Zhao et al., MICRO 2020
- Chameleon: Adaptive Code Optimization for Expedited Deep Neural Network Compilation by Byung Hoon Ahn et al., ICLR 2020
- A Sparse Iteration Space Transformation Framework for Sparse Tensor Algebra by Ryan Senanayake et al. OOPSLA 2020
- Learning to Optimize Halide with Tree Search and Random Programs by Andrew Adams et al., SIGGRAPH 2019
- Learning to Optimize Tensor Programs by Tianqi Chen et al., NeurIPS 2018
- Automatically Scheduling Halide Image Processing Pipelines by Ravi Teja Mullapudi et al., SIGGRAPH 2016
Cost Model
- TLP: A Deep Learning-based Cost Model for Tensor Program Tuning by Yi Zhai et al., ASPLOS 2023
- An Asymptotic Cost Model for Autoscheduling Sparse Tensor Programs by Peter Ahrens et al., PLDI 2022
- TenSet: A Large-scale Program Performance Dataset for Learned Tensor Compilers by Lianmin Zheng et al., NeurIPS 2021
- A Deep Learning Based Cost Model for Automatic Code Optimization by Riyadh Baghdadi et al., MLSys 2021
- A Learned Performance Model for the Tensor Processing Unit by Samuel J. Kaufman et al., MLSys 2021
- DYNATUNE: Dynamic Tensor Program Optimization in Deep Neural Network Compilation by Minjia Zhang et al., ICLR 2021
- MetaTune: Meta-Learning Based Cost Model for Fast and Efficient Auto-tuning Frameworks by Jaehun Ryu et al., arxiv 2021
- Expedited Tensor Program Compilation Based on LightGBM by Gonghan Liu1 et al., JPCS 2021
CPU and GPU Optimization
- DeepCuts: A deep learning optimization framework for versatile GPU workloads by Wookeun Jung et al., PLDI 2021
- Analytical characterization and design space exploration for optimization of CNNs by Rui Li et al., ASPLOS 2021
- UNIT: Unifying Tensorized Instruction Compilation by Jian Weng et al., CGO 2021
- PolyDL: Polyhedral Optimizations for Creation of HighPerformance DL primitives by Sanket Tavarageri et al., arXiv 2020
- Fireiron: A Data-Movement-Aware Scheduling Language for GPUs by Bastian Hagedorn et al., PACT 2020
- Automatic Kernel Generation for Volta Tensor Cores by Somashekaracharya G. Bhaskaracharya et al., arXiv 2020
- Swizzle Inventor: Data Movement Synthesis for GPU Kernels by Phitchaya Mangpo Phothilimthana et al., ASPLOS 2019
- Optimizing CNN Model Inference on CPUs by Yizhi Liu et al., ATC 2019
- Analytical cache modeling and tilesize optimization for tensor contractions by Rui Li et al., SC 19
NPU Optimization
- Heron: Automatically Constrained High-Performance Library Generation for Deep Learning Accelerators by Jun Bi et al., ASPLOS 2023
- AMOS: Enabling Automatic Mapping for Tensor Computations On Spatial Accelerators with Hardware Abstraction by Size Zheng et al., ISCA 2022
- Towards the Co-design of Neural Networks and Accelerators by Yanqi Zhou et al., MLSys 2022
- AKG: Automatic Kernel Generation for Neural Processing Units using Polyhedral Transformations by Jie Zhao et al., PLDI 2021
Graph-level Optimization
- POET: Training Neural Networks on Tiny Devices with Integrated Rematerialization and Paging by Shishir G. Patil et al., ICML 2022
- Collage: Seamless Integration of Deep Learning Backends with Automatic Placement by Byungsoo Jeon et al., PACT 2022
- Apollo: Automatic Partition-based Operator Fusion through Layer by Layer Optimization by Jie Zhao et al., MLSys 2022
- Equality Saturation for Tensor Graph Superoptimization by Yichen Yang et al., MLSys 2021
- IOS: An Inter-Operator Scheduler for CNN Acceleration by Yaoyao Ding et al., MLSys 2021
- Optimizing DNN Computation Graph using Graph Substitutions by Jingzhi Fang et al., VLDB 2020
- Transferable Graph Optimizers for ML Compilers by Yanqi Zhou et al., NeurIPS 2020
- FusionStitching: Boosting Memory IntensiveComputations for Deep Learning Workloads by Zhen Zheng et al., arXiv 2020
- Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning by Woosuk Kwon et al., Neurips 2020
Dynamic Model
- Axon: A Language for Dynamic Shapes in Deep Learning Graphs by Alexander Collins et al., arXiv 2022
- DietCode: Automatic Optimization for Dynamic Tensor Programs by Bojian Zheng et al., MLSys 2022
- The CoRa Tensor Compiler: Compilation for Ragged Tensors with Minimal Padding by Pratik Fegade et al., MLSys 2022
- Nimble: Efficiently Compiling Dynamic Neural Networks for Model Inference by Haichen Shen et al., MLSys 2021
- DISC: A Dynamic Shape Compiler for Machine Learning Workloads by Kai Zhu et al., EuroMLSys 2021
- Cortex: A Compiler for Recursive Deep Learning Models by Pratik Fegade et al., MLSys 2021
Graph Neural Networks
- Graphiler: Optimizing Graph Neural Networks with Message Passing Data Flow Graph by Zhiqiang Xie et al., MLSys 2022
- Seastar: vertex-centric programming for graph neural networks by Yidi Wu et al., Eurosys 2021
- FeatGraph: A Flexible and Efficient Backend for Graph Neural Network Systems by Yuwei Hu et al., SC 2020
Distributed Computing
- SpDISTAL: Compiling Distributed Sparse Tensor Computations by Rohan Yadav et al., SC 2022
- Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning by Lianmin Zheng, Zhuohan Li, Hao Zhang et al., OSDI 2022
- Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parallelization by Colin Unger, Zhihao Jia, et al., OSDI 2022
- Synthesizing Optimal Parallelism Placement and Reduction Strategies on Hierarchical Systems for Deep Learning by Ningning Xie, Tamara Norman, Diminik Grewe, Dimitrios Vytiniotis et al., MLSys 2022
- DISTAL: The Distributed Tensor Algebra Compiler by Rohan Yadav et al., PLDI 2022
- GSPMD: General and Scalable Parallelization for ML Computation Graphs by Yuanzhong Xu et al., arXiv 2021
- Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads by Abhinav Jangda et al., ASPLOS 2022
- OneFlow: Redesign the Distributed Deep Learning Framework from Scratch by Jinhui Yuan et al., arXiv 2021
- Beyond Data and Model Parallelism for Deep Neural Networks by Zhihao et al., MLSys 2019
- Supporting Very Large Models using Automatic Dataflow Graph Partitioning by Minjie Wang et al., EuroSys 2019
- Distributed Halide by Tyler Denniston et al., PPoPP 2016
Quantization
- Automated Backend-Aware Post-Training Quantization by Ziheng Jiang et al., arXiv 2021
- Efficient Execution of Quantized Deep Learning Models: A Compiler Approach by Animesh Jain et al., arXiv 2020
- Automatic Generation of High-Performance Quantized Machine Learning Kernels by Meghan Cowan et al., CGO 2020
Sparse
- The Sparse Abstract Machine by Olivia Hsu et al., ASPLOS 2023
- SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning by Zihao Ye et al., ASPLOS 2023
- WACO: Learning Workload-Aware Co-optimization of the Format and Schedule of a Sparse Tensor Program by Jaeyeon Won et al., ASPLOS 2023
- Looplets: A Language For Structured Coiteration by Willow Ahrens et al., CGO 2023
- Code Synthesis for Sparse Tensor Format Conversion and Optimization by Tobi Popoola et al., CGO 2023
- Stardust: Compiling Sparse Tensor Algebra to a Reconfigurable Dataflow Architecture by Olivia Hsu et al., arXiv 2022
- The Sparse Abstract Machine by Olivia Hsu et al., arXiv 2022
- Unified Compilation for Lossless Compression and Sparse Computing by Daniel Donenfeld et al., CGO 2022
- SparseLNR: Accelerating Sparse Tensor Computations Using Loop Nest Restructuring by Adhitha Dias et al., ICS 2022
- SparTA: Deep-Learning Model Sparsity via Tensor-with-Sparsity-Attribute by Ningxin Zheng et al., OSDI 2022
- Compiler Support for Sparse Tensor Computations in MLIR by Aart J.C. Bik et al., TACO 2022
- Compilation of Sparse Array Programming Models by Rawn Henry and Olivia Hsu et al., OOPSLA 2021
- A High Performance Sparse Tensor Algebra Compiler in MLIR by Ruiqin Tian et al., LLVM-HPC 2021
- Dynamic Sparse Tensor Algebra Compilation by Stephen Chou et al., arXiv 2021
- Automatic Generation of Efficient Sparse Tensor Format Conversion Routines by Stephen Chou et al., PLDI 2020
- TIRAMISU: A Polyhedral Compiler for Dense and Sparse Deep Learning by Riyadh Baghdadi et al., arXiv 2020
- Tensor Algebra Compilation with Workspaces by Fredrik Kjolstad et al., CGO 2019
- Sparse Computation Data Dependence Simplification for Efficient Compiler-Generated Inspectors by Mahdi Soltan Mohammadi et al., PLDI 2019
- Taichi: A Language for High-Performance Computation on Spatially Sparse Data Structures by Yuanming Hu et al., ACM ToG 2019
- The Sparse Polyhedral Framework: Composing Compiler-Generated Inspector-Executor Code by Michelle Mills Strout et al., Proceedings of the IEEE 2018
- Format Abstraction for Sparse Tensor Algebra Compilers by Stephen Chou et al., OOPSLA 2018
- ParSy: Inspection and Transformation of Sparse Matrix Computations for Parallelism by Kazem Cheshmi et al., SC 2018
- Sympiler: Transforming Sparse Matrix Codes by Decoupling Symbolic Analysis by Kazem Cheshmi et al., SC 2017
- The Tensor Algebra Compiler by Fredrik Kjolstad et al., OOPSLA 2017
- Next-generation Generic Programming and its Application to Sparse Matrix Computations by Nikolay Mateev et al., ICS 2000
- A Framework for Sparse Matrix Code Synthesis from High-level Specifications by Nawaaz Ahmed et al., SC 2000
- Automatic Nonzero Structure Analysis by Aart Bik et al., SIAM Journal on Computing 1999
- SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations by William Pugh et al., LCPC 1998
- Automatic Data Structure Selection and Transformation for Sparse Matrix Computations by Aart Bik et al., TPDS 1996
- Compilation Techniques for Sparse Matrix Computations by Aart Bik et al., ICS 1993
Program Rewriting
- Verified tensor-program optimization via high-level scheduling rewrites by Amanda Liu et al., POPL 2022
- Pure Tensor Program Rewriting via Access Patterns (Representation Pearl) by Gus Smith et al., MAPL 2021
- Equality Saturation for Tensor Graph Superoptimization by Yichen Yang et al., MLSys 2021
Verification and Testing
- NNSmith: Generating Diverse and Valid Test Cases for Deep Learning Compilers by Jiawei Liu et al., ASPLOS 2023
- Coverage-guided tensor compiler fuzzing with joint IR-pass mutation by Jiawei Liu et al., OOPSLA 2022
- End-to-End Translation Validation for the Halide Language by Basile Clément et al., OOPSLA 2022
- A comprehensive study of deep learning compiler bugs by Qingchao Shen et al., ESEC/FSE 2021
- Verifying and Improving Halide’s Term Rewriting System with Program Synthesis by Julie L. Newcomb et al., OOPSLA 2020
Tutorials
Contribute
We encourage all contributions to this repository. Open an issue or send a pull request.
Notes on the Link Format
We prefer using a link which points to a more informative page instead of a single pdf. For example, for arxiv papers, we prefer https://arxiv.org/abs/1802.04799 over https://arxiv.org/pdf/1802.04799.pdf. For USENIX papers (OSDI/ATC), we prefer https://www.usenix.org/conference/osdi18/presentation/chen over https://www.usenix.org/system/files/osdi18-chen.pdf. For ACM papers (ASPLOS/PLDI/Eurosys), we prefer https://dl.acm.org/doi/abs/10.1145/3519939.3523446 over https://dl.acm.org/doi/pdf/10.1145/3519939.3523446.