Awesome

Awesome Tensor Compilers

A list of awesome compiler projects and papers for tensor computation and deep learning.

Contents

Open Source Projects
Papers
Tutorials
Contribute

Open Source Projects

Papers

Survey

The Deep Learning Compiler: A Comprehensive Survey by Mingzhen Li et al., TPDS 2020
An In-depth Comparison of Compilers for DeepNeural Networks on Hardware by Yu Xing et al., ICESS 2019

Compiler and IR Design

(De/Re)-Composition of Data-Parallel Computations via Multi-Dimensional Homomorphisms by Ari Rasch, TOPLAS 2024
BladeDISC: Optimizing Dynamic Shape Machine Learning Workloads via Compiler Approach by Zhen Zheng et al., SIGMOD 2024
Hidet: Task-Mapping Programming Paradigm for Deep Learning Tensor Programs by Yaoyao Ding et al., ASPLOS 2023
TensorIR: An Abstraction for Automatic Tensorized Program Optimization by Siyuan Feng, Bohan Hou et al., ASPLOS 2023
Exocompilation for Productive Programming of Hardware Accelerators by Yuka Ikarashi, Gilbert Louis Bernstein et al., PLDI 2022
DaCeML: A Data-Centric Compiler for Machine Learning by Oliver Rausch et al., ICS 22
FreeTensor: A Free-Form DSL with Holistic Optimizations for Irregular Tensor Programs by Shizhi Tang et al., PLDI 2022
Roller: Fast and Efficient Tensor Compilation for Deep Learning by Hongyu Zhu et al., OSDI 2022
AStitch: Enabling a New Multi-dimensional Optimization Space for Memory-Intensive ML Training and Inference on Modern SIMT Architectures by Zhen Zheng et al., ASPLOS 2022
Composable and Modular Code Generation in MLIR: A Structured and Retargetable Approach to Tensor Compiler Construction by Nicolas Vasilache et al., arXiv 2022
PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections by Haojie Wang et al., OSDI 2021
MLIR: Scaling Compiler Infrastructure for Domain Specific Computation by Chris Lattner et al., CGO 2021
A Tensor Compiler for Unified Machine Learning Prediction Serving by Supun Nakandala et al., OSDI 2020
Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks by Lingxiao Ma et al., OSDI 2020
Stateful Dataflow Multigraphs: A Data-Centric Model for Performance Portability on Heterogeneous Architectures by Tal Ben-Nun et al., SC 2019
TASO: The Tensor Algebra SuperOptimizer for Deep Learning by Zhihao Jia et al., SOSP 2019
Tiramisu: A polyhedral compiler for expressing fast and portable code by Riyadh Baghdadi et al., CGO 2019
Triton: an intermediate language and compiler for tiled neural network computations by Philippe Tillet et al., MAPL 2019
Relay: A High-Level Compiler for Deep Learning by Jared Roesch et al., arXiv 2019
TVM: An Automated End-to-End Optimizing Compiler for Deep Learning by Tianqi Chen et al., OSDI 2018
Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions by Nicolas Vasilache et al., arXiv 2018
Intel nGraph: An Intermediate Representation, Compiler, and Executor for Deep Learning by Scott Cyphers et al., arXiv 2018
Glow: Graph Lowering Compiler Techniques for Neural Networks by Nadav Rotem et al., arXiv 2018
DLVM: A modern compiler infrastructure for deep learning systems by Richard Wei et al., arXiv 2018
Diesel: DSL for linear algebra and neural net computations on GPUs by Venmugil Elango et al., MAPL 2018
The Tensor Algebra Compiler by Fredrik Kjolstad et al., OOPSLA 2017
Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines by Jonathan Ragan-Kelley et al., PLDI 2013

Auto-tuning and Auto-scheduling

Accelerated Auto-Tuning of GPU Kernels for Tensor Computations by Chendi Li, Yufan Xu et al., ICS 2024
Enabling Tensor Language Model to Assist in Generating High-Performance Tensor Programs for Deep Learning by Yi Zhai et al., OSDI 2024
The Droplet Search Algorithm for Kernel Scheduling by Michael Canesche et al., ACM TACO 2024
Tensor Program Optimization with Probabilistic Programs by Junru Shao et al., NeurIPS 2022
One-shot tuner for deep learning compilers by Jaehun Ryu et al., CC 2022
Autoscheduling for sparse tensor algebra with an asymptotic cost model by Peter Ahrens et al., PLDI 2022
Bolt: Bridging the Gap between Auto-tuners and Hardware-native Performance by Jiarong Xing et al., MLSys 2022
A Full-Stack Search Technique for Domain Optimized Deep Learning Accelerators by Dan Zhang et al., ASPLOS 2022
Efficient Automatic Scheduling of Imaging and Vision Pipelines for the GPU by Luke Anderson et al., OOPSLA 2021
Lorien: Efficient Deep Learning Workloads Delivery by Cody Hao Yu et al., SoCC 2021
Value Learning for Throughput Optimization of Deep Neural Networks by Benoit Steiner et al., MLSys 2021
A Flexible Approach to Autotuning Multi-Pass Machine Learning Compilers by Phitchaya Mangpo Phothilimthana et al., PACT 2021
Ansor: Generating High-Performance Tensor Programs for Deep Learning by Lianmin Zheng et al., OSDI 2020
Schedule Synthesis for Halide Pipelines on GPUs by Sioutas Savvas et al., TACO 2020
FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System by Size Zheng et al., ASPLOS 2020
ProTuner: Tuning Programs with Monte Carlo Tree Search by Ameer Haj-Ali et al., arXiv 2020
AdaTune: Adaptive tensor program compilation made efficient by Menghao Li et al., NeurIPS 2020
Optimizing the Memory Hierarchy by Compositing Automatic Transformations on Computations and Data by Jie Zhao et al., MICRO 2020
Chameleon: Adaptive Code Optimization for Expedited Deep Neural Network Compilation by Byung Hoon Ahn et al., ICLR 2020
A Sparse Iteration Space Transformation Framework for Sparse Tensor Algebra by Ryan Senanayake et al. OOPSLA 2020
Learning to Optimize Halide with Tree Search and Random Programs by Andrew Adams et al., SIGGRAPH 2019
Learning to Optimize Tensor Programs by Tianqi Chen et al., NeurIPS 2018
Automatically Scheduling Halide Image Processing Pipelines by Ravi Teja Mullapudi et al., SIGGRAPH 2016

Cost Model

TLP: A Deep Learning-based Cost Model for Tensor Program Tuning by Yi Zhai et al., ASPLOS 2023
An Asymptotic Cost Model for Autoscheduling Sparse Tensor Programs by Peter Ahrens et al., PLDI 2022
TenSet: A Large-scale Program Performance Dataset for Learned Tensor Compilers by Lianmin Zheng et al., NeurIPS 2021
A Deep Learning Based Cost Model for Automatic Code Optimization by Riyadh Baghdadi et al., MLSys 2021
A Learned Performance Model for the Tensor Processing Unit by Samuel J. Kaufman et al., MLSys 2021
DYNATUNE: Dynamic Tensor Program Optimization in Deep Neural Network Compilation by Minjia Zhang et al., ICLR 2021
MetaTune: Meta-Learning Based Cost Model for Fast and Efficient Auto-tuning Frameworks by Jaehun Ryu et al., arxiv 2021
Expedited Tensor Program Compilation Based on LightGBM by Gonghan Liu1 et al., JPCS 2021

CPU and GPU Optimization

DeepCuts: A deep learning optimization framework for versatile GPU workloads by Wookeun Jung et al., PLDI 2021
Analytical characterization and design space exploration for optimization of CNNs by Rui Li et al., ASPLOS 2021
UNIT: Unifying Tensorized Instruction Compilation by Jian Weng et al., CGO 2021
PolyDL: Polyhedral Optimizations for Creation of HighPerformance DL primitives by Sanket Tavarageri et al., arXiv 2020
Fireiron: A Data-Movement-Aware Scheduling Language for GPUs by Bastian Hagedorn et al., PACT 2020
Automatic Kernel Generation for Volta Tensor Cores by Somashekaracharya G. Bhaskaracharya et al., arXiv 2020
Swizzle Inventor: Data Movement Synthesis for GPU Kernels by Phitchaya Mangpo Phothilimthana et al., ASPLOS 2019
Optimizing CNN Model Inference on CPUs by Yizhi Liu et al., ATC 2019
Analytical cache modeling and tilesize optimization for tensor contractions by Rui Li et al., SC 19

NPU Optimization

Heron: Automatically Constrained High-Performance Library Generation for Deep Learning Accelerators by Jun Bi et al., ASPLOS 2023
AMOS: Enabling Automatic Mapping for Tensor Computations On Spatial Accelerators with Hardware Abstraction by Size Zheng et al., ISCA 2022
Towards the Co-design of Neural Networks and Accelerators by Yanqi Zhou et al., MLSys 2022
AKG: Automatic Kernel Generation for Neural Processing Units using Polyhedral Transformations by Jie Zhao et al., PLDI 2021

Graph-level Optimization

POET: Training Neural Networks on Tiny Devices with Integrated Rematerialization and Paging by Shishir G. Patil et al., ICML 2022
Collage: Seamless Integration of Deep Learning Backends with Automatic Placement by Byungsoo Jeon et al., PACT 2022
Apollo: Automatic Partition-based Operator Fusion through Layer by Layer Optimization by Jie Zhao et al., MLSys 2022
Equality Saturation for Tensor Graph Superoptimization by Yichen Yang et al., MLSys 2021
IOS: An Inter-Operator Scheduler for CNN Acceleration by Yaoyao Ding et al., MLSys 2021
Optimizing DNN Computation Graph using Graph Substitutions by Jingzhi Fang et al., VLDB 2020
Transferable Graph Optimizers for ML Compilers by Yanqi Zhou et al., NeurIPS 2020
FusionStitching: Boosting Memory IntensiveComputations for Deep Learning Workloads by Zhen Zheng et al., arXiv 2020
Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning by Woosuk Kwon et al., Neurips 2020

Dynamic Model

Axon: A Language for Dynamic Shapes in Deep Learning Graphs by Alexander Collins et al., arXiv 2022
DietCode: Automatic Optimization for Dynamic Tensor Programs by Bojian Zheng et al., MLSys 2022
The CoRa Tensor Compiler: Compilation for Ragged Tensors with Minimal Padding by Pratik Fegade et al., MLSys 2022
Nimble: Efficiently Compiling Dynamic Neural Networks for Model Inference by Haichen Shen et al., MLSys 2021
DISC: A Dynamic Shape Compiler for Machine Learning Workloads by Kai Zhu et al., EuroMLSys 2021
Cortex: A Compiler for Recursive Deep Learning Models by Pratik Fegade et al., MLSys 2021

Graph Neural Networks

Graphiler: Optimizing Graph Neural Networks with Message Passing Data Flow Graph by Zhiqiang Xie et al., MLSys 2022
Seastar: vertex-centric programming for graph neural networks by Yidi Wu et al., Eurosys 2021
FeatGraph: A Flexible and Efficient Backend for Graph Neural Network Systems by Yuwei Hu et al., SC 2020

Distributed Computing

SpDISTAL: Compiling Distributed Sparse Tensor Computations by Rohan Yadav et al., SC 2022
Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning by Lianmin Zheng, Zhuohan Li, Hao Zhang et al., OSDI 2022
Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parallelization by Colin Unger, Zhihao Jia, et al., OSDI 2022
Synthesizing Optimal Parallelism Placement and Reduction Strategies on Hierarchical Systems for Deep Learning by Ningning Xie, Tamara Norman, Diminik Grewe, Dimitrios Vytiniotis et al., MLSys 2022
DISTAL: The Distributed Tensor Algebra Compiler by Rohan Yadav et al., PLDI 2022
GSPMD: General and Scalable Parallelization for ML Computation Graphs by Yuanzhong Xu et al., arXiv 2021
Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads by Abhinav Jangda et al., ASPLOS 2022
OneFlow: Redesign the Distributed Deep Learning Framework from Scratch by Jinhui Yuan et al., arXiv 2021
Beyond Data and Model Parallelism for Deep Neural Networks by Zhihao et al., MLSys 2019
Supporting Very Large Models using Automatic Dataflow Graph Partitioning by Minjie Wang et al., EuroSys 2019
Distributed Halide by Tyler Denniston et al., PPoPP 2016

Quantization

Automated Backend-Aware Post-Training Quantization by Ziheng Jiang et al., arXiv 2021
Efficient Execution of Quantized Deep Learning Models: A Compiler Approach by Animesh Jain et al., arXiv 2020
Automatic Generation of High-Performance Quantized Machine Learning Kernels by Meghan Cowan et al., CGO 2020

Sparse

The Sparse Abstract Machine by Olivia Hsu et al., ASPLOS 2023
SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning by Zihao Ye et al., ASPLOS 2023
WACO: Learning Workload-Aware Co-optimization of the Format and Schedule of a Sparse Tensor Program by Jaeyeon Won et al., ASPLOS 2023
Looplets: A Language For Structured Coiteration by Willow Ahrens et al., CGO 2023
Code Synthesis for Sparse Tensor Format Conversion and Optimization by Tobi Popoola et al., CGO 2023
Stardust: Compiling Sparse Tensor Algebra to a Reconfigurable Dataflow Architecture by Olivia Hsu et al., arXiv 2022
The Sparse Abstract Machine by Olivia Hsu et al., arXiv 2022
Unified Compilation for Lossless Compression and Sparse Computing by Daniel Donenfeld et al., CGO 2022
SparseLNR: Accelerating Sparse Tensor Computations Using Loop Nest Restructuring by Adhitha Dias et al., ICS 2022
SparTA: Deep-Learning Model Sparsity via Tensor-with-Sparsity-Attribute by Ningxin Zheng et al., OSDI 2022
Compiler Support for Sparse Tensor Computations in MLIR by Aart J.C. Bik et al., TACO 2022
Compilation of Sparse Array Programming Models by Rawn Henry and Olivia Hsu et al., OOPSLA 2021
A High Performance Sparse Tensor Algebra Compiler in MLIR by Ruiqin Tian et al., LLVM-HPC 2021
Dynamic Sparse Tensor Algebra Compilation by Stephen Chou et al., arXiv 2021
Automatic Generation of Efficient Sparse Tensor Format Conversion Routines by Stephen Chou et al., PLDI 2020
TIRAMISU: A Polyhedral Compiler for Dense and Sparse Deep Learning by Riyadh Baghdadi et al., arXiv 2020
Tensor Algebra Compilation with Workspaces by Fredrik Kjolstad et al., CGO 2019
Sparse Computation Data Dependence Simplification for Efficient Compiler-Generated Inspectors by Mahdi Soltan Mohammadi et al., PLDI 2019
Taichi: A Language for High-Performance Computation on Spatially Sparse Data Structures by Yuanming Hu et al., ACM ToG 2019
The Sparse Polyhedral Framework: Composing Compiler-Generated Inspector-Executor Code by Michelle Mills Strout et al., Proceedings of the IEEE 2018
Format Abstraction for Sparse Tensor Algebra Compilers by Stephen Chou et al., OOPSLA 2018
ParSy: Inspection and Transformation of Sparse Matrix Computations for Parallelism by Kazem Cheshmi et al., SC 2018
Sympiler: Transforming Sparse Matrix Codes by Decoupling Symbolic Analysis by Kazem Cheshmi et al., SC 2017
The Tensor Algebra Compiler by Fredrik Kjolstad et al., OOPSLA 2017
Next-generation Generic Programming and its Application to Sparse Matrix Computations by Nikolay Mateev et al., ICS 2000
A Framework for Sparse Matrix Code Synthesis from High-level Specifications by Nawaaz Ahmed et al., SC 2000
Automatic Nonzero Structure Analysis by Aart Bik et al., SIAM Journal on Computing 1999
SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations by William Pugh et al., LCPC 1998
Automatic Data Structure Selection and Transformation for Sparse Matrix Computations by Aart Bik et al., TPDS 1996
Compilation Techniques for Sparse Matrix Computations by Aart Bik et al., ICS 1993

Program Rewriting

Verified tensor-program optimization via high-level scheduling rewrites by Amanda Liu et al., POPL 2022
Pure Tensor Program Rewriting via Access Patterns (Representation Pearl) by Gus Smith et al., MAPL 2021
Equality Saturation for Tensor Graph Superoptimization by Yichen Yang et al., MLSys 2021

Verification and Testing

NNSmith: Generating Diverse and Valid Test Cases for Deep Learning Compilers by Jiawei Liu et al., ASPLOS 2023
Coverage-guided tensor compiler fuzzing with joint IR-pass mutation by Jiawei Liu et al., OOPSLA 2022
End-to-End Translation Validation for the Halide Language by Basile Clément et al., OOPSLA 2022
A comprehensive study of deep learning compiler bugs by Qingchao Shen et al., ESEC/FSE 2021
Verifying and Improving Halide’s Term Rewriting System with Program Synthesis by Julie L. Newcomb et al., OOPSLA 2020

Tutorials

Contribute

We encourage all contributions to this repository. Open an issue or send a pull request.

Notes on the Link Format

We prefer using a link which points to a more informative page instead of a single pdf. For example, for arxiv papers, we prefer https://arxiv.org/abs/1802.04799 over https://arxiv.org/pdf/1802.04799.pdf. For USENIX papers (OSDI/ATC), we prefer https://www.usenix.org/conference/osdi18/presentation/chen over https://www.usenix.org/system/files/osdi18-chen.pdf. For ACM papers (ASPLOS/PLDI/Eurosys), we prefer https://dl.acm.org/doi/abs/10.1145/3519939.3523446 over https://dl.acm.org/doi/pdf/10.1145/3519939.3523446.