Awesome

Concept Explanation

Data Parallelism (DP)

Model Parallelism

Model Parallelism has two types: Inter-layer and intra-layer. We note Inter-layer model parallelism as MP, and intra-layer model parallelism as TP (tensor parallelism).

some researchers may call TP parameter parallelism or intra-layer model parallelism.

Popular intra-model parallelism methods include 2D, 2.5D, 3D model-parallelism as well as Megatron(1D). There are only few work related to 2D, 2.5D and 3D now (only Colossal-AI).

Pipeline Parallelism

The partition of PP and MP are similar, but has different executing behaviors. Basically pipeline parallelism has two families: PipeDream family and GPipe family.

Published methods of auto-parallelism, including:

I classify parallelism methods according to their partition ways.

Pipeline Parallelism or Inter-layer Model Parallelism only:

Name	Description	Organization or author	Paper	Framework	Year	Auto Methods
ColocRL(REINFORCE)	Use reinforce learning to discover model partitions	Google Brain	mlr.press	Tensorflow	PMLR 70, 2017	Reinforce
A hierarchical model for device placement (HDP)	Use Scotch to do graph partitioning	Google	link	Tensorflow	ICLR 2018	Reinforce LSTM
GPipe	No implementation, see torchgpipe	Google	arxiv	None	2018 on arxiv, NIPS2019	averagely partition or manually
torchgpipe	An A GPipe implementation in PyTorch	UNIST	arxiv	pytorch	2020 on arxiv	balance stages by profiling
GDP	A general deep RL method for automating device placements on arbitrary graphs. Orthogonal to DP,MP,PP	Google	arxiv	Unknown	2019 on arxiv	Reinforce Transformer
Pesto	partition model based on inter-layer model parallelism	Stony Brook University	acm	Tensorflow	Middleware '21	integer linear program
vPipe	A pipeline only system designed for NAS network. Complementary to hybrid parallelism	HKU	ieee	PyTorch	TPDS vol.33 no.3 2022	Swap, Recompute, Partition(SRP) planner. P: Kernighan-Lin algorithm

Data Parallelism + Pipeline Parallelism (or Inter-layer Model Parallelism):

Name	Description	Organization or author	Paper	Framework	Year	Auto Methods
Spotlight	Model device placement as a Markov decision process (MDP).	University of Toronto	mlr.press	Unknown	PMLR 80, 2018	Reinforce LSTM
Placeto	Looks like Spotlight with MDP, but have different Policy.	MIT	nips	Tensorflow	NIPS 2019	Reinforce
REGAL	a deep reinforcement learning approach to minimizing the execution cost of neural network computation graphs in an optimizing compiler.	Google	openreview	Unknown	ICLR 2020	RL with Genetic Algorithm
PipeDream	This repository contains the source code implementation of PipeDream and PipeDream-2BW	Microsoft Fiddle	arxiv,	PyTorch	2018 on arxiv, SOSP 2019	Dynamic Programming with Profile
PipeDream-2BW	See above one	Microsoft	arxiv, mlr.press	PyTorch	PMLR 139, 2021	Dynamic Programming with Profile
DNN-partitioning	published at NeurIPS 2020.	Microsoft Fiddle	arxiv	proof-of-concept implementation	NIPS 2020	Dynamic Programming and Integer Programming
HetPipe	Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism	UNIST	usenix	PyTorch (not open sourced)	USENIX 2020	use CPLEX to solve linear programming problem
DAPPLE	An Efficient Pipelined Data Parallel Approach for Training Large Model. Succeed from GPipe	Alibaba	arxiv	DAPPLE	2020 on arxiv; PPoPP 21	Dynamic Programming
PipeTransformer	Automated Elastic Pipelining for Distributed Training of Transformers	University of South California	arxiv	PyTorch	ICML 21	Dynamic Programming
Chimera	Efficiently training large-scale neural networks with bidirectional pipelines	Department of Computer Science, ETH Zurich Switzerland	dl.acm	PyTorch	SC 2021	Performance model with brute force
TAPP	Use a Seq2Seq based on attention mechanism to predict stage for layers.	Hohai University	mdpi	Unknown	Appl.sci. 2021, 11	Reinforce Seq2Seq based on attention
RaNNC	RaNNC is an automatic parallelization middleware used to train very large-scale neural networks.	DIRECT and University of Tokyo	arxiv	PyTorch	IPDPS 2021	dynamic programming
HeterPS	distributed deep learning with RL based scheduling in heterogeneous environment.	Baidu	arxiv	Paddle	2021	Reinforce learning based
FTPipe	FTPipe can automatically transform sequential implementation into a multi-GPU one.	Technion-Israel Institute of Technology	usenix	PyTorch	2021	multiprocessor scheduling problem with profiling.

Data Parallelism + Intra-layer Model Parallelism (or Tensor Parallelism):

Name	Description	Organization or author	Paper	Framework	Year	Auto Methods
OptCNN	auto parallelism method for CNN	Zhihao Jia	mlr.press	FlexFlow	PMLR 80, 2018	Dynamic Programming based graph search algorithm
FlexFlow	a deep learning framework that accelerates distributed DNN training by automatically searching for efficient parallelization strategies	Zhihao Jia	stanford	FlexFlow, compatible with PyTorch, Keras	SysML 2019	MCMC
Tofu	Supporting Very Large Models using Automatic Dataflow Graph Partitioning	New York University	dl.acm	Not OpenSourced	Euro-Sys 2019	same as OptCNN
AccPar	Tensor partitioning for heterogeneous deep learning accelerators.	Linghao Song from USC	usc.edu	Need Manually Deploy	2019 on arxiv, HPCA 2020	Dynamic Programming
TensorOpt	Exploring the Tradeoffs in Distributed DNN Training with Auto-Parallelism	CUHK & Huawei	arxiv	MindSpore	2020 on arxiv	Dynamic Programming based graph search algorithm
ROC	Another paper from Zhihao, Jia. Designed for GNN	Zhihao Jia	mlsys	On top of Flexflow	MLSys 2020	uses a novel online linear regression model to achieve efficient graph partitioning, and introduces a dynamic programming algorithm to minimize data transfer cost.
Double Recursive	A Double recursive algorithm to search strategies	Huawei	link	MindSpore	Euro-Par 2021	Double Recursive
PaSE	PaSE uses a dynamic programming based approach to find an efficient strategy within a reasonable time.	Baidu Research	ieee	prototype	IPDPS 2021	Dynamic Programming
P^2	offer a novel syntax-guided program synthesis framework that is able to decompose reductions over one or more parallelism axes to sequences of collectives in a hierarchy- and mapping-aware way	University of Cambridge & DeepMind	arxiv	Simulation Experiment	2021 on arxiv, MLSys 2022	Synthesize tool with simulation
AutoMap	Uses Search and Learn to do find Megatron-like strategies	DeepMind	arxiv	JAX python API, XLA backend	2021 on arxiv, NIPS 2021	Search: Monte Carlo Tree Search; Learn: Interactive Network

Data Parallelism + Model Parallelism (or Tensor Parallelism) + Pipeline Parallelism:

Name	Description	Organization or author	Paper	Framework	Year	Auto Methods
Auto-MAP	It works on HLO IR. Use Linkage Group to prune search space Use DQN RL to search DD, MP, PP stategies.	Alibaba	arxiv	RAINBOW DQN	2020	Reinforce Learning
Piper	This code package contains algorithms (proof-of-concept implementation) and input files (profiled DNN models / workloads) from the paper "Piper: Multidimensional Planner for DNN Parallelization" published at NeurIPS 2021. An extension of DNN partitioning	Microsoft Fiddle	link	proof-of-concept implementation	NIPS 2021	two-level dynamic programming
GSPMD	a system that uses simple tensor sharding annotations to achieve different parallelism paradigms in a unified way	Google	arxiv	Tensorflow XLA	2021	sharding propagation
DistIR	Horizontal TP. An intermediate representation and simulator for efficient neural network distribution	Stanford University & Microsoft Fiddle	arxiv	PyTorch	MLSys 2021	Grid-Search Simulator
Neo	A software-hardware co-designed system for high-performance distributed training of large-scale DLRM.	Facebook	arxiv	PyTorch	2021	1. Greedy 2. Karmarker-Karp Algorithm
Adaptive Paddle	Elastic training, fault tolerant, Cost-model based Sharding propagation	Baidu	arxiv	Paddle	2021	Cost model based. Details un-given.
Alpa	Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning	UC Berkley, Google, etc.	arxiv	Jax, XLA	2022	Integer Linear for Intra, Dynamic programming for inter

Other Interesting automatic work

Name	Description	Organization or author	Paper	Framework	Year	Auto Methods
TASO	automatically optimize DNN computation with graph substitution	Zhihao Jia

Classify with Machine-Learning Based Methods and Classic Algorithm Based Methods

Machine-Learning Based Methods

Name	Method Type	Parallelism	Year
ColocRL	Reinforcement	MP	2017
HDP	Reinforcement	MP	2018
GDP	Reinforcement	MP	2019
REGAL	Reinforcement	MP	2020
TAPP	Reinforcement	DP+PP	2021
Spotlight	Reinforcement	DP+MP	2018
Placeto	Reinforcement	DP+MP	2019
HeterPS	Reinforcement	DP+PP	2021
AutoMap	Deep Learning to predict rank	DP+TP	2021
Auto-MAP	Reinforcement	DP or TP or PP	2020
FlexFlow	MCMC	DP+TP	2019
ROC	uses a novel online linear regression model to achieve efficient graph partitioning, and introduces a dynamic programming algorithm to minimize data transfer cost.	DP+TP	2020

Classic Algorithm Based Methods

Name	Method Type	Parallelism	Year
Pesto	integer linear	MP	2021
vpipe	SRP algorithm + KL (DP)	PP	2022
PipeDream	dynamic programming	DP+PP	2019
DNN-partitioning	dynamic programming + integer programming	DP+PP	2020
PipeDream-2BW	dynamic programming	DP+PP	2021
HetPipe	dynamic programming	DP+PP	2020
DAPPLE	dynamic programming	DP+PP	2021
PipeTransformer	dynamic programming	DP+PP	2021
Chimera	Grid-Search	DP+PP	2021
RaNNC	dynamic programming	DP+PP	2021
FTPipe	Multiprocessor scheduling problem with profiling	DP+PP	2021
OptCNN	dynamic programming	DP+TP	2018
Tofu	dynamic programming	DP+TP	2019
AccPar	dynamic programming	DP+TP	2020
TensorOpt	dynamic programming	DP+TP	2020
Double Recursive	Double recursive	DP+TP	2021
PaSE	dynamic programming	DP+TP	2021
P^2	Synthesize tool with simulation	DP+TP	2021
Piper	two-level dynamic programming	DP+TP+PP	2021
GSPMD	heuristic-propagation	DP+TP+PP	2021
DistIR	grid search	DP+TP+PP	2021
Neo	Greedy + Karmarker-karp algorithm	DP+TP+PP	2021
Alpa	Integer programming + Dynamic Programming	DP+TP+PP	2022

Pictures

REINFORCE

Spotlight

GPipe

GDP

Placeto

REGAL

News

2021.12.9 DeepMind proposes Gopher, a 280 billion parameter transformer language model. Trained by 4096 16GB TPUv3. link

2021.12.8 Baidu and Peng Cheng proposes Wenxin (文心), a 260 billion parameter knowledge-aware pretrained model (a.k.a. ERNIE 3.0 Titan). Trained with Adaptive Paddle in the Table above.

2021.10.26 Inspur formally proposes 245.7 billion parameter on AICC 2021.s