Home

Awesome

Concept Explanation

Data Parallelism (DP)

Model Parallelism

Model Parallelism has two types: Inter-layer and intra-layer. We note Inter-layer model parallelism as MP, and intra-layer model parallelism as TP (tensor parallelism).

some researchers may call TP parameter parallelism or intra-layer model parallelism.

Popular intra-model parallelism methods include 2D, 2.5D, 3D model-parallelism as well as Megatron(1D). There are only few work related to 2D, 2.5D and 3D now (only Colossal-AI).

Pipeline Parallelism

The partition of PP and MP are similar, but has different executing behaviors. Basically pipeline parallelism has two families: PipeDream family and GPipe family.

Published methods of auto-parallelism, including:

I classify parallelism methods according to their partition ways.

Pipeline Parallelism or Inter-layer Model Parallelism only:

NameDescriptionOrganization or authorPaperFrameworkYearAuto Methods
ColocRL(REINFORCE)Use reinforce learning to discover model partitionsGoogle Brainmlr.pressTensorflowPMLR 70, 2017Reinforce
A hierarchical model for device placement (HDP)Use Scotch to do graph partitioningGooglelinkTensorflowICLR 2018Reinforce LSTM
GPipeNo implementation, see torchgpipeGooglearxivNone2018 on arxiv, NIPS2019averagely partition or manually
torchgpipeAn A GPipe implementation in PyTorchUNISTarxivpytorch2020 on arxivbalance stages by profiling
GDPA general deep RL method for automating device placements on arbitrary graphs. Orthogonal to DP,MP,PPGooglearxivUnknown2019 on arxivReinforce Transformer
Pestopartition model based on inter-layer model parallelismStony Brook UniversityacmTensorflowMiddleware '21integer linear program
vPipeA pipeline only system designed for NAS network. Complementary to hybrid parallelismHKUieeePyTorchTPDS vol.33 no.3 2022Swap, Recompute, Partition(SRP) planner. P: Kernighan-Lin algorithm

Data Parallelism + Pipeline Parallelism (or Inter-layer Model Parallelism):

NameDescriptionOrganization or authorPaperFrameworkYearAuto Methods
SpotlightModel device placement as a Markov decision process (MDP).University of Torontomlr.pressUnknownPMLR 80, 2018Reinforce LSTM
PlacetoLooks like Spotlight with MDP, but have different Policy.MITnipsTensorflowNIPS 2019Reinforce
REGALa deep reinforcement learning approach to minimizing the execution cost of neural network computation graphs in an optimizing compiler.GoogleopenreviewUnknownICLR 2020RL with Genetic Algorithm
PipeDreamThis repository contains the source code implementation of PipeDream and PipeDream-2BWMicrosoft Fiddlearxiv,PyTorch2018 on arxiv, SOSP 2019Dynamic Programming with Profile
PipeDream-2BWSee above oneMicrosoftarxiv, mlr.pressPyTorchPMLR 139, 2021Dynamic Programming with Profile
DNN-partitioningpublished at NeurIPS 2020.Microsoft Fiddlearxivproof-of-concept implementationNIPS 2020Dynamic Programming and Integer Programming
HetPipeEnabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data ParallelismUNISTusenixPyTorch (not open sourced)USENIX 2020use CPLEX to solve linear programming problem
DAPPLEAn Efficient Pipelined Data Parallel Approach for Training Large Model. Succeed from GPipeAlibabaarxivDAPPLE2020 on arxiv; PPoPP 21Dynamic Programming
PipeTransformerAutomated Elastic Pipelining for Distributed Training of TransformersUniversity of South CaliforniaarxivPyTorchICML 21Dynamic Programming
ChimeraEfficiently training large-scale neural networks with bidirectional pipelinesDepartment of Computer Science, ETH Zurich Switzerlanddl.acmPyTorchSC 2021Performance model with brute force
TAPPUse a Seq2Seq based on attention mechanism to predict stage for layers.Hohai UniversitymdpiUnknownAppl.sci. 2021, 11Reinforce Seq2Seq based on attention
RaNNCRaNNC is an automatic parallelization middleware used to train very large-scale neural networks.DIRECT and University of TokyoarxivPyTorchIPDPS 2021dynamic programming
HeterPSdistributed deep learning with RL based scheduling in heterogeneous environment.BaiduarxivPaddle2021Reinforce learning based
FTPipeFTPipe can automatically transform sequential implementation into a multi-GPU one.Technion-Israel Institute of TechnologyusenixPyTorch2021multiprocessor scheduling problem with profiling.

Data Parallelism + Intra-layer Model Parallelism (or Tensor Parallelism):

NameDescriptionOrganization or authorPaperFrameworkYearAuto Methods
OptCNNauto parallelism method for CNNZhihao Jiamlr.pressFlexFlowPMLR 80, 2018Dynamic Programming based graph search algorithm
FlexFlowa deep learning framework that accelerates distributed DNN training by automatically searching for efficient parallelization strategiesZhihao JiastanfordFlexFlow, compatible with PyTorch, KerasSysML 2019MCMC
TofuSupporting Very Large Models using Automatic Dataflow Graph PartitioningNew York Universitydl.acmNot OpenSourcedEuro-Sys 2019same as OptCNN
AccParTensor partitioning for heterogeneous deep learning accelerators.Linghao Song from USCusc.eduNeed Manually Deploy2019 on arxiv, HPCA 2020Dynamic Programming
TensorOptExploring the Tradeoffs in Distributed DNN Training with Auto-ParallelismCUHK & HuaweiarxivMindSpore2020 on arxivDynamic Programming based graph search algorithm
ROCAnother paper from Zhihao, Jia. Designed for GNNZhihao JiamlsysOn top of FlexflowMLSys 2020uses a novel online linear regression model to achieve efficient graph partitioning, and introduces a dynamic programming algorithm to minimize data transfer cost.
Double RecursiveA Double recursive algorithm to search strategiesHuaweilinkMindSporeEuro-Par 2021Double Recursive
PaSEPaSE uses a dynamic programming based approach to find an efficient strategy within a reasonable time.Baidu ResearchieeeprototypeIPDPS 2021Dynamic Programming
P^2offer a novel syntax-guided program synthesis framework that is able to decompose reductions over one or more parallelism axes to sequences of collectives in a hierarchy- and mapping-aware wayUniversity of Cambridge & DeepMindarxivSimulation Experiment2021 on arxiv, MLSys 2022Synthesize tool with simulation
AutoMapUses Search and Learn to do find Megatron-like strategiesDeepMindarxivJAX python API, XLA backend2021 on arxiv, NIPS 2021Search: Monte Carlo Tree Search; Learn: Interactive Network

Data Parallelism + Model Parallelism (or Tensor Parallelism) + Pipeline Parallelism:

NameDescriptionOrganization or authorPaperFrameworkYearAuto Methods
Auto-MAPIt works on HLO IR. Use Linkage Group to prune search space Use DQN RL to search DD, MP, PP stategies.AlibabaarxivRAINBOW DQN2020Reinforce Learning
PiperThis code package contains algorithms (proof-of-concept implementation) and input files (profiled DNN models / workloads) from the paper "Piper: Multidimensional Planner for DNN Parallelization" published at NeurIPS 2021. An extension of DNN partitioningMicrosoft Fiddlelinkproof-of-concept implementationNIPS 2021two-level dynamic programming
GSPMDa system that uses simple tensor sharding annotations to achieve different parallelism paradigms in a unified wayGooglearxivTensorflow XLA2021sharding propagation
DistIRHorizontal TP. An intermediate representation and simulator for efficient neural network distributionStanford University & Microsoft FiddlearxivPyTorchMLSys 2021Grid-Search Simulator
NeoA software-hardware co-designed system for high-performance distributed training of large-scale DLRM.FacebookarxivPyTorch20211. Greedy 2. Karmarker-Karp Algorithm
Adaptive PaddleElastic training, fault tolerant, Cost-model based Sharding propagationBaiduarxivPaddle2021Cost model based. Details un-given.
AlpaAutomating Inter- and Intra-Operator Parallelism for Distributed Deep LearningUC Berkley, Google, etc.arxivJax, XLA2022Integer Linear for Intra, Dynamic programming for inter

Other Interesting automatic work

NameDescriptionOrganization or authorPaperFrameworkYearAuto Methods
TASOautomatically optimize DNN computation with graph substitutionZhihao Jia

Classify with Machine-Learning Based Methods and Classic Algorithm Based Methods

Machine-Learning Based Methods

NameMethod TypeParallelismYear
ColocRLReinforcementMP2017
HDPReinforcementMP2018
GDPReinforcementMP2019
REGALReinforcementMP2020
TAPPReinforcementDP+PP2021
SpotlightReinforcementDP+MP2018
PlacetoReinforcementDP+MP2019
HeterPSReinforcementDP+PP2021
AutoMapDeep Learning to predict rankDP+TP2021
Auto-MAPReinforcementDP or TP or PP2020
FlexFlowMCMCDP+TP2019
ROCuses a novel online linear regression model to achieve efficient graph partitioning, and introduces a dynamic programming algorithm to minimize data transfer cost.DP+TP2020

Classic Algorithm Based Methods

NameMethod TypeParallelismYear
Pestointeger linearMP2021
vpipeSRP algorithm + KL (DP)PP2022
PipeDreamdynamic programmingDP+PP2019
DNN-partitioningdynamic programming + integer programmingDP+PP2020
PipeDream-2BWdynamic programmingDP+PP2021
HetPipedynamic programmingDP+PP2020
DAPPLEdynamic programmingDP+PP2021
PipeTransformerdynamic programmingDP+PP2021
ChimeraGrid-SearchDP+PP2021
RaNNCdynamic programmingDP+PP2021
FTPipeMultiprocessor scheduling problem with profilingDP+PP2021
OptCNNdynamic programmingDP+TP2018
Tofudynamic programmingDP+TP2019
AccPardynamic programmingDP+TP2020
TensorOptdynamic programmingDP+TP2020
Double RecursiveDouble recursiveDP+TP2021
PaSEdynamic programmingDP+TP2021
P^2Synthesize tool with simulationDP+TP2021
Pipertwo-level dynamic programmingDP+TP+PP2021
GSPMDheuristic-propagationDP+TP+PP2021
DistIRgrid searchDP+TP+PP2021
NeoGreedy + Karmarker-karp algorithmDP+TP+PP2021
AlpaInteger programming + Dynamic ProgrammingDP+TP+PP2022

Pictures

REINFORCE

img.png

Spotlight

img.png

GPipe

img.png

GDP

img.png

Placeto

img.png

REGAL

img.png

News

2021.12.9 DeepMind proposes Gopher, a 280 billion parameter transformer language model. Trained by 4096 16GB TPUv3. link

2021.12.8 Baidu and Peng Cheng proposes Wenxin (文心), a 260 billion parameter knowledge-aware pretrained model (a.k.a. ERNIE 3.0 Titan). Trained with Adaptive Paddle in the Table above.

2021.10.26 Inspur formally proposes 245.7 billion parameter on AICC 2021.s