Home

Awesome

Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference

This is a repository for ISCA'24 AE Pre-gated MoE.

For more details about this project, please refer to our paper published in ISCA-2024, "Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference,"

Disclaimer

This open-source project is for proof of concept purposes only and should not be used in production environments. The code has not been officially vetted for security vulnerabilities and provides no guarantees of correctness or security. Users should carefully review the code and conduct their own security assessments before using the software.

ISCA'24 Artifact Evaluation

Setup

# Starting from the official container
docker run -ti --gpus all --shm-size 5g --name pregated -v ${DATA_PATH}:/data nvcr.io/nvidia/pytorch:22.09-py3 bash
git clone --recursive https://github.com/ranggihwang/Pregated_MoE.git FasterTransformer

# build on A100
mkdir -p FasterTransformer/build
cd FasterTransformer/build
cmake -DSM=80 -DCMAKE_BUILD_TYPE=Release -DBUILD_PYT=ON -DBUILD_MULTI_GPU=ON ..
make -j
GPUcompute capacity
P4060
P461
V10070
T475
A10080
A3080
A1086
# Python dependencies
pip install -r ../examples/pytorch/t5/requirement.txt

Prepare models

mkdir /data/ft
cd /workspace/FasterTransformer/
./scripts/convert.sh

Evaluation

cd /workspace/FasterTransformer/
# logs will be output here
mkdir logs/
python scripts/eval_all.py

Check block_lats.csv, throughputs.csv and peak_mems.csv for block latencies, throughputs and peak memory usage respectively.

You can modify the following lines in scripts/eval_all.py:

models = [
    "switch-base-8",
    "switch-base-64",
    "switch-base-128",
    "switch-large-128",
]
batch_sizes = [
    1,
    # 2,
    # 4,
    # 8,
    # 16,
]
methods = [
    "GPU-only",
    "Pre-gated",
    "DeepSpeed",  # This is MoE-OnDemand
    "SE-MoE",     # This is MoE-Prefetch
]
metrics = [
    "block_lat",
    "throughput",
    "peak_mem",
    # "max_active_expert",
    # "cache_hit_rate",
]
forced_num_experts = [
    0,
    # 1,
    # 2,
    # 4,
    # 8,
    # 16,
]
cache_ratios = [
    0,
    # 0.01,
    # 0.03,
    # 0.05,
    # 0.1,
    # 0.2,
    # 0.4,
    # 0.8,
]
disk_offloads = [
    0,
    # 1,
]

Citation

Ranggi Hwang, Jianyu Wei, Shijie Cao, Changho Hwang, Xiaohu Tang, Ting Cao, and Mao Yang, "Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference," The 51st IEEE/ACM International Symposium on Computer Architecture (ISCA-51), Buenos Aires, Argentina, June 2024


This is the readme for FasterTransformer

FasterTransformer

This repository provides a script and recipe to run the highly optimized transformer-based encoder and decoder component, and it is tested and maintained by NVIDIA.

Table Of Contents

Model overview

In NLP, encoder and decoder are two important components, with the transformer layer becoming a popular architecture for both components. FasterTransformer implements a highly optimized transformer layer for both the encoder and decoder for inference. On Volta, Turing and Ampere GPUs, the computing power of Tensor Cores are used automatically when the precision of the data and weights are FP16.

FasterTransformer is built on top of CUDA, cuBLAS, cuBLASLt and C++. We provide at least one API of the following frameworks: TensorFlow, PyTorch and Triton backend. Users can integrate FasterTransformer into these frameworks directly. For supporting frameworks, we also provide example codes to demonstrate how to use, and show the performance on these frameworks.

Support matrix

ModelsFrameworkFP16INT8 (after Turing)Sparsity (after Ampere)Tensor parallelPipeline parallelFP8 (after Hopper)
BERTTensorFlowYesYes----
BERTPyTorchYesYesYesYesYes-
BERTTriton backendYes--YesYes-
BERTC++YesYes---Yes
XLNetC++Yes-----
EncoderTensorFlowYesYes----
EncoderPyTorchYesYesYes---
DecoderTensorFlowYes-----
DecoderPyTorchYes-----
DecodingTensorFlowYes-----
DecodingPyTorchYes-----
GPTTensorFlowYes-----
GPT/OPTPyTorchYes--YesYesYes
GPT/OPTTriton backendYes--YesYes-
GPT-MoEPyTorchYes--YesYes-
BLOOMPyTorchYes--YesYes-
BLOOMTriton backendYes--YesYes-
GPT-JTriton backendYes--YesYes-
LongformerPyTorchYes-----
T5/UL2PyTorchYes--YesYes-
T5TensorFlow 2Yes-----
T5/UL2Triton backendYes--YesYes-
T5TensorRTYes--YesYes-
T5-MoEPyTorchYes--YesYes-
Swin TransformerPyTorchYesYes----
Swin TransformerTensorRTYesYes----
ViTPyTorchYesYes----
ViTTensorRTYesYes----
GPT-NeoXPyTorchYes--YesYes-
GPT-NeoXTriton backendYes--YesYes-
BART/mBARTPyTorchYes--YesYes-
WeNetC++Yes-----
DeBERTaTensorFlow 2Yes--On-goingOn-going-
DeBERTaPyTorchYes--On-goingOn-going-

More details of specific models are put in xxx_guide.md of docs/, where xxx means the model name. Some common questions and the respective answers are put in docs/QAList.md. Note that the model of Encoder and BERT are similar and we put the explanation into bert_guide.md together.

Advanced

The following code lists the directory structure of FasterTransformer:

/src/fastertransformer: source code of FasterTransformer
    |--/cutlass_extensions: Implementation of cutlass gemm/kernels.
    |--/kernels: CUDA kernels for different models/layers and operations, like addBiasResiual.
    |--/layers: Implementation of layer modules, like attention layer, ffn layer.
    |--/models: Implementation of different models, like BERT, GPT.
    |--/tensorrt_plugin: encapluate FasterTransformer into TensorRT plugin.
    |--/tf_op: custom Tensorflow OP implementation
    |--/th_op: custom PyTorch OP implementation
    |--/triton_backend: custom triton backend implementation
    |--/utils: Contains common cuda utils, like cublasMMWrapper, memory_utils
/examples: C++, tensorflow and pytorch interface examples
    |--/cpp: C++ interface examples
    |--/pytorch: PyTorch OP examples
    |--/tensorflow: TensorFlow OP examples
    |--/tensorrt: TensorRT examples
/docs: Documents to explain the details of implementation of different models, and show the benchmark
/benchmark: Contains the scripts to run the benchmarks of different models
/tests: Unit tests
/templates: Documents to explain how to add a new model/example into FasterTransformer repo

Note that many folders contains many sub-folders to split different models. Quantization tools are move to examples, like examples/tensorflow/bert/bert-quantization/ and examples/pytorch/bert/bert-quantization-sparsity/.

Global Environment

FasterTransformer provides some convenient environment variables for debuging and testing.

  1. FT_LOG_LEVEL: This environment controls the log level of debug messae. More details are in src/fastertransformer/utils/logger.h. Note that the program will print lots of message when the level is lower than DEBUG and the program would become very slow.
  2. FT_NVTX: If it is set to be ON like FT_NVTX=ON ./bin/gpt_example, the program will insert tha tag of nvtx to help profiling the program.
  3. FT_DEBUG_LEVEL: If it is set to be DEBUG, then the program will run cudaDeviceSynchronize() after every kernels. Otherwise, the kernel is executued asynchronously by default. It is helpful to locate the error point during debuging. But this flag affects the performance of program significantly. So, it should be used only for debuging.

Performance

Hardware settings:

In order to run the following benchmark, we need to install the unix computing tool "bc" by

apt-get install bc

BERT base performance

The FP16 results of TensorFlow were obtained by running the benchmarks/bert/tf_benchmark.sh.

The INT8 results of TensorFlow were obtained by running the benchmarks/bert/tf_int8_benchmark.sh.

The FP16 results of PyTorch were obtained by running the benchmarks/bert/pyt_benchmark.sh.

The INT8 results of PyTorch were obtained by running the benchmarks/bert/pyt_int8_benchmark.sh.

More benchmarks are put in docs/bert_guide.md.

BERT base performances of FasterTransformer new features

The following figure compares the performances of different features of FasterTransformer and FasterTransformer under FP16 on T4.

For large batch size and sequence length, both EFF-FT and FT-INT8-v2 bring about 2x speedup. Using Effective FasterTransformer and int8v2 at the same time can bring about 3.5x speedup compared to FasterTransformer FP16 for large case.

<div align=center><img width=80% src ="docs/images/FT_Encoder_T4.png"/></div>

BERT base performance on TensorFlow

The following figure compares the performances of different features of FasterTransformer and TensorFlow XLA under FP16 on T4.

For small batch size and sequence length, using FasterTransformer can bring about 3x speedup.

For large batch size and sequence length, using Effective FasterTransformer with INT8-v2 quantization can bring about 5x speedup.

<div align=center><img width=80% src ="docs/images/TF_Encoder_T4.png"/></div>

BERT base performance on PyTorch

The following figure compares the performances of different features of FasterTransformer and PyTorch TorchScript under FP16 on T4.

For small batch size and sequence length, using FasterTransformer CustomExt can bring about 4x ~ 6x speedup.

For large batch size and sequence length, using Effective FasterTransformer with INT8-v2 quantization can bring about 5x speedup.

<div align=center><img width=80% src ="docs/images/Py_Encoder_T4.png"/></div>

Decoding and Decoder performance

The results of TensorFlow were obtained by running the benchmarks/decoding/tf_decoding_beamsearch_benchmark.sh and benchmarks/decoding/tf_decoding_sampling_benchmark.sh

The results of PyTorch were obtained by running the benchmarks/decoding/pyt_decoding_beamsearch_benchmark.sh.

In the experiments of decoding, we updated the following parameters:

More benchmarks are put in docs/decoder_guide.md.

Decoder and Decoding end-to-end translation performance on TensorFlow

The following figure shows the speedup of of FT-Decoder op and FT-Decoding op compared to TensorFlow under FP16 with T4. Here, we use the throughput of translating a test set to prevent the total tokens of each methods may be different. Compared to TensorFlow, FT-Decoder provides 1.5x ~ 3x speedup; while FT-Decoding provides 4x ~ 18x speedup.

<div align=center><img width=80% src ="docs/images/TF_Decoder_T4.png"/></div>

Decoder and Decoding end-to-end translation performance on PyTorch

The following figure shows the speedup of of FT-Decoder op and FT-Decoding op compared to PyTorch under FP16 with T4. Here, we use the throughput of translating a test set to prevent the total tokens of each methods may be different. Compared to PyTorch, FT-Decoder provides 1.2x ~ 3x speedup; while FT-Decoding provides 3.8x ~ 13x speedup.

<div align=center><img width=80% src ="docs/images/Py_Decoder_T4.png"/></div>

GPT performance

The following figure compares the performances of Megatron and FasterTransformer under FP16 on A100.

In the experiments of decoding, we updated the following parameters:

<div align=center><img width=80% src ="docs/images/FT_GPT_A100.png"/></div>

Release notes

Changelog

January 2023

Dec 2022

Nov 2022

Oct 2022

Sep 2022

Aug 2022

July 2022

June 2022

May 2022

April 2022

March 2022

February 2022

December 2021

November 2021

August 2021

June 2021

April 2021

Dec 2020

Nov 2020

Sep 2020

Aug 2020

June 2020

May 2020

April 2020

March 2020

February 2020

July 2019

Known issues