Awesome
<div align="center"> <h1>LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training</h1> <img src="docs/imgs/title-favicon.png" width="200" alt="LLaMA-MoE favicon" style="border-radius: 5%;"><br /> <span style="color:red">📢 <strong><i>A SMALLER AFFORDABLE MoE MODEL FOR EVERYONE!!</i></strong></span> <div> <a href="https://huggingface.co/llama-moe" target="_blank">🤗 Model Weights</a> | <a href="#quick-start">🚀 Quick Start</a> | <a href="#installation">⚙️ Installation Guide</a> | <a href="#expert-construction">🚧 Expert Construction</a> | <a href="#continual-pretraining">🚅 Continual Pre-training</a> | <a href="#evaluation">💎 Evaluation</a> | <a href="#sft">💬 Supervised Fine-Tuning (SFT)</a> </div> <a href="docs/LLaMA_MoE.pdf" target="_blank"><strong>📃 Technical Report</strong></a> </div> <h2 id="llama-moe">🎉 Introduction</h2>LLaMA-MoE is a series of open-sourced Mixture-of-Expert (MoE) models based on LLaMA and SlimPajama. We build LLaMA-MoE with the following two steps:
- Partition LLaMA's FFNs into sparse experts and insert top-K gate for each layer of experts.
- Continually pre-train the initialized MoE model with an optimized data sampling weights from Sheared LLaMA and filtered datasets from SlimPajama.
- Lightweight Models: The number of activated model parameters is only 3.0~3.5B, which is friendly for deployment and research usage.
- Multiple Expert Construction Methods:
- Neuron-Independent: Random, Clustering, Co-activation Graph, Gradient (Zhang et al., 2022, Zuo et al., 2022)
- Neuron-Sharing: Inner, Inter (residual)
- Multiple MoE Gating Strategies:
- TopK Noisy Gate (Shazeer et al., 2017)
- Switch Gating (Fedus et al., 2022)
- Fast Continual Pre-training:
- FlashAttention-v2 integrated (Dao, 2023)
- Fast streaming dataset loading
- Abundant Monitor Items:
- Gate load, gate importance
- Loss on steps, loss on tokens, balance loss
- TGS (tokens/GPU/second), MFU (model FLOPs utilization)
- Other visualization utilities
- Dynamic Weight Sampling:
- Self-defined static sampling weights
- Sheared LLaMA's dynamic batch loading (Xia et al., 2023)
# python>=3.10
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_dir = "llama-moe/LLaMA-MoE-v1-3_5B-2_8"
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_dir, torch_dtype=torch.bfloat16, trust_remote_code=True)
model.eval()
model.to("cuda:0")
input_text = "Suzhou is famous of"
inputs = tokenizer(input_text, return_tensors="pt")
inputs = inputs.to("cuda:0")
pred = model.generate(**inputs, max_length=50, temperature=0.0)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
# Suzhou is famous of its beautiful gardens. The most famous one is the Humble Administrator's Garden. It is a classical Chinese garden with a history of more than 600 years. The garden is divided into three
<h2 id="installation">⚙️ Installation</h2>
- Prepare conda environment:
conda create -n smoe python=3.11
(If your environment name is notsmoe
, you may need to change environment in launching scripts) - Add correct environment variables in
~/.bashrc
(gcc
is set to newer version for installingflash-attn
). e.g.:export PATH=/mnt/petrelfs/share/cuda-11.8/bin:$PATH export LD_LIBRARY_PATH=/mnt/petrelfs/share/cuda-11.8/lib64:$LD_LIBRARY_PATH export PATH=/mnt/petrelfs/share/gcc-10.1.0/bin:$PATH export LD_LIBRARY_PATH=/mnt/petrelfs/share/gcc-10.1.0/lib64:$LD_LIBRARY_PATH
- Take the variables into effect:
source ~/.bashrc
- Install PyTorch (CUDA-11.8):
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
- Install dependencies:
pip install -r requirements.txt
- Install
flash-attn
:pip install flash-attn==2.0.1 --no-build-isolation
. You may need to follow the flash-attn installation instructions to avoid some errors. - Install the latest Git:
conda install git
- Clone the repo:
git clone git@github.com:pjlab-sys4nlp/llama-moe.git
(If you don't setup the ssh key to GitHub, you may not able to clone through ssh. Check the docs about it.) - Change current directory:
cd llama-moe
- Install
smoe
in editable mode:pip install -e .[dev]
- Setup
pre-commit
hooks:pre-commit install
Model | #Activated Experts | #Experts | #Activated Params | Foundation Model | SFT Model |
---|---|---|---|---|---|
LLaMA-MoE-3.0B | 2 | 16 | 3.0B | 🤗 base | 🤗 SFT |
LLaMA-MoE-3.5B (4/16) | 4 | 16 | 3.5B | 🤗 base | 🤗 SFT |
LLaMA-MoE-3.5B (2/8) | 2 | 8 | 3.5B | 🤗 base | 🤗 SFT |
- Foundation models
Model | Average | SciQ | PIQA | WinoGrande | ARC-e | ARC-c (25) | HellaSwag (10) | LogiQA | BoolQ (32) | LAMBADA | NQ (32) | MMLU (5) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
OPT-2.7B | 50.3 | 78.9 | 74.8 | 60.8 | 54.4 | 34.0 | 61.4 | 25.8 | 63.3 | 63.6 | 10.7 | 25.8 |
Pythia-2.8B | 51.5 | 83.2 | 73.6 | 59.6 | 58.8 | 36.7 | 60.7 | 28.1 | 65.9 | 64.6 | 8.7 | 26.8 |
INCITE-BASE-3B | 53.7 | 85.6 | 73.9 | 63.5 | 61.7 | 40.3 | 64.7 | 27.5 | 65.8 | 65.4 | 15.2 | 27.2 |
Open-LLaMA-3B-v2 | 55.6 | 88.0 | 77.9 | 63.1 | 63.3 | 40.1 | 71.4 | 28.1 | 69.2 | 67.4 | 16.0 | 26.8 |
Sheared-LLaMA-2.7B | 56.4 | 87.5 | 76.9 | 65.0 | 63.3 | 41.6 | 71.0 | 28.3 | 73.6 | 68.3 | 17.6 | 27.3 |
LLaMA-MoE-3.0B | 55.5 | 84.2 | 77.5 | 63.6 | 60.2 | 40.9 | 70.8 | 30.6 | 71.9 | 66.6 | 17.0 | 26.8 |
LLaMA-MoE-3.5B (4/16) | 57.7 | 87.6 | 77.9 | 65.5 | 65.6 | 44.2 | 73.3 | 29.7 | 75.0 | 69.5 | 20.3 | 26.8 |
LLaMA-MoE-3.5B (2/8) | 57.6 | 88.4 | 77.6 | 66.7 | 65.3 | 43.1 | 73.3 | 29.6 | 73.9 | 69.4 | 19.8 | 27.0 |
- SFT models
Model | MMLU | ARC-c | HellaSeag | TruthfulQA | MT-Bench |
---|---|---|---|---|---|
Sheared LLaMA-2.7B ShareGPT | 28.41 | 41.04 | 71.21 | 47.65 | 3.79 |
Sheared LLaMA-2.7B Deita6K (Our Impl.) | 25.24 | 43.69 | 71.70 | 49.00 | 4.06 |
LLaMA-MoE-v1-3.0B (2/16) | 23.61 | 43.43 | 72.28 | 44.24 | 4.15 |
LLaMA-MoE-v1-3.5B (4/16) | 26.49 | 48.29 | 75.10 | 45.91 | 4.60 |
LLaMA-MoE-v1-3.5B (2/8) | 25.53 | 45.99 | 74.95 | 44.39 | 4.72 |
- Neuron-Independent
- Independent<sub>Random</sub>:
bash ./scripts/expert_construction/split/run_split_random.sh
- Independent<sub>Clustering</sub>:
bash ./scripts/expert_construction/split/run_split_clustering.sh
- Independent<sub>Random</sub>:
- Neuron-Sharing
- Sharing<sub>Inner</sub>:
bash ./scripts/expert_construction/split/run_split_gradient.sh
- Sharing<sub>Inter</sub>:
bash ./scripts/expert_construction/split/run_split_gradient_residual.sh
- Sharing<sub>Inner</sub>:
For more information, please refer to Expert Construction docs.
<h2 id="continual-pretraining">🚅 Continual Pre-training</h2>Tokenization
Download SlimPajama into /path_to_data
and put data from different domains into separate folders:
/path_to_data/en_arxiv
/path_to_data/en_book
/path_to_data/en_c4
/path_to_data/en_cc
/path_to_data/en_stack
/path_to_data/en_wikipedia
/path_to_data/github
Each file should be end with *.jsonl
and each line looks like:
{"id": "id-info", "content": "raw text to be tokenized"}
Run the following command to tokenize the data in each folder:
python -m smoe.utils.tokenize \
-f jsonl \
-t /path_to_tokenizer \
-i /path_to_data/en_arxiv \
-o /path_to_data_tokenized/en_arxiv
Continual Pre-training (CPT)
- NOTICE: Please create
logs/
folder manually:mkdir -p logs
- To run the continual pre-training, please check the CPT docs.
- For evalution on Natural Questions (NQ), please refer to opencompass.
- For other tasks, please refer to lm-eval-harness.
We provide simple examples of SFT to build chatbots.
Please refer to SFT docs and /mnt/petrelfs/zhutong/smoe/scripts/sft
for more details.
@article{llama-moe,
title={LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training},
author={Tong Zhu and Xiaoye Qu and Daize Dong and Jiacheng Ruan and Jingqi Tong and Conghui He and Yu Cheng},
journal={arXiv preprint arXiv:2406.16554},
year={2024},
url={https://arxiv.org/abs/2406.16554},
}
<hr>
<p align="center">LLaMA-MoE Team w/ ❤️</p>