Home

Awesome

<div align="center"> <h1>LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training</h1> <img src="docs/imgs/title-favicon.png" width="200" alt="LLaMA-MoE favicon" style="border-radius: 5%;"><br /> <span style="color:red">📢 <strong><i>A SMALLER AFFORDABLE MoE MODEL FOR EVERYONE!!</i></strong></span> <div> <a href="https://huggingface.co/llama-moe" target="_blank">🤗 Model Weights</a> | <a href="#quick-start">🚀 Quick Start</a> | <a href="#installation">⚙️ Installation Guide</a> | <a href="#expert-construction">🚧 Expert Construction</a> | <a href="#continual-pretraining">🚅 Continual Pre-training</a> | <a href="#evaluation">💎 Evaluation</a> | <a href="#sft">💬 Supervised Fine-Tuning (SFT)</a> </div> <a href="docs/LLaMA_MoE.pdf" target="_blank"><strong>📃 Technical Report</strong></a> </div> <h2 id="llama-moe">🎉 Introduction</h2>

LLaMA-MoE is a series of open-sourced Mixture-of-Expert (MoE) models based on LLaMA and SlimPajama. We build LLaMA-MoE with the following two steps:

  1. Partition LLaMA's FFNs into sparse experts and insert top-K gate for each layer of experts.
  2. Continually pre-train the initialized MoE model with an optimized data sampling weights from Sheared LLaMA and filtered datasets from SlimPajama.

MoE Routing

<h2 id="features">🔥 Features</h2>
  1. Lightweight Models: The number of activated model parameters is only 3.0~3.5B, which is friendly for deployment and research usage.
  2. Multiple Expert Construction Methods:
    1. Neuron-Independent: Random, Clustering, Co-activation Graph, Gradient (Zhang et al., 2022, Zuo et al., 2022)
    2. Neuron-Sharing: Inner, Inter (residual)
  3. Multiple MoE Gating Strategies:
    1. TopK Noisy Gate (Shazeer et al., 2017)
    2. Switch Gating (Fedus et al., 2022)
  4. Fast Continual Pre-training:
    1. FlashAttention-v2 integrated (Dao, 2023)
    2. Fast streaming dataset loading
  5. Abundant Monitor Items:
    1. Gate load, gate importance
    2. Loss on steps, loss on tokens, balance loss
    3. TGS (tokens/GPU/second), MFU (model FLOPs utilization)
    4. Other visualization utilities
  6. Dynamic Weight Sampling:
    1. Self-defined static sampling weights
    2. Sheared LLaMA's dynamic batch loading (Xia et al., 2023)
<h2 id="quick-start">🚀 QuickStart</h2>
# python>=3.10

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_dir = "llama-moe/LLaMA-MoE-v1-3_5B-2_8"
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_dir, torch_dtype=torch.bfloat16, trust_remote_code=True)
model.eval()
model.to("cuda:0")

input_text = "Suzhou is famous of"
inputs = tokenizer(input_text, return_tensors="pt")
inputs = inputs.to("cuda:0")

pred = model.generate(**inputs, max_length=50, temperature=0.0)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
# Suzhou is famous of its beautiful gardens. The most famous one is the Humble Administrator's Garden. It is a classical Chinese garden with a history of more than 600 years. The garden is divided into three
<h2 id="installation">⚙️ Installation</h2>
  1. Prepare conda environment: conda create -n smoe python=3.11 (If your environment name is not smoe, you may need to change environment in launching scripts)
  2. Add correct environment variables in ~/.bashrc (gcc is set to newer version for installing flash-attn). e.g.:
    export PATH=/mnt/petrelfs/share/cuda-11.8/bin:$PATH
    export LD_LIBRARY_PATH=/mnt/petrelfs/share/cuda-11.8/lib64:$LD_LIBRARY_PATH
    export PATH=/mnt/petrelfs/share/gcc-10.1.0/bin:$PATH
    export LD_LIBRARY_PATH=/mnt/petrelfs/share/gcc-10.1.0/lib64:$LD_LIBRARY_PATH
    
  3. Take the variables into effect: source ~/.bashrc
  4. Install PyTorch (CUDA-11.8): pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
  5. Install dependencies: pip install -r requirements.txt
  6. Install flash-attn: pip install flash-attn==2.0.1 --no-build-isolation. You may need to follow the flash-attn installation instructions to avoid some errors.
  7. Install the latest Git: conda install git
  8. Clone the repo: git clone git@github.com:pjlab-sys4nlp/llama-moe.git (If you don't setup the ssh key to GitHub, you may not able to clone through ssh. Check the docs about it.)
  9. Change current directory: cd llama-moe
  10. Install smoe in editable mode: pip install -e .[dev]
  11. Setup pre-commit hooks: pre-commit install
<h2 id="performance">📊 Model Performance</h2>
Model#Activated Experts#Experts#Activated ParamsFoundation ModelSFT Model
LLaMA-MoE-3.0B2163.0B🤗 base🤗 SFT
LLaMA-MoE-3.5B (4/16)4163.5B🤗 base🤗 SFT
LLaMA-MoE-3.5B (2/8)283.5B🤗 base🤗 SFT
ModelAverageSciQPIQAWinoGrandeARC-eARC-c (25)HellaSwag (10)LogiQABoolQ (32)LAMBADANQ (32)MMLU (5)
OPT-2.7B50.378.974.860.854.434.061.425.863.363.610.725.8
Pythia-2.8B51.583.273.659.658.836.760.728.165.964.68.726.8
INCITE-BASE-3B53.785.673.963.561.740.364.727.565.865.415.227.2
Open-LLaMA-3B-v255.688.077.963.163.340.171.428.169.267.416.026.8
Sheared-LLaMA-2.7B56.487.576.965.063.341.671.028.373.668.317.627.3
LLaMA-MoE-3.0B55.584.277.563.660.240.970.830.671.966.617.026.8
LLaMA-MoE-3.5B (4/16)57.787.677.965.565.644.273.329.775.069.520.326.8
LLaMA-MoE-3.5B (2/8)57.688.477.666.765.343.173.329.673.969.419.827.0
ModelMMLUARC-cHellaSeagTruthfulQAMT-Bench
Sheared LLaMA-2.7B ShareGPT28.4141.0471.2147.653.79
Sheared LLaMA-2.7B Deita6K (Our Impl.)25.2443.6971.7049.004.06
LLaMA-MoE-v1-3.0B (2/16)23.6143.4372.2844.244.15
LLaMA-MoE-v1-3.5B (4/16)26.4948.2975.1045.914.60
LLaMA-MoE-v1-3.5B (2/8)25.5345.9974.9544.394.72
<h2 id="expert-construction">🚧 Expert Construction</h2>

For more information, please refer to Expert Construction docs.

<h2 id="continual-pretraining">🚅 Continual Pre-training</h2>

Tokenization

Download SlimPajama into /path_to_data and put data from different domains into separate folders:

Each file should be end with *.jsonl and each line looks like:

{"id": "id-info", "content": "raw text to be tokenized"}

Run the following command to tokenize the data in each folder:

python -m smoe.utils.tokenize \
  -f jsonl \
  -t /path_to_tokenizer \
  -i /path_to_data/en_arxiv \
  -o /path_to_data_tokenized/en_arxiv

Continual Pre-training (CPT)

<h2 id="evaluation">💎 Evaluation</h2> <h2 id="sft">💬 Supervised Fine-Tuning (SFT)</h2>

We provide simple examples of SFT to build chatbots. Please refer to SFT docs and /mnt/petrelfs/zhutong/smoe/scripts/sft for more details.

<h2 id="citation">📑 Citation</h2>
@article{llama-moe,
  title={LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training},
  author={Tong Zhu and Xiaoye Qu and Daize Dong and Jiacheng Ruan and Jingqi Tong and Conghui He and Yu Cheng},
  journal={arXiv preprint arXiv:2406.16554},
  year={2024},
  url={https://arxiv.org/abs/2406.16554},
}
<hr> <p align="center">LLaMA-MoE Team w/ ❤️</p>