Home

Awesome

<div align=center> <img src="figures/fig1.png" width="280px"> </div> <h2 align="center"> <a href="https://arxiv.org/abs/2410.07348">MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts

</a></h2>

<h5 align="center"> If you like our project, please give us a star ⭐ on GitHub for the latest update.</h5> <h5 align=center> <!-- [![Demo](https://img.shields.io/badge/⚡-Hugging%20Face%20Demo-yellow.svg)](https://huggingface.co/spaces/Chat-UniVi/Chat-UniVi) -->

hf arXiv License Hits GitHub issues GitHub closed issues

</h5> <details open><summary>💡 I also have other projects that may interest you ✨. </summary><p> <!-- may -->

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding <br> Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, Li Yuan <br> github github arXiv Conference <br>

MoH: Multi-Head Attention as Mixture-of-Head Attention <br> Peng Jin, Bo Zhu, Li Yuan, Shuicheng Yan <br> github github arXiv <br>

-->

</p></details>

📣 News

⚡ Overview

We introduce three types of zero-computation experts: the zero expert, copy expert, and constant expert, which correspond to discard, skip, and replace operations, respectively. Moreover, we leverage gating residuals, enabling each token to consider the pathway taken in the previous layer when selecting the appropriate experts.

<div align=center> <img src="figures/fig2.png" width="800px"> </div> <div align=center> <img src="figures/fig3.png" width="800px"> </div>

Download URL

<div align=center>
HuggingFace Model
MoE++7B-Base🤗 MoE++7B-Base
MoE++7B-Chat😊 Coming Soon
</div>

😮 Highlights

💡 Low Computing Overhead

For an MoE++ model, its computational complexity is always less than that of MoE models with the same number of parameters.

<div align=center> <img src="figures/fig4.png" width="800px"> </div>

🔥 High Performance & High Throughput

Extensive experimental results demonstrate that MoE++ achieves better performance while delivering 1.1~2.1x expert forward throughput compared to a vanilla MoE model of the same size, which lays a solid foundation for developing advanced and efficient MoE-related models.

<div align=center> <img src="figures/fig7.png" width="800px"> </div>

🤗 Deployment Friendly

Given that zero-computation experts have negligible parameters, we can deploy all zero-computation experts on each GPU, eliminating the significant communication overhead and expert load imbalance associated with FFN experts distributed across different GPUs.

🚀 Main Results

Comparisons between MoE++ and Vanilla MoE Models

<div align=center> <img src="figures/fig9.png" width="800px"> </div>

Comparisons to LLMs of Equivalent Activated Parameters

<div align=center> <img src="figures/fig10.png" width="800px"> </div>

😍 Why is MoE++ better than MoE?

Flexible Computation Allocation

MoE++ allows simple tokens to utilize fewer FFN experts, freeing up more FFN experts to focus on challenging tokens. This results in both Reduced Computation and Enhanced Performance.

<div align=center> <img src="figures/fig5.png" width="800px"> </div>

These findings confirm that MoE++ allows simple tokens to utilize fewer FFN experts, freeing up more FFN experts to focus on challenging tokens.

Stable Routing

Gating residuals effectively establish connections between different MoE++ layers and reduce the variance of routing scores. Meanwhile, the gating residuals do not change the mean and range of values of the routing scores. Consequently, gating residuals contribute to the stable routing of heterogeneous expert architectures in MoE++.

<div align=center> <img src="figures/fig6.png" width="800px"> </div>

🤖 API for Model Inference

If you want to load the model from the model hub on Hugging Face or on local, you can use the following code snippets.

Base Model Inference

from transformers import AutoModelForCausalLM, AutoTokenizer

question = "Hello!"

model = AutoModelForCausalLM.from_pretrained("Chat-UniVi/MoE-Plus-Plus-7B", trust_remote_code=True, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained("Chat-UniVi/MoE-Plus-Plus-7B", trust_remote_code=True)

inputs = tokenizer(question, return_tensors='pt').to(model.device)
response = model.generate(inputs.input_ids, max_length=128)
print(tokenizer.decode(response.cpu()[0], skip_special_tokens=True))

Chat Model Inference

Coming soon...

🗝️ Training & Validating

# For example, test MoE++ on winogrande

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch \
--main_process_port 2004 -m lm_eval --model hf \
--model_args pretrained=Chat-UniVi/MoE-Plus-Plus-7B \
--tasks winogrande \
--batch_size 1 \
--output_path Results/winogrande

👍 Acknowledgement

🤝 Related Projects

🔒 License

✏️ Citation

If you find this paper useful, please consider staring 🌟 this repo and citing 📑 our paper:

@article{jin2024moe,
  title={MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts},
  author={Jin, Peng and Zhu, Bo and Yuan, Li and Yan, Shuicheng},
  journal={arXiv preprint arXiv:2410.07348},
  year={2024}
}

✨ Contributors

<a href="https://github.com/SkyworkAI/MoE-plus-plus/graphs/contributors"> <img src="https://contrib.rocks/image?repo=SkyworkAI/MoE-plus-plus" /> </a>