Home

Awesome

<div align=center> <img src="figures/fig1.png" width="280px"> </div> <h2 align="center"> <a href="https://arxiv.org/abs/2410.11842">MoH: Multi-Head Attention as Mixture-of-Head Attention

</a></h2>

<h5 align="center"> If you like our project, please give us a star ⭐ on GitHub for the latest update.</h5> <h5 align=center> <!-- [![Demo](https://img.shields.io/badge/⚡-Hugging%20Face%20Demo-yellow.svg)](https://huggingface.co/spaces/Chat-UniVi/Chat-UniVi) -->

hf arXiv License Hits GitHub issues GitHub closed issues

</h5> <details open><summary>💡 I also have other projects that may interest you ✨. </summary><p> <!-- may -->

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding <br> Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, Li Yuan <br> github github arXiv Conference <br>

MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts <br> Peng Jin, Bo Zhu, Li Yuan, Shuicheng Yan <br> github github arXiv <br>

-->

</p></details>

📣 News

⚡ Overview

We propose Mixture-of-Head attention (MoH), a new architecture that treats attention heads as experts in the Mixture-of-Experts (MoE) mechanism. MoH has two significant advantages:

<div align=center> <img src="figures/fig2.png" width="800px"> </div>

😮 Highlights

💡 General Framework

We evaluate our proposed MoH across various popular model frameworks, including Vision Transformers (ViT) for image classification, Diffusion models with Transformers (DiT) for class-conditional image generation, and Large Language Models (LLMs) for language tasks.

<div align=center>
CodeHuggingFace Model
MoH-ViT🤗 MoH-ViT-B-75, MoH-ViT-B-50, MoH-ViT-S-80, MoH-ViT-S-75
MoH-DiT😊 MoH-DiT-90
MoH-LLaMA3-8B😊 MoH-LLaMA3-8B
</div>

🔥 High Performance

Extensive experiments on ViT, DiT, and LLMs demonstrate that MoH outperforms multi-head attention by using only 50%~90% of the attention heads.

🤗 Support Continue-Tuning Starting from the Multi-Head Attention Models

We demonstrate that pre-trained multi-head attention models, such as LLaMA3-8B, can be further continue-tuned into our MoH models. Notably, MoH-LLaMA3-8B achieves an average accuracy of 64.0% across 14 benchmarks, outperforming LLaMA3-8B by 2.4% by utilizing only 75% of the attention heads.

<div align=center> <img src="figures/fig3.png" width="800px"> </div>

The MoH model quickly recovers to over 95% of the performance of the original model within a training budget of 10B tokens. Then, the performance gradually improves with the increase of the training tokens.

🚀 Main Results

ViT for ImageNet-1K Classification

<div align=center> <img src="figures/fig4.png" width="800px"> </div>

DiT for Class-Conditional Image Generation (ImageNet-1K)

<div align=center> <img src="figures/fig9.png" width="800px"> </div> <div align=center> <img src="figures/fig5.png" width="800px"> </div>

Training LLMs from Scratch

<div align=center> <img src="figures/fig6.png" width="800px"> </div>

Continue-Tuning LLaMA3-8B

<div align=center> <img src="figures/fig7.png" width="800px"> </div>

😍 Why is MoH better than Multi-Head Attention?

Flexible Head Assignment Patterns

We observe significant variation in attention head assignments across different categories and task topics, indicating that the MoH model adapts to diverse tasks by employing distinct head assignment patterns. This characteristic of MoH allows different attention heads to focus on different types of tasks, making parameter utilization more efficient than multi-head attention.

<div align=center> <img src="figures/fig8.png" width="800px"> </div>

Weighted Summation of Heads

By replacing the standard summation in multi-head attention with a weighted summation, MoH enhances the flexibility of the attention mechanism and increases the performance potential.

🗝️ Training & Validating

👍 Acknowledgement

🤝 Related Projects

🔒 License

✏️ Citation

If you find this paper useful, please consider staring 🌟 this repo and citing 📑 our paper:

@article{jin2024moh,
  title={MoH: Multi-Head Attention as Mixture-of-Head Attention},
  author={Jin, Peng and Zhu, Bo and Yuan, Li and Yan, Shuicheng},
  journal={arXiv preprint arXiv:2410.11842},
  year={2024}
}

✨ Contributors

<a href="https://github.com/SkyworkAI/MoH/graphs/contributors"> <img src="https://contrib.rocks/image?repo=SkyworkAI/MoH" /> </a>