Home

Awesome

<p align="center"> <img src="https://s11.ax1x.com/2023/12/28/piqvDMV.png" width="250" style="margin-bottom: 0.2;"/> <p> <h2 align="center"> <a href="https://arxiv.org/abs/2401.15947">MoE-LLaVA: Mixture of Experts for Large Vision-Language Models</a></h2> <h5 align="center"> If you like our project, please give us a star ⭐ on GitHub for latest update. </h2> <h5 align="center">

hf_space Replicate demo and cloud API Open In Colab hf_space arXiv youtube jiqizhixin License Hits GitHub issues GitHub closed issues <br>

</h5> <details open><summary>💡 I also have other vision-language projects that may interest you ✨. </summary><p> <!-- may -->

Open-Sora-Plan <br> github github <br>

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection <br> Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, Li Yuan <br> github github arXiv <br>

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment <br> Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, Wancai Zhang, Zhifeng Li, Wei Liu, Li Yuan <br> github github arXiv <br>

</p></details>

📣 News

😮 Highlights

MoE-LLaVA shows excellent performance in multi-modal learning.

🔥 High performance, but with fewer parameters

<p align="center"> <img src="assets/intro0.jpg" width=55%> </p>

🚀 Simple baseline, learning multi-modal interactions with sparse pathways.

<p align="center"> <img src="assets/intro.jpg" width=65%> </p>

🤗 Demo

Gradio Web UI <a href='https://github.com/gradio-app/gradio'><img src='https://img.shields.io/github/stars/gradio-app/gradio'></a>

Highly recommend trying out our web demo by the following command, which incorporates all features currently supported by MoE-LLaVA. We also provide online demo in Huggingface Spaces.

# use phi2
deepspeed --include localhost:0 moellava/serve/gradio_web_server.py --model-path "LanguageBind/MoE-LLaVA-Phi2-2.7B-4e" 
# use qwen
deepspeed --include localhost:0 moellava/serve/gradio_web_server.py --model-path "LanguageBind/MoE-LLaVA-Qwen-1.8B-4e" 
# use stablelm
deepspeed --include localhost:0 moellava/serve/gradio_web_server.py --model-path "LanguageBind/MoE-LLaVA-StableLM-1.6B-4e" 

https://github.com/PKU-YuanGroup/MoE-LLaVA/assets/62638829/8541aac6-9ef6-4fde-aa94-80d0375b9bdb

CLI Inference

# use phi2
deepspeed --include localhost:0 moellava/serve/cli.py --model-path "LanguageBind/MoE-LLaVA-Phi2-2.7B-4e"  --image-file "image.jpg"
# use qwen
deepspeed --include localhost:0 moellava/serve/cli.py --model-path "LanguageBind/MoE-LLaVA-Qwen-1.8B-4e"  --image-file "image.jpg"
# use stablelm
deepspeed --include localhost:0 moellava/serve/cli.py --model-path "LanguageBind/MoE-LLaVA-StableLM-1.6B-4e"  --image-file "image.jpg"
<img src="assets/imagecli.gif" />

🐳 Model Zoo

ModelActivated ParamTransformers(HF)ModelScope(HF)AvgVQAv2GQAVizWizSQA-IMGT-VQAPOPEMMEMM-BenchMM-Vet
MoE-LLaVA-1.6B×4-Top22.0B🤗LanguageBind/MoE-LLaVA-StableLM-1.6B-4e<img src="https://github.com/PKU-YuanGroup/MoE-LLaVA/raw/main/assets/modelscope_logo.png" width="20px" style="max-width: 100%;">PKU-YuanLab/MoE-LLaVA-StableLM-1.6B-4e57.376.760.336.262.650.185.71318.160.226.9
MoE-LLaVA-1.8B×4-Top22.2B🤗LanguageBind/MoE-LLaVA-Qwen-1.8B-4e<img src="https://github.com/PKU-YuanGroup/MoE-LLaVA/raw/main/assets/modelscope_logo.png" width="20px" style="max-width: 100%;">PKU-YuanLab/MoE-LLaVA-Qwen-1.8B-4e56.776.261.532.663.148.087.01291.659.625.3
MoE-LLaVA-2.7B×4-Top23.6B🤗LanguageBind/MoE-LLaVA-Phi2-2.7B-4e<img src="https://github.com/PKU-YuanGroup/MoE-LLaVA/raw/main/assets/modelscope_logo.png" width="20px" style="max-width: 100%;">PKU-YuanLab/MoE-LLaVA-Phi2-2.7B-4e61.177.661.443.968.551.486.31423.065.234.3
MoE-LLaVA-1.6B×4-Top2-3842.0B🤗LanguageBind/MoE-LLaVA-StableLM-1.6B-4e-384<img src="https://github.com/PKU-YuanGroup/MoE-LLaVA/raw/main/assets/modelscope_logo.png" width="20px" style="max-width: 100%;">PKU-YuanLab/MoE-LLaVA-StableLM-1.6B-4e-38460.078.661.540.563.954.385.91335.763.332.3
MoE-LLaVA-2.7B×4-Top2-3843.6B🤗LanguageBind/MoE-LLaVA-Phi2-2.7B-4e-384<img src="https://github.com/PKU-YuanGroup/MoE-LLaVA/raw/main/assets/modelscope_logo.png" width="20px" style="max-width: 100%;">PKU-YuanLab/MoE-LLaVA-Phi2-2.7B-4e-38462.979.962.643.770.357.085.71431.368.035.9
LLaVA-1.57B🤗liuhaotian/llava-v1.5-7b-62.078.562.050.066.858.285.91510.764.330.5
<!-- | LLaVA-1.5 | 13B | [liuhaotian/llava-v1.5-13b](https://huggingface.co/liuhaotian/llava-v1.5-13b) | 64.9 | 80.0 | 63.3 | 53.6 | 71.6 | 61.3 | 85.9 | 1531.3 | 67.7 | 35.4 | --> <details>

🚨 Please know https://github.com/PKU-YuanGroup/MoE-LLaVA/issues/27.

<summary>Stage2 Model</summary>
ModelCheckpoint
MoE-LLaVA-1.6B×4-Top2LanguageBind/MoE-LLaVA-StableLM-Stage2
MoE-LLaVA-1.6B×4-Top2-384LanguageBind/MoE-LLaVA-StableLM-Stage2-384
MoE-LLaVA-1.8B×4-Top2LanguageBind/MoE-LLaVA-Qwen-Stage2
MoE-LLaVA-2.7B×4-Top2LanguageBind/MoE-LLaVA-Phi2-Stage2
MoE-LLaVA-2.7B×4-Top2-384LanguageBind/MoE-LLaVA-Phi2-Stage2-384
</details> <details> <summary>Pretrain Model</summary>
ModelCheckpoint
MoE-LLaVA-1.6B×4-Top2LanguageBind/MoE-LLaVA-StableLM-Pretrain
MoE-LLaVA-1.6B×4-Top2-384LanguageBind/MoE-LLaVA-StableLM-384-Pretrain
MoE-LLaVA-1.8B×4-Top2LanguageBind/MoE-LLaVA-Qwen-Pretrain
MoE-LLaVA-2.7B×4-Top2LanguageBind/MoE-LLaVA-Phi2-Pretrain
MoE-LLaVA-2.7B×4-Top2-384LanguageBind/MoE-LLaVA-Phi2-384-Pretrain
</details>

⚙️ Requirements and Installation

We recommend the requirements as follows.

git clone https://github.com/PKU-YuanGroup/MoE-LLaVA
cd MoE-LLaVA
conda create -n moellava python=3.10 -y
conda activate moellava
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation

# Below are optional. For Qwen model.
git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention && pip install .
# Below are optional. Installing them might be slow.
# pip install csrc/layer_norm
# If the version of flash-attn is higher than 2.1.1, the following is not needed.
# pip install csrc/rotary

[!Warning]

<div align="left"> <b> 🚨 We find that using flash attention2 makes performance degradation. </b> </div>

🗝️ Training & Validating

The training & validating instruction is in TRAIN.md & EVAL.md.

💡 Customizing your MoE-LLaVA

The instruction is in CUSTOM.md.

😍 Visualization

The instruction is in VISUALIZATION.md.

🤖 API

We open source all codes. If you want to load the model (e.g. LanguageBind/MoE-LLaVA-Phi2-2.7B-4e) on local, you can use the following code snippets.

Using the following command to run the code.

deepspeed --include localhost:0 predict.py
import torch
from PIL import Image
from moellava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from moellava.conversation import conv_templates, SeparatorStyle
from moellava.model.builder import load_pretrained_model
from moellava.utils import disable_torch_init
from moellava.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria

def main():
    disable_torch_init()
    image = 'moellava/serve/examples/extreme_ironing.jpg'
    inp = 'What is unusual about this image?'
    model_path = 'LanguageBind/MoE-LLaVA-Phi2-2.7B-4e'  # LanguageBind/MoE-LLaVA-Qwen-1.8B-4e or LanguageBind/MoE-LLaVA-StableLM-1.6B-4e
    device = 'cuda'
    load_4bit, load_8bit = False, False  # FIXME: Deepspeed support 4bit or 8bit?
    model_name = get_model_name_from_path(model_path)
    tokenizer, model, processor, context_len = load_pretrained_model(model_path, None, model_name, load_8bit, load_4bit, device=device)
    image_processor = processor['image']
    conv_mode = "phi"  # qwen or stablelm
    conv = conv_templates[conv_mode].copy()
    roles = conv.roles
    image_tensor = image_processor.preprocess(Image.open(image).convert('RGB'), return_tensors='pt')['pixel_values'].to(model.device, dtype=torch.float16)

    print(f"{roles[1]}: {inp}")
    inp = DEFAULT_IMAGE_TOKEN + '\n' + inp
    conv.append_message(conv.roles[0], inp)
    conv.append_message(conv.roles[1], None)
    prompt = conv.get_prompt()
    input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()
    stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
    keywords = [stop_str]
    stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)

    with torch.inference_mode():
        output_ids = model.generate(
            input_ids,
            images=image_tensor,
            do_sample=True,
            temperature=0.2,
            max_new_tokens=1024,
            use_cache=True,
            stopping_criteria=[stopping_criteria])

    outputs = tokenizer.decode(output_ids[0, input_ids.shape[1]:], skip_special_tokens=True).strip()
    print(outputs)

if __name__ == '__main__':
    main()

🙌 Related Projects

👍 Acknowledgement

🔒 License

✏️ Citation

If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil:.

@article{lin2024moe,
  title={MoE-LLaVA: Mixture of Experts for Large Vision-Language Models},
  author={Lin, Bin and Tang, Zhenyu and Ye, Yang and Cui, Jiaxi and Zhu, Bin and Jin, Peng and Zhang, Junwu and Ning, Munan and Yuan, Li},
  journal={arXiv preprint arXiv:2401.15947},
  year={2024}
}
@article{lin2023video,
  title={Video-LLaVA: Learning United Visual Representation by Alignment Before Projection},
  author={Lin, Bin and Zhu, Bin and Ye, Yang and Ning, Munan and Jin, Peng and Yuan, Li},
  journal={arXiv preprint arXiv:2311.10122},
  year={2023}
}

✨ Star History

Star History

🤝 Contributors

<a href="https://github.com/PKU-YuanGroup/MoE-LLaVA/graphs/contributors"> <img src="https://contrib.rocks/image?repo=PKU-YuanGroup/MoE-LLaVA" /> </a>