Home

Awesome

<div style="display: flex; align-items: center;"> <a href="https://arxiv.org/abs/2403.06199"> <h1>LLaVA-Phi & Mipha: Towards Multimodal Small Language Models</h1> </a> </div> <div align="center"> <img src="docs/mipha.jpg" width="20%"> </div>

📸 Release

Model Zoo

Mipha & LLaVA-Phi

ModelLLMVQAv2GQASQA<sup>I</sup>VQA<sup>T</sup>POPEMME<sup>P</sup>MMB
<div style="width: 93pt"> LLaVA-Phi-3B<div style="width: 91pt"> Phi-2-2.7B71.4-68.448.685.01335.159.8
<div style="width: 93pt"> Mipha-1.6B<div style="width: 91pt"> Phi-1.5-1.3B77.562.758.345.686.91203.157.7
<div style="width: 93pt"> Mipha-2.4B<div style="width: 91pt"> Gemma-2B79.563.365.352.486.61397.159.4
<div style="width: 93pt"> Mipha-3B<div style="width: 91pt"> Phi-2-2.7B81.363.970.956.686.71488.969.7

Contents

Install

  1. Clone this repository and navigate to llava-phi folder
git clone https://github.com/zhuyiche/Mipha.git
cd Mipha
  1. Install Package
conda create -n mipha python=3.10 -y
conda activate mipha
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

Mipha Weights

Download Mipha-3B at huggingface

Train

Mipha training consists of two stages: (1) feature alignment stage: use LLaVA-1.5 558K subset of the LAION-CC-SBU dataset to connect a frozen pretrained vision encoder to a frozen LLM; (2) visual instruction tuning stage: visual instruction tuning stage: use 150K GPT-generated multimodal instruction-following data, plus around 515K VQA data from academic-oriented tasks, to teach the model to follow multimodal instructions.

Hyperparameters

The hyperparameters used in pretraining and finetuning are provided below.

  1. Pretraining
HyperparameterGlobal Batch SizeLearning rateEpochsMax lengthWeight decay
Mipha2561e-3120480
  1. Finetuning
HyperparameterGlobal Batch SizeLearning rateEpochsMax lengthWeight decay
Mipha1282e-5220480

Download base checkpoints

Our base model is phi-2. You should download the weights from here, and change the --model_name_or_path in get_base_model.sh. <br> Our vision encoder is SigLIP-SO (0.4B). You should download the weights from here.

Integrate the model

Please download the 558K subset of the LAION-CC-SBU dataset with BLIP captions from here. <br>

Then, you should integrate phi-2 and SigLIP-SO into a single model by running the following script:

bash ./script/mipha/get_base_model.sh

Pretrain (feature alignment)

bash ./scripts/mipha/pretrain.sh

Visual Instruction Tuning

Please refer here to prepare the instruction tuning data.

Training script with DeepSpeed ZeRO-3: finetune.sh.

bash ./scripts/mipha/finetune.sh

Evaluation

To ensure the reproducibility, we evaluate the models with greedy decoding.

See Evaluation.md.

CLI Inference Guide

You can chat about images using Mipha without the Gradio interface. Here is an example command:

python -m mipha.serve.cli \
    --model-path /path/to/mipha-3B \
    --image-file "mipha/serve/examples/extreme_ironing.jpg" \
    --conv-mode phi

Citation

If you find LLaVA-Phi or Mipha useful in your research or applications, please consider giving a star ⭐ and citing using the following BibTeX:

@misc{zhu2024llavaphi,
      title={LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model}, 
      author={Yichen Zhu and Minjie Zhu and Ning Liu and Zhicai Ou and Xiaofeng Mou and Jian Tang},
      year={2024},
      eprint={2401.02330},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@article{zhu2024comprehensive,
  title={A Comprehensive Overhaul of Multimodal Assistant with Small Language Models},
  author={Zhu, Minjie and Zhu, Yichen and Liu, Xin and Liu, Ning and Xu, Zhiyuan and Shen, Chaomin and Peng, Yaxin and Ou, Zhicai and Feng, Feifei and Tang, Jian},
  journal={arXiv preprint arXiv:2403.06199},
  year={2024}
}

Acknowledgement

We build our project based on