Home

Awesome

Open-LLaVA-NeXT

An open-source implementation of LLaVA-NeXT series for facilitating the large multi-modal model community.

Resources: [🤗HuggingFace]

💡 Highlights

🤖 Model Zoo

See more details in ModelZoo.md.

NameViTLLMWeightsMMESEEDSQAMMBMMB-CNTextVQAGQA
llava-next-vicuna-7bCLIP-L-336Vicuna-7BSFT151970.270.167.460.664.964.2
open-llava-next-vicuna-7bCLIP-L-336Vicuna-7BPT, SFT154071.170.768.560.767.264.3
llava-next-llama3-8bCLIP-L-336LLaMA3-8BSFT159172.773.472.669.065.065.5
open-llava-next-llama3-8bCLIP-L-336LLaMA3-8BPT, SFT155274.477.374.470.469.865.9

👨‍💻 ToDo

🔧 Install

  1. Clone this repository and navigate to Open-LLaVA-NeXT folder
git clone https://github.com/xiaoachen98/Open-LLaVA-NeXT.git
cd Open-LLaVA-NeXT
  1. Install Package
conda create -n llava-next python=3.10 -y
conda activate llava-next
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
  1. Install additional packages for training
pip install -e ".[train]"
pip install flash-attn --no-build-isolation

Data Preparation

You should follow this instruction Data.md to manage the training datasets.

Training Overview

Open-LLaVA-NeXT training consists of two stages: (1) feature alignment stage: use 558K subset of the LAION-CC-SBU dataset to connect a frozen pretrained vision encoder to a frozen LLM; (2) visual instruction tuning stage: finetune the entire model with 1M completely open source data. Detailed data statics is provided in Visual Instruction Tuning. We take the Vicuna-v1.5-7B variant as example to present the training and evaluation details.

Open-LLaVA-NeXT series are trained on A100 GPUs with 80GB memory. To train on fewer GPUs, you can reduce the per_device_train_batch_size and increase the gradient_accumulation_steps accordingly. And utilizing DeepSpeed ZeRO-3 can further reduce the memory requirements. Always keep the global batch size the same: per_device_train_batch_size x gradient_accumulation_steps x num_gpus.

Hyperparameters

We use a same set of hyperparameters as LLaVA in finetuning. Both hyperparameters used in pretraining and finetuning are provided below.

  1. Pretraining
HyperparameterGlobal Batch SizeProjector lrEpochsMax lengthWeight decay
Open-LLaVA-NeXT-7B2561e-3140960
  1. Finetuning
HyperparameterGlobal Batch SizeLLM lrProjector lrVision Tower lrEpochsMax lengthWeight decay
Open-LLaVA-NeXT-7B1282e-52e-52e-6140960

Pretrain

Please download the 558K subset of the LAION-CC-SBU dataset with BLIP captions here.

Pretrain takes around 5 hours for Open-LLaVA-NeXT-7B on 16 x A100 (80G).

Training script with DeepSpeed ZeRO-2: pretrain.sh.

Visual Instruction Tuning

  1. Prepare data You should follow the instructions for data preparation in Data.
  2. Prepare MLP projectors You may download our pretrained projectors in Model Zoo, or specify your own MLP projector after pre-training.
  3. Start training Visual instruction tuning takes around 20 hours for Open-LLaVA-NeXT-7B on 16x A100 (80G).

Training script with DeepSpeed ZeRO-2: finetune.sh.

New options to note:

Evaluation

See Evaluation.md.

Citation

If you find this project useful in your research, please consider cite:

@misc{chen2024open,
  title={Open-LLaVA-NeXT: An open-source implementation of LLaVA-NeXT series for facilitating the large multi-modal model community.},
  author={Chen, Lin and Xing, Long},
  howpublished = {\url{https://github.com/xiaoachen98/Open-LLaVA-NeXT}},
  year={2024},
  doi={10.5281/zenodo.13935471}
}

❤️ Acknowledgments