Home

Awesome

<!-- # SliME -->

Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models

Multi-Modal <a href='https://arxiv.org/abs/2406.08487'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a> <a href='https://huggingface.co/collections/yifanzhang114/slime-665bcb2d0d71762b86fdbd2d'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue'></a> <a href='https://huggingface.co/datasets/yifanzhang114/SMR'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Data-green'></a>

<p align="center"> <img src="images/title.png" width="100%" height="100%"> </p>

<font size=7><div align='center' > [๐Ÿ“– arXiv Paper] [๐Ÿ“Š Dataset][๐Ÿ† Models] </div></font>

๐Ÿ”ฅ Update

๐Ÿ‘€ Contents

๐Ÿ”ฎ Install

Please follow the instructions below to install the required packages.

  1. Clone this repository
git clone https://github.com/yfzhang114/SliME.git
  1. Install Package
conda create -n slime python=3.10 -y
conda activate slime
cd SliME
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
  1. Install additional packages for training cases
pip install -e ".[train]"
pip install ninja
pip install datasets
pip install flash-attn --no-build-isolation

๐Ÿ” Model

We provide all our fully finetuned models on Stage 1/2 and 3 data for SliME:

ModelBase LLMVision EncoderFinetuning DataFinetuning scheduleDownload
SliME-7BVicuna-7B-v1.5CLIP-LSharedGPT+SMRfull_ftckpt
SliME-8BLlama-3-8B-InstructCLIP-LSharedGPT+SMRfull_ftckpt
SliME-13BVicuna-13B-v1.5CLIP-LSharedGPT+SMRfull_ftckpt
SliME-70BLlama-3-70B-InstructCLIP-LSharedGPT+SMRLorackpt

Here are the pretrained weights on Stage 1/2 data only:

ModelBase LLMVision EncoderPretrain DataFinetuning scheduleDownload
SliME-7BVicuna-7B-v1.5CLIP-LLLaVA-Pretrain1eckpt
SliME-8BLlama-3-8B-InstructCLIP-LLLaVA-Pretrain1eckpt
SliME-13BVicuna-13B-v1.5CLIP-LLLaVA-Pretrain1eckpt
SliME-70BLlama-3-70B-InstructCLIP-LLLaVA-Pretrain1eckpt

๐Ÿ”ฎ Preparation

Dataset

Please follow LLaVA and SharedGPT4V to prepare the corresponding images and data.

SMR data structure

data
โ”œโ”€โ”€ arxivqa
โ”‚   โ””โ”€โ”€ images
โ”œโ”€โ”€ DVQA
โ”‚   โ””โ”€โ”€ images
โ”œโ”€โ”€ Geometry3K
โ”‚   โ””โ”€โ”€ 0-2400 dirs
โ”œโ”€โ”€ ChartQA
โ”‚   โ””โ”€โ”€ train_images
โ””โ”€โ”€ GeoQA3
โ”‚    โ”œโ”€โ”€ image
โ”‚    โ””โ”€โ”€ json
โ”œโ”€โ”€ mathvision
โ”œโ”€โ”€ scienceqa
โ”œโ”€โ”€ tabmwp
โ””โ”€โ”€ GeoQA3
โ”‚    โ”œโ”€โ”€ train
โ”‚    โ””โ”€โ”€ test
โ”‚    โ””โ”€โ”€ val
โ””โ”€โ”€ ai2d
โ”‚    โ”œโ”€โ”€ abc_images
โ”‚    โ””โ”€โ”€ images
โ””โ”€โ”€ geoqa+
โ”‚   โ””โ”€โ”€ images

You can find the pre-processing code at this URL. If you have any questions about file names or image paths, please refer to the pre-processing code.

  1. Arxiv QA Download images using this download url
python playground/data/process_arxivqa.py
  1. DVQA

Download images using this url.

  1. ChartQA

Clone this repo

extract all the training images in ChartQA_Dataset/train/png into ChartQA

  1. Geometry3K

Download images using this url.

The image path in our json file will be os.path.join(f'Geometry3K/i', 'img_diagram.png')

  1. GeoQA3

Download images using this url

extract all the training images in GeoQA3/image

  1. MathVision

Download images using this url

Our data will not include the images from test-mini split automatically

  1. ScienceQA
wget https://scienceqa.s3.us-west-1.amazonaws.com/images/train.zip
wget https://scienceqa.s3.us-west-1.amazonaws.com/images/val.zip
wget https://scienceqa.s3.us-west-1.amazonaws.com/images/test.zip

unzip -q train.zip
unzip -q val.zip
unzip -q test.zip

rm train.zip
rm val.zip
rm test.zip
  1. Tabmwp

Download images using this url

  1. TextbookQA

Download images using this url

  1. AI2D:

Download images using this url

  1. GeoQA+

Download images using this url

๐Ÿ“ˆ Train

<div align='center' > <details> <summary> Click to see the detail model structure</summary> <p align="center"> <img width="100%" src="images/teaser.png"/> </details> </div>

SliME training consists of three stages: (1) training the global projector and attention adapter specifically; (2) training the local compression layer; and (3) training the full model.

SliME is trained on 8 A100 GPUs with 80GB memory. To train on fewer GPUs, you can reduce the per_device_train_batch_size and increase the gradient_accumulation_steps accordingly. Always keep the global batch size the same: per_device_train_batch_size x gradient_accumulation_steps x num_gpus.

Please make sure you download and organize the data following Preparation before training.

If you want to train and finetune SliME, please run the following command for SliME-7B with image size 336:

bash scripts/vicuna/vicuna_7b_pt.sh
bash scripts/vicuna/vicuna_7b_sft.sh

or for SliME-8B with image size 336:

bash scripts/llama/llama3_8b_pt.sh
bash scripts/llama/llama3_8b_sft.sh

Because we reuse the pre-trained projecter weights from the SliME-7B, you can directly use the sft commands stage-3 instruction tuning by changing the PROJECTOR_DIR:

bash scripts/llama/llama3_8b_sft.sh

Please find more training scripts of in scripts/.

๐Ÿ“ˆ Evaluation

We perform evaluation on several image-based benchmarks. Please see Evaluation for the detailes.

<div align=center> <img width="100%" src="images/exps.png"/> </div>

If you want to evaluate the model on image-based benchmarks, please use the scripts in scripts/MODEL_PATH/eval. For example, run the following command for TextVQA evaluation with SliME-7B:

bash scripts/llama/eval/textvqa.sh

Please find more evaluation scripts in scripts/MODEL_PATH.

The evaluation code and needed files can be found here.

๐Ÿ‘€ Examples

We provide some examples in this section. More examples can be found in our project page.

Hi-Resolution Understanding

<div align=center> <img width="98%" src="images/hr1.png"/> </div> <div align='center' > <details> <summary> Click to expand more examples</summary> <p align="center"> <img src="images/hr2.png" width="100%" height="100%"> <img src="images/code_generation.png" width="100%" height="100%"> <img src="images/story.png" width="100%" height="100%"> </details> </div>

Citation

If you find this repo useful for your research, please consider citing the paper

@article{zhang2024beyond,
  title={Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models},
  author={Zhang, Yi-Fan and Wen, Qingsong and Fu, Chaoyou and Wang, Xue and Zhang, Zhang and Wang, Liang and Jin, Rong},
  journal={arXiv preprint arXiv:2406.08487},
  year={2024}
}

Acknowledgement

We would like to thank the following repos for their great work:

License

The data and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaVA, LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.