Awesome
<!-- # SliME -->Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models
<a href='https://arxiv.org/abs/2406.08487'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a> <a href='https://huggingface.co/collections/yifanzhang114/slime-665bcb2d0d71762b86fdbd2d'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue'></a> <a href='https://huggingface.co/datasets/yifanzhang114/SMR'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Data-green'></a>
<p align="center"> <img src="images/title.png" width="100%" height="100%"> </p><font size=7><div align='center' > [๐ arXiv Paper] [๐ Dataset][๐ Models] </div></font>
๐ฅ Update
- [10/26]๐ฅSliME-8B achieves better high-resolution understanding performance on MME-RealWorld compared to Mini-Gemini and LLaVA-Next.
- [09/26]๐ฅSliME is supported by VLMEvalKit. Feel free to use it without hesitation!
- [07/16]๐ฅThe SliME strategy demonstrates exceptional versatility, extending seamlessly to video analysis (See Slime_video.md). Remarkably, even though the model has never been specifically trained on video data, it is capable of processing up to 8 frames. In the Video-MME benchmark, the model surpasses numerous 7B/8B baselines that have undergone training on video datasets.
- [06/11]๐ฅSliME is coming! We release the paper, code, models, and data for SliME!
- [06/11]๐ฅSliME-70B will be released soon.
๐ Contents
๐ฎ Install
Please follow the instructions below to install the required packages.
- Clone this repository
git clone https://github.com/yfzhang114/SliME.git
- Install Package
conda create -n slime python=3.10 -y
conda activate slime
cd SliME
pip install --upgrade pip # enable PEP 660 support
pip install -e .
- Install additional packages for training cases
pip install -e ".[train]"
pip install ninja
pip install datasets
pip install flash-attn --no-build-isolation
๐ Model
We provide all our fully finetuned models on Stage 1/2 and 3 data for SliME:
Model | Base LLM | Vision Encoder | Finetuning Data | Finetuning schedule | Download |
---|---|---|---|---|---|
SliME-7B | Vicuna-7B-v1.5 | CLIP-L | SharedGPT+SMR | full_ft | ckpt |
SliME-8B | Llama-3-8B-Instruct | CLIP-L | SharedGPT+SMR | full_ft | ckpt |
SliME-13B | Vicuna-13B-v1.5 | CLIP-L | SharedGPT+SMR | full_ft | ckpt |
SliME-70B | Llama-3-70B-Instruct | CLIP-L | SharedGPT+SMR | Lora | ckpt |
Here are the pretrained weights on Stage 1/2 data only:
Model | Base LLM | Vision Encoder | Pretrain Data | Finetuning schedule | Download |
---|---|---|---|---|---|
SliME-7B | Vicuna-7B-v1.5 | CLIP-L | LLaVA-Pretrain | 1e | ckpt |
SliME-8B | Llama-3-8B-Instruct | CLIP-L | LLaVA-Pretrain | 1e | ckpt |
SliME-13B | Vicuna-13B-v1.5 | CLIP-L | LLaVA-Pretrain | 1e | ckpt |
SliME-70B | Llama-3-70B-Instruct | CLIP-L | LLaVA-Pretrain | 1e | ckpt |
๐ฎ Preparation
Dataset
Please follow LLaVA and SharedGPT4V to prepare the corresponding images and data.
SMR data structure
data
โโโ arxivqa
โ โโโ images
โโโ DVQA
โ โโโ images
โโโ Geometry3K
โ โโโ 0-2400 dirs
โโโ ChartQA
โ โโโ train_images
โโโ GeoQA3
โ โโโ image
โ โโโ json
โโโ mathvision
โโโ scienceqa
โโโ tabmwp
โโโ GeoQA3
โ โโโ train
โ โโโ test
โ โโโ val
โโโ ai2d
โ โโโ abc_images
โ โโโ images
โโโ geoqa+
โ โโโ images
You can find the pre-processing code at this URL. If you have any questions about file names or image paths, please refer to the pre-processing code.
- Arxiv QA Download images using this download url
python playground/data/process_arxivqa.py
- DVQA
Download images using this url.
- ChartQA
Clone this repo
extract all the training images in ChartQA_Dataset/train/png
into ChartQA
- Geometry3K
Download images using this url.
The image path in our json file will be os.path.join(f'Geometry3K/i', 'img_diagram.png')
- GeoQA3
Download images using this url
extract all the training images in GeoQA3/image
- MathVision
Download images using this url
Our data will not include the images from test-mini split automatically
- ScienceQA
wget https://scienceqa.s3.us-west-1.amazonaws.com/images/train.zip
wget https://scienceqa.s3.us-west-1.amazonaws.com/images/val.zip
wget https://scienceqa.s3.us-west-1.amazonaws.com/images/test.zip
unzip -q train.zip
unzip -q val.zip
unzip -q test.zip
rm train.zip
rm val.zip
rm test.zip
- Tabmwp
Download images using this url
- TextbookQA
Download images using this url
- AI2D:
Download images using this url
- GeoQA+
Download images using this url
๐ Train
<div align='center' > <details> <summary> Click to see the detail model structure</summary> <p align="center"> <img width="100%" src="images/teaser.png"/> </details> </div>SliME training consists of three stages: (1) training the global projector and attention adapter specifically; (2) training the local compression layer; and (3) training the full model.
SliME is trained on 8 A100 GPUs with 80GB memory. To train on fewer GPUs, you can reduce the per_device_train_batch_size
and increase the gradient_accumulation_steps
accordingly. Always keep the global batch size the same: per_device_train_batch_size
x gradient_accumulation_steps
x num_gpus
.
Please make sure you download and organize the data following Preparation before training.
If you want to train and finetune SliME, please run the following command for SliME-7B with image size 336:
bash scripts/vicuna/vicuna_7b_pt.sh
bash scripts/vicuna/vicuna_7b_sft.sh
or for SliME-8B with image size 336:
bash scripts/llama/llama3_8b_pt.sh
bash scripts/llama/llama3_8b_sft.sh
Because we reuse the pre-trained projecter weights from the SliME-7B, you can directly use the sft commands stage-3 instruction tuning by changing the PROJECTOR_DIR
:
bash scripts/llama/llama3_8b_sft.sh
Please find more training scripts of in scripts/
.
๐ Evaluation
We perform evaluation on several image-based benchmarks. Please see Evaluation for the detailes.
<div align=center> <img width="100%" src="images/exps.png"/> </div>If you want to evaluate the model on image-based benchmarks, please use the scripts in scripts/MODEL_PATH/eval
.
For example, run the following command for TextVQA evaluation with SliME-7B:
bash scripts/llama/eval/textvqa.sh
Please find more evaluation scripts in scripts/MODEL_PATH
.
The evaluation code and needed files can be found here.
๐ Examples
We provide some examples in this section. More examples can be found in our project page.
Hi-Resolution Understanding
<div align=center> <img width="98%" src="images/hr1.png"/> </div> <div align='center' > <details> <summary> Click to expand more examples</summary> <p align="center"> <img src="images/hr2.png" width="100%" height="100%"> <img src="images/code_generation.png" width="100%" height="100%"> <img src="images/story.png" width="100%" height="100%"> </details> </div>Citation
If you find this repo useful for your research, please consider citing the paper
@article{zhang2024beyond,
title={Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models},
author={Zhang, Yi-Fan and Wen, Qingsong and Fu, Chaoyou and Wang, Xue and Zhang, Zhang and Wang, Liang and Jin, Rong},
journal={arXiv preprint arXiv:2406.08487},
year={2024}
}
Acknowledgement
We would like to thank the following repos for their great work:
License
The data and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaVA, LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.