Awesome
<img src="assets/genixer_logo.png" alt="Alt text for the image" width="30"> Genixer: Empowering Multimodal Large Language Model as a Powerful Data Generator
<p align="left"> <a href="https://scholar.google.com/citations?user=QLSk-6IAAAAJ&hl=zh-CN"><strong>Henry Hengyuan Zhao</strong></a> ยท <a href="https://panzhous.github.io/"><strong>Pan Zhou</strong></a> ยท <a href="https://sites.google.com/view/showlab"><strong>Mike Zheng Shou</strong></a> <br> <br> <a href="https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/03355.pdf"><img src='https://img.shields.io/badge/Paper-Genixer-blue' alt='Paper PDF'></a> <a href="https://arxiv.org/abs/2312.06731"><img src='https://img.shields.io/badge/arXiv-Genixer-red' alt='arxiv'></a> <!-- <a href='https://github.com/zhaohengyuan1/Genixer'><img src='https://img.shields.io/badge/Project_Page-Genixer-green' alt='Project Page'></a> --> <a href='https://huggingface.co/Anonymous-G'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models/Datasets-yellow' alt='Hugging Face'></a> <br> <b>Show Lab, National University of Singapore | Singapore Management University </b> </p>If you find this repository helpful, we would greatly appreciate it if you could give it a star.
๐ Key Contributions
-
Two Data Generators - Genixer$_L$ and Genixer$_S$. Genixer$_L$ can generate various types of data, including Common VQA, MC-VQA, and MD data, etc.Genixer$_S$ is specialized in generating grounding-like VQA data.
-
Two Synthetic Datasets - 915K VQA-like data and 350K REC-like data are two synthetic datasets for pretraining stage.
-
8K Synthetic Dataset - llava_mix665k_synthetic_8k.jsonl is the mixed jsonl file for finetuning stage. Take it for enhancing your own MLLM.
๐ Findings
- Medium level synthetic VQA with Fuyu probability (0.5-0.7) would be benefit the model training. We consider that the data with higher probability primarily belong to the simple samples which would contribute less for finally model understanding.
Usage and License Notices: The data, and code is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaMA, Vicuna. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.
๐ An Automatic Multimodal Data Generation Pipeline
-
Instruction Data Collection: We systematically collected 9 primary vision-language (VL) tasks, categorized into two main groups: Generic and Grounding multimodal tasks.
-
Instruction Template Design: We developed a two-level instruction template tailored separately for task-agnostic and task-specific data generation.
-
Empowering MLLMs: We obtained two data generators.
-
Data Filtering: We use Fuyu-8B as a third-party MLLM for free data quality assessment. Logistic probabilities are calculated to evaluate the correctness of each data sample. Additionally, we found that data with medium-level probabilities tend to contribute more to the final model performance.
๐ท Instruction Data Collection
In accordance with the prevalence and practical relevance of real-world multi-modal tasks, we have carefully selected 9 representative multimodal tasks as listed in the following table for corresponding data generation. We categorize the VL tasks into two groups: 4 Generic tasks and 5 Grounding tasks.
<p align="center"> <img src="assets/datacollect.png" width="100%"> </p>๐ Data Filtering
The illustration of proposed Fuyu-driven data filtering framework. The outputs of the framework compose a probability and a direct answer.
<p align="center"> <img src="assets/fuyufiltering.png" width="100%"> </p>๐ Inference Modes
In an automatic data generation context, where image content is agnostic, preemptively determining the specific task type becomes particularly daunting, especially when it involves large-scale data creation purposes. Hence, we consider two key modes for visual instruction data generation: 1) task-agnostic data generation and 2) task-specific data generation.
<p align="center"> <img src="assets/inference_modes.png" width="100%"></a> </p>๐งธ Samples of Generated Data
Selected examples generated from $\text{Genixer}_L$ and $\text{Genixer}_S$. The examples include Common VQA, Adv VQA, MC VQA, MD, and five grounding tasks.
<p align="center"> <img src="https://github.com/sail-sg/Genixer/blob/main/assets/samplesofdata.png" width="100%"> </p>58 Handwritten Generic Instructions
For the generic instructions used in training Genixer, please refer to the path Genixer_Shikra/config/_base_/dataset/template/GenQA_general_instructions.json
for the details.
Genixer with LLaVA
Install
cd Genixer_LLaVA
conda create -n genixerL python=3.10 -y
conda activate genixerL
pip install --upgrade pip
pip install -e .
Model Weights
Model Name | Checkpoints | Description |
---|---|---|
Genixer-llava-v1.5-7b | Model weights | Data Generator |
llava-Genixer-915K-FT-8K-v1.5-7b | Model weights | Trained Model |
Image Datasets
Please download the images from constituting datasets:
- COCO: train2014
- GQA: images
- OCR-VQA: download script, we save all files as
.jpg
- AOKVQA: download script
- TextVQA: train_val_images
- VisualGenome: part1, part2
- LLaVA-CC3M-Pretrain-595K: huggingface
- LLaVA-Pretrain: huggingface
- LLaVA-Instruct: huggingface
- Flickr30K: Kaggle
Training data for $\text{Genixer}_L$
TrainDataforGenixerLLaVA.jsonl: 1M instruction tuning data for training the $\text{Genixer}_L$ with the capability of generating diverse data types.
Synthetic Data
Genixer_915K.jsonl: This is the synthetic instruction tuning data generated by our trained $\text{Genixer}_L$.
Moreover, we provide additional two synthetic pretraining datasets mentioned in ablation study for your preference:
Evaluation for $\text{Genixer}_L$
- Download model weight Genixer-llava-v1.5-7b under the folder
checkpoints
. - Run evaluation on Flickr30K unannotated images with generic data type, please refer to the script
scripts/eval_genixer/generic_generation.sh
.
CHUNKS=8
CKPT=Genixer-llava-v1.5-7b
qfile=data/flickr30k_imagequery.jsonl
imgdir=/yourpath/flickr30k/flickr30k_images/flickr30k_images
datatype=flickr30k_tem0.2
tasktype=generic
for IDX in $(seq 0 $((CHUNKS-1))); do
CUDA_VISIBLE_DEVICES=$IDX python -m model_genixer_eval \
--model-path checkpoints/$CKPT \
--question-file $qfile \
--image-folder $imgdir \
--answers-file ./playground/data/genixer_eval/$datatype/$tasktype/answers/$CKPT/${CHUNKS}_${IDX}.jsonl \
--task-type $tasktype \
--num-chunks $CHUNKS \
--chunk-idx $IDX \
--temperature 0.2 \
--conv-mode vicuna_v1 &
done
wait
output_file=./playground/data/genixer_eval/$datatype/$tasktype/answers/$CKPT/merge.jsonl
> "$output_file"
for IDX in $(seq 0 $((CHUNKS-1))); do
cat ./playground/data/genixer_eval/$datatype/$tasktype/answers/$CKPT/${CHUNKS}_${IDX}.jsonl >> "$output_file"
done
More evaluation scripts can be found in scripts/eval_genixer
.
Training for $\text{Genixer}_L$
-
Download the model weight clip-vit-large-patch14-336 under the folder
checkpoints
. -
Download the model weight llava-v1.5-7b under the folder
checkpoints
. -
Preparing the TrainDataforGenixerLLaVA.jsonl under the folder
data
. -
Run the training script
bash scripts/train_genixer.sh
#!/bin/bash
outputdir=exp/llava-v1.5-7b-Genixer
deepspeed llava/train/train_mem.py \
--deepspeed ./scripts/zero3.json \
--model_name_or_path checkpoints/llava-v1.5-7b \
--version v1 \
--data_path ./data/TrainDataforGenixerLLaVA.jsonl \
--image_folder ./data \
--vision_tower checkpoints/clip-vit-large-patch14-336 \
--mm_projector_type mlp2x_gelu \
--mm_vision_select_layer -2 \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--image_aspect_ratio pad \
--group_by_modality_length True \
--bf16 True \
--output_dir $outputdir \
--num_train_epochs 1 \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 2 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 50000 \
--save_total_limit 1 \
--learning_rate 1e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 2048 \
--gradient_checkpointing True \
--dataloader_num_workers 4 \
--lazy_preprocess True \
--report_to wandb
Training LLaVA1.5 with 915K synthetic data
-
Download the model weight clip-vit-large-patch14-336 under the folder
checkpoints
. -
Download the model weight vicuna-7b-v1.5 under the folder
checkpoints
. -
Download the synthetic pretraining data Genixer_915K.jsonl under the folder
data
. -
Download the mixture finetuning data llava_mix665k_synthetic_8k.jsonl under the folder
data
. -
Run the pretraining script.
bash scripts/pretrain.sh
- Run the finetuing script.
bash scripts/finetune.sh
Evaluation on 12 Multimodal Benchmarks
-
Download llava-Genixer-915K-FT-8K-v1.5-7b under the folder
checkpoints
. -
Following the data preparation steps from here.
Take VizWiz as an example, you just need to set the modelname of downloaded model and ensure the correctness of the path of image folder.
modelname=llava-Genixer-915K-FT-8K-v1.5-7b
python -m llava.eval.model_vqa_loader \
--model-path exp/$modelname \
--question-file ./playground/data/eval/vizwiz/llava_test.jsonl \
--image-folder /dataset/lavis/vizwiz/test/ \
--answers-file ./playground/data/eval/vizwiz/answers/$modelname.jsonl \
--temperature 0 \
--conv-mode vicuna_v1
python scripts/convert_vizwiz_for_submission.py \
--annotation-file ./playground/data/eval/vizwiz/llava_test.jsonl \
--result-file ./playground/data/eval/vizwiz/answers/$modelname.jsonl \
--result-upload-file ./playground/data/eval/vizwiz/answers_upload/$modelname.json
Genixer with Shikra
Install
cd Genixer_Shikra
conda create -n GenixerS python=3.10
conda activate GenixerS
pip install -r requirements.txt
Model Weights
Model Name | Checkpoints | Description |
---|---|---|
Genixer-shikra-7b | Model weights | Data Generator |
shikra-Genixer-350K-7b | Model weights | Trained Model |
Image Datasets
- COCO: train2014
- GQA: images
- VisualGenome: part1, part2
- LLaVA-CC3M-Pretrain-595K: huggingface
- LLaVA-Pretrain: huggingface
- LLaVA-Instruct: huggingface
- Flickr30K: Kaggle
- SBU: images
Training Data
Download the original annotation data from here and put it under data
.
Please refer to the file Genixer_Shikra/config/_base_/dataset/DEFAULT_TRAIN_DATASET.py
to replace yourpath
with the exact folder path on your machine.
genrecdata=dict(
type='GenRECDataset',
filename=r'{{fileDirname}}/../../../data/REC_ref3_train.jsonl',
image_folder=r'/yourpath/coco2014/train2014',
template_file=r"{{fileDirname}}/template/GenQA_general_instructions.json",
),
Synthetic Data
We use $\text{Genixer}_S$ to generate two REC-like datasets syn_lcs_filtered60.jsonl, syn_sbu_filtered60.jsonl with a total of 350K samples.
Evaluation for $\text{Genixer}_S$
-
Download the model weight of Genixer-shikra-7b under the folder
checkpoints
. -
Download the vision encoder clip-vit-large-patch14 under the folder
checkpoints
. -
Run the script
run_eval_genixer.sh
.
accelerate launch --num_processes 8 \
--main_process_port 23782 \
mllm/pipeline/finetune.py \
config/genixer_eval_GenQA.py \
--cfg-options model_args.model_name_or_path=checkpoints/Genixer-shikra-7b \
training_args.output_dir=results/Genixer-shikra-7b
Training for $\text{Genixer}_S$
-
Download the vision encoder clip-vit-large-patch14 under the folder
checkpoints
. -
Download the LLM model weight vicuna-7b-v1.1 under the folder
checkpoints
. -
Download the delta model shikra-7b-delta-v1 of Shikra.
-
Transform the delta model to
shikra-7b-v1.1
with the commandbash model_transform.sh
.
python mllm/models/models/apply_delta.py \
--base /yourpath/vicuna-7b-v1.1 \
--target checkpoints/shikra-7b-v1.1 \
--delta checkpoints/shikra-7b-delta-v1
- Run the stage-1 training script.
bash run_genixer_stage1.sh
- Run the stage-2 training script.
bash run_genixer_stage2.sh
Training Shikra with 350K Synthetic Data
-
Download the vision encoder clip-vit-large-patch14 under the folder
checkpoints
. -
Download the LLM model weight vicuna-7b-v1.1 under the folder
checkpoints
. -
Run the script for the stage-0 pretraining.
bash run_genixer_shikra_stage0.sh
- Run the script for the stage-1 pretraining.
bash run_genixer_shikra_stage1.sh
- Run the script for the stage-2 pretraining.
bash run_genixer_shikra_stage2.sh
Evaluation on REC Tasks
-
Download the model shikra-Genixer-350K-7b under the folder
checkpoints
. -
Download the vision encoder clip-vit-large-patch14 under the folder
checkpoints
. -
Run the script
bash run_eval_rec.sh
.
accelerate launch --num_processes 8 \
--main_process_port 23782 \
mllm/pipeline/finetune.py \
config/eval_multi_rec.py \
--cfg-options model_args.model_name_or_path=checkpoints/shikra-Genixer-350K-7b \
training_args.output_dir=results/shikra-Genixer-350K-7b
Fuyu-Driven Data Filtering
We prepare the code of using Fuyu-8B as the data filtering in the file Genixer_LLaVA/fuyudatafiltering/GenQA_filtering_mp.py
Run the following command for multi-GPU data filtering.
bash scripts/fuyudatafilter.sh
CLIP-Driven REC Data Filtering
We run the CLIP-Driven REC data filtering with this script multiprocess_evalclipscore.py
.
bash Genixer_Shikra/multiprocess_evalclipscore.py
๐ Acknowledgement
๐ Citation
If you find Genixer useful, please cite using this BibTeX:
@misc{zhao2024genixerempoweringmultimodallarge,
title={Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator},
author={Henry Hengyuan Zhao and Pan Zhou and Mike Zheng Shou},
year={2024},
eprint={2312.06731},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2312.06731},
}