Home

Awesome

<img src="assets/genixer_logo.png" alt="Alt text for the image" width="30"> Genixer: Empowering Multimodal Large Language Model as a Powerful Data Generator

<p align="left"> <a href="https://scholar.google.com/citations?user=QLSk-6IAAAAJ&hl=zh-CN"><strong>Henry Hengyuan Zhao</strong></a> ยท <a href="https://panzhous.github.io/"><strong>Pan Zhou</strong></a> ยท <a href="https://sites.google.com/view/showlab"><strong>Mike Zheng Shou</strong></a> <br> <br> <a href="https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/03355.pdf"><img src='https://img.shields.io/badge/Paper-Genixer-blue' alt='Paper PDF'></a> <a href="https://arxiv.org/abs/2312.06731"><img src='https://img.shields.io/badge/arXiv-Genixer-red' alt='arxiv'></a> <!-- <a href='https://github.com/zhaohengyuan1/Genixer'><img src='https://img.shields.io/badge/Project_Page-Genixer-green' alt='Project Page'></a> --> <a href='https://huggingface.co/Anonymous-G'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models/Datasets-yellow' alt='Hugging Face'></a> <br> <b>Show Lab, National University of Singapore &nbsp; | &nbsp; Singapore Management University </b> </p>

If you find this repository helpful, we would greatly appreciate it if you could give it a star.

๐Ÿ”Ž Key Contributions

๐Ÿ‘€ Findings

Usage and License Notices: The data, and code is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaMA, Vicuna. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

๐Ÿ˜Š An Automatic Multimodal Data Generation Pipeline

<p align="center"><img src="./assets/genixer_4step.png" alt="pipeline"/></p>

๐Ÿ“ท Instruction Data Collection

In accordance with the prevalence and practical relevance of real-world multi-modal tasks, we have carefully selected 9 representative multimodal tasks as listed in the following table for corresponding data generation. We categorize the VL tasks into two groups: 4 Generic tasks and 5 Grounding tasks.

<p align="center"> <img src="assets/datacollect.png" width="100%"> </p>

๐Ÿ„ Data Filtering

The illustration of proposed Fuyu-driven data filtering framework. The outputs of the framework compose a probability and a direct answer.

<p align="center"> <img src="assets/fuyufiltering.png" width="100%"> </p>

๐Ÿ— Inference Modes

In an automatic data generation context, where image content is agnostic, preemptively determining the specific task type becomes particularly daunting, especially when it involves large-scale data creation purposes. Hence, we consider two key modes for visual instruction data generation: 1) task-agnostic data generation and 2) task-specific data generation.

<p align="center"> <img src="assets/inference_modes.png" width="100%"></a> </p>

๐Ÿงธ Samples of Generated Data

Selected examples generated from $\text{Genixer}_L$ and $\text{Genixer}_S$. The examples include Common VQA, Adv VQA, MC VQA, MD, and five grounding tasks.

<p align="center"> <img src="https://github.com/sail-sg/Genixer/blob/main/assets/samplesofdata.png" width="100%"> </p>

58 Handwritten Generic Instructions

For the generic instructions used in training Genixer, please refer to the path Genixer_Shikra/config/_base_/dataset/template/GenQA_general_instructions.json for the details.

Genixer with LLaVA

Install

cd Genixer_LLaVA
conda create -n genixerL python=3.10 -y
conda activate genixerL
pip install --upgrade pip
pip install -e .

Model Weights

Model NameCheckpointsDescription
Genixer-llava-v1.5-7bModel weightsData Generator
llava-Genixer-915K-FT-8K-v1.5-7bModel weightsTrained Model

Image Datasets

Please download the images from constituting datasets:

<!-- [llava_v1_5_mix665k.json](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/blob/main/llava_v1_5_mix665k.json): 665K instruction tuning data from original LLaVA1.5. -->

Training data for $\text{Genixer}_L$

TrainDataforGenixerLLaVA.jsonl: 1M instruction tuning data for training the $\text{Genixer}_L$ with the capability of generating diverse data types.

Synthetic Data

Genixer_915K.jsonl: This is the synthetic instruction tuning data generated by our trained $\text{Genixer}_L$.

Moreover, we provide additional two synthetic pretraining datasets mentioned in ablation study for your preference:

Genixer_300K.jsonl

Genixer_610K.jsonl

Evaluation for $\text{Genixer}_L$

  1. Download model weight Genixer-llava-v1.5-7b under the folder checkpoints.
  2. Run evaluation on Flickr30K unannotated images with generic data type, please refer to the script scripts/eval_genixer/generic_generation.sh.
CHUNKS=8
CKPT=Genixer-llava-v1.5-7b

qfile=data/flickr30k_imagequery.jsonl
imgdir=/yourpath/flickr30k/flickr30k_images/flickr30k_images
datatype=flickr30k_tem0.2
tasktype=generic

for IDX in $(seq 0 $((CHUNKS-1))); do
    CUDA_VISIBLE_DEVICES=$IDX python -m model_genixer_eval \
        --model-path checkpoints/$CKPT \
        --question-file $qfile \
        --image-folder $imgdir \
        --answers-file ./playground/data/genixer_eval/$datatype/$tasktype/answers/$CKPT/${CHUNKS}_${IDX}.jsonl \
        --task-type $tasktype \
        --num-chunks $CHUNKS \
        --chunk-idx $IDX \
        --temperature 0.2 \
        --conv-mode vicuna_v1 &
done

wait

output_file=./playground/data/genixer_eval/$datatype/$tasktype/answers/$CKPT/merge.jsonl
> "$output_file"

for IDX in $(seq 0 $((CHUNKS-1))); do
    cat ./playground/data/genixer_eval/$datatype/$tasktype/answers/$CKPT/${CHUNKS}_${IDX}.jsonl >> "$output_file"
done

More evaluation scripts can be found in scripts/eval_genixer.

Training for $\text{Genixer}_L$

  1. Download the model weight clip-vit-large-patch14-336 under the folder checkpoints.

  2. Download the model weight llava-v1.5-7b under the folder checkpoints.

  3. Preparing the TrainDataforGenixerLLaVA.jsonl under the folder data.

  4. Run the training script bash scripts/train_genixer.sh

#!/bin/bash
outputdir=exp/llava-v1.5-7b-Genixer

deepspeed llava/train/train_mem.py \
    --deepspeed ./scripts/zero3.json \
    --model_name_or_path checkpoints/llava-v1.5-7b \
    --version v1 \
    --data_path ./data/TrainDataforGenixerLLaVA.jsonl \
    --image_folder ./data \
    --vision_tower checkpoints/clip-vit-large-patch14-336 \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length True \
    --bf16 True \
    --output_dir $outputdir \
    --num_train_epochs 1 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 2 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 50000 \
    --save_total_limit 1 \
    --learning_rate 1e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to wandb

Training LLaVA1.5 with 915K synthetic data

  1. Download the model weight clip-vit-large-patch14-336 under the folder checkpoints.

  2. Download the model weight vicuna-7b-v1.5 under the folder checkpoints.

  3. Download the synthetic pretraining data Genixer_915K.jsonl under the folder data.

  4. Download the mixture finetuning data llava_mix665k_synthetic_8k.jsonl under the folder data.

  5. Run the pretraining script.

bash scripts/pretrain.sh
  1. Run the finetuing script.
bash scripts/finetune.sh

Evaluation on 12 Multimodal Benchmarks

  1. Download llava-Genixer-915K-FT-8K-v1.5-7b under the folder checkpoints.

  2. Following the data preparation steps from here.

Take VizWiz as an example, you just need to set the modelname of downloaded model and ensure the correctness of the path of image folder.

modelname=llava-Genixer-915K-FT-8K-v1.5-7b

python -m llava.eval.model_vqa_loader \
    --model-path exp/$modelname \
    --question-file ./playground/data/eval/vizwiz/llava_test.jsonl \
    --image-folder /dataset/lavis/vizwiz/test/ \
    --answers-file ./playground/data/eval/vizwiz/answers/$modelname.jsonl \
    --temperature 0 \
    --conv-mode vicuna_v1

python scripts/convert_vizwiz_for_submission.py \
    --annotation-file ./playground/data/eval/vizwiz/llava_test.jsonl \
    --result-file ./playground/data/eval/vizwiz/answers/$modelname.jsonl \
    --result-upload-file ./playground/data/eval/vizwiz/answers_upload/$modelname.json

Genixer with Shikra

Install

cd Genixer_Shikra
conda create -n GenixerS python=3.10
conda activate GenixerS
pip install -r requirements.txt

Model Weights

Model NameCheckpointsDescription
Genixer-shikra-7bModel weightsData Generator
shikra-Genixer-350K-7bModel weightsTrained Model

Image Datasets

Training Data

Download the original annotation data from here and put it under data.

Please refer to the file Genixer_Shikra/config/_base_/dataset/DEFAULT_TRAIN_DATASET.py to replace yourpath with the exact folder path on your machine.

genrecdata=dict(
        type='GenRECDataset',
        filename=r'{{fileDirname}}/../../../data/REC_ref3_train.jsonl',
        image_folder=r'/yourpath/coco2014/train2014',
        template_file=r"{{fileDirname}}/template/GenQA_general_instructions.json",
    ),

Synthetic Data

We use $\text{Genixer}_S$ to generate two REC-like datasets syn_lcs_filtered60.jsonl, syn_sbu_filtered60.jsonl with a total of 350K samples.

Evaluation for $\text{Genixer}_S$

  1. Download the model weight of Genixer-shikra-7b under the folder checkpoints.

  2. Download the vision encoder clip-vit-large-patch14 under the folder checkpoints.

  3. Run the script run_eval_genixer.sh.

accelerate launch --num_processes 8 \
    --main_process_port 23782 \
    mllm/pipeline/finetune.py \
    config/genixer_eval_GenQA.py \
    --cfg-options model_args.model_name_or_path=checkpoints/Genixer-shikra-7b \
    training_args.output_dir=results/Genixer-shikra-7b

Training for $\text{Genixer}_S$

  1. Download the vision encoder clip-vit-large-patch14 under the folder checkpoints.

  2. Download the LLM model weight vicuna-7b-v1.1 under the folder checkpoints.

  3. Download the delta model shikra-7b-delta-v1 of Shikra.

  4. Transform the delta model to shikra-7b-v1.1 with the command bash model_transform.sh.

python mllm/models/models/apply_delta.py \
    --base /yourpath/vicuna-7b-v1.1 \
    --target checkpoints/shikra-7b-v1.1 \
    --delta checkpoints/shikra-7b-delta-v1
  1. Run the stage-1 training script.
bash run_genixer_stage1.sh
  1. Run the stage-2 training script.
bash run_genixer_stage2.sh

Training Shikra with 350K Synthetic Data

  1. Download the vision encoder clip-vit-large-patch14 under the folder checkpoints.

  2. Download the LLM model weight vicuna-7b-v1.1 under the folder checkpoints.

  3. Run the script for the stage-0 pretraining.

bash run_genixer_shikra_stage0.sh
  1. Run the script for the stage-1 pretraining.
bash run_genixer_shikra_stage1.sh
  1. Run the script for the stage-2 pretraining.
bash run_genixer_shikra_stage2.sh

Evaluation on REC Tasks

  1. Download the model shikra-Genixer-350K-7b under the folder checkpoints.

  2. Download the vision encoder clip-vit-large-patch14 under the folder checkpoints.

  3. Run the script bash run_eval_rec.sh.

accelerate launch --num_processes 8 \
    --main_process_port 23782 \
    mllm/pipeline/finetune.py \
    config/eval_multi_rec.py \
    --cfg-options model_args.model_name_or_path=checkpoints/shikra-Genixer-350K-7b \
    training_args.output_dir=results/shikra-Genixer-350K-7b


Fuyu-Driven Data Filtering

We prepare the code of using Fuyu-8B as the data filtering in the file Genixer_LLaVA/fuyudatafiltering/GenQA_filtering_mp.py

Run the following command for multi-GPU data filtering.

bash scripts/fuyudatafilter.sh

CLIP-Driven REC Data Filtering

We run the CLIP-Driven REC data filtering with this script multiprocess_evalclipscore.py.

bash Genixer_Shikra/multiprocess_evalclipscore.py

๐Ÿ™ Acknowledgement

๐ŸŽ“ Citation

If you find Genixer useful, please cite using this BibTeX:

@misc{zhao2024genixerempoweringmultimodallarge,
      title={Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator}, 
      author={Henry Hengyuan Zhao and Pan Zhou and Mike Zheng Shou},
      year={2024},
      eprint={2312.06731},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2312.06731}, 
}