Home

Awesome

SQ-LlaVA: Self-questioning for Vision-Language Assistant

<em> In broad real-world scenarios, proactively asking a question requires more understanding and background knowledge than answering.</em>

<strong> SQ-LlaVA: Self-questioning for Vision-Language Assistant </strong> [paper]

Guohao Sun, Can Qin, Jiamian Wang, Zeyuan Chen, Ran Xu, Zhiqiang Tao

<p align="center"> <img src="./images/1-1.png" width="500px"> <br> A high-level comparison between visual instruction tuning and visual self-questioning (ours) for vision-language assistant. </p>

šŸ”„ News

Contents

Install

  1. Install Package
conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip  # enable PEP 660 support
cd SQ-LLaVA
pip install -e .
  1. Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation

Demo (visual self-questioning)

To test visual self-questioning, please run run_sq.sh with the following settings.

  CUDA_VISIBLE_DEVICES=0 python visual_questioning.py \
  --model_path path/to/sqllava-v1.7-7b-lora-gpt4v-cluster-sq-vloraPTonly \ 
  --model_base Lin-Chen/ShareGPT4V-7B_Pretrained_vit-large336-l12_vicuna-7b-v1.5 \
  --conv-mode="v1_sq" \
  --lora_pretrain path/to/sqllava-v1.7-7b-lora-gpt4v-cluster-sq-vloraPTonly \
  --n_shot 3

Data

Data file nameSize
sharegpt4v_instruct_gpt4-vision_cap100k.json134 MB
share-captioner_coco_lcs_sam_1246k_1107.json1.5 GB
sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json1.2 GB
LLaVA400 MB

Prepare Images

For your convinence, please follow download_data.sh for data preparation.

Then, organize the data as follows in ./mixTraindata:

Visual-self-qa
ā”œā”€ā”€ ...
ā”œā”€ā”€ mixTraindata
ā”‚   ā”œā”€ā”€ llava
ā”‚   ā”‚   ā”œā”€ā”€ llava_pretrain
ā”‚   ā”‚   ā”‚   ā”œā”€ā”€ images
ā”‚   ā”œā”€ā”€ coco
ā”‚   ā”‚   ā”œā”€ā”€ train2017
ā”‚   ā”œā”€ā”€ sam
ā”‚   ā”‚   ā”œā”€ā”€ images
ā”‚   ā”œā”€ā”€ gqa
ā”‚   ā”‚   ā”œā”€ā”€ images
ā”‚   ā”œā”€ā”€ ocr_vqa
ā”‚   ā”‚   ā”œā”€ā”€ images
ā”‚   ā”œā”€ā”€ textvqa
ā”‚   ā”‚   ā”œā”€ā”€ train_images
ā”‚   ā”œā”€ā”€ vg
ā”‚   ā”‚   ā”œā”€ā”€ VG_100K
ā”‚   ā”‚   ā”œā”€ā”€ VG_100K_2
ā”‚   ā”œā”€ā”€ share_textvqa
ā”‚   ā”‚   ā”œā”€ā”€ images
ā”‚   ā”œā”€ā”€ web-celebrity
ā”‚   ā”‚   ā”œā”€ā”€ images
ā”‚   ā”œā”€ā”€ web-landmark
ā”‚   ā”‚   ā”œā”€ā”€ images
ā”‚   ā”œā”€ā”€ wikiart
ā”‚   ā”‚   ā”œā”€ā”€ images
ā”‚   ā”œā”€ā”€ share-captioner_coco_lcs_sam_1246k_1107.json
ā”‚   ā”œā”€ā”€ sharegpt4v_instruct_gpt4-vision_cap100k.json
ā”‚   ā”œā”€ā”€ sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json
ā”‚   ā”œā”€ā”€ blip_laion_cc_sbu_558k.json
ā”‚   ā”œā”€ā”€ llava_v1_5_mix665k.json
ā”œā”€ā”€ ...

Train

Training consists of two stages: (1) feature alignment stage; (2) visual self-questioning instruction tuning stage, teaching the model to ask questions and follow multimodal instructions.

To train on fewer GPUs, you can reduce the per_device_train_batch_size and increase the gradient_accumulation_steps accordingly. Always keep the global batch size the same: per_device_train_batch_size x gradient_accumulation_steps x num_gpus.

Hyperparameters

Both hyperparameters used in pretraining and finetuning are provided below.

  1. Pretraining
HyperparameterGlobal Batch SizeLearning rateEpochsMax lengthWeight decay
SQ-LLaVA2561e-3120480
  1. Finetuning
HyperparameterGlobal Batch SizeLearning rateEpochsMax lengthWeight decay
SQ-LLaVA1282e-4120480

Pretraining (feature alignment)

Training script with DeepSpeed ZeRO-2: pretrain.sh.

Visual Instruction Tuning

Instruction tuning: Training script with DeepSpeed ZeRO-3 and lora: finetune_lora_clu_sq.sh.

Evaluation

Prepare data Please download raw images of datasets (COCO, Flickr, nocaps, conceptual) for image captioning tasks.

  1. Evaluate models on image captioning. See captioning.sh on 4 datasets.
  2. Evaluate models on a diverse set of 12 benchmarks. To ensure the reproducibility, we evaluate the models with greedy decoding. We do not evaluate using beam search to make the inference process consistent with the chat demo of real-time outputs.

See Evaluation.md.

Citation

If you find this code to be useful for your research, please consider citing.

@inproceedings{sun2024sq,
  title={SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant},
  author={Sun, Guohao and Qin, Can and Wang, Jiamian and Chen, Zeyuan and Xu, Ran and Tao, Zhiqiang},
  year = {2024},
  booktitle = {ECCV},
}

Acknowledgement