Awesome
VLFeedback
A GPT-4V annotated preference dataset for large vision language models.
[Project Page] [Datasets] [Silkie Model] [Paper]
Annotation Framework
<img src="imgs/annotate_framework.png" width="800px">Multimodal Instruciton Source
The instructions are sampled from various domains to cover different capabilities of LVLMs
<img src="imgs/instruction_source.png" width="800px">Model Pool
We construct a model pool consists of 12 LVLMs, including
- GPT-4V
- LLaVA-series
- LLaVA-v1.5-7B
- LLaVA-v1.5-13B
- LLaVA-RLHF-7b-v1.5-224
- LLaVA-RLHF-13b-v1.5-336
- Qwen-VL-7B
- IDEFICS-9b-Instruct
- Fuyu-8B
- InstructBLIP-serise
- InstructBLIP-Vicuna-7B
- InstructBLIP-Vicuna-13B
- VisualGLM-6B
- MMICL-Vicuna-13B
Silkie
We select Qwen-VL-Chat as the backbone model and perform DPO on our dataset.
<div align="center"> <img src="imgs/silkie.png" alt="Silkie Logo" width="128px"> <p>Generated by <a href="https://openai.com/dall-e-3">DALL·E 3</a></p> </div>The resulting model, Silkie, achieves comprehensive improvements on various benchmarks
<img src="imgs/silkie_ret.png" width="800px">Installation
To run our training scripts, create a virtual environment and install the dependencies first.
conda create -n silkie python=3.10 && conda activate silkie
pip install -r requirements.txt
Training
Our training scripts support both single-node and multi-node training.
We provide a launch_dpo.py
script that handles both cases. If you want to launch a job locally, you can use:
python launch_dpo.py --config dpo_config/example.yaml --working $WORKING_DIR
If you want to launch a job on a Slurm cluster, specify GPUS_PER_NODE
in launch_dpo.py
and run:
python launch_dpo.py --config dpo_config/example.yaml --working $WORKING_DIR --gpus $NUM_GPUS
Citations
@article{2023vlfeedback,
author = {Lei Li and Zhihui Xie and Mukai Li and Shunian Chen and Peiyi Wang and Liang Chen and Yazheng Yang and Benyou Wang and Lingpeng Kong},
title = {Silkie: Preference Distillation for Large Visual Language Models},
publisher = {arXiv:2312.10665},
year = {2023}
}
Acknowledgements
We would like to thank the authors of trl and Qwen-VL for their great work.