Home

Awesome

VLFeedback

A GPT-4V annotated preference dataset for large vision language models.

[Project Page] [Datasets] [Silkie Model] [Paper]

Annotation Framework

<img src="imgs/annotate_framework.png" width="800px">

Multimodal Instruciton Source

The instructions are sampled from various domains to cover different capabilities of LVLMs

<img src="imgs/instruction_source.png" width="800px">

Model Pool

We construct a model pool consists of 12 LVLMs, including

Silkie

We select Qwen-VL-Chat as the backbone model and perform DPO on our dataset.

<div align="center"> <img src="imgs/silkie.png" alt="Silkie Logo" width="128px"> <p>Generated by <a href="https://openai.com/dall-e-3">DALL·E 3</a></p> </div>

The resulting model, Silkie, achieves comprehensive improvements on various benchmarks

<img src="imgs/silkie_ret.png" width="800px">

Installation

To run our training scripts, create a virtual environment and install the dependencies first.

conda create -n silkie python=3.10  && conda activate silkie
pip install -r requirements.txt

Training

Our training scripts support both single-node and multi-node training. We provide a launch_dpo.py script that handles both cases. If you want to launch a job locally, you can use:

python launch_dpo.py --config dpo_config/example.yaml --working $WORKING_DIR

If you want to launch a job on a Slurm cluster, specify GPUS_PER_NODE in launch_dpo.py and run:

python launch_dpo.py --config dpo_config/example.yaml --working $WORKING_DIR --gpus $NUM_GPUS

Citations

@article{2023vlfeedback,
  author      = {Lei Li and Zhihui Xie and Mukai Li and Shunian Chen and Peiyi Wang and Liang Chen and  Yazheng Yang and  Benyou Wang and  Lingpeng Kong},
  title       = {Silkie: Preference Distillation for Large Visual Language Models},
  publisher   = {arXiv:2312.10665},
  year        = {2023}
}

Acknowledgements

We would like to thank the authors of trl and Qwen-VL for their great work.