


A GPT-4V annotated preference dataset for large vision language models.

Annotation Framework

<img src="imgs/annotate_framework.png" width="800px">

Multimodal Instruciton Source

The instructions are sampled from various domains to cover different capabilities of LVLMs

<img src="imgs/instruction_source.png" width="800px">

Model Pool

We construct a model pool consists of 12 LVLMs, including


We select Qwen-VL-Chat as the backbone model and perform DPO on our dataset.

The resulting model, Silkie

The resulting model, Silkie, achieves comprehensive improvements on various benchmarks

<img src="imgs/silkie_ret.png" width="800px">


To run our training scripts, create a virtual environment and install the dependencies first.

conda create -n silkie python=3.10  && conda activate silkie
pip install -r requirements.txt


Our training scripts support both single-node and multi-node training. We provide a launch_dpo.py script that handles both cases. If you want to launch a job locally, you can use:

python launch_dpo.py --config dpo_config/example.yaml --working $WORKING_DIR

If you want to launch a job on a Slurm cluster, specify GPUS_PER_NODE in launch_dpo.py and run:

python launch_dpo.py --config dpo_config/example.yaml --working $WORKING_DIR --gpus $NUM_GPUS


  author      = {Lei Li and Zhihui Xie and Mukai Li and Shunian Chen and Peiyi Wang and Liang Chen and  Yazheng Yang and  Benyou Wang and  Lingpeng Kong},
  title       = {Silkie: Preference Distillation for Large Visual Language Models},
  publisher   = {arXiv:2312.10665},
  year        = {2023}


We would like to thank the authors of trl and Qwen-VL for their great work.