Home

Awesome

What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?

<div align="center"> <img width="20%" src="images/logo_plus.png"> </div>

Yan Zeng*, Hanbo Zhang*, Jiani Zheng*, Jiangnan Xia, Guoqiang Wei, Yang Wei, Yuchen Zhang, Tao Kong
*Equal Contribution

Project Paper

update

Lynx (8B parameters):

<div align="center"> <img width="70%" src="images/lynx.png"> </div>

results on Open-VQA image testsets

<div align="center"> <img width="70%" src="images/open_vqa_image_result.png"> </div>

results on Open-VQA video testsets && OwlEval human eval && MME benchmark

<div align="center"> <img width="70%" src="images/result_other.png"> </div>

ablation result

<div align="center"> <img width="70%" src="images/ablation.png"> </div>

Quick Start

environment

conda env create -f environment.yml
conda activate lynx

prepare data

step 1: prepare annotation file

Open-VQA annotations file is under the path data/Open_VQA_images.jsonl and data/Open_VQA_videos.jsonl, there is an example:

{
  "dataset": "Open_VQA_images", # the dataset name of your data
  "question": "What is in the image?", 
  "answer": ["platform or tunnel"], # list
  "index": 1, 
  "image": "images/places365/val_256/Places365_val_00000698.jpg", # relative path of image
  "origin_dataset": "places365", 
  "class": "Place", # eight image VQA types and two video VQA types correspond to the open_VQA dataset
}

You can also convert your own data in jsonl format, the keys origin_dataset and class are optional.

step 2: prepare images

Download raw images from corresponding websites: Places365(256x256), VQAv2, OCRVQA, Something-Something-v.2, MSVD-QA, NeXT-QA and MSRVTT-QA.

step 3: modify the default setting in the code

You need the check some import settings in the configs configs/LYNX.yaml, for example:

# change this prompt for different task, this is the default prompt
prompt: "User: {question}\nBot:"
# the key must match the vision key in test_files
# if you test Open_VQA_videos.jsonl, need to change to "video"
vision_prompt_dict: "image"
output_prompt_dict: "answer"

prepare checkpoint

organize the files like this:

lynx-llm/
    data/
        Open_VQA_images.jsonl
        Open_VQA_videos.jsonl
        eva_vit_g.pth
        vicuna-7b/
        finetune_lynx.pt
        pretrain_lynx.pt
    images/
        vqav2/val2014/*.jpg
        places365/val_256/*.jpg
        ocrvqa/images/*.jpg
        sthsthv2cap/val/*.mp4
        msvdqa/test/*.mp4
        nextqa/*.mp4
        msrvttqa/*.mp4

infer

sh generate.sh

Citation

If you find this repository useful, please considering giving ⭐ or citing:

@article{zeng2023matters,
  title={What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?},
  author={Zeng, Yan and Zhang, Hanbo and Zheng, Jiani and Xia, Jiangnan and Wei, Guoqiang and Wei, Yang and Zhang, Yuchen and Kong, Tao},
  journal={arXiv preprint arXiv:2307.02469},
  year={2023}
}

Contact

For issues using this code, please submit a GitHub issue.

License

This project is licensed under the Apache-2.0 License.