Home

Awesome

LVIS-INSTRUCT4V

Introduction

We introduce a fine-grained visual instruction dataset, LVIS-INSTRUCT4V, which contains 220K visually aligned and context-aware instructions produced by prompting the powerful GPT-4V with images from LVIS. Please refer to the arxiv paper for more details.

Usage

Please follow LLaVA to set up the code.

LVIS-INSTRUCT4V is available at LVIS-INSTRUCT4V. To achieve better results on the QA benchmarks, we follow LLaVA 1.5 to mix LVIS-INSTRUCT4V with academic task related data (see the Table 1 & 7 in LLaVA 1.5 paper), which can be found at LVIS-INSTRUCT4V-Nodetail-mix619k, LVIS-INSTRUCT4V-mix730k, and LVIS-Instruct4V-LLaVA-Instruct-mix880k.

Model Zoo

VersionDataSizeScheduleCheckpointVQAv2GQAVizWizSQAT-VQAPOPEMMEMM-BenchMM-Bench-CNSEEDLLaVA-Bench-WildMM-Vet
LLaVA-1.5LVIS-Instruct4V-Nodetail-mix619k7Bfull_ft-1eLVIS-Instruct4V-Nodetail-mix619k-7b79.262.652.568.457.684.01472.967.160.060.870.434.6
LLaVA-1.5LVIS-Instruct4V-mix730k7Bfull_ft-1eLVIS-Instruct4V-mix730k-7b79.462.652.668.958.485.11495.366.659.660.567.133.3
LLaVA-1.5LVIS-Instruct4V-LLaVA-Instruct-mix880k7Bfull_ft-1eLVIS-Instruct4V-LLaVA-Instruct-mix880k-7b79.662.651.868.358.786.01528.266.260.460.667.031.5
LLaVA-1.51LVIS-Instruct4V-Nodetail-mix619k13Bfull_ft-1eLVIS-Instruct4V-Nodetail-mix619k-13b80.163.851.469.062.185.31572.067.861.062.576.740.2
LLaVA-1.5LVIS-Instruct4V-LLaVA-Instruct-mix880k13Bfull_ft-1eLVIS-Instruct4V-LLaVA-Instruct-mix880k-13b80.763.657.270.662.586.01574.968.061.161.671.337.4

Reference

If you find our work useful for your research or applications, please cite using this BibTeX:

@article{wang2023instruct4v,
  title={To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning},
  author={Wang, Junke and Meng, Lingchen and Weng, Zejia and He, Bo and Wu, Zuxuan and Jiang, Yu-Gang},
  journal={arXiv preprint arXiv:2311.07574},
  year={2023}
}

Acknowledgement

We thank the authors of LLaVA for their contribution to the open-source community.

Footnotes

  1. We find TextQA is sensitive to the beam number, and for 13B models, we use beam = 3 on TextQA.