Home

Awesome

IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models.

Haoxuan You*, Rui Sun*, Zhecan Wang*, Long Chen, Gengyu Wang, Hammad A. Ayyubi, Kai-Wei Chang, Shih-Fu Chang

[*: equal contribution]

Paper

demo

Contents

Installation

Clone our repository and create a new python environment via the following command

git clone https://github.com/Hxyou/IdealGPT.git
cd IdealGPT
conda env create -f environment.yml
conda activate idealgpt

If you would like to use LLaVA and MiniGPT4 to solve sub-questions, please install them as mentioned in their repository.

Dataset

In our paper, we conduct experiments on SNLI-VE and VCR. Please refer to their website to see how to download the data.

Run

NOTE: 1. If you would like to run our code, please replace the filepath with yours. 2. You need to configure an OpenAI key to use OpenAI API. More details can be found at OpenAI platform

In order to save money and running time, you can randomly select 500 samples from the val/dev split of VCR and SNLI-VE at first. (dataset can be vcr_val or ve_dev)

cd misc
python sample_data.py --dataset=vcr_val
cd ..

Then, you can use IdealGPT to do inference. Here is an example of zero-shot VCR.

python blip_gpt_main.py  \
    --data_root=/your/dataset/path \
    --exp_tag=vcr_05241522 \
    --dataset=vcr_val \
    --device_id=0 \
    --prompt_setting=v1a \
    --data_partition=0_499 \
    --vqa_model=blip2_t5_xl  \
    --temp_gpt=0.0  \
    --data_subset=/your/selected/data/subset/path \
    --openai_key=<your_openai_key>

You can replace vcr_val with ve_dev to obtain SNLI-VE results.

Evaluation

We employ accuracy to evaluate zero-shot performance on VCR and SNLI-VE.

  1. VCR
python vcr_eval.py --result=[path of saved VCR result folder, named by exp_tag in run]
  1. SNLI-VE
python ve_eval.py --result=[path of saved VCR result folder, named by exp_tag in run]

Cite

If you are interested in our work, please cite the paper as

@misc{you2023idealgpt,
      title={IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models}, 
      author={Haoxuan You and Rui Sun and Zhecan Wang and Long Chen and Gengyu Wang and Hammad A. Ayyubi and Kai-Wei Chang and Shih-Fu Chang},
      year={2023},
      eprint={2305.14985},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}