Home

Awesome

Analyzing Modular Approaches for Visual Question Decomposition

Apoorv Khandelwal, Ellie Pavlick, and Chen Sun

EMNLP 2023

[arxiv] [anthology]


Abstract

Modular neural networks without additional training have recently been shown to surpass end-to-end neural networks on challenging vision-language tasks. The latest such methods simultaneously introduce LLM-based code generation to build programs and a number of skill-specific, task-oriented modules to execute them. In this paper, we focus on ViperGPT and ask where its additional performance comes from and how much is due to the (state-of-art, end-to-end) BLIP-2 model it subsumes vs. additional symbolic components. To do so, we conduct a controlled study (comparing end-to-end, modular, and prompting-based methods across several VQA benchmarks). We find that ViperGPT's reported gains over BLIP-2 can be attributed to its selection of task-specific modules, and when we run ViperGPT using a more task-agnostic selection of modules, these gains go away. Additionally, ViperGPT retains much of its performance if we make prominent alterations to its selection of modules: e.g. removing or retaining only BLIP-2. Finally, we compare ViperGPT against a prompting-based decomposition strategy and find that, on some benchmarks, modular approaches significantly benefit by representing subtasks with natural language, instead of code.

image

Installation

Hardware requirements

Setup

You must run the following commands on your GPU machine, as certain dependencies require CUDA compilation. We highly recommend using the much faster micromamba as a nearly-drop-in replacement for conda.

conda env create -f conda-lock.yml --prefix ./.venv
conda activate ./.venv
pdm install

Environment Variables

You can adjust the environment variables in .env. If you make changes, run conda activate ./.venv again to reload these variables.

Download Viper models

# download models to `$TORCH_HOME/hub/viper` (usually `~/.cache/torch/hub/viper`)
python -m viper.download_models

Download datasets

# download all datasets
python -m src.data.download

# download a specific dataset
python -m src.data.download --dataset {vqav2,gqa,okvqa,aokvqa,coco,scienceqa}

## coco is required for vqav2, okvqa, and aokvqa
## scienceqa is saved to $HF_HOME/datasets/derek-thomas___science_qa

Running experiments

Never forget to conda activate ./.venv.

Run the core experiments (with default settings from our paper):

python experiments/vqa.py \
dataset:{vqav2,gqa,okvqa,aokvqa,scienceqa} \
method:{e2e,viper,successive}

This repo uses AI2 Tango for experiment tracking and caching.

Additional Settings

Explore the --help menus for additional settings!

# For LLM evaluation options
python experiments/vqa.py --help

# For dataset arguments
python experiments/vqa.py dataset:<...> --help

# For method arguments
python experiments/vqa.py dataset:<...> method:<...> --help

Example:

python experiments/vqa.py --gpt-eval-model text-davinci-003 dataset:vqav2 --dataset.split val2014 --dataset.n 5 method:e2e --method.model-type blip2-flan-t5-xxl

# Output

┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ instructgpt_acc ┃ vqav2_acc ┃
┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ 80.0            │ 80.0      │
└─────────────────┴───────────┘

Deprecated OpenAI Models

Unfortunately, the default GPT models (code-davinci-002, text-davinci-002, text-davinci-003) used in ViperGPT and our paper are (or will shortly be) deprecated. Moreover, the legacy Completions API is critical to several functions of this repository. You may work around these restrictions by specifying different GPT models and adjusting the prompts appropriately (e.g. see this chat prompt for ViperGPT), but your milage may vary. For reproducibility and best practices, we strongly recommend using open-source LLMs in your future research.

Citation

@inproceedings{khandelwal2023:vqd,
    title        = {Analyzing Modular Approaches for Visual Question Decomposition},
    author       = {Apoorv Khandelwal and Ellie Pavlick and Chen Sun},
    year         = {2023},
    month        = {December},
    booktitle    = {Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing},
    pages        = {2590--2603}
}