Home

Awesome

[CVPR 2024 CVinW] This is the official implementation of the paper "Multi-Agent VQA: Exploring Multi-Agent Foundation Models in Zero-Shot Visual Question Answering" in Pytorch.

Arxiv Google Scholar Workshop

Key idea: What if a large foundation model fails at VQA? Shall we finetune it on our VQA dataset or object detection dataset? No, we should use tools, and tools are experts in their fields.

This work explores the zero-shot capabilities of foundation models in Visual Question Answering (VQA) tasks. We propose an adaptive multi-agent system, named Multi-Agent VQA, to overcome the limitations of foundation models in object detection and counting by using specialized agents as tools.

Existing approaches heavily rely on fine-tuning their models on specific VQA datasets with a vocabulary of size 3k. Our study instead focuses on the system's performance without fine-tuning it on specific VQA datasets, making it more practical and robust in the open world. We present preliminary experimental results under zero-shot scenarios and highlight some failure cases, offering new directions for future research. A full paper will be released soon.

<p align="center"> <img src=pipeline.png /> </p>

Disclaimer

In this README, you will find instructions on all the available functionalities mentioned in the paper and they should work well. However, please understand that this repository is under development, and we currently only support GPT-4V and Gemini Pro Vision as our large vision-language models. Although you can find codes for other models or functionalities in this repository, they are either incomplete or haven't been thoroughly tested yet. Feel free to submit an issue.

TODOs

Citation

If you believe our work has inspired your research, please kindly cite our work. Thank you!

  @inproceedings{jiang2024multi,
    title={Multi-Agent VQA: Exploring Multi-Agent Foundation Models in Zero-Shot Visual Question Answering},
    author={Jiang, Bowen and Zhuang, Zhijun and Shivakumar, Shreyas S and Roth, Dan and Taylor, Camillo J},
    booktitle={arXiv preprint arXiv:2403.14783},
    year={2024}
  }

Environment

There are two options for setting up the required environment.

Dataset

Due to the costs and time requirements of GPT-4V API, we have to use a subset of the data to evaluate the performance. The test set of VQA-v2 is not publicly available and requires exact matches of the answers, making open-world answers and LLM-based graders inapplicable. We instead adopt the VQA-v2 rest-val dataset, the validation dataset in BEiT-3 and VLMo that was never used for training. It contains 5228 unique image-question pairs. For GQA, we take the same 1000 validation samples used in ELEGANT for testing.

According to the instruction, you need to modify the source codes and generate the index JSON files for the dataset, so we provided the modified codes in this forked repository. Make sure you can get the file vqa.rest_val.jsonl.

Our codes accept the data formats in v2_OpenEnded_mscoco_train2014_questions.json (the question file) and v2_mscoco_train2014_annotations (the annotation file), so we provide the code utils_func/find_matched_rest_val.py to convert vqa.rest_val.jsonl into v2_OpenEnded_mscoco_rest_val2014_questions and v2_mscoco_rest_val2014_annotations.json. You can also download them directly by clicking on their names here.

You should organize the dataset at the end as the following structure, but we are not going to use any training or testing splits.

datasets/
    coco/
        train2014/            
            COCO_train2014_000000000009.jpg                
            ...
        val2014/              
            COCO_val2014_000000000042.jpg
            ...  
        test2015/              
            COCO_test2015_000000000001.jpg
            ...
        answer2label.txt
        vqa.train.jsonl
        vqa.val.jsonl
        vqa.trainable_val.jsonl
        vqa.rest_val.jsonl
        vqa.test.jsonl
        vqa.test-dev.jsonl      
        vqa/
            v2_OpenEnded_mscoco_train2014_questions.json
            v2_OpenEnded_mscoco_val2014_questions.json
            v2_OpenEnded_mscoco_test2015_questions.json
            v2_OpenEnded_mscoco_test-dev2015_questions.json
            v2_OpenEnded_mscoco_rest_val2014_questions
            v2_mscoco_train2014_annotations.json
            v2_mscoco_val2014_annotations.json
            v2_mscoco_rest_val2014_annotations.json

Like what we did in our config.yaml, you can add a soft link to your own datasets/ folder

cd ~/tmp
ln -s /path/to/your/datasets/ .
    

Otherwise, please remove the /tmp/ header from all paths in the provided config.yaml.

You should organize the dataset at the end as the following structure.

datasets/
    gqa/
        images/
            1000.jpg
            ...
        gqasubset1000.json

Quick Start