


An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA

by Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Yumao Lu, Zicheng Liu, and Lijuan Wang

The 36th AAAI Conference on Artificial Intelligence (AAAI), 2022, Oral


Can GPT-3 benefit multimodal tasks? We provide an empirical study of GPT-3 for knowledge-based VQA, named PICa. We show that prompting GPT-3 via the use of image captions with only 16 examples surpasses supervised sota by an absolute +8.6 points on the OK-VQA dataset (from 39.4 to 48.0).

<p align="center"> <img src="https://zyang-ur.github.io//pica/intro.jpg" width="75%"/> </p>


  title={An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA},
  author={Yang, Zhengyuan and Gan, Zhe and Wang, Jianfeng and Hu, Xiaowei and Lu, Yumao and Liu, Zicheng and Wang, Lijuan},



  1. Clone the repository

    git clone https://github.com/microsoft/PICa.git
  2. Prepare the data The cached files for converted OKVQA data, predicted text representations, and similarity features are in the coco_annotations, input_text, and coco_clip_new folders, respectively.


  1. We experimented with the older engine davinci instead of the current default text-davinci-001 that is boosted for instruction tuning, see more discussion here.
    python gpt3_api_okvqa.py --apikey xxx --output_path output
    ## for example
    python gpt3_api_okvqa.py --apikey xxx --output_path output --engine davinci --similarity_metric random --n_ensemble 1 --n_shot 16
    python gpt3_api_okvqa.py --apikey xxx --output_path output --engine davinci --similarity_metric imagequestion --n_ensemble 5 --n_shot 16


  1. Outputs will be saved to format_answer and prompt_answer folders. format_answer is used for final evaluation, following the vqav2 format. prompt_answer contains the input prompt for human interpretation.

  2. output_saved provides the cached predictions.