Home

Awesome

High-Order Attention for Visual Question Answering

High-order attention models are strong tool for tasks with several data modalities inputs. This code is for Visual Question Answering Multiple Choice, which performs 3-Modality attention over Question, Image and Multiple-Choice Answers. This code achieves 69.4 on Multiple-Choice VQA.

Requirements

You also need to install the following package in order to sucessfully run the code.

Training

Download Dataset

The first thing you need to do is to download the data and do some preprocessing. Head over to the data/ folder and run

$ python vqa_preprocess.py --download 1 --split 1

After this step, it will generate two files under the data folder. vqa_raw_train.json and vqa_raw_test.json

Preprocess Image/Question Features
$ python prepro_vqa.py --input_train_json ./vqa_raw_train.json --input_test_json ./vqa_raw_test.json --num_ans 3000 --max_length 15 --test 1

This will generate two files in data/ folder, vqa_data_prepro.h5 and vqa_data_prepro.json.You will also see some question and answer statistics in the terminal output.

Download Image Model

Here we use VGG_ILSVRC_19_layers model and Deep Residual network implement by Facebook model.

Head over to the image_model folder and run

$ python download_model.py --download 'VGG' 

This will download the VGG_ILSVRC_19_layers model under image_model folder. To download the Deep Residual Model, you need to change the VGG to Residual.

Then we are ready to extract the image features. Head back to the data folder and run (You can change the -gpuid, -backend and -batch_size based on your gpu.)

For VGG image feature:

$ th prepro_img_vgg.lua -input_json ./vqa_data_prepro.json -image_root XXXX -cnn_proto ../image_model/VGG_ILSVRC_19_layers_deploy.prototxt -cnn_model ../image_model/VGG_ILSVRC_19_layers.caffemodel

This will generate two output files: vqa_data_img_vgg_train.h5 , vqa_data_img_vgg_test.h5 in -out_path.

For Deep Residual image feature:

$ th prepro_img_residule.lua -input_json ./data/vqa_data_prepro.json  -image_root XXXX -residule_path ../image_model/resnet-200.t7

This will generate two output files: cocoqa_data_img_residule_train.h5, cocoqa_data_img_residule_test.h5 in -out_path.

Train the model

Back to the main folder

th train.lua  \
        -id XX \
        -start_from 0 \
        -dropout 5 \
        -save_checkpoint_every 3000 \
        -eval 0 \        
        -feature_type Residual \
        -hidden_size 512 \
        -hidden_last_size 8192 \
        -hidden_combine_size 8192 \
        -batch_size 250 \
        -losses_log_every 100  \
        -learning_rate 4e-4 \
        -output_size 3001 \
        -learning_rate_decay_every 1200 \
        -input_img_train_h5 XX/cocoqa_data_img_residule_train.h5 \
        -input_ques_h5 data/vqa_data_prepro.h5 \
        -input_json data/vqa_data_prepro.json 
    
Note

Evaluation

th eval.lua -id XX -start_from XX \
        -feature_type Residual \
        -input_img_test_h5 XX/cocoqa_data_img_residule_test.h5 \
        -input_ques_h5 data/vqa_data_prepro.h5 \
        -input_json data/vqa_data_prepro.json \
        -MC 1
Evaluate using Pre-trained Model

You can find The pre-trained model here (Make sure you provide the mcb files): https://www.dropbox.com/sh/u6s47tay8yx7p3i/AACndwMd6E_k_WPNS-Cc26Ega?dl=0

VQA on Single Image with Free Form Question

Soon

Attention Visualization

Soon

Reference

If you use this code as part of any published research, please acknowledge the following paper

@inproceedings{schwartz2017high,
  title={High-Order Attention Models for Visual Question Answering},
  author={Schwartz, Idan and Schwing, Alexander and Hazan, Tamir},
  booktitle={Advances in Neural Information Processing Systems},
  pages={3665--3675},
  year={2017}
}