Home

Awesome

neural-vqa

Join the chat at https://gitter.im/abhshkdz/neural-vqa

This is an experimental Torch implementation of the VIS + LSTM visual question answering model from the paper Exploring Models and Data for Image Question Answering by Mengye Ren, Ryan Kiros & Richard Zemel.

Model architecture

Setup

Requirements:

Download the MSCOCO train+val images and VQA data using sh data/download_data.sh. Extract all the downloaded zip files inside the data folder.

unzip Annotations_Train_mscoco.zip
unzip Questions_Train_mscoco.zip
unzip train2014.zip

unzip Annotations_Val_mscoco.zip
unzip Questions_Val_mscoco.zip
unzip val2014.zip

If you had them downloaded already, copy over the train2014 and val2014 image folders and VQA JSON files to the data folder.

Download the VGG-19 Caffe model and prototxt using sh models/download_models.sh.

Known issues

Usage

Extract image features

th extract_fc7.lua -split train
th extract_fc7.lua -split val

Options

Training

th train.lua

Options

Testing

th predict.lua -checkpoint_file checkpoints/vqa_epoch23.26_0.4610.t7 -input_image_path data/train2014/COCO_train2014_000000405541.jpg -question 'What is the cat on?'

Options

Sample predictions

Randomly sampled image-question pairs from the VQA test set, and answers predicted by the VIS+LSTM model.

Q: What animals are those? A: Sheep

Q: What color is the frisbee that's upside down? A: Red

Q: What is flying in the sky? A: Kite

Q: What color is court? A: Blue

Q: What is in the standing person's hands? A: Bat

Q: Are they riding horses both the same color? A: No

Q: What shape is the plate? A: Round

Q: Is the man wearing socks? A: Yes

Q: What is over the woman's left shoulder? A: Fork

Q: Where are the pink flowers? A: On wall

Implementation Details

Pretrained model and data files

To reproduce results shown on this page or try your own image-question pairs, download the following and run predict.lua with the appropriate paths.

References

License

MIT