Awesome
Attention on Attention: Architectures for Visual Question Answering (VQA)
This is the code for our paper by the same name. Link in the title.
This Project was done for Stanford's CS 224N and CS 230.
Our model architecture is inspired by the winning entry of the 2017 VQA Challenge.
Which follows the VQA system described in "Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering" and "Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge".
Licence
MIT
Our Architectures
This Project uses code provided here
We used the preprocessing and base code provided by the above link and then performed an extensive architecture and hyperparameter search.
Results
Model | Validation Accuracy | Training Time |
---|---|---|
Reported Model | 63.15 | 12 - 18 hours (Tesla K40) |
Our A3x2 Model | 64.78 | 4 hours AWS g3.8xlarge instance (2x M60) |
The accuracy was calculated using the VQA evaluation metric.
About
This is part of a project done for Stanford's CS 224N and CS 230.
Implementation Details
Check out our paper for the full implemetation details and hyperparamter search. ArXiv link coming soon.
HyperParameters Search
Dual Attention Visualization
Usage
Prerequisites
Make sure you are on a machine with a NVIDIA GPU and Python 2.7+ with about 70 GB disk space.
Data Setup
All data should be downloaded to a data/ directory in the root directory of this repository.
The easiest way to download the data is to run the provided script tools/download.sh
from the repository root. If the script does not work, it should be easy to examine the script and modify the steps outlined in it according to your needs. Then run tools/process.sh
from the repository root to process the data to the correct format.
Training
Simply run python main.py
to start training. The default model run is the best performing A3x2. Other model variations can be run using the models flag. The training and validation scores will be printed every epoch, and the best model will be saved under the directory "saved_models". The default flags should give you the result provided in the table above.
Pre-Trained Models
Certain Pretrained models availible upon request.
Our Paper
Citation:
Please use the Citation found at:
http://dblp.uni-trier.de/rec/bibtex/journals/corr/abs-1803-07724