Awesome

Multi-modal Factorized Bilinear Pooling (MFB) for VQA

This is an unofficial and Pytorch implementation for Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering and Beyond Bilinear: Generalized Multi-modal Factorized High-order Pooling for Visual Question Answering.

Figure 1: The MFB+CoAtt Network architecture for VQA.

The result of MFB-baseline and MFH-baseline can be replicated.(Not able to replicate MFH-coatt-glove result, maybe a devil hidden in detail.)

The author helped me a lot when I tried to replicate the result. Great thanks.

The official implementation is based on pycaffe is available here.

Requirements

Python 2.7, pytorch 0.2, torchvision 0.1.9, tensorboardX

Result

Datasets\Models	MFB	MFH	MFH+CoAtt+GloVe (FRCN img features)
VQA-1.0	58.75%	59.15%	68.78%

MFB and MFH refer to MFB-baseline and MFH-baseline, respectively.
The results of MFB and MFH are trained with train sets, tested with val sets, using ResNet152 pool5 features. The result of MFH+CoAtt+GloVe is trained with train+val sets, tested with test-dev sets.

Figure 2: MFB-baseline result

Figure 3: MFH-baseline result

Training from Scratch

$ python train_*.py

Most of the hyper-parameters and configrations with comments are defined in the config.py file.
Pretrained GloVe word embedding model (the spacy library) is required to train the mfb/h-coatt-glove model. The installation instructions of spacy and GloVe model can be found here.

Citation

If you find this implementation helpful, please consider citing:

@article{yu2017mfb,
  title={Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering},
  author={Yu, Zhou and Yu, Jun and Fan, Jianping and Tao, Dacheng},
  journal={IEEE International Conference on Computer Vision (ICCV)},
  year={2017}
}

@article{yu2017beyond,
  title={Beyond Bilinear: Generalized Multi-modal Factorized High-order Pooling for Visual Question Answering},
  author={Yu, Zhou and Yu, Jun and Xiang, Chenchao and Fan, Jianping and Tao, Dacheng},
  journal={arXiv preprint arXiv:1708.03619},
  year={2017}
}