Home

Awesome

Multi-modal Factorized Bilinear Pooling (MFB) for VQA

This is an unofficial and Pytorch implementation for Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering and Beyond Bilinear: Generalized Multi-modal Factorized High-order Pooling for Visual Question Answering.

Figure 1: The MFB+CoAtt Network architecture for VQA.

The result of MFB-baseline and MFH-baseline can be replicated.(Not able to replicate MFH-coatt-glove result, maybe a devil hidden in detail.)

The author helped me a lot when I tried to replicate the result. Great thanks.

The official implementation is based on pycaffe is available here.

Requirements

Python 2.7, pytorch 0.2, torchvision 0.1.9, tensorboardX

Result

Datasets\ModelsMFBMFHMFH+CoAtt+GloVe (FRCN img features)
VQA-1.058.75%59.15%68.78%

Figure 2: MFB-baseline result

Figure 3: MFH-baseline result

Training from Scratch

$ python train_*.py

Citation

If you find this implementation helpful, please consider citing:

@article{yu2017mfb,
  title={Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering},
  author={Yu, Zhou and Yu, Jun and Fan, Jianping and Tao, Dacheng},
  journal={IEEE International Conference on Computer Vision (ICCV)},
  year={2017}
}

@article{yu2017beyond,
  title={Beyond Bilinear: Generalized Multi-modal Factorized High-order Pooling for Visual Question Answering},
  author={Yu, Zhou and Yu, Jun and Xiang, Chenchao and Fan, Jianping and Tao, Dacheng},
  journal={arXiv preprint arXiv:1708.03619},
  year={2017}
}