Awesome
MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-based Visual Question Answering
This code implements a multimodal knowledge extraction model. The model generates output features corresponding to knowledge triplet. These knowledge features can typically be used in the following potential application scenarios:
- Model-based knowledge search. MuKEA is capable of retrieving relevant knowledge for multimodal input.
- Knowledge-based vision-language tasks, such as image caption, referring expression comprehension, vision-language navigation etc.
- Explainable deep learning, especially in the legal, medical fields.
This approach was used to achieve state-of-the-art knowledge-based visual question answering performance on OKVQA (42.59% overall accuracy) and KRVQA (27.38% overall accuracy), as described in:
Requirement
Pytorch == 1.6.0
transformers == 3.5.0
Training
- Create model_save_dir
mkdir model_save_dir
- Preprocessing
$ mkdir data
$ cd data
Download annotation from
We reorganized the storage structure of image features as:
vqa_img_feature_train.pickle{
"image_id":{'feats': features, 'sp_feats': spatial features}
}
The pre-trained LXMERT model expects these spacial features to be normalized bounding boxes on a scale of 0 to 1
The image features are provided by and downloaded from the original bottom-up attention' repo, then follow the script to process the feature.
python tsv2feature.py
Optional download link
The image features with objects' label are provided by and downloaded from the origin LXMERT' repo, then follow the script to process the feature.
python tsv2feature_objects.py
Image features for KRVQA
The image features for KRVQA are generated based on the code in this repo, and can be downloaded form
unzip the file and put it under /data/kr-vqa
Pre-training on VQAv2
python train.py --embedding --model_dir model_save_dir --dataset finetune-dataset/okvqa/krvqa/vqav2 --pretrain --accumulate --validate
note: --dataset
parameter is to set the dataset for finetune
The default learning rate is set to 1e-4 which will lead to faster convergence. If the training is unstable, please set the learning rate to 1e-5 manually in the pre-training stage.
Fine-tuning
python train.py --embedding --model_dir model_save_dir --dataset okvqa/krvqa/vqav2 --load_pthpath model_save_dir/checkpoint --accumulate --validate
w/o pre-training
python train.py --embedding --model_dir model_save_dir --dataset okvqa/krvqa --validate
Models
Bibtex
@inproceedings{Ding2022mukea,
title={MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-based Visual Question Answering},
author={Yang Ding and Jing Yu and Bang Liu and Yue Hu and Mingxin Cui and Qi Wug},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2022}
}