Awesome

The Most Important Thing.

Our code is developed based on:

LXMERT: Learning Cross-Modality Encoder Representations from Transformers (https://github.com/airsplay/lxmert)

If you think our work is useful, please also cite their work!

Introduction

PyTorch code for the CVPR 2021 paper "Causal Attention for Vision-Language Tasks". PyTorch code for the CVPR 2021 paper "Causal Attention for Vision-Language Tasks". Slides of our EMNLP 2019 talk are avialable here. For experiment settings, like the pytorch version and GPU setting, please refer to LXMERT (https://github.com/airsplay/lxmert)

Results 36 RoI version

Split	VQA	GQA	NLVR2
Local Validation	70.40%	60.90%	76.40%
Test-Dev	72.81%	60.84%	76.40% (Test-P)
Test-Standard	73.04%	61.17%	76.00% (Test-U)

Results 64 RoI version

Extracting more RoI visual features from an image will largely improve the performances!

Split	VQA	GQA	NLVR2
Test-Dev	73.54%	61.87%	77.27% (Test-P)
Test-Standard	73.63%	62.07%	77.23% (Test-U)

Pre-training

Notice that this part is the same as LXMERT: https://github.com/airsplay/lxmert. We put them here for self-containing.

Download the aggregated LXMERT dataset from MS COCO, Visual Genome, VQA, and GQA (around 700MB in total). The joint answer labels are saved in data/lxmert/all_ans.json.

mkdir -p data/lxmert
wget --no-check-certificate https://nlp1.cs.unc.edu/data/lxmert_data/lxmert/mscoco_train.json -P data/lxmert/
wget --no-check-certificate https://nlp1.cs.unc.edu/data/lxmert_data/lxmert/mscoco_nominival.json -P data/lxmert/
wget --no-check-certificate https://nlp1.cs.unc.edu/data/lxmert_data/lxmert/vgnococo.json -P data/lxmert/
wget --no-check-certificate https://nlp1.cs.unc.edu/data/lxmert_data/lxmert/mscoco_minival.json -P data/lxmert/

Download the detection features from MS COCO images from LXMERT.

mkdir -p data/mscoco_imgfeat
wget --no-check-certificate https://nlp1.cs.unc.edu/data/lxmert_data/mscoco_imgfeat/train2014_obj36.zip -P data/mscoco_imgfeat
unzip data/mscoco_imgfeat/train2014_obj36.zip -d data/mscoco_imgfeat && rm data/mscoco_imgfeat/train2014_obj36.zip
wget --no-check-certificate https://nlp1.cs.unc.edu/data/lxmert_data/mscoco_imgfeat/val2014_obj36.zip -P data/mscoco_imgfeat
unzip data/mscoco_imgfeat/val2014_obj36.zip -d data && rm data/mscoco_imgfeat/val2014_obj36.zip

Download the detection features for Visual Genome images.

mkdir -p data/vg_gqa_imgfeat
wget --no-check-certificate https://nlp1.cs.unc.edu/data/lxmert_data/vg_gqa_imgfeat/vg_gqa_obj36.zip -P data/vg_gqa_imgfeat
unzip data/vg_gqa_imgfeat/vg_gqa_obj36.zip -d data && rm data/vg_gqa_imgfeat/vg_gqa_obj36.zip

Test on a small split of the MS COCO + Visual Genome datasets:
```
bash run/lxmert_pretrain.bash 0,1,2,3 --multiGPU --tiny
```
Run on the whole MS COCO and Visual Genome related datasets (i.e., VQA, GQA, COCO caption, VG Caption, VG QA).

This part is ours:

The pre-training code is:

bash run/fsb2.bash 0,1,2,3 --multiGPU

After pre-training, the finetuning codes for VQA, GQA, and NLVE2 are:

bash run/vqa_finetuneft.bash 0 0.00004 0.00004
bash run/gqa_finetuneft.bash 0 0.000001 0.000001
bash run/nlvr2_ft.bash 0 0.00003 0.00003