Awesome
X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers (EMNLP 2020)
- Authors: Jaemin Cho, Jiasen Lu, Dustin Schwenk, Hannaneh Hajishirzi, and Ani Kembhavi
- Paper
- Blog
- Demo
- Slideslive Presentation
Summary
Recent multi-modal transformers have achieved tate of the art performance on a variety of multimodal discriminative tasks like visual question answering and generative tasks like image captioning. This begs an interesting question: Can these models go the other way and generate images from pieces of text? Our analysis of a popular representative from this model family - LXMERT - finds that it is unable to generate rich and semantically meaningful imagery with its current training setup. We introduce X-LXMERT, an extension to LXMERT with training refinements. X-LXMERT's image generation capabilities rival state of the art generative models while its question answering and captioning abilities remains comparable to LXMERT.
Demo
Try out AI2 Computer Vision Explorer Demo!
<img src="./assets/x-lxmert-demo.gif">Install
- Python packages
conda create -n xlxmert python=3.7
conda activate xlxmert
cd x-lxmert
pip install -r ./requirements.txt
-
Mask-RCNN-benchmark (for feature extraction)
- Please follow the original installation guide.
-
Faiss (for K-means clustering)
- Please follow the original installation guide.
Code structure
# Store images, features, and annotations
./datasets
COCO/
images/
featuers/
VG/
images/
features/
GQA/
images/
features/
nlvr2/
images/
features/
data/ <= Store text annotations (*.json) for each split
lxmert/
vqa/
gqa/
nlvr2/
# Run feature extraction and k-means clustering
./feature_extraction
# Train image generator
./image_generator
snap/ <= Store image generator checkpoints
scripts/ <= Bash scripts for training image generator
# Train X-LXMERT
./x-lxmert
src/
lxrt/ <= X-LXMERT model class implementation (inherits huggingface transformers' LXMERT class)
pretrain/ <= X-LXMERT Pretraining
tasks/ <= Fine-tuning on downstream tasks (VQA, GQA, NLVR2, Image generation)
snap/ <= Store X-LXMERT checkpoints
scripts/ <= Bash scripts for pretraining, fine-tuning, and image generation
Feature extraction
Please checkout ./feature_extraction for download pre-extracted features and more details.
cd ./feature_extraction
# For Pretraining / VQA
python coco_extract_grid_feature.py --split train
python coco_extract_grid_feature.py --split valid
python coco_extract_grid_feature.py --split test
# For Pretraining
python VG_extract_grid_feature.py
# For GQA
python GQA_extract_grid_feature.py
# For NLVR2
python nlvr2_extract_grid_feature.py --split train
python nlvr2_extract_grid_feature.py --split valid
python nlvr2_extract_grid_feature.py --split test
# K-Means clustering
python run_kmeans.py --src mscoco_train --tgt mscoco_train mscoco valid vg
Pretraining
Pretrain on LXMERT Pretraining data
cd ./x-lxmert/
bash scripts/pretrain.bash
or download pretrained checkpoint
wget -O x-lxmert/snap/pretrained/x_lxmert/Epoch20_LXRT.pth https://ai2-vision-x-lxmert.s3-us-west-2.amazonaws.com/x-lxmert/Epoch20_LXRT.pth
Finetuning
VQA
cd ./x-lxmert/
bash scripts/finetune_vqa.bash
bash scripts/test_vqa.bash
GQA
cd ./x-lxmert/
bash scripts/finetune_gqa.bash
bash scripts/test_gqa.bash
NLVR2
cd ./x-lxmert/
bash scripts/finetune_nlvr2.bash
bash scripts/test_nlvr2.bash
Image generation
Train image generator on MS COCO
cd ./image_generator/
bash scripts/train_generator.bash
or download pretrained checkpoints
wget -O image_generator/snap/pretrained/G_60.pth https://ai2-vision-x-lxmert.s3-us-west-2.amazonaws.com/image_generator/G_60.pth
Sample images
cd ./x-lxmert/
bash scripts/sample_image.bash
Reference
@inproceedings{Cho2020XLXMERT,
title={X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers},
author={Cho, Jaemin and Lu, Jiasen and Schwenk, Dustin and Hajishirzi, Hannaneh and Kembhavi, Aniruddha},
booktitle={EMNLP},
year={2020}
}