Awesome
BridgeTower
This repo is the official Pytorch
implementation of "BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning".
Updates
- Feb. 2023: BridgeTower was integrated into Hugging Face - Transformers.
- Model Hub, Code and Documentation are available.
- Thanks to Anahita Bhiwandiwalla, Tiep Le and Shaoyen Tseng from Intel Labs for their great work!
- Nov. 2022: BridgeTower got accepted by AAAI'23. Code and checkpoints are released.
- Jun. 2022: We released the preprint version in Arxiv.
- May. 2022: BridgeTower (single model, 4M data) achieved 78.73% and 81.15% (base and large) on the VQAv2 Challenge test-std set.
Abstract
Vision-Language (VL) models with the Two-Tower architecture have dominated visual-language representation learning in recent years. Current VL models either use lightweight uni-modal encoders and learn to extract, align and fuse both modalities simultaneously in a deep cross-modal encoder, or feed the last-layer uni-modal representations from the deep pre-trained uni-modal encoders into the top cross-modal encoder. Both approaches potentially restrict vision-language representation learning and limit model performance. In this paper, we propose BridgeTower, which introduces multiple bridge layers that build a connection between the top layers of uni-modal encoders and each layer of the cross-modal encoder. This enables effective bottom-up cross-modal alignment and fusion between visual and textual representations of different semantic levels of pre-trained uni-modal encoders in the cross-modal encoder. Pre-trained with only 4M images, BridgeTower achieves state-of-the-art performance on various downstream vision-language tasks. In particular, on the VQAv2 test-std set, BridgeTower achieves an accuracy of 78.73%, outperforming the previous state-of-the-art model METER by 1.09% with the same pre-training data and almost negligible additional parameters and computational costs. Notably, when further scaling the model, BridgeTower achieves an accuracy of 81.15%, surpassing models that are pre-trained on orders-of-magnitude larger datasets.
Architecture
Main Results
Deployment
- Run
setup.sh
to set up the environment. - [Optional] We use wandb to track experiments! Please remember to
wandb login
and paste your token before running the script.
Dataset Preparation
- We follow ViLT and use pyarrow to serialize the datasets. See here for details.
- For SNLI-VE dataset, we follow here.
- For VG-QA dataset, except the image-text pairs in VG got from here, image meta data, question answers data and coco split information also need to be downloaded.
- The final file structure of datasets are shown in
setup.sh
.
Checkpoints
-
Fine-tuned checkpoints for
- Visual Question Answering on VQAv2: BASE, BASE(w/ VGQA), LARGE, LARGE(w/ VGQA)
- Image-Text Retrieval on Flickr30k: BASE
- Visual Entailment on SNLI-VE: BASE
- Visual Reasoning on NLVR$^2$: BASE
- Image-Text Retrieval on MSCOCO: BASE
-
Here is an example for downloading a checkpoint.
# download azcopy wget https://aka.ms/downloadazcopy-v10-linux tar -xvf downloadazcopy-v10-linux sudo cp ./azcopy_linux_amd64_*/azcopy /usr/bin/ sudo chmod -R 777 /usr/bin/azcopy # azcopy copy [remote path] [local path] azcopy copy "https://chenfei.blob.core.windows.net/data/G/LCI/best_checkpoints/BridgeTower_pt_base.ckpt?sv=2020-10-02&st=2022-11-24T12%3A18%3A49Z&se=2027-11-25T12%3A18%3A00Z&sr=b&sp=r&sig=BJigddAMHfNUtQuTGH8bJUrzAO3LfaeSm48AXUqZngY%3D" "./BridgeTower_pt_base.ckpt"
Pre-training on Image-Text Datasets
# Pre-train BridgeTower Base Model
bash scripts/pre_train.sh
# Pre-train BridgeTower Large Model
bash scripts/pre_train_large.sh
Fine-tuning on Downstream VL Tasks
- VQAv2 Evaluation needs to submit the
json
file in thelogs/
directory to eval.ai evaluation server to get the test-dev and/or test-std scores.
# Base Model on VQAv2 without VLP
bash scripts/ftfs_base_vqa.sh
# Large Model on VQAv2 without VLP
bash scripts/ftfs_large_vqa.sh
# Base Model on VQAv2 with VLP
bash scripts/ftfpt_base_vqa.sh
# Large Model on VQAv2 with VLP
bash scripts/ftfpt_large_vqa.sh
# Base Model on IRTR-Flickr30K with VLP (directly use ITM with multiple false texts)
bash scripts/ftfpt_base_irtr_f30k.sh
# Base Model on IRTR-Flickr30K with VLP (follow ALBEF to use ITC to sample hard negatives for ITM)
bash scripts/ftfpt_base_irtr_itm_itc_f30k.sh
# Base Model on SNLI-VE with VLP
bash scripts/ftfpt_base_snlive.sh
# Base Model on NLVR^2 with VLP
bash scripts/ftfpt_base_nlvr2.sh
# Base Model on IRTR-MSCOCO with VLP (follow ALBEF to use ITC to sample hard negatives for ITM)
bash scripts/ftfpt_base_irtr_itm_itc_coco.sh
Fine-tuning on Uni-Modal Tasks
# Base Model on CIFAR with VLP
bash scripts/ftfpt_base_cifar.sh
# Base Model on GLUE with VLP
bash scripts/ftfpt_base_glue.sh
Citation
@article{xu2022bridge,
title={BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning},
author={Xu, Xiao and Wu, Chenfei and Rosenman, Shachar and Lal, Vasudev and Che, Wanxiang and Duan, Nan},
journal={arXiv preprint arXiv:2206.08657},
year={2022}
}
Acknowledgement
We are highly grateful for the public code of the following papers, our code is partly based on them: