Home

Awesome

BridgeTower

This repo is the official Pytorch implementation of "BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning".

Updates

Abstract

Vision-Language (VL) models with the Two-Tower architecture have dominated visual-language representation learning in recent years. Current VL models either use lightweight uni-modal encoders and learn to extract, align and fuse both modalities simultaneously in a deep cross-modal encoder, or feed the last-layer uni-modal representations from the deep pre-trained uni-modal encoders into the top cross-modal encoder. Both approaches potentially restrict vision-language representation learning and limit model performance. In this paper, we propose BridgeTower, which introduces multiple bridge layers that build a connection between the top layers of uni-modal encoders and each layer of the cross-modal encoder. This enables effective bottom-up cross-modal alignment and fusion between visual and textual representations of different semantic levels of pre-trained uni-modal encoders in the cross-modal encoder. Pre-trained with only 4M images, BridgeTower achieves state-of-the-art performance on various downstream vision-language tasks. In particular, on the VQAv2 test-std set, BridgeTower achieves an accuracy of 78.73%, outperforming the previous state-of-the-art model METER by 1.09% with the same pre-training data and almost negligible additional parameters and computational costs. Notably, when further scaling the model, BridgeTower achieves an accuracy of 81.15%, surpassing models that are pre-trained on orders-of-magnitude larger datasets.

Architecture

Architecture

Main Results

Result1

Result2

Deployment

Dataset Preparation

Checkpoints

Pre-training on Image-Text Datasets

# Pre-train BridgeTower Base Model
bash scripts/pre_train.sh
# Pre-train BridgeTower Large Model
bash scripts/pre_train_large.sh

Fine-tuning on Downstream VL Tasks

# Base Model on VQAv2 without VLP
bash scripts/ftfs_base_vqa.sh

# Large Model on VQAv2 without VLP
bash scripts/ftfs_large_vqa.sh

# Base Model on VQAv2 with VLP
bash scripts/ftfpt_base_vqa.sh

# Large Model on VQAv2 with VLP
bash scripts/ftfpt_large_vqa.sh

# Base Model on IRTR-Flickr30K with VLP (directly use ITM with multiple false texts)
bash scripts/ftfpt_base_irtr_f30k.sh

# Base Model on IRTR-Flickr30K with VLP (follow ALBEF to use ITC to sample hard negatives for ITM)
bash scripts/ftfpt_base_irtr_itm_itc_f30k.sh

# Base Model on SNLI-VE with VLP
bash scripts/ftfpt_base_snlive.sh

# Base Model on NLVR^2 with VLP
bash scripts/ftfpt_base_nlvr2.sh

# Base Model on IRTR-MSCOCO with VLP (follow ALBEF to use ITC to sample hard negatives for ITM)
bash scripts/ftfpt_base_irtr_itm_itc_coco.sh

Fine-tuning on Uni-Modal Tasks

# Base Model on CIFAR with VLP
bash scripts/ftfpt_base_cifar.sh

# Base Model on GLUE with VLP
bash scripts/ftfpt_base_glue.sh

Citation

@article{xu2022bridge,
  title={BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning},
  author={Xu, Xiao and Wu, Chenfei and Rosenman, Shachar and Lal, Vasudev and Che, Wanxiang and Duan, Nan},
  journal={arXiv preprint arXiv:2206.08657},
  year={2022}
}

Acknowledgement

We are highly grateful for the public code of the following papers, our code is partly based on them: