Awesome
GRIT-VLP: GRouped mIni-baTch sampling for Efficient Vision-Language Pre-training
This is the official PyTorch implementation of "GRIT-VLP: GRouped mIni-baTch sampling for Efficient Vision-Language Pre-training" (Accepted to ECCV 2022)
You can find the implementation codes for pre-training and fine-tuning GRIT-VLP.
<img src="img.png" width="600">Pre-training Dataset Download:
Downstream-task Datasets:
Json Files:
- Use same json files from ALBEF
- Change the image path in json files according to your downloaded images (In CC3M and SBU, some images can not be crawled, thus, you should consider about these missing images when creating json files)
Requirements:
- pytorch 1.8.0
- transformers 4.8.1
- timm 0.4.9
Pre-training:
- Pre-train the model using 4 A100 GPUs:
Downstream tasks:
- IRTR (MS-COCO) using 4 A100 GPUs:
- IRTR (Flickr) using 4 A100 GPUs:
- NLVR using 4 A100 GPUs:
- VQA using 4 A100 GPUs:
If you have any questions or problems to run this code, please mail to wotjr3868@snu.ac.kr or gxq9106@gmail.com. Thank you!
Acknowledgement:
Our code implementation is largely borrowed from ALBEF since our method is mainly built upon it. We appreciate the original authors for sharing code.