Awesome
Training Vision-language Transformers from Captions Alone
This is a PyTorch/GPU implementation of the paper VLC. Our work is built on MAE and the pioneering work ViLT.
Install
pip install -r requirements.txt
pip install -e .
Pre-trained models
Task | Base set (4M) | Large set (5.6M) |
---|---|---|
Pre-training | vlc_baseset.ckpt | vlc_largeset.ckpt |
VQA | vlc_baseset_vqa_submission | vlc_largeset_vqa_submission |
Dataset Preparation
We follow ViLT and use pyarrow
to serialize the datasets. See this link for details.
Pre-training
As there are some corrupted images in Google Conceptual captions, we remove the images if they cannot be loaded by PIL. Check check_valid_images.py
in data_process
folder.
python run.py with data_root=<ARROW_ROOT> num_gpus=<NUM_GPUS> num_nodes=<NUM_NODES> task_mlm_itm_mae per_gpu_batchsize=<BS_FITS_YOUR_GPU> whole_word_masking=True step25k image_size=384 pretrain_path=<PRETRAIN_PATH> log_dir=<LOG_FOLDER> mae_weight=1.0
Fine-tuning on Downstream Tasks
VQAv2
Following ALBEF and UNITER, we also use VG-VQA data during VQAv2 finetuning.
We only consider the VG-VQA question-answer pairs if 1) the corresponding images are in VQAv2 training or validation split; 2) the answers appear in the VQAv2 answer set.
Check the map_vg_mscoco.py
and write_valid_vgqa.py
in data_process
folder.
python run.py with data_root=<ARROW_ROOT> num_gpus=<NUM_GPUS> num_nodes=<NUM_NODES> task_finetune_vqa_mae_randaug per_gpu_batchsize=<BS_FITS_YOUR_GPU> load_path=<PRETRAINED_MODEL> log_dir=<LOG_FOLDER> image_size=576 learning_rate=5e-4
NLVR
python run.py with data_root=<ARROW_ROOT> num_gpus=<NUM_GPUS> num_nodes=<NUM_NODES> task_finetune_nlvr2_mae_randaug per_gpu_batchsize=<BS_FITS_YOUR_GPU> load_path=<PRETRAINED_MODEL> log_dir=<LOG_FOLDER> image_size=384 learning_rate=5e-4
COCO IR/TR
python run.py with data_root=<ARROW_ROOT> num_gpus=<NUM_GPUS> num_nodes=<NUM_NODES> task_finetune_irtr_coco_mae_randaug per_gpu_batchsize=<BS_FITS_YOUR_GPU> load_path=<PRETRAINED_MODEL> log_dir=<LOG_FOLDER> image_size=384 learning_rate=5e-4
Flickr30K IR/TR
python run.py with data_root=<ARROW_ROOT> num_gpus=<NUM_GPUS> num_nodes=<NUM_NODES> task_finetune_irtr_f30k_mae_randaug per_gpu_batchsize=<BS_FITS_YOUR_GPU> load_path=<PRETRAINED_MODEL> log_dir=<LOG_FOLDER> image_size=384 learning_rate=5e-4
Image classification on ImageNet-1K
python -m launch --nnodes=2 --nproc_per_node=16 --master_port 44875 main_finetune.py
--batch_size 32
--model vit_base_patch16
--finetune <PRETRAINED_MODEL>
--epochs 100
--input_size 384
--blr 5e-4
--layer_decay 0.65
--weight_decay 0.05
--drop_path 0.1
--reprob 0.25
--mixup 0.8
--cutmix 1.0
--dist_eval
--data_path <ImageNet-1K ROOT>
--output_dir <DIR to SAVE CHECKPOINTS>
Acknowledgements
The code is based on ViLT licensed under Apache 2.0 and MAE under the CC-BY-NC 4.0 license.