Awesome
Image Difference Captioning with Pre-training and Contrastive Learning
This repository is the official implementation of Image Difference Captioning with Pre-training and Contrastive Learning in AAAI2022.
The Image Difference Captioning(IDC) task aims to describe the visual differences between two similar images with natural language. In this work, we propose a new framework following the pre-training and fine-tuning paradigm for IDC. Specifically, we design three self-supervised tasks with contrastive learning strategies to align visual differences and text descriptions at a fine-grained level. Moreover, we propose a data expansion strategy to utilize extra cross-task supervision information, such as data for fine-grained image classification, to alleviate the limitation of available supervised IDC data.
Installation
conda create --name IDC python=3.6
conda activate IDC
pip install torch==1.9.0+cu102 torchvision==0.10.0+cu102 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt
Data Download
We provide the pre-processed image features (by pre-trained ResNet101) , the annotations and the constructed negative data samples of CLEVR-Change and Birds-to-Words dataset in baiduyun password:6zv0 .
You should put the files under the corresponding./clver
or ./bird
folder as follows:
clver
├── dataset_clver
bird
├── dataset
├── bird
├── cub
└── nabirds
CLEVR-Change dataset
cd ./clver
Pre-training
python3.6 pretrain.py --dataset clver --gpu_id 3 \
--exp_name pretrain_clver_neg_tfidf6_t1.0 \
--config ./config/pretrain_clver.json \
--total_train_steps 250000 \
--tmp 1.0
[Note] All settable parameters are explained in para.py
(Optional) View logs via tensorboard
tensorboard --logdir=./experiments/pretrain_clver_neg_tfidf6_t1.0/log --host=0.0.0.0 --port=8080
Fine-tuning
python3.6 finetune.py --mode train --dataset clver --gpu_id 0 \
--exp_name finetune_clver_neg_tfidf6_t1.0 \
--config ./config/finetune_clver.json \
--restore ./experiments/pretrain_clver_neg_tfidf6_t1.0/checkpoint/checkpoint_250000.pt
Inference & Evaluation
python3.6 finetune.py --mode test --dataset clver --gpu_id 0 \
--exp_name finetune_clver_neg_tfidf6_t1.0 \
--config ./config/finetune_clver.json
cd ../eval
python3.6 eval_models.py --dataset clevr \
--testfile ../clver/experiments/finetune_clver_neg_tfidf6_t1.0/results.json \
--gtfile ../clver/dataset_clver/test.json
We also provide the pre-trained and fine-tuned checkpoints at baidu yun (password: 0b07). The reported results on CLEVR-Change dataset are as follows:
Dataset | BLEU4 | METEOR | ROUGE-L | CIDEr |
---|---|---|---|---|
CLEVR-Change | 51.2 | 36.2 | 71.7 | 128.9 |
Birds-to-Words dataset
cd ./bird
Pre-training
We adopt cross-task data expansion strategy on Birds-to-Words dataset to provide additional in-domain knowledge. Specifically, we utilize extra data from general image captioning (GIC), that is the CUB dataset, and Fine-grained visual classification (FGVC), that is the NABirds dataset.
# Stage 1: training with CUB dataset
python3.6 pretrain_cub.py --dataset cub --exp_name pretrain_cub --gpu_id 0 --config ./config/pretrain_cub.json
# Stage 2: training with Birds-to-Words and NABirds dataset alternately
python3.6 pretrain.py --dataset bird --exp_name pretrain_cub_nabirds_bird --gpu_id 3 --config ./config/pretrain_bird_nabirds.json --restore ./experiments/pretrain_cub/checkpoint/checkpoint_60000.pt
Fine-tuning
python3.6 finetune.py --dataset bird --exp_name finetune_bird \
--mode train --gpu_id 3 --config ./config/finetune_bird.json \
--restore experiments/pretrain_cub_nabirds_bird/checkpoint/checkpoint_60000.pt --batch_size 32
Inference & Evaluation
python3.6 finetune.py --mode test --dataset bird --gpu_id 0 \
--exp_name finetune_bird \
--config ./config/finetune_bird.json
cd ../eval
python3.6 eval_models.py --dataset bird \
--testfile ../bird/experiments/finetune_bird/result.json \
--gtfile ../bird/dataset/bird/test_self.json
We also provide the pre-trained and fine-tuned checkpoints at baidu yun (password:to5a). The reported results on Birds-to-Words dataset are as follows:
Dataset | BLEU4 | METEOR | CIDEr-D | ROUGE-L |
---|---|---|---|---|
Birds-to-Words | 31.0 | 23.4 | 25.3 | 49.1 |
Citation
@article{Yao_Wang_Jin_2022,
title={Image Difference Captioning with Pre-training and Contrastive Learning},
volume={36},
url={https://ojs.aaai.org/index.php/AAAI/article/view/20218}, DOI={10.1609/aaai.v36i3.20218},
number={3},
journal={Proceedings of the AAAI Conference on Artificial Intelligence},
author={Yao, Linli and Wang, Weiying and Jin, Qin},
year={2022},
month={Jun.},
pages={3108-3116}
}