Awesome
Distractors-Immune Representation Learning with Cross-modal Contrastive Regularization for Change Captioning
This package contains the accompanying code for the following paper:
Tu, Yunbin, et al. "Distractors-Immune Representation Learning with Cross-modal Contrastive Regularization for Change Captioning", which has appeared as a regular paper in ECCV 2024. The arxiv version is here.
We illustrate the training and testing details as follows:
Installation
- Make virtual environment with miniconda (
conda create -n card python=3.8
) - Install requirements (
pip install -r requirements.txt
) - Setup COCO caption eval tools (github) (Since the repo only supports Python 2.7, either create a separate virtual environment with Python 2.7 or modify the code to be compatible with Python 3.5).
Data
- Download image data from here: viewpoint-agnostic change captioning with cycle consistency (ICCV'21)
- You need to split them as bef-change images and aft-change images, and put two kinds of images into two directories, namely
images
andsc_images
. I have also uploaded my downloaded images into the baidu drive clevr-dc.zip, where the extraction code isdc24
. - After obtaining the image pairs and captions, you should rename them first by using the following commands:
# rename image pairs
python pad_img.py
# rename captions
python rename_dc_caption.py
- Preprocess data
Extract visual features using ImageNet pretrained ResNet-101:
# processing default images
``
python scripts/extract_features.py --input_image_dir ./clevr_dc/images --output_dir ./clevr_dc/features --batch_size 128
# processing semantically changes images
python scripts/extract_features.py --input_image_dir ./clevr_dc/sc_images --output_dir ./clevr_dc/sc_features --batch_size 128
* Build vocab and label files using caption annotations.
``
python scripts/preprocess_captions_dc.py
Training
To train the proposed method, run the following commands:
# create a directory or a symlink to save the experiments logs/snapshots etc.
mkdir experiments
# OR
ln -s $PATH_TO_DIR$ experiments
# this will start the visdom server for logging
# start the server on a tmux session since the server needs to be up during training
python -m visdom.server
# start training
python train.py --cfg configs/dynamic/transformer_dc.yaml
Testing/Inference
To test/run inference on the test dataset, run the following command
python test.py --cfg configs/dynamic/transformer_multi_spot.yaml --snapshot 12000 --gpu 1
The command above will take the model snapshot at 12000th iteration and run inference using GPU ID 1.
Evaluation
- Caption evaluation
To evaluate captions, we need to first reformat the caption annotations into COCO eval tool format (only need to run this once). After setting up the COCO caption eval tools (github), make sure to modify utils/eval_utils.py
so that the COCO_PATH
variable points to the COCO eval tool repository. Then, run the following command:
python utils/eval_utils_dc.py
After the format is ready, run the following command to run evaluation:
# This will run evaluation on the results generated from the validation set and print the best results
python evaluate_dc.py --results_dir ./experiments/DIRL+CCR/eval_sents --anno ./clevr_dc/change_captions_reformat.json
Once the best model is found on the validation set, you can run inference on test set:
python evaluate_dc.py --results_dir ./experiments/DIRL+CCR/test_output/captions --anno ./clevr_dc/change_captions_reformat.json
The results are saved in ./experiments/DIRL+CCR/test_output/captions/eval_results.txt
If you find this helps your research, please consider citing:
@inproceedings{tu2024distractors,
title={Distractors-Immune Representation Learning with Cross-modal Contrastive Regularization for Change Captioning},
author={Tu, Yunbin and Li, Liang and Su, Li and Yan, Chenggang and Huang, Qingming},
booktitle={ECCV},
pages={311--328},
year={2024},
}
Contact
My email is tuyunbin1995@foxmail.com
Any discussions and suggestions are welcome!