Awesome
CARE
PyTorch Implementation of our TIP paper:
Concept-Aware Video Captioning: Describing Videos With Effective Prior Information
Bang Yang, Meng Cao and Yuexian Zou.
TOC
Update Notes
[2023-10-22] We release the code and data.
Environment
Clone and enter the repo:
git clone https://github.com/yangbang18/CARE.git
cd CARE
We has refactored the code and tested it on:
Python
3.9torch
1.13.1cuda
11.7
Other versions are also work, e.g., Python
3.7, torch
1.7.1, and cuda
10.1.
Please change the version of torch and cuda according to your hardwares.
conda create -n CARE python==3.9
conda activate CARE
# Install a proper version of torch, e.g.:
pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1+cu117 -f https://download.pytorch.org/whl/cu117/torch_stable.html
# Note that `torch < 1.7.1` is imcompatible with the package `clip`
pip install -r requirement.txt
Running
Overview
1. Supported datasets (please follow README_DATA.md to prepare data)
MSVD
MSRVTT
VATEX
2. Supported methods, whose configurations can be found in config/methods.yaml
Transformer
: our baseline (autoregressive)TopDown
: same encoder asTransformer
, a two layer LSTM decoder (autoregressive)ARB
: a slight different encoder compared withTransformer
(autoregressive)NACF
: same encoder asARB
, different decoding algorithm (non-autoregressive)- ...
3. Supported feats, whose configurations can be found in config/feats.yaml
4. Supported modality combinations: any combination of a
(audio), m
(motion) and i
(image).
5. Supported architectures, whose configurations can be found in config/archs.yaml
base
: used in MSVD, MSRVTT (model dimension:512
)median
: used in VATEX (model dimension:768
)large
: used in VATEX (model dimension:1,024
)
6. Supported tasks, whose configurations can be found in config/tasks.yaml.
Base
: take different combination of features for decoding.CABase
: visual-driven concept detection and ``Cross -> Semantic'' local semantic guidance.CARE
: multimodal-driven concept detection (MCD) and global-local semantic guidance (G-LSG).Concept
: basic implementations ofCABase
andCARE
, it supports the modification of hyper-parameters related to concept detection.
Training
Command Format:
For the Base
task:
python train.py \
--dataset $dataset_name \
--method $method_name \
--feats $feats_name \
--modality $modality_combination \
--arch $arch_name \
--task $task_name
For CAbase
, CARE
, and Concept
tasks:
python train.py \
--dataset $dataset_name \
--method $method_name \
--feats $feats_name \
--decoder_modality_flags $flags1 \
--predictor_modality_flags $flags2 \
--arch $arch_name \
--task $task_name
Note: the only different is that we need to specify decoder_modality_flags
and predictor_modality_flags
rather than modality
. This is because modalities used for concept detection and decoding can be different. Here are some mappings between flag
and modality
(refer to the flag2modality
of config/Constants.py):
I
(image): iV
(vision): miA
(audio): aVA
(vision + audio): amiVAT
(vision + audio + text): amir
Example:
python train.py \
--dataset MSRVTT \
--method Transformer \
--arch base \
--task Base \
--feats ViT \
--modality ami
python train.py \
--dataset MSRVTT \
--method Transformer \
--arch base \
--task CARE \
--feats ViT \
--decoder_modality_flags VA \
--predictor_modality_flags VAT
Testing
ckpt=/path/to/checkpoint
python translate.py --checkpoint_paths $ckpt
# evaluate on the validation set
python translate.py --checkpoint_paths $ckpt --mode validate
# evaluate on the validation set & save results to a csv file (same directory as the checkpoint)
python translate.py --checkpoint_paths $ckpt --mode validate --save_csv --csv_name val_result.csv
# evaluate on the validation set & save results to a csv file
python translate.py --checkpoint_paths $ckpt --mode validate --save_csv --csv_path ./results/csv --csv_name dummy.csv
# save caption predictions
python translate.py --checkpoint_paths $ckpt --json_path ./results/predictions --json_name dummy.json
# save detailed per-sample scores
python translate.py --checkpoint_paths $ckpt --save_detailed_scores_path ./results/detailed_scores/dummy.json
Show Results
You can run the following command to gather results, where mean metric scores with their standard deviation across a number of runs are reported.
python misc/merge_csv.py --dataset MSVD --average --output_path ./results --output_name MSVD.csv
Reproducibility
Main Experiments (Compared with SOTA)
bash scripts/exp_main_MSVD.sh
bash scripts/exp_main_MSRVTT.sh
bash scripts/exp_main_VATEX.sh
Ablation Study
Note: each script is self-enclosed, i.e., a model variant may be included in N
differnt scripts, each of which will train the model in K
(K=5 by default) different seeds ([0, K)
), resulting in N x K
runs. Just be careful.
-
Main Ablation Study
bash scripts/exp_ablation_main.sh
-
Ablation on GSG and LSG variants
bash scripts/exp_ablation_GLSG.sh
-
Versatility of CARE on Various Encoder-Decoder Networks
bash scripts/exp_versatility_of_CARE.sh
Note: To reproduce SwinBERT, we do not train the model in an end-to-end manner. Instead, we conduct experiments on features extracted from fine-tuned SwinBERT checkpoints. Therefore, if you want to follow our implementations, please refer to yangbang18/Video-Swin-Transformer for feature extraction first (we do not share the feature files due to their large size). Otherwise, you should comment the command lines about SwinBERT in
scripts/exp_versatility_of_CARE.sh
. -
Measuring retrieval performance
python pretreatment/clip_retrieval.py --dataset MSRVTT --n_frames 28 --eval --arch RN50 python pretreatment/clip_retrieval.py --dataset MSRVTT --n_frames 28 --eval --arch RN50x16 python pretreatment/clip_retrieval.py --dataset MSRVTT --n_frames 28 --eval --arch ViT-B/32 python pretreatment/clip_retrieval.py --dataset MSRVTT --n_frames 28 --eval --arch ViT-B/32 --postfix ft python pretreatment/clip_retrieval.py --dataset MSRVTT --n_frames 28 --eval --arch ViT-B/16 --postfix ft
Note:
- Image features of the first 3 commands are provided in our data, but the corresponding text features should be extracted by yourselves. Please refer to the step 3 in README_DATA.md
- Image and text features of the bottom 2 commands are provided in our data, and they are extracted from CLIP fine-tuned on MSRVTT. Our fine-tuning process is done with open_clip.
Analysis
Please refer to notebooks.
Citation
Please [★star] this repo and [cite] the following papers if you feel our code and data useful to your research:
@ARTICLE{yang2023CARE,
author={Yang, Bang and Cao, Meng and Zou, Yuexian},
journal={IEEE Transactions on Image Processing},
title={Concept-Aware Video Captioning: Describing Videos With Effective Prior Information},
year={2023},
volume={32},
number={},
pages={5366-5378},
doi={10.1109/TIP.2023.3307969}
}
Acknowledgement
- This codebase is built upon our previous one, i.e., CLIP-Captioner.