Home

Awesome

Non-Autoregressive Coarse-to-Fine Video Captioning

PyTorch Implementation of the paper:

Non-Autoregressive Coarse-to-Fine Video Captioning (AAAI2021)

Bang Yang, Yuexian Zou*, Fenglin Liu and Can Zhang.

[arXiv] or [aaai.org]

Updates

[22 Oct 2023] This repository is no longer maintained. If you want to reproduce the proposed NACF method in a more modern framework (i.e., PyTorch Lightning) or with more advanced video features as inputs (e.g., CLIP features), please refer to our latest repository.

[30 Aug 2021] Update the out-of-date links.

[16 Jun 2021] Add detailed instuctions for extracting 3D features of videos.

[12 Mar 2021] We have released the codebase, preprocessed data and pre-trained models.

Main Contribution

  1. The first non-autoregressive decoding-based method for video captioning.
  2. A generation task of specific part of speech to alleviate the insufficient training of meaningful words.
  3. Visual word-driven flexible decoding algorithms for caption generation.

Content

Environment

we recommend you use Anaconda to create a new environment:

conda create -n cap python==3.7
conda activate cap
pip install torch==1.6.0+cu101 torchvision==0.7.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html
pip install tqdm psutil h5py PyYaml wget
pip install tensorboard==2.2.2 tensorboardX==2.1

Here we use torch 1.6.0 based on CUDA 10.1. Another version of torch may also work.

Basic Information

  1. supported datasets
  1. supported methods, whose configurations can be found in config/methods.yaml

Corpora/Feature Preparation

Preprocessed corpora and extracted features can be downloaded in the VC_data folder in GoogleDrive or PKU Yun.

Please remember to modify base_data_path in config/Constants.py

Alternatively, you can prepare data on your own (Note: some dependencies should be installed, e.g., nltk, pretrainedmodels).

  1. Preprocessing corpora:
    python prepare_corpora.py --dataset Youtube2Text --sort_vocab
    python prepare_corpora.py --dataset MSRVTT --sort_vocab
    
  2. Feature extraction:

Pretrained Models

We have provided the captioning models pre-trained on Youtube2Text (MSVD) and MSRVTT. Please refer to the experiments folder in GoogleDrive or BaiduYun (extract code lkmu).

Please remember to modify base_checkpoint_path in config/Constants.py

Training

python train.py --default --dataset `dataset_name` --method `method_name`

Keypoints:

Examples:

python train.py --default --dataset MSRVTT --method ARB
python train.py --default --dataset MSRVTT --method NACF

Testing

python translate.py --default --dataset `dataset_name` --method `method_name`

Examples:

# NACF w/o ct
python translate.py --default --dataset MSRVTT --method NACF

# NACF w/ ct
python translate.py --default --dataset MSRVTT --method NACF --use_ct

# NACF using different algorithms
python translate.py --default --dataset MSRVTT --method NACF --use_ct --paradigm mp
python translate.py --default --dataset MSRVTT --method NACF --use_ct --paradigm ef
python translate.py --default --dataset MSRVTT --method NACF --use_ct --paradigm l2r

Citation

Please [★star] this repo and [cite] the following paper if you feel our code or models useful to your research:

@inproceedings{yang2021NACF,
  title={Non-Autoregressive Coarse-to-Fine Video Captioning}, 
  author={Yang, Bang and Zou, Yuexian and Liu, Fenglin and Zhang, Can},     
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={35},
  number={4},
  pages={3119-3127},
  year={2021}
}

Acknowledgements

Code of the decoding part is based on facebookresearch/Mask-Predict.