Awesome
Adversarial Inference for Multi-Sentence Video Descriptions
This is the implementation of Adversarial Inference for Multi-Sentence Video Descriptions
This repository is based on self-critical.pytorch. Thank you Ruotian for the code! The modifications are:
- Training Multimodal Generator and Hybrid Discriminator in
models/
. - Adversarial Inference in
eval_utils.py
Requirements
Clone the repository recursively.
git clone --recursive https://github.com/jamespark3922/adv-inf
Python 2.7 (because there is no coco-caption version for python 3)
PyTorch 0.4 (along with torchvision)
densevid_eval (for activitynet evaluation)
java to run meteor.jar file
Training on ActivityNet Dense Captions
Download ActivityNet captions and preprocess them
We share the input labels and features in this folder. (Scripts to preprocess the labels will be available soon.)
Features
- renext101-64f (126GB) extracted from r3d repository
- resnet152 (14GB), extracted 100 frames for each video
- bottomup labels (16GB) with confidence score, extracted 3 frames for each clip
After downloading them all, unzip them to your preferred feature directory.
Note that mean-pooling operations are done when loading the data in dataloader.py
Training
python train.py --caption_model video --input_json activity_net/inputs/video_data_dense.json --input_fc_dir activity_net/feats/resnext101-64f/ --input_img_dir activity_net/feats/resnet152/ --input_box_dir activity_net/feats/bottomup/ --input_label_h5 activity_net/inputs/video_data_dense_label.h5 --glove_npy activity_net/inputs/glove.npy --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 0 --checkpoint_path video_ckpt --val_videos_use -1 --losses_print_every 10 --batch_size 16 --language_eval 1
Context: The generator model uses the hidden state of previous sentence as "context", starting at epoch --g_context_epoch
.
Evaluation
After training is done, evaluate the captions in paragraph level. Note the evaluation is done on val1 set.
The normal inference using greedymax or beamsearch can be run with the following command:
python eval.py --g_model_path video_ckpt/gen_best.pth --infos_path video_ckpt/infos.pkl --d_model_path video_ckpt/dis_best.pth --sample_max 1 --id $id --beam_size $beam_size
and will be saved in densevid_eval/caption_$id.json
. You can also disable --d_model_path
if you do not wish to score and evaluate the discriminator.
Adversarial Inference
Sampling $num_samples sentences and choosing the best one with discriminator can be run with
python eval.py --g_model_path video_ckpt/gen_best.pth --infos_path video_ckpt/infos.pkl --d_model_path video_ckpt/dis_best.pth --sample_max 0 --num_samples $num_samples --temperature $temperature --id $id
Generated Catpions
You can run the language metrics to reproduce the results
python para-evaluate.py -s $submission_file --verbose
and the diversity metrics (Div-N, Re-N) in paper.
python evaluateCaptionsDiversity.py $submission_file
Reference
@article{park2019advinf,
title= Adversarial Inference for Multi-Sentence Video Descriptions,
author={Park, Jae Sung and Rohrbach, Marcus and Darrell, Trevor and Rohrbach, Anna},
jorunal={CVPR 2019},
year={2019}
}