Awesome

Dense Video Captioning Using Unsupervised Semantic Information

Summary

This repository contains an implementation of the GloVe NLP method adapted to visual features as described in the paper ''Dense Video Captioning Using Unsupervised Semantic Information'' ArXiv link.

Visual GloVe provides a dense representation encoding a co-occurrence similarity estimation with a semantic space in which short video clips with similar visual content are projected near to each other. The figure below illustrates this representation.

Examples of visual similarities. (a) Two video frag-ments with about 28 seconds from YouTube (vdBNZf90PLJ0 andvj3QSVhAhDc). They share some visual similar short clips. (b) A 2D t-SNE representation for the whole visual vocabulary. Someshared fragments are highlighted in red.

Figure. Examples of visual similarities. (a) Two video fragments with about 28 seconds from YouTube (vdBNZf90PLJ0 and vj3QSVhAhDc). They share some visual similar short clips. (b) A 2D t-sne representation for the whole visual vocabulary. Some shared fragments are highlighted in red.

We employ Visual GloVe to replace audio signal in the BMT model (see references) to learn an event proposal model for the Dense Video Captioning (DVC) task. Additionally, we concatenate Visual GloVe with i3D features and feed the MDVC model (see references) to generate captions for each learned proposals. Using this descriptor, we achieve state-of-the-art performance on DVC taking only visual features (i.e., using a single modality approach), as shown in the following tables.

Table2. Table4.

The code was tested on Ubuntu 20.10 with NVIDIA GPU Titan Xp. Using another software/hardware might require adapting conda environment files.

Visual Glove Computation

Step 1: clone this repository

git clone --recursive git@github.com:valterlej/visualglove.git

Step 2: create the conda environment

conda env create -f ./conda_env.yml
conda activate visualglove

Step 3: get visual features

a. download annotations and visual features from:

annotations (.json) https://cs.stanford.edu/people/ranjaykrishna/densevid/captions.zip

features (.npy) - https://1drv.ms/u/s!Atd3eVywQZMJgnFmEqYoFg3fq8w9?e=SUOrbE

b. compute your own visual features using the code from https://github.com/v-iashin/video_features . All instructions you need are presented in it.

All features and configuration files are saved in a shared dataset directory in our experiments. In our case, we create this directory in 'home' and refer to it using ~/dataset.

Step 4: train the model with main.py script

python main.py --procedure training \
    --train_ids_file ~/dataset/train.json \
    --visual_features_dir ~/dataset/i3d_25fps_stack64step64_2stream_npy \
    --vocabulary_size 1500 \
    --vg_window_size 5 \
    --vg_emb_dim 128

With this command, we learn clustering and visual glove models.

You can extract the cluster predictions using the pre-trained cluster model for all video stacks from visual_features_dir with the command:

python main.py --procedure cluster_predictions \
    --output_cluster_predictions_dir <directory path>

Extract visual embeddings using the cluster and visual glove pre-trained models. The command is:

python main.py --procedure visual_glove_embeddings \
    --output_embedding_dir ~/datasets/visglove_acnet_w5_c1500 \
    --output_concatenated_stack_embedding ~/datasets/i3dvisglove_acnet_w5_c1500

Training the Proposal Module

Step 1: clone the repository

git clone --recursive https://github.com/valterlej/CustomBMT.git

Step 2: create and activate the conda environment

conda env create -f ./conda_env.yml
conda activate bmt

Step 3: install the spacy module

python -m spacy download en

Step 4: get or compute the features

You can download all necessary features using the links provided in the 'pre-computed features' section, or you can compute them following the instructions from 'Visual GloVe computation section.

Step 5: train the captioning model using CustomBMT (bi-modal transformer)

Training

python main.py --device_ids 0 \
    --video_features_path ~/datasets/i3d_25fps_stack64step64_2stream_npy \
    --audio_features_path ./data/visglove_acnet_w5_c1500 \
    --procedure train_cap \
    --epoch_num 100 \
    --early_stop_after 25 \
    --dout_p 0.1 \
    --d_vid 1024 \
    --d_model 1024 \
    --d_aud 128

Step 6: train the event proposal generation module using CustomBMT (bi-modal transformer)

Training

python main.py --device_ids 0 \
    --video_features_path ~/datasets/i3d_25fps_stack64step64_2stream_npy \
    --audio_features_path ~/datasets/visglove_acnet_w5_c1500 \
    --procedure train_prop \
    --epoch_num 80  \
    --early_stop_after 20 \
    --pretrained_cap_model_path ./log/train_cap/1203220645/best_cap_model.pt

Evaluating

python main.py --device_ids 0 \
    --video_features_path ~/datasets/i3d_25fps_stack64step64_2stream_npy \
    --audio_features_path ~/datasets/visglove_acnet_w5_c1500 \
    --procedure evaluate \
    --pretrained_cap_model_path ./log/train_cap/<dataset_log>/best_cap_model.pt \
    --prop_pred_path ./log/train_prop/<dataset_log>/submissions/<json_submission_file>

Training and Evaluating Captioning - CustomMDVC

Step 1: clone the repository

git clone https://github.com/valterlej/CustomMDVC.git

Step 2: create and activate the conda environment

conda env create -f ./conda_env.yml
conda activate mdvc

Step 3: install the spacy module

python -m spacy download en

Step 4: get or compute the features

Skip this step if you are retraining the proposal generation model. Otherwise, download visual glove features here.

Step 5: train the captioning model with CustomMDVC (vanilla transformer)

Download de file containing our pre-trained BMT proposals (.zip) in the same format adopted by MDVC. Then, run the following command:

python main.py --device_ids 0 \
    --d_vid 1152 \
    --epoch_num 80 \
    --early_stop_after 20 \
    --video_features_path ./datasets/i3dvisglove_acnet_w5_c1500 \
    --model_type visual
    --val_prop_meta_path ./datasets/bmt_prop_results_val_1_e26_maxprop100_mdvc.csv

Pre-computed features

i3D features (64 stack 64 step)
Visual Glove Features (i3d+Visglove and Visglove only)
VGGish features (only if you desire to reproduce the MDVC or BMT results)

Pre-trained models

Cluster and Visual Glove models (trained considering window = 5 clips -- ~10s -- and vocabulary = 1500 clusters)
BMT captioning model - used for training the proposal module
BMT proposal generation model
MDVC - mdvc model trained with the concatenation of i3d and visual glove features.

Main References

Mini-batch k-means

Sculley, D. Web-Scale k-Means Clustering. In International Conference on World Wide Web, 2010, pp. 1177-1178.

GloVE

Pennington, J.; Socher, R.; Manning, C. D. GloVe: Global Vectors for Word Representation. In. Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1532-1543.

MDVC

Iashin, V.; Rahtu, E. Multi-Modal Dense Video Captioning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2020, pp. 958-959.

BMT

Iashin, V.; Rahtu, E. A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer. In. British Machine Vision Conference (BMVC), 2020.

For the complete list, see the paper.

Citation

Our paper is available on arXiv. Please, use this BibTeX if you would like to cite our work.

@article{estevam:2021
author = {Estevam, V. and Laroca, R. and Pedrini, H. and Menotti, D.},
title = {Dense Video Captioning Using Unsupervised Semantic Information},
journal = {CoRR},
year = {2021},
url = {https://arxiv.org/abs/2112.08455}
}