Home

Awesome

Video ReCap

Video ReCap: Recursive Captioning of Hour-Long Videos
Md Mohaiminul Islam, Ngan Ho, Xitong Yang, Tushar Nagarajan, Lorenzo Torresani, Gedas Bertasius
Accepted by CVPR 2024
[Website] [Paper] [Dataset] [Hugging Face] [Demo]

ViderReCap is a recursive video captioning model that can process very long videos (e.g., hours long) and output video captions at multiple hierarchy levels: short-range clip captions, mid-range segment descriptions, and long-range video summaries. First, the model generates captions for short video clips of a few seconds. As we move up the hierarchy, the model uses sparsely sampled video features and captions generated at the previous hierarchy level as inputs to produce video captions for the current hierarchy level.

<img src="assets/framework.png">

Installation

See installation.md to install this code.

Ego4D-HCap Dataset

See datasets.md for the details and download the Ego4D-HCap Dataset.

Demo Notebook

First, download the pretrained models from this link. Then, you can extract three-levels of hierarchical captions from any video (e.g., assets/example.mp4) using our pretrained models as shown in demo.ipynb notebook.

Download or extract features

We utilize the video encoder of pretrained Dual-Encoder from LaViLa to extract features.
You can directly download the extracted features (~30 GB) from this link (coming soon).
Alternatively, you may extract the features on your own using the following steps.

  1. Download the pretrained video encoder using the following command.
wget https://dl.fbaipublicfiles.com/lavila/checkpoints/dual_encoders/ego4d/clip_openai_timesformer_base.baseline.ep_0003.pth
  1. Extract segment features.
bash scripts/extract_features_segments.sh
  1. Extract video features.
bash scripts/extract_features_videos.sh

Evaluate Pretrained Models

We provide our best model for both Video ReCap and Video ReCap-U.
Download the pretrained models from this link

  1. Evaluate Video ReCap.
bash scripts/eval_video_recap.sh
  1. Evaluate Video ReCap-U.
bash scripts/eval_video_recap_u.sh

You should get the following numbers.

ModelClip Caption<br>(C/ R/ M)Segment Description<br>(C/ R/ M)Video Summary<br>(C/ R/ M)Checkpoint
Video ReCap98.35/ 48.77/ 28.2846.88/ 39.73/ 18.5529.34/ 32.64/ 14.45download
Video ReCap-U92.67/ 47.90/ 28.0845.60/ 39.33/ 18.1731.06/ 33.32/ 14.16download

Train Video ReCap Model

We train our model on 8 V100 GPUs (32GB memory).
Video ReCap is a recursive model for hierarchical video captioning that uses captions generated at the previous level as input for the current hierarchy. We train Video ReCap utilizing the following curriculum learning strategy.

  1. Download pretrained Dual-Encoder from LaViLa using the following command.
mkdir pretrained_models
cd pretrained_models
wget https://dl.fbaipublicfiles.com/lavila/checkpoints/dual_encoders/ego4d/clip_openai_timesformer_base.baseline.ep_0003.pth
cd ..
  1. First, train for 5 epochs using the clip captions data.
bash scripts/run_videorecap_clip.sh
  1. Then extract captions at each 4 seconds interval for the whole video using the trained clip captioning model of step 1. Replace the 'captions_pred' of the train and val metadata using generated captions from appropriate time windows (See datasets.md for more details).
bash scripts/extract_captions.sh
  1. Initialize from Video ReCap clip checkpoint and train for 10 epochs using the segment descriptions.
bash scripts/run_videorecap_segment.sh
  1. Extract segment descriptions using the at each 180 seconds interval for the whole video using the trained clip captioning model of step 3. Replace the 'segment_descriptions_pred' of the train and val metadata using generated descriptions from appropriate time windows (See datasets.md for more details).
bash scripts/extract_segment_descriptions.sh
  1. Finally, initialize from Video ReCap segment checkpoint and train for 10 epochs using the video summaries.
bash scripts/run_videorecap_video.sh

Train Video ReCap-U Model

We train our model on 8 V100 GPUs (32GB memory).
While Video ReCap trains three different sets of trainable parameters for three hierarchies, Video ReCap-U trains only one set of trainable parameters. Following curriculum learning scheme with an alternate batching technique allows us to train a unified model and avoid catestrophic foregetting.

  1. Download pretrained Dual-Encoder from LaViLa using the following command.
mkdir pretrained_models
cd pretrained_models
wget https://dl.fbaipublicfiles.com/lavila/checkpoints/dual_encoders/ego4d/clip_openai_timesformer_base.baseline.ep_0003.pth
cd ..
  1. First stage is same as the VideRecap model, where we train for 5 epochs using the clip captions data.
bash scripts/run_videorecap_clip.sh
  1. Then extract captions at each 4 seconds interval for the whole video using the trained clip captioning model of step 1. Replace the 'captions_pred' of the train and val metadata using generated captions from appropriate time windows (See datasets.md for more details).
bash scripts/extract_captions.sh
  1. Secondly, we initialize from Video ReCap clip checkpoint and train for 10 epochs using the segment descriptions and some clip captions data. We sample clip captions and segment descriptions alternatively at each bach.
bash scripts/run_videorecap_clip.sh
  1. Extract segment descriptions using the at each 180 seconds interval for the whole video using the trained clip captioning model of step 3.
bash scripts/extract_segment_descriptions.sh
  1. Finally, we initialize from Video ReCap segment checkpoint and train for 10 epochs using the video summaries and some segment descriptions and clip captions data. We sample data from three hierarchies alternatively at each batch.
bash scripts/run_videorecap_clip.sh

Additional Supervision using LLMs

Coming soon!

BibTex

@article{islam2024video,
  title={Video ReCap: Recursive Captioning of Hour-Long Videos},
  author={Islam, Md Mohaiminul and Ho, Ngan and Yang, Xitong and Nagarajan, Tushar and
  Torresani, Lorenzo and Bertasius, Gedas},
  journal={arXiv preprint arXiv:2402.13250},
  year={2024}
}