Home

Awesome

Ad-hoc Video Search

We provide frame-level CNN features for the following datasets that have been used by our winning entry for the TRECVID 2018 Ad-hoc Video Search (AVS) task. Code for feature extraction is available at the video-cnn-feat project.

  1. The IACC.3 dataset, which has been the test set for the TRECVID Ad-hoc Video Search (AVS) task since 2016. The dataset contains 4,593 Internet Archive videos (144 GB, 600 hours) with Creative Commons licenses in MPEG-4/H.264 format with duration ranging from 6.5 min to 9.5 min and a mean duration of almost 7.8 min. Automated shot boundary detection has been performed, resulting in 335,944 shots in total. From each shot we sampled frames uniformaly, obtaining 3,845,221 frames in total.
  2. The MSR-VTT dataset, providing 10K web video clips and 200k natural sentences describing the visual content of the clips. The average number of sentences per clip is 20. From each clip we sampled frames uniformly, obtaining 305,462 frames in total.
  3. The TGIF dataset, containing 100K animated GIFs and 120K sentences describing visual content of the animated GIFs. From each gif we sampled frames uniformly, obtaining 1,045,268 frames in total.
  4. The TRECVID 2016 VTT training set, containing 200 videos (Vine urls) and 400 sentences.

Besides, we provide frame-level CNN features for the following datasets that have been used by our winning entry for the TRECVID 2018 Video-to-Text (VTT) Matching and Ranking task.

  1. The Microsoft Video Description dataset (MSVD).

Downloads

Video-level features

Frame-level features

CNN featureDimensionalityDownloads
ResNext-1012,048IACC.3 (27GB), MSR-VTT (2GB), TGIF (7GB), MSVD (288M), TV2016VTT-train (42M)
ResNet-1522,048IACC.3 (26GB), MSR-VTT (2GB), TGIF (7GB), MSVD (283M), TV2016VTT-train (42M)

Sentences

Citations

If you find the feature data useful, please consider citing

Acknowledgments