Home

Awesome

Video Timeline Tags (ViTT)

This repo provides the Video Timeline Tags (ViTT) dataset introduced in Multimodal Pretraining for Dense Video Captioning (arXiv | presentation | slides).

If you find the data or paper useful for your own work, please consider citing:

@inproceedings{huang2020multimodal,
  title={Multimodal Pretraining for Dense Video Captioning},
  author={Huang, Gabriel and Pang, Bo and Zhu, Zhenhai and Rivera, Clara and Soricut, Radu},
  booktitle={AACL-IJCNLP 2020},
  year={2020}
}

Dataset Description

Data files for this dataset can be downloaded via the following links:

The ViTT dataset consists of human produced segment-level annotations for 8,169 videos. Of these, 5,840 videos have been annotated once, and the rest of the videos have been annotated twice or more. A total of 12,461 sets of annotations are released in ViTT-annotations.json. Below is an example set of annotations from the dataset:

{
  "id": "FmTp",
  "annotations": [
    {
      "timestamp": 260,
      "tag": "Opening"
    },
    {
      "timestamp": 16000,
      "tag": "Displaying technique"
    },
    {
      "timestamp": 23990,
      "tag": "Showing foot positioning"
    },
    {
      "timestamp": 55530,
      "tag": "Demonstrating crossover"
    },
    {
      "timestamp": 114100,
      "tag": "Closing"
    }
  ]
}

Data fields:

For experiments described in the paper, we have additionally gone through the following steps:

Please refer to Appendix A.1 in the paper for details on the dataset construction and guidelines for human annotation.