Home

Awesome

VIOLIN: A Large-Scale Dataset for Video-and-Language Inference

Data and code for CVPR 2020 paper: "VIOLIN: A Large-Scale Dataset for Video-and-Language Inference"

example

We introduce a new task, Video-and-Language Inference, for joint multimodal understanding of video and text. Given a video clip with aligned subtitles as premise, paired with a natural language hypothesis based on the video content, a model needs to infer whether the hypothesis is entailed or contradicted by the given video clip.

Also, we present a new large-scale dataset, named Violin (VIdeO-and-Language INference) for this task, which consists of 95,322 video-hypothesis pairs from 15,887 video clips, spanning over 582 hours of video (YouTube and TV shows). In order to address our new multimodal inference task, a model is required to possess sophisticated reasoning skills, from surface-level grounding (e.g., identifying objects and characters in the video) to in-depth commonsense reasoning (e.g., inferring causal relations of events in the video).

News

Violin Dataset

source#episodes#clipsavg clip lenavg pos. statement lenavg neg. statement lenavg subtitle len
Friends2342,67632.89s17.9417.8572.80
Desperate Housewives1803,46632.56s17.7917.8169.19
How I Met Your Mother2071,94431.64s18.0818.0676.78
Modern Family2101,91732.04s18.5218.2098.50
MovieClips5,8855,88540.00s17.7917.8169.20
All6,71615,88735.20s18.1018.0476.40

Baseline Models

Requirements

Usage

  1. Download video features, subtitles and statements and put them into your feat directory.

  2. Finetune BERT-base on Violin's training statements, or download our finetuned BERT model.

  3. Training

    Using only subtitles

    python main.py --feat_dir [feat dir] --bert_dir [bert dir] --input_streams sub
    

    Using both subtitles and video resnet features (--feat c3d for c3d features)

    python main.py --feat_dir [feat dir] --bert_dir [bert dir] --input_streams sub vid --feat resnet
    
  4. Testing

    Testing a specific model

    python main.py --test --feat_dir [feat dir] --bert_dir [bert dir] --input_streams sub vid --feat c3d --model_path [model path]