Home

Awesome

Exposing the Limits of Video-Text Models through Contrast Sets

Repository for Exposing the Limits of Video-Text Models through Contrast Sets (NAACL Short 2022).

Updates

To Do:

Data

We share the verb phrase based contrast set for MSRVTT and LSDMC-ID in this repository. Code to generate the contrast set will be released soon.

DatasetLinkContrast Set Type
MSRVTTDataVerb
LSMDCTBDTBD

MSRVTT

Download

The verb based contrast sets generated by language model (Verb<sub>LM</sub> MC) and humans (Verb<sub>Human</sub> MC) can be found in msrvtt/. You can find the video and annotation data following this link. The sets are generated for the full-test set of MSRVTT with 2990 videos.

You can additionally refer to download script in CLIPBert to get the original MSRVTT split data. To just get train and val, you can also run the download script:

bash msrvtt/download_msrvtt_train_val.sh

Example data:

clipending0ending1ending2ending3ending4label
video9770the boy is trying to fix the problemthe boy is trying to exacerbate the problemtwo men on wave runner in ocean rescuing a surferasian man discusses technology in the younger generationsa group is dancing0
video9771a woman is putting items into a miniature toy ovena child is running around on a mata woman pushing a strollera child is rolling around on a mata game show host hosting a game1

Results

train-9k is the 9k train and test-1k-A is the 1k test-split proposed by JSFUsion [Yu et. al.].

train-7k is the 7k train and test-full is the full test videos in the original MSRVTT. We use the 7k training videos in CLIP4CLIP repo.

MSRVTT-train-7KV to T (R@1)<br>test-1k-AT to V (R@1)<br>test-1k-ARandom MC <br>test-fullGender MC <br>test-fullVerb <sub>LM</sub> MC <br>test-fullVerb <sub>Human</sub> MC <br>test-full
CLIP-Straight27.231.294.169.665.465.1
MMT24.825.592.475.572.871.3
MMT with CLIP features30.530.395.080.173.871.4
CLIP4CLIP<sub>meanP<sub>43.042.196.276.776.273.7
MSRVTT-train-9KV to T (R@1)<br>test-1k-AT to V (R@1)<br>test-1k-ARandom MC <br>test-1k-AGender MC <br>test-1k-AVerb <sub>LM</sub> MC <br>test-1k-AVerb <sub>Human</sub> MC <br>test-1k-A
CLIP-Straight27.231.291.271.464.963.5
MMT27.026.693.575.975.272.9
MMT with CLIP features33.934.095.680.977.773.3
CLIP4CLIP<sub>meanP<sub>43.143.196.379.176.875.4
CLIP2Video43.545.697.076.076.874.3

LSMDC

TBD

Citation

@inproceedings{park-etal-2022-exposing,
    title = "Exposing the Limits of Video-Text Models through Contrast Sets",
    author = "Park, Jae Sung  and
      Shen, Sheng  and
      Farhadi, Ali  and
      Darrell, Trevor  and
      Choi, Yejin  and
      Rohrbach, Anna",
    booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
    month = jul,
    year = "2022",
    address = "Seattle, United States",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.naacl-main.261",
    pages = "3574--3586",
}

Please email jspark96@cs.washington.edu for more information about the dataset.