Awesome
Exposing the Limits of Video-Text Models through Contrast Sets
Repository for Exposing the Limits of Video-Text Models through Contrast Sets (NAACL Short 2022).
Updates
- 2/28/2023: Results for 7k/9k Train splits in MSRVTT.
- 6/24/2022: Released contrast sets for MSRVTT.
To Do:
- Release contrast sets for LSMDC
- Release code to generate contrast sets automatically
Data
We share the verb phrase based contrast set for MSRVTT and LSDMC-ID in this repository. Code to generate the contrast set will be released soon.
Dataset | Link | Contrast Set Type |
---|---|---|
MSRVTT | Data | Verb |
LSMDC | TBD | TBD |
MSRVTT
Download
The verb based contrast sets generated by language model (Verb<sub>LM</sub> MC) and humans (Verb<sub>Human</sub> MC) can be found in msrvtt/
.
You can find the video and annotation data following this link. The sets are generated for the full-test set of MSRVTT with 2990 videos.
You can additionally refer to download script in CLIPBert to get the original MSRVTT split data. To just get train and val, you can also run the download script:
bash msrvtt/download_msrvtt_train_val.sh
Example data:
clip | ending0 | ending1 | ending2 | ending3 | ending4 | label |
---|---|---|---|---|---|---|
video9770 | the boy is trying to fix the problem | the boy is trying to exacerbate the problem | two men on wave runner in ocean rescuing a surfer | asian man discusses technology in the younger generations | a group is dancing | 0 |
video9771 | a woman is putting items into a miniature toy oven | a child is running around on a mat | a woman pushing a stroller | a child is rolling around on a mat | a game show host hosting a game | 1 |
Results
train-9k
is the 9k train and test-1k-A
is the 1k test-split proposed by JSFUsion [Yu et. al.].
train-7k
is the 7k train and test-full
is the full test videos in the original MSRVTT. We use the 7k training videos in CLIP4CLIP repo.
MSRVTT-train-7K | V to T (R@1)<br>test-1k-A | T to V (R@1)<br>test-1k-A | Random MC <br>test-full | Gender MC <br>test-full | Verb <sub>LM</sub> MC <br>test-full | Verb <sub>Human</sub> MC <br>test-full |
---|---|---|---|---|---|---|
CLIP-Straight | 27.2 | 31.2 | 94.1 | 69.6 | 65.4 | 65.1 |
MMT | 24.8 | 25.5 | 92.4 | 75.5 | 72.8 | 71.3 |
MMT with CLIP features | 30.5 | 30.3 | 95.0 | 80.1 | 73.8 | 71.4 |
CLIP4CLIP<sub>meanP<sub> | 43.0 | 42.1 | 96.2 | 76.7 | 76.2 | 73.7 |
MSRVTT-train-9K | V to T (R@1)<br>test-1k-A | T to V (R@1)<br>test-1k-A | Random MC <br>test-1k-A | Gender MC <br>test-1k-A | Verb <sub>LM</sub> MC <br>test-1k-A | Verb <sub>Human</sub> MC <br>test-1k-A |
---|---|---|---|---|---|---|
CLIP-Straight | 27.2 | 31.2 | 91.2 | 71.4 | 64.9 | 63.5 |
MMT | 27.0 | 26.6 | 93.5 | 75.9 | 75.2 | 72.9 |
MMT with CLIP features | 33.9 | 34.0 | 95.6 | 80.9 | 77.7 | 73.3 |
CLIP4CLIP<sub>meanP<sub> | 43.1 | 43.1 | 96.3 | 79.1 | 76.8 | 75.4 |
CLIP2Video | 43.5 | 45.6 | 97.0 | 76.0 | 76.8 | 74.3 |
LSMDC
TBD
Citation
@inproceedings{park-etal-2022-exposing,
title = "Exposing the Limits of Video-Text Models through Contrast Sets",
author = "Park, Jae Sung and
Shen, Sheng and
Farhadi, Ali and
Darrell, Trevor and
Choi, Yejin and
Rohrbach, Anna",
booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
month = jul,
year = "2022",
address = "Seattle, United States",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.naacl-main.261",
pages = "3574--3586",
}
Please email jspark96@cs.washington.edu for more information about the dataset.