Home

Awesome

VITATECS

VITATECS is a diagnostic VIdeo-Text dAtaset for the evaluation of TEmporal Concept underStanding.

VITATECS is also available on Huggingface.

News

[2024/07] Our paper has been accepted to ECCV 2024.

[2024/07] Our dataset is now supported by lmms-eval, an evaluation suite of large multi-modal models.

[2023/11] Our paper is now available on arxiv.

[2023/11] We have updated a new version of VITATECS which is generated using ChatGPT. The previous version generated by OPT-175B has been moved to data_opt folder.

Data

This repo contains 6 jsonl files under data folder, each of which corresponds to an aspect of temporal concepts (Direction, Intensity, Sequence, Localization, Compositionality, Type).

Each line of the jsonl file is a json object, which contains the following fields:

Example (indented for better presentation):

{
    "src_dataset": "VATEX", 
    "video_name": "i0ccSYMl0vo_000027_000037.mp4", 
    "caption": "A woman is placing a waxing strip on a man's leg.", 
    "counterfactual": "A woman is removing a waxing strip from a man's leg.",
    "aspect": "Direction"
}

Evaluation

Data Preparation

videos
    MSRVTT
        video0.mp4
        video1.mp4
        ...
    VATEX
        _0ZBlXUcaOk_000013_000023.mp4
        _1qp63Hh6Xk_000015_000025.mp4
        ...

ALPRO

The evaluation of ALPRO is implemented based on the LAVIS library.

To evaluate ALPRO on VITATECS, run the following commands:

cd alpro
python eval_alpro.py alpro_pretrain.yaml

X-CLIP/CLIP4Clip

The evaluation of X-CLIP/CLIP4Clip is implemented based on the X-CLIP repository.

To evaluate X-CLIP/CLIP4Clip on VITATECS:

License

This dataset is under CC-BY 4.0 license.