Awesome
Causal-VidQA
News
- [2024.07.11] We release the answer for the test set. You can download them and put them into the
['data/QA']
to use them.
Introduction
The Causal-VidQA dataset contains 107,600 QA pairs from the Causal-VidQA dataset. The dataset aims to facilitate deeper video understanding towards video reasoning. In detail, we present the task of Causal-VidQA, which includes four types of questions ranging from scene description (description) to evidence reasoning (explanation) and commonsense reasoning (prediction and counterfactual). For commonsense reasoning, we set up a two-step solution by answering the question and providing a proper reason.
Here is an example from our dataset and the comparison between our dataset and other VisualQA datasets.
<div align=center ><img src="./fig/example.png"/></div> <div align=center ><strong>Example from our Causal-VidQA Dataset</strong></div>Dataset | Visual Type | Visual Source | Annotation | Description | Explanation | Prediction | Counterfactual | #Video/Image | #QA | Video Length (s) |
---|---|---|---|---|---|---|---|---|---|---|
Motivation | Image | MS COCO | Man | ✔ | ✔ | ✔ | $\times$ | 10,191 | - | - |
VCR | Image | Movie Clip | Man | ✔ | ✔ | ✔ | $\times$ | 110,000 | 290,000 | - |
MovieQA | Video | Movie Stories | Auto | ✔ | ✔ | $\times$ | $\times$ | 548 | 21,406 | 200 |
TVQA | Video | TV Show | Man | ✔ | ✔ | $\times$ | $\times$ | 21,793 | 152,545 | 76 |
TGIF-QA | Video | TGIF | Auto | ✔ | $\times$ | $\times$ | $\times$ | 71,741 | 165,165 | 3 |
ActivityNet-QA | Video | ActivityNet | Man | ✔ | ✔ | $\times$ | $\times$ | 5,800 | 58,000 | 180 |
Social-IQ | Video | YouTube | Man | ✔ | ✔ | $\times$ | $\times$ | 1,250 | 7,500 | 60 |
CLEVRER | Video | Game Engine | Man | ✔ | ✔ | ✔ | ✔ | 20,000 | 305,280 | 5 |
V2C | Video | MSR-VTT | Man | ✔ | ✔ | $\times$ | $\times$ | 10,000 | 115,312 | 30 |
NExT-QA | Video | YFCC-100M | Man | ✔ | ✔ | $\times$ | $\times$ | 5,440 | 52,044 | 44 |
Causal-VidQA | Video | Kinetics-700 | Man | ✔ | ✔ | ✔ | ✔ | 26,900 | 107,600 | 9 |
In this page, you can find the code of some SOTA VideoQA methods and the dataset for our CVPR conference paper.
- Jiangtong Li, Li Niu and Liqing Zhang. From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-Answering. CVPR, 2022. [paper link]
Download
Install
Please create an env for this project using miniconda (should install miniconda first)
>conda create -n causal-vidqa python==3.6.12
>conda activate causal-vidqa
>git clone https://github.com/bcmi/Causal-VidQA
>pip install -r requirement.txt
Data Preparation
Please download the pre-computed features and QA annotations from Download 1-4.
And place them in ['data/visual_feature']
, ['data/text_feature']
, ['data/split']
and ['data/QA']
. Note that the Text annotation
is package as QA.tar, you need to unpack it first before place it to ['data/QA']
.
If you want to extract different video features and text features from our Causal-VidQA dataset, you can download the original data from Download 5 and do whatever your want to extract features.
Usage
Once the data is ready, you can easily run the code. First, to run these models with GloVe feature, you can directly train the B2A by:
>sh bash/train_glove.sh
Note that if you want to train the model with BERT feature, we suggest your to first load the BERT feature to sharedarray by:
>python dataset/load.py
and then train the B2A with BERT feature by:
>sh bash/train_bert.sh.
After the train shell file is conducted, you can find the the prediction file under ['results/model_name/model_prefix.json']
and you can evaluate the prediction results by:
>python eval_mc.py
You can also obtain the prediction by running:
>sh bash/eval.sh
The command above will load the model from ['experiment/model_name/model_prefix/model/best.pkl']
and generate the prediction file.
Hint: we have release a trained model for B2A
method, please place this the trained weight in ['experiment/B2A/B2A/model/best.pkl']
and then make prediction by running:
>sh bash/eval.sh
(The results may be slightly different depending on the environments and random seeds.)
(For comparison, please refer to the results in our paper.)
Test set Evaluation
In the released dataset, we hide the correct answer id for questions in our test set. If you want to evaluate the results on test set, please participate the competition in CodaLab.
Hint:
- Each participant can submit the inference results for 10 times in total and 2 times for each day.
- Please zip the inference json file before submit to CodaLab.
- If the register requirement is not approved within one day, please email me.
Citation
@InProceedings{li2022from,
author = {Li, Jiangtong and Niu, Li and Zhang, Liqing},
title = {From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-Answering},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2022}
}
Acknowledgement
Our reproduction of the methods is mainly based on the Next-QA and other respective official repositories, we thank the authors to release their code. If you use the related part, please cite the corresponding paper commented in the code.