Awesome
ICCV2023 - IntentQA: Context-aware Video Intent Reasoning
Introduction
The project is described in our paper IntentQA: Context-aware Video Intent Reasoning (ICCV2023, Oral).
Among the recent flourishing studies on cross-modal vision-language understanding, video question answering (VideoQA) is one of the most prominent to support interactive AI with the ability to understand and communicate dynamic visual scenarios via natural languages. Despite its popularity, VideoQA is still quite challenging, because it demands the models to comprehensively understand the videos to correctly answer questions, which include not only factual but also inferential ones. The former directly asks about the visual facts (e.g., humans, objects, actions, etc.), while the latter (inference VideoQA) requires logical reasoning of latent variables (e.g., the spatial, temporal and causal relationships among entities, mental states, etc.) beyond observed visual facts . The future trend for AI is to study inference VideoQA beyond factoid VideoQA , requiring more reasoning ability beyond mere recognition. In this paper, we propose a new task called IntentQA, i.e., a special kind of inference VideoQA that focuses on intent reasoning.
Dataset
Please download the pre-computed features and original videos from here,
There are 3 folders:
Videos
: This directory contains all the original videos of the dataset, named withvideo_id
. All videos are in MP4 format.region_feat_n
: This folder contains pre-computed bounding box features.frame_feat
: This folder includes pre-computed frame features.
Please download the QA annotations from here. There are 3 files (train.csv
,val.csv
,test.csv
):
In each annotation file, the initial columns follow the same format as in NExT-QA
. Building upon the NExT-QA
foundation, we have introduced additional annotations, adding extra columns to the dataset.
-
action
,lemma
, andlemma_id
: Specifically, we have annotatedaction
,lemma
, andlemma_id
. These columns highlight actions in the current QA that trigger intentions, either self or others', along with the lemmatized forms of these actions and their corresponding IDs after categorizing them into synonymous groups. -
id
,pos_id
, andneg_id
: Furthermore, in thetrain.csv
file, we have also addedid
,pos_id
, andneg_id
annotations. Theid
column denotes the row number of the data, while thepos_id
andneg_id
columns indicate the row numbers (id
) of data in the train set that form positive and negative cases, respectively, in relation to the current row's data.
Results
Model | Text Rep. | CW | CH | TP&TN | Total | Result File |
---|---|---|---|---|---|---|
EVQA | GloVe | 25.92 | 34.54 | 25.52 | 27.27 | |
CoMem | GloVe | 30.00 | 28.69 | 28.95 | 29.52 | |
HGA | GloVe | 32.00 | 30.64 | 31.05 | 31.54 | |
HME | GloVe | 34.40 | 34.26 | 29.14 | 33.08 | |
HQGA | GloVe | 33.20 | 34.26 | 36.57 | 34.21 | |
CoMem | BERT | 47.68 | 54.87 | 39.05 | 46.77 | |
HGA | BERT | 44.88 | 50.97 | 39.62 | 44.61 | |
HME | BERT | 46.08 | 54.32 | 40.76 | 46.16 | |
HQGA | BERT | 48.24 | 54.32 | 41.71 | 47.66 | |
VGT | BERT | 51.44 | 55.99 | 47.62 | 51.27 | |
Blind GPT | BERT | 52.16 | 61.28 | 43.43 | 51.55 | Here |
Ours w/o GPT | BERT | 55.28 | 61.56 | 47.81 | 54.50 | Here |
Ours | BERT | 58.40 | 65.46 | 50.48 | 57.64 | Here |
Human | - | 77.76 | 80.22 | 79.05 | 78.49 | Here |
Demo
Here is a demo that briefly summarizes our work.
Install
conda create -n intentqa python==3.8.8
conda activate intentqa
git clone https://github.com/sail-sg/VGT.git
pip install -r requirements.txt
Inference and Evaluation
./shell/intentqa_test.sh 0
python eval_intentqa.py --folder your_work_dir --mode test
Using GPT
Add the following to intentqa_test.sh
:
--GPT_result='../data/save_models/intentqa/Your_GPT_result_DIR/test-res.json'
You can also use my result file in the Results section.
Citation
This repository is developed based on VGT. We sincerely thank them for their outstanding work.
@InProceedings{Li_2023_ICCV,
author = {Li, Jiapeng and Wei, Ping and Han, Wenjuan and Fan, Lifeng},
title = {IntentQA: Context-aware Video Intent Reasoning},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2023},
pages = {11963-11974}
}
@inproceedings{xiao2022video,
title={Video Graph Transformer for Video Question Answering},
author={Xiao, Junbin and Zhou, Pan and Chua, Tat-Seng and Yan, Shuicheng},
booktitle={European Conference on Computer Vision},
pages={39--58},
year={2022},
organization={Springer}
}