Awesome

Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer

This repository contains PyTorch implementation for ICCV2023 paper Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer [arXiv]

We propose Tem-Adapter, which enables the learning of temporal dynamics and complex semantics by a visual Temporal Aligner and a textual Semantic Aligner. Tem-adapter introduces a language-guided autoregressive task to guide the learning of temporal dependency and thus reduce the temporal gap between image-based pre-training and video-based QA tasks.

intro

Environment Setup

Install PyTorch (should install miniconda first):

conda create --name myenv python=3.7
conda activate myenv
conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=10.2 -c pytorch

Install dependencies

conda install -c conda-forge ffmpeg
conda install -c conda-forge scikit-video
pip install ftfy regex tqdm
pip install timm
pip install jsonlines
pip install git+https://github.com/openai/CLIP.git
pip install -r requirements.txt

Download the dataset and Pre-process with CLIP visual encoder

The SUTD-TrafficQA dataset is publicly released. You can download the original videos and text annotations from https://sutdcv.github.io/SUTD-TrafficQA/#/explore

You can use OpenAI's CLIP as the pre-trained image encoder (ViT-B32). The following instructions can be followed.

Create a folder ./data/ in current directory, such as:

Tem-adapter/
|–– configs/
|–– data/
|–– model/
|–– ...

Unzip downloaded video file 'raw_videos.zip' to 'data' as ./data/raw_videos/.
Put the downloaded annotation file 'R3_all.jsonl' to 'data' as ./data/annotation_file/R3_all.jsonl.

The directory should have the following structure:

Tem-adapter/
|–– configs/
|–– data/
|   |–– raw_videos
|       |–– b_1a4411B7sb_clip_005.mp4
|       |–– b_1a4411B7sb_clip_006.mp4
|       |__ ...  
|   |-- annotation_file
|       |–– R3_all.jsonl
|–– model/
|–– ...

Run the following command to extract features with the CLIP visual encoder.

python preprocess/preprocess_features.py --gpu_id 0 --dataset sutd-traffic --model clip_image

Then there will be a new folder ./data/sutd-traffic/ under the current path.

Download the texts (QA pairs) from here and put them under the path ./data/sutd-traffic/

The dataset directory should have the following structure:

Tem-adapter/
|–– configs/
|–– data/
|   |–– raw_videos
|       |–– b_1a4411B7sb_clip_005.mp4
|       |–– b_1a4411B7sb_clip_006.mp4
|       |__ ...
|   |-- annotation_file
|       |–– R3_all.jsonl
|   |-- sutd-traffic
|       |–– sutd-traffic_transition_appearance_feat.h5
|       |–– output_file_train.jsonl
|       |–– output_file_test.jsonl
|–– model/
|–– ...

Evaluate the trained model

Create a new folder "pretrained" under the path 'Tem-adapter/'
Download the trained checkpoints from this link and put them under the path 'Tem-adapter/pretrained/'

The directory should have the following structure:

Tem-adapter/
|–– configs/
|–– data/
|   |–– raw_videos
|       |–– b_1a4411B7sb_clip_005.mp4
|       |–– b_1a4411B7sb_clip_006.mp4
|       |__ ...
|   |-- annotation_file
|       |–– R3_all.jsonl
|   |-- sutd-traffic
|       |–– sutd-traffic_transition_appearance_feat.h5
|       |–– output_file_train.jsonl
|       |–– output_file_test.jsonl
|–– model/
|–– pretrained
    |-- semanticaligner_49.pt
    |-- tempaligner_49.pt
|–– ...

Uncomment related lines in the 'validate.py' (Check the file for further reference).
To evaluate the trained model, run the following command:

python validate.py --cfg configs/sutd-traffic_transition.yml

Training

Choose the config file in 'configs/sutd-traffic_transition.yml', run the following command:

python train.py --cfg configs/sutd-traffic_transition.yml

Evaluation

To evaluate the trained model, run the following:

python validate.py --cfg configs/sutd-traffic_transition.yml

License

MIT License

Citation

If you find our work useful in your research, please consider citing:

@inproceedings{chen2023tem,
  title={Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer},
  author={Chen, Guangyi and Liu, Xiao and Wang, Guangrun and Zhang, Kun and Torr, Philip HS and Zhang, Xiao-Ping and Tang, Yansong},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={13945--13955},
  year={2023}
}

Acknowledgement

Our reproduction of the methods is mainly based on the SUTD-TrafficQA and HCRN-VideoQA, we thank the authors to release their codes.