Awesome
Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding
NeurIPS 2022, Spotlight Presentation, [arXiv
] [BibTeX
]
Introduction
We propose STCAT, a new one-stage spatio-temporal video grounding method, which achieved state-of-the-art performance on VidSTG and HC-STVG benchmarks. This repository provides the Pytorch Implementations for the model training and evaluation. For more details, please refer to our paper.
<div align="center"> <img src="figs/framework.png"/> </div><br/>Dataset Preparation
The used datasets are placed in data
folder with the following structure.
data
|_ vidstg
| |_ videos
| | |_ [video name 0].mp4
| | |_ [video name 1].mp4
| | |_ ...
| |_ vstg_annos
| | |_ train.json
| | |_ ...
| |_ sent_annos
| | |_ train_annotations.json
| | |_ ...
| |_ data_cache
| | |_ ...
|_ hc-stvg
| |_ v1_video
| | |_ [video name 0].mp4
| | |_ [video name 1].mp4
| | |_ ...
| |_ annos
| | |_ hcstvg_v1
| | | |_ train.json
| | | |_ test.json
| | data_cache
| | |_ ...
You can prepare this structure with the following steps:
VidSTG
- Download the video for VidSTG from the VidOR and put it into
data/vidstg/videos
. The original video download url given by the VidOR dataset provider is broken. You can download the VidSTG videos from this. - Download the text and temporal annotations from VidSTG Repo and put it into
data/vidstg/sent_annos
. - Download the bounding-box annotations from here and put it into
data/vidstg/vstg_annos
. - For the loading efficiency, we provide the dataset cache for VidSTG at here. You can download it and put it into
data/vidstg/data_cache
.
HC-STVG
- Download the version 1 of HC-STVG videos and annotations from HC-STVG. Then put it into
data/hc-stvg/v1_video
anddata/hc-stvg/annos/hcstvg_v1
. - For the loading efficiency, we provide the dataset cache for HC-STVG at here. You can download it and put it into
data/hc-stvg/data_cache
.
Setup
Requirements
The code is tested with PyTorch 1.10.0. The other versions may be compatible as well. You can install the requirements with the following commands:
conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 cudatoolkit=11.3 -c pytorch -c conda-forge
pip install -r requirements.txt
Then, download FFMPEG 4.1.9 and add it to the PATH
environment variable for loading the video.
Pretrained Checkpoints
Our model leveraged the ResNet-101 pretrained by MDETR as the vision backbone. Please download the pretrained weight from here and put it into data/pretrained/pretrained_resnet101_checkpoint.pth
.
Usage
Note: You should use one video per GPU during training and evaluation, more than one video per GPU is not tested and may cause some bugs.
Training
For training on an 8-GPU node, you can use the following script:
# run for VidSTG
python3 -m torch.distributed.launch \
--nproc_per_node=8 \
scripts/train_net.py \
--config-file "experiments/VidSTG/e2e_STCAT_R101_VidSTG.yaml" \
--use-seed \
OUTPUT_DIR data/vidstg/checkpoints/output \
TENSORBOARD_DIR data/vidstg/checkpoints/output/tensorboard \
INPUT.RESOLUTION 448
# run for HC-STVG
python3 -m torch.distributed.launch \
--nproc_per_node=8 \
scripts/train_net.py \
--config-file "experiments/HC-STVG/e2e_STCAT_R101_HCSTVG.yaml" \
--use-seed \
OUTPUT_DIR data/hc-stvg/checkpoints/output \
TENSORBOARD_DIR data/hc-stvg/checkpoints/output/tensorboard \
INPUT.RESOLUTION 448
For more training options (like using other hyper-parameters), please modify the configurations experiments/VidSTG/e2e_STCAT_R101_VidSTG.yaml
and experiments/HC-STVG/e2e_STCAT_R101_HCSTVG.yaml
.
Evaluation
To evaluate the trained STCAT models, please run the following scripts:
# run for VidSTG
python3 -m torch.distributed.launch \
--nproc_per_node=8 \
scripts/test_net.py \
--config-file "experiments/VidSTG/e2e_STCAT_R101_VidSTG.yaml" \
--use-seed \
MODEL.WEIGHT data/vidstg/checkpoints/stcat_res448/vidstg_res448.pth \
OUTPUT_DIR data/vidstg/checkpoints/output \
INPUT.RESOLUTION 448
# run for HC-STVG
python3 -m torch.distributed.launch \
--nproc_per_node=8 \
scripts/test_net.py \
--config-file "experiments/HC-STVG/e2e_STCAT_R101_HCSTVG.yaml" \
--use-seed \
MODEL.WEIGHT data/hc-stvg/checkpoints/stcat_res448/hcstvg_res448.pth \
OUTPUT_DIR data/hc-stvg/checkpoints/output \
INPUT.RESOLUTION 448
Model Zoo
We provide our trained checkpoints with ResNet-101 backbone for results reproducibility.
Dataset | resolution | url | Declarative (m_vIoU/vIoU@0.3/vIoU@0.5) | Interrogative (m_vIoU/vIoU@0.3/vIoU@0.5) | size |
---|---|---|---|---|---|
VidSTG | 416 | Model | 32.94/46.07/32.32 | 27.87/38.89/26.07 | 3.1GB |
VidSTG | 448 | Model | 33.14/46.20/32.58 | 28.22/39.24/26.63 | 3.1GB |
Dataset | resolution | url | m_vIoU/vIoU@0.3/vIoU@0.5 | size |
---|---|---|---|---|
HC-STVG | 416 | Model | 34.93/56.64/31.03 | 3.1GB |
HC-STVG | 448 | Model | 35.09/57.67/30.09 | 3.1GB |
Acknowledgement
This repo is partly based on the open-source release from MDETR, DAB-DETR and MaskRCNN-Benchmark. The evaluation metric implementation is borrowed from TubeDETR for a fair comparison.
License
STCAT
is released under the MIT license.
<a name="Citing"></a>Citation
Consider giving this repository a star and cite it in your publications if it helps your research.
@article{jin2022embracing,
title={Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding},
author={Jin, Yang and Li, Yongzhi and Yuan, Zehuan and Mu, Yadong},
journal={arXiv preprint arXiv:2209.13306},
year={2022}
}