Awesome

Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding

NeurIPS 2022, Spotlight Presentation, [arXiv] [BibTeX]

Introduction

We propose STCAT, a new one-stage spatio-temporal video grounding method, which achieved state-of-the-art performance on VidSTG and HC-STVG benchmarks. This repository provides the Pytorch Implementations for the model training and evaluation. For more details, please refer to our paper.

Dataset Preparation

The used datasets are placed in data folder with the following structure.

data
|_ vidstg
|  |_ videos
|  |  |_ [video name 0].mp4
|  |  |_ [video name 1].mp4
|  |  |_ ...
|  |_ vstg_annos
|  |  |_ train.json
|  |  |_ ...
|  |_ sent_annos
|  |  |_ train_annotations.json
|  |  |_ ...
|  |_ data_cache
|  |  |_ ...
|_ hc-stvg
|  |_ v1_video
|  |  |_ [video name 0].mp4
|  |  |_ [video name 1].mp4
|  |  |_ ...
|  |_ annos
|  |  |_ hcstvg_v1
|  |  |  |_ train.json
|  |  |  |_ test.json
|  |  data_cache
|  |  |_ ...

You can prepare this structure with the following steps:

VidSTG

Download the video for VidSTG from the VidOR and put it into data/vidstg/videos. The original video download url given by the VidOR dataset provider is broken. You can download the VidSTG videos from this.
Download the text and temporal annotations from VidSTG Repo and put it into data/vidstg/sent_annos.
Download the bounding-box annotations from here and put it into data/vidstg/vstg_annos.
For the loading efficiency, we provide the dataset cache for VidSTG at here. You can download it and put it into data/vidstg/data_cache.

HC-STVG

Download the version 1 of HC-STVG videos and annotations from HC-STVG. Then put it into data/hc-stvg/v1_video and data/hc-stvg/annos/hcstvg_v1.
For the loading efficiency, we provide the dataset cache for HC-STVG at here. You can download it and put it into data/hc-stvg/data_cache.

Setup

Requirements

The code is tested with PyTorch 1.10.0. The other versions may be compatible as well. You can install the requirements with the following commands:

conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 cudatoolkit=11.3 -c pytorch -c conda-forge
pip install -r requirements.txt

Then, download FFMPEG 4.1.9 and add it to the PATH environment variable for loading the video.

Pretrained Checkpoints

Our model leveraged the ResNet-101 pretrained by MDETR as the vision backbone. Please download the pretrained weight from here and put it into data/pretrained/pretrained_resnet101_checkpoint.pth.

Usage

Note: You should use one video per GPU during training and evaluation, more than one video per GPU is not tested and may cause some bugs.

Training

For training on an 8-GPU node, you can use the following script:

# run for VidSTG
python3 -m torch.distributed.launch \
 --nproc_per_node=8 \
 scripts/train_net.py \
 --config-file "experiments/VidSTG/e2e_STCAT_R101_VidSTG.yaml" \
 --use-seed \
 OUTPUT_DIR data/vidstg/checkpoints/output \
 TENSORBOARD_DIR data/vidstg/checkpoints/output/tensorboard \
 INPUT.RESOLUTION 448

# run for HC-STVG
python3 -m torch.distributed.launch \
 --nproc_per_node=8 \
 scripts/train_net.py \
 --config-file "experiments/HC-STVG/e2e_STCAT_R101_HCSTVG.yaml" \
 --use-seed \
 OUTPUT_DIR data/hc-stvg/checkpoints/output \
 TENSORBOARD_DIR data/hc-stvg/checkpoints/output/tensorboard \
 INPUT.RESOLUTION 448

For more training options (like using other hyper-parameters), please modify the configurations experiments/VidSTG/e2e_STCAT_R101_VidSTG.yaml and experiments/HC-STVG/e2e_STCAT_R101_HCSTVG.yaml.

Evaluation

To evaluate the trained STCAT models, please run the following scripts:

# run for VidSTG
python3 -m torch.distributed.launch \
 --nproc_per_node=8 \
 scripts/test_net.py \
 --config-file "experiments/VidSTG/e2e_STCAT_R101_VidSTG.yaml" \
 --use-seed \
 MODEL.WEIGHT data/vidstg/checkpoints/stcat_res448/vidstg_res448.pth \
 OUTPUT_DIR data/vidstg/checkpoints/output \
 INPUT.RESOLUTION 448

# run for HC-STVG
python3 -m torch.distributed.launch \
 --nproc_per_node=8 \
 scripts/test_net.py \
 --config-file "experiments/HC-STVG/e2e_STCAT_R101_HCSTVG.yaml" \
 --use-seed \
 MODEL.WEIGHT data/hc-stvg/checkpoints/stcat_res448/hcstvg_res448.pth \
 OUTPUT_DIR data/hc-stvg/checkpoints/output \
 INPUT.RESOLUTION 448

Model Zoo

We provide our trained checkpoints with ResNet-101 backbone for results reproducibility.

Dataset	resolution	url	Declarative (m_vIoU/vIoU@0.3/vIoU@0.5)	Interrogative (m_vIoU/vIoU@0.3/vIoU@0.5)	size
VidSTG	416	Model	32.94/46.07/32.32	27.87/38.89/26.07	3.1GB
VidSTG	448	Model	33.14/46.20/32.58	28.22/39.24/26.63	3.1GB

Dataset	resolution	url	m_vIoU/vIoU@0.3/vIoU@0.5	size
HC-STVG	416	Model	34.93/56.64/31.03	3.1GB
HC-STVG	448	Model	35.09/57.67/30.09	3.1GB

Acknowledgement

This repo is partly based on the open-source release from MDETR, DAB-DETR and MaskRCNN-Benchmark. The evaluation metric implementation is borrowed from TubeDETR for a fair comparison.

License

STCAT is released under the MIT license.

<a name="Citing"></a>Citation

Consider giving this repository a star and cite it in your publications if it helps your research.

@article{jin2022embracing,
  title={Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding},
  author={Jin, Yang and Li, Yongzhi and Yuan, Zehuan and Mu, Yadong},
  journal={arXiv preprint arXiv:2209.13306},
  year={2022}
}