Home

Awesome

Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding

NeurIPS 2022, Spotlight Presentation, [arXiv] [BibTeX]

Introduction

We propose STCAT, a new one-stage spatio-temporal video grounding method, which achieved state-of-the-art performance on VidSTG and HC-STVG benchmarks. This repository provides the Pytorch Implementations for the model training and evaluation. For more details, please refer to our paper.

<div align="center"> <img src="figs/framework.png"/> </div><br/>

Dataset Preparation

The used datasets are placed in data folder with the following structure.

data
|_ vidstg
|  |_ videos
|  |  |_ [video name 0].mp4
|  |  |_ [video name 1].mp4
|  |  |_ ...
|  |_ vstg_annos
|  |  |_ train.json
|  |  |_ ...
|  |_ sent_annos
|  |  |_ train_annotations.json
|  |  |_ ...
|  |_ data_cache
|  |  |_ ...
|_ hc-stvg
|  |_ v1_video
|  |  |_ [video name 0].mp4
|  |  |_ [video name 1].mp4
|  |  |_ ...
|  |_ annos
|  |  |_ hcstvg_v1
|  |  |  |_ train.json
|  |  |  |_ test.json
|  |  data_cache
|  |  |_ ...

You can prepare this structure with the following steps:

VidSTG

HC-STVG

Setup

Requirements

The code is tested with PyTorch 1.10.0. The other versions may be compatible as well. You can install the requirements with the following commands:

conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 cudatoolkit=11.3 -c pytorch -c conda-forge
pip install -r requirements.txt

Then, download FFMPEG 4.1.9 and add it to the PATH environment variable for loading the video.

Pretrained Checkpoints

Our model leveraged the ResNet-101 pretrained by MDETR as the vision backbone. Please download the pretrained weight from here and put it into data/pretrained/pretrained_resnet101_checkpoint.pth.

Usage

Note: You should use one video per GPU during training and evaluation, more than one video per GPU is not tested and may cause some bugs.

Training

For training on an 8-GPU node, you can use the following script:

# run for VidSTG
python3 -m torch.distributed.launch \
 --nproc_per_node=8 \
 scripts/train_net.py \
 --config-file "experiments/VidSTG/e2e_STCAT_R101_VidSTG.yaml" \
 --use-seed \
 OUTPUT_DIR data/vidstg/checkpoints/output \
 TENSORBOARD_DIR data/vidstg/checkpoints/output/tensorboard \
 INPUT.RESOLUTION 448

# run for HC-STVG
python3 -m torch.distributed.launch \
 --nproc_per_node=8 \
 scripts/train_net.py \
 --config-file "experiments/HC-STVG/e2e_STCAT_R101_HCSTVG.yaml" \
 --use-seed \
 OUTPUT_DIR data/hc-stvg/checkpoints/output \
 TENSORBOARD_DIR data/hc-stvg/checkpoints/output/tensorboard \
 INPUT.RESOLUTION 448

For more training options (like using other hyper-parameters), please modify the configurations experiments/VidSTG/e2e_STCAT_R101_VidSTG.yaml and experiments/HC-STVG/e2e_STCAT_R101_HCSTVG.yaml.

Evaluation

To evaluate the trained STCAT models, please run the following scripts:

# run for VidSTG
python3 -m torch.distributed.launch \
 --nproc_per_node=8 \
 scripts/test_net.py \
 --config-file "experiments/VidSTG/e2e_STCAT_R101_VidSTG.yaml" \
 --use-seed \
 MODEL.WEIGHT data/vidstg/checkpoints/stcat_res448/vidstg_res448.pth \
 OUTPUT_DIR data/vidstg/checkpoints/output \
 INPUT.RESOLUTION 448

# run for HC-STVG
python3 -m torch.distributed.launch \
 --nproc_per_node=8 \
 scripts/test_net.py \
 --config-file "experiments/HC-STVG/e2e_STCAT_R101_HCSTVG.yaml" \
 --use-seed \
 MODEL.WEIGHT data/hc-stvg/checkpoints/stcat_res448/hcstvg_res448.pth \
 OUTPUT_DIR data/hc-stvg/checkpoints/output \
 INPUT.RESOLUTION 448

Model Zoo

We provide our trained checkpoints with ResNet-101 backbone for results reproducibility.

DatasetresolutionurlDeclarative (m_vIoU/vIoU@0.3/vIoU@0.5)Interrogative (m_vIoU/vIoU@0.3/vIoU@0.5)size
VidSTG416Model32.94/46.07/32.3227.87/38.89/26.073.1GB
VidSTG448Model33.14/46.20/32.5828.22/39.24/26.633.1GB
Datasetresolutionurlm_vIoU/vIoU@0.3/vIoU@0.5size
HC-STVG416Model34.93/56.64/31.033.1GB
HC-STVG448Model35.09/57.67/30.093.1GB

Acknowledgement

This repo is partly based on the open-source release from MDETR, DAB-DETR and MaskRCNN-Benchmark. The evaluation metric implementation is borrowed from TubeDETR for a fair comparison.

License

STCAT is released under the MIT license.

<a name="Citing"></a>Citation

Consider giving this repository a star and cite it in your publications if it helps your research.

@article{jin2022embracing,
  title={Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding},
  author={Jin, Yang and Li, Yongzhi and Yuan, Zehuan and Mu, Yadong},
  journal={arXiv preprint arXiv:2209.13306},
  year={2022}
}