Home

Awesome

SnAG: Scalable and Accurate Video Grounding (CVPR 2024)

Introduction

This code repo implements SnAG, a scalable and accurate model for long-form video grounding --- localizing moments within an untrimmed long video based on text descriptions. SnAG features a minimalist, late-fusion design for scalable inference, while supporting video-centric sampling for scalable training. Without bells and whistles, SnAG achieves 44.86% R1@0.5 and 70.66% R5@0.5 on TACoS, outperforming the previous state of the art by 8.53 and 12.75 absolute percentage points, respectively. Further, SnAG demonstrates strong results on Ego4D-NLQ (13.57% mean R1 and 32.92 mean R5) and the more challenging MAD dataset (5.55 R1@0.5 and 13.75 R5@0.5). Our paper is accepted to CVPR 2024 and an arXiv version can be found at this link.

Related projects:

ActionFormer: Localizing Moments of Actions with Transformers <br> Chenlin Zhang, Jianxin Wu, Yin Li <br> ECCV 2022 <br> github github arXiv <br>

Visualization

We provide visualizations of localized moments in Ego4D-NLQ videos.

Note that the ground-truth moments are determined by human annotations and subject to errors.

<img src="https://media.githubusercontent.com/media/fmu2/snag_release/main/viz/085f7a8b-e1e5-4e7b-a83d-5ea650edd9fe.gif" width="720"/> <img src="https://media.githubusercontent.com/media/fmu2/snag_release/main/viz/0aca0078-b6ab-41fb-9dc5-a70b8ad137b2.gif" width="720"/> <img src="https://media.githubusercontent.com/media/fmu2/snag_release/main/viz/0ca4506c-962d-4cf1-aa6d-f8222f53dee6.gif" width="720"/>

Changelog

Code Overview

The structure of this code repo is heavily inspired by ActionFormer. Some of the main components are

Installation

To Reproduce Our Results on TACoS

Download Features and Annotations

Details: The features are extracted using the C3D model pretrained on Sports1M, given clips of 16 frames with a frame rate of ~30 fps and a stride of 4 frames. This gives one feature vector per 4/30 ~= 0.1333 seconds. In practice, SnAG uses 4x-subsampled C3D features (i.e., the effective stride is 16 frames) for fair comparison with baselines.

Unpack Features and Annotations

This folder
│   README.md
│   ...  
│
└───data/
│    └───tacos/
│    │	 └───annotations
│    │	 └───c3d_features   
│    └───...
|
└───libs
│
│   ...

Training and Evaluation

python ./train.py --opt video_centric/tacos.yaml --name tacos_reproduce
tensorboard --logdir=./experiments/tacos_reproduce/tensorboard
python ./eval.py --name tacos_reproduce --ckpt last

[Optional] Evaluating Our Pre-trained Model

We also provide a pre-trained model for TACoS. The model with all training logs can be downloaded from this Google Drive link. To evaluate the pre-trained model, please follow the steps listed below.

This folder
│   README.md
│   ...  
│
└───experiments/
│    └───tacos_reproduce/
│    │	 └───eval_last.txt
│    │	 └───log.txt
│    │   └───...    
│    └───...
|
└───libs
│
│   ...
python ./eval.py --name tacos_reproduce --ckpt last
MethodR1@0.3R1@0.5R5@0.3R5@0.5
SnAG55.5145.1481.5870.31

To Reproduce Our Results on Ego4D-NLQ

Download Features and Annotations

Details: We use the official SlowFast features from here. They are extracted using the SlowFast model pretrained on Kinetics 400, given clips of 32 frames with a frame rate of 30 fps and a stride of 16 frames. This gives one feature vector per 16/30 ~= 0.533 seconds. The EgoVLP features are extracted using the EgoVLP model checkpoint, given clips of 32 frames with a frame rate of 30 fps and a stride of 8 frames. This gives one feature vector per 8/30 ~=0.267 seconds. In practice, SnAG uses 2x-subsampled EgoVLP features (i.e., the effective stride is 16 frames) for fair comparison with baselines.

Unpack Features and Annotations

This folder
│   README.md
│   ...  
│
└───data/
│    └───ego4d_slowfast_bert/
│    │	 └───annotations
│    │	 └───slowfast_features
│    │	 └───bert_features
│    └───ego4d_egovlp/
│    │	 └───annotations
│    │	 └───egovlp_features
│    └───...
|
└───libs
│
│   ...

Training and Evaluation

python ./train.py --opt video_centric/ego4d_slowfast_bert.yaml --name ego4d_slowfast_bert_reproduce
python ./train.py --opt video_centric/ego4d_egovlp.yaml --name ego4d_egovlp_reproduce
tensorboard --logdir=./experiments/ego4d_slowfast_bert_reproduce/tensorboard
tensorboard --logdir=./experiments/ego4d_egovlp_reproduce/tensorboard
python ./eval.py --name ego4d_slowfast_bert_reproduce --ckpt last
python ./eval.py --name ego4d_egovlp_reproduce --ckpt last

[Optional] Evaluating Our Pre-trained Model

We also provide pre-trained models for Ego4D-NLQ. The model using SlowFast + BERT features with all training logs can be downloaded from this Google Drive link. The model using EgoVLP features with all training logs can be downloaded from this Google Drive link. To evaluate the pre-trained model, please follow the steps listed below.

This folder
│   README.md
│   ...  
│
└───experiments/
│    └───ego4d_showfast_bert_reproduce/
│    │	 └───eval_last.txt
│    │	 └───log.txt
│    │   └───...
│    └───ego4d_egovlp_reproduce/
│    │	 └───eval_last.txt
│    │	 └───log.txt
│    │   └───...  
│    └───...
|
└───libs
│
│   ...
python ./eval.py --name ego4d_slowfast_bert_reproduce --ckpt last
python ./eval.py --name ego4d_egovlp_reproduce --ckpt last
MethodR1@0.3R1@0.5mean R1R5@0.3R5@0.5mean R5
SnAG (SlowFast + BERT)9.756.408.0828.1019.4723.79
SnAG (EgoVLP)15.5310.9413.2438.4027.7033.10

To Reproduce Our Results on MAD

Download Features and Annotations

Details: We use the official CLIP features from here. The features are extracted using CLIP ViT-L/14 with a frame rate of 5 fps. This gives one feature vector every 0.2 seconds.

Unpack Features and Annotations

This folder
│   README.md
│   ...  
│
└───data/
│    └───mad/
│    │	 └───annotations
│    │	 └───clip_features
│    └───...
|
└───libs
│
│   ...

Training and Evaluation

python ./train.py --opt video_centric/mad.yaml --name mad_reproduce
tensorboard --logdir=./experiments/mad_reproduce/tensorboard
python ./eval.py --name mad_reproduce --ckpt last

[Optional] Evaluating Our Pre-trained Model

We also provide a pre-trained model for MAD. The model with all training logs can be downloaded from this Google Drive link. To evaluate the pre-trained model, please follow the steps listed below.

This folder
│   README.md
│   ...  
│
└───experiments/
│    └───mad_reproduce/
│    │	 └───eval_last.txt
│    │	 └───log.txt
│    │   └───...    
│    └───...
|
└───libs
│
│   ...
python ./eval.py --name mad_reproduce --ckpt last
MethodR1@0.1R1@0.3R1@0.5R5@0.1R5@0.3R5@0.5
SnAG10.358.515.4724.4020.3013.41

To Reproduce Our Results on Charades-STA

Download Features and Annotations

Details: The C3D features are extracted using the C3D model pretrained on Sports1M, given clips of 16 frames with a frame rate of 24 fps and a stride of 4 frames. This gives one feature vector per 4/24 ~= 0.167 seconds. The I3D features are extracted using the I3D model pretrained on Kinetics 400, given clips of 16 frames with a frame rate of 24 fps and a stride of 4 frames. This gives one feature vector per 4/24 ~= 0.167 seconds.

Unpack Features and Annotations

This folder
│   README.md
│   ...  
│
└───data/
│    └───charades_sta_c3d/
│    │	 └───annotations
│    │	 └───c3d_features
│    └───charades_sta_i3d/
│    │	 └───annotations
│    │	 └───i3d_features
│    │	 |   └───charades   # not used
|    |   |   └───kinetics
│    └───...
|
└───libs
│
│   ...

Training and Evaluation

python ./train.py --opt video_centric/charades_sta_c3d.yaml --name charades_sta_c3d_reproduce
python ./train.py --opt video_centric/charades_sta_i3d.yaml --name charades_sta_i3d_reproduce
tensorboard --logdir=./experiments/charades_sta_c3d_reproduce/tensorboard
tensorboard --logdir=./experiments/charades_sta_i3d_reproduce/tensorboard
python ./eval.py --name charades_sta_c3d_reproduce --ckpt last
python ./eval.py --name charades_sta_i3d_reproduce --ckpt last

[Optional] Evaluating Our Pre-trained Model

We also provide pre-trained models for Charades-STA. The model using C3D features with all training logs can be downloaded from this Google Drive link. The model using I3D features with all training logs can be downloaded from this Google Drive link. To evaluate the pre-trained model, please follow the steps listed below.

This folder
│   README.md
│   ...  
│
└───experiments/
│    └───charades_sta_c3d_reproduce/
│    │	 └───eval_last.txt
│    │	 └───log.txt
│    │   └───...
│    └───charades_sta_i3d_reproduce/
│    │	 └───eval_last.txt
│    │	 └───log.txt
│    │   └───...  
│    └───...
|
└───libs
│
│   ...
python ./eval.py --name charades_sta_c3d_reproduce --ckpt last
python ./eval.py --name charades_sta_i3d_reproduce --ckpt last
MethodR1@0.5R1@0.7R5@0.5R5@0.7
SnAG (C3D)51.7533.3390.8365.56
SnAG (I3D)65.1946.3293.0473.12

To Reproduce Our Results on ActivityNet-Captions

Download Features and Annotations

Details: We use the official C3D features from here. The features are extracted using the C3D model pretrained on Sports1M, given clips of 16 frames and a stride of 8 frames. The frame rate is unknown. The feature dimension has been reduced from 4096 to 500 using PCA.

Unpack Features and Annotations

This folder
│   README.md
│   ...  
│
└───data/
│    └───anet_1.3/
│    │	 └───annotations
│    │	 └───c3d_features   
│    └───...
|
└───libs
│
│   ...

Training and Evaluation

python ./train.py --opt video_centric/anet_1.3.yaml --name anet_1.3_reproduce
tensorboard --logdir=./experiments/anet_1.3_reproduce/tensorboard
python ./eval.py --name anet_1.3_reproduce --ckpt last

[Optional] Evaluating Our Pre-trained Model

We also provide a pre-trained model for ActivityNet-Captions. The model with all training logs can be downloaded from this Google Drive link. To evaluate the pre-trained model, please follow the steps listed below.

This folder
│   README.md
│   ...  
│
└───experiments/
│    └───anet_1.3_reproduce/
│    │	 └───eval_last.txt
│    │	 └───log.txt
│    │   └───...    
│    └───...
|
└───libs
│
│   ...
python ./eval.py --name anet_1.3_reproduce --ckpt last
MethodR1@0.5R1@0.7R5@0.5R5@0.7
SnAG47.4429.8982.6063.29

Backup Links

Contact

Fangzhou Mu (fmu2@wisc.edu)

Reference

If you are using our code, please consider citing our paper.

@inproceedings{mu2024snag,
  title={{SnAG}: Scalable and Accurate Video Grounding},
  author={Mu, Fangzhou and Mo, Sicheng and Li, Yin},
  booktitle={CVPR},
  year={2024}
}