Awesome
SnAG: Scalable and Accurate Video Grounding (CVPR 2024)
Introduction
This code repo implements SnAG, a scalable and accurate model for long-form video grounding --- localizing moments within an untrimmed long video based on text descriptions. SnAG features a minimalist, late-fusion design for scalable inference, while supporting video-centric sampling for scalable training. Without bells and whistles, SnAG achieves 44.86% R1@0.5 and 70.66% R5@0.5 on TACoS, outperforming the previous state of the art by 8.53 and 12.75 absolute percentage points, respectively. Further, SnAG demonstrates strong results on Ego4D-NLQ (13.57% mean R1 and 32.92 mean R5) and the more challenging MAD dataset (5.55 R1@0.5 and 13.75 R5@0.5). Our paper is accepted to CVPR 2024 and an arXiv version can be found at this link.
Related projects:
ActionFormer: Localizing Moments of Actions with Transformers <br> Chenlin Zhang, Jianxin Wu, Yin Li <br> ECCV 2022 <br> <br>
Visualization
We provide visualizations of localized moments in Ego4D-NLQ videos.
- Left: Current text query is highlighted.
- Right: Video is slowed down if an event is detected by our method.
- Green border: correct prediction.
- Red border: incorrect prediction.
Note that the ground-truth moments are determined by human annotations and subject to errors.
<img src="https://media.githubusercontent.com/media/fmu2/snag_release/main/viz/085f7a8b-e1e5-4e7b-a83d-5ea650edd9fe.gif" width="720"/> <img src="https://media.githubusercontent.com/media/fmu2/snag_release/main/viz/0aca0078-b6ab-41fb-9dc5-a70b8ad137b2.gif" width="720"/> <img src="https://media.githubusercontent.com/media/fmu2/snag_release/main/viz/0ca4506c-962d-4cf1-aa6d-f8222f53dee6.gif" width="720"/>Changelog
-
04/21/2024: Ego4D result visualization.
-
04/03/2024: Initial code release.
-
02/26/2024: The paper is accepted to CVPR 2024.
Code Overview
The structure of this code repo is heavily inspired by ActionFormer. Some of the main components are
- ./libs/core: Parameter configuration module.
- ./libs/data: Data loader and IO module.
- ./libs/modeling: Our main model with all its building blocks.
- ./libs/worker.py: Training and evaluation loops.
Installation
- Follow INSTALL.md for installing necessary dependencies and compiling the code.
To Reproduce Our Results on TACoS
Download Features and Annotations
- Download tacos.tar.gz (
md5sum a96537114a930038ab8ddb64a17df6e0
) from this Google Drive link. The file includes C3D features in npy format and annotations in json format.
Details: The features are extracted using the C3D model pretrained on Sports1M, given clips of 16 frames
with a frame rate of ~30 fps
and a stride of 4 frames
. This gives one feature vector per 4/30 ~= 0.1333
seconds. In practice, SnAG uses 4x-subsampled C3D features (i.e., the effective stride is 16 frames
) for fair comparison with baselines.
Unpack Features and Annotations
- Unpack the file under ./data (or elsewhere and link to ./data).
- The folder structure should look like
This folder
│ README.md
│ ...
│
└───data/
│ └───tacos/
│ │ └───annotations
│ │ └───c3d_features
│ └───...
|
└───libs
│
│ ...
Training and Evaluation
- Train our SnAG with C3D features. This will create an experiment folder under ./experiments that stores training config, logs, and checkpoints.
python ./train.py --opt video_centric/tacos.yaml --name tacos_reproduce
- [Optional] Monitor the training using TensorBoard
tensorboard --logdir=./experiments/tacos_reproduce/tensorboard
- Evaluate the trained model. The expected R1@0.5 and R5@0.5 should be around 45.0(%) and 70.5(%).
python ./eval.py --name tacos_reproduce --ckpt last
- Training and inference on TACoS requires ~7.4 GB and ~1.2 GB of GPU memory. We recommend using a GPU with at least 8 GB of memory. Please set
microbatch_size
in the config file for gradient accumulation over micro-batches when training with less GPU memory (e.g., setmicrobatch_size
to half ofbatch_size
for GPUs with 6 GB of memory).
[Optional] Evaluating Our Pre-trained Model
We also provide a pre-trained model for TACoS. The model with all training logs can be downloaded from this Google Drive link. To evaluate the pre-trained model, please follow the steps listed below.
- Unpack the file under ./experiments (or elsewhere and link to ./experiments).
- The folder structure should look like
This folder
│ README.md
│ ...
│
└───experiments/
│ └───tacos_reproduce/
│ │ └───eval_last.txt
│ │ └───log.txt
│ │ └───...
│ └───...
|
└───libs
│
│ ...
- The training config is recorded in ./experiments/tacos_reproduce/opt.yaml.
- The training log is located at ./experiments/tacos_reproduce/log.txt and also ./experiments/tacos_reproduce/tensorboard.
- The pre-trained model is ./experiments/tacos_reproduce/models/last.pth.
- Evaluate the pre-trained model.
python ./eval.py --name tacos_reproduce --ckpt last
- The results (Recall at tIoUs) should be
Method | R1@0.3 | R1@0.5 | R5@0.3 | R5@0.5 |
---|---|---|---|---|
SnAG | 55.51 | 45.14 | 81.58 | 70.31 |
- Training your own model will yield slightly different results due to randomness, yet the results should be close to what we report in the paper.
To Reproduce Our Results on Ego4D-NLQ
Download Features and Annotations
- Download ego4d_slowfast_bert.tar.gz (
md5sum d57d03737493e4c7aae39dd3d3d5597b
) from this Google Drive link. The file includes SlowFast and BERT features in npy format and annotations in json format. - Download ego4d_egovlp.tar.gz (
md5sum 44e013aa5c4dcbc4d474fdba5c172804
) from this Google Drive link. The file includes EgoVLP video and text features in npy format and annotations in json format.
Details: We use the official SlowFast features from here. They are extracted using the SlowFast model pretrained on Kinetics 400, given clips of 32 frames
with a frame rate of 30 fps
and a stride of 16 frames
. This gives one feature vector per 16/30 ~= 0.533
seconds. The EgoVLP features are extracted using the EgoVLP model checkpoint, given clips of 32 frames
with a frame rate of 30 fps
and a stride of 8 frames
. This gives one feature vector per 8/30 ~=0.267
seconds. In practice, SnAG uses 2x-subsampled EgoVLP features (i.e., the effective stride is 16 frames
) for fair comparison with baselines.
Unpack Features and Annotations
- Unpack the file under ./data (or elsewhere and link to ./data).
- The folder structure should look like
This folder
│ README.md
│ ...
│
└───data/
│ └───ego4d_slowfast_bert/
│ │ └───annotations
│ │ └───slowfast_features
│ │ └───bert_features
│ └───ego4d_egovlp/
│ │ └───annotations
│ │ └───egovlp_features
│ └───...
|
└───libs
│
│ ...
Training and Evaluation
- Train our SnAG with SlowFast + BERT, or EgoVLP features. This will create an experiment folder under ./experiments that stores training config, logs, and checkpoints.
python ./train.py --opt video_centric/ego4d_slowfast_bert.yaml --name ego4d_slowfast_bert_reproduce
python ./train.py --opt video_centric/ego4d_egovlp.yaml --name ego4d_egovlp_reproduce
- [Optional] Monitor the training using TensorBoard
tensorboard --logdir=./experiments/ego4d_slowfast_bert_reproduce/tensorboard
tensorboard --logdir=./experiments/ego4d_egovlp_reproduce/tensorboard
- Evaluate the trained model. The expected mean R1 and R5 should be around 8.3(%) and 23.6(%) with SlowFast + BERT features, and 13.5(%) and 33.0(%) with EgoVLP features.
python ./eval.py --name ego4d_slowfast_bert_reproduce --ckpt last
python ./eval.py --name ego4d_egovlp_reproduce --ckpt last
- Training and inference on Ego4D-NLQ requires ~3.0 GB and ~1.8 GB of GPU memory, respectively. We recommend using a GPU with at least 4 GB of memory.
[Optional] Evaluating Our Pre-trained Model
We also provide pre-trained models for Ego4D-NLQ. The model using SlowFast + BERT features with all training logs can be downloaded from this Google Drive link. The model using EgoVLP features with all training logs can be downloaded from this Google Drive link. To evaluate the pre-trained model, please follow the steps listed below.
- Unpack the file under ./experiments (or elsewhere and link to ./experiments).
- The folder structure should look like
This folder
│ README.md
│ ...
│
└───experiments/
│ └───ego4d_showfast_bert_reproduce/
│ │ └───eval_last.txt
│ │ └───log.txt
│ │ └───...
│ └───ego4d_egovlp_reproduce/
│ │ └───eval_last.txt
│ │ └───log.txt
│ │ └───...
│ └───...
|
└───libs
│
│ ...
- The training config is recorded in ./experiments/ego4d_..._reproduce/opt.yaml.
- The training log is located at ./experiments/ego4d_..._reproduce/log.txt and also ./experiments/ego4d_..._reproduce/tensorboard.
- The pre-trained model is ./experiments/ego4d_..._reproduce/models/last.pth.
- Evaluate the pre-trained model.
python ./eval.py --name ego4d_slowfast_bert_reproduce --ckpt last
python ./eval.py --name ego4d_egovlp_reproduce --ckpt last
- The results (Recall at tIoUs) should be
Method | R1@0.3 | R1@0.5 | mean R1 | R5@0.3 | R5@0.5 | mean R5 |
---|---|---|---|---|---|---|
SnAG (SlowFast + BERT) | 9.75 | 6.40 | 8.08 | 28.10 | 19.47 | 23.79 |
SnAG (EgoVLP) | 15.53 | 10.94 | 13.24 | 38.40 | 27.70 | 33.10 |
- Training your own model will yield slightly different results due to randomness, yet the results should be close to what we report in the paper.
To Reproduce Our Results on MAD
Download Features and Annotations
Download mad.tar.gz (The MAD dataset owner has requested us to remove the link. Please download the features from the official MAD repository. We will upload a preprocessing script shortly.md5sum dd4fc6f8e2297eb10a1c82d405b03658
) from this Google Drive link. The file includes CLIP features in npy format and annotations in json format.
Details: We use the official CLIP features from here. The features are extracted using CLIP ViT-L/14 with a frame rate of 5 fps
. This gives one feature vector every 0.2
seconds.
Unpack Features and Annotations
- Unpack the file under ./data (or elsewhere and link to ./data).
- The folder structure should look like
This folder
│ README.md
│ ...
│
└───data/
│ └───mad/
│ │ └───annotations
│ │ └───clip_features
│ └───...
|
└───libs
│
│ ...
Training and Evaluation
- Train our SnAG with CLIP features. This will create an experiment folder under ./experiments that stores training config, logs, and checkpoints.
python ./train.py --opt video_centric/mad.yaml --name mad_reproduce
- [Optional] Monitor the training using TensorBoard
tensorboard --logdir=./experiments/mad_reproduce/tensorboard
- Evaluate the trained model. The expected R1@0.5 and R5@0.5 should be around 5.5(%) and 13.5(%).
python ./eval.py --name mad_reproduce --ckpt last
- Training and inference on MAD requires ~20 GB and ~2.2 GB of GPU memory. We recommend using a GPU with at least 24 GB of memory. When running out of GPU memory, try increasing
batch_size
and reducingmax_num_text
while keeping their product unchanged (e.g.,batch_size: 8
,max_num_text: 4
), and experiment with differentmicrobatch_size
(e.g.,microbatch_size: 2
). This will reduce memory footprint at the cost of increasing training time.
[Optional] Evaluating Our Pre-trained Model
We also provide a pre-trained model for MAD. The model with all training logs can be downloaded from this Google Drive link. To evaluate the pre-trained model, please follow the steps listed below.
- Unpack the file under ./experiments (or elsewhere and link to ./experiments).
- The folder structure should look like
This folder
│ README.md
│ ...
│
└───experiments/
│ └───mad_reproduce/
│ │ └───eval_last.txt
│ │ └───log.txt
│ │ └───...
│ └───...
|
└───libs
│
│ ...
- The training config is recorded in ./experiments/mad_reproduce/opt.yaml.
- The training log is located at ./experiments/mad_reproduce/log.txt and also ./experiments/mad_reproduce/tensorboard.
- The pre-trained model is ./experiments/mad_reproduce/models/last.pth.
- Evaluate the pre-trained model.
python ./eval.py --name mad_reproduce --ckpt last
- The results (Recall at tIoUs) should be
Method | R1@0.1 | R1@0.3 | R1@0.5 | R5@0.1 | R5@0.3 | R5@0.5 |
---|---|---|---|---|---|---|
SnAG | 10.35 | 8.51 | 5.47 | 24.40 | 20.30 | 13.41 |
- Training your own model will yield slightly different results due to randomness, yet the results should be close to what we report in the paper.
To Reproduce Our Results on Charades-STA
Download Features and Annotations
- Download charades_sta_c3d.tar.gz (
md5sum 10300461e5f713dffcc038506c73aec7
) from this Google Drive link. The file includes C3D features in npy format and annotations in json format. - Download charades_sta_i3d.tar.gz (
md5sum 57ad93a548dc5428c284e3fc5852136d
) from this Google Drive link. The file includes I3D features in npy format and annotations in json format.
Details: The C3D features are extracted using the C3D model pretrained on Sports1M, given clips of 16 frames
with a frame rate of 24 fps
and a stride of 4 frames
. This gives one feature vector per 4/24 ~= 0.167
seconds. The I3D features are extracted using the I3D model pretrained on Kinetics 400, given clips of 16 frames
with a frame rate of 24 fps
and a stride of 4 frames
. This gives one feature vector per 4/24 ~= 0.167
seconds.
Unpack Features and Annotations
- Unpack the file under ./data (or elsewhere and link to ./data).
- The folder structure should look like
This folder
│ README.md
│ ...
│
└───data/
│ └───charades_sta_c3d/
│ │ └───annotations
│ │ └───c3d_features
│ └───charades_sta_i3d/
│ │ └───annotations
│ │ └───i3d_features
│ │ | └───charades # not used
| | | └───kinetics
│ └───...
|
└───libs
│
│ ...
Training and Evaluation
- Train our SnAG with C3D or I3D features. This will create an experiment folder under ./experiments that stores training config, logs, and checkpoints.
python ./train.py --opt video_centric/charades_sta_c3d.yaml --name charades_sta_c3d_reproduce
python ./train.py --opt video_centric/charades_sta_i3d.yaml --name charades_sta_i3d_reproduce
- [Optional] Monitor the training using TensorBoard
tensorboard --logdir=./experiments/charades_sta_c3d_reproduce/tensorboard
tensorboard --logdir=./experiments/charades_sta_i3d_reproduce/tensorboard
- Evaluate the trained model. The expected R1@0.7 and R5@0.7 should be around 33.5(%) and 65.5(%) with C3D features, and 46.5(%) and 73.0(%) with I3D features.
python ./eval.py --name charades_sta_c3d_reproduce --ckpt last
python ./eval.py --name charades_sta_i3d_reproduce --ckpt last
- Training and inference on Charades-STA requires ~2.7 GB and ~1.2 GB of GPU memory, respectively. We recommend using a GPU with at least 4 GB of memory.
[Optional] Evaluating Our Pre-trained Model
We also provide pre-trained models for Charades-STA. The model using C3D features with all training logs can be downloaded from this Google Drive link. The model using I3D features with all training logs can be downloaded from this Google Drive link. To evaluate the pre-trained model, please follow the steps listed below.
- Unpack the file under ./experiments (or elsewhere and link to ./experiments).
- The folder structure should look like
This folder
│ README.md
│ ...
│
└───experiments/
│ └───charades_sta_c3d_reproduce/
│ │ └───eval_last.txt
│ │ └───log.txt
│ │ └───...
│ └───charades_sta_i3d_reproduce/
│ │ └───eval_last.txt
│ │ └───log.txt
│ │ └───...
│ └───...
|
└───libs
│
│ ...
- The training config is recorded in ./experiments/charades_sta_..._reproduce/opt.yaml.
- The training log is located at ./experiments/charades_sta_..._reproduce/log.txt and also ./experiments/charades_sta_..._reproduce/tensorboard.
- The pre-trained model is ./experiments/charades_sta_..._reproduce/models/last.pth.
- Evaluate the pre-trained model.
python ./eval.py --name charades_sta_c3d_reproduce --ckpt last
python ./eval.py --name charades_sta_i3d_reproduce --ckpt last
- The results (Recall at tIoUs) should be
Method | R1@0.5 | R1@0.7 | R5@0.5 | R5@0.7 |
---|---|---|---|---|
SnAG (C3D) | 51.75 | 33.33 | 90.83 | 65.56 |
SnAG (I3D) | 65.19 | 46.32 | 93.04 | 73.12 |
- Training your own model will yield slightly different results due to randomness, yet the results should be close to what we report in the paper. With our latest implementation, we achieve +1.5 R5@0.7 compared to the paper results using both C3D and I3D features. The R5@0.5 dropped by 1.5 when using C3D features. We consider this a favorable tradeoff as the predictions become more accurate.
To Reproduce Our Results on ActivityNet-Captions
Download Features and Annotations
- Download anet_1.3.tar.gz (
md5sum 4da7bbd46ebf43906cb44e696a4a1852
) from this Google Drive link. The file includes C3D features in npy format and annotations in json format.
Details: We use the official C3D features from here. The features are extracted using the C3D model pretrained on Sports1M, given clips of 16 frames
and a stride of 8 frames
. The frame rate is unknown. The feature dimension has been reduced from 4096 to 500 using PCA.
Unpack Features and Annotations
- Unpack the file under ./data (or elsewhere and link to ./data).
- The folder structure should look like
This folder
│ README.md
│ ...
│
└───data/
│ └───anet_1.3/
│ │ └───annotations
│ │ └───c3d_features
│ └───...
|
└───libs
│
│ ...
Training and Evaluation
- Train our SnAG with C3D features. This will create an experiment folder under ./experiments that stores training config, logs, and checkpoints.
python ./train.py --opt video_centric/anet_1.3.yaml --name anet_1.3_reproduce
- [Optional] Monitor the training using TensorBoard
tensorboard --logdir=./experiments/anet_1.3_reproduce/tensorboard
- Evaluate the trained model. The expected R1@0.7 and R5@0.7 should be around 30.0(%) and 63.0(%).
python ./eval.py --name anet_1.3_reproduce --ckpt last
- Training and inference on ActivityNet-Captions requires ~2.0 GB and ~1.2 GB of GPU memory. We recommend using a GPU with at least 4 GB of memory.
[Optional] Evaluating Our Pre-trained Model
We also provide a pre-trained model for ActivityNet-Captions. The model with all training logs can be downloaded from this Google Drive link. To evaluate the pre-trained model, please follow the steps listed below.
- Unpack the file under ./experiments (or elsewhere and link to ./experiments).
- The folder structure should look like
This folder
│ README.md
│ ...
│
└───experiments/
│ └───anet_1.3_reproduce/
│ │ └───eval_last.txt
│ │ └───log.txt
│ │ └───...
│ └───...
|
└───libs
│
│ ...
- The training config is recorded in ./experiments/anet_1.3_reproduce/opt.yaml.
- The training log is located at ./experiments/anet_1.3_reproduce/log.txt and also ./experiments/anet_1.3_reproduce/tensorboard.
- The pre-trained model is ./experiments/anet_1.3_reproduce/models/last.pth.
- Evaluate the pre-trained model.
python ./eval.py --name anet_1.3_reproduce --ckpt last
- The results (Recall at tIoUs) should be
Method | R1@0.5 | R1@0.7 | R5@0.5 | R5@0.7 |
---|---|---|---|---|
SnAG | 47.44 | 29.89 | 82.60 | 63.29 |
- Training your own model will yield slightly different results due to randomness, yet the results should be close to what we report in the paper.
Backup Links
- 百度云 (提取码: 2024)
Contact
Fangzhou Mu (fmu2@wisc.edu)
Reference
If you are using our code, please consider citing our paper.
@inproceedings{mu2024snag,
title={{SnAG}: Scalable and Accurate Video Grounding},
author={Mu, Fangzhou and Mo, Sicheng and Li, Yin},
booktitle={CVPR},
year={2024}
}