Awesome
DyFADet: Dynamic Feature Aggregation for Temporal Action Detection (ECCV2024)
This repository contains the implementation of the paper, 'DyFADet: Dynamic Feature Aggregation for Temporal Action Detection'.
<div align=center><img width="900" height="280" src="https://github.com/yangle15/DyFADet-pytorch/blob/main/pics/fig1.png"/></div>Installation
-
Please ensure that you have installed PyTorch and CUDA. (We use Pytorch=1.13.0 and CUDA=11.6 in our experiments.)
-
After you download the Repo, you need to install the required packages by running the following command:
pip install -r requirements.txt
- Install NMS
cd ./libs/utils
python setup.py install --user
cd ../..
Data Preparation
HACS (SF features)
-
For the SlowFast features of the HACS dataset, please refer to here for the details about downloading and using the features.
-
Unpack the SlowFast feature into
/YOUR_DATA_PATH/
. You can find the processed annotation json files for the SlowFast feature in this Repo in./data/hacs/annotations
folder.
HACS (VideoMAEv2-g features)
-
This Repo has provided the pre-extracted features of HACS using VideoMAEv2, you can download the features and unpack them into
/YOUR_DATA_PATH/
. -
The annotation json files is also the one in
./data/hacs/annotations
folder.
THUMOS14 (I3D features)
-
Referring the procedure from ActionFormer Repo, you need to download the features (thumos.tar.gz) from this Box link or this Google Drive link or this BaiduYun link. The file includes I3D features, action annotations in json format, and external classification scores.
-
Unpack Features and Annotations into
/YOUR_DATA_PATH/
and/YOUR_ANNOTATION_PATH/
, respectively.
THUMOS14 (VideoMAEv2-g features).
-
You can extract the features using the pre-trained VideoMAEv2 model as stated here. The experiments in our paper use the features extracted by VideoMAEv2-g. We would like to express our GREAT gratitude to Shuming Liu for his help extracting the features!!!
-
For the SlowFast features of HACS dataset, please refer to here for more details about downloading and using the features. And the instruction about the VideoMAEv2 features for the HACS dataset can be found here.
ActivityNet 1.3
-
Referring the procedure from ActionFormer Repo, you need to download anet_1.3.tar.gz from this Box link or this Google Drive Link or this BaiduYun Link. The file includes TSP features, action annotations in json format (similar to ActivityNet annotation format), and external classification scores.
-
Unpack Features and Annotations into
/YOUR_DATA_PATH/
and/YOUR_ANNOTATION_PATH/
, respectively. -
The used external classification scores in our experiments are in
./data/hacs/annotations/
.
FineAction
-
The Pre-extracted features using VideoMAE V2-g can be downloaded here. Please refer the original VideoMAEv2 repository for more details.
-
Unpack Features and Annotations into
/YOUR_DATA_PATH/
and/YOUR_ANNOTATION_PATH/
, respectively. -
The used external classification scores in our experiments are in
./data/hacs/annotations/
.
Training
You can train your own model with the provided CONFIG files. The command for train is
CUDA_VISIBLE_DEVICES=0 python train.py ./configs/CONFIG_FILE --output OUTPUT_PATH
You need to select a specific config files corresponding to different datasets. For the config json file, you need to further change the json_file variable to the path of your annotation file, and the feat_folder variable to the path of the downloaded dataset.
All the model can be trained on a single Nvidia RTX 4090 GPU (24GB).
Evaluation
After training, you can test the obtained model by the following command:
CUDA_VISIBLE_DEVICES=0 python eval.py ./configs/CONFIG_FILE PATH_TO_CHECKPOINT
The mean average precision (mAP) results with the pre-trained models (BaiduYun Link) are :
Dataset | 0.3 /0.5/0.1 | 0.7 /0.95 | Avg | Config |
---|---|---|---|---|
THUMOS14-I3D | 84.0 | 47.9 | 69.2 | thumos_i3d.yaml |
THUMOS14-VM2-g | 84.3 | 50.2 | 70.5 | thumos_mae.yaml |
ActivityNet-TSP | 58.1 | 8.4 | 38.5 | anet_tsp.yaml |
HACS-SF | 57.8 | 11.8 | 39.2 | hacs_slowfast.yaml |
HACS-VM2-g | 64.0 | 14.1 | 44.3 | hacs_mae.yaml |
FineAction-VM2-g | 37.1 | 5.9 | 23.8 | fineaction.yaml |
EPIC-KITCHEN-n | 28.0 | 20.8 | 25.0 | epic_slowfast_noun.yaml |
EPIC-KITCHEN-v | 26.8 | 18.5 | 23.4 | epic_slowfast_verb.yaml |
Citation
If you find this work useful or use our codes in your own research, please use the following bibtex:
@inproceedings{yang2024dyfadet,
title={DyFADet: Dynamic Feature Aggregation for Temporal Action Detection},
author={Yang, Le and Zheng, Ziwei and Han, Yizeng and Cheng, Hao and Song, Shiji and Huang, Gao and Li, Fan},
booktitle={European Conference on Computer Vision (ECCV)},
year={2024}
}
Contact
If you have any questions, please feel free to contact the authors.
Ziwei Zheng: ziwei.zheng@stu.xjtu.edu.cn
Le Yang: yangle15@xjtu.edu.cn
Acknowledgments
Our code is built upon the codebase from ActionFormer, TriDet, Detectron2, and many other great Repos, we would like to express our gratitude for their outstanding work.