

TadTR: End-to-end Temporal Action Detection with Transformer


By Xiaolong Liu, Qimeng Wang, Yao Hu, Xu Tang, Shiwei Zhang, Song Bai, Xiang Bai.

This repo holds the code for TadTR, described in the paper End-to-end temporal action detection with Transformer published in IEEE Transactions on Image Processing (TIP) 2022.

<!-- _The tech report is out-dated. We have significantly improved TadTR since we uploaded it to arxiv. It achives much better performance now. We'll update the arxiv version recently._ -->

We have also explored fully end-to-end training from RGB images with TadTR. See our CVPR 2022 work E2E-TAD.


TadTR is an end-to-end Temporal Action Detection TRansformer. It has the following advantages over previous methods:


[2023.2.19] Fix a bug a loss caculation (issue #21). Thank @zachpvin for raising this issue!

[2022.8.7] Add support for training/testing on THUMOS14!

[2022.7.4] Glad to share that this paper will appear in IEEE Transactions on Image Processing (TIP). Although I am still busy with my thesis, I will try to make the code accessible soon. Thanks for your patience.

[2022.6] Update the technical report of this work on arxiv (now v3).

[2022.3] Our new work E2E-TAD based on TadTR is accepted to CVPR 2022. It supports fully end-to-end training from RGB images.

[2021.9.15] Update the performance on THUMOS14.

[2021.9.1] Add demo code.

[2021.7] Our revised paper was submitted to IEEE Transactions on Image Processing.

[2021.6] Our revised paper was uploaded to arxiv.

[2021.1.21] Our paper was submitted to IJCAI 2021.


Main Results

MethodFeaturemAP@0.5mAP@0.75mAP@0.95Avg. mAP
TadTRI3D RGB47.1432.1110.9432.09
MethodFeaturemAP@0.3mAP@0.4mAP@0.5mAP@0.6mAP@0.7Avg. mAP
TadTRI3D 2stream74.869.160.146.632.856.7
MethodFeaturemAP@0.5mAP@0.75mAP@0.95Avg. mAP
TadTRTSN 2stream51.2934.999.4934.64



Compiling CUDA extensions (Optional)

The RoIAlign operator is implemented with CUDA extension. If your machine does have a NVIDIA GPU with CUDA support, you can run this step. Otherwise, please set disable_cuda=True in opts.py.

cd model/ops;

# If you have multiple installations of CUDA Toolkits, you'd better add a prefix
# CUDA_HOME=<your_cuda_toolkit_path> to specify the correct version. 
python setup.py build_ext --inplace

Run a quick test

python demo.py

1.Data Preparation

Currently we only support thumos14.


Download all data from [BaiduDrive(code: adTR)] or [OneDrive].

After downloading is finished, extract the archived feature files inplace by cd data;tar -xf I3D_2stream_Pth.tar. Then put the features, annotations, the model under the data/thumos14 directory. We expect the following structure in root folder.

- data
  - thumos14
    - I3D_2stream_Pth
     - xxxxx
     - xxxxx
    - th14_annotations_with_fps_duration.json
    - th14_i3d2s_ft_info.json
    - thumos14_tadtr_reference.pth

2.Testing Pre-trained Models


python main.py --cfg CFG_PATH --eval --resume CKPT_PATH

CFG_PATH is the path to the YAML-format config file that defines the experimental setting. For example, configs/thumos14_i3d2s_tadtr.yml. CKPT_PATH is the path of the pre-trained model. Alternatively, you can execute the Shell script bash scripts/test_reference_models.sh thumos14 for simplity.

3.Training by Yourself

Run the following command

python main.py --cfg CFG_PATH

This codebase supports running on both CPU and GPU.

During training, our code will automatically perform testing every N epochs (N is the test_interval in opts.py). Training takes 6~10 minutes on THUMOS14 if you use a modern GPU (e.g. TITAN Xp). You can also monitor the training process with Tensorboard (need to set cfg.tensorboard=True in opts.py). The tensorboard record and the checkpoint will be saved at output_dir (can be modified in config file).

After training is done, you can also test your trained model by running

python main.py --cfg CFG_PATH --eval

It will automatically use the best model checkpoint. If you want to manually specify the model checkpoint, run

python main.py --cfg CFG_PATH --eval --resume CKPT_PATH

Note that the performance of the model trained by your own may be different from the reference model, even though all seeds are fixed. The reason is that TadTR uses the grid_sample operator, whoses gradient computation involves the non-deterministic AtomicAdd operator. Please refer to ref1 ref2 ref3(Chinese) for details.


The code is based on the DETR and Deformable DETR. We also borrow the implementation of the RoIAlign1D from G-TAD. Thanks for their great works.


  title={End-to-end Temporal Action Detection with Transformer},
  author={Liu, Xiaolong and Wang, Qimeng and Hu, Yao and Tang, Xu and Zhang, Shiwei and Bai, Song and Bai, Xiang},
  journal={IEEE Transactions on Image Processing (TIP)},


For questions and suggestions, please contact Xiaolong Liu by email ("liuxl at hust dot edu dot cn").