Home

Awesome

3D-RetinaNet for ROAD and UCF-24 dataset

This repository contains code for 3D-RetinaNet, a novel Single-Stage action detection newtwork proposed along with ROAD dataset. Our TPAMI paper contain detailed description 3D-RetinaNet and ROAD dataset. This code contains training and evaluation for ROAD and UCF-24 datasets.

Table of Contents

Requirements

We need three things to get started with training: datasets, kinetics pre-trained weight, and pytorch with torchvision and tensoboardX.

Dataset download an pre-process

Pytorch and weights

Training 3D-RetinaNet

You will need 4 GPUs (each with at least 10GB VRAM) to run training.

Let's assume that you extracted dataset in /home/user/road/ and weights in /home/user/kinetics-pt/ directory then your train command from the root directory of this repo is going to be:

CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py /home/user/ /home/user/  /home/user/kinetics-pt/ --MODE=train --ARCH=resnet50 --MODEL_TYPE=I3D --DATASET=road --TRAIN_SUBSETS=train_3 --SEQ_LEN=8 --TEST_SEQ_LEN=8 --BATCH_SIZE=4 --LR=0.0041

Second instance of /home/user/ in above command specifies where checkpoint weight and logs are going to be stored. In this case, checkpoints and logs will be in /home/user/road/cache/<experiment-name>/.

Different parameters in main.py will result in different performance. Validation split is automatically selected based in training split number in road.

You can train ucf24 dataset by change some command line parameter as the training sechdule and learning rate differ compared ot road training.

CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py /home/user/ /home/user/  /home/user/kinetics-pt/ --MODE=train --ARCH=resnet50 --MODEL_TYPE=I3D --DATASET=ucf24 --TRAIN_SUBSETS=train --VAL_SUBSETS=val --SEQ_LEN=8 --TEST_SEQ_LEN=8 --BATCH_SIZE=4 --LR=0.00245 --MILESTONES=6,8 --MAX_EPOCHS=10

Testing and Building Tubes

To generate the tubes and evaluate them, first, you will need frame-level detection and link them. It is pretty simple in out case. Similar to training command, you can run following commands. These can run on single GPUs.

There are various MODEs in main.py. You can do each step independently or together. At the moment gen-dets mode generates and evaluated frame-wise detection and finally performs tube building and evaluation.

For ROAD dataset, run the following commands.

python main.py /home/user/ /home/user/  /home/user/kinetics-pt/ --MODE=gen_dets --MODEL_TYPE=I3D --TEST_SEQ_LEN=8 --TRAIN_SUBSETS=train_3 --SEQ_LEN=8 --BATCH_SIZE=4 --LR=0.0041 

and for UCF24

python main.py /home/user/ /home/user/  /home/user/kinetics-pt/ --MODE=gen_dets --ARCH=resnet50 --MODEL_TYPE=I3D --DATASET=ucf24 --TRAIN_SUBSETS=train --VAL_SUBSETS=val --SEQ_LEN=8 --TEST_SEQ_LEN=8 --BATCH_SIZE=4 --LR=0.00245 --EVAL_EPOCHS=10 --GEN_NMS=80 --TOPK=20 --PATHS_IOUTH=0.25 --TRIM_METHOD=indiv

Performance

Here, you find the reproduced results from our paper. We use training split #3 for reproduction on a different machines compared to where results were generated for the paper. Below you will find the test results on validation split #3, which closer to test set compared to other split in terms of environmental conditions. We there is little change in learning rate here, so results are little different than the paper. Also, there are six tasks in ROAD dataset that makes it difficult balance the learning among tasks.

Model is set to I3D with resnet50 backbone. Kinetics pre-trained weights used for resnet50I3D, download link to given above in <a href=#requirements> Requirements section</a>. Results on split #3 with test-sequence length being 8 <frame-AP@0.5>/<video-mAP@0.2>.

<table style="width:100% th"> <tr> <td>Model</td> <td>I3D</td> <!-- <td>I3D</td> <td>0.75</td> <td>0.5:0.95</td> <td>frame-mAP@0.5</td> <td>accuracy(%)</td> --> </tr> <tr> <td align="left">Agentness</td> <td>54.7/--</td> <!-- <td>32.07</td> <td>00.85</td> <td>07.26</td> <td> -- </td> <td> -- </td> --> </tr> <tr> <td align="left">Agent</td> <td>31.1/26.0</td> <!-- <td>32.07</td> <td>00.85</td> <td>07.26</td> <td> -- </td> <td> -- </td> --> </tr> <tr> <td align="left">Action</td> <td>22.0/16.1</td> <!-- <td>36.37</td> <td>07.94</td> <td>14.37</td> <td> -- </td> <td> -- </td> --> </tr> <tr> <td align="left">Location</td> <td>27.3/24.2</td> <!-- <td>43.00</td> <td>14.10</td> <td>19.20</td> <td> -- </td> <td> -- </td> --> </tr> <tr> <td align="left">Duplexes </td> <td>23.7/19.5</td> <!-- <td>46.30</td> <td>15.00</td> <td>20.40</td> <td> -- </td> <td> 91.12 </td> --> </tr> <tr> <td align="left">Events/triplets </td> <td>13.9/15.5</td> <!-- <td>40.59</td> <td>14.06</td> <td>18.48</td> <td>64.96</td> <td>89.78</td> --> </tr> <tr> <td align="left">AV-action</td> <td>44.8/--</td> <!-- <td>15.86</td> <td>00.20</td> <td>03.66</td> <td>22.91</td> <td>73.08</td> --> </tr> <tr> <td align="left">UCF24 results</td> <td></td> <!-- <td>31.80</td> <td>02.83</td> <td>11.42</td> <td>47.26</td> <td>85.49</td> --> </tr> <tr> <td align="left">Actionness</td> <td>--</td> <!-- <td>39.95</td> <td>11.36</td> <td>17.47</td> <td>65.66</td> <td>89.78</td> --> </tr> <tr> <td align="left">Action detection</td> <td>--</td> <!-- <td>42.08</td> <td>12.45</td> <td>18.40</td> <td>61.82</td> <td>90.55</td> --> </tr> <tr> <td align="left">ActionNess-framewise</td> <td>--</td> <!-- <td>43.19</td> <td>13.05</td> <td>18.87</td> <td>64.35</td> <td>91.54</td> --> </tr> </table>
Download pre-trained weights

Citation

If this work has been helpful in your research please cite following articles:

@ARTICLE {singh2022road,
author = {Singh, Gurkirt and Akrigg, Stephen and Di Maio, Manuele and Fontana, Valentina and Alitappeh, Reza Javanmard and Saha, Suman and Jeddisaravi, Kossar and Yousefi, Farzad and Culley, Jacob and Nicholson, Tom and others},
journal = {IEEE Transactions on Pattern Analysis & Machine Intelligence},
title = {ROAD: The ROad event Awareness Dataset for autonomous Driving},
year = {5555},
volume = {},
number = {01},
issn = {1939-3539},
pages = {1-1},
keywords = {roads;autonomous vehicles;task analysis;videos;benchmark testing;decision making;vehicle dynamics},
doi = {10.1109/TPAMI.2022.3150906},
publisher = {IEEE Computer Society},
address = {Los Alamitos, CA, USA},
month = {feb}
}


@inproceedings{singh2017online,
  title={Online real-time multiple spatiotemporal action localisation and prediction},
  author={Singh, Gurkirt and Saha, Suman and Sapienza, Michael and Torr, Philip HS and Cuzzolin, Fabio},
  booktitle={Proceedings of the IEEE International Conference on Computer Vision},
  pages={3637--3646},
  year={2017}
}

@article{maddern20171,
  title={1 year, 1000 km: The Oxford RobotCar dataset},
  author={Maddern, Will and Pascoe, Geoffrey and Linegar, Chris and Newman, Paul},
  journal={The International Journal of Robotics Research},
  volume={36},
  number={1},
  pages={3--15},
  year={2017},
  publisher={SAGE Publications Sage UK: London, England}
}