Home

Awesome

Spatial Temporal Transformer Network

Introduction

This repository contains the implementation of the model presented in the following paper:

Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition, Chiara Plizzari, Marco Cannici, Matteo Matteucci, ArXiv

Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition, Chiara Plizzari, Marco Cannici, Matteo Matteucci, Pattern Recognition. ICPR International Workshops and Challenges, 2021, Proceedings

Skeleton-based action recognition via spatial and temporal transformer networks, Chiara Plizzari, Marco Cannici, Matteo Matteucci, Computer Vision and Image Understanding, Volumes 208-209, 2021, 103219, ISSN 1077-3142, CVIU

Alt Text

Visualizations of Spatial Transformer logits

The heatmaps are 25 x 25 matrices, where each row and each column represents a body joint. An element in position (i, j) represents the correlation between joint i and joint j, resulting from self-attention.

Alt TextAlt Text

Prerequisites

Run mode

<pre><code> python3 main.py </pre></code>

Training: Set in <code>/config/st_gcn/nturgbd/train.yaml</code>:

Testing: Set in <code>/config/st_gcn/nturgbd/train.yaml</code>:

Data generation

We performed our experiments on three datasets: NTU-RGB+D 60, NTU-RGB+D 120 and Kinetics.

NTU-RGB+D

The data can downloaded from their website. You need to download 3D Skeletons only (5.8G (NTU-60) + 4.5G (NTU-120)). Once downloaded, use the following to generate joint data for NTU-60:

<pre><code> python3 ntu_gendata.py </pre></code>

If you want to generate data and preprocess them, use directly:

<pre><code> python3 preprocess.py </pre></code>

In order to generate bones, you need to run:

<pre><code> python3 ntu_gen_bones.py </pre></code>

The joint information and bone information can be merged through:

<pre><code> python3 ntu_merge_joint_bones.py </pre></code>

For NTU-120, the samples are divided between training and testing in a different way. Thus, you need to run:

<pre><code> python3 ntu120_gendata.py </pre></code>

If you want to generate data and process them directly, use:

<pre><code> python3 preprocess_120.py </pre></code>

Kinetics

Kinetics is a dataset for video action recognition, consisting of raw video data only. The corresponding skeletons are extracted using Openpose, and are available for download at GoogleDrive (7.5G). From raw skeletons, generate the dataset by running:

<pre><code> python3 kinetics_gendata.py </pre></code>

Spatial Transformer Stream

Spatial Transformer implementation corresponds to <code>ST-TR/code/st_gcn/net/spatial_transformer.py</code>. Set in <code>/config/st_gcn/nturgbd/train.yaml</code>:

to run the spatial transformer stream (S-TR-stream).

Temporal Transformer Stream

Temporal Transformer implementation corresponds to <code>ST-TR/code/st_gcn/net/temporal_transformer.py</code>. Set in <code>/config/st_gcn/nturgbd/train.yaml </code>:

to run the temporal transformer stream (T-TR-stream).

To merge S-TR and T-TR (ST-TR)

The score resulting from the S-TR stream and T-TR stream are combined to produce the final ST-TR score by:

<pre><code> python3 ensemble.py </pre></code>

Adaptive Configuration (AGCN)

In order to run T-TR-agcn and ST-TR-agcn configurations, please set <code>agcn: True</code>.

Different ST-TR configurations

Set in <code>/config/st_gcn/nturgbd/train.yaml</code>:

To set the block dimensions of the windowed version of Temporal Transformer:

Second order information

Set in <code>/config/st_gcn/nturgbd/train.yaml</code>:

Pre-trained Models

Please notice I have attached pre-trained models of the configurations presented in the paper in the <code>checkpoint_ST-TR</code> folder. Please note that the *bones*.pth configurations correspond to the models trained with joint+bones information, while the others are trained with joints only.

Citation

Please cite one of the following papers if you use this code for your researches:

<pre><code>@article{plizzari2021skeleton, title={Skeleton-based action recognition via spatial and temporal transformer networks}, author={Plizzari, Chiara and Cannici, Marco and Matteucci, Matteo}, journal={Computer Vision and Image Understanding}, volume={208}, pages={103219}, year={2021}, publisher={Elsevier} } </pre></code> <pre><code>@inproceedings{plizzari2021spatial, title={Spatial temporal transformer network for skeleton-based action recognition}, author={Plizzari, Chiara and Cannici, Marco and Matteucci, Matteo}, booktitle={Pattern Recognition. ICPR International Workshops and Challenges: Virtual Event, January 10--15, 2021, Proceedings, Part III}, pages={694--701}, year={2021}, organization={Springer} } </pre></code>

Contact :pushpin:

If you have any question, do not hesitate to contact me at <code> chiara.plizzari@mail.polimi.it</code>. I will be glad to clarify your doubts!

<sub> Note: we include LICENSE, LICENSE_1 and LICENSE_2 in this repository since part of the code has been derived respectively from https://github.com/yysijie/st-gcn, https://github.com/leaderj1001/Attention-Augmented-Conv2d and https://github.com/kenziyuliu/Unofficial-DGNN-PyTorch/blob/master/README.md </sub>