Awesome

Interactive Spatiotemporal Token Attention Network for Skeleton-based General Interactive Action Recognition

This repository is the official implementation of Interactive Spatiotemporal Token Attention Network for Skeleton-based General Interactive Action Recognition (IROS 2023).

0. Table of Contents

1. Change Log
2. Prerequisites
3. Prepare the Datasets
4. Run the Code
5. Acknowledgement
6. Citation

1. Change Log

[2023/12/19] Our paper now is available online in IROS 2023 proceeding. Here's the link.
[2023/07/15] Now our paper is accepted to IROS 2023. Visit our project website!
[2023/03/07] Code Upload.

2. Prerequisites

To clone the main branch only (for code) and exclude the gh-pages branch (for project website), use the following git command:

git clone -b main https://github.com/Necolizer/ISTA-Net.git

pip install -r requirements.txt

3. Prepare the Datasets

3.1 NTU RGB+D 120 / NTU Mutual

Please refer to CTR-GCN and follow the instructions in section Data Preparation to prepare NTU RGB+D 120.

For your convenience, here is the excerpt of the instructions in section Data Preparation:

DownLoad

Request dataset here: https://rose1.ntu.edu.sg/dataset/actionRecognition
Download the skeleton-only datasets:
1. nturgbd_skeletons_s001_to_s017.zip (NTU RGB+D 60)
2. nturgbd_skeletons_s018_to_s032.zip (NTU RGB+D 120)
3. Extract above files to ./data/nturgbd_raw

Directory Structure

Put downloaded data into the following directory structure:

- data/
  - ntu/
  - ntu120/
  - nturgbd_raw/
    - nturgb+d_skeletons/     # from `nturgbd_skeletons_s001_to_s017.zip`
      ...
    - nturgb+d_skeletons120/  # from `nturgbd_skeletons_s018_to_s032.zip`
      ...

Generating Data

Generate NTU RGB+D 120 dataset:

cd ./data/ntu120
# Get skeleton of each performer
python get_raw_skes_data.py
# Remove the bad skeleton 
python get_raw_denoised_data.py
# Transform the skeleton to the center of the first frame
python seq_transformation.py

3.2 SBU-Kinect-Interaction

DownLoad

Download the dataset directly from browser with links in SBU Readme, or using download_sbu.py in ./data/sbu/download_sbu.py:

cd ./data/sbu
python download_sbu.py --version clean --savedir ./SBU-Kinect-Interaction/Clean
python download_sbu.py --version noisy --savedir ./SBU-Kinect-Interaction/Noisy

Go to the savedir and unzip all the downloaded zip file unzip '*.zip'

Directory Structure

path/to/your/SBU-Kinect-Interaction
├── Clean
│   ├── s01s02
│   │   ├── 01
│   │   │   └── 001
│   │   │       ├── depth_000055.png
│   │   │       ├── ...
│   │   │       ├── rgb_000055.png
│   │   │       ├── ..
│   │   │       └── skeleton_pos.txt
│   │   ├── 02
│   │   ├── ...
│   │   └── 08
│   ├── s01s03
│   ├── ...
│   └── s07s03
└── Noisy
    ├── ...

Generating Data

cd ./data/sbu
python getSBU.py --rootdir ./SBU-Kinect-Interaction/Clean --savedir ./SBU-Kinect-Interaction-Skeleton/Clean
python getSBU.py --rootdir ./SBU-Kinect-Interaction/Noisy --savedir ./SBU-Kinect-Interaction-Skeleton/Noisy

3.3 H2O

DownLoad

Request dataset here: https://h2odataset.ethz.ch/ . You can get the username and password from the download page.
Download the dataset directly from the download page or using download_script.py in h2odataset repo (we have included it in ./data/h2o/download_scirpt.py in this repo)
```
cd ./data/h2o
python download_script.py --username "username" --password "password" --mode pose --dest "dest folder path"
```
Select pose mode to download only pose (hand, object, egocentric view) without RGB-D images.
Extract the downloaded files.

Directory Structure

path/to/your/extracted/files
├── label_split
├── subject1
│   ├── h1
│   │   ├── 0
│   │   │   └── cam4
│   │   │       ├── cam_pose
│   │   │       ├── hand_pose
│   │   │       ├── hand_pose_MANO
│   │   │       ├── obj_pose
│   │   │       ├── obj_pose_RT
│   │   │       ├── action_label
│   │   │       └── verb_label
│   │   ├── 1
│   │   ├── 2
│   │   ├── 3
│   │   └── ...
│   ├── h2
│   ├── k1
│   ├── k2
│   ├── o1
│   └── o2
├── subject2
├── subject3
├── subject4
└── object

Generating Data

Generate H2O pth files using ./data/h2o/generate_h2o.py.

cd ./data/h2o
python generate_h2o.py --root path/to/your/extracted/files --dest ./h2o_pth --frames 120

3.4 Assembly101

DownLoad

Submit an access request with your google account in Google Drive. Download poses_60fps directly or using scripts in assembly101-download-scripts.
Download test_challenge.csv in GoogleDrive/fine-grained-annotations
Download 3 csv files in asb101 repo.

Directory Structure

path/to/your/downdload/root
├── fine-grained-annotations
│   ├── test_challenge.csv  (@30fps)   [This file is download from googledrive]
│   ├── actions.csv                    [This file is download from asb101 repo]
│   ├── train.csv           (@60fps)   [This file is download from asb101 repo]
│   └── validation.csv      (@60fps)   [This file is download from asb101 repo]
└── poses_60fps
    ├── nusar-2021_action_both_9011-a01_9011_user_id_2021-02-01_153724.json
    ├── nusar-2021_action_both_9011-b06b_9011_user_id_2021-02-01_154253.json
    ├── ...

Generating Data

cd ./data/asb

# Train & Validation Set
# Step 1:
python ./Preprocess/1_generate_pose_data.py --rootdir path/to/your/downdload/root/poses_60fps --csvdir path/to/your/downdload/root/fine-grained-annotations --savedir ./RAW_contex25_thresh0
# Step 2:
# Action (mandatory)
python ./Preprocess/2_get_final_dataset.py --data_path ./RAW_contex25_thresh0 --type action
# Verb (optional)
python ./Preprocess/2_get_final_dataset.py --data_path ./RAW_contex25_thresh0 --type verb
# Object (optional)
python ./Preprocess/2_get_final_dataset.py --data_path ./RAW_contex25_thresh0 --type noun

# Test Set
# Step 1:
python ./PreprocessTest/1_generate_pose_data.py --rootdir path/to/your/downdload/root/poses_60fps --csvdir path/to/your/downdload/root/fine-grained-annotations --savedir ./RAW_contex25_thresh0
# Step 2:
# Action (mandatory)
python ./PreprocessTest/2_get_final_dataset.py --data_path ./RAW_contex25_thresh0 --type action
# Verb (optional)
python ./PreprocessTest/2_get_final_dataset.py --data_path ./RAW_contex25_thresh0 --type verb
# Object (optional)
python ./PreprocessTest/2_get_final_dataset.py --data_path ./RAW_contex25_thresh0 --type noun

The test set has a less number of valid samples than the provided test_challenge.csv. The 1018 invlid test samples (about 5%) has no pose data and will fail to predict. This may cause lower accuracy reports in CodaLab Challenge Page. More information about this could be found in discussions assembly101 Issue#4.

4. Run the Code

4.1 NTU Mutual

The Cross-subject (X-Sub) and Cross-set (X-Set) criteria are employed, using only the joint modal data to ensure fair comparisons without fusing multiple modalities.

X-Sub

python main.py --config config/ntu/ntu26_xsub_joint.yaml

X-Set

python main.py --config config/ntu/ntu26_xset_joint.yaml

4.2 SBU-Kinect-Interaction

5-fold cross validation approach suggested in SBU is adopted. To get accuracy for each fold, arg fold should be set to 0, 1, 2, 3 or 4 in sbu_noisy_joint.yaml and sbu_clean_joint.yaml. Run each command for 5 times with different fold and average the test results.

Noisy

python main.py --config config/sbu/sbu_noisy_joint.yaml

Clean

python main.py --config config/sbu/sbu_clean_joint.yaml

4.3 H2O

Train & Validate

python main.py --config config/h2o/h2o.yaml

Generate JSON File for Test Submission

python main.py --config config/h2o/h2o_get_test_results.yaml --weights path/to/your/checkpoint

Submit zipped json file action_labels.json in CodaLab Challenge H2O - Action to get the test results.

4.4 Assembly101

Train & Validate

# Action (mandatory): 1380 classes
python main.py --config config/asb/asb_action.yaml
# Verb (optional): 24 classes
python main.py --config config/asb/asb_verb.yaml
# Object (optional): 90 classes
python main.py --config config/asb/asb_noun.yaml

Generate JSON File for Test Submission

# Action (mandatory): 1380 classes
python main.py --config config/asb/asb_action_get_test_results.yaml --weights path/to/your/action/checkpoint
# Verb (optional): 24 classes
python main.py --config config/asb/asb_verb_get_test_results.yaml --weights path/to/your/verb/checkpoint
# Object (optional): 90 classes
python main.py --config config/asb/asb_noun_get_test_results.yaml --weights path/to/your/noun/checkpoint

Submit zipped json file preds.json in CodaLab Challenge Assembly101 3D Action Recognition to get the test results.

You can get a fused json file for action+verb+object using the following script but you should specify the path args in this script:

# You should specify the paths in asb_fuse_json_files.py FIRST
python tools/asb_fuse_json_files.py

ATTENTION: preds.json for action is about 673M before compression, and for action+verb+object is about 727M before compression.

4.5 Dataset Sample Visualizations

We provide scripts in tools/dataset_viz to visualize dataset samples (pngs or gifs) for the above 4 datasets. Specify the args in those scripts and start visualizing general interactive actions!

5. Acknowledgement

Grateful to the collaborators/maintainers of STTFormer, CTR-GCN, MS-G3D, h2odataset, Assembly101 repository. Thanks to the authors for their great work.

6. Citation

If you find this work or code helpful in your research, please consider citing:

@INPROCEEDINGS{wen2023interactive,
  author={Wen, Yuhang and Tang, Zixuan and Pang, Yunsheng and Ding, Beichen and Liu, Mengyuan},
  booktitle={2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)}, 
  title={Interactive Spatiotemporal Token Attention Network for Skeleton-Based General Interactive Action Recognition}, 
  year={2023},
  volume={},
  number={},
  pages={7886-7892},
  doi={10.1109/IROS55552.2023.10342472}}