Awesome

Just Add $\pi$! Pose Induced Video Transformers for Understanding Activities of Daily Living.

intro

This is the official code for the CVPR 2024 paper titled "Just Add $\pi$! Pose Induced Video Transformers for Understanding Activities of Daily Living"

Installation

First, create a conda environment and activate it:

conda create -n pivit python=3.7 -y
source activate pivit

Then, install the following packages:

torch & torchvision pip install torch===1.8.1+cu111 torchvision===0.9.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html
fvcore: pip install 'git+https://github.com/facebookresearch/fvcore'
PyAV: conda install av -c conda-forge
misc: pip install simplejson einops timm psutil scikit-learn opencv-python tensorboard

Lastly, build the codebase by running:

git clone https://github.com/dominickrei/pi-vit
cd pi-vit
python setup.py build develop

Data preparation

We make use of the following action recognition datasets for evaluation: Toyota Smarthome, NTU RGB+D, and NTU RGB+D 120. Download the datasets from their respective sources and structure their directories in the following formats.

Smarthome

├── Smarthome
    ├── mp4
        ├── Cook.Cleandishes_p02_r00_v02_c03.mp4
        ├── Cook.Cleandishes_p02_r00_v14_c03.mp4
        ├── ...
    ├── skeletonv12
        ├── Cook.Cleandishes_p02_r00_v02_c03_pose3d.json
        ├── Cook.Cleandishes_p02_r00_v14_c03_pose3d.json
        ├── ...

NTU RGB+D

├── NTU
    ├── rgb
        ├── S001C001P001R001A001_rgb.avi
        ├── S001C001P001R001A001_rgb.avi
        ├── ...
    ├── skeletons
        ├── S001C001P001R001A001.skeleton.npy
        ├── S001C001P001R001A001.skeleton.npy
        ├── ...

By default, the NTU skeletons are in MATLAB format. We convert them into Numpy format using code provided in https://github.com/shahroudy/NTURGB-D/tree/master/Python

Preparing CSVs

After downloading and preparing the datasets, prepare the CSVs for training, testing, and validation splits as train.csv, test.csv, and val.csv. The format of each CSV is:

path_to_video_1,path_to_video_1_skeleton,label_1
path_to_video_2,path_to_video_2_skeleton,label_2
...
path_to_video_N,path_to_video_N_skeleton,label_N

Usage

We provide configs for training $\pi$-ViT on Smarthome and NTU in configs/. Please update the paths in the config to match the paths in your machine before using.

Training

Download the necessary pretrained models (Kinetics-400 for Smarthome and SSv2 for NTU) from this link and update TRAIN.CHECKPOINT_FILE_PATH to point to the downloaded model.

For example to train $\pi$-ViT on Smarthome using 8 GPUs run the following command:

python tools/run_net.py --cfg configs/Smarthome/PIViT_Smarthome.yaml NUM_GPUS 8

Testing

Model	Dataset	mCA	Top-1	Downloads
$\pi$-ViT	Smarthome CS	72.9	-	HuggingFace
$\pi$-ViT	Smarthome CV2	64.8	-	HuggingFace
$\pi$-ViT	NTU-120 CS	-	91.9	HuggingFace
$\pi$-ViT	NTU-120 CSetup	-	92.9	HuggingFace
$\pi$-ViT	NTU-60 CS	-	94.0	HuggingFace
$\pi$-ViT	NTU-60 CV	-	97.9	HuggingFace

After downloading a pretrained model, evaluate it using the command:

python tools/run_net.py --cfg configs/Smarthome/PIViT_Smarthome.yaml NUM_GPUS 8 TEST.CHECKPOINT_FILE_PATH /path/to/downloaded/model TRAIN.ENABLE False

Setting up skeleton features for $\pi$-ViT

During training, the 3D-SIM module in $\pi$-ViT requires extracted features from a pre-trained sketon action recognition model. This means that for every video in the training set, there must be a corresponding feature vector associated with it. The features should be stored in the directory indicated by the config option EXPERIMENTAL.HYPERFORMER_FEATURES_PATH.

$\pi$-ViT expects a directory containing a single HDF5 file for each video in the training dataset. For example, the directory structure for Smarthome should look like this:

├── /path/to/hyperformer_features
        ├── Cook.Cleandishes_p02_r00_v02_c03.h5
        ├── Cook.Cleandishes_p02_r00_v14_c03.h5
        ├── ...

Where Cook.Cleandishes_p02_r00_v02_c03.h5 is a HDF5 file containing a single dataset named data with a shape of 400x216. We provide a minimal example to demonstrate saving a feature vector in the format $\pi$-ViT expects:

skeleton_features = np.random.rand(400, 216)

with h5py.File('random_tensor.h5', 'w') as f:
    f.create_dataset('data', data=tensor)

Due to the large size of the skeleton feature datasets we do not upload them here, instead we provide the Hyperformer models pre-trained on Toyota-Smarthome in hyperformer_models/. NTU trained models, and details for executing the Hyperformer model, are available here.

Citation & Acknowledgement

@article{reilly2024pivit,
    title={Just Add $\pi$! Pose Induced Video Transformers for Understanding Activities of Daily Living},
    author={Dominick Reilly and Srijan Das},
    booktitle={Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)}
    year={2024}
}

Our primary contributions can be found in:

train_net.py, pivit.py, pivit_modules.py, losses.py, smarthome.py, ntu.py

This repository is built on top of TimeSformer.