Home

Awesome

Just Add $\pi$! Pose Induced Video Transformers for Understanding Activities of Daily Living.

[Paper] [Pretrained models]

intro

This is the official code for the CVPR 2024 paper titled "Just Add $\pi$! Pose Induced Video Transformers for Understanding Activities of Daily Living"

PWC PWC PWC

Installation

First, create a conda environment and activate it:

conda create -n pivit python=3.7 -y
source activate pivit

Then, install the following packages:

Lastly, build the codebase by running:

git clone https://github.com/dominickrei/pi-vit
cd pi-vit
python setup.py build develop

Data preparation

We make use of the following action recognition datasets for evaluation: Toyota Smarthome, NTU RGB+D, and NTU RGB+D 120. Download the datasets from their respective sources and structure their directories in the following formats.

Smarthome

├── Smarthome
    ├── mp4
        ├── Cook.Cleandishes_p02_r00_v02_c03.mp4
        ├── Cook.Cleandishes_p02_r00_v14_c03.mp4
        ├── ...
    ├── skeletonv12
        ├── Cook.Cleandishes_p02_r00_v02_c03_pose3d.json
        ├── Cook.Cleandishes_p02_r00_v14_c03_pose3d.json
        ├── ...

NTU RGB+D

├── NTU
    ├── rgb
        ├── S001C001P001R001A001_rgb.avi
        ├── S001C001P001R001A001_rgb.avi
        ├── ...
    ├── skeletons
        ├── S001C001P001R001A001.skeleton.npy
        ├── S001C001P001R001A001.skeleton.npy
        ├── ...

Preparing CSVs

After downloading and preparing the datasets, prepare the CSVs for training, testing, and validation splits as train.csv, test.csv, and val.csv. The format of each CSV is:

path_to_video_1,path_to_video_1_skeleton,label_1
path_to_video_2,path_to_video_2_skeleton,label_2
...
path_to_video_N,path_to_video_N_skeleton,label_N

Usage

We provide configs for training $\pi$-ViT on Smarthome and NTU in configs/. Please update the paths in the config to match the paths in your machine before using.

Training

Download the necessary pretrained models (Kinetics-400 for Smarthome and SSv2 for NTU) from this link and update TRAIN.CHECKPOINT_FILE_PATH to point to the downloaded model.

For example to train $\pi$-ViT on Smarthome using 8 GPUs run the following command:

python tools/run_net.py --cfg configs/Smarthome/PIViT_Smarthome.yaml NUM_GPUS 8

Testing

ModelDatasetmCATop-1Downloads
$\pi$-ViTSmarthome CS72.9-HuggingFace
$\pi$-ViTSmarthome CV264.8-HuggingFace
$\pi$-ViTNTU-120 CS-91.9HuggingFace
$\pi$-ViTNTU-120 CSetup-92.9HuggingFace
$\pi$-ViTNTU-60 CS-94.0HuggingFace
$\pi$-ViTNTU-60 CV-97.9HuggingFace

After downloading a pretrained model, evaluate it using the command:

python tools/run_net.py --cfg configs/Smarthome/PIViT_Smarthome.yaml NUM_GPUS 8 TEST.CHECKPOINT_FILE_PATH /path/to/downloaded/model TRAIN.ENABLE False

Setting up skeleton features for $\pi$-ViT

During training, the 3D-SIM module in $\pi$-ViT requires extracted features from a pre-trained sketon action recognition model. This means that for every video in the training set, there must be a corresponding feature vector associated with it. The features should be stored in the directory indicated by the config option EXPERIMENTAL.HYPERFORMER_FEATURES_PATH.

$\pi$-ViT expects a directory containing a single HDF5 file for each video in the training dataset. For example, the directory structure for Smarthome should look like this:

├── /path/to/hyperformer_features
        ├── Cook.Cleandishes_p02_r00_v02_c03.h5
        ├── Cook.Cleandishes_p02_r00_v14_c03.h5
        ├── ...

Where Cook.Cleandishes_p02_r00_v02_c03.h5 is a HDF5 file containing a single dataset named data with a shape of 400x216. We provide a minimal example to demonstrate saving a feature vector in the format $\pi$-ViT expects:

skeleton_features = np.random.rand(400, 216)

with h5py.File('random_tensor.h5', 'w') as f:
    f.create_dataset('data', data=tensor)

Due to the large size of the skeleton feature datasets we do not upload them here, instead we provide the Hyperformer models pre-trained on Toyota-Smarthome in hyperformer_models/. NTU trained models, and details for executing the Hyperformer model, are available here.

Citation & Acknowledgement

@article{reilly2024pivit,
    title={Just Add $\pi$! Pose Induced Video Transformers for Understanding Activities of Daily Living},
    author={Dominick Reilly and Srijan Das},
    booktitle={Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)}
    year={2024}
}

Our primary contributions can be found in:

This repository is built on top of TimeSformer.