This is the official implementation of the approach described in the paper of ⏳ Hourglass Tokenizer (🔥HoT🔥):

⏳ Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation,
Wenhao Li, Mengyuan Liu, Hong Liu, Pichao Wang, Jialun Cai, Nicu Sebe
In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024

😮 Highlights

🚀 Both high efficiency and estimation accuracy

🚀 Both high efficiency and estimation accuracy

✨ Simple baseline, general-purpose efficient transformer-based framework

✨ Simple baseline, general-purpose efficient transformer-based framework

💡 Installation

🔥HoT🔥 is tested on Ubuntu 18 with Pytorch 1.7.1 and Python 3.9.

🐳 Download pretrained models

🔥HoT🔥's pretrained models can be found in here, please download it and put it in the './checkpoint/pretrained' directory.

🤖 Dataset setup

Please download the dataset from Human3.6M website, and refer to VideoPose3D to set up the Human3.6M dataset ('./dataset' directory). Or you can download the processed data from here.

|-- dataset
|   |-- data_3d_h36m.npz
|   |-- data_2d_h36m_gt.npz
|   |-- data_2d_h36m_cpn_ft_h36m_dbb.npz

🚅 Test the model

You can obtain the results of Table 6 of our paper, including the results of MixSTE, HoT w. MixSTE, TPC w. MixSTE, MHFormer, TPC w. MHFormer models.

## MixSTE
python main_mixste.py --batch_size 4 --test --frames 243 --stride 243 --model mixste.mixste --previous_dir 'checkpoint/pretrained/mixste' 

## HoT w. MixSTE
python main_mixste.py --batch_size 4 --test --frames 243 --stride 243 --model mixste.hot_mixste --token_num 81 --layer_index 3 --previous_dir 'checkpoint/pretrained/hot_mixste' 

## TPC w. MixSTE
python main_mixste_tpc.py --batch_size 4 --test --frames 243 --stride 1 --model mixste.tpc_mixste --token_num 61 --layer_index 7 --previous_dir 'checkpoint/pretrained/tpc_mixste' 

## MHFormer
python main_mhformer.py --batch_size 256 --test --frames 351 --stride 1 --model mhformer.mhformer --previous_dir 'checkpoint/pretrained/mhformer'

## TPC w. MHFormer
python main_mhformer_tpc.py --batch_size 256 --test --frames 351 --stride 1 --model mhformer.tpc_mhformer --token_num 117 --layer_index 1 --previous_dir 'checkpoint/pretrained/tpc_mhformer' 

⚡ Train the model

To train MixSTE, HoT w. MixSTE, TPC w. MixSTE, MHFormer, TPC w. MHFormer models on Human3.6M:

## MixSTE
python main_mixste.py --batch_size 4 --frames 243 --stride 243 --model mixste.mixste

## HoT w. MixSTE
python main_mixste.py --batch_size 4 --frames 243 --stride 243 --model mixste.hot_mixste --token_num 81 --layer_index 3 

## TPC w. MixSTE
python main_mixste_tpc.py --batch_size 4 --frames 243 --stride 243 --model mixste.tpc_mixste --token_num 61 --layer_index 7

## MHFormer
python main_mhformer.py --batch_size 128 --nepoch 20 --lr 1e-3 --lr_decay_epoch 5 --lr_decay 0.95 --frames 351 --stride 1 --model mhformer.mhformer

## TPC w. MHFormer
python main_mhformer_tpc.py --batch_size 210 --nepoch 20 --lr 1e-3 --lr_decay_epoch 5 --lr_decay 0.95 --frames 351 --stride 1 --model mhformer.tpc_mhformer --token_num 117 --layer_index 1

🤗 Demo

First, you need to download YOLOv3 and HRNet pretrained models here and put it in the './demo/lib/checkpoint' directory. Then, you need to put your in-the-wild videos in the './demo/video' directory.

Run the command below:

python demo/vis.py --video sample_video.mp4

Sample demo output:

Sample demo output:

✏️ Citation

If you find our work useful in your research, please consider citing:

  title={Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation},
  author={Li, Wenhao and Liu, Mengyuan and Liu, Hong and Wang, Pichao and Cai, Jialun and Sebe, Nicu},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},

👍 Acknowledgement

Our code is extended from the following repositories. We thank the authors for releasing the codes.

🔒 Licence

This project is licensed under the terms of the MIT license.

🤝 Contributors

