Home

Awesome

A Dual-Masked Auto-Encoder for Robust Motion Capture with Spatial-Temporal Skeletal Token Completion

<p align="center"><img src="assets/framework.jpg" width="90%" alt="" /></p>

This repository includes the source code for our ACM Multimedia 2022 paper on multi-view multi-person 3D pose estimation. The preprint version is available at arXiv (arXiv:2207.07381). The project webpage is provided here. The dataset presented in the paper is provided here. Please refer to this for more details.

Dependencies

The code is tested on Windows with

pytorch                   1.10.2
torchvision               0.11.3
CUDA                      11.3.1

We suggest using the virtual environment and an easy-to-use package/environment manager such as conda to maintain the project.

conda create -n dmaeMocap python=3.6
conda activate dmaeMocap
# install pytorch
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
# install the rest of the dependencies
pip install -r requirements.txt

Data preparation

Follow the instruction to prepare the necessary data:

Or, generate 2D poses on your own. We provide the instruction at util/gizmo/data_makeup.

Data should be organized as follows:

ROOT/
    └── data/
        └── shelf/
            └── sequences/
                └── img_0/
                └── .../
                └── img_4/
            └── camera_params.npy
            └── checkpoint-best.pth
            └── shelf_eval_2d_detection_dict.npy
    └── ...

Inference

We provide the following script to reconstruct and complete 3D skeletons from multi-view RGB video sequences.

python inference.py

The configuration of triangulation can be found and modified at util/config.py. It can visualize the reconstruction results when self.snapshot_flag = True at Line 18. We set self.snapshot_flag = False as default.

You can use python inference.py --no-dmae to disable the motion completion from D-MAE, and use --snapshot to enable the snapshot.

Evaluate

python evaluate.py

Similar to Inference, the way to reconstruct and complete, the evaluation script is configured by util/config.py. In default, we visualize the inference results and the ground-truth at data/shelf/output/eval_snapshot directory. You can find the metrics in the command console as the output and also are saved at data/shelf/output/eval.log. If you want to evaluate the framework without D-MAE, you need to add --no-dmae to the end of the command line, i.e. python evaluate.py --no-dmae.

Overall, output data would be organized as follows:

ROOT/
    └── data/
        └── shelf/
            └── output/
                └── eval_snapshot/
                └── npy/
                └── eval.log
            └── ...
    └── ...
<!-- ### Test on your dataset --> <!-- TBD -->

Train the D-MAE

In this short guide, we focus on HPE reconstruction and completion by the pretrained model. If you want to reproduce the results of the pretrained model, please refer to training/README.md.

Bibtex

If you use our code/models in your research, please cite our paper:

@inproceedings{jiang2022dmae,
  title={A Dual-Masked Auto-Encoder for Robust Motion Capture with Spatial-Temporal Skeletal Token Completion},
  author={Jiang, Junkun and Chen, Jie and Guo, Yike},
  booktitle={Proceedings of the 30th ACM international conference on Multimedia},
  year={2022}
}

Acknowledgement

Many thanks to the following open-source repositories for the help to develop D-MAE.