Home

Awesome

<p align="center"> <img src="demo/logo.png" width="200" height="100"> </p>

LightTrack: A Generic Framework for Online Top-Down Human Pose Tracking

Update 4/19/2020:

Paper will appear in CVPR 2020 Workshop on Towards Human-Centric Image/Video Synthesis and the 4th Look Into Person (LIP) Challenge.

Update 5/16/2019: Add Camera Demo

[Project Page] [Paper] [Github] PWC

With the provided code, you can easily:

Real-life Application Scenarios:

Table of Contents

Overview

LightTrack is an effective light-weight framework for human pose tracking, truly online and generic for top-down pose tracking. The code for the paper includes LightTrack framework as well as its replaceable component modules, including detector, pose estimator and matcher, the code of which largely borrows or adapts from Cascaded Pyramid Networks [1], PyTorch-YOLOv3, st-gcn and OpenSVAI [3].

Overview

In contrast to Visual Object Tracking (VOT) methods, in which the visual features are implicitly represented by kernels or CNN feature maps, we track each human pose by recursively updating the bounding box and its corresponding pose in an explicit manner. The bounding box region of a target is inferred from the explicit features, i.e., the human keypoints. Human keypoints can be considered as a series of special visual features. The advantages of using pose as explicit features include:

Single Pose Tracking (SPT) and Single Visual Object Tracking (VOT) are thus incorporated into one unified functioning entity, easily implemented by a replaceable single-person human pose estimation module. Below is a simple step-by-step explanation of how the LightTrack framework works.

Example 1

(1). Detection only at the 1st Frame. Blue bboxes indicate tracklets inferred from keypoints.

Example 0

(2). Detection at every other 10 Frames. Red bbox indicates keyframe detection.

Example 2

(3). Detection at every other 10 Frames for multi-person:

For more technical details, please refer to our arXiv paper.

Prerequisites

(Optional: set up the environment on your own)

Getting Started

Demo on Live Camera

PoseTracking FrameworkKeyframe DetectorKeyframe ReID ModulePose EstimatorFPS
LightTrackYOLOv3Siamese GCNMobileNetv1-Deconv220* / 15

Demo on Arbitrary Videos

PoseTracking FrameworkKeyframe DetectorKeyframe ReID ModulePose EstimatorFPS
LightTrackYOLOv3Siamese GCNMobileNetv1-Deconv220* / 15
total_time_ALL: 19.99s
total_time_DET: 1.32s
total_time_POSE: 18.63s
total_time_LIGHTTRACK: 0.04s
total_num_FRAMES: 300
total_num_PERSONS: 600

Average FPS: 15.01fps
Average FPS excluding Pose Estimation: 220.08fps
Average FPS excluding Detection: 16.07fps
Average FPS for framework only: 7261.90fps

You can replace the demo video with your own for fun. You can also try different detectors or pose estimators.

Validate on PoseTrack 2018

Pose estimation models have been provided. It should have been downloaded to ./weights folder while running ./download_weights.sh script. We provide alternatives of CPN101 and MSRA152, pre-trained with ResNet101 and Res152, respectively.

Image SizePose EstimatorWeights
384x288CPN101 [1]CPN_snapshot_293.ckpt
384x288MSRA152 [2]MSRA_snapshot_285.ckpt

Detections for PoseTrack'18 validation set have been pre-computed. We use the same detections from [3] in our experiments. Two options are available, including deformable versions of FPN and RFCN, as illustrated in the paper.
Here we provide the detections by FPN, which renders higher performance.

DetectorJsons
ResNet101_Deformable_FPN_RCNN [6]DeformConv_FPN_RCNN_detect.zip
ResNet101_Deformable_RFCN [6]DeformConv_RFCN_detect.zip
Ground Truth LocationsGT_detect.zip

Evaluation on PoseTrack 2018

For mAP, two values are given: the mean average precision before and after keypoint dropping. For FPS, * means excluding pose inference time. Our LightTrack in true online mode runs at an average of 0.8 fps on PoseTrack'18 validation set.

[LightTrack_CPN101] and [LightTrack_MSRA152] are both trained with [COCO + PoseTrack'17] dataset; [LightTrack_MSRA152 + auxiliary] is trained with [COCO + PoseTrack'18 + ChallengerAI] dataset.

MethodsDet ModeFPSmAPMOTAMOTP
LightTrack_CPN101online-DET-2F47* / 0.876.0 / 70.361.385.2
LightTrack_MSRA152online-DET-2F48* / 0.777.2 / 72.464.685.3
LightTrack_MSRA152 + auxiliaryonline-DET-2F48* / 0.777.7 / 72.765.485.1
MethodsDet ModeFPSmAPMOTAMOTP
LightTrack_CPN101online-GT-2F47* / 0.8- / 70.173.594.7
LightTrack_MSRA152online-GT-2F48* / 0.7- / 73.178.094.8

Qualitative Results

Some gifs exhibiting qualitative results:

PoseTracking FrameworkKeyframe DetectorKeyframe ReID ModulePose Estimator
LightTrackDeformable FPN (heavy)Siamese GCNMSRA152 (heavy)

Demo 1

PoseTracking FrameworkKeyframe DetectorKeyframe ReID ModulePose Estimator
LightTrackYOLOv3 (light)Siamese GCNMobileNetv1-Deconv (light)

Demo 2 Demo 3

Quantitative Results on PoseTrack

Performance on PoseTrack 2017 Benchmark (Test Set)

Challenge 3: Multi-Person Pose Tracking

MethodsModeFPSmAPMOTA
LightTrack (offline-ensemble)batch-66.6558.01
HRNet [4], CVPR'19batch-74.9557.93
FlowTrack [2], ECCV'18batch-74.5757.81
LightTrack (online-3F)online47* / 0.866.5555.15
PoseFlow [5], BMVC'18online10* / -62.9550.98

For FPS, * means excluding pose inference time and - means not applicable. Our LightTrack in true online mode runs at an average of 0.8 fps on PoseTrack'18 validation set. (In total, 57,928 persons are encountered. An average of 6.54 people are tracked for each frame.)

Models are trained with [COCO + PoseTrack'17] dataset.

Training

1) Pose Estimation Module

# Train with COCO+PoseTrack'17
python train_PoseTrack_COCO_17_CPN101.py -d 0-3 -c   # Train CPN-101
# or
python train_PoseTrack_COCO_17_MSRA152.py -d 0-3 -c  # Train MSRA-152
# or
python train_PoseTrack_COCO_17_mobile_deconv.py -d 0-3 -c  # Train MobileNetv1-Deconv

2) Pose Matching Module

# Download training and validation data
cd graph/unit_test;
bash download_data.sh;
cd -;

# Train the siamese graph convolutional network
cd graph;
python main.py processor_siamese_gcn -c config/train.yaml

In order to perform ablation studies on the pose matching module, the simplest way without modifying existing code is to set the pose matching threshold to a value smaller than zero, which will nullify the pose matching module. The performance on PoseTrack'18 validation will then deteriorate.

MethodsDet ModePose Match (thresh)mAPMOTAMOTP
LightTrack_MSRA152online DETNo (0)77.2 / 72.463.385.3
LightTrack_MSRA152online DETYes (1.0)77.2 / 72.464.685.3
LightTrack_CPN101online DETNo (0)76.0 / 70.360.085.2
LightTrack_CPN101online DETYes (1.0)76.0 / 70.361.385.2

Limitations

Currently, the LightTrack framework does not handle well the identity switch/lose in occlusion scenarios, which is due to several reasons: (1) only one frame history is considered during data association; (2) only skeleton-based features are used. However, these problems are not natural drawbacks of the LightTrack framework. In future research, spatiotemporal pose matching can be further explored to mitigate the occlusion problem. A longer history of poses might improve the performance; a combination of visual features and skeleton-based features may further contribute to the robustness of the data association module.

Citation

If you find LightTrack helpful or use this framework in your work, please consider citing:

@article{ning2019lighttrack,
  author    = {Ning, Guanghan and Huang, Heng},
  title     = {LightTrack: A Generic Framework for Online Top-Down Human Pose Tracking,
  journal   = {Proceedings of CVPRW 2020 on Towards Human-Centric Image/Video Synthesis and the 4th Look Into Person (LIP) Challenge},
  year      = {2020},
}

Also consider citing the following works if you use CPN101/MSRA152 models:

@inproceedings{xiao2018simple,
    author={Xiao, Bin and Wu, Haiping and Wei, Yichen},
    title={Simple Baselines for Human Pose Estimation and Tracking},
    booktitle = {ECCV},
    year = {2018}
}
@article{Chen2018CPN,
    Author = {Chen, Yilun and Wang, Zhicheng and Peng, Yuxiang and Zhang, Zhiqiang and Yu, Gang and Sun, Jian},
    Title = {{Cascaded Pyramid Network for Multi-Person Pose Estimation}},
    Conference = {CVPR},
    Year = {2018}
}

Reference

[1] Chen, Yilun, et al. "Cascaded pyramid network for multi-person pose estimation." CVPR (2018).

[2] Xiao, Bin, Haiping Wu, and Yichen Wei. "Simple baselines for human pose estimation and tracking." ECCV (2018).

[3] Ning, Guanghan, et al. "A top-down approach to articulated human pose estimation and tracking". ECCVW (2018).

[4] Sun, Ke, et al. "Deep High-Resolution Representation Learning for Human Pose Estimation." CVPR (2019).

[5] Xiu, Yuliang, et al. "Pose flow: efficient online pose tracking." BMVC (2018).

[6] Dai, Jifeng, et al. "Deformable convolutional networks." ICCV (2017).

Contact

For questions about our paper or code, please contact Guanghan Ning.

Credits

LOGO by: Hogen