Home

Awesome

Hand tracking with DepthAI

Running Google Mediapipe Hand Tracking models on Luxonis DepthAI hardware (OAK-D, OAK-D lite, OAK-1,...)

<p align="center"><img src="img/hand_tracker.gif" alt="Demo" /></p>

What's new ?

Introduction

You must know a few concepts before choosing which options best fit your application. Some of the concepts introduced here are more detailled in the following sections.

Hand Tracking from Mediapipe is a 2-stages pipeline. First, the Hand detection stage detects where are the hands in the whole image. For each detected hand, a Region of Interest (ROI) around the hand is calculated and fed to the second stage, which infers the landmarks. If the confidence in the landmarks is high enough (score above a threshold), the landmarks can be used to directly determined the ROI for the next frame (without the need to use the first stage). This means that the global speed and the FPS can vary and directly depend on the maximal number of hands we want to detect (Solo vs Duo modes) and the actual number of hands in the image. For instance, tracking one hand in Solo mode is faster than tracking one hand in Duo mode.

<p align="center"><img src="img/schema_hand_tracking.png" alt="Schema" width="500"/></p>

We can roughly consider 2 types of application, depending on the user's position relative to the camera:

  1. The user is close to the camera (< 1.5 m), typically the user and the camera are at fixed positions (for instance, the user sits behind a desk, the camera pointing to him). Here the tracker does a good job at tracking the hands even if they are moving relatively fast. In the example below, hand poses are used to emulate key strokes;
<p align="center"><img src="img/tetris.gif" alt="Example" width="330"/></p>
  1. The user can stand at any position in the room, so the distance user-camera can vary a lot (let's say between 1m and 5m). A good example of such application is the use of hand poses for remote controlling connected devices. In the example below, the user switches on/off a light by closing his hand, he is seated in an armchair, but we want the app to work also if he stands near the door 2 meters behind. Because the Mediapipe Hand detection model was not designed to deal with long distance, we use a mechanism called Body Pre Focusing to help locate the hands. It is important to note that the further the distance, the more difficult the tracking is, because fast move appear fuzzier on the image. So for this type of application, it is highly recommended that once the hand pose is taken, the arm keeps still during the pose.
<p align="center"><img src="img/toggle_light.gif" alt="Output: no BPF" width="330"/></p>

Solo mode vs Duo mode

Host mode vs Edge mode

Two modes are available:

Body Pre Focusing

Body Pre Focusing is an optional mechanism meant to help the hand detection when the person is far from the camera.

<p align="center"><img src="img/body_pre_focusing_5m.gif" alt="Distance person-camera = 5m" /></p>

In the video above, the person is at 5m from the camera. The big yellow square represents the smart cropping zone of the body pose estimator (Movenet). The small green square that appears briefly, represents the focus zone on which the palm detector is run. In this example, hands_up_only=True, so this green square appears only when the hand is raised. Once a hand is detected, only the landmark model runs, as long as it can track the hand (the small yellow rotated square is drawn around the hand)

The palm detector model from Google Mediapipe was trained to detect hands that are less than 2 meters away from the camera. So if the person stands further away, his hands may not be detected. And the padding used to make the image square before feeding the network contributes even more to the problem. To improve the detection, a body pose estimator can help to focus on a zone of the image that contains only the hands. So instead of the whole image, we feed the palm detector with the cropped image of the zone around the hands.

A possible and natural body pose estimator is Blazepose as it is the model used in the Mediapipe Holistic solution, but here, we chose Movenet Single pose because of its simpler architecture (Blazepose would imply a more complex pipeline with 2 more neural networks running on the MyriadX).

Movenet gives the wrist keypoints, which are used as the center of the zones we are looking for. Several options are available, selected by the body_pre_focusing parameter (illustrated in the table below).

By setting the hands_up_only option, we ask to take into consideration only the hands for which the wrist keypoint is above the elbow keypoint, meaning in practice that the hands are raised. Indeed, when we want to recognize hand gestures, the arm is generally folded and the hand up.

Recommendations:

Remarks:

ArgumentsPalm detection inputHand tracker outputRemarks
No BPF<img src="img/pd_input_no_bpf.jpg" alt="PD input: no BPF" width="128"/><img src="img/output_no_bpf.jpg" alt="Output: no BPF" width="350"/>Because of the padding, hands get very small and palm detection gives a poor result (right hand not detected, left hand detection inaccurate)
No BPF<br>crop=True<img src="img/pd_input_no_bpf_crop.jpg" alt="PD input: no BPF, crop" width="128"/><img src="img/output_no_bpf_crop.jpg" alt="Output: no BPF, crop" width="200"/>Cropping the image along the shortest side is an easy and inexpensive way to improve the detection, but at the condition the person stays in the center of the image
body_pre_focusing=group<br>hands_up_only=False<img src="img/pd_input_bpf_group_all_hands.jpg" alt="PD input: bpf=group, all hands" width="128"/><img src="img/output_bpf_group_all_hands.jpg" alt="Output: bpf=group, all hands" width="350"/>BPF algorithm finds a zone that contains both hands, which are correctly detected
body_pre_focusing=group<br>hands_up_only=True<img src="img/pd_input_bpf_right.jpg" alt="PD input: bpf=group, hands up only" width="128"/><img src="img/output_bpf_group.jpg" alt="Output: bpf=group, all hands" width="350"/>With "hands_up_only" set to True, the left hand is not taken into consideration since the wrist keypoint is below the elbow keypoint
body_pre_focusing=right<img src="img/pd_input_bpf_right.jpg" alt="PD input: bpf=right" width="128"/><img src="img/output_bpf_group.jpg" alt="Output: bpf=right" width="350"/>The right hand is correctly detected, whatever the value of "hands_up_only"
body_pre_focusing=left<br>hands_up_only=False<img src="img/pd_input_bpf_left_all_hands.jpg" alt="PD input: bpf=left, all hands" width="128"/><img src="img/output_bpf_left_all_hands.jpg" alt="Output: bpf=left, all hands" width="350"/>The left hand is correctly detected
body_pre_focusing=left<br>hands_up_only=true<img src="img/pd_input_no_bpf.jpg" alt="PD input: bpf=left, hands up only" width="128"/><img src="img/output_bpf_left.jpg" alt="Output: bpf=left, hands up only" width="350"/>Because the left hand is not raised, it is not taken into consideration, so we fall back to the case where BPF is not used
body_pre_focusing=higher<img src="img/pd_input_bpf_right.jpg" alt="PD input: bpf=higher" width="128"/><img src="img/output_bpf_higher.jpg" alt="Output: bpf=higher" width="350"/>Here, same result as for "body_pre_focusing=right", whatever the value of "hands_up_only"

Frames per second (FPS)

You will quickly notice that the FPS can vary a lot. Of course, it depends on the modes chosen:

Also the FPS highly depends on the number of hands currently in the image. It may sound counter-intuitive but the FPS is significantly faster when a hand is present than when not. Why is that ? Because the palm detection inference is slower than the landmark regression inference. When no hand is visible, the palm detection (or the body detection when using Body Pre Focusing) is run on every frame until a hand is found. Once the hand is detected, only the landmark model runs on the following frame until the landmark model loses the hand's track. Actually, on the best case scenario, the palm detection is run only once: on the first frame !

Important recommendation: tune the internal FPS ! By default, the internal camera FPS is set to a value that depends on chosen modes and on the use of depth ("-xyz"). These default values are based on my own observations. When starting the demo, you will see a line like below:

Internal camera FPS set to: 36

Please, don't hesitate to play with the parameter internal_fps (via --internal_fps argument in the demos) to find the optimal value for your use case. If the observed FPS is well below the default value, you should lower the FPS with this parameter until the set FPS is just above the observed FPS.

Install

Install the python packages (depthai, opencv) with the following command:

python3 -m pip install -r requirements.txt

Run

Usage:

Use demo.py or demo_bpf.py depending on whether or not you nedd Bpdy Pre Focusing. demo_bpf.py has the same arguments as demo.py with 2 more, which are related to BPF: --body_pre_focusing and --all_hands.

->./demo_bpf.py -h
usage: demo_bpf.py [-h] [-e] [-i INPUT] [--pd_model PD_MODEL] [--no_lm]
                   [--lm_model LM_MODEL] [--use_world_landmarks] [-s] [-xyz]
                   [-g] [-c] [-f INTERNAL_FPS] [-r {full,ultra}]
                   [--internal_frame_height INTERNAL_FRAME_HEIGHT]
                   [-bpf {right,left,group,higher}] [-ah]
                   [--single_hand_tolerance_thresh SINGLE_HAND_TOLERANCE_THRESH]
                   [--dont_force_same_image] [-lmt {1,2}] [-t [TRACE]]
                   [-o OUTPUT]

optional arguments:
  -h, --help            show this help message and exit
  -e, --edge            Use Edge mode (postprocessing runs on the device)

Tracker arguments:
  -i INPUT, --input INPUT
                        Path to video or image file to use as input (if not
                        specified, use OAK color camera)
  --pd_model PD_MODEL   Path to a blob file for palm detection model
  --no_lm               Only the palm detection model is run (no hand landmark
                        model)
  --lm_model LM_MODEL   Landmark model 'full', 'lite', 'sparse' or path to a
                        blob file
  --use_world_landmarks
                        Fetch landmark 3D coordinates in meter
  -s, --solo            Solo mode: detect one hand max. If not used, detect 2
                        hands max (Duo mode)
  -xyz, --xyz           Enable spatial location measure of palm centers
  -g, --gesture         Enable gesture recognition
  -c, --crop            Center crop frames to a square shape
  -f INTERNAL_FPS, --internal_fps INTERNAL_FPS
                        Fps of internal color camera. Too high value lower NN
                        fps (default= depends on the model)
  -r {full,ultra}, --resolution {full,ultra}
                        Sensor resolution: 'full' (1920x1080) or 'ultra'
                        (3840x2160) (default=full)
  --internal_frame_height INTERNAL_FRAME_HEIGHT
                        Internal color camera frame height in pixels
  -bpf {right,left,group,higher}, --body_pre_focusing {right,left,group,higher}
                        Enable Body Pre Focusing
  -ah, --all_hands      In Body Pre Focusing mode, consider all hands (not
                        only the hands up)
  --single_hand_tolerance_thresh SINGLE_HAND_TOLERANCE_THRESH
                        (Duo mode only) Number of frames after only one hand
                        is detected before calling palm detection (default=10)
  --dont_force_same_image
                        (Edge Duo mode only) Don't force the use the same
                        image when inferring the landmarks of the 2 hands
                        (slower but skeleton less shifted)
  -lmt {1,2}, --lm_nb_threads {1,2}
                        Number of the landmark model inference threads
                        (default=2)
  -t [TRACE], --trace [TRACE]
                        Print some debug infos. The type of info depends on
                        the optional argument.

Renderer arguments:
  -o OUTPUT, --output OUTPUT
                        Path to output video file

Some examples:

Whenever you see demo.py, you can replace by demo_bpf.py.

KeypressFunction
EscExit
spacePause
1Show/hide the palm bounding box (only in non solo mode)
2Show/hide the palm detection keypoints (only in non solo mode)
3Show/hide the rotated bounding box around the hand
4Show/hide landmarks
5Show/hide handedness (several display mode are available)
6Show/hide scores
7Show/hide recognized gestures (-g or --gesture)
8Show/hide hand spatial location (-xyz)
9Show/hide the zone used to measure the spatial location (-xyz)
fShow/hide FPS
bShow/hide body keypoints, smart cropping zone and focus zone if body pre focusing is used (only in Host mode)

Mediapipe models

You can find the models palm_detector.blob and hand_landmark_*.blob under the 'models' directory, but below I describe how to get the files.

  1. Clone this github repository in a local directory (DEST_DIR)
  2. In DEST_DIR/models directory, download the source tflite models from this archive. The archive contains:
  1. Install the amazing PINTO's tflite2tensorflow tool. Use the docker installation which includes many packages including a recent version of Openvino.
  2. From DEST_DIR, run the tflite2tensorflow container: ./docker_tflite2tensorflow.sh
  3. From the running container:
cd models
./convert_models.sh

The convert_models.sh converts the tflite models in tensorflow (.pb), then converts the pb file into Openvino IR format (.xml and .bin), and finally converts the IR files in MyriadX format (.blob).

By default, the number of SHAVES associated with the blob files is 4. In case you want to generate new blobs with different number of shaves, you can use the script gen_blob_shave.sh:

# Example: to generate blobs for 6 shaves
./gen_blob_shave.sh -m palm_detection.xml -n 6   # will generate palm_detection_sh6.blob
./gen_blob_shave.sh -m hand_landmark_full.xml -n 6   # will generate hand_landmark_full_sh6.blob

Explanation about the Model Optimizer params :

Blob models vs tflite models The palm detection blob does not exactly give the same results as the tflite version, because the tflite ResizeBilinear instruction is converted into IR Interpolate-1. Yet the difference is almost imperceptible thanks to the great help of PINTO (see issue).

Movenet models : The 'lightning' and 'thunder' Movenet models come from the repository geaxgx/depthai_movenet.

Custom model

The custom_models directory contains the code to build the custom model PDPostProcessing_top2_sh1.blob. This model processes the outputs of the palm detection network (a 1x896x1 tensor for the scores and a 1x896x18 for the regressors) and yields the 2 best detections. For more details, please read this.

Code

There are 2 classes:

The files demo.py or demo_bpf.py are representative examples of how to use these classes.

from HandTrackerRenderer import HandTrackerRenderer
from HandTrackerEdge import HandTracker

tracker = HandTracker(
        # Your own arguments
        ...
        )

renderer = HandTrackerRenderer(tracker=tracker)

while True:
    # Run hand tracker on next frame
    # 'bag' is some information common to the frame and to the hands 
    frame, hands, bag = tracker.next_frame()
    if frame is None: break
    # Draw hands
    frame = renderer.draw(frame, hands, bag)
    key = renderer.waitKey(delay=1)
    if key == 27 or key == ord('q'):
        break

renderer.exit()
tracker.exit()

hands returned by tracker.next_frame() is a list of HandRegion.

For more information on:

Landmarks

When accessing individual landmarks in the arrays hand.landmarks or hand.norm_landmarks, the following schema (source) references the landmarks' indexes:

<p align="center"><img src="img/hand_landmarks.png" alt="Hand landmarks" /></p>

Examples

Pseudo-3D visualization with Open3d + smoothing filtering<img src="examples/3d_visualization/medias/3d_visualization.gif" alt="3D visualization" width="200"/>
Remote control with hand poses<img src="examples/remote_control/medias/toggle_light.gif" alt="3D visualization" width="200"/>

Credits