
转  MediaPipe Hands: On-device Real-time Hand Tracking 论文阅读笔记



0. 摘要 (Abstract)

We present a real-time on-device hand tracking solution that predicts a hand skeleton of a human from a single RGB camera for AR/VR applications. Our pipeline consists of two models: 1) a palm detector, that is providing a bounding box of a hand to, 2) a hand landmark model, that is predicting the hand skeleton. It is implemented via MediaPipe[12], a framework for building cross-platform ML solutions. The proposed model and pipeline architecture demonstrate real-time inference speed on mobile GPUs with high prediction quality. MediaPipe Hands is open sourced at


它在移动GPU上具有较高的实时推理速度和预测质量,具体开源代码请参见 MediaPipe Hands

1. 简介 (Introduction)

Hand tracking is a vital component to provide a natural way for interaction and communication in AR/VR, and has been an active research topic in the industry. Vision-based hand pose estimation has been studied for many years. A large portion of previous work requires specialized hardware, e.g. depth sensors . Other solutions are not lightweight enough to run real-time on commodity mobile devices and thus are limited to platforms equipped with powerful processors. In this paper, we propose a novel solution that does not require any additional hardware and performs in real-time on mobile devices. Our main contributions are:
• An efficient two-stage hand tracking pipeline that can track multiple hands in real-time on mobile devices.
• A hand pose estimation model that is capable of predicting 2.5D hand pose with only RGB input.
• And open source hand tracking pipeline as a ready-togo solution on a variety of platforms, including Android, iOS, Web (Tensorflow.js) and desktop PCs.




2. 架构 (Architecture)

Our hand tracking solution utilizes an ML pipeline consisting of two models working together:
• A palm detector that operates on a full input image and locates palms via an oriented hand bounding box.
• A hand landmark model that operates on the cropped hand bounding box provided by the palm detector and returns high-fidelity 2.5D landmarks.
Providing the accurately cropped palm image to the hand landmark model drastically reduces the need for data augmentation (e.g. rotations, translation and scale) and allows the network to dedicate most of its capacity towards landmark localization accuracy. In a real-time tracking scenario, we derive a bounding box from the landmark prediction of the previous frame as input for the current frame, thus avoiding applying the detector on every frame. Instead, the detector is only applied on the first frame or when the hand prediction indicates that the hand is lost.


提供给手部坐标模型的是精确裁剪的手掌图片,极大的降低数据增强(例如旋转,平移和缩放)操作,可以使模型的性能都用来提高坐标定位的精度。在实时追踪的场景中,当前帧的手部定位框是从上一帧的手部关键点坐标预测中推导出来,这样可以避免每一帧都使用手掌检测器。 手掌检测器只在第一帧或者当手部丢失情况下才使用。

2.1 手部检测器

To detect initial hand locations, we employ a singleshot detector model optimized for mobile real-time application similar to BlazeFace, which is also available in MediaPipe. Detecting hands is a decidedly complex task: our model has to work across a variety of hand sizes with a large scale span (~20x) and be able to detect occluded and self-occluded hands. Whereas faces have high contrast patterns, e.g., around the eye and mouth region, the lack of such features in hands makes it comparatively difficult to detect them reliably from their visual features alone.
Our solution addresses the above challenges using different strategies.
First, we train a palm detector instead of a hand detector, since estimating bounding boxes of rigid objects like palms and fists is significantly simpler than detecting hands with articulated fingers. In addition, as palms are smaller objects, the non-maximum suppression algorithm works well even for the two-hand self-occlusion cases, like handshakes. Moreover, palms can be modelled using only square bounding boxes , ignoring other aspect ratios, and therefore reducing the number of anchors by a factor of 3~5.
Second, we use an encoder-decoder feature extractor similar to FPN for a larger scene-context awareness even for small objects.
Lastly, we minimize the focal loss during training to support a large amount of anchors resulting from the high scale variance. High-level palm detector architecture is shown in Figure 2. We present an ablation study of our design elements in Table 1.





(7)通过消融对比试验发现FocalLoss 比CrossEntropyLoss 要好


2.2 手部坐标预测模型 (Hand LandMark Model)

After running palm detection over the whole image, our subsequent hand landmark model performs precise landmark localization of 21 2.5D coordinates inside the detected hand regions via regression. The model learns a consistent internal hand pose representation and is robust even to partially visible hands and self-occlusions. The model has three outputs (see Figure 3):

21 hand landmarks consisting of x, y, and relative depth.
A hand flag indicating the probability of hand presence in the input image.
A binary classification of handedness, e.g. left or right hand
We use the same topology as [14] for the 21 landmarks. The 2D coordinates are learned from both real-world images as well as synthetic datasets as discussed below, with the relative depth w.r.t. the wrist point being learned only from synthetic images. To recover from tracking failure, we developed another output of the model similar to [8] for producing the probability of the event that a reasonably aligned hand is indeed present in the provided crop. If the score is lower than a threshold then the detector is triggered to reset tracking. Handedness is another important attribute for effective interaction using hands in AR/VR. This is especially useful for some applications where each hand is associated with a unique functionality. Thus we developed a binary classification head to predict whether the input hand is the left or right hand. Our setup targets real-time mobile GPU inference, but we have also designed lighter and heavier versions of the model to address CPU inference on the mobile devices lacking proper GPU support and higher accuracy requirements of accuracy to run on desktop, respectively


(1)手部关键点坐标 X,Y和相对深度

从真实世界的图片和合成的图片中学习到二维坐标 X,Y
从合成数据中学习到相对于手腕的三维深度坐标 Z

虽然我们的初始目标是在移动GPU上进行推理,但是我们也设计了更轻量级和更重量级的模型版本。更轻量级的模型可以在缺乏GPU支持的移动设备上运行,如果想获得更高的精度可以使用更重量级的模型来获得更高的精度。(不同类型模型的参数量、运行时间和测试设备等信息可参见下面的表3 ↓)


3. 数据集和标注 (DataSet And Annotation)
To obtain ground truth data, we created the following datasets addressing different aspects of the problem:
• In-the-wild dataset: This dataset contains 6K images of large variety, e.g. geographical diversity, various lighting conditions and hand appearance. The limitation of this dataset is that it doesn’t contain complex articulation of hands.
• In-house collected gesture dataset: This dataset contains 10K images that cover various angles of all physically possible hand gestures. The limitation of this dataset is that it’s collected from only 30 people with limited variation in background. The in-the-wild and in-house dataset are great complements to each other to improve robustness.
• Synthetic dataset: To even better cover the possible hand poses and provide additional supervision for depth, we render a high-quality synthetic hand model over various backgrounds and map it to the corresponding 3D coordinates. We use a commercial 3D hand model that is rigged with 24 bones and includes 36 blendshapes, which control fingers and palm thickness. The model also provides 5 textures with different skin tones. We created video sequences of transformation between hand poses and sampled 100K images from the videos. We rendered each pose with a random high-dynamic-range lighting environment and three different cameras. See Figure 4 for examples.
For the palm detector, we only use in-the-wild dataset, which is sufficient for localizing hands and offers the highest variety in appearance. However, all datasets are used for training the hand landmark model. We annotate the realworld images with 21 landmarks and use projected groundtruth 3D joints for synthetic images. For hand presence, we select a subset of real-world images as positive examples and sample on the region excluding annotated hand regions as negative examples. For handedness, we annotate a subset of real-world images with handedness to provide such data.


我们使用随机的高动态范围照明环境和三个不同的摄影机渲染每个姿势。有关示例,请参见下面的图4 ↓
第一排的四张图片是在室外图片上进行标注,第二排的四张图片是合成的图片上进行标注 ↓

4. 试验结果 (Result)

For the hand landmark model, our experiments show that the combination of real-world and synthetic datasets provides the best results. See Table 2 for details. We evaluate only on real-world images. Beyond the quality improvement, training with a large synthetic dataset leads to less jitter visually across frames. This observation leads us to believe that our real-world dataset can be enlarged for better generalization.
Our target is to achieve real-time performance on mobile devices. We experimented with different model sizes and found that the “Full” model (see Table 3) provides a good trade-off between quality and speed. Increasing model capacity further introduces only minor improvements in quality but decreases significantly in speed (see Table 3 for details). We use the TensorFlow Lite GPU backend for ondevice inference


试验步骤:详见下表2 (只使用了室外的图片进行模型评估)


我们的目标是在移动设备上实现实时的性能,因此我们在不同模型大小上做了实验,这些模型在质量和速度方面做了不同的权衡,详见表格3 ↓

Tips:以上实验我们使用TensorFlow Lite GPU 作为后端,进行实时的推理。

5. 使用MediaPipe的具体实现 (Implementation In MedisPipe)

With MediaPipe[12], our hand tracking pipeline can be built as a directed graph of modular components, called Calculators. Mediapipe comes with an extensible set of Calculators to solve tasks like model inference, media processing, and data transformations across a wide variety of devices and platforms. Individual Calculators like cropping, rendering and neural network computations are further optimized to utilize GPU acceleration. For example, we employ TFLite GPU inference on most modern phones.
Our MediaPipe graph for hand tracking is shown in Figure 5. The graph consists of two subgraphs one for hand detection and another for landmarks computation. One key optimization MediaPipe provides is that the palm detector only runs as needed (fairly infrequently), saving significant computation. We achieve this by deriving the hand location in the current video frames from the computed hand landmarks in the previous frame, eliminating the need to apply the palm detector on every frame. For robustness, the hand tracker model also outputs an additional scalar capturing the confidence that a hand is present and reasonably aligned in the input crop. Only when the confidence falls below a certain threshold is the hand detection model reapplied to the next frame.

在MediaPipe中,我们的手部追踪流水线是由被称为算子的模块化组件构成的有向图。MediaPipe附带一组可扩展的算子,这些算子可以在各种设备和平台上解决像模型推理,多媒体处理,数据转换等操作。独立的算子像裁剪,渲染和神经网络计算等,可以通过GPU进行优化。例如,我是使用在移动设备上使用TFLite GPU 进行推理。





6. 应用举例 (Application examples)

Our hand tracking solution can readily be used in many applications such as gesture recognition and AR effects. On top of the predicted hand skeleton, we employ a simple algorithm to compute gestures, see Figure 6. First, the state of each finger, e.g. bent or straight, is determined via the accumulated angles of joints. Then, we map the set of finger states to a set of predefined gestures. This straightforward, yet effective technique allows us to estimate basic static gestures with reasonable quality. Beyond static gesture recognition, it is also possible to use a sequence of landmarks to predict dynamic gestures. Another application is to apply AR effects on top of the skeleton. Hand based AR effects currently enjoy high popularity. In Figure 7, we show an example AR rendering of the hand skeleton in neon light style.


首先,关节的弯曲角度决定每根手指的状态(弯曲或伸直)。随后,我们将这组手指状态映射为一组预定义的手势。利用这种直接而有效的方法,我们可以估算出基本的静态手势,同时保证检测质量。现有流水线支持计算多种文化背景(如美国、欧洲和中国)下的手势,以及各种手势标志,包括 “非常棒”、握拳、“好的”、“摇滚” 和 “蜘蛛侠”。


7. 结论 (Conclution)

In this paper, we proposed MediaPipe Hands, an end-toend hand tracking solution that achieves real-time performance on multiple platforms. Our pipeline predicts 2.5D landmarks without any specialized hardware and thus, can be easily deployed to commodity devices. We open sourced the pipeline to encourage researchers and engineers to build gesture control and creative AR/VR applications with our pipeline.

在本文中,我们提出了MediaPipe Hands,这是一种端到端的手跟踪解决方案,可在多个平台上实现实时性能。我们的流水线模型可以在无需任何专用硬件情况下预测2.5D的关键点坐标,并且可以轻松部署到商品设备上。我们将流水线开源,以鼓励研究人员和工程师利用我们的流水线构建手势控制和创造性的AR/VR应用程序。

[1]. 使用 MediaPipe 实现设备端实时手部追踪

版权声明:本文为CSDN博主「炼丹狮」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。

  • 1
  • 1
    觉得还不错? 一键收藏
  • 0


  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助




当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则
钱包余额 0


