[旧物归档] 论文观后感-VNect: 单RGB相机实时人体姿态估计

本文链接：https://blog.csdn.net/github_28260175/article/details/89475971

旧物归档 [gitpage=>csdn]
这是我早期写的论文观后感，观点稚嫩且不准确，权当我瞎说。

Dushyant Mehta, Srinath Sridhar, Oleksandr Sotnychenko, Helge Rhodin, Mohammad Shafiei, Hans-Peter Seidel, Weipeng Xu, Dan Casas, and Christian Theobalt. 2017. VNect: real-time 3D human pose estimation with a single RGB camera. ACM Trans. Graph. 36, 4, Article 44 (July 2017), 14 pages. DOI: https://doi.org/10.1145/3072959.3073596

从准确度（accuracy is quantitatively on par）上看，比离线[3]3d单目rgb姿态评价算法（the best offline 3D monocular RGB pose estimation methods）更好（特别是末端位置end effector positions）；与单目rgb-d（rgb深度[1]）方法效果持平，或有时会更好。

创新点：

设备简单。使用普通相机（普通甚至低分辨率视频流输入）就可以进行分析，比其他3d相机便宜、分辨率高。
效果好。较rgb-d，较offline 3D monocular RGB pose（特别是end effector positions）。
适用性广。室外（rgb-d受室外阳光形象，效果变差），低分辨率，输入图像不需要“精细”裁剪（no-cropped）。
实时（30hz[4]）且精确。速度快，用50层浅层cnn；精确，新型的全卷积核[2]。
？monocular pose reconstruction

算法：

1、CNN Pose Regression：

using convolutional neural networks (CNNs) [Mehta et al. 2016; Pavlakos et al. 2016]，同时生成2d和3d joint positions ，放弃了资源消耗大的Bounding Box Computations（边界盒）。

2、Kinematic Skeleton Fitting:

使用Model-based kinematic skeleton fitting来实时修正前者（CNNs）预测出的运动骨骼连接节点（joint positions），保证预测运动的一致性。

3、Skeleton Initialization (Optional):

事先提供被测人的身高，作为规格参考，降低图像重构时的歧义（ambiguous）。对于kinematic skeleton，平均开始的一段时间里的CNN预测（average CNN predictions for a few frames at the beginning）。

难点：

实时且精确。

1、实时：

harder to run in real-time, partly due to additional preprocessing steps such as bounding box extraction。现有方法需要“精细”修剪的图片（tight crops at a fixed resolution，如由边界盒算法修剪，耗时大）

2、精确：

无法准确预测人体关节的范围（extent of articulation）

具体步骤：

1、CNN Pose Regression

上述解决方法：本文创新的，将2d热图（heatmap）映射到3d，每个节点j都有3个方向映射（location-maps）Xj / Yj / Zj，捕捉相对骨盆（root-relative / pelvis）的j的三维位置xj / yj / zj，以此完成2d到3d的转换。

网络使用，ResNet50 network architecture of He et al. [2016]。

预测出的是，2d热图、3d方位图。

训练网络，使用训练集（2D pose estimation、heatmap）MPII [Andriluka et al. 2014] and LSP [Johnson and Everingham 2010, 2011]、（3D pose）MPI-INF- 3DHP [Mehta et al. 2016] and Human3.6m [Ionescu et al. 2014b]。

不需要“精细”裁剪，Bounding Box Tracker，The BB tracker starts with (slow) multi-scale predictions on the full image for the first few frames, and hones in on the person in the image making use of the BB-agnostic predictions from the fully convolutional network.算法开始时，先按照BB裁剪出含有目标物体的边框，后BB按设定比例逐帧移动、放大，即后续步骤中并未使用BB算法，加快了整个程序的速度以达到实时的目的。