【DeepPose】《DeepPose：Human Pose Estimation via Deep Neural Networks》

最新推荐文章于 2024-08-18 11:14:15 发布

bryant_meng

最新推荐文章于 2024-08-18 11:14:15 发布

阅读量1.5k

点赞数 2

分类专栏： CNN / Transformer

本文链接：https://blog.csdn.net/bryant_meng/article/details/108865162

版权

CNN / Transformer 专栏收录该内容

211 篇文章 7 订阅

订阅专栏

在这里插入图片描述

CVPR-2014

用 Deep learning 方法做 Human Pose Estimation 的鼻祖

文章目录

1 Background and Motivation
2 Advantages / Contributions
3 Method
- 3.1 Pose Estimation as DNN-based Regression
- 3.2 Cascade of Pose Regressors
4 Experiments
- 4.1 Datasets
- 4.2 Results and Discussion
5 Conclusion（own） / Future work

1 Background and Motivation

human pose estimation 问题就是 localization human joints 的问题，难点在于

strong articulations（各式各样的连接）
small and barely visible joints
occlusions and the need to capture the context

当前主流的方法是 Part-based models，但有点像盲人摸象，效率不高还容易以偏概全。reason about pose in a holistic manner 似乎更加合理，然而现有的 holistic manner-based 方法在 real-world 中 with limited success

本文，作者蹭蹭 Deep Neural Network（ DNN）的热度（在分类和定位视觉任务上表现的还不错），借助 DNN 方法，采用回归的方式来实现 holistic human pose estimation，优势如下

capture the full context of each body joint
比 graphical models 方法简单（不需要设计 model topology and interactions between joints）

2 Advantages / Contributions

第一个用 Deep Neural Networks（DNN）来做 human pose estimation，直接回归坐标，配合 cascade 在 4 个 academic datasets 上获得了 SOTA

3 Method

$(x,\overrightarrow{y})$ ， $x$ 是 image data， $\overrightarrow{y}$ 是 GT pose vector

$\overrightarrow{y} = (...,\overrightarrow{y}_i^T,...)^T$ , $\in\{1,...,k\}$

$\overrightarrow{y}_i$ 包含 $i^{th}$ joint 的横纵坐标

人形框（可以是整张图片），表示为 $b = (b_c,b_w,b_h)$

关键点坐标 $\overrightarrow{y}_i$ 在人形框中归一化的结果是

在这里插入图片描述
也即横纵坐标减去 bbox 的中心坐标，横坐标除以 bbox 的 width，纵坐标除以 bbox 的 height

推广到所有关键点上即

$N(\overrightarrow{y};b) =(...,N(\overrightarrow{y}_i;b)^T,...)^T$

可以简写成 $N (x; b)$ 和 $N(\cdot)$

3.1 Pose Estimation as DNN-based Regression

神经网络采用的是 AlexNet（参考【Keras-AlexNet】CIFAR-10），其输出为 $\psi(x;\theta) \in \mathbb{R}^{2k}$ , 2k 是 k 个关键点的坐标值

最终预测在原图上的关键点坐标为

在这里插入图片描述
公式（2）的过程为，输入归一化后的 image data，经 AlexNet 网络预测出关键点坐标后，逆归一化，还原到原图上

Fig.2.

free layer 指的是 LRN（ local response normalization layer）和 P（pooling layer）

网络用符号表示如下：

$C (55 \times 55 \times 96) - > L R N - > P - > C (27 \times 27 \times 256) - > L R N - > P - > C (13 \times 13 \times 384) - > C (13 \times 13 \times 384) - > C (13 \times 13 \times 256) - > P - > F (4096) - > F (4096)$

其中 C 是 convolutional layer，F 是 fully connected layer

训练时，输入为

在这里插入图片描述
进行了归一化

Loss 为

在这里插入图片描述

采用的是 L2 distance between the prediction and the true pose vector

cascade 借鉴于《 Deep convolutional network cascade for facial point detection》

3.2 Cascade of Pose Regressors

Fig 2

后续的 stage 输入为前面 stage 的子图

subsequent pose regressors see higher resolution images and thus learn features for finer scales which ultimately leads to higher precision

不同 stage 采用的都是同一个网路结构 $\psi$ ，但是网络结构的参数 $\theta$ 不同，回归器记为 $\psi(x;\theta_s)$ ，其中 $\in \{1,...,S\}$ 表示不同的 stage，实验中 $S$ 为 3

Stage 1：

$\overrightarrow{y^1} \leftarrow N^{-1}(\psi(N(x;b^0);\theta_1);b^0)$

bounding box $b^0$

Stage $s$ :

$\overrightarrow{y_i^s} \leftarrow \overrightarrow{y_i^{(s-1)}} + N^{-1}(\psi_i(N(x;b_i^{(s-1)});\theta_s);b_i^{(s-1)})$

其中 $b_i^s$ 的迭代过程如下

$b_i^s \leftarrow (\overrightarrow{y_i^s},\sigma diam(\overrightarrow{y^s}))$

表示截取以 $\overrightarrow{y_i^s}$ 为中心，长度为 $\sigma diam(\overrightarrow{y^s})$ 的 sub-image 作为下一个 stage 的输入，其中 $\sigma$ 为缩放因子， $d i a m$ 是直径的意思（depends on the concrete pose definition and dataset），本文中被定义为 the distance between a shoulder and hip from opposing sides

训练 cascade stage 时候，采用了 simulated predictions，在 GT 上加入了一定的扰动（根据高斯分布采样）作为新的 GT，其中高斯分布的方差根据训练集中 $\overrightarrow{y_i^{(s-1)}} - \overrightarrow{y_i^s}$ 计算得到

simulated predictions 用公式化表示如下所示

在这里插入图片描述

数据集 $D$ 变成了 $D_A^s$
在这里插入图片描述

4 Experiments

4.1 Datasets

1）Frames Labeled In Cinema (FLIC)

官网： https://bensapp.github.io/flic-dataset.html

在 face detector 的基础上粗略的框出人（enlarge），再进行关键点的检测

在这里插入图片描述
4000 training and 1000 test images obtained from popular Hollywood movies

关键点如下

在这里插入图片描述
参考姿态估计数据集可视化【附代码】

2）Leeds Sports Pose Dataset（LSP）

官网：https://sam.johnson.io/research/lsp.html

在这里插入图片描述

整张图作为初始图片的输入

11000 training and 1000 test images，14 joints

关键点以及可视化样例如下
在这里插入图片描述
参考姿态估计数据集可视化【附代码】

Right ankle 右脚踝
Right knee 右膝盖
Right hip 右臀
Left hip 左臀
Left knee 左膝盖
Left ankle 左脚踝
Right wrist 右手腕
Right elbow 右手肘
Right shoulder 右肩膀
Left shoulder 左肩膀
Left elbow 左手肘
Left wrist 左手腕
Neck 脖子
Head top 头顶

标配 12 个点 + 脖子和头顶，可视化的时候，脖子会和左右臀中心点连接

3）ImageParse dataset

4）Buffy dataset

评价指标为

Percentage of Correct Parts（PCP）
Percentage of Detected Joints（PDJ）

4.2 Results and Discussion

1）PCP metric on LSP dataset

Percentage of Correct Parts（PCP）

在这里插入图片描述

2）PDJ metric on FLIC and LSP dataset

Percentage of Detected Joints（PDJ）

PDJ on FLIC

2 个点，对比其它 4 个方法，DeepPose 采用 stage2
在这里插入图片描述
PDJ on LSP

4 个点，solo 文献 13 中的方法

在这里插入图片描述

3）Effects of cascade-based refinement

在这里插入图片描述

cascade stage 的中收益在 [0.15,0.2] 区间里

stage 的级联虽然能 look at higher resolution inputs，但是 have more limited context

在这里插入图片描述

4）Cross-dataset Generalization

在 LSP 数据集上训练，ImageParse dataset 上测试

在这里插入图片描述

在 FLIC 数据集上训练，Buffy dataset 上测试

在这里插入图片描述

结果展示如下

在这里插入图片描述

5 Conclusion（own） / Future work

In most of the cases, when the estimated pose is not precise, it still has a correct shape
Further, we show that using a generic convolutional neural network, which was originally designed for classification tasks, can be applied to the different task of localization.
In future, we plan to investigate novel architectures which could be potentially better tailored towards localization problems in general, and in pose estimation in particular