【论文学习】Recurrent Human Pose Estimation-CSDN博客

本文链接：https://blog.csdn.net/m0_37644085/article/details/88855148

Recurrent Human Pose Estimation (建立了更简单的体系结构)

主要关注点是：

1.使用递归模块，本质上增加了有效的接收场(感受野），而不引入额外的参数。

2.增大感受野与关节点精确度的关系。

Abstract

提出了一个ConvNet模型，用于预测图像中的2d人体姿态。该模型回归每个人体关键点的热图表示，并且能够学习和表示部件外观和部件配置的上下文。

we make the following three contributions: (i) an architecture combining a feed forward module with a recurrent module, where the recurrent module can be run iteratively to(increase the effective receptive field of the network ) improve the performance; (ii) the model can be trained end-to-end and from scratch, with auxiliary losses incorporated to improve performance; (iii) we investigate whether keypoint visibility can also be predicted.a preliminary investigation into improving occlusion prediction in human pose estimation

三方面的贡献：(1)将前馈模块与递归模块( recurrent module)相结合的体系结构，其中递归模块可以迭代地运行，以增加网络的有效接收域[点到区域]，从而提高性能；(Ii)该模型可以端到端、从分支开始训练，并加入辅助损失以提高性能；(Iii)我们研究是否也可以预测KeyPoint可见性( keypoint visibility )。也就是改进人体姿态估计中遮挡预测的初步研究

the model is evaluated on two benchmark datasets. the result is a simple architecture that achieves performance on par with the state of the art, but without the complexity of a graphical model stage (or layers).

I. INTRODUCTION

暂记录有参考意义的提到的论文和方法：

作者主要参考的是[27]:Flowing convnets for human pose estimation in videos.视频里用来光流法增强heatmap [9] Human Pose Estimation with Iterative Error Feedback。主要在用feedback来进行错误反馈，主要用在早期的错误修正上，看下对比试验是：直接一步去预测移动方向还是不断地迭代；是迭代误差好还是直接迭代的去预测目标位置；是固定步长的去修正还是不固定步长。9 是一个有趣的混合体，它在回归直接姿态坐标(作为迭代模块的输出)和使用热图作为输入(迭代模块)之间切换.]
in addition, our model shares with convolutional pose machines [43] and the hourglass model [25] the motivation of using large convolution kernels to capture more context (originally proposed by pfister et al. [27]).

模型与[43Convolutional pose machines.参考1、参考2 增大接受野的两种方式:采用 pooling 操作，会牺牲精度；增大 kernel size，会使参数量增加，训练时出现梯度消失的风险 感受野（receptive field）大小的作用】和沙漏模型[25the hourglass model]都是一样的动机：使用大卷积核来捕获更多上下文信息(最初由Pfister等人提出[27])。

与这些方法不同的是，我们使用递归卷积神经网络来增加接收场，从而使训练中的参数减少数量级。多次包含递归模块与[25]中更多沙漏模块的叠加类似。

early methods using convnets predicted pose coordinates of human keypoints directly (as 2d coordinates) [39]. an alternative, which it turns out might be better suited to convnets, is an indirect prediction by first regressing a heatmap over the image for each keypoint, and then obtaining the keypoint position as a mode in this heatmap [20], [27], [38], [37]. the advantage of the heatmap over direct prediction is threefold: it mostly avoids problems with convnets predicting real values; it can handle multiple instances in the image (e.g. if there are several hands present and consequently several corresponding hand keypoints); and it can represent uncertainty by multiple modes

早期的方法使用卷积神经网络直接预测人体关键点的姿态坐标(作为2d坐标)[39]。另一种可能更适合于卷积神经网络的方法是一种间接预测，首先对每个KeyPoint的图像进行热图回归，然后在此热图[20]、[27]、[38]、[37]中获得作为mode的KeyPoint位置。热图比直接预测有三大优点：它主要避免了卷积神经网络预测真值的问题；它可以处理图像中的多个实例(例如，如果有多个手，因此有几个相应的手键点)；它可以用多种modes表示不确定性。

furthermore, combining heatmaps with large convolutional kernels and deeper models [8], [23], [25], [27], [43] improves performance – since the effective receptive fields, and consequently the context captured, is increased. for example, pfister et al. [27] added several large convolutions (e.g. 13×13 kernels). however, a disadvantage is that this increases the number of parameters and makes the optimization more difficult. in our model, we employ a recurrent module that essentially increases the effective receptive fields without introducing additional parameters

此外，将热图与大的卷积核和更深的模型[8]、[23]、[25]、[27]、[43]结合起来，可以提高性能-因为有效的接受域以及所捕获的上下文增加了。(例如增加了几个13×13核的卷积)。然而，缺点是增加了参数的数量，并使优化更加困难。在我们的模型中，使用了一个递归模块，它本质上增加了有效的接收场，而不引入额外的参数。