Realtime Multi-Person 2D Pose Estimation using Part Affinity Field

最新推荐文章于 2023-12-27 21:37:39 发布

EntropyPlus

最新推荐文章于 2023-12-27 21:37:39 发布

阅读量2.9k

点赞数

分类专栏：计算机视觉文献阅读

本文链接：https://blog.csdn.net/u012759262/article/details/110819864

版权

计算机视觉同时被 2 个专栏收录

9 篇文章 3 订阅

订阅专栏

文献阅读

7 篇文章 1 订阅

订阅专栏

1 Summary

写完笔记之后最后填，概述文章的内容，以后查阅笔记的时候先看这一段。注：写文章summary切记需要通过自己的思考，用自己的语言描述。忌讳直接Ctrl + c原文。

2 Research Objective(s)

bottom-up approaches to estimate the keypoints of skeleton

3 Problem Statement

unknow number of people that can occur at any position or scale
interactions between people induce complex spatial interference
runtime complexity tends to grow with the number of people in the image

4 Method(s)

输入: color image of size $\times h$ ，
输出: anatomical keypoints for each person
中间过程：feed forward network predicts a set of 2D confidence maps $S$ of body part locations(2b) and a set of 2D vector fields $L$ of part affinities, which encode the degree of association between parts （身体部分的locations 和所有的同类，比如都是elbow）。

其中confidence map $S$ 有 $J$ 个（图中所有key point的个数， $S_j \in \mathbb{R}^{w \times h}$ 代表第 $j$ 个key point对应的location）； $L$ 有 $C$ 个， $\bold{L}_c \in \mathbb{R}^{w \times h \times 2}$ 代表第 $c$ 个body parts（例如：手臂）的image location的2D vector（position and orientation）
在这里插入图片描述

Q1: 这里的image location可不可以理解为像素？

4.2.1. Simultaneous Detection and Association

The network is split into two branches:

the top branch----> predicts the confidence maps，一个关键点对应网络生成的一个feature map，如果有J个关键点，则共有J张feature map;
the bottom branch ----> predicts the affinity fields，用于度量两个关键点之间的紧密度（有连接关系的则说明比较紧密，注意，这里每个paf都用二维的向量表示（横纵坐标））

在上图中， $\bold{F}$ 是VFF-19的前10层经过微调后输出的feature map； $\bold{S}^t=\rho^t(\bold{F})$ 代表 $t$ 时刻的confidence maps； $\bold{L}^t=\phi(\bold{F})$ 代表 $t$ 时刻的part affinity fields； $\rho^t(), \phi^t()$ 代表stage $t$ 中的CNN 函数；
然后concate一下 feature map $\bold{F}$ , detection confidence maps $\bold{S}^{t-1}$ , part affinity fields $\bold{L}^{t-1}$ ：

然后loss函数可以表示为：

在上述公式中 $\bold{S}^*_j$ 是 groundtruth part confidence map ， $\bold{L}^*_c$ groundtruth part affinity vector field, $\bold{W}$ is a binary mask with $\bold{W(p)} = 0$ when the annotation is missing at an image location $\bold{P}$

The mask is used to avoid penalizing the true positive predictions during training.
整个训练过程的loss可以表示为：

Q2: how to compute the gt part confidence map and gt part affinity vector field (position and orientation)

4.2.2 Confidence Maps for Part Detection

this part introduce how to calculate the $\bold{S}^*$

$\bold{S}_{j,k}^*$ means 第 $k$ 个人的第 $j$ 部分的location
则 $\bold{S}_{j,k}^*$ 被表示为：
在这里插入图片描述
对于一副图中的多个key points（红蓝两种点）可能有下图(a)中的灰色的链接方式，接下来把正确的连接方式用黑色线表示，错误的连接方式用绿色线表示，如图(b)，然后再施加PAFs(下图©中的黄色箭头)

$\bold{S}_{j,k}^*$ 选择所有的人中最“清晰”的点
在这里插入图片描述
至于为什么选max而不是mean，作者专门作图研究过。

4.2.3 Part Affinity Fields for Part Association

这个问题主要是解决，哪些 ket points 属于同一个人。一般的思路是取两个key points 的中间点作为连接对象，但是这种情况不适用于 crowd people的情况，造成这样的原因是：

midpoints仅仅收集了position，并没有收集orientation信息
拿一条手臂来说，整个手臂被抽象成了一个点，那么这块region的区域携带的信息就被浪费掉了

所以作者用 part affinity fields 来保存 position and orientation信息，也就是 how to calculate the $\bold{L}^*_c$ ，如下图：
在这里插入图片描述
假设 $\bold{x}_{j1,k}$ ， $\bold{x}_{j2,k}$ 分别代表第 $k$ 个person 的手臂 $c$ 中的两个点（ $j_1, j_2$ ），那么，如果点 $P$ 在手臂上，那么 $\bold{L}^*_{c,k}(\bold{p})$ is $j_1$ 到 $j_2$ 的unit vector，对于其他点来说，这个vector就是0. 下面是 $\bold{L}^*_{c,k}(\bold{p})$ 的表达式：
在这里插入图片描述
根据向量分解的关系，可以给点 $\bold{p}$ 施加一些限制：

那么，两个key point中间的所有part affinity field 可以表示为：

其中， $n_c(\bold{p})$ 是这幅图中 $k$ 个人的手臂在 $p$ 点所有不为0向量的均值（人话：通过平均 $p$ 点的向量，就可以表示这两个关键点的连接方式，就好比在正常状态下：不管在什么场景中，人的手腕和elbow之间的向量是都是固定的，所以这个玩意要是遇到一个骨折的，绝逼识别失败。）

为了证明PAF这个玩意是有效地，作者计算了候选点之间的 line integral（人话：作者计算了预选肢体的PAF的中点，alignment of the predicted PAF with the candidate limb that would be formed by connecting the detected body parts）

特别的，计算了两个候选部分的locations $\bold{d}_{j1}$ , $\bold{d}_{j2}$ ，作者对这两个关键点连接部分进行采样，
其中：

$\frac{\bold{d}_{j2}-\bold{d}_{j1}}{||\bold{d}_{j2}-\bold{d}_{j1}||}$ 求得了两个点之间的单位向量
$\bold{p}(u)=(1-u)\bold{d}_{j1}+u\bold{d}_{j2}$ 则是表明采样点是 $\bold{d}_{j1}$ 和 $\bold{d}_{j2}$ 之间的任意一个点
这俩货相乘，在信号处理领域，则是衡量二者的相似性

4.2.4. Multi-Person Parsing using PAFs

由于finding the optimal pares是一个NP难问题，因此，作者用了greedy relaxation来寻找高质量的matches。作者推测：能用这个方法的原因是，pair-wise association scores是需要考虑全局信息的，（因为这个玩意是PAF network从大的感知野中的出来的）

第一步，先从多人中检测出候选key point $D_J={\bold{d}^m_j}$ 其中， $\in \{1...J\}$ , $\in \{1...N_j\}$ ， $N_j$ 候选连接部分 $j$ 的总数（人话，假设 $j$ 是头，那么 $N_j$ 就代表所有可以与这个头相连的key points，可以是自己的nect，也可以是别人的neck）， $\bold{d}_j^m$ 是第 $j$ 个部分的 $m$ 个候选点。通俗一点说就是，头部的第m个候选点可以表示为： $\bold{d}_头^m$ 。下面的任务是，找到这个part的连接点。

定义一个变量 $z_{j_1j_2}^{mn}\in \{0, 1\}$ 代表 $\bold{d}_{j1}^m$ 和 $\bold{d}_{j2}^n$ 是否相连，最后的目标是找出所有相连的点，也就是尽量可能的表达出 $Z$ 所有连接的可能性。以下图中右屁股和neck，构造了一个图，点就是候选点 $D_{j1}$ 和 $D_{j2}$ 以及各种可能连接的边，目标构建二分图，保证没有图中的任意两条边都不共享一个节点，因为共享一个节点的话，就说明有个点连错了，我们的目标是find matching with maximum weight for the 已选择的边
在这里插入图片描述

单单公式太抽象了，具体一点，上图中2个人，从左到右分别编号为 $a, b$ ， $J 1$ 代表橙色点， $J 2$ 代表蓝色点， $J 3$ 代表绿色点，那么在©图中（也就是作者采用的tree structure）。
对象 $a$ 橙色的点就可以表示为 $j_1^a$ ，蓝色的点就可以表示为 $j_2^a$ ，绿色的点就可以表示为 $j_3^a$ ；
对象 $b$ 橙色的点就可以表示为 $j_1^b$ ，蓝色的点就可以表示为 $j_2^b$ ，绿色的点就可以表示为 $j_3^c$ ；
所以：

$D_{J1}=\{ \bold{d}_3^a, \bold{d}_3^b\}$ ，其中 $N_{j1}=2$ 代表 $j_1$ 的候选点的个数；同样的， $D_{J3}=\{ \bold{d}_1^a, \bold{d}_1^b, \bold{d}_2^a, \bold{d}_2^b\}$ ，其中 $N_{j3}=4$ 代表 $j_3$ 的候选点的个数。
$\bold{d}_1^a$ 代表第 $a$ 个人的红色点（肩膀 $j_1$ ）的坐标
$z_{j_1j_2}^{ab}=0$ 代表： $j^a_1$ 与 $j^b_2$ 不相连；图中， $z_{j_2j_3}^{aa}=1$ 代表： $j^a_2$ 与 $j^a_3$ 相连。

然后最后的目标函数是，find a matching with maxium weight for the chosen edges:
在这里插入图片描述

限制条件很好理解，就是所有的pair只能一一对应，就是说 $j_1^m$ 这个节点只能与 $j_2^n$ 中的一个节点匹配，同理对于 $j_2$ 也是一样的

5 Evaluation

6 Conclusion

7 Notes(optional)

8 References(optional)

列出相关性高的文献，以便之后可以继续track下去。

EntropyPlus

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
Realtime Multi-Person 2D Pose Estimation using Part Affinity Field

1 Summary写完笔记之后最后填，概述文章的内容，以后查阅笔记的时候先看这一段。注：写文章summary切记需要通过自己的思考，用自己的语言描述。忌讳直接Ctrl + c原文。2 Research Objective(s)bottom-up approaches to estimate the keypoints of skeleton3 Problem Statementunknow number of people that can occur at any position or sc
复制链接

扫一扫