Keywords: human pose estimation, convolutional neural network, 2D-3D joint optimization
Abstract
We tackle the 3D human pose estimation task with end-to-end learning using CNNs.Relative 3D positions between one joint and the other joints are learned via CNNs.
两个创新点:
(1)we added 2D pose information to estimate a 3D pose from an image by concatenating 2D pose estimation result with the features from an image.
(2)we have found that more accurate 3D poses are obtained by combining information on relative positions with respect to multiple joints,instead of just one root joint.
Introduction
整体介绍Human Pose Estimation——由2D的CNN引入3D的CNN,将CNN的优势扩展到3D——总结【5】【6】【7】CNN网络的缺点,叙述增加2D信息的优势(From 2D pose information,undesirable 3D joint positions which generate unnatural human pose may be discarded)
Framework
we propose a simple yet powerful 3D human pose estimation framework based on the regression of joint positions using CNNs.We introduce two strategies to improve the regression results from the baseline CNNs.
(1)not only the image features but also 2D joint classification results are used as input features for 3D pose estimation——this scheme successfully incorporates the correlation between 2D and 3D poses
(2)rather than estimating relative positions with respect to multiple joints——this scheme effectively reduces the error of the joints that are far from the root joint
Related Work
主要介绍基于CNN的2D和3D Human Pose Estimation(详见原文)。
The method proposed in this paper aims to provide an end-to-end learning framework to estimate 3D structure of a human body from a single image.Similar to 【5】,3D and 2D pose information are jointly learned in a single CNN.Unlike the previous works,we directly propagate the 2D classification results to the 3D pose regressors inside the CNNs.
3D-2D Joint Estimation of Human Body Using CNN
The key idea of our method is to train CNN which performs 3D pose estimation using both image features from the input image and 2D pose information retrieved from the same CNN.In other words,the proposed CNN is trained for both 2D joint clasification and 3D joint regression tasks simultaneously.
Structure of the Baseline CNN
The CNN used in this experiment consists of five convolutional layers, three pooling layers, two parallel sets of two fully connected layers, and loss layers for 2D and 3D pose estimation tasks. The CNN accepts a 225 × 225 sized image as an input. The sizes and the numbers of filters as well as the strides are specified in Figure 1. The filter sizes of convolutional and pooling layers are the same as those of ZFnet 【21】, but we reduced the number of feature maps to make the network smaller.
We divided an input image into
N
g
×
N
g
N_g×N_g
Ng×Nggrids and treat each grid as a separate class,which results in
N
g
2
N_g^2
Ng2classes per joint.The target probability fot the
i
t
h
ith
ith grid
g
i
g_i
gi of the
j
t
h
jth
jth joint is inversely proportional to the distance from the ground truth position.
p
^
j
(
g
i
)
=
d
−
1
(
y
^
j
,
c
i
)
I
(
g
i
)
∑
k
=
1
N
g
2
d
−
1
(
y
^
j
,
c
k
)
I
(
g
k
)
(
1
)
\hat p_j(g_i)=\frac{d^{-1}(\hat y_j,c_i)I(g_i)}{\sum _{k=1}^{N_g^2}d^{-1}(\hat y_j,c_k)I(g_k)}\space (1)
p^j(gi)=∑k=1Ng2d−1(y^j,ck)I(gk)d−1(y^j,ci)I(gi) (1)
——
d
−
1
(
x
,
y
)
d^{-1}(x,y)
d−1(x,y) is the inverse of the Euclidean distance between the point x and y in the 2D pixel space,
y
^
j
\hat y_j
y^j is the ground truth position of the
j
t
h
jth
jth joint in the image,and
c
i
c_i
ci is the center of the grid
g
i
g_i
gi.
I
(
g
i
)
I(g_i)
I(gi) is an indicator function that is equal to 1 if the grid
g
i
g_i
gi is one of the four nearest neighbors.
I
(
g
i
)
=
{
1
i
f
d
(
y
^
j
,
c
i
)
<
ω
g
o
o
t
h
e
r
w
i
s
e
,
(
2
)
I(g_i)= \begin {cases} 1\space \space if d(\hat y_j,c_i)<\omega_g\\ o\space \space otherwise, \end {cases} (2)
I(gi)={1 ifd(y^j,ci)<ωgo otherwise,(2)
——
ω
g
\omega_g
ωg is the width.Hence,higher probability is assigned to the grid closer to the ground truth joint positon,and
p
^
j
(
g
i
)
\hat p_j(g_i)
p^j(gi) is normalized so that the sum og the class probabilities is equal to 1.Finally,the objective of the 2D classification task for the
j
t
h
jth
jth join is to minimize the following cross entropy loss function.
L
2
D
(
j
)
=
−
∑
i
=
1
N
g
2
p
^
j
(
g
i
)
l
o
g
p
j
(
g
i
)
,
(
3
)
L_{2D}(j)=-\sum_{i=1}^{N_g^2}\hat p_j(g_i)logp_j(g_i),\space (3)
L2D(j)=−i=1∑Ng2p^j(gi)logpj(gi), (3)
——
p
j
(
g
i
)
p_j(g_i)
pj(gi) is the probability that comes from the softmax output of the CNN.
Estimating 3D position of joints is formulated as a regression task.Since the search space is much larger than the 2D case,it is undersirable to solve 3D pose estimation as a classification task.The 3D loss funcion is designed as a square of the Euclidean distance between the prediction and the ground truth.We estimate 3D position of each joint relative to the root node.the loss function for the
j
t
h
jth
jth joint when the root node is the
r
t
h
rth
rth joint becomes
L
3
D
(
j
,
r
)
=
∣
∣
R
j
−
(
J
^
j
−
J
^
r
)
∣
∣
2
(
4
)
L_{3D}(j,r)=||R_j-(\hat J_j - \hat J_r)||^2\space (4)
L3D(j,r)=∣∣Rj−(J^j−J^r)∣∣2 (4)
——
R
j
R_j
Rj is the predicted relative 3D position of the
j
t
h
jth
jth joint from the root node,
J
^
j
\hat J_j
J^j is the ground truth 3D position of the
j
t
h
jth
jth joint,and
J
^
r
\hat J_r
J^r is that of the root node.The overall cost function of the CNN combines (3) and (4) with weights,
L
a
l
l
=
λ
2
D
∑
j
=
1
N
j
L
2
D
(
j
)
+
λ
3
D
∑
j
≠
r
N
j
L
3
D
(
j
,
r
)
(
5
)
L_{all}=\lambda_{2D}\sum _{j=1}^{N_j}L_{2D}(j)+\lambda_{3D}\sum_{j≠r}^{N_j}L_{3D}(j,r)\space (5)
Lall=λ2Dj=1∑NjL2D(j)+λ3Dj=r∑NjL3D(j,r) (5)
3D Joint Regression with 2D Classification Features
详见原文。The joint locations in an image are usually a strong cue of guessing 3D pose.To exploit 2D classification result for the 3D pose estimation,we concatenate the outputs of softmax in the 2D classification task with the outputs of the fully connected layers in the 3D loss part.The proposed structure after the last pooling layer is shown in Figure(2).First,the 2D classification result is concatenated(
p
r
o
b
s
2
D
l
a
y
e
r
i
n
F
i
g
u
r
e
2
probs\space 2D \space layer \space in \space Figure2
probs 2D layer in Figure2)and passes the fully connected layer(
f
c
p
r
o
b
s
2
D
fc\space probs\space 2D
fc probs 2D).Then,the feature vectors from 2D and 3D part are concatenated(
f
c
2
D
−
3
D
fc\space 2D-3D
fc 2D−3D),which is used for 3D pose estimation task.Note that the error from the fc probs 2D layer is not back-propagated to the probs 2D layer to ensure that layers used for the 2D classification are trained only by the 2D loss part.【3】repeatedly uses the 2D classification result as an input by concatenating it with feature maps from CNN.we simply vectorized the softmax result to produce
N
g
×
N
g
×
N
j
N_g×N_g×N_j
Ng×Ng×Nj feature vector rather than convolving the probability map with features in the convolutional layers.
Multiple 3D Pose Regression from Different Root Nodes![在这里插入图片描述](https://img-blog.csdnimg.cn/20200217145706581.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzQzNDUyMTU2,size_16,color_FFFFFF,t_70)
介绍基础框架及其缺点,【5】提出一种解决办法,并介绍了【5】的缺点,进而提出本文的方法:we estimate the relative position over multiple joints.(基础框架计算各关节与根关节的相对位置,缺点是距离越远,精度越低。【5】提出计算各个节点与父节点之间的相对位置,缺点是中间节点的误差会累积)。令
N
r
N_r
Nr为选择的根节点的数目。实验中令
N
r
=
6
N_r=6
Nr=6可以使得大部分关节或者是根节点或者是邻居节点,可视化如图3(b)。6个3D regression losses如图2。整体误差为
L
a
l
l
=
λ
2
D
∑
j
=
1
N
j
L
2
D
(
j
)
+
λ
3
D
∑
r
∈
R
∑
j
≠
r
N
j
L
3
D
(
j
,
r
)
(
6
)
L_all=\lambda_{2D}\sum_{j=1}^{N_j}L_{2D}(j)+\lambda_{3D}\sum_{r\in R}\sum_{j≠r}^{N_j}L_{3D}(j,r)\space (6)
Lall=λ2Dj=1∑NjL2D(j)+λ3Dr∈R∑j=r∑NjL3D(j,r) (6)
——
R
R
R is the set containing the joint indices that are used as root nodes.When the 3D losses share the same fully connected layers,the trained model outputs the same pose estimation results across all joints.To break this symmetry,we put the fully connected layers for each 3D losses(
f
c
2
l
a
y
e
r
s
i
n
F
i
g
u
r
e
2
fc2\space layers\space in\space Figure2
fc2 layers in Figure2)
At the test time,all the pose estimation results are translated so that the mean of each pose bacomes zero.Final prediction is generated by averaging the translated results.In other words,the 3D position of the
j
t
h
jth
jth joint
X
j
X_j
Xj is calculated as
X
j
=
∑
r
∈
R
X
j
(
r
)
N
r
(
7
)
X_j=\frac{\sum _{r\in R}X_j^{(r)}}{N_r}\space (7)
Xj=Nr∑r∈RXj(r) (7)
——
X
j
(
r
)
X_j^{(r)}
Xj(r) is the predicted 3D position of the
j
t
h
jth
jth joint when the
r
t
h
rth
rth joint is set to a root node.
Implementation Details
详见原文
Conclusions
We expect that the perfprmance can be further improved by incorporating temporal information to the CNN by applying the concepts of recurrent neural network or 3D convolution[26].Also,efficient aligning method for multiple regression results may boost the accuracy of pose estimation.