Day 2: Deep High-Resolution Representation Learning for Human Pose Estimation

最近在做 pose estimation 相关的东西,这篇论文之前其实看过,也做了些笔记,不过没有统一整理。今天没太多时间,所以先把一些重要的东西写下来,记在这。

这篇文章的主要贡献有:

  • 以往的 pose estimation 模型,都是有个从高精到低精,再从低精到高精的过程,而且整个过程中是串行的,因此会损失一定的 spatial information,本文提出的 HRnet,采用了并行模式,在整个过程中都保留了高精的特征图。
  • 此外,本模型采用了 multi scale fusion,把前一个stage的高、低精度进行了一个融合,再输出到下一个阶段。

以下为英文部分


Core Ideas and Contributions

  1. Maintains high-resolution representations through the whole process.
  2. We conduct repeated multi-scale fusions such that each of the high-to-low resolution representations receives information from other parallel representations over and over, leading to rich high resolution representations

Methods and Approaches

General Description

  • We start from a high-resolution subnetwork as the fifirst stage, gradually add high-to-low resolution subnetworks one by one to form more stages, and connect the multi-resolution subnetworks in parallel. We conduct repeated multi-scale fusions by exchanging the information across the parallel multi-resolution subnetworks over and over through the whole process.
  • We perform repeated multi-scale fusions to boost the high-resolution representations with the help of the low-resolution representations of the same depth and similar level, and vice versa, resulting in that high-resolution representations are also rich for pose estimation.

在这里插入图片描述

Detailed Description

  • With the input of size W × H × 3 W \times H \times 3 W×H×3, trasform this problem to estimating K K K heatmaps of size W ′ × H ′ W' \times H' W×H, { H 1 , H 2 , . . . , H K } \{H_1, H_2, ..., H_K\} {H1,H2,...,HK}, where each heatmap H k H_k Hk indicates the location confidence of the k k kth keypoint.
  • It composed of a stem consisting of two strided convolutions decreasing the resolution, a main body outputting the feature maps with the same resolution as its input feature maps, and a regressor estimating the heatmaps where the keypoint positions are chosen and transformed to the full resolution

Sequencial multi-resolution subnetworks

  • Let N s r \mathcal{N}_{s r} Nsr be the subnetwork in the s s sth stage and r r r be the resolution index (Its resolution is 1 2 r − 1 \frac{1}{2^{r-1}} 2r11 of the resolution of the first subnetwork). The high-to-low network with S S S (e.g., 4) stages can be denoted as:

N 11 → N 22 → N 33 → N 44 \mathcal{N}_{11} \rightarrow \mathcal{N}_{22} \rightarrow \mathcal{N}_{33} \rightarrow \mathcal{N}_{44} N11N22N33N44

Parallel multi-resolution subnetworks

  • start from a high-resolution subnetwork as the first stage.
  • gradually add high-to-low subnetworks one by one, forming new stages, and connect the multi-resolution subnetworks in parallel.
  • the resulting resolution for the parallel subnetworks of a later stage, consists of the resolutions from the previous stage, and an extra lower one

N 11 → N 21 → N 31 → N 41 ↘ N 22 → N 32 → N 42 ↘ N 33 → N 43 ↘ N 44 . \begin{aligned}\mathcal{N}_{11} \rightarrow \mathcal{N}_{21} \rightarrow \mathcal{N}_{31} \rightarrow \mathcal{N}_{41} \\ \searrow \mathcal{N}_{22} \rightarrow \mathcal{N}_{32} \rightarrow \mathcal{N}_{42} \\ \searrow \mathcal{N}_{33} \rightarrow \mathcal{N}_{43} \\ \searrow \mathcal{N}_{44} .\end{aligned} N11N21N31N41N22N32N42N33N43N44.

Repeated multi-scale fusion

  • We introduce exchange units across parallel subnetworks such that each subnetwork repeatedly receives the information from other parallel subnetworks
  • We divided the third stage into several (e.g., 3) exchange blocks, and each block is composed of 3 parallel convolution units with an exchange unit across the parallel units, which is given as follows

C 31 1 ↘ ↗ C 31 2 ↘ ↗ C 31 3 ↘ C 32 1 → E 3 1 → C 32 2 → E 3 2 → C 32 3 → E 3 3 C 33 1 ↗ ↘ C 33 2 ↗ ↘ C 33 3 ↗ \begin{array}{lllll}\mathcal{C}_{31}^{1} & \searrow & \nearrow \mathcal{C}_{31}^{2} & \searrow & \nearrow \mathcal{C}_{31}^{3} & \searrow \\\mathcal{C}_{32}^{1} & \rightarrow \mathcal{E}_{3}^{1} & \rightarrow \mathcal{C}_{32}^{2} & \rightarrow \mathcal{E}_{3}^{2} & \rightarrow \mathcal{C}_{32}^{3} & \rightarrow \mathcal{E}_{3}^{3} \\\mathcal{C}_{33}^{1} & \nearrow & \searrow \mathcal{C}_{33}^{2} & \nearrow & \searrow \mathcal{C}_{33}^{3} & \nearrow\end{array} C311C321C331E31C312C322C332E32C313C323C333E33

在这里插入图片描述

  • C s r b C^b_{sr} Csrb represents the convolution unit in the r r rth resolution of the b b bth block in the s s sth stabe, and E s b \mathcal{E}_{s}^{b} Esb is the corresponding exchange unit.
  • The inputs are s s s **response maps: { X 1 , X 2 , … , X s } \left\{\mathbf{X}_{1}, \mathbf{X}_{2}, \ldots, \mathbf{X}_{s}\right\} {X1,X2,,Xs}. The outputs are s s s **response maps: { Y 1 , Y 2 , … , Y s } \left\{\mathbf{Y}_{1}, \mathbf{Y}_{2}, \ldots, \mathbf{Y}_{s}\right\} {Y1,Y2,,Ys}, whose resolutions and widths are the same to the input. Each output is an aggregation of the input maps, Y k = ∑ i = 1 s a ( X i , k ) \mathbf{Y}_{k}=\sum_{i=1}^{s} a\left(\mathbf{X}_{i}, k\right) Yk=i=1sa(Xi,k). The exchange unit across stages has an extra output map Y s + 1 : Y s + 1 = a ( Y s , s + 1 ) \mathbf{Y}_{s+1}: \mathbf{Y}_{s+1}=a\left(\mathbf{Y}_{s}, s+1\right) Ys+1:Ys+1=a(Ys,s+1).
  • The function a ( X i , k ) a\left(\mathbf{X}_{i}, k\right) a(Xi,k) consists of upsampling or downsampling X i X_i Xi **from resolution i i i **to resolution k k k. We adopt strided 3×3 convolutions for downsampling. For instance, one strided 3×3 convolution with the stride 2 for 2× downsampling, and two consecutive strided 3 × 3 convolutions with the stride 2 for 4× downsampling. For upsampling, we adopt the simple nearest neighbor sampling following a 1 × 1 1 \times 1 1×1 convolution for aligning the number of channels. If i = k , a ( ⋅ , ⋅ ) i=k, a(\cdot, \cdot) i=k,a(,) is just an identify connection: a ( X i , k ) = X i a\left(\mathbf{X}_{i}, k\right)=\mathbf{X}_{i} a(Xi,k)=Xi

Network instantiation

  • This part is not clear enough just by reading the paper, so I will look into the code and paste a new link here describing the codes in the future.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值