最近在做 pose estimation 相关的东西,这篇论文之前其实看过,也做了些笔记,不过没有统一整理。今天没太多时间,所以先把一些重要的东西写下来,记在这。
这篇文章的主要贡献有:
- 以往的 pose estimation 模型,都是有个从高精到低精,再从低精到高精的过程,而且整个过程中是串行的,因此会损失一定的 spatial information,本文提出的 HRnet,采用了并行模式,在整个过程中都保留了高精的特征图。
- 此外,本模型采用了 multi scale fusion,把前一个stage的高、低精度进行了一个融合,再输出到下一个阶段。
以下为英文部分
Core Ideas and Contributions
- Maintains high-resolution representations through the whole process.
- We conduct
repeated multi-scale fusions
such that each of the high-to-low resolution representations receives information from other parallel representations over and over, leading to rich high resolution representations
Methods and Approaches
General Description
- We start from a high-resolution subnetwork as the fifirst stage, gradually add high-to-low resolution subnetworks one by one to form more stages, and connect the multi-resolution subnetworks in parallel. We conduct repeated multi-scale fusions by exchanging the information across the parallel multi-resolution subnetworks over and over through the whole process.
- We perform repeated multi-scale fusions to boost the high-resolution representations with the help of the low-resolution representations of the same depth and similar level, and vice versa, resulting in that high-resolution representations are also rich for pose estimation.
Detailed Description
- With the input of size W × H × 3 W \times H \times 3 W×H×3, trasform this problem to estimating K K K heatmaps of size W ′ × H ′ W' \times H' W′×H′, { H 1 , H 2 , . . . , H K } \{H_1, H_2, ..., H_K\} {H1,H2,...,HK}, where each heatmap H k H_k Hk indicates the location confidence of the k k kth keypoint.
- It composed of a stem consisting of two strided convolutions decreasing the resolution, a main body outputting the feature maps with the same resolution as its input feature maps, and a regressor estimating the heatmaps where the keypoint positions are chosen and transformed to the full resolution
Sequencial multi-resolution subnetworks
- Let N s r \mathcal{N}_{s r} Nsr be the subnetwork in the s s sth stage and r r r be the resolution index (Its resolution is 1 2 r − 1 \frac{1}{2^{r-1}} 2r−11 of the resolution of the first subnetwork). The high-to-low network with S S S (e.g., 4) stages can be denoted as:
N 11 → N 22 → N 33 → N 44 \mathcal{N}_{11} \rightarrow \mathcal{N}_{22} \rightarrow \mathcal{N}_{33} \rightarrow \mathcal{N}_{44} N11→N22→N33→N44
Parallel multi-resolution subnetworks
- start from a high-resolution subnetwork as the first stage.
- gradually add high-to-low subnetworks one by one, forming new stages, and connect the multi-resolution subnetworks in parallel.
- the resulting resolution for the parallel subnetworks of a later stage, consists of the resolutions from the previous stage, and an extra lower one
N 11 → N 21 → N 31 → N 41 ↘ N 22 → N 32 → N 42 ↘ N 33 → N 43 ↘ N 44 . \begin{aligned}\mathcal{N}_{11} \rightarrow \mathcal{N}_{21} \rightarrow \mathcal{N}_{31} \rightarrow \mathcal{N}_{41} \\ \searrow \mathcal{N}_{22} \rightarrow \mathcal{N}_{32} \rightarrow \mathcal{N}_{42} \\ \searrow \mathcal{N}_{33} \rightarrow \mathcal{N}_{43} \\ \searrow \mathcal{N}_{44} .\end{aligned} N11→N21→N31→N41↘N22→N32→N42↘N33→N43↘N44.
Repeated multi-scale fusion
- We introduce exchange units across parallel subnetworks such that each subnetwork repeatedly receives the information from other parallel subnetworks
- We divided the third stage into several (e.g., 3) exchange blocks, and each block is composed of 3 parallel convolution units with an exchange unit across the parallel units, which is given as follows
C 31 1 ↘ ↗ C 31 2 ↘ ↗ C 31 3 ↘ C 32 1 → E 3 1 → C 32 2 → E 3 2 → C 32 3 → E 3 3 C 33 1 ↗ ↘ C 33 2 ↗ ↘ C 33 3 ↗ \begin{array}{lllll}\mathcal{C}_{31}^{1} & \searrow & \nearrow \mathcal{C}_{31}^{2} & \searrow & \nearrow \mathcal{C}_{31}^{3} & \searrow \\\mathcal{C}_{32}^{1} & \rightarrow \mathcal{E}_{3}^{1} & \rightarrow \mathcal{C}_{32}^{2} & \rightarrow \mathcal{E}_{3}^{2} & \rightarrow \mathcal{C}_{32}^{3} & \rightarrow \mathcal{E}_{3}^{3} \\\mathcal{C}_{33}^{1} & \nearrow & \searrow \mathcal{C}_{33}^{2} & \nearrow & \searrow \mathcal{C}_{33}^{3} & \nearrow\end{array} C311C321C331↘→E31↗↗C312→C322↘C332↘→E32↗↗C313→C323↘C333↘→E33↗
- C s r b C^b_{sr} Csrb represents the convolution unit in the r r rth resolution of the b b bth block in the s s sth stabe, and E s b \mathcal{E}_{s}^{b} Esb is the corresponding exchange unit.
- The inputs are s s s **response maps: { X 1 , X 2 , … , X s } \left\{\mathbf{X}_{1}, \mathbf{X}_{2}, \ldots, \mathbf{X}_{s}\right\} {X1,X2,…,Xs}. The outputs are s s s **response maps: { Y 1 , Y 2 , … , Y s } \left\{\mathbf{Y}_{1}, \mathbf{Y}_{2}, \ldots, \mathbf{Y}_{s}\right\} {Y1,Y2,…,Ys}, whose resolutions and widths are the same to the input. Each output is an aggregation of the input maps, Y k = ∑ i = 1 s a ( X i , k ) \mathbf{Y}_{k}=\sum_{i=1}^{s} a\left(\mathbf{X}_{i}, k\right) Yk=∑i=1sa(Xi,k). The exchange unit across stages has an extra output map Y s + 1 : Y s + 1 = a ( Y s , s + 1 ) \mathbf{Y}_{s+1}: \mathbf{Y}_{s+1}=a\left(\mathbf{Y}_{s}, s+1\right) Ys+1:Ys+1=a(Ys,s+1).
- The function a ( X i , k ) a\left(\mathbf{X}_{i}, k\right) a(Xi,k) consists of upsampling or downsampling X i X_i Xi **from resolution i i i **to resolution k k k. We adopt strided 3×3 convolutions for downsampling. For instance, one strided 3×3 convolution with the stride 2 for 2× downsampling, and two consecutive strided 3 × 3 convolutions with the stride 2 for 4× downsampling. For upsampling, we adopt the simple nearest neighbor sampling following a 1 × 1 1 \times 1 1×1 convolution for aligning the number of channels. If i = k , a ( ⋅ , ⋅ ) i=k, a(\cdot, \cdot) i=k,a(⋅,⋅) is just an identify connection: a ( X i , k ) = X i a\left(\mathbf{X}_{i}, k\right)=\mathbf{X}_{i} a(Xi,k)=Xi
Network instantiation
- This part is not clear enough just by reading the paper, so I will look into the code and paste a new link here describing the codes in the future.