Day 2: Deep High-Resolution Representation Learning for Human Pose Estimation

最新推荐文章于 2022-11-04 11:38:57 发布

ttppss

最新推荐文章于 2022-11-04 11:38:57 发布

阅读量133

点赞数

分类专栏：论文研读文章标签：人工智能深度学习机器学习 ieee论文

本文链接：https://blog.csdn.net/ttppss/article/details/116485380

版权

论文研读专栏收录该内容

18 篇文章 3 订阅

订阅专栏

最近在做 pose estimation 相关的东西，这篇论文之前其实看过，也做了些笔记，不过没有统一整理。今天没太多时间，所以先把一些重要的东西写下来，记在这。

这篇文章的主要贡献有：

以往的 pose estimation 模型，都是有个从高精到低精，再从低精到高精的过程，而且整个过程中是串行的，因此会损失一定的 spatial information，本文提出的 HRnet，采用了并行模式，在整个过程中都保留了高精的特征图。
此外，本模型采用了 multi scale fusion，把前一个stage的高、低精度进行了一个融合，再输出到下一个阶段。

以下为英文部分

Core Ideas and Contributions

Maintains high-resolution representations through the whole process.
We conduct repeated multi-scale fusions such that each of the high-to-low resolution representations receives information from other parallel representations over and over, leading to rich high resolution representations

Methods and Approaches

General Description

We start from a high-resolution subnetwork as the fifirst stage, gradually add high-to-low resolution subnetworks one by one to form more stages, and connect the multi-resolution subnetworks in parallel. We conduct repeated multi-scale fusions by exchanging the information across the parallel multi-resolution subnetworks over and over through the whole process.
We perform repeated multi-scale fusions to boost the high-resolution representations with the help of the low-resolution representations of the same depth and similar level, and vice versa, resulting in that high-resolution representations are also rich for pose estimation.

在这里插入图片描述

Detailed Description

With the input of size $\times H \times 3$ , trasform this problem to estimating $K$ heatmaps of size $\times H'$ , ${H_1, H_2, ..., H_K\}$ , where each heatmap $H_k$ indicates the location confidence of the $k$ th keypoint.
It composed of a stem consisting of two strided convolutions decreasing the resolution, a main body outputting the feature maps with the same resolution as its input feature maps, and a regressor estimating the heatmaps where the keypoint positions are chosen and transformed to the full resolution

Sequencial multi-resolution subnetworks

Let $\mathcal{N}_{s r}$ be the subnetwork in the $s$ th stage and $r$ be the resolution index (Its resolution is $\frac{1}{2^{r-1}}$ of the resolution of the first subnetwork). The high-to-low network with $S$ (e.g., 4) stages can be denoted as:

$\mathcal{N}_{11} \rightarrow \mathcal{N}_{22} \rightarrow \mathcal{N}_{33} \rightarrow \mathcal{N}_{44}$

Parallel multi-resolution subnetworks

start from a high-resolution subnetwork as the first stage.
gradually add high-to-low subnetworks one by one, forming new stages, and connect the multi-resolution subnetworks in parallel.
the resulting resolution for the parallel subnetworks of a later stage, consists of the resolutions from the previous stage, and an extra lower one

$\begin{aligned}\mathcal{N}_{11} \rightarrow \mathcal{N}_{21} \rightarrow \mathcal{N}_{31} \rightarrow \mathcal{N}_{41} \\ \searrow \mathcal{N}_{22} \rightarrow \mathcal{N}_{32} \rightarrow \mathcal{N}_{42} \\ \searrow \mathcal{N}_{33} \rightarrow \mathcal{N}_{43} \\ \searrow \mathcal{N}_{44} .\end{aligned}$

Repeated multi-scale fusion

We introduce exchange units across parallel subnetworks such that each subnetwork repeatedly receives the information from other parallel subnetworks
We divided the third stage into several (e.g., 3) exchange blocks, and each block is composed of 3 parallel convolution units with an exchange unit across the parallel units, which is given as follows

$\begin{array}{lllll}\mathcal{C}_{31}^{1} & \searrow & \nearrow \mathcal{C}_{31}^{2} & \searrow & \nearrow \mathcal{C}_{31}^{3} & \searrow \\\mathcal{C}_{32}^{1} & \rightarrow \mathcal{E}_{3}^{1} & \rightarrow \mathcal{C}_{32}^{2} & \rightarrow \mathcal{E}_{3}^{2} & \rightarrow \mathcal{C}_{32}^{3} & \rightarrow \mathcal{E}_{3}^{3} \\\mathcal{C}_{33}^{1} & \nearrow & \searrow \mathcal{C}_{33}^{2} & \nearrow & \searrow \mathcal{C}_{33}^{3} & \nearrow\end{array}$

在这里插入图片描述

$C^b_{sr}$ represents the convolution unit in the $r$ th resolution of the $b$ th block in the $s$ th stabe, and $\mathcal{E}_{s}^{b}$ is the corresponding exchange unit.
The inputs are $s$ **response maps: $\left\{\mathbf{X}_{1}, \mathbf{X}_{2}, \ldots, \mathbf{X}_{s}\right\}$ . The outputs are $s$ **response maps: $\left\{\mathbf{Y}_{1}, \mathbf{Y}_{2}, \ldots, \mathbf{Y}_{s}\right\}$ , whose resolutions and widths are the same to the input. Each output is an aggregation of the input maps, $\mathbf{Y}_{k}=\sum_{i=1}^{s} a\left(\mathbf{X}_{i}, k\right)$ . The exchange unit across stages has an extra output map $\mathbf{Y}_{s+1}: \mathbf{Y}_{s+1}=a\left(\mathbf{Y}_{s}, s+1\right)$ .
The function $a\left(\mathbf{X}_{i}, k\right)$ consists of upsampling or downsampling $X_i$ **from resolution $i$ **to resolution $k$ . We adopt strided 3×3 convolutions for downsampling. For instance, one strided 3×3 convolution with the stride 2 for 2× downsampling, and two consecutive strided 3 × 3 convolutions with the stride 2 for 4× downsampling. For upsampling, we adopt the simple nearest neighbor sampling following a $\times 1$ convolution for aligning the number of channels. If $a(\cdot, \cdot)$ is just an identify connection: $a\left(\mathbf{X}_{i}, k\right)=\mathbf{X}_{i}$