【KAPAO】《Rethinking Keypoint Representations：Modeling Keypoints and Poses as Objects for XXX》

bryant_meng

已于 2023-02-09 10:45:42 修改

阅读量1.2k

点赞数 1

分类专栏： CNN / Transformer 文章标签： python 人工智能开发语言

于 2023-01-11 14:52:11 首次发布

本文链接：https://blog.csdn.net/bryant_meng/article/details/128387787

版权

CNN / Transformer 专栏收录该内容

210 篇文章 7 订阅

订阅专栏

在这里插入图片描述

《Rethinking Keypoint Representations: Modeling Keypoints and Poses as Objects for Multi-Person Human Pose Estimation》

ECCV-2022

1 Background and Motivation

人体姿态估计(Human Pose Estimation)经典方法整理

主流的人体姿态估计多基于 heatmap-based regression 方法

缺点

suffer from quantization error
require excessive computation to generate and post-process
when two keypoints of the same type appear in close proximity to one another, the overlapping heatmap signals may be mistaken for a single keypoint.

本文作者直接提出了 single-stage multi-person human pose estimation by simultaneously detecting human pose and keypoint objects and fusing the detections to exploit the strengths of both object representations——Keypoints And Poses As Objects

在这里插入图片描述

pose objects，上图蓝色部分，也即人体框和人体关键点（这个就能预测关键点了）
keypoint objects，上图红色部分，也即关键点框（作者本文提出来的，同 pose objects 一起预测出来，框的中心点会和 pose objects 的关键点按一定策略进行融合）

2 Related Work

Heatmap-free keypoint detection
Single-stage human pose estimation
generally less accurate, but usually perform better in crowded scenes
Extending object detectors for human pose estimation

3 Advantages / Contributions

提出 KAPAO 人体姿态估计方法（keypoint objects，pose objects），heatmap-free，significantly faster and more accurate than state-of-the-art heatmap-based methods when not using TTA

4 KAPAO：Keypoints and Poses as Objects

$\mathbf{\hat{O}} = \mathbf{\hat{O}}^k \cup \mathbf{\hat{O}}^p$

a set of keypoint objects $\{\hat{\mathcal{O}^k} \in \mathbf{\hat{O}}^k\}$

a set of pose objects $\{\hat{\mathcal{O}^p} \in \mathbf{\hat{O}}^p\}$

对于每个 keypoint object $\mathcal{O}^k$ ，有 a small bounding box $\mathbf{b} = (b_x, b_y, b_w, b_h)$ ，超参数 $b_s = b_w = b_h$ 控制 $\mathbf{b}$ 的大小

characterized by strong local features
carry no information regarding the concept of a person or pose
keypoint objects exist in a subspace of a pose objects

对于每个 pose object $\mathcal{O}^p$ ，有 a bounding box of class “person,” and a set of keypoints $\mathbf{z} = \{(x_k, y_k)\}_{k=1}^{K}$ ，

4.1 Architectural Details

在这里插入图片描述
$\mathcal{N}(\mathbf{I}) = \mathbf{\hat{G}}$

input image $\mathbf{I} \in \mathbb{R}^{h \times w \times 3}$

network $\mathcal{N}$

output grid $\mathbf{\hat{G}} = \{\hat{\mathcal{G}}^s| s \in \{8, 16, 32, 64\}\}$ ， $\hat{\mathcal{G}}^s \in \mathbb{R}^{\frac{h}{s} \times \frac{w}{s} \times N_a \times N_o}$

$N_a$ is the number of anchor channels
$N_o = 3K+6$ is the number of output channels for each object
- the objectness $\hat{p}_o$ ，1
- the intermediate bounding boxes $\hat{\mathbf{t}}'= (\hat{t}'_x, \hat{t}'_y, \hat{t}'_w, \hat{t}'_h)$ ，4
- the object class scores $\hat{\mathbf{c}}=(\hat{c}_1,...,\hat{c}_{K+1})$ ，K+1
- the intermediate keypoints $\hat{\mathbf{v}'} = \{\hat{v}'_{xk}, \hat{v}'_{yk}\}_{k=1}^K$ ，2K

Additional detection redundancy is provided by also allowing the neighbouring grid cells $\hat{\mathcal{G}}_{i\pm1, j}^s$ and $\hat{\mathcal{G}}_{i, j\pm1}^s$ to detect an object in image patch $\mathbf{I}_p = \mathbf{I}_{si:s(i+1), sj:s(j+1)}$

在这里插入图片描述
bbox 的计算（x,y 计算的是偏移量）

关键点的计算（偏移量）

其中 $A_w$ 和 $A_h$ 表示 anchor 的宽高

4.2 Loss Function

在这里插入图片描述

$w_s$ is grid weighting
IoU 指的是 CIoU
$v_k$ 是 keypoint visibility flags
$N_b$ 是 batch size

在这里插入图片描述

4.3 Inference

The predicted intermediate bounding boxes $\hat{\mathbf{t}}$ and keypoints $\hat{\mathbf{v}}$ are mapped back to the original image coordinates using the following transformation:

$\hat{\mathbf{b}} = s(\hat{\mathbf{t}}+[i,j,0,0])$

$\hat{\mathbf{z}}_k = s(\hat{\mathbf{v}}_k+[i,j])$

NMS 处理

$\hat{\mathbf{O}}^{p'} = NMS(\hat{\mathbf{O}}^{p}, \tau_{bp})$

$\hat{\mathbf{O}}^{k'} = NMS(\hat{\mathbf{O}}^{k}, \tau_{bk})$

$\tau_{bp}$ 和 $\tau_{bk}$ 为 IoU thresholds

结果进行融合

$\hat{\mathbf{P}} = \varphi(\hat{\mathbf{O}}^{p'}, \hat{\mathbf{O}}^{k'}, \tau_{fd}, \tau_{fc})$

融合算法如下：
在这里插入图片描述

上述解读来自单阶段多人 2D 人体估计算法——KAPAO

4.4 Limitations

pose objects do not include individual keypoint confidences，只有 keypoint objects 有
training requires a considerable amount of time and GPU memory due to the large input size used

5 Experiments

5.1 Datasets and Metrics

COCO
CrowdPose

5.2 Microsoft COCO Keypoints

val 集上
在这里插入图片描述
The post-processing time of KAPAO depends less on the input size so it only increases by approximately 1 ms when using TTA.

看看 test 集上的结果
在这里插入图片描述

5.3 CrowdPose

在这里插入图片描述

5.4 Ablation Studies

在这里插入图片描述
左图，keypoint objects 中，关键点框的大小

右图，不同关键点的融合率，distinct local image features (e.g., the eyes, ears, and nose) have higher fusion rates as they are detected more precisely as keypoint objects than as pose objects.

下面看看融合涨点情况以及耗时情况
在这里插入图片描述

A Supplementary Material

A.1 Hyperparameters

在这里插入图片描述

A.2 Influence of input size on accuracy and speed

在这里插入图片描述
{640, 768, 896, 1024, 1152, 1280}

reducing the input size to 1152 had a negligible effect on the accuracy but provided a meaningful latency reduction.

A.3 Video Inference Demos

在这里插入图片描述

A.4 Error Analysis

KAPAO consistently provides higher AR than previous single-stage methods and higher AP at lower OKS thresholds

four error categories:

Background False Positives
False Negatives
Scoring
due to sub-optimal confidence score assignment; they occur when two detections are in the proximity of a ground-truth annotation and the one with the highest confidence has the lowest OKS
Localization
- Jitter： $\leq exp(-d_i^2 / 2s^2k_i^2)<0.85$
- Miss： $exp(-d_i^2 / 2s^2k_i^2)<0.5$
- Inversion：left-right keypoint flipping within an instance;
- Swap：keypoint swapping between instances

KAPAO-L is less prone to Swap and Inversion errors

A.5 Qualitative Comparisons

the top-20 scoring pose detections for each model.

Because all the COCO keypoint metrics are computed using the 20 top-scoring detections
在这里插入图片描述

在这里插入图片描述
Swap error is a common failure case for CenterGroup but an uncommon failure case for KAPAO due to its detection of holistic pose objects

7 Conclusion（own）

灵感来自于：
《DeepDarts: Modeling Keypoints as Objects for Automatic Scorekeeping in Darts using a Single Camera》（CVPR-2021）
TTA（ Test Time Augmentation）

测试时增强，指的是在推理（预测）阶段，将原始图片进行水平翻转、垂直翻转、对角线翻转、旋转角度等数据增强操作，得到多张图，分别进行推理，再对多个结果进行综合分析，得到最终输出结果。
测试时增强-TTA (Test time augmention)

qubvel / ttach
Test Time Augmentation (TTA) and how to perform it with Keras
Fall Datasets
Head Detector
- https://github.com/aditya-vora/FCHD-Fully-Convolutional-Head-Detector