【KAPAO】《Rethinking Keypoint Representations:Modeling Keypoints and Poses as Objects for XXX》

在这里插入图片描述

《Rethinking Keypoint Representations: Modeling Keypoints and Poses as Objects for Multi-Person Human Pose Estimation》

ECCV-2022



1 Background and Motivation

在这里插入图片描述
人体姿态估计(Human Pose Estimation)经典方法整理

主流的人体姿态估计多基于 heatmap-based regression 方法

缺点

  • suffer from quantization error
  • require excessive computation to generate and post-process
  • when two keypoints of the same type appear in close proximity to one another, the overlapping heatmap signals may be mistaken for a single keypoint.

本文作者直接提出了 single-stage multi-person human pose estimation by simultaneously detecting human pose and keypoint objects and fusing the detections to exploit the strengths of both object representations——Keypoints And Poses As Objects

在这里插入图片描述

  • pose objects,上图蓝色部分,也即人体框和人体关键点(这个就能预测关键点了)

  • keypoint objects,上图红色部分,也即关键点框(作者本文提出来的,同 pose objects 一起预测出来,框的中心点会和 pose objects 的关键点按一定策略进行融合)

2 Related Work

  • Heatmap-free keypoint detection
  • Single-stage human pose estimation
    generally less accurate, but usually perform better in crowded scenes
  • Extending object detectors for human pose estimation

3 Advantages / Contributions

提出 KAPAO 人体姿态估计方法(keypoint objects,pose objects),heatmap-free,significantly faster and more accurate than state-of-the-art heatmap-based methods when not using TTA

4 KAPAO:Keypoints and Poses as Objects

O ^ = O ^ k ∪ O ^ p \mathbf{\hat{O}} = \mathbf{\hat{O}}^k \cup \mathbf{\hat{O}}^p O^=O^kO^p

a set of keypoint objects { O k ^ ∈ O ^ k } \{\hat{\mathcal{O}^k} \in \mathbf{\hat{O}}^k\} {Ok^O^k}

a set of pose objects { O p ^ ∈ O ^ p } \{\hat{\mathcal{O}^p} \in \mathbf{\hat{O}}^p\} {Op^O^p}

对于每个 keypoint object O k \mathcal{O}^k Ok,有 a small bounding box b = ( b x , b y , b w , b h ) \mathbf{b} = (b_x, b_y, b_w, b_h) b=(bx,by,bw,bh),超参数 b s = b w = b h b_s = b_w = b_h bs=bw=bh 控制 b \mathbf{b} b 的大小

  • characterized by strong local features
  • carry no information regarding the concept of a person or pose
  • keypoint objects exist in a subspace of a pose objects

对于每个 pose object O p \mathcal{O}^p Op,有 a bounding box of class “person,” and a set of keypoints z = { ( x k , y k ) } k = 1 K \mathbf{z} = \{(x_k, y_k)\}_{k=1}^{K} z={(xk,yk)}k=1K

4.1 Architectural Details

在这里插入图片描述
N ( I ) = G ^ \mathcal{N}(\mathbf{I}) = \mathbf{\hat{G}} N(I)=G^

input image I ∈ R h × w × 3 \mathbf{I} \in \mathbb{R}^{h \times w \times 3} IRh×w×3

network N \mathcal{N} N

output grid G ^ = { G ^ s ∣ s ∈ { 8 , 16 , 32 , 64 } } \mathbf{\hat{G}} = \{\hat{\mathcal{G}}^s| s \in \{8, 16, 32, 64\}\} G^={G^ss{8,16,32,64}} G ^ s ∈ R h s × w s × N a × N o \hat{\mathcal{G}}^s \in \mathbb{R}^{\frac{h}{s} \times \frac{w}{s} \times N_a \times N_o} G^sRsh×sw×Na×No

  • N a N_a Na is the number of anchor channels

  • N o = 3 K + 6 N_o = 3K+6 No=3K+6 is the number of output channels for each object

    • the objectness p ^ o \hat{p}_o p^o,1
    • the intermediate bounding boxes t ^ ′ = ( t ^ x ′ , t ^ y ′ , t ^ w ′ , t ^ h ′ ) \hat{\mathbf{t}}'= (\hat{t}'_x, \hat{t}'_y, \hat{t}'_w, \hat{t}'_h) t^=(t^x,t^y,t^w,t^h),4
    • the object class scores c ^ = ( c ^ 1 , . . . , c ^ K + 1 ) \hat{\mathbf{c}}=(\hat{c}_1,...,\hat{c}_{K+1}) c^=(c^1,...,c^K+1),K+1
    • the intermediate keypoints v ′ ^ = { v ^ x k ′ , v ^ y k ′ } k = 1 K \hat{\mathbf{v}'} = \{\hat{v}'_{xk}, \hat{v}'_{yk}\}_{k=1}^K v^={v^xk,v^yk}k=1K,2K

Additional detection redundancy is provided by also allowing the neighbouring grid cells G ^ i ± 1 , j s \hat{\mathcal{G}}_{i\pm1, j}^s G^i±1,js and G ^ i , j ± 1 s \hat{\mathcal{G}}_{i, j\pm1}^s G^i,j±1s to detect an object in image patch I p = I s i : s ( i + 1 ) , s j : s ( j + 1 ) \mathbf{I}_p = \mathbf{I}_{si:s(i+1), sj:s(j+1)} Ip=Isi:s(i+1),sj:s(j+1)

在这里插入图片描述
bbox 的计算(x,y 计算的是偏移量)
在这里插入图片描述
关键点的计算(偏移量)
在这里插入图片描述
其中 A w A_w Aw A h A_h Ah 表示 anchor 的宽高

4.2 Loss Function

在这里插入图片描述

  • w s w_s ws is grid weighting
  • IoU 指的是 CIoU
  • v k v_k vk 是 keypoint visibility flags
  • N b N_b Nb 是 batch size

在这里插入图片描述

4.3 Inference

The predicted intermediate bounding boxes t ^ \hat{\mathbf{t}} t^ and keypoints v ^ \hat{\mathbf{v}} v^ are mapped back to the original image coordinates using the following transformation:

b ^ = s ( t ^ + [ i , j , 0 , 0 ] ) \hat{\mathbf{b}} = s(\hat{\mathbf{t}}+[i,j,0,0]) b^=s(t^+[i,j,0,0])

z ^ k = s ( v ^ k + [ i , j ] ) \hat{\mathbf{z}}_k = s(\hat{\mathbf{v}}_k+[i,j]) z^k=s(v^k+[i,j])

NMS 处理

O ^ p ′ = N M S ( O ^ p , τ b p ) \hat{\mathbf{O}}^{p'} = NMS(\hat{\mathbf{O}}^{p}, \tau_{bp}) O^p=NMS(O^p,τbp)

O ^ k ′ = N M S ( O ^ k , τ b k ) \hat{\mathbf{O}}^{k'} = NMS(\hat{\mathbf{O}}^{k}, \tau_{bk}) O^k=NMS(O^k,τbk)

τ b p \tau_{bp} τbp τ b k \tau_{bk} τbk 为 IoU thresholds

结果进行融合

P ^ = φ ( O ^ p ′ , O ^ k ′ , τ f d , τ f c ) \hat{\mathbf{P}} = \varphi(\hat{\mathbf{O}}^{p'}, \hat{\mathbf{O}}^{k'}, \tau_{fd}, \tau_{fc}) P^=φ(O^p,O^k,τfd,τfc)

融合算法如下:
在这里插入图片描述

在这里插入图片描述
上述解读来自 单阶段多人 2D 人体估计算法——KAPAO

4.4 Limitations

  • pose objects do not include individual keypoint confidences,只有 keypoint objects 有

  • training requires a considerable amount of time and GPU memory due to the large input size used

5 Experiments

5.1 Datasets and Metrics

  • COCO
  • CrowdPose

5.2 Microsoft COCO Keypoints

val 集上
在这里插入图片描述
The post-processing time of KAPAO depends less on the input size so it only increases by approximately 1 ms when using TTA.

看看 test 集上的结果
在这里插入图片描述

5.3 CrowdPose

在这里插入图片描述

5.4 Ablation Studies

在这里插入图片描述
左图,keypoint objects 中,关键点框的大小

右图,不同关键点的融合率,distinct local image features (e.g., the eyes, ears, and nose) have higher fusion rates as they are detected more precisely as keypoint objects than as pose objects.

下面看看融合涨点情况以及耗时情况
在这里插入图片描述

A Supplementary Material

A.1 Hyperparameters

在这里插入图片描述

A.2 Influence of input size on accuracy and speed

在这里插入图片描述
{640, 768, 896, 1024, 1152, 1280}

reducing the input size to 1152 had a negligible effect on the accuracy but provided a meaningful latency reduction.

A.3 Video Inference Demos

在这里插入图片描述
在这里插入图片描述

A.4 Error Analysis

KAPAO consistently provides higher AR than previous single-stage methods and higher AP at lower OKS thresholds

four error categories:

  • Background False Positives

  • False Negatives

  • Scoring
    due to sub-optimal confidence score assignment; they occur when two detections are in the proximity of a ground-truth annotation and the one with the highest confidence has the lowest OKS

  • Localization

    • Jitter: 0.5 ≤ e x p ( − d i 2 / 2 s 2 k i 2 ) < 0.85 0.5 \leq exp(-d_i^2 / 2s^2k_i^2)<0.85 0.5exp(di2/2s2ki2)<0.85
    • Miss: e x p ( − d i 2 / 2 s 2 k i 2 ) < 0.5 exp(-d_i^2 / 2s^2k_i^2)<0.5 exp(di2/2s2ki2)<0.5
    • Inversion:left-right keypoint flipping within an instance;
    • Swap:keypoint swapping between instances
      在这里插入图片描述

KAPAO-L is less prone to Swap and Inversion errors

A.5 Qualitative Comparisons

the top-20 scoring pose detections for each model.

Because all the COCO keypoint metrics are computed using the 20 top-scoring detections
在这里插入图片描述
在这里插入图片描述

在这里插入图片描述
Swap error is a common failure case for CenterGroup but an uncommon failure case for KAPAO due to its detection of holistic pose objects
在这里插入图片描述

7 Conclusion(own)

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值