【Point-Set】《Point-Set Anchors for Object Detection, Instance Segmentation and Pose Estimation》

最新推荐文章于 2022-12-24 23:27:50 发布

bryant_meng

最新推荐文章于 2022-12-24 23:27:50 发布

阅读量459

点赞数

分类专栏： CNN / Transformer 文章标签： anchor

本文链接：https://blog.csdn.net/bryant_meng/article/details/109203717

版权

CNN / Transformer 专栏收录该内容

199 篇文章 7 订阅

订阅专栏

在这里插入图片描述

ECCV-2020

作者自己写的博客：利用Point-set Anchor统一物体检测，实例分割，以及人体姿态估计

1 Background and Motivation

【CornerNet】《CornerNet: Detecting Objects as Paired Keypoints》基于预测出的 heatmap 和 offset，定位出所有 bbox 的左上右下角，配合预测的 embedding 将同一目标的左上右下角联合成 bbox

【CenterNet】《Objects as Points》，基于预测的 heatmap 和 offset 确定中心点，配合预测的 size（长宽）定位出 bbox

上面两个 anchor-free object detection 的先驱，一个基于两个点（左上右下）的特征，一个基于一个点（中心点）的特征，就完事了（预测出 bbox 了），作者觉得一个点包含的信息可能不够，开篇来了段下面的话

we argue that the image features extracted at a central point contain limited information for predicting distant keypoints or bounding box boundaries, due to object deformation and scale/orientation variation.

一个点不够，那就多来些点，作者借鉴 anchor-based object detection 的思路，提出了 point-set anchor，在目标检测（object detection）、实例分割（instance segmentation）、人体姿态估计（human pose estimation）任务上取得了有竞争力的结果！

在这里插入图片描述

2 Related Work

object representation
- 基于 rectangular anchor（anchor based）：
  eg，faster rcnn，fpn，focal loss
- 基于 point representation（anchor free）：
  eg，center points，corner points，extreme points，octagon points，point set，raidal points
instance segmentation
- two-stage（top-down），先目标检测，再语义分割
- single-stage（bottom-up），直接实例分割
pose estimation
- 基于 part based heatmap learning，就是最常见的基于热力图预测点（局部一个点一个点的学习）
  缺点：
  1）不是 end-to-end training
  2）need for high resolution
  3）separate steps for joint localization and association（把同一个人的点 associate 在一起，先检测人再检测关键点的方法——top-down 则不存在这个缺点，先检测所有关键点再 group 点成人——bottom-up 则有这个缺点）
- 基于 holistic shape regression，（基于一个点，预测出整个形状）eg：centernet
  缺点：
  However, the holistic shape regression is generally more difficult than part based heat map learning due to optimization on a large, high-dimensional vector space.

3 Advantages / Contributions

借鉴 anchor-based，在 anchor free 的基础上，提出了 Point-set anchor（爱我再多一点点）
基于 RetinaNet，提出 PointSetNet 主干网路来配合引入的 Point-set anchor 机制

4 Method

4.1 Point-Set Anchors

1） pose point-set anchor

COCO 数据集有 17 个关键点，参考 MS COCO 目标检测、人体关键点检测评价指标

"keypoints": [
    "nose",
    "left_eye",
    "right_eye",
    "left_ear",
    "right_ear",
    "left_shoulder",
    "right_shoulder",
    "left_elbow",
    "right_elbow",
    "left_wrist",
    "right_wrist",
    "left_hip",
    "right_hip",
    "left_knee",
    "right_knee",
    "left_ankle",
    "right_ankle"
]

在这里插入图片描述
作者用上图所示（姿势之一，一种 anchor）的 34-dimensional vector来表示 pose point-set anchor，

We initialize the point-set anchors as the most frequent poses in the training set

具体为，用 k-means 在训练集中聚类出最常见的 pose 作为 pose point-set anchor

2） instance mask point-set anchor

在这里插入图片描述
中心点配合 implicit bbox 构成 instance mask point-set anchor，和 faster rcnn 一样，特征图的每个空间位置，有 9 个 instance mask point-set anchor（3 aspect ratio，3 scale）

本文设置一个 Point-Set Anchors 的数量为36

4.2 Shape Regression

regress the offsets $\Delta$ T from the point-set anchor T to the shape S（GT）

S = ${S_i\}_{i=1}^{n_s}$ ， $S_i$ 表示

$i$ -th polygon vertex for an instance mask
$i$ -th corner vertex for a bounding box
$i$ -th keypoint for pose estimation

1）Offsets for pose estimation

$\Delta$ T = S - T

也即 GT 点减 anchor 点

以前常用的方法是通过热力图来预测关键点，作者的方法是在 point-set anchor 的基础上配合网络回归出 offset，来定位关键点

2）Offsets for instance segmentation

S the number of points might be different for different object instances.

然而 point-set anchor T 是固定的，如何定义网络需要学习的 $\Delta$ T 呢？作者引入 matching point T* 来近似 S， $\Delta$ T 计算公式如下

$\Delta$ T = T* - T

不同的 matching 方式，T* 则不同，作者介绍了如下三种 matching 方式

在这里插入图片描述

Nearest point，如图 2（a），黄色是 GT，绿色是 point-set anchor，对于每一个绿点，找最近黄色的点（L1 distance）作为新的 GT—— T*，这会带来一个问题，许多绿点对应到一个黄点上（显得绿点冗余了，也即 point-set anchor 设计的点过于多了）
Nearest line，如图 2（b），把 GT 黄点看成线段的形式，也即每两个黄点连接成线段，每个绿点往黄点构成的线段上投影（做垂线），选投影距离最短的点（垂线与黄色线的交点），作为新的 GT ——T*，这样也会导致多个绿色点映射到了同一个 T*（注意图中拐弯的时候，两条黄线交点，很容易成为多个绿点共享的 T*），但概率会小很多
Corner point with projection，图 2（c）所示，先通过 Nearest point 匹配原则，找到离四个角落绿点最近的黄点，以这四个黄点为基准，配合隐式边界框和黄点组成的多边形轮廓，划分为 top、bottom、right、left 四个区域（如下图红色所示），上区域往下投影，下区域往上投影，左右区域分别往右左投影，与黄点形成线段的交点即为网络新的 GT ——T*，投影到别的区域的绿点为 invalid anchor point，不参与网络的 training

在这里插入图片描述

实验中作者采用的 Corner point with projection 方案

3）Offsets for object detection

top-left 和 bottom-right 来定义目标检测的 anchor，网络要回归的 $\Delta$ T 为 GT 的左上右下点与 point-set anchors 的左上右下点的 distance

4）Positive and negative samples

目标检测和实例分割中，IoU > 0.6 的为 positive，< 0.4 的为 negative

注意实践中，实例分割也是用的 bbox 的 IoU（point-set anchor 配合 offset 与 GT 之间的 IoU）而不是 mask 的 IoU，来减少计算

人关键点检测中，OKS > 0.5 的为 positive，< 0.4 的为 negative，OKS 的定义参考 MS COCO 目标检测、人体关键点检测评价指标，其功能等价于目标检测的 IoU

5）Mask construction

匹配方法为 Nearest point 和 Nearest line 时，任意点作为原点，按顺序连接起来就可以构成 mask

匹配方法为 Corner point with projection 时，仅把 valid 的点按顺序连起来，构成 mask

4.3 PointSetNet

在这里插入图片描述

1）Architecture

在 retinanet 的基础上进行修改，FPN 的层为 ${P3，P4，P5，P6，P7}$ ，头部三个分支，classification，object detection 和 mask / pose regression 分支，三个分支的 dimension 如下

在这里插入图片描述
pose estimation 的 classification 2 表示是否为人，shape regression 中的 2 表示 x,y 坐标

2）Point-set anchor density

目标检测和实例分割中 implicit bounding box 为

3 scale $2^{k/3} (k≤3)$ ，3 aspect ratio $[0.5, 1, 2]$ per location on each of the feature maps

关键点检测中 Point-Set Anchors 为

经过 k-means 聚类产生的 3 个 pose，配合 3 scales 和 3 rotations，产生 27 个 anchors 在 per location

3）Loss function

在这里插入图片描述

$L_{cls}$ ： focal loss
$L_{reg}$ ：L1 loss
$c_{x,y}^*$ ：分类 targets
$t_{x,y}^*$ ：回归 targets
$N_{pos}$ ：the number of positive samples
$\lambda$ ：0.1 and 10.0 for instance segmentation and pose estimation

The loss is calculated over all locations and all feature maps.

4）Elements specific to pose estimation

在这里插入图片描述
图 3 中的这个模块是人体关键点检测任务特有的模块

Deep shape indexed features

理论参考 shape-indexed特征

使用 DCN 来 aggregate 特定的 feature，然后提供更好的 feature 用于分类回归，而不是简单的单点 center feature

we replace the learnable offset in DCN（Deformable convolution network） with the location of points in a point-set anchor.

Multi-stage refinement
Holistic shape regression is generally more difficult than part based heat map learning，常见的解决方法就是 a sequence of weak regressors（boosting 的思想），作者做姿态评估的时候也采用了方式，Hence, we use one-step refinement for simplicity and efficiency

5 Experiments

5.1 Datasets

MS-COCO

5.2 Experiments on Instance Segmentation

1）Mask matching strategies
在这里插入图片描述
Corner Point with Projection 的 GT 匹配模式比较好

2）Effect of point-set anchors

Point-Set Anchors 的 point 越多，效果越好，72 的时候达到饱和

在这里插入图片描述

以中心点预测感觉是基于一个点预测出一个 mask

3）Comparison with state-of-the-art methods

在这里插入图片描述

5.3 Experiments on Pose Estimation

在这里插入图片描述

1）Effect of point-set anchors

表 6 可以看出，Mean Pose anchor 方法最好！

聚类出 5 种 anchor 效果最好

5 scale 效果最好

5 rotation 效果最好

2）Effect of deep shape indexed feature

在这里插入图片描述

3）Effect of multi-stage refinement

在这里插入图片描述
4）Effect of stronger backbone network and multi-scale testing

在这里插入图片描述
5）Comparison with state-of-the-art methods

在这里插入图片描述

6 Conclusion（own） / Future work

shape-indexed features 是传统手工设计的特征之一，和 SIFT，HoG 等是一个队伍的，作者基于 shape-indexed features 提出的专门用于 human pose estimation 任务的 feature aggregation module 模块，没有太看懂，涉及到 shape-indexed features 和可形变卷积！
目标检测这块，还是基本和 anchor-based 一样，实例分割思路很好，基于 anchor 来回归，关键点思路不错（基于一个点有多种人形 anchor，但是实际应用中人形可能各种各样，还存在不完整的人半身仅有脸，anchor 模板可不是那么好选取，不过这些可以通过数据增强来实现），大统思想值得借鉴，效果在三个任务上都很一般，由衷的喊一句，【Simple Baselines】《Simple Baselines for Human Pose Estimation and Tracking》搞得我三观尽碎！

bryant_meng

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
【Point-Set】《Point-Set Anchors for Object Detection, Instance Segmentation and Pose Estimation》

ECCV-2020作者自己写的博客：利用Point-set Anchor统一物体检测，实例分割，以及人体姿态估计文章目录1 Background and Motivation2 Related Work3 Advantages / Contributions4 Method4.1 Point-Set Anchors4.2 Shape Regression4.3 PointSetNet5 Experiments5.1 Datasets5.2 Experiments on Instance Segme.
复制链接

扫一扫