CornerNet：不用 Anchor Boxes 也能进行目标检测（Object Detection）

最新推荐文章于 2022-08-02 19:07:13 发布

田神

最新推荐文章于 2022-08-02 19:07:13 发布

阅读量879

点赞数 1

分类专栏：机器视觉机器学习与神经网络文章标签：目标检测

本文链接：https://blog.csdn.net/StreamRock/article/details/100115681

版权

本文深入探讨了基于深度学习的Anchor Free目标检测方法，重点解析了CornerNet的实现原理，包括ConvNet、Corner Pooling、Heatmaps、Offsets和Embedding。此外，还介绍了CornerNet的升级版CenterNet，分析了其通过三点定位算法、Center Pooling和Cascade Corner Pooling改进目标检测的策略。文章展示了Anchor Free方法在消除Anchor Boxes带来的计算负担和正负例不均衡问题方面的优势。

摘要由CSDN通过智能技术生成

一、简介

目标检测（Object Detection）是图像识别的一个重要领域，近来看了一篇19年8月的相关综述《Recent Advances in Deep Learning for Object Detection》[1]，发现自己又落伍了，现在已经到了不用 anchor boxes也能进行目标定位了。我们来看看文中给出的一个 Object Detection 的发展脉络：
在这里插入图片描述
图1 基于深度学习的目标检测的发展脉络

从图中我们看到两个趋势，其一为 anchor free，另一个是 AutoML，让我们先来了解一下 anchor free 方案，这也是本文的中心。
Anchor Boxes 直译为“锚矩形”（没翻译成“锚盒”，因为我觉得“矩形”似乎更贴合其使用场景），Anchor Boxes是目标定位的基准，它们在图像中的位置是固定的，而我们通过卷积网络regression得到的目标bounded boxes 坐标一般都是以anchor boxes为基准的相对位置，并归一化，由bounded boxes相对位置结合Anchor Boxes的绝对位置，我们就可以对 Objects 进行定位。
在CornerNet出现前的检测模型，不论是One Stage的还是Two Stages的，皆有此设置，然而，Anchor boxes这一机制有两大问题：
1、凡采用anchor boxes的模型，都会在图上定义大量的anchor boxes，这一方面增加了计算量，另一方面也引入了正、负例不均衡，从而导致的训练效果下降；
2、Anchor boxes 是需要设计的，这不仅增加了大量的超级参数，需要手动设置，还因为不同尺度对象需要从不同的Feature Maps中提取，增加了网络的复杂度。
正是为了消除anchor boxes的这两个缺陷，CornerNet[2]提出了 Anchor free 的方案，以下我们就 CornerNet 的实现详细地展开。

二、Anchor Free 的实现原理

CornerNet的实现原理图如下：
在这里插入图片描述
图2、CornerNet的实现框图

它的实现流程分为三个部分:
1、ConvNet 卷积网络，提取特征；
2、Predicting Module，由Corner Pooling预测出：Heapmaps、Offset、Embedding 三部分，它们皆用于计算目标的定位；
3、损失Loss部分：分成多个部分 Loss ，加起来形成总的损失，并采用Adam训练各网络参数。
接下来，我们研究一下各个部分的具体实现：

2.1 ConvNet 作为backbone

CornerNet 所选的卷积网络是一个称为 Hourglass 的网络，所谓 Hourglass 就是沙漏，[2] 中有一段文字是这样叙述的：
The hourglass network was first introduced for the human pose estimation task. It is a fully convolutional neural network that consists of one or more hourglass modules. An hourglass module first down samplesthe input features by a series of convolution and maxpooling layers. It then up samplesthe features back to the original resolutionby a series of upsampling and convolution layers. Since details are lost in the max pooling layers, skip layersare added to bring back the details to the upsampled features. The hourglass module captures both global and local features in a single unified structure. When multiple hourglass modules are stacked in the network, the hourglass modules can reprocess the features to capture higher-level of information. These properties make the hourglass network an ideal choice for object detection as well.
简单翻译如下：一个沙漏模块由两部分组成，其中第一部分由conv和max-pooling构成，使feature maps 尺度逐层缩小，而第二部分采用upsampling和conv，使feature maps再恢复到原来的尺寸，为减少max pooling对原图信息的丢失，采用skip layers，将丢失的details补充回来。沙漏模块可以叠加，形成检测网络。由于其featuremap 先缩，而后再扩展，如同一个沙漏，由此得名hourglass。
在这里插入图片描述
图3、hourglass结构图[3]

CornerNet 的 backbone 由两个 hourglasses 模块堆叠而成，图2简单地用了两个打横的沙漏来表示。另外，CornerNet 为简化实现，将max pooling直接用stride=2 来代替了，其它的细节可以从代码实现中得到[4]。

2.2 网络的predict部分

由图2部分，可以看到在 hourglass 模块后有两个 prediction module，分别用于预测bounding box的左上角（top-left corner）和右下角（bottom-right corner），这是CornerNet的实现的关键。Hourglass 的输出特征图（feature maps）经 Corner Pooling 后，会最终输出三组Predictions：Heatmaps、 Embeddings 和 Offsets，最后的目标定位由它们经 post-processing algorithm 得到。
以下，我们先来看看这三组 Predictions 都是些什么东西，然后再看看 Corner Pooling 的实现原理。

2.2.1 Heatmaps

按 [2] 中所述，Heatmaps 是一个 C * H * W 张量，其中H 表示高度，W表示宽度，反映图的size；C是channels，其数量与目标分类（Category）数量相同。在Heatmap上的一点 $p_{cij}$ 表示在图（image）中（i，j）位置上是 c 分类角点（top-left 或 bottom-right corner点）的score，它是一个小于1大于0的数，可看作概率。
作为 ground truth bounding boxes，每一个都有且仅有一个 top-left corner（或 bottom-right corner），角点位置 $y_{cij}$ 取值为 1，若不是角点则 $y_{cij}$ 取值为 0。由此，可定义一个二进制交叉熵损失：
$L_{det}= -\frac 1N \sum_{c=1}^C \sum_{i=1}^W \sum_{j=1}^H \left \{ \begin{array} {cc} log(p_{cij}) & \text{if } \ y_{cij}=1\\ log(1-p_{cij}) & \text{if otherwise} \end{array} \right. \qquad(1)$
用上述Loss训练网络，会因为positive case 与 negative case 数量不平衡，而导致训练效果不好，[5]给出了一个平衡不均衡训练样例的方法—— focal loss ，其原文摘抄如下：
1、Easily classified negatives comprise the majority of the loss and dominate the gradient.
即容易进行分辨的负例占样例的大多数，它们主导了梯度计算。
2、We propose to add a modulating factor $(1-p_t)^{\gamma}$ to the cross entropy loss, with tunable focusing parameter $\gamma \ge 0$ . We define the focal loss as:
$-(1-p_t)^{\gamma} log(p_t)$
通过对交叉熵添加一个衰减因子 $(1-p_t)^{\gamma}$ ，使 $p_t$ 大的（即容易分辨的样本）衰减大，使 $p_t$ 小的（即不容易分辨的样本）衰减小。一般取 $\gamma \in [0,5]$ 。
若根据这个思路，改造公式（1），有：
$L_{det}= -\frac 1N \sum_{c=1}^C \sum_{i=1}^W \sum_{j=1}^H \left \{ \begin{array} {cc} (1-p_{cij})^{\alpha}log(p_{cij}) & \text{if } \ y_{cij}=1\\ (p_{cij})^{\alpha}log(1-p_{cij}) & \text{if otherwise} \end{array} \right. \qquad(2)$