[论文摘抄] Focal Loss for Dense Object Detection

最新推荐文章于 2023-07-15 11:04:31 发布

氵文大师

最新推荐文章于 2023-07-15 11:04:31 发布

阅读量194

点赞数

分类专栏：论文摘抄文章标签：目标检测计算机视觉深度学习

原文链接：https://arxiv.org/pdf/1708.02002v2.pdf

版权

论文摘抄专栏收录该内容

3 篇文章 0 订阅

订阅专栏

全文摘自：
https://arxiv.org/pdf/1708.02002v2.pdf

0. Abstract

1. Introduction

2. Related Work

3. Focal Loss

4. RetinaNet Detector

在这里插入图片描述
图三：

Figure 3. The one-stage RetinaNet network architecture uses a Feature Pyramid Network (FPN) [20] backbone on top of a feedforward ResNet architecture [16]

(a) to generate a rich, multi-scale convolutional feature pyramid

(b). To this backbone RetinaNet attaches two subnetworks, one for classifying anchor boxes(分类的Anchor框)
(c). and one for regressing from anchor boxes to ground-truth object boxes

(d). The network design is intentionally(故意地) simple, which enables this work to focus on a novel focal loss function that eliminates the accuracy gap between our one-stage detector and state-of-the-art two-stage detectors like Faster R-CNN with FPN [20] while running at faster speeds.
该函数消除了我们的一阶段检测器和最先进的二阶段检测器(如带有 FPN 的 Faster R-CNN)之间的精度差距，同时能以更快的速度运行。

RetinaNet is a single, unified(统一) network composed of a backbone network and two task-specific subnetworks.

The backbone is responsible for computing a convolutional feature map over an entire input image and is an off-the-self(这啥意思这是) convolutional network.

The first subnet performs convolutional object classification on the backbone’s output; the second subnet performs convolutional bounding box regression.
第一个子网对主干的输出进行卷积对象分类；第二个子网执行卷积边界框回归。

The two subnetworks feature a simple design that we propose specifically for one-stage, dense detection, see Figure 3.
这两个子网络具有一个简单的设计，我们针对单阶段密集检测提出了这种设计，请参见图 3。

While there are many possible choices for the details of these components, most design parameters are not particularly sensitive to exact values as shown in the experiments.
虽然这些组件的细节有很多可能的选择，但大多数设计参数对实验中的精确值并不特别敏感。

We describe each component of RetinaNet next.

上边儿这一大段，其实就一句话，RetinaNet 很简单，架构如图3所示.

1. Feature Pyramid Network Backbone

We adopt the Feature Pyramid Network (FPN) from [20] as the backbone network for RetinaNet.

In brief, FPN augments(增强) a standard convolutional network with a top-down pathway and lateral(横向) connections so the network efficiently constructs a rich, multi-scale feature pyramid from a single resolution input image, see Figure 3(a)-(b).
简而言之，FPN 通过自上而下的路径和横向连接增强了标准卷积网络，因此该网络有效地从单分辨率输入图像构建了丰富的多尺度特征金字塔，参见图 3(a)-(b)。

Each level of the pyramid can be used for detecting objects at a different scale.
金字塔的每一层都可以用于检测不同尺度的目标。

FPN improves multi-scale predictions from fully convolutional networks (FCN) [23], as shown by its gains for RPN [28] and DeepMask-style proposals [24], as well at two-stage detectors such as Fast R-CNN [10] or Mask R-CNN [14].
FPN 改进了来自全卷积网络 (FCN) [23] 的多尺度预测, 如…

Following [20], we build FPN on top of the ResNet architecture [16].
We construct a pyramid with levels $P_3$ through $P_7$ , where $l$ indicates pyramid level ( $P_l$ has resolution $2^l$ lower than the input).
Following [20]，我们在 ResNet 架构 [16] 上构建 FPN
我们构建了一个从 $P_3$ 到 $P_7$ 层的金字塔，其中 $l$ 表示金字塔层级（ $P_l$ 的分辨率比输入低 $2^l$ ，这我不知道啥意思）。

As in [20] all pyramid levels have $C = 256$ channels. Details of the pyramid generally follow [20] with a few modest differences.
如在 [20] 中，所有金字塔层都有 $C = 256$ 个通道。金字塔的细节通常遵循[20]，但有一些适度的差异。

While many design choices are not crucial, we emphasize the use of the FPN backbone is; preliminary experiments using features from only the final ResNet layer yielded low AP.
虽然许多设计选择并不重要，但我们强调使用 FPN 是主干；实验表明仅使用来自 ResNet 最后一层的特征的产生了低 AP。

这里有个注脚：

RetinaNet uses feature pyramid levels $P_3$ to $P_7$ , where $P_3$ to $P_5$ are computed from the output of the corresponding ResNet residual stage ( $C_3$ through $C_5$ ) using top-down and lateral connections just as in [20], $P_6$ is obtained via a $3 \times 3$ stride- $2$ conv on $C_5$ , and $P_7$ is computed by applying ReLU followed by a $3 \times 3$ stride- $2$ conv on $P_6$ .

This differs slightly from [20]:
(1) we don’t use the high-resolution pyramid level $P_2$ for computational reasons(计算原因),
(2) $P_6$ is computed by strided convolution instead of downsampling, and
(3) we include $P_7$ to improve large object detection.
These minor modifications improve speed while maintaining(保持) accuracy
这些微小的修改提高了速度，同时保持了准确性

2. Anchors

We use translation-invariant anchor boxes similar to those in the RPN variant in [20].
我们使用类似于 [20] 中RPN 变体中的平移不变 Anchor

The anchors have areas of $32^2$ to $512^2$ on pyramid levels $P_3$ to $P_7$ , respectively.
Anchors 在金字塔级别 $P_3$ 到 $P_7$ 上的面积分别为 $32^2$ 到 $512^2$ 。

As in [20], at each pyramid level we use anchors at three aspect ratios ${1:2, 1:1, 2:1\}$ .
与 [20] 中一样，在每个金字塔层，我们使用三个纵横比 ${1:2, 1:1, 2:1\}$ 的 anchors

For denser scale coverage than in [20], at each level we add anchors of sizes ${2^0, 2^{1/3},2^{2/3}\}$ of the original set of $3$ aspect ratio anchors.
我不知道这是什么意思

This improve AP in our setting.

In total there are $A = 9$ anchors per level and across levels they cover the scale range 32 - 813 pixels with respect to the network’s input image.
我不知道这是什么意思不同层能覆盖的size范围为32-813?

Each anchor is assigned a length K one-hot vector of classification targets, where K is the number of object classes, and a 4-vector of box regression targets.
每个 anchor 都分配有一个长度为 K 的分类 one-hot 向量，其中 K 是目标类别的数量，以及一个长度为 4 的框回归目标向量。

We use the assignment rule(分配规则) from RPN [28] but modified for multiclass detection and with adjusted thresholds.
我们使用来自 RPN [28] 的分配规则，但针对多类检测进行了修改并调整了阈值。

Specifically, anchors are assigned to ground-truth object boxes using an intersection-over-union (IoU) threshold of 0.5; and to background if their IoU is in $[0, 0.4)$ . As each anchor is assigned to at most(最多) one object box, we set the corresponding entry in its length K label vector to 1 and all other entries(设置) to 0.
具体来说，使用 0.5 的交并比 (IoU) 阈值将 anchors 分配给GT框；如果他们的 IoU 在 [0, 0.4) 中，则返回背景。由于每个 anchor 最多分配给一个对象框，我们将其长度为 K 标签向量中的相应条目设置为 1，将所有其他条目设置为 0。

If an anchor is unassigned, which may happen with overlap in [0.4, 0.5), it is ignored during training.
如果 anchor 未分配，这可能是因为重叠值在 [0.4, 0.5) 这个区间，则其在训练期间将被忽略。

Box regression targets are computed as the offset between each anchor and its assigned object box, or omitted if there is no assignment.
框回归计算目标是为每个 anchor 与其分配的对象框之间的偏移量，如果 anchor 没有分配GT框，则被省略。

3. Classification Subnet

The classification subnet predicts the probability of object presence at each spatial position for each of the $A$ anchors and $K$ object classes.
分类子网预测每个位置 $A$ 个 anchors 中存在目标的概率和 $K$ 个目标类的概率。

This subnet is a small FCN attached to each FPN level; parameters of this subnet are shared across all pyramid levels.
注意：这里参数是共享的，因为是全卷积

Its design is simple.

Taking an input feature map with $C$ channels from a given pyramid level, the subnet applies four $3 \times 3$ conv layers, each with $C$ filters and each followed by ReLU activations, followed by a $3 \times 3$ conv layer with $K A$ filters.
FPN的输出是 $C$ 通道大的feature map, 接下来后边跟着 ( $3 \times 3$ 的卷积+ReLU+通道数不变) 组 $\times4$ ，接下来是 $K A$ 通道的 $3 \times 3$ 卷积

Finally sigmoid activations are attached to output the $K A$ binary predictions per spatial location, see Figure 3 (c).
最后附加 sigmoid 激活以输出每个空间位置的 $K A$ 二进制预测，参见图 3 (c)。

We use $C = 256$ and $A = 9$ in most experiments.

4. Box Regression Subnet

In parallel with the object classification subnet, we attach another small FCN to each pyramid level for the purpose of regressing the offset from each anchor box to a nearby ground-truth object, if one exists.
与目标分类子网并行，我们将另一个小型 FCN 附加到每个金字塔层，以便将每个 anchor 的偏移量回归到附近的GT目标（如果存在）。

The design of the box regression subnet is identical to(与…相同) the classification subnet except(除了) that it terminates in $4 A$ linear outputs per spatial location, see Figure 3 (d).
两个子网设计相同，只是输出通道不一样，一个是 $4 A$ ，一个是 $K A$ .

For each of the $A$ anchors per spatial location, these $4$ outputs predict the relative offset between the anchor and the groundtruth box (we use the standard box parameterization from RCNN [11]).
对于每个位置的 A 个 anchor，输出 4 个输出预测 Anchor 和真实框之间的相对偏移（我们使用来自 RCNN [11] 的标准框参数化）。

We note that unlike most recent work, we use a class-agnostic(与类别无关) bounding box regressor which uses fewer parameters and we found to be equally effective.
与最近的工作不同，我们使用了与类别无关的边界框回归器，它使用的参数更少，而且我们发现同样有效。

The object classification subnet and the box regression subnet, though sharing a common structure, use separate parameters.
目标分类子网和回归子网虽然具有共同的结构，但使用不同的参数。