【翻译】Focal Loss for Dense Object Detection(RetinaNet)

【翻译】Focal Loss for Dense Object Detection(RetinaNet)


论文: Focal Loss for Dense Object Detection

摘要

​ The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier is applied to a sparse set of candidate object locations. In contrast, one-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors thus far. In this paper, we investigate why this is the case. We discover that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause. We propose to address this class imbalance by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples. Our novel Focal Loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training. To evaluate the effectiveness of our loss, we design and train a simple dense detector we call RetinaNet. Our results show that when trained with the focal loss, RetinaNet is able to match the speed of previous one-stage detectors while surpassing the accuracy of all existing state-of-the-art two-stage detectors.
  ​​迄今为止,精度最高的目标检测器基于 R-CNN 推广的两阶段方法,其中分类器应用于一组稀疏的候选目标位置。相比之下,应用于可能目标位置的常规密集采样的单阶段检测器有可能更快、更简单,但迄今为止的准确性落后于两阶段检测器。在本文中,我们调查了为什么会出现这种情况。我们发现,在密集检测器训练过程中遇到的极端前景-背景类别不平衡是主要原因。我们建议通过重塑标准交叉熵损失来解决此类不平衡问题,从而降低分配给分类良好示例的损失的权重。我们新颖的 Focal Loss 将训练集中在一组稀疏的困难示例上,并防止大量简单的负样本在训练期间压倒检测器。为了评估损失的有效性,我们设计并训练了一个简单的密集检测器,我们称之为 RetinaNet。我们的结果表明,当使用焦点损失进行训练时,RetinaNet 能够与之前的一阶段检测器的速度相匹配,同时超过所有现有最先进的两阶段检测器的准确度。

1、介绍

Current state-of-the-art object detectors are based on a two-stage, proposal-driven mechanism. As popularized in the R-CNN framework [11], the first stage generates a sparse set of candidate object locations and the second stage classifies each candidate location as one of the foreground classes or as background using a convolutional neural network. Through a sequence of advances [10, 27, 19, 13], this two-stage framework consistently achieves top accuracy on the challenging COCO benchmark [20].
  当前最先进的目标检测器基于两阶段的提议驱动机制。正如在 R-CNN 框架 [11] 中所推广的那样,第一阶段生成一组稀疏的候选目标位置,第二阶段使用卷积神经网络将每个候选位置分类为前景类或背景。通过一系列进步 [10, 27, 19, 13],这个两阶段框架在具有挑战性的 COCO 基准测试 [20] 上始终保持最高准确度。
  Despite the success of two-stage detectors, a natural question to ask is: could a simple one-stage detector achieve similar accuracy? One stage detectors are applied over a regular, dense sampling of object locations, scales, and aspect ratios. Recent work on one-stage detectors, such as YOLO [25, 26] and SSD [21, 9], demonstrates promising results, yielding faster detectors with accuracy within 1040% relative to state-of-the-art two-stage methods.
  尽管两阶段检测器取得了成功,一个自然的问题是:一个简单的单阶段检测器能否达到类似的精度? 一阶段检测器应用于目标位置、尺度和纵横比的常规密集采样。最近在一阶段检测器上的工作,例如 YOLO [25, 26] 和 SSD [21, 9],展示了有希望的结果,与最先进的两阶段方法相比,产生了更快的检测器,精度在10-40%以内。
  This paper pushes the envelop further: we present a onestage object detector that, for the first time, matches the state-of-the-art COCO AP of more complex two-stage detectors, such as the Feature Pyramid Network (FPN) [19] or Mask R-CNN [13] variants of Faster R-CNN [27]. To achieve this result, we identify class imbalance during training as the main obstacle impeding one-stage detector from achieving state-of-the-art accuracy and propose a new loss function that eliminates this barrier.
  本文进一步推动了这个领域:我们提出了一个单阶段目标检测器,它首次与更复杂的两阶段检测器的最先进的 COCO AP 相匹配,例如特征金字塔网络(FPN)[19 ] 或Faster R-CNN [27] 的变体Mask R-CNN [13] 。为了实现这一结果,我们将训练期间的类别不平衡确定为阻碍单阶段检测器达到最先进精度的主要障碍,并提出了一种新的损失函数来消除这一障碍。
  Class imbalance is addressed in R-CNN-like detectors by a two-stage cascade and sampling heuristics. The proposal stage (e.g., Selective Search [34], EdgeBoxes [37], DeepMask [23, 24], RPN [27]) rapidly narrows down the number of candidate object locations to a small number (e.g., 1-2k), filtering out most background samples. In the second classification stage, sampling heuristics, such as a fixed foreground-to-background ratio (1:3), or online hard example mining (OHEM) [30], are performed to maintain a manageable balance between foreground and background.
  在类似 R-CNN 的检测器中,类不平衡通过两阶段阶段联和采样启发式来解决。提议阶段(例如,Selective Search [34]、EdgeBoxes [37]、DeepMask [23, 24]、RPN [27])迅速将候选目标位置的数量缩小到少量(例如,1-2k), 过滤掉大多数背景样本。在第二个分类阶段,执行采样启发式方法,例如固定的前景与背景比率 (1:3),或在线困难样本挖掘 (OHEM) [30],以保持前景和背景之间的可管理平衡。
  In contrast, a one-stage detector must process a much larger set of candidate object locations regularly sampled across an image. In practice this often amounts to enumerating ∼100k locations that densely cover spatial positions, scales, and aspect ratios. While similar sampling heuristics may also be applied, they are inefficient as the training procedure is still dominated by easily classified background examples. This inefficiency is a classic problem in object detection that is typically addressed via techniques such as bootstrapping [32, 28] or hard example mining [36, 8, 30].
  相比之下,单阶段检测器必须处理在图像中定期采样的更大的候选目标位置集。在实践中,这通常相当于枚举密集覆盖空间位置、比例和纵横比的 10 万个位置。虽然也可以应用类似的采样启发式方法,但它们效率低下,因为训练过程仍然由易于分类的背景示例主导。这种低效率是目标检测中的一个经典问题,通常通过诸如引导[32、28]或困难样本挖掘[36、8、30]等技术来解决。
  In this paper, we propose a new loss function that acts as a more effective alternative to previous approaches for dealing with class imbalance. The loss function is a dynamically scaled cross entropy loss, where the scaling factor decays to zero as confidence in the correct class increases, see Figure 1. Intuitively, this scaling factor can automatically down-weight the contribution of easy examples during training and rapidly focus the model on hard examples. Experiments show that our proposed Focal Loss enables us to train a high-accuracy, one-stage detector that significantly outperforms the alternatives of training with the sampling heuristics or hard example mining, the previous state-ofthe-art techniques for training one-stage detectors. Finally, we note that the exact form of the focal loss is not crucial, and we show other instantiations can achieve similar results.
  在本文中,我们提出了一种新的损失函数,它可以作为以前处理类别不平衡的方法的更有效替代方法。损失函数是一个动态缩放的交叉熵损失,其中缩放因子随着对正确类的置信度的增加而衰减到零,见图 1。直观地说,这个缩放因子可以在训练期间自动降低简单示例的贡献并快速集中困难例子的模型。实验表明,我们提出的 Focal Loss 使我们能够训练出一种高精度的单阶段检测器,该检测器的性能显着优于使用采样启发式或困难样本挖掘进行训练的替代方法,这是以前用于训练单阶段检测器的最先进技术 。最后,我们注意到焦点损失的确切形式并不重要,我们展示了其他实例可以达到类似的结果。

加载失败
图 1. 我们提出了一种新的损失,我们称之为 Focal Loss,它在标准交叉熵标准中添加了一个因子 (1 - pt)γ。 设置 γ > 0 可减少分类良好的示例(pt > .5)的相对损失,将更多注意力放在难分类的错误示例上。 正如我们的实验将证明的那样,所提出的焦点损失能够在存在大量简单背景示例的情况下训练高度准确的密集对象检测器。

  To demonstrate the effectiveness of the proposed focal loss, we design a simple one-stage object detector called RetinaNet, named for its dense sampling of object locations in an input image. Its design features an efficient in-network feature pyramid and use of anchor boxes. It draws on a variety of recent ideas from [21, 6, 27, 19]. RetinaNet is efficient and accurate; our best model, based on a ResNet-101FPN backbone, achieves a COCO test-dev AP of 39.1 while running at 5 fps, surpassing the previously best published single-model results from both one and two-stage detectors, see Figure 2.   为了证明所提出的焦点损失的有效性,我们设计了一个简单的单阶段目标检测器,称为 RetinaNet,以其对输入图像中目标位置的密集采样而命名。它的设计具有高效的网络内特征金字塔和锚盒的使用。它借鉴了 [21, 6, 27, 19] 中的各种最新想法。RetinaNet 高效准确; 我们最好的模型基于 ResNet-101-FPN 骨干网,在 COCO 以 5 fps 运行时实现了 39.1 的 AP,超过了之前发布的来自一阶段和二阶段检测器的最佳单模型结果,见图 2。  
 
加载失败
图2 COCO test-dev 上的速度 (ms) 与准确度 (AP)。通过焦距损失,我们简单的单阶段 RetinaNet 检测器优于所有先前的单阶段和两阶段检测器,包括来自 [20] 的最佳报道的 Faster R-CNN [28] 系统。我们展示了具有 ResNet-50-FPN(蓝色圆圈)和 ResNet-101-FPN(橙色菱形)的 RetinaNet 变体在五个尺度(400-800 像素)。忽略低准确度机制(AP<25),RetinaNet 形成所有当前检测器的上包络,改进的变体(未显示)达到 40.8 AP。详细信息在 Section 5 中给出。

2、相关工作

Classic Object Detectors: The sliding-window paradigm, in which a classifier is applied on a dense image grid, has a long and rich history. One of the earliest successes is the classic work of LeCun et al. who applied convolutional neural networks to handwritten digit recognition [18, 35]. Viola and Jones [36] used boosted object detectors for face detection, leading to widespread adoption of such models. The introduction of HOG [4] and integral channel features [5] gave rise to effective methods for pedestrian detection. DPMs [8] helped extend dense detectors to more general object categories and had top results on PASCAL [7] for many years. While the sliding-window approach was the leading detection paradigm in classic computer vision, with the resurgence of deep learning [17], two-stage detectors, described next, quickly came to dominate object detection.
  经典目标检测器:滑动窗口范例,其中分类器应用于密集的图像网格,具有悠久而丰富的历史。最早的成功之一是 LeCun 等人的经典作品,将卷积神经网络应用于手写数字识别 [18, 35]。Viola 和 Jones [36] 使用增强型目标检测器进行人脸检测,从而导致此类模型的广泛采用。HOG [4] 和积分通道特征 [5] 的引入产生了行人检测的有效方法。DPM [8] 帮助将密集检测器扩展到更一般的目标类别,并且多年来在 PASCAL [7] 上取得了最佳结果。虽然滑动窗口方法是经典计算机视觉中领先的检测范式,但随着深度学习 [17] 的复兴,接下来描述的两阶段检测器迅速主导了目标检测。
  Two-stage Detectors: The dominant paradigm in modern object detection is based on a two-stage approach. As pioneered in the Selective Search work [34], the first stage generates a sparse set of candidate proposals that should contain all objects while filtering out the majority of negative locations, and the second stage classifies the proposals into foreground classes / background. R-CNN [11] upgraded the second-stage classifier to a convolutional network yielding large gains in accuracy and ushering in the modern era of object detection. R-CNN was improved over the years, both in terms of speed [14, 10] and by using learned object proposals [6, 23, 27]. Region Proposal Networks (RPN) integrated proposal generation with the second-stage classifier into a single convolution network, forming the Faster RCNN framework [27]. Numerous extensions to this framework have been proposed, e.g. [19, 30, 31, 15, 13].
  两阶段检测器:现代物体检测的主要模型基于两阶段方法。正如在选择性搜索工作 [34] 中所开创的那样,第一阶段生成一组稀疏的候选建议集,这些候选建议集应包含所有目标,同时过滤掉大部分负面位置,第二阶段将建议集分类为前景类/背景。R-CNN [11] 将第二阶段分类器升级为卷积网络,在准确度上产生了巨大的提升,并开启了现代目标检测时代。多年来,R-CNN 在速度 [14, 10] 和使用学习目标提议 [6, 23, 27] 方面都得到了改进。区域建议网络(RPN)将建议生成与第二阶段分类器集成到单个卷积网络中,形成更快的 RCNN 框架 [27]。已经提出了对该框架的许多扩展,例如 [19、30、31、15、13]。
  One-stage Detectors: OverFeat [29] was one of the first modern one-stage object detector based on deep networks. More recently SSD [21, 9] and YOLO [25, 26] have renewed interest in one-stage methods. These detectors have been tuned for speed but their accuracy trails that of twostage methods. SSD has a 10-20% lower AP, while YOLO focuses on an even more extreme speed/accuracy trade-off. See Figure 2. Recent work showed that two-stage detectors can be made fast simply by reducing input image resolution and the number of proposals, but one-stage methods trailed in accuracy even with a larger compute budget [16]. In contrast, the aim of this work is to understand if one-stage detectors can match or surpass the accuracy of two-stage detectors while running at similar or faster speeds.
  单阶段检测器:OverFeat [29] 是第一个基于深度网络的现代单阶段目标检测器之一。最近 SSD [21, 9] 和 YOLO [25, 26] 对单阶段方法重新产生了兴趣。这些检测器已针对速度进行了调整,但其准确性落后于两阶段方法。SSD 的 AP 降低了 10-20%,而 YOLO 则专注于更极端的速度/准确性权衡。参见图 2。最近的工作表明,只需降低输入图像分辨率和提议的数量,就可以快速提高两阶段检测器的速度,但即使在计算预算较大的情况下,单阶段方法的准确性也会落后 [16]。相比之下,这项工作的目的是了解一阶段检测器在以相似或更快的速度运行时是否可以匹配或超过两阶段检测器的精度。
  The design of our RetinaNet detector shares many similarities with previous dense detectors, in particular the concept of ‘anchors’ introduced by RPN [27] and use of features pyramids as in SSD [21] and FPN [19]. We emphasize that our simple detector achieves top results not based on innovations in network design but due to our novel loss.
  我们的 RetinaNet 检测器的设计与之前的密集检测器有许多相似之处,特别是 RPN [27] 引入的“锚点”概念以及 SSD [21] 和 FPN [19] 中特征金字塔的使用。我们强调,我们的简单检测器获得最佳结果不是基于网络设计的创新,而是由于我们的新损失。
  Class Imbalance: Both classic one-stage object detection methods, like boosted detectors [36, 5] and DPMs [8], and more recent methods, like SSD [21], face a large class imbalance during training. These detectors evaluate 1 0 4 − 1 0 5 10^4-10^5 104105 candidate locations per image but only a few locations contain objects. This imbalance causes two problems: (1) training is inefficient as most locations are easy negatives that contribute no useful learning signal; (2) en masse, the easy negatives can overwhelm training and lead to degenerate models. A common solution is to perform some form of hard negative mining [32, 36, 8, 30, 21] that samples hard examples during training or more complex sampling/reweighing schemes [2]. In contrast, we show that our proposed focal loss naturally handles the class imbalance faced by a one-stage detector and allows us to efficiently train on all examples without sampling and without easy negatives overwhelming the loss and computed gradients.
  类不平衡:经典的单阶段目标检测方法,如增强检测器 [36、5] 和 DPM [8],以及最近的方法,如 SSD [21],在训练期间都面临着大的类不平衡。这些检测器评估每张图像的 1 0 4 − 1 0 5 10^4-10^5 104105 个候选位置,但只有少数位置包含目标。这种不平衡导致了两个问题:(1)训练效率低下,因为大多数位置都是简单的负样本,没有提供有用的学习信号; (2) 整体而言,简单的负样本会压倒训练并导致退化模型。一种常见的解决方案是执行某种形式的困难负样本负挖掘 [32、36、8、30、21],在训练或更复杂的采样/重新加权方案 [2] 期间对困难样本进行采样。相比之下,我们表明,我们提出的焦点损失自然地处理了单阶段检测器面临的类别不平衡,并允许我们有效地训练所有示例,而无需采样,也不会出现压倒损失和计算梯度的简单负样本。
  Robust Estimation: There has been much interest in designing robust loss functions (e.g., Huber loss [12]) that reduce the contribution of outliers by down-weighting the loss of examples with large errors (hard examples). In contrast, rather than addressing outliers, our focal loss is designed to address class imbalance by down-weighting inliers (easy examples) such that their contribution to the total loss is small even if their number is large. In other words, the focal loss performs the opposite role of a robust loss: it focuses training on a sparse set of hard examples.
  鲁棒估计:人们对设计鲁棒损失函数(例如,Huber损失[12])非常感兴趣,通过降低具有大误差的例子(困难样本)的损失的权重来减少异常值的贡献。相比之下,我们的焦点损失不是解决异常值,而是通过降低权重(简单的例子)来解决类不平衡,这样即使它们的数量很大,它们对总损失的贡献也很小。换句话说,焦点损失的作用与鲁棒损失相反:它集中训练一组稀疏的硬例子。

3、Focal Loss

The Focal Loss is designed to address the one-stage object detection scenario in which there is an extreme imbalance between foreground and background classes during training (e.g., 1:1000). We introduce the focal loss starting from the cross entropy (CE) loss for binary classification:
  焦点损失的设计是为了解决单阶段目标检测场景,在训练过程中前景和背景课程之间存在极端的不平衡(例如,1:1000)。我们从二分类中的交叉熵(CE)损失开始引入焦点损失:
C E ( p , y ) = { − l o g ( p ) if  y  = 1 − l o g ( 1 − p ) otherwise CE(p, y) = \begin{cases} -log(p) & \text {if $y$ = 1} \\ -log(1-p) & \text{otherwise} \end{cases} CE(p,y)={log(p)log(1p)if y = 1otherwise
  In the above y ∈ {±1} specifies the ground-truth class and p ∈ [0, 1] is the model’s estimated probability for the class with label y = 1. For notational convenience, we define p t p_t pt:
在上面的y∈{±1}指定了真实框,p∈[0,1]是模型对标签为y=1的类的估计概率。为了便于标注符号,我们定义了 p t p_t pt
p t = { p if y = 1 1 − p otherwise p_t = \begin{cases} p & \text{if y = 1} \\ 1-p & \text{otherwise} \end{cases} pt={p1pif y = 1otherwise
and rewrite C E ( p , y ) = C E ( p t ) = − l o g ( p t ) CE(p, y) = CE(pt) = − log(p_t) CE(p,y)=CE(pt)=log(pt).
并重写 C E ( p , y ) = C E ( p t ) = − l o g ( p t ) CE(p, y) = CE(pt) = − log(p_t) CE(p,y)=CE(pt)=log(pt)
  The CE loss can be seen as the blue (top) curve in Figure 1. One notable property of this loss, which can be easily seen in its plot, is that even examples that are easily classified (pt ≫ .5) incur a loss with non-trivial magnitude. When summed over a large number of easy examples, these small loss values can overwhelm the rare class.
  CE损失可以看作是图1中的蓝色(上)曲线。这种损失的一个显著特性,很容易从图中看出,即即使是容易分类的例子(pt ≫ .5)也会导致非平凡幅度的损失。当对大量简单的例子进行求和时,这些小的损失值可能会压倒稀有的类。

3.1 平衡的交叉熵损失

A common method for addressing class imbalance is to introduce a weighting factor α ∈ [0, 1] for class 1 and 1 − α for class −1. In practice α may be set by inverse class frequency or treated as a hyperparameter to set by cross validation. For notational convenience, we define αt analogously to how we defined pt. We write the α-balanced CE loss as:
  解决类不平衡的一种常见方法是为类1引入加权因子α∈[0,1],为类−1引入类1−α。在实践中,α可以通过逆类频率来设置,也可以作为一个超参数,通过交叉验证来设置。为了便于标注,我们对 α t \alpha_t αt的定义类似于我们如何定义 p t p_t pt。我们将α-平衡的CE损失写为:
   C E ( p t ) = − α t l o g ( p t ) CE(p_t) = -\alpha_tlog(p_t) CE(pt)=αtlog(pt)
  This loss is a simple extension to CE that we consider as an experimental baseline for our proposed focal loss.
  这种损失是对 CE 的简单扩展,我们将其视为我们提出的焦点损失的实验基线。

3.2. Focal Loss定义

As our experiments will show, the large class imbalance encountered during training of dense detectors overwhelms the cross entropy loss. Easily classified negatives comprise the majority of the loss and dominate the gradient. While α balances the importance of positive/negative examples, it does not differentiate between easy/hard examples. Instead, we propose to reshape the loss function to down-weight easy examples and thus focus training on hard negatives.
  正如我们的实验将显示的,在密集探测器训练中遇到的大量类不平衡超过了交叉熵损失。容易分类的负样本组成了大部分的损失,并主导了梯度。虽然α平衡了正/负的例子的重要性,但它没有区分简单/困难的例子。相反,我们建议重塑损失函数,以减轻简单的例子,从而将训练集中在困难的负面。
  More formally, we propose to add a modulating factor ( 1 − p t ) γ (1 − p_t)^\gamma (1pt)γ to the cross entropy loss, with tunable focusing parameter γ ≥ 0 \gamma \ge0 γ0. We define the focal loss as:
  更正式地说,我们建议在交叉熵损失中增加一个调制因子 ( 1 − p t ) γ (1 − p_t)^\gamma (1pt)γ,并具有可调聚焦参数 γ ≥ 0 \gamma \ge0 γ0。我们将焦点损失定义为:
   F L ( p t ) = − ( 1 − p t ) γ l o g ( p t ) FL(p_t) = -(1 − p_t)^\gamma log(p_t) FL(pt)=(1pt)γlog(pt)
  The focal loss is visualized for several values of γ ∈ [0, 5] in Figure 1. We note two properties of the focal loss. (1) When an example is misclassified and p t p_t pt is small, the modulating factor is near 1 and the loss is unaffected. As p t p_t pt → 1, the factor goes to 0 and the loss for well-classified examples is down-weighted. (2) The focusing parameter γ smoothly adjusts the rate at which easy examples are downweighted. When γ = 0, FL is equivalent to CE, and as γ is increased the effect of the modulating factor is likewise increased (we found γ = 2 to work best in our experiments).
  图1中位γ∈[0,5]的几个值的焦点损失。我们注意到焦点损失的两个性质。(1)当一个例子被错误分类, p t p_t pt很小时,调制因子接近1,损失不受影响。当 p t p_t pt→1时,因子变为0,分类良好的例子的损失被降低。(2)聚焦参数γ平滑地调整简单例子降低的速率。当γ=0时,FL等于CE,随着γ的增加,调节因子的作用也同样增加(我们发现γ=2在我们的实验中效果最好)。
  Intuitively, the modulating factor reduces the loss contribution from easy examples and extends the range in which an example receives low loss. For instance, with γ = 2, an example classified with p t p_t pt = 0.9 would have 100× lower loss compared with CE and with p t p_t pt ≈ 0.968 it would have 1000× lower loss. This in turn increases the importance of correcting misclassified examples (whose loss is scaled down by at most 4× for pt ≤ .5 and γ = 2).
  直观地说,调制因子减少了简单示例的损失贡献,并扩展了示例接收低损耗的范围。例如,在γ=2中, p t p_t pt=0.9分类的损失比CE低100×,而 p t p_t pt≈0.968分类的损失低1000×。这反过来又增加了纠正错误分类的例子的重要性(对于 p t p_t pt≤.5和γ=2,其损失最多减少了4×)。
  In practice we use an α-balanced variant of the focal loss:
  在实践中,我们使用焦点损失的 α 平衡变体:
   F L ( p t ) = − α t ( 1 − p t ) γ l o g ( p t ) FL(p_t) = -\alpha_t (1 − p_t)^\gamma log(p_t) FL(pt)=αt(1pt)γlog(pt)
  We adopt this form in our experiments as it yields slightly improved accuracy over the non-α-balanced form. Finally, we note that the implementation of the loss layer combines the sigmoid operation for computing p with the loss computation, resulting in greater numerical stability.
   我们在实验中采用这种形式,因为它比非 α 平衡形式略微提高了准确性。最后,我们注意到损失层的实现将计算 p 的 sigmoid 操作与损失计算相结合,从而提高了数值稳定性。
   While in our main experimental results we use the focal loss definition above, its precise form is not crucial. In the online appendix we consider other instantiations of the focal loss and demonstrate that these can be equally effective.
   虽然在我们的主要实验结果中我们使用了上面的焦点损失定义,但它的精确形式并不重要。在在线附录中,我们考虑了焦点损失的其他实例,并证明这些实例同样有效。

3.3. 类不平衡和模型初始化

Binary classification models are by default initialized to have equal probability of outputting either y = −1 or 1. Under such an initialization, in the presence of class imbalance, the loss due to the frequent class can dominate total loss and cause instability in early training. To counter this, we introduce the concept of a ‘prior’ for the value of p estimated by the model for the rare class (foreground) at the start of training. We denote the prior by π and set it so that the model’s estimated p for examples of the rare class is low, e.g. 0.01. We note that this is a change in model initialization (see §4.1) and not of the loss function. We found this to improve training stability for both the cross entropy and focal loss in the case of heavy class imbalance.
  二元分类模型默认初始化为输出 y = -1 或 1 的概率相等。在这样的初始化下,在存在类不平衡的情况下,由于频繁类导致的损失会主导总损失并导致早期训练的不稳定 . 为了解决这个问题,我们在训练开始时为稀有类(前景)的模型估计的 p 值引入了“先验”的概念。我们用 π 表示先验并将其设置为使得模型对稀有类示例的估计 p 较低,例如 0.01 (不理解)。我们注意到这是模型初始化的变化(见 §4.1),而不是损失函数。我们发现这可以在严重的类不平衡的情况下提高交叉熵和焦点损失的训练稳定性。

3.4. 类不平衡和二阶段检测器

Two-stage detectors are often trained with the cross entropy loss without use of α-balancing or our proposed loss. Instead, they address class imbalance through two mechanisms: (1) a two-stage cascade and (2) biased minibatch sampling. The first cascade stage is an object proposal mechanism [34, 23, 27] that reduces the nearly infinite set of possible object locations down to one or two thousand. Importantly, the selected proposals are not random, but are likely to correspond to true object locations, which removes the vast majority of easy negatives. When training the second stage, biased sampling is typically used to construct minibatches that contain, for instance, a 1:3 ratio of positive to negative examples. This ratio is like an implicit αbalancing factor that is implemented via sampling. Our proposed focal loss is designed to address these mechanisms in a one-stage detection system directly via the loss function.
  两阶段检测器通常在不使用 α 平衡或我们提出的损失的情况下使用交叉熵损失进行训练。相反,他们通过两种机制解决类别不平衡问题:(1)两阶段级联和(2)有偏差的小批量采样。第一个级联阶段是一个目标提议机制[34,23,27],它将几乎无限的可能目标位置集减少到一两千。重要的是,选择的提议不是随机的,而是可能对应于真实的目标位置,这消除了绝大多数简单的否定。在训练第二阶段时,通常使用有偏抽样来构建包含例如 1:3 比例的正负样本的小批量。这个比率就像一个通过采样实现的隐含 α 平衡因子。我们提出的焦点损失旨在直接通过损失函数在单阶段检测系统中解决这些机制。

4. RetinaNet检测器

加载失败
图 3. 单阶段 RetinaNet 网络架构在前馈 ResNet 架构 [15] (a) 之上使用特征金字塔网络 (FPN) [19] 主干来生成丰富的多尺度卷积特征金字塔 (b)。 RetinaNet 附加了两个子网络,一个用于分类锚框(c),一个用于从锚框回归到真实对象框(d)。 网络设计有意简单,这使这项工作能够专注于一种新颖的焦点损失函数,该函数消除了我们的单级检测器和最先进的两级检测器(如带有 FPN 的 Faster R-CNN)之间的精度差距。 19]同时以更快的速度运行。

RetinaNet is a single, unified network composed of a backbone network and two task-specific subnetworks. The backbone is responsible for computing a convolutional feature map over an entire input image and is an off-the-self convolutional network. The first subnet performs convolutional object classification on the backbone’s output; the second subnet performs convolutional bounding box regression. The two subnetworks feature a simple design that we propose specifically for one-stage, dense detection, see Figure 3. While there are many possible choices for the details of these components, most design parameters are not particularly sensitive to exact values as shown in the experiments. We describe each component of RetinaNet next.
  RetinaNet 是一个单一的、统一的网络,由一个主干网络和两个特定任务的子网络组成。主干负责计算整个输入图像的卷积特征图,是一个非自我卷积网络。第一个子网对主干的输出进行卷积目标分类; 第二个子网执行卷积边界框回归。这两个子网络具有一个简单的设计,我们专门为单阶段密集检测提出了建议,请参见图 3。虽然这些组件的细节有很多可能的选择,但大多数设计参数对精确值并不特别敏感,如实验。我们接下来描述 RetinaNet 的每个组件。
  Feature Pyramid Network Backbone: We adopt the Feature Pyramid Network (FPN) from [19] as the backbone network for RetinaNet. In brief, FPN augments a standard convolutional network with a top-down pathway and lateral connections so the network efficiently constructs a rich, multi-scale feature pyramid from a single resolution input image, see Figure 3(a)-(b). Each level of the pyramid can be used for detecting objects at a different scale. FPN improves multi-scale predictions from fully convolutional networks (FCN) [22], as shown by its gains for RPN [27] and DeepMask-style proposals [23], as well at two-stage detectors such as Fast R-CNN [10] or Mask R-CNN [13].
  特征金字塔网络骨干网:我们采用 [19] 中的特征金字塔网络(FPN)作为 RetinaNet 的骨干网络。简而言之,FPN 通过自上而下的路径和横向连接增强了标准卷积网络,因此该网络有效地从单分辨率输入图像构建了丰富的多尺度特征金字塔,参见图 3(a)-(b)。金字塔的每一层都可以用于检测不同尺度的目标。FPN 改进了来自全卷积网络 (FCN) [22] 的多尺度预测,如其在 RPN [27] 和 DeepMask 风格提议 [23] 以及两阶段检测器(如 Fast R-CNN [ 10]或Mask R-CNN [13]。
  Following [19], we build FPN on top of the ResNet architecture [15]. We construct a pyramid with levels P 3 P_3 P3 through P 7 P_7 P7, where l indicates pyramid level ( P l P_l Pl has resolution 2 l 2^l 2l lower than the input). As in [19] all pyramid levels have C = 256 channels. Details of the pyramid generally follow [19] with a few modest differences.2 While many design choices are not crucial, we emphasize the use of the FPN backbone is; preliminary experiments using features from only the final ResNet layer yielded low AP.
  在 [19] 之后,我们在 ResNet 架构 [15] 之上构建 FPN。我们构建了一个 P 3 P_3 P3 P 7 P_7 P7级别的金字塔,其中 l 表示金字塔级别( P l P_l Pl的分辨率比输入低 2 l 2^l 2l)。如在 [19] 中,所有金字塔级别都有 C = 256 个通道。金字塔的细节通常遵循 [19],但有一些细微的差异。虽然许多设计选择并不重要,但我们强调 FPN 主干的使用是; 仅使用来自最终 ResNet 层的特征的初步实验产生了低 AP。
  Anchors: We use translation-invariant anchor boxes similar to those in the RPN variant in [19]. The anchors have areas of 3 2 2 32^2 322 to 51 2 2 512^2 5122 on pyramid levels P 3 P_3 P3 to P 7 P_7 P7, respectively. As in [19], at each pyramid level we use anchors at three aspect ratios {1:2, 1:1, 2:1}. For denser scale coverage than in [19], at each level we add anchors of sizes { 2 0 , 2 1 / 3 , 2 2 / 3 } \{2^0, 2^{1/3}, 2^{2/3}\} {20,21/3,22/3} of the original set of 3 aspect ratio anchors. This improve AP in our setting. In total there are A = 9 anchors per level and across levels they cover the scale range 32-813 pixels with respect to the network’s input image.
  Anchors:我们使用与 [19] 中的 RPN 变体中的类似的平移不变锚框。锚点在金字塔级别 P 3 P_3 P3 P 7 P_7 P7 上的面积分别为 3 2 2 32^2 322 51 2 2 512^2 5122。与 [19] 中一样,在每个金字塔级别,我们使用三个纵横比 {1:2, 1:1, 2:1} 的锚点。为了比 [19] 中更密集的尺度覆盖,我们在每个级别添加大小为 { 2 0 , 2 1 / 3 , 2 2 / 3 } \{2^0, 2^{1/3}, 2^{2/3}\} {20,21/3,22/3} 的原始一组 3 个纵横比锚点的锚点。这提高了我们设置中的 AP。每个级别总共有 A = 9 个锚点,并且跨级别它们覆盖了相对于网络输入图像的 32-813 像素的比例范围。
  Each anchor is assigned a length K one-hot vector of classification targets, where K is the number of object classes, and a 4-vector of box regression targets. We use the assignment rule from RPN [27] but modified for multiclass detection and with adjusted thresholds. Specifically, anchors are assigned to ground-truth object boxes using an intersection-over-union (IoU) threshold of 0.5; and to background if their IoU is in [0, 0.4). As each anchor is assigned to at most one object box, we set the corresponding entry in its length K label vector to 1 and all other entries to 0. If an anchor is unassigned, which may happen with overlap in [0.4, 0.5), it is ignored during training. Box regression targets are computed as the offset between each anchor and its assigned object box, or omitted if there is no assignment.
  每个锚点都分配有一个长度为 K 的分类目标单热向量,其中 K 是目标类别的数量,以及一个 4 向量的框回归目标。我们使用来自 RPN [27] 的分配规则,但针对多类检测进行了修改并调整了阈值。具体来说,使用 0.5 的交并比 (IoU) 阈值将锚分配给真实目标框; 如果他们的 IoU 在 [0, 0.4) 中,则返回背景。由于每个锚点最多分配给一个目标框,我们将其长度 K 标签向量中的相应条目设置为 1,将所有其他条目设置为 0。如果未分配锚点,则可能发生在 [0.4, 0.5) 中的重叠, 它在训练期间被忽略。框回归目标计算为每个锚点与其分配的目标框之间的偏移量,如果没有分配,则省略。
  Classification Subnet: The classification subnet predicts the probability of object presence at each spatial position for each of the A anchors and K object classes. This subnet is a small FCN attached to each FPN level; parameters of this subnet are shared across all pyramid levels. Its design is simple. Taking an input feature map with C channels from a given pyramid level, the subnet applies four 3×3 conv layers, each with C filters and each followed by ReLU activations, followed by a 3×3 conv layer with KA filters. Finally sigmoid activations are attached to output the KA binary predictions per spatial location, see Figure 3 ©. We use C = 256 and A = 9 in most experiments.
  分类子网:分类子网为每个 A 个锚点和 K 个目标类预测每个空间位置的目标存在概率。该子网是附加到每个 FPN 级别的小型 FCN; 此子网的参数在所有金字塔级别共享。它的设计很简单。从给定的金字塔级别获取具有 C 个通道的输入特征图,子网应用四个 3×3 卷积层,每个具有 C 个过滤器,每个后跟 ReLU 激活,然后是一个带有 KA 过滤器的 3×3 卷积层。最后附加 sigmoid 激活以输出每个空间位置的 KA 二进制预测,参见图3(c )。在大多数实验中,我们使用 C = 256 和 A = 9。
  In contrast to RPN [27], our object classification subnet is deeper, uses only 3×3 convs, and does not share parameters with the box regression subnet (described next). We found these higher-level design decisions to be more important than specific values of hyperparameters.
  与 RPN [27] 相比,我们的目标分类子网更深,仅使用 3×3 卷积,并且不与框回归子网共享参数(如下所述)。我们发现这些更高级别的设计决策比超参数的特定值更重要。
  Box Regression Subnet: In parallel with the object classification subnet, we attach another small FCN to each pyramid level for the purpose of regressing the offset from each anchor box to a nearby ground-truth object, if one exists. The design of the box regression subnet is identical to the classification subnet except that it terminates in 4A linear outputs per spatial location, see Figure 3 (d). For each of the A anchors per spatial location, these 4 outputs predict the relative offset between the anchor and the ground-truth box (we use the standard box parameterization from RCNN [11]). We note that unlike most recent work, we use a class-agnostic bounding box regressor which uses fewer parameters and we found to be equally effective. The object classification subnet and the box regression subnet, though sharing a common structure, use separate parameters.
  框回归子网:与目标分类子网并行,我们将另一个小型 FCN 附加到每个金字塔级别,以便将每个锚框的偏移量回归到附近的真实目标(如果存在)。box回归子网的设计与分类子网相同,只是它在每个空间位置终止于 4A 线性输出,见图 3 (d)。对于每个空间位置的每个 A 锚,这 4 个输出预测锚和真实框之间的相对偏移(我们使用来自 RCNN [11] 的标准框参数化)。我们注意到,与最近的工作不同,我们使用了与类别无关的边界框回归器,它使用的参数更少,而且我们发现同样有效。目标分类子网和框回归子网虽然具有共同的结构,但使用不同的参数。

4.1. 推理和训练

Inference: RetinaNet forms a single FCN comprised of a ResNet-FPN backbone, a classification subnet, and a box regression subnet, see Figure 3. As such, inference involves simply forwarding an image through the network. To improve speed, we only decode box predictions from at most 1k top-scoring predictions per FPN level, after thresholding detector confidence at 0.05. The top predictions from all levels are merged and non-maximum suppression with a threshold of 0.5 is applied to yield the final detections.
  推理:RetinaNet 形成一个单一的 FCN,由 ResNet-FPN 主干网、分类子网和框回归子网组成,见图 3。因此,推理只需通过网络前向传播图像。为了提高速度,在将检测器置信度阈值设置为 0.05 之后,我们仅从每个 FPN 级别最多 1k 个得分最高的预测中解码框预测。合并所有级别的最高预测,并应用阈值为 0.5 的非最大抑制来产生最终检测。
  Focal Loss: We use the focal loss introduced in this work as the loss on the output of the classification subnet. As we will show in §5, we find that γ = 2 works well in practice and the RetinaNet is relatively robust to γ ∈ [0.5, 5]. We emphasize that when training RetinaNet, the focal loss is applied to all ∼100k anchors in each sampled image. This stands in contrast to common practice of using heuristic sampling (RPN) or hard example mining (OHEM, SSD) to select a small set of anchors (e.g., 256) for each minibatch. The total focal loss of an image is computed as the sum of the focal loss over all ∼100k anchors, normalized by the number of anchors assigned to a ground-truth box. We perform the normalization by the number of assigned anchors, not total anchors, since the vast majority of anchors are easy negatives and receive negligible loss values under the focal loss. Finally we note that α, the weight assigned to the rare class, also has a stable range, but it interacts with γ making it necessary to select the two together (see Tables 1a and 1b). In general α should be decreased slightly as γ is increased (for γ = 2, α = 0.25 works best).
  焦点损失:我们使用这项工作中引入的焦点损失作为分类子网输出的损失。正如我们将在第 5 节中展示的那样,我们发现 γ = 2 在实践中效果很好,并且 RetinaNet 对 γ ∈ [0.5, 5] 相对稳健。我们强调,在训练 RetinaNet 时,焦点损失应用于每个采样图像中的所有近100k 锚点。这与使用启发式采样 (RPN) 或困难样本挖掘 (OHEM, SSD) 为每个小批量选择一小组锚点(例如 256 个)的常见做法形成对比。图像的总focal loss计算为所有~100k锚点的焦点损失的总和,通过分配给真实框的锚点数量进行归一化。我们通过分配的锚点数量而不是总锚点来执行归一化,因为绝大多数锚点都是容易的负样本,并且在焦点损失下接收到的损失值可以忽略不计。最后,我们注意到分配给稀有类的权重 α 也有一个稳定的范围,但它与 γ 相互作用,因此有必要将两者一起选择(见表 1a 和 1b)。一般来说,随着 γ 的增加,α 应该略微降低(对于 γ = 2,α = 0.25 效果最好)。
  Initialization: We experiment with ResNet-50-FPN and ResNet-101-FPN backbones [19]. The base ResNet-50 and ResNet-101 models are pre-trained on ImageNet1k; we use the models released by [15]. New layers added for FPN are initialized as in [19]. All new conv layers except the final one in the RetinaNet subnets are initialized with bias b = 0 and a Gaussian weight fill with σ = 0.01. For the final conv layer of the classification subnet, we set the bias initialization to b = − log((1 − π)/π), where π specifies that at the start of training every anchor should be labeled as foreground with confidence of ∼π. We use π = .01 in all experiments, although results are robust to the exact value. As explained in §3.4, this initialization prevents the large number of background anchors from generating a large, destabilizing loss value in the first iteration of training.
  初始化:我们用 ResNet-50-FPN 和 ResNet-101-FPN 骨干网进行实验 [19]。基础 ResNet-50 和 ResNet-101 模型在 ImageNet1k 上进行了预训练; 我们使用[15]发布的模型。为 FPN 添加的新层在 [19] 中进行初始化。除了 RetinaNet 子网中的最后一层之外,所有新的卷积层都初始化为偏置 b = 0,高斯权重填充为 σ = 0.01。对于分类子网的最后一个卷积层,我们将偏置初始化设置为 b = − log((1 − π)/π),其中 π 指定在训练开始时,每个锚点都应标记为前景,置信度为 ~ π。我们在所有实验中使用 π = .01,尽管结果对精确值是稳健的。如第 3.4 节所述,此初始化可防止大量背景锚在训练的第一次迭代中产生大的、不稳定的损失值。
  Optimization: RetinaNet is trained with stochastic gradient descent (SGD). We use synchronized SGD over 8 GPUs with a total of 16 images per minibatch (2 images per GPU). Unless otherwise specified, all models are trained for 90k iterations with an initial learning rate of 0.01, which is then divided by 10 at 60k and again at 80k iterations. We use horizontal image flipping as the only form of data augmentation unless otherwise noted. Weight decay of 0.0001 and momentum of 0.9 are used. The training loss is the sum the focal loss and the standard smooth L1 loss used for box regression [10]. Training time ranges between 10 and 35 hours for the models in Table 1e.
  优化:RetinaNet 使用随机梯度下降 (SGD) 进行训练。我们在 8 个 GPU 上使用同步 SGD,每个 minibatch 总共 16 个图像(每个 GPU 2 个图像)。除非另有说明,否则所有模型都经过 90k 次迭代训练,初始学习率为 0.01,然后在 60k 时除以 10,再在 80k 次迭代时除以 10。除非另有说明,否则我们使用水平图像翻转作为数据增强的唯一形式。使用 0.0001 的重量衰减和 0.9 的动量。训练损失是焦点损失和用于框回归 [10] 的标准平滑 L1 损失的总和。表 1e 中模型的训练时间在 10 到 35 小时之间。

5. 实验

We present experimental results on the bounding box detection track of the challenging COCO benchmark [20]. For training, we follow common practice [1, 19] and use the COCO trainval35k split (union of 80k images from train and a random 35k subset of images from the 40k image val split). We report lesion(???) and sensitivity studies by evaluating on the minival split (the remaining 5k images from val). For our main results, we report COCO AP on the test-dev split, which has no public labels and requires use of the evaluation server.
  我们在具有挑战性的 COCO 基准 [20] 的边界框检测轨道上展示了实验结果。对于训练,我们遵循常见做法 [1, 19] 并使用 COCO trainval35k split(来自训练的 80k 图像和来自 40k 图像 val 分割的随机 35k 图像子集的联合)。我们通过评估 minival split(来自 val 的剩余 5k 图像)来报告**病变(???)**和敏感性研究。对于我们的主要结果,我们在 test-dev 拆分上报告 COCO AP,它没有公共标签,需要使用评估服务器。

5.1. 训练密集检测

We run numerous experiments to analyze the behavior of the loss function for dense detection along with various optimization strategies. For all experiments we use depth 50 or 101 ResNets [15] with a Feature Pyramid Network (FPN) [19] constructed on top. For all ablation studies we use an image scale of 600 pixels for training and testing.
  我们进行了大量实验来分析密集检测的损失函数的行为以及各种优化策略。对于所有实验,我们使用深度为 50 或 101 的 ResNets [15] 以及在顶部构建的特征金字塔网络 (FPN) [19]。对于所有消融研究,我们使用 600 像素的图像比例进行训练和测试。
  Network Initialization: Our first attempt to train RetinaNet uses standard cross entropy (CE) loss without any modifications to the initialization or learning strategy. This fails quickly, with the network diverging during training. However, simply initializing the last layer of our model such that the prior probability of detecting an object is π = .01 (see §4.1) enables effective learning. Training RetinaNet with ResNet-50 and this initialization already yields a respectable AP of 30.2 on COCO. Results are insensitive to the exact value of π so we use π = .01 for all experiments.
  网络初始化:我们首次尝试使用标准交叉熵 (CE) 损失来训练 RetinaNet,而无需对初始化或学习策略进行任何修改。这很快就会失败,因为网络在训练期间会发散。然而,简单地初始化我们模型的最后一层,使得检测到目标的先验概率为 π = .01(参见 §4.1)可以实现有效的学习。用 ResNet-50 训练 RetinaNet,这个初始化已经在 COCO 上产生了一个可观30.2的AP 。结果对 π 的确切值不敏感,因此我们在所有实验中使用 π = .01。

加载失败
表 1. RetinaNet 和 Focal Loss (FL) 的消融实验。 除非另有说明,否则所有模型都在 trainval35k 上进行训练并在 minival 上进行测试。 如果未指定,默认值为:γ = 2; 3 个比例和 3 个纵横比的锚点; ResNet-50-FPN 骨干网; 以及 600 像素的训练和测试图像比例。 (a) 具有 α-balanced CE 的 RetinaNet 最多达到 31.1 AP。 (b) 相比之下,使用具有相同精确网络的 FL 可获得 2.9 AP 增益,并且对精确的 γ/α 设置相当稳健。 (c) 使用 2-3 比例和 3 纵横比的锚点会产生良好的结果,之后点性能就会饱和。 (d) FL 优于在线硬示例挖掘 (OHEM) [30, 21] 的最佳变体超过 3 点 AP。 (e) RetinaNet 在 test-dev 上针对各种网络深度和图像尺度的准确性/速度权衡(另请参见图 2)。

Balanced Cross Entropy: Our next attempt to improve learning involved using the α-balanced CE loss described in §3.1. Results for various α are shown in Table 1a. Setting α = .75 gives a gain of 0.9 points AP.
  平衡交叉熵:我们下一次改进学习的尝试涉及使用第 3.1 节中描述的 α-平衡交叉熵损失。各种 α 的结果如表 1a 所示。设置 α = .75 可获得 0.9 点 AP 的增益。
  Focal Loss: Results using our proposed focal loss are shown in Table 1b. The focal loss introduces one new hyperparameter, the focusing parameter γ, that controls the strength of the modulating term. When γ = 0, our loss is equivalent to the CE loss. As γ increases, the shape of the loss changes so that “easy” examples with low loss get further discounted, see Figure 1. FL shows large gains over CE as γ is increased. With γ = 2, FL yields a 2.9 AP improvement over the α-balanced CE loss.
  Focal Loss:使用我们提出的焦点损失的结果如表 1b 所示。焦点损失引入了一个新的超参数,即聚焦参数 γ,它控制调制项的强度。当 γ = 0 时,我们的损失相当于交叉熵损失。随着 γ 的增加,损失的形状发生变化,因此具有低损失的“简单”示例被进一步打折,见图 1。随着 γ 的增加,FL 显示出比交叉熵有很大的收益。在 γ = 2 的情况下,FL 比 α 平衡的交叉熵损失提高了 2.9 AP。
  For the experiments in Table 1b, for a fair comparison we find the best α for each γ. We observe that lower α’s are selected for higher γ’s (as easy negatives are down-weighted, less emphasis needs to be placed on the positives). Overall, however, the benefit of changing γ is much larger, and indeed the best α’s ranged in just [.25,.75] (we tested α ∈ [.01, .999]). We use γ = 2.0 with α = .25 for all experiments but α = .5 works nearly as well (.4 AP lower).
  对于表 1b 中的实验,为了公平比较,我们为每个 γ 找到了最好的 α。我们观察到较低的 α 被选择用于较高的 γ(因为简单的负样本被降低,所以需要较少强调正样本)。然而,总体而言,改变 γ 的好处要大得多,实际上最好的 α 的范围仅为 [.25,.75](我们测试了 α ∈ [.01, .999])。对于所有实验,我们使用 γ = 2.0 和 α = .25,但 α = .5 的效果几乎一样(低于 0.4 AP)。
  Analysis of the Focal Loss: To understand the focal loss better, we analyze the empirical distribution of the loss of a converged model. For this, we take take our default ResNet101 600-pixel model trained with γ = 2 (which has 36.0 AP). We apply this model to a large number of random images and sample the predicted probability for ∼107 negative windows and ∼105 positive windows. Next, separately for positives and negatives, we compute FL for these samples, and normalize the loss such that it sums to one. Given the normalized loss, we can sort the loss from lowest to highest and plot its cumulative distribution function (CDF) for both positive and negative samples and for different settings for γ (even though model was trained with γ = 2).
  焦点损失分析:为了更好地理解焦点损失,我们分析了收敛模型损失的经验分布。为此,我们采用默认的 ResNet101 600 像素模型,使用 γ = 2(具有 36.0 AP)进行训练。我们将此模型应用于大量随机图像,并对 ~107 个负窗口和 ~105 个正窗口的预测概率进行采样。接下来,分别针对正面和负面,我们计算这些样本的 FL,并将损失归一化,使其总和为 1。给定归一化损失,我们可以将损失从最低到最高排序,并绘制其正样本和负样本以及 γ 的不同设置的累积分布函数 (CDF)(即使模型是用 γ = 2 训练的)。

加载失败
图 4. 收敛模型中不同 γ 值的正样本和负样本的归一化损失的累积分布函数。 改变 γ 对正例损失分布的影响很小。 然而,对于负样本,增加 γ 会使损失集中在难样本上,几乎所有注意力都从简单的负样本上移开。

Cumulative distribution functions for positive and negative samples are shown in Figure 4. If we observe the positive samples, we see that the CDF looks fairly similar for different values of γ. For example, approximately 20% of the hardest positive samples account for roughly half of the positive loss, as γ increases more of the loss gets concentrated in the top 20% of examples, but the effect is minor.
  正样本和负样本的累积分布函数如图 4 所示。如果我们观察正样本,我们会看到对于不同的 γ 值,CDF 看起来非常相似。例如,大约 20% 的最难的正样本约占正损失的一半,随着 γ 的增加,更多的损失集中在前 20% 的示例中,但影响很小。
  The effect of γ on negative samples is dramatically different. For γ = 0, the positive and negative CDFs are quite similar. However, as γ increases, substantially more weight becomes concentrated on the hard negative examples. In fact, with γ = 2 (our default setting), the vast majority of the loss comes from a small fraction of samples. As can be seen, FL can effectively discount the effect of easy negatives, focusing all attention on the hard negative examples.
  γ对负样本的影响是截然不同的。对于 γ = 0,正 CDF 和负 CDF 非常相似。然而,随着 γ 的增加,更多的权重集中在难的负样本上。事实上,当 γ = 2(我们的默认设置)时,绝大多数损失来自一小部分样本。可以看出,FL 可以有效地降低易负例的影响,将所有注意力集中在难负例上。
  Online Hard Example Mining (OHEM): [30] proposed to improve training of two-stage detectors by constructing minibatches using high-loss examples. Specifically, in OHEM each example is scored by its loss, non-maximum suppression (nms) is then applied, and a minibatch is constructed with the highest-loss examples. The nms threshold and batch size are tunable parameters. Like the focal loss, OHEM puts more emphasis on misclassified examples, but unlike FL, OHEM completely discards easy examples. We also implement a variant of OHEM used in SSD [21]: after applying nms to all examples, the minibatch is constructed to enforce a 1:3 ratio between positives and negatives to help ensure each minibatch has enough positives.
  Online Hard Example Mining (OHEM):[30] 提出通过使用高损失示例构建小批量来改进两阶段检测器的训练。具体来说,在 OHEM 中,每个示例都根据其损失进行评分,然后应用非最大抑制 (nms),并使用损失最高的示例构建小批量。nms 阈值和批量大小是可调参数。与焦点损失一样,OHEM 更强调错误分类的示例,但与 FL 不同,OHEM 完全丢弃了简单示例。我们还实现了 SSD [21] 中使用的 OHEM 的变体:在将 nms 应用于所有示例之后,构建小批量以强制正负之间的比例为 1:3,以帮助确保每个小批量有足够的正数。
  We test both OHEM variants in our setting of one-stage detection which has large class imbalance. Results for the original OHEM strategy and the ‘OHEM 1:3’ strategy for selected batch sizes and nms thresholds are shown in Table 1d. These results use ResNet-101, our baseline trained with FL achieves 36.0 AP for this setting. In contrast, the best setting for OHEM (no 1:3 ratio, batch size 128, nms of .5) achieves 32.8 AP. This is a gap of 3.2 AP, showing FL is more effective than OHEM for training dense detectors. We note that we tried other parameter setting and variants for OHEM but did not achieve better results.
  我们在具有大类不平衡的单阶段检测设置中测试了这两种 OHEM 变体。原始 OHEM 策略和选定批次大小和 nms 阈值的“OHEM 1:3”策略的结果如表 1d 所示。这些结果使用 ResNet-101,我们使用 FL 训练的基线在此设置下达到了 36.0 AP。相比之下,OHEM 的最佳设置(没有 1:3 的比率,批量大小 128,0.5 的 nms)达到 32.8 AP。这是 3.2 AP 的差距,表明 FL 在训练密集检测器方面比 OHEM 更有效。我们注意到我们为 OHEM 尝试了其他参数设置和变体,但没有取得更好的结果。
  Hinge Loss: Finally, in early experiments, we attempted to train with the hinge loss [12] on pt, which sets loss to 0 above a certain value of pt. However, this was unstable and we did not manage to obtain meaningful results. Results exploring alternate loss functions are in the online appendix.
  铰链损失:最后,在早期的实验中,我们尝试在 pt 上使用铰链损失 [12] 进行训练,它将损失设置为 0 高于 pt 的某个值。然而,这是不稳定的,我们没有设法获得有意义的结果。探索替代损失函数的结果在在线附录中。

5.2. 模型架构设计

Anchor Density: One of the most important design factors in a one-stage detection system is how densely it covers the space of possible image boxes. Two-stage detectors can classify boxes at any position, scale, and aspect ratio using a region pooling operation [10]. In contrast, as one-stage detectors use a fixed sampling grid, a popular approach for achieving high coverage of boxes in these approaches is to use multiple ‘anchors’ [27] at each spatial position to cover boxes of various scales and aspect ratios.
  锚点密度:单阶段检测系统中最重要的设计因素之一是它覆盖可能的图像框空间的密集程度。两阶段检测器可以使用区域池操作 [10] 对任何位置、比例和纵横比的框进行分类。相比之下,由于单阶段检测器使用固定的采样网格,在这些方法中实现高覆盖率的一种流行方法是在每个空间位置使用多个“锚点”[27] 来覆盖各种比例和纵横比的盒子。
  We sweep over the number of scale and aspect ratio anchors used at each spatial position and each pyramid level in FPN. We consider cases from a single square anchor at each location to 12 anchors per location spanning 4 sub-octave scales (2k/4, for k ≤ 3) and 3 aspect ratios [0.5, 1, 2]. Results using ResNet-50 are shown in Table 1c. A surprisingly good AP (30.3) is achieved using just one square anchor. However, the AP can be improved by nearly 4 points (to 34.0) when using 3 scales and 3 aspect ratios per location. We used this setting for all other experiments in this work.
  我们扫描了 FPN 中每个空间位置和每个金字塔级别使用的比例和纵横比锚点的数量。我们考虑的情况从每个位置的单个方形锚点到每个位置的 12 个锚点,跨越 4 个子倍频程尺度(2k/4,对于 k ≤ 3)和 3 个纵横比 [0.5, 1, 2]。使用 ResNet-50 的结果如表 1c 所示。仅使用一个方形锚就可以实现令人惊讶的好 AP (30.3)。但是,当每个位置使用 3 个比例和 3 个纵横比时,AP 可以提高近 4 个点(达到 34.0)。我们将此设置用于这项工作中的所有其他实验。
  Finally, we note that increasing beyond 6-9 anchors did not shown further gains. Thus while two-stage systems can classify arbitrary boxes in an image, the saturation of performance w.r.t. density implies the higher potential density of two-stage systems may not offer an advantage.
  最后,我们注意到增加超过 6-9 个锚点并没有显示出进一步的收益。因此,虽然两阶段系统可以对图像中的任意框进行分类,但性能的饱和度 w.r.t.密度意味着两阶段系统的较高潜在密度可能不会提供优势。
  Speed versus Accuracy: Larger backbone networks yield higher accuracy, but also slower inference speeds. Likewise for input image scale (defined by the shorter image side). We show the impact of these two factors in Table 1e. In Figure 2 we plot the speed/accuracy trade-off curve for RetinaNet and compare it to recent methods using public numbers on COCO test-dev. The plot reveals that RetinaNet, enabled by our focal loss, forms an upper envelope over all existing methods, discounting the low-accuracy regime. Remarkably, RetinaNet with ResNet-101-FPN and a 600 pixel image scale (which we denote by RetinaNet-101-600 for simplicity) matches the accuracy of the recently published ResNet-101-FPN Faster R-CNN [19], while running in 122 ms per image compared to 172 ms (both measured on an Nvidia M40 GPU). Using larger image sizes allows RetinaNet to surpass the accuracy of all two-stage approaches, while still being faster. For faster runtimes, there is only one operating point (500 pixel input) at which using ResNet-50FPN improves over ResNet-101-FPN. Addressing the high frame rate regime will likely require special network design, as in [26], rather than use of an off-the-shelf model and is beyond the scope of this work.
  速度与准确性:较大的骨干网络产生更高的准确性,但推理速度也较慢。对于输入图像比例(由较短的图像侧定义)也是如此。我们在表 1e 中展示了这两个因素的影响。在图 2 中,我们绘制了 RetinaNet 的速度/准确度权衡曲线,并将其与最近使用 COCO test-dev 上的公众号的方法进行了比较。该图显示,由我们的焦点损失启用的 RetinaNet 形成了所有现有方法的上包络,不考虑低准确度的机制。值得注意的是,具有 ResNet-101-FPN 和 600 像素图像比例(为简单起见,我们将其表示为 RetinaNet-101-600)的 RetinaNet 与最近发布的 ResNet-101-FPN Faster R-CNN [19] 的准确性相匹配,同时运行每张图像 122 毫秒,而 172 毫秒(均在 Nvidia M40 GPU 上测量)。使用更大的图像尺寸可以让 RetinaNet 超越所有两阶段方法的准确性,同时仍然更快。对于更快的运行时间,只有一个操作点(500 像素输入)使用 ResNet-50FPN 比 ResNet-101-FPN 有所改进。解决高帧率机制可能需要特殊的网络设计,如 [26],而不是使用现成的模型,这超出了这项工作的范围。

5.3. 与流行模型对比

 
加载失败 表 2. 目标检测单模型结果(边界框 AP)与 COCO test-dev 上的最新技术。 我们展示了 RetinaNet-101-800 模型的结果,该模型使用尺度抖动进行训练,并且比表 1e 中的相同模型长 1.5 倍。 我们的模型取得了最佳结果,优于一阶段和两阶段模型。 有关速度与准确度的详细分类,请参见表 1e 和图 2。

  We evaluate RetinaNet on the bounding box detection task of the challenging COCO dataset and compare test-dev results to recent state-of-the-art methods including both one-stage and two-stage models. Results are presented in Table 2 for our RetinaNet-101-800 model trained using scale jitter and for 1.5× longer than the models in Table 1e (giving a 1.3 AP gain). Compared to existing onestage methods, our approach achieves a healthy 5.9 point AP gap (39.1 vs. 33.2) with the closest competitor, DSSD [9], while also being faster, see Figure 2. Compared to recent two-stage methods, RetinaNet achieves a 2.3 point gap above the top-performing Faster R-CNN model based on Inception-ResNet-v2-TDM [31].   我们在具有挑战性的 COCO 数据集的边界框检测任务上评估 RetinaNet,并将测试开发结果与最近最先进的方法(包括一阶段和两阶段模型)进行比较。表 2 给出了我们使用尺度抖动训练的 RetinaNet-101-800 模型的结果,并且比表 1e 中的模型长 1.5 倍(给出 1.3 AP 增益)。与现有的 onestage 方法相比,我们的方法与最接近的竞争对手 DSSD [9] 实现了健康的 5.9 点 AP 差距(39.1 对 33.2),同时也更快,参见图 2。与最近的两阶段方法相比,RetinaNet 实现了 比基于 Inception-ResNet-v2-TDM [31] 的最佳 Faster R-CNN 模型高出 2.3 个百分点的差距。

6. 结论

In this work, we identify class imbalance as the primary obstacle preventing one-stage object detectors from surpassing top-performing, two-stage methods, such as Faster R-CNN variants. To address this, we propose the focal loss which applies a modulating term to the cross entropy loss in order to focus learning on hard examples and down-weight the numerous easy negatives. Our approach is simple and highly effective. We demonstrate its efficacy by designing a fully convolutional one-stage detector and report extensive experimental analysis showing that it achieves state-of-theart accuracy and run time on the challenging COCO dataset.
  在这项工作中,我们将类别不平衡确定为阻止单阶段目标检测器超越性能最佳的两阶段方法(例如 Faster R-CNN 变体)的主要障碍。为了解决这个问题,我们提出了focal loss,它对交叉熵损失应用了一个调制项,以便将学习重点放在困难的例子上,并降低许多简单的负样本的权重。我们的方法简单而高效。我们通过设计一个完全卷积的一阶段检测器来证明它的功效,并报告了广泛的实验分析,表明它在具有挑战性的 COCO 数据集上实现了最先进的精度和运行时间。

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值