Focal Loss for Dense Object Detection（RetinaNet）读书笔记

最新推荐文章于 2024-03-08 12:43:27 发布

Christine__xu

最新推荐文章于 2024-03-08 12:43:27 发布

阅读量1.3k

点赞数

分类专栏：目标检测

本文链接：https://blog.csdn.net/Christine__xu/article/details/105184232

版权

照例是先翻译，再介绍。

Focal Loss for Dense Object Detection

Abstract

The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier is applied to a sparse set of candidate object locations. In contrast, one-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors thus far. In this paper, we investigate why this is the case. We discover that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause. We propose to address this class imbalance by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples. Our novel Focal Loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training. To evaluate the effectiveness of our loss, we design and train a simple dense detector we call RetinaNet. Our results show that when trained with the focal loss, RetinaNet is able to match the speed of previous one-stage detectors while surpassing the accuracy of all existing state-of-the-art two-stage detectors. Code is at: https://github.com/facebookresearch/Detectron

迄今为止，最精确的目标检测器是基于R-CNN推广的两阶段方法，其中分类器应用于稀疏的候选目标位置集。相比之下，在对可能的目标位置进行规则密集采样的基础上应用的一阶段探测器可能更快、更简单，但迄今为止落后于两阶段检测器的精度。在本文中，我们将研究为什么会出现这种情况。我们发现，在检测器的训练过程中所遇到的极端前景、背景类别不平衡，是造成这种现象的主要原因。我们通过修改标准的交叉熵损失，来解决这一类样本不平衡问题，从而降低正确分类样本的损失的权重。我们的原创Focal Loss在一组稀疏的难样本上训练，并有效防止大量负样本影响检测器性能。为了评估损失的有效性，我们设计并训练了一个简单的密集检测器，称为RetinaNet。研究结果表明，当使用Focal Loss训练时，RetinaNet与以往的单阶段检测器速度相当，同时超过了现有最先进的两级检测器的精度。代码位于：https://github.com/facebook research/Detectron

1. Introduction

Current state-of-the-art object detectors are based on a two-stage, proposal-driven mechanism. As popularized in the R-CNN framework [11], the first stage generates a sparse set of candidate object locations and the second stage classifies each candidate location as one of the foreground classes or as background using a convolutional neural network. Through a sequence of advances [10, 28, 20, 14], this two-stage framework consistently achieves top accuracy on the challenging COCO benchmark [21].

Despite the success of two-stage detectors, a natural question to ask is: could a simple one-stage detector achieve similar accuracy? One stage detectors are applied over a regular, dense sampling of object locations, scales, and aspect ratios. Recent work on one-stage detectors, such as YOLO [26, 27] and SSD [22, 9], demonstrates promising results, yielding faster detectors with accuracy within 10- 40% relative to state-of-the-art two-stage methods.

This paper pushes the envelope further: we present a one-stage object detector that, for the first time, matches the state-of-the-art COCO AP of more complex two-stage detectors, such as the Feature Pyramid Network (FPN) [20] or Mask R-CNN [14] variants of Faster R-CNN [28]. To achieve this result, we identify class imbalance during training as the main obstacle impeding one-stage detector from achieving state-of-the-art accuracy and propose a new loss function that eliminates this barrier.

当前最先进的对象检测器基于两阶段的提案驱动机制。正如在R-CNN框架中流行的那样[11]，第一阶段生成稀疏的候选对象位置集，第二阶段使用卷积神经网络将每个候选位置分类为前景之一或背景。通过一系列进展[10，28，20，14]，此两阶段框架始终在具有挑战性的COCO基准测试中实现最高准确性[21]。

尽管两级检测器取得了成功，一个自然的问题是：一个简单的一级检测器能否达到类似的精度？一级检测器在常规密集采样得到的目标位置方面，得到了应用。YOLO [26，27]和SSD [22，9]等一阶段检测器的最新工作令人鼓舞，相对于最新的两级检测器，其可产生精度在10％至40％之内的更快检测器方法。

本文将进一步探讨：我们提出了一种单阶段的目标检测器，它首次与更复杂的两级检测器（例如功能金字塔网络（FPN）或Faster R-CNN [28]、Mask R-CNN [14]）在COCO数据集上准确度相当。为了获得此结果，我们将训练过程中的类别不平衡，确定为阻碍一阶段检测器提高精度的主要障碍，并提出了消除这种障碍的新损失函数。

Class imbalance is addressed in R-CNN-like detectors by a two-stage cascade and sampling heuristics. The proposal stage (e.g., Selective Search [35], EdgeBoxes [39], DeepMask [24, 25], RPN [28]) rapidly narrows down the number of candidate object locations to a small number (e.g., 1-2k), filtering out most background samples. In the second classification stage, sampling heuristics, such as a fixed foreground-to-background ratio (1:3), or online hard example mining (OHEM) [31], are performed to maintain a manageable balance between foreground and background.

类不平衡在类R-CNN检测器中通过两级级联和启发性的采样来解决。建议阶段（例如，选择性搜索[35]、EdgeBoxes[39]、DeepMask[24，25]、RPN[28]）快速地将候选对象位置的数目缩小到一个小数目（例如，1-2k），过滤掉大多数背景样本。在第二个分类阶段，为了在前景和背景之间保持可管理的平衡，执行采样启发，例如固定的前景与背景比（1:3）或在线硬示例挖掘（OHEM）[31]。

In contrast, a one-stage detector must process a much larger set of candidate object locations regularly sampled across an image. In practice this often amounts to enumerating ∼100k locations that densely cover spatial positions, scales, and aspect ratios. While similar sampling heuristics may also be applied, they are inefficient as the training procedure is still dominated by easily classified background examples. This inefficiency is a classic problem in object detection that is typically addressed via techniques such as bootstrapping [33, 29] or hard example mining [37, 8, 31].

相比之下，一级检测器必须处理一组更大的候选对象位置，这些候选对象位置在图像上定期采样。在实践中，这通常相当于枚举密集覆盖空间位置、比例和纵横比的100k个位置。虽然也可以应用类似的抽样启发法，但它们效率低下，因为训练过程仍然由容易分类的背景示例支配。这种效率低下是对象检测中的一个典型问题，通常通过引导（bootstrapping）或硬示例挖掘（hard example mining）等技术来解决。

In this paper, we propose a new loss function that acts as a more effective alternative to previous approaches for dealing with class imbalance. The loss function is a dynamically scaled cross entropy loss, where the scaling factor decays to zero as confidence in the correct class increases, see Figure 1. Intuitively, this scaling factor can automatically down-weight the contribution of easy examples during training and rapidly focus the model on hard examples. Experiments show that our proposed Focal Loss enables us to train a high-accuracy, one-stage detector that significantly outperforms the alternatives of training with the sampling heuristics or hard example mining, the previous state-of-the-art techniques for training one-stage detectors. Finally, we note that the exact form of the focal loss is not crucial, and we show other instantiations can achieve similar results.

在本文中，我们提出一个新的损失函数，作为一个更有效的替代方法来处理类不平衡问题。损失函数是一个动态标度的交叉熵损失，当正确分类的样本置信度增加时，标度因子衰减为零，见图1。直观地说，这个比例因子可以自动降低训练过程中简单示例的权重，并快速地将模型集中在硬示例上。实验表明，我们提出的焦点损失使我们能够训练出一种高精度的单级检测器，其性能明显优于以往训练单级检测器的最新技术，即采样启发式或硬示例挖掘。最后，我们注意到Focal loss的确切形式并不重要，并且其他实例可以获得类似的结果。

To demonstrate the effectiveness of the proposed focal loss, we design a simple one-stage object detector called RetinaNet, named for its dense sampling of object locations in an input image. Its design features an efficient in-network feature pyramid and use of anchor boxes. It draws on a variety of recent ideas from [22, 6, 28, 20]. RetinaNet is efficient and accurate; our best model, based on a ResNet-101- FPN backbone, achieves a COCO test-dev AP of 39.1 while running at 5 fps, surpassing the previously best published single-model results from both one and two-stage detectors, see Figure 2.

为了证明所提出的损失的有效性，我们设计了一种简单的单级目标检测器RetinaNet，该检测器以对输入图像中的目标位置进行密集采样命名。其设计特点是高效的网络特征金字塔和锚的使用。它借鉴了[22，6，28，20]中的最新观点。RetinaNet是高效和准确的；我们的最佳模型，基于ResNet-101-FPN骨干网，在5 fps的速度下运行时，COCO测试开发AP达到39.1，超过了先前发布的单级和两级探测器的最佳单模型结果，见图2。

2. Related Work

Classic Object Detectors: The sliding-window paradigm, in which a classifier is applied on a dense image grid, has a long and rich history. One of the earliest successes is the classic work of LeCun et al. who applied convolutional neural networks to handwritten digit recognition [19, 36]. Viola and Jones [37] used boosted object detectors for face detection, leading to widespread adoption of such models. The introduction of HOG [4] and integral channel features [5] gave rise to effective methods for pedestrian detection. DPMs [8] helped extend dense detectors to more general object categories and had top results on PASCAL [7] for many years. While the sliding-window approach was the leading detection paradigm in classic computer vision, with the resurgence of deep learning [18], two-stage detectors, described next, quickly came to dominate object detection.

经典的目标检测器：滑动窗口模式，在密集的图像网格上应用分类器，具有悠久而丰富的历史。最早的成功之一是LeCun等人的经典著作。他将卷积神经网络应用于手写数字识别[19，36]。Viola和Jones[37]使用增强型目标检测器进行人脸检测，导致了此类模型的广泛采用。HOG[4]和积分通道特征[5]的引入为行人检测提供了有效的方法。DPMs[8]有助于将密集探测器扩展到更一般的物体类别，多年来在PASCAL[7]上取得了最好的结果。虽然滑动窗口方法是经典计算机视觉中的主要检测范式，但随着深度学习的兴起[18]，接下来描述的两级检测器很快就占据了目标检测的主导地位。

Two-stage Detectors: The dominant paradigm in modern object detection is based on a two-stage approach. As pioneered in the Selective Search work [35], the first stage generates a sparse set of candidate proposals that should contain all objects while filtering out the majority of negative locations, and the second stage classifies the proposals into foreground classes / background. R-CNN [11] upgraded the second-stage classifier to a convolutional network yielding large gains in accuracy and ushering in the modern era of object detection. R-CNN was improved over the years, both in terms of speed [15, 10] and by using learned object proposals [6, 24, 28]. Region Proposal Networks (RPN) integrated proposal generation with the second-stage classifier into a single convolution network, forming the Faster RCNN framework [28]. Numerous extensions to this framework have been proposed, e.g. [20, 31, 32, 16, 14].

两级检测器：现代目标检测的主要范式是基于两级方法。正如选择性搜索工作[35]所开创的那样，第一阶段生成一个稀疏的候选方案集，该候选方案集应包含所有对象，同时过滤掉大多数负面样本，第二阶段将方案分类为前景类/背景。R-CNN[11]将第二级分类器升级为卷积网络，在精度上有了很大的提高，并开创了现代的目标检测时代。R-CNN经过多年的发展，在速度[15，10]和对象建议[6，24，28]方面取得了较大的进步。区域建议网络（RPN）将生成建议和第二阶段的分类集成到单个卷积网络中，形成更快的RCNN框架[28]。还有其他工作对该框架提出了许多扩展，例如[20、31、32、16、14]。

One-stage Detectors: OverFeat [30] was one of the first modern one-stage object detector based on deep networks. More recently SSD [22, 9] and YOLO [26, 27] have renewed interest in one-stage methods. These detectors have been tuned for speed but their accuracy trails that of twostage methods. SSD has a 10-20% lower AP, while YOLO focuses on an even more extreme speed/accuracy trade-off. See Figure 2. Recent work showed that two-stage detectors can be made fast simply by reducing input image resolution and the number of proposals, but one-stage methods trailed in accuracy even with a larger compute budget [17]. In contrast, the aim of this work is to understand if one-stage detectors can match or surpass the accuracy of two-stage detectors while running at similar or faster speeds.

单级探测器：OverFeat[30]是第一个基于深度网络的现代单级目标探测器。最近，SSD[22,9]和YOLO[26,27]重新激发了研究者对单阶段方法的兴趣。这些探测器具有较快的速度，但其精度落后于两级方法。SSD的AP降低了10-20%，而YOLO则专注于更极端的速度/精度权衡。见图2。最近的研究表明，只需降低输入图像的分辨率和建议框的数量，就可以快速实现两级检测器，但一级方法在精度方面仍然落后，即使计算预算更大[17]。相比之下，这项工作的目的是了解一级探测器在以相似或更快的速度运行时，是否能够匹配或超过两级探测器的精度。

The design of our RetinaNet detector shares many similarities with previous dense detectors, in particular the concept of ‘anchors’ introduced by RPN [28] and use of features pyramids as in SSD [22] and FPN [20]. We emphasize that our simple detector achieves top results not based on innovations in network design but due to our novel loss.

我们的RetinaNet探测器的设计与以前的密集探测器有许多相似之处，特别是RPN[28]引入的“锚”概念，以及在SSD[22]和FPN[20]中使用金字塔特征。我们强调，我们的简单探测器取得了最好的结果不是基于网络设计的创新，而是由于我们的新损失。

Class Imbalance: Both classic one-stage object detection methods, like boosted detectors [37, 5] and DPMs [8], and more recent methods, like SSD [

最低0.47元/天解锁文章

Christine__xu

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Focal Loss for Dense Object Detection（RetinaNet）读书笔记

照例是先翻译，再介绍。Focal Loss for Dense Object DetectionAbstractThe highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier is applied to a s...
复制链接

扫一扫