【翻译与个人理解】Bridging the Gap Between Anchor-based and Anchor-free Detection via Adaptive Training Sample Selection
通过自适应训练样本选择弥合基于锚点和无锚点检测之间的差距
文章目录
摘要
Object detection has been dominated by anchor-based detectors for several years. Recently, anchor-free detectors have become popular due to the proposal of FPN and Focal Loss. In this paper, we first point out that the essential difference between anchor-based and anchor-free detection is actually how to define positive and negative training samples, which leads to the performance gap between them. If they adopt the same definition of positive and negative samples during training, there is no obvious difference in the final performance, no matter regressing from a box or a point. This shows that how to select positive and negative training samples is important for current object detectors. Then, we propose an Adaptive Training Sample Selection (ATSS) to automatically select positive and negative samples according to statistical characteristics of object. It significantly improves the performance of anchor-based and anchor-free detectors and bridges the gap between them. Finally, we discuss the necessity of tiling multiple anchors per location on the image to detect objects. Extensive experiments conducted on MS COCO support our aforementioned analysis and conclusions. With the newly introduced ATSS, we improve stateof-the-art detectors by a large margin to 50.7% AP without introducing any overhead. The code is available at https://github.com/sfzhang15/ATSS.
目标检测多年来一直由基于锚的检测器主导。最近,由于FPN和Focal Loss的提出,anchor-free检测器变得流行起来。在本文中,我们首先指出anchor-based和anchor-free检测的本质区别实际上是如何定义正负训练样本,这导致了它们之间的性能差距。如果他们在训练时采用相同的正负样本定义,无论是从一个框回归还是从一个点回归,最终的表现都没有明显差异。这表明如何选择正负训练样本对于当前的目标检测器很重要。然后,我们提出了一种自适应训练样本选择(ATSS)来根据目标的统计特征自动选择正样本和负样本。它显着提高了基于锚和无锚检测器的性能,并弥合了它们之间的差距。最后,我们讨论了在图像上的每个位置平铺多个锚点以检测目标的必要性。在 MS COCO 上进行的大量实验支持我们上述的分析和结论。借助新引入的ATSS,我们将最先进的检测器大幅提高到 50.7% 的 AP,而不会引入任何开销。该代码可在https://github.com/sfzhang15/ATSS 获得。
1. 介绍
Object detection is a long-standing topic in the field of computer vision, aiming to detect objects of predefined categories. Accurate object detection would have far reaching impact on various applications including image recognition and video surveillance. In recent years, with the development of convolutional neural network (CNN), object detection has been dominated by anchor-based detectors, which can be generally divided into one-stage methods [36, 33] and two-stage methods [47, 9]. Both of them first tile a large number of preset anchors on the image, then predict the category and refine the coordinates of these anchors by one or several times, finally output these refined anchors as detection results. Because two-stage methods refine anchors several times more than one-stage methods, the former one has more accurate results while the latter one has higher computational efficiency. State-of-the-art results on common detection benchmarks are still held by anchor-based detectors.
目标检测是计算机视觉领域一个由来已久的话题,旨在检测预定义类别的目标。准确的目标检测将对包括图像识别和视频监控在内的各种应用产生深远的影响。近年来,随着卷积神经网络(CNN)的发展,目标检测一直以基于锚的检测器为主,一般可分为一阶段方法[36、33]和二阶段方法[47、9 ]。两者都是先在图像上平铺大量预设的anchors,然后对这些anchor的类别进行预测并细化一次或多次坐标,最后将这些细化的anchor作为检测结果输出。由于两阶段方法比一阶段方法细化了几倍以上的锚点,因此前者具有更准确的结果,而后者具有更高的计算效率。基于锚点的检测器仍然拥有常见检测基准的最新结果。
【个人理解:在同等trick的情况下,因为二阶段检测器在RPN的时候就已经适当的对正负样本进行了均衡,然后重点对一些参数进行分类训练。训练的分类难度会比一阶段目标检测直接做混合分类和预测框回归要来的简单很多。】
Recent academic attention has been geared toward anchor-free detectors due to the emergence of FPN [32] and Focal Loss [33]. Anchor-free detectors directly find objects without preset anchors in two different ways. One way is to first locate several pre-defined or self-learned keypoints and then bound the spatial extent of objects. We call this type of anchor-free detectors as keypoint-based methods [26, 71]. Another way is to use the center point or region of objects to define positives and then predict the four distances from positives to the object boundary. We call this kind of anchor-free detectors as center-based methods [56, 23]. These anchor-free detectors are able to eliminate those hyperparameters related to anchors and have achieved similar performance with anchor-based detectors, making them more potential in terms of generalization ability.
由于 FPN [32] 和 Focal Loss [33] 的出现,最近的学术关注已经转向无锚检测器。无锚检测器以两种不同的方式直接找到没有预设锚的物体。一种方法是首先定位几个预定义或自学的关键点,然后绑定目标的空间范围。我们将这种类型的无锚检测器称为基于关键点的方法 [26, 71]。另一种方法是使用目标的中心点或区域来定义正样本,然后预测从正样本到目标边界的四个距离。我们将这种无锚检测器称为基于中心的方法 [56, 23]。这些无锚检测器能够消除那些与锚相关的超参数,并取得了与基于锚的检测器相似的性能,使其在泛化能力方面更具潜力。
Among these two types of anchor-free detectors, keypoint-based methods follow the standard keypoint estimation pipeline that is different from anchor-based detectors. However, center-based detectors are similar to anchorbased detectors, which treat points as preset samples instead of anchor boxes. Take the one-stage anchor-based detector RetinaNet [33] and the center-based anchor-free detector FCOS [56] as an example, there are three main differences between them: (1) The number of anchors tiled per location. RetinaNet tiles several anchor boxes per location, while FCOS tiles one anchor point1 per location. (2) The definition of positive and negative samples. RetinaNet resorts to the Intersection over Union (IoU) for positives and negatives, while FCOS utilizes spatial and scale constraints to select samples. (3) The regression starting status. RetinaNet regresses the object bounding box from the preset anchor box, while FCOS locates the object from the anchor point. As reported in [56], the anchor-free FCOS achieves much better performance than the anchor-based RetinaNet, it is worth studying which of these three differences are essential factors for the performance gap.
在这两种无锚检测器中,基于关键点的方法遵循与基于锚的检测器不同的标准关键点估计流程。然而,基于中心的检测器类似于基于锚的检测器,它将点视为预设样本而不是锚框。以一级anchor-based检测器RetinaNet [33]和基于中心的anchor-free检测器FCOS [56]为例,它们之间存在三个主要区别:(1)每个位置平铺的anchors数量。RetinaNet 每个位置平铺几个锚框,而 FCOS 每个位置平铺一个锚点。(2)正负样本的定义。RetinaNet 利用联合交集 (IoU) 来判断正负样本,而 FCOS 利用空间和尺度约束来选择样本。(3)回归起始状态。RetinaNet 从预设的锚点框回归目标边界框,而 FCOS 从锚点定位目标。正如 [56] 中所报道的,无锚的 FCOS 比基于锚的 RetinaNet 实现了更好的性能,值得研究这三个差异中的哪一个是性能差距的基本因素。
In this paper, we investigate the differences between anchor-based and anchor-free methods in a fair way by strictly ruling out all the implementation inconsistencies between them. It can be concluded from experiment results that the essential difference between these two kind of methods is the definition of positive and negative training samples, which results in the performance gap between them. If they select the same positive and negative samples during training, there is no obvious gap in the final performance, no matter regressing from a box or a point. Therefore, how to select positive and negative training samples deserves further study. Inspired by that, we propose a new Adaptive Training Sample Selection (ATSS) to automatically select positive and negative samples based on object characteristics. It bridges the gap between anchorbased and anchor-free detectors. Besides, through a series of experiments, a conclusion can be drawn that tiling multiple anchors per location on the image to detect objects is not necessary. Extensive experiments on the MS COCO [34] dataset support our analysis and conclusions. State-of-theart AP 50.7% is achieved by applying the newly introduced ATSS without introducing any overhead. The main contributions of this work can be summarized as:
• Indicating the essential difference between anchorbased and anchor-free detectors is actually how to define positive and negative training samples.
• Proposing an adaptive training sample selection to automatically select positive and negative training samples according to statistical characteristics of object.
• Demonstrating that tiling multiple anchors per location on the image to detect objects is a useless operation.
• Achieving state-of-the-art performance on MS COCO without introducing any additional overhead.
在本文中,我们通过严格排除它们之间的所有实现不一致,以公平的方式调查基于锚和无锚方法之间的差异。从实验结果可以得出,这两种方法的本质区别在于正负训练样本的定义,从而导致了它们之间的性能差距。如果他们在训练时选择了相同的正负样本,无论是从一个框回归还是从一个点回归,最终的表现都没有明显的差距。因此,如何选择正负训练样本值得进一步研究。受此启发,我们提出了一种新的自适应训练样本选择(ATSS)来根据目标特征自动选择正样本和负样本。它弥合了基于锚和无锚检测器之间的差距。此外,通过一系列实验,可以得出结论,不需要在图像上的每个位置平铺多个锚点来检测目标。MS COCO [34] 数据集上的大量实验支持我们的分析和结论。通过应用新引入的 ATSS 而不会引入任何开销,可实现最先进的 AP 50.7%。这项工作的主要贡献可以概括为:
• 指出anchor-based 和anchor-free 检测器的本质区别实际上是如何定义正负训练样本。
• 提出自适应训练样本选择,根据目标的统计特征自动选择正负训练样本。
• 证明在图像上的每个位置平铺多个锚点以检测目标是无用的操作。
• 在不引入任何额外开销的情况下,在 MS COCO 上实现最先进的性能。
2. 相关工作
Current CNN-based object detection consists of anchorbased and anchor-free detectors. The former one can be divided into two-stage and one-stage methods, while the latter one falls into keypoint-based and center-based methods.
当前基于 CNN 的目标检测包括基于锚点和无锚点的检测器。前者可分为两阶段和一阶段方法,而后者则分为基于关键点和基于中心的方法。
2.1. 基于Anchor的检测器
Two-stage method. The emergence of Faster R-CNN [47] establishes the dominant position of two-stage anchorbased detectors. Faster R-CNN consists of a separate region proposal network (RPN) and a region-wise prediction network (R-CNN) [14, 13] to detect objects. After that, lots of algorithms are proposed to improve its performance, including architecture redesign and reform [4, 9, 5, 28, 30], context and attention mechanism [2, 51, 38, 7, 44], multiscale training and testing [54, 41], training strategy and loss function [40, 52, 61, 17], feature fusion and enhancement [25, 32], better proposal and balance [55, 43]. Nowadays, state-of-the-art results are still held by two-stage anchorbased methods on standard detection benchmarks.
两阶段方法。 Faster R-CNN [47]的出现确立了两阶段基于锚的检测器的主导地位。Faster R-CNN 由一个单独的区域提议网络 (RPN) 和一个区域预测网络 (R-CNN) [14, 13] 组成,用于检测目标。之后,提出了许多算法来提高其性能,包括架构重新设计和改革 [4, 9, 5, 28, 30],上下文和注意机制 [2, 51, 38, 7, 44],多尺度训练和测试 [54, 41],训练策略和损失函数 [40, 52, 61, 17],特征融合和增强 [25, 32],更好的提议和平衡 [55, 43]。如今,最先进的结果仍然由基于标准检测基准的两阶段锚定方法保持。
One-stage method. With the advent of SSD [36], onestage anchor-based detectors have attracted much attention because of their high computational efficiency. SSD spreads out anchor boxes on multi-scale layers within a ConvNet to directly predict object category and anchor box offsets. Thereafter, plenty of works are presented to boost its performance in different aspects, such as fusing context information from different layers [24, 12, 69], training from scratch [50, 73], introducing new loss function [33, 6], anchor refinement and matching [66, 67], architecture redesign [21, 22], feature enrichment and alignment [35, 68, 60, 42, 29]. At present, one-stage anchor-based methods can achieve very close performance with two-stage anchor-based methods at a faster inference speed.
单阶段方法。随着 SSD [36] 的出现,基于锚点的单阶段检测器因其高计算效率而备受关注。SSD 在 ConvNet 内的多尺度层上展开锚框,以直接预测目标类别和锚框偏移。此后,提出了大量工作以提高其在不同方面的性能,例如融合来自不同层的上下文信息 [24, 12, 69],从头开始训练 [50, 73],引入新的损失函数 [33, 6], 锚细化和匹配 [66, 67],架构重新设计 [21, 22],特征丰富和对齐 [35, 68, 60, 42, 29]。目前,一阶段基于锚的方法可以在更快的推理速度下实现与两阶段基于锚的方法非常接近的性能。
2.2. Anchor-free检测器
Keypoint-based method. This type of anchor-free method first locates several pre-defined or self-learned keypoints, and then generates bounding boxes to detect objects. CornerNet [26] detects an object bounding box as a pair of keypoints (top-left corner and bottom-right corner) and CornerNet-Lite [27] introduces CornerNet-Saccade and CornerNet-Squeeze to improve its speed. The second stage of Grid R-CNN [39] locates objects via predicting grid points with the position sensitive merits of FCN and then determining the bounding box guided by the grid. ExtremeNet [71] detects four extreme points (top-most, leftmost, bottom-most, right-most) and one center point to generate the object bounding box. Zhu et al. [70] use keypoint estimation to find center point of objects and regress to all other properties including size, 3D location, orientation and pose. CenterNet [11] extends CornetNet as a triplet rather than a pair of keypoints to improve both precision and recall. RepPoints [65] represents objects as a set of sample points and learns to arrange themselves in a manner that bounds the spatial extent of an object and indicates semantically significant local areas.
基于关键点的方法。这种类型的无锚方法首先定位几个预定义或自学习的关键点,然后生成边界框来检测目标。CornerNet [26] 将目标边界框检测为一对关键点(左上角和右下角),CornerNet-Lite [27] 引入了 CornerNet-Saccade 和 CornerNet-Squeeze 以提高其速度。Grid R-CNN [39] 的第二阶段通过预测具有 FCN 位置敏感优点的网格点,然后确定由网格引导的边界框来定位目标。ExtremeNet [71] 检测四个极值点(最顶部、最左侧、最底部、最右侧)和一个中心点以生成目标边界框。朱等人。[70] 使用关键点估计来找到目标的中心点并回归到所有其他属性,包括大小、3D 位置、方向和姿势。CenterNet [11] 将 CornetNet 扩展为三元组而不是一对关键点,以提高精度和召回率。RepPoints [65] 将目标表示为一组样本点,并学习以限制目标空间范围并指示语义上重要的局部区域的方式排列自身。
Center-based method. This kind of anchor-free method regards the center (e.g., the center point or part) of object as foreground to define positives, and then predicts the distances from positives to the four sides of the object bounding box for detection. YOLO [45] divides the image into an S × S grid, and the grid cell that contains the center of an object is responsible for detecting this object. DenseBox [20] uses a filled circle located in the center of the object to define positives and then predicts the four distances from positives to the bound of the object bounding box for location. GA-RPN [59] defines the pixels in the center region of the object as positives to predict the location, width and height of object proposals for Faster R-CNN. FSAF [72] attaches an anchor-free branch with online feature selection to RetinaNet. The newly added branch defines the center region of the object as positives to locate it via predicting four distances to its bounds. FCOS [56] regards all the locations inside the object bounding box as positives with four distances and a novel centerness score to detect objects. CSP [37] only defines the center point of the object box as positives to detect pedestrians with fixed aspect ratio. FoveaBox [23] regards the locations in the middle part of object as positives with four distances to perform detection.
基于中心点方法. 这种anchor-free方法将物体的中心(例如中心点或部分)作为前景来定义正样本,然后预测正样本到物体四个边的距离用于检测的边界框。YOLO[45]将图像划分为一个S×S网格,包含物体中心的网格单元负责检测这个物体。DenseBox [20] 使用位于目标中心的实心圆来定义正值,然后预测从正值到目标边界框边界的四个距离以进行定位。GA-RPN [59] 将目标中心区域的像素定义为正值,以预测 Faster R-CNN 的目标提议的位置、宽度和高度。FSAF [72] 将带有在线特征选择的无锚分支附加到 RetinaNet。新添加的分支将目标的中心区域定义为正,以通过预测到其边界的四个距离来定位它。FCOS [56] 将目标边界框内的所有位置视为具有四个距离和新的中心度分数的正值来检测目标。CSP [37] 仅将目标框的中心点定义为正值,以检测具有固定纵横比的行人。FoveaBox [23] 将目标中间部分的位置视为具有四个距离的正值来执行检测。
3. 基于锚点和无锚点检测的差异分析
Without loss of generality, the representative anchorbased RetinaNet [33] and anchor-free FCOS [56] are adopted to dissect their differences. In this section, we focus on the last two differences: the positive/negative sample definition and the regression starting status. The remaining one difference: the number of anchors tiled per location, will be discussed in subsequent section. Thus, we just tile one square anchor per location for RetinaNet, which is quite similar to FCOS. In the remaining part, we first introduce the experiment settings, then rule out all the implementation inconsistencies, finally point out the essential difference between anchor-based and anchor-free detectors.
不失一般性,采用具有代表性的anchor-based RetinaNet [33]和anchor-free FCOS [56]来剖析它们的差异。在本节中,我们关注最后两个差异:正/负样本定义和回归开始状态。剩下的一个区别:每个位置平铺的锚点数量,将在下一节中讨论。因此,我们只为 RetinaNet 的每个位置平铺一个方形锚,这与 FCOS 非常相似。在剩下的部分中,我们首先介绍了实验设置,然后排除了所有的实现不一致,最后指出了基于锚的检测器和无锚检测器的本质区别。
3.1. 实验设置
Dataset. All experiments are conducted on the challenging MS COCO [34] dataset that includes 80 object classes. Following the common practice [33, 56], all 115K images in the trainval35k split is used for training, and all 5K images in the minival split is used as validation for analysis study. We also submit our main results to the evaluation server for the final performance on the test-dev split.
数据集。 所有实验都是在具有挑战性的 MS COCO [34] 数据集上进行的,该数据集包括 80 个目标类。按照惯例 [33, 56],trainval35k 分割中的所有 115K 图像都用于训练,而 minival 分割中的所有 5K 图像都用作分析研究的验证。我们还将主要结果提交给评估服务器,以获得测试-开发拆分的最终性能。
Training Detail. We use the ImageNet [49] pretrained ResNet-50 [16] with 5-level feature pyramid structure as the backbone. The newly added layers are initialized in the same way as in [33]. For RetinaNet, each layer in the 5-level feature pyramid is associated with one square anchor with 8S scale, where S is the total stride size. During training, we resize the input images to keep their shorter side being 800 and their longer side less or equal to 1, 333. The whole network is trained using the Stochastic Gradient Descent (SGD) algorithm for 90K iterations with 0.9 momentum, 0.0001 weight decay and 16 batch size. We set the initial learning rate as 0.01 and decay it by 0.1 at iteration 60K and 80K, respectively. Unless otherwise stated, the aforementioned training details are used in the experiments.
训练细节。 我们使用 ImageNet预训练具有 5 级特征金字塔结构的 [49] ResNet-50 [16] 作为主干。新添加的层的初始化方式与[33]中的相同。对于 RetinaNet,5 级特征金字塔中的每一层都与一个 8S 尺度的方形锚点相关联,其中 S 是总步幅大小。在训练期间,我们调整输入图像的大小以保持它们的短边为 800,长边小于或等于 1333。整个网络使用随机梯度下降 (SGD) 算法进行 90K 迭代训练,动量为 0.9,权重为 0.0001 衰减和 16 个批量大小。我们将初始学习率设置为 0.01,并在迭代 60K 和 80K 时分别衰减 0.1。除非另有说明,上述训练细节用于实验。
Inference Detail. During the inference phase, we resize the input image in the same way as in the training phase, and then forward it through the whole network to output the predicted bounding boxes with a predicted class. After that, we use the preset score 0.05 to filter out plenty of background bounding boxes, and then output the top 1000 detections per feature pyramid. Finally, the Non-Maximum Suppression (NMS) is applied with the IoU threshold 0.6 per class to generate final top 100 confident detections per image.
推理细节。 在推理阶段,我们以与训练阶段相同的方式调整输入图像的大小,然后将其转发到整个网络以输出具有预测类别的预测边界框。之后,我们使用预设分数 0.05 过滤掉大量背景边界框,然后输出每个特征金字塔的前 1000 个检测。最后,应用非最大抑制 (NMS),每个类别的 IoU 阈值为 0.6,以生成每个图像的最终前 100 个置信检测。
3.2. 不一致的消除
We mark the anchor-based detector RetinaNet with only one square anchor box per location as RetinaNet (#A=1), which is almost the same as the anchor-free detector FCOS. However, as reported in [56], FCOS outperforms RetinaNet (#A=1) by a large margin in AP performance on the MS COCO minival subset, i.e., 37.1% vs. 32.5%. Furthermore, some new improvements have been made for FCOS including moving centerness to regression branch, using GIoU loss function and normalizing regression targets by corresponding strides. These improvements boost the AP performance of FCOS from 37.1% to 37.8% 2, making the gap even bigger. However, part of the AP gap between the anchor-based detector (32.5%) and the anchor-free detector (37.8%) results from some universal improvements that are proposed or used in FCOS, such as adding GroupNorm [62] in heads, using the GIoU [48] regression loss function, limiting positive samples in the ground-truth box [56], introducing the centerness branch [56] and adding a trainable scalar [56] for each level feature pyramid. These improvements can also be applied to anchor-based detectors, therefore they are not the essential differences between anchorbased and anchor-free methods. We apply them to RetinaNet (#A=1) one by one so as to rule out these implementation inconsistencies. As listed in Table 1, these irrelevant differences improve the anchor-based RetinaNet to 37.0%, which still has a gap of 0.8% to the anchor-free FCOS. By now, after removing all the irrelevant differences, we can explore the essential differences between anchor-based and anchor-free detectors in a quite fair way.
我们将每个位置只有一个方形锚框的基于锚的检测器 RetinaNet 标记为 RetinaNet (#A=1),这与无锚检测器 FCOS 几乎相同。然而,正如 [56] 中所报道的,FCOS 在 MS COCO minival 子集上的 AP 性能大幅优于 RetinaNet (#A=1),即 37.1% 对 32.5%。此外,对 FCOS 进行了一些新的改进,包括将中心性移动到回归分支、使用 GIoU 损失函数以及通过相应的步幅对回归目标进行归一化。这些改进将 FCOS 的 AP 性能从 37.1% 提升到了 37.8% ,差距更大。然而,基于锚的检测器(32.5%)和无锚检测器(37.8%)之间的部分 AP 差距是由 FCOS 中提出或使用的一些普遍改进造成的,例如在头部添加 GroupNorm [62],使用 GIoU [48] 回归损失函数,限制真值框 [56] 中的正样本,引入中心分支 [56] 并为每个级别的特征金字塔添加可训练的标量 [56]。这些改进也可以应用于基于锚的检测器,因此它们不是基于锚和无锚方法之间的本质区别。我们将它们一一应用到 RetinaNet (#A=1) 以排除这些实现不一致。如表 1 所列,这些不相关的差异将基于锚的 RetinaNet 提高到 37.0%,与无锚的 FCOS 仍有 0.8% 的差距。到现在为止,在消除了所有不相关的差异之后,我们可以以相当公平的方式探索基于锚和无锚检测器之间的本质区别。
3.3. 本质区别
After applying those universal improvements, these are only two differences between the anchor-based RetinaNet (#A=1) and the anchor-free FCOS. One is about the classification sub-task in detection, i.e., the way to define positive and negative samples. Another one is about the regression sub-task, i.e., the regression starting from an anchor box or an anchor point.
在应用了这些通用改进之后,这些只是基于锚的 RetinaNet (#A=1) 和无锚的 FCOS 之间的两个区别。一是关于检测中的分类子任务,即定义正负样本的方式。另一个是关于回归子任务,即从锚框或锚点开始的回归。
Classification. As shown in Figure 1(a), RetinaNet utilizes IoU to divide the anchor boxes from different pyramid levels into positives and negatives. It first labels the best anchor box of each object and the anchor boxes with
I
o
U
>
θ
p
IoU > \theta_p
IoU>θp as positives, then regards the anchor boxes with
I
o
U
<
θ
n
IoU < \theta_n
IoU<θn as negatives, finally other anchor boxes are ignored during training. As shown in Figure 1(b), FCOS uses spatial and scale constraints to divide the anchor points from different pyramid levels. It first considers the anchor points within the ground-truth box as candidate positive samples, then selects the final positive samples from candidates based on the scale range defined for each pyramid level3, finally those unselected anchor points are negative samples.
分类。 如图 1(a) 所示,RetinaNet 利用 IoU 将不同金字塔级别的锚框划分为正负。它首先将每个目标的最佳anchor box和
I
o
U
>
θ
p
IoU > \theta_p
IoU>θp的anchor box标记为正样本,然后将
I
o
U
<
θ
n
IoU < \theta_n
IoU<θn的anchor box视为负例,最后在训练过程中忽略其他anchor box。如图 1(b) 所示,FCOS 使用空间和尺度约束来划分不同金字塔级别的锚点。它首先将ground-truth box内的锚点视为候选正样本,然后根据为每个金字塔级别定义的尺度范围从候选中选择最终的正样本,最后那些未选择的锚点为负样本。
As shown in Figure 1, FCOS first uses the spatial constraint to find candidate positives in the spatial dimension, then uses the scale constraint to select final positives in the scale dimension. In contrast, RetinaNet utilizes IoU to directly select the final positives in the spatial and scale dimension simultaneously. These two different sample selection strategies produce different positive and negative samples. As listed in the first column of Table 2 for RetinaNet (#A=1), using the spatial and scale constraint strategy instead of the IoU strategy improves the AP performance from 37.0% to 37.8%. As for FCOS, if it uses the IoU strategy to select positive samples, the AP performance decreases from 37.8% to 36.9% as listed in the second column of Table 2. These results demonstrate that the definition of positive and negative samples is an essential difference between anchor-based and anchor-free detectors.
如图1所示,FCOS首先利用空间约束在空间维度上寻找候选正样本,然后利用尺度约束在尺度维度上选择最终正样本。相比之下,RetinaNet 利用 IoU 同时直接选择空间和尺度维度上的最终正样本。这两种不同的样本选择策略会产生不同的正样本和负样本。如表 2 中 RetinaNet (#A=1) 的第一列所示,使用空间和尺度约束策略而不是 IoU 策略将 AP 性能从 37.0% 提高到 37.8%。对于 FCOS,如果使用 IoU 策略选择正样本,AP 性能从 37.8% 下降到 36.9%,如表 2 第二列所示。这些结果表明基于锚和无锚的检测器之间的本质区别是在于正负样本的定义 。
Regression. After positive and negative samples are determined, the location of object is regressed from positive samples as shown in Figure 2(a). RetinaNet regresses from the anchor box with four offsets between the anchor box and the object box as shown in Figure 2(b), while FCOS regresses from the anchor point with four distances to the bound of object as shown in Figure 2©. It means that for a positive sample, the regression starting status of RetinaNet is a box while FCOS is a point. However, as shown in the first and second rows of Table 2, when RetinaNet and FCOS adopt the same sample selection strategy to have consistent positive/negative samples, there is no obvious difference in final performance, no matter regressing starting from a point or a box, i.e., 37.0% vs. 36.9% and 37.8% vs. 37.8%. These results indicate that the regression starting status is an irrelevant difference rather than an essential difference.
回归。 确定正负样本后,从正样本回归目标的位置,如图2(a)所示。RetinaNet 从锚框回归,锚框和目标框之间有四个偏移量,如图 2(b) 所示,而 FCOS 从锚点回归,距离目标边界有四个距离,如图 2© 所示。这意味着对于一个正样本,RetinaNet 的回归起始状态是一个框,而 FCOS 是一个点。但是,如表2的第一行和第二行所示,当RetinaNet和FCOS采用相同的样本选择策略以得到一致的正负样本时,无论是从一个点开始回归还是从一个 框,即 37.0% 对 36.9% 和 37.8% 对 37.8%。这些结果表明,回归起始状态是一个不相关的差异,而不是本质的差异。
Conclusion. According to these experiments conducted in a fair way, we indicate that the essential difference between one-stage anchor-based detectors and center-based anchorfree detectors is actually how to define positive and negative training samples, which is important for current object detection and deserves further study.
结论。 根据这些以公平方式进行的实验,我们指出,基于anchor的检测器和基于中心的anchorfree检测器之间的本质区别实际上是如何定义正负训练样本,这对于当前的目标检测很重要,值得进一步研究。
4.自适应训练样本选择
When training an object detector, we first need to define positive and negative samples for classification, and then use positive samples for regression. According to the previous analysis, the former one is crucial and the anchorfree detector FCOS improves this step. It introduces a new way to define positives and negatives, which achieves better performance than the traditional IoU-based strategy. Inspired by this, we delve into the most basic issue in object detection: how to define positive and negative training samples, and propose an Adaptive Training Sample Selection (ATSS). Compared with these traditional strategies, our method almost has no hyperparameters and is robust to different settings.
在训练目标检测器时,我们首先需要定义正负样本进行分类,然后使用正样本进行回归。根据前面的分析,前一个是关键,anchor-free检测器FCOS改进了这一步。它引入了一种定义正负样本的新方法,比传统的基于 IoU 的策略实现了更好的性能。受此启发,我们深入研究了目标检测中最基本的问题:如何定义正负训练样本,并提出自适应训练样本选择(ATSS)。与这些传统策略相比,我们的方法几乎没有超参数,并且对不同的设置具有鲁棒性。
4.1. 描述
Previous sample selection strategies have some sensitive hyperparameters, such as IoU thresholds in anchor-based detectors and scale ranges in anchor-free detectors. After these hyperparameters are set, all ground-truth boxes must select their positive samples based on the fixed rules, which are suitable for most objects, but some outer objects will be neglected. Thus, different settings of these hyperparameters will have very different results.
以前的样本选择策略具有一些敏感的超参数,例如基于锚的检测器中的 IoU 阈值和无锚检测器中的尺度范围。设置好这些超参数后,所有的ground-truth box都必须根据固定的规则选择它们的正样本,这适用于大多数物体,但会忽略一些外部物体。因此,这些超参数的不同设置将产生非常不同的结果。
To this end, we propose the ATSS method that automatically divides positive and negative samples according to statistical characteristics of object almost without any hyperparameter. Algorithm 1 describes how the proposed method works for an input image. For each ground-truth box g on the image, we first find out its candidate positive samples. As described in Line 3 to 6, on each pyramid level, we select k anchor boxes whose center are closest to the center of g based on L2 distance. Supposing there are
L
\mathcal{L}
L feature pyramid levels, the ground-truth box g will have
k
×
L
k \times \mathcal{L}
k×L candidate positive samples. After that, we compute the IoU between these candidates and the ground-truth g as
D
g
\mathcal{D}_g
Dg in Line 7, whose mean and standard deviation are computed as
m
g
\mathcal{m}_g
mg and
v
g
\mathcal{v}_g
vg in Line 8 and Line 9. With these statistics, the IoU threshold for this ground-truth g is obtained as
t
g
=
m
g
+
v
g
\mathcal{t}_g = \mathcal{m}_g + \mathcal{v}_g
tg=mg+vg in Line 10. Finally, we select these candidates whose IoU are greater than or equal to the threshold
t
g
t_g
tg as final positive samples in Line 11 to 15. Notably, we also limit the positive samples’ center to the ground-truth box as shown in Line 12. Besides, if an anchor box is assigned to multiple ground-truth boxes, the one with the highest IoU will be selected. The rest are negative samples. Some motivations behind our method are explained as follows.
为此,我们提出了一种ATSS方法,即根据目标的统计特征自动划分正样本和负样本,几乎没有任何超参数。算法1描述了所提出的方法如何工作于输入图像。对于图像上的每个真值框g,我们首先找出它的候选阳性样本。如第3至第6行所述,在每个金字塔层上,我们根据L2距离选择k个中心离g中心最近的锚盒。假设有
L
\mathcal{L}
L个特征金字塔水平,真值框g将有
k
×
L
k \times\mathcal{L}
k×L个候选阳性样本。然后,我们计算这些候选项与真值框g之间的IoU(为第7行中的
D
g
\mathcal{D}_g
Dg),其平均值和标准差计算为第8行和第9行中的
m
g
\mathcal{m}_g
mg和
v
g
\mathcal{v}_g
vg。根据这些统计数据,这个真实框g的IoU阈值得到为第10行中的
t
g
=
m
g
+
v
g
\mathcal{t}_g = \mathcal{m}_g + \mathcal{v}_g
tg=mg+vg。最后,我们选择这些IoU大于或等于阈值
t
g
t_g
tg的候选样本作为第11行至第15行的最终阳性样本。值得注意的是,我们还将正样本的中心限制在第12行所示。此外,如果一个锚盒被分配给多个真实框,则将选择IoU最高的一个。其余的样本均为阴性样本。我们的方法背后的一些动机被解释如下。
Selecting candidates based on the center distance between anchor box and object. For RetinaNet, the IoU is larger when the center of anchor box is closer to the center of object. For FCOS, the closer anchor point to the center of object will produce higher-quality detections. Thus, the closer anchor to the center of object is the better candidate.
根据锚框和目标之间的中心距离选择候选目标。 对于 RetinaNet,anchor box 的中心越靠近目标的中心,IoU 越大。对于 FCOS,离目标中心越近的锚点将产生更高质量的检测。因此,离目标中心越近的锚点是更好的候选者。
Using the sum of mean and standard deviation as the IoU threshold. The IoU mean
m
g
m_g
mg of an object is a measure of the suitability of the preset anchors for this object. A high
m
g
m_g
mg as shown in Figure 3(a) indicates it has high-quality candidates and the IoU threshold is supposed to be high. A low
m
g
m_g
mgas shown in Figure 3(b) indicates that most of its candidates are low-quality and the IoU threshold should be low. Besides, the IoU standard deviation
v
g
v_g
vg of an object is a measure of which layers are suitable to detect this object. A high
v
g
v_g
vg as shown in Figure 3(a) means there is a pyramid level specifically suitable for this object, adding
v
g
v_g
vg to
m
g
m_g
mg obtains a high threshold to select positives only from that level. A low
v
g
v_g
vg as shown in Figure 3(b) means that there are several pyramid levels suitable for this object, adding
v
g
v_g
vg to
m
g
m_g
mg obtains a low threshold to select appropriate positives from these levels. Using the sum of mean
m
g
m_g
mg and standard deviation
v
g
v_g
vg as the IoU threshold
t
g
t_g
tg can adaptively select enough positives for each object from appropriate pyramid levels in accordance of statistical characteristics of object.
使用均值和标准差之和作为 IoU 阈值。一个目标的IoU平均
m
g
m_g
mg是对这个目标的预设锚点适用性的度量。如图 3(a) 所示的高
m
g
m_g
mg 表示它具有高质量的候选者,并且 IoU 阈值应该很高。如图 3(b) 所示的低
m
g
m_g
mg 表示其大多数候选者质量较低,IoU 阈值应该较低。此外,目标的 IoU 标准差
v
g
v_g
vg 是衡量哪些层适合检测该目标。如图 3(a) 所示的高
v
g
v_g
vg 意味着有一个金字塔级别专门适合这个目标,将
v
g
v_g
vg 添加到
m
g
m_g
mg 获得一个高阈值,仅从该级别选择正样本。如图 3(b) 所示的低
v
g
v_g
vg 意味着有几个金字塔级别适合这个目标,将
v
g
v_g
vg 添加到
m
g
m_g
mg 可以获得一个低阈值,以便从这些级别中选择适当的正样本。使用均值
m
g
m_g
mg 和标准差
v
g
v_g
vg 之和作为 IoU 阈值
t
g
t_g
tg可以根据目标的统计特征,从适当的金字塔层级中自适应地为每个目标选择足够的正样本。
Limiting the positive samples’ center to object. The anchor with a center outside object is a poor candidate and will be predicted by the features outside the object, which is not conducive to training and should be excluded.
将正样本的中心限制为目标。 中心在目标外的anchor是较差的候选,会被目标外的特征预测,不利于训练,应该排除。
Maintaining fairness between different objects. According to the statistical theory, about 16% of samples are in the confidence interval
[
m
g
+
v
g
,
1
]
[m_g+v_g, 1]
[mg+vg,1] in theory. Although the IoU of candidates is not a standard normal distribution, the statistical results show that each object has about
0.2
×
k
L
0.2\times k \mathcal{L}
0.2×kL positive samples, which is invariant to its scale, aspect ratio and location. In contrast, strategies of RetinaNet and FCOS tend to have much more positive samples for larger objects, leading to unfairness between different objects.
维护不同目标之间的公平性。 根据统计理论,理论上大约 16% 的样本在置信区间
[
m
g
+
v
g
,
1
]
[m_g+v_g, 1]
[mg+vg,1]内。虽然候选目标的IoU不是标准的正态分布,但统计结果表明每个目标大约有
0.2
×
k
L
0.2\times k \mathcal{L}
0.2×kL的正样本,其规模、纵横比和位置是不变的。相比之下,RetinaNet 和 FCOS 的策略往往对较大的目标有更多的正样本,导致不同目标之间的不公平。
Keeping almost hyperparameter-free. Our method only has one hyperparameter k. Subsequent experiments prove that it is quite insensitive to the variations of k and the proposed ATSS can be considered almost hyperparameter-free.
保持几乎无超参数。 我们的方法只有一个超参数 k。随后的实验证明它对 k 的变化非常不敏感,并且所提出的 ATSS 可以被认为几乎是无超参数的。
4.2. 确认
Anchor-based RetinaNet. To verify the effectiveness of our adaptive training sample selection for anchor-based detectors, we use it to replace the traditional strategy in the improved RetinaNet (#A=1). As shown in Table 3, it consistently boosts the performance by 2.3% on AP, 2.4% on
A
P
50
AP_{50}
AP50, 2.9% for
A
P
75
AP_{75}
AP75, 2.9% for
A
P
S
AP_{S}
APS, 2.1% for
A
P
M
AP_{M}
APM and 2.7% for
A
P
L
AP_{L}
APL. These improvements are mainly due to the adaptive selection of positive samples for each ground-truth based on its statistical characteristics. Since our method only redefines positive and negative samples without incurring any additional overhead, these improvements can be considered cost-free.
基于锚点的 RetinaNet。 为了验证我们的自适应训练样本选择对基于锚的检测器的有效性,我们用它来替换改进的 RetinaNet (#A=1) 中的传统策略。如表 3 所示,它持续将 AP 的性能提升 2.3%、
A
P
50
AP_{50}
AP50 的 2.4%、
A
P
75
AP_{75}
AP75 的 2.9%、
A
P
S
AP_{S}
APS 的 2.9%、
A
P
M
AP_{M}
APM 的 2.1% 和
A
P
L
AP_{L}
APL 的 2.7%。这些改进主要是由于基于其统计特征为每个ground-truth自适应选择正样本。由于我们的方法只重新定义正样本和负样本而不会产生任何额外的开销,因此这些改进可以被认为是免费的。
Anchor-free FCOS. The proposed method can also be applied to the anchor-free FCOS in two different versions: the lite and full version. For the lite version, we apply some ideas of the proposed ATSS to FCOS, i.e., replacing its way to select candidate positives with the way in our method. FCOS considers anchor points in the object box as candidates, which results in plenty of low-quality positives. In contrast, our method selects top k = 9 candidates per pyramid level for each ground-truth. The lite version of our method has been merged to the official code of FCOS as the center sampling, which improves FCOS from 37.8% to 38.6% on AP as listed in Table 3. However, the hyperparameters of scale ranges still exist in the lite version.
无锚 FCOS。 所提出的方法也可以在两个不同版本中应用于无锚 FCOS:精简版和完整版。对于精简版,我们将提出的 ATSS 的一些想法应用于 FCOS,即用我们方法中的方式替换其选择候选阳性的方式。FCOS 将目标框中的锚点视为候选点,这会导致大量低质量的正例。相比之下,我们的方法为每个基本事实选择每个金字塔级别的前 k = 9 个候选者。我们方法的精简版已经合并到FCOS官方代码作为中心采样,在AP上的FCOS从37.8%提高到38.6%,如表3所示。但是,精简版仍然存在尺度范围的超参数 。
For the full version, we let the anchor point in FCOS become the anchor box with 8S scale to define positive and negative samples, then still regress these positive samples to objects from the anchor point like FCOS. As shown in Table 3, it significantly increases the performance by 1.4% for AP, by 1.7% for
A
P
50
AP_{50}
AP50, by 1.7% for
A
P
75
AP_{75}
AP75, by 0.6% for
A
P
S
AP_{S}
APS, by 1.3% for
A
P
M
AP_{M}
APM and by 2.7% for
A
P
L
AP_{L}
APL. Notably, these two versions have the same candidates selected in the spatial dimension, but different ways to select final positives from candidates along the scale dimension. As listed in the last two rows of Table 3, the full version (ATSS) outperforms the lite version (center sampling) across different metrics by a large margin. These results indicate that the adaptive way in our method is better than the fixed way in FCOS to select positives from candidates along the scale dimension.
对于完整版,我们让 FCOS 中的锚点成为 8S 尺度的锚框来定义正负样本,然后仍然像 FCOS 一样将这些正样本从锚点回归到目标。如表 3 所示,它将 AP 的性能显着提高了 1.4%,
A
P
50
AP_{50}
AP50 提高了 1.7%,
A
P
75
AP_{75}
AP75 提高了 1.7%,
A
P
S
AP_{S}
APS 提高了 0.6%,
A
P
M
AP_{M}
APM提高了 1.3%,
A
P
L
AP_{L}
APL 提高了 2.7%。值得注意的是,这两个版本在空间维度上选择了相同的候选者,但在尺度维度上从候选者中选择最终正例的方式不同。如表 3 的最后两行所示,完整版 (ATSS) 在不同指标上的性能大大优于精简版(中心抽样)。这些结果表明,我们方法中的自适应方式比 FCOS 中的固定方式更好地沿尺度维度从候选者中选择正例。
4.3. 分析
Training an object detector with the proposed adaptive training sample selection only involves one hyperparameter k and one related setting of anchor boxes. This subsection analyzes them one after another.
使用所提出的自适应训练样本选择来训练目标检测器只涉及一个超参数 k 和一个相关的锚框设置。本小节逐一分析。
Hyperparameter k. We conduct several experiments to study the robustness of the hyperparameter k, which is used to select the candidate positive samples from each pyramid level. As shown in Table 4, different values of k in [3, 5, 7, 9, 11, 13, 15, 17, 19] are used to train the detector. We observe that the proposed method is quite insensitive to the variations of k from 7 to 17. Too large k (e.g., 19) will result in too many low-quality candidates that slightly decreases the performance. Too small k (e.g., 3) causes a noticeable drop in accuracy, because too few candidate positive samples will cause statistical instability. Overall, the only hyperparameter k is quite robust and the proposed ATSS can be nearly regarded as hyperparameter-free.
超参数 k。 我们进行了几次实验来研究超参数 k 的鲁棒性,该参数用于从每个金字塔级别中选择候选正样本。如表 4 所示,使用 [3, 5, 7, 9, 11, 13, 15, 17, 19] 中的不同 k 值来训练检测器。我们观察到,所提出的方法对 k 从 7 到 17 的变化非常不敏感。太大的 k(例如,19)会导致太多低质量的候选者,这会稍微降低性能。k 太小(例如 3)会导致准确率明显下降,因为候选正样本太少会导致统计不稳定。总体而言,唯一的超参数 k 非常稳健,并且所提出的 ATSS 几乎可以被视为无超参数。
Anchor Size. The introduced method resorts to the anchor boxes to define positives and we also study the effect of the anchor size. In the previous experiments, one square anchor with 8S (S indicates the total stride size of the pyramid level) is tiled per location. As shown in Table 5, we conduct some experiments with different scales of the square anchor in [5, 6, 7, 8, 9] and the performances are quite stable. Besides, several experiments with different aspect ratios of the 8S anchor box are performed as shown in Table 6. The performances are also insensitive to this variation. These results indicate that the proposed method is robust to different anchor settings.
锚尺寸。 引入的方法借助锚框来定义正数,我们还研究了锚大小的影响。在之前的实验中,每个位置平铺一个具有 8S 的方形锚(S 表示金字塔级别的总步幅大小)。如表 5 所示,我们在 [5, 6, 7, 8, 9] 中对不同尺度的方形锚点进行了一些实验,性能相当稳定。此外,如表 6 所示,对 8S 锚盒的不同纵横比进行了几次实验。性能也对这种变化不敏感。这些结果表明,所提出的方法对不同的锚设置具有鲁棒性。
4.4. 比较
We compare our final models on the MS COCO test-dev subset in Table 8 with other state-of-the-art object detectors. Following previous works [33, 56], the multiscale training strategy is adopted for these experiments, i.e., randomly selecting a scale between 640 to 800 to resize the shorter side of images during training. Besides, we double the total number of iterations to 180K and the learning rate reduction points to 120K and 160K correspondingly. Other settings are consistent with those mentioned before.
我们将表 8 中 MS COCO 测试开发子集上的最终模型与其他最先进的目标检测器进行比较。继之前的工作 [33, 56] 之后,这些实验采用了多尺度训练策略,即在训练期间随机选择 640 到 800 之间的尺度来调整图像较短边的大小。此外,我们将总迭代次数翻倍至 180K,学习率降低点分别为 120K 和 160K。其他设置与前面提到的一致。
As shown in Table 8, our method with ResNet-101 achieves 43.6% AP without any bells and whistles, which is better than all the methods with the same backbone including Cascade R-CNN [5] (42.8% AP), C-Mask RCNN [7] (42.0% AP), RetinaNet [33] (39.1% AP) and RefineDet [66] (36.4% AP). We can further improve the AP accuracy of the proposed method to 45.1% and 45.6% by using larger backbone networks ResNeXt-32x8d-101 and ResNeXt-64x4d-101 [63], respectively. The 45.6% AP result surpasses all the anchor-free and anchor-based detectors except only 0.1% lower than SNIP [54] (45.7% AP), which introduces the improved multi-scale training and testing strategy. Since our method is about the definition of positive and negative samples, it is compatible and complementary to most of current technologies. We further use the Deformable Convolutional Networks (DCN) [10] to the ResNet and ResNeXt backbones as well as the last layer of detector towers. DCN consistently improves the AP performances to 46.3% for ResNet-101, 47.7% for ResNeXt-32x8d-101 and 47.7% for ResNeXt-64x4d-101, respectively. The best result 47.7% is achieved with singlemodel and single-scale testing, outperforming all the previous detectors by a large margin. Finally, with the multiscale testing strategy, our best model achieves 50.7% AP.
如表 8 所示,我们使用 ResNet-101 的方法在没有任何花里胡哨的情况下实现了 43.6% 的 AP,这优于所有具有相同主干的方法,包括 Cascade R-CNN [5] (42.8% AP)、C-Mask RCNN [7] (42.0% AP)、RetinaNet [33] (39.1% AP) 和 RefineDet [66] (36.4% AP)。通过使用更大的骨干网络 ResNeXt-32x8d-101 和 ResNeXt-64x4d-101 [63],我们可以进一步将所提出方法的 AP 准确率提高到 45.1% 和 45.6%。45.6% 的 AP 结果超过了所有无锚和基于锚的检测器,但仅比 SNIP [54] (45.7% AP) 低 0.1%,它引入了改进的多尺度训练和测试策略。由于我们的方法是关于正样本和负样本的定义,因此它与大多数当前技术兼容和互补。我们进一步将可变形卷积网络 (DCN) [10] 用于 ResNet 和 ResNeXt 主干以及最后一层检测塔。DCN 持续将 ResNet-101 的 AP 性能提高到 46.3%、ResNeXt-32x8d-101 的 47.7% 和 ResNeXt-64x4d-101 的 47.7%。通过单模型和单尺度测试获得了 47.7% 的最佳结果,大大超过了之前的所有检测器。最后,通过多尺度测试策略,我们的最佳模型达到了 50.7% 的 AP。
4.5. 讨论
Previous experiments are based on RetinaNet with only one anchor per location. There is still a difference between anchor-based and anchor-free detectors that is not explored: the number of anchors tiled per location. Actually, the original RetinaNet tiles 9 anchors (3 scales × 3 aspect ratios) per location (marked as RetinaNet (#A=9)) that achieves 36.3% AP as listed in the first row of Table 7. In addition, those universal improvements in Table 1 can also be used to RetinaNet (#A=9), boosting the AP performance from 36.3% to 38.4%. Without using the proposed ATSS, the improved RetinaNet (#A=9) has better performance than RetinaNet (#A=1), i.e., 38.4% in Table 7 vs. 37.0% in Table 1. These results indicate that under the traditional IoU-based sample selection strategy, tiling more anchor boxer per location is effective.
以前的实验基于 RetinaNet,每个位置只有一个锚点。基于anchor的检测器和无anchor-free检测器之间仍然存在一个未被探索的差异:每个位置平铺的anchor数量。实际上,原始 RetinaNet 每个位置(标记为 RetinaNet (#A=9))平铺 9 个锚点(标记为 RetinaNet (#A=9)),实现了表 7 第一行中列出的 36.3% AP。此外,这些通用改进表 1 中的数据也可用于 RetinaNet (#A=9),将 AP 性能从 36.3% 提升到 38.4%。在不使用建议的 ATSS 的情况下,改进的 RetinaNet (#A=9) 比 RetinaNet (#A=1) 具有更好的性能,即表 7 中的 38.4% 和表 1 中的 37.0%。这些结果表明,在传统的 IoU 下基于样本选择策略,每个位置平铺更多的anchor boxer是有效的。
However, after using our proposed method, the opposite conclusion will be drawn. To be specific, the proposed ATSS also improves RetinaNet (#A=9) by 0.8% on AP, 1.4% on AP50 and 1.1% on AP75, achieving similar performances to RetinaNet (#A=1) as listed in the third and sixth rows of Table 7. Besides, when we change the number of anchor scales or aspect ratios from 3 to 1, the results are almost unchanged as listed in the fourth and fifth rows of Table 7. In other words, as long as the positive samples are selected appropriately, no matter how many anchors are tiled at each location, the results are the same. We argue that tiling multiple anchors per location is a useless operation under our proposed method and it needs further study to discover its right role.
然而,在使用我们提出的方法后,将得出相反的结论。具体来说,所提出的 ATSS 还将 RetinaNet (#A=9) 在 AP 上提高了 0.8%,在 AP50 上提高了 1.4%,在 AP75 上提高了 1.1%,实现了与第三和第六中所列的 RetinaNet (#A=1) 相似的性能表 7 的行。此外,当我们将锚尺度或纵横比的数量从 3 更改为 1 时,结果几乎没有变化,如表 7 的第四行和第五行所示。换句话说,只要正样本选择得当,无论每个位置平铺多少个anchors,结果都是一样的。我们认为,在我们提出的方法下,每个位置平铺多个锚点是无用的操作,需要进一步研究以发现其正确作用。
5. 结论
In this work, we point out that the essential difference between one-stage anchor-based and center-based anchor-free detectors is actually the definition of positive and negative training samples. It indicates that how to select positive and negative samples during object detection training is critical. Inspired by that, we delve into this basic issue and propose the adaptive training sample selection, which automatically divides positive and negative training samples according to statistical characteristics of object, hence bridging the gap between anchor-based and anchor-free detectors. We also discuss the necessity of tiling multiple anchors per location and show that it may not be a so useful operation under current situations. Extensive experiments on the challenging benchmarks MS COCO illustrate that the proposed method can achieve state-of-the-art performances without introducing any additional overhead.
在这项工作中,我们指出,一级anchor-based和center-based anchor-free检测器的本质区别实际上是正负训练样本的定义。这表明在目标检测训练过程中如何选择正样本和负样本是至关重要的。受此启发,我们深入研究了这个基本问题并提出了自适应训练样本选择,它根据目标的统计特征自动划分正负训练样本,从而弥合了基于锚点和无锚点检测器之间的差距。我们还讨论了每个位置平铺多个锚点的必要性,并表明在当前情况下它可能不是一个那么有用的操作。在具有挑战性的基准 MS COCO 上进行的大量实验表明,所提出的方法可以在不引入任何额外开销的情况下实现最先进的性能。