【CV-Paper 16】目标检测 03:R-CNN-2014

目标检测(Object Detection) 同时被 2 个专栏收录
21 篇文章 3 订阅

论文期刊:CVPR 2014

Rich feature hierarchies for accurate object detection and semantic segmentation


Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context. In this paper , we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012—achieving a mAP of 53.3%. Our approach combines two key insights: (1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. We also compare R-CNN to OverFeat, a recently proposed sliding-window detector based on a similar CNN architecture. We find that R-CNN outperforms OverFeat by a large margin on the 200-class ILSVRC2013 detection dataset. Source code for the complete system is available at http://www.cs.berkeley.edu/˜rbg/rcnn.

在规范的PASCAL VOC数据集上测得的对象检测性能在最近几年一直处于稳定状态。表现最佳的方法是复杂的集成系统,通常将多个低级图像特征与高级上下文结合在一起。在本文中,我们提出了一种简单且可扩展的检测算法,相对于VOC 2012上的先前最佳结果,该算法将平均平均精度(mAP)提高了30%以上,实现了53.3%的mAP。我们的方法结合了两个关键的见解:(1)一个可以将高容量卷积神经网络(CNN)应用于自下而上的区域建议,以便对对象进行定位和分割;(2)当标记的训练数据稀少,有监督的预训练时对于辅助任务,然后进行特定于域的微调,可以显着提高性能。由于我们将区域建议(region proposals)与CNN相结合,因此我们将我们的方法称为R-CNN:具有CNN功能的区域。我们还将R-CNN与OverFeat(一种基于相似CNN架构的最近提出的滑动窗口检测器)进行了比较。我们发现,在200类ILSVRC2013检测数据集上,R-CNN优于OverFeat。完整系统的源代码可从http://www.cs.berkeley.edu/~rbg/rcnn获得。

1. Introduction

Features matter. The last decade of progress on various visual recognition tasks has been based considerably on the use of SIFT [29] and HOG [7]. But if we look at performance on the canonical visual recognition task, PASCAL VOC object detection [15], it is generally acknowledged that progress has been slow during 2010-2012, with small gains obtained by building ensemble systems and employing minor variants of successful methods.

特征很重要。在过去十年中,各种视觉识别任务的进展很大程度上基于SIFT [29]和HOG [7]的使用。但是,如果我们查看规范的视觉识别任务PASCAL VOC对象检测的性能[15],则通常公认的是,2010-2012年间进展缓慢,通过构建集成系统和采用成功方法的较小变体获得的收益很小。

SIFT and HOG are blockwise orientation histograms, a representation we could associate roughly with complex cells in V1, the first cortical area in the primate visual pathway. But we also know that recognition occurs several stages downstream, which suggests that there might be hierarchical, multi-stage processes for computing features that are even more informative for visual recognition.


Fukushima’s “neocognitron” [19], a biologically inspired hierarchical and shift-invariant model for pattern recognition, was an early attempt at just such a process. The neocognitron, however, lacked a supervised training algorithm. Building on Rumelhart et al. [33], LeCun et al. [26] showed that stochastic gradient descent via backpropagation was effective for training convolutional neural networks (CNNs), a class of models that extend the neocognitron.

Fukushima的 “neocognitron” [19]是一种受生物学启发的,用于模式识别的分层且不变位移的模型,是在此过程中的早期尝试。但是,neocognitron 缺乏监督训练算法。建立在Rumelhart等[33] ,LeCun等[26]研究表明,通过反向传播的随机梯度下降对于训练卷积神经网络(CNN)是有效的,卷积神经网络是扩展新认知加速器的一类模型。

CNNs saw heavy use in the 1990s (e.g., [27]), but then fell out of fashion with the rise of support vector machines. In 2012, Krizhevsky et al. [25] rekindled interest in CNNs by showing substantially higher image classification accuracy on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [9, 10]. Their success resulted from training a large CNN on 1.2 million labeled images, together with a few twists on LeCun’s CNN (e.g., max(x,0) rectifying non-linearities and “dropout” regularization).


The significance of the ImageNet result was vigorously debated during the ILSVRC 2012 workshop. The central issue can be distilled to the following: To what extent do the CNN classification results on ImageNet generalize to object detection results on the PASCAL VOC Challenge?

在ILSVRC 2012研讨会上,对ImageNet结果的重要性进行了激烈的辩论。中心问题可以归结为以下几点:ImageNet上CNN分类结果在多大程度上可以归纳为PASCAL VOC挑战赛上的物体检测结果?

We answer this question by bridging the gap between image classification and object detection. This paper is the first to show that a CNN can lead to dramatically higher object detection performance on PASCAL VOC as compared to systems based on simpler HOG-like features. To achieve this result, we focused on two problems: localizing objects with a deep network and training a high-capacity model with only a small quantity of annotated detection data.

我们通过弥合图像分类和物体检测之间的差距来回答这个问题。本文首次证明,与基于类似HOG的简单功能的系统相比,CNN可以在PASCAL VOC上显着提高目标检测性能。为了获得此结果,我们集中在两个问题上:使用深度网络定位对象和仅使用少量带标注的检测数据来训练高容量模型。

Unlike image classification, detection requires localizing (likely many) objects within an image. One approach frames localization as a regression problem. However, work from Szegedy et al. [38], concurrent with our own, indicates that this strategy may not fare well in practice (they report a mAP of 30.5% on VOC 2007 compared to the 58.5% achieved by our method). An alternative is to build a sliding-window detector. CNNs have been used in this way for at least two decades, typically on constrained object categories, such as faces [32, 40] and pedestrians [35]. In order to maintain high spatial resolution, these CNNs typically only have two convolutional and pooling layers. We also considered adopting a sliding-window approach. However, units high up in our network, which has five convolutional layers, have very large receptive fields (195 × 195 pixels) and strides (32×32 pixels) in the input image, which makes precise localization within the sliding-window paradigm an open technical challenge.

与图像分类不同,检测需要在图像中定位(可能很多)对象。一种方法将定位视为回归问题。但是,Szegedy等人[38]的工作与我们自己的观点同时表明,该策略在实践中可能效果不佳(他们报告的VOC 2007的mAP为30.5%,而我们的方法为58.5%)。另一种方法是构建一个滑动窗口检测器。 CNN已经以这种方式使用了至少二十年,通常用于受约束的对象类别,例如人脸[32、40]和行人[35]。为了维持高空间分辨率,这些CNN通常仅具有两个卷积层和池化层。我们还考虑了采用滑动窗口方法。但是,我们网络中高层的单元具有五个卷积层,它们在输入图像中具有非常大的接收场(195×195像素)和跨度(32×32像素),这使得滑动窗口范式中的精确定位成为可能。

Instead, we solve the CNN localization problem by operating within the “recognition using regions” paradigm [21], which has been successful for both object detection [39] and semantic segmentation [5]. At test time, our method generates around 2000 category-independent region proposals for the input image, extracts a fixed-length feature vector from each proposal using a CNN, and then classifies each region with category-specific linear SVMs. We use a simple technique (affine image warping) to compute a fixed-size CNN input from each region proposal, regardless of the region’s shape. Figure 1 presents an overview of our method and highlights some of our results. Since our system combines region proposals with CNNs, we dub the method R-CNN: Regions with CNN features.

此外,我们通过在“使用区域识别(recognition using regions)”范式[21]中进行操作来解决CNN定位问题,这已成功地用于对象检测[39]和语义分割[5]。在测试时,我们的方法为输入图像生成大约2000个类别独立的区域建议,使用CNN从每个建议中提取固定长度的特征向量,然后使用类别特定的线性SVM对每个区域进行分类。我们使用一种简单的技术(仿射图像扭曲(affine image warping))来计算每个区域建议的固定尺寸CNN输入,而与区域的形状无关。图1概述了我们的方法,并突出了我们的一些结果。由于我们的系统将区域建议与CNN相结合,因此我们将方法R-CNN解释为:具有CNN功能的区域。


图1:物体检测系统概述。我们的系统(1)拍摄输入图像,(2)提取大约2000个自下而上的区域提议,(3)使用大型卷积神经网络(CNN)为每个提议计算特征,然后(4)使用类别对每个区域进行分类特定的线性SVM。在PASCAL VOC 2010上,R-CNN的平均平均精度(mAP)为53.7%。为进行比较,[39]报告使用相同区域建议但采用空间金字塔和视觉效果的方法,占35.1%。流行的可变形零件模型的性能为33.4%。在200级ILSVRC2013检测数据集上,R-CNN的mAP为31.4%,比OverFeat [34]有了很大改进,后者以24.3%的优势获得了此前的最佳结果。

In this updated version of this paper, we provide a headto-head comparison of R-CNN and the recently proposed OverFeat [34] detection system by running R-CNN on the 200-class ILSVRC2013 detection dataset. OverFeat uses a sliding-window CNN for detection and until now was the best performing method on ILSVRC2013 detection. We show that R-CNN significantly outperforms OverFeat, with a mAP of 31.4% versus 24.3%.

在本文的更新版本中,我们通过在200类ILSVRC2013检测数据集上运行R-CNN,提供了R-CNN与最近提出的OverFeat [34]检测系统的直接对比。 OverFeat使用滑动窗口CNN进行检测,到目前为止,这是ILSVRC2013检测中性能最好的方法。我们显示,R-CNN明显优于OverFeat,mAP分别为31.4%和24.3%。

A second challenge faced in detection is that labeled data is scarce and the amount currently available is insufficient for training a large CNN. The conventional solution to this problemistouseunsupervised pre-training, followed by supervised fine-tuning (e.g., [35]). The second principle contribution of this paper is to show thatsupervised pre-training on a large auxiliary dataset (ILSVRC), followed by domainspecific fine-tuning on a small dataset (PASCAL), is an effective paradigm for learning high-capacity CNNs when data is scarce. In our experiments, fine-tuning for detection improves mAP performance by 8 percentage points. After fine-tuning, our system achieves a mAP of 54% on VOC 2010 compared to 33% for the highly-tuned, HOG-based deformable part model (DPM) [17, 20]. We also point readers to contemporaneous work by Donahue et al. [12], who show that Krizhevsky’s CNN can be used (without finetuning) as a blackbox feature extractor, yielding excellent performance on several recognition tasks including scene classification, fine-grained sub-categorization, and domain adaptation.

检测面临的第二个挑战是标记数据稀缺,当前可用的数量不足以训练大型CNN。解决该问题的常规解决方案是在监督下进行预训练,然后进行监督下的微调(例如[35])。本文的第二个主要贡献是,在大型数据集ILSVRC上进行有监督的预训练,然后在小型数据集(PASCAL)上进行领域特定的微调,是在数据量稀缺时学习大容量CNN的有效范例。在我们的实验中,检测的微调将mAP性能提高了8个百分点。经过微调后,我们的系统在VOC 2010上的mAP达到了54%,而经过高度调节的基于HOG的可变形零件模型(DPM)则达到了33%[17,20]。我们还向读者指出Donahue等人[12]的研究表明,Krizhevsky的CNN可以用作黑箱特征提取器(无需微调),从而在某些识别任务(包括场景分类,细粒度子分类和域自适应)上表现出出色的性能。

Our system is also quite efficient. The only class-specific computations are a reasonably small matrix-vector product and greedy non-maximum suppression. This computational property follows from features that are shared across all categories and that are also two orders of magnitude lowerdimensional than previously used region features (cf. [39]).


Understanding the failure modes of our approach is also critical for improving it, and so we report results from the detection analysis tool of Hoiem et al. [23]. As an immediate consequence of this analysis, we demonstrate that a simple bounding-box regression method significantly reduces mislocalizations, which are the dominant error mode.


Beforedevelopingtechnicaldetails, wenotethatbecause R-CNN operates on regions it is natural to extend it to the task of semantic segmentation. With minor modifications, we also achieve competitive results on the PASCAL VOC segmentation task, with an average segmentation accuracy of 47.9% on the VOC 2011 test set.

在开发技术细节之前,我们注意到,由于R-CNN在区域上运行,因此很自然地将其扩展到语义分割任务。稍加修改,我们在PASCAL VOC分割任务上也取得了竞争性结果,在VOC 2011测试集上,平均细分精度为47.9%。

2. Object detection with R-CNN

Our object detection system consists of three modules. The first generates category-independent region proposals. These proposals define the set of candidate detections available to our detector. The second module is a large convolutional neural network that extracts a fixed-length feature vector from each region. The third module is a set of classspecific linear SVMs. In this section, we present our design decisions for each module, describe their test-time usage, detail how their parameters are learned, and show detection results on PASCAL VOC 2010-12 and on ILSVRC2013.

我们的物体检测系统包括三个模块。第一个生成与类别无关的区域建议。这些建议定义了可用于我们的检测器的候选检测集。第二个模块是大型卷积神经网络,可从每个区域提取固定长度的特征向量。第三个模块是一组特定于类别的线性SVM。在本节中,我们将介绍每个模块的设计决策,描述其测试时间用法,详细介绍如何学习其参数,并在PASCAL VOC 2010-12和ILSVRC2013上显示检测结果。

2.1. Module design

Region proposals. A variety of recent papers offer methods for generating category-independent region proposals. Examples include: objectness [1], selective search [39], category-independent object proposals [14], constrained parametric min-cuts (CPMC) [5], multi-scale combinatorial grouping [3], and Cires ¸an et al. [6], who detect mitotic cells by applying a CNN to regularly-spaced square crops, which are a special case of region proposals. While R-CNN is agnostic to the particular region proposal method, we use selective search to enable a controlled comparison with prior detection work (e.g., [39, 41]).

区域建议。各种最新的论文提供了用于生成与类别无关的区域建议的方法。示例包括:objectness[1],选择性搜索(selective search)[39],与类别无关的对象建议[14],约束参数最小切分(CPMC)[5],多尺度组合分组[3]和Ciresan等人[6],他们通过将CNN应用于规则间隔的方形作物来检测有丝分裂细胞,这是区域建议的特例。尽管R-CNN不了解特定的区域建议方法,但我们使用选择性搜索来实现与先前检测工作的受控比较(例如[39,41])

Feature extraction. We extract a 4096-dimensional feature vector from each region proposal using the Caffe [24] implementation of the CNN described by Krizhevsky et al. [25]. Features are computed by forward propagating a mean-subtracted 227×227 RGB image through five convolutional layers and two fully connected layers. We refer readers to [24, 25] for more network architecture details.

特征提取。我们使用Krizhevsky等人[25]描述的CNN的Caffe [24]实现,从每个区域建议中提取4096维特征向量。通过向前传播经过五个卷积层和两个完全连接层的均值减去后的227×227 RGB图像来计算特征。我们请读者参考[24,25]了解更多的网络架构细节。

In order to compute features for a region proposal, we must first convert the image data in that region into a form that is compatible with the CNN (its architecture requires inputs of a fixed 227 × 227 pixel size). Of the many possible transformations of our arbitrary-shaped regions, we opt for the simplest. Regardless of the size or aspect ratio of the candidate region, we warp all pixels in a tight bounding box around it to the required size. Prior to warping, we dilate the tight bounding box so that at the warped size there are exactly p pixels of warped image context around the original box (we use p = 16). Figure 2 shows a random sampling of warped training regions. Alternatives to warping are discussed in Appendix A.

为了计算区域建议的特征,我们必须首先将该区域中的图像数据转换为与CNN兼容的形式(其体系结构需要固定227×227像素大小的输入)。在任意形状区域的许多可能变换中,我们选择最简单的。无论候选区域的大小或宽高比如何,我们都会将其周围的紧密边界框中的所有像素变形为所需的大小。在变形之前,我们先扩大紧边界框,以使在变形尺寸下,原始框周围有准确的p个像素的变形图像上下文(我们使用p = 16)。图2显示了 warped training regions的随机采样。warping的替代方法在附录A中进行了讨论。

2.2. Test-time detection

At test time, we run selective search on the test image to extract around 2000 region proposals (we use selective search’s “fast mode” in all experiments). We warp each proposal and forward propagate it through the CNN in order to compute features. Then, for each class, we score each extracted feature vector using the SVM trained for that class. Given all scored regions in an image, we apply a greedy non-maximum suppression (for each class independently) that rejects a region if it has an intersection-overunion (IoU) overlap with a higher scoring selected region larger than a learned threshold.


Run-time analysis. Two properties make detection efficient. First, all CNN parameters are shared across all categories. Second, the feature vectors computed by the CNN are low-dimensional when compared to other common approaches, such as spatial pyramids with bag-of-visual-word encodings. The features used in the UVA detection system [39], for example, are two orders of magnitude larger than ours (360k vs. 4k-dimensional).


The result of such sharing is that the time spent computing region proposals and features (13s/image on a GPU or 53s/image on a CPU) is amortized over all classes. The only class-specific computations are dot products between features and SVM weights and non-maximum suppression. In practice, all dot products for an image are batched into a single matrix-matrix product. The feature matrix is typically 2000×4096 and the SVM weight matrix is 4096×N, where N is the number of classes.

共享的结果是,计算区域建议和特征所花费的时间(GPU上的13s /图像或CPU上的53s /图像)将在所有类别上分摊。唯一特定于类别的计算是特征和SVM权重以及非极大值抑制之间的点积。实际上,将图像的所有点积都批处理为单个矩阵与矩阵乘积。特征矩阵通常为2000×4096,SVM权重矩阵为4096×N,其中N为类别数。

This analysis shows that R-CNN can scale to thousands of object classes without resorting to approximate techniques, such as hashing. Even if there were 100k classes, the resulting matrix multiplication takes only 10 seconds on a modern multi-core CPU. This efficiency is not merely the result of using region proposals and shared features. The UVA system, due to its high-dimensional features, would be two orders of magnitude slower while requiring 134GB of memory just to store 100k linear predictors, compared to just 1.5GB for our lower-dimensional features.


It is also interesting to contrast R-CNN with the recent work from Dean et al. on scalable detection using DPMs and hashing [8]. They report a mAP of around 16% on VOC 2007 at a run-time of 5 minutes per image when introducing 10k distractor classes. With our approach, 10k detectors can run in about a minute on a CPU, and because no approximations are made mAP would remain at 59% (Section 3.2).

将R-CNN与Dean等人的最新工作进行对比也很有趣。关于使用DPM和散列的可伸缩检测[8]。当引入1万个干扰项类别时,他们报告VOC 2007上的mAP约为16%,每个图像运行时间为5分钟。使用我们的方法,10,000个检测器可以在CPU上运行大约一分钟,并且由于没有做出近似估计,因此mAP将保持在59%(第3.2节)。

2.3. Training

Supervised pre-training. We discriminatively pre-trained the CNN on a large auxiliary dataset (ILSVRC2012 classification) using image-level annotations only (boundingbox labels are not available for this data). Pre-training was performed using the open source Caffe CNN library [24]. In brief, our CNN nearly matches the performance of Krizhevsky et al. [25], obtaining a top-1 error rate 2.2 percentage points higher on the ILSVRC2012 classification validation set. This discrepancy is due to simplifications in the training process.

有监督的预培训。我们仅使用图像级注释在大型辅助数据集(ILSVRC2012分类)上有区别地预训练了CNN(边界框标签不适用于此数据)。使用开源Caffe CNN库[24]进行预训练。简而言之,我们的CNN几乎与Krizhevsky等人的表现相当。 [25],在ILSVRC2012分类验证集上获得了最高2.2个百分点的top-1错误率。这种差异是由于培训过程的简化。

Domain-specific fine-tuning. To adapt our CNN to the new task (detection) and the new domain (warped proposal windows), we continue stochastic gradient descent (SGD) training of the CNN parameters using only warped region proposals. Aside from replacing the CNN’s ImageNetspecific 1000-way classification layer with a randomly initialized (N + 1)-way classification layer (where N is the number of object classes, plus 1 for background), the CNN architecture is unchanged. For VOC, N = 20 and for ILSVRC2013, N = 200. We treat all region proposals with ≥ 0.5 IoU overlap with a ground-truth box as positives for that box’s class and the rest as negatives. We start SGD at a learning rate of 0.001 (1/10th of the initial pre-training rate), which allows fine-tuning to make progress while not clobbering the initialization. In each SGD iteration, we uniformly sample 32 positive windows (over all classes) and 96 background windows to construct a mini-batch of size 128. We bias the sampling towards positive windows because they are extremely rare compared to background.

特定于域的微调。为了使CNN适应新任务(检测)和新域(扭曲建议窗口),我们仅使用扭曲区域建议继续进行CNN参数的随机梯度下降(SGD)训练。除了将CNN的ImageNet特定的1000路分类层替换为随机初始化的(N + 1)路分类层(其中N是对象类的数量,再加上1作为背景)之外,CNN的体系结构保持不变。对于VOC,N = 20,对于ILSVRC2013,N =200。我们将IoU≥0.5 IoU重叠的所有区域建议与一个真实的方框视为该方框的类别的正值,将其余的视为阴性。我们以0.001(初始预训练速率的1/10)的学习速率开始SGD,这允许微调在不破坏初始化的情况下取得进展。在每个SGD迭代中,我们对32个正窗口(在所有类中)和96个背景窗口进行统一采样,以构建大小为128的微型批处理。我们将采样偏向正窗口,因为与背景相比它们极少见。

Object category classifiers. Consider training a binary classifier to detect cars. It’s clear that an image region tightly enclosing a car should be a positive example. Similarly, it’s clear that a background region, which has nothing to do with cars, should be a negative example. Less clear is how to label a region that partially overlaps a car. We resolve this issue with an IoU overlap threshold, below which regions are defined as negatives. The overlap threshold, 0.3, was selected by a grid search over {0,0.1, . . . ,0.5} on a validation set. We found that selecting this threshold carefully is important. Setting it to 0.5, as in [39], decreased mAP by 5 points. Similarly, setting it to 0 decreased mAP by 4 points. Positive examples are defined simply to be the ground-truth bounding boxes for each class.

对象类别分类器。考虑训练二元分类器来检测汽车。很明显,紧密包围汽车的图像区域应该是一个很好的例子。同样,很明显,与汽车无关的背景区域应该是负面的例子。还不清楚如何标记与汽车部分重叠的区域。我们使用IoU重叠阈值解决了此问题,在该阈值以下将区域定义为负值。重叠阈值0.3是通过{0,0.1,… 。 。 ,0.5}放在验证集上。我们发现,仔细选择此阈值很重要。如[39]所述,将其设置为0.5,可使mAP降低5点。同样,将其设置为0会使mAP降低4点。简单地将正面示例定义为每个类别的真实边界框。

Once features are extracted and training labels are applied, we optimize one linear SVM per class. Since the training data is too large to fit in memory, we adopt the standard hard negative mining method [17, 37]. Hard negative mining converges quickly and in practice mAP stops increasing after only a single pass over all images.


In Appendix B we discuss why the positive and negative examples are defined differently in fine-tuning versus SVM training. We also discuss the trade-offs involved in training detection SVMs rather than simply using the outputs from the final softmax layer of the fine-tuned CNN.


2.4. Results on PASCAL VOC 2010-12

Following the PASCAL VOC best practices [15], we validated all design decisions and hyperparameters on the VOC 2007 dataset (Section 3.2). For final results on the VOC 2010-12 datasets, we fine-tuned the CNN on VOC 2012 train and optimized our detection SVMs on VOC 2012 trainval. We submitted test results to the evaluation server only once for each of the two major algorithm variants (with and without bounding-box regression).

遵循PASCAL VOC最佳实践[15],我们验证了VOC 2007数据集上的所有设计决策和超参数(第3.2节)。为了获得VOC 2010-12数据集的最终结果,我们对VOC 2012训练集上的CNN进行了微调,并优化了VOC 2012训练集上的检测SVM。对于两种主要算法变体中的每一种(带有和不带有边界框回归),我们仅将测试结果提交给评估服务器一次。

Table 1 shows complete results on VOC 2010. We compare our method against four strong baselines, including SegDPM [18], which combines DPM detectors with the output of a semantic segmentation system [4] and uses additional inter-detector context and image-classifier rescoring. The most germane comparison is to the UVA system from Uijlings et al. [39], since our systems use the same region proposal algorithm. To classify regions, their method builds a four-level spatial pyramid and populates it with densely sampled SIFT, Extended OpponentSIFT, and RGBSIFT descriptors, each vector quantized with 4000-word codebooks. Classification is performed with a histogram intersection kernel SVM. Compared to their multi-feature, non-linear kernel SVM approach, we achieve a large improvement in mAP, from 35.1% to 53.7% mAP, while also being much faster (Section 2.2). Our method achieves similar performance (53.3% mAP) on VOC 2011/12 test.

表1显示了VOC 2010的完整结果。我们将我们的方法与四个强大的基准进行了比较,其中包括SegDPM [18],后者将DPM检测器与语义分割系统的输出相结合[4],并使用了额外的检测器间上下文和图像分类器计分。最紧密的比较是与Uijlings等人的UVA系统的比较。 [39],因为我们的系统使用相同的区域建议算法。为了对区域进行分类,他们的方法建立了一个四级空间金字塔,并使用密集采样的SIFT,Extended OpponentSIFT和RGBSIFT描述符进行填充,每个描述符都使用4000字码本进行了量化。使用直方图相交核SVM进行分类。与他们的多功能,非线性内核SVM方法相比,我们在mAP方面实现了很大的提高,从35.1%提高到了53.7%,并且速度也快得多(第2.2节)。我们的方法在VOC 2011/12测试中达到了类似的性能(53.3%mAP)。

2.5. Results on ILSVRC2013 detection

We ran R-CNN on the 200-class ILSVRC2013 detection dataset using the same system hyperparameters that we used for PASCAL VOC. We followed the same protocol of submitting test results to the ILSVRC2013 evaluation server only twice, once with and once without bounding-box regression.

我们使用与用于PASCAL VOC相同的系统超参数在200类ILSVRC2013检测数据集上运行R-CNN。我们遵循相同的协议,仅将测试结果提交给ILSVRC2013评估服务器两次,一次进行且一次不进行边界框回归。

Figure 3 compares R-CNN to the entries in the ILSVRC 2013 competition and to the post-competition OverFeat result [34]. R-CNN achieves a mAP of 31.4%, which is significantly ahead of the second-best result of 24.3% from OverFeat. To give a sense of the AP distribution over classes, box plots are also presented and a table of perclass APs follows at the end of the paper in Table 8. Most of the competing submissions (OverFeat, NEC-MU, UvAEuvision, Toronto A, and UIUC-IFP) used convolutional neural networks, indicating that there is significant nuance in how CNNs can be applied to object detection, leading to greatly varying outcomes.

图3将R-CNN与ILSVRC 2013竞赛中的参赛作品以及竞赛后的OverFeat结果进行了比较[34]。 R-CNN的mAP为31.4%,大大超过了OverFeat的第二佳结果24.3%。为了大致了解AP在各个类别中的分布情况,还提供了箱形图,并在表8的末尾列出了每个类别的AP。大多数竞争者(OverFeat,NEC-MU,UvAEuvision,Toronto A,和UIUC-IFP)使用了卷积神经网络,这表明CNN如何应用于对象检测有很大的细微差别,导致结果差异很大。

In Section 4, we give an overview of the ILSVRC2013 detection dataset and provide details about choices that we made when running R-CNN on it.


3. Visualization, ablation, and modes of error

3.1. Visualizing learned features

First-layer filters can be visualized directly and are easy to understand [25]. They capture oriented edges and opponent colors. Understanding the subsequent layers is more challenging. Zeiler and Fergus present a visually attractive deconvolutional approach in [42]. We propose a simple (and complementary) non-parametric method that directly shows what the network learned.

第一层滤波器可以直接可视化并且易于理解[25]。它们捕获定向的边缘和对手的颜色。了解后续层更具挑战性。 Zeiler和Fergus在[42]中提出了一种视觉上有吸引力的反卷积方法。我们提出一种简单(且互补)的非参数方法,该方法直接显示网络学到的知识。

The idea is to single out a particular unit (feature) in the network and use it as if it were an object detector in its own right. That is, we compute the unit’s activations on a large set of held-out region proposals (about 10 million), sort the proposals from highest to lowest activation, perform nonmaximum suppression, and then display the top-scoring regions. Our method lets the selected unit “speak for itself” by showing exactly which inputs it fires on. We avoid averaging in order to see different visual modes and gain insight into the invariances computed by the unit.

表1:VOC 2010测试的检测平均精度(%)。 R-CNN与UV A和Regionlet最直接可比,因为所有方法都使用选择性搜索区域建议。边界框回归(BB)在C节中进行了描述。在发布时,SegDPM是PASCAL VOC排行榜上的佼佼者。†DPM和SegDPM使用其他方法未使用的上下文记录。

图3 :(左)ILSVRC2013检测测试集的平均平均精度。 *开头的方法使用外部训练数据(所有情况下来自ILSVRC分类数据集的图像和标签)。 (右)每种方法的200个平均精度值的箱形图。由于尚未提供按类别的AP,因此未显示赛后OverFeat结果的箱形图(用于R-CNN的按类别的AP在表8中,并且也包含在上载到arXiv.org的技术报告源中;请参阅R-CNN-ILSVRC2013-APs.txt)。红线标记AP的中位数,方框的底部和顶部分别是第25和第75个百分位数。晶须扩展到每种方法的最小和最大AP。将每个AP绘制为晶须上方的绿点(最好使用缩放数字查看)。


表2:VOC 2007测试的检测平均精度(%)。第1-3行显示了R-CNN的性能,无需进行微调。第4-6行显示了在ILSVRC 2012上进行了预训练,然后在VOC 2007上进行了微调(FT)的CNN的结果。第7行包含一个简单的边界框回归(BB)阶段,可减少定位错误(C节)。第8-10行将DPM方法作为强基准。第一个仅使用HOG,而后两个使用不同的特征学习方法来增强或替换HOG。
表3:两种不同CNN架构的VOC 2007测试的检测平均精度(%)。前两行是使用Krizhevsky等人的结构(T-Net)的表2中的结果。第三和第四行使用Simonyan和Zisserman(O-Net)[43]最近提出的16层架构。

We visualize units from layer pool5, which is the maxpooled output of the network’s fifth and final convolutional layer. The pool5feature map is 6 × 6 × 256 = 9216dimensional. Ignoring boundary effects, each pool5unit has a receptive field of 195×195 pixels in the original 227×227 pixel input. A central unit has a nearly global view, while one near the edge has a smaller, clipped support.

我们可视化第5层(即网络的第五个也是最后一个卷积层)的最大池化输出中的第5层池中的单元。 pool5功能图为6×6×256 = 9216维。忽略边界效应,每个pool5单元在原始227×227像素输入中的接收场为195×195像素。一个中间pool5单元几乎具有全局视野,而靠近边缘的pool则具有较小的固定支撑。

Each row in Figure 4 displays the top 16 activations for a pool5 unit from a CNN that we fine-tuned on VOC 2007 trainval. Six of the 256 functionally unique units are visualized (Appendix D includes more). These units were selected to show a representative sample of what the network learns. In the second row, we see a unit that fires on dog facesanddotarrays. The unit corresponding to the third row is a red blob detector. There are also detectors for human faces and more abstract patterns such as text and triangular structures with windows. The network appears to learn a representation that combines a small number of class-tuned features together with a distributed representation of shape, texture, color, and material properties. The subsequent fully connected layer fc6has the ability to model a large set of compositions of these rich features.

图4中的每一行显示了CNN的pool5单元的前16个激活,我们在VOC 2007训练中对其进行了微调。可视化256个功能独特的单元中的六个(附录D包含更多)。选择这些单元以显示网络学到的代表性样本。在第二行中,我们看到一个在狗的面孔和点阵上触发的单元。对应于第三行的单位是红色斑点检测器。还有用于人脸和更抽象的图案的检测器,例如带有窗口的文本和三角形结构。该网络似乎正在学习一种表示,该表示将少量的类调整特征与形状,纹理,颜色和材质属性的分布式表示结合在一起。随后的全连接层fc6具有对这些丰富特征的大量组合进行建模的能力。

3.2. Ablation studies

Performance layer-by-layer, without fine-tuning. To understand which layers are critical for detection performance, we analyzed results on the VOC 2007 dataset for each of the CNN’s last three layers. Layer pool5 was briefly described in Section 3.1. The final two layers are summarized below.

逐层性能,无需微调。为了了解哪些层对检测性能至关重要,我们分析了CNN的最后三层中VOC 2007数据集的结果。第3.1节中简要介绍了pool5层。最后两层总结如下。

Layer fc6is fully connected to pool5. To compute features, it multiplies a 4096×9216 weight matrix by the pool5 feature map (reshaped as a 9216-dimensional vector) and then adds a vector of biases. This intermediate vector is component-wise half-wave rectified (x ← max(0, x)).


Layer fc7is the final layer of the network. It is implemented by multiplying the features computed by fc6by a 4096 × 4096 weight matrix, and similarly adding a vector of biases and applying half-wave rectification.


We start by looking at results from the CNN without fine-tuning on PASCAL, i.e. all CNN parameters were pre-trained on ILSVRC 2012 only. Analyzing performance layer-by-layer (Table 2 rows 1-3) reveals that features from fc7generalize worse than features from fc6. This means that 29%, or about 16.8 million, of the CNN’s parameters can be removed without degrading mAP . More surprising is that removing both fc7and fc6produces quite good results even though pool5features are computed using only 6% of the CNN’s parameters. Much of the CNN’s representational power comes from its convolutional layers, rather than from the much larger densely connected layers. This finding suggests potential utility in computing a dense feature map, in the sense of HOG, of an arbitrary-sized image by using only the convolutional layers of the CNN. This representation would enable experimentation with sliding-window detectors, including DPM, on top of pool5features.

我们首先查看CNN的结果,而无需在PASCAL上进行微调,即所有CNN参数仅在ILSVRC 2012上进行了预训练。逐层分析性能(表2第1-3行)显示,与fc6相比,fc7的功能普遍恶化。这意味着在不降低mAP的前提下,可以删除29%的CNN参数(约1,680万)。更令人惊讶的是,即使仅使用CNN参数的6%计算pool5功能,删除fc7和fc6也会产生很好的结果。 CNN的大部分表示能力来自其卷积层,而不是更大的密集连接层。这一发现表明,在仅使用CNN的卷积层的情况下,在计算HOG任意大小的图像的密集特征图方面可能具有实用性。此表示将使在pool5功能之上可以使用滑动窗口检测器(包括DPM)进行实验。

Performance layer-by-layer, with fine-tuning. We now look at results from our CNN after having fine-tuned its pa6 rameters on VOC 2007 trainval. The improvement is striking (Table 2 rows 4-6): fine-tuning increases mAP by 8.0 percentage points to 54.2%. The boost from fine-tuning is much larger for fc6and fc7 than for pool5, which suggests that the pool5features learned from ImageNet are general and that most of the improvement is gained from learning domain-specific non-linear classifiers on top of them.

性能逐层调整。现在,我们在VOC 2007训练集上微调了其Pa6参数之后,便查看了CNN的结果。改善是惊人的(表2第4-6行):微调使mAP增加8.0个百分点至54.2%。对于fc6和fc7,微调所带来的收益要比pool5大得多,这表明从ImageNet学习到的pool5功能是通用的,并且大多数改进是通过在它们之上学习特定于领域的非线性分类器而获得的。

Comparison to recent feature learning methods. Relatively few feature learning methods have been tried on PASCAL VOC detection. We look at two recent approaches that build on deformable part models. For reference, we also include results for the standard HOG-based DPM [20].

与最近的特征学习方法的比较。在PASCAL VOC检测中尝试了相对较少的特征学习方法。我们看一下基于可变形零件模型的两种最新方法。作为参考,我们还提供了基于标准HOG的DPM的结果[20]。

The first DPM feature learning method, DPM ST [28], augments HOG features with histograms of “sketch token” probabilities. Intuitively, a sketch token is a tight distribution of contours passing through the center of an image patch. Sketch token probabilities are computed at each pixel by a random forest that was trained to classify 35×35 pixel patches into one of 150 sketch tokens or background.

第一种DPM特征学习方法DPM ST [28]通过“sketch token”概率的直方图增强了HOG特征。直观地,sketch token是穿过图像块中心的轮廓的紧密分布。sketch token概率是由随机森林在每个像素处计算的,该随机森林经过训练可将35×35像素补丁分类为150个sketch token或背景之一。

The second method, DPM HSC [31], replaces HOG with histograms of sparse codes (HSC). To compute an HSC, sparse code activations are solved for at each pixel using a learned dictionary of 100 7 × 7 pixel (grayscale) atoms. The resulting activations are rectified in three ways (full and both half-waves), spatially pooled, unit ‘2normalized, and then power transformed ( x ← s i g n ( x ) ∣ x ∣ α x ← sign(x)|x|^α xsign(x)xα).

第二种方法是DPM HSC [31],用稀疏码(HSC)的直方图代替HOG。为了计算HSC,使用学习的100 7×7像素(灰度)原子字典,为每个像素求解稀疏代码激活。产生的激活通过三种方式(全波和半波)进行校正,在空间上合并,对单位’2进行归一化,然后进行幂变换(x←sign(x)| x |α)。

3.3. Network architectures

Most results in this paper use the network architecture from Krizhevsky et al. [25]. However, we have found that the choice of architecture has a large effect on R-CNN detection performance. In Table 3 we show results on VOC 2007 test using the 16-layer deep network recently proposed by Simonyan and Zisserman [43]. This network was one of the top performers in the recent ILSVRC 2014 classification challenge. The network has a homogeneous structure consisting of 13 layers of 3 × 3 convolution kernels, with five max pooling layers interspersed, and topped with three fully-connected layers. We refer to this network as “O-Net” for OxfordNet and the baseline as “T-Net” for TorontoNet.

本文的大多数结果使用Krizhevsky等人[25]的网络结构。但是,我们发现架构的选择对R-CNN检测性能有很大影响。在表3中,我们显示了Simonyan和Zisserman最近提出的使用16层深度网络进行VOC 2007测试的结果[43]。在最近的ILSVRC 2014分类挑战中,该网络是表现最好的网络之一。该网络具有一个均匀的结构,该结构由13层3×3的卷积核组成,其中散布了五个最大池化层,并在其上放置了三个完全连接的层。对于牛津网络,我们将此网络称为“ O-Net”;对于多伦多网络,我们将其基准称为“ T-Net”。

To use O-Net in R-CNN, we downloaded the publicly available pre-trained network weights for the VGG ILSVRC 16 layers model from the Caffe Model Zoo.1We then fine-tuned the network using the same protocol as we used for T-Net. The only difference was to use smaller minibatches (24 examples) as required in order to fit within GPU memory. The results in Table 3 show that RCNN with O-Net substantially outperforms R-CNN with TNet, increasing mAP from 58.5% to 66.0%. However there is a considerable drawback in terms of compute time, with the forward pass of O-Net taking roughly 7 times longer than T-Net.

为了在R-CNN中使用O-Net,我们从Caffe Model Zoo下载了VGG ILSVRC 16层模型的公开可用的预训练网络权重。1然后,我们使用与T-CNN相同的协议对网络进行了微调净。唯一的区别是根据需要使用较小的微型批处理(24个示例),以适合GPU内存。表3中的结果表明,带有O-Net的RCNN明显优于带有TNet的R-CNN,将mAP从58.5%提高到66.0%。但是,在计算时间方面存在很大的缺陷,O-Net的前向传递比T-Net花费大约7倍的时间。

3.4. Detection error analysis

We applied the excellent detection analysis tool from Hoiem et al. [23] in order to reveal our method’s error modes, understand how fine-tuning changes them, and to see how our error types compare with DPM. A full summary of the analysis tool is beyond the scope of this paper and we encourage readers to consult [23] to understand some finer details (such as “normalized AP”). Since the analysis is best absorbed in the context of the associated plots, we present the discussion within the captions of Figure 5 and Figure 6.

我们应用了Hoiem等人[23]的出色的检测分析工具。 为了揭示我们方法的错误模式,了解微调如何更改它们,并查看我们的错误类型与DPM的比较。分析工具的完整摘要超出了本文的范围,我们鼓励读者参考[23]以了解一些更详细的信息(例如“规范化AP”)。由于分析最好在相关图的上下文中进行,因此我们在图5和图6的标题内进行了讨论。

3.5. Bounding-box regression

Based on the error analysis, we implemented a simple method to reduce localization errors. Inspired by the bounding-box regression employed in DPM [17], we train a linear regression model to predict a new detection window given the pool5features for a selective search region proposal. Full details are given in Appendix C. Results in Table 1, Table 2, and Figure 5 show that this simple approach fixes a large number of mislocalized detections, boosting mAP by 3 to 4 points.

基于误差分析,我们实现了一种减少定位误差的简单方法。受DPM [17]中使用的包围盒回归的启发,我们针对给定的选择性搜索区域建议的pool5特征,训练了线性回归模型来预测新的检测窗口。附录C中提供了完整的详细信息。表1,表2和图5中的结果表明,这种简单的方法可以修复大量的错误定位检测,从而将mAP提高3-4点。

3.6. Qualitative results

Qualitative detection results on ILSVRC2013 are presented in Figure 8 and Figure 9 at the end of the paper. Each image was sampled randomly from the val2set and all detections from all detectors with a precision greater than 0.5 are shown. Note that these are not curated and give a realistic impression of the detectors in action. More qualitative results are presented in Figure 10 and Figure 11, but these have been curated. We selected each image because it contained interesting, surprising, or amusing results. Here, also, all detections at precision greater than 0.5 are shown.


4. The ILSVRC2013 detection dataset

In Section 2 we presented results on the ILSVRC2013 detection dataset. This dataset is less homogeneous than PASCAL VOC, requiring choices about how to use it. Since these decisions are non-trivial, we cover them in this section.

在第2节中,我们介绍了ILSVRC2013检测数据集上的结果。该数据集的同质性不如PASCAL VOC,需要选择使用方式。由于这些决定是不平凡的,因此我们将在本节中介绍它们。
图6:对物体特性的敏感性。每个图都显示了六个不同对象特征(遮挡,截断,边界框面积,纵横比,视点,零件可见性)内性能最高和最低的子集的均值(分类)标准化AP(请参见[23])。我们显示了带有和不带有微调(FT)和边界框回归(BB)以及DPM voc-release5的方法(R-CNN)的图。总体而言,微调不会降低灵敏度(最大值和最小值之间的差),但会实质上改善几乎所有特性的最高和最低性能子集。这表明,微调不仅可以改善纵横比和边界框区域中性能最低的子集,还可以根据我们如何扭曲网络输入进行推测。相反,微调可以提高所有特性的鲁棒性,包括遮挡,截断,观察点和零件可见性。

图5:排名最高的假阳性(FP)类型的分布。每个图都显示了FP类型的演变分布,因为按照得分递减的顺序考虑了更多FP。每个FP分为以下4种类型中的1种:Loc-定位不良(IoU重叠的检测与正确的分类在0.1到0.5之间,或重复); Sim-与类似类别的混淆; Oth-混淆了不同的对象类别; BG-在背景上触发的FP。与DPM相比(参见[23]),我们的错误明显更多是由于不良的定位所致,而不是与背景或其他对象类别造成的混淆,这表明CNN功能比HOG具有更大的判别力。松散的本地化很可能是由于我们使用了自下而上的区域建议,以及从对神经网络进行全图像分类的预训练中学到的位置不变性。第三列显示了我们的简单边界框回归方法如何解决许多定位错误。

PASCAL VOC, requiring choices about how to use it. Since these decisions are non-trivial, we cover them in this section.

PASCAL VOC,需要选择使用方式。由于这些决定是不平凡的,因此我们将在本节中介绍它们。

4.1. Dataset overview

The ILSVRC2013 detection dataset is split into three sets: train (395,918), val (20,121), and test (40,152), where the number of images in each set is in parentheses. The val and test splits are drawn from the same image distribution. These images are scene-like and similar in complexity (number of objects, amount of clutter, pose variability, etc.) to PASCAL VOC images. The val and test splits are exhaustively annotated, meaning that in each image all instances from all 200 classes are labeled with bounding boxes. The train set, in contrast, is drawn from the ILSVRC2013 classification image distribution. These images have more variable complexity with a skew towards images of a single centered object. Unlike val and test, the train images (due to their large number) are not exhaustively annotated. In any given train image, instances from the 200 classes may or may not be labeled. In addition to these image sets, each class has an extra set of negative images. Negative images are manually checked to validate that they do not contain any instances of their associated class. The negative image sets were not used in this work. More information on how ILSVRC was collected and annotated can be found in [11, 36].

ILSVRC2013检测数据集分为三组:训练(395,918),val(20,121)和测试(40,152),其中每组中的图像数都用括号括起来。 val和测试分割来自相同的图像分布。这些图像类似于场景,并且在复杂性(对象数量,杂波数量,姿势可变性等)方面与PASCAL VOC图像相似。 val和测试拆分均进行了详尽的注释,这意味着在每个图像中,所有200类的所有实例都用边界框标记。相反,列车组是从ILSVRC2013分类图像分布中得出的。这些图像具有更大的可变复杂性,并且偏向单个居中对象的图像。与val和test不同,训练集图像(由于数量众多)没有详尽标注。在任何给定的训练集图像中,可以标记200个类别的实例,也可以不标记。除了这些图像集外,每个类别还具有一组负图像。手动检查负片图像以确认它们不包含其关联类的任何实例。负像集未在这项工作中使用。有关如何收集和注释ILSVRC的更多信息,请参见[11,36]。

The nature of these splits presents a number of choices for training R-CNN. The train images cannot be used for hard negative mining, because annotations are not exhaustive. Where should negative examples come from? Also, the train images have different statistics than val and test. Should the train images be used at all, and if so, to what extent? While we have not thoroughly evaluated a large number of choices, we present what seemed like the most obvious path based on previous experience.


Our general strategy is to rely heavily on the val set and use some of the train images as an auxiliary source of positive examples. To use val for both training and validation, we split it into roughly equally sized “val1” and “val2” sets. Since some classes have very few examples in val (the smallest has only 31 and half have fewer than 110), it is important to produce an approximately class-balanced partition. To do this, a large number of candidate splits were generated and the one with the smallest maximum relative 8 class imbalance was selected.2Each candidate split was generated by clustering val images using their class counts as features, followed by a randomized local search that may improve the split balance. The particular split used here has a maximum relative imbalance of about 11% and a median relative imbalance of 4%. The val1/val2split and code used to produce them will be publicly available to allow other researchers to compare their methods on the val splits used in this report.

我们的总体策略是严重依赖val集,并使用一些训练图像作为正例的辅助来源。为了将val用于训练和验证,我们将其分为大小大致相等的“ val1”和“ val2”集。由于某些类的val样本很少(最小的样本只有31个,而一半的样本少于110个),因此产生近似类平衡的分区非常重要。为此,生成了大量候选分割,并选择了最大的相对8类不平衡最小的分割。2通过将val图像以其类计数为特征进行聚类,生成每个候选分割,然后进行随机局部搜索改善分配平衡。此处使用的特定拆分的最大相对不平衡约为11%,中值相对不平衡为4%。 val1 / val2 划分和用于生成它们的代码将公开可用,以使其他研究人员可以比较本报告中使用的val 划分的方法。

4.2. Region proposals

We followed the same region proposal approach that was used for detection on PASCAL. Selective search [39] was run in “fast mode” on each image in val1, val2, and test (but not on images in train). One minor modification was required to deal with the fact that selective search is not scale invariant and so the number of regions produced depends on the image resolution. ILSVRC image sizes range from very small to a few that are several mega-pixels, and so we resized each image to a fixed width (500 pixels) before running selective search. On val, selective search resulted in an average of 2403 region proposals per image with a 91.6% recall of all ground-truth bounding boxes (at 0.5 IoU threshold). This recall is notably lower than in PASCAL, where it is approximately 98%, indicating significant room for improvement in the region proposal stage.

我们遵循了用于PASCAL检测的相同区域建议方法。选择性搜索[39]在“快速模式”下对val1,val2和test中的每个图像进行了运行(但不在训练集的图像上)。需要进行一次较小的修改来处理选择性搜索不是尺度不变的事实,因此产生的区域数量取决于图像分辨率。 ILSVRC图像的大小从很小到几百万像素不等,因此在运行选择性搜索之前,我们将每个图像调整为固定宽度(500像素)。 val上,选择性搜索平均每幅图像产生2403个区域建议,所有标注框的召回率达到91.6%(阈值为0.5 IoU)。召回率明显低于PASCAL,后者约为98%,表明在区域建议阶段仍有很大的改进空间。

4.3. Training data

For training data, we formed a set of images and boxes that includes all selective search and ground-truth boxes from val1together with up to N ground-truth boxes per class from train (if a class has fewer than N ground-truth boxes in train, then we take all of them). We’ll call this dataset of images and boxes val1+trainN. In an ablation study, we show mAP on val2for N ∈ {0,500,1000} (Section 4.5).

对于训练数据,我们形成了一组图像和框,其中包括val1中的所有选择性搜索框和标注框,以及训练集上每个类最多N个标注框(如果一个类的训练集中少于N个标注框,那么我们将它们全部接受)。我们将此图像和框的数据集称为val1 + trainN。在消融研究中,我们在val2上针对N∈{0,500,1000}(第4.5节)显示了mAP。

Training data is required for three procedures in R-CNN: (1) CNN fine-tuning, (2) detector SVM training, and (3) bounding-box regressor training. CNN fine-tuning was run for 50k SGD iteration on val1+trainNusing the exact same settings as were used for PASCAL. Fine-tuning on a single NVIDIA Tesla K20 took 13 hours using Caffe. For SVM training, all ground-truth boxes from val1+trainN were used as positive examples for their respective classes. Hard negative mining was performed on a randomly selected subset of 5000 images from val1. An initial experiment indicated that mining negatives from all of val1, versus a 5000 image subset (roughly half of it), resulted in only a 0.5 percentage point drop in mAP , while cutting SVM training time in half. No negative examples were taken from train because the annotations are not exhaustive. The extra sets of verified negative images were not used. The bounding-box regressors were trained on val1.

R-CNN中的三个过程需要训练数据:(1)CNN微调,(2)检测器SVM训练和(3)边界框回归器训练。使用与PASCAL完全相同的设置,在val1 + trainN上对CNN微调进行了50k SGD迭代。使用Caffe对单个NVIDIA Tesla K20进行微调需要13个小时。对于SVM训练,来自val1 + trainN的所有标注框均用作其各自类别的正例。对来自val1的5000张图像的随机选择子集进行了难负例挖掘。最初的实验表明,从val1的全部中提取负例,而不是5000个图像子集(大约占一半),导致mAP下降仅0.5个百分点,同时将SVM训练时间缩短了一半。由于标注不够详尽,因此没有从训练集上提取任何负例。未使用额外的经过验证的负例图像。边界框回归器在val1上进行了训练。

4.4. Validation and evaluation

Before submitting results to the evaluation server, we validated data usage choices and the effect of fine-tuning and bounding-box regression on the val2set using the training data described above. All system hyperparameters (e.g., SVM C hyperparameters, padding used in region warping, NMS thresholds, bounding-box regression hyperparameters) were fixed at the same values used for PASCAL. Undoubtedly some of these hyperparameter choices are slightly suboptimal for ILSVRC, however the goal of this work was to produce a preliminary R-CNN result on ILSVRC without extensive dataset tuning. After selecting the best choices on val2, we submitted exactly two result files to the ILSVRC2013 evaluation server. The first submission was without bounding-box regression and the second submission was with bounding-box regression. For these submissions, we expanded the SVM and boundingbox regressor training sets to use val+train1kand val, respectively. We used the CNN that was fine-tuned on val1+train1kto avoid re-running fine-tuning and feature computation.

在将结果提交给评估服务器之前,我们使用上述训练数据验证了数据使用选择以及val2set的微调和边界框回归的影响。所有系统超参数(例如SVM C超参数,区域变形中使用的填充,NMS阈值,边界框回归超参数)都固定为用于PASCAL的相同值。无疑,对于ILSVRC,这些超参数选择中的某些选择次优,但是,这项工作的目标是在ILSVRC上产生初步的R-CNN结果,而无需进行大量的数据集调整。在val2上选择最佳选择之后,我们将两个结果文件恰好提交给ILSVRC2013评估服务器。第一个提交没有边界框回归,第二个提交有边界框回归。对于这些提交,我们将SVM和boundingbox回归器训练集扩展为分别使用val + train1k和val。我们使用在val1 + train1k上进行了微调的CNN来避免重新运行微调和特征计算。

4.5. Ablation study

Table 4 shows an ablation study of the effects of different amounts of training data, fine-tuning, and boundingbox regression. A first observation is that mAP on val2 matches mAP on test very closely. This gives us confidence that mAP on val2is a good indicator of test set performance. The first result, 20.9%, is what R-CNN achieves using a CNN pre-trained on the ILSVRC2012 classification dataset (no fine-tuning) and given access to the small amount of training data in val1(recall that half of the classes in val1have between 15 and 55 examples). Expanding the training set to val1+trainNimproves performance to 24.1%, with essentially no difference between N = 500 and N = 1000. Fine-tuning the CNN using examples from just val1gives a modest improvement to 26.5%, however there is likely significant overfitting due to the small number of positive training examples. Expanding the fine-tuning set to val1+train1k, which adds up to 1000 positive examples per class from the train set, helps significantly, boosting mAP to 29.7%. Bounding-box regression improves results to 31.0%, which is a smaller relative gain that what was observed in PASCAL.

表4显示了对不同数量的训练数据,微调和边界框回归的影响的消融研究。第一个观察结果是val2上的mAP与测试中的mAP非常接近。这使我们相信val2上的mAP是测试集性能的良好指标。第一个结果为20.9%,是R-CNN使用在ILSVRC2012分类数据集上进行预训练的CNN所获得的结果(无微调),并且可以访问val1中的少量训练数据(回想一下, val1有15到55个示例)。将训练集扩展到val1 + trainN可以将性能提高到24.1%,而N = 500和N = 1000之间基本上没有差异。使用仅val1的示例对CNN进行微调,可以将其适度提高到26.5%,但是由于过度拟合,可能会产生重大影响少数正例。将微调设置扩展到val1 + train1k,可以从训练集中为每个类增加多达1000个正例,这将显着帮助mAP提升至29.7%。边界框回归将结果提高到31.0%,相对增益比在PASCAL中观察到的要小。

4.6. Relationship to OverFeat

There is an interesting relationship between R-CNN and OverFeat: OverFeat can be seen (roughly) as a special case of R-CNN. If one were to replace selective search region proposals with a multi-scale pyramid of regular square regions and change the per-class bounding-box regressors to a single bounding-box regressor, then the systems would be very similar (modulo some potentially significant differences in how they are trained: CNN detection fine-tuning, using SVMs, etc.). It is worth noting that OverFeat has a significant speed advantage over R-CNN: it is about 9x faster, based on a figure of 2 seconds per image quoted from [34]. This speed comes from the fact that OverFeat’s sliding windows (i.e., region proposals) are not warped at the image level and therefore computation can be easily shared between overlapping windows. Sharing is implemented by running the entire network in a convolutional fashion over arbitrary-sized inputs. Speeding up R-CNN should be possible in a variety of ways and remains as future work.


5. Semantic segmentation

Region classification is a standard technique for semantic segmentation, allowing us to easily apply R-CNN to the PASCAL VOC segmentation challenge. To facilitate a direct comparison with the current leading semantic segmentation system (called O2P for “second-order pooling”) [4], we work within their open source framework. O2P uses CPMC to generate 150 region proposals per image and then predicts the quality of each region, for each class, using support vector regression (SVR). The high performance of their approach is due to the quality of the CPMC regions and the powerful second-order pooling of multiple feature types (enriched variants of SIFT and LBP). We also note that Farabet et al. [16] recently demonstrated good results on several dense scene labeling datasets (not including PASCAL) using a CNN as a multi-scale per-pixel classifier.

区域分类是语义分割的一种标准技术,使我们能够轻松地将R-CNN应用于PASCAL VOC分割挑战。为了促进与当前领先的语义分割系统(称为“二阶合并”的O2P)[4]的直接比较,我们在其开源框架内开展工作。 O2P使用CPMC为每个图像生成150个区域建议,然后使用支持向量回归(SVR)预测每个类别的每个区域的质量。其方法的高性能归因于CPMC区域的质量以及多种要素类型(SIFT和LBP的丰富变体)的强大二阶合并。我们还注意到,Farabet等[16]最近在使用CNN作为多尺度每像素分类器的几个密集场景标记数据集(不包括PASCAL)上展示了良好的结果。

We follow [2, 4] and extend the PASCAL segmentation training set to include the extra annotations made available by Hariharan et al. [22]. Design decisions and hyperparameters were cross-validated on the VOC 2011 validation set. Final test results were evaluated only once.

我们遵循[2,4]并扩展了PASCAL分段训练集,以包括Hariharan等人提供的额外注释。 [22]。设计决策和超参数在VOC 2011验证集中进行了交叉验证。最终测试结果仅评估一次。

CNN features for segmentation. We evaluate three strategies for computing features on CPMC regions, all of which begin by warping the rectangular window around the region to 227 × 227. The first strategy (full) ignores the region’s shape and computes CNN features directly on the warped window, exactly as we did for detection. However, these features ignore the non-rectangular shape of the region. Two regions might have very similar bounding boxes while having very little overlap. Therefore, the second strategy (fg) computes CNN features only on a region’s foreground mask. We replace the background with the mean input so that background regions are zero after mean subtraction. The third strategy (full+fg) simply concatenates the full and fg features; our experiments validate their complementarity.

CNN的细分功能。我们评估了三种在CPMC区域上计算特征的策略,所有这些策略均始于将区域周围的矩形窗口变形为227×227。第一个策略(完全)忽略区域的形状,并直接在变形窗口上计算CNN特征,具体如下我们做了检测。但是,这些特征忽略了该区域的非矩形形状。两个区域可能具有非常相似的边界框,而几乎没有重叠。因此,第二种策略(fg)仅在区域的前景蒙版上计算CNN特征。我们用均值输入替换背景,以便均值减后背景区域为零。第三种策略(full + fg)简单地将full和fg功能串联起来;我们的实验证明了它们的互补性。
表5:VOC 2011验证的细分平均准确度(%)。第1栏显示O2P; 2-7使用我们在ILSVRC 2012上进行预训练的CNN。

Results on VOC 2011. Table 5 shows a summary of our results on the VOC 2011 validation set compared with O2P. (See Appendix E for complete per-category results.) Within each feature computation strategy, layer fc6always outperforms fc7and the following discussion refers to the fc6features. The fg strategy slightly outperforms full, indicating that the masked region shape provides a stronger signal, matching our intuition. However, full+fg achieves an average accuracy of 47.9%, our best result by a margin of 4.2% (also modestly outperforming O2P), indicating that the context provided by the full features is highly informative even given the fg features. Notably, training the 20 SVRs on our full+fg features takes an hour on a single core, compared to 10+ hours for training on O2P features.

VOC 2011的结果。表5汇总了我们与V2P相比在VOC 2011验证集中的结果。 (有关完整的每个类别的结果,请参阅附录E。)在每种特征计算策略中,fc6层始终胜过fc7,下面的讨论针对fc6功能。 fg策略略胜于完整策略,表明被遮罩的区域形状提供了更强的信号,与我们的直觉相符。但是,full + fg的平均准确度达到47.9%,我们的最佳结果为4.2%(也略微胜过O2P),这表明即使使用fg功能,完整功能所提供的上下文也非常有用。值得注意的是,在我们的full + fg功能上训练20个SVR在单个内核上花费一个小时,而在O2P功能上训练则需要10多个小时。

In Table 6 we present results on the VOC 2011 test set, comparing our best-performing method, fc6(full+fg), against two strong baselines. Our method achieves the highest segmentation accuracy for 11 out of 21 categories, and the highest overall segmentation accuracy of 47.9%, averaged across categories (but likely ties with the O2P result under any reasonable margin of error). Still better performance could likely be achieved by fine-tuning.

在表6中,我们介绍了VOC 2011测试集的结果,将我们表现最佳的方法fc6(full + fg)与两个强基准进行了比较。我们的方法在21个类别中的11个类别中实现了最高的分割精度,在各个类别之间平均达到了最高的整体细分精度47.9%(但在任何合理的误差范围内,可能与O2P结果相关)。通过微调仍可能获得更好的性能。


表6:VOC 2011测试中的细分准确度(%)。我们将两个强大的基准进行比较:[2]的“区域和零件”(R&P)方法和[4]的二阶合并(O2P)方法。无需进行任何微调,我们的CNN就能达到最佳的细分效果,胜过R&P和大致匹配的O2P。

6. Conclusion

In recent years, object detection performance had stagnated. The best performing systems were complex ensembles combining multiple low-level image features with high-level context from object detectors and scene classifiers. This paper presents a simple and scalable object detection algorithm that gives a 30% relative improvement over the best previous results on PASCAL VOC 2012.

近年来,物体检测性能停滞不前。表现最好的系统是复杂的集合体,将来自对象检测器和场景分类器的多个低层图像特征与高层上下文结合在一起。本文提出了一种简单且可扩展的对象检测算法,与PASCAL VOC 2012上的最佳以往结果相比,相对改进了30%。

We achieved this performance through two insights. The first is to apply high-capacity convolutional neural networks to bottom-up region proposals in order to localize and segment objects. The second is a paradigm for training large CNNs when labeled training data is scarce. We show that it is highly effective to pre-train the network— with supervision—for a auxiliary task with abundant data (image classification) and then to fine-tune the network for the target task where data is scarce (detection). We conjecture that the “supervised pre-training/domain-specific finetuning” paradigm will be highly effective for a variety of data-scarce vision problems.


We conclude by noting that it is significant that we achieved these results by using a combination of classical tools from computer vision and deep learning (bottomup region proposals and convolutional neural networks). Rather than opposing lines of scientific inquiry, the two are natural and inevitable partners.



A. Object proposal transformations

The convolutional neural network used in this work requires a fixed-size input of 227 × 227 pixels. For detection, we consider object proposals that are arbitrary image rectangles. We evaluated two approaches for transforming object proposals into valid CNN inputs.

图7:不同的对象建议转换。 (A)原始对象建议相对于转换后的CNN输入的实际规模; (B)与上下文最紧密的平方; (C)没有上下文的最紧密的正方形; (D)翘曲。在每一列和示例建议中,第一行对应于上下文填充的p = 0像素,而最低行对应于上下文填充的p = 16像素。

The first method (“tightest square with context”) encloses each object proposal inside the tightest square and then scales (isotropically) the image contained in that square to the CNN input size. Figure 7 column (B) shows this transformation. A variant on this method (“tightest square without context”) excludes the image content that surrounds the original object proposal. Figure 7 column © shows this transformation. The second method (“warp”) anisotropically scales each object proposal to the CNN input size. Figure 7 column (D) shows the warp transformation.


For each of these transformations, we also consider including additional image context around the original object proposal. The amount of context padding § is defined as a border size around the original object proposal in the transformed input coordinate frame. Figure 7 shows p = 0 pixels in the top row of each example and p = 16 pixels in the bottom row. In all methods, if the source rectangle extends beyond the image, the missing data is replaced with the image mean (which is then subtracted before inputing the image into the CNN). A pilot set of experiments showed that warping with context padding (p = 16 pixels) outperformed the alternatives by a large margin (3-5 mAP points). Obviously more alternatives are possible, including using replication instead of mean padding. Exhaustive evaluation of these alternatives is left as future work.

对于这些转换中的每一个,我们还考虑在原始对象建议周围包括其他图像上下文。上下文填充量(p)定义为转换后的输入坐标系中原始对象建议周围的边框大小。图7在每个示例的顶行显示p = 0像素,在底行显示p = 16像素。在所有方法中,如果源矩形超出图像,则将丢失的数据替换为图像均值(然后将其减去,然后再将图像输入到CNN中)。一组试验性实验显示,使用上下文填充(p = 16像素)的变形在很大程度上优于替代方法(3-5 mAP点)。显然,更多替代方法是可行的,包括使用复制代替均值填充。这些替代方案的详尽评估留待以后的工作。

B. Positive vs. negative examples and softmax

Two design choices warrant further discussion. The first is: Why are positive and negative examples defined differently for fine-tuning the CNN versus training the object detection SVMs? To review the definitions briefly, for finetuning we map each object proposal to the ground-truth instance with which it has maximum IoU overlap (if any) and label it as a positive for the matched ground-truth class if the IoU is at least 0.5. All other proposals are labeled “background” (i.e., negative examples for all classes). For training SVMs, in contrast, we take only the ground-truth boxes as positive examples for their respective classes and label proposals with less than 0.3 IoU overlap with all instances of a class as a negative for that class. Proposals that fall into the grey zone (more than 0.3 IoU overlap, but are not ground truth) are ignored.

有两种设计选择值得进一步讨论。第一个是:为什么在微调CNN和训练对象检测SVM方面对正例和负例定义不同?为了简短地回顾定义,为了进行微调,我们将每个对象建议映射到最大IoU重叠(如果有)的地面实例,并在IoU至少为0.5时将其标记为匹配地面实例的肯定对象。所有其他建议都标记为“背景”(即,所有类别的否定示例)。相比之下,在训练SVM时,我们仅将真相框作为其各自类别的正面示例,并将与该类别的所有实例重叠的IoU小于0.3 IU的标签建议视为该类别的负面内容。落入灰色区域(重叠超过0.3 IoU,但不是基本事实)的建议将被忽略。

Historically speaking, we arrived at these definitions because we started by training SVMs on features computed by the ImageNet pre-trained CNN, and so fine-tuning was not a consideration at that point in time. In that setup, we found that our particular label definition for training SVMs was optimal within the set of options we evaluated (which included the setting we now use for fine-tuning). When we started using fine-tuning, we initially used the same positive and negative example definition as we were using for SVM training. However, we found that results were much worse than those obtained using our current definition of positives and negatives.


Our hypothesis is that this difference in how positives and negatives are defined is not fundamentally important and arises from the fact that fine-tuning data is limited. Our current scheme introduces many “jittered” examples (those proposals with overlap between 0.5 and 1, but not ground truth), which expands the number of positive examples by approximately 30x. We conjecture that this large set is needed when fine-tuning the entire network to avoid overfitting. However, we also note that using these jittered examples is likely suboptimal because the network is not being fine-tuned for precise localization.


This leads to the second issue: Why, after fine-tuning, train SVMs at all? It would be cleaner to simply apply the last layer of the fine-tuned network, which is a 21-way softmax regression classifier, as the object detector. We tried this and found that performance on VOC 2007 dropped from 54.2% to 50.9% mAP. This performance drop likely arises from a combination of several factors including that the definition of positive examples used in fine-tuning does not emphasize precise localization and the softmax classifier was trained on randomly sampled negative examples rather than on the subset of “hard negatives” used for SVM training.

这导致了第二个问题:为什么在进行微调之后才训练SVM?只需应用微调网络的最后一层(即21向softmax回归分类器)作为对象检测器,会更清洁。我们对此进行了尝试,发现VOC 2007的性能从54.2%下降到50.9%。这种性能下降可能是由多种因素共同导致的,其中包括微调中使用的阳性示例的定义不强调精确的定位,并且softmax分类器是针对随机采样的阴性示例而非使用的“硬性阴性”子集进行训练的用于SVM培训。

This result shows that it’s possible to obtain close to the same level of performance without training SVMs after fine-tuning. We conjecture that with some additional tweaks to fine-tuning the remaining performance gap may be closed. If true, this would simplify and speed up R-CNN training with no loss in detection performance.


C. Bounding-box regression

We use a simple bounding-box regression stage to improve localization performance. After scoring each selective search proposal with a class-specific detection SVM, we predict a new bounding box for the detection using a class-specific bounding-box regressor. This is similar in spirit to the bounding-box regression used in deformable part models [17]. The primary difference between the two approaches is that here we regress from features computed by the CNN, rather than from geometric features computed on the inferred DPM part locations.


The input to our training algorithm is a set of N training pairs ( P i , G i ) i = 1 , . . . , N {(P^i, G^i)}_{i=1,...,N} (Pi,Gi)i=1,...,N, where P i = ( P x i , P y i , P w i , P h i ) Pi= (P^i_x, P^i_y, P^i_w, P^i_h) Pi=(Pxi,Pyi,Pwi,Phi) specifies the pixel coordinates of the center of proposal Pi’s bounding box together with Pi’s width and height in pixels. Hence forth, we drop the superscript i unless it is needed. Each ground-truth bounding box G is specified in the same way: G = ( G x , G y , G w , G h ) G = (G_x, G_y, G_w, G_h) G=(Gx,Gy,Gw,Gh). Our goal is to learn a transformation that maps a proposed boxP to a ground-truth box G.

我们的训练算法的输入是一组N个训练对 ( P i , G i ) i = 1 , . . . , N {(P^i, G^i)}_{i=1,...,N} (Pi,Gi)i=1,...,N,其中 P i = ( P x i , P y i , P w i , P h i ) Pi= (P^i_x, P^i_y, P^i_w, P^i_h) Pi=(Pxi,Pyi,Pwi,Phi)指定像素坐标Pi边界框的中心位置以及Pi的宽度和高度(以像素为单位)。因此,除非需要,否则我们将删除上标i。以相同的方式指定每个标注框G: G = ( G x , G y , G w , G h ) G = (G_x, G_y, G_w, G_h) G=(Gx,Gy,Gw,Gh)。我们的目标是学习将建议的盒子P映射到真实的盒子G的变换。

We parameterize the transformation in terms of four functions dx§, dy§, dw§, and dh§. The first two specify a scale-invariant translation of the center of P’s bounding box, while the second two specify log-space translations of the width and height of P’s bounding box. After learning these functions, we can transform an input proposal P into a predicted ground-truth box G ^ \hat{G} G^ by applying the transformation


Each function d ∗ ( P ) d_*(P) d(P) (where *9 is one of x, y, h, w) is modeled as a linear function of the pool5features of proposal P, denoted by φ_5§. (The dependence of φ 5 ( P ) φ_5(P) φ5(P) on the image data is implicitly assumed.) Thus we have d ∗ ( P ) = w ∗ T φ 5 ( P ) d_*(P) = w^T_*φ_5(P) d(P)=wTφ5(P), where w ∗ w_* w is a vector of learnable model parameters. We learn w ∗ w_* w by optimizing the regularized least squares objective (ridge regression):

将每个函数 d ∗ ( P ) d_*(P) d(P)(其中*是x,y,h,w中的一个)建模为提案P的库特征的线性函数,用φ5(P)表示。 (隐式假设φ5(P)对图像数据的依赖性。)因此,我们有 d ∗ ( P ) = w ∗ T φ 5 ( P ) d_*(P)=w^T_*φ_5(P) d(P)=wTφ5(P),其中w≥是可学习的模型参数的向量。我们通过优化正则化最小二乘目标(岭回归)来学习w:
The regression targets t ∗ t_* t for the training pair (P, G) are defined as

As a standard regularized least squares problem, this can be solved efficiently in closed form.


We found two subtle issues while implementing bounding-box regression. The first is that regularization is important: we set λ = 1000 based on a validation set. The second issue is that care must be taken when selecting which training pairs (P, G) to use. Intuitively, if P is far from all ground-truth boxes, then the task of transforming P to a ground-truth box G does not make sense. Using examples like P would lead to a hopeless learning problem. Therefore, we only learn from a proposal P if it is nearby at least one ground-truth box. We implement “nearness” by assigning P to the ground-truth box G with which it has maximum IoU overlap (in case it overlaps more than one) if and only if the overlap is greater than a threshold (which we set to 0.6 using a validation set). All unassigned proposals are discarded. We do this once for each object class in order to learn a set of class-specific bounding-box regressors.

在执行边界框回归时,我们发现了两个细微的问题。首先,正则化很重要:我们根据验证集设置λ= 1000。第二个问题是在选择要使用的训练对(P,G)时必须小心。直观地讲,如果P远离所有标注框,那么将P转换为标注框G的任务就没有意义了。使用类似P的示例将导致无望的学习问题。因此,我们仅从建议P中获悉,如果建议P至少在一个标注框附近。当且仅当重叠量大于阈值(我们使用a将其设置为0.6)时,我们才通过将P分配给最大IoU重叠量(如果重叠量大于1)的标注框G来实现“接近”。验证集)。所有未分配的投标都将被丢弃。我们为每个对象类执行一次此操作,以了解一组特定于类的包围盒回归器。

At test time, we score each proposal and predict its new detection window only once. In principle, we could iterate this procedure (i.e., re-score the newly predicted bounding box, and then predict a new bounding box from it, and so on). However, we found that iterating does not improve results.


D. Additional feature visualizations

Figure 12 shows additional visualizations for 20 pool5 units. For each unit, we show the 24 region proposals that maximally activate that unit out of the full set of approximately 10 million regions in all of VOC 2007 test. We label each unit by its (y, x, channel) position in the 6 × 6 × 256 dimensional pool5feature map. Within each channel, the CNN computes exactly the same function of the input region, with the (y, x) position changing only the receptive field.

图12显示了20个pool5单元的其他可视化。对于每个单元,我们显示了24个区域建议,这些建议在整个VOC 2007测试中从大约1000万个区域中最大化激活该单元。我们通过在6×6×256维pool5特征图中的(y,x,通道)位置标记每个单元。在每个通道内,CNN计算的输入区域功能完全相同,而(y,x)位置仅改变接收场。

E. Per-category segmentation results

In Table 7 we show the per-category segmentation accuracy on VOC 2011 val for each of our six segmentation methods in addition to the O2P method [4]. These results show which methods are strongest across each of the 20 PASCAL classes, plus the background class.

在表7中,我们显示了除O2P方法外,我们六种细分方法中每一种在VOC 2011 val上的按类别细分的准确性[4]。这些结果表明,在20个PASCAL类以及背景类中,哪种方法最强。

F. Analysis of cross-dataset redundancy

One concern when training on an auxiliary dataset is that there might be redundancy between it and the test set. Even though the tasks of object detection and whole-image classification are substantially different, making such cross-set redundancy much less worrisome, we still conducted a thorough investigation that quantifies the extent to which PASCAL test images are contained within the ILSVRC 2012 training and validation sets. Our findings may be useful to researchers who are interested in using ILSVRC 2012 as training data for the PASCAL image classification task.

在辅助数据集上进行训练时,一个要考虑的问题是,它与测试集之间可能存在冗余。尽管对象检测和全图像分类的任务本质上不同,从而使此类交叉集冗余的担忧减少了,我们仍然进行了透彻的调查,量化了ILSVRC 2012培训和验证中PASCAL测试图像的包含程度套。我们的发现对有兴趣将ILSVRC 2012用作PASCAL图像分类任务的训练数据的研究人员可能有用。

We performed two checks for duplicate (and nearduplicate) images. The first test is based on exact matches of flickr image IDs, which are included in the VOC 2007 test annotations (these IDs are intentionally kept secret for subsequent PASCAL test sets). All PASCAL images, and about half of ILSVRC, were collected from flickr.com. This check turned up 31 matches out of 4952 (0.63%).

我们对重复(和几乎重复)的图像进行了两次检查。第一次测试基于flickr图像ID的精确匹配,这些ID包含在VOC 2007测试注释中(这些ID有意为以后的PASCAL测试集保密)。所有的PASCAL图像以及大约ILSVRC的一半,均来自flickr.com。这张支票在4952个进球中占了31个(0.63%)。

The second check uses GIST [30] descriptor matching, which was shown in [13] to have excellent performance at near-duplicate image detection in large (> 1 million) image collections. Following [13], we computed GIST descriptors on warped 32 × 32 pixel versions of all ILSVRC 2012 trainval and PASCAL 2007 test images.

第二次检查使用GIST [30]描述符匹配,在[13]中显示它在大型(> 1百万)图像集中的近重复图像检测中具有出色的性能。按照[13],我们在所有ILSVRC 2012 trainval和PASCAL 2007测试图像的扭曲的32×32像素版本上计算了GIST描述符。

Euclidean distance nearest-neighbor matching of GIST descriptors revealed 38 near-duplicate images (including all 31 found by flickr ID matching). The matches tend to vary slightly in JPEG compression level and resolution, and to a lesser extent cropping. These findings show that the overlap is small, less than 1%. For VOC 2012, because flickr IDs are not available, we used the GIST matching method only. Based on GIST matches, 1.5% of VOC 2012 test images are in ILSVRC 2012 trainval. The slightly higher rate for VOC 2012 is likely due to the fact that the two datasets were collected closer together in time than VOC 2007 and ILSVRC 2012 were.

GIST描述符的欧氏距离最近邻匹配显示了38个近重复图像(包括所有通过flickr ID匹配找到的图像)。匹配在JPEG压缩级别和分辨率上倾向于略有不同,并且裁切程度较小。这些发现表明,重叠很小,不到1%。对于VOC 2012,因为没有flickr ID,所以我们仅使用了GIST匹配方法。根据GIST匹配,1.5%的VOC 2012测试图像位于ILSVRC 2012训练中。 VOC 2012的比率略高可能是由于以下事实:两个数据集在时间上比VOC 2007和ILSVRC 2012更近地收集在一起。


  • 0
  • 0
  • 2
  • 一键三连
  • 扫一扫,分享海报

©️2021 CSDN 皮肤主题: Age of Ai 设计师:meimeiellie 返回首页
钱包余额 0