OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks（论文翻译）

菜菜菜菜菜菜菜

已于 2023-12-04 14:57:44 修改

阅读量445

点赞数

分类专栏：英文论文翻译文章标签：论文翻译

于 2020-01-13 16:38:37 首次发布

本文链接：https://blog.csdn.net/weixin_43590290/article/details/103958000

版权

本文提出了一种使用卷积网络（ConvNet）进行分类、定位和检测的集成框架，有效地实现了多尺度和滑动窗口方法。通过学习预测对象边界进行定位，增加检测置信度。此框架在ILSVRC2013的定位任务中获胜，同时在检测和分类任务中取得优异成绩，展示了单一共享网络学习多种任务的能力。

摘要由CSDN通过智能技术生成

abstract

我们提出了一个使用卷积网络进行分类，定位和检测的集成框架。我们展示了如何在ConvNet中有效地实现多尺度和滑动窗口方法。我们还通过学习预测对象边界引入了一种新颖的深度学习方法来进行定位。然后累积边界框而不是抑制边界框，以增加检测置信度。我们展示了使用单个共享网络可以同时学习不同的任务。该集成框架是ImageNetLargeScale Visual RecognitionChallenge 2013（ILSVRC2013）的定位任务的获胜者，并且在检测和分类任务中获得了非常具有竞争力的结果。在赛后工作中，我们为检测任务建立了新的技术水平。最后，我们从称为OverFeat的最佳模型中发布了特征提取器。

We present an integrated framework for using Convolutional Networks for classification, localizationand detection. We show how amultiscaleand sliding window approach can be efficiently implemented within a ConvNet. We also introduce a novel deep learning approach to localization by learning to predict object bound-aries. Bounding boxes are then accumulated rather than suppressed in order to increase detection conf i dence. We show that different tasks can be learned simultaneously using a single shared network. This integrated framework is the winner of the localization task of the ImageNetLargeScale Visual RecognitionChallenge 2013 (ILSVRC2013) and obtained very competitive results for the detection and classifications tasks. In post-competition work, we establish a new state ofthe art for the detection task. Finally, we release a feature extractor from our best model called OverFeat.

1 Introduction

识别图像中主要对象的类别是一项任务，卷积网络（ConvNets）[17]已经应用了很多年，无论这些对象是手写字符[16]，门牌号[24]，无纹理玩具[18] ]，交通标志[3、26]，来自Caltech-101数据集的对象[14]或来自1000类ImageNet数据集的对象[15]。 ConvNets在小型数据集（例如Caltech-101）上的准确性虽然不错，但并没有打破纪录。但是，大型数据集的出现使ConvNets能够显着提高诸如1000类ImageNet [5]之类的数据集的最新技术水平。

Recognizing the category of the dominant object in an image is a tasks to which Convolutional Networks (ConvNets) [17] have been applied for many years, whether the objects were handwritten characters [16], house numbers [24], textureless toys [18], traff i c signs [3, 26], objects from the Caltech-101 dataset [14], or objects from the 1000-category ImageNet dataset [15]. The accuracy of ConvNets on small datasets such as Caltech-101, while decent, has not been record-breaking. However, the advent of larger datasets has enabled ConvNets to signif i cantly advance the state of the art on datasets such as the 1000-category ImageNet [5].

ConvNets用于许多此类任务的主要优势在于，从原始像素到最终类别，整个系统都进行了端到端的培训，从而减轻了手动设计合适的特征提取器的需求。主要缺点是它们对标记的训练样本的食欲不振

The main advantage of ConvNets for many such tasks is that the entire system is trained end to end, from raw pixels to ultimate categories, thereby alleviating the requirement to manually design a suitable feature extractor. The main disadvantage is their ravenous appetite for labeled training samples

本文的主要目的是表明，训练卷积网络以同时对图像中的对象进行分类，定位和检测可以提高分类精度以及所有任务的检测和定位精度。本文提出了一种使用单个ConvNet进行对象检测，识别和定位的新集成方法。我们还介绍了一种通过累积预测边界框来进行定位和检测的新颖方法。我们建议通过结合许多本地化预测，无需对背景样本进行训练就可以进行检测，并且可以避免耗时且复杂的自举训练过程。不进行后台培训也可以使网络仅专注于阳性类别以提高准确性。对ImageNet ILSVRC 2012和2013数据集进行了实验，并建立了ILSVRC 2013本地化和检测任务的最新结果。

The main point of this paper is to show that training a convolutional network to simultaneously classify, locate and detect objects in images can boost the classif i cation accuracy and the detection and localization accuracy of all tasks. The paper proposes a new integrated approach to object detection, recognition, and localizationwith a single ConvNet. We also introducea novelmethod for localizationanddetectionby accumulating predictedboundingboxes. We suggestthatby combining many localization predictions, detection can be performed without training on background samples and that it is possible to avoid the time-consuming and complicated bootstrapping training passes. Not training on backgroundalso lets the networkfocus solely on positiveclasses forhigheraccuracy. Experiments are conducted on the ImageNet ILSVRC 2012 and 2013 datasets and establish state of the art results on the ILSVRC 2013 localization and detection tasks.

尽管从ImageNet分类数据集中选择的图像主要包含一个充满大部分图像的居中对象，但感兴趣的对象有时在图像中的大小和位置也有很大不同。解决这个问题的第一个想法是将ConvNet以滑动窗口的方式并在多个比例尺上应用于图像中的多个位置。但是，即使这样，许多观察窗口仍可能包含对象的完全可识别的部分（例如狗的头），而不是整个对象，甚至对象的中心。这导致分类不错，但定位和检测不佳。因此，第二个想法是训练系统不仅要为每个窗口生成类别上的分布，还要对包含相对于窗口的对象的边界框的位置和大小进行预测。第三个想法是在每个位置和规模上为每个类别积累证据。

While images from the ImageNet classif i cation dataset are largely chosen to contain a roughly-centered object that fills much of the image, objects of interest sometimes vary significantly in size and position within the image. The f i rst idea in addressing this is to apply a ConvNet at multiple locations in the image, in a sliding window fashion, and over multiple scales. Even with this, however, many viewing windows may contain a perfectly identif i able portion of the object (say, the head of a dog), but not the entire object, nor even the center of the object. This leads to decent classification but poor localization and detection. Thus, the second idea is to train the system to not only produce a distribution over categories for each window, but also to produce a prediction of the location and size of the bounding box containing the object relative to the window. The third idea is to accumulate the evidence for each category at each location and size.

许多作者建议使用ConvNets在多个尺度上使用滑动窗口进行检测和定位，可以追溯到1990年代早期的多字符字符串[20]，脸部[30]和手[22]。最近，ConvNets已显示出在自然图像[4]，面部检测[8、23]和行人检测[25]中的文本检测方面具有最先进的性能。

Many authors have proposed to use ConvNets for detection and localization with a sliding window over multiple scales, going back to the early 1990’s for multi-character strings [20], faces [30], and hands [22]. More recently, ConvNets have been shown to yield state ofthe art performance on text detection in natural images [4], face detection [8, 23] and pedestrian detection [25].

一些作者还建议训练ConvNet，以直接预测要定位的对象的实例化参数，例如相对于查看窗口的位置或对象的姿势。例如，Osadchy等。 [23]描述了用于同时面部检测和姿势估计的ConvNet。面部由9维输出空间中的3D流形表示。歧管上的位置指示姿势（俯仰，偏航和滚动）。当训练图像是面部时，训练网络以在歧管上在已知姿势的位置处产生一个点。如果图像不是人脸，则将输出推离歧管。在测试时，到歧管的距离指示图像是否包含面部，并且歧管上最近点的位置指示姿势。泰勒等。 [27，28]使用ConvNet估计身体部位（手，头等）的位置，从而得出人体姿势。他们使用度量学习标准来训练网络以在体位歧管上产生点。 Hinton等。还提出了训练网络以计算特征的显式实例化参数作为识别过程的一部分[12]。

Several authors have also proposed to train ConvNets to directly predict the instantiation parameters of the objects to be located, such as the position relative to the viewing window, or the pose of the object. For example Osadchy et al. [23] describe a ConvNet for simultaneous face detection and pose estimation. Faces are represented by a 3D manifold in the nine-dimensional output space. Positions on the manifold indicate the pose (pitch, yaw, and roll). When the training image is a face, the network is trained to produce a point on the manifold at the location of the known pose. If the image is not a face, the output is pushed away from the manifold. At test time, the distance to the manifold indicate whether the image contains a face, and the position of the closest point on the manifold indicates pose. Taylor et al. [27, 28] use a ConvNet to estimate the location of body parts (hands, head, etc) so as to derive the human body pose. They use a metric learning criterion to train the network to produce points on a body pose manifold. Hinton et al. have also proposed to train networks to compute explicit instantiation parameters of features as part of a recognition process [12].

其他作者建议通过基于ConvNet的分段执行对象定位。最简单的方法包括训练ConvNet，以将其查看窗口的中心像素（或体素图像的体素）分类为区域之间的边界与否[13]。但是，当必须对区域进行分类时，最好执行语义分割。主要思想是训练ConvNet，使用该窗口作为决策环境，以其所属的对象类别对观察窗口的中心像素进行分类。应用范围从生物图像分析[21]到移动机器人的障碍物标记[10]到照片标记[7]。这种方法的优点是边界轮廓不必是矩形，区域也不必是外接对象。缺点是需要密集的像素级标签进行训练。这种分割预处理或对象建议步骤最近在传统计算机视觉中获得了普及，以减少用于检测的位置，比例和纵横比的搜索空间[19、2、6、29]。因此，可以在搜索空间中的最佳位置应用昂贵的分类方法，从而提高识别精度。另外，[29，1]提出这些方法通过大幅度减少不太可能的对象区域来提高准确性，从而减少潜在的假阳性。但是，我们的密集滑动窗口方法能够胜过ILSVRC13检测数据集上的对象建议方法。

Other authors have proposed to perform object localization via ConvNet-based segmentation. The simplest approach consists in training the ConvNet to classify the central pixel (or voxel for vol-umetric images) of its viewing window as a boundary between regions or not [13]. But when the regions must be categorized, it is preferable to perform semantic segmentation. The main idea is to train the ConvNet to classify the central pixel of the viewing window with the category of the object it belongs to, using the window as context for the decision. Applications range from biological image analysis [21], to obstacle tagging for mobile robots [10] to tagging of photos [7]. The advantage ofthis approach is that the bounding contours need not be rectangles, and the regions need not be well-circumscribed objects. The disadvantage is that it requires dense pixel-level labels for training. This segmentation pre-processing or object proposal step has recently gained popularity in traditional computer vision to reduce the search space of position, scale and aspect ratio for detection [19, 2, 6, 29]. Hence an expensive classif i cation method can be applied at the optimal location in the search space, thus increasing recognition accuracy. Additionally, [29, 1] suggest that these methods improve accuracy by drastically reducing unlikely object regions, hence reducing potential false positives. Our dense sliding window method, however, is able to outperform object proposal methods on the ILSVRC13 detection dataset.

克里热夫斯基等。 [15]最近展示了使用大型ConvNet的令人印象深刻的分类性能。作者还参加了ImageNet 2012竞赛，赢得了分类和定位方面的挑战。尽管他们展示出了令人印象深刻的本地化性能，但是还没有发表任何工作描述他们的方法。因此，我们的论文首次明确说明了如何将ConvNets用于ImageNet数据的定位和检测。

Krizhevsky et al. [15] recently demonstrated impressive classification performance using a large ConvNet. The authors also entered the ImageNet 2012 competition, winning both the classification and localization challenges. Although they demonstrated an impressive localization performance, there has been no published work describing how their approach. Our paper is thus the first to provide a clear explanation how ConvNets can be used for localization and detection for ImageNet data.

在本文中，我们以与定位和检测在ImageNet 2013竞赛