论文阅读笔记（十三）：Overfeat: Integrated recognition, localization and detection using convolutional networks

最新推荐文章于 2024-03-01 20:54:42 发布

__Sunshine__

最新推荐文章于 2024-03-01 20:54:42 发布

阅读量648

点赞数

分类专栏：笔记文章标签： Overfeat 论文笔记论文翻译

本文链接：https://blog.csdn.net/sunshine_010/article/details/79920285

版权

笔记专栏收录该内容

64 篇文章 7 订阅

订阅专栏

We present an integrated framework for using Convolutional Networks for classification, localization and detection. We show how a multiscale and sliding window approach can be efficiently implemented within a ConvNet. We also introduce a novel deep learning approach to localization by learning to predict object boundaries. Bounding boxes are then accumulated rather than suppressed in order to increase detection confidence. We show that different tasks can be learned simultaneously using a single shared network. This integrated framework is the winner of the localization task of the ImageNet Large Scale Visual Recognition Challenge 2013 (ILSVRC2013) and obtained very competitive results for the detection and classifications tasks. In post-competition work, we establish a new state of the art for the detection task. Finally, we release a feature extractor from our best model called OverFeat.

本文提出了一种利用卷积网络进行分类、定位和检测的集成框架。我们展示了如何在 ConvNet 中有效地实现多尺度和滑动窗口方法。通过学习预测对象边界, 我们还引入了一种新的深度学习方法来定位。因此, 为了提高检测的可信度, 边界框将被累积而不是被抑制。我们表明, 可以使用单个共享网络同时学习不同的任务。该集成框架是 ImageNet 大规模视觉识别挑战 2013 (ILSVRC2013) 定位任务的赢家, 为检测和分类任务获得了非常有竞争力的结果。在后竞争的工作中, 我们为检测任务建立了新的技术状态。最后, 我们发布了一个特征提取器从我们最好的模型称为 OverFeat。

Recognizing the category of the dominant object in an image is a tasks to which Convolutional Networks (ConvNets) [17] have been applied for many years, whether the objects were handwritten characters [16], house numbers [24], textureless toys [18], traffic signs [3, 26], objects from the Caltech-101 dataset [14], or objects from the 1000-category ImageNet dataset [15]. The accuracy of ConvNets on small datasets such as Caltech-101, while decent, has not been record-breaking. However, the advent of larger datasets has enabled ConvNets to significantly advance the state of the art on datasets such as the 1000-category ImageNet [5].

在图像中识别主导对象的类别是一个任务, 卷积网络 (ConvNets) [17] 已经应用了许多年, 无论对象是手写字符 [16], 门牌号 [24], 结构不清的玩具 [18], 交通标志 [3、26]、来自 Caltech-101 数据集 [14] 的对象或1000类 ImageNet 数据集 [15] 中的对象。ConvNets 在 Caltech-101 等小数据集上的准确性, 虽然体面, 却没有破纪录。但是, 较大的数据集的出现使 ConvNets 能够显著提高数据集 (如1000类 ImageNet [5]) 上的技术状态。

The main advantage of ConvNets for many such tasks is that the entire system is trained end to end, from raw pixels to ultimate categories, thereby alleviating the requirement to manually design a suitable feature extractor. The main disadvantage is their ravenous appetite for labeled training samples.

ConvNets 对许多此类任务的主要优点是, 整个系统都经过训练, 从原始像素到最终类别, 从而减轻了手动设计合适的特征抽取器的要求。主要的缺点是他们对标记训练样本的贪婪胃口。

The main point of this paper is to show that training a convolutional network to simultaneously classify, locate and detect objects in images can boost the classification accuracy and the detection and localization accuracy of all tasks. The paper proposes a new integrated approach to object detection, recognition, and localization with a single ConvNet. We also introduce a novel method for localization and detection by accumulating predicted bounding boxes. We suggest that by combining many localization predictions, detection can be performed without training on background samples and that it is possible to avoid the time-consuming and complicated bootstrapping training passes. Not training on background also lets the network focus solely on positive classes for higher accuracy.

本文的重点是对卷积网络进行同时分类、定位和检测图像中的对象的训练, 可以提高所有任务的分类精度和检测和定位精度。提出了一种用一个卷积网络集成物体检测、识别和定位的新方法。本文还介绍了一种通过累积预测边界盒来进行定位和检测的新方法。我们建议, 通过结合许多定位预测, 可以在没有背景样本的训练的情况下进行检测, 并且有可能避免耗时和复杂的引导训练通过。没有在后台进行训练, 也让网络只专注于真实的等级, 以提高准确性。

Experiments are conducted on the ImageNet ILSVRC 2012 and 2013 datasets and establish state of the art results on the ILSVRC 2013 localization and detection tasks.

对 ImageNet ILSVRC 2012 和2013数据集进行了实验, 并建立了 ILSVRC 2013 定位和检测任务的技术结果状态。

While images from the ImageNet classification dataset are largely chosen to contain a roughlycentered object that fills much of the image, objects of interest sometimes vary significantly in size and position within the image. The first idea in addressing this is to apply a ConvNet at multiple locations in the image, in a sliding window fashion, and over multiple scales. Even with this, however, many viewing windows may contain a perfectly identifiable portion of the object (say, the head of a dog), but not the entire object, nor even the center of the object. This leads to decent classification but poor localization and detection. Thus, the second idea is to train the system to not only produce a distribution over categories for each window, but also to produce a prediction of the location and size of the bounding box containing the object relative to the window. The third idea is to accumulate the evidence for each category at each location and size.

虽然 ImageNet 分类数据集中的图像主要被选择为包含一个以粗略为中心的对象来填充大部分图像, 但感兴趣的对象有时在图像中的大小和位置上有很大的差异。解决此事的第一个想法是在图像中的多个位置应用一个 ConvNet, 在滑动窗口中, 并在多个尺度上。尽管如此, 许多查看窗口可能包含对象的一个完全可识别的部分 (如狗的头部), 但不是整个对象, 甚至不是对象的中心。这导致了体面的分类, 但缺乏定位和检测。因此, 第二个想法是训练系统, 不仅为每个窗口的类别生成分布, 而且还可以对包含对象相对于窗口的边界框的位置和大小产生预测。第三个想法是为每个类别在每个地点和大小积累证据。

Many authors have proposed to use ConvNets for detection and localization with a sliding window over multiple scales, going back to the early 1990’s for multi-character strings [20], faces [30], and hands [22]. More recently, ConvNets have been shown to yield state of the art performance on text detection in natural images [4], face detection [8, 23] and pedestrian detection [25].

许多作者建议使用 ConvNets 的检测和定位的一个滑动窗口在多个尺度上, 回到早期1990的多字符字符串 [20], 面对 [30] 和手 [22]。最近, ConvNets 已经显示出的技术表现状态的文本检测在自然图像 [4], 人脸检测 [8, 23] 和行人检测 [25]。

Several authors have also proposed to train ConvNets to directly predict the instantiation parameters of the objects to be located, such as the position relative to the viewing window, or the pose of the object. For example Osadchy et al. [23] describe a ConvNet for simultaneous face detection and pose estimation. Faces are represented by a 3D manifold in the nine-dimensional output space. Positions on the manifold indicate the pose (pitch, yaw, and roll). When the training image is a face, the network is trained to produce a point on the manifold at the location of the known pose. If the image is not a face, the output is pushed away from the manifold. At test time, the distance to the manifold indicate whether the image contains a face, and the position of the closest point on the manifold indicates pose. Taylor et al. [27, 28] use a ConvNet to estimate the location of body parts (hands, head, etc) so as to derive the human body pose. They use a metric learning criterion to train the network to produce points on a body pose manifold. Hinton et al. have also proposed to train networks to compute explicit instantiation parameters of features as part of a recognition process [12].

一些作者还建议训练 ConvNets 直接预测要定位的对象的实例化参数, 例如相对于查看窗口的位置或对象的姿态。例如 Osadchy 等 [23] 描述一个 ConvNet 为同时面孔检测和姿态估计。在九维输出空间中, 面由3D 流形表示。在歧管上的位置表示姿势 (音高、偏航和滚动)。当训练图象是一张面孔时, 网络被训练在流形上产生一个点在已知姿势的位置。如果图像不是一张脸, 输出就会被推离流形。在测试时, 与流形的距离指示图像是否包含一个面, 并且在流形上最接近的点的位置表示姿势。Taylor等人 [27, 28] 使用 ConvNet 来估计身体部位 (手、头等) 的位置, 从而得出人体姿势。他们使用一个公制学习标准来训练网络, 以产生身体姿势歧管的点。Hinton 等人也建议训练网络来计算特征的显式实例化参数作为识别过程的一部分 [12]。

Other authors have proposed to perform object localization via ConvNet-based segmentation. The simplest approach consists in training the ConvNet to classify the central pixel (or voxel for volumetric images) of its viewing window as a boundary between regions or not [13]. But when the regions must be categorized, it is preferable to perform semantic segmentation. The main idea is to train the ConvNet to classify the central pixel of the viewing window with the category of the object it belongs to, using the window as context for the decision. Applications range from biological image analysis [21], to obstacle tagging for mobile robots [10] to tagging of photos [7]. The advantage of this approach is that the bounding contours need not be rectangles, and the regions need not be well-circumscribed objects. The disadvantage is that it requires dense pixel-level labels for training. This segmentation pre-processing or object proposal step has recently gained popularity in traditional computer vision to reduce the search space of position, scale and aspect ratio for detection [19, 2, 6, 29]. Hence an expensive classification method can be applied at the optimal location in the search space, thus increasing recognition accuracy. Additionally, [29, 1] suggest that these methods improve accuracy by drastically reducing unlikely object regions, hence reducing potential false positives. Our dense sliding window method, however, is able to outperform object proposal methods on the ILSVRC13 detection dataset.

其他作者建议通过基于 ConvNet 的分割来执行对象定位。最简单的方法包括训练 ConvNet 将它的观察窗口的中心像素 (或体素体) 归类为区域之间的边界或不 [13]。但是, 当区域必须进行分类时, 最好进行语义分割。其主要思想是训练 ConvNet 将查看窗口的中心像素与它所属对象的类别进行分类, 并使用窗口作为决策的上下文。应用范围从生物图像分析 [21] 到障碍标记为移动机器人 [10] 到标记相片 [7]。这种方法的优点是, 边界轮廓不必是矩形, 并且区域不需要有良好的界限对象。缺点是它需要密集的像素级标签进行训练。该分段预处理或对象建议步骤最近在传统的计算机视觉中获得了广泛的普及, 减少了搜索空间的位置, 缩放和纵横比检测 [19, 2, 6, 29]。因此, 一种昂贵的分类方法可以在搜索空间的最佳位置应用, 从而提高识别精度。此外, [29, 1] 建议这些方法通过大幅度减少不太可能的对象区域来提高准确性, 从而减少潜在的误报。但是, 我们的密集滑动窗口方法能够胜过 ILSVRC13 检测数据集上的对象建议方法。

Krizhevsky et al. [15] recently demonstrated impressive classification performance using a large ConvNet. The authors also entered the ImageNet 2012 competition, winning both the classification and localization challenges. Although they demonstrated an impressive localization performance, there has been no published work describing how their approach. Our paper is thus the first to provide a clear explanation how ConvNets can be used for localization and detection for ImageNet data.

Krizhevsky 等 [15] 最近展示了令人印象深刻的分类性能使用一个大 ConvNet。作者还参加了 ImageNet 2012 的竞争, 赢得了分类和定位的挑战。虽然他们表现出令人印象深刻的定位表现, 但没有发表过的作品描述他们的方法。因此, 我们的论文是第一个提供一个明确的解释如何 ConvNets 可用于定位和检测 ImageNet 数据。

In this paper we use the terms localization and detection in a way that is consistent with their use in the ImageNet 2013 competition, namely that the only difference is the evaluation criterion used and both involve predicting the bounding box for each object in the image.

在本文中, 我们使用术语定位和检测的方式与他们在 ImageNet 2013 竞争中的使用是一致的, 即唯一的区别是使用的评价标准, 两者都涉及预测每个对象的边界框。图像。

In this paper, we explore three computer vision tasks in increasing order of difficulty: (i) classification, (ii) localization, and (iii) detection. Each task is a sub-task of the next. While all tasks are adressed using a single framework and a shared feature learning base, we will describe them separately in the following sections.

本文研究了三种计算机视觉任务的增加难度顺序: (一) 分类(二) 定位和 (三) 检测。每个任务都是下一个任务的子任务。虽然所有任务都是使用单个框架和共享特征学习基础定位的, 但我们将在以下几节中分别描述它们。

Throughout the paper, we report results on the 2013 ImageNet Large Scale Visual Recognition Challenge (ILSVRC2013). In the classification task of this challenge, each image is assigned a single label corresponding to the main object in the image. Five guesses are allowed to find the correct answer (this is because images can also contain multiple unlabeled objects). The localization task is similar in that 5 guesses are allowed per image, but in addition, a bounding box for the predicted object must be returned with each guess. To be considered correct, the predicted box must match the groundtruth by at least 50% (using the PASCAL criterion of union over intersection), as well as be labeled with the correct class (i.e. each prediction is a label and bounding box that are associated together). The detection task differs from localization in that there can be any number of objects in each image (including zero), and false positives are penalized by the mean average precision (mAP) measure. The localization task is a convenient intermediate step between classification and detection, and allows us to evaluate our localization method independently of challenges specific to detection (such as learning a background class). In Fig. 1, we show examples of images with our localization/detection predictions as well as corresponding groundtruth. Note that classification and localization share the same dataset, while detection also has additional data where objects can be smaller. The detection data also contain a set of images where certain objects are absent. This can be used for bootstrapping, but we have not made use of it in this work.

在整个论文中, 我们报告了 2013 ImageNet 大规模视觉识别挑战 (ILSVRC2013) 的结果。在这一挑战的分类任务中, 每个图像都被分配一个与图像中的主对象对应的单个标签。五个猜测被允许找到正确的答案 (这是因为图像也可以包含多个未标记的对象)。定位任务类似于每个图像允许5个猜测, 但另外, 每个猜测都必须返回预测对象的边界框。要被认为是正确的, 预测框必须匹配 groundtruth 至少 50% (使用帕斯卡标准的交集), 以及被标记为正确的类 (即每个预测是一个标签和边界框, 是关联在一起)。检测任务与定位不同, 因为每个图像中都可以有任意数量的对象 (包括零), 而误报则被平均精度 (mAP) 度量值所惩罚。定位任务是分类和检测之间的一个方便的中间步骤, 允许我们独立于检测特定的挑战 (如学习背景类) 来评估定位方法。在图1中, 我们展示了我们的定位/检测预测以及相应 groundtruth 的图像示例。请注意, 分类和定位共享相同的数据集, 而检测也有其他数据, 其中对象可以更小。检测数据还包含一组图像, 其中某些对象不在。这可以用于引导, 但我们没有在这项工作中使用它。

3 Classification
Our classification architecture is similar to the best ILSVRC12 architecture by Krizhevsky et al. [15]. However, we improve on the network design and the inference step. Because of time constraints, some of the training features in Krizhevsky’s model were not explored, and so we expect our results can be improved even further.

3分类
我们的分类体系结构与最佳的 ILSVRC12 体系结构类似 Krizhevsky 等. [15]。但是, 我们对网络设计和推理步骤进行了改进。由于时间的限制, Krizhevsky 模型中的一些训练特性没有被探索, 因此我们希望我们的结果能进一步提高。

4 Localization
Starting from our classification-trained network, we replace the classifier layers by a regression network and train it to predict object bounding boxes at each spatial location and scale. We then combine the regression predictions together, along with the classification results at each location, as we now describe.

4定位
从我们的分类训练网络出发, 我们用回归网络代替分类器层, 训练它在每个空间位置和尺度上预测目标边界框。然后, 我们将回归预测和每个位置的分类结果结合起来, 正如我们现在所描述的。

5 Detection
Detection training is similar to classification training but in a spatial manner. Multiple location of an image may be trained simultaneously. Since the model is convolutional, all weights are shared among all locations. The main difference with the localization task, is the necessity to predict a background class when no object is present. Traditionally, negative examples are initially taken at random for training. Then the most offending negative errors are added to the training set in bootstrapping passes. Independent bootstrapping passes render training complicated and risk potential mismatches between the negative examples collection and training times. Additionally, the size of bootstrapping passes needs to be tuned to make sure training does not overfit on a small set. To circumvent all these problems, we perform negative training on the fly, by selecting a few interesting negative examples per image such as random ones or most offending ones. This approach is more computationally expensive, but renders the procedure much simpler. And since the feature extraction is initially trained with the classification task, the detection fine-tuning is not as long anyway.

5检测
检测训练类似于分类训练, 但在空间方面。可以同时训练图像的多个位置。由于模型是卷积, 所有权重都在所有位置共享。与定位任务的主要区别在于, 在没有对象的情况下, 预测背景类是必要的。传统上, 消极的例子最初是随机接受训练的。然后将最有害的负错误添加到引导传递中的训练集中。独立的引导通过使训练复杂化, 并且风险可能不匹配在阴性例子汇集和训练时间之间。此外, 还需要调整引导传递的大小, 以确保训练不过度拟合在小集上。为了规避所有这些问题, 我们通过选择一些有趣的负面例子 (如随机的或大多数不愉快的), 进行否定训练。这种方法的计算成本更高, 但使过程更加简单。而且, 由于特征抽取最初是用分类任务进行训练的, 因此检测微调的时间并不长。

In Fig. 11, we report the results of the ILSVRC 2013 competition where our detection system ranked 3rd with 19.4% mean average precision (mAP). We later established a new detection state of the art with 24.3% mAP. Note that there is a large gap between the top 3 methods and other teams (the 4th method yields 11.5% mAP). Additionally, our approach is considerably different from the top 2 other systems which use an initial segmentation step to reduce candidate windows from approximately 200,000 to 2,000. This technique speeds up inference and substantially reduces the number of potential false positives. [29, 1] suggest that detection accuracy drops when using dense sliding window as opposed to selective search which discards unlikely object locations hence reducing false positives. Combined with our method, we may observe similar improvements as seen here between traditional dense methods and segmentation based methods. It should also be noted that we did not fine tune on the detection validation set as NEC and UvA did. The validation and test set distributions differ significantly enough from the training set that this alone improves results by approximately 1 point. The improvement between the two OverFeat results in Fig. 11 are due to longer training times and the use of context, i.e. each scale also uses lower resolution scales as input.

在图11中, 我们报告了 ILSVRC 2013 竞争的结果, 我们的检测系统排名第三, 平均精度为 19.4% (mAP)。我们后来建立了一个新的检测状态的技术与24.3% 映射。请注意, 前3方法和其他团队之间存在很大的差距 (第四种方法生成11.5% 映射)。此外, 我们的方法与前2个使用初始分割步骤将候选窗口从大约20万减少到2000的其他系统截然不同。这种技术加快了推理速度, 大大减少了潜在误报的数量。[29, 1] 建议在使用密集滑动窗口时检测精度下降, 而不是选择性搜索, 从而丢弃了不太可能的对象位置, 从而减少了误报。结合我们的方法, 我们可以看到类似的改进, 如这里所见的传统密集方法和基于分割的方法。还应该指出的是, 我们没有微调的检测验证集的 NEC 和 UvA。验证和测试集分布与训练集的差异很大, 仅此一项可以提高大约1点的结果。两个 OverFeat 结果之间的改善是由于训练时间较长和上下文的使用, 即每个刻度也使用较低的分辨率刻度作为输入。

6 Discussion
We have presented a multi-scale, sliding window approach that can be used for classification, localization and detection. We applied it to the ILSVRC 2013 datasets, and it currently ranks 4th in classification, 1st in localization and 1st in detection. A second important contribution of our paper is explaining how ConvNets can be effectively used for detection and localization tasks. These were never addressed in [15] and thus we are the first to explain how this can be done in the context of ImageNet 2012. The scheme we propose involves substantial modifications to networks designed for classification, but clearly demonstrate that ConvNets are capable of these more challenging tasks. Our localization approach won the 2013 ILSVRC competition and significantly outperformed all 2012 and 2013 approaches. The detection model was among the top performers during the competition, and ranks first in post-competition results. We have proposed an integrated pipeline that can perform different tasks while sharing a common feature extraction base, entirely learned directly from the pixels.

6讨论
提出了一种可用于分类、定位和检测的多尺度滑动窗口方法。我们将其应用于 ILSVRC 2013 数据集, 目前在分类、第一、定位和第一检测中排名第四。本文的第二个重要贡献是解释 ConvNets 如何有效地用于检测和定位任务。这些在 [15] 未曾被处理, 因此我们是第一解释怎么这可以做在 ImageNet 2012 的情况下。我们建议的计划涉及对为分类而设计的网络进行大量修改, 但明确表明 ConvNets 能够胜任这些更具挑战性的任务。我们的定位方法赢得了 2013 ILSVRC 的竞争, 并显著优于所有2012和2013的方法。该检测模型在竞争对手中名列前茅, 在赛后结果中名列第一。我们提出了一个集成的管道, 可以执行不同的任务, 同时共享一个共同的特征抽取基础, 完全从像素直接学习。

Our approach might still be improved in several ways. (i) For localization, we are not currently back-propping through the whole network; doing so is likely to improve performance. (ii) We are using l2 loss, rather than directly optimizing the intersection-over-union (IOU) criterion on which performance is measured. Swapping the loss to this should be possible since IOU is still differentiable, provided there is some overlap. (iii) Alternate parameterizations of the bounding box may help to decorrelate the outputs, which will aid network training.

我们的方法可能还有好几种改进。(一) 定位方面, 我们现时并没有透过整个网络进行支持; 及这样做可能会提高性能。(二) 我们使用的是 $l2$ 损失, 而不是直接优化计算性能的交联 (欠条) 标准。交换损失到这应该是可能的, 因为欠条仍然是可微的, 只要有一些重叠。(三) 边界框的交替参数可能有助于 decorrelate 产出, 这将帮助网络训练。