论文阅读笔记(十四):Spatial pyramid pooling in deep convolutional networks for visual recognition(SPP)

Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g., 224×224) input image. This requirement is “artificial” and may reduce the recognition accuracy for the images or sub-images of an arbitrary size/scale. In this work, we equip the networks with another pooling strategy, “spatial pyramid pooling”, to eliminate the above requirement. The new network structure, called SPP-net, can generate a fixed-length representation regardless of image size/scale. Pyramid pooling is also robust to object deformations. With these advantages, SPP-net should in general improve all CNN-based image classification methods. On the ImageNet 2012 dataset, we demonstrate that SPP-net boosts the accuracy of a variety of CNN architectures despite their different designs. On the Pascal VOC 2007 and Caltech101 datasets, SPP-net achieves state-of-theart classification results using a single full-image representation and no fine-tuning.

现有的深卷积神经网络 (CNNs) 需要一个固定大小 (如 224×224) 输入图像。这一要求是 “人工” 的, 可能会降低图像或任意大小/比例的子图像的识别精度。在这项工作中, 我们为网络配备了另一个池策略 “空间金字塔池”, 以消除上述要求。新的网络结构, 称为 SPP, 可以生成一个固定长度表示, 无论图像大小/规模。金字塔池对对象变形也很健壮。有了这些优势, SPP应该改进所有基于 CNN 的图像分类方法。在 ImageNet 2012 数据集上, 我们表明, 尽管它们的设计不同, 但SPP提高了 CNN 体系结构的准确性。在Pascal VOC 2007 和 Caltech101 数据集上, SPP网络实现了使用单一的全图像表示和不微调的目前水平分类的结果,。

The power of SPP-net is also significant in object detection. Using SPP-net, we compute the feature maps from the entire image only once, and then pool features in arbitrary regions (sub-images) to generate fixed-length representations for training the detectors. This method avoids repeatedly computing the convolutional features. In processing test images, our method is 24-102× faster than the R-CNN method, while achieving better or comparable accuracy on Pascal VOC 2007.

在目标检测中, SPP的功率也很大。利用SPP, 我们仅从整个图像中计算特征映射一次, 然后在任意区域 (子图像) 中的池特征生成用于训练探测器的固定长度表示。这种方法避免重复计算卷积特征。在处理测试图像时, 我们的方法比 R CNN 方法24-102×快, 同时在Pascal VOC 2007 上取得更好或比较准确的结果。

In ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2014, our methods rank #2 in object detection and #3 in image classification among all 38 teams. This manuscript also introduces the improvement made for this competition.

在 ImageNet 大规模视觉识别挑战 (ILSVRC) 2014 中, 我们的方法在目标检测和 图像分类中在全部38个团队中排名第2。这篇手稿还介绍了这场比赛的改进。

We are witnessing a rapid, revolutionary change in our vision community, mainly caused by deep convolutional neural networks (CNNs) [1] and the availability of large scale training data [2]. Deep-networksbased approaches have recently been substantially improving upon the state of the art in image classification [3], [4], [5], [6], object detection [7], [8], [5], many other recognition tasks [9], [10], [11], [12], and even non-recognition tasks.

我们正在目睹我们的视觉社区的快速, 革命性的变化, 主要是由深卷积神经网络 (CNNs) [1] 和大规模训练数据的可用性 [2]。最近, 基于深度网络的方法在图像分类的艺术状态上得到了极大的改善 [3], [4], [5], [6], 对象检测 [7], [8], [5], 许多其他识别任务 [9], [10], [11], [12], 甚至非识别任务。

However, there is a technical issue in the training and testing of the CNNs: the prevalent CNNs require a fixed input image size (e.g., 224×224), which limits both the aspect ratio and the scale of the input image. When applied to images of arbitrary sizes, current methods mostly fit the input image to the fixed size, either via cropping [3], [4] or via warping [13], [7], as shown in Figure 1 (top). But the cropped region may not contain the entire object, while the warped content may result in unwanted geometric distortion. Recognition accuracy can be compromised due to the content loss or distortion. Besides, a pre-defined scale may not be suitable when object scales vary. Fixing input sizes overlooks the issues involving scales.

然而, 在卷积神经网络的训练和测试中有一个技术问题: 流行的卷积神经网络需要一个固定的输入图像大小 (如 224×224), 这不仅限制了输入图像的纵横比和缩放比例。当应用到任意大小的图像时, 当前方法大多适合输入图像到固定大小, 要么通过裁剪 [3], [4] 或通过变形 [13], [7], 如图 1 (顶部) 所示。但裁剪区域可能不包含整个对象, 而扭曲的内容可能导致不需要的几何失真。由于内容丢失或失真, 识别精度可能会受到影响。此外, 当对象缩放变化时, 预定义的刻度可能不合适。固定输入大小会忽略涉及刻度的问题。

So why do CNNs require a fixed input size? A CNN mainly consists of two parts: convolutional layers, and fully-connected layers that follow. The convolutional layers operate in a sliding-window manner and output feature maps which represent the spatial arrangement of the activations (Figure 2).

那么为什么卷积神经网络需要一个固定的输入大小呢?CNN 主要由两部分组成: 卷积层和随之而来的全连接层。卷积层以滑动窗口方式运行, 输出特征图表示激活的空间排列 (图 2)。

In fact, convolutional layers do not require a fixed image size and can generate feature maps of any sizes. On the other hand, the fully-connected layers need to have fixedsize/length input by their definition. Hence, the fixedsize constraint comes only from the fully-connected layers, which exist at a deeper stage of the network.

事实上, 卷积层不需要固定的图像大小, 并且可以生成任何大小的特征映射。另一方面, 全连接层需要有固定的大小/长度输入的定义。因此, 固定大小约束仅来自全连接层, 这在网络的更深阶段存在。

In this paper, we introduce a spatial pyramid pooling (SPP) [14], [15] layer to remove the fixed-size constraint of the network. Specifically, we add an SPP layer on top of the last convolutional layer. The SPP layer pools the features and generates fixedlength outputs, which are then fed into the fullyconnected layers (or other classifiers). In other words, we perform some information “aggregation” at a deeper stage of the network hierarchy (between convolutional layers and fully-connected layers) to avoid the need for cropping or warping at the beginning. Figure 1 (bottom) shows the change of the network architecture by introducing the SPP layer. We call the new network structure SPP-net.

本文介绍了一种空间金字塔池 [14]、[15] 层来去除网络的固定大小约束。具体地说, 我们在最后的卷积层的顶部添加了一个SPP层。SPP层集中特征和产生固定长度的输出, 然后送入全连接层 (或其他分类器)。换言之, 我们在网络层次的更深阶段 (卷积层和全连接层) 中执行一些信息 “聚合”, 以避免开始时需要裁剪或变形。图 1 (底部) 通过引入SPP层来显示网络体系结构的变化。我们称新的网络结构为SPP。

Spatial pyramid pooling [14], [15] (popularly known as spatial pyramid matching or SPM [15]), as an extension of the Bag-of-Words (BoW) model [16], is one of the most successful methods in computer vision. It partitions the image into divisions from finer to coarser levels, and aggregates local features in them. SPP has long been a key component in the leading and competition-winning systems for classification (e.g., [17], [18], [19]) and detection (e.g., [20]) before the recent prevalence of CNNs. Nevertheless, SPP has not been considered in the context of CNNs. We note that SPP has several remarkable properties for deep CNNs: 1) SPP is able to generate a fixedlength output regardless of the input size, while the sliding window pooling used in the previous deep networks [3] cannot; 2) SPP uses multi-level spatial bins, while the sliding window pooling uses only a single window size. Multi-level pooling has been shown to be robust to object deformations [15]; 3) SPP can pool features extracted at variable scales thanks to the flexibility of input scales. Through experiments we show that all these factors elevate the recognition accuracy of deep networks.

空间金字塔池化 [14], [15] (普遍地被称为空间金字塔匹配或 SPM [15]), 作为Bag-of-Words (BoW)模型 [16] 的引伸, 是计算机视觉中最成功的方法之一。它将图像分割成各分部, 从更细到更粗的级别, 并在其中聚合局部特征。在最近流行的卷积神经网络之前, 在分类和检测(如 [20])的引导和竞争获胜者系统 (例如, [17], [18], [19]) 中, SPP长期是一个关键的组成部分。然而,卷积神经网络没有考虑到这一点。我们注意到, 在深卷积神经网络有几个显着的性质: 1) 可产生固定长度输出, 无论输入大小, 而滑动窗口池不能使用在以前的深网络 [3] ;2) 使用多层次空间箱, 而滑动窗口池只使用单个窗口大小。多层次池已被证明对对象变形是健壮的 [15];3) 由于输入秤的灵活性, 可在可变尺度上提取池特征。通过实验表明, 所有这些因素都提高了深网络的识别精度。

SPP-net not only makes it possible to generate representations from arbitrarily sized images/windows for testing, but also allows us to feed images with varying sizes or scales during training. Training with variable-size images increases scale-invariance and reduces over-fitting. We develop a simple multi-size training method. For a single network to accept variable input sizes, we approximate it by multiple networks that share all parameters, while each of these networks is trained using a fixed input size. In each epoch we train the network with a given input size, and switch to another input size for the next epoch. Experiments show that this multi-size training converges just as the traditional single-size training, and leads to better testing accuracy.

SPP不仅可以生成任意大小的图像/窗口进行测试的表示形式, 而且还允许我们在训练过程中为不同大小或比例的图像提供反馈。使用可变大小的图像进行训练会增加缩放不变性并减少过度拟合。我们开发了一种简单的多尺寸训练方法。对于单个网络接受可变输入大小, 我们将其近似于共享所有参数的多个网络, 而这些网络都是使用固定的输入大小进行训练的。在每个epoch中, 我们训练具有给定输入大小的网络, 并切换到下一个epoch的另一个输入大小。实验表明, 这种多尺寸训练像传统的单尺寸训练一样收敛, 从而提高了测试精度。

The advantages of SPP are orthogonal to the specific CNN designs. In a series of controlled experiments on the ImageNet 2012 dataset, we demonstrate that SPP improves four different CNN architectures in existing publications [3], [4], [5] (or their modifications), over the no-SPP counterparts. These architectures have various filter numbers/sizes, strides, depths, or other designs. It is thus reasonable for us to conjecture that SPP should improve more sophisticated (deeper and larger) convolutional architectures. SPP-net also shows state-of-the-art classification results on Caltech101 [21] and Pascal VOC 2007 [22] using only a single full-image representation and no fine-tuning.

它的优点是正交的 CNN 的具体设计。在 ImageNet 2012 数据集的一系列受控实验中, 我们证明, 在现有出版物 [3]、[4]、[5] (或它们的修改) 中, 该类改进了四不同的 CNN 体系结构。这些体系结构具有各种过滤器编号/大小、步幅、深度或其他设计。因此, 我们推测, 应改进更复杂的卷积体系结构, 这是合理的。在 Caltech101 [21] 和Pascal VOC 2007 [22] 只使用一个全图像表示法和没有微调的情况下, SPP还显示了最先进的分类结果。

SPP-net also shows great strength in object detection. In the leading object detection method R-CNN [7], the features from candidate windows are extracted via deep convolutional networks. This method shows remarkable detection accuracy on both the VOC and ImageNet datasets. But the feature computation in RCNN is time-consuming, because it repeatedly applies the deep convolutional networks to the raw pixels of thousands of warped regions per image. In this paper, we show that we can run the convolutional layers only once on the entire image (regardless of the number of windows), and then extract features by SPP-net on the feature maps. This method yields a speedup of over one hundred times over R-CNN. Note that training/running a detector on the feature maps (rather than image regions) is actually a more popular idea [23], [24], [20], [5]. But SPP-net inherits the power of the deep CNN feature maps and also the flexibility of SPP on arbitrary window sizes, which leads to outstanding accuracy and efficiency. In our experiment, the SPP-net-based system (built upon the R-CNN pipeline) computes features 24-102× faster than R-CNN, while has better or comparable accuracy. With the recent fast proposal method of EdgeBoxes [25], our system takes 0.5 seconds processing an image (including all steps). This makes our method practical for real-world applications.

SPP在目标检测中也表现出强大的强度。在主要目标检测方法 R-CNN [7] 中, 通过深卷积网络提取候选窗口的特征。该方法在 VOC 和 ImageNet 数据集上均显示出显著的检测精度。但 RCNN 中的特征计算是耗时的, 因为它反复地将深卷积网络应用于每幅图像中数以千计扭曲区域的原始像素。本文表明, 在整个图像上只能运行一次卷积层 (不管窗口的数量如何), 然后在特征图上提取特征。这种方法在 R CNN 上的速度超过了100倍。注意训练或运行检测器在特征地图 (而不是图象区域) 实际上是一个更加普遍的想法 [23], [24], [20], [5]。但是, SPP继承了 CNN 深部特征图的威力, 同时也对任意窗口尺寸的自由度具有很高的灵活性, 从而达到了显著的准确性和效率。在我们的实验中, 以网络为基础的系统 (建立在 r-cnn 通道上) 计算的功能24-102×比 R cnn 快, 而有更好或相当的准确性。利用 EdgeBoxes [25] 最近的快速建议方法, 我们的系统需要0.5 秒处理图像 (包括所有步骤)。这使得我们的方法在实际应用中切实可行。

A preliminary version of this manuscript has been published in ECCV 2014. Based on this work, we attended the competition of ILSVRC 2014 [26], and ranked #2 in object detection and #3 in image classification (both are provided-data-only tracks) among all 38 teams. There are a few modifications made for ILSVRC 2014. We show that the SPP-nets can boost various networks that are deeper and larger (Sec. 3.1.2-3.1.4) over the no-SPP counterparts. Further, driven by our detection framework, we find that multi-view testing on feature maps with flexibly located/sized windows (Sec. 3.1.5) can increase the classification accuracy. This manuscript also provides the details of these modifications.

这篇手稿的初步版本已在 ECCV 2014 出版。在这项工作的基础上, 我们参加了 ILSVRC 2014 [26] 的竞争, 并在目标检测和 #3 的图像分类中进行了排名 #2 (两者都是只提供数据跟踪) 在所有38个团队中。对 ILSVRC 2014 做了一些修改。我们表明, SPP可以比那些不使用SPP接口的促进各种更深更大的网络 (Sec 3.1, 2-3, 1.4) 。进一步, 我们的检测框架驱动, 我们发现, 多视图测试的功能映射与灵活定位/大小的窗口 (秒 3.1.5) 可以提高分类的准确性。本手稿还提供了这些修改的细节。

SPP is a flexible solution for handling different scales, sizes, and aspect ratios. These issues are important in visual recognition, but received little consideration in the context of deep networks. We have suggested a solution to train a deep network with a spatial pyramid pooling layer. The resulting SPP-net shows outstanding accuracy in classification/detection tasks and greatly accelerates DNN-based detection. Our studies also show that many time-proven techniques/insights in computer vision can still play important roles in deep-networks-based recognition.

SPP是处理不同尺度、尺寸和纵横比的灵活解决方案。这些问题在视觉识别中是很重要的, 但在深层网络的背景下却没有得到考虑。我们提出了一个解决方案, 以培养一个空间金字塔池层深网络。结果表明, SPP在分类/检测任务中表现出优异的准确度, 大大加快了 DNN 的检测速度。我们的研究还表明, 许多经过时间验证的计算机视觉技术/洞察力仍然可以在基于深网络的识别中发挥重要作用。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值