Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun

Abstract—Existing deep convolutional neuralnetworks (CNNs) require a fixed-size (e.g., 224×224) input image. Thisrequirement is “artificial” and may reduce the recognition accuracy for theimages or sub-images of an arbitrary size/scale. In this work, we equipthe networks with another pooling strategy, “spatial pyramid pooling”, toeliminate the above requirement. The new network structure, calledSPP-net, can generate a fixed-length representation regardless of imagesize/scale. Pyramid pooling is also robust to object deformations. Withthese advantages, SPP-net should in general improve all CNN-basedimage classification methods. On the ImageNet 2012 dataset, we demonstratethat SPP-net boosts the accuracy of a variety of CNN architectures despitetheir different designs. On the Pascal VOC 2007 and Caltech101 datasets,SPP-net achieves state-of-the-art classification results using a singlefull-image representation and no fine-tuning.

The power of SPP-net is also significantin object detection. Using SPP-net, we compute the feature maps from theentire image only once, and then pool features in arbitrary regions(sub-images) to generate fixed-length representations for training thedetectors. This method avoids repeatedly computing the convolutional features.In processing test images, our method is 24-102×faster than the R-CNN method,while achieving better or comparable accuracy on Pascal VOC 2007.

In ImageNet Large Scale VisualRecognition Challenge (ILSVRC) 2014, our methods rank #2 in object detectionand #3 in image classification among all 38 teams. This manuscript alsointroduces the improvement made for this competition.

Index Terms—Convolutional NeuralNetworks, Spatial Pyramid Pooling, Image Classification, Object Detection


    当前深度卷积神经网络(CNNs)都需要输入的图像尺寸固定(比如224×224)。这种人为的需要导致面对任意尺寸和比例的图像或子图像时降低识别的精度。本文中,我们给网络配上一个叫做“空间金字塔池化”(spatial pyramid pooling)的池化策略以消除上述限制。这个我们称之为SPP-net的网络结构能够产生固定大小的表示(representation)而不关心输入图像的尺寸或比例。金字塔池化对物体的形变十分鲁棒。由于诸多优点,SPP-net可以普遍帮助改进各类基于CNN的图像分类方法。在ImageNet2012数据集上,SPP-net将各种CNN架构的精度都大幅提升,尽管这些架构有着各自不同的设计。在Pascal VOC 2007和Caltech 101数据集上,SPP-net使用单一全图像表示在没有调优的情况下都达到了最好成绩。
    SPP-net在目标检测上也表现突出。使用SPP-net,只需要从整张图片计算一次特征图(feature map),然后对任意尺寸的区域(子图像)进行特征池化以产生一个固定尺寸的表示用于训练检测器。这个方法避免了反复计算卷积特征。在处理测试图像时,我们的方法在VOC2007数据集上,达到相同或更好的性能情况下,比R-CNN方法快24-102倍。


关键词:Convolutional NeuralNetworks, Spatial Pyramid Pooling, Image Classification, Object Detection


We are witnessing a rapid, revolutionary change in ourvision community, mainly caused by deep convolutional neural networks (CNNs)[1] and the availability of large scale training data [2]. Deep-networks-based approacheshave recently been substantially improving upon the state of the art in imageclassification [3], [4], [5], [6], object detection [7], [8], [5],many other recognition tasks [9], [10], [11], [12],and even non-recognition tasks.


我们看到计算机视觉领域正在经历飞速的变化,这一切得益于深度卷积神经网络(CNNs)[1]和大规模的训练数据的出现[2]。近来深度网络对图像分类 [3][4][5][6],物体检测 [7][8][5]和其他识别任务 [9][10][11][12],甚至很多非识别类任务上都表现出了明显的性能提升。

However, there is a technical issue in the training andtesting of the CNNs: the prevalent CNNs require a fixed input image size (e.g.,224×224), which limits both the aspect ratio and the scale of the input image.Whenapplied to images of arbitrary sizes, current methods mostly fit the inputimage to the fixed size, either via cropping [3], [4] or via warping [13], [7],asshown in Figure 1 (top). But the cropped region may not contain the entireobject, while the warped content may result in unwanted geometric distortion.Recognitionaccuracy can be compromised due to the content loss or distortion. Besides, apre-defined scale may not be suitable when object scales vary. Fixing inputsizes overlooks the issues involving scales.


So why do CNNs require a fixed input size? A CNN mainly consists of two parts: convolutional layers,and fully-connected layers that follow. The convolutional layers operate in a sliding-window manner and output feature maps which represent the spatial arrangement of the activations (Figure2). In fact, convolutional layers do not require a fixed image size and cangenerate feature maps of any sizes. On the other hand, the fully-connected layers need to have fixed-size/length input by their definition. Hence, thefixed-size constraint comes only from the fully-connected layers, which existat a deeper stage of the network.

那么为什么CNNs需要一个固定的输入尺寸呢?CNN主要由两部分组成,卷积层和其后的全连接层。卷积部分通过滑窗进行计算,并输出代表激活的空间排布的特征图(feature map)(图2)。事实上,卷积并不需要固定的图像尺寸,他可以产生任意尺寸的特征图。而另一方面,根据定义,全连接层则需要固定的尺寸输入。因此固定尺寸的问题来源于全连接层,也是网络的最后阶段。

In this paper, we introduce a spatial pyramid pooling(SPP)[14], [15] layer to remove the fixed-size constraint of the network.Specifically, we add an SPP layer on top of the last convolutional layer. The SPP layer pools the features and generates fixed-length outputs, which are then fed into the fully-connected layers (or other classifiers). In other words,we perform some information “aggregation” at a deeper stage of the networkhierarchy (between convolutional layers and fully-connected layers) to avoid theneed for cropping or warping at the beginning.Figure 1 (bottom) shows thechange of the network architecture by introducing the SPP layer. We call the newnetwork structure SPP-net.

本文引入一种空间金字塔池化( spatial pyramid pooling,SPP)层以移除对网络固定尺寸的限制。尤其是,将SPP层放在最后一个卷积层之后。SPP层对特征进行池化,并产生固定长度的输出,这个输出再喂给全连接层(或其他分类器)。换句话说,在网络层次的较后阶段(也就是卷积层和全连接层之间)进行某种信息“汇总”,可以避免在最开始的时候就进行裁剪或变形。图1(下)展示了引入SPP层之后的网络结构变化。我们称这种新型的网络结构为SPP-net。

Spatial pyramid pooling [14], [15] (popularly known as spatial pyramid matching or SPM [15]), as an extension of the Bag-of-Words (BoW) model [16],is one of the most successful methods in computer vision. It partitions the image into divisions from finer to coarser levels, and aggregates local features in them. SPP has long been a key component in the leading and competition-winning systems for classification (e.g., [17], [18], [19]) and detection (e.g., [20])before the recent prevalence of CNNs. Nevertheless, SPP has not been considered in the context of CNNs.We note that SPP has several remarkable properties for deep CNNs: 1) SPP is able to generate a fixed-length output regardless of the input size, while the sliding window pooling used in the previous deep networks [3] cannot; 2) SPP uses multi-level spatial bins, while the sliding window pooling uses only a single window size. Multi-level pooling has been shown to be robust to object deformations [15]; 3) SPP can pool features extracted at variable scales thanks to the flexibility of input scales. Throughexperiments we show that all these factors elevate the recognition accuracy of deep networks.

空间金字塔池化[14][15](普遍称谓:空间金字塔匹配spatial pyramid matching, SPM[15]),是一种词袋(Bag-of-Words, BoW)模型的扩展。词袋模型是计算机视觉领域最成功的方法之一。它将图像切分成粗糙到精细各种级别,然后整合其中的局部特征。在CNN之前,SPP一直是各大分类比赛[17][18][19]和检测比赛(比如[20])的冠军系统中的核心组件。对深度CNNs而言,SPP有几个突出的优点:1)SPP能在输入尺寸任意的情况下产生固定大小的输出,而以前的深度网络[3]中的滑窗池化(sliding window pooling)则不能;2)SPP使用了多级别的空间箱(bin),而滑窗池化则只用了一个窗口尺寸。多级池化对于物体的变形十分鲁棒[15];3)由于其对输入的灵活性,SPP可以池化从各种尺度抽取出来的特征。通过实验,我们将展示影响深度网络最终识别精度的所有这些因素。

SPP-net not only makes it possible to generate representations from arbitrarily sized images/windows for testing, but also allows us to feedimages with varying sizes or scales during training. Training with variable-size images increases scale-invariance and reduces over-fitting. We develop a simple multi-size training method. For a single network to accept variable inputsizes, we approximate it by multiple networks that share all parameters, while each of these networks is trained using a fixed input size. In each epoch we train the network with a given input size, and switch to another input size for the next epoch. Experiments show that this multi-size training converges just as the traditional single-size training,and leads to better testing accuracy.


The advantages of SPP are orthogonal to the specific CNN designs. In a series of controlled experiments on the ImageNet 2012 dataset, we demonstrate that SPP improves four different CNN architectures in existing publications[3], [4], [5] (or their modifications), over the no-SPP counterparts. These architectures have various filter numbers/sizes, strides, depths, or other designs.It is thus reasonable for us to conjecture that SPP should improve more sophisticated(deeper and larger) convolutional architectures. SPP-net also shows state-of-the-art classification results on Caltech101 [21] and Pascal VOC 2007[22] using only a single full-image representation and no fine-tuning.

SPP的优点是与各类CNN设计是正交的。通过在ImageNet2012数据集上进行一系列可控的实验,我们发现SPP对[3][4][5]这些不同的CNN架构都有提升。这些架构有不同的特征数量、尺寸、滑动距离(strides)、深度或其他的设计。所以我们有理由推测SPP可以帮助提升更多复杂的(更大、更深)的卷积架构。SPP-net也做到了 Caltech101 [21]和Pascal VOC 2007 [22]上的最好结果,而只使用了一个全图像表示,且没有调优。

SPP-net also shows great strength in object detection. In the leading object detection method R-CNN[7],the features from candidate windows are extracted via deep convolutional networks. This method shows remarkable detection accuracy on both the VOC and ImageNet datasets. But the feature computation in R-CNN is time-consuming, because itrepeatedly applies the deep convolutional networks to the raw pixels of thousands of warped regions per image. In this paper, we show that we can runthe convolutional layers only once on the entireimage (regardless of the number of windows), and then extract features by SPP-net on the feature maps. This method yields a speed up of over one hundred times over R-CNN. Note that training/running a detector on the feature maps(rather than image regions) is actually a more popular idea [23], [24], [20],[5]. But SPP-net inherits the power of the deep CNN feature maps and also the flexibility of SPP on arbitrary window sizes, which leads to outstanding accuracy and efficiency. In our experiment, the SPP-net-based system (built upon the R-CNNpipeline) computes features 24-102×faster than R-CNN, while has better or comparable accuracy.With the recent fast proposal method of EdgeBoxes[25], oursystem takes 0.5 seconds processing an image(including all steps). This makes our method practical for real-world applications.


A preliminary version of this manuscript has been published in ECCV 2014. Based on this work, we attended the competition of ILSVRC 2014[26], and ranked #2 in object detection and #3 in image classification (bothare provided-data-only tracks) among all 38 teams. There are a few modifications made for ILSVRC 2014. We show that the SPP-nets can boost various networks that are deeper and larger (Sec. 3.1.2-3.1.4) over the no-SPP counterparts.Further, driven by our detection framework, we find that multi-view testing onfeature maps with flexibly located/sized windows (Sec. 3.1.5) can increase the classification accuracy. This manuscript also provides the details of these modifications.

We have released the code to facilitate future research(http://research.microsoft.com/ en-us/ um/ people/ kahe/).

本论文的一个早先版本发布在ECCV2014上。基于这个工作,我们参加了ILSVRC 2014 [26],在38个团队中,取得了物体检测第2名和图像分类第3名的成绩。针对ILSVRC 2014我们也做了很多修改。我们将展示SPP-nets可以将更深、更大的网络的性能显著提升。进一步,受检测框架驱动,我们发现借助灵活尺寸窗口对特征图进行多视角测试可以显著提高分类精度。本文对这些改动做了更加详细的说明。



2.1 Convolutional Layers and Feature Maps Consider thepopular seven-layer architectures [3], [4].The first five layers are convolutional, some of which are followed by pooling layers. These pooling layers can also be considered as “convolutional”, in the sense that they are using sliding windows. The last two layers are fully connected, with an N-way softmax as the output, where N is the number of categories.

2. 基于空间金字塔池化的深度网络

2.1 卷积层和特征图


The deep network described above needs a fixed imagesize. However, we notice that the requirement of fixed sizes is only due to the fully-connected layers that demand fixed-length vectors as inputs. On the other hand, the convolutional layers accept inputs of arbitrary sizes. The convolutional layers use sliding filters, and their outputs have roughly the same aspect ratio as the inputs. These outputs are known as feature maps [1] -they involve not only the strength of the responses, but also their spatial positions.


In Figure 2, we visualize some feature maps. They are generated by some filters of the conv layer. Figure 2(c) shows the strongest activated images of these filters in the ImageNet dataset. We see a filter can be activated by some semantic content. For example, the 55-th filter (Figure 2,bottom left) is most activated by a circle shape; the 66-th filter (Figure 2,top right) is most activated by a ∧-shape; and the 118-th filter (Figure 2, bottomright) is most activated by a ∨-shape.These shapes in the input images (Figure 2(a))activate the feature maps at the corresponding positions (the arrows in Figure2).


It is worth noticing that we generate the feature maps in Figure 2 without fixing the input size. These feature maps generated by deep convolutional layers are analogous to the feature maps in traditional methods[27], [28]. In those methods, SIFT vectors [29] or image patches [28] aredensely extracted and then encoded,e.g., by vector quantization [16], [15],[30],sparse coding [17], [18], or Fisher kernels [19].These encoded features consist of the feature maps,and are then pooled by Bag-of-Words (BoW) [16] or spatialpyramids [14], [15]. Analogously, the deep convolutional features can be pooled in a similar way.


2.2 The Spatial Pyramid Pooling Layer

The convolutional layers accept arbitrary input sizes,but they produce outputs of variable sizes. The classifiers (SVM/softmax) or fully-connected layers require fixed-length vectors. Such vectors can be generated by the Bag-of-Words (BoW) approach [16] that pools the features together. Spatial pyramid pooling [14], [15] improves BoW in that it can maintain spatial information by pooling in local spatial bins. These spatialbins have sizes proportional to the image size, so the number of bins is fixed regardless of the image size. This is in contrast to the sliding window pooling of the previous deep networks [3], where the number of sliding windows depends on the input size.

2.2 空间金字塔池化层

卷积层接受任意大小的输入,所以他们的输出也是各种大小。而分类器(SVM/softmax)或者全连接层UI需要固定的输入大小的向量。这种向量可以使用词袋方法[16]通过池化特征来生成。空间金字塔池化[14][15]对BoW进行了改进以便在池化过程中保留局部空间块(local spatial bins)中的空间保留。这些空间块的尺寸和图像的尺寸是成比例的,这样块的数量就是固定的了。而前述深度网络的滑窗池化则对依赖于输入图像的尺寸。

To adopt the deep network for images of arbitrary sizes, we replace the last pooling layer (e.g.,pool5, after the last convolutional layer) with a spatial pyramid pooling layer.Figure 3 illustrates our method.In each spatial bin, we pool the responses of each filter (throughout this paper we use max pooling).The outputs of the spatial pyramid pooling are kM-dimensional vectors with the number of bins denoted as M(k is the number of filters in the last convolutional layer). The fixed-dimensional vectors are the input to the fully-connected layer.


With spatial pyramid pooling, the input image can 4 beof any sizes. This not only allows arbitrary aspect ratios, but also allows arbitrary scales. We can resize the input image to any scale (e.g.,min(w,h)=180,224,...) and apply the same deep network. When the input image is at different scales, the network (with the same filter sizes) will extract features at different scales. The scales play important roles in traditional methods,e.g.,the SIFT vectors are often extracted at multiple scales [29], [27] (determinedby the sizes of the patches and Gaussian filters). We will show that the scales are also important for the accuracy of deep networks.


Interestingly, the coarsest pyramid level has a single bin that covers the entire image. This is in fact a “global pooling” operation,which is also investigated in several concurrent works. In [31], [32] a global average pooling is used to reduce the model size and also reduce overfitting; in [33],a global average pooling is used on the testing stage after all fc layers to improve accuracy; in [34], a global max pooling is used for weakly supervised object recognition. The global pooling operation corresponds to the traditional Bag-of-Words method.


2.3 Training the Network

Theoretically, the above network structure can be trainedwith standard back-propagation [1], regardless of the input image size. But inpractice the GPU implementations (such as cuda-convnet[3] and Caffe[35]) are preferably run on fixed input images. Next we describe our training solution that takes advantage of these GPU implementations while still preserving the spatial pyramid pooling behaviors.

2.3 网络的训练


如前人的工作一样,我们首先考虑接收裁剪成224×224图像的网络。裁剪的目的是数据增强。对于一个给定尺寸的图像,我们先计算空间金字塔池化所需要的块(bins)的大小。试想一个尺寸是axa(也就是13×13)的conv5之后特征图。对于nxn块的金字塔级,我们实现一个滑窗池化过程,窗口大小为win = 上取整[a/n],步幅str = 下取整[a/n]. 对于l层金字塔,我们实现l个这样的层。然后将l个层的输出进行连接输出给全连接层。图4展示了一个cuda卷积网络风格的3层金字塔的样例。(3×3, 2×2, 1×1)。

携带SPP的网络可以应用于任意尺寸,为了解决不同图像尺寸的训练问题,我们考虑一些预设好的尺寸。现在考虑这两个尺寸:180×180,224×224。我们使用缩放而不是裁剪,将前述的224的区域图像变成180大小。这样,不同尺度的区域仅仅是分辨率上的不同,而不是内容和布局上的不同。对于接受180输入的网络,我们实现另一个固定尺寸的网络。本例中,conv5输出的特征图尺寸是axa=10×10。我们仍然使用win = 上取整[a/n],str = 下取整[a/n],实现每个金字塔池化层。这个180网络的空间金字塔层的输出的大小就和224网络的一样了。

3 用于图像分类的SPP-NET

3.1 ImageNet 2012分类实验

最后两层全连接层会使用Dropout[3]。learning rate起始值是0.01,当错误率停滞后就除以10。我们的实现基于公开的cuda-convnet源代码[3]和Caffe[35]。所有网络都是在单一GeForceGTX TitanGPU(6G内存)耗时二到四周训练的。
3.1.1 基准网络架构
– ZF-5:基于Zeiler和Fergus的“快速”模式[4]的网络架构。数字5代表5层卷积网络。

– Convnet*-5:基于Krizhevsky等人工作[3]的修改。我们在conv2和conv3(而不是conv1和conv2)之后加入了两个池化层。这样,每一层之后的特征图就和ZF-5的尺寸一样了。
– Overfeat-5/7:基于Overfeat论文[5],使用了[6]的修改。对比ZF-5/Convnet*-5,这个架构在最后一个池化层产生了更大的特征图(18×18而不是13×13)。还在conv3和后续的卷基层使用了更多的过滤器(512)。我们也研究了7层卷积网络,其中conv3和conv7结构一样。

3.1.2 多层次池化提升准确度
表2(b)中我们显示了使用单尺寸训练的结果。训练和测试尺寸都是224×224.这些网络中,卷积网络都和他们的基准网络有相同的结构,只是最后卷积层之后的池化层,被替换成了SPP层。表2中的结果我们使用了4层金字塔,f6x6, 3×3, 2×2, 1x1g(总共50个块)。为了公平比较,我们仍然使用标准的10-view预测法,每个view都是一个224×224的裁切。表2(b)中的结果显示了明显的性能提升。有趣的是,最大的提升(top-1 error,1.65%)来自于精度最高的网络架构。既然我们一直使用相同10个裁切view。这些提升只能是来自于多层次池化。
值得注意的是多层次池化带来的提升不只是因为更多的参数;而是因为多层次池化对对象的变形和空间布局更加鲁棒[15]。为了说明这个,我们使用一个不同的4层金字塔(f4×4, 3×3, 2×2, 1×1g,供30个块)训练另一个ZF-5网络。这个网络有更少的参数,因为他的全连接层fc6有30×256维输入而不是36×256维。 网络的top-1/top-5错误率分别是35.06/14.04和50块的金字塔网络相近,明显好于非SPP基准网络(35.99/14.76)。
3.1.3 多尺寸训练提升准确度
表2(c)展示了多尺寸训练的结果。训练尺寸是224和180,测试尺寸是224。我们还使用标准的10-view预测法。所有架构的top-1/top-5错误率进一步下降。SPP-net(Overfeat-7)的Top-1 错误率降到29.68%,比非SPP网络低了2.33%,比单尺寸训练降低了0.68%。
3.1.4 全图像表示提升准确度

3.1.5 特征图上的多视图测试

3.2 Experiments on VOC 2007 Classification


3.3 Experiments on Caltech101


4 SPP-NET用于物体检测


我们将SPP-net应用于物体检测。只在整张图像上抽取一次特征。然后在每个特征图的候选窗口上应用空间金字塔池化,形成这个窗口的一个固定长度表示(见图5)。因为只应用一次卷积网络,我们的方法快得多。我们的方法是从特征图中直接抽取特征,而R-CNN则要从图像区域抽取。之前的一些工作中,可变性部件模型(Deformable Part Model, DPM)从HOG[24]特征图的窗口中抽取图像,选择性搜索方法[20]从SIFT编码后的特征图的窗口中抽取特征。Overfeat也是从卷积特征图中抽取特征,但需要预定义的窗口尺寸。作为对比,我们的特征抽取可以在任意尺寸的深度卷积特征图窗口上。

4.1 检测算法

我们使用选择性搜索[20]的“fast”模式对每张图片产生2000个候选窗口。然后缩放图像以满足min(w;h) = s,并且从整张图像中抽取特征图。我们暂时使用ZF-5的SPP-net模型(单一尺寸训练)。在每个候选窗口,我们使用一个4级空间金字塔(1×1, 2×2, 3×3, 6×6, 总共50块)。每个窗口将产生一个12800(256×50)维的表示。这些表示传递给网络的全连接层。然后我们针对每个分类训练一个二分线性SVM分类器。我们的SVN实现追随了[20][7]。我们使用真实标注的窗口去生成正例。负例是那些与正例窗口重叠不超过30%的窗口(使用IoU比例)。
如果一个负例与另一个负例重叠超过70%就会被移除。我们使用标准的难负例挖掘算法(standard hard negative mining [23])训练SVM。这个步骤只迭代一次。对于全部20个分类训练SVM小于1个小时。测试阶段,训练器用来对候选窗口打分。然后在打分窗口上使用最大值抑制[23]算法(30%的阈值)。
通过多尺度特征提取,我们的方法可以得到改进。将图像缩放成min(w;h) = s \belongs S = {480; 576; 688; 864; 1200 },然后针对每个尺度计算conv5的特征图。一个结合这些这些不同尺度特征的策略是逐个channel的池化。但我们从经验上发现另一个策略有更好的效果。对于每个候选窗口,我们选择一个单一尺度s \belongs S,令缩放后的候选窗口的像素数量接近与224×224。然后我们从这个尺度抽取的特征图去计算窗口的特征。如果这个预定义的尺度足够密集,窗口近似于正方形。我们的方法粗略地等效于将窗口缩放到224×224,然后再从中抽取特征。但我们的方法在每个尺度只计算一次特征图,不管有多少个候选窗口。
本例中,数据层接受conv5之后的固定长度的池化后的特征,后面跟着fc_{6,7}和一个新的21路(有一个负例类别)fc8层。fc8的权重使用高斯分布进行初始化σ=0.01。我们修正所有的learning rate为1e-4,再将全部三层调整为1e-5。调优过程中正例是与标注窗口重叠度达到[0.5, 1]的窗口,负例是重叠度为[0.1, 0.5)的。每个mini-batch,25%是正例。我们使用学习率1e-4训练了250k个minibatch,然后使用1e-5训练50k个minibatch。

4.2 检测结果

我们在Pascal VOC 2007数据集的检测任务上,评测了我们的方法。表9展示了我们的不同层的结果,使用了1-scale(s=688)或5-scale。R-CNN的结果见[7],他们使用了5个卷积层的AlexNet[3]。使用pool5层我们的结果是44.9%,R-CNN的结果是44.2%。但使用未调优的fc6层,我们的结果就不好。可能是我们的fc层针对图像区域进行了预训练,在检测案例中,他们用于特征图区域。而特征图区域在窗口框附近会有较强的激活,而图像的区域就不会这样。这种用法的不同是可以通过调优解决的。使用调优后的fc层,我们的结果就比R-CNN稍胜一筹。经过约束狂回归,我们的5-scale结果(59.2%)比R-CNN(58.5%)高0.7%。,而1-scale结果(58.0%)要差0.5%。


4.3 复杂度和运行时间


4.4 用于检测的多模型结合


4.5 ILSVRC 2014 Detection


5 结论






