Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

In this work, we equip the networks with another pooling strategy, “spatial pyramid pooling”. The new network structure, called SPP-net, can generate a fixed-length representation regardless of image size/scale.

The power of SPP-net is also significant in object detection. Using SPP-net, we compute the feature maps from the entire image only once, and then pool features in arbitrary regions (sub-images) to generate fixed-length representations for training the detectors.

INTRODUCTION

a conventional CNN:
[image] → [crop/warp] → [conv] → [layers] → [fc layers] → [output]

spatial pyramid pooling network structure:
[image] → [conv layers] → [spatial pyramid pooling] → [fc layers] → [output]

the fixedsize constraint comes only from the fully-connected layers, which exist at a deeper stage of the network.

The SPP layer pools the features and generates fixedlength outputs, which are then fed into the fullyconnected layers (or other classifiers).As an extension of the Bag-of-Words (BoW) model

We note that SPP has several remarkable properties for deep CNNs:

  1. SPP is able to generate a fixedlength output regardless of the input size, while the sliding window pooling used in the previous deep networks cannot
  2. SPP uses multi-level spatial bins, while the sliding window pooling uses only a single window size. Multi-level pooling has been shown to be robust to object deformations
  3. SPP can pool features extracted at variable scales thanks to the flexibility of input scales.

SPP-net also allows us to feed images with varying sizes or scales during training.

DEEP NETWORKS WITH SPATIAL PYRAMID POOLING

Convolutional Layers and Feature Maps

These pooling layers can also be considered as “convolutional”, in the sense that they are using sliding windows.

These outputs are known as feature maps they involve not only the strength of the responses, but also their spatial positions.

The Spatial Pyramid Pooling Layer

spp

Spatial pyramid pooling improves BoW in that it can maintain spatial information by pooling in local spatial bins. These spatial bins have sizes proportional to the image size, so the number of bins is fixed regardless of the image size.

In each spatial bin, we pool the responses of each filter (throughout this paper we use max pooling). The outputs of the spatial pyramid pooling are kM dimensional vectors with the number of bins denoted as M (k is the number of filters in the last convolutional layer). The fixed-dimensional vectors are the input to the fully-connected layer.

The coarsest pyramid level has a single bin that covers the entire image.The global pooling operation corresponds to the traditional Bag-of-Words method

Training the Network

Single-size training

we implement this pooling level as a sliding window pooling, where the window size w i n = ⌈ a / n ⌉ win = \lceil a/n \rceil win=a/n and stride s t r = ⌊ a / n ⌋ str = \lfloor a/n \rfloor str=a/n with ⌈ ⋅ ⌉ \lceil \cdot \rceil and ⌊ ⋅ ⌋ \lfloor \cdot \rfloor denoting ceiling and floor operations. With an l l l-level pyramid, we implement l l l such layers. The next fully-connected layer (fc 6 ) will concatenate the l l l outputs.

Multi-size training

we train each full epoch on one network, and then switch to the other one (keeping all weights) for the next full epoch.

The main purpose of our multi-size training is to simulate the varying input sizes while still leveraging the existing well-optimized fixed-size implementations.

Note that the above single/multi-size solutions are for training only. At the testing stage, it is straightforward to apply SPP-net on images of any sizes.

SPP-NET FOR IMAGE CLASSIFICATION

Experiments on ImageNet 2012 Classification

Baseline Network Architectures

  • ZF-5
  • Convnet*-5
  • Overfeat-5/7

Multi-level Pooling Improves Accuracy

It is worth noticing that the gain of multi-level pooling is not simply due to more parameters; rather, it is because the multi-level pooling is robust to the variance in object deformations and spatial layout

Multi-size Training Improves Accuracy

Full-image Representations Improve Accuracy

First, we empirically find that even for the combination of dozens of views, the additional two full-image views (with flipping) can still boost the accuracy by about 0.2%.

Second, the full-image view is methodologically consistent with the traditional methods , where the encoded SIFT vectors of the entire image are pooled together.

Third, in other applications such as image retrieval, an image representation, rather than a classification score, is required for similarity ranking. A full-image representation can be preferred.

Multi-view Testing on Feature Maps

Summary and Results for ILSVRC 2014

Experiments on VOC 2007 Classification

Experiments on Caltech101

SPP-NET FOR OBJECT DETECTION

We extract the feature maps from the entire image only once (possibly at multiple scales). Then we apply the spatial pyramid pooling on each candidate window of the feature maps to pool a fixed-length representation of this window. Because the time-consuming convolutions are only applied once, our method can run orders of magnitude faster.

Detection Algorithm

Detection Results

Complexity and Running Time

Model Combination for Detection

We further find that the complementarity is mainly because of the convolutional layers.

ILSVRC 2014 Detection

CONCLUSION

APPENDIX A

Mean Subtraction.

Implementation of Pooling Bins.

For a pyramid level with n × n n \times n n×n bins, the ( i , j ) (i, j) (i,j)-th bin is in the range of [ ⌊ i − 1 n w ⌋ , ⌈ i n w ⌉ ] × [ ⌊ j − 1 n h ⌋ , ⌈ j n h ⌉ ] [\lfloor \frac{i-1}{n} w \rfloor, \lceil \frac {i}{n} w \rceil] \times [\lfloor \frac{j-1}{n} h \rfloor, \lceil \frac{j}{n} h\rceil ] [ni1w,niw]×[nj1h,njh]

Mapping a Window to Feature Maps.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值